Chapter 7 Probability Proportional to Size Sampling

We can use auxiliary information in a variety of ways to improve estimates over SRS. In particular we note that since the most influential units in an estimate are often the largest, it is often in our interest to make sure we select the largest units with higher probability than the smaller units. This idea leads naturally to probability proportional to size sampling, where each unit has a distinct probability of selection $\psi_i$ on any given draw, which is related to its size.

7.1 Cumulative selection method

Let’s say we want to estimate the number of calves born in spring on farms supplying milk for Town Milk Supply. We have information on the number of milking cows on each such farm and indeed each cow is registered on a computer database. Farms vary in size, and we want to make sure we give larger farms (those with more cows) a greater chance of selection, since they will contribute more variability to our estimates. Assume that there are $N$ farms, labelled $i=1,\ldots,N$ , and that there are $X_i$ cows on farm $i$ .

We give farm $i$ a probability of selection of $\psi_i=X_i/X$ on each draw from the population. Here $X=\sum_i X_i$ is the total number of cows on all the farms. We want to have a sample of $n$ farms. To achieve this we put the $N$ farms into a list, and then number all of the cows on all of the farms as $1,\ldots,X$ .

There are $X_1$ cows on the first farm, so they are numbered $1,\ldots,X_1$ . On the second farm where there are $X_2$ cows we number them $X_1+1,\ldots,X_1+X_2$ . The cows on the third farm are numbered $X_1+X_2+1,\ldots,X_1+X_2+X_3$ and so on up to the $N^$ farm and the $X^$ cow.

To choose our random sample we:

Choose a random number $r\in\<1,\ldots,X\>$
If cow number $r$ is in farm $i$ then select that farm into the sample
Repeat until we have $n$ farms in the sample

Note that this sampling is probability proportional to size with replacement (PPSWR). We don’t remove a farm (or a cow) from the list once it has been selected. This is necessary since there is no (known? possible?) way to select a probability proportional to size (PPS) sample without replacement. Inevitably, therefore, we will end up with some farms being selected more than once. What do we do with them? We do not drop the repeated observations – we must use the observations for those farms as many times as they appear.

We only collect the data once of course, it’s just that if a unit appears more than once in the sample, we must use its data repeatedly in our estimator.

7.2 Rejection Method

An alternative way of drawing a PPSWR which is simpler to implement than the cumulative method given above is the Rejection Method. Here again we have $N$ units in the population, and a size measure $X_i$ for each one. The largest unit is of size $M_X=(X_i)$

Draw a random number $i\in\<1,\ldots,N\>$ , which means we are considering whether or not to select unit $i$ .
Draw another random number $r\in\<1,\ldots,M_X\>$ .
If $r\leq X_i$ then select unit $i$ in the sample, otherwise go back to Step 1.
Repeat until $n$ units have been selected.

This is called a rejection method, since sometimes we consider a unit for inclusion in the sample, but then reject it. In general it takes more than $n$ passes through the sequence of steps to select $n$ sample members.

Example

Consider the following population of $N=6$ university classes:

Table 7.1: University Classes

Unit, $i$	Size, $X_i$	Cumulative size, $_i=\sum_^iX_j$
1	10	10
2	15	25
3	40	65
4	40	105
5	80	185
6	100	285

We have a size measure for each unit $X_i$ =the number of student in class $i$ . We have calculated the cumulative running total $_i$ of $X_i$ for all units up to an including unit $i$ . The total of the sizes is $X=_N=285$ . Now let’s select a sample of size using the cumulative method. We generate four random numbers in the range $1,\ldots,285$ : say these numbers are $\$ . The $165^$ student is in class 5, so we select that class into the sample. The $205^$ student is in class 6, and so is the $197^$ , so we select class 6 twice into the sample. Finally the $12^$ student is in class 2. Thus our final sample is $\$ .

How does the rejection method work in this situation? We first note that the largest class has size $M_X=100$ . We select a random number $i$ in the range $1,\ldots,6$ , say $i=4$ : thus we are considering class 4 for selection, which has size 40. We then draw another random number $r$ in the range $1,\ldots,100$ : say $r=2$ . We see that $r\leq X_4$ (i.e. $2\leq 40$ ), so we select class 4 into the sample as the first unit.

Next we select another $i$ value – say we get $i=4$ again. We draw $r$ again, but this time get $r=99$ , which is greater than $X_4=40$ . So we reject class 4 as the second unit.

We continue in this way, drawing pairs of random numbers $(i,r)$ and testing to see if $r\leq X_i$ .

Table 7.2: Rejection method for selecting University classes by PPSWR

$i$	$r$	$X_i$	$r\leq X_i$ ?	Action
4	2	40	Yes	Acecpt class 4
4	99	40	No	Reject and continue
6	57	100	Yes	Accept class 6
5	63	80	Yes	Accept class 5
2	86	15	No	Reject and continue
2	57	15	No	Reject and continue
6	33	100	Yes	Accept class 6

So we end up with the sample $\$ .

7.3 Inclusion probabilities

We have seen that the probability that a unit is selected on any given PPSWR draw is $\psi_i=X_i/X$ . What is the probability $\pi_i$ that a unit is selected into sample of size $n$ ?

The probability that a unit is selected is one minus the probability that on every one of the $n$ draws it is not selected: \[\begin \pi_i = 1 - (1-\psi_i)^n = 1-\left(1-\frac\right)^n \tag \end\] By noting that \[\begin p(\text) &=& p(i) + p(j) - p(\text)\\ &=& p(i) + p(j) - [1-p(\text)] \end\] it can easily be shown that the 2nd order inclusion probabilities are: \[\begin \pi_ = 1-(1-\psi_i)^n-(1-\psi_j)^n+(1-\psi_i-\psi_j)^n \end\]

We have already noted that in with replacement sampling, there is a chance that a unit may be selected multiple times. In general unit $i$ is selected $Q_i$ times, where $Q_i$ is a Binomial random variable: \[\begin Q_i \sim (n,\psi_i) \end\] and the full set of $N$ $Q_i$ values is distributed according to a multinomial distribution \[\begin \sim (n,\psi) \end\]

7.4 Estimators in PPSWR

7.4.1 Total

The Horwitz-Thompson estimator of the total is, as always \[\begin \widehat = \sum_ \frac <\pi_k>\end\] where the probabilities of selection $\pi_i$ are given by Equation (7.1). However we can use the following approximation which holds for small $\psi_i$ : \[\begin \pi_i = 1-\left(1-\frac>\right)^ \approx n\frac> = n\psi_i \end\] This suggests another estimator for the total called the Hansen-Hurwitz estimator, where we replace the $\pi_i$ values with $n\psi_i$ : \[\begin \widehat_ = \frac \sum_ \frac <\psi_k>\end\] This estimator is unbiased, and has variance \[\begin \bfa<\widehat_> = \frac\sum_^\left(\frac><\psi_i>-Y\right)^\psi_ \end\] An unbiased estimator of this variance from the sample is \[\begin \bfa<\widehat><\widehat_> = \frac \sum_\left(\frac<\psi_k>-\widehat_\right)^ \end\] It is easily seen that the variance of the HH estimator is zero if all the values of the variable of interest are proportional to their size measure. Indeed, this is the motivation for such designs. However, in practice, this situation never holds.

The question as to whether the HH estimator or the HT estimator has the smaller variance is not straightforward and in fact depends on the particular configuration of the population values: how correlated are the variable of interest and the size measure, what are the coefficients of variation of the variable of interest and the size measure, etc.

7.4.2 Mean

The Hansen-Hurwitz estimator of the mean follows straightforwardly from that for the total: \[\begin \widehat>_ = \frac_> = \frac \sum_ \frac <\psi_k>\end\] This estimator is unbiased and has variance \[\begin \bfa<\widehat>_> = \frac\sum_^\left(\frac><\psi_i>-Y\right)^\psi_ \end\] which can be estimated from the sample by \[\begin \bfa<\widehat>_> = \frac \sum_\left(\frac<\psi_k>-\widehat_\right)^ \end\]

7.5 Example

We have a dataset of the 2020 population, GDP and military expenditure of $N=150$ countries.

## [1] 150

2020 Military Expenditure

Figure 7.1: 2020 Military Expenditure

The distribution is very strongly right skewed - with the USA dominating expenditure. The total value of all military expenditure in 2020 was Bn$2000.07

If we take a SRSWOR of $n=30$ countries we can create an estimate of the total world military expenditure.

The data in the sample are listed in Table 7.3 and displayed in Figure 7.2.

Table 7.3: SRS of 30 countries

Country	Population	GDP (Bn$)	Military Expenditure (Bn$)
Angola	32866268	104.128681	0.9935944
Belarus	9379952	58.482353	0.8445129
Brazil	212559409	1749.104722	19.7363478
Canada	38005238	1600.331195	22.7548471
Central African Republic	4829764	2.001438	0.0413036
Congo, Dem. Rep.	89561404	45.259707	0.3620916
Gambia, The	2416664	1.672833	0.0148050
Ghana	31072945	62.724595	0.2398872
Guinea-Bissau	1967998	1.218760	0.0233067
Haiti	11402533	14.956795	0.0002638
Iraq	40222503	170.857728	7.0155588
Italy	59554023	1744.731952	28.9213428
Jamaica	2961161	13.440715	0.2444328
Korea, Rep.	51780579	1623.895081	45.7353926
Kosovo	1775378	7.144368	0.0789650
Liberia	5057677	3.115543	0.0169370
Malta	525285	12.886270	0.0806132
Moldova	2620495	8.517410	0.0445338
Montenegro	621306	4.046328	0.1020905
Morocco	36910558	105.726172	4.8309564
Mozambique	31255435	17.959222	0.1537419
Norway	5379475	403.779725	7.1125385
Paraguay	7132530	40.446809	0.3643422
Peru	32971846	190.979129	2.6331234
Romania	19286123	208.838847	5.7268442
Slovenia	2100126	48.124693	0.5748319
Somalia	15893219	5.151914	0.0983850
Tunisia	11818618	44.681504	1.1573724
Turkey	84339067	1015.326663	17.7246321
Zambia	18383956	23.418946	0.2121424

Military Expenditure of a SRSWOR of 30 countries

Figure 7.2: Military Expenditure of a SRSWOR of 30 countries

The mean expenditure in the sample is Bn$5.59 with standard deviation Bn$10.79. This leads to an estimate of \[ \widehat_ = N\bar = (158)(5.59) = 839.2 \] with standard error \[\begin \bfa<\widehat_> &=& \sqrt \\ &=& \sqrt<(150)^2\left(1-\frac\right)\frac> \\ &=& 264.34 \end\] and RSE \[ \bfa<\widehat_> = \frac<\bfa<\widehat_> ><\widehat_> = \frac = 0.315 \] This is clearly a worthless estimate. The extreme skewness of the data has led to an RSE which is completely unacceptable.

In a situation where we have a variable on the frame that is likely to correlate with the size of military expenditure, we can use PPSWR to do a lot better. Gross Domestic Product (GDP) does correlate strongly with military expenditure - as can be seen in the diagram below.

GDP and Military Expenditure

Figure 7.3: GDP and Military Expenditure

Note that we could use population size, but GDP is a better indicator of the size of the economy, which is what we would expect to correlate best with military expenditure).

If we select a sample of $n=30$ countries by ppswr we get, for example, the units

8, 8, 11, 13, 31, 31, 31, 31, 31, 50, 50, 54, 65, 65, 73, 77, 98, 103, 133, 134, 147, 148, 148, 148, 148, 148, 148, 148, 148, 150

There are a few units which have been selected multiple times – most notably Unit 31 (China) and Unit 148 (USA), which respectively occur 5 and 8 times out of 30.

An estimate of the population total from this sample is \[\begin \widehat_ &=& \frac \sum_ \frac <\psi_k>\\ &=& 2032.41 \end\] This is a very good estimate (recall that the true value is 2000.07).

7.6 Simple Random Sampling with Replacement

If we are carrying out a simple random sample with replacement (SRSWR), the probability a unit is selected on each draw is the same for every unit: \[ \psi_i= \frac \] So the HH estimator of the population total from a SRSWR of size $n$ is \[\begin \widehat_ &=& \frac\sum_^n \frac<\psi_k>\\ &=& \frac\sum_^n y_k\\ &=& N\bar \end\] which is exactly the same as the HT estimator from SRSWOR. The weights are just the same: $w_k=N/n$ .

However the variance estimator is different: \[\begin \bfa>_> &=& \frac \sum_^n \left(\frac <\psi_k>- \widehat_\right)^2\\ &=& \frac \sum_^n \left(Ny_k - N\bar\right)^2\\ &=& \frac \sum_^n \left(y_k - \bar\right)^2\\ &=& N^2\frac \end\] which is almost the same as the HT estimator: except that it lacks the finite population correction: \[\begin \bfa>_> &=& N^2\left(1-\frac\right)\frac \end\] This means that even as $n\rightarrow N$ the variance of the estimator does not decrease to zero. That’s because is a with-replacement design, a sample of size $n=N$ is not guaranteed to be a census.

That makes the variance of the SRSWR estimator a bit larger than that of the SRSWOR design – and the estimator is less efficient. i.e. for the same sample size $n$ we get a larger variance.

7.7 Remarks

The lack of efficiency due to the lack of the fpc is typical of with-replacement designs. However if $n$ is much smaller than $N$ the loss of efficiency isn’t important.

In PPSWR designs where the size measure $X_i$ is strongly correlated with the characteristic of interest $Y_i$ then (even without the fpc) PPSWR can have dramatic gains over SRSWOR.

Also note that if the fpc is ignored in any survey sampling analysis (and it often is) – then the analysis effectively treats the sampling as if it were with replacement. Moreover, if we are given data from a complex design – but are only given the survey weights (without enough information to calculate the joint selection probabilities) then we are forced into treating the sample as if it were drawn with replacement.