DEV Community

Jérôme Parent-Lévesque for Potloc

Posted on • Edited on

Optimally Taking Out Extra Survey Respondents

Sometimes when analysing the results of a survey, one needs to remove some respondents from their sample. This is something we do fairly commonly at Potloc in order to obtain a more representative sample of the target population in our surveys. In other words, we use this as a way of performing stratified sampling.

We use a system of quotas to keep track of every agreement on the respondent sample we make with our clients. We have three types of quotas, each corresponding to a different way of assessing whether their target is met or not.

To match targets exactly (like we want to achieve for stratified sampling) we use one of these quota types - the strict quota type. For example, our clients might want exactly 50 respondents who work as electricians. No matter whether we have one respondent missing or one more than 50 in this category, our quota is not achieved.

The second type of quota we use is the minimum type. This type, as the name suggests, simply indicates that we must have at least as many respondents of a specific category as the target number.

The third and final type of quota is the weighted type. As we often use a weighting process to obtain a more representative sample of our population, we make sure to communicate with our clients where survey responses may be weighted. This communication in turn gets converted into quotas of type weighted which behave similarly to minimum quotas, but with more flexibility. The targets don't need to be matched exactly and will instead be achieved through an independent weighting process (don't worry, this will be explained in more details in Step 2 below). The "minimum" for this type of quota is (arbitrarily) set to 50% of the target as a way to limit the scale of the weights (this way weights should rarely be more than 2).

quotas

The image above shows an example of a combination of quotas we could have. Here, we want to end up with a minimum of one respondent who has a cat, exactly one respondent who is a doctor, and we want after weighting to have one effective respondent whose name contain the letter 'a' and two effective respondents whose name is shorter than 7 letters.

Imagine now that we have received the following responses to our survey:

respondents

In this initial state, we have:

  • 1 respondent who has a cat (Alice)
  • 2 doctors (Bob and Catherine)
  • 2 respondents whose name contains the letter 'a' (Alice and Catherine)
  • 2 respondents whose name is shorter than 7 letters (Alice and Bob)

The minimum quota is therefore satisfied (but Alice cannot be removed without breaking it) and there is one too many doctor. For weighted quotas, we always have at least 50% of the target number of respondents. As we will see later, the weighted quotas will be useful in determining the optimal respondents to take out.

We will keep referring to these quotas and respondents throughout this article to provide a practical example of how we select respondents to take out.

We set out to find the optimal selection of respondents to take out given a set of quotas such as this one. Below is the full step-by-step explanation of the algorithm we use to perform this and an example of how it is applied to this fictional set of quotas and respondents.

Step 1

We first identified that by determining which respondents belonged to which quotas, we could split the respondents into 3 different categories:

Respondents that cannot be taken out are respondents that belong to quotas for which the target is not exceeded. For example, our minimum quota "has a cat" has a target of 1 and only Alice fits into this category. Therefore, Alice cannot be taken out as otherwise the "has a cat" quota would be broken. The same goes for quotas with fewer respondents than the target, for example if the target was to have 2 respondents who own a cat.

Respondents that should be taken out as a priority are respondents that belong specifically to a strict quota for which the target is exceeded. Since for this type of quota we want to end up with exactly the target number of respondents, we have to take out respondents belonging to this quota until that target is matched. This group takes priority over the Respondents that cannot be taken out as we prioritise taking out respondents in exceeded strict quotas until those quotas are satisfied.

Respondents that may be taken out are all remaining respondents. These respondents may or may not belong to any quota. If they do, then that quota's target has to be exceeded — otherwise they would be in the Respondents that cannot be taken out category. Note that these respondents logically cannot belong to any strict quota since those belonging to this type of quota must fit in one of the first 2 categories.

Going back to our example, our three respondents would belong into the following groups:

buckets

The respondents that should be taken out as a priority group includes both doctors (Bob and Catherine) as there is one too many to satisfy the strict quota. Alice cannot be taken out because she is the only respondent who has a cat. The last group is empty as all respondents already belong to other groups.

Step 2

Now that we have a categorisation of each respondent, we are almost ready to start taking out respondents. However, since our objective is to optimally take out respondents, we need to compute one more piece of data related to weighted-type quotas.

First, we need to define a bit better what we mean by optimally here.
Our survey results are usually calculated on weighted data in order to better match the target population demographics.

In other words, as part of our survey workflow, we compute a weight for each respondent and use it as a multiplicative factor to scale the "importance" of each survey response. This is a process called weighting.

The weights can be interpreted as a measure of the quality of our respondents sample by looking at their distribution. The further away from the value of 1 the weights are, the worse the quality. Indeed, a small weight indicates that we have too many similar respondents and a large weight indicates that we are missing respondents with similar characteristics.
For more details on the weighting process I invite you to read my previous blog post on Generalized Weighting.

Thus, when taking out extra respondents, we would like to ensure that our weighting quality will be unaffected. This is the key to our notion of optimality — we want not only to satisfy all quotas but also to obtain the highest possible weighting quality as a result.

To achieve this, we compute weights for each respondent based on the targets of the weighted quotas. Using a raked weighting algorithm, we use all weighted quota numbers as "targets" to obtain respondent weights.

In our example, we obtain the following weights by using the targets from the two weighted quotas:

weights

Notice that by multiplying the respondent's (numerical) answer the weighted quota targets are matched perfectly! The count of respondents whose name contains the letter 'a' becomes 1 (from 2) as it is now the sum of 0.59 and 0.41. Meanwhile, the count of respondents whose name is shorter than 7 letters stays 2, although the weight of each respondent differs.

In the next step, we will be removing respondents with the smallest weights first whenever we cannot decide who to take out!

Step 3

Our respondents now belong to one of the 3 categories presented in Step 1 and each have a weight resulting from the raked weighting computation from Step 2. It is now time to start taking out respondents.

The core of the strategy here is to take out respondents one-by-one. After each respondent that is taken out, our quotas and weighting need to be updated, meaning that steps 1 and 2 need to be performed again! We therefore perform this step in a loop where in each iteration we recompute the first 2 steps before choosing and taking out 1 respondent.

This respondent is chosen according to the given priority list:

  • Select a pool of respondents to pick from:
    • If there are any respondents in the Respondents that should be taken out as a priority category, then limit our selection to this group only
    • Otherwise, if there are any Respondents that may be taken out, select this group
  • From this pool, select the optimal respondent to be taken out:
    • The optimal respondent corresponds to the respondent with the smallest weight, since a small weight indicates that we have many similar respondents
    • In the case of a tie, or if there are no weighted quotas, we remove the last respondent to have answered the survey. (Note: the statistically correct thing to do here would be to remove a random respondent from the pool, but we choose this approach as it is idempotent — we can re-run the algorithm and the selected respondents will be the same. Additionally, this replicates the behaviour of traditional sampling tools that have quota-stops)
  • Take out the selected respondent and repeat from Step 1 until all strict quotas are met!

Here's how this would play out in our fictional example:

  • We have 2 respondents in the Respondents that should be taken out as a priority category (Bob and Catherine) and thus only these respondents are taken into consideration
  • To determine who to remove from these two respondents, we take a look at their data:

decision-respondents

  • Since Catherine has the smallest weight (0.41 vs. 1.41), she is taken out. Intuitively, this makes sense as we had one too many respondent whose name contained the letter 'a' to satisfy the weighted quota without even applying a weighting.

We are now left with one respondent whose name contains the letter 'a' and two respondents whose name is shorter than 7 letters, meaning that our final weights will be exactly 1 — the optimal value for weights!

Additionally, now that respondent "Catherine" has been taken out, all of our quotas are satisfied and we can stop the algorithm here.

Conclusion

Using this process, we are able to remove respondents so as to match our quotas as best as we can, while also leading to a better survey weighting. Indeed, since we always remove respondents with the smallest weights, our weighting gets progressively better as the minimum weight gets closer and closer to 1 (the optimal value). This means that the final data presented for this survey — after the weighting step — will be more representative of the target population, a win for both Potloc and our clients!

Interested in what we do at Potloc? Come join us! We are hiring 🚀

Appendix - The case of multiple overlapping strict quotas

We might sometimes have respondents that correspond to multiple different strict quotas. In this scenario, it is more complicated to select the optimal respondents to take out as it is not always obvious what is the smallest possible set of respondents that need to be removed in order to satisfy all such quotas. It is, for example, possible to have a respondent (let's call them respondent A) that we take out since it belongs to a strict quota which have both exceeded their target. However, it is possible to then take out other respondents which also correspond to this quota because they also belong to another strict quota which was exceeded. This could now break the first quota if it had met its target exactly. In this scenario, we end up with a respondent (respondent A) which can now be reinstated as the strict quota it belonged to is now under its target.

To alleviate this problem while avoiding a complicated and expensive decision process solutions from the field of operational research, we employ two mechanisms.

First, we try to more optimally pick which strict quota respondents to take out first. To do this, we also consider the following factors:

  • Whether the respondent can be disqualified or not (if all of its quotas are exceeded)
  • The number of exceeded strict quotas a respondent is a part of (more = higher priority)
  • The total number of strict quotas a respondent is a part of (fewer = higher priority)
    • This is used to minimise the impact of taking out respondents on other quotas
  • The minimum difference between the current count and the target count of strict quotas that are exceeding their target (bigger = higher priority, as there is more room to remove respondents)

Second, we add a final step at the end of the process in which we restore respondents that can be without breaking any quota. This solves the issue highlighted in the example above.

Using these two approximations we are able to get a result that is close to optimal, for a minimal cost.

Top comments (0)