Method specification

1 Meta

  • Support Area: Editing and Imputation

  • Method Theme: Editing and Imputation

  • Last reviewed: 2023-08-03

2 Terminology

  • Auxiliary variable: One of a set of variables that will be used to determine the similarity of potential donor records to records requiring imputation

  • Distance: A numerical measure of how similar two records are to each other

  • Distance function: One of six pre-defined functions used to calculate the distance between two records

  • Donor pool: The set of values from which records requiring imputation may draw. The donor pool is the result of processing carried out on the potential donor pool.

  • IGroup: ‘Imputation Group’, records requiring imputation whose auxiliary variables have the same values.

  • Imputation variable: The variable for which values are being imputed.

  • Impute: To suggest likely values for missing data based on other existing data.

  • Potential donor pool: The set of potential donor records that will be processed into the donor pool for an IGroup.

  • Potential donor record: A record not requiring imputation that may, therefore, contribute the value of its imputation variable.

3 Summary

RBEIS is a method originally developed for imputing categorical data in relatively small social surveys with the intention of minimising conditional imputation variance. It is derived from CANCEIS, which is better suited to large datasets such as the Census. RBEIS differs from CANCEIS by constructing donor pools for each IGroup, rather than for each record. These donor pools are then further processed to be the same size as the IGroup. This implementation of RBEIS expects a Pandas DataFrame as input.

4 Assumptions

  • Distance functions expect arguments to be in the order (record, IGroup)

  • Distance functions assume that categorical data is ordered, i.e. category 1 is somehow ‘less’ than category 2

5 Method input and output

5.1 Input records

impute is the function that runs RBEIS on a Pandas DataFrame. It takes four required positional arguments, and has several optional keyword arguments. An example call to impute follows:

from rbeis.pandas import impute, RBEISDistanceFunction
impute(pd.read_csv("my_data.csv"),
       "genre",
       ["jungle", "acid house", "UK garage"],
       {"artist": RBEISDistanceFunction(1, weight = 5),
        "bpm": RBEISDistanceFunction(5, custom_map = {(170, 180): 0, (140, 160): 100}, threshold = 5),
        "length": RBEISDistanceFunction(2, threshold = 1.25)},
       ratio = 1.5,
       in_place = False,
       keep_intermediates = True)

The first positional argument is a Pandas DataFrame, the dataset upon which to perform imputation.

The second is a string corresponding to the variable to be imputed. This should match the name of one of the columns in the DataFrame.

The third is a list of possible values that the imputation variable can take. In the above example, the imputation variable "genre" can take possible values from the list ["jungle", "acid house", "UK garage"].

The fourth positional argument represents the auxiliar variables that will be used in the imputation. This is a dictionary whose keys are strings corresponding to the columns in the DataFrame to be used as auxiliary variables. Its values are RBEISDistanceFunction objects representing the functions to be used to compare values within each auxiliary variable. The RBEISDistanceFunction object is explained in greater depth below.

The three optional keyword arguments available to impute are ratio, in_place and keep_intermediates.

ratio is a numeric argument that allows RBEIS to accept donors with distances greater than the minimum. If set, it allows RBEIS to accept values less than or equal to the value of ratio multiplied by the minimum distance. It should be set to a number greater than 1, as values less than 1 would result in RBEIS looking for distances less than the minimum. If a value less than 1 is given, an RBEISInputException will be thrown. ratio has no default value, reverting to the original RBEIS behaviour of choosing only the minimum distance if it is not specified.

in_place is a Boolean argument (default True) which instructs impute to overwrite the original DataFrame with the results of the imputation. If in_place is False, impute returns a new DataFrame with the imputed data included.

keep_intermediates is a Boolean argument (default False) which instructs impute not to delete the temporary columns it creates in the process of imputation. Its primary use is for debugging.

Distance functions

Distance functions are encapsulated in RBEISDistanceFunction objects, which specify which of the six RBEIS distance functions to use and contain any required threshold values, custom mappings and weights. They can be constructed in several ways:

  1. Standard distance function: RBEISDistanceFunction(1)

    • This creates a distance function exactly as in the RBEIS specification. In this case, this is the RBEIS distance function 1.

  2. Weighted distance function: RBEISDistanceFunction(1, weight = 2)

    • This creates a distance function with its output scaled by a given factor. In this case, this is the RBEIS distance function 1 with its output scaled by a factor of 2.

  3. Distance function with threshold: RBEISDistanceFunction(3, threshold = 5)

    • This creates one of the RBEIS distance functions (2, 3, 5 or 6) that requires a threshold value. In this case, we create RBEIS distance function 3 with a threshold value of 5.

  4. Distance function with overrides: RBEISDistanceFunction(4, custom_map = {(1, 2): 10, (2, 1): 200})

    • This creates one of the RBEIS distance functions (4, 5, or 6) that overrides certain pairs of values. These overrides are represented as a dictionary whose keys are the pair of values to be overridden and whose values are the distances to return when those pairs of values are passed to the RBEISDistanceFunction. Note that argument order matters here; in the example above the function as applied to (1, 2) returns a distance of 10, whereas the function applied to the (2, 1) returns a distance of 200.

The keyword arguments weight, threshold and custom_map can be used in combination with each other when necessary. In the following example, RBEIS distance function 6 is constructed with its requisite threshold and overrides. It is also given a weight of 4.

RBEISDistanceFunction(6, custom_map = {(1, 2): 10, (5, 3): 400}, threshold = 2.5, weight = 4)

It is possible to provide redundant arguments, for example providing a threshold value to distance function 1. These will not be used, and a warning will be displayed upon creation.

RBEISDistanceFunctions are callable, meaning that they can be used as any other function:

myDF = RBEISDistanceFunction(6, custom_map = {(1, 2): 10, (5, 3): 400}, threshold = 2.5, weight = 4)
myDF(2,3)

5.2 Output records

impute modifies its input DataFrame by adding a new column containing the imputed values for a given variable, named <variable>_imputed. If in_place is set to False, a new DataFrame containing this column is returned.

5.3 Error handling

RBEIS performs extensive checks on its inputs. Where checks find incompatible or nonsensical inputs, an exception is raised and impute is terminated. TypeErrors are raised when inputs of inappropriate type are supplied. RBEISInputExceptions are raised where inputs do not make sense for RBEIS (e.g. choosing distance function 7 when RBEIS only provides six) or where required data is missing. Where checks find unnecessary inputs (for example redundant optional arguments), a warning of class RBEISInputWarning is raised.

5.4 Metadata

Currently, RBEIS collects no metadata by default. If keep_intermediates is set to True, the temporary values calculated by the impute function are able to be inspected post-imputation. These values are IGroup assignments, calculated distances and records to which each record may be a donor.

6 Method

  1. Assign each record requiring imputation to an imputation group (“IGroup”) based on the values of its auxiliary variables.

  2. For each potential donor record (those not requiring imputation), calculate the distance between the values of its auxiliary variables and those of each IGroup. Each auxiliary variable should have a distance function assigned to it, and may also be assigned a weight. The total distance is calculated thus: \(\sum_{v \in A} w_{v}{f_{v}(v_{record},v_{IGroup})}\) where \(A\) is the set of auxiliary variables, \(w_v\) is the weight assigned to the auxiliary variable \(v\), \(f_v\) is the distance function assigned to the auxiliary variable \(v\), \(v_{record}\) is the value of the auxiliary variable for the current record, and \(v_{IGroup}\) is the value of the auxiliary variable for the current IGroup.

  3. Calculate which IGroup or IGroups for which each potential donor record could be used for imputation. This is done by finding the minimum distance between any potential donor and each IGroup, and assigning those potential donors to that IGroup. An optional behaviour is to specify a factor to multiply the minimum distance by, below which to accept donors (as opposed to solely accepting donors with the minimum distance). These assigned records constitute the ‘potential donor pool’ for each IGroup.

  4. Calculate the frequency distribution of the values of the imputation variable occurring within each IGroup’s potential donor pool. Convert these frequencies to the expected numbers of occurrences for a group with the same frequency distribution, but with a number of members equal to the size of the IGroup. This is calculated thus: \(\frac{f_v}{n_{pool}}n_{IGroup}\) where \(f_v\) is the frequency with which value \(v\) occurs, \(n_{pool}\) is the number of records in the potential donor pool, and \(n_{IGroup}\) is the number of records in the IGroup.

  5. For each IGroup, insert into a final donor pool each potential value of the imputation variable a number of times equal to the integer part of its expected number of occurrences.

  6. For each IGroup, using the fractional parts of each expected value as probabilities, draw a value at random and insert it into the donor pool. Remove that value, recalculate the relative probabilities of the remaining possible values, and repeat the step until the donor pool is the same size as the IGroup.

  7. For each IGroup, assign the values in the donor pool randomly to members of the IGroup.