Method specification¶
1 Meta¶
Support Area: Editing and Imputation
Method Theme: Editing and Imputation
Last reviewed: 2023-08-03
2 Terminology¶
Auxiliary variable: One of a set of variables that will be used to determine the similarity of potential donor records to records requiring imputation
Distance: A numerical measure of how similar two records are to each other
Distance function: One of six pre-defined functions used to calculate the distance between two records
Donor pool: The set of values from which records requiring imputation may draw. The donor pool is the result of processing carried out on the potential donor pool.
IGroup: ‘Imputation Group’, records requiring imputation whose auxiliary variables have the same values.
Imputation variable: The variable for which values are being imputed.
Impute: To suggest likely values for missing data based on other existing data.
Potential donor pool: The set of potential donor records that will be processed into the donor pool for an IGroup.
Potential donor record: A record not requiring imputation that may, therefore, contribute the value of its imputation variable.
3 Summary¶
RBEIS is a method originally developed for imputing categorical data in relatively small social surveys with the intention of minimising conditional imputation variance. It is derived from CANCEIS, which is better suited to large datasets such as the Census. RBEIS differs from CANCEIS by constructing donor pools for each IGroup, rather than for each record. These donor pools are then further processed to be the same size as the IGroup. This implementation of RBEIS expects a Pandas DataFrame as input.
4 Assumptions¶
Distance functions expect arguments to be in the order (record, IGroup)
Distance functions assume that categorical data is ordered, i.e. category 1 is somehow ‘less’ than category 2
5 Method input and output¶
5.1 Input records¶
impute
is the function that runs RBEIS on a Pandas DataFrame. It takes four required positional arguments, and has several optional keyword arguments. An example call to impute
follows:
from rbeis.pandas import impute, RBEISDistanceFunction
impute(pd.read_csv("my_data.csv"),
"genre",
["jungle", "acid house", "UK garage"],
{"artist": RBEISDistanceFunction(1, weight = 5),
"bpm": RBEISDistanceFunction(5, custom_map = {(170, 180): 0, (140, 160): 100}, threshold = 5),
"length": RBEISDistanceFunction(2, threshold = 1.25)},
ratio = 1.5,
in_place = False,
keep_intermediates = True)
The first positional argument is a Pandas DataFrame, the dataset upon which to perform imputation.
The second is a string corresponding to the variable to be imputed. This should match the name of one of the columns in the DataFrame.
The third is a list of possible values that the imputation variable can take. In the above example, the imputation variable "genre"
can take possible values from the list ["jungle", "acid house", "UK garage"]
.
The fourth positional argument represents the auxiliar variables that will be used in the imputation. This is a dictionary whose keys are strings corresponding to the columns in the DataFrame to be used as auxiliary variables. Its values are RBEISDistanceFunction
objects representing the functions to be used to compare values within each auxiliary variable. The RBEISDistanceFunction
object is explained in greater depth below.
The three optional keyword arguments available to impute
are ratio
, in_place
and keep_intermediates
.
ratio
is a numeric argument that allows RBEIS to accept donors with distances greater than the minimum. If set, it allows RBEIS to accept values less than or equal to the value of ratio
multiplied by the minimum distance. It should be set to a number greater than 1, as values less than 1 would result in RBEIS looking for distances less than the minimum. If a value less than 1 is given, an RBEISInputException
will be thrown. ratio
has no default value, reverting to the original RBEIS behaviour of choosing only the minimum distance if it is not specified.
in_place
is a Boolean argument (default True
) which instructs impute
to overwrite the original DataFrame with the results of the imputation. If in_place
is False
, impute
returns a new DataFrame with the imputed data included.
keep_intermediates
is a Boolean argument (default False
) which instructs impute
not to delete the temporary columns it creates in the process of imputation. Its primary use is for debugging.
Distance functions¶
Distance functions are encapsulated in RBEISDistanceFunction
objects, which specify which of the six RBEIS distance functions to use and contain any required threshold values, custom mappings and weights. They can be constructed in several ways:
Standard distance function:
RBEISDistanceFunction(1)
This creates a distance function exactly as in the RBEIS specification. In this case, this is the RBEIS distance function 1.
Weighted distance function:
RBEISDistanceFunction(1, weight = 2)
This creates a distance function with its output scaled by a given factor. In this case, this is the RBEIS distance function 1 with its output scaled by a factor of 2.
Distance function with threshold:
RBEISDistanceFunction(3, threshold = 5)
This creates one of the RBEIS distance functions (2, 3, 5 or 6) that requires a threshold value. In this case, we create RBEIS distance function 3 with a threshold value of 5.
Distance function with overrides:
RBEISDistanceFunction(4, custom_map = {(1, 2): 10, (2, 1): 200})
This creates one of the RBEIS distance functions (4, 5, or 6) that overrides certain pairs of values. These overrides are represented as a dictionary whose keys are the pair of values to be overridden and whose values are the distances to return when those pairs of values are passed to the
RBEISDistanceFunction
. Note that argument order matters here; in the example above the function as applied to(1, 2)
returns a distance of 10, whereas the function applied to the(2, 1)
returns a distance of 200.
The keyword arguments weight
, threshold
and custom_map
can be used in combination with each other when necessary. In the following example, RBEIS distance function 6 is constructed with its requisite threshold and overrides. It is also given a weight
of 4.
RBEISDistanceFunction(6, custom_map = {(1, 2): 10, (5, 3): 400}, threshold = 2.5, weight = 4)
It is possible to provide redundant arguments, for example providing a threshold value to distance function 1. These will not be used, and a warning will be displayed upon creation.
RBEISDistanceFunction
s are callable, meaning that they can be used as any other function:
myDF = RBEISDistanceFunction(6, custom_map = {(1, 2): 10, (5, 3): 400}, threshold = 2.5, weight = 4)
myDF(2,3)
5.2 Output records¶
impute
modifies its input DataFrame by adding a new column containing the imputed values for a given variable, named <variable>_imputed
. If in_place
is set to False
, a new DataFrame containing this column is returned.
5.3 Error handling¶
RBEIS performs extensive checks on its inputs. Where checks find incompatible or nonsensical inputs, an exception is raised and impute
is terminated. TypeError
s are raised when inputs of inappropriate type are supplied. RBEISInputException
s are raised where inputs do not make sense for RBEIS (e.g. choosing distance function 7 when RBEIS only provides six) or where required data is missing. Where checks find unnecessary inputs (for example redundant optional arguments), a warning of class RBEISInputWarning
is raised.
5.4 Metadata¶
Currently, RBEIS collects no metadata by default. If keep_intermediates
is set to True
, the temporary values calculated by the impute
function are able to be inspected post-imputation. These values are IGroup assignments, calculated distances and records to which each record may be a donor.
6 Method¶
Assign each record requiring imputation to an imputation group (“IGroup”) based on the values of its auxiliary variables.
For each potential donor record (those not requiring imputation), calculate the distance between the values of its auxiliary variables and those of each IGroup. Each auxiliary variable should have a distance function assigned to it, and may also be assigned a weight. The total distance is calculated thus: \(\sum_{v \in A} w_{v}{f_{v}(v_{record},v_{IGroup})}\) where \(A\) is the set of auxiliary variables, \(w_v\) is the weight assigned to the auxiliary variable \(v\), \(f_v\) is the distance function assigned to the auxiliary variable \(v\), \(v_{record}\) is the value of the auxiliary variable for the current record, and \(v_{IGroup}\) is the value of the auxiliary variable for the current IGroup.
Calculate which IGroup or IGroups for which each potential donor record could be used for imputation. This is done by finding the minimum distance between any potential donor and each IGroup, and assigning those potential donors to that IGroup. An optional behaviour is to specify a factor to multiply the minimum distance by, below which to accept donors (as opposed to solely accepting donors with the minimum distance). These assigned records constitute the ‘potential donor pool’ for each IGroup.
Calculate the frequency distribution of the values of the imputation variable occurring within each IGroup’s potential donor pool. Convert these frequencies to the expected numbers of occurrences for a group with the same frequency distribution, but with a number of members equal to the size of the IGroup. This is calculated thus: \(\frac{f_v}{n_{pool}}n_{IGroup}\) where \(f_v\) is the frequency with which value \(v\) occurs, \(n_{pool}\) is the number of records in the potential donor pool, and \(n_{IGroup}\) is the number of records in the IGroup.
For each IGroup, insert into a final donor pool each potential value of the imputation variable a number of times equal to the integer part of its expected number of occurrences.
For each IGroup, using the fractional parts of each expected value as probabilities, draw a value at random and insert it into the donor pool. Remove that value, recalculate the relative probabilities of the remaining possible values, and repeat the step until the donor pool is the same size as the IGroup.
For each IGroup, assign the values in the donor pool randomly to members of the IGroup.