API Reference
This part of the project documentation focuses on an information-oriented approach. Use it as a
reference for the technical implementation of the gdhi-adj codebase.
Main Pipeline
gdhi_adj.pipeline
Run each module of the pipeline based on config parameters.
run_pipeline(config_path)
Run the GDHI adjustment pipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config_path
|
str
|
Path to the configuration file. |
required |
Preprocessing
gdhi_adj.preprocess.calc_preprocess
Module for calculations to preprocess data in the gdhi_adj project.
calc_iqr(df: pd.DataFrame, iqr_prefix: str, group_col: str, val_col: str, iqr_lower_quantile: float = 0.25, iqr_upper_quantile: float = 0.75, iqr_multiplier: float = 3.0) -> pd.DataFrame
Calculates the interquartile range (IQR) for each LSOA in the DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame. |
required |
iqr_prefix
|
str
|
Prefix for the IQR column names. |
required |
group_col
|
str
|
The column to group by for IQR calculation. |
required |
val_col
|
str
|
The column containing values to calculate IQR. |
required |
iqr_lower_quantile
|
float
|
The lower quantile for IQR calculation. |
0.25
|
iqr_upper_quantile
|
float
|
The upper quantile for IQR calculation. |
0.75
|
iqr_multiplier
|
float
|
The multiplier for the IQR to determine outlier bounds. |
3.0
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The DataFrame with additional columns for IQR, outlier |
DataFrame
|
bounds and 'threshold' columns, indicating which threshold the zscore |
DataFrame
|
breached. |
calc_lad_mean(df: pd.DataFrame) -> pd.DataFrame
Calculates the mean GDHI for each non outlier LSOA in the DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The DataFrame with an added 'mean_non_out_gdhi' column. |
calc_rate_of_change(df: pd.DataFrame, ascending: bool, sort_cols: list, group_col: str, val_col: str) -> pd.DataFrame
Calculate the rate of change going forward and backwards in time in the DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame. |
required |
ascending
|
bool
|
If True, calculates forward rate of change; otherwise, backward. |
required |
sort_cols
|
list
|
Columns to sort by before calculating rate of change. |
required |
group_col
|
str
|
The column to group by for rate of change calculation. |
required |
val_col
|
str
|
The column for which the rate of change is calculated. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: A DataFrame containing the rate of change values. |
calc_zscores(df: pd.DataFrame, score_prefix: str, group_col: str, val_col: str, zscore_upper_threshold: float = 3.0, zscore_lower_threshold: float = -3.0) -> pd.DataFrame
Calculates the z-scores for percent changes and raw data in DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame. |
required |
score_prefix
|
str
|
Prefix for the zscore column names. |
required |
group_col
|
str
|
The column to group by for z-score calculation. |
required |
val_col
|
str
|
The column values to calculate zscores. |
required |
zscore_upper_threshold
|
float
|
The upper threshold for z-score flag. |
3.0
|
zscore_lower_threshold
|
float
|
The lower threshold for z-score flag. |
-3.0
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The DataFrame with an additional 'zscore' and 'threshold' |
DataFrame
|
columns, indicating which threshold the zscore breached. |
gdhi_adj.preprocess.flag_preprocess
Module for flagging preprocessing data in the gdhi_adj project.
create_master_flag(df: pd.DataFrame, zscore_calculation: bool, iqr_calculation: bool) -> pd.DataFrame
Creates a master flag based on z score and IQR flag columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame. |
required |
zscore_calculation
|
bool
|
Whether z-score calculation is performed. |
required |
iqr_calculation
|
bool
|
Whether IQR calculation is performed. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The DataFrame with an additional 'master_flag' columns. |
extract_start_end_years(df: pd.DataFrame) -> pd.DataFrame
Extracts the start and end years from the column headings. Args: df (pd.DataFrame): The input DataFrame with years as headers. Returns: Tuple[int, int]: A tuple containing the start and end years.
flag_rollback_years(df: pd.DataFrame, rollback_year_start: int, rollback_year_end: int) -> pd.DataFrame
Flags years where the GDHI has rolled back from future years. Typically 2010-2014 has 2015 data copied to them as it is missing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame. |
required |
rollback_year_start
|
int
|
The start year for rollback flagging. |
required |
rollback_year_end
|
int
|
The end year for rollback flagging. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with an additional 'rollback_flag' column. |
gdhi_adj.preprocess.join_preprocess
Module for flagging preprocessing data in the gdhi_adj project.
concat_wide_dataframes(df_wide_outlier: pd.DataFrame, df_wide_mean: pd.DataFrame) -> pd.DataFrame
Concatenates two wide dataframes to create a final wide DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_wide_outlier
|
DataFrame
|
The DataFrame containing outlier data. |
required |
df_wide_mean
|
DataFrame
|
The DataFrame containing mean data. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The concatenated DataFrame in wide format. |
constrain_to_reg_acc(df: pd.DataFrame, reg_acc: pd.DataFrame, transaction_name: str) -> pd.DataFrame
Calculate contrained and unconstrained values for each outlier case.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame with outliers to be constrained. |
required |
reg_acc
|
DataFrame
|
The regional accounts DataFrame. |
required |
transaction_code
|
str
|
Transaction code to filter regional accounts. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The constrained DataFrame. |
gdhi_adj.preprocess.pivot_preprocess
Module for pivoting data in the gdhi_adj project.
pivot_output_long(df: pd.DataFrame, uncon_gdhi: str, con_gdhi: str) -> pd.DataFrame
Pivots the output DataFrame to long format. Args: df (pd.DataFrame): The input DataFrame in wide format. uncon_gdhi (str): The column name for unconstrained GDHI. con_gdhi (str): The column name for constrained GDHI.
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The pivoted DataFrame in long format. |
pivot_wide_dataframe(df: pd.DataFrame) -> pd.DataFrame
Pivots the DataFrame from long to wide format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame in long format. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The pivoted DataFrame in wide format. |
pivot_years_long_dataframe(df: pd.DataFrame, new_var_col: str, new_val_col: str) -> pd.DataFrame
Pivots the DataFrame based on specified index, columns, and values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame. |
required |
new_var_col
|
str
|
The name for the column containing old column names. |
required |
new_val_col
|
str
|
The name for the column containing values. |
required |
Returns: pd.DataFrame: The pivoted DataFrame.
gdhi_adj.preprocess.run_preprocess
Module for pre-processing data in the gdhi_adj project.
run_preprocessing(config: dict) -> None
Run the preprocessing steps for the GDHI adjustment project.
This function performs the following steps: 1. Load the configuration settings. 2. Load the input data. 3. Pivot the DataFrame to long format. 4. Calculate percentage rate of change and flag rollback years. 5. Calculate z-scores and IQRs if desired as per config. 6. Create master flags. 7. Save interim data with all calculated values. 8. Calculate LAD mean GDHI. 9. Constrain outliers to regional accounts. 10. Pivot the DataFrame back to wide format. 11. Save the preprocessed data ready for PowerBI analysis.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict
|
Configuration dictionary containing user settings and |
required |
Returns: None: The function does not return any value. It saves the processed DataFrame to a CSV file.
Adjustment
gdhi_adj.adjustment.apportion_adjustment
Module for apportioning values from adjustment in the gdhi_adj project.
apportion_adjustment(df: pd.DataFrame, imputed_df: pd.DataFrame) -> pd.DataFrame
Apportion the adjustment values to all years for each LSOA.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame containing data to adjust. |
required |
imputed_df
|
DataFrame
|
DataFrame containing outlier imputed values. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with outlier values imputed and adjustment. values apportioned accross all years within LSOA. |
apportion_negative_adjustment(df: pd.DataFrame) -> pd.DataFrame
Change negative values to 0 and apportion negative adjustment values to all LSOAs within an LAD/year group.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame containing data to adjust. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with negative adjustment values apportioned across all years within LSOA. |
apportion_rollback_years(df: pd.DataFrame) -> pd.DataFrame
Continue to apportion the adjustments for years that are flagged as rollback years.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame containing all data including adjusted and rollback years. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with reapportioned values for rollback years. |
calc_non_outlier_proportions(df: pd.DataFrame) -> pd.DataFrame
Calculate the proportion of a non-outlier LSOA to the LAD for each year.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame containing all GDHI data. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with LAD totals and proportions for non-outlier LSOAs calculated per year/LAD group. |
check_no_negative_values_col(df: pd.DataFrame, col: str) -> None
Check that adjusted_con_gdhi has no negative values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame with adjusted_con_gdhi column. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If negative values are found. |
gdhi_adj.adjustment.calc_adjustment
Module for calculations to adjust data in the gdhi_adj project.
extrapolate_imputed_val(df: pd.DataFrame, imputed_df: pd.DataFrame) -> pd.DataFrame
Calculate the imputed value for a given LSOA code where the year that has been flagged as an outlier to adjust only has one valid safe year either side.
The imputed value is extrapolated from the nearest safe year and the year 4 years after. This is to avoid short term fluctuations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame with full data for lookup. |
required |
imputed_df
|
DataFrame
|
DataFrame to calculate imputed value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame containing outlier imputed values. |
interpolate_imputed_val(df: pd.DataFrame) -> pd.DataFrame
Calculate the imputed value for a given LSOA code where the year that has been flagged as an outlier to adjust has a valid safe year either side.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame with data to calculate imputed value. |
required |
Returns: pd.DataFrame: DataFrame containing outlier imputed values.
gdhi_adj.adjustment.filter_adjustment
Module for filtering adjustment data in the gdhi_adj project.
filter_adjust(df: pd.DataFrame) -> pd.DataFrame
Filter data to keep only LSOAs for adjustment and subset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing LSOA data. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Filtered DataFrame with only relevant columns and rows. |
filter_component(df: pd.DataFrame, sas_code_filter: str, cord_code_filter: str, credit_debit_filter: str) -> pd.DataFrame
Filter DataFrame by component codes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Constrained DataFrame with component code data. |
required |
sas_code_filter
|
str
|
SAS code to filter by. |
required |
cord_code_filter
|
str
|
CORD code to filter by. |
required |
credit_debit_filter
|
str
|
Credit/Debit code to filter by. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Filtered DataFrame containing only rows matching the |
DataFrame
|
specified component codes. |
filter_year(df: pd.DataFrame, start_year: int, end_year: int) -> pd.DataFrame
Filter DataFrame by a range of years inclusively. Args: df (pd.DataFrame): Input DataFrame containing year data. start_year (int): Start year for filtering (inclusive). end_year (int): End year for filtering (inclusive). Returns: pd.DataFrame: Filtered DataFrame containing only rows within the year range.
gdhi_adj.adjustment.flag_adjustment
Module for flagging data to adjust data in the gdhi_adj project.
identify_safe_years(df: pd.DataFrame, start_year: int = 1900, end_year: int = 2100) -> pd.DataFrame
Identify safe years for each LSOA where no adjustment is needed.
For sequential years flagged for adjustment, the previous and next safe years are located at the end of the sequence of years.
For end of range years flagged for adjustment, it will return one safe year in the range, and one outside, which will return NaN for con_gdhi
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame. |
required |
start_year
|
int
|
The starting year for the data range. |
1900
|
end_year
|
int
|
The ending year for the data range. |
2100
|
Returns: df (pd.DataFrame): DataFrame with additional columns for safe years. safe_years_df (pd.DataFrame): DataFrame containing only the rows that need adjustment with non-outlier year values either side of outlier years.
gdhi_adj.adjustment.join_adjustment
Module for joining adjustment data in the gdhi_adj project.
join_analyst_constrained_data(df_constrained: pd.DataFrame, df_analyst: pd.DataFrame) -> pd.DataFrame
Join analyst data to constrained data based on LSOA code and LAD code.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_constrained
|
DataFrame
|
DataFrame containing constrained data. |
required |
df_analyst
|
DataFrame
|
DataFrame containing analyst data. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Joined DataFrame with relevant columns. |
join_analyst_unconstrained_data(df_unconstrained: pd.DataFrame, df_analyst: pd.DataFrame) -> pd.DataFrame
Join analyst data to unconstrained data based on LSOA code and LAD code.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_unconstrained
|
DataFrame
|
DataFrame with unconstrained data. |
required |
df_analyst
|
DataFrame
|
DataFrame containing analyst data. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Joined DataFrame with relevant columns. |
gdhi_adj.adjustment.pivot_adjustment
Module for pivoting adjustment data in the gdhi_adj project.
pivot_adjustment_long(df: pd.DataFrame) -> pd.DataFrame
Un-pivot (melt) the adjustment DataFrame from wide to long format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame containing data to be adjusted. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Pivoted DataFrame in long format. |
pivot_wide_final_dataframe(df: pd.DataFrame) -> pd.DataFrame
Pivots the DataFrame from long to wide format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame in long format. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The pivoted DataFrame in wide format. |
gdhi_adj.adjustment.reformat_adjustment
Module for joining adjustment data in the gdhi_adj project.
reformat_adjust_col(df: pd.DataFrame) -> pd.DataFrame
Reformat data within the adjust column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame to be reformatted. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with reformatted columns. |
reformat_year_col(df: pd.DataFrame, start_year: int, end_year: int) -> pd.DataFrame
Reformat data within the year column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame to be reformatted. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with reformatted columns. |
gdhi_adj.adjustment.run_adjustment
Module for adjusting data in the gdhi_adj project.
run_adjustment(config: dict) -> None
Run the adjustment steps for the GDHI adjustment project.
This function performs the following steps: 1. Load the configuration settings. 2. Load the input data. 3. Reformat adjust and year columns. 4. Filter of data for adjustment. 5. Join analyst output with constrained and unconstrained data. 6. Pivot the DataFrame to long format for manipulation. 7. Filter data by the specified year range. 8. Calculate the imputed gdhi values for outlier years. 9. Calculate adjustment values based on imputed gdhi. 10. Apportion adjustment values to all years. 11. Save interim data with all calculated values. 12. Pivot data to wide format for PowerBI QA reiteration. 13. Pivot final DataFrame to wide format for exporting. 14. Save the final adjusted data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict
|
Configuration dictionary containing user settings and |
required |
Returns: None: The function does not return any value. It saves the processed DataFrame to a CSV file.
gdhi_adj.adjustment.validation_adjustment
Module for adjustment data validation in the gdhi_adj project.
check_adjust_year_not_empty(df: pd.DataFrame) -> pd.DataFrame
Check that for LSOAs marked for adjustment, the year column is not empty.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame to be checked. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The original DataFrame, unchanged. This allows the function to be used in method chaining (e.g., .pipe()). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If an LSOA marked for adjustment does not have a year specified to adjust. |
check_lsoas_flagged(df: pd.DataFrame) -> pd.DataFrame
Check that not all LSOAs within an LAD are flagged for adjustment.
This is so that there are some non-outlier LSOAs to calculate non-outlier proportions of the total GDHI within an LAD.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame to be checked. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The original DataFrame, unchanged. This allows the function to be used in method chaining (e.g., .pipe()). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If every 'lsoa_code' within an lad_code is marked for adjustment. |
check_years_flagged(df: pd.DataFrame) -> pd.DataFrame
Check that not all years within an LSOA are flagged for adjustment.
This is so that there are some non-outlier LSOAs to interpolate/extrapolate from.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame to be checked. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The original DataFrame, unchanged. This allows the function to be used in method chaining (e.g., .pipe()). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If every year within an 'lsoa_code' is marked for adjustment. |
CORD Preparation
gdhi_adj.cord_preparation.mapping_cord_prep
Module for local authority units mapped to LADs.
aggregate_lad(df)
Aggregate values on LADs and other identifiers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame containing adjusted data with LAD codes joined. |
required |
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: DataFrame containing value columns now aggregated by sum on identifier columns. |
clean_validate_mapper(mapper_df)
Subset the mapper and get a unique DataFrame of values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mapper_df
|
DataFrame
|
DataFrame containing data used to join |
required |
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: DataFrame with unique values for LADs and LAUs. |
join_mapper(df, mapper_df)
Join mapper containing a lookup of LAU and LAD values, to adjusted data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame containing adjusted data. |
required |
mapper_df
|
DataFrame
|
DataFrame containing LAU to LAD lookup. |
required |
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: DataFrame with LADs joined on LAU codes. |
map_S30_to_S12(config: dict, df: pd.DataFrame) -> pd.DataFrame
Run the mapping steps for the GDHI adjustment pipeline.
This function performs the follwing steps: 1. Rename column containing S30 values to LAU and verify mapping is required. If mapping is required: 2. Load in and clean mapper containing LAU to LAD lookup. 3. Join LAU-LAD mapper to adjusted data. 4. Aggregate to LAD if specified in config. 5. Reformat output.
reformat(df, original_columns)
Rename LAD columns for end format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame containing adjusted data and LAD codes. |
required |
original_columns
|
list
|
List of columns from original DataFrame. |
required |
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: Renamed DataFrame with desired columns. |
rename_s30_to_lau(config, df)
Rename column containing S30 area codes to lau_
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict
|
Configuration dictionary containing user settings and pipeline settings. |
required |
df
|
DataFrame
|
DataFrame containing adjusted data. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
|
pd.DataFrame: DataFrame with area code column renamed if S30 codes have been found, otherwise returns original DataFrame. |
||
need_mapping |
Boolean
|
Returned boolean to show mapping is needed. |
gdhi_adj.cord_preparation.transform_cord_prep
Module for imputing values ready for CORD in the gdhi_adj project.
append_all_sub_components(config: dict) -> pd.DataFrame
Append all DataFrames that contain separatesub-components together so that each LSOA has all sub-components present in one DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict
|
Pipeline configuration dictionary containing filepaths for the location of sub-component data. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with each LSOA having many sub-components component appended. |
impute_suppression_x(df: pd.DataFrame, target_cols: List[str], transaction_col: str = 'transaction', lsoa_col: str = 'lsoa_code', transaction_value: str = 'D623', lsoa_val: List[str] = ['95', 'S']) -> pd.DataFrame
Set cells in target_cols to "X" where both conditions are met: - The value in transaction_col equals transaction_value. - The value in lsoa_col starts with any values in lsoa_val list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
input DataFrame |
required |
target_cols
|
List[str]
|
list of column names to modify. |
required |
transaction_col
|
str
|
name of the transaction column. |
'transaction'
|
lsoa_col
|
str
|
name of the LSOA column. |
'lsoa_code'
|
transaction_value
|
str
|
transaction value to match. |
'D623'
|
lsoa_val
|
List[str]
|
list of starting strings for LSOA codes to match ( case sensitive). |
['95', 'S']
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with suppressed values. |
gdhi_adj.cord_preparation.validation_cord_prep
Module for validation checks prior to CORD in the gdhi_adj project.
check_lsoa_consistency(df: pd.DataFrame) -> pd.DataFrame
Performs an internal consistency check on the DataFrame to ensure 'lsoa_code' uniqueness matches the total row count.
This function verifies that the number of unique values in the 'lsoa_code' column is exactly equal to the number of rows in the DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input pandas DataFrame containing an 'lsoa_code' column. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The original DataFrame, unchanged. This allows the function to be used in method chaining (e.g., .pipe()). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the number of unique 'lsoa_code' values does not match the total number of rows in the DataFrame. |
KeyError
|
If the 'lsoa_code' column is missing from the DataFrame. |
check_lsoa_count(df: pd.DataFrame, df_unconstrained: pd.DataFrame) -> pd.DataFrame
Perform a validation check to ensure that the unique lsoa_codes in the constrained DataFrame matches that in the unconstrained DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input pandas DataFrame containing an 'lsoa_code' column. |
required |
df_unconstrained
|
DataFrame
|
The unconstrained DataFrame to compare against. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The original DataFrame, unchanged. This allows the function to be used in method chaining (e.g., .pipe()). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the number of unique 'lsoa_code' values in the constrained DataFrame does not match the number of unique 'lsoa_code' values in the unconstrained DataFrame. |
KeyError
|
If the 'lsoa_code' column is missing from the DataFrame. |
check_no_negative_values_df(df: pd.DataFrame) -> pd.DataFrame
Checks all numeric columns in the DataFrame to ensure no values are less than 0.
This function isolates numeric columns (integers and floats) and verifies that all values are non-negative. It ignores non-numeric columns (e.g., strings).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame to be validated. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The original DataFrame, unchanged, for method chaining. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If any negative values are found in numeric columns. |
check_no_nulls(df: pd.DataFrame) -> pd.DataFrame
Checks the entire DataFrame to ensure it contains no Null, NaN, or None values.
This function scans all cells in the DataFrame. It detects standard numpy NaNs, Python None objects, and pandas pd.NA values.
If any such value is found, it raises a ValueError.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame to be validated. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The original DataFrame, unchanged, for method chaining. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If any null/NaN/None values are found in the DataFrame. |
check_subcomponent_lookup(df: pd.DataFrame, lookup_df: pd.DataFrame) -> pd.DataFrame
This function verifies that each unique value combination in the 'transaction' and 'account_entry' columns from the subcomponent lookup are present in the DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input pandas DataFrame containing subcomponent data. |
required |
lookup_df
|
DataFrame
|
The lookup DataFrame containing all combinations of subcomponents that should be present. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The original DataFrame, unchanged. This allows the function to be used in method chaining (e.g., .pipe()). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If all combinations of 'transaction' and 'account_entry' values from the lookup are not present in the DataFrame. |
check_year_column_completeness(df: pd.DataFrame) -> pd.DataFrame
Verifies that the DataFrame contains a complete set of consecutive year columns.
This function automatically identifies numeric column names (integers or strings representing integers), determines the minimum and maximum years, and checks if every year between that minimum and maximum exists as a column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The original DataFrame, unchanged, for method chaining. |
Raises:
| Type | Description |
|---|---|
ValueError
|
|
gdhi_adj.cord_preparation.run_cord_prep
Module for pre-processing data in the gdhi_adj project.
run_cord_preparation(config: dict) -> None
Run the CORD preparation steps for the GDHI adjustment project.
This function performs the following steps: 1. Load the configuration settings. 2. Load the input data and append all subcomponents together. 3. Map LAU S30 codes to LAD S12 codes. 4. Perform validation checks on the input data. 5. Apply CORD-specific transformations. 6. Save the prepared CORD data for further processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict
|
Configuration dictionary containing user settings and |
required |
Returns:
| Name | Type | Description |
|---|---|---|
None |
None
|
The function does not return any value. It saves the processed |
None
|
DataFrame to a CSV file. |
Utilities
gdhi_adj.utils.helpers
Define helper functions that wrap regularly-used functions.
convert_column_types(df: pd.DataFrame, schema: dict, logger: logging.Logger) -> pd.DataFrame
Convert DataFrame columns data types as specified in the schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame to convert column types. |
required |
schema
|
dict
|
The schema containing column names and their expected |
required |
logger
|
Logger
|
Logger for logging conversion actions. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The DataFrame with converted column types. |
Raises:
| Type | Description |
|---|---|
warning
|
If a column's type conversion fails. |
load_schema_from_toml(schema_path: str) -> dict
Load a schema from a TOML file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
schema_path
|
str
|
Path to the TOML schema file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
A dictionary representation of the schema. |
load_toml_config(path: Union[str, pathlib.Path]) -> dict | None
Load a .toml file from a path, with logging and safe error handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Union[str, Path]
|
The path to load the .toml file from. |
required |
Returns:
| Type | Description |
|---|---|
dict | None
|
dict | None: The loaded toml file as a dictionary, or None on error. |
read_with_schema(input_file_path: str, input_schema_path: str) -> pd.DataFrame
Reads in a csv file and compares it to a data dictionary schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_file_path
|
string
|
Filepath to the csv file to be read in. |
required |
input_schema_path
|
string
|
Filepath to the schema file in TOML format. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
df |
DataFrame
|
Formatted dataFrame containing data from the csv |
DataFrame
|
file. |
rename_columns(df: pd.DataFrame, schema: dict, logger: logging.Logger) -> pd.DataFrame
Rename columns in the DataFrame based on the schema. Schema should be a dict where keys are new column names and values are dicts with 'old_name'.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame to rename columns in. |
required |
schema
|
dict
|
The schema containing old and new column names. |
required |
logger
|
Logger
|
Logger for logging renaming actions. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The DataFrame with renamed columns. |
validate_schema(df: pd.DataFrame, schema: dict)
Validate the DataFrame against the schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame to validate. |
required |
schema
|
dict
|
The schema sourced from a TOML file to validate against. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If a required column fromt he scheda is missing in the |
TypeError
|
If a column's type does not match the expected type in the |
write_with_schema(df: pd.DataFrame, output_schema_path: str, output_dir: str, new_filename=None)
Writes a DataFrame to a CSV file, renaming columns and validating against a schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The final output DataFrame to write to CSV. |
required |
output_schema_path
|
str
|
Path to the output schema file in TOML |
required |
output_dir
|
str
|
Directory where the CSV file will be saved. |
required |
new_filename
|
str
|
New filename for the output CSV. If None, uses the original name. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the DataFrame does not match the schema. |
Returns:
| Name | Type | Description |
|---|---|---|
None |
Writes the DataFrame to a CSV file after validating against the |
|
|
schema. |
gdhi_adj.utils.logger
CustomFormatter
Bases: Formatter
Define logging formatter with colors for different log levels.
format(record)
Set color formatting for logger.
GDHI_adj_logger(name)
Custom logging class for use throughout the GDHI_adj pipeline.
Parameters
name : str The name of the file the logger is being created from.
Initialise the logger class.
gdhi_adj.utils.transform_helpers
Define helper functions that wrap regularly-used functions.
ensure_list(x: any) -> list
Ensure the input is returned as a list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
any
|
Input value to be converted to a list. |
required |
Returns: list: The input value wrapped in a list if it was not already a list.
increment_until_not_in(year: int, adjust_years: list, limit_year: int, is_increasing: bool = True)
Increase or decrease year until it is not in a list of adjust_years. Args: year (int): The starting year. adjust_years (list): List of years to avoid. limit_year (int): The limit year to stop at. is_increasing (bool): If True, increase year; if False, decrease year. Returns: int: The first year not in adjust_years list.
sum_match_check(df: pd.DataFrame, grouping_cols: list, unadjusted_col: str, adjusted_col: str, sum_tolerance: float = 1e-06)
Check that the sums of adjusted column, matches that of the unadjusted column for the same groupings.
If the difference exceeds a specified tolerance, raise an error.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame containing data for sums. |
required |
grouping_cols
|
list
|
(list): List of columns to group for sums. |
required |
unadjusted_col
|
str
|
Unadjusted column. |
required |
adjusted_col
|
str
|
Adjsuted column |
required |
sum_tolerance
|
float
|
Tolerance for the sums to match, default is |
1e-06
|
based on the floating point error
|
0.000001. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
ValueError |
if adjusted and unadjusted sums do not match. |
to_int_list(cell: Any) -> List[int]
Convert a cell to a list of ints. Accepts: - a comma-separated string like "2010,2011, 2012" - a list/tuple of strings or numbers - NaN/None -> returns [] Raises ValueError if an item cannot be converted to int.