Skip to content

API Reference

This part of the project documentation focuses on an information-oriented approach. Use it as a reference for the technical implementation of the gdhi-adj codebase.

Main Pipeline

gdhi_adj.pipeline

Run each module of the pipeline based on config parameters.

run_pipeline(config_path)

Run the GDHI adjustment pipeline.

Parameters:

Name Type Description Default
config_path str

Path to the configuration file.

required

Preprocessing

gdhi_adj.preprocess.calc_preprocess

Module for calculations to preprocess data in the gdhi_adj project.

calc_iqr(df: pd.DataFrame, iqr_prefix: str, group_col: str, val_col: str, iqr_lower_quantile: float = 0.25, iqr_upper_quantile: float = 0.75, iqr_multiplier: float = 3.0) -> pd.DataFrame

Calculates the interquartile range (IQR) for each LSOA in the DataFrame.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame.

required
iqr_prefix str

Prefix for the IQR column names.

required
group_col str

The column to group by for IQR calculation.

required
val_col str

The column containing values to calculate IQR.

required
iqr_lower_quantile float

The lower quantile for IQR calculation.

0.25
iqr_upper_quantile float

The upper quantile for IQR calculation.

0.75
iqr_multiplier float

The multiplier for the IQR to determine outlier bounds.

3.0

Returns:

Type Description
DataFrame

pd.DataFrame: The DataFrame with additional columns for IQR, outlier

DataFrame

bounds and 'threshold' columns, indicating which threshold the zscore

DataFrame

breached.

calc_lad_mean(df: pd.DataFrame) -> pd.DataFrame

Calculates the mean GDHI for each non outlier LSOA in the DataFrame.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame.

required

Returns:

Type Description
DataFrame

pd.DataFrame: The DataFrame with an added 'mean_non_out_gdhi' column.

calc_rate_of_change(df: pd.DataFrame, ascending: bool, sort_cols: list, group_col: str, val_col: str) -> pd.DataFrame

Calculate the rate of change going forward and backwards in time in the DataFrame.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame.

required
ascending bool

If True, calculates forward rate of change; otherwise, backward.

required
sort_cols list

Columns to sort by before calculating rate of change.

required
group_col str

The column to group by for rate of change calculation.

required
val_col str

The column for which the rate of change is calculated.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing the rate of change values.

calc_zscores(df: pd.DataFrame, score_prefix: str, group_col: str, val_col: str, zscore_upper_threshold: float = 3.0, zscore_lower_threshold: float = -3.0) -> pd.DataFrame

Calculates the z-scores for percent changes and raw data in DataFrame.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame.

required
score_prefix str

Prefix for the zscore column names.

required
group_col str

The column to group by for z-score calculation.

required
val_col str

The column values to calculate zscores.

required
zscore_upper_threshold float

The upper threshold for z-score flag.

3.0
zscore_lower_threshold float

The lower threshold for z-score flag.

-3.0

Returns:

Type Description
DataFrame

pd.DataFrame: The DataFrame with an additional 'zscore' and 'threshold'

DataFrame

columns, indicating which threshold the zscore breached.

gdhi_adj.preprocess.flag_preprocess

Module for flagging preprocessing data in the gdhi_adj project.

create_master_flag(df: pd.DataFrame, zscore_calculation: bool, iqr_calculation: bool) -> pd.DataFrame

Creates a master flag based on z score and IQR flag columns.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame.

required
zscore_calculation bool

Whether z-score calculation is performed.

required
iqr_calculation bool

Whether IQR calculation is performed.

required

Returns:

Type Description
DataFrame

pd.DataFrame: The DataFrame with an additional 'master_flag' columns.

extract_start_end_years(df: pd.DataFrame) -> pd.DataFrame

Extracts the start and end years from the column headings. Args: df (pd.DataFrame): The input DataFrame with years as headers. Returns: Tuple[int, int]: A tuple containing the start and end years.

flag_rollback_years(df: pd.DataFrame, rollback_year_start: int, rollback_year_end: int) -> pd.DataFrame

Flags years where the GDHI has rolled back from future years. Typically 2010-2014 has 2015 data copied to them as it is missing.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame.

required
rollback_year_start int

The start year for rollback flagging.

required
rollback_year_end int

The end year for rollback flagging.

required

Returns:

Type Description
DataFrame

pd.DataFrame: DataFrame with an additional 'rollback_flag' column.

gdhi_adj.preprocess.join_preprocess

Module for flagging preprocessing data in the gdhi_adj project.

concat_wide_dataframes(df_wide_outlier: pd.DataFrame, df_wide_mean: pd.DataFrame) -> pd.DataFrame

Concatenates two wide dataframes to create a final wide DataFrame.

Parameters:

Name Type Description Default
df_wide_outlier DataFrame

The DataFrame containing outlier data.

required
df_wide_mean DataFrame

The DataFrame containing mean data.

required

Returns:

Type Description
DataFrame

pd.DataFrame: The concatenated DataFrame in wide format.

constrain_to_reg_acc(df: pd.DataFrame, reg_acc: pd.DataFrame, transaction_name: str) -> pd.DataFrame

Calculate contrained and unconstrained values for each outlier case.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame with outliers to be constrained.

required
reg_acc DataFrame

The regional accounts DataFrame.

required
transaction_code str

Transaction code to filter regional accounts.

required

Returns:

Type Description
DataFrame

pd.DataFrame: The constrained DataFrame.

gdhi_adj.preprocess.pivot_preprocess

Module for pivoting data in the gdhi_adj project.

pivot_output_long(df: pd.DataFrame, uncon_gdhi: str, con_gdhi: str) -> pd.DataFrame

Pivots the output DataFrame to long format. Args: df (pd.DataFrame): The input DataFrame in wide format. uncon_gdhi (str): The column name for unconstrained GDHI. con_gdhi (str): The column name for constrained GDHI.

Returns:

Type Description
DataFrame

pd.DataFrame: The pivoted DataFrame in long format.

pivot_wide_dataframe(df: pd.DataFrame) -> pd.DataFrame

Pivots the DataFrame from long to wide format.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame in long format.

required

Returns:

Type Description
DataFrame

pd.DataFrame: The pivoted DataFrame in wide format.

pivot_years_long_dataframe(df: pd.DataFrame, new_var_col: str, new_val_col: str) -> pd.DataFrame

Pivots the DataFrame based on specified index, columns, and values.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame.

required
new_var_col str

The name for the column containing old column names.

required
new_val_col str

The name for the column containing values.

required

Returns: pd.DataFrame: The pivoted DataFrame.

gdhi_adj.preprocess.run_preprocess

Module for pre-processing data in the gdhi_adj project.

run_preprocessing(config: dict) -> None

Run the preprocessing steps for the GDHI adjustment project.

This function performs the following steps: 1. Load the configuration settings. 2. Load the input data. 3. Pivot the DataFrame to long format. 4. Calculate percentage rate of change and flag rollback years. 5. Calculate z-scores and IQRs if desired as per config. 6. Create master flags. 7. Save interim data with all calculated values. 8. Calculate LAD mean GDHI. 9. Constrain outliers to regional accounts. 10. Pivot the DataFrame back to wide format. 11. Save the preprocessed data ready for PowerBI analysis.

Parameters:

Name Type Description Default
config dict

Configuration dictionary containing user settings and

required

Returns: None: The function does not return any value. It saves the processed DataFrame to a CSV file.

Adjustment

gdhi_adj.adjustment.apportion_adjustment

Module for apportioning values from adjustment in the gdhi_adj project.

apportion_adjustment(df: pd.DataFrame, imputed_df: pd.DataFrame) -> pd.DataFrame

Apportion the adjustment values to all years for each LSOA.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing data to adjust.

required
imputed_df DataFrame

DataFrame containing outlier imputed values.

required

Returns:

Type Description
DataFrame

pd.DataFrame: DataFrame with outlier values imputed and adjustment. values apportioned accross all years within LSOA.

apportion_negative_adjustment(df: pd.DataFrame) -> pd.DataFrame

Change negative values to 0 and apportion negative adjustment values to all LSOAs within an LAD/year group.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing data to adjust.

required

Returns:

Type Description
DataFrame

pd.DataFrame: DataFrame with negative adjustment values apportioned across all years within LSOA.

apportion_rollback_years(df: pd.DataFrame) -> pd.DataFrame

Continue to apportion the adjustments for years that are flagged as rollback years.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing all data including adjusted and rollback years.

required

Returns:

Type Description
DataFrame

pd.DataFrame: DataFrame with reapportioned values for rollback years.

calc_non_outlier_proportions(df: pd.DataFrame) -> pd.DataFrame

Calculate the proportion of a non-outlier LSOA to the LAD for each year.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing all GDHI data.

required

Returns:

Type Description
DataFrame

pd.DataFrame: DataFrame with LAD totals and proportions for non-outlier LSOAs calculated per year/LAD group.

check_no_negative_values_col(df: pd.DataFrame, col: str) -> None

Check that adjusted_con_gdhi has no negative values.

Parameters:

Name Type Description Default
df DataFrame

DataFrame with adjusted_con_gdhi column.

required

Raises:

Type Description
ValueError

If negative values are found.

gdhi_adj.adjustment.calc_adjustment

Module for calculations to adjust data in the gdhi_adj project.

extrapolate_imputed_val(df: pd.DataFrame, imputed_df: pd.DataFrame) -> pd.DataFrame

Calculate the imputed value for a given LSOA code where the year that has been flagged as an outlier to adjust only has one valid safe year either side.

The imputed value is extrapolated from the nearest safe year and the year 4 years after. This is to avoid short term fluctuations.

Parameters:

Name Type Description Default
df DataFrame

DataFrame with full data for lookup.

required
imputed_df DataFrame

DataFrame to calculate imputed value.

required

Returns:

Type Description
DataFrame

pd.DataFrame: DataFrame containing outlier imputed values.

interpolate_imputed_val(df: pd.DataFrame) -> pd.DataFrame

Calculate the imputed value for a given LSOA code where the year that has been flagged as an outlier to adjust has a valid safe year either side.

Parameters:

Name Type Description Default
df DataFrame

DataFrame with data to calculate imputed value.

required

Returns: pd.DataFrame: DataFrame containing outlier imputed values.

gdhi_adj.adjustment.filter_adjustment

Module for filtering adjustment data in the gdhi_adj project.

filter_adjust(df: pd.DataFrame) -> pd.DataFrame

Filter data to keep only LSOAs for adjustment and subset.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame containing LSOA data.

required

Returns:

Type Description
DataFrame

pd.DataFrame: Filtered DataFrame with only relevant columns and rows.

filter_component(df: pd.DataFrame, sas_code_filter: str, cord_code_filter: str, credit_debit_filter: str) -> pd.DataFrame

Filter DataFrame by component codes.

Parameters:

Name Type Description Default
df DataFrame

Constrained DataFrame with component code data.

required
sas_code_filter str

SAS code to filter by.

required
cord_code_filter str

CORD code to filter by.

required
credit_debit_filter str

Credit/Debit code to filter by.

required

Returns:

Type Description
DataFrame

pd.DataFrame: Filtered DataFrame containing only rows matching the

DataFrame

specified component codes.

filter_year(df: pd.DataFrame, start_year: int, end_year: int) -> pd.DataFrame

Filter DataFrame by a range of years inclusively. Args: df (pd.DataFrame): Input DataFrame containing year data. start_year (int): Start year for filtering (inclusive). end_year (int): End year for filtering (inclusive). Returns: pd.DataFrame: Filtered DataFrame containing only rows within the year range.

gdhi_adj.adjustment.flag_adjustment

Module for flagging data to adjust data in the gdhi_adj project.

identify_safe_years(df: pd.DataFrame, start_year: int = 1900, end_year: int = 2100) -> pd.DataFrame

Identify safe years for each LSOA where no adjustment is needed.

For sequential years flagged for adjustment, the previous and next safe years are located at the end of the sequence of years.

For end of range years flagged for adjustment, it will return one safe year in the range, and one outside, which will return NaN for con_gdhi

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame.

required
start_year int

The starting year for the data range.

1900
end_year int

The ending year for the data range.

2100

Returns: df (pd.DataFrame): DataFrame with additional columns for safe years. safe_years_df (pd.DataFrame): DataFrame containing only the rows that need adjustment with non-outlier year values either side of outlier years.

gdhi_adj.adjustment.join_adjustment

Module for joining adjustment data in the gdhi_adj project.

join_analyst_constrained_data(df_constrained: pd.DataFrame, df_analyst: pd.DataFrame) -> pd.DataFrame

Join analyst data to constrained data based on LSOA code and LAD code.

Parameters:

Name Type Description Default
df_constrained DataFrame

DataFrame containing constrained data.

required
df_analyst DataFrame

DataFrame containing analyst data.

required

Returns:

Type Description
DataFrame

pd.DataFrame: Joined DataFrame with relevant columns.

join_analyst_unconstrained_data(df_unconstrained: pd.DataFrame, df_analyst: pd.DataFrame) -> pd.DataFrame

Join analyst data to unconstrained data based on LSOA code and LAD code.

Parameters:

Name Type Description Default
df_unconstrained DataFrame

DataFrame with unconstrained data.

required
df_analyst DataFrame

DataFrame containing analyst data.

required

Returns:

Type Description
DataFrame

pd.DataFrame: Joined DataFrame with relevant columns.

gdhi_adj.adjustment.pivot_adjustment

Module for pivoting adjustment data in the gdhi_adj project.

pivot_adjustment_long(df: pd.DataFrame) -> pd.DataFrame

Un-pivot (melt) the adjustment DataFrame from wide to long format.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing data to be adjusted.

required

Returns:

Type Description
DataFrame

pd.DataFrame: Pivoted DataFrame in long format.

pivot_wide_final_dataframe(df: pd.DataFrame) -> pd.DataFrame

Pivots the DataFrame from long to wide format.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame in long format.

required

Returns:

Type Description
DataFrame

pd.DataFrame: The pivoted DataFrame in wide format.

gdhi_adj.adjustment.reformat_adjustment

Module for joining adjustment data in the gdhi_adj project.

reformat_adjust_col(df: pd.DataFrame) -> pd.DataFrame

Reformat data within the adjust column.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame to be reformatted.

required

Returns:

Type Description
DataFrame

pd.DataFrame: DataFrame with reformatted columns.

reformat_year_col(df: pd.DataFrame, start_year: int, end_year: int) -> pd.DataFrame

Reformat data within the year column.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame to be reformatted.

required

Returns:

Type Description
DataFrame

pd.DataFrame: DataFrame with reformatted columns.

gdhi_adj.adjustment.run_adjustment

Module for adjusting data in the gdhi_adj project.

run_adjustment(config: dict) -> None

Run the adjustment steps for the GDHI adjustment project.

This function performs the following steps: 1. Load the configuration settings. 2. Load the input data. 3. Reformat adjust and year columns. 4. Filter of data for adjustment. 5. Join analyst output with constrained and unconstrained data. 6. Pivot the DataFrame to long format for manipulation. 7. Filter data by the specified year range. 8. Calculate the imputed gdhi values for outlier years. 9. Calculate adjustment values based on imputed gdhi. 10. Apportion adjustment values to all years. 11. Save interim data with all calculated values. 12. Pivot data to wide format for PowerBI QA reiteration. 13. Pivot final DataFrame to wide format for exporting. 14. Save the final adjusted data.

Parameters:

Name Type Description Default
config dict

Configuration dictionary containing user settings and

required

Returns: None: The function does not return any value. It saves the processed DataFrame to a CSV file.

gdhi_adj.adjustment.validation_adjustment

Module for adjustment data validation in the gdhi_adj project.

check_adjust_year_not_empty(df: pd.DataFrame) -> pd.DataFrame

Check that for LSOAs marked for adjustment, the year column is not empty.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame to be checked.

required

Returns:

Type Description
DataFrame

pd.DataFrame: The original DataFrame, unchanged. This allows the function to be used in method chaining (e.g., .pipe()).

Raises:

Type Description
ValueError

If an LSOA marked for adjustment does not have a year specified to adjust.

check_lsoas_flagged(df: pd.DataFrame) -> pd.DataFrame

Check that not all LSOAs within an LAD are flagged for adjustment.

This is so that there are some non-outlier LSOAs to calculate non-outlier proportions of the total GDHI within an LAD.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame to be checked.

required

Returns:

Type Description
DataFrame

pd.DataFrame: The original DataFrame, unchanged. This allows the function to be used in method chaining (e.g., .pipe()).

Raises:

Type Description
ValueError

If every 'lsoa_code' within an lad_code is marked for adjustment.

check_years_flagged(df: pd.DataFrame) -> pd.DataFrame

Check that not all years within an LSOA are flagged for adjustment.

This is so that there are some non-outlier LSOAs to interpolate/extrapolate from.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame to be checked.

required

Returns:

Type Description
DataFrame

pd.DataFrame: The original DataFrame, unchanged. This allows the function to be used in method chaining (e.g., .pipe()).

Raises:

Type Description
ValueError

If every year within an 'lsoa_code' is marked for adjustment.

CORD Preparation

gdhi_adj.cord_preparation.mapping_cord_prep

Module for local authority units mapped to LADs.

aggregate_lad(df)

Aggregate values on LADs and other identifiers.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing adjusted data with LAD codes joined.

required

Returns:

Type Description

pd.DataFrame: DataFrame containing value columns now aggregated by sum on identifier columns.

clean_validate_mapper(mapper_df)

Subset the mapper and get a unique DataFrame of values.

Parameters:

Name Type Description Default
mapper_df DataFrame

DataFrame containing data used to join

required

Returns:

Type Description

pd.DataFrame: DataFrame with unique values for LADs and LAUs.

join_mapper(df, mapper_df)

Join mapper containing a lookup of LAU and LAD values, to adjusted data.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing adjusted data.

required
mapper_df DataFrame

DataFrame containing LAU to LAD lookup.

required

Returns:

Type Description

pd.DataFrame: DataFrame with LADs joined on LAU codes.

map_S30_to_S12(config: dict, df: pd.DataFrame) -> pd.DataFrame

Run the mapping steps for the GDHI adjustment pipeline.

This function performs the follwing steps: 1. Rename column containing S30 values to LAU and verify mapping is required. If mapping is required: 2. Load in and clean mapper containing LAU to LAD lookup. 3. Join LAU-LAD mapper to adjusted data. 4. Aggregate to LAD if specified in config. 5. Reformat output.

reformat(df, original_columns)

Rename LAD columns for end format.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing adjusted data and LAD codes.

required
original_columns list

List of columns from original DataFrame.

required

Returns:

Type Description

pd.DataFrame: Renamed DataFrame with desired columns.

rename_s30_to_lau(config, df)

Rename column containing S30 area codes to lau_

Parameters:

Name Type Description Default
config dict

Configuration dictionary containing user settings and pipeline settings.

required
df DataFrame

DataFrame containing adjusted data.

required

Returns:

Name Type Description

pd.DataFrame: DataFrame with area code column renamed if S30 codes have been found, otherwise returns original DataFrame.

need_mapping Boolean

Returned boolean to show mapping is needed.

gdhi_adj.cord_preparation.transform_cord_prep

Module for imputing values ready for CORD in the gdhi_adj project.

append_all_sub_components(config: dict) -> pd.DataFrame

Append all DataFrames that contain separatesub-components together so that each LSOA has all sub-components present in one DataFrame.

Parameters:

Name Type Description Default
config dict

Pipeline configuration dictionary containing filepaths for the location of sub-component data.

required

Returns:

Type Description
DataFrame

pd.DataFrame: DataFrame with each LSOA having many sub-components component appended.

impute_suppression_x(df: pd.DataFrame, target_cols: List[str], transaction_col: str = 'transaction', lsoa_col: str = 'lsoa_code', transaction_value: str = 'D623', lsoa_val: List[str] = ['95', 'S']) -> pd.DataFrame

Set cells in target_cols to "X" where both conditions are met: - The value in transaction_col equals transaction_value. - The value in lsoa_col starts with any values in lsoa_val list.

Parameters:

Name Type Description Default
df DataFrame

input DataFrame

required
target_cols List[str]

list of column names to modify.

required
transaction_col str

name of the transaction column.

'transaction'
lsoa_col str

name of the LSOA column.

'lsoa_code'
transaction_value str

transaction value to match.

'D623'
lsoa_val List[str]

list of starting strings for LSOA codes to match ( case sensitive).

['95', 'S']

Returns:

Type Description
DataFrame

pd.DataFrame: DataFrame with suppressed values.

gdhi_adj.cord_preparation.validation_cord_prep

Module for validation checks prior to CORD in the gdhi_adj project.

check_lsoa_consistency(df: pd.DataFrame) -> pd.DataFrame

Performs an internal consistency check on the DataFrame to ensure 'lsoa_code' uniqueness matches the total row count.

This function verifies that the number of unique values in the 'lsoa_code' column is exactly equal to the number of rows in the DataFrame.

Parameters:

Name Type Description Default
df DataFrame

The input pandas DataFrame containing an 'lsoa_code' column.

required

Returns:

Type Description
DataFrame

pd.DataFrame: The original DataFrame, unchanged. This allows the function to be used in method chaining (e.g., .pipe()).

Raises:

Type Description
ValueError

If the number of unique 'lsoa_code' values does not match the total number of rows in the DataFrame.

KeyError

If the 'lsoa_code' column is missing from the DataFrame.

check_lsoa_count(df: pd.DataFrame, df_unconstrained: pd.DataFrame) -> pd.DataFrame

Perform a validation check to ensure that the unique lsoa_codes in the constrained DataFrame matches that in the unconstrained DataFrame.

Parameters:

Name Type Description Default
df DataFrame

The input pandas DataFrame containing an 'lsoa_code' column.

required
df_unconstrained DataFrame

The unconstrained DataFrame to compare against.

required

Returns:

Type Description
DataFrame

pd.DataFrame: The original DataFrame, unchanged. This allows the function to be used in method chaining (e.g., .pipe()).

Raises:

Type Description
ValueError

If the number of unique 'lsoa_code' values in the constrained DataFrame does not match the number of unique 'lsoa_code' values in the unconstrained DataFrame.

KeyError

If the 'lsoa_code' column is missing from the DataFrame.

check_no_negative_values_df(df: pd.DataFrame) -> pd.DataFrame

Checks all numeric columns in the DataFrame to ensure no values are less than 0.

This function isolates numeric columns (integers and floats) and verifies that all values are non-negative. It ignores non-numeric columns (e.g., strings).

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be validated.

required

Returns:

Type Description
DataFrame

pd.DataFrame: The original DataFrame, unchanged, for method chaining.

Raises:

Type Description
ValueError

If any negative values are found in numeric columns.

check_no_nulls(df: pd.DataFrame) -> pd.DataFrame

Checks the entire DataFrame to ensure it contains no Null, NaN, or None values.

This function scans all cells in the DataFrame. It detects standard numpy NaNs, Python None objects, and pandas pd.NA values.

If any such value is found, it raises a ValueError.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be validated.

required

Returns:

Type Description
DataFrame

pd.DataFrame: The original DataFrame, unchanged, for method chaining.

Raises:

Type Description
ValueError

If any null/NaN/None values are found in the DataFrame.

check_subcomponent_lookup(df: pd.DataFrame, lookup_df: pd.DataFrame) -> pd.DataFrame

This function verifies that each unique value combination in the 'transaction' and 'account_entry' columns from the subcomponent lookup are present in the DataFrame.

Parameters:

Name Type Description Default
df DataFrame

The input pandas DataFrame containing subcomponent data.

required
lookup_df DataFrame

The lookup DataFrame containing all combinations of subcomponents that should be present.

required

Returns:

Type Description
DataFrame

pd.DataFrame: The original DataFrame, unchanged. This allows the function to be used in method chaining (e.g., .pipe()).

Raises:

Type Description
ValueError

If all combinations of 'transaction' and 'account_entry' values from the lookup are not present in the DataFrame.

check_year_column_completeness(df: pd.DataFrame) -> pd.DataFrame

Verifies that the DataFrame contains a complete set of consecutive year columns.

This function automatically identifies numeric column names (integers or strings representing integers), determines the minimum and maximum years, and checks if every year between that minimum and maximum exists as a column.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame.

required

Returns:

Type Description
DataFrame

pd.DataFrame: The original DataFrame, unchanged, for method chaining.

Raises:

Type Description
ValueError
  • If no numeric/year columns are found.
  • If there are gaps in the sequence of years detected.

gdhi_adj.cord_preparation.run_cord_prep

Module for pre-processing data in the gdhi_adj project.

run_cord_preparation(config: dict) -> None

Run the CORD preparation steps for the GDHI adjustment project.

This function performs the following steps: 1. Load the configuration settings. 2. Load the input data and append all subcomponents together. 3. Map LAU S30 codes to LAD S12 codes. 4. Perform validation checks on the input data. 5. Apply CORD-specific transformations. 6. Save the prepared CORD data for further processing.

Parameters:

Name Type Description Default
config dict

Configuration dictionary containing user settings and

required

Returns:

Name Type Description
None None

The function does not return any value. It saves the processed

None

DataFrame to a CSV file.

Utilities

gdhi_adj.utils.helpers

Define helper functions that wrap regularly-used functions.

convert_column_types(df: pd.DataFrame, schema: dict, logger: logging.Logger) -> pd.DataFrame

Convert DataFrame columns data types as specified in the schema.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame to convert column types.

required
schema dict

The schema containing column names and their expected

required
logger Logger

Logger for logging conversion actions.

required

Returns:

Type Description
DataFrame

pd.DataFrame: The DataFrame with converted column types.

Raises:

Type Description
warning

If a column's type conversion fails.

load_schema_from_toml(schema_path: str) -> dict

Load a schema from a TOML file.

Parameters:

Name Type Description Default
schema_path str

Path to the TOML schema file.

required

Returns:

Name Type Description
dict dict

A dictionary representation of the schema.

load_toml_config(path: Union[str, pathlib.Path]) -> dict | None

Load a .toml file from a path, with logging and safe error handling.

Parameters:

Name Type Description Default
path Union[str, Path]

The path to load the .toml file from.

required

Returns:

Type Description
dict | None

dict | None: The loaded toml file as a dictionary, or None on error.

read_with_schema(input_file_path: str, input_schema_path: str) -> pd.DataFrame

Reads in a csv file and compares it to a data dictionary schema.

Parameters:

Name Type Description Default
input_file_path string

Filepath to the csv file to be read in.

required
input_schema_path string

Filepath to the schema file in TOML format.

required

Returns:

Name Type Description
df DataFrame

Formatted dataFrame containing data from the csv

DataFrame

file.

rename_columns(df: pd.DataFrame, schema: dict, logger: logging.Logger) -> pd.DataFrame

Rename columns in the DataFrame based on the schema. Schema should be a dict where keys are new column names and values are dicts with 'old_name'.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame to rename columns in.

required
schema dict

The schema containing old and new column names.

required
logger Logger

Logger for logging renaming actions.

required

Returns:

Type Description
DataFrame

pd.DataFrame: The DataFrame with renamed columns.

validate_schema(df: pd.DataFrame, schema: dict)

Validate the DataFrame against the schema.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame to validate.

required
schema dict

The schema sourced from a TOML file to validate against.

required

Raises:

Type Description
ValueError

If a required column fromt he scheda is missing in the

TypeError

If a column's type does not match the expected type in the

write_with_schema(df: pd.DataFrame, output_schema_path: str, output_dir: str, new_filename=None)

Writes a DataFrame to a CSV file, renaming columns and validating against a schema.

Parameters:

Name Type Description Default
df DataFrame

The final output DataFrame to write to CSV.

required
output_schema_path str

Path to the output schema file in TOML

required
output_dir str

Directory where the CSV file will be saved.

required
new_filename str

New filename for the output CSV. If None, uses the original name.

None

Raises:

Type Description
ValueError

If the DataFrame does not match the schema.

Returns:

Name Type Description
None

Writes the DataFrame to a CSV file after validating against the

schema.

gdhi_adj.utils.logger

CustomFormatter

Bases: Formatter

Define logging formatter with colors for different log levels.

format(record)

Set color formatting for logger.

GDHI_adj_logger(name)

Custom logging class for use throughout the GDHI_adj pipeline.

Parameters

name : str The name of the file the logger is being created from.

Initialise the logger class.

gdhi_adj.utils.transform_helpers

Define helper functions that wrap regularly-used functions.

ensure_list(x: any) -> list

Ensure the input is returned as a list.

Parameters:

Name Type Description Default
x any

Input value to be converted to a list.

required

Returns: list: The input value wrapped in a list if it was not already a list.

increment_until_not_in(year: int, adjust_years: list, limit_year: int, is_increasing: bool = True)

Increase or decrease year until it is not in a list of adjust_years. Args: year (int): The starting year. adjust_years (list): List of years to avoid. limit_year (int): The limit year to stop at. is_increasing (bool): If True, increase year; if False, decrease year. Returns: int: The first year not in adjust_years list.

sum_match_check(df: pd.DataFrame, grouping_cols: list, unadjusted_col: str, adjusted_col: str, sum_tolerance: float = 1e-06)

Check that the sums of adjusted column, matches that of the unadjusted column for the same groupings.

If the difference exceeds a specified tolerance, raise an error.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing data for sums.

required
grouping_cols list

(list): List of columns to group for sums.

required
unadjusted_col str

Unadjusted column.

required
adjusted_col str

Adjsuted column

required
sum_tolerance float

Tolerance for the sums to match, default is

1e-06
based on the floating point error

0.000001.

required

Returns:

Name Type Description
ValueError

if adjusted and unadjusted sums do not match.

to_int_list(cell: Any) -> List[int]

Convert a cell to a list of ints. Accepts: - a comma-separated string like "2010,2011, 2012" - a list/tuple of strings or numbers - NaN/None -> returns [] Raises ValueError if an item cannot be converted to int.