API Reference

This part of the project documentation focuses on an information-oriented approach. Use it as a reference for the technical implementation of the gdhi-adj codebase.

Main Pipeline

`gdhi_adj.pipeline`

Run each module of the pipeline based on config parameters.

`run_pipeline(config_path)`

Run the GDHI adjustment pipeline.

Parameters:

Name	Type	Description	Default
`config_path`	`str`	Path to the configuration file.	required

Preprocessing

`gdhi_adj.preprocess.calc_preprocess`

Module for calculations to preprocess data in the gdhi_adj project.

`calc_iqr(df: pd.DataFrame, iqr_prefix: str, group_col: str, val_col: str, iqr_lower_quantile: float = 0.25, iqr_upper_quantile: float = 0.75, iqr_multiplier: float = 3.0) -> pd.DataFrame`

Calculates the interquartile range (IQR) for each LSOA in the DataFrame.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame.	required
`iqr_prefix`	`str`	Prefix for the IQR column names.	required
`group_col`	`str`	The column to group by for IQR calculation.	required
`val_col`	`str`	The column containing values to calculate IQR.	required
`iqr_lower_quantile`	`float`	The lower quantile for IQR calculation.	`0.25`
`iqr_upper_quantile`	`float`	The upper quantile for IQR calculation.	`0.75`
`iqr_multiplier`	`float`	The multiplier for the IQR to determine outlier bounds.	`3.0`

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The DataFrame with additional columns for IQR, outlier
`DataFrame`	bounds and 'threshold' columns, indicating which threshold the zscore
`DataFrame`	breached.

`calc_lad_mean(df: pd.DataFrame) -> pd.DataFrame`

Calculates the mean GDHI for each non outlier LSOA in the DataFrame.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The DataFrame with an added 'mean_non_out_gdhi' column.

`calc_rate_of_change(df: pd.DataFrame, ascending: bool, sort_cols: list, group_col: str, val_col: str) -> pd.DataFrame`

Calculate the rate of change going forward and backwards in time in the DataFrame.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame.	required
`ascending`	`bool`	If True, calculates forward rate of change; otherwise, backward.	required
`sort_cols`	`list`	Columns to sort by before calculating rate of change.	required
`group_col`	`str`	The column to group by for rate of change calculation.	required
`val_col`	`str`	The column for which the rate of change is calculated.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: A DataFrame containing the rate of change values.

`calc_zscores(df: pd.DataFrame, score_prefix: str, group_col: str, val_col: str, zscore_upper_threshold: float = 3.0, zscore_lower_threshold: float = -3.0) -> pd.DataFrame`

Calculates the z-scores for percent changes and raw data in DataFrame.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame.	required
`score_prefix`	`str`	Prefix for the zscore column names.	required
`group_col`	`str`	The column to group by for z-score calculation.	required
`val_col`	`str`	The column values to calculate zscores.	required
`zscore_upper_threshold`	`float`	The upper threshold for z-score flag.	`3.0`
`zscore_lower_threshold`	`float`	The lower threshold for z-score flag.	`-3.0`

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The DataFrame with an additional 'zscore' and 'threshold'
`DataFrame`	columns, indicating which threshold the zscore breached.

`gdhi_adj.preprocess.flag_preprocess`

Module for flagging preprocessing data in the gdhi_adj project.

`create_master_flag(df: pd.DataFrame, zscore_calculation: bool, iqr_calculation: bool) -> pd.DataFrame`

Creates a master flag based on z score and IQR flag columns.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame.	required
`zscore_calculation`	`bool`	Whether z-score calculation is performed.	required
`iqr_calculation`	`bool`	Whether IQR calculation is performed.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The DataFrame with an additional 'master_flag' columns.

`extract_start_end_years(df: pd.DataFrame) -> pd.DataFrame`

Extracts the start and end years from the column headings. Args: df (pd.DataFrame): The input DataFrame with years as headers. Returns: Tuple[int, int]: A tuple containing the start and end years.

`flag_rollback_years(df: pd.DataFrame, rollback_year_start: int, rollback_year_end: int) -> pd.DataFrame`

Flags years where the GDHI has rolled back from future years. Typically 2010-2014 has 2015 data copied to them as it is missing.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame.	required
`rollback_year_start`	`int`	The start year for rollback flagging.	required
`rollback_year_end`	`int`	The end year for rollback flagging.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: DataFrame with an additional 'rollback_flag' column.

`gdhi_adj.preprocess.join_preprocess`

Module for flagging preprocessing data in the gdhi_adj project.

`concat_wide_dataframes(df_wide_outlier: pd.DataFrame, df_wide_mean: pd.DataFrame) -> pd.DataFrame`

Concatenates two wide dataframes to create a final wide DataFrame.

Parameters:

Name	Type	Description	Default
`df_wide_outlier`	`DataFrame`	The DataFrame containing outlier data.	required
`df_wide_mean`	`DataFrame`	The DataFrame containing mean data.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The concatenated DataFrame in wide format.

`constrain_to_reg_acc(df: pd.DataFrame, reg_acc: pd.DataFrame, transaction_name: str) -> pd.DataFrame`

Calculate contrained and unconstrained values for each outlier case.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame with outliers to be constrained.	required
`reg_acc`	`DataFrame`	The regional accounts DataFrame.	required
`transaction_code`	`str`	Transaction code to filter regional accounts.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The constrained DataFrame.

`gdhi_adj.preprocess.pivot_preprocess`

Module for pivoting data in the gdhi_adj project.

`pivot_output_long(df: pd.DataFrame, uncon_gdhi: str, con_gdhi: str) -> pd.DataFrame`

Pivots the output DataFrame to long format. Args: df (pd.DataFrame): The input DataFrame in wide format. uncon_gdhi (str): The column name for unconstrained GDHI. con_gdhi (str): The column name for constrained GDHI.

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The pivoted DataFrame in long format.

`pivot_wide_dataframe(df: pd.DataFrame) -> pd.DataFrame`

Pivots the DataFrame from long to wide format.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame in long format.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The pivoted DataFrame in wide format.

`pivot_years_long_dataframe(df: pd.DataFrame, new_var_col: str, new_val_col: str) -> pd.DataFrame`

Pivots the DataFrame based on specified index, columns, and values.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame.	required
`new_var_col`	`str`	The name for the column containing old column names.	required
`new_val_col`	`str`	The name for the column containing values.	required

Returns: pd.DataFrame: The pivoted DataFrame.

`gdhi_adj.preprocess.run_preprocess`

Module for pre-processing data in the gdhi_adj project.

`run_preprocessing(config: dict) -> None`

Run the preprocessing steps for the GDHI adjustment project.

This function performs the following steps: 1. Load the configuration settings. 2. Load the input data. 3. Pivot the DataFrame to long format. 4. Calculate percentage rate of change and flag rollback years. 5. Calculate z-scores and IQRs if desired as per config. 6. Create master flags. 7. Save interim data with all calculated values. 8. Calculate LAD mean GDHI. 9. Constrain outliers to regional accounts. 10. Pivot the DataFrame back to wide format. 11. Save the preprocessed data ready for PowerBI analysis.

Parameters:

Name	Type	Description	Default
`config`	`dict`	Configuration dictionary containing user settings and	required

Returns: None: The function does not return any value. It saves the processed DataFrame to a CSV file.

Adjustment

`gdhi_adj.adjustment.apportion_adjustment`

Module for apportioning values from adjustment in the gdhi_adj project.

`apportion_adjustment(df: pd.DataFrame, imputed_df: pd.DataFrame) -> pd.DataFrame`

Apportion the adjustment values to all years for each LSOA.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing data to adjust.	required
`imputed_df`	`DataFrame`	DataFrame containing outlier imputed values.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: DataFrame with outlier values imputed and adjustment. values apportioned accross all years within LSOA.

`apportion_negative_adjustment(df: pd.DataFrame) -> pd.DataFrame`

Change negative values to 0 and apportion negative adjustment values to all LSOAs within an LAD/year group.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing data to adjust.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: DataFrame with negative adjustment values apportioned across all years within LSOA.

`apportion_rollback_years(df: pd.DataFrame) -> pd.DataFrame`

Continue to apportion the adjustments for years that are flagged as rollback years.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing all data including adjusted and rollback years.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: DataFrame with reapportioned values for rollback years.

`calc_non_outlier_proportions(df: pd.DataFrame) -> pd.DataFrame`

Calculate the proportion of a non-outlier LSOA to the LAD for each year.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing all GDHI data.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: DataFrame with LAD totals and proportions for non-outlier LSOAs calculated per year/LAD group.

`check_no_negative_values_col(df: pd.DataFrame, col: str) -> None`

Check that adjusted_con_gdhi has no negative values.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame with adjusted_con_gdhi column.	required

Raises:

Type	Description
`ValueError`	If negative values are found.

`gdhi_adj.adjustment.calc_adjustment`

Module for calculations to adjust data in the gdhi_adj project.

`extrapolate_imputed_val(df: pd.DataFrame, imputed_df: pd.DataFrame) -> pd.DataFrame`

Calculate the imputed value for a given LSOA code where the year that has been flagged as an outlier to adjust only has one valid safe year either side.

The imputed value is extrapolated from the nearest safe year and the year 4 years after. This is to avoid short term fluctuations.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame with full data for lookup.	required
`imputed_df`	`DataFrame`	DataFrame to calculate imputed value.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: DataFrame containing outlier imputed values.

`interpolate_imputed_val(df: pd.DataFrame) -> pd.DataFrame`

Calculate the imputed value for a given LSOA code where the year that has been flagged as an outlier to adjust has a valid safe year either side.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame with data to calculate imputed value.	required

Returns: pd.DataFrame: DataFrame containing outlier imputed values.

`gdhi_adj.adjustment.filter_adjustment`

Module for filtering adjustment data in the gdhi_adj project.

`filter_adjust(df: pd.DataFrame) -> pd.DataFrame`

Filter data to keep only LSOAs for adjustment and subset.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame containing LSOA data.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: Filtered DataFrame with only relevant columns and rows.

`filter_component(df: pd.DataFrame, sas_code_filter: str, cord_code_filter: str, credit_debit_filter: str) -> pd.DataFrame`

Filter DataFrame by component codes.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Constrained DataFrame with component code data.	required
`sas_code_filter`	`str`	SAS code to filter by.	required
`cord_code_filter`	`str`	CORD code to filter by.	required
`credit_debit_filter`	`str`	Credit/Debit code to filter by.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: Filtered DataFrame containing only rows matching the
`DataFrame`	specified component codes.

`filter_year(df: pd.DataFrame, start_year: int, end_year: int) -> pd.DataFrame`

Filter DataFrame by a range of years inclusively. Args: df (pd.DataFrame): Input DataFrame containing year data. start_year (int): Start year for filtering (inclusive). end_year (int): End year for filtering (inclusive). Returns: pd.DataFrame: Filtered DataFrame containing only rows within the year range.

`gdhi_adj.adjustment.flag_adjustment`

Module for flagging data to adjust data in the gdhi_adj project.

`identify_safe_years(df: pd.DataFrame, start_year: int = 1900, end_year: int = 2100) -> pd.DataFrame`

Identify safe years for each LSOA where no adjustment is needed.

For sequential years flagged for adjustment, the previous and next safe years are located at the end of the sequence of years.

For end of range years flagged for adjustment, it will return one safe year in the range, and one outside, which will return NaN for con_gdhi

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame.	required
`start_year`	`int`	The starting year for the data range.	`1900`
`end_year`	`int`	The ending year for the data range.	`2100`

Returns: df (pd.DataFrame): DataFrame with additional columns for safe years. safe_years_df (pd.DataFrame): DataFrame containing only the rows that need adjustment with non-outlier year values either side of outlier years.

`gdhi_adj.adjustment.join_adjustment`

Module for joining adjustment data in the gdhi_adj project.

`join_analyst_constrained_data(df_constrained: pd.DataFrame, df_analyst: pd.DataFrame) -> pd.DataFrame`

Join analyst data to constrained data based on LSOA code and LAD code.

Parameters:

Name	Type	Description	Default
`df_constrained`	`DataFrame`	DataFrame containing constrained data.	required
`df_analyst`	`DataFrame`	DataFrame containing analyst data.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: Joined DataFrame with relevant columns.

`join_analyst_unconstrained_data(df_unconstrained: pd.DataFrame, df_analyst: pd.DataFrame) -> pd.DataFrame`

Join analyst data to unconstrained data based on LSOA code and LAD code.

Parameters:

Name	Type	Description	Default
`df_unconstrained`	`DataFrame`	DataFrame with unconstrained data.	required
`df_analyst`	`DataFrame`	DataFrame containing analyst data.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: Joined DataFrame with relevant columns.

`gdhi_adj.adjustment.pivot_adjustment`

Module for pivoting adjustment data in the gdhi_adj project.

`pivot_adjustment_long(df: pd.DataFrame) -> pd.DataFrame`

Un-pivot (melt) the adjustment DataFrame from wide to long format.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing data to be adjusted.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: Pivoted DataFrame in long format.

`pivot_wide_final_dataframe(df: pd.DataFrame) -> pd.DataFrame`

Pivots the DataFrame from long to wide format.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame in long format.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The pivoted DataFrame in wide format.

`gdhi_adj.adjustment.reformat_adjustment`

Module for joining adjustment data in the gdhi_adj project.

`reformat_adjust_col(df: pd.DataFrame) -> pd.DataFrame`

Reformat data within the adjust column.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame to be reformatted.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: DataFrame with reformatted columns.

`reformat_year_col(df: pd.DataFrame, start_year: int, end_year: int) -> pd.DataFrame`

Reformat data within the year column.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame to be reformatted.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: DataFrame with reformatted columns.

`gdhi_adj.adjustment.run_adjustment`

Module for adjusting data in the gdhi_adj project.

`run_adjustment(config: dict) -> None`

Run the adjustment steps for the GDHI adjustment project.

This function performs the following steps: 1. Load the configuration settings. 2. Load the input data. 3. Reformat adjust and year columns. 4. Filter of data for adjustment. 5. Join analyst output with constrained and unconstrained data. 6. Pivot the DataFrame to long format for manipulation. 7. Filter data by the specified year range. 8. Calculate the imputed gdhi values for outlier years. 9. Calculate adjustment values based on imputed gdhi. 10. Apportion adjustment values to all years. 11. Save interim data with all calculated values. 12. Pivot data to wide format for PowerBI QA reiteration. 13. Pivot final DataFrame to wide format for exporting. 14. Save the final adjusted data.

Parameters:

Name	Type	Description	Default
`config`	`dict`	Configuration dictionary containing user settings and	required

Returns: None: The function does not return any value. It saves the processed DataFrame to a CSV file.

`gdhi_adj.adjustment.validation_adjustment`

Module for adjustment data validation in the gdhi_adj project.

`check_adjust_year_not_empty(df: pd.DataFrame) -> pd.DataFrame`

Check that for LSOAs marked for adjustment, the year column is not empty.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame to be checked.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The original DataFrame, unchanged. This allows the function to be used in method chaining (e.g., .pipe()).

Raises:

Type	Description
`ValueError`	If an LSOA marked for adjustment does not have a year specified to adjust.

`check_lsoas_flagged(df: pd.DataFrame) -> pd.DataFrame`

Check that not all LSOAs within an LAD are flagged for adjustment.

This is so that there are some non-outlier LSOAs to calculate non-outlier proportions of the total GDHI within an LAD.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame to be checked.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The original DataFrame, unchanged. This allows the function to be used in method chaining (e.g., .pipe()).

Raises:

Type	Description
`ValueError`	If every 'lsoa_code' within an lad_code is marked for adjustment.

`check_years_flagged(df: pd.DataFrame) -> pd.DataFrame`

Check that not all years within an LSOA are flagged for adjustment.

This is so that there are some non-outlier LSOAs to interpolate/extrapolate from.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame to be checked.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The original DataFrame, unchanged. This allows the function to be used in method chaining (e.g., .pipe()).

Raises:

Type	Description
`ValueError`	If every year within an 'lsoa_code' is marked for adjustment.

CORD Preparation

`gdhi_adj.cord_preparation.mapping_cord_prep`

Module for local authority units mapped to LADs.

`aggregate_lad(df)`

Aggregate values on LADs and other identifiers.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing adjusted data with LAD codes joined.	required

Returns:

Type	Description
	pd.DataFrame: DataFrame containing value columns now aggregated by sum on identifier columns.

`clean_validate_mapper(mapper_df)`

Subset the mapper and get a unique DataFrame of values.

Parameters:

Name	Type	Description	Default
`mapper_df`	`DataFrame`	DataFrame containing data used to join	required

Returns:

Type	Description
	pd.DataFrame: DataFrame with unique values for LADs and LAUs.

`join_mapper(df, mapper_df)`

Join mapper containing a lookup of LAU and LAD values, to adjusted data.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing adjusted data.	required
`mapper_df`	`DataFrame`	DataFrame containing LAU to LAD lookup.	required

Returns:

Type	Description
	pd.DataFrame: DataFrame with LADs joined on LAU codes.

`map_S30_to_S12(config: dict, df: pd.DataFrame) -> pd.DataFrame`

Run the mapping steps for the GDHI adjustment pipeline.

This function performs the follwing steps: 1. Rename column containing S30 values to LAU and verify mapping is required. If mapping is required: 2. Load in and clean mapper containing LAU to LAD lookup. 3. Join LAU-LAD mapper to adjusted data. 4. Aggregate to LAD if specified in config. 5. Reformat output.

`reformat(df, original_columns)`

Rename LAD columns for end format.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing adjusted data and LAD codes.	required
`original_columns`	`list`	List of columns from original DataFrame.	required

Returns:

Type	Description
	pd.DataFrame: Renamed DataFrame with desired columns.

`rename_s30_to_lau(config, df)`

Rename column containing S30 area codes to lau_

Parameters:

Name	Type	Description	Default
`config`	`dict`	Configuration dictionary containing user settings and pipeline settings.	required
`df`	`DataFrame`	DataFrame containing adjusted data.	required

Returns:

Name	Type	Description
		pd.DataFrame: DataFrame with area code column renamed if S30 codes have been found, otherwise returns original DataFrame.
`need_mapping`	`Boolean`	Returned boolean to show mapping is needed.

`gdhi_adj.cord_preparation.transform_cord_prep`

Module for imputing values ready for CORD in the gdhi_adj project.

`append_all_sub_components(config: dict) -> pd.DataFrame`

Append all DataFrames that contain separatesub-components together so that each LSOA has all sub-components present in one DataFrame.

Parameters:

Name	Type	Description	Default
`config`	`dict`	Pipeline configuration dictionary containing filepaths for the location of sub-component data.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: DataFrame with each LSOA having many sub-components component appended.

`impute_suppression_x(df: pd.DataFrame, target_cols: List[str], transaction_col: str = 'transaction', lsoa_col: str = 'lsoa_code', transaction_value: str = 'D623', lsoa_val: List[str] = ['95', 'S']) -> pd.DataFrame`

Set cells in target_cols to "X" where both conditions are met: - The value in transaction_col equals transaction_value. - The value in lsoa_col starts with any values in lsoa_val list.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	input DataFrame	required
`target_cols`	`List[str]`	list of column names to modify.	required
`transaction_col`	`str`	name of the transaction column.	`'transaction'`
`lsoa_col`	`str`	name of the LSOA column.	`'lsoa_code'`
`transaction_value`	`str`	transaction value to match.	`'D623'`
`lsoa_val`	`List[str]`	list of starting strings for LSOA codes to match ( case sensitive).	`['95', 'S']`

Returns:

Type	Description
`DataFrame`	pd.DataFrame: DataFrame with suppressed values.

`gdhi_adj.cord_preparation.validation_cord_prep`

Module for validation checks prior to CORD in the gdhi_adj project.

`check_lsoa_consistency(df: pd.DataFrame) -> pd.DataFrame`

Performs an internal consistency check on the DataFrame to ensure 'lsoa_code' uniqueness matches the total row count.

This function verifies that the number of unique values in the 'lsoa_code' column is exactly equal to the number of rows in the DataFrame.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input pandas DataFrame containing an 'lsoa_code' column.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The original DataFrame, unchanged. This allows the function to be used in method chaining (e.g., .pipe()).

Raises:

Type	Description
`ValueError`	If the number of unique 'lsoa_code' values does not match the total number of rows in the DataFrame.
`KeyError`	If the 'lsoa_code' column is missing from the DataFrame.

`check_lsoa_count(df: pd.DataFrame, df_unconstrained: pd.DataFrame) -> pd.DataFrame`

Perform a validation check to ensure that the unique lsoa_codes in the constrained DataFrame matches that in the unconstrained DataFrame.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input pandas DataFrame containing an 'lsoa_code' column.	required
`df_unconstrained`	`DataFrame`	The unconstrained DataFrame to compare against.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The original DataFrame, unchanged. This allows the function to be used in method chaining (e.g., .pipe()).

Raises:

Type	Description
`ValueError`	If the number of unique 'lsoa_code' values in the constrained DataFrame does not match the number of unique 'lsoa_code' values in the unconstrained DataFrame.
`KeyError`	If the 'lsoa_code' column is missing from the DataFrame.

`check_no_negative_values_df(df: pd.DataFrame) -> pd.DataFrame`

Checks all numeric columns in the DataFrame to ensure no values are less than 0.

This function isolates numeric columns (integers and floats) and verifies that all values are non-negative. It ignores non-numeric columns (e.g., strings).

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame to be validated.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The original DataFrame, unchanged, for method chaining.

Raises:

Type	Description
`ValueError`	If any negative values are found in numeric columns.

`check_no_nulls(df: pd.DataFrame) -> pd.DataFrame`

Checks the entire DataFrame to ensure it contains no Null, NaN, or None values.

This function scans all cells in the DataFrame. It detects standard numpy NaNs, Python None objects, and pandas pd.NA values.

If any such value is found, it raises a ValueError.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame to be validated.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The original DataFrame, unchanged, for method chaining.

Raises:

Type	Description
`ValueError`	If any null/NaN/None values are found in the DataFrame.

`check_subcomponent_lookup(df: pd.DataFrame, lookup_df: pd.DataFrame) -> pd.DataFrame`

This function verifies that each unique value combination in the 'transaction' and 'account_entry' columns from the subcomponent lookup are present in the DataFrame.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input pandas DataFrame containing subcomponent data.	required
`lookup_df`	`DataFrame`	The lookup DataFrame containing all combinations of subcomponents that should be present.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The original DataFrame, unchanged. This allows the function to be used in method chaining (e.g., .pipe()).

Raises:

Type	Description
`ValueError`	If all combinations of 'transaction' and 'account_entry' values from the lookup are not present in the DataFrame.

`check_year_column_completeness(df: pd.DataFrame) -> pd.DataFrame`

Verifies that the DataFrame contains a complete set of consecutive year columns.

This function automatically identifies numeric column names (integers or strings representing integers), determines the minimum and maximum years, and checks if every year between that minimum and maximum exists as a column.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The original DataFrame, unchanged, for method chaining.

Raises:

Type	Description
`ValueError`	If no numeric/year columns are found. If there are gaps in the sequence of years detected.

`gdhi_adj.cord_preparation.run_cord_prep`

Module for pre-processing data in the gdhi_adj project.

`run_cord_preparation(config: dict) -> None`

Run the CORD preparation steps for the GDHI adjustment project.

This function performs the following steps: 1. Load the configuration settings. 2. Load the input data and append all subcomponents together. 3. Map LAU S30 codes to LAD S12 codes. 4. Perform validation checks on the input data. 5. Apply CORD-specific transformations. 6. Save the prepared CORD data for further processing.

Parameters:

Name	Type	Description	Default
`config`	`dict`	Configuration dictionary containing user settings and	required

Returns:

Name	Type	Description
`None`	`None`	The function does not return any value. It saves the processed
	`None`	DataFrame to a CSV file.

Utilities

`gdhi_adj.utils.helpers`

Define helper functions that wrap regularly-used functions.

`convert_column_types(df: pd.DataFrame, schema: dict, logger: logging.Logger) -> pd.DataFrame`

Convert DataFrame columns data types as specified in the schema.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame to convert column types.	required
`schema`	`dict`	The schema containing column names and their expected	required
`logger`	`Logger`	Logger for logging conversion actions.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The DataFrame with converted column types.

Raises:

Type	Description
`warning`	If a column's type conversion fails.

`load_schema_from_toml(schema_path: str) -> dict`

Load a schema from a TOML file.

Parameters:

Name	Type	Description	Default
`schema_path`	`str`	Path to the TOML schema file.	required

Returns:

Name	Type	Description
`dict`	`dict`	A dictionary representation of the schema.

`load_toml_config(path: Union[str, pathlib.Path]) -> dict | None`

Load a .toml file from a path, with logging and safe error handling.

Parameters:

Name	Type	Description	Default
`path`	`Union[str, Path]`	The path to load the .toml file from.	required

Returns:

Type	Description
`dict \| None`	dict \| None: The loaded toml file as a dictionary, or None on error.

`read_with_schema(input_file_path: str, input_schema_path: str) -> pd.DataFrame`

Reads in a csv file and compares it to a data dictionary schema.

Parameters:

Name	Type	Description	Default
`input_file_path`	`string`	Filepath to the csv file to be read in.	required
`input_schema_path`	`string`	Filepath to the schema file in TOML format.	required

Returns:

Name	Type	Description
`df`	`DataFrame`	Formatted dataFrame containing data from the csv
	`DataFrame`	file.

`rename_columns(df: pd.DataFrame, schema: dict, logger: logging.Logger) -> pd.DataFrame`

Rename columns in the DataFrame based on the schema. Schema should be a dict where keys are new column names and values are dicts with 'old_name'.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame to rename columns in.	required
`schema`	`dict`	The schema containing old and new column names.	required
`logger`	`Logger`	Logger for logging renaming actions.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The DataFrame with renamed columns.

`validate_schema(df: pd.DataFrame, schema: dict)`

Validate the DataFrame against the schema.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame to validate.	required
`schema`	`dict`	The schema sourced from a TOML file to validate against.	required

Raises:

Type	Description
`ValueError`	If a required column fromt he scheda is missing in the
`TypeError`	If a column's type does not match the expected type in the

`write_with_schema(df: pd.DataFrame, output_schema_path: str, output_dir: str, new_filename=None)`

Writes a DataFrame to a CSV file, renaming columns and validating against a schema.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The final output DataFrame to write to CSV.	required
`output_schema_path`	`str`	Path to the output schema file in TOML	required
`output_dir`	`str`	Directory where the CSV file will be saved.	required
`new_filename`	`str`	New filename for the output CSV. If None, uses the original name.	`None`

Raises:

Type	Description
`ValueError`	If the DataFrame does not match the schema.

Returns:

Name	Type	Description
`None`		Writes the DataFrame to a CSV file after validating against the
		schema.

`gdhi_adj.utils.logger`

`CustomFormatter`

Bases: Formatter

Define logging formatter with colors for different log levels.

`format(record)`

Set color formatting for logger.

`GDHI_adj_logger(name)`

Custom logging class for use throughout the GDHI_adj pipeline.

Parameters

name : str The name of the file the logger is being created from.

Initialise the logger class.

`gdhi_adj.utils.transform_helpers`

Define helper functions that wrap regularly-used functions.

`ensure_list(x: any) -> list`

Ensure the input is returned as a list.

Parameters:

Name	Type	Description	Default
`x`	`any`	Input value to be converted to a list.	required

Returns: list: The input value wrapped in a list if it was not already a list.

`increment_until_not_in(year: int, adjust_years: list, limit_year: int, is_increasing: bool = True)`

Increase or decrease year until it is not in a list of adjust_years. Args: year (int): The starting year. adjust_years (list): List of years to avoid. limit_year (int): The limit year to stop at. is_increasing (bool): If True, increase year; if False, decrease year. Returns: int: The first year not in adjust_years list.

`sum_match_check(df: pd.DataFrame, grouping_cols: list, unadjusted_col: str, adjusted_col: str, sum_tolerance: float = 1e-06)`

Check that the sums of adjusted column, matches that of the unadjusted column for the same groupings.

If the difference exceeds a specified tolerance, raise an error.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing data for sums.	required
`grouping_cols`	`list`	(list): List of columns to group for sums.	required
`unadjusted_col`	`str`	Unadjusted column.	required
`adjusted_col`	`str`	Adjsuted column	required
`sum_tolerance`	`float`	Tolerance for the sums to match, default is	`1e-06`
`based on the floating point error`		0.000001.	required

Returns:

Name	Type	Description
`ValueError`		if adjusted and unadjusted sums do not match.

`to_int_list(cell: Any) -> List[int]`

Convert a cell to a list of ints. Accepts: - a comma-separated string like "2010,2011, 2012" - a list/tuple of strings or numbers - NaN/None -> returns [] Raises ValueError if an item cannot be converted to int.