rdsa-utils
This site contains the project documentation for rdsa-utils, a suite of PySpark, Pandas,
and general pipeline utils for Reproducible Data Science and Analysis (RDSA) projects.
Table Of Contents
Quickly find what you're looking for depending on your use case by looking at the different pages.
Prerequisites
The following prerequisites are required for rdsa-utils:
- Python 3.8 or higher
Dependency Update: PySpark
To optimise the installation process and accommodate users with pre-installed environments,
pyspark is now classified as a development dependency. This adjustment avoids
potential conflicts in environments where pyspark is already available,
such as Cloudera Data Platform.
For Users:
-
If your environment does not have
pysparkpre-installed: You will need to manually installpysparkto utilise features dependent on it. This can be done by runningpip install pyspark==<version>when setting up your environment, replacing<version>with the specific version required for your project. -
If
pysparkis pre-installed in your environment: No additional action is required. This change ensures seamless integration without overwriting or conflicting with the existingpysparkinstallation.
This modification streamlines rdsa-utils for various use cases, enhancing both
flexibility and user experience.
📬 Contact
For questions, support, or feedback about rdsa-utils, please email RDSA.Support@ons.gov.uk.