Skip to main content

NOTE

these pages are primarily for internal audiences rather than users of the Data Platform; we will host user-facing documentation separately.

NOTE

Data transformations are not currently captured as part of the Data Product metadata. We plan to capture this information, though the structure and content of these metadata may substantially change. This data may be captured as part of the lineage information for data processed using dbt-like tools.

Describing data transformations

When describing what has happened to your data before users get to interact with it, you should describe any data transformations you have used. Transforms should be documented in the transformations section of your Data Product metadata.

See also our guidance on describing data cleaning your data may have applied to it.

Types of data transformation

These have been derived from the contributions made to the Data Management Wiki. <!–Some of these contain US spellings - we also accept the UK equivalent (for example “normalisation” and “normalization” are both accepted).–>

Please use the identifier (ID) for the transformation types when populating your transformations.yml section.

ID Description
clustering Clustering is a classic data mining technique based on machine learning that divides ​groups of abstract objects into classes of similar objects. Clustering helps to split data into several subsets.
denormalisation Denormalisation is a formal technique that increases data redundancy but can improve usability in some circumstances. Typically this refers to replacing reference codes or IDs with their actual value. The opposite of normalisation.
enhancement Enhancement is the process that expands existing data with data from other sources (enrichment). Here, additional data is added to close existing information gaps.
harmonisation Data harmonisation is the process of bringing together your data of varying file formats, naming conventions, and columns, and transforming it into one cohesive data set.
normalisation Normalisation is a formal technique that eliminates the data redundancy in a number of steps (= normal forms) by splitting the data according to fixed rules. The opposite of denormalisation.
merging The merging of two or more datasets because they are fundamentally similar, or make more logical sense to a user when combined.
standardisation Standardisation transforms data into a standard form - for example if the same information is represented slightly differently across different systems, you may wish to standardise those values so they match with other data products. Standardisation is also permissible as a cleaning method.
statistical-methods Statistics is the science and technique of collecting, processing, interpreting and presenting data based on rules of mathematics and the laws of logic. Statistical methods include regression, correlation, min, max, standard deviation, mean and clustering.
selection Selection is when you choose to only provide a subset of your data - please clearly describe what is and isn’t included, and your reasoning for this.

Data with no transformations applied

If your data has not undergone any transformation, we suggest you explicitly inform users of this by adding type: none for each of your tables to make this clear to users.

Considerations

Carefully consider if transformations are enhancing the data product for general use, rather than for a single use case.

Suggesting changes

If you wish to suggest additions or improvements to the cleansing types, please follow our guidance on submitting a pull request.

Further reading

Index of documention for data product defintion

Example data product

This page was last reviewed on 19 October 2023. It needs to be reviewed again on 19 April 2024 by the page owner #data-platform-notifications .
This page was set to be reviewed before 19 April 2024 by the page owner #data-platform-notifications. This might mean the content is out of date.