NOTE
these pages are primarily for internal audiences rather than users of the Data Platform; we will host user-facing documentation separately.
NOTE
Data transformations are not currently captured as part of the Data Product metadata. We plan to capture this information, though the structure and content of these metadata may substantially change. This data may be captured as part of the lineage information for data processed using dbt-like tools.
Describing data transformations
When describing what has happened to your data before users get to interact with it, you should describe any data transformations you have used. Transforms should be documented in the transformations
section of your Data Product metadata.
See also our guidance on describing data cleaning your data may have applied to it.
Types of data transformation
These have been derived from the contributions made to the Data Management Wiki. <!–Some of these contain US spellings - we also accept the UK equivalent (for example “normalisation” and “normalization” are both accepted).–>
Please use the identifier (ID) for the transformation types when populating your transformations.yml
section.
ID | Description |
---|---|
clustering | Clustering is a classic data mining technique based on machine learning that divides groups of abstract objects into classes of similar objects. Clustering helps to split data into several subsets. |
denormalisation | Denormalisation is a formal technique that increases data redundancy but can improve usability in some circumstances. Typically this refers to replacing reference codes or IDs with their actual value. The opposite of normalisation. |
enhancement | Enhancement is the process that expands existing data with data from other sources (enrichment). Here, additional data is added to close existing information gaps. |
harmonisation | Data harmonisation is the process of bringing together your data of varying file formats, naming conventions, and columns, and transforming it into one cohesive data set. |
normalisation | Normalisation is a formal technique that eliminates the data redundancy in a number of steps (= normal forms) by splitting the data according to fixed rules. The opposite of denormalisation. |
merging | The merging of two or more datasets because they are fundamentally similar, or make more logical sense to a user when combined. |
standardisation | Standardisation transforms data into a standard form - for example if the same information is represented slightly differently across different systems, you may wish to standardise those values so they match with other data products. Standardisation is also permissible as a cleaning method. |
statistical-methods | Statistics is the science and technique of collecting, processing, interpreting and presenting data based on rules of mathematics and the laws of logic. Statistical methods include regression, correlation, min, max, standard deviation, mean and clustering. |
selection | Selection is when you choose to only provide a subset of your data - please clearly describe what is and isn’t included, and your reasoning for this. |
Data with no transformations applied
If your data has not undergone any transformation, we suggest you explicitly inform users of this by adding type: none
for each of your tables to make this clear to users.
Considerations
Carefully consider if transformations are enhancing the data product for general use, rather than for a single use case.
Suggesting changes
If you wish to suggest additions or improvements to the cleansing types, please follow our guidance on submitting a pull request.
Further reading
Index of documention for data product defintion