Skip to main content

ADR-004 Data should be pushed into Data Products

Status

✅ Accepted

Context

The data platform is built on the availability of Data Products. In order to become a complete Data Product, it needs to contain the data that is to be to be made available for consumers. This ADR covers the mechanisms supported out of the box by the Data Platform to facilitate the ingestion of data into a Data Product.

Analysis

Two different approaches were initially proposed and analysed: push and pull.

  • On a push approach, data should be sent to the platform by a system. The Data Platform is passive and receives the data.
  • On a pull approach, the Data Product should know how to fetch the raw data from a data source (e.g. a database).

Both approaches have advantages:

Push Pull
Advantages
- Reduced complexity around networking. Data sources are likely to be in networks that may not be accessible from the Data Platform, or that would require too many hops to make it reachable from the Platform. - Fewer systems to manage. The Data Platform could offer a complete running environment for the Data Product to fetch the required data from a data source, eliminating the need of an extra hosting environment.
- Standardised approach for all data producers. There will be no need for customisation to access each data source network, making the Data Platform more self service. - Guarantee that the entire data transformation process is captured and hosted by the platform, avoiding ad hoc transformations that may happen in temporary systems created to push data into a data product only once.
Disadvantages
- Data product developers will need to manage another system that fetches data from data sources and then pushes it to the Data Platform - Risk of developers accidentally publishing database credentials in the Data Platform
- Data product developers need to make sure that the system that extracts and pushes the data into a data product is kept in sync with the Data Product transformation code, so that modifications on the structure of the data are handled correctly. As these systems are deployed and managed independently, there’s a higher risk of mismatch. - Security - a compromised Data Platform role could potentially have access to several data sources.

Decision

The current choice is to support data ingestion through a push approach.

Consequences

While we have the infrastructure to allow for data to be sent into a data product, that is not to say that a pull model is forbidden. Data Product developers may choose to fetch their data from within the Data Product. That process however is not natively supported by the Data Platform, nor currently encouraged.

We recognise that many target user groups of the data platform (eg. HMCTS) have an existing pull approach for their datasets into MoJ systems which may be difficult to adapt to a push model. Existing processes that pull data from these systems will need to be adapted to onboard data into a data product. While it is not expected that the Data Platform teams generate these adaptations, some guidance may be provided.

This page was last reviewed on 4 January 2024. It needs to be reviewed again on 4 April 2024 by the page owner #data-platform-notifications .
This page was set to be reviewed before 4 April 2024 by the page owner #data-platform-notifications. This might mean the content is out of date.