The growth in the number of data-driven projects is having a great impact on companies in different industrial and technological sectors. These projects are being driven to improve process automation, optimize resources, and obtain valuable information to improve decision making.

The technical and business teams dedicated to the development of these solutions are focused on the exploitation of the information available in their “Data Lakes”, developing analytical tools such as the construction of dashboards to reflect relevant business KPIs, the ingestion and transformation of large amounts of data or the implementation of ML models that allow inference in certain use cases.

All this technological transformation is having its challenges in relation to the rapid development and scalability of these data projects where some difficulties can be appreciated such as the bottlenecks produced by centralized Data and ML teams, in collaboration with other functional teams more oriented to the data domain or the consumers of this information themselves.

When starting new projects, the Data team usually performs a feasibility study with the exploration of the new datasets provided and tries to determine the target business metrics to be covered. It is in these moments where valuable time is invested in understanding the data, in its transformation and in capturing the needs, which sometimes implies that the projects are delayed in this initial phase and it is necessary to make an extraordinary effort for the acquisition of the business knowledge associated with these data.

Faced with this situation, in which those responsible for the data domain do not have to take into account the use that can be made of the data, the need arises to change data projects as we traditionally see them with the development of Data Products.

A Data Product must be implemented, developed and maintained by a team responsible for a data domain. It therefore belongs to exactly one domain.

It can be defined as an available dataset, or a dashboard reflecting different KPIs or an ML model accessible from other data domains through an interface or API. It must not only provide the data but also the necessary information for its understanding (structure, metadata, interfaces to consume it, maintenance or life cycle).

The objective of a Data Product is to be a reusable asset defined to provide reliable data for a specific purpose aligned with business needs.

Zhamag Dehgani in his book “Data Mesh: Delivering Data-Driven Value at Scale” indicates the main characteristics that define a Data Product, which can be summarized as follows:

For a Data Product to be useful it requires at least the following qualities:

  • Designed to be upgradable: they must be versionable or extensible, adding new functionalities in the future.
  • Designed to scale: given the increasing growth rate of available data, the number of data sources in a domain, or the diversity of users.
  • Designed to provide value: focused on simply providing the highest possible quality and reliable data to consumers in an understandable way.

To better understand this concept, let’s look at some examples.

Is Gmail a Data Product? The truth is that it is not, since its primary purpose is to enable asynchronous written communication between users, but the determination of an email as spam is based on the application of natural language processing techniques.

Another example can be Instagram, which also cannot be considered as a Data Product, however it is composed of them such as notifications, search or browse option.

Finally, is Google Analytics a Data Product? Yes, it is a product whose purpose is to provide information about user behavior on websites.

In the same way, Google’s search engine or Netflix’s recommender system are highly scalable data products.

The development of new data products is not trivial for a company that is currently involved in the implementation of traditional Data Projects, as it requires a transformation in the operational strategy that allows the development of an environment in which templates and data pipelines are standardized that can accelerate the launch of new products.

It is also necessary to have teams that acquire ownership of the different data domains in which these products will be developed.

There are many aspects that must be taken into account when defining new data products, such as defining metadata, establishing the necessary requirements that the new data to be incorporated into the domain must have, determining the different ways in which the data will be accessible, establishing data profiling, versioning and the life cycle of the data, or establishing the level of granularity at which the applications, domains or components will be separated, among others.

Keepler has based its offering on the development of a full-stack analytics service based on public cloud infrastructure capabilities, applying best practices in data engineering, cloud, data governance, data science and data visualization. This approach, together with an Agile methodological approach, allows for an efficient identification, definition, development and deployment of new data products for its customers.

Our Data Products proposal involves the creation or evolution of Data Lakes focused on extracting value from the descriptive analysis of information. 

Additionally, incorporating AI / ML capabilities that allow more sophisticated analysis and the generation of new relevant information to improve decision making and reduce uncertainty.

Author

  • Javier Pacheco

    Data Scientist in Keepler Data Tech: "Live full, die empty" defines my state. This becomes my lifestyle taking me out of my comfort zone and driving my voracious learning attitude about different aspects of Data Science. I love learning by teaching and am always open to new challenges that push me further my comprehension."