Zhamak Dehghani coined the term data mesh in 2019. Since this data transformation, the internet started flooding with information about its advantages and all the problems it would solve. Many articles compare data lake with data mesh.
This article will not explain, in great detail, what a data mesh or data lake is, but, instead, how you can adapt to these systems. We’ll also dive into why organizations must put some abstraction on top of these data management systems to make the right decisions based on their organization-specific scenarios.
Just so you have it, if you are not already aware of these terms, you can refer below articles before joining me back here.
1. Data Mesh Principles and Logical Architecture
(https://martinfowler.com/articles/data-mesh-principles.html)
2. Data Lake Principles and Logical Architecture
( https://www.guru99.com/data-lake-architecture.html )
3. Data Mesh is not a Data Lake
(https://www.linkedin.com/pulse/data-mesh-lake-jeffrey-t-pollock/)
If you have skipped the above three articles, then here is a high-level summary.
What is a data lake?
Data lakes are systems used for organizing and managing data centrally. This data can be in any form, i.e., it can be structured, semi-structured or unstructured. Data lakes typically follow a schema-on-read approach. If we go a little back in history, data lake came into the picture because of the rise of big data. This was triggered by the advent of new data generating sources and more users getting acquainted with the internet.
What is a data mesh?
Data mesh is a system that focuses on decentralizing data and data ownership in an organization. This means anyone in an organization can use and manage data until they have proper credentials for more access. Data mesh advocates that data can stay where it is regardless of whether it lies in different databases. It focuses more on serving this data as a data product that can be made accessible to all the authorized stakeholders. Historically, data mesh came into the picture to overcome challenges people faced with centralized data management systems.
Why organizations should have abstraction
Ever-changing design patterns
If we look from a 30,000 feet view, data lakes and data mesh are design patterns. That is, they are a way to organize and implement data to make easy to access, manage and maintain more efficiently. They are not bringing in any computation revolution. These patterns focus primarily on the ownership of data, its maintenance and distribution to the topmost in a hierarchy.
Now, the moment we consider this as a design pattern, it becomes clear that there is no thumb rule to it, and we cannot say one will fit all forever. These design patterns are bound to evolve. As with every new implementation in a production environment, fresh sets of challenges lead to further brainstorming and give birth to some new design patterns.
The challenge to catching up with these design patterns is the threat of obsoletion. By the time you analyze it, alter it for your organization, migrate your data, and overcome all the challenges, there comes a new approach to doing it more efficiently.
Another challenge is functionality. With new designs in the management systems, you don’t know whether they will be more efficient for you than your existing design. If you search around the internet, some of the early adopters of data mesh have already started posting articles advising enterprises on how to avoid common pitfalls of data mesh setup.
This ultimately becomes our first reason to abstract this entire process. We cannot always spend a lot of development effort to come to a single conclusion. Instead, we can think of a system that can help us give a quick leap into the future and help us make informed decisions. The abstraction should be such that you can easily change your source and destination without any engineering efforts to your core processing logic.
Changes in ownership of data
When learning about data lake and data mesh, one of the crucial topics is data ownership. Thankfully, data mesh brings a perspective that the maintenance and ownership of data should be with the team that understands that data more appropriately. Data owners should be responsible for serving that data in a consumable manner to end-users in the organization.
This ownership distribution brings up the need for the data owners to capture and maintain feedback from the end-user. If the feedbacks contain some enhancement, they need to take care of it and again expose this new enhanced data to the end-user.
If the individual owners start tracking this independently, there is a lot of duplication of efforts that these owners will spend in brainstorming and coming up with some automation. The story does not end here. What if tomorrow, some new design patterns come up with a unique view on data ownership?
In that case, we should abstract the access control part and develop data management systems that will facilitate data access management, data discovery, data tagging, and feedback capture. This can help organizations better organize their data access hierarchy and create a sophisticated, efficient way to adapt new design patterns.
Shifts in the thinking process
If we try to make sense of current design patterns, there is a constant effort to move data closer to the end-user so that the end-user can question the data as needed. Again, let’s take the example of the latest data approach as a product mentioned in the data mesh. It underlines that this thought process is more inclined to abstract the technical nitty-gritty of data transformations and serve consumable data.
However, what if we have an abstracted data management solution systems where domain experts can create formulas/expressions based on their expertise, and engineers can serve the data by applying those formulas to the underlying data structure? It will make more sense if these expressions are maintained at some central portal so that anyone unaware of these expressions can study these and apply them to his own data. For example, suppose someone is not aware of how to calculate simple interest in a banking domain. In that case, such people can simply access the already created expressions and use these expressions on their data.