News

Data Lakehouse: main features, advantages and disadvantages

After hearing about data lakes and data warehouses for quite some time, a third concept related to information storage is beginning to be heard more and more strongly: data lakehouse. It refers to a new concept for data storage and management, which was born as hybrid management architecture resulting from the combination of the most outstanding characteristics of the data lake and the data warehouse. By its nature, in order to fully understand what it is, it is necessary to be clear about the concepts of data lake and data warehouse.

When we talk about data lake, we are talking about a centralized repository responsible for storing large amounts of data in its original format. That is, it is a set of unstructured data. When we talk about data warehouse, on the other hand, we are referring to a store of structured and semi-structured data, coming from various sources, and which are usually used for analysis and reporting.

What is a data lakehouse

A data lakehouse is responsible for combining both types of storage, and covering the gap between both approaches. To do this, it combines the flexibility, scale and adjusted cost of the data lake, with the performance and transactions characteristic of a data warehouse, characterized by its durability, isolation, consistency and atomicity. By combining this, which represents the best of both concepts, it allows you to bring together the analytics and intelligence of all types of data in a single platform.

The characteristics that a data lakehouse includes allow companies and organizations that use this architecture to store large amounts of data without having to have a strict schema, or to support only a certain format or formats for them. Of course, to this it adds several functions that are missing in a data lake: governance, organization and performance capabilities for analytics and reporting.

Although this brings data lakehouses closer to data warehouses, they also have differences with them. A data warehouse uses ELT (extract, load, transform) or ETL (extract, transform, and load) processes to load structured data into a relational database infrastructure, and supports enterprise data analytics and business intelligence applications. Of course, it has its limits: it is not exactly efficient in managing unstructured and semi-structured data. In addition, it can have high costs as the amount of data to be stored and the sources from which it is taken grow.

A data lakehouse also solves these drawbacks of data warehouses, integrating the flexibility and cost-effectiveness of data lakes with the governance, organization and performance capacity of data warehouses. It therefore solves the problems of both architectures, which makes this architecture attractive for professionals and specialists who work in very specific areas.

Thus, data scientists can use it to work on machine learning, Business Intelligence, SQL analytics and data science. For their part, business analysts can take advantage of its functions and features to explore and analyze various data sources, and work in different business areas.

Other professionals who will take advantage of it are product managers, marketing experts, and managers in general. Everyone can use it to monitor and control different trends and indicators.

Architecture of a data lakehouse

The architecture of a data lakehouse, according to eWeek, is made up of five layers. The first is known as ingestion layer. Its mission is to collect data from various sources and bring it either to the storage layer or to a data processing system. It uses different protocols to connect internal and external sources.

Among them, database management systems, software-as-a-service applications, noSQL databases, social networks, CRM applications, Internet of Things sensors and file systems. The ingestion layer can be in charge of extracting data in one go, or do it several times, depending on the size of the data set to be obtained and the source in which they are located.

The second layer of a data lakehouse architecture is that of storage, in which it accepts all types of data as objects, in object storage systems. For example, in AWS S3. This layer allows structured, unstructured and semi-structured data to be stored in open source file formats, such as Apache Avro, Parquet or ORC (Optimized Row Columnar). In addition to the cloud, it can also be deployed on-premises, using a distributed file system, such as Hadoop’s HDFS.

Regarding the third layer of architecture, that of metadata, represents the origin of a data lakehouse. Therefore, it is one of its most important layers. It contains information about other data, and is a unified catalog that includes the metadata of the objects in a data lake.

In addition, it offers those who use a data lakehouse with various functions: ACID transactions that ensure the atomicity, consistency, isolation and durability of changes in the data; and file caching options that optimize data access by keeping frequently accessed files available in memory. It also provides indexing, which speeds up queries allowing rapid data retrieval, and information versioning, which allows specific versions of the data to be saved. This is a layer that allows you to implement predefined schemes to improve data governance.

The last two layers are API and of consumption. The API layer is also one of the most relevant in a data lakehouse, as it allows engineers, data scientists, and analysts to access the data stored in the structure and manipulate it for analytics or reporting tasks, among others. applications. The consumption layer, the last layer in a data lakehouse architecture, is used to store tools and applications with which users access the data, querying, analyzing and processing it.

Keys, advantages and disadvantages of a data lakehouse

Among the most notable points of the architecture of a data lakehouse, based on what we have seen, are the following: ACID transaction support to achieve data consistency and reliability in a distributed environment; cost-efficient data storage for all types of data; yesSupport of structured, unstructured and semi-structured data; and support for open formats.

This means that a data lakehouse has numerous advantages, which make it preferable in various cases to use this type of architecture over a data lake or a data warehouse. Delivers a unified data platform and eliminates data silos. In addition, it is compatible with real-time processing. In this way you can get information immediately and quickly. It also allows processing information in batches, for large-scale analysis.

Maintaining a data warehouse and data lake can be expensive, but with a data lakehouse, data management teams only need to deploy and manage one platform for the data. The costs of managing, deploying and maintaining two are therefore reduced. It also improves data governance by consolidating resources and data sources.

This improves control over security, metrics, role-based access and all types of functions and aspects of information management. Finally, it also reduces data duplication, since you do not work with two data storage systems, but only one. Therefore, there will not be more than one copy of the information. It offers a single source of data that can be shared across different areas of a company.

In view of all this, it seems that a data lakehouse is going to become the only suitable architecture for data storage in the world of business and research. But this is not always the case, because it also has its drawbacks. Additionally, it is a relatively new concept, and its capabilities are still being explored. And therefore, its weaknesses.

To begin with, it is a system that is complex to develop from scratch. To create it, you have to choose a ready-to-use data lakehouse solution, with variable performance depending on the types of queries or the processing engine you use, or invest time and resources in developing and maintaining a custom solution.

As it is a new data management concept, it is not a complete solution for a data lake or a conventional data warehouse, but rather a combination of both. Furthermore, it is not developed to be fail-safe, and especially intrusion-proof.

Those who decide to implement one will have to be proactive about protecting it, and take steps to manage security risks associated with data storage and its complexity. Not to mention that data quality and architectural problems may arise with the use of this type of architecture. Therefore, when considering which architecture is best for an enterprise’s data storage, it is worth examining all available options with caution.

Related Articles