The term “data lake” – a centralized repository containing vast amounts of raw data in its native format – has only been around for a decade. Despite being a relatively new term, data lakes are expected to reach an annual market volume of $20.1 billion by 2025, according to Research and Markets.
A data lake typically hosts data from many sources in multiple formats, all of which require analysis to gain business insights. More and more people are hearing about “data lake” and “big data” at the same time. And it makes sense, because big data analytics requires a lot of data to gain insights.
Because data lakes aggregate data from multiple sources, they can quickly scale to petabytes and beyond. This volume of data exceeds the capacity of traditional database technologies, such as relational database management systems (RDBMS), which were primarily designed to handle structured data.
Not only is there a potential capacity issue, but data lakes amass structured, semi-structured, and unstructured data. To manage these different types of data in a flexible and scalable way, new storage systems such as Hadoop Distributed File System (HDFS) have been used as a storage solution for data lakes. But like any technology, HDFS has its limitations.
One of the main drawbacks of HDFS is that its compute and storage resources are tightly coupled as it scales (because the file system is hosted on the same machines as the application). Computing capacity and memory grow together, which can end up having a high economic cost.
To take full advantage of the business insights found in these massive data lakes, organizations rely on both the analytics tools and the storage repository in which the data is stored; the latter is possibly the most important.
Why? Because the repository must process data from multiple sources with adequate performance, as well as be able to grow in both performance and capacity so that data is widely available to applications, tools, and users.
In the quest for greater scalability, flexibility, and lower cost, object storage is rapidly emerging as the storage standard for data lakes.
With object storage, there is no limit on the volume of data. Another key advantage is that it accommodates all types of data without the need for predefined “schemas” (as is the case with RDBMS, where you have to predefine the structure and relationships between tables to perform complex queries); this ability increases flexibility.
In addition, modern object storage systems like ours support independent scaling of capacity and performance, an important advantage for large analytics projects. The ability to scale independently provides adequate compute performance for data analytics—on demand—and substantially lowers the total cost of a data lake solution.
Object storage has also been embraced by application vendors as they try to meet the challenges of increasing data capacity for customers. Solutions like Splunk now support object storage through the frontend SmartStore (which leverages the Amazon S3 API), and Vertical microfocus offers EOM mode (which also takes advantage of S3).
These solutions decouple the compute (search) tier from the persistent capacity tier, giving users more flexibility and cost efficiency, while enabling much larger data volumes for more efficient analytics. In addition, the Apache Spark ecosystem of tools, which traditionally used HDFS for storage, also supports S3 object storage through the Hadoop S3A-compatible file system interface, which leverages the S3 API
Signed: Israel Serrano, head of Scality for Spain and Portugal