As the volume, velocity and variety of data grow within businesses, they increasingly depend on data lakes for data storage, governance, blending and analysis. A data lake is a system of stored data in its raw format. It enables businesses to collect a larger volume and variety of data without the rigidity and overhead of traditional data warehouse architectures. Additionally, data lakes provide a place for data-focused users to experiment with datasets and ﬁnd value without involving IT or spinning up a large project.
Traditional enterprise data warehouses (EDW) and data marts require days or weeks to plan, design, model and develop before data is made visible to end-users. During this period, key elements in the business may have changed, requiring re-design and protracting time-to-value. EDW rigidity and rigor often entice end-users to build their own solutions using spreadsheets, local databases and other proprietary tools. This inevitably creates data silos, shadow IT and a fragmented data landscape. Furthermore, the scarcity of cataloged business data resources limits the data that the business uses to answer business questions, resulting in decision makers acting on incomplete information.
A well-designed data lake balances the structure provided by a data warehouse with the ﬂexibility of a ﬁle system. It’s important to understand that a data lake is different than a ﬁle system in that raw data cannot simply be added to the lake and made available to the business. Process and design are still required to enable end-users, and security and governance still apply. The application of a specific architecture will enable the data business to be nimble while retaining control.
While the beneﬁciaries of a data lake solution span the entirety of the business, the users that access the lake directly are limited. Not all business users will be interested in accessing the data lake directly, nor should they spend their time on data blending and analysis. Instead, the whole of the lake is made available to data scientists and analysts, while a vetted and curated dataset is made available to the business at large.
The overall architecture and flow of a data lake can be categorized into three primary pillars: operations, discovery and organization.
Data movement involves the ingestion and extraction of data in the data lake. Data might be pushed or pulled depending on the technology chosen and purpose for the movement of data. Ingestion is the most critical component as the source systems can produce a variety of data streams and formats.
Data processing, as it relates to data lakes, involve both real-time and batch processing. This could occur both internally and externally to the data lake. The reason this is included is that a streaming pipeline, in most cases, would route to both a real-time dashboard and the data lake for later batch processing. Data volume, velocity and business requirements help determine the ultimate pattern for processing data.
Orchestration is essentially cloud automation. Giving us the ability to execute multiple processes on multiple systems within a data pipeline and scheduling those pipelines.
Tagging drives data discovery in a business. A data lake could potentially have petabytes of data, much of which might not be funneling to production systems yet, because its value has not been identiﬁed. Tagging is both an automatic and manual business process and is unique to a business.
Metadata is the data about data. This ranges from data lineage such as the format and schema, to capturing information about the data source, the day and time it was captured, as well as the data location. Metadata-capture and tagging are closely related and support data exploration and governance within the business.
Design patterns of our data lake are going to be the foundation of our future development and operations. At the highest of levels, you need a place to land, stage and surface data. The biggest difference between traditional systems is that when we land the data, it will live there indeﬁnitely. Though Azure Data Lake is speciﬁcally designed for large scale analytics – and usually housed on an HDFS platform – it can also be built on Azure Blob Store or even SQL Server, depending on the speciﬁc use-case.
Security is necessary to enforce not only compliance but also to keep the lake from becoming a "data swamp" or a collection of unorganized data within the data lake. Azure Data Lake allows users to utilize Azure Active Directory to control enterprise security as well as security features speciﬁc to Azure Data Lake that controls access.
As a business continues to grow, the need to collect and quickly access a large volume and variety of data increases. With traditional EDWs requiring significant time to develop data, along with the possibility of creating data silos, shadow IT, or fragmented data, business should look to developing a data lake. Utilizing a well-designed data lake will allow a business to effectively and efficiently capitalize on their incoming data.
At Baker Tilly Digital, we help our clients leverage modern cloud platforms and warehousing techniques to derive new value from their data. We work with you to structure your data so it’s understandable, accurate and beneficial to the continued growth of your business. Having the right data lake in place to capture and analyze data will allow your business to make the best decisions possible with the most accurate and up-to-date information available. To learn more about how Baker Tilly Digital can help you get started developing a data lake contact one of our professionals today.