When should we create a data lake

AWS Lake Formation

AWS Lake Formation is a service that enables you to set up a secure data lake in a matter of days. A data lake is a centralized, managed, and secured repository that stores all of your data, both in its original form and prepared for analysis. A data lake enables you to break down data silos and combine different types of analysis to gain insights and make better business decisions.

However, setting up and managing data lakes today is a lot of manual, complicated, and time-consuming tasks. This work includes loading data from various sources, monitoring these data flows, setting up partitions, enabling encryption and key management, defining transform jobs and monitoring their operation, reorganizing data into a columnar format, configuring Access control settings, deduplicating redundant data, matching linked records, ensuring access to records, and checking access over time.

Creating a data lake with Lake Formation is as easy as defining data sources and the data access and security policies to be applied. Lake Formation will then help you collect and catalog data from databases and object stores, move the data into your new Amazon S3 data lake, cleanse and classify your data using machine learning algorithms, and securely access your sensitive data. Your users can access a central data catalog that describes the available data sets and their corresponding use. Users then use these datasets with their choice of analytics and machine learning services, such as Amazon Redshift, Amazon Athena, and (in beta) Amazon EMR for Apache Spark. Lake Formation builds on the capabilities available in AWS Glue.