Big data has been growing as topic for a while now and it is obvious that data is powerful. Data is indeed the new oil. Any business out there is investing in data research. There are many terms nowadays that describe data and how it is organized. A
data lake is one of them. So, what is it?
In simple words a Data Lake is a centralized repository that collects, stores and organizes huge data collection, including structured and semi-structured data. It also allows multiple organizational units (OU) to explore and investigate their current business stage in minutes. It provides users with the availability to do ad-hoc analysis over diverse processing engines like serverless, in-memory processing, queries and batches.
In this blog I will explain how I translated MVP core services for a large e-commerce company into Infrastructure-as-Code (IAC) using CloudFormation scripts to allow for fast and repeatable deployments, efficient testing and to decrease recovery time in case of an unplanned event. This Data Lake architecture used the following services:
Each of these services are a huge topic in their own ecosystem so throughout this article I will highlight information about how they work and how I integrated them.
CloudFormation (CF) is an AWS tool that allows you to build up resources effortlessly. CF uses a template file either YAML or JSON format to bring up a collection of resources together as a single stack. It also works well with AWS-SDK allowing users to securely deploy full stacks directly from the CLI without needing to use the AWS-console. An easy example of a CLI script to build up a collection of resources you could use are:
Data lake CloudFormation diagram
A very useful assessor tool to use with your CF template is cfn-lint. This command line linter tool inspects your code looking for syntax glitches or even bugs that can lead to errors when you implement your code. This tool for sure will save a good 10 minutes of your valuable time. How to install it is shown below.
Once the linter is installed in your OS, it can be run from the command line as simple as this basic line shown below:
Amazon Elastic Computer Cloud (EC2)
EC2 is the backbone of this infrastructure as it is dedicated to holding the e-commerce large data logs during the time of business analysis. Also, it provides you with a resizable compute capacity for this environment. You can kick up a new server optimized for your work in minutes and rapidly scale it up or down as your computing requirements change.
Amazon Simple Storage Sever (S3)
S3 is the biggest and most performant data lake storage solution because of its cost-effective, secure data storage with 11 9s of durability and its virtually unlimited scalability model. It makes sense to store your vast data logs in S3, Don’t you think?
The goal for individuals or businesses to use this data lake solution would be to build and integrate Amazon S3 with Amazon Kinesis, Amazon Athena, Amazon Redshift Spectrum, and AWS Glue for data scientists or engineers to query and process a large amount data.
Kinesis plays a double part within this infrastructure. Firstly, the Kinesis Firehose stream allows you to capture data from a server log being generated on our Amazon EC2 instance and distributes that into your data lake landing zone in Amazon your S3 bucket. The second one uses the Amazon Kinesis client application in order to publish data (“direct put”) into this Amazon Kinesis firehose using the Amazon Kinesis agent.
Data lake CloudFormation diagram
A powerful mechanism that Kinesis possesses is the availability to configure how to store your data into s3. You can configure based on buffer size and buffer interval. For the purpose of this project I have decided to select 5 megabytes of a buffer size meaning that incoming data from the firehose will be dividing the files in five megabytes in size. And, for the buffer interval I set it to the lowest value which is 60 seconds. Tips to remember Kinesis firehose is “almost real-time” and cannot go lower than that.
Now you might be thinking about how to securely protect these services and the data either at rest or in transit.
A simple and secure way to provide access to Kinesis firehose to deliver the data into S3 is to create an IAM role and a policy with properly configured permission attached to this role such as this example below:
Voila! Now you have your delivery stream set. Very exciting right!
Up to this point you are half way to complete your data lake foundation. Let’s recap what you have achieved so far. First you created individual templates of your services, tested them, added security policies and streamed your data from EC2 to your S3 bucket landing zone using Kinesis. Now we need a fully managed extract, transform, and load (ETL) service that makes it easy for you to prepare and load the data for analytics. Glue is perfect for this job. With AWS Glue you can complete this task in two different ways such as manually or you could use AWS Glue Crawler.
Once your database is ready you can run Glue Crawler which after a minute or two would be extracting the metadata from your S3 bucket into a nice table schema. Although Glue would not give headers or partition names to this schema so you would need to edit it manually. Nonetheless Glue would be able to recognize the type of data in the schema (E.g. string, bigint, double)
Now that you have a structured table in AWS Glue for the data storage in your S3 bucket you can start treating S3 as a data lake database. Next stop Athena.
Athena is a serverless service so you do not need to worry about managing anything. This is super cool personally speaking. Athena is also an interactive query service for s3 which offers you a console to query S3 data with standard SQL or NoSQL. Athena also supports a variety of data formats such as:
Athena interface example shown below:
With Athena there is no need to load your data from S3. What do I mean by this? Well the data would actually live in S3 and Athena is smart enough to know how to interpret the data and query it interactively. For those who are familiar with Presto – Athena is using Presto under the hood.
Amazon Redshift is a massive AWS parallel processing data warehouse designed for large scale data sets. A very useful feature that Redshift offers is Amazon Redshift-Spectrum. In this section I will give a high level summary of why it is so powerful.
Redshift Parallel Processing
Let’s start by re-iterating that Amazon Redshift-Spectrum is a serverless query processing engine that would allow you to analyze data that is stored in Amazon S3 using standard Structure Query Language(SQL) without ETL processing. What do I mean by analyzing data that is sitting in Amazon S3? I mean that you do not need to think about using Amazon Redshift storage for any data storage as the data will always live on Amazon S3. You can pull, aggregate and filter all sorts of data using Amazon Redshift-Spectrum. Remember Amazon Redshift-Spectrum is serverless and so you do not need to worry about anything else other than your code.
Amazon Quicksight is a business intelligence application similar to such tools like Tableau, Qlik and Microsoft PowerBI that allows you to run interactive queries on large datasets also very SPICE (Super-fast, Parallel, In-memory, Calculation Engine).
You can create visualizations and dashboards in Amazon Quicksight for the data stored in your data lake. Visualizations are key to help individuals and business owners get a simpler and intuitive view of the current and historical states of your business evolution.
This introductory article is for teams/people who are about to explore data lakes. One fact that I want you to take with you after reading this is that creating a data lake is removing cpu’s overload capacity and is allowing individuals and businesses to downscale their most expensive database cluster. This in turn, allows them to focus less on their infrastructure and more on the powerful insight that comes from their data analysis.