Data Lake best practices in AWS

CATEGORIES

BlogTech Community

Many businesses are looking into enabling analytics on many different types of data sources and gain insights to guide them to better business decisions. A data lake is one way of doing that, where you have a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. Data analysts can in a data lake then leverage the data with their choice of analytics and machine learning services, like Amazon EMR for Apache Spark, Redshift, Athena, and more.

 

AWS Data Lake Formation

AWS Data Lake Formation is a new tool that makes it easier for businesses to setup a data lake – something that previously was a big undertaking taking months can now be broken down into just a few days of work. Data Lake Formation will automatically crawl, clean and prepare the data which you in turn can use to train machine learning models to dedupe based on what you want the data to look like. The most interesting functionality from the new Data Lake Formation might be the centralized dashboard for secure access on table and column level across all tools in the data lake – something that previously has been quite complicated and required third party tooling.

Data lake

Data lake best practices

Best practices for utilizing a data lake optimized for performance, security and data processing were discussed during the AWS Data Lake Formation session at AWS re:Invent 2018. The session was split up into three main categories: Ingestion, Organisation and Preparation of data for the data lake. Your current bottleneck may lie in all or any of these three main categories, as they often interlink – so make sure to look into all of the categories to optimize your data.

 

Ingestion

The main takeaway from the session was that S3 should be used as a single source of truth where data ingested is preserved. No transformation of data should happen in the ingestion S3 storage. If you transform the data, it should be copied to another S3 bucket.

To optimize ingestion so that you don’t have a bucket full of old data at all times, you should also look into utilizing object life cycle policies so that data that you aren’t using gets moved to a cheaper storage class such as glacier. This especially makes sense for data that is outside of your time-scope and that is not interesting for analytics anymore.

Getting data in from databases can be a pain, especially if you are trying to use replicas of on-premise databases. AWS recommends that instead of using database replicas, utilize AWS Database Migration Tool. This makes it easier to replicate the data without having to manage yet another database. If you use a AWS Glue ETL job to transform, merge and prepare the data ingested from the database, you can also optimize the resulting data for analytics and take daily snapshots to preserve the database view of the records.

 

Organisation

Organisation of the data is usually a strategy that comes way too late in a data lake project. You should already in the beginning of the project look into organizing the data data into partitions in S3 and partition the data with keys to align with common query filters.

It is for example sometimes better to create multiple S3 buckets and then partition the buckets on year/month/day/ instead of trying to fit all of your data into one S3 bucket with even more granular partitions. This does in reality depend on what your most common queries look like. Maybe you need to partition on months instead of years depending on your usage.

 

Preparation

For mutable data use a database such as Redshift or Apache HBase but make sure to offload the data to S3 when the data becomes immutable. You can also append delta files to the partitions and compact them on a scheduled jobs to keep the most recent version of the data and delete the rest.

Remember to compact the data from source before you do analytics – the optimal size is between 256 and 1000 MB. If you need fast ingestion than grabbing the data from S3 you can utilize streaming data to Kinesis streams, process the data with Apache Flink and push the processed data to S3.

 

If you’d like some help in AWS Data Lake Formation, please feel free to contact us.

 

Blog

Starter for 10: Meet Jonna Iljin, Nordcloud’s Head of Design

When people start working with Nordcloud, they generally comment on 2 things. First, how friendly and knowledgeable everyone is. Second,...

Blog

Building better SaaS products with UX Writing (Part 3)

UX writers are not omniscient, and it’s best for them to resist the temptation to work in isolation, just as...

Blog

Building better SaaS products with UX Writing (Part 2)

The main purpose of UX writing is to ensure that the people who use any software have a positive experience.

Get in Touch

Let’s discuss how we can help with your cloud journey. Our experts are standing by to talk about your migration, modernisation, development and skills challenges.








    GDPR: The drought for your data lake

    CATEGORIES

    Blog

    Data lake – we’ve seen it mentioned in the IT news headlines

    The new hope of IT organisations to enable their business units with actual content and valuable insights, rather than just offering servers and empty storage. Almost all companies of size and renown have embarked on this new journey and are building data lakes to sail upon them. Or maybe not?

    Recent concerns raised around this data lake use case, especially since the dawn of GDPR has made people rethink the share-everything-with-everyone mindset behind these lakes.  Also, it has raised the matter of data ownership, retention, deletion and correction. Most data lake scenarios are viewed from a primarily technical perspective because that is where the idea comes from. Inevitably, and luckily not AFTER the actual release of many of these lakes into production, the legal and compliance departments have woken up.

    As we are involved in quite a number of these projects, we wanted to share the main aspects to keep in mind when building your data lake with GDPR in mind. So here we go:

    Employee Data

    You can argue of course that once you work for a company, your data belongs to them. But it’s not that easy. First of all, this is a concept that may or may not apply in some countries. Secondly, the concept of storing employee data is one thing, the idea to use it for analytical purposes may require the employee’s consent. And that is where you run into challenges. For example in Germany, companies all have one thing in common: they have extensive employee data and are rarely allowed to use it to their advantage because of the current legislation. Through the GDPR introduction, this type of scrutiny will be imposed on all EU countries and hence become a challenge for many more businesses.

    Customer Data

    This should be the most traditional use case in data privacy and protection and is one of the key reasons why the GDPR debate is so viral and vibrant these days: it concerns almost all companies. There’s a lot to discuss around this particular point, but one specific aspect that is of some note is the “Right to Explanation”. If you use machine learning on user data, GDPR regulations state that “meaningful information about the logic” behind machine learning models must be made available to users.

    Many machine learning models are black boxes, but the type of data used to train them should be made clear to users so that they can make an informed decision to opt out. Users should, at all times be offered the option not to have their data used as part of machine learning and artificial intelligence applications.

    Device Data

    With IoT and Connected-X, we all feel like we can’t really participate in modern society without sacrificing some of your privacy tied to devices and gadgets. From a legal perspective, the providers of services ask for your consent when you install mobile apps or sign up for a SaaS-type service. This is the easy part. Now, imagine you are a car manufacturer, who could gain plenty of insights and competitive advantage through collection of device/car data in that field, and has all the technology to make that happen but is not allowed to do it.

    In actual fact, this is an issue. People used to buy cars without signing a data privacy agreement. Recently, privacy agreements have become an actual necessity in order to even operate the connected car services. As a business, you have to always keep in mind that just because it is device data, does not mean you can harvest and use the data for your advantage. There is a human being or an organisation behind that device who’s using it. You need their consent, otherwise, no data can be legally processed.

    Prevent the Drought

    So does that mean there is a chance your data lake could dry out very soon? Don’t worry, here are some relatively easy ways to address this challenge:

    Anonymisation of data is one way to solve this. This means that the data is being stripped of all potential identifiers to human beings and actual end-user facing devices and collects statistical data for very specific use cases. If that isn’t possible in your given use case it’s a different story. But it must become an inherent part of all the data processing in the solution you design and isn’t bound to the data lake at all – it sits within your application.

    Encryption of data can be a very easy and elegant way to address the challenge without even building much of a solution into your cloud platforms. Most of the public cloud platforms provide several mechanisms that allow encryption on various layers of the platform at no additional cost. The great thing is you can automate remediation actions based on alerts if any kind of data is being stored unencrypted into a cloud. Non-compliance to this standard is practically impossible.

    Data Management Practice setup is a general requirement in order to make sure you have full visibility and (access) control over all the data your company holds, manages or has access to. Also, it is important to run a proper metadata scheme across all the data types as complete as possible so it is searchable and can be clustered.

    There are many more use cases in the Big Data field that require your attention, but I hope we’ve made our point. Just because you have data (in your lake), does not necessarily mean you can actually use it. GDPR demands that you have customer and employee consent, before using any form of data collected. At Nordcloud, we combine strong expertise in Big Data, Machine Learning and IoT field with years of AWS and Azure project delivery, all wrapped up in a deep awareness of data protection and security.

    Please feel free to reach out to us if you think the above sounds familiar but perhaps too complex to tackle on your own. We’re here to help.

    Blog

    Starter for 10: Meet Jonna Iljin, Nordcloud’s Head of Design

    When people start working with Nordcloud, they generally comment on 2 things. First, how friendly and knowledgeable everyone is. Second,...

    Blog

    Building better SaaS products with UX Writing (Part 3)

    UX writers are not omniscient, and it’s best for them to resist the temptation to work in isolation, just as...

    Blog

    Building better SaaS products with UX Writing (Part 2)

    The main purpose of UX writing is to ensure that the people who use any software have a positive experience.

    Get in Touch

    Let’s discuss how we can help with your cloud journey. Our experts are standing by to talk about your migration, modernisation, development and skills challenges.