Foreword
The most important thing, if we forget all the other more technical details, is to know the difference between data lake and data warehouse!
data lake: vast pool of raw data, e.g. S3
data warehouse: a repository for structured, filtered data e.g. Redshift
For a quick overview of all aws data-related services, visit this site!
Exam Notes
In the section below, I present you my notes for the AWS Data Analytics Specialty Exam.
Amazon Athena
· an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL
· You can use a JDBC (Java Database Connectivity) connection to connect Athena to business intelligence tools and other applications, such as SQL Workbench. ODBC (Open Database Connectivity) is also supported in Amazon Athena
· AWS recommends using Athena workgroups to isolate queries for teams, applications, or different workloads
Athena — CREATE TABLE AS (CTAS) query
· used to perform the conversion to columnar formats, such as Parquet and ORC, in one step
Amazon Elasticsearch
· Extracting structured data from documents and creating a smart index using Amazon Elasticsearch Service (Amazon ES) allows you to search through millions of documents quickly
· Selecting the number of shards is like an art, but where there is “JVMMemoryPressure” error it signifies that there is an unbalanced shard allocations across nodes
· means that there are too many shards
· recommended by AWS that you add three dedicated master nodes to each production Amazon ES domain
· allows you to visualize the performance metrics in real-time using Kibana dashboards
· buffer interval of 60 to 900 seconds
AWS Glue
· A fully-managed, pay-as-you-go, extract, transform, and load (ETL)
· AWS Glue can crawl data in different AWS Regions!
· A data processing unit (DPU) is a relative measure of processing power that consists of vCPUs and memory. To improve the job execution time, you can enable job metrics in AWS Glue to estimate the number of data processing units (DPUs) that can be used to scale out an AWS Glue job
· If job execution time too long, can modify the job properties and enable job metrics to evaluate the required number of DPUs. Change the maximum capacity parameter value and set it to a higher number
· Standard worker type: The Standard worker type has a 50 GB disk and 2 executors.
· AWS Glue tracks data that has already been processed during a previous run of an ETL job by having a job bookmark. Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data. With job bookmarks, you can process new data when rerunning on a scheduled interval
AWS Glue Data Catalog
- AWS Glue Data Catalog can be a source for hive metastore
- AWS Glue resource policies (and IAM roles) can be used to control access to Data Catalog resources
- The following are other reasons why you might want to manually create catalog tables and specify catalog tables as the crawler source:
· You want to choose the catalog table name and not rely on the catalog table naming algorithm.
· You want to prevent new tables from being created in the case where files with a format that could disrupt partition detection are mistakenly saved in the data source path
AWS Glue data catalog is stale
· implement an automated task that will allow the AWS Glue crawler to update the schema in real-time
· Set up an Amazon S3 S3:ObjectCreated:* event trigger and an AWS Lambda function that invokes the AWS Glue crawler.
Amazon Redshift
· Doesn’t support upsert (can’t upsert to remove duplicates)
· Instead, use a staging table, utilize the DynamicFrameWriter class in AWS Glue to replace the existing rows in the Redshift table before persisting the new data
· Amazon Redshift gives you the flexibility to execute queries within the console or connect SQL client tools, libraries, or Business Intelligence tools
· workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries won’t get stuck in queues behind long-running queries
· Can use a temporary staging table for transformation
· UNLOAD command: save the Redshift query results to an external storage for occasional analysis
· The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from files in an Amazon S3 bucket. Split your data into files so that the number of files is a multiple of the number of slices in your cluster, and compress the files by using Gzip
· Classic resize requires provisioning a new cluster and transfers data to it, while elastic resize no need. Both are manual methods. Currently there are no automatic resizing
Redshift Spectrum
· to join external Amazon S3 tables with tables that reside on the Amazon Redshift cluster.
· used to efficiently query and retrieve structured or semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables
· common practice is to partition the data based on time
Kinesis Producer Library agent
· Install the agent in EC2
· an easy-to-use, highly configurable library that helps you write to a Kinesis data stream
· KPL agent à Kinesis Data Stream à kinesis data firehose à Elasticsearch
Amazon Kinesis Data Firehose
· is the easiest way to reliably load streaming data into data lakes, data stores, and analytics services. It can capture, transform, and deliver streaming data to:
· Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, generic HTTP endpoints, and service providers like Datadog, New Relic, MongoDB, and Splunk.
· Enrich logs with data from DynamoDB in near-real-time: use lambda
· Kinesis Data Firehose à (lambda for transformation) à S3/Redshift/Elasticsearch
Kinesis Data Stream
· custom applications or streaming needs
· is a real-time data streaming service, which is to say that your applications should assume that data is flowing continuously through the shards in your stream. Resharding enables you to increase or decrease the number of shards in a stream to adapt to changes in the stream’s data flow rate
· ProvisionedThroughputExceededException error: to increase the number of shards
· PutRecords operation sends multiple records to your stream per HTTP request, and the singular PutRecord operation sends records to your stream one at a time (a separate HTTP request is required for each record). You should prefer using PutRecords for most applications because it will achieve a higher throughput per data producer
· enhanced fan-out allows developers to scale up the number of stream consumers (applications reading data from a stream in real-time) by offering each stream consumer its own read throughput (good when dedicated throughput is needed!)
· Can use multiple AWS Lambda functions to process the Kinesis data stream using the Parallelization Factor feature. It allows for faster stream processing without the need to over-scale the number of shards while still guaranteeing the order of records processed.
Kinesis is throttling the write requests
· Use random partition keys and adjust accordingly to distribute the hash key space evenly across shards.
· Use the UpdateShardCount API in Amazon Kinesis to increase the number of shards in the data stream
Amazon Kinesis Data Analytics
· is the easiest way to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real-time
· Amazon MSK/Kinesis Data Streams à Kinesis Data Analytics
· To detect the anomalies in your data stream, you can use the RANDOM_CUT_FOREST function
· Staggered windows is the recommended way to aggregates data
Kinesis Client Library (KCL)
· For developing custom consumer applications
Amazon Quicksight
· To successfully connect Amazon QuickSight to the Amazon S3 buckets used by Athena, make sure that you authorized Amazon QuickSight to access the S3 account. It’s not enough that you, the user, are authorized. Amazon QuickSight must be authorized separately.
· ML-powered anomaly detection
· ML-powered forecasting
· Autonarratives
· cluster security group that contains an inbound rule authorizing access from the appropriate IP address range for Amazon QuickSight in the needed region
· QuickSight console allows for the refresh of SPICE data on a schedule e.g. daily
· not capable of providing near-real-time data. For that, use ES with Kibana
· Amazon QuickSight Enterprise edition:
· supports both AWS Directory Service for Microsoft AD and AD Connector (AD Connector is a directory gateway with which you can redirect directory requests to your on-premises Microsoft Active Directory without caching any information in the cloud)
· can restrict access to a dataset by configuring row-level security (RLS) on it
Manifest file
· Faster loading time because (1) other excluded files are omitted (2) COPY command loads the data in parallel from multiple files
S3DistCp
· is an extension of DistCp with optimizations to work with AWS, notably Amazon S3
· on Amazon EMR
· to Move Data Between HDFS and Amazon S3
· By adding S3DistCp as a step in a job flow, you can efficiently copy large amounts of data from Amazon S3 into HDFS, where subsequent steps in your EMR clusters can process it
EMR
· a highly scalable big data platform that supports open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi (Incubating), and Presto
· Instance Fleet doesn’t have an automatic scaling policy. Only Instance Group has this feature
· Block Public Access configuration is an account-level configuration that helps you centrally manage public network access to EMR clusters in a region
· EMRFS vs HDFS: HDFS is an implementation of the Hadoop FileSystem API, which models POSIX file system behavior. EMRFS is an object store, not a file system
Presto
· is a SQL query engine designed for interactive analytic queries over large datasets from multiple sources.
· supports both non-relational sources and RDBS
Encryption
· If want audit trail, use CMKs
Amazon Managed Streaming for Kafka cluster
· can be used to deliver the messages with very low latency
Apache Spark
· speed advantages of Apache Spark comes from loading data into immutable Spark Dataframes, which can be accessed repeatedly in memory.
· Spark DataFrames organizes distributed data into columns. This makes summaries and aggregates much quicker to calculate.
Amazon S3
· Load only the files needed using Amazon S3 Select
· S3 Select operates on only 1 object, while Athena run queries across multiple paths and the files within
· Amazon S3 Glacier with expedited retrieval: 1 to 5 minutes
· can’t query a GZIP-compressed CSV file using Amazon S3 Glacier Select
AppFlow
· fully managed integration service that enables you to securely transfer data between Software-as-a-Service (SaaS) applications like Salesforce, Zendesk, Slack, and ServiceNow, and AWS services like Amazon S3 and Amazon Redshift