06 - Analytics
Amazon Athena
- Amazon Athena is an interactive query service that utilizes Schema-On-Read, allowing you to run ad-hoc SQL like queries on data from a range of sources
- Athena is used to query large dataset (structured, semi-structured, and unstructured) stored in S3 with infrequent access pattern
- You are charged for compute time only. You don’t need to maintain separate dataset for Athena, it can directly access S3 bucket
Amazon EMR
- EMR is tool for large-scale parallel processing of big data and other large data workloads
- It is based on the Apache Hadoop framework and is delivered as a managed cluster using EC2 instances
- It is used for huge-scale log analysis, indexing, machine learning, financial analysis, simulations, bio-informatics and many other large-scale applications
- EMR cluster have zero or more core nodes, which are managed by the master node. They run tasks and manage data for HDFS
- Data can be input from and output to S3. Intermediate data can be stored using HDFS in the cluster or EMRFS using S3.
Amazon Kinesis
- Streaming service designed to ingest large amounts of data from hundreds, thousands or even millions of producers
- Scalable and Resilient
- Consumers can access a rolling window of that data, or it can be stored in persistent storage of database products
Kinesis Data Stream
- A Kinesis data stream can be used to collect, process, and analyze a large amount of incoming data
- Storage for all incoming data within a 24 hour default window, which can be increased to seven days for an additional charge
- Kinesis Data records (The basis entity written to and read from Kinesis stream, a data record can be up to 1 MB in size) are added by producers and read by consumers
Kinesis Data Firehose
- Reliably load streaming data into data lakes, data stores and analytics tools
- It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk
- It enables near real-time analytics with existing business intelligence tools and dashboards you’re already using today
- Kinesis Data Streams can be used as the source(s) to Kinesis Data Firehose
- Pay for only the data ingested
Kinesis Data Analytics
- Process and analyze real-time, streaming data
- Can use standard SQL queries to process Kinesis data streams
- A Kinesis Data Analytics application consists of three components:
- Input – the streaming source for your application
- Application code – a series of SQL statements that process input and produce output
- Output – one or more in-application streams to hold intermediate results
Kinesis Video Analytics
- Securely ingests and stores video and audio encoded data to consumers such as SageMaker, Rekognition or other services to apply Machine Learning and Video processing