Kinesis 101
What is streaming data
Streaming Data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes).
- Purchases from online stores (think amazon.com)
- Stock Prices
- Game data (as the gamer plays)
- Social Network Data
- Geospatial data (think uber.com)
- IOT sensor data
What is Kinesis
Amazon kinesis is a platform on AWS to send your streaming data to. Kinesis makes it easy to load and analyze streaming datra, and also providing the ability for you to build your own custom applications for your business needs.
What are the core Kinesis Services
- Kinesis Streams
- Kinesis Firehose
- Kinesis Analytics
On Exam, be able to tell which service in which scenario we should use
Kinesis Streams
- Producers of data
- EC2 instances, phones, laptops, IOT
- Producers send data to Kinesis Streams
- Stores the data by default for 24 hours
- can be increased to 7 days
- Increase using the IncreaseStreamRetentionPeriod operation
- Decrease using the DecreaseStreamRetentionPeriod operation
- Request syntax for both operations includes the stream name and retention period in hours
- Check the current retention period of a stream by calling the DescribeStream operation
- Stored in shards
- Stores the data by default for 24 hours
- Once data is stored in shards, you have a fleet of EC2 instances called consumers
- Take data from shard and turn it into something useful
- Aggregating
- Sentiment on social media feeds
- Predict stock market/cost of commodities
- Take data from shard and turn it into something useful
- Once data is converted, store it somewhere
- DynamoDB, S3, EMR, Redshift
-
- Kinesis streams consist of shards
- 5 transactions per second for reads, up to a mximuim total data read rate of 2 MB per second
- Up to 1,000 records per second for writes, up to a maximum total date write of 1 MB per second (including partition keys)
- The data capacity of your stream is a function of the number of shards that you specify for your stream.
- the total capacity of the stream is the sum of the capacities of its shards
Kinesis Firehose
- Producers
- EC2 instances, phones, laptops, IOT
- Send data into Kinesis Firehose
- Don't have to worry about shards, streams, or managing shards
- Completely automated
- Can analyze data using Lambda in real time (optional)
- Once data is analyzed, send it over directly to S3
- No automatic data retention window
- When data comes in, it is either analyzed using lambda (optional) or sent directly to S3 or other locations
- Redshift
- goes to S3 then redshift
- Elastic Search Cluster
- Redshift
- When data comes in, it is either analyzed using lambda (optional) or sent directly to S3 or other locations
- Amazon Kinesis Data Firehose is the easiest way to load streaming data into data stores and analytics tools. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk.
Kinesis Analytics
- Producers send to streams or firehose
- Allows us to use sql queries to analyze data in Kinesis Streams or Firehose
Exam Tips
- Know the difference between Kinesis Streams and Kinesis Firehose
- Choose the most relevant Service
- Questions about shards: Kinesis Streams
- Questions about analyzing data automatically using Lambda and not worrying about data consumers: Firehose
- Understand what Kinesis Analytics is
Extras
- If you want to make sure your kinesis stream can scale over time due to increased volume:
- Add Shards
- Partition key must take a greater number of different values
Kinesis Adapter
Using the Kinesis Adapter is the recommended way to consume Streams from DynamoDB. The DynamoDB Streams API is intentionally similar to that of Kinesis Streams, a service for real-time processing of streaming data at massive scale. You can write applications for Kinesis Streams using the Kinesis Client Library (KCL). The KCL simplifies coding by providing useful abstractions above the low-level kinesis streams API. As a DynamoDB Streams user, you can leverage the design patterns found within the KCL to process DynamoDB streams shards and stream records. To do this, you use the DynamoDB Streams Kinesis Adapter. The Kinesis Adapter implements the Kinesis Streams interface, so that the KCL can be used for consuming and prcessing records from DynamoDB Streams.
Using Lambda is not recommended for real-time data analytics. The Kinesis Adapter is much more equipped.
Using the DynamoDB Streams Kinesis Adapter to Process Stream Records
Lambda and Kinesis
You can use an AWS Lambda function to process records in an Amazon Kinesis data stream. With Kinesis, you can collect data from many sources and process them with multiple consumers. Lambda supports standard data stream iterators and HTTP/2 stream consumers. Lambda reads records from the data stream and invokes your function synchronously with an event that contains stream records. Lambda reads records in batches and invokes your function to process records from the batch.
For lambda functions that process Kinesis or DynamoDB streams, the number of shards is the unit of concurrency. If your stream has 100 active shards, there will be at most 100 lambda function invocations running concurrently. This is because Lambda processes each shard's events in sequence