Time Series Insights: A real case capacity analysis

Introduction

I’ve been blog posting about Time Series Insights (TSI) and I’ll continue to do so. Let me first run a quick introduction to this Azure Service.

Time Series Insights (TSI) is a fully-fledged Azure service, specially meant for IoT scenarios. It includes the storage (so it’s a database), visualization (it’s a ready-to-use dashboard), and its near real time. It’s an end-to-end solution that empowers you to analyze data from storage to analytics while offering queries capabilities together with a powerful and flexible user interface.

Now, enough of this marketing sentences (but true)!

A few enterprises and system integrators join efforts building customized solutions to similar scenarios (ex: predictive maintenance). They typically use REDIS, Hadoop HDFS, InfluxDB, Elastic Search or other storage/database technologies (specialized per industries). This also requires data cleansing, standardization besides supporting time series data streams. Now the best part: TSI automatically infers the schema of your incoming data, which means, that it requires no upfront data preparation. It also supports querying data over assets and time frames, with 400 days’ retention period.

Sweet! This is a tool that is just perfect for IoT solutions, specifically tackling large-scale historian-like solutions.

Capacity

Like any IoT project, especially large-scale ones, planning is the keyword. With TSI, capacity planning should be done early and based on your expected data ingress rate.

This means, understanding data ingression and retention, is very important. Let’s dive into it:

Capacity \ SKU S1 S2
Storage per unit 1 30 GB or 30 million events 2 300 GB or 300 million events 2
Daily ingress per unit 1,3 1 GB or 1 million events 2 10 GB or 10 million events 2
Maximum retention 4,5 13 months 4 13 months 4
Maximum number of units 6 10 10

1 Ingress and total storage are measured by the number of events or data size, whichever comes first.
2 An event is a single unit of data with a timestamp. For billing purposes, we count events in 1-KB blocks. For example, a 0.8-KB actual event is billed as one event, but a 2.6-KB event is billed as three events. The maximum size of an actual event is 32 KB.
3 Ingress is measured per minute. S1 can ingress up to 720 events/minute/unit and S2 can ingress 7,200 events/minute/unit.
4 The data is retained in Time Series Insights based on the selected retention days or maximum limits.
5 Retention is configurable in the Azure portal. The longest allowable retention period is a rolling year of 12 months + 1 month, which is defined as 400 days.
6 An environment can be scaled up to 10 times by adding more units.

After configuring your event source (IoT Hub or Event Hub), TSI will always start ingesting the oldest events (FIFO), within the event source. TSI can ingress appx. 1Million events per day for every unit of S1 provisioned, and 10M events per day for every unit of S2 provisioned.

If you plan to upload historical data to TSI, it’s useful to increase the number of units provisioned for a brief period of time to allow TSI to ingest this historical data. The supported event sources can store data for up to 7 days.

Remember! TSI will always start ingress from the oldest event in the event source.

When the amount of incoming data exceeds your environment’s configuration, you may experience latency or throttling in TSI.

Throttling

Let’s start with a simple example: If you have five million events in an event source when you connect to an S1, single-unit TSI environment, TSI will read approximately one million events per day. This might appear to look as though TSI is experiencing 5 days of latency at first glance. In this scenario, the TSI environment is being throttled.

If you have old events in your event source, you can approach one of two ways:

  • Change your event source’s retention limits to help so they get removed from the TSI store;
  • Provision a larger environment size (in terms of number of units) to increase the throughput of old events.

Using the example above, if you increased that same S1 environment to five units for one day, the environment should catch-up to now within a day. When your steady state event production is 1M or fewer events/day, then you can reduce the capacity of the event back down to one unit after it has caught up.

The throttling limit is enforced based on the environment’s SKU type and capacity. All event sources in the environment share this capacity. If the event source for your IoT Hub or event hub is pushing data beyond the enforced limits, you see throttling and a lag.

Best way to identify and monitor throttling is through ingress metrics in both TSI and your event source. If you see a value for ingress received message lag time or ingress received message count lag, there’s a throttling problem. TSI’s environment metrics include:

TSI Metric Description
Ingress Received Bytes Count of raw bytes read from event sources. Raw count usually includes the property name and value.
Ingress Received Invalid Messages Count of invalid messages read from all Azure Event Hubs or Azure IoT Hub event sources.
Ingress Received Messages Count of messages read from all Event Hubs or IoT Hubs event sources.
Ingress Stored Bytes Total size of events stored and available for query. Size is computed only on the property value.
Ingress Stored Events Count of flattened events stored and available for query.
Ingress Received Message Time Lag Difference between the time that the message is enqueued in the event source and the time it is processed in Ingress.
Ingress Received Message Count Lag Difference between the sequence number of last enqueued message in the event source partition and sequence number of message being processed in Ingress.

Real Case scenario

To make use of these learnings, we just went through, nothing better than a real case scenario. You can use it as an example, accessing or planning your TSI environment.

The business scenario is around an enterprise collecting huge amount of data around vessels navigating throughout the world. They have the usual requirements around storage, and near-real-time querying. With maritime operators, they need to do analysis on data, detect and prevent anomalies.

Once we decided to provision Time Series Insights, we noticed data was missing or didn’t match up correctly. This means vessel precise positioning. We started the capacity analysis immediately:

The analysis done covered a total of 15 days (360 hours). The provisioned TSI was an S1 SKU with 1 Capacity unit - The minimum possible.

After looking at the new metrics available in Azure, for the TSI resource, best approach was to write down metrics:

TOTAL (15 Days = 360 hours)

  • Ingress Received Messages: 9,13M = 9130000
  • Ingress Stored Events: 31,71M = 31710000
  • Ingress Received Bytes: 11,8 GB = 11800000 KB
  • Ingress Stored Bytes: 5 GB = 5000000 KB

AVERAGE /hour

  • Ingress Received Messages: 9130000 / 360 = 25.361,1 messages
  • Ingress Received Bytes: 11800000 / 360 = 32.777,8 bytes
  • Message Size = 1,292 bytes
  • Ingress Stored Events: 31710000 / 360 = 88.083,3 events
  • Ingress Stored Bytes: 5000000 / 360 = 13.888,9 bytes
  • Event Size = 0,157 bytes

Starting by storage, it did not represent any issue, as storage usage was way inferior then the maximum storage capacity (for the 1x S1): 30 GB (30 million events) per month.

Ingress Received Bytes and Ingress Stored Bytes (sum)

Shifting the analysis to ingestion throughput, consider the following images with detailed information about message and events volume:

Message / Events ingest metrics (sum)

Looking at the TSI thresholds for a 1 Capacity unit of S1 SKU:

  • (1 million events) per day
  • or (41.666 events) per hour
  • or (694) per minute
  • or (11,5) per sec

Always check your event source data volume and throughput of ingestion. For this real scenario, vent Hub was the single event source for TSI. For a complete analysis, we must evaluate the existence the EventHub volume of data and its retention capacity.

Event Hub Metric Description
Incoming Messages The number of events or messages sent to Event Hubs over a specified period.
Outgoing Messages The number of events or messages retrieved from Event Hubs over a specified period.
Incoming Bytes The number of bytes sent to the Azure Event Hubs service over a specified period.
Outgoing Bytes The number of bytes retrieved from the Azure Event Hubs service over a specified period.

This particular Event Hub was provisioned with a Standard SKU, 16 Throughput Units, auto-inflate on set to 20 Upper Limit.

Main goal is always to measure the volume of data being sent into TSI. Let me describe the metrics collected also from the Azure Portal and Event Hub Metrics tab:

TOTAL (15 Days = 360 hours)

  • Incoming Messages: 26.5 6M = 26500000
  • Incoming Bytes: 60.6 GB = 60600000 KB
  • Outgoing Messages: 343.96 M = 343960000
  • Outgoing Bytes: 750.6 GB = 750600000 KB

AVERAGE /hour

  • Incoming Messages: 26500000 / 360 = 73611,1 messages = 20,4 messages/sec
  • Incoming Bytes: 60600000 / 360 = 168333,3 bytes = 168.3 KB = 0,4675 KB/sec
  • Outgoing Messages: 343960000 / 360 = 955444,4 messages = 2654 messages/sec
  • Outgoing Bytes: 750600000 / 360 = 2085000 bytes = 2085 KB = 0,5791 KB/sec

Message metrics - Incoming Messages

Message metrics - Outgoing Messages

Message Metrics - Outgoing Bytes

Request metrics

Looking at the EventHub thresholds in a STANDARD SKU:

  • Max event size: 256 KB
  • Throughput Unit 16x (1000 events/sec or 1 MB/sec ingress, 2000 events/sec or 2 MB/sec egress) = (16K events/sec or 16 MB/sec ingress, 24K events/sec or 24 MB/sec egress)
  • Maximum Storage Capacity: 7 days (1 day included)

Conclusions

The precepted latency in TSI’s data ingestion was due to high volume of data throughput and the existence of historical data when TSI was provisioned. This caused throttling and a factual ingestion throughput of 2,5M distinct data points when the expected was ~10M;

The recommended TSI capacity should have been 2x capacity S2, for the expected data ingress rate. However, because there was historical data, the best decision should have been to temporarily configure TSI with a 2x capacity S2;

Note: As for now, you cannot change dynamically the provisioned SKU in TSI. If you provisioned an S1 SKU, and require an S2 KU, a new TSI environment must be created, and data ingestion will begin again (from oldest event in your event source). Nevertheless, you can dynamically change SKUs capacity from 1 to 10 (max). We expect bring this option in the future, as it brings great flexibility to customers.