Data Monitoring In StreamSets

Author(s): Daksh Trehan, Sachin Shrikant Gotal, and Sumit Chahal


StreamSets is a dataflow performance management tool. It is a form of Message queue category that helps in end-to-end data integration to conquer the motive of building, monitoring, and managing smart data pipelines to deliver continuous data for DataOps.


Data is like oxygen for technology giants, and hence, data flow feeds vital business processes and applications. Data pipelines are like veins that help carry data from source to destination and to perform insightful operations. And the functionality of the veins must not hinder and rather be monitored frequently.


For constant monitoring and easy amendments in our pipelines, StreamSets provides us with certain kinds of rules:

  • Metric Rules: It helps the consumer by easy access to statistical properties of the pipeline such as record error count, successful transition rate, and pipeline idle time.

  • Data Rules: It provides information regarding data flow from one stage to another.

  • Data Drift Rules: It helps to track the changes in data and target relationships over time. Hence, making our model future-proof and robust!


Metric Rules

Metric Rules provide the user with real-time analytics regarding data pipelines. StreamSets provides the user with a real-time WebUI that helps to monitor the data flow and also to get alerts when these stats reach a certain custom threshold.


To get notified or to set alerts, we can enable default rules for certain conditions. And StreamSets will notify you once the conditions are met or exceeded.


You can create many default data rules for your pipelines. Also, users have the capabilities to create their own custom rules but from limited metric types.


Metric types determine which alert to trigger in custom conditions. For each metric ID, there is a metric element bound with it that determines what the metric is measuring. The metric element could be count, min, max, percentage, etc.


Some of the default metric types are:

Gauge: Gauge metric helps the user to understand input, output, or error records for the previous processed batch.

The gauge metric particularly works upon the following metric elements:

  • Last Batch Input, Output, or Error Records Counts

  • Last Batch Error Messages Count

  • Current Batch Age

  • Time in Current Stage

  • Time of Last Received Record

Counter: It helps the understand about input, output, or error records for the pipeline or a stage in the pipeline.

The counter metric particularly works upon:

  • Pipeline batch count.

  • The number of input records, output records, error records, or stage errors for the pipeline or a stage in the pipeline.

Histograms: It helps to determine different record types and stage errors in the pipeline.

It provides updates for the record per batch in Monitor mode.

Timer: It helps you provide batch processing timers for the pipeline or a stage in the pipeline.


Metric Conditions

Metric conditions help combine metric rules with a particular threshold so that the user could be notified about the change in data/relationship or find any anomaly in the pipeline.


Value(): It is useful to find a constant value about the current metric and put a threshold on it.

e.g. ${value () > 1000}, this will alert the user once the number of records exceed 1000 count.


time.now(): It returns the time(in ms) for current pipeline/machine. It usually helps to understand when and for what time our pipeline is sitting idle.

e.g. $ {time.now() -value() > 100000), this will alert once the pipeline is idle for 100000ms.


jvm:maxMemoryMB(): It returns with the memory(in MB) used by SDC for current batch. It helps to understand how much data is being consumed by our Data collector.

e.g. $ {value() > jvm:maxMemoryMB () * 0.80) }, this will alert the user when pipeline will consume more than 80% of Java Heap size allocated to the Data Collector.


Configuring a Metric Rule and Alert

There are some metric rules which are provided by StreamSets for each pipeline with some default values. We can select the rule and edit default values.



We can also create new rules by clicking on ‘CREATE NEW RULE’ in the top right corner.




Alert Text- Text that we want to receive once a specific condition meets.

Metric Type- Select the type of metric (discussed above) from the drop-down menu.

Metric ID- Select the type of ID from the drop-down menu. (List of ID’s depends on Metric Type)

Metric Element- Select the type of Element from the drop-down menu. (List of Elements depends on Metric Type)

Condition- We can set the values for the Rule.

Send Email- If we want to send the alert over email, we can select this check-box.

Once we set all the parameters, we can save the rule.



After saving the rule, it will appear in the list, and we can enable it by selecting it, as shown in the image above.


Data Rules

It passes the information regarding the data that is flowing from one stage to another. These are applicable for any link in the whole pipeline and can be incorporated into the above metrics and rules. In order to create a data rule, there is some familiarity with the data and data flow is required.


Configuring a Data Rule and Alert:

Step 1: Go to rules → Data rules → Create New Rule



Step 2: Choose the stream or stage on which you want the rule to apply. Enter the threshold value in percentage or in the count and enable alerts and email notifications to get the alerts.


Step 3: Run the pipeline. If the data rules are violated, and the threshold value is exceeded user will get an alert and a notification on email.



Data Drift Rules

Data Drift can be defined as a change in the distribution of data over time. The change could range from a baseline dataset to precise amends.

e.g., you are a Machine learning engineer and have developed a model that predicts the airfare for a particular route. For starters, let’s assume your model is able to attain a whopping accuracy of 95%. But, unfortunately, after a few months of successful deployment, the model started aging and lost its accuracy to 75%. This could be due to various reasons like delays in approval, sudden pandemic outrage, changes in target value. But that doesn’t mean that the model developed back was bad, a shift just happens (this is known as data drift)


Types of Data Drifts:

Covariate Shift: It defines the shift in independent variables. This could happen when the X (independent) and Y (dependent) variables relationship stays the same, but due to some environment filters/changes, the distribution of X changes.

e.g., the data inflow happened to contain an extra column due to the latest update in the application.


Prior Probability Shift: This happens when there is a shift in the dependent/target variable. This could happen when the X (independent) and Y (dependent) variables relationship stays the same, but due to some environment filters/changes, the distribution of Y changes.


Concept Shift: This is the case when the relationship between X(independent variable) and Y(dependent variable) changes. The amendments in the relationship have nothing to do with dependent and independent variables but mostly deal with the business idea.

These kinds of shifts are more likely to appear in data depending on times, such as time series models and sequential data models.

e.g., the company that was willing to develop a model on predicting airfare prices might drop the plan and go with the model predicting the airplane accuracies resulting in sudden target shift.

Different data drift rules available in StreamSets are:


drift:name(): It is used when the field name varies between two subsequent JSON records. It is valid on map and list-map data types.


drift:size(): It notifies when the number of fields varies between two subsequent JSON records. This could be used upon map, list-map, and list data typed variables.


drift:order(): It alerts the user when the order of fields varies between two subsequent JSON records. It could only be used on the list-map data type.


Configuring a Data Drift Rule and alert:

Create new rules by clicking on ‘CREATE NEW RULE’ in the top right corner.





Stream: Here, we select the component for which we want to apply the data drift rule.

Label: It is for the purpose of displaying the data drift rule.

Condition: Here, we feed the condition which we want to apply on the component with the help of the data drift rule.

Sampling Percentage: This is the percentage of total records used to apply the data drift rule.

Sampling records to retain: This is the number of records needed to keep in memory for displaying results for the data drift rule.

Alert text: This is the text that we want to get from the data drift rule.


Alerts in StreamSets

To stay informed about Data Rules/Metrics, we can enable alerts that will help us to stay informed about sudden discrepancies in the data flow/pipeline and easy pipeline management.

StreamSets can alert the user in the following ways:


Pipeline monitoring: This is an alert type that could be invoked from monitoring mode.


Webhooks: A webhook can be defined as a custom HTTP callback request that can be triggered once a certain condition is met. We can automate Webhooks to tasks as simple as sending us a message once the violation occurs.


Email: StreamSets allows the pipeline to send emails once a certain condition is met. Whenever there is any violation of data rules, the pipeline will send a custom email to every email address that could help to manage the pipeline optimally as soon as possible.


References:

[1] Rules & Alerts— StreamSets Docs

[2] Automated Data Drift Detection in StreamSets DataOps Platform

[3] Data Quality Checks with StreamSets using Drift Rules

[4] Data Quality Checks With StreamSets Using Drift Rules — DZone Big Data

14 views0 comments

Recent Posts

See All