{"id":6434,"date":"2022-06-14T05:09:19","date_gmt":"2022-06-14T05:09:19","guid":{"rendered":"https:\/\/tdengine.com\/?p=6434"},"modified":"2025-03-30T20:08:25","modified_gmt":"2025-03-31T03:08:25","slug":"tdengine-and-tremor-solution-for-system-monitoring-and-alerting","status":"publish","type":"post","link":"https:\/\/tdengine.com\/tdengine-and-tremor-solution-for-system-monitoring-and-alerting\/","title":{"rendered":"TDengine and Tremor Solution for System Monitoring and Alerting"},"content":{"rendered":"\n<p>Working with high-volume metrics is a challenging endeavor, but one that is becoming more and more essential in a world of microservices, serverless workflows, and IoT devices.<\/p>\n\n\n\n<p>Both TDengine and Tremor touch parts of this space, and together they create a highly effective solution and share a lot of design principles.<\/p>\n\n\n\n<p>Without diving too deep into what either of them does, TDengine is a high-performance <a href=\"https:\/\/tdengine.com\/what-is-a-time-series-database\/\">time-series database<\/a> (TSDB) using SQL as its query language. <a href=\"https:\/\/tremor.rs\" rel=\"noopener\">Tremor<\/a> is a high-performance event processing engine using, you might have guessed it, a SQL dialect for its configuration.<\/p>\n\n\n\n<p>There are many uses where these two technologies complement each other, including filtering, normalization, pre-aggregation, and dual writing, but the most interesting one, and the one we&#8217;re going to look at in this post, is alerting.<\/p>\n\n\n\n<p>Before we dive into the how, let&#8217;s talk about why this is such a challenging task. Two things are needed for alerting to work: real-time processing for triggering the alert and fast persistence to get the context for the alert.<\/p>\n\n\n\n<p>Without real-time processing of incoming events, alerting requires periodically fetching the state from the store, checking against the alerting condition, and then triggering the alert. This introduces latency and, with scale, will put a significant tax on the database.<\/p>\n\n\n\n<p>On the other hand, without a fast storage system, alerting is limited to a &#8220;point in time&#8221; observation, either lacking the context that makes an alert actionable or requiring an unreasonable amount of memory to keep the context in scope just in case of the rare event of an alert.<\/p>\n\n\n\n<p>You can see how TDengine and Tremor complement each other here, so let&#8217;s dive in and build something. We&#8217;ll start with the basic setup from Tremor&#8217;s metrics guide. <a href=\"https:\/\/github.com\/tremor-rs\/tremor-www\/tree\/main\/docs\/guides\/code\/metrics\/05_tdengine\" rel=\"noopener\">Click here to view the code.<\/a> This sets us up with a quick runnable combination of TDengine, Tremor, Grafana, and Telegraf as a data collector.<\/p>\n\n\n\n<p>Let&#8217;s look at one of our aggregated data points:<\/p>\n\n\n\n<pre class=\"wp-block-code language-json\"><code class=\"\" data-line=\"\">{\n  &quot;timestamp&quot;:1652101700000000000,\n  &quot;tags&quot;:{\n    &quot;host&quot;:&quot;998fb3b53ea2&quot;,\n    &quot;cpu&quot;:&quot;cpu-total&quot;,\n    &quot;window&quot;:&quot;1min&quot;\n  },\n  &quot;field&quot;:&quot;usage_idle&quot;,\n  &quot;measurement&quot;:&quot;cpu&quot;,\n  &quot;stats&quot;:{\n    &quot;var&quot;:0.14,\n    &quot;min&quot;:94,\n    &quot;mean&quot;:94,\n    &quot;percentiles&quot;:{\n      &quot;0.5&quot;:95,\n      &quot;0.99&quot;:95,\n      &quot;0.9&quot;:95,\n      &quot;0.999&quot;:95\n    },\n    &quot;stdev&quot;:0.35,\n    &quot;count&quot;:6,\n    &quot;max&quot;:95\n  }\n}\n<\/code><\/pre>\n\n\n\n<p>To turn this into an alert, we first need to figure out what we want to alert on. So let&#8217;s come up with a condition: &#8220;the system&#8217;s CPU idle is less than 95% for at least a minute&#8221;.<\/p>\n\n\n\n<p>This can be turned into a SELECT statement like this:<\/p>\n\n\n\n<pre class=\"wp-block-code language-sql\"><code class=\"\" data-line=\"\">select event from normalize\nwhere match event of\n  case %{measurement == &quot;cpu&quot;, field == &quot;usage_idle&quot;, tags ~= %{cpu == &quot;cpu-total&quot;, `window` == &quot;1min&quot;}, stats ~= %{mean &lt; 95}} =&gt; true\n  case _ =&gt; false\nend into alert;\n<\/code><\/pre>\n\n\n\n<p>One problem this leaves is that the alert will be triggered over and over again as long as the condition is met \u2014 not a helpful state for the operational team having to look after the systems. To solve this, we can throw in a simple deduplication script:<\/p>\n\n\n\n<pre class=\"wp-block-code language-tremor\"><code class=\"\" data-line=\"\">  # Initiate state\n  match state of\n    # If state wasn&#039;t set set it\n    case null =&gt;\n      let state = {&quot;last&quot;: {}, &quot;this&quot;: {}, &quot;ingest_ns&quot;: ingest_ns}\n    # If we&#039;re two times beyond the timestamp we can just re-initialize\n    case _ when ingest_ns - state.ingest_ns &gt; swap_after * 2  =&gt;\n      let state = {&quot;last&quot;: {}, &quot;this&quot;: {}, &quot;ingest_ns&quot;: ingest_ns}\n    # If we&#039;re one time over the `swap_after` we:\n    #  * move this -&gt;  last\n    #  * re-initialize`this`\n    #  * set `ingest_ns` for the next round\n    case _ when ingest_ns - state.ingest_ns &gt; swap_after  =&gt;\n      let state.ingest_ns = ingest_ns;\n      let state.last = state.this;\n      let state.this = {}\n    case _ =&gt; null\n  end;\n\n  # If we have seen this alert before drop it\n  match present state.this&#091;event.tags.host] of\n    case true =&gt; drop\n    case _ =&gt; null\n  end;\n\n  # We didn&#039;t see this event before remember it\n  let state.this&#091;event.tags.host] = true;\n\n  # If we saw it in last, we also drop it\n  match present state.last&#091;event.tags.host] of\n    case true =&gt; drop\n    case _ =&gt; event\n  end\nend;\n<\/code><\/pre>\n\n\n\n<p>Last but not least, we can format the alert into something that <a href=\"https:\/\/alerta.io\/\" rel=\"noopener\">alerta<\/a> can understand:<\/p>\n\n\n\n<pre class=\"wp-block-code language-sql\"><code class=\"\" data-line=\"\">select\n{\n  &quot;environment&quot;: &quot;Production&quot;,\n  &quot;event&quot;: &quot;CPU&quot;,\n  &quot;group&quot;: &quot;Host&quot;,\n  &quot;origin&quot;: &quot;telegraf&quot;,\n  &quot;resource&quot;: event.tags.host,\n  &quot;service&quot;: &#091;&quot;host&quot;],\n  &quot;severity&quot;: &quot;major&quot;,\n  &quot;text&quot;: &quot;#{ event.measurement } #{ event.field } exceeds maximum of 99%&quot;,\n  &quot;type&quot;: &quot;exceptionAlert&quot;,\n  &quot;value&quot;: &quot;idle &lt; 99%&quot;\n} from dedup into alert;\n<\/code><\/pre>\n\n\n\n<p>Here is a screenshot of the system monitoring dashboard:<\/p>\n\n\n\n<figure class=\"gb-block-image gb-block-image-3e50be76\"><img decoding=\"async\" width=\"889\" height=\"2048\" class=\"gb-image gb-image-3e50be76 is-style-default\" src=\"https:\/\/eujqw4hwudm.exactdn.com\/wp-content\/uploads\/22.043-01-dashboard.png?strip=all&sharp=1\" alt=\"\" srcset=\"https:\/\/eujqw4hwudm.exactdn.com\/wp-content\/uploads\/22.043-01-dashboard.png?strip=all&amp;sharp=1 889w, https:\/\/eujqw4hwudm.exactdn.com\/wp-content\/uploads\/22.043-01-dashboard-130x300.png?strip=all&amp;sharp=1 130w, https:\/\/eujqw4hwudm.exactdn.com\/wp-content\/uploads\/22.043-01-dashboard-445x1024.png?strip=all&amp;sharp=1 445w, https:\/\/eujqw4hwudm.exactdn.com\/wp-content\/uploads\/22.043-01-dashboard-768x1769.png?strip=all&amp;sharp=1 768w, https:\/\/eujqw4hwudm.exactdn.com\/wp-content\/uploads\/22.043-01-dashboard-667x1536.png?strip=all&amp;sharp=1 667w, https:\/\/eujqw4hwudm.exactdn.com\/wp-content\/uploads\/22.043-01-dashboard.png?strip=all&amp;sharp=1&amp;w=355 355w, https:\/\/eujqw4hwudm.exactdn.com\/wp-content\/uploads\/22.043-01-dashboard.png?strip=all&amp;sharp=1&amp;w=533 533w\" sizes=\"(max-width: 889px) 100vw, 889px\" \/><\/figure>\n\n\n\n<p>In the demo, it&#8217;s sent to alerta, but we could equally send it to any other HTTP endpoint for alerting to show alerts in Slack, Discord, or forward them to PagerDuty.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>See how you can create an effective alerting system for high-volume metrics by leveraging TDengine and Tremor.<\/p>\n","protected":false},"author":116,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[21],"tags":[],"ppma_author":[192,100],"class_list":["post-6434","post","type-post","status-publish","format-standard","hentry","category-engineering"],"authors":[{"term_id":192,"user_id":116,"is_guest":0,"slug":"hgies","display_name":"Heinz Gies (Tremor)","avatar_url":{"url":"https:\/\/tdengine.com\/wp-content\/uploads\/29.03-12-tremor.png","url2x":"https:\/\/tdengine.com\/wp-content\/uploads\/29.03-12-tremor.png"},"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""},{"term_id":100,"user_id":48,"is_guest":0,"slug":"sangshuduo","display_name":"Shuduo Sang","avatar_url":{"url":"https:\/\/tdengine.com\/wp-content\/uploads\/29.04-28-sdsang.jpg","url2x":"https:\/\/tdengine.com\/wp-content\/uploads\/29.04-28-sdsang.jpg"},"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/tdengine.com\/wp-json\/wp\/v2\/posts\/6434","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/tdengine.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/tdengine.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/tdengine.com\/wp-json\/wp\/v2\/users\/116"}],"replies":[{"embeddable":true,"href":"https:\/\/tdengine.com\/wp-json\/wp\/v2\/comments?post=6434"}],"version-history":[{"count":8,"href":"https:\/\/tdengine.com\/wp-json\/wp\/v2\/posts\/6434\/revisions"}],"predecessor-version":[{"id":24631,"href":"https:\/\/tdengine.com\/wp-json\/wp\/v2\/posts\/6434\/revisions\/24631"}],"wp:attachment":[{"href":"https:\/\/tdengine.com\/wp-json\/wp\/v2\/media?parent=6434"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/tdengine.com\/wp-json\/wp\/v2\/categories?post=6434"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/tdengine.com\/wp-json\/wp\/v2\/tags?post=6434"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/tdengine.com\/wp-json\/wp\/v2\/ppma_author?post=6434"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}