<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Monitoring-Infrastructure on K-Life Hack | Systems Architecture &amp; DevOps</title><link>https://klifehack.com/en/tags/monitoring-infrastructure/</link><description>Recent content in Monitoring-Infrastructure on K-Life Hack | Systems Architecture &amp; DevOps</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Sun, 07 Jun 2026 14:07:57 +0900</lastBuildDate><atom:link href="https://klifehack.com/en/tags/monitoring-infrastructure/index.xml" rel="self" type="application/rss+xml"/><item><title>Designing Alert Control and SMTP Integration with Prometheus and Alertmanager</title><link>https://klifehack.com/en/p/prometheus-alertmanager-smtp-routing/</link><pubDate>Sun, 07 Jun 2026 14:07:57 +0900</pubDate><guid>https://klifehack.com/en/p/prometheus-alertmanager-smtp-routing/</guid><description>&lt;h1 id="advancing-monitoring-and-notification-infrastructure-with-prometheus-and-alertmanager-decoupled-design-of-anomaly-detection-and-notification-control"&gt;Advancing Monitoring and Notification Infrastructure with Prometheus and Alertmanager: Decoupled Design of Anomaly Detection and Notification Control
&lt;/h1&gt;&lt;p&gt;In infrastructure monitoring, anomaly detection and notification control are design domains that should be clearly separated. This article explains the decoupled architecture of Prometheus alert evaluation and Alertmanager notification routing, control features to suppress alert storms, and the implementation specifications of notification paths using Naver SMTP.&lt;/p&gt;
&lt;h2 id="1-decoupled-architecture-of-evaluation-and-routing"&gt;1. Decoupled Architecture of Evaluation and Routing
&lt;/h2&gt;&lt;p&gt;In the monitoring pipeline, Prometheus and Alertmanager divide responsibilities as follows. This separation is based on the Single Responsibility Principle.&lt;/p&gt;
&lt;table&gt;
	&lt;thead&gt;
			&lt;tr&gt;
					&lt;th style="text-align: left"&gt;Component&lt;/th&gt;
					&lt;th style="text-align: left"&gt;Role&lt;/th&gt;
					&lt;th style="text-align: left"&gt;Specific Processing&lt;/th&gt;
					&lt;th style="text-align: left"&gt;Output&lt;/th&gt;
			&lt;/tr&gt;
	&lt;/thead&gt;
	&lt;tbody&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;&lt;b&gt;Prometheus&lt;/b&gt;&lt;/td&gt;
					&lt;td style="text-align: left"&gt;&lt;b&gt;Evaluation Engine&lt;/b&gt;&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Evaluates rules defined in &lt;code&gt;rule_files&lt;/code&gt; at each evaluation interval (e.g., 30 seconds).&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Generates alerts in the &amp;ldquo;firing&amp;rdquo; state when conditions are met and sends them to Alertmanager via HTTP POST.&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;&lt;b&gt;Alertmanager&lt;/b&gt;&lt;/td&gt;
					&lt;td style="text-align: left"&gt;&lt;b&gt;Routing Engine&lt;/b&gt;&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Applies grouping, inhibition, and silence processing to received alerts.&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Delivers organized notifications to external notification channels (Email, Slack, etc.).&lt;/td&gt;
			&lt;/tr&gt;
	&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;💡 &lt;b&gt;Why Separation is Necessary&lt;/b&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;b&gt;Engine Specialization&lt;/b&gt;: Prometheus specializes in read/write performance as a time-series database (TSDB). By eliminating external network protocols, retry logic, rate limiting, and state management (such as SMTP or Webhook integration), the stability of the core engine is guaranteed.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;b&gt;Ensuring High Availability&lt;/b&gt;: It becomes possible to aggregate and send alerts from multiple Prometheus servers to a redundant Alertmanager cluster, eliminating single points of failure (SPOF) in the notification path.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2 id="2-alert-rule-components-and-state-transitions"&gt;2. Alert Rule Components and State Transitions
&lt;/h2&gt;&lt;p&gt;Alert rules in Prometheus are defined in YAML format. Example of a rule definition to detect GPU temperature rise:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;- &lt;span style="color:#f92672"&gt;alert&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;GpuHighTemperature&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;expr&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;gpu_temperature_celsius &amp;amp;gt; 80&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;for&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;5m&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;labels&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;severity&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;warning&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;component&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;gpu&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;annotations&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;summary&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;GPU temp on {{ $labels.host }}/{{ $labels.gpu }} = {{ $value }}°C&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;description&lt;/span&gt;: |&lt;span style="color:#e6db74"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; GPU {{ $labels.gpu }} on {{ $labels.host }} has been &amp;amp;gt; 80°C for 5 minutes.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; Threshold: 80°C / Critical: 85°C.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt; Check: nvidia-smi -q -d TEMPERATURE&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The functions of the core parameters are as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;b&gt;&lt;code&gt;expr&lt;/code&gt;&lt;/b&gt;: The PromQL expression that serves as the evaluation condition. If this expression returns a result (time-series data), the alert condition is considered met.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;code&gt;for&lt;/code&gt;&lt;/b&gt;: The waiting time from when the condition is met until the alert actually transitions to the &amp;ldquo;firing&amp;rdquo; state. This prevents false positives caused by temporary spikes.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;code&gt;labels&lt;/code&gt;&lt;/b&gt;: Metadata attached to the alert. Used as criteria for routing and grouping in Alertmanager.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;code&gt;annotations&lt;/code&gt;&lt;/b&gt;: Templates used for notification text. Dynamic information can be embedded using variables such as &lt;code&gt;{{ $labels.host }}&lt;/code&gt; and &lt;code&gt;{{ $value }}&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Due to the presence of the &lt;code&gt;for&lt;/code&gt; parameter, the alert state transition lifecycle transitions through the following three states:&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt; [ expr がデータを返した時 ]
 +------------+ --------------------&amp;amp;gt; +------------+
 | inactive | | pending |
 +------------+ &amp;amp;lt;-------------------- +------------+
 ^ [ expr の結果が空になった時 ] |
 | | [ &amp;#39;for&amp;#39; で指定した時間が経過 ]
 | v
 | +------------+
 +--------------------------------| firing |
 [ expr の結果が空になった時 ] +------------+
 (RESOLVED 通知の送信)
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;&lt;b&gt;&lt;code&gt;inactive&lt;/code&gt;&lt;/b&gt;: Normal state. The PromQL evaluation result is empty.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;code&gt;pending&lt;/code&gt;&lt;/b&gt;: Anomaly detected, but the period specified by &lt;code&gt;for&lt;/code&gt; has not elapsed yet (under validation). Notifications are not sent at this stage.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;code&gt;firing&lt;/code&gt;&lt;/b&gt;: The anomalous state has persisted, and the notification is confirmed. Alerts are forwarded to Alertmanager. Once the condition is resolved, a &lt;code&gt;RESOLVED&lt;/code&gt; notification is automatically sent.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="3-three-control-features-to-suppress-alert-storms"&gt;3. Three Control Features to Suppress Alert Storms
&lt;/h2&gt;&lt;p&gt;When a large-scale failure occurs, an &amp;ldquo;alert storm&amp;rdquo; where a massive volume of notifications is sent simultaneously increases the cognitive load on operators and leads to critical failures being overlooked. Alertmanager provides three control features to prevent this.&lt;/p&gt;
&lt;h3 id="-grouping-group_by"&gt;① Grouping (&lt;code&gt;group_by&lt;/code&gt;)
&lt;/h3&gt;&lt;p&gt;Aggregates similar alerts into a single notification. For example, if multiple components on the same host trigger warnings simultaneously, they are grouped and notified per host rather than individually.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;route&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;group_by&lt;/span&gt;: [&lt;span style="color:#e6db74"&gt;&amp;#39;alertname&amp;#39;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#39;severity&amp;#39;&lt;/span&gt;]
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;group_wait&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;30s &lt;/span&gt; &lt;span style="color:#75715e"&gt;# 最初のアラート受信後、バッファリングする時間&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;group_interval&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;5m &lt;/span&gt; &lt;span style="color:#75715e"&gt;# 同一グループ内の新規アラートを通知するまでの間隔&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="-inhibition-inhibit_rules"&gt;② Inhibition (&lt;code&gt;inhibit_rules&lt;/code&gt;)
&lt;/h3&gt;&lt;p&gt;Suppresses notifications for related &amp;ldquo;dependent alerts&amp;rdquo; when a specific &amp;ldquo;trigger alert&amp;rdquo; has already occurred. For example, if the host itself is down (&lt;code&gt;HostDown&lt;/code&gt;), monitoring alerts for individual processes or GPUs on that host are unnecessary, so their notifications are muted.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;inhibit_rules&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;source_matchers&lt;/span&gt;: [&lt;span style="color:#ae81ff"&gt;alertname=&amp;#34;HostDown&amp;#34;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;target_matchers&lt;/span&gt;: [&lt;span style="color:#ae81ff"&gt;severity=~&amp;#34;warning|info&amp;#34;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;equal&lt;/span&gt;: [&lt;span style="color:#e6db74"&gt;&amp;#39;host&amp;#39;&lt;/span&gt;]
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The inhibition rule is applied only when the labels specified in &lt;code&gt;equal&lt;/code&gt; (in this case, &lt;code&gt;host&lt;/code&gt;) match.&lt;/p&gt;
&lt;h3 id="-resend-control-repeat_interval"&gt;③ Resend Control (&lt;code&gt;repeat_interval&lt;/code&gt;)
&lt;/h3&gt;&lt;p&gt;Controls the interval for repeating the same notification for unresolved alerts. This reduces the risk of alerts being left unaddressed while preventing frequent resending.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;route&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;repeat_interval&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;4h &lt;/span&gt; &lt;span style="color:#75715e"&gt;# 解決していないアラートの再送間隔&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="4-port-465-behavior-and-countermeasures-in-naver-smtp-integration"&gt;4. Port 465 Behavior and Countermeasures in Naver SMTP Integration
&lt;/h2&gt;&lt;p&gt;When using Naver SMTP (&lt;code&gt;smtp.naver.com&lt;/code&gt;) as a notification path, attention must be paid to specific behaviors in protocol negotiation.&lt;/p&gt;
&lt;p&gt;⚠️ &lt;b&gt;Port 465 (Implicit SSL) Connection Issue&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;Naver SMTP supports Port 465 (implicit SSL/TLS) and Port 587 (explicit STARTTLS). By default, Alertmanager attempts to send a STARTTLS command at the start of the connection. However, since Port 465 requires an SSL handshake from the very beginning of the connection, if Alertmanager sends STARTTLS, a protocol mismatch occurs, causing the connection to hang or fail with a &lt;code&gt;connection unexpectedly closed&lt;/code&gt; error.&lt;/p&gt;
&lt;p&gt;🛠️ &lt;b&gt;Solution&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;Explicitly specify &lt;code&gt;smtp_require_tls: false&lt;/code&gt; in the Alertmanager configuration. This skips sending STARTTLS, and the implicit SSL connection on Port 465 is successfully established. Additionally, for authentication, you must use a 16-digit &amp;ldquo;App Password&amp;rdquo; generated from Naver&amp;rsquo;s security settings instead of your regular login password.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="5-configuration-file-for-implementation-alertmanageryml"&gt;5. Configuration File for Implementation (&lt;code&gt;alertmanager.yml&lt;/code&gt;)
&lt;/h2&gt;&lt;p&gt;Practical Alertmanager configuration file incorporating the alert controls and Naver SMTP integration:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;global&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;resolve_timeout&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;5m&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;smtp_smarthost&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#39;smtp.naver.com:465&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;smtp_from&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#39;neogle@naver.com&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;smtp_auth_username&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#39;neogle@naver.com&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;smtp_auth_password&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#39;YOUR_16_DIGIT_APP_PASSWORD&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;smtp_require_tls&lt;/span&gt;: &lt;span style="color:#66d9ef"&gt;false&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;route&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;receiver&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#39;default-email&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;group_by&lt;/span&gt;: [&lt;span style="color:#e6db74"&gt;&amp;#39;alertname&amp;#39;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#39;severity&amp;#39;&lt;/span&gt;]
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;group_wait&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;30s&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;group_interval&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;5m&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;repeat_interval&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;4h&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;routes&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;matchers&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#ae81ff"&gt;severity=&amp;#34;critical&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;receiver&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#39;critical-email&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;repeat_interval&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;1h&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;matchers&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#ae81ff"&gt;severity=&amp;#34;info&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;repeat_interval&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;24h&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;inhibit_rules&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;source_matchers&lt;/span&gt;: [&lt;span style="color:#ae81ff"&gt;alertname=&amp;#34;HostDown&amp;#34;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;target_matchers&lt;/span&gt;: [&lt;span style="color:#ae81ff"&gt;severity=~&amp;#34;warning|info&amp;#34;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;equal&lt;/span&gt;: [&lt;span style="color:#e6db74"&gt;&amp;#39;host&amp;#39;&lt;/span&gt;]
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;source_matchers&lt;/span&gt;: [&lt;span style="color:#ae81ff"&gt;alertname=&amp;#34;GpuCriticalTemp&amp;#34;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;target_matchers&lt;/span&gt;: [&lt;span style="color:#ae81ff"&gt;alertname=&amp;#34;GpuHighTemperature&amp;#34;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;equal&lt;/span&gt;: [&lt;span style="color:#e6db74"&gt;&amp;#39;host&amp;#39;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#39;gpu&amp;#39;&lt;/span&gt;]
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;receivers&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;name&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#39;default-email&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;email_configs&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;to&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#39;neogle@naver.com&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;send_resolved&lt;/span&gt;: &lt;span style="color:#66d9ef"&gt;true&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;name&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#39;critical-email&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;email_configs&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;to&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#39;neogle@naver.com&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;send_resolved&lt;/span&gt;: &lt;span style="color:#66d9ef"&gt;true&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;headers&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;Subject&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#39;🚨 [CRITICAL] {{ .CommonLabels.alertname }} on {{ .CommonLabels.host }}&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="6-troubleshooting-guide"&gt;6. Troubleshooting Guide
&lt;/h2&gt;&lt;table&gt;
	&lt;thead&gt;
			&lt;tr&gt;
					&lt;th style="text-align: left"&gt;Issue&lt;/th&gt;
					&lt;th style="text-align: left"&gt;Probable Cause&lt;/th&gt;
					&lt;th style="text-align: left"&gt;Solution&lt;/th&gt;
			&lt;/tr&gt;
	&lt;/thead&gt;
	&lt;tbody&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;Alert conditions are met but no notification is sent&lt;/td&gt;
					&lt;td style="text-align: left"&gt;The time specified in &lt;code&gt;for&lt;/code&gt; has not elapsed, or it matches an inhibition rule (&lt;code&gt;inhibit_rules&lt;/code&gt;).&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Check if the target alert is in the &lt;code&gt;pending&lt;/code&gt; state in the Prometheus Web UI. Also, verify if a higher-level alert (such as &lt;code&gt;HostDown&lt;/code&gt;) has been triggered on the same host.&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;&lt;code&gt;connection unexpectedly closed&lt;/code&gt; occurs during SMTP connection&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Attempting to use STARTTLS on Port 465.&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Verify if &lt;code&gt;smtp_require_tls: false&lt;/code&gt; is configured.&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;SMTP authentication error occurs&lt;/td&gt;
					&lt;td style="text-align: left"&gt;The regular login password is used, or the App Password has expired.&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Verify that the POP3/SMTP usage setting is enabled in Naver&amp;rsquo;s mail settings, and regenerate and apply a new 16-digit App Password.&lt;/td&gt;
			&lt;/tr&gt;
	&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2 id="configuration-notes"&gt;Configuration Notes
&lt;/h2&gt;&lt;p&gt;The most critical aspect of alert design is that &amp;ldquo;every alert must lead to a concrete action for the recipient.&amp;rdquo; Notifications that do not require action not only lead to operations team fatigue but also delay the detection of truly critical failures.&lt;/p&gt;
&lt;p&gt;By properly combining the grouping, inhibition, and resend controls demonstrated in this article, it is possible to build a highly reliable monitoring and notification infrastructure with minimized noise. Please tune each interval value and inhibition condition step-by-step according to the requirements and operational structure of your actual environment.&lt;/p&gt;</description></item></channel></rss>