在SLS中快速实现异常巡检-白红宇

在SLS中快速实现异常巡检

阅读量：2401 次

发布时间：2019-05-10

本文共 5815 字，大约阅读时间需要 19 分钟。

一、相关算法研究

1.1 常见的开源算法

Yahoo：EGADS

FaceBook：Prophet

Baidu：Opprentice

Twitter：Anomaly Detection

Redhat：hawkular

Ali+Tsinghua：Donut

Tencent：Metis

Numenta：HTM

CMU：SPIRIT

Microsoft：YADING

Linkedin：SAX改进版本

Netflix：Argos

NEC：CloudSeer

NEC+Ant：LogLens

MoogSoft：一家创业公司，做的内容蛮好的，供大家参考

1.2 基于统计方法的异常检测

基于统计方法对时序数据进行不同指标（均值、方差、散度、峰度等）结果的判别，通过一定的人工经验设定阈值进行告警。同时可以引入时序历史数据利用环比、同比等策略，通过一定的人工经验设定阈值进行告警。
通过建立不同的统计指标：窗口均值变化、窗口方差变化等可以较好的解决下图中（1，2，5）所对应的异常点检测；通过局部极值可以检测出图（4）对应的尖点信息；通过时序预测模型可以较好的找到图（3，6）对应的变化趋势，检测出不符合规律的异常点。

如何判别异常？

N-sigma

Boxplot（箱线图）

Grubbs’Test

Extreme Studentized Deviate Test

PS：

N-sigma：在正态分布中，99.73%的数据分布在距平均值三个标准差以内。如果我们的数据服从一定分布，就可以从分布曲线推断出现当前值的概率。

Grubbs假设检验：常被用来检验正态分布数据集中的单个异常值

ESD假设检验：将Grubbs'

Test扩展到k个异常值检测

1.3 基于无监督的方法做异常检测

什么是无监督方法：是否有监督（supervised），主要看待建模的数据是否有标签（label）。若输入数据有标签，则为有监督学习；没标签则为无监督学习。
为何需要引入无监督方法：在监控建立的初期，用户的反馈是非常稀少且珍贵的，在没有用户反馈的情况下，为了快速建立可靠的监控策略，因此引入无监督方法。
针对单维度指标

采用一些回归方法（Holt-Winters、ARMA），通过原始的观测序列学习出预测序列，通过两者之间的残差进行分析得到相关的异常。

针对单维度指标
- 多维度的含义（time，cpu，iops，flow）
- iForest（IsolationForest）是基于集成的异常检测方法
  - 适用连续数据，具有线性时间复杂度和高精度
  - 异常定义：容易被孤立的离群点，分布稀疏且离密度高的群体较远的点。
- 几点说明
  - 判别树越多越稳定，且每棵树都是互相独立的，可以部署在大规模分布系统中
  - 该算法不太适合特别高维度数据，噪音维度维度和敏感维度无法主动剔除
  - 原始iForest算法仅对全局异常值敏感，对局部相对稀疏的点敏感度较低

1.4 基于深度学习的异常检测

论文题目：《Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications》（WWW 2018）

解决的问题：针对具有周期性的时序监控数据，数据中包含一些缺失点和异常点

模型训练结构如下

检测时使用了MCMC填补的技术处理观测窗口中的已知缺失点，核心思想根据已经训练好的模型，迭代逼近边际分布（下图表示MCMC填补的一次迭代示意图）

1.5 使用有监督的方法做异常检测

标注异常这件事儿，本身很复杂？
- 用户定义的异常往往是从系统或者服务角度出发，对数据进行打标，所关联的底层指标、链路指标繁杂，无法从几个维度出发（更多的是系统的一个Shapshot）
- 在进行架构层设计时，都会进行服务自愈设计，底层的异常并未影响到上层业务
- 异常的溯源很复杂，很多情况下，单一监控数据仅是异常结果的反应，而不是异常本身
- 打标样本数量很少，且异常类型多样，针对小样本的学习问题还有待提高

常用的有监督的机器学习方法
- xgboost、gbdt、lightgbm等
- 一些dnn的分类网络等

二、SLS中提供的算法能力

时序分析
- 预测：根据历史数据拟合基线
- 异常检测、变点检测、折点检测：找到异常点
- 多周期检测：发现数据访问中的周期规律
- 时序聚类：找到形态不一样的时序

模式分析
- 频繁模式挖掘
- 差异模式挖掘

海量文本智能聚类
- 支持任意格式日志：Log4J、Json、单行（syslog）
- 日志经任意条件过滤后再Reduce；对Reduce后Pattern，根据signature反查原始数据
- 不同时间段Pattern比较
- 动态调整Reduce精度
- 亿级数据，秒级出结果

三、针对流量场景的实战分析

3.1 多维度的监控指标的可视化

具体的SQL逻辑如下：

* | select   time,   buffer_cnt,   log_cnt,   buffer_rate,   failed_cnt,   first_play_cnt,   fail_rate from   (      select         date_trunc('minute', time) as time,         sum(buffer_cnt) as buffer_cnt,         sum(log_cnt) as log_cnt,         case            when               is_nan(sum(buffer_cnt)*1.0 / sum(log_cnt))             then               0.0             else               sum(buffer_cnt)*1.0 / sum(log_cnt)          end as buffer_rate, sum(failed_cnt) as failed_cnt, sum(first_play_cnt) as first_play_cnt ,          case            when               is_nan(sum(failed_cnt)*1.0 / sum(first_play_cnt))             then               0.0             else               sum(failed_cnt)*1.0 / sum(first_play_cnt)          end as fail_rate       from         log       group by         time       order by         time   )   limit 100000

3.2 各指标的时序环比图

具体的SQL逻辑如下：

* |select     time,    log_cnt_cmp[1] as log_cnt_now,    log_cnt_cmp[2] as log_cnt_old,    case when is_nan(buffer_rate_cmp[1]) then 0.0 else buffer_rate_cmp[1] end as buf_rate_now,    case when is_nan(buffer_rate_cmp[2]) then 0.0 else buffer_rate_cmp[2] end as buf_rate_old,    case when is_nan(fail_rate_cmp[1]) then 0.0 else fail_rate_cmp[1] end as fail_rate_now,    case when is_nan(fail_rate_cmp[2]) then 0.0 else fail_rate_cmp[2] end as fail_rate_oldfrom(select     time,     ts_compare(log_cnt, 86400) as log_cnt_cmp,    ts_compare(buffer_rate, 86400) as buffer_rate_cmp,    ts_compare(fail_rate, 86400) as fail_rate_cmpfrom (select       date_trunc('minute', time - time % 120) as time,     sum(buffer_cnt) as buffer_cnt,     sum(log_cnt) as log_cnt,     sum(buffer_cnt)*1.0 / sum(log_cnt) as buffer_rate,     sum(failed_cnt) as failed_cnt,      sum(first_play_cnt) as first_play_cnt ,    sum(failed_cnt)*1.0 / sum(first_play_cnt) as fail_ratefrom log group by time order by time) group by time)where time is not null limit 1000000

3.3 各指标动态可视化

具体的SQL逻辑如下：

* | select     time,     case when is_nan(buffer_rate) then 0.0 else buffer_rate end as show_index,    isp as indexfrom(select     date_trunc('minute', time) as time,     sum(buffer_cnt)*1.0 / sum(log_cnt) as buffer_rate,    sum(failed_cnt)*1.0 / sum(first_play_cnt) as fail_rate,    sum(log_cnt) as log_cnt,    sum(failed_cnt) as failed_cnt,    sum(first_play_cnt) as first_play_cnt,    ispfrom log group by time, isp order by time) limit 200000

3.4 异常集合的监控Dashboard页面

异常监控项目的背后图表SQL逻辑

* | select     res.name from (     select         ts_anomaly_filter(province, res[1], res[2], res[3], res[6], 100, 0) as res     from (         select             t1.province as province,             array_transpose( ts_predicate_arma(t1.time, t1.show_index, 5, 1, 1) ) as res         from (             select                province,                time,                case when is_nan(buffer_rate) then 0.0 else buffer_rate end as show_index            from (                select                     province,                     time,                     sum(buffer_cnt)*1.0 / sum(log_cnt) as buffer_rate,                     sum(failed_cnt)*1.0 / sum(first_play_cnt) as fail_rate,                     sum(log_cnt) as log_cnt,                     sum(failed_cnt) as failed_cnt,                     sum(first_play_cnt) as first_play_cnt                from log                 group by province, time) ) t1             inner join (                 select                     DISTINCT province                 from  (                     select                         province, time, sum(log_cnt) as total                     from log                     group by province, time )                 where total > 200 ) t2 on t1.province = t2.province          group by t1.province ) ) limit 100000