不对全文内容进行索引的 Loki 到底优秀在哪里-技术圈

引导关注

总结下 loki 的优点

低索引开销

loki 和 es 最大的不同是 loki 只对标签进行索引而不对内容索引
这样做可以大幅降低索引资源开销 (es 无论你查不查，巨大的索引开销必须时刻承担)

并发查询 + 使用 cache

同时为了弥补没有全文索引带来的查询降速使用，Loki 将把查询分解成较小的分片，可以理解为并发的 grep

和 prometheus 采用相同的标签，对接 alertmanager

Loki 和 Prometheus 之间的标签一致是 Loki 的超级能力之一

使用 grafana 作为前端，避免在 kibana 和 grafana 来回切换

架构说明

组件说明

promtail 作为采集器，类比 filebeat

loki 相当于服务端，类比 es

loki 进程包含四种角色

querier 查询器
ingester 日志存储器
query-frontend 前置查询器
distributor 写入分发器

可以通过 loki 二进制的 -target 参数指定运行角色

read path

查询器接收 HTTP / 1 数据请求。
查询器将查询传递给所有 ingesters 请求内存中的数据。
接收器接收读取的请求，并返回与查询匹配的数据（如果有）。
如果没有接收者返回数据，则查询器会从后备存储中延迟加载数据并对其执行查询。
查询器将迭代所有接收到的数据并进行重复数据删除，从而通过 HTTP / 1 连接返回最终数据集。

write path

分发服务器收到一个 HTTP / 1 请求，以存储流数据。
每个流都使用散列环散列。
分发程序将每个流发送到适当的 inester 和其副本（基于配置的复制因子）。
每个实例将为流的数据创建一个块或将其追加到现有块中。每个租户和每个标签集的块都是唯一的。
分发服务器通过 HTTP / 1 连接以成功代码作为响应。

使用本地化模式安装

下载 promtail 和 loki 二进制

$ wget  https://github.com/grafana/loki/releases/download/v2.2.1/loki-linux-amd64.zip
$ wget https://github.com/grafana/loki/releases/download/v2.2.1/promtail-linux-amd64.zip

找一台 linux 机器做测试

安装 promtail

$ mkdir /opt/app/{promtail,loki} -pv

# promtail配置文件
$ cat <<EOF> /opt/app/promtail/promtail.yaml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /var/log/positions.yaml # This location needs to be writeable by promtail.

client:
  url: http://localhost:3100/loki/api/v1/push

scrape_configs:
 - job_name: system
   pipeline_stages:
   static_configs:
   - targets:
      - localhost
     labels:
      job: varlogs  # A `job` label is fairly standard in prometheus and useful for linking metrics and logs.
      host: yourhost # A `host` label will help identify logs from this machine vs others
      __path__: /var/log/*.log  # The path matching uses a third party library: https://github.com/bmatcuk/doublestar
EOF

# service文件
$ cat <<EOF >/etc/systemd/system/promtail.service
[Unit]
Description=promtail server
Wants=network-online.target
After=network-online.target

[Service]
ExecStart=/opt/app/promtail/promtail -config.file=/opt/app/promtail/promtail.yaml
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=promtail
[Install]
WantedBy=default.target
EOF

$ systemctl daemon-reload
$ systemctl restart promtail
$ systemctl status promtail

安装 loki

$ mkdir /opt/app/{promtail,loki} -pv

# promtail配置文件
$ cat <<EOF> /opt/app/loki/loki.yaml
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

ingester:
  wal:
    enabled: true
    dir: /opt/app/loki/wal
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 1h       # Any chunk not receiving new logs in this time will be flushed
  max_chunk_age: 1h           # All chunks will be flushed when they hit this age, default is 1h
  chunk_target_size: 1048576  # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first
  chunk_retain_period: 30s    # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m)
  max_transfer_retries: 0     # Chunk transfers disabled

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /opt/app/loki/boltdb-shipper-active
    cache_location: /opt/app/loki/boltdb-shipper-cache
    cache_ttl: 24h         # Can be increased for faster performance over longer query periods, uses more disk space
    shared_store: filesystem
  filesystem:
    directory: /opt/app/loki/chunks

compactor:
  working_directory: /opt/app/loki/boltdb-shipper-compactor
  shared_store: filesystem

limits_config:
  reject_old_samples: true
  reject_old_samples_max_age: 168h

chunk_store_config:
  max_look_back_period: 0s

table_manager:
  retention_deletes_enabled: false
  retention_period: 0s

ruler:
  storage:
    type: local
    local:
      directory: /opt/app/loki/rules
  rule_path: /opt/app/loki/rules-temp
  alertmanager_url: http://localhost:9093
  ring:
    kvstore:
      store: inmemory
  enable_api: true
EOF

# service文件

$ cat <<EOF >/etc/systemd/system/loki.service
[Unit]
Description=loki server
Wants=network-online.target
After=network-online.target

[Service]
ExecStart=/opt/app/loki/loki -config.file=/opt/app/loki/loki.yaml
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=loki
[Install]
WantedBy=default.target
EOF

$ systemctl daemon-reload
$ systemctl restart loki
$ systemctl status loki

grafana 上配置 loki 数据源

在 grafana explore 上配置查看日志

查看日志 rate({job="message"} |="kubelet"

算 qps rate({job="message"} |="kubelet" [1m])

只索引标签

之前多次提到 loki 和 es 最大的不同是 loki 只对标签进行索引而不对内容索引下面我们举例来看下

静态标签匹配模式

以简单的 promtail 配置举例

配置解读

scrape_configs:
 - job_name: system
   pipeline_stages:
   static_configs:
   - targets:
      - localhost
     labels:
      job: message
      __path__: /var/log/messages

上面这段配置代表启动一个日志采集任务
这个任务有 1 个固定标签job="syslog"
采集日志路径为 /var/log/messages , 会以一个名为 filename 的固定标签
在 promtail 的 web 页面上可以看到类似 prometheus 的 target 信息页面

查询的时候可以使用和 prometheus 一样的标签匹配语句进行查询

{job="syslog"}

scrape_configs:
 - job_name: system
   pipeline_stages:
   static_configs:
   - targets:
      - localhost
     labels:
      job: syslog
      __path__: /var/log/syslog
 - job_name: system
   pipeline_stages:
   static_configs:
   - targets:
      - localhost
     labels:
      job: apache
      __path__: /var/log/apache.log

如果我们配置了两个 job，则可以使用{job=~”apache|syslog”} 进行多 job 匹配
同时也支持正则和正则非匹配

标签匹配模式的特点

原理

和 prometheus 一致，相同标签对应的是一个流 prometheus 处理 series 的模式
prometheus 中标签一致对应的同一个 hash 值和 refid(正整数递增的 id)，也就是同一个 series
时序数据不断的 append 追加到这个 memseries 中
当有任意标签发生变化时会产生新的 hash 值和 refid，对应新的 series

loki 处理日志的模式 - 和 prometheus 一致，loki 一组标签值会生成一个 stream - 日志随着时间的递增会追加到这个 stream 中，最后压缩为 chunk - 当有任意标签发生变化时会产生新的 hash 值，对应新的 stream

查询过程

所以 loki 先根据标签算出 hash 值在倒排索引中找到对应的 chunk?
然后再根据查询语句中的关键词等进行过滤，这样能大大的提速
因为这种根据标签算哈希在倒排中查找 id，对应找到存储的块在 prometheus 中已经被验证过了
属于开销低
速度快

动态标签和高基数

所以有了上述知识，那么就得谈谈动态标签的问题了

两个概念

何为动态标签：说白了就是标签的 value 不固定
何为高基数标签：说白了就是标签的 value 可能性太多了，达到 10 万，100 万甚至更多

promtail 支持在 pipline_stages 中用正则匹配动态标签

比如 apache 的 access 日志

11.11.11.11 - frank [25/Jan/2000:14:00:01 -0500] "GET /1986.js HTTP/1.1" 200 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"

在 promtail 中使用 regex 想要匹配 action和status_code两个标签

- job_name: system
   pipeline_stages:
      - regex:
        expression: "^(?P<ip>\\S+) (?P<identd>\\S+) (?P<user>\\S+) \\[(?P<timestamp>[\\w:/]+\\s[+\\-]\\d{4})\\] \"(?P<action>\\S+)\\s?(?P<path>\\S+)?\\s?(?P<protocol>\\S+)?\" (?P<status_code>\\d{3}|-) (?P<size>\\d+|-)\\s?\"?(?P<referer>[^\"]*)\"?\\s?\"?(?P<useragent>[^\"]*)?\"?$"
    - labels:
        action:
        status_code:
   static_configs:
   - targets:
      - localhost
     labels:
      job: apache
      env: dev
      __path__: /var/log/apache.log

那么对应 action=get/post 和 status_code=200/400 则对应 4 个流

11.11.11.11 - frank [25/Jan/2000:14:00:01 -0500] "GET /1986.js HTTP/1.1" 200 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
11.11.11.12 - frank [25/Jan/2000:14:00:02 -0500] "POST /1986.js HTTP/1.1" 200 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
11.11.11.13 - frank [25/Jan/2000:14:00:03 -0500] "GET /1986.js HTTP/1.1" 400 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
11.11.11.14 - frank [25/Jan/2000:14:00:04 -0500] "POST /1986.js HTTP/1.1" 400 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"