Grafana Loki 学习之踩坑记-技术圈

Grafana 出品的 loki 日志框架完美地与 kubernetes 的 label 理念结合，相对于 EFK 来说更加轻量级，非常适合不需要日志聚合的场景。目前新上集群考虑都采用 loki 做为基础工具，直接在 grafana 中展示，在这里记录下使用 Loki 踩过的一些坑。

踩过的坑

LOKI 启动时提示 panic: invalid page type: 11:10

原因: 对应的 index table 文件已经损坏

解决: 删除相应的 index 文件即可解决

日志的 label 不对

原因: promtail 中的 scrape_config 存在问题.

参考: https://izsk.me/2022/05/15/Loki-log-with-wrong-labels/

grafana 中开启实时日志时提示 Query error

原因: 官方的解释是 Note that live tailing relies on two websocket connections: one between the browser and the Grafana server, and another between the Grafana server and the Loki server. If you run any reverse proxies, please configure them accordingly.

也就是说，如果在 web 与 grafana,grafana 与 loki 之间存在如 nginx 类的 proxy,则需要开启 websocket 特性，恰好作者的 grafana 是在 nginx 后的

解决: nginx 添加 websocket 配置, [详见] https://www.nginx.com/blog/websocket-nginx/

参考: https://github.com/grafana/grafana/blob/b5d8cb25e18fc73f37b3546246363464c9298684/docs/sources/features/datasources/loki.md

Loki: file size too small\nerror creating index client

解决: 删除 loki 的持久化目录下的 boltdb-shipper-active/index_18xxx 目录

参考: https://github.com/grafana/loki/issues/3219

protail: context deadline exceeded

原因: promtail 无法连接 loki 所致

promtail cpu 使用过高

原因: 由于集群中存在大量的 job 类 pod，这会对 loki 的服务发现会有很大的压力，需要调整 promtail 的配置，查看官方的 issue，后续可能会将 ds 由 promtail 转到服务端来做，promtail 需要调整的配置主要为

target_config:
  sync_period: 30s
positions:
  filename: /run/promtail/positions.yaml
  sync_period: 30s

将 sync_period 由默认的 10s 换成 30s

可以使用以下的命令获取到 pprof 文件分析性能

curl localhost:3100/debug/pprof/profile\?seconds\=20

参考: https://github.com/grafana/loki/issues/1315

Maximum active stream limit exceeded

原因：同下，需要调整 limit config 中的 max_streams_per_user，设置为 0 即可

server returned HTTP status 429 Too Many Requests

原因: limit config 中的参数: ingestion_burst_size 默认值太小，调整即可

参考: https://github.com/grafana/loki/issues/1923

Please verify permissions

原因: 这条其实是 warn,不影响 promtail 的正常工作，如果调整过日志的路径的话要确认 promtail 挂载的路径是否正常

loki: invalid schema config

原因: loki 的配置文件格式错误.

promtail: too many open files

原因: /var/log/pods 下面的文件数量太多，导致超过内核参数(fs.inotify.max_user_instances)设置配置.

解决

# 先查看当前机器设置的配置
cat /proc/sys/fs/inotify/max_user_instances
# 再查看promtail启动时watch的文件数
cat /run/promtail/positions.yaml | wc -l
# 如果这个值比max_user_instances要大，则会出现上面的错误，可以通过修改内核参数进行调整
sysctl -w fs.inotify.max_user_instances=1024
# 生效
sysctl -p

参考: https://github.com/grafana/loki/issues/1153

promtail: no such file ro directory

原因：promtail daemonset 启动时会自动挂载好几个 hostpath,如果 docker containers 的配置调整过，则需要 volume 跟 volumemount 都需要对应上。

参考文章

https://github.com/grafana/loki/issues/429
https://github.com/grafana/loki/issues/3219
https://github.com/grafana/loki/issues/1153
https://github.com/grafana/loki/issues/1923
https://blog.csdn.net/weixin_44997607/article/details/108419144
https://github.com/grafana/loki/issues/1315
https://www.nginx.com/blog/websocket-nginx/

原文地址：https://izsk.me/2022/05/15/Loki-Prombles