Sensu "solution for multi-cloud monitoring" disadvantages.
You will known in 5 minutes:
- Why Sensu monitoring is no suitable for production use.
RabbitMQ certificate expired.
Sensu often relays on RabbitMQ as it's transport in case you are configured RabbitMQ as a secured transport.
If your sensu suddenly stopped working - check for the following message:
If your sensu suddenly stopped working - check for the following message:
<---you can scroll here---> /var/log/rabbitmq/rabbit@`hostname -a`.log =ERROR REPORT==== X-Mon-XXXX::06:43:42 === SSL: certify: ssl_handshake.erl:1387:Fatal error: certificate expiredIn this case use sensu_ssl_tool to re-generate SSL certificates and put them into the following places:
<---you can scroll here---> cat sensu_ssl_tool/sensu_ca/cacert.pem -> /etc/rabbitmq/ssl/cacert.pem cat sensu_ssl_tool/server/cert.pem -> /etc/rabbitmq/ssl/cert.pem cat sensu_ssl_tool/server/key.pem -> /etc/rabbitmq/ssl/key.pem cat sensu_ssl_tool/client/cert.pem -> /etc/sensu/ssl/cert.pem cat sensu_ssl_tool/client/key.pem -> /etc/sensu/ssl/key.pem cat /etc/rabbitmq/ssl/cacert.pem >> /etc/ssl/certs/ca-certificates.crt systemctl restart rabbitmq-server.service cat /usr/share/ca-certificates/extra/foo.crt >> /etc/ssl/certs/ca-certificates.crt systemctl restart rabbitmq-server.serviceNow your certificates is up to date.
Suggestions
But tune sensu_ssl_tool and set expire time as long as you need.
Let's say - for a 100 years.
It is better not to wait the problem appears but to do it right after you installed sensu environment.
Let's say - for a 100 years.
It is better not to wait the problem appears but to do it right after you installed sensu environment.
Sensu repository not accessable.
If you deploying or upgrading or checking configuration (usually with ansible / puppet / chef) you can face that your playbooks/cookbooks stops working and you can't proceed with other tasks with the error like this:
All nodes will report errors in this case.
For the monitoring team (without additional digging) it will look like the global problem.
They declare: Monitoring for mission-critical systems. ;)
<---you can scroll here---> [XXXX-XX-XXT17:20:10+03:00] ERROR: Server returned error 503 for http://repositories.sensuapp.org/apt/pubkey.gpg, retrying 2/5 in 7s [XXXX-XX-XXT17:20:17+03:00] ERROR: Server returned error 503 for http://repositories.sensuapp.org/apt/pubkey.gpg, retrying 3/5 in 11s [XXXX-XX-XXT17:20:28+03:00] ERROR: Server returned error 503 for http://repositories.sensuapp.org/apt/pubkey.gpg, retrying 4/5 in 27s [XXXX-XX-XXT17:20:55+03:00] ERROR: Server returned error 503 for http://repositories.sensuapp.org/apt/pubkey.gpg, retrying 5/5 in 54s [XXXX-XX-XXT17:21:49+03:00] WARN: remote_file[/var/chef/cache/pubkey.gpg] cannot be downloaded from http://repositories.sensuapp.org/apt/pubkey.gpg: 503 "Service Unavailable"For example, if you use chef with chef-server, this will triggers all clients on all nodes will stop to execute simultaneously.
All nodes will report errors in this case.
For the monitoring team (without additional digging) it will look like the global problem.
They declare: Monitoring for mission-critical systems. ;)
Suggestions
Tune your software provisioning system ignores errors on the sensu side.
Sensu stop to write local log on network failure.
<---you can scroll here---> {"timestamp":"XXXX-XX-XXT20:13:20.695784+0300","level":"error","message":"[amqp] Detected missing amqp heartbeats"} {"timestamp":"XXXX-XX-XXT20:13:20.696023+0300","level":"warn","message":"reconnecting to transport"} {"timestamp":"XXXX-XX-XXT20:13:25.698631+0300","level":"error","message":"[amqp] Detected TCP connection failure: Errno::ETIMEDOUT"} {"timestamp":"XXXX-XX-XXT20:13:29.699197+0300","level":"error","message":"[amqp] Detected TCP connection failure: Errno::ETIMEDOUT"}When client lost connection to RabbitMQ server it stops to write checks\metrics even to the log file.
If your RabbitMQ killed or not respond in various situation - you will lack even local statistics in log file.
Suggestions
Do not to use Sensu in the any production environment and mission-critical systems.
Inconsistent state of Sensu plugins.
Sensu plugins can be in the broken dependency state.
This will cause error when you are deploying sensu on server.
Also you can not rely that some plugin will be here and will work at any time.
<---you can scroll here---> root@web003-vps945514:~# sensu-install -vvvp raid-checks [SENSU-INSTALL] installing Sensu plugins ... [SENSU-INSTALL] provided Sensu plugins: ["raid-checks"] [SENSU-INSTALL] compiled Sensu plugin gems: ["sensu-plugins-raid-checks"] [SENSU-INSTALL] determining if Sensu gem 'sensu-plugins-raid-checks' is already installed ... [SENSU-INSTALL] gem list -i sensu-plugins-raid-checks false [SENSU-INSTALL] Sensu gem 'sensu-plugins-raid-checks' has not been installed [SENSU-INSTALL] Sensu plugin gems to be installed: ["sensu-plugins-raid-checks"] [SENSU-INSTALL] installing Sensu gem 'sensu-plugins-raid-checks' [SENSU-INSTALL] gem install sensu-plugins-raid-checks --no-document --verbose HEAD https://api.rubygems.org/api/v1/dependencies 200 OK GET https://api.rubygems.org/api/v1/dependencies?gems=sensu-plugins-raid-checks 200 OK Getting SRV record failed: DNS result has no information for _rubygems._tcp.api.rubygems.org GET https://api.rubygems.org/api/v1/dependencies?gems=english,sensu-plugin 200 OK ERROR: Could not find a valid gem 'english' (= 0.6.3) in any repository GET https://api.rubygems.org/latest_specs.4.8.gz 304 Not Modified ERROR: Possible alternatives: english [SENSU-INSTALL] failed to install Sensu gem 'sensu-plugins-raid-checks' [SENSU-INSTALL] please take note of any failure messages above [SENSU-INSTALL] make sure you have build tools installed (e.g. gcc) [SENSU-INSTALL] trying to determine the Sensu plugin homepage for sensu-plugins-raid-checks ... homepage: https://github.com/sensu-plugins/sensu-plugins-raid-checks root@web003-vps945514:~# echo $? 2 root@web003-vps945514:~#You can see the same problem with pure ruby too:
<---you can scroll here---> root@web003-vps945514:~# gem install sensu-plugins-raid-checks ERROR: Could not find a valid gem 'english' (= 0.6.3) in any repository ERROR: Possible alternatives: english root@web003-vps945514:~#You cannon rely on sensu in context of installing plugins with official sensu-install from the official repository.
This will cause error when you are deploying sensu on server.
Also you can not rely that some plugin will be here and will work at any time.
Suggestions
Use your own plugins code.
Out of disk makes sensu unusable.
If your server out of free disk space:
- You will fail to start or restart sensu-client.
- sensu-client will not work and will not send any data to server.
Suggestions
Using separate /var may be a solution (not tested).
Installed plugins can suddenly become non-working.
Please not that this plugin itself does not upgrades.
<---you can scroll here---> root@big32:~# /etc/sensu/plugins/checks/check-disk.rb Traceback (most recent call last): 2: from /etc/sensu/plugins/checks/check-disk.rb:29:in `' 1: from /usr/local/rvm/rubies/ruby-2.7.0/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:92:in `require' /usr/local/rvm/rubies/ruby-2.7.0/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:92:in `require': cannot load such file -- sensu-plugin/check/cli (LoadError) root@big32:~# echo $? 1 root@big32:~#
Suggestions
sensu-install -vvvp check-disk is not enough to fix the problem.
Try to reinstall whole sensu client.
Try to reinstall whole sensu client.
Side code on your servers.
- Sensu have many plugins (ruby gems) that installs many other gems as dependencies
- All this code comes from the Internet.
- Those gems developed by many different individuals and groups.
- Those gems can contain at least intentional malicious code.
- Significant redundant dependency code may slow down your servers significantly.
Suggestions
- It is better not to update plugins in the automatic way.
- Also It is better to write your plugins code by you own.
- It is great to have your code "native" - without any dependencies from other ruby gems.