Nagios

I just started playing with Nagios, an open-source monitoring software package (GPL). I used to use monit instead, but there are two limitations of monit that made me switch:
  • It can only do port/protocol checks on remote hosts
  • It has no tolerance setting for check failures (it sends a warning as soon as there is one failure)
On the other hand, Nagios has tools that allows a Nagios server to perform "local" checks on remote servers, via the network (check_snmp, check_nt, check_nrpe and check_ssh). It has as side effect that it can monitor Windows servers quite well. The web interface enough for my needs.

Note: I'm still using monit for process checks, as Nagios can't do that as well as monit does (monit uses the information in the lockfile to see if the process is still in memory, and uses user-defined commands to restart the process if it is not in memory).

Here is how Nagios works, basically:
  • The tools that do the checks are called plugins
  • Objects have to be defined (timeperiods,hosts, contacts, services, commands), and group of objects can be created.
  • It uses smart checking algorithms so that your server doesn't do 1000 checks at 13:48 and 3 checks at 13:56
  • It can do a command in some situations (restart apache, for example)
Nagios has also other very nice features... Here is how I configured my server:
  • It does several port/protocol checks on remote servers
  • It uses nrpe (check_nrpe) to perform "local" checks on remote servers
  • When a check fails 4 times, it sends a notification (email)
  • When it notified 3 times and the problem is still not solved or acknowledged, it escalates the notifications (sends to my cell as well).
  • For some servers, I allow nagios to wake me up. For some others, they can only send messages to my cell between 7:30 and 22:00 (using timeperiods)
Nagios' configuration is a lot less painful than I thought. To make it easier, I created one file for each organization for which I monitor servers for.

Future plans: Failover

Using nsca, it is possible to have a Nagios server "standby", that would detect if the "master" Nagios server is down, and perform the checks and notifications during this downeime. nsca allows this "standby" server to have all information about previous checks.

BTW, I used the rpm packages from Dag Wieers, and they work fine with my CentOS 4 system. I tried installing nagios on a Fedora Core 4 machine, but the rpm packages for nagios in Fedora-extras are confusing...

Comments

Popular posts from this blog

General linux performance troubleshooting

Asterisk works under OpenVZ (no zaptel)