Monit is a popular and well known service for monitoring “mission-critical” services or applications. It is open source, free and relatively simple to install and use. Today, we’ll look into an aspect of the well less known side of Monit which I call the “smart” monitoring.
This is a not a “how to install” post but rather a reflection on how you can monitor service and still miss a warning when it could fails in some cases. My road to this started with some, as usual, unusual problems for custom applications that must always run on some servers with redundancy. These (legacy) application were originally monitored with custom monitoring software and maintained a basic monitoring. i.e This system only check if the application exist in the process list with a simple ps and grep command. If the result contained the name of the application, then this monitoring software assumed that everything was fine and that there was nothing to report – move along… This description also fit most of the installation that I’ve seen in the field with Monit. And I’ve never thought about it too much until I started getting support calls for these legacy application that would crash – thankfully not often – but that the monitoring would not alert us. What was going on here? The monitoring system was failing us and I couldn’t let that happened. In order to have more power, I decided to use Monit instead for monitoring theses applications. Here’s a very basic script of monitoring you could have with monit ;
check process postfix with pidfile /var/spool/postfix/pid/master.pid start program = "/etc/init.d/postfix start" stop program = "/etc/init.d/postfix stop"
This monitoring script with Monit will automatically make sure that the process ID of Postfix exist, if the process does not exist, Monit will start Postfix automatically. I call this “dumb monitoring”.
The thing is, you may have an application or service loaded into memory, it doesn’t mean it’s fully working. Let me give you an easy (and stupid) example ; Let say that we have an SMTP service running and working properly ; Postfix, it’s able to received and deliver email without any problem. You have enabled a monitoring of the service with Monit, you check if the process ID of Postfix/Sendmail exist on the system and each time that check is done, the result is positive. The service is running fine from the point of view of Monit. But alas, you have forgotten to open port 25 on your external firewall and the SMTP, even tough it’s running properly, it cannot receive, send or process any email at all. This is a “fail scenario”. You might say that the service is running but from my point of view, it is not, since no email can be processed from the SMTP service. Piece of cake, we simply need to add another line to our monitoring script which give us ;
check process postfix with pidfile /var/spool/postfix/pid/master.pid start program = "/etc/init.d/postfix start" stop program = "/etc/init.d/postfix stop" if failed port 25 protocol smtp then alert
Now this script check that the process ID of postfix exist AND that the service can receive connection on port 25. If Postfix is not running, Monit will try to start it and if there are no answer on port 25, Monit will send an email to let use know about it. End of story! Or is it?
Let’s push this further and let’s say that on this same server, the space of the partition where Postfix is running is full, meaning that there is no space available for Postfix to process email since it cannot write to the disk. Will the previous monitoring script warn us about this? The Postfix service is running and also answering when you speak to it on port 25, but the reality is that it won’t be able to process email because there are no space left on the disk to make it through. You could of course check with another monitoring script what is the current space available, but this is not the point of my example, I warned you, it’s a stupid example. Fortunately for us, Monit also support some generic protocol with 2 really useful command ; send and expect. We can modify our initial script and then send an email using our monitored SMTP with this. Of course, we will also spam a mailbox of our choice, we would need then to add a rule for the receiving mailbox to delete the email coming from Monit automaticaly in order to keep this clean.
check process postfix with pidfile /var/spool/postfix/pid/master.pid start program = "/etc/init.d/postfix start" stop program = "/etc/init.d/postfix stop" if failed port 25 and expect "^220.*" send "HELO localhost.localdomain\r\n" expect "^250.*" send "mail from:monit@localhost\r\n" expect "^250.*" send "rcpt to:email@example.com\r\n" expect "^250.*" send "data\r\n" expect "^354.*" send "test\r\n" send ".\r\n" expect "^250.*" send "QUIT\r\n" then alert
The last expect “^250.*” in this script is a return confirmation that the Postfix service has taken care of the email and put it in its queue in order to send it. In this case, the “problem” that the partition is full would then ben fully tested. Now, a small warning, do not use this script in a serious production system, this is only for a demonstration purpose of this post. Even myself do not use this on my monitored production system.
The point here is that it’s easily possible to miss a failing service with basic monitoring. When you implement a system like this, you need to put some time of reflection into it in order to make sur that you cover a wide range of possible outcome when a system is failing. Of course, it’s almost impossible to cover all possible cause because our imagination isn’t big enough to think of all the possibilities associated with it. Only experience can show us more possibilities.