Monitoring UBC failcounts with Zabbix: the efficient way
A couple of times I searched for an efficient way to monitor the UBC failcounts of OpenVZ containers on a hardware node with Zabbix. Most solutions I found on the net were about using items for each UBC and container and watch for changes in the failcounts. But I never liked those many items, each item needs to read /proc/user_beancounters, process it and send the value back to the Zabbix server.
In the last months I used a quite sub-optimal monitoring: I had items for each UBC and for each container on the OpenVZ hardware node. The advantage was that once a UBC failcount increased, a trigger caused an event telling me exactly which UBC of which container increased and I could start investigating. But these were many items with many checks. I more and more felt too uncomfortable with wasting resources for monitoring rather than for real processing.
Recently I upgraded my Zabbix server from 1.8 to 2.0 and used the upgrade to renew my
whole Zabbix setup, including using new, more modular templates, reviewed items and triggers
and finally came up with something I quite like.
The last missing bit was the UBC topic.
And then I finally got the final idea: three years ago a wrote a very simple
Python script (VzUbcMon) to periodically read /proc/user_beancounters and print out
a summary of all UBC failcounts which increased since the last check, including the UBC name
and the container ID. At that time, I didn’t yet use Zabbix for server monitoring and so this
was my little poor man’s UBC monitoring :).
And now I can re-use this script and use it with the Zabbix agent on the hardware node to
check for increased UBC failcounts and let a trigger create an event in case it happens.
On README.zabbix I described in detail how it works and how to install,
a Zabbix template is also included.
The idea is to have one item on the Zabbix side which calls VzUbcMon (and a little helper script to read /proc/user_beancounters as unprivileged zabbix user), VzUbcMon evaluates the UBC failcounts and returns a summary of changes to the Zabbix server which then fires an event in case anything changed and so the admin will get informed. One item for all UBC failcounts and since VzUbcMon returns text to the Zabbix server, detailed information about the increased failcount, the UBC and the container will be included in the event.