Hello,
My first post here on the forum, so I'll start with a nice problem that I have been tripping over.
I run a small nagios installation with a bit more then 15000 services and more the 400 hosts, I have been running nagios on our systems for more the 10 years now.
Now to the problem. I have a one service group with 2893 services attached to it, recently all services in this group started generating strange errors, like this (I have masked out some sensitive information):
Mar 22 06:39:30 XXX nagios3: Warning: Unable to move file '/var/lib/nagios3/spool/checkresults/checkRLYSDN' to check results queue.
Mar 22 06:39:33 XXX nagios3: Error: Unable to rename file '/var/lib/nagios3/spool/checkresults/checkzInUQ2' to '/var/lib/nagios3/spool/checkresults/c485Wid': No such file or directory
and this:
Mar 22 06:48:50 XXX nagios3: Warning: Attempting to execute the command "/usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: PROBLEM\n\nService:XXXXX\nHost: XXXXXX\nState: UNKNOWN for 0d 0h 0m 11s\nAddress: XXXX\n\nInfo:\n\nXXXXXX\n\nDate/Time: Thu Mar 22 06:48:50 CET 2012\n\nACK by: \nComment: \n" | /usr/bin/mail -s "PROBLEM -XXXX: XXXX is UNKNOWN" [email protected]" resulted in a return code of 127. Make sure the script or binary you are trying to execute actually exists...
The same notification commands and service checks run fine on services which are not member of this large service group. Also if I remove the member association to the service group the problem goes away. This started occurring without any changes other then adding new services to the service group. This makes it look like some sort of limit in either nagios or the OS (debian/linux).
Any suggestions or hints are welcome, thanks!
/Nico
Limit to number of services in a service group?
Re: Limit to number of services in a service group?
I can't say that I've ever seen behaviour like that... but I've also never tried to put 3000 services in a single service group before. What version are you running? What's the load on the server like? I've seen similar errors when I had the misfortune of encountering high disk latency... so maybe those are some places to start.
It might also be worth asking on the nagios-devel mailing list they would know if there's any kind of soft limit for what nagios could handle.
It might also be worth asking on the nagios-devel mailing list they would know if there's any kind of soft limit for what nagios could handle.
Re: Limit to number of services in a service group?
I'm running version 3.2.3 and the load on the server is ~0.20, check latency and execution times are typically averaging under 0.5s. It's running on a dual quad-core server with 24GB RAM and an areca hw-raidcontroller.
Re: Limit to number of services in a service group?
Yeah, barely breaking a sweat. As I said the only time I encountered a similar issue was when my Disk IO/latency was through the roof but I don't think that's in anyway related to your problem, you might need to ask on the Nagios-devel mailing list for this one I'm afraid.