Limit to number of services in a service group?
Posted: Thu Mar 22, 2012 6:50 am
Hello,
My first post here on the forum, so I'll start with a nice problem that I have been tripping over.
I run a small nagios installation with a bit more then 15000 services and more the 400 hosts, I have been running nagios on our systems for more the 10 years now.
Now to the problem. I have a one service group with 2893 services attached to it, recently all services in this group started generating strange errors, like this (I have masked out some sensitive information):
Mar 22 06:39:30 XXX nagios3: Warning: Unable to move file '/var/lib/nagios3/spool/checkresults/checkRLYSDN' to check results queue.
Mar 22 06:39:33 XXX nagios3: Error: Unable to rename file '/var/lib/nagios3/spool/checkresults/checkzInUQ2' to '/var/lib/nagios3/spool/checkresults/c485Wid': No such file or directory
and this:
Mar 22 06:48:50 XXX nagios3: Warning: Attempting to execute the command "/usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: PROBLEM\n\nService:XXXXX\nHost: XXXXXX\nState: UNKNOWN for 0d 0h 0m 11s\nAddress: XXXX\n\nInfo:\n\nXXXXXX\n\nDate/Time: Thu Mar 22 06:48:50 CET 2012\n\nACK by: \nComment: \n" | /usr/bin/mail -s "PROBLEM -XXXX: XXXX is UNKNOWN" [email protected]" resulted in a return code of 127. Make sure the script or binary you are trying to execute actually exists...
The same notification commands and service checks run fine on services which are not member of this large service group. Also if I remove the member association to the service group the problem goes away. This started occurring without any changes other then adding new services to the service group. This makes it look like some sort of limit in either nagios or the OS (debian/linux).
Any suggestions or hints are welcome, thanks!
/Nico
My first post here on the forum, so I'll start with a nice problem that I have been tripping over.
I run a small nagios installation with a bit more then 15000 services and more the 400 hosts, I have been running nagios on our systems for more the 10 years now.
Now to the problem. I have a one service group with 2893 services attached to it, recently all services in this group started generating strange errors, like this (I have masked out some sensitive information):
Mar 22 06:39:30 XXX nagios3: Warning: Unable to move file '/var/lib/nagios3/spool/checkresults/checkRLYSDN' to check results queue.
Mar 22 06:39:33 XXX nagios3: Error: Unable to rename file '/var/lib/nagios3/spool/checkresults/checkzInUQ2' to '/var/lib/nagios3/spool/checkresults/c485Wid': No such file or directory
and this:
Mar 22 06:48:50 XXX nagios3: Warning: Attempting to execute the command "/usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: PROBLEM\n\nService:XXXXX\nHost: XXXXXX\nState: UNKNOWN for 0d 0h 0m 11s\nAddress: XXXX\n\nInfo:\n\nXXXXXX\n\nDate/Time: Thu Mar 22 06:48:50 CET 2012\n\nACK by: \nComment: \n" | /usr/bin/mail -s "PROBLEM -XXXX: XXXX is UNKNOWN" [email protected]" resulted in a return code of 127. Make sure the script or binary you are trying to execute actually exists...
The same notification commands and service checks run fine on services which are not member of this large service group. Also if I remove the member association to the service group the problem goes away. This started occurring without any changes other then adding new services to the service group. This makes it look like some sort of limit in either nagios or the OS (debian/linux).
Any suggestions or hints are welcome, thanks!
/Nico