View FAQ

[ Return To FAQ Index | Search The FAQs ]


FAQ Database : Nagios : Troubleshooting

Title:Nagios is hogging memory or causing the server to swap heavily
FAQ ID:F0115
Submitted By:Ethan Galstad 
Last Updated:06/02/2009

Description:

User is experiencing major load issues stemming from Nagios. Indications of this include:

  • Nagios and its child processes are using a large amount of memory
  • The server on which Nagios is running is swapping heavily
  • Many child processes remain running long after the service checks they perform are completed

 

Solution:

This may be due to the fact that you have your service_reaper_frequency directive in the main config file set too high. Try reducing the value for this directive and see if the problem resolves itself.

The service_reaper_frequency directive determines how often Nagios processes the results of service checks that were performed by child processes. Child processes communicate with the parent process used a pipe which has a limited buffer size (typically 4K or less). Each child must sends a message that is approx. 512 bytes in length to the parent process before it exits. This means that the pipe buffer can hold at maximum 8 messages from different child processes (assuming a 4K buffer). Once the buffer fills up, child processes that have not yet sent a message to the parent process "block" (wait, hang around, etc.) until there is enough free space in the buffer to send a message to the parent. Nagios periodically (at the frequency you specify with the service_reaper_frequency directive) reads messages from the pipe, thereby freeing space for other child processes to send messages. If Nagios does not read from this pipe fast enough, child processes can hang around waiting - this causes memory usage to rise. If the problem is serious enough, the server may start swapping these processes to disk, causing load issues.

How do you calculate what value to use for the service_reaper_frequency directive? Here's an example:

Let's assume that on average Nagios is executing 100 checks per minute (1.6 checks per second). Assuming your OS has a 4K pipe buffer size, you can fit a maximum of 8 check results (~512 bytes each) into this buffer. Now for some simple math (assuming an empty pipe buffer to begin with): 8 messages/1.6 messages per second = 5 seconds before the buffer fills up. This means that Nagios (under average conditions) has to check for messages from children every 5 seconds. To be on the safe side, I'd say it should be set to 4. Using these calculations, the value for your service_reaper_frequency directive should probably be 4.

 

Keywords:service_reaper_frequency swapping memory processes load