performance issues, fork error

pkarr · Post by **pkarr** » Thu Oct 17, 2013 5:02 pm

We are currently having performance issues with our Production server that we would like your help in resolving.
Some of the problems were:
- Servers were incorrectly reporting down
- Socket time out errors (over 2000 in a 12 hr time span)
- /tmp was filled with check entries
- Many zombie processes, mostly defunct nagios processes that I cleared with a 'killall -9 nagios'
- Performance graphing is running about 4:30 hrs behind real-time

I checked the event log and got the following error message :

"Warning: The check of host '[hostname]' could not be performed due to a fork() error: 'Resource temporarily unavailable'

I increased the ulimit max to those suggested in your Nagios XI FAQ several months ago. So I do not want to increase it again without getting your advice first.

Pity me folks, I work with the one and only benhank

thank you,

Penny Karr | IT Infrastructure Monitoring
Harvard Vanguard Medical Associates, an Affiliate of Atrius Health
254 Second Avenue | Needham, MA 02494
P (781) 292-1853 | F (781 292-1980 | http://www.harvardvanguard.org
Email: [email protected]

slansing · Post by **slansing** » Fri Oct 18, 2013 10:29 am

Hmm, this may need to be moved to a ticket sooner than later lets start by getting some basic information. What version of Nagios XI is this occurring on? Is this your only XI server or do you have others not displaying this? Can you correlate these performance issues with anything that recently happened on the server? Any software or hardware changes? Even networking issues? Can you send us a screenshot of your Admin > Monitoring Engine Status page? Thanks!

PS: Did you let Benhank get his mitts on your server?

pkarr · Post by **pkarr** » Fri Oct 18, 2013 4:47 pm

The production server is running the following.
Nagios XI Version : 2012R1.6
2.6.32-220.el6.x86_64 x86_64
CentOS release 6.2 (Final)
Gnome is not installed
PHP Version: 5.3.3

No hardware/software updates. The nagios plugins do vary slightly, one is a test instance and the other is production. We are waiting for the arrival of our new servers, then we will upgrade to your latest and greatest version of Nagios XI! Network team says there were no networking issues yesterday.

The problem also occurred last weekend (10/12) on my test server, which is a copy of the production server (different HW, but same OS, Nagios XI version, PHP). On the test server I noticed the host/service checks had stopped running at 12:30 or so on Saturday but Nagios was showing that the monitoring engine wasn’t running or anything else was amiss When I checked the event log, I found a bunch of those fork error messages occurring at ~13:00, ½ hr after the host/service checks stopped. Running a configuration update fixed the problem on this server. Ack, but it was temporary. I notice now that its back, the host/service checks stopped running yesterday at 10/17/2013 12:39:59. This time there were no fork errors and /tmp was filled with those check file droppings. I've left it as is for now, in case you want a system snapshot or anything

My colleague has created a spreadsheet of Nagios errors he noticed on our Production server over the last few days, that I can forward to you. Its an excel spreadsheet so I’ll have to email it to you separately.

OK, I’ve saved the weirdest part for last. We are still getting the error ‘socket timeout after 30 seconds’ but no fork errors now or last night. No false host down alerts and the load average have never been so low. The zombie count is very low, Total Processes service check (RSZDT error, is down-it too had been up)and the performance graphing has caught up but there are gaps.

Here is current screenshot of the Monitoring Engine Status page. It didn’t look anything like this yesterday.

I would still like to pursue this issue with you, after all we don’t want it to come back!

Penny Karr | IT Infrastructure Monitoring
Harvard Vanguard Medical Associates, an Affiliate of Atrius Health
254 Second Avenue | Needham, MA 02494
P (781) 292-1853 | F (781 292-1980 | http://www.harvardvanguard.org
Email: [email protected]

slansing · Post by **slansing** » Mon Oct 21, 2013 11:24 am

Were you able to see any issues in the system log or nagios log at the time the checks stopped being processed other than the fork errors? Could you copy one of those fork errors from an archived log and show us it's exact output?

The check files that are sitting in tmp should be removed as the checks they were attached to likely never returned in a timely manner, or they were not picked up after the checks returned. It sounds like there could have been an issue with NDO2DB at that time. Did you see anything in the logs regarding this?

pkarr · Post by **pkarr** » Tue Oct 29, 2013 1:06 pm

I didn't see any issues on either the test or prod servers when I checked the log files for errors at the times of the fork errors

Here's a cut and paste from the archived log /var/log/messages - it was the same on the production server

Oct 12 13:01:19 lkennagiost02 nagios: Warning: The check of service 'NT: CPU Usage\' on host 'WKENAHPREST01.Healthone.org' could not be performed due to a fork() err\or: 'Resource temporarily unavailable'. The check will be rescheduled.

I couldn't find any error messages for ndo2db in any of the log files - I looked in /var/log/mysql.log, archived messages, cron and nagios logs for that date/time on either server.

Penny Karr | IT Infrastructure Monitoring
Harvard Vanguard Medical Associates, an Affiliate of Atrius Health
254 Second Avenue | Needham, MA 02494
P (781) 292-1853 | F (781 292-1980 | http://www.harvardvanguard.org
Email: [email protected]

sreinhardt · Post by **sreinhardt** » Tue Oct 29, 2013 1:44 pm

What interval do you have most of your host\service checks set to? Also if you could run ulimit -u, what is the present max number of user processes? Did you happen to get an idea of the current number of processes on the system when this was happening?

pkarr · Post by **pkarr** » Thu Nov 14, 2013 11:34 am

Argh! Its back!
We had a repeat of fork errors last night. There were 4,100 of them in a 24 hour period. Twice what we had last time. An apply config which got things running again.
Same error as before, but here's a fresh cut and paste:

Nov 13 22:28:57 LkennagiosP01 nagios: Warning: The check of service 'APC UPS Load' on host 'POS-UPS-PBX-10-2' could not be performed due to a fork() error: 'Resource temporarily unavailable'. The check will be rescheduled.
Nov 13 22:28:57 LkennagiosP01 nagios: Warning: The check of service 'APC UPS Load' on host 'BTR-UPS-MDF-B-1' could not be performed due to a fork() error: 'Resource temporarily unavailable'. The check will be rescheduled.

There were no errors rearding ndo2db, lots of check file droppings in /tmp and several unconfigured objects. I have removed both of these things.

As you would expect, we have a renewed interest in fixing this. So to answer your questions Spenser.
Most of our checks run on a 5 min interval

[root@LkennagiosP01 ~]# ulimit -u
4096
Should I increase it? No, I don't know how many procs were running at the time this occured. I will be keeping a watch on it today though.

thank you,

Penny Karr | IT Infrastructure Monitoring
Harvard Vanguard Medical Associates, an Affiliate of Atrius Health
254 Second Avenue | Needham, MA 02494
P (781) 292-1853 | F (781 292-1980 | http://www.harvardvanguard.org
Email: [email protected]

abrist · Post by **abrist** » Thu Nov 14, 2013 11:56 am

Yeah, Increase it to 8000 or so. And let us know if this happens again . . .

pkarr · Post by **pkarr** » Thu Nov 14, 2013 2:19 pm

Thank you. I have already done so on my test server and will schedule it on the production server.

We have an some additional questions.

My boss, who is becoming an avid reader of the event logs, noticed that there were alot (~1,500) socket timeouts right before the fork errors. Do you think the spike in socket timeouts contributed to the fork errors? Also once the fork errors start, are the checks rescheduled as the error message says? It seems like it would compound the problem.

OK, now here's my question. Is the max number of user process as shown by ulimit -u reflected in the check_procs command that we run on the nagios server?

[root@LkennagiosP01 libexec]# ./check_procs
PROCS OK: 674 processes

[root@LkennagiosP01 libexec]# ./check_procs -w 750 -c 1000 -s RSZDT
PROCS OK: 591 processes with STATE = RSZDT
[root@LkennagiosP01 libexec]#

thank you,
Penny

abrist · Post by **abrist** » Thu Nov 14, 2013 3:45 pm

pkarr wrote:Do you think the spike in socket timeouts contributed to the fork errors?

Yes, because a timing out check will stay open longer than a successful check.

pkarr wrote:Also once the fork errors start, are the checks rescheduled as the error message says? It seems like it would compound the problem.

They are rescheduled, which can snowball into most of your checks getting scheduled at nearly the same time, compounding the number of fork errors.

pkarr wrote: Is the max number of user process as shown by ulimit -u reflected in the check_procs command that we run on the nagios server?

They should be relatively close.

Nagios Support Forum

performance issues, fork error

performance issues, fork error

Re: performance issues, fork error

Re: performance issues, fork error

Re: performance issues, fork error

Re: performance issues, fork error

Re: performance issues, fork error

Re: performance issues, fork error

Re: performance issues, fork error

Re: performance issues, fork error

Re: performance issues, fork error