performance issues, fork error

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
pkarr
Posts: 58
Joined: Fri Oct 05, 2012 1:01 pm

performance issues, fork error

Post by pkarr »

We are currently having performance issues with our Production server that we would like your help in resolving.
Some of the problems were:
- Servers were incorrectly reporting down
- Socket time out errors (over 2000 in a 12 hr time span)
- /tmp was filled with check entries
- Many zombie processes, mostly defunct nagios processes that I cleared with a 'killall -9 nagios'
- Performance graphing is running about 4:30 hrs behind real-time

I checked the event log and got the following error message :

"Warning: The check of host '[hostname]' could not be performed due to a fork() error: 'Resource temporarily unavailable'


I increased the ulimit max to those suggested in your Nagios XI FAQ several months ago. So I do not want to increase it again without getting your advice first.

Pity me folks, I work with the one and only benhank :D

thank you,

Penny Karr | IT Infrastructure Monitoring
Harvard Vanguard Medical Associates, an Affiliate of Atrius Health
254 Second Avenue | Needham, MA 02494
P (781) 292-1853 | F (781 292-1980 | http://www.harvardvanguard.org
Email: [email protected]
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: performance issues, fork error

Post by slansing »

Hmm, this may need to be moved to a ticket sooner than later lets start by getting some basic information. What version of Nagios XI is this occurring on? Is this your only XI server or do you have others not displaying this? Can you correlate these performance issues with anything that recently happened on the server? Any software or hardware changes? Even networking issues? Can you send us a screenshot of your Admin > Monitoring Engine Status page? Thanks!

PS: Did you let Benhank get his mitts on your server? :)
pkarr
Posts: 58
Joined: Fri Oct 05, 2012 1:01 pm

Re: performance issues, fork error

Post by pkarr »

The production server is running the following.
Nagios XI Version : 2012R1.6
2.6.32-220.el6.x86_64 x86_64
CentOS release 6.2 (Final)
Gnome is not installed
PHP Version: 5.3.3


No hardware/software updates. The nagios plugins do vary slightly, one is a test instance and the other is production. We are waiting for the arrival of our new servers, then we will upgrade to your latest and greatest version of Nagios XI! Network team says there were no networking issues yesterday.

The problem also occurred last weekend (10/12) on my test server, which is a copy of the production server (different HW, but same OS, Nagios XI version, PHP). On the test server I noticed the host/service checks had stopped running at 12:30 or so on Saturday but Nagios was showing that the monitoring engine wasn’t running or anything else was amiss When I checked the event log, I found a bunch of those fork error messages occurring at ~13:00, ½ hr after the host/service checks stopped. Running a configuration update fixed the problem on this server. Ack, but it was temporary. I notice now that its back, the host/service checks stopped running yesterday at 10/17/2013 12:39:59. This time there were no fork errors and /tmp was filled with those check file droppings. I've left it as is for now, in case you want a system snapshot or anything

My colleague has created a spreadsheet of Nagios errors he noticed on our Production server over the last few days, that I can forward to you. Its an excel spreadsheet so I’ll have to email it to you separately.

OK, I’ve saved the weirdest part for last. We are still getting the error ‘socket timeout after 30 seconds’ but no fork errors now or last night. No false host down alerts and the load average have never been so low. The zombie count is very low, Total Processes service check (RSZDT error, is down-it too had been up)and the performance graphing has caught up but there are gaps.

Here is current screenshot of the Monitoring Engine Status page. It didn’t look anything like this yesterday.

I would still like to pursue this issue with you, after all we don’t want it to come back!

Penny Karr | IT Infrastructure Monitoring
Harvard Vanguard Medical Associates, an Affiliate of Atrius Health
254 Second Avenue | Needham, MA 02494
P (781) 292-1853 | F (781 292-1980 | http://www.harvardvanguard.org
Email: [email protected]
You do not have the required permissions to view the files attached to this post.
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: performance issues, fork error

Post by slansing »

Were you able to see any issues in the system log or nagios log at the time the checks stopped being processed other than the fork errors? Could you copy one of those fork errors from an archived log and show us it's exact output?

The check files that are sitting in tmp should be removed as the checks they were attached to likely never returned in a timely manner, or they were not picked up after the checks returned. It sounds like there could have been an issue with NDO2DB at that time. Did you see anything in the logs regarding this?
pkarr
Posts: 58
Joined: Fri Oct 05, 2012 1:01 pm

Re: performance issues, fork error

Post by pkarr »

I didn't see any issues on either the test or prod servers when I checked the log files for errors at the times of the fork errors

Here's a cut and paste from the archived log /var/log/messages - it was the same on the production server

Oct 12 13:01:19 lkennagiost02 nagios: Warning: The check of service 'NT: CPU Usage\' on host 'WKENAHPREST01.Healthone.org' could not be performed due to a fork() err\or: 'Resource temporarily unavailable'. The check will be rescheduled.

I couldn't find any error messages for ndo2db in any of the log files - I looked in /var/log/mysql.log, archived messages, cron and nagios logs for that date/time on either server.

Penny Karr | IT Infrastructure Monitoring
Harvard Vanguard Medical Associates, an Affiliate of Atrius Health
254 Second Avenue | Needham, MA 02494
P (781) 292-1853 | F (781 292-1980 | http://www.harvardvanguard.org
Email: [email protected]
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: performance issues, fork error

Post by sreinhardt »

What interval do you have most of your host\service checks set to? Also if you could run ulimit -u, what is the present max number of user processes? Did you happen to get an idea of the current number of processes on the system when this was happening?
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
pkarr
Posts: 58
Joined: Fri Oct 05, 2012 1:01 pm

Re: performance issues, fork error

Post by pkarr »

Argh! Its back!
We had a repeat of fork errors last night. There were 4,100 of them in a 24 hour period. Twice what we had last time. An apply config which got things running again.
Same error as before, but here's a fresh cut and paste:

Nov 13 22:28:57 LkennagiosP01 nagios: Warning: The check of service 'APC UPS Load' on host 'POS-UPS-PBX-10-2' could not be performed due to a fork() error: 'Resource temporarily unavailable'. The check will be rescheduled.
Nov 13 22:28:57 LkennagiosP01 nagios: Warning: The check of service 'APC UPS Load' on host 'BTR-UPS-MDF-B-1' could not be performed due to a fork() error: 'Resource temporarily unavailable'. The check will be rescheduled.


There were no errors rearding ndo2db, lots of check file droppings in /tmp and several unconfigured objects. I have removed both of these things.

As you would expect, we have a renewed interest in fixing this. So to answer your questions Spenser.
Most of our checks run on a 5 min interval

[root@LkennagiosP01 ~]# ulimit -u
4096
Should I increase it? No, I don't know how many procs were running at the time this occured. I will be keeping a watch on it today though.

thank you,

Penny Karr | IT Infrastructure Monitoring
Harvard Vanguard Medical Associates, an Affiliate of Atrius Health
254 Second Avenue | Needham, MA 02494
P (781) 292-1853 | F (781 292-1980 | http://www.harvardvanguard.org
Email: [email protected]
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: performance issues, fork error

Post by abrist »

Yeah, Increase it to 8000 or so. And let us know if this happens again . . .
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
pkarr
Posts: 58
Joined: Fri Oct 05, 2012 1:01 pm

Re: performance issues, fork error

Post by pkarr »

Thank you. I have already done so on my test server and will schedule it on the production server.

We have an some additional questions.

My boss, who is becoming an avid reader of the event logs, noticed that there were alot (~1,500) socket timeouts right before the fork errors. Do you think the spike in socket timeouts contributed to the fork errors? Also once the fork errors start, are the checks rescheduled as the error message says? It seems like it would compound the problem.

OK, now here's my question. Is the max number of user process as shown by ulimit -u reflected in the check_procs command that we run on the nagios server?

[root@LkennagiosP01 libexec]# ./check_procs
PROCS OK: 674 processes

[root@LkennagiosP01 libexec]# ./check_procs -w 750 -c 1000 -s RSZDT
PROCS OK: 591 processes with STATE = RSZDT
[root@LkennagiosP01 libexec]#

thank you,
Penny
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: performance issues, fork error

Post by abrist »

pkarr wrote:Do you think the spike in socket timeouts contributed to the fork errors?
Yes, because a timing out check will stay open longer than a successful check.
pkarr wrote:Also once the fork errors start, are the checks rescheduled as the error message says? It seems like it would compound the problem.
They are rescheduled, which can snowball into most of your checks getting scheduled at nearly the same time, compounding the number of fork errors.
pkarr wrote: Is the max number of user process as shown by ulimit -u reflected in the check_procs command that we run on the nagios server?
They should be relatively close.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Locked