Page 4 of 5

Re: Performance Issues / fork() errors

Posted: Mon Feb 18, 2013 12:46 pm
by abrist
chrisp wrote:I'll let you know tomorrow, when we put it "live"...
I have to chuckle, as you have used the base centos install to more or less setup a netboot environment for the network install. I still think you were close to having the previous install (with new kernel) working, but sometimes fixing is not faster. Let us know how it goes, don't hesitate to ask for help if needed.

Re: Performance Issues / fork() errors

Posted: Mon Feb 18, 2013 1:02 pm
by chrisp
Well, the PXE boot wasn't referring to any dodgy partitions, so went fine. I agree, I was SO close, but the number of things I tried, just to get it to auto-boot onto its disk was just mental. I think the kernel panics were related to the /etc/fstab not being right, but hey ho, I'm back on top now.

Re: Performance Issues / fork() errors

Posted: Mon Feb 18, 2013 1:06 pm
by abrist
Fantastic. And with a real kernel to boot!

Re: Performance Issues / fork() errors

Posted: Wed Feb 20, 2013 7:45 am
by chrisp
# uname -a
Linux Nagios 2.6.32-279.22.1.el6.x86_64 #1 SMP Wed Feb 6 03:10:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

We've backed up our old server & restored it onto the new one. The system is up and running (in paralell), with just a few service & host checks not working, but that's just that they don't yet trust the new server's IP or DNS needs a tweak... Here's a humorous image to show the difference between the old host (16GB RAM, 4 CPU Cores & Software RAID1 HDDs) & the new host (32GB RAM, 8 CPU Cores & Software RAID1 SSDs): -
OldNagiosVsNewNagios.png
However, I've seen some issues with ndo2db & nagios starting up & rrdcached segfaulting on boot...

This is how it looks on clean boot: -

Code: Select all

root@Nagios:~# for SERVICE in nagios ndo2db mysqld postgresql rrdcached npcd ; do echo -n "${SERVICE}: " ; service ${SERVICE} status ; done
nagios: No lock file found in /usr/local/nagios/var/nagios.lock
ndo2db: ndo2db is not running but subsystem locked
mysqld: mysqld (pid  2066) is running...
postgresql: postmaster (pid  2103) is running...
rrdcached: rrdcached is stopped
npcd: NPCD running (pid 2277).
They're all set to start mostly as I'd expect: -

Code: Select all

# for SERVICE in nagios ndo2db mysqld postgresql rrdcached npcd ; do echo -n "${SERVICE}: " ; chkconfig --list ${SERVICE} ; done          
nagios: nagios          0:off   1:off   2:on    3:on    4:on    5:on    6:off
ndo2db: ndo2db          0:off   1:off   2:on    3:on    4:on    5:on    6:off
mysqld: mysqld          0:off   1:off   2:on    3:on    4:on    5:on    6:off
postgresql: postgresql          0:off   1:off   2:on    3:on    4:on    5:on    6:off
rrdcached: rrdcached            0:off   1:off   2:on    3:on    4:on    5:on    6:off
npcd: npcd              0:off   1:off   2:off   3:on    4:off   5:on    6:off
rrdcached
on boot

Code: Select all

Feb 19 16:23:25 Nagios abrtd: Directory 'ccpp-2013-02-19-16:23:25-2255' creation detected
Feb 19 16:23:25 Nagios abrtd: Executable '/usr/bin/rrdcached' doesn't belong to any package
Feb 19 16:23:25 Nagios abrtd: 'post-create' on '/var/spool/abrt/ccpp-2013-02-19-16:23:25-2255' exited with 1
Feb 19 16:23:25 Nagios abrtd: Corrupted or bad directory /var/spool/abrt/ccpp-2013-02-19-16:23:25-2255, deleting
on "service rrdcached restart"

Code: Select all

Feb 19 16:57:06 Nagios rrdcached[19201]: starting up
Feb 19 16:57:06 Nagios rrdcached[19201]: checking for journal files
Feb 19 16:57:06 Nagios rrdcached[19201]: journal processing complete
Feb 19 16:57:06 Nagios rrdcached[19201]: listening for connections
If I just manually restart rrdcached, ndo2db & nagios, they all start OK. Maybe it's some sort of start-order issue? Any clues welcome.

Re: Performance Issues / fork() errors

Posted: Wed Feb 20, 2013 11:30 am
by scottwilkerson
That is quite the performance increase!

Here's what we have by default, it is slightly different

Code: Select all

nagios: nagios          0:off   1:off   2:off   3:on    4:off   5:on    6:off
ndo2db: ndo2db          0:off   1:off   2:off   3:on    4:off   5:on    6:off
mysqld: mysqld          0:off   1:off   2:off   3:on    4:off   5:on    6:off
postgresql: postgresql          0:off   1:off   2:off   3:on    4:off   5:on    6:off
rrdcached: rrdcached            0:off   1:off   2:off   3:on    4:off   5:on    6:off
npcd: npcd              0:off   1:off   2:off   3:on    4:off   5:on    6:off

Re: Performance Issues / fork() errors

Posted: Wed Feb 20, 2013 11:43 am
by chrisp
The performance increase is even more impressive when you know that the old host has "interval_length=180" in nagios.conf, in order to cope at all.

I did "chkconfig <service> on", for postgresql, ndo2db & nagios, just in case that was an issue, so that explains the difference I think.

Re: Performance Issues / fork() errors

Posted: Wed Feb 20, 2013 12:49 pm
by abrist
If you make the rc changes, do you still experience the race conditions?

Re: Performance Issues / fork() errors

Posted: Wed Feb 20, 2013 2:18 pm
by chrisp
Yes.

I did: -

Code: Select all

for SERVICE in nagios ndo2db postgresql rrdcached mysqld npcd ; do echo -n "${SERVICE}: " ; chkconfig ${SERVICE} off ; done
then: -

Code: Select all

for SERVICE in nagios ndo2db postgresql rrdcached mysqld npcd ; do echo -n "${SERVICE}: " ; chkconfig --levels 35 ${SERVICE} on ; done
After reboot: -

Code: Select all

# for SERVICE in nagios ndo2db postgresql rrdcached mysqld npcd ; do echo -n "${SERVICE}: " ; service ${SERVICE} status ; done      
nagios: No lock file found in /usr/local/nagios/var/nagios.lock
ndo2db: ndo2db is not running but subsystem locked
postgresql: postmaster (pid  2098) is running...
rrdcached: rrdcached is stopped
mysqld: mysqld (pid  2061) is running...
npcd: NPCD running (pid 2272).

Re: Performance Issues / fork() errors

Posted: Wed Feb 20, 2013 2:25 pm
by abrist
Hmmm. If we cannot get these conditions resolved, you may have to scrape together a custom init script to fire things up in the proper order.

Re: Performance Issues / fork() errors

Posted: Wed Feb 20, 2013 4:33 pm
by chrisp
You're right, I need to get this implementation live sooner rather than later!

I knocked up a script to kick the 3 problem services after a guestimated safe time, run from /etc/rc.local

Code: Select all

#!/bin/sh
#weirdNagiosStartupProblemFix

sleep 15s

for SERVICE in rrdcached ndo2db nagios
do
    /sbin/service ${SERVICE} restart
done
A reboot later and the system is up and running unattended and without manual intervention.