I have to chuckle, as you have used the base centos install to more or less setup a netboot environment for the network install. I still think you were close to having the previous install (with new kernel) working, but sometimes fixing is not faster. Let us know how it goes, don't hesitate to ask for help if needed.chrisp wrote:I'll let you know tomorrow, when we put it "live"...
Performance Issues / fork() errors
Re: Performance Issues / fork() errors
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Performance Issues / fork() errors
Well, the PXE boot wasn't referring to any dodgy partitions, so went fine. I agree, I was SO close, but the number of things I tried, just to get it to auto-boot onto its disk was just mental. I think the kernel panics were related to the /etc/fstab not being right, but hey ho, I'm back on top now.
Re: Performance Issues / fork() errors
Fantastic. And with a real kernel to boot!
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Performance Issues / fork() errors
# uname -a
Linux Nagios 2.6.32-279.22.1.el6.x86_64 #1 SMP Wed Feb 6 03:10:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
We've backed up our old server & restored it onto the new one. The system is up and running (in paralell), with just a few service & host checks not working, but that's just that they don't yet trust the new server's IP or DNS needs a tweak... Here's a humorous image to show the difference between the old host (16GB RAM, 4 CPU Cores & Software RAID1 HDDs) & the new host (32GB RAM, 8 CPU Cores & Software RAID1 SSDs): -
However, I've seen some issues with ndo2db & nagios starting up & rrdcached segfaulting on boot...
This is how it looks on clean boot: -
They're all set to start mostly as I'd expect: -
rrdcached
on booton "service rrdcached restart"
If I just manually restart rrdcached, ndo2db & nagios, they all start OK. Maybe it's some sort of start-order issue? Any clues welcome.
Linux Nagios 2.6.32-279.22.1.el6.x86_64 #1 SMP Wed Feb 6 03:10:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
We've backed up our old server & restored it onto the new one. The system is up and running (in paralell), with just a few service & host checks not working, but that's just that they don't yet trust the new server's IP or DNS needs a tweak... Here's a humorous image to show the difference between the old host (16GB RAM, 4 CPU Cores & Software RAID1 HDDs) & the new host (32GB RAM, 8 CPU Cores & Software RAID1 SSDs): -
However, I've seen some issues with ndo2db & nagios starting up & rrdcached segfaulting on boot...
This is how it looks on clean boot: -
Code: Select all
root@Nagios:~# for SERVICE in nagios ndo2db mysqld postgresql rrdcached npcd ; do echo -n "${SERVICE}: " ; service ${SERVICE} status ; done
nagios: No lock file found in /usr/local/nagios/var/nagios.lock
ndo2db: ndo2db is not running but subsystem locked
mysqld: mysqld (pid 2066) is running...
postgresql: postmaster (pid 2103) is running...
rrdcached: rrdcached is stopped
npcd: NPCD running (pid 2277).
Code: Select all
# for SERVICE in nagios ndo2db mysqld postgresql rrdcached npcd ; do echo -n "${SERVICE}: " ; chkconfig --list ${SERVICE} ; done
nagios: nagios 0:off 1:off 2:on 3:on 4:on 5:on 6:off
ndo2db: ndo2db 0:off 1:off 2:on 3:on 4:on 5:on 6:off
mysqld: mysqld 0:off 1:off 2:on 3:on 4:on 5:on 6:off
postgresql: postgresql 0:off 1:off 2:on 3:on 4:on 5:on 6:off
rrdcached: rrdcached 0:off 1:off 2:on 3:on 4:on 5:on 6:off
npcd: npcd 0:off 1:off 2:off 3:on 4:off 5:on 6:off
on boot
Code: Select all
Feb 19 16:23:25 Nagios abrtd: Directory 'ccpp-2013-02-19-16:23:25-2255' creation detected
Feb 19 16:23:25 Nagios abrtd: Executable '/usr/bin/rrdcached' doesn't belong to any package
Feb 19 16:23:25 Nagios abrtd: 'post-create' on '/var/spool/abrt/ccpp-2013-02-19-16:23:25-2255' exited with 1
Feb 19 16:23:25 Nagios abrtd: Corrupted or bad directory /var/spool/abrt/ccpp-2013-02-19-16:23:25-2255, deleting
Code: Select all
Feb 19 16:57:06 Nagios rrdcached[19201]: starting up
Feb 19 16:57:06 Nagios rrdcached[19201]: checking for journal files
Feb 19 16:57:06 Nagios rrdcached[19201]: journal processing complete
Feb 19 16:57:06 Nagios rrdcached[19201]: listening for connections
You do not have the required permissions to view the files attached to this post.
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Performance Issues / fork() errors
That is quite the performance increase!
Here's what we have by default, it is slightly different
Here's what we have by default, it is slightly different
Code: Select all
nagios: nagios 0:off 1:off 2:off 3:on 4:off 5:on 6:off
ndo2db: ndo2db 0:off 1:off 2:off 3:on 4:off 5:on 6:off
mysqld: mysqld 0:off 1:off 2:off 3:on 4:off 5:on 6:off
postgresql: postgresql 0:off 1:off 2:off 3:on 4:off 5:on 6:off
rrdcached: rrdcached 0:off 1:off 2:off 3:on 4:off 5:on 6:off
npcd: npcd 0:off 1:off 2:off 3:on 4:off 5:on 6:offRe: Performance Issues / fork() errors
The performance increase is even more impressive when you know that the old host has "interval_length=180" in nagios.conf, in order to cope at all.
I did "chkconfig <service> on", for postgresql, ndo2db & nagios, just in case that was an issue, so that explains the difference I think.
I did "chkconfig <service> on", for postgresql, ndo2db & nagios, just in case that was an issue, so that explains the difference I think.
Re: Performance Issues / fork() errors
If you make the rc changes, do you still experience the race conditions?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Performance Issues / fork() errors
Yes.
I did: -
then: -
After reboot: -
I did: -
Code: Select all
for SERVICE in nagios ndo2db postgresql rrdcached mysqld npcd ; do echo -n "${SERVICE}: " ; chkconfig ${SERVICE} off ; done
Code: Select all
for SERVICE in nagios ndo2db postgresql rrdcached mysqld npcd ; do echo -n "${SERVICE}: " ; chkconfig --levels 35 ${SERVICE} on ; done
Code: Select all
# for SERVICE in nagios ndo2db postgresql rrdcached mysqld npcd ; do echo -n "${SERVICE}: " ; service ${SERVICE} status ; done
nagios: No lock file found in /usr/local/nagios/var/nagios.lock
ndo2db: ndo2db is not running but subsystem locked
postgresql: postmaster (pid 2098) is running...
rrdcached: rrdcached is stopped
mysqld: mysqld (pid 2061) is running...
npcd: NPCD running (pid 2272).
Re: Performance Issues / fork() errors
Hmmm. If we cannot get these conditions resolved, you may have to scrape together a custom init script to fire things up in the proper order.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Performance Issues / fork() errors
You're right, I need to get this implementation live sooner rather than later!
I knocked up a script to kick the 3 problem services after a guestimated safe time, run from /etc/rc.local
A reboot later and the system is up and running unattended and without manual intervention.
I knocked up a script to kick the 3 problem services after a guestimated safe time, run from /etc/rc.local
Code: Select all
#!/bin/sh
#weirdNagiosStartupProblemFix
sleep 15s
for SERVICE in rrdcached ndo2db nagios
do
/sbin/service ${SERVICE} restart
done