<<<R3.3-Monitoring of hosts nodes has not been stable>>>

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
uhiadmin
Posts: 85
Joined: Sat Jan 15, 2011 9:01 am

<<<R3.3-Monitoring of hosts nodes has not been stable>>>

Post by uhiadmin »

Linux Distribution and version?
1. Cent OS 6.2 / 64 Bit
2. Manual Install
3. Gnome installed
4. Update Version R3.3

To Nagios XI Tech Support Specialist:
I applied the R3.3 and monitoring of hosts nodes has not been stable. This is what I have observed since the R3.3 Update:

Total Processes Critical 1/4 2012-8-28: 2848 processes with a STATE=RSZDT


Hopefully, the attached config snapshot might reveal the issue.
You do not have the required permissions to view the files attached to this post.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: <<<R3.3-Monitoring of hosts nodes has not been stable>>>

Post by scottwilkerson »

That's a lot of processes...

Can you run the following

Code: Select all

ps -ef > /tmp/procs.txt
and then attach /tmp/procs.txt

thanks
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
uhiadmin
Posts: 85
Joined: Sat Jan 15, 2011 9:01 am

Re: <<<R3.3-Monitoring of hosts nodes has not been stable>>>

Post by uhiadmin »

Here is the file that asked me to attach.....
You do not have the required permissions to view the files attached to this post.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: <<<R3.3-Monitoring of hosts nodes has not been stable>>>

Post by scottwilkerson »

The problem must have cleared up because there are only 267 processes in that file...
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
uhiadmin
Posts: 85
Joined: Sat Jan 15, 2011 9:01 am

Re: <<<R3.3-Monitoring of hosts nodes has not been stable>>>

Post by uhiadmin »

This message popped up after applying the upgrade:


[root@fldmon ~]# service ndo2db stop
Stopping ndo2db: head: cannot open `/usr/local/nagios/var/ndo2db.lock' for reading: No such file or directory
done.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: <<<R3.3-Monitoring of hosts nodes has not been stable>>>

Post by scottwilkerson »

Was ndo2db running when you ran this stop command?
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
uhiadmin
Posts: 85
Joined: Sat Jan 15, 2011 9:01 am

Re: <<<R3.3-Monitoring of hosts nodes has not been stable>>>

Post by uhiadmin »

Scott,
I believe ndo2db was running. I had to perform a stop on all services in order to perform a mysql repair. Reason for the repair, we are experiencing high cpu load. Critical notification on the services side go beyond 2. This is when we see a spike in down host nodes and "I/O Wait" status is going into the red. My understanding and belief is that we have to offload the mysql database. Our host nodes amount to 3,364 and services are 75. Hopefully, I am explaining this clearly to you.
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: <<<R3.3-Monitoring of hosts nodes has not been stable>>>

Post by lmiltchev »

Is it stable now, after the mysql repair? Anything interesting in the system log?

Code: Select all

tail /var/log/messages
If you decide to offload mysql, here is the document you need to review:

http://assets.nagios.com/downloads/nagi ... Server.pdf

Hope this helps.
Be sure to check out our Knowledgebase for helpful articles and solutions!
uhiadmin
Posts: 85
Joined: Sat Jan 15, 2011 9:01 am

Re: <<<R3.3-Monitoring of hosts nodes has not been stable>>>

Post by uhiadmin »

This is all that I am getting from the command:
Last login: Sun Sep 2 17:14:00 2012
[root@fldmon ~]# tail /var/log/messages
Sep 5 17:54:17 fldmon nagios: HOST ALERT: 747054SW01;DOWN;SOFT;1;CRITICAL - 10.47.54.2: rta nan, lost 100%
Sep 5 17:54:27 fldmon nagios: SERVICE ALERT: localhost;Current Load;CRITICAL;HARD;4;CRITICAL - load average: 6.00, 5.21, 4.29
Sep 5 17:54:29 fldmon nagios: HOST ALERT: 747054SW01;UP;SOFT;2;OK - 10.47.54.2: rta 46.917ms, lost 0%
Sep 5 17:55:26 fldmon nagios: HOST ALERT: 747054SW01;DOWN;SOFT;1;CRITICAL - 10.47.54.2: rta nan, lost 100%
Sep 5 17:55:26 fldmon nagios: HOST ALERT: 747054;DOWN;SOFT;1;CRITICAL - 10.47.54.1: rta nan, lost 100%
Sep 5 17:55:29 fldmon nagios: HOST ALERT: 747054;UP;SOFT;2;OK - 10.47.54.1: rta 45.857ms, lost 0%
Sep 5 17:55:29 fldmon nagios: HOST ALERT: 747054SW01;UP;SOFT;2;OK - 10.47.54.2: rta 46.944ms, lost 0%
Sep 5 17:55:29 fldmon nagios: HOST FLAPPING ALERT: 747054SW01;STARTED; Host appears to have started flapping (23.2% change > 20.0% threshold)
Sep 5 17:55:50 fldmon nagios: HOST ALERT: 747054SW01;DOWN;SOFT;1;CRITICAL - 10.47.54.2: rta nan, lost 100%
Sep 5 17:56:02 fldmon nagios: HOST ALERT: 747054SW01;UP;SOFT;2;OK - 10.47.54.2: rta 46.542ms, lost 20%
[root@fldmon ~]#
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: <<<R3.3-Monitoring of hosts nodes has not been stable>>>

Post by scottwilkerson »

Has the machine stabilized since the repair and re-start of ndo2db?
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Locked