<<<R3.3-Monitoring of hosts nodes has not been stable>>>
<<<R3.3-Monitoring of hosts nodes has not been stable>>>
Linux Distribution and version?
1. Cent OS 6.2 / 64 Bit
2. Manual Install
3. Gnome installed
4. Update Version R3.3
To Nagios XI Tech Support Specialist:
I applied the R3.3 and monitoring of hosts nodes has not been stable. This is what I have observed since the R3.3 Update:
Total Processes Critical 1/4 2012-8-28: 2848 processes with a STATE=RSZDT
Hopefully, the attached config snapshot might reveal the issue.
1. Cent OS 6.2 / 64 Bit
2. Manual Install
3. Gnome installed
4. Update Version R3.3
To Nagios XI Tech Support Specialist:
I applied the R3.3 and monitoring of hosts nodes has not been stable. This is what I have observed since the R3.3 Update:
Total Processes Critical 1/4 2012-8-28: 2848 processes with a STATE=RSZDT
Hopefully, the attached config snapshot might reveal the issue.
You do not have the required permissions to view the files attached to this post.
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: <<<R3.3-Monitoring of hosts nodes has not been stable>>>
That's a lot of processes...
Can you run the following
and then attach /tmp/procs.txt
thanks
Can you run the following
Code: Select all
ps -ef > /tmp/procs.txtthanks
Re: <<<R3.3-Monitoring of hosts nodes has not been stable>>>
Here is the file that asked me to attach.....
You do not have the required permissions to view the files attached to this post.
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: <<<R3.3-Monitoring of hosts nodes has not been stable>>>
The problem must have cleared up because there are only 267 processes in that file...
Re: <<<R3.3-Monitoring of hosts nodes has not been stable>>>
This message popped up after applying the upgrade:
[root@fldmon ~]# service ndo2db stop
Stopping ndo2db: head: cannot open `/usr/local/nagios/var/ndo2db.lock' for reading: No such file or directory
done.
[root@fldmon ~]# service ndo2db stop
Stopping ndo2db: head: cannot open `/usr/local/nagios/var/ndo2db.lock' for reading: No such file or directory
done.
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: <<<R3.3-Monitoring of hosts nodes has not been stable>>>
Was ndo2db running when you ran this stop command?
Re: <<<R3.3-Monitoring of hosts nodes has not been stable>>>
Scott,
I believe ndo2db was running. I had to perform a stop on all services in order to perform a mysql repair. Reason for the repair, we are experiencing high cpu load. Critical notification on the services side go beyond 2. This is when we see a spike in down host nodes and "I/O Wait" status is going into the red. My understanding and belief is that we have to offload the mysql database. Our host nodes amount to 3,364 and services are 75. Hopefully, I am explaining this clearly to you.
I believe ndo2db was running. I had to perform a stop on all services in order to perform a mysql repair. Reason for the repair, we are experiencing high cpu load. Critical notification on the services side go beyond 2. This is when we see a spike in down host nodes and "I/O Wait" status is going into the red. My understanding and belief is that we have to offload the mysql database. Our host nodes amount to 3,364 and services are 75. Hopefully, I am explaining this clearly to you.
Re: <<<R3.3-Monitoring of hosts nodes has not been stable>>>
Is it stable now, after the mysql repair? Anything interesting in the system log?
If you decide to offload mysql, here is the document you need to review:
http://assets.nagios.com/downloads/nagi ... Server.pdf
Hope this helps.
Code: Select all
tail /var/log/messageshttp://assets.nagios.com/downloads/nagi ... Server.pdf
Hope this helps.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: <<<R3.3-Monitoring of hosts nodes has not been stable>>>
This is all that I am getting from the command:
Last login: Sun Sep 2 17:14:00 2012
[root@fldmon ~]# tail /var/log/messages
Sep 5 17:54:17 fldmon nagios: HOST ALERT: 747054SW01;DOWN;SOFT;1;CRITICAL - 10.47.54.2: rta nan, lost 100%
Sep 5 17:54:27 fldmon nagios: SERVICE ALERT: localhost;Current Load;CRITICAL;HARD;4;CRITICAL - load average: 6.00, 5.21, 4.29
Sep 5 17:54:29 fldmon nagios: HOST ALERT: 747054SW01;UP;SOFT;2;OK - 10.47.54.2: rta 46.917ms, lost 0%
Sep 5 17:55:26 fldmon nagios: HOST ALERT: 747054SW01;DOWN;SOFT;1;CRITICAL - 10.47.54.2: rta nan, lost 100%
Sep 5 17:55:26 fldmon nagios: HOST ALERT: 747054;DOWN;SOFT;1;CRITICAL - 10.47.54.1: rta nan, lost 100%
Sep 5 17:55:29 fldmon nagios: HOST ALERT: 747054;UP;SOFT;2;OK - 10.47.54.1: rta 45.857ms, lost 0%
Sep 5 17:55:29 fldmon nagios: HOST ALERT: 747054SW01;UP;SOFT;2;OK - 10.47.54.2: rta 46.944ms, lost 0%
Sep 5 17:55:29 fldmon nagios: HOST FLAPPING ALERT: 747054SW01;STARTED; Host appears to have started flapping (23.2% change > 20.0% threshold)
Sep 5 17:55:50 fldmon nagios: HOST ALERT: 747054SW01;DOWN;SOFT;1;CRITICAL - 10.47.54.2: rta nan, lost 100%
Sep 5 17:56:02 fldmon nagios: HOST ALERT: 747054SW01;UP;SOFT;2;OK - 10.47.54.2: rta 46.542ms, lost 20%
[root@fldmon ~]#
Last login: Sun Sep 2 17:14:00 2012
[root@fldmon ~]# tail /var/log/messages
Sep 5 17:54:17 fldmon nagios: HOST ALERT: 747054SW01;DOWN;SOFT;1;CRITICAL - 10.47.54.2: rta nan, lost 100%
Sep 5 17:54:27 fldmon nagios: SERVICE ALERT: localhost;Current Load;CRITICAL;HARD;4;CRITICAL - load average: 6.00, 5.21, 4.29
Sep 5 17:54:29 fldmon nagios: HOST ALERT: 747054SW01;UP;SOFT;2;OK - 10.47.54.2: rta 46.917ms, lost 0%
Sep 5 17:55:26 fldmon nagios: HOST ALERT: 747054SW01;DOWN;SOFT;1;CRITICAL - 10.47.54.2: rta nan, lost 100%
Sep 5 17:55:26 fldmon nagios: HOST ALERT: 747054;DOWN;SOFT;1;CRITICAL - 10.47.54.1: rta nan, lost 100%
Sep 5 17:55:29 fldmon nagios: HOST ALERT: 747054;UP;SOFT;2;OK - 10.47.54.1: rta 45.857ms, lost 0%
Sep 5 17:55:29 fldmon nagios: HOST ALERT: 747054SW01;UP;SOFT;2;OK - 10.47.54.2: rta 46.944ms, lost 0%
Sep 5 17:55:29 fldmon nagios: HOST FLAPPING ALERT: 747054SW01;STARTED; Host appears to have started flapping (23.2% change > 20.0% threshold)
Sep 5 17:55:50 fldmon nagios: HOST ALERT: 747054SW01;DOWN;SOFT;1;CRITICAL - 10.47.54.2: rta nan, lost 100%
Sep 5 17:56:02 fldmon nagios: HOST ALERT: 747054SW01;UP;SOFT;2;OK - 10.47.54.2: rta 46.542ms, lost 20%
[root@fldmon ~]#
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: <<<R3.3-Monitoring of hosts nodes has not been stable>>>
Has the machine stabilized since the repair and re-start of ndo2db?