Page 1 of 2
NagVis - NDO claims that nagios did not status update
Posted: Thu Aug 24, 2017 1:16 pm
by ssoliveira
Nagvis NDO claims that nagios did not status update for more than 180 seconds
Often; NagVis is having problems; Requiring service to be restarted.
How can I investigate the reason?
Code: Select all
[root@st-dc3a-nagios-n01 ~]# ps -ef | grep ndo2db
nagios 27757 1 0 15:08 ? 00:00:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
nagios 28902 27757 0 15:08 ? 00:00:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
nagios 28907 28902 12 15:08 ? 00:00:02 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
root 30027 13067 0 15:08 pts/0 00:00:00 grep ndo2db
Are there any logs I can analyze?
service ndo2db stop
service ndo2db start
Thank you
Re: NagVis - NDO claims that nagios did not status update
Posted: Thu Aug 24, 2017 1:27 pm
by scottwilkerson
Searching our forums for this error did reveal this problem has been seen in the past, can you try the commands in this post
https://support.nagios.com/forum/viewto ... 317#p18314
Re: NagVis - NDO claims that nagios did not status update
Posted: Thu Aug 24, 2017 1:48 pm
by ssoliveira
I already read this topic, and suggest restarting the services, and comments on a ntp time synchronization. That everything is ok.
My problem is that the error is occurring frequently, and I need to restart services.
I would like a way to analyze the problem, to try to identify a cause, in /var/log/messages there is nothing useful.
Re: NagVis - NDO claims that nagios did not status update
Posted: Thu Aug 24, 2017 2:31 pm
by scottwilkerson
What version of Nagios XI are you running?
Re: NagVis - NDO claims that nagios did not status update
Posted: Thu Aug 24, 2017 2:35 pm
by scottwilkerson
Here's a solution for the current XI version
Run the following from the CLI
Code: Select all
sed -i "s/maxtimewithoutupdate=180/maxtimewithoutupdate=86400/g" /usr/local/nagvis/etc/nagvis.ini.php
Re: NagVis - NDO claims that nagios did not status update
Posted: Wed Aug 30, 2017 5:48 pm
by ssoliveira
What is the behavior after changing this parameter?
; maximum delay of the NDO Database in seconds
;maxtimewithoutupdate=180
I have verified high CPU utilization by Apache processes.
Code: Select all
Tasks: 529 total, 7 running, 521 sleeping, 0 stopped, 1 zombie
Cpu(s): 78.0%us, 15.8%sy, 0.0%ni, 5.2%id, 0.3%wa, 0.1%hi, 0.7%si, 0.0%st
Mem: 49283936k total, 38816824k used, 10467112k free, 262040k buffers
Swap: 4194300k total, 4332k used, 4189968k free, 17681732k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13646 apache 20 0 459m 39m 8712 R 38.8 0.1 0:22.52 httpd
12861 apache 20 0 459m 39m 8464 R 33.2 0.1 0:43.63 httpd
29545 apache 20 0 458m 38m 8704 S 30.3 0.1 0:05.25 httpd
24815 apache 20 0 459m 38m 8636 R 27.7 0.1 0:13.86 httpd
15329 apache 20 0 457m 37m 8544 S 23.1 0.1 0:06.12 httpd
26758 apache 20 0 440m 27m 5116 S 22.5 0.1 0:03.84 httpd
29183 apache 20 0 442m 29m 4300 S 17.9 0.1 0:03.89 httpd
3580 apache 20 0 446m 34m 5800 S 16.9 0.1 0:14.58 httpd
14789 nagios 20 0 126m 5208 1956 R 15.6 0.0 0:00.57 process_perfdat
6901 apache 20 0 456m 36m 8712 S 15.0 0.1 0:21.75 httpd
14783 nagios 20 0 125m 4784 1956 R 14.3 0.0 0:00.55 process_perfdat
30373 apache 20 0 451m 31m 8652 S 13.7 0.1 0:18.44 httpd
15326 apache 20 0 440m 27m 5320 S 12.4 0.1 0:00.38 httpd
28442 apache 20 0 458m 37m 8416 S 10.8 0.1 0:14.80 httpd
1156 apache 20 0 456m 36m 8680 S 10.1 0.1 0:22.91 httpd
13931 apache 20 0 442m 29m 4284 S 10.1 0.1 0:00.78 httpd
I need help to perform a troubleshooting; More in depth; And find out what the problem is.
Is it possible to separate NagVis processing on a separate server?
Re: NagVis - NDO claims that nagios did not status update
Posted: Thu Aug 31, 2017 8:48 am
by scottwilkerson
ssoliveira wrote:What is the behavior after changing this parameter?
; maximum delay of the NDO Database in seconds
;maxtimewithoutupdate=180
I have verified high CPU utilization by Apache processes.
Code: Select all
Tasks: 529 total, 7 running, 521 sleeping, 0 stopped, 1 zombie
Cpu(s): 78.0%us, 15.8%sy, 0.0%ni, 5.2%id, 0.3%wa, 0.1%hi, 0.7%si, 0.0%st
Mem: 49283936k total, 38816824k used, 10467112k free, 262040k buffers
Swap: 4194300k total, 4332k used, 4189968k free, 17681732k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13646 apache 20 0 459m 39m 8712 R 38.8 0.1 0:22.52 httpd
12861 apache 20 0 459m 39m 8464 R 33.2 0.1 0:43.63 httpd
29545 apache 20 0 458m 38m 8704 S 30.3 0.1 0:05.25 httpd
24815 apache 20 0 459m 38m 8636 R 27.7 0.1 0:13.86 httpd
15329 apache 20 0 457m 37m 8544 S 23.1 0.1 0:06.12 httpd
26758 apache 20 0 440m 27m 5116 S 22.5 0.1 0:03.84 httpd
29183 apache 20 0 442m 29m 4300 S 17.9 0.1 0:03.89 httpd
3580 apache 20 0 446m 34m 5800 S 16.9 0.1 0:14.58 httpd
14789 nagios 20 0 126m 5208 1956 R 15.6 0.0 0:00.57 process_perfdat
6901 apache 20 0 456m 36m 8712 S 15.0 0.1 0:21.75 httpd
14783 nagios 20 0 125m 4784 1956 R 14.3 0.0 0:00.55 process_perfdat
30373 apache 20 0 451m 31m 8652 S 13.7 0.1 0:18.44 httpd
15326 apache 20 0 440m 27m 5320 S 12.4 0.1 0:00.38 httpd
28442 apache 20 0 458m 37m 8416 S 10.8 0.1 0:14.80 httpd
1156 apache 20 0 456m 36m 8680 S 10.1 0.1 0:22.91 httpd
13931 apache 20 0 442m 29m 4284 S 10.1 0.1 0:00.78 httpd
I need help to perform a troubleshooting; More in depth; And find out what the problem is.
Is it possible to separate NagVis processing on a separate server?
Changing this wouldn't change NagVis or load at all, the only difference is NagVis was doing an arbatrary check of when NDO updated a time in a table and gave the error if over xxx seconds.
Re: NagVis - NDO claims that nagios did not status update
Posted: Thu Aug 31, 2017 1:18 pm
by ssoliveira
I understood, so this modification would not solve the problem; It would only make NagVis not generate alarm when communication with the system takes longer than normal.
This problem is becoming critical here in the company.
How can I investigate why the environment presents problems?
* Can this CPU consumption by Apache be the problem?
* Do I need to add more CPU?
* Do I need to add more memory?
Our infrastructure is monitoring few servers; but soon we will add a lot of servers.
We use the separate core of the database; and each server interacts with Gearman for load unloading.
What information do I need to report here? About mey environment; To help with this analysis?
The Core server has:
* 48GB of RAM
* 8 CPU
* Disk = LUNS in VMAX Storage (~ 10000 IOPS)
IO
Code: Select all
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 5.33 0.00 0.67 0.00 48.00 72.00 0.01 7.50 0.00 7.50 7.50 0.50
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 0.00 6.00 0.00 48.00 8.00 0.06 10.61 0.00 10.61 0.83 0.50
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 1.67 0.00 6.67 4.00 0.00 2.40 0.00 2.40 2.40 0.40
sdh 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdk 0.00 0.33 11.67 104.33 112.00 842.00 8.22 0.07 0.64 1.54 0.54 0.60 6.93
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdl 0.00 0.00 0.00 0.33 0.00 2.33 7.00 0.00 1.00 0.00 1.00 1.00 0.03
sdj 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 1.67 0.00 12.00 7.20 0.00 2.20 0.00 2.20 2.20 0.37
sdn 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdp 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdq 0.00 0.00 12.00 110.00 109.33 942.67 8.62 0.06 0.51 1.44 0.41 0.47 5.73
sdr 0.00 0.00 0.00 0.33 0.00 4.67 14.00 0.00 0.00 0.00 0.00 0.00 0.00
sdm 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
VxVM23000 0.00 0.00 23.67 214.67 221.33 1784.67 8.42 0.14 0.59 1.52 0.48 0.50 11.83
sds 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdt 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdu 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdv 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
VxVM11000 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
VxVM23001 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
VxVM23002 0.00 0.00 0.00 0.67 0.00 7.00 10.50 0.00 0.50 0.00 0.50 0.50 0.03
VxVM23003 0.00 0.00 0.00 3.33 0.00 18.67 5.60 0.01 2.30 0.00 2.30 1.20 0.40
VxVM23004 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Re: NagVis - NDO claims that nagios did not status update
Posted: Thu Aug 31, 2017 2:35 pm
by ssoliveira
We contacted Nagios Brasil; asking for help.
We were asked to disable one of the brokers, leaving only 1 running.
We were also given the procedure to enable the debug in the NDO, as below.
We are reviewing whether the issue continues after these changes.
==========================================
/usr/local/nagios/etc/ndo2db.cfg
debug_level=-1
==========================================
tail -f /usr/local/nagios/var/ndo2db.debug
tail -f /usr/local/nagios/var/nagios.log
==========================================
Re: NagVis - NDO claims that nagios did not status update
Posted: Thu Aug 31, 2017 4:01 pm
by cdienger
Thank you. Please keep us posted with your progress after making this change.