Page 1 of 1
Acknowledgements suddenly stopped working....
Posted: Wed Aug 19, 2015 6:52 pm
by sandsdenver
I inherited this monster to try to support, its a Nagios system with 38000+ devices and over 1 million service checks every day....,normally I am a ccna netowrk tech, just my background.
So first when I ack something, all I would get back was this (starting yesterday):
nagios_ack add '0554-RTR-01' 'CA:997325'
...One moment please, analyzing the STATUSFILE and your selections...
And thats all I would get back on the screen.
I restarted the Nagios service, and now we get back the proper lines when doing acknowledgements, example:
nagios_ack add '0324-' 'CA1002000'
...One moment please, analyzing the STATUSFILE and your selections...
...Comparison results follow...
Acknowledge HOST Alarm: 0324-RTR-01 CA1002000
Acknowledge HOST Alarm: 0324-SW-ER-01 CA1002000
Acknowledge HOST Alarm: 0324-SW-ER-03 CA1002000
...Comparison complete. Nagios now processing any commands issued...
All was fine and dandy, but then I looked at the host and services page, and we never see the check mark (see picture).....what could I look into further? Screenshots attached of check marks we are supposed to see.....
If it matters:
Version : Nagios® Core⢠4.0.8 -
CentOS Linux release 7.0.1406 (Core)
Thank you!!!
Re: Acknowledgements suddenly stopped working....
Posted: Wed Aug 19, 2015 7:06 pm
by Box293
Yes that is a lot of devices being monitored.
sandsdenver wrote:So first when I ack something, all I would get back was this (starting yesterday):
nagios_ack add '0554-RTR-01' 'CA:997325'
...One moment please, analyzing the STATUSFILE and your selections...
And thats all I would get back on the screen.
I restarted the Nagios service, and now we get back the proper lines when doing acknowledgements, example:
nagios_ack add '0324-' 'CA1002000'
...One moment please, analyzing the STATUSFILE and your selections...
...Comparison results follow...
Acknowledge HOST Alarm: 0324-RTR-01 CA1002000
Acknowledge HOST Alarm: 0324-SW-ER-01 CA1002000
Acknowledge HOST Alarm: 0324-SW-ER-03 CA1002000
...Comparison complete. Nagios now processing any commands issued...
I'm not familiar with this, can you post some screenshots of these steps so we can get a better idea.
Just to check some basics and post back the output:
I'll get you to check the amount of free disk space on your nagios server. Type the following at the command prompt:
Also please run this one:
Run these commands
Code: Select all
tail /var/log/messages -n 100 > /tmp/messages_log.txt
tail /usr/local/nagios/var/nagios.log -n 100 > /tmp/nagios_log.txt
Send us these files:
/tmp/messages_log.txt
/tmp/nagios_log.txt
Please post the file
/usr/local/nagios/etc/nagios.cfg
Re: Acknowledgements suddenly stopped working....
Posted: Thu Aug 20, 2015 2:55 pm
by sandsdenver
Thank you for your time, here is the information requested.
Code: Select all
root@ccsd-lx-noc03 ~> df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos-root 45G 24G 21G 54% /
devtmpfs 5.8G 0 5.8G 0% /dev
tmpfs 5.8G 0 5.8G 0% /dev/shm
tmpfs 5.8G 576K 5.8G 1% /run
tmpfs 5.8G 0 5.8G 0% /sys/fs/cgroup
tmpfs 500M 203M 298M 41% /var/nagiosramdisk
/dev/sdb1 55G 27G 28G 50% /data
/dev/sda1 497M 214M 284M 43% /boot
10.50.5.2:/scripts 1.8T 835G 907G 48% /scripts
root@ccsd-lx-noc03 ~> df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/centos-root 46669824 109818 46560006 1% /
devtmpfs 1515443 357 1515086 1% /dev
tmpfs 1517624 1 1517623 1% /dev/shm
tmpfs 1517624 461 1517163 1% /run
tmpfs 1517624 13 1517611 1% /sys/fs/cgroup
tmpfs 1517624 6 1517618 1% /var/nagiosramdisk
/dev/sdb1 57670656 11338 57659318 1% /data
/dev/sda1 512000 351 511649 1% /boot
10.50.5.2:/scripts 122101760 112788 121988972 1% /scripts
root@ccsd-lx-noc03 ~> top -n 1
top - 12:40:07 up 247 days, 2:37, 1 user, load average: 3.31, 5.74, 5.74
Tasks: 200 total, 8 running, 191 sleeping, 0 stopped, 1 zombie
%Cpu(s): 24.3 us, 13.3 sy, 0.0 ni, 60.2 id, 0.9 wa, 0.0 hi, 1.3 si, 0.0 st
KiB Mem: 12140992 total, 7997432 used, 4143560 free, 0 buffers
KiB Swap: 5242876 total, 79584 used, 5163292 free. 5803376 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5253 apache 20 0 296724 189268 816 R 26.5 1.6 0:21.84 status.cgi
7772 root 20 0 7408 648 552 R 13.3 0.0 0:00.08 nagiostats
7768 nagios 20 0 0 0 0 R 11.6 0.0 0:00.07 php
31415 nagios 20 0 1415396 599336 2408 R 11.6 4.9 286:04.42 nagios
21 root 20 0 0 0 0 S 5.0 0.0 1344:01 ksoftirqd/1
3895 nagios 20 0 0 0 0 Z 5.0 0.0 0:00.08 mod_gearma+
7737 root 20 0 123656 1520 1092 R 5.0 0.0 0:00.48 top
7766 nagios 20 0 143892 3252 2048 R 5.0 0.0 0:00.03 mod_gearma+
28500 apache 20 0 327208 8592 1948 S 5.0 0.1 0:00.50 httpd
322 root 20 0 21260 1656 1088 S 1.7 0.0 24:03.74 cvfwd
5285 nagios 20 0 143604 3348 2128 S 1.7 0.0 0:00.14 mod_gearma+
5580 nagios 20 0 143604 3348 2128 S 1.7 0.0 0:00.04 mod_gearma+
5835 nagios 20 0 143604 3348 2128 S 1.7 0.0 0:00.05 mod_gearma+
6474 nagios 20 0 143084 2740 1904 S 1.7 0.0 0:00.02 mod_gearma+
7621 root 20 0 51596 17436 2296 S 1.7 0.1 0:00.23 mrtg
7764 nagios 20 0 116452 716 588 S 1.7 0.0 0:00.03 check_icmp
1 root 20 0 197616 4576 2412 S 0.0 0.0 225:09.96 systemd
root@ccsd-lx-noc03 ~> tail /var/log/messages -n 100 > /tmp/messages_log.txt
Attached.
root@ccsd-lx-noc03 ~> tail /usr/local/nagios/var/nagios.log -n 100 > /tmp/nagios_log.txt
Attached.
root@ccsd-lx-noc03 ~> more /usr/local/nagios/etc/nagios.cfg
Attached.
Re: Acknowledgements suddenly stopped working....
Posted: Fri Aug 21, 2015 12:54 pm
by tmcdonald
Box293 wrote:I'm not familiar with this, can you post some screenshots of these steps so we can get a better idea.
I am also a bit unsure of what you were referencing in your original post. Could you please post the screenshots of the
Code: Select all
...One moment please, analyzing the STATUSFILE and your selections...
...Comparison results follow...
Acknowledge HOST Alarm: 0324-RTR-01 CA1002000
Acknowledge HOST Alarm: 0324-SW-ER-01 CA1002000
Acknowledge HOST Alarm: 0324-SW-ER-03 CA1002000
...Comparison complete. Nagios now processing any commands issued...
etc. etc. information?
Re: Acknowledgements suddenly stopped working....
Posted: Tue Aug 25, 2015 12:07 pm
by sandsdenver
Its probably just a front end GUI we have used for the past few years, here is a scrren shot. The issue went away for about a week and has came back today.
Re: Acknowledgements suddenly stopped working....
Posted: Tue Aug 25, 2015 4:47 pm
by Box293
Is this a physical server or a VM?
If it's a VM, can you look at the VM's performance stats through the Hypervisor, I'm particularly interested to see if the VM's memory is being exhausted.
sandsdenver wrote:All was fine and dandy, but then I looked at the host and services page, and we never see the check mark (see picture).....what could I look into further? Screenshots attached of check marks we are supposed to see.....
Do these check marks eventually appear?
Re: Acknowledgements suddenly stopped working....
Posted: Wed Aug 26, 2015 10:46 pm
by sandsdenver
Nagios is a beast, hard to tackle the spider web of what goes where....built by one guy, over 8 years.....who is no longer here,..let the fun begin! lol
Yes, they did start working again when we did a build. Yes these are VMs.
Just for info, here is the latest numbers.....
Checking objects...
Checked 79160 services.
Checked 8048 hosts.
Checked 2133 host groups.
Checked 1002 service groups.
Checked 91 contacts.
Checked 59 contact groups.
Checked 564 commands.
Checked 16 time periods.
Checked 9155 host escalations.
Checked 13192 service escalations.
Checking for circular paths...
Checked 8048 hosts
Checked 3552 service dependencies
Checked 0 host dependencies
Checked 16 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...
So for now this thread can be clos.....wait!, one more unrelated question......
What does something like this do?
define hostgroup {
# 36 Hosts in this group
hostgroup_name DataCenter.CIS
alias DataCenter.CIS
}
Oh, and when you define a host, does that automatically get checked via ping command or do you need a servicecheck to do this? Can you turn that off (checking to see if it up)?
Does that make a hostgroup and name it....then later you put in members?
Re: Acknowledgements suddenly stopped working....
Posted: Thu Aug 27, 2015 12:24 am
by Box293
Creating a hostgroup allows you to do things like assign one service to the group, hence all hosts in that group get the service, it's a configuration technique. This might explain it better:
http://sites.box293.com/nagios/guides/c ... n-services
If you define a host, you would define a check_command to use like check-host-alive OR use a template that has it defined. A host doesn't need a check command however if a host goes down then it's services won't have their notifications suppressed. A host check command is different to a service. A host can only have one check command but can have many services.