no alert triggered on service down

nky1986 · Post by **nky1986** » Tue Jan 09, 2018 1:45 am

Hello Team,

There is one service which we monitor for all our servers i.e. ‘Active-Devices-AT-P’. The same service was down for 21 hrs for server ‘mwm-at16001.mojonetworks.com’ and we didn’t receive any alert from Nagios.

Timely events from Nagios graph:

- Jan 04, 4:26 PM
o Total devices: 338, Active devices: went down from 317 to 0
- Jan 05, 6:27 AM
o Total devices: 338, Active devices count: 160
- Jan 05, 1:26 PM
o Total devices: 338, Active devices count: 317

The alert is set to trigger when active devices goes below 40% from previous active count.

Checks are configured as:
- Check interval: 140
- Retry interval: 70
- Max check attempts: 3

As per this, the alert was expected 140 mins after Jan 04, 4:26 PM.

PFA:
- Log for Jan 4th
- Graph screenshot
- Incident report

This was very serious issue for our organization and because Nagios missed to send alert, we faced huge customer impact. Please look into this issue at the earliest and let me know if any further information is required.

Regards,
Narender

nky1986 · Post by **nky1986** » Tue Jan 09, 2018 1:47 am

nagios events pdf

nky1986 · Post by **nky1986** » Tue Jan 09, 2018 1:48 am

nagios graph

nky1986 · Post by **nky1986** » Tue Jan 09, 2018 1:50 am

nagios log

kyang · Post by **kyang** » Tue Jan 09, 2018 11:05 am

Thanks for the information.

Could you send us the whole profile?

Nagios XI Profile --> On the XI Home Page click "Admin" > "System Profile" --> "Download Profile" button
Save the profile.zip file and upload it here or PM me.

If you receive a PROFILE BUILD FAILED
Please follow this article,

https://support.nagios.com/kb/article.p ... ategory=44

After you PM the profile please update this thread so we know you sent it, unless you post the profile on here. Thanks

nky1986 · Post by **nky1986** » Wed Jan 10, 2018 9:02 am

uploading the profile here

dwhitfield · Post by **dwhitfield** » Wed Jan 10, 2018 10:28 am

Please run through https://assets.nagios.com/downloads/nag ... tabase.pdf and report any errors. If you stop at any point, please know at which point you stop.

If the repair script and other instructions in the document do not work, please continue.

Regarding the instructions below, if you do not have killall, you can install it via the following command:
# yum install psmisc

If psmisc is not in your repos, then instead you can check to make sure nagios is not running with
# ps -aef | grep nagios

If that document does not resolve your issue, please run the following commands in order and report any errors. You ***must*** use mariadb instead of mysqld in the commands below, ***if*** you have mariadb.
# service nagios stop
# service ndo2db stop
# service mysqld stop
# service crond stop
# service httpd stop
# killall -9 nagios
# killall -9 ndo2db
# rm -f /usr/local/nagios/var/rw/nagios.cmd
# rm -f /usr/local/nagios/var/nagios.lock
# rm -f /usr/local/nagios/var/ndo.sock
# rm -f /usr/local/nagios/var/ndo2db.lock
# rm -f /usr/local/nagiosxi/var/reconfigure_nagios.lock
# for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
# service mysqld start
# service ndo2db start
# service nagios start
# service httpd start
# service crond start

nky1986 · Post by **nky1986** » Wed Jan 10, 2018 2:09 pm

before running these commands, could you please let me know what we are trying to do here and why?

dwhitfield · Post by **dwhitfield** » Wed Jan 10, 2018 3:34 pm

Your system is old enough that I don't have everything in the profile I'd want. What's the output of ipcs -q?

Ultimately, the emails are scheduled through the database, but the mysql log does not show up every issue. Essentially, experience shows that doing a db repair can resolve notification issues.

As for the commands, the main thing there is for the kernel queue, but kicking the database services, crond (which runs dbmaint), and nagios (which checks the warning/critical) might resolve the issue. Kicking httpd isn't likely to do anything in this case, but if you are bringing everything else down what's two more commands?

nky1986 · Post by **nky1986** » Wed Jan 10, 2018 11:45 pm

this is the output of ipcs -q

------ Message Queues --------
key msqid owner perms used-bytes messages
0xbb010002 16842752 nagios 600 0 0

Nagios Support Forum

no alert triggered on service down

no alert triggered on service down

Re: no alert triggered on service down

Re: no alert triggered on service down

Re: no alert triggered on service down

Re: no alert triggered on service down

Re: no alert triggered on service down

Re: no alert triggered on service down

Re: no alert triggered on service down

Re: no alert triggered on service down

Re: no alert triggered on service down