Page 1 of 3
no alert triggered on service down
Posted: Tue Jan 09, 2018 1:45 am
by nky1986
Hello Team,
There is one service which we monitor for all our servers i.e. ‘Active-Devices-AT-P’. The same service was down for 21 hrs for server ‘mwm-at16001.mojonetworks.com’ and we didn’t receive any alert from Nagios.
Timely events from Nagios graph:
- Jan 04, 4:26 PM
o Total devices: 338, Active devices: went down from 317 to 0
- Jan 05, 6:27 AM
o Total devices: 338, Active devices count: 160
- Jan 05, 1:26 PM
o Total devices: 338, Active devices count: 317
The alert is set to trigger when active devices goes below 40% from previous active count.
Checks are configured as:
- Check interval: 140
- Retry interval: 70
- Max check attempts: 3
As per this, the alert was expected 140 mins after Jan 04, 4:26 PM.
PFA:
- Log for Jan 4th
- Graph screenshot
- Incident report
This was very serious issue for our organization and because Nagios missed to send alert, we faced huge customer impact. Please look into this issue at the earliest and let me know if any further information is required.
Regards,
Narender
Re: no alert triggered on service down
Posted: Tue Jan 09, 2018 1:47 am
by nky1986
nagios events pdf
Re: no alert triggered on service down
Posted: Tue Jan 09, 2018 1:48 am
by nky1986
nagios graph
Re: no alert triggered on service down
Posted: Tue Jan 09, 2018 1:50 am
by nky1986
nagios log
Re: no alert triggered on service down
Posted: Tue Jan 09, 2018 11:05 am
by kyang
Thanks for the information.
Could you send us the whole profile?
Nagios XI Profile --> On the XI Home Page click "Admin" > "System Profile" --> "Download Profile" button
Save the profile.zip file and upload it here or PM me.
If you receive a
PROFILE BUILD FAILED
Please follow this article,
https://support.nagios.com/kb/article.p ... ategory=44
After you PM the profile please update this thread so we know you sent it, unless you post the profile on here. Thanks
Re: no alert triggered on service down
Posted: Wed Jan 10, 2018 9:02 am
by nky1986
uploading the profile here
Re: no alert triggered on service down
Posted: Wed Jan 10, 2018 10:28 am
by dwhitfield
Please run through
https://assets.nagios.com/downloads/nag ... tabase.pdf and report any errors. If you stop at any point, please know at which point you stop.
If the repair script and other instructions in the document do not work, please continue.
Regarding the instructions below, if you do not have killall, you can install it via the following command:
# yum install psmisc
If psmisc is not in your repos, then instead you can check to make sure nagios is not running with
# ps -aef | grep nagios
If that document does not resolve your issue, please run the following commands in order and report any errors. You ***must*** use mariadb instead of mysqld in the commands below, ***if*** you have mariadb.
# service nagios stop
# service ndo2db stop
# service mysqld stop
# service crond stop
# service httpd stop
# killall -9 nagios
# killall -9 ndo2db
# rm -f /usr/local/nagios/var/rw/nagios.cmd
# rm -f /usr/local/nagios/var/nagios.lock
# rm -f /usr/local/nagios/var/ndo.sock
# rm -f /usr/local/nagios/var/ndo2db.lock
# rm -f /usr/local/nagiosxi/var/reconfigure_nagios.lock
# for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
# service mysqld start
# service ndo2db start
# service nagios start
# service httpd start
# service crond start
Re: no alert triggered on service down
Posted: Wed Jan 10, 2018 2:09 pm
by nky1986
before running these commands, could you please let me know what we are trying to do here and why?
Re: no alert triggered on service down
Posted: Wed Jan 10, 2018 3:34 pm
by dwhitfield
Your system is old enough that I don't have everything in the profile I'd want. What's the output of ipcs -q?
Ultimately, the emails are scheduled through the database, but the mysql log does not show up every issue. Essentially, experience shows that doing a db repair can resolve notification issues.
As for the commands, the main thing there is for the kernel queue, but kicking the database services, crond (which runs dbmaint), and nagios (which checks the warning/critical) might resolve the issue. Kicking httpd isn't likely to do anything in this case, but if you are bringing everything else down what's two more commands?
Re: no alert triggered on service down
Posted: Wed Jan 10, 2018 11:45 pm
by nky1986
this is the output of ipcs -q
------ Message Queues --------
key msqid owner perms used-bytes messages
0xbb010002 16842752 nagios 600 0 0