Hello Team,
There is one service which we monitor for all our servers i.e. ‘Active-Devices-AT-P’. The same service was down for 21 hrs for server ‘mwm-at16001.mojonetworks.com’ and we didn’t receive any alert from Nagios.
Timely events from Nagios graph:
- Jan 04, 4:26 PM
o Total devices: 338, Active devices: went down from 317 to 0
- Jan 05, 6:27 AM
o Total devices: 338, Active devices count: 160
- Jan 05, 1:26 PM
o Total devices: 338, Active devices count: 317
The alert is set to trigger when active devices goes below 40% from previous active count.
Checks are configured as:
- Check interval: 140
- Retry interval: 70
- Max check attempts: 3
As per this, the alert was expected 140 mins after Jan 04, 4:26 PM.
PFA:
- Log for Jan 4th
- Graph screenshot
- Incident report
This was very serious issue for our organization and because Nagios missed to send alert, we faced huge customer impact. Please look into this issue at the earliest and let me know if any further information is required.
Regards,
Narender
no alert triggered on service down
Re: no alert triggered on service down
nagios events pdf
You do not have the required permissions to view the files attached to this post.
Re: no alert triggered on service down
nagios graph
You do not have the required permissions to view the files attached to this post.
Re: no alert triggered on service down
nagios log
You do not have the required permissions to view the files attached to this post.
-
kyang
Re: no alert triggered on service down
Thanks for the information.
Could you send us the whole profile?
Nagios XI Profile --> On the XI Home Page click "Admin" > "System Profile" --> "Download Profile" button
Save the profile.zip file and upload it here or PM me.
If you receive a PROFILE BUILD FAILED
Please follow this article,
https://support.nagios.com/kb/article.p ... ategory=44
After you PM the profile please update this thread so we know you sent it, unless you post the profile on here. Thanks
Could you send us the whole profile?
Nagios XI Profile --> On the XI Home Page click "Admin" > "System Profile" --> "Download Profile" button
Save the profile.zip file and upload it here or PM me.
If you receive a PROFILE BUILD FAILED
Please follow this article,
https://support.nagios.com/kb/article.p ... ategory=44
After you PM the profile please update this thread so we know you sent it, unless you post the profile on here. Thanks
Re: no alert triggered on service down
uploading the profile here
You do not have the required permissions to view the files attached to this post.
-
dwhitfield
- Former Nagios Staff
- Posts: 4583
- Joined: Wed Sep 21, 2016 10:29 am
- Location: NoLo, Minneapolis, MN
- Contact:
Re: no alert triggered on service down
Please run through https://assets.nagios.com/downloads/nag ... tabase.pdf and report any errors. If you stop at any point, please know at which point you stop.
If the repair script and other instructions in the document do not work, please continue.
Regarding the instructions below, if you do not have killall, you can install it via the following command:
# yum install psmisc
If psmisc is not in your repos, then instead you can check to make sure nagios is not running with
# ps -aef | grep nagios
If that document does not resolve your issue, please run the following commands in order and report any errors. You ***must*** use mariadb instead of mysqld in the commands below, ***if*** you have mariadb.
# service nagios stop
# service ndo2db stop
# service mysqld stop
# service crond stop
# service httpd stop
# killall -9 nagios
# killall -9 ndo2db
# rm -f /usr/local/nagios/var/rw/nagios.cmd
# rm -f /usr/local/nagios/var/nagios.lock
# rm -f /usr/local/nagios/var/ndo.sock
# rm -f /usr/local/nagios/var/ndo2db.lock
# rm -f /usr/local/nagiosxi/var/reconfigure_nagios.lock
# for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
# service mysqld start
# service ndo2db start
# service nagios start
# service httpd start
# service crond start
If the repair script and other instructions in the document do not work, please continue.
Regarding the instructions below, if you do not have killall, you can install it via the following command:
# yum install psmisc
If psmisc is not in your repos, then instead you can check to make sure nagios is not running with
# ps -aef | grep nagios
If that document does not resolve your issue, please run the following commands in order and report any errors. You ***must*** use mariadb instead of mysqld in the commands below, ***if*** you have mariadb.
# service nagios stop
# service ndo2db stop
# service mysqld stop
# service crond stop
# service httpd stop
# killall -9 nagios
# killall -9 ndo2db
# rm -f /usr/local/nagios/var/rw/nagios.cmd
# rm -f /usr/local/nagios/var/nagios.lock
# rm -f /usr/local/nagios/var/ndo.sock
# rm -f /usr/local/nagios/var/ndo2db.lock
# rm -f /usr/local/nagiosxi/var/reconfigure_nagios.lock
# for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
# service mysqld start
# service ndo2db start
# service nagios start
# service httpd start
# service crond start
Re: no alert triggered on service down
before running these commands, could you please let me know what we are trying to do here and why?
-
dwhitfield
- Former Nagios Staff
- Posts: 4583
- Joined: Wed Sep 21, 2016 10:29 am
- Location: NoLo, Minneapolis, MN
- Contact:
Re: no alert triggered on service down
Your system is old enough that I don't have everything in the profile I'd want. What's the output of ipcs -q?
Ultimately, the emails are scheduled through the database, but the mysql log does not show up every issue. Essentially, experience shows that doing a db repair can resolve notification issues.
As for the commands, the main thing there is for the kernel queue, but kicking the database services, crond (which runs dbmaint), and nagios (which checks the warning/critical) might resolve the issue. Kicking httpd isn't likely to do anything in this case, but if you are bringing everything else down what's two more commands?
Ultimately, the emails are scheduled through the database, but the mysql log does not show up every issue. Essentially, experience shows that doing a db repair can resolve notification issues.
As for the commands, the main thing there is for the kernel queue, but kicking the database services, crond (which runs dbmaint), and nagios (which checks the warning/critical) might resolve the issue. Kicking httpd isn't likely to do anything in this case, but if you are bringing everything else down what's two more commands?
Re: no alert triggered on service down
this is the output of ipcs -q
------ Message Queues --------
key msqid owner perms used-bytes messages
0xbb010002 16842752 nagios 600 0 0
------ Message Queues --------
key msqid owner perms used-bytes messages
0xbb010002 16842752 nagios 600 0 0