Nagios XI Notifications Stopped working
Nagios XI Notifications Stopped working
Now it is 2 days since we got any notification from Nagios, It was all working fine. Even a forced check which is failing is not producing a notification.
Here is the Profile info for our nagios environment.
System:
Nagios XI Version : 2014R1.5
2.6.32-358.2.1.el6.x86_64 x86_64
CentOS release 6.4 (Final)
Gnome is not installed
Apache Information
PHP Version: 5.3.3
Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36
Server Name:
Server Address:
Server Port: 80
Date/Time
PHP Timezone: America/New_York
PHP Time: Wed, 18 Jan 2017 12:02:57 -0500
System Time: Wed, 18 Jan 2017 12:02:57 -0500
Nagios XI Data
License ends in: NTOSNM
nagios (pid 18438) is running...
NPCD running (pid 30792).
ndo2db (pid 29780) is running...
CPU Load 15: 0.67
Total Hosts: 196
Total Services: 2933
Function 'get_base_uri' returns: http://../nagiosxi/
Function 'get_base_url' returns: http://../nagiosxi/
Function 'get_backend_url(internal_call=false)' returns:../nagiosxi/includes/components/profile/profile.php
Function 'get_backend_url(internal_call=true)' returns: http://localhost/nagiosxi/backend/
Ping Test localhost
Running:
/bin/ping -c 3 localhost 2>&1
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.055 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.033 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.035 ms
--- localhost ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.033/0.041/0.055/0.009 ms
Test wget To localhost
WGET From URL: http://localhost/nagiosxi/includes/components/ccm/
Running:
/usr/bin/wget http://localhost/nagiosxi/includes/components/ccm/
--2017-01-18 12:02:59-- http://localhost/nagiosxi/includes/components/ccm/
Resolving localhost... ::1, 127.0.0.1
Connecting to localhost|::1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: "/usr/local/nagiosxi/tmp/ccm_index.tmp"
0K ......... 877K=0.01s
2017-01-18 12:02:59 (877 KB/s) - "/usr/local/nagiosxi/tmp/ccm_index.tmp" saved [9666]
Here is the Profile info for our nagios environment.
System:
Nagios XI Version : 2014R1.5
2.6.32-358.2.1.el6.x86_64 x86_64
CentOS release 6.4 (Final)
Gnome is not installed
Apache Information
PHP Version: 5.3.3
Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36
Server Name:
Server Address:
Server Port: 80
Date/Time
PHP Timezone: America/New_York
PHP Time: Wed, 18 Jan 2017 12:02:57 -0500
System Time: Wed, 18 Jan 2017 12:02:57 -0500
Nagios XI Data
License ends in: NTOSNM
nagios (pid 18438) is running...
NPCD running (pid 30792).
ndo2db (pid 29780) is running...
CPU Load 15: 0.67
Total Hosts: 196
Total Services: 2933
Function 'get_base_uri' returns: http://../nagiosxi/
Function 'get_base_url' returns: http://../nagiosxi/
Function 'get_backend_url(internal_call=false)' returns:../nagiosxi/includes/components/profile/profile.php
Function 'get_backend_url(internal_call=true)' returns: http://localhost/nagiosxi/backend/
Ping Test localhost
Running:
/bin/ping -c 3 localhost 2>&1
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.055 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.033 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.035 ms
--- localhost ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.033/0.041/0.055/0.009 ms
Test wget To localhost
WGET From URL: http://localhost/nagiosxi/includes/components/ccm/
Running:
/usr/bin/wget http://localhost/nagiosxi/includes/components/ccm/
--2017-01-18 12:02:59-- http://localhost/nagiosxi/includes/components/ccm/
Resolving localhost... ::1, 127.0.0.1
Connecting to localhost|::1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: "/usr/local/nagiosxi/tmp/ccm_index.tmp"
0K ......... 877K=0.01s
2017-01-18 12:02:59 (877 KB/s) - "/usr/local/nagiosxi/tmp/ccm_index.tmp" saved [9666]
Re: Nagios XI Notifications Stopped working
Do I need to call it in ??
-
dwhitfield
- Former Nagios Staff
- Posts: 4583
- Joined: Wed Sep 21, 2016 10:29 am
- Location: NoLo, Minneapolis, MN
- Contact:
Re: Nagios XI Notifications Stopped working
What's the output of tail -50 /var/log/maillog? This file logs email sent through sendmail. Only applicable to core contact "notify-*-by-email" notification handlers and sendmail tests.
Also, what's the output of tail -50 /usr/local/nagios/var/nagios.log, so we can see checks, notifications, external commands, and events.
Last but not least, what's the output of tail -50 /usr/local/nagiosxi/tmp/phpmailer.log?
Were there any changes to your mailing 2 days ago? Did you run out of disk space? Power outage? I'm guessing by Cent 6.4 and 2014R1.5 that you didn't run any updates 2 days ago. Also, two days ago was Monday. Did you get notifications over the weekend or is it possible the problem started on Friday?
Also, what's the output of tail -50 /usr/local/nagios/var/nagios.log, so we can see checks, notifications, external commands, and events.
Last but not least, what's the output of tail -50 /usr/local/nagiosxi/tmp/phpmailer.log?
Were there any changes to your mailing 2 days ago? Did you run out of disk space? Power outage? I'm guessing by Cent 6.4 and 2014R1.5 that you didn't run any updates 2 days ago. Also, two days ago was Monday. Did you get notifications over the weekend or is it possible the problem started on Friday?
Re: Nagios XI Notifications Stopped working
Looks like there are some errors in these logs. I am working with our Exchange team and other teams to make sure that there was no changes were made.
Thanks
Sujith
Here are the mail logs
Here are the PHP mailer log
Here are the Nagios log entries
Thanks
Sujith
Here are the mail logs
Code: Select all
Jan 17 07:00:34 localhost postfix/qmgr[1680]: 22B58100F: removed
Jan 17 08:00:02 localhost postfix/pickup[19070]: 1055F3337: uid=0 from=<root>
Jan 17 08:00:02 localhost postfix/cleanup[30274]: 1055F3337: message-id=<[email protected]>
Jan 17 08:00:02 localhost postfix/qmgr[1680]: 1055F3337: from=<[email protected]>, size=2626, nrcpt=1 (queue active)
Jan 17 08:00:02 localhost postfix/local[30276]: 1055F3337: to=<[email protected]>, orig_to=<root@localhost>, relay=local, delay=0.05, delays=0.02/0.01/0/0.02, dsn=5.2.2, status=bounced (cannot update mailbox /var/mail/root for user root. error writing message: File too large)
Jan 17 08:00:02 localhost postfix/cleanup[30274]: 19CCB3338: message-id=<[email protected]>
Jan 17 08:00:02 localhost postfix/bounce[30278]: 1055F3337: sender non-delivery notification: 19CCB3338
Jan 17 08:00:02 localhost postfix/qmgr[1680]: 19CCB3338: from=<>, size=4569, nrcpt=1 (queue active)
Jan 17 08:00:02 localhost postfix/qmgr[1680]: 1055F3337: removed
Jan 17 08:00:02 localhost postfix/local[30276]: 19CCB3338: to=<[email protected]>, relay=local, delay=0.05, delays=0.01/0/0/0.04, dsn=5.2.2, status=bounced (cannot update mailbox /var/mail/root for user root. error writing message: File too large)
Jan 17 08:00:02 localhost postfix/qmgr[1680]: 19CCB3338: removed
Jan 18 07:00:33 localhost postfix/pickup[30894]: 167A14368: uid=0 from=<root>
Jan 18 07:00:33 localhost postfix/cleanup[5526]: 167A14368: message-id=<[email protected]>
Jan 18 07:00:33 localhost postfix/qmgr[1680]: 167A14368: from=<[email protected]>, size=3700, nrcpt=1 (queue active)
Jan 18 07:00:33 localhost postfix/local[5528]: 167A14368: to=<[email protected]>, orig_to=<root>, relay=local, delay=0.36, delays=0.19/0.09/0/0.08, dsn=5.2.2, status=bounced (cannot update mailbox /var/mail/root for user root. error writing message: File too large)
Jan 18 07:00:33 localhost postfix/cleanup[5526]: 5EA084369: message-id=<[email protected]>
Jan 18 07:00:33 localhost postfix/qmgr[1680]: 5EA084369: from=<>, size=5623, nrcpt=1 (queue active)
Jan 18 07:00:33 localhost postfix/bounce[5529]: 167A14368: sender non-delivery notification: 5EA084369
Jan 18 07:00:33 localhost postfix/qmgr[1680]: 167A14368: removed
Jan 18 07:00:33 localhost postfix/local[5528]: 5EA084369: to=<[email protected]>, relay=local, delay=0.01, delays=0/0/0/0, dsn=5.2.2, status=bounced (cannot update mailbox /var/mail/root for user root. error writing message: File too large)
Jan 18 07:00:33 localhost postfix/qmgr[1680]: 5EA084369: removed
Jan 18 08:00:01 localhost postfix/pickup[30894]: DD640F8D: uid=0 from=<root>
Jan 18 08:00:01 localhost postfix/cleanup[18471]: DD640F8D: message-id=<[email protected]>
Jan 18 08:00:01 localhost postfix/qmgr[1680]: DD640F8D: from=<[email protected]>, size=2642, nrcpt=1 (queue active)
Jan 18 08:00:01 localhost postfix/local[18478]: DD640F8D: to=<[email protected]>, orig_to=<root@localhost>, relay=local, delay=0.06, delays=0.03/0.01/0/0.02, dsn=5.2.2, status=bounced (cannot update mailbox /var/mail/root for user root. error writing message: File too large)
Jan 18 08:00:01 localhost postfix/cleanup[18471]: E758DF8E: message-id=<[email protected]>
Jan 18 08:00:01 localhost postfix/qmgr[1680]: E758DF8E: from=<>, size=4577, nrcpt=1 (queue active)
Jan 18 08:00:01 localhost postfix/bounce[18479]: DD640F8D: sender non-delivery notification: E758DF8E
Jan 18 08:00:01 localhost postfix/qmgr[1680]: DD640F8D: removed
Jan 18 08:00:01 localhost postfix/local[18478]: E758DF8E: to=<[email protected]>, relay=local, delay=0.01, delays=0/0/0/0, dsn=5.2.2, status=bounced (cannot update mailbox /var/mail/root for user root. error writing message: File too large)
Jan 18 08:00:01 localhost postfix/qmgr[1680]: E758DF8E: removed
Jan 18 12:51:56 localhost postfix/pickup[4792]: 4BE7BF11: uid=500 from=<nagios>
Jan 18 12:51:56 localhost postfix/cleanup[27909]: 4BE7BF11: message-id=<[email protected]>
Jan 18 12:51:56 localhost postfix/qmgr[1680]: 4BE7BF11: from=<[email protected]>, size=831, nrcpt=1 (queue active)
Jan 18 12:51:56 localhost postfix/smtp[27911]: 4BE7BF11: to=<[email protected]>, relay=smtp.priv.aglrsc.com[65.243.68.157]:25, delay=0.08, delays=0.03/0.04/0.01/0.01, dsn=2.6.0, status=sent (250 2.6.0 <[email protected]> Queued mail for delivery)
Jan 18 12:51:56 localhost postfix/qmgr[1680]: 4BE7BF11: removedCode: Select all
SMTP Error: Could not connect to SMTP host. (method=smtp;host=smtp.priv.aglrsc.com ;port=25;security=none)
SMTP Error: Could not connect to SMTP host. (method=smtp;host=smtp.priv.aglrsc.com ;port=25;security=none)Here are the Nagios log entries
Code: Select all
[1484768693] SERVICE ALERT: Field-GAATLP727W;Memory Usage;CRITICAL;HARD;1;CRITICAL - [Triggered by _MemUsed%>85] - Physical Memory: Total: 3.989GB - Used: 3.787GB (95%) - Free: 0.202GB (5%)
[1484768693] SERVICE NOTIFICATION: clicktechsupport;Field-GAATLP727W;Memory Usage;CRITICAL;xi_service_notification_handler;CRITICAL - [Triggered by _MemUsed%>85] - Physical Memory: Total: 3.989GB - Used: 3.787GB (95%) - Free: 0.202GB (5%)
[1484768705] SERVICE ALERT: poseidon24;CM-CCCB001;WARNING;HARD;1;kill 644
[1484768723] Warning: The results of service 'WMS : WMS SSIS Package failures Audit' on host 'agasc35u' are stale by 0d 0h 0m 40s (threshold=0d 0h 0m 20s). I'm forcing an immediate check of the service.
[1484768723] Warning: The results of service 'Click Outgoing Messages Count' on host 'agasc41u' are stale by 0d 0h 0m 40s (threshold=0d 0h 0m 20s). I'm forcing an immediate check of the service.
[1484768784] Warning: The results of service 'WMS : WMS SSIS Package failures Audit' on host 'agasc35u' are stale by 0d 0h 0m 41s (threshold=0d 0h 0m 20s). I'm forcing an immediate check of the service.
[1484768784] Warning: The results of service 'Click Outgoing Messages Count' on host 'agasc41u' are stale by 0d 0h 0m 41s (threshold=0d 0h 0m 20s). I'm forcing an immediate check of the service.
[1484768785] SERVICE ALERT: poseidon24;CM-CCCB001;OK;HARD;1;env
[1484768844] Auto-save of retention data completed successfully.
[1484768844] Warning: The results of service 'WMS : WMS SSIS Package failures Audit' on host 'agasc35u' are stale by 0d 0h 0m 40s (threshold=0d 0h 0m 20s). I'm forcing an immediate check of the service.
[1484768844] Warning: The results of service 'Click Outgoing Messages Count' on host 'agasc41u' are stale by 0d 0h 0m 40s (threshold=0d 0h 0m 20s). I'm forcing an immediate check of the service.
[1484768854] SERVICE NOTIFICATION: sukumar;sycbttsta01.nigas.com;nrpe_winprocess;CRITICAL;xi_service_notification_handler;PROCESS CRITICAL - 111 process(es)
[1484768904] Warning: The results of service 'WMS : WMS SSIS Package failures Audit' on host 'agasc35u' are stale by 0d 0h 0m 40s (threshold=0d 0h 0m 20s). I'm forcing an immediate check of the service.
[1484768904] Warning: The results of service 'Click Outgoing Messages Count' on host 'agasc41u' are stale by 0d 0h 0m 40s (threshold=0d 0h 0m 20s). I'm forcing an immediate check of the service.
[1484768945] SERVICE ALERT: poseidon24;CM-CCCB001;WARNING;HARD;1;kill 1515
[1484768956] SERVICE NOTIFICATION: NGASITFieldTeam;NGAS-Field ILNAPP96W;Check D: drive space;UNKNOWN;xi_service_notification_handler;UNKNOWN - The WMI query had problems. You might have your username/password wrong or the user's access level is too low. Wmic error text on the next line.
[1484768963] Warning: The results of service 'WMS : WMS SSIS Package failures Audit' on host 'agasc35u' are stale by 0d 0h 0m 39s (threshold=0d 0h 0m 20s). I'm forcing an immediate check of the service.
[1484768963] Warning: The results of service 'Click Outgoing Messages Count' on host 'agasc41u' are stale by 0d 0h 0m 39s (threshold=0d 0h 0m 20s). I'm forcing an immediate check of the service.
[1484768969] SERVICE NOTIFICATION: nagioswms;WMS Biztalk2;BizTalk - BizTalk Server 2010 Error;CRITICAL;xi_service_notification_handler;CRITICAL - [Triggered by _ItemCount>1] - 2 event(s) of at least Severity Level "Error", were recorded in the last 2 hours from the Application Event Log. (List is on next line. Fields shown are - Logfile:TimeGenerated:Type:SourceName:Message)
[1484769015] SERVICE NOTIFICATION: nagiosadmin;Ganetp27w.net.aglrsc.com;Drive C: Disk Usage;CRITICAL;xi_service_notification_handler;CRITICAL - Socket timeout after 10 seconds
[1484769023] Warning: The results of service 'WMS : WMS SSIS Package failures Audit' on host 'agasc35u' are stale by 0d 0h 0m 40s (threshold=0d 0h 0m 20s). I'm forcing an immediate check of the service.
[1484769023] Warning: The results of service 'Click Outgoing Messages Count' on host 'agasc41u' are stale by 0d 0h 0m 40s (threshold=0d 0h 0m 20s). I'm forcing an immediate check of the service.
[1484769024] SERVICE ALERT: poseidon24;CM-CCCB001;OK;HARD;1;env-
dwhitfield
- Former Nagios Staff
- Posts: 4583
- Joined: Wed Sep 21, 2016 10:29 am
- Location: NoLo, Minneapolis, MN
- Contact:
Re: Nagios XI Notifications Stopped working
Can you ping smtp.priv.aglrsc.com from the Nagios server? Is smtp.priv.aglrsc.com resolving to the proper IP address on the Nagios server?
I know you said you were working with your Exchange team, but I will go ahead and ask...did the name of the smtp server change?
Perhaps a new shiny firewall appliance arrived Monday morning?
I don't know how much experience you have looking at maillogs, but I have seen those File too large errors be false positives. It's worth making sure your server isn't out of space though.
I know you said you were working with your Exchange team, but I will go ahead and ask...did the name of the smtp server change?
Perhaps a new shiny firewall appliance arrived Monday morning?
I don't know how much experience you have looking at maillogs, but I have seen those File too large errors be false positives. It's worth making sure your server isn't out of space though.
Re: Nagios XI Notifications Stopped working
Ping works fine. But when I send test email from Nagios it is NOT working. locally admins are able to send email from server with no problem.
Re: Nagios XI Notifications Stopped working
This error here indicates that it can't even connect:
What is the output of this command on the XI server:
- You may need to change 25 to whatever port you guys are using
- Change X.X.X.X to your mailserver address
Thank you
Code: Select all
SMTP Error: Could not connect to SMTP host.- You may need to change 25 to whatever port you guys are using
- Change X.X.X.X to your mailserver address
Code: Select all
nmap -p25 X.X.X.XRe: Nagios XI Notifications Stopped working
Code: Select all
PORT STATE SERVICE
25/tcp open smtp
-
dwhitfield
- Former Nagios Staff
- Posts: 4583
- Joined: Wed Sep 21, 2016 10:29 am
- Location: NoLo, Minneapolis, MN
- Contact:
Re: Nagios XI Notifications Stopped working
Did the Exchange team change the SMTP port # on you?