Nagios XI Notifications Stopped working

sujitt · Post by **sujitt** » Wed Jan 18, 2017 12:05 pm

Now it is 2 days since we got any notification from Nagios, It was all working fine. Even a forced check which is failing is not producing a notification.

Here is the Profile info for our nagios environment.

System:

Nagios XI Version : 2014R1.5
2.6.32-358.2.1.el6.x86_64 x86_64
CentOS release 6.4 (Final)
Gnome is not installed
Apache Information

PHP Version: 5.3.3
Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36
Server Name:
Server Address:
Server Port: 80
Date/Time

PHP Timezone: America/New_York
PHP Time: Wed, 18 Jan 2017 12:02:57 -0500
System Time: Wed, 18 Jan 2017 12:02:57 -0500
Nagios XI Data

License ends in: NTOSNM

nagios (pid 18438) is running...
NPCD running (pid 30792).
ndo2db (pid 29780) is running...
CPU Load 15: 0.67
Total Hosts: 196
Total Services: 2933
Function 'get_base_uri' returns: http://../nagiosxi/
Function 'get_base_url' returns: http://../nagiosxi/
Function 'get_backend_url(internal_call=false)' returns:../nagiosxi/includes/components/profile/profile.php
Function 'get_backend_url(internal_call=true)' returns: http://localhost/nagiosxi/backend/
Ping Test localhost

Running:
/bin/ping -c 3 localhost 2>&1
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.055 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.033 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.035 ms

--- localhost ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.033/0.041/0.055/0.009 ms
Test wget To localhost

WGET From URL: http://localhost/nagiosxi/includes/components/ccm/
Running:
/usr/bin/wget http://localhost/nagiosxi/includes/components/ccm/
--2017-01-18 12:02:59-- http://localhost/nagiosxi/includes/components/ccm/
Resolving localhost... ::1, 127.0.0.1
Connecting to localhost|::1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: "/usr/local/nagiosxi/tmp/ccm_index.tmp"

0K ......... 877K=0.01s

2017-01-18 12:02:59 (877 KB/s) - "/usr/local/nagiosxi/tmp/ccm_index.tmp" saved [9666]

sujitt · Post by **sujitt** » Wed Jan 18, 2017 1:50 pm

Do I need to call it in ??

dwhitfield · Post by **dwhitfield** » Wed Jan 18, 2017 2:22 pm

What's the output of tail -50 /var/log/maillog? This file logs email sent through sendmail. Only applicable to core contact "notify-*-by-email" notification handlers and sendmail tests.

Also, what's the output of tail -50 /usr/local/nagios/var/nagios.log, so we can see checks, notifications, external commands, and events.

Last but not least, what's the output of tail -50 /usr/local/nagiosxi/tmp/phpmailer.log?

Were there any changes to your mailing 2 days ago? Did you run out of disk space? Power outage? I'm guessing by Cent 6.4 and 2014R1.5 that you didn't run any updates 2 days ago. Also, two days ago was Monday. Did you get notifications over the weekend or is it possible the problem started on Friday?

sujitt · Post by **sujitt** » Wed Jan 18, 2017 2:51 pm

Looks like there are some errors in these logs. I am working with our Exchange team and other teams to make sure that there was no changes were made.

Thanks
Sujith
Here are the mail logs

Code: Select all

Jan 17 07:00:34 localhost postfix/qmgr[1680]: 22B58100F: removed
Jan 17 08:00:02 localhost postfix/pickup[19070]: 1055F3337: uid=0 from=<root>
Jan 17 08:00:02 localhost postfix/cleanup[30274]: 1055F3337: message-id=<[email protected]>
Jan 17 08:00:02 localhost postfix/qmgr[1680]: 1055F3337: from=<[email protected]>, size=2626, nrcpt=1 (queue active)
Jan 17 08:00:02 localhost postfix/local[30276]: 1055F3337: to=<[email protected]>, orig_to=<root@localhost>, relay=local, delay=0.05, delays=0.02/0.01/0/0.02, dsn=5.2.2, status=bounced (cannot update mailbox /var/mail/root for user root. error writing message: File too large)
Jan 17 08:00:02 localhost postfix/cleanup[30274]: 19CCB3338: message-id=<[email protected]>
Jan 17 08:00:02 localhost postfix/bounce[30278]: 1055F3337: sender non-delivery notification: 19CCB3338
Jan 17 08:00:02 localhost postfix/qmgr[1680]: 19CCB3338: from=<>, size=4569, nrcpt=1 (queue active)
Jan 17 08:00:02 localhost postfix/qmgr[1680]: 1055F3337: removed
Jan 17 08:00:02 localhost postfix/local[30276]: 19CCB3338: to=<[email protected]>, relay=local, delay=0.05, delays=0.01/0/0/0.04, dsn=5.2.2, status=bounced (cannot update mailbox /var/mail/root for user root. error writing message: File too large)
Jan 17 08:00:02 localhost postfix/qmgr[1680]: 19CCB3338: removed
Jan 18 07:00:33 localhost postfix/pickup[30894]: 167A14368: uid=0 from=<root>
Jan 18 07:00:33 localhost postfix/cleanup[5526]: 167A14368: message-id=<[email protected]>
Jan 18 07:00:33 localhost postfix/qmgr[1680]: 167A14368: from=<[email protected]>, size=3700, nrcpt=1 (queue active)
Jan 18 07:00:33 localhost postfix/local[5528]: 167A14368: to=<[email protected]>, orig_to=<root>, relay=local, delay=0.36, delays=0.19/0.09/0/0.08, dsn=5.2.2, status=bounced (cannot update mailbox /var/mail/root for user root. error writing message: File too large)
Jan 18 07:00:33 localhost postfix/cleanup[5526]: 5EA084369: message-id=<[email protected]>
Jan 18 07:00:33 localhost postfix/qmgr[1680]: 5EA084369: from=<>, size=5623, nrcpt=1 (queue active)
Jan 18 07:00:33 localhost postfix/bounce[5529]: 167A14368: sender non-delivery notification: 5EA084369
Jan 18 07:00:33 localhost postfix/qmgr[1680]: 167A14368: removed
Jan 18 07:00:33 localhost postfix/local[5528]: 5EA084369: to=<[email protected]>, relay=local, delay=0.01, delays=0/0/0/0, dsn=5.2.2, status=bounced (cannot update mailbox /var/mail/root for user root. error writing message: File too large)
Jan 18 07:00:33 localhost postfix/qmgr[1680]: 5EA084369: removed
Jan 18 08:00:01 localhost postfix/pickup[30894]: DD640F8D: uid=0 from=<root>
Jan 18 08:00:01 localhost postfix/cleanup[18471]: DD640F8D: message-id=<[email protected]>
Jan 18 08:00:01 localhost postfix/qmgr[1680]: DD640F8D: from=<[email protected]>, size=2642, nrcpt=1 (queue active)
Jan 18 08:00:01 localhost postfix/local[18478]: DD640F8D: to=<[email protected]>, orig_to=<root@localhost>, relay=local, delay=0.06, delays=0.03/0.01/0/0.02, dsn=5.2.2, status=bounced (cannot update mailbox /var/mail/root for user root. error writing message: File too large)
Jan 18 08:00:01 localhost postfix/cleanup[18471]: E758DF8E: message-id=<[email protected]>
Jan 18 08:00:01 localhost postfix/qmgr[1680]: E758DF8E: from=<>, size=4577, nrcpt=1 (queue active)
Jan 18 08:00:01 localhost postfix/bounce[18479]: DD640F8D: sender non-delivery notification: E758DF8E
Jan 18 08:00:01 localhost postfix/qmgr[1680]: DD640F8D: removed
Jan 18 08:00:01 localhost postfix/local[18478]: E758DF8E: to=<[email protected]>, relay=local, delay=0.01, delays=0/0/0/0, dsn=5.2.2, status=bounced (cannot update mailbox /var/mail/root for user root. error writing message: File too large)
Jan 18 08:00:01 localhost postfix/qmgr[1680]: E758DF8E: removed
Jan 18 12:51:56 localhost postfix/pickup[4792]: 4BE7BF11: uid=500 from=<nagios>
Jan 18 12:51:56 localhost postfix/cleanup[27909]: 4BE7BF11: message-id=<[email protected]>
Jan 18 12:51:56 localhost postfix/qmgr[1680]: 4BE7BF11: from=<[email protected]>, size=831, nrcpt=1 (queue active)
Jan 18 12:51:56 localhost postfix/smtp[27911]: 4BE7BF11: to=<[email protected]>, relay=smtp.priv.aglrsc.com[65.243.68.157]:25, delay=0.08, delays=0.03/0.04/0.01/0.01, dsn=2.6.0, status=sent (250 2.6.0  <[email protected]> Queued mail for delivery)
Jan 18 12:51:56 localhost postfix/qmgr[1680]: 4BE7BF11: removed

Here are the PHP mailer log

Code: Select all

SMTP Error: Could not connect to SMTP host. (method=smtp;host=smtp.priv.aglrsc.com ;port=25;security=none)
SMTP Error: Could not connect to SMTP host. (method=smtp;host=smtp.priv.aglrsc.com ;port=25;security=none)

Here are the Nagios log entries

Code: Select all

[1484768693] SERVICE ALERT: Field-GAATLP727W;Memory Usage;CRITICAL;HARD;1;CRITICAL - [Triggered by _MemUsed%>85] - Physical Memory: Total: 3.989GB - Used: 3.787GB (95%) - Free: 0.202GB (5%)
[1484768693] SERVICE NOTIFICATION: clicktechsupport;Field-GAATLP727W;Memory Usage;CRITICAL;xi_service_notification_handler;CRITICAL - [Triggered by _MemUsed%>85] - Physical Memory: Total: 3.989GB - Used: 3.787GB (95%) - Free: 0.202GB (5%)
[1484768705] SERVICE ALERT: poseidon24;CM-CCCB001;WARNING;HARD;1;kill 644
[1484768723] Warning: The results of service 'WMS : WMS SSIS Package failures Audit' on host 'agasc35u' are stale by 0d 0h 0m 40s (threshold=0d 0h 0m 20s).  I'm forcing an immediate check of the service.
[1484768723] Warning: The results of service 'Click Outgoing Messages Count' on host 'agasc41u' are stale by 0d 0h 0m 40s (threshold=0d 0h 0m 20s).  I'm forcing an immediate check of the service.
[1484768784] Warning: The results of service 'WMS : WMS SSIS Package failures Audit' on host 'agasc35u' are stale by 0d 0h 0m 41s (threshold=0d 0h 0m 20s).  I'm forcing an immediate check of the service.
[1484768784] Warning: The results of service 'Click Outgoing Messages Count' on host 'agasc41u' are stale by 0d 0h 0m 41s (threshold=0d 0h 0m 20s).  I'm forcing an immediate check of the service.
[1484768785] SERVICE ALERT: poseidon24;CM-CCCB001;OK;HARD;1;env
[1484768844] Auto-save of retention data completed successfully.
[1484768844] Warning: The results of service 'WMS : WMS SSIS Package failures Audit' on host 'agasc35u' are stale by 0d 0h 0m 40s (threshold=0d 0h 0m 20s).  I'm forcing an immediate check of the service.
[1484768844] Warning: The results of service 'Click Outgoing Messages Count' on host 'agasc41u' are stale by 0d 0h 0m 40s (threshold=0d 0h 0m 20s).  I'm forcing an immediate check of the service.
[1484768854] SERVICE NOTIFICATION: sukumar;sycbttsta01.nigas.com;nrpe_winprocess;CRITICAL;xi_service_notification_handler;PROCESS CRITICAL - 111 process(es)
[1484768904] Warning: The results of service 'WMS : WMS SSIS Package failures Audit' on host 'agasc35u' are stale by 0d 0h 0m 40s (threshold=0d 0h 0m 20s).  I'm forcing an immediate check of the service.
[1484768904] Warning: The results of service 'Click Outgoing Messages Count' on host 'agasc41u' are stale by 0d 0h 0m 40s (threshold=0d 0h 0m 20s).  I'm forcing an immediate check of the service.
[1484768945] SERVICE ALERT: poseidon24;CM-CCCB001;WARNING;HARD;1;kill 1515
[1484768956] SERVICE NOTIFICATION: NGASITFieldTeam;NGAS-Field ILNAPP96W;Check D: drive space;UNKNOWN;xi_service_notification_handler;UNKNOWN - The WMI query had problems. You might have your username/password wrong or the user's access level is too low. Wmic error text on the next line.
[1484768963] Warning: The results of service 'WMS : WMS SSIS Package failures Audit' on host 'agasc35u' are stale by 0d 0h 0m 39s (threshold=0d 0h 0m 20s).  I'm forcing an immediate check of the service.
[1484768963] Warning: The results of service 'Click Outgoing Messages Count' on host 'agasc41u' are stale by 0d 0h 0m 39s (threshold=0d 0h 0m 20s).  I'm forcing an immediate check of the service.
[1484768969] SERVICE NOTIFICATION: nagioswms;WMS Biztalk2;BizTalk - BizTalk Server 2010 Error;CRITICAL;xi_service_notification_handler;CRITICAL - [Triggered by _ItemCount>1] - 2 event(s) of at least Severity Level "Error", were recorded in the last 2 hours from the Application Event Log. (List is on next line. Fields shown are - Logfile:TimeGenerated:Type:SourceName:Message)
[1484769015] SERVICE NOTIFICATION: nagiosadmin;Ganetp27w.net.aglrsc.com;Drive C: Disk Usage;CRITICAL;xi_service_notification_handler;CRITICAL - Socket timeout after 10 seconds
[1484769023] Warning: The results of service 'WMS : WMS SSIS Package failures Audit' on host 'agasc35u' are stale by 0d 0h 0m 40s (threshold=0d 0h 0m 20s).  I'm forcing an immediate check of the service.
[1484769023] Warning: The results of service 'Click Outgoing Messages Count' on host 'agasc41u' are stale by 0d 0h 0m 40s (threshold=0d 0h 0m 20s).  I'm forcing an immediate check of the service.
[1484769024] SERVICE ALERT: poseidon24;CM-CCCB001;OK;HARD;1;env

dwhitfield · Post by **dwhitfield** » Wed Jan 18, 2017 3:06 pm

Can you ping smtp.priv.aglrsc.com from the Nagios server? Is smtp.priv.aglrsc.com resolving to the proper IP address on the Nagios server?

I know you said you were working with your Exchange team, but I will go ahead and ask...did the name of the smtp server change?

Perhaps a new shiny firewall appliance arrived Monday morning?

I don't know how much experience you have looking at maillogs, but I have seen those File too large errors be false positives. It's worth making sure your server isn't out of space though.

sujitt · Post by **sujitt** » Wed Jan 18, 2017 4:09 pm

Ping works fine. But when I send test email from Nagios it is NOT working. locally admins are able to send email from server with no problem.

ssax · Post by **ssax** » Wed Jan 18, 2017 5:26 pm

This error here indicates that it can't even connect:

Code: Select all

SMTP Error: Could not connect to SMTP host.

What is the output of this command on the XI server:
- You may need to change 25 to whatever port you guys are using
- Change X.X.X.X to your mailserver address

Code: Select all

nmap -p25 X.X.X.X

Thank you

sujitt · Post by **sujitt** » Wed Jan 18, 2017 5:31 pm

Code: Select all

PORT   STATE SERVICE
25/tcp open  smtp

dwhitfield · Post by **dwhitfield** » Wed Jan 18, 2017 5:35 pm

Did the Exchange team change the SMTP port # on you?

Nagios Support Forum

Nagios XI Notifications Stopped working

Nagios XI Notifications Stopped working

Re: Nagios XI Notifications Stopped working

Re: Nagios XI Notifications Stopped working

Re: Nagios XI Notifications Stopped working

Re: Nagios XI Notifications Stopped working

Re: Nagios XI Notifications Stopped working

Re: Nagios XI Notifications Stopped working

Re: Nagios XI Notifications Stopped working

Re: Nagios XI Notifications Stopped working