Page 1 of 2

Nagios Xi email alerts stopped to work

Posted: Thu Apr 21, 2016 5:30 am
by caf_infra
Hello Nagios Support Team,

We are experiencing problems with email alerting on one of our Nagios Xi servers. Email alerting was working as expected foe a while but then it stopped to work.
Server is set to send notifications using SMTP and pointed to relay using standard port 25 without authentication.
Since email settings was set they haven't been changed. Sending Test email from web interface is going through and I am getting this test email, however, host and service notifications are not working.
Ping and telnet of the relay is successful, Nagios server can communicate with relay. Alert settings on hosts\services are set to send notifications 24x7, contacts are in place and configured to receive notifications.
Please assist with this email problem as this is critical for our infrastructure monitoring?

System profile:
Nagios XI Installation Profile

System:

Nagios XI Version : 5.2.5
nagxiliv02.caf.org.uk 3.10.0-327.10.1.el7.x86_64 x86_64
CentOS Linux release 7.2.1511 (Core)
Gnome Installed
Apache Information

PHP Version: 5.4.16
Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36
Server Name: nagxiliv02.caf.org.uk
Server Address: 10.120.0.22
Server Port: 443
Date/Time

PHP Timezone: Europe/London
PHP Time: Thu, 21 Apr 2016 11:27:48 +0100
System Time: Thu, 21 Apr 2016 11:27:48 +0100
Nagios XI Data

License ends in: RPSNPT

nagios (pid 21419) is running...
NPCD running (pid 956).
ndo2db (pid 1198) is running...
CPU Load 15: 0.44
Total Hosts: 21
Total Services: 190
Function 'get_base_uri' returns: https://nagxiliv02.caf.org.uk/nagiosxi/
Function 'get_base_url' returns: https://nagxiliv02.caf.org.uk/nagiosxi/
Function 'get_backend_url(internal_call=false)' returns: https://nagxiliv02.caf.org.uk/nagiosxi/ ... rofile.php
Function 'get_backend_url(internal_call=true)' returns: https://localhost/nagiosxi/backend/
Ping Test localhost

Running:
/bin/ping -c 3 localhost 2>&1
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.104 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.124 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.127 ms

--- localhost ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.104/0.118/0.127/0.013 ms
Test wget To localhost

WGET From URL: https://localhost/nagiosxi/includes/components/ccm/
Running:
/usr/bin/wget https://localhost/nagiosxi/includes/components/ccm/
--2016-04-21 11:27:50-- https://localhost/nagiosxi/includes/components/ccm/
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:443... connected.
ERROR: cannot verify localhost's certificate, issued by '/C=UK/ST=Kent/L=West Malling/O=CAF/OU=IT/CN=nagxiliv01/emailAddress=[email protected]':
Self-signed certificate encountered.
ERROR: certificate common name 'nagxiliv01' doesn't match requested host name 'localhost'.
To connect to localhost insecurely, use `--no-check-certificate'.
Network Settings

1: lo: mtu 65536 qdisc noqueue state UNKNOWN

link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

inet 127.0.0.1/8 scope host lo

valid_lft forever preferred_lft forever

inet6 ::1/128 scope host

valid_lft forever preferred_lft forever

2: ens160: mtu 1500 qdisc mq state UP qlen 1000

link/ether 00:50:56:8a:4a:d2 brd ff:ff:ff:ff:ff:ff

inet 10.120.0.22/16 brd 10.120.255.255 scope global ens160

valid_lft forever preferred_lft forever

inet6 fe80::250:56ff:fe8a:4ad2/64 scope link

valid_lft forever preferred_lft forever


default via 10.120.0.3 dev ens160 proto static metric 100

10.120.0.0/16 dev ens160 proto kernel scope link src 10.120.0.22 metric 100

Re: Nagios Xi email alerts stopped to work

Posted: Thu Apr 21, 2016 9:53 am
by rkennedy
Can you run /usr/local/nagiosxi/scripts/repair/repair_databases.sh and see if that helps? Usually if they stop out of no where it's related to SQL.

If that's not it, can you PM over your profile for me to take a look at? (Admin -> System Profile -> Download Profile)

Re: Nagios Xi email alerts stopped to work

Posted: Mon Apr 25, 2016 6:24 am
by caf_infra
I have run the DB repair script on Friday and left Nagios to run over the weekend. The script run OK just 1 message at the end was stating ERROR (see attachment). Today I have checked my email for alerts and there was some but these alerts was for NON existing (old) host/service which was deleted fom Nagios long time ago. It looks like Nagios is using rolled back (repaired) DB but not current existing DB and that is really confusing. That raised a question - where is current DB and where new data is stored? Is there are any other ways how to repair Nagios alerts?

Re: Nagios Xi email alerts stopped to work

Posted: Mon Apr 25, 2016 6:31 am
by caf_infra
I'm posting our Nagios Xi System Profile zip for troubleshooting.

Re: Nagios Xi email alerts stopped to work

Posted: Mon Apr 25, 2016 10:53 am
by rkennedy

Code: Select all

160406  9:31:14 [ERROR] /usr/libexec/mysqld: Table './nagios/nagios_servicestatus' is marked as crashed and should be repaired
160406  9:31:14 [ERROR] /usr/libexec/mysqld: Table './nagios/nagios_servicestatus' is marked as crashed and should be repaired
160406  9:31:14 [Note] /usr/libexec/mysqld: Normal shutdown
It looks like your tables were crashed indeed.

How many CPU's do you have allocated to this machine? Your load looks abnormally high.

Additionally, I noticed -

Code: Select all

nagios    3970  2.2  0.0 144884 10828 ?        S    11:45   0:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_ifoperstatus -H 172.18.21.1 -C ######## -k 10625
Is this the right community string? I had a recent ticket that had a high load because a lot of SNMP checks did not have the correct community string.

Re: Nagios Xi email alerts stopped to work

Posted: Mon Apr 25, 2016 11:28 am
by caf_infra
Currently we have 2 vCPU allocated for Nagios host and 4GB of memory. If CPU is the problem we can increase to 4 at any time. I will repair mysql tables tomorrow as office hours in London are over. In terms if SNMP we are not using any SNMP check at this point (have to double check). This Nagios instance is monitoring only 21 host and it shouldn't be a CPU issue because it is not much load on the server. Apart mysql table repair, would you suggest to add more CPU, Memory, Disk space?
I will check your reply tomorrow as soon I will be in office, make necessary repair work and post results.

Many Thanks!

Re: Nagios Xi email alerts stopped to work

Posted: Mon Apr 25, 2016 12:35 pm
by hsmith
I would make sure you repair the tables before you do anything. I've seen this cause major performance issues.

Re: Nagios Xi email alerts stopped to work

Posted: Mon Apr 25, 2016 8:40 pm
by Box293
What is the output of:

Code: Select all

free -m

Re: Nagios Xi email alerts stopped to work

Posted: Tue Apr 26, 2016 3:35 am
by caf_infra
The output from free -m:

Re: Nagios Xi email alerts stopped to work

Posted: Tue Apr 26, 2016 4:43 am
by caf_infra
I have run the repair script which seems to repair database but emails alerts still not sent :?:
Also, I have performed manual table repair as described in the Repairing_The_Nagios_XI_Database.pdf document and still no luck.

Attaching Profile zip file.

OK. Let's narrow it a bit. I have found some error running repairmysql.sh script for nagiosxi database and it is as follows: