mail problem

MPIvan · Post by **MPIvan** » Tue Dec 17, 2013 8:05 am

Hi,

As i mention many times im using Nagios

4.0.2 with CentOS on a virtual machine controlled by VMWare center/Vcenter and im making snapshot before any changes i made. So Nagios was working ok it was sending mail as it should be and all that stuff. After i was trying something and didnt work, i back the previews state of the machine using the VMWare snapshot menage the way it was ... i have doing this many times and when i go back with the snapshot manger the machine works fine ( with the Nagios working fine also ) so this time i have stop receiving mails from one type ( type not group, im using group but not for contacts ) of hosts (routers). From other hosts i have receiving but from this one i dont ... and i cant see what is the problem here ... any suggestion ?

And also i have notes this in the log file ..

[12-17-2013 14:40:03] wproc: Core Worker 12150: job 54509 (pid=16002): Dormant child reaped
Informational Message[12-17-2013 14:39:58] wproc: Core Worker 12150: Failed to reap child with pid 16002. Next attempt @ 1387287603.462683
Informational Message[12-17-2013 14:39:58] wproc: Core Worker 12150: tv.tv_sec is currently 1387287598
Informational Message[12-17-2013 14:39:58] Warning: Check of host 'Anten05' timed out after 30.01 seconds
Informational Message[12-17-2013 14:39:58] wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Informational Message[12-17-2013 14:39:58] wproc: host=Anten05; service=(null);
Informational Message[12-17-2013 14:39:58] wproc: command: /usr/local/nagios/libexec/check_ping -H 192.168.1.101 -w 3000.0,80% -c 5000.0,100% -p 5
Informational Message[12-17-2013 14:39:58] wproc: CHECK job 54509 from worker Core Worker 12150 timed out after 30.01s
Informational Message[12-17-2013 14:39:58] wproc: Core Worker 12150: job 54509 (pid=16002) timed out. Killing it

Also i have to say the last modification that i have made on the routers is

check_command check_dummy!0
parents localhost

abrist · Post by **abrist** » Tue Dec 17, 2013 10:58 am

You have some checks that are timing out, additionally, your change to check_dummy will always report the router as up (though I presume you are aware of this). Nothing in your post helps with the mail troubleshooting. Could you post a tail of your maillog:

Code: Select all

tail -25 /var/log/maillog

MPIvan · Post by **MPIvan** » Tue Dec 17, 2013 12:08 pm

Here it is

[root@mpnagios objects]# tail -25 /var/log/maillog
Dec 17 17:33:26 mpnagios postfix/pickup[6722]: DCEAC2A03A0: uid=500 from=<nagios>
Dec 17 17:33:26 mpnagios postfix/cleanup[18662]: DCEAC2A03A0: message-id=<20131217163326.DCEAC2A03A0@mpnagios>
Dec 17 17:33:26 mpnagios postfix/qmgr[1357]: DCEAC2A03A0: from=<nagios@makpetrol.com.mk>, size=659, nrcpt=1 (queue active)
Dec 17 17:33:26 mpnagios postfix/local[18664]: DCEAC2A03A0: to=<nagios@localhost.localdomain>, orig_to=<nagios@localhost>, relay=local, delay=0.08, delays=0.06/0.02/0/0.01, dsn=2.0.0, status=sent (delivered to mailbox)
Dec 17 17:33:26 mpnagios postfix/qmgr[1357]: DCEAC2A03A0: removed
Dec 17 17:47:29 mpnagios postfix/pickup[6722]: 0B21F2A03A0: uid=500 from=<nagios>
Dec 17 17:47:29 mpnagios postfix/cleanup[21396]: 0B21F2A03A0: message-id=<20131217164729.0B21F2A03A0@mpnagios>
Dec 17 17:47:29 mpnagios postfix/qmgr[1357]: 0B21F2A03A0: from=<nagios@makpetrol.com.mk>, size=651, nrcpt=1 (queue active)
Dec 17 17:47:29 mpnagios postfix/local[21398]: 0B21F2A03A0: to=<nagios@localhost.localdomain>, orig_to=<nagios@localhost>, relay=local, delay=0.05, delays=0.04/0.01/0/0, dsn=2.0.0, status=sent (delivered to mailbox)
Dec 17 17:47:29 mpnagios postfix/qmgr[1357]: 0B21F2A03A0: removed
Dec 17 17:57:20 mpnagios postfix/pickup[6722]: 17D172A03A1: uid=500 from=<nagios>
Dec 17 17:57:20 mpnagios postfix/cleanup[23350]: 17D172A03A1: message-id=<20131217165720.17D172A03A1@mpnagios>
Dec 17 17:57:20 mpnagios postfix/qmgr[1357]: 17D172A03A1: from=<nagios@makpetrol.com.mk>, size=691, nrcpt=1 (queue active)
Dec 17 17:57:20 mpnagios postfix/local[23352]: 17D172A03A1: to=<nagios@localhost.localdomain>, orig_to=<nagios@localhost>, relay=local, delay=0.05, delays=0.04/0.01/0/0, dsn=2.0.0, status=sent (delivered to mailbox)
Dec 17 17:57:20 mpnagios postfix/qmgr[1357]: 17D172A03A1: removed
Dec 17 18:03:32 mpnagios postfix/pickup[6722]: 1711F2A03A1: uid=500 from=<nagios>
Dec 17 18:03:32 mpnagios postfix/cleanup[24679]: 1711F2A03A1: message-id=<20131217170332.1711F2A03A1@mpnagios>
Dec 17 18:03:32 mpnagios postfix/qmgr[1357]: 1711F2A03A1: from=<nagios@makpetrol.com.mk>, size=659, nrcpt=1 (queue active)
Dec 17 18:03:32 mpnagios postfix/local[24681]: 1711F2A03A1: to=<nagios@localhost.localdomain>, orig_to=<nagios@localhost>, relay=local, delay=0.16, delays=0.11/0.04/0/0.01, dsn=2.0.0, status=sent (delivered to mailbox)
Dec 17 18:03:32 mpnagios postfix/qmgr[1357]: 1711F2A03A1: removed
Dec 17 18:04:19 mpnagios postfix/pickup[6722]: DE3772A03A1: uid=500 from=<nagios>
Dec 17 18:04:19 mpnagios postfix/cleanup[24679]: DE3772A03A1: message-id=<20131217170419.DE3772A03A1@mpnagios>
Dec 17 18:04:19 mpnagios postfix/qmgr[1357]: DE3772A03A1: from=<nagios@makpetrol.com.mk>, size=704, nrcpt=1 (queue active)
Dec 17 18:04:19 mpnagios postfix/local[24681]: DE3772A03A1: to=<nagios@localhost.localdomain>, orig_to=<nagios@localhost>, relay=local, delay=0.03, delays=0.02/0/0/0, dsn=2.0.0, status=sent (delivered to mailbox)
Dec 17 18:04:19 mpnagios postfix/qmgr[1357]: DE3772A03A1: removed
[root@mpnagios objects]#

your change to check_dummy will always report the router as up (though I presume you are aware of this)

No i didnt get this

( i dont pay attention to that

), So i guest this is the problem or is there another that this is happening ?

slansing · Post by **slansing** » Tue Dec 17, 2013 12:18 pm

It looks like mail is being routed off your nagios server just fine, if the routers in question are the ones set up with dummy checks then you will not be notified as abrist stated, since they will always be in an UP state.

MPIvan · Post by **MPIvan** » Tue Dec 17, 2013 12:25 pm

Well Yes, here is what iv got for now in my template and router cfg file

define host{
use bp-rt
host_name PE002
alias Router002
display_name Router PE002
address 172.10.20.1
_SNMPCOMMUNITY imnottellingyou:)
contacts Ivan
notes Tel:000000000000
}

define host{
name bp-rt
use generic-host
check_period 24x7
check_interval 5
retry_interval 1
max_check_attempts 10
check_command check-host-alive
notification_options d,r
notification_interval 0
hostgroups router-bp
register 0
icon_image cisco.png
statusmap_image my_router.png
check_command check_dummy!0
parents localhost
# 2d_coords 120,270
# 3d_coords 100.0,50.0,75.0
}

slansing · Post by **slansing** » Tue Dec 17, 2013 2:13 pm

Okay this host actually has a valid check attached to it that will cause it to change states. Are you able to look at the state history report and narrow down that time that alert should have been sent, and then go look in your maillog archives to see if a alert was in fact sent out?

MPIvan · Post by **MPIvan** » Wed Dec 18, 2013 8:07 am

I guess this was the problem .... i remove the "check_command check_dummy!0" command and now it is ok.

HOST NOTIFICATION: Ivan;Router PE025;DOWN;notify-host-by-email;CRITICAL - Time to live exceeded (172.10.25.1)

I would like to know what is this wproc log messages that iv got

wproc: Core Worker 12003: job 40879 (pid=4381) timed out. Killing it
[1387371888] wproc: CHECK job 40879 from worker Core Worker 12003 timed out after 30.01s
[1387371888] wproc: command: /usr/local/nagios/libexec/check_ping -H 172.10.6.224 -w 3000.0,80% -c 5000.0,100% -p 5
[1387371888] wproc: host=Router PE248; service=(null);
[1387371888] wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1387371888] Warning: Check of host 'Router PE248' timed out after 30.01 seconds
[1387371888] wproc: Core Worker 12003: tv.tv_sec is currently 1387371888
[1387371888] wproc: Core Worker 12003: Failed to reap child with pid 4381. Next attempt @ 1387371893.935136
[1387371893] wproc: Core Worker 12003: job 40879 (pid=4381): Dormant child reaped
[1387372488] wproc: Core Worker 12004: job 41250 (pid=6686) timed out. Killing it
[1387372488] wproc: CHECK job 41250 from worker Core Worker 12004 timed out after 30.01s
[1387372488] wproc: command: /usr/local/nagios/libexec/check_ping -H 172.10.6.224 -w 3000.0,80% -c 5000.0,100% -p 5
[1387372488] wproc: host=Router PE248; service=(null);
[1387372488] wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1387372488] Warning: Check of host 'Router PE248' timed out after 30.01 seconds
[1387372488] wproc: Core Worker 12004: tv.tv_sec is currently 1387372488
[1387372488] wproc: Core Worker 12004: Failed to reap child with pid 6686. Next attempt @ 1387372493.895019
[1387372493] wproc: Core Worker 12004: job 41250 (pid=6686): Dormant child reaped

abrist · Post by **abrist** » Wed Dec 18, 2013 11:25 am

MPIvan wrote:Warning: Check of host 'Router PE248' timed out after 30.01 seconds

MPIvan wrote:CHECK job 41250 from worker Core Worker 12004 timed out after 30.01s

Looks like this particular check is timing out. Either the router cannot be checked by icmp, the ip address is wrong for the router, or you are experiencing a lot of load/io wait and the check is not completing before the 30 sec timeout.

MPIvan · Post by **MPIvan** » Thu Dec 19, 2013 9:36 am

I have the following iptables

[root@mpnagios /]# cat /etc/sysconfig/iptables
# Firewall configuration written by system-config-firewall
# Manual customization of this file is not recommended.
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j ACCEPT
-A INPUT -m state --state NEW -m udp -p udp --dport 161:162 -j ACCEPT
-A INPUT -j REJECT --reject-with icmp-host-prohibited
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
COMMIT
[root@mpnagios /]#

And you saying "you are experiencing a lot of load/io wait and the check is not completing before the 30 sec timeout." so how can i check this ??

Because i still have the problem ... yesterday i remove the dummy host check and today is the same ... is there a way, because iv recovering/backing up the previous snapshot state of the virtual machine to some how confuse the nagios log the time and other stuff ???

I have to mention that i was changing the names of the routers recently and after the changes i restart nagios. Even i didnt change them all at the same time ( 10 - 20 routers per change ).

sreinhardt · Post by **sreinhardt** » Thu Dec 19, 2013 4:12 pm

Are the IPs of those routers changing, or just the display name\host name? Could you post the check definitions that you have changed to?

Nagios Support Forum

mail problem

mail problem

Re: mail problem

Re: mail problem

Re: mail problem

Re: mail problem

Re: mail problem

Re: mail problem

Re: mail problem

Re: mail problem

Re: mail problem