mail problem

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
MPIvan
Posts: 213
Joined: Thu Nov 22, 2012 6:09 am

mail problem

Post by MPIvan »

Hi,

As i mention many times im using Nagios :) 4.0.2 with CentOS on a virtual machine controlled by VMWare center/Vcenter and im making snapshot before any changes i made. So Nagios was working ok it was sending mail as it should be and all that stuff. After i was trying something and didnt work, i back the previews state of the machine using the VMWare snapshot menage the way it was ... i have doing this many times and when i go back with the snapshot manger the machine works fine ( with the Nagios working fine also ) so this time i have stop receiving mails from one type ( type not group, im using group but not for contacts ) of hosts (routers). From other hosts i have receiving but from this one i dont ... and i cant see what is the problem here ... any suggestion ?

And also i have notes this in the log file ..

[12-17-2013 14:40:03] wproc: Core Worker 12150: job 54509 (pid=16002): Dormant child reaped
Informational Message[12-17-2013 14:39:58] wproc: Core Worker 12150: Failed to reap child with pid 16002. Next attempt @ 1387287603.462683
Informational Message[12-17-2013 14:39:58] wproc: Core Worker 12150: tv.tv_sec is currently 1387287598
Informational Message[12-17-2013 14:39:58] Warning: Check of host 'Anten05' timed out after 30.01 seconds
Informational Message[12-17-2013 14:39:58] wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Informational Message[12-17-2013 14:39:58] wproc: host=Anten05; service=(null);
Informational Message[12-17-2013 14:39:58] wproc: command: /usr/local/nagios/libexec/check_ping -H 192.168.1.101 -w 3000.0,80% -c 5000.0,100% -p 5
Informational Message[12-17-2013 14:39:58] wproc: CHECK job 54509 from worker Core Worker 12150 timed out after 30.01s
Informational Message[12-17-2013 14:39:58] wproc: Core Worker 12150: job 54509 (pid=16002) timed out. Killing it
Also i have to say the last modification that i have made on the routers is
check_command check_dummy!0
parents localhost
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: mail problem

Post by abrist »

You have some checks that are timing out, additionally, your change to check_dummy will always report the router as up (though I presume you are aware of this). Nothing in your post helps with the mail troubleshooting. Could you post a tail of your maillog:

Code: Select all

tail -25 /var/log/maillog
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
MPIvan
Posts: 213
Joined: Thu Nov 22, 2012 6:09 am

Re: mail problem

Post by MPIvan »

Here it is
[root@mpnagios objects]# tail -25 /var/log/maillog
Dec 17 17:33:26 mpnagios postfix/pickup[6722]: DCEAC2A03A0: uid=500 from=<nagios>
Dec 17 17:33:26 mpnagios postfix/cleanup[18662]: DCEAC2A03A0: message-id=<20131217163326.DCEAC2A03A0@mpnagios>
Dec 17 17:33:26 mpnagios postfix/qmgr[1357]: DCEAC2A03A0: from=<nagios@makpetrol.com.mk>, size=659, nrcpt=1 (queue active)
Dec 17 17:33:26 mpnagios postfix/local[18664]: DCEAC2A03A0: to=<nagios@localhost.localdomain>, orig_to=<nagios@localhost>, relay=local, delay=0.08, delays=0.06/0.02/0/0.01, dsn=2.0.0, status=sent (delivered to mailbox)
Dec 17 17:33:26 mpnagios postfix/qmgr[1357]: DCEAC2A03A0: removed
Dec 17 17:47:29 mpnagios postfix/pickup[6722]: 0B21F2A03A0: uid=500 from=<nagios>
Dec 17 17:47:29 mpnagios postfix/cleanup[21396]: 0B21F2A03A0: message-id=<20131217164729.0B21F2A03A0@mpnagios>
Dec 17 17:47:29 mpnagios postfix/qmgr[1357]: 0B21F2A03A0: from=<nagios@makpetrol.com.mk>, size=651, nrcpt=1 (queue active)
Dec 17 17:47:29 mpnagios postfix/local[21398]: 0B21F2A03A0: to=<nagios@localhost.localdomain>, orig_to=<nagios@localhost>, relay=local, delay=0.05, delays=0.04/0.01/0/0, dsn=2.0.0, status=sent (delivered to mailbox)
Dec 17 17:47:29 mpnagios postfix/qmgr[1357]: 0B21F2A03A0: removed
Dec 17 17:57:20 mpnagios postfix/pickup[6722]: 17D172A03A1: uid=500 from=<nagios>
Dec 17 17:57:20 mpnagios postfix/cleanup[23350]: 17D172A03A1: message-id=<20131217165720.17D172A03A1@mpnagios>
Dec 17 17:57:20 mpnagios postfix/qmgr[1357]: 17D172A03A1: from=<nagios@makpetrol.com.mk>, size=691, nrcpt=1 (queue active)
Dec 17 17:57:20 mpnagios postfix/local[23352]: 17D172A03A1: to=<nagios@localhost.localdomain>, orig_to=<nagios@localhost>, relay=local, delay=0.05, delays=0.04/0.01/0/0, dsn=2.0.0, status=sent (delivered to mailbox)
Dec 17 17:57:20 mpnagios postfix/qmgr[1357]: 17D172A03A1: removed
Dec 17 18:03:32 mpnagios postfix/pickup[6722]: 1711F2A03A1: uid=500 from=<nagios>
Dec 17 18:03:32 mpnagios postfix/cleanup[24679]: 1711F2A03A1: message-id=<20131217170332.1711F2A03A1@mpnagios>
Dec 17 18:03:32 mpnagios postfix/qmgr[1357]: 1711F2A03A1: from=<nagios@makpetrol.com.mk>, size=659, nrcpt=1 (queue active)
Dec 17 18:03:32 mpnagios postfix/local[24681]: 1711F2A03A1: to=<nagios@localhost.localdomain>, orig_to=<nagios@localhost>, relay=local, delay=0.16, delays=0.11/0.04/0/0.01, dsn=2.0.0, status=sent (delivered to mailbox)
Dec 17 18:03:32 mpnagios postfix/qmgr[1357]: 1711F2A03A1: removed
Dec 17 18:04:19 mpnagios postfix/pickup[6722]: DE3772A03A1: uid=500 from=<nagios>
Dec 17 18:04:19 mpnagios postfix/cleanup[24679]: DE3772A03A1: message-id=<20131217170419.DE3772A03A1@mpnagios>
Dec 17 18:04:19 mpnagios postfix/qmgr[1357]: DE3772A03A1: from=<nagios@makpetrol.com.mk>, size=704, nrcpt=1 (queue active)
Dec 17 18:04:19 mpnagios postfix/local[24681]: DE3772A03A1: to=<nagios@localhost.localdomain>, orig_to=<nagios@localhost>, relay=local, delay=0.03, delays=0.02/0/0/0, dsn=2.0.0, status=sent (delivered to mailbox)
Dec 17 18:04:19 mpnagios postfix/qmgr[1357]: DE3772A03A1: removed
[root@mpnagios objects]#
your change to check_dummy will always report the router as up (though I presume you are aware of this)
No i didnt get this :) ( i dont pay attention to that :) ), So i guest this is the problem or is there another that this is happening ?
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: mail problem

Post by slansing »

It looks like mail is being routed off your nagios server just fine, if the routers in question are the ones set up with dummy checks then you will not be notified as abrist stated, since they will always be in an UP state.
MPIvan
Posts: 213
Joined: Thu Nov 22, 2012 6:09 am

Re: mail problem

Post by MPIvan »

Well Yes, here is what iv got for now in my template and router cfg file
define host{
use bp-rt
host_name PE002
alias Router002
display_name Router PE002
address 172.10.20.1
_SNMPCOMMUNITY imnottellingyou:)
contacts Ivan
notes Tel:000000000000
}


define host{
name bp-rt
use generic-host
check_period 24x7
check_interval 5
retry_interval 1
max_check_attempts 10
check_command check-host-alive
notification_options d,r
notification_interval 0
hostgroups router-bp
register 0
icon_image cisco.png
statusmap_image my_router.png
check_command check_dummy!0
parents localhost
# 2d_coords 120,270
# 3d_coords 100.0,50.0,75.0
}
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: mail problem

Post by slansing »

Okay this host actually has a valid check attached to it that will cause it to change states. Are you able to look at the state history report and narrow down that time that alert should have been sent, and then go look in your maillog archives to see if a alert was in fact sent out?
MPIvan
Posts: 213
Joined: Thu Nov 22, 2012 6:09 am

Re: mail problem

Post by MPIvan »

I guess this was the problem .... i remove the "check_command check_dummy!0" command and now it is ok.
HOST NOTIFICATION: Ivan;Router PE025;DOWN;notify-host-by-email;CRITICAL - Time to live exceeded (172.10.25.1)
I would like to know what is this wproc log messages that iv got
wproc: Core Worker 12003: job 40879 (pid=4381) timed out. Killing it
[1387371888] wproc: CHECK job 40879 from worker Core Worker 12003 timed out after 30.01s
[1387371888] wproc: command: /usr/local/nagios/libexec/check_ping -H 172.10.6.224 -w 3000.0,80% -c 5000.0,100% -p 5
[1387371888] wproc: host=Router PE248; service=(null);
[1387371888] wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1387371888] Warning: Check of host 'Router PE248' timed out after 30.01 seconds
[1387371888] wproc: Core Worker 12003: tv.tv_sec is currently 1387371888
[1387371888] wproc: Core Worker 12003: Failed to reap child with pid 4381. Next attempt @ 1387371893.935136
[1387371893] wproc: Core Worker 12003: job 40879 (pid=4381): Dormant child reaped
[1387372488] wproc: Core Worker 12004: job 41250 (pid=6686) timed out. Killing it
[1387372488] wproc: CHECK job 41250 from worker Core Worker 12004 timed out after 30.01s
[1387372488] wproc: command: /usr/local/nagios/libexec/check_ping -H 172.10.6.224 -w 3000.0,80% -c 5000.0,100% -p 5
[1387372488] wproc: host=Router PE248; service=(null);
[1387372488] wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1387372488] Warning: Check of host 'Router PE248' timed out after 30.01 seconds
[1387372488] wproc: Core Worker 12004: tv.tv_sec is currently 1387372488
[1387372488] wproc: Core Worker 12004: Failed to reap child with pid 6686. Next attempt @ 1387372493.895019
[1387372493] wproc: Core Worker 12004: job 41250 (pid=6686): Dormant child reaped
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: mail problem

Post by abrist »

MPIvan wrote:Warning: Check of host 'Router PE248' timed out after 30.01 seconds
MPIvan wrote:CHECK job 41250 from worker Core Worker 12004 timed out after 30.01s
Looks like this particular check is timing out. Either the router cannot be checked by icmp, the ip address is wrong for the router, or you are experiencing a lot of load/io wait and the check is not completing before the 30 sec timeout.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
MPIvan
Posts: 213
Joined: Thu Nov 22, 2012 6:09 am

Re: mail problem

Post by MPIvan »

I have the following iptables

[root@mpnagios /]# cat /etc/sysconfig/iptables
# Firewall configuration written by system-config-firewall
# Manual customization of this file is not recommended.
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j ACCEPT
-A INPUT -m state --state NEW -m udp -p udp --dport 161:162 -j ACCEPT
-A INPUT -j REJECT --reject-with icmp-host-prohibited
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
COMMIT
[root@mpnagios /]#
And you saying "you are experiencing a lot of load/io wait and the check is not completing before the 30 sec timeout." so how can i check this ??

Because i still have the problem ... yesterday i remove the dummy host check and today is the same ... is there a way, because iv recovering/backing up the previous snapshot state of the virtual machine to some how confuse the nagios log the time and other stuff ???

I have to mention that i was changing the names of the routers recently and after the changes i restart nagios. Even i didnt change them all at the same time ( 10 - 20 routers per change ).
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: mail problem

Post by sreinhardt »

Are the IPs of those routers changing, or just the display name\host name? Could you post the check definitions that you have changed to?
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
Locked