Monitoring engine won't start after adding hosts to hostgrou

daveinvb · Post by **daveinvb** » Wed Jun 24, 2015 10:12 am

I added approximately 260 hosts to a host group. Now the Monitoring Engine won't start. I verified the config files through the CCM and all I get are warnings, but no errors. I know this has previously been caused by a bad configuration when I've bulk added but this time everything was added through the GUI. Where can I find what's preventing it from starting?

daveinvb · Post by **daveinvb** » Wed Jun 24, 2015 10:24 am

I deactivated the Host Group, applied configuration changes, activated it, applied configuration changes, and now the engine has started.

Post by **tgriep** » Wed Jun 24, 2015 10:29 am

If the issue is resolved, can we close this post?

daveinvb · Post by **daveinvb** » Wed Jun 24, 2015 12:50 pm

The engine actually stopped again without making any changes.
Is there anywhere I can find out what is causing it to stop?

abrist · Post by **abrist** » Wed Jun 24, 2015 12:57 pm

I would start by looking at the nagios.log, and system messages:

Code: Select all

tail -50 /var/log/messages
tail -50 /usr/local/nagios/var/nagios.log

Also, do you use mod_gearman or mk_livestatus?

daveinvb · Post by **daveinvb** » Wed Jun 24, 2015 1:12 pm

I do not use those. I am unfamiliar with them, is it something you'd suggest? Currently I just use the System Status icons in the top right corner of XI.

abrist · Post by **abrist** » Wed Jun 24, 2015 1:14 pm

Could you post the tails (in code wraps) requested in my previous post?

daveinvb · Post by **daveinvb** » Wed Jun 24, 2015 1:19 pm

It is currently running, however here are the logs.

Code: Select all

Jun 24 14:17:00 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:00 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:00 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:00 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:00 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:00 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:00 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:00 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:00 ip-10-222-2-32 nagios: SERVICE NOTIFICATION: serverteam;FFIPWPA1;Memory Usage;CRITICAL;xi_service_notification_handler;connect to address 10.40.203.22 and port 12489: Connection refused
Jun 24 14:17:00 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:00 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:00 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:00 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:00 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:00 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:00 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:00 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:00 ip-10-222-2-32 nagios: HOST ALERT: 172.16.20.20;DOWN;SOFT;1;CRITICAL - 172.16.20.20: Time to live exceeded in transit @ 172.16.19.157. rta nan, lost 100%
Jun 24 14:17:00 ip-10-222-2-32 nagios: SERVICE ALERT: FILESFENG01;Uptime;WARNING;SOFT;2;could not fetch information from server
Jun 24 14:17:00 ip-10-222-2-32 nagios: SERVICE ALERT: PTX-SFIDC02;Memory Usage;UNKNOWN;HARD;5;could not fetch information from server
Jun 24 14:17:01 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:01 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:01 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:01 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:01 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:01 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:01 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:01 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:01 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:01 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:01 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:01 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:01 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:01 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:01 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:01 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:01 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:01 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:01 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:01 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:02 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:02 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:02 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:02 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:02 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:02 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:02 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:02 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...
Jun 24 14:17:02 ip-10-222-2-32 ndo2db: Message sent to queue.
Jun 24 14:17:02 ip-10-222-2-32 ndo2db: Warning: queue send error, retrying...

Code: Select all

[1435169758] SERVICE NOTIFICATION: serverteam;FFIACMILPA1;CPU Usage;CRITICAL;xi_service_notification_handler;connect to address 10.30.10.22 and port 12489: Connection refused
[1435169758] SERVICE ALERT: SFPACCLISQ;Memory Usage;WARNING;HARD;5;could not fetch information from server
[1435169758] SERVICE NOTIFICATION: serverteam;SFPACCLISQ;Memory Usage;WARNING;xi_service_notification_handler;could not fetch information from server
[1435169758] SERVICE ALERT: PTX-CLINTON02;Uptime;WARNING;SOFT;2;could not fetch information from server
[1435169758] SERVICE NOTIFICATION: serverteam;FFIDCMARPA2;Drive C: Disk Usage;CRITICAL;xi_service_notification_handler;connect to address 10.51.253.50 and port 12489: Connection refused
[1435169761] HOST ALERT: 172.16.20.38;DOWN;SOFT;1;CRITICAL - 172.16.20.38: Time to live exceeded in transit @ 172.16.19.157. rta nan, lost 100%
[1435169771] SERVICE ALERT: DFSCLINTON01;CPU Usage;WARNING;SOFT;1;could not fetch information from server
[1435169771] SERVICE ALERT: PTX-SFIDC02;Uptime;WARNING;SOFT;2;could not fetch information from server
[1435169775] SERVICE ALERT: USSNFSPACPW01;Drive C: Disk Usage;UNKNOWN;HARD;5;Free disk space : Invalid drive
[1435169775] SERVICE ALERT: SFPACSNFSQ;CPU Usage;WARNING;SOFT;1;could not fetch information from server
[1435169777] HOST ALERT: 172.16.127.2;DOWN;SOFT;1;CRITICAL - 172.16.127.2: rta nan, lost 100%
[1435169777] HOST ALERT: 172.16.13.215;DOWN;SOFT;1;CRITICAL - 172.16.13.215: rta nan, lost 100%
[1435169777] SERVICE ALERT: SFPACSNFSQ;Uptime;WARNING;SOFT;2;could not fetch information from server
[1435169778] SERVICE NOTIFICATION: serverteam;FFIDCDENPA1;Memory Usage;CRITICAL;xi_service_notification_handler;connect to address 10.10.31.231 and port 12489: Connection refused
[1435169782] SERVICE ALERT: DCSNF01;CPU Usage;WARNING;SOFT;1;could not fetch information from server
[1435169787] SERVICE ALERT: SFPACCLIPA1;Drive C: Disk Usage;UNKNOWN;HARD;5;Free disk space : Invalid drive
[1435169792] HOST ALERT: 172.16.20.38;UP;SOFT;2;OK - 172.16.20.38: rta 31.003ms, lost 0%
[1435169792] HOST FLAPPING ALERT: 172.16.20.38;STARTED; Host appears to have started flapping (22.5% change > 20.0% threshold)
[1435169795] HOST ALERT: 10.35.48.11;DOWN;SOFT;1;CRITICAL - 10.35.48.11: rta nan, lost 100%
[1435169796] HOST ALERT: 10.35.48.11;DOWN;SOFT;1;CRITICAL - 10.35.48.11: rta nan, lost 100%
[1435169797] SERVICE NOTIFICATION: serverteam;FFIACCARPA3;Drive C: Disk Usage;CRITICAL;xi_service_notification_handler;connect to address 10.10.35.118 and port 12489: Connection refused
[1435169797] SERVICE NOTIFICATION: serverteam;FNAAFFITCM11;CPU Usage;CRITICAL;xi_service_notification_handler;connect to address 10.10.17.223 and port 12489: Connection refused
[1435169797] SERVICE NOTIFICATION: serverteam;FFIACLINPA1;Drive C: Disk Usage;CRITICAL;xi_service_notification_handler;connect to address 10.50.253.22 and port 12489: Connection refused
[1435169797] SERVICE NOTIFICATION: serverteam;FFIDCMARPA2;Memory Usage;CRITICAL;xi_service_notification_handler;connect to address 10.51.253.50 and port 12489: Connection refused
[1435169797] SERVICE ALERT: PTX-SFIDC02;Uptime;WARNING;SOFT;2;could not fetch information from server
[1435169799] SERVICE ALERT: NORSPCSS05;Memory Usage;WARNING;HARD;5;could not fetch information from server
[1435169800] SERVICE ALERT: FILESFENG01;Drive C: Disk Usage;WARNING;HARD;5;could not fetch information from server
[1435169805] SERVICE ALERT: SFPACSNFPA2;CPU Usage;OK;SOFT;2;CPU Load 0% (5 min average)
[1435169806] SERVICE ALERT: NORSPCSS05;Uptime;WARNING;SOFT;4;could not fetch information from server
[1435169807] SERVICE ALERT: USSNFSPACPW02;Drive C: Disk Usage;UNKNOWN;HARD;5;Free disk space : Invalid drive
[1435169807] SERVICE ALERT: SFPCECLIPA2;Drive C: Disk Usage;WARNING;HARD;5;could not fetch information from server
[1435169817] SERVICE ALERT: NORSPCSS05;Uptime;OK;SOFT;2;System Uptime - 0 day(s) 0 hour(s) 0 minute(s)
[1435169820] SERVICE NOTIFICATION: serverteam;FFIPWPA1;Memory Usage;CRITICAL;xi_service_notification_handler;connect to address 10.40.203.22 and port 12489: Connection refused
[1435169820] HOST ALERT: 172.16.20.20;DOWN;SOFT;1;CRITICAL - 172.16.20.20: Time to live exceeded in transit @ 172.16.19.157. rta nan, lost 100%
[1435169820] SERVICE ALERT: FILESFENG01;Uptime;WARNING;SOFT;2;could not fetch information from server
[1435169820] SERVICE ALERT: PTX-SFIDC02;Memory Usage;UNKNOWN;HARD;5;could not fetch information from server
[1435169828] Auto-save of retention data completed successfully.
[1435169830] HOST ALERT: 10.35.48.11;UP;HARD;1;OK - 10.35.48.11: rta 126.023ms, lost 0%
[1435169830] HOST NOTIFICATION: Matt Douglas;10.35.48.11;UP;xi_host_notification_handler;OK - 10.35.48.11: rta 126.023ms, lost 0%
[1435169830] HOST NOTIFICATION: Network Team;10.35.48.11;UP;xi_host_notification_handler;OK - 10.35.48.11: rta 126.023ms, lost 0%
[1435169835] SERVICE ALERT: SFPACSNFSQ;CPU Usage;WARNING;SOFT;2;could not fetch information from server
[1435169836] SERVICE ALERT: SFPACSNFSQ;Uptime;OK;SOFT;3;System Uptime - 0 day(s) 0 hour(s) 0 minute(s)
[1435169837] SERVICE NOTIFICATION: serverteam;FFIILPA1;CPU Usage;CRITICAL;xi_service_notification_handler;connect to address 10.10.34.226 and port 12489: Connection refused
[1435169837] SERVICE NOTIFICATION: serverteam;Kansas City Domain Controller;Drive C: Disk Usage;CRITICAL;xi_service_notification_handler;connect to address 10.8.16.42 and port 12489: Connection refused
[1435169837] SERVICE ALERT: SFPCECLIPA1;CPU Usage;WARNING;SOFT;2;could not fetch information from server
[1435169837] SERVICE NOTIFICATION: serverteam;FFIACCDCPA1;Memory Usage;CRITICAL;xi_service_notification_handler;connect to address 10.40.253.32 and port 12489: Connection refused
[1435169837] SERVICE NOTIFICATION: serverteam;FNAAFFITCM06;Memory Usage;CRITICAL;xi_service_notification_handler;connect to address 10.30.253.223 and port 12489: Connection refused
[1435169840] SERVICE ALERT: DCSNF01;CPU Usage;WARNING;SOFT;2;could not fetch information from server
[1435169850] SERVICE ALERT: USSNFSPACPW02;CPU Usage;WARNING;SOFT;1;could not fetch information from server
[1435169850] SERVICE ALERT: DFSCLINTON01;Uptime;OK;SOFT;2;System Uptime - 0 day(s) 0 hour(s) 0 minute(s)

daveinvb · Post by **daveinvb** » Wed Jun 24, 2015 2:09 pm

It has now stopped.

Code: Select all

tail -50 /usr/local/nagios/var/nagios.log
[1435172822] HOST NOTIFICATION: serverteam;FNAFFITCM02;UP;xi_host_notification_handler;OK - 10.43.253.223: rta 149.202ms, lost 0%
[1435172823] HOST ALERT: 172.16.100.247;DOWN;SOFT;1;CRITICAL - 172.16.100.247: rta nan, lost 100%
[1435172823] SERVICE ALERT: RSVIEWK201;Uptime;OK;SOFT;2;System Uptime - 0 day(s) 0 hour(s) 0 minute(s)
[1435172824] HOST ALERT: FNAFFITCM02;UP;HARD;1;OK - 10.43.253.223: rta 165.258ms, lost 0%
[1435172824] HOST NOTIFICATION: Matt Douglas;FNAFFITCM02;UP;xi_host_notification_handler;OK - 10.43.253.223: rta 165.258ms, lost 0%
[1435172824] HOST NOTIFICATION: serverteam;FNAFFITCM02;UP;xi_host_notification_handler;OK - 10.43.253.223: rta 165.258ms, lost 0%
[1435172825] SERVICE ALERT: SFPACSNFSQ;Uptime;WARNING;SOFT;1;could not fetch information from server
[1435172827] SERVICE ALERT: DCSNF01;CPU Usage;WARNING;SOFT;1;could not fetch information from server
[1435172831] SERVICE FLAPPING ALERT: SFPACCLIPA2;Drive C: Disk Usage;STOPPED; Service appears to have stopped flapping (3.8% change < 5.0% threshold)
[1435172831] SERVICE ALERT: SFPACSNFSQ;Drive C: Disk Usage;UNKNOWN;HARD;5;Free disk space : Invalid drive
[1435172832] HOST ALERT: 10.43.254.16;UP;HARD;1;OK - 10.43.254.16: rta 163.490ms, lost 0%
[1435172832] HOST NOTIFICATION: Matt Douglas;10.43.254.16;UP;xi_host_notification_handler;OK - 10.43.254.16: rta 163.490ms, lost 0%
[1435172832] HOST NOTIFICATION: Network Team;10.43.254.16;UP;xi_host_notification_handler;OK - 10.43.254.16: rta 163.490ms, lost 0%
[1435172836] SERVICE ALERT: USSNFSPACPW01;Drive C: Disk Usage;UNKNOWN;HARD;5;Free disk space : Invalid drive
[1435172839] HOST ALERT: 10.35.48.11;DOWN;SOFT;2;CRITICAL - 10.35.48.11: rta nan, lost 100%
[1435172841] HOST ALERT: 172.16.100.247;DOWN;SOFT;1;CRITICAL - 172.16.100.247: rta nan, lost 100%
[1435172843] HOST ALERT: 10.35.32.10;UP;SOFT;2;OK - 10.35.32.10: rta 34.372ms, lost 20%
[1435172844] SERVICE ALERT: PTX-SFIDC02;Uptime;WARNING;SOFT;1;could not fetch information from server
[1435172846] SERVICE ALERT: PTX-SFIDC02;CPU Usage;WARNING;SOFT;1;could not fetch information from server
[1435172847] SERVICE ALERT: USSNFSPACPW01;Memory Usage;WARNING;HARD;5;could not fetch information from server
[1435172848] SERVICE ALERT: USSNFSPACPW01;Uptime;WARNING;SOFT;1;could not fetch information from server
[1435172848] SERVICE ALERT: FILESFENG01;CPU Usage;WARNING;SOFT;1;could not fetch information from server
[1435172848] SERVICE ALERT: DCSNF01;Memory Usage;WARNING;HARD;5;could not fetch information from server
[1435172848] SERVICE ALERT: USWILSPACPW01;Proc: Sql Server Buff Hit;CRITICAL;SOFT;1;1
[1435172849] HOST ALERT: 10.35.48.11;DOWN;SOFT;2;CRITICAL - 10.35.48.11: rta nan, lost 100%
[1435172851] HOST ALERT: ARNPMSFDC01;UP;SOFT;2;PING OK - Packet loss = 16%, RTA = 182.69 ms
[1435172860] SERVICE ALERT: PTX-SFIDC02;Drive C: Disk Usage;WARNING;HARD;5;could not fetch information from server
[1435172860] HOST ALERT: 10.35.48.11;DOWN;SOFT;2;CRITICAL - 10.35.48.11: rta nan, lost 100%
[1435172861] SERVICE ALERT: FILEK201;Uptime;OK;SOFT;2;System Uptime - 0 day(s) 0 hour(s) 0 minute(s)
[1435172861] HOST ALERT: 172.16.100.247;DOWN;SOFT;2;CRITICAL - 172.16.100.247: rta nan, lost 100%
[1435172865] SERVICE ALERT: PTX-SFIDC02;Drive C: Disk Usage;WARNING;HARD;5;could not fetch information from server
[1435172866] SERVICE ALERT: SFPACSNFSQ;Drive C: Disk Usage;WARNING;HARD;5;could not fetch information from server
[1435172866] SERVICE ALERT: USSNFSPACPW02;Drive C: Disk Usage;WARNING;HARD;5;could not fetch information from server
[1435172869] SERVICE ALERT: NORSPCSS05;CPU Usage;WARNING;SOFT;2;could not fetch information from server
[1435172877] HOST ALERT: 10.35.47.65;DOWN;SOFT;1;CRITICAL - Network Unreachable (10.35.47.65)
[1435172877] SERVICE ALERT: USSNFSPACPW02;Uptime;WARNING;SOFT;1;could not fetch information from server
[1435172880] HOST ALERT: 10.35.32.10;UP;SOFT;2;OK - 10.35.32.10: rta 43.767ms, lost 0%
[1435172881] SERVICE ALERT: SFPACSNFSQ;Memory Usage;WARNING;HARD;5;could not fetch information from server
[1435172888] HOST ALERT: 10.35.32.10;UP;SOFT;2;OK - 10.35.32.10: rta 29.753ms, lost 0%
[1435172890] SERVICE ALERT: USSNFSPACPW01;Drive C: Disk Usage;UNKNOWN;HARD;5;Free disk space : Invalid drive
[1435172890] SERVICE ALERT: NORSPCSS05;Drive C: Disk Usage;WARNING;HARD;5;could not fetch information from server
[1435172894] SERVICE ALERT: PTX-SFIDC02;Drive C: Disk Usage;UNKNOWN;HARD;5;Free disk space : Invalid drive
[1435172894] SERVICE NOTIFICATION: serverteam;FFIACCDCPA2;Drive C: Disk Usage;CRITICAL;xi_service_notification_handler;connect to address 10.40.253.33 and port 12489: Connection refused
[1435172896] SERVICE ALERT: DCSNF01;Uptime;WARNING;SOFT;1;could not fetch information from server
[1435172899] SERVICE ALERT: DCSNF01;Drive C: Disk Usage;WARNING;HARD;5;could not fetch information from server
[1435172905] SERVICE ALERT: DCSNF01;Memory Usage;WARNING;HARD;5;could not fetch information from server
[1435172909] SERVICE ALERT: FILESFENG01;Uptime;WARNING;SOFT;1;could not fetch information from server
[1435172911] HOST ALERT: 172.16.20.72;DOWN;SOFT;1;CRITICAL - 172.16.20.72: rta 37.824ms, lost 50%
[1435172912] HOST ALERT: 172.16.20.42;DOWN;SOFT;1;CRITICAL - 172.16.20.42: Time to live exceeded in transit @ 172.16.19.157. rta nan, lost 100%
[1435172914] HOST ALERT: ARNPMSFDC01;UP;SOFT;2;PING WARNING - System call sent warnings to stderr Packet loss = 16%, RTA = 265.97 ms

jolson · Post by **jolson** » Wed Jun 24, 2015 2:19 pm

This issue could be related to ulimits. Please run the following on your CLI:

Code: Select all

ulimit -a

And check the following post:
http://support.nagios.com/wiki/index.ph ... 3.x_Issues

Nagios Support Forum

Monitoring engine won't start after adding hosts to hostgrou

Monitoring engine won't start after adding hosts to hostgrou

Re: Monitoring engine won't start after adding hosts to host

Re: Monitoring engine won't start after adding hosts to host

Re: Monitoring engine won't start after adding hosts to host

Re: Monitoring engine won't start after adding hosts to host

Re: Monitoring engine won't start after adding hosts to host

Re: Monitoring engine won't start after adding hosts to host

Re: Monitoring engine won't start after adding hosts to host

Re: Monitoring engine won't start after adding hosts to host

Re: Monitoring engine won't start after adding hosts to host