Nagiosxi OOM issue/ kills monitoring engine

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
dms0522
Posts: 7
Joined: Wed Feb 03, 2016 2:10 pm

Nagiosxi OOM issue/ kills monitoring engine

Post by dms0522 »

I have been having issue this past week with several OOM Errors, but it also stops my monitoring service and then i have to restart it. which is bad when it happens overnight and no one gets any alerts. Could someone please help me. please let me know what extra information you may need.
You do not have the required permissions to view the files attached to this post.
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Nagiosxi OOM issue/ kills monitoring engine

Post by ssax »

What version of XI?

How much RAM do you have?

Please PM me a copy of your profile, you can download it from Admin > System Profile by clicking the Download Profile button.

If that occurs again, please get the output of these commands before fixing it:

Code: Select all

top -n1
ps aux
dms0522
Posts: 7
Joined: Wed Feb 03, 2016 2:10 pm

Re: Nagiosxi OOM issue/ kills monitoring engine

Post by dms0522 »

it's xi 5.7.3 and it currently has 8GB of RAM, it did only have 1GB, but i added more last wednesday and it seems to recognize it so i'm not sure what's going on.

i pm'd you my profile.

ok, i'll run the command when it stops again.
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Nagiosxi OOM issue/ kills monitoring engine

Post by ssax »

Still seeing these:

Code: Select all

Oct  8 15:44:10 nagiosxi nagios: Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1602186250.perfdata.host" - errno: Cannot allocate memory
I'm also seeing the OOM in /var/log/messages, is this a physical system? I'm wondering if you have some bad memory if you just added more RAM, did you reboot it after adding it?

You have crashed DB tables, please repair them:

Code: Select all

cd /usr/local/nagiosxi/scripts
./repair_databases.sh
Let's check our DB tables after you've repaired them:

Additionally, please send the output of these commands (as root):
- NOTE: You may need to adjust the -h 127.0.0.1, the -uroot, and -pnagiosxi in the first command if your DB is offloaded to another server and/or you've changed the root mysql password

Code: Select all

echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql', 'nagiosxi');" | mysql -h 127.0.0.1 -uroot -pnagiosxi --table
This next command may fail, that's okay, not all systems run postgresql, send the output anyways:

Code: Select all

echo "SELECT relname as Table, pg_size_pretty(pg_total_relation_size(relid)) As Size, pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as ExternalSize FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC;" | psql nagiosxi nagiosxi
What is the output of this command:

Code: Select all

dmesg
dms0522
Posts: 7
Joined: Wed Feb 03, 2016 2:10 pm

Re: Nagiosxi OOM issue/ kills monitoring engine

Post by dms0522 »

to answer your first question, no it is completely virtual, through VMware and yes i did reboot after.
dms0522
Posts: 7
Joined: Wed Feb 03, 2016 2:10 pm

Re: Nagiosxi OOM issue/ kills monitoring engine

Post by dms0522 »

first command output

+--------------------------------------------+------------+
| Table | Size in MB |
+--------------------------------------------+------------+
| nagios_acknowledgements | 0.10 |
| nagios_commands | 0.02 |
| nagios_commenthistory | 1.04 |
| nagios_comments | 0.01 |
| nagios_configfiles | 0.01 |
| nagios_configfilevariables | 0.01 |
| nagios_conninfo | 0.29 |
| nagios_contact_addresses | 0.00 |
| nagios_contact_notificationcommands | 0.01 |
| nagios_contactgroup_members | 0.00 |
| nagios_contactgroups | 0.00 |
| nagios_contactnotificationmethods | 0.12 |
| nagios_contactnotifications | 0.12 |
| nagios_contacts | 0.01 |
| nagios_contactstatus | 0.00 |
| nagios_customvariables | 0.10 |
| nagios_customvariablestatus | 0.11 |
| nagios_dbversion | 0.00 |
| nagios_downtimehistory | 0.03 |
| nagios_eventhandlers | 0.00 |
| nagios_externalcommands | 0.01 |
| nagios_flappinghistory | 0.98 |
| nagios_host_contactgroups | 0.01 |
| nagios_host_contacts | 0.03 |
| nagios_host_parenthosts | 0.00 |
| nagios_hostchecks | 0.11 |
| nagios_hostdependencies | 0.00 |
| nagios_hostescalation_contactgroups | 0.00 |
| nagios_hostescalation_contacts | 0.00 |
| nagios_hostescalations | 0.00 |
| nagios_hostgroup_members | 0.00 |
| nagios_hostgroups | 0.00 |
| nagios_hosts | 0.06 |
| nagios_hoststatus | 0.16 |
| nagios_instances | 0.00 |
| nagios_logentries | 9.69 |
| nagios_notifications | 0.13 |
| nagios_objects | 0.34 |
| nagios_processevents | 0.06 |
| nagios_programstatus | 0.00 |
| nagios_runtimevariables | 0.00 |
| nagios_scheduleddowntime | 0.00 |
| nagios_service_contactgroups | 0.03 |
| nagios_service_contacts | 0.09 |
| nagios_service_parentservices | 0.00 |
| nagios_servicechecks | 0.45 |
| nagios_servicedependencies | 0.00 |
| nagios_serviceescalation_contactgroups | 0.00 |
| nagios_serviceescalation_contacts | 0.00 |
| nagios_serviceescalations | 0.00 |
| nagios_servicegroup_members | 0.00 |
| nagios_servicegroups | 0.00 |
| nagios_services | 0.18 |
| nagios_servicestatus | 0.75 |
| nagios_statehistory | 26.40 |
| nagios_systemcommands | 0.03 |
| nagios_timedeventqueue | 0.00 |
| nagios_timedevents | 0.00 |
| nagios_timeperiod_timeranges | 0.02 |
| nagios_timeperiods | 0.00 |
| tbl_command | 0.04 |
| tbl_contact | 0.01 |
| tbl_contactgroup | 0.01 |
| tbl_contacttemplate | 0.01 |
| tbl_domain | 0.01 |
| tbl_host | 0.05 |
| tbl_hostdependency | 0.00 |
| tbl_hostescalation | 0.00 |
| tbl_hostextinfo | 0.00 |
| tbl_hostgroup | 0.01 |
| tbl_hosttemplate | 0.01 |
| tbl_info | 0.13 |
| tbl_lnkContactToCommandHost | 0.00 |
| tbl_lnkContactToCommandService | 0.00 |
| tbl_lnkContactToContactgroup | 0.00 |
| tbl_lnkContactToContacttemplate | 0.00 |
| tbl_lnkContactToVariabledefinition | 0.00 |
| tbl_lnkContactgroupToContact | 0.00 |
| tbl_lnkContactgroupToContactgroup | 0.00 |
| tbl_lnkContacttemplateToCommandHost | 0.00 |
| tbl_lnkContacttemplateToCommandService | 0.00 |
| tbl_lnkContacttemplateToContactgroup | 0.00 |
| tbl_lnkContacttemplateToContacttemplate | 0.00 |
| tbl_lnkContacttemplateToVariabledefinition | 0.00 |
| tbl_lnkHostToContact | 0.01 |
| tbl_lnkHostToContactgroup | 0.00 |
| tbl_lnkHostToHost | 0.00 |
| tbl_lnkHostToHostgroup | 0.00 |
| tbl_lnkHostToHosttemplate | 0.01 |
| tbl_lnkHostToVariabledefinition | 0.01 |
| tbl_lnkHostdependencyToHost_DH | 0.00 |
| tbl_lnkHostdependencyToHost_H | 0.00 |
| tbl_lnkHostdependencyToHostgroup_DH | 0.00 |
| tbl_lnkHostdependencyToHostgroup_H | 0.00 |
| tbl_lnkHostescalationToContact | 0.00 |
| tbl_lnkHostescalationToContactgroup | 0.00 |
| tbl_lnkHostescalationToHost | 0.00 |
| tbl_lnkHostescalationToHostgroup | 0.00 |
| tbl_lnkHostgroupToHost | 0.00 |
| tbl_lnkHostgroupToHostgroup | 0.00 |
| tbl_lnkHosttemplateToContact | 0.00 |
| tbl_lnkHosttemplateToContactgroup | 0.00 |
| tbl_lnkHosttemplateToHost | 0.00 |
| tbl_lnkHosttemplateToHostgroup | 0.00 |
| tbl_lnkHosttemplateToHosttemplate | 0.00 |
| tbl_lnkHosttemplateToVariabledefinition | 0.00 |
| tbl_lnkServiceToContact | 0.05 |
| tbl_lnkServiceToContactgroup | 0.01 |
| tbl_lnkServiceToHost | 0.03 |
| tbl_lnkServiceToHostgroup | 0.00 |
| tbl_lnkServiceToServicegroup | 0.00 |
| tbl_lnkServiceToServicetemplate | 0.03 |
| tbl_lnkServiceToVariabledefinition | 0.02 |
| tbl_lnkServicedependencyToHost_DH | 0.00 |
| tbl_lnkServicedependencyToHost_H | 0.00 |
| tbl_lnkServicedependencyToHostgroup_DH | 0.00 |
| tbl_lnkServicedependencyToHostgroup_H | 0.00 |
| tbl_lnkServicedependencyToService_DS | 0.00 |
| tbl_lnkServicedependencyToService_S | 0.00 |
| tbl_lnkServicedependencyToServicegroup_DS | 0.02 |
| tbl_lnkServicedependencyToServicegroup_S | 0.02 |
| tbl_lnkServiceescalationToContact | 0.00 |
| tbl_lnkServiceescalationToContactgroup | 0.00 |
| tbl_lnkServiceescalationToHost | 0.00 |
| tbl_lnkServiceescalationToHostgroup | 0.00 |
| tbl_lnkServiceescalationToService | 0.00 |
| tbl_lnkServiceescalationToServicegroup | 0.02 |
| tbl_lnkServicegroupToService | 0.00 |
| tbl_lnkServicegroupToServicegroup | 0.00 |
| tbl_lnkServicetemplateToContact | 0.00 |
| tbl_lnkServicetemplateToContactgroup | 0.00 |
| tbl_lnkServicetemplateToHost | 0.00 |
| tbl_lnkServicetemplateToHostgroup | 0.00 |
| tbl_lnkServicetemplateToServicegroup | 0.00 |
| tbl_lnkServicetemplateToServicetemplate | 0.01 |
| tbl_lnkServicetemplateToVariabledefinition | 0.00 |
| tbl_lnkTimeperiodToTimeperiod | 0.00 |
| tbl_logbook | 0.00 |
| tbl_mainmenu | 0.00 |
| tbl_permission | 0.02 |
| tbl_permission_inactive | 0.02 |
| tbl_service | 0.14 |
| tbl_servicedependency | 0.00 |
| tbl_serviceescalation | 0.00 |
| tbl_serviceextinfo | 0.00 |
| tbl_servicegroup | 0.01 |
| tbl_servicetemplate | 0.02 |
| tbl_session | 0.00 |
| tbl_session_locks | 0.00 |
| tbl_settings | 0.00 |
| tbl_submenu | 0.00 |
| tbl_timedefinition | 0.02 |
| tbl_timeperiod | 0.01 |
| tbl_user | 0.01 |
| tbl_variabledefinition | 0.07 |

2nd command output:

table | size | externalsize
--------------------------+------------+--------------
xi_meta | 15 MB | 3296 kB
xi_auditlog | 1512 kB | 1232 kB
xi_events | 1312 kB | 1304 kB
xi_usermeta | 632 kB | 424 kB
xi_auth_tokens | 296 kB | 288 kB
xi_options | 120 kB | 96 kB
xi_sysstat | 112 kB | 72 kB
xi_users | 88 kB | 72 kB
xi_commands | 80 kB | 72 kB
xi_mibs | 72 kB | 64 kB
xi_eventqueue | 64 kB | 56 kB
xi_cmp_nagiosbpi_backups | 64 kB | 56 kB
xi_sessions | 40 kB | 40 kB
xi_cmp_trapdata | 24 kB | 24 kB
xi_cmp_favorites | 16 kB | 16 kB
xi_deploy_jobs | 16 kB | 16 kB
xi_cmp_trapdata_log | 16 kB | 16 kB
xi_deploy_agents | 16 kB | 16 kB
xi_incidents | 8192 bytes | 8192 bytes
xi_cmp_ccm_backups | 8192 bytes | 8192 bytes
(20 rows)
You do not have the required permissions to view the files attached to this post.
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Nagiosxi OOM issue/ kills monitoring engine

Post by ssax »

Have you experienced the OOM kill since repairing the databases?
dms0522
Posts: 7
Joined: Wed Feb 03, 2016 2:10 pm

Re: Nagiosxi OOM issue/ kills monitoring engine

Post by dms0522 »

nope, i checked nagios this morning and over the weekend the monitoring engine did not stop and when i checked the data base it did not have any oom killers. that seems to have fixed it. thank you so much
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Nagiosxi OOM issue/ kills monitoring engine

Post by ssax »

That's great to hear, I'm glad that resolved your issue! Are we okay to lock this up and mark it as resolved?
Locked