Page 1 of 1

Getting a random crash, some log files attached

Posted: Thu Sep 02, 2021 9:32 am
by theo481
We're getting what seems to be a random crash on our NagiosXI server. This seems to be the start of the problem.

Code: Select all

Aug 31 20:14:18 nagios-prod nagios: wproc: Core Worker 19749: job 13714 (pid=65141) timed out. Killing it
Aug 31 20:14:18 nagios-prod nagios: wproc: CHECK job 13714 from worker Core Worker 19749 timed out after 124.73s
Aug 31 20:14:21 nagios-prod systemd: Started Session c80451 of user root.
Aug 31 20:14:52 nagios-prod journal: Suppressed 334 messages from /system.slice/nagios.service
Aug 31 20:15:07 nagios-prod nagios: job 13755 (pid=748): read() returned error 11
Aug 31 20:15:13 nagios-prod nagios: job 13785 (pid=512): read() returned error 11
Aug 31 20:15:14 nagios-prod nagios: job 13756 (pid=749): read() returned error 11
Aug 31 20:15:14 nagios-prod nagios: job 13885 (pid=852): read() returned error 11
Aug 31 20:15:14 nagios-prod nagios: job 13757 (pid=750): read() returned error 11
Aug 31 20:15:14 nagios-prod nagios: job 13758 (pid=751): read() returned error 11
Aug 31 20:15:14 nagios-prod nagios: job 13758 (pid=751): read() returned error 11
Aug 31 20:15:14 nagios-prod nagios: job 13758 (pid=751): read() returned error 11
Aug 31 20:15:14 nagios-prod nagios: job 13758 (pid=751): read() returned error 11
Aug 31 20:15:14 nagios-prod nagios: job 13759 (pid=752): read() returned error 11
Aug 31 20:15:14 nagios-prod nagios: job 13755 (pid=973): read() returned error 11
Aug 31 20:15:15 nagios-prod nagios: job 13760 (pid=753): read() returned error 11
Aug 31 20:15:15 nagios-prod nagios: job 13761 (pid=754): read() returned error 11
Aug 31 20:15:16 nagios-prod nagios: job 13762 (pid=755): read() returned error 11
Aug 31 20:15:16 nagios-prod nagios: job 13763 (pid=756): read() returned error 11
Aug 31 20:15:16 nagios-prod nagios: job 13764 (pid=757): read() returned error 11
Aug 31 20:15:16 nagios-prod nagios: job 13765 (pid=758): read() returned error 11
Aug 31 20:15:16 nagios-prod nagios: job 13766 (pid=759): read() returned error 11
Aug 31 20:15:16 nagios-prod nagios: job 13767 (pid=760): read() returned error 11
Aug 31 20:15:16 nagios-prod nagios: job 13768 (pid=761): read() returned error 11
Aug 31 20:15:16 nagios-prod nagios: job 13769 (pid=762): read() returned error 11
Aug 31 20:15:16 nagios-prod nagios: job 13770 (pid=763): read() returned error 11
Aug 31 20:15:16 nagios-prod nagios: job 13771 (pid=764): read() returned error 11
Aug 31 20:15:16 nagios-prod nagios: job 13772 (pid=765): read() returned error 11
Aug 31 20:15:17 nagios-prod nagios: job 13773 (pid=766): read() returned error 11
Aug 31 20:15:17 nagios-prod nagios: job 13774 (pid=767): read() returned error 11
Aug 31 20:15:17 nagios-prod nagios: job 13775 (pid=768): read() returned error 11
Aug 31 20:15:17 nagios-prod nagios: job 13776 (pid=769): read() returned error 11
Aug 31 20:15:17 nagios-prod nagios: job 13777 (pid=770): read() returned error 11
Aug 31 20:15:17 nagios-prod nagios: job 13778 (pid=771): read() returned error 11
Aug 31 20:15:17 nagios-prod nagios: job 13779 (pid=772): read() returned error 11
Aug 31 20:15:17 nagios-prod nagios: job 13780 (pid=773): read() returned error 11
Aug 31 20:15:17 nagios-prod nagios: job 13756 (pid=975): read() returned error 11
Aug 31 20:15:17 nagios-prod nagios: job 13757 (pid=976): read() returned error 11
Aug 31 20:15:17 nagios-prod nagios: job 13758 (pid=977): read() returned error 11
Aug 31 20:15:17 nagios-prod nagios: job 13759 (pid=978): read() returned error 11
Aug 31 20:15:17 nagios-prod nagios: job 13760 (pid=979): read() returned error 11
Aug 31 20:15:17 nagios-prod nagios: job 13761 (pid=980): read() returned error 11
Aug 31 20:15:17 nagios-prod nagios: job 13762 (pid=981): read() returned error 11
Aug 31 20:15:17 nagios-prod nagios: job 13763 (pid=982): read() returned error 11
Aug 31 20:15:17 nagios-prod nagios: job 13764 (pid=983): read() returned error 11
Aug 31 20:15:18 nagios-prod nagios: job 13765 (pid=987): read() returned error 11
Aug 31 20:15:18 nagios-prod nagios: job 13766 (pid=988): read() returned error 11
Aug 31 20:15:18 nagios-prod nagios: job 13767 (pid=989): read() returned error 11
Aug 31 20:15:18 nagios-prod nagios: job 13768 (pid=991): read() returned error 11
Aug 31 20:15:18 nagios-prod nagios: job 13769 (pid=992): read() returned error 11
Aug 31 20:15:18 nagios-prod nagios: job 13770 (pid=993): read() returned error 11
Aug 31 20:15:18 nagios-prod nagios: job 13771 (pid=995): read() returned error 11
Aug 31 20:15:18 nagios-prod nagios: job 13772 (pid=996): read() returned error 11
Aug 31 20:15:18 nagios-prod nagios: job 13773 (pid=997): read() returned error 11
Aug 31 20:15:18 nagios-prod nagios: job 13774 (pid=998): read() returned error 11
Aug 31 20:15:18 nagios-prod nagios: job 13775 (pid=999): read() returned error 11
Aug 31 20:15:18 nagios-prod nagios: job 13776 (pid=1000): read() returned error 11
Aug 31 20:15:18 nagios-prod nagios: job 13777 (pid=1003): read() returned error 11
Aug 31 20:15:18 nagios-prod nagios: job 13778 (pid=1004): read() returned error 11
Aug 31 20:15:18 nagios-prod nagios: job 13779 (pid=1005): read() returned error 11
Aug 31 20:15:18 nagios-prod nagios: job 13780 (pid=1006): read() returned error 11
Aug 31 20:15:18 nagios-prod nagios: job 13781 (pid=1007): read() returned error 11
Aug 31 20:15:19 nagios-prod nagios: job 13781 (pid=465): read() returned error 11
Aug 31 20:15:19 nagios-prod nagios: job 13783 (pid=471): read() returned error 11
Aug 31 20:15:19 nagios-prod nagios: job 13781 (pid=774): read() returned error 11
Aug 31 20:15:19 nagios-prod nagios: job 13781 (pid=774): read() returned error 11
Aug 31 20:15:19 nagios-prod nagios: job 13782 (pid=775): read() returned error 11
Aug 31 20:15:19 nagios-prod nagios: job 13782 (pid=775): read() returned error 11
Aug 31 20:15:19 nagios-prod nagios: job 13782 (pid=775): read() returned error 11
Aug 31 20:15:19 nagios-prod nagios: job 13782 (pid=775): read() returned error 11
Aug 31 20:15:19 nagios-prod nagios: job 13782 (pid=775): read() returned error 11
Aug 31 20:15:19 nagios-prod nagios: job 13782 (pid=508): read() returned error 11
Aug 31 20:15:19 nagios-prod nagios: job 13783 (pid=509): read() returned error 11
Aug 31 20:15:20 nagios-prod nagios: job 13786 (pid=513): read() returned error 11
Aug 31 20:15:20 nagios-prod nagios: job 13788 (pid=515): read() returned error 11
Aug 31 20:15:20 nagios-prod nagios: job 13789 (pid=516): read() returned error 11
Aug 31 20:15:20 nagios-prod nagios: job 13790 (pid=517): read() returned error 11
Aug 31 20:15:20 nagios-prod nagios: job 13791 (pid=518): read() returned error 11
Aug 31 20:15:20 nagios-prod nagios: job 13792 (pid=519): read() returned error 11
Aug 31 20:15:20 nagios-prod nagios: job 13793 (pid=520): read() returned error 11
Aug 31 20:15:20 nagios-prod nagios: job 13796 (pid=524): read() returned error 11
After this repeats for a long time it switches over to

Code: Select all

ug 31 20:24:01 nagios-prod systemd: Started Session 1649639 of user nagios.
Aug 31 20:24:01 nagios-prod systemd: Started Session 1649632 of user nagios.
Aug 31 20:24:01 nagios-prod systemd: Started Session 1649640 of user nagios.
Aug 31 20:24:01 nagios-prod systemd: Started Session 1649633 of user nagios.
Aug 31 20:24:01 nagios-prod systemd: Started Session 1649636 of user nagios.
Aug 31 20:24:01 nagios-prod systemd: Started Session 1649641 of user nagios.
Aug 31 20:24:01 nagios-prod systemd: Started Session 1649642 of user nagios.
Aug 31 20:24:01 nagios-prod systemd: Started Session 1649638 of user nagios.
Aug 31 20:25:01 nagios-prod systemd: Started Session 1649647 of user nagios.
Aug 31 20:25:01 nagios-prod systemd: Started Session 1649645 of user nagios.
Aug 31 20:25:01 nagios-prod systemd: Started Session 1649649 of user nagios.
Aug 31 20:25:01 nagios-prod systemd: Started Session 1649643 of user nagios.
Aug 31 20:25:01 nagios-prod systemd: Started Session 1649650 of user nagios.
Aug 31 20:25:01 nagios-prod systemd: Started Session 1649648 of user nagios.
Aug 31 20:25:01 nagios-prod systemd: Started Session 1649646 of user nagios.
Aug 31 20:25:01 nagios-prod systemd: Started Session 1649652 of user root.
Aug 31 20:25:01 nagios-prod systemd: Started Session 1649644 of user nagios.
Aug 31 20:25:01 nagios-prod systemd: Started Session 1649651 of user nagios.
Aug 31 20:25:01 nagios-prod systemd: Started Session 1649654 of user nagios.
Aug 31 20:25:01 nagios-prod systemd: Started Session 1649655 of user nagios.
Aug 31 20:25:01 nagios-prod systemd: Started Session 1649653 of user nagios.
Aug 31 20:26:01 nagios-prod systemd: Started Session 1649657 of user nagios.
Aug 31 20:26:01 nagios-prod systemd: Started Session 1649658 of user nagios.
Aug 31 20:26:01 nagios-prod systemd: Started Session 1649659 of user nagios.
Aug 31 20:26:01 nagios-prod systemd: Started Session 1649663 of user nagios.
Aug 31 20:26:01 nagios-prod systemd: Started Session 1649661 of user nagios.
Aug 31 20:26:01 nagios-prod systemd: Started Session 1649665 of user nagios.
Aug 31 20:26:01 nagios-prod systemd: Started Session 1649656 of user nagios.
Aug 31 20:26:01 nagios-prod systemd: Started Session 1649660 of user nagios.
Aug 31 20:26:01 nagios-prod systemd: Started Session 1649664 of user nagios.
Aug 31 20:26:01 nagios-prod systemd: Started Session 1649662 of user nagios.
Aug 31 20:26:01 nagios-prod systemd: Started Session 1649666 of user nagios.
Aug 31 20:27:01 nagios-prod systemd: Started Session 1649667 of user nagios.
Aug 31 20:27:01 nagios-prod systemd: Started Session 1649669 of user nagios.
Aug 31 20:27:01 nagios-prod systemd: Started Session 1649671 of user nagios.
Aug 31 20:27:01 nagios-prod systemd: Started Session 1649673 of user nagios.
Aug 31 20:27:01 nagios-prod systemd: Started Session 1649674 of user nagios.
Aug 31 20:27:01 nagios-prod systemd: Started Session 1649676 of user nagios.
Aug 31 20:27:01 nagios-prod systemd: Started Session 1649675 of user nagios.
It is seemingly down until a restart of the monitoring process is done.

Any ideas?

Re: Getting a random crash, some log files attached

Posted: Thu Sep 02, 2021 3:48 pm
by dchurch
If you PM me a system profile I can diagnose further. Get one by going to Admin (top menu) => System Profile (in the left menu), then clicking the blue button.

If you're unable to generate the the profile through the web interface, please try generating it from the command line by running these commands as root:

Code: Select all

rm -rf /usr/local/nagiosxi/var/components/profile*
/usr/local/nagiosxi/scripts/components/getprofile.sh SUPPORT
Then send me the resulting /usr/local/nagiosxi/var/components/profile.zip file.
If the profile script fails, please include the ENTIRE output.

Re: Getting a random crash, some log files attached

Posted: Fri Sep 03, 2021 9:27 am
by dchurch
In `/etc/my.cnf.d/nagios.cnf`, create a file with the following lines:

Code: Select all

[mysqld]
innodb_log_buffer_size = 32M
innodb_buffer_pool_size = 1G
innodb_log_file_size = 256M
max_allowed_packet = 67108864
Then run the following commands as root:

Code: Select all

/usr/local/nagiosxi/scripts/manage_services.sh stop nagios
/usr/local/nagiosxi/scripts/manage_services.sh stop mysqld
rm /var/lib/mysql/ib_logfile0
rm /var/lib/mysql/ib_logfile1
/usr/local/nagiosxi/scripts/manage_services.sh restart mysqld
/usr/local/nagiosxi/scripts/manage_services.sh restart nagios

Re: Getting a random crash, some log files attached

Posted: Fri Sep 03, 2021 10:05 am
by theo481
dchurch wrote:If you PM me a system profile I can diagnose further. Get one by going to Admin (top menu) => System Profile (in the left menu), then clicking the blue button.

If you're unable to generate the the profile through the web interface, please try generating it from the command line by running these commands as root:

Code: Select all

rm -rf /usr/local/nagiosxi/var/components/profile*
/usr/local/nagiosxi/scripts/components/getprofile.sh SUPPORT
Then send me the resulting /usr/local/nagiosxi/var/components/profile.zip file.
If the profile script fails, please include the ENTIRE output.
sent

Re: Getting a random crash, some log files attached

Posted: Fri Sep 03, 2021 4:50 pm
by dchurch
I guess I was too quick to reply after receiving your profile.

Did you try my suggestion above? Did that fix your issue?

Re: Getting a random crash, some log files attached

Posted: Thu Sep 09, 2021 8:39 am
by theo481
I will try this today, I was out with COVID.

Re: Getting a random crash, some log files attached

Posted: Thu Sep 09, 2021 4:39 pm
by benjaminsmith
I will try this today, I was out with COVID.
Sorry to hear that, and hope you are feeling much better.

Let us know how it goes. Dan is out on vacation, please send any private message to my account. Thanks, Benjamin

Re: Getting a random crash, some log files attached

Posted: Mon Sep 20, 2021 10:14 am
by theo481
Seems to be running better, but now my backups are failing.

Code: Select all

Last output from the system: Error backing up MySQL database 'nagios' - check the password in this script!

Re: Getting a random crash, some log files attached

Posted: Mon Sep 20, 2021 11:02 am
by benjaminsmith
Hi,

The last profile sent had crashed tables in the database log, try running the repair script below as root from the CLI.

Code: Select all

/usr/local/nagiosxi/scripts/repair_databases.sh
Then test run another backup job. If that's not the issue, verify the correct passwords using the guide below.
https://assets.nagios.com/downloads/nag ... ios-XI.pdf

--Benjamin