Frequent database connection error

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
mejokj
Posts: 353
Joined: Mon Jul 22, 2013 10:31 pm

Frequent database connection error

Post by mejokj »

I have a Nagios XI 2014R2.4 setup to monitor around 250 hosts and 1500 services. Everything works well but I get the following error multiple times a day. It disappears by itself after a while and then comes back later.

Message: A database connection error has been detected, we are attempting to repair the server, if the repair does not resolve the issue, please contact Nagios support. Run the following from the CLI as root to attempt to repair the DB
/usr/local/nagiosxi/scripts/repair_databases.sh

I followed the instructions mentioned in the Repairing_The_Nagios_XI_Database.pdf document and I think truncating tables fixes the issue, but only temporarily.

Any suggestions on why this is reoccurring?

System info:
CentOS release 6.5
32 Bit
Manual install on Vmware virtual machine
Last edited by mejokj on Wed Apr 08, 2015 4:13 pm, edited 1 time in total.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: Frequent database connection error

Post by jdalrymple »

Database errors are generally the result of a disk getting filled or a system having an improper shutdown (crash). Has your server experienced either of those?

Typically after the db is properly repaired the issue doesn't recur though.

Do you see anything weird in your mysqld.log?
mejokj
Posts: 353
Joined: Mon Jul 22, 2013 10:31 pm

Re: Frequent database connection error

Post by mejokj »

There is a lot of disk space available, no improper shutdowns of the server either.

In mysqld.log I see a lot of the following:

141210 13:43:28 [ERROR] /usr/libexec/mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed

and

141211 16:49:01 [ERROR] /usr/libexec/mysqld: Table './nagios/nagios_hoststatus' is marked as crashed and should be repaired

Any strings I should be 'grep' ing for?
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Frequent database connection error

Post by abrist »

How fast is the logentries table growing?

Code: Select all

ls -lha /var/lib/mysql/nagios/nagios_logentries.*
Is this server experiencing hard reboots?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
mejokj
Posts: 353
Joined: Mon Jul 22, 2013 10:31 pm

Re: Frequent database connection error

Post by mejokj »

I don't think it's growing that fast but if it is, is there a way to slow it down?

[root@NMSAPPSERVER1 ~]# ls -lha /var/lib/mysql/nagios/nagios_logentries.*
-rw-rw----. 1 mysql mysql 8.8K Aug 26 2014 /var/lib/mysql/nagios/nagios_logentries.frm
-rw-rw---- 1 mysql mysql 41M Apr 9 01:30 /var/lib/mysql/nagios/nagios_logentries.MYD
-rw-rw---- 1 mysql mysql 39M Apr 9 01:30 /var/lib/mysql/nagios/nagios_logentries.MYI


No reboots at all in the last 6 days:
[root@NMSAPPSERVER1 ~]# uptime
01:31:44 up 6 days, 16:39, 3 users, load average: 1.16, 1.38, 1.66
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Frequent database connection error

Post by abrist »

I have seen some issue with "wproc" and "Worker" errors spamming the logentries table. Lets check to see if that is populating:

Code: Select all

echo "select * from nagios.nagios_logentries limit 20;" | mysql -pnagiosxi
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
mejokj
Posts: 353
Joined: Mon Jul 22, 2013 10:31 pm

Re: Frequent database connection error

Post by mejokj »

[root@NMSAPPSERVER1 ~]# echo "select * from nagios.nagios_logentries limit 20;" | mysql -pnagiosxi
logentry_id instance_id logentry_time entry_time entry_time_usec logentry_type logentry_data realtime_data inferred_data_extracted
1 1 2015-04-02 08:51:58 2015-04-02 08:51:58 571541 262144 ndomod: Error writing to data sink! Some output may get lost... 1 1
2 1 2015-04-02 08:51:58 2015-04-02 08:51:58 571607 262144 ndomod: Please check remote ndo2db log, database connection or SSL Parameters 1 1
3 1 2015-04-02 08:51:58 2015-04-02 08:51:58 571463 64 Caught SIGTERM, shutting down... 1 1
4 1 2015-04-02 08:51:58 2015-04-02 08:51:58 640846 1 wproc: 'Core Worker 26355' seems to be choked. ret = -1; bufsize = 120: errno = 32 (Broken pipe) 1 1
5 1 2015-04-02 08:51:58 2015-04-02 08:51:58 640943 1 Unable to run check for service '2/2 Status' on host '457-SWITCH' 1 1
6 1 2015-04-02 08:51:58 2015-04-02 08:51:58 715572 64 Successfully shutdown... (PID=26347) 1 1
7 1 2015-04-02 08:52:31 2015-04-02 08:52:31 114173 262144 ndomod registered for log data' 1 1
8 1 2015-04-02 08:52:31 2015-04-02 08:52:31 114240 262144 ndomod registered for system command data' 11
9 1 2015-04-02 08:52:31 2015-04-02 08:52:31 114270 262144 ndomod registered for event handler data' 11
10 1 2015-04-02 08:52:31 2015-04-02 08:52:31 114308 262144 ndomod registered for notification data' 11
11 1 2015-04-02 08:52:31 2015-04-02 08:52:31 114335 262144 ndomod registered for comment data' 1 1
12 1 2015-04-02 08:52:31 2015-04-02 08:52:31 114360 262144 ndomod registered for downtime data' 1 1
13 1 2015-04-02 08:52:31 2015-04-02 08:52:31 114397 262144 ndomod registered for flapping data' 1 1
14 1 2015-04-02 08:52:31 2015-04-02 08:52:31 114423 262144 ndomod registered for program status data' 11
15 1 2015-04-02 08:52:31 2015-04-02 08:52:31 114481 262144 ndomod registered for host status data' 1 1
16 1 2015-04-02 08:52:31 2015-04-02 08:52:31 114509 262144 ndomod registered for service status data' 11
17 1 2015-04-02 08:52:31 2015-04-02 08:52:31 114545 262144 ndomod registered for adaptive program data'1 1
18 1 2015-04-02 08:52:31 2015-04-02 08:52:31 114573 262144 ndomod registered for adaptive host data' 11
19 1 2015-04-02 08:52:31 2015-04-02 08:52:31 114610 262144 ndomod registered for adaptive service data'1 1
20 1 2015-04-02 08:52:31 2015-04-02 08:52:31 114637 262144 ndomod registered for external command data'1 1
User avatar
lmiltchev
Former Nagios Staff
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Frequent database connection error

Post by lmiltchev »

1 1 2015-04-02 08:51:58 2015-04-02 08:51:58 571541 262144 ndomod: Error writing to data sink! Some output may get lost... 1 1
Is mysql database local or offloaded to a remote server? Can you connect to it manually from the CLI? What is the output of the following command?

Code: Select all

sysctl -p
FYI, sometimes it takes a several runs of the database repair script in order to fix all of the errors. Have you checked the mysqld.log for errors after you ran the "repair_databases.sh" script?
If mysql errors keep showing up in the log after running the database repair script, you could try running the following command and see if the errors will go away:

Code: Select all

mysqlcheck -r -f -u root -pnagiosxi --databases nagios nagiosql
Be sure to check out our Knowledgebase for helpful articles and solutions!
mejokj
Posts: 353
Joined: Mon Jul 22, 2013 10:31 pm

Re: Frequent database connection error

Post by mejokj »

MySQL is on the same server.

[root@NMSAPPSERVER1 ~]# sysctl -p
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
error: "net.bridge.bridge-nf-call-ip6tables" is an unknown key
error: "net.bridge.bridge-nf-call-iptables" is an unknown key
error: "net.bridge.bridge-nf-call-arptables" is an unknown key
kernel.msgmnb = 131072000
kernel.msgmax = 131072000
kernel.shmmax = 4294967295
kernel.shmall = 268435456
net.ipv4.neigh.default.gc_interval = 3600
net.ipv4.neigh.default.gc_stale_time = 3600
net.ipv4.neigh.default.gc_thresh3 = 4096
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv4.conf.eth0.rp_filter = 0
net.ipv4.conf.eth1.rp_filter = 0

Found the following errors in the mysqld.log:
150401 20:08:17 [ERROR] Got error 127 when reading table './nagios/nagios_hoststatus'
150401 20:08:18 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_hoststatus.MYI'; try to repair it


The following command doesn't make any difference:
mysqlcheck -r -f -u root -pnagiosxi --databases nagios nagiosql


My priority here is not fixing the errors because they go away after some time, but keeps coming back.

I did a quick look up online and found out: http://themanbehindthecode.com/2011/08/ ... ing-table/
"In my case I am using a MyISAM table with concurrent writes enabled. At any given time I had 10 – 15 processes attempting to read and write 1000+ rows to the table every second. The MySQL database could handle this for about 10 – 15 minutes and then it would mark the table as crashed, causing all of my processing to fail. My fix was to cut back on the processing and stop abusing the table so much. Table corruptions solved."
cmerchant
Posts: 546
Joined: Wed Sep 24, 2014 11:19 am

Re: Frequent database connection error

Post by cmerchant »

Could you show us the output of the following:

Code: Select all

df -h
df -i
My priority here is not fixing the errors because they go away after some time, but keeps coming back.
We would hope that we can find what causes the corruption in the first place. It can be host of things - lack of space, out of processes, corrupted memory, bad mysql code and commits that fail.
That was an interesting link that you provided describing dialing back the number of processes hitting the database, but the author was not specific as to how he implemented it.
Locked