I have a hunch it may be due to a Sudo change... (When we upgraded a box to 2.7 the Nagios User didn't have permissions) This box however is still running 2.5 and I'm wondering if that change may have negatively affected the 2.5 box. I'll paste below the change below the error... then again we have another box running 2.5 and working fine :/
/usr/local/nagiosxi/scripts/backup_xi.sh
Backing up Core Config Manager (NagiosQL)...
tar: Removing leading `/' from member names
tar: Removing leading `/' from member names
Backing up Nagios Core...
tar: Removing leading `/' from member names
tar: /usr/local/nagios/share/perfdata/esu1l384: file changed as we read it
tar: /usr/local/nagios/share/perfdata/esu2v775: file changed as we read it
tar: /usr/local/nagios/var/ndo.sock: socket ignored
tar: /usr/local/nagios/var/rw/nagios.qh: socket ignored
tar: /usr/local/nagios/var: file changed as we read it
Backing up Nagios XI...
tar: Removing leading `/' from member names
Backing up MRTG...
tar: Removing leading `/' from member names
Backing up NRDP...
tar: Removing leading `/' from member names
Backing up MySQL databases...
mysqldump: Got error: 144: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed when using LOCK TABLES
Error backing up MySQL database 'nagios' - check the password in this script!
Sudo File change, per my Unix admin the way they do Sudo is from a master file. The entry on the Nagios Box doesn't really do anything:
Aug 19 15:53:00 esu1l268 nagios: wproc: CHECK job 876 from worker Core Worker 3152 timed out after 60.01s
Aug 19 15:53:00 esu1l268 nagios: wproc: host=s96d3z0; service=check_FS_space_Solaris_by_sshpass;
Aug 19 15:53:00 esu1l268 nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Aug 19 15:53:00 esu1l268 ndo2db: Error: Could not connect to MySQL database: Can't connect to local MySQL server
through socket '/var/lib/mysql/mysql.sock' (2)
Aug 19 15:53:00 esu1l268 ndo2db: Error: Could not connect to MySQL database: Can't connect to local MySQL server
through socket '/var/lib/mysql/mysql.sock' (2)
Aug 19 15:53:00 esu1l268 nagios: Warning: Check of service 'check_FS_space_Solaris_by_sshpass' on host 's96d3z0'
timed out after 60.006s!
Aug 19 15:53:00 esu1l268 ndo2db: Error: Could not connect to MySQL database: Can't connect to local MySQL server
through socket '/var/lib/mysql/mysql.sock' (2)
Aug 19 15:53:00 esu1l268 ndo2db: Error: Could not connect to MySQL database: Can't connect to local MySQL server
through socket '/var/lib/mysql/mysql.sock' (2)
Aug 19 15:53:00 esu1l268 nagios: wproc: Core Worker 3152: job 876 (pid=17007): Dormant child reaped
Aug 19 15:53:00 esu1l268 rsyslogd-2177: imuxsock begins to drop messages from pid 3169 due to rate-limiting
Aug 19 15:53:02 esu1l268 snmpd[2649]: Connection from UDP: [11.48.116.70]:47248->[11.48.4.85]
So when I run that MySQL repair things look good for a bit. not sure how long... but after what seemed like 30 minutes now none of my services are listed in NagiosXI or core.
Things seem to be running in the background okay though as we're still alerting.
What it's saying when we try to pull up the services page in XI:
We are currently running 2500 host checks & close to 10,000 services on the box (it's a massive box hardware isn't being taxed). But wondering if the Engine just isn't keeping up???
Any ideas for diagnosis?
For the record I don't care so much about the back up running at the moment as I do getting the Database error fixed!
service nagios stop
service ndo2db stop
killall -9 nagios
service mysqld restart
service ndo2db start
service nagios start
Try that and see if the errors are gone.
It is running locally to answer this sorry.
And when I tail /var/log/messages
Aug 24 15:00:41 esu1l268 ndo2db: Warning: queue send error, retrying...
Aug 24 15:00:57 esu1l268 sshd[32543]: Did not receive identification string from 11.48.23.75
Aug 24 15:01:01 esu1l268 ndo2db: Error: max retries exceeded sending message to queue. Kernel queue parameters may neeed to be tuned. See README.
Aug 24 15:01:01 esu1l268 ndo2db: Warning: queue send error, retrying...
Aug 24 15:01:21 esu1l268 ndo2db: Error: max retries exceeded sending message to queue. Kernel queue parameters may neeed to be tuned. See README.
Aug 24 15:01:21 esu1l268 ndo2db: Warning: queue send error, retrying...
Aug 24 15:01:41 esu1l268 ndo2db: Error: max retries exceeded sending message to queue. Kernel queue parameters may neeed to be tuned. See README.
Aug 24 15:01:41 esu1l268 ndo2db: Warning: queue send error, retrying...
Aug 24 15:02:01 esu1l268 ndo2db: Error: max retries exceeded sending message to queue. Kernel queue parameters may neeed to be tuned. See README.
Aug 24 15:02:01 esu1l268 ndo2db: Warning: queue send error, retrying...
Last edited by JakeHatMacys on Mon Aug 24, 2015 2:03 pm, edited 1 time in total.
JakeHatMacys wrote:So when I run that MySQL repair things look good for a bit. not sure how long... but after what seemed like 30 minutes now none of my services are listed in NagiosXI or core.
Things seem to be running in the background okay though as we're still alerting.
What it's saying when we try to pull up the services page in XI:
We are currently running 2500 host checks & close to 10,000 services on the box (it's a massive box hardware isn't being taxed). But wondering if the Engine just isn't keeping up???
Any ideas for diagnosis?
For the record I don't care so much about the back up running at the moment as I do getting the Database error fixed!