Page 1 of 3
NdoUtils stop working
Posted: Tue Jun 28, 2016 3:31 pm
by algomas123
Hi everybody.
I have had Nagios 4.1.1 and ndoutils 2.0 running for some months without problems. I also use ndo2db for writing all information to a mysql db that is in the same server. I have like 30 hosts and about 300 services.
Some days ago, after installing nagiosql (but not sure if this is the problem...as make not sense) on the same mysql db...I start having some problems:
After some hours (3-4) running...linux message queue start to increase crazily and /var/log/messages is flooded with:
ndo2db: Error: queue recv error.
and no more data is inserted to database!! Both services keep running, but looks like socket is closed (but the file is there!!!)
My kernel parameters are set as recommended (131072000)...mysql db has ram and space enough for a good performance.
I would appreciate any clue or help, I don't know what else to try!
Thanks in advance.
Re: NdoUtils stop working
Posted: Tue Jun 28, 2016 4:30 pm
by ssax
What is the output of this command:
If you see more than one nagios message queue, run these commands:
Code: Select all
service nagios stop
killall -9 nagios
service ndo2db stop
for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
service ndo2db start
service nagios start
Then validate again:
See if that resolves the issue.
Re: NdoUtils stop working
Posted: Wed Jun 29, 2016 3:06 am
by algomas123
Thanks for your reply!
This is my queue
but it increase by 20 every second!!
ndo2db is not quering any more.
Re: NdoUtils stop working
Posted: Wed Jun 29, 2016 12:28 pm
by tgriep
Is the MYSQL server local to the Nagios server or is it remote?
Are you seeing any errors in the MYSQL logs?
What OS and version is the Core system running on?
Can you post this file so we can view it?
Re: NdoUtils stop working
Posted: Wed Jun 29, 2016 1:47 pm
by algomas123
Hello!
Yes, it is local database.
No, there are no errors in mysql logs...and it has enought ram and disk.
It is running Nagios 4.1.1 on a Centos 7.
Code: Select all
# Controls the maximum size of a message, in bytes
kernel.msgmnb = 131072000
# Controls the default maxmimum size of a mesage queue
kernel.msgmax = 131072000
# Controls the maximum shared segment size, in bytes
kernel.shmmax = 4294967295
# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 268435456
## The maximum number of messages allowed in any one message queue
kernel.msgmni = 256000
# Controls the maximum number of shared memory segments, in pages
kernel.shmmni = 4096
kernel.sem = 250 32000 100 128
fs.file-max = 6815744
# net.ipv4.ip_local_port_range = 9000 65500
net.core.rmem_default = 262144
net.core.wmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_max = 1048576
fs.aio-max-nr = 1048576
Now it is crashing again, and ndo2db is high in cpu (also systemd-journal...I guess because it is writting lots of logs with "Queue recv error"
[img]
[/img]
Thanks for your help
Re: NdoUtils stop working
Posted: Wed Jun 29, 2016 2:17 pm
by tgriep
One thing you can try is to edit the /etc/sysctl.conf file and change this from
to
Run this to apply the change
See if that helps out.
You could update the ndoutils to the master branch which may help out too.
Here is the process for doing that.
Code: Select all
cd /tmp
wget https://github.com/NagiosEnterprises/ndoutils/archive/master.tar.gz
tar xvfz master.tar.gz
cd ndoutils-master
./configure
make
make install
service nagios stop
service ndo2db restart
service nagios start
Let us know if either of these changes works for you.
Re: NdoUtils stop working
Posted: Wed Jun 29, 2016 3:10 pm
by algomas123
Ok, just tried both solutions...I will say something in some hours (It use to fail after 4-5 hours).
Many thanks for your help.
Re: NdoUtils stop working
Posted: Wed Jun 29, 2016 3:15 pm
by tgriep
No problem, let us know how it works out.
Re: NdoUtils stop working
Posted: Thu Jun 30, 2016 3:00 am
by algomas123
Hello,
Now it is even more strange!
I did what you said with a minor modification:
I added code in order to print the error reason to the queue.c file where it crash
Code: Select all
char* pop_from_queue(void) {
struct ndo2db_queue_msg msg;
char *buf;
ssize_t received;
size_t buf_size;
received = msgrcv(queue_id, &msg, NDO_MAX_MSG_SIZE, NDO_MSG_TYPE, MSG_NOERROR);
if (received < 0) {
int errno_save = errno;
//Added sterror...
syslog(LOG_ERR, "Error: queue recv error: %s", strerror(errno));
errno = errno_save;
received = 0;
}
buf_size = strnlen(msg.text, (size_t)received);
buf = malloc(buf_size+1);
strncpy(buf, msg.text, buf_size);
buf[buf_size] = '\0';
return buf;
}
Ok...so the after hours, /var/log/message is showing me this error:
Code: Select all
ndo2db: Error: queue recv error: Invalid argument
Strange thing is that I do not have any message queue, ndo2db is not quering...BUT NAGIOS IS WORKING!! (Maybe only in memory...I dont know)
EDIT: Ok, obviously the invalid argument is because queue is not there anymore...but why?
Re: NdoUtils stop working
Posted: Thu Jun 30, 2016 10:37 am
by tgriep
You may want to enable debugging in the ndo2db.cfg file and see what error shows up there when the issue happens again.
We might get more details on what is failing which the developers could use.
Nagios Core only uses the MYSQL database to store it's information / status for other 3rd party tools to use, it doesn't use it to run.