Page 1 of 2

Problems with UNKNOWN messages.

Posted: Mon Oct 29, 2018 10:12 am
by nagiosEngie
HEllo Nagios Crew,
I'm having huge amounts of UNKNOWN messages due to:

UNKNOWN: Execution exceeded timeout threshold of 60s
UNKNOWN: Error occurred while running the plugin. Use the verbose flag for more details.

Most of these are alarms from "SWAP" and "Uptime" checks done on NCPA agents.
Some stats in attached file: "unkonwn messages stats.docx"

Stats are generated on this months (october) alerts.
Some servers generate a huge amount of unknown messages.
Do you have any suggestions to limit this problem?

Thanks
Sandro

Re: Problems with UNKNOWN messages.

Posted: Mon Oct 29, 2018 11:54 am
by benjaminsmith
HI @nagiosEngine

If this just started happening in October and it's intermittent, you might be experiencing some type of network connectivity or quality of service issues.

You could try upping the timeout settings beyond 60 seconds on the ncpa check command. Go the CCM > _Commands > Edit ( $USER1$/check_ncpa.py -H $HOSTADDRESS$ -T 120 $ARG1$ and if that reduces the number for unknown messages, it's most likely a network issue.

Let me know if what you find out.

Re: Problems with UNKNOWN messages.

Posted: Tue Nov 06, 2018 9:10 am
by nagiosEngie
Hello,
I kept an eye on the messages related to UPTIME and SWAP. They sill all go in timeout:

From event log:

2018-11-06 15:01:10 Warning: Check of service 'Uptime' on host 'EILIBWEBITMI01' timed out after 120.007s!
Runtime Error 2018-11-06 15:01:10 wproc: host=EILIBWEBITMI01; service=Uptime;

2018-11-06 15:00:29 Warning: Check of service 'Swap Usage' on host 'EILIBWEBITMI01' timed out after 120.006s!
Runtime Error 2018-11-06 15:00:29 wproc: host=EILIBWEBITMI01; service=Swap Usage;

SAndro

Re: Problems with UNKNOWN messages.

Posted: Tue Nov 06, 2018 10:29 am
by lmiltchev
Can you show us a few examples of "failing" checks, run from the command line along with the output of it? Please use the verbose flag (-v).

Example:

Code: Select all

/usr/local/nagios/libexec/check_ncpa.py -H <ip address> -t '<token>' -P 5693 -M memory/swap/percent -w 50 -c 80 -v
Also, run the following command and show the output:

Code: Select all

nmap <server ip> -p 5693

Re: Problems with UNKNOWN messages.

Posted: Wed Nov 07, 2018 4:01 am
by nagiosEngie
Hi,
That is the problem if I do it via command line I am unable to generate the timeout, even if I repeat the command one after the other.
I launched this 10 times in a row and the command gave the correct output wit no problem.

/usr/local/nagios/libexec/check_ncpa.py -v -H <IP> -T 120 -t 'xxxxxx' -P 5693 -M memory/swap -u Gi -w 95 -c 98
Connecting to: https://<IP>:5693/api/memory/swap/?token=xxxxx&warning=95&critical=98&units=Gi&check=1
File returned contained:
{
"returncode": 0,
"stdout": "OK: Used swap was 56.90 % (Total: 9.25 GiB, Used: 5.26 GiB, Free: 3.99 GiB) | 'total'=9.25GiB;9;9; 'used'=5.26GiB;9;9; 'free'=3.99GiB;9;9;"
}
OK: Used swap was 56.90 % (Total: 9.25 GiB, Used: 5.26 GiB, Free: 3.99 GiB) | 'total'=9.25GiB;9;9; 'used'=5.26GiB;9;9; 'free'=3.99GiB;9;9;

NMAP command output:

nmap <IP> -p 5693
Starting Nmap 6.47 ( http://nmap.org ) at 2018-11-07 09:56 CET
Nmap scan report for <HOSTNAME FQDN> (<IP>)
Host is up (0.00078s latency).
PORT STATE SERVICE
5693/tcp open unknown

Nmap done: 1 IP address (1 host up) scanned in 0.06 seconds

I checked CPU usage and SWAP on the monitored server and it is OK.
Thanks
Sandro

Re: Problems with UNKNOWN messages.

Posted: Wed Nov 07, 2018 11:21 am
by lmiltchev
That is the problem if I do it via command line I am unable to generate the timeout, even if I repeat the command one after the other.
This is strange. Does the timeout happen at about the same time? Perhaps, the server that you are monitoring is very busy at that time with performing updates, backups, etc.?

What is the version of the NCPA agent and check_ncpa.py plugin that you are currently using?

Code: Select all

/usr/local/nagios/libexec/check_ncpa.py -H<ip address> -t <token>' -P 5693 -M system/agent_version
/usr/local/nagios/libexec/check_ncpa.py -V
Can you show us the 'Uptime' and 'Swap Usage' service configs on host 'EILIBWEBITMI01'? Please, obfuscate sensitive data.

Re: Problems with UNKNOWN messages.

Posted: Thu Nov 08, 2018 4:00 am
by nagiosEngie
Hello lmiltchev,
the situation is getting worse. I am getting timeouts on more and more servers.
I had a look at the eventlog and I have just in the last week 3300 timeout messages.
this is now happening on 5 different servers.

La ncpa agent I am using is 2.1.3.
Nagios upgraded to the latest update 5.5.6
check_ncpa.py, Version 1.1.3
Do you think this can be related to high load on the nagios server? stats in image.

Thanks

SAndro

Re: Problems with UNKNOWN messages.

Posted: Thu Nov 08, 2018 10:50 am
by lmiltchev
Can you PM me your latest profile (Admin > System Config > System Profile > Download Profile)? We will need to review your configs and various logs.

Re: Problems with UNKNOWN messages.

Posted: Fri Nov 09, 2018 1:13 pm
by lmiltchev
For some reason nagios.log is missing from the profile. Can you PM me the log?

Also, send the ncpa_listener.log and win32service_ncpalistener.log from the Windows machine.

How long does it usually take to run these NCPA commands from the command line? Try running them several times, and time the check.

Example:

Code: Select all

time /usr/local/nagios/libexec/check_ncpa.py -v -H <IP> -T 120 -t 'xxxxxx' -P 5693 -M memory/swap -u Gi -w 95 -c 98

Re: Problems with UNKNOWN messages.

Posted: Tue Nov 13, 2018 10:10 am
by lmiltchev
There are errors in the ncpa_listener.log as this one:
2018-10-22 04:15:33,447:ERROR:database:database is locked
Traceback (most recent call last):
File "C:\ncpa\agent\listener\database.py", line 67, in add_check
OperationalError: database is locked
which means that most probably the db is corrupt.

Do the following:

1. Stop both, the NCPA Listener, and NCPA Passive services on the Windows machine.

2. Delete the db file - C:\Program Files (x86)\Nagios\NCPA\var\ncpa.db. It will be recreated when the services start.

3. Disable the check logging in the C:\Program Files (x86)\Nagios\NCPA\etc\ncpa.cfg by changing the check_logging value to zero:

Code: Select all

check_logging = 0
save, and exit.

4. Start the NCPA Listener, and NCPA Passive services.

Let us know if this helped.