bizarre statusjson.cgi error with config update

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
jjosephson
Posts: 4
Joined: Mon Jun 27, 2016 7:53 pm

bizarre statusjson.cgi error with config update

Post by jjosephson »

This is going to be tough to explain, but bear with me.

I've been using Nagios 4.1.1 for quite a while to monitor some servers. Recently I added a couple new servers to the config files and I noticed that the nagios website started reporting that nagios wasn't running, even though it clearly was, and attempting to view anything on the website gave an "unable to read config files" error. The error logs were largely clean, but I eventually traced it back to the statusjson.cgi=programstatus command failing. The error stated that it was unable to open a necessary file, but running the same command with strace showed all files were opening successfully. Permissions are correct for all files statusjson.cgi needs to open, though I did notice that when the command failed statusjson.cgi wouldn't attempt to read status.dat, even though it existed.

After a lot of troubleshooting I discovered that the error would only occur if the hostname for the new host I added starts with a number or a,b,c. if the hostname starts with any character after c in the alphabet, the website works just fine and the statusjson.cgi command succeeds. I've tried retyping the config lines, renaming an existing host, adding the host in a new config file, but nothing seems to matter other than the first character in the hostname.

This is seriously the most bizarre error I've ever seen in all my time as a sys engineer. Anyone ever seen something like this before? Is there some sort of weird race condition I'm managing to hit?

An example of a failing host config is:

define host{
use linux-server
host_name c.domain.com
alias aserver
display_name aserver
address 123.456.123.456
# parents
hostgroups +Group1, Group2
notes My Notes
#notes_url
#action_url
contact_groups Us
}

If anyone has any ideas, I'd love to hear them.
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: bizarre statusjson.cgi error with config update

Post by rkennedy »

I haven't heard of something like this happening in 4.1.1 so far.

Can you post your nagios.log file for us to look at? It'll really help to see everything at play, and the actual log files so we can try to replicate this. Any other debugging you can provide will be helpful as well.

Are you using any additional addons in conjunction with Nagios?
Former Nagios Employee
jjosephson
Posts: 4
Joined: Mon Jun 27, 2016 7:53 pm

Re: bizarre statusjson.cgi error with config update

Post by jjosephson »

Sorry for the long delay, I had the server reimaged to make sure it wasn't a filesystem thing. The same error is still occuring after a fresh imaging.

Nagios.log contents:

Code: Select all

$ cat nagios.log
[1467833213] Nagios 4.1.1 starting... (PID=4348)
[1467833213] Local time is Wed Jul 06 12:26:53 PDT 2016
[1467833213] LOG VERSION: 2.0
[1467833213] qh: Socket '/usr/local/nagios/var/rw/nagios.qh' successfully initialized
[1467833213] qh: core query handler registered
[1467833213] nerd: Channel hostchecks registered successfully
[1467833213] nerd: Channel servicechecks registered successfully
[1467833213] nerd: Channel opathchecks registered successfully
[1467833213] nerd: Fully initialized and ready to rock!
[1467833213] wproc: Successfully registered manager as @wproc with query handler
[1467833213] wproc: Registry request: name=Core Worker 4353;pid=4353
[1467833213] wproc: Registry request: name=Core Worker 4354;pid=4354
[1467833213] wproc: Registry request: name=Core Worker 4352;pid=4352
[1467833213] wproc: Registry request: name=Core Worker 4351;pid=4351
[1467833213] Successfully launched command file worker with pid 4355
[1467833304] wproc: Core Worker 4354: job 16 (pid=4498) timed out. Killing it
[1467833304] wproc: CHECK job 16 from worker Core Worker 4354 timed out after 60.01s
[1467833304] wproc:   host=stashtest; service=Check listing a GIT repo;
[1467833304] wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1467833304] wproc: Core Worker 4354: job 16 (pid=4498): Dormant child reaped
Additional info:

This is only a problem with the website. Nagios itself is running normally and performing checks as expected, the config file works correctly with nagios itself. The API is also working normally. This problem is only with statusjson.cgi failing to identify that Nagios is running and consequently failing to load the cfg files when I add a new server starting with 0-9, or a, b, c.

We first saw this on Nagios 4.0.8 and upgraded to 4.1.1 with no change.

Running statusjson.cgi manually on the server gives an "Unable to Open File for Reading", but running with strace does not show any errors opening files:

Code: Select all

$(export REQUEST_METHOD=GET; ./statusjson.cgi )
Cache-Control: no-store
Pragma: no-cache
Last-Modified: Wed, 06 Jul 2016 19:51:38 GMT
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-type: application/json; charset=utf-8

{
  "format_version": 0,
  "result": {
    "query_time": 1467834698000,
    "cgi": "statusjson.cgi",
    "program_start": 0,
    "type_code": 2,
    "type_text": "Unable to Open File for Reading",
    "message": "Error: Could not read some or all object configuration data!"
  },
With strace:

Code: Select all

$ export REQUEST_METHOD=GET; strace -e open,access /nagios/sbin/statusjson.cgi
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
open("/lib64/libc.so.6", O_RDONLY)      = 3
Cache-Control: no-store
Pragma: no-cache
open("/etc/localtime", O_RDONLY)        = 3
Last-Modified: Wed, 06 Jul 2016 20:02:10 GMT
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-type: application/json; charset=utf-8

open("/usr/lib/locale/locale-archive", O_RDONLY) = 3
open("/usr/local/nagios/etc/cgi.cfg", O_RDONLY) = 3
open("/usr/local/nagios/etc/nagios.cfg", O_RDONLY) = 3
open("/usr/local/nagios/var/objects.cache", O_RDONLY) = 3
open("/usr/lib64/gconv/gconv-modules.cache", O_RDONLY) = 3

If I modify the config file to add a "d" in front of the hostname for the new server I added, strace now looks like this, and the website now works as expected:

Code: Select all

$ export REQUEST_METHOD=GET; strace -e open,access /nagios/sbin/statusjson.cgi
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
open("/lib64/libc.so.6", O_RDONLY)      = 3
Cache-Control: no-store
Pragma: no-cache
open("/etc/localtime", O_RDONLY)        = 3
Last-Modified: Wed, 06 Jul 2016 20:00:07 GMT
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-type: application/json; charset=utf-8

open("/usr/lib/locale/locale-archive", O_RDONLY) = 3
open("/usr/local/nagios/etc/cgi.cfg", O_RDONLY) = 3
open("/usr/local/nagios/etc/nagios.cfg", O_RDONLY) = 3
open("/usr/local/nagios/var/objects.cache", O_RDONLY) = 3
open("/usr/local/nagios/var/status.dat", O_RDONLY) = 3
open("/usr/lib64/gconv/gconv-modules.cache", O_RDONLY) = 3
jjosephson
Posts: 4
Joined: Mon Jun 27, 2016 7:53 pm

Re: bizarre statusjson.cgi error with config update

Post by jjosephson »

Nevermind, I just figured this out.

We had a service description with a comma in it. The new server I added was the first server alphabetically for this service as long as it started with 1-9, a, b, c. For some reason the CGIs will work fine with commas in service descriptions as long as the FIRST server in the object cache doesn't include a comma. If the first server has a comma in the description, the CGIs fail to load the configs.

I believe the config validator would have normally caught this, but one of our other engineers had tweaked our setup to ignore commas, thus the validator would not fail.

Removing the comma from the service description has resolved our problem.
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: bizarre statusjson.cgi error with config update

Post by ssax »

So after it works can you just change it back to a through c and it fails again?

What does /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg show when it's broken?
jjosephson
Posts: 4
Joined: Mon Jun 27, 2016 7:53 pm

Re: bizarre statusjson.cgi error with config update

Post by jjosephson »

Yes, after removing the comma in the service description, everything in Nagios is working normally. The new server starting with c is in the configs and being checked and reported as expected.

The nagios -v nagios.cfg never threw any errors when the cgi's were broken, but this is because one of our other engineers had removed "," as an illegal character.

It does seem like there's still a bug in the cgi's where if a service description includes a comma and the first server alphabetically includes that service, then the cgi's will fail to load. But using the default config validation settings should prevent this.
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: bizarre statusjson.cgi error with config update

Post by tmcdonald »

Interesting find. Do you have a GitHub profile? If so, it would be very appreciated if you filed this on our Core page:

https://github.com/NagiosEnterprises/nagioscore

Otherwise let me know and we can get it filed, we just need to replicate it first.
Former Nagios employee
Locked