bizarre statusjson.cgi error with config update

jjosephson · Post by **jjosephson** » Mon Jun 27, 2016 8:07 pm

This is going to be tough to explain, but bear with me.

I've been using Nagios 4.1.1 for quite a while to monitor some servers. Recently I added a couple new servers to the config files and I noticed that the nagios website started reporting that nagios wasn't running, even though it clearly was, and attempting to view anything on the website gave an "unable to read config files" error. The error logs were largely clean, but I eventually traced it back to the statusjson.cgi=programstatus command failing. The error stated that it was unable to open a necessary file, but running the same command with strace showed all files were opening successfully. Permissions are correct for all files statusjson.cgi needs to open, though I did notice that when the command failed statusjson.cgi wouldn't attempt to read status.dat, even though it existed.

After a lot of troubleshooting I discovered that the error would only occur if the hostname for the new host I added starts with a number or a,b,c. if the hostname starts with any character after c in the alphabet, the website works just fine and the statusjson.cgi command succeeds. I've tried retyping the config lines, renaming an existing host, adding the host in a new config file, but nothing seems to matter other than the first character in the hostname.

This is seriously the most bizarre error I've ever seen in all my time as a sys engineer. Anyone ever seen something like this before? Is there some sort of weird race condition I'm managing to hit?

An example of a failing host config is:

define host{
use linux-server
host_name c.domain.com
alias aserver
display_name aserver
address 123.456.123.456
# parents
hostgroups +Group1, Group2
notes My Notes
#notes_url
#action_url
contact_groups Us
}

If anyone has any ideas, I'd love to hear them.

rkennedy · Post by **rkennedy** » Tue Jun 28, 2016 9:52 am

I haven't heard of something like this happening in 4.1.1 so far.

Can you post your nagios.log file for us to look at? It'll really help to see everything at play, and the actual log files so we can try to replicate this. Any other debugging you can provide will be helpful as well.

Are you using any additional addons in conjunction with Nagios?

jjosephson · Post by **jjosephson** » Wed Jul 06, 2016 3:04 pm

Sorry for the long delay, I had the server reimaged to make sure it wasn't a filesystem thing. The same error is still occuring after a fresh imaging.

Nagios.log contents:

Code: Select all

$ cat nagios.log
[1467833213] Nagios 4.1.1 starting... (PID=4348)
[1467833213] Local time is Wed Jul 06 12:26:53 PDT 2016
[1467833213] LOG VERSION: 2.0
[1467833213] qh: Socket '/usr/local/nagios/var/rw/nagios.qh' successfully initialized
[1467833213] qh: core query handler registered
[1467833213] nerd: Channel hostchecks registered successfully
[1467833213] nerd: Channel servicechecks registered successfully
[1467833213] nerd: Channel opathchecks registered successfully
[1467833213] nerd: Fully initialized and ready to rock!
[1467833213] wproc: Successfully registered manager as @wproc with query handler
[1467833213] wproc: Registry request: name=Core Worker 4353;pid=4353
[1467833213] wproc: Registry request: name=Core Worker 4354;pid=4354
[1467833213] wproc: Registry request: name=Core Worker 4352;pid=4352
[1467833213] wproc: Registry request: name=Core Worker 4351;pid=4351
[1467833213] Successfully launched command file worker with pid 4355
[1467833304] wproc: Core Worker 4354: job 16 (pid=4498) timed out. Killing it
[1467833304] wproc: CHECK job 16 from worker Core Worker 4354 timed out after 60.01s
[1467833304] wproc:   host=stashtest; service=Check listing a GIT repo;
[1467833304] wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1467833304] wproc: Core Worker 4354: job 16 (pid=4498): Dormant child reaped

Additional info:

This is only a problem with the website. Nagios itself is running normally and performing checks as expected, the config file works correctly with nagios itself. The API is also working normally. This problem is only with statusjson.cgi failing to identify that Nagios is running and consequently failing to load the cfg files when I add a new server starting with 0-9, or a, b, c.

We first saw this on Nagios 4.0.8 and upgraded to 4.1.1 with no change.

Running statusjson.cgi manually on the server gives an "Unable to Open File for Reading", but running with strace does not show any errors opening files:

Code: Select all

$(export REQUEST_METHOD=GET; ./statusjson.cgi )
Cache-Control: no-store
Pragma: no-cache
Last-Modified: Wed, 06 Jul 2016 19:51:38 GMT
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-type: application/json; charset=utf-8

{
  "format_version": 0,
  "result": {
    "query_time": 1467834698000,
    "cgi": "statusjson.cgi",
    "program_start": 0,
    "type_code": 2,
    "type_text": "Unable to Open File for Reading",
    "message": "Error: Could not read some or all object configuration data!"
  },

With strace:

Code: Select all

$ export REQUEST_METHOD=GET; strace -e open,access /nagios/sbin/statusjson.cgi
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
open("/lib64/libc.so.6", O_RDONLY)      = 3
Cache-Control: no-store
Pragma: no-cache
open("/etc/localtime", O_RDONLY)        = 3
Last-Modified: Wed, 06 Jul 2016 20:02:10 GMT
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-type: application/json; charset=utf-8

open("/usr/lib/locale/locale-archive", O_RDONLY) = 3
open("/usr/local/nagios/etc/cgi.cfg", O_RDONLY) = 3
open("/usr/local/nagios/etc/nagios.cfg", O_RDONLY) = 3
open("/usr/local/nagios/var/objects.cache", O_RDONLY) = 3
open("/usr/lib64/gconv/gconv-modules.cache", O_RDONLY) = 3

If I modify the config file to add a "d" in front of the hostname for the new server I added, strace now looks like this, and the website now works as expected:

Code: Select all

$ export REQUEST_METHOD=GET; strace -e open,access /nagios/sbin/statusjson.cgi
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
open("/lib64/libc.so.6", O_RDONLY)      = 3
Cache-Control: no-store
Pragma: no-cache
open("/etc/localtime", O_RDONLY)        = 3
Last-Modified: Wed, 06 Jul 2016 20:00:07 GMT
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-type: application/json; charset=utf-8

open("/usr/lib/locale/locale-archive", O_RDONLY) = 3
open("/usr/local/nagios/etc/cgi.cfg", O_RDONLY) = 3
open("/usr/local/nagios/etc/nagios.cfg", O_RDONLY) = 3
open("/usr/local/nagios/var/objects.cache", O_RDONLY) = 3
open("/usr/local/nagios/var/status.dat", O_RDONLY) = 3
open("/usr/lib64/gconv/gconv-modules.cache", O_RDONLY) = 3

jjosephson · Post by **jjosephson** » Wed Jul 06, 2016 5:03 pm

Nevermind, I just figured this out.

We had a service description with a comma in it. The new server I added was the first server alphabetically for this service as long as it started with 1-9, a, b, c. For some reason the CGIs will work fine with commas in service descriptions as long as the FIRST server in the object cache doesn't include a comma. If the first server has a comma in the description, the CGIs fail to load the configs.

I believe the config validator would have normally caught this, but one of our other engineers had tweaked our setup to ignore commas, thus the validator would not fail.

Removing the comma from the service description has resolved our problem.

ssax · Post by **ssax** » Wed Jul 06, 2016 5:05 pm

So after it works can you just change it back to a through c and it fails again?

What does /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg show when it's broken?

jjosephson · Post by **jjosephson** » Thu Jul 07, 2016 12:42 pm

Yes, after removing the comma in the service description, everything in Nagios is working normally. The new server starting with c is in the configs and being checked and reported as expected.

The nagios -v nagios.cfg never threw any errors when the cgi's were broken, but this is because one of our other engineers had removed "," as an illegal character.

It does seem like there's still a bug in the cgi's where if a service description includes a comma and the first server alphabetically includes that service, then the cgi's will fail to load. But using the default config validation settings should prevent this.

tmcdonald · Post by **tmcdonald** » Thu Jul 07, 2016 3:58 pm

Interesting find. Do you have a GitHub profile? If so, it would be very appreciated if you filed this on our Core page:

https://github.com/NagiosEnterprises/nagioscore

Otherwise let me know and we can get it filed, we just need to replicate it first.

Nagios Support Forum

bizarre statusjson.cgi error with config update

bizarre statusjson.cgi error with config update

Re: bizarre statusjson.cgi error with config update

Re: bizarre statusjson.cgi error with config update

Re: bizarre statusjson.cgi error with config update

Re: bizarre statusjson.cgi error with config update

Re: bizarre statusjson.cgi error with config update

Re: bizarre statusjson.cgi error with config update