Page 1 of 1

[Gearman] Timeout shows RC 255 and full command (password exposed)

Posted: Wed Jun 25, 2025 10:20 am
by igae1
Hi everyone,

Not 100% sure if this is a Nagios XI or a Nagios Core issue.

I’m seeing an odd combination of return-code 255 and full-command echo when a check times out, and the command line contains a clear-text password.

I’m running a distributed check setup with Nagios Core 4.5.3 on the master using Mod Gearman:

Environment
  • Nagios Core 4.5.3 (single master)
  • mod-gearman NEB: 1.0.1 on the master
  • mod-gearman worker: 3.3.0 on a separate node
  • NCPA 2.4.4 on the target host, launching a custom Bash wrapper → pytest → Selenium
What happens
A service check that ultimately calls NCPA + a custom plugin sometimes exceeds its internal timeout.

Timeout settings
  • nagios.cfg service_check_timeout = 60 s
  • worker.conf job_timeout = 60 s (I know this is supposed to affect only event-handlers)
  • Plugin invoked with -T 59 s (internal timeout)
So the worker kills the process after ~60 s, returns the state code I configured (2 = CRITICAL), but also prepends the hard-wired “RC 255 out of bounds” phrase and — worst of all — prints the entire command, with the -p password expanded from a $USERn$ macro.

Code: Select all

CRITICAL: Return code of 255 is out of bounds. (worker: <worker-host>)
Error: Plugin command (/bin/sh /usr/local/ncpa/plugins/launcher_test_az.sh -n Argument -h xvfb -u User -p Password -T 59) timed out. (59 sec)
The problem is that the entire command line is echoed back, including the -p argument that contains a production password (expanded from a $USER macro). Exposing credentials in plain text is a security concern.

What I’ve tried / know so far
  • job_timeout in worker.conf is also 60 s, but docs say it only affects event-handlers.
  • Changing timeout_return alters the exit code, but the worker still prints the full command.
  • Yes, I could refactor the plugin to read the password from a file or env-var, but I’d prefer a Gearman-side fix.
Questions
  1. Is there a worker.conf option (or patch) that suppresses or sanitises the command the worker prints when it times out?
  2. If not, has anyone found a workaround besides rewriting the plugin (e.g. show_error_output=no)?
  3. Could upgrading from mod-gearman to nagios-mod-gearman in the workers solve this problem?
  4. Would upgrading the NEB module to a newer nagios-mod-gearman release change this behaviour?
Any pointers would be greatly appreciated.
Thanks in advance!

Re: [Gearman] Timeout shows RC 255 and full command (password exposed)

Posted: Wed Jun 25, 2025 8:18 pm
by kg2857
Please forgive my dumb question.
Is gearman really needed? I dumped it and the ramdisk (both inherited from previous admins) many years ago.
Maybe look into reducing gearman's logging?

Re: [Gearman] Timeout shows RC 255 and full command (password exposed)

Posted: Thu Jun 26, 2025 12:33 am
by igae1
Hi, thanks for the suggestion!

In our setup Gearman is definitely required, but not because of the old “ramdisk trick.”
The sole reason is the number of checks we must schedule and process.

Current architecture
  • One Nagios XI master that hands out roughly 30 000 active service checks every five minutes
  • Six dedicated worker servers running mod-gearman-worker 3.3.0
  • nagios-mod-gearman 1.0.1 on the master to dispatch the jobs
With that load the master alone can’t cope; Gearman spreads the work across the six workers.
We never relied on any tmpfs/ramdisk tuning, so dropping that trick wouldn’t eliminate our need for Gearman.

Logging level
Both the NEB module and the workers are already set to log_level=0 (errors only), so verbosity isn’t the source of the problem.

What remains an issue
When a check times out, the worker echoes the entire command line—including a clear-text password expanded from a $USER macro. I’m looking for a way to suppress or mask those sensitive arguments while keeping the distributed setup.

Any pointers are much appreciated—thanks again!

Re: [Gearman] Timeout shows RC 255 and full command (password exposed)

Posted: Thu Jun 26, 2025 11:39 pm
by kg2857
XI knows nothing about gearman.
My ramdisk mention was just an aside because it is/was a performance trick. Apologies for the confusion.
I'd ask how many hosts for those 30k checks and guess maybe 30k is too many due to a design issue. What are the 30k checks checking?

Re: [Gearman] Timeout shows RC 255 and full command (password exposed)

Posted: Fri Jun 27, 2025 3:22 am
by BrianKnight
I am replying here so that I can keep track of this thread.

Re: [Gearman] Timeout shows RC 255 and full command (password exposed)

Posted: Thu Jul 03, 2025 12:38 am
by igae1
kg2857 wrote: Thu Jun 26, 2025 11:39 pm XI knows nothing about gearman.
My ramdisk mention was just an aside because it is/was a performance trick. Apologies for the confusion.
I'd ask how many hosts for those 30k checks and guess maybe 30k is too many due to a design issue. What are the 30k checks checking?
Over 2k hosts.

Re: [Gearman] Timeout shows RC 255 and full command (password exposed)

Posted: Thu Jul 03, 2025 2:11 am
by kg2857
What is the mix of unique host types, for example windows, linux, snmp, others?
How many cpu usage services are there for linux hosts for example?
Perhaps a vast number of checks (services) isn't helping. It's just a guess. It might be interesting to wonder if gearman or nagios performance is related to the number of services.
Nagios XI sort of steers (inexperienced) folks to a wizard thought process rather than a nagios core one. Wizards define vary host oriented services. For example, each host has a cpu check rather than all linux hosts having a single service.
Nagios core (which xi runs on), being a config file based system, likes things like templates and hostgroups to make things generic.