execvp(/bin/sh, ...) failed. errno is 7: Argument list too long
Posted: Mon Nov 13, 2023 9:56 pm
Hi
Recently we have encountered an issue with nagios, it failed to send notification from "intermittently". After we going through the nagios.log, we found the following errors
[1699879778] wproc: NOTIFY job 92385 from worker Core Worker 5369 is a non-check helper but exited with return code 7
[1699879778] wproc: host=sv960-lbp5.eq; service=Keepalived State; contact=opsgenie_xxx_team
[1699879778] wproc: early_timeout=0; exited_ok=1; wait_status=1792; error_code=0;
[1699879778] wproc: stderr line 01: execvp(/bin/sh, ...) failed. errno is 7: Argument list too long
However, nagios.debug, indicated the argument was 208 bytes, which was well under the ARG_MAX limit, after we deep dive into the source code, lib/runcmd.c, we added explain_execvp () and explain_message_execvp() after the execvp calls, we found the details of the error message
failed, Argument list too long (7, E2BIG) because the total number of bytes in the argument list (argv) plus the environment (envp) is too large (143372 > 5242880)
it indicated the size of arguments plus all the environment variables, was about ~140k, which was much larger than we expected, as we never see all those environment variables on any notification we have.
Further study on execvp(), it indicated execvp() was limited to 128k, MAX_ARG_STRLEN, which explained why it failed with ~140k argument length, however, again , at that moment, we have no idea where those extra bytes were coming from.
After further investigated the environment variables populated by nagios, we found a lot of environment variables populated by nagios during notification were not used, and one of them NAGIOS_SERVICEGROUPMEMBERS was took out 44k+. After we added few codes on runcmd_setenv() to filter out NAGIOS_SERVICEGROUPMEMBERS, the nagios is happy to send out any notification, the error on about was no longer found.
I understand a lot of nagios users out there from a company a lot bigger than us, a lot of more devices monitored by nagios, I would like to know if anyone has encountered the same this problem we have, and how they addressed the problem, instead of hacking the code, any better solution available? Most important how nagios address execvp() limitation.
Thanks in advance.
Sherman
Recently we have encountered an issue with nagios, it failed to send notification from "intermittently". After we going through the nagios.log, we found the following errors
[1699879778] wproc: NOTIFY job 92385 from worker Core Worker 5369 is a non-check helper but exited with return code 7
[1699879778] wproc: host=sv960-lbp5.eq; service=Keepalived State; contact=opsgenie_xxx_team
[1699879778] wproc: early_timeout=0; exited_ok=1; wait_status=1792; error_code=0;
[1699879778] wproc: stderr line 01: execvp(/bin/sh, ...) failed. errno is 7: Argument list too long
However, nagios.debug, indicated the argument was 208 bytes, which was well under the ARG_MAX limit, after we deep dive into the source code, lib/runcmd.c, we added explain_execvp () and explain_message_execvp() after the execvp calls, we found the details of the error message
failed, Argument list too long (7, E2BIG) because the total number of bytes in the argument list (argv) plus the environment (envp) is too large (143372 > 5242880)
it indicated the size of arguments plus all the environment variables, was about ~140k, which was much larger than we expected, as we never see all those environment variables on any notification we have.
Further study on execvp(), it indicated execvp() was limited to 128k, MAX_ARG_STRLEN, which explained why it failed with ~140k argument length, however, again , at that moment, we have no idea where those extra bytes were coming from.
After further investigated the environment variables populated by nagios, we found a lot of environment variables populated by nagios during notification were not used, and one of them NAGIOS_SERVICEGROUPMEMBERS was took out 44k+. After we added few codes on runcmd_setenv() to filter out NAGIOS_SERVICEGROUPMEMBERS, the nagios is happy to send out any notification, the error on about was no longer found.
I understand a lot of nagios users out there from a company a lot bigger than us, a lot of more devices monitored by nagios, I would like to know if anyone has encountered the same this problem we have, and how they addressed the problem, instead of hacking the code, any better solution available? Most important how nagios address execvp() limitation.
Thanks in advance.
Sherman