no alert triggered on service down

nky1986 · Post by **nky1986** » Wed Jan 10, 2018 11:48 pm

one more point

if you see the nagios logs or events for that day, the service status never went into soft 1-2-3... states, that also suggests that service check was missed and as a result, there was no alert

but surprisingly, the graphs were plotted properly

Post by **tgriep** » Thu Jan 11, 2018 2:35 pm

I saw the same thing. I did not see any entries on the service being down that day.
One possible cause for what happened is a stuck nagios process running.
When there are 2 nagios daemons running, they conflict with each other and cause the issue you had but without having access to the server on that day, it is hard to determine if that was true.

nky1986 · Post by **nky1986** » Thu Jan 11, 2018 11:59 pm

what is the next step?

we need to find out RCA of this issue and fix it.

npolovenko · Post by **npolovenko** » Fri Jan 12, 2018 12:22 pm

@nky1986, Let's try this. Can you open a Service Status Detail page for the service in question and go to the advanced tab. After that, please submit a critical passive check result 3 times so that the service check gets into a hard-critical state. How much time does it take till you receive an email alert?

nky1986 · Post by **nky1986** » Tue Jan 16, 2018 1:29 pm

when i try to do that, there are four options under check result i.e. OK, WARNING, UNKNOWN, CRITICAL. what shall i choose here?

dwhitfield · Post by **dwhitfield** » Tue Jan 16, 2018 1:37 pm

npolovenko wrote: submit a critical passive check result 3 times

Critical is what was suggested, but ultimately it doesn't matter exactly. You just need to select one that will send notifications. Thus, if it's OK now, sending another OK is not going to trigger a notification. Critical is certainly the most likely to cause a notification.

nky1986 · Post by **nky1986** » Tue Jan 16, 2018 1:51 pm

i have submitted 3 critical passive checks and below are the current settings:

Checks are configured as:
- Check interval: 140
- Retry interval: 70
- Max check attempts: 3

so when should I expect the alert notification?

dwhitfield · Post by **dwhitfield** » Tue Jan 16, 2018 2:16 pm

If you submit 3 (and they show up in the nagios.log), it should come out immediately unless you have a notification delay. I see now that you have postgres so the repair I had you run earlier is less likely to have resolved the issue. Let's try the following.

The following commands are different if you are using a version of PostgreSQL before v9. To determine which version you have execute the following command:

Code: Select all

postgres -V

Based on that output, execute the commands specific to your version:

Versions BEFORE 9

Code: Select all

echo "vacuum;vacuum analyze;"|psql nagiosxi postgres
service postgresql restart

Versions 9 onwards

Code: Select all

echo "vacuum;vacuum analyze;vacuum full;"|psql nagiosxi postgres
service postgresql restart

To log in the postgres manually, run:

Code: Select all

psql nagiosxi nagiosxi

To view the tables, run:

Code: Select all

\d

and to exit:

Code: Select all

\q

If you tried to run the vacuum on the posgres or you attempted to log in manually in the database, but you see the following error message:

Code: Select all

psql: FATAL:  database is not accepting commands to avoid wraparound data loss in database "postgres"
HINT:  Stop the postmaster and use a standalone backend to vacuum database "postgres".

You may notice either a high CPU usage for the postmaster process, or a repeated error message in the /var/lib/pgsql/data/pg_log file:

transaction ID wrap limit is 2147484146

You can try to fix the issue by running the following command in the command line:

Important: Run the commands one-by-one (don't run them with one go!)

Versions BEFORE PostgreSQL 9

Code: Select all

service postgresql stop
su postgres
echo "VACUUM;" > /tmp/fix.sql
postgres -D /var/lib/pgsql/data nagiosxi < /tmp/fix.sql
postgres -D /var/lib/pgsql/data postgres < /tmp/fix.sql
postgres -D /var/lib/pgsql/data template1 < /tmp/fix.sql
exit
service postgresql start

Note: The commands listed above may not work with some versions of PosgreSQL. If you see the following error:

Code: Select all

postgres: invalid argument: "nagiosxi"

You will need to run the following commands instead:

Code: Select all

service postgresql stop
su postgres
echo "VACUUM;" > /tmp/fix.sql
postgres --single -D /var/lib/pgsql/data nagiosxi < /tmp/fix.sql
postgres --single -D /var/lib/pgsql/data postgres < /tmp/fix.sql
postgres --single -D /var/lib/pgsql/data template1 < /tmp/fix.sql
exit
service postgresql start

Versions 9 Onwards

Code: Select all

service postgresql stop
su postgres
echo "VACUUM FULL;" > /tmp/fix.sql
postgres -D /var/lib/pgsql/data nagiosxi < /tmp/fix.sql
postgres -D /var/lib/pgsql/data postgres < /tmp/fix.sql
postgres -D /var/lib/pgsql/data template1 < /tmp/fix.sql
exit
service postgresql start

Note: The commands listed above may not work with some versions of PosgreSQL. If you see the following error:

Code: Select all

postgres: invalid argument: "nagiosxi"

You will need to run the following commands instead:

Code: Select all

service postgresql stop
su postgres
echo "VACUUM FULL;" > /tmp/fix.sql
postgres --single -D /var/lib/pgsql/data nagiosxi < /tmp/fix.sql
postgres --single -D /var/lib/pgsql/data postgres < /tmp/fix.sql
postgres --single -D /var/lib/pgsql/data template1 < /tmp/fix.sql
exit
service postgresql start

nky1986 · Post by **nky1986** » Tue Jan 16, 2018 2:40 pm

we are running with postgres 8.4.20 and i performed below operations

[root@Spectraguard ~]# echo "vacuum;vacuum analyze;"|psql nagiosxi postgres
VACUUM
VACUUM

[root@Spectraguard ~]# service postgresql restart
Stopping postgresql service: [ OK ]
Starting postgresql service: [ OK ]

[root@Spectraguard ~]# psql nagiosxi nagiosxi
\d
\q

i didn't face any error during all that.

shall i submit critical passive checks now?

dwhitfield · Post by **dwhitfield** » Tue Jan 16, 2018 4:42 pm

nky1986 wrote:
shall i submit critical passive checks now?

Yes.

When you did this before, did they should up in the nagios.log? What happens if you change your notification command? Have your notifications ever worked on this server? Was there an upgrade in between it working and not working?

Nagios Support Forum

no alert triggered on service down

Re: no alert triggered on service down

Re: no alert triggered on service down

Re: no alert triggered on service down

Re: no alert triggered on service down

Re: no alert triggered on service down

Re: no alert triggered on service down

Re: no alert triggered on service down

Re: no alert triggered on service down

Re: no alert triggered on service down

Re: no alert triggered on service down