[Nagios-devel] Re: [netsaint] RFC - advice actually - on availability reporting

Guest · Post by **Guest** » Sun Oct 20, 2002 4:35 pm

Hi Stanley

This is a good direction for capabilities. (I've also copied the
nagios-devel lists as the discussion would probably be more relevant
there)

I think the SLA info while being possibly entered into the config files
should be ignored by the Nagios monitor.

The report might want to include the start/stop time (duration) of the
event and any acknowledgement/comment(s) entered into Nagios about the
host/service.

If we start looking at the 2.0 interface, it might very easily be possible
for the reporting/event daemon to update the log with a duration downtime
when writing out the UP log entry. The event daemon would have to keep a
state flag and a couple of timestamp entries in the struct.

-sg

On Sat, 19 Oct 2002, Stanley Hopcroft wrote:

> Dear Ladies and Gentlemen,
>
> I am writing with a plea for advice, or perhaps to point out an
> opportunity for Nag/Netsaint development.
>
> The opportunity is that while Netsaint 9and AFAIK Nagios) availability
> reporting is magnificent, it doesn't meet management requirements of
> being able to report against SLA.
>
> The basic reason for this is that Netsaint is probably quite justifiably
> ignorant of SLA factors such as
>
> 1. what elements (routers, servers, network nodes) are in a service
>
> 2. what constitutes the agreed level of service
>
> (For example, if one is providing a LAN service
>
> - all the switches, routers, SLBs serving the clients and servers are in
> the service; as are the DNS, WINS, LDAP/AD and DHCP servers
>
> - the agreement may be specified by completely arbitrary functions such
> as
>
> Service is OK if { DHCP, DNS and WINS servers are up 100%
> { some proportion of client network nodes (switches
> etc) are up 100%
> { all server network nodes are up 100%
> )
>
> It seems to me that the second requirement - the specification of the
> SLA function/agreement is completely arbitrary or site dependent and
> therefore has no relationship with Nag/Netsaint whatsover. It belongs to
> the reporting package - the bit that takes the SLA function, the
> host/service downtime and produces the report.
>
> (OTOH, Netsaint seems to adopt a simple SLA of agreement per host or
> service based on the proportion of host/service up time.
>
> Then again, it could be said that the reporting function _could_ employ
> the existing host/service (node) downtime provided by avail.cgi
> [the CSV report of node availability]
>
> However, my experience is that even mailing the output of avail.cgi as
> an Excel attachment [set MIME type of attachment] has failed to satsify
> the local PHBs).
>
> The approach my colleagues and I would therefore like to adopt is to
> store in an ODBC accessible database (mySQL), records of node downtimes
> eg
>
> host_name, service_description, downtime, time_date (prob at the end of
> the downtime)
>
> and let folks report how they like from that using the reporting tools
> they choose. They can see the downs [which they may already know about]
> so rather than gawk at %UP/OKss they see the list of downs for that
> node.
>
> Please would you comment on
>
> 1 How helpful or otherwise you think this approach may be
>
> 2 How to update the DB with the interval for which the node was
> unavailable (yes, the DB is really only acting as a file store but one
> that is accessible to authorised users from their Win desktop, and that
> provides simple queries).
>
> In regard to 2, the global service handlers seem to be a means of
> responding to HARD state changes. Logging them could produce a list
> like
>
> 1034949832 mvs;Logon to production database;CRITICAL;HARD;3 Menu .. not
> found: FWP use this service either press ENTER for guest access type
> valid Userid and pres s ENTER For more information press N20205 SELECTED
> APPLICA
> 1034950132 mvs;Logon to production database;OK;HARD;3 Logon to
> Production database Ok.
> 1034968553 mvs;FTP;CRITICAL;HARD;Socket timeout after 10 seconds
> 1034968843 mvs;FTP;OK;HARD;FTP ok - 1 second response time
>
> that could be batch processed to update t

...[email truncated]...

This post was automatically imported from historical nagios-devel mailing list archives
Original poster: sghosh@sghosh.org