Nagios and EDAC support

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Professor Balthazar
Posts: 15
Joined: Wed Feb 26, 2014 6:53 am
Location: Stockholm, Sweden.

Nagios and EDAC support

Post by Professor Balthazar »

I'm a system manager at Ericsson /// and I'm new to Nagios, we started with a trial version to see how it works together with our HW.

I would like to build a demonstration unit with nagios support. The network monitoring works fine!!
But into the agents on a couple of blades (x86-based/6-12core/24-256GB/3x400/600GB HDD) I want to have integrated control and warnings for HW components (Primary memory and SSD/HDD).

The Linux SuSe (SLES) distribution on the blade have edac support. I have tried to use components from Nagios and add-ons to Nagios.
For ME/US the HW control and notification is a priority. It's very important to have warnings on memory DIMM's and read/write errors on media (SSD/HDD).

The main memory (24GB) have 3 DIMMS and the edac 'with simulated/injected test-fault' (bash) script returns:

gep-eqmmgr:/home/qstenli # bash edac
<<<edac.mem>>>
mc0 csrow0 8192 Registered-DDR3 5 0 S8ECD8ED
mc0 csrow1 8192 Registered-DDR3 0 0 S8ECD8ED
mc0 csrow2 8192 Registered-DDR3 0 0 S8ECD8ED

I have attached a file (.jpg) below with the Nagios notification. So far so good! But I would like to have the numbers of errors, in this case: (5) Correctable ECC errors found!
I have tried to email support but have no answer back.

/BR
Sten-Åke Lindell
Attachments
The notification on Nagios!
The notification on Nagios!
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios and EDAC support

Post by scottwilkerson »

Can you point us to the plugin you are using?
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
Professor Balthazar
Posts: 15
Joined: Wed Feb 26, 2014 6:53 am
Location: Stockholm, Sweden.

Re: Nagios and EDAC support

Post by Professor Balthazar »

I have tried NRPE and also check_mk with same result.

If i use the combination there is a WARN - WARNING ... message, a strict 'WARNING' should be a proper alternative, but this miss-match is built in somewhere.

I used this plugin
https://bitbucket.org/darkfader/nagios/ ... ugins/edac
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Nagios and EDAC support

Post by tmcdonald »

What is the output if you run that plugin from the command line? I have a feeling it might have to do with trimming the newline, but I can't say for sure. Can you run it and show the output?
Former Nagios employee
Professor Balthazar
Posts: 15
Joined: Wed Feb 26, 2014 6:53 am
Location: Stockholm, Sweden.

Re: Nagios and EDAC support

Post by Professor Balthazar »

This output was described in the first post. I'm going to change back to a pure Nagios NRPE application for this plugin cause of the non working inventory functionality in cmk.
I have made changed in other plugins and get it going, but this just want to work as I want. I think it's the common issue for all plugins when it's been changed.

Command line:
gep-eqmmgr:/home/qstenli # bash edac

Plugin output:
<<<edac.mem>>>
mc0 csrow0 8192 Registered-DDR3 5 0 S8ECD8ED
mc0 csrow1 8192 Registered-DDR3 0 0 S8ECD8ED
mc0 csrow2 8192 Registered-DDR3 0 0 S8ECD8ED
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Nagios and EDAC support

Post by abrist »

This plugin is very small and basic:

Code: Select all

#!/bin/bash
# 
# This agent plugin for Check_MK / Nagios is supposed to read data provided by the linux EDAC driver
# Right now, the driver can read status of memory modules and pci busses and we'll try to monitor that.

# The best stop for alll things EDAC is http://buttersideup.com/edacwiki/ and the edac.txt in the kernel doc.

EDAC=false
 
if [ -d /sys/devices/system/edac/mc ]; then 
    EDAC=true
fi
if [ -d /sys/devices/system/edac/pci ] && [ `cat /sys/devices/system/edac/pci/check_pci_errors` = 1 ]; then
    EDAC=true
fi

if [ $EDAC = "true" ]; then
    echo '<<<edac>>>'
fi


# EDAC memory reporting
# Example output:
# mc0 csrow0 8192 Registered-DDR3 0 0 S4ECD4ED 
# EDAC memory driver running
# Iterate all memory controllers, print memory info and EDAC mode. Maybe later add DIMM label support.
for mc in /sys/devices/system/edac/mc/* ; do 
    test -d  $mc || break
    for csrow in $mc/csrow* ; do 
        echo "$(basename $mc) $(basename $csrow) $(cat $csrow/{size_mb,mem_type,ce_count,ue_count,edac_mode} | tr '\n' ' ')" 
    done
done

# EDAC pci error reporting
# By default, Linux does ignore PCI crc/ecc. Only do checks if the admin enabled it.
#   echo "Not yet supported"
Have you made any changes to it to suit your needs?
Or are you planning on keeping check_mk so your can use:

Code: Select all

#!/usr/bin/python
# -*- encoding: utf-8; py-indent-offset: 4 -*-
# +------------------------------------------------------------------+
# |             ____ _               _        __  __ _  __           |
# |            / ___| |__   ___  ___| | __   |  \/  | |/ /           |
# |           | |   | '_ \ / _ \/ __| |/ /   | |\/| | ' /            |
# |           | |___| | | |  __/ (__|   <    | |  | | . \            |
# |            \____|_| |_|\___|\___|_|\_\___|_|  |_|_|\_\           |
# |                                                                  |
# | Copyright Mathias Kettner 2010             mk@mathias-kettner.de |
# +------------------------------------------------------------------+
#
# This is a free addon for check_mk.
# The download page http://exchange.check-mk.org/ and the current source
# can be found at http://bitbucket.org/darkfader/nagios/check_mk/edac
#
# check_mk is free software;  you can redistribute it and/or modify it
# under the  terms of the  GNU General Public License  as published by
# the Free Software Foundation in version 2.  check_mk is  distributed
# in the hope that it will be useful, but WITHOUT ANY WARRANTY;  with-
# out even the implied warranty of  MERCHANTABILITY  or  FITNESS FOR A
# PARTICULAR PURPOSE. See the  GNU General Public License for more de-
# ails.  You should have  received  a copy of the  GNU  General Public
# License along with GNU Make; see the file  COPYING.  If  not,  write
# to the Free Software Foundation, Inc., 51 Franklin St,  Fifth Floor,
# Boston, MA 02110-1301 USA.



# Author: Florian Heigl <fh@mathias-kettner.de>
# Check for edac error states: PCI and Memory 


# Example agent output:
_agent_output="""<<<edac_mem>>>
mc0 csrow0 8192 Registered-DDR3 0 0 S4ECD4ED
mc0 csrow1 8192 Registered-DDR3 0 0 S4ECD4ED
mc1 csrow0 8192 Registered-DDR3 0 0 S4ECD4ED
mc2 csrow0 8192 Registered-DDR3 0 0 S4ECD4ED"""


def inventory_edac_mem(checkname, info):
    inventory = []
    for line in info:
        if len(line) == 7:
            mc, csrow, size_mb, dimm_type, ce_count, ue_count, edac_mode = line
            inventory.append((("%s %s") % (mc, csrow), None))
            
    return inventory


def check_edac_mem(item, _no_params, info):
    for line in info:
        mc, csrow, size_mb, dimm_type, ce_count, ue_count, edac_mode = line
        if (("%s %s") % (mc, csrow)) == item:
            dimmdescr = ("%s MB %s DIMM") % (size_mb, dimm_type)
            # check correctable and uncorrectable error counters
            # tie in the EDAC mode here to modify results when using chipkill memory
            if saveint(ue_count)   > 0:
                return ((2, "CRITICAL - %s, Uncorrectable ECC errors found!!" % dimmdescr))
            elif saveint(ce_count) > 0:
                return ((1,  "WARNING - %s, Correctable ECC errors found!" % dimmdescr))
            else:
                return ((0, "OK - %s, no ECC errors found" % dimmdescr))

    return (3, "UNKNOWN - %s not found in agent output")


check_info['edac.mem'] = (check_edac_mem, "EDAC Memory %s", 0, inventory_edac_mem)
If you are going to use check_mk, you will need to edit the python script for verbosity. If you are going to just use the EDAC plugin, you will need to add logic for checking the result and exiting correctly.
I will do what I can to help, but I guess we need to narrow the scope here.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Professor Balthazar
Posts: 15
Joined: Wed Feb 26, 2014 6:53 am
Location: Stockholm, Sweden.

Re: Nagios and EDAC support

Post by Professor Balthazar »

There you have two plugin for edac. The first is a 'small & basic' one in Bash script and the second in Python written for a total integration with check_mk (cmk).

My local test host is 10.35.40.242. the file from autochecks /var/lib/check_mk/autochecks/10.35.40.242.mk

[
("10.35.40.242", "cpu.loads", None, cpuload_default_levels),
("10.35.40.242", "cpu.threads", None, threads_default_levels),
("10.35.40.242", "df", '/', {}),
("10.35.40.242", "diskstat", 'SUMMARY', diskstat_default_levels),
("10.35.40.242", "edac.mem", 'mc0 csrow0', None),
("10.35.40.242", "edac.mem", 'mc0 csrow1', None),
("10.35.40.242", "edac.mem", 'mc0 csrow2', None),
("10.35.40.242", "kernel", 'Context Switches', kernel_default_levels),
("10.35.40.242", "kernel", 'Major Page Faults', kernel_default_levels),
("10.35.40.242", "kernel", 'Process Creations', kernel_default_levels),
("10.35.40.242", "kernel.util", None, kernel_util_default_levels),
("10.35.40.242", "lnx_if", '2', {'state': ['1'], 'speed': 1000000000}),
("10.35.40.242", "lnx_if", '5', {'state': ['1'], 'speed': 1000000000}),
("10.35.40.242", "lnx_if", '6', {'state': ['1'], 'speed': 1000000000}),
("10.35.40.242", "mem.used", None, memused_default_levels),
("10.35.40.242", "mounts", '/', ['acl', 'barrier=1', 'data=ordered', 'errors=continue', 'relatime', 'rw', 'user_xattr']),
("10.35.40.242", "postfix_mailq", None, postfix_mailq_default_levels),
("10.35.40.242", "tcp_conn_stats", None, tcp_conn_stats_default_levels),
("10.35.40.242", "uptime", None, {}),
]

In this file the rows below are written/edited into this file when the inventory cmk function fails.

("10.35.40.242", "edac.mem", 'mc0 csrow0', None),
("10.35.40.242", "edac.mem", 'mc0 csrow1', None),
("10.35.40.242", "edac.mem", 'mc0 csrow2', None),

I have tried the inventory function 25-30 times and had succeded one (1) time and found this rows into 10.35.40.242.mk at autochecks. So I'm not so happy about the inventory function.
But that time it was working it resulted in same output at the Nagios monitor (see attached .jpeg in my first post).
It also works when you edit this file and add those three rows for edac.

If you use cmk -vP create edac and edit /var/lib/check_mk/packages/edac and add agents : 'plugins/edac' , checkman: 'edac.mem', checks: 'edac' i.e. you follow the instruction.
cml -vP install edac-1.0.mkp including the second plugin here based by Python. You follow the documentation cmk -O, cmk -II 10.35.40.242, cmk -R

The inventory does'nt update. When you edit your host.mk file (see the three rows below) and done cmk -R again the results is in the attached .jpg file. It's waiting and pending forever.
We did not made any changes in the downloded edac-1.0.mkp file or in any plugins for edac and smart.stats and smart.temp, it's just the same - none working!

I could not attach the result picture from Nagias presentation.
Last edited by Professor Balthazar on Thu Apr 03, 2014 3:58 am, edited 1 time in total.
Professor Balthazar
Posts: 15
Joined: Wed Feb 26, 2014 6:53 am
Location: Stockholm, Sweden.

Re: Nagios and EDAC support

Post by Professor Balthazar »

The first trial maybe could be to edit the row

"dimmdescr = ("%s MB %s DIMM") % (size_mb, dimm_type)"

to

"dimmdescr = ("%s MB %s DIMM %s Errors") % (size_mb, dimm_type, uc_counts)"

But first of all my wishes to have the edac and smart.stat, smart.temp plugin up and running!
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Nagios and EDAC support

Post by abrist »

I am aware that the python script is for check_mk integration. Just in case you are not aware, check_mk is *not* developed by Nagios Enterprises, nor is it truly supported by us, the check_mk guys are the place to drill down issues with mk's inventory options, etc. I had asked if you were looking at using check_mk for these checks or just "the basic check" as I can help you add logic to the basic check to meet your requirements. But these forums are the wrong place to drill down issue with check_mk - the proper place for those questions is the check_mk mailing list: http://mathias-kettner.com/check_mk_lists.html
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Professor Balthazar
Posts: 15
Joined: Wed Feb 26, 2014 6:53 am
Location: Stockholm, Sweden.

[SOLVED] Nagios and EDAC support

Post by Professor Balthazar »

This issue is solved. I have EDAC for 4 x 16 GB DIMM's and 2 x 8GB + 2 x 16GB DIMM's.

The sb_edac driver is updated and I have written new edac plugin (scripts) för Sandy Bridge and Ivy Bridge. It's not the same as for Core i7.

<<<edac>>>
mc0 dimm0 16384 Registered-DDR3 S4ECD4ED 0 0
mc0 dimm3 16384 Registered-DDR3 S4ECD4ED 0 0
mc0 dimm6 8192 Registered-DDR3 S4ECD4ED 0 0
mc0 dimm9 8192 Registered-DDR3 S4ECD4ED 0 0
0 0 0 0 49152 Sandy Bridge Socket#0

<<<edac>>>
mc0 dimm0 16384 Registered-DDR3 S4ECD4ED 0 0
mc0 dimm3 16384 Registered-DDR3 S4ECD4ED 0 0
mc0 dimm6 8192 Registered-DDR3 S4ECD4ED 0 0
mc0 dimm9 8192 Registered-DDR3 S4ECD4ED 0 0
0 0 0 0 49152 Ivy Bridge Socket#0
Locked