Nagios Network Analyzer - General Overview

High Level Overview

Nagios Network Analyzer is a network flow data analyzer. Network Analyzer works with Netflow and sFlow network flow types along with others such as jFlow. The application gives you the ability to see where network data is coming from and where it is going to. This allows network administrators to better understand the flow of data on their network. Network Analyzer takes the information and aggregates it into easily readable reports, queries, graphs, and other diagrams such as the cord diagram using the data it collected. It also does some statistical analysis of bandwidth data seen over time to let administrators know that there may be a problem on the network.

This network data is normally referred to as "flows", "network flows", or sometimes even "netflows" and comes into Network Analyzer as either Netflow or sFlow through a program called nfcapd (or sfcapd if using sFlow) via a port that is set up when you create a Source. Sources in Network Analyzer also support other flow types. Most other flow types have less data than Netflow and therefore can be stored and queried using a Source that has been set up to receive Netflow no matter what the flow type. Sources can also receive Netflow data from multiple different devices as long as they are all sending to the same Source port. This allows users to create sources for specific sections in their environment.

Once the data is collected it is stored in binary files that are able to be read using a program called nfdump. This program is used by the Network Analyzer Backend to reap flow data files every 5 minutes. Every 5 minutes the backend will loop through each of the Sources you have created and consolidate bandwidth data into an RRD file and pull out any data for the Views that are associated with each Source. Views allow longer storage of raw data. Raw data is the flow data talked about in the paragraph prior, that comes in and gives you very clear information about what devices data went where. The processed RRD file only shows the total amount of data that was sent per 5 minutes. By default, all Sources (and Views) only store 24 hours of raw data since the raw flow data can become large very fast in bigger environments. The backend also runs every 5 minutes and checks to see if it needs to send out alerts. There are multiple alerting mechanisms and if there is a OK, WARNING, or CRITICAL it will send it out via the alerting method selected in the Alerting section of the interface.

All raw data that is associated with a source (that includes Views on that source) is stored within a directory in the main Network Analyzer directory. The title of the directory is always the name of the Source. So a Source name of "Nagios Backup Network" would be located in /usr/local/nagiosna/var/NagiosBackupNetwork and would contain a folder called "flows" that contains the raw flow data and an RRD file named NagiosBackupNetwork that would hold aggregated bandwidth data. All the data stored in this directory is what makes up everything you see in the web interface.

Architecture Diagram

The diagram below explains how the flow data comes into Network Analyzer and the Source's association with it. A more detailed explanation will follow.

Incoming Flow Data and Sources

A Source is just a collection of flow data and settings. When you create a Source in Network Analyzer, you set the port for it to listen for flow data on which is what we call the "Listener" in the diagram above. You also select the flow type. As stated above, we offer the use of Netflow or sFlow as the flow type. If you have a different flow type not listed, then you can just choose Netflow. Each Source can have multiple devices sending flow data to it. When the Source is online in the web interface, it means that the nfcapd (or sfcapd) collector is currently running on the Network Analyzer machine and will start collecting flow data right away. The "Listener" creates the raw flow data files. It takes 5 minutes for a flow data file to be fully written so flow data may not show up in the web interface for several minutes in smaller environments. Every 5 minutes aggregated bandwidth data will be saved for each source and View that is in the system. This is automatic and doesn't change per Source.

Network Analyzer Backend

There is a backend to Network Analyzer that does a lot of the background processing of flow data. The backend runs every 5 minutes, when the nfcapd or sfcapd process finishes writing a new flow data file. The backend is writen in python and is located in /usr/local/nagiosna with the following directory structure. Pretend we have only one source named "My Source" in the web interface.

/bin - Location of the reaper script, the scripts that generate graphing data, and run the alerting every 5 minutes if it needs to.
/var - Stores the data for the Sources and Views and has a backend.log with backend log data and the cmdsubsys.log which has all data pertaining to running commands.
- /MySource
  - 5566.pid - The pid file for the nfcapd or sfcapd process that is running for this Source. This is named the port number. So this Source is listening for flow data on port 5566.
  - bandwidth.rrd - Stores the Source's bandwidth aggregated bandwidth data in RRD format.
  - /flows - Stores the binary flow data files for the source (raw source data). Can be read using nfdump. Uses nfexpire to automatically keep only the set amount of raw data that is defined in the Source settings
  - /views
    - /SomeView - Folder that stores the binary nfcapd or sfcapd files. Can be read using nfdump. Uses nfexpire to keep only the set amount of raw data based on the lifetime set in the View settings.
    - SomeView-bandwidth.rrd - Stores the view's aggregated bandwidth data in RRD format.
/scripts - Contains scripts that can be helpful if there are problems with your install. There are also scripts in this folder that are used in the web interface to do certain actions.
/etc - Empty for now. May contain backend config files in the future.

Processing Alerts

The backend also processes the alerts. When it runs every 5 minutes it checks what alerts are currently in the database and checks against the last 5 minutes of data. If the data meets any of the thresholds (OK, WARNING, or CRITICAL) it will then check the database for any ways it should send notifications. You can set up notifications via email, NRDP to Nagios Core/Nagios XI servers, SNMP Traps, or even execute local scripts. Alerts are only ran once for each 5 minute cycle.

Processing Views

When the 5 minute cycle begins the backend also checks to see if there are any views associated with the Source that it is writing out bandwidth data to the RRD for. If there is a view, or many views, it will run nfdump on the query that the view is for. It will then export that data into the views own flow data directory and then run the nfexpire to make sure that there is no data in the directory that is older than what we have set for the views raw data lifetime. It will then update the bandwidth for the view using an aggregated nfdump query. This allows users to use a View much like a source but for a more specific set of data. Like the data for a certain port or ip address.

Web Interface

The web interface is located in /var/www/nagiosna and contains all the CodeIgniter files and the web interface files.

/application - This is the main portion of the web interface. Without getting into how CodeIgniter works, it has all the files that actually everything in the browser.
/www - The only folder that is accessible from outside.

index.php - The main file that accesses the rest of the application.
/media - Stores everything from stylesheets to javascript and images. These are then linked into pages you see in the web interface.

/system - These are the CodeIgniter framework system files. These aren't edited and shouldn't be edited.

The web interface is written with php and utilizes CodeIgniter as it's base framework. However, graphs are created in Javascript using Highcharts and D3. The graph data is generated with python scripts that are run by the web interface in the background when a report or query is ran. All flow data (raw flow data and graphed bandwidth data) is received for reports and queries by running calls to the backend. All other information in the web interface is from the MySQL database.

MySQL Database

The application uses MySQL as it's database or MariaDB on CentOS 7. The database isn't included in the Architecture diagram because it's only used to store information about Sources such as the name, port, and other information that we may need in the web interface. It also stores information on the alerts that the backend processes and the locations that alerts get sent to as well as the user and api authentication information for the web interface. No flow data is stored in the database at any time. All flow data is either in RRD format or flat files in the flows/views directories explained above.

Source Groups

Source Groups are literally just group of Sources. These Source Groups do not store any data. Instead, they are a list of Sources that, when a query, report, or graph is created it uses the data from all the Sources inside the Source Group to get the data it returns. Source Groups are not shown in the architecture diagram because they do not interfere with the gathering of flow data and are purely used to organize devices in Network Analyzer that may be in multiple groups.

Abnormal Behavior (in Sources)

One of the most common questions is how is abnormal behavior calculated. Abnormal behavior is calculated using the bandwidth data in the RRD file. RRDtool has a special ability that will give "failures" for a certain time period where the failures correspond to statistical abnormalities in the last few entries. We check the last 30 minutes, in 5 minute samples, for abnormal behavior. Using RRDtool, it automatically calculates failures based on a Holt-Winters algorithm. It takes two time periods before the last 30 minutes and gets the prediction of what the last 30 minutes should have been. If what the last 30 minutes actually is differs enough from what it should have been, it considers that a failure. Those failures are then given to the web interface which puts them in the graph on the main page for admins to see when they log in.

Final Thoughts

For any support related questions please visit the Nagios Support Forums at:

http://support.nagios.com/forum/