https://github.com/SteveBeauchemin/nfdumpWrapper
Definitely in the concept stages.
The initial problem from the start was that I could not figure out why making files in my desired location
did not work. The httpd protected mode in RH7 killed me for a while. I had no idea that protected mode was there.
Now that I have a grasp of how to turn that off, I will go back to the method I was trying to use.
Getting the smaller jobs to run their own process using their own CPU was next.
Currently GNU Parallel is working for me. I am also looking at the possibility of using perl Run::IPC3.
The intent is to use nfdump with the -w parameter, and store binary files in an intermediate location.
Basically I will take one command line that would grab all the data in one big gulp, and instead make it into
many command lines each grabbing a small piece. The binary outputs will be stored in an intermediate isolated location.
Then the smaller binary files can then be queried in a single nfdump operation and provide the 'final answer' sorta.
The -w parameter only applies when combined with 4 other parameters, so it only works for a specific scenario. No Top Talker stuff.
There are 2 places in the command line I will exploit.
One is the directory location where the flows are stored. The sources broken up by the ports that listen.
If you have 20 flow directories, you could get a very long complicated line listing all 20 locations.
Take -M parameter and break it into 20 separate nfdump calls.
The other parameter to examine is the -t time period, where you have a [start time]-[end time] provided.
Read that and break it into one hour intervals. Since the flow data is stored in 5 minute chunks, over a
one hour, period you will have 12 flow files. So break up long time periods to shorter time periods, and have many of them.
Example...
Looking at the number of flow files to process:
if we have 10 flow sources, basically listening on 10 ports, and we want data for a 24 hour period, we need to process 2880 flows.
(10 dir x 12 files x 24 hours = 2880 flow files.)
We could either have one nfdump at 100% on one CPU for a long time...
Or, we make many smaller nfdump jobs, run them as separate processes, and use all the CPU available.
(10 dir x 24 hours = 240 small nfdump jobs)
YMMV (your math may vary)
So basically, take the 500GB of store flow data, and reduce it to some MB of stored data.
Then run the 'real' command against that to return data to NNA.
I will come up with similar but different ideas for the Top Talkers. If I split them to many small jobs,
I still have to capture the text output, sort, merge, aggregate, average, and output the data.
Like I said. work in process, still in conceptual stage, doing alpha testing.
I apologize if this seems like I am rambling. But I am rambling... Still in the 'what the heck am I doing' stage of things.
Steve B