Network Analyzer Slow

CFT6Server · Post by **CFT6Server** » Sat Jun 13, 2015 2:53 am

Much appreciate on the feedback and providing the thought process as well. It does help to see what troubleshooting steps you guys are taking to test and attempt to resolve this issue.

I agree regarding the pruning of data, if we can have the full view, of course that will be better.

I will increase the vCPU on the NNA to see if it will at least handle things better. From my observations, each I query something nfdump is spawned and uses 1 core.

Thanks and I'll be watching the thread for updates!

Post by **lmiltchev** » Mon Jun 15, 2015 8:58 am

I will increase the vCPU on the NNA to see if it will at least handle things better.

Please, let us know if this made any difference.

CFT6Server · Post by **CFT6Server** » Mon Jun 15, 2015 2:25 pm

No difference here and I don't suspect it would since it is single threaded. So testing a query or clicking on the source will create a nfdump process and that usually pins a single core. When clicking on NNA tab in XI, looks like it does two queries with nfdump and that uses two processes = two cores.

tmcdonald · Post by **tmcdonald** » Tue Jun 16, 2015 2:08 pm

In terms of the CPU pinning, for that we would need to wait and see what the nfdump team does regarding multi-threading.

I asked the devs and they're saying the network-based disks are the bottleneck, and that switching away from network-based storage may be the answer.

We'll need to continue the discussion on our end. Clearly there are some improvements that could be made, we just need to identify the best way to go about making them.

I will update the thread when I know more.

CFT6Server · Post by **CFT6Server** » Tue Jun 16, 2015 9:31 pm

Thanks. These are VMs, so in terms of network based disks I don't think there other options. I can get an NFS LUN provisioned so it has more of a dedicated network storage for the netflow files, but I don't think that will make much difference in this case.

The last mention of multi-threading for nfdump was a while ago from my google searches. Do you guys have direct communication with that team?
I just haven't seen much activity for this since 2013 and not very hopeful.

Having a better backend ultimately might be the answer, but unfortunately that might take a while. A DB or Elasticsearch backend will help a much quicker query instead of reading all the files. (This doesn't seem to scale well, especially at the scale that we are running into)

tmcdonald · Post by **tmcdonald** » Wed Jun 17, 2015 4:01 pm

This is what I got from the devs:

Options:

1. Filter the incoming netflow using views
2. Use small timeframes when running queries
3. Get faster I/O

Basically switching to another DB would be great, particularly because it'd give a benefit on running queries but when we go back to a GB query - even with a DB/ES it will still take a long time because of the I/O. You just can't actually get that data faster - the only way is if it was in RAM, but even so the DB wont hold every bit of netflow data in RAM for fast access. So if he wants to be able to run a query on 20GB of data he has to have a way to read that data that's faster than 30MB/s.

Would any of these options be something you could pursue? I can reach out to the nfdump folks but we do not have any sort of official affiliation with them that I could lean on.

CFT6Server · Post by **CFT6Server** » Wed Jun 17, 2015 7:06 pm

Some comments below.

1. Filter the incoming netflow using view is an option. However, it will not resolve any delays with integration with XI. Or if any users click on the source that kicks off the the process, it is stuck there until that's done before they can navigate away. Not sure if Views will help with these points.

2. Smaller time frames say a few hours will not help. I've already lowered the data to 2 days, and frankly if the product we purchased cannot handle even a day, there is something wrong. It will hard to explain why competing products are able to do full netflow without any issues. (we do have other netflow analyzers in the environment as well)

3. Since this is a VM in a virtualized environment, I am not sure how we can increase the I/O. Purchasing a physical box and have SSD to have a product work better is not an option.

Sounds like an option here will require a complete rewrite of NNA. I am just surprised that NNA is having issues handling about 20GB worth of data.

If Log Server is able to handle billions of logs I would think there's a way to integrate that with NNA to provide faster query. Basically the data coming in will have to go through logstash first and then into elasticsearch. This means that NNA will have to be fully integrated into Log Server. Just a thought.

jomann · Post by **jomann** » Thu Jun 18, 2015 10:49 am

Hello CFT6Server,

To start off let me point out that it is possible for us to create a setting that would disable the default queries from running when you select a source, if that would help make it more usable. This doesn't help with the XI integration but we could look at a fix for that also. However, the I/O really is what is causing the problem. I'm not sure how the other netflow products you have work but even if it were stored in a database it'd still need to read the 20GB that it needs to query from somewhere. Unless that database has the 20GB of data stored in the cache ready to pull then it really wouldn't be all that much faster. Other products may do things like aggregate data and change it - we store the entire netflow data, in binary form. It records every netflow record that comes in to the flows folder in NNA. This is by design to get the most out of queries. We chose nfdump and nfcapd because they seemed like the most helpful tools for gathering and querying netflow data.

As far as the speed of the NAS though - at 30MB/s you can only transfer 20GB in: 20GB/0.3GB/s = ~67 seconds just to get the data. Nfdump reads 5MB chunks at a time, does the filters, and continues on. The reading of the chunks would also slow down the actual read times too. The caching on the NAS is most likely negligible because the 20GB of data is probably not being accessed enough to be stored in the cache.

The fact that Log Server can handle billions of logs is correct - in the right environment. However, there have been people who have set up log server on VMs that are using NAS-based storage and have the same kind of issues. They are pushing in over 50GB of logs per day and it stops their web interface because Elasticsearch is spending all of it's time re-arranging shards of data between the servers and the slow reading/writing is stopping it from replicating the 50GB of data at a decent rate. So essentially the same problem exists even with Elasticsearch. Once written though, the read speed would definitely be much higher in a physical disk environment where if you had 2 Log Server instances you could get half the data from each instance essentially doubling your read time speeds. However, on a NAS those 2 Log Server instances may be impacting each other's read speeds.

As you stated in your response, speeding up these type of large queries on our end would require a rewrite of NNA and that won't be able to happen any time soon.

CFT6Server · Post by **CFT6Server** » Thu Jun 18, 2015 10:03 pm

Hi There,

Thanks for the detailed response. I think in our case, disabling or having the ability to set the default time interval (say 1hour) for the default queries when a source is selected would be helpful. If this can tie into the XI integration, which will allow quicker response, I think this will solve 50% of the issue. I think we can live with the fact that larger queries will take a while, and the quick links and normal operations won't be slowed down. I do appreciate that Nagios is helping with solving this issue and providing suggestions.

I understand the concept of data/speed in reference to Log Server. However, in my test environment with a 2 node build. I have 150GB worth of data (doing anywhere from 30GB to 40GB of logs per day) that I am querying against without any hiccups. The underline technology/method is much more efficient when running the queries. The performance of the test systems is running perfectly in the same infrastructure. We don't have a problem writing that data to Elasticsearch and the reading and query is fast. Hence my suggestion of something similar for NNA.

Can we start off with maybe looking at those defaults settings and then we go expand from there?

tmcdonald · Post by **tmcdonald** » Fri Jun 19, 2015 12:47 pm

I can certainly put in a feature request for the defaults. Is there any specific wording you want to use or would "Default to 1 hour for source query interval" suffice?

Nagios Support Forum

Network Analyzer Slow

Re: Network Analyzer Slow

Re: Network Analyzer Slow

Re: Network Analyzer Slow

Re: Network Analyzer Slow

Re: Network Analyzer Slow

Re: Network Analyzer Slow

Re: Network Analyzer Slow

Re: Network Analyzer Slow

Re: Network Analyzer Slow

Re: Network Analyzer Slow