Page 1 of 1

scaling issues/too many indices

Posted: Mon Jun 08, 2015 2:24 pm
by jvestrum
We have been trying to pull in 5 years worth of old logs, and keep running into scaling issues. We've worked past some of them - the memlock ulimit, php running out of memory, backups getting stuck. The current issue is elasticsearch running out of file descriptors. We've raised the "nofile" ulimit to 262144 and elasticsearch still runs out of file descriptors and crashes. So we had to drop the past indicies to get it to respond again.

It seems like all these problems lead back to having too many indices. Is there some way we can go to 1 index per month instead of 1 per day? Has anyone else been successful in pulling in several years of logs? It's not all that much data, in the tens of GB.

Re: scaling issues/too many indices

Posted: Mon Jun 08, 2015 3:49 pm
by jolson
How are you importing this information? Would you be alright with one Index with todays date holding all of your data, or does it need to be sorted by date?

Re: scaling issues/too many indices

Posted: Mon Jun 08, 2015 4:43 pm
by jvestrum
jolson wrote:How are you importing this information? Would you be alright with one Index with todays date holding all of your data, or does it need to be sorted by date?
We are using a logstash agent to ship the logs into elasticsearch, applying some custom grok filters along the way.

It should be okay having all old logs go into one index, but the dates/timestamps on each record do need to be preserved. And I think that will break date-based filtering from the NLS web interface, because of how it builds the queries. For this past data we mostly plan to mine it directly with our own elasticsearch queries so that might be okay.

Re: scaling issues/too many indices

Posted: Mon Jun 08, 2015 5:01 pm
by jolson
According to Elastic:
The date filter is especially important for sorting events and for backfilling old data. If you don’t get the date correct in your event, then searching for them later will likely sort out of order.
In the absence of this filter, logstash will choose a timestamp based on the first time it sees the event (at input time), if the timestamp is not already set in the event. For example, with file input, the timestamp is set to the time of each read.
I am assuming that your timestamp field is currently being set by a 'date' filter. In the absence of a date filter, your logs will be stamped with the arrival time, as opposed to the date present in the log. In short order you could also write some grok that would allow you to parse out the date into a different field for sorting purposes.

If that doesn't work for you, let us know and we can try to come up with a different way to approach this.

Re: scaling issues/too many indices

Posted: Tue Jun 09, 2015 3:33 pm
by jvestrum
Ideally I'd like our past, present, and future data to all "look the same", with standard timestamps and arranged into consistent, logical indices. What we're trying now is re-indexing the data into one-per-month indices named logstash-YYYY.MM. I found the config option in the Dashboard interface where I can set the timestamping interval to monthly, and that seems to work - queries are using the new indexes. I've also lowered the primary shards per index to 2 (with 1 replica). I'll update later on the status of the reindexing.

Re: scaling issues/too many indices

Posted: Tue Jun 09, 2015 3:44 pm
by jolson
Sounds good - be sure to let us know. Thank you!

For anyone who is looking over this post, the option that jvestrum is using can be found in 'Dashboards -> Configure Dashboard -> Index':
2015-06-09 15_43_57-Dashboard • Nagios Log Server - Firefox Developer Edition.png