Efficient large file search (webapp/service)

Hi,
I have a large text file (or several large text files), a few GBs total but growing over time.
I would like to create a webapp that offers users the service of searching for specific strings (one at a time) in the text file(s). The service returns true/false if the string was found in the file(s).
I’d like to do this efficiently so as to allow many users to search for many different strings at the same time, without overloading the server.
Where do I begin with building something like this?
Thank you!

Best scenario would be using a database backend instead of text files.

Consider Elasticsearch. Very simple to design/build/deploy and manage. This way your data can be ingested into Elasticsearch indices and subjected to just about any kind of searching and analytics you might need. Lightning fast and extremely flexible as it scales horizontally, build on Apache Lucene it can be made to be highly performant.

With something like Filebeat ( made with Go ) and/or Logstash your data can be constantly pouring into the Elastic cluster and available for searching often within a few seconds of origin depending on your design.

https://www.elastic.co/webinars/getting-started-elasticsearch?elektra=home&storm=sub1

Bleve is a full-text search engine and might be a good start, although I cannot find anyhting about suitability for GB-sized texts. Primarily it seems to be tailored for Go data structs rather than for large, flat text files (but I have only skimmed the docs, so I could be wrong.)

Thanks for the suggestions. Although Go has been my go-to language recently, I think I will go with a NodeJS+Elasticsearch solution for this project. Now I just need to figure out how to index a large list of strings in ES.

Now I just need to figure out how to index a large list of strings in ES.

Easy. Once you get the hang of it you will find Logstash to be a massively powerful and easy to use tool for slicing/dicing your strings. Next to nothing it can’t do…you even get the full power of Ruby right in the runtime of Logstash if none of the other filters seem to do what you need.

If you wanted to PM me a small sample of your strings I would be happy to provide you with a starter Logstash config file showing how to deconstruct, manipulate and ultimately ship them to Elastic for ingestion.

Also, although the newer versions of the ELK stack offer a lot, you may not need to use v5 or v6. V2.x of Elastic with V2.4.1 of Logstash is a formidable combo for full/partial text search and analytics.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.