How to search multiple custom files?

Gabbie · December 3, 2016, 6:41am

So I’m a newbie to Meteor and I’ve only really had some experience working on PHP/MYSQL/MYSQLI/HTML for most of the project’s I’ve done over the last few months as a hobbie. I’m now in the process of learning Meteor/React to get started making some more responsive and easier to scale/tinker with while using less code and being able to do alot more.

My main issue right now is I’m having issues finding tutorials about how to search through external files to show the entire line of contents if it’s been found in a file (anywhere between 1mb and 5gb are file sizes and strings will probably be around at max ~4KB).

Example;

User searches for “John”

file1.txt contains “John 1” and should be outputted
file2.txt contains “Kimmy 2” and shouldn’t be shown
file3.txt contains “John 2” and “John 3” and should be output both, etc

aido179 · December 4, 2016, 5:22pm

Firstly, you are describing what is classically called an Information Retrieval (IR) problem, which is one of the classical hard problems of computer science.

Using JS to do this is probably not your best bet. Meteor (and node in general) is built specifically as a web server (I use this term loosely), to connect to a structured database and display lots of small chunks of data. It’s not designed to do large calculations like text search over GBs of data.

That being said, it’s not impossible. And if you are learning…why not give it a try.

To do a text search on an unstructured file, the basic process would be as follows:

Open a file
Read it character by character. (or word by word, line by line, whatever)
Compare your search term to the contents of what you have read
If a match, return the position of the match and continue from step 2.
Repeat for all the files you have.

You can improve this process by doing some pre-calculation. If you have the files you want to search, you can read them word by word and build a database of the occurrences of each word in each file. Then when the user searches, you can just query the database and return the necessary info which will be faster due to the indexing on the database.

The next challenge you should think about is if the user uses more than one word in their search query. What if the two words never occur together? What if they both occur in separate files? You could think about building a matrix of documents (files) and the words and their frequencies within the document…(warning this is a rabbit hole…see TFIDF, TWIDF and many other IR algorithmic approaches)

This relies on the source files to never change though, which might not be true. If they change, you have to perform the precalc again.

If you were building a production system, I would recommend checking out Apache Lucine (I think) and other open source information retrieval and search systems.

robfallows · December 5, 2016, 12:03pm

@aido179 :

@Gabbie : If you’re really interested in pursuing this (it’s something I did way back in the day), may I recommend that you get hold of Knuth’s The Art of Computer Programming. Volume 3 (sorting and searching) is gold for this kind of project.