Best practices for optimization of regex searches on server-side?

ffxsam · October 12, 2016, 8:02am

I’m implementing a search where a user can type a partial text string and it’ll use that to search both names of things, and tag names. The code in the publication would just be, for example:

  return Whatever.find({ name: new RegExp(searchString, 'i') });

Besides indexing this field via Whatever._ensureIndex({ name: 1 }), is there anything else I should do to reduce the server load?

robfallows · October 12, 2016, 9:22am

You may not get the performance you expect: https://scalegrid.io/blog/mongodb-regular-expressions-indexes-performance/

MongoDB supports regular expressions using the $regex operator. However these regular expression queries have a downside, all but one type of regex makes poor use of indexes and results in performance problems. For a production server with large amounts of data, a bad regex query can bring your server to its knees.

and from the MongoDB documentation:

For case sensitive regular expression queries, if an index exists for the field, then MongoDB matches the regular expression against the values in the index, which can be faster than a collection scan. Further optimization can occur if the regular expression is a “prefix expression”, which means that all potential matches start with the same string. This allows MongoDB to construct a “range” from that prefix and only match against those values from the index that fall within that range.

ffxsam · October 12, 2016, 5:09pm

Hmm, that’s a bummer. I’ll have to do some more reading up on how to best handle this. One idea off the top of my head is to break the name of the item into pieces, so for example, when they add a new item called “My First Item” it results in this document:

{
  "name": "My First Item is Cool",
  "searchSegments": [
    "My",
    "First",
    "Item",
    "is",
    "Cool"
  ]
}

And then when the user performs a search:

const partialMatch = new RegExp(`^${searchText}`, 'i');
return Items.find({ searchSegments: partialMatch });

EDIT: Actually, even better… store searchSegments as all lowercase, then lowercase the user’s search query, that way the regex i flag is not necessary. I also wonder if MongoDB text indexes help significantly.

ffxsam · October 15, 2016, 10:42pm

BTW, was just thinking: what if I subscribe to all data on the server, and do regex searches on the client side using .find()? I wonder if that’s less expensive, and just shifts more of the processing power to the client?

robfallows · October 17, 2016, 2:17pm

Unless you have (or can publish) a small^* number of documents, any off-loading of regex searching on the client will be swamped by the network traffic as your collection is shipped over. If your collection is large enough, this would also eat up all your browser memory.

* As in fewer than 100.

ffxsam · October 17, 2016, 6:51pm

I read and re-read your post and tried to wrap my head around it. This part trips me up:

Do you mean that if a client-side regex search is performed as the DDP data is coming in over the wire, it will slow the client down? In my case, the regex search is only done as the user types in a search box and hits enter, so the data will have already been cached on the client.

I guess with today’s memory standards, I didn’t even think of this. If I subscribe to 200-300 MongoDB documents, obviously that data is then stored in MiniMongo on the client.

For kicks, I dumped out the data in JSON format. It’s about 109KB, and that’s 251 records.

EDIT: I previously tried Object.bsonsize(db.tracks.find()) but it returned 1KB, which can’t be right.

robfallows · October 18, 2016, 9:00am

No, I mean that in order to perform any minimongo operation on the client, the documents first have to be shipped across. Once they’re on the client you’re fine.

I’d assumed that you were talking about a collection size of thousands or millions of documents. However, 250 docs totalling 109kB (will be a lot more when stored as objects) is not that many. If you need client responsiveness (like filter as you type) then it’s probably managable.

avalanche1 · April 14, 2017, 12:20pm

As far as I can gather - mongo regex search is very slow; mongo $text search doesnt support partial word search. If you have a large collection (10s of thousands and more) I recommend storing it in Elasticsearch instead of Mongo.

ffxsam · April 14, 2017, 2:19pm

Ultimately, my solution was to create word fragments for every item saved/updated. In other words, if someone updates the item’s name to “MongoDB Cheat Sheet”, an array is made of these words and stored on the object:

{
  "name": "MongoDB Cheat Sheet",
  "searchFragments": [
    "mongodb",
    "cheat",
    "sheet"
  ]
}

And then the server-side search is performed like this:

  if (options.searchText) {
    const regex = new RegExp(`^${options.searchText.toLowerCase()}`);
    selector['properties.searchFragments'] = regex;
  }

  return Items.find(selector, { ... });

SkyRooms · April 14, 2017, 3:04pm

So, what if I did this.

Name: “im TOTaly goING to ScrEW with YOU”.

How do we solve that? PHP and MySQL would be super easy. Node… bummer right?

ffxsam · April 14, 2017, 3:52pm

How is that a problem in Node? That stores the fragments as:

["im", "totally", "going", "to", "screw", "with", "you"]

And then when the user searches for “TOTAL” (even in all caps), it gets lower-cased and matches against the array:

Items.find({
  'properties.searchFragments': /^total/
});

And it would match.

SkyRooms · April 14, 2017, 4:42pm

Ahhh I see. So you could also loop through and combine the fragments.

Nice idea. I wish node searching didn’t suck so bad

ffxsam · April 14, 2017, 5:39pm

It’s not necessarily Node as much as it is a MongoDB thing.

SkyRooms · April 14, 2017, 6:08pm

Ah sorry, yes, Mongo. New terms lol.

But surely there are enterprise level systems using MongoDB that have a solution to a simple search?!

ffxsam · April 14, 2017, 6:20pm

Someone else mentioned Elasticsearch… I imagine for text-heavy content, other solutions might be used. I’m not really sure though.

praveen007 · May 17, 2018, 6:01am

"name " : “Anand vihar”;
const partialMatch = new RegExp(^${name}, ‘i’);
user.find({$or : [{“name” : partialMatch},{“username” : partialMatch}]})

in this RegExp, I searched for the first word only(means Anand).coming search data.
If I search with the Vihar I was unable to get the result which I required.
please can anyone help me over here.

robfallows · May 17, 2018, 8:52am

The ^ character means “at the start of the string”. So, Anand is at the start of the string, but Vihar isn’t.

praveen007 · May 17, 2018, 9:18am

Thanks for the reply. can you please give me some example, where i need to change the character(^) to get the result when i search with vihar

robfallows · May 17, 2018, 12:19pm

Just remove it.

nikhilgoswami · September 17, 2018, 4:40pm

Using something like Items.find({"description": { '$regex': new RegExp(variable, "i") }}).fetch();

should help. I am not sure about performance though. Reference here