Slow mongo query on createUser

kimar · January 4, 2021, 7:23pm

When calling createUser, accounts-password does a case insensitive query on the email address to check if its already taken (see here).

The query looks like this:

db.getCollection('users').find({
    "$and": [
      {
        "$or": [
          {
            "emails.address": {
              "$regex": "^mari",
              "$options": ""
            }
          },
          {
            "emails.address": {
              "$regex": "^marI",
              "$options": ""
            }
          },
          {
            "emails.address": {
              "$regex": "^maRi",
              "$options": ""
            }
          },
          {
            "emails.address": {
              "$regex": "^maRI",
              "$options": ""
            }
          },
          {
            "emails.address": {
              "$regex": "^mAri",
              "$options": ""
            }
          },
          {
            "emails.address": {
              "$regex": "^mArI",
              "$options": ""
            }
          },
          {
            "emails.address": {
              "$regex": "^mARi",
              "$options": ""
            }
          },
          {
            "emails.address": {
              "$regex": "^mARI",
              "$options": ""
            }
          },
          {
            "emails.address": {
              "$regex": "^Mari",
              "$options": ""
            }
          },
          {
            "emails.address": {
              "$regex": "^MarI",
              "$options": ""
            }
          },
          {
            "emails.address": {
              "$regex": "^MaRi",
              "$options": ""
            }
          },
          {
            "emails.address": {
              "$regex": "^MaRI",
              "$options": ""
            }
          },
          {
            "emails.address": {
              "$regex": "^MAri",
              "$options": ""
            }
          },
          {
            "emails.address": {
              "$regex": "^MArI",
              "$options": ""
            }
          },
          {
            "emails.address": {
              "$regex": "^MARi",
              "$options": ""
            }
          },
          {
            "emails.address": {
              "$regex": "^MARI",
              "$options": ""
            }
          }
        ]
      },
      {
        "emails.address": {
          "$regex": "^mariXXXX@XXX\\.com$",
          "$options": "i"
        }
      }
    ]
  })

My app has about 400K users, and this query can take up to 6000ms (especially when the email address starts with a common first name).

My DB is hosted on Atlas, and of course has an index on the emails.address field.

Has anybody else experienced this? Am I missing something?

paulishca · January 4, 2021, 8:06pm

I think you are not missing anything. I did a test on an app I have with 6 users and I got 160ms response time on a user create, most of it async…probably waiting for the conditional search.
I am afraid this is just a bad and outdated design and I am really curious to see what others say.

kimar · January 4, 2021, 8:34pm

Thanks for chiming in @paulishca

My plan is to just fork the package to have a case sensitive lookup, since I already lowercase email addresses inside the onCreateUser hook. Are there any downsides to this approach?

Would it make sense to patch the original package to allow a case sensitive query? For instance, adding a flag in Accounts.config({'caseSensitiveEmails': true}) ?

paulishca · January 4, 2021, 8:40pm

I think case sensitive emails make no sense since they … don’t exist. All data that goes in Mongo should be validated, trimmed, cleaned before writing to DB.
I’d really be interested to know your benchmarking with the email index sorted alphabetically, case insensitive (all low case).

paulishca · January 4, 2021, 8:45pm

on onCreateUser perhaps we could save the domain as well and index by it. But with 500.000 users you would need a lot of RAM just for these all these indexes.
Then find for {domain: ‘domain.com’, email: ‘dj.bobo@domain.com’}. I am thinking emails of a domain would only be searched within the emails that contain that domain which may result in massive improvements especially for uncommon addresses like business emails.
Not two indexes but one index that includes both domain and email.

kimar · January 4, 2021, 9:02pm

By “case sensitive lookup”, I mean we assume all emails are lowercase (which they are) so we can just do Meteor.users.find({'emails.address': ... }) which results in the query taking 60ms instead of 6000ms.

paulishca · January 4, 2021, 9:07pm

that is also what I understood :). Was saying two replies back that yes, data should be already sanitized before hitting into MongoDB. Will it take 60ms? I rally don’t see an issue with the concept, unless I am in that spot where I don’t know what I don’t know.

paulishca · January 4, 2021, 9:16pm

This is email index on the Users collection. So already has unique: true which means in theory you should not be able to insert a record with the same email. Would that result in a error that can be handled and notified back to the user?!
What I mean, maybe you don’t even search for the email, just attempt to insert it and if it fails based on uniqueness then … it exists …
But if your index is corrupted … big drama …

rjdavid · January 4, 2021, 11:38pm

We do not use this package but we do lowercase all emails before saving to optimize the query to a simple match.

That existing algorithm to check for existing email looks like a very complex solution to a simple problem

peterfkruger · January 5, 2021, 11:41am

I think the only reasonable solution would be to fix this once and for all directly in accounts:password. Email would henceforth be stored lowercase only, period.

paulishca · January 5, 2021, 1:07pm

Just re-copied the word because I agree with it more than the average of my general agreements

paulishca · January 6, 2021, 2:46pm

I will open an github issue for this. Could we agree this is a performance bug and not a feature request?
The argument here would be that usernames and emails should only be low case enforced by the server and in UI the developer is free to deal with it in the input fields.
Let me know please what/how you think.

rjdavid · January 6, 2021, 2:53pm

The fix will be a breaking change or will require a database update script. Not sure if it has been done before

paulishca · January 6, 2021, 3:00pm

Perhaps a second package should do it? So that the user can do the DB migration and switch the packages or start with the new one if new project… Would that be too cumbersome?!

peterfkruger · January 6, 2021, 3:19pm

I consider a second package for the data migration a good solution. If needed, it can be added temporarily and invoked in Meteor.startup on the server. Subsequently both the invocation and the package itself can be removed.

Even if both stay wouldn’t normally do much harm. If the migration script is properly written, it wouldn’t do any actual update in the Users collection if all email is already lowercase.

On that note, the migration script shall ideally use bulk update for the best performance, which is crucial for systems that may have thousands of users.

paulishca · January 6, 2021, 3:35pm

I was thinking to have a new accounts package and leave it up to the user to manage the DB migration, perhaps provide a best practice document.
Let’s say:

“accounts-password” old. (already in use).
Perform DB migration (within Meteor or with Mongo scripts) - developer’s deal
“accounts-password” old still works ok.
Switch to “accounts-password-new”.

peterfkruger · January 6, 2021, 3:47pm

You’re right, that should work.

rjdavid · January 6, 2021, 3:57pm

I remember something was done with react-meteor-data before wherein version 2 was a breaking change. The user must explicitly upgrade to version 2 and would not be automatically upgraded with the usual meteor update

Can that work in this case?

I think the breaking change was the version of react supported

mrrk · January 6, 2021, 11:38pm

I think the root cause is that the original query doesn’t make use of indexes. Using a regex will result in a collscan, even if the property is indexed. Instead you need to specify an index with a collation, and then specify the collation when querying; there’s info in the Mongo docs about it here:

In our app we have a separate collection for user profiles, and we perform a case-insensitive lookup on the email by specifying an index with the following collation:

// when creating the indexes 
const collation = {
    locale: 'en', 
    strength: 1, // case-insensitive
};

And then we query like this:

Profiles.rawCollection.find({ email }).collation(collation)

So to fix this you’d need to change the default index on the Users collection to use a collation, and then you’d need to modify the query to use the collation when searching so the search is case-insensitive.

peterfkruger · January 6, 2021, 11:40pm

That’s a brilliantly elegant solution! Plus it works without data migration (save the changing of the default index)