Speeding up DB queries on server

cspecter · December 26, 2017, 10:13pm

I have a server-side method which computes various statistics by comparing two failry large collections (100k documents each), one a collection of sales data and the other contains a calendar. The method runs a loop about 3000 times (a user selected set of days). Each loop it might need to do 5-6 fetch and 4-5 count queries and while each query doesn’t take that much time, in aggregate they slow down the whole process. Currently a fetch is taking about 6ms and a count 4ms, making each loop take around 55ms in total or almost 3 minutes for the whole method. While these seem quick it’s not quick enough, especially with multiple users asking for data. I’m looking for advice on how speed this whole process up.

I’m running Meteor 1.6.1 hosted on a Digital Ocean droplet using a a 3-member MongoDB deployment also hosted on Digital Ocean (droplets with 2GB ram each).

I have tried the following:

Created an index for each query
Returning only needed fields in each query
Reduced number of queries to minimum needed for the various operations
Ran a $explain on each query to make sure it’s hitting the relevant index

If I run each query in the Mongo shell, they all report less than 1ms response time with a fetch, but in my Meteor app it’s about 5-7ms for the fetch. I’ve tried pre-loading some of the data into an in-memory array which definitely runs quite a bit faster (only taking around 15 seconds to complete the whole method), but is impractical if I have to load everything in both collections. I’ve also tried MongoDB hosting on Compose and MLab just to see if it’s my DO deployment, but both has worse response times).

Is this something where Redis could speed things up, or is Redis mostly used for client-server interactions? Is there an in-memory DB solution for server-side use? Would Grapher help in this case? Help appreciated.

//Data schema example (simplified)
Locations = {
    location: {
        id: "ID",
        date: "2017-01-01"
    },
    items: 1234,
    cost: 1234,
    expenses: 1234
}

Calendar = {
    date: "2017-01-01",
    open: true
}

//Query examples
let costOnDay = Locations.findOne({"location.id": "ABCD1234", "location.date": "2017-01-01"}, {fields: {store: 1, cost: 1}});

let expenses = Locations.find({"location.id": "ABCD1234", "location.date": {$lte: "2017-01-01"}}, {sort: {"location.date": -1}, limit: 120, fields: {store: 1, expenses: 1}}).fetch().reverse();

let daysOpen = Calendar.find({date: {$gte: "2017-01-01", $lte: "2017-02-01"}, open: true}, {sort: {date: 1}}).count();

distalx · December 27, 2017, 7:43am

Hi,

In your case it will require a more details to pin point the exact cause of the delay.

I would suggest, You should give following two things a try before adding a Redis as an overhead.

Tinker with your data model

In mongodb half of the time performance can be improved by just modifying your data model.

And what data model use is depends on the nature of your application, like would it dealing with the heavy writes or heavy reads!

With Normalization (relational data model) You get good performance for your writes, and with Distributed Data Model You get good performance for your reads.

Also Sometimes Distributing the Data among collections reduce the steps of the computation.

Please checkout the following articles for more details

https://www.mongodb.com/blog/post/thinking-documents-part-2

If possible make use of aggregation framework

I find Aggregation Pipeline extremely useful when it comes to aggregation and transformations of documents. I have used meteor-aggregate package for that in few projects.

https://docs.mongodb.com/manual/aggregation/

raphaelarias · December 27, 2017, 10:55am

Do you believe you could fairly the same, or at least minimise queries using the aggregation pipeline of Mongo?

nlammertyn · December 27, 2017, 11:22am

I’m assuming you are running your loops in series, as in one-by-one?

If each loop can be isolated from the rest (no dependencies on previous loops), it’s probably going to speed up the process a lot if you try to run the loop in a concurrent manner.

The reasoning is that if you run them in a series, you’re compounding various latencies (network, database lookup, etc.).

If however you run your loops in parallel with let’s say a concurrency of 20-100 (you’ll have to experiment to get a good number that works for you) you’re not blocking your loops as much, as soon as one finishes another can start. It’s essentially the same async concept as the event loop which is at the core of how nodejs works.

the async npm package has a built-in queue method that makes working with concurrency a lot easier: https://caolan.github.io/async/docs.html#queue

And yeah, reducing the number of db queries, perhaps with the use of aggregation, will likely have a large impact as well.

cspecter · December 27, 2017, 3:51pm

My method is processing a set of data and returning an array of objects for display in various graphs to the user. The loops are running in series as each loop depends on some computed data from the previous loop. For example, over a certain range of dates it might load in some financial data then compute various derivative analytics, in subsequent loops it then might compute the number of days with certain flags since those values crossed some threshold. I am already running different instances of each method in their own workers, so that I’m not blocking out different users from running the same method, but I don’t think I could parallelize the method itself though because of the branching nature of some of these calculations.

Now that I look at the logs, I think that I am running up against latency with all these calls hitting the DB at the same time. I will look into denormaizing my collections even more, maybe pre-processing the data so I can make less queries. If that doesn’t work I will probably need to look at managing some Redis deployment.

Thanks for the tips!

cspecter · January 30, 2018, 10:40pm

In case it helps anyone else, here is where I netted out on this issue. I tried several different solutions. Aggregations proved to be too unwieldy. I then tried loading all my data into a JS array and performing all my calculations using Lodash. As expected this had a dramatic speed increase, at least until the array had a few thousand members, then performance was no better that my Mongo calls. After about 10,000 docs array speed was much worse. The obvious answer would have been to use Minimongo, but as is well know, Minimongo doesn’t include indexes. Speed with Minimongo was about 10x worse than my normal queries.

I ended us using LokiJs. LokiJs is an in-memory DB written in node compatable JS. It uses mongo syntax for queries and includes indexing. It’s also blazing fast. I wrote an interpreter that converts my Meteor call syntax into a Loki query, so I didn’t have to change any of the logic in my method. The collection I’ve been working with is around 200k documents and around 49mb in size, so could easily sit in memory. LokiJs v1 does have some limitations I had to work around. You can only have basic indexes, no compound indexes and you can’t index on nested keys. It will also only use one index per query, so I had to reformat a couple calls to use the fastest index. LokiJs2 looks like it will fix many of these issues.

Speed was dramatically increased. My original method using regular Mongo queries took about 180s to complete. With LokiJs it takes about 40s and I could probably get that down by being a little smarter with what data I’m asking for and how I optimize each query. I’m looking into using it as a Minimongo replacement on the client side as well. I highly recommend checking it out.