Updating big number of documents often - mongo high CPU


#1

Hello,

what is best way how to update often changing big number of small documents ?
For now I am doing around 15k every 5 minutes.

I have document structure like this

> db.streams.findOne()
{
	"_id" : "8S5d2w5YZBe9oCWL4",
	"channel" : "ECTVLoL",
	"title" : "Bienvenue sur l'Eclypsia TV LOL",
	"game" : "League of Legends",
	"followers" : 1663,
	"channel_url" : "http://hitbox.tv/ectvlol",
	"viewers" : 748,
	"avatar" : "http://edge.vie.hitbox.tv/static/img/channel/ECTVLoL_550fe222af2c4_small.png",
	"timestamp" : ISODate("2015-10-27T13:12:01.362Z"),
	"thumbnail" : "http://edge.vie.hitbox.tv/static/img/media/live/ectvlol_mid_000.jpg",
	"service" : "hitbox",
	"online" : true
}
> 

They are quite frequently changing. And I can identify document by matching channel and service property.
So for now I am updating them like this.

data.forEach(function(item) {
            Streams.upsert(
              {
                service: 'twitch',
                channel: item.channel.display_name
              },
              {
                $set: {
                  title: item.channel.status,
                  game: item.game,
                  avatar: item.channel.logo,
                  followers: Number(item.channel.followers),
                  viewers: Number(item.viewers),
                  timestamp: moment().toDate(),
                  thumbnail: item.preview.medium,
                  channel_url: item.channel.url,
                  online: true
                }
              }
            );
          })

It seems to go better after adding index

> db.streams.createIndex({ "service": 1, "channel": 1})
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 1,
	"numIndexesAfter" : 2,
	"ok" : 1
}

But still CPU on 1gb DO droplet is 100%
And I have 1 more index for fulltext search.

> db.streams.createIndex({ "channel": "text", "title": "text", "game": "text"})
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 2,
	"numIndexesAfter" : 3,
	"ok" : 1
}
> 

I am little scared to add also Youtube grabbing package, due to already high CPU usage by MongoDB atm.
Am I missing some index or some more effective way how to update?
Should I fetch document first and update just changes, or upsert is nicely optimized for that already?
BTW, basic fulltext search as it is now is up on http://shocki.tv
Indexing twitch.tv, hitbox.tv, livecoding.tv , there are some “meteor” or “meteorjs” streams from time to time.


#2

Wouldn’t it be a better strategy to query the services directly for viewers/avatar/timestaps etc. on the client and polling it in to a client only collection for sorting etc. instead of keeping a server cache?


#3

In ideal world where external API allow you to sort based on your criteria and return only fields you ask maybe :smiley: But these APIs dont have such features.
Still I should add my API key for every request and if every user fetch whole list to himself, it would spam APIs quite a lot.


#4

This may not be a useful reply, but have you considered switching to CouchDB. With my (limited) understanding, CDB should be less CPU intensive with a lot of queries.

This is just a thought to be fair. It may not have the desired effect on a small DO droplet such as yours.


#5

right… pesky API keys… btw. I am not talking by sorting it with the API, but right at the client you can do new Mongo.Collection(null) to create a client-side collection which you can treat the same as any collection but not being synced to the server.

What are the limits on the API? No public apis available?


#6

I was asking only if this can be optimised somehow in MongoDB, if not I would be spamming it into ElasticSearch and keep in mongo just service+channel reference so I can map follows/likes/notification for particular streams or so.


#7

minimongo does not support text index, what is like main part of functionality.
And there are no known exact numbers, just note that there are rate limits :smiley:


#8

There is no need for text index in minimongo. What I am proposing is to only cache “static fields” on server for searchabilty. While on the client you could just fetch the rest of the data, which is constantly changing i guess (viewers, online, followers etc.) but if there are api keys in use, it is not a viable solution. Maybe a RPC with this.unblock() so it wont block. Since you would be running it only for displayed items it will even lower the strain on the external services.


#9

well, online is managed by me, I update timestamp for all channels I get from external API and if they were not updated for ~15m they are marked as offline.
For followers I dont need to update it every time, that is correct, viewers are good to have up2date.
title’s change from time to time too.