Updating big number of documents often - mongo high CPU



what is best way how to update often changing big number of small documents ?
For now I am doing around 15k every 5 minutes.

I have document structure like this

> db.streams.findOne()
	"_id" : "8S5d2w5YZBe9oCWL4",
	"channel" : "ECTVLoL",
	"title" : "Bienvenue sur l'Eclypsia TV LOL",
	"game" : "League of Legends",
	"followers" : 1663,
	"channel_url" : "http://hitbox.tv/ectvlol",
	"viewers" : 748,
	"avatar" : "http://edge.vie.hitbox.tv/static/img/channel/ECTVLoL_550fe222af2c4_small.png",
	"timestamp" : ISODate("2015-10-27T13:12:01.362Z"),
	"thumbnail" : "http://edge.vie.hitbox.tv/static/img/media/live/ectvlol_mid_000.jpg",
	"service" : "hitbox",
	"online" : true

They are quite frequently changing. And I can identify document by matching channel and service property.
So for now I am updating them like this.

data.forEach(function(item) {
                service: 'twitch',
                channel: item.channel.display_name
                $set: {
                  title: item.channel.status,
                  game: item.game,
                  avatar: item.channel.logo,
                  followers: Number(item.channel.followers),
                  viewers: Number(item.viewers),
                  timestamp: moment().toDate(),
                  thumbnail: item.preview.medium,
                  channel_url: item.channel.url,
                  online: true

It seems to go better after adding index

> db.streams.createIndex({ "service": 1, "channel": 1})
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 1,
	"numIndexesAfter" : 2,
	"ok" : 1

But still CPU on 1gb DO droplet is 100%
And I have 1 more index for fulltext search.

> db.streams.createIndex({ "channel": "text", "title": "text", "game": "text"})
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 2,
	"numIndexesAfter" : 3,
	"ok" : 1

I am little scared to add also Youtube grabbing package, due to already high CPU usage by MongoDB atm.
Am I missing some index or some more effective way how to update?
Should I fetch document first and update just changes, or upsert is nicely optimized for that already?
BTW, basic fulltext search as it is now is up on http://shocki.tv
Indexing twitch.tv, hitbox.tv, livecoding.tv , there are some “meteor” or “meteorjs” streams from time to time.


Wouldn’t it be a better strategy to query the services directly for viewers/avatar/timestaps etc. on the client and polling it in to a client only collection for sorting etc. instead of keeping a server cache?


In ideal world where external API allow you to sort based on your criteria and return only fields you ask maybe :smiley: But these APIs dont have such features.
Still I should add my API key for every request and if every user fetch whole list to himself, it would spam APIs quite a lot.


This may not be a useful reply, but have you considered switching to CouchDB. With my (limited) understanding, CDB should be less CPU intensive with a lot of queries.

This is just a thought to be fair. It may not have the desired effect on a small DO droplet such as yours.


right… pesky API keys… btw. I am not talking by sorting it with the API, but right at the client you can do new Mongo.Collection(null) to create a client-side collection which you can treat the same as any collection but not being synced to the server.

What are the limits on the API? No public apis available?


I was asking only if this can be optimised somehow in MongoDB, if not I would be spamming it into ElasticSearch and keep in mongo just service+channel reference so I can map follows/likes/notification for particular streams or so.


minimongo does not support text index, what is like main part of functionality.
And there are no known exact numbers, just note that there are rate limits :smiley:


There is no need for text index in minimongo. What I am proposing is to only cache “static fields” on server for searchabilty. While on the client you could just fetch the rest of the data, which is constantly changing i guess (viewers, online, followers etc.) but if there are api keys in use, it is not a viable solution. Maybe a RPC with this.unblock() so it wont block. Since you would be running it only for displayed items it will even lower the strain on the external services.


well, online is managed by me, I update timestamp for all channels I get from external API and if they were not updated for ~15m they are marked as offline.
For followers I dont need to update it every time, that is correct, viewers are good to have up2date.
title’s change from time to time too.