What happens to Meteor during a MongoDB failover event?

I’m connecting my Meteor app to a 3-member replica set: one primary and two secondaries. I have readPreference=primaryPreferred and w=majority in my MONGO_URL (following these guidelines).

If the primary goes away, will the app automatically start sending writes to whichever secondary is elected primary? How long might this take? Up to a minute? Could I lose data during the failover?

Can Meteor do anything with secondaries?

My MONGO_URL includes all three set members. Does the order of these three hosts matter?

Failover seems unlikely, but I just saw it happen when I used the MongoDB Cloud Manager to upgrade my replica set to 3.0.6.

3 Likes

I’m also interested in this as I also just updated my mongo version (to 3.0.8). The process causes the primary to failover, it updates the primary and then restarts it. During the process my logs filled up with exceptions from observe. Not finding the primary is hardly cause for an exception is it ? What happens if the primary fails for any long period of time ?
In my case I’m using Kadira so not sure if their wrapper around it has anything do with it. Certainly doesn’t look like it.

Internal Server Error - Exception while polling query {“collectionName”:“drinks”,“selector”:{"id":“sPwqQevE4EPgs6xLb”},“options”:{“transform”:null,“limit”:1,“fields”:{“createdAt”:0,“updatedAt”:0}}}: Error: No replica set primary available for query with ReadPreference PRIMARY at Object.Future.wait (/opt/emp/app/programs/server/node_modules/fibers/future.js:395:16) at [object Object]..extend.nextObject (packages/mongo/mongo_driver.js:986:1) at [object Object]..extend.forEach (packages/mongo/mongo_driver.js:1020:1) at [object Object]..extend.getRawObjects (packages/mongo/mongo_driver.js:1069:1) at [object Object]..extend.pollMongo (packages/mongo/polling_observe_driver.js:147:1) at [object Object].proto.pollMongo (packages/meteorhacks_kadira/lib/hijack/wrap_observers.js:63:1) at Object.task (packages/mongo/polling_observe_driver.js:85:1) at [object Object]..extend.run (packages/meteor/fiber_helpers.js:147:1) at packages/meteor/fiber_helpers.js:125:1 - - - - - at [object Object].ReplSet.checkoutReader (/opt/emp/app/programs/server/npm/npm-mongo/node_modules/mongodb/lib/mongodb/connection/repl_set/repl_set.js:619:14) at Cursor.nextObject (/opt/emp/app/programs/server/npm/npm-mongo/node_modules/mongodb/lib/mongodb/cursor.js:748:48) at [object Object].fn [as synchronousNextObject] (/opt/emp/app/programs/server/node_modules/fibers/future.js:89:26) at [object Object]..extend.nextObject (packages/mongo/mongo_driver.js:986:1) at [object Object]..extend.forEach (packages/mongo/mongo_driver.js:1020:1) at [object Object]..extend.getRawObjects (packages/mongo/mongo_driver.js:1069:1) at [object Object]..extend._pollMongo (packages/mongo/polling_observe_driver.js:147:1) at [object Object].proto.pollMongo (packages/meteorhacks_kadira/lib/hijack/wrap_observers.js:63:1) at Object.task (packages/mongo/polling_observe_driver.js:85:1) at [object Object]..extend._run (packages/meteor/fiber_helpers.js:147:1)

1 Like

+1 I was also wondering what happens in such a case, I’m running a replica set on Compose.

My Meteor apps now connect to MongoDB 3.2.1. Using the Wired Tiger storage engine. 3-member replica set. No sharding. Each replica set member is an AWS EC2 instance. I monitor the replica set using MongoDB Cloud Manager.

I did some tests during low traffic. Failover seems to take a few seconds. I see some of these errors in my Meteor app logs:

Got exception while polling query: MongoError: not master and slaveOk=false

These errors abate after ~10 seconds from the time of failover.

I don’t have answers to the other questions in my original post.

1 Like

+1. I am running MongoDB 3.2 on Compose which gives 2 mongos router with no replicaSet and handling failover does not seem to be possible with Meteor at the moment.

Also having problems with this on mLab, anyone managed to make the failover work?

Did anyone ever get to the bottom of this? I get hit by this occasionally and have to restart Meteor so that it can reconnect to the database which is far from ideal.

I’m using Atlas. When AWS was unstable and for some reason we kept losing connection with the primary, mongo elected one of the secondaries. This happened sometimes many times every 10 minutes.

I’ve seen that in the server log sometimes I would have an error with something like: “Connection 15 with DB lost”.

In general the app worked just fine, but there were two cases that I had to kill the container. It was unresponsive. It was online, but you couldn’t use the app.

Overall, it worked just fine.

PS: No sharding, 3-member replica set.

I’m trying changing our database connection string on our test server so that it doesn’t use a replica set. We blocked access to Compose and saw the usual timeout errors appear in the logs and when we unblocked the IP address the Meteor application reconnected to the database.

It’s not an ideal situation because occasionally Compose do have failovers but at least I get emailed when that happens versus, for example, Linode having a brief outage which breaks the app silently.

I’ll be doing more testing before we apply this to the production system but it seems like a good compromise until a proper solution is (hopefully) found.

I could use some forum help on an issue in Meteor 1.10.2 that I’m running into.

I use Mongo Atlas and whenever they do a maintenance update, my app’s UI starts failing to do real-time synchronization. Their maintenance update seems to take down the master, which causes problems in my app.

Users can still do things in the app, but Meteor stops returning the updates from the database to the UI which causes functionality errors and confusion of course. Upon doing a browser refresh, any database updates are then shown. So it’s not a full disconnection from the database.

In my APM logs I see a UnhandledPromiseRejectionWarning: MongoError: not master error that’s rooted in Meteor’s Mongo code.

In my MONGO_URL, I follow Mongo Atlas’s example, but I don’t use the replicaSet parameter. They don’t show it in their example. The various references I’ve seen online say to use the replicaSet parameter in the MONGO_OPLOG_URL, which I’ve disabled because I use redis-oplog.

My MONGO_URL looks like this:

"MONGO_URL": "mongodb+srv://user:password@server.mongodb.net/database?retryWrites=true&w=majority"

I’ve also read that in some Mongo versions/drivers, you don’t need to specify the replicaSet parameter. I also found StackExchange posts that mention you need to use the replicaSet in MONGO_OPLOG_URL so Meteor knows about. Which could solve the failover issue.

It doesn’t seem like in the Meteor Guide there’s any real documentation on the proper current Mongo URL format. Though I know it depends on the version of Meteor and Mongo. But there should be more documentation on this. For example, what Mongo URL parameters work and which ones don’t.

My mongodb url:

MONGO_URL=mongodb://someuser:somepassword@instance1:27017,instance2:27017,instance3:27017/database_name?replicaSet=rs0

When I call rs.stepDown() function to switch primary => secondary, the app hangs for half of minute then it works again.

Using mongodb+srv URI format is the standard to topology auto-discovery, so you don’t need replicaSet or to specify the node list completely.

This behavior is exclusive to the mongodb driver, and for failover events, you need to be prepared with readPreference=primaryPreferred in your URI.

We need to understand that the election can take some time, and in this time window there is no primary, so writes will indeed fail momentarily. Your app can still be up, though, with reads.

2 Likes

Something strange is, I set up a test database on Atlas and connected to it from a local version of my app with my above URL in the same format as always.

I then did a Test Failover in Atlas. Everything works fine. No error messages on my app’s console how there is in production and the app worked totally fine the whole way through the failover.

I assumed when Atlas did its maintenance update to my production database which lines up to the exact time the errors started occurring (and has happened in the past the same way) that the maintenance was triggering a failover. Is the maintenance perhaps causing some other kind of disconnect?

This just happened to my production app again. I updated my Mongo URLs with readPreference=primaryPreferred and have seemingly went through several Mongo Atlas maintenance updates. However, one just happened Saturday night and it hosed my app all day yesterday with the same error as above. Going to take this up with Atlas and see what exact maintenance they did.

The only fix is to manually restart all my containers and have them connect fresh.

1 Like

@evolross what error message are you seeing when this happens? “Pool was force destroyed?”

We had some severe lag issues, as well as something that sounds similar to what you’d described, in the past.

The solution for us was when we worked with Atlas support and received some mongodb strings WITHOUT the +srv.

It seems the extra DNS step was causing intermittent issues. Since then, absolutely no problems on Atlas.

Same errors as above:

My connection URLs are in the format:

mongodb+srv://user:password@server.server.server/database?readPreference=primaryPreferred&retryWrites=true&w=majority

Here’s the response I received from official paid support from MongoDB Atlas:

So they said as a start, the Node mongo driver compatibility is off. I indeed run an older version of Meteor (1.10.2). And according to the change log, Meteor 1.11 introduces the Mongo 3.6 driver. So that’s a place to start, will update Meteor ASAP in my production app.