mongodb常见Q&A整理

升级相关

cluster upgrade from 2.6.9 to 3.0.3 error: no such cmd: _getUserCacheGeneration

参考链接

这个问题是因为config-server在2.6.10中才加入这个api

configdb error： bad serverID set in setShardVersion and none

升级3.0就能解决这个问题

2.6 primary 连接不上 3.0的seconary errno:111 Connection refused

参考链接

问题是mongodb没有正常关闭，删掉dbpath所有内容，重启mongodb，问题就解决了

mongorestore no reachable servers

这个问题是因为节点启动指定了rs name，但是却没有正确初始化replica set

replSet info Couldn’t load config yet. Sleeping 20sec and will try again

干掉不停尝试的进程，重新启动

the collection lacks a unique index on _id

这个是我遇到的一个比较奇怪的问题，下面是我查到的一些信息

讨论里面说的是因为服务器突然宕机导致的数据问题，查了半天没有解决方案，最后我应该是直接删掉了这个实例的dbpath，然后用副本集的机制重新同步的数据

rename replset name（重命名副本集，亲测可用）

参考链接

MongoDB Assertion: 10334:BSONObj size: 1852142352 (0x1073656E) is invalid

这个问题有可能也是服务器突然关闭导致的（例如断电），解决方案如下

清除oplog
repair数据库

参考链接

关闭3.0警告方法

参考链接1
参考链接2

分片相关

chunk move failed

1	moveChunk failed to engage TO-shard in the data transfer: can't accept new chunks because there are still 59 deletes from previous migration

意思是说，当前正要去接受新chunk 的shard正在删除上一次数据迁移出的数据，不能接受新Chunk，于是本次迁移失败。这种log里显示的是warning，但有时候会发现shard的删除持续了十几天都没完成，查看日志，可以发现同一个chunk的删除在不断重复执行，重启所有无法接受新chunk的shard可以解决这个问题。
参考：
http://stackoverflow.com/questions/26640861/movechunk-failed-to-engage-to-shard-in-the-data-transfer-cant-accept-new-chunk
如果采用了balancer自动均衡，那么可以加上_waitForDelete参数，如：

1	{ "_id" : "balancer", "activeWindow" : { "start" : "12:00", "stop" : "19:30" }, "stopped" : false, "_waitForDelete" : true }

，这样就不会因delete堆积而导致后续migrate失败，当然，需要考虑到这里的阻塞是否会影响到程序正常运转，在实践中慎重采用使用waitForDelete，因为发现加上它之后迁移性能非常差，可能出现卡住十几个小时的情况，外界拿住了被迁移chunk的游标句柄，这时候删除不能执行，阻塞了后续其它迁移操作。
游标被打开而导致被迁移数据无法及时删除时的日志：

2015-03-07T10:21:20.118+0800 [RangeDeleter] rangeDeleter waiting for open cursors in: cswuyg_test.cswuyg_test, min: { _id: -6665031702664277348 }, max: { _id: -6651575076051867067 }, elapsedSecs: 6131244, cursors: [ 220477635588 ]

这可能会卡住几十小时，甚至一直卡住，影响后续的moveChunk操作，导致数据不均衡。
解决方法还是：重启。

数据表已经分片，但是磁盘空间没有释放

分片完成，数据chunk移动完成，其他分片已经分得当前分片的数据了，但是当前分片磁盘占用还是和之前一样大，问题在在于这个

The file size, once allocated, does not go down – that space will be re-used by Mongo if your shard continues to grow. So if you truly want to get disk space back, repairDatabase is the way to go. You can take your secondary out of the replica set, run the repairDatabase, and bring it back into the replica set.
原因是磁盘空间一旦释放，mongo不会释放磁盘空间，以后来数据的时候会复用，如果需要立刻释放磁盘空间，使用repairDatabase或者是先删掉数据目录，利用副本集的机制重新导入一遍就能释放戴㷣磁盘空间

大表分片卡住的问题

大表分片的时候卡住不动，数据库读写都被阻塞
https://jira.mongodb.org/browse/SERVER-10853
这个目前没有找到好的解决办法，一定要在业务量小的时候进行分片

索引相关

mongodb添加索引注意事项

mongodb 添加索引
By default, MongoDB builds indexes in the foreground, which prevents all read and write operations to the database while the index builds. Also, no operation that requires a read or write lock on all databases (e.g. listDatabases) can occur during a foreground index build.
Background index construction allows read and write operations to continue while building the index.
默认createIndex会阻塞当前数据库所有的读写操作和另外一些依赖所有数据的的读写操作的命令（例如listdatabase），如果创建索引时候加上background，就能避免读写失败的问题

副本集相关

replica set protocolVersion的问题

[ReplicationExecutor] Error in heartbeat request to 127.0.0.1:27017; BadValue Unexpected field protocolVersion in replica set configuration
mongodb3.2之前replset的默认protocolVersion是0，3.2之后默认是1，所以当副本集中各个节点mongod版本不同时，要改为一致，修改的方法如下：

1
2
3

var cfg = rs.conf(); 
cfg.protocolVersion=1; 
rs.reconfig(cfg);

修改副本集名称（需要停止数据库）

关闭副本集所有节点（注意要先关闭authentication）
不使用–replSet参数，启动所有节点

更新本地local数据库代码，代码如下

use local  
var doc = db.system.replset.findOne()  
doc._id = 'NewReplicaSetName'
db.system.replset.save(doc)  
db.system.replset.remove({_id:'OldReplicaSetName'})

再次关闭所有节点
修改配置文件中的replset name
启动所有节点

No common protocol found when add shard in local network

副本集中同时存在mongo3.2和mongo3.0的时候就会出现这个错误，解决方案就是统一版本

副本集中主从节点数据量大小不同的原因

Brief: Because of different amount of not reclaimed memory space on secondary and different padding factor on secondary and primary.

Long: It could be the case if you have long running primary node where some documents were deleted and inserted, and no compact operation was run. This space would no be reclaimed, and would be counted in dataSize, avgObjSize and storageSize. Secondary could be fully resynced from primary, but only operations from current oplog would be replayed. In this case secondary could have lower values for dataSize, avgObjSize and storageSize. If after that secondary is elected as primary, you could see described difference in sizes. In addition each server has it’s own padding factor, that is why you see difference in dataSize.

Concrete scenario could be different, but there are two main causes: amount of not reclaimed memory space and different padding factor.

参考链接1
参考链接2

Fatal assertion 15915 OplogOperationUnsupported: Applying renameCollection not supported in initial sync

这个是因为副本集在initsync的过程中遇到了更改集合名称的操作，遇到这个操作，initsync会中断，手动重启即可

导入导出

读写性能影响

Mongdump does not lock the db. It means other read and write operations will continue normally.
Actually, both mongodump and mongorestore are non-blocking. So if you want to mongodump mongorestore a db then its your responsibility to make sure that it is really a desired snapshot backup/restore. To do this, you must stop all other write operations while taking/restoring backups with mongodump/mongorestore. If you are running a sharded environment then its recommended you stop the balancer also.

mongodump和mongorestore都不会阻塞数据库，但是会影响数据库读写性能

通过复制数据目录转移数据

注意，复制的时候一定要保证没有数据库正在使用这个数据目录，否则复制过去的数据目录是无法正常启动数据库的

集群相关

不同的mongos节点读写数据不一致

这个问题有可能是使用了moveprimary命令导致的，看官网上的一段话:

WARNING
If you use the movePrimary command to move un-sharded collections, you must either restart all mongos instances, or use the flushRouterConfig command on all mongos instances before reading or writing any data to any unsharded collections that were moved. This action ensures that the mongos is aware of the new shard for these collections.
If you do not update the mongos instances’ metadata cache after using movePrimary, the mongos may miss data on reads, and may not write data to the correct shard. To recover, you must manually intervene.

意思是，moveprimary命令执行完成之后，如果没有重启所有mongos或者是没有手动flushRouterConfig，将会导致使用不同的mongos读写的数据分布在不同的分片上，这个不是必现的，但是如果某天真的出现了，就完蛋了，所以使用这个的时候一定要千万小心。

mongodb lockpinger

There are distributed locks used in a sharded environment. The balancer takes a lock (only one migration is active at a time), the shards will take out meta data locks when doing splits also. Those live in the config.locks collection.

As for the lock pinger, the config.lockpings collection keeps track of the active components in the cluster, so it is an informational collection. The lock pinger is what populates this collection and you have pasted what looks to be the result of a successful ping.

Note: please do not use these collections for anything in your application or elsewhere. These collections (as noted on the linked pages) are considered internal only and can (and will) be changed/removed/updated without notice.

stackoverflowdba.stackexchange.com/questions/58272/what-is-lock-pinger-in-mongodb

其他

Incompatible to update capped collection after upgrade to mongo3.2

这是我升级的时候发现的一个问题，副本集init-sync的时候出现问题了，提示can not change the size of a document in a capped collection，搜了一下发现了这个

MongoDB 3.2 behaivor
MongoDB-3.2 added on a new condition check when update a capped collection, this is suitable for all storage engines. Which means user “Cannot change the size of a document in a capped collection”.

MongoDB 3.0 behaivor
In mongodb 3.0, only mmapv1 have limitation to update a capped collection, that is “objects in a capped ns cannot grow(no need have the same size with old document”.

mongo jirajira.mongodb.org/browse/SERVER-23981

另外，有api可以让一个普通表转为固定大小表，却没有命令可以让固定大小表转为普通表，所以切换钱要小心，如果必须要转回去，可以参考这个

1
2
3

db.createCollection("norm_coll");
var cur = db.cap_col.find()
while (cur.hasNext()) {obj = cur.next(); db.norm_coll.insert(obj);}

升级相关

cluster upgrade from 2.6.9 to 3.0.3 error: no such cmd: _getUserCacheGeneration

configdb error： bad serverID set in setShardVersion and none

2.6 primary 连接不上 3.0的seconary errno:111 Connection refused

mongorestore no reachable servers

replSet info Couldn’t load config yet. Sleeping 20sec and will try again

the collection lacks a unique index on _id

rename replset name（重命名副本集，亲测可用）

MongoDB Assertion: 10334:BSONObj size: 1852142352 (0x1073656E) is invalid

关闭3.0警告方法

分片相关

chunk move failed

数据表已经分片，但是磁盘空间没有释放

大表分片卡住的问题

索引相关

mongodb添加索引注意事项

副本集相关

replica set protocolVersion的问题

修改副本集名称（需要停止数据库）

No common protocol found when add shard in local network

副本集中主从节点数据量大小不同的原因

Fatal assertion 15915 OplogOperationUnsupported: Applying renameCollection not supported in initial sync

导入导出

读写性能影响

通过复制数据目录转移数据

集群相关

不同的mongos节点读写数据不一致

mongodb lockpinger

其他

Incompatible to update capped collection after upgrade to mongo3.2

更多错误总结文章