MySQL在线改表工具ghost 1.0.30版本改进解析
MySQL在线改表工具ghost 12月6号发布了1.0.30版本,我们主要使用ghost来替代pt-osc,提交了一些功能和bug issue,后来想到为什么不能直接修改代码给官方提交PR呢?于是在后面发现的2个bug中,就尝试修改代码提交PR,被官方吸收,虽然改动都比较简单(go语言现学现用,感觉很清晰),但是迈出了第一步。
在1.0.30的8个更新中就包含了我提交的2个PR,1个功能改进,具体如下:
- Fixed cut-over stall: when cut-over happened to start executing even as lag or otherwise a t1.hrottling condition also emerged, a race condition would make the cut-over stall until throttling reason went away. The fix is that cut-over does not wait, and at worst, rolls back.
- fix to cut-over retries – it didn’t throttle properly in between and did not cleanup properly in between.
- cut-over timeout on “unknown” stalls. An example to “unknown” was the throttling problem described above. I don’t know what other unknowns there are, but cut-over is now generically resilient to them.
- support for –skip-foreign-key-checks - can save inspection time, when one is certain no FKs actually exist
- Made column-rename detection heuristic more robust (it could wrongly infer a rename by some unfortunate comment or column-name)
- Reading and reporting replication lag before waiting on initial replication event – this adds visibility in seemingly-hanging operations where the replica is lagging in the first place
- changelog table uses same engine as original table
- on startup, verifying replication works all all the way up to the master.
本文主要描述一下1、2、7、8四个更新的内容。
更新1、2
1、2主要是解决cut-over阶段的重试和超时问题,因为cut-over阶段会进行锁表,而锁表期间会导致业务不可用,所以需要有超时和重试机制,避免一次锁表时间过长。
这里不再对比前后,直接描述修改之后的机制
在Migrate方法中会反复对cutOver进行重试;其中retryOperation对error进行屏蔽,因此会不断的进行整体重试;
migrator.go::Migrate
if err := this.hooksExecutor.onBeforeCutOver(); err != nil {
return err
}
if err := this.retryOperation(this.cutOver); err != nil {
return err
}
migrator.go::cutOver
if this.migrationContext.CutOverType == base.CutOverAtomic {
// Atomic solution: we use low timeout and multiple attempts. But for
// each failed attempt, we throttle until replication lag is back to normal
err := this.atomicCutOver()
return err
}
if this.migrationContext.CutOverType == base.CutOverTwoStep {
return this.cutOverTwoStep()
}
migrator::atomicCutOver
defer func() {
okToUnlockTable <- true
this.applier.DropAtomicCutOverSentryTableIfExists()
}()
go func() {
if err := this.applier.AtomicCutOverMagicLock(lockOriginalSessionIdChan, tableLocked, okToUnlockTable, tableUnlocked); err != nil {
log.Errore(err)
}
}()
waitForEventsUpToLock()
//其中锁表操作在一个独立的协程中运行,表锁住之后会被okToUnlockTable阻塞住;而在atomicCutOver方法中使用defer保证okToUnlockTable总会被设置为true,这样就可以确保在失败的情况下,比如等待change log中超时,表锁都会得到有效释放
migrator::AtomicCutOverMagicLock
//设置session级别的lock_wait_timeout,避免lock tables时间太久
tableLockTimeoutSeconds := this.migrationContext.CutOverLockTimeoutSeconds * 2
log.Infof("Setting LOCK timeout as %d seconds", tableLockTimeoutSeconds)
query = fmt.Sprintf(`set session lock_wait_timeout:=%d`, tableLockTimeoutSeconds)
<-okToUnlockTable
//删除migc表释放锁等
//这个方法中会从binlog中读取锁表之后所有的更新,确保ghost表和origin表是一致的
migrator::waitForEventsUpToLock
//在waitForEventsUpToLock中会定义一个定时器migrationContext.CutOverLockTimeoutSeconds
//如果等待change log中的“migrationContext.CutOverLockTimeoutSeconds”标志时间过长则会超时
就是说锁表默认会锁6s,等待waitForEventsUpToLock会持续3
-cut-over-lock-timeout-seconds int
Max number of seconds to hold locks on tables while attempting to cut-over (retry attempted when lock exceeds timeout) (default 3)
更新6
这个也是我们在生产环境中碰到的问题,如果ghost在改表的时候,读取binlog的slave和master之间有延迟,ghost会hang住,然后一致处于下面状态:
2016-11-01 10:03:40 INFO Waiting for tables to be in place
导致这个的原因是因为ghost在主库上执行create change log table和create ghost table之后,会一直等待这些表在该slave的binglog中出现,否则就会一直等。
如果对ghost不熟悉的话,则会不知所云,因此需改修改的更清晰一些:
2016-11-02 12:50:36 INFO Waiting for ghost table to be migrated. Current lag is 6s
更新7
这个是我们线上使用过程中碰到的一个问题:
2016-12-01 10:06:29 INFO Creating magic cut-over table `test`.`_test_del`
2016-12-01 10:06:29 INFO Magic cut-over table created
2016-12-01 10:06:29 INFO Locking `test`.`test`, `test`.`_test_del`
2016-12-01 10:06:29 ERROR Error 1785: When @@GLOBAL.ENFORCE_GTID_CONSISTENCY = 1, updates to non-transactional tables can only be done in either autocommitted statements or single-statement transactions, and never in the same statement as updates to transactional tables.
2016-12-01 10:06:29 INFO Looking for magic cut-over table
调查问题的原因,打开GTID需要设置@@GLOBAL.ENFORCE_GTID_CONSISTENCY=1,之后单条的sql语句中混合事务和非事务是可以执行的,并且下面4种语句也不允许被执行;其中第4个语句是在MySQL官方文档中没有写到;而ghost中使用了这种场景。
1.CREATE TABLE ... SELECT statements
2.CREATE TEMPORARY TABLE statements inside transactions
3.Transactions or statements that update both transactional and nontransactional tables.
4.lock tables test write,test1 write混用事务和非事务引擎。
而我们的详细场景:我们打开GTID,@@GLOBAL.ENFORCE_GTID_CONSISTENCY=1,默认engine是innodb,而myisam在load data场景下效率比innodb要高很多,因此在load data的场景下会有一些表使用myisam,而ghost在创建magic表(cut-over阶段会用到)时没有指定engine,这样就会导致magic表使用默认引擎innodb,所以在lock tables write语句中会混用myisam和innodb引擎,所以会报上面这个错误。
改动也很简单,就是将magic表和引擎和origin表的引擎保持一致即可;issue地址。
query := fmt.Sprintf(`create /* gh-ost */ table %s.%s (
id int auto_increment primary key
- ) comment='%s'
+ ) engine=%s comment='%s'
`,
+ this.migrationContext.TableEngine,
更新8
这个也是我们在生产环境中遇到的一个问题,并且给官方提交pull request,并且被purge进了master,具体的issue
线上存在这样一个场景:有的DBA在进行拆分或者迁移主库之后,没有在新的主库上执行reset slave all,执行了reset slave或者干脆就只是stop slave,而reset slave或者stop slave之后,执行show slave status都是可以读取到Master_Host和Master_Port;而之前ghost只是简单从–host参数提供的主机上开始向上递归执行show slave status遍历以找到主库。
之前处理的比较简单,递归的过程只要执行show slave status不是空的,就把Master_Host和Master_Port作为master;这样在刚才提到的reset slave和stop slave两种场景下,就会获取一个脏的master,后续就会出现类似“找不到database”或者“无法连接master”等错误。
解决方法也很简单,就是在向上递归执行show slave status的时候判断Master_Log_File是否为空,reset slave之后Master_Host是非空的,但是Master_Log_File是是空,因此如果发现Master_Log_File为空则这台机器就是master了;
具体流程简单如下:
if show slave status output is not empty:
if Master_Log_File is empty:
return
else Slave_IO_Running or Slave_SQL_Running not yes:
exit
else
this may be the master
详细的代码参照inspect.go::getMasterConnectionConfig和utils.go::GetMasterConnectionConfigSafe