gh-ost解析

6 minute read

gh—ost是github的dba开源一款使用go语言开发的MySQL在线改表工具，解决了目前采用pt-online-schema-change遇到的一些问题，思路也很新颖；作者很厉害，也是是openark kit工具集的作者（主要是用python写的一套工具集）。

为什么叫gh-ost而不是叫gh-osc，创造了一个新的缩写，gh-ost可以发音为ghost，t可以解读为Transmogrifier/Translator/Transformer/Transfigurator。

原理

抛弃了pt-online-schema-change使用trigger来同步增量数据的方法，而通过模拟slave获取row格式的binlog的方式来获取增量数据，具体的数据流图可以看下图。

数据流图

使用tirgger的方式有很多缺点，作者在triggerless-design里面做了详尽的阐述，对与我们在工作中遇到的具体问题主要是MDL锁的问题、死锁的问题，这两个问题在使用pt-online-schema-change的时候没有办法绕过去。我们线上在使用pt-osc的时候遇到过多起因为MDL锁和死锁导致的问题，具体的DML锁和死锁问题在其他文章里面做详尽的阐述。

因为trigger的方式，trigger触发的语句和被触发的语句在一个事务中；所以当时看到这种方式的第一反应就是会不会存在时序的问题（binlog和copy数据产生的时序问题）；经过和朋友的讨论以及推理之后发现不会存在问题。

因为binlog中记录的是full image，所以binlog中的数据是最权威的，而且读取的binlog在应用的时候做了如下转化，而且copy old data是insert ignore，因此会以binlog的优先级为最高，因此不会有问题。

源类型	目标类型
insert	replace
update	update
delete	delete

对与insert和update是没有问题的，因为无论copy old row和apply binlog的先后顺序，如果apply binlog在后，会覆盖掉copy old row，如果apply binlog在前面，copy old row因为使用insert ignore，因此会被ignore掉；

对与delete数据，我们可以演算一下，abc三个操作，可能存在三种情况（b肯定在a的后面）：

a.delete old row
b.delete binlog apply
c.copy old row

cab，c会将数据copy到ghost表，最后b会把ghost表中的数据delete掉；
acb，c空操作，b也是空操作；
abc，b空操作，c也是空操作；

详细过程

一、通过count(*)来获取这个表的准确数据量，或者explain来获取大概值，用于计算进度；通过exact-rowcount参数来决定，默认方式是通过explain。

--方法1：通过explain来获取大概的数据
explain select /* gh-ost */ * from %s.%s where 1=1;
--方法2：通过count(*)获取一个准确值
select /* gh-ost */ count(*) as rows from %s.%s;

二、通过order by分别获取最大值和最小值范围

SELECT /* gh-ost `test`.`test` */ `id` 
FROM   `test`.`test` 
ORDER  BY `id` ASC 
LIMIT  1; 

SELECT /* gh-ost `test`.`test` */ `id` 
FROM   `test`.`test` 
ORDER  BY `id` DESC 
LIMIT  1; 

三、通过一个下面的sql不断的探测是否还有数据需要copy

SELECT /* gh-ost `test`.`faredb_detail` iteration:25 */ `faredb_detail_id` 
FROM   (SELECT `faredb_detail_id` 
        FROM   `test`.`faredb_detail` 
        WHERE  (( `faredb_detail_id` > 118887 )) 
               AND ( ( `faredb_detail_id` < '105070846' ) 
                      OR (( `faredb_detail_id` = '105070846' )) ) 
        ORDER  BY `faredb_detail_id` ASC 
        LIMIT  1000) select_osc_chunk 
ORDER  BY `faredb_detail_id` DESC 
LIMIT  1; 

四、通过下面的SQL Copy原始表的数据

insert ignore into new_table select from old_table force index (`PRIMARY`) where  lock in share mode

五、通过go/binlog/gomysql_reader.go读取binlog，然后通过go/logic/applier.go将读取的binlog进行转化写入到ghost表中；

通过buildDMLEventQuery可以看到binlog的转化细节；

源类型	目标类型
insert	replace
update	update
delete	delete

六、cut-over阶段（即rename表阶段），作者连续写了三篇blog来描述，比较复杂，有兴趣的可以阅读，1.Solving the Facebook-OSC non-atomic table swap problem；2.Solving the non-atomic table swap, Take II；3.Solving the non-atomic table swap, Take III: making it atomic;

六、详细可以参照下面的日志

2016-08-12 12:00:31 INFO starting gh-ost 1.0.8
2016-08-12 12:00:31 INFO Migrating `test`.`sbtest`
2016-08-12 12:00:31 INFO connection validated on test02:5002
2016-08-12 12:00:31 INFO User has ALL privileges
2016-08-12 12:00:31 INFO binary logs validated on test02:5002
2016-08-12 12:00:31 INFO Restarting replication on test02:5002 to make sure binlog settings apply to replication thread
2016-08-12 12:00:31 INFO Table found. Engine=InnoDB
2016-08-12 12:00:31 INFO Estimated number of rows via EXPLAIN: 9936
2016-08-12 12:00:31 INFO Master found to be test01:5002
2016-08-12 12:00:31 INFO connection validated on test02:5002
2016-08-12 12:00:31 INFO Registering replica at test02:5002
2016-08-12 12:00:31 INFO Connecting binlog streamer at mysql-bin.000023:954323927
2016-08-12 12:00:31 INFO rotate to next log name: mysql-bin.000023
2016-08-12 12:00:31 INFO connection validated on test01:5002
2016-08-12 12:00:31 INFO connection validated on test01:5002
2016-08-12 12:00:31 INFO Droppping table `test`.`_sbtest_gho`
2016-08-12 12:00:31 INFO Table dropped
2016-08-12 12:00:31 INFO Droppping table `test`.`_sbtest_ghc`
2016-08-12 12:00:31 INFO Table dropped
2016-08-12 12:00:31 INFO Creating changelog table `test`.`_sbtest_ghc`
2016-08-12 12:00:32 INFO Changelog table created
2016-08-12 12:00:32 INFO Creating ghost table `test`.`_sbtest_gho`
2016-08-12 12:00:32 INFO Ghost table created
2016-08-12 12:00:32 INFO Altering ghost table `test`.`_sbtest_gho`
2016-08-12 12:00:32 INFO Ghost table altered
2016-08-12 12:00:32 INFO Chosen shared unique key is PRIMARY
2016-08-12 12:00:32 INFO Shared columns are id,k,c,pad
2016-08-12 12:00:32 INFO Listening on unix socket file: /tmp/gh-ost.test.sbtest.sock
2016-08-12 12:00:32 INFO As instructed, I'm issuing a SELECT COUNT(*) on the table. This may take a while
2016-08-12 12:00:32 INFO Exact number of rows via COUNT: 10000
2016-08-12 12:00:32 INFO Migration min values: [1]
2016-08-12 12:00:32 INFO Migration max values: [10000]
# Migrating `test`.`sbtest`; Ghost table is `test`.`_sbtest_gho`
# Migrating test01.dx.sankuai.com:5002; inspecting test02.dx.sankuai.com:5002; executing on dx-dba01.dx.sankuai.com
# Migration started at Fri Aug 12 12:00:31 +0800 2016
# chunk-size: 1000; max-lag-millis: 1500ms; max-load: Threads_running=25; critical-load: Threads_running=1000; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# postpone-cut-over-flag-file: /tmp/ghost.postpone.flag
# panic-flag-file: /tmp/ghost.panic.flag
# Serving on unix socket: /tmp/gh-ost.test.sbtest.sock
Copy: 0/10000 0.0%; Applied: 0; Backlog: 0/100; Time: 0s(total), 0s(copy); streamer: mysql-bin.000023:954326679; ETA: N/A
Copy: 0/10000 0.0%; Applied: 0; Backlog: 0/100; Time: 1s(total), 1s(copy); streamer: mysql-bin.000023:954328041; ETA: N/A
2016-08-12 12:00:33 INFO Row copy complete
Copy: 10000/10000 100.0%; Applied: 0; Backlog: 0/100; Time: 1s(total), 1s(copy); streamer: mysql-bin.000023:954831040; ETA: 0s
2016-08-12 12:00:33 INFO Grabbing voluntary lock: gh-ost.84092.lock
2016-08-12 12:00:33 INFO Setting LOCK timeout as 6 seconds
2016-08-12 12:00:33 INFO Looking for magic cut-over table
2016-08-12 12:00:33 INFO Creating magic cut-over table `test`.`_sbtest_del`
2016-08-12 12:00:33 INFO Magic cut-over table created
2016-08-12 12:00:33 INFO Locking `test`.`sbtest`, `test`.`_sbtest_del`
2016-08-12 12:00:33 INFO Tables locked
2016-08-12 12:00:33 INFO Session locking original & magic tables is 84092
2016-08-12 12:00:33 INFO Writing changelog state: AllEventsUpToLockProcessed
2016-08-12 12:00:33 INFO Waiting for events up to lock
Copy: 10000/10000 100.0%; Applied: 0; Backlog: 1/100; Time: 2s(total), 1s(copy); streamer: mysql-bin.000023:954959331; ETA: 0s
2016-08-12 12:00:34 INFO Done waiting for events up to lock; duration=971.748469ms
# Migrating `test`.`sbtest`; Ghost table is `test`.`_sbtest_gho`
# Migrating test01.dx.sankuai.com:5002; inspecting test02.dx.sankuai.com:5002; executing on dx-dba01.dx.sankuai.com
# Migration started at Fri Aug 12 12:00:31 +0800 2016
# chunk-size: 1000; max-lag-millis: 1500ms; max-load: Threads_running=25; critical-load: Threads_running=1000; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# postpone-cut-over-flag-file: /tmp/ghost.postpone.flag
# panic-flag-file: /tmp/ghost.panic.flag
# Serving on unix socket: /tmp/gh-ost.test.sbtest.sock
Copy: 10000/10000 100.0%; Applied: 0; Backlog: 0/100; Time: 2s(total), 1s(copy); streamer: mysql-bin.000023:954960101; ETA: 0s
2016-08-12 12:00:34 INFO Setting RENAME timeout as 3 seconds
2016-08-12 12:00:34 INFO Session renaming tables is 84090
2016-08-12 12:00:34 INFO Issuing and expecting this to block: rename /* gh-ost */ table `test`.`sbtest` to `test`.`_sbtest_del`, `test`.`_sbtest_gho` to `test`.`sbtest`
2016-08-12 12:00:34 INFO Found atomic RENAME to be blocking, as expected. Double checking the lock is still in place (though I don't strictly have to)
2016-08-12 12:00:34 INFO Checking session lock: gh-ost.84092.lock
2016-08-12 12:00:34 INFO Connection holding lock on original table still exists
2016-08-12 12:00:34 INFO Will now proceed to drop magic table and unlock tables
2016-08-12 12:00:34 INFO Dropping magic cut-over table
2016-08-12 12:00:34 INFO Releasing lock from `test`.`sbtest`, `test`.`_sbtest_del`
2016-08-12 12:00:34 INFO Tables unlocked
2016-08-12 12:00:34 INFO Tables renamed
2016-08-12 12:00:34 INFO Lock & rename duration: 981.435405ms. During this time, queries on `sbtest` were blocked
2016-08-12 12:00:34 INFO Looking for magic cut-over table
2016-08-12 12:00:34 INFO Droppping table `test`.`_sbtest_ghc`
2016-08-12 12:00:34 INFO Table dropped
2016-08-12 12:00:34 INFO Am not dropping old table because I want this operation to be as live as possible. If you insist I should do it, please add `--ok-to-drop-table` next time. But I prefer you do not. To drop the old table, issue:
2016-08-12 12:00:34 INFO -- drop table `test`.`_sbtest_del`
2016-08-12 12:00:34 INFO Done migrating `test`.`sbtest`
2016-08-12 12:00:34 INFO Done

重要的一些选项以及典型使用方式

重要的一些选项：

-critical-load --max-load
        Comma delimited status-name=threshold, same format as --max-load. When status exceeds threshold, app panics and quits

改表完成之后是否删除老表。        
-ok-to-drop-table
        Shall the tool drop the old table at end of operation. DROPping tables can be a long locking operation, which is why I'm not doing it by default. I'm an online tool, yes?     
        
如果执行之前发现old-table，删除还是终止？默认终止。
-initially-drop-old-table
        Drop a possibly existing OLD table (remains from a previous run?) before beginning operation. Default is to panic and abort if such table exists           

port选项仅对应的是host吗？其他地方都需要设置全
-port int
        MySQL port (preferably a replica, not the master) (default 3306)

因为gh-ost没有提供recursion-method=processlist方法，因此需要通过throttle-control-replicas指定所有的需要检查的slave，并且注意上面的port仅仅对应于host选项，因此需要host:port的方式来写全称
-throttle-control-replicas string
    	List of replicas on which to check for lag; comma delimited. Example: myhost1.com:3306,myhost2.com,myhost3.com:3307
    	
拷贝每个chunk之后sleep的时间=nice-ratio*copy-chunk-time，默认值是0，表示不sleep
-nice-ratio float
    	force being 'nice', imply sleep time per chunk time; range: [0.0..100.0]. Example values: 0 is aggressive. 1.5: for every ms spend in a rowcopy chunk, spend 1.5ms sleeping immediately after    
  
默认是不删除socket文件的，这样当第二次运行的时候报错，提示socket文件已经存在 
-initially-drop-socket-file
    	Should gh-ost forcibly delete an existing socket file. Be careful: this might drop the socket file of a running migration!    	
-exact-rowcount
    	actually count table rows as opposed to estimate them (results in more accurate progress estimation)	

典型使用方式：

./gh-ost \
--max-load=Threads_running=25 \
--critical-load=Threads_running=64 \
--chunk-size=1000 \
--throttle-control-replicas="test02:3306" \
--max-lag-millis=1500 \
--initially-drop-old-table \
--initially-drop-ghost-table \
--initially-drop-socket-file \
--ok-to-drop-table \
--conf="./my.cnf" \
--host="test02" \
--port=3306 \
--user="admin" \
--password="admin" \
--database="test" \
--table="test" \
--verbose \
--alter="drop index id1" \
--switch-to-rbr \
--allow-master-master \
--cut-over=default \
--default-retries=120 \
--panic-flag-file=/tmp/ghost.panic.flag \
--postpone-cut-over-flag-file=/tmp/ghost.postpone.flag \
--execute

发现的一些不足：

相比pt-osc缺少–check-interval参数；目前这个值写死是1s；issue地址默认是1s，足够小了，作者认为当前可以；
–conf选项设置的my.cnf文件中不支持设置prompt=[\h]\u@\d\r:\m:\s>；issue地址，作者已经修复；
缺少增加unique index的时候的检查；如果增加unique index的时候会丢失数据；（pt-osc也存在这个问题）；作者认为不需要修复，问题的解决难度也比较大，DBA在使用前需要注意；
相比pt-osc缺少–check-replication-filters；是否确实对从库是否存在这个表的检查
相比pt-osc缺少–recursion-method=processlist；需要通过参数进行设置，issue地址

问题：

1.相比pt-osc缺乏–check-interval参数，那么检查的频率是多少呢？

~~~go // Migrate executes the complete migration logic. This is the major gh-ost function. func (this *Migrator) Migrate() (err error) { //在这个函数中进行load和slave lag的检查； go this.initiateThrottler() }

//通过这个函数可以很清晰的看到，间隔是每秒钟检查一次load和slave lag；通过time.Tick(1 * time.Second)创建一个定时器，每秒钟往管道throttlerTick中写一个值，然后再通过for range从管道中读取； // initiateThrottler initiates the throttle ticker and sets the basic behavior of throttling. func (this *Migrator) initiateThrottler() error { throttlerTick := time.Tick(1 * time.Second)

throttlerFunction := func() {
	alreadyThrottling, currentReason := this.migrationContext.IsThrottled()
	shouldThrottle, throttleReason := this.shouldThrottle()
	if shouldThrottle && !alreadyThrottling {
		// New throttling
		this.applier.WriteAndLogChangelog("throttle", throttleReason)
	} else if shouldThrottle && alreadyThrottling && (currentReason != throttleReason) {
		// Change of reason
		this.applier.WriteAndLogChangelog("throttle", throttleReason)
	} else if alreadyThrottling && !shouldThrottle {
		// End of throttling
		this.applier.WriteAndLogChangelog("throttle", "done throttling")
	}
	this.migrationContext.SetThrottled(shouldThrottle, throttleReason)
}
throttlerFunction()
for range throttlerTick {
	throttlerFunction()
}

return nil }

~~~

Twitter Facebook Google+ LinkedIn

gh-ost解析

原理

详细过程

重要的一些选项以及典型使用方式

发现的一些不足：

问题：

1.相比pt-osc缺乏–check-interval参数，那么检查的频率是多少呢？

rocketmq概要

简述k8s service的可访问性

aws guardduty简介

如何使用vpc peering和vpc endpoint service打通两个aws vpc之间的服务