整个问题提出和查验由 @gc @甘露寺的姑子@乙酉 完毕,文档记载由@gc @乙酉完毕。
我仅仅进行了问题剖析和文档收拾
欢迎重视我的《深化了解MySQL主从原理 32讲 》,如下:
一、事例描绘
MGR在遇到表不存在的状况下,节点没有退出节点而是爆出一个正告,而且节点状况也正常,正告如下:
2019-10-17T21:16:11.564211+08:00 10 [Warning] Slave SQL for channel
group_replication_applier': Worker 1 failed executing transaction 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:8' at master log , end_log_pos 220;
Error executing row event: 'Table 'test.a_1' doesn't exist', Error_code: 1146
集群状况如下:
[root@mysql.sock][test]>select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 9fd479bb-f0d8-11e9-9381-000c29105312 | mysql_1 | 3306 | ONLINE |
| group_replication_applier | a8833a96-f0d8-11e9-a9f4-000c291fd9a5 | mysql_2 | 3306 | ONLINE |
| group_replication_applier | b2968fe2-f0d8-11e9-a8ff-000c29c89e42 | mysql_3 | 3306 | ONLINE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)
其时觉得很古怪,咱们咱们都知道这种过失即便是在主从状况下也是报错的SQL线程退出的,MGR竟然还能在线,这样的一种状况数据现已不同步了,应该报错而且除掉节点才对。
二、问题剖析
随即一些感兴趣的同学立刻进行了查验,查验作用和上面不一致,查验作用是报错而不是出正告如下:
2019-10-17T09:16:34.317542Z 84 [ERROR] Slave SQL for channel
'group_replication_applier': Error executing row event:
'Table 'test.emp1' doesn't exist', Error_code: 1146
而且这样的一种状况表不存在的节点现已被除掉掉了。下面是一般的状况的节点状况:
secondary 1节点:
[root@mysql.sock][test]>select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | a8833a96-f0d8-11e9-a9f4-000c291fd9a5 | mysql_2 | 3306 | ERROR |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
1 row in set (0.00 sec)
secondary 2节点:
[root@mysql.sock][test]>select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | b2968fe2-f0d8-11e9-a8ff-000c29c89e42 | mysql_3 | 3306 | ERROR |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
1 row in set (0.00 sec)
那么疑问便是为什么同样是MGR一个是正告一个是过失呢,而且前者还能处于正常同步状况。不错看到标题就知道这儿和slave_skip_errors参数有关。
三、查验模仿
咱们咱们都知道再Master-Slave中假定遇到从库表不存在肯定是报错的,除非设置slave_skip_errors参数,当然我在线上重来没有设置过这个参数,而且经过这个事例咱们发现本参数对MGR也有影响,如下查验办法:
咱们在3个节点都翻开slave-skip-errors= ddl_exist_errors
如下图:
然后建立3节点single-primary方法的MGR集群。
集群建立正常。
然后实施如下操作:
[root@mysql.sock][(none)]>set sql_log_bin=0;
Query OK, 0 rows affected (0.00 sec)
[root@mysql.sock][(none)]>create table test.a_1(id bigint auto_increment primary key,name varchar(20));
Query OK, 0 rows affected (0.01 sec)
[root@mysql.sock][(none)]>set sql_log_bin=1;
Query OK, 0 rows affected (0.00 sec)
此刻primary节点是有a_1表的,可是由于binlog封闭的原因,两个secondary节点是不存在a_1表的。
然后咱们刺进数据:
[root@mysql.sock][test]>insert into test.a_1 values(null,'tom');
Query OK, 1 row affected (0.02 sec)
此刻,primary节点由于存在a_1表,所以可以刺进,可是两个secondary节点不存在a_1表,所以刺进是失利的。数据发生不一致。一般的状况下这种数据不一致会导致2个secondary节点被提出集群才对。可是实际上3个节点都是正常的,集群并没有失效。
[root@mysql.sock][test]>select * from test.a_1;
+----+------+
| id | name |
+----+------+
| 1 | tom |
+----+------+
1 row in set (0.00 sec)
[root@mysql.sock][test]>select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 9fd479bb-f0d8-11e9-9381-000c29105312 | mysql_1 | 3306 | ONLINE |
| group_replication_applier | a8833a96-f0d8-11e9-a9f4-000c291fd9a5 | mysql_2 | 3306 | ONLINE |
| group_replication_applier | b2968fe2-f0d8-11e9-a8ff-000c29c89e42 | mysql_3 | 3306 | ONLINE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)
此刻去2个secondary节点读取test.a_1表,表是不存在的。
secondary 1:
[root@mysql.sock][test]>select * from test.a_1;
ERROR 1146 (42S02): Table 'test.a_1' doesn't exist
[root@mysql.sock][test]>
secondary 2:
[root@mysql.sock][test]>select * from test.a_1;
ERROR 1146 (42S02): Table 'test.a_1' doesn't exist
error log输出信息:(set global log_error_verbosity = 3;)
2019-10-17T21:16:11.564211+08:00 10 [Warning] Slave SQL for channel
'group_replication_applier': Worker 1 failed executing transaction 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:8' at master log ,
end_log_pos 220; Error executing row event: 'Table 'test.a_1' doesn't exist', Error_code: 1146
四、slave_skip_errors源码收效点
这个设置在Rows_log_event::do_apply_event 函数中收效,也便是DML Event开始运用的时分收效,这是惯例的SQL线程(或许Worker线程)调用的。
#ifdef HAVE_REPLICATION
if (opt_slave_skip_errors)
add_slave_skip_errors(opt_slave_skip_errors);
#endif
if (open_and_lock_tables(thd, rli->tables_to_lock, 0))//翻开表
{
uint actual_error= thd->get_stmt_da()->mysql_errno();
if (thd->is_slave_error || thd->is_fatal_error)
{
if (ignored_error_code(actual_error)) //这儿遭到 slave_skip_errors 参数操控 ignored_error_code会将slave_skip_errors的参数设置读取出来
{
if (log_warnings > 1)
rli->report(WARNING_LEVEL, actual_error,
"Error executing row event: '%s'",
(actual_error ? thd->get_stmt_da()->message_text() :
"unexpected success or fatal error"));
thd->get_stmt_da()->reset_condition_info(thd);
clear_all_errors(thd, const_cast<Relay_log_info*>(rli));
error= 0;
goto end;
}
else
{
rli->report(ERROR_LEVEL, actual_error,
"Error executing row event: '%s'",
(actual_error ? thd->get_stmt_da()->message_text() :
"unexpected success or fatal error"));
thd->is_slave_error= 1;
const_cast<Relay_log_info*>(rli)->slave_close_thread_tables(thd);
DBUG_RETURN(actual_error);
}
}
```
可以正常的看到MGR的实施逻辑遭到了该参数的影响。
发表评论