HBase手工删除Region In Transition的表

问题现象

一次预分区建表时卡住,强制重启hbase后发现有不少region处于RIT(Region In Transition)状态:

file

启动hbase发现region server总是启动没几分钟就停止,查看日志发现打开某个表有问题:

2022-02-06T23:22:59,303 WARN  [RS_OPEN_REGION-regionserver/zookeeper-3:16020-0] handler.AssignRegionHandler: Failed to open region hb_data_2,e0000000,1670290921940.99fbb79e62c2182aab011931a36e210f., will report to master
2022-02-06T23:55:08,541 ERROR [RS_OPEN_REGION-regionserver/zookeeper-3:16020-1] regionserver.HRegionServer: ***** ABORTING region server zookeeper-3,16020,1670342087370: Replay of WAL required. Forcing server shutdown *****
org.apache.hadoop.hbase.DroppedSnapshotException: region: hb_data_2,60000000,1670311834912.9887284106bf70971f4a457e16558183.
        at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2903) ~[hbase-server-2.5.1.jar:2.5.1]
        at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2580) ~[hbase-server-2.5.1.jar:2.5.1]
        at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:5144) ~[hbase-server-2.5.1.jar:2.5.1]
        at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:1003) ~[hbase-server-2.5.1.jar:2.5.1]
        at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:941) ~[hbase-server-2.5.1.jar:2.5.1]

删除损坏的表

为了不影响业务,决定先删除创建失败的表hb_data_2。

> zkCli.sh
> rmr /hbase/table/hb_data_2

> hdfs dfs -rmr /hbase/data/default/hb_data_2

需要删除的region数量有200多个左右,一般文章介绍的方法是在hbase:meta表里删除:

deleteall 'hbase:meta', '...'

但这次问题region数量比较多,可以用HBCK工具提供的extraRegionsInMeta --fix命令批量删除,该命令的作用是扫描hbase:meta里存在但目录不存在的那些region,如果使用--fix命令则会自动从hbase:meta里删除这些region:

> hbase hbck -j /usr/local/hbase-hbck2/hbase-hbck2-1.2.0.jar extraRegionsInMeta  --fix default

Regions that had no dir on the FileSystem and got removed from Meta: 64
Regions in Meta but having no equivalent dir, for each table:
        hb_data_2-> ec86493893d7742525919263abd01c2c e23d0e908125730a5e1095720bf13e50 ...

清理后重启hbase,可以看到RIT的region已经不在了。

个别region删除不掉

按上面方法清理后,发现有个别region仍然在opening状态,即使在hbase:meta里删除,重启hbase服务后发现region又被恢复回来。此时可以在hbase web ui里的"procedures & locks"里看到对应的进程和进程号。用hbck2的bypass命令可以清除,注意加-o参数否则可能失败:

> hbase hbck -j /usr/local/hbase-hbck2/hbase-hbck2-1.2.0.jar bypass -r -o 412150

然后再去删除一遍hbase:meta里问题region的对应记录即可:

scan 'hbase:meta', {STARTROW=>'hb_data_3', LIMIT=>2}
hbase:011:0> deleteall 'hbase:meta','hb_data_3,f8000000,1670290921940.e4bb0233b1f0130e661a347127087e7d.'

HBCK2安装方法

HBase2默认带的工具是旧版的HBCK1,所以需要先安装HBCK2。建议从源码编译安装使得版本与hbase严格匹配,如果直接下载bin版运行抛出异常的概率比较高。

首先下载源码(从国内镜像下载比较快):

> wget https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/hbase-operator-tools-1.2.0/hbase-operator-tools-1.2.0-src.tar.gz --no-check-certificate
> tar zxvf hbase-operator-tools-1.2.0-src.tar.gz

然后手工修改pom.xml里的hbase.version与当前使用的版本一致(注意,log4j2的版本也需要与hbase的保持一致):

> vi pom.xml
<hbase.version>2.5.1</hbase.version>
<log4j2.version>2.17.2</log4j2.version>

编译源码,这一步时间大约1个小时,主要是download依赖项比较慢:

> mvn install

hbck2 1.2.0对hbase 2.5.1版编译有两个用例不通过:

[INFO] Results:
[INFO]
[ERROR] Errors:
[ERROR]   TestMissingTableDescriptorGenerator.shouldGenerateTableInfoBasedOnCachedTableDescriptor:91 » IllegalArgument
[ERROR]   TestMissingTableDescriptorGenerator.shouldGenerateTableInfoBasedOnFileSystem:121 » IllegalArgument
[INFO]
[ERROR] Tests run: 72, Failures: 0, Errors: 2, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache HBase Operator Tools 1.2.0:
[INFO]
[INFO] Apache HBase Operator Tools ........................ SUCCESS [09:14 min]
[INFO] Apache HBase - Table Reporter ...................... SUCCESS [13:53 min]
[INFO] Apache HBase - HBCK2 ............................... FAILURE [33:07 min]
[INFO] Apache HBase - HBase Tools ......................... SKIPPED
[INFO] Apache HBase Operator Tools - Assembly ............. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  56:15 min
[INFO] Finished at: 2022-02-06T21:53:23+08:00
[INFO] ------------------------------------------------------------------------

猜测原因是hbck2版本没跟上hbase版本,暂时skip掉testcases解决:

> mvn install -Dmaven.test.skip=true

参考资料

https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2
https://zhuanlan.zhihu.com/p/373957937