重復(fù)region問(wèn)題:
- 查看meta中的region
- scan 'hbase:meta' , {LIMIT=>10,FILTER=>"PrefixFilter('INDEX_11')"}
-
- 在數(shù)據(jù)遷移的時(shí)候碰到兩個(gè)重復(fù)的region
- b0c8f08ffd7a96219f748ef14d7ad4f8,73ab00eaa7bab7bc83f440549b9749a3
-
- 刪除兩個(gè)重復(fù)的region
-
- delete 'hbase:meta','INDEX_11,4380_2431,1429757926776.b0c8f08ffd7a96219f748ef14d7ad4f8.','info:regioninfo'
-
- delete 'hbase:meta','INDEX_11,5479_0041431700000000040100004815E9,1429757926776.73ab00eaa7bab7bc83f440549b9749a3.','info:regioninfo'
-
- 刪除兩個(gè)重復(fù)的hdfs
-
- /hbase/data/default/INDEX_11/b0c8f08ffd7a96219f748ef14d7ad4f8
- /hbase/data/default/INDEX_11/73ab00eaa7bab7bc83f440549b9749a3
-
- 對(duì)應(yīng)的重啟regionserver(只是為了刷新hmaster上匯報(bào)的RIS的狀態(tài))
-
- 肯定會(huì)丟數(shù)據(jù),把沒(méi)有上線的重復(fù)region上的數(shù)據(jù)丟失
新hbase hbck
- 新版的hbck
-
- 新版本的 hbck 可以修復(fù)各種錯(cuò)誤,修復(fù)選項(xiàng)是:
- (1)-fix,向下兼容用,被-fixAssignments替代
- (2)-fixAssignments,用于修復(fù)region assignments錯(cuò)誤
- (3)-fixMeta,用于修復(fù)meta表的問(wèn)題,前提是HDFS上面的region info信息有并且正確。
- (4)-fixHdfsHoles,修復(fù)region holes(空洞,某個(gè)區(qū)間沒(méi)有region)問(wèn)題
- (5)-fixHdfsOrphans,修復(fù)Orphan region(hdfs上面沒(méi)有.regioninfo的region)
- (6)-fixHdfsOverlaps,修復(fù)region overlaps(區(qū)間重疊)問(wèn)題
- (7)-fixVersionFile,修復(fù)缺失hbase.version文件的問(wèn)題
- (8)-maxMerge <n> (n默認(rèn)是5),當(dāng)region有重疊是,需要合并region,一次合并的region數(shù)最大不超過(guò)這個(gè)值。
- (9)-sidelineBigOverlaps ,當(dāng)修復(fù)region overlaps問(wèn)題時(shí),允許跟其他region重疊次數(shù)最多的一些region不參與(修復(fù)后,可以把沒(méi)有參與的數(shù)據(jù)通過(guò)bulk load加載到相應(yīng)的region)
- (10)-maxOverlapsToSideline <n> (n默認(rèn)是2),當(dāng)修復(fù)region overlaps問(wèn)題時(shí),一組里最多允許多少個(gè)region不參與
- 由于選項(xiàng)較多,所以有兩個(gè)簡(jiǎn)寫(xiě)的選項(xiàng)
- (11) -repair,相當(dāng)于-fixAssignments -fixMeta -fixHdfsHoles -fixHdfsOrphans -fixHdfsOverlaps -fixVersionFile -sidelineBigOverlaps
- (12)-repairHoles,相當(dāng)于-fixAssignments -fixMeta -fixHdfsHoles -fixHdfsOrphans
-
-
-
- 新版本的 hbck
- (1)缺失hbase.version文件
- 加上選項(xiàng) -fixVersionFile 解決
- (2)如果一個(gè)region即不在META表中,又不在hdfs上面,但是在regionserver的online region集合中
- 加上選項(xiàng) -fixAssignments 解決
- (3)如果一個(gè)region在META表中,并且在regionserver的online region集合中,但是在hdfs上面沒(méi)有
- 加上選項(xiàng) -fixAssignments -fixMeta 解決,( -fixAssignments告訴regionserver close region),( -fixMeta刪除META表中region的記錄)
- (4)如果一個(gè)region在META表中沒(méi)有記錄,沒(méi)有被regionserver服務(wù),但是在hdfs上面有
- 加上選項(xiàng) -fixMeta -fixAssignments 解決,( -fixAssignments 用于assign region),( -fixMeta用于在META表中添加region的記錄)
- (5)如果一個(gè)region在META表中沒(méi)有記錄,在hdfs上面有,被regionserver服務(wù)了
- 加上選項(xiàng) -fixMeta 解決,在META表中添加這個(gè)region的記錄,先undeploy region,后assign
- (6)如果一個(gè)region在META表中有記錄,但是在hdfs上面沒(méi)有,并且沒(méi)有被regionserver服務(wù)
- 加上選項(xiàng) -fixMeta 解決,刪除META表中的記錄
- (7)如果一個(gè)region在META表中有記錄,在hdfs上面也有,table不是disabled的,但是這個(gè)region沒(méi)有被服務(wù)
- 加上選項(xiàng) -fixAssignments 解決,assign這個(gè)region
- (8)如果一個(gè)region在META表中有記錄,在hdfs上面也有,table是disabled的,但是這個(gè)region被某個(gè)regionserver服務(wù)了
- 加上選項(xiàng) -fixAssignments 解決,undeploy這個(gè)region
- (9)如果一個(gè)region在META表中有記錄,在hdfs上面也有,table不是disabled的,但是這個(gè)region被多個(gè)regionserver服務(wù)了
- 加上選項(xiàng) -fixAssignments 解決,通知所有regionserver close region,然后assign region
- (10)如果一個(gè)region在META表中,在hdfs上面也有,也應(yīng)該被服務(wù),但是META表中記錄的regionserver和實(shí)際所在的regionserver不相符
- 加上選項(xiàng) -fixAssignments 解決
-
- (11)region holes
- 需要加上 -fixHdfsHoles ,創(chuàng)建一個(gè)新的空region,填補(bǔ)空洞,但是不assign 這個(gè) region,也不在META表中添加這個(gè)region的相關(guān)信息
- (12)region在hdfs上面沒(méi)有.regioninfo文件
- -fixHdfsOrphans 解決
- (13)region overlaps
- 需要加上 -fixHdfsOverlaps
-
-
- 說(shuō)明:
- (1)修復(fù)region holes時(shí),-fixHdfsHoles 選項(xiàng)只是創(chuàng)建了一個(gè)新的空region,填補(bǔ)上了這個(gè)區(qū)間,還需要加上-fixAssignments -fixMeta 來(lái)解決問(wèn)題,( -fixAssignments 用于assign region),( -fixMeta用于在META表中添加region的記錄),所以有了組合拳 -repairHoles 修復(fù)region holes,相當(dāng)于-fixAssignments -fixMeta -fixHdfsHoles -fixHdfsOrphans
- (2) -fixAssignments,用于修復(fù)region沒(méi)有assign、不應(yīng)該assign、assign了多次的問(wèn)題
- (3)-fixMeta,如果hdfs上面沒(méi)有,那么從META表中刪除相應(yīng)的記錄,如果hdfs上面有,在META表中添加上相應(yīng)的記錄信息
- (4)-repair 打開(kāi)所有的修復(fù)選項(xiàng),相當(dāng)于-fixAssignments -fixMeta -fixHdfsHoles -fixHdfsOrphans -fixHdfsOverlaps -fixVersionFile -sidelineBigOverlaps
-
- 新版本的hbck從(1)hdfs目錄(2)META(3)RegionServer這三處獲得region的Table和Region的相關(guān)信息,根據(jù)這些信息判斷并repair
轉(zhuǎn)meta,表手動(dòng)刪除表
- 因?yàn)榧河脖P(pán)緊俏,絕對(duì)對(duì)原來(lái)的表加上COMPRESSION=>LZO屬性。但是創(chuàng)建表,長(zhǎng)時(shí)間沒(méi)有反饋。決定drop掉這張表,但是始終drop失敗。重啟集群,hbase 60010界面顯示有region transaction。為創(chuàng)建失敗的表region,在PENDING_OPEN和CLOSED之間跳。describe 表失敗, enable表失敗,disable表失敗,從60010界面查看表失敗。很蛋疼。
- 后決定強(qiáng)制刪除當(dāng)前表。google了一下,找到這篇文章,文章大部分都是對(duì)的,但是最后一步存在問(wèn)題.原文中命令為:
- delete 'TrojanInfo','TrojanInfo,,1361433390076.2636b5a2b3d3d08f23d2af9582f29bd8.','info:server'
- 當(dāng)時(shí)就覺(jué)得有有問(wèn)題,沒(méi)有涉及.META.表,如何更新META信息?
- 嘗試兩次刪除后是始終報(bào)錯(cuò),確定應(yīng)該是有問(wèn)題,為了以防萬(wàn)一,google一下更新META信息的操作,將命令改為
- delete '.META.','TrojanInfo,,1361433390076.2636b5a2b3d3d08f23d2af9582f29bd8.','info:server'
- 命令成功執(zhí)行。
- 重啟集群后,transction仍然存在,分析應(yīng)該是meta表沒(méi)有更新的問(wèn)題,對(duì)meta表做一次major_compact,重啟集群,成功。不再有報(bào)錯(cuò)。
-
- 下面是對(duì)原文的拷貝:
- 強(qiáng)制刪除表:
- 1、強(qiáng)制刪除該表在hdfs上的所有文件(路徑根據(jù)實(shí)際情況而定):
- [sql] view plaincopy
-
- ./hadoop fs -rmr /hbase/TrojanInfo
- 2、刪除該表在HBase系統(tǒng)表.META.中的記錄:
- A、首先從.META.中查詢(xún)出表 TrojanInfo在.META.中的rowkey,這可以通過(guò)scan '.META.',然后手動(dòng)篩選;
- B、然后刪除該rowkey下的3個(gè)字段(假設(shè)查詢(xún)出的rowkey為T(mén)rojanInfo,,1361433390076.2636b5a2b3d3d08f23d2af9582f29bd8.)
- [plain] view plaincopy
-
- delete 'TrojanInfo','TrojanInfo,,1361433390076.2636b5a2b3d3d08f23d2af9582f29bd8.','info:server'
- delete 'TrojanInfo','TrojanInfo,,1361433390076.2636b5a2b3d3d08f23d2af9582f29bd8.','info:serverstartcode'
- delete 'TrojanInfo','TrojanInfo,,1361433390076.2636b5a2b3d3d08f23d2af9582f29bd8.','info:reg
-
轉(zhuǎn)meta表修復(fù)三
- 一、故障原因
- IP為10.191.135.3的服務(wù)器在2013年8月1日出現(xiàn)服務(wù)器重新啟動(dòng)的情況,導(dǎo)致此臺(tái)服務(wù)器上的所有服務(wù)均停止。從而造成NTP服務(wù)停止。當(dāng)NTP服務(wù)停止后,導(dǎo)致HBase集群中大部分機(jī)器時(shí)鐘和主機(jī)時(shí)間不一致,造成regionserver服務(wù)中止。并在重新啟動(dòng)后,出現(xiàn)region的hole。需要對(duì)數(shù)據(jù)進(jìn)行重新修復(fù),以正常提供插入數(shù)據(jù)的服務(wù)。
-
- 二、恢復(fù)方式
- 1、集群50個(gè)regionserver,宕掉服務(wù)41個(gè),namenode所在機(jī)器10.191.135.3不明重啟(原因查找中)導(dǎo)致本機(jī)上的namenode、zookeeper、時(shí)間同步服務(wù)器服務(wù)掛掉。
- 2、重啟hbase服務(wù)時(shí),沒(méi)能成功stop剩余的9個(gè)regionserver服務(wù),進(jìn)行了人為kill進(jìn)程,
- 3、在hdfs上移走了hlog(避免啟動(dòng)時(shí)split log花費(fèi)過(guò)多時(shí)間影響服務(wù)),然后重啟hbase。發(fā)現(xiàn)10.191.135.30機(jī)器上的時(shí)間與時(shí)間同步服務(wù)器10.191.135.3不同步。手工同步后重啟成功。hbase可以正常提供查詢(xún)服務(wù)。
- 4、運(yùn)行mapreduce put數(shù)據(jù)。拋出異常,數(shù)據(jù)無(wú)法正常插入;
- 5、執(zhí)行/opt/hbase/bin/hbase hbck -fixAssignments,嘗試重新分配region。結(jié)果顯示hbase有空洞,即region之間數(shù)據(jù)不連續(xù)了;
- 6、通過(guò)上述操作可以定位是在regionserver服務(wù)宕掉的后重啟的過(guò)程中丟了數(shù)據(jù)。需要進(jìn)行空洞修復(fù)。然而hbase hbck命令總是只顯示三條空洞。
- 7、通過(guò)編寫(xiě)的regionTest.jar工具進(jìn)行進(jìn)一步檢測(cè)出空洞所在的regionname然后停掉hbase,進(jìn)而進(jìn)行region合并修復(fù)空洞;
- 8、合并的merge 操作需要先去.META.表里讀取該region的信息,由于.META.表也在regionserver宕機(jī)過(guò)程中受到損壞,所以部分region的.META.信息沒(méi)有,merge操作時(shí)就拋出空指針異常。因此只能將hdfs這些region進(jìn)行移除,然后通過(guò)regionTest.jar 檢測(cè)新的空洞所在的regionname,進(jìn)行合并操作修復(fù)空洞;
- 9、關(guān)于region重疊,即regionname存在.META.表內(nèi),但是在hdfs上被錯(cuò)誤的移出,并進(jìn)行了region合并。這種情況下需要通過(guò)regionTest.jar檢測(cè)重疊的regionname然后手動(dòng)去.META.表刪除,.META.表修改之后需要flush;
- 10、最后再次執(zhí)行 hbase hbck 命令,hbase 所有表status ok。
-
- 三、相關(guān)命令及頁(yè)面報(bào)錯(cuò)信息
- 1.手工同步時(shí)間命令
service ntpd stop
ntpdate -d 192.168.1.20
service ntpd start
-
-
- 2.org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 actions: WrongRegionException: 2 times, servers with issues: datanode10:60020,
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1641)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1409)
at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:949)
at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:826)
at org.apache.hadoop.hbase.client.HTable.put(HTable.java:801)
at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:123)
at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:84)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:533)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:88)
at o
-
- 3.13/08/01 18:30:02 DEBUG util.HBaseFsck: There are 22093 region info entries
ERROR: There is a hole in the region chain between +8615923208069cmnet201303072132166264580 and +861592321. You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: There is a hole in the region chain between +8618375993383cmwap20130512235639430 and +8618375998629cmnet201305040821436779670. You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: There is a hole in the region chain between +8618725888080cmnet201212271719506311400 and +8618725889786cmnet201302131646431671140. You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: Found inconsistency in table cqgprs
Summary:
-ROOT- is okay.
Number of regions: 1
Deployed on: datanode14,60020,1375330955915
.META. is okay.
Number of regions: 1
Deployed on: datanode21,60020,1375330955825
cqgprs is okay.
Number of regions: 22057
Deployed on: datanode1,60020,1375330955761 datanode10,60020,1375330955748 datanode11,60020,1375330955736 datanode12,60020,1375330955993 datanode13,60020,1375330955951 datanode14,60020,1375330955915 datanode15,60020,1375330955882 datanode16,60020,1375330955892 datanode17,60020,1375330955864 datanode18,60020,1375330955703 datanode19,60020,1375330955910 datanode2,60020,1375330955751 datanode20,60020,1375330955849 datanode21,60020,1375330955825 datanode22,60020,1375334479752 datanode23,60020,1375330955835 datanode24,60020,1375330955932 datanode25,60020,1375330955856 datanode26,60020,1375330955807 datanode27,60020,1375330955882 datanode28,60020,1375330955785 datanode29,60020,1375330955799 datanode3,60020,1375330955778 datanode30,60020,1375330955748 datanode31,60020,1375330955877 datanode32,60020,1375330955763 datanode33,60020,1375330955755 datanode34,60020,1375330955713 datanode35,60020,1375330955768 datanode36,60020,1375330955896 datanode37,60020,1375330955884 datanode38,60020,1375330955918 datanode39,60020,1375330955881 datanode4,60020,1375330955826 datanode40,60020,1375330955770 datanode41,60020,1375330955824 datanode42,60020,1375449245386 datanode43,60020,1375330955880 datanode44,60020,1375330955902 datanode45,60020,1375330955881 datanode46,60020,1375330955841 datanode47,60020,1375330955790 datanode48,60020,1375330955848 datanode49,60020,1375330955849 datanode5,60020,1375330955880 datanode50,60020,1375330955802 datanode6,60020,1375330955753 datanode7,60020,1375330955890 datanode8,60020,1375330955967 datanode9,60020,1375330955948
test1 is okay.
Number of regions: 1
Deployed on: datanode43,60020,1375330955880
test2 is okay.
Number of regions: 1
Deployed on: datanode21,60020,1375330955825
35 inconsistencies detected.
Status: INCONSISTENT
-
- 4.hadoop jar regionTest.jar com.region.RegionReaderMain /hbase/cqgprs 檢測(cè)cqgprs表里的空洞所在的regionname。
-
- 5.==================================
first endKey = +8615808059207cmnet201307102326567966800
second startKey = +8615808058578cmnet201212251545557984830
first regionNmae = cqgprs,+8615808058578cmnet201212251545557984830,1375241186209.0f8266ad7ac45be1fa7233e8ea7aeef9.
second regionNmae = cqgprs,+8615808058578cmnet201212251545557984830,1362778571889.3552d3db8166f421047525d6be39c22e.
==================================
first endKey = +8615808060140cmnet201303051801355846850
second startKey = +8615808059207cmnet201307102326567966800
first regionNmae = cqgprs,+8615808058578cmnet201212251545557984830,1362778571889.3552d3db8166f421047525d6be39c22e.
second regionNmae = cqgprs,+8615808059207cmnet201307102326567966800,1375241186209.09d489d3df513bc79bab09cec36d2bb4.
==================================
-
- 6.Usage: bin/hbase org.apache.hadoop.hbase.util.Merge [-Dfs.default.name=hdfs:
./hbase org.apache.hadoop.hbase.util.Merge -Dfs.defaultFS=hdfs:
-
- 7.13/08/01 22:24:02 WARN util.HBaseFsck: Naming new problem group: +8618225125357cmnet201212290358070667800
ERROR: (regions cqgprs,+8618225123516cmnet201304131404096748520,1375363774655.b3cf5cc752f4427a4e699270dff9839e. and cqgprs,+8618225125357cmnet201212290358070667800,1364421610707.7f7038bfbe2c0df0998a529686a3e1aa.) There is an overlap in the region chain.
13/08/01 22:24:02 WARN util.HBaseFsck: reached end of problem group: +8618225127504cmnet201302182135452100210
13/08/01 22:24:02 WARN util.HBaseFsck: Naming new problem group: +8618285642723cmnet201302031921019768070
ERROR: (regions cqgprs,+8618285277826cmnet201306170027424674330,1375363962312.9d1e93b22cec90fd75361fa65b1d20d2. and cqgprs,+8618285642723cmnet201302031921019768070,1360873307626.f631cd8c6acc5e711e651d13536abe94.) There is an overlap in the region chain.
13/08/01 22:24:02 WARN util.HBaseFsck: reached end of problem group: +8618286275556cmnet201212270713444340110
13/08/01 22:24:02 WARN util.HBaseFsck: Naming new problem group: +8618323968833cmnet201306010239025175240
ERROR: (regions cqgprs,+8618323967956cmnet201306091923411365860,1375364143678.665dba6a14ebc9971422b39e079b00ae. and cqgprs,+8618323968833cmnet201306010239025175240,1372821719159.6d2fecc1b3f9049bbca83d84231eb365.) There is an overlap in the region chain.
13/08/01 22:24:02 WARN util.HBaseFsck: reached end of problem group: +8618323992353cmnet201306012336364819810
ERROR: There is a hole in the region chain between +8618375993383cmwap20130512235639430 and +8618375998629cmnet201305040821436779670. You need to create a new .regioninfo and region dir in hdfs to plug the hole.
13/08/01 22:24:02 WARN util.HBaseFsck: Naming new problem group: +8618723686187cmnet201301191433522129820
ERROR: (regions cqgprs,+8618723683087cmnet201301300708363045080,1375364411992.4ee5787217c1da4895d95b3b92b8e3a2. and cqgprs,+8618723686187cmnet201301191433522129820,1362003066106.70b48899cc753a0036f11bb27d2194f9.) There is an overlap in the region chain.
13/08/01 22:24:02 WARN util.HBaseFsck: reached end of problem group: +8618723689138cmnet201301051742388948390
13/08/01 22:24:02 WARN util.HBaseFsck: Naming new problem group: +8618723711808cmnet201301031139206225900
ERROR: (regions cqgprs,+8618723710003cmnet201301250809235976320,1375364586329.40eed10648c9a43e3d5ce64e9d63fe00. and cqgprs,+8618723711808cmnet201301031139206225900,1361216401798.ebc442e02f5e784bce373538e06dd232.) There is an overlap in the region chain.
13/08/01 22:24:02 WARN util.HBaseFsck: reached end of problem group: +8618723714626cmnet201302122009459491970
ERROR: There is a hole in the region chain between +8618725888080cmnet201212271719506311400 and +8618725889786cmnet201302131646431671140. You need to create a new .regioninfo and region dir in hdfs to plug the hole.
-
- 8. delete '.META.','regionname','info:serverstartcode'
- delete '.META.','regionname','info:regionserver'
- delete '.META.','regionname','info:regioninfo'
-
- 9. flush '.META.'
major_compact '.META.'