取重復(fù)記錄中的任一條的分析
?
??? 最近一直考慮一個問題,找出表中同一字段重復(fù)記錄中的任意一條,照理來說這樣的一個需求,在Oracle內(nèi)部進行實現(xiàn)是很方便的,而且不需要對表進行2遍的掃描。但是事實上我想了很久也向不出來有什么函數(shù)可以直接實現(xiàn)這一功能,基本上所有可以這樣做的方法都需要進行嵌套才能完成。不知道為什么Oracle沒有提供這個功能,也許是有什么邏輯矛盾我沒有想到。現(xiàn)在總結(jié)一下,模擬的環(huán)境如下:
?
??? create table t1_t(a int,b varchar2(50));
??? insert into t1_t(a,b) (select mod(rownum,50) rn,object_name from user_objects);
??? insert into t1_t(a,b) (select * from t1_t);
??? insert into t1_t(a,b) (select * from t1_t);
??? insert into t1_t(a,b) (select * from t1_t);
??? insert into t1_t(a,b) (select * from t1_t);
??? commit;
?
??? 至此t1_t表大致上有300萬條記錄左右,其中a字段的取值都是0-49之間的integer,現(xiàn)在的目標(biāo)就是取到50條記錄,a分別等于0~49,具體哪一條都可以,但需要同時取出b字段。
?
?
一、方法大致有兩種:
?
1、使用row_number() over函數(shù)找到首條記錄選出
?
??? select * from
??? (select a,b,row_number() over(partition by a order by 1) k from t1_t)
??? where k=1;
?
??? 實際執(zhí)行時間:在軟解析的情況下需要3s左右。執(zhí)行計劃如下:
?
Execution Plan
----------------------------------------------------------
?? 0????? SELECT STATEMENT Optimizer=ALL_ROWS (Cost=32068 Card=3632040 Bytes=192498120)
?? 1??? 0?? VIEW (Cost=32068 Card=3632040 Bytes=192498120)
?? 2??? 1???? WINDOW (SORT PUSHED RANK) (Cost=32068 Card=3632040 Bytes=101697120)
?? 3??? 2?????? TABLE ACCESS (FULL) OF 'T1_T' (TABLE) (Cost=3255 Card=3632040 Bytes=101697120)
?
Statistics
----------------------------------------------------------
??????? 165? recursive calls
???????? 22? db block gets
????? 17303? consistent gets
?????? 1937? physical reads
????? 75904? redo size
?????? 2682? bytes sent via SQL*Net to client
??????? 537? bytes received via SQL*Net from client
????????? 5? SQL*Net roundtrips to/from client
????????? 4? sorts (memory)
????????? 1? sorts (disk)
???????? 50? rows processed
?
2、兩層關(guān)聯(lián),用rowid別名進行匹配:
?
??? select t1.* from t1_t t1,(select min(rowid) rid from t1_t group by a) t2
??? where t1.rowid = t2.rid;
?
??? 實際執(zhí)行時間:軟解析情況下1.5s左右。執(zhí)行計劃如下,比上面的方法要好很多,主要是由于不需要進行排序的操作:
?
Execution Plan
----------------------------------------------------------
?? 0????? SELECT STATEMENT Optimizer=ALL_ROWS (Cost=3919 Card=50 Bytes=2000)
?? 1??? 0?? NESTED LOOPS (Cost=3919 Card=50 Bytes=2000)
?? 2??? 1???? VIEW (Cost=3869 Card=50 Bytes=600)
?? 3??? 2?????? HASH (GROUP BY) (Cost=3869 Card=50 Bytes=750)
?? 4??? 3???????? TABLE ACCESS (FULL) OF 'T1_T' (TABLE) (Cost=3243 Card=3632040 Bytes=54480600)
?? 5??? 1???? TABLE ACCESS (BY USER ROWID) OF 'T1_T' (TABLE) (Cost=1 Card=1 Bytes=28)
?
Statistics
----------------------------------------------------------
????????? 1? recursive calls
????????? 0? db block gets
????? 16278? consistent gets
????????? 0? physical reads
????????? 0? redo size
?????? 2638? bytes sent via SQL*Net to client
??????? 537? bytes received via SQL*Net from client
????????? 5? SQL*Net roundtrips to/from client
????????? 0? sorts (memory)
????????? 0? sorts (disk)
???????? 50? rows processed
?
?
二、考慮使用索引
?
?
??? 首先增加a、b字段的索引:
?
??? create
index
t1_t_a
on
t1_t(a);
??? create
index
t1_t_b
on
t1_t(b);
???
??? 要使用索引,需要將字段設(shè)置為not null,或在SQL中使用not null選項才可以:
?
???
alter
table
t1_t
modify
a
not
null
;
???
alter
table
t1_t
modify
b
not
null
;
?
?
1、對于第1種方法,執(zhí)行計劃完全沒變,執(zhí)行效率略有提高
?
Execution Plan
----------------------------------------------------------
?? 0????? SELECT STATEMENT Optimizer=ALL_ROWS (Cost=32068 Card=3632040 Bytes=192498120)
?? 1??? 0?? VIEW (Cost=32068 Card=3632040 Bytes=192498120)
?? 2??? 1???? WINDOW (SORT PUSHED RANK) (Cost=32068 Card=3632040 Bytes=101697120)
?? 3??? 2?????? TABLE ACCESS (FULL) OF 'T1_T' (TABLE) (Cost=3255 Card=3632040 Bytes=101697120)
?
Statistics
----------------------------------------------------------
????????? 2? recursive calls
???????? 22? db block gets
????? 16228? consistent gets
???????? 21? physical reads
????????? 0? redo size
?????? 2682? bytes sent via SQL*Net to client
??????? 537? bytes received via SQL*Net from client
????????? 5? SQL*Net roundtrips to/from client
????????? 0? sorts (memory)
????????? 1? sorts (disk)
???????? 50? rows processed
?
2、對于第2中方法,用INDEX FAST FULL SCAN 代替了 TABLE ACCESS FULL SCAN,執(zhí)行效率也略有提高
?
Execution Plan
----------------------------------------------------------
?? 0????? SELECT STATEMENT Optimizer=ALL_ROWS (Cost=2114 Card=50 Bytes=2000)
?? 1??? 0?? NESTED LOOPS (Cost=2114 Card=50 Bytes=2000)
?? 2??? 1???? VIEW (Cost=2064 Card=50 Bytes=600)
?? 3??? 2?????? HASH (GROUP BY) (Cost=2064 Card=50 Bytes=750)
?? 4??? 3???????? INDEX (FAST FULL SCAN) OF 'T1_T_A' (INDEX) (Cost=1439 Card=3632040 Bytes=54480600)
?? 5??? 1???? TABLE ACCESS (BY USER ROWID) OF 'T1_T' (TABLE) (Cost=1 Card=1 Bytes=28)
Statistics
----------------------------------------------------------
??????? 210? recursive calls
????????? 0? db block gets
?????? 7145? consistent gets
????????? 0? physical reads
????????? 0? redo size
?????? 2638? bytes sent via SQL*Net to client
??????? 537? bytes received via SQL*Net from client
????????? 5? SQL*Net roundtrips to/from client
????????? 5? sorts (memory)
????????? 0? sorts (disk)
???????? 50? rows processed
?
?
三、強制使用索引
?
1、對于方法1,使用索引t1_t_a
?
??? select
*
from
??? (
select
/*+INDEX(t1_t t1_t_a)*/
a,b,row_number() over(
partition
by
a
order
by
1
) k
from
t1_t)
??? where
k=
1
;
Execution Plan
----------------------------------------------------------
?? 0????? SELECT STATEMENT Optimizer=ALL_ROWS (Cost=844081 Card=3632040 Bytes=192498120)
?? 1??? 0?? VIEW (Cost=844081 Card=3632040 Bytes=192498120)
?? 2??? 1???? WINDOW (NOSORT) (Cost=844081 Card=3632040 Bytes=101697120)
?? 3??? 2?????? TABLE ACCESS (BY INDEX ROWID) OF 'T1_T' (TABLE) (Cost=815268 Card=3632040Bytes=101697120)
?
?? 4??? 3???????? INDEX (FULL SCAN) OF 'T1_T_A' (INDEX) (Cost=7154 Card=3632040)
Statistics
----------------------------------------------------------
????????? 0? recursive calls
????????? 0? db block gets
???? 814027? consistent gets
????????? 0? physical reads
????????? 0? redo size
?????? 2817? bytes sent via SQL*Net to client
??????? 537? bytes received via SQL*Net from client
????????? 5? SQL*Net roundtrips to/from client
????????? 0? sorts (memory)
????????? 0? sorts (disk)
???????? 50? rows processed
?
??? 相比沒有使用索引時的效率還要低的多,因為是全表掃描,所以使用全索引掃描造成了額外的開銷。
?
2、對于方法2:
?
??? 將INDEX FAST FULL SCAN 改成INDEX FULL SCAN 明顯沒有什么意義。
?
?
四、a,b的聯(lián)合索引:
?
??? 首先創(chuàng)建a,b的聯(lián)合索引:
??? createindex t1_t_ab on t1_t(a,b);
?
1、對方法1,使用索引t1_t_ab
?
?
??? select
*
from
??? (
select
/*+INDEX(t1_t t1_t_ab)*/
a,b,row_number() over(
partition
by
a
order
by
1
) k
from
t1_t)
??? where
k=
1
;
?
Execution Plan
----------------------------------------------------------
?? 0????? SELECT STATEMENT Optimizer=ALL_ROWS (Cost=48327 Card=3632040 Bytes=192498120)
?? 1??? 0?? VIEW (Cost=48327 Card=3632040 Bytes=192498120)
?? 2??? 1???? WINDOW (NOSORT) (Cost=48327 Card=3632040 Bytes=101697120)
?? 3??? 2?????? INDEX (FULL SCAN) OF 'T1_T_AB' (INDEX) (Cost=19514 Card=3632040 Bytes=1016
????????? 97120)
?
Statistics
----------------------------------------------------------
????????? 1? recursive calls
????????? 0? db block gets
????? 19368? consistent gets
????????? 0? physical reads
????????? 0? redo size
?????? 2783? bytes sent via SQL*Net to client
??????? 537? bytes received via SQL*Net from client
????????? 5? SQL*Net roundtrips to/from client
????????? 0? sorts (memory)
????????? 0? sorts (disk)
???????? 50? rows processed
?
??? 相比沒有使用索引的時候,性能有所降低,但是比使用t1_t_a索引要好的多了,但是如果對于a字段進行限制時,row_number() over函數(shù)會自動找到t1_t_ab索引,效果會很好
?
??? select
*
from
??? (
select
a,b,row_number() over(
partition
by
a
order
by
1
) k
from
t1_twhere a='1'
)
??? where
k=
1
;
?
Execution Plan
----------------------------------------------------------
?? 0????? SELECT STATEMENT Optimizer=ALL_ROWS (Cost=1119 Card=84004 Bytes=4452212)
?? 1??? 0?? VIEW (Cost=1119 Card=84004 Bytes=4452212)
?? 2??? 1???? WINDOW (NOSORT) (Cost=1119 Card=84004 Bytes=2352112)
?? 3??? 2?????? INDEX (RANGE SCAN) OF 'T1_T_AB' (INDEX) (Cost=453 Card=84004 Bytes=2352112)
?
Statistics
----------------------------------------------------------
????????? 1? recursive calls
????????? 0? db block gets
??????? 410? consistent gets
????????? 0? physical reads
????????? 0? redo size
??????? 537? bytes sent via SQL*Net to client
??????? 504? bytes received via SQL*Net from client
????????? 2? SQL*Net roundtrips to/from client
????????? 0? sorts (memory)
????????? 0? sorts (disk)
????????? 1? rows processed
?
?
五、總結(jié)
?
??? 基本上如果是要在全表中找到所有a的單條記錄,那么加不加索引的區(qū)別都不大,因為全表掃描是沒有必要走索引的,但是用方法2會比方法1要快速很多,資源占用也較少,原因是不需要排序的環(huán)節(jié)。
?
??? 如果只是想要取出單個a值的任意記錄,則可以選用方法1,并建立a,b字段的聯(lián)合索引,會大大提高效率。
?
?
?