本文由 Hilton 所撰寫 版權(quán)歸屬于 Hilton 如需轉(zhuǎn)載請(qǐng)來信告知 hitonyang@yahoo.com.cn |
|
雖然apache認(rèn)為JakartaORO是一個(gè)更完 備的正則表達(dá)式處理包,但regexp的應(yīng)用也是非常廣泛,大概是因?yàn)樗暮?jiǎn)單吧。下面 是regexp的學(xué)習(xí)筆記。
1、下載安裝
下載源碼
cvs -d :pserver:anoncvs@cvs.apache.org:/home/cvspublic login
password: anoncvs
cvs -d :pserver:anoncvs@cvs.apache.org:/home/cvspublic checkout jakarta-regexp
或下載編譯好的包
wget http://apache.linuxforum.net/dist/jakarta/regexp/binaries/jakarta-regexp-1.3.tar.gz
2、基本情況
1)Regexp是一個(gè)由100%純java正則式處理包,是
Jonathan Locke捐給Apache軟件基金會(huì)的。 他最初開發(fā)這個(gè)軟件是在1996年,在時(shí)間的考驗(yàn)面前RegExp表達(dá)非常堅(jiān) 挺:)。 它包括完整的Javadoc文檔,以及一個(gè)簡(jiǎn)單的Applet來做可視化調(diào)試和兼容性測(cè)試.
2)RE類regexp包中非常重要的一個(gè)類,它是一個(gè)高效的、輕量級(jí)的正則式計(jì)算器/匹配器的類,RE是regular expression的縮寫。正則式是能夠進(jìn)行復(fù)雜的字符串匹配的模板,而且當(dāng)一個(gè)字符串能匹配某個(gè)模板時(shí),你可 以抽取出那些匹配的部分,這在進(jìn)行文本解析時(shí)非常有用。下面討論一下正則式的語法。
為了編譯一個(gè)正則式,你需要簡(jiǎn)單地以模板為參數(shù)構(gòu)造一個(gè)RE匹配器對(duì)象來完成,然后就可調(diào)用任一個(gè) RE.match方法來對(duì)一個(gè)字符串進(jìn)行匹配檢查,如果匹配成功/失敗,則返回真/假值。例如:
RE r = new RE("a*b");
boolean matched = r.match("aaaab");
RE.getParen可以取回匹配的字符序列,或者匹配的字符序列的某一部分(如果模板中有相應(yīng)的括號(hào)的 話),以及它們的位置、長(zhǎng)度等屬性。如:
RE r = new RE("(a*)b"); // Compile expression
boolean matched = r.match("xaaaab"); // Match against "xaaaab"
String wholeExpr = r.getParen(0); // wholeExpr will be 'aaaab'
String insideParens = r.getParen(1); // insideParens will be 'aaaa'
int startWholeExpr = r.getParenStart(0); // startWholeExpr will be index 1
int endWholeExpr = r.getParenEnd(0); // endWholeExpr will be index 6
int lenWholeExpr = r.getParenLength(0); // lenWholeExpr will be 5
int startInside = r.getParenStart(1); // startInside will be index 1
int endInside = r.getParenEnd(1); // endInside will be index 5
int lenInside = r.getParenLength(1); // lenInside will be 4
RE支持正則式的后向引用,如:
([0-9]+)=\1
匹配 n=n (象 0=0 or 2=2)這樣的字符串
3)RE支持的正則式的語法如下:
字符
unicodeChar |
Matches any identical unicode character |
\ |
Used to quote a meta-character (like '*') |
\\ |
Matches a single '\' character |
\0nnn |
Matches a given octal character |
\xhh |
Matches a given 8-bit hexadecimal character |
\\uhhhh |
Matches a given 16-bit hexadecimal character |
\t |
Matches an ASCII tab character |
\n |
Matches an ASCII newline character |
\r |
Matches an ASCII return character |
\f |
Matches an ASCII form feed character |
字符集
[abc] |
簡(jiǎn)單字符集 |
[a-zA-Z] |
帶區(qū)間的 字符集 |
[^abc] |
字符集的否定 |
標(biāo)準(zhǔn)POSIX 字符集
[:alnum:] |
Alphanumeric characters. |
[:alpha:] |
Alphabetic characters. |
[:blank:] |
Space and tab characters. |
[:cntrl:] |
Control characters. |
[:digit:] |
Numeric characters. |
[:graph:] |
Characters that are printable and are also visible.(A space is printable, but not visible, while an `a' is both.) |
[:lower:] |
Lower-case alphabetic characters. |
[:print:] |
Printable characters (characters that are not control characters.) |
[:punct:] |
Punctuation characters (characters that are not letter,digits, control characters, or space characters). |
[:space:] |
Space characters (such as space, tab, and formfeed, to name a few). |
[:upper:] |
Upper-case alphabetic characters. |
[:xdigit:] |
Characters that are hexadecimal digits. |
非標(biāo)準(zhǔn)的 POSIX樣式的字符集
[:javastart:] |
Start of a Java identifier |
[:javapart:] |
Part of a Java identifier |
預(yù)定義的字符集
. |
Matches any character other than newline |
\w |
Matches a "word" character (alphanumeric plus "_") |
\W |
Matches a non-word character |
\s |
Matches a whitespace character |
\S |
Matches a non-whitespace character |
\d |
Matches a digit character |
\D |
Matches a non-digit character |
邊界匹配符
^ |
Matches only at the beginning of a line |
$ |
Matches only at the end of a line |
\b |
Matches only at a word boundary |
\B |
Matches only at a non-word boundary |
貪婪匹配限定符
A* |
Matches A 0 or more times (greedy) |
A+ |
Matches A 1 or more times (greedy) |
A? |
Matches A 1 or 0 times (greedy) |
A{n} |
Matches A exactly n times (greedy) |
A{n,} |
Matches A at least n times (greedy) |
非貪婪匹配限定符
A*? |
Matches A 0 or more times (reluctant) |
A+? |
Matches A 1 or more times (reluctant) |
A?? |
Matches A 0 or 1 times (reluctant) |
邏輯運(yùn)算符
AB |
Matches A followed by B |
A|B |
Matches either A or B |
(A) |
Used for subexpression grouping |
(?:A) |
Used for subexpression clustering (just like grouping but no backrefs) |
后向引用符
\1 |
Backreference to 1st parenthesized subexpression |
\2 |
Backreference to 2nd parenthesized subexpression |
\3 |
Backreference to 3rd parenthesized subexpression |
\4 |
Backreference to 4th parenthesized subexpression |
\5 |
Backreference to 5th parenthesized subexpression |
\6 |
Backreference to 6th parenthesized subexpression |
\7 |
Backreference to 7th parenthesized subexpression |
\8 |
Backreference to 8th parenthesized subexpression |
\9 |
Backreference to 9th parenthesized subexpression |
RE運(yùn)行的程序先經(jīng)過RECompiler類的編譯. 由于效率的原因,RE匹配器沒有包括正則式的編譯類. 實(shí)際上, 如果要預(yù)編譯1個(gè)或多個(gè)正則式,可以通過命令行運(yùn)行'recompile'類,如
java org.apache.regexp.recompile a*b
則產(chǎn)生類似下面的編譯輸出(最后一行不是):
// Pre-compiled regular expression "a*b"
char[] re1Instructions =
{
0x007c, 0x0000, 0x001a, 0x007c, 0x0000, 0x000d, 0x0041,
0x0001, 0x0004, 0x0061, 0x007c, 0x0000, 0x0003, 0x0047,
0x0000, 0xfff6, 0x007c, 0x0000, 0x0003, 0x004e, 0x0000,
0x0003, 0x0041, 0x0001, 0x0004, 0x0062, 0x0045, 0x0000,
0x0000,
};
REProgram re1 = new REProgram(re1Instructions);
RE r = new RE(re1);
通過利用預(yù)編譯的req來構(gòu)建RE匹配器對(duì)象,可以避免運(yùn)行時(shí)進(jìn)行編譯的成本。 如果需要?jiǎng)討B(tài)的構(gòu)造正 則式,則可以創(chuàng)建單獨(dú)一個(gè)RECompiler對(duì)象,并利用它來編譯每個(gè)正則式。注意,RE 和 RECompiler 都不是 threadsafe的(出于效率的原因), 因此當(dāng)多線程運(yùn)行時(shí),你需要為每個(gè)線程分別創(chuàng)建編譯器和匹配器。
3、例程
1)regexp包中帶有一個(gè)applet寫的小程序,運(yùn)行如下:
java org.apache.regexp.REDemo
2)Jeffer Hunter寫了一個(gè)例程,可以
下載。
3)regexp自帶的測(cè)試?yán)?,也很有參考價(jià)值。它把所有正則式及相關(guān)的字符串以及結(jié)果都放在一個(gè)單獨(dú)的文件 里,在$REGEXPHOME/docs/RETest.txt中。當(dāng)然,這個(gè)例程的運(yùn)行也要在$REGEXPHOME目錄下。
cd $REGEXPHOME
java org.apache.regexp.RETest
參考資料
1、 Jeffrey Hunter's README_regular_expressions.txt |
http://www.idevelopment.info/topics/topics.cgi?LEVEL=programming
2、The Jakarta Site – CVS Repository
http://jakarta.apache.org/site/cvsindex.html
原作者:Hilton
來 源:http://hedong.3322.org/