python 異常、正則表達(dá)式
http://docs.python.org/library/re.html
http://docs.python.org/howto/regex.html#regex-howto
例 6.1. 打開一個(gè)不存在的文件
>>> fsock = open("/notthere", "r")
Traceback (innermost last):
File "<interactive input>", line 1, in ?
IOError: [Errno 2] No such file or directory: '/notthere'
>>> try:
... fsock = open("/notthere")
... except IOError:
... print "The file does not exist, exiting gracefully"
... print "This line will always print"
The file does not exist, exiting gracefully
This line will always print
# Bind the name getpass to the appropriate function
try:
import termios, TERMIOS
except ImportError:
try:
import msvcrt
except ImportError:
try:
from EasyDialogs import AskPassword
except ImportError:
getpass = default_getpass
else:
getpass = AskPassword
else:
getpass = win_getpass
else:
getpass = unix_getpass
例 6.10. 遍歷 dictionary
>>> import os
>>> for k, v in os.environ.items():
... print "%s=%s" % (k, v)
USERPROFILE=C:\Documents and Settings\mpilgrim
OS=Windows_NT
COMPUTERNAME=MPILGRIM
USERNAME=mpilgrim
[...略...]
>>> print "\n".join(["%s=%s" % (k, v)
... for k, v in os.environ.items()])
USERPROFILE=C:\Documents and Settings\mpilgrim
OS=Windows_NT
COMPUTERNAME=MPILGRIM
例 6.13. 使用 sys.modules
>>> import fileinfo
>>> print '\n'.join(sys.modules.keys())
win32api
os.path
os
fileinfo
exceptions
>>> fileinfo
<module 'fileinfo' from 'fileinfo.pyc'>
>>> sys.modules["fileinfo"]
<module 'fileinfo' from 'fileinfo.pyc'>
下面的例子將展示通過(guò)結(jié)合使用 __module__ 類屬性和 sys.modules dictionary 來(lái)獲取已知類所在的模塊。
例 6.14. __module__ 類屬性
>>> from fileinfo import MP3FileInfo
>>> MP3FileInfo.__module__
'fileinfo'
>>> sys.modules[MP3FileInfo.__module__]
<module 'fileinfo' from 'fileinfo.pyc'> 每個(gè) Python 類都擁有一個(gè)內(nèi)置的類屬性 __module__,它定義了這個(gè)類的模塊的名字。
將它與 sys.modules 字典復(fù)合使用,你可以得到定義了某個(gè)類的模塊的引用。
例 6.16. 構(gòu)造路徑名
>>> import os
>>> os.path.join("c:\\music\\ap\\", "mahadeva.mp3")
'c:\\music\\ap\\mahadeva.mp3'
>>> os.path.join("c:\\music\\ap", "mahadeva.mp3")
'c:\\music\\ap\\mahadeva.mp3'
>>> os.path.expanduser("~")
'c:\\Documents and Settings\\mpilgrim\\My Documents'
>>> os.path.join(os.path.expanduser("~"), "Python")
'c:\\Documents and Settings\\mpilgrim\\My Documents\\Python'
例 7.2. 匹配整個(gè)單詞
>>> s = '100 BROAD'
>>> re.sub('ROAD$', 'RD.', s)
'100 BRD.'
>>> re.sub('\\bROAD$', 'RD.', s)
'100 BROAD'
>>> re.sub(r'\bROAD$', 'RD.', s)
'100 BROAD'
>>> s = '100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD$', 'RD.', s)
'100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD\b', 'RD.', s)
'100 BROAD RD. APT 3'
我真正想要做的是,當(dāng) 'ROAD' 出現(xiàn)在字符串的末尾,并且是作為一個(gè)獨(dú)立的單詞時(shí),而不是一些長(zhǎng)單詞的一部分,才對(duì)他進(jìn)行匹配。為了在正則表達(dá)式中表達(dá)這個(gè)意思,你利用 \b,它的含義是“單詞的邊界必須在這里”。在 Python 中,由于字符 '\' 在一個(gè)字符串中必須轉(zhuǎn)義,這會(huì)變得非常麻煩。有時(shí)候,這類問(wèn)題被稱為“反斜線災(zāi)難”,這也是 Perl 中正則表達(dá)式比 Python 的正則表達(dá)式要相對(duì)容易的原因之一。另一方面,Perl 也混淆了正則表達(dá)式和其他語(yǔ)法,因此,如果你發(fā)現(xiàn)一個(gè) bug,很難弄清楚究竟是一個(gè)語(yǔ)法錯(cuò)誤,還是一個(gè)正則表達(dá)式錯(cuò)誤。
為了避免反斜線災(zāi)難,你可以利用所謂的“原始字符串”,只要為字符串添加一個(gè)前綴 r 就可以了。這將告訴 Python,字符串中的所有字符都不轉(zhuǎn)義;'\t' 是一個(gè)制表符,而 r'\t' 是一個(gè)真正的反斜線字符 '\',緊跟著一個(gè)字母 't'。我推薦只要處理正則表達(dá)式,就使用原始字符串;否則,事情會(huì)很快變得混亂 (并且正則表達(dá)式自己也會(huì)很快被自己搞亂了)。
例 7.4. 檢驗(yàn)百位數(shù)
>>> import re
>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)$'
>>> re.search(pattern, 'MCM')
<SRE_Match object at 01070390>
>>> re.search(pattern, 'MD')
<SRE_Match object at 01073A50>
>>> re.search(pattern, 'MMMCCC')
<SRE_Match object at 010748A8>
>>> re.search(pattern, 'MCMC')
>>> re.search(pattern, '')
<SRE_Match object at 01071D98>
例 7.5. 老方法:每一個(gè)字符都是可選的
>>> import re
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'M')
<_sre.SRE_Match object at 0x008EE090>
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'MM')
<_sre.SRE_Match object at 0x008EEB48>
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'MMM')
<_sre.SRE_Match object at 0x008EE090>
>>> re.search(pattern, 'MMMM')
>>>
例 7.6. 一個(gè)新的方法:從 n 到 m
>>> pattern = '^M{0,3}$'
>>> re.search(pattern, 'M')
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MM')
<_sre.SRE_Match object at 0x008EE090>
>>> re.search(pattern, 'MMM')
<_sre.SRE_Match object at 0x008EEDA8>
>>> re.search(pattern, 'MMMM')
>>>
對(duì)于個(gè)位數(shù)的正則表達(dá)式有類似的表達(dá)方式,我將省略細(xì)節(jié),直接展示結(jié)果。
>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'
用另一種 {n,m} 語(yǔ)法表達(dá)這個(gè)正則表達(dá)式會(huì)如何呢?這個(gè)例子展示新的語(yǔ)法。
例 7.8. 用 {n,m} 語(yǔ)法確認(rèn)羅馬數(shù)字
>>> pattern = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
>>> re.search(pattern, 'MDLV')
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MMDCLXVI')
<_sre.SRE_Match object at 0x008EEB48>
例 7.9. 帶有內(nèi)聯(lián)注釋 (Inline Comments) 的正則表達(dá)式
>>> pattern = """
^ # beginning of string
M{0,3} # thousands - 0 to 3 M's
(CM|CD|D?C{0,3}) # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
# or 500-800 (D, followed by 0 to 3 C's)
(XC|XL|L?X{0,3}) # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
# or 50-80 (L, followed by 0 to 3 X's)
(IX|IV|V?I{0,3}) # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
# or 5-8 (V, followed by 0 to 3 I's)
$ # end of string
"""
>>> re.search(pattern, 'M', re.VERBOSE)
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MCMLXXXIX', re.VERBOSE)
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MMMDCCCLXXXVIII', re.VERBOSE)
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'M')
當(dāng)使用松散正則表達(dá)式時(shí),最重要的一件事情就是:必須傳遞一個(gè)額外的參數(shù) re.VERBOSE,該參數(shù)是定義在 re 模塊中的一個(gè)常量,標(biāo)志著待匹配的正則表達(dá)式是一個(gè)松散正則表達(dá)式。正如你看到的,這個(gè)模式中,有很多空格 (所有的空格都被忽略),和幾個(gè)注釋 (所有的注釋也被忽略)。如果忽略所有的空格和注釋,它就和前面章節(jié)里的正則表達(dá)式完全相同,但是具有更好的可讀性。
>>> re.search(pattern, 'M')
這個(gè)沒(méi)有匹配。為什么呢?因?yàn)闆](méi)有 re.VERBOSE 標(biāo)記,所以 re.search 函數(shù)把模式作為一個(gè)緊湊正則表達(dá)式進(jìn)行匹配。Python 不能自動(dòng)檢測(cè)一個(gè)正則表達(dá)式是為松散類型還是緊湊類型。Python 默認(rèn)每一個(gè)正則表達(dá)式都是緊湊類型的,除非你顯式地標(biāo)明一個(gè)正則表達(dá)式為松散類型。
例 7.16. 解析電話號(hào)碼 (最終版本)
>>> phonePattern = re.compile(r'''
# don't match beginning of string, number can start anywhere
(\d{3}) # area code is 3 digits (e.g. '800')
\D* # optional separator is any number of non-digits
(\d{3}) # trunk is 3 digits (e.g. '555')
\D* # optional separator
(\d{4}) # rest of number is 4 digits (e.g. '1212')
\D* # optional separator
(\d*) # extension is optional and can be any number of digits
$ # end of string
''', re.VERBOSE)
>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212')
('800', '555', '1212', '')
現(xiàn)在,你應(yīng)該熟悉下列技巧:
^ 匹配字符串的開始。
$ 匹配字符串的結(jié)尾。
\b 匹配一個(gè)單詞的邊界。
\d 匹配任意數(shù)字。
\D 匹配任意非數(shù)字字符。
x? 匹配一個(gè)可選的 x 字符 (換言之,它匹配 1 次或者 0 次 x 字符)。
x* 匹配0次或者多次 x 字符。
x+ 匹配1次或者多次 x 字符。
x{n,m} 匹配 x 字符,至少 n 次,至多 m 次。
(a|b|c) 要么匹配 a,要么匹配 b,要么匹配 c。
(x) 一般情況下表示一個(gè)記憶組 (remembered group)。你可以利用 re.search 函數(shù)返回對(duì)象的 groups() 函數(shù)獲取它的值。
http://www.woodpecker.org.cn/diveintopython/regular_expressions/phone_numbers.html
Regular expression pattern syntax
.
|
Matches any character except \n (if DOTALL, also matches \n)
|
^
|
Matches start of string (if MULTILINE, also matches after \n)
|
$
|
Matches end of string (if MULTILINE, also matches before \n)
|
*
|
Matches zero or more cases of the previous regular expression; greedy (match as many as possible)
|
+
|
Matches one or more cases of the previous regular expression; greedy (match as many as possible)
|
?
|
Matches zero or one case of the previous regular expression; greedy (match one if possible)
|
*? , +?, ??
|
Non-greedy versions of *, +, and ? (match as few as possible)
|
{m,n}
|
Matches m to n cases of the previous regular expression (greedy)
|
{m,n}?
|
Matches m to n cases of the previous regular expression (non-greedy)
|
[...]
|
Matches any one of a set of characters contained within the brackets
|
|
|
Matches expression either preceding it or following it
|
(...)
|
Matches the regular expression within the parentheses and also indicates a group
|
(?iLmsux)
|
Alternate way to set optional flags; no effect on match
|
(?:...)
|
Like (...), but does not indicate a group
|
(?P<id>...)
|
Like (...), but the group also gets the name id
|
(?P=id)
|
Matches whatever was previously matched by group named id
|
(?#...)
|
Content of parentheses is just a comment; no effect on match
|
(?=...)
|
Lookahead assertion; matches if regular expression ... matches what comes next, but does not consume any part of the string
|
(?!...)
|
Negative lookahead assertion; matches if regular expression ... does not match what comes next, and does not consume any part of the string
|
(?<=...)
|
Lookbehind assertion; matches if there is a match for regular expression ... ending at the current position (... must match a fixed length)
|
(?<!...)
|
Negative lookbehind assertion; matches if there is no match for regular expression ... ending at the current position (... must match a fixed length)
|
\number
|
Matches whatever was previously matched by group numbered number (groups are automatically numbered from 1 up to 99)
|
\A
|
Matches an empty string, but only at the start of the whole string
|
\b
|
Matches an empty string, but only at the start or end of a word (a maximal sequence of alphanumeric characters; see also \w)
|
\B
|
Matches an empty string, but not at the start or end of a word
|
\d
|
Matches one digit, like the set [0-9]
|
\D
|
Matches one non-digit, like the set [^0-9]
|
\s
|
Matches a whitespace character, like the set [ \t\n\r\f\v]
|
\S
|
Matches a non-white character, like the set [^ \t\n\r\f\v]
|
\w
|
Matches one alphanumeric character; unless LOCALE or UNICODE is set, \w is like [a-zA-Z0-9_]
|
\W
|
Matches one non-alphanumeric character, the reverse of \w
|
\Z
|
Matches an empty string, but only at the end of the whole string
|
\\
|
Matches one backslash character
|
posted on 2009-08-22 23:48
Frank_Fang 閱讀(1883)
評(píng)論(0) 編輯 收藏 所屬分類:
Python學(xué)習(xí)