1. string---文本常量和模板
作用:包含处理文本的常量和类
Python版本:1.4及以后版本1.1 函数
capwords():将一个字符串中所有单词的首字母大写
>>> import string>>> s = 'The quick brown fox jumped over the lazy dog'>>> string.capwords(s)'The Quick Brown Fox Jumped Over The Lazy Dog'1. 使用列表来完成
>>> s'The quick brown fox jumped over the lazy dog'>>> " ".join(map(lambda x: x[0].upper() + x[1:], s.split(" ")))'The Quick Brown Fox Jumped Over The Lazy Dog'
但是如果单词之间存在多个空白字符,则列表完成的代码存在瑕疵.新修改的代码如下:
>>> ss'The quick brown fox jumped over the lazy dog'>>> for index in range(len(ss)): if (index == 0 or ss[index] == " ") and index != len(ss) - 1 and ss[index + 1] != " ": ss = ss[:index + 1] + ss[index + 1].upper() + ss[index + 2:] >>> ss'THe Quick Brown Fox Jumped Over The Lazy Dog'
maketrans():结合translate()方法将一组字符修改为另一组字符,这种做法优于反复调用replace()
>>> import string>>> leet = string.maketrans('abegiloprstz', '463611092572')>>> s'The quick brown fox jumped over the lazy dog'>>> s.translate(leet)'Th3 qu1ck 620wn f0x jum93d 0v32 7h3 142y d06'1. 使用replace()方法反复完成
>>> s'The quick brown fox jumped over the lazy dog'>>> subStr = s>>> length = len('abegiloprstz')>>> for i in range(0, length): subStr = subStr.replace('abegiloprstz'[i], '463611092572'[i]) >>> subStr'Th3 qu1ck 620wn f0x jum93d 0v32 7h3 142y d06'
1.2 模板
使用string.Template拼接时,可以在变量名前面加上前缀$(如$var)来标识变量,或者如果需要与两侧的文本相区分,还可以使用大括号将变量括起(如${var})
一个简单的例子如下:import stringvalues = {'var': 'foo'}#通过string.Template进行转移,需要转义符$t = string.Template("""Variable : $varEscape : $$ #$重复两次来完成转义Variable in text: ${var}iable""")print 'TEMPLATE:', t.substitute(values)#字符串的格式化显示,通过关键字来匹配数据s = """Variable : %(var)sEscape : %% #%重复两次来完成转义Variable in text: %(var)siable"""print 'INTERPOLATION:', s % values解释器输出:
>>> TEMPLATE: Variable : fooEscape : $Variable in text: fooiableINTERPOLATION: Variable : fooEscape : %Variable in text: fooiable模板与标准字符拼接有一个重要区别,即 模板不考虑参数类型.值会转换为字符串,再将字符串插入到结果中.这里没有提供格式化选项. 我们可以通过 safe_substitute()方法,避免未能提供模板所需全部参数时可能产生的异常:
import stringvalues = {'var': 'foo'}t = string.Template("$var is here but $missing is not provided")try: print 'substitute() :', t.substitute(values)except KeyError, err: print 'ERROR:', str(err)#如果模板未提供,则保持原值print 'safe_substitute():', t.safe_substitute(values)解释器显示如下:
>>> substitute() : ERROR: 'missing'safe_substitute(): foo is here but $missing is not provided
1.3 高级模板
可以修改string.Template的默认语法,为此要调整它在模板体中查找变量名所使用的正则表达式模式.一种简单的做法是修改delimiter和idpattern类属性.
import stringtemplate_text = """Delimiter : %%Replatec : %with_underscoreIgnored : %notunderscored"""d = {'with_underscore' : 'replaced', 'notunderscored' : 'not replaced',}#定界符修改为%#变量名的格式必须符合'[a-z]+_[a-z]+',即中间必须有下划线_class MyTemplate(string.Template): delimiter = '%' idpattern = '[a-z]+_[a-z]+'t = MyTemplate(template_text)print 'Modified ID pattern'print t.safe_substitute(d)
解释器显示如下:
>>> Modified ID patternDelimiter : %Replatec : replacedIgnored : %notunderscored要完成更复杂的修改,可以覆盖pattern属性,定义一个全新的正则表达式.所提供的模式必须包含4个命名组,分别对应 定界符,命名变量,用大括号括住的变量名,以及不合法的定界符模式
要完成更复杂的修改,可以覆盖pattern属性,定义一个全新的正则表达式.所提供的模式必须包含4个命名组,分别对应定界符,命名变量,用大括号括住的变量名,以及不合法的定界符模式import reimport stringclass MyTemplate(string.Template): delimiter = '{ {' #将定界符修改为'{ {' pattern = r"""\{\{(?:(?P\{\{)|(?P [_a-z][_a-z0-9]*)\}\}|(?P [_a-z][_a-z0-9]*)\}\}|(?P ))"""t = MyTemplate("""{ { { {{ {var}}{ {foo}}""")print 'MATCHES:', t.pattern.findall(t.template)print 'SUBSTITUTED:', t.safe_substitute(var='123replacement', foo='replacement')
解释器显示如下:
>>> MATCHES: [('{ {', '', '', ''), ('', 'var', '', ''), ('', 'foo', '', '')]SUBSTITUTED: { {123replacementreplacement备注: 不理解pattern的四个参数的使用.
2. textwrap---格式化文本段落
作用:通过调整换行符在段落中出现的位置来格式化文本
Python版本: 2.5及以后版本 需要美观打印时,可以用textwrap模块来格式化要输出的文本.这个模块允许通过编程提供类似段落自动换行或填充特性等功能.2.1 示例数据
sample_text = """The textwrap module can be used to format text for output insituations where pretty-printing is desired. It offersprogrammatic functionality similar to the paragraph wrappingor filling features found in many text editors"""存入模块textwrap_example.py中,供后面程序的导入.
2.2 填充数据
通过提供宽度来填充数据
>>> import textwrap>>> from textwrap_example import sample_text>>> print textwrap.fill(sample_text, width = 50) The textwrap module can be used to formattext for output in situations where pretty-printing is desired. It offers programmaticfunctionality similar to the paragraph wrappingor filling features found in many text editors结果显示只有第一行有缩进,其余的均没有.
2.3 去除现有缩进
我们可以通过dedent来引入一级缩进:
>>> print textwrap.dedent(sample_text)The textwrap module can be used to format text for output insituations where pretty-printing is desired. It offersprogrammatic functionality similar to the paragraph wrappingor filling features found in many text editors
2.4 结合dedent和fill
我们可以通过dedent达到缩进,而通过fill来填充空格:
>>> dedented_text = textwrap.dedent(sample_text).strip()>>> for width in [45, 70]: print '%d Columns:\n' % width print textwrap.fill(dedented_text, width=width) print 45 Columns:The textwrap module can be used to formattext for output in situations where pretty-printing is desired. It offers programmaticfunctionality similar to the paragraphwrapping or filling features found in manytext editors70 Columns:The textwrap module can be used to format text for output insituations where pretty-printing is desired. It offers programmaticfunctionality similar to the paragraph wrapping or filling featuresfound in many text editors
2.5 悬挂缩进
更好的情况是:第一行保持缩进,用于区别后面各行
>>> dedented_text = textwrap.dedent(sample_text).strip()>>> print textwrap.fill(dedented_text, initial_indent='', subsequent_indent=' ' * 4, width = 50,)The textwrap module can be used to format text for output in situations where pretty-printing is desired. It offers programmatic functionality similar to the paragraph wrapping or filling features found in many text editors
3. re---正则表达式
3.1 查找文本中的模式
search()函数取模式和要扫描的文本作为输入,找到则返回一个Match对象,否则返回None.
而每个Match对象包含有关匹配性质的信息,包括原输入字符串,使用的正则表达式,以及模式在原字符串中出现的位置:>>> import re>>> pattern = 'this'>>> text = 'Does this text match the pattern?'>>> match = re.search(pattern, text)>>> dir(match)['__class__', '__copy__', '__deepcopy__', '__delattr__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'end', 'endpos', 'expand', 'group', 'groupdict', 'groups', 'lastgroup', 'lastindex', 'pos', 're', 'regs', 'span', 'start', 'string']>>> match.string'Does this text match the pattern?'>>> match.start备注:使用dir()和help()函数来查看各个对象的功能,很重要.>>> match.start()5>>> match.re<_sre.SRE_Pattern object at 0x0000000002A9E258>>>> match.re()Traceback (most recent call last): File " ", line 1, in match.re()TypeError: '_sre.SRE_Pattern' object is not callable>>> match.re.pattern'this'
3.2 编译表达式
如果表达式经常被使用,编译这些表达式会更加高效.compile()函数会把一个表达式字符串转换为一个RegexObject
import re#预编译模式regexes = [re.compile(p) for p in ['this', 'that']]text = 'Does this text match the pattern'print 'Text: %r\n' % textfor regex in regexes: print 'Seeking "%s" ->' % regex.pattern, if regex.search(text): print 'match' else: print 'no match'解释器显示如下:
>>> Text: 'Does this text match the pattern'Seeking "this" -> matchSeeking "that" -> no match>>> type(regexes)>>> regexes[<_sre.SRE_Pattern object at 0x0000000002BAE0E8>, <_sre.SRE_Pattern object at 0x0000000002BAE258>]
3.3 多重匹配
findall()函数会返回输入中与模式匹配而不重叠的所有字串
import retext = 'abbaaabbbbaaaaa'pattern = 'ab'for match in re.findall(pattern, text): print 'Found "%s"' % match#这里re.finditer(pattern, text)只会运行一次,所以match才会递归显示每一项(for在Python中的语法)for match in re.finditer(pattern, text): s = match.start() e = match.end() print 'Found "%s" at %d:%d' % (text[s:e], s, e)解释器显示如下:
>>> Found "ab"Found "ab"Found "ab" at 0:2Found "ab" at 5:7
3.4 模式语法
正则表达式支持更强大的模式,而不只是简单的字面量文本字符串.模式可以重复,可以锚定到输入中不同的逻辑位置,还可以采用紧凑形式表示而不需要在模式中提供每一个字面量字符.使用所有这些特性时,需要结合字面量文本值和元字符,元字符是re实现的正则表达式模式语法的一部分.
import redef test_patterns(text, patterns=[]): for pattern, desc in patterns: print 'Pattern %r (%s)\n' % (pattern, desc) print ' %r' % text for match in re.finditer(pattern, text): s = match.start() e = match.end() substr = text[s:e] n_backslashes = text[:s].count('\\') prefix = '.' * (s + n_backslashes) print ' %s%r|' % (prefix, substr), print returnif __name__ == "__main__": test_patterns('abbaaabbbbaaaaa', [('ab', "'a' followed by 'b'"),])存储在文件re_test_patterns.py中.
重复
模式中有五种表达重复的方式.如果模式后面跟元字符*,这个模式会重复0次或多次.如果为+,则至少重复1次.为?则重复0或1次.{m}特定重复m次.{m,n}则至少重复m次,最大重复n次.{m,}则至少重复m次,无上限.
from re_test_patterns import test_patternstest_patterns( 'abbaabbba', [('ab*', 'a followed by zero or more b'), ('ab+', 'a followed by one or more b'), ('ab?', 'a followed by zero or one b'), ('ab{3}', 'a followed by three b'), ('ab{2,3}', 'a followed by two to three b'), ])解释器显示如下:
>>> Pattern 'ab*' (a followed by zero or more b) 'abbaabbba' 'abb'| ...'a'| ....'abbb'| ........'a'|Pattern 'ab+' (a followed by one or more b) 'abbaabbba' 'abb'| ....'abbb'|Pattern 'ab?' (a followed by zero or one b) 'abbaabbba' 'ab'| ...'a'| ....'ab'| ........'a'|Pattern 'ab{3}' (a followed by three b) 'abbaabbba' ....'abbb'|Pattern 'ab{2,3}' (a followed by two to three b) 'abbaabbba' 'abb'| ....'abbb'|正常情况下,处理重复指令时, re匹配模式时会利用尽可能多的输入.这种所谓"贪心"的行为可能导致单个匹配减少,或者匹配中包含了多于原先预计的输入文本.在重复指令后面加上 "?"可以关闭这种贪心行为:
from re_test_patterns import test_patternstest_patterns( 'abbaabbba', [('ab*?', 'a followed by zero or more b'), ('ab+?', 'a followed by one or more b'), ('ab??', 'a followed by zero or one b'), ('ab{3}?', 'a followed by three b'), ('ab{2,3}?', 'a followed by two to three b'), ])解释器显示如下:
>>> Pattern 'ab*?' (a followed by zero or more b) 'abbaabbba' 'a'| ...'a'| ....'a'| ........'a'|Pattern 'ab+?' (a followed by one or more b) 'abbaabbba' 'ab'| ....'ab'|Pattern 'ab??' (a followed by zero or one b) 'abbaabbba' 'a'| ...'a'| ....'a'| ........'a'|Pattern 'ab{3}?' (a followed by three b) 'abbaabbba' ....'abbb'|Pattern 'ab{2,3}?' (a followed by two to three b) 'abbaabbba' 'abb'| ....'abb'|
字符集
字符集是一组字符,包含可以与模式中相应位置匹配的所有字符.例如[ab]可以匹配a或b:
from re_test_patterns import test_patternstest_patterns( 'abbaabbba', [('[ab]', 'either a or b'), ('a[ab]+', 'a followed by 1 or more a or b'), ('a[ab]+?', 'a followed by 1 or more a or b, not greedy'), ])解释器显示如下:(注意贪心算法)
>>> Pattern '[ab]' (either a or b) 'abbaabbba' 'a'| .'b'| ..'b'| ...'a'| ....'a'| .....'b'| ......'b'| .......'b'| ........'a'|Pattern 'a[ab]+' (a followed by 1 or more a or b) 'abbaabbba' 'abbaabbba'|Pattern 'a[ab]+?' (a followed by 1 or more a or b, not greedy) 'abbaabbba' 'ab'| ...'aa'|字符集还可以用来排除某些特定字符.尖字符(^)表示要查找未在随后的字符集中出现的字符.
from re_test_patterns import test_patternstest_patterns( 'This is some text -- with punctuation', #找到不包含字符"-","."或空格的所有字符串 [('[^-. ]+', 'sequences without -, ., or space'), ])解释器显示如下:
>>> Pattern '[^-. ]+' (sequences without -, ., or space) 'This is some text -- with punctuation' 'This'| .....'is'| ........'some'| .............'text'| .....................'with'| ..........................'punctuation'|利用字符区间来定义一个字符集,其中包括一个起点和一个终点之间所有连续的字符:
from re_test_patterns import test_patternstest_patterns( 'This is some text -- with punctuation', [('[a-z]+', 'sequences of lowercase letters'), ('[A-Z]+', 'sequences of uppercase letters'), ('[a-zA-Z]+', 'sequences of lowercase or uppercase letters'), ('[A-Z][a-z]+', 'one uppercase followed by lowercase'), ])解释器显示如下:
>>> Pattern '[a-z]+' (sequences of lowercase letters) 'This is some text -- with punctuation' .'his'| .....'is'| ........'some'| .............'text'| .....................'with'| ..........................'punctuation'|Pattern '[A-Z]+' (sequences of uppercase letters) 'This is some text -- with punctuation' 'T'|Pattern '[a-zA-Z]+' (sequences of lowercase or uppercase letters) 'This is some text -- with punctuation' 'This'| .....'is'| ........'some'| .............'text'| .....................'with'| ..........................'punctuation'|Pattern '[A-Z][a-z]+' (one uppercase followed by lowercase) 'This is some text -- with punctuation' 'This'|作为字符集的一种特殊情况,元字符"."指模式应当匹配该位置的任何单字符.
from re_test_patterns import test_patternstest_patterns( 'abbaabbba', [('a.', 'a followed by any one character'), ('b.', 'b followed by any one character'), ('a.*b', 'a followed by anything, ending in b'), ('a.*?b', 'a followed by anything, ending in b'), ])解释器显示如下:
>>> Pattern 'a.' (a followed by any one character) 'abbaabbba' 'ab'| ...'aa'|Pattern 'b.' (b followed by any one character) 'abbaabbba' .'bb'| .....'bb'| .......'ba'|Pattern 'a.*b' (a followed by anything, ending in b) 'abbaabbba' 'abbaabbb'|Pattern 'a.*?b' (a followed by anything, ending in b) 'abbaabbba' 'ab'| ...'aab'|
转义码
re可以识别的转义码如下:
转义码 | 含义 |
\d | 一个数字 |
\D | 一个非数字 |
\s | 空白符(制表符,空格,换行符等) |
\S | 非空白符 |
\w | 字母数字 |
\W | 非字母数字 |
from re_test_patterns import test_patternstest_patterns( 'A prime #1 example!', [(r'\d+', 'sequence of digits'), (r'\D+', 'sequence of nondigits'), (r'\s+', 'sequence of whitespace'), (r'\S+', 'sequence of nonwhitespace'), (r'\w+', 'alphanumeric characters'), (r'\W+', 'nonalphanumeric') ])解释器显示如下:
>>> Pattern '\\d+' (sequence of digits) 'A prime #1 example!' .........'1'|Pattern '\\D+' (sequence of nondigits) 'A prime #1 example!' 'A prime #'| ..........' example!'|Pattern '\\s+' (sequence of whitespace) 'A prime #1 example!' .' '| .......' '| ..........' '|Pattern '\\S+' (sequence of nonwhitespace) 'A prime #1 example!' 'A'| ..'prime'| ........'#1'| ...........'example!'|Pattern '\\w+' (alphanumeric characters) 'A prime #1 example!' 'A'| ..'prime'| .........'1'| ...........'example'|Pattern '\\W+' (nonalphanumeric) 'A prime #1 example!' .' '| .......' #'| ..........' '| ..................'!'|要匹配属于正则表达式语法的字符,需要对搜索模式中的字符进行转义:
from re_test_patterns import test_patternstest_patterns( r'\d+ \D+ \s+', [(r'\\.\+', 'escape code'), ])解释器显示如下:
>>> Pattern '\\\\.\\+' (escape code) '\\d+ \\D+ \\s+' '\\d+'| .....'\\D+'| ..........'\\s+'|
锚定
可以使用锚定指令指定输入文本中模式应当出现的相对位置.
锚定码 | 含义 |
^ | 字符串或行的开始 |
$ | 字符串或行的结束 |
\A | 字符串开始 |
\Z | 字符串结束 |
\b | 一个单词开头或末尾的空串 |
\B | 不在一个单词开头或末尾的空串 |
from re_test_patterns import test_patternstest_patterns( 'This is some text -- with punctuation.', [(r'^\w+', 'word at start of string'), (r'\A\w+', 'word at start of string'), (r'\w+\S*$', 'word near end of string, skip punctuation'), (r'\w+\S*\Z', 'word near end of string, skip punctuation'), (r'\w*t\w*', 'word containing t'), (r'\bt\w+', 't at start of word'), (r'\w+t\b', 't at end of word'), (r'\Bt\B', 't not start or end of word'), ])解释器显示如下:
>>> Pattern '^\\w+' (word at start of string) 'This is some text -- with punctuation.' 'This'|Pattern '\\A\\w+' (word at start of string) 'This is some text -- with punctuation.' 'This'|Pattern '\\w+\\S*$' (word near end of string, skip punctuation) 'This is some text -- with punctuation.' ..........................'punctuation.'|Pattern '\\w+\\S*\\Z' (word near end of string, skip punctuation) 'This is some text -- with punctuation.' ..........................'punctuation.'|Pattern '\\w*t\\w*' (word containing t) 'This is some text -- with punctuation.' .............'text'| .....................'with'| ..........................'punctuation'|Pattern '\\bt\\w+' (t at start of word) 'This is some text -- with punctuation.' .............'text'|Pattern '\\w+t\\b' (t at end of word) 'This is some text -- with punctuation.' .............'text'|Pattern '\\Bt\\B' (t not start or end of word) 'This is some text -- with punctuation.' .......................'t'| ..............................'t'| .................................'t'|
3.5 限制搜索
如果提前已经知道只需搜索整个输入的一个子集,可以告诉re限制搜索范围,从而进一步约束正则表达式匹配.例如,如果模式必须出现在输入的最前面,那么使用match()而不是search()会锚定搜索,而不必在搜索模式中显式的包含一个锚.
>>> import re>>> text = 'This is some text -- with punctuation.'>>> pattern = 'is'>>> m = re.match(pattern, text)>>> print mNone>>> s = re.search(pattern, text)>>> print s<_sre.SRE_Match object at 0x0000000002C265E0>已编译正则表达式的search()方法还接受可选的start和end位置参数,将搜索限制在输入的一个子串中:
import retext = 'This is some text -- with punctuation.'pattern = re.compile(r'\b\w*is\w*\b')print 'Text:', textprintpos = 0while True: match = pattern.search(text, pos) if not match: break s = match.start() e = match.end() print ' %2d : %2d = "%s"' % (s, e - 1, text[s:e]) pos = e解释器显示如下:
>>> Text: This is some text -- with punctuation. 0 : 3 = "This" 5 : 6 = "is"
3.6 用组解析匹配
搜索模式匹配是正则表达式所提供强大功能的基础.为模式增加组(group)可以隔离匹配文本的各个部分.通过小括号("("和")")来分组:
from re_test_patterns import test_patternstest_patterns( 'abbaaabbbbaaaaa', [('a(ab)', 'a followed by literal ab'), ('a(a*b*)', 'a followed by 0-n a and 0-n b'), ('a(ab)*', 'a followed by 0-n ab'), ('a(ab)+', 'a followed by 1-n ab'), ])解释器显示如下:
>>> Pattern 'a(ab)' (a followed by literal ab) 'abbaaabbbbaaaaa' ....'aab'|Pattern 'a(a*b*)' (a followed by 0-n a and 0-n b) 'abbaaabbbbaaaaa' 'abb'| ...'aaabbbb'| ..........'aaaaa'|Pattern 'a(ab)*' (a followed by 0-n ab) 'abbaaabbbbaaaaa' 'a'| ...'a'| ....'aab'| ..........'a'| ...........'a'| ............'a'| .............'a'| ..............'a'|Pattern 'a(ab)+' (a followed by 1-n ab) 'abbaaabbbbaaaaa' ....'aab'|要访问一个模式中单个组所匹配的子串,可以使用Match对象的group()方法:
import retext = 'This is some text -- with punctuation.'print textprintpatterns = [ (r'^(\w+)', 'word at start of string'), (r'(\w+)\S*$', 'word at end, with optional punctuation'), (r'(\bt\w+)\W+(\w+)', 'word starting with t, another word'), (r'(\w+t)\b', 'word ending with t'), ]for pattern, desc in patterns: regex = re.compile(pattern) match = regex.search(text) print 'Pattern %r (%s)\n' % (pattern, desc) print ' ', match.groups()print解释器显示如下:
>>> This is some text -- with punctuation.Pattern '^(\\w+)' (word at start of string) ('This',)Pattern '(\\w+)\\S*$' (word at end, with optional punctuation) ('punctuation',)Pattern '(\\bt\\w+)\\W+(\\w+)' (word starting with t, another word) ('text', 'with')Pattern '(\\w+t)\\b' (word ending with t) ('text',)Python对基本分组语法做了扩展,增加了命名组.通过使用名字来指示组,这样以后就可以更容易的修改模式,而不必同时修改使用了匹配结果的代码.要设置一个组的名字,可以使用以下语法: (?P<name>pattern):
import retext = 'This is some text -- with punctuation.'print textprintpatterns = [ r'^(?P解释器显示如下:\w+)', r'(?P \w+)\S*$', r'(?P \bt\w+)\W+(?P \w+)', r'(?P \w+t)\b', ]for pattern in patterns: regex = re.compile(pattern) match = regex.search(text) print 'Matching "%s"' % pattern print ' ', match.groups() print ' ', match.groupdict() print
>>> This is some text -- with punctuation.Matching "^(?P备注: 使用 groupdict()可以获取一个字典,它将组名映射到匹配的子串. groups()返回的有序序列还包含命名模式. 所以,我们可以更新test_patterns(),它会显示与一个模式匹配的编号组和命名组:\w+)" ('This',) {'first_word': 'This'}Matching "(?P \w+)\S*$" ('punctuation',) {'last_word': 'punctuation'}Matching "(?P \bt\w+)\W+(?P \w+)" ('text', 'with') {'other_word': 'with', 't_word': 'text'}Matching "(?P \w+t)\b" ('text',) {'ends_with_t': 'text'}
import redef test_patterns(text, patterns=[]): for pattern, desc in patterns: print 'Pattern %r (%s)\n' % (pattern, desc) print ' %r' % text for match in re.finditer(pattern, text): s = match.start() e = match.end() prefix = ' ' * (s) print ' %s%r%s ' % (prefix, text[s:e], ' ' * (len(text) - e)), print match.groups() if match.groupdict(): print '%s%s' % (' ' * (len(text) - s), match.groupdict()) print returnif __name__ == "__main__": test_patterns('abbaabbba', [(r'a((a*)(b*))', "'a' followed by 0-n a and 0-n b"),])解释器显示如下:
>>> Pattern 'a((a*)(b*))' ('a' followed by 0-n a and 0-n b) 'abbaabbba' 'abb' ('bb', '', 'bb') 'aabbb' ('abbb', 'a', 'bbb') 'a' ('', '', '')组对于指定候选模式也很有用.可以使用管道符号(|)指示应当匹配某一个或另一个模式:
from re_test_patterns import test_patternstest_patterns( 'abbaabbba', [(r'a((a+)|(b+))', 'a then seq. of a or seq. of b'), (r'a((a|b)+)', 'a then seq. of [ab]'), ])解释器显示如下:
>>> Pattern 'a((a+)|(b+))' (a then seq. of a or seq. of b) 'abbaabbba' 'abb' ('bb', None, 'bb') 'aa' ('a', 'a', None)Pattern 'a((a|b)+)' (a then seq. of [ab]) 'abbaabbba' 'abbaabbba' ('bbaabbba', 'a')如果匹配子模式的字符串并不是从整个文本抽取的一部分,此时定义一个包含子模式的组也很有用.这些组称为"非捕获组".非捕获组可以用来描述重复模式或候选模式,而不再返回值中区分字符串的匹配部分.要创建一个非捕获组,可以使用语法(?:pattern)
from re_test_patterns import test_patternstest_patterns( 'abbaabbba', [(r'a((a+)|(b+))', 'capturing form'), (r'a((?:a+)|(?:b+))', 'noncapturing'), ])解释器显示如下:
>>> Pattern 'a((a+)|(b+))' (capturing form) 'abbaabbba' 'abb' ('bb', None, 'bb') 'aa' ('a', 'a', None)Pattern 'a((?:a+)|(?:b+))' (noncapturing) 'abbaabbba' 'abb' ('bb',) 'aa' ('a',)
3.7 搜索选项
利用选项标志可以改变匹配引擎处理表达式的方式.可以使用OR操作结合这些标志,然后传递至compile(),search(),match()以及其他接受匹配模式完成搜索的函数
不区分大小写的匹配
IGNORECASE使模式中的字面量字符和字符区间与大小写字符都匹配.
import retext = 'This is some text -- with punctuation.'pattern = r'\bT\w+'with_case = re.compile(pattern)without_case = re.compile(pattern, re.IGNORECASE)print 'Text:\n %r' % textprint 'Pattern:\n %s' % patternprint 'Case-sensitive:'for match in with_case.findall(text): print ' %r' % matchprint 'Case-insensitive:'for match in without_case.findall(text): print ' %r' % match解释器显示如下:
>>> Text: 'This is some text -- with punctuation.'Pattern: \bT\w+Case-sensitive: 'This'Case-insensitive: 'This' 'text'
多行输入
有两个标志会影响如何在多行输入中进行搜索:MULTILINE和DOTALL.MULTILINE标志会控制模式匹配代码如何对包含换行符的文本处理锚定指令.当打开多行模式时,除了整个字符串外,还要在每一行的开头和结尾应用^和$的锚定规则:
import retext = 'This is some text -- with punctuation.\nA second line.'pattern = r'(^\w+)|(\w+\S*$)'single_line = re.compile(pattern)multiline = re.compile(pattern, re.MULTILINE)print 'Text:\n %r' % textprint 'Pattern:\n %s' % patternprint 'Single Line:'for match in single_line.findall(text): print ' %r' % (match,)print 'Multiline :'for match in multiline.findall(text): print ' %r' % (match,)解释器显示如下:
>>> Text: 'This is some text -- with punctuation.\nA second line.'Pattern: (^\w+)|(\w+\S*$)Single Line: ('This', '') ('', 'line.')Multiline : ('This', '') ('', 'punctuation.') ('A', '') ('', 'line.')DOTALL也是一个与多行文本有关的标志.正常情况下,点字符(.)可以与输入文本中除了换行符之外的所有其他字符匹配.这个标志则允许点字符还可以匹配换行符.
import retext = 'This is some text -- with punctuation.\nA second line.'pattern = r'.+'no_newlines = re.compile(pattern)dotall = re.compile(pattern, re.DOTALL)print 'Text:\n %r' % textprint 'Pattern:\n %s' % patternprint 'No newlines:'for match in no_newlines.findall(text): print ' %r' % (match,)print 'Multiline :'for match in dotall.findall(text): print ' %r' % (match,)解释器显示如下:
>>> Text: 'This is some text -- with punctuation.\nA second line.'Pattern: .+No newlines: 'This is some text -- with punctuation.' 'A second line.'Multiline : 'This is some text -- with punctuation.\nA second line.'
详细表达式语法
详细表达式语法:允许在模式中嵌入注释和额外的空白符
import readdress = re.compile( ''' [\w\d.+-]+ #username @ ([\w\d.]+\.)+ #domain name prefix (com|org|edu)''', re.UNICODE | re.VERBOSE)candidates = [ u'first.last@example.com', u'first.last+category@gmail.com', u'valid-address@mail.example.com', u'not-valid@example.foo' ]for candidate in candidates: match = address.search(candidate) print '%-30s %s' % (candidate, 'Matches' if match else 'No match')解释器显示如下:
>>> first.last@example.com Matchesfirst.last+category@gmail.com Matchesvalid-address@mail.example.com Matchesnot-valid@example.foo No match则我们可以扩展此版本:解析包含人名和Email地址的输入.
import readdress = re.compile( ''' ((?P解释器显示如下:([\w.,]+\s+)*[\w.,]+) \s* < )? (?P [\w\d.+-]+ #username @ ([\w\d.]+\.)+ #domain name prefix (com|org|edu) ) >?''', re.UNICODE | re.VERBOSE)candidates = [ u'first.last@example.com', u'first.last+category@gmail.com', u'valid-address@mail.example.com', u'not-valid@example.foo' u'First Last ', u'No Brackets first.last@example.com', u'First Last', u'First Middle Last ', u'First M. Last ', u' ', ]for candidate in candidates: print 'Candidate:', candidate match = address.search(candidate) if match: print ' Name :', match.groupdict()['name'] print ' Email:', match.groupdict()['email'] else: print ' No match'
>>> Candidate: first.last@example.com Name : None Email: first.last@example.comCandidate: first.last+category@gmail.com Name : None Email: first.last+category@gmail.comCandidate: valid-address@mail.example.com Name : None Email: valid-address@mail.example.comCandidate: not-valid@example.fooFirst LastName : example.fooFirst Last Email: first.last@example.comCandidate: No Brackets first.last@example.com Name : None Email: first.last@example.comCandidate: First Last No matchCandidate: First Middle Last Name : First Middle Last Email: first.last@example.comCandidate: First M. Last Name : First M. Last Email: first.last@example.comCandidate: Name : None Email: first.last@example.com
在模式中嵌入标志
如果编译表达式时不能增加标志,如将模式作为参数传入一个将在以后编译该模式的库函数时,可以把标志嵌入到表达式字符串本身.例如不区分大小写的匹配,可以在表达式开头增加(?i)
import retext = 'This is some text -- with punctuation.'pattern = r'(?i)\bT\w+'regex = re.compile(pattern)print 'Text :', textprint 'Pattern :', patternprint 'Matches :', regex.findall(text)解释器显示如下:
>>> Text : This is some text -- with punctuation.Pattern : (?i)\bT\w+Matches : ['This', 'text']所有标志的缩写如下:
标志 | 缩写 |
IGNORECASE | i |
MULTILINE | m |
DOTALL | s |
UNICODE | u |
VERBOSE | x |
3.8 前向或后向
很多情况下,仅当模式中另外某个部分也匹配时才匹配模式的某一部分,这非常有用.例如上例中只有尖括号成对时候,表达式才匹配.所以修改如下,修改后使用了一个肯定前向断言来匹配尖括号对.前向断言语法为(?=pattern):
import readdress = re.compile( ''' ((?P解释器显示如下:([\w.,]+\s+)*[\w.,]+) \s+ ) (?= (<.*>$) | ([^<].*[^>]$) ) [\w\d.+-]+ #username @ ([\w\d.]+\.)+ #domain name prefix (com|org|edu) ) >?''', re.UNICODE | re.VERBOSE)candidates = [ u'first.last@example.com', u'No Brackets first.last@example.com', u'Open Bracket ', u'Close Bracket first.last@example.com>', ]for candidate in candidates: print 'Candidate:', candidate match = address.search(candidate) if match: print ' Name :', match.groupdict()['name'] print ' Email:', match.groupdict()['email'] else: print ' No match'
>>> Candidate: first.last@example.com No matchCandidate: No Brackets first.last@example.com Name : No Brackets Email: first.last@example.comCandidate: Open Bracket否定前向断言((?!pattern))要求模式不匹配当前位置后面的文本.例如,Email识别模式可以修改为忽略自动系统常用的noreply邮件地址:Name : Open Bracket Email: first.last@example.comCandidate: Close Bracket first.last@example.com> No match
import readdress = re.compile( ''' ^ (?!noreply@.*$) [\w\d.+-]+ #username @ ([\w\d.]+\.)+ #domain name prefix (com|org|edu) $''', re.UNICODE | re.VERBOSE)candidates = [ u'first.last@example.com', u'noreply@example.com', ]for candidate in candidates: print 'Candidate:', candidate match = address.search(candidate) if match: print ' Match:', candidate[match.start():match.end()] else: print ' No match'解释器显示如下:
>>> Candidate: first.last@example.com Match: first.last@example.comCandidate: noreply@example.com No match相应的 否定后向断言语法为:(?<!pattern)
address = re.compile( ''' ^ [\w\d.+-]+ #username (?可以借组语法(?<=pattern)用肯定后向断言查找符合某个模式的文本:
import retwitter = re.compile('''(?<=@)([\w\d_]+)''', re.UNICODE | re.VERBOSE)text = '''This text includes two Twitter handles.One for @ThePSF, and one for the author, @doughellmann.'''print textfor match in twitter.findall(text): print 'Handle:', match解释器显示如下:
>>> This text includes two Twitter handles.One for @ThePSF, and one for the author, @doughellmann.Handle: ThePSFHandle: doughellmann
3.9 自引用表达式
匹配的值还可以用在表达式后面的部分中.最容易的办法是使用\num按id编号引用先前匹配的组:
import readdress = re.compile(r'''(\w+) #first name\s+(([\w.]+)\s+)? #optional middle name or initial(\w+) #last name\s+<(?P解释器显示如下:\1\.\4@([\w\d.]+\.)+(com|org|edu))>''', re.UNICODE | re.VERBOSE | re.IGNORECASE)candidates = [u'First Last ',u'Different Name ',u'First Middle Last ',u'First M. Last ', ]for candidate in candidates: print 'Candidate:', candidate match = address.search(candidate) if match: print ' Match name:', match.group(1), match.group(4) print ' Match email:', match.group(5) else: print ' No match'
>>> Candidate: First Last按数字id创建反向引用有两个缺点:1是表达式改变时需要重新编号,这样难以维护.2是最多创建99个引用,如果超过99个,则会产生更难维护的问题.Match name: First Last Match email: first.last@example.comCandidate: Different Name No matchCandidate: First Middle Last Match name: First Last Match email: first.last@example.comCandidate: First M. Last Match name: First Last Match email: first.last@example.com
所以Python的表达式可以使用(?P=name)指示表达式中先前匹配的一个命名组的值:
address = re.compile(r'''(?P在表达式中使用反向引用还有一种机制,即根据前一个组是否匹配来选择不同的模式.可以修正这个Email模式,使得如果出现名字就需要有尖括号,不过如果只有Email地址本身就不需要尖括号.语法是(?(id)yes-expression|no-expression),这里id是组名或编号,yes-expression是组有值时使用的模式,no-expression则是组没有值时使用的模式.\w+) #first name\s+(([\w.]+)\s+)? #optional middle name or initial(?P \w+) #last name\s+<(?P (?P=first_name)\.(?P=last_name)@([\w\d.]+\.)+(com|org|edu))>''', re.UNICODE | re.VERBOSE | re.IGNORECASE)
import readdress = re.compile(r'''^(?P解释器显示如下:([\w.]+\s+)*[\w.]+)?\s*(?(name)(?P (?=(<.*>$)))|(?=([^<].*[^>]$)))(?(brackets)<|\s*)(?P [\w\d.+-]+@([\w\d.]+\.)+(com|org|edu))(?(brackets)>|\s*)$''', re.UNICODE | re.VERBOSE)candidates = [u'First Last ',u'No Brackets first.last@example.com',u'Open Bracket ',u'no.brackets@example.com', ]for candidate in candidates: print 'Candidate:', candidate match = address.search(candidate) if match: print ' Match name:', match.groupdict()['name'] print ' Match email:', match.groupdict()['email'] else: print ' No match'
>>> Candidate: First LastMatch name: First Last Match email: first.last@example.comCandidate: No Brackets first.last@example.com No matchCandidate: Open Bracket No matchCandidate: no.brackets@example.com Match name: None Match email: no.brackets@example.com
3.10 用模式修改字符串
使用sub()可以将一个模式的所有出现替换为另一个字符串:
import rebold = re.compile(r'\*{2}(.*?)\*{2}')text = 'Make this **bold**. This **too**.'print 'Text:', textprint 'Bold:', bold.sub(r'\1', text)解释器显示如下:
>>> Text: Make this **bold**. This **too**.Bold: Make this bold. This too.要在替换中使用命名组,可以使用语法\g<name>.我们可以使用count来限制完成的替换数:
import rebold = re.compile(r'\*{2}(?P解释器显示如下:.*?)\*{2}', re.UNICODE)text = 'Make this **bold**. This **too**.'print 'Text:', textprint 'Bold:', bold.sub(r' \g ', text, count=1)
>>>Text: Make this **bold**. This **too**.Bold: Make this bold. This **too**.
3.11 利用模式拆分
str.split()是分解字符串来完成解析的最常用方法之一.但是如果存在多行情况下,我们则需要findall,使用(.+?)\n{2,}的模式.
import retext = '''Paragraph oneon two lines.Paragraph two.Paragraph three.'''for num, para in enumerate(re.findall(r'(.+?)\n{2,}', text, flags=re.DOTALL) ): print num, repr(para) print解释器显示如下:(注意{2,}这个模式)
>>> 0 'Paragraph one\non two lines.'1 'Paragraph two.'但是这样最后一行无法显示.我们可以使用split来处理:
import retext = '''Paragraph oneon two lines.Paragraph two.Paragraph three.'''print 'With findall:'for num, para in enumerate(re.findall(r'(.+?)(\n{2,}|$)', text, flags=re.DOTALL) ): print num, repr(para) printprintprint 'With split:'for num, para in enumerate(re.split(r'\n{2,}', text)): print num, repr(para) print解释器显示如下:
>>> With findall:0 ('Paragraph one\non two lines.', '\n\n')1 ('Paragraph two.', '\n\n\n')2 ('Paragraph three.', '')With split:0 'Paragraph one\non two lines.'1 'Paragraph two.'2 'Paragraph three.'