博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Python标准库学习笔记1:文本
阅读量:7024 次
发布时间:2019-06-28

本文共 33587 字,大约阅读时间需要 111 分钟。

  hot3.png

1. string---文本常量和模板

作用:包含处理文本的常量和类

Python版本:1.4及以后版本

1.1 函数

capwords():将一个字符串中所有单词的首字母大写

>>> import string>>> s = 'The quick brown fox jumped over the lazy dog'>>> string.capwords(s)'The Quick Brown Fox Jumped Over The Lazy Dog'
1. 使用列表来完成
>>> s'The quick brown fox jumped over the lazy dog'>>> " ".join(map(lambda x: x[0].upper() + x[1:], s.split(" ")))'The Quick Brown Fox Jumped Over The Lazy Dog'

    但是如果单词之间存在多个空白字符,则列表完成的代码存在瑕疵.新修改的代码如下:

>>> ss'The quick brown fox jumped over the lazy   dog'>>> for index in range(len(ss)):	if (index == 0 or ss[index] == " ") and index != len(ss) - 1 and ss[index + 1] != " ":		ss = ss[:index + 1] + ss[index + 1].upper() + ss[index + 2:]		>>> ss'THe Quick Brown Fox Jumped Over The Lazy   Dog'

maketrans():结合translate()方法将一组字符修改为另一组字符,这种做法优于反复调用replace()

>>> import string>>> leet = string.maketrans('abegiloprstz', '463611092572')>>> s'The quick brown fox jumped over the lazy dog'>>> s.translate(leet)'Th3 qu1ck 620wn f0x jum93d 0v32 7h3 142y d06'
1. 使用replace()方法反复完成
>>> s'The quick brown fox jumped over the lazy dog'>>> subStr = s>>> length = len('abegiloprstz')>>> for i in range(0, length):	subStr = subStr.replace('abegiloprstz'[i], '463611092572'[i])	>>> subStr'Th3 qu1ck 620wn f0x jum93d 0v32 7h3 142y d06'

1.2 模板

    使用string.Template拼接时,可以在变量名前面加上前缀$(如$var)来标识变量,或者如果需要与两侧的文本相区分,还可以使用大括号将变量括起(如${var})

    一个简单的例子如下:

import stringvalues = {'var': 'foo'}#通过string.Template进行转移,需要转义符$t = string.Template("""Variable    : $varEscape      : $$	#$重复两次来完成转义Variable in text: ${var}iable""")print 'TEMPLATE:', t.substitute(values)#字符串的格式化显示,通过关键字来匹配数据s = """Variable    : %(var)sEscape      : %%	#%重复两次来完成转义Variable in text: %(var)siable"""print 'INTERPOLATION:', s % values
     解释器输出:
>>> TEMPLATE: Variable    : fooEscape      : $Variable in text: fooiableINTERPOLATION: Variable    : fooEscape      : %Variable in text: fooiable
    模板与标准字符拼接有一个重要区别,即
模板不考虑参数类型.值会转换为字符串,再将字符串插入到结果中.这里没有提供格式化选项.
    我们可以通过
safe_substitute()方法,避免未能提供模板所需全部参数时可能产生的异常:
import stringvalues = {'var': 'foo'}t = string.Template("$var is here but $missing is not provided")try:    print 'substitute() :', t.substitute(values)except KeyError, err:    print 'ERROR:', str(err)#如果模板未提供,则保持原值print 'safe_substitute():', t.safe_substitute(values)
     解释器显示如下:
>>> substitute() : ERROR: 'missing'safe_substitute(): foo is here but $missing is not provided

1.3 高级模板

    可以修改string.Template的默认语法,为此要调整它在模板体中查找变量名所使用的正则表达式模式.一种简单的做法是修改delimiteridpattern类属性.

import stringtemplate_text = """Delimiter : %%Replatec : %with_underscoreIgnored : %notunderscored"""d = {'with_underscore' : 'replaced',     'notunderscored' : 'not replaced',}#定界符修改为%#变量名的格式必须符合'[a-z]+_[a-z]+',即中间必须有下划线_class MyTemplate(string.Template):    delimiter = '%'    idpattern = '[a-z]+_[a-z]+'t = MyTemplate(template_text)print 'Modified ID pattern'print t.safe_substitute(d)

    解释器显示如下:

>>> Modified ID patternDelimiter : %Replatec : replacedIgnored : %notunderscored
    要完成更复杂的修改,可以覆盖pattern属性,定义一个全新的正则表达式.所提供的模式必须包含4个命名组,分别对应
定界符,命名变量,用大括号括住的变量名,以及不合法的定界符模式
要完成更复杂的修改,可以覆盖pattern属性,定义一个全新的正则表达式.所提供的模式必须包含4个命名组,分别对应定界符,命名变量,用大括号括住的变量名,以及不合法的定界符模式import reimport stringclass MyTemplate(string.Template):    delimiter = '{
{' #将定界符修改为'{
{' pattern = r"""\{\{(?:(?P
\{\{)|(?P
[_a-z][_a-z0-9]*)\}\}|(?P
[_a-z][_a-z0-9]*)\}\}|(?P
))"""t = MyTemplate("""{
{
{
{{
{var}}{
{foo}}""")print 'MATCHES:', t.pattern.findall(t.template)print 'SUBSTITUTED:', t.safe_substitute(var='123replacement', foo='replacement')

    解释器显示如下:

>>> MATCHES: [('{
{', '', '', ''), ('', 'var', '', ''), ('', 'foo', '', '')]SUBSTITUTED: {
{123replacementreplacement
备注: 不理解pattern的四个参数的使用.

2. textwrap---格式化文本段落

作用:通过调整换行符在段落中出现的位置来格式化文本

Python版本: 2.5及以后版本
    需要美观打印时,可以用textwrap模块来格式化要输出的文本.这个模块允许通过编程提供类似段落自动换行或填充特性等功能.

2.1 示例数据

sample_text = """The textwrap module can be used to format text for output insituations where pretty-printing is desired. It offersprogrammatic functionality similar to the paragraph wrappingor filling features found in many text editors"""
    存入模块textwrap_example.py中,供后面程序的导入.

2.2 填充数据

    通过提供宽度来填充数据

>>> import textwrap>>> from textwrap_example import sample_text>>> print textwrap.fill(sample_text, width = 50)     The textwrap module can be used to formattext for output in     situations where pretty-printing is desired. It offers     programmaticfunctionality similar to the paragraph wrappingor filling features found in many text editors
    结果显示只有第一行有缩进,其余的均没有.

2.3 去除现有缩进

    我们可以通过dedent来引入一级缩进:

>>> print textwrap.dedent(sample_text)The textwrap module can be used to format text for output insituations where pretty-printing is desired. It offersprogrammatic functionality similar to the paragraph wrappingor filling features found in many text editors

2.4 结合dedent和fill

    我们可以通过dedent达到缩进,而通过fill来填充空格:

>>> dedented_text = textwrap.dedent(sample_text).strip()>>> for width in [45, 70]:	print '%d Columns:\n' % width	print textwrap.fill(dedented_text, width=width)	print	45 Columns:The textwrap module can be used to formattext for output in situations where pretty-printing is desired. It offers programmaticfunctionality similar to the paragraphwrapping or filling features found in manytext editors70 Columns:The textwrap module can be used to format text for output insituations where pretty-printing is desired. It offers programmaticfunctionality similar to the paragraph wrapping or filling featuresfound in many text editors

2.5 悬挂缩进

    更好的情况是:第一行保持缩进,用于区别后面各行

>>> dedented_text = textwrap.dedent(sample_text).strip()>>> print textwrap.fill(dedented_text, initial_indent='', subsequent_indent=' ' * 4, width = 50,)The textwrap module can be used to format text for    output in situations where pretty-printing is    desired. It offers programmatic functionality    similar to the paragraph wrapping or filling    features found in many text editors

3. re---正则表达式

3.1 查找文本中的模式

    search()函数取模式和要扫描的文本作为输入,找到则返回一个Match对象,否则返回None.

    而每个Match对象包含有关匹配性质的信息,包括原输入字符串,使用的正则表达式,以及模式在原字符串中出现的位置:

>>> import re>>> pattern = 'this'>>> text = 'Does this text match the pattern?'>>> match = re.search(pattern, text)>>> dir(match)['__class__', '__copy__', '__deepcopy__', '__delattr__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'end', 'endpos', 'expand', 'group', 'groupdict', 'groups', 'lastgroup', 'lastindex', 'pos', 're', 'regs', 'span', 'start', 'string']>>> match.string'Does this text match the pattern?'>>> match.start
>>> match.start()5>>> match.re<_sre.SRE_Pattern object at 0x0000000002A9E258>>>> match.re()Traceback (most recent call last): File "
", line 1, in
match.re()TypeError: '_sre.SRE_Pattern' object is not callable>>> match.re.pattern'this'
备注:使用dir()和help()函数来查看各个对象的功能,很重要.

3.2 编译表达式

    如果表达式经常被使用,编译这些表达式会更加高效.compile()函数会把一个表达式字符串转换为一个RegexObject

import re#预编译模式regexes = [re.compile(p) for p in ['this', 'that']]text = 'Does this text match the pattern'print 'Text: %r\n' % textfor regex in regexes:    print 'Seeking "%s" ->' % regex.pattern,    if regex.search(text):        print 'match'    else:        print 'no match'
    解释器显示如下:
>>> Text: 'Does this text match the pattern'Seeking "this" -> matchSeeking "that" -> no match>>> type(regexes)
>>> regexes[<_sre.SRE_Pattern object at 0x0000000002BAE0E8>, <_sre.SRE_Pattern object at 0x0000000002BAE258>]

3.3 多重匹配

    findall()函数会返回输入中与模式匹配而不重叠的所有字串

import retext = 'abbaaabbbbaaaaa'pattern = 'ab'for match in re.findall(pattern, text):    print 'Found "%s"' % match#这里re.finditer(pattern, text)只会运行一次,所以match才会递归显示每一项(for在Python中的语法)for match in re.finditer(pattern, text):    s = match.start()    e = match.end()    print 'Found "%s" at %d:%d' % (text[s:e], s, e)
    解释器显示如下:
>>> Found "ab"Found "ab"Found "ab" at 0:2Found "ab" at 5:7

3.4 模式语法

    正则表达式支持更强大的模式,而不只是简单的字面量文本字符串.模式可以重复,可以锚定到输入中不同的逻辑位置,还可以采用紧凑形式表示而不需要在模式中提供每一个字面量字符.使用所有这些特性时,需要结合字面量文本值和元字符,元字符是re实现的正则表达式模式语法的一部分.

import redef test_patterns(text, patterns=[]):    for pattern, desc in patterns:        print 'Pattern %r (%s)\n' % (pattern, desc)        print '     %r' % text        for match in re.finditer(pattern, text):            s = match.start()            e = match.end()            substr = text[s:e]            n_backslashes = text[:s].count('\\')            prefix = '.' * (s + n_backslashes)            print '     %s%r|' % (prefix, substr),        print    returnif __name__ == "__main__":    test_patterns('abbaaabbbbaaaaa',                  [('ab', "'a' followed by 'b'"),])
    存储在文件re_test_patterns.py中.

重复

    模式中有五种表达重复的方式.如果模式后面跟元字符*,这个模式会重复0次或多次.如果为+,则至少重复1次.为?则重复0或1次.{m}特定重复m次.{m,n}则至少重复m次,最大重复n次.{m,}则至少重复m次,无上限.

from re_test_patterns import test_patternstest_patterns(    'abbaabbba',    [('ab*',    'a followed by zero or more b'),     ('ab+',    'a followed by one or more b'),     ('ab?',    'a followed by zero or one b'),     ('ab{3}',  'a followed by three b'),     ('ab{2,3}',   'a followed by two to three b'),     ])
    解释器显示如下:
>>> Pattern 'ab*' (a followed by zero or more b)     'abbaabbba'     'abb'|      ...'a'|      ....'abbb'|      ........'a'|Pattern 'ab+' (a followed by one or more b)     'abbaabbba'     'abb'|      ....'abbb'|Pattern 'ab?' (a followed by zero or one b)     'abbaabbba'     'ab'|      ...'a'|      ....'ab'|      ........'a'|Pattern 'ab{3}' (a followed by three b)     'abbaabbba'     ....'abbb'|Pattern 'ab{2,3}' (a followed by two to three b)     'abbaabbba'     'abb'|      ....'abbb'|
    正常情况下,处理重复指令时,
re匹配模式时会利用尽可能多的输入.这种所谓"贪心"的行为可能导致单个匹配减少,或者匹配中包含了多于原先预计的输入文本.在重复指令后面加上
"?"可以关闭这种贪心行为:
from re_test_patterns import test_patternstest_patterns(    'abbaabbba',    [('ab*?',    'a followed by zero or more b'),     ('ab+?',    'a followed by one or more b'),     ('ab??',    'a followed by zero or one b'),     ('ab{3}?',  'a followed by three b'),     ('ab{2,3}?',   'a followed by two to three b'),     ])
    解释器显示如下:
>>> Pattern 'ab*?' (a followed by zero or more b)     'abbaabbba'     'a'|      ...'a'|      ....'a'|      ........'a'|Pattern 'ab+?' (a followed by one or more b)     'abbaabbba'     'ab'|      ....'ab'|Pattern 'ab??' (a followed by zero or one b)     'abbaabbba'     'a'|      ...'a'|      ....'a'|      ........'a'|Pattern 'ab{3}?' (a followed by three b)     'abbaabbba'     ....'abbb'|Pattern 'ab{2,3}?' (a followed by two to three b)     'abbaabbba'     'abb'|      ....'abb'|

字符集

    字符集是一组字符,包含可以与模式中相应位置匹配的所有字符.例如[ab]可以匹配a或b:

from re_test_patterns import test_patternstest_patterns(    'abbaabbba',    [('[ab]', 'either a or b'),     ('a[ab]+', 'a followed by 1 or more a or b'),     ('a[ab]+?', 'a followed by 1 or more a or b, not greedy'),     ])
    解释器显示如下:(注意贪心算法)
>>> Pattern '[ab]' (either a or b)     'abbaabbba'     'a'|      .'b'|      ..'b'|      ...'a'|      ....'a'|      .....'b'|      ......'b'|      .......'b'|      ........'a'|Pattern 'a[ab]+' (a followed by 1 or more a or b)     'abbaabbba'     'abbaabbba'|Pattern 'a[ab]+?' (a followed by 1 or more a or b, not greedy)     'abbaabbba'     'ab'|      ...'aa'|
    字符集还可以用来排除某些特定字符.尖字符(^)表示要查找未在随后的字符集中出现的字符.
from re_test_patterns import test_patternstest_patterns(    'This is some text -- with punctuation',	#找到不包含字符"-","."或空格的所有字符串    [('[^-. ]+', 'sequences without -, ., or space'),     ])
    解释器显示如下:
>>> Pattern '[^-. ]+' (sequences without -, ., or space)     'This is some text -- with punctuation'     'This'|      .....'is'|      ........'some'|      .............'text'|      .....................'with'|      ..........................'punctuation'|
    利用字符区间来定义一个字符集,其中包括一个起点和一个终点之间所有连续的字符:
from re_test_patterns import test_patternstest_patterns(    'This is some text -- with punctuation',    [('[a-z]+', 'sequences of lowercase letters'),     ('[A-Z]+', 'sequences of uppercase letters'),     ('[a-zA-Z]+', 'sequences of lowercase or uppercase letters'),     ('[A-Z][a-z]+', 'one uppercase followed by lowercase'),     ])
    解释器显示如下:
>>> Pattern '[a-z]+' (sequences of lowercase letters)     'This is some text -- with punctuation'     .'his'|      .....'is'|      ........'some'|      .............'text'|      .....................'with'|      ..........................'punctuation'|Pattern '[A-Z]+' (sequences of uppercase letters)     'This is some text -- with punctuation'     'T'|Pattern '[a-zA-Z]+' (sequences of lowercase or uppercase letters)     'This is some text -- with punctuation'     'This'|      .....'is'|      ........'some'|      .............'text'|      .....................'with'|      ..........................'punctuation'|Pattern '[A-Z][a-z]+' (one uppercase followed by lowercase)     'This is some text -- with punctuation'     'This'|
    作为字符集的一种特殊情况,元字符"."指模式应当匹配该位置的任何单字符.
from re_test_patterns import test_patternstest_patterns(    'abbaabbba',    [('a.', 'a followed by any one character'),     ('b.', 'b followed by any one character'),     ('a.*b', 'a followed by anything, ending in b'),     ('a.*?b', 'a followed by anything, ending in b'),     ])
    解释器显示如下:
>>> Pattern 'a.' (a followed by any one character)     'abbaabbba'     'ab'|      ...'aa'|Pattern 'b.' (b followed by any one character)     'abbaabbba'     .'bb'|      .....'bb'|      .......'ba'|Pattern 'a.*b' (a followed by anything, ending in b)     'abbaabbba'     'abbaabbb'|Pattern 'a.*?b' (a followed by anything, ending in b)     'abbaabbba'     'ab'|      ...'aab'|

转义码

    re可以识别的转义码如下:

转义码
含义
\d
一个数字
\D
一个非数字
\s
空白符(制表符,空格,换行符等)
\S
非空白符
\w
字母数字
\W
非字母数字
from re_test_patterns import test_patternstest_patterns(    'A prime #1 example!',    [(r'\d+', 'sequence of digits'),     (r'\D+', 'sequence of nondigits'),     (r'\s+', 'sequence of whitespace'),     (r'\S+', 'sequence of nonwhitespace'),     (r'\w+', 'alphanumeric characters'),     (r'\W+', 'nonalphanumeric')     ])
    解释器显示如下:
>>> Pattern '\\d+' (sequence of digits)     'A prime #1 example!'     .........'1'|Pattern '\\D+' (sequence of nondigits)     'A prime #1 example!'     'A prime #'|      ..........' example!'|Pattern '\\s+' (sequence of whitespace)     'A prime #1 example!'     .' '|      .......' '|      ..........' '|Pattern '\\S+' (sequence of nonwhitespace)     'A prime #1 example!'     'A'|      ..'prime'|      ........'#1'|      ...........'example!'|Pattern '\\w+' (alphanumeric characters)     'A prime #1 example!'     'A'|      ..'prime'|      .........'1'|      ...........'example'|Pattern '\\W+' (nonalphanumeric)     'A prime #1 example!'     .' '|      .......' #'|      ..........' '|      ..................'!'|
    要匹配属于正则表达式语法的字符,需要对搜索模式中的字符进行转义:
from re_test_patterns import test_patternstest_patterns(    r'\d+ \D+ \s+',    [(r'\\.\+', 'escape code'),     ])
    解释器显示如下:
>>> Pattern '\\\\.\\+' (escape code)     '\\d+ \\D+ \\s+'     '\\d+'|      .....'\\D+'|      ..........'\\s+'|

锚定

    可以使用锚定指令指定输入文本中模式应当出现的相对位置.

锚定码
含义
^
字符串或行的开始
$
字符串或行的结束
\A
字符串开始
\Z
字符串结束
\b
一个单词开头或末尾的空串
\B
不在一个单词开头或末尾的空串
from re_test_patterns import test_patternstest_patterns(    'This is some text -- with punctuation.',    [(r'^\w+', 'word at start of string'),     (r'\A\w+', 'word at start of string'),     (r'\w+\S*$', 'word near end of string, skip punctuation'),     (r'\w+\S*\Z', 'word near end of string, skip punctuation'),     (r'\w*t\w*', 'word containing t'),     (r'\bt\w+', 't at start of word'),     (r'\w+t\b', 't at end of word'),     (r'\Bt\B', 't not start or end of word'),     ])
    解释器显示如下:
>>> Pattern '^\\w+' (word at start of string)     'This is some text -- with punctuation.'     'This'|Pattern '\\A\\w+' (word at start of string)     'This is some text -- with punctuation.'     'This'|Pattern '\\w+\\S*$' (word near end of string, skip punctuation)     'This is some text -- with punctuation.'     ..........................'punctuation.'|Pattern '\\w+\\S*\\Z' (word near end of string, skip punctuation)     'This is some text -- with punctuation.'     ..........................'punctuation.'|Pattern '\\w*t\\w*' (word containing t)     'This is some text -- with punctuation.'     .............'text'|      .....................'with'|      ..........................'punctuation'|Pattern '\\bt\\w+' (t at start of word)     'This is some text -- with punctuation.'     .............'text'|Pattern '\\w+t\\b' (t at end of word)     'This is some text -- with punctuation.'     .............'text'|Pattern '\\Bt\\B' (t not start or end of word)     'This is some text -- with punctuation.'     .......................'t'|      ..............................'t'|      .................................'t'|

3.5 限制搜索

    如果提前已经知道只需搜索整个输入的一个子集,可以告诉re限制搜索范围,从而进一步约束正则表达式匹配.例如,如果模式必须出现在输入的最前面,那么使用match()而不是search()会锚定搜索,而不必在搜索模式中显式的包含一个锚.

>>> import re>>> text = 'This is some text -- with punctuation.'>>> pattern = 'is'>>> m = re.match(pattern, text)>>> print mNone>>> s = re.search(pattern, text)>>> print s<_sre.SRE_Match object at 0x0000000002C265E0>
    已编译正则表达式的search()方法还接受可选的start和end位置参数,将搜索限制在输入的一个子串中:
import retext = 'This is some text -- with punctuation.'pattern = re.compile(r'\b\w*is\w*\b')print 'Text:', textprintpos = 0while True:    match = pattern.search(text, pos)    if not match:        break    s = match.start()    e = match.end()    print ' %2d : %2d = "%s"' % (s, e - 1, text[s:e])    pos = e
    解释器显示如下:
>>> Text: This is some text -- with punctuation.  0 :  3 = "This"  5 :  6 = "is"

3.6 用组解析匹配

    搜索模式匹配是正则表达式所提供强大功能的基础.为模式增加组(group)可以隔离匹配文本的各个部分.通过小括号("("和")")来分组:

from re_test_patterns import test_patternstest_patterns(    'abbaaabbbbaaaaa',    [('a(ab)', 'a followed by literal ab'),     ('a(a*b*)', 'a followed by 0-n a and 0-n b'),     ('a(ab)*', 'a followed by 0-n ab'),     ('a(ab)+', 'a followed by 1-n ab'),    ])
    解释器显示如下:
>>> Pattern 'a(ab)' (a followed by literal ab)     'abbaaabbbbaaaaa'     ....'aab'|Pattern 'a(a*b*)' (a followed by 0-n a and 0-n b)     'abbaaabbbbaaaaa'     'abb'|      ...'aaabbbb'|      ..........'aaaaa'|Pattern 'a(ab)*' (a followed by 0-n ab)     'abbaaabbbbaaaaa'     'a'|      ...'a'|      ....'aab'|      ..........'a'|      ...........'a'|      ............'a'|      .............'a'|      ..............'a'|Pattern 'a(ab)+' (a followed by 1-n ab)     'abbaaabbbbaaaaa'     ....'aab'|
    要访问一个模式中单个组所匹配的子串,可以使用Match对象的group()方法:
import retext = 'This is some text -- with punctuation.'print textprintpatterns = [    (r'^(\w+)', 'word at start of string'),    (r'(\w+)\S*$', 'word at end, with optional punctuation'),    (r'(\bt\w+)\W+(\w+)', 'word starting with t, another word'),    (r'(\w+t)\b', 'word ending with t'),    ]for pattern, desc in patterns:    regex = re.compile(pattern)    match = regex.search(text)    print 'Pattern %r (%s)\n' % (pattern, desc)    print ' ', match.groups()print
    解释器显示如下:
>>> This is some text -- with punctuation.Pattern '^(\\w+)' (word at start of string)  ('This',)Pattern '(\\w+)\\S*$' (word at end, with optional punctuation)  ('punctuation',)Pattern '(\\bt\\w+)\\W+(\\w+)' (word starting with t, another word)  ('text', 'with')Pattern '(\\w+t)\\b' (word ending with t)  ('text',)
    Python对基本分组语法做了扩展,增加了命名组.通过使用名字来指示组,这样以后就可以更容易的修改模式,而不必同时修改使用了匹配结果的代码.要设置一个组的名字,可以使用以下语法:
(?P<name>pattern):
import retext = 'This is some text -- with punctuation.'print textprintpatterns = [    r'^(?P
\w+)', r'(?P
\w+)\S*$', r'(?P
\bt\w+)\W+(?P
\w+)', r'(?P
\w+t)\b', ]for pattern in patterns: regex = re.compile(pattern) match = regex.search(text) print 'Matching "%s"' % pattern print ' ', match.groups() print ' ', match.groupdict() print
    解释器显示如下:
>>> This is some text -- with punctuation.Matching "^(?P
\w+)" ('This',) {'first_word': 'This'}Matching "(?P
\w+)\S*$" ('punctuation',) {'last_word': 'punctuation'}Matching "(?P
\bt\w+)\W+(?P
\w+)" ('text', 'with') {'other_word': 'with', 't_word': 'text'}Matching "(?P
\w+t)\b" ('text',) {'ends_with_t': 'text'}
备注: 使用
groupdict()可以获取一个字典,它将组名映射到匹配的子串.
groups()返回的有序序列还包含命名模式.
    所以,我们可以更新test_patterns(),它会显示与一个模式匹配的编号组和命名组:
import redef test_patterns(text, patterns=[]):    for pattern, desc in patterns:        print 'Pattern %r (%s)\n' % (pattern, desc)        print '     %r' % text        for match in re.finditer(pattern, text):            s = match.start()            e = match.end()            prefix = ' ' * (s)            print ' %s%r%s ' % (prefix, text[s:e], ' ' * (len(text) - e)),            print match.groups()            if match.groupdict():                print '%s%s' % (' ' * (len(text) - s), match.groupdict())        print    returnif __name__ == "__main__":    test_patterns('abbaabbba',                  [(r'a((a*)(b*))', "'a' followed by 0-n a and 0-n b"),])
    解释器显示如下:
>>> Pattern 'a((a*)(b*))' ('a' followed by 0-n a and 0-n b)     'abbaabbba' 'abb'        ('bb', '', 'bb')    'aabbb'   ('abbb', 'a', 'bbb')         'a'  ('', '', '')
    组对于指定候选模式也很有用.可以使用管道符号(|)指示应当匹配某一个或另一个模式:
from re_test_patterns import test_patternstest_patterns(    'abbaabbba',    [(r'a((a+)|(b+))', 'a then seq. of a or seq. of b'),     (r'a((a|b)+)', 'a then seq. of [ab]'),     ])
    解释器显示如下:
>>> Pattern 'a((a+)|(b+))' (a then seq. of a or seq. of b)     'abbaabbba' 'abb'        ('bb', None, 'bb')    'aa'      ('a', 'a', None)Pattern 'a((a|b)+)' (a then seq. of [ab])     'abbaabbba' 'abbaabbba'  ('bbaabbba', 'a')
    如果匹配子模式的字符串并不是从整个文本抽取的一部分,此时定义一个包含子模式的组也很有用.这些组称为"非捕获组".非捕获组可以用来描述重复模式或候选模式,而不再返回值中区分字符串的匹配部分.要创建一个非捕获组,可以使用语法(?:pattern)
from re_test_patterns import test_patternstest_patterns(    'abbaabbba',    [(r'a((a+)|(b+))', 'capturing form'),     (r'a((?:a+)|(?:b+))', 'noncapturing'),     ])
    解释器显示如下:
>>> Pattern 'a((a+)|(b+))' (capturing form)     'abbaabbba' 'abb'        ('bb', None, 'bb')    'aa'      ('a', 'a', None)Pattern 'a((?:a+)|(?:b+))' (noncapturing)     'abbaabbba' 'abb'        ('bb',)    'aa'      ('a',)

3.7 搜索选项

    利用选项标志可以改变匹配引擎处理表达式的方式.可以使用OR操作结合这些标志,然后传递至compile(),search(),match()以及其他接受匹配模式完成搜索的函数

不区分大小写的匹配

    IGNORECASE使模式中的字面量字符和字符区间与大小写字符都匹配.

import retext = 'This is some text -- with punctuation.'pattern = r'\bT\w+'with_case = re.compile(pattern)without_case = re.compile(pattern, re.IGNORECASE)print 'Text:\n  %r' % textprint 'Pattern:\n   %s' % patternprint 'Case-sensitive:'for match in with_case.findall(text):    print ' %r' % matchprint 'Case-insensitive:'for match in without_case.findall(text):    print ' %r' % match
    解释器显示如下:
>>> Text:  'This is some text -- with punctuation.'Pattern:   \bT\w+Case-sensitive: 'This'Case-insensitive: 'This' 'text'

多行输入

    有两个标志会影响如何在多行输入中进行搜索:MULTILINE和DOTALL.MULTILINE标志会控制模式匹配代码如何对包含换行符的文本处理锚定指令.当打开多行模式时,除了整个字符串外,还要在每一行的开头和结尾应用^和$的锚定规则:

import retext = 'This is some text -- with punctuation.\nA second line.'pattern = r'(^\w+)|(\w+\S*$)'single_line = re.compile(pattern)multiline = re.compile(pattern, re.MULTILINE)print 'Text:\n  %r' % textprint 'Pattern:\n   %s' % patternprint 'Single Line:'for match in single_line.findall(text):    print ' %r' % (match,)print 'Multiline    :'for match in multiline.findall(text):    print ' %r' % (match,)
    解释器显示如下:
>>> Text:  'This is some text -- with punctuation.\nA second line.'Pattern:   (^\w+)|(\w+\S*$)Single Line: ('This', '') ('', 'line.')Multiline    : ('This', '') ('', 'punctuation.') ('A', '') ('', 'line.')
    DOTALL也是一个与多行文本有关的标志.正常情况下,点字符(.)可以与输入文本中除了换行符之外的所有其他字符匹配.这个标志则允许点字符还可以匹配换行符.
import retext = 'This is some text -- with punctuation.\nA second line.'pattern = r'.+'no_newlines = re.compile(pattern)dotall = re.compile(pattern, re.DOTALL)print 'Text:\n  %r' % textprint 'Pattern:\n   %s' % patternprint 'No newlines:'for match in no_newlines.findall(text):    print ' %r' % (match,)print 'Multiline    :'for match in dotall.findall(text):    print ' %r' % (match,)
    解释器显示如下:
>>> Text:  'This is some text -- with punctuation.\nA second line.'Pattern:   .+No newlines: 'This is some text -- with punctuation.' 'A second line.'Multiline    : 'This is some text -- with punctuation.\nA second line.'

详细表达式语法

    详细表达式语法:允许在模式中嵌入注释和额外的空白符

import readdress = re.compile(    '''    [\w\d.+-]+  #username    @    ([\w\d.]+\.)+   #domain name prefix    (com|org|edu)''',    re.UNICODE | re.VERBOSE)candidates = [    u'first.last@example.com',    u'first.last+category@gmail.com',    u'valid-address@mail.example.com',    u'not-valid@example.foo'    ]for candidate in candidates:    match = address.search(candidate)    print '%-30s  %s' % (candidate, 'Matches' if match else 'No match')
    解释器显示如下:

>>> first.last@example.com          Matchesfirst.last+category@gmail.com   Matchesvalid-address@mail.example.com  Matchesnot-valid@example.foo           No match
    则我们可以扩展此版本:解析包含人名和Email地址的输入.

import readdress = re.compile(    '''    ((?P
([\w.,]+\s+)*[\w.,]+) \s* < )? (?P
[\w\d.+-]+ #username @ ([\w\d.]+\.)+ #domain name prefix (com|org|edu) ) >?''', re.UNICODE | re.VERBOSE)candidates = [ u'first.last@example.com', u'first.last+category@gmail.com', u'valid-address@mail.example.com', u'not-valid@example.foo' u'First Last
', u'No Brackets first.last@example.com', u'First Last', u'First Middle Last
', u'First M. Last
', u'
', ]for candidate in candidates: print 'Candidate:', candidate match = address.search(candidate) if match: print ' Name :', match.groupdict()['name'] print ' Email:', match.groupdict()['email'] else: print ' No match'
    解释器显示如下:

>>> Candidate: first.last@example.com Name : None Email: first.last@example.comCandidate: first.last+category@gmail.com Name : None Email: first.last+category@gmail.comCandidate: valid-address@mail.example.com Name : None Email: valid-address@mail.example.comCandidate: not-valid@example.fooFirst Last 
Name : example.fooFirst Last Email: first.last@example.comCandidate: No Brackets first.last@example.com Name : None Email: first.last@example.comCandidate: First Last No matchCandidate: First Middle Last
Name : First Middle Last Email: first.last@example.comCandidate: First M. Last
Name : First M. Last Email: first.last@example.comCandidate:
Name : None Email: first.last@example.com

在模式中嵌入标志

    如果编译表达式时不能增加标志,如将模式作为参数传入一个将在以后编译该模式的库函数时,可以把标志嵌入到表达式字符串本身.例如不区分大小写的匹配,可以在表达式开头增加(?i)

import retext = 'This is some text -- with punctuation.'pattern = r'(?i)\bT\w+'regex = re.compile(pattern)print 'Text     :', textprint 'Pattern  :', patternprint 'Matches  :', regex.findall(text)
    解释器显示如下:

>>> Text     : This is some text -- with punctuation.Pattern  : (?i)\bT\w+Matches  : ['This', 'text']
所有标志的缩写如下:

标志
缩写
IGNORECASE
i
MULTILINE
m
DOTALL
s
UNICODE
u
VERBOSE
x

3.8 前向或后向

    很多情况下,仅当模式中另外某个部分也匹配时才匹配模式的某一部分,这非常有用.例如上例中只有尖括号成对时候,表达式才匹配.所以修改如下,修改后使用了一个肯定前向断言来匹配尖括号对.前向断言语法为(?=pattern):

import readdress = re.compile(    '''    ((?P
([\w.,]+\s+)*[\w.,]+) \s+ ) (?= (<.*>$) | ([^<].*[^>]$) )
[\w\d.+-]+ #username @ ([\w\d.]+\.)+ #domain name prefix (com|org|edu) ) >?''', re.UNICODE | re.VERBOSE)candidates = [ u'first.last@example.com', u'No Brackets first.last@example.com', u'Open Bracket
', u'Close Bracket first.last@example.com>', ]for candidate in candidates: print 'Candidate:', candidate match = address.search(candidate) if match: print ' Name :', match.groupdict()['name'] print ' Email:', match.groupdict()['email'] else: print ' No match'
    解释器显示如下:

>>> Candidate: first.last@example.com No matchCandidate: No Brackets first.last@example.com Name : No Brackets Email: first.last@example.comCandidate: Open Bracket 
Name : Open Bracket Email: first.last@example.comCandidate: Close Bracket first.last@example.com> No match
   
否定前向断言((?!pattern))要求模式不匹配当前位置后面的文本.例如,Email识别模式可以修改为忽略自动系统常用的noreply邮件地址:

import readdress = re.compile(    '''    ^    (?!noreply@.*$)    [\w\d.+-]+  #username    @    ([\w\d.]+\.)+   #domain name prefix    (com|org|edu)    $''',    re.UNICODE | re.VERBOSE)candidates = [    u'first.last@example.com',    u'noreply@example.com',    ]for candidate in candidates:    print 'Candidate:', candidate    match = address.search(candidate)    if match:        print ' Match:', candidate[match.start():match.end()]    else:        print ' No match'
    解释器显示如下:

>>> Candidate: first.last@example.com Match: first.last@example.comCandidate: noreply@example.com No match
    相应的
否定后向断言语法为:(?<!pattern)

address = re.compile(    '''    ^    [\w\d.+-]+  #username    (?
    可以借组语法(?<=pattern)用肯定后向断言查找符合某个模式的文本:

import retwitter = re.compile('''(?<=@)([\w\d_]+)''',    re.UNICODE | re.VERBOSE)text = '''This text includes two Twitter handles.One for @ThePSF, and one for the author, @doughellmann.'''print textfor match in twitter.findall(text):    print 'Handle:', match
    解释器显示如下:

>>> This text includes two Twitter handles.One for @ThePSF, and one for the author, @doughellmann.Handle: ThePSFHandle: doughellmann

3.9 自引用表达式

    匹配的值还可以用在表达式后面的部分中.最容易的办法是使用\num按id编号引用先前匹配的组:

import readdress = re.compile(r'''(\w+)   #first name\s+(([\w.]+)\s+)?  #optional middle name or initial(\w+)   #last name\s+<(?P
\1\.\4@([\w\d.]+\.)+(com|org|edu))>''', re.UNICODE | re.VERBOSE | re.IGNORECASE)candidates = [u'First Last
',u'Different Name
',u'First Middle Last
',u'First M. Last
', ]for candidate in candidates: print 'Candidate:', candidate match = address.search(candidate) if match: print ' Match name:', match.group(1), match.group(4) print ' Match email:', match.group(5) else: print ' No match'
    解释器显示如下:

>>> Candidate: First Last 
Match name: First Last Match email: first.last@example.comCandidate: Different Name
No matchCandidate: First Middle Last
Match name: First Last Match email: first.last@example.comCandidate: First M. Last
Match name: First Last Match email: first.last@example.com
    按数字id创建反向引用有两个缺点:1是表达式改变时需要重新编号,这样难以维护.2是最多创建99个引用,如果超过99个,则会产生更难维护的问题.

    所以Python的表达式可以使用(?P=name)指示表达式中先前匹配的一个命名组的值:

address = re.compile(r'''(?P
\w+) #first name\s+(([\w.]+)\s+)? #optional middle name or initial(?P
\w+) #last name\s+<(?P
(?P=first_name)\.(?P=last_name)@([\w\d.]+\.)+(com|org|edu))>''', re.UNICODE | re.VERBOSE | re.IGNORECASE)
    在表达式中使用反向引用还有一种机制,即根据前一个组是否匹配来选择不同的模式.可以修正这个Email模式,使得如果出现名字就需要有尖括号,不过如果只有Email地址本身就不需要尖括号.语法是(?(id)yes-expression|no-expression),这里id是组名或编号,yes-expression是组有值时使用的模式,no-expression则是组没有值时使用的模式.

import readdress = re.compile(r'''^(?P
([\w.]+\s+)*[\w.]+)?\s*(?(name)(?P
(?=(<.*>$)))|(?=([^<].*[^>]$)))(?(brackets)<|\s*)(?P
[\w\d.+-]+@([\w\d.]+\.)+(com|org|edu))(?(brackets)>|\s*)$''', re.UNICODE | re.VERBOSE)candidates = [u'First Last
',u'No Brackets first.last@example.com',u'Open Bracket
',u'no.brackets@example.com', ]for candidate in candidates: print 'Candidate:', candidate match = address.search(candidate) if match: print ' Match name:', match.groupdict()['name'] print ' Match email:', match.groupdict()['email'] else: print ' No match'
    解释器显示如下:

>>> Candidate: First Last 
Match name: First Last Match email: first.last@example.comCandidate: No Brackets first.last@example.com No matchCandidate: Open Bracket
No matchCandidate: no.brackets@example.com Match name: None Match email: no.brackets@example.com

3.10 用模式修改字符串

    使用sub()可以将一个模式的所有出现替换为另一个字符串:

import rebold = re.compile(r'\*{2}(.*?)\*{2}')text = 'Make this **bold**. This **too**.'print 'Text:', textprint 'Bold:', bold.sub(r'\1', text)
    解释器显示如下:

>>> Text: Make this **bold**. This **too**.Bold: Make this bold. This too.
    要在替换中使用命名组,可以使用语法\g<name>.我们可以使用count来限制完成的替换数:

import rebold = re.compile(r'\*{2}(?P
.*?)\*{2}', re.UNICODE)text = 'Make this **bold**. This **too**.'print 'Text:', textprint 'Bold:', bold.sub(r'
\g
', text, count=1)
    解释器显示如下:

>>>Text: Make this **bold**. This **too**.Bold: Make this bold. This **too**.

3.11 利用模式拆分

    str.split()是分解字符串来完成解析的最常用方法之一.但是如果存在多行情况下,我们则需要findall,使用(.+?)\n{2,}的模式.

import retext = '''Paragraph oneon two lines.Paragraph two.Paragraph three.'''for num, para in enumerate(re.findall(r'(.+?)\n{2,}',                                      text,                                      flags=re.DOTALL)                           ):    print num, repr(para)    print
    解释器显示如下:(注意{2,}这个模式)

>>> 0 'Paragraph one\non two lines.'1 'Paragraph two.'
    但是这样最后一行无法显示.我们可以使用split来处理:

import retext = '''Paragraph oneon two lines.Paragraph two.Paragraph three.'''print 'With findall:'for num, para in enumerate(re.findall(r'(.+?)(\n{2,}|$)',                                      text,                                      flags=re.DOTALL)                           ):    print num, repr(para)    printprintprint 'With split:'for num, para in enumerate(re.split(r'\n{2,}', text)):    print num, repr(para)    print
    解释器显示如下:

>>> With findall:0 ('Paragraph one\non two lines.', '\n\n')1 ('Paragraph two.', '\n\n\n')2 ('Paragraph three.', '')With split:0 'Paragraph one\non two lines.'1 'Paragraph two.'2 'Paragraph three.'

转载于:https://my.oschina.net/voler/blog/380925

你可能感兴趣的文章
C#获取cpu序列号 硬盘ID 网卡硬地址以及操作注册表 .
查看>>
AutoCompleteTextView 与sqlite绑定实现记住用户输入的内容并自动提示
查看>>
WPF弹性模拟动画
查看>>
[ACM_水题] ZOJ 3714 [Java Beans 环中连续m个数最大值]
查看>>
Java Collection
查看>>
Java Android HTTP实现总结
查看>>
Makefile 中会在多处地方看到 FORCE
查看>>
hadoop参数传递
查看>>
揭秘uc浏览器四
查看>>
SharePoint 2013 Step by Step——How to Create a Lookup Column to Another Site(Cross Site)
查看>>
用条件注释判断浏览器版本,解决兼容问题
查看>>
[经验帖]外包如何定价
查看>>
Unity依赖注入使用详解
查看>>
unity3d camera.culling mask
查看>>
张驰(中国著名男装设计师) - 搜狗百科
查看>>
Dynamic Flash Messages
查看>>
WPF和js交互 WebBrowser数据交互
查看>>
几个常用的CSS3样式代码以及不兼容的解决办法
查看>>
数学图形(1.16) 笛卡儿叶形线
查看>>
Apache Spark源码走读之18 -- 使用Intellij idea调试Spark源码
查看>>