본문 바로가기
조회 수 8766 추천 수 0 댓글 0
?

단축키

Prev이전 문서

Next다음 문서

+ - Up Down Comment Print
?

단축키

Prev이전 문서

Next다음 문서

+ - Up Down Comment Print
Regular Expression (matching vs searching)

** Searching vs. matching
searching: 문자열의 일부분 패턴매칭 - search() 함수 사용
matching: 전체 문자열의 패턴매칭 - match() 함수 사용


** Common Regular Expression Symbols and Special Characters

foo|bar : foo 아니면 bar
[^A-Za-z0-9_] : []안에 있는 어떤 문자도 매치되지 않는 것 ([]안의 ^는 문자열시작이 아님)
.*?[a-z] : 최소의 문자열 후에 a-z 문자가 옴. (*|+|?|{})? <--- *_?{} 뒤에 ?가 오는 경우 non-greedy
f(oo|u)bar : foobar 아니면 fubar. (o|u)등과 같이 한 문자씩 오면 [ou]와 같이 []로 대체가능

\d : [0-9] <---- \D는 반대
\w : [A-Za-z0-9] <---- \W는 반대
\s : [ \n\t\r\v\f] <---- \S는 반대
\b : 문자경계 (word boundary)
\N : 하위그룹 N. (...)과 동일
\c : c는 특수문자. 예:  \., \\, \*
\A (\Z) : ^ 와 $

ex)

[0-9]{15,16} : 0-9 사이의 수가 15개 또는 16개
\d{3}-\d{3}-\d{4} : 숫자3개-숫자3개-숫자4개 (800-555-1212 등과 같은 전화번호에 적합)


** 괄호()의 사용
: 서브그룹으로 묶거나 매치되는 내용을 다음의 루틴에 사용

\d+(\.\d*)? : .숫자(0또는1개이상)이 서브그룹으로 묶여 ?로 0또는 1개가 가능. 즉, 0.004, 2, 75. 등등 매치된다.

(Mr?s?\. )?[A-Z][a-z]* [ A-Za-z-]+ : 아래와 같이 풀이
(Mr?s?\. )? <----- ()안의 RE가 있거나 혹은 한번만 나옴
Mr?s?\. <---- M나오고 r이 나오거나 한번 나오고 s가 나오거나 한번 나오고 점(.) 그리고 공백
즉, Mrs. Mr. Ms. M. 등이 매치


** match()
- match()는 re 모듈은 첫번째 함수이며 regex 객체 메소드이다.
- match()는 패턴과 문자열을 인수로 받아 매칭일 경우 match 객체를 리턴한다
- 매칭되지 않을 경우 None이 리턴
- 매칭되어 match 객체가 리턴된 경우 이 match 객체의 group 메소드를 사용해 매치를 출력

>>> import re

>>> m = re.match('foo', 'food on the table') <------ 매칭
>>> m
<_sre.SRE_Match object at 0x103635ed0> <------ 객체 참조값
>>> m.group()
'foo'

>>> m = re.match('fooX', 'food on the table') <--------- 매칭안됨
>>> m <-------- None 리턴
>>> m.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group' <------ AttributeError 예외발생

OOP 특징을 활용하면 이렇게 가능

>>> re.match('foo', 'food on the table').group()
'foo'


** search()

- 아래와 같이 food 문자열을 seafood 문자열로 변경하고 match()를 호출하면 매치되지 않는다.

>>> re.match('foo', 'seafood on the table').group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

이유는 match()가 패턴매칭을 문자열을 처음부터 시도하기 때문이다. f와 s가 매칭되지 않으므로 실패
search()를 사용하면 전체 문자열을 패턴매칭한다.

>>> re.search('foo', 'seafood on the table').group()
'foo'


##############################
# Regular Expression 사용예
##############################

** 하나 이상의 문자열 패턴매칭 (|)

>>> bt = 'bat|bet|bit'
>>> m = re.match(bt, 'bat')
>>> if m is not None: m.group()
... 
'bat'


** 단일 문자 (.)
: dot(.)은 \n 또는 non-character(문자가 없음)을 제외한 모든 문자와 match

>>> anyend = '.end'
>>> re.match(anyend, 'bend').group() <-------- MATCH
>>> re.match(anyend, '\nend').group() <-------- NOT MATCH: end앞에 \n
>>> re.match(anyend, 'end').group() <-------- NOT MATCH: end앞에 문자가 없음


** 문자클래스 ([])

- [cr][23][dp][o2] 와 r2d2|c3po 은 어떻게 다른가?
- r2d2|c3po: r2d2 혹은 c3po 만 match
- [cr][23][dp][o2]은 c2d0, c2d2, .. 등 16 (2x2x2x2)가지 경우가 나오며, r2d2와 c3po는 그 중 2가지에만 해당된다.


** 반복과 그룹핑

- 이메일주소에 대한 RE를 작성

>>> patt = '\w+@(\w+\.)?\w+\.com'
>>> re.match(patt, 'nobody@xxx.com').group()
'nobody@xxx.com'
>>> re.match(patt, 'nobody@www.xxx.com').group()
'nobody@www.xxx.com'

위의 패턴을 풀이하면:
\w+ : 하나 이상의 문자를 포함
(\w+\.)? : 하나 이상의 문자가 .로 끝나는 패턴이 없거나 하나가 있음
\w+\.com : 하나 이상의 문자가 .com으로 끝남

- abc-123 같은 문자열에 매치되는 패턴

>>> m = re.match('\w\w\w-\d\d\d', 'abc-123')
>>> if m is not None: m.group()
... 
'abc-123'

위와 같이 쉽게 \w와 \d를 사용해 패턴작성
여기서 abc 또는 123만 얻고 싶다면 그룹핑을 사용한다.

>>> re.match('(\w\w\w)-(\d\d\d)', 'abc-123').group()
'abc-123'
>>> re.match('(\w\w\w)-(\d\d\d)', 'abc-123').group(1)
'abc'
>>> re.match('(\w\w\w)-(\d\d\d)', 'abc-123').group(2)
'123'
>>> re.match('(\w\w\w)-(\d\d\d)', 'abc-123').groups() <---- tuple로 리턴
('abc', '123')

위의 패턴을 좀 더 세련되게 한다면 아래와 같다.

>>> re.match('(\w{3})-(\d{3})', 'abc-123').group(1)
'abc'

요소가 1개인 tuple 리턴 및 중복괄호를 사용한 그룹핑

>>> re.match('(ab)', 'ab').groups() <------- tuple로 리턴해야 함
('ab',) <------ 요소가 1개인 tuple

>>> m = re.match('(a(b))', 'ab')
>>> m.group(1)
'ab'
>>> m.group(2)
'b'
>>> m.groups()
('ab', 'b')


** 문자열 바운더리 (/b)

>>> re.search(r'\bthe', 'bite the dog').group() <------- the 앞에 바운더리 있음
'the'

>>> re.search(r'\bthe', 'bitethe dog').group() <------- the 앞에 바운더리 없음
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

>>> re.search(r'\Bthe', 'bitethe dog').group() <------- the 앞에 바운더리 없음 (/B 사용)
'the'

Note. RE에서 raw string 사용하는 것이 좋다. 이유는 나중에...

//////////////////////////////////////////////////////

regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. Regular expressions are widely used in UNIX world.

The module re provides full support for Perl-like regular expressions in Python. The re module raises the exception re.error if an error occurs while compiling or using a regular expression.

We would cover two important functions which would be used to handle regular expressions. But a small thing first: There are various characters which would have special meaning when they are used in regular expression. To avoid any confusion while dealing with regular expressions we would use Raw Strings as r'expression'.

The match Function

This function attempts to match RE pattern to string with optional flags.

Here is the syntax for this function:

re.match(pattern, string, flags=0)

Here is the description of the parameters:

ParameterDescription
patternThis is the regular expression to be matched.
stringThis is the string which would be searched to match the pattern at the beginning of string.
flagsYou can specifiy different flags using bitwise OR (|). These are modifiers which are listed in the table below.

The re.match function returns a match object on success, None on failure. We would use group(num) orgroups() function of match object to get matched expression.

Match Object MethodsDescription
group(num=0)This methods returns entire match (or specific subgroup num)
groups()This method return all matching subgroups in a tuple (empty if there weren't any)

EXAMPLE:

#!/usr/bin/python
import re

line = "Cats are smarter than dogs";

matchObj = re.match( r'(.*) are(\.*)', line, re.M|re.I)

if matchObj:
   print "matchObj.group() : ", matchObj.group()
   print "matchObj.group(1) : ", matchObj.group(1)
   print "matchObj.group(2) : ", matchObj.group(2)
else:
   print "No match!!"

When the above code is executed, it produces following result:

matchObj.group(): Cats are
matchObj.group(1) : Cats
matchObj.group(2) :

The search Function

This function search for first occurrence of RE pattern within string with optional flags.

Here is the syntax for this function:

re.search(pattern, string, flags=0)

Here is the description of the parameters:

ParameterDescription
patternThis is the regular expression to be matched.
stringThis is the string which would be searched to match the pattern anywhere in the string.
flagsYou can specifiy different flags using bitwise OR (|). These are modifiers which are listed in the table below.

The re.search function returns a match object on success, None on failure. We would use group(num)or groups() function of match object to get matched expression.

Match Object MethodsDescription
group(num=0)This methods returns entire match (or specific subgroup num)
groups()This method return all matching subgroups in a tuple (empty if there weren't any)

EXAMPLE:

#!/usr/bin/python
import re

line = "Cats are smarter than dogs";

matchObj = re.search( r'(.*) are(\.*)', line, re.M|re.I)

if matchObj:
   print "matchObj.group() : ", matchObj.group()
   print "matchObj.group(1) : ", matchObj.group(1)
   print "matchObj.group(2) : ", matchObj.group(2)
else:
   print "No match!!"

When the above code is executed, it produces following result:

matchObj.group(): Cats are
matchObj.group(1) : Cats
matchObj.group(2) :

Matching vs Searching:

Python offers two different primitive operations based on regular expressions: match checks for a match only at the beginning of the string, while search checks for a match anywhere in the string (this is what Perl does by default).

EXAMPLE:

#!/usr/bin/python
import re

line = "Cats are smarter than dogs";

matchObj = re.match( r'dogs', line, re.M|re.I)
if matchObj:
   print "match --> matchObj.group() : ", matchObj.group()
else:
   print "No match!!"

matchObj = re.search( r'dogs', line, re.M|re.I)
if matchObj:
   print "search --> matchObj.group() : ", matchObj.group()
else:
   print "No match!!"

When the above code is executed, it produces following result:

No match!!
search --> matchObj.group() :  dogs

Search and Replace:

Some of the most important re methods that use regular expressions is sub.

SYNTAX:

re.sub(pattern, repl, string, max=0)

This method replace all occurrences of the RE pattern in string with repl, substituting all occurrences unless max provided. This method would return modified string.

EXAMPLE:

Following is the example:

#!/usr/bin/python

phone = "2004-959-559 #This is Phone Number"

# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print "Phone Num : ", num

# Remove anything other than digits
num = re.sub(r'\D', "", phone)    
print "Phone Num : ", num

When the above code is executed, it produces following result:

Phone Num :  2004-959-559
Phone Num :  2004959559

Regular-expression Modifiers - Option Flags

Regular expression literals may include an optional modifier to control various aspects of matching. The modifier are specified as an optional flag. You can provide multiple modified using exclusive OR (|), as shown previously and may be represented by one of these:

ModifierDescription
re.IPerforms case-insensitive matching.
re.LInterprets words according to the current locale.This interpretation affects the alphabetic group (\w and \W), as well as word boundary behavior (\b and \B).
re.MMakes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string).
re.SMakes a period (dot) match any character, including a newline.
re.UInterprets letters according to the Unicode character set. This flag affects the behavior of \w, \W, \b, \B.
re.XPermits "cuter" regular expression syntax. It ignores whitespace (except inside a set [] or when escaped by a backslash), and treats unescaped # as a comment marker.

Regular-expression patterns:

Except for control characters, (+ ? . * ^ $ ( ) [ ] { } | \), all characters match themselves. You can escape a control character by preceding it with a backslash.

Following table lists the regular expression syntax that is available in Python.

PatternDescription
^Matches beginning of line.
$Matches end of line.
.Matches any single character except newline. Using m option allows it to match newline as well.
[...]Matches any single character in brackets.
[^...]Matches any single character not in brackets
re*Matches 0 or more occurrences of preceding expression.
re+Matches 1 or more occurrence of preceding expression.
re?Matches 0 or 1 occurrence of preceding expression.
re{ n}Matches exactly n number of occurrences of preceding expression.
re{ n,}Matches n or more occurrences of preceding expression.
re{ n, m}Matches at least n and at most m occurrences of preceding expression.
a| bMatches either a or b.
(re)Groups regular expressions and remembers matched text.
(?imx)Temporarily toggles on i, m, or x options within a regular expression. If in parentheses, only that area is affected.
(?-imx)Temporarily toggles off i, m, or x options within a regular expression. If in parentheses, only that area is affected.
(?: re)Groups regular expressions without remembering matched text.
(?imx: re)Temporarily toggles on i, m, or x options within parentheses.
(?-imx: re)Temporarily toggles off i, m, or x options within parentheses.
(?#...)Comment.
(?= re)Specifies position using a pattern. Doesn't have a range.
(?! re)Specifies position using pattern negation. Doesn't have a range.
(?> re)Matches independent pattern without backtracking.
\wMatches word characters.
\WMatches nonword characters.
\sMatches whitespace. Equivalent to [\t\n\r\f].
\SMatches nonwhitespace.
\dMatches digits. Equivalent to [0-9].
\DMatches nondigits.
\AMatches beginning of string.
\ZMatches end of string. If a newline exists, it matches just before newline.
\zMatches end of string.
\GMatches point where last match finished.
\bMatches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets.
\BMatches nonword boundaries.
\n, \t, etc.Matches newlines, carriage returns, tabs, etc.
\1...\9Matches nth grouped subexpression.
\10Matches nth grouped subexpression if it matched already. Otherwise refers to the octal representation of a character code.

Regular-expression Examples

Literal characters:

ExampleDescription
pythonMatch "python".

Character classes:

ExampleDescription
[Pp]ythonMatch "Python" or "python"
rub[ye]Match "ruby" or "rube"
[aeiou]Match any one lowercase vowel
[0-9]Match any digit; same as [0123456789]
[a-z]Match any lowercase ASCII letter
[A-Z]Match any uppercase ASCII letter
[a-zA-Z0-9]Match any of the above
[^aeiou]Match anything other than a lowercase vowel
[^0-9]Match anything other than a digit

Special Character Classes:

ExampleDescription
.Match any character except newline
\dMatch a digit: [0-9]
\DMatch a nondigit: [^0-9]
\sMatch a whitespace character: [ \t\r\n\f]
\SMatch nonwhitespace: [^ \t\r\n\f]
\wMatch a single word character: [A-Za-z0-9_]
\WMatch a nonword character: [^A-Za-z0-9_]

Repetition Cases:

ExampleDescription
ruby?Match "rub" or "ruby": the y is optional
ruby*Match "rub" plus 0 or more ys
ruby+Match "rub" plus 1 or more ys
\d{3}Match exactly 3 digits
\d{3,}Match 3 or more digits
\d{3,5}Match 3, 4, or 5 digits

Nongreedy repetition:

This matches the smallest number of repetitions:

ExampleDescription
<.*>Greedy repetition: matches "<python>perl>"
<.*?>Nongreedy: matches "<python>" in "<python>perl>"

Grouping with parentheses:

ExampleDescription
\D\d+No group: + repeats \d
(\D\d)+Grouped: + repeats \D\d pair
([Pp]ython(, )?)+Match "Python", "Python, python, python", etc.

Backreferences:

This matches a previously matched group again:

ExampleDescription
([Pp])ython&\1ailsMatch python&rails or Python&Rails
(['"])[^\1]*\1Single or double-quoted string. \1 matches whatever the 1st group matched . \2 matches whatever the 2nd group matched, etc.

Alternatives:

ExampleDescription
python|perlMatch "python" or "perl"
rub(y|le))Match "ruby" or "ruble"
Python(!+|\?)"Python" followed by one or more ! or one ?

Anchors:

This need to specify match position

ExampleDescription
^PythonMatch "Python" at the start of a string or internal line
Python$Match "Python" at the end of a string or line
\APythonMatch "Python" at the start of a string
Python\ZMatch "Python" at the end of a string
\bPython\bMatch "Python" at a word boundary
\brub\B\B is nonword boundary: match "rub" in "rube" and "ruby" but not alone
Python(?=!)Match "Python", if followed by an exclamation point
Python(?!!)Match "Python", if not followed by an exclamation point

Special syntax with parentheses:

ExampleDescription
R(?#comment)Matches "R". All the rest is a comment
R(?i)ubyCase-insensitive while matching "uby"
R(?i:uby)Same as above
rub(?:y|le))Group only without creating \1 backreference


Title
List of Articles
번호 제목 글쓴이 날짜 조회 수
58 Python 2.7 in CentOS with no issue (setuptools, pip, virtualenv) Hojung 2014.08.31 3528
57 GUI 기본 file Hojung 2013.04.06 5568
56 Gmail SMTPx2, POP, IMAP Script Hojung 2013.04.06 3515
55 E-Mail 작성 및 발송 스크립트 Hojung 2013.04.06 3834
54 Interactive IMAP4 Script Hojung 2013.04.06 3188
53 SMTP and POP3 Script Hojung 2013.04.06 2287
52 Interactive POP3 Script file Hojung 2013.04.06 2805
51 Interactive SMTP Script Hojung 2013.04.06 2269
50 FTP Download Script Hojung 2013.04.06 7247
49 SocketServer 모듈을 사용해 TCP Server/Client 작성 Hojung 2013.04.06 5175
48 Network Programming - socket() 모듈을 사용해 UDP Server, client 생성 Hojung 2013.04.06 5099
47 Network Programming - socket() 모듈을 사용해 TCP Server, client 생성 Hojung 2013.04.06 16195
46 brew install python on Mac OS X + pyqt, lxml and spynner Hojung 2013.03.29 5120
45 Regular Expression (search vs match 그리고 Greediness) file Hojung 2013.03.26 2482
44 Regular Expression - Extension Notations (?...) Hojung 2013.03.26 2312
43 Regular Expression (문자열을 RE를 이용 list 리턴) - split() Hojung 2013.03.26 3224
42 Regular Expression (찾기 및 바꾸기) - sub() and subn() Hojung 2013.03.26 2538
41 Regular Expression (찾기) - findall() and finditer() Hojung 2013.03.26 3432
» Regular Expression (matching vs searching) Hojung 2013.03.23 8766
39 How to install pip, spynner, macports and py-pyqt4 Hojung 2013.03.22 3640
Board Pagination ‹ Prev 1 2 3 Next ›
/ 3

Designed by sketchbooks.co.kr / sketchbook5 board skin

나눔글꼴 설치 안내


이 PC에는 나눔글꼴이 설치되어 있지 않습니다.

이 사이트를 나눔글꼴로 보기 위해서는
나눔글꼴을 설치해야 합니다.

설치 취소

Sketchbook5, 스케치북5

Sketchbook5, 스케치북5

Sketchbook5, 스케치북5

Sketchbook5, 스케치북5