ยปCore Development>Code coverage>Lib/re.py

Python code coverage for Lib/re.py

#countcontent
1n/a#
2n/a# Secret Labs' Regular Expression Engine
3n/a#
4n/a# re-compatible interface for the sre matching engine
5n/a#
6n/a# Copyright (c) 1998-2001 by Secret Labs AB. All rights reserved.
7n/a#
8n/a# This version of the SRE library can be redistributed under CNRI's
9n/a# Python 1.6 license. For any other use, please contact Secret Labs
10n/a# AB (info@pythonware.com).
11n/a#
12n/a# Portions of this engine have been developed in cooperation with
13n/a# CNRI. Hewlett-Packard provided funding for 1.6 integration and
14n/a# other compatibility work.
15n/a#
16n/a
17n/ar"""Support for regular expressions (RE).
18n/a
19n/aThis module provides regular expression matching operations similar to
20n/athose found in Perl. It supports both 8-bit and Unicode strings; both
21n/athe pattern and the strings being processed can contain null bytes and
22n/acharacters outside the US ASCII range.
23n/a
24n/aRegular expressions can contain both special and ordinary characters.
25n/aMost ordinary characters, like "A", "a", or "0", are the simplest
26n/aregular expressions; they simply match themselves. You can
27n/aconcatenate ordinary characters, so last matches the string 'last'.
28n/a
29n/aThe special characters are:
30n/a "." Matches any character except a newline.
31n/a "^" Matches the start of the string.
32n/a "$" Matches the end of the string or just before the newline at
33n/a the end of the string.
34n/a "*" Matches 0 or more (greedy) repetitions of the preceding RE.
35n/a Greedy means that it will match as many repetitions as possible.
36n/a "+" Matches 1 or more (greedy) repetitions of the preceding RE.
37n/a "?" Matches 0 or 1 (greedy) of the preceding RE.
38n/a *?,+?,?? Non-greedy versions of the previous three special characters.
39n/a {m,n} Matches from m to n repetitions of the preceding RE.
40n/a {m,n}? Non-greedy version of the above.
41n/a "\\" Either escapes special characters or signals a special sequence.
42n/a [] Indicates a set of characters.
43n/a A "^" as the first character indicates a complementing set.
44n/a "|" A|B, creates an RE that will match either A or B.
45n/a (...) Matches the RE inside the parentheses.
46n/a The contents can be retrieved or matched later in the string.
47n/a (?aiLmsux) Set the A, I, L, M, S, U, or X flag for the RE (see below).
48n/a (?:...) Non-grouping version of regular parentheses.
49n/a (?P<name>...) The substring matched by the group is accessible by name.
50n/a (?P=name) Matches the text matched earlier by the group named name.
51n/a (?#...) A comment; ignored.
52n/a (?=...) Matches if ... matches next, but doesn't consume the string.
53n/a (?!...) Matches if ... doesn't match next.
54n/a (?<=...) Matches if preceded by ... (must be fixed length).
55n/a (?<!...) Matches if not preceded by ... (must be fixed length).
56n/a (?(id/name)yes|no) Matches yes pattern if the group with id/name matched,
57n/a the (optional) no pattern otherwise.
58n/a
59n/aThe special sequences consist of "\\" and a character from the list
60n/abelow. If the ordinary character is not on the list, then the
61n/aresulting RE will match the second character.
62n/a \number Matches the contents of the group of the same number.
63n/a \A Matches only at the start of the string.
64n/a \Z Matches only at the end of the string.
65n/a \b Matches the empty string, but only at the start or end of a word.
66n/a \B Matches the empty string, but not at the start or end of a word.
67n/a \d Matches any decimal digit; equivalent to the set [0-9] in
68n/a bytes patterns or string patterns with the ASCII flag.
69n/a In string patterns without the ASCII flag, it will match the whole
70n/a range of Unicode digits.
71n/a \D Matches any non-digit character; equivalent to [^\d].
72n/a \s Matches any whitespace character; equivalent to [ \t\n\r\f\v] in
73n/a bytes patterns or string patterns with the ASCII flag.
74n/a In string patterns without the ASCII flag, it will match the whole
75n/a range of Unicode whitespace characters.
76n/a \S Matches any non-whitespace character; equivalent to [^\s].
77n/a \w Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]
78n/a in bytes patterns or string patterns with the ASCII flag.
79n/a In string patterns without the ASCII flag, it will match the
80n/a range of Unicode alphanumeric characters (letters plus digits
81n/a plus underscore).
82n/a With LOCALE, it will match the set [0-9_] plus characters defined
83n/a as letters for the current locale.
84n/a \W Matches the complement of \w.
85n/a \\ Matches a literal backslash.
86n/a
87n/aThis module exports the following functions:
88n/a match Match a regular expression pattern to the beginning of a string.
89n/a fullmatch Match a regular expression pattern to all of a string.
90n/a search Search a string for the presence of a pattern.
91n/a sub Substitute occurrences of a pattern found in a string.
92n/a subn Same as sub, but also return the number of substitutions made.
93n/a split Split a string by the occurrences of a pattern.
94n/a findall Find all occurrences of a pattern in a string.
95n/a finditer Return an iterator yielding a match object for each match.
96n/a compile Compile a pattern into a RegexObject.
97n/a purge Clear the regular expression cache.
98n/a escape Backslash all non-alphanumerics in a string.
99n/a
100n/aSome of the functions in this module takes flags as optional parameters:
101n/a A ASCII For string patterns, make \w, \W, \b, \B, \d, \D
102n/a match the corresponding ASCII character categories
103n/a (rather than the whole Unicode categories, which is the
104n/a default).
105n/a For bytes patterns, this flag is the only available
106n/a behaviour and needn't be specified.
107n/a I IGNORECASE Perform case-insensitive matching.
108n/a L LOCALE Make \w, \W, \b, \B, dependent on the current locale.
109n/a M MULTILINE "^" matches the beginning of lines (after a newline)
110n/a as well as the string.
111n/a "$" matches the end of lines (before a newline) as well
112n/a as the end of the string.
113n/a S DOTALL "." matches any character at all, including the newline.
114n/a X VERBOSE Ignore whitespace and comments for nicer looking RE's.
115n/a U UNICODE For compatibility only. Ignored for string patterns (it
116n/a is the default), and forbidden for bytes patterns.
117n/a
118n/aThis module also defines an exception 'error'.
119n/a
120n/a"""
121n/a
122n/aimport enum
123n/aimport sre_compile
124n/aimport sre_parse
125n/aimport functools
126n/atry:
127n/a import _locale
128n/aexcept ImportError:
129n/a _locale = None
130n/a
131n/a# public symbols
132n/a__all__ = [
133n/a "match", "fullmatch", "search", "sub", "subn", "split",
134n/a "findall", "finditer", "compile", "purge", "template", "escape",
135n/a "error", "A", "I", "L", "M", "S", "X", "U",
136n/a "ASCII", "IGNORECASE", "LOCALE", "MULTILINE", "DOTALL", "VERBOSE",
137n/a "UNICODE",
138n/a]
139n/a
140n/a__version__ = "2.2.1"
141n/a
142n/aclass RegexFlag(enum.IntFlag):
143n/a ASCII = sre_compile.SRE_FLAG_ASCII # assume ascii "locale"
144n/a IGNORECASE = sre_compile.SRE_FLAG_IGNORECASE # ignore case
145n/a LOCALE = sre_compile.SRE_FLAG_LOCALE # assume current 8-bit locale
146n/a UNICODE = sre_compile.SRE_FLAG_UNICODE # assume unicode "locale"
147n/a MULTILINE = sre_compile.SRE_FLAG_MULTILINE # make anchors look for newline
148n/a DOTALL = sre_compile.SRE_FLAG_DOTALL # make dot match newline
149n/a VERBOSE = sre_compile.SRE_FLAG_VERBOSE # ignore whitespace and comments
150n/a A = ASCII
151n/a I = IGNORECASE
152n/a L = LOCALE
153n/a U = UNICODE
154n/a M = MULTILINE
155n/a S = DOTALL
156n/a X = VERBOSE
157n/a # sre extensions (experimental, don't rely on these)
158n/a TEMPLATE = sre_compile.SRE_FLAG_TEMPLATE # disable backtracking
159n/a T = TEMPLATE
160n/a DEBUG = sre_compile.SRE_FLAG_DEBUG # dump pattern after compilation
161n/aglobals().update(RegexFlag.__members__)
162n/a
163n/a# sre exception
164n/aerror = sre_compile.error
165n/a
166n/a# --------------------------------------------------------------------
167n/a# public interface
168n/a
169n/adef match(pattern, string, flags=0):
170n/a """Try to apply the pattern at the start of the string, returning
171n/a a match object, or None if no match was found."""
172n/a return _compile(pattern, flags).match(string)
173n/a
174n/adef fullmatch(pattern, string, flags=0):
175n/a """Try to apply the pattern to all of the string, returning
176n/a a match object, or None if no match was found."""
177n/a return _compile(pattern, flags).fullmatch(string)
178n/a
179n/adef search(pattern, string, flags=0):
180n/a """Scan through string looking for a match to the pattern, returning
181n/a a match object, or None if no match was found."""
182n/a return _compile(pattern, flags).search(string)
183n/a
184n/adef sub(pattern, repl, string, count=0, flags=0):
185n/a """Return the string obtained by replacing the leftmost
186n/a non-overlapping occurrences of the pattern in string by the
187n/a replacement repl. repl can be either a string or a callable;
188n/a if a string, backslash escapes in it are processed. If it is
189n/a a callable, it's passed the match object and must return
190n/a a replacement string to be used."""
191n/a return _compile(pattern, flags).sub(repl, string, count)
192n/a
193n/adef subn(pattern, repl, string, count=0, flags=0):
194n/a """Return a 2-tuple containing (new_string, number).
195n/a new_string is the string obtained by replacing the leftmost
196n/a non-overlapping occurrences of the pattern in the source
197n/a string by the replacement repl. number is the number of
198n/a substitutions that were made. repl can be either a string or a
199n/a callable; if a string, backslash escapes in it are processed.
200n/a If it is a callable, it's passed the match object and must
201n/a return a replacement string to be used."""
202n/a return _compile(pattern, flags).subn(repl, string, count)
203n/a
204n/adef split(pattern, string, maxsplit=0, flags=0):
205n/a """Split the source string by the occurrences of the pattern,
206n/a returning a list containing the resulting substrings. If
207n/a capturing parentheses are used in pattern, then the text of all
208n/a groups in the pattern are also returned as part of the resulting
209n/a list. If maxsplit is nonzero, at most maxsplit splits occur,
210n/a and the remainder of the string is returned as the final element
211n/a of the list."""
212n/a return _compile(pattern, flags).split(string, maxsplit)
213n/a
214n/adef findall(pattern, string, flags=0):
215n/a """Return a list of all non-overlapping matches in the string.
216n/a
217n/a If one or more capturing groups are present in the pattern, return
218n/a a list of groups; this will be a list of tuples if the pattern
219n/a has more than one group.
220n/a
221n/a Empty matches are included in the result."""
222n/a return _compile(pattern, flags).findall(string)
223n/a
224n/adef finditer(pattern, string, flags=0):
225n/a """Return an iterator over all non-overlapping matches in the
226n/a string. For each match, the iterator returns a match object.
227n/a
228n/a Empty matches are included in the result."""
229n/a return _compile(pattern, flags).finditer(string)
230n/a
231n/adef compile(pattern, flags=0):
232n/a "Compile a regular expression pattern, returning a pattern object."
233n/a return _compile(pattern, flags)
234n/a
235n/adef purge():
236n/a "Clear the regular expression caches"
237n/a _cache.clear()
238n/a _compile_repl.cache_clear()
239n/a
240n/adef template(pattern, flags=0):
241n/a "Compile a template pattern, returning a pattern object"
242n/a return _compile(pattern, flags|T)
243n/a
244n/a_alphanum_str = frozenset(
245n/a "_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890")
246n/a_alphanum_bytes = frozenset(
247n/a b"_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890")
248n/a
249n/adef escape(pattern):
250n/a """
251n/a Escape all the characters in pattern except ASCII letters, numbers and '_'.
252n/a """
253n/a if isinstance(pattern, str):
254n/a alphanum = _alphanum_str
255n/a s = list(pattern)
256n/a for i, c in enumerate(pattern):
257n/a if c not in alphanum:
258n/a if c == "\000":
259n/a s[i] = "\\000"
260n/a else:
261n/a s[i] = "\\" + c
262n/a return "".join(s)
263n/a else:
264n/a alphanum = _alphanum_bytes
265n/a s = []
266n/a esc = ord(b"\\")
267n/a for c in pattern:
268n/a if c in alphanum:
269n/a s.append(c)
270n/a else:
271n/a if c == 0:
272n/a s.extend(b"\\000")
273n/a else:
274n/a s.append(esc)
275n/a s.append(c)
276n/a return bytes(s)
277n/a
278n/a# --------------------------------------------------------------------
279n/a# internals
280n/a
281n/a_cache = {}
282n/a
283n/a_pattern_type = type(sre_compile.compile("", 0))
284n/a
285n/a_MAXCACHE = 512
286n/adef _compile(pattern, flags):
287n/a # internal: compile pattern
288n/a try:
289n/a p, loc = _cache[type(pattern), pattern, flags]
290n/a if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
291n/a return p
292n/a except KeyError:
293n/a pass
294n/a if isinstance(pattern, _pattern_type):
295n/a if flags:
296n/a raise ValueError(
297n/a "cannot process flags argument with a compiled pattern")
298n/a return pattern
299n/a if not sre_compile.isstring(pattern):
300n/a raise TypeError("first argument must be string or compiled pattern")
301n/a p = sre_compile.compile(pattern, flags)
302n/a if not (flags & DEBUG):
303n/a if len(_cache) >= _MAXCACHE:
304n/a _cache.clear()
305n/a if p.flags & LOCALE:
306n/a if not _locale:
307n/a return p
308n/a loc = _locale.setlocale(_locale.LC_CTYPE)
309n/a else:
310n/a loc = None
311n/a _cache[type(pattern), pattern, flags] = p, loc
312n/a return p
313n/a
314n/a@functools.lru_cache(_MAXCACHE)
315n/adef _compile_repl(repl, pattern):
316n/a # internal: compile replacement pattern
317n/a return sre_parse.parse_template(repl, pattern)
318n/a
319n/adef _expand(pattern, match, template):
320n/a # internal: match.expand implementation hook
321n/a template = sre_parse.parse_template(template, pattern)
322n/a return sre_parse.expand_template(template, match)
323n/a
324n/adef _subx(pattern, template):
325n/a # internal: pattern.sub/subn implementation helper
326n/a template = _compile_repl(template, pattern)
327n/a if not template[0] and len(template[1]) == 1:
328n/a # literal replacement
329n/a return template[1][0]
330n/a def filter(match, template=template):
331n/a return sre_parse.expand_template(template, match)
332n/a return filter
333n/a
334n/a# register myself for pickling
335n/a
336n/aimport copyreg
337n/a
338n/adef _pickle(p):
339n/a return _compile, (p.pattern, p.flags)
340n/a
341n/acopyreg.pickle(_pattern_type, _pickle, _compile)
342n/a
343n/a# --------------------------------------------------------------------
344n/a# experimental stuff (see python-dev discussions for details)
345n/a
346n/aclass Scanner:
347n/a def __init__(self, lexicon, flags=0):
348n/a from sre_constants import BRANCH, SUBPATTERN
349n/a self.lexicon = lexicon
350n/a # combine phrases into a compound pattern
351n/a p = []
352n/a s = sre_parse.Pattern()
353n/a s.flags = flags
354n/a for phrase, action in lexicon:
355n/a gid = s.opengroup()
356n/a p.append(sre_parse.SubPattern(s, [
357n/a (SUBPATTERN, (gid, 0, 0, sre_parse.parse(phrase, flags))),
358n/a ]))
359n/a s.closegroup(gid, p[-1])
360n/a p = sre_parse.SubPattern(s, [(BRANCH, (None, p))])
361n/a self.scanner = sre_compile.compile(p)
362n/a def scan(self, string):
363n/a result = []
364n/a append = result.append
365n/a match = self.scanner.scanner(string).match
366n/a i = 0
367n/a while True:
368n/a m = match()
369n/a if not m:
370n/a break
371n/a j = m.end()
372n/a if i == j:
373n/a break
374n/a action = self.lexicon[m.lastindex-1][1]
375n/a if callable(action):
376n/a self.match = m
377n/a action = action(self, m.group())
378n/a if action is not None:
379n/a append(action)
380n/a i = j
381n/a return result, string[i:]