source: trunk/libs/newlib/src/newlib/libc/posix/regex.3 @ 577

Last change on this file since 577 was 444, checked in by satin@…, 6 years ago

add newlib,libalmos-mkh, restructure shared_syscalls.h and mini-libc

File size: 16.9 KB
Line 
1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2.\" Copyright (c) 1992, 1993, 1994
3.\"     The Regents of the University of California.  All rights reserved.
4.\"
5.\" This code is derived from software contributed to Berkeley by
6.\" Henry Spencer.
7.\"
8.\" Redistribution and use in source and binary forms, with or without
9.\" modification, are permitted provided that the following conditions
10.\" are met:
11.\" 1. Redistributions of source code must retain the above copyright
12.\"    notice, this list of conditions and the following disclaimer.
13.\" 2. Redistributions in binary form must reproduce the above copyright
14.\"    notice, this list of conditions and the following disclaimer in the
15.\"    documentation and/or other materials provided with the distribution.
16.\" 3. All advertising materials mentioning features or use of this software
17.\"    must display the following acknowledgement:
18.\"     This product includes software developed by the University of
19.\"     California, Berkeley and its contributors.
20.\" 4. Neither the name of the University nor the names of its contributors
21.\"    may be used to endorse or promote products derived from this software
22.\"    without specific prior written permission.
23.\"
24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
27.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
34.\" SUCH DAMAGE.
35.\"
36.\"     @(#)regex.3     8.4 (Berkeley) 3/20/94
37.\" $FreeBSD: src/lib/libc/regex/regex.3,v 1.9 2001/10/01 16:08:58 ru Exp $
38.\"
39.Dd March 20, 1994
40.Dt REGEX 3
41.Os
42.Sh NAME
43.Nm regcomp ,
44.Nm regexec ,
45.Nm regerror ,
46.Nm regfree
47.Nd regular-expression library
48.Sh LIBRARY
49.Lb libc
50.Sh SYNOPSIS
51.In sys/types.h
52.In regex.h
53.Ft int
54.Fn regcomp "regex_t *restrict preg" "const char *_restrictpattern" "int cflags"
55.Ft int
56.Fo regexec
57.Fa "const regex_t *_restrict preg" "const char *_restrict string"
58.Fa "size_t nmatch" "regmatch_t pmatch[_restrict]" "int eflags"
59.Fc
60.Ft size_t
61.Fo regerror
62.Fa "int errcode" "const regex_t *_restrict preg"
63.Fa "char *_restrict errbuf" "size_t errbuf_size"
64.Fc
65.Ft void
66.Fn regfree "regex_t *preg"
67.Sh DESCRIPTION
68These routines implement
69.St -p1003.2
70regular expressions
71.Pq Do RE Dc Ns s ;
72see
73.Xr re_format 7 .
74.Fn Regcomp
75compiles an RE written as a string into an internal form,
76.Fn regexec
77matches that internal form against a string and reports results,
78.Fn regerror
79transforms error codes from either into human-readable messages,
80and
81.Fn regfree
82frees any dynamically-allocated storage used by the internal form
83of an RE.
84.Pp
85The header
86.Aq Pa regex.h
87declares two structure types,
88.Ft regex_t
89and
90.Ft regmatch_t ,
91the former for compiled internal forms and the latter for match reporting.
92It also declares the four functions,
93a type
94.Ft regoff_t ,
95and a number of constants with names starting with
96.Dq Dv REG_ .
97.Pp
98.Fn Regcomp
99compiles the regular expression contained in the
100.Fa pattern
101string,
102subject to the flags in
103.Fa cflags ,
104and places the results in the
105.Ft regex_t
106structure pointed to by
107.Fa preg .
108.Fa Cflags
109is the bitwise OR of zero or more of the following flags:
110.Bl -tag -width REG_EXTENDED
111.It Dv REG_EXTENDED
112Compile modern
113.Pq Dq extended
114REs,
115rather than the obsolete
116.Pq Dq basic
117REs that
118are the default.
119.It Dv REG_BASIC
120This is a synonym for 0,
121provided as a counterpart to
122.Dv REG_EXTENDED
123to improve readability.
124.It Dv REG_NOSPEC
125Compile with recognition of all special characters turned off.
126All characters are thus considered ordinary,
127so the
128.Dq RE
129is a literal string.
130This is an extension,
131compatible with but not specified by
132.St -p1003.2 ,
133and should be used with
134caution in software intended to be portable to other systems.
135.Dv REG_EXTENDED
136and
137.Dv REG_NOSPEC
138may not be used
139in the same call to
140.Fn regcomp .
141.It Dv REG_ICASE
142Compile for matching that ignores upper/lower case distinctions.
143See
144.Xr re_format 7 .
145.It Dv REG_NOSUB
146Compile for matching that need only report success or failure,
147not what was matched.
148.It Dv REG_NEWLINE
149Compile for newline-sensitive matching.
150By default, newline is a completely ordinary character with no special
151meaning in either REs or strings.
152With this flag,
153.Ql [^
154bracket expressions and
155.Ql .\&
156never match newline,
157a
158.Ql ^\&
159anchor matches the null string after any newline in the string
160in addition to its normal function,
161and the
162.Ql $\&
163anchor matches the null string before any newline in the
164string in addition to its normal function.
165.It Dv REG_PEND
166The regular expression ends,
167not at the first NUL,
168but just before the character pointed to by the
169.Va re_endp
170member of the structure pointed to by
171.Fa preg .
172The
173.Va re_endp
174member is of type
175.Ft "const char *" .
176This flag permits inclusion of NULs in the RE;
177they are considered ordinary characters.
178This is an extension,
179compatible with but not specified by
180.St -p1003.2 ,
181and should be used with
182caution in software intended to be portable to other systems.
183.El
184.Pp
185When successful,
186.Fn regcomp
187returns 0 and fills in the structure pointed to by
188.Fa preg .
189One member of that structure
190(other than
191.Va re_endp )
192is publicized:
193.Va re_nsub ,
194of type
195.Ft size_t ,
196contains the number of parenthesized subexpressions within the RE
197(except that the value of this member is undefined if the
198.Dv REG_NOSUB
199flag was used).
200If
201.Fn regcomp
202fails, it returns a non-zero error code;
203see
204.Sx DIAGNOSTICS .
205.Pp
206.Fn Regexec
207matches the compiled RE pointed to by
208.Fa preg
209against the
210.Fa string ,
211subject to the flags in
212.Fa eflags ,
213and reports results using
214.Fa nmatch ,
215.Fa pmatch ,
216and the returned value.
217The RE must have been compiled by a previous invocation of
218.Fn regcomp .
219The compiled form is not altered during execution of
220.Fn regexec ,
221so a single compiled RE can be used simultaneously by multiple threads.
222.Pp
223By default,
224the NUL-terminated string pointed to by
225.Fa string
226is considered to be the text of an entire line, minus any terminating
227newline.
228The
229.Fa eflags
230argument is the bitwise OR of zero or more of the following flags:
231.Bl -tag -width REG_STARTEND
232.It Dv REG_NOTBOL
233The first character of
234the string
235is not the beginning of a line, so the
236.Ql ^\&
237anchor should not match before it.
238This does not affect the behavior of newlines under
239.Dv REG_NEWLINE .
240.It Dv REG_NOTEOL
241The NUL terminating
242the string
243does not end a line, so the
244.Ql $\&
245anchor should not match before it.
246This does not affect the behavior of newlines under
247.Dv REG_NEWLINE .
248.It Dv REG_STARTEND
249The string is considered to start at
250.Fa string
251+
252.Fa pmatch Ns [0]. Ns Va rm_so
253and to have a terminating NUL located at
254.Fa string
255+
256.Fa pmatch Ns [0]. Ns Va rm_eo
257(there need not actually be a NUL at that location),
258regardless of the value of
259.Fa nmatch .
260See below for the definition of
261.Fa pmatch
262and
263.Fa nmatch .
264This is an extension,
265compatible with but not specified by
266.St -p1003.2 ,
267and should be used with
268caution in software intended to be portable to other systems.
269Note that a non-zero
270.Va rm_so
271does not imply
272.Dv REG_NOTBOL ;
273.Dv REG_STARTEND
274affects only the location of the string,
275not how it is matched.
276.El
277.Pp
278See
279.Xr re_format 7
280for a discussion of what is matched in situations where an RE or a
281portion thereof could match any of several substrings of
282.Fa string .
283.Pp
284Normally,
285.Fn regexec
286returns 0 for success and the non-zero code
287.Dv REG_NOMATCH
288for failure.
289Other non-zero error codes may be returned in exceptional situations;
290see
291.Sx DIAGNOSTICS .
292.Pp
293If
294.Dv REG_NOSUB
295was specified in the compilation of the RE,
296or if
297.Fa nmatch
298is 0,
299.Fn regexec
300ignores the
301.Fa pmatch
302argument (but see below for the case where
303.Dv REG_STARTEND
304is specified).
305Otherwise,
306.Fa pmatch
307points to an array of
308.Fa nmatch
309structures of type
310.Ft regmatch_t .
311Such a structure has at least the members
312.Va rm_so
313and
314.Va rm_eo ,
315both of type
316.Ft regoff_t
317(a signed arithmetic type at least as large as an
318.Ft off_t
319and a
320.Ft ssize_t ) ,
321containing respectively the offset of the first character of a substring
322and the offset of the first character after the end of the substring.
323Offsets are measured from the beginning of the
324.Fa string
325argument given to
326.Fn regexec .
327An empty substring is denoted by equal offsets,
328both indicating the character following the empty substring.
329.Pp
330The 0th member of the
331.Fa pmatch
332array is filled in to indicate what substring of
333.Fa string
334was matched by the entire RE.
335Remaining members report what substring was matched by parenthesized
336subexpressions within the RE;
337member
338.Va i
339reports subexpression
340.Va i ,
341with subexpressions counted (starting at 1) by the order of their opening
342parentheses in the RE, left to right.
343Unused entries in the array (corresponding either to subexpressions that
344did not participate in the match at all, or to subexpressions that do not
345exist in the RE (that is,
346.Va i
347>
348.Fa preg Ns -> Ns Va re_nsub ) )
349have both
350.Va rm_so
351and
352.Va rm_eo
353set to -1.
354If a subexpression participated in the match several times,
355the reported substring is the last one it matched.
356(Note, as an example in particular, that when the RE
357.Ql "(b*)+"
358matches
359.Ql bbb ,
360the parenthesized subexpression matches each of the three
361.So Li b Sc Ns s
362and then
363an infinite number of empty strings following the last
364.Ql b ,
365so the reported substring is one of the empties.)
366.Pp
367If
368.Dv REG_STARTEND
369is specified,
370.Fa pmatch
371must point to at least one
372.Ft regmatch_t
373(even if
374.Fa nmatch
375is 0 or
376.Dv REG_NOSUB
377was specified),
378to hold the input offsets for
379.Dv REG_STARTEND .
380Use for output is still entirely controlled by
381.Fa nmatch ;
382if
383.Fa nmatch
384is 0 or
385.Dv REG_NOSUB
386was specified,
387the value of
388.Fa pmatch Ns [0]
389will not be changed by a successful
390.Fn regexec .
391.Pp
392.Fn Regerror
393maps a non-zero
394.Fa errcode
395from either
396.Fn regcomp
397or
398.Fn regexec
399to a human-readable, printable message.
400If
401.Fa preg
402is
403.No non\- Ns Dv NULL ,
404the error code should have arisen from use of
405the
406.Ft regex_t
407pointed to by
408.Fa preg ,
409and if the error code came from
410.Fn regcomp ,
411it should have been the result from the most recent
412.Fn regcomp
413using that
414.Ft regex_t .
415.No ( Fn Regerror
416may be able to supply a more detailed message using information
417from the
418.Ft regex_t . )
419.Fn Regerror
420places the NUL-terminated message into the buffer pointed to by
421.Fa errbuf ,
422limiting the length (including the NUL) to at most
423.Fa errbuf_size
424bytes.
425If the whole message won't fit,
426as much of it as will fit before the terminating NUL is supplied.
427In any case,
428the returned value is the size of buffer needed to hold the whole
429message (including terminating NUL).
430If
431.Fa errbuf_size
432is 0,
433.Fa errbuf
434is ignored but the return value is still correct.
435.Pp
436If the
437.Fa errcode
438given to
439.Fn regerror
440is first ORed with
441.Dv REG_ITOA ,
442the
443.Dq message
444that results is the printable name of the error code,
445e.g.\&
446.Dq Dv REG_NOMATCH ,
447rather than an explanation thereof.
448If
449.Fa errcode
450is
451.Dv REG_ATOI ,
452then
453.Fa preg
454shall be
455.No non\- Ns Dv NULL
456and the
457.Va re_endp
458member of the structure it points to
459must point to the printable name of an error code;
460in this case, the result in
461.Fa errbuf
462is the decimal digits of
463the numeric value of the error code
464(0 if the name is not recognized).
465.Dv REG_ITOA
466and
467.Dv REG_ATOI
468are intended primarily as debugging facilities;
469they are extensions,
470compatible with but not specified by
471.St -p1003.2 ,
472and should be used with
473caution in software intended to be portable to other systems.
474Be warned also that they are considered experimental and changes are possible.
475.Pp
476.Fn Regfree
477frees any dynamically-allocated storage associated with the compiled RE
478pointed to by
479.Fa preg .
480The remaining
481.Ft regex_t
482is no longer a valid compiled RE
483and the effect of supplying it to
484.Fn regexec
485or
486.Fn regerror
487is undefined.
488.Pp
489None of these functions references global variables except for tables
490of constants;
491all are safe for use from multiple threads if the arguments are safe.
492.Sh IMPLEMENTATION CHOICES
493There are a number of decisions that
494.St -p1003.2
495leaves up to the implementor,
496either by explicitly saying
497.Dq undefined
498or by virtue of them being
499forbidden by the RE grammar.
500This implementation treats them as follows.
501.Pp
502See
503.Xr re_format 7
504for a discussion of the definition of case-independent matching.
505.Pp
506There is no particular limit on the length of REs,
507except insofar as memory is limited.
508Memory usage is approximately linear in RE size, and largely insensitive
509to RE complexity, except for bounded repetitions.
510See
511.Sx BUGS
512for one short RE using them
513that will run almost any system out of memory.
514.Pp
515A backslashed character other than one specifically given a magic meaning
516by
517.St -p1003.2
518(such magic meanings occur only in obsolete
519.Bq Dq basic
520REs)
521is taken as an ordinary character.
522.Pp
523Any unmatched
524.Ql [\&
525is a
526.Dv REG_EBRACK
527error.
528.Pp
529Equivalence classes cannot begin or end bracket-expression ranges.
530The endpoint of one range cannot begin another.
531.Pp
532.Dv RE_DUP_MAX ,
533the limit on repetition counts in bounded repetitions, is 255.
534.Pp
535A repetition operator
536.Ql ( ?\& ,
537.Ql *\& ,
538.Ql +\& ,
539or bounds)
540cannot follow another
541repetition operator.
542A repetition operator cannot begin an expression or subexpression
543or follow
544.Ql ^\&
545or
546.Ql |\& .
547.Pp
548.Ql |\&
549cannot appear first or last in a (sub)expression or after another
550.Ql |\& ,
551i.e. an operand of
552.Ql |\&
553cannot be an empty subexpression.
554An empty parenthesized subexpression,
555.Ql "()" ,
556is legal and matches an
557empty (sub)string.
558An empty string is not a legal RE.
559.Pp
560A
561.Ql {\&
562followed by a digit is considered the beginning of bounds for a
563bounded repetition, which must then follow the syntax for bounds.
564A
565.Ql {\&
566.Em not
567followed by a digit is considered an ordinary character.
568.Pp
569.Ql ^\&
570and
571.Ql $\&
572beginning and ending subexpressions in obsolete
573.Pq Dq basic
574REs are anchors, not ordinary characters.
575.Sh SEE ALSO
576.Xr grep 1 ,
577.Xr re_format 7
578.Pp
579.St -p1003.2 ,
580sections 2.8 (Regular Expression Notation)
581and
582B.5 (C Binding for Regular Expression Matching).
583.Sh DIAGNOSTICS
584Non-zero error codes from
585.Fn regcomp
586and
587.Fn regexec
588include the following:
589.Pp
590.Bl -tag -width REG_ECOLLATE -compact
591.It Dv REG_NOMATCH
592.Fn regexec
593failed to match
594.It Dv REG_BADPAT
595invalid regular expression
596.It Dv REG_ECOLLATE
597invalid collating element
598.It Dv REG_ECTYPE
599invalid character class
600.It Dv REG_EESCAPE
601.Ql \e
602applied to unescapable character
603.It Dv REG_ESUBREG
604invalid backreference number
605.It Dv REG_EBRACK
606brackets
607.Ql "[ ]"
608not balanced
609.It Dv REG_EPAREN
610parentheses
611.Ql "( )"
612not balanced
613.It Dv REG_EBRACE
614braces
615.Ql "{ }"
616not balanced
617.It Dv REG_BADBR
618invalid repetition count(s) in
619.Ql "{ }"
620.It Dv REG_ERANGE
621invalid character range in
622.Ql "[ ]"
623.It Dv REG_ESPACE
624ran out of memory
625.It Dv REG_BADRPT
626.Ql ?\& ,
627.Ql *\& ,
628or
629.Ql +\&
630operand invalid
631.It Dv REG_EMPTY
632empty (sub)expression
633.It Dv REG_ASSERT
634can't happen - you found a bug
635.It Dv REG_INVARG
636invalid argument, e.g. negative-length string
637.El
638.Sh HISTORY
639Originally written by
640.An Henry Spencer .
641Altered for inclusion in the
642.Bx 4.4
643distribution.
644.Sh BUGS
645This is an alpha release with known defects.
646Please report problems.
647.Pp
648The back-reference code is subtle and doubts linger about its correctness
649in complex cases.
650.Pp
651.Fn Regexec
652performance is poor.
653This will improve with later releases.
654.Fa Nmatch
655exceeding 0 is expensive;
656.Fa nmatch
657exceeding 1 is worse.
658.Fn Regexec
659is largely insensitive to RE complexity
660.Em except
661that back
662references are massively expensive.
663RE length does matter; in particular, there is a strong speed bonus
664for keeping RE length under about 30 characters,
665with most special characters counting roughly double.
666.Pp
667.Fn Regcomp
668implements bounded repetitions by macro expansion,
669which is costly in time and space if counts are large
670or bounded repetitions are nested.
671An RE like, say,
672.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
673will (eventually) run almost any existing machine out of swap space.
674.Pp
675There are suspected problems with response to obscure error conditions.
676Notably,
677certain kinds of internal overflow,
678produced only by truly enormous REs or by multiply nested bounded repetitions,
679are probably not handled well.
680.Pp
681Due to a mistake in
682.St -p1003.2 ,
683things like
684.Ql "a)b"
685are legal REs because
686.Ql )\&
687is
688a special character only in the presence of a previous unmatched
689.Ql (\& .
690This can't be fixed until the spec is fixed.
691.Pp
692The standard's definition of back references is vague.
693For example, does
694.Ql "a\e(\e(b\e)*\e2\e)*d"
695match
696.Ql "abbbd" ?
697Until the standard is clarified,
698behavior in such cases should not be relied on.
699.Pp
700The implementation of word-boundary matching is a bit of a kludge,
701and bugs may lurk in combinations of word-boundary matching and anchoring.
Note: See TracBrowser for help on using the repository browser.