COHERENT manpages
This page displays the COHERENT manpage for regexp.h [Header file for regular-expression functions].
List of available manpages
Index
regexp.h -- Header File Header file for regular-expression functions #include <regexp.h> Header file <regexp.h> is used with regular-expression function regcomp(), regexec(), and regsub(). These functions manipulate a regular expression, which is stored in structure regexp. <regexp.h> defines this structure as follows: typedef struct regexp { char *startp[NSUBEXP]; char *endp[NSUBEXP]; char regstart; char reganch; char *regmust; int regmlen; char program[1]; } regexp; Fields regstart through program are used internally, and should not be manipulated by a user's program. Fields startp[] and endp[] are arrays of pointers to sub-strings within the expression. For details on how these pointers are used, see the Lexicon entry for regexec(). NSUBEXP gives the number of sub-strings that can be addressed at one time; as of this writing, it is set to ten. Syntax of a Regular Expression The following describes the rules with which the regexp functions define a regular expression. A regular expression consists of zero or more branches. Branches are separated from each other by a pipe character `|'. A string matches an expression when it matches any branch within the expression. A branch, in turn, consists of zero or more pieces, which are concatenated. Each piece is a string, or atom, which can be followed by `*', `+', or `?'. An atom followed by `*' can be matched with a sequence of zero or more matches of the atom. An atom followed by `+' can be matched with a sequence of one or more matches of the atom. An atom followed by `?' can be matched with either the atom or the null string. An atom, in turn, is built from the following: (expression) A regular expression between parentheses This matches a match for the regular expression. [string] Match any character within string. If string contains a hyphen `-', this represents a range of characters. For example, ``0-9'' represents all digits; or ``a-z'' represents all lower-case characters. To include a literal `-' within string, make it the first or last character within string. To include a literal `]' in the sequence, make it the first character, after a possible `^'. [^string] Match any character that is not in string. a range (see below), `.' ^ Match the null string at the beginning of the input string. $ Match the null string at the end of the input string. \c Match the single character c literally; ignore any special significance that c might have. Ambiguity A string can match more than one part of an regular expression. The following rules describe how to choose which part to match. The basic rule is that if a regular expression could match two parts of a string, it matches the one that begins earlier. If both parts begin in the same place but match different lengths of the expression, or match the same length in different ways, life gets messier, as follows. In general, the possibilities in a list of branches are considered in left- to-right order, the possibilities for `*', `+', and `?' are considered longest-first, nested constructs are considered from the outermost in, and concatenated constructs are considered leftmost-first. The match that is chosen is the one that uses the earliest possibility in the first choice that has to be made. If there is more than one choice, the next will be made in the same manner (earliest possibility) subject to the decision on the first choice. For example, ``(ab|a)b*c'' could match ``abc'' in one of two ways. The first choice is between ``ab'' and `a'; since ``ab'' is earlier, and lead to a successful overall match, it is chosen. Since the `b' is already spoken for, the ``b*'' must match its last possibility -- the empty string -- because it must respect the earlier choice. In the particular case where no `|'s are present and there is only one `*', `+', or `?', the net effect is that the longest possible match will be chosen. So ``ab*'', presented with ``xabbbby'', will match ``abbbb''. Note that if ``ab*'' is tried against ``xabyabbbz'', it will match ``ab'' just after `x', due to the begins-earliest rule. In effect, the decision on where to start the match is the first choice to be made, hence subsequent choices must respect it even if this leads them to less- preferred alternatives. See Also header files, regcomp(), regerror(), regexec(), regsub() Notes The code used for the regexp() was written by Harry Spencer at the University of Toronto. It is copyright © 1986 by the University of Toronto. These routines are intended to be compatible with the Bell System-8 regexp() but are not derived from Bell code. The above description of regular expressions is derived from the manual page written by Harry Spencer.