COHERENT manpages

This page displays the COHERENT manpage for regexp.h [Header file for regular-expression functions].

List of available manpages
Index


regexp.h -- Header File

Header file for regular-expression functions
#include <regexp.h>

Header  file  <regexp.h>  is  used with  regular-expression  function
regcomp(), regexec(),  and regsub().  These functions manipulate  a regular
expression, which  is stored in structure  regexp. <regexp.h> defines
this structure as follows:

    typedef struct regexp {
        char *startp[NSUBEXP];
        char *endp[NSUBEXP];
        char regstart;
        char reganch;
        char *regmust;
        int regmlen;
        char program[1];
    } regexp;

Fields  regstart through  program are  used internally,  and should  not be
manipulated by a user's program.   Fields startp[] and endp[] are arrays of
pointers to  sub-strings within the  expression.  For details  on how these
pointers are  used, see the Lexicon entry for  regexec(). NSUBEXP gives the
number  of sub-strings  that  can be  addressed  at one  time;  as of  this
writing, it is set to ten.

Syntax of a Regular Expression

The following describes the rules  with which the regexp functions define a
regular expression.

A  regular expression  consists  of zero  or more  branches.  Branches  are
separated from  each other by  a pipe character  `|'.  A string  matches an
expression when it matches any branch within the expression.

A branch, in turn, consists of zero or more pieces, which are concatenated.
Each piece is a string, or atom, which can be followed by `*', `+', or `?'.
An atom  followed by  `*' can be  matched with a  sequence of zero  or more
matches  of the  atom.   An atom  followed  by `+'  can be  matched with  a
sequence of one  or more matches of the atom.   An atom followed by `?' can
be matched with either the atom or the null string.

An atom, in turn, is built from the following:

(expression)
     A regular expression between  parentheses This matches a match for the
     regular expression.

[string]
     Match any  character within string.  If string contains  a hyphen `-',
     this  represents   a  range  of  characters.    For  example,  ``0-9''
     represents   all  digits;   or  ``a-z''   represents   all  lower-case
     characters.  To include a literal `-' within string, make it the first
     or  last character  within string.  To  include a  literal `]'  in the
     sequence, make it the first character, after a possible `^'.

[^string]
     Match any character that is not in string.
      a range (see below), `.'

^    Match the null string at the beginning of the input string.

$    Match the null string at the end of the input string.

\c   Match   the  single   character  c   literally;  ignore   any  special
     significance that c might have.

Ambiguity

A  string can  match more  than  one part  of an  regular expression.   The
following rules describe how to choose which part to match.

The basic rule  is that if a regular expression  could match two parts of a
string, it matches the one that begins earlier.

If both  parts begin in the  same place but match  different lengths of the
expression, or match the same  length in different ways, life gets messier,
as follows.

In general, the possibilities in a list of branches are considered in left-
to-right  order, the  possibilities for  `*', `+',  and `?'  are considered
longest-first, nested constructs  are considered from the outermost in, and
concatenated constructs  are considered leftmost-first.  The  match that is
chosen is  the one that uses  the earliest possibility in  the first choice
that has  to be made.  If  there is more than one choice,  the next will be
made in  the same manner (earliest possibility) subject  to the decision on
the first choice.

For example,  ``(ab|a)b*c'' could  match ``abc'' in  one of two  ways.  The
first choice is  between ``ab'' and `a'; since ``ab''  is earlier, and lead
to a  successful overall  match, it  is chosen.  Since  the `b'  is already
spoken for, the ``b*'' must match  its last possibility -- the empty string
-- because it must respect the earlier choice.

In the particular case where no `|'s are present and there is only one `*',
`+', or  `?', the  net effect  is that the  longest possible match  will be
chosen.   So ``ab*'',  presented  with ``xabbbby'',  will match  ``abbbb''.
Note that  if ``ab*'' is tried against ``xabyabbbz'',  it will match ``ab''
just after  `x', due to the begins-earliest rule.   In effect, the decision
on  where  to  start the  match  is  the first  choice  to  be made,  hence
subsequent  choices  must respect  it  even  if this  leads  them to  less-
preferred alternatives.

See Also

header files,
regcomp(),
regerror(),
regexec(),
regsub()

Notes

The  code  used for  the  regexp()  was written  by  Harry  Spencer at  the
University of  Toronto.  It is  copyright © 1986 by  the University of
Toronto.   These  routines are  intended  to be  compatible  with the  Bell
System-8  regexp()  but  are   not  derived  from  Bell  code.   The  above
description of regular expressions  is derived from the manual page written
by Harry Spencer.