Regular Expressions in XLISP-STAT

Luke Tierney

1998/02/02

Introduction

This is a simple regular expression library provided as an illustration of the shared library and C wrapper generation mechanisms. This regular expression code is now part of the source snapshot. Makefiles for Borland C++ and VC++ are in the snapshot as well. Compiled DLL's and projects/makefiles for the Macintosh and for Win32 are also available. To use these compiled libraries you will also need the generated Lisp file.

Interface and Examples

A Higer-Level Interface

This higher-level interface provides two exported Lisp functions, regexp and regsub, based on the Tcl functions with the same names[cite welch97:_pract_progr_tcl_tk]. The package is REGULAR-EXPRESSIONS with nickname REGEXP.

The syntax for regexp is

regexp pattern string &key ignore-case extended index-only

pattern is a regular expression to be matched to string. If ignore-case is true, the case of characters is ignored in comparisons; by default it is false. If extended is true, then extended regular expressions are used (the REG_EXTENDED flag is given to regcomp); extend is true by default. If index-only is true, a list of (start . end) pairs for the matched substrings is returned. Otherwise, a list of the substrings is returned. The default value is false.

The following example, adapted slightly from [cite welch97:_pract_progr_tcl_tk, Example 11-2], uses regexp to decompose a URL into its components:

> (use-package "REGEXP")
T
> (regexp "^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)"
          "http://stat.umn.edu:80/xyz")
("http://stat.umn.edu:80/xyz"
 "http://"
 "http"               ;; protocol
 "stat.umn.edu"       ;; host
 ":80"
 "80"                 ;; port
 "/xyz")              ;; path

The return value has been edited and comments added for clarity.

The function regsub performs substitution based on pattern matching. The syntax is

regsub pattern string sub &key ignore-case extended all

A match of pattern in string is replaced using sub. If all is true then all occurrences are replaced; otherwise, only the first is replaced. By default, all is false. The substitution argument sub can be a string, which is used literally, or it can be a function. The function is called with the substrings produced by the match as arguments and should return a string to use as a replacement for the match.

An example [These examples are based on the URL Decoding example in [cite welch97:_pract_progr_tcl_tk, p. 128]] of the simple form of regsub:

> (regsub "\\+" "abc+def+ghi" " " :all t)
"abc def ghi"

To illustrate the more complex form, the following example replaces %xx hexadecimal encodings in URL's with the character they represent.

> (regsub "%([0-9a-hA-H][0-9a-hA-H])" "%7ewelc%68"
          #'(lambda (a b)
              (string (int-char (let ((*read-base* 16))
                                  (read-from-string b)))))
          :all t)
"~welch"

These two examples can be combined into a simple URL decoding function:

<url-decode definition>= (U->)
(defun url-decode (url)
  (flet ((parse-hex (s)
           (let ((*read-base* 16))
             (read-from-string s))))
    (regsub "%([0-9a-hA-H][0-9a-hA-H])"
            (regsub "\\+" url " " :all t)
            #'(lambda (a b) (string (int-char (parse-hex b))))
            :all t)))

Defines url-decode (links are to index).

An illustration:

> (url-decode "%7ewelc%68+book")
"~welch book"

Lower-Level Interface

The lower-level interface provides a closer interface to the POSIX regular expression functions. In particular regcomp can be used to compile a regular expression that is then reused repeatedly in calls to regexec. The functions are ``Lispified'' by returning results as Lisp objects and signaling errors when appropriate rather than returning error codes. Currently these are in the same package as the higher-level functions. [Perhaps this should be changed.]

There are six exported constants to be used as flags. These are integer values that can be combined with logior. The four flag values for regcomp are

REG_EXTENDED Use extended regular expressions
REG_NEWLINE Special handling of newline characters
REG_NOSUB Report only success/fail in regexec
REG_ICASE Ignore case in match.

The two flags for regexec are

REG_NOTBOL First character of string is not the beginning of the line
REG_NOTEOL Last character of string is not the end of the line

The two public functions are regcomp and regexec. The syntax for regcomp is

regcomp pattern flags

where pattern is a string containing a regular expression and flags is an integer flag constructed from the constants given above. regcomp returns a regular expression structure or signals an error.

The syntax for regexec is

regexec rex string flags

where rex is a regular expression structure created by regcomp, string is a string to be matched, and flags is an integer flag.

As an example, we can compile the URL pattern used above as

> (setf rex (regcomp "^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)" REG_EXTENDED))
#<Regular Expression "^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)">

and use it to match a URL with

> (regexec rex "http://stat.umn.edu:80/xyz" 0)
((0 . 26) (0 . 7) (0 . 4) (7 . 19) (19 . 22) (20 . 22) (22 . 26))

Implementation

The file containing the wrappers and supporting Lisp code is regexp.wrp.

<regexp.wrp>=
<package and module setup>
<lower-level component>
<higher-level component>

The package and module setup code is

<package and module setup>= (<-U U->)
(provide "regexp")

(defpackage "REGULAR-EXPRESSIONS"
  (:use "COMMON-LISP")
  (:nicknames "REGEXP"))

(in-package "REGEXP")

(export '(<exported variable symbols>
          <exported function symbols>))

The package does not :use use the C-WRAPPERS package since the wrappers are not needed at runtime. Instead, the wrapper macros are referenced with the package nickname wrap qualifier.

The Lower-Level Component

The lower-level component is a wrapper around the POSIX regular expression library.

<lower-level component>= (<-U)
(wrap:c-lines "#include <sys/types.h>"
              "#include <regex.h>")

<constant wrappers>
<compound data wrappers>
<function wrappers>
<Lisp regex data structure>
<support functions for the lower-level>
<public low-level functions>

The exported constants are

<constant wrappers>= (<-U) [D->]
(wrap:c-constant REG_EXTENDED "REG_EXTENDED" :integer)
(wrap:c-constant REG_NEWLINE "REG_NEWLINE" :integer)
(wrap:c-constant REG_NOSUB "REG_NOSUB" :integer)
(wrap:c-constant REG_ICASE "REG_ICASE" :integer)
(wrap:c-constant REG_NOTBOL "REG_NOTBOL" :integer)
(wrap:c-constant REG_NOTEOL "REG_NOTEOL" :integer)

Defines REG_EXTENDED, REG_ICASE, REG_NEWLINE, REG_NOSUB, REG_NOTBOL, REG_NOTEOL (links are to index).

<exported variable symbols>= (<-U U->)
REG_EXTENDED REG_NEWLINE REG_NOSUB REG_ICASE REG_NOTBOL REG_NOTEOL

In addition we need the no-match result constant for regexec,

<constant wrappers>+= (<-U) [<-D]
(wrap:c-constant REG_NOMATCH "REG_NOMATCH" :integer)

Defines REG_NOMATCH (links are to index).

but this does not need to be exported.

There are two compound C data types, regex_t for compiled regular expressions and regmatch_t for the match results.

<compound data wrappers>= (<-U)
(wrap:c-pointer "regex_t"
           (:make make-regex-t)
           (:get re-nsub "re_nsub" :integer))

(wrap:c-pointer "regmatch_t"
           (:make make-regmatch-t)
           (:get rm-so "rm_so" :integer)
           (:get rm-eo "rm_eo" :integer))

Defines make-regex-t, make-regmatch-t, re-nsub, rm-eo, rm-so (links are to index).

There are four functions. regcomp compiles and regexec executes regular expressions. The corresponding Lisp functions are base-regcomp and base-regexec.

<function wrappers>= (<-U) [D->]
(wrap:c-function base-regcomp "regcomp"
  ((:cptr "regex_t") :string :integer) :integer)

(wrap:c-function base-regexec "regexec"
  ((:cptr "regex_t") :string :integer (:cptr "regmatch_t") :integer) :integer)

Defines base-regcomp, base-regexec (links are to index).

The regfree function releases storage allocated for compiled expressions and regerror fills a string with an error message.

<function wrappers>+= (<-U) [<-D]
(wrap:c-function regfree "regfree" ((:cptr "regex_t")) :void)

(wrap:c-function regerror "regerror"
  (:integer (:cptr "regex_t") :string :integer) :integer)

For the public interface, regular expressions are represented by a Lisp structure with two slots, ptr for holding the allocated C data, and pat for holding the pattern string.

<Lisp regex data structure>= (<-U) [D->]
(defstruct (regexp (:print-function print-regexp)
                   (:constructor (make-regexp (ptr pat))))
  ptr pat)

Defines regexp (links are to index).

The printed representation is provided by

<Lisp regex data structure>+= (<-U) [<-D]
(defun print-regexp (x s d)
  (format s "#<Regular Expression ~s>" (regexp-pat x)))

Defines print-regexpr (links are to index).

The lower-level component exports two functions, the public versions of regcomp and regexec.

<exported function symbols>= (<-U U->) [D->]
regcomp regexec

The public version of regcomp takes a pattern and a flags argument and returns a regexp structure or signals an error. [perhaps it would be better to accept keywords or optional arguments instead of flags.]

<public low-level functions>= (<-U) [D->]
(defun regcomp (pat flags)
  (system:without-interrupts
    (let* ((rex (make-regex-t))
           (result (base-regcomp rex pat flags))
           (value (make-regexp rex pat)))
      (unless (= result 0)
              (raise-regex-error "regcomp" result rex))
      (system:register-finalizer value #'regexp-free)
      value)))

Defines regcomp (links are to index).

The regexp-free function is called by the finalization mechanism when a regexp object is garbage-collected.

<support functions for the lower-level>= (<-U) [D->]
(defun regexp-free (x)
  (without-interrupts
   (let ((rex (regexp-ptr x)))
     (unless (null rex)
             (setf (regexp-ptr x) nil)
             (regfree rex)))))

Defines regexp-free (links are to index).

Interrupts are disabled around the creation of the expression object to insure that it will be registered for garbage collection. If regexp-free is only called from the finalization mechanism then interrupts will already be suspended. The without-interrupts wrapper is used to prevent an interrupt between removing the reference and releasing the memory with regfree if regexp-free is ever called directly.

The public regexec function takes a regexp object, a string, and a flags argument and returns NIL for no match, a list of index pairs for a match, or signals an error.

<public low-level functions>+= (<-U) [<-D]
(defun regexec (r str flags)
  (let* ((rex (regexp-ptr r))
         (nmatch (+ (re-nsub rex) 1))
         (rm (make-regmatch-t nmatch))
         (result (base-regexec rex str nmatch rm flags)))
    (cond
      ((= result 0) <convert data in rm to list of index pairs>)
      ((= result REG_NOMATCH) nil)
      (t (raise-regex-error "regexec" result rex)))))

Defines regexec (links are to index).

The list of index pairs is constructed by

<convert data in rm to list of index pairs>= (U->)
(let ((val nil))
  (dotimes (i nmatch (nreverse val))
    (let ((so (rm-so rm i))
          (eo (rm-eo rm i)))
      (push (if (or (= so -1) (= eo -1))
                nil
              (cons so eo))
            val))))

Errors are raised by calling raise-regex-error. This function generates an error message by calling regerror. regerror is called twice, once with an empty string and buffer size zero to get the required buffer size, and once with a buffer of the correct size. The buffer is filled with a null-terminated string; the subseq trims off the null character.

<support functions for the lower-level>+= (<-U) [<-D]
(defun raise-regex-error (fun result rex)
  (let ((size (regerror result rex "" 0))
        (buf (make-string size)))
    (regerror result rex buf size)
    (error "~a in ~a" (subseq buf 0 (- size 1)) fun)))

The Higher-Level Component

The higher level component consist of Lisp code that builds on the low level.

<higher-level component>= (<-U)
<get-substrings definition>
<regexp definition>
<regsub definition>
<url-decode definition>

The higher-level component exports three functions.

<exported function symbols>+= (<-U U->) [<-D]
regexp regsub url-decode

The regexp function calls regcomp and regexpr and then turns the resulting list of index pairs into a list of substrings using get-substrings.

<regexp definition>= (<-U)
(defun regexp (pat str &key ignore-case (extended t) index-only)
  (let* ((rex <compile regular expression for pat>)
         (pairs (regexec rex str 0)))
    (if index-only pairs (get-substrings str pairs))))

Defines regexp (links are to index).

<compile regular expression for pat>= (<-U U->)
(let ((icase (if ignore-case REG_ICASE 0))
      (ext (if extended REG_EXTENDED 0)))
  (regcomp pat (logior icase ext)))

The function get-substrings is

<get-substrings definition>= (<-U)
(defun get-substrings (str val)
  (mapcar #'(lambda (x) (if x (subseq str (car x) (cdr x)) nil))
          val))

Defines get-substrings (links are to index).

The regsub function is defined as

<regsub definition>= (<-U)
(defun regsub (pat str sub &key ignore-case (extended t) all)
  (let ((rex <compile regular expression for pat>)
        (head "")
        (tail str))
    (loop
     (let ((val (regexec rex tail 0)))
       (unless val (return (concatenate 'string head tail)))
       (let* ((sval <string to substitute>)
              (start (car (first val)))
              (end (cdr (first val)))
              (tail-head (subseq tail 0 start)))
         (setf head (concatenate 'string head tail-head sval))
         (setf tail (subseq tail end))))
     (unless all (return (concatenate 'string head tail))))))

Defines regsub (links are to index).

If sub is itself a string then it is substituted directly. Otherwise it is assumed to be a function and is applied to the match's substrings.

<string to substitute>= (<-U)
(if (stringp sub)
    sub
  (apply sub (get-substrings tail val)))

Module Index File

The _autoidx.lsp file for the regular expression module is

<_autoidx.lsp>=
<package and module setup>

(system:define-autoload-module "regexp"
  (variable <exported variable symbols>)
  (function <exported function symbols>))

Test Code for C Regular Expressions

This is a little test file to make sure the C regular expression functions are working.

<rtest.c>=
#include <stdio.h>
#include <sys/types.h>
#include <regex.h>

int match(char *string, char *pattern)
{
  int i;
  regex_t re;
  char buf[256];

  i=regcomp(&re, pattern, REG_EXTENDED|REG_NOSUB);
  if (i != 0) {
    (void)regerror(i,&re,buf,sizeof buf);
    printf("%s\n",buf);
    return(0);                       /* report error */
  }
  i = regexec(&re, string, (size_t) 0, NULL, 0);
  regfree(&re);
  if (i != 0) {
    (void)regerror(i,&re,buf,sizeof buf);
    printf("%s\n",buf);
    return(0);                       /* report error */
  }
  return(1);
}

void main()
{
  printf("%s\n", match("ABCDE", "[A-Z]*") ? "success" : "failure");
}

References

[1] Brent B. Welch. Practical Programming in Tcl and Tk. Prentice-Hall, Upper Saddle River, NJ, 2nd edition, 1997.

Indices

Chunks

<_autoidx.lsp>: D1
<compile regular expression for pat>: U1, D2, U3
<compound data wrappers>: U1, D2
<constant wrappers>: U1, D2, D3
<convert data in rm to list of index pairs>: U1, D2
<exported function symbols>: U1, D2, D3, U4
<exported variable symbols>: U1, D2, U3
<function wrappers>: U1, D2, D3
<get-substrings definition>: U1, D2
<higher-level component>: U1, D2
<Lisp regex data structure>: U1, D2, D3
<lower-level component>: U1, D2
<package and module setup>: U1, D2, U3
<public low-level functions>: U1, D2, D3
<regexp definition>: U1, D2
<regexp.wrp>: D1
<regsub definition>: U1, D2
<rtest.c>: D1
<string to substitute>: U1, D2
<support functions for the lower-level>: U1, D2, D3
<url-decode definition>: D1, U2

Identifiers

base-regcomp: D1, U2
base-regexec: D1, U2
get-substrings: U1, D2, U3
make-regex-t: D1, U2
make-regmatch-t: D1, U2
print-regexpr: D1
regcomp: U1, U2, D3, U4, U5
regexec: U1, U2, D3, U4, U5, U6
regexp: U1, D2, U3, U4, U5, U6, U7, D8, U9
regexp-free: D1
REG_EXTENDED: D1, U2, U3, U4
REG_ICASE: D1, U2, U3
REG_NEWLINE: D1, U2
REG_NOMATCH: D1, U2
REG_NOSUB: D1, U2, U3
REG_NOTBOL: D1, U2
REG_NOTEOL: D1, U2
regsub: U1, U2, D3
re-nsub: D1, U2
rm-eo: D1, U2
rm-so: D1, U2
url-decode: D1, U2

`REG_EXTENDED`	Use extended regular expressions
`REG_NEWLINE`	Special handling of newline characters
`REG_NOSUB`	Report only success/fail in `regexec`
`REG_ICASE`	Ignore case in match.

`REG_NOTBOL`	First character of string is not the beginning of the line
`REG_NOTEOL`	Last character of string is not the end of the line