regexp
and regsub
, based on the Tcl functions with the same
names[cite welch97:_pract_progr_tcl_tk]. The package is
REGULAR-EXPRESSIONS
with nickname REGEXP
.
The syntax for regexp
is
regexp pattern string &key ignore-case extended index-only
pattern
is a regular expression to be matched to string
. If
ignore-case
is true, the case of characters is ignored in
comparisons; by default it is false. If extended
is true, then
extended regular expressions are used (the REG_EXTENDED
flag is
given to regcomp
); extend
is true by default. If
index-only
is true, a list of (start . end)
pairs for the
matched substrings is returned. Otherwise, a list of the substrings is
returned. The default value is false.
The following example, adapted slightly from [cite welch97:_pract_progr_tcl_tk, Example
11-2], uses regexp
to decompose a URL
into its components:
> (use-package "REGEXP") T > (regexp "^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)" "http://stat.umn.edu:80/xyz") ("http://stat.umn.edu:80/xyz" "http://" "http" ;; protocol "stat.umn.edu" ;; host ":80" "80" ;; port "/xyz") ;; pathThe return value has been edited and comments added for clarity.
The function regsub
performs substitution based on pattern
matching. The syntax is
regsub pattern string sub &key ignore-case extended allA match of
pattern
in string
is replaced using sub
. If
all
is true then all occurrences are replaced; otherwise, only the
first is replaced. By default, all
is false. The substitution
argument sub
can be a string, which is used literally, or it can
be a function. The function is called with the substrings produced by
the match as arguments and should return a string to use as a
replacement for the match.
An example [These examples are based on the URL Decoding
example in [cite welch97:_pract_progr_tcl_tk, p. 128]] of the simple
form of regsub
:
> (regsub "\\+" "abc+def+ghi" " " :all t) "abc def ghi"To illustrate the more complex form, the following example replaces
%xx
hexadecimal encodings in URL's with the character they
represent.
> (regsub "%([0-9a-hA-H][0-9a-hA-H])" "%7ewelc%68" #'(lambda (a b) (string (int-char (let ((*read-base* 16)) (read-from-string b))))) :all t) "~welch"These two examples can be combined into a simple URL decoding function:
<url-decode
definition>= (U->)
(defun url-decode (url)
(flet ((parse-hex (s)
(let ((*read-base* 16))
(read-from-string s))))
(regsub "%([0-9a-hA-H][0-9a-hA-H])"
(regsub "\\+" url " " :all t)
#'(lambda (a b) (string (int-char (parse-hex b))))
:all t)))
Definesurl-decode
(links are to index).
An illustration:
> (url-decode "%7ewelc%68+book") "~welch book"
regcomp
can be used to
compile a regular expression that is then reused repeatedly in calls
to regexec
. The functions are ``Lispified'' by returning results
as Lisp objects and signaling errors when appropriate rather than
returning error codes. Currently these are in the same package as the
higher-level functions. [Perhaps this should be changed.]
There are six exported constants to be used as flags. These are
integer values that can be combined with logior
. The four flag
values for regcomp
are
The two flags for
REG_EXTENDED
Use extended regular expressions REG_NEWLINE
Special handling of newline characters REG_NOSUB
Report only success/fail in regexec
REG_ICASE
Ignore case in match.
regexec
are
REG_NOTBOL
First character of string is not the beginning of the line REG_NOTEOL
Last character of string is not the end of the line
The two public functions are regcomp
and regexec
. The syntax
for regcomp
is
regcomp pattern flagswhere
pattern
is a string containing a regular expression and
flags
is an integer flag constructed from the constants given
above. regcomp
returns a regular expression structure or signals
an error.
The syntax for regexec
is
regexec rex string flagswhere
rex
is a regular expression structure created by
regcomp
, string
is a string to be matched, and flags
is an
integer flag.
As an example, we can compile the URL pattern used above as
> (setf rex (regcomp "^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)" REG_EXTENDED)) #<Regular Expression "^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)">and use it to match a URL with
> (regexec rex "http://stat.umn.edu:80/xyz" 0) ((0 . 26) (0 . 7) (0 . 4) (7 . 19) (19 . 22) (20 . 22) (22 . 26))
regexp.wrp
.
<regexp.wrp>= <package and module setup> <lower-level component> <higher-level component>
The package and module setup code is
<package and module setup>= (<-U U->) (provide "regexp") (defpackage "REGULAR-EXPRESSIONS" (:use "COMMON-LISP") (:nicknames "REGEXP")) (in-package "REGEXP") (export '(<exported variable symbols> <exported function symbols>))
The package does not :use
use the C-WRAPPERS
package since the
wrappers are not needed at runtime. Instead, the wrapper macros are
referenced with the package nickname wrap
qualifier.
<lower-level component>= (<-U)
(wrap:c-lines "#include <sys/types.h>"
"#include <regex.h>")
<constant wrappers>
<compound data wrappers>
<function wrappers>
<Lisp regex
data structure>
<support functions for the lower-level>
<public low-level functions>
<constant wrappers>= (<-U) [D->] (wrap:c-constant REG_EXTENDED "REG_EXTENDED" :integer) (wrap:c-constant REG_NEWLINE "REG_NEWLINE" :integer) (wrap:c-constant REG_NOSUB "REG_NOSUB" :integer) (wrap:c-constant REG_ICASE "REG_ICASE" :integer) (wrap:c-constant REG_NOTBOL "REG_NOTBOL" :integer) (wrap:c-constant REG_NOTEOL "REG_NOTEOL" :integer)
DefinesREG_EXTENDED
,REG_ICASE
,REG_NEWLINE
,REG_NOSUB
,REG_NOTBOL
,REG_NOTEOL
(links are to index).
<exported variable symbols>= (<-U U->) REG_EXTENDED REG_NEWLINE REG_NOSUB REG_ICASE REG_NOTBOL REG_NOTEOL
In addition we need the no-match result constant for regexec
,
<constant wrappers>+= (<-U) [<-D] (wrap:c-constant REG_NOMATCH "REG_NOMATCH" :integer)
DefinesREG_NOMATCH
(links are to index).
but this does not need to be exported.
There are two compound C data types, regex_t
for compiled regular
expressions and regmatch_t
for the match results.
<compound data wrappers>= (<-U) (wrap:c-pointer "regex_t" (:make make-regex-t) (:get re-nsub "re_nsub" :integer)) (wrap:c-pointer "regmatch_t" (:make make-regmatch-t) (:get rm-so "rm_so" :integer) (:get rm-eo "rm_eo" :integer))
Definesmake-regex-t
,make-regmatch-t
,re-nsub
,rm-eo
,rm-so
(links are to index).
There are four functions. regcomp
compiles and regexec
executes regular expressions. The corresponding Lisp functions are
base-regcomp
and base-regexec
.
<function wrappers>= (<-U) [D->] (wrap:c-function base-regcomp "regcomp" ((:cptr "regex_t") :string :integer) :integer) (wrap:c-function base-regexec "regexec" ((:cptr "regex_t") :string :integer (:cptr "regmatch_t") :integer) :integer)
Definesbase-regcomp
,base-regexec
(links are to index).
The regfree
function releases storage allocated for compiled
expressions and regerror
fills a string with an error message.
<function wrappers>+= (<-U) [<-D] (wrap:c-function regfree "regfree" ((:cptr "regex_t")) :void) (wrap:c-function regerror "regerror" (:integer (:cptr "regex_t") :string :integer) :integer)
For the public interface, regular expressions are represented by a
Lisp structure with two slots, ptr
for holding the allocated C
data, and pat
for holding the pattern string.
<Lisp regex
data structure>= (<-U) [D->]
(defstruct (regexp (:print-function print-regexp)
(:constructor (make-regexp (ptr pat))))
ptr pat)
Definesregexp
(links are to index).
The printed representation is provided by
<Lisp regex
data structure>+= (<-U) [<-D]
(defun print-regexp (x s d)
(format s "#<Regular Expression ~s>" (regexp-pat x)))
Definesprint-regexpr
(links are to index).
The lower-level component exports two functions, the public versions
of regcomp
and regexec
.
<exported function symbols>= (<-U U->) [D->] regcomp regexec
The public version of regcomp
takes a pattern and a flags argument
and returns a regexp
structure or signals an
error. [perhaps it would be better to accept keywords or
optional arguments instead of flags.]
<public low-level functions>= (<-U) [D->] (defun regcomp (pat flags) (system:without-interrupts (let* ((rex (make-regex-t)) (result (base-regcomp rex pat flags)) (value (make-regexp rex pat))) (unless (= result 0) (raise-regex-error "regcomp" result rex)) (system:register-finalizer value #'regexp-free) value)))
Definesregcomp
(links are to index).
The regexp-free
function is called by the
finalization
mechanism when a regexp
object is garbage-collected.
<support functions for the lower-level>= (<-U) [D->] (defun regexp-free (x) (without-interrupts (let ((rex (regexp-ptr x))) (unless (null rex) (setf (regexp-ptr x) nil) (regfree rex)))))
Definesregexp-free
(links are to index).
Interrupts are disabled around the creation of the expression object
to insure that it will be registered for garbage collection. If
regexp-free
is only called from the finalization mechanism then
interrupts will already be suspended. The without-interrupts
wrapper is used to prevent an interrupt between removing the reference
and releasing the memory with regfree
if regexp-free
is ever
called directly.
The public regexec
function takes a regexp
object, a string,
and a flags argument and returns NIL
for no match, a list of index
pairs for a match, or signals an error.
<public low-level functions>+= (<-U) [<-D]
(defun regexec (r str flags)
(let* ((rex (regexp-ptr r))
(nmatch (+ (re-nsub rex) 1))
(rm (make-regmatch-t nmatch))
(result (base-regexec rex str nmatch rm flags)))
(cond
((= result 0) <convert data in rm
to list of index pairs>)
((= result REG_NOMATCH) nil)
(t (raise-regex-error "regexec" result rex)))))
Definesregexec
(links are to index).
The list of index pairs is constructed by
<convert data in rm
to list of index pairs>= (U->)
(let ((val nil))
(dotimes (i nmatch (nreverse val))
(let ((so (rm-so rm i))
(eo (rm-eo rm i)))
(push (if (or (= so -1) (= eo -1))
nil
(cons so eo))
val))))
Errors are raised by calling raise-regex-error
. This function
generates an error message by calling regerror
. regerror
is
called twice, once with an empty string and buffer size zero to get
the required buffer size, and once with a buffer of the correct size.
The buffer is filled with a null-terminated string; the subseq
trims off the null character.
<support functions for the lower-level>+= (<-U) [<-D] (defun raise-regex-error (fun result rex) (let ((size (regerror result rex "" 0)) (buf (make-string size))) (regerror result rex buf size) (error "~a in ~a" (subseq buf 0 (- size 1)) fun)))
<higher-level component>= (<-U) <get-substrings
definition> <regexp
definition> <regsub
definition> <url-decode
definition>
The higher-level component exports three functions.
<exported function symbols>+= (<-U U->) [<-D] regexp regsub url-decode
The regexp
function calls regcomp
and regexpr
and then
turns the resulting list of index pairs into a list of substrings
using get-substrings
.
<regexp
definition>= (<-U) (defun regexp (pat str &key ignore-case (extended t) index-only) (let* ((rex <compile regular expression forpat
>) (pairs (regexec rex str 0))) (if index-only pairs (get-substrings str pairs))))
Definesregexp
(links are to index).
<compile regular expression for pat
>= (<-U U->)
(let ((icase (if ignore-case REG_ICASE 0))
(ext (if extended REG_EXTENDED 0)))
(regcomp pat (logior icase ext)))
The function get-substrings
is
<get-substrings
definition>= (<-U)
(defun get-substrings (str val)
(mapcar #'(lambda (x) (if x (subseq str (car x) (cdr x)) nil))
val))
Definesget-substrings
(links are to index).
The regsub
function is defined as
<regsub
definition>= (<-U) (defun regsub (pat str sub &key ignore-case (extended t) all) (let ((rex <compile regular expression forpat
>) (head "") (tail str)) (loop (let ((val (regexec rex tail 0))) (unless val (return (concatenate 'string head tail))) (let* ((sval <string to substitute>) (start (car (first val))) (end (cdr (first val))) (tail-head (subseq tail 0 start))) (setf head (concatenate 'string head tail-head sval)) (setf tail (subseq tail end)))) (unless all (return (concatenate 'string head tail))))))
Definesregsub
(links are to index).
If sub
is itself a string then it is substituted
directly. Otherwise it is assumed to be a function and is applied to
the match's substrings.
<string to substitute>= (<-U) (if (stringp sub) sub (apply sub (get-substrings tail val)))
_autoidx.lsp
file for the regular expression module is
<_autoidx.lsp
>=
<package and module setup>
(system:define-autoload-module "regexp"
(variable <exported variable symbols>)
(function <exported function symbols>))
<rtest.c>= #include <stdio.h> #include <sys/types.h> #include <regex.h> int match(char *string, char *pattern) { int i; regex_t re; char buf[256]; i=regcomp(&re, pattern, REG_EXTENDED|REG_NOSUB); if (i != 0) { (void)regerror(i,&re,buf,sizeof buf); printf("%s\n",buf); return(0); /* report error */ } i = regexec(&re, string, (size_t) 0, NULL, 0); regfree(&re); if (i != 0) { (void)regerror(i,&re,buf,sizeof buf); printf("%s\n",buf); return(0); /* report error */ } return(1); } void main() { printf("%s\n", match("ABCDE", "[A-Z]*") ? "success" : "failure"); }
[1] Brent B. Welch. Practical Programming in Tcl and Tk. Prentice-Hall, Upper Saddle River, NJ, 2nd edition, 1997.
_autoidx.lsp
>: D1
pat
>: U1, D2, U3
rm
to list of index pairs>: U1, D2
get-substrings
definition>: U1, D2
regex
data structure>: U1, D2, D3
regexp
definition>: U1, D2
regsub
definition>: U1, D2
url-decode
definition>: D1, U2