Formal Models for String Patterns
A. C. Fleck

Abstract

Two models are developed that formalize common aspects of string patterns as they appear in various scripting languages and text processing systems. These models substantially generalize regular expressions, and draw upon motivation from the SNOBOL4 programming language. One of the models utilizes the formal languages paradigm and provides characterizations of the collections of strings that will match a given pattern. Pattern elements are formally defined, building from certain primitive elements (e.g., string constants) whose matching strings are pre-determined. Then certain pattern composition operations are introduced and their matching characteristics are described based on their constituent operands. Natural families of pattern elements correspond to regular and context-free collections of matched strings. Extensions natural to the formalism show how to construct patterns whose descriptive capacity exceeds these traditional formal language families.

A second formal model is introduced that provides a closer connection to the matching activity associated with pattern processing. This model uses functions from strings to finite counted sets. The idea is to model the movement of a "cursor" through a candidate matching string. The counted sets provide for describing multiple matching paths (and backtracking) in a natural and meaningful way. This model is used to more deeply investigate the interaction of advanced pattern combinations and is found to possess desirable algebraic properties.