Patterns

Regular expressions have been used with great success in text editing and casual searching tools. They are an easy way of describing and matching patterns in texts. Regular expressions have infiltrated programming languages, becoming standard features in Perl and JavaScript. The regular expression notation itself has continued to grow past the point of regularity.

One of the critical activities in secure programming is input validation. Regular expressions are poorly suited to validation because the notation is so cryptic. If the validation rules are too confusing to read, how can we have confidence that they are correct?

As an alternative, the Misty Programming Language has patterns. A pattern is a special kind of object that can be used to match patterns in texts. Patterns are less terse than regular expressions, and much easier to read and write.

A pattern literal, written as ¶pilcrow followed by a pattern expression, produces an immutable pattern object.

An Example

We want to extract information from a url. We define an lib library record containing two patterns (ip4 and name) and a url pattern that uses the lib. The ¶pilcrow character introduces a pattern, much as ƒflorin introduces a function.

def lib: {
    ip4: ¶ {4 {"octet": 1-3 digit / "."}}
    name: ¶ {1- [letter digit "_-%"]}
}
def url: ¶ (lib) {
    - {"scheme": name ":"}
    "slash": 0-3 "/"
    "host":  {ip4, 1- {name / "."}}
    - {":" "port": 1-5 digit}
    - {"/" "path": 0- [/ "?#" Cc]}
    - {"?" "query": 0- [/ "#" Cc]}
    - {"#" "hash": 0- char}
}

We can use the extract function to match a pattern against a text.

var result: extract("http://www.mistysystem.com/index.html#top", url)

json.encode(result, true) is

{
    hash: "top"
    host: "www.mistysystem.com"
    path: "index.html"
    scheme: "http"
    slash: "//"
}

We can call match again with a different text:

set result: extract("216.168.224.111:9997?clearance=top", url)

json.encode(result, true) is

{
    host: "216.168.224.111"
    octet: ["216", "168", "224", "111"]
    port: "9997"
    query: "clearance=top"
    slash: ""
}

To contrast, this is a JavaScript regular expression that does a similar thing:

/^(?:([A-Za-z\-_%0-9]+):)?(\/{0,3})([A-Za-z\-_%0-9]+(?:\.[A-Za-z\-_%0-9]+)*)(?:\:(\d+))?
(?:\/([^?#]*))?(?:\?([^#]*))?(?:#(.*))?$/

The regular expression is much more compact than the pattern, but it is also much harder to understand.

Let's take the pattern apart. First, we had a library pattern for matching a dotted IP addresses:

ip4: ¶ {4 {"octet": 1-3 digit / "."}}

It matches exactly four groups of one to three digits separated by .period. Each group of digits is stored in the result under the name octet. So

lib.id.extract("198.137.240.92")
    # {octet: ["198", "137", "240", "92"]}

The /slash operator allows us to specify a separator. The pattern could be written without the /slash operator but the pattern would be longer.

ip4: ¶ {"octet": 1-3 digit 3 {"." octet: 1-3 digit}}

If we attached the name octet: on an outer level

ip4: ¶ {"octet": 4 {1-3 digit / "."}}

then the result would have been

    # {octet: "198.137.240.92"}

The second pattern matches a name.

name: ¶ {1- [letter digit "_-%"]}

The definition of name in this case is pretty loose: any combination of letters, digits, and the characters _underbar, -minus sign, and %percent sign.

filename: lib.name.extract{"doom.dust"}
    # null because a name can't contain "."

The url pattern matches urls. It matches in seven steps.

It optionally matches a name followed by a :colon. If they match, the name is saved in the result record in the scheme field. A prefix of -minus sign means the same as 0-1. The item can occur once or not at all; it is optional.
```
- {"scheme": name ":"} 
```
If it had been written as
```
- "scheme": {name ":"} 
```
then the :colon would be saved as well as the name.
It matches between 0 and 3 slash characters and saves them under the name slash.
```
"slash": 0-3 "/"
```
It matches a host name or address. This component is not optional. It is saved under the name host. First we attempt to match using the ip pattern we defined earlier. If that does not match, then we'll match a dotted domain name (one or more names separated by dots).
```
"host": {ip4, 1- {name / "."}}
```
It optionally matches a port number. If it is found, it is saved under the name port.
```
- {":" "port": 1-5 digit}
```
It optionally matches a path. The path can contain any sequence of characters except control characters and "?" and "#". We are reserving "?" and "#" for the last two options. If a path is found, it is saved under the name path.
```
- {"/" "path": 0- [/ "?#" Cc]}
```
It optionally matches a query text. The query text can contain any sequence of characters except control characters and "#". We are saving "#" for the final option. If a query text is found, it is saved under the name query.
```
- {"?" "query": 0- [/ "#" Cc]}
```
It optionally matches a hash (or anchor name). If it is found, it is saved under the name hash.
```
- {"#" "hash": 0- char}
```

Patterns are made of the same tokens as the rest of the Misty Programming Language. That has benefits in development tools. Features like paren balancing are not imperiled by the mixing of language conventions. Patterns can contain comments, just as functions can.

The pattern? function can be used to sense patterns.

pattern?(url)       # true
pattern?(lib)       # false
pattern?(lib.ip4)   # true
pattern?("pattern") # false
stone?(url)         # true
record?(url)        # false

Pattern Library

A pattern library is a record containing patterns that can be used by other patterns. The lib record in the first example is a pattern library.

A pattern library can contain patterns, texts, and functions.

Texts

When an ordinary text is used as a pattern, it will match identical text.

Functions

A function in a pattern library should have this signature:

ƒ (text, from)

A function will be passed the complete text being matched and the index of the character that is currently being considered. If the function does not match, then it must return 0 or false. If the function does find a match, it must return the number of characters that are matched. It may also return true if it matched exactly one character.

When a function is called by a character class, then text will be the character under consideration, and from will be 0. It should return 1 or true if the character is in the class or null or false if the character is not in the class.

Functions make it possible to efficiently express a much larger class of patterns.

Example:

ƒ end(text, from) {

# Match a carriage return and/or linefeed.

    if text[from] = "\n"
        return 1
    fi
    if text[from] = "\r"
        if text[from + 1] = "\n"
            return 2
        fi
        return 1
    fi
    return 0
}

Pattern expressions

pattern_literal '¶' space optional_pattern_library pattern_group

optional_pattern_library "" '(' pattern_library_list ')' space

pattern_library_list indent expression pattern_library_list_open outdent expression pattern_library_list_closed

pattern_library_list_open "" linebreak expression pattern_library_list_open

pattern_library_list_closed "" ',' space expression pattern_library_list_closed

A pattern literal is the ¶pilcrow optionally followed by one or more pattern libraries wrapped in ( )parens, followed by a pattern group wrapped in { }braces.

Group

pattern_group '{' pattern_group_filling '}'

pattern_group_filling indent open_pattern_group outdent closed_pattern_group

open_pattern_group pattern_sequence optional_open_pattern_group

optional_open_pattern_group "" linebreak optional_open_pattern_separator

optional_open_pattern_separator open_pattern_group '/' space pattern_item

closed_pattern_group pattern_sequence optional_closed_pattern_group

optional_closed_pattern_group "" ',' space closed_pattern_group space '/' space pattern_item

pattern_sequence pattern_item space pattern_sequence pattern_item

pattern_item pattern_depot pattern_quantifier pattern_element pattern_quantifier pattern_depot pattern_element

pattern_element pattern_group pattern_class pattern_text name

A pattern group is wrapped in wrapped in ( )parens. A group matches if all of the pattern expressions in it match.

Alternatives are separated by line breaks or ,comma. A set of alternatives matches if any one of the alternatives matches.

Optionally, an item after a /slash indicates a separator that appears before all but the first repetition. This is allowed only if the maximum number of repetitions is at least 2.

Text literal

pattern_text text_literal '!' space text_literal

A text in quotes matches characters in the text. Normally in Misty, text comparisons are case sensitive. In patterns, text comparison is also case sensitive, except when a text is preceded by !exclamation point which allows lowercase letters to match uppercase letters.

"vorpal" matches "vorpal"

! "vorpal" matches "vorpal" and "Vorpal" and "VorPal" and "VORPAL"

! "Vorpal" matches "Vorpal" and "VorPal" and "VORPAL" but not "vorpal"

Built-in class	Matches
`ascii`	any of the 128 ASCII characters `["\u{0}" - "\u{7F}"]`
`char`	any character `[/]`
`digit`	a digit `["0"` - `"9"]`
`end`	carriage return and/or linefeed `{"\n", "\r" - "\n"}`
`hex`	any hexadecimal digit `["0" - "9" "A" - "F"]`
`letter`	any Unicode letter `[Ll Lu Lt Lm Lo]`
`lower`	any lower-case Unicode letter `[Ll]`
`space`	any Unicode spacing character `[Zs]`
`upper`	any upper-case Unicode letter `[Lu]`
`Lu`	Letter, uppercase
`Ll`	Letter, lowercase
`Lt`	Letter, titlecase
`Lm`	Letter, modifier
`Lo`	Letter, other
`Mn`	Mark, nonspacing
`Mc`	Mark, spacing combining
`Me`	Mark, enclosing
`Nd`	Number, decimal digit
`Nl`	Number, letter
`No`	Number, other
`Pc`	Punctuation, connector
`Pd`	Punctuation, dash
`Ps`	Punctuation, open
`Pe`	Punctuation, close
`Pi`	Punctuation, initial quote
`Pf`	Punctuation, final quote
`Po`	Punctuation, other
`Sm`	Symbol, math
`Sc`	Symbol, currency
`Sk`	Symbol, modifier
`So`	Symbol, other
`Zs`	Separator, space
`Zl`	Separator, line
`Zp`	Separator, paragraph
`Cc`	Other, control
`Cf`	Other, format
`Cs`	Other, surrogate
`Co`	Other, private use
`Cn`	Other, not assigned (including noncharacters)

Name

A name can refer to one of the system-supplied names. These include ascii, char, digit, end, hex, letter, lower, space, upper as well as the two character Unicode General Categories. The Unicode General Categories are only available in character classes.

A name can refer to a user-defined pattern in a library. This allows the breaking of a complex pattern into simpler subpatterns. It also allows for the recognition of context-free languages.

A name can refer to a user-defined text in a library.

A name matches if the thing it represents matches.

Character class

pattern_class '[' pattern_set ']'

pattern_set "" name optional_pattern_set pattern_text_range optional_pattern_set '/' optional_pattern_set

optional_pattern_set "" space pattern_set

pattern_text_range text_literal pattern_text_extent

pattern_text_extent "" space '-' space pattern_text_range

A character class specification is wrapped in brackets. A set is a sequence of zero or more names or texts, and contains the union of those characters.

[letter digit]   # matches any letter or digit.

Name

A name can refer to one of the built-in character classes, or to a library text, character class, or function.

Literal text

A text contains characters in the class. You can have one big text or lots of little texts.

Range

If there is a -minus sign between two texts, then an inclusive range is made from the last character of the first text thru the first character of the second text.

["0" - "9"] matches a digit, as does ["0123456789"].

Anticlass

The /slash signals characters to be excluded from the class.

["a" - "z" / "aeiou"]    # matches a lower case consonant.

[/ digit]                # matches any character that is not a digit.

A second /slash signals an intersection operation on the current set.

[ascii / / upper]            # matches upper case ASCII letters.

Repetition

Quantifier	Repeats
`-`	`0` or 1 (optional)
`number`	repeat count
`from` `-` `thru`	in the range
`from` `-`	at least

pattern_quantifier "" '-' space number_literal '-' number_literal space number_literal '-' space number_literal space

By default a pattern expression matches exactly once.

item

Any pattern expression can be preceded by a repetition factor. It determines the number of times that the item must match.

5 item

A range can be specified with two numbers separated by -minus. The item will match no fewer than the first number, and no more than the second number.

1-5 item

If the second number is omitted, then there is no limit to how many times the item can be matched.

1- item

An optional item is indicated by a prefix of -minus sign. This indicates that the item may match zero or one times.

- item

If the next to last item in a pattern group is /slash then the last item in the group is matched before every iteration except the first. This is a convenient way to specify a separator.

If an iteration matches emptiness, then the repetition count is satisfied, even if the minimum number of repetitions has not been reached yet.

Saving

pattern_depot "" text_literal ':' space name ':' space

Patterns can be used to extract information from texts. The extract method returns a record containing the saved matching material.

To save material, place a text or a name and a :colon immediately before any pattern expression. All of the text that is matched by that pattern expression will be saved in the result record, using the text or name as the key. If multiple matches are made (because of repetition or multiple uses of the same text), then all of the matches will be saved in an array under the text or name.

Pattern Functions

Patterns can be used with some of the built-in functions.

array(text, pattern)

extract(text, pattern, from, to)

replace(text, pattern, replacement, limit)

search(text, pattern, from)

For example, the array() method can be used to break the contents of a file into an array of lines. If the separator pattern is found in the text, then the subtext before the match is stored in the new array, and the match is repeated on the remainder until no more matches are possible. The remainder of the text is stored in the array and the array is result.

var lines: array(file, ¶ {end})

This will break on linefeed, carriage return, and carriage return linefeed.