Regular expressions have been used with great success in text editing and casual searching tools. They are an easy way of describing and matching patterns in texts. Regular expressions have infiltrated programming languages, becoming standard features in Perl and JavaScript. The regular expression notation itself has continued to grow past the point of regularity.
One of the critical activities in secure programming is input validation. Regular expressions are poorly suited to validation because the notation is so cryptic. If the validation rules are too confusing to read, how can we have confidence that they are correct?
As an alternative, the Misty Programming Language has patterns. A pattern is a special kind of object that can be used to match patterns in texts. Patterns are less terse than regular expressions, and much easier to read and write.
A pattern literal, written as pattern
followed by
a pattern expression, produces an immutable pattern object.
We want to extract information from a url. We define an lib
library record containing two patterns (ip4
and name
)
and a url
pattern that uses the lib
.
The word
introduces a pattern,
much as function
introduces a function. Patterns may not be created in a function that has a pattern variable or input in its scope.
def lib: { ip4: pattern {4 {"octet": 1-3 digit / "."}} name: pattern {1- [letter digit "_-%"]} } def url: pattern (lib) { - {"scheme": name ":"} "slash": 0-3 "/" "host": {ip4, 1- {name / "."}} - {":" "port": 1-5 digit} - {"/" "path": 0- [/ "?#" Cc]} - {"?" "query": 0- [/ "#" Cc]} - {"#" "hash": 0- char} }
We can use the extract
function to match a pattern against a text.
var result: extract("http://www.mistysystem.com/index.html#top", url)
json.encode(result, true)
is
{ hash: "top" host: "www.mistysystem.com" path: "index.html" scheme: "http" slash: "//" }
We can call match
again with a different text:
assign result: extract("216.168.224.111:9997?clearance=top", url)
json.encode(result, true)
is
{ host: "216.168.224.111" octet: ["216", "168", "224", "111"] port: "9997" query: "clearance=top" slash: "" }
To contrast, this is a JavaScript regular expression that does a similar thing:
/^(?:([A-Za-z\-_%0-9]+):)?(\/{0,3})([A-Za-z\-_%0-9]+(?:\.[A-Za-z\-_%0-9]+)*)(?:\:(\d+))? (?:\/([^?#]*))?(?:\?([^#]*))?(?:#(.*))?$/
The regular expression is much more compact than the pattern, but it is also much harder to understand.
Let's take the pattern apart. First, we had a library pattern for matching a dotted IP addresses:
ip4: pattern {4 {"octet": 1-3 digit / "."}}
It matches exactly four groups of one to three digits separated by .
period.
Each group of digits is stored in the result under the name octet
.
So
lib.id.extract("198.137.240.92") # {octet: ["198", "137", "240", "92"]}
The /
slash operator allows us to specify a separator.
The pattern could be written without the /
slash
operator but the pattern would be longer.
ip4: pattern {"octet": 1-3 digit 3 {"." octet: 1-3 digit}}
If we attached the name octet
: on an outer
level
ip4: pattern {"octet": 4 {1-3 digit / "."}}
then the result would have been
# {octet: "198.137.240.92"}
The second pattern matches a name.
name: pattern {1- [letter digit "_-%"]}
The definition of name
in this case is pretty loose: any combination
of letters, digits, and the characters _
underbar,
-
minus sign, and %
percent
sign.
filename: lib.name.extract{"doom.dust"} # null because a name can't contain "."
The url
pattern matches urls. It matches in seven steps.
It optionally matches a name
followed by a :
colon.
If they match, the name
is saved in the result record in the scheme
field. A prefix
of -
minus sign means the same as 0-1
.
The item can occur once or not at all; it is optional.
- {"scheme": name ":"}
If it had been written as
- "scheme": {name ":"}
then the :
colon would be saved as well as the name
.
slash
.
"slash": 0-3 "/"
host
. First
we attempt to match using the ip
pattern we defined earlier.
If that does not match, then we'll match a dotted domain name (one or
more names separated by dots).
"host": {ip4, 1- {name / "."}}
port
.
- {":" "port": 1-5 digit}
"?"
and "#"
. We are reserving "?"
and "#"
for the last two options. If a path is found, it is saved under the name path
.
- {"/" "path": 0- [/ "?#" Cc]}
"#"
. We are saving "#"
for the
final option. If a query text is found, it is saved under the name query
.
- {"?" "query": 0- [/ "#" Cc]}
hash
.
- {"#" "hash": 0- char}
Patterns are made of the same tokens as the rest of the Misty Programming Language. That has benefits in development tools. Features like paren balancing are not imperiled by the mixing of language conventions. Patterns can contain comments, just as functions can.
The pattern?
function can be used to sense patterns.
pattern?(url) # true pattern?(lib) # false pattern?(lib.ip4) # true pattern?("pattern") # false stone?(url) # true record?(url) # false
A pattern library is a record containing patterns that can be used by
other patterns. The lib
record in the first example is a pattern library.
A pattern library can contain patterns, texts, and functions.
When an ordinary text is used as a pattern, it will match identical text.
A function in a pattern library should have this signature:
(text, from)
A function
will be passed the complete text
being matched and the
index of the character that is currently being considered. If the function
does not match, then it must return 0
or false
. If the function
does find a match, it must return the number of characters that are
matched. It may also return true
if it matched exactly one character.
When a function is called by a character class, then text
will
be the character under consideration, and from
will be 0
.
It should return 1
or true
if the character is in the class or
null
or false
if the character is not in the class.
Functions make it possible to efficiently express a much larger class of patterns.
Example:
function end(text, from) { # Match a carriage return and/or linefeed. if text[from] = "\n" return 1 fi if text[from] = "\r" if text[from + 1] = "\n" return 2 fi return 1 fi return 0 }
pattern_literal
'pattern'
space optional_pattern_library pattern_group
optional_pattern_library
""
'('
pattern_library_list ')'
space
pattern_library_list indent expression pattern_library_list_open outdent expression pattern_library_list_closed
pattern_library_list_open
""
linebreak expression pattern_library_list_open
pattern_library_list_closed
""
','
space expression pattern_library_list_closed
A pattern literal is pattern
optionally followed
by one or more pattern libraries wrapped in ( )
parens, followed by a pattern group wrapped in { }
braces.
pattern_group
'{'
pattern_group_filling '}'
pattern_group_filling indent open_pattern_group outdent closed_pattern_group
open_pattern_group pattern_sequence optional_open_pattern_group
optional_open_pattern_group
""
linebreak optional_open_pattern_separator
optional_open_pattern_separator
open_pattern_group '/'
space pattern_item
closed_pattern_group pattern_sequence optional_closed_pattern_group
optional_closed_pattern_group
""
','
space closed_pattern_group
space '/'
space pattern_item
pattern_sequence pattern_item space pattern_sequence pattern_item
pattern_item pattern_depot pattern_quantifier pattern_element pattern_quantifier pattern_depot pattern_element
pattern_element pattern_group pattern_class pattern_text name
A pattern group is wrapped in wrapped in ( )
parens. A group
matches if all of the pattern expressions in it match.
Alternatives are separated by line breaks or ,
comma. A
set of alternatives matches if any one of the alternatives matches.
Optionally, an item after a /
slash indicates a separator
that appears before all but the first repetition. This is allowed only if the maximum number of repetitions is at least 2
.
pattern_text
text_literal
'!'
space text_literal
A text in quotes matches characters in the text. Normally in Misty, text
comparisons are case sensitive. In patterns, text comparison is also case
sensitive, except when a text is preceded by !
exclamation point
which allows lowercase letters to match uppercase letters.
"vorpal"
matches"vorpal"
! "vorpal"
matches"vorpal"
and"Vorpal"
and"VorPal"
and"VORPAL"
! "Vorpal"
matches"Vorpal"
and"VorPal"
and"VORPAL"
but not"vorpal"
Built-in class | Matches |
---|---|
ascii |
any of the 128 ASCII characters["\u{0}" - "\u{7F}"] |
char |
any character[/] |
digit |
a digit - "9"] |
end |
carriage return and/or linefeed
|
hex |
any hexadecimal digit["0" - "9" "A" - "F"] |
letter |
any Unicode letter[Ll Lu Lt Lm Lo] |
lower |
any lower-case Unicode letter[Ll] |
space |
any Unicode spacing character[Zs] |
upper |
any upper-case Unicode letter[Lu] |
Lu | Letter, uppercase |
Ll | Letter, lowercase |
Lt | Letter, titlecase |
Lm | Letter, modifier |
Lo | Letter, other |
Mn | Mark, nonspacing |
Mc | Mark, spacing combining |
Me | Mark, enclosing |
Nd | Number, decimal digit |
Nl | Number, letter |
No | Number, other |
Pc | Punctuation, connector |
Pd | Punctuation, dash |
Ps | Punctuation, open |
Pe | Punctuation, close |
Pi | Punctuation, initial quote |
Pf | Punctuation, final quote |
Po | Punctuation, other |
Sm | Symbol, math |
Sc | Symbol, currency |
Sk | Symbol, modifier |
So | Symbol, other |
Zs | Separator, space |
Zl | Separator, line |
Zp | Separator, paragraph |
Cc | Other, control |
Cf | Other, format |
Cs | Other, surrogate |
Co | Other, private use |
Cn | Other, not assigned (including noncharacters) |
A name can refer to one of the system-supplied names. These include ascii
,
char
,
digit
, end
, hex
, letter
, lower
, space
, upper
as well as the two character Unicode General Categories. The Unicode General Categories are only available in character classes.
A name can refer to a user-defined pattern in a library. This allows the breaking of a complex pattern into simpler subpatterns. It also allows for the recognition of context-free languages.
A name can refer to a user-defined text in a library.
A name matches if the thing it represents matches.
pattern_class
'['
pattern_set ']'
pattern_set
""
name optional_pattern_set
pattern_text_range optional_pattern_set
'/'
optional_pattern_set
optional_pattern_set
""
space pattern_set
pattern_text_range text_literal pattern_text_extent
pattern_text_extent
""
space '-'
space pattern_text_range
A character class specification is wrapped in brackets. A set is a sequence of zero or more names or texts, and contains the union of those characters.
[letter digit] # matches any letter or digit.
A name can refer to one of the built-in character classes, or to a library text, character class, or function.
A text contains characters in the class. You can have one big text or lots of little texts.
If there is a -
minus sign between two texts, then
an inclusive range is made from the last character of the first text thru the first character of the second text.
["0" - "9"]
matches a digit, as does ["0123456789"]
.
The /
slash signals characters to be excluded
from the class.
["a" - "z" / "aeiou"] # matches a lower case consonant. [/ digit] # matches any character that is not a digit.
A second /
slash signals an intersection
operation on the current set.
[ascii / / upper] # matches upper case ASCII letters.
Quantifier | Repeats |
---|---|
- |
0 or 1 (optional) |
number | repeat count |
from - thru |
in the range |
from - |
at least |
pattern_quantifier
""
'-'
space
number_literal '-'
number_literal space
number_literal '-'
space
number_literal space
By default a pattern expression matches exactly once.
item
Any pattern expression can be preceded by a repetition factor. It determines the number of times that the item must match.
5 item
A range can be specified with two numbers separated by -
minus.
The item will match no fewer than the first number, and no more than the
second number.
1-5 item
If the second number is omitted, then there is no limit to how many times the item can be matched.
1- item
An optional item is indicated by a prefix of -
minus sign.
This indicates that the item may match zero or one times.
- item
If the next to last item in a pattern group is /
slash
then the last item in the group is matched before every iteration except
the first. This is a convenient way to specify a separator.
If an iteration matches emptiness, then the repetition count is satisfied, even if the minimum number of repetitions has not been reached yet.
pattern_depot
""
text_literal ':'
space
name ':'
space
Patterns can be used to extract information from texts. The extract
method returns a record containing the saved matching material.
To save material, place a text or a name and a :
colon immediately
before any pattern expression. All of the text that is matched by that
pattern expression will be saved in the result record, using the text
or name as the key. If multiple matches are made (because of repetition or multiple
uses of the same text), then all of the matches will be saved
in an array under the text or name.
Patterns can be used with some of the built-in functions.
array(text, pattern)
extract(text, pattern, from, to)
replace(text, pattern, replacement, limit)
search(text, pattern, from)
For example, the array()
method can be used
to break the contents of a file into an array of lines. If the separator pattern is found in the text, then the subtext before the match
is stored in the new array, and the match is repeated on the remainder
until no more matches are possible. The remainder of the text is stored
in the array and the array is result.
var lines: array(file, pattern {end})
This will break on linefeed, carriage return, and carriage return linefeed.