Welcome to the surlex documentation.
Surlex is domain specific language designed for pattern matching and data capturing using a minimal syntax. It is similar in concept to regex and is, in fact, a regex generator. It can be described as a language embracing and simplifying a subset of the features of regular expressions, keeping the power of regex available, but prioritizing syntactic clarity.
Surlex was originally designed for matching URLs. Consider the following surlex that matches a URL:
/articles/<year:Y>/<slug:s>/(<page:#>/)
This surlex would match the following URL:
/articles/2009/people-like-simplicity/3/
This match would produce the following data dictionary:
{
'year': '2009',
'slug': 'people-like-simplicity',
'page': '3',
}
The page number is optional, so if it was left off the URL, the surlex would still match the URL, but only extract the year and slug.
This syntax diagram describes the entirety of surlex syntax:
The syntax of surlex is minimalistic and intended to provide the most concise form possible for the extraction of named patterns.
Like regular expressions, surlex expressions will match a given input based on two things: normal characters and specialized metacharacters.
Normal characters such as alphanumeric characters will match the input string just like a simple search. This means that the surlex surlex will match the input string Maybe surlex can do that.
Certain characters have special meaning in a surlex expression.
A caret at the beginning of the surlex expression makes the surlex only match the beginning of the input string. For example, ^surlex will match surlex finds slugs. but not Slugs will be found by surlex. This is the same behavior that regex provides.
A dollar sign at the end of a phrase has exactly the opposite effect as a caret; it will only match the end of the input string. This means that Surlex.$ will match I found this thing called Surlex. but not Surlex might have no use case..
An asterisk is a standard wildcard; it will match anything. It is the same as regex .*.
By wrapping a section of a surlex expression in parentheses, you are marking a section of the pattern as optional, so it is not required to match. It is equivalent to wrapping a regex in ( and )?.
A pair of angle brackets containing some text is called a surlex tag and is handled specially. These tags are the most important feature of surlex in terms of it’s pattern-matching capabilities.
Surlex tags are are special pattern-matching objects that fall into three categories:
- Simple tags
- Regex tags
- Macro tags
A simple tag contains a variable name between angle brackets. For example, the simple tag <name> is used the the following surlex:
My name is <name>.This is equivalent to the following regex:
My name is (?P<name>.+)\.
A regex tag is the same as a simple tag with the addition of an equals sign (=) followed by a regex after the tag name. For example, the following surlex matches a 5-digit number and assigns it to the variable zipcode:
My house is in zip code <zipcode=[0-9]{5}>.This is equivalent to the following regex:
My house is in zip code (?P<zipcode>[0-9]{5})\.
A macro tag is the same as a simple tag with the addition of a colon (:) followed by a regex after the tag name. For example, the following surlex matches a 4-digit year and assigns it to the variable year:
It is <year:Y>.This is equivalent to the following regex:
It is (?P<year>[0-9]{4}).
Surlex | Regex equivalent | Matches | Extracts |
---|---|---|---|
/<product>/<option>.html | /(?P<product>.+)/(?P<option>.+)\.html | /shirt/green.html | {'product': 'shirt', 'option': 'green'} |
/<product>/<option>.* | /(?P<product>.+)/(?P<option>.+)\..* | /shirt/red.anything | {'product': 'shirt', 'option': 'red'} |
/things/edit/<slug:s>/ | /things/edit/(?P<slug>[\w-]+)/ | /things/edit/thing-1/ | {'slug': 'thing-1'} |
/real/regex/<=.*$> | /real/regex/.*$ | /real/regex/anything | {} |
/blog/(<year:Y>/) | /blog/((?P<year>\d{4})/)? | /blog/2009/ | {'year': '2009'} |
The Surlex class exposes most of module’s functionality. It is initialized with a surlex expression:
>>> from surlex import Surlex
>>> surlex = Surlex('/articles/<year:Y>/<slug:s>/(<page:#>/)')
The equivalent regex is generated in the to_regex property.
>>> print surlex.to_regex
/articles/(?P<year>\d{4})/(?P<slug>[\w-]+)/((?P<page>\d+)/)?
To match a given URL against the surlex object, run Surlex.match:
>>> print surlex.match('/articles/2009/people-like-simplicity/3/')
{'year': '2009', 'page': '3', 'slug': 'people-like-simplicity'}
The full parse tree is available for additional hacking at surlex.node_list:
>>> print surlex.node_list
[<TextNode "/articles/">, <MacroTagNode year: Y>, <TextNode "/">, <MacroTagNode slug: s>, <TextNode "/">, <OptionalNode: [<MacroTagNode page: #>, <TextNode "/">]>]