Regular Expressions provide an important foundation for learning systems. They are useful for quick and direct approaches to solving problems without creating mounds of training data, nor the infrastructure for deploying a model. While they are a common programming technique, and simple enough to employ, they tend to be used so infrequently that you must re-learn them each time you wish to apply. This post summarizes the basic regex syntax, strategies, and workflow in hopes it will decrease the time needed to implement. A few different languages are used in examples, for various scenario. Happy re-learning!
Regex Basics
Operators
Character classes
abc, 123, \d, \D: matches exact character, exact digit, any digit, any non-digit
\s, \S, \w, \W : matches white space, non-white space, alphanumeric word, non-alphanumeric word
Boundaries
^, $, \b, \B: start of string, end of string, word boundary, not word boundary
Quantifiers
x*: matches zero or multiple x
x+: matches one or multiple x
x{m,n}: matches x repeat m to n times. a{4} represent aaaa
x?: optional - matches one or zero x
Groups, Ranges, and Capture
[xyz], (x|y|z): equals x or y or z
[^xyz]: not x or y or z
[x-z]: matches anyone between x and y
^x, [^xyz]: means any character that is not x, not any in group x|y|z
(xyz), (xy(z)): capture group, capture group and sub-group
Syntax patterns
RegEx libraries typically provide functionality and components using similar patterns, such as the following:
pattern
- encapsulate the expression that is sought using above syntax (mostly language agnostic)find
- apply a pattern, directly to text and return nothing, or a regex objectmatch
- apply the pattern to text and return boolean whether a match, or exact match, existssub
- substitute a pattern for a string- convenience functionality
Common Solution Approaches
Example data
We will use the following file for example data.
%%bash
ls Data/Bloomberg_Chat
example_chat.txt
We will read-in and parse each line using a method similar to the following:
import java.io.InputStream
new File("./Data/Bloomberg_Chat/example_chat.txt").withInputStream { stream ->
stream.eachLine { line ->
println line
}}
Message#: 0
Message Sent: 02/13/2019 08:42:15
Subject: Instant Bloomberg Persistent
02/13/2019 08:42:15 User_01,has joined the room
02/19/2019 00:56:29 User_105,Says Cupiditate voluptas sunt velit. Accusantium aliquid expedita excepturi quis laborum autem. Quas occaecati et atque est repellat dolores. Laudantium in molestiae consequatur voluptate ipsa.
02/19/2019 00:55:35 User_68,has left the room
null
Walking
Walking is one of the most direct approaches. In the Walking method, you slowly move from the left to the right of your text, matching patterns along the way. Your target text will everything at the end of the string.
import java.io.InputStream
new File("./Data/Bloomberg_Chat/example_chat.txt").withInputStream { stream ->
stream.eachLine { line ->
//find beginning
trgt = line =~ ~/^Message\sSent:\s(.*)$/
if(trgt){println trgt[0][1]}
}}
02/13/2019 08:42:15
null
Bracketing
The Bracketing method is taken from the similar technique used in field artillery to range your inteded target. First, pattern the string that begins just before your target text. Next, pattern the string that ends just after your target text. Your target text will be in the middle.
import java.io.InputStream
new File("./Data/Bloomberg_Chat/example_chat.txt").withInputStream { stream ->
stream.eachLine { line ->
//find beginning
tmp1 = line =~ ~/^Message\sSent:\s(.*)$/
if(tmp1){
//find end
trgt = tmp1[0][1] =~ ~/.+?(?=\s\d{2}:\d{2}:\d{2})/
if(trgt){
println trgt[0]
}
}
}}
02/13/2019
null
Divide and conquer
Here, you have a few targets that you are interested in capturing. Create nested capture groups within the original capture.
import java.io.InputStream
new File("./Data/Bloomberg_Chat/example_chat.txt").withInputStream { stream ->
stream.eachLine { line ->
trgt = line =~ ~/^((\d{2}\S\d{2}\S\d{4}\s\d{2}:\d{2}:\d{2})\s([^,]+))(.*)$/
if(trgt){
println ("--------Begin line---------")
println trgt[0][2] //dtv
println trgt[0][3] //member
println trgt[0][4] //content
println ("---------End line----------")
}
}}
--------Begin line---------
02/13/2019 08:42:15
User_01
,has joined the room
---------End line----------
--------Begin line---------
02/19/2019 00:56:29
User_105
,Says Cupiditate voluptas sunt velit. Accusantium aliquid expedita excepturi quis laborum autem. Quas occaecati et atque est repellat dolores. Laudantium in molestiae consequatur voluptate ipsa.
---------End line----------
--------Begin line---------
02/19/2019 00:55:35
User_68
,has left the room
---------End line----------
null
Parsing
In this approach, you want to parse all pieces of a data into their respective fields. This is often used when getting semi-structured data, such as log files, into a structured format, such as a table. This is an example from pyspark.
%python #METHOD-1: RegEx
from pyspark.sql import Row
import re
parts = [
r'(?P<host>\S+)', # host
r'\S+', # indent (unused)
r'(?P<user>\S+)', # user
r'\[(?P<time>.+)\]', # time
r'"(?P<request>.*)"', # request
r'(?P<status>[0-9]+)', # status
r'(?P<size>\S+)', # size
r'"(?P<referrer>.*)"', # referrer
r'"(?P<agent>.*)"', # user agent
]
pattern = re.compile(r'\s+'.join(parts)+r'\s*\Z')
prs = logs.map(lambda x: pattern.match(x).groupdict() )
rows = prs.map(lambda x: Row(**x))
dfLogs = rows.toDF()
dfLogs.show()
+--------------------+-------------+--------------------+--------------------+-----+------+--------------------+----+
| agent| host| referrer| request| size|status| time|user|
+--------------------+-------------+--------------------+--------------------+-----+------+--------------------+----+
|Mozilla/5.0 (comp...| 66.249.69.97| -|GET /071300/24215...| 514| 404|24/Sep/2014:22:25...| -|
|Mozilla/5.0 (X11;...|71.19.157.174| -| GET /error HTTP/1.1| 505| 404|24/Sep/2014:22:26...| -|
|Mozilla/5.0 (X11;...|71.19.157.174| -|GET /favicon.ico ...| 1713| 200|24/Sep/2014:22:26...| -|
|Mozilla/5.0 (X11;...|71.19.157.174| -| GET / HTTP/1.1|18785| 200|24/Sep/2014:22:26...| -|
|Mozilla/5.0 (X11;...|71.19.157.174|http://www.holden...|GET /jobmineimg.p...| 222| 200|24/Sep/2014:22:26...| -|
|Mozilla/5.0 (X11;...|71.19.157.174| -|GET /error78978 H...| 505| 404|24/Sep/2014:22:26...| -|
+--------------------+-------------+--------------------+--------------------+-----+------+--------------------+----+
Convenience structures
Languages can have sytax conveniences to make working with regex much easier. This can include making patterns part of case statements, such as is done in groovy and scala, and allowing for raw string input, such as in groovy and python.
In addition, programmers can make their life easier by creating specific data structures that can hold the output of target matches.
Language: Groovy
The following functionality is commonly used with groovy:
~string
- pattern operator=~
- find pattern==~
- exact match operatorswitch-case
- convenience functionality
The pattern operator (~string)
import java.util.regex.Pattern
def pattern = ~/([Gg])roovy/
pattern.class == Pattern
true
//The slashy form of a Groovy string has a huge advantage over double (or single) quoted string - you don’t have to escape backslashes.
( /Version \d+\.\d+\.\d+/ == 'Version \\d+\\.\\d+\\.\\d+' )
true
p = ~/foo/
p = ~'foo'
p = ~"foo"
p = ~$/dollar/slashy $ string/$
//p = ~"${pattern}"
dollar/slashy $ string
The find operator (=~)
import java.util.regex.Matcher
def matcher = "My code is groovier and better when I use Groovy there" =~ /\S+er\b/
println matcher.find()
println matcher.size() == 2
matcher[0..-1] == ["groovier", "better"]
true
true
true
if ("My code is groovier and better when I use Groovy there" =~ /\S+er\b/) {
"At least one element matches the pattern"
}
At least one element matches the pattern
def (first,second) = "My code is groovier and better when I use Groovy there" =~ /\S+er\b/
first == "groovier" & second == "better"
true
// With grouping we get a multidimensional array
def group = ('groovy and grails, ruby and rails' =~ /(\w+) and (\w+)/)
println group.hasGroup()
println 2 == group.size()
println ['groovy and grails', 'groovy', 'grails'] == group[0]
println 'rails' == group[1][2]
The exact match operator (==~)
"My code is groovier and better when I use Groovy there" ==~ /\S+er\b/ //no exact match => only two words
false
"My code is groovier and better when I use Groovy there" ==~ /^My code .* there$/ //exact match of beginning and end of string
true
The pattern with switch case
def input = "test"
switch (input) {
case ~/\d{3}/:
"The number has 3 digits"
break
case ~/\w{4}/:
"The word has 4 letters"
break
default:
"Unrecognized..."
}
The word has 4 letters
Language: Python
Reading files
You can read-in a file with the following:
file_object = open(“filename”, “mode”)
The mode argument has a default value of r
- read value, if omitted. The modes are:
r
– Read mode which is used when the file is only being readw
– Write mode which is used to edit and write new information to the file (any existing files with the same name will be erased when this mode is activated)a
– Appending mode, which is used to add new data to the end of the file; that is new information is automatically amended to the endr+
– Special read and write mode, which is used to handle both actions when working with a file
By using the with
statement, you ensure proper handling of the file, including closing it when work is completed.
%%python
with open("./Data/Bloomberg_Chat/example_chat.txt") as file:
data = file.read()
print(data)
Message#: 0
Message Sent: 02/13/2019 08:42:15
Subject: Instant Bloomberg Persistent
02/13/2019 08:42:15 User_01,has joined the room
02/19/2019 00:56:29 User_105,Says Cupiditate voluptas sunt velit. Accusantium aliquid expedita excepturi quis laborum autem. Quas occaecati et atque est repellat dolores. Laudantium in molestiae consequatur voluptate ipsa.
02/19/2019 00:55:35 User_68,has left the room
Ordinary string manipulation
Before moving straight to regular expressions, users should take advantage of Python’s built-in string manipulation functionality. Bracketing textual start/end anchors with simple tools can keep your code clean and readable. The following is an example of adding all text as a string, then using string methods and list comprehensions to select targets.
%%python
with open("./Data/Bloomberg_Chat/example_chat.txt") as file:
data = file.read()
content = (data.split(',has left the room')[0]).split('has joined the room')[1]
lines = content.replace('\n', ' ').split('.')
quote = [x.strip() for x in lines]
quote
['02/19/2019 00:56:29 User_105,Says Cupiditate voluptas sunt velit',
'Accusantium aliquid expedita excepturi quis laborum autem',
'Quas occaecati et atque est repellat dolores',
'Laudantium in molestiae consequatur voluptate ipsa',
'02/19/2019 00:55:35 User_68']
Using RegEx
Regualar expressions are more powerful not only for their patterns, but also because they can be compiled and used over large amounts of data.
Use raw strings instead of regular Python strings. Raw strings begin with r"some text"
and tell Python not to interpret backslashes and special metacharacters in the string. This allows you to pass them to the regular expression engine, directly.
An example is using r"\n\w"
instead of "\\n\\w"
.
Below, we use the following methods from the re
module:
re.search()
- stop with first matchre.findall() / re.finditer()
- search over entire string, returns a list (or iterator) of all captured datare.compile()
method to speed-up processing on larger data. This is especially useful with a big data framework, such as Apache Sparkre.sub()
- find and replace
%%python
import re
regexDate = re.compile(r"Message#:\s(\d+)")
with open("./Data/Bloomberg_Chat/example_chat.txt") as file:
data = file.read()
grp = (re.search(regexDate, data)).group(0)
print(grp)
Message#: 0
Language: JavaScript
/\w+\d+/
- match a string of alpha-numeric charactersRegExp("\\w+\\d+")
- constructor notationRegExp.test()
- test for a matchRegExp.exec()
- returns matching resultsinputStr.search()
- find a matchinputStr.match()
- returns an index of matchesinputStr.replace()
- substitute a match with a string
The flags are either appended after the regular expression in the literal notation, or passed into the constructor in the normal notation.
g
- allows you to run RegExp.exec() multiple times to find every match in the input string until the method returns null.i
- makes the pattern case insensitive so that it matches strings of different capitalizationsm
- is necessary if your input string has newline characters (\n), this flag allows the start and end metacharacter (^ and $ respectively) to match at the beginning and end of each line instead of at the beginning and end of the whole input stringu
- interpret the regular expression as unicode codepoints
%%javascript
var fs = require('fs');
fs.readFile( __dirname + '/Data/Bloomberg_Chat/example_chat.txt', function (err, data) {
if (err) {
throw err;
}
console.log(data.toString());
});
%%javascript
console.log('hi')
//process.stdout.write("hello: ");
process.stdout.write("Downloading ");