Monday, October 19, 2009

Save the Day with Regular Expressions

xkcd has a great comic about regular expressions:



It rings so true--if you have ever seen the look of amazement on people's faces after you use regular expressions, you definitely understand the comic.

For those of you who aren't familiar with regular expressions, have you ever received a data file as part of a Dynamics GP data migration or import that had horrible formatting, or had extra or messy data in it that made it nearly impossible for you to use?

Or, to pull a skeleton out of our own GP closet, have you ever had to try and extract just the data from a Dynamics GP report that was saved to a text file? Or attempted to get the GP report to import into Excel? Fun, huh? All those massive, useless page headers, and the fields positioned all over the multi-line detail data make those quite a chore.

I've seen this plenty over the years, and surprisingly, a typical reaction is, "Sorry, we can't work with that format." People just don't know how to deal with such messy looking or inconsistent data layouts.

Although it can't solve every data formatting problem, regular expressions can be a life saver in many situations, and using regular expressions is usually much cheaper (often free!) than high-end data extraction applications.

Regular Expressions, or "reg ex" (or regex or regexp) for short, is a technique of using special search strings, or patterns, that allow you to search for specific, complex, or possibly inconsistent or varying patterns in a text file.

One of my most basic and frequent uses of regex is to take a column of data and convert it for use in a SQL statement. I had to use this today for a client--they provided me a list of Dynamics GP Payroll pay codes, and I had to develop a query to find all payroll transactions that used the pay codes.

The list of 35 pay codes in Excel looked something like this:

C09900
C11500
H16500
H18000
H18750
H19500
H21000
H21750
R06000
R08000
R08550
...

And I needed to be able to use them in an "IN" clause in a SQL query.

I fired up UltraEdit, my preferred text editor, pasted in the column of data,
and then ran a Replace with the following options:

Find What: ^p (new paragraph)
Replace With: ', '

I instantly had converted the list into this:

'C09900 ', 'C11500 ', 'H16500 ', 'H18000', 'H18750 ', 'H19500 ', 'H21000 ', 'H21750 ', 'R06000 ', 'R08000', 'R08550 '

Close, but the trailing spaces after some pay codes bugged me, so I wanted to get rid of them. Once again, regex to the rescue. But some of the pay codes have no trailing space, and others had one or more spaces. So I reverted back to my original column of pay codes and ran this replace routine first:

Find What: [space]*^p (find one or more spaces followed by a new paragraph)
Replace With: ', '

(where [space] was a single space)

Viola! A clean list that I can use in a SQL "IN" statement:

'C09900', 'C11500', 'H16500', 'H18000', 'H18750', 'H19500', 'H21000', 'H21750', 'R06000', 'R08000', 'R08550'

In just a few seconds, I was able to convert the data into exactly what I needed and paste it right into my SQL statement.

Another common use of regular expressions is to validate or clean data input. For example, let's say you are having to import social security numbers, credit card numbers, or other all-numeric data. How can you tell if an alpha character, symbol, or other non-numeric character exists in the data? And what if you wanted to strip out invalid characters? What if you needed to validate the format for a Visa vs. Amex credit card number? Or the proper format of a Dutch postal code?

All easily accomplished with regular expressions.

Here's the catch: Unfortunately, there are multiple "standards" for regular expression syntax. Although I may say I "know" regular expressions, I understand the concept, how and when to use them, and I happen to know the UltraEdit and .NET syntaxes. But if I had to use a tool that used a different syntax, I would have to refer to the help pages and learn the specific expression syntax for that tool.

The examples I've give so far are using the basic UltraEdit expression syntax. There is also Unix syntax. And Perl syntax. And of course Microsoft could never follow an existing standard, so .NET has its own syntax with its own corresponding quirks. (.NET also has dedicated validation classes, such as RegexStringValidator)

Naturally, you can Google the topic and find quite a few resources for learning about regular expressions, how they work, the syntax, examples, and tricks. But, ultimately you will have to use an application, tool, or specific programming language that supports regular expressions, and that tool will use a particular regex syntax. So I would recommend picking a tool first, at least temporarily while you learn, and then use its documentation to get familiar with regex.

I found this free add on that appears to let you use regex in Excel. I haven't tried it yet, but it would be a great feature to have in Excel. There are some things that make regex difficult in Excel (like needing to combine or search text that spans multiple cells), but it wouldn't hurt to have it for data within cells.

If you are new to regular expressions, here is one web site that may be a good starting point to better understand the concept and see some interesting examples:

http://www.regular-expressions.info/tutorial.html

After a few minutes of study, you'll have a new tool to save the day!

1 comment:

Jivtesh Singh said...

Hey Steve,

That's like a perfect post!
You've got XKCD, an example of the user world, and regex :)

Great going!