Friday, June 25, 2010

Parsing Addresses: Importing Unparsed Addresses Into Dynamics GP - Part 2

In Part 1 of this post, I discussed my challenge of trying to import unparsed addresses into Dynamics GP and the various options that I found to solve this age old problem.  I would have thought that there would be plenty of solutions available by now, but the ones I found had various limitations that prevented them from working for me.

Then, finally, I came across exactly what I was looking for:  An economical tool called RecogniContact, developed by LoquiSoft, that is specifically designed to perform full "contact" parsing.  It can be integrated into a custom application or web site, and allows you to programmatically parse and validate address information.

In addition to parsing addresses, it can also parse contact data, such as prefix (Mr., Mrs.), middle initial, suffix, title, department, phone number, and e-mail address.  It will even tell you the gender of the contact for over 20,000 first names.  But wait, there's more!  More, you say?

It uses international postal coding standards to intelligently parse addresses for 21 countries (currently US and Europe), parse addresses for another 21 countries based on address structure, and, as scary as this sounds, it can even process contact data in 13 different languages.  It can parse international phone numbers based on country-specific standards, and it can even determine if an area code is valid for a given country.  This actually thwarted my lazy test plans when I tried to use a fake phone number of (123) 456-7890.  It detected the invalid number and, sigh, I had to update my data to use a valid area code.

In addition to this exhaustive list, another very valuable feature is its ability to parse the fields that it can successfully recognize, but then output the specific values that could not be recognized.  This is actually a critical feature when parsing addresses, as it allows the user to skip past the contacts that imported fine, and makes it easier for them to specifically identify and manually review only the "partially" parsed addresses. 

I corresponded with Werner Noska, the owner of LoquiSoft, and he explained that the tool started as a research project for an Austrian federal agency, which resulted in ContactCopy, a desktop address parser.  Due to requests by companies looking to incorporate the technology into their software, he developed RecogniContact.

He explained some technical details about RecogniContact that were outside of my realm, such as structural parsing vs. content aware parsing, and its ability to recognize data elements even without separators.  I know that software developers are sometimes "ego enhanced", and therefore occasionally looking to prove something with their L337 development skillZ, but if you have ever tried to write code from scratch to parse addresses, especially international addresses, trust me, you just can't compete with the knowledge, experience, and accuracy that is built into RecogniContact.

RecogniContact is available as a web service subscription, or as a licensed COM component that can be integrated with .NET or other COM aware development tools.  I chose the COM option, since it was simpler for me to license for my client and their desktop customer import.  With the sample .NET code provided by LoquiSoft, I was able to quickly develop a C# prototype, and in a few hours, I had RecogniContact fully integrated into my customer import.

Based on the limited sample data that I was given by the client (only 330 records), I have been able to consistently achieve an 80% "successful parse rate" (i.e. contact is fully parsed) without manipulating the raw data at all, which I think is pretty good.  The majority of those successfully parsed addresses were in the US and Europe, which explains the high parse rate.  But RecogniContact also did a decent job with some messy addresses, and was able to at least partially parse all of the international addresses.  The 20% of the addresses that were not fully parsed were all partially parsed to some extent, and I was able to flag them for client review, so the results are far better than I could possibly hope to achieve writing my own code.

Here is one I thought was impressive.  Even though the data includes the incorrect country, RecogniContact's knowledge of the UK postal codes allow it to recognize the country:

Original Data:

Neal Hutchins/19 Colomberie, St Helier/Jersey/Channel Islands/JE5 7SY/United States/,

Parsed Data:

First: Neal
Last: Hutchins
Full Name:  Neal Hutchins
Gender: M
Company: United States
Address1: 19 Colomberie, St Helier
City: Jersey/Channel Islands
Postal: JE5 7SY
Country: United Kingdom
Country ISO: GB

Unrecognized Values:

In this example, there were two e-mail addresses, in which cases it parses the first, and considers the second email address unrecognized.  And having "United States" in the data is odd, so it does its best and puts it in the company field.  This was actually a common issue with the data I received, so I can easily detect the country, and if it is not US, I can quickly scan the other contact fields to see if United States is present and then remove it, or move the value to the Unrecognized Values property.

Even though it can't perform miracles with every address, RecogniContact's ability to return the unrecognized data values allows me to easily flag the customer records that need manual review and corrections, enabling the client to focus on the problems, and reducing the time required to review the imported customers.

One of the limitations of RecogniContact, and any parsing tool for that matter, is the ability to handle non-standard or ambiguous values.  Things like unusual titles, mail stops, building numbers on a campus, or attention lines seemed to give it the most trouble, as these are often not standard postal address values.  Here are a few examples:

A-105 Memphis, Y-11 Shasta Nagar, Lokhandwala, Complex, Andheri/Mumbai/400053/India

Stockton Network Services, Inc c/o First Fidelity Financial ATTN:  Anita Woodman T10/060/110 Riverside Blvd/Jacksonville/FL/32204 904 884-6830 SVP - Finance & Controller Stan Parkman/United States

Both of these are just too random to parse 100%, but it usually gets at least 50% right, and doesn't take long for a human to correct the remaining fields.

Another aspect of my sample data that posed a challenge for RecogniContact was symbols, such as mixtures of colon, semi colon, ampersand, dash, comma and slash, all in the same address.  This is understandable to me, as the tool needs some means of trying to identify field values, and mixed symbols in contact data pose quite a challenge, especially if they are embedded in a field value.

The Stockton example above is a good one.  When I try and parse it, the "c/o", "ATTN:", and "T10/060" mail stop make it difficult to parse. 

First: Bob
Last: Henson
Full Name:  Bob Henson
Gender: M
Company: SVP - Finance & Controller Stan Parkman/United States
Address1: 060/110 Riverside Blvd
City: Jacksonville
State: FL
Postal: 32204
Country: United States
Country ISO: US
Phone1: +1 (904) 884-6830

Unrecognized Values: Stockton Network Services, Inc c/o First Fidelity Financial ATTN:  Anita Woodman T10

But as you can see, it did a decent job of parsing a pretty difficult pile of data, getting most of it correct.

With all of that said, once you get the contact information in Dynamics GP, it's actually quite convenient to review the contacts that did not parse successfully.  To make the manual review easier, I store both the original source data and the unrecognized values in the Customer Note.  When reviewing the customer records, the user is able to open the Note window side-by-side with the customer window and edit the customer record.  It works very well and is a simple process.

Another thing I do in Dynamics GP is flag customers whose information did not parse fully--I set the Customer User Defined Field 2 to "Needs Review".  We then created a SmartList favorite that looks for this value, and the client then has a single place to see the customers that need manual review, and they can simply double click on each SmartList record to perform their review.

One final interesting issue that I found with two sample records were people's names that resembled street names.  So "Thomas Way" was interpreted as a street instead of a name.  I can't think of any way around that, and since it was very rare, it wasn't an issue so much as an interesting coincidence.

My customer import with the RecogniContact parsing component has been deployed at the client site, and they will begin testing it this week, so I'll soon see how well it works with fresh data and I am earger to find out how well it meets their needs.

Go forth and parse!

No comments: