Wednesday, June 23, 2010

Parsing Addresses: Importing Unparsed Addresses Into Dynamics GP - Part 1

Several years ago, I worked on a large Dynamics GP implementation that involved the migration of over 30,000 customer records from an old, custom, AS400 system.  The client was able to export the customer data to text files for us, but even though we had all of the data, we still had a problem. 

The data in the AS400 was stored in free form text fields, such as ADDRESS1, ADDRESS2, ADDRESS3, NOTE1, NOTE2, etc.  The system did not have separate fields for city, state, or zip, and there was no validation on any of the fields.  Contact names and phone numbers were placed in different fields.  And of course, there were alot of data errors, missing information, random notes, inconsistent formatting, and other oddities in the data that made a clean import nearly impossible.

At the time, we looked for tools to try and parse and validate the data, but we couldn't find any good, and economical, options.  The client ultimately chose to develop their own routines to parse the data just enough to import it into Dynamics GP's separate address and contact fields.  It wasn't pretty, but we got through it.

Fast forward to last month, when I was asked to develop an integration that would import customers from Excel.  Pretty basic, except for some fun caveats:  All of the address data will be in one column in Excel, and company names, contacts, and e-mail addresses are stored inconsistently in other columns.  Oh, and did I mention that the customer information was international?  Yup, the data included addresses from Michigan to Mongolia (literally, as in Ulaanbaatar).  Once again, I was in address parsing purgatory.  (Virgil, is that you???)

Here's one example of an actual customer address that I received (details modified slightly to protect the innocent):

Neal Hutchins/19 Colomberie, St Helier/Jersey/Channel Islands/JE5 7SY/United States/neal.hutchins@telecom-solutions.com, Narissa.Atista@telecom-solutions.com

Try parsing that bad boy.  If you try hard enough, you'll find that the address is not in the United States, but is actually somewhere in the UK.

With this fun challenge, I did some fresh research to see if anything had changed in the world of address parsing.  Given the nature of the problem, there is no single magical solution, but I learned of a few new options.

First, of course, were recommendations to try parse the data "manually" through code.  In some cases, this may be feasible (i.e. US addresses only), but given the data that I needed to parse, I knew that I didn't have enough time to write an address parsing routine that would be particularly accurate.

The next recommendation was to utilize mapping web services to "geocode" the address information.  This is a very good article and code sample on SQLServerCentral.com with a nice overview of the process, as well as a great code sample.  While this is a very slick solution that might be a great option for certain situations, there were a few significant limitations for my project. 

The first is that the geocoding option seems to only handle address information, not "contact" information, such as company name, contact, phone numbers, and e-mail addresses.  The second issue is that the geocoding option does not seem to do a very good job of partially parsing an address.  Many of my test addresses simply failed, and returned no results. 

And finally, the deal killer, which is unfortunately not mentioned in the SQL Server Central article, is that the Google Terms of Service heavily restricts the use of the API for separate (non-map related) or commercial applications.  I'm not a lawyer, so reading the terms of service didn't enlighten me much, but this is what others have informed me.

http://groups.google.com/group/google-contacts-api/browse_thread/thread/65fc032082d5e374

http://code.google.com/apis/maps/documentation/geocoding/#Limits

http://code.google.com/apis/maps/faq.html

So with those two options being ruled out, I started to look for other options that would specifically address my requirements.

I then learned that there are services that you can use to manually parse, scrub, and validate addresses.  While potentially effective, these involve paying for a service (presumably expensive) and sending data in batches.  I needed a real-time programmatic parsing and validation solution for my integration.

After more searching, I finally came across a solution.  I'll discuss it in more detail in part two.

No comments: