Brizzled

... wherein I bloviate discursively

Brian Clapper, bmc@clapper.org

Simple Address Standardization

| Comments

Introduction

Suppose you’re writing an application that stores and compares street addresses. Depending on how you obtain those addresses, you may need to standardize them.

  • Standardizing the addresses makes them easier to compare.
  • Standardizing the address can have the side effect of validating the address.
  • Standardizing an address also makes it more likely that you can send a physical piece of mail to the address and have it actually arrive.

This article explores one way to solve that problem. I’ll be showing examples of Ruby and Python, but the same general approach works for other languages.

The Approach

The simplest way to get started is to use REST-based mapping APIs already on the Internet, such as Google Maps, Yahoo! Maps, Bing, and others. There are a number of language-specific APIs available to make this task easier. For example:

Note: The Internet-based mapping APIs all have restrictions. The Google Maps API, for instance, contains this clause in its terms of service:

§ 10.1.1(g) No Use of Content without a Google Map. You must not use or display the Content without a corresponding Google map, unless you are explicitly permitted to do so in the Maps APIs Documentation, or through written permission from Google. For example, you must not use geocodes obtained through the Service except in conjunction with a Google map, but you may display Street View imagery without a corresponding Google map because the Maps APIs Documentation explicitly permits you to do so.

If you’re developing an application that only needs to do address normalization and geocoding, but won’t be displaying a map, these services are not suitable for production. You may be able to get away with using them during development (since your web site won’t be publicly accessible), though that’s really a question for your lawyer.

Let’s assume, however, that you’ve determined (by discussing it with your lawyer or by calling Google or Yahoo!) that it’s safe to use one of the Internet mapping services, but you also want to be able to move to a commercial service down the road. (See below for some possible commercial services.) The easiest way to do that is to write a very simple API of your own that hides the underlying address normalization API you’re using.

API Specification

To hide the underlying API being used (making it easier to switch to a different implementation, when necessary), let’s first define a higher-level API of our own.

Requirements:

  • Provide a generic NormalizedAddress object that contains the fields we need.
  • Provide a function or class that takes a raw address and returns a NormalizedAddress object.

Ruby

The Ruby specification for our API will look like the following. (There are, obviously, other ways to cast this API.)

Ruby API
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
module AddressNormalizer
  class NormalizedAddress
    attr_reader :address_line1, :address_line2, :city, :state_province,
                :postal_code, :country, :latitude, :longitude

    def initialize(...)
      @address_line1  = ...
      @address_line2  = ...
      @city           = ...
      @state_province = ...
      @postal_code    = ...
      @country        = ...
      @latitude       = ...
      @longitude      = ...
    end
  end

  def normalize_street_address(raw_address_string)
     ...
  end
end

Python

The Python specification for our API is similar.

Python API
1
2
3
4
5
6
7
8
9
10
11
12
13
class NormalizedAddress(object):
    def __init__(self, ... ):
        self.address_line1  = ...
        self.address_line2  = ...
        self.city           = ...
        self.state_province = ...
        self.postal_code    = ...
        self.country        = ...
        self.latitude       = ...
        self.longitude      = ...

def normalize_street_address(raw_address_string):
    ...

Google Maps Implementation

Implementation Issues

Before diving into the implementations, themselves, there are several problems we have to address (pun intended).

The Maps API doesn’t truly standardize addresses

The free services do not always standardize U.S. addresses properly–at least, not from the United States Postal Service’s standpoint. In the U.S., many towns often share post offices. For instance, consider the address of an office where I used to work:

1400 Liberty Ridge Drive, Chesterbrook, PA

The post office serving this address is actually Wayne, PA. The standardized address is:

1400 Liberty Ridge Dr, Wayne, PA 19087

Let’s use the Ruby geocoder gem to see what Google returns for the first address:

1
2
3
4
5
6
7
8
9
$ pry
[1] pry(main)> require 'geocoder'
=> true
[2] pry(main) res = Geocoder.search('1400 Liberty Ridge Drive, Chesterbrook, PA')
=> [#<Geocoder::Result::Google:0x00000000cfa788
  @data=
...
[3] res[0].formatted_address
=> "1400 Liberty Ridge Dr, Chesterbrook, PA 19087, USA"

Note that the town name has not been standardized to the correct post office name. The Yahoo! Maps API exhibits similar behavior.

If you’re comparing many different addresses, you might need to ensure that they all use the canonical post office name. If you’re willing to make another connection to Google, you can try to correct this problem by taking the returned latitude and longitude values and reverse geocoding them:

1
2
3
4
5
6
7
8
9
10
11
[3] pry(main) latitude = results[0].data["geometry"]["location"]["lat"]
=> 40.228934
[4] pry(main) longitude = results[0].data["geometry"]["location"]["lng"]
=> -75.517588
[5] res = Geocoder.search(latitude=latitude, longitude=longitude)
=> [#<Geocoder::Result::Google:0x00000000cfa788
  @data=
  ...

[6] res[0].foramtted_address
=> "1400 Liberty Ridge Dr, Wayne, PA 19087, USA"

This solution doesn’t work with every address. (It doesn’t work with my home address, for instance). However, it’s still worth trying.

The Maps API can “zoom out” if the address isn’t valid

If you give the Google Maps API a bad address, you can either get no data or “zoomed out” data. For instance, here’s what you get for the nonsense address “100 My Place, Foobar, XY”.

1
2
3
4
[1] pry(main)> require 'geocoder'
=> true
[2] pry(main) results = Geocoder.search('100 My Place, Foobar, XY')
=> []

Bad address = no results. That makes sense: There’s no such state as “XY”. But, if I give the API a bad address with a valid state, I get “zoomed out” results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[2] pry(main) results = Geocoder.search('100 My Place, Foobar, PA')
=> [#<Geocoder::Result::Google:0x00000001083468
  @data=
   {"address_components"=>
     [{"long_name"=>"Pennsylvania",
       "short_name"=>"PA",
       "types"=>["administrative_area_level_1", "political"]},
      {"long_name"=>"United States",
       "short_name"=>"US",
       "types"=>["country", "political"]}],
    "formatted_address"=>"Pennsylvania, USA",
    "geometry"=>
     {"bounds"=>
       {"northeast"=>{"lat"=>42.26936509999999, "lng"=>-74.6895019},
        "southwest"=>{"lat"=>39.7197989, "lng"=>-80.5198949}},
      "location"=>{"lat"=>41.2033216, "lng"=>-77.1945247},
      "location_type"=>"APPROXIMATE",
      "viewport"=>
       {"northeast"=>{"lat"=>42.2690472, "lng"=>-75.1455745},
        "southwest"=>{"lat"=>40.11995350000001, "lng"=>-79.2434749}}},
    "types"=>["administrative_area_level_1", "political"]}>]

Note that with the valid address, we get back a “types” array (specifically, result[0].data["types"]) that contains the string “street_address”, meaning that the result is granular to the street address. But with the second example, we get “administrative_area_level_1”, which is Google Maps-speak for “state”, in the U.S. In other words, the Maps API zoomed out to the nearest geographical designation it could identify–which, in this case, was the state of Pennsylvania.

This behavior makes sense for geolocation, but it isn’t very useful in an address normalizer.

Fortunately, it’s relatively easy to correct this problem. The various “types” values returned by the Google Maps API are documented at http://code.google.com/apis/maps/documentation/geocoding/#Types. For our purposes, if the top-level “types” value doesn’t contain one of the following values, then we can assume the address wasn’t found:

  • street_address indicates a precise street address, e.g., a house
  • subpremise is a “first-order entity below a named location, usually a singular building within a collection of buildings with a common name.” In practice, this is what Google Maps returns for addresses that contain, say, a suite number.

The Code

Now we’re ready to write some code.

Ruby Google Maps Implementation

The Ruby Geocoder gem handles connecting to the Google Maps REST service, retrieving the JSON results, and decoding the JSON. So, let’s use it and save ourselves a little work. Note, however, that we still have to decode the results, mapping the Google Maps-specific data encoding into our more generic NormalizedAddress object.

Normalize Address via Google Maps, in Ruby (normalize_google.rb) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
require 'geocoder'
require 'bigdecimal'
require 'set'

module AddressNormalizer

  class NormalizedAddress
    Geocoder::Configuration.lookup = :google

    Attrs = [:address_line1, :address_line2, :city, :state_province,
             :postal_code, :country, :longitude, :formatted_address]

    attr_reader *Attrs

    def initialize(data)
      @address_line1  = nil
      @address_line2  = nil
      @city           = nil
      @postal_code    = nil
      @state_province = nil
      @country        = nil
      @latitude       = nil
      @longitude      = nil

      # The Google result consists of:
      #
      # - An array ("address_components") of hashes consisting of
      #   {"long_name" => "...", "short_name" => "...", "types" => [...]}
      # - A "geometry" hash, with the latitude and longitude
      # - A "partial_match" indicator (which we're ignoring)
      # - A "types" array (which we're also ignoring)

      data["address_components"].each do |hash|
        types = hash["types"]
        value = hash["long_name"]
        if types.include? "subpremise"
          @address_line2 = "##{value}"
        elsif types.include? "street_number"
          @house_number = value
        elsif types.include? "sublocality"
          @city = value
        elsif types.include? "locality"
          @city = value if @city.nil?
        elsif types.include? "country"
          @country = value
        elsif types.include? "postal_code"
          @postal_code = value
        elsif types.include? "route"
          @street = value
        elsif types.include? "administrative_area_level_1"
          @state_province = value
        end
      end

      @address_line1 = "#{@house_number} #{@street}"

      if data["formatted_address"]
        @formatted_address = data["formatted_address"]
      else
        @formatted_address = [
          @address_line1, @address_line2, @city, @state_province, @postal_code
        ].select {|s| ! (s.nil? || s.empty?)}.join(' ')
      end

      # Latitude and longitude

      if data["geometry"] and data["geometry"]["location"]
        loc = data["geometry"]["location"]
        @latitude = BigDecimal.new(loc["lat"].to_s)
        @longitude = BigDecimal.new(loc["lng"].to_s)
      end
    end

    def to_s
      @formatted_address
    end

    def inspect
      Hash[ Attrs.map {|field| [field, self.send(field)]} ]
    end
  end

  ACCEPTABLE_TYPES = Set.new(["street_address", "subpremise"])

  def normalize_street_address(raw_address_string)

    normalized_address = nil

    # Geocoder.search() returns an array of results. Take the first one.
    geocoded = Geocoder.search(raw_address_string)
    if geocoded && (geocoded.length > 0)
      # Geocoder returns data that may or may not be granular enough. For
      # instance, attempting to retrieve information about nonexistent
      # address '100 Main St, XYZ, PA, US' still returns a value, but the
      # value's type is "administrative_area_level_1", which means the data
      # is granular to a (U.S.) state. If it's a valid address, we should
      # get data that's more granular than that. Of the codes listed at
      # http://code.google.com/apis/maps/documentation/geocoding/#Types
      # we're interested in "street_address", "premise" and
      # "subpremise".
      data = geocoded[0].data
      types = Set.new(data["types"])
      if !(types & ACCEPTABLE_TYPES).empty?
        normalized_address = NormalizedAddress.new(data)
      end
    end

    normalized_address
  end
end

Here’s a sample console run, with a valid address (Google headquarters) and invalid addresses (the Foobar, Pennsylvania, example from above):

Test Run
1
2
3
4
5
6
7
8
9
10
[1] pry(main)> require 'normalize_google'
=> true
[2] pry(main)> include AddressNormalizer
=> Object
[3] pry(main)> a = normalize_street_address '1600 Amphitheatre Parkway, Mountain View, CA'
=> #<AddressNormalizer::NormalizedAddress:0xd8e320>
[4] pry(main)> a.to_s
=> "1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA"
[5] pry(main)> a = normalize_street_address '100 My Place, Foobar, PA'
=> nil

Python Google Maps Implementation

For our Python implementation, we’ll use the py-googlemaps API. The results are somewhat different from the Ruby geocoder gem. For example:

Python Google Maps with Good Address
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
$ ipython
Python 2.7.1 (r271:86832, Mar 27 2011, 20:51:04)
...
In [1]: from googlemaps import GoogleMaps

In [2]: g = GoogleMaps()

In [3]: d = g.geocode('1600 Amphitheatre Parkway, Mountain View, CA')

In [4]: d  # reformatted slightly, for readability
Out [4]:
{
  u'Status': {
      u'code': 200,
      u'request': u'geocode'
  },
  u'Placemark': [{
    u'Point': {
      u'coordinates': [-122.0853032, 37.4211444, 0]
    },
    u'ExtendedData': {
      u'LatLonBox': {
        u'west': -122.0866522,
        u'east': -122.0839542,
        u'north': 37.4224934,
        u'south': 37.4197954}
      },
      u'AddressDetails': {
        u'Country': {
          u'CountryName': u'USA',
          u'AdministrativeArea': {
            u'AdministrativeAreaName': u'CA',
            u'Locality': {
              u'PostalCode': {u'PostalCodeNumber': u'94043'},
              u'Thoroughfare': {
                u'ThoroughfareName': u'1600 Amphitheatre Pkwy'
              },
              u'LocalityName': u'Mountain View'
            }
          },
          u'CountryNameCode': u'US'
        },
        u'Accuracy': 8
      },
      u'id': u'p1',
      u'address': u'1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA'
  }],
  u'name': u'1600 Amphitheatre Parkway, Mountain View, CA'
}

For a non-existent address that zooms out (“100 My Place, Foobar, PA”), here’s the result:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
In [5]: d = g.geocode('100 My Place, Foobar, PA')

In [6]: d  # reformatted slightly, for readability
Out [6]:
{
  u'Status': {u'code': 200, u'request': u'geocode'},
  u'Placemark': [{
    u'Point': {
      u'coordinates': [-77.1945247, 41.2033216, 0]
    },
    u'ExtendedData': {
      u'LatLonBox': {
        u'west': -79.2434749,
        u'east': -75.1455745,
        u'north': 42.2690472,
        u'south': 40.1199535
      }
    },
    u'AddressDetails': {
      u'Country': {
        u'CountryName': u'USA',
        u'AdministrativeArea': {
          u'AdministrativeAreaName': u'PA'
        },
        u'CountryNameCode': u'US'
      },
      u'Accuracy': 2
    },
    u'id': u'p1',
    u'address': u'Pennsylvania, USA'
  }],
  u'name': u'100 My Place, Foobar, PA'
}

For a completely nonexistent address (“100 My Place, Foobar, XY”), the API raises an exception:

1
2
3
4
5
6
7
8
9
10
11
12
13
In [7]: d = g.geocode('100 My Place, Foobar, PA')
GoogleMapsError                           Traceback (most recent call last)

/home/bmc/<ipython console> in <module>()

/home/bmc/.pythonbrew/pythons/Python-2.7.1/lib/python2.7/site-packages/googlemaps.pyc in geocode(self, query, sensor, oe, ll, spn, gl)
    260         status_code = response['Status']['code']
    261         if status_code != STATUS_OK:
--> 262             raise GoogleMapsError(status_code, url, response)
    263         return response
    264

GoogleMapsError: Error 602: G_GEO_UNKNOWN_ADDRESS

The documentation for the API shows this example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
gmaps = GoogleMaps(api_key)
address = '350 Fifth Avenue New York, NY'
result = gmaps.geocode(address)
placemark = result['Placemark'][0]
lng, lat = placemark['Point']['coordinates'][0:2]    # Note these are backkwards from usual
print lat, lng
6721118 -73.9838823
details = placemark['AddressDetails']['Country']['AdministrativeArea']
street = details['Locality']['Thoroughfare']['ThoroughfareName']
city = details['Locality']['LocalityName']
state = details['AdministrativeAreaName']
zipcode = details['Locality']['PostalCode']['PostalCodeNumber']
print ', '.join((street, city, state, zipcode))
350 5th Ave, Brooklyn, NY, 11215

It seems reasonable to adopt this strategy:

  • If we get an exception, we have a bad address.
  • If we can’t find details['Locality']['PostalCode']['PostalCodeNumber'] in the results, then the address isn’t granular enough, so treat it as a bad address.

Unfortunately, reverse-geocoding, with this Python API, doesn’t always return useful information, so that step is omitted here.

Normalize Address via Google Maps, in Python (normalize_google.py) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
from googlemaps import GoogleMaps, GoogleMapsError
from decimal import Decimal

__all__ = ['NormalizedAddress', 'normalize_street_address']

class NormalizedAddress(object):

    def __init__(self, data):
        self.address_line1  = None
        self.address_line2  = None
        self.city           = None
        self.postal_code    = None
        self.state_province = None
        self.country        = None
        self.latitude       = None
        self.longitude      = None

        placemark = data['Placemark'][0]
        self.longitude, self.latitude = placemark['Point']['coordinates'][0:2]
        country_data = placemark['AddressDetails']['Country']
        details = country_data['AdministrativeArea']

        if details.get('SubAdministrativeArea') is not None:
            locality = details['SubAdministrativeArea']['Locality']
            self.city = locality['LocalityName']
        elif details.get('Locality') is not None:
            locality = details['Locality']
            self.city = locality['LocalityName']
        elif details.get('DependentLocality') is not None:
            locality = details['DependentLocality']
            self.city = locality['DependentLocalityName']
        else:
            raise GoogleMapsError(GoogleMapsError.G_GEO_UNKNOWN_ADDRESS)

        self.address_line1 = locality['Thoroughfare']['ThoroughfareName']

        # Break out any suite number, since it's not broken out in the data.
        try:
            i = self.address_line1.index('#')
            self.address_line2 = self.address_line1[i:].strip()
            self.address_line1 = self.address_line1[:i-1].strip()
        except ValueError:
            pass

        self.state_province = details['AdministrativeAreaName']
        self.postal_code = locality['PostalCode']['PostalCodeNumber']
        self.country = country_data['CountryName']
        self.formatted_address = placemark.get('AddressDetails', {}).get('address')
        if self.formatted_address is None:
            fields = [
                self.address_line1, self.address_line2, self.city,
                self.state_province, self.postal_code, self.country
            ]
            self.formatted_address = ' '.join([
                x for x in fields if (x is not None) and (x.strip() != '')
            ])

    def __str__(self):
        return self.formatted_address

    def __repr__(self):
        return str(self.__dict__)

def normalize_street_address(raw_address_string):

    gm = GoogleMaps() # Add API key for premium Google Maps service
    result = None

    try:
        data = gm.geocode(raw_address_string)
        if data is not None:
            result = NormalizedAddress(data)

    except KeyError:
        result = None

    except GoogleMapsError as ex:
        if ex.message != GoogleMapsError.G_GEO_UNKNOWN_ADDRESS:
            raise ex

    return result

Here’s a sample console run, with the same addresses as above:

Test Run
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
In [1]: from normalize_google import *

In [2]: a = normalize_street_address('1600 Amphitheatre Parkway, Mountain View, CA')

In [3]: a
Out[3]: {'address_line2': None, 'city': u'Mountain View', 'address_line1': u'1600 Amphitheatre Pkwy', 'state_province': u'CA', 'longitude': -122.0853032, 'postal_code': u'94043', 'country': u'USA', 'latitude': 37.4211444, 'formatted_address': u'1600 Amphitheatre Pkwy Mountain View CA 94043 USA'}

In [4]: str(a)
Out[4]: '1600 Amphitheatre Pkwy Mountain View CA 94043 USA'

In [5]: a = normalize_street_address('100 My Place, Foobar, PA')

In [6]: print(a)
None

In [7]: a = normalize_street_address('100 My Place, Foobar, ZZ')

In [8]: print(a)
None

Using a Commercial Service

You’ve completed your initial development, and you’re ready to launch. But, your slick new service will not be displaying maps, and your lawyer has told you, in no uncertain terms, that you cannot risk using Google Maps (or Yahoo! Maps) in violation of the terms of service. So, it’s time to switch to a commercial service.

Fortunately, the API is designed to hide its reliance on Google Maps from the outside world, so reimplementing it should be easy. There are a number of commercial solutions that provide REST APIs, and many are reasonably priced. Here are three examples:

  • Melissa Data provides both address verification and geocoding solutions. Its Address Verifier service provides a REST API and supports U.S, Canadian, and international addresses. Their subscription plans are based on yearly volume.
  • AddressDoctor provides a SOAP API for scrubbing addresses. Their plans are also based on yearly volume.
  • SmartyStreets provides a REST API, and with monthly or yearly volume-based subscription plans. In addition, they have a free plan that allows 250 addresses to be scrubbed a month, and they have a GitHub repository that contains sample code for Java, Python, C#, Ruby and Javascript.

Another option is Gisgraphy. Gisgraphy is an open source geocoding product, which you can download and run locally. If you want to use the REST API on their server, they have a premium service; however, hosting the data locally, providing your own REST service, appears to be free. Of course, if you use this solution, you’re responsible for keeping the software and the data up-to-date, as well as providing a local server to run the software.

Let’s reimplement our service in terms of SmartyStreets. There are a few caveats to note:

  1. The SmartyStreets LiveAddress API does not do geocoding, so we won’t get latitude and longitude out of our queries. We’ll leave those fields empty in our API.
  2. The API requires pre-parsed addresses; it does not handle a single address string. That means we’ll have to figure out how to parse the address string ourselves.
  3. The resulting changes will be US-specific.

Update (4 August, 2012) Per a comment (see below), SmartyStreets now provides both geocoding and single-string address parsing, so caveats #1 and #2, above, no longer apply.

Ruby SmartyStreets Implementation

Let’s start with Ruby. Parsing the address is not a big deal in Ruby, because we can use the StreetAddress gem. The gem is fully documented at http://streetaddress.rubyforge.com.

1
$ gem install StreetAddress

The SmartyStreets JSON REST API is documented at http://wiki.smartystreets.com/liveaddress_api_users_guide#section-4rest-json-endpoint, though several fields are missing from the documentation. Like many such APIs, it’ll return multiple possible choices for an address. In this implementation, we’re just going to take the first one. However, for interactive services, where you’re scrubbing a user-supplied address, if there are several possible hits, you may want to display them and allow the user to choose which one seems like the right address.

The Ruby reimplementation of our address normalizer API is fairly straightforward:

Normalize Address via SmartyStreets, in Ruby (normalize_smartystreets.rb) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# Address Normalizer API using the SmartyStreets LiveAddress REST API.

require 'cgi'
require 'json'
require 'net/http'
require 'street_address'

module AddressNormalizer

  API_KEY    = 'your_api_key_goes_here'
  URL        = 'api.qualifiedaddress.com'
  AUTH_TOKEN = CGI::escape(API_KEY)

  class NormalizedAddress

    Attrs = [:address_line1, :address_line2, :city, :state_province,
             :postal_code, :country, :longitude, :formatted_address]

    attr_reader *Attrs

    LINE1_KEYS = ['primary_number', 'street_name', 'street_suffix',
                  'street_postdirection']
    LINE2_KEYS = ['secondary_designator','secondary_number']
    ZIP_KEYS   = ['zipcode', 'plus4_code']

    def initialize(data)

      @address_line1  = assemble(LINE1_KEYS, data)
      @address_line2  = assemble(LINE2_KEYS, data)
      @city           = data['city_name']
      @postal_code    = assemble(ZIP_KEYS, data, '-')
      @state_province = data['state_abbreviation']
      @country        = 'US'
      @latitude       = nil
      @longitude      = nil

      formatted_fields = [@address_line1, @address_line2, @city, @state_province]
      @formatted_address = cond_join(formatted_fields, ', ') + " " + @postal_code
    end

    def to_s
      @formatted_address
    end

    def inspect
      Hash[ Attrs.map {|field| [field, self.send(field)]} ]
    end

    private

    def assemble(keys, data, sep=' ')
      s = cond_join(keys.map {|k| data[k]}, sep)
      s.strip.length == 0 ? nil : s
    end

    def cond_join(items, sep=' ')
      items.select {|i| (! i.nil?) && i.strip.length > 0}.join(sep).strip
    end
  end

  def normalize_street_address(raw_address_string)

    parsed = StreetAddress::US.parse_address(raw_address_string)
    raise "Cannot parse address #{raw_address_string}" if parsed.nil?

    address1 = "#{parsed.number} #{parsed.street} #{parsed.street_type}"
    params = {
      "candidates" => "1",
      "auth-token" => AUTH_TOKEN,
      "street"     => CGI::escape(address1),
      "city"       => CGI::escape(parsed.city),
      "state"      => CGI::escape(parsed.state)
    }

    params["secondary"] = parsed.unit if parsed.unit
    params["zipcode"] = parsed.postal_code if parsed.postal_code

    query = "/street-address/?" + params.map {|k, v| "#{k}=#{v}"}.join('&')

    http = Net::HTTP.new(URL)
    request = Net::HTTP::Get.new(query)
    response = JSON.parse(http.request(request).body)
    if response and (response.length > 0) and response[0]['components']
      normalized_address = NormalizedAddress.new(response[0]['components'])
    else
      normalized_address = nil
    end

    normalized_address
  end
end

Before you can actually use that software, you have to register with SmartyStreets, get an API key, and plug it into the code by setting the API_KEY constant.

Here’s a sample console run.

Test Run
1
2
3
4
5
6
7
8
9
10
[1] pry(main)> require 'normalize_smartypants'
=> true
[2] pry(main)> include AddressNormalizer
=> Object
[3] pry(main)> a = normalize_street_address '1600 Amphitheatre Parkway, Mountain View, CA'
=> #<AddressNormalizer::NormalizedAddress:0xd8e320>
[4] pry(main)> a.to_s
=> "1600 Amphitheatre Pkwy, Mountain View, CA 94043-1351"
[5] pry(main)> a = normalize_street_address '100 My Place, Foobar, PA'
=> nil

Python SmartyStreets Implementation

The reimplementation in Python is similar. However, there isn’t a handy Python library for parsing a single-string address. I’ve put together a quick hack, for the purposes of this blog post.

WARNING: This hack is not suitable for production use! It is far from robust and will fail on some addresses. It exists here solely for demonstration purposes. A better solution would be to port the Ruby StreetAddress gem to Python.

With that warning out of the way, here’s the address parsing code:

Quick and Dirty Address Parser, in Python (parse_addr.py) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# Quick and dirty street address parser, in Python.
#
# WARNING: THIS IS NOT ROBUST! A Python port of the Ruby StreetAddress gem
# would be a better solution.

import re

__all__ = ['parse_address']

STATE_NAMES = [
    'AL', 'Alabama',
    'AK', 'Alaska',
    'AS', 'America Samoa',
    'AZ', 'Arizona',
    'AR', 'Arkansas',
    'CA', 'California',
    'CO', 'Colorado',
    'CT', 'Connecticut',
    'DE', 'Delaware',
    'DC', 'District of Columbia',
    'FM', 'Micronesia',
    'FL', 'Florida',
    'GA', 'Georgia',
    'GU', 'Guam',
    'HI', 'Hawaii',
    'ID', 'Idaho',
    'IL', 'Illinois',
    'IN', 'Indiana',
    'IA', 'Iowa',
    'KS', 'Kansas',
    'KY', 'Kentucky',
    'LA', 'Louisiana',
    'ME', 'Maine',
    'MH', 'Islands1',
    'MD', 'Maryland',
    'MA', 'Massachusetts',
    'MI', 'Michigan',
    'MN', 'Minnesota',
    'MS', 'Mississippi',
    'MO', 'Missouri',
    'MT', 'Montana',
    'NE', 'Nebraska',
    'NV', 'Nevada',
    'NH', 'New Hampshire',
    'NJ', 'New Jersey',
    'NM', 'New Mexico',
    'NY', 'New York',
    'NC', 'North Carolina',
    'ND', 'North Dakota',
    'OH', 'Ohio',
    'OK', 'Oklahoma',
    'OR', 'Oregon',
    'PW', 'Palau',
    'PA', 'Pennsylvania',
    'PR', 'Puerto Rico',
    'RI', 'Rhode Island',
    'SC', 'South Carolina',
    'SD', 'South Dakota',
    'TN', 'Tennessee',
    'TX', 'Texas',
    'UT', 'Utah',
    'VT', 'Vermont',
    'VI', 'Virgin Island',
    'VA', 'Virginia',
    'WA', 'Washington',
    'WV', 'West Virginia',
    'WI', 'Wisconsin',
    'WY', 'Wyoming'
]

ADDR_PATTERN = \
r'^(?P<address1>(\d{1,5}\s+(\w+\s*)+\s+(road|rd|parkway|pky|dr|drive|st|street|ln|la|lane|place|pl))|(P\.O\.\s+Box\s+\d{1,5}))\s*' +\
r'\s*,?\s*(?P<address2>(apt|bldg|dept|fl|floor|lot|pier|rm|room|slip|ste|suite|trlr|unit)\s+\w{1,5}\s*)?' +\
r',\s*(?P<city>([A-Z][a-z]+\s*){1,3}),\s*' +\
r'(?P<state>(' + '|'.join(STATE_NAMES) + ')\s*)' +\
r'(?P<zipcode>\d{5}(-\d{4})?)?$'

ADDRESS_REGEX = re.compile(ADDR_PATTERN, re.IGNORECASE)

MIN_KEYS = frozenset(['address1', 'city', 'state'])

def parse_address(address_string):
    m = ADDRESS_REGEX.match(address_string)
    result = {}
    if m is not None:
        for i in ('address1', 'address2', 'city', 'state', 'zipcode'):
            val = m.group(i)
            if val is not None:
                result[i] = val

    keys = frozenset([k for k in result.keys()])
    if (keys & MIN_KEYS) != MIN_KEYS:
        result = None

    return result

Now that we have an address parser, however brittle, we can write the Python version of the SmartyStreets implementation.

Normalize Address via SmartyStreets, in Ruby (normalize_smartystreets.py) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import json
import urllib
from parse_addr import parse_address

__all__ = ['NormalizedAddress', 'normalize_street_address']

API_KEY  = r'your_api_key_goes_here'
LOCATION = 'https://api.qualifiedaddress.com/street-address/'

LINE1_KEYS = ['primary_number', 'street_name', 'street_suffix',
              'street_postdirection']
LINE2_KEYS = ['secondary_designator','secondary_number']
ZIP_KEYS   = ['zipcode', 'plus4_code']

class NormalizedAddress(object):


    def __init__(self, data):
        self.address_line1  = self._assemble(LINE1_KEYS, data)
        self.address_line2  = self._assemble(LINE2_KEYS, data)
        self.city           = data['city_name']
        self.postal_code    = self._assemble(ZIP_KEYS, data)
        self.state_province = data['state_abbreviation']
        self.country        = 'US'
        self.latitude       = None
        self.longitude      = None

        formatted_fields = [
            self.address_line1, self.address_line2, self.city,
            self.state_province
        ]
        self.formatted_address = (
            self._cond_join(formatted_fields, ', ') + ' ' + self.postal_code
        )

    def __str__(self):
        return self.formatted_address

    def __repr__(self):
        return str(self.__dict__)

    def _assemble(self, keys, data, sep = ' '):
        vals = [data[k] for k in keys if data.get(k) is not None]
        return self._cond_join(vals, sep)

    def _cond_join(self, items, sep = ' '):
        trimmed = [i for i in items if (i is not None) and (len(i.strip()) > 0)]
        return sep.join(trimmed)

def normalize_street_address(raw_address_string):

    result = None
    parsed = parse_address(raw_address_string)
    if parsed is not None:
        query_data = {
            'auth-token': API_KEY,
            'street':     parsed['address1'],
            'city':       parsed['city'],
            'state':      parsed['state']
        }

        if parsed.get('zipcode') is not None:
            query_data['zipcode'] = parsed['zipcode']

        if parsed.get('address2') is not None:
            query_data['secondary'] = parsed['address2']

        try:
            url = '%s?%s' % (LOCATION, urllib.urlencode(query_data))
            response = urllib.urlopen(url).read()
            data = json.loads(response)
            if (data is not None) and (len(data) > 0):
                result = NormalizedAddress(data[0]['components'])

        except KeyError:
            result = None

    return result

And here’s the obligatory console test run:

Test Run
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
In [1]: from normalize_google import *

In [2]: a = normalize_street_address('1600 Amphitheatre Parkway, Mountain View, CA')

In [3]: a
Out[3]: {'address_line2': '', 'city': u'Mountain View', 'address_line1': u'1600 Amphitheatre Pkwy', 'state_province': u'CA', 'longitude': None, 'postal_code': u'94043 1351', 'country': 'US', 'latitude': None, 'formatted_address': u'1600 Amphitheatre Pkwy, Mountain View, CA 94043 1351'}

In [4]: str(a)
Out[4]: '1600 Amphitheatre Pkwy, Mountain View, CA 94043 1351'

In [5]: a = normalize_street_address('100 My Place, Foobar, PA')

In [6]: print(a)
None

In [7]: a = normalize_street_address('100 My Place, Foobar, ZZ')

In [8]: print(a)
None

Further Enhancements

An even better implementation would be to keep both the SmartyStreets and Google implementations, allowing either one to be selected by a constant, a configuration parameter, or an environment variable. Since this task isn’t a difficult enhancement, it’s left as an exercise for the reader.

References

Comments