| Author |
Message |
![[Post New]](/gforum/templates/default/images/icon_minipost_new.gif) 27/03/2009 02:33:10
|
mhuggins
Joined: 20/06/2008 00:12:34
Messages: 18
Location: Downingtown, PA
Offline
|
Should there be some duplicate postal codes in the U.S. postal codes dump? I'm used to entering a postal code on Google and having it take me to one location, never listing two.
My reason for asking is that I was planning on using the postal code as my primary key for my postal codes table. This is fine for 43,666 of the postal codes; 44 others are duplicated, however, ruining my hopes.
Are these 44 duplicates intended to be there? Is there anything I can do to work around this?
Code:
SELECT t.postal_code, t.cnt AS "count" FROM (SELECT p.postal_code, count(p.postal_code) AS cnt FROM postal_codes p GROUP BY p.postal_code) t WHERE t.cnt > 1;
postal_code | count
-------------+-----
10804 | 2
15129 | 2
15714 | 2
19018 | 2
28470 | 2
28740 | 2
30598 | 2
32305 | 2
32309 | 2
32352 | 2
32461 | 2
32563 | 2
32566 | 2
33028 | 3
33326 | 2
33327 | 2
33777 | 2
33927 | 2
34609 | 2
34610 | 2
36203 | 2
39208 | 2
39218 | 2
39232 | 2
39272 | 2
39288 | 2
44334 | 2
46971 | 2
53235 | 2
54082 | 2
60527 | 2
66741 | 2
67221 | 2
75048 | 2
81066 | 2
82936 | 2
83250 | 2
83638 | 2
83714 | 2
85086 | 2
91701 | 2
91737 | 2
95304 | 2
95652 | 2
(44 rows)
|
|
|
 |
![[Post New]](/gforum/templates/default/images/icon_minipost_new.gif) 27/03/2009 04:43:06
|
mhuggins
Joined: 20/06/2008 00:12:34
Messages: 18
Location: Downingtown, PA
Offline
|
I guess I should elaborate on why I'm seeing this as an issue, and I'm trying to wrap my head around a solution.
When I insert this data into my database, I need to assign it a primary key. Using a serial/auto-incrementing value is simple enough. However, if I later want to do a refresh of this data with an updated CSV file here, then there's a chance that the postal codes will become associated with a different primary key value. This is not good for other tables containing foreign keys to my postal codes table.
Any suggestions?
|
|
|
 |
![[Post New]](/gforum/templates/default/images/icon_minipost_new.gif) 27/03/2009 06:15:31
|
marc
Joined: 08/12/2005 07:39:47
Messages: 4501
Offline
|
The United States Postal Service lists duplicates as well:
http://zip4.usps.com/zip4/citytown_zip.jsp
Example:
Actual City name in 10804
NEW ROCHELLE, NY
Acceptable City names in 10804
WYKAGYL, NY
Marc
|
 |
|
|
 |
![[Post New]](/gforum/templates/default/images/icon_minipost_new.gif) 27/03/2009 06:51:41
|
mhuggins
Joined: 20/06/2008 00:12:34
Messages: 18
Location: Downingtown, PA
Offline
|
Damn, that sucks. Thanks for the reply, marc.
Well on that note, does anyone have any suggestions on how to import the data into a database and have it maintain the same primary key no matter how often I refresh the data from a new CSV?
|
|
|
 |
![[Post New]](/gforum/templates/default/images/icon_minipost_new.gif) 05/06/2009 03:39:07
|
JoeMoyle
![[Avatar]](/gforum/images/avatar/5129a5ddcd0dcd755232baa04c231698.jpg)
Joined: 01/06/2009 23:34:08
Messages: 2
Offline
|
I use the following MS SQL 2008 code to get rid of what I consider duplicates.
Code:
WITH Dups AS
(
SELECT
Admin_Code1
,Admin_Code2
,Country_Code
,[Name]
,Postal_Code
,ROW_NUMBER() OVER(
PARTITION BY Admin_Code1,Admin_Code2, Country_Code, [Name], Postal_Code
ORDER BY Admin_Code1, Admin_Code2, Country_Code, [Name], Postal_Code) AS RN
FROM dbo.Tmp_PostalCodes
)
DELETE
FROM Dups
WHERE RN > 1;
As you can see it removes what I think are true duplicates but not duplicate postal codes.
What I have done to import the data and maintain a key will take a few moments to explain. First I have a staging table where I can do things like run the remove duplicates code I posted above. I have a table in the end database that makes use of the idea of super types and sub types. I have a super type table called geo_feature that contains information generic to any geo feature I load. It contains two columns used to track where the source of the data and the key used in that source data if it has one.
For data such as the postal code dump that has no geo_id I create one by concatinating the values of the natural key (the colums I use to determine duplicates above) and then hash that value. At least for this postal code data it results in a unique hash for every value once I've cleaned out the true duplicates.
Here is a portion of the code I use to pull from the staging table where you can see the creation of the hash I use for the key.
Code:
SELECT DISTINCT
pc.Admin_Code1 AS Admin_Code_1
,pc.Admin_Code2 AS Admin_Code_2
,pc.Country_Code AS Country_Code
,gf.Geo_Id AS Country_Geo_Id
,pc.[Name] AS [Name]
,pc.Postal_Code AS Postal_Code
,HASHBYTES('SHA1'
,pc.Admin_Code1
+ pc.Admin_Code2
+ pc.Country_Code
+ pc.[Name]
+ pc.Postal_Code
) AS [Hash_Bytes]
FROM Tmp_PostalCodes AS pc
JOIN Tmp_AllCountries AS a
ON pc.Country_Code = a.Country_Code
JOIN [Geo].[Geo_Feature] AS gf
ON CAST(a.GeoName_Code AS NVARCHAR(32)) = gf.Data_Source_Id
WHERE 1 = 1
AND a.Feature_Class = 'A'
AND a.Feature_Code = 'PCLI'
AND gf.Data_Source = @Data_Source
Hope this helps.
|
Joe Moyle
www.JoeMoyleEnterprises.com |
|
|
 |
|
|