GeoNames Forum

Should there be some duplicate postal codes in the U.S. postal codes dump? I'm used to entering a postal code on Google and having it take me to one location, never listing two.

My reason for asking is that I was planning on using the postal code as my primary key for my postal codes table. This is fine for 43,666 of the postal codes; 44 others are duplicated, however, ruining my hopes.

Are these 44 duplicates intended to be there? Is there anything I can do to work around this?

Code:

SELECT t.postal_code, t.cnt AS "count" FROM (SELECT p.postal_code, count(p.postal_code) AS cnt FROM postal_codes p GROUP BY p.postal_code) t WHERE t.cnt > 1;
 
  postal_code | count 
 -------------+-----
  10804       |   2
  15129       |   2
  15714       |   2
  19018       |   2
  28470       |   2
  28740       |   2
  30598       |   2
  32305       |   2
  32309       |   2
  32352       |   2
  32461       |   2
  32563       |   2
  32566       |   2
  33028       |   3
  33326       |   2
  33327       |   2
  33777       |   2
  33927       |   2
  34609       |   2
  34610       |   2
  36203       |   2
  39208       |   2
  39218       |   2
  39232       |   2
  39272       |   2
  39288       |   2
  44334       |   2
  46971       |   2
  53235       |   2
  54082       |   2
  60527       |   2
  66741       |   2
  67221       |   2
  75048       |   2
  81066       |   2
  82936       |   2
  83250       |   2
  83638       |   2
  83714       |   2
  85086       |   2
  91701       |   2
  91737       |   2
  95304       |   2
  95652       |   2
 (44 rows)

I guess I should elaborate on why I'm seeing this as an issue, and I'm trying to wrap my head around a solution.

When I insert this data into my database, I need to assign it a primary key. Using a serial/auto-incrementing value is simple enough. However, if I later want to do a refresh of this data with an updated CSV file here, then there's a chance that the postal codes will become associated with a different primary key value. This is not good for other tables containing foreign keys to my postal codes table.

Any suggestions?

The United States Postal Service lists duplicates as well:
http://zip4.usps.com/zip4/citytown_zip.jsp

Example:
Actual City name in 10804
NEW ROCHELLE, NY
Acceptable City names in 10804
WYKAGYL, NY

Marc

Damn, that sucks. Thanks for the reply, marc.

Well on that note, does anyone have any suggestions on how to import the data into a database and have it maintain the same primary key no matter how often I refresh the data from a new CSV?

I use the following MS SQL 2008 code to get rid of what I consider duplicates.
Code:

WITH Dups AS
 (
 	SELECT
 		Admin_Code1
 		,Admin_Code2
 		,Country_Code
 		,[Name]
 		,Postal_Code
 		,ROW_NUMBER() OVER(
 			PARTITION BY Admin_Code1,Admin_Code2, Country_Code, [Name], Postal_Code
 			ORDER BY Admin_Code1, Admin_Code2, Country_Code, [Name], Postal_Code) AS RN
 	FROM dbo.Tmp_PostalCodes
 )
 DELETE
 FROM	Dups
 WHERE	RN > 1;

As you can see it removes what I think are true duplicates but not duplicate postal codes.

What I have done to import the data and maintain a key will take a few moments to explain. First I have a staging table where I can do things like run the remove duplicates code I posted above. I have a table in the end database that makes use of the idea of super types and sub types. I have a super type table called geo_feature that contains information generic to any geo feature I load. It contains two columns used to track where the source of the data and the key used in that source data if it has one.

For data such as the postal code dump that has no geo_id I create one by concatinating the values of the natural key (the colums I use to determine duplicates above) and then hash that value. At least for this postal code data it results in a unique hash for every value once I've cleaned out the true duplicates.

Here is a portion of the code I use to pull from the staging table where you can see the creation of the hash I use for the key.

Code:

	SELECT	DISTINCT
 		pc.Admin_Code1		AS Admin_Code_1
 		,pc.Admin_Code2		AS Admin_Code_2
 		,pc.Country_Code	AS Country_Code
 		,gf.Geo_Id			AS Country_Geo_Id
 		,pc.[Name]			AS [Name]
 		,pc.Postal_Code		AS Postal_Code
 		,HASHBYTES('SHA1'
 				,pc.Admin_Code1
 				+ pc.Admin_Code2
 				+ pc.Country_Code
 				+ pc.[Name]
 				+ pc.Postal_Code
 				)			AS [Hash_Bytes]
 	FROM	Tmp_PostalCodes					AS pc
 	JOIN	Tmp_AllCountries				AS a
 	ON	pc.Country_Code = a.Country_Code
 	JOIN	[Geo].[Geo_Feature]				AS gf
 	ON	CAST(a.GeoName_Code AS NVARCHAR(32)) = gf.Data_Source_Id
 	WHERE	1 = 1
 	AND	a.Feature_Class = 'A'
 	AND	a.Feature_Code = 'PCLI'
 	AND	gf.Data_Source = @Data_Source

Hope this helps.