<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
	<channel>
		<title><![CDATA[Latest posts for the topic "Duplicate postal codes?"]]></title>
		<link>http://forum.geonames.org/gforum/posts/list/7.page</link>
		<description><![CDATA[Latest messages posted in the topic "Duplicate postal codes?"]]></description>
		<generator>JForum - http://www.jforum.net</generator>
			<item>
				<title>Duplicate postal codes?</title>
				<description><![CDATA[ Should there be some duplicate postal codes in the U.S. postal codes dump?  I'm used to entering a postal code on Google and having it take me to one location, never listing two.

My reason for asking is that I was planning on using the postal code as my primary key for my postal codes table.  This is fine for 43,666 of the postal codes; 44 others are duplicated, however, ruining my hopes.

Are these 44 duplicates intended to be there?  Is there anything I can do to work around this?

<span class="genmed"><b>Code:</b></span><br>
		<div style="overflow: auto; width: 100%;">
		<pre>SELECT t.postal_code, t.cnt AS "count" FROM &#40;SELECT p.postal_code, count&#40;p.postal_code&#41; AS cnt FROM postal_codes p GROUP BY p.postal_code&#41; t WHERE t.cnt &gt; 1;

 postal_code | count 
-------------+-----
 10804       |   2
 15129       |   2
 15714       |   2
 19018       |   2
 28470       |   2
 28740       |   2
 30598       |   2
 32305       |   2
 32309       |   2
 32352       |   2
 32461       |   2
 32563       |   2
 32566       |   2
 33028       |   3
 33326       |   2
 33327       |   2
 33777       |   2
 33927       |   2
 34609       |   2
 34610       |   2
 36203       |   2
 39208       |   2
 39218       |   2
 39232       |   2
 39272       |   2
 39288       |   2
 44334       |   2
 46971       |   2
 53235       |   2
 54082       |   2
 60527       |   2
 66741       |   2
 67221       |   2
 75048       |   2
 81066       |   2
 82936       |   2
 83250       |   2
 83638       |   2
 83714       |   2
 85086       |   2
 91701       |   2
 91737       |   2
 95304       |   2
 95652       |   2
&#40;44 rows&#41;</pre>
		</div>]]></description>
				<guid isPermaLink="true">http://forum.geonames.org/gforum/posts/list/1347.page#5630</guid>
				<link>http://forum.geonames.org/gforum/posts/list/1347.page#5630</link>
				<pubDate><![CDATA[Fri, 27 Mar 2009 02:33:10]]> GMT</pubDate>
				<author><![CDATA[ mhuggins]]></author>
			</item>
			<item>
				<title>Re:Duplicate postal codes?</title>
				<description><![CDATA[ I guess I should elaborate on why I'm seeing this as an issue, and I'm trying to wrap my head around a solution.

When I insert this data into my database, I need to assign it a primary key.  Using a serial/auto-incrementing value is simple enough.  However, if I later want to do a refresh of this data with an updated CSV file here, then there's a chance that the postal codes will become associated with a different primary key value.  This is not good for other tables containing foreign keys to my postal codes table.

Any suggestions?]]></description>
				<guid isPermaLink="true">http://forum.geonames.org/gforum/posts/list/1347.page#5631</guid>
				<link>http://forum.geonames.org/gforum/posts/list/1347.page#5631</link>
				<pubDate><![CDATA[Fri, 27 Mar 2009 04:43:06]]> GMT</pubDate>
				<author><![CDATA[ mhuggins]]></author>
			</item>
			<item>
				<title>Re:Duplicate postal codes?</title>
				<description><![CDATA[ The United States Postal Service lists duplicates as well:
http://zip4.usps.com/zip4/citytown_zip.jsp

Example:
Actual City name in 10804
NEW ROCHELLE, NY
Acceptable City names in 10804
WYKAGYL, NY


Marc]]></description>
				<guid isPermaLink="true">http://forum.geonames.org/gforum/posts/list/1347.page#5632</guid>
				<link>http://forum.geonames.org/gforum/posts/list/1347.page#5632</link>
				<pubDate><![CDATA[Fri, 27 Mar 2009 06:15:31]]> GMT</pubDate>
				<author><![CDATA[ marc]]></author>
			</item>
			<item>
				<title>Re:Duplicate postal codes?</title>
				<description><![CDATA[ Damn, that sucks.  Thanks for the reply, marc.

Well on that note, does anyone have any suggestions on how to import the data into a database and have it maintain the same primary key no matter how often I refresh the data from a new CSV?]]></description>
				<guid isPermaLink="true">http://forum.geonames.org/gforum/posts/list/1347.page#5633</guid>
				<link>http://forum.geonames.org/gforum/posts/list/1347.page#5633</link>
				<pubDate><![CDATA[Fri, 27 Mar 2009 06:51:41]]> GMT</pubDate>
				<author><![CDATA[ mhuggins]]></author>
			</item>
			<item>
				<title>Re:Duplicate postal codes?</title>
				<description><![CDATA[ I use the following MS SQL 2008 code to get rid of what I consider duplicates.
<span class="genmed"><b>Code:</b></span><br>
		<div style="overflow: auto; width: 100%;">
		<pre>WITH Dups AS
&#40;
	SELECT
		Admin_Code1
		,Admin_Code2
		,Country_Code
		,&#91;Name&#93;
		,Postal_Code
		,ROW_NUMBER&#40;&#41; OVER&#40;
			PARTITION BY Admin_Code1,Admin_Code2, Country_Code, &#91;Name&#93;, Postal_Code
			ORDER BY Admin_Code1, Admin_Code2, Country_Code, &#91;Name&#93;, Postal_Code&#41; AS RN
	FROM dbo.Tmp_PostalCodes
&#41;
DELETE
FROM	Dups
WHERE	RN &gt; 1;</pre>
		</div>
As you can see it removes what I think are true duplicates but not duplicate postal codes.

What I have done to import the data and maintain a key will take a few moments to explain.  First I have a staging table where I can do things like run the remove duplicates code I posted above.  I have a table in the end database that makes use of the idea of super types and sub types.  I have a super type table called geo_feature that contains information generic to any geo feature I load.  It contains two columns used to track where the source of the data and the key used in that source data if it has one.

For data such as the postal code dump that has no geo_id I create one by concatinating the values of the natural key (the colums I use to determine duplicates above) and then hash that value.  At least for this postal code data it results in a unique hash for every value once I've cleaned out the true duplicates.

Here is a portion of the code I use to pull from the staging table where you can see the creation of the hash I use for the key.

<span class="genmed"><b>Code:</b></span><br>
		<div style="overflow: auto; width: 100%;">
		<pre>	SELECT	DISTINCT
		pc.Admin_Code1		AS Admin_Code_1
		,pc.Admin_Code2		AS Admin_Code_2
		,pc.Country_Code	AS Country_Code
		,gf.Geo_Id			AS Country_Geo_Id
		,pc.&#91;Name&#93;			AS &#91;Name&#93;
		,pc.Postal_Code		AS Postal_Code
		,HASHBYTES&#40;'SHA1'
				,pc.Admin_Code1
				+ pc.Admin_Code2
				+ pc.Country_Code
				+ pc.&#91;Name&#93;
				+ pc.Postal_Code
				&#41;			AS &#91;Hash_Bytes&#93;
	FROM	Tmp_PostalCodes					AS pc
	JOIN	Tmp_AllCountries				AS a
	ON	pc.Country_Code = a.Country_Code
	JOIN	&#91;Geo&#93;.&#91;Geo_Feature&#93;				AS gf
	ON	CAST&#40;a.GeoName_Code AS NVARCHAR&#40;32&#41;&#41; = gf.Data_Source_Id
	WHERE	1 = 1
	AND	a.Feature_Class = 'A'
	AND	a.Feature_Code = 'PCLI'
	AND	gf.Data_Source = @Data_Source</pre>
		</div>

Hope this helps.]]></description>
				<guid isPermaLink="true">http://forum.geonames.org/gforum/posts/list/1347.page#5938</guid>
				<link>http://forum.geonames.org/gforum/posts/list/1347.page#5938</link>
				<pubDate><![CDATA[Fri, 5 Jun 2009 03:39:07]]> GMT</pubDate>
				<author><![CDATA[ JoeMoyle]]></author>
			</item>
	</channel>
</rss>