GeoNames Home | Postal Codes | Download / Webservice | About 

GeoNames Forum
  [Search] Search   [Recent Topics] Recent Topics   [Groups] Back to home page 
[Register] Register / 
[Login] Login 
mssql spatial import... problem solved  XML
Forum Index -> General
Author Message
JasonCoder



Joined: 11/03/2008 23:15:30
Messages: 2
Offline

During the process of fighting with the import of GeoNames into mssql 2008, along with several offerings to the gods, I swore if it worked to write up what I did to be of assistance to others.

First, the UTF-8 was a real pain. Neither BULK LOAD nor SSIS (read: DTS) liked that at all. So first order of business was to convert this to UTF-16, or what Windows/.NET just calls "Unicode".

There were a couple of online services which did a nice job on the smaller files but allCountries.txt was a different story. Obviously something local was needed. I found a win32 port of iconv but this gave me odd results. While it was quick, I had to specify little endian utf16 and even then it gave me garbage on every row.

I decided to write my own program to do this leveraging the conversion classes in the .NET framework. My code only ran about 20 lines, the System.IO authors really delivered. It's a small little program (attached). You will need the .NET framework installed.

After I had the file in UTF-16 SSIS worked just fine. If you created your table with the scema in the readme on this site and then make sure under advanced you specify the same datatypes, lengths and precision you'll be fine.

Edit: How to use the program would probably help, yeah? Utf8ToUtf16.exe inputFile.txt outputFile.txt


Regards,

JasonCoder
 Description converts utf8 to utf16 [Disk] Download
 Filesize 5 Kbytes
 Downloaded:  5008 time(s)


This geodetic stuff is pretty cool
PJonDevelopment



Joined: 06/03/2008 01:24:26
Messages: 14
Location: Rio de Janeiro, Brazil
Offline

Hi, Jason

I had no problem importing the tables to MSSQL 2008.

I just right clicked the database on the Enterprise Manager, selected Tasks, Import Data.

At the window that opened I selected Flat File, browse to the file I want to import.

And the catch (which I discovered the second time around) was to select UTF-8 at the file encoding drop down list.

After that, I clicked on the Columns to specify, size, and datatypes for each column of the file.

Then, Next, Next, Next... and fix some mappings to datatypes that were not found because of my poor selection,.

And MSSQL 2008 imported every single line with no problems.

On a technical note, UTF-8 and UTF-16 are just encoding formats for an Unicode text.

Regards,

Paulo Santos
http://pjondevelopment.50webs.com
[WWW]
JasonCoder



Joined: 11/03/2008 23:15:30
Messages: 2
Offline

Hi Paulo,

I went back and looked and on CTP6 there is no "file encoding" drop down list. There is indeed a "code page" drop down with a UTF-8 option buried at the end. Trouble is, this list gets grayed out if you check the "unicode" checkbox. Very confusing.

Maybe I'm just a noob that doesn't know that code page == file encoding. And why have a unicode checkbox that greys that list out? I guess because unicode to msft means utf-16? Sigh, whatever.

Anyways, glad that worked for you, for me it was less than intuitive. Hopefully that's something msft fixes by RTM.

My program might still come in handy for those not on 2008 or who find the import utility as unfriendly as I did.

This geodetic stuff is pretty cool
oyen



Joined: 27/09/2008 10:15:28
Messages: 5
Offline

THANK YOU! Your program was a great help. I was using the NAnt thing before but none of the columns registered, so I thought it wasn't parsing it in the proper manner. Your program saved me. Thank you!
AFKYBD



Joined: 19/02/2009 02:10:03
Messages: 6
Offline

Jason,

Would you mind posting the code you wrote to convert?
dovholuk



Joined: 01/06/2009 19:19:11
Messages: 1
Offline

Here is the code i used.. it's a tad 'brittle' (no checking that the file exists etc) but it'll do the job if you supply valid parameters...

usage: inputFile outputFile rowsBeforeStatus

Utf8ToUtf16.exe "c:\allCountries.txt" allAsUtf16.txt 100000

which produces:
100000 lines processed
200000 lines processed
300000 lines processed
400000 lines processed
500000 lines processed
600000 lines processed
700000 lines processed
800000 lines processed
...
... SNIP ...
...
6800000 lines processed
6900000 lines processed
6900956 lines processed

Code:
 using System;
 using System.Collections.Generic;
 using System.Linq;
 using System.Text;
 using System.IO;
 
 namespace Utf8ToUtf16
 {
   class Program
   {
     static void Main(string[] args)
     {
       Int64 rowsBeforeStatus = long.MaxValue;
       if (args.Length > 2)
       {
         rowsBeforeStatus = Convert.ToInt64(args[2]);
       }
 
       using (FileStream inputFile = File.Open(args[0], FileMode.Open, FileAccess.Read, FileShare.Write))
       {
         using (StreamReader inputStream = new StreamReader(inputFile, Encoding.UTF8))
         {
           using (FileStream outputFile = File.Open(args[1], FileMode.Create, FileAccess.Write, FileShare.Write))
           {
             using (StreamWriter outputStream = new StreamWriter(outputFile, Encoding.Unicode))
             {
               if (rowsBeforeStatus == long.MaxValue)
               {
                 while (!inputStream.EndOfStream)
                 {
                   outputStream.WriteLine(inputStream.ReadLine());
                 }
               }
               else
               {
                 Int64 totalLinesProcessed = 0;
                 while (!inputStream.EndOfStream)
                 {
                   Int64 i = 0;
                   while (i < rowsBeforeStatus && !inputStream.EndOfStream)
                   {
                     outputStream.WriteLine(inputStream.ReadLine());
                     i++;
                   }
                   totalLinesProcessed += i;
                   Console.WriteLine(totalLinesProcessed + " lines processed");
                 }
               }
             }
           }
         }
       }
     }
   }
 }
 


Then as per above:

BULK
INSERT GeoNames
FROM 'c:\allAsUtf16.txt'
WITH(
DATAFILETYPE = 'widechar',
FIELDTERMINATOR = '\t',
ROWTERMINATOR = '\n'
)
GO


The entire process (on a Core 2 duo machine with decent disks/ram) was something like 10-20 minutes total. I spent a TON more time just trying to figure out that sql server doesn't support UTF8 (cause I just couldn't imagine that was true - but it was).

Time to write code = 10 minutes
File process time = 10-30 seconds
Bulk Insert time = just over 5 1/2 minutes for me

Hope that helps everyone out there! :)
 
Forum Index -> General
Go to:   
Powered by JForum 2.1.5 © JForum Team