GeoNames Home | Postal Codes | Download / Webservice | About 

GeoNames Forum
  [Search] Search   [Recent Topics] Recent Topics   [Groups] Back to home page 
[Register] Register / 
[Login] Login 
creating additional hierarchy to places and linking postcodes with geoName  XML
Forum Index -> General
Author Message
jonny512379



Joined: 11/02/2017 20:41:32
Messages: 14
Offline

Hello All / Marc

I have used your data sets before and am using them again.
I am also using the postcode data + full GB postcodes

i have written some python script, in a flat file, which uses separate/standalone functions build your data into PostgreSQL
This uses mostly your data of
Code:
# URLs for GeoNames data
 FILES_TO_DOWNLOAD = {
     "allCountries.zip": "http://download.geonames.org/export/dump/allCountries.zip",
     "countryInfo.txt": "http://download.geonames.org/export/dump/countryInfo.txt",
     "admin1CodesASCII.txt": "http://download.geonames.org/export/dump/admin1CodesASCII.txt",
     "admin2Codes.txt": "http://download.geonames.org/export/dump/admin2Codes.txt",
     "postalCodes_allCountries.zip": "http://download.geonames.org/export/zip/allCountries.zip",
     "GB_full.csv.zip": "http://download.geonames.org/export/zip/GB_full.csv.zip",
     "hierarchy.zip": "http://download.geonames.org/export/dump/hierarchy.zip",
     "alternateNamesV2.zip": "https://download.geonames.org/export/dump/alternateNamesV2.zip",
     "adminCode5.zip": "http://download.geonames.org/export/dump/adminCode5.zip"  
 }


Part of this code is now attempting to link geoname_id with the postal code data set, using PostGIS + the lat/lon columns.

The main part of the code, basically gathers your data and recreates it in a database

I have contacted geoNames by email, about premium data, but as of yet, no reply. i am more than happy to pay for it.

What i am wondering, is "would you like me to publish the python script here, so others may use it".
The script is of course catered for my needs to create, recreate and update the overall results. However anyone OK with python (and/or ChatGPT) should be able to easily add in functions for additional data sets.

Best Regards
Jon
[Email]
jonny512379



Joined: 11/02/2017 20:41:32
Messages: 14
Offline

thank you for allowing this post.

I am of course not an expert in geoNames or areas/postcodes, they are dam complicated and do not always follow logic....

I believe my process/logic of linking the geoName and postlcode data table is sound (first attempt though!!!).

I am now seeing the problem... my script has been running on 24 xeon cores for over 6 hours.... and this is only for GB.... i will leave it running over night, but suspect i may need to relook at my indexing and my Postgre config to give it more recourses.

To attempt to do postalcodes worldwide, i may been to spin up an AWS server with 100s of cores....

Once done, i am happy to share the data and code that calculated this.

PS i am loving this challenge/puzzle! i will complete it!

Thanks Jon
[Email]
jonny512379



Joined: 11/02/2017 20:41:32
Messages: 14
Offline

If this is of help to anyone, it is a basic python code, intentionally written in modular style, so you can comment out calls to any functions you do not need.

It is written to import into a postgreSQL database, but should work for mySQL also (not tested). HOWEVER step 12 needs postgreSQL and PostGIS enabling, so this step will not work on mySQL. However this step is for my needs and not required to import the data.

Step 7 is also not required for UK townslist (i left it in though, in case it is of use, although i have not used it as of yet)

It should also create a full hierarchies structure for admin areas and places.
i *think* this gets it all correct, but do your own testing for your requirements.

once the main data is imported, it leaves this data intact and creates duplicate tables for editing, so you can re-run the script again for updates.

It only uses the main tables from geoNames, but you can add to the download list/functions and extract functions if you like. For the additional downloads, etc. table creation data you will need to refer to geoNames documentation.

in the __main__ uncomment or comment the functions you want to run or not


Code:
 import os
 import requests
 import zipfile
 import psycopg2
 import shutil
 from concurrent.futures import ThreadPoolExecutor
 
 # Database configuration
 DB_NAME = "your database name"
 DB_USER = "your user name"
 DB_PASS = "your password"
 DB_HOST = "your database location"
 DB_PORT = "your database port"
 
 # Local directory for data
 DATA_DIR = r"E:\work - website douments\dev\geoNames_data"
 
 # URLs for GeoNames data
 FILES_TO_DOWNLOAD = {
     "allCountries.zip": "http://download.geonames.org/export/dump/allCountries.zip",
     "countryInfo.txt": "http://download.geonames.org/export/dump/countryInfo.txt",
     "admin1CodesASCII.txt": "http://download.geonames.org/export/dump/admin1CodesASCII.txt",
     "admin2Codes.txt": "http://download.geonames.org/export/dump/admin2Codes.txt",
     "postalCodes_allCountries.zip": "http://download.geonames.org/export/zip/allCountries.zip",
     "GB_full.csv.zip": "http://download.geonames.org/export/zip/GB_full.csv.zip",
     "hierarchy.zip": "http://download.geonames.org/export/dump/hierarchy.zip",
     "alternateNamesV2.zip": "https://download.geonames.org/export/dump/alternateNamesV2.zip",
     "adminCode5.zip": "http://download.geonames.org/export/dump/adminCode5.zip"  
 }
 
 # ----------------------------------
 # Helper: Connect to database
 def connect_db():
     conn = psycopg2.connect(
         dbname=DB_NAME, user=DB_USER, password=DB_PASS, host=DB_HOST, port=DB_PORT
     )
     cur = conn.cursor()
     cur.execute("SELECT version();")
     print("Connected to DB:", cur.fetchone()[0])
     return conn, cur
 
 # Helper: ensure PostGIS is enabled in the current database
 def enable_postgis():
     """
     Enables the PostGIS extension in the current database.
     This ensures that the PostGIS types (e.g., geometry) and functions are available.
     """
     conn, cur = connect_db()
     try:
         cur.execute("CREATE EXTENSION IF NOT EXISTS postgis;")
         conn.commit()
         print("PostGIS extension enabled.")
     except Exception as e:
         conn.rollback()
         print("Error enabling PostGIS extension:", e)
     finally:
         cur.close()
         conn.close()
 
 
 
 # 1. Ensure Data Directory
 def ensure_data_directory():
     os.makedirs(DATA_DIR, exist_ok=True)
     os.chdir(DATA_DIR)
     print(f"Working directory: {DATA_DIR}")
 
 # 2. Download a file (independent function)
 def download_file(file_dict, overwrite=False):
     for fname, url in file_dict.items():
         if not overwrite and os.path.exists(fname):
             print(f"{fname} exists, skipping download.")
         else:
             print(f"Downloading {fname}...")
             r = requests.get(url)
             r.raise_for_status()
             with open(fname, "wb") as f:
                 f.write(r.content)
             print(f"Downloaded {fname}")
 
 # 3. Extract file from ZIP into a new file
 def extract_file(zip_mapping, final_filename):
     # zip_mapping: dict mapping zip file name -> expected internal file name
     for zip_file, internal_filename in zip_mapping.items():
         print(f"Extracting {zip_file}...")
         temp_folder = f"temp_{zip_file}"
         with zipfile.ZipFile(zip_file, 'r') as z:
             z.extractall(temp_folder)
             extracted_file = find_file(temp_folder, internal_filename)
             if extracted_file:
                 shutil.move(extracted_file, os.path.join(DATA_DIR, final_filename))
                 print(f"Extracted and renamed to {final_filename}.")
             else:
                 print(f"⚠️ {zip_file} did not contain expected file {internal_filename}.")
             shutil.rmtree(temp_folder, ignore_errors=True)
 
 # Helper: Find file in a directory tree
 def find_file(directory, target):
     for root, _, files in os.walk(directory):
         if target in files:
             return os.path.join(root, target)
     return None
 
 # 4. Create database tables (each function is independent)
 def create_table_country():
     conn, cur = connect_db()
     cur.execute("""
     CREATE TABLE country (
         iso_code VARCHAR(2) PRIMARY KEY, 
         name TEXT, 
         iso3 VARCHAR(3),
         population BIGINT, 
         capital TEXT, 
         continent VARCHAR(2), 
         geoname_id INTEGER
     );
     """)
     conn.commit()
     print("Created table: country.")
     cur.close()
     conn.close()
 
 def create_table_region():
     conn, cur = connect_db()
     cur.execute("""
     CREATE TABLE region (
         country_code VARCHAR(2), 
         admin1_code VARCHAR(20), 
         name TEXT, 
         geoname_id INTEGER,
         PRIMARY KEY(country_code, admin1_code)
     );
     """)
     conn.commit()
     print("Created table: region.")
     cur.close()
     conn.close()
 
 def create_table_subregion():
     conn, cur = connect_db()
     cur.execute("""
      CREATE TABLE subregion (
         country_code VARCHAR(2), 
         admin1_code VARCHAR(20), 
         admin2_code VARCHAR(80),
         name TEXT, 
         geoname_id INTEGER,
         PRIMARY KEY(country_code, admin1_code, admin2_code)
     );
     """)
     conn.commit()
     print("Created table: subregion.")
     cur.close()
     conn.close()
 
 def create_table_geoname():
     conn, cur = connect_db()
     cur.execute("""
      CREATE TABLE geoname (
         geoname_id INTEGER PRIMARY KEY, 
         name TEXT, 
         asciiname TEXT, 
         alternatenames TEXT,
         latitude DOUBLE PRECISION, 
         longitude DOUBLE PRECISION, 
         feature_class CHAR(1),
         feature_code VARCHAR(10), 
         country_code VARCHAR(2), 
         cc2 TEXT,
         admin1_code VARCHAR(20), 
         admin2_code VARCHAR(80), 
         admin3_code VARCHAR(20), 
         admin4_code VARCHAR(20),
         population BIGINT, 
         elevation INTEGER, 
         dem INTEGER, 
         timezone VARCHAR(40), 
         modification_date DATE
     );
     """)
     conn.commit()
     print("Created table: geoname.")
     cur.close()
     conn.close()
 
 def create_table_postalcode():
     conn, cur = connect_db()
     cur.execute("""
         CREATE TABLE postalcode (
             country_code VARCHAR(2),
             postal_code VARCHAR(20),
             place_name TEXT,
             admin1_name TEXT,
             admin1_code VARCHAR(20),
             admin2_name TEXT,
             admin2_code VARCHAR(80),
             admin3_name TEXT,
             admin3_code VARCHAR(20),
             latitude DOUBLE PRECISION,
             longitude DOUBLE PRECISION,
             accuracy INTEGER
         );
     """)
     cur.execute("""
         CREATE INDEX IF NOT EXISTS idx_postal_code
         ON postalcode(country_code, postal_code);
     """)
     conn.commit()
     print("Created table: postalcode and index idx_postal_code.")
     cur.close()
     conn.close()
 
 def create_table_postcodesectorcentroid():
     conn, cur = connect_db()
     cur.execute("""
      CREATE TABLE postcodesectorcentroid (
         country_code VARCHAR(2), 
         postcode_prefix VARCHAR(20),
         latitude DOUBLE PRECISION, 
         longitude DOUBLE PRECISION,
         PRIMARY KEY(country_code, postcode_prefix)
     );
     """)
     conn.commit()
     print("Created table: postcodesectorcentroid.")
     cur.close()
     conn.close()
 
 def create_table_hierarchy_raw():
     """
     Creates a raw table for the hierarchy data.
     Expected columns: parent_id, child_id, type
     """
     conn, cur = connect_db()
     cur.execute("DROP TABLE IF EXISTS hierarchy_raw;")
     cur.execute("""
         CREATE TABLE hierarchy_raw (
             parent_id INTEGER,
             child_id INTEGER,
             type TEXT
         );
     """)
     conn.commit()
     print("Created table: hierarchy_raw.")
     cur.close()
     conn.close()
 
 def create_table_alternate_names_v2_raw():
     """
     Creates a raw table for the alternateNamesV2 data.
     Expected columns:
       alternateNameId INTEGER,
       geonameId INTEGER,
       isoLanguage VARCHAR(10),
       alternateName TEXT,
       isPreferredName SMALLINT,
       isShortName SMALLINT,
       isColloquial SMALLINT,
       isHistoric SMALLINT,
       "from" TEXT,
       "to" TEXT
     """
     conn, cur = connect_db()
     cur.execute("DROP TABLE IF EXISTS alternate_names_v2_raw;")
     cur.execute("""
         CREATE TABLE alternate_names_v2_raw (
             alternateNameId INTEGER,
             geonameId INTEGER,
             isoLanguage VARCHAR(10),
             alternateName TEXT,
             isPreferredName SMALLINT,
             isShortName SMALLINT,
             isColloquial SMALLINT,
             isHistoric SMALLINT,
             "from" TEXT,
             "to" TEXT
         );
     """)
     conn.commit()
     print("Created table: alternate_names_v2_raw.")
     cur.close()
     conn.close()
 
 # --- Function: Create table for adminCode5 ---
 def create_table_admincode5():
     """
     Creates a table 'admincode5' to store the data from adminCode5.zip.
     Expected columns: geonameId (INTEGER) and adm5code (TEXT).
     """
     conn, cur = connect_db()
     try:
         cur.execute("DROP TABLE IF EXISTS admincode5;")
         cur.execute("""
             CREATE TABLE admincode5 (
                 geonameid INTEGER,
                 adm5code TEXT
             );
         """)
         conn.commit()
         print("Created table: admincode5.")
     except Exception as e:
         conn.rollback()
         print("Error creating table admincode5:", e)
     finally:
         cur.close()
         conn.close()
 
 # 5. Import data into tables
 def import_countries():
     conn, cur = connect_db()
     with open("countryInfo.txt", encoding="utf-8") as f:
         for line in f:
             if line.startswith("#"):
                 continue
             parts = line.strip().split("\t")
             if len(parts) >= 17:
                 # Use conditional expression to avoid ValueError
                 pop = int(parts[7]) if parts[7] else None
                 geoname_id = int(parts[16]) if parts[16] else None
                 cur.execute("INSERT INTO country VALUES (%s,%s,%s,%s,%s,%s,%s) ON CONFLICT DO NOTHING;",
                             (parts[0], parts[4], parts[1], pop, parts[5], parts[8], geoname_id))
     conn.commit()
     print("Imported countries.")
     cur.close()
     conn.close()
 
 def import_regions():
     conn, cur = connect_db()
     with open("admin1CodesASCII.txt", encoding="utf-8") as f:
         for line in f:
             parts = line.strip().split("\t")
             if len(parts) < 4:
                 continue
             code, name, _, geoid = parts
             country, admin1 = code.split('.')
             cur.execute("INSERT INTO region VALUES (%s,%s,%s,%s) ON CONFLICT DO NOTHING;",
                         (country, admin1, name, int(geoid)))
     conn.commit()
     print("Imported regions.")
     cur.close()
     conn.close()
 
 def import_subregions():
     conn, cur = connect_db()
     with open("admin2Codes.txt", encoding="utf-8") as f:
         for line in f:
             parts = line.strip().split("\t")
             if len(parts) == 4 and len(parts[0].split('.')) == 3:
                 country, admin1, admin2 = parts[0].split('.')
                 cur.execute("INSERT INTO subregion VALUES (%s,%s,%s,%s,%s) ON CONFLICT DO NOTHING;",
                             (country, admin1, admin2, parts[1], int(parts[3])))
     conn.commit()
     print("Imported subregions.")
     cur.close()
     conn.close()
 
 def import_geonames():
     conn, cur = connect_db()
     place_file = os.path.join(DATA_DIR, "allCountries.txt")
     print("Copying geoname data started.")
     with open(place_file, 'r', encoding='utf-8') as f:
         cur.copy_expert("COPY geoname FROM STDIN (FORMAT csv, DELIMITER E'\t', QUOTE E'\b', NULL '', ESCAPE '\\');", f)
     conn.commit()
     print("Imported geoname data.")
     cur.close()
     conn.close()
     
 
 def import_postalcode():
     conn, cur = connect_db()
     postal_file = os.path.join(DATA_DIR, "postalCodes_allCountries.txt")
     print("Copying postalcode data started.")
     with open(postal_file, 'r', encoding='utf-8') as f:
         cur.copy_expert(
             "COPY postalcode FROM STDIN WITH (FORMAT csv, DELIMITER E'\t', NULL '');", f)
     conn.commit()
     print("Imported postal codes.")
     cur.close()
     conn.close()
 
 def import_hierarchy_raw():
     """
     Imports the raw hierarchy data from the extracted hierarchy.txt file.
     """
     conn, cur = connect_db()
     hierarchy_file = os.path.join(DATA_DIR, "hierarchy.txt")
     print("Importing hierarchy data from", hierarchy_file)
     with open(hierarchy_file, 'r', encoding='utf-8') as f:
         cur.copy_expert(
             "COPY hierarchy_raw FROM STDIN (FORMAT csv, DELIMITER E'\t', NULL '');", f)
     conn.commit()
     print("Imported hierarchy data.")
     cur.close()
     conn.close()
 
 def import_alternate_names_v2():
     """
     Imports the raw alternateNamesV2 data from the extracted alternateNamesV2.txt file.
     """
     conn, cur = connect_db()
     alt_names_file = os.path.join(DATA_DIR, "alternateNamesV2.txt")
     print("Importing alternateNamesV2 data from", alt_names_file)
     with open(alt_names_file, 'r', encoding='utf-8') as f:
         cur.copy_expert(
             "COPY alternate_names_v2_raw FROM STDIN (FORMAT csv, DELIMITER E'\t', NULL '');", f)
     conn.commit()
     print("Imported alternateNamesV2 data.")
     cur.close()
     conn.close()
 
 # --- Function: Import adminCode5 data ---
 def import_admincode5():
     """
     Imports data from the extracted adminCode5.txt file into the 'admincode5' table.
     Assumes the file is located in DATA_DIR.
     """
     conn, cur = connect_db()
     admincode5_file = os.path.join(DATA_DIR, "adminCode5.txt")
     print("Importing adminCode5 data from", admincode5_file)
     try:
         with open(admincode5_file, 'r', encoding='utf-8') as f:
             cur.copy_expert(
                 "COPY admincode5 FROM STDIN WITH (FORMAT csv, DELIMITER E'\t', NULL '');", f)
         conn.commit()
         print("Imported adminCode5 data successfully.")
     except Exception as e:
         conn.rollback()
         print("Error importing adminCode5 data:", e)
     finally:
         cur.close()
         conn.close()
 
 
 # 6. Update UK postcodes with a full dataset (for GB)
 def update_with_full_UK_postcodes():
     conn, cur = connect_db()
     cur.execute("DELETE FROM postalcode WHERE country_code = 'GB';")
     conn.commit()
     print("Deleted GB post codes.")
     cur.close()
     conn.close()
 
     full_UK_postal_file = os.path.join(DATA_DIR, "GB_full.txt")
     conn, cur = connect_db()
     print("Copying full UK postalcode data started.")
     with open(full_UK_postal_file, 'r', encoding='utf-8') as f:
         cur.copy_expert(
             "COPY postalcode FROM STDIN WITH (FORMAT csv, DELIMITER E'\t', NULL '');", f)
     conn.commit()
     print("Imported full UK postal codes.")
     cur.close()
     conn.close()
 
 # 7. import uk_towns.sql
 def import_uk_towns_sql():
     """
     Imports the UK towns data from the SQL file.
     File for the SQL is purchased from www.townslist.co.uk
     This data is useful as it contains some UK data on area names that is missing from geoNames.
     It can also be useful for future hierarchical structure
     The file is expected at:
     E:/work - website douments/dev/townsList/uk-towns/sql/uk-towns.sql
     """
 
     # Use a raw string for the directory path to avoid escape sequence issues.
     sql_file_path = os.path.join(r"E:\work - website douments\dev\townsList\uk-towns\sql", "uk-towns.sql")
     print(f"Importing UK towns from {sql_file_path}...")
     
     # Read the entire SQL file
     with open(sql_file_path, 'r', encoding='utf-8') as f:
         sql_content = f.read()
     
     # Connect to the database using your helper function
     conn, cur = connect_db()
     
     try:
         # Split the content by semicolons to execute multiple statements
         statements = sql_content.split(';')
         for statement in statements:
             stmt = statement.strip()
             if stmt:
                 cur.execute(stmt)
         conn.commit()
         print("UK towns SQL import successful.")
     except Exception as e:
         conn.rollback()
         print("Error importing uk-towns.sql:", e)
     finally:
         cur.close()
         conn.close()
 
 # 8 create duplicate table of geonames for editing
 def create_and_copy_geonames_hierarchy_custom():
     """
     Creates a new table called 'geonames_hierarchy_custom' by copying the structure
     and all data from the 'geoname' table, then adds extra columns:
       - toDisplay (SMALLINT)
       - geoname_geonameid_parentID (INTEGER)
       - parentName (TEXT)
     The primary key is preserved, giving you manual control.
     """
     conn, cur = connect_db()
     try:
         # Drop the table if it already exists.
         cur.execute("DROP TABLE IF EXISTS geonames_hierarchy_custom;")
         conn.commit()
         
         # Create the table with the same structure as geoname.
         cur.execute("CREATE TABLE geonames_hierarchy_custom (LIKE geoname INCLUDING ALL);")
         conn.commit()
         
         # Copy all data from geoname into the new table.
         cur.execute("INSERT INTO geonames_hierarchy_custom SELECT * FROM geoname;")
         conn.commit()
         
         # Alter the table to add extra columns.
         cur.execute("ALTER TABLE geonames_hierarchy_custom ADD COLUMN toDisplay SMALLINT;")
         cur.execute("ALTER TABLE geonames_hierarchy_custom ADD COLUMN geoname_geonameid_parentID INTEGER;")
         cur.execute("ALTER TABLE geonames_hierarchy_custom ADD COLUMN parentName TEXT;")
         conn.commit()
         
         print("Created and populated table: geonames_hierarchy_custom.")
     except Exception as e:
         conn.rollback()
         print("Error creating and copying geonames_hierarchy_custom:", e)
     finally:
         cur.close()
         conn.close()
 
 # 9: start add parent areas to custom areas
 
 # helper funmction for admin4
 def get_parent_by_admin4(admin1, admin2, admin3, admin4, country):
     """Return the parent row from geoname using ADM4 criteria."""
     conn, cur = connect_db()
     try:
         cur.execute("""
             SELECT geoname_id, name 
             FROM geoname
             WHERE country_code = %s
               AND feature_class = 'A'
               AND feature_code = 'ADM4'
               AND admin1_code = %s
               AND admin2_code = %s
               AND admin3_code = %s
               AND admin4_code = %s
             LIMIT 1;
         """, (country, admin1, admin2, admin3, admin4))
         result = cur.fetchone()
     finally:
         cur.close()
         conn.close()
     return result
 
 # helper funmction for admin3
 def get_parent_by_admin3(admin1, admin2, admin3, country):
     """Return the parent row from geoname using ADM3 criteria."""
     conn, cur = connect_db()
     try:
         cur.execute("""
             SELECT geoname_id, name 
             FROM geoname
             WHERE country_code = %s
               AND feature_class = 'A'
               AND feature_code = 'ADM3'
               AND admin1_code = %s
               AND admin2_code = %s
               AND admin3_code = %s
             LIMIT 1;
         """, (country, admin1, admin2, admin3))
         result = cur.fetchone()
     finally:
         cur.close()
         conn.close()
     return result
 
 # helper funmction for admin2
 def get_parent_by_admin2(admin1, admin2, country):
     """Return the parent row from geoname using ADM2 criteria."""
     conn, cur = connect_db()
     try:
         cur.execute("""
             SELECT geoname_id, name 
             FROM geoname
             WHERE country_code = %s
               AND feature_class = 'A'
               AND feature_code = 'ADM2'
               AND admin1_code = %s
               AND admin2_code = %s
             LIMIT 1;
         """, (country, admin1, admin2))
         result = cur.fetchone()
     finally:
         cur.close()
         conn.close()
     return result
 
 # helper funmction for admin1
 def get_parent_by_admin1(admin1, country):
     """Return the parent row from geoname using ADM1 criteria."""
     conn, cur = connect_db()
     try:
         cur.execute("""
             SELECT geoname_id, name 
             FROM geoname
             WHERE country_code = %s
               AND feature_class = 'A'
               AND feature_code = 'ADM1'
               AND admin1_code = %s
             LIMIT 1;
         """, (country, admin1))
         result = cur.fetchone()
     finally:
         cur.close()
         conn.close()
     return result
 
 # helper funmction for country
 def get_parent_country(country):
     """Return the country-level row from geoname (using a PCL code)."""
     conn, cur = connect_db()
     try:
         cur.execute("""
             SELECT geoname_id, name 
             FROM geoname
             WHERE feature_code LIKE 'PCL%%'
               AND country_code = %s
             LIMIT 1;
         """, (country,))
         result = cur.fetchone()
     finally:
         cur.close()
         conn.close()
     return result
 
 # helper funmction to get area parent
 def get_parent_for_area(admin1, admin2, admin3, admin4, country, fcode):
     """
     Determines the parent for an area using its admin codes and feature code.
     If fcode contains 'ADM', it follows one logic; otherwise it uses an alternative order.
     """
     if 'ADM' in fcode:
         if admin4 and admin4.strip():
             return get_parent_by_admin3(admin1, admin2, admin3, country)
         elif admin3 and admin3.strip():
             return get_parent_by_admin2(admin1, admin2, country)
         elif admin2 and admin2.strip():
             return get_parent_by_admin1(admin1, country)
         elif admin1 and admin1.strip():
             return get_parent_country(country)
     else:
         if admin4 and admin4.strip():
             return get_parent_by_admin4(admin1, admin2, admin3, admin4, country)
         elif admin3 and admin3.strip():
             return get_parent_by_admin3(admin1, admin2, admin3, country)
         elif admin2 and admin2.strip():
             return get_parent_by_admin2(admin1, admin2, country)
         elif admin1 and admin1.strip():
             return get_parent_by_admin1(admin1, country)
         elif country and country.strip():
             return get_parent_country(country)
     return None
 
 # main function to call for adding all parents
 def add_area_parents():
     """
     Reads all rows from the 'geoname' table and, for each area, determines its parent based on admin codes and feature code.
     Inserts into 'geonames_hierarchy_custom' table the following fields:
       - geoname_id (from geoname)
       - name (from geoname)
       - geoname_geonameid_parentID (determined parent's geoname_id, or NULL)
       - parentName (determined parent's name, or NULL)
     """
     # Retrieve all areas from the geoname table
     conn, cur = connect_db()
     cur.execute("""
         SELECT geoname_id, name, admin1_code, admin2_code, admin3_code, admin4_code, country_code, feature_code
         FROM geoname;
     """)
     rows = cur.fetchall()
     cur.close()
     conn.close()
     
     for row in rows:
         geoname_id, name, admin1, admin2, admin3, admin4, country, fcode = row
         parent = get_parent_for_area(admin1, admin2, admin3, admin4, country, fcode)
         parent_id = parent[0] if parent else None
         parent_name = parent[1] if parent else None
 
         # Insert the record into geonames_hierarchy_custom
         conn, cur = connect_db()
         try:
             cur.execute("""
                 INSERT INTO geonames_hierarchy_custom (geoname_id, name, geoname_geonameid_parentID, parentName)
                 VALUES (%s, %s, %s, %s);
             """, (geoname_id, name, parent_id, parent_name))
             conn.commit()
         except Exception as e:
             conn.rollback()
             print(f"Error inserting geoname_id {geoname_id}: {e}")
         finally:
             cur.close()
             conn.close()
 
 # end of 9 add parent areas to custom areas
 def update_hierarchy_from_raw():
     """
     Updates the geonames_hierarchy_custom table using explicit parent-child relationships 
     from the hierarchy_raw table (imported from hierarchy.zip). For rows where hierarchy_raw.type = 'ADM',
     it sets:
       - geoname_geonameid_parentID to the parent_id from hierarchy_raw, and 
       - parentName to the parent's name (looked up from the geoname table).
     This function performs a bulk update using a single SQL statement.
     """
     conn, cur = connect_db()
     try:
         cur.execute("""
             UPDATE geonames_hierarchy_custom AS ghc
             SET geoname_geonameid_parentID = hr.parent_id,
                 parentName = (SELECT name FROM geoname WHERE geoname_id = hr.parent_id)
             FROM hierarchy_raw AS hr
             WHERE ghc.geoname_id = hr.child_id
               AND hr.type = 'ADM';
         """)
         conn.commit()
         print("Bulk update from hierarchy_raw completed successfully.")
     except Exception as e:
         conn.rollback()
         print("Error in bulk update from hierarchy_raw:", e)
     finally:
         cur.close()
         conn.close()
 
 
 # 10: add parents to cities of county or fall back to state
 def assign_parent_to_populated_places():
     """
     For records in geonames_hierarchy_custom with feature_class 'P'
     (populated places) that do not have a parent assigned (geoname_geonameid_parentID IS NULL),
     assign a parent based on matching administrative codes from the geoname table.
     
     1. First, attempt to assign the parent using the county level (ADM2) by matching
        country_code, admin1_code, and admin2_code.
     2. Then, for any remaining records, assign the parent using the state/region level (ADM1)
        by matching country_code and admin1_code.
     """
     conn, cur = connect_db()
     try:
         # Update using county-level data (ADM2)
         cur.execute("""
             UPDATE geonames_hierarchy_custom AS child
             SET geoname_geonameid_parentID = parent.geoname_id,
                 parentName = parent.name
             FROM geoname AS parent
             WHERE child.feature_class = 'P'
               AND child.geoname_geonameid_parentID IS NULL
               AND child.country_code = parent.country_code
               AND child.admin1_code = parent.admin1_code
               AND child.admin2_code = parent.admin2_code
               AND parent.feature_class = 'A'
               AND parent.feature_code = 'ADM2';
         """)
         conn.commit()
         print("Assigned parent to populated places using county-level (ADM2).")
 
         # For any remaining populated places, use state/region-level data (ADM1)
         cur.execute("""
             UPDATE geonames_hierarchy_custom AS child
             SET geoname_geonameid_parentID = parent.geoname_id,
                 parentName = parent.name
             FROM geoname AS parent
             WHERE child.feature_class = 'P'
               AND child.geoname_geonameid_parentID IS NULL
               AND child.country_code = parent.country_code
               AND child.admin1_code = parent.admin1_code
               AND parent.feature_class = 'A'
               AND parent.feature_code = 'ADM1';
         """)
         conn.commit()
         print("Assigned remaining populated places using state/region-level (ADM1).")
     except Exception as e:
         conn.rollback()
         print("Error assigning parent to populated places:", e)
     finally:
         cur.close()
         conn.close()
    
 # 11: add in and populate admin5 codes
 def add_and_populate_admin5_code():
     """
     Adds a new column 'admin5_code' (TEXT) to the geonames_hierarchy_custom table,
     then populates it by joining with the admincode5 table. 
     The admincode5 table has columns:
       - geonameid (INTEGER)
       - adm5code (TEXT)
     This function updates geonames_hierarchy_custom.admin5_code with admincode5.adm5code
     for rows where geoname_id = admincode5.geonameid.
     """
     conn, cur = connect_db()
     try:
         # Add the new column to the table
         cur.execute("ALTER TABLE geonames_hierarchy_custom ADD COLUMN admin5_code TEXT;")
         conn.commit()
         print("Added column 'admin5_code' to geonames_hierarchy_custom.")
 
         # Update the new column by joining with admincode5 table
         cur.execute("""
             UPDATE geonames_hierarchy_custom ghc
             SET admin5_code = ac.adm5code
             FROM admincode5 ac
             WHERE ghc.geoname_id = ac.geonameid;
         """)
         conn.commit()
         print("Populated 'admin5_code' from admincode5 table.")
     except Exception as e:
         conn.rollback()
         print("Error in add_and_populate_admin5_code:", e)
     finally:
         cur.close()
         conn.close()
 
 # 12: start of adding relationship to postcodes from geonames
 def create_and_copy_postalcode_custom():
     """
     Creates a new table 'postalcode_custom' as an identical copy of 'postalcode'
     (including all columns and data), and then adds a new column 'geoname_id'
     to store the related administrative area from geonames_hierarchy_custom.
     """
     conn, cur = connect_db()
     try:
         # Drop the table if it exists
         cur.execute("DROP TABLE IF EXISTS postalcode_custom;")
         conn.commit()
         
         # Create postalcode_custom as a copy of postalcode
         cur.execute("CREATE TABLE postalcode_custom AS TABLE postalcode;")
         conn.commit()
         print("Created table: postalcode_custom as a copy of postalcode.")
         
         # Add the new column for geoname_id
         cur.execute("ALTER TABLE postalcode_custom ADD COLUMN geoname_id INTEGER;")
         conn.commit()
         print("Added column 'geoname_id' to postalcode_custom.")
     except Exception as e:
         conn.rollback()
         print("Error in create_and_copy_postalcode_custom:", e)
     finally:
         cur.close()
         conn.close()
 
 
 def add_geometry_to_postalcode_custom():
     """
     Adds a geometry column 'geom' to the postalcode_custom table and populates it
     using the latitude and longitude columns. The geometry is created using SRID 4326.
     """
     conn, cur = connect_db()
     try:
         # Add the geometry column if it does not exist
         cur.execute("ALTER TABLE postalcode_custom ADD COLUMN geom geometry(Point,4326);")
         conn.commit()
         print("Added geometry column 'geom' to postalcode_custom.")
         
         # Populate the geometry column from longitude and latitude
         cur.execute("""
             UPDATE postalcode_custom
             SET geom = ST_SetSRID(ST_MakePoint(longitude, latitude), 4326);
         """)
         conn.commit()
         print("Populated geometry column 'geom' from latitude and longitude.")
     except Exception as e:
         conn.rollback()
         print("Error in add_geometry_to_postalcode_custom:", e)
     finally:
         cur.close()
         conn.close()
 
 
 def create_spatial_index_postalcode_custom():
     """
     Creates a spatial index on the 'geom' column of postalcode_custom for fast spatial queries.
     """
     conn, cur = connect_db()
     try:
         cur.execute("CREATE INDEX IF NOT EXISTS idx_postalcode_custom_geom ON postalcode_custom USING GIST(geom);")
         conn.commit()
         print("Created spatial index on postalcode_custom.geom.")
     except Exception as e:
         conn.rollback()
         print("Error in create_spatial_index_postalcode_custom:", e)
     finally:
         cur.close()
         conn.close()
 
 
 def create_spatial_index_hierarchy_custom():
     """
     Creates a spatial index on the 'geom' column of geonames_hierarchy_custom.
     (Assumes that you have already added and populated a geometry column in that table.)
     """
     conn, cur = connect_db()
     try:
         cur.execute("CREATE INDEX IF NOT EXISTS idx_ghc_geom ON geonames_hierarchy_custom USING GIST(geom);")
         conn.commit()
         print("Created spatial index on geonames_hierarchy_custom.geom.")
     except Exception as e:
         conn.rollback()
         print("Error in create_spatial_index_hierarchy_custom:", e)
     finally:
         cur.close()
         conn.close()
 
 def create_indexes_postalcode_custom():
     """
     Creates a indexes on the 'country_code, postal_code' columns of country_code.
     (Assumes that you have already added and populated a geometry column in that table.)
     """
     conn, cur = connect_db()
     try:
         cur.execute("CREATE INDEX IF NOT EXISTS idx_postalcode_custom_country_postal ON postalcode_custom(country_code, postal_code);")
         conn.commit()
         print("Created indexes on index on country_code and postal_code for postalcode_custom.")
     except Exception as e:
         conn.rollback()
         print("Error in create_indexes_postalcode_custom:", e)
     finally:
         cur.close()
         conn.close()
 
 def create_index_on_country_code(table_name, index_name="idx_country_code"):
     """
     Creates an index on the country_code column for the specified table,
     if it does not already exist.
 
     Parameters:
       table_name (str): Name of the table on which to create the index.
       index_name (str): Desired name for the index (default: 'idx_country_code').
     """
     conn, cur = connect_db()
     try:
         query = f"CREATE INDEX IF NOT EXISTS {index_name} ON {table_name}(country_code);"
         cur.execute(query)
         conn.commit()
         print(f"Created index {index_name} on {table_name}(country_code).")
     except Exception as e:
         conn.rollback()
         print(f"Error creating index on {table_name}(country_code):", e)
     finally:
         cur.close()
         conn.close()
 
 
 
 def add_geometry_to_hierarchy_custom():
     """
     Adds a geometry column 'geom' to the geonames_hierarchy_custom table and populates it
     using the latitude and longitude columns from that table.
     """
     conn, cur = connect_db()
     try:
         cur.execute("ALTER TABLE geonames_hierarchy_custom ADD COLUMN geom geometry(Point,4326);")
         conn.commit()
         print("Added geometry column 'geom' to geonames_hierarchy_custom.")
         
         cur.execute("""
             UPDATE geonames_hierarchy_custom
             SET geom = ST_SetSRID(ST_MakePoint(longitude, latitude), 4326);
         """)
         conn.commit()
         print("Populated geometry column 'geom' in geonames_hierarchy_custom.")
     except Exception as e:
         conn.rollback()
         print("Error in add_geometry_to_hierarchy_custom:", e)
     finally:
         cur.close()
         conn.close()
 
 import multiprocessing
 
 def get_gb_postcode_prefixes():
     """
     Queries the postalcode_custom table for country 'GB' and returns a sorted list
     of distinct two-character prefixes from the postal_code column.
     
     Returns:
         List[str]: A sorted list of two-character postal code prefixes.
     """
     conn, cur = connect_db()
     try:
         cur.execute("""
             SELECT DISTINCT LEFT(postal_code, 2) AS prefix
             FROM postalcode_custom
             WHERE country_code = 'GB'
               AND LENGTH(postal_code) >= 2;
         """)
         prefixes = [row[0] for row in cur.fetchall() if row[0]]
         return sorted(prefixes)
     except Exception as e:
         print("Error retrieving prefixes:", e)
         return []
     finally:
         cur.close()
         conn.close()
 
 def update_postalcode_custom_geoname_id_chunked_by_prefix(prefix, batch_size=1000, search_radius=100000):
     """
     Updates postalcode_custom.geoname_id for records in GB that match a given postal code prefix.
     Processes data in batches. Uses a spatial join with a bounding-box filter and the kNN operator.
     
     Parameters:
       prefix (str): Two-character prefix to filter postal_code values (e.g., 'ST').
       batch_size (int): Number of distinct postal codes to process per batch.
       search_radius (int): Maximum distance (in meters) to consider for candidates.
     """
     while True:
         conn, cur = connect_db()
         try:
             cur.execute("""
                 SELECT DISTINCT postal_code
                 FROM postalcode_custom
                 WHERE country_code = 'GB'
                   AND geoname_id IS NULL
                   AND postal_code LIKE %s
                 LIMIT %s;
             """, (f"{prefix}%", batch_size))
             batch = cur.fetchall()
         except Exception as e:
             conn.rollback()
             print(f"Error fetching batch for prefix {prefix}: {e}")
             cur.close()
             conn.close()
             break
         finally:
             cur.close()
             conn.close()
         
         if not batch:
             print(f"No more records to update for prefix {prefix}.")
             break
 
         postal_codes = [row[0] for row in batch]
         print(f"[{prefix}] Processing batch of {len(postal_codes)} postal codes...")
 
         conn, cur = connect_db()
         try:
             query = """
                 WITH nearest AS (
                   SELECT 
                     pc.postal_code,
                     pc.country_code,
                     ghc.geoname_id,
                     ROW_NUMBER() OVER (
                       PARTITION BY pc.postal_code
                       ORDER BY pc.geom <-> ghc.geom
                     ) AS rn
                   FROM postalcode_custom pc
                   JOIN geonames_hierarchy_custom ghc
                     ON pc.country_code = ghc.country_code
                   WHERE pc.postal_code = ANY(%s)
                     AND ST_DWithin(pc.geom::geography, ghc.geom::geography, %s)
                 )
                 UPDATE postalcode_custom pc
                 SET geoname_id = n.geoname_id
                 FROM nearest n
                 WHERE pc.postal_code = n.postal_code
                   AND n.rn = 1;
             """
             cur.execute(query, (postal_codes, search_radius))
             conn.commit()
             print(f"[{prefix}] Updated {len(postal_codes)} postal codes.")
         except Exception as e:
             conn.rollback()
             print(f"Error updating batch for prefix {prefix}: {e}")
         finally:
             cur.close()
             conn.close()
 
 def update_for_prefix(prefix):
     """
     Wrapper function to update postal codes for a given prefix.
     """
     update_postalcode_custom_geoname_id_chunked_by_prefix(prefix)
 
 
 # helper function, if a mistake was made to empty postalcode_custom for a country_code
 def reset_geoname_ids_for_country(country_code='GB'):
     """
     Resets the geoname_id column to NULL for all rows in postalcode_custom
     for the given country_code.
     """
     conn, cur = connect_db()
     try:
         cur.execute("UPDATE postalcode_custom SET geoname_id = NULL WHERE country_code = %s;", (country_code,))
         conn.commit()
         print(f"Reset geoname_id to NULL for country {country_code}.")
     except Exception as e:
         conn.rollback()
         print("Error resetting geoname_id:", e)
     finally:
         cur.close()
         conn.close()
 
 
 
 
 # end of step 12: start of adding relationship to postcodes from geonames
 
 
 
 # x. Compute medians for postcode sectors (for GB)
 def compute_medians():
     conn, cur = connect_db()
     cur.execute("""
         INSERT INTO postcodesectorcentroid(country_code, postcode_prefix, latitude, longitude)
         SELECT country_code, LEFT(postal_code, LENGTH(postal_code)-2),
                percentile_cont(0.5) WITHIN GROUP (ORDER BY latitude),
                percentile_cont(0.5) WITHIN GROUP (ORDER BY longitude)
         FROM postalcode
         WHERE country_code='GB'
         GROUP BY country_code, LEFT(postal_code, LENGTH(postal_code)-2);
     """)
     conn.commit()
     print("Computed centroids for postcode sectors.")
     cur.close()
     conn.close()
 
 # ----------------------------------
 # Main block: call each step independently as needed.
 if __name__ == "__main__":
     # Step 1: Ensure data directory
     #ensure_data_directory()
     # ensure PostGIS is enabled
     #enable_postgis()
 
     # Step 2: Download files (uncomment to run)
     #download_file({"allCountries.zip": FILES_TO_DOWNLOAD["allCountries.zip"]}, overwrite=False)
     #download_file({"countryInfo.txt": FILES_TO_DOWNLOAD["countryInfo.txt"]}, overwrite=False)
     #download_file({"admin1CodesASCII.txt": FILES_TO_DOWNLOAD["admin1CodesASCII.txt"]}, overwrite=False)
     #download_file({"admin2Codes.txt": FILES_TO_DOWNLOAD["admin2Codes.txt"]}, overwrite=False)
     #download_file({"postalCodes_allCountries.zip": FILES_TO_DOWNLOAD["postalCodes_allCountries.zip"]}, overwrite=False)
     #download_file({"GB_full.csv.zip": FILES_TO_DOWNLOAD["GB_full.csv.zip"]}, overwrite=False)
     #download_file({"hierarchy.zip": FILES_TO_DOWNLOAD["hierarchy.zip"]}, overwrite=False)
     #download_file({"alternateNamesV2.zip": FILES_TO_DOWNLOAD["alternateNamesV2.zip"]}, overwrite=False)
     #download_file({"adminCode5.zip": FILES_TO_DOWNLOAD["adminCode5.zip"]}, overwrite=False)
 
 
 
     # Step 3: Extract files (uncomment to run)
     #extract_file({"allCountries.zip": "allCountries.txt"}, "allCountries.txt")
     #extract_file({"postalCodes_allCountries.zip": "allCountries.txt"}, "postalCodes_allCountries.txt")
     #extract_file({"GB_full.csv.zip": "GB_full.txt"}, "GB_full.txt")
     #extract_file({"hierarchy.zip": "hierarchy.txt"}, "hierarchy.txt")
     #extract_file({"alternateNamesV2.zip": "alternateNamesV2.txt"}, "alternateNamesV2.txt")
     #extract_file({"adminCode5.zip": "adminCode5.txt"}, "adminCode5.txt")
 
     # Step 4: Create tables (uncomment to run)
     #create_table_country()
     #create_table_region()
     #create_table_subregion()
     #create_table_geoname()
     #create_table_postalcode()
     #create_table_postcodesectorcentroid()
     #create_table_hierarchy_raw()
     #create_table_alternate_names_v2_raw()
     #create_table_admincode5()
 
     # Step 5: Import data into tables (uncomment to run)
     #import_countries()
     #import_regions()
     #import_subregions()
     #import_geonames()
     #import_postalcode()
     #import_hierarchy_raw()
     #import_alternate_names_v2()
     #import_admincode5()
 
     # Step 6: Update postal codes for GB with full data (uncomment to run)
     #update_with_full_UK_postcodes()
 
     # Step 7: import uk_towns from <a href="https://www.townslist.co.uk/" target="_blank" rel="nofollow">https://www.townslist.co.uk/</a>
     #import_uk_towns_sql()
 
     # Step 8: create a custom table based on geoNames for modification
     #create_and_copy_geonames_hierarchy_custom()
 
     # Step 9: add parents to each area of the new custom table
     #update_hierarchy_from_raw()
     # add_area_parents() not needed, does the same as above
 
     # 10: add parents to cities of county or fall back to state
     #assign_parent_to_populated_places()
 
     # 11: add in and populate admin5 codes
     # add_and_populate_admin5_code()
 
     # 12: start of adding relationship to postcodes from geonames
     # a. Create postalcode_custom as a copy of postalcode and add new column.
     #create_and_copy_postalcode_custom()
     
     # b. Add a geometry column and populate it.
     # add_geometry_to_postalcode_custom()
     
     # c. Add a geometry column to geonames_hierarchy_custom (if not already done).
     # add_geometry_to_hierarchy_custom()
     
     # d. Create spatial indexes on both tables.
     #create_spatial_index_postalcode_custom()
     #create_spatial_index_hierarchy_custom()
     # other useful indexes
     #create_indexes_postalcode_custom()
     #create_index_on_country_code("postalcode_custom", "idx_postalcode_custom_country")
     #create_index_on_country_code("geonames_hierarchy_custom", "idx_ghc_country")
     
     # e. Perform the spatial join to update postalcode_custom.geoname_id.
  
 
     # e1. helper function, if you need to reset a countries postcodes to NULL
     # USE WITH GREAT CAUTION, THIS WILL REMOVE HOURS/DAYS/WEEKS OF WORK!!
     #reset_geoname_ids_for_country(country_code='GB'): 
 
     # Get the list of distinct two-character prefixes from postalcode_custom for GB.
     prefixes = get_gb_postcode_prefixes()
     print("Distinct GB postal code prefixes:", prefixes)
     
     # Create a multiprocessing pool (adjust number of processes as appropriate for your machine)
     pool = multiprocessing.Pool(processes=8)
     pool.map(update_for_prefix, prefixes)
     pool.close()
     pool.join()
 
     print("Parallel update for GB postal codes complete.")
 
     # Step x: Compute medians for postcode sectors
     # compute_medians()
 
     print("ߎɠGeoNames import complete!")
 


EDIT: running steps 1 to 11 will take at most up to an hour or so, depending on PC/Server spec.
UPDATE: step 12 has been updated/changed to use multiprocessing, as this step is computationally heavy! it has also been split into chinking in case of issues, so it commits every 1000 rows (change this as needed)
FYI: using the settings above, i have 24 Xeon cores running at around ~40% for this task, sometimes (when disk read/writing) peaking at ~75%. it does not seem to use much RAM, maybe 1GB....
It is still running (over 24 hours!), but early indications are it is linking the geoname_id to the GB postcodes correct

If geoNames want to add any input into this, it will be very welcomed
[Email]
jonny512379



Joined: 11/02/2017 20:41:32
Messages: 14
Offline

I would love to hear admin/techs opinions on step 12
Do you foresee any issues with the calculations?

It is to early for me to say 100%, but i have ran some inner join statements, and from the small amount i can manually check, it seems to be getting them correct, but.... in all fairness this is the 6th attempt at trying to build the data for my needs. it is the first time i am trying with geoNames data though.

I do have other data sets i have used before for the UK, that have quiet a few towns/villages/etc missing from geoNames, but if you need these, i would prefer to email them, as you will need to look into the licencing for the "other datasets"
[Email]
 
Forum Index -> General
Go to:   
Powered by JForum 2.1.5 © JForum Team