Data Science with OpenStreetMap and Python

Nikolai Janakiev @njanakiev

Working With OSM Data

OSM Key Amenity

taginfo

Loading Data from Taginfo

In [3]:
import requests
import pandas as pd

key = 'amenity'
url = "https://taginfo.openstreetmap.org/api/4/key/values"
response = requests.get(url, params={
                'key':key, 
                'page':0, 'rp':100,
                'sortname':'count', 'sortorder':'desc'
            })
data = response.json()['data']

df = pd.DataFrame(data).set_index('value')
df[['count', 'description']].head()
Out[3]:
count description
value
parking 2891252 A place for parking cars
place_of_worship 1023016 A place where religious services are conducted
school 969974 A primary or secondary school (pupils typicall...
bench 964951 A place for people to sit; allows room for sev...
restaurant 909128 A restaurant sells full sit-down meals with se...

Distribution of Amenity Tags on Taginfo

In [4]:
df.iloc[:20][::-1]['count'].plot(kind='barh', color='C0');

Loading Data from OpenStreetMap

In [5]:
import requests

overpass_url = "http://overpass-api.de/api/interpreter"
overpass_query = """
    [out:json];
    area["ISO3166-1"="AT"][admin_level=2]->.search;
    node[amenity="restaurant"](area.search);
    out count;"""
response = requests.get(overpass_url, params={'data': overpass_query})
response.json()
Out[5]:
{'elements': [{'id': 0,
   'tags': {'areas': '0',
    'nodes': '11910',
    'relations': '0',
    'total': '11910',
    'ways': '0'},
   'type': 'count'}],
 'generator': 'Overpass API 0.7.55.4 3079d8ea',
 'osm3s': {'copyright': 'The data included in this document is from www.openstreetmap.org. The data is made available under ODbL.',
  'timestamp_areas_base': '2018-10-09T15:50:02Z',
  'timestamp_osm_base': '2018-10-09T16:31:02Z'},
 'version': 0.6}

PostGIS

Spatial Types

  • geometry, geography

Spatial Indexes

  • r-tree, quad-tree, kd-tree

Spatial Functions

  • ST_Length(geometry), ST_X(geometry)

Create a Table

CREATE TABLE osm_amenities (
    osm_id BIGINT PRIMARY KEY,
    country_code VARCHAR(2), 
    amenity INTEGER,
    name VARCHAR,
    geom GEOGRAPHY);

Insert an Entry

INSERT INTO osm_amenities 
    VALUES (
        2415889355, 
        'AT', 
        'university', 
        'Department of Geoinformatics - Z_GIS', 
        ST_GeogFromText('Point (13.0610593 47.7885778)')
    )

Point (13.0610593 47.7885778) is in Well-known text (WKT) format

GeoPandas

Loading PostGIS Table as GeoPandas Dataframe

In [6]:
import psycopg2
import geopandas as gpd

with psycopg2.connect(database="osm_data_science_db", user="postgres", 
                      password='password', host='localhost') as connection:
    gdf = gpd.GeoDataFrame.from_postgis("""SELECT * FROM osm_amenities_areas""", connection, geom_col='geom')
gdf[['osm_id', 'amenity', 'state', 'geom']].head()
Out[6]:
osm_id amenity state geom
0 476 university Steiermark POINT (15.4498326391065 47.0777236761512)
1 11101 townhall Wien POINT (16.3573412415296 48.2108412144281)
2 11936 school Wien POINT (16.4103468335947 48.2842460993475)
3 14604 school Niederösterreich POINT (16.778381402358 48.0259891360606)
4 27945 school Wien POINT (16.2909158154218 48.1690868739775)

Visualize all Amenities in Austria

In [7]:
gdf.to_crs({'init': 'epsg:31287'}).plot(
    figsize=(20, 10), alpha=0.4, markersize=0.5, color='k');

What are the most common Amenities in Salzburg?

In [8]:
# Transform our data
df_counts = gdf.groupby(['state', 'amenity']).count().reset_index()
df_counts = df_counts.pivot(index='state', columns='amenity', values='osm_id')
df_counts = df_counts.fillna(0).astype('int')
In [9]:
df_counts.loc['Salzburg'].sort_values()[-20:].plot(
    kind='barh', color='C0', title='Amenities in Salzburg Land');

What is the most 'French' City?

  • Using simple data science methods and OpenStreetMap to model culture

Load and Prepare the Data Set

In [12]:
import psycopg2
import geopandas as gpd

with psycopg2.connect(database="osm_data_science_db", user="postgres", 
                      password='password', host='localhost') as connection:
    gdf = gpd.GeoDataFrame.from_postgis("""SELECT * FROM city_amenity_counts""", 
            connection, geom_col='geom')

# Unique country codes
gdf['country_code'].unique()
Out[12]:
array(['AT', 'CH', 'DE', 'FR'], dtype=object)
In [13]:
# Get only Germany and France
gdf_DE_FR = gdf[(gdf['country_code'] == 'DE') | \
                (gdf['country_code'] == 'FR')].copy()

Select the Feature Vectors

In [14]:
X = gdf_DE_FR.loc[:, 'parking':'veterinary'].values
y = gdf_DE_FR['country_code'].map({'DE': 0, 'FR': 1}).values

Train our Model

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1000)

Logistic Regression

In [16]:
from sklearn.linear_model import LogisticRegressionCV

clf = LogisticRegressionCV(cv=5)
clf.fit(X_train, y_train)
Out[16]:
LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)
In [17]:
print('Training score : {:.4f}'.format(clf.score(X_train, y_train)))
print('Testing score  : {:.4f}'.format(clf.score(X_test, y_test)))
Training score : 0.8936
Testing score  : 0.8293

Most 'French' Austrian City?

In [18]:
# Get feature vectors
gdf_AT = gdf[gdf['country_code'] == 'AT']

X_AT = gdf_AT.loc[:, 'parking':'veterinary'].values

Calculate Probabilites of being classified as French

In [19]:
for city, prediction, probability in zip(gdf_AT['city'], clf.predict(X_AT), clf.predict_proba(X_AT)[:, 1]):
    prediction_label = 'German' if prediction == 0 else 'French'
    print('{:12s} {},  {:.3f}'.format(city, prediction_label, probability))
Wien         French,  0.561
Graz         German,  0.313
Linz         French,  0.970
Salzburg     German,  0.281
Innsbruck    German,  0.434
Klagenfurt   German,  0.464

Visualize a Map of most 'French' Cities

Calculate Probability of being classified as French

In [20]:
gdf['frenchness'] = gdf.loc[:, 'parking':'veterinary'].apply(
    lambda row: clf.predict_proba([row.values])[:, 1][0], axis=1)

Normalize Probabilities between 0 and 1

In [21]:
f = gdf['frenchness']
gdf['frenchness'] = (f - f.min()) / (f.max() - f.min())

Visualize with Folium

In [22]:
import folium
import matplotlib

cmap = matplotlib.cm.get_cmap('cool', 5)

m = folium.Map(location=[49.5734053, 7.588576], zoom_start=5)
for i, row in gdf.iterrows():

    rgb = cmap(row['frenchness'])[:3]
    hex_color = matplotlib.colors.rgb2hex(rgb)
    
    folium.CircleMarker([row['geom'].y, row['geom'].x], 
                        radius=5,
                        popup=row['city'], 
                        fill_color=hex_color,
                        fill_opacity=1,
                        opacity=0.0).add_to(m)
m.save('maps/folium_map.html')
In [23]:
from IPython.display import IFrame

IFrame('maps/folium_map.html', width=1024, height=600)
Out[23]:

folium visualization