When dealing with large datasets for machine learning, have you ever come across an address column that looks like this?
Location data can be very messy and difficult to process.
It is difficult to encode addresses, since they are of very high cardinality. If you try to encode a column like this with a technique like one-hot encoding, it will lead to high dimensionality, and your machine learning model might not perform well.
The easiest way to overcome this problem is to geocode these columns.
What is geocoding?
Geocoding is the process of converting addresses into geographical coordinates. This means that you'll be transforming raw addresses into latitude/longitude pairs.
Geocoding in Python
There are many different libraries available that can help you do this with Python. The fastest is the Google Maps API, which I recommend if you have more than 1000 addresses you need to convert in a short period of time.
However, the Google Maps API isn't free. You will need to pay around $5 per 1000 request.
A free alternative to the Google Maps API is the OpenStreetMap API. However, the OpenStreetMap API is a lot slower, and also slightly less accurate.
In this article, I will walk you through the geocoding process with these two APIs.
Method 1: Google Maps API
Lets first use the Google Maps API to convert addresses into lat/long pairs. You will first need to create a Google Cloud account to do this, and enter your credit card information.
Although this is a paid service, Google gives you $200 in free credit when you first create a Google Cloud account. This means that you can make around 40,000 calls with their geocoding API before you get charged for it. As long as you don't hit this limit, your account will not be charged.
Once you've received you API key, you can start coding!
We are going to use the Zomato Restaurants Kaggle dataset for this tutorial. Make sure to have the dataset installed in your path. Then, install the googlemaps API package with this command:
pip install -U googlemaps
Run the following lines of code to import the libraries you need to get started:
import csv import pandas as pd import googlemaps
Reading the dataset
Now, lets read the dataset and check the head of the dataframe:
data = pd.read_csv('zomato.csv',encoding="ISO-8859-1") df = data.copy() df.head()
This dataframe has 21 columns and 9551 rows.
We only need the address column for geocoding, so I'm going to drop all the other columns. Then, I am going to drop duplicates so we only get unique addresses:
df = df[['Address']] df = df.drop_duplicates()
Taking a look at the head of the dataframe again, we can see only the address column:
Great! We can start geocoding now.
First, we need to access our API key with Python. Run the following lines of code to do this:
gmaps_key = googlemaps.Client(key="your_API_key")
Now, lets try geocoding one address first, and take a look at the output.
add_1 = df['Address'] g = gmaps_key.geocode(add_1) lat = g["geometry"]["location"]["lat"] long = g["geometry"]["location"]["lng"] print('Latitude: '+str(lat)+', Longitude: '+str(long))
The output of the above code looks like this:
If you get the above output, great! Everything works.
We can now replicate this process for the entire dataframe:
# geocode the entire dataframe: def geocode(add): g = gmaps_key.geocode(add) lat = g["geometry"]["location"]["lat"] lng = g["geometry"]["location"]["lng"] return (lat, lng) df['geocoded'] = df['Address'].apply(geocode)
Lets check the head of the dataframe again to see if this worked:
If your output looks like the screenshot above, congratulations! You have successfully geocoded addresses in an entire dataframe.
Method 2: OpenStreetMap API
The OpenStreetMap API is completely free, but is slower and less accurate than the Google maps API.
This API was unable to locate many of the addresses in the dataset, so we will be using the locality column this time instead.
Before we start with the tutorial, lets look at the difference between the address and locality column. Run the following lines of code to do this:
print('Address: '+data['Address']+'\n\nLocality: '+data['Locality'])
Your output will look like this:
The address column is a lot more granular than the locality column, and it provides the exact location of the restaurant, including the floor number. This might be the reason the address isn't recognized by the OpenStreetMap API, but the locality is.
Lets geocode the first locality and take a look at the output.
Run the following lines of code:
import url import requests data = data[['Locality']] url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(df['Locality']) +'?format=json' response = requests.get(url).json() print('Latitude: '+response['lat']+', Longitude: '+response['lon'])
The output of the above codes is very similar to the result generated by the Google Maps API:
Now, lets create a function to find the coordinates of the entire dataframe:
Great! Now, lets take a look at the head of the dataframe:
Notice that this API was unable to come up with coordinates for many of the localities in the dataframe.
Although its a great free alternative to the Google Maps API, you risk losing a lot of data if you geocode with OpenStreetMap.
That's all for this tutorial! I hope you learnt something new from here, and have a better understanding on dealing with geospatial data.
Good luck with your data science journey, and thanks for reading!