I got my hands on a dataset from Crunchbase which contains a rich set of data about startups. Initially I looked at visualizing the time related trends of total funding and funding per industry. Those ended up not being very engaging (everything is trending up). That's why I moved my focus to the geographical distribution and how to best represent the growth in funding in different regions.
After some experimentation I arrived at the animation above. Each frame shows the funding activity over a 2-year window which ends at the month in the title. It uses a log scale with red showing the highest values, as you can see in the legend on the right. The scale is fixed and was calculated based on the largest 2-year window. The values on the map are generated by aggregating the funding per city and passing it through a gaussian filter in order to get smoother contours for the heatmap. (You can find a map, based on the raw data without smoothing, near the end of this post).
As expected we see that right now (2014) Sillicon Valley is the hottest region. Some other hotspots are New York, London, Israel, Moscow, Beijing. Looking at the bigger picture, USA is clearly leading, followed by West Europe and then East Asia. There are also a lot of small spots on the map but remember that the scale is logarithmic. That means that the small spots have 1/1000 (or even smaller fraction) of the funding that the red areas are getting.
In the process I also found out that Crunchbase publishes a somewhat similar graph showing monthly activity. The difference is that they give a static view of only the last month but you can actually hover over the circles to see which companies correspond to them.
The rest of this post explains how the map was generated and includes all the Python code that was used. Feel free to experiment with it.
The visualization was generated using Crunchbase's database of startup activity including funding rounds, acquisitions, investors and so on. It is updated monthly and is available at http://info.crunchbase.com/about/crunchbase-data-exports/. The dataset comes as an Excel file which contains several sheets:
import pandas as pd
import numpy as np
import scipy.ndimage
import datetime
crunchbase = pd.ExcelFile('crunchbase_monthly_export.xlsx')
crunchbase.sheet_names
For this analysis we are going to focus on the Rounds sheet which contains data about companies and the rounds of funding they have taken:
rounds = crunchbase.parse('Rounds')
rounds.head()
Let's take a look at the summary of the funding rounds sizes:
rounds.raised_amount_usd.describe(percentiles=[0.01,0.5,0.99])
Notice the large gap between the 99th percentile and the max value. This prompted me to look at the biggest rounds and look for potential outliers:
biggest_rounds = rounds.dropna(subset=['raised_amount_usd']).sort('raised_amount_usd')
biggest_rounds.tail(20).ix[:,['company_name','funding_round_type','raised_amount_usd']]
The first few look alright but towards the end we see companies that I wouldn't really consider startups. There doesn't seem to be a clear cutoff point. I looked at those companies in Crunchabse and made the subjective decision to exclude the last 6 rows from the list.
blacklist_companies = biggest_rounds.tail(6).company_permalink.values
city_rounds = pd.DataFrame(rounds[~rounds.company_permalink.isin(blacklist_companies)])
The next step in order to get the geographical distribution is to calculate the city level funding sizes:
city_rounds = city_rounds.groupby(['company_country_code', 'company_city']).raised_amount_usd.sum()
city_rounds.head()
Pandas makes the grouping columns the index of the resulting table. For convenience I'll reset the index and bring them back as regular columns
city_rounds = city_rounds.reset_index()
city_rounds.head()
Let's normalize the city name from unicode to ascii.
import unicodedata
city_rounds.company_city = city_rounds.company_city.\
map(lambda s: unicodedata.normalize('NFKD', unicode(s)).encode('ascii','ignore'))
city_rounds.head()
In order to get the geographical distribution we'll need the locations (lat, lon) of the cities. One place which has this data is http://www.geonames.org/export/. Surprisingly it is freely available and comes with good documentaion. The biggest dataset contains 150K cities and claims to include every city with a population bigger than 1000 people.
import csv
geo = pd.read_table('cities15000.txt', sep='\t', header=None, quoting=csv.QUOTE_NONE)
geo.columns = ['geonameid','city','asciiname','alternatenames','latitude','longitude','featureclass',
'featurecode','countrycode','cc2','admin1code','admin2code','admin3code','admin4code',
'population','elevation','dem','timezone','modificationdate',]
geo.head()
Despite having a column called 'asciiname' it turns out that it contains some non-ascii symbols. Instead we are going to use the 'city' column.
geo.city = geo.city.\
map(lambda s: unicodedata.normalize('NFKD', unicode(s, 'utf-8')).encode('ascii','ignore'))
geo.head()
The next step is to join the city rounds table with the city geo locations data. We would want to join based on the country and city columns. One small detail is that the country code in the 2 tables are different, one uses 2 letter codes, the other has 3 letters. That's why I'm bringing a third table which maps between those two formats:
country_codes = pd.read_csv('wikipedia-iso-country-codes.csv')
country_codes.rename(columns={'English short name lower case': 'country',
'Alpha-2 code': 'countrycode2',
'Alpha-3 code': 'countrycode3'},
inplace=True)
country_codes.head()
geo = pd.merge(geo, country_codes, left_on='countrycode', right_on='countrycode2')
geo.head()
Next let's check what will happen if we do the join based on the city name:
roundset = set(city_rounds.company_city.unique())
geoset = set(geo.city.unique())
print len(roundset), len(roundset - geoset)
More than 2000 cities wouldn't match. Here are some examples:
list(roundset - geoset)[1000:1010]
Some of those are important cities which we wouldn't want to lose in the process. To fix this issue I'll use a Python library for fuzzy string matching with the cute name 'fuzzywuzzy':
from fuzzywuzzy import process
print process.extract('Washington, D. C.', geoset)
The result is the top 5 matches and their similarity score. In this example the top match seems to be the correct one. The algorithm goes through the rows of city_rounds
in order and tries to match against the cities from the geo
table. To speed up the string matching, we are going to group the geo
table by country and only look for matches which share the same country code as the query city. I've also set a score threshold of 75 in order to exclude low-confidence matches.
geo_grouped = geo.groupby('countrycode3')
def find_match(i, city_rounds=city_rounds):
'''Find the best match for the city in row i of city_rounds
searching in cities from geo_grouped
Returns a tuple of ((country, city), best_match) if a high
quality match is found
otherwise returns None'''
country = city_rounds.ix[i,'company_country_code']
if country in geo_grouped.indices:
cities = geo_grouped.city.get_group(country)
best_match = process.extractOne(city_rounds.ix[i,'company_city'], cities)
if best_match and best_match[1] > 75:
return ((country, city_rounds.ix[i,'company_city']), best_match[0])
To further improve the speed of the matching step we are going to run the find_match
function in parallel. Be warned that it still takes a while to finish (~15 minutes on my laptop):
from joblib import Parallel, delayed
results = Parallel(n_jobs=-1)(delayed(find_match)(i) for i in range(len(city_rounds)))
We end up with a list of tuples and None
s from which we filter out the None
values. Convert the list to a dictionary which will be used to match the (country, city) to a city from the geo table. We also fix 'New York' which happens to match equally well with both 'New York City' and 'York' (and arbitralily picks the second one).
replace = dict(filter(bool, results))
override = {
('USA', 'New York'): 'New York City',
}
replace.update(override)
Now we can add an extra column to city_rounds
and populate it with the best matches:
city_rounds['city_match'] = [np.nan]*len(city_rounds)
for i in range(len(city_rounds)):
t = tuple(city_rounds.ix[i,['company_country_code','company_city']].values)
city_rounds.ix[i,'city_match'] = replace.get(t, np.nan)
We are ready to join the city_rounds
and geo
tables:
merged = pd.merge(geo, city_rounds, left_on=['countrycode3', 'city'], right_on=['company_country_code', 'city_match'])
merged.head()
And the cities with the largest sums of funding rounds are:
merged[merged.raised_amount_usd.notnull()].\
ix[:,['company_city','city_match','countrycode3','latitude','longitude','population','raised_amount_usd']].\
sort(columns='raised_amount_usd').\
tail(n=10)
Another interesting statistic is the top cities with the largest median round sizes (among cities with at least 50 funding rounds):
grouped = rounds.groupby(['company_country_code', 'company_city']).raised_amount_usd
grouped.median()[grouped.count() > 50].reset_index().dropna(subset=['raised_amount_usd']).sort('raised_amount_usd').tail(10)
Now it's finally time to map our data.
import mpl_toolkits.basemap as bm
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib import cm
%matplotlib inline
To map the geographical distribution of the funding rounds, I'll split the map in a 600x300 grid. The value for each cell of the grid comes from summing the values for all cities which fall within that cell. That's what the next function is calculating.
def geo_distribution(table):
world = bm.Basemap(resolution='l',projection='merc', area_thresh=10000,
llcrnrlon=-160, llcrnrlat=-50, urcrnrlon=180, urcrnrlat=70, ellps='WGS84')
N = 300
lons, lats = world.makegrid(2*N, N) # get lat/lons of 2N by N evenly spaced grid.
x, y = world(lons, lats) # compute map proj coordinates.
data = np.zeros((N,2*N))
for r in table[table.raised_amount_usd.notnull()].iterrows():
xx = np.searchsorted(lons[0], r[1].longitude)
yy = np.searchsorted(lats[:,0], r[1].latitude)
data[yy,xx] += r[1].raised_amount_usd
return x, y, data
x, y, data = geo_distribution(merged)
Plotting a histogram of the result shows that the values are extremely left-skewed. We are going to plot those values colored by their magnitude and with the current distribution we sould be seeing mostly one color which is not very interesting.
plt.hist(data[data>0].flatten())
To counter that we'll just work in a log scale which gives us a better looking distribution:
log_data = np.log10(data + 1)
plt.hist(log_data[log_data>0].flatten())
Let's go ahead and create the plot:
fig = plt.figure(figsize=(20, 12))
ax = fig.add_axes([0.0, 0.0, 0.95, 1.0])
plt.title('Total startup funding (1999-2014)', fontsize=24)
# Initialize the map and configure the style
world = bm.Basemap(resolution='l',projection='merc', area_thresh=10000,
llcrnrlon=-160, llcrnrlat=-50, urcrnrlon=180, urcrnrlat=70, ellps='WGS84', ax=ax)
world.drawcoastlines(linewidth=0.1)
world.drawcountries(linewidth=0.1)
world.drawlsmask(land_color='#F4F3F2', ocean_color='#BFE2FF')
x, y, data = geo_distribution(merged)
log_data = np.log10(data + 1)
# Calculate the range for the contour values
min_level = np.percentile(log_data[log_data>1].flatten(), 40)
max_level = log_data.max()
clevs = np.linspace(min_level, max_level)
cmap = cm.Spectral_r
cs = world.contourf(x, y, log_data, clevs, cmap=cmap)
# Plot the colorbar with values matching the contours
ax1 = fig.add_axes([0.97, 0.2, 0.03, 0.6])
norm = mpl.colors.LogNorm(vmin=10**min_level, vmax=10**max_level)
cb = mpl.colorbar.ColorbarBase(ax1, cmap=cmap, norm=norm, orientation='vertical')
cb.set_ticks([1e5, 1e6, 1e7, 1e8, 1e9, 1e10])
cb.set_ticklabels(['$100K', '$1M', '$10M', '$100M', '$1B', '$10B'])
It looks a bit sparse, right? The problem is that the density, of the cities with startup activity, is not high enough. The approach I took was to apply a gaussian filter to the values in order to smoothen the points and distribute some of their weight to their neighbouring points. Here's the result:
fig = plt.figure(figsize=(20, 12))
ax = fig.add_axes([0.0, 0.0, 0.95, 1.0])
plt.title('Total startup funding (1999-2014)', fontsize=24)
# Initialize the map and configure the style
world = bm.Basemap(resolution='l',projection='merc', area_thresh=10000,
llcrnrlon=-160, llcrnrlat=-50, urcrnrlon=180, urcrnrlat=70, ellps='WGS84', ax=ax)
world.drawcoastlines(linewidth=0.1)
world.drawcountries(linewidth=0.1)
world.drawlsmask(land_color='#F4F3F2', ocean_color='#BFE2FF')
x, y, data = geo_distribution(merged)
log_data = np.log10(scipy.ndimage.filters.gaussian_filter(data, 0.6) + 1)
min_level = np.percentile(log_data[log_data>1].flatten(), 40)
max_level = log_data.max()
clevs = np.linspace(min_level, max_level)
cmap = cm.Spectral_r
cs = world.contourf(x, y, log_data, clevs, cmap=cmap)
ax1 = fig.add_axes([0.97, 0.2, 0.03, 0.6])
norm = mpl.colors.LogNorm(vmin=10**min_level, vmax=10**max_level)
cb = mpl.colorbar.ColorbarBase(ax1, cmap=cmap, norm=norm, orientation='vertical')
cb.set_ticks([1e5, 1e6, 1e7, 1e8, 1e9, 1e10])
cb.set_ticklabels(['$100K', '$1M', '$10M', '$100M', '$1B', '$10B'])
Finally we can generate the animation from the beginning of this post. This is done by outputting a frame for each 2-year window using very similar code to that one used for the map above. First we need some logic to slice the city_rounds
table in windows defined by (begin, end) dates:
rounds2 = rounds[rounds.funded_at.map(lambda d: type(d)==pd.datetime)]
def merged_in_range(begin, end):
city_rounds2 = pd.DataFrame(rounds2[(rounds2.funded_at>=begin) & (rounds2.funded_at<=end) &\
(~rounds2.company_permalink.isin(blacklist_companies))].\
groupby(['company_country_code', 'company_city']).raised_amount_usd.sum()).reset_index()
city_rounds2.company_city = city_rounds2.company_city.\
map(lambda s: unicodedata.normalize('NFKD', unicode(s)).encode('ascii','ignore'))
merged2 = pd.merge(merged, city_rounds2, how='left', on=['company_country_code', 'company_city'], suffixes=('_total', ''))
return merged2
# find the last window in order to compute the levels for the colorscale
date = np.datetime64(datetime.datetime(2014, 6, 30))
time_window = np.timedelta64(2*365, 'D')
begin = pd.to_datetime(date - time_window)
end = pd.to_datetime(date)
merged2 = merged_in_range(begin, end)
x, y, data = geo_distribution(merged2)
log_data = np.log10(scipy.ndimage.filters.gaussian_filter(data, 0.6) + 1)
min_level = np.percentile(log_data[log_data>1].flatten(), 40)
max_level = log_data.max()
clevs = np.linspace(min_level, max_level)
norm = mpl.colors.LogNorm(vmin=10**min_level, vmax=10**max_level)
def plot(date):
begin = pd.to_datetime(date - time_window)
end = pd.to_datetime(date)
merged2 = merged_in_range(begin, end)
fig = plt.figure(figsize=(20, 12))
ax = fig.add_axes([0.05, 0.05, 0.87, 0.9])
plt.title('Startup funding %d/%02d-%d/%02d' % (begin.year, begin.month, end.year, end.month),
fontsize=20, family='monospace')
# Initialize the map and configure the style
world = bm.Basemap(resolution='l',projection='merc', area_thresh=10000,
llcrnrlon=-160, llcrnrlat=-50, urcrnrlon=180, urcrnrlat=70, ellps='WGS84', ax=ax)
world.drawcoastlines(linewidth=0.1)
world.drawcountries(linewidth=0.1)
world.drawlsmask(land_color='#F4F3F2', ocean_color='#BFE2FF')
x, y, data = geo_distribution(merged2)
log_data = np.log10(scipy.ndimage.filters.gaussian_filter(data, 0.6) + 1)
cs = world.contourf(x, y, log_data, levels=clevs, cmap=cmap)
ax1 = fig.add_axes([0.94, 0.2, 0.03, 0.6])
cb = mpl.colorbar.ColorbarBase(ax1, cmap=cmap, norm=norm, orientation='vertical')
cb.set_ticks([1e5, 1e6, 1e7, 1e8, 1e9, 1e10])
cb.set_ticklabels(['$100K', '$1M', '$10M', '$100M', '$1B', '$10B'])
plt.savefig('%s.png' % str(date)[:7])
rng = pd.date_range('1/1/1999','6/30/2014',freq='1M')
_ = Parallel(n_jobs=-1)(delayed(plot)(date) for date in rng)
Once you have generated the separate frames, you can use imagemagick, or any other image processing software, to stitch them together in an animated gif.