I got my hands on a dataset from Crunchbase which contains a rich set of data about startups. Initially I looked at visualizing the time related trends of total funding and funding per industry. Those ended up not being very engaging (everything is trending up). That's why I moved my focus to the geographical distribution and how to best represent the growth in funding in different regions.

After some experimentation I arrived at the animation above. Each frame shows the funding activity over a 2-year window which ends at the month in the title. It uses a log scale with red showing the highest values, as you can see in the legend on the right. The scale is fixed and was calculated based on the largest 2-year window. The values on the map are generated by aggregating the funding per city and passing it through a gaussian filter in order to get smoother contours for the heatmap. (You can find a map, based on the raw data without smoothing, near the end of this post).

As expected we see that right now (2014) Sillicon Valley is the hottest region. Some other hotspots are New York, London, Israel, Moscow, Beijing. Looking at the bigger picture, USA is clearly leading, followed by West Europe and then East Asia. There are also a lot of small spots on the map but remember that the scale is logarithmic. That means that the small spots have 1/1000 (or even smaller fraction) of the funding that the red areas are getting.

In the process I also found out that Crunchbase publishes a somewhat similar graph showing monthly activity. The difference is that they give a static view of only the last month but you can actually hover over the circles to see which companies correspond to them.

The rest of this post explains how the map was generated and includes all the Python code that was used. Feel free to experiment with it.

The visualization was generated using Crunchbase's database of startup activity including funding rounds, acquisitions, investors and so on. It is updated monthly and is available at http://info.crunchbase.com/about/crunchbase-data-exports/. The dataset comes as an Excel file which contains several sheets:

import pandas as pd
import numpy as np
import scipy.ndimage
import datetime

crunchbase = pd.ExcelFile('crunchbase_monthly_export.xlsx')
crunchbase.sheet_names

[u'License',
 u'Analysis',
 u'Companies',
 u'Rounds',
 u'Investments',
 u'Acquisitions',
 u'Additions']

For this analysis we are going to focus on the Rounds sheet which contains data about companies and the rounds of funding they have taken:

rounds = crunchbase.parse('Rounds')
rounds.head()

Let's take a look at the summary of the funding rounds sizes:

rounds.raised_amount_usd.describe(percentiles=[0.01,0.5,0.99])

count    6.264300e+04
mean     1.042072e+07
std      3.300527e+08
min      0.000000e+00
1%       1.150000e+04
50%      1.907284e+06
99%      1.000000e+08
max      7.879506e+10
dtype: float64

Notice the large gap between the 99th percentile and the max value. This prompted me to look at the biggest rounds and look for potential outliers:

biggest_rounds = rounds.dropna(subset=['raised_amount_usd']).sort('raised_amount_usd')
biggest_rounds.tail(20).ix[:,['company_name','funding_round_type','raised_amount_usd']]

The first few look alright but towards the end we see companies that I wouldn't really consider startups. There doesn't seem to be a clear cutoff point. I looked at those companies in Crunchabse and made the subjective decision to exclude the last 6 rows from the list.

blacklist_companies = biggest_rounds.tail(6).company_permalink.values
city_rounds = pd.DataFrame(rounds[~rounds.company_permalink.isin(blacklist_companies)])

The next step in order to get the geographical distribution is to calculate the city level funding sizes:

city_rounds = city_rounds.groupby(['company_country_code', 'company_city']).raised_amount_usd.sum()
city_rounds.head()

company_country_code  company_city
ALB                   Huntsville       1900000
                      Prishtine         150000
ANT                   Amsterdam        6257320
                      AnthÃ©e         10871040
                      Eindhoven        5019299
Name: raised_amount_usd, dtype: float64

Pandas makes the grouping columns the index of the resulting table. For convenience I'll reset the index and bring them back as regular columns

city_rounds = city_rounds.reset_index()
city_rounds.head()

Let's normalize the city name from unicode to ascii.

import unicodedata

city_rounds.company_city = city_rounds.company_city.\
    map(lambda s: unicodedata.normalize('NFKD', unicode(s)).encode('ascii','ignore'))
city_rounds.head()

In order to get the geographical distribution we'll need the locations (lat, lon) of the cities. One place which has this data is http://www.geonames.org/export/. Surprisingly it is freely available and comes with good documentaion. The biggest dataset contains 150K cities and claims to include every city with a population bigger than 1000 people.

import csv

geo = pd.read_table('cities15000.txt', sep='\t', header=None, quoting=csv.QUOTE_NONE)
geo.columns = ['geonameid','city','asciiname','alternatenames','latitude','longitude','featureclass',
               'featurecode','countrycode','cc2','admin1code','admin2code','admin3code','admin4code',
               'population','elevation','dem','timezone','modificationdate',]
geo.head()

Despite having a column called 'asciiname' it turns out that it contains some non-ascii symbols. Instead we are going to use the 'city' column.

geo.city = geo.city.\
    map(lambda s: unicodedata.normalize('NFKD', unicode(s, 'utf-8')).encode('ascii','ignore'))
geo.head()

The next step is to join the city rounds table with the city geo locations data. We would want to join based on the country and city columns. One small detail is that the country code in the 2 tables are different, one uses 2 letter codes, the other has 3 letters. That's why I'm bringing a third table which maps between those two formats:

country_codes = pd.read_csv('wikipedia-iso-country-codes.csv')
country_codes.rename(columns={'English short name lower case': 'country',
                              'Alpha-2 code': 'countrycode2',
                              'Alpha-3 code': 'countrycode3'},
                     inplace=True)
country_codes.head()

geo = pd.merge(geo, country_codes, left_on='countrycode', right_on='countrycode2')
geo.head()

Next let's check what will happen if we do the join based on the city name:

roundset = set(city_rounds.company_city.unique())
geoset = set(geo.city.unique())
print len(roundset), len(roundset - geoset)

4319 2351

More than 2000 cities wouldn't match. Here are some examples:

list(roundset - geoset)[1000:1010]

['Malvern Wells',
 'Saint-martin-des-champs',
 'Brentford',
 'Lkan',
 'Saint Louis',
 'Santa Ynez',
 'Ramonville-saint-agne',
 'GuimarAes',
 'Washington, D. C.',
 'Muenchen']

Some of those are important cities which we wouldn't want to lose in the process. To fix this issue I'll use a Python library for fuzzy string matching with the cute name 'fuzzywuzzy':

from fuzzywuzzy import process

print process.extract('Washington, D. C.', geoset)

[('Washington, D.C.', 97), ('Wa', 90), ('Washington', 90), ('Fort Washington', 78), ('Port Washington', 78)]

The result is the top 5 matches and their similarity score. In this example the top match seems to be the correct one. The algorithm goes through the rows of city_rounds in order and tries to match against the cities from the geo table. To speed up the string matching, we are going to group the geo table by country and only look for matches which share the same country code as the query city. I've also set a score threshold of 75 in order to exclude low-confidence matches.

geo_grouped = geo.groupby('countrycode3')

def find_match(i, city_rounds=city_rounds):
    '''Find the best match for the city in row i of city_rounds
    searching in cities from geo_grouped
    
    Returns a tuple of ((country, city), best_match) if a high
    quality match is found
    otherwise returns None'''
    country = city_rounds.ix[i,'company_country_code']
    if country in geo_grouped.indices:
        cities = geo_grouped.city.get_group(country)
        best_match = process.extractOne(city_rounds.ix[i,'company_city'], cities)
        if best_match and best_match[1] > 75:
            return ((country, city_rounds.ix[i,'company_city']), best_match[0])

To further improve the speed of the matching step we are going to run the find_match function in parallel. Be warned that it still takes a while to finish (~15 minutes on my laptop):

from joblib import Parallel, delayed

results = Parallel(n_jobs=-1)(delayed(find_match)(i) for i in range(len(city_rounds)))

We end up with a list of tuples and Nones from which we filter out the None values. Convert the list to a dictionary which will be used to match the (country, city) to a city from the geo table. We also fix 'New York' which happens to match equally well with both 'New York City' and 'York' (and arbitralily picks the second one).

replace = dict(filter(bool, results))

override = {
('USA', 'New York'): 'New York City',
}
replace.update(override)

Now we can add an extra column to city_rounds and populate it with the best matches:

city_rounds['city_match'] = [np.nan]*len(city_rounds)

for i in range(len(city_rounds)):
    t = tuple(city_rounds.ix[i,['company_country_code','company_city']].values)
    city_rounds.ix[i,'city_match'] = replace.get(t, np.nan)

We are ready to join the city_rounds and geo tables:

merged = pd.merge(geo, city_rounds, left_on=['countrycode3', 'city'], right_on=['company_country_code', 'city_match'])
merged.head()

And the cities with the largest sums of funding rounds are:

merged[merged.raised_amount_usd.notnull()].\
    ix[:,['company_city','city_match','countrycode3','latitude','longitude','population','raised_amount_usd']].\
    sort(columns='raised_amount_usd').\
    tail(n=10)

Another interesting statistic is the top cities with the largest median round sizes (among cities with at least 50 funding rounds):

grouped = rounds.groupby(['company_country_code', 'company_city']).raised_amount_usd
grouped.median()[grouped.count() > 50].reset_index().dropna(subset=['raised_amount_usd']).sort('raised_amount_usd').tail(10)

Now it's finally time to map our data.

import mpl_toolkits.basemap as bm
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib import cm
%matplotlib inline

To map the geographical distribution of the funding rounds, I'll split the map in a 600x300 grid. The value for each cell of the grid comes from summing the values for all cities which fall within that cell. That's what the next function is calculating.

def geo_distribution(table):
    world = bm.Basemap(resolution='l',projection='merc', area_thresh=10000,
                llcrnrlon=-160, llcrnrlat=-50, urcrnrlon=180, urcrnrlat=70, ellps='WGS84')

    N = 300
    lons, lats = world.makegrid(2*N, N) # get lat/lons of 2N by N evenly spaced grid.
    x, y = world(lons, lats) # compute map proj coordinates.
    data = np.zeros((N,2*N))

    for r in table[table.raised_amount_usd.notnull()].iterrows():
        xx = np.searchsorted(lons[0], r[1].longitude)
        yy = np.searchsorted(lats[:,0], r[1].latitude)
        data[yy,xx] += r[1].raised_amount_usd

    return x, y, data

x, y, data = geo_distribution(merged)

Plotting a histogram of the result shows that the values are extremely left-skewed. We are going to plot those values colored by their magnitude and with the current distribution we sould be seeing mostly one color which is not very interesting.

plt.hist(data[data>0].flatten())

(array([1362,   10,    1,    1,    1,    1,    0,    1,    0,    1]),
 array([  2.00000000e+03,   5.71948829e+09,   1.14389746e+10,
          1.71584609e+10,   2.28779472e+10,   2.85974334e+10,
          3.43169197e+10,   4.00364060e+10,   4.57558923e+10,
          5.14753786e+10,   5.71948649e+10]),
 <a list of 10 Patch objects>)

To counter that we'll just work in a log scale which gives us a better looking distribution:

log_data = np.log10(data + 1)
plt.hist(log_data[log_data>0].flatten())

(array([  7,  54, 107, 193, 269, 305, 241, 148,  47,   7]),
 array([  3.30124709,   4.04685808,   4.79246908,   5.53808007,
          6.28369107,   7.02930206,   7.77491306,   8.52052405,
          9.26613505,  10.01174604,  10.75735704]),
 <a list of 10 Patch objects>)

Let's go ahead and create the plot:

fig = plt.figure(figsize=(20, 12))
ax = fig.add_axes([0.0, 0.0, 0.95, 1.0])
plt.title('Total startup funding (1999-2014)', fontsize=24)

# Initialize the map and configure the style
world = bm.Basemap(resolution='l',projection='merc', area_thresh=10000,
            llcrnrlon=-160, llcrnrlat=-50, urcrnrlon=180, urcrnrlat=70, ellps='WGS84', ax=ax)
world.drawcoastlines(linewidth=0.1)
world.drawcountries(linewidth=0.1)
world.drawlsmask(land_color='#F4F3F2', ocean_color='#BFE2FF')

x, y, data = geo_distribution(merged)
log_data = np.log10(data + 1)

# Calculate the range for the contour values
min_level = np.percentile(log_data[log_data>1].flatten(), 40)
max_level = log_data.max()
clevs = np.linspace(min_level, max_level)
cmap = cm.Spectral_r

cs = world.contourf(x, y, log_data, clevs, cmap=cmap)

# Plot the colorbar with values matching the contours
ax1 = fig.add_axes([0.97, 0.2, 0.03, 0.6])
norm = mpl.colors.LogNorm(vmin=10**min_level, vmax=10**max_level)
cb = mpl.colorbar.ColorbarBase(ax1, cmap=cmap, norm=norm, orientation='vertical')
cb.set_ticks([1e5, 1e6, 1e7, 1e8, 1e9, 1e10])
cb.set_ticklabels(['$100K', '$1M', '$10M', '$100M', '$1B', '$10B'])

It looks a bit sparse, right? The problem is that the density, of the cities with startup activity, is not high enough. The approach I took was to apply a gaussian filter to the values in order to smoothen the points and distribute some of their weight to their neighbouring points. Here's the result:

fig = plt.figure(figsize=(20, 12))
ax = fig.add_axes([0.0, 0.0, 0.95, 1.0])
plt.title('Total startup funding (1999-2014)', fontsize=24)

# Initialize the map and configure the style
world = bm.Basemap(resolution='l',projection='merc', area_thresh=10000,
            llcrnrlon=-160, llcrnrlat=-50, urcrnrlon=180, urcrnrlat=70, ellps='WGS84', ax=ax)
world.drawcoastlines(linewidth=0.1)
world.drawcountries(linewidth=0.1)
world.drawlsmask(land_color='#F4F3F2', ocean_color='#BFE2FF')

x, y, data = geo_distribution(merged)
log_data = np.log10(scipy.ndimage.filters.gaussian_filter(data, 0.6) + 1)

min_level = np.percentile(log_data[log_data>1].flatten(), 40)
max_level = log_data.max()
clevs = np.linspace(min_level, max_level)
cmap = cm.Spectral_r

cs = world.contourf(x, y, log_data, clevs, cmap=cmap)

ax1 = fig.add_axes([0.97, 0.2, 0.03, 0.6])
norm = mpl.colors.LogNorm(vmin=10**min_level, vmax=10**max_level)
cb = mpl.colorbar.ColorbarBase(ax1, cmap=cmap, norm=norm, orientation='vertical')
cb.set_ticks([1e5, 1e6, 1e7, 1e8, 1e9, 1e10])
cb.set_ticklabels(['$100K', '$1M', '$10M', '$100M', '$1B', '$10B'])

Finally we can generate the animation from the beginning of this post. This is done by outputting a frame for each 2-year window using very similar code to that one used for the map above. First we need some logic to slice the city_rounds table in windows defined by (begin, end) dates:

rounds2 = rounds[rounds.funded_at.map(lambda d: type(d)==pd.datetime)]

def merged_in_range(begin, end):
    city_rounds2 = pd.DataFrame(rounds2[(rounds2.funded_at>=begin) & (rounds2.funded_at<=end) &\
                                        (~rounds2.company_permalink.isin(blacklist_companies))].\
                                groupby(['company_country_code', 'company_city']).raised_amount_usd.sum()).reset_index()
    city_rounds2.company_city = city_rounds2.company_city.\
        map(lambda s: unicodedata.normalize('NFKD', unicode(s)).encode('ascii','ignore'))
    merged2 = pd.merge(merged, city_rounds2, how='left', on=['company_country_code', 'company_city'], suffixes=('_total', ''))
    
    return merged2

# find the last window in order to compute the levels for the colorscale
date = np.datetime64(datetime.datetime(2014, 6, 30))
time_window = np.timedelta64(2*365, 'D')
begin = pd.to_datetime(date - time_window)
end = pd.to_datetime(date)
merged2 = merged_in_range(begin, end)
x, y, data = geo_distribution(merged2)
log_data = np.log10(scipy.ndimage.filters.gaussian_filter(data, 0.6) + 1)

min_level = np.percentile(log_data[log_data>1].flatten(), 40)
max_level = log_data.max()
clevs = np.linspace(min_level, max_level)
norm = mpl.colors.LogNorm(vmin=10**min_level, vmax=10**max_level)

def plot(date):
    begin = pd.to_datetime(date - time_window)
    end = pd.to_datetime(date)
    merged2 = merged_in_range(begin, end)

    fig = plt.figure(figsize=(20, 12))
    ax = fig.add_axes([0.05, 0.05, 0.87, 0.9])
    plt.title('Startup funding %d/%02d-%d/%02d' % (begin.year, begin.month, end.year, end.month),
              fontsize=20, family='monospace')

    # Initialize the map and configure the style
    world = bm.Basemap(resolution='l',projection='merc', area_thresh=10000,
                llcrnrlon=-160, llcrnrlat=-50, urcrnrlon=180, urcrnrlat=70, ellps='WGS84', ax=ax)
    world.drawcoastlines(linewidth=0.1)
    world.drawcountries(linewidth=0.1)
    world.drawlsmask(land_color='#F4F3F2', ocean_color='#BFE2FF')

    x, y, data = geo_distribution(merged2)
    log_data = np.log10(scipy.ndimage.filters.gaussian_filter(data, 0.6) + 1)
    cs = world.contourf(x, y, log_data, levels=clevs, cmap=cmap)

    ax1 = fig.add_axes([0.94, 0.2, 0.03, 0.6])
    cb = mpl.colorbar.ColorbarBase(ax1, cmap=cmap, norm=norm, orientation='vertical')
    cb.set_ticks([1e5, 1e6, 1e7, 1e8, 1e9, 1e10])
    cb.set_ticklabels(['$100K', '$1M', '$10M', '$100M', '$1B', '$10B'])

    plt.savefig('%s.png' % str(date)[:7])
    
rng = pd.date_range('1/1/1999','6/30/2014',freq='1M')

_ = Parallel(n_jobs=-1)(delayed(plot)(date) for date in rng)

Once you have generated the separate frames, you can use imagemagick, or any other image processing software, to stitch them together in an animated gif.

	company_permalink	company_name	company_category_list	company_market	company_country_code	company_state_code	company_region	company_city	funding_round_permalink	funding_round_type	funding_round_code	funded_at	funded_month	funded_quarter	funded_year	raised_amount_usd
0	/organization/canal-do-credito	Canal do Credito	\|Credit\|Technology\|Services\|Finance\|	Credit	BRA	NaN	Rio de Janeiro	Belo Horizonte	/funding-round/f7d9bcd086a4e90643e0df8f3da32a25	venture	a	2010-01-01 00:00:00	2010-01	2010-Q1	2010	750000
1	/organization/waywire	#waywire	\|Entertainment\|Politics\|Social Media\|News\|	Entertainment	USA	NY	New York City	New York	/funding-round/cc409188fa2b63482bd9008f682c2efa	seed	NaN	2012-06-30 00:00:00	2012-06	2012-Q2	2012	1750000
2	/organization/tv-communications	&TV Communications	\|Games\|	Games	USA	CA	Los Angeles	Los Angeles	/funding-round/86d22afc65107b6941e6c43c671ecbb8	unknown	NaN	2010-06-04 00:00:00	2010-06	2010-Q2	2010	1000000
3	/organization/tv-communications	&TV Communications	\|Games\|	Games	USA	CA	Los Angeles	Los Angeles	/funding-round/59a3669a64e39360c2b939300bcda162	unknown	NaN	2010-09-23 00:00:00	2010-09	2010-Q3	2010	3000000
4	/organization/rock-your-paper	'Rock' Your Paper	\|Publishing\|Education\|	Education	EST	NaN	Tallinn	Tallinn	/funding-round/f06b420775f7cb6c1541a9db526534bb	seed	NaN	2012-08-09 00:00:00	2012-08	2012-Q3	2012	40000

	company_name	funding_round_type	raised_amount_usd
2837	Alibaba	unknown	1000000000
66359	Verizon Communications	debt_financing	1000000000
3928	AOL	post_ipo_equity	1000000000
68109	Wave Broadband	private_equity	1050000000
69772	Xerox	post_ipo_equity	1100000000
64884	Uber	venture	1200000000
61674	Terra-Gen Power	debt_financing	1200000000
12908	Clearwire	post_ipo_equity	1500000000
21573	Facebook	private_equity	1500000000
11885	Charter Communications	post_ipo_equity	1662513431
49719	Quad/Graphics	post_ipo_debt	1900000000
10872	Carestream	debt_financing	2400000000
55495	sigmacare	private_equity	2600000000
12909	Clearwire	post_ipo_equity	3200000000
13612	COFCO	debt_financing	3200000000
66360	Verizon Communications	debt_financing	3822518000
66356	Verizon Communications	debt_financing	3835050000
53457	Sberbank	post_ipo_debt	5800000000
66357	Verizon Communications	debt_financing	21271935000
10714	Cardinal Health	debt_financing	78795064652

	geonameid	city	asciiname	alternatenames	latitude	longitude	featureclass	featurecode	countrycode	cc2	admin1code	admin2code	admin3code	admin4code	population	elevation	dem	timezone	modificationdate
0	3040051	les Escaldes	les Escaldes	Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engo...	42.50729	1.53414	P	PPLA	AD	NaN	08	NaN	NaN	NaN	15853	NaN	1033	Europe/Andorra	2008-10-15
1	3041563	Andorra la Vella	Andorra la Vella	ALV,Ando-la-Vyey,Andora,Andora la Vela,Andora ...	42.50779	1.52109	P	PPLC	AD	NaN	07	NaN	NaN	NaN	20430	NaN	1037	Europe/Andorra	2010-05-30
2	290594	Umm al Qaywayn	Umm al Qaywayn	Um al Quweim,Umm al Qaiwain,Umm al Qawain,Umm ...	25.56473	55.55517	P	PPLA	AE	NaN	07	NaN	NaN	NaN	44411	NaN	2	Asia/Dubai	2012-11-03
3	291074	Ras al-Khaimah	Ras al-Khaimah	Julfa,Khaimah,RKT,Ra's al Khaymah,Ras al Khaim...	25.78953	55.94320	P	PPLA	AE	NaN	05	NaN	NaN	NaN	115949	NaN	2	Asia/Dubai	2013-11-26
4	291696	Khawr Fakkān	Khawr Fakkan	Fakkan,Fakkān,Khawr Fakkan,Khawr Fakkān,Khawr ...	25.33132	56.34199	P	PPL	AE	NaN	06	NaN	NaN	NaN	33575	NaN	20	Asia/Dubai	2013-10-25

	geonameid	city	asciiname	alternatenames	latitude	longitude	featureclass	featurecode	countrycode	cc2	admin1code	admin2code	admin3code	admin4code	population	elevation	dem	timezone	modificationdate
0	3040051	les Escaldes	les Escaldes	Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engo...	42.50729	1.53414	P	PPLA	AD	NaN	08	NaN	NaN	NaN	15853	NaN	1033	Europe/Andorra	2008-10-15
1	3041563	Andorra la Vella	Andorra la Vella	ALV,Ando-la-Vyey,Andora,Andora la Vela,Andora ...	42.50779	1.52109	P	PPLC	AD	NaN	07	NaN	NaN	NaN	20430	NaN	1037	Europe/Andorra	2010-05-30
2	290594	Umm al Qaywayn	Umm al Qaywayn	Um al Quweim,Umm al Qaiwain,Umm al Qawain,Umm ...	25.56473	55.55517	P	PPLA	AE	NaN	07	NaN	NaN	NaN	44411	NaN	2	Asia/Dubai	2012-11-03
3	291074	Ras al-Khaimah	Ras al-Khaimah	Julfa,Khaimah,RKT,Ra's al Khaymah,Ras al Khaim...	25.78953	55.94320	P	PPLA	AE	NaN	05	NaN	NaN	NaN	115949	NaN	2	Asia/Dubai	2013-11-26
4	291696	Khawr Fakkan	Khawr Fakkan	Fakkan,Fakkān,Khawr Fakkan,Khawr Fakkān,Khawr ...	25.33132	56.34199	P	PPL	AE	NaN	06	NaN	NaN	NaN	33575	NaN	20	Asia/Dubai	2013-10-25

	country	countrycode2	countrycode3	Numeric code	ISO 3166-2
0	Afghanistan	AF	AFG	4	ISO 3166-2:AF
1	Åland Islands	AX	ALA	248	ISO 3166-2:AX
2	Albania	AL	ALB	8	ISO 3166-2:AL
3	Algeria	DZ	DZA	12	ISO 3166-2:DZ
4	American Samoa	AS	ASM	16	ISO 3166-2:AS

Daniel's blog

Thursday, July 31, 2014

Global Distribution of Startup Funding

	company_country_code	company_city	raised_amount_usd
0	ALB	Huntsville	1900000
1	ALB	Prishtine	150000
2	ANT	Amsterdam	6257320
3	ANT	AnthÃ©e	10871040
4	ANT	Eindhoven	5019299

	geonameid	city	asciiname	alternatenames	latitude	longitude	featureclass	featurecode	countrycode	cc2	...	modificationdate	country	countrycode2	countrycode3	Numeric code	ISO 3166-2	company_country_code	company_city	raised_amount_usd	city_match
0	292223	Dubai	Dubai	DXB,Dabei,Dibai,Dibay,Doubayi,Dubae,Dubai,Duba...	25.25817	55.30472	P	PPLA	AE	NaN	...	2012-08-19	United Arab Emirates	AE	ARE	784	ISO 3166-2:AE	ARE	Dubai	543960250	Dubai
1	292968	Abu Dhabi	Abu Dhabi	AEbu Saby,AUH,Aboe Dhabi,Abou Dabi,Abu Dabi,Ab...	24.46667	54.36667	P	PPLC	AE	NaN	...	2012-01-18	United Arab Emirates	AE	ARE	784	ISO 3166-2:AE	ARE	Abu Dhabi	31250000	Abu Dhabi
2	3428992	San Isidro	San Isidro	San Isidro	-34.47145	-58.50776	P	PPLA2	AR	NaN	...	2014-03-07	Argentina	AR	ARG	32	ISO 3166-2:AR	ARG	San Isidro	600000	San Isidro
3	3430863	Mar del Plata	Mar del Plata	Gorad Mar-dehl'-Plata,MDQ,Mar de Plata,Mar del...	-38.00228	-57.55754	P	PPLA2	AR	NaN	...	2014-03-07	Argentina	AR	ARG	32	ISO 3166-2:AR	ARG	Mar Del Plata	25000	Mar del Plata
4	3435910	Buenos Aires	Buenos Aires	BUE,Baires,Bonaero,Bonaeropolis,Bonaëropolis,B...	-34.61315	-58.37723	P	PPLC	AR	NaN	...	2012-01-12	Argentina	AR	ARG	32	ISO 3166-2:AR	ARG	Buenos Aires	159381442	Buenos Aires

	company_city	city_match	countrycode3	latitude	longitude	population	raised_amount_usd
3716	Sunnyvale	Sunnyvale	USA	37.36883	-122.03635	140081	7760180546
884	London	London	GBR	51.50853	-0.12574	7556900	8146995105
3679	San Jose	San Jose	USA	37.33939	-121.89496	945942	8487200272
2971	Cambridge	Cambridge	USA	42.37510	-71.10561	105162	8992166712
3625	Mountain View	Mountain View	USA	37.38605	-122.08385	74066	9196358515
3672	San Diego	San Diego	USA	32.71533	-117.15726	1307402	9824868373
323	Beijing	Beijing	CHN	39.90750	116.39723	7480601	9964255702
3643	Palo Alto	Palo Alto	USA	37.44188	-122.14302	64403	10769991473
3305	New York	New York City	USA	40.71427	-74.00597	8175133	25886061459
3674	San Francisco	San Francisco	USA	37.77493	-122.41942	805235	33254924381

	company_country_code	company_city	raised_amount_usd
58	USA	Bothell	7250000.0
142	USA	San Jose	7383900.0
133	USA	Redwood City	7499999.0
84	USA	Fremont	7500000.0
13	CHN	Beijing	8000000.0
161	USA	Westborough	8000000.0
159	USA	Watertown	8187500.0
86	USA	Hayward	8813094.5
149	USA	South San Francisco	9200000.0
107	USA	Milpitas	12000000.0