Sunday, August 10, 2014

Easy Haskell Profiling

Would you rather look at:

or

The above are two presentations of the same information - profiling reports for a Haskell program. I want to show you a tool that I wrote, which generates the second type of report and makes it easy to find the hot spots in your programs. To quote the package description:

visual-prof profiles your Haskell program and generates a HTML file containing its source code with parts of the code highlighted in different colors depending on the fraction of the running time that they take

The example above uses an inefficient project Euler solution which was posted on StackOverflow. In order to run the profiling, the code needs a main function. You can find the exact code that is used in this post at https://github.com/djv/VisualProf/blob/master/eu13.hs

Profiling in Haskell

To get the standard Haskell profiling report you have to compile with the -prof flag set. -auto-all means that you want profiling information for all the top-level functions. For our example we would compile with:

ghc -O2 --make eu13.hs -prof -auto-all -fforce-recomp

and then run:

./eu13 +RTS -p

The report is written in eu13.prof and is what you saw in the beginning of this post.

The first problem with the default report is that you only see the top-level profiling information, in our case 100% of the time is spent in numDivs. To drill down to the subexpression level, in order to find which part of the function definition is taking the most time, we need to add the so-called cost centers by hand. When you have big function definitions this becomes very tedious.

Another problem is that even with the manually added cost centers, you need to go back and forth between your source code and the profiling report.

Profiling with VisualProf

While working on a school project, I found myself wasting a lot of time manually adding SCCs and deciphering the profiling reports. That’s why I wrote VisualProf which produces an HTML profiling report with a single command line.

To generate the visual report from the example above, we’ll start by cloning the github repo:

git clone https://github.com/djv/VisualProf.git

In order to profile any Haskell code you need to have all the imported libraries installed with profiling enabled (cabal install -p libraryname). If you don’t have that turned on you’ll have to reinstall a lot of existing packages, which I don’t recommend. The better option is to work in a cabal sandbox:

cd VisualProf
cabal sandbox init

With or without a sandbox the next step is to run install:

cabal install visual-prof

To get the profiling report run:

./.cabal-sandbox/bin/visual-prof -px eu13.hs eu13

This will generate the HTML in eu13.hs.html and also the text report under eu13.prof.

Another example

If you are curious, I have included the pidigits benchmark as a second example in the github repo. It contains only one function with a pretty long definition, the visual profiling report is:

How does it work?

VisualProf goes through the following steps:

  1. Parse the source code and wrap every subexpression in a SCC syntax tree element. The SCCs have consecutive numbers as identifiers.
  2. Compile using ghc with profiling turned on and run the executable.
  3. Parse the text profiling report and build a map from the SCC id to percentage of runtime used.
  4. Go back to the source code and pretty print it but replacing every SCC with an HTML tag specifying the background color. The colors are calculated based on the values in the map from step 3.
  5. Open the HTMl report in your favourite browser.

Thursday, July 31, 2014

Global Distribution of Startup Funding

crunchbase

Full size animation

I got my hands on a dataset from Crunchbase which contains a rich set of data about startups. Initially I looked at visualizing the time related trends of total funding and funding per industry. Those ended up not being very engaging (everything is trending up). That's why I moved my focus to the geographical distribution and how to best represent the growth in funding in different regions.

After some experimentation I arrived at the animation above. Each frame shows the funding activity over a 2-year window which ends at the month in the title. It uses a log scale with red showing the highest values, as you can see in the legend on the right. The scale is fixed and was calculated based on the largest 2-year window. The values on the map are generated by aggregating the funding per city and passing it through a gaussian filter in order to get smoother contours for the heatmap. (You can find a map, based on the raw data without smoothing, near the end of this post).

As expected we see that right now (2014) Sillicon Valley is the hottest region. Some other hotspots are New York, London, Israel, Moscow, Beijing. Looking at the bigger picture, USA is clearly leading, followed by West Europe and then East Asia. There are also a lot of small spots on the map but remember that the scale is logarithmic. That means that the small spots have 1/1000 (or even smaller fraction) of the funding that the red areas are getting.

In the process I also found out that Crunchbase publishes a somewhat similar graph showing monthly activity. The difference is that they give a static view of only the last month but you can actually hover over the circles to see which companies correspond to them.

The rest of this post explains how the map was generated and includes all the Python code that was used. Feel free to experiment with it.

The visualization was generated using Crunchbase's database of startup activity including funding rounds, acquisitions, investors and so on. It is updated monthly and is available at http://info.crunchbase.com/about/crunchbase-data-exports/. The dataset comes as an Excel file which contains several sheets:

In [1]:
import pandas as pd
import numpy as np
import scipy.ndimage
import datetime

crunchbase = pd.ExcelFile('crunchbase_monthly_export.xlsx')
crunchbase.sheet_names
Out[1]:
[u'License',
 u'Analysis',
 u'Companies',
 u'Rounds',
 u'Investments',
 u'Acquisitions',
 u'Additions']

For this analysis we are going to focus on the Rounds sheet which contains data about companies and the rounds of funding they have taken:

In [2]:
rounds = crunchbase.parse('Rounds')
rounds.head()
Out[2]:
company_permalink company_name company_category_list company_market company_country_code company_state_code company_region company_city funding_round_permalink funding_round_type funding_round_code funded_at funded_month funded_quarter funded_year raised_amount_usd
0 /organization/canal-do-credito Canal do Credito |Credit|Technology|Services|Finance| Credit BRA NaN Rio de Janeiro Belo Horizonte /funding-round/f7d9bcd086a4e90643e0df8f3da32a25 venture a 2010-01-01 00:00:00 2010-01 2010-Q1 2010 750000
1 /organization/waywire #waywire |Entertainment|Politics|Social Media|News| Entertainment USA NY New York City New York /funding-round/cc409188fa2b63482bd9008f682c2efa seed NaN 2012-06-30 00:00:00 2012-06 2012-Q2 2012 1750000
2 /organization/tv-communications &TV Communications |Games| Games USA CA Los Angeles Los Angeles /funding-round/86d22afc65107b6941e6c43c671ecbb8 unknown NaN 2010-06-04 00:00:00 2010-06 2010-Q2 2010 1000000
3 /organization/tv-communications &TV Communications |Games| Games USA CA Los Angeles Los Angeles /funding-round/59a3669a64e39360c2b939300bcda162 unknown NaN 2010-09-23 00:00:00 2010-09 2010-Q3 2010 3000000
4 /organization/rock-your-paper 'Rock' Your Paper |Publishing|Education| Education EST NaN Tallinn Tallinn /funding-round/f06b420775f7cb6c1541a9db526534bb seed NaN 2012-08-09 00:00:00 2012-08 2012-Q3 2012 40000

Let's take a look at the summary of the funding rounds sizes:

In [3]:
rounds.raised_amount_usd.describe(percentiles=[0.01,0.5,0.99])
Out[3]:
count    6.264300e+04
mean     1.042072e+07
std      3.300527e+08
min      0.000000e+00
1%       1.150000e+04
50%      1.907284e+06
99%      1.000000e+08
max      7.879506e+10
dtype: float64

Notice the large gap between the 99th percentile and the max value. This prompted me to look at the biggest rounds and look for potential outliers:

In [4]:
biggest_rounds = rounds.dropna(subset=['raised_amount_usd']).sort('raised_amount_usd')
biggest_rounds.tail(20).ix[:,['company_name','funding_round_type','raised_amount_usd']]
Out[4]:
company_name funding_round_type raised_amount_usd
2837 Alibaba unknown 1000000000
66359 Verizon Communications debt_financing 1000000000
3928 AOL post_ipo_equity 1000000000
68109 Wave Broadband private_equity 1050000000
69772 Xerox post_ipo_equity 1100000000
64884 Uber venture 1200000000
61674 Terra-Gen Power debt_financing 1200000000
12908 Clearwire post_ipo_equity 1500000000
21573 Facebook private_equity 1500000000
11885 Charter Communications post_ipo_equity 1662513431
49719 Quad/Graphics post_ipo_debt 1900000000
10872 Carestream debt_financing 2400000000
55495 sigmacare private_equity 2600000000
12909 Clearwire post_ipo_equity 3200000000
13612 COFCO debt_financing 3200000000
66360 Verizon Communications debt_financing 3822518000
66356 Verizon Communications debt_financing 3835050000
53457 Sberbank post_ipo_debt 5800000000
66357 Verizon Communications debt_financing 21271935000
10714 Cardinal Health debt_financing 78795064652

The first few look alright but towards the end we see companies that I wouldn't really consider startups. There doesn't seem to be a clear cutoff point. I looked at those companies in Crunchabse and made the subjective decision to exclude the last 6 rows from the list.

In [5]:
blacklist_companies = biggest_rounds.tail(6).company_permalink.values
city_rounds = pd.DataFrame(rounds[~rounds.company_permalink.isin(blacklist_companies)])

The next step in order to get the geographical distribution is to calculate the city level funding sizes:

In [6]:
city_rounds = city_rounds.groupby(['company_country_code', 'company_city']).raised_amount_usd.sum()
city_rounds.head()
Out[6]:
company_country_code  company_city
ALB                   Huntsville       1900000
                      Prishtine         150000
ANT                   Amsterdam        6257320
                      Anthée         10871040
                      Eindhoven        5019299
Name: raised_amount_usd, dtype: float64

Pandas makes the grouping columns the index of the resulting table. For convenience I'll reset the index and bring them back as regular columns

In [7]:
city_rounds = city_rounds.reset_index()
city_rounds.head()
Out[7]:
company_country_code company_city raised_amount_usd
0 ALB Huntsville 1900000
1 ALB Prishtine 150000
2 ANT Amsterdam 6257320
3 ANT Anthée 10871040
4 ANT Eindhoven 5019299

Let's normalize the city name from unicode to ascii.

In [8]:
import unicodedata

city_rounds.company_city = city_rounds.company_city.\
    map(lambda s: unicodedata.normalize('NFKD', unicode(s)).encode('ascii','ignore'))
city_rounds.head()
Out[8]:
company_country_code company_city raised_amount_usd
0 ALB Huntsville 1900000
1 ALB Prishtine 150000
2 ANT Amsterdam 6257320
3 ANT AnthAe 10871040
4 ANT Eindhoven 5019299

In order to get the geographical distribution we'll need the locations (lat, lon) of the cities. One place which has this data is http://www.geonames.org/export/. Surprisingly it is freely available and comes with good documentaion. The biggest dataset contains 150K cities and claims to include every city with a population bigger than 1000 people.

In [9]:
import csv

geo = pd.read_table('cities15000.txt', sep='\t', header=None, quoting=csv.QUOTE_NONE)
geo.columns = ['geonameid','city','asciiname','alternatenames','latitude','longitude','featureclass',
               'featurecode','countrycode','cc2','admin1code','admin2code','admin3code','admin4code',
               'population','elevation','dem','timezone','modificationdate',]
geo.head()
Out[9]:
geonameid city asciiname alternatenames latitude longitude featureclass featurecode countrycode cc2 admin1code admin2code admin3code admin4code population elevation dem timezone modificationdate
0 3040051 les Escaldes les Escaldes Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engo... 42.50729 1.53414 P PPLA AD NaN 08 NaN NaN NaN 15853 NaN 1033 Europe/Andorra 2008-10-15
1 3041563 Andorra la Vella Andorra la Vella ALV,Ando-la-Vyey,Andora,Andora la Vela,Andora ... 42.50779 1.52109 P PPLC AD NaN 07 NaN NaN NaN 20430 NaN 1037 Europe/Andorra 2010-05-30
2 290594 Umm al Qaywayn Umm al Qaywayn Um al Quweim,Umm al Qaiwain,Umm al Qawain,Umm ... 25.56473 55.55517 P PPLA AE NaN 07 NaN NaN NaN 44411 NaN 2 Asia/Dubai 2012-11-03
3 291074 Ras al-Khaimah Ras al-Khaimah Julfa,Khaimah,RKT,Ra's al Khaymah,Ras al Khaim... 25.78953 55.94320 P PPLA AE NaN 05 NaN NaN NaN 115949 NaN 2 Asia/Dubai 2013-11-26
4 291696 Khawr Fakkān Khawr Fakkan Fakkan,Fakkān,Khawr Fakkan,Khawr Fakkān,Khawr ... 25.33132 56.34199 P PPL AE NaN 06 NaN NaN NaN 33575 NaN 20 Asia/Dubai 2013-10-25

Despite having a column called 'asciiname' it turns out that it contains some non-ascii symbols. Instead we are going to use the 'city' column.

In [10]:
geo.city = geo.city.\
    map(lambda s: unicodedata.normalize('NFKD', unicode(s, 'utf-8')).encode('ascii','ignore'))
geo.head()
Out[10]:
geonameid city asciiname alternatenames latitude longitude featureclass featurecode countrycode cc2 admin1code admin2code admin3code admin4code population elevation dem timezone modificationdate
0 3040051 les Escaldes les Escaldes Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engo... 42.50729 1.53414 P PPLA AD NaN 08 NaN NaN NaN 15853 NaN 1033 Europe/Andorra 2008-10-15
1 3041563 Andorra la Vella Andorra la Vella ALV,Ando-la-Vyey,Andora,Andora la Vela,Andora ... 42.50779 1.52109 P PPLC AD NaN 07 NaN NaN NaN 20430 NaN 1037 Europe/Andorra 2010-05-30
2 290594 Umm al Qaywayn Umm al Qaywayn Um al Quweim,Umm al Qaiwain,Umm al Qawain,Umm ... 25.56473 55.55517 P PPLA AE NaN 07 NaN NaN NaN 44411 NaN 2 Asia/Dubai 2012-11-03
3 291074 Ras al-Khaimah Ras al-Khaimah Julfa,Khaimah,RKT,Ra's al Khaymah,Ras al Khaim... 25.78953 55.94320 P PPLA AE NaN 05 NaN NaN NaN 115949 NaN 2 Asia/Dubai 2013-11-26
4 291696 Khawr Fakkan Khawr Fakkan Fakkan,Fakkān,Khawr Fakkan,Khawr Fakkān,Khawr ... 25.33132 56.34199 P PPL AE NaN 06 NaN NaN NaN 33575 NaN 20 Asia/Dubai 2013-10-25

The next step is to join the city rounds table with the city geo locations data. We would want to join based on the country and city columns. One small detail is that the country code in the 2 tables are different, one uses 2 letter codes, the other has 3 letters. That's why I'm bringing a third table which maps between those two formats:

In [11]:
country_codes = pd.read_csv('wikipedia-iso-country-codes.csv')
country_codes.rename(columns={'English short name lower case': 'country',
                              'Alpha-2 code': 'countrycode2',
                              'Alpha-3 code': 'countrycode3'},
                     inplace=True)
country_codes.head()
Out[11]:
country countrycode2 countrycode3 Numeric code ISO 3166-2
0 Afghanistan AF AFG 4 ISO 3166-2:AF
1 Åland Islands AX ALA 248 ISO 3166-2:AX
2 Albania AL ALB 8 ISO 3166-2:AL
3 Algeria DZ DZA 12 ISO 3166-2:DZ
4 American Samoa AS ASM 16 ISO 3166-2:AS
In [12]:
geo = pd.merge(geo, country_codes, left_on='countrycode', right_on='countrycode2')
geo.head()
Out[12]:
geonameid city asciiname alternatenames latitude longitude featureclass featurecode countrycode cc2 ... population elevation dem timezone modificationdate country countrycode2 countrycode3 Numeric code ISO 3166-2
0 3040051 les Escaldes les Escaldes Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engo... 42.50729 1.53414 P PPLA AD NaN ... 15853 NaN 1033 Europe/Andorra 2008-10-15 Andorra AD AND 20 ISO 3166-2:AD
1 3041563 Andorra la Vella Andorra la Vella ALV,Ando-la-Vyey,Andora,Andora la Vela,Andora ... 42.50779 1.52109 P PPLC AD NaN ... 20430 NaN 1037 Europe/Andorra 2010-05-30 Andorra AD AND 20 ISO 3166-2:AD
2 290594 Umm al Qaywayn Umm al Qaywayn Um al Quweim,Umm al Qaiwain,Umm al Qawain,Umm ... 25.56473 55.55517 P PPLA AE NaN ... 44411 NaN 2 Asia/Dubai 2012-11-03 United Arab Emirates AE ARE 784 ISO 3166-2:AE
3 291074 Ras al-Khaimah Ras al-Khaimah Julfa,Khaimah,RKT,Ra's al Khaymah,Ras al Khaim... 25.78953 55.94320 P PPLA AE NaN ... 115949 NaN 2 Asia/Dubai 2013-11-26 United Arab Emirates AE ARE 784 ISO 3166-2:AE
4 291696 Khawr Fakkan Khawr Fakkan Fakkan,Fakkān,Khawr Fakkan,Khawr Fakkān,Khawr ... 25.33132 56.34199 P PPL AE NaN ... 33575 NaN 20 Asia/Dubai 2013-10-25 United Arab Emirates AE ARE 784 ISO 3166-2:AE

5 rows × 24 columns

Next let's check what will happen if we do the join based on the city name:

In [13]:
roundset = set(city_rounds.company_city.unique())
geoset = set(geo.city.unique())
print len(roundset), len(roundset - geoset)
4319 2351

More than 2000 cities wouldn't match. Here are some examples:

In [14]:
list(roundset - geoset)[1000:1010]
Out[14]:
['Malvern Wells',
 'Saint-martin-des-champs',
 'Brentford',
 'Lkan',
 'Saint Louis',
 'Santa Ynez',
 'Ramonville-saint-agne',
 'GuimarAes',
 'Washington, D. C.',
 'Muenchen']

Some of those are important cities which we wouldn't want to lose in the process. To fix this issue I'll use a Python library for fuzzy string matching with the cute name 'fuzzywuzzy':

In [15]:
from fuzzywuzzy import process

print process.extract('Washington, D. C.', geoset)
[('Washington, D.C.', 97), ('Wa', 90), ('Washington', 90), ('Fort Washington', 78), ('Port Washington', 78)]

The result is the top 5 matches and their similarity score. In this example the top match seems to be the correct one. The algorithm goes through the rows of city_rounds in order and tries to match against the cities from the geo table. To speed up the string matching, we are going to group the geo table by country and only look for matches which share the same country code as the query city. I've also set a score threshold of 75 in order to exclude low-confidence matches.

In [16]:
geo_grouped = geo.groupby('countrycode3')

def find_match(i, city_rounds=city_rounds):
    '''Find the best match for the city in row i of city_rounds
    searching in cities from geo_grouped
    
    Returns a tuple of ((country, city), best_match) if a high
    quality match is found
    otherwise returns None'''
    country = city_rounds.ix[i,'company_country_code']
    if country in geo_grouped.indices:
        cities = geo_grouped.city.get_group(country)
        best_match = process.extractOne(city_rounds.ix[i,'company_city'], cities)
        if best_match and best_match[1] > 75:
            return ((country, city_rounds.ix[i,'company_city']), best_match[0])

To further improve the speed of the matching step we are going to run the find_match function in parallel. Be warned that it still takes a while to finish (~15 minutes on my laptop):

In [17]:
from joblib import Parallel, delayed

results = Parallel(n_jobs=-1)(delayed(find_match)(i) for i in range(len(city_rounds)))

We end up with a list of tuples and Nones from which we filter out the None values. Convert the list to a dictionary which will be used to match the (country, city) to a city from the geo table. We also fix 'New York' which happens to match equally well with both 'New York City' and 'York' (and arbitralily picks the second one).

In [18]:
replace = dict(filter(bool, results))

override = {
('USA', 'New York'): 'New York City',
}
replace.update(override)

Now we can add an extra column to city_rounds and populate it with the best matches:

In [19]:
city_rounds['city_match'] = [np.nan]*len(city_rounds)

for i in range(len(city_rounds)):
    t = tuple(city_rounds.ix[i,['company_country_code','company_city']].values)
    city_rounds.ix[i,'city_match'] = replace.get(t, np.nan)

We are ready to join the city_rounds and geo tables:

In [20]:
merged = pd.merge(geo, city_rounds, left_on=['countrycode3', 'city'], right_on=['company_country_code', 'city_match'])
merged.head()
Out[20]:
geonameid city asciiname alternatenames latitude longitude featureclass featurecode countrycode cc2 ... modificationdate country countrycode2 countrycode3 Numeric code ISO 3166-2 company_country_code company_city raised_amount_usd city_match
0 292223 Dubai Dubai DXB,Dabei,Dibai,Dibay,Doubayi,Dubae,Dubai,Duba... 25.25817 55.30472 P PPLA AE NaN ... 2012-08-19 United Arab Emirates AE ARE 784 ISO 3166-2:AE ARE Dubai 543960250 Dubai
1 292968 Abu Dhabi Abu Dhabi AEbu Saby,AUH,Aboe Dhabi,Abou Dabi,Abu Dabi,Ab... 24.46667 54.36667 P PPLC AE NaN ... 2012-01-18 United Arab Emirates AE ARE 784 ISO 3166-2:AE ARE Abu Dhabi 31250000 Abu Dhabi
2 3428992 San Isidro San Isidro San Isidro -34.47145 -58.50776 P PPLA2 AR NaN ... 2014-03-07 Argentina AR ARG 32 ISO 3166-2:AR ARG San Isidro 600000 San Isidro
3 3430863 Mar del Plata Mar del Plata Gorad Mar-dehl'-Plata,MDQ,Mar de Plata,Mar del... -38.00228 -57.55754 P PPLA2 AR NaN ... 2014-03-07 Argentina AR ARG 32 ISO 3166-2:AR ARG Mar Del Plata 25000 Mar del Plata
4 3435910 Buenos Aires Buenos Aires BUE,Baires,Bonaero,Bonaeropolis,Bonaëropolis,B... -34.61315 -58.37723 P PPLC AR NaN ... 2012-01-12 Argentina AR ARG 32 ISO 3166-2:AR ARG Buenos Aires 159381442 Buenos Aires

5 rows × 28 columns

And the cities with the largest sums of funding rounds are:

In [21]:
merged[merged.raised_amount_usd.notnull()].\
    ix[:,['company_city','city_match','countrycode3','latitude','longitude','population','raised_amount_usd']].\
    sort(columns='raised_amount_usd').\
    tail(n=10)
Out[21]:
company_city city_match countrycode3 latitude longitude population raised_amount_usd
3716 Sunnyvale Sunnyvale USA 37.36883 -122.03635 140081 7760180546
884 London London GBR 51.50853 -0.12574 7556900 8146995105
3679 San Jose San Jose USA 37.33939 -121.89496 945942 8487200272
2971 Cambridge Cambridge USA 42.37510 -71.10561 105162 8992166712
3625 Mountain View Mountain View USA 37.38605 -122.08385 74066 9196358515
3672 San Diego San Diego USA 32.71533 -117.15726 1307402 9824868373
323 Beijing Beijing CHN 39.90750 116.39723 7480601 9964255702
3643 Palo Alto Palo Alto USA 37.44188 -122.14302 64403 10769991473
3305 New York New York City USA 40.71427 -74.00597 8175133 25886061459
3674 San Francisco San Francisco USA 37.77493 -122.41942 805235 33254924381

Another interesting statistic is the top cities with the largest median round sizes (among cities with at least 50 funding rounds):

In [22]:
grouped = rounds.groupby(['company_country_code', 'company_city']).raised_amount_usd
grouped.median()[grouped.count() > 50].reset_index().dropna(subset=['raised_amount_usd']).sort('raised_amount_usd').tail(10)
Out[22]:
company_country_code company_city raised_amount_usd
58 USA Bothell 7250000.0
142 USA San Jose 7383900.0
133 USA Redwood City 7499999.0
84 USA Fremont 7500000.0
13 CHN Beijing 8000000.0
161 USA Westborough 8000000.0
159 USA Watertown 8187500.0
86 USA Hayward 8813094.5
149 USA South San Francisco 9200000.0
107 USA Milpitas 12000000.0

Now it's finally time to map our data.

In [23]:
import mpl_toolkits.basemap as bm
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib import cm
%matplotlib inline

To map the geographical distribution of the funding rounds, I'll split the map in a 600x300 grid. The value for each cell of the grid comes from summing the values for all cities which fall within that cell. That's what the next function is calculating.

In [24]:
def geo_distribution(table):
    world = bm.Basemap(resolution='l',projection='merc', area_thresh=10000,
                llcrnrlon=-160, llcrnrlat=-50, urcrnrlon=180, urcrnrlat=70, ellps='WGS84')

    N = 300
    lons, lats = world.makegrid(2*N, N) # get lat/lons of 2N by N evenly spaced grid.
    x, y = world(lons, lats) # compute map proj coordinates.
    data = np.zeros((N,2*N))

    for r in table[table.raised_amount_usd.notnull()].iterrows():
        xx = np.searchsorted(lons[0], r[1].longitude)
        yy = np.searchsorted(lats[:,0], r[1].latitude)
        data[yy,xx] += r[1].raised_amount_usd

    return x, y, data

x, y, data = geo_distribution(merged)

Plotting a histogram of the result shows that the values are extremely left-skewed. We are going to plot those values colored by their magnitude and with the current distribution we sould be seeing mostly one color which is not very interesting.

In [25]:
plt.hist(data[data>0].flatten())
Out[25]:
(array([1362,   10,    1,    1,    1,    1,    0,    1,    0,    1]),
 array([  2.00000000e+03,   5.71948829e+09,   1.14389746e+10,
          1.71584609e+10,   2.28779472e+10,   2.85974334e+10,
          3.43169197e+10,   4.00364060e+10,   4.57558923e+10,
          5.14753786e+10,   5.71948649e+10]),
 <a list of 10 Patch objects>)

To counter that we'll just work in a log scale which gives us a better looking distribution:

In [26]:
log_data = np.log10(data + 1)
plt.hist(log_data[log_data>0].flatten())
Out[26]:
(array([  7,  54, 107, 193, 269, 305, 241, 148,  47,   7]),
 array([  3.30124709,   4.04685808,   4.79246908,   5.53808007,
          6.28369107,   7.02930206,   7.77491306,   8.52052405,
          9.26613505,  10.01174604,  10.75735704]),
 <a list of 10 Patch objects>)

Let's go ahead and create the plot:

In [27]:
fig = plt.figure(figsize=(20, 12))
ax = fig.add_axes([0.0, 0.0, 0.95, 1.0])
plt.title('Total startup funding (1999-2014)', fontsize=24)

# Initialize the map and configure the style
world = bm.Basemap(resolution='l',projection='merc', area_thresh=10000,
            llcrnrlon=-160, llcrnrlat=-50, urcrnrlon=180, urcrnrlat=70, ellps='WGS84', ax=ax)
world.drawcoastlines(linewidth=0.1)
world.drawcountries(linewidth=0.1)
world.drawlsmask(land_color='#F4F3F2', ocean_color='#BFE2FF')

x, y, data = geo_distribution(merged)
log_data = np.log10(data + 1)

# Calculate the range for the contour values
min_level = np.percentile(log_data[log_data>1].flatten(), 40)
max_level = log_data.max()
clevs = np.linspace(min_level, max_level)
cmap = cm.Spectral_r

cs = world.contourf(x, y, log_data, clevs, cmap=cmap)

# Plot the colorbar with values matching the contours
ax1 = fig.add_axes([0.97, 0.2, 0.03, 0.6])
norm = mpl.colors.LogNorm(vmin=10**min_level, vmax=10**max_level)
cb = mpl.colorbar.ColorbarBase(ax1, cmap=cmap, norm=norm, orientation='vertical')
cb.set_ticks([1e5, 1e6, 1e7, 1e8, 1e9, 1e10])
cb.set_ticklabels(['$100K', '$1M', '$10M', '$100M', '$1B', '$10B'])

It looks a bit sparse, right? The problem is that the density, of the cities with startup activity, is not high enough. The approach I took was to apply a gaussian filter to the values in order to smoothen the points and distribute some of their weight to their neighbouring points. Here's the result:

In [30]:
fig = plt.figure(figsize=(20, 12))
ax = fig.add_axes([0.0, 0.0, 0.95, 1.0])
plt.title('Total startup funding (1999-2014)', fontsize=24)

# Initialize the map and configure the style
world = bm.Basemap(resolution='l',projection='merc', area_thresh=10000,
            llcrnrlon=-160, llcrnrlat=-50, urcrnrlon=180, urcrnrlat=70, ellps='WGS84', ax=ax)
world.drawcoastlines(linewidth=0.1)
world.drawcountries(linewidth=0.1)
world.drawlsmask(land_color='#F4F3F2', ocean_color='#BFE2FF')

x, y, data = geo_distribution(merged)
log_data = np.log10(scipy.ndimage.filters.gaussian_filter(data, 0.6) + 1)

min_level = np.percentile(log_data[log_data>1].flatten(), 40)
max_level = log_data.max()
clevs = np.linspace(min_level, max_level)
cmap = cm.Spectral_r

cs = world.contourf(x, y, log_data, clevs, cmap=cmap)

ax1 = fig.add_axes([0.97, 0.2, 0.03, 0.6])
norm = mpl.colors.LogNorm(vmin=10**min_level, vmax=10**max_level)
cb = mpl.colorbar.ColorbarBase(ax1, cmap=cmap, norm=norm, orientation='vertical')
cb.set_ticks([1e5, 1e6, 1e7, 1e8, 1e9, 1e10])
cb.set_ticklabels(['$100K', '$1M', '$10M', '$100M', '$1B', '$10B'])