All my Butterfly observations around the world

Papilionoidea

Python

global

Papilionoidea

Author

José R. Ferrer-Paris

Published

July 22, 2025

Modified

February 24, 2026

My butterfly observations in iNaturalist

One of my beloved subjects of iNat explorations are the butterflies. I have a long story with them: during my biology studies I worked with endemic Brown butterflies (Satyridae) in the Páramos of Venezuela; in my PhD I was monitoring distribution and abundance of White and yellow butterflies (Pieridae) across the country; and for my postdoc in South Africa I analysed the global relationships between butterfly species and their larval host plants. Now my research focus has shifted, but I still like to record the butterflies that I detect in my nature walks.

Here I want to find out how many species I have recorded in each of the countries I have visited, and break this down by families and subfamilies.

Overview

Here I reproduce the minimum number of steps required to create a summary of observation for the selected taxonomic group and regions of interest.

Download iNaturalist observation and identification information for a specific query.
Download spatial data representing regions of interest and
Intersect observations with these boundaries.
Group and summarise information based on taxonomic and geographic information.

Tools and Libraries

I will be using a Python environment with a selection of my favourite libraries, as explained here.

We will be using the following libraries in this blog post:

from pyinaturalist import get_observations
import pandas as pd
import numpy as np
import geopandas as gpd
from IPython.display import display, HTML

Step-by-Step Guide

Step 1: Downloading iNaturalist observations

Let’s begin by fetching iNaturalist observations with pyinaturalist. We will need a selection of global observations so we select the user neomapas, and see where in the world they have been.

username = 'neomapas'
taxonid=47224
observations = get_observations(user_id=username, taxon_id=taxonid, per_page=0)
n_obs = observations['total_results']

Let’s print an overview of these total results:

print("User {} has {} observations of Papilionoidea (taxon id {}) in iNaturalist".format(username,n_obs,taxonid))

User neomapas has 498 observations of Papilionoidea (taxon id 47224) in iNaturalist

Updating this after my visit to Bali!

The maximum number of observations we can download in each query 200, so we need to use pagination to get all results. For each query we will extract a selection of fields that we will use for summarising the observation records (coordinates, species guess, quality grade), and at the same time we will extract the taxonomic information from each identification.

records=list()
idrecords=list()
msg="Requesting observations from user _{}_: page {}, total of {} observations downloaded"
j=1
while len(records) < n_obs:
    print(msg.format(username,j,min(j*200,n_obs)))
    observations = get_observations(
        user_id='neomapas',
        taxon_id=47224,
        per_page=1000,
        page=j)
    for obs in observations['results']:
        record = {
            'uuid': obs['uuid'],
            'quality': obs['quality_grade'],
            'description': obs['description'],
            'location': obs['place_guess'],
            'longitude': obs['location'][1],
            'latitude': obs['location'][0],
            'species guess': obs['species_guess'],
            'observed on': obs['observed_on'],
            'points': obs['faves_count'] * 10 + obs['comments_count'] + obs['identifications_count'] * 3,
        }
        for id in obs['identifications']:
            ca = id['category']
            fch = id['created_at']
            idrecord = {
                'uuid': obs['uuid'], 
                'quality_grade': obs['quality_grade'], 
                'id_category': ca, 
                'created': fch}
            for anc in id['taxon']['ancestors']:
                idrecord[anc['rank']] = anc['name']
            idrecord[id['taxon']['rank']] = id['taxon']['name']
            idrecords.append(idrecord)
        if len(obs['observation_photos'])>0:
            record['url'] = obs['observation_photos'][0]['photo']['url']
            record['attribution'] = obs['observation_photos'][0]['photo']['attribution']
        records.append(record)
    j=j+1

Requesting observations from user _neomapas_: page 1, total of 200 observations downloaded
Requesting observations from user _neomapas_: page 2, total of 400 observations downloaded
Requesting observations from user _neomapas_: page 3, total of 498 observations downloaded

This example requires extracting some additional information that is nested within the json structure of the API response. I explain some of the details in this post.

We transform these sets of records into two data frames with pandas:

inat_obs=pd.DataFrame(records)
inat_ids=pd.DataFrame(idrecords)

Step 2: Merging observation and identification information

A tricky step is to transform species guess information into full taxonomic information. Here I am using the identification information included in the response from the get_observation function to reconstruct the taxonomic information. The problem is that there are multiple id suggestions per observation, and we have to filter the unvalidated ids first.

In this case most of my butterfly observations are research grade:

inat_obs.groupby(['quality']).agg({"uuid": pd.Series.nunique})

	uuid
quality
casual	1
needs_id	201
research	296

Research grade observations will always have improving and supporting identifications:

inat_ids.groupby(['quality_grade','id_category']).agg({"uuid": pd.Series.nunique})

		uuid
quality_grade	id_category
casual	improving	1
casual	leading	1
needs_id	improving	124
	leading	193
	maverick	3
	supporting	11
research	improving	296
	leading	35
	maverick	45
	supporting	282

We can use this trick to select the taxonomic information from the best id of each observation:

ss=inat_ids.id_category.isin(['improving','supporting'])
cols=['uuid','family','subfamily']
best_ids = inat_ids.loc[ss,cols].drop_duplicates().dropna()

And now merge these best ids back with the observation records.

inat_obs_ids = inat_obs.join(best_ids.set_index('uuid'), on='uuid')

Later in the code, I will need to format a html string to define figures with captions, let’s do this now for each record in this list of records:

inat_obs_ids['figure'] = [
    "<figure class='mini'><a href='https://www.inaturalist.org/observations/%s' target=_blank><img src='%s' height=50><figcaption class='mini'>%s: <i>%s</i></figcaption></a></figure>" % (
        record['uuid'],
        record['url'],
        record['family'],
        record['species guess'])
    for idx,record in inat_obs_ids.iterrows() 
]

Step 3: Number of unique observations and species per family

Now we can summarise the information in this combined dataframe to get the unique number of observations (with their unique universal ids, or uuid) and species for each family and subfamily of Butterflies:

aggfuns = {
    "uuid": pd.Series.nunique,
    "species guess": pd.Series.nunique
    }
inat_obs_ids.groupby(['family','subfamily',]).agg(aggfuns)

		uuid	species guess
family	subfamily
Hesperiidae	Eudaminae	1	1
	Hesperiinae	3	2
	Pyrginae	5	4
	Tagiadinae	1	1
	Trapezitinae	4	3
Lycaenidae	Lycaeninae	4	3
	Polyommatinae	47	25
	Theclinae	4	4
Nymphalidae	Apaturinae	1	1
	Biblidinae	17	10
	Charaxinae	1	1
	Danainae	22	10
	Heliconiinae	22	13
	Limenitidinae	11	5
	Nymphalinae	51	30
	Satyrinae	34	15
Papilionidae	Papilioninae	28	17
Papilionidae	Parnassiinae	1	1
Pieridae	Coliadinae	18	15
Pieridae	Pierinae	44	29
Riodinidae	Euselasiinae	1	1
Riodinidae	Riodininae	3	3

Step 4: Number of unique observations and species ids per country

In order to combine the observation records with external spatial information, we need to add proper geospatial information to this data frame using geopandas. For this, we first transform the numeric variables latitude and longitude into a geometry with a explicit Coordinate Reference System (CRS):

gs = gpd.points_from_xy(inat_obs_ids.longitude, inat_obs_ids.latitude, crs="EPSG:4326")
inat_obs_xy=gpd.GeoDataFrame(inat_obs_ids, geometry=gs)

Now we need to download the external geospatial data representing the country boundaries. This is very easy to do thanks to the great features of geopandas.read_file function.

For global data I like to use the high resolution, World Bank Official Administrative Boundaries available from the World Bank Group Data Catalog. This services provides a link to access the a zip file with the spatial vector files, and we need to construct a remote path to read the shapefile inside the zipfile:

zip_url='https://datacatalogfiles.worldbank.org/ddh-published-v2/0038272/3/DR0046659/wb_countries_admin0_10m.zip'
shp_file = "WB_countries_Admin0_10m/WB_countries_Admin0_10m.shp"
remote_path = 'zip+{}!/{}'.format(zip_url, shp_file)

In this way geopandas is able to download and read the file from the cloud:

WB0 = gpd.read_file(remote_path)

Now we are ready to overlay the iNaturalist observations onto the administrative boundaries using another geopandas function: sjoin.

inat_obs_world = inat_obs_xy.sjoin(WB0, how="left")

And finally we can aggregate information by country name (column WB_NAME):

aggfuns = {
    "uuid": pd.Series.nunique,
    "species guess": pd.Series.nunique
    }
inat_obs_world.groupby("WB_NAME").agg(aggfuns)

	uuid	species guess
WB_NAME
Australia	115	51
Colombia	4	4
Costa Rica	2	2
Finland	5	5
Germany	1	1
Indonesia	46	28
Kenya	5	5
Mexico	120	59
Panama	3	3
Peru	7	4
Rwanda	7	6
Singapore	2	1
South Africa	33	22
Switzerland	1	1
Tajikistan	4	4
Uganda	2	2
United Arab Emirates	1	1
Venezuela, Republica Bolivariana de	100	80

Step 5: Displaying a sample of observations

Now we combine spatial and taxonomic information to get a wall of pictures showing the most interesting observations for each butterfly family in each of the countries I have visited.

These lines of code perform a couple of tricks. I group the data twice, first I do the selection based on the points column for each combination of country and family, then I iterate across the countries and join the figures in a list. I then use display and HTML functions to read the formatted text strings as html elements¹ to organise the figures and captions on this webpage.

selection = (
    inat_obs_world
    .sort_values('points')
    .groupby(['WB_NAME','family'])
    .first()
    .groupby(['WB_NAME'])
    .agg({'figure':'unique'})
)


sections = list()
for idx,row in selection.iterrows():
    sectionfigures="&nbsp;".join(row['figure'])
    sectionname="<figure class='mini'><p class='figsection'>%s </p></figure>" % idx
    sections.append(sectionname + sectionfigures)

allsections="<div class='container'>%s</div>" % ("".join(sections))

display(HTML(allsections))