from pygbif import occurrences as occ
import altair as alt
import pandas as pd
import geopandas as gpd
I am planning a trip to Colombia, heading there with my cameras and iNaturalist app to add new observations of plants, birds and butterflies to my personal collection.
In preparation for my trip I want to review what are the popular destinations for iNaturalist users and what species to expect in those. I will explore some options with Python
to download and visualise a selection of biodiversity records.
Querying iNaturalist from Python
The first option is to use the pyinaturalist
library in python. But the iNaturalist API complained when I was trying to download large amounts of data: all records from a country for a single month.
Next option is to use the iNaturalist export function:
- Go to https://www.inaturalist.org/observations/export
- Query observations from Colombia for October 2023:
quality_grade=any&identifications=any&place_id=7196&month=10&year=2023
- Request data sent to email…
But this option is not very useful for a reproducible workflow.
Now, that page suggests using GBIF for big downloads, so maybe that is the best way…
Import the modules
First I use the import statements to load the modules I will use in this session. I will be using the occurrences
functions from PyGBIF for query and download, and tryout exploratory data visualisation with Altair. Also using some functions from GeoPandas and pandas for convenience in reading data as a data frame.
What and where
I want to explore iNaturalist records from Colombia for the month of october 2023.
After some trial and error and reading some similar questions online, I adjusted the search parameters for my query using the search options country
, datasetKey
, year
and month
:
= {
search_params 'country': 'CO', # Colombia
'datasetKey': '50c9509d-22c7-4a22-a47d-8c48425ef4a7', # iNaturalist dataset
'limit': 300, # occurrences per page
'year': 2022,
'month': 10 # October
}
And use the occurrences search
function from pygbif
:
= occ.search(**search_params) gbif_records
The query finds thousands of occurrences:
print(gbif_records['count'])
6985
But only a limited number is downloaded in each query (the search function uses pagination, we will solve that later).
= pd.DataFrame(gbif_records['results'])
gbif_df gbif_df.shape
(300, 90)
We can check the most frequent orders represented in this first query:
'order','orderKey']].value_counts().head() gbif_df[[
order orderKey
Passeriformes 729.0 69
Lepidoptera 797.0 55
Piciformes 724.0 12
Accipitriformes 7191147.0 11
Lamiales 408.0 10
Name: count, dtype: int64
Downloading large selections of records
In order to retrieve all occurrences we need a function that applies the same query multiple times using an offset until all records are downloaded. It was really easy to find such a function for python:
def get_all_occurrences(params):
= []
all_occurrences = 0
offset while True:
'offset'] = offset
params[= occ.search(**params)
occurrences = occurrences['results']
results if not results:
break
all_occurrences.extend(results)+= len(results)
offset print(f"{offset} occurrences downloaded...")
= pd.DataFrame(all_occurrences)
all_occurrences return all_occurrences
Now I will try this here to download all records from one order (Lepidoptera) using the orderKey
parameter (see value above in the table of most frequent orders). This will retrieve records for all butterflies and months for the same country, year and month selectred above.
'orderKey'] = 797
search_params[= get_all_occurrences(search_params) lepidoptera_occurrences
300 occurrences downloaded...
600 occurrences downloaded...
770 occurrences downloaded...
We can do the same for Passeriformes (aves), Asterales (plants), etc. Just need to adjust the orderKey
parameter accordingly.
Among the lepidoptera we will focus on the families of butterflies. Remember we have a pandas dataframe, so we can use the loc
function to subset the dataframe:
= lepidoptera_occurrences.family.isin(['Pieridae','Nymphalidae', 'Lycaenidae','Papilionidae','Riodinidae', 'Hedylidae', 'Hesperiidae'])
ss
= lepidoptera_occurrences.loc[ss] butterfly_occurrences
Visualisation of points in a map
I found a useful map of the administrative divisions of Colombia in simplemaps, we can download and read the json file with geopandas
:
= 'https://simplemaps.com/static/svg/country/co/admin1/co.json'
url = gpd.read_file(url) colombia
For visualisation I found Altair was a nice alternative.
First we will define the background using the mark_geoshape
function with a geopandas dataframe:
= alt.Chart(colombia).mark_geoshape(
background ='lightgray',
fill='white'
stroke'mercator').properties(
).project(=600,
width=700
height )
For the points we can use the previous dataframe with the columns for the coordinates and declaring another variable for the colours:
= alt.Chart(butterfly_occurrences).mark_circle().encode(
points ="decimalLongitude:Q", latitude="decimalLatitude:Q", color='family:N'
longitude )
Putting this together is as easy as:
+ points background
Conclusion
Here we use python, pygbif
and altair
to explore biodiversity records in one country and a selected time frame. Thanks to GBIF and iNaturalist portals for providing wonderful tools to access their data!
Here the basic recipe:
- Find the dataset key for iNaturalist,
- Query the GBIF database,
- Explore the data and select the orderKey for each order of interest
- Repeat the query for each order and iterate to download all records
- Plot the data with Altair
- Done!
That’s it for now. Now, if you excuse me, I need to go back to planning my trip!