class: middle
<style type="text/css"> .remark-slide-number { display: none; } </style> <br> <br> <center> # Creating open, reproducible workflows for ecological niche modeling ### Andrea Sánchez-Tapia, Sara Ribeiro Mortara, Felipe Sodré Barros ### FOSS4G Buenos Aires 2021 </center> <br> __[Diapositivas en Castellano](https://andreasancheztapia.github.io/FOSS4G_2021/espa%C3%B1ol#1)__ --- background-image: url("figs/logo_jbrj.png") background-position: 98% 2% background-size: 100px .left-column[ <img src="figs/andrea.jpg" title="Pictures of the team: Andrea Sánchez-Tapia, Sara Mortara, Felipe Sodré Barros, Guilherme Gall, Diogo Rocha, and Marinez Ferreira de Siqueira (PI)" alt="Pictures of the team: Andrea Sánchez-Tapia, Sara Mortara, Felipe Sodré Barros, Guilherme Gall, Diogo Rocha, and Marinez Ferreira de Siqueira (PI)" width="90" style="display: block; margin: auto auto auto 0;" /><img src="figs/SRM.jpg" title="Pictures of the team: Andrea Sánchez-Tapia, Sara Mortara, Felipe Sodré Barros, Guilherme Gall, Diogo Rocha, and Marinez Ferreira de Siqueira (PI)" alt="Pictures of the team: Andrea Sánchez-Tapia, Sara Mortara, Felipe Sodré Barros, Guilherme Gall, Diogo Rocha, and Marinez Ferreira de Siqueira (PI)" width="90" style="display: block; margin: auto auto auto 0;" /><img src="figs/Felipe_Barros.jpg" title="Pictures of the team: Andrea Sánchez-Tapia, Sara Mortara, Felipe Sodré Barros, Guilherme Gall, Diogo Rocha, and Marinez Ferreira de Siqueira (PI)" alt="Pictures of the team: Andrea Sánchez-Tapia, Sara Mortara, Felipe Sodré Barros, Guilherme Gall, Diogo Rocha, and Marinez Ferreira de Siqueira (PI)" width="90" style="display: block; margin: auto auto auto 0;" /><img src="figs/gui.jpeg" title="Pictures of the team: Andrea Sánchez-Tapia, Sara Mortara, Felipe Sodré Barros, Guilherme Gall, Diogo Rocha, and Marinez Ferreira de Siqueira (PI)" alt="Pictures of the team: Andrea Sánchez-Tapia, Sara Mortara, Felipe Sodré Barros, Guilherme Gall, Diogo Rocha, and Marinez Ferreira de Siqueira (PI)" width="90" style="display: block; margin: auto auto auto 0;" /><img src="figs/diogo.jpeg" title="Pictures of the team: Andrea Sánchez-Tapia, Sara Mortara, Felipe Sodré Barros, Guilherme Gall, Diogo Rocha, and Marinez Ferreira de Siqueira (PI)" alt="Pictures of the team: Andrea Sánchez-Tapia, Sara Mortara, Felipe Sodré Barros, Guilherme Gall, Diogo Rocha, and Marinez Ferreira de Siqueira (PI)" width="90" style="display: block; margin: auto auto auto 0;" /><img src="figs/mari.jpeg" title="Pictures of the team: Andrea Sánchez-Tapia, Sara Mortara, Felipe Sodré Barros, Guilherme Gall, Diogo Rocha, and Marinez Ferreira de Siqueira (PI)" alt="Pictures of the team: Andrea Sánchez-Tapia, Sara Mortara, Felipe Sodré Barros, Guilherme Gall, Diogo Rocha, and Marinez Ferreira de Siqueira (PI)" width="90" style="display: block; margin: auto auto auto 0;" /> ] .right-column[ #### Scientific Computation Laboratory - Rio de Janeiro Botanical Garden #### ¡liibre! Independent Biodiversity Informatics Laboratory + Scientific workflows based in R for data downloading and cleaning, taxonomic checking + Biodiversity informatics, ENM/SDM, open science, reproducibility + Support for data-intensive research projects (e.g., CNCFlora - IUCN authority) ] --- background-image: url("https://the-turing-way.netlify.app/_static/logo.jpg") background-position: 4% 4% background-size: 200px .right[ ### https://the-turing-way.netlify.app/welcome ] <br> <br> <br> <br> + [Open research](https://the-turing-way.netlify.app/reproducible-research/open.html): reproducible, transparent, reusable, collaborative, accountable, accessible to society + [Reproducibility](https://the-turing-way.netlify.app/reproducible-research/reproducible-research.html): data and code being available to fully rerun the analysis --- ## ecological niche modeling or species distribution modeling (ENM, SDM) + __Stastistical learning__ applied to the prediction of areas potentially suitable for the presence of a species -- + Species occurrence data (collection data bases, GBIF) -- + Environmental predictor variables (Worldclim, remote sensing) -- + Diferent algorithms (GLM, SVM, random forests, classification trees, GBM) -- + Model performance (data partition into training and test data + performance metrics) -- + Early implementations: hidden code ("black boxes"), point-and-click applications --- background-image: url("https://www.r-project.org/Rlogo.png") background-position: 4% 4% background-size: 200px <br> <br> <br> <br> - Statistical programming language -- - Libre software, open code, _lingua franca_ in Ecology -- - Several options for ENM/SDM in R -- - GIS with __raster__, __sp__, __maps__, __rgdal__, __sf__ -- - Established packages such as __dismo__ (Hijmans et al 2017), __BIOMOD2__ (Thuiller et al 2007) -- - Recent packages: **ENMeval** (Muscarella et al 2014), **sdm** (Naimi & Araujo 2016), **spThin** (Aiello-Lammens et al 2015), **zoon** (Golding et al. 2018), **wallace** (Kass et al 2018), **kuenm** (Cobos et al 2019), **occCite** (Lowens 2020) -- #### Difficulties with scalability, code repetition, decision documentation... .purple[non-reproducible] --- background-image: url("figs/modleR.png") background-position: 4% 4% background-size: 150px <br> <br> <br> <br> <br> - https://model-r.github.io/modleR/index.html -- A .purple[workflow] developed to __integrate and automatize__ some of the common steps in ecological niche modeling -- - a four-step workflow: -- + `setup_sdmdata()`: data setup, cleaning, partition, pseudoabsence selection -- + `do_any()` and `do_many()`: model fitting, projecting and evaluating -- + `final_model()`: joining partitions -- + `ensemble_model()`: algorithm consensus --- background-image: url("figs/modleR.png") background-position: 4% 4% background-size: 150px .right[ ## a four-step workflow <img src="figs/WORKFLOW.png" title="Workflow scheme. The first step cleans and partitions data into training and test sets, samples pseudoabsences and controls for the correlation of explanatory variables. The second step fits an algorithm per partition, for each train and test set. The third step joins partitions, creating a model per algorithm. The fourth step analyzes algorithm consensus -'ensemble modeling', to obtain a unique model per species" alt="Workflow scheme. The first step cleans and partitions data into training and test sets, samples pseudoabsences and controls for the correlation of explanatory variables. The second step fits an algorithm per partition, for each train and test set. The third step joins partitions, creating a model per algorithm. The fourth step analyzes algorithm consensus -'ensemble modeling', to obtain a unique model per species" width="1000" /> ] --- ## 1. `setup_sdmdata()`: data preparation + Data cleaning: exact duplicates, NAs and one occurrence per pixel (this does not exclude the need for previous careful data cleaning!) -- + Experimental design: bootstrap, cross-validation -- + Pseudo-absence sampling -- + Control of variable correlation up to a user-defined value (e.g., 0.8) --- ## 2. `do_(m)any()`: model fitting and projection + `do_any()` for one algorithm and partition (ex. `algo = "maxent"`) -- + `do_many()` calls `do_any()` to fit multiple algorithms (ex: `bioclim = TRUE`, `maxent = TRUE`) -- + Parametrization -- - Algorithms: dismo, maxnet, GLM, SVM, bosques aleatorios, boosted regression trees (BRT) -- + Projection to different datasets (in time or space) -- + Returns table with performance statistics -> TSS, AUC, pROC, FNR, Jaccard... --- ## 2. `do_(m)any()`: model fitting and projection <img src="figs/modleR/rf_cont_Abarema_langsdorffii_1_1.png" width="300" /><img src="figs/modleR/rf_cont_Abarema_langsdorffii_1_2.png" width="300" /><img src="figs/modleR/rf_cont_Abarema_langsdorffii_1_3.png" width="300" /> _Abarema langsdorffii_, three partitions, randomForests --- ## 3. `final_model()`: a model per algorithm per species + The basics: a central tendency measure and uncertainty between partitions -- + Which models to join? (the raw continuous model, the binary) -- + Some additional operations: consensus between binary models -- + Uncertainty: range (max - min) between partitions --- ## 3. `final_model()`: a model per algorithm per species <img src="figs/modleR/Abarema_langsdorffii_rf_raw_mean.png" width="300" /><img src="figs/modleR/Abarema_langsdorffii_rf_bin_consensus.png" width="300" /><img src="figs/modleR/Abarema_langsdorffii_rf_raw_uncertainty.png" width="300" /> 1. Raw mean 2. Binary consensus 3. Uncertainty (range) --- ## 4. `ensemble_model()` - algorithmic consensus + Mean between final models -- + Consensus -- + Best-performing algorithm -- + PCA between algorithms -- + Range uncertainty metrics -- + Ensemble models do not necessarily perform better than individual algorithms (Zhu & Peterson 2017) -- + Evaluate different ensemble model performance (WIP) --- class: center, middle # modleR was written to implement good practices in scientific computation --- ## folder structure and portability .pull-left[ + A single working directory per project + Different steps: different subfolders + A consistent subfolder structure + Relative rather than absolute paths and no `setwd()` ] .pull-right[ <img src="figs/fig03_folder.png" width="772" /> ] --- ## modularity + Each step saves its output -- + The next step reads the previous output -- + Using HD space rather than RAM: -- + The user may enter and exit the workflow at any step -- + Parallelization and use in high performance/high throughput computational frameworks (HPC/HTC) --- ## thorough metadata recording + Metadata: parametrization options -- + Session information -- + Environment: packages used and their version -- <img src="figs/metadata.png" width="600" /> --- ## interoperability We did not create new classes or methods: communication with other packages in the R environment <img src="figs/xkcd_standards.png" title="xkcd comic: title: how standards proliferate: (see a/c chargers, chracater encodings, instant messaging, etc) First frame: Situation: There are 14 competing standards. Second frame: two human figures. Person 1: '14?! Ridiculous! We need to develop one universal standard that covers everyone's use cases'. Person 2: 'Yeah!'. Third frame: Soon: Situation: There are 15 competing standards." alt="xkcd comic: title: how standards proliferate: (see a/c chargers, chracater encodings, instant messaging, etc) First frame: Situation: There are 14 competing standards. Second frame: two human figures. Person 1: '14?! Ridiculous! We need to develop one universal standard that covers everyone's use cases'. Person 2: 'Yeah!'. Third frame: Soon: Situation: There are 15 competing standards." width="719" /> --- ## reproducibility <img src="figs/feng.png" width="600" /> --- ## final remarks + Reproducibility should drive any ENM workflow -- + __Metadata__ are really useful - and necessary -- + __Scalability__: Any ENM workflow should be easily adapted to HPC -- + __Flexibility__ to start and leave at any step is essential to guarantee the communication with other packages and the requirements of different projects -- + Part of a larger group of R packages in the area: efforts for __integration__ --- class: middle, center # ¡Gracias! <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> [@SanchezTapiaA](https://twitter.com/SanchezTapiaA) <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> [@MortaraSara](https://twitter.com/MortaraSara) <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> [@FelipeSMBarros](https://twitter.com/FelipeSMBarros) <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 496 512"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg> [modleR](https://model-r.github.io/modleR/index.html)