Much attention in recent years has been focused on making biodiversity data open and accessible to researchers. Yet ensuring the availability of these data is only the first step in preventing data waste. Here, we argue that researchers need to do a better job of using available datasets. We recommend that researchers search for existing data sources to serve their needs first, that they work to integrate multiple data sources when one alone will not suffice, and that they aim to explore research topics that will directly inform conservation action. We provide a roadmap with resources and examples to help guide conservation researchers towards better data-use practices. The vast quantities of biodiversity data, coupled with advanced techniques for using and integrating datasets, will play a key role in determining how to halt biodiversity declines. Making data open and accessible is only the start; we must be sure that we are using that existing data to conduct further research and inform decisions.
Resource allocation for conserving biodiversity depends on a balance between collecting data to make informed and effective decisions and acting on the best available information in a timely manner to prevent avoidable extinctions (Bennett et al., 2018). Indeed, gathering more information can help managers make decisions that more effectively benefit biodiversity (Bennett et al., 2018; Canessa et al., 2015) and reduce the likelihood of taking actions that are ineffectual or even harmful (Ricciardi and Simberloff, 2009; Cook et al., 2010). However, monitoring and data collection take time, which can delay the onset of urgently needed intervention. For example, the Christmas Island Pipistrelle (Pipistrellus murrayi) was monitored continuously without action until it was extinct, and the Orange-bellied parrot (Neophema chrysogaster) nearly suffered the same fate (Martin et al., 2012). Furthermore, monitoring and data collection can deplete already limited conservation budgets, siphoning resources that could otherwise be redirected towards conservation action, and without providing any real benefit for informing decisions (Bennett et al., 2018; Canessa et al., 2015). Evidence suggests that obtaining more data (which is often costly and challenging) may have little to no benefit over using sparse yet readily available data (Grantham et al., 2009). In many cases, experts agree that evidence synthesis is needed more than further data collection to conserve biodiversity (Buxton et al., 2021a), and yet research and monitoring still consume approximately half of conservation budgets (Buxton et al., 2020a).
To expedite science without compromising on quality, researchers can make better use of existing data. Rapid action does not need to be uninformed; there are extensive data already available that could be used to address many of the pressing questions we need answered to conserve biodiversity. The growing movement towards more open science means that public data are slowly but surely becoming the standard (Costello and Wieczorek, 2014; Roche et al., 2014). Data rescue (Bledsoe et al., 2022), better open data practices (Roche et al. 2014; Roche et al. 2021; Gerstner et al., 2017), and more applied research (Buxton et al., 2021b) have all been the subject of recent research advances. Consequently, researchers now have more data than ever at their disposal.
However, comparatively little emphasis has been placed on encouraging researchers to first seek out existing data before collecting more. The potential for scientific data re-use can end after the publication of a journal article if the data are not sought out to contribute to further research. By examining the availability of existing data, researchers may find opportunities to reduce data collection costs or to harmonize new data collection with existing datasets (e.g., Grenié et al., 2023). During the global pandemic in 2020, many scientists did exactly this when new safety protocols interrupted or prevented data collection (Buxton et al., 2020b; Howell et al., 2022). In many cases, multiple datasets can be used to complement one another and fill data gaps (e.g., Rosenberg et al., 2019) or can be integrated with professional monitoring efforts to improve the accuracy of inferences (e.g., Robinson et al., 2020).
The Big Data movement presents extensive opportunities for conservation researchers to reduce resource expenditure on data collection (Runting et al., 2020). Datasets are sometimes published as “data papers” (e.g., Soria et al., 2021; Naujokaitis-Lewis et al., 2022), collating information from a variety of sources and providing appropriate meta-data for future analyses. Beyond data collected by researchers, alternative sources of biodiversity data are also available. Community science (aka “citizen science”) data are widely available, and researchers are steadily overcoming analytical challenges associated with the analysis of these datasets (Binley and Bennett, 2023) and recognizing their capacity to contribute to conservation research (Binley et al., 2021; Chandler et al., 2017; Lin et al., 2022). Data infrastructures such as the Global Biodiversity Information Facility (GBIF) and Living Planet Project bring together multiple datasets, including professionally collected data, museum specimens and community science data, all in one location (GBIF: The Global Biodiversity Information Facility, 2022). More targeted projects have also been created, for example, collating information specifically on detectability estimates (Edwards et al., 2023) or on insect population trends (Grames et al., 2022). Furthermore, such databases often provide tools for working directly with the data, improving adherence to FAIR data principles (findability, accessibility, interoperability and reusability; FORCE11, 2014). Undoubtedly, these data may present unique analytical challenges, but novel methods that address these challenges can unlock their great potential for biodiversity research and monitoring (Johnston et al., 2022).
Here, we outline recommendations for making better use of existing data in conservation research, providing a roadmap to better guide researchers towards minimizing data waste. Our objective is to enhance the capacity of biodiversity research to effectively and efficiently inform conservation management. Of course, new or continued data collection is required in many circumstances, and existing data sources may not necessarily be appropriate for every study. We argue that looking for existing data should be the first step rather than a last resort in biodiversity research.
A roadmap for better data useWe present a roadmap designed to encourage researchers to look for available data at the outset of their project design and demonstrate how this can be incorporated into their workflow (Fig. 1; Table S1). We base this roadmap on structured decision-making frameworks, which are an established method of guiding conservation action by explicitly outlining the tools and options available while weighing the potential costs and consequences of each decision (Bower et al., 2018; Gregory et al., 2012; Hemming et al., 2021). Decisions regarding whether to use existing data are analogous to other decisions in conservation, in that they require an understanding of the advantages and disadvantages of available options, and, most importantly, a clear concept of the fundamental objectives at hand. This roadmap differs from the classic structured decision making frameworks in conservation and environmental science in that we explore decisions related to data availability and suitability in greater depth, but our workflow fits within current guidelines for decision analysis (Hemming et al., 2021). While this roadmap is not a comprehensive collection of all the tools available, it is designed to guide researchers towards more efficient data use. We recommend that researchers search for existing data first; that they integrate datasets and target gaps for new data collection; and that they aim to conduct research that can contribute to informed conservation decisions.
Roadmap to minimizing data waste, based on a structured decision-making framework. Although the steps are presented here in a linear fashion, any step can feed back into another. For example, once results are used to inform action, researchers and conservation managers can circle back to step 1. If the available data are found to be unsuitable in step 3, one can return to step 2 and search for more available data.
When we refer to “data” in this article, we are referring broadly to biotic and abiotic data relevant to the study of biodiversity and conservation. These data can be collected from a wide variety of sources, including academic research and community science programs. However, we do not assume that this roadmap framework encompasses the specific considerations required when working with Indigenous data and knowledges. Indigenous data and knowledges are valid in their own right and are crucial for addressing the biodiversity crisis (Latulippe and Klenk, 2020; McGregor et al., 2020; Reid et al., 2022; TallBear, 2014). Undoubtedly, Indigenous data and knowledges, along with the Indigenous nations who hold them, should play a key role in biodiversity research and conservation decision-making. However, even if they are accessible to researchers (for example if they are published in peer-reviewed literature or are “findable” in existing databases) ethical research practices demand that Indigenous data must only be used following Indigenous-led or co-led frameworks (Carroll et al., 2019, 2020; Simpson, 2004). For these reasons, Indigenous data and knowledges exceed the scope of the definition of “biodiversity data” for this paper.
Step 1: Define the problem and objectives. A fundamental first step in this roadmap is defining the problem. To make biodiversity conservation research more impactful, problems and objectives should be clearly defined both in terms of causal processes or threats and data needs. This includes the spatial and temporal extent of data needed and the taxon/taxa for which data are required. The aim of this stage should be to seek information that supports management decisions and/or investigates threats to biodiversity. While gathering information for the sake of improving general knowledge can be beneficial in certain contexts (Wintle et al., 2010), not all these data will be directly relevant to the decision making process. Explicitly stating the objectives can therefore help distinguish which information needs are clearly linked with the management decisions at hand. All relevant stakeholders, partners and practitioners should be consulted when defining the problem and objectives.
Step 2: Search for available data. Once the problems and objectives are clearly identified, researchers should consult potential sources of existing data to assess how they might address their data needs. Potential data sources include but are not limited to: community science databases; large open data repositories; atlases; grassroots community programs; and published peer-reviewed datasets. Researchers should be aware that data sources outside of the peer-reviewed literature may hold critical information pertaining to the protection and recovery of species (Khorozyan, 2022). These sources all vary in their adherence to FAIR principles (FORCE11, 2014). An overview of these tools including examples is provided in Table S1.
We encourage conservation researchers to see if data needs can be met at least in part by pre-existing datasets, which are increasingly available to meet the needs of conservation scientists. Although some taxonomic groups have extensive data available, this availability does not always translate into use in the literature (Theobald et al., 2015). With the promising “open science” movement quickly gaining traction (Roche et al., 2021; Stodden, 2011) and community science participation increasing rapidly (Ruiz-Gutierrez et al., 2021), researchers should make every effort to use this to their advantage. We do however acknowledge that the availability of data may vary depending on geographic location and taxon (Binley et al., 2023a; Titley et al., 2017).
We also urge researchers to be open to alternative sources of information. There has been some resistance to using data sources such as community science due to perceived issues with accuracy (Lukyanenko et al., 2016) and biases (Geldmann et al., 2016). However, advances in statistics have allowed researchers to overcome many of the limitations of community science data collection (Johnston et al. 2023) and in some cases, found that it can sometimes be more reliable than professionally collected data (Swanson et al., 2016; Callaghan and Gawlik, 2015). At minimum, such data are a helpful baseline of diffuse prior information (e.g., to be examined for indicator species; Mair et al., 2017). Local community members may also provide valuable insights into ecosystems and environments with which they are intimately familiar (Etkin, 2002; Brook and McLachlan, 2008). Moreover, reaching out to managers of existing community science projects for potential collaboration can foster relationship-building between scientists and other key stakeholders, strengthening the network of professionals working on a given subject (Cooper et al., 2007; Decker, 2005; Cooke et al., 2020). When engaging with locally-based initiatives, researchers should also be open to expanding on or redirecting the research questions to better address local information needs (Decker, 2005). Of course, researchers should always carefully examine the quality of data before using them, regardless of data origins.
Step 3: Assess suitability. Researchers must then carefully examine the potential utility and limitations of existing data sources to confirm suitability for meeting data needs. For example, for a study on birds that covers their entire breeding range in North America, the North American Breeding Bird Survey (BBS; Hudson et al., 2017) will cover the spatial range during the breeding period, eBird (Sullivan et al., 2014) will cover the same range year round, and Atlas data (e.g., the Ontario Breeding Bird Atlas; Cadman et al., 2007) can provide more detailed information on breeding behaviour but only for a specific time window. Additionally, the data must be appropriate for answering the problems and objectives outlined in step 1. For example, the BBS is designed to measure population trends, but can do a poorer job estimating species richness given the protocol is not well suited to that purpose (Ankori-Karlinsky et al., 2022). Finally, researchers must establish what, if any, statistical approaches can and should be used to account for any potential biases present in the dataset. For example, observer bias is present in most data collection protocols (Farmer et al., 2012), but methods exist to account for variation in detectability (Sólymos et al., 2012), observer skill (Hudson et al., 2017), and spatial and temporal variation in effort (Sullivan et al., 2009).
Step 4: Process data. Pertinent data can then be accessed and manipulated using a wide range of openly available tools that are often designed specifically to improve the reusability of these data. R software packages (e.g., ebirdst, Auer et al., 2020; bbsBayes, Edwards and Smith, 2021) and Application Programming Interfaces (APIs) (e.g., eBird.org, iNaturalist.org, GBIF.org) have been developed for many large repositories and initiatives. Online dashboards are common for atlases and large data repositories and allow one to both download and visualize data (e.g., birdscanada.org; https://na-pops.org/#dashboard). These data often require substantial cleaning and processing before they can be used in analysis, but tools and workflows also exist for this purpose (e.g., Maitner et al., 2018; Mathew et al., 2014; Ribeiro et al., 2022). Additional processing such as taxonomic harmonization may be required if multiple datasets are to be integrated (Grenié et al., 2023; Ribeiro et al., 2022).
We acknowledge that using existing data is not always easy. Conservation has long been plagued by the “file-drawer” phenomenon, whereby important work is “lost” in grey literature and government reports (Haddaway and Bayliss, 2015; Buxton et al., 2021b), and ineffectual solutions risk being tried repeatedly due to publication bias against null results (Wood, 2020). Community science and other similarly large and complex datasets often require extensive computational and technical abilities to use in a sensible and useful way. This can cost time and resources. However, efforts made now to overcome these challenges can serve to improve future conservation research. We recommend that scientists continue to follow emerging best-practice open data protocols (Roche et al., 2014; Costello and Wieczorek, 2014); that they look to existing, open resources first to answer their data needs; and that they take time to learn how alternative data sources may be able to contribute to informing conservation decisions.
Step 5: Explore methods for integration. Many datasets, particularly those that rely on opportunistic observations, are subject to spatial and temporal biases (Geldmann et al., 2016; Boakes et al., 2010). In areas that are harder to access, such as the northern Boreal regions of Canada, higher elevations, or simply areas with less road access, data can be scarce (Munson et al., 2010). These data can be harmonized with other datasets to fill geographic, temporal and taxonomic gaps in coverage. For example, Link and Sauer (2007) integrated winter bird counts from the Christmas Bird Count with breeding-season bird counts from the Breeding Bird Survey to determine the relative impacts of seasonal threats on overall population declines in Carolina Wrens, overcoming seasonal gaps in coverage in each program. Similarly, Rosenberg et al. (2019) used 13 different data sources, including community science programs, to estimate the alarming loss of avian abundance over the past few decades across North America. When multiple datasets are required to fit the spatial, temporal, or taxonomic needs of the study, researchers can use well-established methods of data integration to fill these gaps. Fletcher et al. (2019) review the main approaches to data integration: data pooling, independent models, auxiliary data, informed priors and integrated models. Additionally, offsets accounting for different probabilities of detection among surveys are increasingly being used for integrative purposes (Miller et al., 2021). With the advancement of computing capabilities and statistical approaches, substantial research over the last few decades has been focused on developing methods to use and integrate these large datasets in a reliable and sensible way (Feldman et al., 2021; Sullivan et al., 2014). These fields have developed enough that it is now time to put those data to work.
New data collection should be prioritized where existing data cannot be used to fill these gaps. For example, eBird community science data was used along with targeted professional monitoring to prioritize dynamic conservation action in central California, creating temporary wetlands for migratory shorebirds amid extensive agricultural landscapes (Reynolds et al., 2017). The eBird data were effective in producing estimates of avian abundance and occurrence across the broader landscape, but the professional monitoring improved the accuracy of estimates on the private properties where the conservation action was being implemented, and where community scientists were unable to survey (Robinson et al., 2020). Analytical tools such as Value of Information (VOI) analysis can be used to quantify the need for collecting more data to inform such prioritizations (Bennett et al., 2018). If deemed necessary, spatial data gaps can then be targeted for additional monitoring, either by professionals, community scientists, or both. Increasing the use of decision-science frameworks such as VOI can help researchers decide whether new monitoring is truly necessary to inform effective action.
Like finding and using existing data, integrating data from multiple sources is often easier said than done. Specifically, the variable quality and quantity of data can present substantial analytical challenges, as can data collected using different survey protocols (Pacifici et al., 2017; Miller et al., 2019). Furthermore, using more data is not guaranteed to improve parameter estimation (Simmonds et al., 2020); like collecting more data, integrative approaches should be used only when warranted. Despite these challenges, using multiple sources can limit our susceptibility to biases and incorrect inferences that may result from a single source alone (Feng and Che-Castaldo, 2021; Munson et al., 2010). Large, unstructured datasets are not always sufficient for research purposes (Bayraktarov et al., 2019; Knight et al., 2021), but they certainly merit consideration, especially if they can be combined with targeted data collection to fill gaps.
Step 6: Use information to inform decisions. In Step 1 of the roadmap, we established that problems and objectives should be directly linked to conservation issues. Both Step 1 and Step 6 should engage relevant stakeholders, policy makers and conservation managers. Biodiversity monitoring can result in wasted data if the needs of conservation practitioners, policy makers and other relevant parties are ignored, and if they cannot find or access the information necessary to inform their decisions. As such, biodiversity research can be vastly improved by including these partners at every stage of the decision-making process whenever possible.
The goal of data collection for biodiversity conservation should go beyond monitoring; ultimately it should be used to inform action (Buxton et al., 2021b). Many open data sources provide baseline surveillance monitoring, which can be critical for detecting emerging conservation issues (Lindenmayer et al., 2012; Dickinson et al., 2010; Wintle et al., 2010) and tracking population declines (Hudson et al., 2017). While detecting these declines is important, monitoring a species still does nothing to prevent or reverse declines. The Southern Resident Killer Whale has been monitored extensively since the 1960’s and the threats contributing to its decline are relatively well understood (Ellis, 2018), yet little progress has been made on the species’ recovery since its listing in the early 2000’s (Lacy et al., 2017). While the scientists themselves have provided important information to policymakers regarding actions such as fishing and boat traffic reductions and pollution mitigation, the impact of their monitoring has been limited by lack of political will.
One of the strengths of using existing data is that conservation practitioners and researchers can circumvent the time and expense needed to collect further information to inform decisions when it is not always warranted to do so. Although they may be designed for baseline surveillance monitoring in many cases (Dickinson et al., 2010), these data can and should be used to test hypotheses (Yoccoz et al., 2001) and inform decisions (Ruiz-Gutierrez et al., 2021). There are also many projects that are designed with a specific threat or issue in mind, assumedly to be used for research in these areas (e.g. Fitzgerald et al., 2014).
DiscussionThe potential for data waste exists at multiple stages of the scientific process. Data loss (Fig. 2) can occur when long-term research databases are not optimally maintained (Bledsoe et al., 2022). Furthermore, the open science movement has thus far focused on making data public and accessible (Bledsoe et al., 2022; Purgar et al., 2021), but accessing published data is not always feasible, even when the authors provide a statement that the data is freely available upon request (Gabelica et al., 2022). Published data can even be lost within a research team when data-owners (such as graduate students or research associates) change institutions and do not adhere to proper data storage and handover practices. To avoid loss, data should adhere to the FAIR principles: Findable, Accessible, Interoperable, and Reusable (FORCE11, 2014). An important aspect of FAIR principles is that they do not require data to be open access, which is a vital consideration in conservation research when working with sensitive data such as endangered species locations or with Indigenous communities and traditional knowledges. Ensuring research data is managed with FAIR principles allows the necessary safeguards to be placed around sensitive data, such licensing, copyright, and access controls, without diminishing the opportunity for data re-use.
Examples of inefficiencies that contribute to data waste in conservation research. The quantity and quality of data that are ultimately available to inform conservation action are depleted when they are not openly available and accessible, but also when they are neglected in favour of new data collection and when they are not used to inform tangible action. The quantity of available data that remain neglected due to being inaccessible, underused or ignored is currently unknown.
However, the problem of data waste is only partially alleviated by making data more FAIR. Open community science data, for example, have proliferated in recent years, but are often relegated to surveillance monitoring only (Dickinson et al., 2010; Nichols and Williams, 2006), despite their demonstrated capacity to inform decisions (Howell et al., 2022; Ruiz-Gutierrez et al., 2021). There are many opportunities to increase the use of community science data for applied research that have yet to be realized (Soroye et al., 2022). These data are underused if they are not substantially incorporated in actionable research (Buxton et al., 2021b).
We acknowledge the importance of biodiversity monitoring, and in particular long-term monitoring programs. Available data do not always suit the needs of conservation research and practice, and there are geographical and taxonomic gaps in our knowledge (Binley et al., 2023a; Titley et al., 2017) that must still be filled to inform effective conservation action. Furthermore, monitoring can be critical for discovering previously unknown conservation issues, engaging with the public (Dickinson et al., 2012), convincing managers to act (Venus and Sauer, 2022), and assessing the efficacy of actions. However, while some data gathering is valuable, it can also be expensive and time consuming, and so should not be undertaken lightly. Over half of data deficient species are suspected to be at risk of extinction (Borgelt et al., 2022), and are potentially in need of urgent action. Open databases of existing data are often underutilized (Binley et al., 2021) and viewed as a last resort (Buxton et al., 2020b; Howell et al., 2022). Increasing the use of existing data could help redistribute conservation resources so that more can be spent on action. For example, data on birds are widely available through community science programs (e.g., Butcher et al., 1990; Hudson et al., 2017; Sullivan et al., 2014). Seeing as birds receive more conservation funding than other taxonomic groups (Gordon et al., 2019) and avian monitoring and research make up approximately a quarter of bird conservation budgets (Buxton et al., 2020b), this represents a potential opportunity to minimize redundant efforts and spending and redistribute precious resources.
ConclusionOvercoming the knowledge-action gap in conservation biology is an active area of research (Roche et al., 2021; Buxton et al., 2021a). A data-knowledge gap could compound the effects of the knowledge-action gap in hindering the preservation of biodiversity (Bayraktarov et al., 2019). Vast amounts of time and money are spent on collecting data, but also recovering lost data (Bledsoe et al., 2022). Given the time-sensitive nature of many conservation actions, and the extensive resources that go into data collection, one of our goals as conservation scientists should be to minimize data waste (Binley et al., 2023b). We can do so by both improving the availability of our own data, but also first looking for what is already available before setting out to collect more.
FundingADB, JGV and JRB are funded through NSERC(Discovery and Alliance) and ECCC. ADB is also funded through Mitacs and Carleton University. PS is grateful for supporting funds from the University of Ottawa, and an NSERC Postgraduate Scholarships-Doctoral award.