Citizen science contributes enormously to biodiversity monitoring (Chandler et al., 2017) by providing data that are potentially as useful as those collected by professional scientists, especially for research over large spatial and temporal extents (Callaghan et al., 2020). However, there are still major gaps in the taxonomic, temporal and spatial coverage of biodiversity data to track changes in species' abundance and distribution (Amano et al., 2016; Feldman et al., 2021; Wetzel et al., 2018). Furthermore, biodiversity encompasses not only the diversity of organisms but also the diversity of interactions between them and with their environment, which has been given scant attention by citizen science (Chandler et al., 2017; Groom et al., 2021). Ecological interactions are the foundation of ecology and the architecture of ecosystems, but observing these relationships is challenging (Jordano, 2016). The sheer amount and complexity of interactions as well as the detection probability in the field limits studies with respect to number of species, locations and time periods that can be studied.
However, both occurrence and interaction data are needed to understand, address and mitigate the consequences of the five major drivers of biodiversity loss (Díaz et al., 2019). For instance, land-use change may affect host-vector dynamics (Spence Beaulieu et al., 2019), pollution may lead to adaptations in species traits (Rech et al., 2022), climate change can affect trophic cascades and distribution shifts (van Gils et al., 2016), overexploitation may change former mutualisms (Speziale et al., 2018) and biological invasions can result in novel plant–pollinator networks (Parra-Tabla & Arceo-Gómez, 2021). The variety of direct and indirect effects of biotic and abiotic interactions are difficult to study, which poses a challenge that can only be addressed through global collective effort (Díaz et al., 2019).
To achieve more accurate and reliable monitoring and interaction data, we need to improve and integrate current methods (Besson et al., 2022; Kühl et al., 2020; van Klink et al., 2022), to better analyse existing data (Probert et al., 2022), and to extract more information from already collected data (Johnston et al., 2022). In the latter case, one untapped source of abundant data is the corpus of digital images and other media, generated and shared on citizen science platforms. As of May 2023, four such websites alone (Artportalen, iNaturalist, Observation.org, and Pl@ntNet) had collectively published over 78 million images through the Global Biodiversity Information Facility (GBIF). Almost all these photos were taken of organisms or signs of their presence in situ (i.e. nests, faeces, tracks, etc.) and thus may capture ecologically relevant information as a by-product. This type of additional information has been termed ‘secondary data’ (Callaghan et al., 2021, see Box 1 for details). Having recognised the potential, a growing number of studies have examined ecological questions using incidentally captured information. Yet, so far, the scientific community has been relatively blind to the opportunities to extend our knowledge of biodiversity from secondary data. Therefore, we assume a ‘biodiversity blindness’, drawing on the concept to describe the public's lack of attention to the presence and diversity of plant and animal life (Moscoe & Hanes, 2019).
Secondary data in the context of citizen science in biodiversity research, as we define it, refers to a subset of information that is unintentionally captured alongside primary data. Primary citizen science data collected for a specific research focus, such as monitoring the distribution of a species, provide information on the location and date of record of the species in addition to evidence of its occurrence. This primary data of ‘what?’, ‘when?’ and ‘where?’ are the intended focus of many ad hoc observing portals. In contrast, secondary data are ancillary details that are also present in the materials collected but were not the intended subject of the study. Indeed, the observer may have been unaware of the secondary information they evidenced.
Secondary data can offer valuable opportunities for additional research and analysis, enriching our understanding of ecosystems' functioning, population dynamics, natural behaviours and environmental conditions. They represent any retrievable pieces of information that can be seen on an image or in a video, heard on an acoustic recording or be included in a descriptive text. The information they contain may relate to features of individuals or populations, biotic interactions (including human–nature interactions), landscape and environmental conditions, or any other biotic or abiotic features or their combination (Figure 1). We recognise, of course, that the subject of investigation may be something other than the mere detection of species. In complex citizen science programmes, the primary data can only be separated from secondary data with reference to the objectives of the project. For example, in the COASST project (Parrish et al., 2017), citizen scientists record bird carcasses on beaches, providing not only an image but also record a wealth of information about the morphology of the carcass and the state of the environment. In addition to the primary information collected, the images of dead birds may contain even more information than the research project anticipated, such as the presence of necrophagous species. Thus, secondary data are data that the methodology was not intended to capture, though there are no sharp demarcation lines.
FIGURE 1. Examples of secondary data captured in images vouchering for species occurrences. (a) background habitat of a fire salamander (Salamandra salamandra), and the unintended record of another species (the fungus) (b) the interaction between human and red fox (Vulpes vulpes), (c) interaction between a leaf miner and the host plant, (d) predation by the Mexican grass-carrying wasp Isodontia mexicana. Image copyrights: © https://www.inaturalist.org/photos/251871116, http://creativecommons.org/publicdomain/zero/1.0/ (a) © juniper_likethetree https://www.inaturalist.org/observations/155228701, https://creativecommons.org/licenses/by/4.0/ (b) ©ahabo https://www.inaturalist.org/photos/257468598, https://creativecommons.org/licenses/by/4.0/ (c) © Mattia Menchetti https://www.inaturalist.org/observations/129293402, https://creativecommons.org/licenses/by/4.0/ (d).
Following our definition of secondary data (Box 1), we do not explicitly refer to the metadata (e.g. timestamp or geolocation) associated with the occurrence record, though they also represent potential secondary data sources and play a major role in selecting datasets and their analyses. An important element in utilising secondary data is the explorative character of this research method, as the nature of this unintentionally recorded information may be unclear. The additional information may document biotic interactions and co-occurrences, which could provide not only important ecological information of the observed species but also records of the by-catch for respective monitoring programmes. Secondary data may also include details on morphology, behaviour, habitat and various other aspects of a species' traits and ecology.
This paper explores the opportunities and pitfalls of extracting secondary data from multimedia records of biodiversity. We also examine how advances in artificial intelligence can and might accelerate data extraction. Our goal is to illuminate the hidden treasure of biodiversity data contained in citizen science multimedia records and available openly to the scientific community. While the efforts to explore and exploit the realm of secondary data are still in their infancy, they already demonstrate numerous opportunities to enrich and inform biodiversity research.
RESEARCH OPPORTUNITIES AND TYPES OF SECONDARY DATAExtracting secondary data from existing citizen science sources helps to address universal challenges of biodiversity research, such as taxonomic bias, detectability of species and their interactions, and recognition of spatio-temporal dynamics.
Taxonomic bias towards charismatic or well-known species in citizen science data poses a challenge for researchers and limits the possibilities of launching projects that deal with less popular, cryptic or under-researched species. In addition, although simple, unstructured programmes generate high numbers of citizen science observations, information that is more complex to record, such as an individual's health condition, is not purposefully collected and therefore not formally documented. Similarly, studies that are less engaging, such as those that are time-consuming, physically challenging or in less attractive localities are discriminated against. Extracting secondary data from existing observations could be fruitful to fill such data needs. For example, diurnal or seasonal activity patterns or vocal characteristics of rare species could be retrieved from soundscapes or the background of audio recordings of focal species. From images, occurrences of less charismatic arthropods or pathogenic fungi living on photographed plants could be extracted. In the latter case, in 2010, citizen scientists monitored and scored leaf damage on horse-chestnut trees (Aesculus hippocastanum) caused by the leaf-miner Cameraria ohridella in Great Britain (Pocock & Evans, 2014). Today, the infestation could also be detected as secondary data in images on which horse-chestnut trees are the primary observation, thereby improving the data situation at low additional costs and resource expenditure.
Using secondary data to take advantage of the taxonomic bias towards well-documented species also brings additional research opportunities. When observations for a given species are widely available, one can extract data on multiple aspects of interest, such as morphological traits or biotic interactions, without spending time and resources on launching and running a new raw data collection campaign. For example, Putman et al. (2021) used images of the secretive but thoroughly documented lizard Elgaria multicarinata that were primarily collected for determining the species' distribution in Southern California. From the images, the authors assessed predation pressure and health condition by measuring the lizards' tails and by looking for ectoparasites in the animal ear regions. Likewise, citizen science photos have been used to identify subtle morphological differences between two very similar species of grasshopper and to establish their distributions (Pélissié et al., 2023).
Coincidental evidence can mitigate low detection probabilities. By identifying a rare species in primary observations of other species, whether that is through biotic interaction or an incidental co-occurrence, the pool of observations can be enlarged. A citizen science project in Australia has shown that co-occurrence of a common and rare possum species can lead to more detections of the latter (Steven et al., 2021). Aside from potentially increasing sample sizes of monitoring data for the benefit of statistical analyses, we can improve our understanding of ecological impacts on other species, including people. We envisage application in pollination dynamics, invasion impact or climate change research. For example, we can potentially study the preferred flower species and colour in a network of native and exotic bumblebees and host plants (Catron et al., 2023; Fontúrbel et al., 2023). Another example is hair loss in moose (Alces alces) and wapiti (Cervus canadensis) caused by the expanding distribution of winter tick (Dermacentor albipictus) due to climate warming in Yukon, Canada (Chenery, 2023). Serendipity is a factor as well to reveal ecological interactions; Rosa et al. (2022) not only found new and supposedly extinct species as primary observations, but also novel predatory interactions that were accidentally captured in the iNaturalist images of marine snails. Such chance discoveries based on the background information could be especially useful in invasion science, where secondary data may reveal new or hidden invasions or previously undocumented ecological processes that facilitate or hinder invasions.
Extracting secondary data from a series of observations across space and time can also support efforts to move from a mere single-species snapshot (an occurrence record) to spatio-temporal biodiversity dynamics. Using timestamps and geolocation metadata of citizen science observations to investigate spatial and temporal dynamics has been successfully applied before (e.g. Feldman et al., 2021; Newson et al., 2016). Given a sufficient temporal span and frequency of observations, we suggest linking secondary data to such a stamp to obtain a variety of observable dynamics. For example, a series of landscape images would not only contribute to monitoring data (e.g. the abundance and distribution of species on the images), but can also be useful for studying phenological dynamics at the community level (Hofmeester et al., 2020).
Figure 2 illustrates how secondary data can add contextual dimensions to primary species observations, thereby mitigating ‘biodiversity blindness’ by expanding on the information in citizen science multimedia records beyond geographical locations and time. This context applies to different scales and scopes of consideration, specifically on the level of individuals, populations, communities, the surrounding environment and the human dimension. For each of these levels, Table 1 gives extensive lists of types of information contained in secondary data.
FIGURE 2. Illustrative relationship between primary and secondary data as sources for different types of information. Image copyright: © zilpzalp17, https://www.inaturalist.org/observations/148103702, https://creativecommons.org/licenses/by-nc/4.0/.
TABLE 1 Types of information extractable from secondary data from citizen science projects. As the literature on secondary data is scarce and obscure due to a missing common terminology, we also listed examples that used secondary information in combination with other sources and approaches (e.g. iEcology or literature). The table groups the publications by level (human–nature interactions, features of the individual, etc.) as described. For each study it gives the feature of interest (extracted data) and the data elements used to extract it as well as a short description (study example) of the content and the methods and sources used (source and extraction method).
Data element | Extracted data | Study example | Source and extraction methoda | Reference |
Observer/human–nature interactions | ||||
Interpretation of photo | Animal behaviour | Human–coyote encounter in urban environments, assessing timing, level of urbanisation and aggressiveness | iNaturalist; manual extraction by visual inspection and categorisation of coyote behaviour | Drake et al. (2021) |
Interpretation of photo | Association with human infrastructure | Investigations of wrens nesting in human stable or unstable structures by examining citizen science images | iNaturalist, eBird, WikiAves and field data; manual extraction by visual inspection of nesting sites | Alexandrino et al. (2022) |
Interpretation of photo | Cause of death | Taxa and seasonality of roadkills in the USA | Tagged iNaturalist images (‘roadkill’, ‘dead on road’); manual extraction by visual inspection, for example, species identification | Unger (2022) |
Features of the individual | ||||
Time and interpretation of photo | Diurnal patterns | Diurnal pattern of a lizard | iNaturalist, HerpMapper; manual extraction by visual inspection of daytime and light on the images | Blais and Shaw (2018) |
Date and interpretation of photo | Phenology | Fine-scale delimitation of the flowering phenology of Yucca species | iNaturalist; manual extraction by applying a scoring rubric and consensus method | Barve et al. (2020) |
Progression of phenological stages of Alliaria petiolata | iNaturalist; deep learning algorithm to classify phenological stages | Reeb et al. (2022) | ||
Identification and interpretation of photo | Hybrids | Assessing hybridisation by colour variation among grass snakes across Europe | iNaturalist; manual extraction by visual inspection following an identification key for several coloured features | Fritz and Ihlow (2022) |
Measurement and interpretation of photo | Morphological traits | Pigmentation of wings in male damselflies across broad spatial scales | iNaturalist; half-automated extraction of size and place of pigmentation using visual inspection and ImageJ |
Drury et al. (2019) |
Melanisation of grey squirrels in relation to urbanisation and temperature | iNaturalist and SquirrelMapper; manual extraction of squirrel colour by visual inspection in the form of a citizen science project | Cosentino and Gibbs (2022) | ||
Length ratio measurements to identify two closely related grasshopper species | iNaturalist; length ratios from photos using ImageJ | Pélissié et al. (2023) | ||
Interpretation of photo | Behavioural traits | Warming-up behaviour of differently coloured rattlesnakes and their corresponding survival rates | iNaturalist; manual extraction and classification of colour, living or dead and background | Rhodes et al. (2022) |
Date and interpretation of photo | Life history | Breeding biology of swallow-tailed hummingbirds | WikiAves, iNaturalist, eBird; manual extraction of classes of breeding stages (nesting, eggs, fledglings, etc.) and using dates for calculation of breeding timings | Turella et al. (2022) |
Interpretation of photo | Health | Body condition of seals | Applying images of the New Zealand Leopard Seal Photograph Library (NZLSPL) to a self-developed scoring system | Warren (2021) |
Measurement and interpretation of photo | Injuries/damage | Frequency of wing damage in migrating butterflies | Comparing images of butterfly wings with a reference image and estimating frequency/coverage of damage using Inkscape and MATLAB | Korkmaz et al. (2022) |
Biotic interactions and co-occurrence | ||||
identification | Flower visitation/pollination | Identification of host flowers of the introduced wasp Isodontia mexicana with an iNaturalist-Pl@ntnet-workflow | iNaturalist; automatic data exchange tool approach validated by visual inspection | Pernat et al. (2022) |
Identification and interpretation of photo | Species conflicts | Competition for nest cavities between bee species and wild honey bee colonies in its native and introduced ranges | iNaturalist and literature review; from images manual extraction of country, combs and entrance characteristics of nests | Saunders et al. (2021) |
Identification and interpretation of photo | Diet | Analysing prey items in correlation with the age and sex of sparrowhawks by images from different sources | Google Images, Macaulay Library, iNaturalist, BirdGuides, Facebook, and Twitter by automatic or manual retrieval; extraction of sparrowhawk age, sex, and prey species by visual inspection | Panter and Amar (2021) |
Identification and interpretation of photo | Host preferences | Host preferences and phenological peak of hairworms | iNaturalist and literature review; manual extraction by visual inspection of infected host species | Doherty et al. (2021) |
Interpretation of photo | Disease | The proportion of Caryophyllaceae plants infected with anther-smut disease in eastern USA, comparing herbarium and citizen science data | iNaturalist and herbarium data; manual extraction and classification—infected or not—by visual inspection | Kido and Hood (2020) |
Identification and interpretation of photo | Predation | Using images to investigate the spectrum of prey of assassin bugs (Hemiptera: Reduviidae) | Flickr, iSpot Nature, BugGuide, NatureWatch, Google Images; extraction of predation events and preyed species by visual inspection | Hernandez et al. (2019) |
Measurement and interpretation of photo | Parasitism | Assessing parasitism and predation of lizards in an urban environment by measuring tail/body ratio and number of ticks | iNaturalist, HerpMapper; manual extraction of tick number by visual inspection, half-automated extraction of relative body size with measurement via ImageJ | Putman et al. (2021) |
Identification of black spot infections in fishes | iNaturalist, manual extraction of the proportion of infected fish species by visual inspection | Happel (2019) | ||
Measurement, identification and interpretation of photo | Basibionts | Occurrence and environmental conditions of epibiotic algae on gastropods detected from images of a corresponding citizen science project in Japan | Dedicated citizen science project collecting images via Google Form, e-mail and Twitter; manual extraction of substrate and condition by visual inspection, coverage of epibiotic algae measured with ImageJ | Kagawa et al. (2020) |
Identification and interpretation of photo | Cause of death | Quantity and causes of shark stranding on a global scale | Literature, iNaturalist, YouTube, Twitter, Facebook, Instagram; manual extraction of species, sex and potential cause of stranding | Wosnick et al., 2022 |
Features of the population and community | ||||
Measurement and interpretation of photo | Mating system | Evolution of mating-related wing ornamentation of dragonflies influenced by climate warming | iNaturalist; half-automated extraction of wing pigmentation with ImageJ (following Drury et al., 2019) | Moore et al. (2021) |
Quantification and identification | Population size | Decline of introduced llama population in Italy estimated by recognition of individuals on images | iNaturalist, Facebook, Twitter; manual extraction of coat pattern and age (by body size) by visual inspection | Gargioni et al. (2021) |
Quantification and identification | Sex ratio, age groups | Sex ratio and proportion of life stages of manta rays | images of MantaMatcher (citizen science) combined with YouTube, Facebook, Instagram, Flickr and Vimeo and private collections; extraction of sex and life stages by visual inspection | Knochel et al. (2022) |
Environmental features | ||||
Interpretation of photo | Habitat type | Classification of marine biotic habitats from image background and evaluating results with professional reef life survey | iNaturalist; manual extraction and classification of biotic habitats by visual inspection | Bolt et al. (2022) |
Interpretation of photo | Cause of death | Correlating weather and catastrophic events with images of dead birds | iNaturalist project dedicated to bird mortality | Yang et al. (2021) |
Interpretation of photo and location | Substrate | Substrate choices of native oyster species Ostrea lurida | iNaturalist, fieldwork and literature; manual extraction of the substrate information by visual inspection of images in combination with observer notes and Google Earth | Kornbluth et al. (2022) |
aOnly the sources and methods used to obtain secondary data whose data type was the subject of the study (feature, interaction, etc.) are listed. In most cases, the analyses also used geo-locations, dates and metadata that were part of the primary observation.
SECONDARY DATA ARE SLOWLY DIFFUSING INTO THE SCIENTIFIC LITERATUREStudies using secondary data (Table 1) have mostly focused on the extraction of morphological information, such as the pigmentation on wings of Calopterygidae damselflies (Drury et al., 2019), coloration patterns of grass snakes (Fritz & Ihlow, 2022), and intra- and interspecific variabilities in coloration of birds and plants (Laitly et al., 2021). Some studies also used secondary data to assess human–nature interactions such as bat handling during the COVID crisis (Van der Jeucht et al., 2021), to classify marine habitats using image backgrounds (Bolt et al., 2022), and to identify plants visited by hummingbirds (Marín-Gómez et al., 2022). Secondary data from citizen science are often combined with iEcology or culturomics data sources (Jarić et al., 2020) or museum collections (Box 2). Examples include a dietary study of African snakes (Maritz & Maritz, 2020), arthropod parasitism by hairworms (Doherty et al., 2021) and the distribution of anther-smut disease in the Caryophyllaceae plant family (Kido & Hood, 2020).
The use of secondary data in research is similar to the emerging areas of conservation culturomics and iEcology (Jarić et al., 2020). Culturomics seeks to understand human culture through the quantitative analysis of changes in word frequencies in large bodies of digital texts (Michel et al., 2011). In the context of biodiversity, the emergent area of ‘conservation culturomics’ focuses on the relationship between people and nature (Ladle et al., 2016), informed by contents of various types of online data. iEcology, on the other hand, is an umbrella term for analysing various types of digital data generated or collected for purposes other than ecological research to obtain insights into ecological questions. In contrast, in citizen science projects, people consciously contribute to the goal of a particular activity, such as biodiversity monitoring or invasive species detection (Marchante et al., 2023).
Citizen science has already proven useful in mapping and tracking biological invasions (Encarnação et al., 2021). The additional information that comes with secondary data could reveal even more aspects of the invasion process, thereby supporting invasive species management. For example, first approaches explored the host plants of introduced pollinators (Bila Dubaić et al., 2022; Guariento et al., 2019; Pernat et al., 2022) and cavity occupancy by wild honey bees (Apis mellifera) in Australia (Saunders et al., 2021).
Reanalysis of images has also been used for trait-based studies to characterise, for example particulate matter in the global oceans (Trudnowska et al., 2021) and the feeding habits of marine copepods (Vilgrain et al., 2021), although these studies did not use citizen science data. The potential for extracting functional traits from images, either directly measured or inferred by combining visible features with context metrics from the metadata, has been thoroughly considered for plankton (Orenstein et al., 2022).
SECONDARY DATA EXTRACTION COULD BE ACHIEVED ALONG A GRADIENT OF HUMAN AND ARTIFICIAL INTELLIGENCEAs approaches to obtain secondary data are just emerging, such data are still mainly extracted manually. This can be challenging when thousands of images need to be interpreted and evaluated. For example, the aforementioned study of anther-smut infection within the Caryophyllaceae examined 79,801 iNaturalist images (Kido & Hood, 2020). There is much to be gained from automation that could scale up the process to millions of images, particularly for pre-selecting images and recognising relevant image features. For example, computer vision could be used to extract and analyse information on colour in images, for example, greenness of plants (Yuke, 2019), and deep learning models to detect, count and classify specific features of interest (Bjerge et al., 2023; Mann et al., 2022). Likewise, algorithms and pretrained dictionaries in Natural Language Processing could leverage the use of textual content, such as image captions, commentaries and tags in secondary data. Automated systems would also facilitate real-time analysis of biodiversity dynamics, making them particularly useful for informing decision-makers regarding effects of conservation efforts or as early warning tools (van Klink et al., 2022).
Despite the obvious appeal of machine learning for automatic data extraction from citizen science sources, several obstacles lie before its full potential can be realised. Developing robust models that effectively handle diverse and noisy datasets is challenging and resource-intensive. Nevertheless, for some tasks, existing tools may be customised or applied directly. Multiple trained deep learning models to screen multimedia for human or natural objects are freely available. For instance, object detection models, which are often pretrained and benchmarked on the COCO dataset (Lin et al., 2014) containing 80 different object categories (including birds and other animals), may already provide relevant secondary data output. Moreover, models exist for specific groups of organisms and data types: Merlin Bird ID and BirdNET (Kahl et al., 2021) for bird detection based on sound (the former can identify species also from images), Pl@ntNet API for plants, Bjerge et al. (2023) created a test dataset for insects, FishID for fish species in images; MegaDetector or TrapTagger for animals in camera trap photos; and BatDetect2 and BatNet for bats (Aodha et al., 2022; Krivek et al., 2023) in sound recordings. Additionally, customised models can be trained on open datasets, for example, FathomNet for marine organisms (Katija et al., 2022), Pl@ntNet for plants and iNaturalist for a range of different species. Importantly, even with readily available models, manual resources and expertise are required to ensure the anticipated model behaviour and performance on new data.
In other cases, models and analysis pipelines may need to be developed from scratch. Where models or training data are not available, the cost–benefit ratio of developing new artificial intelligence models should be weighed against the use of human-mediated approaches. For example, Mann et al. (2022) developed an approach to automatically detect flowering plants in images, which were then examined by citizen scientists for the rare presence of insects.
Efficient processing is relevant when dealing with large amounts of data, but it is critical to consider the resources needed. Developing custom automated methods and their broader usefulness and applicability versus setting up and maintaining manual processing pipelines (e.g. citizen science projects or recruiting and managing volunteers) may differ in terms of time, costs and personnel demands as well as the output quality. In any case, to address the uncertainty in exploratory analyses of secondary data variables, that is, to get an idea of what kind of additional information primary datasets contain, a subset of data would most often be analysed manually. This pre-processing can inform researchers about which methodologies to apply for larger scale extraction of information.
We expect that future secondary data extraction will be performed on a continuum between fully human and fully automated approaches with the respective advantages and disadvantages along this spectrum. Hybrid intelligence, that is, the combination of deep learning and human diligence (Mann, 2022; Rafner et al., 2021), can be effectively used to extract and analyse secondary data. Primary data can be filtered manually and, if necessary, annotated or immediately tested for relevant secondary data in the case of an existing algorithm. Conversely, one or more features can be selected from images (or other types of media) by an algorithm to be processed afterwards by a human (e.g. for annotation, validation or analysis; Figure 3).
FIGURE 3. Interaction and possible applications of methods to extract/retrieve secondary data. Our sample image of the primary observation of a raptor (a) also happened to include a black woodpecker and two plant species as secondary data (b). These exemplary secondary data could be extracted by humans or artificial intelligence only or by both in hybrid intelligence approaches (c). (image: © zilpzalp17, https://www.inaturalist.org/observations/148103702, https://creativecommons.org/licenses/by-nc/4.0/).
Another challenge to apply artificial intelligence in secondary data studies is not knowing which data variables to look for and how to select or develop a potential identification algorithm or, simply put, how to search for the unknown unknowns. A human eye is able to identify the unexpected while the algorithm only recognises what is expected of it, that is, what it was trained for. In order to leverage the power of artificial intelligence for effective data extraction, data collection generally needs to be guided by precise research questions and must be based on a priori identification of variables of interest. As detection models trained to recognise an increasing number of objects, or segmentation models able to distinguish different areas in images (Kirillov et al., 2023) emerge, these may become increasingly relevant for exploration of secondary data without predetermined knowledge on what to look for. As artificial intelligence tools develop to analyse, visualise and synthesise multimodal data (audio, video, text, etc.), it is likely that they will be able to recognise multiple features of interest. Also, identifying patterns across media types, including insights overlooked by a researcher, and even suggesting mechanisms and hypotheses that could explain such patterns are within reach. Application of these models should always be done with careful consideration of potential model biases that can skew results.
WHY ARE WE NOT THERE YET?Citizen science multimedia records are clearly more than meets the eye. To protect biodiversity, it is not only essential to inventory and monitor species, but also to understand the ecological networks they are part of. By giving many examples of current and possible future areas of application we demonstrated how secondary data offer the opportunity to extend and complement systematically collected interaction and monitoring data. Although we are convinced of the great potential in the untapped information, we still see some challenges to overcome and specific pitfalls to address.
Similar to the early days of the citizen science movement, the issue of bias can cast doubt on this new resource. Indeed, we suspect a similar bias in secondary data as in primary data (Isaac et al., 2014). Secondary data, however, would be less influenced by known recording behaviour (e.g. aesthetic preferences or charisma of observed target species) and more affected by previously less considered human actions. Staging of observed species in a particular location and environments, and cultural differences in what can be appropriately photographed are imaginable examples. In these cases, scientists using secondary data can benefit from accelerated development and discussions in analysing opportunistic data to correct for bias (Johnston et al., 2022). Transparent handling of potential biases should be a given in both metadata and corresponding publications.
Of greater concern are biases from data generated by citizen science projects with unknown scientific goals. For example, when for a project citizens document a particular plant solely in forest habitat, higher-level analysis of that plant's habitats based on image backgrounds would document an unrepresentative proportion of this plant in forests. Therefore, the source of data should be known, that is, in unclear cases, the project organisers would also need to be contacted or the images excluded from the analysis. A thorough critique of data provenance may be particularly necessary if counterintuitive trends or patterns emerge during initial spatio-temporal visualisation or classification of data.
Major obstacles preventing the breakthrough of secondary data research concern methods for filtering and processing the vast amounts of data. Filtering primary data for the desired information is mainly a matter of metadata. More precise information provided by metadata for each image, text file, audio recording and other data types can speed up data acquisition and minimise errors. Citizen science platforms such as iNaturalist allow users to add more details about their records in corresponding observation fields or select specific features from the list of annotations. However, free-text fields are arbitrary, and feature lists are confining, and may not include, for example, possible interactions between species.
Since improving metadata according to the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles is a worldwide effort, this problem will hopefully be solved with time. Similarly, the development of new and better machine-aided object recognition will allow large amounts of secondary data to be processed automatically in the future. In addition, approaches have been developed to not only recognise objects in multimedia data sources, but also to differentiate by anomalies or other species-specific features such as plant or animal colours (e.g. Hantak et al., 2022; Perez-Udell et al., 2023). Ultimately, the ever-improving models used to generate ecological networks also help to turn information into knowledge.
A more pressing issue is the legality and ethical defensibility of using millions of secondary data sources for purposes other than those intended when the primary observation was recorded and posted. Considerations of ethical and privacy issues are not exclusive to secondary data. They are also pertinent regarding the primary data used in iEcology and culturomics (Jarić et al., 2020), from which secondary data can be derived. While much of primary data (e.g. online texts, images, videos or audio recordings) are publicly available, and in many cases, people have given consent to their availability (e.g. by registering in citizen science or social media platforms), researchers are required to pay careful consideration to how they collect, use and share these data (Di Minin et al., 2021; Thompson et al., 2021; Zimmer, 2010).
Ethics are particularly relevant when dealing with online data from social media, where work is often used or distributed without the owners' consent. In fact, most social media platforms allow posting nearly any content, as they are not able to automatically identify copyrighted material. This issue is less prominent on citizen science platforms, as the users have stronger control of posted data and media licensing. But the way such information is shared and scraped still opens various possibilities of copyright infringement in the digital space. As such, there is a considerable uncertainty regarding situations in which acquiring permission and crediting authorship becomes mandatory. If not Creative Commons, checking licences can become a time-consuming process. Especially, when dealing with big data that are derived from multiple sources, it may be highly unfeasible to directly contact the media owners to get permission for use (Leighton et al., 2016).
Ethical issues have to be carefully considered and are especially delicate when secondary data allow recognition of people, or allow the identification of contentious human interactions, such as illegal fisheries (Sbragaglia et al., 2021), poaching or trade in wild organisms (Di Minin et al., 2019; Zimmer, 2010). Di Minin et al. (2021) have suggested a set of guidelines that can help address ethical concerns in research when using such data. Likewise, while publicly sharing species location information is useful for research, disclosing the location and identification of rare or threatened species can become a threat to their conservation (Lindenmayer & Scheele, 2017). Although citizen science platforms such as iNaturalist already consider ‘taxon geoprivacy’ as a way to safeguard the locations of species ‘at risk’, a sensitive species as secondary data would still come with full coordinates if not recognised as such.
Finally, it is most important that awareness of the existence and potential of secondary data grows among scientists. With our contribution, we aim to open the eyes of the scientific community to overcome ‘biodiversity blindness’ and acknowledge the wealth of information far beyond the location and date of a species observation in the millions of freely available multimedia files. Besides being blind to this treasure of data, studies and projects dedicated to the topic may also not be seen as such due to a lack of common terminology. Therefore, we would like to establish the term secondary data as proposed by Callaghan et al. (2021) or at least stimulate a discussion about terminology, so that a corresponding field of research can grow.
It should be clear to the community that this approach applies not only to data from citizen science, social media or webpages, but also to data collected by scientists in the field or in the laboratory. The multiple benefits demonstrated here should convince people to make the (raw) data available to the public according to the FAIR principle, be it via GBIF, GitHub or other openly accessible repositories. As with all new and innovative methods, a transition period will be necessary before this approach is fully integrated into the research toolkit. Again, we draw comparisons with iEcology, culturomics and citizen science in that secondary data are utilised in a complementary and supportive way to other data sources and verified with ground truthing. Efforts to explore and use secondary data, although still in their early stages, are already demonstrating many ways to enrich and inform biodiversity research.
AUTHOR CONTRIBUTIONSNadja Pernat conceived the ideas and planned and facilitated a 3-day workshop in November 2022 that was attended by all authors; Nadja Pernat, Yuval Itescu, Jasmijn Hillaert, Cristina Preda and Marina Golivets researched the reviewed articles; Nadja Pernat, Jasmijn Hillaert and Susan Canavan created the visualisations; Nadja Pernat, Quentin Groom and David M. Richardson led the writing of the manuscript. All authors contributed critically to the drafts and gave final approval for publication.
ACKNOWLEDGEMENTSThis work originated from a workshop supported by the COST Action Increasing understanding of alien species through citizen science (COST Action CA17122). DMR acknowledges support from the Centre for Invasion Biology, Stellenbosch University, Mobility 2020 project no. CZ.02.2.69/0.0/0.0/18_053/0017850 (Ministry of Education, Youth and Sports of the Czech Republic) and long-term research development project RVO 67985939 (Czech Academy of Sciences). ASV acknowledges support from the FCT—Portuguese Foundation for Science and Technology through the program Stimulus for Scientific Employment—Individual Support (contract reference 2020.01175.CEECIND/CP1601/CT0009), and project ClimateMedia—Understanding climate change phenomena and impacts from digital technology and social media (contract reference 2022.06965.PTDC). IJ and PP acknowledge support from the Czech Science Foundation (project no. 23-07278S). HT acknowledges FCT/MCTES support to CESAM (UIDP/50017/2020+UIDB/50017/2020+LA/P/0094/2020). NP acknowledges the Animal Ecology Lab (Sascha Buchholz, Hilke Hollens-Kuhr) and the University of Münster for their support in running the workshop and Emily Brunner for co-producing the illustrations. We thank Tom August for helpful comments on the manuscript as well as Wouter Koch and an anonymous reviewer for their valuable advice on improving the article. Open Access funding enabled and organized by Projekt DEAL.
CONFLICT OF INTEREST STATEMENTThe authors declare no conflict of interest.
DATA AVAILABILITY STATEMENTThis work does not contain any original data.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details













1 Institute of Landscape Ecology, University of Münster, Münster, Germany; Centre for Integrative Biodiversity Research and Applied Ecology (CIBRA), University of Münster, Münster, Germany
2 Institute of Botany, Czech Academy of Sciences, Průhonice, Czech Republic; School of Natural Sciences, University of Galway, Galway, UK
3 Helmholtz Centre for Environmental Research – UFZ, Halle, Germany
4 Research Institute of Nature and Forest, Brussels, Belgium
5 Leibniz Institute of Freshwater Ecology and Inland Fisheries, Berlin, Germany; Freie Universität Berlin, Berlin, Germany; Department of Evolutionary and Environmental Biology, University of Haifa, Haifa, Israel
6 CNRS, AgroParisTech, Ecologie Systématique Evolution, Université Paris-Saclay, Gif-sur-Yvette, France; Biology Centre of the Czech Academy of Sciences, Institute of Hydrobiology, České Budějovice, Czech Republic
7 Department of Ecoscience, Aarhus University, Aarhus, Denmark
8 Institute of Botany, Czech Academy of Sciences, Průhonice, Czech Republic; Department of Ecology, Faculty of Science, Charles University, Prague, Czech Republic
9 Faculty of Natural and Agricultural Sciences, Ovidius University of Constanta, Constanta, Romania
10 Institute of Botany, Czech Academy of Sciences, Průhonice, Czech Republic; Centre for Invasion Biology, Department of Botany and Zoology, Stellenbosch University, Stellenbosch, South Africa
11 Centre for Environmental and Marine Studies and Department of Biology, University of Aveiro, Aveiro, Portugal
12 CIBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, InBIO Laboratório Associado, Universidade do Porto, Porto, Portugal; BIOPOLIS Program in Genomics, Biodiversity and Land Planning, CIBIO, Vairão, Portugal; NBI, Natural Business Intelligence, Tec Labs 1.2.1, Campus da Faculdade de Ciências da Universidade de Lisboa, Lisbon, Portugal
13 Meise Botanic Garden, Brussels, Belgium