Ecological and Geographic Influences on Cumacea Genetics in the Northern North Atlantic

by aPhyloGeo software

Abstract

Cumacea (crustaceans: Peracarida) are vital indicators of benthic health in marine ecosystems. This study investigated the influence of environmental (i.e., biological or ecosystemic), climatic (i.e., meteorological or atmospheric), and geographic (i.e., spatial or regional) attributes on their genetic variability in the Northern North Atlantic, focusing on Icelandic waters. We analyzed mitochondrial sequences of the 16S rRNA gene from 62 Cumacea specimens. Using the aPhyloGeo software, we compared these sequences with relevant parameters such as latitude (decimal degree) at the start of sampling, wind speed (m/s) at the start of sampling, O2 concentration (mg/L), and depth (m) at the start of sampling.

Our analyses revealed variability in most spatial and biological attributes, reflecting the diversity of ecological requirements and benthic habitats. The most common Cumacea families, Diastylidae and Leuconidae, suggest adaptations to various marine environments. Phylogeographic analysis showed a divergence between specific genetic sequences and two habitat attributes: wind speed (m/s) at the start of sampling and O2 concentration (mg/L). This indicates potential local adaptation to these fluctuating conditions.

These results reinforce the importance of further research into the relationship between Cumacea genetics and global environmental factors. Understanding these relationships is essential for interpreting the evolutionary dynamics and adaptation of deep-sea Cumacea. This study sheds much-needed light on invertebrate acclimatization to climate change, anthropomorphic pressures, and deep-water habitat management. It can contribute to the evolution of more efficient conservation strategies and inform policies that protect vulnerable marine ecosystems.

The aPhyloGeo Python package is freely and publicly available on GitHub and PyPi, providing an invaluable tool for future research.

Keywords:AtlanticArcticBioinformaticsBiologyCumaceaIcelandPhylogenyPhylogeography

Introduction

The North Atlantic and Subarctic regions, particularly the Icelandic waters, are of ecological interest due to their diverse water masses and unique oceanographic features Schnurr et al., 2014Meißner et al., 2014Uhlir et al., 2021. These areas form vital benthic habitats[1] Levin & Dayton, 2009 and enhance our understanding of deep-sea ecosystems and biodiversity patterns Rogers et al., 2007Danovaro et al., 2008Uhlir et al., 2021. The IceAGE project and its predecessors, BIOFAR and BIOICE, provide invaluable data for studying the impacts of climate change and seabed mining, especially in the Greenland, Iceland, and Norwegian (GIN) seas Meißner et al., 2018.

Cumacea, a crustacean taxon within Peracarida, provide major indicators of marine ecosystem health due to their sensitivity to environmental fluctuations Stransky & Svavarsson, 2010 and their contribution to benthic food webs Rehm, 2009. Despite their ecological importance, deep-sea benthic invertebrates’ evolutionary history remains uncharted, notably in the North Atlantic Jennings & Etter, 2014. Interpreting these deep-sea organisms’ genetic distribution and demography is central for predicting their response to climate change and anthropogenic pressures, such as seabed mining Jennings & Etter, 2014Meißner et al., 2018.

Given the urgency of the above factors, this study aims to analyze the influence of ecological (climatic and environmental) and geographic parameters on the genetic variability of Cumacea in the Northern North Atlantic. Specifically, we will examine whether genetic adaptation exists between the genetic structure of the 16S rRNA mitochondrial gene region of cumacean species sampled and their habitat attributes. If so, we will determine the attribute that diverges most from a specific gene sequence of this cumaceans gene (i.e., a window) and further explore the potential associated protein using bioinformatics tools to interpret its biological relevance. Our approach includes confirming different phylogeographic models[2] and updating a Python package (currently in beta), aPhyloGeo, to simplify these analyses.

This paper is organized as follows: Section Related Works reviews pertinent studies on the biodiversity and biogeography of deep-sea benthic invertebrates; Section Our Contribution summarizes the aims and contributions of this study, highlighting aspects relating to the conservation and adaptation of marine invertebrates to climate change; Section Materials and Methods describes the data collection, sampling procedures, and genetic analyses; Section Metrics describes the metrics used to evaluate the phylogeographic models; Section Results presents the results; finally, Section Conclusion discusses their implications for future research and conservation efforts.

Assessing and quantifying the biodiversity of deep-sea benthic invertebrates has become increasingly crucial since it was discovered that their species richness may be underestimated Grassle & Maciolek, 1992. Subsequent research has highlighted the need for large-scale distribution models to interpret the diversity of these organisms across their ecological and evolutionary contexts Rex et al., 1997. That is why recent efforts have focused on mapping, managing, and studying the seabed Brown et al., 2011. Advanced technologies such as acoustic detection are improving our knowledge of benthic ecosystem complexity Brown et al., 2011. Integrating genetic and habitat attributes gives a deeper understanding of how ecosystemic, meteorological, and spatial attributes influence the genetic differences, distribution, biodiversity, and resilience of deep-sea benthic organisms Vrijenhoek, 2009.

However, the relationship between genetics and the environment is complex, involving gene-environment interactions and natural selection factors, which makes it difficult to identify clear causal relationships Balkenhol et al., 2009. In addition, the distinction between the direct and indirect effects of the environment on genetics poses other challenges Manel et al., 2010Balkenhol et al., 2019. The restrictions of available methods for measuring genetic and ecological attributes, combined with logistical constraints, often limit the scope of such studies Manel et al., 2010Shafer & Wolf, 2013. This complexity may explain why the environment and genetics of Cumacea have been less studied, even though they are essential for interpreting how these deep-sea invertebrates adapt to fluctuating environmental conditions.

Our Contribution

Our study focuses on the genetic fluctuation of the 16S rRNA mitochondrial gene in cumacean populations in response to variations in their habitat, which has been little explored in previous studies Grassle & Maciolek, 1992Rex et al., 2000. We aim to refine the natural selection hypothesis by identifying the specific genetic region with the highest mutation rate and the potentially associated protein using bioinformatics tools, such as protein structure modeling and functional annotation databases, to expose the potential functions that this protein might have on the adaptation of cumaceans to habitat fluctuations. By linking mitochondrial sequences of the 16S rRNA gene to habitat parameters using phylogeographic analyses, we can better interpret the selection effect on this genetic sequence of cumaceans that confers survival advantages in the extreme environments of the northern North Atlantic. This approach enables us to detect adaptive mutations and their functional outcomes using bioinformatics tools, to gain a better insight into how natural selection proceeds at the molecular level in these challenging habitats. This represents a major advance over previous research, which has often struggled to integrate genetic and biological data in the context of deep-sea invertebrates Etter & Rex, 1990Vrijenhoek, 2009.

Using robust analytical methods such as dissimilarity calculations and phylogenetic reconstructions, we provide new insights into the genetic adaptation of marine Cumacea. Unlike previous studies, which have encountered difficulties establishing a link between genetics and the environment Manel et al., 2003Balkenhol et al., 2009, our results provide a better understanding of evolutionary dynamics in aquatic ecosystems.

Furthermore, our genetic and environmental data highlights critical habitats of high conservation interest, which can be considered for establishing marine protected areas Levin & Dayton, 2009. These results are essential for developing informed conservation strategies in the context of climate change. Finally, our study paves the way for further research on other invertebrate species across different geographic regions. By extending this research to diverse environments and taxonomic groups, scientists will gain a more complete understanding of the adaptation and resilience of marine invertebrates to changing conditions. This work contributes essential insights to the field and supports the development of informed conservation strategies.

Materials and Methods

This section describes our data and introduces the main stages of data pre-processing and the aPhyloGeo software. A flow chart, constructed with the diagram software draw.io, summarizes this section (Figure 1).

Flow chart summarizing the Materials and Methods section workflow. Six different colors highlight the blocks. The first block (blue) represents our database. The second block (red) is data pre-processing, where we remove attributes. The third and fourth blocks (orange) implement the aPhyloGeo software and its parameters for our phylogeographic analyses (see in the second step of the section ). The fifth block (grey) calculates phylogenetic tree comparison distances. The sixth block (yellow) compares the distances between the phylogenetic trees produced. The seventh block (purple) identifies regions with high mutation rates based on the results of the tree comparisons. *See YAML files on GitHub for more details on these parameters.

Figure 1:Flow chart summarizing the Materials and Methods section workflow. Six different colors highlight the blocks. The first block (blue) represents our database. The second block (red) is data pre-processing, where we remove attributes. The third and fourth blocks (orange) implement the aPhyloGeo software and its parameters for our phylogeographic analyses (see in the second step of the section aPhyloGeo software). The fifth block (grey) calculates phylogenetic tree comparison distances. The sixth block (yellow) compares the distances between the phylogenetic trees produced. The seventh block (purple) identifies regions with high mutation rates based on the results of the tree comparisons. *See YAML files on GitHub for more details on these parameters.

Description of the data

The study area was located in a northern region of the North Atlantic, including the Icelandic Sea, the Denmark Strait, and the Norwegian Sea. The specimens examined were collected as part of the IceAGE project (Icelandic marine Animals: Genetic and Ecology; Cruise ship M85/3 in 2011), which focused on the deep continental slopes and abyssal waters around Iceland Meißner et al., 2018. The sampling period for the included specimens was from August 30 to September 22, 2011, and they were collected at depths ranging from 316 m to 2568 m. Detailed protocols concerning the sampling plan, sample processing, DNA extraction steps, PCR amplification, sequencing, and aligned DNA sequences are available in Uhlir et al., 2021.

Data pre-processing

We used data from the article Uhlir et al., 2021, IceAGE project, and related data from the bold system’s database, as described in Uhlir et al., 2021. Given these databases’ enormous breadth of features, we applied a selective reduction procedure. Attributes that were not directly relevant to the analysis of correlations between Cumacea genetics and habitat properties, displayed little to no variability (non-numerical data), and had a large number of missing data (> 95%) were omitted from our study. Out of the 495 available in the IceAGE dataset, we considered 62 specimens for which mitochondrial DNA sequences of the 16S rRNA gene were available.

Next, we calculated the variance using the var()var() function in RStudio Desktop 4.3.2 for each of the selected numerical attributes. This step aimed to eliminate attributes with low variation, as they are unlikely to provide critical data to the analysis. We set a variance threshold of ≤ 0.1 to exclude uninformative attributes. The latter allows us to retain attributes whose variability is reasonably sufficient for our analyses while rejecting those with little variation. Only water salinity was eliminated based on this criterion (S2=0.02146629S^2 = 0.02146629). The formula (equation 1) and code (Program 1) used to calculate the variance of our final features, available in the data file on GitHub, are provided below:

S2=i=1n(xixˉ)2n1S^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}

where S2S^2 is the variance of the attribute, xix_i represents each attribute value, xˉ\bar{x} is the average of the values for this attribute, and nn is the number of values for this attribute in the dataset.

# Import data from CSV file
# Read the dataset "Final_Data_Article.csv" and store it in an object called Data.
# The data is read with semicolon (;) as the delimiter.
Data <- read.csv(file = "Final_Data_Article.csv", header = TRUE, sep = ";")

# Calculate the variance of numerical attributes only from the dataset
# This snippet computes the variance for each numerical column.
# Non-numeric columns are excluded from the calculation.
variances <- sapply(Data, function(x) {
  # Check if the column is numeric
  if (is.numeric(x)) {
    # Compute variance, excluding NA values
    var(x, na.rm = TRUE)
  } else {
    NA  # Return NA for non-numeric columns
  }
})

# Display variances
# Print the calculated variances for each numerical attribute.
print(variances)

Program 1:RStudio script to calculate the variance of each numerical attributes in our final dataset

To refine our dataset, we calculated the Pearson correlation between attributes using the cor()cor() function in RStudio Desktop 4.3.2. Features exhibiting strong correlations with each other (threshold > 0.90) were removed to avoid repetition and guarantees attribute independence. We considered the threshold of > 0.90 to be an adequate compromise between preserving properties for our analyses and eliminating the repetition of information in our data. Since we have three missing data for O2 concentration (mg/L), we have used the "pairwise.complete.obs method". This method calculates the Pearson correlation matrix using all accessible pairs of observations, even if some data are missing. Using the above threshold, three attributes were discarded: latitude (DD) (Lat_start_end: r=0.99966582r = 0.99966582), longitude (DD) (Long_start_end: r=0.99999794r = 0.99999794), and depth (m) (Depth_start_end: r=0.99985791r = 0.99985791) at the end of sampling. The formula (equation 2) and code (Program 2) used to calculate the Pearson correlation coefficient between our final numerical attributes are shown below:

r=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2i=1n(yiyˉ)2r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2 \sum_{i=1}^{n} (y_i - \bar{y})^2}}

where rr is the Pearson correlation coefficient between two attributes, xix_i are the values of an attribute, yiy_i are the values of another attribute, xˉ\bar{x} and yˉ\bar{y} are respectively the averages of the two attributes, and nn is the sample size of the two attributes.

# Import data from CSV file
# Read the dataset "Final_Data_Article.csv" and store it in an object called Data.
# The data is read with semicolon (;) as the delimiter.
Data <- read.csv(file = "Final_Data_Article.csv", header = TRUE, sep = ";")

# Select numeric columns only from the dataset
# This code filters the dataset to include only numeric columns.
# Create a new dataframe that contains only the columns with numeric data.
numeric_Data <- Data[sapply(Data, is.numeric)]

# Calculate Pearson correlation matrix
# This code calculates the Pearson correlation matrix for the numeric columns.
# The “pairwise.complete.obs” argument ignores missing values.
correlation_matrix <- cor(numeric_Data, use = "pairwise.complete.obs")

# Display correlation matrix
# Print the resulting Pearson correlation matrix.
print(correlation_matrix)

Program 2:RStudio script to calculate the Pearson correlation coefficient between all the numerical attributes in our final dataset

This selection of attributes and data resulted in a table containing 62 rows (n=62n=62) and 17 columns (number of attributes).

Selected attributes in the IceAGE database

Geographic data

  • The latitude (Figure 2a) and longitude (Figure 2b) at the start of sampling, both in decimal degrees (DD), as they are intimately linked to the environmental gradients and historical mechanisms modeling genetic heterogeneity Gaither & Rocha, 2013.

  • The sectors across the seas around Iceland: the Denmark Strait (n=28n=28), the Iceland Basin (n=15n=15), the Irminger Basin (n=12n=12), the Norwegian Sea (n=4n=4), and the Norwegian Basin (n=3n=3).

Environmental data

  • Depth (m) at the start of sampling (Figure 2c), as well as water temperature (°C) (Figure 2e), and O2 concentration (mg/L) (Figure 2f), as these are vital elements of the marine ecosystem that have an impact on the distribution and evolutionary acclimatization of marine species Rex et al., 2006Danovaro et al., 2010.

  • The sampling sites’ sedimentary characteristics directly influence the distribution of Cumacea Uhlir et al., 2021. In this study, they are divided into six ecological niche categories: mud (n=30n=30), sandy mud (n=15n=15), sand (n=9n=9), forams (n=3n=3), muddy sand (n=3n=3), and gravel (n=2n=2).

Climatic data

Wind speed (m/s) (Figure 2d) and wind direction at the start and end of sampling were also included, giving the contribution of wind to benthic ecosystem dynamics and the restructuring of species distribution by wind currents and sediment transport Siedlecki et al., 2016Waga et al., 2020Saeedi et al., 2022. The wind direction at the start of sampling comprises six orientations: South-West (n=22n=22), South (n=15n=15), North-East (n=9n=9), South-South-East (n=9n=9), North-West (n=5n=5), and East (n=2n=2); while that at the end of sampling is composed of seven orientations: South (n=15n=15), South-West (n=15n=15), North-East (n=9n=9), West-South-West (n=7n=7), South-East (n=6n=6), North-North-West (n=5n=5), South-South-East (n=3n=3), and East (n=2n=2).

Selected attributes in the bold system’s database

Taxonomic data

The family, genus, and scientific name of the cumaceans sampled were integrated into our data to study evolutionary relationships and genetic variation to habitat attributes among the specimens in our dataset. These comprise seven families: Diastylidae (n=21n=21), Lampropidae (n=13n=13), Leuconidae (n=12n=12), Astacidae (n=7n=7), Bodotriidae (n=4n=4), Ceratocumatidae (n=3n=3), and Pseudocumatidae (n=2n=2). A total of 21 cumacean species were found in our sample (Figure 3). We have also included the sample identity (id) so that each sample remains unique. Some specimens were only identified to genus (n=1n=1) or family (n=5n=5) in our sample.

Selected attributes from article Uhlir et al. (2021)

Other environmental data

The habitat and water mass of the sampling points were the only water attributes taken directly from Table 1 of Uhlir et al., 2021, as they can give us insight into how they may affect Cumacea genetic diversity and the acclimatization of these species in the GIN seas around Iceland. Thus, the water masses definitions, as described in Uhlir et al., 2021, were used as a reference: Arctic Polar Water (APW, n=15n=15), Iceland Sea Overflow Water (ISOW, n=15n=15), North Atlantic Water (NAW, n=9n=9), Arctic Polar Water/Norwegian Sea Arctic Intermediate Water (APW/NSAIW, n=7n=7), warm Norwegian Sea Deep Water (NSDWw, n=8n=8), Labrador Sea Water (LSW, n=3n=3), cold Norwegian Sea Deep Water (NSDWc, n=3n=3), and Norwegian Sea Arctic Intermediate Water (NSAIW, n=2n=2) (Figure 4). In terms of habitat, we considered the three categories used in Uhlir et al., 2021: Deep Sea (n=38n=38), Shelf (n=15n=15), and Slope (n=9n=9) (Figure 5).

Genetic data

To better interpret benthic species’ relationship and evolutionary responses, genetic data are required Wilson & Hessler, 1987Uhlir et al., 2021. Thus, the aligned DNA sequence of the 16S rRNA mitochondrial gene region from each of the samples was included in our analyses. This region is standard in phylogeny and phylogeography studies Hugenholtz et al., 1998 and sufficiently conserved over time to guarantee exact alignments between different species or populations Saccone et al., 1999. We examined 62 of the 306 aligned DNA sequences used for phylogeographic analyses by Uhlir et al., 2021. As some specimens in our sample have their DNA sequence duplicated, or even quadruplicated with a difference of one or two nucleotides, we took into account the longest-aligned DNA sequence of each specimen.

aPhyloGeo software

We used the cross-platform Python software aPhyloGeo for our phylogeographic analyses, designed to analyze phylogenetic trees using ecological and geographic attributes (Program 3). Developed by My-Linh Luu, Georges Marceau, David Beauchemin, and Nadia Tahiri, aPhyloGeo offers tools to study and identify potential divergence between species genetics and habitat characteristics, enabling us to understand the evolution of species under different environmental conditions Koshkarov et al., 2022.

We selected this software for our analysis because, to our knowledge, it is the first phylogeographic tool capable of establishing similarity or dissimilarity between species genetics and environmental, climatic, and geographical attributes - precisely the objective of our study Koshkarov et al., 2022. The aPhyloGeo software offers several key functionalities:

  1. Phylogeographic tree evaluation: The software elucidates the evolutionary relationships between species based on their genetic sequences, which is essential for interpreting phylogeographic models that link the evolution of species to their spatial distribution and their biological and meteorological context Koshkarov et al., 2022.

  2. Ecological and regional dissimilarity analysis: It enables the detection of divergence and convergence between genetic sequences and the mentioned attributes to be tested and visualized. This software property is essential for identifying the effect of these attributes on the genetic fluctuations and evolutionary history of these cumacean species Koshkarov et al., 2022.

  3. Investigation of genetic diversity: aPhyloGeo measures genetic heterogeneity based on spatial variation, making it possible to recognize potential evolutionary processes (e.g., mutation, speciation, genetic drift, and selection pressures) and local adaptations Manel et al., 2010.

# This script outlines the main steps for processing genetic and attribute data using the aPhyloGeo package.
if __name__ == "__main__":

    # Load parameters from a configuration file
    # Load parameters necessary for the analysis from a specified file.
    Params.load_from_file()

    # Load the sequence file
    # Loads the genetic sequence file specified by the reference gene filepath.
    sequence_file = utils.load_sequence_file(Params.reference_gene_filepath)

    # Create an AlignSequences object
    # This object is responsible for aligning the genetic sequences.
    align_sequence = AlignSequences(sequence_file)

    # Load attribute data from a CSV file
    # This reads the attributes data into a pandas DataFrame.
    attribute_data = pd.read_csv(Params.file_name)

    # Perform the alignment of sequences
    # This method aligns the genetic sequences and stores the results.
    alignments = align_sequence.align()

    # Generate phylogenetic trees based on aligned sequences
    # Create phylogenetic trees from the multiple sequence alignments (MSA).
    genetic_trees = utils.genetic_pipeline(alignments.msa)
    
    # Create a GeneticTrees object
    # Represent the generated phylogenetic trees in Newick format.
    trees = GeneticTrees(trees_dict=genetic_trees, format="newick")
   
    # Generate attribute trees based on attribute data
    # Create trees representing the relationships between different attributes.
    attribute_trees = utils.attribute_pipeline(attribute_data)
    
    # Filter the results based on the generated trees
    # Filter the results to ensure they meet certain criteria.
    utils.filter_results(attribute_trees, genetic_trees, attribute_data)

Program 3:Main script for tutorial using the aPhyloGeo package.

The aPhyloGeo Python package is freely and publicly available on GitHub, and is also available on PyPi, to facilitate complex phylogeographic analyses. The software process has three main stages:

  1. The first step was to collect DNA sequences from Cumacea of sufficient quality for the needs of our results Koshkarov et al., 2022. In this study, 62 cumaceans samples were selected to represent 62 sequences of the 16S rRNA mitochondrial gene. We then included two climatic attributes, namely wind speed (m/s) at the start and end of the sampling; three environmental characteristics, such as depth (m) at the start of sampling, water temperature (°C), and O2 concentration (mg/L); and two geographic variables, latitude (DD) and longitude (DD) at the start of sampling.

  2. In the second step, trees were generated separately from biological, spatial, meteorological, and genetic data. Concerning spatial attributes, we calculated the dissimilarity between each pair of cumaceans from distinct spatial conditions Koshkarov et al., 2022. This produced a symmetrical square matrix Koshkarov et al., 2022. The neighbor-joining algorithm[3] was used to build the spatial tree from this matrix Koshkarov et al., 2022. Each geographic attribute gives rise to a geographic tree. If there are mm windows affected by this attribute, there will be mm geographic trees. The same approach was applied to biological, meteorological, and genetic data.

    For genetic data, phylogenetic reconstruction was reiterated to build genetic trees based on 62 mitochondrial 16S rRNA sequences, considering only data within a window that progresses along the alignment Koshkarov et al., 2022. This displacement can vary according to the steps and the size of the window defined by the user (their length is determined by the number of base pairs (bp)) Koshkarov et al., 2022.

    In our case, we set up the aPhyloGeo software as follows: pairwiseAlignerpairwiseAligner for sequence alignment; Hamming distance\text{Hamming distance} to measure simple dissimilarities between sequences of identical length; Wider Fit by elongating with Gap (starAlignment)\text{Wider Fit by elongating with Gap (starAlignment)} algorithm takes alignment gaps into account, which is often mandatory in the case of major deletions or insertions in the sequences; windows_size\text{windows\_size}: 1 nucleotide (nt); and finally, step_size\text{step\_size}: 10 nt. The last two configurations imply that for each 1 nt window, a phylogenetic tree is produced using the nucleotide of each cumacean, then the window is moved by 10 nt, creating a new tree. Each window in the alignment will give a genetic tree. If there are nn windows, there will be nn phylogenetic trees. Genetic trees will be used in an object called T1T_1, while spatial and ecological trees are used in another object called T2T_2.

  3. In the third step, the genetic trees constructed in each sliding window are compared with ecosystemic, atmospheric, and regional trees using Robinson-Foulds distance Robinson & Foulds, 1981, normalized Robinson-Foulds distance, Euclidean distance, and Least Squares distance. These contribute to understanding the correspondence between cumaceans genetic sequences and their habitat. The approach also takes bootstrapping into account Koshkarov et al., 2022. The results of these metrics were obtained using the functions least_square(tree1,tree2)least\_square(tree1, tree2), robinson_foulds(tree1,tree2)robinson\_foulds(tree1, tree2), euclidean_dist(tree1,tree2)euclidean\_dist(tree1, tree2) from the aPhyloGeo software and were organized by the main function (Program 3). Those for the normalized Robinson-Foulds distance were obtained with the function robinson_foulds(tree1,tree2)robinson\_foulds(tree1, tree2) (see the last line of code in Program 4). The metric output tells us which of our attributes have the greatest divergence of phylogenetic relationships in our samples, based on the magnitude of the metric distances (see Figure 6 and Figure 7).

    In addition to identifying the specific attribute, a sliding-window approach enables the precise localization of subtle sequences with high rates of genetic mutation Koshkarov et al., 2022. This method requires shifting a fixed-size window over the alignment of genetic sequences, allowing phylogenetic trees to be reconstructed for each part of the sequence. It therefore allows us to recognize changes in evolutionary relationships along the 16S rRNA mitochondrial gene region of cumacean species. This method is essential for determining whether cumaceans-specific gene sequences in this region of their genome may be affected by certain ecological or spatial attributes of their habitat (see Figure 6 and Figure 7).

Metrics

Our phylogeographic study used four distance metrics to quantify topological differences between phylogenetic trees. It also assesses dissimilarities between genetic sequences and ecological and regional attributes. This enables a comprehensive analysis of the evolutionary dynamics of cumacean populations in different environmental contexts.

The following section presents a more concise version of the functions mentioned in the second and third steps of section aPhyloGeo software:

Robinson-Foulds distance

The Robinson-Foulds (RF) distance calculates the distance between phylogenetic trees built in each sliding window (T1T_1) and the attributes trees (T2T_2) (see the list in the first step of the section aPhyloGeo software) Tahiri et al., 2018Koshkarov et al., 2022. This measurement is used to evaluate the topological differences between the two sets of trees (see Equation (3) and Program 4).

For example, it evaluates the number of division differences between phylogenetic trees built within certain user-defined sliding windows (see the second step of the section aPhyloGeo software) and geographic trees built with latitude data (DD) at the start of sampling Robinson & Foulds, 1981. A high distance between a specific window and other windows considered in the RF distance analysis implies that the habitat feature has little to no impact on this particular DNA sequence and that this attribute cannot explain the genetic divergences observed in this DNA sequence.

RF(T1,T2)=Σ(T1)ΔΣ(T2)\text{RF}(T_1, T_2) = | \Sigma(T_1) \Delta \Sigma(T_2) |

where RF(T1,T2)\text{RF}(T_1, T_2) is the Robinson-Foulds distance between the two sets of trees, Σ(T1)\Sigma(T_1) and Σ(T2)\Sigma(T_2) are the sets of divisions in trees T1T_1 and T2T_2 and Δ, the difference between these two sets.

import ete3

def robinson_foulds(tree1, tree2):
    """
    Calculate the Robinson-Foulds distance between the two sets of trees.

    Parameters:
    - tree1: Genetic trees in Newick format (represent phylogenetic trees in text form).
    - tree2: Ecological and spatial trees in Newick format (represent attributes trees in text form).

    Returns:
    - rf: The Robinson-Foulds distance between the two sets of trees.
    - rf_normalized: The normalized Robinson-Foulds distance.
    """
    # Initialize the Robinson-Foulds distance
    rf = 0

    # Convert trees from Newick format to ete3.Tree objects
    # This parsing allows us to use ete3 methods to compare trees.
    tree1_newick = ete3.Tree(tree1.format("newick"), format=1)
    tree2_newick = ete3.Tree(tree2.format("newick"), format=1)

    # Calculate the Robinson-Foulds distance
    # The method returns RFD, maximum possible RFD, and common leaves (i.e., a taxa) between the trees.
    rf, rf_max, common_leaves = tree1_newick.robinson_foulds(tree2_newick, unrooted_trees=True)

    # If there are no common leaves, set the RFD to 0
    # This makes it possible to deal with cases where the trees have no overlapping taxa.
    if len(common_leaves) == 0:
        rf = 0

    # Return the RFD and its normalized value
    # Normalization is done by dividing RFD by the maximum possible RFD.
    return rf, rf / rf_max

Program 4:Python script for calculating the Robinson-Foulds Distance using the ete3 package in the aPhyloGeo package.

Normalized Robinson-Foulds distance

The normalized Robinson-Foulds (nRF) distance scales the RF distance to account for the size variations in the trees (number of clades; i.e., a group of species with a common origin), allowing a more equitable comparison. It scales the distance to a range between 0 and 1. In our context, the distance has been normalized by 2n62n-6, where nn represents the number of taxa (see Equation (4) and the last line of code in Program 4).

Since the size of environmental trees constructed with O2 concentration data (mg/L) differs from that of other attributes due to missing data, this nRF distance allows us to compare its dissimilarity with the phylogenetic trees in a fairer way Tahiri et al., 2018Koshkarov et al., 2022. It reveals the relative influence of O2 concentration (mg/L) on cumacean phylogenetic relationships, independent of tree size Tahiri et al., 2018Koshkarov et al., 2022. A high value of this metric between a specific window and other windows considered in the nRF distance analysis does not allow us to conclude that there is a correlation between this DNA sequence and the attribute. It may indicate a topological dissimilarity between the habitat attribute tree and the gene trees at that position in the DNA sequence alignments.

RFnorm(T1,T2)=Σ(T1)ΔΣ(T2)Σ(T1)+Σ(T2)\text{RF}_{\text{norm}}(T_1, T_2) = \frac{| \Sigma(T_1) \Delta \Sigma(T_2) |}{| \Sigma(T_1) | + | \Sigma(T_2) |}

where RFnorm(T1,T2)\text{RF}_{\text{norm}}(T_1, T_2) is the normalized Robinson-Foulds distance between the two sets of trees, Σ(T1)\Sigma(T_1) and Σ(T2)\Sigma(T_2) are the sets of divisions in trees T1T_1 and T2T_2 and Δ, the difference between these two sets.

Euclidean distance

In our study, the Euclidean distance calculates the distance between two sets of points in a multidimensional space, which designates the divisions of the two sets of trees (T1T_1 and T2T_2). It is used to compare divisions between two respective sets of trees to assess the degree of divergence or similarity of their topologies (see Equation (5) and Program 5). Branches are weighted according to their length, which makes it possible to obtain quantitative dissimilarities between leaf (i.e., cumacean species) pairs (i.e., genetic distance) in the two sets of trees Choi & Gomez, 2009. Thus, for each pair of leaves, their distance in the genetic trees and the habitat attribute trees are compared Choi & Gomez, 2009.

By comparing the two sets of trees T1T_1 and T2T_2 using this metric, it is possible to measure the extent to which genetic divergences correspond to fluctuations in habitat attributes. This is crucial for interpreting evolutionary relationships with these factors. A high distance of this metric between a specific window and other windows considered in the Euclidean distance analysis reveals evolutionary divergences between members of the cumacean populations at the level of this DNA sequence (see Figure 6d and Figure 7d).

dEuclidean(T1,T2)=i=1n(T1iT2i)2d_{\text{Euclidean}}(T_1, T_2) = \sqrt{\sum_{i=1}^{n} (T1_i - T2_i)^2}

where dEuclidean(T1,T2)d_{\text{Euclidean}}(T_1, T_2) is the Euclidean distance between the two sets of trees (T1T_1 and T2T_2), and T1iT1_i and T2iT2_i represent the respective divisions of trees T1T_1 and T2T_2 for each i-th division.

import ete3
import dendropy

def euclidean_dist(tree1, tree2):
    """
    Calculate the Euclidean distance between the two sets of trees.

    Parameters:
    - tree1: Genetic trees in Newick format (represent phylogenetic trees in text form).
    - tree2: Meteorological, biological, and regional trees in Newick format (represent attributes trees in text form).

    Returns:
    - ed: The Euclidean distance between the two sets of trees.
    """

    # Create a TaxonNamespace object to handle taxon information
    tns = dendropy.TaxonNamespace()

    # Load the first tree from Newick format into a dendropy Tree object
    # Analyzes the string formatted by Newick and prepares the tree for comparison.
    tree1_tc = dendropy.Tree.get(
        data=tree1.format("newick"), 
        schema="newick", 
        taxon_namespace=tns
    )
    
    # Load the second tree from Newick format into a dendropy Tree object
    # Similar to the first tree, this step prepares the second tree for comparison.
    tree2_tc = dendropy.Tree.get(
        data=tree2.format("newick"), 
        schema="newick", 
        taxon_namespace=tns
    )

    # Encode the bipartitions of both trees
    # This step converts the trees into a format where the presence or absence of 
    # Each bipartition (split) is coded, which is necessary to calculate distances.
    tree1_tc.encode_bipartitions()
    tree2_tc.encode_bipartitions()

    # Calculate the Euclidean distance between the two trees
    # The distance is computed based on the differences in the encoded bipartitions.
    ed = dendropy.calculate.treecompare.euclidean_distance(tree1_tc, tree2_tc)

    return ed

Program 5:Python script for calculating the Euclidean distance using the ete3 and the dendropy packages in the aPhyloGeo package

Least Squares distance

The Least Squares (LS) distance measures the sum of the squares of the differences between the phylogenetic distances of the leaf pairs between the two sets of trees (T1T_1 and T2T_2) (see Equation (6) and Program 6). As with Euclidean distance, the distance between each pair of leaves in the genetic trees is compared with that of the habitat attribute trees Czarna et al., 2006Balaban et al., 2020. This metric allows us to measure the topological dissimilarity between the two sets of trees and to understand how these different habitat attributes influence the topological structure of phylogenetic trees. A high value between a specific window and other windows considered in the LS distance analysis indicates a structural discrepancy between this DNA sequence and the tree built from a habitat attribute. Furthermore, we cannot conclude with certainty that there is a correlation between them since genetic variations in this window are inconsistent with variations in this habitat parameter.

dLS(T1,T2)=i=1n1j=i+1n(dT1(i,j)dT2(i,j))2d_{\text{LS}}(T_1, T_2) = \sum_{i=1}^{n-1}\sum_{j=i+1}^{n}(d_T1(i,j) - d_T2(i,j))^2

where dLS(T1,T2)d_{\text{LS}}(T_1, T_2) is the Least Squares distance between the two sets of trees (T1T_1 and T2T_2), and dT1(i,j)d_T1(i,j) and dT2(i,j)d_T2(i,j), the distance between leaves ii and jj in T1T_1 and T2T_2, respectively.

import ete3

def least_square(tree1, tree2):
    """
    
    Parameters:
    - tree1: Genetic trees.
    - tree2: Atmospherical, ecosystemic, and spatial trees.

    Returns:
    - ls: The Least-Squares distance between the two sets of trees.
    """
    
    # Initialize the Least-Squares distance to zero
    ls = 0.0
    
    # Retrieve the list of terminal leaves (species) from the first tree
    leaves = tree1.get_terminals()
    
    # Extract the names of the terminal leaves
    leaves_name = [leaf.name for leaf in leaves]
    
    # Iterate over each pair of leaves in the trees
    for i in leaves_name:
        # Remove the first leaf from the list to avoid redundant comparisons
        leaves_name.pop(0)
        for j in leaves_name:
            # Calculate the distance between the pair of leaves in tree1
            d1 = tree1.distance(tree1.find_any(i), tree1.find_any(j))
            # Calculate the distance between the same pair of leaves in tree2
            d2 = tree2.distance(tree2.find_any(i), tree2.find_any(j))
            # Accumulate the absolute difference of distances into the LSD
            ls += abs(d1 - d2)
    
    return ls

Program 6:Python script for calculating the LSD using the ete3 package in the aPhyloGeo package

Interestingly, Euclidean distance and LS distance are more sensitive to quantitative differences in branch length and subtle tree topology, making them suitable for identifying detailed correlations between genetic fluctuations and those of habitat parameters Czarna et al., 2006Choi & Gomez, 2009. They can therefore be used to study fine divergences between trees, enabling nuanced identification of the effects of habitat attributes on the genetic structure of species Czarna et al., 2006Choi & Gomez, 2009.

As for Robinson-Foulds distance (normalized or not), although widely applied in evolutionary biology, they are less sensitive to slight topological dissimilarities, making them less accurate for identifying fine correlations between genetics and habitat parameters Smith, 2019Smith, 2020. This is due to their structural nature and the fact that they are not measured by branch length Smith, 2019Smith, 2020.

Creating Figures

Figure 2, Figure 3, Figure 6 and Figure 7 were made with Python 3.11, while Figure 4 and Figure 5 were made with RStudio Desktop 4.3.2.

Results

The violin diagrams shown in Figure 2 are used to display summary statistics similar to box plots, showing medians (white lines), interquartile ranges (thickened black bars), and the rest of the distributions (thin black lines), except the “extreme” points. Wider areas indicate a greater probability of the attributes taking a given value. They summarize the distribution of spatial (latitude and longitude at the start of sampling (DD)), atmospheric (wind speed (m/s) at the start of sampling), and ecosystemic (depth (m) at the start of sampling, water temperature (°C), and O2 concentration (mg/L)) data. These diagrams are essential for understanding habitat conditions and highlighting attributes that can potentially influence the genetic adaptability of cumaceans. In Table 1 and Figure 2, the parameters are designated by their names from the IceAGE database, except for latitude and longitude at the start of sampling (DD), for which the term “dec” has been removed at the end to avoid confusion.

Table 1:Table summarizing key statistics such as mean, median, standard deviation (Std Dev), 1st quartile (Q1) and 3rd quartile (Q3) of biological (depth (m) at the start of sampling, water temperature (°C), and O2 concentration (mg/L)), spatial (latitude (DD) at the start of sampling and longitude (DD) at the start of sampling) and atmospheric (wind speed (m/s) at the start and end of sampling) attributes for our phylogeographic analyses

Table summarizing key statistics such as mean, median, standard deviation (Std Dev), 1st quartile (Q1) and 3rd quartile (Q3) of biological (depth (m) at the start of sampling, water temperature (^\circC), and O2 concentration (mg/L)), spatial (latitude (DD) at the start of sampling and longitude (DD) at the start of sampling) and atmospheric (wind speed (m/s) at the start and end of sampling) attributes for our phylogeographic analyses
Violin diagrams of two regional attributes, one atmospheric attribute, and three ecosystemic attributes from our sample that provide essential information about the ecological and meteorological conditions of cumacean habitats. a) Latitude (DD) at the start of sampling (red) suggest that the samples come from two dominant latitudinal (DD) regions (around 61.64 DD and 67.64 DD); b) Longitude (DD) at the start of sampling (yellow) implies that the samples come from two dominant longitudinal (DD) regions (around -26.77 DD and -18.14 DD); c) Depth (m) at the start of sampling (green) suggest that the samples were mainly collected and concentrated at three different depths (m) (around 500 m, 1500 m and 2500 m); d) Wind speed (m/s) at start of sampling (light blue) indicate stable the wind conditions (m/s) at the start of sampling (around 6.00 m/s); e) Water temperature (^\circC) (dark blue) suggest that the samples were mostly collected and concentrated at two different water temperatures (^\circC) (around 0.07 ^\circC and 2.66 ^\circC); f) O2 concentration (mg/L) (pink) implies that the samples were primarily collected and concentrated at two different O2 concentrations (mg/L) (around 258.39 mg/L and 290.90 mg/L).

Figure 2:Violin diagrams of two regional attributes, one atmospheric attribute, and three ecosystemic attributes from our sample that provide essential information about the ecological and meteorological conditions of cumacean habitats. a) Latitude (DD) at the start of sampling (red) suggest that the samples come from two dominant latitudinal (DD) regions (around 61.64 DD and 67.64 DD); b) Longitude (DD) at the start of sampling (yellow) implies that the samples come from two dominant longitudinal (DD) regions (around -26.77 DD and -18.14 DD); c) Depth (m) at the start of sampling (green) suggest that the samples were mainly collected and concentrated at three different depths (m) (around 500 m, 1500 m and 2500 m); d) Wind speed (m/s) at start of sampling (light blue) indicate stable the wind conditions (m/s) at the start of sampling (around 6.00 m/s); e) Water temperature (°C) (dark blue) suggest that the samples were mostly collected and concentrated at two different water temperatures (°C) (around 0.07 °C and 2.66 °C); f) O2 concentration (mg/L) (pink) implies that the samples were primarily collected and concentrated at two different O2 concentrations (mg/L) (around 258.39 mg/L and 290.90 mg/L).

Our results revealed variability in most habitat attributes, as shown in Figure 2. For instance, the median of the latitude at the start of sampling (67.15 DD; Table 1) is higher than the mean (64.83 DD; Table 1), showing an asymmetric distribution skewed towards lower values. This trend is also observed for depth (m) at the start of sampling (Median: 1574.70 m; Mean: 1412.57 m; Figure 2c and Table 1) and O2 concentration (mg/L) (Median: 278.77 mg/L; Mean: 271.88 mg/L; Figure 2f and Table 1). The bimodal shape of the latitude distribution curve suggests that samples came from two dominant latitudinal regions at the start of sampling (around 61.64 DD and 67.64 DD; Figure 2a and Table 1). This bimodality is also observed in longitude (DD) at the start of sampling (around -26.77 DD and -18.14 DD; Figure 2b and Table 1), as well as for water temperature (°C) (around 0.07 °C and 2.66 °C; Figure 2e and Table 1), and O2 concentration (mg/L) (around 258.39 mg/L and 290.90 mg/L; Figure 2f and Table 1).

The median of the longitude (DD) at the start of sampling (-26.21 DD; Table 1) is lower than the mean (-23.12 DD; Table 1), indicating asymmetry on the higher sides (Figure 2b), as does the water temperature (°C) (Mean: 1.45 °C; Median: 0.71 °C; Figure 2e and Table 1). Unlike all the other diagrams in Figure 2, the curve of the depth (m) at the start of sampling (Figure 2c)) has a multimodal shape with three prominent peaks, suggesting that the samples were mainly collected and concentrated at three different depths (around 500 m, 1500 m and 2500 m; Figure 2c).

The mean (6.26 m/s; Table 1) and median (6.00 m/s; Table 1) of wind speed (m/s) at the start of sampling (Figure 2d) are fairly similar, with a high density of data around the median (6.00 m/s; Figure 2d and Table 1). This suggests stable wind conditions (m/s) at the start of sampling. The standard deviation of water temperature (°C) is relatively high (1.73 °C; Table 1) compared to the mean (1.45 °C; Table 1), suggesting acclimatization of cumaceans to a variety of habitat temperatures (-0.851 °C – 4.28 °C; Figure 2e and Table 1). The range of data for O2 concentration (mg/L) shows some variability (245.53 mg/L – 292.97 mg/L; Figure 2f and Table 1) in the environmental conditions. This reflects a diversity of requirements in terms of O2 concentration (mg/L), with cumaceans potentially affected by the heterogeneity of biogeochemical cycles, such as photosynthesis, respiration, and organic decomposition, which affect depth-dependent dissolved O2 concentration (mg/L).

Frequency distribution of cumacean species in our sample. The bars represent the number of individuals for each species. The percentages (%) displayed above the bars indicate the relative abundance of each species in the total sample. The mean and median values of the frequency distribution are shown in the top right-hand corner of the histogram. Unlike less common species, those that are abundant (such as Leptostylis ampullacea and Leucon pallidus) may have adaptive characteristics that enable them to exploit resources more easily, resist interspecific competition or withstand changing biological conditions.

Figure 3:Frequency distribution of cumacean species in our sample. The bars represent the number of individuals for each species. The percentages (%) displayed above the bars indicate the relative abundance of each species in the total sample. The mean and median values of the frequency distribution are shown in the top right-hand corner of the histogram. Unlike less common species, those that are abundant (such as Leptostylis ampullacea and Leucon pallidus) may have adaptive characteristics that enable them to exploit resources more easily, resist interspecific competition or withstand changing biological conditions.

The distribution and diversity of the various cumacean species found in our sample are shown in Figure 3. It shows that the most represented species are Leptostylis ampullacea (14.1%) and Leucon pallidus (12.5%). In contrast, species like Bathycuma brevirostre and Styloptocuma gracillimum are less represented (1.6%), implying that some species may have restricted ecological niches or face ecological forces that limit their distribution. The dominance of certain species (such as Leptostylis ampullacea and Leucon pallidus) suggests that they may have adaptive traits that enable them to make the most of the accessible resources, resist interspecific competition, or survive in fluctuating ecosystemic conditions, aligns with our study’s aim of relating genetic adaptation to habitat characteristics.

Distribution of cumacean families by water mass. This histogram represents the frequency of occurrence of the different cumacean families in our samples, classified according to the water mass in which they were collected. Eight water mass categories are represented: Arctic Polar Water (APW), Arctic Polar Water/North Sub-Arctic Intermediate Water (APW/NSAIW), Iceland Scotland Overflow Water (ISOW), Labrador Sea Water (LSW), North Atlantic Water (NAW), North Sub-Arctic Intermediate Water (NSAIW), cold North Sub-Atlantic Deep Water (NSDWc), and warm North Sub-Atlantic Deep Water (NSDWw). Seven families are represented: Astacidae (red), Bodotriidae (brown), Ceratocumatidae (green), Diastylidae (turquoise), Lampropidae (blue), Leuconidae (purple), and Pseudocumatidae (pink). The presence of the Diastylidae (turquoise) family in the majority of water bodies (APW, APW/NSAIW, ISOW, NSAIW, NSDWc, and NSDWw) accentuates the resilience and ecological acclimatization of this family to various ecological niches and conditions.

Figure 4:Distribution of cumacean families by water mass. This histogram represents the frequency of occurrence of the different cumacean families in our samples, classified according to the water mass in which they were collected. Eight water mass categories are represented: Arctic Polar Water (APW), Arctic Polar Water/North Sub-Arctic Intermediate Water (APW/NSAIW), Iceland Scotland Overflow Water (ISOW), Labrador Sea Water (LSW), North Atlantic Water (NAW), North Sub-Arctic Intermediate Water (NSAIW), cold North Sub-Atlantic Deep Water (NSDWc), and warm North Sub-Atlantic Deep Water (NSDWw). Seven families are represented: Astacidae (red), Bodotriidae (brown), Ceratocumatidae (green), Diastylidae (turquoise), Lampropidae (blue), Leuconidae (purple), and Pseudocumatidae (pink). The presence of the Diastylidae (turquoise) family in the majority of water bodies (APW, APW/NSAIW, ISOW, NSAIW, NSDWc, and NSDWw) accentuates the resilience and ecological acclimatization of this family to various ecological niches and conditions.

The following figure supports the objective of our study by showing the distribution of the various cumacean families in the different water bodies (Figure 4). The Diastylidae family, for example, is the most common in all water bodies (turquoise color in Figure 4), testifying to its resilience and ecological adaptability to a wide variety of habitat conditions, reminiscent of the dominance of Leptostylis ampullacea (Figure 3, 14.1%) which belongs to the Diastylidae family.

Distribution of cumacean families by habitat. This histogram represents the frequency of occurrence of the different cumacean families in our samples, classified according to the habitat in which they were collected. Three habitat categories are represented: Deep Sea, Shelf, and Slope. Seven families are represented: Astacidae (red), Bodotriidae (brown), Ceratocumatidae (green), Diastylidae (turquoise), Lampropidae (blue), Leuconidae (purple), and Pseudocumatidae (pink). The presence of cumacean families in more than one habitat, such as Diastylidae (turquoise), Lampropidae (blue), and Leuconidae (purple), may indicate the development of adaptations, whether morphological, physiological, or behavioral, that could favor their persistence in these habitats.

Figure 5:Distribution of cumacean families by habitat. This histogram represents the frequency of occurrence of the different cumacean families in our samples, classified according to the habitat in which they were collected. Three habitat categories are represented: Deep Sea, Shelf, and Slope. Seven families are represented: Astacidae (red), Bodotriidae (brown), Ceratocumatidae (green), Diastylidae (turquoise), Lampropidae (blue), Leuconidae (purple), and Pseudocumatidae (pink). The presence of cumacean families in more than one habitat, such as Diastylidae (turquoise), Lampropidae (blue), and Leuconidae (purple), may indicate the development of adaptations, whether morphological, physiological, or behavioral, that could favor their persistence in these habitats.

The distribution of samples of the different cumacean families according to the type of habitat where they were collected during sampling is shown in Figure 5. The deep-sea habitats show the greatest diversity of families, mainly Diastylidae and Lampropidae, suggesting they are well acclimatized to deep-sea conditions. In contrast, the slope has the lowest diversity, with Diastylidae again the most dominant, implying that some cumacean species have fewer ecological niches or are less adapted to this habitat. Although less diverse than the deep sea, the shelf is dominated by Leuconidae, indicating that this family may be specifically well-acclimated to shelf habitat. These patterns imply that certain cumacean families, such as the Diastylidae, Lampropidae, and Leuconidae, have developed distinct adaptations (physiological, behavioral, or morphological) to remain in particular ecological niches, reflecting the impact of habitat conditions on the genetic distribution of cumaceans.

Analysis of fluctuations in four distance metrics using multiple sequence alignment (MSA): a) Least Squares distance, b) Robinson-Foulds distance, c) normalized Robinson-Foulds distance, and d) Euclidean distance. These distance variations are studied to establish their dissimilarity with the variation in wind speed (m/s) at the start of sampling.

Figure 6:Analysis of fluctuations in four distance metrics using multiple sequence alignment (MSA): a) Least Squares distance, b) Robinson-Foulds distance, c) normalized Robinson-Foulds distance, and d) Euclidean distance. These distance variations are studied to establish their dissimilarity with the variation in wind speed (m/s) at the start of sampling.

Analysis of fluctuations in four distance metrics using multiple sequence alignment (MSA): a) Least Squares distance, b) Robinson-Foulds distance, c) normalized Robinson-Foulds distance, and d) Euclidean distance. These variations in distance are studied to establish their dissimilarity with the variation in Otextsubscript2 concentration (mg/L) at the sampling sites.

Figure 7:Analysis of fluctuations in four distance metrics using multiple sequence alignment (MSA): a) Least Squares distance, b) Robinson-Foulds distance, c) normalized Robinson-Foulds distance, and d) Euclidean distance. These variations in distance are studied to establish their dissimilarity with the variation in Otextsubscript2 concentration (mg/L) at the sampling sites.

The divergence between the genetic sequences and two attributes, one climatic (wind speed (m/s) at the start of sampling) and the other environmental (O2 concentration (mg/L)) is presented in Figure 6 and Figure 7. All the attributes given in the first step of the aPhyloGeo software section were analyzed and their script and figure will be soon available in the imgimg and scriptscript python file on GitHub. However, only these two attributes showed the most interesting mutation rate. Using the four metrics mentioned in the section Metrics, we noticed that the Euclidean distance is particularly sensitive to our data, manifesting considerable sequence variation at the position in MSA 520-529 amino acids (aa) (Euclidean distance: 0.8 < x < 0.9; Figure 6d) and 1190-199 aa (Euclidean distance: 1.2 < x < 1.3; Figure 7d). Unlike the other windows for this metric in the two figures (see Figure 6d and Figure 7d), the fluctuations in wind speed (m/s) at the start of sampling and in O2 concentration (mg/L) do not appear to explain the variations in these two specific sequences. This implies that these genetic sites are subject to selection pressures or evolutionary changes, due to biological (O2 concentration (mg/L)) and meteorological conditions (wind speed (m/s) at the start of sampling). These results align with our study’s aim to identify the genetic region of cumaceans with the highest mutation rate linked to a specific habitat attribute.

These results provide important insight into the genetic adaptation of cumaceans to their environment. These results need to be analyzed in greater depth to certify their involvement, especially in contrast with Uhlir et al., 2021, which investigated similar topics of environmental and climatic effects on cumaceans distribution and genetics. The aPhyloGeo package is still in the process of being updated. A more in-depth analysis of the results is available on GitHub in the supplementary file.

Conclusion

This study examines the effects of meteorological, regional, and ecosystemic attributes on the genetics of cumaceans in the waters surrounding Iceland. Our main objective is to determine whether there is a divergence between precise genetic information of the 16S rRNA mitochondrial gene region (i.e., a window) of cumacean species and their habitat attributes. In addition to data distribution representations (see Figure 2, Figure 3, Figure 4 and Figure 5), DNA sequence analyses have identified specific genetic windows with high mutation rates in response to atmospheric and biological attributes such as wind speed (m/s) at the start of sampling (Position in MSA: 520-529 aa; Euclidean distance: 0.8 < x < 0.9; Figure 6d) and O2 concentration (mg/L) (Position in MSA: 1190-1199 aa; Euclidean distance: 1.2 < x < 1.3; Figure 7d). These results imply variable genetic sites that could contribute to the evolutionary acclimatization of cumaceans to their fluctuating environments and that some cumacean species have diverged from other populations and have genetically adapted to both attributes.

The novelty in our research lies in the exhaustive divergence between habitat attributes and genetic mutability in cumaceans, particularly in identifying genetic windows associated with habitat fluctuations, which has not been widely investigated in previous studies Manel et al., 2003Vrijenhoek, 2009. In this case, our integrated method identifies specific genetic regions sensitive to ecosystemic and atmospheric variations. Thus, by seeking to determine which of these two attributes diverges most with the DNA sequences, the eventual identification of proteins linked to one of these variable DNA sequences will make it possible to represent its functional effects in responses to habitat changes. Our future research will focus on verifying the prediction of this protein and assessing its role in the physiological adaptation of cumaceans to fluctuating conditions, adding a link between genetic data and ecological function.

Interpreting how marine invertebrates genetically adapt to variations in their habitat can help us better predict their responses to climate change and advance conservation plans to protect them. Identifying the specific attributes that influence genetic variability of Cumacea can contribute to the designation and supervision of marine protected areas, assuring they include habitats crucial to the survival and acclimatization of these species. Thus, our results can inform the management of fishing and seabed mining companies by revealing ecologically vulnerable areas where these disturbances can seriously affect benthic biodiversity.

Furthermore, our results provide essential knowledge to guide future studies on the genetic adaptation of Cumacea and other invertebrates to ecological and regional variability. Based on these findings, future research should focus on additional ecosystemic and meteorological attributes, such as nutrient accessibility, water pH, ocean currents, and the degree of human disturbance, to further improve the interpretation of the complex interactions between genetics and the environment. Broadening the scope of application to other marine species, not just marine invertebrates, and diverse geographic regions would allow us to generalize the results more effectively. With this in mind, longitudinal study models on these species could reflect long-term climatic and biological fluctuations and improve our knowledge of the dynamics of genetic acclimatization.

However, it is important to recognize the limitations of our study. In particular, the three missing data points on O2 concentration (mg/L) and the relatively small sample size (n=62n=62) may have induced a bias, which could impact the validity of our interpretations and restrict the generalizability of our results. Moreover, these missing data could provide partial insight into the relationship between O2 concentration (mg/L) and genetic fluctuation in Cumacea, and our sample size may reduce the statistical power of our results. Future studies should address these gaps by incorporating larger sample sizes and more complete datasets to confirm and expand our conclusions. Additionally, as our research focuses solely on the mitochondrial sequence of the 16S rRNA gene, utilizing more elaborate genomic methods, such as whole-genome sequencing, could help us better understand the genetic variety and the global acclimatization mechanisms of marine species. This would provide more comprehensive genetic databases to improve our accuracy and knowledge in identifying existing (and new) marine invertebrate species using DNA barcoding (e.g., mitochondrial DNA cytochrome c oxidase I (COX1)). Finally, multidisciplinary collaborations between ecology, genetics, and oceanography would be essential to enhance knowledge sharing and its application in future research.

Acknowledgments

The authors thank the SciPy conference and reviewers for their valuable comments on this paper as well as Mansour Kebe for his technical support and Carolin Uhlir for her clarifications and advice on her study Uhlir et al., 2021. This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), the Fonds de recherche du Québec - Nature et technologies (FRQNT), the Université de Sherbrooke grant, and the Centre de recherche en écologie de l’Université de Sherbrooke (CREUS).

References
  1. Schnurr, S., Brandt, A., Brix, S., Fiorentino, D., Malyutina, M., & Svavarsson, J. (2014). Composition and distribution of selected munnopsid genera (Crustacea, Isopoda, Asellota) in Icelandic waters. Deep Sea Research Part I: Oceanographic Research Papers, 84, 142–155. 10.1016/j.dsr.2013.11.004
  2. Meißner, K., Brenke, N., & Svavarsson, J. (2014). Benthic habitats around Iceland investigated during the IceAGE expeditions. Polish Polar Research, 35(2), 177–202. 10.2478/popore-2014-0016
  3. Uhlir, C., Schwentner, M., Meland, K., Kongsrud, J. A., Glenner, H., Brandt, A., Thiel, R., Svavarsson, J., Lörz, A.-N., & Brix, S. (2021). Adding pieces to the puzzle: insights into diversity and distribution patterns of Cumacea (Crustacea: Peracarida) from the deep North Atlantic to the Arctic Ocean. PeerJ, 9, e12379. 10.7717/peerj.12379
  4. Levin, L. A., & Dayton, P. K. (2009). Ecological theory and continental margins: where shallow meets deep. Trends in Ecology & Evolution, 24(11), 606–617. 10.1016/j.tree.2009.04.012
  5. Rogers, A. D., Baco, A., Griffiths, H., Hart, T., & Hall-Spencer, J. M. (2007). Corals on seamounts. Seamounts: Ecology, Fisheries & Conservation, 141–169. 10.1002/9780470691953.ch8