A targeted sequence capture array for

Full text

Turn on search term navigation

Although the cost of whole‐genome sequencing has continued to decrease dramatically over the past decade, the cost and complexity of whole‐genome analyses still limit their utility and accessibility for answering evolutionary questions in novel taxa (Richards, 2018). However, a polished genome assembly is not necessary to address many questions. In this context, several methods have been developed to reduce the cost and effort required to obtain genomic information in novel species (McKain et al., 2018). The recent development of targeted sequence capture presents an affordable method for consistently isolating specific, long, phylogenetically informative regions in the taxa of interest (Gnirke et al., 2009; Mamanova et al., 2010; Hale et al., 2020). Targeted sequence capture uses biotinylated RNA baits to target prepared sequencing library fragments. The baited library fragments can then be pulled out of the solution using streptavidin‐coated magnetic beads to selectively enrich the fragments that contain loci of interest, while discarding the majority of library fragments that do not. Two advantages of this method over other methods of genome sequence partitioning, such as genome skimming and restriction site–associated DNA sequencing (RAD‐seq), are (1) it does not necessarily depend on a highly polished, annotated reference genome, and (2) the same loci can be consistently sequenced at a high depth across individuals without requiring comprehensive, concurrent sequencing of all individuals (Mamanova et al., 2010; Grover et al., 2012; Jones and Good, 2016).

In this paper, we report on the design and implementation of a targeted sequence capture array to collect data for phylogenetic analysis within the Salicaceae, the plant family that includes poplars and willows. Understanding species relationships within this family, and in particular within the genus Salix L., has presented challenges to taxonomists as early as Linnaeus, who noted that “species of this genus are extremely difficult to clarify” (Linnaeus, 1753; Skvortsov, 1999). Salix species present challenges to classification due to their wide geographic ranges, hysteranthous phenology, extensive interspecific hybridization, polyploidy, and the lack of well‐defined flower characters for morphological circumscription of taxa (Raup, 1959; Skvortsov, 1999; Percy et al., 2014; Wang et al., 2020). Species of Salix exhibit holarctic distributions, and there are several classifications that differ among continents and are challenging to synthesize due to non‐overlapping taxonomic treatment of species (Dickmann and Kuzovkina, 2014). Past efforts to reconstruct the phylogeny of Salix using nuclear amplified fragment length polymorphism (AFLP) markers and plastid barcode sequences have resulted in a lack of clearly resolved species relationships, especially in the subgenus Vetrix (Dumort.) Dumort. (Trybush et al., 2008; Percy et al., 2014). A more recent study using a supermatrix approach with RAD‐seq data showed resolution within a subset of species of the subgenera Vetrix and Chamaetia (Dumort.) Nasarov, highlighting the potential of large‐scale molecular data to resolve this phylogenetically challenging group (Wagner et al., 2018).

The utility of RAD‐seq for collecting data for phylogeny, however, is limited by several issues. First, RAD‐seq does not consistently screen homologous regions across species and across different experiments, which limits its utility for adding species to a phylogeny at a later time. Second, because RAD‐seq assesses diversity in very short segments of the genome that contain little potential phylogenetic information independently, this type of data requires the concatenation of loci and the use of supermatrix phylogenetic analyses (de Queiroz and Gatesy, 2007), which do not allow the separate exploration of gene and species phylogenies using supertree methods (Sanderson et al., 1998). Additionally, concatenation approaches are likely to exacerbate problems associated with maximum‐likelihood methods for species with rapid diversification (Edwards et al., 2007; Edwards, 2009). Targeted sequence capture does not have these limitations, and thus may be a more appropriate genotyping platform for phylogenetics.

Species of Populus L. and Salix have been of great interest for the development of forestry and biofuel products, resulting in polished reference genomes for P. trichocarpa Torr. & A. Gray, P. tremula L., P. euphratica Olivier, S. purpurea L., and S. suchowensis W. C. Cheng, as well as shallow resequencing data for many additional species (Tuskan et al., 2018). Our design strategy leveraged this abundance of existing genomic information to quantify polymorphism and the distribution of insertion/deletion polymorphisms (indels) within and among species in order to maximize capture efficiency. Furthermore, because we consistently target exon regions, we are able to leverage information about nucleotide‐site degeneracy to quantify population genomic summary statistics. We demonstrate the utility of this resource for Populus and Salix species by presenting a fully resolved phylogenetic tree for six species and an outgroup, and by estimating the distribution of nucleotide diversity within species for our targeted genes.

METHODS

Probe design

Our goal was to identify regions that could be efficiently captured using RNA bait hybridization for diverse species across the family Salicaceae. The family Salicaceae is thought to have diverged from other clades approximately 92.5 mya (Zhang et al., 2018b). Our primary focus was on the genera Populus and Salix, which diverged approximately 48 mya, and the species Idesia polycarpa Maxim., which diverged from other clades approximately 56 mya, which we use as an outgroup (Zhang et al., 2018b). Although we were interested in using these probes for phylogenetics with both Populus and Salix species, we focused on maximizing capture efficiency for the species in Salix, because the phylogeny for Populus is already much better resolved than that for Salix (Trybush et al., 2008; Percy et al., 2014; Wang et al., 2014, 2020; Liu et al., 2017). For this reason, the capture baits were designed to target regions in S. purpurea that also would have high capture efficiency across the Salicaceae. The efficiency of RNA bait binding, and thus capture efficiency, is reduced as target regions diverge due to sequence polymorphism (Lemmon and Lemmon, 2013). To improve capture efficiency, we quantified sequence polymorphism among whole‐genome resequencing data from a diverse array of Populus and Salix species (Appendix S1). The whole‐genome short reads of the Populus and Salix species were aligned to the P. trichocarpa genome assembly version 3 (Tuskan et al., 2006) using BWA MEM version 0.7.12 with default parameters (Li, 2013). We used the P. trichocarpa genome as our initial reference because it was the most polished and annotated genome in the genus. Variable sites and indels were identified using SAMtools mpileup (Li, 2011), and read depth for the variant calls was quantified using vcftools (Danecek et al., 2011). Custom Python scripts were used to identify variant and indel frequencies for all exons in the P. trichocarpa genome annotation (scripts available at https://github.com/BrianSanderson/phylo‐seq‐cap [see Data Availability]; Sanderson, 2020).

Orthologs for our candidate loci in the Salix purpurea 94006 genome assembly version 1 (DOE‐JGI, 2016; Carlson et al., 2017; Zhou et al., 2018) were identified from a list of orthologs shared by the P. trichocarpa and S. purpurea genomes prepared using a tree‐based approach by JGI using Phytozome version 12 software (Goodstein et al., 2012). We further screened candidate regions to exclude high‐similarity duplicated regions by accepting only loci with single BLAST (Camacho et al., 2009) hits against the highly contiguous assembly of S. purpurea 94006 version 5 (Zhou et al., 2020), which is less fragmented than the S. purpurea 94006 version 1 genome assembly. Genes from the Salicoid whole‐genome duplication were identified using MCScanX (Wang et al., 2012) using default parameters, and segments for which the average number of synonymous substitutions (K_s value) for paralogous genes was between 0.2 and 0.8 were selected. Genes for which at least 600 bp of exon sequence contained 2–12% polymorphism and fewer than two indels were selected for probe design by Arbor Biosciences (Ann Arbor, Michigan, USA). Probes were designed with 50% overlap across the targeted regions, so that each nucleotide position would potentially be captured by two probes. Finally, to ensure that loci with high divergence across the family would be captured, we identified targets with less than 95% identity (based on BLAST results) between S. purpurea and P. trichocarpa and designed supplementary probes from orthologs of these genes in the I. polycarpa genome.

Library preparation and sequence capture

Libraries for two individuals from each P. balsamifera L., P. tremula, P. mexicana Wesm., S. nigra Marshall, S. exigua Nutt., and S. phlebophylla Andersson (Appendix S2) were prepared using the NEBNext Ultra II DNA Prep Kit (New England Biolabs, Ipswich, Massachusetts, USA) following the manufacturer’s protocol, and quantified using an Agilent Bioanalyzer 2100 DNA 1000 kit (Agilent Technologies, Santa Clara, California, USA). Libraries were pooled at equimolar concentrations into two pools of six prior to probe hybridization following the Arbor Biosciences myBaits protocol version 3.0.1 and Hale et al. (2020). The hybridized samples were subsequently pooled at equimolar ratios and sequenced at the Texas Tech Center for Biotechnology and Genomics using a MiSeq with the v2 Micro kit and 150‐bp paired‐end reads (Illumina, San Diego, California, USA).

Analysis of sequence capture data

Read data were trimmed for primer sequences and low quality scores using Trimmomatic version 0.36 (Bolger et al., 2014). The trimmed read data, as well as the whole‐genome reads for I. polycarpa, were assembled into gene sequences using the HybPiper pipeline (Johnson et al., 2016). We estimated the depth of read coverage across all targeted genes as well as at off‐target sites in R version 3.3.0 (R Core Team, 2016). The assembled amino acid sequences were aligned with MAFFT version 7.310 using the parameters –localpair and –maxiterate 1000 (Katoh and Standley, 2013), converted into codon‐aligned nucleotide alignments with PAL2NAL version 14 (Suyama et al., 2006), and trimmed for quality and large gaps with trimAl version 1.4.rev15 with the parameter –gt 0.5 (Capella‐Gutiérrez et al., 2009).

HybPiper provides warnings for genes that have multiple competing assemblies that are within 80% of the length of the target region, because the alternate alignments may indicate that those genes have paralogous copies in the genome. We estimated phylogenetic relationships using the full set of gene sequences recovered from our sequence capture data, as well as a restricted set of putatively single‐copy genes, based on our a priori list of shared paralogs between S. purpurea and P. trichocarpa, and supplemented by the list of paralog warnings from HybPiper.

We estimated gene trees using RAxML version 8.2.10 (Stamatakis, 2014), specifying a GTRΓ model of sequence evolution. A set of 250 bootstrap replicates was generated for each gene tree. We used ASTRAL‐III (Zhang et al., 2018a; Rabiee et al., 2019) to infer the species tree from the RAxML gene trees. Because all nodes are weighted equally during quartet decomposition in ASTRAL‐III, we used sumtrees in the Python package DendroPy version 4.4.0 (Sukumaran and Holder, 2010) to collapse nodes with less than 33% bootstrap support values prior to species tree estimation. A set of 100 multilocus bootstrap replicates was generated for the species tree. We used phyparts (Smith et al., 2015) to determine the extent of congruence among gene trees for each node in the species tree. Cladograms representing the gene tree congruence and alternate topologies were plotted with the scripts phypartspiecharts.py and minority_report.py (scripts written by Matt Johnson [Texas Tech University], available at https://github.com/mossmatters/phyloscripts).

Finally, we used custom Python scripts to quantify nucleotide diversity at synonymous and nonsynonymous sites between the individuals of the same species, as well as correlations in values of per‐site nucleotide diversity between all species. The scripts described above, as well as the full details of these analyses, are available at https://github.com/BrianSanderson/phylo‐seq‐cap (Sanderson, 2020).

RESULTS

Sequence capture efficiency

The final capture kit targets 972 genes covered by 12,951 probes based on the S. purpurea reference genome, and an additional 7049 (redundant) probes based on the I. polycarpa genome that target genes with the highest divergence between S. purpurea and P. trichocarpa (<95% identity in BLAST results of probes against the P. trichocarpa v3 genome sequence). This included an average of 680 ± 309 (mean ± SD) probes on each S. purpurea chromosome (Appendix S3), with an average of 1098 ± 489 bp (mean ± SD) of exon sequence per gene. Of the 972 target genes, 593 are putatively single copy based on our identification of paralogs in the S. purpurea genome assembly, 142 represent pairs of paralogs from the shared Salicoid whole‐genome duplication (i.e., 71 pairs of genes), and 237 are genes that have known paralogs for which we were not able to design targets in this kit (i.e., each of these genes has one or more paralogs in the S. purpurea genome that is not targeted by probes). We included a total of 1219 genes in the target file used to assemble the capture data, which includes the 972 targeted genes as well as paralogous copies for which probes were not designed. Because the issues of paralogy become more complex when we add species other than S. purpurea and P. trichocarpa, we advise using the HybPiper warnings of multiple competing long assemblies to assess paralogy in novel species following guidance from Johnson (2017a). The capture probe sequences and the target reference file are accessible at https://github.com/BrianSanderson/phylo‐seq‐cap (Sanderson, 2020). The sequence capture kit is available from Arbor Biosciences (Ref #170424‐30 “Salicaceae”).

Sequence capture efficiency was high among the libraries. We recovered 805,820 ± 178,482 reads (mean ± SD) from our Populus and Salix target capture libraries, of which 86.7% ± 1.15% (mean ± SD) mapped to the target sequence reference (Table 1). An average of 94.48% ± 1.37% of targeted exon sequences were covered by ≥10 reads. The average read depth was 44.65 ± 1.61 for on‐target sites, and 14.48 ± 2.10 for off‐target sites (Appendix S4).

1 TableCoverage summary statistics for sequence capture read data. For each library, values represent the number of reads in the sequenced library, the number of those reads that mapped to the reference file for the targeted genes, the proportion of mapped reads, the number of targeted genes (out of 972) that had read data mapped to them, and the number of genes that had 25%, 50%, 75%, and 100% of the targeted sequences covered with >10× reads.

					No. of genes with % targeted sequences
Name	No. of reads	Reads mapped	Proportion mapped	Genes mapped	25%	50%	75%	100%
I_polycarpa_WGS‐2^a	223,470,714	1,653,494	0.007	971	970	966	944	123
P_balsamifera_MGR‐01	614,093	523,321	0.852	972	971	960	884	122
P_balsamifera_MGR‐04	769,303	659,712	0.858	972	972	965	915	145
P_mexicana_PM3	843,032	739,728	0.878	972	972	964	917	140
P_mexicana_PM5	880,962	768,927	0.873	972	972	967	913	142
P_tremula_R01‐01	749,220	638,002	0.852	971	970	960	907	134
P_tremula_R04‐01	634,625	539,805	0.851	971	969	956	876	122
S_exigua_SE002	1,139,616	998,698	0.876	969	969	966	937	229
S_exigua_SE053	843,120	741,938	0.88	969	969	964	928	195
S_nigra_SG037	1,166,615	1,028,635	0.882	971	971	967	932	205
S_nigra_SG051	602,649	524,993	0.871	971	970	961	903	136
S_phlebophylla_SP15M	753,628	651,791	0.865	972	972	967	939	204
S_phlebophylla_SP7F	672,975	581,147	0.864	972	972	967	925	203

^aThe I_polycarpa_WGS‐2 data is from whole‐genome sequencing data, rather than targeted sequence capture, and thus the low percentage of read mapping reflects the lack of target enrichment (although the read coverage across targets was comparable to the sequence capture libraries [Appendix S4]).

Phylogenetics

The species tree estimated with putatively single‐copy genes correctly paired all individuals of the same species and revealed a fully resolved phylogeny for the Populus and Salix species with 100% multilocus bootstrap support for all nodes (Fig. 1A). At least 85% of gene trees support the topology of the species tree (Fig. 1B), with the exceptions of the bipartition that separates P. balsamifera and P. tremula, and the bipartition that separates S. phlebophylla from the other Salix species, which had dominant alternate topologies that were supported by a large number of gene trees (Appendices S5, S6). The topology of the species tree estimated with the full set of genes and known paralogs was nearly identical to the tree estimated with only the putatively single‐copy genes. The major difference between these trees was evident in the bipartition separating P. balsamifera and P. tremula, where there were a large number of alternative topologies supported by small numbers of gene trees (the top three were supported by 13, 11, and 10 gene trees; Appendix S7).

View Image - 1 Figure. Species trees estimated for the 432 putatively single‐copy genes that did not have paralog warnings reported by HybPiper. (A) Species tree generated by ASTRAL‐III for the gene trees. Node values represent bootstrap support from 100 multilocus bootstrap replicates in ASTRAL‐III. Branch lengths represent coalescent units. (B) Cladogram showing the congruence of gene trees for all nodes in the ASTRAL‐III species tree. The numbers above each node represent the number of gene trees that support the displayed bipartition, and numbers below the node represent the number of gene trees that support all alternate bipartitions. Purple wedges represent the proportion of gene trees that support the displayed bipartition. Blue wedges represent the proportion of gene trees that support a single alternative bipartition (see Appendices S5, S6). Green wedges represent the proportion of gene trees that have multiple conflicting bipartitions. Yellow wedges represent the proportion of gene trees that have no supported bipartition. Plotting code and its interpretation were provided by Matt Johnson (for more detail, see Johnson, 2017b).

1 Figure. Species trees estimated for the 432 putatively single‐copy genes that did not have paralog warnings reported by HybPiper. (A) Species tree generated by ASTRAL‐III for the gene trees. Node values represent bootstrap support from 100 multilocus bootstrap replicates in ASTRAL‐III. Branch lengths represent coalescent units. (B) Cladogram showing the congruence of gene trees for all nodes in the ASTRAL‐III species tree. The numbers above each node represent the number of gene trees that support the displayed bipartition, and numbers below the node represent the number of gene trees that support all alternate bipartitions. Purple wedges represent the proportion of gene trees that support the displayed bipartition. Blue wedges represent the proportion of gene trees that support a single alternative bipartition (see Appendices S5, S6). Green wedges represent the proportion of gene trees that have multiple conflicting bipartitions. Yellow wedges represent the proportion of gene trees that have no supported bipartition. Plotting code and its interpretation were provided by Matt Johnson (for more detail, see Johnson, 2017b).

Population genomics

Patterns of nucleotide diversity, measured as Nei’s π (Nei and Li, 1979), varied among species, with the greatest variation at synonymous sites (Appendices S8, S9). Populus tremula had the highest average values of π at both synonymous and nonsynonymous sites (Fig. 2). The values of π among species were highly correlated for species within genera and exhibited lower correlations between genera (Fig. 3).

View Image - 2 Figure. Means and 95% confidence intervals of values of nucleotide diversity (Nei’s π) within each species at synonymous (yellow) and nonsynonymous (purple) sites.

2 Figure. Means and 95% confidence intervals of values of nucleotide diversity (Nei’s π) within each species at synonymous (yellow) and nonsynonymous (purple) sites.

View Image - 3 Figure. Pairwise correlation (Pearson’s r) of values of Nei’s π between all species. Values above the diagonal represent the correlation of π at synonymous sites, values below the diagonal represent nonsynonymous sites. Boxes outlined in black represent within‐genus comparisons.

3 Figure. Pairwise correlation (Pearson’s r) of values of Nei’s π between all species. Values above the diagonal represent the correlation of π at synonymous sites, values below the diagonal represent nonsynonymous sites. Boxes outlined in black represent within‐genus comparisons.

DISCUSSION

The decreasing cost of obtaining genomic and transcriptomic sequence data holds great promise for unlocking our understanding of phylogenetic relationships and population genetic patterns within and among complex taxonomic groups. However, assembling complete genomes is still not a trivial task, and there exist relatively few polished plant reference genomes onto which genome skimming data can be mapped. Many methods have been developed to reduce the sequencing and analytical burdens associated with obtaining genome data. We believe that targeted sequence capture is one of the most promising contemporary methods of inexpensively generating genomic information.

The efficiency of our targeted sequence capture array was extremely high, which yielded abundant phylogenetic information for six species of Populus and Salix. Overall, the phylogeny was fully resolved and conformed to our general understanding of the relationships among the taxa (Wu et al., 2015; Wang et al., 2020). One strength of the sequence capture approach is that it provides sufficiently long contiguous segments of gene sequences to assemble gene trees enabling the use of super‐tree methods, which can overcome the problems introduced by concatenation of multiple gene regions with divergent histories (Edwards et al., 2007; Edwards, 2009). The supertree approach also allowed for the identification of alternative evolutionary histories that are supported by different regions of the genome, as often occurs during historical hybridization and introgression (Zhang et al., 2018a; Rabiee et al., 2019). Our species tree identified three alternative gene tree relationships among the three Populus species (Appendix S5). Previous studies have provided evidence of historical introgression among these species, including a history of chloroplast capture and hybridization between P. mexicana and species in the section Tacamahaca Spach (including P. balsamifera; Wang et al., 2014, 2020; Liu et al., 2017). The second most‐supported alternative topology that we recovered placed P. mexicana and P. tremula as sister taxa, a pattern that does not support this hypothesis, likely due to incomplete lineage sorting (Wang et al., 2020). Populus tremula likely has a greater long‐term effective population size than P. balsamifera (Wang et al., 2016), and so coalescence times may be shorter on average in P. balsamifera. Among the Salix species, we identified three alternative gene tree relationships between the S. phlebophylla and S. exigua individuals, which may reflect the histories of rapid speciation and hybridization that have long vexed attempts at phylogenetic reconstruction in the genus Salix (Appendix S6; Trybush et al., 2008; Percy et al., 2014). Both of these patterns in Populus and Salix may be better understood once additional taxa are added to this phylogeny.

We have also shown that this sequence capture design can be applied to address questions related to population genomics in the Salicaceae. Many of the advantages of targeted sequence capture over competing methods are of particular relevance for population genomic studies, including specific knowledge of loci being sequenced; the ability to differentiate among synonymous, nonsynonymous, intronic, and intergenic loci; and the ability to collect data on the same set of loci across different experiments, either within species or across species, for comparative studies. In particular, synonymous sites, especially fourfold synonymous sites, are among the fastest‐evolving regions of the genome and the sites within genic regions least influenced by selection (Wright and Andolfatto, 2008), and are thus among the best regions for estimating patterns of historical demography. Our estimates of nucleotide diversity are similar to those that have been previously reported for P. balsamifera and P. tremula using Sanger sequencing data (Ingvarsson, 2005; Olson et al., 2010) and whole‐genome sequencing data (Wang et al., 2016). The high estimates of diversity in S. phlebophylla compared to the other two Salix species are curious and may result from a history with relatively little migration due to the absence of glaciation over a large portion of its Beringian distribution (Hultén, 1937).

The current study is based on a small sample size per species (n = 2), and so our ability to account for population structure or robustly perform population genomic inferences with these data is limited. Additionally, a potential limitation for using this sequence capture array for comparative population genomics is that we screened loci for a range of among‐species variability between 2–12%, which excludes loci that exhibit extremely high or low values of nucleotide diversity. This may bias estimates of nucleotide diversity arising from these probes toward greater evenness. The ability to identify synonymous sites, which are the closest to neutral among all classes of sites (Wright and Andolfatto, 2008), should partially address this bias. Another feature of sequence capture data is the recovery of “off‐target” sequences that result from the fact that the insert size of libraries is larger than the 120‐bp bait length, and so regions upstream and downstream of the target will be sequenced as well. These regions may include intronic and intergenic regions, as well as exonic sequences that deviate from the constraints we used for our design. The results we report here only incorporate the “on‐target” sites that we sequenced, but HybPiper implements methods to assemble intronic sequences as well. However, the potential effects of hitchhiking selection on synonymous site variation will likely remain apparent.

We also found that it was straightforward to integrate the targeted sequence capture data with whole‐genome sequence data using the HybPiper pipeline by simply including the FASTQ files from whole‐genome reads in the pipeline. This strategy was used to successfully incorporate whole‐genome sequencing data from I. polycarpa, to act as our outgroup. The proportion of gene coverage as well as the read depth for the I. polycarpa data was similar to the sequence capture libraries (Table 1).

A whole‐genome duplication occurred prior to the divergence of Salix and Populus, and there are at least 8000 known paralog pairs in the P. trichocarpa reference genome (Tuskan et al., 2006). Genes with paralogous copies in the genome can complicate gene assemblies, because sequence data from both copies may alternately align to the same target sequence. We identified paralogous sequences in the S. purpurea genome assembly using MCScanX, and used that information to assist in the design of the sequence capture array. The final array includes 593 putatively single‐copy genes, 142 pairs of paralogs, and 237 genes that have paralogs but for which we were not able to include both paralogs in the kit due to our selection criteria. The target reference file we used to map the sequence capture data thus includes 1219 genes, including the single‐copy and known paralogs from S. purpurea. In addition to this, HybPiper provides warnings for genes that have multiple competing alignments that cover the majority of the target sequence, which may indicate the presence of multiple paralogous copies in the genome (Johnson et al., 2016). This will be particularly useful because the genes that have maintained paralogous copies are likely to differ among species throughout the diversification of willows. We estimated evolutionary relationships using both the full set of 1219 single‐copy and known paralog genes, as well as a limited set of only single‐copy genes that did not report paralog warnings. The results from both analyses were nearly the same, but this will likely not be true for a more complex phylogenetic analysis that includes more than six species and an outgroup. For those more complex phylogenetic analyses, the ability to compare trees constructed with single‐copy genes with those using paralogous copies may provide crucial information for reconciling evolutionary relationships.

This sequence capture array will provide the community with an excellent resource to consistently sequence a set of variable regions of the genome for phylogenetic and population genomic investigations in the Salicaceae. The rate of read mapping and coverage of target genes was remarkably consistent across both genera, despite the fact that the taxa were selected to maximize sampling of phylogenetic diversity within each genus. The Salicaceae are important plants in the Northern Hemisphere both ecologically and economically and have been the subjects of numerous population genetic and population genomic investigations of speciation, hybridization, introgression, selection, and historical population size and migration. This resource will allow phylogenetic and comparative population genomic studies to assess the same loci across different studies, which will allow us to build a worldwide diversity database and facilitate more precise comparative research questions. Our results demonstrate that the rate of gene capture is extremely high, such that it would be unnecessary to filter data and determine appropriate overlapping genotype thresholds, as is necessary with random genome partitioning methods such as RAD‐seq.

Acknowledgments

The authors thank M. Johnson (Texas Tech University) for help with data analysis, P. Ingvarsson (Sveriges lantbruksuniversitet, Uppsala, Sweden) for help collecting Populus tremula, J. Martinez (Instituto de Geología, UNAM, Hermosillo, Mexico) for help collecting P. mexicana, and J. Phillips (Phytozome) for help in identifying orthologs between P. trichocarpa and S. purpurea. Funding for this project was provided by the U.S. National Science Foundation (IOS‐1542509 and IOS‐1542599), Genome Canada (168BIO), and the National Natural Science Foundation of China (31561123001). The genetic material used to generate the genomic resources from the Agriculture Canada Balsam Poplar (AgCanBaP) collection belong to Her Majesty the Queen in Right of Canada as represented by the Minister of Agriculture and Agri‐Food. Agriculture and Agri‐Food Canada retains complete ownership of the resources presented here.

AUTHOR CONTRIBUTIONS

S.P.D. and M.S.O. conceived the study. S.P.D., Q.C.B.C., T.M., and M.S.O. secured funding to support the project. B.J.S. and S.P.D. designed the sequence capture array. Q.C.B.C. and T.M. provided whole‐genome sequence data. B.J.S and M.S.O. prepared and sequenced the DNA samples, analyzed the data, interpreted the results, and wrote the manuscript. All authors edited drafts of the manuscript and approved the final version.

Data Availability

Accession numbers for all sequence data used to design the sequence capture array are presented in Appendix S1. The raw reads of targeted sequence capture data from the six species of Populus and Salix are available on the National Center for Biotechnology Information Sequence Read Archive (NCBI SRA) under the BioProject accession number PRJNA627181. The raw reads of the Idesia polycarpa whole‐genome sequence data are available in the Genome Warehouse of the Beijing Institute of Genomics (BIG), under the accession number PRJCA002959. The sequences of the probes that were designed, all of the custom Python scripts that were used for this study, and the full details of analyses summarized in notebooks are available at https://github.com/BrianSanderson/phylo‐seq‐cap (Sanderson, 2020).

Word count: 4671

Show less

© 2020. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Premise

The family Salicaceae has proved taxonomically challenging, especially in the genus Salix, which is speciose and features frequent hybridization and polyploidy. Past efforts to reconstruct the phylogeny with molecular barcodes have failed to resolve the species relationships of many sections of the genus.

Methods

We used the wealth of sequence data in the family to design sequence capture probes to target regions of 300–1200 bp of exonic regions of 972 genes.

Results

We recovered sequence data for nearly all of the targeted genes in three species of Populus and three species of Salix. We present a species tree, discuss concordance among gene trees, and present population genomic summary statistics for these loci.

Conclusions

Our sequence capture array has extremely high capture efficiency within the genera Populus and Salix, resulting in abundant phylogenetic information. Additionally, these loci show promise for population genomic studies.

Details

Title

A targeted sequence capture array for phylogenetics and population genomics in the Salicaceae

Author

Sanderson, Brian J¹

; DiFazio, Stephen P²

; Cronk, Quentin C B³

; Ma, Tao⁴

; Olson, Matthew S⁵

¹ Department of Biological Sciences, Texas Tech University, Lubbock, Texas, USA; Department of Biology, West Virginia University, Morgantown, West Virginia, USA
² Department of Biology, West Virginia University, Morgantown, West Virginia, USA
³ Department of Botany, University of British Columbia, Vancouver, British Columbia, Canada
⁴ Key Laboratory of Bio‐Resource and Eco‐Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, People’s Republic of China
⁵ Department of Biological Sciences, Texas Tech University, Lubbock, Texas, USA

Section

Genomic Resources Article

Publication year

2020

Publication date

Oct 2020

Publisher

John Wiley & Sons, Inc.

e-ISSN

21680450

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1002/aps3.11394

ProQuest document ID

2455943957

A targeted sequence capture array for phylogenetics and population genomics in the Salicaceae

Jump to:

Full text

Abstract

Details

Suggested sources