TY - JOUR
T1 - High-quality genome assemblies for nine non-model North American insect species representing six orders (Insecta: Coleoptera, Diptera, Hemiptera, Hymenoptera, Lepidoptera, Neuroptera)
AU - Walden, Kimberly K.O.
AU - Cao, Yanghui
AU - Fields, Christopher J.
AU - Hernandez, Alvaro G.
AU - Rendon, Gloria A.
AU - Robinson, Gene E.
AU - Skinner, Rachel K.
AU - Stein, Jeffrey A.
AU - Dietrich, Christopher H.
N1 - Publisher Copyright:
© 2024 The Author(s). Molecular Ecology Resources published by John Wiley & Sons Ltd.
PY - 2024/11
Y1 - 2024/11
N2 - Field-collected specimens were used to obtain nine high-quality genome assemblies from a total of 10 insect species native to prairies and savannas of central Illinois (USA): Mellilla xanthometata (Lepidoptera: Geometridae), Stenolophus ochropezus (Coleoptera: Carabidae), Forcipata loca (Hemiptera: Cicadellidae), Coelinius sp. (Hymenoptera: Braconidae), Thaumatomyia glabra (Diptera: Chloropidae), Brachynemurus abdominalus (Neuroptera: Myrmeleontidae), Catonia carolina (Hemiptera: Achilidae), Oncometopia orbona (Hemiptera: Cicadellidae), Flexamia atlantica (Hemiptera: Cicadellidae) and Stictocephala bisonia (Hemiptera: Membracidae). Sequencing library preparation from single specimens was successful despite extremely small DNA yields (<0.1 μg) for some samples. Additional sequencing and assembly workflows were adapted to each sample depending on the initial DNA yield. PacBio circular consensus (CCS/HiFi) or continuous long reads (CLR) libraries were used to sequence DNA fragments up to 50 kb in length, with Illumina sequenced linked-reads (TellSeq libraries) and Omni-C libraries used for scaffolding and gap-filling. Assembled genome sizes ranged from 135 MB to 3.2 GB. The number of assembled scaffolds ranged from 47 to >13,000, with the longest scaffold per assembly ranging from ~23 to 439 Mb. Genome completeness was high, with BUSCO scores ranging from 85.5% completeness for the largest genome (Stictocephala bisonia) to 98.8% completeness for the smallest genome (Coelinius sp.). The unique content was estimated using RepeatMasker and GenomeScope2, which ranged from 50.7% to 75.8% and roughly decreased with increasing genome size. Structural annotation predicted a range of 19,281–72,469 protein models for sequenced species. Sequencing costs per genome at the time ranged from US$3–5k, averaged ~1600 CPU-hours on a high-performance cluster and required approximately 14 h of bioinformatics analyses with samples using PacBio HiFi data. Most assemblies would benefit from further manual curation to correct possible scaffold misjoins and translocations suggested by off-diagonal or depleted signals in Omni-C contact maps.
AB - Field-collected specimens were used to obtain nine high-quality genome assemblies from a total of 10 insect species native to prairies and savannas of central Illinois (USA): Mellilla xanthometata (Lepidoptera: Geometridae), Stenolophus ochropezus (Coleoptera: Carabidae), Forcipata loca (Hemiptera: Cicadellidae), Coelinius sp. (Hymenoptera: Braconidae), Thaumatomyia glabra (Diptera: Chloropidae), Brachynemurus abdominalus (Neuroptera: Myrmeleontidae), Catonia carolina (Hemiptera: Achilidae), Oncometopia orbona (Hemiptera: Cicadellidae), Flexamia atlantica (Hemiptera: Cicadellidae) and Stictocephala bisonia (Hemiptera: Membracidae). Sequencing library preparation from single specimens was successful despite extremely small DNA yields (<0.1 μg) for some samples. Additional sequencing and assembly workflows were adapted to each sample depending on the initial DNA yield. PacBio circular consensus (CCS/HiFi) or continuous long reads (CLR) libraries were used to sequence DNA fragments up to 50 kb in length, with Illumina sequenced linked-reads (TellSeq libraries) and Omni-C libraries used for scaffolding and gap-filling. Assembled genome sizes ranged from 135 MB to 3.2 GB. The number of assembled scaffolds ranged from 47 to >13,000, with the longest scaffold per assembly ranging from ~23 to 439 Mb. Genome completeness was high, with BUSCO scores ranging from 85.5% completeness for the largest genome (Stictocephala bisonia) to 98.8% completeness for the smallest genome (Coelinius sp.). The unique content was estimated using RepeatMasker and GenomeScope2, which ranged from 50.7% to 75.8% and roughly decreased with increasing genome size. Structural annotation predicted a range of 19,281–72,469 protein models for sequenced species. Sequencing costs per genome at the time ranged from US$3–5k, averaged ~1600 CPU-hours on a high-performance cluster and required approximately 14 h of bioinformatics analyses with samples using PacBio HiFi data. Most assemblies would benefit from further manual curation to correct possible scaffold misjoins and translocations suggested by off-diagonal or depleted signals in Omni-C contact maps.
KW - annotation
KW - Earth BioGenome project
KW - genome assembly
KW - Insecta
KW - non-model species
KW - sequencing
UR - http://www.scopus.com/inward/record.url?scp=85201366721&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85201366721&partnerID=8YFLogxK
U2 - 10.1111/1755-0998.14010
DO - 10.1111/1755-0998.14010
M3 - Article
C2 - 39155537
AN - SCOPUS:85201366721
SN - 1755-098X
VL - 24
JO - Molecular ecology resources
JF - Molecular ecology resources
IS - 8
M1 - e14010
ER -