The structure, function, and evolution of a complete human chromosome 8
The complete assembly of each human chromosome is essential for understanding human biology and evolution. Using complementary long-read sequencing technologies, we complete the first linear assembly of a human autosome, chromosome 8. Our assembly resolves the sequence of five previously long-standing gaps, including a 2.08 Mbp centromeric α-satellite array, a 644 kbp defensin copy number polymorphism important for disease risk, and an 863 kbpvariable number tandem repeat at chromosome 8q21.2 that can function as a neocentromere. We show that the centromeric α-satellite array is generally methylated except for a 73 kbp hypomethylated region of diverse higher-order α-satellite enriched with CENP-A nucleosomes, consistent with the location of the kinetochore. Using a dual long-read sequencing approach, we complete the assembly of the orthologous chromosome 8 centromeric regions in chimpanzee, orangutan, and macaque for the first time to reconstruct its evolutionary history. Comparative and phylogenetic analyses show that the higher-order α-satellite structure evolved specifically in the great ape ancestor, and the centromeric region evolved with a layered symmetry, with more ancient higher-order repeats located at the periphery adjacent to monomeric α-satellites. We estimate that the mutation rate of centromeric satellite DNA is accelerated at least 2.2-fold, and this acceleration extends beyond the higher-order α-satellite into the flanking sequence.
A Tree of Human Gut Bacterial Species and its Applications to Metagenomics and Metaproteomics Data Analysis
Here we examined the impact of incompleteness of the genomes on the tree reconstruction, and we showed that phylogeny approaches including RAxML (which handles missing data explicitly) and FastTree generally performed well on simulated collection of 400 genomes with missing information. As RAxML is computationally prohibitive for the much larger collections of gut genomes, we chose FastTree to build a unified tree of human-gut associated bacterial species (referred to as gut tree), including more than 3000 genomes, most of which are incomplete. We developed two downstream applications of the gut tree: peptide-centric analysis of metaproteomics datasets; and taxonomic characterization of metagenomic sequences. In both applications, the gut tree provided the basis for quantification of species composition at various taxonomic resolutions. Conclusions The gut tree presented in this study provides a useful framework for taxonomic profiling of human gut microbiome. Including MAGs in the tree provides more comprehensive representation of microbial species diversity associated with human gut, important for studying the taxonomic composition of gut microbiome.
Samplot: A Platform for Structural Variant Visual Validation and Automated Filtering
Visual validation is an essential step in structural variant (SV) detection to eliminate false positives. We present Samplot, a tool for quickly creating images that display the read depth and sequence alignments necessary to adjudicate purported SVs across multiple samples and sequencing technologies, including short, long, and phased reads. These simple images can be rapidly reviewed to curate large SV call sets. Samplot is easily applicable to many biological problems such as prioritization of potentially causal variants in disease studies, family-based analysis of inherited variation, or de novo SV review. Samplot also includes a trained machine learning package that dramatically decreases the number of false positives without human review. Samplot is available via the conda package manager or at https://github.com/ryanlayer/samplot.
nanoDoc: RNA modification detection using Nanopore raw reads with Deep One-Class Classification
Advances in Nanopore single-molecule direct RNA sequencing (DRS) have presented the possibility of detecting comprehensive post-transcriptional modifications (PTMs) as an alternative to experimental approaches combined with high-throughput sequencing. It has been shown that the DRS method can detect the change in the raw electric current signal of a PTM; however, the accuracy and reliability still require improvement.Here, we presented a new software, called nanoDoc, for detecting PTMs from DRS data using a deep neural network. Current signal deviations caused by PTMs are analyzed via Deep One-Class Classification with a convolutional neural network. Using a ribosomal RNA dataset, the software archive displayed an area under the curve (AUC) accuracy of 0.96 for the detection of 23 different kinds of modifications in Escherichia coli and Saccharomyces cerevisiae. We also demonstrated a tentative classification of PTMs using unsupervised clustering. Finally, we applied this software to severe acute respiratory syndrome coronavirus 2 data and identified commonly modified sites among three groups. nanoDoc is open source (GPLv3) and available at https://github.com/uedaLabR/nanoDoc
PanACoTA: A modular tool for massive microbial comparative genomics
The gene repertoires of microbial species, their pangenomes, evolve very fast. Their study facilitates the discrimination between lineages and reveals which genes drive their recent adaptation. It has therefore become a key topic of study in microbial evolution and genomics. Yet, the increase in the number of genomes available to certain species, now reaching many thousands, complicates the establishment of the basic building blocks of comparative genomics. Here, we present PanACoTA, a tool that allows to download all genomes of a species, build a database with those passing quality and redundancy controls, define uniform annotation, and use them to build a pangenome, several variants of core or persistent genomes, their alignments, and a rapid but accurate phylogenetic tree. While many programs have become available in the last few years to build pangenomes, we have focused on a method that tackles all the key steps of the process, from download to phylogenetic inference. This was conceived in a modular way, i.e. while all steps are integrated, they can also be run separately and multiple times to allow rapid and extensive exploration of the space of parameters of interest. The software is built in Python 3 and includes features to facilitate its installation and its future development. We believe PanACoTa is an interesting addition to the current set of bioinformatics software for comparative genomics, since it will accelerate and standardize the more routine parts of the work, allowing microbial genomicists to more quickly tackle their specific questions.
6.【路杀】法国蒙彼利埃大学（Université de Montpellier）学者通过路边尸体采样完成两种肉食性哺乳动物的基因组测序
High-quality carnivore genomes from roadkill samples enable species delimitation in aardwolf and bat-eared fox
In a context of ongoing biodiversity erosion, obtaining genomic resources from wildlife is becoming essential for conservation. The thousands of yearly mammalian roadkill could potentially provide a useful source material for genomic surveys. To illustrate the potential of this underexploited resource, we used roadkill samples to sequence reference genomes and study the genomic diversity of the bat-eared fox (Otocyon megalotis) and the aardwolf (Proteles cristata) for which subspecies have been defined based on similar disjunct distributions in Eastern and Southern Africa. By developing an optimized DNA extraction protocol, we successfully obtained long reads using the Oxford Nanopore Technologies (ONT) MinION device. For the first time in mammals, we obtained two reference genomes with high contiguity and gene completeness by combining ONT long reads with Illumina short reads using hybrid assembly. Based on re-sequencing data from few other roakill samples, the comparison of the genetic differentiation between our two pairs of subspecies to that of pairs of well-defined species across Carnivora showed that the two subspecies of aardwolf might warrant species status (P. cristata and P. septentrionalis), whereas the two subspecies of bat-eared fox might not. Moreover, using these data, we conducted demographic analyses that revealed similar trajectories between Eastern and Southern populations of both species, suggesting that their population sizes have been shaped by similar environmental fluctuations. Finally, we obtained a well resolved genome-scale phylogeny for Carnivora with evidence for incomplete lineage sorting among the three main arctoid lineages. Overall, our cost-effective strategy opens the way for large-scale population genomic studies and phylogenomics of mammalian wildlife using roadkill.
Single-cell mapper (scMappR): using scRNA-seq to infer cell-type specificities of differentially expressed genes
RNA sequencing (RNA-seq) is widely used to identify differentially expressed genes (DEGs) and reveal biological mechanisms underlying complex biological processes. RNA-seq is often performed on heterogeneous samples and the resulting DEGs do not necessarily indicate the cell types where the differential expression occurred. While single-cell RNA-seq (scRNA-seq) methods solve this problem, technical and cost constraints currently limit its widespread use. Here we present single cell Mapper (scMappR), a method that assigns cell-type specificity scores to DEGs obtained from bulk RNA-seq by integrating cell-type expression data generated by scRNA-seq and existing deconvolution methods. After benchmarking scMappR using RNA-seq data obtained from sorted blood cells, we asked if scMappR could reveal known cell-type specific changes that occur during kidney regeneration. We found that scMappR appropriately assigned DEGs to cell-types involved in kidney regeneration, including a relatively small proportion of immune cells. While scMappR can work with any user supplied scRNA-seq data, we curated scRNA-seq expression matrices for ∼100 human and mouse tissues to facilitate its use with bulk RNA-seq data alone. Overall, scMappR is a user-friendly R package that complements traditional differential expression analysis available at CRAN.
Genome Methylation Predicts Age and Longevity of Bats
Bats hold considerable potential for understanding exceptional longevity because some species can live eight times longer than other mammals of similar size . Estimating their age or longevity is difficult because they show few signs of aging. DNA methylation (DNAm) provides a potential solution given its utility for estimating age [2-4] and lifespan [5-7] in humans. Here, we profile DNAm from wing biopsies of nearly 700 individuals representing 26 bat species and demonstrate that DNAm can predict chronological age accurately. Furthermore, the rate DNAm changes at age-informative sites is negatively related to longevity. To identify longevity-informative sites, we compared DNAm rates between three long-lived and two short-lived species. Hypermethylated age and longevity sites are enriched for histone and chromatin features associated with transcriptional regulation and preferentially located in the promoter regions of helix-turn-helix transcription factors (TFs). Predicted TF binding site motifs and enrichment analyses indicate that age-related methylation change is influenced by developmental processes, while longevity-related DNAm change is associated with innate immunity or tumorigenesis genes, suggesting that bat longevity results, in part, from augmented immune response and cancer suppression.
Clinical Course And Risk Factors For In-hospital Death In Critical COVID-19 In Wuhan, China
As the SARS-CoV-2 virus spreads through human populations, the unprecedented accumulation of viral genome sequences is ushering a new era of "genomic contact tracing" – that is, using viral genome sequences to trace local transmission dynamics. However, because the viral phylogeny is already so large – and will undoubtedly grow many fold – placing new sequences onto the tree has emerged as a barrier to real-time genomic contact tracing. Here, we resolve this challenge by building an efficient, tree-based data structure encoding the inferred evolutionary history of the virus. We demonstrate that our approach improves the speed of phylogenetic placement of new samples and data visualization by orders of magnitude, making it possible to complete the placements under real-time constraints. Our method also provides the key ingredient for maintaining a fully-updated reference phylogeny. We make these tools available to the research community through the UCSC SARS-CoV-2 Genome Browser to enable rapid cross-referencing of information in new virus sequences with an ever-expanding array of molecular and structural biology data. The methods described here will empower research and genomic contact tracing for laboratories worldwide.
The Ecological Impact of High-performance Computing in Astrophysics
The importance of computing in astronomy continues to increase, and so is its impact on theenvironment. When analyzing data or performing simulations, most researchers raise con-cerns about the time to reach a solution rather than its impact on the environment. Luckily,a reduced time-to-solution due to faster hardware or optimizations in the software generallyalso leads to a smaller carbon footprint. This is not the case when the reduced wall-clocktime is achieved by overclocking the processor, or when using supercomputers.The increase in the popularity of interpreted scripting languages, and the general availabilityof high-performance workstations form a considerable threat to the environment. A similarconcern can be raised about the trend of running single-core instead of adopting efficientmany-core programming paradigms.In astronomy, computing is among the top producers of green-house gasses, surpassing tele-scope operations. Here I hope to raise the awareness of the environmental impact of runningnon-optimized code on overpowered computer hardware.