| Sample | Collection date | Days | Seq. length | Seq. GC (%) | Seq. gaps | Seq. ambiguities | Mapped reads | Covered bases (%) | Mean depth | Mean baseQ | Mean mapQ | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | MIP00022 | 1995-11-01 | 0 | 4411532 | 62.2 | 21505 | 203430 | 5866459 | 99.8 | 96.8 | 36.5 | 58.7 |
| 2 | MIP00023 | 1996-07-01 | 243 | 4411532 | 62.2 | 22777 | 203429 | 5599178 | 99.7 | 89.9 | 36.5 | 58.8 |
| 3 | MIP00024 | 1996-11-01 | 366 | 4411532 | 62.1 | 28904 | 203481 | 5618419 | 99.9 | 86.7 | 36.6 | 58.8 |
| 4 | MIP00025 | 1997-01-01 | 427 | 4411532 | 62.2 | 22593 | 203411 | 5868087 | 99.7 | 95.6 | 36.5 | 58.8 |
| 5 | MIP00026 | 1998-03-01 | 851 | 4411532 | 62.2 | 22872 | 203390 | 5848734 | 99.7 | 97.3 | 36.5 | 58.8 |
| 6 | MIP00027 | 1998-06-01 | 943 | 4411532 | 62.2 | 23082 | 203420 | 5988312 | 99.7 | 96.1 | 36.5 | 58.8 |
| 7 | MIP00028 | 1998-09-01 | 1035 | 4411532 | 62.2 | 22834 | 203423 | 5814763 | 99.6 | 94.9 | 36.5 | 58.8 |
| 8 | MIP00029 | 1999-02-01 | 1188 | 4411532 | 62.2 | 20731 | 203442 | 6068765 | 99.7 | 101.4 | 36.6 | 58.7 |
| 9 | MIP00030 | 1999-05-01 | 1277 | 4411532 | 62.2 | 22256 | 203432 | 5833507 | 99.7 | 97.1 | 36.5 | 58.7 |
Summary
Study characteristics
- Study name: HST00003
- Organism: M. tuberculosis
- Mapping reference: MTB_anc
- Variant calling: freebayes
Sample metadata is summarised in Table 1. Figure 1 shows the structure of the dataset.
List of nucleotide variants
To distinguish pre-existing variation from changes arising during the study period, the sequence of the most recent common ancestor (MRCA) of the target samples was built using ancestral sequence reconstruction (ASR) and compared to the mapping reference (Table 2). These variants represent fixed differences already present before the processes under study began. Nucleotide variants identified in each sample relative to the reconstructed ancestral sequence are listed in Table 3. These variants reflect changes that arose across the sampling period.
Genome alignments
Genomes have been aligned and compiled into a multiple sequence alignment against MTB_anc using Nextclade. The fraction of gaps, sequence ambiguities (non-ACTG or gap) and mismatches against the reference have been calculated across the whole alignment (after masking) using sliding 10000-bp windows (Figure 2).
Variants overview
A total of 78 different nucleotide variants were identified, of which 12 were ‘Frameshift’, 3 were ‘In-frame indel’, 19 were ‘Intergenic’, 34 were ‘Non-synonymous’, 1 were ‘Other’, and 9 were ‘Synonymous’ (Figure 3).
Allele frequency trajectories
The correlation of the allele frequency of each nucleotide variant with time since initial sampling has been calculated using Spearman’s correlation method (Figure 4). The graph shows which variants pass the correlation threshold (0.7), the p-value threshold (0.05), or both.
Selected correlated nucleotide variants are described in detail in Figure 5 and Figure 6. The visualizations are separated to differentiate variants with larger allele frequency amplitude (differences between their maximum and minimum alternate frequencies). In both cases, the plots show the alternative allele frequency across sampling days. Plot colors correspond to Figure 4, indicating whether each variant passes the correlation threshold, the p-value threshold, or both.
Temporal signal
To estimate the evolutionary rate based on all alleles (Figure 9), including those at low frequencies, a neighbor-joining (NJ) tree was constructed from pairwise allele frequency-weighted distances between study samples using afwdist (Figure 7). To estimate the evolutionary rate from majority alleles (Figure 10), a maximum likelihood (ML) tree was inferred under a GTR+F+I+G4 substitution model using IQ-TREE (Figure 8). Results are summarized in Table 6.
| Method | Alleles | Rate | Rate per site | R² | P |
|---|---|---|---|---|---|
| NJ | All (weighted) | 4.24 [3.53, 4.95] | 1.03e-06 [8.59e-07, 1.2e-06] | 0.966 | < 0.001 |
| ML | Majority | 2.15 [-1.57, 5.87] | 5.23e-07 [-3.81e-07, 1.43e-06] | 0.211 | 0.213 |
Evolutionary metrics
To track signatures of selection, the allele frequency-weighted rate of change per synonymous site (dS analog) and per non-synonymous site (dN analog) have been calculated for each sample relative to the reconstructed case ancestor (MRCA), together with their ratio ω (Figure 11). These metrics quantify frequency-weighted divergence from the ancestral sequence across all alleles and help identify selective pressures. Frequency-weighted πN and πS of biallelic variants are reported as a metric of standing diversity (Figure 12). Rate denominators are calculated using the Nei–Gojobori (1986) method. Frequency-weighted synonymous and non-synonymous counts are reported as raw values, relative to the first sample, and as a ratio (Figure 13).
The site frequency spectrum (SFS) has been calculated to help interpret the estimates above. Because alleles may occur at varying frequencies across samples, the SFS has been calculated using allele frequency bins, stratified by synonymous single-nucleotide variants (SNV), non-synonymous SNV, and all variants combined. To account for varying allele frequency trajectories across the dataset, the SFS is reported using two approaches: “exclusive”, which captures the overall distribution of within-host allele frequencies across the dataset (Figure 14), and “inclusive”, which identifies variants with globally stable frequency behavior (Figure 15).
Variants connectivity
Hierarchical clustering of pairwise AF correlations
To detect possible interactions between mutations, pairwise Spearman’s correlation between allele frequencies have been calculated (Figure 16). The heatmap is interactive and allows zooming into specific regions. Only the top 100 alleles with the highest overall median coefficient are represented.
Network communities of pairwise AF correlations
Studying how variants are connected allows us to identify behavioral patterns and relate annotated variants to those lacking annotation. In this analysis, all variants are connected to one another (Figure 17), rather than being examined in pairs as in the previous approach (Figure 16).
The network representation (Figure 17) of pairwise correlations between variant allele frequencies as edge weights. Graph communities have been identified using the Louvain algorithm. Each color represents a community with its corresponding number.
Custom annotation of variants
An overview of the presence of variants with all available DRUG annotation (colored by GRADING annotation) is shown in Figure 18. The variation in frequency of these variants across different samples in time is shown in Figure 19.