MIPERA report

Workflow version: v0.8.0 (v0.8.0)

Published

2026-06-17

Summary

Study characteristics

  • Study name: HST00003
  • Organism: M. tuberculosis
  • Mapping reference: MTB_anc
  • Variant calling: freebayes

Sample metadata is summarised in Table 1. Figure 1 shows the structure of the dataset.

Table 1: Target dataset summary.
  Sample Collection date Days Seq. length Seq. GC (%) Seq. gaps Seq. ambiguities Mapped reads Covered bases (%) Mean depth Mean baseQ Mean mapQ
1 MIP00022 1995-11-01 0 4411532 62.2 21505 203430 5866459 99.8 96.8 36.5 58.7
2 MIP00023 1996-07-01 243 4411532 62.2 22777 203429 5599178 99.7 89.9 36.5 58.8
3 MIP00024 1996-11-01 366 4411532 62.1 28904 203481 5618419 99.9 86.7 36.6 58.8
4 MIP00025 1997-01-01 427 4411532 62.2 22593 203411 5868087 99.7 95.6 36.5 58.8
5 MIP00026 1998-03-01 851 4411532 62.2 22872 203390 5848734 99.7 97.3 36.5 58.8
6 MIP00027 1998-06-01 943 4411532 62.2 23082 203420 5988312 99.7 96.1 36.5 58.8
7 MIP00028 1998-09-01 1035 4411532 62.2 22834 203423 5814763 99.6 94.9 36.5 58.8
8 MIP00029 1999-02-01 1188 4411532 62.2 20731 203442 6068765 99.7 101.4 36.6 58.7
9 MIP00030 1999-05-01 1277 4411532 62.2 22256 203432 5833507 99.7 97.1 36.5 58.7
Figure 1: MDS (PCoA) and MST visualization of pairwise dissimilarities between samples. Arrows indicate chronological sampling order. Percentages indicate the proportion of variation explained. Distances between points are approximately proportional to allele frequency-weighted dissimilarities between each pair of samples (see afwdist).

List of nucleotide variants

To distinguish pre-existing variation from changes arising during the study period, the sequence of the most recent common ancestor (MRCA) of the target samples was built using ancestral sequence reconstruction (ASR) and compared to the mapping reference (Table 2). These variants represent fixed differences already present before the processes under study began. Nucleotide variants identified in each sample relative to the reconstructed ancestral sequence are listed in Table 3. These variants reflect changes that arose across the sampling period.

Table 2: Variants in the reconstructed ancestral sequence relative to the mapping reference.
Loading ITables v2.7.0 from the init_notebook_mode cell... (need help?)
Table 3: Variants in each sample relative to the reconstructed ancestral sequence.
Loading ITables v2.7.0 from the init_notebook_mode cell... (need help?)

Genome alignments

Genomes have been aligned and compiled into a multiple sequence alignment against MTB_anc using Nextclade. The fraction of gaps, sequence ambiguities (non-ACTG or gap) and mismatches against the reference have been calculated across the whole alignment (after masking) using sliding 10000-bp windows (Figure 2).

Figure 2: Overview of the masked alignment of the study samples against the reference. The fraction of gaps, sequence ambiguities (non-ACTG or gap) and reference mismatches have been calculated across the alignment in 10000-bp windows.

Variants overview

A total of 78 different nucleotide variants were identified, of which 12 were ‘Frameshift’, 3 were ‘In-frame indel’, 19 were ‘Intergenic’, 34 were ‘Non-synonymous’, 1 were ‘Other’, and 9 were ‘Synonymous’ (Figure 3).

Figure 3: Genome-wide nucleotide variant landscape and regional polymorphism enrichment. The top panel represents nucleotide variants detected across samples (ordered chronologically, earliest bottom to latest top), called against the reconstructed ancestral sequence, plotted against genome position. Marker shape encodes predicted effect group; fill opacity scales with allele frequency. Subsequent panels represent posterior probabilities \(q_i = P(p_i > \mu \mid \text{data})\) of exceeding the dataset-wide polymorphism rate \(\mu\), estimated per gene (middle panel; Table 4) and per 1000-bp sliding window (bottom panel; Table 5), under a Bayesian hierarchical beta-binomial model. For each region \(i\), \(k_i\) polymorphic sites (union across timepoints) out of \(n_i\) callable sites are modelled as \(k_i \sim \mathrm{BetaBinomial}(\alpha, \beta, n_i)\), with priors centred on the empirical rate \(p_0\). Inference was performed using PyMC and nutpie using a marginalized parametrization over hyperparameters \(\mu\) and \(\phi\), with per-region posteriors recovered analytically via the conjugate beta update. Per-gene stats: \(p_0\) = 1.54e-05, \(\bar\mu\) = 0.000264 SCI 95.0 % [0.000134, 0.000409] & MCI 95.0 % [0.000152, 0.000444]. Per-window stats: \(p_0\) = 1.9e-05, \(\bar\mu\) = 0.000256 SCI 95.0 % [0.000144, 0.000384] & MCI 95.0 % [0.000157, 0.000406]. Only regions with a false discovery rate (FDR) \(\leq 0.05\) are displayed. Grey shading indicates masked regions excluded from the reconstructed ancestral sequence.
Table 4: Per-feature counts and posterior enrichment statistics from the Bayesian hierarchical beta-binomial model (FDR ≤ 0.05).
Loading ITables v2.7.0 from the init_notebook_mode cell... (need help?)
Table 5: Sliding-window counts and posterior enrichment statistics from the Bayesian hierarchical beta-binomial model (FDR ≤ 0.05).
Loading ITables v2.7.0 from the init_notebook_mode cell... (need help?)

Allele frequency trajectories

The correlation of the allele frequency of each nucleotide variant with time since initial sampling has been calculated using Spearman’s correlation method (Figure 4). The graph shows which variants pass the correlation threshold (0.7), the p-value threshold (0.05), or both.

Figure 4: Correlation coefficients and adjusted p-values of allele frequencies with time. The correction method applied is Benjamini-Hochberg. Lines indicate adjusted p-value threshold and correlation thresholds. Select points to view variant information.

Selected correlated nucleotide variants are described in detail in Figure 5 and Figure 6. The visualizations are separated to differentiate variants with larger allele frequency amplitude (differences between their maximum and minimum alternate frequencies). In both cases, the plots show the alternative allele frequency across sampling days. Plot colors correspond to Figure 4, indicating whether each variant passes the correlation threshold, the p-value threshold, or both.

Figure 5: Time series of relative allele frequencies for variants above the allele frequency amplitude (Δfreq) threshold. Each subplot shows the progression of allele frequencies over time for a specific genomic position with significant correlation with time. Gaps reflect time points where sequencing was unreliable. A frequency of zero indicates confirmed absence above quality/depth filters.
Figure 6: Time series of relative allele frequencies for variants under the allele frequency amplitude (Δfreq) threshold. Each subplot shows the progression of allele frequencies over time for a specific genomic position with significant correlation with time. Gaps reflect time points where sequencing was unreliable. A frequency of zero indicates confirmed absence above quality/depth filters.

Temporal signal

To estimate the evolutionary rate based on all alleles (Figure 9), including those at low frequencies, a neighbor-joining (NJ) tree was constructed from pairwise allele frequency-weighted distances between study samples using afwdist (Figure 7). To estimate the evolutionary rate from majority alleles (Figure 10), a maximum likelihood (ML) tree was inferred under a GTR+F+I+G4 substitution model using IQ-TREE (Figure 8). Results are summarized in Table 6.

Table 6: Evolutionary rate estimates and temporal signal statistics. Rates are expressed in number of changes per year; rates per site are expressed in number of changes per site and year. Both are accompanied by their 95 % confidence interval (CI).
Method Alleles Rate Rate per site P
NJ All (weighted) 4.24 [3.53, 4.95] 1.03e-06 [8.59e-07, 1.2e-06] 0.966 < 0.001
ML Majority 2.15 [-1.57, 5.87] 5.23e-07 [-3.81e-07, 1.43e-06] 0.211 0.213
Figure 7: NJ tree constructed from a pairwise allele frequency-weighted metric. Branch lengths are expressed in allele frequency-weighted substitutions. The reference MTB_anc was used to root the tree (omitted from the visualization for clarity).
Figure 8: Maximum Likelihood tree constructed with IQ-TREE (GTR+F+I+G4). Branch lengths are expressed in substitutions per site. The reference MTB_anc was used as ougroup and to root the tree (omitted from the visualization for clarity).
Figure 9: Root-to-tip distances versus days since the first sample. The line shows the linear model fit with a 95 % CI ribbon.
Figure 10: Root-to-tip distances versus days since the first sample for Maximum Likelihood tree. The line shows the linear model fit with a 95 % CI ribbon.

Evolutionary metrics

To track signatures of selection, the allele frequency-weighted rate of change per synonymous site (dS analog) and per non-synonymous site (dN analog) have been calculated for each sample relative to the reconstructed case ancestor (MRCA), together with their ratio ω (Figure 11). These metrics quantify frequency-weighted divergence from the ancestral sequence across all alleles and help identify selective pressures. Frequency-weighted πN and πS of biallelic variants are reported as a metric of standing diversity (Figure 12). Rate denominators are calculated using the Nei–Gojobori (1986) method. Frequency-weighted synonymous and non-synonymous counts are reported as raw values, relative to the first sample, and as a ratio (Figure 13).

Figure 11: Time series of frequency-weighted dN and dS analogs, and their ratio (ω). Each point corresponds to a different sample, sorted in chronological order.
Figure 12: Time series of frequency-weighted πN and πS analogs, and their ratio. Only biallelic variants are considered. Each point corresponds to a different sample, sorted in chronological order.
Figure 13: Time series of non-synonymous (N) and synonymous (S) weighted counts (sum of frequencies) shown as raw values, relative to the first sample, and as a ratio. Each point corresponds to a different sample, sorted in chronological order.

The site frequency spectrum (SFS) has been calculated to help interpret the estimates above. Because alleles may occur at varying frequencies across samples, the SFS has been calculated using allele frequency bins, stratified by synonymous single-nucleotide variants (SNV), non-synonymous SNV, and all variants combined. To account for varying allele frequency trajectories across the dataset, the SFS is reported using two approaches: “exclusive”, which captures the overall distribution of within-host allele frequencies across the dataset (Figure 14), and “inclusive”, which identifies variants with globally stable frequency behavior (Figure 15).

Figure 14: Exclusive binned site frequency spectrum. Each variant observation has been assigned independently to an allele frequency bin based on its frequency in that sample. Each panel shows the number of alleles observed across a given number of samples. Variants are binned per-observation rather than globally (a variant that segregates at different frequencies across samples will contribute to multiple bins simultaneously).
Figure 15: Inclusive binned site frequency spectrum. Each variant has been assigned a single global frequency class based on the consistency of its allele frequency across all samples in which it has been detected. Variants that fall within the same bin in every sample are assigned to that bin; variants whose frequency spans multiple bins across samples are classified as “Mixed”. For each class, the plot shows the number of alleles observed across a given number of samples

Variants connectivity

Hierarchical clustering of pairwise AF correlations

To detect possible interactions between mutations, pairwise Spearman’s correlation between allele frequencies have been calculated (Figure 16). The heatmap is interactive and allows zooming into specific regions. Only the top 100 alleles with the highest overall median coefficient are represented.

Figure 16: Interactive hierarchically clustered heatmap of pairwise correlation coefficients between time series of allele frequencies. Select heatmap squares to view information about related variants.

Network communities of pairwise AF correlations

Studying how variants are connected allows us to identify behavioral patterns and relate annotated variants to those lacking annotation. In this analysis, all variants are connected to one another (Figure 17), rather than being examined in pairs as in the previous approach (Figure 16).

The network representation (Figure 17) of pairwise correlations between variant allele frequencies as edge weights. Graph communities have been identified using the Louvain algorithm. Each color represents a community with its corresponding number.

Figure 17: Network visualization showing variant communities. Node fill colors represent community assignment (0-indexed), while border colors indicate annotation status. Edge opacity and color intensity reflect correlation coefficients. Hover over nodes for variant details and edges for exact weight values. Network indicators have been calculated: degree (number of nodes to which each node is connected), degree centrality (degree centrality indicating the proportion of nodes connected relative to the total network size), eigenvector centrality (relates a node’s importance considering the importance of neighboring nodes), betweenness centrality (number of times a node lies on the shortest paths between other nodes) and closeness centrality (how close a node is to all others, based on shortest path lengths).

Custom annotation of variants

An overview of the presence of variants with all available DRUG annotation (colored by GRADING annotation) is shown in Figure 18. The variation in frequency of these variants across different samples in time is shown in Figure 19.

Figure 18: Overview of the detected variants with custom annotation on the reconstructed ancestral sequence and each sample in the dataset in chronological order.
Figure 19: Allele frequency trajectories for annotation-associated variants over time. Each connected group of points represents a distinct variant; gaps reflect time points where sequencing was unreliable. A frequency of zero indicates confirmed absence given the applied variant calling filters.