MIPERA report

Summary

Study characteristics

Study name: HST00003
Organism: M. tuberculosis
Mapping reference: MTB_anc
Variant calling: freebayes

Sample metadata is summarised in Table 1. Figure 1 shows the structure of the dataset.

Table 1: Target dataset summary.

	Sample	Collection date	Days	Seq. length	Seq. GC (%)	Seq. gaps	Seq. ambiguities	Mapped reads	Covered bases (%)	Mean depth	Mean baseQ	Mean mapQ
1	MIP00022	1995-11-01	0	4411532	62.2	21505	203430	5866459	99.8	96.8	36.5	58.7
2	MIP00023	1996-07-01	243	4411532	62.2	22777	203429	5599178	99.7	89.9	36.5	58.8
3	MIP00024	1996-11-01	366	4411532	62.1	28904	203481	5618419	99.9	86.7	36.6	58.8
4	MIP00025	1997-01-01	427	4411532	62.2	22593	203411	5868087	99.7	95.6	36.5	58.8
5	MIP00026	1998-03-01	851	4411532	62.2	22872	203390	5848734	99.7	97.3	36.5	58.8
6	MIP00027	1998-06-01	943	4411532	62.2	23082	203420	5988312	99.7	96.1	36.5	58.8
7	MIP00028	1998-09-01	1035	4411532	62.2	22834	203423	5814763	99.6	94.9	36.5	58.8
8	MIP00029	1999-02-01	1188	4411532	62.2	20731	203442	6068765	99.7	101.4	36.6	58.7
9	MIP00030	1999-05-01	1277	4411532	62.2	22256	203432	5833507	99.7	97.1	36.5	58.7

Figure 1: MDS (PCoA) and MST visualization of pairwise dissimilarities between samples. Arrows indicate chronological sampling order. Percentages indicate the proportion of variation explained. Distances between points are approximately proportional to allele frequency-weighted dissimilarities between each pair of samples (see afwdist).

List of nucleotide variants

To distinguish pre-existing variation from changes arising during the study period, the sequence of the most recent common ancestor (MRCA) of the target samples was built using ancestral sequence reconstruction (ASR) and compared to the mapping reference (Table 2). These variants represent fixed differences already present before the processes under study began. Nucleotide variants identified in each sample relative to the reconstructed ancestral sequence are listed in Table 3. These variants reflect changes that arose across the sampling period.

Dataset ancestor variants (Table 2)

Table 2: Variants in the reconstructed ancestral sequence relative to the mapping reference.

init_notebook_modetrusted

Loading ITables v2.7.0 from the init_notebook_mode cell... (need help?)

Sample variants relative to the ancestor (Table 3)

Table 3: Variants in each sample relative to the reconstructed ancestral sequence.

Loading ITables v2.7.0 from the init_notebook_mode cell... (need help?)

Genome alignments

Genomes have been aligned and compiled into a multiple sequence alignment against MTB_anc using Nextclade. The fraction of gaps, sequence ambiguities (non-ACTG or gap) and mismatches against the reference have been calculated across the whole alignment (after masking) using sliding 10000-bp windows (Figure 2).

Genome alignment overview (Figure 2)

Figure 2: Overview of the masked alignment of the study samples against the reference. The fraction of gaps, sequence ambiguities (non-ACTG or gap) and reference mismatches have been calculated across the alignment in 10000-bp windows.

Variants overview

A total of 78 different nucleotide variants were identified, of which 12 were ‘Frameshift’, 3 were ‘In-frame indel’, 19 were ‘Intergenic’, 34 were ‘Non-synonymous’, 1 were ‘Other’, and 9 were ‘Synonymous’ (Figure 3).

Figure 3: Genome-wide nucleotide variant landscape and regional polymorphism enrichment. The top panel represents nucleotide variants detected across samples (ordered chronologically, earliest bottom to latest top), called against the reconstructed ancestral sequence, plotted against genome position. Marker shape encodes predicted effect group; fill opacity scales with allele frequency. Subsequent panels represent posterior probabilities \(q_i = P(p_i > \mu \mid \text{data})\) of exceeding the dataset-wide polymorphism rate \(\mu\), estimated per gene (middle panel; Table 4) and per 1000-bp sliding window (bottom panel; Table 5), under a Bayesian hierarchical beta-binomial model. For each region \(i\), \(k_i\) polymorphic sites (union across timepoints) out of \(n_i\) callable sites are modelled as \(k_i \sim \mathrm{BetaBinomial}(\alpha, \beta, n_i)\), with priors centred on the empirical rate \(p_0\). Inference was performed using PyMC and nutpie using a marginalized parametrization over hyperparameters \(\mu\) and \(\phi\), with per-region posteriors recovered analytically via the conjugate beta update. Per-gene stats: \(p_0\) = 1.54e-05, \(\bar\mu\) = 0.000264 SCI 95.0 % [0.000134, 0.000409] & MCI 95.0 % [0.000152, 0.000444]. Per-window stats: \(p_0\) = 1.9e-05, \(\bar\mu\) = 0.000256 SCI 95.0 % [0.000144, 0.000384] & MCI 95.0 % [0.000157, 0.000406]. Only regions with a false discovery rate (FDR) \(\leq 0.05\) are displayed. Grey shading indicates masked regions excluded from the reconstructed ancestral sequence.

Per-feature polymorphism enrichment results (Table 4)

Table 4: Per-feature counts and posterior enrichment statistics from the Bayesian hierarchical beta-binomial model (FDR ≤ 0.05).

Loading ITables v2.7.0 from the init_notebook_mode cell... (need help?)

Per-1000-bp window polymorphism enrichment results (Table 5)

Table 5: Sliding-window counts and posterior enrichment statistics from the Bayesian hierarchical beta-binomial model (FDR ≤ 0.05).

Loading ITables v2.7.0 from the init_notebook_mode cell... (need help?)

Allele frequency trajectories

The correlation of the allele frequency of each nucleotide variant with time since initial sampling has been calculated using Spearman’s correlation method (Figure 4). The graph shows which variants pass the correlation threshold (0.7), the p-value threshold (0.05), or both.

Figure 4: Correlation coefficients and adjusted p-values of allele frequencies with time. The correction method applied is Benjamini-Hochberg. Lines indicate adjusted p-value threshold and correlation thresholds. Select points to view variant information.

Selected correlated nucleotide variants are described in detail in Figure 5 and Figure 6. The visualizations are separated to differentiate variants with larger allele frequency amplitude (differences between their maximum and minimum alternate frequencies). In both cases, the plots show the alternative allele frequency across sampling days. Plot colors correspond to Figure 4, indicating whether each variant passes the correlation threshold, the p-value threshold, or both.

Figure 5: Time series of relative allele frequencies for variants above the allele frequency amplitude (Δfreq) threshold. Each subplot shows the progression of allele frequencies over time for a specific genomic position with significant correlation with time. Gaps reflect time points where sequencing was unreliable. A frequency of zero indicates confirmed absence above quality/depth filters.

Figure 6: Time series of relative allele frequencies for variants under the allele frequency amplitude (Δfreq) threshold. Each subplot shows the progression of allele frequencies over time for a specific genomic position with significant correlation with time. Gaps reflect time points where sequencing was unreliable. A frequency of zero indicates confirmed absence above quality/depth filters.

Temporal signal

To estimate the evolutionary rate based on all alleles (Figure 9), including those at low frequencies, a neighbor-joining (NJ) tree was constructed from pairwise allele frequency-weighted distances between study samples using afwdist (Figure 7). To estimate the evolutionary rate from majority alleles (Figure 10), a maximum likelihood (ML) tree was inferred under a GTR+F+I+G4 substitution model using IQ-TREE (Figure 8). Results are summarized in Table 6.

Table 6: Evolutionary rate estimates and temporal signal statistics. Rates are expressed in number of changes per year; rates per site are expressed in number of changes per site and year. Both are accompanied by their 95 % confidence interval (CI).

Method	Alleles	Rate	Rate per site	R²	P
NJ	All (weighted)	4.24 [3.53, 4.95]	1.03e-06 [8.59e-07, 1.2e-06]	0.966	< 0.001
ML	Majority	2.15 [-1.57, 5.87]	5.23e-07 [-3.81e-07, 1.43e-06]	0.211	0.213

Figure 7: NJ tree constructed from a pairwise allele frequency-weighted metric. Branch lengths are expressed in allele frequency-weighted substitutions. The reference MTB_anc was used to root the tree (omitted from the visualization for clarity).

Figure 8: Maximum Likelihood tree constructed with IQ-TREE (GTR+F+I+G4). Branch lengths are expressed in substitutions per site. The reference MTB_anc was used as ougroup and to root the tree (omitted from the visualization for clarity).

Figure 9: Root-to-tip distances versus days since the first sample. The line shows the linear model fit with a 95 % CI ribbon.

Figure 10: Root-to-tip distances versus days since the first sample for Maximum Likelihood tree. The line shows the linear model fit with a 95 % CI ribbon.

Evolutionary metrics

To track signatures of selection, the allele frequency-weighted rate of change per synonymous site (dS analog) and per non-synonymous site (dN analog) have been calculated for each sample relative to the reconstructed case ancestor (MRCA), together with their ratio ω (Figure 11). These metrics quantify frequency-weighted divergence from the ancestral sequence across all alleles and help identify selective pressures. Frequency-weighted πN and πS of biallelic variants are reported as a metric of standing diversity (Figure 12). Rate denominators are calculated using the Nei–Gojobori (1986) method. Frequency-weighted synonymous and non-synonymous counts are reported as raw values, relative to the first sample, and as a ratio (Figure 13).

Figure 11: Time series of frequency-weighted dN and dS analogs, and their ratio (ω). Each point corresponds to a different sample, sorted in chronological order.

Figure 12: Time series of frequency-weighted πN and πS analogs, and their ratio. Only biallelic variants are considered. Each point corresponds to a different sample, sorted in chronological order.

Figure 13: Time series of non-synonymous (N) and synonymous (S) weighted counts (sum of frequencies) shown as raw values, relative to the first sample, and as a ratio. Each point corresponds to a different sample, sorted in chronological order.

The site frequency spectrum (SFS) has been calculated to help interpret the estimates above. Because alleles may occur at varying frequencies across samples, the SFS has been calculated using allele frequency bins, stratified by synonymous single-nucleotide variants (SNV), non-synonymous SNV, and all variants combined. To account for varying allele frequency trajectories across the dataset, the SFS is reported using two approaches: “exclusive”, which captures the overall distribution of within-host allele frequencies across the dataset (Figure 14), and “inclusive”, which identifies variants with globally stable frequency behavior (Figure 15).

Figure 14: Exclusive binned site frequency spectrum. Each variant observation has been assigned independently to an allele frequency bin based on its frequency in that sample. Each panel shows the number of alleles observed across a given number of samples. Variants are binned per-observation rather than globally (a variant that segregates at different frequencies across samples will contribute to multiple bins simultaneously).

Figure 15: Inclusive binned site frequency spectrum. Each variant has been assigned a single global frequency class based on the consistency of its allele frequency across all samples in which it has been detected. Variants that fall within the same bin in every sample are assigned to that bin; variants whose frequency spans multiple bins across samples are classified as “Mixed”. For each class, the plot shows the number of alleles observed across a given number of samples

Variants connectivity

Hierarchical clustering of pairwise AF correlations

To detect possible interactions between mutations, pairwise Spearman’s correlation between allele frequencies have been calculated (Figure 16). The heatmap is interactive and allows zooming into specific regions. Only the top 100 alleles with the highest overall median coefficient are represented.

Figure 16: Interactive hierarchically clustered heatmap of pairwise correlation coefficients between time series of allele frequencies. Select heatmap squares to view information about related variants.

Network communities of pairwise AF correlations

Studying how variants are connected allows us to identify behavioral patterns and relate annotated variants to those lacking annotation. In this analysis, all variants are connected to one another (Figure 17), rather than being examined in pairs as in the previous approach (Figure 16).

The network representation (Figure 17) of pairwise correlations between variant allele frequencies as edge weights. Graph communities have been identified using the Louvain algorithm. Each color represents a community with its corresponding number.

Figure 17: Network visualization showing variant communities. Node fill colors represent community assignment (0-indexed), while border colors indicate annotation status. Edge opacity and color intensity reflect correlation coefficients. Hover over nodes for variant details and edges for exact weight values. Network indicators have been calculated: degree (number of nodes to which each node is connected), degree centrality (degree centrality indicating the proportion of nodes connected relative to the total network size), eigenvector centrality (relates a node’s importance considering the importance of neighboring nodes), betweenness centrality (number of times a node lies on the shortest paths between other nodes) and closeness centrality (how close a node is to all others, based on shortest path lengths).

Custom annotation of variants

An overview of the presence of variants with all available DRUG annotation (colored by GRADING annotation) is shown in Figure 18. The variation in frequency of these variants across different samples in time is shown in Figure 19.

Figure 18: Overview of the detected variants with custom annotation on the reconstructed ancestral sequence and each sample in the dataset in chronological order.

Figure 19: Allele frequency trajectories for annotation-associated variants over time. Each connected group of points represents a distinct variant; gaps reflect time points where sequencing was unreliable. A frequency of zero indicates confirmed absence given the applied variant calling filters.