About the NMD Lung Atlas
Interactive browser for long-read isoform expression and nonsense-mediated decay (NMD) response across four primary human lung cell types under SMG1 inhibition.
What's in the atlas
- 4 primary human lung cell types, 7 unique donors, 13 cell-type × donor pairings, 26 long-read samples total (paired PacBio Iso-Seq + short-read RNA-seq). AT2, FB, and MVE were derived from the same 3 donors; LAE was obtained from 4 additional donors:
- AT2 — alveolar type 2 cells (internal code
AT; 3 donors × 2 conditions = 6 samples)
- LAE — large airway epithelial, submerged culture (internal code
DD; 4 donors × 2 = 8 samples)
- FB — primary lung fibroblasts (3 donors × 2 = 6 samples)
- MVE — microvascular endothelial cells (internal code
MV; 3 donors × 2 = 6 samples)
- Treatment: 0.3 µM SMG1 inhibitor (SMG1i) or equivalent-volume DMSO vehicle for 6 hours at 37 °C before RNA extraction.
- Data content: per-cell-type DMSO-baseline CPM (mean across DMSO samples) for each detected isoform, mashr posterior-mean log₂ fold-change under SMG1i vs DMSO, mashr lfsr and limma-voom adjusted p-values, and NMD-susceptible calls (
lfsr < 0.05 AND positive posterior mean).
- Universe: 162,800 isoforms tested for SMG1i-vs-DMSO differential expression (after edgeR
filterByExpr). Of these, 160,081 have SQANTI3 structural annotations and are displayed here (16,894 detected genes; 2,719 fusion transcripts excluded). GENCODE v49 annotated transcripts are additionally overlaid across the remaining protein-coding and lncRNA space so users can compare detected structures against annotated space.
How to use the atlas
- Search a gene by HGNC symbol or Ensembl ID (e.g., SRSF8, ATF4, ENSG00000263465). Genes with at least one NMD-susceptible isoform carry a small orange badge.
- Filter which isoforms are shown by minimum CPM (slider) and whether to include GENCODE-annotated transcripts that weren't detected (checkbox).
- Click any isoform in the table (or a track in the stacked view below) to see its transcript structure, expression per cell type, and NMD response.
- Compare all isoforms for a gene in the modal — genomic coordinates, Isopair-style intron chevrons, click-to-select.
Data version
This build: loading…. New versions may be released as our upstream pipeline is updated. The version tag can be cited alongside the paper for exact reproducibility.
Read the Methods → or How to cite →
Methods summary
Compact summary of the analytical pipeline. For the full paper Methods including donor characteristics, sequencing depth, quality control, and full statistical model specifications, see the paper.
Sample processing and treatment
Primary human lung cells were isolated from tissue not used for transplant (UNC IRB protocol 03-1396). AT2, MV, and FB cells were obtained from the same three donors; LAE cells were derived from four additional donors. LAE cells were maintained in submerged culture; MV cells were CD31-bead-enriched from peripheral lung; FB cells were outgrown from minced distal lung; AT2 cells were expanded as HT2-280-purified alveolospheres and then plated as monolayer on 5% Matrigel. All cultures were then treated with 0.3 µM SMG1 inhibitor (SMG1i) or equivalent DMSO vehicle for six hours before RNA extraction.
Long-read RNA sequencing
PacBio Kinnex full-length cDNA libraries were prepared from isolated RNA per PacBio protocol 103-238-700 REV07 with two modifications: SMRTbell cleanup used 1.0× beads (instead of 0.9×) to avoid enriching for long transcripts and obtain properly formed concatenated libraries, and prior to concatenation the eight Kinnex PCR reactions were pooled by equal mass (Qubit-based) rather than by fixed volume from each tube. Kinnex FL RNA libraries were sequenced on a PacBio Revio instrument with Revio SMRT cells and SPRQ chemistry to an average depth of 13 million FLNC reads. Long reads were aligned to GRCh38 with minimap2 -ax splice:hq --junc-bed known_junctions.bed --MD --cs=long -uf. Full-length transcript isoforms were identified and quantified with the PacBio isocall pipeline (--seq-tech pac-bio-hifi --filter-group no-filters --model-coverage) and classified with SQANTI3.
Short-read RNA sequencing
Paired-end stranded, polyA-selected libraries (NEBNext UltraExpress) were sequenced on an Illumina NovaSeq 6000 to a mean depth of ~59 M reads (mean RIN 9.3). Reads were processed with the nf-core RNA-seq pipeline v3.14.0 (Nextflow 24.04.4): Trim Galore 0.6.7 + cutadapt 4.6 for QC and trimming, STAR 2.7.10a for alignment to GRCh38 + GENCODE v49, and Salmon 1.10.1 with tximport for gene- and isoform-level quantification.
Differential isoform expression (SMG1i vs DMSO)
Isoform counts from the SQANTI3-classified isocall output were organised into an edgeR DGEList, filtered with filterByExpr (min.count = 5, min.total.count = 10) to 162,800 isoforms, TMM-normalised, and fit with limma-voom under the design ~ cell_type + treatment + cell_type:treatment with LAE as the reference cell type. The paired-donor structure was handled via duplicateCorrelation incorporated as a blocking factor in lmFit. Cell-type-specific SMG1i effects were extracted via contrasts.fit + eBayes moderation. The four per-cell-type contrast estimates were then jointly shrunk across cell types with multivariate adaptive shrinkage (mashr): standard errors were derived from moderated-t ratios, null correlation was estimated with estimate_null_correlation_simple from a random feature subset, and covariance structures included both data-driven (PCA-based from univariate mash_1by1 at lfsr < 0.05) and canonical (cov_canonical) matrices. An isoform is called NMD-susceptible in a given cell type when its mashr lfsr < 0.05 AND its posterior-mean log₂ fold-change is positive.
Transcript structures
Exon coordinates displayed in the atlas come from:
- SQANTI3-corrected long-read assemblies for novel isoforms (identifiers of the form
ENSG00000###.#.novelN);
- GENCODE v49 primary_assembly annotation for annotated transcripts (identifiers of the form
ENST00000###.#).
CDS provenance policy
Because CDS calls carry different provenance and different biases, every isoform in the atlas is labelled with its CDS source (coloured banner above the transcript viz):
- GENCODE v49 — canonical annotation, used for annotated transcripts without any
*_NF tag.
- GENCODE placeholder — GENCODE annotates a CDS but flags the start and/or end as "not found" (
cds_start_NF, mRNA_start_NF, cds_end_NF, or mRNA_end_NF). We surface this as "no reliable CDS available"; the placeholder CDS is not drawn.
- Reference-AUG projection — for novel comparator isoforms paired to a canonical reference, we project the reference's start codon into the comparator's transcript coordinates and walk to the first in-frame stop (Isopair::traceReferenceAtg()). Unbiased against PTC-containing ORFs and used as the paper's ground truth for PTC-determination analyses.
- TD2 (caution) — fallback via SQANTI3 / TransDecoder2 (TD2)–called ORF. TD2 is known to avoid PTC-containing ORFs, so the CDS shown for NMD-substrate isoforms may not be the biologically active one. Displayed with a visible ⚠ badge.
- None — no CDS available (e.g., lncRNA-biotyped GENCODE transcripts, structurally short transcripts with no ORF).
Cell-type DMSO-baseline CPM (mean across DMSO samples per cell type, TMM-normalized from the mashr DIE pipeline) and NMD response (log₂FC, lfsr, NMD-susceptible flag) are shown per cell type. Undetected genes (no isoform reaching the mashr expression filter in any of the four cell types) are indexed but have no gene shard; a "not detected in our data" message with an Ensembl link is shown when such a gene is queried.
Data conventions
- Baseline expression = DMSO samples only. The CPM values displayed reflect baseline (DMSO) abundance in each cell type.
- NMD-response direction: positive log₂FC = accumulates under SMG1i = substrate of NMD.
- Cell-type scope: 4 cell types (AT2, LAE, FB, MVE). Two additional culture conditions (DD_ALI, DO_ALI) exist in the source data but were excluded from the manuscript scope because their short-read vs long-read effect-size correlation was lower than the 4-CT set.
How to cite
If you use this atlas in a publication, please cite the accompanying paper and the specific data version you queried.
Paper citation
Leshem Y, Kasai Y, Thakur SM, Paul A, Ziniti J, Boueiz A, Saferali A, DeMarzio M, Laederach A, Randell SH, Castaldi PJ. Long-read characterization of nonsense-mediated decay across primary human lung cell types. [Journal TBD], 2026. [DOI pending]
Data-version citation
Include the atlas data version so a future reader can pull the exact same underlying dataset:
NMD Lung Atlas, data version loading…, generated loading…. Available at loading….
License
Data displayed in this atlas is released under CC BY 4.0 — you're free to reuse it with attribution to the paper citation above.
Analysis source code is available on request from the corresponding author; a public release location will be added here on publication.
Report an issue
Found something that looks wrong? Contact the corresponding author (Peter Castaldi, repjc@channing.harvard.edu).
Welcome
This atlas accompanies Leshem et al. "Long-read characterization of nonsense-mediated decay
across primary human lung cell types" (2026). Search a gene above to view its isoforms,
transcript structures, expression, and NMD response under SMG1 inhibition (treatment that
stabilises NMD substrates) across alveolar type 2 (AT2), large airway epithelial (LAE), fibroblast (FB),
and microvascular endothelial (MVE) cells.
Featured genes: SRSF8,
ATF4,
SRSF3,
SRSF7,
CFTR,
SMG1.