Questions, answered.
Short, direct answers. If yours is not here, write to us at the bottom of the page.
Q · 01 What file formats does the server accept? input +
Single-cell data: a standard .h5ad AnnData file, rows = cells, columns = genes, gene
symbols as HGNC. Raw counts, normalised or log1p are all fine — the server detects which.
Spatial data: a 10x Space Ranger ≥ 3.0 output folder (zipped), with the H&E hi-res image present. Spatial bins should be ≤ 50 μm; LingoCell does not currently aggregate larger bins.
Q · 02 What do I get back when a job finishes? output +
Every job returns the requested embedding tensors (e_cell, e_niche,
e_combined) in .npy, plus a CSV of per-cell predictions and a small
HTML report with a UMAP or niche map.
If you opt in, raw attention matrices, cosine-similarity histograms and per-cell descriptions are saved alongside.
Q · 03 Is LingoCell a single model or a pipeline of models? model +
One model with three trained stages. Stage 1 aligns gene expression and text via CLIP and a TinyLlama decoder. Stage 1.5 bridges the scRNA-seq ↔ Visium HD distribution gap. Stage 2 adds a spatial branch and learns niche embeddings while preserving Stage 1 capabilities.
The same checkpoint serves all four tasks on the server.
Q · 04 How big is the cell-type vocabulary? model +
786 cell types, taken from a curated subset of CellxGene's ontology. Cells outside this set still
receive a valid e_cell vector; the type prediction will just be the nearest neighbour
inside the vocab. See cell_type_code_mapping.json in Resources.
Q · 05 How should I trust a retrieval hit or a cell description? output +
For retrieval: top-1 cosine ≥ 0.55 is reliable for 50K-candidate indices; 0.40 – 0.55 is borderline and warrants manual review; below 0.40 should be treated as no match.
For descriptions: turn on save_attn and check that the top-attended genes are biologically
consistent with the claim. We document an attention-grounded reading protocol in the Describe walkthrough.
Q · 06 How long does a typical job take? running +
Cell-type annotation: ~ 26 ms / cell on an RTX 4090 (fp16). 100 K cells finish in ~ 45 s.
Spatial-niche on a 150 K-cell Visium HD slide: ~ 11 min including H&E feature extraction.
Description generation: ~ 180 ms / cell with greedy sampling.
Q · 07 Can I run LingoCell on my own GPU? running +
Yes. Inference fits in 12 GB vRAM for scRNA-seq and 24 GB for Visium HD with the default configuration. Training is gated by 4× H100 — we provide checkpoints, not a one-click trainer.
CPU inference works for small tests; expect ~ 15 min per 10 K cells.
Q · 08 What if my gene symbols do not match HGCN exactly? input +
The pipeline runs an alias-resolution step (HGNC + GENCODE) before tokenising. As long as 60 % or more of your symbols map cleanly, results stay within ±1 % of fully-mapped accuracy. Unmapped genes are silently zero-filled, and the report lists which ones.
Q · 09 What is the licence for the weights and code? ethics +
Code is MIT-licensed. Pretrained weights are CC-BY-NC 4.0 — free for academic use, contact us for a commercial agreement. Generated cell descriptions inherit the same non-commercial restriction.
Q · 10 Does the server store my uploaded data? ethics +
Uploaded files live in the server's job sandbox for 14 days and are then deleted. Embeddings and reports stay attached to your job link, not your account. Nothing you upload is used for retraining without your written opt-in.
Still stuck?
Open an issue on the lab page or email us — we triage Q&A within two working days.