portal/ guides/ 04 · cell-description

GUIDE 04 · expression → text

让 LingoCell 给单个细胞
写一段话。

Stage 1 Phase 2 训了一个 TinyLlama-1.1B 解码器：交叉注意力直接读取基因 token 序列，而不是只依赖语言先验。这让生成出的描述与该细胞实际表达谱真正挂钩。

Guide spec

decoder: TinyLlama-1.1B-Chat
cross-attn: last 12 layers
input: cell expr (1200 HVG, binned)
output: ≤ 256 tokens 自然语言
sampling: greedy / top-p
latency: ~ 180 ms / cell · 1× 4090
train loss: 2.31

01 ─
原理

为什么不是“让 GPT 看一句细胞类型再编”？

LLaMA 收到的 prompt 是固定的（system + user），所有细胞共享同一句，唯一变量是它通过 cross-attention 读取的基因表达序列。这强迫语言模型从基因里取信息，而不是从问题里猜答案。

PROMPT

system + user (fixed)

所有细胞相同

XATTN

read gene tokens

(1201, 512) · last 12L

DECODE

autoregressive

greedy or top-p 0.9

TEXT

cell description

≤ 256 tok

Fixed prompt

# system
You are a scientific assistant specialized in cell description predictions. 
Given the cell expression embeddings, describe it clearly and concisely 
in professional language.

# user
Based on the gene expression profile provided through cross-attention, 
describe the cell type, tissue origin, and biological state.

02 ─
交互生成

选一个细胞，让它说话。

下面是从 0429 检查点跑出来的四个真实细胞。点击 “生成” 后，描述会按 token 流式出现，底部 10 个最被关注的基因会用绿色条标出注意力权重。

STEP

选择一个细胞

细胞编号取自训练 / 验证集；表达谱已经 binned 到 0–50。

sampling 就绪

STEP

流式生成 + 注意力

输出文字 = LLaMA decoder 的实际样本；下方的基因栅格显示 cross-attention 在生成期间累计的权重（颜色越深越被关注）。

GENERATED · TinyLlama-1.1B

点击 “生成” 后描述会按 token 流式出现 …

tokens

—

latency

—

avg logprob

—

sampling

greedy

CROSS-ATTENTION · TOP 10 GENES

运行后显示注意力栅格

STEP

命令行复现

同一细胞，在你的本机：

# single-cell description with attention dump
python custom_tools/generate_description.py \
  --ckpt save/mask25/best_model_phase1.pt \
  --input data/test_200_cells.h5ad \
  --cell_id 88421 \
  --sampling greedy \
  --save_attn out/cell_88421_attn.npz

03 ─
关键参数

何时该用 greedy，何时该 sample。

--sampling	`greedy` (默认) · 描述更稳定，适合报告；`topp` 0.9 + T=0.7 适合做差异写作 / 多样本对比
--max_new_tokens	96 默认 · 取值 32–256 超过 96 token 后边际信息量陡降
--prompt_variant	v6.2 (默认 · 强 cross-attn 引用) · 仅在改 prompt 后重新训练才能改详见 §5.1.5 · v6.2 → "Based on the gene expression profile provided through cross-attention …"
--save_attn	false · 是否一并导出 cross-attention 权重 (n_layer × heads × tgt × src) 单细胞 ~ 14 MB
--batch_size	32 · 在 H100 上可到 96 batch 越大越省时间，但 attention 占显存呈平方

04 ─
常见问题

关于幻觉与可信度。

它会幻觉吗？

会，但少。v6.2 prompt + cross-attention 强引用让 90% 描述里至少有 3 个 attention-top 基因被显式提到。建议把 --save_attn 打开，描述与注意力一起呈现，让读者自行核对。

能输出中文吗？

不能直接输出。训练语料是英文，强行翻译后描述质量崩塌（loss 翻倍）。推荐做法：先生成英文描述，再外接翻译模型，不要直接改 prompt 语言。

给一批细胞写一段总结，怎么做？

把这批细胞的 e_cell 平均、归一化，喂入 generate_pool_description.py 即可—— 它把 mean embedding 当作单个虚拟细胞处理。质量与“真实代表细胞”相当。

让 LingoCell 给单个细胞写一段话。

为什么不是“让 GPT 看一句细胞类型再编”？

选一个细胞，让它说话。

选择一个细胞

流式生成 + 注意力

命令行复现

何时该用 greedy，何时该 sample。

关于幻觉与可信度。

它会幻觉吗？

能输出中文吗？

给一批细胞写一段总结，怎么做？

让 LingoCell 给单个细胞
写一段话。