← Back

Computational Immunology PhD Interview

Simple guide connecting your motivation-letter terms to what you already know from your thesis: single-cell data, GRN inference, KAN interpretability, perturbation validation, QC, normalization, HVG selection, and reproducible ML workflows.

Rule for tomorrow: do not pretend to be an immunology expert. Be honest, then connect the question to a computational workflow.

1. Start here: your safe position

Your honest position

My strongest background is computational: single-cell data, machine learning, statistics, and reproducible workflows. I am still learning the detailed immunology, but I can connect biological questions to analysis workflows.

If you do not know a biology term

Say: “I am still building deeper expertise in that specific biological area, so I do not want to overstate my knowledge. My current understanding is that ... Computationally, I would approach it by ...”

Thesis-to-lab bridge

In my thesis, I learned how to turn noisy single-cell expression data into interpretable regulatory hypotheses. In this PhD, the same idea can be applied to disease datasets: identify cell states, regulatory programs, spatial niches, and immune-stromal interactions.

What they need to hear

You are trainable, honest, computationally strong, careful with interpretation, and motivated to learn the immunology.

2. 5-minute slide script

Slide 1 — Project overview and fit
Today I will briefly present my master’s thesis project, where I developed an interpretable Kolmogorov-Arnold Network-based framework for gene regulatory network inference from single-cell gene expression data. The goal was to move from raw scRNA-seq data to ranked transcription factor-target relationships and testable biological hypotheses. This is relevant to this PhD because the position also needs high-dimensional biological data analysis, reproducible workflows, machine learning, and biological interpretation.
Slide 2 — Motivation
The biological motivation is that target-gene expression is not usually controlled by one regulator alone. It can depend on nonlinear or combinatorial effects of multiple transcription factors. Single-cell data makes this difficult because it is sparse, noisy, high-dimensional, and affected by dropout. Existing methods such as GRNBoost2 are scalable, but they mostly return importance scores rather than explicit regulatory functions. So my thesis asked whether KANs can capture nonlinear TF-target dependencies while also providing interpretable symbolic representations.
Slide 3 — Framework
The framework is target-specific. For each target gene, I used candidate transcription factors as input features and trained a KAN model to predict the target-gene expression. The model produced two useful outputs: feature-importance scores to rank TF-target relationships, and symbolic formulas to inspect the learned regulatory relationship. So the output was not only a prediction, but a ranked regulatory hypothesis set that could be interpreted biologically.
Slide 4 — Validation
I did not treat the inferred GRN as ground truth. I treated it as a ranked hypothesis set. I used synthetic validation, comparison with GRNBoost2, perturbation validation, and symbolic formula validation. For perturbation validation, I used the full KAN regression model. I performed in silico TF knockouts and compared the KAN-predicted expression with experimental CRISPRi perturbation data. For interpretability, I separately plugged TF expression values into the symbolic formulas and compared the formula output with observed gene expression.
Slide 5 — Key result
The main result is that the full KAN model recovered population-level perturbation responses better than noisy single-cell variation. At the single-cell level, correlations were moderate, which is expected because scRNA-seq has dropout and cell-to-cell variability. But after gene-level aggregation, the agreement was very strong, with Pearson correlations above 0.95 for all four transcription factors. I also checked log2 fold change. It showed the same trend: single-cell predictions had more under- or overestimation, while mean-expression predictions were much more accurate. The symbolic formulas helped interpret TF-target dependencies, but they were lower-fidelity than the full KAN model. So I view KAN-derived GRNs as interpretable ranked hypotheses for biological validation.

3. Motivation-letter terms mapped to your thesis

Fibroblast activation state
What it means

Fibroblasts are structural/stromal cells. In disease, they can become activated and start inflammatory, tissue-remodeling, or profibrotic programs.

Relate to thesis

In your thesis, you studied gene-expression states and regulatory programs. Here, instead of TF-target networks in general, you would identify regulatory programs that explain why fibroblasts become inflammatory or fibrotic.

Safe answer

Fibroblast activation means fibroblasts shift from structural cells into inflammatory, tissue-remodeling, or profibrotic states. Computationally, I would identify fibroblast subclusters, marker genes, pathway enrichment, disease-control differences, and candidate regulatory programs.

Inflammation vs fibrosis
What it means

Inflammation = immune activation. Fibrosis = excessive tissue scarring/remodeling.

Relate to thesis

Your thesis dealt with changes in gene expression after perturbation. Here, disease processes can also be seen as changes in gene programs: inflammatory programs or profibrotic extracellular-matrix programs.

Safe answer

Inflammation refers to immune activation and inflammatory signaling. Fibrosis refers to excessive tissue remodeling and extracellular matrix deposition, often involving activated fibroblasts. Chronic inflammation can push fibroblasts toward fibrotic programs.

Mesenchymal programs
What it means

Gene-expression programs in stromal cells, especially fibroblasts.

Relate to thesis

In your thesis, you looked for TF-target dependencies. In this lab, you may look for TFs/pathways driving mesenchymal programs such as extracellular matrix production or tissue remodeling.

Safe answer

By mesenchymal programs, I mean gene-expression programs active in stromal cells such as fibroblasts. These may involve extracellular matrix, tissue remodeling, inflammatory signaling, migration, or profibrotic differentiation.

Spatial neighborhoods
What it means

Which cells are physically close to each other in tissue.

Relate to thesis

Your thesis used single-cell expression without tissue location. Spatial analysis adds location: not only what a cell expresses, but where it is and which cells are nearby.

Safe answer

Spatial neighborhoods describe the local arrangement of cells in tissue. For example, an activated fibroblast may behave differently if it is near macrophages, T cells, endothelial cells, or other fibroblasts.

Immune-stromal communication
What it means

Immune cells and stromal cells signaling to each other.

Relate to thesis

Your thesis inferred regulatory relationships inside gene networks. This is similar in spirit, but now the relationships are between cell types, such as macrophages sending signals to fibroblasts.

Safe answer

Immune-stromal communication means signaling between immune cells and stromal cells such as fibroblasts. Computationally, I would study it using ligand-receptor inference, spatial co-localization, and pathway analysis, while treating results as hypotheses.

Resolving vs pathogenic tissue states
What it means

Resolving = tissue is moving toward repair. Pathogenic = tissue maintains inflammation or fibrosis.

Relate to thesis

Your perturbation validation compared predicted vs actual response. In disease data, you may compare resolving vs pathogenic states and ask which genes/pathways/regulators differ.

Safe answer

A resolving state moves tissue toward repair and reduced inflammation, while a pathogenic state maintains inflammation or fibrosis. Computationally, I would compare cell states, gene programs, pathway activity, and spatial organization between these conditions.

Disease propagation
What it means

How disease signals spread or are maintained across tissue or cell populations.

Relate to thesis

Your thesis asked whether a TF perturbation changes downstream gene expression. Disease propagation is similar conceptually: which cells/signals drive downstream changes in other cells or tissue areas.

Safe answer

By disease propagation, I mean how inflammatory or fibrotic programs may spread across tissue compartments or become maintained over time. Computationally, I would study this using spatial data, cell-cell communication analysis, trajectories, and regulatory models.

Candidate drivers
What it means

Genes, TFs, pathways, or cell-cell signals that may cause or maintain a disease state.

Relate to thesis

In your thesis, TF feature importance ranked candidate TF-target relationships. In this PhD, similar ranking can prioritize candidate drivers of fibroblast activation or immune-stromal communication.

Safe answer

A regulatory model can prioritize candidate transcription factors or pathways that may drive a disease-associated state. I would not call them causal immediately; I would treat them as hypotheses for validation.

Descriptive atlas vs mechanistic model
What it means

Atlas = what cell types/states are present. Mechanistic model = why they arise and what drives them.

Relate to thesis

Your thesis tried to go beyond prediction by extracting interpretable regulatory hypotheses. That is the same idea: go beyond describing clusters and ask what regulates them.

Safe answer

A descriptive atlas tells us which cell types and states are present. A mechanistic model tries to explain why those states arise, which regulators or signals drive them, and which candidates can be experimentally tested.

4. Workflow terms you mentioned

Quality control / QC

Meaning: remove or flag bad cells and artifacts. Thesis link: you checked detected genes, total counts, and mitochondrial percentage. Answer: I would use QC to remove technical artifacts before biological interpretation.

Integration

Meaning: combine datasets while reducing batch effects. Thesis link: like normalization/preprocessing, but for multiple patients/batches/technologies. Answer: I would correct technical variation while preserving true disease biology.

Annotation

Meaning: assign labels to cells/clusters. Thesis link: you worked with genes/TFs; here you also label cells as fibroblasts, macrophages, T cells, etc.

Differential expression

Meaning: find genes up/down between conditions. Thesis link: related to log2FC and perturbation response. Answer: I would compare disease vs control within the same cell type/state.

Differential abundance

Meaning: ask whether a cell type/state is more common in disease. Answer: disease may change both gene expression and the frequency of cell populations.

Trajectory / cell-state transition

Meaning: infer a possible path from one cell state to another. Answer: useful for resting-to-activated fibroblast hypotheses, but static scRNA-seq does not directly observe time.

Ligand-receptor inference

Meaning: predict possible cell-cell signaling. Thesis link: like a network, but between cell types instead of TF-target genes. Answer: useful but hypothesis-generating, not proof.

Spatial neighborhood analysis

Meaning: study which cells are near each other in tissue. Thesis link: adds tissue location to expression analysis.

5. Technologies you mentioned

scRNA-seq

Gene expression for individual cells. This is closest to your thesis.

Visium

Spatial transcriptomics with tissue spots. Each spot may contain multiple cells. Think: expression + approximate location.

MERSCOPE

High-resolution imaging-based spatial transcriptomics for selected genes. Think: more spatially precise, but targeted.

IMC

Imaging mass cytometry: spatial protein profiling in tissue. Think: protein markers + location.

Clinical metadata

Patient/sample information: disease, treatment, severity, tissue, batch, donor. Helps avoid confusing technical or patient effects with disease biology.

6. Likely questions from your motivation letter

Your letter says fibroblast activation states. What do you mean?
Best answer

Fibroblast activation means fibroblasts shift from structural cells into inflammatory, tissue-remodeling, or profibrotic states. Computationally, I would identify fibroblast subclusters, marker genes, pathway enrichment, disease-control differences, and regulatory programs. I am still learning the detailed fibroblast biology, but this is how I understand the computational question.

What is the difference between inflammation and fibrosis?
Best answer

Inflammation is immune activation and inflammatory signaling. Fibrosis is excessive tissue remodeling or scarring, often involving activated fibroblasts and extracellular matrix deposition. Chronic inflammation can contribute to fibrosis.

What do you mean by spatial neighborhoods?
Best answer

Spatial neighborhoods refer to which cells are physically close to each other in tissue. For example, an activated fibroblast may behave differently if it is near macrophages, T cells, endothelial cells, or other stromal cells.

What is immune-stromal communication?
Best answer

It means signaling between immune cells and stromal cells such as fibroblasts. Computationally, I would study it using ligand-receptor inference, pathway analysis, and spatial co-localization. I would treat it as hypothesis-generating, not proof of signaling.

How can regulatory network models prioritize candidate drivers?
Best answer

A regulatory network model can rank TFs, pathways, or genes associated with a disease state. These are candidate drivers, not proven causal mechanisms. They should be validated using perturbation data, spatial evidence, protein data, or experiments.

What is ligand-receptor inference?
Best answer

It predicts possible cell-cell communication by checking whether one cell type expresses a ligand and another expresses the corresponding receptor. It is useful for hypothesis generation but does not prove active signaling.

What are the limitations of ligand-receptor inference?
Best answer

It is based mainly on RNA expression, so it does not prove protein abundance, spatial contact, signaling activity, or causality. I would strengthen it using spatial proximity, protein data, perturbation evidence, and known biology.

How would you integrate scRNA-seq, Visium/MERSCOPE, IMC, and clinical metadata?
Best answer

I would use scRNA-seq to define cell types and states, spatial transcriptomics to locate them in tissue, IMC to validate protein-level spatial phenotypes, and clinical metadata to connect molecular patterns to disease group, severity, treatment, or outcome.

What is the difference between Visium and MERSCOPE?
Best answer

My understanding is that Visium gives spatial transcriptomic profiles across tissue spots, where each spot may contain multiple cells. MERSCOPE provides higher-resolution targeted spatial transcriptomics using imaging-based detection of selected genes.

What does moving from descriptive atlases to mechanistic models mean?
Best answer

A descriptive atlas tells us what cell types and states are present. A mechanistic model tries to explain why those states arise, which regulators or signals drive them, and what candidates can be tested experimentally.

Did you write this motivation letter yourself?
Safe honest answer

I used writing assistance to polish the wording, but the motivation reflects the direction I am genuinely interested in. Some biological areas are still new to me, and I am actively preparing them. My strongest contribution right now is computational: single-cell data, machine learning, reproducible workflows, and interpretable analysis.

7. Thesis-based technical questions

How did you handle high-dimensional omics data?

I handled high-dimensional omics data by reducing technical noise and then reducing dimensionality in a biologically meaningful way. In my thesis, I worked with scRNA-seq data containing more than 36,000 genes. I applied QC, normalization, log1p transformation, and HVG selection to reduce the data to the top 1,000 informative genes. Then I used target-specific KAN models rather than one huge global model, and scaled training using GPU/HPC parallelization.

What kind of QC did you employ?

I used standard scRNA-seq QC. I inspected detected genes per cell, total transcript counts, and mitochondrial read percentage. Low detected genes can indicate poor-quality cells, very high counts can suggest doublets or multiplets, and high mitochondrial content can indicate stressed or dying cells. After QC inspection, I used median-depth normalization, log1p transformation, and HVG selection.

Why median-depth normalization and log1p transformation?

Median-depth normalization corrects library-size differences using the typical cell depth, which is robust to extreme high-count cells. log1p compresses the skewed expression range and handles zeros naturally. This gave stable inputs for regression-based KAN models while preserving biologically meaningful variation.

Why not mean-depth normalization?

Mean depth can be pulled upward by outlier cells with very high counts, such as doublets or technical artifacts. Median depth better represents the typical cell, so scaling to the median avoids over-amplifying noise caused by extreme cells.

How did you handle noise, sparsity, and dropout?

I did not explicitly remove dropout using imputation. I handled noise and sparsity indirectly through QC, median-depth normalization, log1p transformation, HVG selection, regularized KAN training, and evaluation at both single-cell and gene-mean levels. The moderate single-cell correlations show the noise remained challenging, while the high gene-mean correlations show the model captured the dominant biological perturbation signal.

What is batch effect?

A batch effect is unwanted technical variation caused by sample processing, sequencing run, lab, machine, reagent lot, or time point. It is dangerous because cells may cluster by batch rather than biology. Batch correction should remove technical variation without removing real disease signal.

What is causality?

Causality means changing one variable directly produces a change in another. In GRN inference, correlation between TF A and gene B does not prove A regulates B. Stronger causal evidence comes from perturbation experiments, time-course data, chromatin accessibility, or wet-lab validation.

Why was CD99 an outlier in log2FC?

CD99 was an outlier because the experimentally observed expression was near zero in several perturbation settings, while the model predicted a small nonzero value. Since log2FC is ratio-based, dividing by near-zero actual expression produces a very large positive fold change. I would interpret it as model overestimation, not direct biological proof.

8. Memorize these simple lines tonight

1

My strongest background is computational: high-dimensional biological data, ML, statistics, and reproducible workflows. I am still deepening my immunology, but I can contribute technically while learning the disease biology.

2

Fibroblast activation means fibroblasts shifting from structural cells into inflammatory, tissue-remodeling, or profibrotic states.

3

Spatial neighborhoods describe which cells are physically near each other in tissue, and they matter because cell function depends on local context.

4

Immune-stromal communication means signaling between immune cells and stromal cells, such as macrophages or T cells interacting with fibroblasts.

5

Ligand-receptor inference predicts possible cell-cell signaling based on ligand expression in one cell type and receptor expression in another. It is hypothesis-generating.

6

A descriptive atlas tells us what cell types and states exist. A mechanistic model tries to explain why those states arise and what drives them.

7

A regulatory hypothesis is a candidate mechanism suggested by a model, such as a TF or pathway that may drive a disease-associated state. It is not proof of causality until validated.

8

For unknown biology questions, be honest: I am still learning the detailed immunology, but computationally I would approach it through QC, annotation, differential analysis, pathways, spatial neighborhoods, communication analysis, and regulatory hypotheses.