Hieratype Pipeline: Bug Fixes & Optimization

Jan 16, 2026 by Editorial Team 45 views

Hey folks, let's dive into some key areas for improvement within the hieratype pipeline. These notes are all about making the code more robust, efficient, and easier to work with. We'll be looking at potential bugs, memory concerns, and overall code quality to make sure everything runs smoothly. Let's get started!

Fixing the `run_pipeline()` Function's Field Name

The Issue: The run_pipeline() function is using the wrong field name. This is a common typo but can cause major problems.

The Problem: In the run_pipeline.R file, the code uses pipeline$markerslist when calling fit_metagene_scores. However, the make_pipeline() function actually creates a list named pipeline with a field called markerslists.

do.call(fit_metagene_scores,
  c(list(markerslist = pipeline$markerslist[[parnt]], ...))) # Incorrect

pipeline <- list(markerslists = markerslists, priors = priors, priors_category = priors_category) # Correct

So, when the code tries to access pipeline$markerslist, it's looking for something that doesn't exist, leading to an error.

The Fix: The solution is straightforward. Change all instances of pipeline$markerslist[[...]] to pipeline$markerslists[[...]]. This ensures the code is referencing the correct data structure, and the pipeline will run as expected. This seemingly small tweak is important for the whole system, so let's get it right!

Seed Handling in `cluster_metagenes()`: Ensuring Correctness and Safety

The Issue: The current approach to seed handling in the cluster_metagenes() function has some vulnerabilities.

The Problem: The current code attempts to save and restore the random seed using .Random.seed. However, .Random.seed might not exist in a fresh R session, and the method of restoring the seed isn't the cleanest. Writing back to the global environment (<<-) can lead to unexpected side effects.

orig_seed <- .Random.seed
on.exit({.Random.seed <<- orig_seed})

The code calls set.seed(seed) multiple times within a loop. The restoration process should be localized and robust.

The Fix: There are a couple of recommended approaches:

Use withr::with_seed(): This is the cleaner, more modern approach. withr is a package designed for managing temporary states, including the random seed. Using withr::with_seed(seed, { ... }) ensures the seed is set only within the specified block of code and is automatically restored afterward.
Implement a Safe Restore: This approach involves checking if .Random.seed exists before attempting to save it and restoring it properly, taking into account the possibility that .Random.seed might not exist in the first place.

orig_seed <- if (exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE)) get(".Random.seed", envir = .GlobalEnv) else NULL
on.exit({
  if (is.null(orig_seed)) rm(".Random.seed", envir = .GlobalEnv) else assign(".Random.seed", orig_seed, envir = .GlobalEnv)
}, add = TRUE)

Important Note: The original use of .Random.seed caused issues with other software. It's essential to check if .Random.seed exists before attempting to use it.

Addressing Potential Errors in `cluster_metagenes()` with Zero Anchor Cells

The Issue: A potential error arises in cluster_metagenes() when some cell type models are not fitted.

The Problem: The function fits per-cell type models only if nfit[ii] > 0. However, it later iterates through all cell types and calls score_mixture_overfit(). If a cell type has no fitted model (because nfit[ii] == 0), score_mixture_overfit() will throw an error.

for(ii in names(models)){
  ll_scores[[ii]] <- score_mixture_overfit(models[[ii]], ...)
}

The Fix: The solution involves explicitly handling non-fitted classes. This could include setting the log-likelihood for non-fitted classes to -Inf and being extra cautious in the which.max logic to avoid errors. The column names should also be set based on what was actually scored (colnames(llmat)), and not the scores variable.

Posterior Column Naming: Avoiding Mislabeling

The Issue: There's a risk of mismatch in column names, which can lead to mislabeling of posterior probabilities.

The Problem: The code computes ll_scores only for the names(models), and then constructs post_probs using colnames(scores). If the order or set of scored classes differs from colnames(scores), the column names in post_probs will be incorrect, potentially leading to incorrect interpretations.

post_probs <- data.table::data.table(as.matrix(ppmat))
colnames(post_probs) <- colnames(scores)

The Fix: Set the column names from the object you've actually built, which is likely llmat. This ensures that the column names in post_probs accurately reflect the classes used in the log-likelihood calculations. Be sure to use these new names downstream to maintain consistency.

Enforcing Consistent Cell ID Alignment

The Issue: Inconsistent cell ID alignment across different functions can lead to errors.

The Problem: Several parts of the code assume that row names exist and match across different objects. For instance:

cluster_metagenes() uses rownames(scores) to align prior_prob_level.
combine_postprob_tables() merges data by cell_ID.
run_pipeline() passes initial_prior_weights without strict alignment checks.

Recommendations:

Establish an Invariant: Define a strict rule: