Hieratype Pipeline: Bug Fixes & Optimization
Hey folks, let's dive into some key areas for improvement within the hieratype pipeline. These notes are all about making the code more robust, efficient, and easier to work with. We'll be looking at potential bugs, memory concerns, and overall code quality to make sure everything runs smoothly. Let's get started!
Fixing the run_pipeline() Function's Field Name
The Issue: The run_pipeline() function is using the wrong field name. This is a common typo but can cause major problems.
The Problem: In the run_pipeline.R file, the code uses pipeline$markerslist when calling fit_metagene_scores. However, the make_pipeline() function actually creates a list named pipeline with a field called markerslists.
do.call(fit_metagene_scores,
c(list(markerslist = pipeline$markerslist[[parnt]], ...))) # Incorrect
pipeline <- list(markerslists = markerslists, priors = priors, priors_category = priors_category) # Correct
So, when the code tries to access pipeline$markerslist, it's looking for something that doesn't exist, leading to an error.
The Fix: The solution is straightforward. Change all instances of pipeline$markerslist[[...]] to pipeline$markerslists[[...]]. This ensures the code is referencing the correct data structure, and the pipeline will run as expected. This seemingly small tweak is important for the whole system, so let's get it right!
Seed Handling in cluster_metagenes(): Ensuring Correctness and Safety
The Issue: The current approach to seed handling in the cluster_metagenes() function has some vulnerabilities.
The Problem: The current code attempts to save and restore the random seed using .Random.seed. However, .Random.seed might not exist in a fresh R session, and the method of restoring the seed isn't the cleanest. Writing back to the global environment (<<-) can lead to unexpected side effects.
orig_seed <- .Random.seed
on.exit({.Random.seed <<- orig_seed})
The code calls set.seed(seed) multiple times within a loop. The restoration process should be localized and robust.
The Fix: There are a couple of recommended approaches:
-
Use
withr::with_seed(): This is the cleaner, more modern approach.withris a package designed for managing temporary states, including the random seed. Usingwithr::with_seed(seed, { ... })ensures the seed is set only within the specified block of code and is automatically restored afterward. -
Implement a Safe Restore: This approach involves checking if
.Random.seedexists before attempting to save it and restoring it properly, taking into account the possibility that.Random.seedmight not exist in the first place.
orig_seed <- if (exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE)) get(".Random.seed", envir = .GlobalEnv) else NULL
on.exit({
if (is.null(orig_seed)) rm(".Random.seed", envir = .GlobalEnv) else assign(".Random.seed", orig_seed, envir = .GlobalEnv)
}, add = TRUE)
Important Note: The original use of .Random.seed caused issues with other software. It's essential to check if .Random.seed exists before attempting to use it.
Addressing Potential Errors in cluster_metagenes() with Zero Anchor Cells
The Issue: A potential error arises in cluster_metagenes() when some cell type models are not fitted.
The Problem: The function fits per-cell type models only if nfit[ii] > 0. However, it later iterates through all cell types and calls score_mixture_overfit(). If a cell type has no fitted model (because nfit[ii] == 0), score_mixture_overfit() will throw an error.
for(ii in names(models)){
ll_scores[[ii]] <- score_mixture_overfit(models[[ii]], ...)
}
The Fix: The solution involves explicitly handling non-fitted classes. This could include setting the log-likelihood for non-fitted classes to -Inf and being extra cautious in the which.max logic to avoid errors. The column names should also be set based on what was actually scored (colnames(llmat)), and not the scores variable.
Posterior Column Naming: Avoiding Mislabeling
The Issue: There's a risk of mismatch in column names, which can lead to mislabeling of posterior probabilities.
The Problem: The code computes ll_scores only for the names(models), and then constructs post_probs using colnames(scores). If the order or set of scored classes differs from colnames(scores), the column names in post_probs will be incorrect, potentially leading to incorrect interpretations.
post_probs <- data.table::data.table(as.matrix(ppmat))
colnames(post_probs) <- colnames(scores)
The Fix: Set the column names from the object you've actually built, which is likely llmat. This ensures that the column names in post_probs accurately reflect the classes used in the log-likelihood calculations. Be sure to use these new names downstream to maintain consistency.
Enforcing Consistent Cell ID Alignment
The Issue: Inconsistent cell ID alignment across different functions can lead to errors.
The Problem: Several parts of the code assume that row names exist and match across different objects. For instance:
cluster_metagenes()usesrownames(scores)to alignprior_prob_level.combine_postprob_tables()merges data bycell_ID.run_pipeline()passesinitial_prior_weightswithout strict alignment checks.
Recommendations:
- Establish an Invariant: Define a strict rule: