Chapter 3 Spotcalling

MERFISH data comprises fluorescent images across different channels and imaging cycles. The desired end result are spot coordinates and spot identities.

To that end, we need to perform several subtasks: (1) establish parameters and parse necessary metadata, (2) load and process relevant images, (3) consolidate information across images and decode intensity vectors using the MERFISH codebook, (3) filter out low quality spots, and finally (4) save the output. Subtasks 2-4 are repeatedly performed as we loop through FOVs.

library(masmr)
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.2

3.1 Pre-processing

Before running your pipeline proper, there are a series of pre-processing steps that each script needs to perform. These will be explained in detail here, but the same principles apply to cell segmentation and consolidation scripts.

3.1.1 Establishing parameters

All scripts should begin by establishing parameters: e.g. location of files to read, location to output files, file formats etc.

This is accomplished in masmr using the establishParams() function, which automatically creates an environment object called params in the global environment that will be repeatedly referenced by downstream functions.

data_dir = 'C:/Users/kwaje/Downloads/JIN_SIM/20190718_WK_T2_a_B2_R8_12_d50/'
establishParams(
  imageDirs = paste0(data_dir, 'S10/'),
  imageFormat = '.ome.tif',
  codebookFileName = list.files(paste0(data_dir, 'codebook/'), pattern='codebook', full.names = TRUE),
  codebookColumnNames = paste0(rep(c(0:3), 4), '_', rep(c('cy3', 'alexa594', 'cy5', 'cy7'), each=4) ),
  outDir = 'C:/Users/kwaje/Downloads/JYOUT/',
  dapiDir = paste0(data_dir, 'DAPI_1/'),
  pythonLocation = 'C:/Users/kwaje/AppData/Local/miniforge3/envs/scipyenv/',
  fpkmFileName = paste0(data_dir, "FPKM_data/Jin_FPKMData_B.tsv"),
  nProbesFileName = paste0(data_dir, "ProbesPerGene.csv"),
  verbose = FALSE #It is generally a good idea to keep verbose as True, but will turn it off for this showcase
)

## Assuming .ome.tif is extension for meta files...

As the bare minimum, users need to establish:

imageDirs: where the MERFISH image files are located.
imageFormat: what image file formats to read. Typically .ome.tif or .dax, but other formats compatible with Bioformats are acceptable too.
codebookFileName: location of MERFISH codebook.
codebookColumnNames: names of every bit in the MERFISH codebook. Each “bit name” should typically comprise the image index (e.g. image 1, 2, 3 etc) and channel (e.g. Cy3, Alexa-594 etc): so a bit titled 0_cy3 tells us that the image to be loaded is the Cy3 channel of the first image in the set (here, starting from 0). This is important because the order images are loaded in may not match up with the order of bits in the codebook.
outDir: where output files should be written to.

Other parameters such as dapiDir and pythonLocation will be discussed subsequently for Cell Segmentation.

There also exist some parameters such as fpkmFileName (file containing FPKM data) and nProbesFileName (file containing number of MERFISH encoding probes per gene), which are files to be read for quality control purposes. However, these parameters are designed to be optional, and your code should run fine even if they are not specified.

The information stored in the params object can be read for further details:

ls( envir = params )

##  [1] "codebook_colnames"         "codebook_file"            
##  [3] "dapi_dir"                  "dapi_format"              
##  [5] "dir_names"                 "fov_names"                
##  [7] "fpkm_file"                 "im_dir"                   
##  [9] "im_format"                 "inferred_codebookColnames"
## [11] "meta_format"               "n_probes_file"            
## [13] "nbits"                     "out_dir"                  
## [15] "parent_out_dir"            "python_location"          
## [17] "raw_codebook"              "resumeMode"               
## [19] "seed"                      "verbose"

Note the existence of the params$resumeMode object. This houses a boolean that determines if functions are skipped if there is evidence that they have been run before. As default, to save time, this is set to TRUE.

3.1.2 Getting parameters from image metadata

The next step is to obtain image metadata that accompanies most microscopy images. This is typically found in the same file as your image data (e.g. .ome.tif) but can sometimes be in a separate file (e.g. .xml that accompany .dax images). masmr will attempt to guess which meta data format is appropriate, but users can change values in params as needed.

We can check what values have been entered into params:

params$meta_format

## [1] ".ome.tif"

If unsatisfied, we can update as follows:

params$meta_format <- '.dax'
print(params$meta_format)

## [1] ".dax"

params$meta_format <- '.ome.tif'
print(params$meta_format)

## [1] ".ome.tif"

Having specified our meta data formats, we can then load additional meta data information with the readImageMetaData function.

readImageMetaData()

## Warning in readImageMetaData(): Assuming this data has been acquired by Triton

## Warning in readImageMetaData(): Updating params$fov_names

This adds additional information to params, such as resolution information:

print( params$resolutions )

## $per_pixel_microns
## [1] "0.0706" "0.0706"
## 
## $xydimensions_pixels
## [1] "2960" "2960"
## 
## $xydimensions_microns
## [1] "208.976" "208.976"
## 
## $channel_order
## [1] "cy7"      "cy5"      "alexa594" "cy3"     
## 
## $fov_names
##   [1] "MMStack_1-Pos_000_000" "MMStack_1-Pos_001_000" "MMStack_1-Pos_002_000"
##   [4] "MMStack_1-Pos_003_000" "MMStack_1-Pos_004_000" "MMStack_1-Pos_005_000"
##   [7] "MMStack_1-Pos_006_000" "MMStack_1-Pos_007_000" "MMStack_1-Pos_008_000"
##  [10] "MMStack_1-Pos_009_000" "MMStack_1-Pos_009_001" "MMStack_1-Pos_008_001"
##  [13] "MMStack_1-Pos_007_001" "MMStack_1-Pos_006_001" "MMStack_1-Pos_005_001"
##  [16] "MMStack_1-Pos_004_001" "MMStack_1-Pos_003_001" "MMStack_1-Pos_002_001"
##  [19] "MMStack_1-Pos_001_001" "MMStack_1-Pos_000_001" "MMStack_1-Pos_000_002"
##  [22] "MMStack_1-Pos_001_002" "MMStack_1-Pos_002_002" "MMStack_1-Pos_003_002"
##  [25] "MMStack_1-Pos_004_002" "MMStack_1-Pos_005_002" "MMStack_1-Pos_006_002"
##  [28] "MMStack_1-Pos_007_002" "MMStack_1-Pos_008_002" "MMStack_1-Pos_009_002"
##  [31] "MMStack_1-Pos_009_003" "MMStack_1-Pos_008_003" "MMStack_1-Pos_007_003"
##  [34] "MMStack_1-Pos_006_003" "MMStack_1-Pos_005_003" "MMStack_1-Pos_004_003"
##  [37] "MMStack_1-Pos_003_003" "MMStack_1-Pos_002_003" "MMStack_1-Pos_001_003"
##  [40] "MMStack_1-Pos_000_003" "MMStack_1-Pos_000_004" "MMStack_1-Pos_001_004"
##  [43] "MMStack_1-Pos_002_004" "MMStack_1-Pos_003_004" "MMStack_1-Pos_004_004"
##  [46] "MMStack_1-Pos_005_004" "MMStack_1-Pos_006_004" "MMStack_1-Pos_007_004"
##  [49] "MMStack_1-Pos_008_004" "MMStack_1-Pos_009_004" "MMStack_1-Pos_009_005"
##  [52] "MMStack_1-Pos_008_005" "MMStack_1-Pos_007_005" "MMStack_1-Pos_006_005"
##  [55] "MMStack_1-Pos_005_005" "MMStack_1-Pos_004_005" "MMStack_1-Pos_003_005"
##  [58] "MMStack_1-Pos_002_005" "MMStack_1-Pos_001_005" "MMStack_1-Pos_000_005"
##  [61] "MMStack_1-Pos_000_006" "MMStack_1-Pos_001_006" "MMStack_1-Pos_002_006"
##  [64] "MMStack_1-Pos_003_006" "MMStack_1-Pos_004_006" "MMStack_1-Pos_005_006"
##  [67] "MMStack_1-Pos_006_006" "MMStack_1-Pos_007_006" "MMStack_1-Pos_008_006"
##  [70] "MMStack_1-Pos_009_006" "MMStack_1-Pos_009_007" "MMStack_1-Pos_008_007"
##  [73] "MMStack_1-Pos_007_007" "MMStack_1-Pos_006_007" "MMStack_1-Pos_005_007"
##  [76] "MMStack_1-Pos_004_007" "MMStack_1-Pos_003_007" "MMStack_1-Pos_002_007"
##  [79] "MMStack_1-Pos_001_007" "MMStack_1-Pos_000_007" "MMStack_1-Pos_000_008"
##  [82] "MMStack_1-Pos_001_008" "MMStack_1-Pos_002_008" "MMStack_1-Pos_003_008"
##  [85] "MMStack_1-Pos_004_008" "MMStack_1-Pos_005_008" "MMStack_1-Pos_006_008"
##  [88] "MMStack_1-Pos_007_008" "MMStack_1-Pos_008_008" "MMStack_1-Pos_009_008"
##  [91] "MMStack_1-Pos_009_009" "MMStack_1-Pos_008_009" "MMStack_1-Pos_007_009"
##  [94] "MMStack_1-Pos_006_009" "MMStack_1-Pos_005_009" "MMStack_1-Pos_004_009"
##  [97] "MMStack_1-Pos_003_009" "MMStack_1-Pos_002_009" "MMStack_1-Pos_001_009"
## [100] "MMStack_1-Pos_000_009"
## 
## $zslices
## [1] "1"

It also creates a dataframe that stores file information for every bit. This will be referenced heavily by downstream functions.

knitr::kable( head( params$global_coords ), format = 'html', booktabs = TRUE )

x_microns	y_microns	z_microns	image_file	fov	bit_name_full	bit_name_alias	bit_name
-2645.753	3421.847	5673.78	C:/Users/kwaje/Downloads/JIN_SIM/20190718_WK_T2_a_B2_R8_12_d50/S10/hyb00_1/hyb00_1_MMStack_1-Pos_000_000.ome.tif	MMStack_1-Pos_000_000	hyb00_1_cy7	0_cy7	0_cy7
-2645.753	3421.847	5673.78	C:/Users/kwaje/Downloads/JIN_SIM/20190718_WK_T2_a_B2_R8_12_d50/S10/hyb00_1/hyb00_1_MMStack_1-Pos_000_000.ome.tif	MMStack_1-Pos_000_000	hyb00_1_cy5	0_cy5	0_cy5
-2645.753	3421.847	5673.78	C:/Users/kwaje/Downloads/JIN_SIM/20190718_WK_T2_a_B2_R8_12_d50/S10/hyb00_1/hyb00_1_MMStack_1-Pos_000_000.ome.tif	MMStack_1-Pos_000_000	hyb00_1_alexa594	0_alexa594	0_alexa594
-2645.753	3421.847	5673.78	C:/Users/kwaje/Downloads/JIN_SIM/20190718_WK_T2_a_B2_R8_12_d50/S10/hyb00_1/hyb00_1_MMStack_1-Pos_000_000.ome.tif	MMStack_1-Pos_000_000	hyb00_1_cy3	0_cy3	0_cy3
-2645.753	3421.847	5673.78	C:/Users/kwaje/Downloads/JIN_SIM/20190718_WK_T2_a_B2_R8_12_d50/S10/hyb01_1/hyb01_1_MMStack_1-Pos_000_000.ome.tif	MMStack_1-Pos_000_000	hyb01_1_cy7	1_cy7	1_cy7
-2645.753	3421.847	5673.78	C:/Users/kwaje/Downloads/JIN_SIM/20190718_WK_T2_a_B2_R8_12_d50/S10/hyb01_1/hyb01_1_MMStack_1-Pos_000_000.ome.tif	MMStack_1-Pos_000_000	hyb01_1_cy5	1_cy5	1_cy5

Note that as default, the above information is saved to a sub-directory in your chosen output folder.

print( params$parent_out_dir )

## [1] "C:/Users/kwaje/Downloads/JYOUT/"

print( params$out_dir )

## [1] "C:/Users/kwaje/Downloads/JYOUT//IM/"

3.1.3 Preparing the codebook

A codebook is a table of indicating the sequence of ON (1) and OFF (0) bits for each gene (example below). masmr expects this to be a table (e.g. .csv or .txt) that contains N rows for N genes/blanks, and M+1 columns for M bits (e.g. below is a 16-bit setup) and 1 gene column.

knitr::kable( head(data.table::fread( params$codebook_file, data.table = F )), format = 'html', booktabs = TRUE )

V2	V3	V4	V5	V6	V7	V9	V10	V11	V13	V14	V15	V17
0	1	1	0	1	0	0	1	0	0	0	0	Blank-1
0	1	0	0	1	0	0	0	0	1	1	0	Blank-2
0	1	0	0	0	1	0	0	1	1	0	0	Blank-3
1	0	1	1	0	0	0	0	0	0	0	1	Blank-4
0	0	1	0	0	0	0	1	0	1	1	0	Blank-5
1	1	0	1	0	0	1	0	0	0	0	0	ASCL1

Unique to spotcalling, users must “prepare” their codebooks. This typically involves ensuring that the order of bits matches the order of image loading that masmr performs. Additionally, masmr also provides the option to add more “blanks” beyond those already specified by the user.

prepareCodebook( exhaustiveBlanks = T )

The new ordered codebook is stored in the output directory, and users can refer to it as follows:

knitr::kable( head( params$ordered_codebook ), format = 'html', booktabs = TRUE )

	0_cy7	0_cy5	0_alexa594	1_cy7	1_cy5	1_alexa594	1_cy3	2_cy7	2_cy5	2_alexa594	2_cy3	3_cy3
Blank-1	0	0	0	0	1	1	0	0	0	0	1	1
Blank-2	1	0	0	1	0	1	0	0	0	0	1	0
Blank-3	1	0	0	0	0	0	0	0	1	1	1	0
Blank-4	0	0	1	0	0	0	1	1	0	0	0	1
Blank-5	1	0	0	1	1	0	0	0	0	0	0	1
ASCL1	0	1	1	0	0	0	1	0	0	0	1	0

3.1.4 House-keeping

Having performed all the necessary pre-processing, it may be useful for users to take a snapshot of the current environment. Unnecessary data can then be cleared from memory in subsequent steps to return the global environment to that snapshot. This is accomplished using our establishCleanSlate and cleanUp functions.

An example use-case is provided below:

establishCleanSlate()

## [1] "data_dir"   "full_res"   "n_channels" "params"

print(cleanSlate)

## [1] "data_dir"   "full_res"   "n_channels" "params"     "cleanSlate"

x = '12345'
exists('x')

## [1] TRUE

cleanUp()
exists('x')

## [1] FALSE

3.2 Preparing to loop through images

Having done the necessary pre-processing, users may now begin processing each FOV. This is done by looping through params$fov_names. It is also recommended that the current FOV be saved in params$current_fov.

head(params$fov_names)

## [1] "MMStack_1-Pos_000_000" "MMStack_1-Pos_001_000" "MMStack_1-Pos_002_000"
## [4] "MMStack_1-Pos_003_000" "MMStack_1-Pos_004_000" "MMStack_1-Pos_005_000"

params$current_fov <- params$fov_names[38]
print(params$current_fov)

## [1] "MMStack_1-Pos_002_003"

Users may also wish to check if they are ready to loop with our checkReadyLoop function, but this is entirely optional.

checkReadyLoop()

## Warning in checkReadyLoop(): params$brightness_min not found:
##       If desired, ensure getAnchorParams() has been run

## [1] FALSE

3.2.1 Image reading

To read images, we rely on the RBioformats package, which is able to read most image file formats. One exception is .dax files – however, these are binary files and can be read by base R. To account for these different types of images, we can use the readImage function in masmr.

Below, I provide an example of a .dax file:

file <- 'C:/Users/kwaje/Downloads/LIU_CONNECT_CORTICALORGANOID_240902/DATA/20240902/cont/Alexa594_0_10.dax'
im <- readImage( 
  file,
  imageDimensions = c(1,2048,2048,1),
  nchannels = 1, #The image only has 1 channel
  nzs = 1
  )

## Warning in readImage(file, imageDimensions = c(1, 2048, 2048, 1), nchannels =
## 1, : Unable to read with RBioFormats

print(length(im))

## [1] 1

imsub <- im[[1]]
plot(imager::as.cimg(imsub), main = paste0('Image dimensions: ', paste(dim(imsub), collapse = ' x ') ))

From above, note that the default is assumed to be: Number of Channels, Number of rows, Number of Columns, Number of Z slices. In the above case, since and are both 1, users could technically specify image dimensions using width an height alone. However, it is generally good practice to keep to a consistent format.

For .ome.tif files, the same function can be used. However, we need to specify an additional series parameter: otherwise, all FOVs will be loaded. For more details, we recommend reading the documentation for RBioformats::read.image. Below, I provide an example of reading the second channel for the current FOV (specified above).

file <- unique( params$global_coords[params$global_coords$fov==params$fov_names, 'image_file'] )[1]
series <- which( params$fov_names == params$current_fov )
im <- readImage( 
  file, 
  nchannels = 4, # The image has 4 channels total
  nzs = 1,
  series = series )
print( length(im) ) #4 channels

## [1] 4

imsub <- im[[2]]
plot(imager::as.cimg( imsub ), main = paste0('Image dimensions: ', paste(dim(imsub), collapse = ' x ') ) )

## Warning in as.cimg.array(imsub): Assuming third dimension corresponds to
## time/depth

Note that readImage as written above reads all four channels into memory, and we subset the second channel using im[,,2]. An alternative is to just load the second channel directly using the subset parameter in RBioformats::read.image, or the channelIndex parameter in readImage. This is useful for memory management.

im <- readImage( 
  file,
  nchannels = 4,
  nzs = 1,
  series = series, 
  channelIndex = 2 )
print(length(im)) #Only 1 channel loaded

## [1] 1

imsub <- im[[1]]
plot(imager::as.cimg(imsub), main = paste0('Image dimensions: ', paste(dim(imsub), collapse = ' x ') ) )

## Warning in as.cimg.array(imsub): Assuming third dimension corresponds to
## time/depth

For MERFISH, we typically have to handle multiple images at once. We have thus created an additional image reading function called readImageList that loads all images for a given FOV, by cross-referencing params$global_coords to find all relevant files. And because readImageList does this cross-referencing, parameters such as nchannels, nzs etc do not have to be specified.

imList <- readImageList( params$current_fov )
names(imList)

##  [1] "0_cy7"      "0_cy5"      "0_alexa594" "0_cy3"      "1_cy7"     
##  [6] "1_cy5"      "1_alexa594" "1_cy3"      "2_cy7"      "2_cy5"     
## [11] "2_alexa594" "2_cy3"      "3_cy7"      "3_cy5"      "3_alexa594"
## [16] "3_cy3"

3.2.2 Establishing global image parameters

As a first step, it is recommended that we determine the dynamic range of intensity values: i.e. the brightness value for below which (i.e. minimum brightness / floor) and above which (i.e. maximum brightness / ceiling) we can safely ignore gradation. A sensible minimum brightness would be intensities for background pixels where no tissue is imaged. For maximum brightness, we want a cap where all super-threshold intensities can effectively be considered ‘ON’. Ideally, these limits should be relatively consistent across FOVs.

We have created a function getAnchorParams that loads multiple FOVs and attempts to get global image parameters. As default, brightness_max and brightness_min are returned. For details, see the documentation for getAnchorParams.

getAnchorParams( nSamples = 6 )

These values are appended to params:

params$brightness_min

##  [1] 0.02665484 0.07892208 0.02916114 0.04901184 0.03023053 0.06347031
##  [7] 0.02397402 0.04414353 0.02355671 0.07126881 0.02574727 0.04234602
## [13] 0.02069601 0.06184499 0.02228692 0.04452783

Anchor FOVs are those used to derive global image parameters. These can be user-specified, or automatically determined with getAnchorParams. If automatic, nSamples number of FOVs are randomly selected. These can be visualised (in red) after running getAnchorParams:

print(params$anchors)

## [1] "MMStack_1-Pos_001_003" "MMStack_1-Pos_005_000" "MMStack_1-Pos_007_009"
## [4] "MMStack_1-Pos_008_009" "MMStack_1-Pos_009_001" "MMStack_1-Pos_002_003"

print(params$anchors_plot)

Random sampling should work for most cases. For more regular FOVs, users may instead opt for grid-based selection of FOVs: evenly spaced FOVs internal to the outermost ring of FOVs are selected. [Since we are re-obtaining anchor metrics, set freshAnchors to true.]

getAnchorParams( anchorMode = 'grid', returnTroubleShootPlots = T, freshAnchors = T )

And so this returns a different subset of anchors.

print(params$anchors)

##  [1] "MMStack_1-Pos_001_001" "MMStack_1-Pos_001_003" "MMStack_1-Pos_001_005"
##  [4] "MMStack_1-Pos_001_007" "MMStack_1-Pos_003_002" "MMStack_1-Pos_003_004"
##  [7] "MMStack_1-Pos_003_006" "MMStack_1-Pos_003_008" "MMStack_1-Pos_005_001"
## [10] "MMStack_1-Pos_005_003" "MMStack_1-Pos_005_005" "MMStack_1-Pos_005_007"
## [13] "MMStack_1-Pos_007_002" "MMStack_1-Pos_007_004" "MMStack_1-Pos_007_006"
## [16] "MMStack_1-Pos_007_008"

print(params$anchors_plot)

Users should ensure that sufficient anchors are chosen for representative brightness caps and floors to be obtained. Additionally, different brightness caps / floors may be required for different samples, depending on how consistent imaging parameters were across samples. Inconsistent imaging parameters may be expected if samples were e.g. imaged on different days.

Additionally, should users wish to obtain other global parameters, this can be achieved by altering the imageFunctions and summaryFunctions fields in getAnchorParams.

For instance, say we wish to also obtain the mean brightness per image (brightness_mean). We first write a function to perform this (custom_getMeanBrightness), which will then be looped over every anchor FOV. We include this function among the named list users_imageFunctions, which we then input to imageFunctions. This returns a vector of mean brightness values for every anchor image, which are then summarised by the corresponding function in summaryFunctions (as default, we recommend conservativeMean – explained later – or some summary function robust to outliers).

custom_getMeanBrightness <- function( im, ... ){
  meanBrightness <- mean( im )
  return( meanBrightness )
}
users_imageFunctions = list(
  'brightness_min' = imsBrightnessMin_MIP, #MIP for Z-stacks
  'brightness_max' = imsBrightnessMax_MIP,
  'brightness_mean' = custom_getMeanBrightness
)
users_summaryFunctions = list(
  'brightness_min' = conservativeMean,
  'brightness_max' = conservativeMean,
  'brightness_mean' = conservativeMean
)
getAnchorParams( 
  imageFunctions = users_imageFunctions, 
  summaryFunctions = users_summaryFunctions,
  anchorMode = 'grid'
)

On conservativeMean: this is a function that takes the mean within a specific range, delimited by a certain number of standard deviations (default = 0.9). This should return mean values less skewed by outliers. An alternative could be median, or some user-defined function.

set.seed(12345)
test <- rgamma( 1E4, 2, 1 )
hist( test, breaks=100, main='Histogram of values with mean (red) and conservativeMean (blue)', xlab='Gamma distribution' ); 
abline(v=mean(test), col='red'); 
abline(v=conservativeMean(test, nSDs = 0.9), col='blue')

To evaluate the suitability of chosen thresholds, users can refer to the troubleshootPlots environment returned when returnTroubleShootPlots is true.

metric = 'brightness_min'
bit_number = 1
print( troubleshootPlots[[ metric ]][[ bit_number ]] )

For each metric, we also report the observed thresholds as a quantile of each anchor image’s pixel intensity. These can be found in the ANCHOR_QUANTILEQC.csv file. We expect brightness floors to have a low quantile across FOVs (unless there are some empty FOVs included), and brightness ceilings to have high quantiles.

head(read.csv( paste0(params$out_dir, 'ANCHOR_QUANTILEQC.csv')))

##           metric bit MMStack_1.Pos_001_001 MMStack_1.Pos_001_003
## 1 brightness_min   1             0.9187895             0.1231919
## 2 brightness_max   1             0.9999800             0.9998100
## 3 brightness_min   2             0.9205798             0.1248851
## 4 brightness_max   2             0.9995589             0.9855621
## 5 brightness_min   3             0.9230520             0.1136280
## 6 brightness_max   3             1.0000000             1.0000000
##   MMStack_1.Pos_001_005 MMStack_1.Pos_001_007 MMStack_1.Pos_003_002
## 1             0.2534933             0.2073274             0.2458487
## 2             0.9997967             0.9998654             0.9998590
## 3             0.2626584             0.2286615             0.2431207
## 4             0.9769110             0.9885266             0.9930865
## 5             0.2583598             0.1959322             0.2110254
## 6             0.9999945             1.0000000             0.9999952
##   MMStack_1.Pos_003_004 MMStack_1.Pos_003_006 MMStack_1.Pos_003_008
## 1             0.5588561             0.5077519            0.10864865
## 2             0.9999914             0.9999982            0.99961480
## 3             0.5643622             0.5056256            0.13454552
## 4             0.9991272             0.9994388            0.99336617
## 5             0.4886617             0.4099523            0.09494156
## 6             1.0000000             1.0000000            0.99999726
##   MMStack_1.Pos_005_001 MMStack_1.Pos_005_003 MMStack_1.Pos_005_005
## 1            0.08566187             0.4722320             0.6327399
## 2            0.99916499             0.9999821             1.0000000
## 3            0.08949598             0.4169337             0.4555937
## 4            0.87119601             0.9991592             0.9993902
## 5            0.08018513             0.3478341             0.3117073
## 6            0.99997740             1.0000000             1.0000000
##   MMStack_1.Pos_005_007 MMStack_1.Pos_007_002 MMStack_1.Pos_007_004
## 1             0.1366108             0.1784223             0.4619367
## 2             0.9998951             0.9999053             0.9999768
## 3             0.1668553             0.1792000             0.5029208
## 4             0.9972692             0.9932534             0.9988799
## 5             0.1181132             0.1452180             0.4117146
## 6             0.9999995             1.0000000             1.0000000
##   MMStack_1.Pos_007_006 MMStack_1.Pos_007_008
## 1             0.1657847             0.9504158
## 2             0.9998435             0.9999966
## 3             0.1761314             0.9576270
## 4             0.9904303             0.9999356
## 5             0.1251103             0.9552260
## 6             0.9999977             1.0000000

3.3 Maximum intensity projection

For Z stacks, we default to maximum intensity projection (MIP) to collapse a 3D array into a 2D matrix. [If there are no other slices, then the 2D matrix is returned unchanged.]

toy_example <- array( c(
  matrix(0:11 / 12, nrow = 3, ncol = 4, byrow = T),
  matrix(11:0 / 12, nrow = 3, ncol = 4, byrow = T)
), dim = c(3, 4, 2)
)
mip <- maxIntensityProject(toy_example)
par(mfrow=c(1,3))
plot( imager::as.cimg(toy_example[,,1]), main = 'Z slice 1', interpolate = F, rescale = F)
plot( imager::as.cimg(toy_example[,,2]), main = 'Z slice 2', interpolate = F, rescale = F)
plot( imager::as.cimg(mip), main = 'MIP', interpolate = F, rescale = F)

par(mfrow=c(1,1))

Also note that similarly performs MIP (where possible) prior to finding minimum / maximum brightnesses.

masmr::imsBrightnessMin_MIP

## function (im, zDim = 3, ...) 
## {
##     if (length(dim(im)) != 2) {
##         im <- suppressWarnings(maxIntensityProject(im, zDim = zDim))
##     }
##     result <- imsBrightnessMin(im, ...)
##     return(result)
## }
## <bytecode: 0x00000223e915ce88>
## <environment: namespace:masmr>

Users wishing to process Z slices individually are encouraged refer to Troubleshooting (Chapter 8) for details.

3.4 Looping through images

After establishing global parameters (e.g. brightness floors and ceilings), we are now ready to loop through our image data. It is strongly recommended that the user keeps track of the current FOV being processed in params$current_fov as follows:

for( fovName in params$fov_names ){
  
  ## Report current FOV / iteration 
  cat( paste0('\n', fovName, ': ', 
  which(params$fov_names==fovName), ' of ', 
  length(params$fov_names), '...' ))
  
  ## Check if file has been created
  if( params$resumeMode ){
    checkFile <- file.exists( paste0(params$out_dir, 'SPOTCALL_', fovName, '.csv.gz') )
    if(checkFile){
      cat('FOV has been processed...Skipping...')
      next
    }
  }
  
  ## Save the current_fov
  params$current_fov <- fovName
  
  ## Load image
  imList <- readImageList( params$current_fov )
  
  ## If needed, MIP for each channel
  imList <- maxIntensityProject( imList )
  
  ## Image processing functions etc...
}

3.4.1 Default image processing

Designing custom image processing functions is a topic worthy of its own chapter (Chapter 4). For now, we will use the default processing provided in masmr:

getImageMetrics( imList )

##

The processed images are stored in a new imMetrics environment, and can be visualised as follows:

par(mfrow=c(2,3))
for( imMetricName in ls(imMetrics) ){
  imsub <- imMetrics[[imMetricName]][[1]]
  plot(imager::as.cimg(imsub[1000:1500, 1000:1500, 1] ), main = imMetricName)
}
par(mfrow=c(1,1))

3.4.2 Image registration

Images of the same FOV are rarely perfect overlaps of each other due to imprecise microscope movements. Additionally, optical distortions (e.g. chromatic aberration), may also cause images from different channels to become unaligned. It is thus highly recommended to align images before any downstream decoding (i.e. image registration).

It is up to the user to decide what image processing to perform to obtain images ideal for registration. We recommend a combination of low-pass and high-pass features: below, we combine images from COARSE and DECODE in imMetrics (i.e. band pass filtering).

forReg <- list()
for(i in 1:length(imList)){
  forReg[[i]] <- imNormalise( imMetrics$COARSE[[i]] + imMetrics$DECODE[[i]] )
}
names(forReg) <- names(imList) #Always keep the names from imList
norm <- forReg[[1]]
par(mfrow=c(1,2))
plot( imager::as.cimg(norm), main = 'Band pass' )

## Warning in as.cimg.array(norm): Assuming third dimension corresponds to
## time/depth

plot( imager::as.cimg(norm[1000:1500, 1000:1500, 1]), main = 'Band pass (Zoom in)' )

par(mfrow=c(1,1))

Registration is accomplished with our registerImages function. The function works by maximising cross-correlation between two images (in Fourier transformed space). [Strictly, cross-correlation is first passed through a Gaussian filter centred around the origin, to upweight small shifts. Gaussian filter parameters can be tweaked using relevant parameters – see documentation for registerImages. Additionally, when an image stack is detected, maximum intensity projection is performed before registration.]

registerImages(forReg, returnTroubleShootPlots = T)

## Warning in maxIntensityProject(imsub): 2D matrix provided...Skipping MIP...

This function adds the necessary coordinate offsets to the params environment object. Shift vectors (and coordinates in general) are typically handled as complex numbers in masmr; with the real component reflecting the x-coordinate and the imaginary, y.

print( params$shifts )

##      0_cy7      0_cy5 0_alexa594      0_cy3      1_cy7      1_cy5 1_alexa594 
##   0+0i   0+0i   0+0i   0-1i  -5+3i  -5+3i  -5+2i
##      1_cy3      2_cy7      2_cy5 2_alexa594      2_cy3      3_cy7      3_cy5 
##  -5+2i -12+5i -12+5i -11+4i -11+3i -14+2i -14+2i
## 3_alexa594      3_cy3 
## -14+3i -14+2i

These offsets are added to existing coordinates so as to better align images. To evaluate the registration, users can set returnTroubleShootPlots in registerImages to true (which returns a troubleshootPlots object), or use register_troubleshootPlots directly (see documentation).

p <- troubleshootPlots[['REGISTRATION_EVAL']][[1]][['PRE_REGISTER']]
print(
  p + ggtitle('Before registration')
)

The default is to plot registrations with respect to the reference FOV (red). A window centred on the brightest pixel in the reference FOV is selected as default.

p <- troubleshootPlots[['REGISTRATION_EVAL']][[1]][['POST_REGISTER']]
print(
  p + ggtitle('After registration')
)

Additionally, the registerImages function also saves the reference bit image as a REGIM_{ FOV Name }.png file as part of its default behaviour. This can be called on for subsequent stitching.

file.exists( paste0(params$out_dir, 'REGIM_', params$current_fov, '.png') )

## [1] TRUE

3.5 Working with pixel-wise metrics

Having aligned and processed our images, the next steps are to compile our images into a manageable data format, perform decoding, and successively filter out low quality pixels.

3.5.1 Moving from image lists to a dataframe

The first step is to convert our images into a dataframe for easier use. Hereafter, decoding and filtering will be performed on this dataframe. This is accomplished with consolidateImageMetrics. Additionally, as the previous image lists are no longer needed, it is recommended users clean their environment with cleanUp.

If troubleshooting, users may wish to keep the imMetrics environment for the time being. However, do note that this tends to be a big object and may stretch memory requirements. [One solution, shown below, is to just keep a subset of the objects in imMetrics.]

spotcalldf <- consolidateImageMetrics()

print( paste0('Size of entire imMetrics: ', sum(sapply(ls(imMetrics), function(x) as.numeric(object.size(imMetrics[[x]])))) * 1E-9, ' Gb') )

## [1] "Size of entire imMetrics: 5.04670648 Gb"

for( ix in ls(imMetrics) ){
  if( ix %in% c('DECODE') ){ next }
  remove( list=ix, envir=imMetrics )
}
print( paste0('Size of imMetrics after filtering: ', sum(sapply(ls(imMetrics), function(x) as.numeric(object.size(imMetrics[[x]])))) * 1E-9, ' Gb') )

## [1] "Size of imMetrics after filtering: 1.121489776 Gb"

cleanUp( c(
  'spotcalldf', #Necessary: must be specified, otherwise the dataframe will be deleted
  'imMetrics', #Optional: keep if want to make use of images in filterDF() later
  cleanSlate) #Other things to keep
)

The result is a dataframe with a unified coordinate system, in which pixels are rows and image metrics are columns.

knitr::kable( head( spotcalldf ), format = 'html', booktabs = TRUE )

	WX	WY	WZ	IDX	COARSE_01	DECODE_01	MASK_01	NORM_01	ORIG_01	COARSE_02	DECODE_02	MASK_02	NORM_02	ORIG_02	COARSE_03	DECODE_03	MASK_03	NORM_03	ORIG_03	COARSE_04	DECODE_04	MASK_04	ORIG_04	COARSE_05	DECODE_05	MASK_05	ORIG_05	COARSE_06	DECODE_06	MASK_06	NORM_06	ORIG_06	COARSE_07	DECODE_07	MASK_07	NORM_07	ORIG_07	COARSE_08	DECODE_08	MASK_08	ORIG_08	COARSE_09	DECODE_09	MASK_09	ORIG_09	COARSE_10	DECODE_10	MASK_10	ORIG_10	COARSE_11	DECODE_11	MASK_11	NORM_11	ORIG_11	COARSE_12	DECODE_12	MASK_12	ORIG_12	COARSE_13	DECODE_13	MASK_13	ORIG_13	COARSE_14	DECODE_14	MASK_14	ORIG_14	COARSE_15	DECODE_15	MASK_15	ORIG_15	COARSE_16	DECODE_16	MASK_16	ORIG_16
15055	255	6	1	14985	0.1685885	0.0666336	FALSE	0.0318499	0.0431527	0.1959525	0.0182777	FALSE	0.0874075	0.1050684	0.2910819	0.1395470	TRUE	0.0616752	0.1104383	0.1467471	0.0036333	FALSE	0.0532875	0.0949858	0.0023963	FALSE	0.0242018	0.1435644	0.0209040	FALSE	0.0195158	0.0593125	0.1903854	0.0108405	FALSE	0.0055703	0.0601619	0.0732994	0.0026951	FALSE	0.0213105	0.1354508	0.0018844	FALSE	0.0410899	0.1167777	0.0052897	FALSE	0.0486866	0.2663939	0.0435117	FALSE	0.0160442	0.1010937	0.1288667	0.0019179	FALSE	0.0423075	0.1124755	0.0024016	FALSE	0.0433106	0.0792933	0.0028786	FALSE	0.0405763	0.1350443	0.0000000	FALSE	0.0568347	0.0570399	0.0021209	FALSE	0.0223294
15056	256	6	1	14986	0.1753839	0.0860337	FALSE	0.0431405	0.0468669	0.2066871	0.0489090	FALSE	0.0966803	0.1079719	0.2955125	0.1514781	TRUE	0.0599694	0.1090731	0.1438671	0.0036322	FALSE	0.0511141	0.0923445	0.0023963	FALSE	0.0241780	0.1465902	0.0264190	FALSE	0.0257662	0.0613144	0.1964823	0.0234909	FALSE	0.0134243	0.0657939	0.0741845	0.0026951	FALSE	0.0200000	0.1455799	0.0018838	FALSE	0.0534705	0.1163260	0.0052897	FALSE	0.0529889	0.2757924	0.0654732	FALSE	0.0232150	0.1101432	0.1315909	0.0017231	FALSE	0.0464064	0.1207218	0.0024016	FALSE	0.0447938	0.0790790	0.0028778	FALSE	0.0397129	0.1441830	0.0000000	FALSE	0.0799452	0.0581769	0.0021209	FALSE	0.0198046
15057	257	6	1	14987	0.1767175	0.0890733	FALSE	0.0404089	0.0459683	0.2154453	0.0750645	FALSE	0.1129332	0.1130610	0.2975720	0.1588251	TRUE	0.0608004	0.1097382	0.1366940	0.0036309	FALSE	0.0454851	0.0879305	0.0023963	FALSE	0.0175016	0.1466302	0.0225324	FALSE	0.0132059	0.0572916	0.1976275	0.0270122	FALSE	0.0100954	0.0634068	0.0740865	0.0026951	FALSE	0.0204992	0.1562726	0.0018829	FALSE	0.0593196	0.1147730	0.0052897	FALSE	0.0534790	0.2846638	0.0895756	FALSE	0.0281234	0.1163376	0.1298415	0.0014963	FALSE	0.0440120	0.1246618	0.0024016	FALSE	0.0457579	0.0755643	0.0028770	FALSE	0.0333999	0.1513037	0.0000767	FALSE	0.0723273	0.0566595	0.0021209	FALSE	0.0192252
15058	258	6	1	14988	0.1725610	0.0745237	FALSE	0.0351278	0.0442310	0.2215627	0.0934877	TRUE	0.1196076	0.1151508	0.2923042	0.1414761	TRUE	0.0602755	0.1093181	0.1266107	0.0036295	FALSE	0.0360410	0.0870373	0.0023963	FALSE	0.0220558	0.1502949	0.0330563	FALSE	0.0283854	0.0621532	0.1963598	0.0262408	FALSE	0.0145165	0.0665772	0.0755687	0.0026951	FALSE	0.0216849	0.1567512	0.0018819	FALSE	0.0609281	0.1091272	0.0052897	FALSE	0.0472707	0.2918106	0.1114874	FALSE	0.0311911	0.1202091	0.1256519	0.0012436	FALSE	0.0408871	0.1283710	0.0024016	FALSE	0.0601454	0.0725193	0.0028761	FALSE	0.0309718	0.1566531	0.0003004	FALSE	0.0854233	0.0560639	0.0021209	FALSE	0.0190389
15059	259	6	1	14989	0.1633875	0.0446587	FALSE	0.0214699	0.0397380	0.2251548	0.1048259	TRUE	0.1336188	0.1195380	0.2782481	0.0944740	FALSE	0.0467602	0.0985018	0.1213048	0.0036276	FALSE	0.0373857	0.0840205	0.0023963	FALSE	0.0242257	0.1538012	0.0442957	FALSE	0.0371955	0.0649749	0.1897434	0.0167363	FALSE	0.0093672	0.0628846	0.0774059	0.0026951	FALSE	0.0231669	0.1526377	0.0018807	FALSE	0.0517645	0.1032195	0.0052897	FALSE	0.0440939	0.2937106	0.1167685	TRUE	0.0253241	0.1128049	0.1253053	0.0009766	FALSE	0.0451889	0.1312150	0.0024016	FALSE	0.0582171	0.0758693	0.0028751	FALSE	0.0411428	0.1587198	0.0004918	FALSE	0.0903877	0.0572023	0.0021209	FALSE	0.0208394
15060	260	6	1	14990	0.1521710	0.0107363	FALSE	0.0000000	0.0325093	0.2237427	0.0974893	TRUE	0.1340264	0.1196656	0.2608532	0.0406566	FALSE	0.0319765	0.0866704	0.1220588	0.0036248	FALSE	0.0383238	0.0772029	0.0023963	FALSE	0.0183838	0.1538350	0.0435356	FALSE	0.0308856	0.0629540	0.1776488	0.0024123	FALSE	0.0000000	0.0498676	0.0800322	0.0026951	FALSE	0.0247582	0.1538761	0.0018794	FALSE	0.0594170	0.0982393	0.0052897	FALSE	0.0443117	0.2927845	0.1152722	TRUE	0.0368280	0.1273229	0.1239065	0.0007128	FALSE	0.0407654	0.1332971	0.0024016	FALSE	0.0533966	0.0779187	0.0028741	FALSE	0.0433821	0.1544284	0.0006483	FALSE	0.0747240	0.0584646	0.0021209	FALSE	0.0199081

3.5.2 Decoding

Decoding involves calculating distances between image metric vectors and expected vectors from the codebook. As default, our decode function takes columns with DECODE_XX format and calculates

COS: cosine distances,
EUC: Euclidean distances,
L2E: L2-normalised Euclidean distances

between image and codebook vectors. Both the nearest match and next nearest match (prefixed with B) are returned.

spotcalldf <- decode(spotcalldf)

The relationship between the different distance metrics is illustrated below.

cowplot::plot_grid( plotlist = list(
  ggplot( spotcalldf ) +
    geom_hex(aes(x=COS, y=L2E), bins=100) +
    scale_fill_viridis_c(option = 'turbo', trans='log10') +
    theme_minimal(base_size=14) +
    theme(legend.position = 'top'),
  ggplot( spotcalldf ) +
    geom_hex(aes(x=COS, y=EUC), bins=100) +
    scale_fill_viridis_c(option = 'turbo', trans='log10') +
    theme_minimal(base_size=14) +
    theme(legend.position = 'top'),
  ggplot( spotcalldf ) +
    geom_hex(aes(x=EUC, y=L2E), bins=100) +
    scale_fill_viridis_c(option = 'turbo', trans='log10') +
    theme_minimal(base_size=14) +
    theme(legend.position = 'top')
), nrow=1 )

At this stage, pixels are assigned a preliminary gene label under the g column. As default, this is dependent on the nearest codebook match in cosine distance (column COSLAB in the dataframe).

knitr::kable( head(spotcalldf[,grepl('^g$|LAB$', colnames(spotcalldf))]), format = 'html', booktabs = TRUE )

	g	COSLAB	EUCLAB	L2ELAB	BCOSLAB	BEUCLAB	BL2ELAB
15055	BLANK-NEW-05	104	104	104	100	100	100
15056	BLANK-NEW-05	104	104	104	55	55	55
15057	MKI67	55	55	55	11	11	11
15058	MKI67	55	55	55	11	11	11
15059	MKI67	55	55	55	51	51	51
15060	MKI67	55	55	55	51	51	51

3.5.3 Annotating blanks

Users are recommended to filter out pixels with large distance metrics: i.e. those we cannot confidently decode. These thresholds can be determined a priori, or determined empirically by finding the threshold that separates blanks from non-blank calls.

There are several ways to define “blanks” that are provided in masmr:

gBlank: Whether the name of the gene assigned to a given pixel has the word “blank” in it.
bPossible: Whether the next closest gene label for selected distance metrics are possible.
gbPossible: Whether the next closest gene label for the default distance metric (COS) are possible.
consistentLabel: Whether the gene labels agree across distance metrics.

As default, gBlank, gbPossible, and consistentLabel tests are run. The number of tests failed is reported by scoreBlanks.

nBlankTestsFailed <- scoreBlanks( spotcalldf )
table(nBlankTestsFailed)

## nBlankTestsFailed
##       0       1       2       3 
## 2039428  616429  255396       2

spotcalldf$BLANK <- nBlankTestsFailed >0

Say we decide that “blanks” are those that fail even one test. We can then examine the range of distance metrics between blanks and non-blanks. In general, distance metrics should be larger for blanks.

ggplot( reshape2::melt(spotcalldf, measure.vars=c('COS', 'EUC', 'L2E'))  ) +
  geom_violin(aes(x=BLANK, y=value, colour=BLANK) ) +
  geom_boxplot(aes(x=BLANK, y=value, colour=BLANK), fill='transparent', width=0.25 ) +
  facet_wrap(~variable, scales='free') +
  ylab('') +
  theme_minimal(base_size=14) +
  scale_colour_manual(values=c('TRUE' = 'red', 'FALSE' = 'black')) +
  theme(legend.position = 'none')

A reasonable distance threshold might be the average of medians. This can be easily obtained with the findThreshold or thresh_Quantile function (see documentation for details).

filtout <- rep(0, nrow(spotcalldf))
for( distanceMetric in c('COS', 'EUC', 'L2E') ){
  thresh <- thresh_Quantile( spotcalldf[,distanceMetric], label = spotcalldf$BLANK, quantileFalse = 0.5, quantileTrue = 0.5 )
  cat(paste0( '\n', distanceMetric, ' threshold: ', thresh ))
  filtout <- filtout + spotcalldf[,distanceMetric] > thresh
}

## 
## COS threshold: 0.222254471692498
## EUC threshold: 3.30408867238758
## L2E threshold: 0.326703039218853

table(filtout > 0)

## 
##   FALSE    TRUE 
##  824288 2086967

3.5.4 Filtering and evaluating your choice of filter

How do we decide what is a good filter to apply to our data?

One idea is to use a priori knowledge of gene distributions in space. For instance, in our example dataset of an organoid, we expect dividing cells to be concentrated near the lumen. Accordingly, filtering should enrich lumenal MKI67 (a marker for cell division).

cowplot::plot_grid( plotlist = list(
  ggplot( spotcalldf[filtout>0 & spotcalldf$g=='MKI67',] ) +
    geom_hex(aes(x=WX, y=WY), bins=100) +
    scale_fill_viridis_c(option='turbo', trans='log10') +
    scale_y_reverse() +
    theme_void( base_size = 14 ) +
    coord_fixed() + ggtitle('Drop'),
  
  ggplot( spotcalldf[filtout==0  & spotcalldf$g=='MKI67',] ) +
    geom_hex(aes(x=WX, y=WY), bins=100) +
    scale_fill_viridis_c(option='turbo', trans='log10') +
    scale_y_reverse() +
    theme_void( base_size = 14 ) +
    coord_fixed() + ggtitle('Keep')
), nrow=1)

Another option is to consider how filtering changes quality control metrics: e.g. a good filter should generally increase correlation between count profiles and previously obtained FPKM data, should decrease the percentage of blanks relative to true genes, etc. We provide a filterDF function that reports how a given filter changes these QC metrics (should relevant files be specified in establishParams).

params$verbose <- TRUE ## To see output of filterDF, verbose is changed to True
spotcalldf <- filterDF(spotcalldf, filtout>0, returnTroubleShootPlots = T, plotWindowRadius = 5)

## 
## Applying filter: filtout > 0...

## N pixels: 2911255 ( 100% ) --> 824288 ( 28.31% )

## N blanks: 871827 ( 29.95% ) --> 207062 ( 25.12% )

## LogFPKM Corr: 0.58239 --> 0.59519

## LogN Probes Corr: 0.52337 --> 0.51121

params$verbose <- FALSE ## Turn it back off if want to avoid messages entirely (warnings and errors will still be given)

Perhaps the safest option – in that one that does not require a priori assumptions – is to perform some decoding by eye on a small FOV region, and ensure that one’s spotcalling pipeline returns results that agree with these manually obtained results. To attempt this, we provide a returnTroubleShootPlots parameter in filterDF. When set to TRUE (default FALSE), plots are returned showing pixel intensity images for each bit, overlaid with gene labels. [Do note that these plots are only available if users have kept their imMetrics environment, or have specified a list of images (imList).]

Plots show spots either BEFORE or AFTER filtering.

print( troubleshootPlots[['BEFORE']][[1]] )

print( troubleshootPlots[['AFTER']][[1]] )

Under the hood, filterDF makes use of the spotcall_troubleshootPlots function to return plots, and parameters for spotcall_troubleshootPlots can be input into filterDF to modify the plots returned. Alternatively, users may simply use spotcall_troubleshootPlots directly (see documentation).

## An example
plotList <- spotcall_troubleshootPlots( 
  spotcalldf, 
  plotWindowRadius = 5, 
  decodeMetric = 'DECODE', #Instead of DECODE, users may pick other objects in imMetrics
  chosenCoordinate = c(
    939 + 340i, #Specifying a custom coordinate
    sum(as.numeric(spotcalldf[1,c('WX', 'WY')]) * c(1, 1i)) #Specifying the first pixel in spotcalldf
    )
  )
for( i in 1:length(plotList) ){
  p <- plotList[[i]]
  print( p )
}

Users are encouraged to optimise their image processing pipelines on one FOV, before applying the same pipelines to all other FOVs.

3.5.5 Binarising continuous values

Note that the current image vector contains continuous values. Users may wish to binarise values, such that superthreshold values are unambiguously assigned as ON, and subthreshold as OFF. To that end, we provide the bitsFromDecode function that searches for the optimal threshold that maximises accuracy of ON/OFF calls per bit. There are several ways to assay accuracy:

precision: True positives / (True positives + False positives)
recall: True positives / (True positives + False negatives)
f1 (default): 2 * Precision * Recall / (Precision + Recall)
fb: (1 + $\beta^2$) * Precision * Recall / ($\beta^2$ * Precision + Recall)

After thresholding, we can also calculate Hamming Distance from codebook vectors, which is stored under the HD column.

spotcalldf <- bitsFromDecode(spotcalldf)
table(spotcalldf$HD)

## 
##      0      1      2      3      4      5      6      7      8      9 
##  46125 288300 296934 124675  46694  15838   4772    851     94      5

It is recommended to filter out pixels whose Hamming Distance to their assigned gene is >1.

spotcalldf <- filterDF(spotcalldf, spotcalldf$HD > 1)

Binarisation is recommended if there is a clear cutoff in intensities between 0 vs 1 bits in the DECODE images. This assumption should be true in most cases, but may be violated if there are very different numbers of encoding probes for different genes: e.g. if there is a gene A that has 40 probes and a gene B with only 5 probes, ON for gene B will be dimmer than ON for gene A (and OFF for gene A may be brighter than OFF for gene B, depending on how thorough bleaching / washing was between imaging cycles). So a single intensity threshold separating ON vs OFF may not be appropriate here. In this scenario, users may wish to skip binarisation entirely.

3.5.6 Obtaining other QC metrics

Beyond image-codebook vector distances, we also provide other functions that users may find useful for performing further quality control checks. We provide several options for thresholding:

thresh_Otsu: Otsu thresholding. Useful for separating a bimodal distribution.
thresh_Elbow: Identifies the elbow in a distribution. Useful for heavy tailed distributions.
thresh_Quantile: Finds the average between quantiles in group A vs B. As default, the chosen quantiles are the medians of A and B.

All thresholding functions are compatible with the findThreshold function.

set.seed(12345)
groups <- sample(c(F, T), 1E4, T, c(0.5, 0.5))
values <- rep(0, length(groups))
values[groups] <- rnorm(sum(groups), mean = 1, sd = 0.1)
values[!groups] <- rgamma(sum(!groups), shape = 1, scale = 0.5)
hist(values, breaks=1000, main = 'Thresholds: Otsu (red), Elbow (blue), Quantile (green)' ); 
abline(v=findThreshold(values, thresh_Otsu), col='red');
abline(v=findThreshold(values[values > median(values)], thresh_Elbow, rightElbow = T), col='blue');
abline(v=findThreshold(values, thresh_Quantile, labels = groups), col='green')

`scoreDeltaF`

This function calculates $\Delta F / F_0$ metrics across image metrics. As default, all image metrics generated by getImageMetrics (and whose names are saved in params$imageMetrics) will be processed. [NB When $F_0$ is 0, $\Delta F / F_0$ is default set to the highest valid value for that column.]

Below, we show the $\Delta F / F_0$ alone, but users should read the scoreDeltaF documentation for the full list of metrics returned (e.g. $F_1$, $F_0$ etc).

spotcalldf <- scoreDeltaF( spotcalldf )
knitr::kable( head( spotcalldf[,grepl('^DFF0_', colnames(spotcalldf))] ), format = 'html', booktabs = TRUE )

	DFF0_ORIG	DFF0_NORM	DFF0_COARSE	DFF0_DECODE	DFF0_MASK
15058	1.2802097	1.347051e+01	0.9599303	8.064703	11
15248	2.1340655	6.973976e+04	1.6488084	44.437034	11
15481	0.6950561	1.981764e+00	0.3761400	7.115631	8
15489	0.2565917	2.393276e+00	0.4356834	35.583974	11
15557	0.1342583	1.395903e-01	0.2187945	12.710667	11
15562	0.6546181	3.173345e+00	0.3342836	15.044078	11

We expect low confidence spots to on average have low $\Delta F / F_0$, i.e. their ON bits are less distinguishable from their OFF bits.

ggplot(spotcalldf[spotcalldf$F0_DECODE!=0,]) +
  geom_violin(aes(x=BLANK, y=DFF0_DECODE, colour=BLANK) ) +
  geom_boxplot(aes(x=BLANK, y=DFF0_DECODE, colour=BLANK), fill='transparent', width=0.25 ) +
  scale_y_log10() +
  theme_minimal(base_size=20) +
  scale_colour_manual(values=c('TRUE' = 'red', 'FALSE' = 'black')) +
  theme(legend.position = 'none')

`spatialIsAdjacent`

The spatialIsAdjacent function searches in the immediate surroundings of a set of query coordinates. For instance, below, we ask if a given pixel is next to a blank. [Unsurprisingly, blanks will be reported as being next to blanks.]

isNextToBlank <- spatialIsAdjacent( spotcalldf, 'BLANK' )
by(isNextToBlank, spotcalldf$BLANK, table)

## spotcalldf$BLANK: FALSE
## 
##  FALSE   TRUE 
## 240735  16688 
## ------------------------------------------------------------ 
## spotcalldf$BLANK: TRUE
## 
##  TRUE 
## 77002

spotcalldf$NEXTTOBLANK <- isNextToBlank

In general, we are less confident about non-blank spots that are next to blanks, as evidenced by their higher codebook distances below.

ggplot( reshape2::melt( data.frame(spotcalldf)[!spotcalldf$BLANK, ], measure.vars=c('COS', 'EUC', 'L2E'))  ) +
  geom_violin(aes(x=NEXTTOBLANK, y=value, colour=NEXTTOBLANK) ) +
  geom_boxplot(aes(x=NEXTTOBLANK, y=value, colour=NEXTTOBLANK), fill='transparent', width=0.25 ) +
  facet_wrap(~variable, scales='free') +
  ylab('') +
  theme_minimal(base_size=14) +
  scale_colour_manual(values=c('TRUE' = 'red', 'FALSE' = 'black')) +
  theme(legend.position = 'none') +
  ggtitle('Non-blank spots') + xlab('Is next to a blank')

Users may thus wish to “trim” pixel clusters, by removing non-blanks that are next to blanks: this is beneficial for clustering (described later). [Obviously, blanks should not be subject to the same filtering.]

spotcalldf <- filterDF(spotcalldf, spotcalldf$NEXTTOBLANK & !spotcalldf$BLANK)

`spatialHNNDist`

Another useful spatial metric is how far apart any two pixels with the same identity are from each other, here called “homotypic nearest neighbour (HNN) distance”. This can be calculated using spatialHNNDist.

HNNDist <- spatialHNNDist( spotcalldf )
spotcalldf$HNNDIST <- HNNDist

In general, blanks are less likely to be next to each other (possibly because of erroneous single pixel calls in noisy regions), while spots from the same gene tend to cluster - i.e. tend to have low HNN distances (below). Users typically do not have to perform thresholding on HNN distances given the later clustering step; though the distribution of distances can be helpful for determining minimum intracluster distances later on.

hnnDistBins <- factor(cut(spotcalldf$HNNDIST, breaks=c(0,1,2,3,4,5,Inf)))
ggplot( data.frame(spotcalldf, hnnDistBins)  ) +
  scale_fill_viridis_d(name = 'HNN distance range (pixels)', option='turbo') +
  geom_bar(aes(x=BLANK, fill=hnnDistBins), position='fill' ) +
  theme_minimal(base_size=14) +
  theme(legend.position = 'right') + ylab('Proportion')

`scoreClassifier`

Finally, we have the scoreClassifier function. Under the hood, this function builds a classifier (logistic regression as default) to distinguish one class of pixels from another, using pixel-wise metrics in our dataframe. Users can specify covariates to include in the model; otherwise, all columns in the dataframe will be used.

Below is an example of predicting blanks from non-blanks. We exclude bit-specific columns (e.g. DECODE_01), covariates that are likely to highly correlate with others (e.g. F1_DECODE and DFF0_DECODE are likely to highly correlate), coordinate information, and any covariate that is equivalent to BLANK.

forExclusion <- c('Xm', 'Ym', 'fov', 'g', 'WX', 'WY', 'IDX', colnames(spotcalldf)[grepl('_\\d+$|^DFF0_|^F1_|CV_|LAB$', colnames(spotcalldf))])
probabilityBLANK <- scoreClassifier(spotcalldf, 'BLANK', variablesExcluded = forExclusion)
spotcalldf$PBLANK <- probabilityBLANK

Accordingly the probability of being a blank is higher for blanks than non-blanks.

thresh <- thresh_Quantile(spotcalldf$PBLANK, spotcalldf$BLANK, quantileFalse = 0.5, quantileTrue = 0.25)
ggplot( spotcalldf  ) +
  geom_violin(aes(x=BLANK, y=PBLANK, colour=BLANK) ) +
  geom_boxplot(aes(x=BLANK, y=PBLANK, colour=BLANK), fill='transparent', width=0.25 ) +
  ylab('Probability of being a blank') +
  theme_minimal(base_size=14) +
  scale_colour_manual(values=c('TRUE' = 'red', 'FALSE' = 'black')) +
  theme(legend.position = 'none') +
  geom_hline(yintercept=thresh, linetype = 'dashed', colour='grey')

Thresholding and filtering can be performed using the classifier scores:

spotcalldf <- filterDF(spotcalldf, spotcalldf$PBLANK > thresh)

3.5.7 Clustering

Having performed QC to focus on pixels whose identity we are confident in, the next step is to cluster adjacent pixels with the same identity into a single punctum. This is accomplished with the spatialClusterLeiden function. Here, we want clusters at least 3 pixels big, and with a maximum distance between pixels of 2 pixels.

clusterInfo <- spatialClusterLeiden( spotcalldf, minNeighbours = 3, maxInterSpotDistancePixels = 2)
spotcalldf[,colnames(clusterInfo)] <- clusterInfo

Users are recommended to visualise clusters across a small window. Note how specifying centroids only returns a single pixel coordinate per cluster.

## Specify window to plot
cxWindow <- c(1200, 1300, 1200, 1300)

## Specify what genes assigned which colours
## Manually:
# genePalette <- setNames(
#   viridis::turbo(nrow(params$ordered_codebook)),
#   rownames(params$ordered_codebook) )
## Using previously established palette (established during prepareCodebook):
genePalette <- params$genePalette

## Plot of labelled pixels vs centroids   
cowplot::plot_grid( plotlist = list(
  ggplot( 
    spotcalldf[
      spotcalldf$WX > cxWindow[1] 
      & spotcalldf$WX < cxWindow[2] 
      & spotcalldf$WY > cxWindow[3] 
      & spotcalldf$WY < cxWindow[4], ]
    ) + 
    geom_text( aes(
      x=WX, y=WY, label=g, 
      colour=factor(g, levels=rownames(params$ordered_codebook))) 
      ) +
    scale_colour_manual( values=genePalette ) +
    theme_void(base_size=14) +
    theme(legend.position = 'none', 
          plot.title = element_text(hjust=0.5)) +
    scale_x_continuous( limits=c(cxWindow[1], cxWindow[2]) ) +
    scale_y_reverse( limits=c(cxWindow[4], cxWindow[3]) ) +
    ggtitle('All pixels'),
  
    ggplot( 
    spotcalldf[
      spotcalldf$CENTROID
      & spotcalldf$WX > cxWindow[1] 
      & spotcalldf$WX < cxWindow[2] 
      & spotcalldf$WY > cxWindow[3] 
      & spotcalldf$WY < cxWindow[4], ]
    ) + 
    geom_text( aes(
      x=WX, y=WY, label=g, 
      colour=factor(g, levels=rownames(params$ordered_codebook))) 
      ) +
    scale_colour_manual( values=genePalette ) +
    theme_void(base_size=14) +
    theme(legend.position = 'none', 
          plot.title = element_text(hjust=0.5)) +
    scale_x_continuous( limits=c(cxWindow[1], cxWindow[2]) ) +
    scale_y_reverse( limits=c(cxWindow[4], cxWindow[3]) ) +
    ggtitle('Centroids only')
  
), nrow=1 )

Depending on what the user wants, the non-centroid pixels can be filtered. Additionally, users may wish to examine cluster sizes to find an appropriate range of pixels per cluster.

maxClusterSize = 50
ggplot( spotcalldf[spotcalldf$CENTROID,] ) +
  geom_histogram( aes(x=CLUSTER_SIZE, fill=BLANK), binwidth = 1) +
  facet_wrap(~BLANK, ncol=1, scales='free_y', strip.position = 'right') +
  theme_minimal(base_size=14) +
  xlab('Cluster size') + ylab('N centroids') +
  geom_vline(xintercept = maxClusterSize, col='grey', linetype='dashed') +
  scale_fill_manual( values=c('FALSE' = 'black', 'TRUE' = 'red') )

As before, filtering is accomplished with filterDF:

spotcalldf <- filterDF(spotcalldf, !(spotcalldf$CENTROID & spotcalldf$CLUSTER_SIZE < maxClusterSize) )

3.6 Saving spot calls

If we are satisified with our quality control, the final step is to use the saveDF function to save the dataframe. The default output is a SPOTCALL_{ FOV Name }.csv.gz file; users that have set params$resumeMode to TRUE may wish to first check if this file exists before proceeding with image processing.

saveDF( spotcalldf )
file.exists( paste0(params$out_dir, 'SPOTCALL_', params$current_fov, '.csv.gz') )

## [1] TRUE

Individual files are thereby generated for every FOV.