Chapter 3 Spotcalling
MERFISH data comprises fluorescent images across different channels and imaging cycles. The desired end result are spot coordinates and spot identities.
To that end, we need to perform several subtasks: (1) establish parameters and parse necessary metadata, (2) load and process relevant images, (3) consolidate information across images and decode intensity vectors using the MERFISH codebook, (3) filter out low quality spots, and finally (4) save the output. Subtasks 2-4 are repeatedly performed as we loop through FOVs.
library(masmr)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
3.1 Pre-processing
Before running your pipeline proper, there are a series of pre-processing steps that each script needs to perform. These will be explained in detail here, but the same principles apply to cell segmentation and consolidation scripts.
3.1.1 Establishing parameters
All scripts should begin by establishing parameters: e.g. location of files to read, location to output files, file formats etc.
This is accomplished in masmr
using the establishParams()
function, which automatically creates an environment object called params
in the global environment that will be repeatedly referenced by downstream functions.
= 'C:/Users/kwaje/Downloads/JIN_SIM/20190718_WK_T2_a_B2_R8_12_d50/'
data_dir establishParams(
imageDirs = paste0(data_dir, 'S10/'),
imageFormat = '.ome.tif',
codebookFileName = list.files(paste0(data_dir, 'codebook/'), pattern='codebook', full.names = TRUE),
codebookColumnNames = paste0(rep(c(0:3), 4), '_', rep(c('cy3', 'alexa594', 'cy5', 'cy7'), each=4) ),
outDir = 'C:/Users/kwaje/Downloads/JYOUT/',
dapiDir = paste0(data_dir, 'DAPI_1/'),
pythonLocation = 'C:/Users/kwaje/AppData/Local/miniforge3/envs/scipyenv/',
fpkmFileName = paste0(data_dir, "FPKM_data/Jin_FPKMData_B.tsv"),
nProbesFileName = paste0(data_dir, "ProbesPerGene.csv"),
verbose = FALSE #It is generally a good idea to keep verbose as True, but will turn it off for this showcase
)
## Assuming .ome.tif is extension for meta files...
As the bare minimum, users need to establish:
imageDirs
: where the MERFISH image files are located.imageFormat
: what image file formats to read. Typically.ome.tif
or.dax
, but other formats compatible with Bioformats are acceptable too.codebookFileName
: location of MERFISH codebook.codebookColumnNames
: names of every bit in the MERFISH codebook. Each “bit name” should typically comprise the image index (e.g. image 1, 2, 3 etc) and channel (e.g. Cy3, Alexa-594 etc): so a bit titled0_cy3
tells us that the image to be loaded is the Cy3 channel of the first image in the set (here, starting from 0). This is important because the order images are loaded in may not match up with the order of bits in the codebook.outDir
: where output files should be written to.
Other parameters such as dapiDir
and pythonLocation
will be discussed subsequently for Cell Segmentation.
There also exist some parameters such as fpkmFileName
(file containing FPKM data) and nProbesFileName
(file containing number of MERFISH encoding probes per gene), which are files to be read for quality control purposes. However, these parameters are designed to be optional, and your code should run fine even if they are not specified.
The information stored in the params
object can be read for further details:
ls( envir = params )
## [1] "codebook_colnames" "codebook_file"
## [3] "dapi_dir" "dapi_format"
## [5] "dir_names" "fov_names"
## [7] "fpkm_file" "im_dir"
## [9] "im_format" "inferred_codebookColnames"
## [11] "meta_format" "n_probes_file"
## [13] "nbits" "out_dir"
## [15] "parent_out_dir" "python_location"
## [17] "raw_codebook" "resumeMode"
## [19] "seed" "verbose"
Note the existence of the params$resumeMode
object. This houses a boolean that determines if functions are skipped if there is evidence that they have been run before. As default, to save time, this is set to TRUE
.
3.1.2 Getting parameters from image metadata
The next step is to obtain image metadata that accompanies most microscopy images. This is typically found in the same file as your image data (e.g. .ome.tif
) but can sometimes be in a separate file (e.g. .xml
that accompany .dax
images). masmr
will attempt to guess which meta data format is appropriate, but users can change values in params
as needed.
We can check what values have been entered into params
:
$meta_format params
## [1] ".ome.tif"
If unsatisfied, we can update as follows:
$meta_format <- '.dax'
paramsprint(params$meta_format)
## [1] ".dax"
$meta_format <- '.ome.tif'
paramsprint(params$meta_format)
## [1] ".ome.tif"
Having specified our meta data formats, we can then load additional meta data information with the readImageMetaData
function.
readImageMetaData()
## Warning in readImageMetaData(): Assuming this data has been acquired by Triton
## Warning in readImageMetaData(): Updating params$fov_names
This adds additional information to params
, such as resolution information:
print( params$resolutions )
## $per_pixel_microns
## [1] "0.0706" "0.0706"
##
## $xydimensions_pixels
## [1] "2960" "2960"
##
## $xydimensions_microns
## [1] "208.976" "208.976"
##
## $channel_order
## [1] "cy7" "cy5" "alexa594" "cy3"
##
## $fov_names
## [1] "MMStack_1-Pos_000_000" "MMStack_1-Pos_001_000" "MMStack_1-Pos_002_000"
## [4] "MMStack_1-Pos_003_000" "MMStack_1-Pos_004_000" "MMStack_1-Pos_005_000"
## [7] "MMStack_1-Pos_006_000" "MMStack_1-Pos_007_000" "MMStack_1-Pos_008_000"
## [10] "MMStack_1-Pos_009_000" "MMStack_1-Pos_009_001" "MMStack_1-Pos_008_001"
## [13] "MMStack_1-Pos_007_001" "MMStack_1-Pos_006_001" "MMStack_1-Pos_005_001"
## [16] "MMStack_1-Pos_004_001" "MMStack_1-Pos_003_001" "MMStack_1-Pos_002_001"
## [19] "MMStack_1-Pos_001_001" "MMStack_1-Pos_000_001" "MMStack_1-Pos_000_002"
## [22] "MMStack_1-Pos_001_002" "MMStack_1-Pos_002_002" "MMStack_1-Pos_003_002"
## [25] "MMStack_1-Pos_004_002" "MMStack_1-Pos_005_002" "MMStack_1-Pos_006_002"
## [28] "MMStack_1-Pos_007_002" "MMStack_1-Pos_008_002" "MMStack_1-Pos_009_002"
## [31] "MMStack_1-Pos_009_003" "MMStack_1-Pos_008_003" "MMStack_1-Pos_007_003"
## [34] "MMStack_1-Pos_006_003" "MMStack_1-Pos_005_003" "MMStack_1-Pos_004_003"
## [37] "MMStack_1-Pos_003_003" "MMStack_1-Pos_002_003" "MMStack_1-Pos_001_003"
## [40] "MMStack_1-Pos_000_003" "MMStack_1-Pos_000_004" "MMStack_1-Pos_001_004"
## [43] "MMStack_1-Pos_002_004" "MMStack_1-Pos_003_004" "MMStack_1-Pos_004_004"
## [46] "MMStack_1-Pos_005_004" "MMStack_1-Pos_006_004" "MMStack_1-Pos_007_004"
## [49] "MMStack_1-Pos_008_004" "MMStack_1-Pos_009_004" "MMStack_1-Pos_009_005"
## [52] "MMStack_1-Pos_008_005" "MMStack_1-Pos_007_005" "MMStack_1-Pos_006_005"
## [55] "MMStack_1-Pos_005_005" "MMStack_1-Pos_004_005" "MMStack_1-Pos_003_005"
## [58] "MMStack_1-Pos_002_005" "MMStack_1-Pos_001_005" "MMStack_1-Pos_000_005"
## [61] "MMStack_1-Pos_000_006" "MMStack_1-Pos_001_006" "MMStack_1-Pos_002_006"
## [64] "MMStack_1-Pos_003_006" "MMStack_1-Pos_004_006" "MMStack_1-Pos_005_006"
## [67] "MMStack_1-Pos_006_006" "MMStack_1-Pos_007_006" "MMStack_1-Pos_008_006"
## [70] "MMStack_1-Pos_009_006" "MMStack_1-Pos_009_007" "MMStack_1-Pos_008_007"
## [73] "MMStack_1-Pos_007_007" "MMStack_1-Pos_006_007" "MMStack_1-Pos_005_007"
## [76] "MMStack_1-Pos_004_007" "MMStack_1-Pos_003_007" "MMStack_1-Pos_002_007"
## [79] "MMStack_1-Pos_001_007" "MMStack_1-Pos_000_007" "MMStack_1-Pos_000_008"
## [82] "MMStack_1-Pos_001_008" "MMStack_1-Pos_002_008" "MMStack_1-Pos_003_008"
## [85] "MMStack_1-Pos_004_008" "MMStack_1-Pos_005_008" "MMStack_1-Pos_006_008"
## [88] "MMStack_1-Pos_007_008" "MMStack_1-Pos_008_008" "MMStack_1-Pos_009_008"
## [91] "MMStack_1-Pos_009_009" "MMStack_1-Pos_008_009" "MMStack_1-Pos_007_009"
## [94] "MMStack_1-Pos_006_009" "MMStack_1-Pos_005_009" "MMStack_1-Pos_004_009"
## [97] "MMStack_1-Pos_003_009" "MMStack_1-Pos_002_009" "MMStack_1-Pos_001_009"
## [100] "MMStack_1-Pos_000_009"
##
## $zslices
## [1] "1"
It also creates a dataframe that stores file information for every bit. This will be referenced heavily by downstream functions.
::kable( head( params$global_coords ), format = 'html', booktabs = TRUE ) knitr
x_microns | y_microns | z_microns | image_file | fov | bit_name_full | bit_name_alias | bit_name |
---|---|---|---|---|---|---|---|
-2645.753 | 3421.847 | 5673.78 | C:/Users/kwaje/Downloads/JIN_SIM/20190718_WK_T2_a_B2_R8_12_d50/S10/hyb00_1/hyb00_1_MMStack_1-Pos_000_000.ome.tif | MMStack_1-Pos_000_000 | hyb00_1_cy7 | 0_cy7 | 0_cy7 |
-2645.753 | 3421.847 | 5673.78 | C:/Users/kwaje/Downloads/JIN_SIM/20190718_WK_T2_a_B2_R8_12_d50/S10/hyb00_1/hyb00_1_MMStack_1-Pos_000_000.ome.tif | MMStack_1-Pos_000_000 | hyb00_1_cy5 | 0_cy5 | 0_cy5 |
-2645.753 | 3421.847 | 5673.78 | C:/Users/kwaje/Downloads/JIN_SIM/20190718_WK_T2_a_B2_R8_12_d50/S10/hyb00_1/hyb00_1_MMStack_1-Pos_000_000.ome.tif | MMStack_1-Pos_000_000 | hyb00_1_alexa594 | 0_alexa594 | 0_alexa594 |
-2645.753 | 3421.847 | 5673.78 | C:/Users/kwaje/Downloads/JIN_SIM/20190718_WK_T2_a_B2_R8_12_d50/S10/hyb00_1/hyb00_1_MMStack_1-Pos_000_000.ome.tif | MMStack_1-Pos_000_000 | hyb00_1_cy3 | 0_cy3 | 0_cy3 |
-2645.753 | 3421.847 | 5673.78 | C:/Users/kwaje/Downloads/JIN_SIM/20190718_WK_T2_a_B2_R8_12_d50/S10/hyb01_1/hyb01_1_MMStack_1-Pos_000_000.ome.tif | MMStack_1-Pos_000_000 | hyb01_1_cy7 | 1_cy7 | 1_cy7 |
-2645.753 | 3421.847 | 5673.78 | C:/Users/kwaje/Downloads/JIN_SIM/20190718_WK_T2_a_B2_R8_12_d50/S10/hyb01_1/hyb01_1_MMStack_1-Pos_000_000.ome.tif | MMStack_1-Pos_000_000 | hyb01_1_cy5 | 1_cy5 | 1_cy5 |
Note that as default, the above information is saved to a sub-directory in your chosen output folder.
print( params$parent_out_dir )
## [1] "C:/Users/kwaje/Downloads/JYOUT/"
print( params$out_dir )
## [1] "C:/Users/kwaje/Downloads/JYOUT//IM/"
3.1.3 Preparing the codebook
A codebook is a table of indicating the sequence of ON (1) and OFF (0) bits for each gene (example below). masmr
expects this to be a table (e.g. .csv
or .txt
) that contains N rows for N genes/blanks, and M+1 columns for M bits (e.g. below is a 16-bit setup) and 1 gene column.
::kable( head(data.table::fread( params$codebook_file, data.table = F )), format = 'html', booktabs = TRUE ) knitr
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Blank-1 |
0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | Blank-2 |
0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | Blank-3 |
0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | Blank-4 |
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | Blank-5 |
0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ASCL1 |
Unique to spotcalling, users must “prepare” their codebooks. This typically involves ensuring that the order of bits matches the order of image loading that masmr
performs. Additionally, masmr
also provides the option to add more “blanks” beyond those already specified by the user.
prepareCodebook( exhaustiveBlanks = T )
The new ordered codebook is stored in the output directory, and users can refer to it as follows:
::kable( head( params$ordered_codebook ), format = 'html', booktabs = TRUE ) knitr
0_cy7 | 0_cy5 | 0_alexa594 | 0_cy3 | 1_cy7 | 1_cy5 | 1_alexa594 | 1_cy3 | 2_cy7 | 2_cy5 | 2_alexa594 | 2_cy3 | 3_cy7 | 3_cy5 | 3_alexa594 | 3_cy3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Blank-1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
Blank-2 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
Blank-3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
Blank-4 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Blank-5 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
ASCL1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3.1.4 House-keeping
Having performed all the necessary pre-processing, it may be useful for users to take a snapshot of the current environment. Unnecessary data can then be cleared from memory in subsequent steps to return the global environment to that snapshot. This is accomplished using our establishCleanSlate
and cleanUp
functions.
An example use-case is provided below:
establishCleanSlate()
## [1] "data_dir" "full_res" "n_channels" "params"
print(cleanSlate)
## [1] "data_dir" "full_res" "n_channels" "params" "cleanSlate"
= '12345'
x exists('x')
## [1] TRUE
cleanUp()
exists('x')
## [1] FALSE
3.2 Preparing to loop through images
Having done the necessary pre-processing, users may now begin processing each FOV. This is done by looping through params$fov_names
. It is also recommended that the current FOV be saved in params$current_fov
.
head(params$fov_names)
## [1] "MMStack_1-Pos_000_000" "MMStack_1-Pos_001_000" "MMStack_1-Pos_002_000"
## [4] "MMStack_1-Pos_003_000" "MMStack_1-Pos_004_000" "MMStack_1-Pos_005_000"
$current_fov <- params$fov_names[38]
paramsprint(params$current_fov)
## [1] "MMStack_1-Pos_002_003"
Users may also wish to check if they are ready to loop with our checkReadyLoop
function, but this is entirely optional.
checkReadyLoop()
## Warning in checkReadyLoop(): params$brightness_min not found:
## If desired, ensure getAnchorParams() has been run
## [1] FALSE
3.2.1 Image reading
To read images, we rely on the RBioformats
package, which is able to read most image file formats. One exception is .dax
files – however, these are binary files and can be read by base R. To account for these different types of images, we can use the readImage
function in masmr
.
Below, I provide an example of a .dax
file:
<- 'C:/Users/kwaje/Downloads/LIU_CONNECT_CORTICALORGANOID_240902/DATA/20240902/cont/Alexa594_0_10.dax'
file <- readImage(
im
file,imageDimensions = c(1,2048,2048,1),
nchannels = 1, #The image only has 1 channel
nzs = 1
)
## Warning in readImage(file, imageDimensions = c(1, 2048, 2048, 1), nchannels =
## 1, : Unable to read with RBioFormats
print(length(im))
## [1] 1
<- im[[1]]
imsub plot(imager::as.cimg(imsub), main = paste0('Image dimensions: ', paste(dim(imsub), collapse = ' x ') ))
From above, note that the default is assumed to be: Number of Channels, Number of rows, Number of Columns, Number of Z slices. In the above case, since and are both 1, users could technically specify image dimensions using width an height alone. However, it is generally good practice to keep to a consistent format.
For .ome.tif
files, the same function can be used. However, we need to specify an additional series
parameter: otherwise, all FOVs will be loaded. For more details, we recommend reading the documentation for RBioformats::read.image
. Below, I provide an example of reading the second channel for the current FOV (specified above).
<- unique( params$global_coords[params$global_coords$fov==params$fov_names, 'image_file'] )[1]
file <- which( params$fov_names == params$current_fov )
series <- readImage(
im
file, nchannels = 4, # The image has 4 channels total
nzs = 1,
series = series )
print( length(im) ) #4 channels
## [1] 4
<- im[[2]]
imsub plot(imager::as.cimg( imsub ), main = paste0('Image dimensions: ', paste(dim(imsub), collapse = ' x ') ) )
## Warning in as.cimg.array(imsub): Assuming third dimension corresponds to
## time/depth
Note that readImage
as written above reads all four channels into memory, and we subset the second channel using im[,,2]
. An alternative is to just load the second channel directly using the subset
parameter in RBioformats::read.image
, or the channelIndex
parameter in readImage
. This is useful for memory management.
<- readImage(
im
file,nchannels = 4,
nzs = 1,
series = series,
channelIndex = 2 )
print(length(im)) #Only 1 channel loaded
## [1] 1
<- im[[1]]
imsub plot(imager::as.cimg(imsub), main = paste0('Image dimensions: ', paste(dim(imsub), collapse = ' x ') ) )
## Warning in as.cimg.array(imsub): Assuming third dimension corresponds to
## time/depth
For MERFISH, we typically have to handle multiple images at once. We have thus created an additional image reading function called readImageList
that loads all images for a given FOV, by cross-referencing params$global_coords
to find all relevant files. And because readImageList
does this cross-referencing, parameters such as nchannels
, nzs
etc do not have to be specified.
<- readImageList( params$current_fov )
imList names(imList)
## [1] "0_cy7" "0_cy5" "0_alexa594" "0_cy3" "1_cy7"
## [6] "1_cy5" "1_alexa594" "1_cy3" "2_cy7" "2_cy5"
## [11] "2_alexa594" "2_cy3" "3_cy7" "3_cy5" "3_alexa594"
## [16] "3_cy3"
3.2.2 Establishing global image parameters
As a first step, it is recommended that we determine the dynamic range of intensity values: i.e. the brightness value for below which (i.e. minimum brightness / floor) and above which (i.e. maximum brightness / ceiling) we can safely ignore gradation. A sensible minimum brightness would be intensities for background pixels where no tissue is imaged. For maximum brightness, we want a cap where all super-threshold intensities can effectively be considered ‘ON’. Ideally, these limits should be relatively consistent across FOVs.
We have created a function getAnchorParams
that loads multiple FOVs and attempts to get global image parameters. As default, brightness_max
and brightness_min
are returned. For details, see the documentation for getAnchorParams
.
getAnchorParams( nSamples = 6 )
These values are appended to params
:
$brightness_min params
## [1] 0.02665484 0.07892208 0.02916114 0.04901184 0.03023053 0.06347031
## [7] 0.02397402 0.04414353 0.02355671 0.07126881 0.02574727 0.04234602
## [13] 0.02069601 0.06184499 0.02228692 0.04452783
Anchor FOVs are those used to derive global image parameters. These can be user-specified, or automatically determined with getAnchorParams
. If automatic, nSamples
number of FOVs are randomly selected. These can be visualised (in red) after running getAnchorParams
:
print(params$anchors)
## [1] "MMStack_1-Pos_001_003" "MMStack_1-Pos_005_000" "MMStack_1-Pos_007_009"
## [4] "MMStack_1-Pos_008_009" "MMStack_1-Pos_009_001" "MMStack_1-Pos_002_003"
print(params$anchors_plot)
Random sampling should work for most cases. For more regular FOVs, users may instead opt for grid-based selection of FOVs: evenly spaced FOVs internal to the outermost ring of FOVs are selected. [Since we are re-obtaining anchor metrics, set freshAnchors
to true.]
getAnchorParams( anchorMode = 'grid', returnTroubleShootPlots = T, freshAnchors = T )
And so this returns a different subset of anchors.
print(params$anchors)
## [1] "MMStack_1-Pos_001_001" "MMStack_1-Pos_001_003" "MMStack_1-Pos_001_005"
## [4] "MMStack_1-Pos_001_007" "MMStack_1-Pos_003_002" "MMStack_1-Pos_003_004"
## [7] "MMStack_1-Pos_003_006" "MMStack_1-Pos_003_008" "MMStack_1-Pos_005_001"
## [10] "MMStack_1-Pos_005_003" "MMStack_1-Pos_005_005" "MMStack_1-Pos_005_007"
## [13] "MMStack_1-Pos_007_002" "MMStack_1-Pos_007_004" "MMStack_1-Pos_007_006"
## [16] "MMStack_1-Pos_007_008"
print(params$anchors_plot)
Users should ensure that sufficient anchors are chosen for representative brightness caps and floors to be obtained. Additionally, different brightness caps / floors may be required for different samples, depending on how consistent imaging parameters were across samples. Inconsistent imaging parameters may be expected if samples were e.g. imaged on different days.
Additionally, should users wish to obtain other global parameters, this can be achieved by altering the imageFunctions
and summaryFunctions
fields in getAnchorParams
.
For instance, say we wish to also obtain the mean brightness per image (brightness_mean
). We first write a function to perform this (custom_getMeanBrightness
), which will then be looped over every anchor FOV. We include this function among the named list users_imageFunctions
, which we then input to imageFunctions
. This returns a vector of mean brightness values for every anchor image, which are then summarised by the corresponding function in summaryFunctions
(as default, we recommend conservativeMean
– explained later – or some summary function robust to outliers).
<- function( im, ... ){
custom_getMeanBrightness <- mean( im )
meanBrightness return( meanBrightness )
}= list(
users_imageFunctions 'brightness_min' = imsBrightnessMin_MIP, #MIP for Z-stacks
'brightness_max' = imsBrightnessMax_MIP,
'brightness_mean' = custom_getMeanBrightness
)= list(
users_summaryFunctions 'brightness_min' = conservativeMean,
'brightness_max' = conservativeMean,
'brightness_mean' = conservativeMean
)getAnchorParams(
imageFunctions = users_imageFunctions,
summaryFunctions = users_summaryFunctions,
anchorMode = 'grid'
)
On conservativeMean
: this is a function that takes the mean within a specific range, delimited by a certain number of standard deviations (default = 0.9). This should return mean values less skewed by outliers. An alternative could be median
, or some user-defined function.
set.seed(12345)
<- rgamma( 1E4, 2, 1 )
test hist( test, breaks=100, main='Histogram of values with mean (red) and conservativeMean (blue)', xlab='Gamma distribution' );
abline(v=mean(test), col='red');
abline(v=conservativeMean(test, nSDs = 0.9), col='blue')
To evaluate the suitability of chosen thresholds, users can refer to the troubleshootPlots
environment returned when returnTroubleShootPlots
is true.
= 'brightness_min'
metric = 1
bit_number print( troubleshootPlots[[ metric ]][[ bit_number ]] )
For each metric, we also report the observed thresholds as a quantile of each anchor image’s pixel intensity. These can be found in the ANCHOR_QUANTILEQC.csv
file. We expect brightness floors to have a low quantile across FOVs (unless there are some empty FOVs included), and brightness ceilings to have high quantiles.
head(read.csv( paste0(params$out_dir, 'ANCHOR_QUANTILEQC.csv')))
## metric bit MMStack_1.Pos_001_001 MMStack_1.Pos_001_003
## 1 brightness_min 1 0.9187895 0.1231919
## 2 brightness_max 1 0.9999800 0.9998100
## 3 brightness_min 2 0.9205798 0.1248851
## 4 brightness_max 2 0.9995589 0.9855621
## 5 brightness_min 3 0.9230520 0.1136280
## 6 brightness_max 3 1.0000000 1.0000000
## MMStack_1.Pos_001_005 MMStack_1.Pos_001_007 MMStack_1.Pos_003_002
## 1 0.2534933 0.2073274 0.2458487
## 2 0.9997967 0.9998654 0.9998590
## 3 0.2626584 0.2286615 0.2431207
## 4 0.9769110 0.9885266 0.9930865
## 5 0.2583598 0.1959322 0.2110254
## 6 0.9999945 1.0000000 0.9999952
## MMStack_1.Pos_003_004 MMStack_1.Pos_003_006 MMStack_1.Pos_003_008
## 1 0.5588561 0.5077519 0.10864865
## 2 0.9999914 0.9999982 0.99961480
## 3 0.5643622 0.5056256 0.13454552
## 4 0.9991272 0.9994388 0.99336617
## 5 0.4886617 0.4099523 0.09494156
## 6 1.0000000 1.0000000 0.99999726
## MMStack_1.Pos_005_001 MMStack_1.Pos_005_003 MMStack_1.Pos_005_005
## 1 0.08566187 0.4722320 0.6327399
## 2 0.99916499 0.9999821 1.0000000
## 3 0.08949598 0.4169337 0.4555937
## 4 0.87119601 0.9991592 0.9993902
## 5 0.08018513 0.3478341 0.3117073
## 6 0.99997740 1.0000000 1.0000000
## MMStack_1.Pos_005_007 MMStack_1.Pos_007_002 MMStack_1.Pos_007_004
## 1 0.1366108 0.1784223 0.4619367
## 2 0.9998951 0.9999053 0.9999768
## 3 0.1668553 0.1792000 0.5029208
## 4 0.9972692 0.9932534 0.9988799
## 5 0.1181132 0.1452180 0.4117146
## 6 0.9999995 1.0000000 1.0000000
## MMStack_1.Pos_007_006 MMStack_1.Pos_007_008
## 1 0.1657847 0.9504158
## 2 0.9998435 0.9999966
## 3 0.1761314 0.9576270
## 4 0.9904303 0.9999356
## 5 0.1251103 0.9552260
## 6 0.9999977 1.0000000
3.3 Maximum intensity projection
For Z stacks, we default to maximum intensity projection (MIP) to collapse a 3D array into a 2D matrix. [If there are no other slices, then the 2D matrix is returned unchanged.]
<- array( c(
toy_example matrix(0:11 / 12, nrow = 3, ncol = 4, byrow = T),
matrix(11:0 / 12, nrow = 3, ncol = 4, byrow = T)
dim = c(3, 4, 2)
),
)<- maxIntensityProject(toy_example)
mip par(mfrow=c(1,3))
plot( imager::as.cimg(toy_example[,,1]), main = 'Z slice 1', interpolate = F, rescale = F)
plot( imager::as.cimg(toy_example[,,2]), main = 'Z slice 2', interpolate = F, rescale = F)
plot( imager::as.cimg(mip), main = 'MIP', interpolate = F, rescale = F)
par(mfrow=c(1,1))
Also note that similarly performs MIP (where possible) prior to finding minimum / maximum brightnesses.
::imsBrightnessMin_MIP masmr
## function (im, zDim = 3, ...)
## {
## if (length(dim(im)) != 2) {
## im <- suppressWarnings(maxIntensityProject(im, zDim = zDim))
## }
## result <- imsBrightnessMin(im, ...)
## return(result)
## }
## <bytecode: 0x00000223e915ce88>
## <environment: namespace:masmr>
Users wishing to process Z slices individually are encouraged refer to Troubleshooting (Chapter 8) for details.
3.4 Looping through images
After establishing global parameters (e.g. brightness floors and ceilings), we are now ready to loop through our image data. It is strongly recommended that the user keeps track of the current FOV being processed in params$current_fov
as follows:
for( fovName in params$fov_names ){
## Report current FOV / iteration
cat( paste0('\n', fovName, ': ',
which(params$fov_names==fovName), ' of ',
length(params$fov_names), '...' ))
## Check if file has been created
if( params$resumeMode ){
<- file.exists( paste0(params$out_dir, 'SPOTCALL_', fovName, '.csv.gz') )
checkFile if(checkFile){
cat('FOV has been processed...Skipping...')
next
}
}
## Save the current_fov
$current_fov <- fovName
params
## Load image
<- readImageList( params$current_fov )
imList
## If needed, MIP for each channel
<- maxIntensityProject( imList )
imList
## Image processing functions etc...
}
3.4.1 Default image processing
Designing custom image processing functions is a topic worthy of its own chapter (Chapter 4). For now, we will use the default processing provided in masmr
:
getImageMetrics( imList )
##
The processed images are stored in a new imMetrics
environment, and can be visualised as follows:
par(mfrow=c(2,3))
for( imMetricName in ls(imMetrics) ){
<- imMetrics[[imMetricName]][[1]]
imsub plot(imager::as.cimg(imsub[1000:1500, 1000:1500, 1] ), main = imMetricName)
}par(mfrow=c(1,1))
3.4.2 Image registration
Images of the same FOV are rarely perfect overlaps of each other due to imprecise microscope movements. Additionally, optical distortions (e.g. chromatic aberration), may also cause images from different channels to become unaligned. It is thus highly recommended to align images before any downstream decoding (i.e. image registration).
It is up to the user to decide what image processing to perform to obtain images ideal for registration. We recommend a combination of low-pass and high-pass features: below, we combine images from COARSE
and DECODE
in imMetrics
(i.e. band pass filtering).
<- list()
forReg for(i in 1:length(imList)){
<- imNormalise( imMetrics$COARSE[[i]] + imMetrics$DECODE[[i]] )
forReg[[i]]
}names(forReg) <- names(imList) #Always keep the names from imList
<- forReg[[1]]
norm par(mfrow=c(1,2))
plot( imager::as.cimg(norm), main = 'Band pass' )
## Warning in as.cimg.array(norm): Assuming third dimension corresponds to
## time/depth
plot( imager::as.cimg(norm[1000:1500, 1000:1500, 1]), main = 'Band pass (Zoom in)' )
par(mfrow=c(1,1))
Registration is accomplished with our registerImages
function. The function works by maximising cross-correlation between two images (in Fourier transformed space). [Strictly, cross-correlation is first passed through a Gaussian filter centred around the origin, to upweight small shifts. Gaussian filter parameters can be tweaked using relevant parameters – see documentation for registerImages
. Additionally, when an image stack is detected, maximum intensity projection is performed before registration.]
registerImages(forReg, returnTroubleShootPlots = T)
## Warning in maxIntensityProject(imsub): 2D matrix provided...Skipping MIP...
This function adds the necessary coordinate offsets to the params
environment object. Shift vectors (and coordinates in general) are typically handled as complex numbers in masmr
; with the real component reflecting the x-coordinate and the imaginary, y.
print( params$shifts )
## 0_cy7 0_cy5 0_alexa594 0_cy3 1_cy7 1_cy5 1_alexa594
## 0+0i 0+0i 0+0i 0-1i -5+3i -5+3i -5+2i
## 1_cy3 2_cy7 2_cy5 2_alexa594 2_cy3 3_cy7 3_cy5
## -5+2i -12+5i -12+5i -11+4i -11+3i -14+2i -14+2i
## 3_alexa594 3_cy3
## -14+3i -14+2i
These offsets are added to existing coordinates so as to better align images. To evaluate the registration, users can set returnTroubleShootPlots
in registerImages
to true (which returns a troubleshootPlots
object), or use register_troubleshootPlots
directly (see documentation).
<- troubleshootPlots[['REGISTRATION_EVAL']][[1]][['PRE_REGISTER']]
p print(
+ ggtitle('Before registration')
p )
The default is to plot registrations with respect to the reference FOV (red). A window centred on the brightest pixel in the reference FOV is selected as default.
<- troubleshootPlots[['REGISTRATION_EVAL']][[1]][['POST_REGISTER']]
p print(
+ ggtitle('After registration')
p )
Additionally, the registerImages
function also saves the reference bit image as a REGIM_{ FOV Name }.png
file as part of its default behaviour. This can be called on for subsequent stitching.
file.exists( paste0(params$out_dir, 'REGIM_', params$current_fov, '.png') )
## [1] TRUE
3.5 Working with pixel-wise metrics
Having aligned and processed our images, the next steps are to compile our images into a manageable data format, perform decoding, and successively filter out low quality pixels.
3.5.1 Moving from image lists to a dataframe
The first step is to convert our images into a dataframe for easier use. Hereafter, decoding and filtering will be performed on this dataframe. This is accomplished with consolidateImageMetrics
. Additionally, as the previous image lists are no longer needed, it is recommended users clean their environment with cleanUp
.
If troubleshooting, users may wish to keep the imMetrics
environment for the time being. However, do note that this tends to be a big object and may stretch memory requirements. [One solution, shown below, is to just keep a subset of the objects in imMetrics
.]
<- consolidateImageMetrics()
spotcalldf
print( paste0('Size of entire imMetrics: ', sum(sapply(ls(imMetrics), function(x) as.numeric(object.size(imMetrics[[x]])))) * 1E-9, ' Gb') )
## [1] "Size of entire imMetrics: 5.04670648 Gb"
for( ix in ls(imMetrics) ){
if( ix %in% c('DECODE') ){ next }
remove( list=ix, envir=imMetrics )
}print( paste0('Size of imMetrics after filtering: ', sum(sapply(ls(imMetrics), function(x) as.numeric(object.size(imMetrics[[x]])))) * 1E-9, ' Gb') )
## [1] "Size of imMetrics after filtering: 1.121489776 Gb"
cleanUp( c(
'spotcalldf', #Necessary: must be specified, otherwise the dataframe will be deleted
'imMetrics', #Optional: keep if want to make use of images in filterDF() later
#Other things to keep
cleanSlate) )
The result is a dataframe with a unified coordinate system, in which pixels are rows and image metrics are columns.
::kable( head( spotcalldf ), format = 'html', booktabs = TRUE ) knitr
WX | WY | WZ | IDX | COARSE_01 | DECODE_01 | MASK_01 | NORM_01 | ORIG_01 | COARSE_02 | DECODE_02 | MASK_02 | NORM_02 | ORIG_02 | COARSE_03 | DECODE_03 | MASK_03 | NORM_03 | ORIG_03 | COARSE_04 | DECODE_04 | MASK_04 | NORM_04 | ORIG_04 | COARSE_05 | DECODE_05 | MASK_05 | NORM_05 | ORIG_05 | COARSE_06 | DECODE_06 | MASK_06 | NORM_06 | ORIG_06 | COARSE_07 | DECODE_07 | MASK_07 | NORM_07 | ORIG_07 | COARSE_08 | DECODE_08 | MASK_08 | NORM_08 | ORIG_08 | COARSE_09 | DECODE_09 | MASK_09 | NORM_09 | ORIG_09 | COARSE_10 | DECODE_10 | MASK_10 | NORM_10 | ORIG_10 | COARSE_11 | DECODE_11 | MASK_11 | NORM_11 | ORIG_11 | COARSE_12 | DECODE_12 | MASK_12 | NORM_12 | ORIG_12 | COARSE_13 | DECODE_13 | MASK_13 | NORM_13 | ORIG_13 | COARSE_14 | DECODE_14 | MASK_14 | NORM_14 | ORIG_14 | COARSE_15 | DECODE_15 | MASK_15 | NORM_15 | ORIG_15 | COARSE_16 | DECODE_16 | MASK_16 | NORM_16 | ORIG_16 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15055 | 255 | 6 | 1 | 14985 | 0.1685885 | 0.0666336 | FALSE | 0.0318499 | 0.0431527 | 0.1959525 | 0.0182777 | FALSE | 0.0874075 | 0.1050684 | 0.2910819 | 0.1395470 | TRUE | 0.0616752 | 0.1104383 | 0.1467471 | 0.0036333 | FALSE | 0 | 0.0532875 | 0.0949858 | 0.0023963 | FALSE | 0 | 0.0242018 | 0.1435644 | 0.0209040 | FALSE | 0.0195158 | 0.0593125 | 0.1903854 | 0.0108405 | FALSE | 0.0055703 | 0.0601619 | 0.0732994 | 0.0026951 | FALSE | 0 | 0.0213105 | 0.1354508 | 0.0018844 | FALSE | 0 | 0.0410899 | 0.1167777 | 0.0052897 | FALSE | 0 | 0.0486866 | 0.2663939 | 0.0435117 | FALSE | 0.0160442 | 0.1010937 | 0.1288667 | 0.0019179 | FALSE | 0 | 0.0423075 | 0.1124755 | 0.0024016 | FALSE | 0 | 0.0433106 | 0.0792933 | 0.0028786 | FALSE | 0 | 0.0405763 | 0.1350443 | 0.0000000 | FALSE | 0 | 0.0568347 | 0.0570399 | 0.0021209 | FALSE | 0 | 0.0223294 |
15056 | 256 | 6 | 1 | 14986 | 0.1753839 | 0.0860337 | FALSE | 0.0431405 | 0.0468669 | 0.2066871 | 0.0489090 | FALSE | 0.0966803 | 0.1079719 | 0.2955125 | 0.1514781 | TRUE | 0.0599694 | 0.1090731 | 0.1438671 | 0.0036322 | FALSE | 0 | 0.0511141 | 0.0923445 | 0.0023963 | FALSE | 0 | 0.0241780 | 0.1465902 | 0.0264190 | FALSE | 0.0257662 | 0.0613144 | 0.1964823 | 0.0234909 | FALSE | 0.0134243 | 0.0657939 | 0.0741845 | 0.0026951 | FALSE | 0 | 0.0200000 | 0.1455799 | 0.0018838 | FALSE | 0 | 0.0534705 | 0.1163260 | 0.0052897 | FALSE | 0 | 0.0529889 | 0.2757924 | 0.0654732 | FALSE | 0.0232150 | 0.1101432 | 0.1315909 | 0.0017231 | FALSE | 0 | 0.0464064 | 0.1207218 | 0.0024016 | FALSE | 0 | 0.0447938 | 0.0790790 | 0.0028778 | FALSE | 0 | 0.0397129 | 0.1441830 | 0.0000000 | FALSE | 0 | 0.0799452 | 0.0581769 | 0.0021209 | FALSE | 0 | 0.0198046 |
15057 | 257 | 6 | 1 | 14987 | 0.1767175 | 0.0890733 | FALSE | 0.0404089 | 0.0459683 | 0.2154453 | 0.0750645 | FALSE | 0.1129332 | 0.1130610 | 0.2975720 | 0.1588251 | TRUE | 0.0608004 | 0.1097382 | 0.1366940 | 0.0036309 | FALSE | 0 | 0.0454851 | 0.0879305 | 0.0023963 | FALSE | 0 | 0.0175016 | 0.1466302 | 0.0225324 | FALSE | 0.0132059 | 0.0572916 | 0.1976275 | 0.0270122 | FALSE | 0.0100954 | 0.0634068 | 0.0740865 | 0.0026951 | FALSE | 0 | 0.0204992 | 0.1562726 | 0.0018829 | FALSE | 0 | 0.0593196 | 0.1147730 | 0.0052897 | FALSE | 0 | 0.0534790 | 0.2846638 | 0.0895756 | FALSE | 0.0281234 | 0.1163376 | 0.1298415 | 0.0014963 | FALSE | 0 | 0.0440120 | 0.1246618 | 0.0024016 | FALSE | 0 | 0.0457579 | 0.0755643 | 0.0028770 | FALSE | 0 | 0.0333999 | 0.1513037 | 0.0000767 | FALSE | 0 | 0.0723273 | 0.0566595 | 0.0021209 | FALSE | 0 | 0.0192252 |
15058 | 258 | 6 | 1 | 14988 | 0.1725610 | 0.0745237 | FALSE | 0.0351278 | 0.0442310 | 0.2215627 | 0.0934877 | TRUE | 0.1196076 | 0.1151508 | 0.2923042 | 0.1414761 | TRUE | 0.0602755 | 0.1093181 | 0.1266107 | 0.0036295 | FALSE | 0 | 0.0360410 | 0.0870373 | 0.0023963 | FALSE | 0 | 0.0220558 | 0.1502949 | 0.0330563 | FALSE | 0.0283854 | 0.0621532 | 0.1963598 | 0.0262408 | FALSE | 0.0145165 | 0.0665772 | 0.0755687 | 0.0026951 | FALSE | 0 | 0.0216849 | 0.1567512 | 0.0018819 | FALSE | 0 | 0.0609281 | 0.1091272 | 0.0052897 | FALSE | 0 | 0.0472707 | 0.2918106 | 0.1114874 | FALSE | 0.0311911 | 0.1202091 | 0.1256519 | 0.0012436 | FALSE | 0 | 0.0408871 | 0.1283710 | 0.0024016 | FALSE | 0 | 0.0601454 | 0.0725193 | 0.0028761 | FALSE | 0 | 0.0309718 | 0.1566531 | 0.0003004 | FALSE | 0 | 0.0854233 | 0.0560639 | 0.0021209 | FALSE | 0 | 0.0190389 |
15059 | 259 | 6 | 1 | 14989 | 0.1633875 | 0.0446587 | FALSE | 0.0214699 | 0.0397380 | 0.2251548 | 0.1048259 | TRUE | 0.1336188 | 0.1195380 | 0.2782481 | 0.0944740 | FALSE | 0.0467602 | 0.0985018 | 0.1213048 | 0.0036276 | FALSE | 0 | 0.0373857 | 0.0840205 | 0.0023963 | FALSE | 0 | 0.0242257 | 0.1538012 | 0.0442957 | FALSE | 0.0371955 | 0.0649749 | 0.1897434 | 0.0167363 | FALSE | 0.0093672 | 0.0628846 | 0.0774059 | 0.0026951 | FALSE | 0 | 0.0231669 | 0.1526377 | 0.0018807 | FALSE | 0 | 0.0517645 | 0.1032195 | 0.0052897 | FALSE | 0 | 0.0440939 | 0.2937106 | 0.1167685 | TRUE | 0.0253241 | 0.1128049 | 0.1253053 | 0.0009766 | FALSE | 0 | 0.0451889 | 0.1312150 | 0.0024016 | FALSE | 0 | 0.0582171 | 0.0758693 | 0.0028751 | FALSE | 0 | 0.0411428 | 0.1587198 | 0.0004918 | FALSE | 0 | 0.0903877 | 0.0572023 | 0.0021209 | FALSE | 0 | 0.0208394 |
15060 | 260 | 6 | 1 | 14990 | 0.1521710 | 0.0107363 | FALSE | 0.0000000 | 0.0325093 | 0.2237427 | 0.0974893 | TRUE | 0.1340264 | 0.1196656 | 0.2608532 | 0.0406566 | FALSE | 0.0319765 | 0.0866704 | 0.1220588 | 0.0036248 | FALSE | 0 | 0.0383238 | 0.0772029 | 0.0023963 | FALSE | 0 | 0.0183838 | 0.1538350 | 0.0435356 | FALSE | 0.0308856 | 0.0629540 | 0.1776488 | 0.0024123 | FALSE | 0.0000000 | 0.0498676 | 0.0800322 | 0.0026951 | FALSE | 0 | 0.0247582 | 0.1538761 | 0.0018794 | FALSE | 0 | 0.0594170 | 0.0982393 | 0.0052897 | FALSE | 0 | 0.0443117 | 0.2927845 | 0.1152722 | TRUE | 0.0368280 | 0.1273229 | 0.1239065 | 0.0007128 | FALSE | 0 | 0.0407654 | 0.1332971 | 0.0024016 | FALSE | 0 | 0.0533966 | 0.0779187 | 0.0028741 | FALSE | 0 | 0.0433821 | 0.1544284 | 0.0006483 | FALSE | 0 | 0.0747240 | 0.0584646 | 0.0021209 | FALSE | 0 | 0.0199081 |
3.5.2 Decoding
Decoding involves calculating distances between image metric vectors and expected vectors from the codebook. As default, our decode
function takes columns with DECODE_XX
format and calculates
COS
: cosine distances,EUC
: Euclidean distances,L2E
: L2-normalised Euclidean distances
between image and codebook vectors. Both the nearest match and next nearest match (prefixed with B
) are returned.
<- decode(spotcalldf) spotcalldf
The relationship between the different distance metrics is illustrated below.
::plot_grid( plotlist = list(
cowplotggplot( spotcalldf ) +
geom_hex(aes(x=COS, y=L2E), bins=100) +
scale_fill_viridis_c(option = 'turbo', trans='log10') +
theme_minimal(base_size=14) +
theme(legend.position = 'top'),
ggplot( spotcalldf ) +
geom_hex(aes(x=COS, y=EUC), bins=100) +
scale_fill_viridis_c(option = 'turbo', trans='log10') +
theme_minimal(base_size=14) +
theme(legend.position = 'top'),
ggplot( spotcalldf ) +
geom_hex(aes(x=EUC, y=L2E), bins=100) +
scale_fill_viridis_c(option = 'turbo', trans='log10') +
theme_minimal(base_size=14) +
theme(legend.position = 'top')
nrow=1 ) ),
At this stage, pixels are assigned a preliminary gene label under the g
column. As default, this is dependent on the nearest codebook match in cosine distance (column COSLAB
in the dataframe).
::kable( head(spotcalldf[,grepl('^g$|LAB$', colnames(spotcalldf))]), format = 'html', booktabs = TRUE ) knitr
g | COSLAB | EUCLAB | L2ELAB | BCOSLAB | BEUCLAB | BL2ELAB | |
---|---|---|---|---|---|---|---|
15055 | BLANK-NEW-05 | 104 | 104 | 104 | 100 | 100 | 100 |
15056 | BLANK-NEW-05 | 104 | 104 | 104 | 55 | 55 | 55 |
15057 | MKI67 | 55 | 55 | 55 | 11 | 11 | 11 |
15058 | MKI67 | 55 | 55 | 55 | 11 | 11 | 11 |
15059 | MKI67 | 55 | 55 | 55 | 51 | 51 | 51 |
15060 | MKI67 | 55 | 55 | 55 | 51 | 51 | 51 |
3.5.3 Annotating blanks
Users are recommended to filter out pixels with large distance metrics: i.e. those we cannot confidently decode. These thresholds can be determined a priori, or determined empirically by finding the threshold that separates blanks from non-blank calls.
There are several ways to define “blanks” that are provided in masmr
:
gBlank
: Whether the name of the gene assigned to a given pixel has the word “blank” in it.bPossible
: Whether the next closest gene label for selected distance metrics are possible.gbPossible
: Whether the next closest gene label for the default distance metric (COS
) are possible.consistentLabel
: Whether the gene labels agree across distance metrics.
As default, gBlank
, gbPossible
, and consistentLabel
tests are run. The number of tests failed is reported by scoreBlanks
.
<- scoreBlanks( spotcalldf )
nBlankTestsFailed table(nBlankTestsFailed)
## nBlankTestsFailed
## 0 1 2 3
## 2039428 616429 255396 2
$BLANK <- nBlankTestsFailed >0 spotcalldf
Say we decide that “blanks” are those that fail even one test. We can then examine the range of distance metrics between blanks and non-blanks. In general, distance metrics should be larger for blanks.
ggplot( reshape2::melt(spotcalldf, measure.vars=c('COS', 'EUC', 'L2E')) ) +
geom_violin(aes(x=BLANK, y=value, colour=BLANK) ) +
geom_boxplot(aes(x=BLANK, y=value, colour=BLANK), fill='transparent', width=0.25 ) +
facet_wrap(~variable, scales='free') +
ylab('') +
theme_minimal(base_size=14) +
scale_colour_manual(values=c('TRUE' = 'red', 'FALSE' = 'black')) +
theme(legend.position = 'none')
A reasonable distance threshold might be the average of medians. This can be easily obtained with the findThreshold
or thresh_Quantile
function (see documentation for details).
<- rep(0, nrow(spotcalldf))
filtout for( distanceMetric in c('COS', 'EUC', 'L2E') ){
<- thresh_Quantile( spotcalldf[,distanceMetric], label = spotcalldf$BLANK, quantileFalse = 0.5, quantileTrue = 0.5 )
thresh cat(paste0( '\n', distanceMetric, ' threshold: ', thresh ))
<- filtout + spotcalldf[,distanceMetric] > thresh
filtout }
##
## COS threshold: 0.222254471692498
## EUC threshold: 3.30408867238758
## L2E threshold: 0.326703039218853
table(filtout > 0)
##
## FALSE TRUE
## 824288 2086967
3.5.4 Filtering and evaluating your choice of filter
How do we decide what is a good filter to apply to our data?
One idea is to use a priori knowledge of gene distributions in space. For instance, in our example dataset of an organoid, we expect dividing cells to be concentrated near the lumen. Accordingly, filtering should enrich lumenal MKI67
(a marker for cell division).
::plot_grid( plotlist = list(
cowplotggplot( spotcalldf[filtout>0 & spotcalldf$g=='MKI67',] ) +
geom_hex(aes(x=WX, y=WY), bins=100) +
scale_fill_viridis_c(option='turbo', trans='log10') +
scale_y_reverse() +
theme_void( base_size = 14 ) +
coord_fixed() + ggtitle('Drop'),
ggplot( spotcalldf[filtout==0 & spotcalldf$g=='MKI67',] ) +
geom_hex(aes(x=WX, y=WY), bins=100) +
scale_fill_viridis_c(option='turbo', trans='log10') +
scale_y_reverse() +
theme_void( base_size = 14 ) +
coord_fixed() + ggtitle('Keep')
nrow=1) ),
Another option is to consider how filtering changes quality control metrics: e.g. a good filter should generally increase correlation between count profiles and previously obtained FPKM data, should decrease the percentage of blanks relative to true genes, etc. We provide a filterDF
function that reports how a given filter changes these QC metrics (should relevant files be specified in establishParams
).
$verbose <- TRUE ## To see output of filterDF, verbose is changed to True
params<- filterDF(spotcalldf, filtout>0, returnTroubleShootPlots = T, plotWindowRadius = 5) spotcalldf
##
## Applying filter: filtout > 0...
## N pixels: 2911255 ( 100% ) --> 824288 ( 28.31% )
## N blanks: 871827 ( 29.95% ) --> 207062 ( 25.12% )
## LogFPKM Corr: 0.58239 --> 0.59519
## LogN Probes Corr: 0.52337 --> 0.51121
$verbose <- FALSE ## Turn it back off if want to avoid messages entirely (warnings and errors will still be given) params
Perhaps the safest option – in that one that does not require a priori assumptions – is to perform some decoding by eye on a small FOV region, and ensure that one’s spotcalling pipeline returns results that agree with these manually obtained results. To attempt this, we provide a returnTroubleShootPlots
parameter in filterDF
. When set to TRUE
(default FALSE
), plots are returned showing pixel intensity images for each bit, overlaid with gene labels. [Do note that these plots are only available if users have kept their imMetrics
environment, or have specified a list of images (imList
).]
Plots show spots either BEFORE
or AFTER
filtering.
print( troubleshootPlots[['BEFORE']][[1]] )
print( troubleshootPlots[['AFTER']][[1]] )
Under the hood, filterDF
makes use of the spotcall_troubleshootPlots
function to return plots, and parameters for spotcall_troubleshootPlots
can be input into filterDF
to modify the plots returned. Alternatively, users may simply use spotcall_troubleshootPlots
directly (see documentation).
## An example
<- spotcall_troubleshootPlots(
plotList
spotcalldf, plotWindowRadius = 5,
decodeMetric = 'DECODE', #Instead of DECODE, users may pick other objects in imMetrics
chosenCoordinate = c(
939 + 340i, #Specifying a custom coordinate
sum(as.numeric(spotcalldf[1,c('WX', 'WY')]) * c(1, 1i)) #Specifying the first pixel in spotcalldf
)
)for( i in 1:length(plotList) ){
<- plotList[[i]]
p print( p )
}
Users are encouraged to optimise their image processing pipelines on one FOV, before applying the same pipelines to all other FOVs.
3.5.5 Binarising continuous values
Note that the current image vector contains continuous values. Users may wish to binarise values, such that superthreshold values are unambiguously assigned as ON, and subthreshold as OFF. To that end, we provide the bitsFromDecode
function that searches for the optimal threshold that maximises accuracy of ON/OFF calls per bit. There are several ways to assay accuracy:
precision
: True positives / (True positives + False positives)recall
: True positives / (True positives + False negatives)f1
(default): 2 * Precision * Recall / (Precision + Recall)fb
: (1 + \(\beta^2\)) * Precision * Recall / (\(\beta^2\) * Precision + Recall)
After thresholding, we can also calculate Hamming Distance from codebook vectors, which is stored under the HD
column.
<- bitsFromDecode(spotcalldf)
spotcalldf table(spotcalldf$HD)
##
## 0 1 2 3 4 5 6 7 8 9
## 46125 288300 296934 124675 46694 15838 4772 851 94 5
It is recommended to filter out pixels whose Hamming Distance to their assigned gene is >1.
<- filterDF(spotcalldf, spotcalldf$HD > 1) spotcalldf
Binarisation is recommended if there is a clear cutoff in intensities between 0 vs 1 bits in the DECODE
images. This assumption should be true in most cases, but may be violated if there are very different numbers of encoding probes for different genes: e.g. if there is a gene A that has 40 probes and a gene B with only 5 probes, ON for gene B will be dimmer than ON for gene A (and OFF for gene A may be brighter than OFF for gene B, depending on how thorough bleaching / washing was between imaging cycles). So a single intensity threshold separating ON vs OFF may not be appropriate here. In this scenario, users may wish to skip binarisation entirely.
3.5.6 Obtaining other QC metrics
Beyond image-codebook vector distances, we also provide other functions that users may find useful for performing further quality control checks. We provide several options for thresholding:
thresh_Otsu
: Otsu thresholding. Useful for separating a bimodal distribution.thresh_Elbow
: Identifies the elbow in a distribution. Useful for heavy tailed distributions.
thresh_Quantile
: Finds the average between quantiles in group A vs B. As default, the chosen quantiles are the medians of A and B.
All thresholding functions are compatible with the findThreshold
function.
set.seed(12345)
<- sample(c(F, T), 1E4, T, c(0.5, 0.5))
groups <- rep(0, length(groups))
values <- rnorm(sum(groups), mean = 1, sd = 0.1)
values[groups] !groups] <- rgamma(sum(!groups), shape = 1, scale = 0.5)
values[hist(values, breaks=1000, main = 'Thresholds: Otsu (red), Elbow (blue), Quantile (green)' );
abline(v=findThreshold(values, thresh_Otsu), col='red');
abline(v=findThreshold(values[values > median(values)], thresh_Elbow, rightElbow = T), col='blue');
abline(v=findThreshold(values, thresh_Quantile, labels = groups), col='green')
scoreDeltaF
This function calculates \(\Delta F / F_0\) metrics across image metrics. As default, all image metrics generated by getImageMetrics
(and whose names are saved in params$imageMetrics
) will be processed. [NB When \(F_0\) is 0, \(\Delta F / F_0\) is default set to the highest valid value for that column.]
Below, we show the \(\Delta F / F_0\) alone, but users should read the scoreDeltaF
documentation for the full list of metrics returned (e.g. \(F_1\), \(F_0\) etc).
<- scoreDeltaF( spotcalldf )
spotcalldf ::kable( head( spotcalldf[,grepl('^DFF0_', colnames(spotcalldf))] ), format = 'html', booktabs = TRUE ) knitr
DFF0_ORIG | DFF0_NORM | DFF0_COARSE | DFF0_DECODE | DFF0_MASK | |
---|---|---|---|---|---|
15058 | 1.2802097 | 1.347051e+01 | 0.9599303 | 8.064703 | 11 |
15248 | 2.1340655 | 6.973976e+04 | 1.6488084 | 44.437034 | 11 |
15481 | 0.6950561 | 1.981764e+00 | 0.3761400 | 7.115631 | 8 |
15489 | 0.2565917 | 2.393276e+00 | 0.4356834 | 35.583974 | 11 |
15557 | 0.1342583 | 1.395903e-01 | 0.2187945 | 12.710667 | 11 |
15562 | 0.6546181 | 3.173345e+00 | 0.3342836 | 15.044078 | 11 |
We expect low confidence spots to on average have low \(\Delta F / F_0\), i.e. their ON bits are less distinguishable from their OFF bits.
ggplot(spotcalldf[spotcalldf$F0_DECODE!=0,]) +
geom_violin(aes(x=BLANK, y=DFF0_DECODE, colour=BLANK) ) +
geom_boxplot(aes(x=BLANK, y=DFF0_DECODE, colour=BLANK), fill='transparent', width=0.25 ) +
scale_y_log10() +
theme_minimal(base_size=20) +
scale_colour_manual(values=c('TRUE' = 'red', 'FALSE' = 'black')) +
theme(legend.position = 'none')
spatialIsAdjacent
The spatialIsAdjacent
function searches in the immediate surroundings of a set of query coordinates. For instance, below, we ask if a given pixel is next to a blank. [Unsurprisingly, blanks will be reported as being next to blanks.]
<- spatialIsAdjacent( spotcalldf, 'BLANK' )
isNextToBlank by(isNextToBlank, spotcalldf$BLANK, table)
## spotcalldf$BLANK: FALSE
##
## FALSE TRUE
## 240735 16688
## ------------------------------------------------------------
## spotcalldf$BLANK: TRUE
##
## TRUE
## 77002
$NEXTTOBLANK <- isNextToBlank spotcalldf
In general, we are less confident about non-blank spots that are next to blanks, as evidenced by their higher codebook distances below.
ggplot( reshape2::melt( data.frame(spotcalldf)[!spotcalldf$BLANK, ], measure.vars=c('COS', 'EUC', 'L2E')) ) +
geom_violin(aes(x=NEXTTOBLANK, y=value, colour=NEXTTOBLANK) ) +
geom_boxplot(aes(x=NEXTTOBLANK, y=value, colour=NEXTTOBLANK), fill='transparent', width=0.25 ) +
facet_wrap(~variable, scales='free') +
ylab('') +
theme_minimal(base_size=14) +
scale_colour_manual(values=c('TRUE' = 'red', 'FALSE' = 'black')) +
theme(legend.position = 'none') +
ggtitle('Non-blank spots') + xlab('Is next to a blank')
Users may thus wish to “trim” pixel clusters, by removing non-blanks that are next to blanks: this is beneficial for clustering (described later). [Obviously, blanks should not be subject to the same filtering.]
<- filterDF(spotcalldf, spotcalldf$NEXTTOBLANK & !spotcalldf$BLANK) spotcalldf
spatialHNNDist
Another useful spatial metric is how far apart any two pixels with the same identity are from each other, here called “homotypic nearest neighbour (HNN) distance”. This can be calculated using spatialHNNDist
.
<- spatialHNNDist( spotcalldf )
HNNDist $HNNDIST <- HNNDist spotcalldf
In general, blanks are less likely to be next to each other (possibly because of erroneous single pixel calls in noisy regions), while spots from the same gene tend to cluster - i.e. tend to have low HNN distances (below). Users typically do not have to perform thresholding on HNN distances given the later clustering step; though the distribution of distances can be helpful for determining minimum intracluster distances later on.
<- factor(cut(spotcalldf$HNNDIST, breaks=c(0,1,2,3,4,5,Inf)))
hnnDistBins ggplot( data.frame(spotcalldf, hnnDistBins) ) +
scale_fill_viridis_d(name = 'HNN distance range (pixels)', option='turbo') +
geom_bar(aes(x=BLANK, fill=hnnDistBins), position='fill' ) +
theme_minimal(base_size=14) +
theme(legend.position = 'right') + ylab('Proportion')
scoreClassifier
Finally, we have the scoreClassifier
function. Under the hood, this function builds a classifier (logistic regression as default) to distinguish one class of pixels from another, using pixel-wise metrics in our dataframe. Users can specify covariates to include in the model; otherwise, all columns in the dataframe will be used.
Below is an example of predicting blanks from non-blanks. We exclude bit-specific columns (e.g. DECODE_01
), covariates that are likely to highly correlate with others (e.g. F1_DECODE
and DFF0_DECODE
are likely to highly correlate), coordinate information, and any covariate that is equivalent to BLANK
.
<- c('Xm', 'Ym', 'fov', 'g', 'WX', 'WY', 'IDX', colnames(spotcalldf)[grepl('_\\d+$|^DFF0_|^F1_|CV_|LAB$', colnames(spotcalldf))])
forExclusion <- scoreClassifier(spotcalldf, 'BLANK', variablesExcluded = forExclusion)
probabilityBLANK $PBLANK <- probabilityBLANK spotcalldf
Accordingly the probability of being a blank is higher for blanks than non-blanks.
<- thresh_Quantile(spotcalldf$PBLANK, spotcalldf$BLANK, quantileFalse = 0.5, quantileTrue = 0.25)
thresh ggplot( spotcalldf ) +
geom_violin(aes(x=BLANK, y=PBLANK, colour=BLANK) ) +
geom_boxplot(aes(x=BLANK, y=PBLANK, colour=BLANK), fill='transparent', width=0.25 ) +
ylab('Probability of being a blank') +
theme_minimal(base_size=14) +
scale_colour_manual(values=c('TRUE' = 'red', 'FALSE' = 'black')) +
theme(legend.position = 'none') +
geom_hline(yintercept=thresh, linetype = 'dashed', colour='grey')
Thresholding and filtering can be performed using the classifier scores:
<- filterDF(spotcalldf, spotcalldf$PBLANK > thresh) spotcalldf
3.5.7 Clustering
Having performed QC to focus on pixels whose identity we are confident in, the next step is to cluster adjacent pixels with the same identity into a single punctum. This is accomplished with the spatialClusterLeiden
function. Here, we want clusters at least 3 pixels big, and with a maximum distance between pixels of 2 pixels.
<- spatialClusterLeiden( spotcalldf, minNeighbours = 3, maxInterSpotDistancePixels = 2)
clusterInfo colnames(clusterInfo)] <- clusterInfo spotcalldf[,
Users are recommended to visualise clusters across a small window. Note how specifying centroids only returns a single pixel coordinate per cluster.
## Specify window to plot
<- c(1200, 1300, 1200, 1300)
cxWindow
## Specify what genes assigned which colours
## Manually:
# genePalette <- setNames(
# viridis::turbo(nrow(params$ordered_codebook)),
# rownames(params$ordered_codebook) )
## Using previously established palette (established during prepareCodebook):
<- params$genePalette
genePalette
## Plot of labelled pixels vs centroids
::plot_grid( plotlist = list(
cowplotggplot(
spotcalldf[$WX > cxWindow[1]
spotcalldf& spotcalldf$WX < cxWindow[2]
& spotcalldf$WY > cxWindow[3]
& spotcalldf$WY < cxWindow[4], ]
+
) geom_text( aes(
x=WX, y=WY, label=g,
colour=factor(g, levels=rownames(params$ordered_codebook)))
+
) scale_colour_manual( values=genePalette ) +
theme_void(base_size=14) +
theme(legend.position = 'none',
plot.title = element_text(hjust=0.5)) +
scale_x_continuous( limits=c(cxWindow[1], cxWindow[2]) ) +
scale_y_reverse( limits=c(cxWindow[4], cxWindow[3]) ) +
ggtitle('All pixels'),
ggplot(
spotcalldf[$CENTROID
spotcalldf& spotcalldf$WX > cxWindow[1]
& spotcalldf$WX < cxWindow[2]
& spotcalldf$WY > cxWindow[3]
& spotcalldf$WY < cxWindow[4], ]
+
) geom_text( aes(
x=WX, y=WY, label=g,
colour=factor(g, levels=rownames(params$ordered_codebook)))
+
) scale_colour_manual( values=genePalette ) +
theme_void(base_size=14) +
theme(legend.position = 'none',
plot.title = element_text(hjust=0.5)) +
scale_x_continuous( limits=c(cxWindow[1], cxWindow[2]) ) +
scale_y_reverse( limits=c(cxWindow[4], cxWindow[3]) ) +
ggtitle('Centroids only')
nrow=1 ) ),
Depending on what the user wants, the non-centroid pixels can be filtered. Additionally, users may wish to examine cluster sizes to find an appropriate range of pixels per cluster.
= 50
maxClusterSize ggplot( spotcalldf[spotcalldf$CENTROID,] ) +
geom_histogram( aes(x=CLUSTER_SIZE, fill=BLANK), binwidth = 1) +
facet_wrap(~BLANK, ncol=1, scales='free_y', strip.position = 'right') +
theme_minimal(base_size=14) +
xlab('Cluster size') + ylab('N centroids') +
geom_vline(xintercept = maxClusterSize, col='grey', linetype='dashed') +
scale_fill_manual( values=c('FALSE' = 'black', 'TRUE' = 'red') )
As before, filtering is accomplished with filterDF
:
<- filterDF(spotcalldf, !(spotcalldf$CENTROID & spotcalldf$CLUSTER_SIZE < maxClusterSize) ) spotcalldf
3.6 Saving spot calls
If we are satisified with our quality control, the final step is to use the saveDF
function to save the dataframe. The default output is a SPOTCALL_{ FOV Name }.csv.gz
file; users that have set params$resumeMode
to TRUE
may wish to first check if this file exists before proceeding with image processing.
saveDF( spotcalldf )
file.exists( paste0(params$out_dir, 'SPOTCALL_', params$current_fov, '.csv.gz') )
## [1] TRUE
Individual files are thereby generated for every FOV.