“Talk is cheap. Show me the code.”
-Linus Torvalds

Loading the data

Let’s begin by loading PreciseDist, and the Buettner et al. cell cycle dataset that comes with the package, and then let’s take a quick look at it.

data("data_cell_cycle")
library(tibble)
as_tibble(data_cell_cycle)
#> # A tibble: 182 x 8,990
#>    Cell_cycle   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`
#>  * <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 G1         1.74  2.59      0 0     1.13  2.53  0.912  2.44 1.87  0.218
#>  2 G1         0.735 0.994     0 0     0     0.276 0      2.55 0     0    
#>  3 G1         2.14  1.76      0 2.79  1.22  2.01  0      2.40 0     0.757
#>  4 G1         1.50  1.67      0 0.462 0.619 2.25  0      2.34 0.213 0    
#>  5 G1         2.81  0.739     0 0     0.601 2.38  0      2.32 0     0    
#>  6 G1         2.53  2.97      0 0     1.20  1.99  2.20   2.49 0     1.04 
#>  7 G1         2.71  2.74      0 1.37  0.381 0.492 0.717  1.61 1.01  0    
#>  8 G1         0.232 2.70      0 0.232 0     1.66  0      1.35 2.19  1.01 
#>  9 G1         3.03  2.73      0 0.397 0.844 2.40  0.928  2.57 0     0.844
#> 10 G1         2.87  2.84      0 0     0.752 0.752 2.27   2.57 0     1.10 
#> # ... with 172 more rows, and 8,979 more variables: `11` <dbl>,
#> #   `12` <dbl>, `13` <dbl>, `14` <dbl>, `15` <dbl>, `16` <dbl>,
#> #   `17` <dbl>, `18` <dbl>, `19` <dbl>, `20` <dbl>, `21` <dbl>,
#> #   `22` <dbl>, `23` <dbl>, `24` <dbl>, `25` <dbl>, `26` <dbl>,
#> #   `27` <dbl>, `28` <dbl>, `29` <dbl>, `30` <dbl>, `31` <dbl>,
#> #   `32` <dbl>, `33` <dbl>, `34` <dbl>, `35` <dbl>, `36` <dbl>,
#> #   `37` <dbl>, `38` <dbl>, `39` <dbl>, `40` <dbl>, `41` <dbl>,
#> #   `42` <dbl>, `43` <dbl>, `44` <dbl>, `45` <dbl>, `46` <dbl>,
#> #   `47` <dbl>, `48` <dbl>, `49` <dbl>, `50` <dbl>, `51` <dbl>,
#> #   `52` <dbl>, `53` <dbl>, `54` <dbl>, `55` <dbl>, `56` <dbl>,
#> #   `57` <dbl>, `58` <dbl>, `59` <dbl>, `60` <dbl>, `61` <dbl>,
#> #   `62` <dbl>, `63` <dbl>, `64` <dbl>, `65` <dbl>, `66` <dbl>,
#> #   `67` <dbl>, `68` <dbl>, `69` <dbl>, `70` <dbl>, `71` <dbl>,
#> #   `72` <dbl>, `73` <dbl>, `74` <dbl>, `75` <dbl>, `76` <dbl>,
#> #   `77` <dbl>, `78` <dbl>, `79` <dbl>, `80` <dbl>, `81` <dbl>,
#> #   `82` <dbl>, `83` <dbl>, `84` <dbl>, `85` <dbl>, `86` <dbl>,
#> #   `87` <dbl>, `88` <dbl>, `89` <dbl>, `90` <dbl>, `91` <dbl>,
#> #   `92` <dbl>, `93` <dbl>, `94` <dbl>, `95` <dbl>, `96` <dbl>,
#> #   `97` <dbl>, `98` <dbl>, `99` <dbl>, `100` <dbl>, `101` <dbl>,
#> #   `102` <dbl>, `103` <dbl>, `104` <dbl>, `105` <dbl>, `106` <dbl>,
#> #   `107` <dbl>, `108` <dbl>, `109` <dbl>, `110` <dbl>, …

Now before we run anything on the cell cycle data, let’s remove the ground truth labels from the data-set. Note: This code uses the %>% operator heavily, so please read a little about it if it is confusing to you.

Setting the baseline

The goal of this analysis will be to uncover the true structure that exists in the data. Thus, before we begin, we will set a baseline to exceed. The baseline we will choose is the euclidean and correlation distance of the cell_cycle_data.

First, we will calculate each distance (we will coerce correlation into a distance with as.dist()):

library(proxy)
euclidean_dist <- proxy::dist(cell_cycle_data, method = "euclidean", diag = TRUE)
correlation_dist <- proxy::simil(cell_cycle_data, method = "correlation", diag = TRUE) %>% as.dist()

Now, we will use each distance as input into umap(), which is a non-linear dimension-reduction technique similar to t-SNE in some ways, but with some favorable properties. Lots more can be learned about umap here and here:

library(uwot)
euclidean_map <- umap(euclidean_dist) %>% as.data.frame()
correlation_map <- umap(correlation_dist) %>% as.data.frame()

Lastly, we will plot the results with the true labels using plot_ly():

While we could tweak the umap() parameters or try another algorithm like Rtsne(), the immediate and obvious conclusion should be that it seems neither the euclidean distance nor the correlation distance do a great job of capturing the true structure of the data.

The algorithm to beat

This data-set is a single-cell RNA-seq data-set of Cd4+ T helper cells that were activated by IL-4 to create a set of asynchronously dividing cells that represent different stages of the cell cycle. This data is also contained in the SIMLR package, and was used in the corresponding paper to show how their semi-supervised method almost perfectly captures the structure in the data. As we will see below, however, this data-set has some fairly obvious structure that many different methods can capture.

We will begin though by running SIMLR() and then plotting the results with plot_ly().

First, we run it as they did in the paper with c = 3.

Second, we will extract the clusters from the data:

cell_cycle_simlr_clusters <- cell_cycle_simlr$y$cluster %>%
  as.character() %>%
  recode("1" = "cluster_1", "2" = "cluster_2", "3" = "cluster_3") %>%
  as_tibble() %>%
  select(SIMLR_clusters = value) %>%
  as.matrix()

Now we will extract the t-SNE results:

And, now let’s plot it with the ground truth labels and the SIMLR() clusters:

As we can see, SIMLR does an excellent job when we set c = 3. But, what happens when we set c = 2 or c = 4? Note: The code is being omitted here for readability, but the only line that was changed was SIMLR(c = 3) to SIMLR(c = 2) and SIMLR(c = 4).

First, c = 2:

Now, c = 4:

While the SIMLR t-SNE plot shows that the algorithm does a good job of maintaining the correct structure in the data when c = 2, the structure is already starting to fracture when c = 4. This, of course, is not to denigrate SIMLR - it is an interesting method that is well suited for some datasets, and which we will eventually incorporate as a fusion option in PreciseDist - but if we did not have the ground truth to guide us it might take us a long time before we understood that the data is most truly represented by c = 3.

Let’s see what PreciseDist gives us.

Experiment 1: Starting with PreciseDist

The first thing we would like to do is to see which distances we can apply to the data. Although any custom distance function can be input into precise_dist(), it is very likely that the package already contains a distance suitable for the data. Because data_cell_cycle is static data (i.e. data taken at a single point in time), we will exclude time series distances from our inquiry. This should not stop you, however, from trying time series or other distances on static data during your own data explorations. In fact, the underlying philosophy of PreciseDist is that we begin our analysis fundamentally ignorant of the true structure of the data. Therefore, we should not preclude the possibility that statically acquired data has a time-like ordering component to it.

Because we will run all 60 distances encompassed by “static_dists”, if your computer has the memory/cores to run these in parallel, it is highly recommended you do so. See the vignette A Parallel Future to learn more about PreciseDist’s parallel architecture, but for now note that the user needs to declare to PreciseDist that they will be running in parallel (parallel = TRUE), and then the user needs to explicitly declare the back end they want to use to run the parallelism:

Now we run precise_dist():

At this point, maybe it takes our data an hour (or twelve) to run, so we go home and come back the next day. For some reason though, our computer shutoff sometime during the night. That is OK though because we originally set the file parameter, so we can load our data from file:

library(readr)
cell_cycle_static_dists <- read_rds("/absolute_path/to_somewhere/with_full_name/including_the/file_extension.rds")

Now, lets take a quick look at the length of the data and the first five entries to the data.

The first thing to notice is that we input 60 distances and got back 55. The reason for that is because PreciseDist will silently fail on distances that threw an error for whatever reason. The other thing to notice is that PreciseDist writes a list to file. This is different than the format that the function actually returns. If we had gotten back the results of the function and looked at it, it would have structure like this (although, the Time_Taken_Seconds column would not be NA):

This format is a tibble with a string column containing suffix + distance, a list column where each row contains the corresponding distance matrix, and a final column which shows us the time it took to calculate each distance. Note that relevant downstream PreciseDist functions will take both formats as input. If we wanted to turn the PreciseDist format back into a list we could simply run:

Besides changing the precise_dist() output format, precise_transform() also performs a wide variety of filtering and mathematical transformations to the data. When first interrogating your data though, it is recommended to not apply any complicated transformations. Despite its name, however, precise_dist() returns similarities and correlations in addition to distances. Thus, it is very likely at this stage you will want to convert all of the output matrices into distances:

Note that some precise_transform() options expect the data to be in a certain format, for example, non-negative or a similarity rather than a distance. Rather than have preconceived notions of what the user is trying to accomplish, however, PreciseDist generally does not force the input into a certain format. Therefore, the user should think carefully about what they are trying to accomplish before running a transformation function.

Now that the data is all in positive distance form, we can see see how they are related to one another using precise_correlations(), which creates a correlation matrix from distance matrices. We can then take a look at some of the output with heatmaply():

Let’s use each distance as input into umap():

Let’s also calculate heatmaps for each distance:

Now that we have a list of 55 umap and heatmap results corresponding to the 55 distances precise_dist() returned, we will leverage the wonderful treliscopejs() library to view our results. First we will run trellis_plots() to view the output of precise_umap():


Next we will run trellis_heatmap() to view the output of precise_heatmap():



We learn a lot from these two views. First, some distances like canberra_phil, clark, divergence, minkowski_0.5, neyman, random_forest and wavehedges do a good job by themselves of separating the data. Input any of these distances into the right clustering algorithm, and there is a decent chance you will get back a worthy clustering. Without ground truth labels, however, we would hope for finer boundaries between the clusters. Although visual structure does not necessarily equate to real clusters, in our experience it often does. Which brings us to the kodama_pls distance:



This distance does a comparable job to SIMLR, and even if we did not have the ground truth labels, the clusters are so clearly defined I imagine we would look carefully at this distance using our domain expertise to validate the results. With that being said, however, this is one example, and this algorithm does not scale well. So, let’s pretend kodama_pls did not do such an excellent job and move on to the other distances we already identified. Specifically, if we look back at the results of precise_correlations() we will notice a highly correlated block which includes canberra_phil, clark, divergence and wavehedges. Let’s isolate those distances and see how it looks when we fuse them.

Experiment 2: Working with PreciseDist

First, extract those distances from the original list of 55 distances precise_dist() gave us:

Now we will fuse the distances using fuse. Of course, we could also try distatis or SNF, but fuse here is logical considering how highly correlated the distances are:

At this point, we will use precise_graph() to build a graph we can then visualize using precise_viz(). We could also input the fusion distance directly into precise_viz(), but precise_graph() often returns far more pleasing visual results:

Now feed it into precise_viz():

Make sure you move the graph around using your mouse, and hover over it to see the labels. You can also zoom in and out using your scroll wheel:

As we can see, simply fusing four highly correlated distances that originally showed some structure did a fairly good job of distinguishing the true classes. What if we transform the input though? Using precise_transform() we will first coerce our fused distance into a similarity, and then we will run hglasso(), which estimates a sparse inverse covariance matrix with hub nodes using a Lasso penalty:

Now let’s visualize the results:

Not only do we have three clearly defined groups, our visualization also shows us the connections between nodes and - quite literally - the circle of life.

Experiment 3: The PreciseDist Function Factory

We will now try another course of action to see if we can recapitulate our results, but using another method. If you remember from before, random_forest results were in the group of distances we identified as being promising. Random forest distances are interesting because they will accept a second argument, mtry, which is the number of variables randomly sampled as candidates at each split. Thus, we will now introduce precise_func_fact(), which stands for PreciseDist function factory:

Let’s take a look at the first five entries of what we have created:

We now have a list of functions that we can input into precise_dist(). Before, when dists = “static_dists”, we ignored the dist_funcs argument. Now, let’s ingnore the dists argument and make dist_funcs = cell_cycle_rf_funcs:

Let’s take a quick look at the results using fusion = “distatis” because in our experience random forest distances often respond well to that fusion method. We have seen all of the following functions before, so we are going to lump them into one call:

Let’s see how it looks using a 2D plot of the umap() output:

Hm, that does not look good at all. In fact, the fusion made it distinctly worse. There is, however, a simple trick we can apply that will make everything look considerably better. The fact that our distances look better separately than fused can mean a lot of things, but one obvious conclusion is that too many of our edges are noise. If we keep only our strongest edges and then fuse the graphs, the result will be much cleaner.

First, we sparsify and coerce our matrices into distances using precise_transform():

Then we run everything like we just did, except we will choose networkD3 as our layout, which scales terribly, but which is fun for small data-sets:

Much better!

Experiment 4: Random Walks around Partitions

So far, using PreciseDist we have found two very different ways of recovering the signal in the data. This time, let’s take advantage of PreciseDist’s partition parameter to create 10 copies of a single distance with each distance-copy having 10% noise. We will use the neyman distance this time because we have already utilized most of the other distances we identified as promising in our first experiment:

As in our previous experiment, before we do anything else let’s quickly see how it looks using fuse as our fusion method and mds_3d_graph as our visualization:

The results are a bit messy, and without true labels it would be difficult to discern three distinct classes. Previously, we sparsified the matrix to improve the signal in the edge weights. This time we will try another trick, which is to graph the random walk transition matrix of the fused neyman partitions:

Much cleaner results.

Experiment 5: Minkowski 100x

This final experiment uses the final distance we found promising, which was the Minkowski distance where p = 0.5. By this point, we have spent a lot of time going through the code piecemeal, so we end by giving you the following to peruse without further comment:

Bullet Point Conclusions

  • We hope this has been useful. Questions, comments, concerns, bugs, errors, feature requests etc can be directed to the PreciseDist github page issues section: https://github.com/bmuchmore/PreciseDist/issues

  • If the data has structure, there is no one best algorithm to unearth that structure. Thus, the best thing you can do with your data is to…

  • Iterate, iterate, iterate! The purpose of PreciseDist is to enable rapid prototyping of a large variety of algorithms. Think of your data has an unknown object and PreciseDist as a tool to capture many different views of that object in order to provide an accurate representation of the object’s true form.

  • We overfit! There is no way around it: Because we had access to the true labels, it could easily guide us through every choice we made.

  • Without true labels, you must have domain expertise to determine if your final visualization/clusters/whatever are correct.

  • However, you do not necessarily need domain expertise to arrive at the correct visualization/clusters/whatever, which brings us to a story of my father. He was an emergency room physician for 30 years, and he said that the true nature of the job was determining if a patient that walked through the door was sick or healthy. That is, were they about to die in the next 24 hours or not? While that is often a simple enough task, there are large repercussions for failing to uncover the hidden cases that are not immediately apparent. In addition, there are also negative consequences if you report a case is urgent and apparent when it is, in fact, not. Which brings us back to…

  • Iterate, iterate, iterate! Over time you will get a feeling for whether the structure you are uncovering in your data is likely real or imagined.

  • Good luck!