Clustering with PreciseDist

“Statistical results can never be better than the data, whose value depends on the ethnological competence with which they are collected and analyzed. To a large extent trait-count computations merely corroborate the ordinary ethnological findings made by students who know their field comparatively…On the other hand, statistical treatment usually expresses results more precisely and definitely. In all cases examined it also indicates greater or less corrections of the ethnological interpretations. These corrections we believe to be valid.”

-Driver and Kroeber, Quantitative expression of cultural relationships, p 51

Introduction

In many ways, the end goal of the PreciseDist framework is to cluster. Our philosophy though, is that the actual clustering algorithm is the least important step in distance-based-clustering because garbage-in-garbage-out. Although we are tempted to expound much further on our philosophical and technical views concerning the art and science of clustering, we will rather redirect the reader to this insightful paper on cluster reasoning and this interesting Stack Exchange post on the merits and draw-backs of visual clustering (be sure to read the second answer as well as the first).

What we will briefly comment on and show below though, are some of the ways in which we can utilize precise_cluster(), and how valuable visualizations can be at validating clusters, which is exemplified here by use of the trellis_descriptors() function. Of course though, at anytime the user can pull out the results obtained with PreciseDist and use any algorithm they see fit to cluster. Thus, for convenience, we have provided a repository of code that takes distances or similarities as input into the clustering algorithm. Please see the Clustering with Other Algorithms vignette for more details.

Data set-up

Data and set-up comes from the Cell Cycle Vignette - Experiment 5: Minkowski 100x. See that vignette for more details.

library(PreciseDist)
data("data_cell_cycle")
str(data_cell_cycle[1:5])
library(dplyr)
cell_cycle_data <- data_cell_cycle %>%
  dplyr::select(-Cell_cycle) %>%
  as.matrix()
cell_cycle_labels <- data_cell_cycle %>%
  dplyr::select(Cell_cycle) %>%
  as.matrix()
cell_cycle_minkowski_params <- seq(0.45, 0.54, length.out = 10)
cell_cycle_minkowski_funcs <- precise_func_fact(
  func = "minkowski",
  params = cell_cycle_minkowski_params
)
library(future)
library(doFuture)
registerDoFuture()
plan(multiprocess, workers = 10)
cell_cycle_minkowski_dists <- cell_cycle_data %>%
  as.matrix() %>%
  precise_dist(
    dist_funcs = cell_cycle_minkowski_funcs,
    time_series = FALSE,
    partitions = 10,
    suffix = "cell_minkowski_",
    file = "/absolute_path/to_somewhere/with_full_name/inclusing_the/file_extension.rds",
    parallel = TRUE,
    local_timeout = Inf,
    verbose = TRUE
  )
cell_cycle_minkowski_transformed <- cell_cycle_minkowski_dists  %>%
  precise_transform(transform = "laplacian")
cell_cycle_minkowski_fused <- precise_fusion(
  cell_cycle_minkowski_transformed,
  fusion = "fuse",
  verbose = TRUE
)

Clustering the distance

We will begin by visualizing the distance matrix we just produced without any labels. It will become evident in a moment why we are choosing a 2D plot as our visualization technique rather than a (often) beautiful 3D graph:

viz_mm <- precise_viz(
  data = cell_cycle_minkowski_fused,
  plot_type = "fr_2d_plot",
  k = 5,
  color_vec = NULL,
  colors = NULL,
  size = 0.5,
  graphml = NULL,
  html = NULL,
  verbose = FALSE
)
viz_mm$visual_output

As we can see, the visualization of the original matrix without any augmentation does not show any obvious clusters. Let’s cluster the original matrix now, and add on the true labels for reference:

library(future)
library(doFuture)
registerDoFuture()
plan(multiprocess, workers = 10)
clusters_mm <- precise_cluster(
  cell_cycle_minkowski_fused,
  cluster_alg = c("all"),
  parallel = TRUE,
  verbose = FALSE
) %>%
  cbind(cell_cycle_labels)

Now that we have our clusters, we will introduce the trellis_descriptors() function to simultaneously map all of our clusters onto the visualization we just produced. It is important to note here, that while we use the results of precise_viz() here as the input visualization, we could use any two column matrix as input, for example, the results of Rtsne() or umap():

trellis_descriptors(
  data = viz_mm,
  descriptors = clusters_mm,
  path = "/absolute_path/to_somewhere/with_full_name/not_including_a/file_extension",
  name = "clusters",
  group = "clusters",
  size = 0.5,
  rank = FALSE,
  self_contained = FALSE,
  desc = "",
  md_desc = "",
  height = 500,
  width = 500,
  nrow = 1,
  ncol = 1,
  verbose = TRUE
)

We can see that the clusters from the different algorithms map in what seems like a consistent and stable fashion, but with this view it is very hard to know if the clusters are of any potential use. In this instance, there are three obvious solutions: Improve the visualization, add extra descriptors which can explain the clusters or both. First, we will improve the visualization by running the matrix through precise_graph():

library(future)
library(doFuture)
registerDoFuture()
plan(multiprocess, workers = 8)
graph_mm <- precise_graph(
  data = cell_cycle_minkowski_fused,
  method = 1,
  n_neighbors = 50,
  spread = 1,
  min_dist = 0.001,
  bandwidth = 10,
  parallel = TRUE,
  verbose = FALSE
)
viz_graph_mm <- precise_viz(
  data = graph_mm,
  plot_type = "fr_2d_plot",
  k = 5,
  color_vec = NULL,
  colors = NULL,
  size = 0.5,
  graphml = NULL,
  html = NULL,
  verbose = FALSE
)
viz_graph_mm$visual_output

Now let’s add the original clusters back on to see how they correlate with our new visualization:

trellis_descriptors(
  data = viz_graph_mm,
  descriptors = clusters_mm,
  path = "/absolute_path/to_somewhere/with_full_name/not_including_a/file_extension",
  name = "features",
  group = "features",
  size = 0.5,
  rank = FALSE,
  self_contained = FALSE,
  desc = "",
  md_desc = "",
  height = 500,
  width = 500,
  nrow = 1,
  ncol = 1,
  verbose = TRUE
)

Clustering the graph

There are, of course, other ways to cluster with precise_cluster(). Above, we clustered with the matrix, saw that the clusters perhaps made sense but that the visualization was probably sub-optimal, and then we mapped the clusters onto a superior visualization made by precise_graph() + precise_viz(). Rather than clustering the original matrix, however, we can extract the graph from the above call to precise_graph() and then cluster that:

library(future)
library(doFuture)
registerDoFuture()
plan(multiprocess, workers = 10)
clusters_graph <- precise_cluster(
  graph_mm$fused_dist,
  cluster_alg = c("all"),
  parallel = TRUE,
  verbose = FALSE
) %>%
  cbind(cell_cycle_labels)

Now we can map these clusters onto the same visualization we used before to see how the clusterings differ:

trellis_descriptors(
  data = viz_graph_mm,
  descriptors = clusters_graph,
  path = "/absolute_path/to_somewhere/with_full_name/not_including_a/file_extension",
  name = "features",
  group = "features",
  size = 0.5,
  rank = FALSE,
  self_contained = FALSE,
  desc = "",
  md_desc = "",
  height = 500,
  width = 500,
  nrow = 1,
  ncol = 1,
  verbose = TRUE
)

In this instance, it seems some of the clusterings probably make more sense while the others make less sense. The important point is that clustering the graph directly gives us additional clustering options, some of which may fit our data more usefully.

Clustering the visualization

The final clustering option that we will show here, is the option to cluster the output of precise_viz() directly, which means we are clustering literally what we see:

library(future)
library(doFuture)
registerDoFuture()
plan(multiprocess, workers = 10)
clusters_viz <- precise_cluster(
  viz_graph_mm,
  cluster_alg = c("all"),
  parallel = TRUE,
  verbose = FALSE
) %>%
  cbind(cell_cycle_labels)

trellis_descriptors(
  data = viz_graph_mm,
  descriptors = clusters_viz,
  path = "/absolute_path/to_somewhere/with_full_name/not_including_a/file_extension",
  name = "features",
  group = "features",
  size = 0.5,
  rank = FALSE,
  self_contained = FALSE,
  desc = "",
  md_desc = "",
  height = 500,
  width = 500,
  nrow = 1,
  ncol = 1,
  verbose = TRUE
)

This last clustering gives us fairly divergent results compared to the other two methods, but it is arguably the most flexible method because all we have to do is change the visualization to change the clusters. Of course, this should be done with knowledge and caution of what you are trying to accomplish, but there are times when it can be a useful option.

The take-home points

There is no right way to cluster, and there is no best clustering method. There are only useful clusterings.
trellis_descriptors() provides a useful way of both visualizing the clusterings, and far more importantly, validating them.
Of course, there are numerous other ways to cluster. As mentioned at the beginning, take a look at Clustering with Other Algorithms vignette for other potential options.

Brian Muchmore