Cluster Validation with PreciseDist

‘I said “I don’t think there are flying saucers”. So my antagonist said, “Is it impossible that there are flying saucers? Can you prove that it’s impossible?” “No”, I said, “I can’t prove it’s impossible. It’s just very unlikely”. At that he said, “You are very unscientific. If you can’t prove it impossible then how can you say that it’s unlikely?” But that is the way that is scientific. It is scientific only to say what is more likely and what less likely, and not to be proving all the time the possible and impossible.’

-Richard P. Feynman, Seeking New Truths, p 165

Introduction

As stated before, while PreciseDist includes clustering routines, the framework is mainly concerned with the before (i.e. defining the distance) and the after (i.e. validating the clusters) of distance-based clustering. Although the literature is rife with different internal and external cluster-validation indices, these indices output numbers, and numbers are only valuable within the cozy confines of context. So, PreciseDist strives to provide the context necessary for cluster interpretation because, in the end, a perfect cluster means nothing if it can not be reasonably explained as being more likely than not.

Data set-up

We are going to begin with the data and clusters we obtained in the Clustering with PreciseDist Vignette, so please see that vignette for details. The set-up code is being omitted, but below are the initial results when we clustered the original fused minkowski matrix:

The first point of we would like to make is that the clusters are visually mapping to the structure indicated by the visualization fairly well. Thus, the first step in clustering validation is a meaningful visualization. As we have said before, structure in your visualization does not always mean that there is structure in the data, but in our experience it often does. It remains to be seen, however, if the structure and clusters we see are useful or not.

Cluster proportions using trellis_pivot()

We will begin by introducing the trellis_pivot() function, so that we can get an accurate sense of the proportions that the trule labels make up of the clusters and vice-versa. Of course, we won’t always have the true labels to guide us, but using the default options of this function can still be useful to gauge the proportions of a single categorical variable in the context of all other categorical variables.

First, we will view the true label count as a fraction of each cluster by setting rows = “Cell_cycle” and cols = NULL:

trellis_pivot(
  data = clusters_mm,
  diagnostics = NULL,
  rows = "Cell_cycle",
  cols = NULL,
  path = "/absolute_path/to_somewhere/with_full_name/not_including_a/file_extension",
  name = "pivot",
  group = "pivot",
  aggregator = "Count as Fraction of Rows",
  renderer = "Table Barchart",
  state = NULL,
  self_contained = FALSE,
  desc = "",
  md_desc = "",
  height = 500,
  width = 500,
  nrow = 1,
  ncol = 1
)

From this view we can see, for example, that the G2M labels are encapsulated by both Cluster_3 (95.4%) and Cluster_1 (4.6%) of edge_betweenness. Now, we will view the cluster count as a fraction of true labels by setting rows = NULL and cols = “Cell_cycle”:

trellis_pivot(
  data = clusters_mm,
  diagnostics = NULL,
  rows = NULL,
  cols = "Cell_cycle",
  path = "/absolute_path/to_somewhere/with_full_name/not_including_a/file_extension",
  name = "pivot",
  group = "pivot",
  aggregator = "Count as Fraction of Rows",
  renderer = "Table Barchart",
  state = NULL,
  self_contained = FALSE,
  desc = "",
  md_desc = "",
  height = 500,
  width = 500,
  nrow = 1,
  ncol = 1
)

In this view we now see that Cluster_3 of edge_betweenness is 100% made up of G2M labels.

We should also note here that being a pivot table, this function is a very flexible way to view different proportions and statistics of your data for numerical as well categorical data, so try changing the aggregator and renderer arguments of trellis_pivot() or try changing these options with in the trelliscope visualization to get a sense of what is possible. See the rpivotTable vignette for more details.

Mapping features using trellis_descriptors()

In addition to seeing the proportions of true labels (or any other variable) contained within each cluster, it would also be helpful to map the original columns (i.e. the columns we clustered with) onto the visualization to see if specific attributes correspond with the different clusterings, and this is where trellis_descriptors() really shines:

trellis_descriptors(
  data = viz_graph_mm,
  descriptors = cell_cycle_data[, 1:500],
  diagnostics = NULL,
  path = "/absolute_path/to_somewhere/with_full_name/not_including_a/file_extension",
  name = "features",
  group = "features",
  size = NULL,
  rank = FALSE,
  self_contained = FALSE,
  desc = "",
  md_desc = "",
  height = 500,
  width = 500,
  nrow = 2,
  ncol = 2,
  verbose = TRUE
)

A few notes to consider about the above visualization:

trellis_descriptors() knows if the column is continuous or not, and thus will automatically provide the correct scale.
Note that we set size = NULL, which sets the size according to the value.
We used rank = FALSE here to leave the descriptors in their original scale, but we could try rank = TRUE to see if the results change (see below).
We only used the first 500 columns of our original data, but trellis_descriptors() can handle thousands of descriptors (if you can).
The other visualizations precise_viz() offers (i.e. 2D graph, 3D graph and 3D plot) could work with trellis_descriptors(), but they don’t have the option to show continuous scales or size, which is why they have been left out and a 2D plot is the only output option.

Determining feature relevance with precise_features()

OK, so we now have mapped all of our input features onto our visualization, which we can use to correspond to our clusters. It would be nice, however, if we could correlate what we see to the statistical significance of each feature to help guide us to the visualizations which explain our clusters. It is, after all, difficult to go through 500 visualizations manually, no matter how nicely it is laid out for the user. Thus, we now introduce precise_features() to help guide our search for relevant features that can explain our clustering.

The precise_features() function is a supervised feature selection function that utilizes both univariate feature rankings such has Chi-Square and ANOVA and tree-based feature selection methods like Boruta() and ranger(). Although PreciseDist is build for unsupervised learning, at this point we have clusters, which means we have explicit labels for each point. Thus, we can now run our original features against out cluster labels to get a better sense of which features are significantly driving the clusters. In this case, we will choose the Louvain clustering as our clustering of choice and as the input into the grouping_vec parameter:

library(future)
library(doFuture)
registerDoFuture()
plan(multiprocess, workers = 16)
feature_rank <- precise_features(
  data = cell_cycle_data[, 1:500],
  grouping_vec = clusters_mm$louvain,
  feature_alg = c("anova", "boruta", "ranger"),
  num_tree = 1001,
  runs_max = 1001,
  parallel = TRUE,
  verbose = TRUE
)
as_tibble(feature_rank)

#> # A tibble: 500 x 7
#>    Feature    Anova Anova_Bonferroni Boruta_decision Ranger_Importan…
#>    <chr>      <dbl>            <dbl> <fct>                      <dbl>
#>  1 1       1.42e- 3         7.11e- 1 Rejected                0.00396 
#>  2 2       6.56e- 6         3.28e- 3 Rejected                0.127   
#>  3 3       1.78e- 1         1.00e+ 0 Rejected               -0.000192
#>  4 4       8.27e- 2         1.00e+ 0 Rejected               -0.0117  
#>  5 5       1.20e- 1         1.00e+ 0 Rejected                0.0359  
#>  6 6       4.68e- 7         2.34e- 4 Rejected                0.195   
#>  7 7       7.53e- 2         1.00e+ 0 Rejected                0.0182  
#>  8 8       9.70e-20         4.85e-17 Confirmed               0.854   
#>  9 9       2.34e- 1         1.00e+ 0 Rejected               -0.00963 
#> 10 10      4.60e- 8         2.30e- 5 Confirmed               0.431   
#> # ... with 490 more rows, and 2 more variables: Ranger_Pvalue <dbl>,
#> #   Ranger_Bonferroni <dbl>

Now that we have a variety of cluster diagnostics, we can set diagnostics = feature_rank in trellis_descriptors() and re-run it. Make sure to play with the sort and filter options on the side panel to see which features are driving the clusters, and to validate these associations by visualizing them:

trellis_descriptors(
  data = viz_graph_mm,
  descriptors = cell_cycle_data[, 1:500],
  diagnostics = feature_rank,
  path = "/absolute_path/to_somewhere/with_full_name/not_including_a/file_extension",
  name = "features",
  group = "features",
  size = NULL,
  rank = FALSE,
  self_contained = FALSE,
  desc = "",
  md_desc = "",
  height = 500,
  width = 500,
  nrow = 2,
  ncol = 2,
  verbose = TRUE
)

In the above display, we set rank = FALSE. Let’s see what happens when we set rank = TRUE:

trellis_descriptors(
  data = viz_graph_mm,
  descriptors = cell_cycle_data[, 1:500],
  diagnostics = feature_rank,
  path = "/absolute_path/to_somewhere/with_full_name/not_including_a/file_extension",
  name = "features",
  group = "features",
  size = NULL,
  rank = TRUE,
  self_contained = FALSE,
  desc = "",
  md_desc = "",
  height = 500,
  width = 500,
  nrow = 2,
  ncol = 2,
  verbose = TRUE
)

If we compare and contrast the two the differences are not wildly different, but they are different. Generally, setting rank = TRUE is desirable when the differences between values is very large or very small, although as we just saw it is easy enough to simply run trellis_descriptors() with both rank = TRUE and rank = FALSE, which is what we typically end up doing in practice.

Brian Muchmore

2018-09-26

Introduction

Data set-up

Cluster proportions using trellis_pivot()

Mapping features using trellis_descriptors()

Determining feature relevance with precise_features()

Contents