Create multiple distance matrices with optional resampling

This function takes a single numeric dataset as input and returns multiple distance matrices. While the output may be useful in and of itself, the general purpose of the function is to return a list of matrices that can be fused into a single distance matrix that best represents the patterns inherent to the original dataset. Note: Depending on the user's choice of "dists" or input to "dist_funcs", this function will return a list containing distances, similarities, correlations or any combination of those three.

precise_dist(data, dists = NULL, dist_funcs = NULL, time_series = FALSE,
  partitions = 1, suffix = "", file = NULL, parallel = FALSE,
  local_timeout = Inf, verbose = TRUE)

Arguments

data	A numeric dataframe, matrix or tibble of input data.
dists	NULL or a character vector of distances. A full-list of input choices can be found by running precise_dist_list("all_dists")
dist_funcs	NULL or a list of custom distance functions.
time_series	TRUE or FALSE. Is the input data a time-series? Only useful if partitions > 1
partitions	An integer value of the number of resampling partitions to create.
suffix	A string to be used as the suffix for the output distance name.
file	NULL or the absolute path to save the results as an RData file.
verbose	TRUE or FALSE. Should the function tell you what is happening internally?
cores	An integer value equal to 1 or greater for the number of computer cores to use.

Value

A tibble with three columns: Distance = the name of the distance, Matrix = the nested matrix corresponding to the value in Distance and Time_Taken_Seconds = The amount of time in seconds it took to calculate each matrix.

Details

This is a fairly complex function, so a number of details should be mentioned:

While this function is named "precise_dist" and most of the inputs for the dists parameter return proper distances, as noted in bold above, similarities and correlations may be returned as well. This is important to take note of if you intend to use this function's output as input to precise_fusion. See precise_transform if you would like to mutate the precise_dist output matrices into only distances or similarities before inputting them into precise_fusion.
The partitions argument is described in more detail in the "Overfitting in unsupervised learning" section below.
If partitions > 1 then the output matrix will have 100 / partitions as the percentage of resampled values. For examples, if partitions = 2 then 50% of the output matrix will be resampled or if partitions = 10 then 10% of the output matrix will be reampled.
The time_series parameter has some important caveats to consider:
1. While setting time_series = TRUE along with setting partitions > 1 can be useful to expose algorithm overfitting, due to the following constraints a time-series dataset may have it's inherent pattern(s) destroyed by setting time_series = TRUE.
2. The input data is expected to be ordered chronologically by row, so that the value of row 1 represents a point in time before row 2 and row 2 represents a point in time before row 3 etc.
3. The number of partitions should divide cleanly into the number of rows. Thus, if nrow(data) = 100 then setting partitions = 10 will work. However, if nrow(data) = 121 then partitions = 10 will fail. In this example, partitions = 11 is a logical choice.
4. For best results, typically the number of partitions should correspond to the periodicity of the data (if known). So, if nrow(data) = 100 and every 20 rows reflects the periodicity of the data, partitions = 5 could be a good choice.
The suffix parameter is useful if you plan to fuse distances from more than one dataset with different features but identical observations. By adding a suffix, there will be less chance of confusing which distance came from which dataset downstream in the workflow.

Inputting file paths in Windows

The following insights were taken from here.

When you are using Windows, you still have to use forward slashes for paths. In R, backslashes are reserved for escaping values. So a path in R looks like: C:/path/to/my/directory
In newer variants of Windows, the C:\ is protected from writes by user accounts. If you want to write to the C:, you must be an administrator. You can accomplish this by right-clicking on the R icon in Windows and choosing "Run as an administrator." This should also be done when you're installing packages. You may not have rights to install packages on certain Windows versions if you don't run it as an administrator.
If you don't want to run R as an administrator, and you want to write to files, you will by default have rights to the C:/Users/username/ directory.

Overfitting in unsupervised learning

While supervised learning has a long history devoted to the fight against overfit, unsupervised learning workflows tend to have few available options. If we use neural nets as an example of supervised learning, we find that there are four general ways to avoid overfitting (as detailed in Deep Learning with R):

Get more training data.
Reduce the capacity of the network.
Add weight regularization.
Add dropout.

If we try to apply similar principles to unsupervised learning, however, we find each of the following:

This is often difficult.
This corresponds to pruning the input features of the dataset before running an algorithm, which is often impractical in an unsupervised setting.
While this is difficult to do at the level of the individual distance algorithm, this is possible to implement at the level of the distance fusion algorithm, and will become a PreciseDist feature in the future.
This is easy to do!

So, what does it mean to add dropout? This idea was first implemented in the neural network domain by Geoff Hinton, and here is how he describes it: “I went to my bank. The tellers kept changing and I asked one of them why. He said he didn’t know but they got moved around a lot. I figured it must be because it would require cooperation between employees to successfully defraud the bank. This made me realize that randomly removing a different subset of neurons on each example would prevent conspiracies and thus reduce overfitting.”

Thus, the way precise_dist seeks to avoid overfitting is to add dropout (aka introduce noise), so that spurious patterns that are weak enough to disrupt but significant enough to mislead are destroyed. Briefly, here are the two ways precise_dist implements this (borrowed in part from http://www.stat.cmu.edu/~cshalizi/350/lectures/28/lecture-28.pdf):

First way: Add noise at the level of the distance algorithm
1. Calculate the distance using the full input data.
2. Remove a percentage of data determined by the value of the partitions argument.
3. Calculate the variance of each column.
4. Fill in the missing data with values drawn from the Gaussian kernel of each column with bandwidth determined by
5. Coerce the final distance matrix back to symmetry by replacing the upper diagonal with the lower diagonal.
Second way: Add noise at the level of the distance fusion
1. Calculate the distance using the full input data.
2. Remove a percentage of data determined by the value of the partitions argument.
3. Calculate the variance of each column.
4. Fill in the missing data with values drawn from the Gaussian kernel of each column with bandwidth determined by
5. Coerce the final distance matrix back to symmetry by replacing the upper diagonal with the lower diagonal.

Finally, how should this be applied in practice? Although the difference between pattern and noise can be difficult to discern, patterns of significance should generally not be destroyed by the introduction of a small (e.g. 10%) amount of random noise. Thus, if the results of precise_dist(partitions = 1) and precise_dist(partitions = 10) largely hold, one should feel fairly confident that the resulting patterns are inherent to the data.

References

Muchmore, B., Muchmore P. and Alarcón-Riquelme ME. (2018). Optimal Distance Matrix Construction with PreciseDist and PreciseGraph.

Examples

test_matrix <- replicate(10, rnorm(10))