`segmenTier`

is a dynamic programming solution to segmentation based on maximization of arbitrary similarity measures within segments as developed in Machné, Murray, and Stadler (2017).

In addition to the core algorithm, function `segmentClusters`

, the package provides time-series processing and clustering functions as described in the publication. These are generally applicable where a `k-means`

clustering yields meaningful results, and have been specifically developed for clustering of the Discrete Fourier Transform of periodic gene expression data (“circadian” or “yeast metabolic oscillations”).

This clustering approach is outlined in the supplemental material of Machné and Murray (2012), and here is used as a basis of segment similarity measures. Note, that the functions `processTimeseries`

and `clusterTimeseries`

, can also be used as stand-alone tools for periodic time-series analysis and clustering.

The ideas and theory behind the package are detailed in Machné, Murray, and Stadler (2017), here we provide a synopsis.

`segmenTier`

’s input is a clustering \(\mathcal{C}_{\alpha}\subseteq\mathbb{X}\), \(\alpha=1,\dots n\). `segmenTier`

then solves the recursion:

where \(s(j,k,\alpha)\) is a scoring function that measures the similarity of a segment, from positions \(j\) to \(k\), to cluster \(\mathcal{C}_\alpha\), and \(M\) is a penalty incurred by the use of a new segment, a fixed cost for each jump that allows to fine-tune minimal segment lengths. The algorithm maximizes scores for breakpoints at which maximal inter-segment similarities are reached when switching between two distinct clusters at positions \(j\) (“\(\max_{\beta\ne\alpha}\)”). This recursion is implemented in `Rcpp`

for efficiency, and directly available as function `calculateScore`

.

Back-tracing the maximal scores \(S_{k,\alpha}\), function `backtrace`

, then provides both segment borders and segment cluster associations.

The main interface function `segmentClusters`

wraps the recursion and back-tracing functions and handles scoring function selection and additional parameters.

Three scoring functions are available. They all sum up a similarity measure between positions \(j\) and \(k\) and clusters \(\mathcal{C}\).

The first two rely on a clustering of all positions \(x\)

\begin{equation} s(j,k,\alpha) = \sum_{i=j}^k Q(\mathcal{C}_i,\mathcal{C}_\alpha) \end{equation}where \(\mathcal{C}_i\) is the cluster label of data row \(x_i\), and \(Q(\mathcal{C},\mathcal{D})\) is an, in principle arbitrary, similarity measure for the two clusters. The most basic choice is \(Q(\mathcal{C},\mathcal{C})=1\) and \(Q(\mathcal{C},\mathcal{D})=a<0\) for \(\mathcal{C}\ne\mathcal{D}\). This case is available as scoring function “ccls” (argument `S="ccls"`

) with a default value \(a=-2\), and requires as sole input a vector of cluster labels.

For our application, an RNA-seq time-series, the Pearson correlation between the cluster centroids (mean values) proofed useful: \(Q(\mathcal{C},\mathcal{D})= \text{corr}(\bar x_{\mathcal{C}},\bar x_{\mathcal{D}})\) is pre-calculated as a cluster-cluster correlation matrix by `segmenTier`

’s interface to `kmeans`

clustering (`clusterTimeseries`

), and available as scoring function “ccor” (`S="ccor"`

).

The third option does not rely on a clustering of all positions and allows to use cluster centroids that are independently derived, eg. from only a subset of the data. The scoring function

\begin{equation} s(j,k,\alpha) = \sum_{i=j}^k \tilde\sigma(x_i,\mathcal{C}_{\alpha}) \end{equation}measures the similarity of each data row \(x_i\) to a cluster centroid, is again implemented as Pearson correlations, \(\tilde\sigma(x_i,\mathcal{C}_{\alpha})= \text{corr}(x_i,\bar x_{\mathcal{C}_{\alpha}})\), pre-calculated by the clustering interface, and available as scoring function “icor” (`S="icor"`

).

For efficiency, Pearson correlations are calculated in a custom function in `Rcpp`

. Note that it is not necessary to pre-calculate all values of \(s(i,k,\mathcal{C}\), since we can simply subtract sums; for \(i>1\):

In summary, the primitive scoring function “ccls” requires only a vector of cluster labels as input. Scoring function “ccor” requires both a vector of cluster labels and a cluster-cluster similarity matrix, and scoring function “icor” requires only a position-cluster similarity matrix. These matrices, internally called `csim`

, are provided by `segmenTier`

’s clustering function `clusterTimeseries`

using Pearson correlations to cluster centroids, but can also be provided by the user, as outlined in section User-Defined Similarities.

It proofed useful to further emphasize correlation-based similarities by an exponent \(\epsilon>1\), which further weakens moderate positive or negative correlations, and this is available as argument `E`

for all scoring functions. Signs are preserved, allowing for even-valued exponents.

Additionally, one can define a nuisance cluster in pre-processing of the data, eg. total data values or data-cluster correlations below a certain threshold, and enforce such segments by using a higher similarity in combination with a lower length penalty (argument `Mn`

). Argument `nui`

of `segmentClusters`

will be the self-similarity of the nuisance segment, and `-nui`

the nuisance similarity to all other clusters. The exponent `E`

will also be applied to nuisance similarity `nui`

.

The figure below shows the effects of `E>1`

on Pearson correlations and `nui`

>1, ie. on the socring function. Detailed analysis of the effects of these parameters and the length penalty `M`

on segmentation is provided in Machné, Murray, and Stadler (2017) and demonstrated in the R demo “segment_data”.

`segmenTier`

’s clustering interface (`clusterTimeseries`

) pre-calculates the cluster-cluster (“ccor”) and position-cluster (“icor”) similarity matrices and adds them to the “clustering” object, list item `csim`

in the object that is passed to the recursion function as a look-up table.

Advanced users can provide similarity measures themselves by constructing `csim`

where an input cluster labeling (argument `seq`

) must be integers that serve as column and row indices of this matrix, ie. similarity between a cluster “2” and a cluster “3” is stored in `csim[2,3]`

for scoring function “ccor”.

For scoring function “icor” columns are also indexed by cluster labels, and rows are the indices at positions \(i\), ie. `csim[345,4]`

is the similarity of data row 345 to the cluster labeled “4”. Scoring function “icor” in principle does not require cluster labels for all positions. However, in the current implementation a cluster label vector must still be supplied. It can consist of arbitrary values, EXCEPT for positions pre-determined as “nuisance” with fixed similarities `nui`

. At such position the cluster label should be “0”.

Scaling `E`

and nuisance similarities `nui`

can be applied to user-defined similarity matrices, but when choosing `E`

=1 (default) and simply NOT supplying any clusters labels “0” they will have no effect.

In summary, users that wish to define their own similarities are best advised to provide these similarities as a numeric matrix, argument `csim`

in the main function `segmentClusters`

. The reported cluster labels of segments are the column indices of `csim`

. For scoring function “ccor” each position must be assigned to a cluster label and `csim`

is a cluster-cluster similarity matrix. For scoring function “icor”, recommended for this custom use of the algorithm, `csim`

is a position-cluster similarity matrix and cluster labels in argument `seq`

allow to define a nuisance cluster. Construction of `csim`

is demonstrated in the R demo `segment_test`

.

Segmentation by custom similarities is simple, see section User-Defined Similarities. However, `segmenTier`

provides a pipeline of time-series processing and clustering functions that is to some extent specific to the data for which the algorithm was developed, but should also be generally applicable.

The function `processTimeseries`

prepares a time-series, with time points in columns, and individual measurements in rows, for segmentation. The function produces a list, an S3 object of class “timeseries”, that serves as input for the clustering function. It is generally applicable to provide the input for the clustering wrapper, and raw input data will be clustered and segmented when setting options `use.fft=FALSE`

and `trafo="raw"`

, and use `na2zero=TRUE`

to avoid interpretation of NA values as 0 (which is useful for positive valued signals where absence of data means absence of signal, eg. sequencing-based data (“read-counts”)).

The algorithm had been originally designed for periodic data, and `processTimeseries`

can also perform a Discrete Fourier Transform (DFT, `use.fft=TRUE`

) using the `stats`

package function `mvfft`

, including a permutation analysis (`perm>0`

) that provides p-values for all periodic components for the DFT. Clustering of the DFT of periodic data works well to distinguish time-courses with similar temporal profiles. This approach has been applied to microarray-based transcriptome data from “metabolic” or “respiratory oscillations” in budding yeast (Machné and Murray 2012) and “diurnal” or “circadian oscillations” of cyanobacteria (Lehmann et al. 2013, Beck et al. (2014)). This approach is implemented in the function `clusterTimeseries`

, using a simple `k-means`

clustering of the passed “timeseries” object.

In the original DFT-based clustering approach a better clustering of the DFT of periodic data was obtained by using a model-based clustering algorithm that allows for tailed distributions, as implemented in the BioConductor package `flowClust`

. This is available in the function `flowclusterTimeseries`

. However, this is much slower than `k-means`

and not recommended in the context of segmentation. Future implementations will fuse different clustering methods in one function, but likely be implemented in a distinct R package (see `clusterTimeseries2`

in github package `segmenTools`

).

However, within `segmenTier`

the output of `clusterTimeseries`

, S3 object of class “clustering”, serves as input for segmentation and provides the similarity matrices required for scoring functions “icor” and “ccor”. The correct matrix is automatically selected by `segmentClusters`

.

The next development cycle of this package should allow for more efficient sweeping of “long” data sets, eg. genome-wide data:

- More generic function to provide input for segmentation, eg.
- A function that takes cluster centers and data as inputs and provides a “clustering” object with
`csim`

similarity matrices for`segmentClusters`

, and - A maximal length parameter in the recursion, to sweep across long data-sets, removing the pre-segmentation requirement,
- A subset clustering utility, that samples a reasonable subset of the data for clustering, and supplies cluster centers.

From CRAN:

`install.packages("segmenTier")`

The development version can be obtained from github using `devtools`

:

```
library(devtools)
install_github("raim/segmenTier", subdir = "pkg")
```

```
library(segmenTier)
data(primseg436) # RNA-seq time-series data
# Fourier-transform and cluster time-series:
tset <- processTimeseries(ts=tsd, na2zero=TRUE, use.fft=TRUE,
dft.range=1:7, dc.trafo="ash", use.snr=TRUE)
cset <- clusterTimeseries(tset, K=12)
```

```
## Timeseries N 7624
## Used datapoints 7374
## Clusters K 12
```

```
# ... segment it:
segments <- segmentClusters(seq=cset, M=100, E=2, nui=3, S="icor")
```

```
## Scoring matrix 20190209 14:04:17
## parameters function:icor; scale:2; max/min:max
## Backtracing 20190209 14:04:24
## parameters multib:max
## elapsed, sec 6
## Done at 20190209 14:04:24
```

```
# and inspect results:
plotSegmentation(tset, cset, segments, cex=.5, lwd=2)
```

`print(segments)`

```
##
## Similarity-based segmentation by dynamic programming:
## Total length: 7624
## Segments:
## CL start end
## [1,] 2 52 439
## [2,] 9 440 712
## [3,] 8 713 4160
## [4,] 7 4306 4831
## [5,] 11 4832 6455
## [6,] 10 6456 6989
## [7,] 2 6990 7574
##
## Parameters:
## k S E M Mn a nui multi nextmax multib
## segments 1 icor 2 100 20 -2 3 max TRUE max
##
## Run time: 6.226 seconds
```

```
## and get segment border table for further processing
head(segments$segments)
```

```
## CL start end
## [1,] 2 52 439
## [2,] 9 440 712
## [3,] 8 713 4160
## [4,] 7 4306 4831
## [5,] 11 4832 6455
## [6,] 10 6456 6989
```

Usage of the package is further demonstrated in two R demos.

The main low level interface to the algorithm, function `segmentClusters`

, is demonstrated in the file `demo/segment_test.R`

. It produces Supplemental Figure S1 of Machné, Murray, and Stadler (2017).

To run it as a demo in R simply type:

`demo("segment_test", package = "segmenTier")`

A real-life data set is processed, clustered and segmented with varying parameters in `demo/segment_data.R`

.

This demo runs quite long, since it calculates many segmentations. It provides a comprehensive overview of the effects of segmentation parameters `E`

, `M`

and `nui`

, and produces (among others) Figure 3 and Supplemental Figures S4a and S4b of Machné, Murray, and Stadler (2017).

`demo("segment_data", package = "segmenTier")`

Beck, C., S. Hertel, A. Rediger, R. Lehmann, A. Wiegard, A. Kolsch, B. Heilmann, J. Georg, W.R. Hess, and I.M. Axmann. 2014. “Daily Expression Pattern of Protein-Encoding Genes and Small Noncoding RNAs in *Synechocystis* Sp. Strain PCC 6803.” *Appl Environ Microbiol* 80 (17): 5195–5206. doi:10.1128/AEM.01086-14.

Lehmann, R., R. Machné, J. Georg, M. Benary, I. M. Axmann, and R. Steuer. 2013. “How Cyanobacteria Pose New Problems to Old Methods: Challenges in Microarray Time Series Analysis.” *BMC Bioinformatics* 14 (1): 133. doi:10.1186/1471-2105-14-133.

Machné, R., and D.B. Murray. 2012. “The Yin and Yang of Yeast Transcription: Elements of a Global Feedback System Between Metabolism and Chromatin.” *PLoS One* 7 (6): e37906. doi:10.1371/journal.pone.0037906.

Machné, R., D.B. Murray, and P.F. Stadler. 2017. “Similarity-Based Segmentation of Multi-Dimensional Signals.” *Sci Rep* 7 (1): 12355. doi:10.1038/s41598-017-12401-8.