Checking whether stringy05 is suitable as a projection pursuit index

Published

16 Jun 2026

Goal

Before using stringy05 inside a guided tour, I first check whether it behaves like a useful projection pursuit index.

For data matrix \(X \in \mathbb{R}^{n \times p}\) and projection basis \(A \in \mathbb{R}^{p \times 2}\), the projected data are

\[ Y = XA. \]

The projection pursuit index is then calculated on \(Y\).

Here I compare two versions of the index:

  • stringy05(rescale = FALSE): the original stringy05 index.
  • stringy05(rescale = TRUE): a rescaled version designed to reduce the influence of values that commonly arise from Gaussian noise.

In previous work, I estimated the distribution of the stringy05 index under bivariate Gaussian noise across different sample sizes. Using the estimated upper tail of this noise distribution, I derived a sample-size-dependent threshold

\[ \ell(n) = 0.05 + \frac{3.86}{\sqrt{n}}. \]

The rescaled index is then defined as

\[ I_{\text{rescaled}} = \max \left( 0, \frac{I - \ell(n)} {1 - \ell(n)} \right), \]

where \(I\) is the original stringy05 value.

This transformation preserves the upper end of the index while shrinking values that are likely to arise from Gaussian noise. Values below the estimated noise threshold are mapped to zero, and values above the threshold are linearly rescaled to the interval \([0,1]\).

The motivation is that random noise projections can sometimes produce non-zero stringy values, potentially making optimisation more difficult. By incorporating an estimated noise threshold, the rescaled version aims to increase separation between structured projections and noise projections, allowing the optimiser to focus more strongly on projections containing genuine stringy structure.

Code
# Load packages
library(tourr)
library(cassowaryr)
library(spinebil)
library(ferrn)
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)

Define the two index functions

# Rescaling function for stringy05
rescale_stringy05 <- function(z, n) {
  lb <- 0.05 + 3.86 / sqrt(n)
  pmax(0, (z - lb) / (1 - lb))
}

stringy05_raw <- function(mat) {
  cassowaryr::sc_stringy05(mat[, 1], mat[, 2])
}

stringy05_rescaled <- function(mat) {
  z <- cassowaryr::sc_stringy05(mat[, 1], mat[, 2])
  rescale_stringy05(z, nrow(mat))
}

Generate polynomial data

I create data where only two variables contain the hidden polynomial structure. The signal is placed in variables 2 and 3, while variables 1 and 4 are Gaussian noise. I also add a small amount of noise to the polynomial variables so the structure is not perfectly deterministic.

Code
make_poly_data <- function(n = 300, p = 4, degree = 2, seed = 1050,
                           signal_noise_sd = 0.005) {
  set.seed(seed)

  t <- seq(-1, 1, length.out = n)
  signal <- poly(t, degree = degree, raw = TRUE)

  x <- matrix(rnorm(n * p), nrow = n, ncol = p)

  x[, 2] <- signal[, 1] + rnorm(n, sd = signal_noise_sd)
  x[, 3] <- signal[, 2] + rnorm(n, sd = signal_noise_sd)

  colnames(x) <- paste0("V", seq_len(p))

  as.data.frame(x)
}

dat4 <- make_poly_data(n = 300, p = 4, degree = 2, seed = 1050)

The true interesting projection is the plane spanned by variables 2 and 3.

Code
basis_true <- diag(ncol(dat4))[, c(2, 3)]
basis_noise <- diag(ncol(dat4))[, c(1, 4)]

Before using the guided tour, I first plotted the two signal variables directly. This is the target structure that I want the guided tour to find.

Code
ggplot(dat4, aes(x = V2, y = V3)) +
         geom_point() +
  theme(aspect.ratio = 1,
        axis.ticks = element_blank(),
        axis.text = element_blank())

First direct check: signal versus noise

A useful projection pursuit index should give a higher value to the structured projection than to a noise projection.

Code
# Rescaling function for stringy05
rescale_stringy05 <- function(z, n) {
  lb <- 0.05 + 3.86 / sqrt(n)
  pmax(0, (z - lb) / (1 - lb))
}

stringy05_raw <- function(mat) {
    cassowaryr::sc_stringy05(mat[, 1], mat[, 2])
}

stringy05_rescaled <- function(mat) {
    z <- cassowaryr::sc_stringy05(mat[, 1], mat[, 2])
    rescale_stringy05(z, nrow(mat))
}

direct_check <- tibble(
  projection = c("true polynomial plane", "noise plane"),
  raw = c(
    stringy05_raw(as.matrix(dat4) %*% basis_true),
    stringy05_raw(as.matrix(dat4) %*% basis_noise)
  ),
  rescaled = c(
    stringy05_rescaled(as.matrix(dat4) %*% basis_true),
    stringy05_rescaled(as.matrix(dat4) %*% basis_noise)
  )
)

knitr::kable(direct_check)
projection raw rescaled
true polynomial plane 0.9880098 0.9835106
noise plane 0.2421718 0.0000000

If stringy05 is useful as a PPI, the true polynomial plane should have a larger value than the noise plane.

Rotation invariance check with spinebil

For a 2D guided tour, the index should measure the plane, not the orientation inside the plane. This means rotating the projected 2D data should not substantially change the index value.

Here I use spinebil::profile_rotation() to rotate the true 2D polynomial projection and calculate both index versions at each rotation step.

Code
d_true <- as.matrix(dat4) %*% basis_true

index_list <- list(
  stringy05_raw,
  stringy05_rescaled
)

index_labels <- c(
  "stringy05 raw",
  "stringy05 rescaled"
)

rotation_values <- spinebil::profile_rotation(
  d = d_true,
  index_list = index_list,
  index_labels = index_labels,
  n = 200
)


spinebil::plot_rotation(rotation_values)

The rotation profile is almost flat for both the raw and rescaled versions of stringy05. This suggests that the index is approximately rotation invariant: rotating the same 2D projection does not substantially change the index value. This is a good property for a projection pursuit index because the guided tour should evaluate the projection plane itself, not the arbitrary orientation of the points inside the 2D display.

Trace from noise projection to true projection

Next, I examine how the index changes along an interpolated planned tour path from a noise projection to the true polynomial projection.

A useful index should increase as the projection approaches the hidden structure. If the index only increases very close to the true projection, it may be difficult for an optimizer to find the structure.

Code
m <- list(
  basis_noise,
  basis_true
)

trace_values <- spinebil::get_trace(
  d = as.matrix(dat4),
  m = m,
  index_list = index_list,
  index_labels = index_labels
)

spinebil::plot_trace(trace_values)

Both the raw and rescaled versions remain relatively flat until very close to the true projection, where the index increases sharply, suggesting that stringy05 has relatively low squintability and may be challenging for optimizers to locate from random starting projections.

Squintability

Squintability checks how early the index starts improving when moving from random projections toward the best projection. Here the theoretical best projection is the plane spanned by variables 2 and 3.

Code
stringy05_raw <- function() {
  function(mat) {
    cassowaryr::sc_stringy05(mat[, 1], mat[, 2])
  }
}

stringy05_rescaled <- function() {
  function(mat) {
    z <- cassowaryr::sc_stringy05(mat[, 1], mat[, 2])
    rescale_stringy05(z, nrow(mat))
  }
}
set.seed(1050)

dat4_mat <- as.matrix(dat4)

basis_squint_raw <- ferrn::sample_bases(
  idx = "stringy05_raw",
  data = dat4_mat,
  n_basis = 20,
  best = basis_true,
  min_proj_dist = 0.5,
  step_size = 0.02,
  parallel = FALSE,
  seed = 1050
)

basis_squint_rescaled <- ferrn::sample_bases(
  idx = "stringy05_rescaled",
  data = dat4_mat,
  n_basis = 20,
  best = basis_true,
  min_proj_dist = 0.5,
  step_size = 0.02,
  parallel = FALSE,
  seed = 1050
)

squint_raw <- ferrn::calc_squintability(
  basis_squint_raw,
  method = "ks",
  bin_width = 0.02
)

squint_rescaled <- ferrn::calc_squintability(
  basis_squint_rescaled,
  method = "ks",
  bin_width = 0.02
)

squint_raw
# PP index:     stringy05_raw
# No. of bases: 20 -> 1454
# method:       ks
  max_x max_d squint
  <dbl> <dbl>  <dbl>
1 0.284  2.01  0.571
Code
squint_rescaled
# PP index:     stringy05_rescaled
# No. of bases: 20 -> 1454
# method:       ks
  max_x max_d squint
  <dbl> <dbl>  <dbl>
1 0.284  2.14  0.609

Conclusion

To be written …

Reuse

All Rights Reserved