Safe Correlate Tutorial

I wrote a piece of IDL code to more “safely” (i.e. conservatively) compute the probability that data are correlated (or, more accurately, the probability that they are not). I was later asked to share this code, so I wrote this tutorial on its use.

Operating principle and basic use

Frequently in my scientific endeavors I want to find out, quantitatively, whether some data x, are correlated with another set of data, y. There are a host of statistical methods to this end. These take the xy data and generally compute some coefficient of correlation along with the probability that uncorrelated data (the null hypothesis) could explain the arrangement of the data given varying assumptions. (I’ll call that probability of no-correlation the PTE, probability to exceed.) Here’s an example of a perfect correlation — a line:

N = 10
x = findgen(10)+1
y = findgen(10)+1
p = plot(x, y, 'o', sym_size=0.5, /sym_filled, xtitle='x', ytitle='y')

line

(This and the rest of the code snippets in this tutorial will be in IDL, since that is what SAFE_CORRELATE is written in.) Unsurprisingly, the Spearman Rank Correlation Test indicates that these points are almost certainly correlated.

result = r_correlate(xp, yp)
pte = result[1]
print, pte
    0.000000

I use the Spearman test because it is recommended as the fallback test when one is not sure exactly what type of relationship (e.g. y ~ x or y ~ exp(x)?) the data have, according to the wisdom of Wall and Jenkins (Practical Statistics for Astronomers, Cambridge Press, 2003). This test also forms the core of the SAFE_CORRELATE function that I wrote. However, rarely in science is it the case that the data have negligible uncertainty. What if the data have large uncertainties relative to their separations? For example, what if the error bars on the points in that line are huge?

err = N*0.25
xerrvec = replicate(err,N)
yerrvec = replicate(err,N)
ep = errorplot(x, y, xerrvec, yerrvec, 'o', sym_size=0.5, /sym_filled, xtitle='x', ytitle='y')

line with errs

Are these data actually correlated, or do they just chance to land on a line? Based on their uncertainties, in another universe these data may have been arranged like

seed = 0
x1 = x + randomn(seed, N)*err
y1 = y + randomn(seed, N)*err
ep1 = errorplot(x1, y1, xerrvec, yerrvec, 'o', sym_size=0.5, /sym_filled, xtitle='x', ytitle='y')

alternative line

If this is the way the points are truly arranged, then they certainly don’t look as correlated. Indeed, now the Spearman test gives a >17% PTE for uncorrelated data.

result = r_correlate(x1,y1)
pte = result[1]
print,pte
    0.173939

This is where SAFE_CORRELATE comes in. It generates many possible realizations of the data given the uncertainties and computes the PTE for each. These are then averaged to estimate the overall PTE given the uncertainties.

Why average? There are a few ways to think of why averaging the PTEs is the proper choice. Perhaps the most rigorous is to think of this process as sampling the PDF of the Spearman rank correlation coefficients and computing the expectation value of the PTE over that distribution.

However, I prefer the following informal reasoning: If N simulated datasets are generated, each has a “probability” of occurring of roughly 1/N. Thus, the probability that this is the true arrangement of the data and that the data are uncorrelated is PTE/N. We want to know the probability that the data are uncorrelated if any one of the simulated datasets is correct. In statistical terms, this is the probability that the data are arranged as in simulated dataset 1 and are uncorrelated or dataset 2 and are uncorrelated, and so on: (dataset 1 & uncorrelated) | (dataset 2 & uncorrelated) | … | dataset N & uncorrelated). In mathematical terms, this amounts to summing all of the PTE/N, which is the same as averaging the PTEs.

SAFE_CORRELATE takes the data, their respective errors, and the optional keywords NSIM (for number of simulated datasets to generate) and SEED (to use with IDL’s native random functions to ensure reproducible, pseudo-random output) and returns the overall PTE for the null-hypothesis of uncorrelated data. The errors can be single values, vectors of values, or points sampling the PDF or likelihood distribution for each point (more on these more complicated cases in the next section). Does it work? Running SAFE_CORRELATE on the example above gives

print,safe_correlate(x,y,err,err,seed=seed)
    0.13172031

The probability that those data are uncorrelated in light of their uncertainties is >13%, not negligible as running R_CORRELATE on the data without errors suggested. What if the uncertainties are smaller?

err_small = N*0.1
xerrvec_small = replicate(err_small,N)
yerrvec_small = replicate(err_small,N)
ep_small = errorplot(x, y, xerrvec_small, yerrvec_small, 'o', sym_size=0.5, /sym_filled, xtitle='x', ytitle='y')
print,safe_correlate(x,y,err_small,err_small)
    0.0010552892

line with small err

Now the possibility that the data are uncorrelated is only 0.1%. Alternatively, if the errors are huge

err_huge = N
xerrvec_huge = replicate(err_huge,N)
yerrvec_huge = replicate(err_huge,N)
ep_huge = errorplot(x, y, xerrvec_huge, yerrvec_huge, 'o', sym_size=0.5, /sym_filled, xtitle='x', ytitle='y')
print,safe_correlate(x,y,err_huge,err_huge,seed=seed)
    0.49747334

line with huge err

With such massive uncertainties, the data are as likely to be uncorrelated as correlated, just as intuition would suggest.

More sophisticated examples

Non-uniform (but still Gaussian) uncertainties

If each point has a different uncertainty, these can simply be supplied as vector input.

seed = 0.0 ;for reproducable results
N = 20
errscale = N*0.2
;generate some nonuniform errors
xerr = randomu(seed,N)*errscale
yerr = randomu(seed,N)*errscale
;generate the data from a line with errors
x = findgen(N) + randomn(seed,N)*xerr
y = findgen(N) + randomn(seed,N)*yerr
ep = errorplot(x, y, xerr, yerr, 'o', sym_size=0.5, /sym_filled, xtitle='x', ytitle='y')
;let's run 10^5 simulations, just to demonstrate the NSIM keyword
print,safe_correlate(x, y, xerr, yerr, seed=seed, nsim=1e5)
    0.0042871421

nonuniform errors

Upper/lower limits (or non-Gaussian errors)

Statistical limits on data values can often significantly strengthen or weaken a correlation by eye, but these can’t be included in a standard statistical correlation test. If the user has (or can approximate) the likelihood distributions for these limits, then they can be included in a SAFE_CORRELATE analysis. The same is true for data where points have likelihood distributions that are significantly non-Gaussian. SAFE_CORRELATE will use the actual distributions to generate random datasets. Here’s a manufactured example of such a case.

seed = 0.0
N = 10
;generate some data with identical Gaussian errors
xerr = N*0.3
x = findgen(N) + randomn(seed,N)*xerr + 2.0

;the y-values are not necessary since we are providing pdfs
;but the function has to have an argument, so...
y = !null
;-----
;manufacture various likelihood distributions (PDFs) for the y data
;make an Nx2xM array to contain the PDFs
M = 1000;number of points sampling each PDF
yerr = fltarr(N,2,M)
ysig = fltarr(N) ;std deviations for plotting errors
;first some upper limits: let's use a steep sigmoid curve
uplim = [2.5, 3.5, 4.5]
for i = 0,2dobegin &$
  values = findgen(M)/(M-1)*(uplim[i] + 1.0) &$
  likelihoods = 1.0/(1.0 + exp(5.0*(values - uplim[i])))/uplim[i] &$
  yerr[i,0,*] = values &$
  yerr[i,1,*] = likelihoods &$
endfor
;(note that the PDFs do not have to be normalized -- SAFE_CORRELATE will
; normalize them)

;now some asymmetric PDFs. How about some gamma distributions?
maxval = [5,6,7]
t = 0.7;scale parameter
for i = 0,2dobegin &$
 values = findgen(M)/(M-1)*maxval[i]*3.0 &$
 k = maxval[i]/t + 1 &$
 likelihoods = values^(k-1)*exp(-values/t)/t^k/gamma(k) &$
 yerr[i+3,0,*] = values &$
 yerr[i+3,1,*] = likelihoods &$
 ysig[i+3] = sqrt(k)*t &$
endfor

;and finally some regular old gaussians
mu = [8,9,10,11]
err = xerr/2.0
for i = 0,3 dobegin &$
  values = (findgen(M)/(M-1) - 0.5)*xerr*6.0 + mu[i] &$
  likelihoods = exp(-(values - mu[i])^2/err^2)/err/sqrt(2*!pi) &$
  yerr[i+6,0,*] = values &$
  yerr[i+6,1,*] = likelihoods &$
  ysig[i+6] = err
endfor
;let's have a look at those PDFs
values = yerr[*,0,*]
likelihoods = yerr[*,1,*]
p = plot(values[*],lihelihoods[*],/nodata,xtitle='y value',$
                ytitle='unnormalized likelihood')
for i = 0,N-1dobegin &$
  !null = plot(yerr[i,0,*], yerr[i,1,*], /overplot) &$
endfor

likelihoods

;------
;let's plot the points and errors
p = errorplot(x[3:*], y[3:*], replicate(xerr,N-3), ysig[3:*], 'o', $
              sym_size=0.5, /sym_filled, xtitle='x', ytitle='y')
p2 = errorplot(x[0:2], y[0:2], replicate(xerr,3), ysig[0:2], ' ', /overplot)
u = text(x[0:3], y[0:3], replicate('$\downarrow$',3), /data, alignment=0.5, $
         vertical_alignment=1)
complicated points
;------
;Finally, the analysis
print,safe_correlate(x, y, xerr, yerr, seed=seed)
    0.056156299

;and what if we hadn't included the upper limits?
print,safe_correlate(x[3:*], y[3:*], xerr, yerr[3:*,*,*], seed=seed)
    0.20411673

Intuitively, the upper limits strengthen the evidence that the data are correlated. SAFE_CORRELATE quantifies this, giving a PTE of 6% when the limits are included versus 20% when they are not.