Parke Loyd

Safe Correlate Tutorial

I wrote a piece of IDL code to more “safely” (i.e. conservatively) compute the probability that data are correlated (or, more accurately, the probability that they are not). I was later asked to share this code, so I wrote this tutorial on its use.

Operating principle and basic use

Frequently in my scientific endeavors I want to find out, quantitatively, whether some data x, are correlated with another set of data, y. There are a host of statistical methods to this end. These take the xy data and generally compute some coefficient of correlation along with the probability that uncorrelated data (the null hypothesis) could explain the arrangement of the data given varying assumptions. (I’ll call that probability of no-correlation the PTE, probability to exceed.) Here’s an example of a perfect correlation — a line:

N = 10
x = findgen(10)+1
y = findgen(10)+1
p = plot(x, y, 'o', sym_size=0.5, /sym_filled, xtitle='x', ytitle='y')

line

(This and the rest of the code snippets in this tutorial will be in IDL, since that is what SAFE_CORRELATE is written in.) Unsurprisingly, the Spearman Rank Correlation Test indicates that these points are almost certainly correlated.

result = r_correlate(xp, yp)
pte = result[1]
print, pte
    0.000000

I use the Spearman test because it is recommended as the fallback test when one is not sure exactly what type of relationship (e.g. y ~ x or y ~ exp(x)?) the data have, according to the wisdom of Wall and Jenkins (Practical Statistics for Astronomers, Cambridge Press, 2003). This test also forms the core of the SAFE_CORRELATE function that I wrote. However, rarely in science is it the case that the data have negligible uncertainty. What if the data have large uncertainties relative to their separations? For example, what if the error bars on the points in that line are huge?

err = N*0.25
xerrvec = replicate(err,N)
yerrvec = replicate(err,N)
ep = errorplot(x, y, xerrvec, yerrvec, 'o', sym_size=0.5, /sym_filled, xtitle='x', ytitle='y')

line with errs

Are these data actually correlated, or do they just chance to land on a line? Based on their uncertainties, in another universe these data may have been arranged like

seed = 0
x1 = x + randomn(seed, N)*err
y1 = y + randomn(seed, N)*err
ep1 = errorplot(x1, y1, xerrvec, yerrvec, 'o', sym_size=0.5, /sym_filled, xtitle='x', ytitle='y')

alternative line

If this is the way the points are truly arranged, then they certainly don’t look as correlated. Indeed, now the Spearman test gives a >17% PTE for uncorrelated data.

result = r_correlate(x1,y1)
pte = result[1]
print,pte
    0.173939

This is where SAFE_CORRELATE comes in. It generates many possible realizations of the data given the uncertainties and computes the PTE for each. These are then averaged to estimate the overall PTE given the uncertainties.

Why average? There are a few ways to think of why averaging the PTEs is the proper choice. Perhaps the most rigorous is to think of this process as sampling the PDF of the Spearman rank correlation coefficients and computing the expectation value of the PTE over that distribution.

However, I prefer the following informal reasoning: If N simulated datasets are generated, each has a “probability” of occurring of roughly 1/N. Thus, the probability that this is the true arrangement of the data and that the data are uncorrelated is PTE/N. We want to know the probability that the data are uncorrelated if any one of the simulated datasets is correct. In statistical terms, this is the probability that the data are arranged as in simulated dataset 1 and are uncorrelated or dataset 2 and are uncorrelated, and so on: (dataset 1 & uncorrelated) | (dataset 2 & uncorrelated) | … | dataset N & uncorrelated). In mathematical terms, this amounts to summing all of the PTE/N, which is the same as averaging the PTEs.

SAFE_CORRELATE takes the data, their respective errors, and the optional keywords NSIM (for number of simulated datasets to generate) and SEED (to use with IDL’s native random functions to ensure reproducible, pseudo-random output) and returns the overall PTE for the null-hypothesis of uncorrelated data. The errors can be single values, vectors of values, or points sampling the PDF or likelihood distribution for each point (more on these more complicated cases in the next section). Does it work? Running SAFE_CORRELATE on the example above gives

print,safe_correlate(x,y,err,err,seed=seed)
    0.13172031

The probability that those data are uncorrelated in light of their uncertainties is >13%, not negligible as running R_CORRELATE on the data without errors suggested. What if the uncertainties are smaller?

err_small = N*0.1
xerrvec_small = replicate(err_small,N)
yerrvec_small = replicate(err_small,N)
ep_small = errorplot(x, y, xerrvec_small, yerrvec_small, 'o', sym_size=0.5, /sym_filled, xtitle='x', ytitle='y')
print,safe_correlate(x,y,err_small,err_small)
    0.0010552892

line with small err

Now the possibility that the data are uncorrelated is only 0.1%. Alternatively, if the errors are huge

err_huge = N
xerrvec_huge = replicate(err_huge,N)
yerrvec_huge = replicate(err_huge,N)
ep_huge = errorplot(x, y, xerrvec_huge, yerrvec_huge, 'o', sym_size=0.5, /sym_filled, xtitle='x', ytitle='y')
print,safe_correlate(x,y,err_huge,err_huge,seed=seed)
    0.49747334

line with huge err

With such massive uncertainties, the data are as likely to be uncorrelated as correlated, just as intuition would suggest.

More sophisticated examples

Non-uniform (but still Gaussian) uncertainties

If each point has a different uncertainty, these can simply be supplied as vector input.

seed = 0.0 ;for reproducable results
N = 20
errscale = N*0.2
;generate some nonuniform errors
xerr = randomu(seed,N)*errscale
yerr = randomu(seed,N)*errscale
;generate the data from a line with errors
x = findgen(N) + randomn(seed,N)*xerr
y = findgen(N) + randomn(seed,N)*yerr
ep = errorplot(x, y, xerr, yerr, 'o', sym_size=0.5, /sym_filled, xtitle='x', ytitle='y')
;let's run 10^5 simulations, just to demonstrate the NSIM keyword
print,safe_correlate(x, y, xerr, yerr, seed=seed, nsim=1e5)
    0.0042871421

nonuniform errors

Upper/lower limits (or non-Gaussian errors)

Statistical limits on data values can often significantly strengthen or weaken a correlation by eye, but these can’t be included in a standard statistical correlation test. If the user has (or can approximate) the likelihood distributions for these limits, then they can be included in a SAFE_CORRELATE analysis. The same is true for data where points have likelihood distributions that are significantly non-Gaussian. SAFE_CORRELATE will use the actual distributions to generate random datasets. Here’s a manufactured example of such a case.

seed = 0.0
N = 10
;generate some data with identical Gaussian errors
xerr = N*0.3
x = findgen(N) + randomn(seed,N)*xerr + 2.0

;the y-values are not necessary since we are providing pdfs
;but the function has to have an argument, so...
y = !null
;-----
;manufacture various likelihood distributions (PDFs) for the y data
;make an Nx2xM array to contain the PDFs
M = 1000;number of points sampling each PDF
yerr = fltarr(N,2,M)
ysig = fltarr(N) ;std deviations for plotting errors
;first some upper limits: let's use a steep sigmoid curve
uplim = [2.5, 3.5, 4.5]
for i = 0,2dobegin &$
  values = findgen(M)/(M-1)*(uplim[i] + 1.0) &$
  likelihoods = 1.0/(1.0 + exp(5.0*(values - uplim[i])))/uplim[i] &$
  yerr[i,0,*] = values &$
  yerr[i,1,*] = likelihoods &$
endfor
;(note that the PDFs do not have to be normalized -- SAFE_CORRELATE will
; normalize them)

;now some asymmetric PDFs. How about some gamma distributions?
maxval = [5,6,7]
t = 0.7;scale parameter
for i = 0,2dobegin &$
 values = findgen(M)/(M-1)*maxval[i]*3.0 &$
 k = maxval[i]/t + 1 &$
 likelihoods = values^(k-1)*exp(-values/t)/t^k/gamma(k) &$
 yerr[i+3,0,*] = values &$
 yerr[i+3,1,*] = likelihoods &$
 ysig[i+3] = sqrt(k)*t &$
endfor

;and finally some regular old gaussians
mu = [8,9,10,11]
err = xerr/2.0
for i = 0,3 dobegin &$
  values = (findgen(M)/(M-1) - 0.5)*xerr*6.0 + mu[i] &$
  likelihoods = exp(-(values - mu[i])^2/err^2)/err/sqrt(2*!pi) &$
  yerr[i+6,0,*] = values &$
  yerr[i+6,1,*] = likelihoods &$
  ysig[i+6] = err
endfor
;let's have a look at those PDFs
values = yerr[*,0,*]
likelihoods = yerr[*,1,*]
p = plot(values[*],lihelihoods[*],/nodata,xtitle='y value',$
                ytitle='unnormalized likelihood')
for i = 0,N-1dobegin &$
  !null = plot(yerr[i,0,*], yerr[i,1,*], /overplot) &$
endfor

likelihoods

;------
;let's plot the points and errors
p = errorplot(x[3:*], y[3:*], replicate(xerr,N-3), ysig[3:*], 'o', $
              sym_size=0.5, /sym_filled, xtitle='x', ytitle='y')
p2 = errorplot(x[0:2], y[0:2], replicate(xerr,3), ysig[0:2], ' ', /overplot)
u = text(x[0:3], y[0:3], replicate('$\downarrow$',3), /data, alignment=0.5, $
         vertical_alignment=1)
complicated points
;------
;Finally, the analysis
print,safe_correlate(x, y, xerr, yerr, seed=seed)
    0.056156299

;and what if we hadn't included the upper limits?
print,safe_correlate(x[3:*], y[3:*], xerr, yerr[3:*,*,*], seed=seed)
    0.20411673

Intuitively, the upper limits strengthen the evidence that the data are correlated. SAFE_CORRELATE quantifies this, giving a PTE of 6% when the limits are included versus 20% when they are not.

My First Transit Spectrum Code

I recently submitted an HST Cycle 22 proposal to observe the hot super-Earth 55 Cnc e. To estimate the feasibility of the proposed observations, I relied on the spectra for 55 Cnc e generated by Hu and Seager (2014). However, Hu and Seager considered only atmospheres with an atomic makeup of at least 50% H. This is problematic, because the ~2000 K equilibrium temperature of 55 Cnc e means that H, either atomic or molecular, can escape from the surface at timescales of less than 100 years. Atomic C and O (and certainly the molecular combinations therein), on the other hand, do not escape — at least so long as the exobase temperature does not much exceed 10,000 K. Thus, if 55 Cnc e has an atmosphere at all, it is liable to be dominated by CO or CO2, not the H-bearing species predicted in Hu and Seager’s photochemical model.

55 Cnc e System

55 Cnc e and its host star: one of the few star-planet pairs that can be shown to scale! (This is a rough attempt to do so, at least.)

 

Unfortunately, I did not have enough time leading up to the proposal deadline to create my own tools for simulating the transit spectrum of a planet so that I could examine C-O dominated atmospheres. Yet it has been bugging me, and I would like to have this tool both for my understanding and for future reasearch/proposals. These past few weeks, I finally got around to it.

I looked to the paper by Benneke and Seager (2011) for a general roadmap in the creation of my code. However, their code is way more sophisticated than anything I can create in a few weeks of tinkering (especially since I am still getting familiar with my new favorite coding language, Python). Even if I could replicate their code exactly, it’s not my style to follow a recipe unless I really have no idea what I am doing. In the end, I settled on writing a pipeline that computes the transit depth for a planet with a simple isothermal, single-species atmosphere, but that employs extensive abosprtion line data. This code will thus allow legitimate, if crude, estimates of a transit spectrum for a planet. Below I briefly describe the creation of this code, and the trials and tribulations therein, in several parts.

Computing absorption coefficients

Following Benneke and Seager, I made use of the HITEMP database of high-temperature absorption lines. Another, more extensive, database, HITRAN, is available for low-temperature lines. The two databases have the same form, so the functions I created for reading in and processing HITEMP line data should be easily extensible to HITRAN. (These data can be accessed by anonymous ftp to ftp://cfa-ftp.harvard.edu/pub/.)

The data provided by HITRAN/HITEMP and their proper use are described in a 1996 paper (see especially the appendices). In brief, the database provides the wavenumber [cm-1] of the line and its intensity [cm-1/(molecule cm-2)]. These two are the most important parameters, but the database then also provides pressure broadening coefficients, a pressure shift parameter, the lower state energy, and so on.

Aside: What is absorption line intensity?

Given my minimal background in absorption spectroscopy, I was mystified by the “intensity” parameter. I’m used to intensity being a term to express the flux per unit solid angle of a radiation field. So what is it in this context? I haven’t found an explanation that I find entirely satisfactory. After mulling over how this parameter is used for a while, here is what I have decided. Disclaimer: I give no guarantees that this is the “right” way to conceptualize the intensity.

If I were going to try to make a record of a single absorption line, I would measure the cross section for absorption as a function of wavenumber (well, wavelength really, but HITRAN/HITEMP use wavenumber, so let’s stick with that). However, the resulting line profile changes according to the strength of the relevant broadening mechanisms (natural, pressure, Doppler), so to generalize my measurement I would compute the integral of the absorption coefficient over wavenumber. Even if the broadening mechanisms change, this integral should be constant.

This would allow me to compute the absorption coefficient as a function of wavelength, but only for a molecule that happens to be in the lower ro-vibrational state of this transition. In reality, only some fraction of the molecules in the gas doing the absorbing will be in this state. So, if I wanted to compute the optical depth of a column, I would have to use Botlzmann statistics (assuming local thermodynamic equilibrium, of course) to determine the fraction of the molecules in the right state to produce the absorption line in question, then compute the column density of just those particles, and finally multiply by the absorption cross section to get the optical depth. However, I could just lump my calculation of what fraction of the molecules are in the proper state into my computation of absorption coefficients. This would simply amount to taking the product the integral I computed earlier and the fraction of molecules in the right state. This is convenient, though it hides some of the physics.

This product is the “intensity” that the HITRAN/HITEMP databases provide. Thus, to compute the absorption, one first uses Boltzmann statistics to compute the intensity at the temperature in question (referencing from the intensity at 296 K that is provided in the database). Then this is multiplied by the line profile to determine the absorption cross section as a function of wavelength, and finally by the column density of the relevant species to compute the optical depth. This is how I understand the intensity.

I should note that, by my formulation, it makes sense to express intensity in units of cm2 molecule-1 cm-1 = cm molecule-1. However, others generally express it as cm-1/molecule cm-2. Theses units nudge the reader to think of intensity as a “per column density” quantity, but I prefer “the integral of cross sections over wavenumber, with a correction for state fractions.”

Back to computing absorption coefficients

I wrote the Python module hitemp.py to utilize the HITEMP data. Making a function to parse the text-file data was fairly trivial — hitemp.readlines. There a lot of lines, and I knew from the start I wouldn’t have the computing power to include them all. Consequently, I wrote readlines such that it read in data only for lines that are both within the user-defined wavelength range of interest and that have an intensity above some fraction of the strongest line’s intensity.

The other function contained in the hitemp module, hitemp.grid_P, sums absorption coefficients of all lines down to the chosen intensity on a grid of wavenumber and pressure. Since I assume an isothermal atmosphere, the code only considers a single temperature, although it would be easy to modify it to create a datacube over wavenumber, pressure, and temperature.  Unfortunately, the code require that the ratio of the partition sum at the desired temperature over the partition sum at the reference temperature (296 K) be supplied as input. A FORTRAN code, TIPS, is included in the HITRAN database that estimates these partition sums, but it operates via an interactive command-line prompt. I simply didn’t have the time to try to make this code play nice with Python, so for now one must compute the desired ratio of partition sums externally.

This part of the code presented an interesting problem. For the coarse wavelength grid I wished to use, the line profiles are very narrow relative to the wavelength spacing of the grid. Thus, simply computing the value of the line profile at each point on the wavelength grid is very, very wrong. A narrow line will often fall between two points on the grid and, essentially, be lost. In reality, if absorption from that line were observed using a low resolution spectrograph, the spectrogrograph pixels would measure the average value of the absorption over the width of the pixel, not just the value at the center of the pixel.

This made gridding the absorption coefficients a much more intense numerical task. To properly average the lines over the grid, I created the general function intergolate.py, so named because it, in essence, integrates and interpolates at the same time. The function, given the coordinates of a curve and the endpoints of a set of bins, computes the average value of the curve within each bin. Thus, the integral of the result is the same as the integral of the input. This is not the case for normal interpolation. Besides line profiles, I expect this function to be very useful when dealing with probability density functions.

Because of this issue of missing narrow lines, the grid_P option computes the Voigt profile of each line at points specifically chosen to cllearly resolve the structure of the line with a minimum of computations. Then, the absorption coefficients (the product of the profile and intensity) are intergolated (not interpolated) onto the supplied wavenumber grid. This is done successively for each line at each pressure and the absorption coefficients are summed. It takes a really long time. The code could probably be dramatically sped up for coarse wavelength grids by using analytical epxressions (or even approximations) to directly compute the integral of the line over bins corresponding to each wavenumber grid point, especially for the line wings — but that’s a future project.

The end result: I can compute the cumulative (all lines) absorption coefficients over a wavelength-pressure grid for later interpolation to different pressures. As a side note, I found that pressure, at least for coarse wavelength grids, does not greatly affect the results because even strong pressure broadening still produces narrow lines relative to grid spacings of  10 cm-1. So when applied to coarse spectra and atmospheres of moderate to low pressure, a pressure grid of only a few points suffices.

Spectral Resolution

Absorption coefficients of a set of molecular absorption combs for CO at resolutions of 0.2, 2, and 10 cm-1. The peak absorption coefficients get “diluted” as the resolution is coarsened.

 

Generating the transit spectrum (or “probing the onion”)

Computationally, generating the transit spectrum is surprisingly fast once the absorption coefficients over a wavelength-pressure grid are available. I created the module transit_spectrum.py with the single function compute_depth to compute the transit depth as a function of wavenumber. The function assumes an isothermal atmosphere of temperature T and base pressure P0 at r0, the radius of the planet’s surface or cloud deck. The user sets the lower pressure limit to consider (i.e. the maximum altitude), and the code then divides the atmosphere into N shells of even thickness with that limit. At each pressure, the absorption coefficients are interpolated from the provided wavenumber-pressure grid.

Atmospheres

If we could spatially resolve a planet transit at two wavelengths, one in the core of an absorption line and another well outside of the line, it would look something like this. The apparent size of the planet varies, thereby producing different transit depths.

 

With the profile of the atmosphere established, the code then considers lines of sight (LOSs) passing through the atmosphere at the radius of each shell. Thus, an LOS at the outermost shell passes only through that shell. The next LOS down passes through both the outermost shell and the one beneath. The next down passes through three shells… so on and so forth. For each LOS, the path length within each shell (it varies according to the geometry – see figure below) is computed and multiplied by the density of molecules to determine the column density. This is then multiplied by the absorption coefficients to compute the optical depth for just that shell, and finally the optical depths due to each shell intersected by the LOS are summed.

The Onion

The geometry of computing transit depths with atmospheric absorption.

 

The optical depth of each LOS can be used to compute the opacity of an annulus around the planetary disk. The product of the opacity and area of the annulus give the equivalent, fully opaque area that would absorb the same amount of light. The equivalent areas of each shell are  summed together with the geometric size of the planet to determine the overall “effective area” of the planet transit. Finally, these are divided by the area of the stellar disk to yield the transit depth.

 The Results: Possible Spectra of 55 Cnc e

At long last, I used my code to generate spectra of 55 Cnc e for three different extreme cases: 1 bar atmospheres of pure H2O, CO, and CO2. I only looked at the wavelengths covered by HST’s Wide Field Camera 3 instrument with the G140L grism, since this is the instrument I proposed to use. The spectra all show prominent features with amplitudes of at least 20 ppm. Unsurprisingly, water produces the clearest absorption signature.

55 Cnc e Spectra

Spectra of pure 1 bar atmospheres for 55 Cnc e (all assume a radius of 2 REarth and mass of 8 Earth).

 

I then used a code I created when writing my proposal to evaluate the capacity of G140L data to distinguish these spectra from a flat spectrum. (This code is ugly as it was both early in my days of Python coding and rather rushed, so I won’t post it.) I found that it takes several 10’s of orbits worth of HST data to rule out these spectra to at least 2-sigma. Doing the same with a millibar instead of a bar of atmosphere requires over 100 orbits to rule out an atmosphere. So, if no spectral structure is evident in the 26 orbits I proposed, we should be able to rule out an atmosphere of about a bar. Ultimately, it would be nice to get down to millibar levels, since then we could place a strong constraint on the sub-telluric density of 55 Cnc, which in turn would be evidence for an exotic, carbon-rich core (Madhusudhan et al. 2012; ApJ 759:40).

Viola! I’m excited to now have a tool in my quiver that I made all by myself to use with future transit spectroscopy endeavors.

I Believe in Doubt

One of my sisters asked that each of my family members write a “This I Believe” essay, following the guidelines of the 1950s radio show. Below is the resulting essay I wrote on doubt. Perhaps one day I will also elucidate my belief in smelling the roses, or other core beliefs still buried in my subconscious.


Doubt

I believe in doubt. Indeed, I am working to become a practitioner of doubt, a scientist. By this, I do not mean that I avoid placing trust in friends, that I ignore all instincts, or that I deny faith. I mean simply that I make an effort, a conscious effort, to accept little as certain.

I remember a soccer practice in second grade when someone on my team sent the ball soaring high into the air with an impressive kick. A boy some twice my age announced that the ball was “airborne.”  I was in awe. As a child, I was fascinated by the mystery of flight. How did this older boy know the ball had achieved it? Yet he was older, therefore right. The ball was airborne.

Flight so enraptured me that I constantly groped for ways to understand how it was possible. At one point, I concluded that planes must simply fly so fast as to escape the pull of gravity. I reasoned that the wings are there just to keep them flying straight. When my father gently corrected this view, my conviction was so strong I simply refused to believe him. Faster than gravity just made so much sense.

Perhaps around middle school, I read a book that said planes fly because the shape of their wings forces the air above to move faster than the air below. A scientist named Bernoulli figured out that slower air exerts more pressure, so the wing is pushed up. Eureka: lift! Things soar. How foolish of me to think a round soccer ball might fly or to imagine that anything can be faster than gravity. Here was the answer to the mystery of flight.

Ah but it turns out not to be the only answer. There are others: three, as I would later learn. All predict lift just as well, just as accurately. Not only that, an adept physicist can arrive at all three beginning only with an equation describing the random motions of the particles in a gas. These deeper, more comprehensive explanations humbled me. Even now, I will not claim to understand, absolutely, how heavy things fly.

Time and again, I have been lulled into certainty by adept orators and appealing façades. Invariably I feel the embarrassment of my foolish trust when that certainty betrays its flaws. From the lessons of flight and the countless other questions I once thought I had answered, I have come to a single conclusion: I am certain that I should never be certain.

Because of the tremendous worth I assign to doubt, I am not an outspoken person, I am seldom resolute, I am frequently anxious, and I am rarely the winner in arguments. Yet also because of this, I avoid harming others through choices made in false confidence. Furthermore, I expose myself to deeper, more beautiful truths because I am willing to admit imperfections in those to which I cling. This is why I embrace doubt.

Written October 2013