sburns.org

Miscellaneous Thoughts on the Academic STEM Brain Drain or, How I Learned to Stop Worrying and Leave The Lab

Thu, 10 Jul 2014 00:00:00 UTC

Many smart people have written pieces related to the academic brain drain. Please read the originals but they boil down to these ideas:

“Big Science” requires computational skills more lucratively rewarded in industry.
The current (read: abysmal, with no signs of improvement) funding situation requires scientists to devote an increasing proportion of their time to secure funding. Scientists get up in the morning hoping to explain the world, not write grants.

Having spent the better part of a decade in some kind of academic research role and with tomorrow my last day at Vanderbilt, here are some (required?) thoughts about the brain drain.

Engineering is not Science

My past research interests, especially neuroimaging, are computationally intensive. Not only are data storage and compute time for analyses high but the need to verify the computer is doing what you think (and publish!) is vital. This last part translates to writing both well-architected & correct software for analysis. This requires building codebases not at all once but rather through a controlled, deliberate and iterative fashion. Eventually people working in the computing-oriented sciences quickly have a solid if not expert knowledge of modern computing systems from the OS up through their application layers.

This process of software engineering has become ever more fascinating to me. In an effort to produce both state-of-the-art and reproducible scienctific workflows, I’m continually searching for resources in the hope of applying sound software engineering principles to the generation of good science. However, deliberately engineering good science is not the same as scientific output and that output alone is the differentiator between a successful academic scientist and a struggling grad student/adjunct faculty member.

Increasingly, universities care about the securing of funding (in no small part to pay for explosive otherwise-unrelated-to-research staff growth) and funding is generally routed to the testing of new hypotheses, not engineering better infrastructure to more efficiently run current projects.

Refactoring, unit-testing and version-control, pillars of professional software engineering, have little place in an environment where only results and not the maintainability of those results matter. I not only agree that computational skills nurtured in science education are more lucratively rewarded in industry, but that academia is a caustic environment to improve one’s software engineering ability. If you’re an engineer rather than scientist at heart, academia is a tough place to practice your trade.

Academic Classism

There is a very real class war in academia penalizing those who forego the traditional undergrad-grad school-postdoc-tenure track faculty path. To a fault, I tend to get bored easily (see my desire above to constantly make existing things better) and I am positive that if I entered a Ph.D. program (assuming I could find one enough appealing), after five years I would have 3-4 mostly-finished projects and none on their own would lead to a dissertation.

The successful completion of a Ph.D. requires hard work, perseverance and tenacity; I’m not complaining about that. My beef lies in the bias of Ph.D’s awarded to people who put their head down for 5 years and think about nothing else than their dissertation. I am not that person (especially now that I have a family) and I don’t think the skills required to complete a Ph.D. are necessarily required to be a good scientist. There is overlap but many good (or would-be) scientists don’t have the mindset to complete a Ph.D.

Herein lies the issue—in the university setting, people without a Ph.D. carry little to no political capital. They cannot run departments, hold tenure or execute their own research program. Perhaps the skills to successfully command these posts are fostered during Ph.D. work (I wouldn’t know). Without your Ph.D. you can’t be faculty and that makes you less useful in the eyes of the university because it’s much more difficult to generate income without a research or clinical program of your own. The staff designation places an upper bound on both career trajectory and compensation.

I find common threads between this and the disheartening trend of meritocracy in tech. As a white male of privilege, in no way do I intend to diminish the struggle minorities face in STEM but it’s not difficult to find analogies in the implicit biases generated within both the tech and academic communities. Both operate under the auspices of “the cream will rise to the top” without considering the diversity, makeup and ingredients of the cream.

Random Thoughts

I couldn’t congeal these into a narrative (not that there’s an underlying theme above!) and so in no particular order:

Universities face multiple, overlapping leaky pipelines due to many disparate internal and external forces.
Academia undervalues engineering talent. Said talent should no longer be afraid to leave academic jobs whose main advantage has always been stability. Current funding situations have removed or diminished this advantage.
Extremely smart people work in academia but university hiring practices (especially concerning little to no remote workforce) may diminish the chance for like-minded coworkers. I worked with brilliant people at all levels in Vanderbilt but very often felt like an orphan given my interests.
The need to write off capital depreciation for tax purposes may produce suboptimal computing environments in universities. I would much rather have deployed my bigger applications to EC2/Heroku/etc. but computing resources have to be spent on hardware. The need to own the hardware is silly.
Intellectual Property rights for staff at universities are generally not good. In particular, Vanderbilt considers books, papers & art to be owned by the creator while software and technology is owned by the institution. I’d love to hear a lawyer explain the difference with a straight face.
At least at Vanderbilt, faculty are given nearly a day a week to consult on external projects without university interference. Any and all external collaborations I worked on were negotiated through the institution with no addition to my bottom line.

Universities will hold a monopoly on interesting ideas for the near future, at least until those ideas become profitable. The need to better improve myself became too great and the call to a new line of work too loud.

Screencast for Motivation to use GitHub

Fri, 30 May 2014 00:00:00 UTC

I’m currently enrolled in Software Carpentry’s Online Instructor Training and our most recent assignment was to make a 2-3 minute screencast to motivate the usage of a tool or idea one might cover in a Software Carpentry Bootcamp. I choose to do mine on GitHub.

There are some things I don’t like about it (should have zoomed in on the browser as the editor & terminal become distracting, could have used some bullet points in the outro) but overall I’m happy with it.

If you’re at all interested in leading a Software Carpentry Bootcamp, I highly recommend the Instructor Training. I’ve learned very interesting ways to approach teaching & evaluating what I’ve (hopefully) taught.

A Recipe for Cortical Tractography Using Freesufer Labels

Sat, 03 May 2014 00:00:00 UTC

In this post, I’m going to describe a method I’ve been working on for performing probabilistic tractography in a single subject using automatically generated cortical labels from Freesurfer. This isn’t new or groundbreaking, however the description of such methods as found in a paper is generally a lossy transmission of the actual code used to generate this kind of analysis.

Scientific replicability & reproducibility require a working implemention and I’ve not seen such a description of this kind of analysis & processing in the internet, hence this article. I hope this is helpful to the community. It’s also taken a lot of time for me to get here so hopefully you won’t have to make the same mistakes I have (but mistakes are good so hopefully you make your own :)

I’m making certain assumptions & decisions in this analysis. If you don’t agree with them, that’s alright, you’re not going to hurt my feelings. It doesn’t make this any more or less correct. This is an immature technique and I don’t think there are hard & fast rules the field has yet approved for cortical tractography. But if you think something is really wrong, I’d love to get your opinion, discuss it and potentially update this post.

The goal of this analysis is to get a measure of structural connectivity between all cortical “regions” of the brain. There are an infinite amount of ways to divy up a brain into regions of interest. Here, I’m choosing to use the 2009 labels produced by Freesurfer Destrieux atlas. Using FSL’s probtrackx2 tool, we’ll use DTI data to perform “in-silico tractography” (hand-waving) from each region and measure to what degree each region “connects” to every other region. There are a lot of caveats in the above paragraph & a lot of underlying assumptions I’m making up-to-and-including:

Freesurfer generates accurate & reliable cortical labels from a clean T1 image.
Diffusion MRI captures the relative motion of water molecules in tissue.
Water most likely moves parallel to axons which carry action potentials from the neuron to their target.
Action potentials are the primary means neurons use to communicate with one another.
During development, the brain organizes itself in such a way that groups of neurons that communicate often with other groups will develop large axonal bundles between one another.
These bundles restrict the diffusion of water in a way that is detectable during a diffusion MR sequence.
Using the diffusion information, we can build probability density functions at every voxel to describe our best guess at which way water flows at that particular voxel. These PDFs also help us characterize the uncertainty inherent in the measurement.
Using these PDFs, we can step through the diffusion image, generating the most likely path water would flow beginning at some point A. We call this a tract, streamline or sample.
Generating lots of these potential tracts, we can make statistically sound inferences about whether we actually trust that the generated tracts represent the actual underlying anatomy. The actual anatoamy is otherwise difficult to attain from our subjects, hence why we’re imaging them.
Pulling this large amount of data together, we can generate a NxN connectivity matrix and plot the relative connectivity between all regions.

If I haven’t offended you with any of these statements, let’s get on with it.

Prerequisites

Data

This analysis requires the following MR data:

A Freesurfer-able T1 image. This should cover the entire brain at high-resolution (~1mm³ voxels). For this article, I’m going to call this file T1.nii.gz.
A DTI sequence with at least 30 directions, though depending on your SnR fewer directions can be acceptable in certain cases. The standard sequence I use is 60 directions. I’m going to refer to this file as dti.nii.gz.
The b-values & b-vectors for the gradients. These files should be generated in the process from converting your raw images (probably in DICOM format) to the standard research format (NIFTI). These are simple text files that I’ll refer to as bvals and bvecs.

We’re going to call this subject janedoe.

$ ls ./
T1.nii.gz      dti.nii.gz         bvecs
bvals

Software

I’ll be using Freesurfer & FSL for this analysis, both of which are freely available. In particular, I use Freesurfer 5.1 & FSL 5.0.6 though I’m fairly certain this will work in the newest version of Freesurfer, 5.3.

Structural processing

The T1 image needs to be processed through Freesurfer’s standard recon-all pipeline. There are many resources for how to do this online, namely the Freesurfer wiki. I run the pipeline this way:

$ recon-all -s janedoe -i T1.nii.gz
$ recon-all -s janedoe -all \
    -qcache \
    -measure thickness \
    -measure curv \
    -measure sulc \
    -measure area \
    -measure jacobian_white
$ mri_annotation2label --subject janedoe \
    --hemi lh \
    --annotation $SUBJECTS_DIR/janedoe/label/lh.aparc.a2009s.annot \
    --outdir $SUBJECTS_DIR/janedoe/label \
    --surface white
$ mri_annotation2label --subject janedoe \
    --hemi rh \
    --annotation $SUBJECTS_DIR/janedoe/label/rh.aparc.a2009s.annot \
    --outdir $SUBJECTS_DIR/janedoe/label \
    --surface white

The first recon-all imports the data and creates the standard folder layout in $SUBJECTS_DIR/janedoe. The second call executes all three steps of the Freesurfer pipeline (note the -all flag). The two mri_annotation2label commands convert the Destrieux cortical annotation to individual labels across the two hemispheres. These labels are written into the label/ directory for the subject. Labels are simple text files that map Freesufer vertices to particular cortical regions.

This process usually takes between 20-40 hours depending on the quality of data. Grab a cup of coffee or a nap.

After this completes, we need to do quality assurance. I’m sure there are more rigorous examples out there but this is what I do. For janedoe, I generate the following tcl scripts:

$ cat janedoe.tkmedit.tcl
for { set i 5 } { $i < 256 } { incr i 10 } {
SetSlice $i
RedrawScreen
SaveTIFF janedoe_screenshots/tkmedit-$i.tiff
}
exit
$ cat janedoe.tksurfer.lh.tcl
make_lateral_view;
redraw;
save_tiff janedoe_screenshots/rh-lateral.tiff;
rotate_brain_y 180;
redraw;
save_tiff janedoe_screenshots/rh-medial.tiff;
labl_import_annotation aparc.a2009s.annot;
redraw;
make_lateral_view;
redraw;
save_tiff janedoe_screenshots/rh-annot-lateral.tiff;
rotate_brain_y 180;
redraw;
save_tiff janedoe_screenshots/rh-annot-medial.tiff;
exit;
$ cat janedoe.tksurfer.rh.tcl
make_lateral_view;
redraw;
save_tiff janedoe_screenshots/rh-lateral.tiff;
rotate_brain_y 180;
redraw;
save_tiff janedoe_screenshots/rh-medial.tiff;
labl_import_annotation aparc.a2009s.annot;
redraw;
make_lateral_view;
redraw;
save_tiff janedoe_screenshots/rh-annot-lateral.tiff;
rotate_brain_y 180;
redraw;
save_tiff janedoe_screenshots/rh-annot-medial.tiff;
exit;

Given these files, I run the following commands:

$ mkdir -p janedoe_screenshots
$ tkmedit janedoe brain.finalsurfs.mgz -aseg -surfs -tcl ./janedoe.tkmedit.tcl
$ tksurfer janedoe lh inflated -gray -tcl ./janedoe.tksurfer.lh.tcl
$ tksurfer janedoe rh inflated -gray -tcl ./janedoe.tksurfer.rh.tcl

We make a folder and then run the tcl scripts in tkmedit and tksurfer. The tkmedit script loops through slices in brain.finalsurfs.mgz (colored by the automatic segmentation with the surfaces overlayed), taking a screenshot every centimeter. The tksurfer scripts make screenshots of the lateral & medial views with and without the Destrieux labels.

At this point we’ve got about 35 pictures to look at and check that segmentation & labeling proceeded normally. In the volume data we’re looking for the colors (the volumetric segmentations) to look accurate and no coloring what is obviously not brain. You also want to ensure the surfaces track well with the image.

Note that some skull has been left in the what-should-be skull-stripped image and that it’s been labeled as cortex. This is not good but bad results are more enlightening than good ones :)

On the surface images you’re looking for sharp points in the inflated surfaces–these are bad and are most likely to happen on the medial surface (where it’s not quite as important) or in the temporal pole (which is bad if you’re interested in language). With the labels overlayed, one of these screenshots looks like this:

Becuase Freesurfer makes 3D meshes of the white-matter and pial surfaces, we can do fancy things like inflate the brain by pushing out on those meshes. The picture above is the “inflated” view. Each color represents a different labeled region of cortex. Note the spikes in the temporal pole (lowest portion of this view). This is not ideal.

Freesurfer produces a lot of other data . Given the measure flags I passed above to recon-all, more than 2650 data points can be extracted from files in the $SUBJECTS_DIR/janedoe/stats/ folder. If you’re interested in that sort of thing, you might take a look at code I wrote to do just that.

Let’s assume Jane was a great participant and her T1 was very clean. Freesurfer is quite robust and most likely performed good segmentation & labels. On to the diffusion images.

Diffusion Processing

From our DTI data, we need to produce the following information:

The mask of the non-diffusion-weighted image.
A Fractional Anisotropy (FA) image for registration to the T1.
The motion-corrected DTI sequence.
PDFs characterizing the underlying diffusion process.

Non-diffusion-weighted mask

For probabalistic tractography, we need to generate a mask within which we constrain tractography. Assuming the first volume of the DTI sequence is the non-diffusion weighted image, we can use bet2 to do this.

$ fslroi dti nodif 0 1
$ bet2 nodif nodif_brain -m -f .25

fslroi extracts the first time volume from the dti sequence and saves it as nodif.nii.gz. Note with FSL commands you do not need to add file extensions. We then use bet2 with a fractional intensity threshold of 0.25. This is generally a robust threshold to remove unwanted tissue from a non-diffusion weighted image. The -m option creates a binary nodif_brain_mask image.

Correcting for Motion

Because we’re collecting many volumes of diffusion-weighted data from our subject, there’s a very high percentage of some motion between volumes. Also because of how the gradient magnets are used to apply a direction of diffusion and then read out the image, lingering currents in the amplifiers can add artifacts to the data. FSL’s eddy_correct tool attempts to fix both issues.

$ eddy_correct dti dti_ecc 0
$ fdt_rotate_bvecs bvecs rot_bvecs dti_ecc.ecclog
$ mv bvecs old_bvecs && mv rot_bvecs bvecs

eddy_correct takes the raw filename, the output filename and which volume to register all of the gradients to (which is typically the non-diffusion-weighted image). This takes a few seconds per volume, so probably around a minute for full data set.

fdt_rotate_bvecs takes the logfile of eddy_correct and applies the proper rotation to the gradient directions. Because we’ve registered every diffusion volume to the first, the original gradients are not accurate anymore. fdt_rotate_bvecs fixes this issue. Then we just rename the new gradient vectors to bvecs without overwriting the old ones.

Generating the FA image

There are much better ways to create an FA image than this method, but as you’ll see below, we’re only using the FA image for registration purposes. I would probably not use this image in a whole-brain FA analysis. For something like that, you might consider using Camino.

$ dtifit -k dti_ecc \
    -o ./dtifit \
    -m nodif_brain_mask \
    -r bvecs \
    -b bvals \

We’re interested in dtifit_FA.nii.gz. dtifit doesn’t implement the most state-of-the-art model of diffusion, but for our purposes its fast and good enough.

These screenshots were taken with fslview_bin. On the left is the fractional anisotropy (FA) image. In FA images, values varies between 0 & 1 where 0 represents purely isotropic water diffusion. Picture how a golf ball might bounce around inside a basketball–it could go anywhere. Where the FA image is brightest (towards 1), there is one specific direction that water is likely to diffuse (like rolling a golf ball down a paper-towel rod, there isn’t much place else for it to go). FA is typically highest in large fiber bundles such as the corpus callosum (the axonal bundles connecting the two hemispheres). The right image is color-coded by diffusion direction. Green is Anterior-Posterior, Red is Left-Right & Blue is Inferior-Superior (head-foot). The color intensity in this image is modulated by the FA value.

Its very important to look at these images for quality assurance. In the right image, we want to make sure our gradient table is correct and that we haven’t flipped two dimensions. If this were the case, the image would look the same but colors would be switched. In the pure FA image, we’re looking for smearing and other ugly artifacts which typically denote large amounts of motion.

Generating PDFs

At this point we have a motion- & artifact-corrected image (dti_ecc.nii.gz), the corrected gradient table (bvecs), the gradient values (bvals), and a mask of the non-diffusion-weighted image (nodif_brain_mask.nii.gz). If we were smart, we’d use bedpostx out of the box to generate PDFs of the diffusion direction and get on with tractography. Unfortunately, bedpostx takes about 20 minutes per slice and typical datasets contain between 40 and 50 slices so this process takes about 15 hours of compute time. Fortunately, it’s an extremely parallelizable task. So the following script exactly mimics FSL 5’s bedpostx but can run with a linear speedup based on the amount of processors on your machine.

$ cat ./bedpostx.sh

datadir=./

# Estimation parameters
nfibres=2
fudge=1
burnin=1000
njumps=1250
sampleevery=25


mkdir -p bedpostx
mkdir -p bedpostx/diff_slices
mkdir -p bedpostx/logs
mkdir -p bedpostx/logs/pid_${$}
mkdir -p bedpostx/xfms

echo "bedpostx_preproc begin `date`"
bedpostx_preproc.sh ${datadir} 0
echo "bedpostx_preproc end `date`"
echo

nslices=`${FSLDIR}/bin/fslval ./dti_ecc dim3`
[ -f bedpostx/commands.txt ] && rm bedpostx/commands.txt

slice=0
while [ $slice -lt $nslices ]
do
    slicezp=`$FSLDIR/bin/zeropad $slice 4`
    if [ -f bedpostx/diff_slices/data_slice_$slicezp/dyads2.nii.gz ];then
        echo "slice $slice has already been processed"
    else
        echo "${FSLDIR}/bin/bedpostx_single_slice.sh $datadir $slice --nfibres=$nfibres --fudge=$fudge --burnin=$burnin --njumps=$njumps --sampleevery=$sampleevery --model=1">> bedpostx/commands.txt
    fi
    slice=$(($slice + 1))
done

# parallel processing
echo "parallel processing begin `date`"
run_parallel.py bedpostx/commands.txt --ncpu 12
echo "parallel processing end `date`"
echo

# Clean things up
echo "bedpostx_postproc begin `date`"
bedpostx_postproc.sh ${datadir}
echo "bedpostx_postproc end `date`"
echo

$ cat run_parallel.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

""" run_parallel.py

Simple script to run a list of shell commands in parallel.

"""

import sys
import multiprocessing as mp


def create_parser():
    from argparse import ArgumentParser, FileType
    ap = ArgumentParser()
    ap.add_argument('infile', type=FileType('r'),
        help="file with shell commands to execute")
    ap.add_argument('-n', '--ncpu', type=int, default=0,
        help="Number of CPUs to use (default: %(default)s: all CPUs)")
    return ap


def cpus_to_use(ncpu):
    return ncpu if ncpu else mp.cpu_count()


if __name__ == '__main__':
    from subprocess import call

    ap = create_parser()
    args = ap.parse_args(sys.argv[1:])

    ncpu = cpus_to_use(args.ncpu)

    if args.infile:
        # Read commands from already open file and close
        commands = [c for c in args.infile.read().split('\n') if c]
            args.infile.close()

        # Create a pool and map run_cmd to the shell commands
        pool = mp.Pool(processes=ncpu)
        pool.map(call, commands)

$ source ./bedpostx.sh

This script should finish in about 90 minutes or so. Much better than 15 hours :)

bedpostx unfortunately doesn’t give us pretty pictures to look at :(

Preparing for Tractography

Structural constraints

We’re not there yet! But we can start to combine some of the data. We’ll first generate some images from the structural processing, notably a mask of the ventricles & white matter in both hemispheres.

$ mkdir -p anat
$ mri_convert $SUBJECTS_DIR/janedoe/mri/rawavg.mgz anat/str.nii.gz
$ mri_convert $SUBJECTS_DIR/janedoe/mri/orig.mgz anat/fs.nii.gz
$ mri_binarize --i $SUBJECTS_DIR/janedoe/mri/aparc+aseg.mgz --ventricles --o anat/ventricles.nii.gz
$ mri_binarize --i $SUBJECTS_DIR/janedoe/mri/aparc+aseg.mgz --match 2 --o anat/wm.lh.nii.gz
$ mri_binarize --i $SUBJECTS_DIR/janedoe/mri/aparc+aseg.mgz --match 41 --o anat/wm.rh.nii.gz
# Put binarized wm filenames into txt file
$ ls -1 anat/wm* > waypoints.txt
# also copy over label files & white surfaces
$ rsync $SUBJECTS_DIR/janedoe/label/*.label label/
$ rsync $SUBJECTS_DIR/janedoe/surf/{l,r}h.white surf/

We’re going to use these binarized files to constrain tractography.

Registrations

We need to be able to map labels in Freesurfer space to DTI space. Aligning a brain in one modality (T1) to another (DTI) is a process called registration. We’re going to perform a number of registrations to produce a transform matrix that places a Freesurfer label into DTI space.

# transform filenames
$ fs2str=bedpostx/xfms/fs2str.mat
$ str2fs=bedpostx/xfms/str2fs.mat
$ fa2fs=bedpostx/xfms/fa2fs.mat
$ fs2fa=bedpostx/xfms/fs2fa.mat
$ fa2str=bedpostx/xfms/fa2str.mat
$ str2fa=bedpostx/xfms/str2fa.mat
# register structurual to Fs
$ tkregister2 --mov $fs \
    --targ $str \
    --regheader \
    --reg /tmp/junk \
    --fslregout $fs2str \
    --noedit
# invert to create str2fs
$ convert_xfm -omat $str2fs -inverse $fs2str

# Now transforming FA to structural:
$ flirt -in $fa -ref $str -omat $fa2str -dof 6
# invert to create str2fa
$ convert_xfm -omat $str2fa -inverse $fa2str

# Concatenate and inverse
$ convert_xfm -omat $fa2fs -concat $str2fs $fa2str
$ convert_xfm -omat $fs2fa -inverse $fa2fs

At this point, we have a registration matrix bedpostx/xfms/fs2fa.mat that we’ll give to probtrackx2.

Generating Seeds

Still more to do and we haven’t even started tractography yet! Now we need to convert the label files from Freesurfer into binary volume images that probtrackx2 can read. For this, I’m going to convert them with Freesurfer’s (appropriately labeled) mri_label2vol.

$ seed_list=seeds.txt
$ for hemi in lh rh
do
    for lab in `cat label_order.txt`
    do
        label=label/$hemi.$lab.label
        vol=${label/%.label/.gii}
        echo converting $label to $vol
        mri_label2vol \
            --label $label \
            --temp anat/orig.nii.gz \
            --o $vol \
            --identity \
            --fillthresh 0.5 > /dev/null
        echo $vol >> $seed_list
    done
done
$ cat label_order.txt
G_rectus
G_subcallosal
S_suborbital
S_orbital_med-olfact
G_orbital
S_orbital-H_Shaped
G_and_S_transv_frontopol
G_and_S_cingul-Ant
G_and_S_frontomargin
G_front_sup
S_front_sup
S_front_middle
G_front_middle
S_orbital_lateral
S_front_inf
G_front_inf-Triangul
G_front_inf-Orbital
Lat_Fis-ant-Horizont
Lat_Fis-ant-Vertical
S_circular_insula_ant
S_circular_insula_sup
G_insular_short
G_Ins_lg_and_S_cent_ins
S_precentral-inf-part
G_front_inf-Opercular
S_precentral-sup-part
G_precentral
S_central
G_and_S_subcentral
Lat_Fis-post
S_circular_insula_inf
G_temp_sup-Plan_polar
Pole_temporal
G_temp_sup-G_T_transv
S_temporal_transverse
G_temp_sup-Plan_tempo
G_temp_sup-Lateral
S_temporal_sup
G_temporal_middle
S_temporal_inf
G_temporal_inf
S_collat_transv_ant
S_collat_transv_post
G_oc-temp_med-Parahip
S_oc-temp_lat
G_oc-temp_lat-fusifor
G_pariet_inf-Supramar
G_postcentral
S_postcentral
G_and_S_paracentral
G_parietal_sup
S_intrapariet_and_P_trans
S_interm_prim-Jensen
G_pariet_inf-Angular
S_occipital_ant
G_and_S_occipital_inf
G_occipital_middle
S_oc_sup_and_transversal
G_occipital_sup
G_cuneus
S_oc_middle_and_Lunatus
Pole_occipital
S_oc-temp_med_and_Lingual
G_oc-temp_med-Lingual
S_calcarine
S_parieto_occipital
G_precuneus
S_subparietal
G_cingul-Post-dorsal
G_cingul-Post-ventral
S_pericallosal
S_cingul-Marginalis
G_and_S_cingul-Mid-Post
G_and_S_cingul-Mid-Ant

I empirically determined this ordering of labels in this notebook but I removed BA*, cortex, entorhinal, MT and V* labels since they overlap with the 2009 atlas. seeds.txt now contains 148 (74 labels * 2 hemispheres) paths to our volumes of interest.

Just for a sanity check, let’s overlay a few of these volumes on the T1 image in Freesurfer space:

Now we’re ready to being tractography (If you’re still with me, I applaud you)

Tractography

At this point, you’re going to need a big computer. Each of these 148 seed regions can run independently, so if you have access to a compute cluster, by all means use it. For each probtracking run, we’re going to do this:

$ probtrackx2 -x label/lh.G_and_S_cingul-Ant.nii.gz \
    -s bedpostx/merged \
    -m bedpostx/nodif_brain_mask \
    -l \
    --usef \
    --s2tastext \
    --os2t \
    --onewaycondition \
    -c 0.2 \
    -S 2000 \
    --steplength=0.5 \
    -P 5000 \
    --fibthresh=0.01 \
    --distthresh=0.0 \
    --sampvox=0.0 \
    --xfm=bedpostx/xfms/fs2fa.mat \
    --avoid=anat/ventricles.nii.gz \
    --seedref=anat/fs.nii.gz \
    --forcedir \
    --opd \
    -V 1 \
    --omatrix1 \
    --dir=results/lh.G_and_S_cingul-Ant.nii.gz.probtrackx2/ \
    --waypoints=waypoints.txt \
    --waycond='OR' \
    --targetmasks=seeds.txt

It’s an exercise to the reader to generate 148 of these scripts. I suggest python :)

Let’s walk through the options because I’ve wasted many months of compute time generating crap results.

-x is the seed. This will differ for each run.
-s is the merged samples from bedpostx.
-m is the non-diffusion brain mask.
-l performs loop checks on paths.
-usef uses FA for constrain prob tracking.
--s2tastext outputs text files for all generated tracts. You must set this along with -os2t to generate the proper output files.
--os2t Outputs seeds to target images. One per voxel in the seed. There can be quite a lot of these files.
--onewaycondition applies waypoint conditions to each half of the tract separately (see --waypoints).
-c 0.2 constrains curvature of the generated tracts. This is the default.
-S 2000 Each tract is created by at the most 2000 steps.
--steplength=0.5 gives step length in mm. -S * –steplength` is the maximum length a tract can be.
-P 5000 the number of sample tracts generated. This number above all else determines the time it takes to run this.
--fibthresh=0.01 thresholds the point at which probtrack will consider orientations of differing directions.
--distthresh=0.0 samples shorter than this length are discarded.
--sampvox=0.0 randomly sample the points with 0.0 mm sphere of the seed voxel. I set this to zero because I trust the incoming seed.
--xfm=bedpostx/xfms/fs2fa.mat this sets the linear transform from seed space to DTI space.
--meshspace=freesurfer We’re using Freesurfer meshes.
--avoid=$ventricles Generated samples to run into the --avoid image (or iamges) are flat-out rejected. Anatomically speaking, fiber tracts do not pass through the ventricles.
--seedref=anat/brain.nii.gz this merely gives the reference space for the seeds. Output images are of this size and shape.
--forcedir use the results directory given, don’t create a new one.
--opd output the distributions of the samples.
-V 1 Verbosity level
--omatrix1 We’re interested in the seed-to-targets matrix.
--dir=results/lh.G_and_S_cingul-Ant.gii.probtrackx2/ directory for results.
--waypoints=waypoints.txt this file, which for me contains the paths to the white matter binarized images, requires that generated samples pass through these way points. Samples that don’t are rejected. I’m assuming that pathways I care about pass through white matter.
--waycond='OR' this determines the boolean logic for rejecting (or keeping) pathways given by --waypoints. In this instance, I only want the stream lines to pass through the left or right white matter, not necessarily both.
--targetmasks=seeds.txt This file gives path names to the seeds of interest. As I said before, this is every region from the brain.

The time it takes probtrackx2 to finish a region depends on the size of the region. As you can see above in the cortical parcellation picture, not all regions are the same size. Hence some runs don’t take very long (~90 minutes) and others will take a very long time, upwards of 96 hours on modern hardware. Having done this processing on a few subjects, I can say that all told these runs take about 30 days of compute time per subject. Grab a coffee and pillow.

Analysis

When everything is finished, we’d like to visualize the NxN connectivity matrix. We get two important outputs from our probtrackx2 runs. fdt_paths.nii.gz can be overlayed on the reference file and contains a count at every voxel of how many streamlines passed through that voxel. The other output of interest is matrix_seeds_to_all_targets.nii.gz. This is a 2D voxel-by-target matrix. By collapsing across all seed voxels and dividing by the total number of streamlines generated during the run, we generate a 1xN array of percentages representing the proportion of streamlines that reached each target. By doing this for all 148 output matrices and stacking the arrays, we generate a 148x148 connectivity matrix.

Here is my implementation in python:

import numpy as np
import matplotlib as mpl
plt = mpl.pyplot
import nibabel
import os

def collapse_probtrack_results(waytotal_file, matrix_file):
    with open(waytotal_file) as f:
        waytotal = int(f.read())
    data = nibabel.load(matrix_file).get_data()
    collapsed = data.sum(axis=0) / waytotal * 100.
    return collapsed

matrix_template = 'results/{roi}.nii.gz.probtrackx2/matrix_seeds_to_all_targets.nii.gz'
processed_seed_list = [s.replace('.nii.gz','').replace('label/', '')
    for s in open('seeds.txt').read().split('\n')
    if s]
N = len(processed_seed_list)
conn = np.zeros((N, N))
rois=[]
idx = 0
for roi in processed_seed_list:
    matrix_file = template.format(roi=roi)
    seed_directory = os.path.dirname(result)
    roi = os.path.basename(seed_directory).replace('.nii.gz.probtrackx2', '')
    waytotal_file = os.path.join(seed_directory, 'waytotal')
    rois.append(roi)
    try:
        # if this particular seed hasn't finished processing, you can still
        # build the matrix by catching OSErrors that pop up from trying
        # to open the non-existent files
        conn[idx, :] = collapse_probtrack_results(waytotal_file, matrix_file)
    except OSError:
        pass
    idx += 1

# figure plotting
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(conn, interpolation='nearest', )
cax.set_cmap('hot')
caxes = cax.get_axes()

# set number of ticks
caxes.set_xticks(range(len(new_order)))
caxes.set_yticks(range(len(new_order)))

# label the ticks
caxes.set_xticklabels(new_order, rotation=90)
caxes.set_yticklabels(new_order, rotation=0)

# axes labels
caxes.set_xlabel('Target ROI', fontsize=20)
caxes.set_ylabel('Seed ROI', fontsize=20)

# Colorbar
cbar = fig.colorbar(cax)
cbar.set_label('% of streamlines from seed to target', rotation=-90, fontsize=20)

# title text
title_text = ax.set_title('Structural Connectivity with Freesurfer Labels & ProbtrackX2',
    fontsize=26)
title_text.set_position((.5, 1.10))

Link to full image

The diagonal is connectivity from the seed to itself so it makes sense that is very “hot”. Close to off-diagonal we see the most connectivity which also makes sense because I’ve constrained the list to be anatomically “nearby”. I’ve got a few ideas about what to do with this matrices, but I’ll save that for another day.

Using Conda For Quicker Travis Builds

Fri, 28 Mar 2014 00:00:00 UTC

Update: This post was spurred by a brief exchange on Twitter with Matt Davis (@jiffyclub). Apologies for not attributing this originally.

Like I’ve said in the past, it’s irresponsible to produce and share un-tested code. Travis-CI is a system that takes testing one step farther. They’ve setup an integration with Github such that when commits on any branch are pushed to your repository, Travis will pull the branch to one of their testing servers and execute a series of tests on your behalf. This is fantastic because:

Users don’t have to take your word that this package is tested.
Potential contributors don’t need to setup their own testing infrastructure. If they submit a pull-request, Travis will pick it up and test it. Github will display the result of the Travis test on the pull-request page. Both the maintainer (you) and the contributor know this submission passes tests. If it doesn’t, discussions can ensue.
Because Travis always runs a test in a new environment, as a package owner you don’t have to worry about false positives when testing locally, i.e. some particular aspect of your local setup biasing test results.

Travis uses a .travis.yml file in your repository to setup, run & teardown tests. Setup often involves installing one or more dependencies, running usually invokes a test-runner like nose or py.test and teardown may include notification of test results through email or other channels. Travis has lots of helpful documentation about writing a good .travis.yml file.

Bedrock packages such as numpy, scipy & pandas are the building blocks for many scientific and data analysis applications in python. For performance reasons, these packages often include C extensions meaning they either require compilation from source before using or be installed as a pre-built binary. pip, the go-to python package installer, does not install these packages as binaries. This is an issue if you’re testing on Travis because while they build successfully on the platform, this process can take a long time relative to the time required to test your package.

Enter conda, a up-and-coming package manager from the folks at Continuum Analytics. Continuum provides pre-built binaries of many scientific & data analysis packages. conda will, instead of downloading the source and building as does pip, download the pre-built binaries from Continuum’s servers and simply move them into place. This makes for faster, more stable & deterministic builds of your go-to libraries.

How can you use conda on Travis? Here’s an example .travis.yml for python:

language: python
python:
  - "2.7"
  - "3.3"
install:
  # Install conda
  - sudo pip install conda
  # install deps
  - sudo conda init
  - sudo conda update conda --yes
  - deps='pip numpy pandas requests nose numpydoc sphinx'
  - conda create -p $HOME/py --yes $deps "python=$TRAVIS_PYTHON_VERSION"
  - export PATH=$HOME/py/bin:$PATH
  # install your own package into the environment
  - python setup.py install
script:
- "nosetests -w test/ -v"

The install key defines a set of steps Travis will take to setup your testing environment. The steps above download, initialize & update conda and then create a conda environment in $HOME/py into which dependencies are installed (set in the deps variable). Finally, python setup.py install installs your package into the conda environment. After that, the steps defined in script are executed. Here, we’re simply invoking nosetests to search for tests and execute them.

My Travis builds of PyCap have gone from between 5-6 minutes to about 90 seconds on the same test battery simply by using conda to install numpy and pandas.

Because Travis is a shared resource, developers should try to optimize build time so Travis is testing our code and not wasting time building the same packages over & over again. If your software requires scipy, numpy, pandas or other bedrock packages, I highly recommend taking the time to change your testing process on Travis.

Packaging Best Practices

Tue, 28 Jan 2014 00:00:00 UTC

You’ve got a brilliant idea you want to implement in software. Congratulations, ideas are energizing. You hope it’ll be good enough that others will want to use it. You might even want others to contribute to your code and make it better. These are all laudable goals. Unfortunately, an idea is nothing without an implementation and an implementation is nothing without proper structure around it. Here are some simple steps you can take to produce better, more professional software.

I hope this guide is both timeless and language agnostic but I’ll probably sprinkle in specific services & tools that I use to make my life easier.

Packaging Best Practices

Starting a new package is always a good time to re-orient yourself with best practices for your particular language & environment. Packaging techniques seem to change every 2-3 years in the python world. Look to popular, well-designed packages as a guide for packaging. In the python world, requests is your best bet for an easy-to-understand package. I always look at it when starting a new package.

Version control

This should go without saying, but version control is an absolute necessity. Consider it an infinite “undo” button. Modern VCS systems like git and mercurial make it dead-simple to create branches off your main codebase for new features, bug-fixes and places for experimentation without affecting “production” code.

Learn a VCS and use it. Personally I prefer git because of GitHub and the ability to commit without an internet connection (looking at you Subversion). Remember though that tools are just tools. Don’t get lost in tribal wars.

README-driven development

README Driven Development says to write your README first. If you’re gung-ho about developing software for others to use, gather your thoughts about what problem(s) you’re solving and how you expect your users to interact with your software.

Consider your README the initial specification you’re trying to hit. Before there exists any code in your package, you can take the time to truly consider what you’re trying to do. The more code you’ve written, the more difficult it is to about-face on any particular idea, goal, design or implementation.

Setup testing infrastructure

Nobody gets into software development because unit- and functional-testing is fun. However, asking people to use un-tested software is careless, unprofessional and downright malevolent. Work hard to reduce the impediments to testing your code. Consider dumping your test command(s) into a Makefile. If your testing infrastructure requires separate packages, make it very easy to get those packages.

As an aside, py.test let’s you generate a script that runs your tests. The script contains a copy of the py.test package and therefore contributors don’t need to download & install py.test. Brilliant if you ask me.

Continuous Integration

Continuous Integration (CI) is the practice of merging development branches often to reduce integration hell. Unless we’ve tested the development branches, they shouldn’t be considered correct or valid. Therefore try to setup some sort of continuous testing service. If your package is on GitHub, take the 5 minutes to write a .travis.yml file and setup the hooks to get pushed branches tested on TravisCI.

Documentation is king

It’s a lie to think you’re working alone on a software project. Six months from now you will have forgotten why you made a particular decision in some small function and you will be working with (hopefully not against :) a previous version of yourself at that point. Wouldn’t it be nice to find the piece of documentation detailing that decision?

If you’re working with others, documentation is an absolute necessity. Not only does it keep developers on the same page but also gives users some confidence that you care about your work enough to write about it. Writing prose about your software also invariably helps you form better ideas about why you’re writing this software, who should use this software & what your users will want to get out of this package.

Sphinx is the go-to tool for writing documentation for python packages. Couple it with ReadTheDocs, a service that automatically builds and hosts your documentation for open-source projects.

I have not found a better site devoted to documentation than WriteTheDocs. Read it and understand why documentation is so important to your software.

Conclusions

I hope these tips will help someone write more professional code. I just started working on something new and thinking about developing a more complete, user-friendly package spurred this post. I should have more to say about my new package soon.

Tools, Libraries & Applications

Sun, 26 Jan 2014 00:00:00 UTC

By training, I’m not a software engineer. I’m willing to bet the majority of those who develop software to produce science weren’t trained as such either, which is why Software Carpentry is so important.

At its core, we produce science by applying one or methods against one or more datasets. The exact methods and datasets depend on the field and its current best practices. One can develop their own methods or use freely available tools. Vice versa, one might collect their own data or use publicly-available datasets. Either way, science needs a method and data to apply it against. We then can make inferences about the world.

Typically, if you’re developing methods for others to use, you’ll test your tool(s) against a public dataset so your results can be compared with other, similar methods. If you’re collecting your own data (perhaps from a unique population), you’ll probably want to use a public toolchain for analysis. Any variance in the results should be wholly attributable to the collected data and not the toolset.

There is a relationship between our tools and data and the better we understand this, the more reliable and reproducible our analyses become. I think this relationship takes the form of a stack. At the bottom of the stack, we have low-level tools that do the nitty gritty work. We may abstract these tools into libraries so they can more easily be used both alone and together. Finally, we produce applications built upon common, shared libraries that ask specific questions using specific data.

The following should be obvious to any professional developer. I’ll use examples from neuroimaging but this should generalize to all fields of science.

Tools

Lots of fields have robust & mature toolsets for data analysis. In neuroimaging, some of these packages include Freesurfer, FSL and SPM. Lots of energy and work have been poured into the tools these packages expose and they’ve been tested against a wide variety of datasets. No reviewer will scoff at the usage of any of these tools.

However, no one would call these packages cutting-edge. Researchers across the world are continually developing novel algorithms for image processing and statistical analyses and as these procedures mature, they too become dependable tools for the rest of the community to use.

However, none of these tools will produce results for a scientific paper in and of themselves. The minimum work to be done is to apply these tools against your dataset. This is always done by developing by an “application” that uses these tools.

Applications

Applications are highly dependent on the researchers data, hypotheses and available computing infrastructure. They can be as simple as a single script that passes acquired data to tools. They can also become quite complex, integrating many sources of data to many paths for analyses. Either way, it’s very hard to share applications across research groups because their requirements are so specific and tuned.

This isn’t necessarily bad though. There is much knowledge to be gained in developing these applications and students should seek this exposure as much as possible. Building an application exposes one to how tools should be used as well as confirming the research questions meant to be addressed are being answered.

Libraries

Especially within a research group, applications will often lean on the same or a similar set of tools. Libraries can play a useful layer between tools and applications. Libraries should expose tools in a common form such that disparate applications can independently use these tools to address specific research questions. These libraries serve to gather our accumulated knowledge of best practices for using a particular tool. This is a form of Don’t Repeat Yourself (DRY), a powerful methodology for building software.

These libraries have two requirements. First, they should present a unified interface for building commands to execute tools. Each tool requires different data in different forms so these libraries should try and minimize the differences across tools and simplify the use of said tools.

More importantly, these libraries should make no decisions about how incoming data should be organized or how the generated commands are ultimately executed. These decisions are the discretion of the application layer. Not only does this make libraries more useful to separate research groups but also infintely more testable because they don’t require any particular infrastructure when unit testing.

As we develop better know-how about using particular tools, we should pour this knowledge into libraries, not applications. Walling off our improved logic into a single application means no other can reap the benefits. Instead, if we update our libraries with improved knowledge and maintain backwards compatibility, all of the applications built upon the library can share the improvements.

Abstraction

Implicit in the above discussion is the idea of abstraction. Tools abstract their advanced image processing algorithms. Libraries abstract the advanced knowledge of how best to use tools. Applications abstract the all of the above so we can quickly apply methods against our particular dataset and ultimately answer questions about the world.

No layer is more important than the other. As research software engineers, we should be interested in all layers and keep in mind how changing any particular layer will affect the others. Only when we understand the relationship between our tools, libraries and applications can we begin to engineer better science.

TBSS & IPython.parallel

Tue, 31 Dec 2013 00:00:00 UTC

A brief primer on Diffusion Tensor Imaging

We collect a wide variety of MR data from our children. Along with a high-resolution T1 and functional MR, we also collect a high-angular resolution diffusion image (HARDI), which is a kind of diffusion tensor image (DTI) sequence.This type of MR sequence measures the relative motion of water at every voxel along a variety of gradients (directions) through the brain. The line of thinking goes that water molecules, during their random walk, are more likely to move parallel to axons than cross the myelin sheath. Therefore diffusion imaging is a way to get at structural connectivity because it (hopefully) reveals the white-matter pathways through the brain.

Much like functional MR, the math behind DTI is not complex. Simply put, we use the many collected images and known gradient direction for each image to solve for the tensor at every voxel. This tensor encapsulates both the direction water will most likely flow in that voxel as well as to the degree it will flow. For each direction (X, Y & Z), the tensor contains information about the relative component of the direction. The average of the X, Y and Z components is called Fractional Anisotropy (FA) and it varies between 0 and 1. For voxels that contain water moving in a very definite direction, FA is high and you might visualize it as a rod pointing in some particular direction. For areas of the brain in which water moves more equally in all three directions, FA is low. This typically occurs in gray matter where axons are much less organized than in white matter.

Group modeling

We typically will examine FA maps and look for regions that covary with some behavioral measure. We collect all MR data in what is called subject or native space. In subject space, there is no relation between the actual brain region at any particular voxel in one subject and that same voxel in another subject. Because we collect data from many subjects for statistical power, we employ a method call registration to align every subject’s native space FA image to a template. This template is often in a space that has been labeled with specific anatomical locations. Therefore, once we have all subjects in a template space and find regions that correlate with a behavioral measure, we can look up the anatomical location of that region and make a brilliant scientific conclusion (hopefully).

Problems

FA maps are difficult to register to the template because white matter pathways vary widely between subjects and especially so if the subjects will as a whole have different white matter structure from the template. Our child data is very difficult to register to any particular template space.

FSL’s Tract-Based Spatial Statistics (TBSS) tool provides what some believe to be the best way to register many FA maps together. In a nutshell, it registers every subject to each other, discovers the most “prototypical” subject and aligns that subject to the template. Stacking the inter-subject registration and the template registrations, each image is brought into template space.

So, if we have 100 subjects, that’s 100² or 10000 inter-subject registrations to perform. Each registration takes 3-5 minutes, so we’re looking at ~20 days of compute time to do this. Fortunately, this problem is extremely parallelizable because no single registration depends on another. FSL provides a way to send this to a compute cluster. However, typical (especially shared) clusters perform better with long running tasks ( > 1 hr wall time) due to the overhead required in submitting and maintaining running processes on each compute node. Submitting 10000 5-minute jobs would swamp the scheduler and generate some very nasty emails from your cluster system administrator whom you shouldn’t want to upset. We need some way to perform many jobs in parallel with less overhead than a full cluster.

Solution

Enter IPython.parallel. IPython provides the architecture and machinery to start engines (processes that will accept work) on separate computers and then submit work from one or more other processes. This is better than python’s multiprocessing library that parallelizes only across a single machine. If you only have a single machine at your disposal, I would absolutely use multiprocessing.

First, we need to preprocess our data. Assuming you’re in a directory with all of your FA images:

$ tbss_1_preproc *.nii

This takes a few seconds per image. Next, we’ll run TBSS’s registration step. The -n option tells TBSS to run the inter-subject registration:

$ tbss_2_reg -n

If you let this run as is, you’ll need to come back in 20 days. What tbss_2_reg -n first does is build a list of registrations to perform. It stores this in a text file at FA/.commands. If you keep running tbss_2_reg, it will simply begin executing these commands.

Instead, what I do is kill the tbss_2_reg command and jump into python. First, read the list of registration commands and define a function we’ll use to execute a single command.

with open('FA/.commands') as f:
    commands = [c for c in f.read().split() if c]

def execute(command):
    """ execute the command-line call given in `command`

    Note: imports should go in the function as imported
    modules are not global in the IPython cluster"""
    from subprocess import call
    from shlex import split

    parts = split(command)
    return call(parts)

At this point, we need to start up our “cluster”. You should make a new parallel IPython profile for this. I won’t dive into the details of configuring an IPython cluster as the docs are quite good. Back in a shell, let’s start our cluster:

$ screen -S cluster
...screen session...
$ ipcluster --profile-dir=/path/to/your/parallel/profile_dir
...output from ipcluster...
...Ctrl-A d to detach your terminal...

I’m a big fan of screen for long running processes. ipcluster is the easy way to start an IPython cluster but it needs to run for as long as you want to do work on that cluster. Simply backgrounding the command (ipcluster ... &) isn’t enough as terminating a shell session with running jobs will kill the jobs (and your cluster). This method is also pretty explicit in that when you want to stop your cluster, just re-attach and kill the process. If you nohup ipcluster ... &, you’ll have to hunt around in top to find the right IPython process to kill.

With the cluster up and running, back into python:

from IPython.parallel import Client
c = Client(profile_dir='/path/to/your/parallel/profile_dir')
lview = c.load_balanced_view()

len(c) will be how many engines you have running in this cluster. The load balanced view is the primary method you should use to submit jobs to your cluster.

results = lview.map_async(execute, commands)

Much like python’s built-in map, lview.map_async takes a function and an iterable of arguments to pass to that function. Behind the scenes, it does the hard work of submitting jobs to the IPython cluster and grabbing results. Because we’re using the .map_async function, this will return very quickly. Rest assured, the engines are working very hard now.

Caveats & Conclusions

This isn’t the cleanest nor easiest-to-setup method to parallelize TBSS. However, I’ve found IPython’s parallel machinery to be bullet-proof and provide the right level of control when I need to run lots of little jobs. This method is definitely overkill if you have a single machine and like I said above, definitely use multiprocessing (or better, joblib, a great wrapper around multiprocessing).

I think this is a great example of just how much IPython can help in your work. If the notebook feature sold you on IPython, the parallel tools are just icing on the cake.

Video of KC talk is up

Mon, 22 Jul 2013 00:00:00 UTC

I just finished my talk about advanced REDCap usage at the Kennedy Center. Many thanks to Jenny Gilbert for organizing it.

The fine folks at the Kennedy Center have already posted the video.

The PDF version of my final slides is here.

I think it went over well and I received some nice feedback afterwards. I worked hard to talk less about the exact mechanics of using the API and more about what types of problems we can better solve with the API and Data Entry Triggers.

Intro to the REDCap API

Mon, 22 Jul 2013 00:00:00 UTC

As far as I can tell, there isn’t a tutorial on the internet about how to use the REDCap API. So here goes…

REDCap is an advanced web-based application for securely storing and retrieving tabular data. In simple terms, it can be thought of a web-based spreadsheet, though it is much more than that. It provides an Application Programming Interface which means external software can programmatically download and upload data into REDCap Projects. This tutorial assumes working knowledge of REDCap. When all else fails, please consult your site’s API help page, which is at here for Vanderbilt.

Becuase the API is based on simple HTTP requests, any programming langauge with a HTTP library can use the REDCap API. I’m going to demonstrate simple API usage in python using the wonderful requests library.

To use the REDCap API, you must know the following:

The API URL for your site’s REDCap installation. For Vanderbilt, this url is https://redcap.vanderbilt.edu/api/.
The API token for your Project. A token is generated by the REDCap administrators and connects your user account to a particular REDCap Project. Therefore, if you have API access to many Projects, you will have many tokens to manage.

Basic Usage

Every call to the REDCap API is a HTTP POST request with specific parameters in the payload. The token parameter is always required as this tells the API from which Project you’re requesting a response. Next, the content parameter is used to declare the type of request you’re making. Finally, you may want to include the format field as well as this tells the API in what format you want the response. It defaults to returning a CSV string, but I generally prefer getting json-formatted responses as that format can be easily converted to actual in-memory objects like lists, dictionaries, strings, etc.

So let’s begin by making the most simple request, exporting the Project’s Metadata (AKA Data Dictionary).

from requests import post
# Two constants we'll use throughout
TOKEN = '8E66DB6844D58E990075AFB51658A002'
URL = 'https://redcap.vanderbilt.edu/api/'

payload = {'token': TOKEN, 'format': 'json', 'content': 'metadata'}

response = post(URL, data=payload)
print response.status_code
200

A few things to talk about here:

At least at Vanderbilt, don’t forgot the trailing slash at the end of the API URL string. Your site may differ but if you mess up the URL, nothing will work and you’ll probably get 501 “Method Not Implemented” responses.
Under no circumstance should you ever publicize your Project token. This is like publishing the password you use to login to REDCap, which you would never do. In this instance however, this token is from a dummy project I use to test things with. There’s no real data and definitely not any PHI in it, so I’m not super worried.

But just to be clear:

Under no circumstances should you publicize your project token(s)!

You’ve been warned. (If you do publicize them for whatever reason, don’t fret. Just delete those tokens through the web app ASAP and request new tokens).

With that out of the way, the API accepted our request and returned data with a ‘200’ status, which means “everything is peachy” in HTTP.

Now let’s examine our metadata a bit. The .json() method I’m going to use just decodes the response (every language’s JSON library will work a bit differently, though).

metadata = response.json()
print "This project has %d fields" % len(metadata)
print
print "field_name (type) ---> field_label"
print "---------------------------"
for field in metadata:
    print "%s (%s) ---> %s" % (field['field_name'], field['field_type'], field['field_label'])
print
print 'Every field has these keys: %s' % ', '.join(sorted(metadata[0].keys()))

This project has 11 fields

field_name (type) ---> field_label
---------------------------
study_id (text) ---> Study ID
first_name (text) ---> First Name
last_name (text) ---> Last Name
dob (text) ---> Date of Birth
sex (dropdown) ---> Gender
address (notes) ---> Street, City, State, ZIP
phone_number (text) ---> Phone number
file (file) ---> File
foo_score (text) ---> Test score for Foo test
bar_score (text) ---> Test score for Bar test
image_path (text) ---> image_path

Every field has these keys: branching_logic, custom_alignment, field_label, field_name, field_note, field_type, form_name, identifier, matrix_group_name, question_number, required_field, section_header, select_choices_or_calculations, text_validation_max, text_validation_min, text_validation_type_or_show_slider_number

The returned json decodes to a list of dict objects (python’s name for hash tables). We see that there are 11 fields in this project, we print out a mapping of the field_name (the “machine” name for a field) along with it’s type and the field_label (the human-readable description). Finally, I just print out all of the keys from the first field so we can look at all of the data that comes with each field.

For all intents and purposes, this data structure is what we get when we manually download the Data Dictionary from our project, just in a slightly more machine-readable format.

Data Export

Here’s the fun part. Just tweak the request payload a little and we’ll download all of the data from our project:

payload['content'] = 'record'
payload['type'] = 'flat' # we want each row to contain the entire record
response = post(URL, data=payload)
data = response.json()

Voilà, we’ve just downloaded all of the data from our project. Let’s examine it.

print "This project has %d records" % len(data)

print "Each record has the following keys: %s." % ', '.join(data[0].keys())
print
print "But our metadata structure has the following fields: %s!" % ', '.join(f['field_name'] for f in metadata)
print

This project has 3 records
Each record has the following keys: phone_number, first_name, last_name, image_path, dob, demographics_complete, foo_score, sex, study_id, file, address, imaging_complete, testing_complete, bar_score.

But our metadata structure has the following fields: study_id, first_name, last_name, dob, sex, address, phone_number, file, foo_score, bar_score, image_path!

You’d be wrong to assume the fields we get from exporting the data matches the field_names from the metadata structure. This is because the REDCap API also returns the status of all of the forms for a particular record. These fields are always called [form name]_complete where [form name] is the lowercased & underscore-replaced version of the forms you see in the web-application. (You would be correct to assume the fields from an export are a superset of the fields from the metadata structure)

We can examine a particular record like so:

record = data[0]
for field_name, value in record.items():
    print "%s: %s" % (field_name, value)

phone_number: (615) 555-1234
first_name: Billy Bob
last_name: blah blah
image_path: /path/to/image
dob: 2000-01-01
demographics_complete: 2
foo_score: 100
sex: 1
study_id: 1
file: [document]
address: 123 Main Street, Anytown USA 23456
imaging_complete: 2
testing_complete: 2
bar_score: 2

Pretty neat. Within the payload that you send to the API, you can specify parameters that will limit the response to just include specific records, fields, forms, events (if your Project is longitudinal) and whether to get the raw or human-label in mutliple-choice fields. Experimenting with these calls is left to the reader.

Importing new data

Even fancier than exporting current data from the Project is updating records through the API. This payload looks a little different, though. We’ve got to encode the data that we want to import and attach it to the payload.

from json import dumps # the function we'll need to make a json-string of our new data

updated_record = data[0]
# Update a particular field
updated_record['foo_score'] = '100'

#we have to pass a list of records to the redcap API, so we're going to dump our new record within a list
# and we need to specify how to format the json string
to_import_json = dumps([updated_record], separators=(',',':'))
payload['data'] = to_import_json

response = post(URL, data=payload)
print response.json()['count']

Real quickly:

We updated a field from the first record.
We made a json-formatted string of this data structure (after packing it into a list because that’s what the API wants).
We attached this data to the data field of the payload and made the request to the API.
By default when importing data, the API will respond with a dict with the key count. This number is how many records you imported. You can see here that we import one record.

You might be wondering to yourself, how did the API know which record to update? That information is specified in the study_id field because study_id is the primary key of the Project, which is by definition the first field in the metadata (take this opportunity to look back and see that study_id was in fact the first field).

Note, we formatted the incoming data as json because that was the format we specified in the format parameter of the payload. You could just as easily import data formatted as CSV or XML if you change that parameter.

Exporting and Importing data are the two most important methods of the API. You can also download, upload and delete files stored in file fields per record but doing this is different for every HTTP library so I’ll let you figure it out for your programming language :)

That brings us to the end of how to use the REDCap API generally. I’ve implemented everything above in python, but you’re free to use whatever language you like as long as it has an HTTP library.

That being said, python is fantastic language with great libraries for high- level data manipulation like pandas, low-level data structures like NumPy, scientific libraries like SciPy. Python is also very popular in web development communities so there are web frameworks like Django and Flask in case you want to build websites or applications. If you need to do some advanced task, there probably exists a python package to help you on your way. It’s a great platform to build all sorts of tools.

Using the REDCap API in Python Applications

To make it easier to use the REDCap API from within python scripts and applications, I wrote PyCap. I’ll assume a Mac OS X or Linux environment, though all of this should work on Windows. It assumes working knowledge of the shell and the python language.

First, we must install the package. In a shell:

$ pip install PyCap

If you don’t have pip installed, this will work (you really should though, easy_install is considered deprecated by much of the python community):

$ easy_install PyCap

You may notice another package, requests, is installed as well.

With installation out of the way, let’s start writing python. We’ll begin with importing the package. The two main classes your scripts and applications should use are the Project class and the RedcapError exception.

from redcap import Project, RedcapError

(As long as this import doesn’t fail, you installed PyCap correctly).

Connecting to REDCap Projects

Just like above, you’ll need to know your API token and URL for your site.

project = Project(URL, TOKEN)

for field in project.metadata:
    print "%s (%s) ---> %s" % (field['field_name'], field['field_type'], field['field_label'])

study_id (text) ---> Study ID
first_name (text) ---> First Name
last_name (text) ---> Last Name
dob (text) ---> Date of Birth
sex (dropdown) ---> Gender
address (notes) ---> Street, City, State, ZIP
phone_number (text) ---> Phone number
file (file) ---> File
foo_score (text) ---> Test score for Foo test
bar_score (text) ---> Test score for Bar test
image_path (text) ---> image_path

When you create a Project, PyCap automatically exports the metadata from your project. First, it does so to setup a few nice attributes on the object but more importantly, if the metadata request works correctly, the URL and token are correct and can be trusted to work later on.

All of the methods the API provides are available. To demonstrate what we did above, consider the following:

metadata = project.export_metadata()
data = project.export_records()
data[0]['first_name'] = 'Billy Bob'
response = project.import_records(data)
print response['count']

In these 5 lines, we:

Made an export metadata request (by default in json format), then automatically decoded it.
Made a data export request (again, by default in JSON format) and returning the decoded data.
Tweaked a single field of the first record.
Imported the new data.
Printing how many records were imported.

All of the HTTP request machinery, making sure the payloads correct, encoding and decoding the JSON responses is handled for you. I wrote PyCap because I think most people just want their data and shouldn’t have to know HTTP to make it happen. Trust me, I made a lot of mistakes in building this library. You should use it so you don’t have to waste your time.

File downloads/uploads/deletions

I didn’t really go through file actions above because every HTTP library is going to deal with files differently. If you use PyCap, file operations are super simple:

record = '1'
field = 'file'
contents, headers = project.export_file(record, field)
print contents
print headers['name']

Just some data, you know.
data.txt

Obviously, most important returned data is the file contents. In the web- application, the filename you see for this particular record/field is what comes through in headers['name']. So if you want to save it to your local hard drive, it’s easy to keep the same name.

with open(headers['name'], 'w') as f:
    f.write(contents)

Just FYI, if you download a stored PDF, contents will be the binary data string and you’ll want to open the file in the wb mode.

Let’s say we want to upload a new file to that record. A little more complicated, but still pretty easy.

# First write a new file
with open(headers['name'], 'w') as f:
    f.write('Yeah, I decided to change the contents of the file')

new_fname = 'new_data.txt'
with open(headers['name'], 'r') as f:
    response = project.import_file(record, field, new_fname, f)

# just to check...
contents, headers = project.export_file(record, field)
print contents

Yeah, I decided to change the contents of the file

And if you really want to delete a file from REDCap, that too is possible. Warning there is no undo button for this :)

response = project.delete_file(record, field)

There is more documentation for PyCap here.

Feedback/Questions/Comments

Any feedback about this tutorial is greatly appreciated. There isn’t much on the internet about this so I hope you find it helpful in your work with REDCap. Feel free to open an issue on this post on GitHub

Legos & The Engineer's Brain

Wed, 17 Jul 2013 00:00:00 UTC

Recent research from Vanderbilt (also in the NY Times) finds that performance on early (age 13) spatial reasoning tests increases the power to predict the student’s scholarly publications and patents some 30 years later above and beyond SAT mathematical and verbal scores alone:

The study, conducted by David Lubinski and colleagues at Vanderbilt University Peabody College of education and human development, provides evidence that early spatial ability – the skill required to mentally manipulate 2D and 3D objects – predicts the development of new knowledge, and especially innovation in science, technology, engineering and mathematics (STEM) domains, above and beyond more traditional measures of mathematical and verbal ability.

Obviously, the SAT score can predict future performance, especially in academic areas such as research publications and patent applications. Being an engineer by training, I find it interesting that spatial reasoning can increase the accuracy of these predictions. Engineers reason about systems on a day-to-day basis. Mechanical, chemical and civil engineers get to touch their creations; software engineers like myself must reason about the slightly less physical applications and distributed systems they create.

I’ve theorized my more logical & analytical thought processes are directly (cor)related to my love for Legos as a child. I always built the designs described in the instructions first and only then would I get more creative with the pieces after I understood how they could fit together. The first page of the design only uses 3 or 4 pieces and it’s obvious how they join one another. Each successive step only differs by 3 or 4 pieces, but it’s not always obvious how (or why!) they assimilate into the previous state. Sometimes it was required to set down the current piece and start building a subsystem of the design. It was both magical and eye-opening when all of the pieces came together. Legos described a way for viewing and putting together the world different from any other toy in my childhood.

That process alone, of determining the current state of a system, what it should be and how to get there most efficiently, is a near perfect overview of the field of engineering. I cannot wait to introduce Legos to my daughter!

Without a doubt, working through this process over and over with Legos pushed my brain to hone these skills. Obviously, engineers face challenges that involve less plastic and more math, can be orders of magnitude more abstract and vary along more dimensions than just three; however, the foundation for thinking critically about systems both in their individual pieces and as a whole is critical to succeeding in the STEM fields.

Upcoming Kennedy Center Talk

Fri, 12 Jul 2013 00:00:00 UTC

I will be giving a talk for the Vanderbilt Kennedy Center’s Statistical and Methodology Core on Monday, July 22nd at 12:30p in One Magnolia Circle, Rm 241.

The official calendar event can be found here.

I’ve given a “beta” version of this talk to a smaller audience at Chris Fonnesbeck’s Statistical Computing Series. I will discuss using advanced REDCap interfaces and how they can be leveraged to solve problems commonly faced in research.

Because I’m expecting a very diverse audience, there are no code samples in the talk. I just finished a basic tutorial on using the REDCap API that can be found here. You can see a very-close-to-the-actual-version of my slides here.

There will be video at some point and I’ll post the link somewhere around here.

Freesurfer Stats in REDCap

Wed, 10 Jul 2013 00:00:00 UTC

Freesurfer is a fantastic software package for reconstructing the brain’s cortical surface from a high resolution structural MR image. Just give it a T1-weighted image (where gray matter is gray, white matter white) and a day of processing time (depending on image quality) and it produces 3D meshes of the white and gray matter surfaces. Given these models, it can then do all sorts of fancy calculations like area and average cortical thickness of particular regions. It’s a fantastic tool for researchers because manual delineation of the cortical surface requires expertise and is extremely time-intensive. It’s not an option when you’re collecting data from one or two subjects a day as we do in our lab.

The problem with Freesurfer is that it can be extremely verbose with the data it spits out. We want to put these statistics into our REDCap databases so we can better analyze them against the myriad of behavioral measures we collect. If you run Freesurfer and generate advanced labeling, it can produce upwards of 2700 measures per subject. It’s simply untenable to make some poor RA copy/paste all of these values into REDCap.

So, to make this sort of thing easier, I wrote a little tool that parses and flattens the stats files into a simple python dictionary. Usage goes a little something like this:

from recon_stats import Subject
s = Subject('SUBJECTID') # where SUBJECTID is an identifier for a subject living in SUBJECTS_DIR
s.get_measures()
data = s.upload_dict()

# Using my PyCap package, you can then import the data into a REDCap project

from redcap import Project
p = Project(URL, TOKEN)
data[p.def_field] = 'SUBJECTID'
response = p.import_records([data])

This isn’t going to work until you’ve added the ~2650 fields to your REDCap data dictionary which is no task for the faint of heart. To make this a little more easier for everyone, I created a REDCap Shared Library so anyone with a REDCap project can easily search for Freesurfer Reconstruction Stats and download it into their project’s data dictionary. Doing will will create a recon form. I don’t recommend viewing this form for any particular record in your project, REDCap takes quite a while to generate HTML for all 2655 fields :)

Check out the repo on github. It’s on my to-do list to put it up on PyPI. Until then you can git clone the repo to your local machine. Happy recon-all & REDCap’ing!

There aren't unit tests for opinions

Thu, 24 Jan 2013 00:00:00 UTC

I spend the majority of my day attempting to convince my computer that I’m not an idiot. Some days I’m more successful than others.

One way to help improve this process is to write software in such a way that it can be unit tested. Unit testing involves ensuring the building blocks of your code operate as you expect on predetermined inputs and outputs. If you test well enough, you can rest assured that your code is doing what you think it is doing and more importantly what you want it to do. You might say untested code is inherently wrong because you can’t trust it.

Unfortunately, there is no unit testing framework for opinions. There is no magic to automatically check the veracity and intent of a thought or statement. This is made worse by the expanse to which we can share our ideas these days.

Just because you can type or say it doesn’t make it correct. We take other people being wrong to imply naiveté (but too often we jump straight to stupidity) in our society. I wish it meant something more akin to having a unit test fail and that they should try again.

Winter Pesto

Tue, 22 Jan 2013 00:00:00 UTC

I made my own pesto the other night and am kicking myself for waiting this long to do so:

a large handful of spinach, chopped coarsely
2/3 cup walnuts, preferably roasted
2 cloves garlic
salt and pepper to taste
1/3 cup extra virgin olive oil

Add the garlic, nuts, salt, pepper, and about a third of the spinach to a food processor and pulse to a coarse mix. Add the rest of the spinach (or in batches as much as your processor can handle) and chop. Add the oil in parts and if you have some liquid available (from cooking pasta perhaps) add some now. Note, the liquid will assist in chopping so the more liquid, the finer the pesto. A quarter cup makes for a thin pesto.

Toss with pasta, top with good parmesan and enjoy.

Hell is Other People's Spreadsheets

Mon, 21 Jan 2013 00:00:00 UTC

When I started at EBRL, we had begun slowly moving to capturing and recording all subject data in various REDCap projects. REDCap is web-based data capture system created by the Vanderbilt Institute for Clinical and Translational Research. Briefly, REDCap allows clinical and scientific groups to build, operate and use databases all through a web application. Without REDCap, these types of groups have either captured data in difficult-to-maintain spreadsheets or hire/outsource the development and admnistration of a database system. Both alternatives are frought with danger (especially when capturing personal health information (PHI)) and specifically for our lab, moving to REDCap was advantageous for the following reasons:

No more emailing spreadsheets (especially those containing PHI).
One and only one place to monitor subject status.
One and only one place from which to start analyses.
No more emailing spreadsheets.

Slowly but surely, we’ve created enough databases such that no data captured in our lab is canonically stored in a spreadsheet. Our data is more safe, more accessible, and more pliable in that REDCap can export formats understood by many statistical packages.

These advantages only scratch the surface of what putting our data in REDCap meant. Because REDCap exposes an Application Programming Interface (API) for exporting and importing data, it doesn’t take much to realize that we could automate some of the more tedious and error-prone data workflows.

In future posts I will expound upon some of the technology and infrastructure I’ve developed for the lab, how it relates to REDCap, and ultimately how it has enabled us to produce more reliable and better science.

Take home message: From our perspective, if data is not in a globally and programmatically accessible place, it might as well not exist.