Statistical Methods Primer

Welcome to the statistical methods section of this website. We have generated a very brief set of tutorials for the statistical methods that we used for our paper. Hopefully this will make the article more accessible and be helpful to anyone who needs to replicate partially or completely some of the code we used.

Kallisto: read pseudo-alignment

We used Kallisto, an excellent of software from Lior Pachter’s group, to perform read pseudo-alignment for each mutant we analyzed. Although pseudo-alignment is initially not as accurate as complete alignment, the speed at which the algorithm completes means that it can be bootstrapped. The bootstrapping is what really makes it really great (well, that and the way in which they compute k-classes). We really love this software.

See the paper at Nature Biotechnology: Near-optimal probabilistic RNA-seq quantification

Sleuth: differential expression analysis

Sleuth is another great software from Lior Pachter’s group. This beautiful library was developed to optimally accept Kallisto processed reads, although it can work with other alignment tools. Sleuth performs the differential expression analysis by fitting a log-linear model to explain changes in expression between the different samples. Together, the combination of Sleuth and Kallisto are fantastic tools for processing RNA-seq data. Equally important, they are succinct. Kallisto is run via a single command using Terminal, and Sleuth requires about 20 lines of fairly standardized code.

See their paper at BioRxiv: Differential Analysis of RNA-seq incorporating quantification uncertainty

Bayesian robust regressions for interaction prediction

Jupyter Notebook on Bayesian Robust Regressions

Orthogonal distance regression

Orthogonal distance regression (ODR) is a method to fit the line of best fit when you have measurements that have errors in both the x- and the y-coordinates. Usually, when we have measurements with no error bars, we use a method called Least Squares to find the line of best fit. When the measurements have errors along the y-axis, we can modify this method and use Weighted Least Squares. Weighted Least Squares takes points with large errors and makes them less important, whereas points with smaller errors are more important. ODR is similar in the sense that it is takes into account errors in both x and y and weights points accordingly.

A major difference between usual Least Squares and ODR is the minimization we are performing. For Least Squares, we are usually minimizing the vertical distance of the points to the line. In other words, when we have minimized the sum of the squares of the residuals we have found the line of best fit. However, if there are errors on both X and Y, we need to minimize something else. In this case, it again makes sense to find the line that also minimizes the distance between the points and the lines. However, in this case, we will add the constraint that the ruler we use to measure this distance must always be at right angles to the line itself (hence the orthogonal).

The reasoning behind this has been explained previously. For further information on the mathematics behind regressions, we refer you to David Hogg’s excellent article: Data Analysis recipes: Fitting a model to data. In the next Jupyter notebook, we show quickly how ODR is considerably better than least-squares when errors in both coordinates are known.

Jupyter Notebook on Orthogonal Regression

Bootstrapping to test a hypothesis

Parametric Bootstrap

Jupyter Notebook on Parametric Bootstrapping

Non-parametric Bootstrap

Jupyter Notebook on Non-parametric Bootstrapping

Model Selection

Jupyter Notebook on Model Selection

Contact us

If there are broken links, or you have questions, you can contact us at: dangeles@caltech.edu

or at

pws@caltech.edu