In this notebook, I will perform a dominance analysis on the two classes that require it, namely the bx93-associated class and the sy622-associated class. As usual, I first load all the libraries I need.
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as scipy
from matplotlib import rc
rc('text', usetex=True)
rc('text.latex', preamble=r'\usepackage{cmbright}')
rc('font', **{'family': 'sans-serif', 'sans-serif': ['Helvetica']})
%matplotlib inline
# This enables SVG graphics inline.
%config InlineBackend.figure_formats = {'png', 'retina'}
# JB's favorite Seaborn settings for notebooks
rc = {'lines.linewidth': 2,
'axes.labelsize': 18,
'axes.titlesize': 18,
'axes.facecolor': 'DFDFE5'}
sns.set_context('notebook', rc=rc)
sns.set_style("dark")
mpl.rcParams['xtick.labelsize'] = 16
mpl.rcParams['ytick.labelsize'] = 16
mpl.rcParams['legend.fontsize'] = 14
Master variables
q = 0.1
tidy = pd.read_csv('../output/SI1_dpy_22_results.csv')
I will measure dominance class-wide by using a fairly standard trick. I will assume that all of the transcripts in the class are controlled via exactly the same mechanism. Then, it stands to reason that the dominance relationship between the two alleles should always be the same. Dominance will be defined in terms of expression level, and dominance is a continuous variable between 0 and 1. I will consider the expression of the heterozygote to be a linear combination of the expression levels of the two homozygotes, subject to the constraint that the mixing factors in the linear combination add up to unity. Mathematically, this can be expressed as: $$ \beta_{sy622/bx93, i} = \hat{D} \cdot \begin{bmatrix} \beta_{sy622/sy622, i} \\ \beta_{bx93/bx93, i} \\ \end{bmatrix}, $$ where $\hat{D}$ contains the mixing factors, $d$ and $1-d$. For our system, I will ask that the same matrix $\hat{D}$ hold for all the transcripts $i$, since we want to believe that this class is a single phenotype, not a combination of them, and that phenotypes arise by a single mechanism. In this case, the problem can be expressed as a linear regression, where a single parameter, $d$ must be found. I will solve this linear regression using Bayes theorem. We will assume that errors are normally distributed in $\beta$ space, and that $d$ has a uniform prior. Then, I will numerically solve for the $d$ value that minimizes the negative logarithm of the likelihood. This is the Maximume Likelihood Estimate of the dominance coefficient.
def log_likelihood(d, x, y, xy, xy_se):
"""
Assumes a Normal Distribution and returns the log-likelihood of the
events given a parameter X
Params:
d - a number between 0 and 1
x, y, xy - measurements
xy_se - std of each point xy
Output:
log-likelihood
"""
# calculate weighted average
model = d*x + (1-d)*y
chi2 = (xy - model)**2
sigma = xy_se
#return log model
return -(chi2/(2*sigma**2)).sum()
def neg_log_likelihood(d, x, y, xy, xy_se):
return -log_likelihood(d, x, y, xy, xy_se)
I will begin by finding the dominance coefficient of the bx93-associated class. First, I will extract all of the transcripts that are in this class, regardless of their q-values.
def pheno_class(phenotype):
"""Wrapper for column selection."""
return (tidy['phenotypic class'] == phenotype)
def fetch_vars(phenotypic_class):
# extract b values for each genotype:
x = tidy[pheno_class(phenotypic_class) &
(tidy.strain == 'PS4087')].b.values
y = tidy[pheno_class(phenotypic_class) &
(tidy.strain == 'PS4187')].b.values
xy = tidy[pheno_class(phenotypic_class) &
(tidy.strain == 'PS4176')].b.values
# extract standard errors as well:
se_xy = tidy[pheno_class(phenotypic_class) &
(tidy.strain == 'PS4176')].se_b.values
# return
return x, y, xy, se_xy
params = fetch_vars('bx93 associated')
x, y, xy, se_xy = params
For simplicity, I will also define a plotting function called dominance_distribution
. This distribution will plot the log-likelihood and show the $d$ value that maximizes it. It will also return the MLE estimate and the associated error of the MLE.
def dominance_distribution(x, y, xy, se_xy):
"""
Returns the dominance distribution for a set of measurements.
Params:
-------
x, y, xy -- numpy ndarrays of the same size
se_xy -- standard error for the entries in xy
Output:
-------
mean and stdev of the dominance distribution given the data
"""
# generate dominance coefficients from 0 to 1
D = np.linspace(0, 1, 1000)
# get the negative log likelihood at each dominance level
Y = np.array([neg_log_likelihood(di, x, y, xy, se_xy) for di in D])
# make sure the neg log likelihood isn't very large
if (np.abs(Y) > 2.5*10**3).any():
print('Warning, log-likelihood has very large absolute values.' + \
' Arbitrarily subtracting Y.mean() from Y')
Y = Y - Y.mean()
# prepare a plot
fig, ax = plt.subplots()
# decide whether to plot the log likelihood or the likelihood
if (np.abs(Y) > 10**3).any():
print('Warning, after renormalization, the likelihood ' +\
'magnitudes remain large. Plotting log-likelihood instead.')
# plot the log likelihood
plt.plot(D, -Y)
# label:
plt.xlabel(r'Fractional Dominance, $d$')
plt.ylabel(r'$\log{P(d | Data)}$')
# find the max log likelihood
mean = D[np.where(Y == Y.min())][0]
# estimate the standard deviation:
print('Estimating standard deviation. Estimation is not exact.')
# truncate Y to only the points near the peak:
Yprime = Y[np.where(Y < 0.8*Y.min())] - Y.min()
# estimate the standard deviation using the truncated Y
# first calculate P(y) = e^y/sum_y e^y
expY = np.exp(-Yprime)/np.exp(-Yprime).sum()
# find the standard deviation by approximating
# <sigma^2> = <(x - \bar{x})^2>
std = np.sqrt(((D[np.where(Y < 0.8*Y.min())] - mean)**2*expY).sum())
else:
# probability of Y = e^y/sum_y e^y
expY = np.exp(-Y)/np.exp(-Y).sum()
# get the mean D:
mean = (expY*D).sum()
# std of D:
std = np.sqrt(((D - mean)**2*expY).sum())
# plot
plt.plot(D, expY)
plt.xlabel(r'Fractional Dominance, $d$')
plt.ylabel(r'$P(d | Data)$')
plt.axvline(mean, color='red', ls='--')
return mean, std
Let's go ahead and calculate $d$ for the sy622 allele in the bx93-associated class:
mean, std = dominance_distribution(x, y, xy, se_xy)
print('dominance: {0:.2f} +/- {1:.2f}'.format(mean, std))
Next, I would also like to make a plot of the predicted expression level, $\hat{D} \cdot \hat{\beta}_{i/i}$, vs the observed expression level of the heterozygote, $\beta_{i/j}$. This plot is diagnostic, in the sense that the points should show linear behavior with no systematic residuals. If this is the case, we may be more inclined to believe there is a single dominance coefficient. For this plot, I will color the points according to the local density of the points in the graph (yellow for high density/ purple for low) and I will make the size of the point inversely proportional to some measurement of the uncertainty in the observed coefficients. The more certainty we have in a given point, the larger it will be.
def density_color(x, y):
"""A wrapper function to color points according to local density"""
points = np.vstack([x, y])
z = scipy.stats.gaussian_kde(points)(points)
return z
def dominance_plot(x, y, xy, se_xy, mixing_factor, s0=3, **kwargs):
"""
Given a set of measurements and a mixing factor,
make a dominance plot.
Params:
-------
x, y, xy -- numpy ndarrays of the same size
se_xy -- standard errors for the xy vector
mixing_factor -- the mixing factor for the linear combination of x and y
s0 -- size of the points when se_xy is 1
**kwargs -- matplotlib scatter kwargs
Output:
-------
ax - A matplotlib axis
"""
size = np.sqrt(se_xy)
fig, ax = plt.subplots()
# plot of heterozygote values vs model
ax.scatter(mixing_factor*x + (1-mixing_factor)*y, xy, s=s0/size, **kwargs)
# get limits
lims = [
np.min([ax.get_xlim(), ax.get_ylim()]), # min of both axes
np.max([ax.get_xlim(), ax.get_ylim()]), # max of both axes
]
# axis limits
ax.plot(lims, lims, 'k-', alpha=1, zorder=0)
# labels
sy_dom = '{0:.2g}'.format(mean)
bx_dom = '{0:.2g}'.format(1 - mean)
xlabel = sy_dom + r'$\cdot\beta_{sy622} + $' + bx_dom
xlabel += r'$\cdot\beta_{bx93}$'
plt.xlabel(xlabel)
plt.ylabel(r'$\beta_{{sy622/bx93}}$')
return ax
density = density_color(x, y)
ax = dominance_plot(x, y, xy, se_xy, 0.2, alpha=1,
cmap='viridis', c=density, s0=5)
plt.xticks([-6, -3, 0, 3, 6])
plt.yticks([-6, -3, 0, 3, 6])
plt.savefig('../output/bx93_associated_dominance.svg', bbox_inches='tight')
For the most part, it looks good! Couple of far off points, but those are quite small, indicating high variance and therefore high uncertainty. It seems there really might only be a single dominance behavior for this class.
params = fetch_vars('sy622 associated')
x, y, xy, se_xy = params
mean, std = dominance_distribution(x, y, xy, se_xy)
print('dominance: {0:.2f} +/- {1:.2f}'.format(mean, std))
density = density_color(x, y)
ax = dominance_plot(x, y, xy, se_xy, mean, alpha=1,
c=density, cmap='viridis', s0=5)
plt.xlim(-4, 4)
plt.ylim(-4, 4)
plt.xticks([-4, 0, 4])
plt.yticks([-4, 0, 4])
plt.savefig('../output/sy622_associated_dominance.svg', bbox_inches='tight')