Quantifying dominance behaviors

In this notebook, I will perform a dominance analysis on the two classes that require it, namely the bx93-associated class and the sy622-associated class. As usual, I first load all the libraries I need.

In [2]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as scipy

from matplotlib import rc

rc('text', usetex=True)
rc('text.latex', preamble=r'\usepackage{cmbright}')
rc('font', **{'family': 'sans-serif', 'sans-serif': ['Helvetica']})

%matplotlib inline

# This enables SVG graphics inline. 
%config InlineBackend.figure_formats = {'png', 'retina'}

# JB's favorite Seaborn settings for notebooks
rc = {'lines.linewidth': 2, 
      'axes.labelsize': 18, 
      'axes.titlesize': 18, 
      'axes.facecolor': 'DFDFE5'}
sns.set_context('notebook', rc=rc)
sns.set_style("dark")

mpl.rcParams['xtick.labelsize'] = 16 
mpl.rcParams['ytick.labelsize'] = 16 
mpl.rcParams['legend.fontsize'] = 14

Master variables

In [3]:
q = 0.1
tidy  = pd.read_csv('../output/SI1_dpy_22_results.csv')

I will measure dominance class-wide by using a fairly standard trick. I will assume that all of the transcripts in the class are controlled via exactly the same mechanism. Then, it stands to reason that the dominance relationship between the two alleles should always be the same. Dominance will be defined in terms of expression level, and dominance is a continuous variable between 0 and 1. I will consider the expression of the heterozygote to be a linear combination of the expression levels of the two homozygotes, subject to the constraint that the mixing factors in the linear combination add up to unity. Mathematically, this can be expressed as: $$ \beta_{sy622/bx93, i} = \hat{D} \cdot \begin{bmatrix} \beta_{sy622/sy622, i} \\ \beta_{bx93/bx93, i} \\ \end{bmatrix}, $$ where $\hat{D}$ contains the mixing factors, $d$ and $1-d$. For our system, I will ask that the same matrix $\hat{D}$ hold for all the transcripts $i$, since we want to believe that this class is a single phenotype, not a combination of them, and that phenotypes arise by a single mechanism. In this case, the problem can be expressed as a linear regression, where a single parameter, $d$ must be found. I will solve this linear regression using Bayes theorem. We will assume that errors are normally distributed in $\beta$ space, and that $d$ has a uniform prior. Then, I will numerically solve for the $d$ value that minimizes the negative logarithm of the likelihood. This is the Maximume Likelihood Estimate of the dominance coefficient.

In [4]:
def log_likelihood(d, x, y, xy, xy_se):
    """
    Assumes a Normal Distribution and returns the log-likelihood of the
    events given a parameter X
    
    Params:
    d - a number between 0 and 1
    x, y, xy - measurements
    xy_se - std of each point xy
    
    Output:
    log-likelihood
    """
    # calculate weighted average
    model = d*x + (1-d)*y
    
    chi2 = (xy - model)**2
    sigma = xy_se
    
    #return log model
    return  -(chi2/(2*sigma**2)).sum()

def neg_log_likelihood(d, x, y, xy, xy_se):
    return -log_likelihood(d, x, y, xy, xy_se)

Dominance behaviors of the bx93-associated class

I will begin by finding the dominance coefficient of the bx93-associated class. First, I will extract all of the transcripts that are in this class, regardless of their q-values.

In [5]:
def pheno_class(phenotype):
    """Wrapper for column selection."""
    return (tidy['phenotypic class'] == phenotype)

def fetch_vars(phenotypic_class):
    # extract b values for each genotype:
    x = tidy[pheno_class(phenotypic_class) &
             (tidy.strain == 'PS4087')].b.values
    y = tidy[pheno_class(phenotypic_class) &
             (tidy.strain == 'PS4187')].b.values
    xy = tidy[pheno_class(phenotypic_class) &
              (tidy.strain == 'PS4176')].b.values
    
    # extract standard errors as well:
    se_xy = tidy[pheno_class(phenotypic_class) &
             (tidy.strain == 'PS4176')].se_b.values
    
    # return
    return x, y, xy, se_xy

params = fetch_vars('bx93 associated')
x, y, xy, se_xy = params

For simplicity, I will also define a plotting function called dominance_distribution. This distribution will plot the log-likelihood and show the $d$ value that maximizes it. It will also return the MLE estimate and the associated error of the MLE.

In [6]:
def dominance_distribution(x, y, xy, se_xy):
    """
    Returns the dominance distribution for a set of measurements.
    
    Params:
    -------
    x, y, xy -- numpy ndarrays of the same size
    se_xy -- standard error for the entries in xy
    
    Output:
    -------
    mean and stdev of the dominance distribution given the data
    """
    # generate dominance coefficients from 0 to 1
    D = np.linspace(0, 1, 1000)
    # get the negative log likelihood at each dominance level 
    Y = np.array([neg_log_likelihood(di, x, y, xy, se_xy) for di in D])

    # make sure the neg log likelihood isn't very large
    if (np.abs(Y) > 2.5*10**3).any():
        print('Warning, log-likelihood has very large absolute values.' + \
              ' Arbitrarily subtracting Y.mean() from Y')
        Y = Y - Y.mean()

    # prepare a plot
    fig, ax = plt.subplots()

    # decide whether to plot the log likelihood or the likelihood
    if (np.abs(Y) > 10**3).any():
        print('Warning, after renormalization, the likelihood ' +\
              'magnitudes remain large. Plotting log-likelihood instead.')

        # plot the log likelihood
        plt.plot(D, -Y)

        # label:
        plt.xlabel(r'Fractional Dominance, $d$')
        plt.ylabel(r'$\log{P(d | Data)}$')

        # find the max log likelihood
        mean = D[np.where(Y == Y.min())][0]

        # estimate the standard deviation:
        print('Estimating standard deviation. Estimation is not exact.')
        # truncate Y to only the points near the peak:
        Yprime = Y[np.where(Y < 0.8*Y.min())] - Y.min()

        # estimate the standard deviation using the truncated Y
        # first calculate P(y) = e^y/sum_y e^y
        expY = np.exp(-Yprime)/np.exp(-Yprime).sum()

        # find the standard deviation by approximating
        # <sigma^2> = <(x - \bar{x})^2>
        std = np.sqrt(((D[np.where(Y < 0.8*Y.min())] - mean)**2*expY).sum())
    else:
        # probability of Y = e^y/sum_y e^y
        expY = np.exp(-Y)/np.exp(-Y).sum()
        # get the mean D:
        mean = (expY*D).sum()
        # std of D:
        std = np.sqrt(((D - mean)**2*expY).sum())
        
        # plot
        plt.plot(D, expY)
        plt.xlabel(r'Fractional Dominance, $d$')
        plt.ylabel(r'$P(d | Data)$')

    plt.axvline(mean, color='red', ls='--')
    
    return mean, std

Let's go ahead and calculate $d$ for the sy622 allele in the bx93-associated class:

In [7]:
mean, std = dominance_distribution(x, y, xy, se_xy)
print('dominance: {0:.2f} +/- {1:.2f}'.format(mean, std))
Warning, after renormalization, the likelihood magnitudes remain large. Plotting log-likelihood instead.
Estimating standard deviation. Estimation is not exact.
dominance: 0.18 +/- 0.00

Next, I would also like to make a plot of the predicted expression level, $\hat{D} \cdot \hat{\beta}_{i/i}$, vs the observed expression level of the heterozygote, $\beta_{i/j}$. This plot is diagnostic, in the sense that the points should show linear behavior with no systematic residuals. If this is the case, we may be more inclined to believe there is a single dominance coefficient. For this plot, I will color the points according to the local density of the points in the graph (yellow for high density/ purple for low) and I will make the size of the point inversely proportional to some measurement of the uncertainty in the observed coefficients. The more certainty we have in a given point, the larger it will be.

In [12]:
def density_color(x, y):
    """A wrapper function to color points according to local density"""
    points = np.vstack([x, y])
    z = scipy.stats.gaussian_kde(points)(points)
    return z

def dominance_plot(x, y, xy, se_xy, mixing_factor, s0=3, **kwargs):
    """
    Given a set of measurements and a mixing factor,
    make a dominance plot.
    
    Params:
    -------
    x, y, xy -- numpy ndarrays of the same size
    se_xy -- standard errors for the xy vector
    mixing_factor -- the mixing factor for the linear combination of x and y
    s0 -- size of the points when se_xy is 1
    **kwargs -- matplotlib scatter kwargs
    
    Output:
    -------
    ax - A matplotlib axis
    
    """
    size = np.sqrt(se_xy)
    
    fig, ax = plt.subplots()
    
    # plot of heterozygote values vs model
    ax.scatter(mixing_factor*x + (1-mixing_factor)*y, xy, s=s0/size, **kwargs)

    # get limits
    lims = [
        np.min([ax.get_xlim(), ax.get_ylim()]),  # min of both axes
        np.max([ax.get_xlim(), ax.get_ylim()]),  # max of both axes
    ]
    
    # axis limits
    ax.plot(lims, lims, 'k-', alpha=1, zorder=0)
    
    # labels
    sy_dom = '{0:.2g}'.format(mean)
    bx_dom = '{0:.2g}'.format(1 - mean)
    xlabel = sy_dom + r'$\cdot\beta_{sy622} + $' + bx_dom 
    xlabel += r'$\cdot\beta_{bx93}$'
    plt.xlabel(xlabel)
    plt.ylabel(r'$\beta_{{sy622/bx93}}$')
    
    return ax

density = density_color(x, y)
ax = dominance_plot(x, y, xy, se_xy, 0.2, alpha=1,
                    cmap='viridis', c=density, s0=5)

plt.xticks([-6, -3, 0, 3, 6])
plt.yticks([-6, -3, 0, 3, 6])

plt.savefig('../output/bx93_associated_dominance.svg', bbox_inches='tight')

For the most part, it looks good! Couple of far off points, but those are quite small, indicating high variance and therefore high uncertainty. It seems there really might only be a single dominance behavior for this class.

Dominance behaviors of the sy622-associated class

In [13]:
params = fetch_vars('sy622 associated')
x, y, xy, se_xy = params
In [14]:
mean, std = dominance_distribution(x, y, xy, se_xy)
print('dominance: {0:.2f} +/- {1:.2f}'.format(mean, std))
Warning, after renormalization, the likelihood magnitudes remain large. Plotting log-likelihood instead.
Estimating standard deviation. Estimation is not exact.
dominance: 0.52 +/- 0.00
In [15]:
density = density_color(x, y)
ax = dominance_plot(x, y, xy, se_xy, mean, alpha=1,
                    c=density, cmap='viridis', s0=5)

plt.xlim(-4, 4)
plt.ylim(-4, 4)
plt.xticks([-4, 0, 4])
plt.yticks([-4, 0, 4])

plt.savefig('../output/sy622_associated_dominance.svg', bbox_inches='tight')