In this notebook, I will put forward the concept of a phenotypic class, and I will show how to find these classes in this data. As always, the first step is to load in the libraries
import pandas as pd
import numpy as np
# import own libraries
import epistasis as epi
tidy = pd.read_csv('../input/quantifications.csv')
# set statistical significance
q = 0.1
The method by which I defined the phenotypic classes was outlined in the main manuscript, so I will not go into depth on how to do it here. Suffice to say that the molecular lesions strongly suggest a null hypothesis for how the sets should look, and false-positive and false-negative analyses suggest that the Venn diagram, and therefore the phenotypic classes, follow the expected null hypothesis. Here, I assign the transcripts to those classes suggested by the null hypothesis. As a first step, I find those genes that are differentially expressed in any combination of two genotypes.
# find all pairwise overlaps
overlap_sy622_het = epi.find_overlap(tidy, ['PS4087', 'PS4176'], q=q)
overlap_bx93_het = epi.find_overlap(tidy, ['PS4187', 'PS4176'], q=q)
overlap_sy622_bx93 = epi.find_overlap(tidy, ['PS4187', 'PS4087'], q=q)
Next, I use the same definitions as in the main text to combine (or exclude) genes from each pairwise overlap into phenotypic classes.
# The following code finds the desired transcripts (isoforms) in each class and puts them into numpy arrays:
def target(selection):
"""Wrapper function for boolean logic."""
return tidy.target_id.isin(selection)
sig = (tidy.qval < q)
sy622_associated = tidy[target(overlap_sy622_het) &
(~target(overlap_sy622_bx93))].target_id.unique()
bx93_associated = tidy[target(overlap_bx93_het) |
(target(overlap_sy622_bx93))].target_id.unique()
bx93_specific = tidy[(~target(overlap_sy622_bx93)) &
(~target(overlap_bx93_het)) &
sig &
(tidy.strain == 'PS4187')].target_id.unique()
sy622_specific = tidy[(~target(overlap_sy622_bx93)) &
(~target(overlap_sy622_het)) &
sig &
(tidy.strain == 'PS4087')].target_id.unique()
transhet_specific = tidy[(~target(overlap_bx93_het)) &
(~target(overlap_sy622_het)) &
sig &
(tidy.strain == 'PS4176')].target_id.unique()
Make a dictionary where each class is a key for its numpy array.
# Make a dictionary
classes = {'sy622 associated': sy622_associated,
'bx93 associated': bx93_associated,
'sy622 specific': sy622_specific,
'bx93 specific': bx93_specific,
'transhet specific': transhet_specific}
We also need the inverse dictionary, where each transcript is the key for its class. This will allow us to annotate the dataframe with a new column, phenotypic class. While we're at it, let's also print out how many transcripts exist in each phenotypic class.
inv_classes = {}
for key, value in classes.items():
n = len(tidy[tidy.target_id.isin(value)].ens_gene.unique())
print(key, n)
for v in value:
inv_classes[v] = key
Annotate the dataframe with the information on phenotypic classes, and save it to a file.
# label transcripts with the phenotypic class they belong to
tidy['phenotypic class'] = tidy.target_id.map(inv_classes)
# save to a dataframe
tidy.to_csv('../output/SI1_dpy_22_results.csv')