Identifying phenotypic classes in transcriptomic data

In this notebook, I will put forward the concept of a phenotypic class, and I will show how to find these classes in this data. As always, the first step is to load in the libraries

In [1]:
import pandas as pd
import numpy as np

# import own libraries
import epistasis as epi

tidy  = pd.read_csv('../input/quantifications.csv')

# set statistical significance
q = 0.1

Defining the phenotypic classes

The method by which I defined the phenotypic classes was outlined in the main manuscript, so I will not go into depth on how to do it here. Suffice to say that the molecular lesions strongly suggest a null hypothesis for how the sets should look, and false-positive and false-negative analyses suggest that the Venn diagram, and therefore the phenotypic classes, follow the expected null hypothesis. Here, I assign the transcripts to those classes suggested by the null hypothesis. As a first step, I find those genes that are differentially expressed in any combination of two genotypes.

In [2]:
# find all pairwise overlaps
overlap_sy622_het = epi.find_overlap(tidy, ['PS4087', 'PS4176'], q=q)
overlap_bx93_het = epi.find_overlap(tidy, ['PS4187', 'PS4176'], q=q)
overlap_sy622_bx93 = epi.find_overlap(tidy, ['PS4187', 'PS4087'], q=q)

Next, I use the same definitions as in the main text to combine (or exclude) genes from each pairwise overlap into phenotypic classes.

In [3]:
# The following code finds the desired transcripts (isoforms) in each class and puts them into numpy arrays:

def target(selection):
    """Wrapper function for boolean logic."""
    return tidy.target_id.isin(selection)


sig = (tidy.qval < q)

sy622_associated = tidy[target(overlap_sy622_het) &
                        (~target(overlap_sy622_bx93))].target_id.unique()

bx93_associated = tidy[target(overlap_bx93_het) |
                       (target(overlap_sy622_bx93))].target_id.unique()

bx93_specific = tidy[(~target(overlap_sy622_bx93)) & 
                     (~target(overlap_bx93_het)) &
                      sig &
                     (tidy.strain == 'PS4187')].target_id.unique()

sy622_specific = tidy[(~target(overlap_sy622_bx93)) &
                      (~target(overlap_sy622_het)) &
                      sig &
                      (tidy.strain == 'PS4087')].target_id.unique()

transhet_specific = tidy[(~target(overlap_bx93_het)) &
                         (~target(overlap_sy622_het)) & 
                         sig &
                         (tidy.strain == 'PS4176')].target_id.unique()

Make a dictionary where each class is a key for its numpy array.

In [4]:
# Make a dictionary
classes = {'sy622 associated': sy622_associated,
           'bx93 associated': bx93_associated,
           'sy622 specific': sy622_specific,
           'bx93 specific': bx93_specific,
           'transhet specific': transhet_specific}

We also need the inverse dictionary, where each transcript is the key for its class. This will allow us to annotate the dataframe with a new column, phenotypic class. While we're at it, let's also print out how many transcripts exist in each phenotypic class.

In [5]:
inv_classes = {}

for key, value in classes.items():
    n = len(tidy[tidy.target_id.isin(value)].ens_gene.unique())
    print(key, n)
    
    for v in value:
        inv_classes[v] = key
sy622 associated 665
bx93 associated 229
sy622 specific 1213
bx93 specific 37
transhet specific 1302

Annotate the dataframe with the information on phenotypic classes, and save it to a file.

In [19]:
# label transcripts with the phenotypic class they belong to
tidy['phenotypic class'] = tidy.target_id.map(inv_classes)
# save to a dataframe
tidy.to_csv('../output/SI1_dpy_22_results.csv')