Table of Contents

Basic Statistics

In this Jupyter notebook, I will assemble the dataframe that I will use throughout the rest of the project. Then, I will calculate some relevant statistics (number of genes and isoforms identified in total, number of differentially expressed genes per genotype, average effect sizes and the percent internalization of each genotype. Finally, I will briefly show the benefits of pooling samples instead of using a general linear model (GLM) with interactions to analyze this dataset. To begin, I will import all the relevant libraries.

In [20]:
import pandas as pd
import numpy as np
import os

# import own libraries
import pretty_table as pretty

Having imported all the libraries I need, I will define the parameters that I will use for the rest of this Notebook.

In [21]:
# place the strains we will use in a list:
strains =['PS4187', 'PS4087', 'PS4176', 'EW15', 'MT4866', 'MT2124'] 
q = 0.1  # q-value cutoff

# place the strain list into a dictionary
strain_dict = {strains[i]: 1 for i in range(len(strains))}

# experimental design matrix fed into sleuth
genmap = pd.read_csv('../sleuth/rna_seq_info.txt', sep='\t', comment='#')

# strain to genotype information:
names = pd.read_csv('../input/strain_to_genotype.csv', comment='#')

Next, I will open the files containing the beta values output by sleuth for each genotype. If the genotype is not included in the list of genotypes to be analyzed, I will not open the file. Finally, I will make a single dataframe containing all of the data, and work with this unique dataframe.

In [23]:
# open all beta.csv files, and concatenate them into a single dataframe
frames = []
for root, dirs, files in os.walk("../sleuth/sleuth_strains"):
    for file in files:
        # don't open likelihood ratio test if there is one
        if file == 'lrt.csv':
            continue

        # extract the strain identifier from the filename
        strain =  file[:-4].replace('_', '-')

        # if the strain isn't in our strainlist, drop it
        if strain[2:] not in strains:
            continue

        # open the dataframe, and add strain, genotype and order columns
        df = pd.read_csv(root + '/' + file, sep=',')
        df.sort_values('target_id', inplace=True)
        df['strain'] = strain.replace('b-', '')
        df['order'] = strain_dict[df.strain.unique()[0]]

        # add the dataframe to the list
        frames += [df]

# concatenate, dropNAs
tidy = pd.concat(frames)
tidy.dropna(subset=['ens_gene', 'b', 'qval'], inplace=True)

# sort by a pre-determined value, and then by target_id. Always sort like this
tidy.sort_values(['order', 'target_id'], ascending=True, inplace=True)

# drop the first column because it doesn't contain anything.
tidy.drop([tidy.columns[0]], axis=1, inplace=True)

# final dataframe. I will use this dataframe throughout
tidy.head()
Out[23]:
target_id pval qval b se_b mean_obs var_obs tech_var sigma_sq smooth_sigma_sq final_sigma_sq ens_gene ext_gene description transcript_biotype strain order
19635 2L52.1a 0.892540 0.984461 -0.047659 0.352794 3.894122 1.016851 0.138264 0.065421 0.075103 0.075103 WBGene00007063 2L52.1 NaN protein_coding PS4176 1
17497 2L52.1a 0.755861 0.936079 0.117265 0.377153 3.894122 1.016851 0.138264 0.065421 0.075103 0.075103 WBGene00007063 2L52.1 NaN protein_coding EW15 1
13347 2L52.1a 0.609008 0.987898 0.192910 0.377153 3.894122 1.016851 0.138264 0.065421 0.075103 0.075103 WBGene00007063 2L52.1 NaN protein_coding MT2124 1
10030 2L52.1a 0.631823 1.000000 0.180717 0.377153 3.894122 1.016851 0.138264 0.065421 0.075103 0.075103 WBGene00007063 2L52.1 NaN protein_coding PS4187 1
20447 2L52.1a 0.944533 0.999687 -0.026240 0.377153 3.894122 1.016851 0.138264 0.065421 0.075103 0.075103 WBGene00007063 2L52.1 NaN protein_coding PS4087 1

Next, I need to annotate the dataframe. I will do that by using another dataframe called names that associates genotypes with strain IDs in various formats.

In [24]:
def strain_to_map(names, col='Alleles'):
    """
    A function to generate a dictionary of strain values to another column
    in the `names` dataframe.
    
    Params:
    -------
    names: pandas DataFrame. Must contain a `Strain` column
    col: string. Column name to associate values with.
    
    Output:
    strain_to: dictionary of Strain values to `col` values.
    """
    strain_to  = {names.Strain.values[i]: names[col].values[i]
                  for i in np.arange(len(names))}
    return strain_to

# generate a dictionary relating strain to the alleles in that strain:
strain_to_allele = strain_to_map(names)
# generate a dictionary relating strain to the genotype in that strain in
# code-friendly format
strain_to_genotype = strain_to_map(names, 'Genotype')

# generate a dictionary relating strain to the genotype in that strain in a
# fancy format appropriate for publication
strain_to_fancy = strain_to_map(names, 'FancyName')
In [25]:
# annotate the dataframe using the dictionaries above
tidy['fancy'] = tidy.strain.map(strain_to_fancy)
tidy['allele'] = tidy.strain.map(strain_to_allele)
tidy['genotype'] = tidy.strain.map(strain_to_allele)
In [26]:
# save this new dataframe
tidy.to_csv('../input/quantifications.csv', index=False)

Now that we have the dataframe, we can ask how many transcripts we identified in all genotypes, and how many genes these transcripts correspond to.

In [27]:
total_genes_id = tidy.target_id.unique().shape[0]

m = "Total isoforms identified in all genotypes: {0}"
print(m.format(total_genes_id))

m = "Total genes identified in all genotypes: {0}"
print(m.format(len(tidy.ens_gene.unique())))
Total isoforms identified in all genotypes: 19416
Total genes identified in all genotypes: 13704

I can also identify the number of differentially expressed genes for each genotype.

In [28]:
# subset to keep only DE genes:
sig = (tidy.qval < q)

# print table title
pretty.table_print(['Genotype', 'Strain', 'DEG'])

# group dataframe by genotypes, and count DE genes
for genotype, group in tidy[sig].groupby('genotype'):
    # extract strain name of the current group
    strain = group.strain.unique()[0]
    # find number of DE genes in group:
    deg = len(group.ens_gene.unique())
    # pretty print
    pretty.table_print([genotype, strain, deg])
Genotype            Strain              DEG                 
bx93                PS4187              266                 
bx93/sy622          PS4176              2128                
ga80                EW15                4613                
n1046               MT2124              2526                
n2021               MT4866              509                 
sy622               PS4087              2036                

Finally, I will calculate a statistic I refer to as internalization. Simply put, internalization measures how much of a small object is contained within a large object. In other words, I want to know what fraction of the transcripts that are differentially expressed in mdt-12(bx93) are also differentially expressed in med-12(sy622). The closer the internalization fraction is to 100%, the more mdt-12(bx93) is engulfed by mdt-12(sy622).

In [29]:
pretty.table_print(['pair', 'shared GENES', 'internalization',
                    'internal'], space=17)

passed = []
# group the dataframe by genotype:
for genotype1, group1 in tidy[sig].groupby('genotype'):
    # group the dataframe by genotype again.
    for genotype2, group2 in tidy[sig].groupby('genotype'):
        
        # make sure the genotypes are not the same
        if genotype1 == genotype2:
            continue
        
        # I only want to enumerate the genotype pairwise combinations once.
        # The for loop enumerates them twice.
        # So keep track of which combinations we don't need by adding to
        # a list
        passed += [genotype2 + genotype1]
        if genotype1 + genotype2 in passed:
            continue

        # find the size of each group:
        size1 = group1.shape[0]
        size2 = group2.shape[0]

        # find the overlap between both genotypes
        overlap = group1[group1.target_id.isin(group2.target_id)]
        overlap = overlap.target_id.nunique()
        
        # find whichever group is smaller:
        denominator = np.min([size1, size2])
        
        # record which group is smaller:
        if size1 < size2:
            internal = genotype1
        else:
            internal = genotype2
        
        # generate a message to pretty print
        # print the genotype combination:
        m = "{0}-{1}\t".format(genotype1, genotype2)
        # print the number of genes that overlap
        m += "{0}\t".format(overlap)
        # print the internalized fraction of genes that overlap
        m += "{0:.2g}\t".format(overlap/denominator)
        # print which genotype is internalized
        m += "{0}".format(internal)
        
        # pretty print:
        pretty.table_print(m.split('\t'), space=17)
pair             shared GENES     internalization  internal         
bx93-bx93/sy622  192              0.72             bx93             
bx93-ga80        184              0.69             bx93             
bx93-n1046       130              0.49             bx93             
bx93-n2021       56               0.21             bx93             
bx93-sy622       189              0.71             bx93             
bx93/sy622-ga80  1399             0.62             bx93/sy622       
bx93/sy622-n1046 981              0.44             bx93/sy622       
bx93/sy622-n2021 250              0.47             n2021            
bx93/sy622-sy622 836              0.39             sy622            
ga80-n1046       1947             0.76             n1046            
ga80-n2021       324              0.61             n2021            
ga80-sy622       1314             0.62             sy622            
n1046-n2021      144              0.27             n2021            
n1046-sy622      597              0.28             sy622            
n2021-sy622      263              0.5              n2021