Tutorial

[1]:
from IPython.display import display, Image
%matplotlib inline
import matplotlib.pyplot as plt
[2]:
# these lines of code are to be used if using the binder instance. Ignore otherwise!
import sys
sys.path.append('../../../')
from magine.copy_sample_dbs import copy_sample_databases
# This uses the a cached version of the databases to speed up the tutorial
copy_sample_databases()
Database directories already exist. Override using force=True argument. Skipping...
[3]:
import pandas as pd
import seaborn as sns
import numpy as np
c:\users\pinojc\miniconda3\envs\magine_37\lib\site-packages\statsmodels\tools\_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

ExperimentalData

Since MAGINE is built for multi-sample, multi-omics data, it is no surprise that the data is the most important aspect. Here we should how to use the :py:class:ExperimentalData class. We designed MAGINE data input to be as flexible as possible, requiring a standard format for 8 columns. Users are required to format the input files to share the same column names. Additional columns can still be used for additional tags on the data.

For this tutorial, we are going to use our time series multi-omic response of A549 cells to cisplatin.

The description of the experiments and dataset can be found in Norris, Jeremy L., et al.

[4]:
Image('data_sample.png')
[4]:
_images/Tutorial_6_0.png
[5]:
# load the experimental data
from magine.data.experimental_data import load_data

exp_data = load_data('Data/norris_et_al_2017_cisplatin_data.csv.gz', low_memory=False)
[6]:
help(exp_data)
Help on ExperimentalData in module magine.data.experimental_data object:

class ExperimentalData(builtins.object)
 |  ExperimentalData(data_file)
 |
 |  Manages all experimental data
 |
 |  Methods defined here:
 |
 |  __getitem__(self, name)
 |
 |  __init__(self, data_file)
 |      Parameters
 |      ----------
 |      data_file : str, pandas.DataFrame
 |          Name of file, generally csv.
 |          If provided a str, the file will be read in as a pandas.DataFrame
 |
 |  __setattr__(self, name, value)
 |      Implement setattr(self, name, value).
 |
 |  create_summary_table(self, sig=False, index='identifier', save_name=None, plot=False, write_latex=False)
 |      Creates a summary table of data.
 |
 |
 |      Parameters
 |      ----------
 |      sig: bool
 |          Flag to summarize significant species only
 |      save_name: str
 |          Name to save csv and .tex file
 |      index: str
 |         Index for counts
 |      plot: bool
 |          If you want to create a plot of the table
 |      write_latex: bool
 |          Create latex file of table
 |
 |
 |      Returns
 |      -------
 |      pandas.DataFrame
 |
 |  get_measured_by_datatype(self)
 |      Returns dict of species per data type
 |
 |      Returns
 |      -------
 |      dict
 |
 |  subset(self, species, index='identifier')
 |      Parameters
 |      ----------
 |      species : list, str
 |          List of species to create subset dataframe from
 |      index : str
 |          Index to filter based on provided 'species' list
 |
 |      Returns
 |      -------
 |      magine.data.experimental_data.Species
 |
 |  volcano_analysis(self, out_dir, use_sig_flag=True, p_value=0.1, fold_change_cutoff=1.5)
 |      Creates a volcano plot for each experimental method
 |
 |      Parameters
 |      ----------
 |      out_dir: str, path
 |          Path to where the output figures will be saved
 |      use_sig_flag: bool
 |          Use significant flag of data
 |      p_value: float, optional
 |          p value criteria for significant
 |          Will not be used if use_sig_flag
 |      fold_change_cutoff: float, optional
 |          fold change criteria for significant
 |          Will not be used if use_sig_flag
 |
 |      Returns
 |      -------
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |
 |  __dict__
 |      dictionary for instance variables (if defined)
 |
 |  __weakref__
 |      list of weak references to the object (if defined)
 |
 |  compounds
 |      Only compounds in data
 |
 |      Returns
 |      -------
 |      Sample
 |
 |  exp_methods
 |      List of source columns
 |
 |  genes
 |      All data tagged with gene
 |
 |      Includes protein and RNA.
 |
 |      Returns
 |      -------
 |
 |  proteins
 |      Protein level data
 |
 |      Tagged with "gene" identifier that is not RNA
 |
 |      Returns
 |      -------
 |
 |  rna
 |      RNA level data
 |
 |      Tagged with "RNA"
 |
 |      Returns
 |      -------
 |
 |  sample_ids
 |      List of sample_ids
 |
 |  species
 |      Returns data in Sample format
 |
 |      Returns
 |      -------
 |      Sample

About the data

There are 4 time points and 6 experimental platforms.

[7]:
exp_data.sample_ids
[7]:
['01hr', '06hr', '24hr', '48hr']
[8]:
exp_data.exp_methods
[8]:
['rna_seq', 'ph_silac', 'label_free', 'silac', 'C18', 'HILIC']

We created a function to quickly get the numbers of measured per time point and platform. Additional arguments can be used to get counts of only signficantly changes species (marked by signficant_flag column of input data).

[9]:
display(exp_data.create_summary_table())
display(exp_data.create_summary_table(sig=True))
sample_id 01hr 06hr 24hr 48hr Total Unique Across
source
C18 522 227 653 685 1402
HILIC 471 605 930 613 1504
label_free 2766 2742 2551 2261 3447
ph_silac 2608 3298 3384 3236 5113
rna_seq 18741 19104 19992 - 20642
silac 2923 3357 3072 3265 4086
sample_id 01hr 06hr 24hr 48hr Total Unique Across
source
C18 522 227 653 685 1402
HILIC 471 605 930 613 1504
label_free 196 46 271 874 1085
ph_silac 514 888 1227 851 2278
rna_seq 73 1999 12215 - 12340
silac 38 52 228 266 485

MAGINE uses the identifier column as the default index. This keeps things simple when using the output for other tools (passing to molecular networks). You can also pass an index argument to calculate other values. Here, we use the label column, which contains PTMs of our protein species. See how silac values do not change, but there is an increase number of ph_silac unique species.

[10]:
display(exp_data.create_summary_table(sig=True, index='label'))
sample_id 01hr 06hr 24hr 48hr Total Unique Across
source
C18 528 227 657 689 1412
HILIC 479 611 941 621 1521
label_free 201 46 281 911 1149
ph_silac 594 1370 2414 1368 4757
rna_seq 73 1999 12215 - 12340
silac 38 52 228 266 485

Filter by category (experimental method)

We can access the input data using the .species property. This returns a modified pandas.Datatable.

MAGINE uses the species_type and source column name to split data into compounds, genes (includes species_type==gene), rna (includes species_type==gene, source == rna), or protein (species_type==gene, source != rna). They can be accessed with the “.prefix”, such as

[11]:
exp_data.genes.head(5)
[11]:
identifier label species_type fold_change p_value significant sample_id source
0 HOXD1 HOXD1_rnaseq protein -520.256762 0.00102 True 06hr rna_seq
1 MIR7704 MIR7704_rnaseq protein -520.256762 0.00102 True 06hr rna_seq
2 AC078814.1 AC078814.1_rnaseq protein -76.022260 0.00102 True 06hr rna_seq
3 PPM1H PPM1H_rnaseq protein -76.022260 0.00102 True 06hr rna_seq
4 PLCH1 PLCH1_rnaseq protein -17.888990 0.00102 True 06hr rna_seq
[12]:
exp_data.compounds.head(5)
[12]:
identifier label species_type fold_change p_value significant sample_id source
128152 HMDB0036114 (-)-3-Thujone metabolites 1.6 2.100000e-02 True 06hr C18
128153 HMDB0001320 (13E)-11a-Hydroxy-9,15-dioxoprost-13-enoic acid metabolites 88.8 5.800000e-12 True 24hr C18
128154 HMDB0012113 (22Alpha)-hydroxy-campest-4-en-3-one metabolites 100.0 9.500000e-04 True 48hr HILIC
128155 HMDB0010361 (23S)-23,25-dihdroxy-24-oxovitamine D3 23-(bet... metabolites -100.0 1.000000e-12 True 48hr C18
128156 HMDB0011644 (24R)-Cholest-5-ene-3-beta,7-alpha,24-triol metabolites 1.6 7.400000e-05 True 01hr C18

Similarily, we can also filter the data by source using the .name, where name is anything in the source column. We can get a list of these by printing exp_data.exp_methods.

[13]:
# prints all the available exp_methods
exp_data.exp_methods
[13]:
['rna_seq', 'ph_silac', 'label_free', 'silac', 'C18', 'HILIC']
[14]:
# filters to only the 'label_free'
exp_data.label_free.shape
[14]:
(13085, 8)
[15]:
exp_data.label_free.head(5)
[15]:
identifier label species_type fold_change p_value significant sample_id source
102446 LIMS1 LIMS1_lf protein 12.42 0.00003 True 01hr label_free
102447 SMARCE1 SMARCE1_lf protein -2.49 0.00030 True 01hr label_free
102448 HEXA HEXA_lf protein 6.42 0.00060 True 01hr label_free
102449 SRSF1 SRSF1_lf protein -3.21 0.00060 True 01hr label_free
102450 SF3B1 SF3B1_lf protein -1.57 0.00130 True 01hr label_free
[16]:
exp_data.HILIC.head(5)
[16]:
identifier label species_type fold_change p_value significant sample_id source
128154 HMDB0012113 (22Alpha)-hydroxy-campest-4-en-3-one metabolites 100.0 0.000950 True 48hr HILIC
128157 HMDB0011644 (24R)-Cholest-5-ene-3-beta,7-alpha,24-triol metabolites 1.7 0.000072 True 24hr HILIC
128162 HMDB0012114 (3S)-3,6-Diaminohexanoate metabolites -1.9 0.000030 True 06hr HILIC
128164 HMDB0012114 (3S)-3,6-Diaminohexanoate metabolites -3.0 0.002000 True 24hr HILIC
128166 HMDB0012115 (3S,5S)-3,5-Diaminohexanoate metabolites -1.9 0.000030 True 06hr HILIC

Significant filter

We can use the significant column to filter that data to only contain those species.

[17]:
exp_data.species.shape
[17]:
(132932, 8)
[18]:
exp_data.species.sig.shape
[18]:
(27288, 8)

Filter data to up or down regulated species.

For enrichment analysis, we will want to access up-regulated and down-regulated species using .up and .down.

[19]:
exp_data.rna_seq.up.head(10)
[19]:
identifier label species_type fold_change p_value significant sample_id source
13 DLX2 DLX2_rnaseq protein 2.874358 0.001020 True 06hr rna_seq
18 RETSAT RETSAT_rnaseq protein 2.325934 0.001020 True 06hr rna_seq
21 SLC52A1 SLC52A1_rnaseq protein 2.871869 0.001020 True 06hr rna_seq
24 OTUD3 OTUD3_rnaseq protein 1.821775 0.001020 True 06hr rna_seq
35 RP11-209D14.2 RP11-209D14.2_rnaseq protein 1.819533 0.025204 True 06hr rna_seq
58 ZNF554 ZNF554_rnaseq protein 2.309691 0.004153 True 06hr rna_seq
59 FZD9 FZD9_rnaseq protein 1.812798 0.001020 True 06hr rna_seq
71 SBK1 SBK1_rnaseq protein 1.806427 0.002689 True 06hr rna_seq
88 PPM1D PPM1D_rnaseq protein 1.803186 0.001020 True 06hr rna_seq
92 ZNF425 ZNF425_rnaseq protein 2.846581 0.001020 True 06hr rna_seq
[20]:
exp_data.rna_seq.down.head(10)
[20]:
identifier label species_type fold_change p_value significant sample_id source
0 HOXD1 HOXD1_rnaseq protein -520.256762 0.001020 True 06hr rna_seq
1 MIR7704 MIR7704_rnaseq protein -520.256762 0.001020 True 06hr rna_seq
2 AC078814.1 AC078814.1_rnaseq protein -76.022260 0.001020 True 06hr rna_seq
3 PPM1H PPM1H_rnaseq protein -76.022260 0.001020 True 06hr rna_seq
4 PLCH1 PLCH1_rnaseq protein -17.888990 0.001020 True 06hr rna_seq
5 RP11-639F1.1 RP11-639F1.1_rnaseq protein -17.888990 0.001020 True 06hr rna_seq
6 TP63 TP63_rnaseq protein -12.355659 0.001020 True 06hr rna_seq
7 JARID2 JARID2_rnaseq protein -7.891502 0.001020 True 06hr rna_seq
8 GLI2 GLI2_rnaseq protein -5.389009 0.001020 True 06hr rna_seq
9 MAP3K5 MAP3K5_rnaseq protein -4.262353 0.001893 True 06hr rna_seq

Extracting by sample (time point)

We also added an index filter to segregate by sample_id.

[21]:
for i in exp_data.sample_ids:
    print(i)
    display(exp_data[i].head(5))
01hr
identifier label species_type fold_change p_value significant sample_id source
19160 GRIK4 GRIK4_rnaseq protein 77.555651 0.019824 True 01hr rna_seq
19161 GRIK4_3p_UTR GRIK4_3p_UTR_rnaseq protein 77.555651 0.019824 True 01hr rna_seq
19162 AP001187.9 AP001187.9_rnaseq protein -25.455050 0.019824 True 01hr rna_seq
19163 MIR192 MIR192_rnaseq protein -25.455050 0.019824 True 01hr rna_seq
19164 MIR194-2 MIR194-2_rnaseq protein -25.455050 0.019824 True 01hr rna_seq
06hr
identifier label species_type fold_change p_value significant sample_id source
0 HOXD1 HOXD1_rnaseq protein -520.256762 0.00102 True 06hr rna_seq
1 MIR7704 MIR7704_rnaseq protein -520.256762 0.00102 True 06hr rna_seq
2 AC078814.1 AC078814.1_rnaseq protein -76.022260 0.00102 True 06hr rna_seq
3 PPM1H PPM1H_rnaseq protein -76.022260 0.00102 True 06hr rna_seq
4 PLCH1 PLCH1_rnaseq protein -17.888990 0.00102 True 06hr rna_seq
24hr
identifier label species_type fold_change p_value significant sample_id source
37960 LHX3 LHX3_rnaseq protein 202.225343 0.005180 True 24hr rna_seq
37961 C17orf67 C17orf67_rnaseq protein 2.571464 0.000123 True 24hr rna_seq
37962 ALX1 ALX1_rnaseq protein -2.572587 0.000123 True 24hr rna_seq
37963 MIR7844 MIR7844_rnaseq protein 2.573033 0.009349 True 24hr rna_seq
37964 TMCC3 TMCC3_rnaseq protein 2.573033 0.009349 True 24hr rna_seq
48hr
identifier label species_type fold_change p_value significant sample_id source
58025 TNS3 TNS3_1188_1197_phsilac protein -3.837129 0.049 True 48hr ph_silac
58026 SIPA1L3 SIPA1L3_S(ph)158_phsilac protein -5.119600 0.049 True 48hr ph_silac
58027 TNS3 TNS3_Y(ph)780_phsilac protein -4.986421 0.049 True 48hr ph_silac
58028 FGD6 FGD6_S(ph)554_phsilac protein -3.900705 0.049 True 48hr ph_silac
58029 GPN1 GPN1_S(ph)312_phsilac protein 2.901199 0.049 True 48hr ph_silac
[22]:
exp_data['01hr'].head(5)
[22]:
identifier label species_type fold_change p_value significant sample_id source
19160 GRIK4 GRIK4_rnaseq protein 77.555651 0.019824 True 01hr rna_seq
19161 GRIK4_3p_UTR GRIK4_3p_UTR_rnaseq protein 77.555651 0.019824 True 01hr rna_seq
19162 AP001187.9 AP001187.9_rnaseq protein -25.455050 0.019824 True 01hr rna_seq
19163 MIR192 MIR192_rnaseq protein -25.455050 0.019824 True 01hr rna_seq
19164 MIR194-2 MIR194-2_rnaseq protein -25.455050 0.019824 True 01hr rna_seq

Pivot table to get table across time

We also provide a function to quickly pivot the data to for easy export.

[23]:
exp_data.label_free.pivoter(
    convert_to_log=False,
    index='identifier',
    columns='sample_id',
    values=['fold_change', 'p_value']
).head(10)
[23]:
fold_change p_value
sample_id 01hr 06hr 24hr 48hr 01hr 06hr 24hr 48hr
identifier
A2M 1.040000 1.140 51.93 11.58 0.514800 0.44370 0.24260 0.11130
AACS -1.100000 3.740 NaN NaN 0.281800 0.26950 NaN NaN
AAGAB 1.000000 -1.150 1.46 -2.03 0.968100 0.39240 0.84450 0.09760
AAK1 1.320000 1.590 NaN 1.72 0.715800 0.18110 NaN 0.95660
AAMP -1.200000 -1.460 1.85 1.78 0.836800 0.55420 0.13640 0.32460
AAR2 NaN -1.690 NaN NaN NaN 0.96510 NaN NaN
AARS 0.326667 -0.035 -1.44 -3.12 0.299867 0.62425 0.46725 0.00045
AARS2 1.170000 NaN NaN NaN 0.253000 NaN NaN NaN
AARSD1 1.210000 4.070 -2.05 NaN 0.459700 0.49160 0.78440 NaN
AASDHPPT -0.330000 1.020 1.07 -1.11 0.709600 0.81160 0.45290 0.00070

Note that in the previous two examples, we find that there are NaN values. This is because of our experiental data. We can easy check what species are not found in all 4 of our label free experiements.

[24]:
print(len(exp_data.label_free.present_in_all_columns(
    index='identifier',
    columns='sample_id',
).id_list))
Number in index went from 3447 to 1819
1819

This shows that out of the 3447 unique species measured in label-free proteomics, only 1819 were measured in all time points. What one can do with this information is dependent on the analysis. For now, we will keep using the full dataset.

It is important to note that this class is basically a hopped up pandas.DataFrame, so the commands can be chained together.

Visualization

We provide commonly used plotting functions. .volcano_plot .volcano_by_sample .plot_histogram .plot_species .heatmap

Volcano plots

[25]:
exp_data.label_free.volcano_plot();
_images/Tutorial_43_0.png
[26]:
exp_data.label_free.volcano_by_sample(sig_column=True);
_images/Tutorial_44_0.png
[27]:
exp_data.label_free.plot_histogram();
_images/Tutorial_45_0.png

Plotting subset of species

We provide the a few plotting interfaces to explore that subsets of the data. Basically, you create a list of species and provide it to the function. It filters based on these and then returns the results.

Time series using plot’y or matplotlib

[28]:
exp_data.label_free.plot_species(['LMNA', 'VDAC1'], plot_type='plotly')
[29]:
exp_data.label_free.plot_species(['LMNA', 'VDAC1'], plot_type='matplotlib');
_images/Tutorial_49_0.png

Heatplots

[30]:
exp_data.label_free.heatmap(
    ['LMNA', 'VDAC1'],
    figsize=(6,4),
    linewidths=0.01
);
_images/Tutorial_51_0.png

Notice that the above plot doesn’t show any of the modifiers of LMBA (no _s(ph)22_lf). This is because the default index to pivot plots is the identifier column. You can set the label column for plotting by passing index=label to the function. Note, if you want to filter the data using the more generic ‘identifier’ column, you just specify that with subset_index=’identifier’

[31]:
exp_data.label_free.heatmap(
    ['LMNA', 'VDAC1'],
    subset_index='identifier',
    index='label',
    figsize=(6,4),
    linewidths=0.01
);
_images/Tutorial_53_0.png

Examples

Here are a few examples how all the above commands can be chained together to create plots with varying degrees of critera.

Query 1:

Heatmap of label-free proteomics that are signficantly change in at least 3 time points.
[32]:
lf_sig = exp_data.label_free.require_n_sig(
    index='label',
    columns='sample_id',
    n_sig=3
).heatmap(
    convert_to_log=True,
    cluster_row=True,
    index='label',
    values='fold_change',
    columns='sample_id',
    annotate_sig=True,
    figsize=(8, 12),
    div_colors=True,
    num_colors=21,
    linewidths=0.01
);
_images/Tutorial_55_0.png

Query 2:

Changes that happen at all 3 timepoints for RNA-seq.
[33]:
exp_data.rna.require_n_sig(n_sig=3, index='label').plot_species(plot_type='plotly');

Query 3:

  • Heatmap and time series plot of proteins that are consistently down regulated at 3 time points.

[34]:
exp_data.proteins.down.require_n_sig(n_sig=3, index='label').plot_species(plot_type='matplotlib');
exp_data.proteins.down.require_n_sig(n_sig=3, index='label').heatmap(index='label', cluster_row=True, linewidths=0.01);
_images/Tutorial_59_0.png
_images/Tutorial_59_1.png

Query 4:

Clustered heatmap of label-free data
[35]:
exp_data.silac.heatmap(
    linewidths=0.01,
    index='label',
    cluster_row=True,
    min_sig=2,
    figsize=(12,18)
);
_images/Tutorial_61_0.png

Extending to other plots

Since our exp_data is built off a pandas.DataFrame, we can use other packages that take that data format. Seaborn is one such tool that provides some very nice plots.

[36]:
label_free = exp_data.label_free.copy()
label_free.log2_normalize_df(column='fold_change', inplace=True)

g = sns.PairGrid(label_free,
                 x_vars=('sample_id'),
                 y_vars=('fold_change', 'p_value'),
                 aspect=3.25, height=3.5)
g.map(
    sns.violinplot,
    palette="pastel",
    order=label_free.sample_ids
);
_images/Tutorial_63_0.png

Venn diagram comparisons between measurements

[37]:
from magine.plotting.venn_diagram_maker import create_venn2, create_venn3

lf = exp_data.label_free.sig.id_list
silac = exp_data.silac.sig.id_list
phsilac = exp_data.ph_silac.sig.id_list
hilic = exp_data.HILIC.sig.id_list
rplc = exp_data.C18.sig.id_list

create_venn2(hilic, rplc, 'HILIC', 'RPLC');
_images/Tutorial_65_0.png
[38]:
create_venn3(lf, silac, phsilac, 'LF', 'SILAC', 'ph-SILAC');
_images/Tutorial_66_0.png

Networks

Create data driven network

MAGINE generates networks using seed species. It located the seed species in multiple databases and finds interconnecting edges among them. The goal of this process was to obtain all the known biological regulation among the species. We currently utilize KEGG, Reactome, HMDB, TTRUST, and BioGrid for node and edge sources.

[39]:
Image('network_creation_steps.png')
[39]:
_images/Tutorial_70_0.png
[40]:
# some imports needed
from magine.networks.network_generator import build_network
import magine.networks.utils as utils
import networkx as nx
import os
2020-10-07 16:15:00.271 - magine - INFO - Logging started on MAGINE version 0.1.0
2020-10-07 16:15:00.273 - magine - INFO - Log entry time offset from UTC: -7.00 hours

This is done using the build_network function. Now we will create the network. We pass the seed and background list to the network as well as flags turning on all of the network databases. We also trim source/sink nodes (this basically cleans up dangling nodes that are not in our seed or background lists).

[41]:
measured = exp_data.species.id_list
sig_measured = exp_data.species.sig.id_list
print(len(measured))
print(len(sig_measured))
23725
15777
[42]:
if not os.path.exists('Data/cisplatin_network.p'):
    network = build_network(

        # genes seed species
        seed_species=sig_measured,

        # all data measured, used to allow interconnecting nodes that are not in seeds.
        all_measured_list=measured,

        use_biogrid=True,  # expand with biogrid
        use_hmdb=True,  # expand with hmdb
        use_reactome=True,  # expand with reactome
        use_signor=True,  # expand with signor
        trim_source_sink=True,  # remove all source and sink nodes not measured
        save_name='Data/cisplatin_network'
    )
else:
    # Load the network, note that it is returned above but for time limits,
    # we will just load the generated one.
    network = nx.read_gpickle('Data/cisplatin_network.p')

# add attibutes to graph nodes (measured, measured at which time points,
# significantly changed at which time point)
utils.add_data_to_graph(network, exp_data)
print("Saving network")
# write to GML for cytoscape or other program
nx.write_gml(
    network,
    os.path.join('Data', 'cisplatin_network_w_attributes.gml')
)

# write to gpickle for fast loading in python
nx.write_gpickle(
    network,
    os.path.join('Data', 'cisplatin_based_network.p'),
)
Saving network
[43]:
print(network.number_of_nodes())
print(network.number_of_edges())
13308
181300

As you might iMAGINE, the larger number of input nodes and source databases, the larger the resulting network. 13308 nodes and 181300 edges are too much to manually explore. Thus, we are going to use the Subgraph Class to being to query the network. We developed multiple tools to subgraph and explore the network, as well as multiple visualizations.

Explore subgraphs of network

[44]:
from magine.networks.subgraphs import Subgraph
from magine.networks.visualization import draw_igraph, draw_graphviz, draw_mpl, draw_cyjs
net_sub = Subgraph(network)
help(net_sub)
Help on Subgraph in module magine.networks.subgraphs object:

class Subgraph(builtins.object)
 |  Subgraph(network, exp_data=None, pool=None)
 |
 |  Methods defined here:
 |
 |  __init__(self, network, exp_data=None, pool=None)
 |      Generates network subgraphs
 |
 |      Parameters
 |      ----------
 |      network : networkx.DiGraph
 |      exp_data : magine.data.datatypes.ExperimentalData
 |
 |  downstream_of_node(self, species_1, include_list=None, save_name=None, draw=False)
 |      Generate network of all downstream species of provides species
 |
 |
 |      Parameters
 |      ----------
 |      species_1 : str
 |          species name
 |      save_name : str
 |          name to save gml file
 |      draw : bool
 |          create figure of graph
 |      include_list : list_like
 |          list of species that must be in path in order to consider a path
 |      Returns
 |      -------
 |      nx.DiGraph
 |
 |
 |      Examples
 |      --------
 |      >>> from networkx import DiGraph
 |      >>> from magine.networks.subgraphs import Subgraph
 |      >>> g = DiGraph()
 |      >>> g.add_edges_from([('a','b'),('b','c'), ('c', 'd'), ('a', 'd'),         ('e', 'd')])
 |      >>> net_sub = Subgraph(g)
 |      >>> downstream_d = net_sub.downstream_of_node('d')
 |      >>> sorted(downstream_d.edges)
 |      []
 |      >>> downstream_c = net_sub.downstream_of_node('c')
 |      >>> sorted(downstream_c.edges)
 |      [('c', 'd')]
 |
 |  expand_neighbors(self, network=None, nodes=None, upstream=False, downstream=False, max_dist=1, include_only=None, add_interconnecting_edges=False)
 |      Create/expand a network based on neighbors from a list of species
 |
 |      Parameters
 |      ----------
 |      network : nx.DiGraph or None
 |          Starting network to expand nodes. If not provided, will use
 |          default network
 |      nodes : list_like
 |          List of nodes to expand
 |      upstream : bool
 |          Expand upstream nodes
 |      downstream : bool
 |          Expand downstream nodes
 |      max_dist :
 |          Max distance to explore
 |      include_only : list_like
 |          Limit network to only contain these species
 |      add_interconnecting_edges : bool
 |          Add edges connecting all nodes. Default if False, so only direct
 |          edges to neighbors will be added.
 |
 |      Returns
 |      -------
 |      nx.DiGraph
 |
 |  measured_networks_over_time(self, graph, colors, prefix)
 |      Adds color to a network over time
 |
 |      Parameters
 |      ----------
 |      graph : nx.DiGraph
 |      colors : list
 |          List of colors for time points
 |      prefix : str
 |          Prefix for image files
 |
 |
 |      Returns
 |      -------
 |
 |  measured_networks_over_time_up_down(self, graph, prefix, color_up='tomato', color_down='lightblue')
 |      Parameters
 |      ----------
 |      graph : nx.DiGraph
 |
 |      prefix : str
 |          Prefix for image files
 |      color_up : str
 |
 |      color_down : str
 |
 |
 |      Returns
 |      -------
 |
 |  neighbors(self, node, upstream=True, downstream=True, max_dist=1, include_only=None, start_network=None)
 |      Create network containing provided node and its neighbors.
 |
 |      Parameters
 |      ----------
 |      node : str
 |      upstream : bool
 |      downstream : bool
 |      max_dist : int
 |      include_only : list
 |      start_network : nx.DiGraph
 |
 |
 |      Returns
 |      -------
 |      nx.DiGraph
 |
 |  paths_between_list(self, species_list, single_path=False, max_length=None, add_interconnecting_edges=False, include_only=None, pool=None, save_name=None, draw=False, image_format='png')
 |      Returns graph containing all shortest paths between list.
 |
 |      Parameters
 |      ----------
 |
 |      species_list : list_like
 |          list of species
 |      save_name : str
 |          name to save
 |      single_path : bool
 |          use single shortest path if True, else use all shortest paths
 |      draw : bool
 |          create a dot generated figure
 |      image_format : str
 |          dot acceptable output formats, (pdf, png, etc)
 |      pool : multiprocessing.Pool
 |          If it it provided, it uses its map function to run this function.
 |      max_length : int
 |          Max length for path between any 2 species
 |      include_only : list_like
 |          List of species that must be present
 |      Returns
 |      -------
 |      graph : networkx.DiGraph
 |          graph containing paths between species list provided
 |
 |
 |      Examples
 |      --------
 |      >>> from networkx import DiGraph
 |      >>> from magine.networks.subgraphs import Subgraph
 |      >>> g = DiGraph()
 |      >>> g.add_edges_from([('a','b'),('b','c'), ('c', 'd'), ('a', 'd'), ('e', 'd')])
 |      >>> g.add_path(['g', 'h', 'c', 'i', 'j', 'k'])
 |      >>> net_sub = Subgraph(g)
 |      >>> path_a_d = net_sub.paths_between_list(['a','c','d'])
 |      >>> sorted(path_a_d.edges)
 |      [('a', 'b'), ('a', 'd'), ('b', 'c'), ('c', 'd')]
 |      >>> path_a_f = net_sub.paths_between_list(['g', 'h', 'j'], max_length=4)
 |      >>> sorted(path_a_f.edges)
 |      [('c', 'i'), ('g', 'h'), ('h', 'c'), ('i', 'j')]
 |
 |  paths_between_pair(self, node_1, node_2, bidirectional=False, single_path=False, draw=False, image_format='png')
 |      Generates a graph based on all shortest paths between two species.
 |
 |
 |      Parameters
 |      ----------
 |      node_1 : str
 |          name of first species
 |      node_2 : str
 |          name of second species
 |      bidirectional : bool
 |          If you want to search bidirectionally
 |      single_path : bool
 |          If you only want a single shortest path
 |      draw : bool
 |          create an image of returned network
 |      image_format : str, optional
 |          If draw=True you can pass an image format. (pdf, png, svg).
 |          default=png
 |
 |      Returns
 |      -------
 |      graph : networkx.DiGraph
 |
 |
 |      Examples
 |      --------
 |      >>> from networkx import DiGraph
 |      >>> from magine.networks.subgraphs import Subgraph
 |      >>> g = DiGraph()
 |      >>> g.add_edges_from([('a','b'),('b','c'), ('c', 'd'),  ('e', 'd'), ('d', 'a')])
 |      >>> net_sub = Subgraph(g)
 |      >>> path_a_d = net_sub.paths_between_pair('a','d', False)
 |      >>> sorted(path_a_d.edges)
 |      [('a', 'b'), ('b', 'c'), ('c', 'd')]
 |      >>> path_a_d = net_sub.paths_between_pair('a','d', True)
 |      >>> sorted(path_a_d.edges)
 |      [('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'a')]
 |
 |  paths_between_two_lists(self, list_1, list_2, single_path=False, max_length=None, include_only=None, reverse=False, add_interconnecting_edges=False, draw=False, save_name=None, image_format='png')
 |      Generates a graph based on all shortest paths between two species.
 |
 |
 |      Parameters
 |      ----------
 |      list_1 : list
 |          Node names
 |      list_2 : list
 |          Node names
 |      single_path : bool
 |          If you only want a single shortest path.
 |      max_length : int
 |          Maximum distance between any two species.
 |      include_only : list
 |          Species required to be in paths/
 |      reverse : bool
 |          Flag to check list_2 to list_1. Default will only look for list_1
 |          to list_2.
 |      add_interconnecting_edges : bool
 |          Add edges between species even if not between list_1 and list_2
 |          nodes.
 |      save_name : str
 |          Save of figure/network
 |      draw : bool
 |          create an image of returned network
 |      image_format : str, optional
 |          If draw=True you can pass an image format. (pdf, png, svg).
 |          default=png
 |
 |      Returns
 |      -------
 |      graph : networkx.DiGraph
 |
 |      Examples
 |      --------
 |      >>> from networkx import DiGraph
 |      >>> from magine.networks.subgraphs import Subgraph
 |      >>> g = DiGraph()
 |      >>> g.add_path(['a', 'b', 'c', 'd'])
 |      >>> g.add_path(['g', 'h', 'c', 'i'])
 |      >>> net_sub = Subgraph(g)
 |      >>> path_a_d = net_sub.paths_between_two_lists(['a','g'], ['c', 'i'], max_length=3)
 |      >>> sorted(path_a_d.edges)
 |      [('a', 'b'), ('b', 'c'), ('g', 'h'), ('h', 'c')]
 |
 |  upstream_of_node(self, species_1, include_list=None, save_name=None, draw=False)
 |      Generate network of all upstream species of provides species
 |
 |      Parameters
 |      ----------
 |      species_1 : str
 |          species name
 |      save_name : str
 |          name to save gml file
 |      draw : bool
 |          create figure of graph
 |      include_list : list_like
 |          Species that must be in path in order to consider a path
 |
 |      Returns
 |      -------
 |      nx.DiGraph
 |
 |
 |      Examples
 |      --------
 |      >>> from networkx import DiGraph
 |      >>> from magine.networks.subgraphs import Subgraph
 |      >>> g = DiGraph()
 |      >>> g.add_edges_from([('a','b'),('b','c'), ('c', 'd'), ('a', 'd'),         ('e', 'd')])
 |      >>> net_sub = Subgraph(g)
 |      >>> upstream_d = net_sub.upstream_of_node('d')
 |      >>> sorted(upstream_d.edges())
 |      [('a', 'd'), ('b', 'c'), ('c', 'd'), ('e', 'd')]
 |      >>> upstream_c = net_sub.upstream_of_node('c')
 |      >>> sorted(upstream_c.edges())
 |      [('a', 'b'), ('b', 'c')]
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |
 |  __dict__
 |      dictionary for instance variables (if defined)
 |
 |  __weakref__
 |      list of weak references to the object (if defined)

[45]:
bax_neighbors = net_sub.neighbors(
    'BAX', # node of interest
    upstream=True, # include upstream nodes
    downstream=False,  # include downstream nodes
    include_only=exp_data.species.sig.id_list # limit nodes to only significant changed species
)

There are multiple ways to visualize the network. draw_igraph, draw_mpl, draw_graphviz, draw_cyjs

[46]:
draw_igraph(bax_neighbors, bbox=[400, 400], node_size=25, inline=True)
[46]:
_images/Tutorial_80_0.svg
[47]:
draw_mpl(bax_neighbors, layout='fdp', scale=3, node_size=100, font_size=12);
_images/Tutorial_81_0.png
[48]:
draw_graphviz(bax_neighbors, 'fdp')
_images/Tutorial_82_0.png
[49]:
draw_cyjs(bax_neighbors)

This network can be exanded by a single or list of nodes passed.

[50]:
expand = net_sub.expand_neighbors(bax_neighbors, nodes='CASP3', downstream=True)
[51]:
draw_igraph(expand,
            bbox=[800, 800],
            node_font_size=25,
            font_size=4,
            node_size=25,
            inline=True,
            layout='graphopt')
[51]:
_images/Tutorial_86_0.svg
[52]:
draw_graphviz(expand, 'sfdp', width=700)
_images/Tutorial_87_0.png

Finding paths between two proteins

[53]:
BAX_to_AKT = net_sub.paths_between_pair('BAX', 'AKT1')
[54]:
draw_graphviz(BAX_to_AKT)
_images/Tutorial_90_0.png

Running enrichment analysis via EnrichR

MAGINE automates the upload samples to EnrichR and collated the results into a user friends format (EnrichmentResult Class).

[55]:
from magine.enrichment.enrichr import Enrichr
[56]:
e = Enrichr()
[57]:
ph_silac_enrichment = e.run_samples(
    exp_data.ph_silac.sig.up_by_sample,
    exp_data.ph_silac.sig.sample_ids,
    gene_set_lib='Reactome_2016'
)
[58]:
ph_silac_enrichment.head(10)
[58]:
term_name rank p_value z_score combined_score adj_p_value genes n_genes db significant sample_id
0 cell cycle_hsa-1640170 1 6.097256e-07 3.022131 43.247475 0.000933 ACD,AKAP9,BRCA1,CDC16,CDC20,CDC7,CLASP2,DCTN1,... 26 Reactome_2016 True 01hr
1 interleukin-2 signaling_hsa-451927 2 1.697041e-06 4.177109 55.499684 0.001298 AKAP9,BRAF,CNKSR2,CUL3,HAVCR2,INPPL1,IRS2,MAPK... 16 Reactome_2016 True 01hr
2 interleukin-3, 5 and gm-csf signaling_hsa-512988 3 2.677495e-06 4.033071 51.746840 0.001366 AKAP9,BRAF,CNKSR2,CUL3,INPPL1,IRS2,MAPK3,MARK3... 16 Reactome_2016 True 01hr
3 interleukin receptor shc signaling_hsa-912526 4 5.589099e-06 4.027927 48.716538 0.002138 AKAP9,BRAF,CNKSR2,CUL3,INPPL1,IRS2,MAPK3,MARK3... 15 Reactome_2016 True 01hr
4 signalling by ngf_hsa-166520 5 5.968762e-06 3.070175 36.931051 0.001826 AKAP13,AKAP9,ARHGEF16,BRAF,CNKSR2,CUL3,HDAC1,I... 21 Reactome_2016 True 01hr
5 mapk family signaling cascades_hsa-5683057 6 7.859464e-06 3.706449 43.564834 0.002004 AKAP9,BRAF,CNKSR2,CUL3,DNAJB1,IRS2,MAPK3,MARK3... 16 Reactome_2016 True 01hr
6 insulin receptor signalling cascade_hsa-74751 7 8.967430e-06 3.667706 42.625753 0.001960 AKAP9,BRAF,CNKSR2,CUL3,INSR,IRS2,MAPK3,MARK3,P... 16 Reactome_2016 True 01hr
7 signaling by interleukins_hsa-449147 8 9.805077e-06 3.188776 36.774905 0.001875 AKAP9,BRAF,CNKSR2,CUL3,HAVCR2,INPPL1,IRS2,MAP3... 19 Reactome_2016 True 01hr
8 signal attenuation_hsa-74749 9 1.022501e-05 26.315789 302.386159 0.001738 INSR,IRS2,MAPK3,SHC1 4 Reactome_2016 True 01hr
9 signaling by fgfr2_hsa-5654738 10 1.150341e-05 3.280362 37.307117 0.001760 AKAP9,BRAF,CNKSR2,CUL3,HNRNPA1,HNRNPM,INSR,IRS... 18 Reactome_2016 True 01hr
[59]:
# clean up naming of terms
ph_silac_enrichment.term_name = ph_silac_enrichment.term_name.str.split('_').str.get(0)
[60]:
ph_silac_enrichment.head(10)
[60]:
term_name rank p_value z_score combined_score adj_p_value genes n_genes db significant sample_id
0 cell cycle 1 6.097256e-07 3.022131 43.247475 0.000933 ACD,AKAP9,BRCA1,CDC16,CDC20,CDC7,CLASP2,DCTN1,... 26 Reactome_2016 True 01hr
1 interleukin-2 signaling 2 1.697041e-06 4.177109 55.499684 0.001298 AKAP9,BRAF,CNKSR2,CUL3,HAVCR2,INPPL1,IRS2,MAPK... 16 Reactome_2016 True 01hr
2 interleukin-3, 5 and gm-csf signaling 3 2.677495e-06 4.033071 51.746840 0.001366 AKAP9,BRAF,CNKSR2,CUL3,INPPL1,IRS2,MAPK3,MARK3... 16 Reactome_2016 True 01hr
3 interleukin receptor shc signaling 4 5.589099e-06 4.027927 48.716538 0.002138 AKAP9,BRAF,CNKSR2,CUL3,INPPL1,IRS2,MAPK3,MARK3... 15 Reactome_2016 True 01hr
4 signalling by ngf 5 5.968762e-06 3.070175 36.931051 0.001826 AKAP13,AKAP9,ARHGEF16,BRAF,CNKSR2,CUL3,HDAC1,I... 21 Reactome_2016 True 01hr
5 mapk family signaling cascades 6 7.859464e-06 3.706449 43.564834 0.002004 AKAP9,BRAF,CNKSR2,CUL3,DNAJB1,IRS2,MAPK3,MARK3... 16 Reactome_2016 True 01hr
6 insulin receptor signalling cascade 7 8.967430e-06 3.667706 42.625753 0.001960 AKAP9,BRAF,CNKSR2,CUL3,INSR,IRS2,MAPK3,MARK3,P... 16 Reactome_2016 True 01hr
7 signaling by interleukins 8 9.805077e-06 3.188776 36.774905 0.001875 AKAP9,BRAF,CNKSR2,CUL3,HAVCR2,INPPL1,IRS2,MAP3... 19 Reactome_2016 True 01hr
8 signal attenuation 9 1.022501e-05 26.315789 302.386159 0.001738 INSR,IRS2,MAPK3,SHC1 4 Reactome_2016 True 01hr
9 signaling by fgfr2 10 1.150341e-05 3.280362 37.307117 0.001760 AKAP9,BRAF,CNKSR2,CUL3,HNRNPA1,HNRNPM,INSR,IRS... 18 Reactome_2016 True 01hr

The EnrichmentResult Class shares the same plotting format as the ExperimentalData Class.

[61]:
ph_silac_enrichment.heatmap(
    min_sig=3,
    figsize=(4,16),
    linewidths=0.01,
    cluster_by_set=False
);
_images/Tutorial_99_0.png
[62]:
print(len(ph_silac_enrichment.sig.term_name.unique()))
315
[79]:
display(sorted(ph_silac_enrichment.sig.term_name.unique()))
['2-ltr circle formation',
 "3' -utr-mediated translational regulation",
 'activation of bad and translocation to mitochondria',
 'activation of bh3-only proteins',
 'activation of the ap-1 family of transcription factors',
 'activation of the mrna upon binding of the cap-binding complex and eifs, and subsequent binding to 43s',
 'akt phosphorylates targets in the cytosol',
 'antiviral mechanism by ifn-stimulated genes',
 'apobec3g mediated resistance to hiv-1 infection',
 'apoptosis',
 'apoptotic cleavage of cellular proteins',
 'apoptotic execution  phase',
 'arms-mediated activation',
 'assembly of the pre-replicative complex',
 'asymmetric localization of pcp proteins',
 'auf1 (hnrnp d0) binds and destabilizes mrna',
 'autodegradation of cdh1 by cdh1:apc/c',
 'autodegradation of the e3 ubiquitin ligase cop1',
 'axon guidance',
 'basigin interactions',
 'beta-catenin phosphorylation cascade',
 'cap-dependent translation initiation',
 'caspase-mediated cleavage of cytoskeletal proteins',
 'cdk-mediated phosphorylation and removal of cdc6',
 'cdt1 association with the cdc6:orc:origin complex',
 'cell cycle',
 'cell cycle checkpoints',
 'cell cycle, mitotic',
 'cell death signalling via nrage, nrif and nade',
 'cell-cell communication',
 'cell-extracellular matrix interactions',
 'cellular response to heat stress',
 'cellular responses to stress',
 'cellular senescence',
 'centrosome maturation',
 'chk1/chk2(cds1) mediated inactivation of cyclin b:cdk1 complex',
 'chrebp activates metabolic gene expression',
 'clearance of nuclear envelope membranes from chromatin',
 'cleavage of growing transcript in the termination region',
 'cross-presentation of soluble exogenous antigens (endosomes)',
 'cytokine signaling in immune system',
 'dap12 interactions',
 'dap12 signaling',
 'dcc mediated attractive signaling',
 'deactivation of the beta-catenin transactivating complex',
 'deadenylation of mrna',
 'dectin-1 mediated noncanonical nf-kb signaling',
 'degradation of axin',
 'degradation of beta-catenin by the destruction complex',
 'degradation of dvl',
 'degradation of gli1 by the proteasome',
 'degradation of gli2 by the proteasome',
 'depolymerisation of the nuclear lamina',
 'developmental biology',
 'disease',
 'diseases of signal transduction',
 'dna replication pre-initiation',
 'downstream signal transduction',
 'downstream signaling events of b cell receptor (bcr)',
 'downstream signaling of activated fgfr1',
 'downstream signaling of activated fgfr2',
 'downstream signaling of activated fgfr3',
 'downstream signaling of activated fgfr4',
 'early phase of hiv life cycle',
 'egfr downregulation',
 'eph-ephrin signaling',
 'er-phagosome pathway',
 'erk/mapk targets',
 'erks are inactivated',
 'eukaryotic translation elongation',
 'eukaryotic translation initiation',
 'eukaryotic translation termination',
 'export of viral ribonucleoproteins from nucleus',
 'factors involved in megakaryocyte development and platelet production',
 'fc epsilon receptor (fceri) signaling',
 'fceri mediated mapk activation',
 'fcgamma receptor (fcgr) dependent phagocytosis',
 'fgfr1 mutant receptor activation',
 'formation of a pool of free 40s subunits',
 'frs-mediated fgfr1 signaling',
 'frs-mediated fgfr2 signaling',
 'frs-mediated fgfr3 signaling',
 'frs-mediated fgfr4 signaling',
 'frs2-mediated activation',
 'g alpha (12/13) signalling events',
 'g1/s dna damage checkpoints',
 'g2/m checkpoints',
 'g2/m dna damage checkpoint',
 'g2/m transition',
 'gab1 signalosome',
 'gene expression',
 'gene silencing by rna',
 'gli3 is processed to gli3r by the proteasome',
 'glucose transport',
 'glycolysis',
 'golgi cisternae pericentriolar stack reorganization',
 'gp1b-ix-v activation signalling',
 'grb2 events in egfr signaling',
 'growth hormone receptor signaling',
 'gtp hydrolysis and joining of the 60s ribosomal subunit',
 'hdr through mmej (alt-nhej)',
 'hedgehog ligand biogenesis',
 'hexose transport',
 'hh mutants abrogate ligand secretion',
 "hh mutants that don't undergo autocatalytic processing are degraded by erad",
 'hiv infection',
 'hiv life cycle',
 'host interactions of hiv factors',
 'host interactions with influenza factors',
 'hsf1-dependent transactivation',
 'igf1r signaling cascade',
 'immune system',
 'infectious disease',
 'influenza infection',
 'influenza life cycle',
 'influenza viral rna transcription and replication',
 'initiation of nuclear envelope reformation',
 'innate immune system',
 'insulin receptor signalling cascade',
 'integration of provirus',
 'interactions of rev with host cellular proteins',
 'interactions of vpr with host cellular proteins',
 'interferon signaling',
 'interleukin receptor shc signaling',
 'interleukin-2 signaling',
 'interleukin-3, 5 and gm-csf signaling',
 'intrinsic pathway for apoptosis',
 'irs activation',
 'irs-mediated signalling',
 'irs-related events triggered by igf1r',
 'isg15 antiviral mechanism',
 'l13a-mediated translational silencing of ceruloplasmin expression',
 'l1cam interactions',
 'late phase of hiv life cycle',
 'lrr flii-interacting protein 1 (lrrfip1) activates type i ifn production',
 'm phase',
 'm/g1 transition',
 'major pathway of rrna processing in the nucleolus',
 'map2k and mapk activation',
 'mapk family signaling cascades',
 'mapk targets/ nuclear events mediated by map kinases',
 'mapk1/mapk3 signaling',
 'mapk6/mapk4 signaling',
 'meiotic synapsis',
 'membrane trafficking',
 'metabolism of non-coding rna',
 'metabolism of proteins',
 'misspliced gsk3beta mutants stabilize beta-catenin',
 'mitotic anaphase',
 'mitotic g2-g2/m phases',
 'mitotic metaphase and anaphase',
 'mitotic prometaphase',
 'mitotic prophase',
 "mrna 3'-end processing",
 'mrna splicing',
 'mrna splicing - major pathway',
 'ncam signaling for neurite out-growth',
 'negative feedback regulation of mapk pathway',
 'negative regulation of fgfr1 signaling',
 'negative regulation of fgfr3 signaling',
 'negative regulation of fgfr4 signaling',
 'negative regulation of the pi3k/akt network',
 'nep/ns2 interacts with the cellular export machinery',
 'nephrin interactions',
 'netrin-1 signaling',
 'ngf signalling via trka from the plasma membrane',
 'nik-->noncanonical nf-kb signaling',
 'nonhomologous end-joining (nhej)',
 'nonsense mediated decay (nmd) enhanced by the exon junction complex (ejc)',
 'nonsense mediated decay (nmd) independent of the exon junction complex (ejc)',
 'nonsense-mediated decay (nmd)',
 'nrage signals death through jnk',
 'ns1 mediated effects on host pathways',
 'nuclear envelope breakdown',
 'nuclear envelope reassembly',
 'nuclear events (kinase and transcription factor activation)',
 'nuclear import of rev protein',
 'nuclear pore complex (npc) disassembly',
 'oncogene induced senescence',
 'orc1 removal from chromatin',
 'p53-dependent g1 dna damage response',
 'p53-dependent g1/s dna damage checkpoint',
 'p53-independent dna damage response',
 'p53-independent g1/s dna damage checkpoint',
 'p75 ntr receptor-mediated signalling',
 'pcp/ce pathway',
 'peptide chain elongation',
 'phosphorylation site mutants of ctnnb1 are not targeted to the proteasome by the destruction complex',
 'pi-3k cascade:fgfr1',
 'pi-3k cascade:fgfr2',
 'pi-3k cascade:fgfr3',
 'pi-3k cascade:fgfr4',
 'pi3k cascade',
 'pi3k events in erbb4 signaling',
 'pi3k/akt activation',
 'pi3k/akt signaling in cancer',
 'pi5p, pp2a and ier3 regulate pi3k/akt signaling',
 'pip3 activates akt signaling',
 'post-elongation processing of intron-containing pre-mrna',
 'post-elongation processing of intronless pre-mrna',
 'post-elongation processing of the transcript',
 'post-translational protein modification',
 'processing of capped intron-containing pre-mrna',
 'processing of capped intronless pre-mrna',
 'processing of intronless pre-mrnas',
 'programmed cell death',
 'prolonged erk activation events',
 'raf/map kinase cascade',
 'recruitment of mitotic centrosome proteins and complexes',
 'regulation of actin dynamics for phagocytic cup formation',
 'regulation of activated pak-2p34 by proteasome mediated degradation',
 'regulation of apoptosis',
 'regulation of dna replication',
 'regulation of glucokinase by glucokinase regulatory protein',
 'regulation of hsf1-mediated heat shock response',
 'regulation of mrna stability by proteins that bind au-rich elements',
 'regulation of ornithine decarboxylase (odc)',
 'regulation of plk1 activity at g2/m transition',
 'regulation of ras by gaps',
 'regulation of tp53 activity',
 'regulation of tp53 activity through phosphorylation',
 'removal of licensing factors from origins',
 'rev-mediated nuclear export of hiv rna',
 'rho gtpase effectors',
 'rho gtpases activate paks',
 'rho gtpases activate pkns',
 'rho gtpases activate wasps and waves',
 'ribosomal scanning and start codon recognition',
 'rna polymerase ii transcription',
 'rna polymerase ii transcription termination',
 'role of abl in robo-slit signaling',
 'role of lat2/ntal/lab on calcium mobilization',
 'rrna modification in the nucleus',
 'rrna processing',
 "s33 mutants of beta-catenin aren't phosphorylated",
 "s37 mutants of beta-catenin aren't phosphorylated",
 "s45 mutants of beta-catenin aren't phosphorylated",
 'scf(skp2)-mediated degradation of p27/p21',
 'scf-beta-trcp mediated degradation of emi1',
 'selenoamino acid metabolism',
 'selenocysteine synthesis',
 'semet incorporation into proteins',
 'separation of sister chromatids',
 'shc1 events in egfr signaling',
 'shc1 events in erbb4 signaling',
 'signal attenuation',
 'signal transduction',
 'signaling by cytosolic fgfr1 fusion mutants',
 'signaling by egfr',
 'signaling by erbb4',
 'signaling by fgfr',
 'signaling by fgfr1',
 'signaling by fgfr2',
 'signaling by fgfr3',
 'signaling by fgfr4',
 'signaling by insulin receptor',
 'signaling by interleukins',
 'signaling by leptin',
 'signaling by pdgf',
 'signaling by rho gtpases',
 'signaling by robo receptor',
 'signaling by scf-kit',
 'signaling by the b cell receptor (bcr)',
 'signaling by type 1 insulin-like growth factor 1 receptor (igf1r)',
 'signaling by vegf',
 'signalling by ngf',
 'signalling to erks',
 'signalling to p38 via rit and rin',
 'signalling to ras',
 'slc-mediated transmembrane transport',
 'snrnp assembly',
 'sos-mediated signalling',
 'spry regulation of fgf signaling',
 'stabilization of p53',
 'sumo e3 ligases sumoylate target proteins',
 'sumoylation',
 'sumoylation of dna damage response and repair proteins',
 'sumoylation of dna replication proteins',
 'sumoylation of rna binding proteins',
 'sumoylation of transcription factors',
 'switching of origins to a post-replicative state',
 'synthesis of dna',
 "t41 mutants of beta-catenin aren't phosphorylated",
 'the role of gtse1 in g2/m progression after g2 checkpoint',
 'tp53 regulates metabolic genes',
 'tp53 regulates transcription of dna repair genes',
 'traf6 mediated induction of proinflammatory cytokines',
 'transcriptional regulation by small rnas',
 'transcriptional regulation by tp53',
 'translation',
 'translation initiation complex formation',
 'translocation of glut4 to the plasma membrane',
 'transmembrane transport of small molecules',
 'transport of mature mrna derived from an intron-containing transcript',
 'transport of mature mrna derived from an intronless transcript',
 'transport of mature mrnas derived from intronless transcripts',
 'transport of mature transcript to cytoplasm',
 'transport of ribonucleoproteins into the host nucleus',
 'transport of the slbp dependant mature mrna',
 'transport of the slbp independent mature mrna',
 'trna processing',
 'trna processing in the nucleus',
 'ubiquitin mediated degradation of phosphorylated cdc25a',
 'ubiquitin-dependent degradation of cyclin d',
 'ubiquitin-dependent degradation of cyclin d1',
 'unfolded protein response (upr)',
 'vegfa-vegfr2 pathway',
 'vegfr2 mediated cell proliferation',
 'vesicle-mediated transport',
 'vif-mediated degradation of apobec3g',
 'viral messenger rna synthesis',
 'viral mrna translation',
 'vpr-mediated nuclear import of pics',
 'vpu mediated degradation of cd4',
 'vxpx cargo-targeting to cilium']

There are 315 enriched terms. If we look at the top ranked terms, we see that some fo them have similar descriptions “…. fgfr signaling”. If we look at the gene list, we can also see that some of the genes are similar. To see if there are redundant terms that are enriched, we can calculate their similarity with the Jaccard Index (intersection over union). width=50 Drawing

[64]:
# calculate the Jaccard Index and returns a ranked dataframe of terms and scores.
# Higher scores means more similar terms
d = ph_silac_enrichment.find_similar_terms('regulation of tp53 activity')
display(d.head(20))
term_name similarity_score
251 tp53 regulates transcription of dna repair genes 1.000000
1108 regulation of tp53 activity through phosphoryl... 1.000000
869 sumoylation of transcription factors 1.000000
422 hdr through mmej (alt-nhej) 1.000000
1109 hdr through mmej (alt-nhej) 1.000000
318 regulation of tp53 activity through phosphoryl... 1.000000
225 regulation of tp53 activity through phosphoryl... 1.000000
778 regulation of tp53 activity through phosphoryl... 1.000000
552 intrinsic pathway for apoptosis 1.000000
518 activation of bh3-only proteins 1.000000
361 transcriptional regulation by tp53 0.379310
799 transcriptional regulation by tp53 0.375000
942 transcriptional regulation by tp53 0.250000
397 g2/m dna damage checkpoint 0.217391
744 g2/m dna damage checkpoint 0.192308
338 g2/m checkpoints 0.178571
347 cell cycle checkpoints 0.172414
958 g2/m dna damage checkpoint 0.148148
381 cellular senescence 0.142857
97 transcriptional regulation by tp53 0.142857

We can do this for all terms and view the results in a distance matrix (plot used for visualization purpose only).

[65]:
ph_silac_enrichment.dist_matrix(figsize=(12, 12));
_images/Tutorial_105_0.png

We can remove the redundant ones to compress the array. Here, we sort the terms by combined_score. For each term, we calculate the Jaccard index with all other terms. If a term falls below above a user defined threshold, it will be removed in the resulting array. By doing so, we minimize the total number of terms, while maximizing the information content of the resulting array.

[66]:
ph_silac_enrichment_slim = ph_silac_enrichment.remove_redundant(level='dataframe')
Number of rows went from 315 to 84
[67]:
# notive the reduction in size and overlap of terms
ph_silac_enrichment_slim.dist_matrix();
_images/Tutorial_108_0.png
[68]:
ph_silac_enrichment_slim.heatmap(
    min_sig=2,
    figsize=(4,12),
    linewidths=0.01,
    cluster_by_set=True
);
_images/Tutorial_109_0.png

Important to known, we can still recover the terms removed based on the highest level term kept.

[69]:
ph_silac_enrichment.show_terms_below('g2/m dna damage checkpoint').heatmap(
    linewidths=0.01,
    convert_to_log=False,
    figsize=(3, 8));
Number of rows went from 315 to 79
_images/Tutorial_111_1.png
[70]:
display(sorted(ph_silac_enrichment_slim.term_name.unique()))
['2-ltr circle formation',
 'activation of bad and translocation to mitochondria',
 'activation of the ap-1 family of transcription factors',
 'activation of the mrna upon binding of the cap-binding complex and eifs, and subsequent binding to 43s',
 'akt phosphorylates targets in the cytosol',
 'antiviral mechanism by ifn-stimulated genes',
 'apoptotic cleavage of cellular proteins',
 'assembly of the pre-replicative complex',
 'basigin interactions',
 'cell-extracellular matrix interactions',
 'centrosome maturation',
 'chrebp activates metabolic gene expression',
 'dcc mediated attractive signaling',
 'deactivation of the beta-catenin transactivating complex',
 'deadenylation of mrna',
 'egfr downregulation',
 'eph-ephrin signaling',
 'er-phagosome pathway',
 'erks are inactivated',
 'factors involved in megakaryocyte development and platelet production',
 'g alpha (12/13) signalling events',
 'g2/m dna damage checkpoint',
 'gene silencing by rna',
 'glycolysis',
 'golgi cisternae pericentriolar stack reorganization',
 'gp1b-ix-v activation signalling',
 'growth hormone receptor signaling',
 'gtp hydrolysis and joining of the 60s ribosomal subunit',
 'hdr through mmej (alt-nhej)',
 "hh mutants that don't undergo autocatalytic processing are degraded by erad",
 'hsf1-dependent transactivation',
 'influenza life cycle',
 'initiation of nuclear envelope reformation',
 'l1cam interactions',
 'late phase of hiv life cycle',
 'lrr flii-interacting protein 1 (lrrfip1) activates type i ifn production',
 'major pathway of rrna processing in the nucleolus',
 'mapk6/mapk4 signaling',
 'meiotic synapsis',
 'membrane trafficking',
 'mitotic prometaphase',
 'mrna splicing - major pathway',
 'negative feedback regulation of mapk pathway',
 'nephrin interactions',
 'nonhomologous end-joining (nhej)',
 'nonsense mediated decay (nmd) enhanced by the exon junction complex (ejc)',
 'nrage signals death through jnk',
 'nuclear envelope breakdown',
 'nuclear import of rev protein',
 'oncogene induced senescence',
 'pcp/ce pathway',
 'peptide chain elongation',
 'phosphorylation site mutants of ctnnb1 are not targeted to the proteasome by the destruction complex',
 'pi3k cascade',
 'post-elongation processing of intron-containing pre-mrna',
 'regulation of hsf1-mediated heat shock response',
 'regulation of mrna stability by proteins that bind au-rich elements',
 'regulation of ornithine decarboxylase (odc)',
 'regulation of plk1 activity at g2/m transition',
 'regulation of ras by gaps',
 'regulation of tp53 activity through phosphorylation',
 'rho gtpases activate paks',
 'rho gtpases activate pkns',
 'rho gtpases activate wasps and waves',
 'role of abl in robo-slit signaling',
 'rrna modification in the nucleus',
 'semet incorporation into proteins',
 'separation of sister chromatids',
 'signal attenuation',
 'signaling by cytosolic fgfr1 fusion mutants',
 'slc-mediated transmembrane transport',
 'sumoylation of dna damage response and repair proteins',
 'sumoylation of dna replication proteins',
 'sumoylation of rna binding proteins',
 'sumoylation of transcription factors',
 'the role of gtse1 in g2/m progression after g2 checkpoint',
 'tp53 regulates metabolic genes',
 'tp53 regulates transcription of dna repair genes',
 'transport of mature mrna derived from an intron-containing transcript',
 'transport of mature mrna derived from an intronless transcript',
 'trna processing in the nucleus',
 'unfolded protein response (upr)',
 'vpr-mediated nuclear import of pics',
 'vxpx cargo-targeting to cilium']

We can use the EnrichmentResult to extract out any given set of genes for a term (or the entire array). For a select term, we can extract out the species of interest to visualize. This makes chaining together the data, networks, and enrichment output seamlessly.

[71]:
exp_data.ph_silac.heatmap(
    ph_silac_enrichment_slim.sig.term_to_genes('apoptotic cleavage of cellular proteins'),
    subset_index='identifier',
    index='label',
    cluster_row=True,
    sort_row='index',
    min_sig=2,
    linewidths=0.01,
    figsize=(4, 8),
);
_images/Tutorial_114_0.png
[72]:
exp_data.ph_silac.heatmap(
    ph_silac_enrichment_slim.sig.term_to_genes('tp53 regulates transcription of dna repair genes'),
    subset_index='identifier',
    index='label',
    cluster_row=True,
    sort_row='index',
    min_sig=2,
    linewidths=0.01,
    figsize=(3, 6),
);
_images/Tutorial_115_0.png
[73]:
exp_data.ph_silac.heatmap(
    ph_silac_enrichment.sig.term_to_genes('apoptosis'),
    subset_index='identifier',
    index='label',
    cluster_row=True,
    sort_row='index',
    min_sig=2,
    linewidths=0.01,
    figsize=(2,12),
);
_images/Tutorial_116_0.png

Drawing

Since this part is time consuming, it is best to do it outside of a notebook. The code to do so can be found in “run_enrichment.py”. The results will be a csv file that we will load next.

Annotated Gene set Network (AGN)

Lastly, we created a function to generate molecular and coarse grain networks based on enrichment terms. Users can used the compressed enrichment result terms to generate large scale representations of their data, or by selecting key terms of importance. Here, we are going to use 3 terms from the compressed enrichment result class.

[74]:
from magine.networks.annotated_set import create_subnetwork
# selected terms of interest
terms=['g2/m dna damage checkpoint',
       'apoptotic cleavage of cellular proteins',
       'regulation of tp53 activity through phosphorylation']

term_net, mol_net = create_subnetwork(
    ph_silac_enrichment_slim, network,
    terms=terms,
    save_name='all_example',
    use_threshold=True,
    use_cytoscape=False, # If you have cytoscape open, this will create a cytoscape session if True
)
Creating ontology network
[75]:
draw_cyjs(term_net)
[76]:
draw_cyjs(mol_net, add_parent=True)
[77]:
draw_cyjs(mol_net, add_parent=False)

Finally, we can bring it full circle and subset our experimental data to visualize the nodes in the networks measured values over time.

[78]:
exp_data.ph_silac.heatmap(
    mol_net.nodes,
    subset_index='identifier',
    index='label',
    cluster_row=True,
    sort_row='index',
    min_sig=2,
    linewidths=0.01,
    figsize=(4,12),
);
_images/Tutorial_124_0.png

It is important to note that MAGINE provides a suite of tools that can be used to automate and consturct custom pipelines. What is shown in this tutorial is just a subset of the possibly pipelines one can build. You can contact james.c.pino@vanderbilt.edu or alex.lubbock@vanderbilt.edu if you have any other questions.