Tutorial¶

from IPython.display import display, Image
%matplotlib inline
import matplotlib.pyplot as plt

import pandas as pd
import seaborn as sns
import numpy as np

ExperimentalData¶

Since MAGINE is built for multi-sample, multi-omics data, it is no surprise that the data is the most important aspect. Here we should how to use the ExperimentalData class. We designed MAGINE data input to be as flexible as possible, requiring a standard format for 8 columns. Users are required to format the input files to share the same column names. Additional columns can still be used for additional tags on the data.

Column	Data type	Description
identifier	string	HNGC or HMDB ids
label	string	any label that you would like to have as an alterative to identifier. Useful for PTMS (BAX_S(ph)292), aliases, or chemical names (Deoxyuridine)
species_type	string	Options (gene, protein, or metabolite) Right now we are trying to work out a more comprehensive tag to use. For now, we suggest using `protein` for all gene related products and use the source column to further distinguish.
significant	bool	used to label if the measurment was signficant compared to control.
fold_change	float	Assumes non-log space
p_value	float	HNGC or HMDB ids
identifier	string	Significance statistic for fold change calculation (can be FDR, BH, etc corrected.
source	string	Experimental platform (ie SILAC, PH_SILAC, LABEL_FREE, RNASEQ)
sample_id	string	Used to identify sample (time points, drug dose, etc)

For this tutorial, we are going to use our time series multi-omic response of A549 cells to cisplatin.

The description of the experiments and dataset can be found in Norris, Jeremy L., et al.

This file is in the format as show above in the table.

# load the experimental data
from magine.data.experimental_data import load_data

exp_data = load_data('Data/norris_et_al_2017_cisplatin_data.csv.gz', low_memory=False)

About the data¶

This dataset consists of 6 experimental platforms across 4 time points. The ExperimentData class is designed to explore these data in a seamless way. The core of the class is a pandas.DataFrame, however we built additional functions that we frequently used. We store meta data, such as the sample_ids, exp_methods.

exp_data.sample_ids

['01hr', '06hr', '24hr', '48hr']

exp_data.exp_methods

['rna_seq', 'ph_silac', 'label_free', 'silac', 'C18', 'HILIC']

There are many functions built around the ExperimentalData class. This notebook is meant to demostrate an examples workflow of how they can be used. Please refer to the documentation for exhuastive information about the functions.

Summary of data¶

Gather number of measured species per time point and platform.

exp_data.create_summary_table()

sample_id	01hr	06hr	24hr	48hr	Total Unique Across
source
C18	522	227	653	685	1402
HILIC	471	605	930	613	1504
label_free	2766	2742	2551	2261	3447
ph_silac	2608	3298	3384	3236	5113
rna_seq	18741	19104	19992	-	20642
silac	2923	3357	3072	3265	4086

Count number of significantly changed (signficant_flag=True).

exp_data.create_summary_table(sig=True)

sample_id	01hr	06hr	24hr	48hr	Total Unique Across
source
C18	522	227	653	685	1402
HILIC	471	605	930	613	1504
label_free	196	46	271	874	1085
ph_silac	514	888	1227	851	2278
rna_seq	73	1999	12215	-	12340
silac	38	52	228	266	485

MAGINE uses the identifier column as the default index. This keeps things simple when using the output for other tools (passing to molecular networks). You can also pass an index argument to calculate other values. Here, we use the label column, which contains PTMs of our protein species. See how rna_seq values do not change, but there is an increase number of ph_silac unique species.

exp_data.create_summary_table(sig=True, index='label')

sample_id	01hr	06hr	24hr	48hr	Total Unique Across
source
C18	528	227	657	689	1412
HILIC	479	611	941	621	1521
label_free	201	46	281	911	1149
ph_silac	594	1370	2414	1368	4757
rna_seq	73	1999	12215	-	12340
silac	38	52	228	266	485

Filter by category (experimental method)¶

We can access the input data using the .species property. This returns a modified pandas.Datatable.

exp_data.species.head(5)

	identifier	label	species_type	fold_change	p_value	significant	sample_id	source
0	HOXD1	HOXD1_rnaseq	protein	-520.256762	0.00102	True	06hr	rna_seq
1	MIR7704	MIR7704_rnaseq	protein	-520.256762	0.00102	True	06hr	rna_seq
2	AC078814.1	AC078814.1_rnaseq	protein	-76.022260	0.00102	True	06hr	rna_seq
3	PPM1H	PPM1H_rnaseq	protein	-76.022260	0.00102	True	06hr	rna_seq
4	PLCH1	PLCH1_rnaseq	protein	-17.888990	0.00102	True	06hr	rna_seq

We added attributes to the classs to quickly separate the data based on various input columns. We use the species_type and source column name to split data into compounds, genes (includes species_type==gene), rna (includes species_type==protein, source == rna), or protein (species_type==gene, source != rna). They can be accessed with the “.prefix”, such as

exp_data.genes.head(5)

	identifier	label	species_type	fold_change	p_value	significant	sample_id	source
0	HOXD1	HOXD1_rnaseq	protein	-520.256762	0.00102	True	06hr	rna_seq
1	MIR7704	MIR7704_rnaseq	protein	-520.256762	0.00102	True	06hr	rna_seq
2	AC078814.1	AC078814.1_rnaseq	protein	-76.022260	0.00102	True	06hr	rna_seq
3	PPM1H	PPM1H_rnaseq	protein	-76.022260	0.00102	True	06hr	rna_seq
4	PLCH1	PLCH1_rnaseq	protein	-17.888990	0.00102	True	06hr	rna_seq

exp_data.compounds.head(5)

	identifier	label	species_type	fold_change	p_value	significant	sample_id	source
128152	HMDB0036114	(-)-3-Thujone	metabolites	1.6	2.100000e-02	True	06hr	C18
128153	HMDB0001320	(13E)-11a-Hydroxy-9,15-dioxoprost-13-enoic acid	metabolites	88.8	5.800000e-12	True	24hr	C18
128154	HMDB0012113	(22Alpha)-hydroxy-campest-4-en-3-one	metabolites	100.0	9.500000e-04	True	48hr	HILIC
128155	HMDB0010361	(23S)-23,25-dihdroxy-24-oxovitamine D3 23-(bet...	metabolites	-100.0	1.000000e-12	True	48hr	C18
128156	HMDB0011644	(24R)-Cholest-5-ene-3-beta,7-alpha,24-triol	metabolites	1.6	7.400000e-05	True	01hr	C18

Similarily, we can also filter the data by source using the .name, where name is anything in the source column. We can get a list of these by printing exp_data.exp_methods.

# prints all the available exp_methods
exp_data.exp_methods

['rna_seq', 'ph_silac', 'label_free', 'silac', 'C18', 'HILIC']

# filters to only the 'label_free'
exp_data.label_free.shape

(13085, 8)

exp_data.label_free.head(5)

	identifier	label	species_type	fold_change	p_value	significant	sample_id	source
102446	LIMS1	LIMS1_lf	protein	12.42	0.00003	True	01hr	label_free
102447	SMARCE1	SMARCE1_lf	protein	-2.49	0.00030	True	01hr	label_free
102448	HEXA	HEXA_lf	protein	6.42	0.00060	True	01hr	label_free
102449	SRSF1	SRSF1_lf	protein	-3.21	0.00060	True	01hr	label_free
102450	SF3B1	SF3B1_lf	protein	-1.57	0.00130	True	01hr	label_free

exp_data.HILIC.head(5)

	identifier	label	species_type	fold_change	p_value	significant	sample_id	source
128154	HMDB0012113	(22Alpha)-hydroxy-campest-4-en-3-one	metabolites	100.0	0.000950	True	48hr	HILIC
128157	HMDB0011644	(24R)-Cholest-5-ene-3-beta,7-alpha,24-triol	metabolites	1.7	0.000072	True	24hr	HILIC
128162	HMDB0012114	(3S)-3,6-Diaminohexanoate	metabolites	-1.9	0.000030	True	06hr	HILIC
128164	HMDB0012114	(3S)-3,6-Diaminohexanoate	metabolites	-3.0	0.002000	True	24hr	HILIC
128166	HMDB0012115	(3S,5S)-3,5-Diaminohexanoate	metabolites	-1.9	0.000030	True	06hr	HILIC

Significant filter¶

We can use the significant column to filter that data to only contain those species by applying .sig .

exp_data.species.shape

(132932, 8)

exp_data.species.sig.shape

(27288, 8)

exp_data.label_free.sig.head(5)

	identifier	label	species_type	fold_change	p_value	significant	sample_id	source
102446	LIMS1	LIMS1_lf	protein	12.42	0.00003	True	01hr	label_free
102447	SMARCE1	SMARCE1_lf	protein	-2.49	0.00030	True	01hr	label_free
102448	HEXA	HEXA_lf	protein	6.42	0.00060	True	01hr	label_free
102449	SRSF1	SRSF1_lf	protein	-3.21	0.00060	True	01hr	label_free
102450	SF3B1	SF3B1_lf	protein	-1.57	0.00130	True	01hr	label_free

Filter data to up or down regulated species.¶

For enrichment analysis, we will want to access up-regulated and down-regulated species using .up and .down.

exp_data.label_free.up.head(5)

	identifier	label	species_type	fold_change	p_value	significant	sample_id	source
102446	LIMS1	LIMS1_lf	protein	12.42	0.00003	True	01hr	label_free
102448	HEXA	HEXA_lf	protein	6.42	0.00060	True	01hr	label_free
102451	USP15	USP15_N-term A(ace)2_lf	protein	18.78	0.00270	True	01hr	label_free
102460	SBDS	SBDS_lf	protein	2.79	0.00560	True	01hr	label_free
102488	CLIC4	CLIC4_lf	protein	2.03	0.01880	True	01hr	label_free

exp_data.label_free.down.head(5)

	identifier	label	species_type	fold_change	p_value	significant	sample_id	source
102447	SMARCE1	SMARCE1_lf	protein	-2.49	0.0003	True	01hr	label_free
102449	SRSF1	SRSF1_lf	protein	-3.21	0.0006	True	01hr	label_free
102450	SF3B1	SF3B1_lf	protein	-1.57	0.0013	True	01hr	label_free
102452	CKAP4	CKAP4_lf	protein	-3.26	0.0030	True	01hr	label_free
102453	DDX17	DDX17_lf	protein	-3.08	0.0034	True	01hr	label_free

Extracting by sample (time point)¶

We also added an index filter to segregate by sample_id.

for i in exp_data.sample_ids:
    print(i)
    display(exp_data[i].head(5))

01hr

	identifier	label	species_type	fold_change	p_value	significant	sample_id	source
19160	GRIK4	GRIK4_rnaseq	protein	77.555651	0.019824	True	01hr	rna_seq
19161	GRIK4_3p_UTR	GRIK4_3p_UTR_rnaseq	protein	77.555651	0.019824	True	01hr	rna_seq
19162	AP001187.9	AP001187.9_rnaseq	protein	-25.455050	0.019824	True	01hr	rna_seq
19163	MIR192	MIR192_rnaseq	protein	-25.455050	0.019824	True	01hr	rna_seq
19164	MIR194-2	MIR194-2_rnaseq	protein	-25.455050	0.019824	True	01hr	rna_seq

06hr

	identifier	label	species_type	fold_change	p_value	significant	sample_id	source
0	HOXD1	HOXD1_rnaseq	protein	-520.256762	0.00102	True	06hr	rna_seq
1	MIR7704	MIR7704_rnaseq	protein	-520.256762	0.00102	True	06hr	rna_seq
2	AC078814.1	AC078814.1_rnaseq	protein	-76.022260	0.00102	True	06hr	rna_seq
3	PPM1H	PPM1H_rnaseq	protein	-76.022260	0.00102	True	06hr	rna_seq
4	PLCH1	PLCH1_rnaseq	protein	-17.888990	0.00102	True	06hr	rna_seq

24hr

	identifier	label	species_type	fold_change	p_value	significant	sample_id	source
37960	LHX3	LHX3_rnaseq	protein	202.225343	0.005180	True	24hr	rna_seq
37961	C17orf67	C17orf67_rnaseq	protein	2.571464	0.000123	True	24hr	rna_seq
37962	ALX1	ALX1_rnaseq	protein	-2.572587	0.000123	True	24hr	rna_seq
37963	MIR7844	MIR7844_rnaseq	protein	2.573033	0.009349	True	24hr	rna_seq
37964	TMCC3	TMCC3_rnaseq	protein	2.573033	0.009349	True	24hr	rna_seq

48hr

	identifier	label	species_type	fold_change	p_value	significant	sample_id	source
58025	TNS3	TNS3_1188_1197_phsilac	protein	-3.837129	0.049	True	48hr	ph_silac
58026	SIPA1L3	SIPA1L3_S(ph)158_phsilac	protein	-5.119600	0.049	True	48hr	ph_silac
58027	TNS3	TNS3_Y(ph)780_phsilac	protein	-4.986421	0.049	True	48hr	ph_silac
58028	FGD6	FGD6_S(ph)554_phsilac	protein	-3.900705	0.049	True	48hr	ph_silac
58029	GPN1	GPN1_S(ph)312_phsilac	protein	2.901199	0.049	True	48hr	ph_silac

exp_data['01hr'].head(5)

	identifier	label	species_type	fold_change	p_value	significant	sample_id	source
19160	GRIK4	GRIK4_rnaseq	protein	77.555651	0.019824	True	01hr	rna_seq
19161	GRIK4_3p_UTR	GRIK4_3p_UTR_rnaseq	protein	77.555651	0.019824	True	01hr	rna_seq
19162	AP001187.9	AP001187.9_rnaseq	protein	-25.455050	0.019824	True	01hr	rna_seq
19163	MIR192	MIR192_rnaseq	protein	-25.455050	0.019824	True	01hr	rna_seq
19164	MIR194-2	MIR194-2_rnaseq	protein	-25.455050	0.019824	True	01hr	rna_seq

Pivot table to get table across time¶

We also provide a function to quickly pivot the data to for easy export.

exp_data.label_free.pivoter(
    convert_to_log=False,
    index='identifier',
    columns='sample_id',
    values=['fold_change', 'p_value']
).head(10)

	fold_change				p_value
sample_id	01hr	06hr	24hr	48hr	01hr	06hr	24hr	48hr
identifier
A2M	1.040000	1.140	51.93	11.58	0.514800	0.44370	0.24260	0.11130
AACS	-1.100000	3.740	NaN	NaN	0.281800	0.26950	NaN	NaN
AAGAB	1.000000	-1.150	1.46	-2.03	0.968100	0.39240	0.84450	0.09760
AAK1	1.320000	1.590	NaN	1.72	0.715800	0.18110	NaN	0.95660
AAMP	-1.200000	-1.460	1.85	1.78	0.836800	0.55420	0.13640	0.32460
AAR2	NaN	-1.690	NaN	NaN	NaN	0.96510	NaN	NaN
AARS	0.326667	-0.035	-1.44	-3.12	0.299867	0.62425	0.46725	0.00045
AARS2	1.170000	NaN	NaN	NaN	0.253000	NaN	NaN	NaN
AARSD1	1.210000	4.070	-2.05	NaN	0.459700	0.49160	0.78440	NaN
AASDHPPT	-0.330000	1.020	1.07	-1.11	0.709600	0.81160	0.45290	0.00070

Note that in the previous two examples, we find that there are NaN values. This is because of our experiental data. We can easy check what species are not found in all 4 of our label free experiements.

print(len(exp_data.label_free.present_in_all_columns(
    index='identifier',
    columns='sample_id',
).id_list))

Number in index went from 3447 to 1819
1819

This shows that out of the 3447 unique species measured in label-free proteomics, only 1819 were measured in all time points. What one can do with this information is dependent on the analysis. We can filter by requiring a species to be signficantly changed in at least n samples.

print(exp_data.label_free.require_n_sig(n_sig=2).identifier.unique().shape)
print(exp_data.label_free.require_n_sig(n_sig=3).identifier.unique().shape)
print(exp_data.label_free.require_n_sig(n_sig=4).identifier.unique().shape)

(247,)
(53,)
(2,)

It is important to note that this class is basically a hopped up pandas.DataFrame, so the commands can be chained together.

exp_data.label_free.require_n_sig(n_sig=3).pivoter(
    convert_to_log=False,
    index='identifier',
    columns='sample_id',
    values=['fold_change', 'p_value']
).head(10)

	fold_change				p_value
sample_id	01hr	06hr	24hr	48hr	01hr	06hr	24hr	48hr
identifier
COX5A	-2.185000	-1.550	-1.610	5.480	0.121850	0.07850	0.11900	2.800000e-03
DAZAP1	-1.860000	1.240	2.040	2.740	0.364050	0.56310	0.08610	1.100000e-03
DDX3X	-1.940000	1.090	-1.620	-3.030	0.089000	0.70330	0.05480	2.700000e-03
DDX5	-7.623333	1.090	-2.085	-6.160	0.094733	0.59080	0.00725	2.900000e-03
ERH	-2.340000	-1.095	1.440	2.020	0.073300	0.48190	0.06335	1.885000e-02
FUS	-2.415000	1.200	1.940	2.620	0.096850	0.08740	0.03320	6.000000e-04
GPRC5A	-2.450000	-1.160	1.800	31.455	0.099200	0.09660	0.05520	4.000000e-04
H1FX	-1.660000	-1.320	-1.540	-6.340	0.095700	0.00250	0.01970	9.345000e-07
HNRNPA2B1	-4.365000	1.360	2.600	4.960	0.050550	0.09280	0.05390	2.000000e-04
HNRNPC	-4.543333	4.170	2.270	4.920	0.070000	0.15085	0.12120	3.800000e-03

Visualization¶

We provide commonly used plotting functions. .volcano_plot .volcano_by_sample .plot_histogram .plot_species .heatmap

Volcano plots¶

exp_data.label_free.volcano_plot();

exp_data.label_free.volcano_by_sample(sig_column=True);

exp_data.label_free.plot_histogram();

Plotting subset of species¶

We provide the a few plotting interfaces to explore that subsets of the data. Basically, you create a list of species and provide it to the function. It filters based on these and then returns the results.

Time series using plot’y or matplotlib¶

exp_data.label_free.plot_species(['LMNA', 'VDAC1'], plot_type='plotly')

exp_data.label_free.plot_species(['LMNA', 'VDAC1'], plot_type='matplotlib');

Heatplots¶

exp_data.label_free.heatmap(
    ['LMNA', 'VDAC1'],
    figsize=(6,4),
    linewidths=0.01
);

Notice that the above plot doesn’t show any of the modifiers of LMBA (no _s(ph)22_lf). This is because the default index to pivot plots is the identifier column. You can set the label column for plotting by passing index=label to the function. Note, if you want to filter the data using the more generic ‘identifier’ column, you just specify that with subset_index=‘identifier’

exp_data.label_free.heatmap(
    ['LMNA', 'VDAC1'],
    subset_index='identifier',
    index='label',
    figsize=(6,4),
    linewidths=0.01
);

Examples¶

Here are a few examples how all the above commands can be chained together to create plots with varying degrees of critera.

Query 1:¶

Heatmap of label-free proteomics that are signficantly change in at least 3 time points.

lf_sig = exp_data.label_free.require_n_sig(
    index='label',
    columns='sample_id',
    n_sig=3
).heatmap(
    convert_to_log=True,
    cluster_row=True,
    index='label',
    values='fold_change',
    columns='sample_id',
    annotate_sig=True,
    figsize=(8, 12),
    div_colors=True,
    num_colors=21,
    linewidths=0.01
);

Query 2:¶

Changes that happen at all 3 timepoints for RNA-seq.

exp_data.rna.require_n_sig(n_sig=3, index='label').plot_species(plot_type='plotly');

Query 3:¶

Heatmap and time series plot of proteins that are consistently down regulated at 3 time points.

exp_data.proteins.down.require_n_sig(n_sig=3, index='label').plot_species(plot_type='matplotlib');
exp_data.proteins.down.require_n_sig(n_sig=3, index='label').heatmap(index='label', cluster_row=True, linewidths=0.01);

Query 4:¶

Clustered heatmap of label-free data

exp_data.label_free.heatmap(
    linewidths=0.01,
    index='label',
    cluster_row=True,
    cluster_col=True,
    min_sig=3,
    figsize=(12,18)
);

Extending to other plots¶

Since our exp_data is built off a pandas.DataFrame, we can use other packages that take that data format. Seaborn is one such tool that provides some very nice plots.

label_free = exp_data.label_free.copy()
label_free.log2_normalize_df(column='fold_change', inplace=True)

g = sns.PairGrid(label_free,
                 x_vars=('sample_id'),
                 y_vars=('fold_change', 'p_value'),
                 aspect=3.25, height=3.5)
g.map(
    sns.violinplot,
    palette="pastel",
    order=label_free.sample_ids
);

Venn diagram comparisons between measurements¶

from magine.plotting.venn_diagram_maker import create_venn2, create_venn3

lf = exp_data.label_free.sig

lf = exp_data.label_free.sig.id_list
silac = exp_data.silac.sig.id_list
phsilac = exp_data.ph_silac.sig.id_list
hilic = exp_data.HILIC.sig.id_list
rplc = exp_data.C18.sig.id_list

create_venn2(hilic, rplc, 'HILIC', 'RPLC');

create_venn3(lf, silac, phsilac, 'LF', 'SILAC', 'ph-SILAC');

Networks¶

Create data driven network¶

MAGINE generates networks using prior knowledge obtained from network databases. It starts with seed species, which are biologically interesting. This can be from manual curation, based on filtering of data, or a variety of other means. We locate the seed species in multiple databases and finds interconnecting edges among them. The goal of this process was to obtain all the known biological regulation among the species. We currently utilize KEGG, Reactome, HMDB, TTRUST, and BioGrid for node and edge sources.

# some imports needed
from magine.networks.network_generator import build_network
import magine.networks.utils as utils
import networkx as nx
import os

2021-09-03 13:59:25.008 - magine - INFO - Logging started on MAGINE version 0.1.5
2021-09-03 13:59:25.009 - magine - INFO - Log entry time offset from UTC: -7.00 hours
[33mWARNING [bioservices:UniChem:119]: [0m [34mThe URL (http://www.ebi.ac.uk/unichem/rest) provided cannot be reached.[0m
[33mWARNING [bioservices:UniChem:119]: [0m [34mUniChem has added new source. Please update the source_ids attribute in bioservices[0m

This is done using the build_network function. Now we will create the network. We pass the seed (interesting species) and background list (all things measured that we want to make sure are not deleted from the network) to the generator as well as flags turning on all of the network databases.

measured = exp_data.species.id_list
sig_measured = exp_data.species.sig.id_list
print(len(measured))
print(len(sig_measured))

23725
15777

if not os.path.exists('Data/cisplatin_network.p'):
    network = build_network(

        # genes seed species
        seed_species=sig_measured,

        # all data measured, used to allow interconnecting nodes that are not in seeds.
        all_measured_list=measured,

        use_biogrid=True,  # expand with biogrid
        use_hmdb=True,  # expand with hmdb
        use_reactome=True,  # expand with reactome
        use_signor=True,  # expand with signor
        trim_source_sink=True,  # remove all source and sink nodes not measured
        save_name='Data/cisplatin_network'
    )
    # add attibutes to graph nodes (measured, measured at which time points,
    # significantly changed at which time point)
    utils.add_data_to_graph(network, exp_data)
    print("Saving network")
    # write to GML for cytoscape or other program
    nx.write_gml(
        network,
        os.path.join('Data', 'cisplatin_network_w_attributes.gml')
    )

    # write to gpickle for fast loading in python
    nx.write_gpickle(
        network,
        os.path.join('Data', 'cisplatin_based_network.p'),
    )
else:
    # Load the network, note that it is returned above but for time limits,
    # we will just load the generated one.
    network = nx.read_gpickle('Data/cisplatin_based_network.p')

print(network.number_of_nodes())
print(network.number_of_edges())

13308
181300

As you might iMAGINE, the larger number of input nodes and source databases, the larger the resulting network. 13308 nodes and 181300 edges are too much to manually explore. Thus, we are going to use the Subgraph Class to being to query the network. We developed multiple tools to subgraph and explore the network, as well as multiple visualizations.

Explore subgraphs of network¶

from magine.networks.subgraphs import Subgraph
from magine.networks.visualization import draw_igraph, draw_graphviz, draw_mpl, draw_cyjs
net_sub = Subgraph(network)

bax_neighbors = net_sub.neighbors(
    'BAX', # node of interest
    upstream=True, # include upstream nodes
    downstream=False,  # include downstream nodes
    include_only=exp_data.species.sig.id_list # limit nodes to only significant changed species
)

There are multiple ways to visualize the network. draw_igraph, draw_mpl, draw_graphviz, draw_cyjs. The draw_cyjs provides interactive networks within the browser. It shouldn’t be used for extremely large networks. There is also some issues with exporting these views when converting the notebooks to other formats (such as web documentation).

draw_igraph(bax_neighbors, bbox=[400, 400], node_size=25, inline=True)

draw_graphviz(bax_neighbors, 'fdp')

draw_cyjs(bax_neighbors)

This network can be exanded by a single or list of nodes passed.

expand = net_sub.expand_neighbors(bax_neighbors, nodes='BID', downstream=True)

draw_igraph(expand,
            bbox=[500, 500],
            node_font_size=25,
            font_size=4,
            node_size=25,
            inline=True,
            layout='graphopt')

draw_graphviz(expand, 'sfdp', width=700)

Finding paths between two proteins¶

BAX_to_AKT = net_sub.paths_between_pair('NOTCH1', 'MYC', bidirectional=True)

draw_graphviz(BAX_to_AKT)

draw_cyjs(BAX_to_AKT)

Running enrichment analysis via EnrichR¶

MAGINE automates the upload samples to EnrichR and collated the results into a user friends format (EnrichmentResult Class).

from magine.enrichment.enrichr import Enrichr

e = Enrichr()

Consistantly down RNA

const_dn_lf = exp_data.label_free.up.require_n_sig(n_sig=2, index='label')
const_dn_lf.heatmap();

# single sample, single background gene set
enrich_dn_lf = e.run(const_dn_lf.id_list, gene_set_lib= 'Reactome_2016')
enrich_dn_lf.head(10)

enrich_dn_lf.term_name = enrich_dn_lf.term_name.str.split('_').str.get(0)
enrich_dn_lf.head(10)

	term_name	rank	p_value	z_score	combined_score	adj_p_value	genes	n_genes	db	significant
0	mrna splicing - major pathway	1	0.000003	17.523290	224.549447	0.000275	FUS,HNRNPA2B1,RBMX,SNRPF,SRSF1,SRSF7	6	Reactome_2016	True
1	mrna splicing	2	0.000004	16.245283	201.396704	0.000275	FUS,HNRNPA2B1,RBMX,SNRPF,SRSF1,SRSF7	6	Reactome_2016	True
2	apoptotic cleavage of cellular proteins	3	0.000004	43.874380	543.247522	0.000275	GSN,LMNA,PLEC,VIM	4	Reactome_2016	True
3	caspase-mediated cleavage of cytoskeletal prot...	4	0.000005	118.642857	1442.061152	0.000275	GSN,PLEC,VIM	3	Reactome_2016	True
4	the citric acid (tca) cycle and respiratory el...	5	0.000006	15.243743	183.651472	0.000275	ATP5D,ATP5F1,ATP5L,SLC16A1,UQCRC2,UQCRFS1	6	Reactome_2016	True
5	formation of atp by chemiosmotic coupling	6	0.000013	82.120879	922.112457	0.000510	ATP5D,ATP5F1,ATP5L	3	Reactome_2016	True
6	apoptotic execution phase	7	0.000015	30.783752	341.143188	0.000510	GSN,LMNA,PLEC,VIM	4	Reactome_2016	True
7	respiratory electron transport, atp synthesis ...	8	0.000017	17.661147	193.585769	0.000510	ATP5D,ATP5F1,ATP5L,UQCRC2,UQCRFS1	5	Reactome_2016	True
8	processing of capped intron-containing pre-mrna	9	0.000022	11.958834	128.249699	0.000575	FUS,HNRNPA2B1,RBMX,SNRPF,SRSF1,SRSF7	6	Reactome_2016	True
9	mrna splicing - minor pathway	10	0.000486	21.747813	165.908608	0.011427	SNRPF,SRSF1,SRSF7	3	Reactome_2016	True

from magine.plotting.wordcloud_tools import create_wordcloud
wc = create_wordcloud(enrich_dn_lf.sig)
wc.plot();

Multi-timepoint enrichment analysis¶

We can also run enrichment analysis across multiple samples. The resulting output is similar to the ExperimentalData class, but designed around functions for futrther enrichment analysis. This class is the EnrichementResult class. First let’s run the enrichment for each ph-silac sample. We need a list of those species by sample and the sample ids to place them in our results. We can access these with : exp_data.ph_silac.sig.up_by_sample, exp_data.ph_silac.sig.sample_ids. Here we will run against the Reactome_2016 gene set, but you can also run it across a list of gene sets.

ph_silac_enrichment = e.run_samples(
    exp_data.ph_silac.sig.up_by_sample,
    exp_data.ph_silac.sig.sample_ids,
    gene_set_lib='Reactome_2016'
)

ph_silac_enrichment.head(10)

	term_name	rank	p_value	z_score	combined_score	adj_p_value	genes	n_genes	db	significant	sample_id
0	cell cycle_hsa-1640170	1	6.097256e-07	3.317719	47.477413	0.00043	ACD,AKAP9,BRCA1,CDC16,CDC20,CDC7,CLASP2,DCTN1,...	26	Reactome_2016	True	01hr
1	interleukin-2 signaling_hsa-451927	2	1.697041e-06	4.580979	60.865751	0.00055	AKAP9,BRAF,CNKSR2,CUL3,HAVCR2,INPPL1,IRS2,MAPK...	16	Reactome_2016	True	01hr
2	interleukin-3, 5 and gm-csf signaling_hsa-512988	3	2.677495e-06	4.410658	56.591512	0.00055	AKAP9,BRAF,CNKSR2,CUL3,INPPL1,IRS2,MAPK3,MARK3...	16	Reactome_2016	True	01hr
3	interleukin receptor shc signaling_hsa-912526	4	5.589099e-06	4.392809	53.129672	0.00055	AKAP9,BRAF,CNKSR2,CUL3,INPPL1,IRS2,MAPK3,MARK3...	15	Reactome_2016	True	01hr
4	signalling by ngf_hsa-166520	5	5.968762e-06	3.332650	40.088346	0.00055	AKAP13,AKAP9,ARHGEF16,BRAF,CNKSR2,CUL3,HDAC1,I...	21	Reactome_2016	True	01hr
5	mapk family signaling cascades_hsa-5683057	6	7.859464e-06	4.027363	47.336790	0.00055	AKAP9,BRAF,CNKSR2,CUL3,DNAJB1,IRS2,MAPK3,MARK3...	16	Reactome_2016	True	01hr
6	insulin receptor signalling cascade_hsa-74751	7	8.967430e-06	3.982165	46.280367	0.00055	AKAP9,BRAF,CNKSR2,CUL3,INSR,IRS2,MAPK3,MARK3,P...	16	Reactome_2016	True	01hr
7	signaling by interleukins_hsa-449147	8	9.805077e-06	3.453619	39.829245	0.00055	AKAP9,BRAF,CNKSR2,CUL3,HAVCR2,INPPL1,IRS2,MAP3...	19	Reactome_2016	True	01hr
8	signal attenuation_hsa-74749	9	1.022501e-05	43.755556	502.780826	0.00055	INSR,IRS2,MAPK3,SHC1	4	Reactome_2016	True	01hr
9	signaling by fgfr2_hsa-5654738	10	1.150341e-05	3.551082	40.385980	0.00055	AKAP9,BRAF,CNKSR2,CUL3,HNRNPA1,HNRNPM,INSR,IRS...	18	Reactome_2016	True	01hr

Term names from enrichR follow various formats, depending on the library and when it was created. We provide some tools to clean them up, but since the EnrichmentResult is based on pandas.Dataframe, we suggest users clean/shorten names to their liking.

# clean up naming of terms
ph_silac_enrichment.term_name = ph_silac_enrichment.term_name.str.split('_').str.get(0)

ph_silac_enrichment.head(10)

	term_name	rank	p_value	z_score	combined_score	adj_p_value	genes	n_genes	db	significant	sample_id
0	cell cycle	1	6.097256e-07	3.317719	47.477413	0.00043	ACD,AKAP9,BRCA1,CDC16,CDC20,CDC7,CLASP2,DCTN1,...	26	Reactome_2016	True	01hr
1	interleukin-2 signaling	2	1.697041e-06	4.580979	60.865751	0.00055	AKAP9,BRAF,CNKSR2,CUL3,HAVCR2,INPPL1,IRS2,MAPK...	16	Reactome_2016	True	01hr
2	interleukin-3, 5 and gm-csf signaling	3	2.677495e-06	4.410658	56.591512	0.00055	AKAP9,BRAF,CNKSR2,CUL3,INPPL1,IRS2,MAPK3,MARK3...	16	Reactome_2016	True	01hr
3	interleukin receptor shc signaling	4	5.589099e-06	4.392809	53.129672	0.00055	AKAP9,BRAF,CNKSR2,CUL3,INPPL1,IRS2,MAPK3,MARK3...	15	Reactome_2016	True	01hr
4	signalling by ngf	5	5.968762e-06	3.332650	40.088346	0.00055	AKAP13,AKAP9,ARHGEF16,BRAF,CNKSR2,CUL3,HDAC1,I...	21	Reactome_2016	True	01hr
5	mapk family signaling cascades	6	7.859464e-06	4.027363	47.336790	0.00055	AKAP9,BRAF,CNKSR2,CUL3,DNAJB1,IRS2,MAPK3,MARK3...	16	Reactome_2016	True	01hr
6	insulin receptor signalling cascade	7	8.967430e-06	3.982165	46.280367	0.00055	AKAP9,BRAF,CNKSR2,CUL3,INSR,IRS2,MAPK3,MARK3,P...	16	Reactome_2016	True	01hr
7	signaling by interleukins	8	9.805077e-06	3.453619	39.829245	0.00055	AKAP9,BRAF,CNKSR2,CUL3,HAVCR2,INPPL1,IRS2,MAP3...	19	Reactome_2016	True	01hr
8	signal attenuation	9	1.022501e-05	43.755556	502.780826	0.00055	INSR,IRS2,MAPK3,SHC1	4	Reactome_2016	True	01hr
9	signaling by fgfr2	10	1.150341e-05	3.551082	40.385980	0.00055	AKAP9,BRAF,CNKSR2,CUL3,HNRNPA1,HNRNPM,INSR,IRS...	18	Reactome_2016	True	01hr

The EnrichmentResult Class shares the same plotting format as the ExperimentalData Class, so we can resuse a lot of the commands we demonstrated earlier.

ph_silac_enrichment.require_n_sig(n_sig=3, inplace=True)
ph_silac_enrichment.heatmap(
    figsize=(4,16),
    linewidths=0.01,
    cluster_by_set=False
);

print(len(ph_silac_enrichment.sig.term_name.unique()))

Term compression¶

The above plot seems really busy. There are 139 enriched terms. If we look at the top ranked terms, we see that some fo them have similar descriptions “…. fgfr signaling”. If we look at the gene list, we can also see that some of the genes are similar. To see if there are redundant terms that are enriched, we can calculate their similarity with the Jaccard Index (intersection over union). $width=50$

# calculate the Jaccard Index and returns a ranked dataframe of terms and scores.
# Higher scores means more similar terms
d = ph_silac_enrichment.find_similar_terms('cell cycle, mitotic')
display(d.head(20))

	term_name	similarity_score
85	viral messenger rna synthesis	1.0
490	ns1 mediated effects on host pathways	1.0
447	export of viral ribonucleoproteins from nucleus	1.0
61	viral messenger rna synthesis	1.0
60	glucose transport	1.0
59	mitotic prophase	1.0
120	m phase	1.0
57	host interactions with influenza factors	1.0
56	ns1 mediated effects on host pathways	1.0
55	nuclear pore complex (npc) disassembly	1.0
424	nuclear envelope breakdown	1.0
121	hexose transport	1.0
495	host interactions with influenza factors	1.0
497	antiviral mechanism by ifn-stimulated genes	1.0
48	nuclear envelope breakdown	1.0
134	nuclear import of rev protein	1.0
498	isg15 antiviral mechanism	1.0
468	interactions of vpr with host cellular proteins	1.0
499	glucose transport	1.0
116	transport of ribonucleoproteins into the host ...	1.0

We can do this for all terms and view the results in a distance matrix (plot used for visualization purpose only).

ph_silac_enrichment.dist_matrix(figsize=(12, 12));

We can remove the redundant ones to compress the array. Here, we sort the terms by combined_score. For each term, we calculate the Jaccard index with all other terms. If a term falls below above a user defined threshold, it will be removed in the resulting array. By doing so, we minimize the total number of terms, while maximizing the information content of the resulting array.

ph_silac_enrichment_slim = ph_silac_enrichment.remove_redundant(level='dataframe')

Number of rows went from 139 to 29

# notive the reduction in size and overlap of terms
ph_silac_enrichment_slim.dist_matrix();

ph_silac_enrichment_slim.heatmap(
    min_sig=2,
    figsize=(4,12),
    linewidths=0.01,
    cluster_by_set=True
);

Important to known, we can still recover the terms removed based on the highest level term kept.

ph_silac_enrichment.show_terms_below('g2/m checkpoints').heatmap(
    linewidths=0.01,
    convert_to_log=False,
    figsize=(3, 8));

Number of rows went from 139 to 28

sorted(ph_silac_enrichment_slim.term_name.unique())

['activation of the ap-1 family of transcription factors',
 'antiviral mechanism by ifn-stimulated genes',
 'apoptotic cleavage of cellular proteins',
 'cellular response to heat stress',
 'cellular senescence',
 'g2/m checkpoints',
 'golgi cisternae pericentriolar stack reorganization',
 'growth hormone receptor signaling',
 'influenza life cycle',
 'late phase of hiv life cycle',
 'major pathway of rrna processing in the nucleolus',
 'mrna splicing - major pathway',
 'negative feedback regulation of mapk pathway',
 'nonsense-mediated decay (nmd)',
 'nuclear envelope breakdown',
 'nuclear import of rev protein',
 'regulation of mrna stability by proteins that bind au-rich elements',
 'rho gtpases activate paks',
 'rho gtpases activate wasps and waves',
 'rrna modification in the nucleus',
 'signal attenuation',
 'sumoylation of dna damage response and repair proteins',
 'sumoylation of dna replication proteins',
 'sumoylation of rna binding proteins',
 'transcriptional regulation by tp53',
 'transport of mature mrna derived from an intron-containing transcript',
 'transport of mature mrna derived from an intronless transcript',
 'trna processing in the nucleus',
 'vpr-mediated nuclear import of pics']

Visualize species of enrichmentt term¶

We can use the EnrichmentResult to extract out any given set of genes for a term (or the entire array). For a select term, we can extract out the species of interest to visualize. This makes chaining together the data, networks, and enrichment output seamlessly.

exp_data.ph_silac.heatmap(
    ph_silac_enrichment_slim.sig.term_to_genes('apoptotic cleavage of cellular proteins'),
    subset_index='identifier',
    index='label',
    cluster_row=False,
    sort_row='mean',
    min_sig=2,
    linewidths=0.01,
    figsize=(4, 8),
);

exp_data.ph_silac.heatmap(
    ph_silac_enrichment_slim.sig.term_to_genes('g2/m checkpoints'),
    subset_index='identifier',
    index='label',
    cluster_row=True,
    sort_row='index',
    min_sig=2,
    linewidths=0.01,
    figsize=(3, 6),
);

exp_data.ph_silac.heatmap(
    ph_silac_enrichment.sig.term_to_genes('apoptosis'),
    subset_index='identifier',
    index='label',
    cluster_row=True,
    sort_row='index',
    min_sig=2,
    linewidths=0.01,
    figsize=(2,12),
);

We can use the ExperimentalData class to filter the data to create lists of genes for further analysis. We take these lists and run enrichment analysis using Enrichr.¶

Since this part is time consuming, it is best to do it outside of a notebook. The code to do so can be found in “run_enrichment.py”. The results will be a csv file that we will load next.

Annotated Gene set Network (AGN)¶

Lastly, we created a function to generate molecular and coarse grain networks based on enrichment terms. Users can used the compressed enrichment result terms to generate large scale representations of their data, or by selecting key terms of importance. Here, we are going to use 3 terms from the compressed enrichment result class.

from magine.networks.annotated_set import create_asn
# selected terms of interest
terms=['apoptotic cleavage of cellular proteins',
       'g2/m checkpoints',
       'transcriptional regulation by tp53']

term_net, mol_net = create_asn(
    ph_silac_enrichment_slim, network,
    terms=terms,
    save_name='all_example',
    use_threshold=True,
    use_cytoscape=False, # If you have cytoscape open, this will create a cytoscape session if True
)

Creating ontology network

draw_cyjs(term_net)

draw_cyjs(mol_net, add_parent=True)

draw_cyjs(mol_net, add_parent=False)

Finally, we can bring it full circle and subset our experimental data to visualize the nodes in the networks measured values over time.

exp_data.ph_silac.heatmap(
    mol_net.nodes,
    subset_index='identifier',
    index='label',
    cluster_row=True,
    sort_row='index',
    min_sig=2,
    linewidths=0.01,
    figsize=(6,12),
);

Here we presented some examples of how to use MAGINE. The strength is in MAGINES ability to explore data, enrichment, and networks within a single space, allowing back and forth exploration. If you have any suggestions or would like to contribute workflows or pipelines, please share on our github issues.