CBDD algorithms and functions

This document lists algorithms and features which are currently available in CBDD R package.

The functionality of CBDD R package can be roughly classified as follows:

  • Build network: tools to retrieve, create or modify the molecular networks
  • Analyze data: various algorithms trying to generate insight from the ‘omics’ data sets in the context of the networks and pathways.
  • Interpret results: gain bioligical insights out of algorithm outputs.

Within each functionality block, algorithms are split into ‘areas’ based on their principal goal and annotated with primary references and input requirements. Other features, such as visualizations, are briefly described as well.

All implemented algorithms are listed alongside with their inputs and outputs. The legend for the input columns reads as follows:

  • R – Required; algorithm will not work if this input / attribute is not provided
  • A – Allowed; algorithm will work with this attribute and take it into account
  • Blank – Ignored (algorithm will not use this attribute)

There are several major input formats for CBDD algorithms. The algorithms differ in their input requirements (some require inputs to have specific properties, while some are very general). In most common case, two main inputs are required:

  • ‘Functional’ input: global network of interactions between molecular entities (genes or proteins), or collection of molecular pathways
  • ‘Data’, or ‘omics’ input: molecular profiling data set user wants to interpret (e.g. differentially expressed gene list)

The ‘omics’ data formats are:

  • Start nodes: a list of genes somehow associated with the phenotype of interest. The start nodes may have associated activity / abundance change values (e.g. expression fold changes) and also p-values showing significance of their phenotype association.
  • Matrix: a matrix of whole genome measurements, e.g. gene expression signals, plus typically the vector of phenotypes for samples to facilitate inter-group comparison.

The basic network format is simply an adjacency list – two-column matrix showing who interacts with whom. Network edges may have additional attributes:

  • Direction. Some algorithms require directed network and will imply that signal goes from node 1 to node 2.
  • Edge type (effect) Some algorithm utilize information on whether interactions result in the activation or inhibition of their target nodes
  • Mechanism. Some algorithms may differentiate between different types of interactions (e.g. physical binding vs. transcription regulation links)
  • Weight. Edges may be weighted by confidence.

Pathways are small subnetworks, typically much better studied than the rest of the network. The same network attributes described above are fully applicable in them.

There are two major approaches to create large-scale networks describing molecular relationships in biological systems:

  • Scaffold network: experimental detection of physical interactions between genes, proteins and other molecules, or curation of such findings from literature;
  • Data-driven network: de novo prediction of relationships between molecules based on pattern mining in diverse data sets (for example, search for expression similarities, typically indicative of some sort of functional relationships between genes; or text mining for co-citations of genes in compendia of literature)

Both approaches have their advantages. The CBDD provides infrastructure to load pre-existing networks from text files and from Clarivate Analytics’ MetaBase. Also, following algorithms are available for data-driven generation and modification of networks:

Area Description
Data-driven network De novo generation of relationships between biological entities (e.g. from similarities in gene expression profiles).
Network adjustment Weighting and adjustment of networks (e.g. based on tissue expression data), making them more specific to a particular biological context.

There are three approaches for data-driven reconstruction of networks. The simplest one is conventional co-expression analysis (edges are correlations between matrix rows – i.e. gene expression profiles – which are higher than a threshold).

The latter two methods are more sophisticated. ARACNE uses mutual information to infer expression profile similarities and prunes networks to drive down false positive edges. networkAnova uses analysis of variance to infer edges from patterns of expression change across a range of phenotypes.

Algorithm Data input Matrix phenotypes Reference
dataDrivenNetwork Matrix
ARACNE Matrix 16723010
networkAnova Matrix R 22467911

All these approaches have average performance (minutes).

The network analysis algorithms do different things and might be utilized for different purposes. It is often hard to classify the algorithms into precisely defined categories. CBDD uses a classification based upon the end goal (roughly corresponding to the typical research needs in drug development):

Area Description Example purpose
Node prioritization Learn which nodes in the network are well connected to the nodes of interest and might regulate the phenotype. Drug target identification
Subnetwork prioritization Find modules in networks which are associated with phenotype. Mechanism reconstruction; Biomarker discovery
Pathway prioritization Learning which of the canonical signaling pathways are associated with phenotype provides good clues to the molecular mechanisms behind it and may help with biomarker search. Mechanism reconstruction; Biomarker discovery
Unsupervised analysis Learn how patients stratify into the subtypes and which networks and pathways drive this stratification. Patient stratification
Integrative analysis Many omics data types are routinely available, and there is often need to understand how they talk to one another. Networks can be utilized for answering questions such as ‘which mutations in my dataset affect the differential expression in disease?’. Any purpose

Almost all of these algorithms are based on the guilt-by-association principle, ranking all nodes in the networks by topological closeness to the set of start nodes (the definitions of topological closeness vary). All of these algorithms are pretty fast and can operate on large networks. They can be roughly split into three subclasses

  • Local approaches (assessing one or several steps away neighborhood of start nodes)
  • Global approaches (taking into account the entire network topology): random walk, network propagation, toppNet
  • Causal reasoning and SigNet: local algorithms taking advantage of rich network annotation to make inference about causal regulators of input node set and their possible activation/repression. These methods are more demanding in terms of memory and are a little slower.
Algorithm Func. input Edge dir. Edge weight Edge type Edge mech. Data input St.node weights St.node p-values Reference
overconnectivity Network A Start nodes 19010930
guiltByAssociation Network Start nodes 11101803
neighborhoodScoring Network A Start nodes R 20840752
interconnectivity Network A Start nodes 22369140
hiddenNodes Network A A Start nodes 19309513
randomWalk Network A A Start nodes A 18371930
networkPropagation Network A A Start nodes A 20090828
toppNetHITS Network R A Start nodes A 19245720
toppNetKM Network R A Start nodes A 19245720
GeneMania Network A Start nodes 18613948
causalReasoning Network R R R Start nodes R 22355083
SigNet Network R R R Start nodes R 24518063

Pathway prioritization approaches will be added soon

Subnetwork algorithms look for parts of global network which are enriched with the start nodes or somehow associated with input data. Pathway algorithms, similarly, prioritize the small networks from a collection (e.g. canonical signaling pathways) which are associated with the data.

There are several flavors of these algorithms:

  • Enriched subnetwork finders: activeModules, pathwayInference, HotNet/HotNet2, DEGAS
  • Dense subnetwork finders (they can also prioritize their findings by association with data): MCODE, DENSE. These approaches are better suited for data-driven networks than for scaffold networks like MetaBase
  • Biomarker finders, seeking pathways/subnetworks discriminating between two phenotypes in a data set: pathwayActivityInference, subnetworkMarkers
  • Methods looking for subnetworks/pathways where perturbed nodes are in agreement with network topology and activation/inhibition pathways: CASNet, SPIA
Algorithm Func. input Edge dir. Edge weight Edge type Edge mech. Data input St.node weights St.node p-values Matrix phenotypes Reference
activeModules Network Start nodes R 12169552
pathwayInference Network Start nodes R 15509611
HotNet Network Start nodes A A 22174262
HotNet2 Network Start nodes A A 25501392
DEGAS Network Matrix R 20976054
DENSE Network Start nodes 22024446
MCODE Network Start nodes 12525261
pathwayActivityInference Pathways R Matrix R 19997592
subnetworkMarkers Network Matrix R 17940530
SPIA Pathways R R R A Start nodes R 18990722
CASNet Network A R Start nodes R R 23368093

Most of these methods are reasonably fast and can handle large input networks and data sets. Exceptions are DENSE and HotNet (high memory requirements). HotNet may run for more than 1 hour on laptop for network with ~10K nodes.

These approaches seek to integrate several input sets of nodes (typically coming from different omics data types, like mutations and expression profiles) and infer subnetworks connecting them.

  • Cause/consequence algorithms: use random walk-like approaches or prize-collecting Steiner tree algorithm to devise pathways connecting nodes from ‘cause’ start set with nodes from ‘consequences’ start set.
Algorithm Func. input Edge dir. Edge weight Edge type Edge mech. Data input St.node weights St.node p-values Matrix phenotypes Reference
TieDie Network A A A Start nodes (2) A 23986566
PCST Network A A Start nodes (2) A 19638617
PCSF Network A A Start nodes (2) A 23383998
eQED Network A A A Start nodes (2 or more) 18319721
PARADIGM Network R R A Matrix (1 or more) 20529912

Of these approaches, TieDie works fast; PCST / PCSF is slower and more eager for memory (although network with 10K nodes can be handled reasonably fast), and eQED is very sensitive to the input size (likely running into problems in networks as large as 5K nodes)

There are two approaches for patient stratification / unsupervised clustering in the toolkit. Both utilize non-negative matrix factorization supplied with network-based constraints.

For NCIS, this is simply upweighting nodes which are well connected with high variance nodes in the network. For NBS, the constraints actually force node clusters to contain topologically close nodes (although they do not have to be all connected to each other)

Algorithm Func. input Edge dir. Edge weight Edge type Edge mech. Data input Matrix phenotypes Reference
NCIS Network A A Matrix 24491042
NBS Network Matrix 24037242

Both approaches have average performance (minutes); assessing cluster stability (performing multiple runs of algorithms and checking how pairs of samples co-cluster) may take a long time

This block includes functionality intended to make sense of the results produced by the algorithms, namely:

  • Visualization
  • Biological interpretation (e.g. enrichment analysis to reveal impacted biological processes)
  • Comparison of algorithm results
  • Network comparison

dE-MAP is a differential network analysis approach, which highlights differences between two input networks.

Algorithm Func. input Edge dir. Edge weight Edge type Edge mech. Reference
dEMAP Network (2) A 21127252
editDistance Network (2) A A A A doi:10.1109/TSMC.1983.6313167

Visualization is crucial aspect of network analysis; especially when it concerns the subnetwork identification task. In order to gain insight from the subnetwork analysis result, analyst should be able to create a network graph and overlay his or her data on top of it. Unfortunately, base R lacks effective tools for creating network-based visualizations.

CBDD includes basic interactive network viewer, which renders subnetwork as an HTML widget. The widget can be explored in RStudio, embedded into the R markdown documents or simply saved as HTML web page.

The new network viewer function has the following features:

  • Basic features
    • Interactive visualization produced by an ordinary R function call and rendered as an HTML page. Users can zoom, pan, select and move nodes around. Nodes and their neighbors are highlighted to help understand the local structure of the network;
    • Full control of aesthetics of the network (node shapes, sizes, colors etc.) Default visualization aesthetics would depend on network attributes (e.g. edge width corresponding to edge weight in weighted networks).;
    • Nodes can be relabeled to provide human-readable names in the visualization
    • Export to image, HTML page, XGMML data format (with preservation of visual attributes and layout) for Cytoscape.
  • Network and data interpretation tools
    • Data overlay: mapping data to network’s visual attributes (e.g. node color); support for multiple data layers.
    • Display of experimental data and annotations for nodes and edges on the pop-up tooltips
    • Enrichment of selected node sets with supplied ontology; depicting top processes in the separate panel of the visualization page. Highlight the nodes for selected process
    • Grouping and collapsing nodes to simplify the network.

There are functions allowing comparisons of the most popular kinds of results the CBDD algorithms yield: subnetworks and ranked lists of entities (e.g. network nodes). Three possible use cases are defined for comparing the results of algorithms:

  • Pairwise comparison of result sets. For example, how different are score discributions for two node prioritization algorithms?
  • Overview of nodes (edges) relevance for the multiple result sets. For example, what are nodes consistently ranked high by node prioritization algorithms? Are there any patterns in node scores distributions?
  • Functional overview multiple result sets. For example, what are GO processes associated with subnetworks found by different algorithms for my data set? Do they make sense?

For each use case, interactive visualizations are available.