Welcome to CINDERellA!
The goal of this project is to enable you to run causal Bayesian networks that accurately predict up and downstream genes and entire regulatory networks on the basis of gene expression or gene expression+genetics.
You can read more about methods included in this toolbox in this paper in Genetics where we compare their performance across ~14,000 realistic networks.
CINDERellA has been used to discover genes driving Alzheimer’s disease-related dementias (AD/ADRD) on the basis of bulk brain transcriptomes, brain multi-omics data 1, and brain multi-omics data 2.
Overview
CINDERellA is an easy-to-use Bayesian Network Learning Tool that learns causal networks from gene expression data using Markov Chain Monte Carlo (MCMC) methods.
⚠️ MATLAB Compatibility: This toolbox is compatible with MATLAB versions up to R2016a. Newer MATLAB versions may encounter compatibility issues.
Quick Start
Step 1: Setup
% Add CINDERellA to your MATLAB path
CINDERellA_PATH = './functions';
addpath(CINDERellA_PATH);
Step 2: Load Your Data
% Load expression data (user's responsibility - done outside the function)
expdata = read_exp('your_expression_data.txt');
Step 3: Run CINDERellA
% Basic usage with default parameters
CINDERellA(expdata.data);
Usage Examples
Basic Usage
% Load data first
expdata = read_exp('my_expression_data.txt');
% Run with default settings
CINDERellA(expdata.data);
Custom Parameters
% With custom parameters
CINDERellA(expdata.data, 'output_dir', 'my_results', ...
'max_parents', 3, 'runtime_minutes', 30, 'num_samples', 1000);
Using Prior Knowledge
% Create prior matrix to constrain search space
nGenes = size(expdata.data, 1);
prior = ones(nGenes, nGenes); % Start with all edges allowed
prior(1,2) = 0; % Disallow edge from gene 1 to gene 2
CINDERellA(expdata.data, 'prior_matrix', prior);
Visualization Options
% With layout and visualization options
CINDERellA(expdata.data, 'layout', 'force', 'edge_threshold', 0.33);
Input Parameters
Required
- data_matrix - Expression data matrix (rows=genes, columns=samples)
Optional Parameters
- ‘output_dir’ - Output directory (default:
'./CINDERellA_results/'
) - ‘sampler’ - MCMC sampler method (default:
'M.c2PB'
, or'M.REV50'
if prior used) - ‘max_parents’ - Maximum parents per gene (default:
3
) - ‘runtime_minutes’ - Total runtime in minutes (default:
0.5
) - ‘num_samples’ - Number of network samples to collect (default:
100
) - ‘edge_threshold’ - Edge frequency threshold for visualization (default:
0.3
) - ‘prior_matrix’ - nGenes × nGenes binary matrix (1=allowed, 0=disallowed edges)
- ‘force_recompute’ - Force recomputation even if results exist (default:
false
) - ‘layout’ - Network layout algorithm (default:
'force'
)
MCMC Samplers
Single Chain Samplers
'STR'
,'c2PB'
,'c3PB'
,'c4PB'
,'1PB'
,'2PB'
,'3PB'
,'4PB'
,'REV50'
- Samples networks every (runtime/num_samples) seconds
Multi-Chain Samplers (Recommended)
'M.STR'
,'M.c2PB'
,'M.c3PB'
,'M.c4PB'
,'M.1PB'
,'M.2PB'
,'M.3PB'
,'M.4PB'
,'M.REV50'
- Collects final network state from each chain after (runtime/num_samples) seconds
Layout Options
- ‘circle’ - Circular layout
- ‘force’ - Force-directed layout (default)
- ‘layered’ - Hierarchical layered layout
- ‘subspace’ - Subspace layout
Data Format Requirements
- Input file: Tab-separated text file
- Structure: Genes as rows, samples as columns
- Minimum: At least 2 genes and 2 samples
Output Files
CINDERellA generates several output files in the specified output directory:
- edgefrq.txt - Edge frequencies from sampled networks (main result for evaluation)
- Mcmc.mat - All sampled networks
- Param.mat - Parameters used in the analysis
- LS.mat - Local scores
- mcmc_diagnostics.png - Log likelihood trace plot
- network_visualization.png - Network plot with edges above threshold
Complete Workflow Example
% 1. Setup
CINDERellA_PATH = './functions';
addpath(CINDERellA_PATH);
% 2. Load data (user's responsibility)
expdata = read_exp('test_data/exp.txt');
% 3. Run CINDERellA
CINDERellA(expdata.data, 'runtime_minutes', 1, 'max_parents', 3);
% 4. Evaluation (done separately)
% Load true network if available
network = read_network('test_data/network.txt', size(expdata.data, 1));
% Load learned edge frequencies
edgefrq_data = dlmread('./CINDERellA_results/edgefrq.txt');
nGenes = size(expdata.data, 1);
edgefrq = sparse(edgefrq_data(:,1), edgefrq_data(:,2), edgefrq_data(:,3), nGenes, nGenes);
% Perform evaluation
[AUCPR, AUCROC] = evaluation(edgefrq, network.data, 'plot', 1);
Tips for Best Results
Runtime Settings
- Short test runs: 0.5-2 minutes for initial testing
- Production runs: 30-60 minutes for reliable results
- Complex networks: Consider longer runtimes for better convergence
Sampling Strategy
- Multi-chain samplers (M.* prefix) are generally recommended for final results
- Single chain samplers are useful for initial testing to see how much runtime is needed for convergence
Prior Knowledge
- Use
prior_matrix
to incorporate known biological constraints - Set
prior(i,j) = 0
to disallow edge from gene i to gene j - Set
prior(i,j) = 1
to allow edge (default)
Visualization
- Adjust
edge_threshold
to control network complexity in plots - Higher thresholds show only strongest connections
- Lower thresholds show more potential connections
Troubleshooting
Common Issues
- Empty data matrix: Ensure your data file loaded correctly
- Dimension errors: Check that genes are rows, samples are columns
- Prior matrix size: Must be nGenes × nGenes
- Memory issues: Reduce
num_samples
orruntime_minutes
for large datasets
Performance Optimization
- Start with short runtime for testing
- Use appropriate number of samples (100-1000 typical)
- Consider your system’s computational capacity when setting parameters
Citation
If you use CINDERellA in your research, please cite:
Tasaki, S., Sauerwine, B., Hoff, B., Toyoshiba, H., Gaiteri, C., & Chaibub Neto, E. (2015). Bayesian network reconstruction using systems genetics data: comparison of MCMC methods. Genetics, 199(4), 973-989. doi:10.1534/genetics.114.172619
Author: Shinya Tasaki, Ph.D. (stasaki@gmail.com)
License: 3-clause BSD License