Description

Training Data

Our main source of training and testing data is RNAcompete19. Here we describe the data collection, preprocessing, and evaluation methods, along with our rationale for relying on only a subset of in vivo experiments.

We use the normalized probe intensity data from RNAcompete19 . The only non-linear preprocessing we did was to clamp probe intensities to be no larger than that of the 99.95 th percentile intensity.

When training a DeepBind model, the probe intensities undergo a linear transformation before being used as training targets. The linear transformation ensures each set of targets has mean 0.0 and variance 1.0.

Architecture

https://static-content.springer.com/esm/art%3A10.1038%2Fnbt.3300/MediaObjects/41587_2015_BFnbt3300_MOESM50_ESM.pdf

A pretty simple CNN with max pooling

For all RBP models we used motif_len=16 because the RNAcompete sequences are rather short (32-43nt) and most RBPs have short motifs

We use num_motifs=16 for all experiments

. For each dataset we only consider the performance of two neural network configurations: either no hidden layer, or one hidden layer with 32 rectified-linear (ReLU) units

Optimization

Random search

We evaluate learning rates in the range [.0005, .05].

stochastic gradient descent (SGD)

We use batch_size=64 for all experiments.

20,000 parameter update steps. At 4,000-step intervals, we evaluate the current performance of the trained model on held-out validation data, resulting in five performance ratings for each calibration trial. If a model’s validation performance was better at step 8,000 than it was at step 20,000, this is taken as evidence that the model began to over-fit the data midway through training.