
Training Data

Our main source of training and testing data is RNAcompete19. Here we describe the data collection, preprocessing, and evaluation methods, along with our rationale for relying on only a subset of in vivo experiments.

We use the normalized probe intensity data from RNAcompete19 . The only non-linear preprocessing we did was to clamp probe intensities to be no larger than that of the 99.95 th percentile intensity.

When training a DeepBind model, the probe intensities undergo a linear transformation before being used as training targets. The linear transformation ensures each set of targets has mean 0.0 and variance 1.0.


A pretty simple CNN with max pooling

For all RBP models we used motif_len=16 because the RNAcompete sequences are rather short (32-43nt) and most RBPs have short motifs

We use num_motifs=16 for all experiments

. For each dataset we only consider the performance of two neural network configurations: either no hidden layer, or one hidden layer with 32 rectified-linear (ReLU) units


Random search

We evaluate learning rates in the range [.0005, .05].

stochastic gradient descent (SGD)

We use batch_size=64 for all experiments.

20,000 parameter update steps. At 4,000-step intervals, we evaluate the current performance of the trained model on held-out validation data, resulting in five performance ratings for each calibration trial. If a model’s validation performance was better at step 8,000 than it was at step 20,000, this is taken as evidence that the model began to over-fit the data midway through training.