Our main source of training and testing data is RNAcompete19. Here we describe the data collection, preprocessing, and evaluation methods, along with our rationale for relying on only a subset of in vivo experiments.
We use the normalized probe intensity data from RNAcompete19 . The only non-linear preprocessing we did was to clamp probe intensities to be no larger than that of the 99.95 th percentile intensity.
When training a DeepBind model, the probe intensities undergo a linear transformation before being used as training targets. The linear transformation ensures each set of targets has mean 0.0 and variance 1.0.
https://static-content.springer.com/esm/art%3A10.1038%2Fnbt.3300/MediaObjects/41587_2015_BFnbt3300_MOESM50_ESM.pdf
A pretty simple CNN with max pooling
For all RBP models we used motif_len=16 because the RNAcompete sequences are rather short (32-43nt) and most RBPs have short motifs
We use num_motifs=16 for all experiments
. For each dataset we only consider the performance of two neural network configurations: either no hidden layer, or one hidden layer with 32 rectified-linear (ReLU) units
Random search
We evaluate learning rates in the range [.0005, .05].
stochastic gradient descent (SGD)
We use batch_size=64 for all experiments.
20,000 parameter update steps. At 4,000-step intervals, we evaluate the current performance of the trained model on held-out validation data, resulting in five performance ratings for each calibration trial. If a model’s validation performance was better at step 8,000 than it was at step 20,000, this is taken as evidence that the model began to over-fit the data midway through training.