TAHLIL
AHMED CHOWDHURY
&
MD. NAZMUL ALAM
ABSTRACT
MicroRNAs
(miRNAs) are a group of short (~22 nt) non-coding RNAs that play important regulatory
roles .The miRNAs are transcribed as long primary miRNAs, which are processed into
60–70nt miRNA precursors (pre-miRNAs) by nuclear RNaseIII Drosha. The pre-miRNAs are
then cleaved into 22nt mature miRNAs. MiRNA precursors are characterized
by their hairpin
structures. However, a large amount of similar hairpins can be folded in many
genomes.
There
are lots of computational approach to classify pseudo hairpin and pre-miRNAs.
As computational
prediction models, generally support-vector-machine, Gaussian/kernel density estimator,
hidden Markov model etc are commonly used. Now a days another computational model
Artificial Neural Network a renowned supervised approach of machine learning algorithm
for pattern recognition has been extensively used in the prediction field. In
this study we
will describe an approach which uses artificial neural network for improved
classification of pseudo
hairpin and pre-miRNAs.
Artificial Neural Network Architecture
Feedforward
network is used for the architecture of ANN. It has 3 layers. 2 hidden layers
and 1 output layer. Though it has an input layer but it is not counted as a
layer. We used 20 neurons in each hidden layer. There are 22 neurons in input
layers which is equal to the number of features used in out system. It has one
neuron in output layer. Every layer is internally connected to each other and
directly connected only to its adjacent layer.
Input And output
We used
21 features as input for our network. This 21 featuers are scaled value from 0
to 1. The output of our network is binary value. Which is either 0 or 1. 1
means the inputs set is a pre-miRNA sequence. 0 means the input sequence is a
pseudo hairpin.
Training Algorithm, Transfer, Learning Function
Determining
which training algorithm , transfer function and learning function to use is a
very difficult task. There are 19 different training algorithms, 13 transfer
functions and 15 learning functions. Testing all the 19*13*15 combination is an
impossible with out current processing capability. So we tested some of the
combinations. In our system we used Quasi-Newton BFGS algorithm known as
trainbfs for training algorithm , logsig transfer function for first hidden
layer and purelin for second hidden layer, learnlv1 for learning function.
Weight
Initialization
Random weight
initialization gives different output with same configuration and input while
testing each time. So we needded to use a static initialization method. We used
the followingequation found from miRANN system which initialize the weights of
each layer to a limit [-.05,0.5]. Closer connections has a very low
differences in their weights for this equation.
W(i,j) =
-0.5 + (i*c+j)*(.1/(r*c))
Where W(i,j)
is the weight of j’th neruon of i’th layer, r is the total number of neurons in
that Layer and c is the total number of inputs.
MATLAB Neural Network Toolbox
There are
lots of standard network codes available for training. But for our work we used MATLAB Neural
Network Toolbox. It’s very efficient and all the complex functions could be accessed
easily.
Training The Network
To train the
network first we have to load the input and corresponding output sets as it is
a supervised learning. We use the load function for this. We can use load
function like this,
f257_train_input=transpose(load('E:\MIRNA\NegetiveSet\TrainSet\Input_data'));
f257_train_input=transpose(load('E:\MIRNA\NegetiveSet\TrainSet\Input_data'));
Then we
define some parameters transfer function, train function, learn function,
hidden layer Size means number of neurons in each hidden layer, maximum
iteration and validation set size.
transfer_func
= {'logsig' 'purelin'};
train_func = 'trainbfg';
learn_func = 'learnlv1';
layer_size = [20 20];
max_epoc = 1;
(Number of iteration)
max_validation
= 10;
Here max_epoc
is 1 because we train the network repeatedly manually. max_epoc=1 means it will
automatically run only 1 iteration. We did this because after each
iteration we see if the networks performance is good enough. Here
trainbfg is the training function for the network. learnlv1 is the learning
function used for the network. logsig and purelin are the transfer function in
first and second hidden layer correspondingly.
Now the function
newff creates a feedforward backpropagation network. This function requires
three arguments which are, train input, train output, layer size. We can give
additional parameters to it for running exactly we want. We can give the
parameters transfer function, train function , learning function.
net =
newff( f257_train_input, f257_train_output, layer_size, transfer_func,
train_func, learn_func);
Here net is
the network object returned by newff. Here f257_train_input is a set of
sequence of both pseudo and pre-miRNAs which we obtained from mirbase release
17. f257_train_output is the corresponding output for each input sequence as it
is a supervised learning. This command creates the network object net and also
initializes the weights and biases of the network; therefore the network is
ready for training. For re initializing the weights, or to perform the custom
initialization we used our equation for generating the values and assigned to
the net.IW,
net.LW, net.b.
[riw ciw]
= size(net.IW) [rlw clw] = size(net.LW) [rb
cb] = size(net.b)
for i=1:riw
for j=1:ciw
[r c] = size(net.IW{i,j});
for m=1:r
for n=1:c
net.IW{i,j}(m,n) =-.05+(m*c+n)*(.1/(r*c));%(1/(i+j+m+n))-(1/((rb+cb+r+c)/2));
end
end
end
end
The code
above initializes the weights of the synapses from input layer to other layer,
here to only first hidden layer.
for i=1:rlw
for j=1:clw
[r c] = size(net.LW{i,j});
for m=1:r
for n=1:c
net.LW{i,j}(m,n) = -.05+(m*c+n)*(.1/(r*c));%(1/(i+j+m+n))-(1/((rb+cb+r+c)/2));
end
end
end
end
The code
above initializes the weights of the synapses from each hidden layer to other
layers, here to second hidden layer and from second hidden layer to output
layer.
for i=1:rb
for j=1:cb
[r c] = size(net.b{i,j});
for m=1:r
for n=1:c
net.b{i,j}(m,n) = -.05+(m*c+n)*(.1/(r*c));%(1/(i+j+m+n))-(1/((rb+cb+r+c)/2));
end
end
end
end
The code
above initializes the weights of the synapses of the biases of all layers. Now
the network is created and initialized. Now we will train the network
with ‘train’ function. We call the train function iteratively and after each
iteration we measure the performance of the network with confusion matrix.
[net,
train_rec, train_output, train_error] = train (net, f257_train_input,
f257_train_output);
‘train’
function returns the object of the network, train output and train error. Train
output includes current prediction of the input sequences of the network. We
get the confusion matrix from the ‘confusion’ function. It takes the train
output and actual output as input and returns a 1*4 matrix. The value on the
first element is the number of sequences the network predicted correctly as
pseudo hairpin. The value on the second element is the number of sequences the
network incorrectly predicted as pre-miRNA. The value of the third element
defines the number of the sequences the network incorrectly predicted as pseudo
hairpin. The last element of the matrix denotes the number of sequences the
network correctly defined as pre-miRNA.
[c, cm,
ind, per] = confusion (f257_train_output, train_output);
Here cm is
the confusion matrix returned by ‘confusion’ function.We define the performance
of the network’s training by the accuracy of the network which means the rate
of correct predictions of both pre-miRNA and pseudo hairpin.
train_perf
= ((cm(1)+cm(4))/sum(sum(cm)))*100;
train_perf
defines the accuracy as percentage. The equation for accuracy is elaborately
discussed in previous chapters of this report.
Now that the
network is trained we need to test it to see how it actually works on unknown
data. We use ‘sim’ function to simulate the network for testing set.
test_output
= sim(net, f257_test_input);
f257_test_input
is the testing input sequence. test_output denotes the prediction output for
the input sequences. Just like the training we also find the confusion matrix
by calling ‘confusion’ function and getting confusion matrix.
[c, cm, ind,
per] = confusion(f257_test_output, test_output);
test_perf
= ((cm(1)+cm(4))/sum(sum(cm)))*100
From the
confusion matrix we get the accuracy of the network. If the accuracy is
standard we stop the iteration and the current state of the network is saved for
future predictions.
Result And Discussion
To measure
the performance of our system we used confusion matrix. There are 4 terms
associated with confusion matrix,
True positive
(TP) = Correct prediction of pre-miRNA.
True positive
(TN) = Correct prediction of pseudo hairpin.
False
positive (FP) = Incorrect prediction of pseudo hairpin.
False
negative(FN) = Incorrect prediction of pre-miRNA.
The total
prediction accuracy (ACC), Specificity (SE), Sensitivity (SP) of the prediction
system are given
by:
ACC = (
(TP+TN) / (TP+TN+FP+FN) ) *100
SP = (TN /
(TN+FP) )*100
SE = (TP /
(TP+FN) )*100
Human Performance
In our system
for human data set,
True Positive
prediction (TP) = 698
True Negative
prediction (TN) = 693
False
Positive prediction (FP) = 2
False
Negative prediction (FN) = 7
Sensitivity (
SE ) = 99.01%
Specificity (
SP ) = 99.71%
Accuracy
(ACC) = 99.357
Prediction (from ANN)
|
|||
Negative
|
Positive
|
||
Target (Actual)
|
Negative
|
693
|
7
|
Positive
|
2
|
698
|
TABLE : Confusion Matrix
Receiver
Operator Characteristic (ROC) curve which is a plot of the true positive rate (sensitivity)
against the false positive rate (specificity).
Figure : Receiver Operator Characteristic
(ROC) curve
Cross-species Performance
We
evaluated 20 species including human for cross-species testing. We trained with
human dataset and tested with positive corresponding species dataset and for negative
we used human dataset. We used mirbase release 17 for the positive sets. Our
prediction accuracy for cross-species 91.33% which shows that there is strong
resembles in those species miRNAs and human miRNAs.
Species
|
Total
miRNA
|
Found
|
Not Found
|
%Found
|
Homo sapiens
|
1424
|
1415
|
9
|
99.357143
|
Bos Taurus
|
662
|
648
|
14
|
97.870778
|
Pongo pygmaeus
|
581
|
563
|
18
|
96.955504
|
Pan troglodytes
|
600
|
557
|
43
|
92.763664
|
Oryza sativa
|
491
|
394
|
97
|
80.268682
|
Mus musculus
|
720
|
699
|
21
|
97.112676
|
Gallus gallus
|
499
|
447
|
52
|
89.658048
|
Macaca mulatta
|
479
|
449
|
30
|
93.808312
|
Canis familiaris
|
323
|
281
|
42
|
86.999022
|
Taeniopygia guttata
|
220
|
206
|
14
|
93.804348
|
Schmidtea mediterranea
|
148
|
139
|
9
|
93.750000
|
Ciona intestinalis
|
331
|
259
|
72
|
78.176528
|
Physcomitrella patens
|
229
|
217
|
12
|
94.940797
|
Bombyx mori
|
487
|
475
|
12
|
97.556866
|
Arabidopsis thaliana
|
232
|
196
|
36
|
84.334764
|
Danio rerio
|
358
|
354
|
4
|
98.865784
|
Rattus norvegicus
|
408
|
307
|
101
|
75.180505
|
Drosophila pseudoobscura
|
211
|
203
|
8
|
96.373626
|
Medicago truncatula
|
375
|
302
|
73
|
80.651163
|
Herpes Simplex Virus 2
|
18
|
17
|
1
|
98.189415
|
Total
|
8796
|
8128
|
668
|
91.330838
|
Table: Cross Species Performance
Conclusion
In this work
we tried to further improve previous work of miRANN. miRANN has some shortcoming
which are, its cross species performance were much lower, it did not have good
basis for choosing its feature set, due to huge amount of feature set the
system was biased, and for having 257 feature set the computation time for the
system was very lengthy. We use genetic algorithm based feature selection for
our feature set, and we have a very good cross-species performance and our
system is not biased due to feature set. Our ongoing work is whole genome wide
search for miRNA with our current system.