Saturday, January 9, 2016

An improved approach for Classification of Precursor microRNA using Artificial Neural Network.

   

                       TAHLIL AHMED CHOWDHURY

                                         &

                            MD. NAZMUL ALAM

  

                               ABSTRACT

MicroRNAs (miRNAs) are a group of short (~22 nt) non-coding RNAs that play important regulatory roles .The miRNAs are transcribed as long primary miRNAs, which are processed into 60–70nt miRNA precursors (pre-miRNAs) by nuclear RNaseIII Drosha. The pre-miRNAs are then cleaved into 22nt mature miRNAs. MiRNA precursors  are characterized by their hairpin structures. However, a large amount of similar hairpins can be folded in many genomes.

There are lots of computational approach to classify pseudo hairpin and pre-miRNAs. As computational prediction models, generally support-vector-machine, Gaussian/kernel density estimator, hidden Markov model etc are commonly used. Now a days another computational model Artificial Neural Network a renowned supervised approach of machine learning algorithm for pattern recognition has been extensively used in the prediction field. In this study we will describe an approach which uses artificial neural network for improved classification of pseudo hairpin and pre-miRNAs.


                  Artificial Neural Network Architecture

Feedforward network is used for the architecture of ANN. It has 3 layers. 2 hidden layers and 1 output layer. Though it has an input layer but it is not counted as a layer. We used 20 neurons in each hidden layer. There are 22 neurons in input layers which is equal to the number of features used in out system. It has one neuron in output layer. Every layer is internally connected to each other and directly connected only to its adjacent layer.


                             Input And output 

 We used 21 features as input for our network. This 21 featuers are scaled value from 0 to 1. The output of our network is binary value. Which is either 0 or 1. 1 means the inputs set is a pre-miRNA sequence. 0 means the input sequence is a pseudo hairpin.

      Training Algorithm, Transfer, Learning Function

Determining which training algorithm , transfer function and learning function to use is a very difficult task. There are 19 different training algorithms, 13 transfer functions and 15 learning functions. Testing all the 19*13*15 combination is an impossible with out current processing capability. So we tested some of the combinations. In our system we used Quasi-Newton BFGS algorithm known as trainbfs for training algorithm , logsig transfer function for first hidden layer and purelin for second hidden layer, learnlv1 for learning function.


                            Weight Initialization

Random weight initialization gives different output with same configuration and input while testing each time. So we needded to use a static initialization method. We used the followingequation found from miRANN system which initialize the weights of each layer to a limit  [-.05,0.5]. Closer connections has a very low differences in their weights for this equation.

W(i,j) = -0.5 + (i*c+j)*(.1/(r*c))

Where W(i,j) is the weight of j’th neruon of i’th layer, r is the total number of neurons in that Layer and c is the total number of inputs.


                    MATLAB Neural Network Toolbox

There are lots of standard network codes available for training. But for our work we used MATLAB Neural Network Toolbox. It’s very efficient and all the complex functions could be accessed easily.


                          Training The Network

To train the network first we have to load the input and corresponding output sets as it is a supervised learning. We use the load function for this. We can use load function like this,

f257_train_input=transpose(load('E:\MIRNA\NegetiveSet\TrainSet\Input_data'));

Then we define some parameters transfer function, train function, learn function, hidden layer Size means number of neurons in each hidden layer, maximum iteration and validation set size.

transfer_func = {'logsig' 'purelin'};

            train_func = 'trainbfg';

             learn_func = 'learnlv1';

            layer_size = [20 20];

            max_epoc = 1;             (Number of iteration)

max_validation = 10;

Here max_epoc is 1 because we train the network repeatedly manually. max_epoc=1 means it will automatically run only 1 iteration.  We did this because after each iteration we see if the networks performance is good enough.  Here trainbfg is the training function for the network. learnlv1 is the learning function used for the network. logsig and purelin are the transfer function in first and second hidden layer correspondingly.



Now the function newff creates a feedforward backpropagation network. This function requires three arguments which are, train input, train output, layer size. We can give additional parameters to it for running exactly we want.  We can give the parameters transfer function, train function , learning function.

net = newff( f257_train_input, f257_train_output, layer_size, transfer_func, train_func, learn_func);

Here net is the network object returned by newff. Here f257_train_input is a set of sequence of both pseudo and pre-miRNAs which we obtained from mirbase release 17. f257_train_output is the corresponding output for each input sequence as it is a supervised learning. This command creates the network object net and also initializes the weights and biases of the network; therefore the network is ready for training. For re initializing the weights, or to perform the custom initialization we used our equation for generating the values and assigned to the net.IW, net.LW, net.b.



[riw ciw] = size(net.IW)    [rlw clw] = size(net.LW)    [rb cb] = size(net.b)

    for i=1:riw

        for j=1:ciw

            [r c] = size(net.IW{i,j});    

            for m=1:r

                for n=1:c

                    net.IW{i,j}(m,n) =-.05+(m*c+n)*(.1/(r*c));%(1/(i+j+m+n))-(1/((rb+cb+r+c)/2));

                end

            end

        end

    end

The code above initializes the weights of the synapses from input layer to other layer, here to only first hidden layer.

    for i=1:rlw

        for j=1:clw

            [r c] = size(net.LW{i,j});

            for m=1:r

                for n=1:c

                    net.LW{i,j}(m,n) = -.05+(m*c+n)*(.1/(r*c));%(1/(i+j+m+n))-(1/((rb+cb+r+c)/2));

                end

            end

        end

    end

The code above initializes the weights of the synapses from each hidden layer to other layers, here to second hidden layer and from second hidden layer to output layer.

    for i=1:rb

        for j=1:cb

            [r c] = size(net.b{i,j});

            for m=1:r

                for n=1:c

                    net.b{i,j}(m,n) = -.05+(m*c+n)*(.1/(r*c));%(1/(i+j+m+n))-(1/((rb+cb+r+c)/2));

                end

            end

        end

    end



The code above initializes the weights of the synapses of the biases of all layers. Now the network is created and initialized.  Now we will train the network with ‘train’ function. We call the train function iteratively and after each iteration we measure the performance of the network with confusion matrix. 

[net, train_rec, train_output, train_error] = train (net, f257_train_input, f257_train_output);

‘train’ function returns the object of the network, train output and train error. Train output includes current prediction of the input sequences of the network. We get the confusion matrix from the ‘confusion’ function. It takes the train output and actual output as input and returns a 1*4 matrix. The value on the first element is the number of sequences the network predicted correctly as pseudo hairpin. The value on the second element is the number of sequences the network incorrectly predicted as pre-miRNA. The value of the third element defines the number of the sequences the network incorrectly predicted as pseudo hairpin. The last element of the matrix denotes the number of sequences the network correctly defined as pre-miRNA. 

[c, cm, ind, per] = confusion (f257_train_output, train_output);

Here cm is the confusion matrix returned by ‘confusion’ function.We define the performance of the network’s training by the accuracy of the network which means the rate of correct predictions of both pre-miRNA and pseudo hairpin.

train_perf = ((cm(1)+cm(4))/sum(sum(cm)))*100;

train_perf defines the accuracy as percentage. The equation for accuracy is elaborately discussed in previous chapters of this report.

Now that the network is trained we need to test it to see how it actually works on unknown data. We use ‘sim’ function to simulate the network for testing set.

test_output = sim(net, f257_test_input);

f257_test_input is the testing input sequence. test_output denotes the prediction output for the input sequences. Just like the training we also find the confusion matrix by calling ‘confusion’ function and getting confusion matrix. 

[c, cm, ind, per] = confusion(f257_test_output, test_output);

test_perf = ((cm(1)+cm(4))/sum(sum(cm)))*100

From the confusion matrix we get the accuracy of the network. If the accuracy is standard we stop the iteration and the current state of the network is saved for future predictions.

                     Result And Discussion

To measure the performance of our system we used confusion matrix. There are 4 terms associated with confusion matrix,

True positive (TP) = Correct prediction of pre-miRNA.

True positive (TN) = Correct prediction of pseudo hairpin.

False positive (FP) = Incorrect prediction of pseudo hairpin.

False negative(FN) = Incorrect prediction of pre-miRNA.

The total prediction accuracy (ACC), Specificity (SE), Sensitivity (SP) of the prediction system are given by:

ACC = ( (TP+TN) / (TP+TN+FP+FN) ) *100

SP = (TN / (TN+FP) )*100

SE = (TP / (TP+FN) )*100

                          Human Performance

In our system for human data set,

True Positive prediction (TP) = 698

True Negative prediction (TN) = 693

False Positive prediction (FP) = 2

False Negative prediction (FN) = 7

Sensitivity ( SE ) = 99.01%

Specificity ( SP ) = 99.71%

Accuracy (ACC) = 99.357



Prediction (from ANN)


       Negative
        Positive

Target (Actual)

Negative

       693

         7

Positive
     
       2


         698

                             TABLE : Confusion Matrix


      Receiver Operating Characteristic (ROC) Graph


Receiver Operator Characteristic (ROC) curve which is a plot of the true positive rate (sensitivity) against the false positive rate (specificity).



Figure : Receiver Operator Characteristic (ROC) curve








                      Cross-species Performance

We evaluated 20 species including human for cross-species testing. We trained with human dataset and tested with positive corresponding species dataset and for negative we used human dataset. We used mirbase release 17 for the positive sets. Our prediction accuracy for cross-species 91.33% which shows that there is strong resembles in those species miRNAs and human miRNAs.





Species
Total miRNA
      Found
  Not Found
                 %Found
Homo sapiens
1424
1415
9
99.357143
Bos Taurus
662
648
14
97.870778
Pongo pygmaeus
581
563
18
96.955504
Pan troglodytes
600
557
43
92.763664
Oryza sativa
491
394
97
80.268682
Mus musculus
720
699
21
97.112676
Gallus gallus
499
447
52
89.658048
Macaca mulatta
479
449
30
93.808312
Canis familiaris
323
281
42
86.999022
Taeniopygia guttata
220
206
14
93.804348
Schmidtea mediterranea
148
139
9
93.750000
Ciona intestinalis
331
259
72
78.176528
Physcomitrella patens
229
217
12
94.940797
Bombyx mori
487
475
12
97.556866
Arabidopsis thaliana
232
196
36
84.334764
Danio rerio
358
354
4
98.865784
Rattus norvegicus
408
307
101
75.180505
Drosophila pseudoobscura
211
203
8
96.373626
Medicago truncatula
375
302
73
80.651163
Herpes Simplex Virus 2
18
17
1
98.189415
Total
8796
8128
668
91.330838

Table: Cross Species Performance





                             Conclusion

In this work we tried to further improve previous work of miRANN. miRANN has some shortcoming which are, its cross species performance were much lower, it did not have good basis for choosing its feature set, due to huge amount of feature set the system was biased, and for having 257 feature set the computation time for the system was very lengthy. We use genetic algorithm based feature selection for our feature set, and we have a very good cross-species performance and our system is not biased due to feature set. Our ongoing work is whole genome wide search for miRNA with our current system.