fulltext.study @t Gmail

Maximizing lipocalin prediction through balanced and diversified training set and decision fusion

Paper ID Volume ID Publish Year Pages File Format Full-Text
14977 1365 2015 10 PDF Available
Title
Maximizing lipocalin prediction through balanced and diversified training set and decision fusion
Abstract

•Unsupervised Kmeans preprocessing for balancing and diversifying training set.•Enhanced classification of lipocalins by fusion of classifiers.•Superior generalization on blind testing data sets.•ReliefF based feature ranking.

Lipocalins are short in sequence length and perform several important biological functions. These proteins are having less than 20% sequence similarity among paralogs. Experimentally identifying them is an expensive and time consuming process. The computational methods based on the sequence similarity for allocating putative members to this family are also far elusive due to the low sequence similarity existing among the members of this family. Consequently, the machine learning methods become a viable alternative for their prediction by using the underlying sequence/structurally derived features as the input. Ideally, any machine learning based prediction method must be trained with all possible variations in the input feature vector (all the sub-class input patterns) to achieve perfect learning. A near perfect learning can be achieved by training the model with diverse types of input instances belonging to the different regions of the entire input space. Furthermore, the prediction performance can be improved through balancing the training set as the imbalanced data sets will tend to produce the prediction bias towards majority class and its sub-classes. This paper is aimed to achieve (i) the high generalization ability without any classification bias through the diversified and balanced training sets as well as (ii) enhanced the prediction accuracy by combining the results of individual classifiers with an appropriate fusion scheme. Instead of creating the training set randomly, we have first used the unsupervised Kmeans clustering algorithm to create diversified clusters of input patterns and created the diversified and balanced training set by selecting an equal number of patterns from each of these clusters. Finally, probability based classifier fusion scheme was applied on boosted random forest algorithm (which produced greater sensitivity) and K nearest neighbour algorithm (which produced greater specificity) to achieve the enhanced predictive performance than that of individual base classifiers. The performance of the learned models trained on Kmeans preprocessed training set is far better than the randomly generated training sets. The proposed method achieved a sensitivity of 90.6%, specificity of 91.4% and accuracy of 91.0% on the first test set and sensitivity of 92.9%, specificity of 96.2% and accuracy of 94.7% on the second blind test set. These results have established that diversifying training set improves the performance of predictive models through superior generalization ability and balancing the training set improves prediction accuracy. For smaller data sets, unsupervised Kmeans based sampling can be an effective technique to increase generalization than that of the usual random splitting method.

Graphical abstractFigure optionsDownload full-size imageDownload as PowerPoint slide

Keywords
Lipocalins; Diverse input patterns; Balanced training set; Boosted random forest; KNN; Classifier fusion schemes
First Page Preview
Maximizing lipocalin prediction through balanced and diversified training set and decision fusion
Get Full-Text Now
Don't Miss Today's Special Offer
Price was $35.95
You save - $31
Price after discount Only $4.95
100% Money Back Guarantee
Full-text PDF Download
Online Support
Any Questions? feel free to contact us
Publisher
Database: Elsevier - ScienceDirect
Journal: Computational Biology and Chemistry - Volume 59, Part A, December 2015, Pages 101–110
Authors
, ,
Subjects
Physical Sciences and Engineering Chemical Engineering Bioengineering
Get Full-Text Now
Don't Miss Today's Special Offer
Price was $35.95
You save - $31
Price after discount Only $4.95
100% Money Back Guarantee
Full-text PDF Download
Online Support
Any Questions? feel free to contact us