Next Article in Journal
Behavior of Carbothermal Dephosphorization of Phosphorus-Containing Converter Slag and Its Resource Utilization
Next Article in Special Issue
Critical Analysis of Risk Factors and Machine-Learning-Based Gastric Cancer Risk Prediction Models: A Systematic Review
Previous Article in Journal
Leaching Kinetics of Y and Eu from Waste Phosphors under Microwave Irradiation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Dimension Reduction and Classifier-Based Feature Selection for Oversampled Gene Expression Data and Cancer Classification

1
Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
2
DAAI Research Group, Department of Computing and Data Science, School of Computing and Digital Technology, Birmingham City University, Birmingham B4 7XG, UK
3
UTM Big Data Centre, Ibnu Sina Institute for Scientific and Industrial Research, Universiti Teknologi Malaysia, Johor Bahru 81310, Johor, Malaysia
4
College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA
*
Author to whom correspondence should be addressed.
Submission received: 16 May 2023 / Revised: 14 June 2023 / Accepted: 24 June 2023 / Published: 27 June 2023

Abstract

:
Gene expression data are usually known for having a large number of features. Usually, some of these features are irrelevant and redundant. However, in some cases, all features, despite being numerous, show high importance and contribute to the data analysis. In a similar fashion, gene expression data sometimes have limited instances with a high rate of imbalance among the classes. This can limit the exposure of a classification model to instances of different categories, thereby influencing the performance of the model. In this study, we proposed a cancer detection approach that utilized data preprocessing techniques such as oversampling, feature selection, and classification models. The study used SVMSMOTE for the oversampling of the six examined datasets. Further, we examined different techniques for feature selection using dimension reduction methods and classifier-based feature ranking and selection. We trained six machine learning algorithms, using repeated 5-fold cross-validation on different microarray datasets. The performance of the algorithms differed based on the data and feature reduction technique used.

1. Introduction

Gene expression involves the translation of information encoded in a gene into gene products, including proteins, tRNA, rRNA, mRNA, or snRNA. With the increase in technological standards, gene expression continues to have a sporadic increase in importance for health and life science applications [1]. Prognostic risk scores from gene expression have shown prominent clinical values as they are promising biomarkers. They can be used for the prediction of prognosis, including the identification of mortality and metastasis risks in patients. They can also be used to determine the response of patients to treatment [2]. Identifying the risk of cancer recurrence or metastasis in patients can help clinicians strategically recommend effective treatment. Furthermore, the determination of response to treatment can identify the overall survival of the patients and intuitively develop novel drugs or appropriate treatment based on each patient’s classification. In the majority of cancer types, HLA gene expression has been shown to prolong overall survival [3]. On the other hand, an increase in the expression of Human Endogenous Retrovirus K mRNA in the blood is linked to the presence of breast cancer, which shows it is a biomarker [4]. BRCA2 is another gene whose expression is associated with highly proliferative and aggressive breast cancer. The higher the expression of BRCA2, the more aggressiveness the breast cancer [5]. This indicates its potential as a biomarker for breast cancer. In essence, biological determinants such as predictive gene expression signatures can now be used for the effective classification of tumors according to their subgroup [6]. The profiling of gene expression for breast cancer or other cancer types can be further improved using clinicopathological and microenvironmental features [7,8,9].
The generation of data in the biomedical and health fields has increased sporadically while yielding samples with a high number of features [10,11]. The challenge of high-dimensional data is the difficulty associated with manual analysis and the redundancy that comes along with some of the features. Over the years, several studies have been carried out based on feature selection. In [12], the authors utilized a hybrid filter and wrapper method for the selection of features in gene expression data. The authors also used LASSO, an embedded technique, and reported that the performance of the machine learning algorithms was better with the implementation of LASSO on the examined high dimensional datasets. Townes et al. [13] implemented a simple multinomial method using generalized principal component analysis and carried out feature selection using deviance. Different combinations of methods were further compared with current methods to show their performance. Evolutionary algorithms have also been implemented in some studies for the improvement of feature selection. Jain et al. [14] implemented a correlation-based feature selection method improved with binary particle swarm optimization for the selection of genes before classifying the cancer types using Naïve Bayes algorithm. This method improved the classifier’s performance. In the same vein, Kabir et al. [15] compared two different dimension reduction techniques—PCA, and autoencoders for the selection of features in a prostate cancer classification analysis. Two machine learning methods—neural networks and SVM—were further used for classification. The study showed that the classifiers performed better on the reduced dataset.
Prasad et al. [16] used a recursive particle swarm optimization technique with the integration of filter-based methods for ranking. The authors also reported improved performance based on five datasets. Another similar approach for gene selection involves the hybridization of ant colony optimization and cellular learning automata. Based on the ROC curve evaluation of three classifiers, the proposed method selected the minimum gene needed for maximum performance of the classifiers [17]. Similarly, Alhenawi et al. [18] proposed a hybrid feature selection technique using Hill Climbing, the Novel LS algorithm, and Tabu search for microarray data. This is similar to the filter-wrapper and embedded technique utilized for gene expression data in [12]. However, Keshta et al. [19] proposed a multi-stage algorithm for the extraction and selection of features in a cancer detection study. It was reported that despite the reduction in the number of features used for classification, the performance of classifiers was either enhanced or unchanged. In addition, a nested genetic algorithm consisting of an outer genetic algorithm and an inner genetic algorithm has previously been implemented for the gene selection of a colon and lung dataset using 5-fold cross-validation [20]. A significant increase in classification accuracy was also reported. Several other feature/gene selection techniques are being improved and implemented to improve the accuracy of cancer classification. These include robust linear discriminant analysis [21], adaptive principal component analysis [22], and the use of deep variational autoencoders, especially in studies that involve the use of deep learning [23].
In this study, we considered the problem of imbalanced data, which is common in health data, before using dimension reduction techniques such as principal component analysis (PCA), truncated singular value decomposition (TSVD), and T-stochastic neighbor embedding (TSNE) to address the high dimensionality issue peculiar to gene expression data. We also utilized the ability of some machine learning algorithms to rank some genes and make selections based on a specified threshold.

2. Materials and Methods

2.1. Dataset Description

The six gene expression datasets used in this study are the brain, colon, leukemia, lymphoma, prostate, and small blue round cell tumor (SBRCT) datasets, as explained in our previous work [24] and shown in Table 1. Forty-two (42) patient samples make up the brain cancer microarray dataset. The tumors include 10 medulloblastomas, 5 atypical teratoid/rhabdoid tumors of the central nervous system (CNS), 5 rhabdoid tumors of the renal and extrarenal organs, 8 supratentorial primitive neuroectodermal tumors (PNETs), 10 non-embryonal brain tumors, and 4 normal human cerebella. 6817 genes are present in the first oligonucleotide microarrays. They underwent thresholding during pre-processing by [25]. Therefore, for the entire dataset with five distinct sample classes, there are 5597 genes. Alon et al. [26] conducted the initial analysis of the colon cancer microarray dataset. The raw data from the Affymetrix oligonucleotide arrays was processed by the dataset’s original authors. Samples of both normal and tumor tissue make up the dataset. The total number of samples is 62, and the 2000 gene numbers after pre-processing reported by earlier authors [27,28] are the total gene numbers. Acute lymphoblastic leukemia and acute myeloid leukemia are the two kinds of acute leukemia studied for gene expression, whose results were used to create the leukemia cancer dataset. Affymetrix high-density oligonucleotide arrays, which had 6817 genes but were reduced to 3051 genes and further analyzed by [29], were used to determine the levels of gene expression. 47 cases of ALL (38 B-cell ALL and 9 T-cell ALL) and 25 instances of AML make up the dataset. Dudoit et al. [30] performed more pre-processing on the dataset. The dataset can be acquired from [27,28].
The dataset for the lymphoma microarray is found in [25]. It comprises 62 samples and 4026 genes. The majority of the data samples are from three distinct adult lymphoid malignancies: 42 samples represent diffuse large B-cell lymphoma (DLBCL), 9 samples come from follicular lymphoma (FL), and 11 samples come from chronic lymphocytic leukemia (CLL). The dataset is also available can be found in the literature [27,28]. 50 of the samples in the prostate cancer dataset are normal prostate specimens, while the remaining 52 are tumors. The collection contains 102 patterns of gene expression. About 12,600 genes make up this microarray dataset, which is based on an oligonucleotide microarray. The dataset still contains 6033 genes after pre-processing [31]. There are four distinct classifications in the Small Round Blue-Cell Tumor (SRBCT) microarray dataset, which initially had 6567 genes and 63 samples. Whereas, 8 samples come from NHL, 12 from NB, 20 from the RMS, and 23 from EWS. The number of genes decreased to 2308 after pre-processing. This dataset was produced using [32] and is available in [27,28].

2.2. Methods

From the information about the datasets contained in Table 1, there is a clear indication of the high dimensionality of the datasets. This high dimensionality informs the basics of the feature selection and dimension reduction methods used in the study. Before analysis, we normalized the data using the min-max normalization method. The formula for min-max normalization is given as x = x x m i n x m a x x m i n , where x is the vector in a feature column, x m i n is the minimum value in the column x m a x and is the maximum value in the column. Because the analysis deals with the selection of features, it is crucial to balance the features so that a feature does not have more contributing capacity to the analysis than another column. For further analysis, we note the imbalance of the data.
For health-related data, it is always essential to deal with the imbalance in order to present a false representation of the evaluation, especially using accuracy. Subsequently, oversampling was carried out on each of the datasets using the SVMSMOTE technique [24]. This technique uses the support vector machine to predict new and unknown samples around the borderline. In a support vector machine, the borderline or margin is crucial for the establishment of the decision boundary. This is why SVMSMOTE focuses on instances of the minority class that are found along the borderline and therefore generates more samples in such a way that new instances of the minority class will appear where there are fewer instances of the majority class. In this study, the nearest neighbor parameter was set to 3 and the random state to 42.
Two categories of feature reduction methods were considered in this study. The first one entails the use of dimension reduction methods, and the second category entails the use of classifiers for feature ranking and selection. The dimension reduction methods used are principal component analysis (PCA), truncated singular value decomposition (TSVD), and T-distributed stochastic neighbor embedding (TSNE). On the other hand, the classifiers used for ranking are random forest (RF) and logistic regression (LR). PCA tries to maintain the data variance as much as possible. Basis vectors, also known as principal components, along with the maximum variance of the data are chosen with the goal of minimizing the reconstruction error over all the data points. To do this, the eigenvectors with the largest eigenvalues are selected first. TSVD uses a matrix factorization technique that is similar to PCA. However, TSVD is performed on the data matrix, while PCA is performed on the covariance matrix. The name “truncated” comes from the number of columns being equal to the truncation. That is, matrices with a specified number of columns are produced. The primary goal of TSNE is to preserve pairwise similarities as much as possible. However, like PCA, it does not maintain inputs. This makes it useful for visualization and exploration. It treats similarities in the original space as probabilities and finds the embedding that preserves probability structure. TSNE uses Kullback-Leibler (KL) divergence as a measure of the similarity between two probability distributions.
All examined models went through a 5-fold cross-validation repeated for a total of five replicates and were evaluated based on four different metrics: accuracy, precision, recall, and F1-score. These evaluation criteria are calculated based on Equations (1)–(4) respectively.
Accuracy = TP + TN TP + FP + TN + FN
Precision = TP TP + FP
Recall = TP TP + FN
F 1 - score = 2 Precision Recall Precision + Recall
where TP, FP, TN, and FN are true positive, false positive, true negative, and false negative, respectively.
We implemented all analyses in this study on a Windows 10 64-bit operating system on a x64-based processor computer with 64 GB of RAM. The processor specification is an Intel (R) Core (TM) i7-9700 K CPU @ 3.6 GHz. Python 3.6 and libraries, including scikit-learn, were utilized for the analyses. Figure 1 further shows the procedure for study analysis.

3. Results and Discussions

For all analyses, we used repeated 5-fold cross-validation. The 5-fold cross-validation had the parameter shuffle set to true and was repeated five times to generate 25 results per evaluation metric. The mean of each analysis is reported as the result, while a full report of the result with the standard deviation is given in the Supplementary Material. For this study, we examined the different feature reduction techniques using six different classifiers, logistic regression (LR), random forest (RF), support vector machine (SVM), gradient boosting classifier (GBC), Gaussian Naïve Bayes (GNB), and k-nearest neighbor (KNN). These are all commonly used classifiers that behave differently, which has thus influenced their choice for the analyses. For the logistic regression, the parameter for maximum iteration was set to 500, and liblinear was selected as the solver. In a similar manner, for the random forest and gradient boosting classifiers, the number of estimators was 500. For the support vector machine, Gaussian naïve bayes, and k-nearest neighbor, the default sklearn parameters were used.

3.1. Performance of Classification Methods after Oversampling

Firstly, we carried out the analysis on all the datasets without any prior oversampling or reduction. The only preprocessing technique that was applied was normalization. We further compared the result with the result generated after the SVMSMOTE oversampling technique was used. The results from the two analyses are shown in Table 2, Table 3, Table 4 and Table 5. The better result is highlighted. From the tables, we deduce that the majority of the analysis with oversampling has better performance than the one without oversampling. A peculiar benefit of oversampling is that it allows the model to be exposed and trained with a balanced number of both the majority and minority samples. For the lymphoma dataset, there is consistency between the original dataset and the oversampled dataset using random forest and support vector machines.

3.2. Performance of Classification Methods Based on Dimension Reduction Techniques

Due to the better performance of the models using the oversampled data in Table 2, Table 3, Table 4 and Table 5, the same data was used for the rest of the analyses. In Table 6, Table 7, Table 8 and Table 9, we compared the performance of the trained models based on three-dimensional reduction methods, namely principal component analysis (PCA), truncated singular value decomposition (TSVD), and t-distributed stochastic neighbor embedding (TSNE). The parameters of PCA and TSVD were set to make the cumulative explained variance of the chosen principal components 0.99. For TSNE, the number of components was set at 3, and perplexity was set at 50. From the results shown in Table 6, Table 7, Table 8 and Table 9, the performance of PCA and TSVD is relatively similar, although in many cases, PCA had better performance. In the majority of the analyses, the performance of the PCA or TSVD dimension-reduced analyses was better than the performance of the classifiers before reduction. TSNE, on the other hand, had generally poor performance for all the datasets and classifiers examined. We suppose that this can be attributed to the fact that TSNE has random probability and attempts to retain the variance of neighboring points, that is, local variance. In many cases, TSNE is used just for data visualization, especially for 2D or 3D visualization of images while retaining the local variance.

3.3. Performance of Classification Methods Based on Classifier-Based Gene Ranking and Selection

Furthermore, we used two classifiers (random forest and logistic regression) for feature ranking and selection. The oversampled datasets were also used for analyses. For the feature selection based on random forest, we used 500 estimators as the parameters and only selected features that were above the mean threshold. On the other hand, for feature selection based on logistic regression, only features above the median threshold for each were selected. We have employed different thresholds for the two techniques to discover if there would be a significant difference in the performance of the classifiers based on the threshold of the feature selection. For the Lymphoma dataset, both random forest and logistic regression had the same performance across evaluations. This was similarly noticed with the small blue round cell tumor (SBRCT) dataset, except with the use of gradient boosting classifiers, and k-nearest neighbor. Overall, Table 10, Table 11, Table 12 and Table 13 show that both classifiers had good performance in the ranking and selection of features, although they utilize different strategies for their threshold.

4. Conclusions

Data generated in the medical and bioinformatics fields are known to have a large number of features. Analyzing these features manually is exhausting, and this is where the utilization of machine learning tools comes in handy. Another major issue with health data is the high rate of class imbalance. This sometimes affects the integrity and robustness of models built using such data. Models might be unable to accurately predict the class of unseen instances in the minority class. In this study, we have considered SVMSMOTE for oversampling the data and thereby increasing the number of instances of each data point. We discover that, generally, the performance of the examined models was better after oversampling. The application of the dimension reduction techniques and feature ranking and selection further improved the performance in many instances. Principal component analysis and truncated singular value decomposition performed better than t-distributed stochastic neighbor embedding. TSNE in fact had a generally bad performance, as shown in Table 6, Table 7, Table 8 and Table 9. It is advised to use TSNE primarily for dimension reduction for data visualization. In the same vein, both random forest and logistic regression classifiers were effective in the selection of features despite utilizing different threshold criteria. Although a low number of analyses with the original dataset had good performance, the majority of the analyses with SVMSMOTE and feature reduction had better performance, saved time, and enhanced the interpretability of models. Future works will investigate the analysis of gene expression data and cancer classification using different oversampling and dimension reduction methods on different microarray datasets.

Supplementary Materials

The following supporting information can be downloaded at: https://0-www-mdpi-com.brum.beds.ac.uk/article/10.3390/pr11071940/s1, Table S1: Comparison between performance with original dataset and performance after oversampling. Table S2: Comparison between performance of PCA, TSVD, and TSNE using oversampled dataset. Table S3: Comparison between performance of RF and LR using oversampled dataset.

Author Contributions

Conceptualization, O.O.P. and F.S.; methodology, O.O.P., F.S. and N.S.; software, O.O.P.; validation, M.T., Z.L. and I.O.M.; formal analysis, O.O.P., F.S., M.T. and Z.L.; investigation, M.T., Z.L. and I.O.M.; resources, N.S. and I.O.M.; data curation, O.O.P.; writing—original draft preparation, O.O.P., M.T., Z.L. and I.O.M.; writing—review and editing, O.O.P., F.S. and N.S.; visualization, O.O.P.; supervision, F.S. and N.S.; project administration, F.S.; funding acquisition, N.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Research Management Center at Universiti Teknologi Malaysia (Vot No: Q.J130000.21A6.00P48) and the Ministry of Higher Education, Malaysia (JPT(BKPI)1000/016/018/25(58)) through the Malaysia Big Data Research Excellence Consortium (BiDaREC) (Vot No: R.J130000.7851.4L933), (Vot No: R.J130000.7851.5F568), (Vot No: R.J130000.7851.4L942), (Vot No: R.J130000.7851.4L938), and (Vot No: R.J130000.7851.4L936). We are also grateful to (Project No: KHAS-KKP/2021/FTMK/C00003) and (Project No: KKP002-2021) for their financial support of this research.

Data Availability Statement

The datasets are available and can be downloaded form Microarray Datasets: https://csse.szu.edu.cn/staff/zhuzx/Datasets.html.

Acknowledgments

The authors would like to thank the Research Management Center at Universiti Teknologi Malaysia for funding this research using (Vot No: Q.J130000.21A6.00P48) and the Ministry of Higher Education, Malaysia (JPT(BKPI)1000/016/018/25(58)) through the Malaysia Big Data Research Excellence Consortium (BiDaREC) (Vot No: R.J130000.7851.4L933), (Vot No: R.J130000.7851.5F568), (Vot No: R.J130000.7851.4L942), (Vot No: R.J130000.7851.4L938), and (Vot No: R.J130000.7851.4L936). We are also grateful to (Project No: KHAS-KKP/2021/FTMK/C00003) and (Project No: KKP002-2021) for their financial support of this research.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Thakur, T.; Batra, I.; Luthra, M.; Vimal, S.; Dhiman, G.; Malik, A.; Shabaz, M. Gene expression-assisted cancer prediction techniques. J. Healthc. Eng. 2021, 2021, 4242646. [Google Scholar] [CrossRef] [PubMed]
  2. Ahluwalia, P.; Kolhe, R.; Gahlay, G.K. The clinical relevance of gene expression based prognostic signatures in colorectal cancer. Biochim. Biophys. Acta Rev. Cancer 2021, 1875, 188513. [Google Scholar] [CrossRef] [PubMed]
  3. Schaafsma, E.; Fugle, C.M.; Wang, X.; Cheng, C. Pan-cancer association of HLA gene expression with cancer prognosis and immunotherapy efficacy. Br. J. Cancer 2021, 125, 422–432. [Google Scholar] [CrossRef]
  4. Tourang, M.; Fang, L.; Zhong, Y.; Suthar, R.C. Association between Human Endogenous Retrovirus K gene expression and breast cancer. Cell. Mol. Biomed. Rep. 2021, 1, 7–13. [Google Scholar] [CrossRef]
  5. Satyananda, V.; Oshi, M.; Endo, I.; Takabe, K. High BRCA2 gene expression is associated with aggressive and highly proliferative breast cancer. Ann. Surg. Oncol. 2021, 28, 7356–7365. [Google Scholar] [CrossRef]
  6. Qian, Y.; Daza, J.; Itzel, T.; Betge, J.; Zhan, T.; Marmé, F.; Teufel, A. Prognostic cancer gene expression signatures: Current status and challenges. Cells 2021, 10, 648. [Google Scholar] [CrossRef] [PubMed]
  7. Munkácsy, G.; Santarpia, L.; Győrffy, B. Gene Expression Profiling in Early Breast Cancer—Patient Stratification Based on Molecular and Tumor Microenvironment Features. Biomedicines 2022, 10, 248. [Google Scholar] [CrossRef]
  8. Oliveira, L.J.C.; Amorim, L.C.; Megid, T.B.C.; De Resende, C.A.A.; Mano, M.S. Gene expression signatures in early Breast Cancer: Better together with clinicopathological features. Crit. Rev. Oncol. Hematol. 2022, 175, 103708. [Google Scholar] [CrossRef]
  9. Schettini, F.; Chic, N.; Brasó-Maristany, F.; Paré, L.; Pascual, T.; Conte, B.; Martínez-Sáez, O.; Adamo, B.; Vidal, M.; Barnadas, E.; et al. Clinical, pathological, and PAM50 gene expression features of HER2-low breast cancer. NPJ Breast Cancer 2021, 7, 1. [Google Scholar] [CrossRef]
  10. Zhong, Y.; Chalise, P.; He, J. Nested cross-validation with ensemble feature selection and classification model for high-dimensional biological data. Commun. Stat. Simul. Comput. 2023, 52, 110–125. [Google Scholar] [CrossRef]
  11. Petinrin, O.O.; Saeed, F.; Li, X.; Ghabban, F.; Wong, K.C. Reactions’ descriptors selection and yield estimation using metaheuristic algorithms and voting ensemble. Comput. Mater. Contin. 2022, 70, 4745–4762. [Google Scholar]
  12. Hameed, S.S.; Petinrin, O.O.; Hashi, A.O.; Saeed, F. Filter-wrapper combination and embedded feature selection for gene expression data. Int. J. Adv. Soft Compu. Appl. 2018, 10, 90–105. [Google Scholar]
  13. Townes, F.W.; Hicks, S.C.; Aryee, M.J.; Irizarry, R.A. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biol. 2019, 20, 295. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  14. Jain, I.; Jain, V.K.; Jain, R. Correlation feature selection based improved-binary particle swarm optimization for gene selection and cancer classification. Appl. Soft Comput. 2018, 62, 203–215. [Google Scholar] [CrossRef]
  15. Kabir, M.F.; Chen, T.; Ludwig, S.A. A performance analysis of dimensionality reduction algorithms in machine learning models for cancer prediction. Healthc. Anal. 2023, 3, 100125. [Google Scholar] [CrossRef]
  16. Prasad, Y.; Biswas, K.; Hanmandlu, M. A recursive PSO scheme for gene selection in microarray data. Appl. Soft Comput. 2018, 71, 213–225. [Google Scholar] [CrossRef]
  17. Sharbaf, F.V.; Mosafer, S.; Moattar, M.H. A hybrid gene selection approach for microarray data classification using cellular learning automata and ant colony optimization. Genomics 2016, 107, 231–238. [Google Scholar] [CrossRef]
  18. Alhenawi, E.A.; Al-Sayyed, R.; Hudaib, A.; Mirjalili, S. Improved intelligent water drop-based hybrid feature selection method for microarray data processing. Comput. Biol. Chem. 2023, 103, 107809. [Google Scholar] [CrossRef]
  19. Keshta, I.; Deshpande, P.S.; Shabaz, M.; Soni, M.; Bhadla, M.K.; Muhammed, Y. Multi-stage biomedical feature selection extraction algorithm for cancer detection. SN Appl. Sci. 2023, 5, 131. [Google Scholar] [CrossRef]
  20. Sayed, S.; Nassef, M.; Badr, A.; Farag, I. A nested genetic algorithm for feature selection in high-dimensional cancer microarray datasets. Expert Syst. Appl. 2019, 121, 233–243. [Google Scholar] [CrossRef]
  21. Li, X.; Wang, H. On Mean-Optimal Robust Linear Discriminant Analysis. In Proceedings of the 2022 IEEE International Conference on Data Mining (ICDM), Orlando, FL, USA, 30 November–3 December 2022; pp. 1047–1052. [Google Scholar]
  22. Li, X.; Wang, H. Adaptive Principal Component Analysis. In Proceedings of the 2022 SIAM International Conference on Data Mining (SDM), Alexandria, VA, USA, 28–30 April 2022; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2022; pp. 486–494. [Google Scholar]
  23. Jiang, J.; Xu, J.; Liu, Y.; Song, B.; Guo, X.; Zeng, X.; Zou, Q. Dimensionality reduction and visualization of single-cell RNA-seq data with an improved deep variational autoencoder. Briefings Bioinform. 2023, 24, bbad152. [Google Scholar] [CrossRef] [PubMed]
  24. Hameed, S.S.; Muhammad, F.F.; Hassan, R.; Saeed, F. Gene Selection and Classification in Microarray Datasets using a Hybrid Approach of PCC-BPSO/GA with Multi Classifiers. J. Comput. Sci. 2018, 14, 868–880. [Google Scholar] [CrossRef] [Green Version]
  25. Dettling, M.; Bühlmann, P. Supervised clustering of genes. Genome Biol. 2002, 3, research0069.1. [Google Scholar] [CrossRef]
  26. Alon, U.; Barkai, N.; Notterman, D.A.; Gish, K.; Ybarra, S.; Mack, D.; Levine, A.J. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 1999, 96, 6745–6750. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  27. Zhu, Z.; Ong, Y.S.; Dash, M. Markov Blanket-Embedded Genetic Algorithm for Gene Selection. Pattern Recognit. 2007, 49, 3236–3248. [Google Scholar] [CrossRef]
  28. Microarray Datasets. Available online: https://csse.szu.edu.cn/staff/zhuzx/Datasets.html (accessed on 8 June 2023).
  29. Golub, T.R.; Slonim, D.K.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J.P.; Coller, H.; Loh, M.L.; Downing, J.R.; Caligiuri, M.A.; et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 1999, 286, 531–537. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  30. Dudoit, S.; Fridlyand, J.; Speed, T.P. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. 2002, 97, 77–87. [Google Scholar] [CrossRef] [Green Version]
  31. Díaz-Uriarte, R.; De Andres, S.A. Gene selection and classification of microarray data using random forest. BMC Bioinform. 2006, 7, 3. [Google Scholar] [CrossRef] [Green Version]
  32. Nguyen, H.M.; Cooper, E.W.; Kamei, K. Borderline over-sampling for imbalanced data classification. In Proceedings of the Fifth International Workshop on Computational Intelligence & Applications, IEEE SMC Hiroshima Chapter, Hiroshima, Japan, 10–12 November 2009. [Google Scholar]
Figure 1. Process Diagram of the Study.
Figure 1. Process Diagram of the Study.
Processes 11 01940 g001
Table 1. Detailed Information about the datasets.
Table 1. Detailed Information about the datasets.
DatasetClassesInstancesAttributes
Brain5425598
Colon2622001
Leukemia2723572
Lymphoma3624027
Prostate21026034
SBRCT4632309
Table 2. Comparison between the accuracy of models before and after oversampling.
Table 2. Comparison between the accuracy of models before and after oversampling.
MethodSVMSMOTE STATUSBrainColonLeukemiaLymphomaProstateSBRCT
LR 0.81670.88720.98571.00000.91100.9846
With0.86670.90000.98950.99090.92290.9889
RF 0.74440.84100.94571.00000.89100.9846
With0.84440.92500.98951.00000.89430.9889
SVM 0.67500.85640.98571.00000.87140.8897
With0.76000.91251.00001.00000.90380.9234
GBC 0.63830.76460.84860.98330.88180.8297
With0.61110.80000.89570.98140.87620.8457
GNB 0.64170.85510.97240.91670.63710.9205
With0.63110.86251.00000.99090.60570.9778
KNN 0.69720.79230.93050.98460.86140.7744
With0.78000.87500.88300.97270.85520.8146
Table 3. Comparison between the precision of models before and after oversampling.
Table 3. Comparison between the precision of models before and after oversampling.
MethodSVMSMOTE STATUSBrainColonLeukemiaLymphomaProstateSBRCT
LR 0.77830.87370.98891.00000.91780.9929
With0.85670.90560.99090.99050.93240.9929
RF 0.64330.83040.95561.00000.90530.9929
With0.85670.93620.99091.00000.90130.9929
SVM 0.54830.84380.98891.00000.87570.8625
With0.80330.92491.00001.00000.90860.9200
GBC 0.56150.70070.83780.99050.89200.7827
With0.63400.81440.89560.97710.88360.8552
GNB 0.62250.84210.97460.90160.66300.9125
With0.60800.87061.00000.99170.63050.9762
KNN 0.52580.78140.91390.98330.86690.8177
With0.63670.90720.90430.97420.85450.7935
Table 4. Comparison between the recall of models before and after oversampling.
Table 4. Comparison between the recall of models before and after oversampling.
MethodSVMSMOTE STATUSBrainColonLeukemiaLymphomaProstateSBRCT
LR 0.78500.88730.98331.00000.90480.9750
With0.87500.89780.98890.99440.91490.9833
RF 0.71500.83390.95001.00000.89230.9875
With0.86170.92460.98891.00000.89180.9833
SVM 0.65670.85060.98331.00000.86900.8564
With0.80170.91031.00001.00000.89950.9324
GBC 0.54730.72080.87390.97780.87840.7939
With0.62830.80160.89580.98490.87330.8501
GNB 0.62830.85300.97220.85330.64050.8806
With0.71670.85851.00000.99260.61670.9733
KNN 0.64000.77960.94440.99260.86060.8025
With0.75670.87140.88880.97780.84740.8097
Table 5. Comparison between F1 scores of models before and after oversampling.
Table 5. Comparison between F1 scores of models before and after oversampling.
MethodSVMSMOTE STATUSBrainColonLeukemiaLymphomaProstateSBRCT
LR 0.74800.87360.98501.00000.90790.9795
With0.84270.89780.98940.99200.91850.9862
RF 0.64670.82280.94501.00000.88950.9890
With0.83070.92320.98941.00000.89040.9862
SVM 0.56100.84050.98501.00000.86950.8448
With0.75730.90971.00001.00000.89960.9149
GBC 0.52880.69440.82050.98150.87760.7674
With0.57300.79790.88980.97940.87150.8394
GNB 0.57930.84080.97150.86080.63100.8840
With0.62600.85981.00000.99160.59580.9706
KNN 0.54670.77000.91580.98660.85940.7650
With0.66600.86650.87590.97440.84970.7711
Table 6. Comparison between model accuracy for PCA, TSVD, and TSNE using an oversampled dataset.
Table 6. Comparison between model accuracy for PCA, TSVD, and TSNE using an oversampled dataset.
MethodFS MethodBrainColonLeukemiaLymphomaProstateSBRCT
LRPCA0.91110.90000.98950.99090.92290.9889
TSVD0.86670.90000.98950.99090.92290.9889
TSNE0.23780.55000.64970.61390.49900.2731
RFPCA0.78220.82500.88420.95450.81760.8591
TSVD0.71560.87500.88421.00000.82710.8164
TSNE0.23780.65000.66960.62290.46050.4041
SVMPCA0.73560.92501.00001.00000.91330.9012
TSVD0.50670.90001.00000.99090.84570.9123
TSNE0.21780.60000.69180.69610.47950.4047
GBCPCA0.64220.78500.89420.95370.82100.8668
TSVD0.66440.81250.87310.95370.82900.8351
TSNE0.28000.61250.67440.46670.44630.3625
GNBPCA0.75780.82500.84150.90870.65330.7281
TSVD0.73560.87500.85200.92680.64330.6953
TSNE0.23560.61250.63860.60430.44050.3497
KNNPCA0.78220.87500.90410.97270.84520.8041
TSVD0.78000.87500.88300.97270.86520.8146
TSNE0.26000.58750.64970.66970.44290.3708
Table 7. Comparison between model precision for PCA, TSVD, and TSNE using an oversampled dataset.
Table 7. Comparison between model precision for PCA, TSVD, and TSNE using an oversampled dataset.
MethodFS MethodBrainColonLeukemiaLymphomaProstateSBRCT
LRPCA0.89000.90560.99090.99050.93240.9929
TSVD0.85670.90560.99090.99050.93240.9929
TSNE0.16600.55680.65480.62070.50840.2679
RFPCA0.82500.84600.90540.97440.81870.8771
TSVD0.72830.88880.90541.00000.82340.8290
TSNE0.22330.66580.66030.62700.47320.4012
SVMPCA0.78170.94341.00001.00000.92390.8925
TSVD0.48440.90561.00000.99050.84680.9067
TSNE0.16470.60760.68800.69460.47360.4241
GBCPCA0.61430.80750.89130.96190.83840.8720
TSVD0.64070.81900.86880.95280.85630.8367
TSNE0.23970.62130.68820.48880.45600.3752
GNBPCA0.75900.84170.86260.93310.67320.7521
TSVD0.71230.89990.87560.94660.66460.7282
TSNE0.18000.64840.63970.60340.44000.3502
KNNPCA0.67670.90720.91570.97420.84490.7810
TSVD0.63670.90720.90430.97420.86580.7935
TSNE0.22000.58890.66000.67430.44970.3923
Table 8. Comparison between model recall for PCA, TSVD, and TSNE using an oversampled dataset.
Table 8. Comparison between model recall for PCA, TSVD, and TSNE using an oversampled dataset.
MethodFS MethodBrainColonLeukemiaLymphomaProstateSBRCT
LRPCA0.90830.89780.98890.99440.91490.9833
TSVD0.87500.89780.98890.99440.91490.9833
TSNE0.21330.55280.66480.59490.51060.2908
RFPCA0.85170.82740.89290.96300.81570.8844
TSVD0.77170.87920.89291.00000.81950.8427
TSNE0.26170.65810.65660.60360.47380.4147
SVMPCA0.78170.92141.00001.00000.90720.9074
TSVD0.61000.89781.00000.99440.84590.9233
TSNE0.21330.59860.69410.70380.48730.4318
GBCPCA0.64130.78480.89930.95890.80610.8849
TSVD0.66130.81400.87710.96260.81010.8378
TSNE0.24400.61880.69240.42560.45720.3864
GNBPCA0.80170.82420.84180.88040.65060.7124
TSVD0.76170.87280.85180.90810.64150.6736
TSNE0.25330.61390.64790.57760.44880.3746
KNNPCA0.78670.87140.90990.97780.83630.8035
TSVD0.75670.87140.88880.97780.85650.8097
TSNE0.26670.58970.66280.67590.44730.4254
Table 9. Comparison between model F1 scores for PCA, TSVD, and TSNE using an oversampled dataset.
Table 9. Comparison between model F1 scores for PCA, TSVD, and TSNE using an oversampled dataset.
MethodFS MethodBrainColonLeukemiaLymphomaProstateSBRCT
LRPCA0.88270.89780.98940.99200.91850.9862
TSVD0.84270.89780.98940.99200.91850.9862
TSNE0.17800.54570.64170.58110.49120.2504
RFPCA0.79230.82100.88060.95850.81310.8533
TSVD0.69770.87400.88061.00000.82060.8076
TSNE0.21800.64580.64880.58810.45910.3809
SVMPCA0.73730.92191.00001.00000.90920.8871
TSVD0.48420.89781.00000.99200.84220.9042
TSNE0.16150.59270.68090.67500.46650.4004
GBCPCA0.58040.77910.89030.95800.81100.8597
TSVD0.59880.81100.86840.95420.81520.8278
TSNE0.22450.61020.66660.43060.43980.3361
GNBPCA0.75220.82090.83420.88660.63870.7102
TSVD0.70560.86990.84420.91250.62740.6859
TSNE0.18800.60260.62970.55530.43180.3318
KNNPCA0.70030.86650.89920.97440.83890.7601
TSVD0.66600.86650.87590.97440.85940.7711
TSNE0.22270.57510.63760.64910.43940.3804
Table 10. Comparison between model accuracy for RF and LR using an oversampled dataset.
Table 10. Comparison between model accuracy for RF and LR using an oversampled dataset.
MethodFS MethodBrainColonLeukemiaLymphomaProstateSBRCT
LRRF0.93330.90000.98951.00000.95140.9889
LR0.95560.91251.00001.00000.96100.9889
RFRF0.88890.90000.98951.00000.93240.9889
LR0.86670.92501.00001.00000.92290.9889
SVMRF0.91110.91251.00001.00000.93240.9889
LR0.93330.93751.00001.00000.92290.9889
GBCRF0.62890.86000.89780.98140.86860.8591
LR0.58890.80250.89570.98140.86620.8568
GNBRF0.71780.90001.00001.00000.83570.9889
LR0.74000.92501.00001.00000.71100.9889
KNNRF0.86670.90000.98950.99090.92290.9784
LR0.95560.91250.97780.99090.89380.9673
Table 11. Comparison between model precision for RF and LR using an oversampled dataset.
Table 11. Comparison between model precision for RF and LR using an oversampled dataset.
MethodFS MethodBrainColonLeukemiaLymphomaProstateSBRCT
LRRF0.94330.90560.99091.00000.96290.9929
LR0.96330.92491.00001.00000.96910.9929
RFRF0.87670.90560.99091.00000.94280.9929
LR0.87670.93621.00001.00000.93240.9929
SVMRF0.89000.92491.00001.00000.94280.9929
LR0.95000.95161.00001.00000.93240.9929
GBCRF0.67590.87540.89630.97710.87700.8685
LR0.62700.81030.89560.97710.87580.8594
GNBRF0.72000.90561.00001.00000.83850.9929
LR0.76670.93351.00001.00000.72190.9929
KNNRF0.81670.90560.99090.99050.93440.9804
LR0.96330.92660.97140.99050.90120.9720
Table 12. Comparison between model recall for RF and LR using an oversampled dataset.
Table 12. Comparison between model recall for RF and LR using an oversampled dataset.
MethodFS MethodBrainColonLeukemiaLymphomaProstateSBRCT
LRRF0.94170.89780.98891.00000.94280.9833
LR0.96170.91031.00001.00000.95530.9833
RFRF0.88830.89780.98891.00000.92260.9833
LR0.87500.92461.00001.00000.91490.9833
SVMRF0.90830.91031.00001.00000.92260.9833
LR0.94830.93571.00001.00000.91490.9833
GBCRF0.66570.85600.89900.98490.86490.8611
LR0.61170.80410.89580.98490.86030.8591
GNBRF0.78500.89781.00001.00000.82970.9833
LR0.79830.92281.00001.00000.71560.9833
KNNRF0.86330.89780.98890.99440.91010.9762
LR0.96170.91070.98460.99440.88190.9662
Table 13. Comparison between model F1 scores for RF and LR using an oversampled dataset.
Table 13. Comparison between model F1 scores for RF and LR using an oversampled dataset.
MethodFS MethodBrainColonLeukemiaLymphomaProstateSBRCT
LRRF0.92530.89780.98941.00000.94810.9862
LR0.95200.90971.00001.00000.95920.9862
RFRF0.86400.89780.98941.00000.92790.9862
LR0.84930.92321.00001.00000.91850.9862
SVMRF0.88270.90971.00001.00000.92790.9862
LR0.93600.93561.00001.00000.91850.9862
GBCRF0.60660.85700.89260.97940.86370.8515
LR0.54700.80140.88980.97940.85910.8481
GNBRF0.70630.89781.00001.00000.83020.9862
LR0.73200.92371.00001.00000.70750.9862
KNNRF0.81530.89780.98940.99200.91700.9752
LR0.95200.91060.97500.99200.88790.9651
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Petinrin, O.O.; Saeed, F.; Salim, N.; Toseef, M.; Liu, Z.; Muyide, I.O. Dimension Reduction and Classifier-Based Feature Selection for Oversampled Gene Expression Data and Cancer Classification. Processes 2023, 11, 1940. https://0-doi-org.brum.beds.ac.uk/10.3390/pr11071940

AMA Style

Petinrin OO, Saeed F, Salim N, Toseef M, Liu Z, Muyide IO. Dimension Reduction and Classifier-Based Feature Selection for Oversampled Gene Expression Data and Cancer Classification. Processes. 2023; 11(7):1940. https://0-doi-org.brum.beds.ac.uk/10.3390/pr11071940

Chicago/Turabian Style

Petinrin, Olutomilayo Olayemi, Faisal Saeed, Naomie Salim, Muhammad Toseef, Zhe Liu, and Ibukun Omotayo Muyide. 2023. "Dimension Reduction and Classifier-Based Feature Selection for Oversampled Gene Expression Data and Cancer Classification" Processes 11, no. 7: 1940. https://0-doi-org.brum.beds.ac.uk/10.3390/pr11071940

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop