1. Introduction
Cancer continues to be a major health problem worldwide, taking millions of lives [
1,
2,
3,
4,
5,
6,
7,
8]. Cancer, rather than being a single ailment, is a broad spectrum of complicated disorders characterized by uncontrolled cell growth and the propensity to rapidly infect other parts of the body. The inherent complexity and heterogeneity pose serious challenges in developing effective anticancer therapies [
9].
Conventional methods like radiotherapy and chemotherapy are usually beneficial but have high costs and considerable adverse effects on normal cells. Furthermore, the development of resistance by cancer cells to the existing chemotherapeutic drugs presents a challenging obstacle [
10,
11]. As a result, there is an ongoing need for the discovery of novel anticancer drugs. Traditional therapies destroy both cancer and normal cells, leading to exorbitant medical costs [
12,
13,
14,
15]. Peptide-based treatment is a promising option because of its high specificity, enhanced tumor penetration, and minimal toxicity under normal physiological conditions [
16]. The discovery of anticancer peptides has transformed this landscape by enabling the selective targeting of cancer cells while protecting normal cells [
17,
18,
19].
Anticancer peptides (ACPs) have therapeutic potential for numerous malignancies, as they selectively target cancer cells without affecting normal physiological processes [
20,
21]. ACPs, ranging in length from 5 to 50 amino acids, have cationic amphipathic structures that interact with the anionic lipid membrane of cancer cells, enabling selective targeting [
22,
23]. They exhibit broad-spectrum cytotoxicity against many cancer cells while sparing normal cells due to their electrostatic interaction with the plasma membrane of cancer cells [
24]. Over the last decade, several peptide-based therapies have been evaluated in pre-clinical and clinical trials, highlighting the need to discover novel ACPs for cancer treatment [
25]. ACPs, primarily derived from antimicrobial peptides (AMPs), represent a new direction in anticancer drug development [
24,
26]. The safety and efficacy of ACPs make them viable alternatives to traditional broad-spectrum drugs. The extensive research into ACP therapies in pre-clinical and clinical trials against numerous tumor types indicates a paradigm shift, though identifying clinically viable ACPs remains a challenge due to the time-consuming and expensive nature of the experimental methods. Computational methods are, therefore, essential for efficient ACP identification.
Bioinformatics encompass a myriad of computational methodologies [
27,
28,
29,
30,
31,
32], with a particular emphasis on machine-learning-based approaches for identifying anticancer peptides (ACPs). The pioneering tool, Anti-CP, used a support vector machine (SVM) with sequence-based features and binary profiles [
33]. Chou’s pseudo-amino acid composition (PseAAC) and local alignment kernel were used in [
34] for ACP prediction, while g-gap dipeptide components were optimized in [
35]. The amino acid composition, average chemical shifts, and reduced amino acid composition were selected in [
36]. Several other methodologies have been proposed, including feature representation learning models [
37], 400-dimensional features with g-gap dipeptide features [
38], and a generalized chaos game representation (CGR) method [
39]. Notably, the investigation into deep learning architectures indicated the advantages of recurrent neural networks [
40]. ACP prediction research has grown in popularity over the last decade, with more experimentally validated ACPs generated from protein sequences [
41]. The surge in the accessible proteins from high-throughput sequencing efforts indicates a rapid growth in potential ACPs. Given the challenges inherent in experimental procedures, computational approaches, particularly machine learning, have gained popularity. However, the short length of ACPs makes it difficult to capture the specificity information.
In recent years, there has been a proliferation of machine-learning-based methods, notably, efficient feature representation algorithms. While there are several sequence-based feature descriptors available, combining several types of features to train classifiers raises concerns about the curse of dimensionality and information redundancy. Efficient approaches are required to optimize the information contained in feature descriptors. Furthermore, the integration of physicochemical information has been proposed, as ACPs considerably vary from non-ACPs in terms of these characteristics [
22].
Inspired by the success of deep representation learning [
42,
43,
44] in natural language processing, several sequence-based deep representation learning algorithms for proteins and peptides have emerged [
45,
46,
47,
48,
49,
50,
51,
52]. Unsupervised or semi-supervised learning methods, such as ProFET [
53], UniRep [
46], ProGen [
54], and UDSMProt [
55], use large datasets and show promise for protein-related predictions. For instance, in [
56], authors reviewed different deep learning architectures including various embedding techniques used for the feature extraction and model designing for protein sequences prediction tasks. In [
57], authors have shown that a deep transfer learning using ProteinBERT representations produces promising results where the labeled data are limited. Transfer learning enables these models to be employed as pre-training models for novel tasks such as ACP prediction. While prior techniques have shown promising results, taking into account the dimensional advantages of the model is critical. Methods like ACP-DA [
58] use data augmentation to improve ACP prediction performance, addressing the curse of dimensionality by concatenating binary profile features and physicochemical properties. Similarly, iACP-DRLF [
59] employs deep representation learning and LGBM feature selection, outperforming previous methods like ACPred-Fuse [
60] and AntiCP 2.0 [
61].
In a nutshell, advancements in ACP prediction methods have been substantial, but there is still a need to enhance prediction accuracy and consider the dimensionality advantages of the models. Computational methods, particularly those leveraging machine learning and deep representation learning, hold promise for rapid ACP identification.
While current machine learning approaches provide some advantages in predicting ACPs, there is still room for improvement. Deep learning models are highly effective, but their
black-box nature can obscure the rationale behind classification decisions. In contrast, a very simple model may lack the precision required for accurate classification. To this end, [
62] proposed ACP-KSRC, which uses a
sparse-representation classification (SRC) technique [
63,
64], incorporating polynomial kernel-based
principal component analysis (PCA) embedding to reduce feature space dimensions. Furthermore, it employs the
synthetic minority oversampling technique (SMOTE) with
K-Means [
65] to balance sample space dimensions, facilitating the construction of the
kernel SRC model.
The ACP-KSRC [
62], leveraging a carefully curated feature set and robust signal processing tools, improves decision-making transparency, resulting in better explainability in classification. In line with the growing emphasis on explainable machine learning, a novel deep-latent-space encoding scheme, termed DeepLSE, is introduced. This approach demonstrates significant advancements in classification outcomes, particularly in scenarios with a small sample size and an abundance of features [
66,
67,
68,
69].
The DeepLSE method uses representation learning [
42,
43] and an auto-encoder (AE) to project high-dimensional data into a compressed latent space. In contrast to classic AEs, where compressed representations may not guarantee discriminating features, the proposed DeepLSE learns a meaningful feature set that is both compact and effective for classification.
Section 2 provides comprehensive details of the proposed approach, encompassing datasets, feature encoding techniques, and classification methods. Subsequently,
Section 3 presents the results of the experimental analysis and offers a detailed discussion. The paper concludes in
Section 4, summarizing the key findings and outlining potential directions for future research.
3. Experimental Results
This section provides a detailed study of the DeepLSE approach, utilizing several numerical experiments. The proposed DeepLSE-based anticancer peptide classification method, ACP-LSE, is validated in terms of methodology, and supporting experiments offer a rationale for hyperparameter selection. The experiments are organized in the following order.
Section 3.1 focuses on demonstrating how the proposed DeepLSE more effectively compresses feature dimensions compared to a conventional auto-encoder and the original feature space. In the same section, the impact of different values of
G in C
g-SAAP embeddings on model performance is also analyzed. In
Section 3.2, the influence of the feature and latent-space dimension is investigated for different values of
. In
Section 3.3, the effect of
on the latent space is also analyzed.
Section 3.4 evaluates the robustness of ACP-LSE against random mutations.
Section 3.5 examines the performance consistency of ACP-LSE for various training and testing sample split ratios. Finally,
Section 3.6 discusses the findings from comparing the performance of ACP-LSE to existing state-of-the-art approaches.
3.1. Latent Space: Latent-Space Encoding of Cg-SAAP
Modern machine learning methods need vast amounts of training data to achieve better generalization and performance. However, a phenomenon known as the curse of dimensionality occurs when the quantity of measurements or samples is limited but the attributes proliferate (). Here, F and S represent the number of features/attributes and samples, , where and are the numbers of negative and positive samples in the dataset, respectively. In the context of this study, the dataset consists of a small number of samples (e.g., 344, 740, etc.), although the attributes may number in the thousands. For example, the description of the Cg-SAAP with has 4000 attributes. The curse of dimensionality not only makes our classification task theoretically ill-posed but it also poses a significant challenge in developing an effective latent-space representation for compressed representation, particularly when . To address this, the proposed ACP-LSE removes the least representative dimensions, reducing the original feature space of to a deep latent-space representation of size , where .
To assess the performance of the proposed ACP-LSE method, a comparison of the generalized multi-dimension distribution overlap (GMDM) [
72,
73] scores is presented in
Figure 3. To that end, two models with identical encoder and decoder settings undergo training for 10-fold cross-validation on the ACP344 dataset. For both models, the number of latent-space variables (LV) was fixed to 5. The latent-space encoded features from these models are fed into the GMDM function to measure the degree of overlap between the feature spaces of two classes. Specifically, one model with
is a conventional auto-encoder without a classification constraint, while the other is the DeepLSE with
, incorporating a constraint on the latent-space representation. As a baseline, the original feature space is also evaluated for GMDM scores, and the process is repeated for 10 different values of gaps (
G) in C
g-SAAP.
It is evident from
Figure 3 that the original feature space is highly cluttered, resulting in very low class separability. In contrast, both the auto-encoder and the proposed DeepLSE effectively compress the feature dimensions, leading to higher relative class separability scores. It is noteworthy that the GMDM suggests using a weighted contribution of the projected variables based on their eigen-spread values. However, since the goal of this experiment is to assess overlap in the original space without noise removal, no PCA embedding is employed during the GMDM calculation. Furthermore, to prevent data leakage, no training sample is used to calculate GMDM scores, and the average results of the 10-fold cross-validation are plotted.
The above experiment is focused on demonstrating, through GMDM, how the proposed DeepLSE more effectively compresses feature dimensions compared to a conventional auto-encoder and the original feature space. From the analysis, we can conclude that the higher values of G bring some useful information that is cluttered by other redundant features/information. Unlike conventional AE, where cluttered information is encoded in an ineffective way, unable to harness useful discriminating information, the proposed DeepLSE efficiently represents them in a compact discriminating feature space.
To further showcase the impact of different values of
G on the model performance, size, and number of features, the additional results are presented in
Table 2, highlighting DeepLSE classification results for different values of Gap (
G) in terms of the MCC score. Similar to the previous analysis, the results presented in
Table 2 clearly show that higher values of
G provide additional discriminating information and help improve classification performance at the cost of model complexity, which is evident in the number of input features generated by C
g-SAAP for different values of
G. It is also noteworthy that the larger models often need more training data, therefore, we can see that after the gap value of
the model performance is capped to the MCC value of
.
3.2. Analyzing the Effect of Feature and Latent-Space Dimension for Different Values of
In the proposed DeepLSE-based ACP classification approach, several parameters can affect the classification performance. For instance, the GAP (G) between two amino-acid pairs in Cg-SAAP controls the length of the feature space. Similarly, the latent-space dimension () controls the size of the output of the encoder module, and loss mixing weight () controls the training priority for a specific loss. To analyze the sensitivity of DeepLSE for the aforementioned hyperparameters, for each , seventy experiments are performed, where each experiment is a 10-fold cross-validation on the ACP344 dataset.
In particular, for five different values of
, seven combinations of
and ten GAP values
are evaluated, resulting in 700 training and testing trials. The findings of this exhaustive analysis are summarized through surface plots in
Figure 4. It is seen that irrespective of the choice of
, GAP
G, and
, the overall test classification performance in terms of balanced accuracy is somewhat consistent in the range of 0.80∼0.98.
Table 3 summarizes the results in the form of a comparison of the best mean statistics of 10-fold cross-validation results for different
values. This demonstrates the adaptability of the proposed DeepLSE method for learning effective solutions for the given problem under variable conditions.
However, it can also be seen that the model is sensitive to the choice of these parameters, and the a carefully selected combination can help in designing a better classification model. In this regard, the best performance of the model was observed for
with
and a GAP value of
.
Figure 5 illustrates the MCC values for different
with
and a GAP value of
. This is interesting because on one hand, higher values of
tend to favor classification over reconstruction (see
Figure 4e: high MSE(dB) with good MCC), and lower values such as
prioritize reconstruction loss (see
Figure 4a: lowest MSE(dB) with comparable MCC). A general assumption is that the classifier trained solely for the classification task must produce the best results, but it is observed from
Figure 5 that the reconstruction constraint on latent space could help in learning additional useful information that can aid during the inference phase. A more prominent gain could be seen in larger feature spaces. For example, with GAP (
), a model with
achieved the best results; see
Table 3. Similarly, a model with
achieved the best results for GAP (
). This further strengthens the proposed claim about using representational learning.
3.3. Analyzing the Effect of on Latent-Space
By choosing a suitable value of the loss mixing hyperparameter from , one can control the contribution of reconstruction and classification losses. To analyze the effect of on the learned representation of features, deep latent-space encodings (output of the encoder block) are visualized for three values: , , and .
For this experiment, the ACP-LSE model is trained on the ACP344 dataset with the aforementioned values of
, Gap
, and latent space of the size
. Since the output of the latent space is relatively large for easy visualization, it is first reduced to a 2D projected version using PCA and TSNE [
74].
Figure 6 shows scatter plots of the PCA and TSNE-based 2D projections of the encoded outputs. The findings show that with a large value of
, the DeepLSE model focuses on classification accuracy and projects samples of the same class closer together. With the lower value of
, the model reduces classification accuracy and focuses on reducing reconstruction loss. Ideally, it is better to have the best classification accuracy; however, a large value of
gives the classifier more freedom during the training phase and might lead the model to overfit the training data. Therefore, for better generalization,
should be balanced. This enables the encoder to learn variability in input features, while the latent-space representation constrains the classifier module to learn a decision boundary that maximizes inter-class separability.
3.4. Analyzing the Robustness of DeepLSE against Random Mutations
Figure 7 shows the original (unmutated) ACP344 dataset alongside two of its mutation variants. This experiment attempts to evaluate the robustness of the proposed ACP-LSE method against mutations in ACP sequences. Specifically, the TSNE [
74] plots of C
g-SAAP (with
) features from the original ACP344 dataset are compared to mutants of 138 ACPs derived from the ACP344 dataset. The purpose of this experiment is to determine the susceptibility of latent-space encoding to random mutations.
The findings show that in the original feature space, the separability of ACPs and non-ACPs in empirical distributions reduces significantly with the mutation rate. When more amino acids in ACPs are mutated, the chance increases that these mutant ACP features will not have anticancer properties. However, the proposed ACP-LSE, which was trained purely on original (unmutated) data, retains class separability even when three amino acids are randomly mutated or replaced. This demonstrates the effectiveness of representation learning in extracting robust representations.
3.5. Analyzing Performance Consistency versus Ratio of Training and Testing Samples Splits
To assess the influence of dataset size on the performance and consistency of the proposed method, an experiment is designed where the performance of the proposed ACP-LSE is evaluated on various train and test dataset split ratios. For this experiment, the
latent-space feature (LV) value was set to
and
Gap was set to
, allowing us to perform comprehensive testing in comparatively less time.
Figure 8 displays the
Matthew’s correlation coefficient (
) values for models trained and tested on the
ACP344 dataset for various sample sizes.
For a fair comparison, both the DeepLSE and conventional DNN models are designed with identical numbers of neurons in their feature extraction (encoder) and classification modules. The DeepLSE was trained with a default weight mixing ratio of . For the standard DNN, the loss mixing ratio was set to (because there is no decoder/reconstruction loss).
Both models were evaluated for nine different train–test split ratios ranging from training and testing samples to training and training samples. In each experiment, the training and testing samples were randomly shuffled, and weights were reinitialized to random values. To obtain statistically meaningful results, all experiments were repeated five times, and the mean results were compared.
The findings in
Figure 8 show that the proposed method is more robust in classification performance compared to standard DNN and produces superior outcomes with consistency. The proposed method outperforms the standard DNN classifiers in both extreme cases:
training,
testing samples and
training,
testing samples. Similarly, for other distributions of training and testing sample splits, the classification performance of DeepLSE is either high or comparable to that of the standard DNN. This highlights the superiority of the proposed representation-learning approach, where the latent space is constrained by reconstruction loss, allowing for the learning of useful features using large models even when the number of training samples is extremely low.
3.6. Comparison with State-of-the-Art ACP Classification Approaches
This section compares the performance of the proposed anticancer peptide classification method, ACP-LSE, which is based on DeepLSE, to the current state-of-the-art ACP classification algorithms. The assessment uses two standard datasets: ACP344 [
34] and ACP740 [
70]. It is critical to emphasize that for a fair comparison, the training and testing samples across all methodologies are kept comparable, as outlined in the previous works. Specifically, the ACP344 dataset is assessed using 5-fold and 10-fold cross-validation protocols, whereas the ACP740 dataset is examined using a 5-fold cross-validation approach.
Table 4 summarizes the hyperparameters used in this study, while any additional specifications about each experiment are presented in the respective subsections.
3.6.1. ACP344 Dataset
To guarantee a fair comparison, we analyze the proposed method on the ACP344 dataset using two well-known evaluation protocols reported in the literature.
Table 5 and
Table 6 show the performance statistics of several algorithms on the ACP344 dataset using 5-fold and 10-fold cross-validation, respectively. Given the imbalanced nature of the dataset, conventional accuracy metrics are deemed inadequate to describe the overall performance. As a result, class-specific assessment criteria such as MCC and Youden’s index are employed to assess the comprehensive classification capability of classification models.
In 5-fold cross-validation, the proposed method achieves the third-best MCC score, demonstrating its effectiveness in differentiating the ACP features. The proposed ACP-LSE has an MCC value of
, which is only
and
lower than the values reported for EnACP [
75] and IACP [
75,
76], respectively. It is also worth mentioning that the ACP344 dataset is severely unbalanced, with just a limited number of training samples (110 ACPs and 165 non-ACPs) available for learning in 5-fold cross-validation. Given that the auto-encoder model does not require classification labels, it might be interesting to investigate DeepLSE with pre-trained models to determine its potential impact on performance improvement. To investigate the influence of a larger training dataset, a 10-fold cross-validation protocol is used.
Significantly, the proposed method outperforms 10-fold cross-validation, demonstrating its effectiveness in classifying the features of ACP. Specifically, the proposed ACP-LSE achieved the highest MCC value of
, outperforming other methods. This value is
higher than that of ACP-DL [
70],
higher than ACP-LDF [
78] with the
LibD3C,
higher than ACP-SKRC [
62], ACP-LDF [
78] with RF and SVM classifiers, and
higher than SAP with the SVM classifier [
38]. This substantiates the argument that the proposed method has the potential to predict novel ACPs or ACP-like peptides. Other assessment metrics support this efficacy, emphasizing the substantial difference between ACPs and non-ACPs.
3.6.2. ACP740 Dataset
This section compares the proposed ACP-LSE on the ACP740 dataset to several state-of-the-art ACP classification methods, such as ACP-DL [
70], ACP-DA [
58], ACP-MHCNN [
79], and ACP-KSRC [
62].
Table 7 summarizes the comparison results. The findings show that the proposed method outperforms the ACP-DL [
70], ACP-DA [
58], and ACP-KSRC [
62] algorithms, as indicated by the class-specific evaluation parameter MCC. The improvements are significant, with margins of
,
, and
, respectively.
Notably, the ACP-LSE achieves the highest MCC value of
, outperforming ACP-DL, ACP-DA, and ACP-KSRC by
,
, and
, respectively. In addition, the performance of the proposed representation-based approach slightly outperforms the powerful ACP-MHCNN [
79] method. This efficacy is consistent across various evaluation metrics, demonstrating the capability of the proposed ACP-LSE to distinguish between ACPs and non-ACPs. These findings indicate that the proposed method is promising for predicting ACPs or ACP-like peptides.
4. Conclusions
The diagnosis and treatment of cancer, a complex disease with diverse causes, are challenging in the field of medical research. Anticancer peptides (ACPs) are a promising approach in targeted therapy with the potential for precise and accurate treatment. However, for large-scale identification and synthesis, credible prediction approaches are required. In this paper, an intuitive yet powerful representation-learning-based method, ACP-LSE, is proposed, which shows significant improvements in classification performance, particularly in cases with small sample sizes and a large number of features.
For investigation, the results on two benchmark datasets (and three protocols) were analyzed, suggesting that the higher number of training samples either with 10-fold cross validation in ACP-344 or 5 fold cross validation in ACP-740; show a superior classification performance. Various experimental analyses show that the proposed method achieves improved classification performance and aids in learning compact latent-space representation. The suggested approach is tested for different quantitative and qualitative metrics and demonstrates a higher performance compared to the current methods. A key limitation of the proposed ACP-LSE method is that unlike contrastive learning, where a successfully trained model could learn maximum allowable class separability (i.e., infinite), in the proposed method, maximum class separability is bounded to unity. Additionally, unlike DNN where there is no decoder, the training steps in the proposed method involve learning a high number of training parameters due to the decoder network. Finally, for effective model designing using DeepLSE, the tuning of additional hyperparameters, e.g., , , etc., require computationally complex ablation studies, which take time. In the future research, I would like to investigate techniques to deal with the aforementioned limitations.