Learn about our MBSCLoc model.
Firstly, the UTR-LM feature extraction module encodes the input raw mRNA sequence based on a pre-trained UTR-LM model, known as UTR-LM embedding.
After that, the multi-class contrastive representation learning module takes the extracted features as input and constructs paired samples (consisting of an anchor, a positive sample with identical labels, and multiple negative samples with completely different labels). These paired samples are then processed through multi-class contrastive representation learning, serving as inputs for the clustering module.
Subsequently, the optimal number of clusters is determined to be 7 by combining the elbow method and the silhouette coefficient method. Furthermore, in the optimal subspace construction and training module, subsets are constructed by combining the number of labels and the clustering results, and XGBoost is employed as the base classifier to train each subset. For the classification, a threshold of 0.5 is used at each position in the XGBoost classifier.
Finally, the classifiers trained within the multiple balanced subspaces generate the final prediction through an ensemble voting mechanism.