Introduction
Recent technological advances have enabled the simultaneous collection of multi-omics data, i.e., different types or modalities of molecular data across various organ tissues of patients. For integrative predictive modeling, analyzing such data presents several challenges:
- Data Availability: Ideally, different modalities are measured in the same individuals, enabling early or intermediate integration techniques. However, real-world datasets often have missing modalities, requiring imputation or dataset reduction.
- Modality-Specific Characteristics: Each data modality may require tailored statistical methods rather than a one-size-fits-all approach.
- Late Integration Modeling: Instead of integrating data at the raw level, late-stage integration models first generate modality-specific predictions and then combine them. This approach is particularly useful when:
- Modalities have different scales, distributions, or levels of missingness.
- Feature-level integration is infeasible due to data heterogeneity.
- The interpretability of individual modalities is important.
Common late integration strategies include:
- Aggregative models such as Lasso and random forests.
- Weighted averaging of modality-specific predictions.
We introduce the R package fuseMLR for late integration predictive modeling. This package provides:
- A structured workflow for defining training processes with multiple data modalities.
- Support for modality-specific machine learning methods.
- Automatic aggregation of predictions once modality-specific training is completed.
- User-friendly functionality to simplify model training and evaluation.
To illustrate its application, we simulate multi-omics data and apply fuseMLR to perform late-stage integrative modeling.
Summary of Steps for Using the fuseMLR Package
Installation
To install fuseMLR, use the following commands:
# Install from GitHub (if not yet on CRAN)
remotes::install_github("your-repo/fuseMLR")
Usage
Follow this step-by-step guide to use fuseMLR for late integration modeling.
Step 1: Load the Package
library(fuseMLR)
library(ggplot2)
Step 2: Simulate Multi-Omics Data
To demonstrate fuseMLR, we generate synthetic multi-omics data and visualize their distributions.
set.seed(123)
modalities <- list(
omics1 = matrix(rnorm(1000), nrow = 100, ncol = 10),
omics2 = matrix(rnorm(1000), nrow = 100, ncol = 10)
)
response <- rnorm(100)
Visualizing the first modality:
ggplot(data.frame(value = modalities$omics1[,1]), aes(x = value)) +
geom_histogram(binwidth = 0.2, fill = "blue", alpha = 0.5) +
ggtitle("Distribution of First Feature in Omics1")
Step 3: Define the Model
model <- fuseMLR$new()
Step 4: Train Modality-Specific Models
Different machine learning methods may be appropriate depending on the nature of the modality data:
model$train_modality("omics1", modalities$omics1, response, method = "lasso")
model$train_modality("omics2", modalities$omics2, response, method = "random_forest")
Step 5: Aggregate Predictions
To combine the predictions from each modality, fuseMLR offers multiple aggregation methods:
model$aggregate_predictions(method = "weighted_mean")
Step 6: Evaluate the Model
Evaluate predictive performance using built-in metrics and visualize predictions.
performance <- model$evaluate()
print(performance)
pred_df <- data.frame(True = response, Predicted = model$predict())
ggplot(pred_df, aes(x = True, y = Predicted)) +
geom_point(color = "blue", alpha = 0.6) +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
ggtitle("True vs Predicted Values")
Interpretation of results:
- Higher values of R-squared indicate better predictive performance.
- Lower RMSE suggests reduced prediction error.
- The scatter plot helps visualize the alignment between true and predicted values.
Practical Use Case
A – Data
library(fuseMLR)
The following example is based on multi-omics simulated data available in fuseMLR. Data have been simulated using the R package InterSIM, version 2.2.0. Two types of data were simulated: training and testing datasets. Each consists of four data.frames—gene expression, protein expression, methylation data modalities, and targeted observations. Individuals are organized in rows, variables in columns, with an additional column for individual IDs. In total, \(70\) individuals with \(50\) (not necessarily overlapping) individuals pro layer have been simulated for training, and \(23\) ($20$ per layer) for testing. Effects have been introduced for across data modalities by shifting the means by \(0.5\) to create case-control study with \(50\)% prevalence. For illustration, the number of variables was kept smaller than what is typically expected in reality. The data simulation code is available here.
data("multi_omics")
# This list contains two lists of data: training and test.
# Each sublist contains three omics data as described above.
str(object = multi_omics, max.level = 2L)
## List of 2 ## $ training:List of 4 ## ..$ geneexpr :'data.frame': 50 obs. of 132 variables: ## ..$ proteinexpr:'data.frame': 50 obs. of 161 variables: ## ..$ methylation:'data.frame': 50 obs. of 368 variables: ## ..$ target :'data.frame': 70 obs. of 2 variables: ## $ testing :List of 4 ## ..$ geneexpr :'data.frame': 20 obs. of 132 variables: ## ..$ proteinexpr:'data.frame': 20 obs. of 161 variables: ## ..$ methylation:'data.frame': 20 obs. of 368 variables: ## ..$ target :'data.frame': 30 obs. of 2 variables:
B – Training
The training process is handled in fuseMLR by a class called Training. We will use training to refer to an instance from this class. Function createTraining is available to create an empty (without layers) training.
B.1 – Creating a training
training <- createTraining(id = "training",
ind_col = "IDS",
target = "disease",
target_df = multi_omics$training$target,
verbose = TRUE)
print(training)
## Training : training ## Problem type : classification ## Status : Not trained ## Number of layers: 0 ## Layers trained : 0
A training contains modality-specific training layers (instances of the TrainLayer class) and a meta-layer (instance of the TrainMetaLayer class). The modality-specific training layers encapsulate training data modalities, variable selection functions and learners (i.e. R functions implementing statistical predicting methods). The meta-layer encapsulates the meta-learner. The terms modality-specific and layer-specific are used interchangeably as synonyms in the following.
Three main functions are necessary to create the training: createTraining(), createTrainLayer() and createTrainMetaLayer(). createTraining() creates an empty training, on which modalitity-specific training layers are added using the function createTrainLayer(). Function createTrainMetaLayer() is used to add meta-layer to training.
The following code adds a gene expression, a protein abundance, and a methylation layer to training. For illustration purpose we use the same variable selection method Boruta and the same learner (ranger) for all layers. These functions fulfill fuseMLR requirements in terms of arguments and outputs (see createTrainLayer documentation for details). We expect the variable selection function to accept two arguments: x (the predictor design matrix) and y (the response vector). The function must return a vector of selected variables. Methods that do not follow this format – such as Vita (via ranger) or Lasso (via glmnet) – require additional wrapping or a custom function to extract the selected variables, for example, based on a significance threshold. An exception, however, is made for the Boruta function, for which an internal adjustment is implemented; its use requires no further modifications. For learners, the arguments x and y are mandatory as well, and the resulting model must support the generic predict function. Predictions should be returned either as a vector or a list with a predictions field containing the predicted values (also a vector). If the function does not meet the required input and output criteria, users should follow the steps in E-Interface and wrapping to create an interface or wrap their function for use with fuseMLR.
# Create gene expression layer.
createTrainLayer(training = training,
train_layer_id = "geneexpr",
train_data = multi_omics$training$geneexpr,
varsel_package = "Boruta",
varsel_fct = "Boruta",
varsel_param = list(num.trees = 1000L,
mtry = 3L,
probability = TRUE),
lrner_package = "ranger",
lrn_fct = "ranger",
param_train_list = list(probability = TRUE,
mtry = 1L),
param_pred_list = list(),
na_action = "na.keep")
## Training : training ## Problem type : classification ## Status : Not trained ## Number of layers: 1 ## Layers trained : 0 ## p : 131 ## n : 50 ## na.action : na.keep
# Create gene protein abundance layer
createTrainLayer(training = training,
train_layer_id = "proteinexpr",
train_data = multi_omics$training$proteinexpr,
varsel_package = "Boruta",
varsel_fct = "Boruta",
varsel_param = list(num.trees = 1000L,
mtry = 3L,
probability = TRUE),
lrner_package = "ranger",
lrn_fct = "ranger",
param_train_list = list(probability = TRUE,
mtry = 1L),
param_pred_list = list(type = "response"),
na_action = "na.keep")
## Training : training ## Problem type : classification ## Status : Not trained ## Number of layers: 2 ## Layers trained : 0 ## p : 131 | 160 ## n : 50 | 50 ## na.action : na.keep | na.keep
# Create methylation layer
createTrainLayer(training = training,
train_layer_id = "methylation",
train_data = multi_omics$training$methylation,
varsel_package = "Boruta",
varsel_fct = "Boruta",
varsel_param = list(num.trees = 1000L,
mtry = 3L,
probability = TRUE),
lrner_package = "ranger",
lrn_fct = "ranger",
param_train_list = list(probability = TRUE,
mtry = 1L),
param_pred_list = list(),
na_action = "na.keep")
## Training : training ## Problem type : classification ## Status : Not trained ## Number of layers: 3 ## Layers trained : 0 ## p : 131 | 160 | 367 ## n : 50 | 50 | 50 ## na.action : na.keep | na.keep | na.keep
Also add a meta-layer. We use the weighted mean (internal function to fuseMLR) as meta-learner. Similarly to learners, a meta-learner should allow at least the arguments x and y to pass the design matrix of predictors and should return a model that allow to call the generic function predict to make predictions for a new dataset (see documentation of createTrainMetaLayer for details). If these criteria are not fulfilled, the explanation provided in E-Interface and wrapping details how to map the x and y to the original argument names or to wrap the original function. In Appendix we provide information about available meta-learners.
We use the weighted mean as the meta-learner. Weighted learners allow the meta-model to be more robust against outliers or noisy data. By adjusting the weights, it can downplay the influence of less reliable models or predictions, thus making the system more stable in the face of unpredictable data. Theoretically, we do not expect this meta-learner to outperform all the modality-specific learners but rather to achieve a performance level between the worst and the best of the modality-specific learners.
# Create meta layer with imputation of missing values.
createTrainMetaLayer(training = training,
meta_layer_id = "meta_layer",
lrner_package = NULL,
lrn_fct = "weightedMeanLearner",
param_train_list = list(),
param_pred_list = list(na_rm = TRUE),
na_action = "na.rm")
## Training : training
## Problem type : classification
## Status : Not trained
## Number of layers: 4
## Layers trained : 0
## p : 131 | 160 | 367
## n : 50 | 50 | 50
## na.action : na.keep | na.keep | na.keep
print(training)
## Training : training
## Problem type : classification
## Status : Not trained
## Number of layers: 4
## Layers trained : 0
## p : 131 | 160 | 367
## n : 50 | 50 | 50
## na.action : na.keep | na.keep | na.keep
Function upsetplot() is available to generate an upset of the training data, i.e. an overview how patients overlap across layers.
upsetplot(object = training, order.by = "freq")
plot of chunk upsetplot
B.2 – Variable selection
Function varSelection() performs modality-specific variable selection. So, a user can opt to conduct variable selection separately without training.
# Variable selection
set.seed(5467)
var_sel_res <- varSelection(training = training)
## Variable selection on layer geneexpr started.
## Variable selection on layer geneexpr done.
## Variable selection on layer proteinexpr started.
## Variable selection on layer proteinexpr done.
## Variable selection on layer methylation started.
## Variable selection on layer methylation done.
print(var_sel_res)
## Layer variable ## 1 geneexpr ASNS ## 2 geneexpr ATM ## 3 geneexpr BAK1 ## 4 geneexpr BRAF ## 5 geneexpr DVL3 ## 6 geneexpr FOXM1 ## 7 geneexpr MAP2K1 ## 8 geneexpr NF2 ## 9 geneexpr NOTCH1 ## 10 geneexpr PCNA ## 11 geneexpr PEA15 ## 12 geneexpr PRDX1 ## 13 geneexpr RBM15 ## 14 geneexpr SHC1 ## 15 geneexpr SRC ## 16 geneexpr TP53BP1 ## 17 geneexpr XRCC5 ## 18 geneexpr YBX1 ## 19 proteinexpr GAPDH ## 20 methylation cg01894895 ## 21 methylation cg14785449 ## 22 methylation cg24747396 ## 23 methylation cg03813215 ## 24 methylation cg25059899 ## 25 methylation cg16225441 ## 26 methylation cg23935746 ## 27 methylation cg17655614 ## 28 methylation cg23989635 ## 29 methylation cg22679003 ## 30 methylation cg01663570 ## 31 methylation cg25500285 ## 32 methylation cg11251858 ## 33 methylation cg23244421 ## 34 methylation cg09976774 ## 35 methylation cg18470891 ## 36 methylation cg20484306 ## 37 methylation cg02412050 ## 38 methylation cg16293088 ## 39 methylation cg20042228 ## 40 methylation cg07566050 ## 41 methylation cg23641145 ## 42 methylation cg13908523 ## 43 methylation cg20849549 ## 44 methylation cg18331396 ## 45 methylation cg08383063 ## 46 methylation cg16153267 ## 47 methylation cg19254235 ## 48 methylation cg20406374 ## 49 methylation cg22175811 ## 50 methylation cg19393006 ## 51 methylation cg12507125 ## 52 methylation cg01442799
Let us display the training object again to see the update on variable level.
print(training)
## Training : training ## Problem type : classification ## Status : Not trained ## Number of layers: 4 ## Layers trained : 0 ## p : 18 | 1 | 33 ## n : 50 | 50 | 50 ## na.action : na.keep | na.keep | na.keep
For each layer, the variable selection results show the chosen variables.
B.3 – Train
We use the function fusemlr() to train our models using the subset of selected variables. Here we set use_var_sel = TRUE to use variables obtained from the variable selection step.
set.seed(5462)
fusemlr(training = training,
use_var_sel = TRUE)
## Creating fold predictions.
## | | | 0% | |======== | 10% | |================= | 20% | |========================= | 30% | |================================== | 40% | |========================================== | 50% | |================================================== | 60% | |=========================================================== | 70% | |=================================================================== | 80% | |============================================================================ | 90% | |====================================================================================| 100%
## Training of base model on layer geneexpr started.
## Training of base model on layer geneexpr done.
## Training of base model on layer proteinexpr started.
## Training of base model on layer proteinexpr done.
## Training of base model on layer methylation started.
## Training of base model on layer methylation done.
The display of the training object now updates information about the trained layers.
print(training)
## Training : training ## Problem type : classification ## Status : Trained ## Number of layers: 4 ## Layers trained : 4 ## Var. sel. used : Yes ## p : 18 | 1 | 33 | 3 ## n : 50 | 50 | 50 | 26 ## na.action : na.keep | na.keep | na.keep | na.rm
We can also display a summary of training to see more details on layer levels. Information about the training data modality, the variable selection method and the learner stored at each layer are displayed.
summary(training)
## Training training ## ---------------- ## Training : training ## Problem type : classification ## Status : Trained ## Number of layers: 4 ## Layers trained : 4 ## Var. sel. used : Yes ## p : 18 | 1 | 33 | 3 ## n : 50 | 50 | 50 | 26 ## na.action : na.keep | na.keep | na.keep | na.rm ## ---------------- ## ## Layer geneexpr ## ---------------- ## TrainLayer : geneexpr ## Status : Trained ## Nb. of objects stored : 4 ## ---------------- ## Object(s) on layer geneexpr ## ## ---------------- ## TrainData : geneexpr_data ## Layer : geneexpr ## Ind. id. : IDS ## Target : disease ## n : 50 ## Missing : 0 ## p : 18 ## ---------------- ## ## ---------------- ## VarSel : geneexpr_varsel ## TrainLayer : geneexpr ## Package : Boruta ## Function : Boruta ## ---------------- ## ## ---------------- ## Learner : geneexpr_lrner ## TrainLayer : geneexpr ## Package : ranger ## Learn function : ranger ## ---------------- ## ## ## Layer proteinexpr ## ---------------- ## TrainLayer : proteinexpr ## Status : Trained ## Nb. of objects stored : 4 ## ---------------- ## Object(s) on layer proteinexpr ## ## ---------------- ## TrainData : proteinexpr_data ## Layer : proteinexpr ## Ind. id. : IDS ## Target : disease ## n : 50 ## Missing : 0 ## p : 1 ## ---------------- ## ## ---------------- ## VarSel : proteinexpr_varsel ## TrainLayer : proteinexpr ## Package : Boruta ## Function : Boruta ## ---------------- ## ## ---------------- ## Learner : proteinexpr_lrner ## TrainLayer : proteinexpr ## Package : ranger ## Learn function : ranger ## ---------------- ## ## ## Layer methylation ## ---------------- ## TrainLayer : methylation ## Status : Trained ## Nb. of objects stored : 4 ## ---------------- ## Object(s) on layer methylation ## ## ---------------- ## TrainData : methylation_data ## Layer : methylation ## Ind. id. : IDS ## Target : disease ## n : 50 ## Missing : 0 ## p : 33 ## ---------------- ## ## ---------------- ## VarSel : methylation_varsel ## TrainLayer : methylation ## Package : Boruta ## Function : Boruta ## ---------------- ## ## ---------------- ## Learner : methylation_lrner ## TrainLayer : methylation ## Package : ranger ## Learn function : ranger ## ---------------- ## ## ## MetaLayer ## ---------------- ## TrainMetaLayer : meta_layer ## Status : Trained ## Nb. of objects stored : 3 ## ## ---------------- ## Object(s) on MetaLayer ## ## ---------------- ## Learner : meta_layer_lrner ## TrainLayer : meta_layer ## Learn function : weightedMeanLearner ## ---------------- ## ## ---------------- ## TrainData : modality-specific predictions ## Layer : meta_layer ## Ind. id. : IDS ## Target : disease ## n : 26 ## Missing : 0 ## p : 3 ## ----------------
We use extractModel() to retrieve the list of stored models and extractData() to retrieve training data.
models_list <- extractModel(training = training)
str(object = models_list, max.level = 1L)
## List of 4 ## $ geneexpr :List of 13 ## ..- attr(*, "class")= chr "ranger" ## $ proteinexpr:List of 13 ## ..- attr(*, "class")= chr "ranger" ## $ methylation:List of 13 ## ..- attr(*, "class")= chr "ranger" ## $ meta_layer : 'weightedMeanLearner' Named num [1:3] 0.63 0.143 0.227 ## ..- attr(*, "names")= chr [1:3] "geneexpr" "proteinexpr" "methylation"
Three random forests and one weighted meta-model trained on each layer are returned. The smallest weight is assigned to protein abundance, while the highest is given to gene expression.
data_list <- extractData(object = training)
str(object = data_list, max.level = 1)
## List of 4 ## $ geneexpr :'data.frame': 50 obs. of 20 variables: ## $ proteinexpr:'data.frame': 50 obs. of 3 variables: ## $ methylation:'data.frame': 50 obs. of 35 variables: ## $ meta_layer :'data.frame': 26 obs. of 5 variables:
The three simulated training modalities and the meta-data are returned.
C – Predicting
In this section, we create a testing instance (from the Testing class) and make predictions for new data. This is done analogously to training. Only the testing data modalities are required. Relevant functions are createTesting() and createTestLayer().
# Create testing for predictions
testing <- createTesting(id = "testing",
ind_col = "IDS")
# Create gene expression layer
createTestLayer(testing = testing,
test_layer_id = "geneexpr",
test_data = multi_omics$testing$geneexpr)
## Testing : testing ## Number of layers: 1 ## p : 131 ## n : 20
# Create gene protein abundance layer
createTestLayer(testing = testing,
test_layer_id = "proteinexpr",
test_data = multi_omics$testing$proteinexpr)
## Testing : testing ## Number of layers: 2 ## p : 131 | 160 ## n : 20 | 20
# Create methylation layer
createTestLayer(testing = testing,
test_layer_id = "methylation",
test_data = multi_omics$testing$methylation)
## Testing : testing ## Number of layers: 3 ## p : 131 | 160 | 367 ## n : 20 | 20 | 20
A summary of testing.
summary(testing)
## Testing testing ## ---------------- ## Testing : testing ## Number of layers: 3 ## p : 131 | 160 | 367 ## n : 20 | 20 | 20 ## ---------------- ## ## Class : TestData ## name : geneexpr_data ## ind. id. : IDS ## n : 20 ## p : 132 ## ## ## Class : TestData ## name : proteinexpr_data ## ind. id. : IDS ## n : 20 ## p : 161 ## ## ## Class : TestData ## name : methylation_data ## ind. id. : IDS ## n : 20 ## p : 368
A look on testing data.
data_list <- extractData(object = testing)
str(object = data_list, max.level = 1)
## List of 3 ## $ geneexpr :'data.frame': 20 obs. of 132 variables: ## $ proteinexpr:'data.frame': 20 obs. of 161 variables: ## $ methylation:'data.frame': 20 obs. of 368 variables:
An upset plot to visualize patient overlap across testing layers.
upsetplot(object = testing, order.by = "freq")
plot of chunk upsetplot_new
Function predict() is available for predicting.
predictions <- predict(object = training, testing = testing)
print(predictions)
## $predicting ## Predicting : testing ## Nb. layers : 4 ## ## $predicted_values ## IDS geneexpr proteinexpr methylation meta_layer ## 1 participant100 NA 0.6397214 0.1428325 0.3346214 ## 2 participant20 NA NA 0.1404468 0.1404468 ## 3 participant24 0.7993976 0.6641056 0.7268421 0.7635696 ## 4 participant25 0.4272167 NA NA 0.4272167 ## 5 participant27 0.3198778 NA 0.4055008 0.3425881 ## 6 participant28 0.6186016 0.1421365 NA 0.5304799 ## 7 participant3 0.7238341 0.8741675 NA 0.7516381 ## 8 participant32 0.1294389 0.4466698 0.2283079 0.1972486 ## 9 participant34 0.5258817 0.8387246 0.7699016 0.6260620 ## 10 participant39 0.6487119 NA 0.7975190 0.6881810 ## 11 participant42 0.8276992 0.5456857 0.4091008 0.6922371 ## 12 participant51 NA 0.2041437 NA 0.2041437 ## 13 participant53 NA 0.5456857 NA 0.5456857 ## 14 participant54 0.3565595 NA NA 0.3565595 ## 15 participant55 0.4467897 NA NA 0.4467897 ## 16 participant6 NA 0.2580222 0.8082516 0.5958744 ## 17 participant63 NA 0.5456857 0.2290643 0.3512737 ## 18 participant64 0.3541286 0.8571754 NA 0.4471665 ## 19 participant68 0.6341873 NA 0.8441730 0.6898832 ## 20 participant71 0.5962802 0.1375698 0.6070333 0.5331732 ## 21 participant75 NA 0.6909714 0.2815952 0.4396060 ## 22 participant77 0.2251071 0.9041167 0.1613984 0.3076571 ## 23 participant79 0.2962032 0.2041437 NA 0.2791769 ## 24 participant81 0.2959206 0.3698698 0.2439008 0.2946624 ## 25 participant84 0.7141040 0.4079024 0.6294857 0.6511100 ## 26 participant86 0.5700675 NA 0.6853365 0.6006410 ## 27 participant94 0.4537627 0.3614794 0.7669524 0.5117734 ## 28 participant97 NA NA 0.2545849 0.2545849 ## 29 participant98 NA 0.1375698 0.2154238 0.1853738
Prediction performances for layer-specific levels and the meta-layer are estimated. Brier Score (BS) is utilized to assess calibration performance and the Area Under the Curve (AUC) to evaluate classification accuracy.
pred_values <- predictions$predicted_values
actual_pred <- merge(x = pred_values,
y = multi_omics$testing$target,
by = "IDS",
all.y = TRUE)
y <- as.numeric(actual_pred$disease == "1")
# On all patients
perf_bs <- sapply(X = actual_pred[ , 2L:5L], FUN = function (my_pred) {
bs <- mean((y[complete.cases(my_pred)] - my_pred[complete.cases(my_pred)])^2)
roc_obj <- pROC::roc(y[complete.cases(my_pred)], my_pred[complete.cases(my_pred)])
auc <- pROC::auc(roc_obj)
performances = rbind(bs, auc)
return(performances)
})
rownames(perf_bs) <- c("BS", "AUC")
print(perf_bs)
## geneexpr proteinexpr methylation meta_layer ## BS 0.1293884 0.3265617 0.08045023 0.1273704 ## AUC 1.0000000 0.5350000 1.00000000 0.9952381
As expected, the performance of the meta-learner in terms of Brier Score is not the best; it falls between the worst and best modality-specific performance measures. For AUC, the meta-learner performs as well as the best modality-specific learner. We observe that performance on protein abundance is the worst, consistent with its lowest assigned weight. There is a reversal in the weight order between the methylation layer and gene expression. However, the performance difference between these two modalities is relatively small, approximately \(0.05\).
D – Interface and wrapping
This section explains how to resolve argument discrepancies when the original function does not conform to the fuseMLR format introduced in C.1 – Creating a training. These discrepancies can occur in input (argument names) or output formats of the user-provided functions.
At the input level, we distinguish common supervised learning arguments from method specific arguments. The common arguments are a matrix x of independent variables and y representing a response variable. If the provided original function (variable selection or learning function) does not have these two arguments, then discrepancies must be resolved. Moreover, the predict function extending the generic R predict function must allow arguments object and data. If this is not the case, discrepancies must be resolved.
The provided learner must return a model compatible with an extension of the generic predict function (e.g., predict.glmnet for glmnet models). The predict function should return either a vector of predictions or a list with a predictions field (a vector of predicted values). For binary classification (classes \(0\) and \(1\)), learners can return a two-column matrix or data.frame of class probabilities, where the second column represents the probability of class \(1\) (used by fuseMLR) and the first column that of \(0\) (ignored by fuseMLR). For variable selection, the output should be a vector of selected variables. If these criteria are not met, discrepancies must be resolved.
We offer two main ways to resolve discrepancies: either using an interface or a wrapping of the original function.
Interface
The interface approach maps the argument names of the original learning function using arguments of createTrainLayer(). In the example below, the gene expression layer is recreated using the svm function from the e1071 package as the learner. A discrepancy arises because predict.svm uses object and newdata as argument names. We also provide a function using the argument extract_pred_fct of createTrainLayer to extract the predicted values. Similar arguments are also available for the createTrainMetaLayer function to generate meta-layer.
# Re-create the gene expression layer with support vector machine as learner.
createTrainLayer(training = training,
train_layer_id = "geneexpr",
train_data = multi_omics$training$geneexpr,
varsel_package = "Boruta",
varsel_fct = "Boruta",
varsel_param = list(num.trees = 1000L,
mtry = 3L,
probability = TRUE),
lrner_package = "e1071",
lrn_fct = "svm",
param_train_list = list(type = 'C-classification',
kernel = 'radial',
probability = TRUE),
param_pred_list = list(probability = TRUE),
na_action = "na.rm",
x_lrn = "x",
y_lrn = "y",
object = "object",
data = "newdata", # Name discrepancy resolved.
extract_pred_fct = function (pred) {
pred <- attr(pred, "probabilities")
return(pred[ , 1L])
}
)
# Variable selection
set.seed(5467)
var_sel_res <- varSelection(training = training)
set.seed(5462)
training <- fusemlr(training = training,
use_var_sel = TRUE)
print(training)
Wrapping
In the wrapping approach we define a new function mylasso to run a Lasso regression from the glmnet package as the meta-leaner.
- Wrapping of glmnet.
# We wrap the original functions
mylasso <- function (x, y,
nlambda = 25,
nfolds = 5) {
# Perform cross-validation to find the optimal lambda
cv_lasso <- glmnet::cv.glmnet(x = as.matrix(x), y = y,
family = "binomial",
type.measure = "deviance",
nfolds = nfolds)
best_lambda <- cv_lasso$lambda.min
lasso_best <- glmnet::glmnet(x = as.matrix(x), y = y,
family = "binomial",
alpha = 1,
lambda = best_lambda
)
lasso_model <- list(model = lasso_best)
class(lasso_model) <- "mylasso"
return(lasso_model)
}
- Extension of
predict.
# We extend the generic predict function mylasso.
predict.mylasso <- function (object, data) {
glmnet_pred <- predict(object = object$model,
newx = as.matrix(data),
type = "response",
s = object$model$lambda)
return(as.vector(glmnet_pred))
}
# Re-create the gene expression layer with support vector machine as learner.
createTrainMetaLayer(training = training,
meta_layer_id = "meta_layer",
lrner_package = NULL,
lrn_fct = "mylasso",
param_train_list = list(nlambda = 100L),
na_action = "na.impute")
set.seed(5462)
training <- fusemlr(training = training,
use_var_sel = TRUE)
print(training)
- Re-set the meta-layer with the wrapped learner and train.
# Re-create the gene expression layer with support vector machine as learner.
createTrainMetaLayer(training = training,
meta_layer_id = "meta_layer",
lrner_package = NULL,
lrn_fct = "mylasso",
param_train_list = list(nlambda = 100L),
na_action = "na.impute")
set.seed(5462)
training <- fusemlr(training = training,
use_var_sel = TRUE)
print(training)
Appendix
In addition to any pre-existing learner in R as a meta-learner, we have implemented the following ones.
Table:
| Leaner | Description |
|---|---|
| weightedMeanLearner | The weighted mean meta learner. It uses modality-specific predictions to estimate the weights of the modality-specific models |
| bestLayerLearner | The best layer-specific model is used as meta model. |
| cobra | cobra implements the COBRA (COmBined Regression Alternative), an aggregation method for combining predictions from multiple individual learners |
Conclusion
fuseMLR simplifies late integration predictive modeling by supporting modality-specific models and automatic aggregation. This enhances the predictive performance of heterogeneous biological datasets while preserving interpretability.
For further details, refer to the package documentation and vignette examples.
References
- Cesaire J. K. Fouodo: fuseMLR: Fusing Machine Learning in R, https://cran.r-project.org/web/packages/fuseMLR/index.html.
We would like to extend our special thanks to Dr. Césaire Fouodo, a professional Machine Learning Scientist and author of fuseMLR, for collaborating with us to deliver tailored solutions that help you extract the best from your data.
We help businesses and researchers solve complex challenges by providing expert guidance in statistics, machine learning, and tailored education.
Our core services include:
– Statistical Consulting:
Comprehensive consulting tailored to your data-driven needs.
– Training and Coaching:
In-depth instruction in statistics, machine learning, and the use of statistical software such as SAS, R, and Python.
– Reproducible Data Analysis Pipelines:
Development of documented, reproducible workflows using SAS macros and customized R and Python code.
– Interactive Data Visualization and Web Applications:
Creation of dynamic visualizations and web apps with R (Shiny, Plotly), Python (Streamlit, Dash by Plotly), and SAS (SAS Viya, SAS Web Report Studio).
– Automated Reporting and Presentation:
Generation of automated reports and presentations using Markdown and Quarto.
– Scientific Data Analysis:
Advanced analytical support for scientific research projects.