This document gives a basic walkthrough of securexgboost python package. There’s also a sample Jupyter notebook at demo/python/jupyter/e2e-demo.ipynb.
List of other Helpful Links
To install Secure XGBoost, follow instructions in Installation Guide.
To verify your installation, run the following in Python:
import securexgboost as xgb
The Secure XGBoost python module is able to load data from:
(See /tutorials/input_format for detailed description of text input format.)
The data is stored in a DMatrix object.
To load a libsvm text file or a Secure XGBoost binary file into DMatrix:
dtrain = xgb.DMatrix('train.svm.txt')
dtest = xgb.DMatrix('test.svm.buffer')
To load a CSV file into DMatrix:
# label_column specifies the index of the column containing the true label
dtrain = xgb.DMatrix('train.csv?format=csv&label_column=0')
dtest = xgb.DMatrix('test.csv?format=csv&label_column=0')
Note
Categorical features not supported
Note that Secure XGBoost does not support categorical features.
Secure XGBoost can use either a list of pairs or a dictionary to set parameters. For instance:
Booster parameters
param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'}
param['nthread'] = 4
param['eval_metric'] = 'auc'
You can also specify multiple eval metrics:
param['eval_metric'] = ['auc', 'ams@0']
# alternatively:
# plst = param.items()
# plst += [('eval_metric', 'ams@0')]
Specify validations set to watch performance
evallist = [(dtest, 'eval'), (dtrain, 'train')]
Training a model requires a parameter list and data set.
num_round = 10
bst = xgb.train(param, dtrain, num_round, evallist)
Methods including update and boost from securexgboost.Booster are designed for internal usage only. The wrapper function securexgboost.train does some pre-configuration including setting up caches and some other parameters.
If you have a validation set, you can use early stopping to find the optimal number of boosting rounds.
Early stopping requires at least one set in evals. If there’s more than one, it will use the last.
train(..., evals=evals, early_stopping_rounds=10)
The model will train until the validation score stops improving. Validation error needs to decrease at least every early_stopping_rounds to continue training.
This works with both metrics to minimize (RMSE, log loss, etc.) and to maximize (MAP, NDCG, AUC). Note that if you specify more than one evaluation metric the last one in param['eval_metric'] is used for early stopping.
A model that has been trained or loaded can perform predictions on data sets.
# 7 entities, each contains 10 features
data = np.random.rand(7, 10)
dtest = xgb.DMatrix(data)
ypred = bst.predict(dtest)