Python Package Introduction

This document gives a basic walkthrough of securexgboost python package. There’s also a sample Jupyter notebook at demo/python/jupyter/e2e-demo.ipynb.

List of other Helpful Links

Install Secure XGBoost

To install Secure XGBoost, follow instructions in Installation Guide.

To verify your installation, run the following in Python:

import securexgboost as xgb

Data Interface

The Secure XGBoost python module is able to load data from:

  • LibSVM text format file
  • Comma-separated values (CSV) file

(See /tutorials/input_format for detailed description of text input format.)

The data is stored in a DMatrix object.

  • To load a libsvm text file or a Secure XGBoost binary file into DMatrix:

    dtrain = xgb.DMatrix('train.svm.txt')
    dtest = xgb.DMatrix('test.svm.buffer')
    
  • To load a CSV file into DMatrix:

    # label_column specifies the index of the column containing the true label
    dtrain = xgb.DMatrix('train.csv?format=csv&label_column=0')
    dtest = xgb.DMatrix('test.csv?format=csv&label_column=0')
    

    Note

    Categorical features not supported

    Note that Secure XGBoost does not support categorical features.

Setting Parameters

Secure XGBoost can use either a list of pairs or a dictionary to set parameters. For instance:

  • Booster parameters

    param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'}
    param['nthread'] = 4
    param['eval_metric'] = 'auc'
    
  • You can also specify multiple eval metrics:

    param['eval_metric'] = ['auc', 'ams@0']
    
    # alternatively:
    # plst = param.items()
    # plst += [('eval_metric', 'ams@0')]
    
  • Specify validations set to watch performance

    evallist = [(dtest, 'eval'), (dtrain, 'train')]
    

Training

Training a model requires a parameter list and data set.

num_round = 10
bst = xgb.train(param, dtrain, num_round, evallist)

Methods including update and boost from securexgboost.Booster are designed for internal usage only. The wrapper function securexgboost.train does some pre-configuration including setting up caches and some other parameters.

Early Stopping

If you have a validation set, you can use early stopping to find the optimal number of boosting rounds. Early stopping requires at least one set in evals. If there’s more than one, it will use the last.

train(..., evals=evals, early_stopping_rounds=10)

The model will train until the validation score stops improving. Validation error needs to decrease at least every early_stopping_rounds to continue training.

This works with both metrics to minimize (RMSE, log loss, etc.) and to maximize (MAP, NDCG, AUC). Note that if you specify more than one evaluation metric the last one in param['eval_metric'] is used for early stopping.

Prediction

A model that has been trained or loaded can perform predictions on data sets.

# 7 entities, each contains 10 features
data = np.random.rand(7, 10)
dtest = xgb.DMatrix(data)
ypred = bst.predict(dtest)