Python API Reference

This page gives the Python API reference of xgboost, please also refer to Python Package Introduction for more information about python package.

Core Data Structure

Core XGBoost Library.

class securexgboost.DMatrix(data, encrypted=False, label=None, missing=None, weight=None, silent=False, feature_names=None, feature_types=None, nthread=None)

Bases: object

Data Matrix used in XGBoost.

DMatrix is a internal data structure that used by XGBoost which is optimized for both memory efficiency and training speed. You can construct DMatrix from numpy.arrays

Parameters:
  • data (string/numpy.array/scipy.sparse/pd.DataFrame/dt.Frame) – Data source of DMatrix. When data is string type, it represents the path libsvm format txt file, or binary file that xgboost can read from.
  • label (list or numpy 1-D array, optional) – Label of the training data.
  • missing (float, optional) – Value in the data which needs to be present as a missing value. If None, defaults to np.nan.
  • weight (list or numpy 1-D array , optional) –

    Weight for each instance.

    Note

    For ranking task, weights are per-group.

    In ranking task, one weight is assigned to each group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.

  • silent (boolean, optional) – Whether print messages during construction
  • feature_names (list, optional) – Set names for features.
  • feature_types (list, optional) – Set types for features.
  • nthread (integer, optional) – Number of threads to use for loading data from numpy array. If -1, uses maximum threads available on the system.
feature_names

Get feature names (column labels).

Returns:feature_names
Return type:list or None
feature_types

Get feature types (column types).

Returns:feature_types
Return type:list or None
num_col()

Get the number of columns (features) in the DMatrix.

Returns:number of columns
Return type:int
num_row()

Get the number of rows in the DMatrix.

Returns:number of rows
Return type:int
class securexgboost.Booster(params=None, cache=(), model_file=None)

Bases: object

A Booster of XGBoost.

Booster is the model of xgboost, that contains low level routines for training, prediction and evaluation.

Parameters:
  • params (dict) – Parameters for boosters.
  • cache (list) – List of cache items.
  • model_file (string) – Path to the model file.
eval(data, name='eval', iteration=0)

Evaluate the model on mat.

Parameters:
  • data (DMatrix) – The dmatrix storing the input.
  • name (str, optional) – The name of the dataset.
  • iteration (int, optional) – The current iteration number.
Returns:

result – Evaluation result string.

Return type:

str

eval_set(evals, iteration=0, feval=None)

Evaluate a set of data.

Parameters:
  • evals (list of tuples (DMatrix, string)) – List of items to be evaluated.
  • iteration (int) – Current iteration.
  • feval (function) – Custom evaluation function.
Returns:

result – Evaluation result string.

Return type:

str

predict(data, output_margin=False, ntree_limit=0, pred_leaf=False, pred_contribs=False, approx_contribs=False, pred_interactions=False, validate_features=True)

Predict with data.

Note

This function is not thread safe.

For each booster object, predict can only be called from one thread. If you want to run prediction using multiple thread, call bst.copy() to make copies of model object and then call predict().

Note

Using predict() with DART booster

If the booster object is DART type, predict() will perform dropouts, i.e. only some of the trees will be evaluated. This will produce incorrect results if data is not the training data. To obtain correct results on test sets, set ntree_limit to a nonzero value, e.g.

preds = bst.predict(dtest, ntree_limit=num_round)
Parameters:
  • data (DMatrix) – The dmatrix storing the input.
  • output_margin (bool) – Whether to output the raw untransformed margin value.
  • ntree_limit (int) – Limit number of trees in the prediction; defaults to 0 (use all trees).
  • pred_leaf (bool) – When this option is on, the output will be a matrix of (nsample, ntrees) with each record indicating the predicted leaf index of each sample in each tree. Note that the leaf index of a tree is unique per tree, so you may find leaf 1 in both tree 1 and tree 0.
  • pred_contribs (bool) – When this is True the output will be a matrix of size (nsample, nfeats + 1) with each record indicating the feature contributions (SHAP values) for that prediction. The sum of all feature contributions is equal to the raw untransformed margin value of the prediction. Note the final column is the bias term.
  • approx_contribs (bool) – Approximate the contributions of each feature
  • pred_interactions (bool) – When this is True the output will be a matrix of size (nsample, nfeats + 1, nfeats + 1) indicating the SHAP interaction values for each pair of features. The sum of each row (or column) of the interaction values equals the corresponding SHAP value (from pred_contribs), and the sum of the entire matrix equals the raw untransformed margin value of the prediction. Note the last row and column correspond to the bias term.
  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
Returns:

  • prediction (numpy array)
  • num_preds (number of predictions)

set_param(params, value=None)

Set parameters into the Booster.

Parameters:
  • params (dict/list/str) – list of key,value pairs, dict of key to value or simply str key
  • value (optional) – value of the specified parameter, when params is str key
update(dtrain, iteration, fobj=None)

Update for one iteration, with objective function calculated internally. This function should not be called directly by users.

Parameters:
  • dtrain (DMatrix) – Training data.
  • iteration (int) – Current iteration number.
  • fobj (function) – Customized objective function.
class securexgboost.Enclave(enclave_image=None, flags=3, create_enclave=True, log_verbosity=0)

Bases: object

An enclave.

A trusted execution environment used for secure XGBoost.

get_remote_report_with_pubkey()

Get remote attestation report and public key of enclave

get_report_attrs()

Get the enclave public key and remote report

To be called by the RPC service

Must be called after get_remote_report_with_pubkey() is called

Returns:
  • pem_key (proto.NDArray)
  • key_size (int)
  • remote_report (proto.NDArray)
  • remote_report_size (int)
set_report_attrs(pem_key, key_size, remote_report, remote_report_size)

Set the enclave public key and remote report

To be used by RPC client during verification

verify_remote_report_and_set_pubkey()

Verify the received attestation report and set the public key

class securexgboost.CryptoUtils

Bases: object

Crypto utils class

add_client_key(data, data_len, signature, sig_len)

Add client symmetric key used to encrypt file fname

Parameters:
  • data (proto.NDArray) – key used to encrypt client files
  • data_len (int) – length of data
  • signature (proto.NDArray) – signature over data, signed with client private key
  • sig_len (int) – length of signature
decrypt_predictions(key, encrypted_preds, num_preds)

Decrypt encrypted predictions

Parameters:
  • key (byte array) – key used to encrypt client files
  • encrypted_preds (c_char_p) – encrypted predictions
  • num_preds (int) – number of predictions
Returns:

preds – plaintext predictions

Return type:

numpy array

encrypt_data_with_pk(data, data_len, pem_key, key_size)
Parameters:
  • data (byte array) –
  • data_len (int) –
  • pem_key (proto) –
  • key_size (int) –
Returns:

  • encrypted_data (proto.NDArray)
  • encrypted_data_size_as_int (int)

encrypt_file(input_file, output_file, key_file)

Encrypt a file

Parameters:
  • input_file (str) – path to file to be encrypted
  • output_file (str) – path to which encrypted file will be saved
  • key_file (str) – path to key used to encrypt file
generate_client_key(path_to_key)

Generate a new key and save it to path_to_key

Parameters:path_to_key (str) – path to which key will be saved
sign_data(keyfile, data, data_size)
Parameters:
  • keyfile (str) –
  • data (proto.NDArray) –
  • data_size (int) –
Returns:

  • signature (proto.NDArray)
  • sig_len_as_int (int)

Learning API

Training Library containing training routines.

securexgboost.train(params, dtrain, num_boost_round=10, evals=(), early_stopping_rounds=None, evals_result=None, verbose_eval=True, callbacks=None, learning_rates=None)

Train a booster with given parameters.

Parameters:
  • params (dict) – Booster params.
  • dtrain (DMatrix) – Data to be trained.
  • num_boost_round (int) – Number of boosting iterations.
  • evals (list of pairs (DMatrix, string)) – List of items to be evaluated during training, this allows user to watch performance on the validation set.
  • early_stopping_rounds (int) – Activates early stopping. Validation error needs to decrease at least every early_stopping_rounds round(s) to continue training. Requires at least one item in evals. If there’s more than one, will use the last. Returns the model from the last iteration (not the best one). If early stopping occurs, the model will have three additional fields: bst.best_score, bst.best_iteration and bst.best_ntree_limit. (Use bst.best_ntree_limit to get the correct value if num_parallel_tree and/or num_class appears in the parameters)
  • evals_result (dict) –

    This dictionary stores the evaluation results of all the items in watchlist.

    Example: with a watchlist containing [(dtest,'eval'), (dtrain,'train')] and a parameter containing ('eval_metric': 'logloss'), the evals_result returns

    {'train': {'logloss': ['0.48253', '0.35953']},
     'eval': {'logloss': ['0.480385', '0.357756']}}
    
  • verbose_eval (bool or int) – Requires at least one item in evals. If verbose_eval is True then the evaluation metric on the validation set is printed at each boosting stage. If verbose_eval is an integer then the evaluation metric on the validation set is printed at every given verbose_eval boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed. Example: with verbose_eval=4 and at least one item in evals, an evaluation metric is printed every 4 boosting stages, instead of every boosting stage.
  • learning_rates (list or function (deprecated - use callback API instead)) – List of learning rate for each boosting round or a customized function that calculates eta in terms of current number of round and the total number of boosting round (e.g. yields learning rate decay)
Returns:

Booster

Return type:

a trained booster model