Akshit Arora
Data Scientist @ | @_AkshitArora
Speed, UX, and Iteration
The Way to Win at Data Science
Machine Learning Lifecycle
Also, Table of Contents
Random Forests
Intuition
Random Forests
Example Dataset
Random Forests
Sub-sampled dataset 1 / 3
Random Forests
Sub-sampled dataset 2 / 3
Random Forests
Sub-sampled dataset 3 / 3
Random Forests
Building a decision tree
Random Forests
Measuring "Highest Improvement" at a split
A potential split S, defined by (feature split_value), will split this node’s dataset of ‘N’ rows into left and right subsets with N_left and N_right rows respectively. The improvement of Split S is computed as:
improvement = Gini PARENT - impurity
where impurity is:
impurity = N_left / N * Gini Left + N_right / N * Gini Right
Random Forests
What is Gini Impurity?
Random Forests
Independent Decision Trees
Random Forests
sklearn implementation
from sklearn.ensemble import RandomForestClassifier as sklRF
from sklearn.metrics import accuracy_scores
import multiprocessing as mp
skl_rf_params = {
'n_estimators': 25,
'max_depth': 13,
'n_jobs': mp.cpu_count()
}
skl_rf = sklRF(**skl_rf_params)
skl_rf.fit(X_train, y_train)
y_pred = skl_rf.predict(X_test)
print("sklearn RF Accuracy Score: ", accuracy_score(y_pred, y_test))
Random Forests
Acceleration opportunities
Random Forests
Split Algorithm - Min/Max histograms
Random Forests
Split Algorithm - Quantiles
Random Forests
RAPIDS cuML implementation
from cuML import RandomForestClassifier as cuRF
cu_rf_params = {
'n_estimators': 25,
'max_depth': 13,
'max_features': 8,
'n_bins': 15
}
cu_rf = cuRF(**cu_rf_params)
cu_rf.fit(X_train, y_train)
y_pred = cu_rf.predict(X_test)
print("cuML RF Accuracy Score: ", accuracy_score(y_pred, y_test))
Random Forests
Results - Benchmark Details
Random Forests
Results - Higgs Dataset
Random Forests
Results - Synthetic Dataset
Data Processing Evolution
Faster Data Access, Less Data Movement
Data Movement and Transformation
The Bane of Productivity and Performance
Learning from Apache Arrow
Data Processing Evolution
Faster Data Access, Less Data Movement
Open Source Data Science Ecosystem
Familiar Python APIs
RAPIDS
End-to-End GPU Accelerated Data Science
Random Forests
Using cuDF for data loading
from cuML import RandomForestClassifier as cuRF
from cuml.preprocessing.model_selection import train_test_split
import cudf
data = cudf.read_csv('dataset.csv')
X_train, X_test, y_train, y_test = \
train_test_split(data, 'label', train_size=0.8)
cu_rf_params = {
'n_estimators': 25,
'max_depth': 13,
'max_features': 8,
'n_bins': 15
}
cu_rf = cuRF(**cu_rf_params)
cu_rf.fit(X_train, y_train)
y_pred = cu_rf.predict(X_test)
print("cuML RF Accuracy Score: ", accuracy_score(y_pred, y_test))
Why cuDF?
cuDF Benchmarks
Random Forests
Dask cuML implementation
from cuml.dask.common import utils as dask_utils
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import dask_cudf
from cuml.dask.ensemble import RandomForestClassifier as cumlDaskRF
# This will use all GPUs on the local host by default
cluster = LocalCUDACluster(threads_per_worker=1)
c = Client(cluster)
# Query the client for all connected workers
workers = c.has_what().keys()
n_workers = len(workers)
n_streams = 8 # Performance optimization
# Data parameters
train_size = 100000
test_size = 1000
n_samples = train_size + test_size
n_features = 20
# Random Forest building parameters
max_depth = 12
n_bins = 16
n_trees = 1000
# Generate Data on host
X, y = datasets.make_classification(n_samples=n_samples, n_features=n_features,
n_clusters_per_class=1, n_informative=int(n_features / 3),
random_state=123, n_classes=5)
y = y.astype(np.int32)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size)
n_partitions = n_workers
# First convert to cudf (with real data, you would likely load in cuDF format to start)
X_train_cudf = cudf.DataFrame.from_pandas(pd.DataFrame(X_train))
y_train_cudf = cudf.Series(y_train)
# Partition with Dask
# In this case, each worker will train on 1/n_partitions fraction of the data
X_train_dask = dask_cudf.from_cudf(X_train_cudf, npartitions=n_partitions)
y_train_dask = dask_cudf.from_cudf(y_train_cudf, npartitions=n_partitions)
# Persist to cache the data in active memory
X_train_dask, y_train_dask = \
dask_utils.persist_across_workers(c, [X_train_dask, y_train_dask], workers=workers)
cuml_model = cumlDaskRF(max_depth=max_depth, n_estimators=n_trees, n_bins=n_bins, n_streams=n_streams)
cuml_model.fit(X_train_dask, y_train_dask)
wait(cuml_model.rfs) # Allow asynchronous training tasks to finish
Random Forests
Dask cuML implementation
XGBoost models
Forest Inference Library implementation
from cuml import ForestInference
fm = ForestInference.load(filename=model_path,
algo='BATCH_TREE_REORG',
output_class=True,
threshold=0.50,
model_type='xgboost')
# perform prediction on the model loaded from path
fil_preds = fm.predict(X_validation)
XGBoost Models
Forest Inference Library
Hyper Parameter Optimization w/ RAPIDS
Hyper Parameter Optimization w/ RAPIDS
Huge speedups translate into >7x TCO reduction
Based on sample Random Forest training code from cloud-ml-examples repository, running on Azure ML. 10 concurrent workers with 100 total runs, 100M rows, 5-fold cross-validation per run.
Today we learned!