Text classification to detect toxic comments using Azure ML Services for Kaggle Jigsaw competition

Attribution

My work here is focused on tooling, monitoring, execution, clean pipeline and integration with Azure services. Other code was adopted from many public kernels and github repositories that can be found here and here. I tried to keep the references to original kernels or githubs in the code comments, when appropritate. I apologize if I missed some references or quotes, and if so, please let me know.

About this solution

This is the pipeline I developed during the Kaggle Jigsaw Toxic comment classification challenge. In this competition, I focused more on tools rather than on winning or developing novel models.

The goal of developing and publishing this is to share reusable code that will help other people run Kaggle competitions on Azure ML Services.

This is the list of features:

Everything is a hyperparameter approach. Config files control everything from model parameters to feature generation and cv.
Run experiments locally or remotely using Azure ML Workbench
Run experiments on GPU-based Azure DSVM machines
Easily start and stop VMs
Add and remove VMs to and from the VM fleet
Zero manual data download for new VMs - files are downloaded from Azure Storage on demand by code.
Caching feature transformations in Azure Storage
Shutting down the VMs at the end of experiment
Time and memory usage logging
Keras/Tensorboard integration
Usage of Sacred library and logging experiment reporducibility data to CosmosDB
Integration with Telegram Bot and Telegram notifications
Examples of running LightGBM, sklearn models, Keras/Tensorflow

Presentations and blog posts

This solution is accompanying a few presentations that I gave on PyData meetup in Ljubljana and AI meetup in Zagreb. Links to the slides:

I am also writing a blog post which is going to published here.

Technical overview of the solution

This sample solution allows you to run text classification experiments via command line or via Azure ML Workbench. To see, how to run, configure and monitor experiments, refer to the Standard Operating Procedures section. If you wish to extend or modify the code, please refer to [Modifying the solution]

Pipeline

Pipeline intends to be extendable and configurable to support “everything is a hyperparameter” approach.

Pipeline: code structure

Architecture

Below is the diagram of compute architecture. I used cheap Windows DSVM (burst instances) to run series of experiments and stop the machines after all experiments are finished. I used 2-3 CPU VMs and 2 GPU vms to run experiments. I used minimial sized collections for CosmosDB to store “python/sacred” experiment information. Information from Azure ML runs is stored in a separate storage account, which has 1 folder per experiment, automatically populated by AML.

Training pipeline architecture on azure

Experiment flow

Run az ml command or use AML WB to start an experiment
Azure ML Experimentation services send your code to VSTS/git special branch
Azure ML Experimentation spins off a docker container on the appropriate compute and monitors the progress
The pipeline code takes care of downloading all necessary files, so it can be ran on completely new VM
Python Sacred library saves detailed telemetry to CosmosDB
Python Sacred library notifies you via Telegram when experiment starts and ends, together with experiment results.
When experiment ends, Azure ML saves all the resulting files to your Azure Storage account

TODO: draw better diagram

Experiment flow doodle drawing

Deploying solution

This is not a turnkey solution. You will probably have to fix something or customize the code to your needs. Here is the approximate instruction how to set up the new deployment on your local machine from scratch.

Run git clone https://github.com/vykhand/kaggle-jigsaw to clone the git repository
Install AML Workbench (AMLWB)
Add github folder to AMLWB ()
Log in to Azure using CLI
Use the Example script to set up new VM
Create Azure Storage and load data to it
Create Cosmos DB for Sacred experiment monitoring.
(Semi-optional) Create and configure Telegram bot.
To be able to auto-shutdown VMs, get Azure subscription keys to be able to use Python management SDKs.
Configure your runconfig environment variables, copying them from Example VM configuration to your amlconfig\<vm>.runconfig file and populate them with your own values.
If you are setting up a GPU VM, be sure to follow the instruction or copy the the Example GPU config
You are now good to go. Refer to Standard Operating Procedures to understand how to run and configure experiments

Below you will find the details on (almost) each step

Setting up Azure ML Services

In this competition, I used experiment monitoring facility and notebook facility. Azure ML has many other capabilities as described at http://aka.ms/AzureMLGettingStarted

Getting the data and uploading it to storage account

Data for this competition is not distributed with this repository. You need to download it from the competition page located here. You can use Kaggle API to download data from command line.

There are also multiple other external files that I used for this competition:

To be able to run this solution, you need to set up your storage account, upload the required files there and configure the following variables in your .runconfig file:

  "AZURE_STORAGE_ACCOUNT": ""
  "AZURE_STORAGE_KEY": ""
  "AZURE_CONTAINER_NAME": "kaggle-jigsaw"

Setting up Azure CLI and logging in

AZCLI should be set up together with Azure ML Workbench.

To make sure your PATH variable is set up to use the AML WB environment, you should either start cmd.exe or powershell from AMLWB, or open terminal with VS code in your project (if VS Code was set up by AMLWB)

Verify az verison:

az --version

You need to login with az by typing the command and following the instructions

az login

then you need to set your subscriptipon using the commands

# show subscriptions
az account list
# set subscription
az account set -s <subscription_name>

Setting up Azure Service principal

This is only needed to shutdown the VMs automatically. In order to set it up, you need to configure the following environment variables in your runconfig file:

  "AZURE_CLIENT_ID":  ""
  "AZURE_CLIENT_SECRET": ""
  "AZURE_TENANT_ID": ""
  "AZURE_SUBSCRIPTION_ID":

This document describes how to find this information.

Setting up Cosmos DB

Create new cosmosDB instance with MongoDB API as default API. You don’t need anything else, as Sacred library will take care of creating your collections.

You need to specify your mongodb connection string using this env variable in your .runconfig file.

"COSMOS_URL": "mongodb://<>:<>"

this string can be found in the “Connection String” section of your CosmosDB in the Azure Portal.

Create and configure Telegram bot

This is non-essential but I don’t have a config option to switch it off. So if you don’t want to use it, you’ll have to comment all the lines like this:

**_experiment.observers.append(u.get_telegram_observer())

To configure telegram bot, please follow this instruction to set it up, and then update telegram.json file.

{
    "token": "your telegram token",
    "chat_id": "chat id"
}

Standard operating procedures

Running experiments from command line

Running GRU model with default config:

az ml experiment submit -c dsvm001 run.py GRU

Running with configuration file (modified file copied from conf_template)

az ml experiment submit -c dsvm001 run.py --conf experiment_conf/GRU.yaml

Running experiments with AMLWB

you can run run.py script with AML WB on your favourite compute, specifying configuration file or modeltype as an argument

Accessing experiment files

After you run the experiment, all the files are available in the output Azure Storage. You can easily access these files using Azure ML Workbench:

Downloading experiment outputs

You can also locate your Azure Storage that is created together with your Azure ML Experimentation account and download files from there:

![Accessing your experiment files using Portal Storage Explorer(img/storage_outputs.png)

Adding new compute

Copy and modify example_dsvm.json or example_nc6_dsvm.json
Copy and modify example_dsvm.azcli, replacing the example_dsvm with your VM name
Run the commands in your azcli file

az group create -n example_dsvm -l northeurope

# now let's create the DSVM based on the JSON configuration file you created earlier.
# note we assume the mydsvm.json config file is placed in the "docs" sub-folder.
az group deployment create -g example_dsvm --template-uri https://raw.githubusercontent.com/Azure/DataScienceVM/master/Scripts/CreateDSVM/Ubuntu/azuredeploy.json --parameters @scripts/example_dsvm.json

# find the FQDN (fully qualified domain name) of the VM just created.

az vm show -g example_dsvm -n example_dsvm -d

# attach the DSVM compute target
# it is a good idea to use FQDN in case the IP address changes after you deallocate the VM and restart it

az ml computetarget attach remotedocker --name example_dsvm --address example_dsvm.northeurope.cloudapp.azure.com --username <username> --password "<yourpass>"

# prepare the Docker image on the DSVM
az ml experiment prepare -c example_dsvm

Deallocating machines

run

az vm deallocate -g <vm_group> -n <vm_name>

Running series of experiments

specify folder name and compute target. the script will run all the experiments and will shut down the VM at the end

python run_folder.py experiments_conf\folder_name dsvm001

Monitoring GPU usage

watch nvidia-smi

Monitoring Tensorflow using Tensorboard

Note the tensorflow logdir link in your experiment log:

Tensorflow link from Keras

You just need the logdir name. Then log into your GPU VM and run:

tensorboard --logdir ~/.azureml/share/<your projectpath>/<logdir name>

Retrieving experiment results from CosmosDB

You can use the portal or VS Code plugin to access the data in CosmosDB:

CosmosDB data

You can also set up the PowerBI dashboards that connect to your CosmosDB.

Modifying solution

Programmers love developing their own frameworks. There is a chance that you will hate mine. But in the case you’d like it, here are some pointers on modifying and extending it.

Project structure

Folder name	Description
`aml_config`	Configuration files for compute
`conf_templates`	Experiment configuration templates for different models
`experiment_conf`	configurations of the individual experiments. No need to save every experiment here
`kaggle-jigsaw`	Main python package
`notebooks`	Jupyter notebooks with some musings
`scripts`	scripts to create compute
`root folder`	Run scripts and readme

Python package structure

Folder name	Description
`experiments`	Sacred experiments for a particular model type
`modeling`	Cross-validation logic, base classes for classifiers, and individual model classes
`preprocessing`	Base class for all preprocessors and data transofrmation classes
`constants.py`	This is where you can redefine default file names etc.
`util.py`	Auxilary routines for logging, dealing with Azure and other miscellaneous things

Utilites util.py

responsible for

Logging: time, memory
Communicating with Azure Storage
- Get file from Azure storage “cache”
- Put file to cache
- check if file is available
Shutting down the VMs
Handling the cache for experiment preprocessed data
- calculates feature param md5 hash and downloads from Azure if file is available with this hash

Experiments

Code here simply chooses appropriate classes for model and preporcessors, and sets up Sacred.

Modeling

Implements base classes for models and specific classes. Also implements cross-validation logic.

Preprocessing

Implements base preprocessor class and specific classes for specific needs.

Run scripts

Generic runner for all models run.py
Run a folder of experiments on a VM and then shut it down run_folder.py
Train embeddings on a vm train_w2v.py