Download data from Kaggle competition and upload in Azure ML

In some Kaggle competitions the provided machines can not handle the volume of data available. In this cases, I think that could be beneficial to train the model in another place.

I made a simple script to download the data from Kaggle and upload it in Azure ML. To make things configurable I created a .env file to store some important values like API tokens.

KAGGLE_USERNAME=""
KAGGLE_KEY=""
AZURE_CLIENT_ID=""
AZURE_TENANT_ID=""
AZURE_CLIENT_SECRET=""
AZURE_SUBSCRIPTION_ID=""
AZURE_RESOURCE_GROUP=""
AZURE_WORKSPACE=""

In Kaggle, we must access Profile and then Settings to generate a API token. At the end of this step we should have the information to fill KAGGLE_USERNAME and KAGGLE_KEY environment variables.

Kaggle token creation

In Azure ML we have to create a new workspace. Right now, I think that we can create one workspace per competition. It allow us to organize data and code efficiently. To create the workspace, access the Azure ML Portal, click in Workspaces and then New.

Create Azure Machine Learning Workspace

After workspace creation, access it to get workspace related information. At the end of this step we should have the AZURE_SUBSCRIPTION_ID, AZURE_RESOURCE_GROUP and AZURE_WORKSPACE environment variables.

Azure Machine Learning Workspace Information

In Microsoft Entra ID we have to create a App Registration. Give a meaningful name and keep the default configuration.

App registration

Access your recently created App registration and go to Overview to get AZURE_CLIENT_ID, AZURE_TENAT_ID environment variables.

App registration information

Go to Certificates & Secrets to create a new secret. Get the secret value to fill the AZURE_CLIENT_SECRET environment variable.

App registration secret

Next we have to give permission to our app registration in our Azure Machine Learning Workspace. In Azure Portal, go to all resources and access your recently created Azure Machine Learning Workspace.

In Access Control (IAM) click in Add and then Add role assignment.

Azure Machine Learning Workspace access control

Select the role AzureML Data Scientist.

Role assignment

In the Members tab, click in Select Members, search for the App registration created, select it and then finalize the process for role assignment.

Role assignment – Members

The configuration steps is done. Now we can use the following script to download data from competition and save in Azure ML.

import os
import argparse
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.identity import DefaultAzureCredential
from dotenv import load_dotenv


def download_kaggle_competition_data(competition_name, download_path="./data"):
    """
    Downloads Kaggle competition data using the Kaggle API.
    """
    
    os.makedirs(download_path, exist_ok=True)
    os.system(f"kaggle competitions download -c {competition_name} -p {download_path}")
    os.system(f"unzip -o {download_path}/*.zip -d {download_path}")

if __name__ == "__main__":
    load_dotenv(override=True)

    parser = argparse.ArgumentParser(description="Download Kaggle competition data and upload it to Azure ML.")
    parser.add_argument("--competition_name", type=str, required=True, help="Name of the Kaggle competition.")
    parser.add_argument("--blob_name", type=str, required=True, help="Name of the blob in Azure ML datastore.")
    args = parser.parse_args()

    print(f"Downloading data for competition: {args.competition_name}")
    download_kaggle_competition_data(args.competition_name)

    print("Connecting to Azure ML Workspace...")
    print(os.environ["AZURE_SUBSCRIPTION_ID"])
    print(os.environ["AZURE_RESOURCE_GROUP"])
    print(os.environ["AZURE_WORKSPACE"])
    ml_client = MLClient(
        DefaultAzureCredential(),
        subscription_id=os.environ["AZURE_SUBSCRIPTION_ID"],
        resource_group_name=os.environ["AZURE_RESOURCE_GROUP"],
        workspace_name=os.environ["AZURE_WORKSPACE"]
    )

    print(f"Uploading dataset as an asset with blob name: {args.blob_name}")
    data_asset = Data(
        path="./data",  # Local path where Kaggle data was downloaded and unzipped
        type="uri_folder",  # Specify this is a folder-type asset
        name=args.blob_name,  # Use the blob name provided as argument
        version="1"  # Versioning for tracking changes
    )
    ml_client.data.create_or_update(data_asset)

    print("Process completed successfully!")

You should have a folder project structure like that.

Project folder structure

Here is a example of how to call the script.

python src/download_data.py --competition_name "jane-street-real-time-market-data-forecasting" --blob_name "jane-street-data"

At the end of the execution you should be able to see your uploaded data in your Azure Machine Learning Workspace under the menu Assets / Data.

Uploaded data

Leave a Reply

Your email address will not be published. Required fields are marked *