Previously, I used AWS SageMaker Studio for model training in my work. However, when I received a generous $10,000 credit from Google Cloud for Startups, I decided to transition our training environment to Vertex AI Workbench.
This article explores the usability differences between SageMaker and Vertex AI and documents our migration process.
Building the Model Training Environment
Creating the Dockerfile
In SageMaker, application code was not included in the container image. Instead, we used dependencies
to load external code and entry_point
to specify a shell script that switched Conda environments and executed the code.
SageMaker Training Script:
import sagemaker
from sagemaker.estimator import Estimator
session = sagemaker.Session()
role = sagemaker.get_execution_role()
estimator = Estimator(
image_uri="*****.dkr.ecr.ap-northeast-1.amazonaws.com/bert-training:latest",
role=role,
instance_type="ml.g4dn.2xlarge",
instance_count=1,
base_job_name="pre-training",
output_path="s3://sagemaker/output_data/pre_training",
sagemaker_session=session,
entry_point="pre-training.sh",
dependencies=["bert-training"],
checkpoint_s3_uri="s3://sagemaker/checkpoints/summary",
checkpoint_local_path="/opt/ml/checkpoints/",
use_spot_instances=True,
max_wait=120*60*60,
max_run=120*60*60,
hyperparameters={
"wandb_api_key": "*******",
"mlm": True,
"do_train": True,
"field_hs": 64,
"output_dir": "/opt/ml/checkpoints/",
"data_root": "/opt/ml/input/data/input_data/",
"data_fname": "pre_training_data",
"num_train_epochs": 3,
"save_steps": 100,
"per_device_train_batch_size": 8
},
tags=[{'Key': 'Project', 'Value': 'AIResearch'}]
)
estimator.fit({"input_data": "s3://sagemaker/input_data/pre_training_data.csv"})
Unlike SageMaker, Vertex AI does not offer an entry_point
for specifying commands, so we included the application code directly in the container image and installed the necessary packages without using a Conda environment.
Dockerfile for Vertex AI:
# Dockerfile for model training on Vertex AI
FROM gcr.io/deeplearning-platform-release/pytorch-gpu.1-12
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1
WORKDIR /app
RUN apt-get update && apt-get install -y \
build-essential \
&& rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir \
pandas==1.4.3 \
scikit-learn==1.1.1 \
transformers==4.26.0 \
numpy==1.23.1 \
imbalanced-learn==0.10.1 \
wandb \
python-dotenv \
google-cloud-storage
COPY . /app
ENTRYPOINT ["python", "main.py"]
Deploying the Container Image
We used the following commands to deploy the Docker image to Google Cloud's Artifact Registry:
docker buildx build --platform linux/amd64 -f Dockerfile.vertex -t asia-northeast1-docker.pkg.dev/ai/bert-training/pre-training:latest .
gcloud auth configure-docker asia-northeast1-docker.pkg.dev
docker push asia-northeast1-docker.pkg.dev/ai/bert-training/pre-training:latest
Migrating Input Data
We transferred input data from S3 to Cloud Storage using the following steps:
- Open 'Create a Transfer Job' in Cloud Storage.
- Select 'Amazon S3' as the source and 'Google Cloud Storage' as the destination.
- Create an IAM user with
AmazonS3ReadOnlyAccess
and enter the provided credentials. - Specify the destination bucket and start the transfer job.
Writing the Training Script
In Vertex AI, the training script can be written as follows:
from google.cloud import aiplatform
def create_custom_job(
project: str,
display_name: str,
container_image_uri: str,
location: str = 'us-central1',
args: list = None,
bucket_name: str = None,
):
aiplatform.init(project=project, location=location, staging_bucket=bucket_name)
custom_job = {
"display_name": display_name,
"worker_pool_specs": [{
"machine_spec": {
"
machine_type": "n1-highmem-32",
"accelerator_type": "NVIDIA_TESLA_V100",
"accelerator_count": 4,
},
"replica_count": 1,
"container_spec": {
"image_uri": container_image_uri,
"args": args,
},
}]
}
job = aiplatform.CustomJob(**custom_job)
job.run(sync=True)
project_id = 'ai'
display_name = 'pre-training'
container_image_uri = 'asia-northeast1-docker.pkg.dev/ai/bert-training/pre-training:latest'
bucket_name = 'gs://bert-training'
args = [
"--mlm",
"--do_train",
"--field_hs", "64",
"--data_fname", "pre_training_data",
"--num_train_epochs", "1",
"--save_steps", "100",
"--per_device_train_batch_size", "8",
"--gcs_bucket_name", "bert-training",
"--gcs_blob_name", "vertex/input_data/pre_training_data.csv",
"--local_data_path", "./data/action_history/pre_training_data.csv"
]
create_custom_job(
project=project_id,
display_name=display_name,
container_image_uri=container_image_uri,
bucket_name=bucket_name,
location=location,
args=args,
)
Vertex AI does not automatically place input data into a container path as SageMaker does with S3 paths. Therefore, the application must explicitly handle the download and upload of training artifacts.
from google.cloud import storage
import os
def download_csv_from_gcs(bucket_name, source_blob_name, destination_file_path):
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_path)
print(f"CSV file {source_blob_name} downloaded to {destination_file_path}.")
def upload_directory_to_gcs(bucket_name, source_directory, destination_blob_prefix):
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
for root, _, files in os.walk(source_directory):
for file in files:
local_path = os.path.join(root, file)
relative_path = os.path.relpath(local_path, source_directory)
blob_path = os.path.join(destination_blob_prefix, relative_path)
blob = bucket.blob(blob_path)
blob.upload_from_filename(local_path)
print(f"Uploaded {local_path} to {blob_path}.")
These functions are integrated into the application as follows:
def main(args):
output_dir = args.output_dir # Directory where training outputs are saved
bucket_name = args.gcs_bucket_name # Cloud Storage bucket name
destination_blob_prefix = 'vertex/output_data/pre_training'
upload_directory_to_gcs(bucket_name, output_dir, destination_blob_prefix)
if __name__ == "__main__":
parser = define_main_parser()
opts = parser.parse_args()
download_csv_from_gcs(opts.gcs_bucket_name, opts.gcs_blob_name, opts.local_data_path)
main(opts)
Additional Notes
Regional Constraints
When we attempted to use the NVIDIA Tesla V100 GPU in the asia-northeast1
(Tokyo) region, we encountered errors. Further investigation revealed significant restrictions on the GPUs available in this region, prompting us to switch to the us-central1
(Iowa) region. It was a lesson in the importance of considering regional resource differences before migration.
https://cloud.google.com/vertex-ai/docs/quotas
Spot Instances Unavailable
Unlike SageMaker, where we frequently utilized spot instances to save costs, Vertex AI does not support them. This was not a major issue due to the ample GCP credits we had, but it's worth noting for those planning budgets. In fact, the guaranteed resource allocation in Vertex AI provided unexpected benefits over the potential for training interruptions with SageMaker's spot instances.
Top comments (0)