If you log an artifact that doesn’t track external files, W&B saves the artifact’s files to W&B servers. This is the default behavior when you log artifacts with the W&B Python SDK.If you log an artifact that tracks external files, W&B logs metadata about the object, such as the object’s ETag and size. If object versioning is enabled on the bucket, W&B also logs the version ID.
Track an artifact in an external bucket
Use the W&B Python SDK to track references to files stored outside W&B.- Initialize a run with
wandb.init(). - Create an artifact object with
wandb.Artifact(). - Specify the reference to the bucket path with the artifact object’s
wandb.Artifact.add_reference()method. - Log the artifact’s metadata with
run.log_artifact().
datasets/mnist/ directory contains a collection of images. To track the image datasets/mnist/ directory as a dataset artifact, specify:
- Provide a name for the artifact, such as
"mnist". - Set the
typeparameter to"dataset"when you construct the artifact object (wandb.Artifact(type="dataset")). - When you call
wandb.Artifact.add_reference(), provide the path to thedatasets/mnist/directory as an Amazon S3 URI (s3://my-bucket/datasets/mnist/). - Log the artifact with
run.log_artifact().
mnist:latest:
W&B Artifacts support any Amazon S3 compatible interface, including CoreWeave AI Object Storage and MinIO. The following scripts work without modification with both providers, when you set the
AWS_S3_ENDPOINT_URL environment variable to point at your CoreWeave AI Object Storage or MinIO server.Download an artifact from an external bucket
After you log a reference artifact, you can download it later to retrieve the original files from the bucket. When W&B downloads a reference artifact, it retrieves the files from the underlying bucket using the metadata recorded when you logged the artifact. If your bucket has object versioning enabled, W&B retrieves the object version that corresponds to the state of the file at the time the artifact was logged. As you evolve the contents of your bucket, you can point to the exact version of your data a given model was trained on, because the artifact serves as a snapshot of your bucket during the training run. The following code sample shows how to download a reference artifact. The APIs for downloading artifacts are the same for both reference and non-reference artifacts:If you overwrite files as part of your workflow, W&B recommends that you enable ‘Object Versioning’ on your storage buckets.If versioning is enabled, W&B can retrieve the correct version of the file when you download an artifact, even if the file has been overwritten since you logged the artifact.Based on your use case, read the instructions to enable object versioning: AWS, Google Cloud, Azure.
Add and download an external file from a bucket
The following code sample uploads a dataset to an Amazon S3 bucket, tracks it with a reference artifact, then downloads it:See the following reports for an end-to-end walkthrough on how to track artifacts by reference for Google Cloud or Azure:
Cloud storage credentials
To read from and write to your external bucket, W&B needs credentials configured in your environment. W&B uses the default mechanism to look for credentials based on the cloud provider you use. To learn more about the credentials used, read the documentation from your cloud provider:| Cloud provider | Credentials documentation |
|---|---|
| CoreWeave AI Object Storage | CoreWeave AI Object Storage documentation |
| AWS | Boto3 documentation |
| Google Cloud | Google Cloud documentation |
| Azure | Azure documentation |
AWS_REGION environment variable to match the bucket region.
Track an artifact in a filesystem
A common pattern for accessing datasets is to expose an NFS mount point to a remote filesystem on all machines running training jobs. This can be an alternative to a cloud storage bucket because, from the perspective of the training script, the files appear local to your filesystem. To track an artifact in a filesystem:- Initialize a run with
wandb.init(). - Create an artifact object with
wandb.Artifact(). - Specify the reference to the filesystem path with the artifact object’s
wandb.Artifact.add_reference()method. - Log the artifact’s metadata with
run.log_artifact().
< >) with your own values.
file:// prefix that denotes the use of filesystem references. The second component is the root / of the filesystem. The remaining components are the path to the directory or file you want to track.
As an example, suppose you have a filesystem mounted at /mount with the following structure:
datasets/mnist/ directory as a dataset artifact, use the following code snippet:
mnist:latest that points to the files stored under /mount/datasets/mnist/.
Similarly, to track a model stored at models/cnn/my_model.h5, use the following code snippet:
Download an artifact from an external filesystem
After you log a filesystem reference artifact, you can download it later to retrieve the original files from the mounted filesystem. Download files from a referenced filesystem using the same APIs as non-reference artifacts:- Initialize a run with
wandb.init(). - Use the
wandb.Run.use_artifact()method to indicate the artifact you want to download. - Call the artifact’s
wandb.Artifact.download()method to download the files from the referenced filesystem.
/mount/datasets/mnist to the artifacts/mnist:v0/ directory.
Artifact.download() throws an error if it can’t reconstruct the artifact. For example, if an artifact contains a reference to a file that was overwritten, Artifact.download() throws an error because the artifact can no longer be reconstructed.