Skip to main content
This walkthrough demonstrates how to create, track, and use a dataset artifact with W&B. By the end, you’ve logged a dataset as a versioned artifact to W&B and downloaded it in a subsequent run. This lets you reproducibly share datasets across experiments and track them as inputs and outputs of your runs.

Log in to W&B

Import the W&B library and log in to W&B. If you haven’t done so already, sign up for a free W&B account.
import wandb

wandb.login()

Initialize a run

Use wandb.init() to initialize a run. This generates a background process to sync and log data. Provide a project name and a job type:
# Create a W&B Run. Here you specify 'dataset' as the job type since this example
# shows how to create a dataset artifact.
with wandb.init(project="artifacts-example", job_type="upload-dataset") as run:
    # Your code here

Create an artifact object

Create an artifact object with wandb.Artifact(). Provide a name for the artifact and a description of the file type for the name and type parameters, respectively. For example, the following code snippet demonstrates how to create an artifact called 'bicycle-dataset' with a 'dataset' label:
artifact = wandb.Artifact(name="bicycle-dataset", type="dataset")
For more information about how to construct an artifact, see Construct artifacts.

Add the dataset to the artifact

Add a file to the artifact. Common file types include models and datasets. The following example adds a dataset named dataset.h5 that is saved locally on your machine to the artifact:
# Add a file to the artifact's contents
artifact.add_file(local_path="dataset.h5")
Replace the filename dataset.h5 in the previous code snippet with the path to the file you want to add to the artifact.

Log the dataset

Use the W&B run object’s wandb.Run.log_artifact() method to both save your artifact version and declare the artifact as an output of the run.
# Save the artifact version to W&B and mark it
# as the output of this run
run.log_artifact(artifact)
When you log an artifact, W&B creates a 'latest' alias by default. For more information about artifact aliases and versions, see Create a custom alias and Create new artifact versions, respectively. Putting this together, your script so far should look like this:
import wandb

wandb.login()

with wandb.init(project="artifacts-example", job_type="upload-dataset") as run:
    artifact = wandb.Artifact(name="bicycle-dataset", type="dataset")
    artifact.add_file(local_path="dataset.h5")
    run.log_artifact(artifact)

Download and use the artifact

Now that the dataset is logged as an artifact, you can pull it into other runs as a tracked input. The following code example demonstrates the steps you can take to use an artifact you’ve logged and saved to the W&B servers:
  1. Initialize a new run object with wandb.init().
  2. Use the run object’s wandb.Run.use_artifact() method to specify which artifact to use. This returns an artifact object.
  3. Use the artifact’s wandb.Artifact.download() method to download the contents of the artifact.
# Create a W&B Run. Here you specify 'training' for 'type'
# because you use this run to track training.
with wandb.init(project="artifacts-example", job_type="training") as run:

  # Query W&B for an artifact and mark it as input to this run
  artifact = run.use_artifact("bicycle-dataset:latest")

  # Download the artifact's contents
  artifact_dir = artifact.download()
Alternatively, you can use the Public API (wandb.Api) to export or update data already saved in W&B outside of a run. For more information, see Track external files. You now have a versioned dataset artifact logged to W&B and consumed by a downstream run. The artifact graph tracks both the upload and the download.