Skip to main content

Data Owner Workflow

This guide covers the Data Owner's responsibilities: downloading data, creating Syft datasets, and approving federated learning jobs.

Run in Both DO Notebooks

These steps should be run in both DO1 and DO2 notebooks, with different partition numbers.

Step 1: Download the Dataset

Download the PIMA Indians Diabetes dataset (pre-partitioned) from Hugging Face:

from pathlib import Path
from huggingface_hub import snapshot_download

DATASET_DIR = Path("./dataset/").expanduser().absolute()

if not DATASET_DIR.exists():
snapshot_download(
repo_id="khoaguin/pima-indians-diabetes-database-partitions",
repo_type="dataset",
local_dir=DATASET_DIR,
)

Step 2: Create a Syft Dataset

Create a Syft dataset from your partition. Each DO uses a different partition number:

Data OwnerPartition Number
DO10
DO21

For DO1:

partition_number = 0  # DO1 uses partition 0
DATASET_PATH = DATASET_DIR / f"pima-indians-diabetes-database-{partition_number}"

do_client.create_dataset(
name="pima-indians-diabetes-database",
mock_path=DATASET_PATH / "mock",
private_path=DATASET_PATH / "private",
summary="PIMA Indians Diabetes dataset - Partition 0",
readme_path=DATASET_PATH / "README.md",
tags=["healthcare", "diabetes"],
sync=True,
)

For DO2:

partition_number = 1  # DO2 uses partition 1
DATASET_PATH = DATASET_DIR / f"pima-indians-diabetes-database-{partition_number}"

do_client.create_dataset(
name="pima-indians-diabetes-database",
mock_path=DATASET_PATH / "mock",
private_path=DATASET_PATH / "private",
summary="PIMA Indians Diabetes dataset - Partition 1",
readme_path=DATASET_PATH / "README.md",
tags=["healthcare", "diabetes"],
sync=True,
)

Verify Dataset Creation

do_client.datasets.get_all()
Understanding Mock vs Private Data
  • mock_path: Contains synthetic/sample data that data scientists can explore and write code against
  • private_path: Contains the real data that never leaves your environment

Step 3: Review and Approve Jobs

Once the Data Scientist submits a job, you'll see it in your jobs list.

Check for Incoming Jobs

do_client.jobs

Review a Job

Inspect the job details before approving:

try:
do_client.jobs[0]
except IndexError:
print("No jobs to approve.")

What Happens During Job Execution

This job will run the federated learning client code on your private data. Note that this can be improved to filter jobs by datasite or other criteria in the future.

Security Best Practice

Always review the job code before approving. Use the job inspection tools to understand what the submitted code will do with your data.

Approve the Job

do_client.jobs[0].approve()

# Verify approval status
do_client.jobs

Step 4: Run Approved Jobs

Process all approved jobs. This executes the federated learning client code on your private data:

do_client.process_approved_jobs()

This will:

  1. Install required packages
  2. Run the client-side FL training on your private data
  3. Send only model updates (not raw data) back to the aggregator
What Happens During Training

The FL client code runs locally on your machine. Your raw data is used for training, but only the computed model parameters (weights and biases) are shared with the data scientist's aggregator.

Cleanup

When you're finished with the tutorial, clean up SyftBox resources:

do_client.delete_syftbox()

Next Steps

While waiting for jobs from the Data Scientist, you can continue to the Data Scientist Workflow to see the other side of the process.