Data Owner Workflow
This guide covers the Data Owner's responsibilities: downloading data, creating Syft datasets, and approving federated learning jobs.
These steps should be run in both DO1 and DO2 notebooks, with different partition numbers.
Step 1: Download the Dataset
Download the PIMA Indians Diabetes dataset (pre-partitioned) from Hugging Face:
from pathlib import Path
from huggingface_hub import snapshot_download
DATASET_DIR = Path("./dataset/").expanduser().absolute()
if not DATASET_DIR.exists():
snapshot_download(
repo_id="khoaguin/pima-indians-diabetes-database-partitions",
repo_type="dataset",
local_dir=DATASET_DIR,
)
Step 2: Create a Syft Dataset
Create a Syft dataset from your partition. Each DO uses a different partition number:
| Data Owner | Partition Number |
|---|---|
| DO1 | 0 |
| DO2 | 1 |
For DO1:
partition_number = 0 # DO1 uses partition 0
DATASET_PATH = DATASET_DIR / f"pima-indians-diabetes-database-{partition_number}"
do_client.create_dataset(
name="pima-indians-diabetes-database",
mock_path=DATASET_PATH / "mock",
private_path=DATASET_PATH / "private",
summary="PIMA Indians Diabetes dataset - Partition 0",
readme_path=DATASET_PATH / "README.md",
tags=["healthcare", "diabetes"],
sync=True,
)
For DO2:
partition_number = 1 # DO2 uses partition 1
DATASET_PATH = DATASET_DIR / f"pima-indians-diabetes-database-{partition_number}"
do_client.create_dataset(
name="pima-indians-diabetes-database",
mock_path=DATASET_PATH / "mock",
private_path=DATASET_PATH / "private",
summary="PIMA Indians Diabetes dataset - Partition 1",
readme_path=DATASET_PATH / "README.md",
tags=["healthcare", "diabetes"],
sync=True,
)
Verify Dataset Creation
do_client.datasets.get_all()
mock_path: Contains synthetic/sample data that data scientists can explore and write code againstprivate_path: Contains the real data that never leaves your environment
Step 3: Review and Approve Jobs
Once the Data Scientist submits a job, you'll see it in your jobs list.
Check for Incoming Jobs
do_client.jobs
Review a Job
Inspect the job details before approving:
try:
do_client.jobs[0]
except IndexError:
print("No jobs to approve.")
This job will run the federated learning client code on your private data. Note that this can be improved to filter jobs by datasite or other criteria in the future.
Always review the job code before approving. Use the job inspection tools to understand what the submitted code will do with your data.
Approve the Job
do_client.jobs[0].approve()
# Verify approval status
do_client.jobs
Step 4: Run Approved Jobs
Process all approved jobs. This executes the federated learning client code on your private data:
do_client.process_approved_jobs()
This will:
- Install required packages
- Run the client-side FL training on your private data
- Send only model updates (not raw data) back to the aggregator
The FL client code runs locally on your machine. Your raw data is used for training, but only the computed model parameters (weights and biases) are shared with the data scientist's aggregator.
Cleanup
When you're finished with the tutorial, clean up SyftBox resources:
do_client.delete_syftbox()
Next Steps
While waiting for jobs from the Data Scientist, you can continue to the Data Scientist Workflow to see the other side of the process.