Data Scientist Workflow
This guide covers the Data Scientist's responsibilities: exploring datasets, preparing and submitting FL projects, and running the aggregation server.
Step 1: Explore Data Owner Datasets
After the Data Owners have created their datasets, explore what's available:
# Check DO1's datasets
do1_datasets = ds_client.datasets.get_all(datasite=do1_email)
try:
do1_datasets[0].describe()
except IndexError:
print(
f"No datasets found for DO1 (do1_datasets is empty). "
f"Verify DO1 published/shared the dataset and that {do1_email} is correct."
)
# Check DO2's datasets
do2_datasets = ds_client.datasets.get_all(datasite=do2_email)
try:
do2_datasets[0].describe()
except IndexError:
print(
f"No datasets found for DO2 (do2_datasets is empty). "
f"Verify DO2 published/shared the dataset and that {do2_email} is correct."
)
# Get mock dataset URLs for development
mock_dataset_urls = [do1_datasets[0].mock_url, do2_datasets[0].mock_url]
mock_dataset_urls
You'll see metadata about each dataset, including:
- Dataset name and summary
- Available features (columns)
- Tags and descriptions
As a data scientist, you can only see the mock data—synthetic data with the same schema as the private data. This lets you develop and test your code without accessing sensitive information.
Step 2: Clone the FL Project
The FL project uses Flower, a popular federated learning framework. Clone the pre-built project:
from pathlib import Path
!mkdir -p /content/fl-diabetes-prediction
!curl -sL https://github.com/khoaguin/fl-diabetes-prediction/archive/refs/heads/main.tar.gz | tar -xz --strip-components=1 -C /content/fl-diabetes-prediction
SYFT_FLWR_PROJECT_PATH = Path("/content/fl-diabetes-prediction")
print(f"syft-flwr project at: {SYFT_FLWR_PROJECT_PATH}")
Step 3: Bootstrap the Project
Configure the project with the aggregator (DS) and participating datasites (DOs):
import syft_flwr
try:
!rm -rf {SYFT_FLWR_PROJECT_PATH / "main.py"}
print(f"syft_flwr version = {syft_flwr.__version__}")
do_emails = [peer.email for peer in ds_client.peers]
syft_flwr.bootstrap(
SYFT_FLWR_PROJECT_PATH, aggregator=ds_email, datasites=do_emails
)
print("Bootstrapped project successfully ✅")
except Exception as e:
print(e)
This generates the main.py entry point and configures the project for your specific data owners.
Step 4: Submit Jobs to Data Owners
Send the FL project to each data owner for review:
# Clean up any cached files
!rm -rf {SYFT_FLWR_PROJECT_PATH / "fl_diabetes_prediction" / "__pycache__"}
job_name = "fl-diabetes-training"
# Submit to DO1
ds_client.submit_python_job(
user=do1_email,
code_path=str(SYFT_FLWR_PROJECT_PATH),
job_name=job_name,
)
# Submit to DO2
ds_client.submit_python_job(
user=do2_email,
code_path=str(SYFT_FLWR_PROJECT_PATH),
job_name=job_name,
)
Check Submitted Jobs
ds_client.jobs
The job contains the training code—data owners can inspect it before approving execution on their private data. The code defines the model architecture, training loop, and aggregation strategy.
Step 5: Wait for Approval
At this point, wait for both Data Owners to:
- Review the submitted job code
- Approve the job
- Run
do_client.process_approved_jobs()
You can monitor the job status:
ds_client.jobs
Step 6: Run the Aggregator
Once Data Owners have approved and started processing their jobs, install the required packages and run the aggregator:
Install Dependencies
!uv pip install \
"flwr-datasets>=0.5.0" \
"imblearn>=0.0" \
"loguru>=0.7.3" \
"pandas>=2.3.0" \
"ipywidgets>=8.1.7" \
"scikit-learn==1.7.1" \
"torch>=2.8.0" \
"ray==2.31.0"
Start the Aggregation Server
ds_email = ds_client.email
syftbox_folder = f"/content/SyftBox_{ds_email}"
!SYFTBOX_EMAIL="{ds_email}" SYFTBOX_FOLDER="{syftbox_folder}" \
uv run {str(SYFT_FLWR_PROJECT_PATH / "main.py")}
Watch the training logs as:
- Data Owner clients connect
- Model updates are received
- Aggregation happens each round
- Global model accuracy improves
Step 7: Check Results
After training completes, check the final job status:
ds_client.jobs
The aggregated model weights are saved in the SyftBox folder. Check the FL training logs to find the exact path where the final model is saved.
Cleanup
When finished, clean up SyftBox resources:
ds_client.delete_syftbox()
What Just Happened?
Congratulations! You successfully trained a diabetes prediction model using federated learning:
- Two data owners each held a private partition of the PIMA Indians Diabetes dataset
- A data scientist coordinated the training without ever seeing the raw data
- Model updates were aggregated using the Flower framework (FedAvg strategy)
- Privacy was preserved—raw data never left the data owner's Colab environment
This is the core promise of federated learning: collaborative machine learning without sharing sensitive data.
Privacy Guarantees
| Guarantee | Description |
|---|---|
| No raw data access | DS only sees mock data and model parameters |
| Compute-to-data | Training happens where the data lives |
| Parameter exchange only | Only mathematical weights (W, b) move over the network |
| Consent-based | DOs must explicitly approve each job |
Next Steps
- Explore the
FL project code to understand the model architecture
- Modify the aggregation strategy in
server_app.py - Try with your own datasets and models