Skip to main content

Data Scientist Workflow

This guide covers the Data Scientist's responsibilities: exploring datasets, preparing and submitting FL projects, and running the aggregation server.

Step 1: Explore Data Owner Datasets

After the Data Owners have created their datasets, explore what's available:

# Check DO1's datasets
do1_datasets = ds_client.datasets.get_all(datasite=do1_email)

try:
do1_datasets[0].describe()
except IndexError:
print(
f"No datasets found for DO1 (do1_datasets is empty). "
f"Verify DO1 published/shared the dataset and that {do1_email} is correct."
)



# Check DO2's datasets
do2_datasets = ds_client.datasets.get_all(datasite=do2_email)

try:
do2_datasets[0].describe()
except IndexError:
print(
f"No datasets found for DO2 (do2_datasets is empty). "
f"Verify DO2 published/shared the dataset and that {do2_email} is correct."
)

# Get mock dataset URLs for development
mock_dataset_urls = [do1_datasets[0].mock_url, do2_datasets[0].mock_url]
mock_dataset_urls

You'll see metadata about each dataset, including:

  • Dataset name and summary
  • Available features (columns)
  • Tags and descriptions
Mock Data Access

As a data scientist, you can only see the mock data—synthetic data with the same schema as the private data. This lets you develop and test your code without accessing sensitive information.

Step 2: Clone the FL Project

The FL project uses Flower, a popular federated learning framework. Clone the pre-built project:

from pathlib import Path

!mkdir -p /content/fl-diabetes-prediction

!curl -sL https://github.com/khoaguin/fl-diabetes-prediction/archive/refs/heads/main.tar.gz | tar -xz --strip-components=1 -C /content/fl-diabetes-prediction

SYFT_FLWR_PROJECT_PATH = Path("/content/fl-diabetes-prediction")

print(f"syft-flwr project at: {SYFT_FLWR_PROJECT_PATH}")

Step 3: Bootstrap the Project

Configure the project with the aggregator (DS) and participating datasites (DOs):

import syft_flwr

try:
!rm -rf {SYFT_FLWR_PROJECT_PATH / "main.py"}
print(f"syft_flwr version = {syft_flwr.__version__}")
do_emails = [peer.email for peer in ds_client.peers]
syft_flwr.bootstrap(
SYFT_FLWR_PROJECT_PATH, aggregator=ds_email, datasites=do_emails
)
print("Bootstrapped project successfully ✅")
except Exception as e:
print(e)

This generates the main.py entry point and configures the project for your specific data owners.

Step 4: Submit Jobs to Data Owners

Send the FL project to each data owner for review:

# Clean up any cached files
!rm -rf {SYFT_FLWR_PROJECT_PATH / "fl_diabetes_prediction" / "__pycache__"}

job_name = "fl-diabetes-training"

# Submit to DO1
ds_client.submit_python_job(
user=do1_email,
code_path=str(SYFT_FLWR_PROJECT_PATH),
job_name=job_name,
)

# Submit to DO2
ds_client.submit_python_job(
user=do2_email,
code_path=str(SYFT_FLWR_PROJECT_PATH),
job_name=job_name,
)

Check Submitted Jobs

ds_client.jobs
What Gets Submitted

The job contains the training code—data owners can inspect it before approving execution on their private data. The code defines the model architecture, training loop, and aggregation strategy.

Step 5: Wait for Approval

At this point, wait for both Data Owners to:

  1. Review the submitted job code
  2. Approve the job
  3. Run do_client.process_approved_jobs()

You can monitor the job status:

ds_client.jobs

Step 6: Run the Aggregator

Once Data Owners have approved and started processing their jobs, install the required packages and run the aggregator:

Install Dependencies

!uv pip install \
"flwr-datasets>=0.5.0" \
"imblearn>=0.0" \
"loguru>=0.7.3" \
"pandas>=2.3.0" \
"ipywidgets>=8.1.7" \
"scikit-learn==1.7.1" \
"torch>=2.8.0" \
"ray==2.31.0"

Start the Aggregation Server

ds_email = ds_client.email
syftbox_folder = f"/content/SyftBox_{ds_email}"

!SYFTBOX_EMAIL="{ds_email}" SYFTBOX_FOLDER="{syftbox_folder}" \
uv run {str(SYFT_FLWR_PROJECT_PATH / "main.py")}

Watch the training logs as:

  • Data Owner clients connect
  • Model updates are received
  • Aggregation happens each round
  • Global model accuracy improves

Step 7: Check Results

After training completes, check the final job status:

ds_client.jobs
Finding the Trained Model

The aggregated model weights are saved in the SyftBox folder. Check the FL training logs to find the exact path where the final model is saved.

Cleanup

When finished, clean up SyftBox resources:

ds_client.delete_syftbox()

What Just Happened?

Congratulations! You successfully trained a diabetes prediction model using federated learning:

  1. Two data owners each held a private partition of the PIMA Indians Diabetes dataset
  2. A data scientist coordinated the training without ever seeing the raw data
  3. Model updates were aggregated using the Flower framework (FedAvg strategy)
  4. Privacy was preserved—raw data never left the data owner's Colab environment

This is the core promise of federated learning: collaborative machine learning without sharing sensitive data.

Privacy Guarantees

GuaranteeDescription
No raw data accessDS only sees mock data and model parameters
Compute-to-dataTraining happens where the data lives
Parameter exchange onlyOnly mathematical weights (W, b) move over the network
Consent-basedDOs must explicitly approve each job

Next Steps

  • Explore the FL project code to understand the model architecture
  • Modify the aggregation strategy in server_app.py
  • Try with your own datasets and models