Data Scientist: Exploring Datasites

Learn how to discover, explore, and assess distributed datasets for your federated learning experiments.

As a Data Scientist, your journey begins with discovery. Before writing any model code, you must understand the landscape of available data across the SyftBox network. In the SyftBox ecosystem, you don't need direct access to raw data to start your research. By connecting as a guest, you can browse "data storefronts" (Datasites) and interact with synthetic versions of sensitive information.

1. Discovering Datasites

SyftBox operates as a file sync network where metadata is synchronized through a central server maintained by OpenMined.

The SyftBox Network: You can find available datasites by checking the global metadata directory shared across all active SyftBox clients.
Connecting as a Guest: Use the init_session method to establish a read-only connection to any reachable datasite.

import syft_rds as sy
from syft_core import Client

# Get your own email from SyftBox config
your_email = Client.load().email

# Connect to a Data Owner's datasite as a guest
do_client = sy.init_session(host="data_owner@example.com", email=your_email)

# Check if you have admin access (you won't as a guest)
print(do_client.is_admin)  # False

2. Viewing Dataset Metadata

Once connected, you can see exactly what the Data Owner has made "discoverable".

Listing Datasets: Access the .datasets attribute on your client to see a high-level list of available data.
Inspecting Details: Each dataset includes rich metadata provided by the owner:
Summary & Description: Context on the study's purpose and data origin.
Asset List: The individual files (features, labels, etc.) contained within the dataset.
Schema Information: Column names, data types, and statistical ranges.

3. Understanding Data Availability

Mock vs. Private: Every asset consists of two paths. While you are blocked from the Private Path, the Mock Path is open for inspection.
The Permission Model: You can check which datasets allow guest discovery and which require you to be a "registered" user or collaborator before the metadata is visible.

4. Assessing Dataset Suitability for FL

Before committing to a project, use the Mock Data to verify your model will work.

Feature Parity: If you are running Horizontal FL (e.g., across multiple hospitals), use the mock data to ensure all datasites use the same feature names and scaling.
Prototyping: Load the mock data into a Pandas DataFrame to test your preprocessing scripts.

# Get a dataset by name
dataset = do_client.dataset.get(name="pima-indians-diabetes-database")

# Get the mock data path (accessible to guests)
mock_path = dataset.get_mock_path()
print(f"Mock data path: {mock_path}")

# Load and inspect the mock data
import pandas as pd
mock_df = pd.read_csv(mock_path / "train.csv")
print(mock_df.head())

5. Identifying Potential Collaborators

Discovery is also social. You can use metadata to find other Data Scientists or Data Owners with similar research interests.

Owner Contact: Metadata often includes the owner's public email or organization details.
Multi-Site Selection: You can select multiple datasites that host similar datasets to form a "cohort" for your federated training project.

Next Step: Now that you've identified your data sources, head over to Creating FL Projects to learn how to structure your model and training logic.

Federated Learning on Pima Diabetes Data This video provides a practical look at how data scientists can build and coordinate models across distributed environments, similar to the diabetes prediction workflow in this guide.

1. Discovering Datasites​

2. Viewing Dataset Metadata​

3. Understanding Data Availability​

4. Assessing Dataset Suitability for FL​

5. Identifying Potential Collaborators​

1. Discovering Datasites

2. Viewing Dataset Metadata

3. Understanding Data Availability

4. Assessing Dataset Suitability for FL

5. Identifying Potential Collaborators