Skip to main content

Data Owner: Managing Your Datasets

You manage two distinct versions of your data: Private Data, which stays on your machine, and Mock Data, which you share with the network. This guide walks you through how to create, organize, and maintain these assets.

1. Understanding the Dataset Structure

In SyftBox, data is organized into Datasets, which contain one or more Assets.

  • Dataset: A high-level container (e.g., "Pima Indians Diabetes Database").
  • Asset: An individual data file within that dataset (e.g., train.csv or test.csv).
  • Dual-Path System: Every Asset has a Private Path (real data used for training) and a Mock Path (fake data used for code development).

2. Creating Your First Dataset

Follow these steps to make your data discoverable for Federated Learning.

Step 1: Access the Dashboard

  1. Ensure your SyftBox client is running.
  2. Open your browser and navigate to the RDS-Dashboard (usually http://localhost:8000).
  3. Click on the Datasets tab in the sidebar.

Step 2: Define Dataset Metadata

  1. Click + Add a Dataset.
  2. Name: Give your dataset a unique, descriptive name. This is the ID Data Scientists will use to find your data (e.g., heart-disease-v1).
  3. Description: Provide context. Mention the features included, the sample size, and any relevant preprocessing. You can use Markdown here to create lists or tables.

Step 3: Add Assets (The Files)

For each file you want to include:

  1. Click Add Asset.
  2. Asset Name: (e.g., training_records).
  3. Private File Path: Provide the absolute path to the real CSV on your local machine.

Security Note: This path is never shared with the network. It is only used by the local SyftBox runner during an approved job.

  1. Mock File Path: Provide the path to your synthetic/fake CSV.

Pro Tip: Ensure your mock data has the exact same headers (columns) as your private data, or the Data Scientist's code will crash during the real run.

3. Best Practices for Mock Data

Mock data is the only part of your dataset that a Data Scientist can see. To be effective, it should:

  • Match the Schema: Have identical column names and data types (int, float, string).
  • Reflect Distributions: While the values are fake, keeping the statistical range similar helps the DS build better models.
  • Be Small: Usually 10–20 rows is enough for the DS to verify their code works.

4. Maintenance & Updates

Data is rarely static. Here is how you manage your assets over time:

Editing a Dataset

You can update descriptions or paths at any time.

  • Navigate to DatasetsView Details on your target dataset.
  • Click Edit to modify paths if you move your files locally.

Deleting Data

  • If you no longer wish to participate in a project, you can delete the dataset from the dashboard.
  • Effect: This immediately removes your metadata from the SyftBox network. Any future jobs targeting this dataset ID will fail to find your node.

[Coming soon: Dataset Versioning]

  • Instructions on how to handle schema changes and versioned releases of your data assets.

Security FAQ

  • Can a DS see my private file path? No. They only see the name you gave the dataset and the contents of your mock file.
  • Is my data uploaded to a cloud? No. SyftBox uses "Compute-to-Data." Your private data remains on your hard drive; only the model training code comes to you.

Next Step: Head over to the Reviewing FL Job Proposals guide to learn how to audit and approve incoming training requests from Data Scientists.