Skip to content

Instantly share code, notes, and snippets.

@yeshwanthv5
Last active June 1, 2020 07:23
Show Gist options
  • Select an option

  • Save yeshwanthv5/67d1c138ddc2bede090e0d68447b14c6 to your computer and use it in GitHub Desktop.

Select an option

Save yeshwanthv5/67d1c138ddc2bede090e0d68447b14c6 to your computer and use it in GitHub Desktop.
Federated Learning Notes

Paper talks about System design and challenges for deploying a large scale federated learning system

Practical Issues

  • Device availability correlating with data distribution
    • Eg. Timezone availability - If we are doing update at a certain time, the avialable devices are likely to be from one particular timezone
  • Device interruption
    • Can be caused by unreliable connection
  • Limited device storage, compute, battery

Notions

  • Devices: Participating devices
  • FL server: cloud based service
  • Round: Selecting potential devices - Running local model - Update back to the server
  • FL Plan: Server communicates to the clients what needs to be executed - Tensorflow graph and instructions on what needs to be done
    • Communication happens in the form of exchanging tensorflow checkpoints.

Selection

  • Eligibility criterion
    • To be plugged in
    • Unmetered internet connection
  • Reservoir sampling
    • More sophisticated sampling method required to remove selection bias

Reporting

  • Server waits for devices to report after running their local updates.
  • If a threshold number of devices required to report is met then the round is success else abandoned.
  • Number of straggling devices(devices dropping out without communicating the results) tolerance - configurable

Pace Steering

  • Scaling down for few devices scaling up if more devices available
  • Setting an optimum time window for the devices to connect back

Device

  • Maintains a repository of local data
    • Upper limit on this data - remove old data when it goes beyond this
    • Encryption recommended - vulnerable data
  • Periodic job is scheduled by FL runtime
  • Avoid any negative impact on device experience - battery, data usage - use only when plugged in and connected to unmetered network
    • Abort if any condition is no longer met
  • This job communicates to server - "Hey I am ready to execute your program"
  • Server can decide - "Sure you may execute" or it might say - "You are not required at this time you may check back again at "
  • If selected device executes the model accoding to the plan sent by the server.
  • After completion - report back to the server - free all resources
  • Issues
    • Multi-tenancy: Multiple FL routines should be handled without overloading the device
    • Attestation: Since we don't authenticate users because of privacy concerns we need a way to protect against attacks by non genuine devices.
      • Protection against model steering from 'content farms'

Server

  • Major issue is scale of devices - 10 to 100 million
  • Update size - can be in KBs or MBs
  • Geographic region affecting the number of active device

References: https://arxiv.org/pdf/1902.01046.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment