yeshwanthv5/federated_averaging_paper.md

Last active June 1, 2020 07:23

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/yeshwanthv5/67d1c138ddc2bede090e0d68447b14c6.js"></script>
Save yeshwanthv5/67d1c138ddc2bede090e0d68447b14c6 to your computer and use it in GitHub Desktop.

Download ZIP

Federated Learning Notes

Raw

federated_averaging_paper.md

First paper to introduce federated learning.

References: https://arxiv.org/pdf/1602.05629.pdf

Raw

system_design_paper.md

Paper talks about System design and challenges for deploying a large scale federated learning system

Practical Issues

Device availability correlating with data distribution
- Eg. Timezone availability - If we are doing update at a certain time, the avialable devices are likely to be from one particular timezone
Device interruption
- Can be caused by unreliable connection
Limited device storage, compute, battery

Notions

Devices: Participating devices
FL server: cloud based service
Round: Selecting potential devices - Running local model - Update back to the server
FL Plan: Server communicates to the clients what needs to be executed - Tensorflow graph and instructions on what needs to be done
- Communication happens in the form of exchanging tensorflow checkpoints.

Selection

Eligibility criterion
- To be plugged in
- Unmetered internet connection
Reservoir sampling
- More sophisticated sampling method required to remove selection bias

Reporting

Server waits for devices to report after running their local updates.
If a threshold number of devices required to report is met then the round is success else abandoned.
Number of straggling devices(devices dropping out without communicating the results) tolerance - configurable

Pace Steering

Scaling down for few devices scaling up if more devices available
Setting an optimum time window for the devices to connect back

Device

Maintains a repository of local data
- Upper limit on this data - remove old data when it goes beyond this
- Encryption recommended - vulnerable data
Periodic job is scheduled by FL runtime
Avoid any negative impact on device experience - battery, data usage - use only when plugged in and connected to unmetered network
- Abort if any condition is no longer met
This job communicates to server - "Hey I am ready to execute your program"
Server can decide - "Sure you may execute" or it might say - "You are not required at this time you may check back again at "
If selected device executes the model accoding to the plan sent by the server.
After completion - report back to the server - free all resources
Issues
- Multi-tenancy: Multiple FL routines should be handled without overloading the device
- Attestation: Since we don't authenticate users because of privacy concerns we need a way to protect against attacks by non genuine devices.
  - Protection against model steering from 'content farms'

Server

Major issue is scale of devices - 10 to 100 million
Update size - can be in KBs or MBs
Geographic region affecting the number of active device

References: https://arxiv.org/pdf/1902.01046.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment