Paper talks about System design and challenges for deploying a large scale federated learning system
Practical Issues
- Device availability correlating with data distribution
- Eg. Timezone availability - If we are doing update at a certain time, the avialable devices are likely to be from one particular timezone
- Device interruption
- Can be caused by unreliable connection
- Limited device storage, compute, battery
Notions
- Devices: Participating devices
- FL server: cloud based service
- Round: Selecting potential devices - Running local model - Update back to the server
- FL Plan: Server communicates to the clients what needs to be executed - Tensorflow graph and instructions on what needs to be done
- Communication happens in the form of exchanging tensorflow checkpoints.
Selection
- Eligibility criterion
- To be plugged in
- Unmetered internet connection
- Reservoir sampling
- More sophisticated sampling method required to remove selection bias
Reporting
- Server waits for devices to report after running their local updates.
- If a threshold number of devices required to report is met then the round is success else abandoned.
- Number of straggling devices(devices dropping out without communicating the results) tolerance - configurable
Pace Steering
- Scaling down for few devices scaling up if more devices available
- Setting an optimum time window for the devices to connect back
Device
- Maintains a repository of local data
- Upper limit on this data - remove old data when it goes beyond this
- Encryption recommended - vulnerable data
- Periodic job is scheduled by FL runtime
- Avoid any negative impact on device experience - battery, data usage - use only when plugged in and connected to unmetered network
- Abort if any condition is no longer met
- This job communicates to server - "Hey I am ready to execute your program"
- Server can decide - "Sure you may execute" or it might say - "You are not required at this time you may check back again at "
- If selected device executes the model accoding to the plan sent by the server.
- After completion - report back to the server - free all resources
- Issues
- Multi-tenancy: Multiple FL routines should be handled without overloading the device
- Attestation: Since we don't authenticate users because of privacy concerns we need a way to protect against attacks by non genuine devices.
- Protection against model steering from 'content farms'
Server
- Major issue is scale of devices - 10 to 100 million
- Update size - can be in KBs or MBs
- Geographic region affecting the number of active device
References: https://arxiv.org/pdf/1902.01046.pdf