deviantony/AGENT_CLUSTER.md

Swarm aggregation specs

Functional requirements

Ability to aggregate data from multiple nodes in a Swarm cluster
Easy to deploy solution
Works on both Windows and Linux platforms (and thus, multi-platform clusters)

Using an agent

Using an agent solution is an interesting way to go here. We can deploy an agent inside a Swarm overlay network using a global deployment method. This would schedule the agent to run on each node of the cluster and automatically creates/remove agents when nodes are added/removed from the cluster.

The agent should act as a proxy to the Docker API (similar to what Portainer is doing at the moment) via the /endpoints/<ID>/docker endpoint. It should be able to plug on the Docker API of a Docker node using the Unix socket or a TCP URL with or without TLS (for Windows hosts, evolution to support named pipe could be added later on).

Communications between Portainer and the agents must be secured/encrypted. An authentication mechanism is also required here to ensure that anybody cannot query the Docker API through the agent.

That agent should also be available to be deployed in a standalone Docker host and thus simplifying the management of a standalone host as well (no need to expose the Docker API anymore for example).

Potential evolutions:

The agent could be used to expose node metrics (disk, CPU, MEM usage...)

Possible implementations

Have a look at the following files for possible implementations.

Cluster of agents

This implementation rely on the fact that the agent is able to auto-discover other agents. Deployed as a global service inside a Swarm cluster, each agent automatically discover the other agents in the cluster and register them.

Portainer can then be plugged on any of these agents (either by using DNS-SRV records ensure high-availability or using the URL to a specific agent). To do so, a user would just need to create a new endpoint and add the IP:PORT to one of the agents in the cluster (or use the Swarm service name to be able to use DNS-SRV records).

The agent would be responsible for the following:

Aggregate the data of multiple nodes (list the containers available in the cluster for example)
Redirect requests to specific nodes in the cluster (inspect a container on a specific nodeor create a new secret via a cluster manager for example)

This would give the advantage to be a totally transparent solution from the Portainer point of view. Changes in the Portainer codebase would be limited. For example, when querying the list of containers Portainer would just redirect the /containers/json request on the agent which would take care of the data aggregation and return the response.

On startup, the agent should do the following:

Retrieve information about the Docker engine where the agent is running: is it a Swarm manager? What version of the API is it using?
Auto-discover the other agents inside the network where the agent has been started and register them

When querying the Docker API via an agent, some queries should be intercepted and rewrited/redirected. Some examples are:

GET /containers/json: The agent should execute that request against all the existing nodes and aggregate the data into a new response object.

IMPORTANT: When aggregating data, the response must be as close as possible to the Docker API response and thus the agent should only decorate the response and not create a different response object. We don't want to create a new Docker API and should stay compatible.

POST /services: The agent should redirect the request to a manager node inside the cluster as this query can only be executed on manager nodes.
GET /containers/<ID>/json: The agent should redirect the request to the node where the container is located. To do so, a reference to the node where the container is located can be passed inside a HTTP header.

The advantages of this solution are the following:

Simple to deploy, just deploy a global agent service inside an overlay network
Highly-available, you can connect the Portainer endpoint to any agent located on any node inside your Swarm cluster (even more HA when using DNS-SRV)
Portainer does not need to be connected to a Swarm manager anymore, the agent take care of redirecting the requests to specific nodes in the cluster.
Transparency, will not require to add a lot of changes inside the Portainer codebase. UAC are still managed inside Portainer API.

Example usage:

Create your Docker Swarm cluster
Create a new overlay network inside the cluster: docker network create --driver overlay portainer_agent
Deploy the agent as a global service inside the cluster: docker service create portainer/agent --network portainer_agent (whether a port should be exposed depends if we use the DNS-SRV approach)
Within Portainer, create a new endpoint and either put the IP:PORT to an agent inside the endpoint URL or use the service name for the DNS-SRV approach.

Example Compose:

version: "3"
services:
  portainer-agent:
    image: portainer/agent
    # ports:
    #  - "6000:6000"
    deploy:
      mode: global
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock"
  portainer:
    deploy:
      replicas: 1
    image: portainer/portainer
    volumes:
      /path/to/data:/data

Technical details

Communications between Portainer and the agents are one-way only (that is Portainer -> agents). At no time an agent should need to communicate with the Portainer instance.
Memberlist or even Serf can be used to register/manage the list of agents for each agent (with metadata such as role in the Swarm cluster, API version...).
No socat type service to deploy, the agent should act as a reverse proxy to the Docker API.
The agent should be isolated from the Portainer codebase and have its own Docker image (portainer/agent).

Things to investigate

How to secure the communications between Portainer and the agents
Authentication mechanism to prevent unauthenticated queries

Thomas idea

Voici mes pensées sur l'aggrégation des data swarm à aujourd'hui

2 scenarii utilisateur

Si le swarm n'est pas encore dans portainer Une aide est dispo sur portainer qui explique cela :

=> portainer-agent.yml

version: "3"
services:
  socat:
    image: "a-socat-image"
    deploy:
      mode: global
    volumes:
      - /var/run/docker.sock
  agent:
    image: portainer/portainer
    command: --agent --agent-url=https://myswarm:6000 --cluster-name=PROD --portainer-url=http://url-to-portainer:port --token=aGeneratedTokenFromPortainerInstance
    volumes:
      /my-path/portainer/agent-tls-config:/data/agent
    ports:
      - 6000:6000

Portainer donne le compose et la commande pour le déployer en exemple, l'utilisateur le fait sur son swarm

Tony: Je ne suis pas fan d'avoir plusieurs services a déployer (un socat et un agent). Je pense que l'agent doit avoir le rôle de proxy vers l'API de Docker et donc être aussi un socat-like. Ca simplifiera le déploiement.

Autre chose, je pense que l'agent doit être séparé de la codebase de Portainer (avec sa propre image portainer/agent).

Au premier démarrage de l'agent, des certifs sont générés et poser dans le volume, et l'agent enregistre automatiquement l'endpoint dans portainer Si on ne met que l'option agent, ca demarre un agent simple avec des certifs, qu'on peut ajouter manuellement dans portainer

Les certifs sont utilisés pour la communication Portainer/Agents en TLS, c'est ça?

Si le swarm est déjà un endpoint dans portainer Dans ce cas on peut encore plus automatiser le process je pense, avec un bouton de transformation d'un endpoitn swarm classique en endpoints portainer agent Pour cela, portainer se charge de déployer la stack sur le swarm cible et modifie le endpoint (tcp:2376 ou socker) en endpoint de type agent : tcp:6000 sur la cible)

Je ne pense pas qu'il y ait besoin d'automatiser ce processus. Il suffit d'aller sur un endpoint, de l'éditer et de remplacer son URL par l'URL d'un agent (on pourra très bien ajouter une partie migration des endpoints existant dans la FAQ).

On devrait peut être ajouté une note dans la partie création d'endpoint pour donner plus d'informations dans le cas de l'enregistrement d'un cluster Swarm.

Pour la partie technique

L'agent recuperera périodiquement le DNS-SRV de tasks.socat pour trouver les ips des noeuds Il fera un docker info pour vérifier si ce sont des manager/workers etc... et gardera cela en RAM Quand il recevra une requête, il s'occupe, avec ces infos, de renvoyer le bon résultat, ou d'effectuer la bonne action

Pour la première partie d'implémentation, je pense à

Créer une stack minimale avec : Un agent socat basique en mode global Un agent en go (ou un option de lancement de portainer - --agent) qui requête le DNS SRV pour trouver les nœuds et garder ces infos en RAM Une requête basique de type liste containers qui effectue la requête sur tous les nœuds et revoie les infos agrégées (edited)