TITLE: Feature request: Would like a way to refuse subsequent TCP connections while allowing current connections enough time to drain ### summary This feature request was originally opened as https://github.com/envoyproxy/envoy/issues/2920, but was too specific about the implementation. This issue updates the title and content to clarify the goals and be flexible about the implementation. Given I've configured Envoy with LDS serving a TCP proxy listener on some port and there are connections in flight I would like a way to refuse subsequent TCP connections to that port while allowing current established connections to drain We tried the following approaches, but neither of them achieve our goals: - the LDS is updated to remove the listener - Envoy is signalled with a `SIGTERM` - Send a `GET` request to `/healthcheck/fail` ### steps to reproduce write a `bootstrap.yaml` like ``` --- admin: access_log_path: /tmp/admin_access.log address: socket_address: address: 0.0.0.0 port_value: 9901 node: id: some-envoy-node cluster: some-envoy-cluster dynamic_resources: lds_config: path: /cfg/lds-current.yaml static_resources: clusters: - name: example_cluster connect_timeout: 0.25s type: STATIC lb_policy: ROUND_ROBIN hosts: - socket_address: address: 93.184.216.3 # IP address of example.com port_value: 80 ``` write a lds-current.yaml file like ``` version_info: "0" resources: - "@type": type.googleapis.com/envoy.api.v2.Listener name: listener_0 address: socket_address: address: 0.0.0.0 port_value: 8080 filter_chains: - filters: - name: envoy.tcp_proxy config: stat_prefix: ingress_tcp cluster: example_cluster ``` launch envoy (I'm using v1.6.0) ``` envoy -c /cfg/bootstrap.yaml --v2-config-only --drain-time-s 30 ``` confirm that the TCP proxy is working ``` curl -v -H 'Host: example.com' 127.0.0.1:8080 ``` #### Removing the listener next update the LDS to return an empty set of listeners. this is a two step process. first, write an empty LDS response file lds-empty.yaml ``` version_info: "1" resources: [] ``` second, move that file on top of the file being watched: ``` mv lds-empty.yaml lds-current.yaml ``` in the Envoy stdout logs you'll see a line ``` source/server/lds_api.cc:68] lds: remove listener 'listener_0' ``` attempt to connect to the port where the listener used to be: ``` curl -v -H 'Host: example.com' 127.0.0.1:8080 ``` ##### expected behavior Would like to see all new TCP connections be refused immediately, as if a listener had never been added in the first place. Existing TCP connections should continue to be serviced. ##### actual behavior the port is still open, even after the LDS update occurs ``` lsof -i COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME envoy 1 root 9u IPv4 30166 0t0 TCP *:9901 (LISTEN) envoy 1 root 22u IPv4 30171 0t0 TCP *:8080 (LISTEN) ``` clients can connect to the port, but the TCP proxying seems to hang (can't tell where) ``` curl -H 'Host: example.com' -v 127.0.0.1:8080 * Rebuilt URL to: 127.0.0.1:8080/ * Trying 127.0.0.1... * Connected to 127.0.0.1 (127.0.0.1) port 8080 (#0) > GET / HTTP/1.1 > Host: example.com > User-Agent: curl/7.47.0 > Accept: */* > ^C ``` this state remains until --drain-time-s time has elapsed (30 seconds in this example). At that point the port is finally closed, so you see ``` curl 127.0.0.1:8080 curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused ``` #### pkill -SIGTERM envoy If instead of removing the listeners we signaled Envoy ``` pkill -SIGTERM envoy ``` Envoy exits immediately without allowing current connections to drain ``` [2018-03-28 17:42:06.995][3563][warning][main] source/server/server.cc:312] caught SIGTERM [2018-03-28 17:42:06.995][3563][info][main] source/server/server.cc:357] main dispatch loop exited [2018-03-28 17:42:07.004][3563][info][main] source/server/server.cc:392] exiting ``` ##### expected behavior Would like to see all new TCP connections be refused immediately, as if a listener had never been added in the first place. Existing TCP connections should continue to be serviced. ##### actual behavior the port is still open, even after the LDS update occurs ``` lsof -i COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME envoy 1 root 9u IPv4 30166 0t0 TCP *:9901 (LISTEN) envoy 1 root 22u IPv4 30171 0t0 TCP *:8080 (LISTEN) ``` clients can connect to the port, but the TCP proxying seems to hang (can't tell where) ``` curl -H 'Host: example.com' -v 127.0.0.1:8080 * Rebuilt URL to: 127.0.0.1:8080/ * Trying 127.0.0.1... * Connected to 127.0.0.1 (127.0.0.1) port 8080 (#0) > GET / HTTP/1.1 > Host: example.com > User-Agent: curl/7.47.0 > Accept: */* > ^C ``` this state remains until --drain-time-s time has elapsed (30 seconds in this example). At that point the port is finally closed, so you see ``` curl 127.0.0.1:8080 curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused ``` #### background In Cloud Foundry, we have the following setup currently: ``` Router =====> Envoy ----> App (shared ingress) (TLS) TCP ``` Each application instance has a sidecar Envoy which terminates TLS connections from the shared ingress router. Applications may not speak HTTP, so we use basic TCP connectivity checks from the shared Router to the Envoy in order to infer application health and determine if a client connection should be load-balanced to that Envoy. When the upstream Envoy accepts the TCP connection, the Router considers that upstream healthy. When the upstream refuses the TCP connection, the Router considers that upstream unhealthy. During a graceful shutdown, the scheduler ought to be able to drain the Envoy before terminating the application. This would mean that the Envoy ought to service any in-flight TCP connections without accepting any new ones.