Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save marfillaster/7a136ea826815ac22f2849e099a1c6a1 to your computer and use it in GitHub Desktop.

Select an option

Save marfillaster/7a136ea826815ac22f2849e099a1c6a1 to your computer and use it in GitHub Desktop.
MikroTik RouterOS v7 dual DHCP WAN recursive failover w/ PCC load-balancing; and recursive ECMP

TL;DR: Default-Config Dual WAN PCC + Recursive Failover

This paste assumes a hardware MikroTik RouterBOARD with the standard MikroTik default config — hAP, hEX, RB5009-class, etc. — where ether1 is the WAN port and ether2 through etherN are LAN bridge ports. The paste removes ether2 from the LAN bridge and turns it into WAN2.

If your router is not in that default state — CHR, multi-WAN appliances, anything reconfigured, or anything where ether1 is not your WAN — read the full guide and substitute interface names. The lab report's §8.7 TL;DR validation shows the kind of remap CHR needs before this paste is safe.

Resulting layout after the paste:

WAN1 = ether1, DHCP
WAN2 = ether2, DHCP   (was a LAN bridge port; this paste moves it)
LAN  = bridge, 192.168.88.0/24
DNS  = router forwards to 1.1.1.1 and 8.8.8.8

Probe set, kept disjoint from DNS resolvers so a probe outage and a DNS outage are independent events:

WAN1 route probes: 64.6.64.6, 9.9.9.9, 208.67.222.222
WAN2 route probes: 149.112.112.112, 64.6.65.6, 208.67.220.220

Run from local console, Safe Mode, or a management path that will not disappear when ether2 is removed from the LAN bridge. Review interface names first.

Companion files:

Paste

/system backup save name=before-dual-wan-pcc
/export file=before-dual-wan-pcc

# Default config uses ether2 as LAN. Move it to WAN2.
/interface bridge port remove [find where interface=ether2]

# FastTrack bypasses mangle/routing-mark logic.
/ip firewall filter disable [find where action=fasttrack-connection]

# Interface lists. Default config usually already has WAN and LAN.
:if ([:len [/interface list find where name="WAN"]] = 0) do={/interface list add name=WAN}
:if ([:len [/interface list find where name="LAN"]] = 0) do={/interface list add name=LAN}
:if ([:len [/interface list member find where list="WAN" interface="ether1"]] = 0) do={/interface list member add list=WAN interface=ether1 comment="WAN1"}
:if ([:len [/interface list member find where list="WAN" interface="ether2"]] = 0) do={/interface list member add list=WAN interface=ether2 comment="WAN2"}
:if ([:len [/interface list member find where list="LAN" interface="bridge"]] = 0) do={/interface list member add list=LAN interface=bridge comment="LAN bridge"}

# Keep the router as LAN DNS, but do not use route probes as DNS servers.
/ip dns set allow-remote-requests=yes servers=1.1.1.1,8.8.8.8
/ip dhcp-server network set [find where address=192.168.88.0/24] gateway=192.168.88.1 dns-server=192.168.88.1

# Local destinations must bypass PCC.
/ip firewall address-list remove [find where comment="LAN subnet - bypass PCC"]
/ip firewall address-list add list=local address=192.168.88.0/24 comment="LAN subnet - bypass PCC"

# Policy routing tables.
/ip firewall mangle remove [find where comment~"PCC|route ISP|skip local destinations"]
/ip route remove [find where comment~"WAN1_RECURSIVE|WAN2_RECURSIVE|MAIN_RECURSIVE|to_ISP"]
/routing table remove [find where name="to_ISP1"]
/routing table remove [find where name="to_ISP2"]
/routing table add fib name=to_ISP1 comment="Policy table for traffic selected for ISP1"
/routing table add fib name=to_ISP2 comment="Policy table for traffic selected for ISP2"

# Recursive routes. Probes avoid Cloudflare/Google so those can remain DNS resolvers.
/ip route remove [find where comment~"WAN1_RECURSIVE|WAN2_RECURSIVE|MAIN_RECURSIVE|to_ISP"]
/ip route add disabled=yes dst-address=64.6.64.6/32 gateway=ether1 scope=10 target-scope=10 comment="WAN1_RECURSIVE"
/ip route add disabled=yes dst-address=9.9.9.9/32 gateway=ether1 scope=10 target-scope=10 comment="WAN1_RECURSIVE"
/ip route add disabled=yes dst-address=208.67.222.222/32 gateway=ether1 scope=10 target-scope=10 comment="WAN1_RECURSIVE"
/ip route add disabled=yes dst-address=149.112.112.112/32 gateway=ether2 scope=10 target-scope=10 comment="WAN2_RECURSIVE"
/ip route add disabled=yes dst-address=64.6.65.6/32 gateway=ether2 scope=10 target-scope=10 comment="WAN2_RECURSIVE"
/ip route add disabled=yes dst-address=208.67.220.220/32 gateway=ether2 scope=10 target-scope=10 comment="WAN2_RECURSIVE"
/ip route add check-gateway=ping distance=1 dst-address=0.0.0.0/0 gateway=64.6.64.6 scope=10 target-scope=11 comment="MAIN_RECURSIVE_WAN1"
/ip route add check-gateway=ping distance=1 dst-address=0.0.0.0/0 gateway=149.112.112.112 scope=10 target-scope=11 comment="MAIN_RECURSIVE_WAN2"
/ip route add check-gateway=ping distance=1 dst-address=0.0.0.0/0 gateway=9.9.9.9 routing-table=to_ISP1 scope=10 target-scope=11 comment="to_ISP1_primary_via_WAN1"
/ip route add check-gateway=ping distance=2 dst-address=0.0.0.0/0 gateway=208.67.220.220 routing-table=to_ISP1 scope=10 target-scope=11 comment="to_ISP1_backup_via_WAN2"
/ip route add check-gateway=ping distance=1 dst-address=0.0.0.0/0 gateway=64.6.65.6 routing-table=to_ISP2 scope=10 target-scope=11 comment="to_ISP2_primary_via_WAN2"
/ip route add check-gateway=ping distance=2 dst-address=0.0.0.0/0 gateway=208.67.222.222 routing-table=to_ISP2 scope=10 target-scope=11 comment="to_ISP2_backup_via_WAN1"

# PCC mangle rules.
/ip firewall mangle remove [find where comment~"PCC|route ISP|skip local destinations"]
/ip firewall mangle add chain=prerouting action=accept in-interface-list=LAN dst-address-list=local comment="skip local destinations before PCC"
/ip firewall mangle add chain=prerouting action=mark-connection in-interface-list=LAN dst-address-list=!local dst-address-type=!local connection-mark=no-mark new-connection-mark=ISP1_conn passthrough=yes per-connection-classifier=both-addresses-and-ports:2/0 comment="PCC bucket 0 -> ISP1"
/ip firewall mangle add chain=prerouting action=mark-connection in-interface-list=LAN dst-address-list=!local dst-address-type=!local connection-mark=no-mark new-connection-mark=ISP2_conn passthrough=yes per-connection-classifier=both-addresses-and-ports:2/1 comment="PCC bucket 1 -> ISP2"
/ip firewall mangle add chain=prerouting action=mark-routing in-interface-list=LAN connection-mark=ISP1_conn new-routing-mark=to_ISP1 passthrough=no comment="route ISP1_conn via to_ISP1"
/ip firewall mangle add chain=prerouting action=mark-routing in-interface-list=LAN connection-mark=ISP2_conn new-routing-mark=to_ISP2 passthrough=no comment="route ISP2_conn via to_ISP2"

# NAT. Default config usually already has this; add only if missing.
:if ([:len [/ip firewall nat find where chain="srcnat" action="masquerade" out-interface-list="WAN"]] = 0) do={/ip firewall nat add chain=srcnat action=masquerade out-interface-list=WAN ipsec-policy=out,none comment="masquerade both WANs"}

# DHCP clients. ether1 exists in default config; ether2 is added as WAN2.
/ip dhcp-client set [find where interface=ether1] add-default-route=no use-peer-dns=no use-peer-ntp=no comment="WAN1 DHCP - updates WAN1_RECURSIVE routes"
:if ([:len [/ip dhcp-client find where interface=ether2]] = 0) do={/ip dhcp-client add interface=ether2 add-default-route=no use-peer-dns=no use-peer-ntp=no comment="WAN2 DHCP - updates WAN2_RECURSIVE routes"} else={/ip dhcp-client set [find where interface=ether2] add-default-route=no use-peer-dns=no use-peer-ntp=no comment="WAN2 DHCP - updates WAN2_RECURSIVE routes"}

/ip dhcp-client set [find where interface=ether1] script=":if ([:tobool \$bound]) do={:local gw \$\"gateway-address\"; /ip/route/set [find where comment=\"WAN1_RECURSIVE\"] gateway=\$gw disabled=no;} else={/ip/route/set [find where comment=\"WAN1_RECURSIVE\"] disabled=yes;}; /ip/firewall/connection/remove [find where connection-mark=\"ISP1_conn\"];"

/ip dhcp-client set [find where interface=ether2] script=":if ([:tobool \$bound]) do={:local gw \$\"gateway-address\"; /ip/route/set [find where comment=\"WAN2_RECURSIVE\"] gateway=\$gw disabled=no;} else={/ip/route/set [find where comment=\"WAN2_RECURSIVE\"] disabled=yes;}; /ip/firewall/connection/remove [find where connection-mark=\"ISP2_conn\"];"

# Bootstrap already-bound leases without waiting for renew.
:delay 10s
:if ([:len [/ip dhcp-client get [find where interface=ether1] gateway]] > 0) do={/ip route set [find where comment="WAN1_RECURSIVE"] gateway=[/ip dhcp-client get [find where interface=ether1] gateway] disabled=no}
:if ([:len [/ip dhcp-client get [find where interface=ether2] gateway]] > 0) do={/ip route set [find where comment="WAN2_RECURSIVE"] gateway=[/ip dhcp-client get [find where interface=ether2] gateway] disabled=no}

Check

/ip dhcp-client print detail
/ip route print detail where comment~"WAN|MAIN|to_ISP"
/ip firewall mangle print stats
/ip firewall nat print stats

From a LAN client:

for i in $(seq 1 30); do curl -4 -s --max-time 10 https://ifconfig.me; echo; done | sort | uniq -c

You should see both ISP public IPs over multiple connections.

Probe set used here:

WAN1: 64.6.64.6, 9.9.9.9, 208.67.222.222
WAN2: 149.112.112.112, 64.6.65.6, 208.67.220.220
DNS resolvers: 1.1.1.1, 8.8.8.8

MikroTik RouterOS v7 — Dual DHCP WAN Recursive Failover with PCC Load Balancing

Last updated: 2026-05-07

This is a production-minded RouterOS v7 dual-WAN guide. It covers:

  • two DHCP WAN uplinks;
  • recursive gateway checks;
  • per-connection-classifier, or PCC, load balancing;
  • automatic failover inside each policy routing table;
  • recursive main-table defaults for router-originated traffic, with optional ECMP;
  • NAT masquerade across both WANs.

It is written as a guide, not just a pastebin. Review every interface name, subnet, probe IP, firewall rule, and route comment before using it on a live router.

Companion lab validation report: Vultr CHR lab setup, methodology, and results

Want the short pasteable version first? Start with TL;DR default-config paste, then come back here for the rationale and variants.


At a glance

Aspect This design
WAN type Two DHCP uplinks (PPPoE/static covered in §9)
Load balancing Per-connection (PCC), connection-sticky
Health check Recursive route + check-gateway=ping (link state alone is not enough)
Failover Per policy table; backup default at distance=2
Failback Automatic when the probe recovers
NAT Masquerade on both WANs
IPv6 Out of scope
Tested on RouterOS 7.21.4 on CHR, Vultr lab — see companion report

This design is for: a single router with two ISPs where LAN clients should use both uplinks concurrently and survive a single-WAN failure without manual intervention.

This design is not: bandwidth bonding, session-preserving failover, BGP multihoming, or a substitute for a real firewall policy.

If you take one thing away from this guide: a recursive default route through a probe IP that itself resolves through the DHCP-learned gateway, paired with check-gateway=ping, is what makes failover trigger on upstream outages instead of only on link drops. Everything else in this document is plumbing around that idea. See §5 for the rationale and §6.6 for the routes themselves.


Common gotchas

These are the issues most likely to bite a reader applying this guide. Each links to the section that explains how the design handles it.

Gotcha What goes wrong Where to read
Cloud CHR or multi-port routers may not have ether1 as WAN Pasting ether1/ether2 examples blindly can turn a management NIC into a "WAN" and lock you out §1 cloud CHR warning, §2
FastTrack bypasses mangle and routing marks Traffic marked for to_ISP1 / to_ISP2 exits the wrong WAN, or via main §6.1
DHCP route-update script can leave placeholders disabled on first apply First script run during the unbound state disables placeholders; bootstrap once after first apply §6.7
Public management default route in main masks main-table dual-WAN Router-originated services (DNS, NTP, package downloads, DDNS) follow the public path instead of the recursive defaults §6.6 cloud-CHR note, §7.7
PCC is connection-sticky and changes public IP per-flow Single-stream speed tests show one WAN; existing sessions break on failover; some banking / anti-fraud flows misbehave §6.8, §11.1
Probe set should be disjoint from DNS resolvers If your only probes are also your only resolvers, route-health failure and DNS-service failure become indistinguishable in triage §6.6, §10
check-gateway=ping reaction window is ~30–40 s Link-layer-style HA, not BFD; do not expect sub-second failover §10
Local destinations must bypass PCC explicitly Without the local address-list bypass, LAN-to-VLAN/VPN traffic gets PCC-marked and may fail §6.4, §6.8, §8
passthrough=no on mark-routing is required passthrough=yes lets a later mangle rule overwrite the routing mark and silently misroute §6.8
Static or PPPoE WANs need a different bootstrap The DHCP script does not apply; gateway must be set manually or to the PPPoE interface §9.3, §9.4

Contents

  1. Assumptions
  2. Read this before pasting
  3. Topology
  4. Packet-flow overview
  5. Why recursive routing is used here
  6. Core config
  7. Validation checklist
  8. Troubleshooting
  9. Optional variants
  10. Operational notes
  11. Design trade-offs and when not to use this
  12. References
  13. Changelog

1. Assumptions

This example assumes:

Item Example value Change this?
RouterOS v7.x, lab-tested on 7.21.4 Yes, test on your target version
WAN 1 interface ether1 Usually yes
WAN 2 interface ether2 Usually yes
LAN bridge bridge Usually yes
LAN subnet 192.168.88.0/24 Usually yes
Router LAN IP 192.168.88.1/24 Usually yes
WAN type DHCP on both WANs Yes, see static/PPPoE notes
IPv4 NAT masquerade on both WANs Usually yes
IPv6 not covered Add separately

The config does not include a complete firewall policy. Keep or build a real input/forward firewall around it.

Cloud CHR management NIC warning

Cloud-hosted RouterOS instances often have an extra public management NIC before the private WAN/LAN NICs. In the Vultr CHR lab used to validate this guide, ether1 was the public management interface, while the actual dual-WAN test interfaces were ether2 and ether3.

Do not assume ether1 is WAN1 on a cloud VM. First map interfaces with DHCP/static probes, then substitute the real WAN and LAN interface names throughout this guide. If you must keep public SSH/WinBox management during testing, leave the public default route in the main table and apply the dual-WAN policy routing to LAN traffic only.

Also check the cloud provider's private-network MTU. Vultr VPC 2.0 used MTU 1450 in the validation lab, so the WAN-side CHR interfaces were set to 1450.


2. Read this before pasting

Do not paste this into production blind.

Recommended process:

/export file=before-dual-wan-change
/system backup save name=before-dual-wan-change

Use Safe Mode in WinBox/terminal while applying the routing and firewall changes.

Rollback path if something goes wrong: load the backup from a local console with /system backup load name=before-dual-wan-change, or restore selected sections by /import file-name=before-dual-wan-change.rsc. Keep at least one console-reachable management path that does not depend on the new policy tables.

Important preflight checks:

  1. Replace ether1, ether2, bridge, and 192.168.88.0/24 with your real values.
  2. Confirm whether the router has a separate public management NIC that should stay out of the WAN list.
  3. Disable or bypass FastTrack for traffic that will use routing marks.
  4. Add all local LAN, VLAN, and VPN subnets to the local address list.
  5. Pick probe IPs that are stable and reachable through the intended ISP.
  6. Test failover by breaking upstream reachability, not only by unplugging a cable.

3. Topology

                  ┌─────────────── ISP 1, DHCP ───────────────┐
                  │                                             │
LAN clients ── bridge ── MikroTik RouterOS v7 ── ether1 / WAN1  │
                  │                           └─ ether2 / WAN2 ─┤
                  │                                             │
                  └─────────────── ISP 2, DHCP ───────────────┘

4. Packet-flow overview

For LAN-originated traffic:

LAN client packet
    ↓
/prerouting mangle
    ↓
PCC marks the connection as ISP1_conn or ISP2_conn
    ↓
Routing mark is set to to_ISP1 or to_ISP2
    ↓
Route lookup happens in the selected FIB table
    ↓
Default route points at a recursive probe IP
    ↓
Probe IP resolves through the current DHCP gateway
    ↓
Packet exits WAN1 or WAN2

For router replies to traffic that arrived from a WAN, the inbound packet is connection-marked by WAN interface, and the router's output mangle rule sends the reply back through the same WAN. This is useful for router services and for symmetric behavior around dst-nat/port-forwarded flows.

For router-originated new traffic, the router uses the main table unless you add separate output mangle or routing rules. This includes the router's own DNS resolver, NTP, package downloads, DDNS updates, pings, and other services that are not generated by LAN clients in prerouting. Recursive defaults in the main table give that router-originated traffic the same upstream-health failover mechanics used by the policy tables. Equal route distances make those defaults ECMP; different distances make them active-backup.


5. Why recursive routing is used here

A normal default route that points at a DHCP gateway only checks whether the gateway itself is reachable. That may still be true when the ISP has lost upstream internet.

Recursive routing adds one extra layer:

0.0.0.0/0 → probe IP, for example 64.6.64.6
probe IP → actual DHCP gateway learned from ISP

The router pings the probe IP with check-gateway=ping. If the probe cannot be reached through the intended WAN, the route becomes inactive and the backup route with the next distance can take over.

The key values are:

Setting Purpose
/routing/table add fib name=to_ISP1 Creates a FIB routing table for traffic marked to ISP1
/routing/table add fib name=to_ISP2 Creates a FIB routing table for traffic marked to ISP2
/ip route ... dst-address=<probe>/32 gateway=<actual WAN gateway> Host route that forces each probe through the intended ISP
/ip route ... dst-address=0.0.0.0/0 gateway=<probe> Recursive default route
target-scope=11 on default route Lets the default route resolve through the probe route with scope=10
check-gateway=ping Deactivates a route when its recursive probe fails

Do not reuse probe IPs blindly. Use targets that are unlikely to disappear and that your ISP does not block. Avoid using your router's only configured DNS resolvers as the only probes; DNS failure and route-health failure should be separable when possible. The specific probe IPs used by this guide are listed in §6.6, and the selection criteria are in §10.


6. Core config

6.1 Disable FastPath / FastTrack for this design

PCC and policy routing depend on mangle marks. FastTrack can bypass firewall/mangle processing and can misroute traffic that should use non-main routing tables.

allow-fast-path=no is a global setting and will lower maximum throughput on lower-end routers (hEX-class) because more packets stay on the slow path. Test under real traffic on small hardware. See §10 for more.

/ip settings set allow-fast-path=no

# If you started from a default MikroTik firewall, disable the default FastTrack rule.
# This command is safe if no such rule exists; it simply finds nothing.
/ip firewall filter disable [find where action=fasttrack-connection]

If you must keep FastTrack for selected hosts, add explicit accept exceptions before the FastTrack rule for all traffic that uses to_ISP1 or to_ISP2. The simplest and safest version of this design disables FastTrack.


6.2 Interface lists

/interface list
add name=WAN comment="dual-WAN uplinks"
add name=LAN comment="trusted LAN-facing interfaces"

/interface list member
add interface=ether1 list=WAN comment="WAN1"
add interface=ether2 list=WAN comment="WAN2"
add interface=bridge list=LAN comment="LAN bridge"

6.3 Optional fresh-router LAN baseline

Skip this section if your bridge, LAN IP, DHCP server, VLANs, and firewall are already configured.

/interface bridge
add name=bridge comment="LAN bridge"

/interface bridge port
add bridge=bridge interface=ether3
add bridge=bridge interface=ether4
add bridge=bridge interface=ether5

/ip address
add address=192.168.88.1/24 interface=bridge network=192.168.88.0 comment="LAN gateway"

/ip pool
add name=pool-lan ranges=192.168.88.10-192.168.88.254

/ip dhcp-server
add name=dhcp-lan interface=bridge address-pool=pool-lan disabled=no

/ip dhcp-server network
add address=192.168.88.0/24 gateway=192.168.88.1 dns-server=192.168.88.1

/ip dns
set allow-remote-requests=yes servers=1.1.1.1,8.8.8.8

/ip dns static
add name=router.lan address=192.168.88.1

The DNS server IPs above are intentionally separate from the recursive probe IPs in §6.6. DNS resolution health and route-health probes should be separable when possible. Avoid making the router's only DNS resolvers identical to the only probe IPs used to decide whether a WAN is healthy.


6.4 Local address list

Add every local subnet that should not be PCC load-balanced toward the internet.

/ip firewall address-list
add list=local address=192.168.88.0/24 comment="LAN subnet - add VLAN/VPN/local subnets here too"

Examples for additional internal networks:

/ip firewall address-list
add list=local address=10.10.10.0/24 comment="example VLAN"
add list=local address=172.16.0.0/16 comment="example VPN/internal range"

6.5 Routing tables

/routing table
add fib name=to_ISP1 comment="Policy table for traffic selected for ISP1"
add fib name=to_ISP2 comment="Policy table for traffic selected for ISP2"

6.6 Recursive probe routes and default routes

This section uses three host routes per WAN. They are initially disabled and then enabled by the DHCP client scripts after a lease is bound.

The route comments are intentional. The DHCP scripts update all routes with comment WAN1_RECURSIVE or WAN2_RECURSIVE. Do not reuse those exact comments for unrelated routes.

Why does the placeholder use gateway=ether1? It is a placeholder until the DHCP client binds. The script in §6.7 overwrites gateway= with the real next-hop IP from the lease and enables the route. On a long-running broadcast Ethernet WAN you should always have an actual next-hop IP set by the DHCP script — see §9.3 for the static-WAN case where you set the next-hop manually.

/ip route

# -------------------------------------------------------------------
# Main-table recursive probes for router-originated traffic.
# These are updated by the DHCP scripts.
# Probes deliberately avoid Cloudflare and Google DNS so those can be
# used as normal DNS resolvers without also being route-health targets.
# -------------------------------------------------------------------
add disabled=yes dst-address=64.6.64.6/32 gateway=ether1 scope=10 target-scope=10 comment="WAN1_RECURSIVE"
add disabled=yes dst-address=149.112.112.112/32 gateway=ether2 scope=10 target-scope=10 comment="WAN2_RECURSIVE"

# Main-table recursive defaults for router-originated new traffic.
# Equal distances use ECMP. If you prefer WAN1 primary / WAN2 backup
# for main-table traffic, set MAIN_RECURSIVE_WAN2 distance=2.
add check-gateway=ping distance=1 dst-address=0.0.0.0/0 gateway=64.6.64.6 scope=10 target-scope=11 comment="MAIN_RECURSIVE_WAN1"
add check-gateway=ping distance=1 dst-address=0.0.0.0/0 gateway=149.112.112.112 scope=10 target-scope=11 comment="MAIN_RECURSIVE_WAN2"

# -------------------------------------------------------------------
# Recursive host routes for the policy routing tables.
# WAN1_RECURSIVE routes are forced through WAN1's current DHCP gateway.
# WAN2_RECURSIVE routes are forced through WAN2's current DHCP gateway.
#
# Each table's primary and backup intentionally use different DNS providers
# so a single provider outage does not deactivate both routes at once.
# -------------------------------------------------------------------
add disabled=yes dst-address=9.9.9.9/32 gateway=ether1 scope=10 target-scope=10 comment="WAN1_RECURSIVE"
add disabled=yes dst-address=208.67.222.222/32 gateway=ether1 scope=10 target-scope=10 comment="WAN1_RECURSIVE"

add disabled=yes dst-address=64.6.65.6/32 gateway=ether2 scope=10 target-scope=10 comment="WAN2_RECURSIVE"
add disabled=yes dst-address=208.67.220.220/32 gateway=ether2 scope=10 target-scope=10 comment="WAN2_RECURSIVE"

# -------------------------------------------------------------------
# Policy-table defaults.
# Each table prefers its own ISP, then falls back to the other ISP.
# Primary and backup deliberately use different probe providers.
# -------------------------------------------------------------------
add check-gateway=ping distance=1 dst-address=0.0.0.0/0 gateway=9.9.9.9 routing-table=to_ISP1 scope=10 target-scope=11 comment="to_ISP1_primary_via_WAN1"
add check-gateway=ping distance=2 dst-address=0.0.0.0/0 gateway=208.67.220.220 routing-table=to_ISP1 scope=10 target-scope=11 comment="to_ISP1_backup_via_WAN2"

add check-gateway=ping distance=1 dst-address=0.0.0.0/0 gateway=64.6.65.6 routing-table=to_ISP2 scope=10 target-scope=11 comment="to_ISP2_primary_via_WAN2"
add check-gateway=ping distance=2 dst-address=0.0.0.0/0 gateway=208.67.222.222 routing-table=to_ISP2 scope=10 target-scope=11 comment="to_ISP2_backup_via_WAN1"

To make the main table active-backup instead of ECMP:

/ip route set [find where comment="MAIN_RECURSIVE_WAN1"] distance=1
/ip route set [find where comment="MAIN_RECURSIVE_WAN2"] distance=2

If the router has a public management NIC and you need to keep SSH/WinBox reachable through that public interface, consider omitting the MAIN_RECURSIVE_WAN1 and MAIN_RECURSIVE_WAN2 defaults during the lab. The policy tables still exercise PCC and failover for LAN traffic, while the main table keeps the cloud provider's management default route.

The main-table recursive defaults exist for router-originated traffic — the router's own DNS resolver, NTP, package downloads, DDNS updates, pings — which uses main because it does not enter the LAN prerouting chain. Equal distances make these defaults ECMP; different distances make them active-backup. Static defaults, DHCP-added defaults, or deliberate output policy routing can fill the same role if ECMP is not the right shape for your design.


6.7 DHCP clients with route-update scripts

Add the DHCP clients after creating the route placeholders above.

/ip dhcp-client
add interface=ether1 add-default-route=no use-peer-dns=no use-peer-ntp=no comment="WAN1 DHCP - updates WAN1_RECURSIVE routes"
add interface=ether2 add-default-route=no use-peer-dns=no use-peer-ntp=no comment="WAN2 DHCP - updates WAN2_RECURSIVE routes"

Set the WAN1 DHCP script:

/ip dhcp-client
set [find where interface=ether1] script={
    :if ([:tobool $bound]) do={
        :local gw $"gateway-address";
        /ip/route/set [find where comment="WAN1_RECURSIVE"] gateway=$gw disabled=no;
        :log info ("WAN1 DHCP bound; recursive routes now use " . $gw);
    } else={
        /ip/route/set [find where comment="WAN1_RECURSIVE"] disabled=yes;
        :log warning "WAN1 DHCP unbound; WAN1 recursive routes disabled";
    }

    /ip/firewall/connection/remove [find where connection-mark="ISP1_conn"];
}

Set the WAN2 DHCP script:

/ip dhcp-client
set [find where interface=ether2] script={
    :if ([:tobool $bound]) do={
        :local gw $"gateway-address";
        /ip/route/set [find where comment="WAN2_RECURSIVE"] gateway=$gw disabled=no;
        :log info ("WAN2 DHCP bound; recursive routes now use " . $gw);
    } else={
        /ip/route/set [find where comment="WAN2_RECURSIVE"] disabled=yes;
        :log warning "WAN2 DHCP unbound; WAN2 recursive routes disabled";
    }

    /ip/firewall/connection/remove [find where connection-mark="ISP2_conn"];
}

If your terminal rejects the multiline script form, paste the script body into the DHCP client script field in WinBox/WebFig, or convert it to a quoted one-liner with escaped quotes.

After applying the scripts, renew the leases:

/ip dhcp-client renew [find where interface=ether1]
/ip dhcp-client renew [find where interface=ether2]

The DHCP route-update script disables placeholder routes during any unbound transition, including the brief unbound state that occurs during initial paste/import. Its first run can therefore leave the placeholders disabled even when the leases bind moments later. Bootstrap once, unconditionally, after first apply:

# Replace these gateways with the values shown by /ip dhcp-client print detail.
/ip route set [find where comment="WAN1_RECURSIVE"] gateway=<WAN1_DHCP_GATEWAY> disabled=no
/ip route set [find where comment="WAN2_RECURSIVE"] gateway=<WAN2_DHCP_GATEWAY> disabled=no

Verify:

/ip dhcp-client print detail
/ip route print detail where comment~"WAN1_RECURSIVE|WAN2_RECURSIVE"

Future DHCP bind/unbind events update the routes via the script; this manual bootstrap is needed only once.

Each script clears only its own ISP's connection marks. A WAN1 lease change (new public IP, recovery after failure, etc.) leaves stale ISP1_conn flows pinned to the old path or old NAT public IP, so they are flushed and re-classified. Healthy ISP2_conn flows on the still-up WAN2 are not disturbed.


6.8 Mangle rules for inbound symmetry, PCC, and routing marks

Order matters. Keep the local-traffic accept rule before the PCC rules. add appends, so paste this section into a router with no existing prerouting mangle rules — or reorder afterward with move.

The chain=output rules below intentionally do not filter on connection-state=new. They must mark every reply packet whose connection-mark matches; replies on already-established flows are not in state new, and adding that condition will break per-WAN reply symmetry.

/ip firewall mangle

# Do not PCC traffic whose destination is another local subnet.
add chain=prerouting action=accept in-interface-list=LAN dst-address-list=local comment="skip local destinations before PCC"

# Mark new inbound connections by the WAN where they arrived.
# This helps replies from router services and dst-nat flows return through the same ISP.
add chain=prerouting action=mark-connection connection-state=new connection-mark=no-mark in-interface=ether1 new-connection-mark=ISP1_conn passthrough=yes comment="new inbound via WAN1"
add chain=prerouting action=mark-connection connection-state=new connection-mark=no-mark in-interface=ether2 new-connection-mark=ISP2_conn passthrough=yes comment="new inbound via WAN2"

# PCC for new LAN-originated internet connections.
add chain=prerouting action=mark-connection connection-state=new connection-mark=no-mark in-interface-list=LAN dst-address-list=!local dst-address-type=!local new-connection-mark=ISP1_conn passthrough=yes per-connection-classifier=both-addresses-and-ports:2/0 comment="PCC bucket 0 -> ISP1"
add chain=prerouting action=mark-connection connection-state=new connection-mark=no-mark in-interface-list=LAN dst-address-list=!local dst-address-type=!local new-connection-mark=ISP2_conn passthrough=yes per-connection-classifier=both-addresses-and-ports:2/1 comment="PCC bucket 1 -> ISP2"

# Apply routing marks to LAN packets based on connection mark.
add chain=prerouting action=mark-routing connection-mark=ISP1_conn in-interface-list=LAN new-routing-mark=to_ISP1 passthrough=no comment="route ISP1_conn via to_ISP1"
add chain=prerouting action=mark-routing connection-mark=ISP2_conn in-interface-list=LAN new-routing-mark=to_ISP2 passthrough=no comment="route ISP2_conn via to_ISP2"

# Router-generated replies for marked inbound connections.
# Do NOT add connection-state=new here; replies on established flows must also be marked.
add chain=output action=mark-routing connection-mark=ISP1_conn dst-address-list=!local new-routing-mark=to_ISP1 passthrough=no comment="router output replies via ISP1"
add chain=output action=mark-routing connection-mark=ISP2_conn dst-address-list=!local new-routing-mark=to_ISP2 passthrough=no comment="router output replies via ISP2"

The mark-routing rules use passthrough=no on purpose. Once a packet has its routing mark, no later mangle rule should change it; passthrough=yes here would let a subsequent rule overwrite the mark and silently misroute traffic. The earlier mark-connection rules use passthrough=yes because the next stage (mark-routing) needs to read the connection mark.

Notes:

  • PCC is per connection, not per packet.
  • A single TCP session uses one WAN; it is not split across both links.
  • Existing sessions usually break during failover. New sessions should use the active route.
  • Some banking, gaming, streaming, and CDN flows may dislike changing public IPs across different connections.

6.9 NAT

/ip firewall nat
add chain=srcnat action=masquerade out-interface-list=WAN ipsec-policy=out,none comment="masquerade both WANs"

If you use IPsec, WireGuard, GRE, EOIP, VLANs, or other tunnels, review NAT bypass rules before this masquerade rule.


7. Validation checklist

7.1 Check DHCP clients

/ip dhcp-client print detail

Confirm each WAN has:

  • status=bound;
  • an address;
  • a gateway value, shown as gateway= in print output and exposed to scripts as $"gateway-address";
  • add-default-route=no.

7.2 Check recursive route state

/ip route print detail where comment~"WAN|MAIN|to_ISP"

Expected behavior:

  • WAN1_RECURSIVE routes should use WAN1's current DHCP gateway.
  • WAN2_RECURSIVE routes should use WAN2's current DHCP gateway.
  • to_ISP1_primary_via_WAN1 should be active when WAN1's probe is reachable.
  • to_ISP2_primary_via_WAN2 should be active when WAN2's probe is reachable.
  • backup routes should become active when their primary route fails.

7.3 Test routing-table-specific pings

/tool ping address=9.9.9.9 routing-table=to_ISP1 count=5
/tool ping address=64.6.65.6 routing-table=to_ISP2 count=5

routing-table= on /tool ping requires RouterOS 7.1 or later, but syntax and CLI context can vary between versions and shells. If your build rejects that form, validate with route state plus source-address probes:

/ip route print detail where comment~"WAN|to_ISP"

# Replace the source addresses with the DHCP addresses on your WAN interfaces.
/ping address=9.9.9.9 src-address=<WAN1_DHCP_ADDRESS> count=5
/ping address=64.6.65.6 src-address=<WAN2_DHCP_ADDRESS> count=5

For PCC itself, prefer real client traffic from the LAN. Router-originated ping tests do not traverse the LAN prerouting PCC rules.

7.4 Check mangle hit counters

/ip firewall mangle print stats all

Generate client traffic, then confirm:

  • PCC bucket counters increase;
  • routing-mark rule counters increase;
  • local-destination accept rule increases when clients talk to local subnets.

Example client-side flow generator:

for i in $(seq 1 30); do
  curl -4 -s --max-time 10 https://ifconfig.me
  echo
done | sort | uniq -c

If the client also has a separate public management NIC, make sure the test traffic is sourced through the LAN side of the MikroTik path, for example with curl --interface <LAN_IP>.

7.5 Check connection marks

/ip firewall connection print where connection-mark~"ISP"

You should see client flows marked as either ISP1_conn or ISP2_conn.

7.6 Failover test

Test WAN1 failure:

/ip route print detail where routing-table=to_ISP1

Then break WAN1 upstream connectivity. Do not rely only on unplugging the Ethernet cable; a real ISP failure may leave the local link up while upstream internet is dead.

Expected result:

  • WAN1 recursive probe becomes unreachable.
  • to_ISP1_primary_via_WAN1 becomes inactive.
  • to_ISP1_backup_via_WAN2 becomes the active route.
  • New connections marked ISP1_conn use WAN2 until WAN1 recovers.

Repeat the test for WAN2.

7.7 Validate router-originated DNS if you depend on it

If LAN clients use the MikroTik as their DNS server, also test the router's own upstream DNS path. Client DNS queries enter the router locally, but the router's upstream resolver queries are generated by the router itself and normally use the main table.

Use resolver IPs that are not the same as your recursive probe IPs, flush the cache, resolve a new name, and check the DNS connections:

/ip dns set allow-remote-requests=yes servers=1.1.1.1,8.8.8.8
/ip dns cache flush
/ip firewall connection remove [find where dst-port=53]

:put [/resolve example.net]
:put [/resolve openai.com]

/ip firewall connection print detail where dst-port=53
/ip route print detail where comment~"MAIN_RECURSIVE"

Expected result:

  • DNS resolution succeeds.
  • DNS connections use the expected WAN source address or expected main-table path.
  • If one main-table recursive default is failed, new resolver lookups use the surviving default.

Router-originated traffic uses main. If main contains a non-recursive default route — common on cloud VMs with a public management NIC, or on routers where DHCP-added WAN defaults still exist — that route can mask the recursive defaults during this test. To validate the real recursive path, manage the router through an out-of-band path (console, LAN-jump, noVNC) and temporarily disable the masking default before repeating the test.


8. Troubleshooting

Symptom Likely cause What to check
Recursive routes show unreachable DHCP script did not update gateway, route comments do not match, or probe is blocked /ip route print detail where comment~"WAN", /log print, DHCP client status
Traffic marked for ISP1 exits ISP2 unexpectedly FastTrack still active or main table is being used Disable FastTrack; check mangle counters and route marks
Router services reply through the wrong WAN Inbound connection was not marked, or output mark rule missing Check inbound mangle rules and chain=output mark-routing rules
LAN clients cannot reach local VLANs/subnets Local destinations are being PCC-marked Add every internal subnet to address-list=local; keep local accept rule before PCC
Failover works, but existing downloads or SSH sessions die Expected behavior; PCC is connection-based and NAT public IP changed Start new sessions; optionally clear affected connection marks
Failback does not happen after WAN recovery Probe route still inactive, probe target does not answer, or DHCP route still disabled Ping probe via intended WAN; check DHCP script logs and route state
Port forwards work on only one WAN Inbound connection marking or dst-nat rules are incomplete Mark new inbound connections by WAN and ensure dst-nat rules exist for both WANs
Traceroute looks odd PCC, ECMP, ICMP behavior, and router-originated traffic can differ from TCP flows Test with routing-table-specific pings/traceroutes and with real TCP traffic
BGP/OSPF/tunnels behave oddly Mangle rules can override route decisions for traffic that should use another policy Add explicit accept/bypass rules before PCC for routing protocols, tunnels, or management subnets
Load balancing seems uneven PCC balances connection buckets, not bandwidth Use weighted PCC buckets for asymmetric links; test with many flows

Manual cleanup commands:

/ip firewall connection remove [find where connection-mark="ISP1_conn"]
/ip firewall connection remove [find where connection-mark="ISP2_conn"]

9. Optional variants

9.1 Load-balance only selected clients

Use this when most clients should use WAN1 primary / WAN2 backup, while selected clients are PCC load-balanced.

First, make the main-table recursive defaults active-backup instead of ECMP:

/ip route set [find where comment="MAIN_RECURSIVE_WAN1"] distance=1
/ip route set [find where comment="MAIN_RECURSIVE_WAN2"] distance=2

Create an address list for clients allowed to use PCC:

/ip firewall address-list
add list=MultiWAN-Clients address=192.168.88.50 comment="example load-balanced client"
add list=MultiWAN-Clients address=192.168.88.60 comment="example load-balanced client"

Then add src-address-list=MultiWAN-Clients to the two PCC mark-connection rules:

/ip firewall mangle
set [find where comment="PCC bucket 0 -> ISP1"] src-address-list=MultiWAN-Clients
set [find where comment="PCC bucket 1 -> ISP2"] src-address-list=MultiWAN-Clients

Non-listed clients will not receive PCC routing marks and will use the main table.


9.2 Weighted PCC for asymmetric WAN speeds

PCC does not aggregate bandwidth, but you can bias new connection distribution.

Example: WAN1 is 250 Mbps, WAN2 is 500 Mbps. Use a 1:2 connection-bucket ratio.

Replace the two PCC rules with three buckets:

/ip firewall mangle
remove [find where comment="PCC bucket 0 -> ISP1"]
remove [find where comment="PCC bucket 1 -> ISP2"]

add chain=prerouting action=mark-connection connection-state=new connection-mark=no-mark in-interface-list=LAN dst-address-list=!local dst-address-type=!local new-connection-mark=ISP1_conn passthrough=yes per-connection-classifier=both-addresses-and-ports:3/0 comment="PCC weighted bucket 0 -> ISP1"
add chain=prerouting action=mark-connection connection-state=new connection-mark=no-mark in-interface-list=LAN dst-address-list=!local dst-address-type=!local new-connection-mark=ISP2_conn passthrough=yes per-connection-classifier=both-addresses-and-ports:3/1 comment="PCC weighted bucket 1 -> ISP2"
add chain=prerouting action=mark-connection connection-state=new connection-mark=no-mark in-interface-list=LAN dst-address-list=!local dst-address-type=!local new-connection-mark=ISP2_conn passthrough=yes per-connection-classifier=both-addresses-and-ports:3/2 comment="PCC weighted bucket 2 -> ISP2"

For different ratios, increase the denominator and allocate more buckets to the faster WAN.


9.3 Static WAN instead of DHCP

For a static Ethernet WAN, do not use the DHCP script for that interface. Set the recursive host routes manually to the ISP next-hop IP.

Example WAN1 static gateway 203.0.113.1:

/ip route set [find where comment="WAN1_RECURSIVE"] gateway=203.0.113.1 disabled=no

Do not use gateway=ether1 on a normal broadcast Ethernet WAN as a long-term substitute for a real next-hop IP. Use the actual gateway IP from the ISP.


9.4 PPPoE WAN instead of DHCP

For PPPoE, use the PPPoE interface as the gateway for recursive probe host routes because PPPoE is point-to-point.

Example WAN1 PPPoE interface pppoe-out1:

/interface list member add interface=pppoe-out1 list=WAN comment="WAN1 PPPoE"
/ip route set [find where comment="WAN1_RECURSIVE"] gateway=pppoe-out1 disabled=no

Then remove (not just disable) the WAN1 DHCP client, including its script. The script in §6.7 references $"gateway-address" from the DHCP lease; it has no meaning for PPPoE and must not run alongside this manual set. If you keep the DHCP client object disabled, also clear its script field to avoid confusion later.


10. Operational notes

PCC limitations

PCC is not bandwidth bonding. It does not split one TCP flow across both WANs. It assigns each new connection to a bucket and keeps that connection sticky to the selected path as long as the connection is tracked.

Practical effects:

  • speed tests with one connection may show only one WAN;
  • multiple clients or multi-connection tests show balancing better;
  • long-lived sessions may break when their selected WAN fails;
  • sites that bind login/session state to public IP may behave poorly when different connections leave through different ISPs.

Probe target selection

Use probe IPs that:

  • are stable;
  • answer ICMP consistently;
  • are outside your ISP network if you want to detect wider internet reachability;
  • are not all in the same provider or DNS service;
  • are not the same IPs you plan to use as the router's DNS resolvers;
  • are not blocked by the ISP.

If a provider blocks ICMP, choose another target or monitor by another method.

The example uses well-known anycast public DNS IPs from Quad9, OpenDNS, and Verisign/UltraDNS because they are globally reachable and tend to answer ICMP. It deliberately avoids Cloudflare and Google as route probes so those popular resolvers can be used for actual DNS service without also becoming route-health sensors.

The main-table probes (64.6.64.6 / 149.112.112.112) and the policy-table probes use different IPs on purpose. Reachability of one set does not guarantee reachability of the other, so it is normal during partial outages to see the router itself reach the internet via the main table while LAN clients on a policy table appear stuck — or the other way around. Compare both sets of probes when triaging.

Keep DNS resolvers and route probes separate where practical. If the router's only DNS servers are also the only recursive probe targets, DNS-service failure and route-health failure become harder to distinguish during troubleshooting.

FastTrack and performance

Disabling FastTrack/FastPath can reduce maximum throughput on low-end routers because more packets stay on the slow path. On devices such as hEX-class routers, test CPU under real traffic. On more capable routers such as RB5009-class hardware, this design is usually much more comfortable, but throughput still depends on firewall complexity, queueing, packet size, and enabled services.

Firewall placement

This guide only covers routing, mangle, and NAT. It does not replace a firewall policy. Keep normal protections such as:

  • input chain drop for unsolicited WAN traffic;
  • established/related accept rules;
  • invalid drop rules where appropriate;
  • explicit management access restrictions;
  • dst-nat rules only for services you intend to expose.

Be cautious with generic invalid drops if your network has asymmetric routing, tunnels, or advanced policy routing.


11. Design trade-offs and when not to use this

11.1 What you give up by choosing PCC

PCC is connection-level, not packet-level. That means:

  • A single TCP flow uses one WAN. Single-stream speed tests measure one link, not the sum of both.
  • Sessions that bind to a public IP — banking, anti-fraud / Cloudflare Turnstile, some video conferencing, IMAP IDLE on certain providers — can misbehave when consecutive connections from the same client leave through different ISPs. If most users hit a small set of such services, route those destinations through one WAN via address-list + routing mark, alongside or instead of PCC.
  • Failover is not session-preserving. Live SSH, VoIP, and streaming sessions on the failed WAN drop. New connections recover automatically.

11.2 When a simpler design is enough

If you only need failover (no load balancing), drop the PCC mangle rules and keep the recursive default routes in main at distances 1 and 2. One WAN is active at a time, and check-gateway=ping triggers failover the same way.

If you only need per-client steering (e.g. send the guest VLAN out WAN2), skip PCC and use src-address-listmark-routing directly. Static assignment is easier to debug than connection-classifier hashing.

11.3 When PCC is the wrong tool

  • Two ISPs with very different latency or jitter where you want application-aware steering. Use a real SD-WAN appliance.
  • Public-facing services that must answer on a stable IP. PCC + masquerade gives a different public IP per WAN; you want BGP multihoming or an upstream front-door (DNS failover, anycast, cloud LB).
  • True bandwidth aggregation. Look at MLPPP, L2TP+BCP, or commercial bonding services. None are RouterOS-native at the level you might expect.

11.4 Why recursive routes instead of netwatch scripts

Some guides drive failover with /tool netwatch scripts that enable and disable default routes when a probe goes down. That works, but it is an out-of-band controller racing with the routing table. Recursive routing keeps the health check inside the FIB: a route is active if and only if its probe is reachable through the intended next-hop. There is one source of truth, no script timing to tune, and check-gateway=ping is documented behavior rather than a custom script you have to maintain across upgrades.


12. References


13. Changelog

  • 2026-05-07 — Initial public version. Includes the TL;DR default-config paste, route probes deliberately disjoint from DNS resolvers, and validation against RouterOS 7.21.4 in a Vultr CHR lab. PCC split, per-policy-table failover and failback, recursive main-table ECMP for router-originated traffic, and the TL;DR paste applied to a default-like RouterBOARD baseline are all confirmed. See companion lab report.

Vultr CHR Dual-WAN Lab: Setup, Methodology, and Results

Lab date: 2026-05-07 RouterOS tested: 7.21.4 Guide under test: MikroTik RouterOS v7 dual DHCP WAN recursive failover with PCC Pasteable TL;DR: Default-config dual WAN PCC + recursive failover

This document is the validation companion for the dual-WAN guide. It records the topology, methodology, and results of a Vultr CHR lab that exercises every claim the guide makes: PCC load balancing, per-policy-table recursive failover, recursive ECMP for router-originated traffic, and NAT masquerade across both WANs.

The methodology below incorporates the gotchas surfaced during validation, so a second team can reproduce the lab without rediscovering them. Those gotchas are also catalogued in §10.


1. Goal

Validate, in a controlled cloud lab with realistic upstream-failure injection:

  • two DHCP WAN uplinks bound and updating recursive routes;
  • recursive route health checks deactivating routes when an upstream probe fails while link state stays up;
  • PCC distributing new LAN connections across both WANs;
  • per-policy-table failover and failback;
  • recursive main-table ECMP for router-originated traffic, including the router's own DNS resolver;
  • NAT masquerade on both WANs;
  • the TL;DR paste applied cleanly to a default-like RouterBOARD baseline.

2. Topology

                         public internet
                              │
             ┌────────────────┴────────────────┐
             │                                 │
       ┌─────┴─────┐                     ┌─────┴─────┐
       │  isp1-vm  │                     │  isp2-vm  │
       │ Debian 13 │                     │ Debian 13 │
       │ NAT/DHCP  │                     │ NAT/DHCP  │
       └─────┬─────┘                     └─────┬─────┘
             │ vpc-wan1                        │ vpc-wan2
             │ 10.10.1.0/24                    │ 10.10.2.0/24
             │                                 │
          ┌──┴─────────────────────────────────┴──┐
          │              chr-dual-wan-lab         │
          │ RouterOS CHR 7.21.4                   │
          │ ether1 = public management            │
          │ ether2 = WAN1                         │
          │ ether3 = WAN2                         │
          │ ether4 = LAN                          │
          └───────────────────┬───────────────────┘
                              │ vpc-lan
                              │ 192.168.88.0/24
                         ┌────┴────┐
                         │ lan-vm  │
                         │ Debian  │
                         └─────────┘

VPCs:

VPC CIDR Purpose
vpc-wan1 10.10.1.0/24 CHR WAN1 to fake ISP1
vpc-wan2 10.10.2.0/24 CHR WAN2 to fake ISP2
vpc-lan 192.168.88.0/24 CHR LAN to test client

Instances:

Host Public IP Role
isp1-vm ISP1_PUBLIC_IP Gateway / DHCP / NAT for 10.10.1.0/24
isp2-vm ISP2_PUBLIC_IP Gateway / DHCP / NAT for 10.10.2.0/24
chr-dual-wan-lab CHR_PUBLIC_IP RouterOS CHR under test
lan-vm LAN_VM_PUBLIC_IP LAN traffic generator and management jump host

3. Interface Map

The guide's examples use ether1 as WAN1 and ether2 as WAN2. On Vultr CHR, ether1 is the public management NIC, so the guide's interface names shift by one in this lab:

Guide role Guide example This lab
Public management not in guide ether1
WAN1 ether1 ether2
WAN2 ether2 ether3
LAN bridge port ether3 or local LAN ports ether4
ether1 = public Vultr management, DHCP CHR_PUBLIC_IP/23
ether2 = vpc-wan1, DHCP from 10.10.1.1, MTU 1450
ether3 = vpc-wan2, DHCP from 10.10.2.1, MTU 1450
ether4 = vpc-lan, bridge port

This is the single biggest practical cloud-lab caveat for the guide. Always map interfaces before pasting any dual-WAN config — pasting the guide's ether1/ether2 examples blindly on cloud CHR will turn the management NIC into a "WAN" and lock you out.

The Vultr VPC 2.0 MTU is 1450, so both WAN-side CHR interfaces are pinned to MTU 1450.


4. Fake ISP VM Setup

Both fake ISP VMs are Debian 13 images using /etc/network/interfaces, dnsmasq for DHCP, and iptables-nft plus UFW for forwarding/NAT.

4.1 ISP1

enp1s0 = public NIC, ISP1_PUBLIC_IP/23
enp8s0 = VPC NIC, 10.10.1.1/24
ip addr flush dev enp8s0
ip addr add 10.10.1.1/24 dev enp8s0
ip link set enp8s0 mtu 1450 up

echo "net.ipv4.ip_forward=1" >/etc/sysctl.d/99-fwd.conf
sysctl -p /etc/sysctl.d/99-fwd.conf

/etc/dnsmasq.d/vpc.conf:

interface=enp8s0
bind-interfaces
dhcp-range=10.10.1.100,10.10.1.150,1h
dhcp-option=3,10.10.1.1
dhcp-option=6,1.1.1.1,8.8.8.8

iptables NAT:

-A POSTROUTING -s 10.10.1.0/24 -o enp1s0 -j MASQUERADE

UFW (rationale in §4.3):

sed -i 's/^DEFAULT_FORWARD_POLICY=.*/DEFAULT_FORWARD_POLICY="ACCEPT"/' /etc/default/ufw
ufw route allow in on enp8s0 out on enp1s0
ufw route allow in on enp1s0 out on enp8s0
ufw allow in on enp8s0 to any port 67 proto udp
ufw allow in on enp8s0 to any port 68 proto udp
ufw reload

4.2 ISP2

Identical to ISP1 with vpc-wan2 = 10.10.2.0/24 and a VPC IP of 10.10.2.1/24. Substitute addresses in dnsmasq.d/vpc.conf and the NAT rule; UFW commands are unchanged.

4.3 Why those UFW rules

Debian 13 with default UFW denies forwarded traffic. That breaks two things at once on a fake-ISP VM:

  • routed packets between the VPC and the public NIC are dropped, so even a successful DHCP lease cannot reach the internet through the fake ISP;
  • DHCP discovery broadcasts arriving on the VPC NIC are dropped before dnsmasq sees them, so leases never bind.

The symptom is that CHR can statically ping 10.10.x.1 but its DHCP client stays searching.... The four UFW rules in §4.1 enable forwarding policy globally, allow VPC↔public forwarding both directions explicitly, and allow inbound UDP 67/68 on the VPC NIC ahead of UFW's not-local drop path. Bake them into cloud-init for the ISP VMs so DHCP-from-CHR works on first boot.


5. CHR Baseline and Management Path

The CHR is managed through lan-vm as a jump host throughout validation. Its public management interface (ether1) is enabled only for initial provisioning and final cleanup; it is disabled during all behavior tests so that the cloud-provider default route in main cannot mask the dual-WAN routing under test:

ssh -J root@LAN_VM_PUBLIC_IP admin@192.168.88.1
/interface/disable ether1

This is the most consequential methodology choice in the lab. Tests of router-originated services such as the DNS resolver and main-table ECMP must run with the public management default route absent from main; otherwise the public NIC is what carries the test traffic and the dual-WAN behavior is invisible in connection tracking. Plan for this from the start — provision out-of-band access (LAN-jump, noVNC console) before disabling ether1.

CHR LAN baseline:

bridge = LAN bridge
ether4 = bridge port
192.168.88.1/24 = CHR LAN gateway
dhcp-lan = DHCP server on bridge
pool-lan = 192.168.88.10-192.168.88.254

WAN DHCP results after applying the dual-WAN config from the guide with the §3 interface substitutions:

WAN1 on ether2 = 10.10.1.123/24 via 10.10.1.1
WAN2 on ether3 = 10.10.2.113/24 via 10.10.2.1

The recursive routes are bootstrapped once immediately after the DHCP clients bind. The DHCP route-update script disables placeholder routes during any unbound transition — including the brief unbound state during initial paste — so the script's first run can leave the placeholders disabled even though the leases bind moments later. Bootstrap unconditionally:

/ip route set [find where comment="WAN1_RECURSIVE"] gateway=10.10.1.1 disabled=no
/ip route set [find where comment="WAN2_RECURSIVE"] gateway=10.10.2.1 disabled=no

Future DHCP bind/unbind events update the routes via the script; this manual step is needed only once.


6. LAN Client Setup

lan-vm has both a public NIC (for SSH and as the jump host) and a LAN VPC NIC:

enp1s0 = public NIC, LAN_VM_PUBLIC_IP/23
enp8s0 = LAN VPC NIC, 192.168.88.4/24

The default route stays public for SSH management. Source-policy routing forwards traffic sourced from 192.168.88.4 through CHR:

ip route replace default via 192.168.88.1 dev enp8s0 table 88
ip rule add from 192.168.88.4/32 table 88 priority 1000

Test traffic uses --interface 192.168.88.4 to bind curl to the LAN source so it follows the policy-routed path through CHR:

curl --interface 192.168.88.4 -4 -s --max-time 10 https://ifconfig.me

For the ECMP and main-table tests that need destination-tuple variance, several public IP-echo services are used so the destination hash varies meaningfully between flows (see §8.6).


7. Probe Set

The guide and TL;DR use a probe set deliberately disjoint from the router's DNS resolvers, so a probe outage and a DNS outage are independent events:

WAN1 route probes: 64.6.64.6 (Verisign), 9.9.9.9 (Quad9), 208.67.222.222 (OpenDNS)
WAN2 route probes: 149.112.112.112 (Quad9), 64.6.65.6 (Verisign), 208.67.220.220 (OpenDNS)
DNS resolvers:     1.1.1.1 (Cloudflare), 8.8.8.8 (Google)

The role of each probe under the guide:

Comment Address Routed via Used by
MAIN_RECURSIVE_WAN1 64.6.64.6 WAN1 main-table default
MAIN_RECURSIVE_WAN2 149.112.112.112 WAN2 main-table default
to_ISP1_primary_via_WAN1 9.9.9.9 WAN1 to_ISP1 primary
to_ISP1_backup_via_WAN2 208.67.220.220 WAN2 to_ISP1 backup
to_ISP2_primary_via_WAN2 64.6.65.6 WAN2 to_ISP2 primary
to_ISP2_backup_via_WAN1 208.67.222.222 WAN1 to_ISP2 backup

Every probe is reachability-checked from the intended WAN before any failure injection (see §8.3).


8. Validation Methodology

8.1 Interface mapping

Identify VPC wires before applying the dual-WAN config. Static probes are reliable; DHCP is not, until the §4.3 UFW rules are in place:

ether2 reaches 10.10.1.1 -> WAN1
ether3 reaches 10.10.2.1 -> WAN2
ether4 reaches neither   -> LAN

After UFW is configured and the dual-WAN config is applied, leases bind:

ether2 leased 10.10.1.123/24 from 10.10.1.1
ether3 leased 10.10.2.113/24 from 10.10.2.1

8.2 Route health

/ip route print detail where comment~"WAN|to_ISP|MAIN"

Healthy state:

WAN1_RECURSIVE active via 10.10.1.1%ether2
WAN2_RECURSIVE active via 10.10.2.1%ether3
MAIN_RECURSIVE_WAN1 active, ecmp
MAIN_RECURSIVE_WAN2 active, ecmp
to_ISP1_primary_via_WAN1 active
to_ISP1_backup_via_WAN2 standby
to_ISP2_primary_via_WAN2 active
to_ISP2_backup_via_WAN1 standby

8.3 Probe reachability

For each probe, source-bound ping from the intended WAN's DHCP address confirms upstream reachability before failure injection:

/ping address=64.6.64.6 src-address=10.10.1.123 count=5
/ping address=9.9.9.9 src-address=10.10.1.123 count=5
/ping address=208.67.222.222 src-address=10.10.1.123 count=5
/ping address=149.112.112.112 src-address=10.10.2.113 count=5
/ping address=64.6.65.6 src-address=10.10.2.113 count=5
/ping address=208.67.220.220 src-address=10.10.2.113 count=5

8.4 PCC split test

The PCC mangle rules use per-connection-classifier=both-addresses-and-ports. Each curl gets a fresh ephemeral source port; with ifconfig.me anycast, destinations also vary modestly between connections. From lan-vm:

for i in $(seq 1 30); do
  curl --interface 192.168.88.4 -4 -s --max-time 10 https://ifconfig.me
  echo
done | sort | uniq -c

CHR counters checked:

/ip firewall mangle print stats
/ip firewall nat print stats
/ip firewall connection print where src-address~"192.168.88.4"

This is sufficient to demonstrate split behavior. For stricter validation under varied destination tuples — useful when results from a small sample look uneven — use the multi-destination loop from §8.6, and see the §10 PCC variance caveat.

8.5 Failure injection

Failure tests block a single probe IP at the upstream fake ISP. This keeps the VPC link and DHCP lease up while breaking the route-health probe — exactly the outage shape the recursive design exists to detect. Allow ~35 s for check-gateway=ping to react before re-checking route state.

WAN1 policy primary failure and recovery:

ssh root@ISP1_PUBLIC_IP 'iptables -I FORWARD 1 -i enp8s0 -d 9.9.9.9 -j DROP'
ssh root@ISP1_PUBLIC_IP 'iptables -D FORWARD -i enp8s0 -d 9.9.9.9 -j DROP'

WAN2 policy primary failure and recovery:

ssh root@ISP2_PUBLIC_IP 'iptables -I FORWARD 1 -i enp8s0 -d 64.6.65.6 -j DROP'
ssh root@ISP2_PUBLIC_IP 'iptables -D FORWARD -i enp8s0 -d 64.6.65.6 -j DROP'

Main-table and ECMP probes are exercised separately in §8.6 and §8.7 with their own block/unblock pairs.

8.6 Main-table ECMP, isolated

ECMP is validated in a temporary FIB table named to_ECMP_TEST so the production policy tables stay intact.

Same-table recursion rule. In RouterOS v7, a recursive route resolves its gateway via FIB lookup in the same routing table by default. A default in to_ECMP_TEST cannot resolve through a host route in main. Both the recursive host routes and the recursive defaults must live in the same table, or both must live in main. This is the most common foot-gun when testing recursive routes across multiple FIB tables and it shapes both this test and §8.7.

Temporary table with co-located host and default routes:

/routing/table/add fib name=to_ECMP_TEST comment="Temporary strict ECMP validation table"

/ip/route/add dst-address=9.9.9.9/32 gateway=10.10.1.1 routing-table=to_ECMP_TEST scope=10 target-scope=10 comment="ECMP_STRICT_TEST_WAN1_RECURSIVE"
/ip/route/add dst-address=1.1.1.1/32 gateway=10.10.2.1 routing-table=to_ECMP_TEST scope=10 target-scope=10 comment="ECMP_STRICT_TEST_WAN2_RECURSIVE"

/ip/route/add check-gateway=ping distance=1 dst-address=0.0.0.0/0 gateway=9.9.9.9 routing-table=to_ECMP_TEST scope=10 target-scope=11 comment="ECMP_STRICT_TEST_DEFAULT_WAN1"
/ip/route/add check-gateway=ping distance=1 dst-address=0.0.0.0/0 gateway=1.1.1.1 routing-table=to_ECMP_TEST scope=10 target-scope=11 comment="ECMP_STRICT_TEST_DEFAULT_WAN2"

A single LAN-source mangle rule placed before the PCC bucket rules steers test traffic into the temporary table and bypasses PCC:

/ip/firewall/mangle/add chain=prerouting action=mark-routing src-address=192.168.88.4 in-interface-list=LAN dst-address-list=!local dst-address-type=!local new-routing-mark=to_ECMP_TEST passthrough=no comment="ECMP strict LAN-jump test client via to_ECMP_TEST"
/ip/firewall/mangle/move [find where comment="ECMP strict LAN-jump test client via to_ECMP_TEST"] destination=1

Multi-destination flow generator for destination-hash variance:

urls=(https://ifconfig.me https://api.ipify.org https://checkip.amazonaws.com https://icanhazip.com https://ident.me https://myexternalip.com/raw)
for i in $(seq 1 48); do
  u=${urls[$(( (i-1) % ${#urls[@]} ))]}
  curl --interface 192.168.88.4 -4 -s --max-time 10 "$u"
  echo
done | sed "/^$/d" | sort | uniq -c

ECMP failure injection blocks the test probes (9.9.9.9 for WAN1, 1.1.1.1 for WAN2) at their upstream fake ISP. Cleanup removes the mangle rule, the four routes, and the temporary table.

8.7 Router-originated DNS over the recursive main table

The router's own DNS resolver queries are router-originated; they enter no LAN prerouting chain and use the main table by default. Validation places recursive defaults directly in main — alongside the existing WAN1_RECURSIVE and WAN2_RECURSIVE host routes from the guide — and disables ether1 so the only main-table defaults are the dual-WAN recursive ones:

/ip/route/add check-gateway=ping distance=1 dst-address=0.0.0.0/0 gateway=64.6.64.6 routing-table=main scope=10 target-scope=11 comment="DNS_MAIN_TEST_WAN1"
/ip/route/add check-gateway=ping distance=1 dst-address=0.0.0.0/0 gateway=149.112.112.112 routing-table=main scope=10 target-scope=11 comment="DNS_MAIN_TEST_WAN2"

/ip/dns/set allow-remote-requests=yes servers=76.76.2.0,94.140.14.14 cache-max-ttl=1m
/ip/dns/cache/flush
/ip/firewall/connection/remove [find where dst-port=53]

Resolvers 76.76.2.0 (Control D) and 94.140.14.14 (AdGuard) are deliberately neither route probes nor the production DNS resolvers, so this test is isolated from both subsystems. Test queries:

:put [/resolve example.net]
:put [/resolve openai.com]
/ip/firewall/connection/print detail where dst-port=53
/ip/route/print detail where comment~"DNS_MAIN_TEST"

Failure injection blocks the WAN1 main probe (64.6.64.6), then restores it and blocks the WAN2 main probe (149.112.112.112). Between each block:

/ip/dns/cache/flush
/ip/firewall/connection/remove [find where dst-port=53]
:put [/resolve example.org]

Cleanup removes the temporary DNS_MAIN_TEST routes, restores production DNS resolvers, re-enables ether1, and clears DNS connections.

Why steering DNS through a separate FIB table does not work. A direct alternative is to mark output UDP/TCP port 53 into a dedicated table such as to_DNS_TEST, with the recursive defaults inside that table. This fails unless the recursive host routes (<probe>/32 → DHCP gateway) also live in the same table — see §8.6. If you put the host routes in main and the defaults in a separate table, the cross-table recursion does not resolve, and DNS falls back via whatever default exists in main (typically the cloud public management route). Place recursive main-table defaults in main for router-originated traffic; main already contains the WAN1_RECURSIVE and WAN2_RECURSIVE host routes from the guide.

8.8 TL;DR paste validation

The TL;DR (00-...md) is validated separately on a default-like RouterBOARD baseline. CHR's defaults are not the same as a hardware RouterBOARD's defaults, so the lab first re-maps VPCs to match the TL;DR's assumed shape:

pub    = Vultr public safety NIC (disabled before validation)
ether1 = WAN1 VPC
ether2 = WAN2 VPC
ether3 = LAN VPC
bridge = default LAN bridge containing ether2 and ether3

Default-like baseline before paste:

ether1 = DHCP WAN with add-default-route=yes
ether2 = bridge port (will be moved to WAN2 by the paste)
ether3 = bridge port for lan-vm
bridge = 192.168.88.1/24 with DHCP server
default masquerade rule present
default FastTrack rule present

The paste is applied through the LAN jump path:

ssh -J root@LAN_VM_PUBLIC_IP admin@192.168.88.1

Two categories of constraint must be satisfied for non-interactive SSH paste, and the TL;DR file already accounts for both:

  • Use fully qualified commands (/ip route add ...) rather than RouterOS context-mode commands (/ip route followed by bare add). Context mode is not reliable when piped through SSH.
  • Encode the DHCP route-update scripts as quoted RouterOS strings rather than multi-line script blocks, and follow them with a brief :delay 10s plus an unconditional bootstrap of already-bound leases.

After the paste, pub is disabled and PCC, policy failover, and the main-table recursive DNS path are exercised.


9. Results

All numbers below are from runs with the public management interface (ether1 on CHR or pub on the TL;DR baseline) disabled. Tests use the §7 probe set.

9.1 PCC load balancing (steady state)

30 LAN client HTTPS flows from 192.168.88.4 to https://ifconfig.me:

12 flows exited ISP1_PUBLIC_IP
18 flows exited ISP2_PUBLIC_IP

CHR mangle counters increased on:

PCC bucket 0 -> ISP1
PCC bucket 1 -> ISP2
route ISP1_conn via to_ISP1
route ISP2_conn via to_ISP2
masquerade both WANs

A 12/18 split on 30 single-destination flows is within the expected variance for a per-connection hash. Distribution converges to the bucket ratio with larger samples and broader destination variance; small samples can tilt either way.

9.2 WAN1 policy-table failure

Block:   iptables -I FORWARD 1 -i enp8s0 -d 9.9.9.9 -j DROP   on isp1-vm
Result:  to_ISP1_primary_via_WAN1 = inactive (after ~35 s)
         to_ISP1_backup_via_WAN2 = active
         12/12 LAN curls exited ISP2_PUBLIC_IP
Restore: iptables -D FORWARD -i enp8s0 -d 9.9.9.9 -j DROP
         to_ISP1_primary_via_WAN1 = active
         to_ISP1_backup_via_WAN2 = standby

9.3 WAN2 policy-table failure

Block:   iptables -I FORWARD 1 -i enp8s0 -d 64.6.65.6 -j DROP   on isp2-vm
Result:  to_ISP2_primary_via_WAN2 = inactive
         to_ISP2_backup_via_WAN1 = active
         12/12 LAN curls exited ISP1_PUBLIC_IP
Restore: route returns to primary

9.4 Main-table ECMP, isolated

Healthy:

ECMP_STRICT_TEST_DEFAULT_WAN1 = active, ecmp, immediate-gw=10.10.1.1%ether2
ECMP_STRICT_TEST_DEFAULT_WAN2 = active, ecmp, immediate-gw=10.10.2.1%ether3

48-flow multi-destination sample from §8.6:

39 flows exited ISP1_PUBLIC_IP
9 flows exited ISP2_PUBLIC_IP

WAN1 ECMP probe failure (9.9.9.9 blocked at isp1-vm):

ECMP_STRICT_TEST_DEFAULT_WAN1 = inactive
ECMP_STRICT_TEST_DEFAULT_WAN2 = active
18/18 test curls exited ISP2_PUBLIC_IP

WAN2 ECMP probe failure (1.1.1.1 blocked at isp2-vm):

ECMP_STRICT_TEST_DEFAULT_WAN2 = inactive
ECMP_STRICT_TEST_DEFAULT_WAN1 = active
18/18 test curls exited ISP1_PUBLIC_IP

The 39/9 healthy split is more skewed than the PCC sample because RouterOS ECMP uses a per-flow hash, not round-robin. Distribution converges only with enough source/destination variance; under failure of one route, the survivor takes 100% of new flows as expected.

9.5 Router-originated DNS over the recursive main table

Healthy:

DNS_MAIN_TEST_WAN1 = active, ecmp
DNS_MAIN_TEST_WAN2 = active, ecmp
/resolve example.net succeeded
/resolve openai.com succeeded
DNS resolver connections sourced from 10.10.1.123 in this sample

WAN1 main probe blocked (64.6.64.6):

DNS_MAIN_TEST_WAN1 = inactive
DNS_MAIN_TEST_WAN2 = active
/resolve example.org succeeded
DNS resolver connections sourced from 10.10.2.113

WAN2 main probe blocked (149.112.112.112):

DNS_MAIN_TEST_WAN2 = inactive
DNS_MAIN_TEST_WAN1 = active
/resolve cloudflare.com succeeded
DNS resolver connections sourced from 10.10.1.123

The router's own DNS resolver follows the recursive main-table defaults and survives single-WAN probe failure with no manual intervention.

9.6 TL;DR paste validation

Post-paste state on the default-like baseline:

ether2 removed from bridge and became WAN2
FastTrack disabled
WAN1 DHCP bound on ether1
WAN2 DHCP bound on ether2
WAN1_RECURSIVE host routes active via 10.10.1.1%ether1
WAN2_RECURSIVE host routes active via 10.10.2.1%ether2
MAIN_RECURSIVE_WAN1 and MAIN_RECURSIVE_WAN2 active as ECMP
to_ISP1 and to_ISP2 primaries active

PCC sample (30 flows, single destination):

12 flows exited ISP1_PUBLIC_IP
18 flows exited ISP2_PUBLIC_IP
mangle and masquerade counters increased

WAN1 policy failure (9.9.9.9 blocked at isp1-vm): to_ISP1_primary_via_WAN1 deactivated, backup activated, 12/12 LAN curls exited ISP2_PUBLIC_IP.

WAN2 policy failure (64.6.65.6 blocked at isp2-vm): to_ISP2_primary_via_WAN2 deactivated, backup activated, 12/12 LAN curls exited ISP1_PUBLIC_IP.

Router-originated DNS over main-table recursive defaults: identical behavior to §9.5, with 1.1.1.1 and 8.8.8.8 as the production DNS resolvers (separate from the route probes per §7).


10. Caveats and Gotchas

These are the issues most likely to bite a reader reproducing this lab or applying the guide on cloud CHR. Each is reflected in the methodology above; this table is the index.

Caveat Why it matters Where it shows up
Cloud CHR interface ordering puts the public management NIC at ether1 Pasting the guide's ether1/ether2 examples blindly will turn the management NIC into a "WAN" and lock you out §3
UFW with default DENY drops both routed traffic and DHCP discovery Fake ISPs will not lease until UFW is opened for forwarding and UDP 67/68 on the VPC NIC §4.3
DHCP route-update script can run during an unbound transition Placeholder routes can stay disabled even after the lease binds; bootstrap once after the initial paste §5
Public management default route in main masks main-table dual-WAN behavior Router-originated services (DNS, NTP, package downloads) follow the public route instead of the recursive defaults §5, §8.7
Recursive resolution in v7 happens within the same routing table by default A default in to_FOO cannot resolve through a host route in main; put both in the same table or both in main §8.6 sidebar, §8.7 sidebar
PCC hash variance with a single source IP Single-client tests look uneven; vary destinations and use multiple source ports or multiple clients §6, §8.4, §8.6
Probe set should be disjoint from the router's DNS resolvers If the only probes are also the only resolvers, DNS-service failure and route-health failure become indistinguishable in triage §7
check-gateway=ping reaction window is ~35 s in this lab This is link-layer-style HA, not BFD; do not expect sub-second failover §8.5
ECMP is per-flow hash-based Small samples can skew heavily; ECMP is not round-robin §9.4
RouterOS context-mode commands are unreliable in non-interactive SSH paste Use fully qualified commands, encode scripts as quoted strings, and bootstrap leases unconditionally §8.8

11. Assessment

The dual-WAN guide and TL;DR paste behave as documented in a Vultr CHR lab on RouterOS 7.21.4:

  • PCC distributes new LAN connections across both WANs;
  • per-policy-table recursive failover and failback react to single-probe upstream outages while link state stays up;
  • recursive main-table defaults provide router-originated services — the DNS resolver included — with the same upstream-health failover, after the public management default route is removed from main;
  • the TL;DR paste applies cleanly to a default-like RouterBOARD baseline through non-interactive SSH.

Issues encountered during validation were operational rather than architectural. They are captured in §10 and incorporated into the methodology above so a second team can reproduce this lab without rediscovering them.

Production deployment still requires reviewing site-specific firewall, management access, probe choice, and interface naming before applying any of this on real WANs.


12. Cleanup

When the lab is no longer needed, destroy resources in this order to avoid orphaned attachments and billing:

  1. LAN VM and fake ISP VMs.
  2. CHR instance.
  3. Uploaded ISO or custom image artifacts.
  4. VPCs.
@oakwhiz
Copy link
Copy Markdown

oakwhiz commented Jul 24, 2022

I haven't yet had success with the routing side of the config.

@lucasasdelli
Copy link
Copy Markdown

Hello, thanks for posting it. Unfortunately with the two ECMP default gateways rules I get invalid and unreachable. I'm not sure if they should use the main route of one of the two PCC ones.

@oakwhiz
Copy link
Copy Markdown

oakwhiz commented Jul 27, 2022

Testing several variants of this config with PCC set to destination only, and with either A) both ISPs connected, or B) a single ISP interface set up with a working link that has IP transit blocked via a switch, shows incorrect behavior. For example with both ISPs connected, traffic marked for ISP1 will only go through ISP2. Blocking ISP2 so that it silently fails breaks all default-destined traffic, but traffic destined for the directly connected on-link subnet at ISP1 still works.

My thought is that something is wrong with the routing, and my next step is to see if adding a 3rd layer of nexthop routes would actually fix this. In this configuration there are only two layers: 0.0.0.0/0 default to fake gateway inside routing tables, and fake gateway via real gateway inside the main table. However not all resources on this topic mention the intermediate nexthop, and in v7 the target-scope must be adjusted to match (might be 12>11>10 or 13>12>11).

Any discussion would be appreciated as the Mikrotik documentation is always a bit terse, ambiguous, and not so thorough.

@oakwhiz
Copy link
Copy Markdown

oakwhiz commented Jul 27, 2022

I encountered this bug: https://forum.mikrotik.com/viewtopic.php?t=185950
and in addition I was using a gateway which was not set up to forward to a default route - after using the proper gateway things are starting to improve.

@oakwhiz
Copy link
Copy Markdown

oakwhiz commented Jul 28, 2022

I have posted a partial configuration which might help someone: https://gist.github.com/oakwhiz/55b4043e99320129323496ffd5087f05
I should mention though, there are some minor errors in there and various things set up for debugging/testing purposes, I wouldn't recommend copying it.

@dipan29
Copy link
Copy Markdown

dipan29 commented Nov 30, 2022

Can anyone please explain what this area is doing?

/ip dhcp-client add interface=ether1 add-default-route=no script=":if (\$bound=1) do={\r\
    \n    /ip/route/set [find where comment=\"ISP1\"] gateway=\$\"gateway-address\"\r\
    \n}\r\
    \n\r\
    \n/ip/firewall/connection/remove [find connection-mark=\"ISP1_conn\"]\r\
    \n/ip/firewall/connection/remove [find connection-mark=\"ISP2_conn\"]\r\
    \n" use-peer-dns=no use-peer-ntp=no

I have one Static WAN where I need to setup but how do I add the script?

@marfillaster
Copy link
Copy Markdown
Author

@dipan29 That is for resetting all connections whenever the DHCP client changes its IP. If your other uplink is a static IP, you only need one DHCP client.

@dipan29
Copy link
Copy Markdown

dipan29 commented Dec 1, 2022

@marfillaster thanks for the response,

Also, I tried doing the same and gracefully it works for the local network right out of the box. As in for all interface lists LAN. However, My requirement was a little different and could not manage to get it working post Router OS 7.

I would require something similar but for the following scenario

  • Having one DHCP Client and one with Static IP or PPPoE Client (here the gateway caused an issue when I tried to replicate cause for ether1 when its PPPoE the gateway IP isn't defined)
  • Ensuring that both the WAN IPs can be used for DST-NAT - for any local subnet
  • I would be having generally two subnets - one for home lab and one for normal home users. For the home lab I won't really want to do load balancing but just ensure that the DST-NAT for that network works from both the WANs.
  • For normal home users, I would be okay if it goes out through ether1 (PPPoE/Static) most of the times but have an address-list called DUAL say that can only get Load Balancing when tagged to it. In other words, only the addresses tagged as DUAL can have the load balancing.

I tried finding a lot of guides on the internet but could not have something from reddit or mikrotik forum that serves my purpose (some claim to, but they just don't get connected).
Also, I am new to this mikrotik, would be glad if you can guide me with some exact details or something from the internet that works.

@rexsllemel
Copy link
Copy Markdown

rexsllemel commented Feb 27, 2023

I'm using this config but, when the ISP 1 has no internet, it uses the ISP 2, however, when the ISP 1's internet get's back, the recursive will still mark ISP 1 as no internet. It only uses ISP 2's internet the whole time, unless I'll restart the mikrotik.

@nhan6310
Copy link
Copy Markdown

@marfillaster thanks for the response,

Also, I tried doing the same and gracefully it works for the local network right out of the box. As in for all interface lists LAN. However, My requirement was a little different and could not manage to get it working post Router OS 7.

I would require something similar but for the following scenario

  • Having one DHCP Client and one with Static IP or PPPoE Client (here the gateway caused an issue when I tried to replicate cause for ether1 when its PPPoE the gateway IP isn't defined)
  • Ensuring that both the WAN IPs can be used for DST-NAT - for any local subnet
  • I would be having generally two subnets - one for home lab and one for normal home users. For the home lab I won't really want to do load balancing but just ensure that the DST-NAT for that network works from both the WANs.
  • For normal home users, I would be okay if it goes out through ether1 (PPPoE/Static) most of the times but have an address-list called DUAL say that can only get Load Balancing when tagged to it. In other words, only the addresses tagged as DUAL can have the load balancing.

I tried finding a lot of guides on the internet but could not have something from reddit or mikrotik forum that serves my purpose (some claim to, but they just don't get connected). Also, I am new to this mikrotik, would be glad if you can guide me with some exact details or something from the internet that works.

I can help you with all your requests

@aaronbolton
Copy link
Copy Markdown

Has the variable $leaseBound change to $bound now?

@dipan29
Copy link
Copy Markdown

dipan29 commented Aug 12, 2023

Hi @nhan6310 how can we connect to get this configuration set?

@quinont
Copy link
Copy Markdown

quinont commented Aug 12, 2023

I tested this on a MikroTik RB4011, and it worked perfectly. However, I'm encountering an issue that perhaps you could assist with: the load balancing is happening constantly (meaning a PC is continuously switching between the two WANs). Is there a way to configure WAN balancing based on the client's IP?

@aaronbolton
Copy link
Copy Markdown

aaronbolton commented Aug 12, 2023

@quinont the easiest way to to change this is to change the pre routing mangle rule to included a src address list and then you can specify which IP get load balanced or not

Below is how mines looks

add action=mark-connection chain=prerouting
connection-mark=no-mark dst-address-type=
!local in-interface-list=LAN
new-connection-mark=ISP1_conn passthrough=yes
per-connection-classifier=
both-addresses-and-ports:2/0 src-address-list=
MultiWAN-Clients
add action=mark-connection chain=prerouting
connection-mark=no-mark dst-address-type=
!local in-interface-list=LAN
new-connection-mark=ISP2_conn passthrough=yes
per-connection-classifier=
both-addresses-and-ports:2/1 src-address-list=
MultiWAN-Clients

@yosijoe
Copy link
Copy Markdown

yosijoe commented Dec 21, 2023

any one can give me recursive fail over gate way
im using NTH
my ip : gateway
my dns :8.8.8.8,8.8.4.4,1.1.1.1
wan1 192.168.1.1
wan2 192.168.254.254
wan3 192.168.0.1

@adrianolima83
Copy link
Copy Markdown

Jovem, que configuração excelente. Testamos em laboratório de bancada, fizemos testes de estresse, usamos conexões diferentes (pppoe, client, etc), e tudo funcionou perfeitamente. Meus parabéns e muito obrigado pelo compartilhamento.

@carlosjs23
Copy link
Copy Markdown

carlosjs23 commented Jul 5, 2024

 # WAN to LAN
 add action=mark-connection chain=prerouting connection-mark=no-mark connection-state=established,related in-interface=ether1 new-connection-mark=ISP1_conn \
     passthrough=yes
 add action=mark-connection chain=prerouting connection-mark=no-mark connection-state=established,related in-interface=ether2 new-connection-mark=ISP2_conn \
     passthrough=yes

These rules are not only from WAN to LAN connections, they also catch connections directed to the router itself (if you remove the connection-state=stablished,related), thats the reason why this two rules also exist:

 add action=mark-routing chain=output connection-mark=ISP1_conn dst-address-list=!local new-routing-mark=to_ISP1 passthrough=yes
 add action=mark-routing chain=output connection-mark=ISP2_conn dst-address-list=!local new-routing-mark=to_ISP2 passthrough=yes

The output rules are using already marked connections that were directed to the router itself, and now they serve the purpose to route the connections to output from the same interface that they did enter. This have nothing to do with the ECMP routes, the ECMP routes are used only for new originated connections from the router itself to the internet, so you can ping 8.8.8.8 from the router and it will choose one of the two routes, then ping 8.8.4.4 and it may choose the another.

Sorry for my english.

@n19htz
Copy link
Copy Markdown

n19htz commented Oct 29, 2024

how to deal with bgp routes? Their routing mark beeing overriden

@cthu1hoo
Copy link
Copy Markdown

@marfillaster thank you so much for this concise and coherent example, I didn't need PCC but two-way failover and routing specific clients via secondary ISP seems to work great based on this. no troubles with router-originated tunnels and OSPF either.

@dilbernd
Copy link
Copy Markdown

@marfillaster Thanks for this detailed script.

My question is – do you have any reading recommendations to actually understand how these things work, since I obviously don’t get it enough to adapt your solution to my setup described below?

The Microtik documentation is great for adapting understanding modern IP network management to their syntax if you already know all the lingo and principles involved in depth, and general reading on “how to IP” is extremely basic, I’d really like to read up to fill the gap between “I guess I understand a bit about consumer/SOHO networking” and “… and this is how you write the subtle art of networking in our language”.


My personal issue that has me confused is a bit more exotic since I have 2 ISPs, with asymmetric bandwidth (1x250, 1x500) and I’d like to distribute connections by available bandwidth proportions.

I tried to replicate the old RouterOS 6 syntax of just adding a default route with multiple gateways, with multiplying gateways by their proportion of the total bandwidth – so in my case gateway=ether1,pppoe-out1,pppoe-out1 with the recursive routes via public DNS addresses method in RouterOS 7.

I ended up with the extremely simple:

/ip route
add comment=Provider250 distance=1 dst-address=208.67.222.222 gateway=100.64.0.1 scope=10 target-scope=10
add comment=Provider500 distance=1 dst-address=1.1.1.1 gateway=pppoe-out1 scope=10 target-scope=10
add comment=Provider500 distance=1 dst-address=8.8.8.8 gateway=pppoe-out1 scope=10 target-scope=10
add check-gateway=ping dst-address=0.0.0.0/0 gateway=1.1.1.1 scope=10 target-scope=11
add check-gateway=ping dst-address=0.0.0.0/0 gateway=8.8.8.8 scope=10 target-scope=11
add check-gateway=ping dst-address=0.0.0.0/0 gateway=208.67.222.222 scope=10 target-scope=11
/ip firewall nat
add action=masquerade chain=srcnat comment="defconf: masquerade" ipsec-policy=out,none out-interface-list=WAN

This seems to work flawlessly in that /ip route print shows all default routes being used for ECMP and /interface monitor-traffic pppoe-out1,ether1 shows both interfaces used, and pppoe-out1 more than ether1 (if not always in the exact proportions sought, but of course that’s to be expected if connections go different routes and then have different demands). Now seeing your much more sophisticated mangling-based setup, it makes me nervous that I’m missing something that opens me up to subtle errors.

I don’t see an obvious way however to adapt your script to actually use different providers for different proportions of traffic since this would seem to collide with the mangling, which can only work based on interfaces used and thus would conflate multiple routes for the same provider and possibly mess up the proportional routing?

@marfillaster
Copy link
Copy Markdown
Author

Updated the guide with a validated RouterOS v7 version split into three ordered files:

  • 00 TL;DR pasteable default-config setup
  • 01 full PCC + recursive failover guide
  • 02 Vultr CHR lab setup, methodology, and results

Notable changes: safer default-config paste, non-Google/non-Cloudflare route probes, LAN-jump validation methodology, and PCC/failover/ECMP/router-DNS validation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment