From: David Ahern <dsahern@gmail.com>
To: Petr Machata <petrm@nvidia.com>, netdev@vger.kernel.org
Cc: David Ahern <dsahern@kernel.org>,
"David S. Miller" <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>,
Ido Schimmel <idosch@nvidia.com>
Subject: Re: [PATCH net-next 00/12] nexthop: Preparations for resilient next-hop groups
Date: Thu, 28 Jan 2021 20:24:33 -0700 [thread overview]
Message-ID: <3ee4c868-0bd4-6df2-aaef-07efed2812f6@gmail.com> (raw)
In-Reply-To: <cover.1611836479.git.petrm@nvidia.com>
On 1/28/21 5:49 AM, Petr Machata wrote:
> At this moment, there is only one type of next-hop group: an mpath group.
> Mpath groups implement the hash-threshold algorithm, described in RFC
> 2992[1].
>
> To select a next hop, hash-threshold algorithm first assigns a range of
> hashes to each next hop in the group, and then selects the next hop by
> comparing the SKB hash with the individual ranges. When a next hop is
> removed from the group, the ranges are recomputed, which leads to
> reassignment of parts of hash space from one next hop to another. RFC 2992
> illustrates it thus:
>
> +-------+-------+-------+-------+-------+
> | 1 | 2 | 3 | 4 | 5 |
> +-------+-+-----+---+---+-----+-+-------+
> | 1 | 2 | 4 | 5 |
> +---------+---------+---------+---------+
>
> Before and after deletion of next hop 3
> under the hash-threshold algorithm.
>
> Note how next hop 2 gave up part of the hash space in favor of next hop 1,
> and 4 in favor of 5. While there will usually be some overlap between the
> previous and the new distribution, some traffic flows change the next hop
> that they resolve to.
>
> If a multipath group is used for load-balancing between multiple servers,
> this hash space reassignment causes an issue that packets from a single
> flow suddenly end up arriving at a server that does not expect them, which
> may lead to TCP reset.
>
> If a multipath group is used for load-balancing among available paths to
> the same server, the issue is that different latencies and reordering along
> the way causes the packets to arrive in wrong order.
>
> Resilient hashing is a technique to address the above problem. Resilient
> next-hop group has another layer of indirection between the group itself
> and its constituent next hops: a hash table. The selection algorithm uses a
> straightforward modulo operation to choose a hash bucket, and then reads
> the next hop that this bucket contains, and forwards traffic there.
>
> This indirection brings an important feature. In the hash-threshold
> algorithm, the range of hashes associated with a next hop must be
> continuous. With a hash table, mapping between the hash table buckets and
> the individual next hops is arbitrary. Therefore when a next hop is deleted
> the buckets that held it are simply reassigned to other next hops:
>
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> v v v v
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>
> Before and after deletion of next hop 3
> under the resilient hashing algorithm.
>
> When weights of next hops in a group are altered, it may be possible to
> choose a subset of buckets that are currently not used for forwarding
> traffic, and use those to satisfy the new next-hop distribution demands,
> keeping the "busy" buckets intact. This way, established flows are ideally
> kept being forwarded to the same endpoints through the same paths as before
> the next-hop group change.
>
> This patchset prepares the next-hop code for eventual introduction of
> resilient hashing groups.
>
> - Patches #1-#4 carry otherwise disjoint changes that just remove certain
> assumptions in the next-hop code.
>
> - Patches #5-#6 extend the in-kernel next-hop notifiers to support more
> next-hop group types.
>
> - Patches #7-#12 refactor RTNL message handlers. Resilient next-hop groups
> will introduce a new logical object, a hash table bucket. It turns out
> that handling bucket-related messages is similar to how next-hop messages
> are handled. These patches extract the commonalities into reusable
> components.
>
> The plan is to contribute approximately the following patchsets:
>
> 1) Nexthop policy refactoring (already pushed)
> 2) Preparations for resilient next hop groups (this patchset)
> 3) Implementation of resilient next hop group
> 4) Netdevsim offload plus a suite of selftests
> 5) Preparations for mlxsw offload of resilient next-hop groups
> 6) mlxsw offload including selftests
>
> Interested parties can look at the current state of the code at [2] and
> [3].
>
> [1] https://tools.ietf.org/html/rfc2992
> [2] https://github.com/idosch/linux/commits/submit/res_integ_v1
> [3] https://github.com/idosch/iproute2/commits/submit/res_v1
>
Very easy to review patchset. Thank you for that and for this cover
letter with the end goal and progress.
next prev parent reply other threads:[~2021-01-29 3:25 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-28 12:49 [PATCH net-next 00/12] nexthop: Preparations for resilient next-hop groups Petr Machata
2021-01-28 12:49 ` [PATCH net-next 01/12] nexthop: Rename nexthop_free_mpath Petr Machata
2021-01-29 3:05 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 02/12] nexthop: Dispatch nexthop_select_path() by group type Petr Machata
2021-01-29 3:08 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 03/12] nexthop: Introduce to struct nh_grp_entry a per-type union Petr Machata
2021-01-29 3:09 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 04/12] nexthop: Assert the invariant that a NH group is of only one type Petr Machata
2021-01-29 3:10 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 05/12] nexthop: Use enum to encode notification type Petr Machata
2021-01-29 3:12 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 06/12] nexthop: Dispatch notifier init()/fini() by group type Petr Machata
2021-01-29 3:13 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 07/12] nexthop: Extract dump filtering parameters into a single structure Petr Machata
2021-01-29 3:16 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 08/12] nexthop: Extract a common helper for parsing dump attributes Petr Machata
2021-01-29 3:17 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 09/12] nexthop: Strongly-type context of rtm_dump_nexthop() Petr Machata
2021-01-29 3:18 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 10/12] nexthop: Extract a helper for walking the next-hop tree Petr Machata
2021-01-29 3:19 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 11/12] nexthop: Add a callback parameter to rtm_dump_walk_nexthops() Petr Machata
2021-01-29 3:20 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 12/12] nexthop: Extract a helper for validation of get/del RTNL requests Petr Machata
2021-01-29 3:21 ` David Ahern
2021-01-29 3:24 ` David Ahern [this message]
2021-01-29 5:10 ` [PATCH net-next 00/12] nexthop: Preparations for resilient next-hop groups patchwork-bot+netdevbpf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3ee4c868-0bd4-6df2-aaef-07efed2812f6@gmail.com \
--to=dsahern@gmail.com \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=idosch@nvidia.com \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=petrm@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.