From: David Ahern <dsahern@gmail.com>
To: Petr Machata <petrm@nvidia.com>, netdev@vger.kernel.org
Cc: David Ahern <dsahern@kernel.org>,
"David S. Miller" <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>,
Ido Schimmel <idosch@nvidia.com>
Subject: Re: [PATCH net-next 00/12] nexthop: Preparations for resilient next-hop groups
Date: Thu, 28 Jan 2021 20:24:33 -0700 [thread overview]
Message-ID: <3ee4c868-0bd4-6df2-aaef-07efed2812f6@gmail.com> (raw)
In-Reply-To: <cover.1611836479.git.petrm@nvidia.com>
On 1/28/21 5:49 AM, Petr Machata wrote:
> At this moment, there is only one type of next-hop group: an mpath group.
> Mpath groups implement the hash-threshold algorithm, described in RFC
> 2992[1].
>
> To select a next hop, hash-threshold algorithm first assigns a range of
> hashes to each next hop in the group, and then selects the next hop by
> comparing the SKB hash with the individual ranges. When a next hop is
> removed from the group, the ranges are recomputed, which leads to
> reassignment of parts of hash space from one next hop to another. RFC 2992
> illustrates it thus:
>
> +-------+-------+-------+-------+-------+
> | 1 | 2 | 3 | 4 | 5 |
> +-------+-+-----+---+---+-----+-+-------+
> | 1 | 2 | 4 | 5 |
> +---------+---------+---------+---------+
>
> Before and after deletion of next hop 3
> under the hash-threshold algorithm.
>
> Note how next hop 2 gave up part of the hash space in favor of next hop 1,
> and 4 in favor of 5. While there will usually be some overlap between the
> previous and the new distribution, some traffic flows change the next hop
> that they resolve to.
>
> If a multipath group is used for load-balancing between multiple servers,
> this hash space reassignment causes an issue that packets from a single
> flow suddenly end up arriving at a server that does not expect them, which
> may lead to TCP reset.
>
> If a multipath group is used for load-balancing among available paths to
> the same server, the issue is that different latencies and reordering along
> the way causes the packets to arrive in wrong order.
>
> Resilient hashing is a technique to address the above problem. Resilient
> next-hop group has another layer of indirection between the group itself
> and its constituent next hops: a hash table. The selection algorithm uses a
> straightforward modulo operation to choose a hash bucket, and then reads
> the next hop that this bucket contains, and forwards traffic there.
>
> This indirection brings an important feature. In the hash-threshold
> algorithm, the range of hashes associated with a next hop must be
> continuous. With a hash table, mapping between the hash table buckets and
> the individual next hops is arbitrary. Therefore when a next hop is deleted
> the buckets that held it are simply reassigned to other next hops:
>
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> v v v v
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>
> Before and after deletion of next hop 3
> under the resilient hashing algorithm.
>
> When weights of next hops in a group are altered, it may be possible to
> choose a subset of buckets that are currently not used for forwarding
> traffic, and use those to satisfy the new next-hop distribution demands,
> keeping the "busy" buckets intact. This way, established flows are ideally
> kept being forwarded to the same endpoints through the same paths as before
> the next-hop group change.
>
> This patchset prepares the next-hop code for eventual introduction of
> resilient hashing groups.
>
> - Patches #1-#4 carry otherwise disjoint changes that just remove certain
> assumptions in the next-hop code.
>
> - Patches #5-#6 extend the in-kernel next-hop notifiers to support more
> next-hop group types.
>
> - Patches #7-#12 refactor RTNL message handlers. Resilient next-hop groups
> will introduce a new logical object, a hash table bucket. It turns out
> that handling bucket-related messages is similar to how next-hop messages
> are handled. These patches extract the commonalities into reusable
> components.
>
> The plan is to contribute approximately the following patchsets:
>
> 1) Nexthop policy refactoring (already pushed)
> 2) Preparations for resilient next hop groups (this patchset)
> 3) Implementation of resilient next hop group
> 4) Netdevsim offload plus a suite of selftests
> 5) Preparations for mlxsw offload of resilient next-hop groups
> 6) mlxsw offload including selftests
>
> Interested parties can look at the current state of the code at [2] and
> [3].
>
> [1] https://tools.ietf.org/html/rfc2992
> [2] https://github.com/idosch/linux/commits/submit/res_integ_v1
> [3] https://github.com/idosch/iproute2/commits/submit/res_v1
>
Very easy to review patchset. Thank you for that and for this cover
letter with the end goal and progress.
next prev parent reply other threads:[~2021-01-29 3:25 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-28 12:49 [PATCH net-next 00/12] nexthop: Preparations for resilient next-hop groups Petr Machata
2021-01-28 12:49 ` [PATCH net-next 01/12] nexthop: Rename nexthop_free_mpath Petr Machata
2021-01-29 3:05 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 02/12] nexthop: Dispatch nexthop_select_path() by group type Petr Machata
2021-01-29 3:08 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 03/12] nexthop: Introduce to struct nh_grp_entry a per-type union Petr Machata
2021-01-29 3:09 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 04/12] nexthop: Assert the invariant that a NH group is of only one type Petr Machata
2021-01-29 3:10 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 05/12] nexthop: Use enum to encode notification type Petr Machata
2021-01-29 3:12 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 06/12] nexthop: Dispatch notifier init()/fini() by group type Petr Machata
2021-01-29 3:13 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 07/12] nexthop: Extract dump filtering parameters into a single structure Petr Machata
2021-01-29 3:16 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 08/12] nexthop: Extract a common helper for parsing dump attributes Petr Machata
2021-01-29 3:17 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 09/12] nexthop: Strongly-type context of rtm_dump_nexthop() Petr Machata
2021-01-29 3:18 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 10/12] nexthop: Extract a helper for walking the next-hop tree Petr Machata
2021-01-29 3:19 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 11/12] nexthop: Add a callback parameter to rtm_dump_walk_nexthops() Petr Machata
2021-01-29 3:20 ` David Ahern
2021-01-28 12:49 ` [PATCH net-next 12/12] nexthop: Extract a helper for validation of get/del RTNL requests Petr Machata
2021-01-29 3:21 ` David Ahern
2021-01-29 3:24 ` David Ahern [this message]
2021-01-29 5:10 ` [PATCH net-next 00/12] nexthop: Preparations for resilient next-hop groups patchwork-bot+netdevbpf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3ee4c868-0bd4-6df2-aaef-07efed2812f6@gmail.com \
--to=dsahern@gmail.com \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=idosch@nvidia.com \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=petrm@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).