All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ido Schimmel <idosch@nvidia.com>
To: Guillaume Nault <gnault@redhat.com>
Cc: netdev@vger.kernel.org
Subject: Re: Matching on DSCP with IPv4 FIB rules
Date: Thu, 27 Jun 2024 22:54:29 +0300	[thread overview]
Message-ID: <Zn3DdfGZIVBxN0DR@shredder.lan> (raw)
In-Reply-To: <ZnypieBfn3CxCGDq@debian>

Hi Guillaume, thanks for the detailed response!

On Thu, Jun 27, 2024 at 01:51:37AM +0200, Guillaume Nault wrote:
> On Wed, Jun 26, 2024 at 02:58:17PM +0300, Ido Schimmel wrote:
> > Hi Guillaume, everyone,
> 
> Hi Ido, thanks for reaching out,
> 
> > We have users that would like to direct traffic to a routing table based
> > on the DSCP value in the IP header. While this can be done using IPv6
> > FIB rules, it cannot be done using IPv4 FIB rules as the kernel only
> > allows such rules to match on the three TOS bits from RFC 791 (lower
> > three DSCP bits). See more info in Guillaume's excellent presentation
> > here [1].
> > 
> > Extending IPv4 FIB rules to match on DSCP is not easy because of how
> > inconsistently the TOS field in the IPv4 flow information structure
> > (i.e., 'struct flowi4::flowi4_tos') is initialized and handled
> > throughout the networking stack.
> > 
> > Redefining the field using 'dscp_t' and removing the masking of the
> > upper three DSCP bits is not an option as it will change existing
> > behavior. For example, an incoming IPv4 packet with a DS field of 0xfc
> > will no longer match a FIB rule that matches on 'tos 0x1c'.
> 
> Could removing the high order bits mask actually _be_ an option? I was
> worried about behaviour change when I started looking into this. But,
> with time, I'm more and more thinking about just removing the mask.
> 
> Here are the reasons why:
> 
>   * DSCP deprecated the Precedence/TOS bits separation more than
>     25 years ago. I've never heard of anyone trying to use the high
>     order bits as Preference, while we've had several reports of people
>     using (or trying to use) the full DSCP bit range.
>     Also, I far as I know, Linux never offered any way to interpret the
>     high order bits as Precedence (apart from parsing these bits
>     manually with u32 or BPF, but these use cases wouldn't be affected
>     if we decided to use the full DSCP bit range in core IPv4 code).
> 
>   * Ignoring the high order bits creates useless inconsistencies
>     between the IPv4 and IPv6 code, while current RFCs make no such
>     distinction.
> 
>   * Even the IPv4 implementation isn't self consistent. While most
>     route lookups are done with the high order bits cleared, some parts
>     of the code explicitly use the full DSCP bit range.
> 
>   * In the past, people have sent patches to mask the high order DSCP
>     bits and created regressions because of that. People seem to use

By "patches" you mean IPv6 patches?

>     the RT_TOS() macro on whatever "tos" variable they see, without
>     really understanding the consequences. I think we'd be better off
>     without RT_TOS() and the various IPTOS_* variants, so people
>     wouldn't be tempted to copy/pasting such code.
> 
>   * It would indeed be a behaviour change to make "tos 0x1c" exactly
>     match "0x1c". But I'd be surprised if people really expected "0x1c"
>     to actually match "0xfc". Also currently one can set "tos 0x1f" in
>     routes, but no packet will ever match. That's probably not
>     something anyone would expect. Making "0x1c" mean "0x1c" and "0x1f"
>     mean "0x1f" would simplify everyone's life I believe.

Did you mean "0xfc" instead of "0x1f"? The kernel rejects routes with
"tos 0x1f" due to ECN bits being set.

I agree with everything you wrote except the assumption about users'
expectations. I honestly do not know if some users are relying on "tos
0x1c" to also match "0xfc", but I am not really interested in finding
out especially when undoing the change is not that easy. However, I have
another suggestion that might work which seems like a middle ground
between both approaches:

1. Extending the IPv4 flow information structure with a new 'dscp_t'
field (e.g., 'struct flowi4::dscp') and initializing it with the full
DSCP value throughout the stack. Already did this for all the places
where 'flowi4_tos' initialized other than flowi4_init_output() which is
next on my list.

2. Keeping the existing semantics of the "tos" keyword in ip-rule and
ip-route to match on the three lower DSCP bits, but changing the IPv4
functions that match on 'flowi4_tos' (fib_select_default,
fib4_rule_match, fib_table_lookup) to instead match on the new DSCP
field with a mask. For example, in fib4_rule_match(), instead of:

if (r->dscp && r->dscp != inet_dsfield_to_dscp(fl4->flowi4_tos))

We will have:

if (r->dscp && r->dscp != (fl4->dscp & IPTOS_RT_MASK))

I was only able to find two call paths that can reach these functions
with a TOS value that does not have its three upper DSCP bits masked:

nl_fib_input()
	nl_fib_lookup()
		flowi4_tos = frn->fl_tos	// Directly from user space
		fib_table_lookup()

nft_fib4_eval()
	flowi4_tos = iph->tos & DSCP_BITS
	fib_lookup()

The first belongs to an ancient "NETLINK_FIB_LOOKUP" family which I am
quite certain nobody is using and the second belongs to netfilter's fib
expression.

If regressions are reported for any of them (unlikely, IMO), we can add
a new flow information flag (e.g., 'FLOWI_FLAG_DSCP_NO_MASK') which will
tell the core routing functions to not apply the 'IPTOS_RT_MASK' mask.

3. Removing 'struct flowi4::flowi4_tos'.

4. Adding a new DSCP FIB rule attribute (e.g., 'FRA_DSCP') with a
matching "dscp" keyword in iproute2 that accepts values in the range of
[0, 63] which both address families will support. IPv4 will support it
via the new DSCP field ('struct flowi4::dscp') and IPv6 will support it
using the existing flow label field ('struct flowi6::flowlabel').

The kernel will reject rules that are configured with both "tos" and
"dscp".

I do not want to add a similar keyword to ip-route because I have no use
case for it and if we add it now we will never be able to remove it. It
can always be added later without too much effort.

> 
> > Instead, I was thinking of extending the IPv4 flow information structure
> > with a new 'dscp_t' field (e.g., 'struct flowi4::dscp') and adding a new
> > DSCP FIB rule attribute (e.g., 'FRA_DSCP') that accepts values in the
> > range of [0, 63] which both address families will support. This will
> > allow user space to get a consistent behavior between IPv4 and IPv6 with
> > regards to DSCP matching, without affecting existing use cases.
> 
> If removing the high order bits mask really isn't feasible, then yes,
> that'd probably be our best option. But this would make both the code
> and the UAPI more complex. Also we'd have to take care of corner cases
> (when both TOS and DSCP are set) and make the same changes to IPv4
> routes, to keep TOS/DSCP consistent between ip-rule and ip-route.
> 
> Dropping the high order bits mask, on the other hand, would make
> everything consistent and would simplifies both the code and the user
> experience. The only drawback is that "tos 0x1c" would only match "0x1c"
> (and not "0x1f" anymore). But, as I said earlier, I doubt if such a use
> case really exist.

Whether use cases like that exist or not is the main issue I have with
the removal of the high order bits mask. The advantage of this approach
is that no new uAPI is required, but the disadvantage is that there is a
potential for regressions without an easy mitigation.

I believe that with the approach I outlined above the potential for
regressions is lower and we should have a way to mitigate them if/when
reported. The disadvantage is that we need to introduce a new "dscp"
keyword and a new netlink attribute.

> 
> > Adding the new field and initializing it correctly throughout the stack
> > is not a small undertaking so I was wondering a) Are you OK with the
> > suggested approach? b) If not, what else would you suggest?
> 
> Sorry for the long text, but I think you have my opinion now.
> And yes, whatever the option, this is going to be a long task.

Yes :(

> 
> Side note: I'm actually working on a series to start converting
> flowi4_tos to dscp_t. I should have a first patch set ready soon
> (converting only a few places). But, I'm keeping the old behaviour of
> clearing the 3 high order bits for now (these are just two separate
> topics).

I will be happy to review, but I'm not sure what you mean by "converting
only a few places". How does it work?

> 
> I can allocate more time on the dscp_t conversion and work/help with
> removing the high order bits mask if there's interest in this option.
> 
> > Thanks
> > 
> > [1] https://lpc.events/event/11/contributions/943/attachments/901/1780/inet_tos_lpc2021.pdf
> > 
> 

  reply	other threads:[~2024-06-27 19:54 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-26 11:58 Matching on DSCP with IPv4 FIB rules Ido Schimmel
2024-06-26 23:51 ` Guillaume Nault
2024-06-27 19:54   ` Ido Schimmel [this message]
2024-07-04 18:19     ` Guillaume Nault
2024-07-18 13:14       ` Ido Schimmel
2024-07-19 17:57         ` Guillaume Nault
2024-07-25  9:58           ` Ido Schimmel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zn3DdfGZIVBxN0DR@shredder.lan \
    --to=idosch@nvidia.com \
    --cc=gnault@redhat.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.