Re: Matching on DSCP with IPv4 FIB rules

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Ido Schimmel <idosch@nvidia.com>
To: Guillaume Nault <gnault@redhat.com>
Cc: netdev@vger.kernel.org
Subject: Re: Matching on DSCP with IPv4 FIB rules
Date: Thu, 27 Jun 2024 22:54:29 +0300	[thread overview]
Message-ID: <Zn3DdfGZIVBxN0DR@shredder.lan> (raw)
In-Reply-To: <ZnypieBfn3CxCGDq@debian>

Hi Guillaume, thanks for the detailed response!

On Thu, Jun 27, 2024 at 01:51:37AM +0200, Guillaume Nault wrote:
> On Wed, Jun 26, 2024 at 02:58:17PM +0300, Ido Schimmel wrote:
> > Hi Guillaume, everyone,
> 
> Hi Ido, thanks for reaching out,
> 
> > We have users that would like to direct traffic to a routing table based
> > on the DSCP value in the IP header. While this can be done using IPv6
> > FIB rules, it cannot be done using IPv4 FIB rules as the kernel only
> > allows such rules to match on the three TOS bits from RFC 791 (lower
> > three DSCP bits). See more info in Guillaume's excellent presentation
> > here [1].
> > 
> > Extending IPv4 FIB rules to match on DSCP is not easy because of how
> > inconsistently the TOS field in the IPv4 flow information structure
> > (i.e., 'struct flowi4::flowi4_tos') is initialized and handled
> > throughout the networking stack.
> > 
> > Redefining the field using 'dscp_t' and removing the masking of the
> > upper three DSCP bits is not an option as it will change existing
> > behavior. For example, an incoming IPv4 packet with a DS field of 0xfc
> > will no longer match a FIB rule that matches on 'tos 0x1c'.
> 
> Could removing the high order bits mask actually _be_ an option? I was
> worried about behaviour change when I started looking into this. But,
> with time, I'm more and more thinking about just removing the mask.
> 
> Here are the reasons why:
> 
>   * DSCP deprecated the Precedence/TOS bits separation more than
>     25 years ago. I've never heard of anyone trying to use the high
>     order bits as Preference, while we've had several reports of people
>     using (or trying to use) the full DSCP bit range.
>     Also, I far as I know, Linux never offered any way to interpret the
>     high order bits as Precedence (apart from parsing these bits
>     manually with u32 or BPF, but these use cases wouldn't be affected
>     if we decided to use the full DSCP bit range in core IPv4 code).
> 
>   * Ignoring the high order bits creates useless inconsistencies
>     between the IPv4 and IPv6 code, while current RFCs make no such
>     distinction.
> 
>   * Even the IPv4 implementation isn't self consistent. While most
>     route lookups are done with the high order bits cleared, some parts
>     of the code explicitly use the full DSCP bit range.
> 
>   * In the past, people have sent patches to mask the high order DSCP
>     bits and created regressions because of that. People seem to use

By "patches" you mean IPv6 patches?

>     the RT_TOS() macro on whatever "tos" variable they see, without
>     really understanding the consequences. I think we'd be better off
>     without RT_TOS() and the various IPTOS_* variants, so people
>     wouldn't be tempted to copy/pasting such code.
> 
>   * It would indeed be a behaviour change to make "tos 0x1c" exactly
>     match "0x1c". But I'd be surprised if people really expected "0x1c"
>     to actually match "0xfc". Also currently one can set "tos 0x1f" in
>     routes, but no packet will ever match. That's probably not
>     something anyone would expect. Making "0x1c" mean "0x1c" and "0x1f"
>     mean "0x1f" would simplify everyone's life I believe.

Did you mean "0xfc" instead of "0x1f"? The kernel rejects routes with
"tos 0x1f" due to ECN bits being set.

I agree with everything you wrote except the assumption about users'
expectations. I honestly do not know if some users are relying on "tos
0x1c" to also match "0xfc", but I am not really interested in finding
out especially when undoing the change is not that easy. However, I have
another suggestion that might work which seems like a middle ground
between both approaches:

1. Extending the IPv4 flow information structure with a new 'dscp_t'
field (e.g., 'struct flowi4::dscp') and initializing it with the full
DSCP value throughout the stack. Already did this for all the places
where 'flowi4_tos' initialized other than flowi4_init_output() which is
next on my list.

2. Keeping the existing semantics of the "tos" keyword in ip-rule and
ip-route to match on the three lower DSCP bits, but changing the IPv4
functions that match on 'flowi4_tos' (fib_select_default,
fib4_rule_match, fib_table_lookup) to instead match on the new DSCP
field with a mask. For example, in fib4_rule_match(), instead of:

if (r->dscp && r->dscp != inet_dsfield_to_dscp(fl4->flowi4_tos))

We will have:

if (r->dscp && r->dscp != (fl4->dscp & IPTOS_RT_MASK))

I was only able to find two call paths that can reach these functions
with a TOS value that does not have its three upper DSCP bits masked:

nl_fib_input()
	nl_fib_lookup()
		flowi4_tos = frn->fl_tos	// Directly from user space
		fib_table_lookup()

nft_fib4_eval()
	flowi4_tos = iph->tos & DSCP_BITS
	fib_lookup()

The first belongs to an ancient "NETLINK_FIB_LOOKUP" family which I am
quite certain nobody is using and the second belongs to netfilter's fib
expression.

If regressions are reported for any of them (unlikely, IMO), we can add
a new flow information flag (e.g., 'FLOWI_FLAG_DSCP_NO_MASK') which will
tell the core routing functions to not apply the 'IPTOS_RT_MASK' mask.

3. Removing 'struct flowi4::flowi4_tos'.

4. Adding a new DSCP FIB rule attribute (e.g., 'FRA_DSCP') with a
matching "dscp" keyword in iproute2 that accepts values in the range of
[0, 63] which both address families will support. IPv4 will support it
via the new DSCP field ('struct flowi4::dscp') and IPv6 will support it
using the existing flow label field ('struct flowi6::flowlabel').

The kernel will reject rules that are configured with both "tos" and
"dscp".

I do not want to add a similar keyword to ip-route because I have no use
case for it and if we add it now we will never be able to remove it. It
can always be added later without too much effort.

> 
> > Instead, I was thinking of extending the IPv4 flow information structure
> > with a new 'dscp_t' field (e.g., 'struct flowi4::dscp') and adding a new
> > DSCP FIB rule attribute (e.g., 'FRA_DSCP') that accepts values in the
> > range of [0, 63] which both address families will support. This will
> > allow user space to get a consistent behavior between IPv4 and IPv6 with
> > regards to DSCP matching, without affecting existing use cases.
> 
> If removing the high order bits mask really isn't feasible, then yes,
> that'd probably be our best option. But this would make both the code
> and the UAPI more complex. Also we'd have to take care of corner cases
> (when both TOS and DSCP are set) and make the same changes to IPv4
> routes, to keep TOS/DSCP consistent between ip-rule and ip-route.
> 
> Dropping the high order bits mask, on the other hand, would make
> everything consistent and would simplifies both the code and the user
> experience. The only drawback is that "tos 0x1c" would only match "0x1c"
> (and not "0x1f" anymore). But, as I said earlier, I doubt if such a use
> case really exist.

Whether use cases like that exist or not is the main issue I have with
the removal of the high order bits mask. The advantage of this approach
is that no new uAPI is required, but the disadvantage is that there is a
potential for regressions without an easy mitigation.

I believe that with the approach I outlined above the potential for
regressions is lower and we should have a way to mitigate them if/when
reported. The disadvantage is that we need to introduce a new "dscp"
keyword and a new netlink attribute.

> 
> > Adding the new field and initializing it correctly throughout the stack
> > is not a small undertaking so I was wondering a) Are you OK with the
> > suggested approach? b) If not, what else would you suggest?
> 
> Sorry for the long text, but I think you have my opinion now.
> And yes, whatever the option, this is going to be a long task.

Yes :(

> 
> Side note: I'm actually working on a series to start converting
> flowi4_tos to dscp_t. I should have a first patch set ready soon
> (converting only a few places). But, I'm keeping the old behaviour of
> clearing the 3 high order bits for now (these are just two separate
> topics).

I will be happy to review, but I'm not sure what you mean by "converting
only a few places". How does it work?

> 
> I can allocate more time on the dscp_t conversion and work/help with
> removing the high order bits mask if there's interest in this option.
> 
> > Thanks
> > 
> > [1] https://lpc.events/event/11/contributions/943/attachments/901/1780/inet_tos_lpc2021.pdf
> > 
>

next prev parent reply	other threads:[~2024-06-27 19:54 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-26 11:58 Matching on DSCP with IPv4 FIB rules Ido Schimmel
2024-06-26 23:51 ` Guillaume Nault
2024-06-27 19:54   ` Ido Schimmel [this message]
2024-07-04 18:19     ` Guillaume Nault
2024-07-18 13:14       ` Ido Schimmel
2024-07-19 17:57         ` Guillaume Nault
2024-07-25  9:58           ` Ido Schimmel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zn3DdfGZIVBxN0DR@shredder.lan \
    --to=idosch@nvidia.com \
    --cc=gnault@redhat.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).