netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Matching on DSCP with IPv4 FIB rules
@ 2024-06-26 11:58 Ido Schimmel
  2024-06-26 23:51 ` Guillaume Nault
  0 siblings, 1 reply; 7+ messages in thread
From: Ido Schimmel @ 2024-06-26 11:58 UTC (permalink / raw)
  To: gnault, netdev

Hi Guillaume, everyone,

We have users that would like to direct traffic to a routing table based
on the DSCP value in the IP header. While this can be done using IPv6
FIB rules, it cannot be done using IPv4 FIB rules as the kernel only
allows such rules to match on the three TOS bits from RFC 791 (lower
three DSCP bits). See more info in Guillaume's excellent presentation
here [1].

Extending IPv4 FIB rules to match on DSCP is not easy because of how
inconsistently the TOS field in the IPv4 flow information structure
(i.e., 'struct flowi4::flowi4_tos') is initialized and handled
throughout the networking stack.

Redefining the field using 'dscp_t' and removing the masking of the
upper three DSCP bits is not an option as it will change existing
behavior. For example, an incoming IPv4 packet with a DS field of 0xfc
will no longer match a FIB rule that matches on 'tos 0x1c'.

Instead, I was thinking of extending the IPv4 flow information structure
with a new 'dscp_t' field (e.g., 'struct flowi4::dscp') and adding a new
DSCP FIB rule attribute (e.g., 'FRA_DSCP') that accepts values in the
range of [0, 63] which both address families will support. This will
allow user space to get a consistent behavior between IPv4 and IPv6 with
regards to DSCP matching, without affecting existing use cases.

Adding the new field and initializing it correctly throughout the stack
is not a small undertaking so I was wondering a) Are you OK with the
suggested approach? b) If not, what else would you suggest?

Thanks

[1] https://lpc.events/event/11/contributions/943/attachments/901/1780/inet_tos_lpc2021.pdf

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Matching on DSCP with IPv4 FIB rules
  2024-06-26 11:58 Matching on DSCP with IPv4 FIB rules Ido Schimmel
@ 2024-06-26 23:51 ` Guillaume Nault
  2024-06-27 19:54   ` Ido Schimmel
  0 siblings, 1 reply; 7+ messages in thread
From: Guillaume Nault @ 2024-06-26 23:51 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev

On Wed, Jun 26, 2024 at 02:58:17PM +0300, Ido Schimmel wrote:
> Hi Guillaume, everyone,

Hi Ido, thanks for reaching out,

> We have users that would like to direct traffic to a routing table based
> on the DSCP value in the IP header. While this can be done using IPv6
> FIB rules, it cannot be done using IPv4 FIB rules as the kernel only
> allows such rules to match on the three TOS bits from RFC 791 (lower
> three DSCP bits). See more info in Guillaume's excellent presentation
> here [1].
> 
> Extending IPv4 FIB rules to match on DSCP is not easy because of how
> inconsistently the TOS field in the IPv4 flow information structure
> (i.e., 'struct flowi4::flowi4_tos') is initialized and handled
> throughout the networking stack.
> 
> Redefining the field using 'dscp_t' and removing the masking of the
> upper three DSCP bits is not an option as it will change existing
> behavior. For example, an incoming IPv4 packet with a DS field of 0xfc
> will no longer match a FIB rule that matches on 'tos 0x1c'.

Could removing the high order bits mask actually _be_ an option? I was
worried about behaviour change when I started looking into this. But,
with time, I'm more and more thinking about just removing the mask.

Here are the reasons why:

  * DSCP deprecated the Precedence/TOS bits separation more than
    25 years ago. I've never heard of anyone trying to use the high
    order bits as Preference, while we've had several reports of people
    using (or trying to use) the full DSCP bit range.
    Also, I far as I know, Linux never offered any way to interpret the
    high order bits as Precedence (apart from parsing these bits
    manually with u32 or BPF, but these use cases wouldn't be affected
    if we decided to use the full DSCP bit range in core IPv4 code).

  * Ignoring the high order bits creates useless inconsistencies
    between the IPv4 and IPv6 code, while current RFCs make no such
    distinction.

  * Even the IPv4 implementation isn't self consistent. While most
    route lookups are done with the high order bits cleared, some parts
    of the code explicitly use the full DSCP bit range.

  * In the past, people have sent patches to mask the high order DSCP
    bits and created regressions because of that. People seem to use
    the RT_TOS() macro on whatever "tos" variable they see, without
    really understanding the consequences. I think we'd be better off
    without RT_TOS() and the various IPTOS_* variants, so people
    wouldn't be tempted to copy/pasting such code.

  * It would indeed be a behaviour change to make "tos 0x1c" exactly
    match "0x1c". But I'd be surprised if people really expected "0x1c"
    to actually match "0xfc". Also currently one can set "tos 0x1f" in
    routes, but no packet will ever match. That's probably not
    something anyone would expect. Making "0x1c" mean "0x1c" and "0x1f"
    mean "0x1f" would simplify everyone's life I believe.

> Instead, I was thinking of extending the IPv4 flow information structure
> with a new 'dscp_t' field (e.g., 'struct flowi4::dscp') and adding a new
> DSCP FIB rule attribute (e.g., 'FRA_DSCP') that accepts values in the
> range of [0, 63] which both address families will support. This will
> allow user space to get a consistent behavior between IPv4 and IPv6 with
> regards to DSCP matching, without affecting existing use cases.

If removing the high order bits mask really isn't feasible, then yes,
that'd probably be our best option. But this would make both the code
and the UAPI more complex. Also we'd have to take care of corner cases
(when both TOS and DSCP are set) and make the same changes to IPv4
routes, to keep TOS/DSCP consistent between ip-rule and ip-route.

Dropping the high order bits mask, on the other hand, would make
everything consistent and would simplifies both the code and the user
experience. The only drawback is that "tos 0x1c" would only match "0x1c"
(and not "0x1f" anymore). But, as I said earlier, I doubt if such a use
case really exist.

> Adding the new field and initializing it correctly throughout the stack
> is not a small undertaking so I was wondering a) Are you OK with the
> suggested approach? b) If not, what else would you suggest?

Sorry for the long text, but I think you have my opinion now.
And yes, whatever the option, this is going to be a long task.

Side note: I'm actually working on a series to start converting
flowi4_tos to dscp_t. I should have a first patch set ready soon
(converting only a few places). But, I'm keeping the old behaviour of
clearing the 3 high order bits for now (these are just two separate
topics).

I can allocate more time on the dscp_t conversion and work/help with
removing the high order bits mask if there's interest in this option.

> Thanks
> 
> [1] https://lpc.events/event/11/contributions/943/attachments/901/1780/inet_tos_lpc2021.pdf
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Matching on DSCP with IPv4 FIB rules
  2024-06-26 23:51 ` Guillaume Nault
@ 2024-06-27 19:54   ` Ido Schimmel
  2024-07-04 18:19     ` Guillaume Nault
  0 siblings, 1 reply; 7+ messages in thread
From: Ido Schimmel @ 2024-06-27 19:54 UTC (permalink / raw)
  To: Guillaume Nault; +Cc: netdev

Hi Guillaume, thanks for the detailed response!

On Thu, Jun 27, 2024 at 01:51:37AM +0200, Guillaume Nault wrote:
> On Wed, Jun 26, 2024 at 02:58:17PM +0300, Ido Schimmel wrote:
> > Hi Guillaume, everyone,
> 
> Hi Ido, thanks for reaching out,
> 
> > We have users that would like to direct traffic to a routing table based
> > on the DSCP value in the IP header. While this can be done using IPv6
> > FIB rules, it cannot be done using IPv4 FIB rules as the kernel only
> > allows such rules to match on the three TOS bits from RFC 791 (lower
> > three DSCP bits). See more info in Guillaume's excellent presentation
> > here [1].
> > 
> > Extending IPv4 FIB rules to match on DSCP is not easy because of how
> > inconsistently the TOS field in the IPv4 flow information structure
> > (i.e., 'struct flowi4::flowi4_tos') is initialized and handled
> > throughout the networking stack.
> > 
> > Redefining the field using 'dscp_t' and removing the masking of the
> > upper three DSCP bits is not an option as it will change existing
> > behavior. For example, an incoming IPv4 packet with a DS field of 0xfc
> > will no longer match a FIB rule that matches on 'tos 0x1c'.
> 
> Could removing the high order bits mask actually _be_ an option? I was
> worried about behaviour change when I started looking into this. But,
> with time, I'm more and more thinking about just removing the mask.
> 
> Here are the reasons why:
> 
>   * DSCP deprecated the Precedence/TOS bits separation more than
>     25 years ago. I've never heard of anyone trying to use the high
>     order bits as Preference, while we've had several reports of people
>     using (or trying to use) the full DSCP bit range.
>     Also, I far as I know, Linux never offered any way to interpret the
>     high order bits as Precedence (apart from parsing these bits
>     manually with u32 or BPF, but these use cases wouldn't be affected
>     if we decided to use the full DSCP bit range in core IPv4 code).
> 
>   * Ignoring the high order bits creates useless inconsistencies
>     between the IPv4 and IPv6 code, while current RFCs make no such
>     distinction.
> 
>   * Even the IPv4 implementation isn't self consistent. While most
>     route lookups are done with the high order bits cleared, some parts
>     of the code explicitly use the full DSCP bit range.
> 
>   * In the past, people have sent patches to mask the high order DSCP
>     bits and created regressions because of that. People seem to use

By "patches" you mean IPv6 patches?

>     the RT_TOS() macro on whatever "tos" variable they see, without
>     really understanding the consequences. I think we'd be better off
>     without RT_TOS() and the various IPTOS_* variants, so people
>     wouldn't be tempted to copy/pasting such code.
> 
>   * It would indeed be a behaviour change to make "tos 0x1c" exactly
>     match "0x1c". But I'd be surprised if people really expected "0x1c"
>     to actually match "0xfc". Also currently one can set "tos 0x1f" in
>     routes, but no packet will ever match. That's probably not
>     something anyone would expect. Making "0x1c" mean "0x1c" and "0x1f"
>     mean "0x1f" would simplify everyone's life I believe.

Did you mean "0xfc" instead of "0x1f"? The kernel rejects routes with
"tos 0x1f" due to ECN bits being set.

I agree with everything you wrote except the assumption about users'
expectations. I honestly do not know if some users are relying on "tos
0x1c" to also match "0xfc", but I am not really interested in finding
out especially when undoing the change is not that easy. However, I have
another suggestion that might work which seems like a middle ground
between both approaches:

1. Extending the IPv4 flow information structure with a new 'dscp_t'
field (e.g., 'struct flowi4::dscp') and initializing it with the full
DSCP value throughout the stack. Already did this for all the places
where 'flowi4_tos' initialized other than flowi4_init_output() which is
next on my list.

2. Keeping the existing semantics of the "tos" keyword in ip-rule and
ip-route to match on the three lower DSCP bits, but changing the IPv4
functions that match on 'flowi4_tos' (fib_select_default,
fib4_rule_match, fib_table_lookup) to instead match on the new DSCP
field with a mask. For example, in fib4_rule_match(), instead of:

if (r->dscp && r->dscp != inet_dsfield_to_dscp(fl4->flowi4_tos))

We will have:

if (r->dscp && r->dscp != (fl4->dscp & IPTOS_RT_MASK))

I was only able to find two call paths that can reach these functions
with a TOS value that does not have its three upper DSCP bits masked:

nl_fib_input()
	nl_fib_lookup()
		flowi4_tos = frn->fl_tos	// Directly from user space
		fib_table_lookup()

nft_fib4_eval()
	flowi4_tos = iph->tos & DSCP_BITS
	fib_lookup()

The first belongs to an ancient "NETLINK_FIB_LOOKUP" family which I am
quite certain nobody is using and the second belongs to netfilter's fib
expression.

If regressions are reported for any of them (unlikely, IMO), we can add
a new flow information flag (e.g., 'FLOWI_FLAG_DSCP_NO_MASK') which will
tell the core routing functions to not apply the 'IPTOS_RT_MASK' mask.

3. Removing 'struct flowi4::flowi4_tos'.

4. Adding a new DSCP FIB rule attribute (e.g., 'FRA_DSCP') with a
matching "dscp" keyword in iproute2 that accepts values in the range of
[0, 63] which both address families will support. IPv4 will support it
via the new DSCP field ('struct flowi4::dscp') and IPv6 will support it
using the existing flow label field ('struct flowi6::flowlabel').

The kernel will reject rules that are configured with both "tos" and
"dscp".

I do not want to add a similar keyword to ip-route because I have no use
case for it and if we add it now we will never be able to remove it. It
can always be added later without too much effort.

> 
> > Instead, I was thinking of extending the IPv4 flow information structure
> > with a new 'dscp_t' field (e.g., 'struct flowi4::dscp') and adding a new
> > DSCP FIB rule attribute (e.g., 'FRA_DSCP') that accepts values in the
> > range of [0, 63] which both address families will support. This will
> > allow user space to get a consistent behavior between IPv4 and IPv6 with
> > regards to DSCP matching, without affecting existing use cases.
> 
> If removing the high order bits mask really isn't feasible, then yes,
> that'd probably be our best option. But this would make both the code
> and the UAPI more complex. Also we'd have to take care of corner cases
> (when both TOS and DSCP are set) and make the same changes to IPv4
> routes, to keep TOS/DSCP consistent between ip-rule and ip-route.
> 
> Dropping the high order bits mask, on the other hand, would make
> everything consistent and would simplifies both the code and the user
> experience. The only drawback is that "tos 0x1c" would only match "0x1c"
> (and not "0x1f" anymore). But, as I said earlier, I doubt if such a use
> case really exist.

Whether use cases like that exist or not is the main issue I have with
the removal of the high order bits mask. The advantage of this approach
is that no new uAPI is required, but the disadvantage is that there is a
potential for regressions without an easy mitigation.

I believe that with the approach I outlined above the potential for
regressions is lower and we should have a way to mitigate them if/when
reported. The disadvantage is that we need to introduce a new "dscp"
keyword and a new netlink attribute.

> 
> > Adding the new field and initializing it correctly throughout the stack
> > is not a small undertaking so I was wondering a) Are you OK with the
> > suggested approach? b) If not, what else would you suggest?
> 
> Sorry for the long text, but I think you have my opinion now.
> And yes, whatever the option, this is going to be a long task.

Yes :(

> 
> Side note: I'm actually working on a series to start converting
> flowi4_tos to dscp_t. I should have a first patch set ready soon
> (converting only a few places). But, I'm keeping the old behaviour of
> clearing the 3 high order bits for now (these are just two separate
> topics).

I will be happy to review, but I'm not sure what you mean by "converting
only a few places". How does it work?

> 
> I can allocate more time on the dscp_t conversion and work/help with
> removing the high order bits mask if there's interest in this option.
> 
> > Thanks
> > 
> > [1] https://lpc.events/event/11/contributions/943/attachments/901/1780/inet_tos_lpc2021.pdf
> > 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Matching on DSCP with IPv4 FIB rules
  2024-06-27 19:54   ` Ido Schimmel
@ 2024-07-04 18:19     ` Guillaume Nault
  2024-07-18 13:14       ` Ido Schimmel
  0 siblings, 1 reply; 7+ messages in thread
From: Guillaume Nault @ 2024-07-04 18:19 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev

On Thu, Jun 27, 2024 at 10:54:29PM +0300, Ido Schimmel wrote:
> Hi Guillaume, thanks for the detailed response!
> 
> On Thu, Jun 27, 2024 at 01:51:37AM +0200, Guillaume Nault wrote:
> > 
> > Could removing the high order bits mask actually _be_ an option? I was
> > worried about behaviour change when I started looking into this. But,
> > with time, I'm more and more thinking about just removing the mask.
> > 
> > Here are the reasons why:
> > 
> >   * DSCP deprecated the Precedence/TOS bits separation more than
> >     25 years ago. I've never heard of anyone trying to use the high
> >     order bits as Preference, while we've had several reports of people
> >     using (or trying to use) the full DSCP bit range.
> >     Also, I far as I know, Linux never offered any way to interpret the
> >     high order bits as Precedence (apart from parsing these bits
> >     manually with u32 or BPF, but these use cases wouldn't be affected
> >     if we decided to use the full DSCP bit range in core IPv4 code).
> > 
> >   * Ignoring the high order bits creates useless inconsistencies
> >     between the IPv4 and IPv6 code, while current RFCs make no such
> >     distinction.
> > 
> >   * Even the IPv4 implementation isn't self consistent. While most
> >     route lookups are done with the high order bits cleared, some parts
> >     of the code explicitly use the full DSCP bit range.
> > 
> >   * In the past, people have sent patches to mask the high order DSCP
> >     bits and created regressions because of that. People seem to use
> 
> By "patches" you mean IPv6 patches?

Not necessarily. I had the following case in mind:
https://lore.kernel.org/netdev/20200805024131.2091206-1-liuhangbin@gmail.com/

I'm pretty sure this revert came after someone complained that setting
the high order DSCP bits stopped working in VXLAN. But I haven't
managed to find the original report.

But there has been fixes for IPv6 too:
https://lore.kernel.org/netdev/20220805191906.9323-1-matthias.may@westermo.com/

> >   * It would indeed be a behaviour change to make "tos 0x1c" exactly
> >     match "0x1c". But I'd be surprised if people really expected "0x1c"
> >     to actually match "0xfc". Also currently one can set "tos 0x1f" in
> >     routes, but no packet will ever match. That's probably not
> >     something anyone would expect. Making "0x1c" mean "0x1c" and "0x1f"
> >     mean "0x1f" would simplify everyone's life I believe.
> 
> Did you mean "0xfc" instead of "0x1f"? The kernel rejects routes with
> "tos 0x1f" due to ECN bits being set.

Yes, 0xfc. I don't know what I had in mind when I wrote 0x1f. That
value clearly doesn't make any sense in this context.

> I agree with everything you wrote except the assumption about users'
> expectations. I honestly do not know if some users are relying on "tos
> 0x1c" to also match "0xfc", but I am not really interested in finding
> out especially when undoing the change is not that easy. However, I have
> another suggestion that might work which seems like a middle ground
> between both approaches:
> 
> 1. Extending the IPv4 flow information structure with a new 'dscp_t'
> field (e.g., 'struct flowi4::dscp') and initializing it with the full
> DSCP value throughout the stack. Already did this for all the places
> where 'flowi4_tos' initialized other than flowi4_init_output() which is
> next on my list.
> 
> 2. Keeping the existing semantics of the "tos" keyword in ip-rule and
> ip-route to match on the three lower DSCP bits, but changing the IPv4
> functions that match on 'flowi4_tos' (fib_select_default,
> fib4_rule_match, fib_table_lookup) to instead match on the new DSCP
> field with a mask. For example, in fib4_rule_match(), instead of:
> 
> if (r->dscp && r->dscp != inet_dsfield_to_dscp(fl4->flowi4_tos))
> 
> We will have:
> 
> if (r->dscp && r->dscp != (fl4->dscp & IPTOS_RT_MASK))

So, do you mean to centralise the effect of all the current RT_TOS()
calls inside a few functions? So that we can eventually remove all
those RT_TOS() calls later?

> I was only able to find two call paths that can reach these functions
> with a TOS value that does not have its three upper DSCP bits masked:
> 
> nl_fib_input()
> 	nl_fib_lookup()
> 		flowi4_tos = frn->fl_tos	// Directly from user space
> 		fib_table_lookup()
> 
> nft_fib4_eval()
> 	flowi4_tos = iph->tos & DSCP_BITS
> 	fib_lookup()
> 
> The first belongs to an ancient "NETLINK_FIB_LOOKUP" family which I am
> quite certain nobody is using and the second belongs to netfilter's fib
> expression.

I agree that nl_fib_input() probably doesn't matter.

For nft_fib4_eval() it really looks like the current behaviour is
intended. And even though it's possible that nobody currently relies on
it, I think it's the correct one. So I don't really feel like changing
it.

> If regressions are reported for any of them (unlikely, IMO), we can add
> a new flow information flag (e.g., 'FLOWI_FLAG_DSCP_NO_MASK') which will
> tell the core routing functions to not apply the 'IPTOS_RT_MASK' mask.
> 
> 3. Removing 'struct flowi4::flowi4_tos'.
> 
> 4. Adding a new DSCP FIB rule attribute (e.g., 'FRA_DSCP') with a
> matching "dscp" keyword in iproute2 that accepts values in the range of
> [0, 63] which both address families will support. IPv4 will support it
> via the new DSCP field ('struct flowi4::dscp') and IPv6 will support it
> using the existing flow label field ('struct flowi6::flowlabel').

I'm sorry, something isn't clear to me. Since masking the high order
bits has been centralised at step 2, how will you match them?

If we continue to take fib4_rule_match() as an example; do you mean to
extend struct fib4_rule to store the extra information, so that
fib4_rule_match() knows how to test fl4->dscp? For example:

    /* Assuming FRA_DSCP sets ->dscp_mask to 0xff while the default
     * would be 0x1c to keep the old behaviour.
     */
    if (r->dscp && r->dscp != (fl4->dscp & r->dscp_mask))
        return 0;

> The kernel will reject rules that are configured with both "tos" and
> "dscp".
> 
> I do not want to add a similar keyword to ip-route because I have no use
> case for it and if we add it now we will never be able to remove it. It
> can always be added later without too much effort.
> 
> > 
> > > Instead, I was thinking of extending the IPv4 flow information structure
> > > with a new 'dscp_t' field (e.g., 'struct flowi4::dscp') and adding a new
> > > DSCP FIB rule attribute (e.g., 'FRA_DSCP') that accepts values in the
> > > range of [0, 63] which both address families will support. This will
> > > allow user space to get a consistent behavior between IPv4 and IPv6 with
> > > regards to DSCP matching, without affecting existing use cases.
> > 
> > If removing the high order bits mask really isn't feasible, then yes,
> > that'd probably be our best option. But this would make both the code
> > and the UAPI more complex. Also we'd have to take care of corner cases
> > (when both TOS and DSCP are set) and make the same changes to IPv4
> > routes, to keep TOS/DSCP consistent between ip-rule and ip-route.
> > 
> > Dropping the high order bits mask, on the other hand, would make
> > everything consistent and would simplifies both the code and the user
> > experience. The only drawback is that "tos 0x1c" would only match "0x1c"
> > (and not "0x1f" anymore). But, as I said earlier, I doubt if such a use
> > case really exist.
> 
> Whether use cases like that exist or not is the main issue I have with
> the removal of the high order bits mask. The advantage of this approach
> is that no new uAPI is required, but the disadvantage is that there is a
> potential for regressions without an easy mitigation.
> 
> I believe that with the approach I outlined above the potential for
> regressions is lower and we should have a way to mitigate them if/when
> reported. The disadvantage is that we need to introduce a new "dscp"
> keyword and a new netlink attribute.

What I'd really like is to stop the proliferation of RT_TOS() and to
get consistent behaviour. If the new "dscp" option allows that, while
still allowing to simplify the implementation, then I'm fine with it.

> > Side note: I'm actually working on a series to start converting
> > flowi4_tos to dscp_t. I should have a first patch set ready soon
> > (converting only a few places). But, I'm keeping the old behaviour of
> > clearing the 3 high order bits for now (these are just two separate
> > topics).
> 
> I will be happy to review, but I'm not sure what you mean by "converting
> only a few places". How does it work?

My idea is to go through the functions that uses ->flowi4_tos one by
one. I convert their u8 variables to dscp_t, but keep ->flowi4_tos a
__u8 field for the moment. For example:

-void my_func(__u8 tos, ...)
-void my_func(dscp_t dscp, ...)
 {
     ...
-    fl4.flowi4_tos = tos;
-    fl4.flowi4_tos = inet_dscp_to_dsfield(dscp);
     ...
 }

Of course, the whole call chain should be converted, until the function
that reads the value from a packet header or from user space:

 void another_func(const struct iphdr *ip4h)
 {
-    __u8 tos = ip4h->tos;
+    dscp_t dscp = inet_dsfield_to_dscp(ip4h->tos);
     ...
-    my_func(tos, ...);
+    my_func(dscp, ...);
     ...
 }

Depending on how complex is the call chain, I introduce intermediate
u8/dscp_t conversions, to keep patches simple.

The idea is to eventually have inet_dsfield_to_dscp() conversions at
the boundaries of the kernel, and to have temporary internal
inet_dscp_to_dsfield() calls when interacting with ->flowi4_tos.

Once all ->flowi4_tos users will actually use a dscp_t value, then
I'll convert this field from __u8 to dscp_t and remove all the extra
inet_dscp_to_dsfield() calls. That last patch will have to touch many
places at once, but, by renaming ->flowi4_tos to ->flowi4_dscp, we can
rely on the compiler and on sparse to ensure that every place has been
taken care of. Also, that final patch should be easy to review as it
should mostly consist of chunks like:

-    fl4->flowi4_tos = inet_dscp_to_dsfield(dscp);
+    fl4->flowi4_dscp = dscp;

But I'm not there yet ;).

Note that I'm going to be offline for a bit more than a week. I'll
catch up on this topic after I get back online.

> > I can allocate more time on the dscp_t conversion and work/help with
> > removing the high order bits mask if there's interest in this option.
> > 
> > > Thanks
> > > 
> > > [1] https://lpc.events/event/11/contributions/943/attachments/901/1780/inet_tos_lpc2021.pdf
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Matching on DSCP with IPv4 FIB rules
  2024-07-04 18:19     ` Guillaume Nault
@ 2024-07-18 13:14       ` Ido Schimmel
  2024-07-19 17:57         ` Guillaume Nault
  0 siblings, 1 reply; 7+ messages in thread
From: Ido Schimmel @ 2024-07-18 13:14 UTC (permalink / raw)
  To: Guillaume Nault; +Cc: netdev

On Thu, Jul 04, 2024 at 08:19:05PM +0200, Guillaume Nault wrote:
> On Thu, Jun 27, 2024 at 10:54:29PM +0300, Ido Schimmel wrote:
> > Hi Guillaume, thanks for the detailed response!
> > 
> > On Thu, Jun 27, 2024 at 01:51:37AM +0200, Guillaume Nault wrote:
> > > 
> > > Could removing the high order bits mask actually _be_ an option? I was
> > > worried about behaviour change when I started looking into this. But,
> > > with time, I'm more and more thinking about just removing the mask.
> > > 
> > > Here are the reasons why:
> > > 
> > >   * DSCP deprecated the Precedence/TOS bits separation more than
> > >     25 years ago. I've never heard of anyone trying to use the high
> > >     order bits as Preference, while we've had several reports of people
> > >     using (or trying to use) the full DSCP bit range.
> > >     Also, I far as I know, Linux never offered any way to interpret the
> > >     high order bits as Precedence (apart from parsing these bits
> > >     manually with u32 or BPF, but these use cases wouldn't be affected
> > >     if we decided to use the full DSCP bit range in core IPv4 code).
> > > 
> > >   * Ignoring the high order bits creates useless inconsistencies
> > >     between the IPv4 and IPv6 code, while current RFCs make no such
> > >     distinction.
> > > 
> > >   * Even the IPv4 implementation isn't self consistent. While most
> > >     route lookups are done with the high order bits cleared, some parts
> > >     of the code explicitly use the full DSCP bit range.
> > > 
> > >   * In the past, people have sent patches to mask the high order DSCP
> > >     bits and created regressions because of that. People seem to use
> > 
> > By "patches" you mean IPv6 patches?
> 
> Not necessarily. I had the following case in mind:
> https://lore.kernel.org/netdev/20200805024131.2091206-1-liuhangbin@gmail.com/
> 
> I'm pretty sure this revert came after someone complained that setting
> the high order DSCP bits stopped working in VXLAN. But I haven't
> managed to find the original report.
> 
> But there has been fixes for IPv6 too:
> https://lore.kernel.org/netdev/20220805191906.9323-1-matthias.may@westermo.com/
> 
> > >   * It would indeed be a behaviour change to make "tos 0x1c" exactly
> > >     match "0x1c". But I'd be surprised if people really expected "0x1c"
> > >     to actually match "0xfc". Also currently one can set "tos 0x1f" in
> > >     routes, but no packet will ever match. That's probably not
> > >     something anyone would expect. Making "0x1c" mean "0x1c" and "0x1f"
> > >     mean "0x1f" would simplify everyone's life I believe.
> > 
> > Did you mean "0xfc" instead of "0x1f"? The kernel rejects routes with
> > "tos 0x1f" due to ECN bits being set.
> 
> Yes, 0xfc. I don't know what I had in mind when I wrote 0x1f. That
> value clearly doesn't make any sense in this context.
> 
> > I agree with everything you wrote except the assumption about users'
> > expectations. I honestly do not know if some users are relying on "tos
> > 0x1c" to also match "0xfc", but I am not really interested in finding
> > out especially when undoing the change is not that easy. However, I have
> > another suggestion that might work which seems like a middle ground
> > between both approaches:
> > 
> > 1. Extending the IPv4 flow information structure with a new 'dscp_t'
> > field (e.g., 'struct flowi4::dscp') and initializing it with the full
> > DSCP value throughout the stack. Already did this for all the places
> > where 'flowi4_tos' initialized other than flowi4_init_output() which is
> > next on my list.
> > 
> > 2. Keeping the existing semantics of the "tos" keyword in ip-rule and
> > ip-route to match on the three lower DSCP bits, but changing the IPv4
> > functions that match on 'flowi4_tos' (fib_select_default,
> > fib4_rule_match, fib_table_lookup) to instead match on the new DSCP
> > field with a mask. For example, in fib4_rule_match(), instead of:
> > 
> > if (r->dscp && r->dscp != inet_dsfield_to_dscp(fl4->flowi4_tos))
> > 
> > We will have:
> > 
> > if (r->dscp && r->dscp != (fl4->dscp & IPTOS_RT_MASK))
> 
> So, do you mean to centralise the effect of all the current RT_TOS()
> calls inside a few functions? So that we can eventually remove all
> those RT_TOS() calls later?

Yes. See this patch:

https://github.com/idosch/linux/commit/80f536f629c4dccdb6c015a10ca25d7743233208.patch

I can send it when net-next opens. It should allow us to start removing
the masking of the high order DSCP bits without introducing regressions
as the masking now happens at the core and not at individual call sites
or along the path to the core.

> 
> > I was only able to find two call paths that can reach these functions
> > with a TOS value that does not have its three upper DSCP bits masked:
> > 
> > nl_fib_input()
> > 	nl_fib_lookup()
> > 		flowi4_tos = frn->fl_tos	// Directly from user space
> > 		fib_table_lookup()
> > 
> > nft_fib4_eval()
> > 	flowi4_tos = iph->tos & DSCP_BITS
> > 	fib_lookup()
> > 
> > The first belongs to an ancient "NETLINK_FIB_LOOKUP" family which I am
> > quite certain nobody is using and the second belongs to netfilter's fib
> > expression.
> 
> I agree that nl_fib_input() probably doesn't matter.
> 
> For nft_fib4_eval() it really looks like the current behaviour is
> intended. And even though it's possible that nobody currently relies on
> it, I think it's the correct one. So I don't really feel like changing
> it.

Yes, I agree. The patch I mentioned takes care of that by setting the
new 'FLOWI_FLAG_MATCH_FULL_DSCP' in nft_fib4_eval().

> 
> > If regressions are reported for any of them (unlikely, IMO), we can add
> > a new flow information flag (e.g., 'FLOWI_FLAG_DSCP_NO_MASK') which will
> > tell the core routing functions to not apply the 'IPTOS_RT_MASK' mask.
> > 
> > 3. Removing 'struct flowi4::flowi4_tos'.
> > 
> > 4. Adding a new DSCP FIB rule attribute (e.g., 'FRA_DSCP') with a
> > matching "dscp" keyword in iproute2 that accepts values in the range of
> > [0, 63] which both address families will support. IPv4 will support it
> > via the new DSCP field ('struct flowi4::dscp') and IPv6 will support it
> > using the existing flow label field ('struct flowi6::flowlabel').
> 
> I'm sorry, something isn't clear to me. Since masking the high order
> bits has been centralised at step 2, how will you match them?
> 
> If we continue to take fib4_rule_match() as an example; do you mean to
> extend struct fib4_rule to store the extra information, so that
> fib4_rule_match() knows how to test fl4->dscp? For example:

Yes. See these patches:

https://github.com/idosch/linux/commit/1a79fb59f66731cfc891e3fecb3b08cda6bb0170.patch
https://github.com/idosch/linux/commit/a4990aab8ee4866b9f853777a50de09537255d67.patch
https://github.com/idosch/linux/commit/7328d60b7cfe2b07b2d565c9af36f650e96552a5.patch
https://github.com/idosch/linux/commit/73a739735d27bef613813f0ac0a9280e6427264d.patch

Specifically these hunks from the second patch:

@@ -37,6 +37,7 @@ struct fib4_rule {
 	u8			dst_len;
 	u8			src_len;
 	dscp_t			dscp;
+	u8			is_dscp_sel:1;	/* DSCP or TOS selector */
 	__be32			src;
 	__be32			srcmask;
 	__be32			dst;
@@ -186,7 +187,14 @@ INDIRECT_CALLABLE_SCOPE int fib4_rule_match(struct fib_rule *rule,
 	    ((daddr ^ r->dst) & r->dstmask))
 		return 0;
 
-	if (r->dscp && !fib_dscp_match(r->dscp, fl4))
+	/* When DSCP selector is used we need to match on the entire DSCP field
+	 * in the flow information structure. When TOS selector is used we need
+	 * to mask the upper three DSCP bits prior to matching to maintain
+	 * legacy behavior.
+	 */
+	if (r->is_dscp_sel && r->dscp != inet_dsfield_to_dscp(fl4->flowi4_tos))
+		return 0;
+	else if (!r->is_dscp_sel && r->dscp && !fib_dscp_match(r->dscp, fl4))
 		return 0;
 
 	if (rule->ip_proto && (rule->ip_proto != fl4->flowi4_proto))

Note that it is just an RFC. I first need to remove the masking of the
high order DSCP bits before I can send it.

> 
>     /* Assuming FRA_DSCP sets ->dscp_mask to 0xff while the default
>      * would be 0x1c to keep the old behaviour.
>      */
>     if (r->dscp && r->dscp != (fl4->dscp & r->dscp_mask))

It's a bit more involved. Even if the old TOS selector is used, we don't
always want to mask using 0x1c. If nft_fib4_eval() filled 0xfc in
'flowi4_tos', then by masking using 0x1c it will now match a FIB rule
that was configured with 'tos 0x1c' whereas previously it didn't. The
new 'FLOWI_FLAG_MATCH_FULL_DSCP' takes care of that, but it only applies
to rules configured with the TOS selector. The new DSCP selector will
always match against the entire DSCP field.

>         return 0;
> 
> > The kernel will reject rules that are configured with both "tos" and
> > "dscp".
> > 
> > I do not want to add a similar keyword to ip-route because I have no use
> > case for it and if we add it now we will never be able to remove it. It
> > can always be added later without too much effort.
> > 
> > > 
> > > > Instead, I was thinking of extending the IPv4 flow information structure
> > > > with a new 'dscp_t' field (e.g., 'struct flowi4::dscp') and adding a new
> > > > DSCP FIB rule attribute (e.g., 'FRA_DSCP') that accepts values in the
> > > > range of [0, 63] which both address families will support. This will
> > > > allow user space to get a consistent behavior between IPv4 and IPv6 with
> > > > regards to DSCP matching, without affecting existing use cases.
> > > 
> > > If removing the high order bits mask really isn't feasible, then yes,
> > > that'd probably be our best option. But this would make both the code
> > > and the UAPI more complex. Also we'd have to take care of corner cases
> > > (when both TOS and DSCP are set) and make the same changes to IPv4
> > > routes, to keep TOS/DSCP consistent between ip-rule and ip-route.
> > > 
> > > Dropping the high order bits mask, on the other hand, would make
> > > everything consistent and would simplifies both the code and the user
> > > experience. The only drawback is that "tos 0x1c" would only match "0x1c"
> > > (and not "0x1f" anymore). But, as I said earlier, I doubt if such a use
> > > case really exist.
> > 
> > Whether use cases like that exist or not is the main issue I have with
> > the removal of the high order bits mask. The advantage of this approach
> > is that no new uAPI is required, but the disadvantage is that there is a
> > potential for regressions without an easy mitigation.
> > 
> > I believe that with the approach I outlined above the potential for
> > regressions is lower and we should have a way to mitigate them if/when
> > reported. The disadvantage is that we need to introduce a new "dscp"
> > keyword and a new netlink attribute.
> 
> What I'd really like is to stop the proliferation of RT_TOS() and to
> get consistent behaviour. If the new "dscp" option allows that, while
> still allowing to simplify the implementation, then I'm fine with it.

Yes, I believe the new "dscp" option gets us there.

> 
> > > Side note: I'm actually working on a series to start converting
> > > flowi4_tos to dscp_t. I should have a first patch set ready soon
> > > (converting only a few places). But, I'm keeping the old behaviour of
> > > clearing the 3 high order bits for now (these are just two separate
> > > topics).
> > 
> > I will be happy to review, but I'm not sure what you mean by "converting
> > only a few places". How does it work?
> 
> My idea is to go through the functions that uses ->flowi4_tos one by
> one. I convert their u8 variables to dscp_t, but keep ->flowi4_tos a
> __u8 field for the moment. For example:
> 
> -void my_func(__u8 tos, ...)
> -void my_func(dscp_t dscp, ...)
>  {
>      ...
> -    fl4.flowi4_tos = tos;
> -    fl4.flowi4_tos = inet_dscp_to_dsfield(dscp);
>      ...
>  }
> 
> Of course, the whole call chain should be converted, until the function
> that reads the value from a packet header or from user space:
> 
>  void another_func(const struct iphdr *ip4h)
>  {
> -    __u8 tos = ip4h->tos;
> +    dscp_t dscp = inet_dsfield_to_dscp(ip4h->tos);
>      ...
> -    my_func(tos, ...);
> +    my_func(dscp, ...);
>      ...
>  }
> 
> Depending on how complex is the call chain, I introduce intermediate
> u8/dscp_t conversions, to keep patches simple.
> 
> The idea is to eventually have inet_dsfield_to_dscp() conversions at
> the boundaries of the kernel, and to have temporary internal
> inet_dscp_to_dsfield() calls when interacting with ->flowi4_tos.
> 
> Once all ->flowi4_tos users will actually use a dscp_t value, then
> I'll convert this field from __u8 to dscp_t and remove all the extra
> inet_dscp_to_dsfield() calls. That last patch will have to touch many
> places at once, but, by renaming ->flowi4_tos to ->flowi4_dscp, we can
> rely on the compiler and on sparse to ensure that every place has been
> taken care of. Also, that final patch should be easy to review as it
> should mostly consist of chunks like:
> 
> -    fl4->flowi4_tos = inet_dscp_to_dsfield(dscp);
> +    fl4->flowi4_dscp = dscp;
> 
> But I'm not there yet ;).

OK, I see, thanks for explaining.

Are you OK with the approach that I outlined above? Basically:

1. Submitting the patch that centralizes TOS matching
2. Removing the masking of the high order DSCP bits
3. Adding new DSCP selector for FIB rules

If you already have some patches that convert to 'dscp_t', then I can
work on top of them to avoid conflicts.

> 
> Note that I'm going to be offline for a bit more than a week. I'll
> catch up on this topic after I get back online.
> 
> > > I can allocate more time on the dscp_t conversion and work/help with
> > > removing the high order bits mask if there's interest in this option.
> > > 
> > > > Thanks
> > > > 
> > > > [1] https://lpc.events/event/11/contributions/943/attachments/901/1780/inet_tos_lpc2021.pdf
> > > > 
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Matching on DSCP with IPv4 FIB rules
  2024-07-18 13:14       ` Ido Schimmel
@ 2024-07-19 17:57         ` Guillaume Nault
  2024-07-25  9:58           ` Ido Schimmel
  0 siblings, 1 reply; 7+ messages in thread
From: Guillaume Nault @ 2024-07-19 17:57 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev

On Thu, Jul 18, 2024 at 04:14:08PM +0300, Ido Schimmel wrote:
> On Thu, Jul 04, 2024 at 08:19:05PM +0200, Guillaume Nault wrote:
> 
> > So, do you mean to centralise the effect of all the current RT_TOS()
> > calls inside a few functions? So that we can eventually remove all
> > those RT_TOS() calls later?
> 
> Yes. See this patch:
> 
> https://github.com/idosch/linux/commit/80f536f629c4dccdb6c015a10ca25d7743233208.patch
> 
> I can send it when net-next opens. It should allow us to start removing
> the masking of the high order DSCP bits without introducing regressions
> as the masking now happens at the core and not at individual call sites
> or along the path to the core.

Thanks for the sample patch. I think we're on the same page.

> > > I was only able to find two call paths that can reach these functions
> > > with a TOS value that does not have its three upper DSCP bits masked:
> > > 
> > > nl_fib_input()
> > > 	nl_fib_lookup()
> > > 		flowi4_tos = frn->fl_tos	// Directly from user space
> > > 		fib_table_lookup()
> > > 
> > > nft_fib4_eval()
> > > 	flowi4_tos = iph->tos & DSCP_BITS
> > > 	fib_lookup()
> > > 
> > > The first belongs to an ancient "NETLINK_FIB_LOOKUP" family which I am
> > > quite certain nobody is using and the second belongs to netfilter's fib
> > > expression.
> > 
> > I agree that nl_fib_input() probably doesn't matter.
> > 
> > For nft_fib4_eval() it really looks like the current behaviour is
> > intended. And even though it's possible that nobody currently relies on
> > it, I think it's the correct one. So I don't really feel like changing
> > it.
> 
> Yes, I agree. The patch I mentioned takes care of that by setting the
> new 'FLOWI_FLAG_MATCH_FULL_DSCP' in nft_fib4_eval().

Okay, let me contradict myself... :)

Considering that the number of users of this new flag has no
reason to grow and that we can ignore nl_fib_input() (which is close to
unusable as the necessary fib_result_nl structure isn't exported to
include/uapi/), does it really make sense to add a special case just
for nft_fib4_eval()?

I imagine the pain of describing the tos and dscp keywords in man pages
for example. There will be enough important details about the ECN bits,
the behaviour differences between IPv4 and IPv6, the different number
representation between dscp and tos (bit shift)... If we also have to
explain that the behaviour also depends on the module at the origin of
the route lookup, end users are going to get completely lost.

> > > If regressions are reported for any of them (unlikely, IMO), we can add
> > > a new flow information flag (e.g., 'FLOWI_FLAG_DSCP_NO_MASK') which will
> > > tell the core routing functions to not apply the 'IPTOS_RT_MASK' mask.
> > > 
> > > 3. Removing 'struct flowi4::flowi4_tos'.
> > > 
> > > 4. Adding a new DSCP FIB rule attribute (e.g., 'FRA_DSCP') with a
> > > matching "dscp" keyword in iproute2 that accepts values in the range of
> > > [0, 63] which both address families will support. IPv4 will support it
> > > via the new DSCP field ('struct flowi4::dscp') and IPv6 will support it
> > > using the existing flow label field ('struct flowi6::flowlabel').
> > 
> > I'm sorry, something isn't clear to me. Since masking the high order
> > bits has been centralised at step 2, how will you match them?
> > 
> > If we continue to take fib4_rule_match() as an example; do you mean to
> > extend struct fib4_rule to store the extra information, so that
> > fib4_rule_match() knows how to test fl4->dscp? For example:
> 
> Yes. See these patches:
> 
> https://github.com/idosch/linux/commit/1a79fb59f66731cfc891e3fecb3b08cda6bb0170.patch
> https://github.com/idosch/linux/commit/a4990aab8ee4866b9f853777a50de09537255d67.patch
> https://github.com/idosch/linux/commit/7328d60b7cfe2b07b2d565c9af36f650e96552a5.patch
> https://github.com/idosch/linux/commit/73a739735d27bef613813f0ac0a9280e6427264d.patch
> 
> Specifically these hunks from the second patch:
> 
> @@ -37,6 +37,7 @@ struct fib4_rule {
>  	u8			dst_len;
>  	u8			src_len;
>  	dscp_t			dscp;
> +	u8			is_dscp_sel:1;	/* DSCP or TOS selector */
>  	__be32			src;
>  	__be32			srcmask;
>  	__be32			dst;
> @@ -186,7 +187,14 @@ INDIRECT_CALLABLE_SCOPE int fib4_rule_match(struct fib_rule *rule,
>  	    ((daddr ^ r->dst) & r->dstmask))
>  		return 0;
>  
> -	if (r->dscp && !fib_dscp_match(r->dscp, fl4))
> +	/* When DSCP selector is used we need to match on the entire DSCP field
> +	 * in the flow information structure. When TOS selector is used we need
> +	 * to mask the upper three DSCP bits prior to matching to maintain
> +	 * legacy behavior.
> +	 */
> +	if (r->is_dscp_sel && r->dscp != inet_dsfield_to_dscp(fl4->flowi4_tos))
> +		return 0;
> +	else if (!r->is_dscp_sel && r->dscp && !fib_dscp_match(r->dscp, fl4))
>  		return 0;
>  
>  	if (rule->ip_proto && (rule->ip_proto != fl4->flowi4_proto))
> 
> Note that it is just an RFC. I first need to remove the masking of the
> high order DSCP bits before I can send it.

I find "is_dscp_sel" not very informative as a field name. Maybe
"dscp_nomask" or "dscp_full" would be better? They're more intuitive
to me, but I have no problem if you prefer to keep "is_dscp_sel" of
course.

> >     /* Assuming FRA_DSCP sets ->dscp_mask to 0xff while the default
> >      * would be 0x1c to keep the old behaviour.
> >      */
> >     if (r->dscp && r->dscp != (fl4->dscp & r->dscp_mask))
> 
> It's a bit more involved. Even if the old TOS selector is used, we don't
> always want to mask using 0x1c. If nft_fib4_eval() filled 0xfc in
> 'flowi4_tos', then by masking using 0x1c it will now match a FIB rule
> that was configured with 'tos 0x1c' whereas previously it didn't. The
> new 'FLOWI_FLAG_MATCH_FULL_DSCP' takes care of that, but it only applies
> to rules configured with the TOS selector. The new DSCP selector will
> always match against the entire DSCP field.

Back to nft_fib4_eval(), making it behave like the rest of the kernel
would also mean it'd behave like the existing ipt_rpfilter module. So
people moving from iptables to nftables would keep the same behaviour.
Unless you strongly feel about keeping the FLOWI_FLAG_MATCH_FULL_DSCP
flag, I think we should ask Pablo and Florian if they're okay for
making nft_fib4_eval() behave like the rest of the stack.

> Are you OK with the approach that I outlined above? Basically:
> 
> 1. Submitting the patch that centralizes TOS matching
> 2. Removing the masking of the high order DSCP bits
> 3. Adding new DSCP selector for FIB rules

Yes, looks like a good plan!

> If you already have some patches that convert to 'dscp_t', then I can
> work on top of them to avoid conflicts.

I think we can complete the new dscp feature faster than the dscp_t
conversion. Therefore I think it'll make more sense for me to rebase
on top of your patches.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Matching on DSCP with IPv4 FIB rules
  2024-07-19 17:57         ` Guillaume Nault
@ 2024-07-25  9:58           ` Ido Schimmel
  0 siblings, 0 replies; 7+ messages in thread
From: Ido Schimmel @ 2024-07-25  9:58 UTC (permalink / raw)
  To: Guillaume Nault; +Cc: netdev

On Fri, Jul 19, 2024 at 07:57:27PM +0200, Guillaume Nault wrote:
> On Thu, Jul 18, 2024 at 04:14:08PM +0300, Ido Schimmel wrote:
> > On Thu, Jul 04, 2024 at 08:19:05PM +0200, Guillaume Nault wrote:
> > 
> > > So, do you mean to centralise the effect of all the current RT_TOS()
> > > calls inside a few functions? So that we can eventually remove all
> > > those RT_TOS() calls later?
> > 
> > Yes. See this patch:
> > 
> > https://github.com/idosch/linux/commit/80f536f629c4dccdb6c015a10ca25d7743233208.patch
> > 
> > I can send it when net-next opens. It should allow us to start removing
> > the masking of the high order DSCP bits without introducing regressions
> > as the masking now happens at the core and not at individual call sites
> > or along the path to the core.
> 
> Thanks for the sample patch. I think we're on the same page.
> 
> > > > I was only able to find two call paths that can reach these functions
> > > > with a TOS value that does not have its three upper DSCP bits masked:
> > > > 
> > > > nl_fib_input()
> > > > 	nl_fib_lookup()
> > > > 		flowi4_tos = frn->fl_tos	// Directly from user space
> > > > 		fib_table_lookup()
> > > > 
> > > > nft_fib4_eval()
> > > > 	flowi4_tos = iph->tos & DSCP_BITS
> > > > 	fib_lookup()
> > > > 
> > > > The first belongs to an ancient "NETLINK_FIB_LOOKUP" family which I am
> > > > quite certain nobody is using and the second belongs to netfilter's fib
> > > > expression.
> > > 
> > > I agree that nl_fib_input() probably doesn't matter.
> > > 
> > > For nft_fib4_eval() it really looks like the current behaviour is
> > > intended. And even though it's possible that nobody currently relies on
> > > it, I think it's the correct one. So I don't really feel like changing
> > > it.
> > 
> > Yes, I agree. The patch I mentioned takes care of that by setting the
> > new 'FLOWI_FLAG_MATCH_FULL_DSCP' in nft_fib4_eval().
> 
> Okay, let me contradict myself... :)
> 
> Considering that the number of users of this new flag has no
> reason to grow and that we can ignore nl_fib_input() (which is close to
> unusable as the necessary fib_result_nl structure isn't exported to
> include/uapi/), does it really make sense to add a special case just
> for nft_fib4_eval()?

I have no problem removing the flag and introducing it only if
regressions are reported. See more below.

> 
> I imagine the pain of describing the tos and dscp keywords in man pages
> for example. There will be enough important details about the ECN bits,
> the behaviour differences between IPv4 and IPv6, the different number
> representation between dscp and tos (bit shift)... If we also have to
> explain that the behaviour also depends on the module at the origin of
> the route lookup, end users are going to get completely lost.
> 
> > > > If regressions are reported for any of them (unlikely, IMO), we can add
> > > > a new flow information flag (e.g., 'FLOWI_FLAG_DSCP_NO_MASK') which will
> > > > tell the core routing functions to not apply the 'IPTOS_RT_MASK' mask.
> > > > 
> > > > 3. Removing 'struct flowi4::flowi4_tos'.
> > > > 
> > > > 4. Adding a new DSCP FIB rule attribute (e.g., 'FRA_DSCP') with a
> > > > matching "dscp" keyword in iproute2 that accepts values in the range of
> > > > [0, 63] which both address families will support. IPv4 will support it
> > > > via the new DSCP field ('struct flowi4::dscp') and IPv6 will support it
> > > > using the existing flow label field ('struct flowi6::flowlabel').
> > > 
> > > I'm sorry, something isn't clear to me. Since masking the high order
> > > bits has been centralised at step 2, how will you match them?
> > > 
> > > If we continue to take fib4_rule_match() as an example; do you mean to
> > > extend struct fib4_rule to store the extra information, so that
> > > fib4_rule_match() knows how to test fl4->dscp? For example:
> > 
> > Yes. See these patches:
> > 
> > https://github.com/idosch/linux/commit/1a79fb59f66731cfc891e3fecb3b08cda6bb0170.patch
> > https://github.com/idosch/linux/commit/a4990aab8ee4866b9f853777a50de09537255d67.patch
> > https://github.com/idosch/linux/commit/7328d60b7cfe2b07b2d565c9af36f650e96552a5.patch
> > https://github.com/idosch/linux/commit/73a739735d27bef613813f0ac0a9280e6427264d.patch
> > 
> > Specifically these hunks from the second patch:
> > 
> > @@ -37,6 +37,7 @@ struct fib4_rule {
> >  	u8			dst_len;
> >  	u8			src_len;
> >  	dscp_t			dscp;
> > +	u8			is_dscp_sel:1;	/* DSCP or TOS selector */
> >  	__be32			src;
> >  	__be32			srcmask;
> >  	__be32			dst;
> > @@ -186,7 +187,14 @@ INDIRECT_CALLABLE_SCOPE int fib4_rule_match(struct fib_rule *rule,
> >  	    ((daddr ^ r->dst) & r->dstmask))
> >  		return 0;
> >  
> > -	if (r->dscp && !fib_dscp_match(r->dscp, fl4))
> > +	/* When DSCP selector is used we need to match on the entire DSCP field
> > +	 * in the flow information structure. When TOS selector is used we need
> > +	 * to mask the upper three DSCP bits prior to matching to maintain
> > +	 * legacy behavior.
> > +	 */
> > +	if (r->is_dscp_sel && r->dscp != inet_dsfield_to_dscp(fl4->flowi4_tos))
> > +		return 0;
> > +	else if (!r->is_dscp_sel && r->dscp && !fib_dscp_match(r->dscp, fl4))
> >  		return 0;
> >  
> >  	if (rule->ip_proto && (rule->ip_proto != fl4->flowi4_proto))
> > 
> > Note that it is just an RFC. I first need to remove the masking of the
> > high order DSCP bits before I can send it.
> 
> I find "is_dscp_sel" not very informative as a field name. Maybe
> "dscp_nomask" or "dscp_full" would be better? They're more intuitive
> to me, but I have no problem if you prefer to keep "is_dscp_sel" of
> course.

OK, changed to "dscp_full" as you suggested.

> 
> > >     /* Assuming FRA_DSCP sets ->dscp_mask to 0xff while the default
> > >      * would be 0x1c to keep the old behaviour.
> > >      */
> > >     if (r->dscp && r->dscp != (fl4->dscp & r->dscp_mask))
> > 
> > It's a bit more involved. Even if the old TOS selector is used, we don't
> > always want to mask using 0x1c. If nft_fib4_eval() filled 0xfc in
> > 'flowi4_tos', then by masking using 0x1c it will now match a FIB rule
> > that was configured with 'tos 0x1c' whereas previously it didn't. The
> > new 'FLOWI_FLAG_MATCH_FULL_DSCP' takes care of that, but it only applies
> > to rules configured with the TOS selector. The new DSCP selector will
> > always match against the entire DSCP field.
> 
> Back to nft_fib4_eval(), making it behave like the rest of the kernel
> would also mean it'd behave like the existing ipt_rpfilter module. So
> people moving from iptables to nftables would keep the same behaviour.
> Unless you strongly feel about keeping the FLOWI_FLAG_MATCH_FULL_DSCP
> flag, I think we should ask Pablo and Florian if they're okay for
> making nft_fib4_eval() behave like the rest of the stack.

Yes, that's a good idea. I will post a RFC patchset that converts
nft_fib4_eval() and "NETLINK_FIB_LOOKUP" to mask the upper DSCP bits
along with the RT_TOS() centralization patch. Let's see what Pablo and
Florian say.

> 
> > Are you OK with the approach that I outlined above? Basically:
> > 
> > 1. Submitting the patch that centralizes TOS matching
> > 2. Removing the masking of the high order DSCP bits
> > 3. Adding new DSCP selector for FIB rules
> 
> Yes, looks like a good plan!

Great, thanks!

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-07-25  9:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-26 11:58 Matching on DSCP with IPv4 FIB rules Ido Schimmel
2024-06-26 23:51 ` Guillaume Nault
2024-06-27 19:54   ` Ido Schimmel
2024-07-04 18:19     ` Guillaume Nault
2024-07-18 13:14       ` Ido Schimmel
2024-07-19 17:57         ` Guillaume Nault
2024-07-25  9:58           ` Ido Schimmel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).