All of lore.kernel.org
 help / color / mirror / Atom feed
From: Florian Westphal <fw@strlen.de>
To: Boris Sukholitko <boris.sukholitko@broadcom.com>
Cc: Florian Westphal <fw@strlen.de>,
	Pablo Neira Ayuso <pablo@netfilter.org>,
	netfilter-devel@vger.kernel.org,
	Ilya Lifshits <ilya.lifshits@broadcom.com>
Subject: Re: [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload
Date: Tue, 9 May 2023 11:48:27 +0200	[thread overview]
Message-ID: <20230509094827.GA14758@breakpoint.cc> (raw)
In-Reply-To: <ZFj7PomKpCnLsDz2@noodle>

Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> On Sun, May 07, 2023 at 07:37:58PM +0200, Florian Westphal wrote:
> > Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> > > On Wed, May 3, 2023 at 9:46 PM Florian Westphal <fw@strlen.de> wrote:
> > > >
> > > > Boris Sukholitko <boris.sukholitko@broadcom.com> wrote:
> > > [... snip to non working offload ...]
> > > 
> > > > > table inet filter {
> > > > >         flowtable f1 {
> > > > >                 hook ingress priority filter
> > > > >                 devices = { veth0, veth1 }
> > > > >         }
> > > > >
> > > > >         chain forward {
> > > > >                 type filter hook forward priority filter; policy accept;
> > > > >                 ip dscp set cs3 offload
> > > > >                 ip protocol { tcp, udp, gre } flow add @f1
> > > > >                 ct state established,related accept
> > > > >         }
> > > > > }
> > > 
> > > [...]
> > > 
> > > >
> > > > I wish you would have reported this before you started to work on
> > > > this, because this is not a bug, this is expected behaviour.
> > > >
> > > > Once you offload, the ruleset is bypassed, this is by design.
> > > 
> > > From the rules UI perspective it seems possible to accelerate
> > > forward chain handling with the statements such as dscp modification there.
> > > 
> > > Isn't it better to modify the packets according to the bypassed
> > > ruleset thus making the behaviour more consistent?
> > 
> > The behaviour is consistent.  Once flow is offloaded, ruleset is
> > bypassed.  Its easy to not offload those flows that need the ruleset.
> > 
> > > > Lets not make the software offload more complex as it already is.
> > > 
> > > Could you please tell which parts of software offload are too complex?
> > > It's not too bad from what I've seen :)
> > > 
> > > This patch series adds 56 lines of code in the new nf_conntrack.ext.c
> > > file. 20 of them (nf_flow_offload_apply_payload) are used in
> > > the software fast path. Is it too high of a price?
> > 
> > 56 lines of code *now*.
> > 
> > Next someone wants to call into sets/maps for named counters that
> > they need.  Then someone wants limit or quota to work.  Then they want fib
> > for RPF.  Then xfrm policy matching to augment acccounting.
> > This will go on until we get to the point where removing "fast" path
> > turns into a performance optimization.
> 
> OK. May I assume that you are concerned with the eventual performance impact
> on the software fast path (i.e. nf_flow_offload_ip_hook)?

Yes, but I also dislike the concept, see below.

> Obviously the performance of the fast path is very important to our
> customers. Otherwise they would not be requiring dscp fast path
> modification. :)
> 
> One of the things we've thought about regarding the fast path
> performance is rewriting nf_flow_offload_ip_hook to work with
> nf_flowtable->flow_block instead of flow_offload_tuple.

Sorry, I should have expanded on my reservations towards this concept.

Let me explain.
Lets consider your original example first:

----------
table inet filter {
        flowtable f1 {
                hook ingress priority filter
                devices = { veth0, veth1 }
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
                ip dscp set cs3
                ip protocol { tcp, udp, gre } flow add
                ct state established,related accept
        }
}
----------

This has a clearly defined meaning in all possible combinations.

Software:
1. It defines a bypass for veth0 <-> veth1
2. the way this specific ruleset is defined, all of tcp/udp/gre will
   attempt to offload
3. once offload has happened, entire inet:forward may be bypassed
4. User ruleset needs to cope with packets being moved back to
   software: fragmented packets, tcp fin/rst, hw timeouts and so on.
5. User can control via 'offload' keyword if HW offload should be
   attempted or not

Hardware:
even 'nf_flow_offload_ip_hook' may be bypassed.  But nothing changes
compared to 'no hw offload' case from a conceptual point of view.

Lets now consider existing netdev:ingress/egress in this same picture:
(Example from Pablo):
------
table inet filter {
        flowtable f1 {
                hook ingress priority filter
                devices = { veth0, veth1 }
        }

        chain ingress {
                type filter hook ingress device veth0 priority filter; policy accept; flags offload;
                ip dscp set cs3
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
                meta l4proto { tcp, udp, gre } flow add @f1
                ct state established,related accept
        }
}

Again, this has defined meaning in all combinations:
With HW offload: veth0 will be told to mangle dscp.
This happens in all cases and for every matching packet,
regardless if a flowtable exists or not.

Same would happen for 'egress', just that it would happen at xmit time
rather at receive time.  Again, its not relevant if there is active
flowtable or not, or if data path is offloaded to hardware, to software,
handled by fallback or entirely without flowtables being present.

Its also clear that this is tied to 'veth0', other devices will
not be involved and not do such mangling.

Now lets look at your proposal:
----------------
table inet filter {
        flowtable f1 {
                hook ingress priority filter
                devices = { veth0, veth1 }
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
                ip dscp set cs3 offload
                ip protocol { tcp, udp, gre } flow add
                ct state established,related accept
        }
}
----------------

This means that software flowtable offload
shall do a 'ip dscp set cs3'.

What if the flowtable is offloaded to hardware
entirely, without software fallback?

What if the devices listed in the flowtable definition can handle
flow offload, but no payload mangling?

Does the 'offload' mean that the rule is only active for
software path?  Only for hardware path? both?

How can I tell if its offloaded to hardware for one device
but not for the other?  Or will that be disallowed?

What if someone adds another rule after or before 'ip dscp',
but without the 'offload' keyword?  Now ordering becomes an
issue.

Users now need to consider different control flows:

  jump exceptions
  ip dscp set cs3 offload

  chain exceptions {
    ip daddr 1.2.3.4 accept
  }

This won't work as expected, because offloaded flows will not
pass through 'forward' chain but somehow a few selected rules
will be run anyway.

TL;DR: I think that for HW offload its paramount to make it crystal
clear as to which device is responsible to handle such rules.

The existing netdev:ingress/egress hooks provide the needed
chain/rules/expression:device mapping.  User can easily
tell if HW is responsible or SW by looking for 'offload' flag
presence.

I don't think mixing software and hardware offload contexts as proposed
is a good idea, both from user frontend syntax, clarity and error reporting
(e.g. if hw rejects offload request) point of view.

I also believe that allowing payload mangling from *software* offload
path sets a precedence for essentially allowing all other expressions
again which completely negates the flowtable concept.

I still think that dscp mangling should be done via netdev:ingress/egress
hooks, I don't see why this has to be bolted into flowtable sw offload.


  parent reply	other threads:[~2023-05-09  9:49 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-03 12:55 [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 01/19] selftest: netfilter: use /proc for pid checking Boris Sukholitko
2023-05-03 18:47   ` Florian Westphal
2023-05-04  8:53     ` Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 02/19] selftest: netfilter: no need for ps -x option Boris Sukholitko
2023-05-03 18:53   ` Florian Westphal
2023-05-03 12:55 ` [PATCH nf-next 03/19] selftest: netfilter: wait for specific nc pids Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 04/19] selftest: netfilter: monitor result file sizes Boris Sukholitko
2023-05-03 18:54   ` Florian Westphal
2023-05-03 12:55 ` [PATCH nf-next 05/19] netfilter: nft_payload: refactor mangle operation Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 06/19] netfilter: nft_payload: publish nft_payload_set Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 07/19] netfilter: nft_payload: export mangle Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 08/19] netfilter: nft_payload: use flag for checksum need Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 09/19] netfilter: nft_payload: add offload flag define Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 10/19] netfilter: nft_payload: allow offload in the netlink Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 11/19] netfilter: conntrack: nft extension Kconfig Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 12/19] netfilter: nft: empty nft conntrack extension Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 13/19] netfilter: conntrack: register nft extension Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 14/19] netfilter: nft: add payload context into extension Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 15/19] netfilter: nft: add payload application Boris Sukholitko
2023-05-03 23:32   ` kernel test robot
2023-05-04  0:44   ` kernel test robot
2023-05-03 12:55 ` [PATCH nf-next 16/19] netfilter: nftables: fast path payload mangle Boris Sukholitko
2023-05-03 15:41   ` kernel test robot
2023-05-03 12:55 ` [PATCH nf-next 17/19] netfilter: nftables: payload save mechanism Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 18/19] netfilter: nft_payload: save payload if needed Boris Sukholitko
2023-05-03 12:55 ` [PATCH nf-next 19/19] selftests: netfilter: dscp offload test Boris Sukholitko
2023-05-03 18:46 ` [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload Florian Westphal
2023-05-07 15:22   ` Boris Sukholitko
2023-05-07 17:37     ` Florian Westphal
2023-05-08 13:38       ` Boris Sukholitko
2023-05-08 20:07         ` Pablo Neira Ayuso
2023-05-09 14:56           ` Boris Sukholitko
2023-05-09  9:48         ` Florian Westphal [this message]
2023-05-10  7:49           ` Boris Sukholitko
2023-05-10 12:55             ` Florian Westphal
2023-05-11 15:59               ` Boris Sukholitko
2023-05-11 16:36                 ` Florian Westphal
2023-05-03 20:30 ` Pablo Neira Ayuso
2023-05-03 20:41 ` Pablo Neira Ayuso
2023-05-04  8:50   ` Boris Sukholitko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230509094827.GA14758@breakpoint.cc \
    --to=fw@strlen.de \
    --cc=boris.sukholitko@broadcom.com \
    --cc=ilya.lifshits@broadcom.com \
    --cc=netfilter-devel@vger.kernel.org \
    --cc=pablo@netfilter.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.