From: Vlad Buslov <vladbu@nvidia.com>
To: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: <netfilter-devel@vger.kernel.org>, <kadlec@netfilter.org>,
<fw@strlen.de>, <ozsh@nvidia.com>, <paulb@nvidia.com>
Subject: Re: [PATCH net-next 2/8] netfilter: introduce total count of hw offloaded flow table entries
Date: Tue, 15 Mar 2022 18:18:17 +0200 [thread overview]
Message-ID: <87tubztca0.fsf@nvidia.com> (raw)
In-Reply-To: <YjBpFDc+rOqhSPrW@salvia>
On Tue 15 Mar 2022 at 11:23, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Sat, Mar 12, 2022 at 08:56:49PM +0200, Vlad Buslov wrote:
> [...]
>> Hi Pablo,
>>
>> Thanks for reviewing my code and sorry for the late reply.
>>
>> We explored the approach you propose and found several issues with it.
>> First, the nice benefit of implementation in this patch is that having
>> counter increment in flow_offload_add() (and test in following patch)
>> completely avoids spamming the workqueue when the limit is reached which
>> is an important concern for slower embedded DPU cores. Second, it is not
>> possible to change it when IPS_HW_OFFLOAD_BIT is set at the very end of
>> flow_offload_work_add() function because in following patch we need to
>> verify that counter is in user-specified limit before attempting
>> offload. Third, changing the counter in wq tasks makes it hard to
>> balance correctly. Consider following cases:
>>
>> - flow_offload_work_add() can be called arbitrary amount of times per
>> flow due to refresh logic. However, any such flow is still deleted
>> only once.
>>
>> - flow_offload_work_del() can be called for flows that were never
>> actually offloaded (it is called for flows that have NF_FLOW_HW bit
>> that is unconditionally set before attempting to schedule offload task
>> on wq).
>>
>> Counter balancing issues could maybe be solved by carefully
>> conditionally changing it based on current value IPS_HW_OFFLOAD_BIT, but
>> spamming the workqueue can't be prevented for such design.
>>
>> > That also moves the atomic would be away from the packet path.
>>
>> I understand your concern. However, note that this atomic is normally
>> changed once for adding offloaded flow and once for removing it. The
>> code path is only executed per-packet in error cases where flow has
>> failed to offload and refresh is called repeatedly for same flow.
>
> Thanks for explaining.
>
> There used to be in the code a list of pending flows to be offloaded.
>
> I think it would be possible to restore such list and make it per-cpu,
> the idea is to add a new field to the flow_offload structure to
> annotate the cpu that needs to deal with this flow (same cpu deals
> with add/del/stats). The cpu field is set at flow creation time.
What would be the algorithm to assign the cpu field? Some simple
algorithm like round-robin will not take into account CPU load of
unrelated tasks (for example, OvS which is also CPU-intensive) and
offload tasks on contested cores will get less CPU time, which will
result unbalanced occupancy where some cores are idle and other have
long list of offload tasks. Any advanced algorithm will be hard to
implement since we don't have access to scheduler internal data. Also,
in my experience not all offload tasks take same amount of CPU time (for
example, offloading complex flows with tunnels takes longer than simple
flows and deletes take less time than adds), so having access to just
the current lists sizes doesn't directly translate to list processing
time.
>
> Once there is one item, add work to the workqueue to that cpu.
> Meanwhile the workqueue does not have a chance, we keep adding more
> items to the workqueue.
>
> The workqueue handler then zaps the list of pending flows to be
> offloaded, it might have more than one single item in the list.
I understand the proposal but I'm missing the benefit it provides over
existing workqueue approach. Both standard kernel linked list and
workqueue are unbounded and don't count their elements which means we
would still have to implement approach similar to what is proposed in
existing series - add atomic to manually count the size and reject new
elements over some maximum (including, in case of unified list, flow
deletions that we don't really want to skip).
>
> So instead of three workqueues, we only have one. Scalability is
> achieved by fanning out flows over CPUs.
But existing nf_ft_offload_* wokrqueues are already parallel and
unbound, so they already fan out tasks over CPU cores and probably also
do it better than any custom algorithm that we can come up with since
threads are scheduled by the system scheduler that takes into account
current CPU load.
next prev parent reply other threads:[~2022-03-15 16:24 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-02-22 15:09 [PATCH net-next 0/8] Conntrack offload debuggability improvements Vlad Buslov
2022-02-22 15:09 ` [PATCH net-next 1/8] net/sched: act_ct: set 'net' pointer when creating new nf_flow_table Vlad Buslov
2022-03-07 21:09 ` Pablo Neira Ayuso
2022-02-22 15:09 ` [PATCH net-next 2/8] netfilter: introduce total count of hw offloaded flow table entries Vlad Buslov
2022-03-07 21:47 ` Pablo Neira Ayuso
2022-03-12 18:56 ` Vlad Buslov
2022-03-15 10:23 ` Pablo Neira Ayuso
2022-03-15 16:18 ` Vlad Buslov [this message]
2022-03-07 21:56 ` Pablo Neira Ayuso
2022-03-12 19:51 ` Vlad Buslov
2022-03-15 10:41 ` Pablo Neira Ayuso
2022-03-15 16:34 ` Vlad Buslov
2022-02-22 15:09 ` [PATCH net-next 3/8] netfilter: introduce max " Vlad Buslov
2022-03-07 22:13 ` Pablo Neira Ayuso
2022-03-12 19:32 ` Vlad Buslov
2022-02-22 15:09 ` [PATCH net-next 4/8] netfilter: introduce total count of hw offload 'add' workqueue tasks Vlad Buslov
2022-03-07 22:46 ` Pablo Neira Ayuso
2022-02-22 15:10 ` [PATCH net-next 5/8] netfilter: introduce max " Vlad Buslov
2022-03-07 22:43 ` Pablo Neira Ayuso
2022-03-12 19:59 ` Vlad Buslov
2022-02-22 15:10 ` [PATCH net-next 6/8] netfilter: introduce total count of hw offload 'del' " Vlad Buslov
2022-02-22 15:10 ` [PATCH net-next 7/8] netfilter: introduce total count of hw offload 'stats' wq tasks Vlad Buslov
2022-02-22 15:10 ` [PATCH net-next 8/8] netfilter: flowtable: add hardware offload tracepoints Vlad Buslov
2022-03-07 22:49 ` Pablo Neira Ayuso
2022-03-12 20:05 ` Vlad Buslov
2022-03-15 10:29 ` Pablo Neira Ayuso
2022-03-15 16:36 ` Vlad Buslov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87tubztca0.fsf@nvidia.com \
--to=vladbu@nvidia.com \
--cc=fw@strlen.de \
--cc=kadlec@netfilter.org \
--cc=netfilter-devel@vger.kernel.org \
--cc=ozsh@nvidia.com \
--cc=pablo@netfilter.org \
--cc=paulb@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.