From: Aaron Conole <aconole@redhat.com>
To: Ilya Maximets <i.maximets@ovn.org>
Cc: netdev@vger.kernel.org, "David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>,
Paolo Abeni <pabeni@redhat.com>,
Simon Horman <horms@kernel.org>,
dev@openvswitch.org, linux-kernel@vger.kernel.org,
Eelco Chaudron <echaudro@redhat.com>
Subject: Re: [PATCH net-next] net: openvswitch: allow providing upcall pid for the 'execute' command
Date: Wed, 02 Jul 2025 08:46:26 -0400 [thread overview]
Message-ID: <f7tcyaisqu5.fsf@redhat.com> (raw)
In-Reply-To: <b450a7f2-cab4-4b71-aec8-ef867403663a@ovn.org> (Ilya Maximets's message of "Wed, 2 Jul 2025 14:37:05 +0200")
Ilya Maximets <i.maximets@ovn.org> writes:
> On 7/2/25 2:14 PM, Aaron Conole wrote:
>> Ilya Maximets <i.maximets@ovn.org> writes:
>>
>>> When a packet enters OVS datapath and there is no flow to handle it,
>>> packet goes to userspace through a MISS upcall. With per-CPU upcall
>>> dispatch mechanism, we're using the current CPU id to select the
>>> Netlink PID on which to send this packet. This allows us to send
>>> packets from the same traffic flow through the same handler.
>>>
>>> The handler will process the packet, install required flow into the
>>> kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE.
>>>
>>> While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a
>>> recirculation action that will pass the (likely modified) packet
>>> through the flow lookup again. And if the flow is not found, the
>>> packet will be sent to userspace again through another MISS upcall.
>>>
>>> However, the handler thread in userspace is likely running on a
>>> different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled
>>> in the syscall context of that thread. So, when the time comes to
>>> send the packet through another upcall, the per-CPU dispatch will
>>> choose a different Netlink PID, and this packet will end up processed
>>> by a different handler thread on a different CPU.
>>
>> Just wondering but why can't we choose the existing core handler when
>> running the packet_cmd_execute? For example, when looking into the
>> per-cpu table we know what the current core is, can we just queue to
>> that one? I actually thought that's what the PER_CPU dispatch mode was
>> supposed to do.
>
> This is exactly how it works today and it is the problem, because our
> userspace handler is running on a different CPU and so the 'current CPU'
> during the packet_cmd_execute is different from the one where kernel
> was processing the original upcall.
>
>> Or is it that we want to make sure we keep the
>> association between the skbuff for re-injection always?
>
> We want the same packet to be enqueued to the same upcall socket after
> each recirculation, so it gets handled by the same userspace thread.
>
>>
>>> The process continues as long as there are new recirculations, each
>>> time the packet goes to a different handler thread before it is sent
>>> out of the OVS datapath to the destination port. In real setups the
>>> number of recirculations can go up to 4 or 5, sometimes more.
>>
>> Is it because the userspace handler threads are being rescheduled across
>> CPUs?
>
> Yes. Userspace handlers are not pinned to a specific core in most cases,
> so they will be running on different CPUs and will float around.
>
>> Do we still see this behavior if we pinned each handler thread to
>> a specific CPU rather than letting the scheduler make the decision?
>
> If you pin each userspace thread to a core that is specified in the
> PCPU_UPCALL_PIDS for the socket that it is listening on, then the
> problem will go away, as the packet_cmd_execute syscall will be executed
> on the same core where the kernel received the original packet.
> However, that's not possible in many cases (reserved CPUs, different
> CPU affinity for IRQs and userspace applications, etc.) and not desired
> as may impact performance of the system, because the kernel and userspace
> will compete for the same core.
Just wanted to make sure I was understanding the problem. Okay, this
makes sense.
>>
>>> There is always a chance to re-order packets while processing upcalls,
>>> because userspace will first install the flow and then re-inject the
>>> original packet. So, there is a race window when the flow is already
>>> installed and the second packet can match it and be forwarded to the
>>> destination before the first packet is re-injected. But the fact that
>>> packets are going through multiple upcalls handled by different
>>> userspace threads makes the reordering noticeably more likely, because
>>> we not only have a race between the kernel and a userspace handler
>>> (which is hard to avoid), but also between multiple userspace handlers.
>>>
>>> For example, let's assume that 10 packets got enqueued through a MISS
>>> upcall for handler-1, it will start processing them, will install the
>>> flow into the kernel and start re-injecting packets back, from where
>>> they will go through another MISS to handler-2. Handler-2 will install
>>> the flow into the kernel and start re-injecting the packets, while
>>> handler-1 continues to re-inject the last of the 10 packets, they will
>>> hit the flow installed by handler-2 and be forwarded without going to
>>> the handler-2, while handler-2 still re-injects the first of these 10
>>> packets. Given multiple recirculations and misses, these 10 packets
>>> may end up completely mixed up on the output from the datapath.
>>>
>>> Let's allow userspace to specify on which Netlink PID the packets
>>> should be upcalled while processing OVS_PACKET_CMD_EXECUTE.
>>> This makes it possible to ensure that all the packets are processed
>>> by the same handler thread in the userspace even with them being
>>> upcalled multiple times in the process. Packets will remain in order
>>> since they will be enqueued to the same socket and re-injected in the
>>> same order. This doesn't eliminate re-ordering as stated above, since
>>> we still have a race between kernel and the userspace thread, but it
>>> allows to eliminate races between multiple userspace threads.
>>>
>>> Userspace knows the PID of the socket on which the original upcall is
>>> received, so there is no need to send it up from the kernel.
>>>
>>> Solution requires storing the value somewhere for the duration of the
>>> packet processing. There are two potential places for this: our skb
>>> extension or the per-CPU storage. It's not clear which is better,
>>> so just following currently used scheme of storing this kind of things
>>> along the skb.
>>
>> With this change we're almost full on the OVS sk_buff control block.
>> Might be good to mention it in the commit message if you're respinning.
>
> Are we full? The skb->cb size is 48 bytes and we're only using 24
> with this change, unless I'm missing something.
Hrrm... I guess I miscounted. Yes I agree, with this change we're at 24
bytes in-use on 64-bit platform. Okay no worries.
> Best regards, Ilya Maximets.
next prev parent reply other threads:[~2025-07-02 12:46 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-27 22:01 [PATCH net-next] net: openvswitch: allow providing upcall pid for the 'execute' command Ilya Maximets
2025-06-27 22:09 ` Ilya Maximets
2025-07-02 2:21 ` Jakub Kicinski
2025-07-02 8:10 ` Ilya Maximets
2025-07-02 12:14 ` Aaron Conole
2025-07-02 12:37 ` Ilya Maximets
2025-07-02 12:46 ` Aaron Conole [this message]
2025-07-02 13:53 ` [ovs-dev] " Flavio Leitner
2025-07-02 14:41 ` Ilya Maximets
2025-07-02 23:08 ` Flavio Leitner
2025-07-03 8:38 ` Ilya Maximets
2025-07-03 11:04 ` Flavio Leitner
2025-07-03 11:15 ` Ilya Maximets
2025-07-03 11:26 ` Flavio Leitner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f7tcyaisqu5.fsf@redhat.com \
--to=aconole@redhat.com \
--cc=davem@davemloft.net \
--cc=dev@openvswitch.org \
--cc=echaudro@redhat.com \
--cc=edumazet@google.com \
--cc=horms@kernel.org \
--cc=i.maximets@ovn.org \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).