Re: [PATCH net-next] net: openvswitch: allow providing upcall pid for the 'execute' command

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Aaron Conole <aconole@redhat.com>
To: Ilya Maximets <i.maximets@ovn.org>
Cc: netdev@vger.kernel.org,  "David S. Miller" <davem@davemloft.net>,
	 Eric Dumazet <edumazet@google.com>,
	 Jakub Kicinski <kuba@kernel.org>,
	 Paolo Abeni <pabeni@redhat.com>,
	 Simon Horman <horms@kernel.org>,
	dev@openvswitch.org,  linux-kernel@vger.kernel.org,
	 Eelco Chaudron <echaudro@redhat.com>
Subject: Re: [PATCH net-next] net: openvswitch: allow providing upcall pid for the 'execute' command
Date: Wed, 02 Jul 2025 08:46:26 -0400	[thread overview]
Message-ID: <f7tcyaisqu5.fsf@redhat.com> (raw)
In-Reply-To: <b450a7f2-cab4-4b71-aec8-ef867403663a@ovn.org> (Ilya Maximets's message of "Wed, 2 Jul 2025 14:37:05 +0200")

Ilya Maximets <i.maximets@ovn.org> writes:

> On 7/2/25 2:14 PM, Aaron Conole wrote:
>> Ilya Maximets <i.maximets@ovn.org> writes:
>> 
>>> When a packet enters OVS datapath and there is no flow to handle it,
>>> packet goes to userspace through a MISS upcall.  With per-CPU upcall
>>> dispatch mechanism, we're using the current CPU id to select the
>>> Netlink PID on which to send this packet.  This allows us to send
>>> packets from the same traffic flow through the same handler.
>>>
>>> The handler will process the packet, install required flow into the
>>> kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE.
>>>
>>> While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a
>>> recirculation action that will pass the (likely modified) packet
>>> through the flow lookup again.  And if the flow is not found, the
>>> packet will be sent to userspace again through another MISS upcall.
>>>
>>> However, the handler thread in userspace is likely running on a
>>> different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled
>>> in the syscall context of that thread.  So, when the time comes to
>>> send the packet through another upcall, the per-CPU dispatch will
>>> choose a different Netlink PID, and this packet will end up processed
>>> by a different handler thread on a different CPU.
>> 
>> Just wondering but why can't we choose the existing core handler when
>> running the packet_cmd_execute?  For example, when looking into the
>> per-cpu table we know what the current core is, can we just queue to
>> that one?  I actually thought that's what the PER_CPU dispatch mode was
>> supposed to do.
>
> This is exactly how it works today and it is the problem, because our
> userspace handler is running on a different CPU and so the 'current CPU'
> during the packet_cmd_execute is different from the one where kernel
> was processing the original upcall.
>
>> Or is it that we want to make sure we keep the
>> association between the skbuff for re-injection always?
>
> We want the same packet to be enqueued to the same upcall socket after
> each recirculation, so it gets handled by the same userspace thread.
>
>> 
>>> The process continues as long as there are new recirculations, each
>>> time the packet goes to a different handler thread before it is sent
>>> out of the OVS datapath to the destination port.  In real setups the
>>> number of recirculations can go up to 4 or 5, sometimes more.
>> 
>> Is it because the userspace handler threads are being rescheduled across
>> CPUs?
>
> Yes.  Userspace handlers are not pinned to a specific core in most cases,
> so they will be running on different CPUs and will float around.
>
>> Do we still see this behavior if we pinned each handler thread to
>> a specific CPU rather than letting the scheduler make the decision?
>
> If you pin each userspace thread to a core that is specified in the
> PCPU_UPCALL_PIDS for the socket that it is listening on, then the
> problem will go away, as the packet_cmd_execute syscall will be executed
> on the same core where the kernel received the original packet.
> However, that's not possible in many cases (reserved CPUs, different
> CPU affinity for IRQs and userspace applications, etc.) and not desired
> as may impact performance of the system, because the kernel and userspace
> will compete for the same core.

Just wanted to make sure I was understanding the problem.  Okay, this
makes sense.

>> 
>>> There is always a chance to re-order packets while processing upcalls,
>>> because userspace will first install the flow and then re-inject the
>>> original packet.  So, there is a race window when the flow is already
>>> installed and the second packet can match it and be forwarded to the
>>> destination before the first packet is re-injected.  But the fact that
>>> packets are going through multiple upcalls handled by different
>>> userspace threads makes the reordering noticeably more likely, because
>>> we not only have a race between the kernel and a userspace handler
>>> (which is hard to avoid), but also between multiple userspace handlers.
>>>
>>> For example, let's assume that 10 packets got enqueued through a MISS
>>> upcall for handler-1, it will start processing them, will install the
>>> flow into the kernel and start re-injecting packets back, from where
>>> they will go through another MISS to handler-2.  Handler-2 will install
>>> the flow into the kernel and start re-injecting the packets, while
>>> handler-1 continues to re-inject the last of the 10 packets, they will
>>> hit the flow installed by handler-2 and be forwarded without going to
>>> the handler-2, while handler-2 still re-injects the first of these 10
>>> packets.  Given multiple recirculations and misses, these 10 packets
>>> may end up completely mixed up on the output from the datapath.
>>>
>>> Let's allow userspace to specify on which Netlink PID the packets
>>> should be upcalled while processing OVS_PACKET_CMD_EXECUTE.
>>> This makes it possible to ensure that all the packets are processed
>>> by the same handler thread in the userspace even with them being
>>> upcalled multiple times in the process.  Packets will remain in order
>>> since they will be enqueued to the same socket and re-injected in the
>>> same order.  This doesn't eliminate re-ordering as stated above, since
>>> we still have a race between kernel and the userspace thread, but it
>>> allows to eliminate races between multiple userspace threads.
>>>
>>> Userspace knows the PID of the socket on which the original upcall is
>>> received, so there is no need to send it up from the kernel.
>>>
>>> Solution requires storing the value somewhere for the duration of the
>>> packet processing.  There are two potential places for this: our skb
>>> extension or the per-CPU storage.  It's not clear which is better,
>>> so just following currently used scheme of storing this kind of things
>>> along the skb.
>> 
>> With this change we're almost full on the OVS sk_buff control block.
>> Might be good to mention it in the commit message if you're respinning.
>
> Are we full?  The skb->cb size is 48 bytes and we're only using 24
> with this change, unless I'm missing something.

Hrrm... I guess I miscounted.  Yes I agree, with this change we're at 24
bytes in-use on 64-bit platform.  Okay no worries.

> Best regards, Ilya Maximets.

next prev parent reply	other threads:[~2025-07-02 12:46 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-27 22:01 [PATCH net-next] net: openvswitch: allow providing upcall pid for the 'execute' command Ilya Maximets
2025-06-27 22:09 ` Ilya Maximets
2025-07-02  2:21 ` Jakub Kicinski
2025-07-02  8:10   ` Ilya Maximets
2025-07-02 12:14 ` Aaron Conole
2025-07-02 12:37   ` Ilya Maximets
2025-07-02 12:46     ` Aaron Conole [this message]
2025-07-02 13:53 ` [ovs-dev] " Flavio Leitner
2025-07-02 14:41   ` Ilya Maximets
2025-07-02 23:08     ` Flavio Leitner
2025-07-03  8:38       ` Ilya Maximets
2025-07-03 11:04         ` Flavio Leitner
2025-07-03 11:15           ` Ilya Maximets
2025-07-03 11:26             ` Flavio Leitner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f7tcyaisqu5.fsf@redhat.com \
    --to=aconole@redhat.com \
    --cc=davem@davemloft.net \
    --cc=dev@openvswitch.org \
    --cc=echaudro@redhat.com \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=i.maximets@ovn.org \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).