From: Paolo Abeni <pabeni@redhat.com>
To: Adrian Moreno <amorenoz@redhat.com>, netdev@vger.kernel.org
Cc: Aaron Conole <aconole@redhat.com>,
Eelco Chaudron <echaudro@redhat.com>,
Ilya Maximets <i.maximets@ovn.org>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Simon Horman <horms@kernel.org>,
"open list:OPENVSWITCH" <dev@openvswitch.org>,
open list <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH net-next v2] net: openvswitch: decouple flow_table from ovs_mutex
Date: Mon, 13 Apr 2026 10:39:37 +0200 [thread overview]
Message-ID: <b4534229-994a-42ac-a200-a9cb1b333e60@redhat.com> (raw)
In-Reply-To: <20260407120418.356718-1-amorenoz@redhat.com>
On 4/7/26 2:04 PM, Adrian Moreno wrote:
> Currently the entire ovs module is write-protected using the global
> ovs_mutex. While this simple approach works fine for control-plane
> operations (such as vport configurations), requiring the global mutex
> for flow modifications can be problematic.
>
> During periods of high control-plane operations, e.g: netdevs (vports)
> coming and going, RTNL can suffer contention. This contention is easily
> transferred to the ovs_mutex as RTNL nests inside ovs_mutex. Flow
> modifications, however, are done as part of packet processing and having
> them wait for RTNL pressure to go away can lead to packet drops.
>
> This patch decouples flow_table modifications from ovs_mutex by means of
> the following:
>
> 1 - Make flow_table an rcu-protected pointer inside the datapath.
> This allows both objects to be protected independently while reducing the
> amount of changes required in "flow_table.c".
>
> 2 - Create a new mutex inside the flow_table that protects it from
> concurrent modifications.
> Putting the mutex inside flow_table makes it easier to consume for
> functions inside flow_table.c that do not currently take pointers to the
> datapath.
> Some function signatures need to be changed to accept flow_table so that
> lockdep checks can be performed.
>
> 3 - Create a reference count to temporarily extend rcu protection from
> the datapath to the flow_table.
> In order to use the flow_table without locking ovs_mutex, the flow_table
> pointer must be first dereferenced within an rcu-protected region.
> Next, the table->mutex needs to be locked to protect it from
> concurrent writes but mutexes must not be locked inside an rcu-protected
> region, so the rcu-protected region must be left at which point the
> datapath can be concurrently freed.
> To extend the protection beyond the rcu region, a reference count is used.
> One reference is held by the datapath, the other is temporarily
> increased during flow modifications. For example:
>
> Datapath deletion:
>
> ovs_lock();
> table = rcu_dereference_protected(dp->table, ...);
> rcu_assign_pointer(dp->table, NULL);
> ovs_flow_tbl_put(table);
> ovs_unlock();
>
> Flow modification:
>
> rcu_read_lock();
> dp = get_dp(...);
> table = rcu_dereference(dp->table);
> ovs_flow_tbl_get(table);
> rcu_read_unlock();
>
> mutex_lock(&table->lock);
> /* Perform modifications on the flow_table */
> mutex_unlock(&table->lock);
> ovs_flow_tbl_put(table);
>
> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
> ---
> v2: Fix argument in ovs_flow_tbl_put (sparse)
> Remove rcu checks in ovs_dp_masks_rebalance
> ---
> net/openvswitch/datapath.c | 285 ++++++++++++++++++++++++-----------
> net/openvswitch/datapath.h | 2 +-
> net/openvswitch/flow.c | 13 +-
> net/openvswitch/flow.h | 9 +-
> net/openvswitch/flow_table.c | 180 ++++++++++++++--------
> net/openvswitch/flow_table.h | 51 ++++++-
> 6 files changed, 380 insertions(+), 160 deletions(-)
This is too big for a single patch. The changelog above already suggests
a way of splitting the change. At least the RCU-ification addition
should be straight forward in a separate patch, which in turn should be
easily reviewable.
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index e209099218b4..9c234993520c 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -88,13 +88,17 @@ static void ovs_notify(struct genl_family *family,
> * DOC: Locking:
> *
> * All writes e.g. Writes to device state (add/remove datapath, port, set
> - * operations on vports, etc.), Writes to other state (flow table
> - * modifications, set miscellaneous datapath parameters, etc.) are protected
> - * by ovs_lock.
> + * operations on vports, etc.) and writes to other datapath parameters
> + * are protected by ovs_lock.
> + *
> + * Writes to the flow table are NOT protected by ovs_lock. Instead, a per-table
> + * mutex and reference count are used (see comment above "struct flow_table"
> + * definition). On some few occasions, the per-flow table mutex is nested
> + * inside ovs_mutex.
> *
> * Reads are protected by RCU.
> *
> - * There are a few special cases (mostly stats) that have their own
> + * There are a few other special cases (mostly stats) that have their own
> * synchronization but they nest under all of above and don't interact with
> * each other.
> *
> @@ -166,7 +170,6 @@ static void destroy_dp_rcu(struct rcu_head *rcu)
> {
> struct datapath *dp = container_of(rcu, struct datapath, rcu);
>
> - ovs_flow_tbl_destroy(&dp->table);
> free_percpu(dp->stats_percpu);
> kfree(dp->ports);
> ovs_meters_exit(dp);
> @@ -247,6 +250,7 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
> struct ovs_pcpu_storage *ovs_pcpu = this_cpu_ptr(ovs_pcpu_storage);
> const struct vport *p = OVS_CB(skb)->input_vport;
> struct datapath *dp = p->dp;
> + struct flow_table *table;
> struct sw_flow *flow;
> struct sw_flow_actions *sf_acts;
> struct dp_stats_percpu *stats;
> @@ -257,9 +261,16 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
> int error;
>
> stats = this_cpu_ptr(dp->stats_percpu);
> + table = rcu_dereference(dp->table);
> + if (!table) {
> + net_dbg_ratelimited("ovs: no flow table on datapath %s\n",
> + ovs_dp_name(dp));
> + kfree_skb(skb);
> + return;
> + }
>
> /* Look up flow. */
> - flow = ovs_flow_tbl_lookup_stats(&dp->table, key, skb_get_hash(skb),
> + flow = ovs_flow_tbl_lookup_stats(table, key, skb_get_hash(skb),
> &n_mask_hit, &n_cache_hit);
> if (unlikely(!flow)) {
> struct dp_upcall_info upcall;
> @@ -752,12 +763,16 @@ static struct genl_family dp_packet_genl_family __ro_after_init = {
> static void get_dp_stats(const struct datapath *dp, struct ovs_dp_stats *stats,
> struct ovs_dp_megaflow_stats *mega_stats)
> {
> + struct flow_table *table = ovsl_dereference(dp->table);
Should be rcu_dereference_ovs_tbl() ?
> int i;
>
> memset(mega_stats, 0, sizeof(*mega_stats));
>
> - stats->n_flows = ovs_flow_tbl_count(&dp->table);
> - mega_stats->n_masks = ovs_flow_tbl_num_masks(&dp->table);
> + if (table) {
> + stats->n_flows = ovs_flow_tbl_count(table);
As noted by Aaron, READ_ONCE() is now needed when accessing
table->count. And WRITE_ONCE when writing it
> + mega_stats->n_masks = ovs_flow_tbl_num_masks(table);
Sashiko says:
---
get_dp_stats() accesses table->mask_array via ovs_flow_tbl_num_masks()
while holding only ovs_mutex. Since this patch decouples flow table updates
by moving them under table->lock, ovs_flow_cmd_new() can execute
concurrently and trigger a reallocation of the mask array, freeing the old
one via call_rcu().
Because get_dp_stats() does not hold rcu_read_lock(), the thread can be
preempted (as ovs_mutex is sleepable) and the RCU grace period might expire
before the count is read. Can this lead to a use-after-free?
---
Note that it also spotted pre-existing issues, please have a look:
https://sashiko.dev/#/patchset/20260407120418.356718-1-amorenoz%40redhat.com
[...]
> @@ -71,15 +93,40 @@ struct flow_table {
>
> extern struct kmem_cache *flow_stats_cache;
>
> +#ifdef CONFIG_LOCKDEP
> +int lockdep_ovs_tbl_is_held(const struct flow_table *table);
> +#else
> +static inline int lockdep_ovs_tbl_is_held(const struct flow_table *table)
> +{
> + (void)table;
You can use the __always_unused annotation.
> + return 1;
> +}
> +#endif
> +
> +#define ASSERT_OVS_TBL(tbl) WARN_ON(!lockdep_ovs_tbl_is_held(tbl))
> +
> +/* Lock-protected update-allowed dereferences.*/
> +#define ovs_tbl_dereference(p, tbl) \
> + rcu_dereference_protected(p, lockdep_ovs_tbl_is_held(tbl))
> +
> +/* Read dereferences can be protected by either RCU, table lock or ovs_mutex. */
Is this access schema really safe? I understand tables can be
written/deleted under the table lock only. If so this should ignore the
OVS mutex status.
/P
prev parent reply other threads:[~2026-04-13 8:39 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-07 12:04 [PATCH net-next v2] net: openvswitch: decouple flow_table from ovs_mutex Adrian Moreno
2026-04-10 18:52 ` [ovs-dev] " Aaron Conole
2026-04-13 8:39 ` Paolo Abeni [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b4534229-994a-42ac-a200-a9cb1b333e60@redhat.com \
--to=pabeni@redhat.com \
--cc=aconole@redhat.com \
--cc=amorenoz@redhat.com \
--cc=davem@davemloft.net \
--cc=dev@openvswitch.org \
--cc=echaudro@redhat.com \
--cc=edumazet@google.com \
--cc=horms@kernel.org \
--cc=i.maximets@ovn.org \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox