* [PATCH net-next v1] net: openvswitch: decouple flow_table from ovs_mutex
@ 2026-03-13 17:31 Adrian Moreno
2026-03-13 19:35 ` Jakub Kicinski
` (3 more replies)
0 siblings, 4 replies; 6+ messages in thread
From: Adrian Moreno @ 2026-03-13 17:31 UTC (permalink / raw)
To: netdev
Cc: Adrian Moreno, Aaron Conole, Eelco Chaudron, Ilya Maximets,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, dev, linux-kernel
Currently the entire ovs module is write-protected using the global
ovs_mutex. While this simple approach works fine for control-plane
operations (such as vport configurations), requiring the global mutex
for flow modifications can be problematic.
During periods of high control-plane operations, e.g: netdevs (vports)
coming and going, RTNL can suffer contention. This contention is easily
transferred to the ovs_mutex as RTNL nests inside ovs_mutex. Flow
modifications, however, are done as part of packet processing and having
them wait for RTNL pressure to go away can lead to packet drops.
This patch decouples flow_table modifications from ovs_mutex by means of
the following:
1 - Make flow_table an rcu-protected pointer inside the datapath.
This allows both objects to be protected independently while reducing the
amount of changes required in "flow_table.c".
2 - Create a new mutex inside the flow_table that protects it from
concurrent modifications.
Putting the mutex inside flow_table makes it easier to consume for
functions inside flow_table.c that do not currently take pointers to the
datapath.
Some function signatures need to be changed to accept flow_table so that
lockdep checks can be performed.
3 - Create a reference count to temporarily extend rcu protection from
the datapath to the flow_table.
In order to use the flow_table without locking ovs_mutex, the flow_table
pointer must be first dereferenced within an rcu-protected region.
Next, the table->mutex needs to be locked to protect it from
concurrent writes but mutexes must not be locked inside an rcu-protected
region, so the rcu-protected region must be left at which point the
datapath can be concurrently freed.
To extend the protection beyond the rcu region, a reference count is used.
One reference is held by the datapath, the other is temporarily
increased during flow modifications. For example:
Datapath deletion:
ovs_lock();
table = rcu_dereference_protected(dp->table, ...);
rcu_assign_pointer(dp->table, NULL);
ovs_flow_tbl_put(table);
ovs_unlock();
Flow modification:
rcu_read_lock();
dp = get_dp(...);
table = rcu_dereference(dp->table);
ovs_flow_tbl_get(table);
rcu_read_unlock();
mutex_lock(&table->lock);
/* Perform modifications on the flow_table */
mutex_unlock(&table->lock);
ovs_flow_tbl_put(table);
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
---
net/openvswitch/datapath.c | 284 ++++++++++++++++++++++++-----------
net/openvswitch/datapath.h | 2 +-
net/openvswitch/flow.c | 13 +-
net/openvswitch/flow.h | 9 +-
net/openvswitch/flow_table.c | 180 ++++++++++++++--------
net/openvswitch/flow_table.h | 51 ++++++-
6 files changed, 379 insertions(+), 160 deletions(-)
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index d5b6e2002bc1..133701fb0c77 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -88,13 +88,17 @@ static void ovs_notify(struct genl_family *family,
* DOC: Locking:
*
* All writes e.g. Writes to device state (add/remove datapath, port, set
- * operations on vports, etc.), Writes to other state (flow table
- * modifications, set miscellaneous datapath parameters, etc.) are protected
- * by ovs_lock.
+ * operations on vports, etc.) and writes to other datapath parameters
+ * are protected by ovs_lock.
+ *
+ * Writes to the flow table are NOT protected by ovs_lock. Instead, a per-table
+ * mutex and reference count are used (see comment above "struct flow_table"
+ * definition). On some few occasions, the per-flow table mutex is nested
+ * inside ovs_mutex.
*
* Reads are protected by RCU.
*
- * There are a few special cases (mostly stats) that have their own
+ * There are a few other special cases (mostly stats) that have their own
* synchronization but they nest under all of above and don't interact with
* each other.
*
@@ -166,7 +170,6 @@ static void destroy_dp_rcu(struct rcu_head *rcu)
{
struct datapath *dp = container_of(rcu, struct datapath, rcu);
- ovs_flow_tbl_destroy(&dp->table);
free_percpu(dp->stats_percpu);
kfree(dp->ports);
ovs_meters_exit(dp);
@@ -247,6 +250,7 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
struct ovs_pcpu_storage *ovs_pcpu = this_cpu_ptr(ovs_pcpu_storage);
const struct vport *p = OVS_CB(skb)->input_vport;
struct datapath *dp = p->dp;
+ struct flow_table *table;
struct sw_flow *flow;
struct sw_flow_actions *sf_acts;
struct dp_stats_percpu *stats;
@@ -257,9 +261,16 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
int error;
stats = this_cpu_ptr(dp->stats_percpu);
+ table = rcu_dereference(dp->table);
+ if (!table) {
+ net_dbg_ratelimited("ovs: no flow table on datapath %s\n",
+ ovs_dp_name(dp));
+ kfree_skb(skb);
+ return;
+ }
/* Look up flow. */
- flow = ovs_flow_tbl_lookup_stats(&dp->table, key, skb_get_hash(skb),
+ flow = ovs_flow_tbl_lookup_stats(table, key, skb_get_hash(skb),
&n_mask_hit, &n_cache_hit);
if (unlikely(!flow)) {
struct dp_upcall_info upcall;
@@ -752,12 +763,16 @@ static struct genl_family dp_packet_genl_family __ro_after_init = {
static void get_dp_stats(const struct datapath *dp, struct ovs_dp_stats *stats,
struct ovs_dp_megaflow_stats *mega_stats)
{
+ struct flow_table *table = ovsl_dereference(dp->table);
int i;
memset(mega_stats, 0, sizeof(*mega_stats));
- stats->n_flows = ovs_flow_tbl_count(&dp->table);
- mega_stats->n_masks = ovs_flow_tbl_num_masks(&dp->table);
+ if (table) {
+ stats->n_flows = ovs_flow_tbl_count(table);
+ mega_stats->n_masks = ovs_flow_tbl_num_masks(table);
+ }
+
stats->n_hit = stats->n_missed = stats->n_lost = 0;
@@ -829,15 +844,16 @@ static size_t ovs_flow_cmd_msg_size(const struct sw_flow_actions *acts,
+ nla_total_size_64bit(8); /* OVS_FLOW_ATTR_USED */
}
-/* Called with ovs_mutex or RCU read lock. */
+/* Called with table->lock or RCU read lock. */
static int ovs_flow_cmd_fill_stats(const struct sw_flow *flow,
+ const struct flow_table *table,
struct sk_buff *skb)
{
struct ovs_flow_stats stats;
__be16 tcp_flags;
unsigned long used;
- ovs_flow_stats_get(flow, &stats, &used, &tcp_flags);
+ ovs_flow_stats_get(flow, table, &stats, &used, &tcp_flags);
if (used &&
nla_put_u64_64bit(skb, OVS_FLOW_ATTR_USED, ovs_flow_used_time(used),
@@ -857,8 +873,9 @@ static int ovs_flow_cmd_fill_stats(const struct sw_flow *flow,
return 0;
}
-/* Called with ovs_mutex or RCU read lock. */
+/* Called with RCU read lock or table->lock held. */
static int ovs_flow_cmd_fill_actions(const struct sw_flow *flow,
+ const struct flow_table *table,
struct sk_buff *skb, int skb_orig_len)
{
struct nlattr *start;
@@ -878,7 +895,7 @@ static int ovs_flow_cmd_fill_actions(const struct sw_flow *flow,
if (start) {
const struct sw_flow_actions *sf_acts;
- sf_acts = rcu_dereference_ovsl(flow->sf_acts);
+ sf_acts = rcu_dereference_ovs_tbl(flow->sf_acts, table);
err = ovs_nla_put_actions(sf_acts->actions,
sf_acts->actions_len, skb);
@@ -897,8 +914,10 @@ static int ovs_flow_cmd_fill_actions(const struct sw_flow *flow,
return 0;
}
-/* Called with ovs_mutex or RCU read lock. */
-static int ovs_flow_cmd_fill_info(const struct sw_flow *flow, int dp_ifindex,
+/* Called with table->lock or RCU read lock. */
+static int ovs_flow_cmd_fill_info(const struct sw_flow *flow,
+ const struct flow_table *table,
+ int dp_ifindex,
struct sk_buff *skb, u32 portid,
u32 seq, u32 flags, u8 cmd, u32 ufid_flags)
{
@@ -929,12 +948,12 @@ static int ovs_flow_cmd_fill_info(const struct sw_flow *flow, int dp_ifindex,
goto error;
}
- err = ovs_flow_cmd_fill_stats(flow, skb);
+ err = ovs_flow_cmd_fill_stats(flow, table, skb);
if (err)
goto error;
if (should_fill_actions(ufid_flags)) {
- err = ovs_flow_cmd_fill_actions(flow, skb, skb_orig_len);
+ err = ovs_flow_cmd_fill_actions(flow, table, skb, skb_orig_len);
if (err)
goto error;
}
@@ -968,8 +987,9 @@ static struct sk_buff *ovs_flow_cmd_alloc_info(const struct sw_flow_actions *act
return skb;
}
-/* Called with ovs_mutex. */
+/* Called with table->lock. */
static struct sk_buff *ovs_flow_cmd_build_info(const struct sw_flow *flow,
+ const struct flow_table *table,
int dp_ifindex,
struct genl_info *info, u8 cmd,
bool always, u32 ufid_flags)
@@ -977,12 +997,12 @@ static struct sk_buff *ovs_flow_cmd_build_info(const struct sw_flow *flow,
struct sk_buff *skb;
int retval;
- skb = ovs_flow_cmd_alloc_info(ovsl_dereference(flow->sf_acts),
+ skb = ovs_flow_cmd_alloc_info(ovs_tbl_dereference(flow->sf_acts, table),
&flow->id, info, always, ufid_flags);
if (IS_ERR_OR_NULL(skb))
return skb;
- retval = ovs_flow_cmd_fill_info(flow, dp_ifindex, skb,
+ retval = ovs_flow_cmd_fill_info(flow, table, dp_ifindex, skb,
info->snd_portid, info->snd_seq, 0,
cmd, ufid_flags);
if (WARN_ON_ONCE(retval < 0)) {
@@ -998,6 +1018,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
struct nlattr **a = info->attrs;
struct ovs_header *ovs_header = genl_info_userhdr(info);
struct sw_flow *flow = NULL, *new_flow;
+ struct flow_table *table;
struct sw_flow_mask mask;
struct sk_buff *reply;
struct datapath *dp;
@@ -1064,30 +1085,43 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
goto err_kfree_acts;
}
- ovs_lock();
+ rcu_read_lock();
dp = get_dp(net, ovs_header->dp_ifindex);
if (unlikely(!dp)) {
error = -ENODEV;
- goto err_unlock_ovs;
+ rcu_read_unlock();
+ goto err_kfree_reply;
}
+ table = rcu_dereference(dp->table);
+ if (!table || !ovs_flow_tbl_get(table)) {
+ error = -ENODEV;
+ rcu_read_unlock();
+ goto err_kfree_reply;
+ }
+ rcu_read_unlock();
+
+ /* It is safe to dereference "table" after leaving rcu read-protected
+ * region because it's pinned by refcount.
+ */
+ mutex_lock(&table->lock);
/* Check if this is a duplicate flow */
if (ovs_identifier_is_ufid(&new_flow->id))
- flow = ovs_flow_tbl_lookup_ufid(&dp->table, &new_flow->id);
+ flow = ovs_flow_tbl_lookup_ufid(table, &new_flow->id);
if (!flow)
- flow = ovs_flow_tbl_lookup(&dp->table, key);
+ flow = ovs_flow_tbl_lookup(table, key);
if (likely(!flow)) {
rcu_assign_pointer(new_flow->sf_acts, acts);
/* Put flow in bucket. */
- error = ovs_flow_tbl_insert(&dp->table, new_flow, &mask);
+ error = ovs_flow_tbl_insert(table, new_flow, &mask);
if (unlikely(error)) {
acts = NULL;
- goto err_unlock_ovs;
+ goto err_unlock_tbl;
}
if (unlikely(reply)) {
- error = ovs_flow_cmd_fill_info(new_flow,
+ error = ovs_flow_cmd_fill_info(new_flow, table,
ovs_header->dp_ifindex,
reply, info->snd_portid,
info->snd_seq, 0,
@@ -1095,7 +1129,8 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
ufid_flags);
BUG_ON(error < 0);
}
- ovs_unlock();
+ mutex_unlock(&table->lock);
+ ovs_flow_tbl_put(table);
} else {
struct sw_flow_actions *old_acts;
@@ -1108,28 +1143,28 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
if (unlikely(info->nlhdr->nlmsg_flags & (NLM_F_CREATE
| NLM_F_EXCL))) {
error = -EEXIST;
- goto err_unlock_ovs;
+ goto err_unlock_tbl;
}
/* The flow identifier has to be the same for flow updates.
* Look for any overlapping flow.
*/
if (unlikely(!ovs_flow_cmp(flow, &match))) {
if (ovs_identifier_is_key(&flow->id))
- flow = ovs_flow_tbl_lookup_exact(&dp->table,
+ flow = ovs_flow_tbl_lookup_exact(table,
&match);
else /* UFID matches but key is different */
flow = NULL;
if (!flow) {
error = -ENOENT;
- goto err_unlock_ovs;
+ goto err_unlock_tbl;
}
}
/* Update actions. */
- old_acts = ovsl_dereference(flow->sf_acts);
+ old_acts = ovs_tbl_dereference(flow->sf_acts, table);
rcu_assign_pointer(flow->sf_acts, acts);
if (unlikely(reply)) {
- error = ovs_flow_cmd_fill_info(flow,
+ error = ovs_flow_cmd_fill_info(flow, table,
ovs_header->dp_ifindex,
reply, info->snd_portid,
info->snd_seq, 0,
@@ -1137,7 +1172,8 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
ufid_flags);
BUG_ON(error < 0);
}
- ovs_unlock();
+ mutex_unlock(&table->lock);
+ ovs_flow_tbl_put(table);
ovs_nla_free_flow_actions_rcu(old_acts);
ovs_flow_free(new_flow, false);
@@ -1149,8 +1185,10 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
kfree(key);
return 0;
-err_unlock_ovs:
- ovs_unlock();
+err_unlock_tbl:
+ mutex_unlock(&table->lock);
+ ovs_flow_tbl_put(table);
+err_kfree_reply:
kfree_skb(reply);
err_kfree_acts:
ovs_nla_free_flow_actions(acts);
@@ -1244,6 +1282,7 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
struct net *net = sock_net(skb->sk);
struct nlattr **a = info->attrs;
struct ovs_header *ovs_header = genl_info_userhdr(info);
+ struct flow_table *table;
struct sw_flow_key key;
struct sw_flow *flow;
struct sk_buff *reply = NULL;
@@ -1278,29 +1317,42 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
}
}
- ovs_lock();
+ rcu_read_lock();
dp = get_dp(net, ovs_header->dp_ifindex);
if (unlikely(!dp)) {
error = -ENODEV;
- goto err_unlock_ovs;
+ goto err_free_reply;
}
+ table = rcu_dereference(dp->table);
+ if (!table || !ovs_flow_tbl_get(table)) {
+ rcu_read_unlock();
+ error = -ENODEV;
+ goto err_free_reply;
+ }
+ rcu_read_unlock();
+
+ /* It is safe to dereference "table" after leaving rcu read-protected
+ * region because it's pinned by refcount.
+ */
+ mutex_lock(&table->lock);
+
/* Check that the flow exists. */
if (ufid_present)
- flow = ovs_flow_tbl_lookup_ufid(&dp->table, &sfid);
+ flow = ovs_flow_tbl_lookup_ufid(table, &sfid);
else
- flow = ovs_flow_tbl_lookup_exact(&dp->table, &match);
+ flow = ovs_flow_tbl_lookup_exact(table, &match);
if (unlikely(!flow)) {
error = -ENOENT;
- goto err_unlock_ovs;
+ goto err_unlock_tbl;
}
/* Update actions, if present. */
if (likely(acts)) {
- old_acts = ovsl_dereference(flow->sf_acts);
+ old_acts = ovs_tbl_dereference(flow->sf_acts, table);
rcu_assign_pointer(flow->sf_acts, acts);
if (unlikely(reply)) {
- error = ovs_flow_cmd_fill_info(flow,
+ error = ovs_flow_cmd_fill_info(flow, table,
ovs_header->dp_ifindex,
reply, info->snd_portid,
info->snd_seq, 0,
@@ -1310,20 +1362,22 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
}
} else {
/* Could not alloc without acts before locking. */
- reply = ovs_flow_cmd_build_info(flow, ovs_header->dp_ifindex,
+ reply = ovs_flow_cmd_build_info(flow, table,
+ ovs_header->dp_ifindex,
info, OVS_FLOW_CMD_SET, false,
ufid_flags);
if (IS_ERR(reply)) {
error = PTR_ERR(reply);
- goto err_unlock_ovs;
+ goto err_unlock_tbl;
}
}
/* Clear stats. */
if (a[OVS_FLOW_ATTR_CLEAR])
- ovs_flow_stats_clear(flow);
- ovs_unlock();
+ ovs_flow_stats_clear(flow, table);
+ mutex_unlock(&table->lock);
+ ovs_flow_tbl_put(table);
if (reply)
ovs_notify(&dp_flow_genl_family, reply, info);
@@ -1332,8 +1386,10 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
return 0;
-err_unlock_ovs:
- ovs_unlock();
+err_unlock_tbl:
+ mutex_unlock(&table->lock);
+ ovs_flow_tbl_put(table);
+err_free_reply:
kfree_skb(reply);
err_kfree_acts:
ovs_nla_free_flow_actions(acts);
@@ -1346,6 +1402,7 @@ static int ovs_flow_cmd_get(struct sk_buff *skb, struct genl_info *info)
struct nlattr **a = info->attrs;
struct ovs_header *ovs_header = genl_info_userhdr(info);
struct net *net = sock_net(skb->sk);
+ struct flow_table *table;
struct sw_flow_key key;
struct sk_buff *reply;
struct sw_flow *flow;
@@ -1370,33 +1427,48 @@ static int ovs_flow_cmd_get(struct sk_buff *skb, struct genl_info *info)
if (err)
return err;
- ovs_lock();
+ rcu_read_lock();
dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
if (!dp) {
- err = -ENODEV;
- goto unlock;
+ rcu_read_unlock();
+ return -ENODEV;
}
+ table = rcu_dereference(dp->table);
+ if (!table || !ovs_flow_tbl_get(table)) {
+ rcu_read_unlock();
+ return -ENODEV;
+ }
+ rcu_read_unlock();
+
+ /* It is safe to dereference "table" after leaving rcu read-protected
+ * region because it's pinned by refcount.
+ */
+ mutex_lock(&table->lock);
+
if (ufid_present)
- flow = ovs_flow_tbl_lookup_ufid(&dp->table, &ufid);
+ flow = ovs_flow_tbl_lookup_ufid(table, &ufid);
else
- flow = ovs_flow_tbl_lookup_exact(&dp->table, &match);
+ flow = ovs_flow_tbl_lookup_exact(table, &match);
if (!flow) {
err = -ENOENT;
goto unlock;
}
- reply = ovs_flow_cmd_build_info(flow, ovs_header->dp_ifindex, info,
- OVS_FLOW_CMD_GET, true, ufid_flags);
+ reply = ovs_flow_cmd_build_info(flow, table, ovs_header->dp_ifindex,
+ info, OVS_FLOW_CMD_GET, true,
+ ufid_flags);
if (IS_ERR(reply)) {
err = PTR_ERR(reply);
goto unlock;
}
- ovs_unlock();
+ mutex_unlock(&table->lock);
+ ovs_flow_tbl_put(table);
return genlmsg_reply(reply, info);
unlock:
- ovs_unlock();
+ mutex_unlock(&table->lock);
+ ovs_flow_tbl_put(table);
return err;
}
@@ -1405,6 +1477,7 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info)
struct nlattr **a = info->attrs;
struct ovs_header *ovs_header = genl_info_userhdr(info);
struct net *net = sock_net(skb->sk);
+ struct flow_table *table;
struct sw_flow_key key;
struct sk_buff *reply;
struct sw_flow *flow = NULL;
@@ -1425,36 +1498,49 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info)
return err;
}
- ovs_lock();
+ rcu_read_lock();
dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
if (unlikely(!dp)) {
- err = -ENODEV;
- goto unlock;
+ rcu_read_unlock();
+ return -ENODEV;
}
+ table = rcu_dereference(dp->table);
+ if (!table || !ovs_flow_tbl_get(table)) {
+ rcu_read_unlock();
+ return -ENODEV;
+ }
+ rcu_read_unlock();
+
+ /* It is safe to dereference "table" after leaving rcu read-protected
+ * region because it's pinned by refcount.
+ */
+ mutex_lock(&table->lock);
+
if (unlikely(!a[OVS_FLOW_ATTR_KEY] && !ufid_present)) {
- err = ovs_flow_tbl_flush(&dp->table);
+ err = ovs_flow_tbl_flush(table);
goto unlock;
}
if (ufid_present)
- flow = ovs_flow_tbl_lookup_ufid(&dp->table, &ufid);
+ flow = ovs_flow_tbl_lookup_ufid(table, &ufid);
else
- flow = ovs_flow_tbl_lookup_exact(&dp->table, &match);
+ flow = ovs_flow_tbl_lookup_exact(table, &match);
if (unlikely(!flow)) {
err = -ENOENT;
goto unlock;
}
- ovs_flow_tbl_remove(&dp->table, flow);
- ovs_unlock();
+ ovs_flow_tbl_remove(table, flow);
+ mutex_unlock(&table->lock);
reply = ovs_flow_cmd_alloc_info((const struct sw_flow_actions __force *) flow->sf_acts,
&flow->id, info, false, ufid_flags);
if (likely(reply)) {
if (!IS_ERR(reply)) {
rcu_read_lock(); /*To keep RCU checker happy. */
- err = ovs_flow_cmd_fill_info(flow, ovs_header->dp_ifindex,
+ err = ovs_flow_cmd_fill_info(flow, table,
+ ovs_header->dp_ifindex,
reply, info->snd_portid,
info->snd_seq, 0,
OVS_FLOW_CMD_DEL,
@@ -1473,10 +1559,12 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info)
}
out_free:
+ ovs_flow_tbl_put(table);
ovs_flow_free(flow, true);
return 0;
unlock:
- ovs_unlock();
+ mutex_unlock(&table->lock);
+ ovs_flow_tbl_put(table);
return err;
}
@@ -1485,6 +1573,7 @@ static int ovs_flow_cmd_dump(struct sk_buff *skb, struct netlink_callback *cb)
struct nlattr *a[__OVS_FLOW_ATTR_MAX];
struct ovs_header *ovs_header = genlmsg_data(nlmsg_data(cb->nlh));
struct table_instance *ti;
+ struct flow_table *table;
struct datapath *dp;
u32 ufid_flags;
int err;
@@ -1501,8 +1590,13 @@ static int ovs_flow_cmd_dump(struct sk_buff *skb, struct netlink_callback *cb)
rcu_read_unlock();
return -ENODEV;
}
+ table = rcu_dereference(dp->table);
+ if (!table) {
+ rcu_read_unlock();
+ return -ENODEV;
+ }
- ti = rcu_dereference(dp->table.ti);
+ ti = rcu_dereference(table->ti);
for (;;) {
struct sw_flow *flow;
u32 bucket, obj;
@@ -1513,8 +1607,8 @@ static int ovs_flow_cmd_dump(struct sk_buff *skb, struct netlink_callback *cb)
if (!flow)
break;
- if (ovs_flow_cmd_fill_info(flow, ovs_header->dp_ifindex, skb,
- NETLINK_CB(cb->skb).portid,
+ if (ovs_flow_cmd_fill_info(flow, table, ovs_header->dp_ifindex,
+ skb, NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq, NLM_F_MULTI,
OVS_FLOW_CMD_GET, ufid_flags) < 0)
break;
@@ -1598,8 +1692,13 @@ static int ovs_dp_cmd_fill_info(struct datapath *dp, struct sk_buff *skb,
struct ovs_dp_stats dp_stats;
struct ovs_dp_megaflow_stats dp_megaflow_stats;
struct dp_nlsk_pids *pids = ovsl_dereference(dp->upcall_portids);
+ struct flow_table *table;
int err, pids_len;
+ table = ovsl_dereference(dp->table);
+ if (!table)
+ return -ENODEV;
+
ovs_header = genlmsg_put(skb, portid, seq, &dp_datapath_genl_family,
flags, cmd);
if (!ovs_header)
@@ -1625,7 +1724,7 @@ static int ovs_dp_cmd_fill_info(struct datapath *dp, struct sk_buff *skb,
goto nla_put_failure;
if (nla_put_u32(skb, OVS_DP_ATTR_MASKS_CACHE_SIZE,
- ovs_flow_tbl_masks_cache_size(&dp->table)))
+ ovs_flow_tbl_masks_cache_size(table)))
goto nla_put_failure;
if (dp->user_features & OVS_DP_F_DISPATCH_UPCALL_PER_CPU && pids) {
@@ -1736,6 +1835,7 @@ u32 ovs_dp_get_upcall_portid(const struct datapath *dp, uint32_t cpu_id)
static int ovs_dp_change(struct datapath *dp, struct nlattr *a[])
{
u32 user_features = 0, old_features = dp->user_features;
+ struct flow_table *table;
int err;
if (a[OVS_DP_ATTR_USER_FEATURES]) {
@@ -1757,8 +1857,12 @@ static int ovs_dp_change(struct datapath *dp, struct nlattr *a[])
int err;
u32 cache_size;
+ table = ovsl_dereference(dp->table);
+ if (!table)
+ return -ENODEV;
+
cache_size = nla_get_u32(a[OVS_DP_ATTR_MASKS_CACHE_SIZE]);
- err = ovs_flow_tbl_masks_cache_resize(&dp->table, cache_size);
+ err = ovs_flow_tbl_masks_cache_resize(table, cache_size);
if (err)
return err;
}
@@ -1812,6 +1916,7 @@ static int ovs_dp_vport_init(struct datapath *dp)
static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info)
{
struct nlattr **a = info->attrs;
+ struct flow_table *table;
struct vport_parms parms;
struct sk_buff *reply;
struct datapath *dp;
@@ -1835,9 +1940,12 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info)
ovs_dp_set_net(dp, sock_net(skb->sk));
/* Allocate table. */
- err = ovs_flow_tbl_init(&dp->table);
- if (err)
+ table = ovs_flow_tbl_alloc();
+ if (IS_ERR(table)) {
+ err = PTR_ERR(table);
goto err_destroy_dp;
+ }
+ rcu_assign_pointer(dp->table, table);
err = ovs_dp_stats_init(dp);
if (err)
@@ -1907,7 +2015,7 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info)
err_destroy_stats:
free_percpu(dp->stats_percpu);
err_destroy_table:
- ovs_flow_tbl_destroy(&dp->table);
+ ovs_flow_tbl_put(dp->table);
err_destroy_dp:
kfree(dp);
err_destroy_reply:
@@ -1919,7 +2027,8 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info)
/* Called with ovs_mutex. */
static void __dp_destroy(struct datapath *dp)
{
- struct flow_table *table = &dp->table;
+ struct flow_table *table = rcu_dereference_protected(dp->table,
+ lockdep_ovsl_is_held());
int i;
if (dp->user_features & OVS_DP_F_TC_RECIRC_SHARING)
@@ -1941,14 +2050,10 @@ static void __dp_destroy(struct datapath *dp)
*/
ovs_dp_detach_port(ovs_vport_ovsl(dp, OVSP_LOCAL));
- /* Flush sw_flow in the tables. RCU cb only releases resource
- * such as dp, ports and tables. That may avoid some issues
- * such as RCU usage warning.
- */
- table_instance_flow_flush(table, ovsl_dereference(table->ti),
- ovsl_dereference(table->ufid_ti));
+ rcu_assign_pointer(dp->table, NULL);
+ ovs_flow_tbl_put(table);
- /* RCU destroy the ports, meters and flow tables. */
+ /* RCU destroy the ports and meters. */
call_rcu(&dp->rcu, destroy_dp_rcu);
}
@@ -2556,13 +2661,18 @@ static void ovs_dp_masks_rebalance(struct work_struct *work)
{
struct ovs_net *ovs_net = container_of(work, struct ovs_net,
masks_rebalance.work);
+ struct flow_table *table;
struct datapath *dp;
ovs_lock();
-
- list_for_each_entry(dp, &ovs_net->dps, list_node)
- ovs_flow_masks_rebalance(&dp->table);
-
+ list_for_each_entry_rcu(dp, &ovs_net->dps, list_node) {
+ table = ovsl_dereference(dp->table);
+ if (!table)
+ continue;
+ mutex_lock(&table->lock);
+ ovs_flow_masks_rebalance(table);
+ mutex_unlock(&table->lock);
+ }
ovs_unlock();
schedule_delayed_work(&ovs_net->masks_rebalance,
diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index db0c3e69d66c..44773bf9f645 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -90,7 +90,7 @@ struct datapath {
struct list_head list_node;
/* Flow table. */
- struct flow_table table;
+ struct flow_table __rcu *table;
/* Switch ports. */
struct hlist_head *ports;
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 66366982f604..0a748cf20f53 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -124,8 +124,9 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags,
spin_unlock(&stats->lock);
}
-/* Must be called with rcu_read_lock or ovs_mutex. */
+/* Must be called with rcu_read_lock or table->lock held. */
void ovs_flow_stats_get(const struct sw_flow *flow,
+ const struct flow_table *table,
struct ovs_flow_stats *ovs_stats,
unsigned long *used, __be16 *tcp_flags)
{
@@ -136,7 +137,8 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
memset(ovs_stats, 0, sizeof(*ovs_stats));
for_each_cpu(cpu, flow->cpu_used_mask) {
- struct sw_flow_stats *stats = rcu_dereference_ovsl(flow->stats[cpu]);
+ struct sw_flow_stats *stats =
+ rcu_dereference_ovs_tbl(flow->stats[cpu], table);
if (stats) {
/* Local CPU may write on non-local stats, so we must
@@ -153,13 +155,14 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
}
}
-/* Called with ovs_mutex. */
-void ovs_flow_stats_clear(struct sw_flow *flow)
+/* Called with table->lock held. */
+void ovs_flow_stats_clear(struct sw_flow *flow, struct flow_table *table)
{
unsigned int cpu;
for_each_cpu(cpu, flow->cpu_used_mask) {
- struct sw_flow_stats *stats = ovsl_dereference(flow->stats[cpu]);
+ struct sw_flow_stats *stats =
+ ovs_tbl_dereference(flow->stats[cpu], table);
if (stats) {
spin_lock_bh(&stats->lock);
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index b5711aff6e76..e05ed6796e4e 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -23,6 +23,7 @@
#include <net/dst_metadata.h>
#include <net/nsh.h>
+struct flow_table;
struct sk_buff;
enum sw_flow_mac_proto {
@@ -280,9 +281,11 @@ static inline bool ovs_identifier_is_key(const struct sw_flow_id *sfid)
void ovs_flow_stats_update(struct sw_flow *, __be16 tcp_flags,
const struct sk_buff *);
-void ovs_flow_stats_get(const struct sw_flow *, struct ovs_flow_stats *,
- unsigned long *used, __be16 *tcp_flags);
-void ovs_flow_stats_clear(struct sw_flow *);
+void ovs_flow_stats_get(const struct sw_flow *flow,
+ const struct flow_table *table,
+ struct ovs_flow_stats *stats, unsigned long *used,
+ __be16 *tcp_flags);
+void ovs_flow_stats_clear(struct sw_flow *flow, struct flow_table *table);
u64 ovs_flow_used_time(unsigned long flow_jiffies);
int ovs_flow_key_update(struct sk_buff *skb, struct sw_flow_key *key);
diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
index ffc72a741a50..188e9d39ce00 100644
--- a/net/openvswitch/flow_table.c
+++ b/net/openvswitch/flow_table.c
@@ -45,6 +45,16 @@
static struct kmem_cache *flow_cache;
struct kmem_cache *flow_stats_cache __read_mostly;
+#ifdef CONFIG_LOCKDEP
+int lockdep_ovs_tbl_is_held(const struct flow_table *table)
+{
+ if (debug_locks)
+ return lockdep_is_held(&table->lock);
+ else
+ return 1;
+}
+#endif
+
static u16 range_n_bytes(const struct sw_flow_key_range *range)
{
return range->end - range->start;
@@ -250,12 +260,12 @@ static int tbl_mask_array_realloc(struct flow_table *tbl, int size)
if (!new)
return -ENOMEM;
- old = ovsl_dereference(tbl->mask_array);
+ old = ovs_tbl_dereference(tbl->mask_array, tbl);
if (old) {
int i;
for (i = 0; i < old->max; i++) {
- if (ovsl_dereference(old->masks[i]))
+ if (ovs_tbl_dereference(old->masks[i], tbl))
new->masks[new->count++] = old->masks[i];
}
call_rcu(&old->rcu, mask_array_rcu_cb);
@@ -269,7 +279,7 @@ static int tbl_mask_array_realloc(struct flow_table *tbl, int size)
static int tbl_mask_array_add_mask(struct flow_table *tbl,
struct sw_flow_mask *new)
{
- struct mask_array *ma = ovsl_dereference(tbl->mask_array);
+ struct mask_array *ma = ovs_tbl_dereference(tbl->mask_array, tbl);
int err, ma_count = READ_ONCE(ma->count);
if (ma_count >= ma->max) {
@@ -278,7 +288,7 @@ static int tbl_mask_array_add_mask(struct flow_table *tbl,
if (err)
return err;
- ma = ovsl_dereference(tbl->mask_array);
+ ma = ovs_tbl_dereference(tbl->mask_array, tbl);
} else {
/* On every add or delete we need to reset the counters so
* every new mask gets a fair chance of being prioritized.
@@ -286,7 +296,7 @@ static int tbl_mask_array_add_mask(struct flow_table *tbl,
tbl_mask_array_reset_counters(ma);
}
- BUG_ON(ovsl_dereference(ma->masks[ma_count]));
+ WARN_ON_ONCE(ovs_tbl_dereference(ma->masks[ma_count], tbl));
rcu_assign_pointer(ma->masks[ma_count], new);
WRITE_ONCE(ma->count, ma_count + 1);
@@ -297,12 +307,12 @@ static int tbl_mask_array_add_mask(struct flow_table *tbl,
static void tbl_mask_array_del_mask(struct flow_table *tbl,
struct sw_flow_mask *mask)
{
- struct mask_array *ma = ovsl_dereference(tbl->mask_array);
+ struct mask_array *ma = ovs_tbl_dereference(tbl->mask_array, tbl);
int i, ma_count = READ_ONCE(ma->count);
/* Remove the deleted mask pointers from the array */
for (i = 0; i < ma_count; i++) {
- if (mask == ovsl_dereference(ma->masks[i]))
+ if (mask == ovs_tbl_dereference(ma->masks[i], tbl))
goto found;
}
@@ -330,10 +340,10 @@ static void tbl_mask_array_del_mask(struct flow_table *tbl,
static void flow_mask_remove(struct flow_table *tbl, struct sw_flow_mask *mask)
{
if (mask) {
- /* ovs-lock is required to protect mask-refcount and
+ /* table lock is required to protect mask-refcount and
* mask list.
*/
- ASSERT_OVSL();
+ ASSERT_OVS_TBL(tbl);
BUG_ON(!mask->ref_count);
mask->ref_count--;
@@ -387,7 +397,8 @@ static struct mask_cache *tbl_mask_cache_alloc(u32 size)
}
int ovs_flow_tbl_masks_cache_resize(struct flow_table *table, u32 size)
{
- struct mask_cache *mc = rcu_dereference_ovsl(table->mask_cache);
+ struct mask_cache *mc = rcu_dereference_ovs_tbl(table->mask_cache,
+ table);
struct mask_cache *new;
if (size == mc->cache_size)
@@ -407,15 +418,23 @@ int ovs_flow_tbl_masks_cache_resize(struct flow_table *table, u32 size)
return 0;
}
-int ovs_flow_tbl_init(struct flow_table *table)
+struct flow_table *ovs_flow_tbl_alloc(void)
{
struct table_instance *ti, *ufid_ti;
+ struct flow_table *table;
struct mask_cache *mc;
struct mask_array *ma;
+ table = kzalloc_obj(*table, GFP_KERNEL);
+ if (!table)
+ return ERR_PTR(-ENOMEM);
+
+ mutex_init(&table->lock);
+ refcount_set(&table->refcnt, 1);
+
mc = tbl_mask_cache_alloc(MC_DEFAULT_HASH_ENTRIES);
if (!mc)
- return -ENOMEM;
+ goto free_table;
ma = tbl_mask_array_alloc(MASK_ARRAY_SIZE_MIN);
if (!ma)
@@ -436,7 +455,7 @@ int ovs_flow_tbl_init(struct flow_table *table)
table->last_rehash = jiffies;
table->count = 0;
table->ufid_count = 0;
- return 0;
+ return table;
free_ti:
__table_instance_destroy(ti);
@@ -444,7 +463,10 @@ int ovs_flow_tbl_init(struct flow_table *table)
__mask_array_destroy(ma);
free_mask_cache:
__mask_cache_destroy(mc);
- return -ENOMEM;
+free_table:
+ mutex_destroy(&table->lock);
+ kfree(table);
+ return ERR_PTR(-ENOMEM);
}
static void flow_tbl_destroy_rcu_cb(struct rcu_head *rcu)
@@ -471,7 +493,7 @@ static void table_instance_flow_free(struct flow_table *table,
flow_mask_remove(table, flow->mask);
}
-/* Must be called with OVS mutex held. */
+/* Must be called with table mutex held. */
void table_instance_flow_flush(struct flow_table *table,
struct table_instance *ti,
struct table_instance *ufid_ti)
@@ -506,11 +528,11 @@ static void table_instance_destroy(struct table_instance *ti,
call_rcu(&ufid_ti->rcu, flow_tbl_destroy_rcu_cb);
}
-/* No need for locking this function is called from RCU callback or
- * error path.
- */
-void ovs_flow_tbl_destroy(struct flow_table *table)
+/* No need for locking this function is called from RCU callback. */
+static void ovs_flow_tbl_destroy_rcu(struct rcu_head *rcu)
{
+ struct flow_table *table = container_of(rcu, struct flow_table, rcu);
+
struct table_instance *ti = rcu_dereference_raw(table->ti);
struct table_instance *ufid_ti = rcu_dereference_raw(table->ufid_ti);
struct mask_cache *mc = rcu_dereference_raw(table->mask_cache);
@@ -519,6 +541,20 @@ void ovs_flow_tbl_destroy(struct flow_table *table)
call_rcu(&mc->rcu, mask_cache_rcu_cb);
call_rcu(&ma->rcu, mask_array_rcu_cb);
table_instance_destroy(ti, ufid_ti);
+ mutex_destroy(&table->lock);
+ kfree(table);
+}
+
+void ovs_flow_tbl_put(struct flow_table *table)
+{
+ if (refcount_dec_and_test(&table->refcnt)) {
+ mutex_lock(&table->lock);
+ table_instance_flow_flush(table,
+ ovs_tbl_dereference(table->ti, table),
+ ovs_tbl_dereference(table->ufid_ti, table));
+ mutex_unlock(&table->lock);
+ call_rcu(&table->rcu, ovs_flow_tbl_destroy_rcu);
+ }
}
struct sw_flow *ovs_flow_tbl_dump_next(struct table_instance *ti,
@@ -572,7 +608,8 @@ static void ufid_table_instance_insert(struct table_instance *ti,
hlist_add_head_rcu(&flow->ufid_table.node[ti->node_ver], head);
}
-static void flow_table_copy_flows(struct table_instance *old,
+static void flow_table_copy_flows(struct flow_table *table,
+ struct table_instance *old,
struct table_instance *new, bool ufid)
{
int old_ver;
@@ -589,17 +626,18 @@ static void flow_table_copy_flows(struct table_instance *old,
if (ufid)
hlist_for_each_entry_rcu(flow, head,
ufid_table.node[old_ver],
- lockdep_ovsl_is_held())
+ lockdep_ovs_tbl_is_held(table))
ufid_table_instance_insert(new, flow);
else
hlist_for_each_entry_rcu(flow, head,
flow_table.node[old_ver],
- lockdep_ovsl_is_held())
+ lockdep_ovs_tbl_is_held(table))
table_instance_insert(new, flow);
}
}
-static struct table_instance *table_instance_rehash(struct table_instance *ti,
+static struct table_instance *table_instance_rehash(struct flow_table *table,
+ struct table_instance *ti,
int n_buckets, bool ufid)
{
struct table_instance *new_ti;
@@ -608,16 +646,19 @@ static struct table_instance *table_instance_rehash(struct table_instance *ti,
if (!new_ti)
return NULL;
- flow_table_copy_flows(ti, new_ti, ufid);
+ flow_table_copy_flows(table, ti, new_ti, ufid);
return new_ti;
}
+/* Must be called with flow_table->lock held. */
int ovs_flow_tbl_flush(struct flow_table *flow_table)
{
struct table_instance *old_ti, *new_ti;
struct table_instance *old_ufid_ti, *new_ufid_ti;
+ ASSERT_OVS_TBL(flow_table);
+
new_ti = table_instance_alloc(TBL_MIN_BUCKETS);
if (!new_ti)
return -ENOMEM;
@@ -625,8 +666,8 @@ int ovs_flow_tbl_flush(struct flow_table *flow_table)
if (!new_ufid_ti)
goto err_free_ti;
- old_ti = ovsl_dereference(flow_table->ti);
- old_ufid_ti = ovsl_dereference(flow_table->ufid_ti);
+ old_ti = ovs_tbl_dereference(flow_table->ti, flow_table);
+ old_ufid_ti = ovs_tbl_dereference(flow_table->ufid_ti, flow_table);
rcu_assign_pointer(flow_table->ti, new_ti);
rcu_assign_pointer(flow_table->ufid_ti, new_ufid_ti);
@@ -694,7 +735,8 @@ static bool ovs_flow_cmp_unmasked_key(const struct sw_flow *flow,
return cmp_key(flow->id.unmasked_key, key, key_start, key_end);
}
-static struct sw_flow *masked_flow_lookup(struct table_instance *ti,
+static struct sw_flow *masked_flow_lookup(struct flow_table *tbl,
+ struct table_instance *ti,
const struct sw_flow_key *unmasked,
const struct sw_flow_mask *mask,
u32 *n_mask_hit)
@@ -710,7 +752,7 @@ static struct sw_flow *masked_flow_lookup(struct table_instance *ti,
(*n_mask_hit)++;
hlist_for_each_entry_rcu(flow, head, flow_table.node[ti->node_ver],
- lockdep_ovsl_is_held()) {
+ lockdep_ovs_tbl_is_held(tbl)) {
if (flow->mask == mask && flow->flow_table.hash == hash &&
flow_cmp_masked_key(flow, &masked_key, &mask->range))
return flow;
@@ -737,9 +779,9 @@ static struct sw_flow *flow_lookup(struct flow_table *tbl,
int i;
if (likely(*index < ma->max)) {
- mask = rcu_dereference_ovsl(ma->masks[*index]);
+ mask = rcu_dereference_ovs_tbl(ma->masks[*index], tbl);
if (mask) {
- flow = masked_flow_lookup(ti, key, mask, n_mask_hit);
+ flow = masked_flow_lookup(tbl, ti, key, mask, n_mask_hit);
if (flow) {
u64_stats_update_begin(&stats->syncp);
stats->usage_cntrs[*index]++;
@@ -755,11 +797,11 @@ static struct sw_flow *flow_lookup(struct flow_table *tbl,
if (i == *index)
continue;
- mask = rcu_dereference_ovsl(ma->masks[i]);
+ mask = rcu_dereference_ovs_tbl(ma->masks[i], tbl);
if (unlikely(!mask))
break;
- flow = masked_flow_lookup(ti, key, mask, n_mask_hit);
+ flow = masked_flow_lookup(tbl, ti, key, mask, n_mask_hit);
if (flow) { /* Found */
*index = i;
u64_stats_update_begin(&stats->syncp);
@@ -846,8 +888,8 @@ struct sw_flow *ovs_flow_tbl_lookup_stats(struct flow_table *tbl,
struct sw_flow *ovs_flow_tbl_lookup(struct flow_table *tbl,
const struct sw_flow_key *key)
{
- struct table_instance *ti = rcu_dereference_ovsl(tbl->ti);
- struct mask_array *ma = rcu_dereference_ovsl(tbl->mask_array);
+ struct table_instance *ti = rcu_dereference_ovs_tbl(tbl->ti, tbl);
+ struct mask_array *ma = rcu_dereference_ovs_tbl(tbl->mask_array, tbl);
u32 __always_unused n_mask_hit;
u32 __always_unused n_cache_hit;
struct sw_flow *flow;
@@ -866,21 +908,22 @@ struct sw_flow *ovs_flow_tbl_lookup(struct flow_table *tbl,
struct sw_flow *ovs_flow_tbl_lookup_exact(struct flow_table *tbl,
const struct sw_flow_match *match)
{
- struct mask_array *ma = ovsl_dereference(tbl->mask_array);
+ struct mask_array *ma = ovs_tbl_dereference(tbl->mask_array, tbl);
int i;
- /* Always called under ovs-mutex. */
+ /* Always called under tbl->lock. */
for (i = 0; i < ma->max; i++) {
- struct table_instance *ti = rcu_dereference_ovsl(tbl->ti);
+ struct table_instance *ti =
+ rcu_dereference_ovs_tbl(tbl->ti, tbl);
u32 __always_unused n_mask_hit;
struct sw_flow_mask *mask;
struct sw_flow *flow;
- mask = ovsl_dereference(ma->masks[i]);
+ mask = ovs_tbl_dereference(ma->masks[i], tbl);
if (!mask)
continue;
- flow = masked_flow_lookup(ti, match->key, mask, &n_mask_hit);
+ flow = masked_flow_lookup(tbl, ti, match->key, mask, &n_mask_hit);
if (flow && ovs_identifier_is_key(&flow->id) &&
ovs_flow_cmp_unmasked_key(flow, match)) {
return flow;
@@ -916,7 +959,7 @@ bool ovs_flow_cmp(const struct sw_flow *flow,
struct sw_flow *ovs_flow_tbl_lookup_ufid(struct flow_table *tbl,
const struct sw_flow_id *ufid)
{
- struct table_instance *ti = rcu_dereference_ovsl(tbl->ufid_ti);
+ struct table_instance *ti = rcu_dereference_ovs_tbl(tbl->ufid_ti, tbl);
struct sw_flow *flow;
struct hlist_head *head;
u32 hash;
@@ -924,7 +967,7 @@ struct sw_flow *ovs_flow_tbl_lookup_ufid(struct flow_table *tbl,
hash = ufid_hash(ufid);
head = find_bucket(ti, hash);
hlist_for_each_entry_rcu(flow, head, ufid_table.node[ti->node_ver],
- lockdep_ovsl_is_held()) {
+ lockdep_ovs_tbl_is_held(tbl)) {
if (flow->ufid_table.hash == hash &&
ovs_flow_cmp_ufid(flow, ufid))
return flow;
@@ -934,28 +977,33 @@ struct sw_flow *ovs_flow_tbl_lookup_ufid(struct flow_table *tbl,
int ovs_flow_tbl_num_masks(const struct flow_table *table)
{
- struct mask_array *ma = rcu_dereference_ovsl(table->mask_array);
+ struct mask_array *ma = rcu_dereference_ovs_tbl(table->mask_array,
+ table);
return READ_ONCE(ma->count);
}
u32 ovs_flow_tbl_masks_cache_size(const struct flow_table *table)
{
- struct mask_cache *mc = rcu_dereference_ovsl(table->mask_cache);
+ struct mask_cache *mc = rcu_dereference_ovs_tbl(table->mask_cache,
+ table);
return READ_ONCE(mc->cache_size);
}
-static struct table_instance *table_instance_expand(struct table_instance *ti,
+static struct table_instance *table_instance_expand(struct flow_table *table,
+ struct table_instance *ti,
bool ufid)
{
- return table_instance_rehash(ti, ti->n_buckets * 2, ufid);
+ return table_instance_rehash(table, ti, ti->n_buckets * 2, ufid);
}
-/* Must be called with OVS mutex held. */
+/* Must be called with table mutex held. */
void ovs_flow_tbl_remove(struct flow_table *table, struct sw_flow *flow)
{
- struct table_instance *ti = ovsl_dereference(table->ti);
- struct table_instance *ufid_ti = ovsl_dereference(table->ufid_ti);
+ struct table_instance *ti = ovs_tbl_dereference(table->ti,
+ table);
+ struct table_instance *ufid_ti = ovs_tbl_dereference(table->ufid_ti,
+ table);
BUG_ON(table->count == 0);
table_instance_flow_free(table, ti, ufid_ti, flow);
@@ -989,10 +1037,10 @@ static struct sw_flow_mask *flow_mask_find(const struct flow_table *tbl,
struct mask_array *ma;
int i;
- ma = ovsl_dereference(tbl->mask_array);
+ ma = ovs_tbl_dereference(tbl->mask_array, tbl);
for (i = 0; i < ma->max; i++) {
struct sw_flow_mask *t;
- t = ovsl_dereference(ma->masks[i]);
+ t = ovs_tbl_dereference(ma->masks[i], tbl);
if (t && mask_equal(mask, t))
return t;
@@ -1030,22 +1078,25 @@ static int flow_mask_insert(struct flow_table *tbl, struct sw_flow *flow,
return 0;
}
-/* Must be called with OVS mutex held. */
+/* Must be called with table mutex held. */
static void flow_key_insert(struct flow_table *table, struct sw_flow *flow)
{
struct table_instance *new_ti = NULL;
struct table_instance *ti;
+ ASSERT_OVS_TBL(table);
+
flow->flow_table.hash = flow_hash(&flow->key, &flow->mask->range);
- ti = ovsl_dereference(table->ti);
+ ti = ovs_tbl_dereference(table->ti, table);
table_instance_insert(ti, flow);
table->count++;
/* Expand table, if necessary, to make room. */
if (table->count > ti->n_buckets)
- new_ti = table_instance_expand(ti, false);
+ new_ti = table_instance_expand(table, ti, false);
else if (time_after(jiffies, table->last_rehash + REHASH_INTERVAL))
- new_ti = table_instance_rehash(ti, ti->n_buckets, false);
+ new_ti = table_instance_rehash(table, ti, ti->n_buckets,
+ false);
if (new_ti) {
rcu_assign_pointer(table->ti, new_ti);
@@ -1054,13 +1105,15 @@ static void flow_key_insert(struct flow_table *table, struct sw_flow *flow)
}
}
-/* Must be called with OVS mutex held. */
+/* Must be called with table mutex held. */
static void flow_ufid_insert(struct flow_table *table, struct sw_flow *flow)
{
struct table_instance *ti;
+ ASSERT_OVS_TBL(table);
+
flow->ufid_table.hash = ufid_hash(&flow->id);
- ti = ovsl_dereference(table->ufid_ti);
+ ti = ovs_tbl_dereference(table->ufid_ti, table);
ufid_table_instance_insert(ti, flow);
table->ufid_count++;
@@ -1068,7 +1121,7 @@ static void flow_ufid_insert(struct flow_table *table, struct sw_flow *flow)
if (table->ufid_count > ti->n_buckets) {
struct table_instance *new_ti;
- new_ti = table_instance_expand(ti, true);
+ new_ti = table_instance_expand(table, ti, true);
if (new_ti) {
rcu_assign_pointer(table->ufid_ti, new_ti);
call_rcu(&ti->rcu, flow_tbl_destroy_rcu_cb);
@@ -1076,12 +1129,14 @@ static void flow_ufid_insert(struct flow_table *table, struct sw_flow *flow)
}
}
-/* Must be called with OVS mutex held. */
+/* Must be called with table mutex held. */
int ovs_flow_tbl_insert(struct flow_table *table, struct sw_flow *flow,
const struct sw_flow_mask *mask)
{
int err;
+ ASSERT_OVS_TBL(table);
+
err = flow_mask_insert(table, flow, mask);
if (err)
return err;
@@ -1100,10 +1155,11 @@ static int compare_mask_and_count(const void *a, const void *b)
return (s64)mc_b->counter - (s64)mc_a->counter;
}
-/* Must be called with OVS mutex held. */
+/* Must be called with table->lock held. */
void ovs_flow_masks_rebalance(struct flow_table *table)
{
- struct mask_array *ma = rcu_dereference_ovsl(table->mask_array);
+ struct mask_array *ma = rcu_dereference_ovs_tbl(table->mask_array,
+ table);
struct mask_count *masks_and_count;
struct mask_array *new;
int masks_entries = 0;
@@ -1119,7 +1175,7 @@ void ovs_flow_masks_rebalance(struct flow_table *table)
struct sw_flow_mask *mask;
int cpu;
- mask = rcu_dereference_ovsl(ma->masks[i]);
+ mask = rcu_dereference_ovs_tbl(ma->masks[i], table);
if (unlikely(!mask))
break;
@@ -1173,7 +1229,7 @@ void ovs_flow_masks_rebalance(struct flow_table *table)
for (i = 0; i < masks_entries; i++) {
int index = masks_and_count[i].index;
- if (ovsl_dereference(ma->masks[index]))
+ if (ovs_tbl_dereference(ma->masks[index], table))
new->masks[new->count++] = ma->masks[index];
}
diff --git a/net/openvswitch/flow_table.h b/net/openvswitch/flow_table.h
index f524dc3e4862..cffd412c9045 100644
--- a/net/openvswitch/flow_table.h
+++ b/net/openvswitch/flow_table.h
@@ -59,7 +59,29 @@ struct table_instance {
u32 hash_seed;
};
+/* Locking:
+ *
+ * flow_table is _not_ protected by ovs_lock (see comment above ovs_mutex
+ * in datapath.c).
+ *
+ * All writes to flow_table are protected by the embedded "lock".
+ * In order to ensure datapath destruction does not trigger the destruction
+ * of the flow_table, "refcnt" is used. Therefore, writers must:
+ * 1 - Enter rcu read-protected section
+ * 2 - Increase "table->refcnt"
+ * 3 - Leave rcu read-protected section (to avoid using mutexes inside rcu)
+ * 4 - Lock "table->lock"
+ * 5 - Perform modifications
+ * 6 - Release "table->lock"
+ * 7 - Decrease "table->refcnt"
+ *
+ * Reads are protected by RCU.
+ */
struct flow_table {
+ /* Locks flow table writes. */
+ struct mutex lock;
+ refcount_t refcnt;
+ struct rcu_head rcu;
struct table_instance __rcu *ti;
struct table_instance __rcu *ufid_ti;
struct mask_cache __rcu *mask_cache;
@@ -71,15 +93,40 @@ struct flow_table {
extern struct kmem_cache *flow_stats_cache;
+#ifdef CONFIG_LOCKDEP
+int lockdep_ovs_tbl_is_held(const struct flow_table *table);
+#else
+static inline int lockdep_ovs_tbl_is_held(const struct flow_table *table)
+{
+ (void)table;
+ return 1;
+}
+#endif
+
+#define ASSERT_OVS_TBL(tbl) WARN_ON(!lockdep_ovs_tbl_is_held(tbl))
+
+/* Lock-protected update-allowed dereferences.*/
+#define ovs_tbl_dereference(p, tbl) \
+ rcu_dereference_protected(p, lockdep_ovs_tbl_is_held(tbl))
+
+/* Read dereferences can be protected by either RCU, table lock or ovs_mutex. */
+#define rcu_dereference_ovs_tbl(p, tbl) \
+ rcu_dereference_check(p, \
+ lockdep_ovs_tbl_is_held(tbl) || lockdep_ovsl_is_held())
+
int ovs_flow_init(void);
void ovs_flow_exit(void);
struct sw_flow *ovs_flow_alloc(void);
void ovs_flow_free(struct sw_flow *, bool deferred);
-int ovs_flow_tbl_init(struct flow_table *);
+struct flow_table *ovs_flow_tbl_alloc(void);
+void ovs_flow_tbl_put(struct flow_table *table);
+static inline bool ovs_flow_tbl_get(struct flow_table *table)
+{
+ return refcount_inc_not_zero(&table->refcnt);
+}
int ovs_flow_tbl_count(const struct flow_table *table);
-void ovs_flow_tbl_destroy(struct flow_table *table);
int ovs_flow_tbl_flush(struct flow_table *flow_table);
int ovs_flow_tbl_insert(struct flow_table *table, struct sw_flow *flow,
--
2.53.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH net-next v1] net: openvswitch: decouple flow_table from ovs_mutex
2026-03-13 17:31 [PATCH net-next v1] net: openvswitch: decouple flow_table from ovs_mutex Adrian Moreno
@ 2026-03-13 19:35 ` Jakub Kicinski
2026-03-16 12:01 ` Adrián Moreno
2026-03-14 18:32 ` [syzbot ci] " syzbot ci
` (2 subsequent siblings)
3 siblings, 1 reply; 6+ messages in thread
From: Jakub Kicinski @ 2026-03-13 19:35 UTC (permalink / raw)
To: Adrian Moreno
Cc: netdev, Aaron Conole, Eelco Chaudron, Ilya Maximets,
David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman, dev,
linux-kernel
On Fri, 13 Mar 2026 18:31:12 +0100 Adrian Moreno wrote:
> Currently the entire ovs module is write-protected using the global
> ovs_mutex. While this simple approach works fine for control-plane
> operations (such as vport configurations), requiring the global mutex
> for flow modifications can be problematic.
YNL selftest for ovs seems to trigger this:
[ 88.995118][ T50] =============================
[ 88.995287][ T50] WARNING: suspicious RCU usage
[ 88.995448][ T50] 7.0.0-rc3-virtme #1 Not tainted
[ 88.995630][ T50] -----------------------------
[ 88.995788][ T50] net/openvswitch/datapath.c:2666 RCU-list traversed in non-reader section!!
[ 88.996122][ T50]
[ 88.996122][ T50] other info that might help us debug this:
[ 88.996122][ T50]
[ 88.996388][ T50]
[ 88.996388][ T50] rcu_scheduler_active = 2, debug_locks = 1
[ 88.996640][ T50] 3 locks held by kworker/2:1/50:
[ 88.996800][ T50] #0: ff11000001139b48 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0xcb4/0x1390
[ 88.997092][ T50] #1: ffa000000036fd10 ((work_completion)(&(&ovs_net->masks_rebalance)->work)){+.+.}-{0:0}, at: process_one_work+0xd16/0x1390
[ 88.997420][ T50] #2: ffffffffc08038e8 (ovs_mutex){+.+.}-{4:4}, at: ovs_dp_masks_rebalance+0x29/0x270 [openvswitch]
[ 88.997707][ T50]
[ 88.997707][ T50] stack backtrace:
[ 88.997898][ T50] CPU: 2 UID: 0 PID: 50 Comm: kworker/2:1 Not tainted 7.0.0-rc3-virtme #1 PREEMPT(full)
[ 88.997903][ T50] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 88.997904][ T50] Workqueue: events ovs_dp_masks_rebalance [openvswitch]
[ 88.997911][ T50] Call Trace:
[ 88.997914][ T50] <TASK>
[ 88.997916][ T50] dump_stack_lvl+0x6f/0xa0
[ 88.997921][ T50] lockdep_rcu_suspicious.cold+0x4f/0xad
[ 88.997928][ T50] ovs_dp_masks_rebalance+0x226/0x270 [openvswitch]
[ 88.997933][ T50] process_one_work+0xd57/0x1390
[ 88.997940][ T50] ? pwq_dec_nr_in_flight+0x700/0x700
[ 88.997942][ T50] ? lock_acquire.part.0+0xbc/0x260
[ 88.997950][ T50] worker_thread+0x4d6/0xd40
[ 88.997954][ T50] ? rescuer_thread+0x1330/0x1330
[ 88.997956][ T50] ? __kthread_parkme+0xb3/0x200
[ 88.997960][ T50] ? rescuer_thread+0x1330/0x1330
[ 88.997962][ T50] kthread+0x30f/0x3f0
[ 88.997964][ T50] ? trace_irq_enable.constprop.0+0x13c/0x190
[ 88.997967][ T50] ? kthread_affine_node+0x150/0x150
[ 88.997970][ T50] ret_from_fork+0x472/0x6b0
[ 88.997974][ T50] ? arch_exit_to_user_mode_prepare.isra.0+0x140/0x140
[ 88.997977][ T50] ? __switch_to+0x538/0xcf0
[ 88.997980][ T50] ? kthread_affine_node+0x150/0x150
[ 88.997983][ T50] ret_from_fork_asm+0x11/0x20
[ 88.997991][ T50] </TASK>
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/tools/net/ynl/tests/ovs.c
^ permalink raw reply [flat|nested] 6+ messages in thread
* [syzbot ci] Re: net: openvswitch: decouple flow_table from ovs_mutex
2026-03-13 17:31 [PATCH net-next v1] net: openvswitch: decouple flow_table from ovs_mutex Adrian Moreno
2026-03-13 19:35 ` Jakub Kicinski
@ 2026-03-14 18:32 ` syzbot ci
2026-03-17 13:30 ` [PATCH net-next v1] " kernel test robot
2026-03-18 14:09 ` kernel test robot
3 siblings, 0 replies; 6+ messages in thread
From: syzbot ci @ 2026-03-14 18:32 UTC (permalink / raw)
To: aconole, amorenoz, davem, dev, echaudro, edumazet, horms,
i.maximets, kuba, linux-kernel, netdev, pabeni
Cc: syzbot, syzkaller-bugs
syzbot ci has tested the following series
[v1] net: openvswitch: decouple flow_table from ovs_mutex
https://lore.kernel.org/all/20260313173114.1220551-1-amorenoz@redhat.com
* [PATCH net-next v1] net: openvswitch: decouple flow_table from ovs_mutex
and found the following issue:
BUG: sleeping function called from invalid context in __alloc_skb
Full report is available here:
https://ci.syzbot.org/series/dad18167-5b3c-4436-a026-ab60850e4342
***
BUG: sleeping function called from invalid context in __alloc_skb
tree: net-next
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git
base: ce8ee8583ed83122405eabaa8fb351be4d9dc65c
arch: amd64
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config: https://ci.syzbot.org/builds/2b4330f1-9bd1-42b5-8e81-a127a9bf1b82/config
C repro: https://ci.syzbot.org/findings/8259a139-586c-46ee-8e3e-4ed8801b7533/c_repro
syz repro: https://ci.syzbot.org/findings/8259a139-586c-46ee-8e3e-4ed8801b7533/syz_repro
BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:323
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 5961, name: syz.0.17
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
2 locks held by syz.0.17/5961:
#0: ffffffff8fc3a570 (cb_lock){++++}-{4:4}, at: genl_rcv+0x19/0x40 net/netlink/genetlink.c:1218
#1: ffffffff8e7602e0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:312 [inline]
#1: ffffffff8e7602e0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:850 [inline]
#1: ffffffff8e7602e0 (rcu_read_lock){....}-{1:3}, at: ovs_flow_cmd_set+0x3c5/0xd60 net/openvswitch/datapath.c:1320
CPU: 0 UID: 0 PID: 5961 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
__might_resched+0x378/0x4d0 kernel/sched/core.c:8884
might_alloc include/linux/sched/mm.h:323 [inline]
slab_pre_alloc_hook mm/slub.c:4452 [inline]
slab_alloc_node mm/slub.c:4807 [inline]
kmem_cache_alloc_node_noprof+0x7f/0x690 mm/slub.c:4882
__alloc_skb+0x1d0/0x7d0 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
nlmsg_new include/net/netlink.h:1055 [inline]
netlink_ack+0x146/0xa50 net/netlink/af_netlink.c:2487
netlink_rcv_skb+0x2b6/0x4b0 net/netlink/af_netlink.c:2556
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x80f/0x9b0 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0xa68/0xad0 net/socket.c:2592
___sys_sendmsg+0x2a5/0x360 net/socket.c:2646
__sys_sendmsg net/socket.c:2678 [inline]
__do_sys_sendmsg net/socket.c:2683 [inline]
__se_sys_sendmsg net/socket.c:2681 [inline]
__x64_sys_sendmsg+0x1bd/0x2a0 net/socket.c:2681
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f4c60b9c799
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc44689868 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f4c60e15fa0 RCX: 00007f4c60b9c799
RDX: 000000000000c000 RSI: 0000200000000000 RDI: 0000000000000003
RBP: 00007f4c60c32c99 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f4c60e15fac R14: 00007f4c60e15fa0 R15: 00007f4c60e15fa0
</TASK>
================================================
WARNING: lock held when returning to user space!
syzkaller #0 Tainted: G W
------------------------------------------------
syz.0.17/5961 is leaving the kernel with locks still held!
1 lock held by syz.0.17/5961:
#0: ffffffff8e7602e0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:312 [inline]
#0: ffffffff8e7602e0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:850 [inline]
#0: ffffffff8e7602e0 (rcu_read_lock){....}-{1:3}, at: ovs_flow_cmd_set+0x3c5/0xd60 net/openvswitch/datapath.c:1320
------------[ cut here ]------------
Voluntary context switch within RCU read-side critical section!
WARNING: kernel/rcu/tree_plugin.h:332 at rcu_note_context_switch+0xcac/0xf40 kernel/rcu/tree_plugin.h:332, CPU#1: syz.0.17/5961
Modules linked in:
CPU: 1 UID: 0 PID: 5961 Comm: syz.0.17 Tainted: G W syzkaller #0 PREEMPT(full)
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:rcu_note_context_switch+0xcac/0xf40 kernel/rcu/tree_plugin.h:332
Code: 00 41 c6 45 00 00 48 8b 3d b1 e5 65 0e 48 81 c4 b8 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d e9 4b 63 ff ff 48 8d 3d 14 c3 69 0e <67> 48 0f b9 3a e9 1b f4 ff ff 90 0f 0b 90 45 84 e4 0f 84 ea f3 ff
RSP: 0000:ffffc900061dfbb0 EFLAGS: 00010002
RAX: 0000000000000000 RBX: ffff8881626b3a00 RCX: 0000000080000002
RDX: 0000000000000000 RSI: ffffffff8c27a4e0 RDI: ffffffff90150f10
RBP: dffffc0000000000 R08: ffffffff901183b7 R09: 1ffffffff2023076
R10: dffffc0000000000 R11: fffffbfff2023077 R12: 0000000000000000
R13: 0000000000000000 R14: ffff88823c63bd40 R15: ffff8881626b3e84
FS: 000055556739f500(0000) GS:ffff8882a9464000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffc9debcf78 CR3: 000000016d29c000 CR4: 00000000000006f0
Call Trace:
<TASK>
__schedule+0x2ff/0x5340 kernel/sched/core.c:6791
__schedule_loop kernel/sched/core.c:6989 [inline]
schedule+0x164/0x360 kernel/sched/core.c:7004
__exit_to_user_mode_loop kernel/entry/common.c:54 [inline]
exit_to_user_mode_loop kernel/entry/common.c:98 [inline]
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
irqentry_exit_to_user_mode_prepare include/linux/irq-entry-common.h:270 [inline]
irqentry_exit_to_user_mode include/linux/irq-entry-common.h:339 [inline]
irqentry_exit+0x155/0x620 kernel/entry/common.c:219
asm_sysvec_call_function_single+0x1a/0x20 arch/x86/include/asm/idtentry.h:704
RIP: 0033:0x7f4c60b9c799
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc44689868 EFLAGS: 00000246
RAX: 0000000000000020 RBX: 00007f4c60e15fa0 RCX: 00007f4c60b9c799
RDX: 000000000000c000 RSI: 0000200000000000 RDI: 0000000000000003
RBP: 00007f4c60c32c99 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f4c60e15fac R14: 00007f4c60e15fa0 R15: 00007f4c60e15fa0
</TASK>
----------------
Code disassembly (best guess):
0: 00 41 c6 add %al,-0x3a(%rcx)
3: 45 00 00 add %r8b,(%r8)
6: 48 8b 3d b1 e5 65 0e mov 0xe65e5b1(%rip),%rdi # 0xe65e5be
d: 48 81 c4 b8 00 00 00 add $0xb8,%rsp
14: 5b pop %rbx
15: 41 5c pop %r12
17: 41 5d pop %r13
19: 41 5e pop %r14
1b: 41 5f pop %r15
1d: 5d pop %rbp
1e: e9 4b 63 ff ff jmp 0xffff636e
23: 48 8d 3d 14 c3 69 0e lea 0xe69c314(%rip),%rdi # 0xe69c33e
* 2a: 67 48 0f b9 3a ud1 (%edx),%rdi <-- trapping instruction
2f: e9 1b f4 ff ff jmp 0xfffff44f
34: 90 nop
35: 0f 0b ud2
37: 90 nop
38: 45 84 e4 test %r12b,%r12b
3b: 0f .byte 0xf
3c: 84 ea test %ch,%dl
3e: f3 repz
3f: ff .byte 0xff
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot@syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net-next v1] net: openvswitch: decouple flow_table from ovs_mutex
2026-03-13 19:35 ` Jakub Kicinski
@ 2026-03-16 12:01 ` Adrián Moreno
0 siblings, 0 replies; 6+ messages in thread
From: Adrián Moreno @ 2026-03-16 12:01 UTC (permalink / raw)
To: Jakub Kicinski
Cc: netdev, Aaron Conole, Eelco Chaudron, Ilya Maximets,
David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman, dev,
linux-kernel
On Fri, Mar 13, 2026 at 12:35:35PM -0700, Jakub Kicinski wrote:
> On Fri, 13 Mar 2026 18:31:12 +0100 Adrian Moreno wrote:
> > Currently the entire ovs module is write-protected using the global
> > ovs_mutex. While this simple approach works fine for control-plane
> > operations (such as vport configurations), requiring the global mutex
> > for flow modifications can be problematic.
>
> YNL selftest for ovs seems to trigger this:
>
> [ 88.995118][ T50] =============================
> [ 88.995287][ T50] WARNING: suspicious RCU usage
> [ 88.995448][ T50] 7.0.0-rc3-virtme #1 Not tainted
> [ 88.995630][ T50] -----------------------------
> [ 88.995788][ T50] net/openvswitch/datapath.c:2666 RCU-list traversed in non-reader section!!
> [ 88.996122][ T50]
> [ 88.996122][ T50] other info that might help us debug this:
> [ 88.996122][ T50]
> [ 88.996388][ T50]
> [ 88.996388][ T50] rcu_scheduler_active = 2, debug_locks = 1
> [ 88.996640][ T50] 3 locks held by kworker/2:1/50:
> [ 88.996800][ T50] #0: ff11000001139b48 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0xcb4/0x1390
> [ 88.997092][ T50] #1: ffa000000036fd10 ((work_completion)(&(&ovs_net->masks_rebalance)->work)){+.+.}-{0:0}, at: process_one_work+0xd16/0x1390
> [ 88.997420][ T50] #2: ffffffffc08038e8 (ovs_mutex){+.+.}-{4:4}, at: ovs_dp_masks_rebalance+0x29/0x270 [openvswitch]
> [ 88.997707][ T50]
> [ 88.997707][ T50] stack backtrace:
> [ 88.997898][ T50] CPU: 2 UID: 0 PID: 50 Comm: kworker/2:1 Not tainted 7.0.0-rc3-virtme #1 PREEMPT(full)
> [ 88.997903][ T50] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 88.997904][ T50] Workqueue: events ovs_dp_masks_rebalance [openvswitch]
> [ 88.997911][ T50] Call Trace:
> [ 88.997914][ T50] <TASK>
> [ 88.997916][ T50] dump_stack_lvl+0x6f/0xa0
> [ 88.997921][ T50] lockdep_rcu_suspicious.cold+0x4f/0xad
> [ 88.997928][ T50] ovs_dp_masks_rebalance+0x226/0x270 [openvswitch]
> [ 88.997933][ T50] process_one_work+0xd57/0x1390
> [ 88.997940][ T50] ? pwq_dec_nr_in_flight+0x700/0x700
> [ 88.997942][ T50] ? lock_acquire.part.0+0xbc/0x260
> [ 88.997950][ T50] worker_thread+0x4d6/0xd40
> [ 88.997954][ T50] ? rescuer_thread+0x1330/0x1330
> [ 88.997956][ T50] ? __kthread_parkme+0xb3/0x200
> [ 88.997960][ T50] ? rescuer_thread+0x1330/0x1330
> [ 88.997962][ T50] kthread+0x30f/0x3f0
> [ 88.997964][ T50] ? trace_irq_enable.constprop.0+0x13c/0x190
> [ 88.997967][ T50] ? kthread_affine_node+0x150/0x150
> [ 88.997970][ T50] ret_from_fork+0x472/0x6b0
> [ 88.997974][ T50] ? arch_exit_to_user_mode_prepare.isra.0+0x140/0x140
> [ 88.997977][ T50] ? __switch_to+0x538/0xcf0
> [ 88.997980][ T50] ? kthread_affine_node+0x150/0x150
> [ 88.997983][ T50] ret_from_fork_asm+0x11/0x20
> [ 88.997991][ T50] </TASK>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/tools/net/ynl/tests/ovs.c
>
Thanks!
I wonder why this did not come up in my tests. Jakub, would you mind
sharing the config used for this test?
For mask rebalancing I initially thought of doing the same thing as for
the other flow commands (rcu + refcount + flow_table mutex).
But given it's not that critical and it's not run in the context of
handler threads, I rolled back to locking in this case. That
"list_for_each_entry_rcu" is a leftover of my initial attempt and should
be replaced with normal "list_for_each_entry".
Thanks.
Adrián
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net-next v1] net: openvswitch: decouple flow_table from ovs_mutex
2026-03-13 17:31 [PATCH net-next v1] net: openvswitch: decouple flow_table from ovs_mutex Adrian Moreno
2026-03-13 19:35 ` Jakub Kicinski
2026-03-14 18:32 ` [syzbot ci] " syzbot ci
@ 2026-03-17 13:30 ` kernel test robot
2026-03-18 14:09 ` kernel test robot
3 siblings, 0 replies; 6+ messages in thread
From: kernel test robot @ 2026-03-17 13:30 UTC (permalink / raw)
To: Adrian Moreno, netdev
Cc: oe-kbuild-all, Adrian Moreno, Aaron Conole, Eelco Chaudron,
Ilya Maximets, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, dev, linux-kernel
Hi Adrian,
kernel test robot noticed the following build warnings:
[auto build test WARNING on net-next/main]
url: https://github.com/intel-lab-lkp/linux/commits/Adrian-Moreno/net-openvswitch-decouple-flow_table-from-ovs_mutex/20260314-152511
base: net-next/main
patch link: https://lore.kernel.org/r/20260313173114.1220551-1-amorenoz%40redhat.com
patch subject: [PATCH net-next v1] net: openvswitch: decouple flow_table from ovs_mutex
config: arm-randconfig-r132-20260317 (https://download.01.org/0day-ci/archive/20260317/202603172153.FuWl19zD-lkp@intel.com/config)
compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project f46a5153850c1303d687233d4adf699b01041da8)
sparse: v0.6.5-rc1
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260317/202603172153.FuWl19zD-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603172153.FuWl19zD-lkp@intel.com/
sparse warnings: (new ones prefixed by >>)
>> net/openvswitch/datapath.c:2016:28: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct flow_table *table @@ got struct flow_table [noderef] __rcu *table @@
net/openvswitch/datapath.c:2016:28: sparse: expected struct flow_table *table
net/openvswitch/datapath.c:2016:28: sparse: got struct flow_table [noderef] __rcu *table
vim +2016 net/openvswitch/datapath.c
1913
1914 static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info)
1915 {
1916 struct nlattr **a = info->attrs;
1917 struct flow_table *table;
1918 struct vport_parms parms;
1919 struct sk_buff *reply;
1920 struct datapath *dp;
1921 struct vport *vport;
1922 struct ovs_net *ovs_net;
1923 int err;
1924
1925 err = -EINVAL;
1926 if (!a[OVS_DP_ATTR_NAME] || !a[OVS_DP_ATTR_UPCALL_PID])
1927 goto err;
1928
1929 reply = ovs_dp_cmd_alloc_info();
1930 if (!reply)
1931 return -ENOMEM;
1932
1933 err = -ENOMEM;
1934 dp = kzalloc_obj(*dp);
1935 if (dp == NULL)
1936 goto err_destroy_reply;
1937
1938 ovs_dp_set_net(dp, sock_net(skb->sk));
1939
1940 /* Allocate table. */
1941 table = ovs_flow_tbl_alloc();
1942 if (IS_ERR(table)) {
1943 err = PTR_ERR(table);
1944 goto err_destroy_dp;
1945 }
1946 rcu_assign_pointer(dp->table, table);
1947
1948 err = ovs_dp_stats_init(dp);
1949 if (err)
1950 goto err_destroy_table;
1951
1952 err = ovs_dp_vport_init(dp);
1953 if (err)
1954 goto err_destroy_stats;
1955
1956 err = ovs_meters_init(dp);
1957 if (err)
1958 goto err_destroy_ports;
1959
1960 /* Set up our datapath device. */
1961 parms.name = nla_data(a[OVS_DP_ATTR_NAME]);
1962 parms.type = OVS_VPORT_TYPE_INTERNAL;
1963 parms.options = NULL;
1964 parms.dp = dp;
1965 parms.port_no = OVSP_LOCAL;
1966 parms.upcall_portids = a[OVS_DP_ATTR_UPCALL_PID];
1967 parms.desired_ifindex = nla_get_s32_default(a[OVS_DP_ATTR_IFINDEX], 0);
1968
1969 /* So far only local changes have been made, now need the lock. */
1970 ovs_lock();
1971
1972 err = ovs_dp_change(dp, a);
1973 if (err)
1974 goto err_unlock_and_destroy_meters;
1975
1976 vport = new_vport(&parms);
1977 if (IS_ERR(vport)) {
1978 err = PTR_ERR(vport);
1979 if (err == -EBUSY)
1980 err = -EEXIST;
1981
1982 if (err == -EEXIST) {
1983 /* An outdated user space instance that does not understand
1984 * the concept of user_features has attempted to create a new
1985 * datapath and is likely to reuse it. Drop all user features.
1986 */
1987 if (info->genlhdr->version < OVS_DP_VER_FEATURES)
1988 ovs_dp_reset_user_features(skb, info);
1989 }
1990
1991 goto err_destroy_portids;
1992 }
1993
1994 err = ovs_dp_cmd_fill_info(dp, reply, info->snd_portid,
1995 info->snd_seq, 0, OVS_DP_CMD_NEW);
1996 BUG_ON(err < 0);
1997
1998 ovs_net = net_generic(ovs_dp_get_net(dp), ovs_net_id);
1999 list_add_tail_rcu(&dp->list_node, &ovs_net->dps);
2000
2001 ovs_unlock();
2002
2003 ovs_notify(&dp_datapath_genl_family, reply, info);
2004 return 0;
2005
2006 err_destroy_portids:
2007 kfree(rcu_dereference_raw(dp->upcall_portids));
2008 err_unlock_and_destroy_meters:
2009 ovs_unlock();
2010 ovs_meters_exit(dp);
2011 err_destroy_ports:
2012 kfree(dp->ports);
2013 err_destroy_stats:
2014 free_percpu(dp->stats_percpu);
2015 err_destroy_table:
> 2016 ovs_flow_tbl_put(dp->table);
2017 err_destroy_dp:
2018 kfree(dp);
2019 err_destroy_reply:
2020 kfree_skb(reply);
2021 err:
2022 return err;
2023 }
2024
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net-next v1] net: openvswitch: decouple flow_table from ovs_mutex
2026-03-13 17:31 [PATCH net-next v1] net: openvswitch: decouple flow_table from ovs_mutex Adrian Moreno
` (2 preceding siblings ...)
2026-03-17 13:30 ` [PATCH net-next v1] " kernel test robot
@ 2026-03-18 14:09 ` kernel test robot
3 siblings, 0 replies; 6+ messages in thread
From: kernel test robot @ 2026-03-18 14:09 UTC (permalink / raw)
To: Adrian Moreno
Cc: oe-lkp, lkp, netdev, dev, Adrian Moreno, Aaron Conole,
Eelco Chaudron, Ilya Maximets, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, linux-kernel,
oliver.sang
Hello,
kernel test robot noticed "WARNING:suspicious_RCU_usage" on:
commit: e556fbe58fe8144bc9e7f0909344b22af53c05ce ("[PATCH net-next v1] net: openvswitch: decouple flow_table from ovs_mutex")
url: https://github.com/intel-lab-lkp/linux/commits/Adrian-Moreno/net-openvswitch-decouple-flow_table-from-ovs_mutex/20260314-152511
base: https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git 9089c5f3c444ad6e9eb172e9375615ed0b0bc31c
patch link: https://lore.kernel.org/all/20260313173114.1220551-1-amorenoz@redhat.com/
patch subject: [PATCH net-next v1] net: openvswitch: decouple flow_table from ovs_mutex
in testcase: boot
config: x86_64-randconfig-013-20260317
compiler: clang-20
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 32G
(please refer to attached dmesg/kmsg for entire log/backtrace)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202603182149.54af5f9b-lkp@intel.com
[ 67.815648][ T24] WARNING: suspicious RCU usage
[ 67.819156][ T24] 7.0.0-rc3-00655-ge556fbe58fe8 #1 Tainted: G T
[ 67.821304][ T24] -----------------------------
[ 67.822695][ T24] net/openvswitch/datapath.c:2666 RCU-list traversed in non-reader section!!
[ 67.825209][ T24]
[ 67.825209][ T24] other info that might help us debug this:
[ 67.825209][ T24]
[ 67.828074][ T24]
[ 67.828074][ T24] rcu_scheduler_active = 2, debug_locks = 1
[ 67.830500][ T24] 3 locks held by kworker/0:2/24:
[ 67.831886][ T24] #0: ffff88810007db40 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works (kernel/workqueue.c:3250)
[ 67.835011][ T24] #1: ffff88810184fd30 ((work_completion)(&(&ovs_net->masks_rebalance)->work)){+.+.}-{0:0}, at: process_scheduled_works (kernel/workqueue.c:3251)
[ 67.841040][ T24] #2: ffffffff96bd5220 (ovs_mutex){+.+.}-{4:4}, at: ovs_dp_masks_rebalance (net/openvswitch/datapath.c:2666)
[ 67.843339][ T24]
[ 67.843339][ T24] stack backtrace:
[ 67.844454][ T24] CPU: 0 UID: 0 PID: 24 Comm: kworker/0:2 Tainted: G T 7.0.0-rc3-00655-ge556fbe58fe8 #1 PREEMPT(lazy)
[ 67.846651][ T24] Tainted: [T]=RANDSTRUCT
[ 67.846651][ T24] Workqueue: events ovs_dp_masks_rebalance
[ 67.846651][ T24] Call Trace:
[ 67.846651][ T24] <TASK>
[ 67.846651][ T24] dump_stack_lvl (lib/dump_stack.c:123)
[ 67.846651][ T24] lockdep_rcu_suspicious (kernel/locking/lockdep.c:6877)
[ 67.846651][ T24] ovs_dp_masks_rebalance (net/openvswitch/datapath.c:?)
[ 67.846651][ T24] ? process_scheduled_works (kernel/workqueue.c:3251)
[ 67.846651][ T24] process_scheduled_works (kernel/workqueue.c:?)
[ 67.846651][ T24] worker_thread (kernel/workqueue.c:?)
[ 67.846651][ T24] ? _raw_spin_unlock_irqrestore (arch/x86/include/asm/preempt.h:104 kernel/locking/spinlock.c:194)
[ 67.846651][ T24] kthread (kernel/kthread.c:438)
[ 67.846651][ T24] ? worker_attach_to_pool (kernel/workqueue.c:3385)
[ 67.846651][ T24] ? kthread_unuse_mm (kernel/kthread.c:381)
[ 67.846651][ T24] ret_from_fork (arch/x86/kernel/process.c:164)
[ 67.846651][ T24] ? kthread_unuse_mm (kernel/kthread.c:381)
[ 67.846651][ T24] ret_from_fork_asm (arch/x86/entry/entry_64.S:258)
[ 67.846651][ T24] </TASK>
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20260318/202603182149.54af5f9b-lkp@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-03-18 14:09 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-13 17:31 [PATCH net-next v1] net: openvswitch: decouple flow_table from ovs_mutex Adrian Moreno
2026-03-13 19:35 ` Jakub Kicinski
2026-03-16 12:01 ` Adrián Moreno
2026-03-14 18:32 ` [syzbot ci] " syzbot ci
2026-03-17 13:30 ` [PATCH net-next v1] " kernel test robot
2026-03-18 14:09 ` kernel test robot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox