[PATCH v5 net-next 0/2] net: RPS table overwrite prevention and flow

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 net-next 0/2] net: RPS table overwrite prevention and flow_id caching
@ 2025-07-21  3:16 Krishna Kumar
  2025-07-21  3:16 ` [PATCH v5 net-next 1/2] net: Prevent RPS table overwrite for active flows Krishna Kumar
  2025-07-21  3:16 ` [PATCH v5 net-next 2/2] net: Cache hash and flow_id to avoid recalculation Krishna Kumar
  0 siblings, 2 replies; 8+ messages in thread
From: Krishna Kumar @ 2025-07-21  3:16 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, tom, kuba, pabeni, horms, sdf, kuniyu,
	ahmed.zaki, aleksander.lobakin, atenart, jdamato, krishna.ku,
	krikku

This series splits the previous RPS patch [1] into two patches for
net-next following reviewer feedback. It also addresses a kernel test
robot warning by ensuring rps_flow_is_active() is defined only when
aRFS is enabled. I tested v3 with four builds and reboots: two for
[PATCH 1/2] with aRFS enabled/disabled, and two for [PATCH 2/2].
There are no code changes in v4 and v5, only documentation, etc.

The first patch prevents RPS table overwrite for active flows thereby
improving aRFS stability.

The second patch caches hash & flow_id in get_rps_cpu() to avoid
recalculating it in set_rps_cpu() (this patch depends on the first).

[1] https://lore.kernel.org/netdev/20250708081516.53048-1-krikku@gmail.com/
[2] https://lore.kernel.org/netdev/20250717064554.3e4d9993@kernel.org/

Signed-off-by: Krishna Kumar <krikku@gmail.com>
---
v5: Same v4 patch sent with a change in documentation style for "return".
v4: Same v3 patch sent as a new thread instead of a continuation.
v3: Wrapped rps_flow_is_active() in #ifdef CONFIG_RFS_ACCEL to fix
    unused function warning reported by kernel test robot.
v2: Split original patch into two: RPS table overwrite prevention and hash/
    flow_id caching.

Krishna Kumar (2):
  net: Prevent RPS table overwrite for active flows
  net: Cache hash and flow_id to avoid recalculation

 include/net/rps.h    |  5 ++-
 net/core/dev.c       | 89 ++++++++++++++++++++++++++++++++++++++------
 net/core/net-sysfs.c |  4 +-
 3 files changed, 84 insertions(+), 14 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v5 net-next 1/2] net: Prevent RPS table overwrite for active flows
  2025-07-21  3:16 [PATCH v5 net-next 0/2] net: RPS table overwrite prevention and flow_id caching Krishna Kumar
@ 2025-07-21  3:16 ` Krishna Kumar
  2025-07-21  8:14   ` Eric Dumazet
  2025-07-21  8:21   ` Eric Dumazet
  2025-07-21  3:16 ` [PATCH v5 net-next 2/2] net: Cache hash and flow_id to avoid recalculation Krishna Kumar
  1 sibling, 2 replies; 8+ messages in thread
From: Krishna Kumar @ 2025-07-21  3:16 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, tom, kuba, pabeni, horms, sdf, kuniyu,
	ahmed.zaki, aleksander.lobakin, atenart, jdamato, krishna.ku,
	krikku

This patch fixes an issue where two different flows on the same RXq
produce the same hash resulting in continuous flow overwrites.

Flow #1: A packet for Flow #1 comes in, kernel calls the steering
         function. The driver gives back a filter id. The kernel saves
	 this filter id in the selected slot. Later, the driver's
	 service task checks if any filters have expired and then
	 installs the rule for Flow #1.
Flow #2: A packet for Flow #2 comes in. It goes through the same steps.
         But this time, the chosen slot is being used by Flow #1. The
	 driver gives a new filter id and the kernel saves it in the
	 same slot. When the driver's service task runs, it runs through
	 all the flows, checks if Flow #1 should be expired, the kernel
	 returns True as the slot has a different filter id, and then
	 the driver installs the rule for Flow #2.
Flow #1: Another packet for Flow #1 comes in. The same thing repeats.
         The slot is overwritten with a new filter id for Flow #1.

This causes a repeated cycle of flow programming for missed packets,
wasting CPU cycles while not improving performance. This problem happens
at higher rates when the RPS table is small, but tests show it still
happens even with 12,000 connections and an RPS size of 16K per queue
(global table size = 144x16K = 64K).

This patch prevents overwriting an rps_dev_flow entry if it is active.
The intention is that it is better to do aRFS for the first flow instead
of hurting all flows on the same hash. Without this, two (or more) flows
on one RX queue with the same hash can keep overwriting each other. This
causes the driver to reprogram the flow repeatedly.

Changes:
  1. Add a new 'hash' field to struct rps_dev_flow.
  2. Add rps_flow_is_active(): a helper function to check if a flow is
     active or not, extracted from rps_may_expire_flow().
  3. In set_rps_cpu():
     - Avoid overwriting by programming a new filter if:
        - The slot is not in use, or
        - The slot is in use but the flow is not active, or
        - The slot has an active flow with the same hash, but target CPU
          differs.
     - Save the hash in the rps_dev_flow entry.
  4. rps_may_expire_flow(): Use earlier extracted rps_flow_is_active().

Testing & results:
  - Driver: ice (E810 NIC), Kernel: net-next
  - #CPUs = #RXq = 144 (1:1)
  - Number of flows: 12K
  - Eight RPS settings from 256 to 32768. Though RPS=256 is not ideal,
    it is still sufficient to cover 12K flows (256*144 rx-queues = 64K
    global table slots)
  - Global Table Size = 144 * RPS (effectively equal to 256 * RPS)
  - Each RPS test duration = 8 mins (org code) + 8 mins (new code).
  - Metrics captured on client

Legend for following tables:
Steer-C: #times ndo_rx_flow_steer() was Called by set_rps_cpu()
Steer-L: #times ice_arfs_flow_steer() Looped over aRFS entries
Add:     #times driver actually programmed aRFS (ice_arfs_build_entry())
Del:     #times driver deleted the flow (ice_arfs_del_flow_rules())
Units:   K = 1,000 times, M = 1 million times

  |-------|---------|------|     Org Code    |---------|---------|
  | RPS   | Latency | CPU  | Add    |  Del   | Steer-C | Steer-L |
  |-------|---------|------|--------|--------|---------|---------|
  | 256   | 227.0   | 93.2 | 1.6M   | 1.6M   | 121.7M  | 267.6M  |
  | 512   | 225.9   | 94.1 | 11.5M  | 11.2M  | 65.7M   | 199.6M  |
  | 1024  | 223.5   | 95.6 | 16.5M  | 16.5M  | 27.1M   | 187.3M  |
  | 2048  | 222.2   | 96.3 | 10.5M  | 10.5M  | 12.5M   | 115.2M  |
  | 4096  | 223.9   | 94.1 | 5.5M   | 5.5M   | 7.2M    | 65.9M   |
  | 8192  | 224.7   | 92.5 | 2.7M   | 2.7M   | 3.0M    | 29.9M   |
  | 16384 | 223.5   | 92.5 | 1.3M   | 1.3M   | 1.4M    | 13.9M   |
  | 32768 | 219.6   | 93.2 | 838.1K | 838.1K | 965.1K  | 8.9M    |
  |-------|---------|------|   New Code      |---------|---------|
  | 256   | 201.5   | 99.1 | 13.4K  | 5.0K   | 13.7K   | 75.2K   |
  | 512   | 202.5   | 98.2 | 11.2K  | 5.9K   | 11.2K   | 55.5K   |
  | 1024  | 207.3   | 93.9 | 11.5K  | 9.7K   | 11.5K   | 59.6K   |
  | 2048  | 207.5   | 96.7 | 11.8K  | 11.1K  | 15.5K   | 79.3K   |
  | 4096  | 206.9   | 96.6 | 11.8K  | 11.7K  | 11.8K   | 63.2K   |
  | 8192  | 205.8   | 96.7 | 11.9K  | 11.8K  | 11.9K   | 63.9K   |
  | 16384 | 200.9   | 98.2 | 11.9K  | 11.9K  | 11.9K   | 64.2K   |
  | 32768 | 202.5   | 98.0 | 11.9K  | 11.9K  | 11.9K   | 64.2K   |
  |-------|---------|------|--------|--------|---------|---------|

Some observations:
  1. Overall Latency improved: (1790.19-1634.94)/1790.19*100 = 8.67%
  2. Overall CPU increased:    (777.32-751.49)/751.45*100    = 3.44%
  3. Flow Management (add/delete) remained almost constant at ~11K
     compared to values in millions.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202507161125.rUCoz9ov-lkp@intel.com/
Signed-off-by: Krishna Kumar <krikku@gmail.com>
---
 include/net/rps.h    |  5 +--
 net/core/dev.c       | 82 ++++++++++++++++++++++++++++++++++++++++----
 net/core/net-sysfs.c |  4 ++-
 3 files changed, 81 insertions(+), 10 deletions(-)

diff --git a/include/net/rps.h b/include/net/rps.h
index d8ab3a08bcc4..8e33dbea9327 100644
--- a/include/net/rps.h
+++ b/include/net/rps.h
@@ -25,13 +25,14 @@ struct rps_map {
 
 /*
  * The rps_dev_flow structure contains the mapping of a flow to a CPU, the
- * tail pointer for that CPU's input queue at the time of last enqueue, and
- * a hardware filter index.
+ * tail pointer for that CPU's input queue at the time of last enqueue, a
+ * hardware filter index, and the hash of the flow.
  */
 struct rps_dev_flow {
 	u16		cpu;
 	u16		filter;
 	unsigned int	last_qtail;
+	u32		hash;
 };
 #define RPS_NO_FILTER 0xffff
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 354d3453b407..979dcdfdee1f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4837,6 +4837,28 @@ static u32 rfs_slot(u32 hash, const struct rps_dev_flow_table *flow_table)
 	return hash_32(hash, flow_table->log);
 }
 
+#ifdef CONFIG_RFS_ACCEL
+/**
+ * rps_flow_is_active - check whether the flow is recently active.
+ * @rflow: Specific flow to check activity.
+ * @flow_table: Check activity against the flow_table's size.
+ * @cpu: CPU saved in @rflow.
+ *
+ * If the CPU has processed many packets since the flow's last activity
+ * (beyond 10 times the table size), the flow is considered stale.
+ *
+ * Return: true if flow was recently active.
+ */
+static bool rps_flow_is_active(struct rps_dev_flow *rflow,
+			       struct rps_dev_flow_table *flow_table,
+			       unsigned int cpu)
+{
+	return cpu < nr_cpu_ids &&
+	       ((int)(READ_ONCE(per_cpu(softnet_data, cpu).input_queue_head) -
+		READ_ONCE(rflow->last_qtail)) < (int)(10 << flow_table->log));
+}
+#endif
+
 static struct rps_dev_flow *
 set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 	    struct rps_dev_flow *rflow, u16 next_cpu)
@@ -4847,8 +4869,11 @@ set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 		struct netdev_rx_queue *rxqueue;
 		struct rps_dev_flow_table *flow_table;
 		struct rps_dev_flow *old_rflow;
+		struct rps_dev_flow *tmp_rflow;
+		unsigned int tmp_cpu;
 		u16 rxq_index;
 		u32 flow_id;
+		u32 hash;
 		int rc;
 
 		/* Should we steer this flow to a different hardware queue? */
@@ -4863,14 +4888,58 @@ set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 		flow_table = rcu_dereference(rxqueue->rps_flow_table);
 		if (!flow_table)
 			goto out;
-		flow_id = rfs_slot(skb_get_hash(skb), flow_table);
+
+		hash = skb_get_hash(skb);
+		flow_id = rfs_slot(hash, flow_table);
+
+		tmp_rflow = &flow_table->flows[flow_id];
+		tmp_cpu = READ_ONCE(tmp_rflow->cpu);
+
+		/* Make sure this slot is usable before enabling steer */
+		if (READ_ONCE(tmp_rflow->filter) != RPS_NO_FILTER) {
+			/* This slot has an entry */
+			if (rps_flow_is_active(tmp_rflow, flow_table,
+					       tmp_cpu)) {
+				/*
+				 * This slot has an active "programmed" flow.
+				 * Break out if the cached value stored is for
+				 * a different flow, or (for our flow) the
+				 * rx-queue# did not change.
+				 */
+				if (hash != READ_ONCE(tmp_rflow->hash) ||
+				    next_cpu == tmp_cpu) {
+					/*
+					 * Don't unnecessarily reprogram if:
+					 * 1. This slot has an active different
+					 *    flow.
+					 * 2. This slot has the same flow (very
+					 *    likely but not guaranteed) and
+					 *    the rx-queue# did not change.
+					 */
+					goto out;
+				}
+			}
+
+			/*
+			 * When we overwrite the flow, the driver still has
+			 * the cached entry. But drivers will check if the
+			 * flow is active and rps_may_expire_entry() will
+			 * return False and driver will delete it soon. Hence
+			 * inconsistency between kernel & driver is quickly
+			 * resolved.
+			 */
+		}
+
 		rc = dev->netdev_ops->ndo_rx_flow_steer(dev, skb,
 							rxq_index, flow_id);
 		if (rc < 0)
 			goto out;
+
 		old_rflow = rflow;
-		rflow = &flow_table->flows[flow_id];
+		rflow = tmp_rflow;
 		WRITE_ONCE(rflow->filter, rc);
+		WRITE_ONCE(rflow->hash, hash);
+
 		if (old_rflow->filter == rc)
 			WRITE_ONCE(old_rflow->filter, RPS_NO_FILTER);
 	out:
@@ -5005,17 +5074,16 @@ bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index,
 	struct rps_dev_flow_table *flow_table;
 	struct rps_dev_flow *rflow;
 	bool expire = true;
-	unsigned int cpu;
 
 	rcu_read_lock();
 	flow_table = rcu_dereference(rxqueue->rps_flow_table);
 	if (flow_table && flow_id < (1UL << flow_table->log)) {
+		unsigned int cpu;
+
 		rflow = &flow_table->flows[flow_id];
 		cpu = READ_ONCE(rflow->cpu);
-		if (READ_ONCE(rflow->filter) == filter_id && cpu < nr_cpu_ids &&
-		    ((int)(READ_ONCE(per_cpu(softnet_data, cpu).input_queue_head) -
-			   READ_ONCE(rflow->last_qtail)) <
-		     (int)(10 << flow_table->log)))
+		if (READ_ONCE(rflow->filter) == filter_id &&
+		    rps_flow_is_active(rflow, flow_table, cpu))
 			expire = false;
 	}
 	rcu_read_unlock();
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index c28cd6665444..5ea9f64adce3 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1120,8 +1120,10 @@ static ssize_t store_rps_dev_flow_table_cnt(struct netdev_rx_queue *queue,
 			return -ENOMEM;
 
 		table->log = ilog2(mask) + 1;
-		for (count = 0; count <= mask; count++)
+		for (count = 0; count <= mask; count++) {
 			table->flows[count].cpu = RPS_NO_CPU;
+			table->flows[count].filter = RPS_NO_FILTER;
+		}
 	} else {
 		table = NULL;
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v5 net-next 2/2] net: Cache hash and flow_id to avoid recalculation
  2025-07-21  3:16 [PATCH v5 net-next 0/2] net: RPS table overwrite prevention and flow_id caching Krishna Kumar
  2025-07-21  3:16 ` [PATCH v5 net-next 1/2] net: Prevent RPS table overwrite for active flows Krishna Kumar
@ 2025-07-21  3:16 ` Krishna Kumar
  1 sibling, 0 replies; 8+ messages in thread
From: Krishna Kumar @ 2025-07-21  3:16 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, tom, kuba, pabeni, horms, sdf, kuniyu,
	ahmed.zaki, aleksander.lobakin, atenart, jdamato, krishna.ku,
	krikku

get_rps_cpu() can cache flow_id and hash as both are required by
set_rps_cpu() instead of recalculating them twice.

Signed-off-by: Krishna Kumar <krikku@gmail.com>
---
 net/core/dev.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 979dcdfdee1f..2c3981b823c7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4861,7 +4861,8 @@ static bool rps_flow_is_active(struct rps_dev_flow *rflow,
 
 static struct rps_dev_flow *
 set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
-	    struct rps_dev_flow *rflow, u16 next_cpu)
+	    struct rps_dev_flow *rflow, u16 next_cpu, u32 hash,
+	    u32 flow_id)
 {
 	if (next_cpu < nr_cpu_ids) {
 		u32 head;
@@ -4872,8 +4873,6 @@ set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 		struct rps_dev_flow *tmp_rflow;
 		unsigned int tmp_cpu;
 		u16 rxq_index;
-		u32 flow_id;
-		u32 hash;
 		int rc;
 
 		/* Should we steer this flow to a different hardware queue? */
@@ -4889,9 +4888,6 @@ set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 		if (!flow_table)
 			goto out;
 
-		hash = skb_get_hash(skb);
-		flow_id = rfs_slot(hash, flow_table);
-
 		tmp_rflow = &flow_table->flows[flow_id];
 		tmp_cpu = READ_ONCE(tmp_rflow->cpu);
 
@@ -4965,6 +4961,7 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 	struct rps_dev_flow_table *flow_table;
 	struct rps_map *map;
 	int cpu = -1;
+	u32 flow_id;
 	u32 tcpu;
 	u32 hash;
 
@@ -5011,7 +5008,8 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 		/* OK, now we know there is a match,
 		 * we can look at the local (per receive queue) flow table
 		 */
-		rflow = &flow_table->flows[rfs_slot(hash, flow_table)];
+		flow_id = rfs_slot(hash, flow_table);
+		rflow = &flow_table->flows[flow_id];
 		tcpu = rflow->cpu;
 
 		/*
@@ -5030,7 +5028,8 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 		     ((int)(READ_ONCE(per_cpu(softnet_data, tcpu).input_queue_head) -
 		      rflow->last_qtail)) >= 0)) {
 			tcpu = next_cpu;
-			rflow = set_rps_cpu(dev, skb, rflow, next_cpu);
+			rflow = set_rps_cpu(dev, skb, rflow, next_cpu, hash,
+					    flow_id);
 		}
 
 		if (tcpu < nr_cpu_ids && cpu_online(tcpu)) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v5 net-next 1/2] net: Prevent RPS table overwrite for active flows
  2025-07-21  3:16 ` [PATCH v5 net-next 1/2] net: Prevent RPS table overwrite for active flows Krishna Kumar
@ 2025-07-21  8:14   ` Eric Dumazet
  2025-07-21 11:12     ` Krishna Kumar
  2025-07-23  4:33     ` Krishna Kumar
  2025-07-21  8:21   ` Eric Dumazet
  1 sibling, 2 replies; 8+ messages in thread
From: Eric Dumazet @ 2025-07-21  8:14 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: netdev, davem, tom, kuba, pabeni, horms, sdf, kuniyu, ahmed.zaki,
	aleksander.lobakin, atenart, jdamato, krishna.ku

On Sun, Jul 20, 2025 at 8:16 PM Krishna Kumar <krikku@gmail.com> wrote:
>
> This patch fixes an issue where two different flows on the same RXq
> produce the same hash resulting in continuous flow overwrites.
>
> Flow #1: A packet for Flow #1 comes in, kernel calls the steering
>          function. The driver gives back a filter id. The kernel saves
>          this filter id in the selected slot. Later, the driver's
>          service task checks if any filters have expired and then
>          installs the rule for Flow #1.
> Flow #2: A packet for Flow #2 comes in. It goes through the same steps.
>          But this time, the chosen slot is being used by Flow #1. The
>          driver gives a new filter id and the kernel saves it in the
>          same slot. When the driver's service task runs, it runs through
>          all the flows, checks if Flow #1 should be expired, the kernel
>          returns True as the slot has a different filter id, and then
>          the driver installs the rule for Flow #2.
> Flow #1: Another packet for Flow #1 comes in. The same thing repeats.
>          The slot is overwritten with a new filter id for Flow #1.
>
> This causes a repeated cycle of flow programming for missed packets,
> wasting CPU cycles while not improving performance. This problem happens
> at higher rates when the RPS table is small, but tests show it still
> happens even with 12,000 connections and an RPS size of 16K per queue
> (global table size = 144x16K = 64K).
>
> This patch prevents overwriting an rps_dev_flow entry if it is active.
> The intention is that it is better to do aRFS for the first flow instead
> of hurting all flows on the same hash. Without this, two (or more) flows
> on one RX queue with the same hash can keep overwriting each other. This
> causes the driver to reprogram the flow repeatedly.
>
> Changes:
>   1. Add a new 'hash' field to struct rps_dev_flow.
>   2. Add rps_flow_is_active(): a helper function to check if a flow is
>      active or not, extracted from rps_may_expire_flow().
>   3. In set_rps_cpu():
>      - Avoid overwriting by programming a new filter if:
>         - The slot is not in use, or
>         - The slot is in use but the flow is not active, or
>         - The slot has an active flow with the same hash, but target CPU
>           differs.
>      - Save the hash in the rps_dev_flow entry.
>   4. rps_may_expire_flow(): Use earlier extracted rps_flow_is_active().
>
> Testing & results:
>   - Driver: ice (E810 NIC), Kernel: net-next
>   - #CPUs = #RXq = 144 (1:1)
>   - Number of flows: 12K
>   - Eight RPS settings from 256 to 32768. Though RPS=256 is not ideal,
>     it is still sufficient to cover 12K flows (256*144 rx-queues = 64K
>     global table slots)
>   - Global Table Size = 144 * RPS (effectively equal to 256 * RPS)
>   - Each RPS test duration = 8 mins (org code) + 8 mins (new code).
>   - Metrics captured on client
>
> Legend for following tables:
> Steer-C: #times ndo_rx_flow_steer() was Called by set_rps_cpu()
> Steer-L: #times ice_arfs_flow_steer() Looped over aRFS entries
> Add:     #times driver actually programmed aRFS (ice_arfs_build_entry())
> Del:     #times driver deleted the flow (ice_arfs_del_flow_rules())
> Units:   K = 1,000 times, M = 1 million times
>
>   |-------|---------|------|     Org Code    |---------|---------|
>   | RPS   | Latency | CPU  | Add    |  Del   | Steer-C | Steer-L |
>   |-------|---------|------|--------|--------|---------|---------|
>   | 256   | 227.0   | 93.2 | 1.6M   | 1.6M   | 121.7M  | 267.6M  |
>   | 512   | 225.9   | 94.1 | 11.5M  | 11.2M  | 65.7M   | 199.6M  |
>   | 1024  | 223.5   | 95.6 | 16.5M  | 16.5M  | 27.1M   | 187.3M  |
>   | 2048  | 222.2   | 96.3 | 10.5M  | 10.5M  | 12.5M   | 115.2M  |
>   | 4096  | 223.9   | 94.1 | 5.5M   | 5.5M   | 7.2M    | 65.9M   |
>   | 8192  | 224.7   | 92.5 | 2.7M   | 2.7M   | 3.0M    | 29.9M   |
>   | 16384 | 223.5   | 92.5 | 1.3M   | 1.3M   | 1.4M    | 13.9M   |
>   | 32768 | 219.6   | 93.2 | 838.1K | 838.1K | 965.1K  | 8.9M    |
>   |-------|---------|------|   New Code      |---------|---------|
>   | 256   | 201.5   | 99.1 | 13.4K  | 5.0K   | 13.7K   | 75.2K   |
>   | 512   | 202.5   | 98.2 | 11.2K  | 5.9K   | 11.2K   | 55.5K   |
>   | 1024  | 207.3   | 93.9 | 11.5K  | 9.7K   | 11.5K   | 59.6K   |
>   | 2048  | 207.5   | 96.7 | 11.8K  | 11.1K  | 15.5K   | 79.3K   |
>   | 4096  | 206.9   | 96.6 | 11.8K  | 11.7K  | 11.8K   | 63.2K   |
>   | 8192  | 205.8   | 96.7 | 11.9K  | 11.8K  | 11.9K   | 63.9K   |
>   | 16384 | 200.9   | 98.2 | 11.9K  | 11.9K  | 11.9K   | 64.2K   |
>   | 32768 | 202.5   | 98.0 | 11.9K  | 11.9K  | 11.9K   | 64.2K   |
>   |-------|---------|------|--------|--------|---------|---------|
>
> Some observations:
>   1. Overall Latency improved: (1790.19-1634.94)/1790.19*100 = 8.67%
>   2. Overall CPU increased:    (777.32-751.49)/751.45*100    = 3.44%
>   3. Flow Management (add/delete) remained almost constant at ~11K
>      compared to values in millions.
>
> Reported-by: kernel test robot <lkp@intel.com>
> Closes: https://lore.kernel.org/oe-kbuild-all/202507161125.rUCoz9ov-lkp@intel.com/
> Signed-off-by: Krishna Kumar <krikku@gmail.com>
> ---
>  include/net/rps.h    |  5 +--
>  net/core/dev.c       | 82 ++++++++++++++++++++++++++++++++++++++++----
>  net/core/net-sysfs.c |  4 ++-
>  3 files changed, 81 insertions(+), 10 deletions(-)
>
> diff --git a/include/net/rps.h b/include/net/rps.h
> index d8ab3a08bcc4..8e33dbea9327 100644
> --- a/include/net/rps.h
> +++ b/include/net/rps.h
> @@ -25,13 +25,14 @@ struct rps_map {
>
>  /*
>   * The rps_dev_flow structure contains the mapping of a flow to a CPU, the
> - * tail pointer for that CPU's input queue at the time of last enqueue, and
> - * a hardware filter index.
> + * tail pointer for that CPU's input queue at the time of last enqueue, a
> + * hardware filter index, and the hash of the flow.
>   */
>  struct rps_dev_flow {
>         u16             cpu;
>         u16             filter;
>         unsigned int    last_qtail;
> +       u32             hash;

This is problematic, because adds an extra potential cache line miss in RPS.

Some of us do not use CONFIG_RFS_ACCEL, make sure to not add extra
costs for this configuration ?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v5 net-next 1/2] net: Prevent RPS table overwrite for active flows
  2025-07-21  3:16 ` [PATCH v5 net-next 1/2] net: Prevent RPS table overwrite for active flows Krishna Kumar
  2025-07-21  8:14   ` Eric Dumazet
@ 2025-07-21  8:21   ` Eric Dumazet
  2025-07-21 11:33     ` Krishna Kumar
  1 sibling, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2025-07-21  8:21 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: netdev, davem, tom, kuba, pabeni, horms, sdf, kuniyu, ahmed.zaki,
	aleksander.lobakin, atenart, jdamato, krishna.ku

On Sun, Jul 20, 2025 at 8:16 PM Krishna Kumar <krikku@gmail.com> wrote:
>
> This patch fixes an issue where two different flows on the same RXq
> produce the same hash resulting in continuous flow overwrites.
>
> Flow #1: A packet for Flow #1 comes in, kernel calls the steering
>          function. The driver gives back a filter id. The kernel saves
>          this filter id in the selected slot. Later, the driver's
>          service task checks if any filters have expired and then
>          installs the rule for Flow #1.
> Flow #2: A packet for Flow #2 comes in. It goes through the same steps.
>          But this time, the chosen slot is being used by Flow #1. The
>          driver gives a new filter id and the kernel saves it in the
>          same slot. When the driver's service task runs, it runs through
>          all the flows, checks if Flow #1 should be expired, the kernel
>          returns True as the slot has a different filter id, and then
>          the driver installs the rule for Flow #2.
> Flow #1: Another packet for Flow #1 comes in. The same thing repeats.
>          The slot is overwritten with a new filter id for Flow #1.
>
> This causes a repeated cycle of flow programming for missed packets,
> wasting CPU cycles while not improving performance. This problem happens
> at higher rates when the RPS table is small, but tests show it still
> happens even with 12,000 connections and an RPS size of 16K per queue
> (global table size = 144x16K = 64K).
>
>
> Reported-by: kernel test robot <lkp@intel.com>
> Closes: https://lore.kernel.org/oe-kbuild-all/202507161125.rUCoz9ov-lkp@intel.com/
> Signed-off-by: Krishna Kumar <krikku@gmail.com>
> ---
>  include/net/rps.h    |  5 +--
>  net/core/dev.c       | 82 ++++++++++++++++++++++++++++++++++++++++----
>  net/core/net-sysfs.c |  4 ++-
>  3 files changed, 81 insertions(+), 10 deletions(-)
>
> diff --git a/include/net/rps.h b/include/net/rps.h
> index d8ab3a08bcc4..8e33dbea9327 100644
> --- a/include/net/rps.h
> +++ b/include/net/rps.h
> @@ -25,13 +25,14 @@ struct rps_map {
>

> +#ifdef CONFIG_RFS_ACCEL
> +/**
> + * rps_flow_is_active - check whether the flow is recently active.
> + * @rflow: Specific flow to check activity.
> + * @flow_table: Check activity against the flow_table's size.
> + * @cpu: CPU saved in @rflow.
> + *
> + * If the CPU has processed many packets since the flow's last activity
> + * (beyond 10 times the table size), the flow is considered stale.
> + *
> + * Return: true if flow was recently active.
> + */
> +static bool rps_flow_is_active(struct rps_dev_flow *rflow,
> +                              struct rps_dev_flow_table *flow_table,
> +                              unsigned int cpu)
> +{
> +       return cpu < nr_cpu_ids &&
> +              ((int)(READ_ONCE(per_cpu(softnet_data, cpu).input_queue_head) -
> +               READ_ONCE(rflow->last_qtail)) < (int)(10 << flow_table->log));
> +}
> +#endif

This notion of active flow is kind of weird.
It might be time to make it less obscure, less expensive and time
(jiffies ?) deterministic.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v5 net-next 1/2] net: Prevent RPS table overwrite for active flows
  2025-07-21  8:14   ` Eric Dumazet
@ 2025-07-21 11:12     ` Krishna Kumar
  2025-07-23  4:33     ` Krishna Kumar
  1 sibling, 0 replies; 8+ messages in thread
From: Krishna Kumar @ 2025-07-21 11:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, davem, tom, kuba, pabeni, horms, sdf, kuniyu, ahmed.zaki,
	aleksander.lobakin, atenart, jdamato, krishna.ku

Thanks Eric.

1. I will move hash under aRFS (addressing your next comment separately)?
2. There's no point (actually may harm) in moving the existing
"filter" under aRFS
    though that's also used in aRFS code - it is u16 and pairs well
with the "u16 cpu".

I hope #1 is sufficient?

Thanks,
- Krishna

On Mon, Jul 21, 2025 at 1:44 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sun, Jul 20, 2025 at 8:16 PM Krishna Kumar <krikku@gmail.com> wrote:
> >
> > This patch fixes an issue where two different flows on the same RXq
> > produce the same hash resulting in continuous flow overwrites.
> >
> > Flow #1: A packet for Flow #1 comes in, kernel calls the steering
> >          function. The driver gives back a filter id. The kernel saves
> >          this filter id in the selected slot. Later, the driver's
> >          service task checks if any filters have expired and then
> >          installs the rule for Flow #1.
> > Flow #2: A packet for Flow #2 comes in. It goes through the same steps.
> >          But this time, the chosen slot is being used by Flow #1. The
> >          driver gives a new filter id and the kernel saves it in the
> >          same slot. When the driver's service task runs, it runs through
> >          all the flows, checks if Flow #1 should be expired, the kernel
> >          returns True as the slot has a different filter id, and then
> >          the driver installs the rule for Flow #2.
> > Flow #1: Another packet for Flow #1 comes in. The same thing repeats.
> >          The slot is overwritten with a new filter id for Flow #1.
> >
> > This causes a repeated cycle of flow programming for missed packets,
> > wasting CPU cycles while not improving performance. This problem happens
> > at higher rates when the RPS table is small, but tests show it still
> > happens even with 12,000 connections and an RPS size of 16K per queue
> > (global table size = 144x16K = 64K).
> >
> > This patch prevents overwriting an rps_dev_flow entry if it is active.
> > The intention is that it is better to do aRFS for the first flow instead
> > of hurting all flows on the same hash. Without this, two (or more) flows
> > on one RX queue with the same hash can keep overwriting each other. This
> > causes the driver to reprogram the flow repeatedly.
> >
> > Changes:
> >   1. Add a new 'hash' field to struct rps_dev_flow.
> >   2. Add rps_flow_is_active(): a helper function to check if a flow is
> >      active or not, extracted from rps_may_expire_flow().
> >   3. In set_rps_cpu():
> >      - Avoid overwriting by programming a new filter if:
> >         - The slot is not in use, or
> >         - The slot is in use but the flow is not active, or
> >         - The slot has an active flow with the same hash, but target CPU
> >           differs.
> >      - Save the hash in the rps_dev_flow entry.
> >   4. rps_may_expire_flow(): Use earlier extracted rps_flow_is_active().
> >
> > Testing & results:
> >   - Driver: ice (E810 NIC), Kernel: net-next
> >   - #CPUs = #RXq = 144 (1:1)
> >   - Number of flows: 12K
> >   - Eight RPS settings from 256 to 32768. Though RPS=256 is not ideal,
> >     it is still sufficient to cover 12K flows (256*144 rx-queues = 64K
> >     global table slots)
> >   - Global Table Size = 144 * RPS (effectively equal to 256 * RPS)
> >   - Each RPS test duration = 8 mins (org code) + 8 mins (new code).
> >   - Metrics captured on client
> >
> > Legend for following tables:
> > Steer-C: #times ndo_rx_flow_steer() was Called by set_rps_cpu()
> > Steer-L: #times ice_arfs_flow_steer() Looped over aRFS entries
> > Add:     #times driver actually programmed aRFS (ice_arfs_build_entry())
> > Del:     #times driver deleted the flow (ice_arfs_del_flow_rules())
> > Units:   K = 1,000 times, M = 1 million times
> >
> >   |-------|---------|------|     Org Code    |---------|---------|
> >   | RPS   | Latency | CPU  | Add    |  Del   | Steer-C | Steer-L |
> >   |-------|---------|------|--------|--------|---------|---------|
> >   | 256   | 227.0   | 93.2 | 1.6M   | 1.6M   | 121.7M  | 267.6M  |
> >   | 512   | 225.9   | 94.1 | 11.5M  | 11.2M  | 65.7M   | 199.6M  |
> >   | 1024  | 223.5   | 95.6 | 16.5M  | 16.5M  | 27.1M   | 187.3M  |
> >   | 2048  | 222.2   | 96.3 | 10.5M  | 10.5M  | 12.5M   | 115.2M  |
> >   | 4096  | 223.9   | 94.1 | 5.5M   | 5.5M   | 7.2M    | 65.9M   |
> >   | 8192  | 224.7   | 92.5 | 2.7M   | 2.7M   | 3.0M    | 29.9M   |
> >   | 16384 | 223.5   | 92.5 | 1.3M   | 1.3M   | 1.4M    | 13.9M   |
> >   | 32768 | 219.6   | 93.2 | 838.1K | 838.1K | 965.1K  | 8.9M    |
> >   |-------|---------|------|   New Code      |---------|---------|
> >   | 256   | 201.5   | 99.1 | 13.4K  | 5.0K   | 13.7K   | 75.2K   |
> >   | 512   | 202.5   | 98.2 | 11.2K  | 5.9K   | 11.2K   | 55.5K   |
> >   | 1024  | 207.3   | 93.9 | 11.5K  | 9.7K   | 11.5K   | 59.6K   |
> >   | 2048  | 207.5   | 96.7 | 11.8K  | 11.1K  | 15.5K   | 79.3K   |
> >   | 4096  | 206.9   | 96.6 | 11.8K  | 11.7K  | 11.8K   | 63.2K   |
> >   | 8192  | 205.8   | 96.7 | 11.9K  | 11.8K  | 11.9K   | 63.9K   |
> >   | 16384 | 200.9   | 98.2 | 11.9K  | 11.9K  | 11.9K   | 64.2K   |
> >   | 32768 | 202.5   | 98.0 | 11.9K  | 11.9K  | 11.9K   | 64.2K   |
> >   |-------|---------|------|--------|--------|---------|---------|
> >
> > Some observations:
> >   1. Overall Latency improved: (1790.19-1634.94)/1790.19*100 = 8.67%
> >   2. Overall CPU increased:    (777.32-751.49)/751.45*100    = 3.44%
> >   3. Flow Management (add/delete) remained almost constant at ~11K
> >      compared to values in millions.
> >
> > Reported-by: kernel test robot <lkp@intel.com>
> > Closes: https://lore.kernel.org/oe-kbuild-all/202507161125.rUCoz9ov-lkp@intel.com/
> > Signed-off-by: Krishna Kumar <krikku@gmail.com>
> > ---
> >  include/net/rps.h    |  5 +--
> >  net/core/dev.c       | 82 ++++++++++++++++++++++++++++++++++++++++----
> >  net/core/net-sysfs.c |  4 ++-
> >  3 files changed, 81 insertions(+), 10 deletions(-)
> >
> > diff --git a/include/net/rps.h b/include/net/rps.h
> > index d8ab3a08bcc4..8e33dbea9327 100644
> > --- a/include/net/rps.h
> > +++ b/include/net/rps.h
> > @@ -25,13 +25,14 @@ struct rps_map {
> >
> >  /*
> >   * The rps_dev_flow structure contains the mapping of a flow to a CPU, the
> > - * tail pointer for that CPU's input queue at the time of last enqueue, and
> > - * a hardware filter index.
> > + * tail pointer for that CPU's input queue at the time of last enqueue, a
> > + * hardware filter index, and the hash of the flow.
> >   */
> >  struct rps_dev_flow {
> >         u16             cpu;
> >         u16             filter;
> >         unsigned int    last_qtail;
> > +       u32             hash;
>
> This is problematic, because adds an extra potential cache line miss in RPS.
>
> Some of us do not use CONFIG_RFS_ACCEL, make sure to not add extra
> costs for this configuration ?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v5 net-next 1/2] net: Prevent RPS table overwrite for active flows
  2025-07-21  8:21   ` Eric Dumazet
@ 2025-07-21 11:33     ` Krishna Kumar
  0 siblings, 0 replies; 8+ messages in thread
From: Krishna Kumar @ 2025-07-21 11:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, davem, tom, kuba, pabeni, horms, sdf, kuniyu, ahmed.zaki,
	aleksander.lobakin, atenart, jdamato, krishna.ku

On Mon, Jul 21, 2025 at 1:51 PM Eric Dumazet <edumazet@google.com> wrote:
>
> > +static bool rps_flow_is_active(struct rps_dev_flow *rflow,
> > +                              struct rps_dev_flow_table *flow_table,
> > +                              unsigned int cpu)
> > +{
> > +       return cpu < nr_cpu_ids &&
> > +              ((int)(READ_ONCE(per_cpu(softnet_data, cpu).input_queue_head) -
> > +               READ_ONCE(rflow->last_qtail)) < (int)(10 << flow_table->log));
> > +}
> > +#endif
>
> This notion of active flow is kind of weird.
> It might be time to make it less obscure, less expensive and time
> (jiffies ?) deterministic.

My first internal attempt had this approach (not submitted as I felt
it was doing two
things in the same patch - fixing an issue we are seeing in Flipkart
production servers
vs improving an existing function):

struct rps_dev_flow {
        u16 cpu;
        u16 filter;
        unsigned int last_qtail;
        unsigned long last_active; /* Last activity timestamp (jiffies) */
        u32 hash;
};

I had not considered removing last_qtail or its implication of packet
reordering at
this time, so I had both last_qtail and last_active in the structure
(have to understand
this better).

and:
#define RFS_ACCEL_FLOW_TIMEOUT (HZ / 4)
static bool rps_flow_is_active(struct rps_dev_flow *rflow, unsigned int cpu)
{
    /*
     * Check if the flow is for a valid, online CPU and if the current
time is before
       the flow's expiration time.
     */
    return cpu < nr_cpu_ids &&
              time_before(jiffies, rflow->last_use + RFS_ACCEL_FLOW_TIMEOUT);
}

Please let me know if this approach is correct, and whether it should
be a separate patch.

Thanks,
- Krishna

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v5 net-next 1/2] net: Prevent RPS table overwrite for active flows
  2025-07-21  8:14   ` Eric Dumazet
  2025-07-21 11:12     ` Krishna Kumar
@ 2025-07-23  4:33     ` Krishna Kumar
  1 sibling, 0 replies; 8+ messages in thread
From: Krishna Kumar @ 2025-07-23  4:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, davem, tom, kuba, pabeni, horms, sdf, kuniyu, ahmed.zaki,
	aleksander.lobakin, atenart, krishna.ku

Hi Eric,

On Mon, Jul 21, 2025 at 1:44 PM Eric Dumazet <edumazet@google.com> wrote:
> >  struct rps_dev_flow {
> >         u16             cpu;
> >         u16             filter;
> >         unsigned int    last_qtail;
> > +       u32             hash;
>
> This is problematic, because adds an extra potential cache line miss in RPS.
>
> Some of us do not use CONFIG_RFS_ACCEL, make sure to not add extra
> costs for this configuration ?

I will send this fix today.

On your point on making the active check less obscure/costly,
my understanding is we need last_qtail for in-order processing
and add something like last_use for activity detection. If that’s
correct, I can send a patch for it after this.

Thanks,
- Krishna

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-07-23  4:34 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-21  3:16 [PATCH v5 net-next 0/2] net: RPS table overwrite prevention and flow_id caching Krishna Kumar
2025-07-21  3:16 ` [PATCH v5 net-next 1/2] net: Prevent RPS table overwrite for active flows Krishna Kumar
2025-07-21  8:14   ` Eric Dumazet
2025-07-21 11:12     ` Krishna Kumar
2025-07-23  4:33     ` Krishna Kumar
2025-07-21  8:21   ` Eric Dumazet
2025-07-21 11:33     ` Krishna Kumar
2025-07-21  3:16 ` [PATCH v5 net-next 2/2] net: Cache hash and flow_id to avoid recalculation Krishna Kumar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).