[PATCH net v4 1/2] net/sched: Fix backlog accounting in qdisc_dequeue

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net v4 1/2] net/sched: Fix backlog accounting in qdisc_dequeue_internal
@ 2025-07-27 23:56 William Liu
  2025-07-27 23:57 ` [PATCH net v4 2/2] selftests/tc-testing: Check backlog stats in gso_skb case William Liu
  2025-08-08 21:27 ` [PATCH net v4 1/2] net/sched: Fix backlog accounting in qdisc_dequeue_internal Jakub Kicinski
  0 siblings, 2 replies; 14+ messages in thread
From: William Liu @ 2025-07-27 23:56 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, pabeni, kuba, jiri, davem, edumazet, horms,
	savy, victor, William Liu

This issue applies for the following qdiscs: hhf, fq, fq_codel, and
fq_pie, and occurs in their change handlers when adjusting to the new
limit. The problems are the following in the values passed to the
subsequent qdisc_tree_reduce_backlog call given a tbf parent:

1. When the tbf parent runs out of tokens, skbs of these qdiscs will
   be placed in gso_skb. Their peek handlers are qdisc_peek_dequeued,
   which accounts for both qlen and backlog. However, in the case of
   qdisc_dequeue_internal, ONLY qlen is accounted for when pulling
   from gso_skb. This means that these qdiscs are missing a
   qdisc_qstats_backlog_dec when dropping packets to satisfy the
   new limit in their change handlers.

   One can observe this issue with the following (with tc patched to
   support a limit of 0):

   export TARGET=fq
   tc qdisc del dev lo root
   tc qdisc add dev lo root handle 1: tbf rate 8bit burst 100b latency 1ms
   tc qdisc replace dev lo handle 3: parent 1:1 $TARGET limit 1000
   echo ''; echo 'add child'; tc -s -d qdisc show dev lo
   ping -I lo -f -c2 -s32 -W0.001 127.0.0.1 2>&1 >/dev/null
   echo ''; echo 'after ping'; tc -s -d qdisc show dev lo
   tc qdisc change dev lo handle 3: parent 1:1 $TARGET limit 0
   echo ''; echo 'after limit drop'; tc -s -d qdisc show dev lo
   tc qdisc replace dev lo handle 2: parent 1:1 sfq
   echo ''; echo 'post graft'; tc -s -d qdisc show dev lo

   The second to last show command shows 0 packets but a positive
   number (74) of backlog bytes. The problem becomes clearer in the
   last show command, where qdisc_purge_queue triggers
   qdisc_tree_reduce_backlog with the positive backlog and causes an
   underflow in the tbf parent's backlog (4096 Mb instead of 0).

2. fq_codel_change is also wrong in the non gso_skb case. It tracks
   the amount to drop after the limit adjustment loop through
   cstats.drop_count and cstats.drop_len, but these are also updated
   in fq_codel_dequeue, and reset everytime if non-zero in that
   function after a call to qdisc_tree_reduce_backlog.
   If the drop path ever occurs in fq_codel_dequeue and
   qdisc_dequeue_internal takes the non gso_skb path, then we would
   reduce the backlog by an extra packet.

To fix these issues, the codepath for all clients of
qdisc_dequeue_internal has been simplified: codel, pie, hhf, fq,
fq_pie, and fq_codel. qdisc_dequeue_internal handles the backlog
adjustments for all cases that do not directly use the dequeue
handler.

Special care is taken for fq_codel_dequeue to account for the
qdisc_tree_reduce_backlog call in its dequeue handler. The
cstats reset is moved from the end to the beginning of
fq_codel_dequeue, so the change handler can use cstats for
proper backlog reduction accounting purposes. The drop_len and
drop_count fields are not used elsewhere so this reordering in
fq_codel_dequeue is ok.

Fixes: 2d3cbfd6d54a ("net_sched: Flush gso_skb list too during ->change()")
Fixes: 4b549a2ef4be ("fq_codel: Fair Queue Codel AQM")
Fixes: 10239edf86f1 ("net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc")
Signed-off-by: William Liu <will@willsroot.io>
Reviewed-by: Savino Dicanosa <savy@syst3mfailure.io>
---
v1 -> v2:
  - Fix commit formatting
  - There was a suggestion to split the patch apart into one for each
    qdisc - however, individual qdisc behavior will be further broken
    by this split due to the reliance on qdisc_dequeue_internal
---
 include/net/sch_generic.h |  9 +++++++--
 net/sched/sch_codel.c     | 10 +++++-----
 net/sched/sch_fq.c        | 14 +++++++-------
 net/sched/sch_fq_codel.c  | 22 +++++++++++++++-------
 net/sched/sch_fq_pie.c    | 10 +++++-----
 net/sched/sch_hhf.c       |  6 +++---
 net/sched/sch_pie.c       | 10 +++++-----
 7 files changed, 47 insertions(+), 34 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 638948be4c50..a24094a638dc 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -1038,10 +1038,15 @@ static inline struct sk_buff *qdisc_dequeue_internal(struct Qdisc *sch, bool dir
 	skb = __skb_dequeue(&sch->gso_skb);
 	if (skb) {
 		sch->q.qlen--;
+		qdisc_qstats_backlog_dec(sch, skb);
+		return skb;
+	}
+	if (direct) {
+		skb = __qdisc_dequeue_head(&sch->q);
+		if (skb)
+			qdisc_qstats_backlog_dec(sch, skb);
 		return skb;
 	}
-	if (direct)
-		return __qdisc_dequeue_head(&sch->q);
 	else
 		return sch->dequeue(sch);
 }
diff --git a/net/sched/sch_codel.c b/net/sched/sch_codel.c
index c93761040c6e..8dc467f665bb 100644
--- a/net/sched/sch_codel.c
+++ b/net/sched/sch_codel.c
@@ -103,7 +103,7 @@ static int codel_change(struct Qdisc *sch, struct nlattr *opt,
 {
 	struct codel_sched_data *q = qdisc_priv(sch);
 	struct nlattr *tb[TCA_CODEL_MAX + 1];
-	unsigned int qlen, dropped = 0;
+	unsigned int prev_qlen, prev_backlog;
 	int err;
 
 	err = nla_parse_nested_deprecated(tb, TCA_CODEL_MAX, opt,
@@ -142,15 +142,15 @@ static int codel_change(struct Qdisc *sch, struct nlattr *opt,
 		WRITE_ONCE(q->params.ecn,
 			   !!nla_get_u32(tb[TCA_CODEL_ECN]));
 
-	qlen = sch->q.qlen;
+	prev_qlen = sch->q.qlen;
+	prev_backlog = sch->qstats.backlog;
 	while (sch->q.qlen > sch->limit) {
 		struct sk_buff *skb = qdisc_dequeue_internal(sch, true);
 
-		dropped += qdisc_pkt_len(skb);
-		qdisc_qstats_backlog_dec(sch, skb);
 		rtnl_qdisc_drop(skb, sch);
 	}
-	qdisc_tree_reduce_backlog(sch, qlen - sch->q.qlen, dropped);
+	qdisc_tree_reduce_backlog(sch, prev_qlen - sch->q.qlen,
+				  prev_backlog - sch->qstats.backlog);
 
 	sch_tree_unlock(sch);
 	return 0;
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 902ff5470607..986e71e3362c 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -1014,10 +1014,10 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt,
 		     struct netlink_ext_ack *extack)
 {
 	struct fq_sched_data *q = qdisc_priv(sch);
+	unsigned int prev_qlen, prev_backlog;
 	struct nlattr *tb[TCA_FQ_MAX + 1];
-	int err, drop_count = 0;
-	unsigned drop_len = 0;
 	u32 fq_log;
+	int err;
 
 	err = nla_parse_nested_deprecated(tb, TCA_FQ_MAX, opt, fq_policy,
 					  NULL);
@@ -1135,16 +1135,16 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt,
 		err = fq_resize(sch, fq_log);
 		sch_tree_lock(sch);
 	}
+
+	prev_qlen = sch->q.qlen;
+	prev_backlog = sch->qstats.backlog;
 	while (sch->q.qlen > sch->limit) {
 		struct sk_buff *skb = qdisc_dequeue_internal(sch, false);
 
-		if (!skb)
-			break;
-		drop_len += qdisc_pkt_len(skb);
 		rtnl_kfree_skbs(skb, skb);
-		drop_count++;
 	}
-	qdisc_tree_reduce_backlog(sch, drop_count, drop_len);
+	qdisc_tree_reduce_backlog(sch, prev_qlen - sch->q.qlen,
+				  prev_backlog - sch->qstats.backlog);
 
 	sch_tree_unlock(sch);
 	return err;
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index 2a0f3a513bfa..f9e6d76a1712 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -286,6 +286,10 @@ static struct sk_buff *fq_codel_dequeue(struct Qdisc *sch)
 	struct fq_codel_flow *flow;
 	struct list_head *head;
 
+	/* reset these here, as change needs them for proper accounting*/
+	q->cstats.drop_count = 0;
+	q->cstats.drop_len = 0;
+
 begin:
 	head = &q->new_flows;
 	if (list_empty(head)) {
@@ -319,8 +323,6 @@ static struct sk_buff *fq_codel_dequeue(struct Qdisc *sch)
 	if (q->cstats.drop_count) {
 		qdisc_tree_reduce_backlog(sch, q->cstats.drop_count,
 					  q->cstats.drop_len);
-		q->cstats.drop_count = 0;
-		q->cstats.drop_len = 0;
 	}
 	return skb;
 }
@@ -366,8 +368,10 @@ static const struct nla_policy fq_codel_policy[TCA_FQ_CODEL_MAX + 1] = {
 static int fq_codel_change(struct Qdisc *sch, struct nlattr *opt,
 			   struct netlink_ext_ack *extack)
 {
+	unsigned int dropped_qlen = 0, dropped_backlog = 0;
 	struct fq_codel_sched_data *q = qdisc_priv(sch);
 	struct nlattr *tb[TCA_FQ_CODEL_MAX + 1];
+	unsigned int prev_qlen, prev_backlog;
 	u32 quantum = 0;
 	int err;
 
@@ -439,17 +443,21 @@ static int fq_codel_change(struct Qdisc *sch, struct nlattr *opt,
 		WRITE_ONCE(q->memory_limit,
 			   min(1U << 31, nla_get_u32(tb[TCA_FQ_CODEL_MEMORY_LIMIT])));
 
+	prev_qlen = sch->q.qlen;
+	prev_backlog = sch->qstats.backlog;
 	while (sch->q.qlen > sch->limit ||
 	       q->memory_usage > q->memory_limit) {
 		struct sk_buff *skb = qdisc_dequeue_internal(sch, false);
 
-		q->cstats.drop_len += qdisc_pkt_len(skb);
+		if (q->cstats.drop_count) {
+			dropped_qlen += q->cstats.drop_count;
+			dropped_backlog += q->cstats.drop_len;
+		}
+
 		rtnl_kfree_skbs(skb, skb);
-		q->cstats.drop_count++;
 	}
-	qdisc_tree_reduce_backlog(sch, q->cstats.drop_count, q->cstats.drop_len);
-	q->cstats.drop_count = 0;
-	q->cstats.drop_len = 0;
+	qdisc_tree_reduce_backlog(sch, prev_qlen - dropped_qlen - sch->q.qlen,
+				  prev_backlog - dropped_backlog - sch->qstats.backlog);
 
 	sch_tree_unlock(sch);
 	return 0;
diff --git a/net/sched/sch_fq_pie.c b/net/sched/sch_fq_pie.c
index b0e34daf1f75..8f49e9ff4f4c 100644
--- a/net/sched/sch_fq_pie.c
+++ b/net/sched/sch_fq_pie.c
@@ -289,8 +289,7 @@ static int fq_pie_change(struct Qdisc *sch, struct nlattr *opt,
 {
 	struct fq_pie_sched_data *q = qdisc_priv(sch);
 	struct nlattr *tb[TCA_FQ_PIE_MAX + 1];
-	unsigned int len_dropped = 0;
-	unsigned int num_dropped = 0;
+	unsigned int prev_qlen, prev_backlog;
 	int err;
 
 	err = nla_parse_nested(tb, TCA_FQ_PIE_MAX, opt, fq_pie_policy, extack);
@@ -365,14 +364,15 @@ static int fq_pie_change(struct Qdisc *sch, struct nlattr *opt,
 			   nla_get_u32(tb[TCA_FQ_PIE_DQ_RATE_ESTIMATOR]));
 
 	/* Drop excess packets if new limit is lower */
+	prev_qlen = sch->q.qlen;
+	prev_backlog = sch->qstats.backlog;
 	while (sch->q.qlen > sch->limit) {
 		struct sk_buff *skb = qdisc_dequeue_internal(sch, false);
 
-		len_dropped += qdisc_pkt_len(skb);
-		num_dropped += 1;
 		rtnl_kfree_skbs(skb, skb);
 	}
-	qdisc_tree_reduce_backlog(sch, num_dropped, len_dropped);
+	qdisc_tree_reduce_backlog(sch, prev_qlen - sch->q.qlen,
+				  prev_backlog - sch->qstats.backlog);
 
 	sch_tree_unlock(sch);
 	return 0;
diff --git a/net/sched/sch_hhf.c b/net/sched/sch_hhf.c
index 5aa434b46707..011d1330aea5 100644
--- a/net/sched/sch_hhf.c
+++ b/net/sched/sch_hhf.c
@@ -509,8 +509,8 @@ static int hhf_change(struct Qdisc *sch, struct nlattr *opt,
 		      struct netlink_ext_ack *extack)
 {
 	struct hhf_sched_data *q = qdisc_priv(sch);
+	unsigned int prev_qlen, prev_backlog;
 	struct nlattr *tb[TCA_HHF_MAX + 1];
-	unsigned int qlen, prev_backlog;
 	int err;
 	u64 non_hh_quantum;
 	u32 new_quantum = q->quantum;
@@ -561,14 +561,14 @@ static int hhf_change(struct Qdisc *sch, struct nlattr *opt,
 			   usecs_to_jiffies(us));
 	}
 
-	qlen = sch->q.qlen;
+	prev_qlen = sch->q.qlen;
 	prev_backlog = sch->qstats.backlog;
 	while (sch->q.qlen > sch->limit) {
 		struct sk_buff *skb = qdisc_dequeue_internal(sch, false);
 
 		rtnl_kfree_skbs(skb, skb);
 	}
-	qdisc_tree_reduce_backlog(sch, qlen - sch->q.qlen,
+	qdisc_tree_reduce_backlog(sch, prev_qlen - sch->q.qlen,
 				  prev_backlog - sch->qstats.backlog);
 
 	sch_tree_unlock(sch);
diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c
index ad46ee3ed5a9..af2646545a8a 100644
--- a/net/sched/sch_pie.c
+++ b/net/sched/sch_pie.c
@@ -142,8 +142,8 @@ static int pie_change(struct Qdisc *sch, struct nlattr *opt,
 		      struct netlink_ext_ack *extack)
 {
 	struct pie_sched_data *q = qdisc_priv(sch);
+	unsigned int prev_qlen, prev_backlog;
 	struct nlattr *tb[TCA_PIE_MAX + 1];
-	unsigned int qlen, dropped = 0;
 	int err;
 
 	err = nla_parse_nested_deprecated(tb, TCA_PIE_MAX, opt, pie_policy,
@@ -193,15 +193,15 @@ static int pie_change(struct Qdisc *sch, struct nlattr *opt,
 			   nla_get_u32(tb[TCA_PIE_DQ_RATE_ESTIMATOR]));
 
 	/* Drop excess packets if new limit is lower */
-	qlen = sch->q.qlen;
+	prev_qlen = sch->q.qlen;
+	prev_backlog = sch->qstats.backlog;
 	while (sch->q.qlen > sch->limit) {
 		struct sk_buff *skb = qdisc_dequeue_internal(sch, true);
 
-		dropped += qdisc_pkt_len(skb);
-		qdisc_qstats_backlog_dec(sch, skb);
 		rtnl_qdisc_drop(skb, sch);
 	}
-	qdisc_tree_reduce_backlog(sch, qlen - sch->q.qlen, dropped);
+	qdisc_tree_reduce_backlog(sch, prev_qlen - sch->q.qlen,
+				  prev_backlog - sch->qstats.backlog);
 
 	sch_tree_unlock(sch);
 	return 0;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net v4 2/2] selftests/tc-testing: Check backlog stats in gso_skb case
  2025-07-27 23:56 [PATCH net v4 1/2] net/sched: Fix backlog accounting in qdisc_dequeue_internal William Liu
@ 2025-07-27 23:57 ` William Liu
  2025-07-30 17:54   ` Cong Wang
  2025-08-08 21:27 ` [PATCH net v4 1/2] net/sched: Fix backlog accounting in qdisc_dequeue_internal Jakub Kicinski
  1 sibling, 1 reply; 14+ messages in thread
From: William Liu @ 2025-07-27 23:57 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, pabeni, kuba, jiri, davem, edumazet, horms,
	savy, victor, William Liu

Add tests to ensure proper backlog accounting in hhf, codel, pie, fq,
fq_pie, and fq_codel qdiscs. We check for the bug pattern originally
found in fq, fq_pie, and fq_codel, which was an underflow in the tbf
parent backlog stats upon child qdisc removal.

Signed-off-by: William Liu <will@willsroot.io>
Reviewed-by: Savino Dicanosa <savy@syst3mfailure.io>
---
v3 -> v4:
  - Unify tests to check against underflow in tbf stats on purge
v2 -> v3:
  - Simplify ping command in test cases
  - Remove scapyPlugin dependency
---
 .../tc-testing/tc-tests/infra/qdiscs.json     | 198 ++++++++++++++++++
 1 file changed, 198 insertions(+)

diff --git a/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json b/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
index c6db7fa94f55..14c6224866e8 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
@@ -185,6 +185,204 @@
             "$IP addr del 10.10.10.10/24 dev $DUMMY || true"
         ]
     },
+    {
+        "id": "34c0",
+        "name": "Test TBF with HHF Backlog Accounting in gso_skb case against underflow",
+        "category": [
+            "qdisc",
+            "tbf",
+            "hhf"
+        ],
+        "plugins": {
+            "requires": [
+                "nsPlugin"
+            ]
+        },
+        "setup": [
+            "$IP link set dev $DUMMY up || true",
+            "$IP addr add 10.10.11.10/24 dev $DUMMY || true",
+            "$TC qdisc add dev $DUMMY root handle 1: tbf rate 8bit burst 100b latency 100ms",
+            "$TC qdisc replace dev $DUMMY handle 2: parent 1:1 hhf limit 1000",
+            [
+                "ping -I $DUMMY -c2 10.10.11.11",
+                1
+            ],
+            "$TC qdisc change dev $DUMMY handle 2: parent 1:1 hhf limit 1"
+        ],
+        "cmdUnderTest": "$TC qdisc del dev $DUMMY handle 2: parent 1:1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC -s qdisc show dev $DUMMY",
+        "matchPattern": "backlog 0b 0p",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root"
+        ]
+    },
+    {
+        "id": "fd68",
+        "name": "Test TBF with CODEL Backlog Accounting in gso_skb case against underflow",
+        "category": [
+            "qdisc",
+            "tbf",
+            "codel"
+        ],
+        "plugins": {
+            "requires": [
+                "nsPlugin"
+            ]
+        },
+        "setup": [
+            "$IP link set dev $DUMMY up || true",
+            "$IP addr add 10.10.11.10/24 dev $DUMMY || true",
+            "$TC qdisc add dev $DUMMY root handle 1: tbf rate 8bit burst 100b latency 100ms",
+            "$TC qdisc replace dev $DUMMY handle 2: parent 1:1 codel limit 1000",
+            [
+                "ping -I $DUMMY -c2 10.10.11.11",
+                1
+            ],
+            "$TC qdisc change dev $DUMMY handle 2: parent 1:1 codel limit 1"
+        ],
+        "cmdUnderTest": "$TC qdisc del dev $DUMMY handle 2: parent 1:1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC -s qdisc show dev $DUMMY",
+        "matchPattern": "backlog 0b 0p",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root"
+        ]
+    },
+    {
+        "id": "514e",
+        "name": "Test TBF with PIE Backlog Accounting in gso_skb case against underflow",
+        "category": [
+            "qdisc",
+            "tbf",
+            "pie"
+        ],
+        "plugins": {
+            "requires": [
+                "nsPlugin"
+            ]
+        },
+        "setup": [
+            "$IP link set dev $DUMMY up || true",
+            "$IP addr add 10.10.11.10/24 dev $DUMMY || true",
+            "$TC qdisc add dev $DUMMY root handle 1: tbf rate 8bit burst 100b latency 100ms",
+            "$TC qdisc replace dev $DUMMY handle 2: parent 1:1 pie limit 1000",
+            [
+                "ping -I $DUMMY -c2 10.10.11.11",
+                1
+            ],
+            "$TC qdisc change dev $DUMMY handle 2: parent 1:1 pie limit 1"
+        ],
+        "cmdUnderTest": "$TC qdisc del dev $DUMMY handle 2: parent 1:1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC -s qdisc show dev $DUMMY",
+        "matchPattern": "backlog 0b 0p",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root"
+        ]
+    },
+    {
+        "id": "6c97",
+        "name": "Test TBF with FQ Backlog Accounting in gso_skb case against underflow",
+        "category": [
+            "qdisc",
+            "tbf",
+            "fq"
+        ],
+        "plugins": {
+            "requires": [
+                "nsPlugin"
+            ]
+        },
+        "setup": [
+            "$IP link set dev $DUMMY up || true",
+            "$IP addr add 10.10.11.10/24 dev $DUMMY || true",
+            "$TC qdisc add dev $DUMMY root handle 1: tbf rate 8bit burst 100b latency 100ms",
+            "$TC qdisc replace dev $DUMMY handle 2: parent 1:1 fq limit 1000",
+            [
+                "ping -I $DUMMY -c2 10.10.11.11",
+                1
+            ],
+            "$TC qdisc change dev $DUMMY handle 2: parent 1:1 fq limit 1"
+        ],
+        "cmdUnderTest": "$TC qdisc del dev $DUMMY handle 2: parent 1:1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC -s qdisc show dev $DUMMY",
+        "matchPattern": "backlog 0b 0p",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root"
+        ]
+    },
+    {
+        "id": "5d0b",
+        "name": "Test TBF with FQ_CODEL Backlog Accounting in gso_skb case against underflow",
+        "category": [
+            "qdisc",
+            "tbf",
+            "fq_codel"
+        ],
+        "plugins": {
+            "requires": [
+                "nsPlugin"
+            ]
+        },
+        "setup": [
+            "$IP link set dev $DUMMY up || true",
+            "$IP addr add 10.10.11.10/24 dev $DUMMY || true",
+            "$TC qdisc add dev $DUMMY root handle 1: tbf rate 8bit burst 100b latency 100ms",
+            "$TC qdisc replace dev $DUMMY handle 2: parent 1:1 fq_codel limit 1000",
+            [
+                "ping -I $DUMMY -c2 10.10.11.11",
+                1
+            ],
+            "$TC qdisc change dev $DUMMY handle 2: parent 1:1 fq_codel limit 1"
+        ],
+        "cmdUnderTest": "$TC qdisc del dev $DUMMY handle 2: parent 1:1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC -s qdisc show dev $DUMMY",
+        "matchPattern": "backlog 0b 0p",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root"
+        ]
+    },
+    {
+        "id": "21c3",
+        "name": "Test TBF with FQ_PIE Backlog Accounting in gso_skb case against underflow",
+        "category": [
+            "qdisc",
+            "tbf",
+            "fq_pie"
+        ],
+        "plugins": {
+            "requires": [
+                "nsPlugin"
+            ]
+        },
+        "setup": [
+            "$IP link set dev $DUMMY up || true",
+            "$IP addr add 10.10.11.10/24 dev $DUMMY || true",
+            "$TC qdisc add dev $DUMMY root handle 1: tbf rate 8bit burst 100b latency 100ms",
+            "$TC qdisc replace dev $DUMMY handle 2: parent 1:1 fq_pie limit 1000",
+            [
+                "ping -I $DUMMY -c2 10.10.11.11",
+                1
+            ],
+            "$TC qdisc change dev $DUMMY handle 2: parent 1:1 fq_pie limit 1"
+        ],
+        "cmdUnderTest": "$TC qdisc del dev $DUMMY handle 2: parent 1:1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC -s qdisc show dev $DUMMY",
+        "matchPattern": "backlog 0b 0p",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root"
+        ]
+    },
     {
         "id": "a4bb",
         "name": "Test FQ_CODEL with HTB parent - force packet drop with empty queue",
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH net v4 2/2] selftests/tc-testing: Check backlog stats in gso_skb case
  2025-07-27 23:57 ` [PATCH net v4 2/2] selftests/tc-testing: Check backlog stats in gso_skb case William Liu
@ 2025-07-30 17:54   ` Cong Wang
  2025-07-30 17:59     ` William Liu
  0 siblings, 1 reply; 14+ messages in thread
From: Cong Wang @ 2025-07-30 17:54 UTC (permalink / raw)
  To: William Liu
  Cc: netdev, jhs, pabeni, kuba, jiri, davem, edumazet, horms, savy,
	victor

On Sun, Jul 27, 2025 at 11:57:10PM +0000, William Liu wrote:
> +        "setup": [
> +            "$IP link set dev $DUMMY up || true",
> +            "$IP addr add 10.10.11.10/24 dev $DUMMY || true",
> +            "$TC qdisc add dev $DUMMY root handle 1: tbf rate 8bit burst 100b latency 100ms",
> +            "$TC qdisc replace dev $DUMMY handle 2: parent 1:1 hhf limit 1000",
> +            [
> +                "ping -I $DUMMY -c2 10.10.11.11",
> +                1
> +            ],

Sorry, I still have troubles understanding the magic "1" here, and I
don't find any other selftest using it. So why do we need it here?

You said it is in the original reproducer, but the original reproducer
is not part of tc-testing. This does not explain to me.

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net v4 2/2] selftests/tc-testing: Check backlog stats in gso_skb case
  2025-07-30 17:54   ` Cong Wang
@ 2025-07-30 17:59     ` William Liu
  0 siblings, 0 replies; 14+ messages in thread
From: William Liu @ 2025-07-30 17:59 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, jhs, pabeni, kuba, jiri, davem, edumazet, horms, savy,
	victor

On Wednesday, July 30th, 2025 at 5:54 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:

> 
> 
> On Sun, Jul 27, 2025 at 11:57:10PM +0000, William Liu wrote:
> 
> > + "setup": [
> > + "$IP link set dev $DUMMY up || true",
> > + "$IP addr add 10.10.11.10/24 dev $DUMMY || true",
> > + "$TC qdisc add dev $DUMMY root handle 1: tbf rate 8bit burst 100b latency 100ms",
> > + "$TC qdisc replace dev $DUMMY handle 2: parent 1:1 hhf limit 1000",
> > + [
> > + "ping -I $DUMMY -c2 10.10.11.11",
> > + 1
> > + ],
> 
> 
> Sorry, I still have troubles understanding the magic "1" here, and I
> don't find any other selftest using it. So why do we need it here?
> 
> You said it is in the original reproducer, but the original reproducer
> is not part of tc-testing. This does not explain to me.
> 
> Thanks.

Oh, that was the magic number you were referring to? That is the expected return code for ping in that case, and must be specified if a command in the setup returns non-zero. 

This behavior is documented in the tc-testing READMEs, under creating-testcases/AddingTestCases.txt (section SETUP/TEARDOWN errors).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net v4 1/2] net/sched: Fix backlog accounting in qdisc_dequeue_internal
  2025-07-27 23:56 [PATCH net v4 1/2] net/sched: Fix backlog accounting in qdisc_dequeue_internal William Liu
  2025-07-27 23:57 ` [PATCH net v4 2/2] selftests/tc-testing: Check backlog stats in gso_skb case William Liu
@ 2025-08-08 21:27 ` Jakub Kicinski
  2025-08-10 21:06   ` William Liu
  1 sibling, 1 reply; 14+ messages in thread
From: Jakub Kicinski @ 2025-08-08 21:27 UTC (permalink / raw)
  To: William Liu
  Cc: netdev, jhs, xiyou.wangcong, pabeni, jiri, davem, edumazet, horms,
	savy, victor

On Sun, 27 Jul 2025 23:56:32 +0000 William Liu wrote:
> Special care is taken for fq_codel_dequeue to account for the
> qdisc_tree_reduce_backlog call in its dequeue handler. The
> cstats reset is moved from the end to the beginning of
> fq_codel_dequeue, so the change handler can use cstats for
> proper backlog reduction accounting purposes. The drop_len and
> drop_count fields are not used elsewhere so this reordering in
> fq_codel_dequeue is ok.

Using local variables like we do in other qdiscs will not work?
I think your change will break drop accounting during normal dequeue?

> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 638948be4c50..a24094a638dc 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -1038,10 +1038,15 @@ static inline struct sk_buff *qdisc_dequeue_internal(struct Qdisc *sch, bool dir
>  	skb = __skb_dequeue(&sch->gso_skb);
>  	if (skb) {
>  		sch->q.qlen--;
> +		qdisc_qstats_backlog_dec(sch, skb);
> +		return skb;
> +	}
> +	if (direct) {
> +		skb = __qdisc_dequeue_head(&sch->q);
> +		if (skb)
> +			qdisc_qstats_backlog_dec(sch, skb);
>  		return skb;
>  	}
> -	if (direct)
> -		return __qdisc_dequeue_head(&sch->q);
>  	else

sorry for a late nit, it wasn't very clear from the diff but
we end up with

	if (direct) {
		...
	}
	else
		return ..;

Please reformat:

	if (direct) {
		...
	} else {
		...
	}

>  		return sch->dequeue(sch);
>  }

> diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
> index 902ff5470607..986e71e3362c 100644
> --- a/net/sched/sch_fq.c
> +++ b/net/sched/sch_fq.c
> @@ -1014,10 +1014,10 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt,
>  		     struct netlink_ext_ack *extack)
>  {
>  	struct fq_sched_data *q = qdisc_priv(sch);
> +	unsigned int prev_qlen, prev_backlog;
>  	struct nlattr *tb[TCA_FQ_MAX + 1];
> -	int err, drop_count = 0;
> -	unsigned drop_len = 0;
>  	u32 fq_log;
> +	int err;
>  
>  	err = nla_parse_nested_deprecated(tb, TCA_FQ_MAX, opt, fq_policy,
>  					  NULL);
> @@ -1135,16 +1135,16 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt,
>  		err = fq_resize(sch, fq_log);
>  		sch_tree_lock(sch);
>  	}
> +
> +	prev_qlen = sch->q.qlen;
> +	prev_backlog = sch->qstats.backlog;
>  	while (sch->q.qlen > sch->limit) {
>  		struct sk_buff *skb = qdisc_dequeue_internal(sch, false);
>  
> -		if (!skb)
> -			break;

The break conditions is removed to align the code across the qdiscs?

> -		drop_len += qdisc_pkt_len(skb);
>  		rtnl_kfree_skbs(skb, skb);
> -		drop_count++;
>  	}
> -	qdisc_tree_reduce_backlog(sch, drop_count, drop_len);
> +	qdisc_tree_reduce_backlog(sch, prev_qlen - sch->q.qlen,
> +				  prev_backlog - sch->qstats.backlog);

There is no real change in the math here, right?
Again, you're just changing this to align across the qdiscs?
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net v4 1/2] net/sched: Fix backlog accounting in qdisc_dequeue_internal
  2025-08-08 21:27 ` [PATCH net v4 1/2] net/sched: Fix backlog accounting in qdisc_dequeue_internal Jakub Kicinski
@ 2025-08-10 21:06   ` William Liu
  2025-08-11 15:29     ` Jakub Kicinski
  0 siblings, 1 reply; 14+ messages in thread
From: William Liu @ 2025-08-10 21:06 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, jhs, xiyou.wangcong, pabeni, jiri, davem, edumazet, horms,
	savy, victor

On Friday, August 8th, 2025 at 9:27 PM, Jakub Kicinski <kuba@kernel.org> wrote:

> 
> 
> On Sun, 27 Jul 2025 23:56:32 +0000 William Liu wrote:
> 
> > Special care is taken for fq_codel_dequeue to account for the
> > qdisc_tree_reduce_backlog call in its dequeue handler. The
> > cstats reset is moved from the end to the beginning of
> > fq_codel_dequeue, so the change handler can use cstats for
> > proper backlog reduction accounting purposes. The drop_len and
> > drop_count fields are not used elsewhere so this reordering in
> > fq_codel_dequeue is ok.
> 
> 
> Using local variables like we do in other qdiscs will not work?
> I think your change will break drop accounting during normal dequeue?

Can you elaborate on this? 

I just moved the reset of two cstats fields from the dequeue handler epilogue to the prologue. Those specific cstats fields are not used elsewhere so they should be fine, but we need to accumulate their values during limit adjustment. Otherwise the limit adjustment loop could perform erroneous accounting in the final qdisc_tree_reduce_backlog because the dequeue path could have already triggered qdisc_tree_reduce_backlog calls.

> 
> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> > index 638948be4c50..a24094a638dc 100644
> > --- a/include/net/sch_generic.h
> > +++ b/include/net/sch_generic.h
> > @@ -1038,10 +1038,15 @@ static inline struct sk_buff *qdisc_dequeue_internal(struct Qdisc *sch, bool dir
> > skb = __skb_dequeue(&sch->gso_skb);
> > if (skb) {
> > sch->q.qlen--;
> > + qdisc_qstats_backlog_dec(sch, skb);
> > + return skb;
> > + }
> > + if (direct) {
> > + skb = __qdisc_dequeue_head(&sch->q);
> > + if (skb)
> > + qdisc_qstats_backlog_dec(sch, skb);
> > return skb;
> > }
> > - if (direct)
> > - return __qdisc_dequeue_head(&sch->q);
> > else
> 
> 
> sorry for a late nit, it wasn't very clear from the diff but
> we end up with
> 
> if (direct) {
> ...
> }
> else
> return ..;
> 
> Please reformat:
> 
> if (direct) {
> ...
> } else {
> ...
> }
> 

Ok noted.

> > return sch->dequeue(sch);
> > }
> 
> > diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
> > index 902ff5470607..986e71e3362c 100644
> > --- a/net/sched/sch_fq.c
> > +++ b/net/sched/sch_fq.c
> > @@ -1014,10 +1014,10 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt,
> > struct netlink_ext_ack *extack)
> > {
> > struct fq_sched_data *q = qdisc_priv(sch);
> > + unsigned int prev_qlen, prev_backlog;
> > struct nlattr *tb[TCA_FQ_MAX + 1];
> > - int err, drop_count = 0;
> > - unsigned drop_len = 0;
> > u32 fq_log;
> > + int err;
> > 
> > err = nla_parse_nested_deprecated(tb, TCA_FQ_MAX, opt, fq_policy,
> > NULL);
> > @@ -1135,16 +1135,16 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt,
> > err = fq_resize(sch, fq_log);
> > sch_tree_lock(sch);
> > }
> > +
> > + prev_qlen = sch->q.qlen;
> > + prev_backlog = sch->qstats.backlog;
> > while (sch->q.qlen > sch->limit) {
> > struct sk_buff *skb = qdisc_dequeue_internal(sch, false);
> > 
> > - if (!skb)
> > - break;
> 
> 
> The break conditions is removed to align the code across the qdiscs?

That break is no longer needed because qdisc_internal_dequeue handles all the length and backlog size adjustments. The check existed there because of the qdisc_pkt_len call.

> 
> > - drop_len += qdisc_pkt_len(skb);
> > rtnl_kfree_skbs(skb, skb);
> > - drop_count++;
> > }
> > - qdisc_tree_reduce_backlog(sch, drop_count, drop_len);
> > + qdisc_tree_reduce_backlog(sch, prev_qlen - sch->q.qlen,
> > + prev_backlog - sch->qstats.backlog);
> 
> 
> There is no real change in the math here, right?
> Again, you're just changing this to align across the qdiscs?

Yep, asides from using a properly updated qlen and backlog from the revamped qdisc_dequeue_internal.

> --
> pw-bot: cr

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net v4 1/2] net/sched: Fix backlog accounting in qdisc_dequeue_internal
  2025-08-10 21:06   ` William Liu
@ 2025-08-11 15:29     ` Jakub Kicinski
  2025-08-11 16:52       ` William Liu
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Kicinski @ 2025-08-11 15:29 UTC (permalink / raw)
  To: William Liu
  Cc: netdev, jhs, xiyou.wangcong, pabeni, jiri, davem, edumazet, horms,
	savy, victor

On Sun, 10 Aug 2025 21:06:57 +0000 William Liu wrote:
> > On Sun, 27 Jul 2025 23:56:32 +0000 William Liu wrote:
> >   
> > > Special care is taken for fq_codel_dequeue to account for the
> > > qdisc_tree_reduce_backlog call in its dequeue handler. The
> > > cstats reset is moved from the end to the beginning of
> > > fq_codel_dequeue, so the change handler can use cstats for
> > > proper backlog reduction accounting purposes. The drop_len and
> > > drop_count fields are not used elsewhere so this reordering in
> > > fq_codel_dequeue is ok.  
> > 
> > 
> > Using local variables like we do in other qdiscs will not work?
> > I think your change will break drop accounting during normal dequeue?  
> 
> Can you elaborate on this? 
> 
> I just moved the reset of two cstats fields from the dequeue handler
> epilogue to the prologue. Those specific cstats fields are not used
> elsewhere so they should be fine, 

That's the disconnect. AFAICT they are passed to codel_dequeue(),
and will be used during normal dequeue, as part of normal active
queue management under traffic..

> but we need to accumulate their
> values during limit adjustment. Otherwise the limit adjustment loop
> could perform erroneous accounting in the final
> qdisc_tree_reduce_backlog because the dequeue path could have already
> triggered qdisc_tree_reduce_backlog calls.
>
> > > diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
> > > index 902ff5470607..986e71e3362c 100644
> > > --- a/net/sched/sch_fq.c
> > > +++ b/net/sched/sch_fq.c
> > > @@ -1014,10 +1014,10 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt,
> > > struct netlink_ext_ack *extack)
> > > {
> > > struct fq_sched_data *q = qdisc_priv(sch);
> > > + unsigned int prev_qlen, prev_backlog;
> > > struct nlattr *tb[TCA_FQ_MAX + 1];
> > > - int err, drop_count = 0;
> > > - unsigned drop_len = 0;
> > > u32 fq_log;
> > > + int err;
> > > 
> > > err = nla_parse_nested_deprecated(tb, TCA_FQ_MAX, opt, fq_policy,
> > > NULL);
> > > @@ -1135,16 +1135,16 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt,
> > > err = fq_resize(sch, fq_log);
> > > sch_tree_lock(sch);
> > > }
> > > +
> > > + prev_qlen = sch->q.qlen;
> > > + prev_backlog = sch->qstats.backlog;
> > > while (sch->q.qlen > sch->limit) {
> > > struct sk_buff *skb = qdisc_dequeue_internal(sch, false);
> > > 
> > > - if (!skb)
> > > - break;  
> > 
> > 
> > The break conditions is removed to align the code across the qdiscs?  
> 
> That break is no longer needed because qdisc_internal_dequeue handles
> all the length and backlog size adjustments. The check existed there
> because of the qdisc_pkt_len call.

Ack, tho, theoretically the break also prevents an infinite loop.
Change is fine, but worth calling this out in the commit message,
I reckon.

> > > - drop_len += qdisc_pkt_len(skb);
> > > rtnl_kfree_skbs(skb, skb);
> > > - drop_count++;
> > > }
> > > - qdisc_tree_reduce_backlog(sch, drop_count, drop_len);
> > > + qdisc_tree_reduce_backlog(sch, prev_qlen - sch->q.qlen,
> > > + prev_backlog - sch->qstats.backlog);  
> > 
> > 
> > There is no real change in the math here, right?
> > Again, you're just changing this to align across the qdiscs?  
> 
> Yep, asides from using a properly updated qlen and backlog from the
> revamped qdisc_dequeue_internal.

Personal preference, but my choice would be to follow the FQ code,
and count the skbs as they are freed. But up to you, since we hold
the lock supposedly the changes to backlog can only be due to our
purging.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net v4 1/2] net/sched: Fix backlog accounting in qdisc_dequeue_internal
  2025-08-11 15:29     ` Jakub Kicinski
@ 2025-08-11 16:52       ` William Liu
  2025-08-11 17:24         ` Jakub Kicinski
  0 siblings, 1 reply; 14+ messages in thread
From: William Liu @ 2025-08-11 16:52 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, jhs, xiyou.wangcong, pabeni, jiri, davem, edumazet, horms,
	savy, victor

On Monday, August 11th, 2025 at 3:30 PM, Jakub Kicinski <kuba@kernel.org> wrote:

> 
> 
> On Sun, 10 Aug 2025 21:06:57 +0000 William Liu wrote:
> 
> > > On Sun, 27 Jul 2025 23:56:32 +0000 William Liu wrote:
> > > 
> > > > Special care is taken for fq_codel_dequeue to account for the
> > > > qdisc_tree_reduce_backlog call in its dequeue handler. The
> > > > cstats reset is moved from the end to the beginning of
> > > > fq_codel_dequeue, so the change handler can use cstats for
> > > > proper backlog reduction accounting purposes. The drop_len and
> > > > drop_count fields are not used elsewhere so this reordering in
> > > > fq_codel_dequeue is ok.
> > > 
> > > Using local variables like we do in other qdiscs will not work?
> > > I think your change will break drop accounting during normal dequeue?
> > 
> > Can you elaborate on this?
> > 
> > I just moved the reset of two cstats fields from the dequeue handler
> > epilogue to the prologue. Those specific cstats fields are not used
> > elsewhere so they should be fine,
> 
> 
> That's the disconnect. AFAICT they are passed to codel_dequeue(),
> and will be used during normal dequeue, as part of normal active
> queue management under traffic..
> 

Yes, that is the only place those values are used. From my understanding, codel_dequeue is only called in fq_codel_dequeue. So moving the reset from the dequeue epilogue to the dequeue prologue should be fine as the same behavior is kept - the same values should always be used by codel_dequeue.

Is there a case I am not seeing? If so, I can just add additional fields to the fq_codel_sched_data, but wanted to avoid doing that for this one edge case.

> > but we need to accumulate their
> > values during limit adjustment. Otherwise the limit adjustment loop
> > could perform erroneous accounting in the final
> > qdisc_tree_reduce_backlog because the dequeue path could have already
> > triggered qdisc_tree_reduce_backlog calls.
> > 
> > > > diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
> > > > index 902ff5470607..986e71e3362c 100644
> > > > --- a/net/sched/sch_fq.c
> > > > +++ b/net/sched/sch_fq.c
> > > > @@ -1014,10 +1014,10 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt,
> > > > struct netlink_ext_ack *extack)
> > > > {
> > > > struct fq_sched_data *q = qdisc_priv(sch);
> > > > + unsigned int prev_qlen, prev_backlog;
> > > > struct nlattr *tb[TCA_FQ_MAX + 1];
> > > > - int err, drop_count = 0;
> > > > - unsigned drop_len = 0;
> > > > u32 fq_log;
> > > > + int err;
> > > > 
> > > > err = nla_parse_nested_deprecated(tb, TCA_FQ_MAX, opt, fq_policy,
> > > > NULL);
> > > > @@ -1135,16 +1135,16 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt,
> > > > err = fq_resize(sch, fq_log);
> > > > sch_tree_lock(sch);
> > > > }
> > > > +
> > > > + prev_qlen = sch->q.qlen;
> > > > + prev_backlog = sch->qstats.backlog;
> > > > while (sch->q.qlen > sch->limit) {
> > > > struct sk_buff *skb = qdisc_dequeue_internal(sch, false);
> > > > 
> > > > - if (!skb)
> > > > - break;
> > > 
> > > The break conditions is removed to align the code across the qdiscs?
> > 
> > That break is no longer needed because qdisc_internal_dequeue handles
> > all the length and backlog size adjustments. The check existed there
> > because of the qdisc_pkt_len call.
> 
> 
> Ack, tho, theoretically the break also prevents an infinite loop.
> Change is fine, but worth calling this out in the commit message,
> I reckon.
> 
> > > > - drop_len += qdisc_pkt_len(skb);
> > > > rtnl_kfree_skbs(skb, skb);
> > > > - drop_count++;
> > > > }
> > > > - qdisc_tree_reduce_backlog(sch, drop_count, drop_len);
> > > > + qdisc_tree_reduce_backlog(sch, prev_qlen - sch->q.qlen,
> > > > + prev_backlog - sch->qstats.backlog);
> > > 
> > > There is no real change in the math here, right?
> > > Again, you're just changing this to align across the qdiscs?
> > 
> > Yep, asides from using a properly updated qlen and backlog from the
> > revamped qdisc_dequeue_internal.
> 
> 
> Personal preference, but my choice would be to follow the FQ code,
> and count the skbs as they are freed. But up to you, since we hold
> the lock supposedly the changes to backlog can only be due to our
> purging.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net v4 1/2] net/sched: Fix backlog accounting in qdisc_dequeue_internal
  2025-08-11 16:52       ` William Liu
@ 2025-08-11 17:24         ` Jakub Kicinski
  2025-08-11 17:51           ` William Liu
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Kicinski @ 2025-08-11 17:24 UTC (permalink / raw)
  To: William Liu
  Cc: netdev, jhs, xiyou.wangcong, pabeni, jiri, davem, edumazet, horms,
	savy, victor

On Mon, 11 Aug 2025 16:52:51 +0000 William Liu wrote:
> > > Can you elaborate on this?
> > > 
> > > I just moved the reset of two cstats fields from the dequeue handler
> > > epilogue to the prologue. Those specific cstats fields are not used
> > > elsewhere so they should be fine,  
> > 
> > 
> > That's the disconnect. AFAICT they are passed to codel_dequeue(),
> > and will be used during normal dequeue, as part of normal active
> > queue management under traffic..
> >   
> 
> Yes, that is the only place those values are used. From my
> understanding, codel_dequeue is only called in fq_codel_dequeue. So
> moving the reset from the dequeue epilogue to the dequeue prologue
> should be fine as the same behavior is kept - the same values should
> always be used by codel_dequeue.
> 
> Is there a case I am not seeing? If so, I can just add additional
> fields to the fq_codel_sched_data, but wanted to avoid doing that for
> this one edge case.

This sort of separation of logic is very error prone in general.
If you're asking for a specific bug that would exist with your 
patch - I believe that two subsequent fq_codel_change() calls,
first one reducing the limit, the other one _not_ reducing (and
therefore never invoking dequeue) will adjust the backlog twice.

As I commented in the previous message - wouldn't counting the
packets we actually dequeue not solve this problem? smth like:

	pkts = 0;
	bytes = 0;
 	while (sch->q.qlen > sch->limit ||
 	       q->memory_usage > q->memory_limit) {
 		struct sk_buff *skb = qdisc_dequeue_internal(sch, false);
 
		pkts++;
		bytes += qdisc_pkt_len(skb);
 		rtnl_kfree_skbs(skb, skb);
 	}
	qdisc_tree_reduce_backlog(sch, pkts, bytes);

? "Conceptually" we are only responsible for adjusting the backlog
for skbs we actually gave to kfree_skb(). 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net v4 1/2] net/sched: Fix backlog accounting in qdisc_dequeue_internal
  2025-08-11 17:24         ` Jakub Kicinski
@ 2025-08-11 17:51           ` William Liu
  2025-08-12  0:51             ` Jakub Kicinski
  0 siblings, 1 reply; 14+ messages in thread
From: William Liu @ 2025-08-11 17:51 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, jhs, xiyou.wangcong, pabeni, jiri, davem, edumazet, horms,
	savy, victor

On Monday, August 11th, 2025 at 5:24 PM, Jakub Kicinski <kuba@kernel.org> wrote:

> 
> 
> On Mon, 11 Aug 2025 16:52:51 +0000 William Liu wrote:
> 
> > > > Can you elaborate on this?
> > > > 
> > > > I just moved the reset of two cstats fields from the dequeue handler
> > > > epilogue to the prologue. Those specific cstats fields are not used
> > > > elsewhere so they should be fine,
> > > 
> > > That's the disconnect. AFAICT they are passed to codel_dequeue(),
> > > and will be used during normal dequeue, as part of normal active
> > > queue management under traffic..
> > 
> > Yes, that is the only place those values are used. From my
> > understanding, codel_dequeue is only called in fq_codel_dequeue. So
> > moving the reset from the dequeue epilogue to the dequeue prologue
> > should be fine as the same behavior is kept - the same values should
> > always be used by codel_dequeue.
> > 
> > Is there a case I am not seeing? If so, I can just add additional
> > fields to the fq_codel_sched_data, but wanted to avoid doing that for
> > this one edge case.
> 
> 
> This sort of separation of logic is very error prone in general.
> If you're asking for a specific bug that would exist with your
> patch - I believe that two subsequent fq_codel_change() calls,
> first one reducing the limit, the other one not reducing (and
> therefore never invoking dequeue) will adjust the backlog twice.
> 

In that case, I think the code in the limit adjustment while loop never run, so the backlog reduction would only happen with arguments of 0. But yes, I agree that this approach is not ideal.

> As I commented in the previous message - wouldn't counting the
> packets we actually dequeue not solve this problem? smth like:
> 
> pkts = 0;
> bytes = 0;
> while (sch->q.qlen > sch->limit ||
> 
> q->memory_usage > q->memory_limit) {
> 
> struct sk_buff *skb = qdisc_dequeue_internal(sch, false);
> 
> pkts++;
> bytes += qdisc_pkt_len(skb);
> rtnl_kfree_skbs(skb, skb);
> }
> qdisc_tree_reduce_backlog(sch, pkts, bytes);
> 
> ? "Conceptually" we are only responsible for adjusting the backlog
> for skbs we actually gave to kfree_skb().

I think the issue here is qdisc_dequeue_internal can call the actual dequeue handler, and fq_codel_dequeue would have already made a qdisc_tree_reduce_backlog call [1] when cstats.drop_count is non-zero. Wouldn't we be double counting packets and bytes for qdisc_tree_reduce_backlog after the limit adjustment loop with this approach?

[1] https://elixir.bootlin.com/linux/v6.16/source/net/sched/sch_fq_codel.c#L320

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net v4 1/2] net/sched: Fix backlog accounting in qdisc_dequeue_internal
  2025-08-11 17:51           ` William Liu
@ 2025-08-12  0:51             ` Jakub Kicinski
  2025-08-12  2:10               ` William Liu
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Kicinski @ 2025-08-12  0:51 UTC (permalink / raw)
  To: William Liu
  Cc: netdev, jhs, xiyou.wangcong, pabeni, jiri, davem, edumazet, horms,
	savy, victor

On Mon, 11 Aug 2025 17:51:39 +0000 William Liu wrote:
> > This sort of separation of logic is very error prone in general.
> > If you're asking for a specific bug that would exist with your
> > patch - I believe that two subsequent fq_codel_change() calls,
> > first one reducing the limit, the other one not reducing (and
> > therefore never invoking dequeue) will adjust the backlog twice.
> 
> In that case, I think the code in the limit adjustment while loop
> never run, so the backlog reduction would only happen with arguments
> of 0.

True.

> But yes, I agree that this approach is not ideal.
> > As I commented in the previous message - wouldn't counting the
> > packets we actually dequeue not solve this problem? smth like:
> > 
> > pkts = 0;
> > bytes = 0;
> > while (sch->q.qlen > sch->limit ||
> > 
> > q->memory_usage > q->memory_limit) {
> > 
> > struct sk_buff *skb = qdisc_dequeue_internal(sch, false);
> > 
> > pkts++;
> > bytes += qdisc_pkt_len(skb);
> > rtnl_kfree_skbs(skb, skb);
> > }
> > qdisc_tree_reduce_backlog(sch, pkts, bytes);
> > 
> > ? "Conceptually" we are only responsible for adjusting the backlog
> > for skbs we actually gave to kfree_skb().  
> 
> I think the issue here is qdisc_dequeue_internal can call the actual
> dequeue handler, and fq_codel_dequeue would have already made a
> qdisc_tree_reduce_backlog call [1] when cstats.drop_count is
> non-zero. Wouldn't we be double counting packets and bytes for
> qdisc_tree_reduce_backlog after the limit adjustment loop with this
> approach?

AFAICT only if the backlog adjustment is using the prev_qlen,
prev_backlog approach, which snapshots the backlog. In that case,
yes, the "internal drops" will mess up the count. 

My naive mental model is that we're only responsible for adjusting
the backlog for skbs we actually dequeued. IOW the skbs that
qdisc_dequeue_internal() returned to the "limit trimming loop".
Because we are "deleting" them, potentially in the middle of
a qdisc hierarchy. If the qdisc decides to delete some more, 
its on the hook for adjusting for those portions.

I can't find any bugs in your prev_* snapshot approach, but it does
feel like a layering violation to me.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net v4 1/2] net/sched: Fix backlog accounting in qdisc_dequeue_internal
  2025-08-12  0:51             ` Jakub Kicinski
@ 2025-08-12  2:10               ` William Liu
  2025-08-12 14:38                 ` Jakub Kicinski
  0 siblings, 1 reply; 14+ messages in thread
From: William Liu @ 2025-08-12  2:10 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, jhs, xiyou.wangcong, pabeni, jiri, davem, edumazet, horms,
	savy, victor

On Tuesday, August 12th, 2025 at 12:51 AM, Jakub Kicinski <kuba@kernel.org> wrote:

> 
> 
> On Mon, 11 Aug 2025 17:51:39 +0000 William Liu wrote:
> 
> > > This sort of separation of logic is very error prone in general.
> > > If you're asking for a specific bug that would exist with your
> > > patch - I believe that two subsequent fq_codel_change() calls,
> > > first one reducing the limit, the other one not reducing (and
> > > therefore never invoking dequeue) will adjust the backlog twice.
> > 
> > In that case, I think the code in the limit adjustment while loop
> > never run, so the backlog reduction would only happen with arguments
> > of 0.
> 
> 
> True.
> 
> > But yes, I agree that this approach is not ideal.
> > 
> > > As I commented in the previous message - wouldn't counting the
> > > packets we actually dequeue not solve this problem? smth like:
> > > 
> > > pkts = 0;
> > > bytes = 0;
> > > while (sch->q.qlen > sch->limit ||
> > > 
> > > q->memory_usage > q->memory_limit) {
> > > 
> > > struct sk_buff *skb = qdisc_dequeue_internal(sch, false);
> > > 
> > > pkts++;
> > > bytes += qdisc_pkt_len(skb);
> > > rtnl_kfree_skbs(skb, skb);
> > > }
> > > qdisc_tree_reduce_backlog(sch, pkts, bytes);
> > > 
> > > ? "Conceptually" we are only responsible for adjusting the backlog
> > > for skbs we actually gave to kfree_skb().
> > 
> > I think the issue here is qdisc_dequeue_internal can call the actual
> > dequeue handler, and fq_codel_dequeue would have already made a
> > qdisc_tree_reduce_backlog call [1] when cstats.drop_count is
> > non-zero. Wouldn't we be double counting packets and bytes for
> > qdisc_tree_reduce_backlog after the limit adjustment loop with this
> > approach?
> 
> 
> AFAICT only if the backlog adjustment is using the prev_qlen,
> prev_backlog approach, which snapshots the backlog. In that case,
> yes, the "internal drops" will mess up the count.

Yep, that's why I added the dropped_qlen and dropped_backlog variables, though that is not a very pretty solution.

But even looking at the method you suggested (copy pasting for reference):

	pkts = 0;
	bytes = 0;
 	while (sch->q.qlen > sch->limit ||
 	       q->memory_usage > q->memory_limit) {
 		struct sk_buff *skb = qdisc_dequeue_internal(sch, false);
 
		pkts++;
		bytes += qdisc_pkt_len(skb);
 		rtnl_kfree_skbs(skb, skb);
 	}
	qdisc_tree_reduce_backlog(sch, pkts, bytes);

qdisc_dequeue_internal can trigger fq_codel_dequeue, which can trigger qdisc_tree_reduce_backlog before returning (the only qdisc out of these that does so in its dequeue handler). 

Let's say the limit only goes down by one, and packet A is at the front of the queue. qdisc_dequeue_internal takes the dequeue path, and fq_codel_dequeue triggers a qdisc_tree_reduce_backlog from that packet before returning the skb. Would this final qdisc_tree_reduce_backlog after the limit drop not double count? 

> My naive mental model is that we're only responsible for adjusting
> the backlog for skbs we actually dequeued. IOW the skbs that
> qdisc_dequeue_internal() returned to the "limit trimming loop".
> Because we are "deleting" them, potentially in the middle of
> a qdisc hierarchy. If the qdisc decides to delete some more,
> its on the hook for adjusting for those portions.
> 

Right, but the dequeue handler calling qdisc_tree_reduce_backlog might have already adjusted the backlog for an skb before qdisc_dequeue_internal returns in fq_codel. This is what makes the handling here messier. 

I could be missing something, but the original code in fq_codel was very strange.

> I can't find any bugs in your prev_* snapshot approach, but it does
> feel like a layering violation to me.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net v4 1/2] net/sched: Fix backlog accounting in qdisc_dequeue_internal
  2025-08-12  2:10               ` William Liu
@ 2025-08-12 14:38                 ` Jakub Kicinski
  2025-08-12 16:59                   ` William Liu
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Kicinski @ 2025-08-12 14:38 UTC (permalink / raw)
  To: William Liu
  Cc: netdev, jhs, xiyou.wangcong, pabeni, jiri, davem, edumazet, horms,
	savy, victor

On Tue, 12 Aug 2025 02:10:02 +0000 William Liu wrote:
> > AFAICT only if the backlog adjustment is using the prev_qlen,
> > prev_backlog approach, which snapshots the backlog. In that case,
> > yes, the "internal drops" will mess up the count.  
> 
> Yep, that's why I added the dropped_qlen and dropped_backlog
> variables, though that is not a very pretty solution.
> 
> But even looking at the method you suggested (copy pasting for
> reference):
> 
> 	pkts = 0;
> 	bytes = 0;
>  	while (sch->q.qlen > sch->limit ||
>  	       q->memory_usage > q->memory_limit) {
>  		struct sk_buff *skb = qdisc_dequeue_internal(sch, false); 
> 		pkts++;
> 		bytes += qdisc_pkt_len(skb);
>  		rtnl_kfree_skbs(skb, skb);
>  	}
> 	qdisc_tree_reduce_backlog(sch, pkts, bytes);
> 
> qdisc_dequeue_internal can trigger fq_codel_dequeue, which can
> trigger qdisc_tree_reduce_backlog before returning (the only qdisc
> out of these that does so in its dequeue handler). 
> 
> Let's say the limit only goes down by one, and packet A is at the
> front of the queue. qdisc_dequeue_internal takes the dequeue path,
> and fq_codel_dequeue triggers a qdisc_tree_reduce_backlog from that
> packet before returning the skb. Would this final
> qdisc_tree_reduce_backlog after the limit drop not double count? 

The packets that got counted in qdisc_tree_reduce_backlog() inside
->dequeue are freed immediately via

  drop_func()
    kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_CONGESTED);

in the scenario you're describing ->dequeue should return NULL.
If that's possible, then we have another bug here :$

Normally backlogs get adjusted as the packet travels down the hierarchy
thru the parent chain. ->dequeue is part of this normal path so skbs it
returns are still in the parent's backlogs. qdisc_tree_reduce_backlog()
is only called when we need to do something to an skb outside of the
normal ->enqueue/->dequeue flow that iterates the hierarchy.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net v4 1/2] net/sched: Fix backlog accounting in qdisc_dequeue_internal
  2025-08-12 14:38                 ` Jakub Kicinski
@ 2025-08-12 16:59                   ` William Liu
  0 siblings, 0 replies; 14+ messages in thread
From: William Liu @ 2025-08-12 16:59 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, jhs, xiyou.wangcong, pabeni, jiri, davem, edumazet, horms,
	savy, victor

On Tuesday, August 12th, 2025 at 2:38 PM, Jakub Kicinski <kuba@kernel.org> wrote:

> 
> 
> On Tue, 12 Aug 2025 02:10:02 +0000 William Liu wrote:
> 
> > > AFAICT only if the backlog adjustment is using the prev_qlen,
> > > prev_backlog approach, which snapshots the backlog. In that case,
> > > yes, the "internal drops" will mess up the count.
> > 
> > Yep, that's why I added the dropped_qlen and dropped_backlog
> > variables, though that is not a very pretty solution.
> > 
> > But even looking at the method you suggested (copy pasting for
> > reference):
> > 
> > pkts = 0;
> > bytes = 0;
> > while (sch->q.qlen > sch->limit ||
> > q->memory_usage > q->memory_limit) {
> > struct sk_buff *skb = qdisc_dequeue_internal(sch, false);
> > pkts++;
> > bytes += qdisc_pkt_len(skb);
> > rtnl_kfree_skbs(skb, skb);
> > }
> > qdisc_tree_reduce_backlog(sch, pkts, bytes);
> > 
> > qdisc_dequeue_internal can trigger fq_codel_dequeue, which can
> > trigger qdisc_tree_reduce_backlog before returning (the only qdisc
> > out of these that does so in its dequeue handler).
> > 
> > Let's say the limit only goes down by one, and packet A is at the
> > front of the queue. qdisc_dequeue_internal takes the dequeue path,
> > and fq_codel_dequeue triggers a qdisc_tree_reduce_backlog from that
> > packet before returning the skb. Would this final
> > qdisc_tree_reduce_backlog after the limit drop not double count?
> 
> 
> The packets that got counted in qdisc_tree_reduce_backlog() inside
> ->dequeue are freed immediately via
> 
> 
> drop_func()
> kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_CONGESTED);
> 
> in the scenario you're describing ->dequeue should return NULL.
> 
> If that's possible, then we have another bug here :$
> 
> Normally backlogs get adjusted as the packet travels down the hierarchy
> thru the parent chain. ->dequeue is part of this normal path so skbs it
> 
> returns are still in the parent's backlogs. qdisc_tree_reduce_backlog()
> is only called when we need to do something to an skb outside of the
> normal ->enqueue/->dequeue flow that iterates the hierarchy.


Ah ok, this makes much more sense now! Thank you for explaining - I will get a v5 in for this patch with your proposed fix soonish then.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-08-12 17:00 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-27 23:56 [PATCH net v4 1/2] net/sched: Fix backlog accounting in qdisc_dequeue_internal William Liu
2025-07-27 23:57 ` [PATCH net v4 2/2] selftests/tc-testing: Check backlog stats in gso_skb case William Liu
2025-07-30 17:54   ` Cong Wang
2025-07-30 17:59     ` William Liu
2025-08-08 21:27 ` [PATCH net v4 1/2] net/sched: Fix backlog accounting in qdisc_dequeue_internal Jakub Kicinski
2025-08-10 21:06   ` William Liu
2025-08-11 15:29     ` Jakub Kicinski
2025-08-11 16:52       ` William Liu
2025-08-11 17:24         ` Jakub Kicinski
2025-08-11 17:51           ` William Liu
2025-08-12  0:51             ` Jakub Kicinski
2025-08-12  2:10               ` William Liu
2025-08-12 14:38                 ` Jakub Kicinski
2025-08-12 16:59                   ` William Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).