[PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet
@ 2023-01-11 13:05 Daniele Palmas
  2023-01-11 13:05 ` [PATCH net-next v4 1/3] ethtool: add tx aggregation parameters Daniele Palmas
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Daniele Palmas @ 2023-01-11 13:05 UTC (permalink / raw)
  To: David Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Subash Abhinov Kasiviswanathan, Sean Tranchetti, Jonathan Corbet,
	Alexander Lobakin, Gal Pressman, Dave Taht
  Cc: Bjørn Mork, Greg Kroah-Hartman, netdev, Daniele Palmas

Hello maintainers and all,

this patchset implements tx qmap packets aggregation in rmnet and generic
ethtool support for that.

Some low-cat Thread-x based modems are not capable of properly reaching the maximum
allowed throughput both in tx and rx during a bidirectional test if tx packets
aggregation is not enabled.

I verified this problem with rmnet + qmi_wwan by using a MDM9207 Cat. 4 based modem
(50Mbps/150Mbps max throughput). What is actually happening is pictured at
https://drive.google.com/file/d/1gSbozrtd9h0X63i6vdkNpN68d-9sg8f9/view

Testing with iperf TCP, when rx and tx flows are tested singularly there's no issue
in tx and minor issues in rx (not able to reach max throughput). When there are concurrent
tx and rx flows, tx throughput has an huge drop. rx a minor one, but still present.

The same scenario with tx aggregation enabled is pictured at
https://drive.google.com/file/d/1jcVIKNZD7K3lHtwKE5W02mpaloudYYih/view
showing a regular graph.

This issue does not happen with high-cat modems (e.g. SDX20), or at least it
does not happen at the throughputs I'm able to test currently: maybe the same
could happen when moving close to the maximum rates supported by those modems.
Anyway, having the tx aggregation enabled should not hurt.

The first attempt to solve this issue was in qmi_wwan qmap implementation,
see the discussion at https://lore.kernel.org/netdev/20221019132503.6783-1-dnlplm@gmail.com/

However, it turned out that rmnet was a better candidate for the implementation.

Moreover, Greg and Jakub suggested also to use ethtool for the configuration:
not sure if I got their advice right, but this patchset add also generic ethtool
support for tx aggregation.

The patches have been tested mainly against an MDM9207 based modem through USB
and SDX55 through PCI (MHI).

v2 should address the comments highlighted in the review: the implementation is
still in rmnet, due to Subash's request of keeping tx aggregation there.

v3 fixes ethtool-netlink.rst content out of table bounds and a W=1 build warning
for patch 2.

v4 solves a race related to egress_agg_params.

Daniele Palmas (3):
  ethtool: add tx aggregation parameters
  net: qualcomm: rmnet: add tx packets aggregation
  net: qualcomm: rmnet: add ethtool support for configuring tx
    aggregation

 Documentation/networking/ethtool-netlink.rst  |  17 ++
 .../ethernet/qualcomm/rmnet/rmnet_config.c    |   5 +
 .../ethernet/qualcomm/rmnet/rmnet_config.h    |  20 ++
 .../ethernet/qualcomm/rmnet/rmnet_handlers.c  |  18 +-
 .../net/ethernet/qualcomm/rmnet/rmnet_map.h   |   6 +
 .../ethernet/qualcomm/rmnet/rmnet_map_data.c  | 191 ++++++++++++++++++
 .../net/ethernet/qualcomm/rmnet/rmnet_vnd.c   |  54 ++++-
 .../net/ethernet/qualcomm/rmnet/rmnet_vnd.h   |   1 +
 include/linux/ethtool.h                       |  12 +-
 include/uapi/linux/ethtool_netlink.h          |   3 +
 net/ethtool/coalesce.c                        |  22 +-
 11 files changed, 342 insertions(+), 7 deletions(-)

-- 
2.37.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH net-next v4 1/3] ethtool: add tx aggregation parameters
  2023-01-11 13:05 [PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet Daniele Palmas
@ 2023-01-11 13:05 ` Daniele Palmas
  2023-01-11 13:05 ` [PATCH net-next v4 2/3] net: qualcomm: rmnet: add tx packets aggregation Daniele Palmas
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Daniele Palmas @ 2023-01-11 13:05 UTC (permalink / raw)
  To: David Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Subash Abhinov Kasiviswanathan, Sean Tranchetti, Jonathan Corbet,
	Alexander Lobakin, Gal Pressman, Dave Taht
  Cc: Bjørn Mork, Greg Kroah-Hartman, netdev, Daniele Palmas

Add the following ethtool tx aggregation parameters:

ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES
Maximum size in bytes of a tx aggregated block of frames.

ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES
Maximum number of frames that can be aggregated into a block.

ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS
Time in usecs after the first packet arrival in an aggregated
block for the block to be sent.

Signed-off-by: Daniele Palmas <dnlplm@gmail.com>
---
v4
- No change
v3
- Fixed ethtool-netlink.rst content out of table bounds
v2
- Replaced the generic 'size' word with 'bytes' in the related ETHTOOL define
- Changed all the names making the word 'aggr' to follow 'tx'
- Improved documentation on the feature in ethtool-netlink.rst
---
 Documentation/networking/ethtool-netlink.rst | 17 +++++++++++++++
 include/linux/ethtool.h                      | 12 ++++++++++-
 include/uapi/linux/ethtool_netlink.h         |  3 +++
 net/ethtool/coalesce.c                       | 22 ++++++++++++++++++--
 4 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index f10f8eb44255..06ea91bde164 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -1004,6 +1004,9 @@ Kernel response contents:
   ``ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL``  u32     rate sampling interval
   ``ETHTOOL_A_COALESCE_USE_CQE_TX``            bool    timer reset mode, Tx
   ``ETHTOOL_A_COALESCE_USE_CQE_RX``            bool    timer reset mode, Rx
+  ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES``     u32     max aggr size, Tx
+  ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES``    u32     max aggr packets, Tx
+  ``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS``    u32     time (us), aggr, Tx
   ===========================================  ======  =======================
 
 Attributes are only included in reply if their value is not zero or the
@@ -1022,6 +1025,17 @@ each packet event resets the timer. In this mode timer is used to force
 the interrupt if queue goes idle, while busy queues depend on the packet
 limit to trigger interrupts.
 
+Tx aggregation consists of copying frames into a contiguous buffer so that they
+can be submitted as a single IO operation. ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES``
+describes the maximum size in bytes for the submitted buffer.
+``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` describes the maximum number of frames
+that can be aggregated into a single buffer.
+``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` describes the amount of time in usecs,
+counted since the first packet arrival in an aggregated block, after which the
+block should be sent.
+This feature is mainly of interest for specific USB devices which does not cope
+well with frequent small-sized URBs transmissions.
+
 COALESCE_SET
 ============
 
@@ -1055,6 +1069,9 @@ Request contents:
   ``ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL``  u32     rate sampling interval
   ``ETHTOOL_A_COALESCE_USE_CQE_TX``            bool    timer reset mode, Tx
   ``ETHTOOL_A_COALESCE_USE_CQE_RX``            bool    timer reset mode, Rx
+  ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES``     u32     max aggr size, Tx
+  ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES``    u32     max aggr packets, Tx
+  ``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS``    u32     time (us), aggr, Tx
   ===========================================  ======  =======================
 
 Request is rejected if it attributes declared as unsupported by driver (i.e.
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 9e0a76fc7de9..a1ff1ca0a5b6 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -217,6 +217,9 @@ __ethtool_get_link_ksettings(struct net_device *dev,
 struct kernel_ethtool_coalesce {
 	u8 use_cqe_mode_tx;
 	u8 use_cqe_mode_rx;
+	u32 tx_aggr_max_bytes;
+	u32 tx_aggr_max_frames;
+	u32 tx_aggr_time_usecs;
 };
 
 /**
@@ -260,7 +263,10 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 *legacy_u32,
 #define ETHTOOL_COALESCE_RATE_SAMPLE_INTERVAL	BIT(21)
 #define ETHTOOL_COALESCE_USE_CQE_RX		BIT(22)
 #define ETHTOOL_COALESCE_USE_CQE_TX		BIT(23)
-#define ETHTOOL_COALESCE_ALL_PARAMS		GENMASK(23, 0)
+#define ETHTOOL_COALESCE_TX_AGGR_MAX_BYTES	BIT(24)
+#define ETHTOOL_COALESCE_TX_AGGR_MAX_FRAMES	BIT(25)
+#define ETHTOOL_COALESCE_TX_AGGR_TIME_USECS	BIT(26)
+#define ETHTOOL_COALESCE_ALL_PARAMS		GENMASK(26, 0)
 
 #define ETHTOOL_COALESCE_USECS						\
 	(ETHTOOL_COALESCE_RX_USECS | ETHTOOL_COALESCE_TX_USECS)
@@ -288,6 +294,10 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 *legacy_u32,
 	 ETHTOOL_COALESCE_RATE_SAMPLE_INTERVAL)
 #define ETHTOOL_COALESCE_USE_CQE					\
 	(ETHTOOL_COALESCE_USE_CQE_RX | ETHTOOL_COALESCE_USE_CQE_TX)
+#define ETHTOOL_COALESCE_TX_AGGR		\
+	(ETHTOOL_COALESCE_TX_AGGR_MAX_BYTES |	\
+	 ETHTOOL_COALESCE_TX_AGGR_MAX_FRAMES |	\
+	 ETHTOOL_COALESCE_TX_AGGR_TIME_USECS)
 
 #define ETHTOOL_STAT_NOT_SET	(~0ULL)
 
diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h
index 5799a9db034e..62cffbf157b1 100644
--- a/include/uapi/linux/ethtool_netlink.h
+++ b/include/uapi/linux/ethtool_netlink.h
@@ -400,6 +400,9 @@ enum {
 	ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL,	/* u32 */
 	ETHTOOL_A_COALESCE_USE_CQE_MODE_TX,		/* u8 */
 	ETHTOOL_A_COALESCE_USE_CQE_MODE_RX,		/* u8 */
+	ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES,		/* u32 */
+	ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES,		/* u32 */
+	ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS,		/* u32 */
 
 	/* add new constants above here */
 	__ETHTOOL_A_COALESCE_CNT,
diff --git a/net/ethtool/coalesce.c b/net/ethtool/coalesce.c
index 487bdf345541..e405b47f7eed 100644
--- a/net/ethtool/coalesce.c
+++ b/net/ethtool/coalesce.c
@@ -105,7 +105,10 @@ static int coalesce_reply_size(const struct ethnl_req_info *req_base,
 	       nla_total_size(sizeof(u32)) +	/* _TX_MAX_FRAMES_HIGH */
 	       nla_total_size(sizeof(u32)) +	/* _RATE_SAMPLE_INTERVAL */
 	       nla_total_size(sizeof(u8)) +	/* _USE_CQE_MODE_TX */
-	       nla_total_size(sizeof(u8));	/* _USE_CQE_MODE_RX */
+	       nla_total_size(sizeof(u8)) +	/* _USE_CQE_MODE_RX */
+	       nla_total_size(sizeof(u32)) +	/* _TX_AGGR_MAX_BYTES */
+	       nla_total_size(sizeof(u32)) +	/* _TX_AGGR_MAX_FRAMES */
+	       nla_total_size(sizeof(u32));	/* _TX_AGGR_TIME_USECS */
 }
 
 static bool coalesce_put_u32(struct sk_buff *skb, u16 attr_type, u32 val,
@@ -180,7 +183,13 @@ static int coalesce_fill_reply(struct sk_buff *skb,
 	    coalesce_put_bool(skb, ETHTOOL_A_COALESCE_USE_CQE_MODE_TX,
 			      kcoal->use_cqe_mode_tx, supported) ||
 	    coalesce_put_bool(skb, ETHTOOL_A_COALESCE_USE_CQE_MODE_RX,
-			      kcoal->use_cqe_mode_rx, supported))
+			      kcoal->use_cqe_mode_rx, supported) ||
+	    coalesce_put_u32(skb, ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES,
+			     kcoal->tx_aggr_max_bytes, supported) ||
+	    coalesce_put_u32(skb, ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES,
+			     kcoal->tx_aggr_max_frames, supported) ||
+	    coalesce_put_u32(skb, ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS,
+			     kcoal->tx_aggr_time_usecs, supported))
 		return -EMSGSIZE;
 
 	return 0;
@@ -227,6 +236,9 @@ const struct nla_policy ethnl_coalesce_set_policy[] = {
 	[ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL] = { .type = NLA_U32 },
 	[ETHTOOL_A_COALESCE_USE_CQE_MODE_TX]	= NLA_POLICY_MAX(NLA_U8, 1),
 	[ETHTOOL_A_COALESCE_USE_CQE_MODE_RX]	= NLA_POLICY_MAX(NLA_U8, 1),
+	[ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES] = { .type = NLA_U32 },
+	[ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES] = { .type = NLA_U32 },
+	[ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS] = { .type = NLA_U32 },
 };
 
 int ethnl_set_coalesce(struct sk_buff *skb, struct genl_info *info)
@@ -321,6 +333,12 @@ int ethnl_set_coalesce(struct sk_buff *skb, struct genl_info *info)
 			tb[ETHTOOL_A_COALESCE_USE_CQE_MODE_TX], &mod);
 	ethnl_update_u8(&kernel_coalesce.use_cqe_mode_rx,
 			tb[ETHTOOL_A_COALESCE_USE_CQE_MODE_RX], &mod);
+	ethnl_update_u32(&kernel_coalesce.tx_aggr_max_bytes,
+			 tb[ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES], &mod);
+	ethnl_update_u32(&kernel_coalesce.tx_aggr_max_frames,
+			 tb[ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES], &mod);
+	ethnl_update_u32(&kernel_coalesce.tx_aggr_time_usecs,
+			 tb[ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS], &mod);
 	ret = 0;
 	if (!mod)
 		goto out_ops;
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next v4 2/3] net: qualcomm: rmnet: add tx packets aggregation
  2023-01-11 13:05 [PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet Daniele Palmas
  2023-01-11 13:05 ` [PATCH net-next v4 1/3] ethtool: add tx aggregation parameters Daniele Palmas
@ 2023-01-11 13:05 ` Daniele Palmas
  2023-01-12 19:41   ` Subash Abhinov Kasiviswanathan (KS)
  2023-01-11 13:05 ` [PATCH net-next v4 3/3] net: qualcomm: rmnet: add ethtool support for configuring tx aggregation Daniele Palmas
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Daniele Palmas @ 2023-01-11 13:05 UTC (permalink / raw)
  To: David Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Subash Abhinov Kasiviswanathan, Sean Tranchetti, Jonathan Corbet,
	Alexander Lobakin, Gal Pressman, Dave Taht
  Cc: Bjørn Mork, Greg Kroah-Hartman, netdev, Daniele Palmas

Add tx packets aggregation.

Bidirectional TCP throughput tests through iperf with low-cat
Thread-x based modems revelead performance issues both in tx
and rx.

The Windows driver does not show this issue: inspecting USB
packets revealed that the only notable change is the driver
enabling tx packets aggregation.

Tx packets aggregation is by default disabled and can be enabled
by increasing the value of ETHTOOL_A_COALESCE_TX_MAX_AGGR_FRAMES.

The maximum aggregated size is by default set to a reasonably low
value in order to support the majority of modems.

This implementation is based on patches available in Code Aurora
repositories (msm kernel) whose main authors are

Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Sean Tranchetti <stranche@codeaurora.org>

Signed-off-by: Daniele Palmas <dnlplm@gmail.com>
---
Hi all,

in v4 I should have met all Paolo's remarks, besides the one related
to removing the scheduled work, since according to my understanding
of the code in dev_queue_xmit it's not possible to avoid the
WARN_ONCE when called directly from the timer function.

Context at https://lore.kernel.org/netdev/CAGRyCJGHSPO+i_xKHGbNg+Hki5tQC3_6Kc8RNcHWN6pxQdjODw@mail.gmail.com/

v4
- Solved race when accessing egress_agg_params
- Removed duplicated code for error in rmnet_map_tx_aggregate
v3
- Fixed no previous prototype for rmnet_map_flush_tx_packet_queue
  warning reported by kernel test robot
v2
- Removed icmp packets direct sending
- Changed spin_lock_irqsave to spin_lock_bh
- Increased the possible maximum size of an aggregated block
- Aligned rmnet_egress_agg_params and types to ethtool ones
- Changed bypass time from variable to define
- Fixed RCT style in rmnet_map_tx_aggregate
- Fixed order of skb freeing in rmnet_map_tx_aggregate
- rmnet_map_tx_aggregate refactoring
- Change aggregation function to use frag_list
- Removed RMNET_FLAGS_EGRESS_AGGREGATION
---
 .../ethernet/qualcomm/rmnet/rmnet_config.c    |   5 +
 .../ethernet/qualcomm/rmnet/rmnet_config.h    |  20 ++
 .../ethernet/qualcomm/rmnet/rmnet_handlers.c  |  18 +-
 .../net/ethernet/qualcomm/rmnet/rmnet_map.h   |   6 +
 .../ethernet/qualcomm/rmnet/rmnet_map_data.c  | 191 ++++++++++++++++++
 .../net/ethernet/qualcomm/rmnet/rmnet_vnd.c   |   9 +-
 .../net/ethernet/qualcomm/rmnet/rmnet_vnd.h   |   1 +
 7 files changed, 246 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
index 27b1663c476e..39d24e07f306 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
@@ -12,6 +12,7 @@
 #include "rmnet_handlers.h"
 #include "rmnet_vnd.h"
 #include "rmnet_private.h"
+#include "rmnet_map.h"
 
 /* Local Definitions and Declarations */
 
@@ -39,6 +40,8 @@ static int rmnet_unregister_real_device(struct net_device *real_dev)
 	if (port->nr_rmnet_devs)
 		return -EINVAL;
 
+	rmnet_map_tx_aggregate_exit(port);
+
 	netdev_rx_handler_unregister(real_dev);
 
 	kfree(port);
@@ -79,6 +82,8 @@ static int rmnet_register_real_device(struct net_device *real_dev,
 	for (entry = 0; entry < RMNET_MAX_LOGICAL_EP; entry++)
 		INIT_HLIST_HEAD(&port->muxed_ep[entry]);
 
+	rmnet_map_tx_aggregate_init(port);
+
 	netdev_dbg(real_dev, "registered with rmnet\n");
 	return 0;
 }
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
index 3d3cba56c516..ed112d51ac5a 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
@@ -6,6 +6,7 @@
  */
 
 #include <linux/skbuff.h>
+#include <linux/time.h>
 #include <net/gro_cells.h>
 
 #ifndef _RMNET_CONFIG_H_
@@ -19,6 +20,12 @@ struct rmnet_endpoint {
 	struct hlist_node hlnode;
 };
 
+struct rmnet_egress_agg_params {
+	u32 bytes;
+	u32 count;
+	u64 time_nsec;
+};
+
 /* One instance of this structure is instantiated for each real_dev associated
  * with rmnet.
  */
@@ -30,6 +37,19 @@ struct rmnet_port {
 	struct hlist_head muxed_ep[RMNET_MAX_LOGICAL_EP];
 	struct net_device *bridge_ep;
 	struct net_device *rmnet_dev;
+
+	/* Egress aggregation information */
+	struct rmnet_egress_agg_params egress_agg_params;
+	/* Protect aggregation related elements */
+	spinlock_t agg_lock;
+	struct sk_buff *skbagg_head;
+	struct sk_buff *skbagg_tail;
+	int agg_state;
+	u8 agg_count;
+	struct timespec64 agg_time;
+	struct timespec64 agg_last;
+	struct hrtimer hrtimer;
+	struct work_struct agg_wq;
 };
 
 extern struct rtnl_link_ops rmnet_link_ops;
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
index a313242a762e..9f3479500f85 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
@@ -164,8 +164,18 @@ static int rmnet_map_egress_handler(struct sk_buff *skb,
 
 	map_header->mux_id = mux_id;
 
-	skb->protocol = htons(ETH_P_MAP);
+	if (READ_ONCE(port->egress_agg_params.count) > 1) {
+		unsigned int len;
+
+		len = rmnet_map_tx_aggregate(skb, port, orig_dev);
+		if (likely(len)) {
+			rmnet_vnd_tx_fixup_len(len, orig_dev);
+			return -EINPROGRESS;
+		}
+		return -ENOMEM;
+	}
 
+	skb->protocol = htons(ETH_P_MAP);
 	return 0;
 }
 
@@ -235,6 +245,7 @@ void rmnet_egress_handler(struct sk_buff *skb)
 	struct rmnet_port *port;
 	struct rmnet_priv *priv;
 	u8 mux_id;
+	int err;
 
 	sk_pacing_shift_update(skb->sk, 8);
 
@@ -247,8 +258,11 @@ void rmnet_egress_handler(struct sk_buff *skb)
 	if (!port)
 		goto drop;
 
-	if (rmnet_map_egress_handler(skb, port, mux_id, orig_dev))
+	err = rmnet_map_egress_handler(skb, port, mux_id, orig_dev);
+	if (err == -ENOMEM)
 		goto drop;
+	else if (err == -EINPROGRESS)
+		return;
 
 	rmnet_vnd_tx_fixup(skb, orig_dev);
 
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
index 2b033060fc20..b70284095568 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
@@ -53,5 +53,11 @@ void rmnet_map_checksum_uplink_packet(struct sk_buff *skb,
 				      struct net_device *orig_dev,
 				      int csum_type);
 int rmnet_map_process_next_hdr_packet(struct sk_buff *skb, u16 len);
+unsigned int rmnet_map_tx_aggregate(struct sk_buff *skb, struct rmnet_port *port,
+				    struct net_device *orig_dev);
+void rmnet_map_tx_aggregate_init(struct rmnet_port *port);
+void rmnet_map_tx_aggregate_exit(struct rmnet_port *port);
+void rmnet_map_update_ul_agg_config(struct rmnet_port *port, u32 size,
+				    u32 count, u32 time);
 
 #endif /* _RMNET_MAP_H_ */
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
index ba194698cc14..a5e3d1a88305 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
@@ -12,6 +12,7 @@
 #include "rmnet_config.h"
 #include "rmnet_map.h"
 #include "rmnet_private.h"
+#include "rmnet_vnd.h"
 
 #define RMNET_MAP_DEAGGR_SPACING  64
 #define RMNET_MAP_DEAGGR_HEADROOM (RMNET_MAP_DEAGGR_SPACING / 2)
@@ -518,3 +519,193 @@ int rmnet_map_process_next_hdr_packet(struct sk_buff *skb,
 
 	return 0;
 }
+
+#define RMNET_AGG_BYPASS_TIME_NSEC 10000000L
+
+static void reset_aggr_params(struct rmnet_port *port)
+{
+	port->skbagg_head = NULL;
+	port->agg_count = 0;
+	port->agg_state = 0;
+	memset(&port->agg_time, 0, sizeof(struct timespec64));
+}
+
+static void rmnet_send_skb(struct rmnet_port *port, struct sk_buff *skb)
+{
+	if (skb_needs_linearize(skb, port->dev->features)) {
+		if (unlikely(__skb_linearize(skb))) {
+			struct rmnet_priv *priv;
+
+			priv = netdev_priv(port->rmnet_dev);
+			this_cpu_inc(priv->pcpu_stats->stats.tx_drops);
+			dev_kfree_skb_any(skb);
+			return;
+		}
+	}
+
+	dev_queue_xmit(skb);
+}
+
+static void rmnet_map_flush_tx_packet_work(struct work_struct *work)
+{
+	struct sk_buff *skb = NULL;
+	struct rmnet_port *port;
+
+	port = container_of(work, struct rmnet_port, agg_wq);
+
+	spin_lock_bh(&port->agg_lock);
+	if (likely(port->agg_state == -EINPROGRESS)) {
+		/* Buffer may have already been shipped out */
+		if (likely(port->skbagg_head)) {
+			skb = port->skbagg_head;
+			reset_aggr_params(port);
+		}
+		port->agg_state = 0;
+	}
+
+	spin_unlock_bh(&port->agg_lock);
+	if (skb)
+		rmnet_send_skb(port, skb);
+}
+
+static enum hrtimer_restart rmnet_map_flush_tx_packet_queue(struct hrtimer *t)
+{
+	struct rmnet_port *port;
+
+	port = container_of(t, struct rmnet_port, hrtimer);
+
+	schedule_work(&port->agg_wq);
+
+	return HRTIMER_NORESTART;
+}
+
+unsigned int rmnet_map_tx_aggregate(struct sk_buff *skb, struct rmnet_port *port,
+				    struct net_device *orig_dev)
+{
+	struct timespec64 diff, last;
+	unsigned int len = skb->len;
+	struct sk_buff *agg_skb;
+	int size;
+
+	spin_lock_bh(&port->agg_lock);
+	memcpy(&last, &port->agg_last, sizeof(struct timespec64));
+	ktime_get_real_ts64(&port->agg_last);
+
+	if (!port->skbagg_head) {
+		/* Check to see if we should agg first. If the traffic is very
+		 * sparse, don't aggregate.
+		 */
+new_packet:
+		diff = timespec64_sub(port->agg_last, last);
+		size = port->egress_agg_params.bytes - skb->len;
+
+		if (size < 0) {
+			/* dropped */
+			spin_unlock_bh(&port->agg_lock);
+			return 0;
+		}
+
+		if (diff.tv_sec > 0 || diff.tv_nsec > RMNET_AGG_BYPASS_TIME_NSEC ||
+		    size == 0)
+			goto no_aggr;
+
+		port->skbagg_head = skb_copy_expand(skb, 0, size, GFP_ATOMIC);
+		if (!port->skbagg_head)
+			goto no_aggr;
+
+		dev_kfree_skb_any(skb);
+		port->skbagg_head->protocol = htons(ETH_P_MAP);
+		port->agg_count = 1;
+		ktime_get_real_ts64(&port->agg_time);
+		skb_frag_list_init(port->skbagg_head);
+		goto schedule;
+	}
+	diff = timespec64_sub(port->agg_last, port->agg_time);
+	size = port->egress_agg_params.bytes - port->skbagg_head->len;
+
+	if (skb->len > size) {
+		agg_skb = port->skbagg_head;
+		reset_aggr_params(port);
+		spin_unlock_bh(&port->agg_lock);
+		hrtimer_cancel(&port->hrtimer);
+		rmnet_send_skb(port, agg_skb);
+		spin_lock_bh(&port->agg_lock);
+		goto new_packet;
+	}
+
+	if (skb_has_frag_list(port->skbagg_head))
+		port->skbagg_tail->next = skb;
+	else
+		skb_shinfo(port->skbagg_head)->frag_list = skb;
+
+	port->skbagg_head->len += skb->len;
+	port->skbagg_head->data_len += skb->len;
+	port->skbagg_head->truesize += skb->truesize;
+	port->skbagg_tail = skb;
+	port->agg_count++;
+
+	if (diff.tv_sec > 0 || diff.tv_nsec > port->egress_agg_params.time_nsec ||
+	    port->agg_count >= port->egress_agg_params.count ||
+	    port->skbagg_head->len == port->egress_agg_params.bytes) {
+		agg_skb = port->skbagg_head;
+		reset_aggr_params(port);
+		spin_unlock_bh(&port->agg_lock);
+		hrtimer_cancel(&port->hrtimer);
+		rmnet_send_skb(port, agg_skb);
+		return len;
+	}
+
+schedule:
+	if (!hrtimer_active(&port->hrtimer) && port->agg_state != -EINPROGRESS) {
+		port->agg_state = -EINPROGRESS;
+		hrtimer_start(&port->hrtimer,
+			      ns_to_ktime(port->egress_agg_params.time_nsec),
+			      HRTIMER_MODE_REL);
+	}
+	spin_unlock_bh(&port->agg_lock);
+
+	return len;
+
+no_aggr:
+	spin_unlock_bh(&port->agg_lock);
+	skb->protocol = htons(ETH_P_MAP);
+	dev_queue_xmit(skb);
+
+	return len;
+}
+
+void rmnet_map_update_ul_agg_config(struct rmnet_port *port, u32 size,
+				    u32 count, u32 time)
+{
+	spin_lock_bh(&port->agg_lock);
+	port->egress_agg_params.bytes = size;
+	WRITE_ONCE(port->egress_agg_params.count, count);
+	port->egress_agg_params.time_nsec = time * NSEC_PER_USEC;
+	spin_unlock_bh(&port->agg_lock);
+}
+
+void rmnet_map_tx_aggregate_init(struct rmnet_port *port)
+{
+	hrtimer_init(&port->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	port->hrtimer.function = rmnet_map_flush_tx_packet_queue;
+	spin_lock_init(&port->agg_lock);
+	rmnet_map_update_ul_agg_config(port, 4096, 1, 800);
+	INIT_WORK(&port->agg_wq, rmnet_map_flush_tx_packet_work);
+}
+
+void rmnet_map_tx_aggregate_exit(struct rmnet_port *port)
+{
+	hrtimer_cancel(&port->hrtimer);
+	cancel_work_sync(&port->agg_wq);
+
+	spin_lock_bh(&port->agg_lock);
+	if (port->agg_state == -EINPROGRESS) {
+		if (port->skbagg_head) {
+			dev_kfree_skb_any(port->skbagg_head);
+			reset_aggr_params(port);
+		}
+
+		port->agg_state = 0;
+	}
+	spin_unlock_bh(&port->agg_lock);
+}
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
index 3f5e6572d20e..6d8b8fdb9d03 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
@@ -29,7 +29,7 @@ void rmnet_vnd_rx_fixup(struct sk_buff *skb, struct net_device *dev)
 	u64_stats_update_end(&pcpu_ptr->syncp);
 }
 
-void rmnet_vnd_tx_fixup(struct sk_buff *skb, struct net_device *dev)
+void rmnet_vnd_tx_fixup_len(unsigned int len, struct net_device *dev)
 {
 	struct rmnet_priv *priv = netdev_priv(dev);
 	struct rmnet_pcpu_stats *pcpu_ptr;
@@ -38,10 +38,15 @@ void rmnet_vnd_tx_fixup(struct sk_buff *skb, struct net_device *dev)
 
 	u64_stats_update_begin(&pcpu_ptr->syncp);
 	pcpu_ptr->stats.tx_pkts++;
-	pcpu_ptr->stats.tx_bytes += skb->len;
+	pcpu_ptr->stats.tx_bytes += len;
 	u64_stats_update_end(&pcpu_ptr->syncp);
 }
 
+void rmnet_vnd_tx_fixup(struct sk_buff *skb, struct net_device *dev)
+{
+	rmnet_vnd_tx_fixup_len(skb->len, dev);
+}
+
 /* Network Device Operations */
 
 static netdev_tx_t rmnet_vnd_start_xmit(struct sk_buff *skb,
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.h
index dc3a4443ef0a..c2b2baf86894 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.h
@@ -16,6 +16,7 @@ int rmnet_vnd_newlink(u8 id, struct net_device *rmnet_dev,
 int rmnet_vnd_dellink(u8 id, struct rmnet_port *port,
 		      struct rmnet_endpoint *ep);
 void rmnet_vnd_rx_fixup(struct sk_buff *skb, struct net_device *dev);
+void rmnet_vnd_tx_fixup_len(unsigned int len, struct net_device *dev);
 void rmnet_vnd_tx_fixup(struct sk_buff *skb, struct net_device *dev);
 void rmnet_vnd_setup(struct net_device *dev);
 int rmnet_vnd_validate_real_dev_mtu(struct net_device *real_dev);
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next v4 3/3] net: qualcomm: rmnet: add ethtool support for configuring tx aggregation
  2023-01-11 13:05 [PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet Daniele Palmas
  2023-01-11 13:05 ` [PATCH net-next v4 1/3] ethtool: add tx aggregation parameters Daniele Palmas
  2023-01-11 13:05 ` [PATCH net-next v4 2/3] net: qualcomm: rmnet: add tx packets aggregation Daniele Palmas
@ 2023-01-11 13:05 ` Daniele Palmas
  2023-01-12 22:59 ` [PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet Alexander H Duyck
  2023-01-13 21:20 ` patchwork-bot+netdevbpf
  4 siblings, 0 replies; 12+ messages in thread
From: Daniele Palmas @ 2023-01-11 13:05 UTC (permalink / raw)
  To: David Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Subash Abhinov Kasiviswanathan, Sean Tranchetti, Jonathan Corbet,
	Alexander Lobakin, Gal Pressman, Dave Taht
  Cc: Bjørn Mork, Greg Kroah-Hartman, netdev, Daniele Palmas

Add support for ETHTOOL_COALESCE_TX_AGGR for configuring the tx
aggregation settings.

Signed-off-by: Daniele Palmas <dnlplm@gmail.com>
---
v4
- No change
v3
- No change
v2
- Fixed undefined reference to `__aeabi_uldivmod' issue with arm, reported-by: kernel test robot
---
 .../net/ethernet/qualcomm/rmnet/rmnet_vnd.c   | 45 +++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
index 6d8b8fdb9d03..046b5f7d8e7c 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
@@ -215,7 +215,52 @@ static void rmnet_get_ethtool_stats(struct net_device *dev,
 	memcpy(data, st, ARRAY_SIZE(rmnet_gstrings_stats) * sizeof(u64));
 }
 
+static int rmnet_get_coalesce(struct net_device *dev,
+			      struct ethtool_coalesce *coal,
+			      struct kernel_ethtool_coalesce *kernel_coal,
+			      struct netlink_ext_ack *extack)
+{
+	struct rmnet_priv *priv = netdev_priv(dev);
+	struct rmnet_port *port;
+
+	port = rmnet_get_port_rtnl(priv->real_dev);
+
+	memset(kernel_coal, 0, sizeof(*kernel_coal));
+	kernel_coal->tx_aggr_max_bytes = port->egress_agg_params.bytes;
+	kernel_coal->tx_aggr_max_frames = port->egress_agg_params.count;
+	kernel_coal->tx_aggr_time_usecs = div_u64(port->egress_agg_params.time_nsec,
+						  NSEC_PER_USEC);
+
+	return 0;
+}
+
+static int rmnet_set_coalesce(struct net_device *dev,
+			      struct ethtool_coalesce *coal,
+			      struct kernel_ethtool_coalesce *kernel_coal,
+			      struct netlink_ext_ack *extack)
+{
+	struct rmnet_priv *priv = netdev_priv(dev);
+	struct rmnet_port *port;
+
+	port = rmnet_get_port_rtnl(priv->real_dev);
+
+	if (kernel_coal->tx_aggr_max_frames < 1 || kernel_coal->tx_aggr_max_frames > 64)
+		return -EINVAL;
+
+	if (kernel_coal->tx_aggr_max_bytes > 32768)
+		return -EINVAL;
+
+	rmnet_map_update_ul_agg_config(port, kernel_coal->tx_aggr_max_bytes,
+				       kernel_coal->tx_aggr_max_frames,
+				       kernel_coal->tx_aggr_time_usecs);
+
+	return 0;
+}
+
 static const struct ethtool_ops rmnet_ethtool_ops = {
+	.supported_coalesce_params = ETHTOOL_COALESCE_TX_AGGR,
+	.get_coalesce = rmnet_get_coalesce,
+	.set_coalesce = rmnet_set_coalesce,
 	.get_ethtool_stats = rmnet_get_ethtool_stats,
 	.get_strings = rmnet_get_strings,
 	.get_sset_count = rmnet_get_sset_count,
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v4 2/3] net: qualcomm: rmnet: add tx packets aggregation
  2023-01-11 13:05 ` [PATCH net-next v4 2/3] net: qualcomm: rmnet: add tx packets aggregation Daniele Palmas
@ 2023-01-12 19:41   ` Subash Abhinov Kasiviswanathan (KS)
  0 siblings, 0 replies; 12+ messages in thread
From: Subash Abhinov Kasiviswanathan (KS) @ 2023-01-12 19:41 UTC (permalink / raw)
  To: Daniele Palmas, David Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet, Sean Tranchetti, Jonathan Corbet, Alexander Lobakin,
	Gal Pressman, Dave Taht
  Cc: Bjørn Mork, Greg Kroah-Hartman, netdev



On 1/11/2023 6:05 AM, Daniele Palmas wrote:
> Add tx packets aggregation.
> 
> Bidirectional TCP throughput tests through iperf with low-cat
> Thread-x based modems revelead performance issues both in tx
> and rx.
> 
> The Windows driver does not show this issue: inspecting USB
> packets revealed that the only notable change is the driver
> enabling tx packets aggregation.
> 
> Tx packets aggregation is by default disabled and can be enabled
> by increasing the value of ETHTOOL_A_COALESCE_TX_MAX_AGGR_FRAMES.
> 
> The maximum aggregated size is by default set to a reasonably low
> value in order to support the majority of modems.
> 
> This implementation is based on patches available in Code Aurora
> repositories (msm kernel) whose main authors are
> 
> Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
> Sean Tranchetti <stranche@codeaurora.org>
> 
> Signed-off-by: Daniele Palmas <dnlplm@gmail.com>
> ---
> Hi all,
> 
> in v4 I should have met all Paolo's remarks, besides the one related
> to removing the scheduled work, since according to my understanding
> of the code in dev_queue_xmit it's not possible to avoid the
> WARN_ONCE when called directly from the timer function.
> 
> Context at https://lore.kernel.org/netdev/CAGRyCJGHSPO+i_xKHGbNg+Hki5tQC3_6Kc8RNcHWN6pxQdjODw@mail.gmail.com/
> 
> v4
> - Solved race when accessing egress_agg_params
> - Removed duplicated code for error in rmnet_map_tx_aggregate

Hi Daniele

I see that you are calling dev_queue_xmit from the worker thread context 
only so I assume you aren't running into any warnings anymore.


Reviewed-by: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet
  2023-01-11 13:05 [PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet Daniele Palmas
                   ` (2 preceding siblings ...)
  2023-01-11 13:05 ` [PATCH net-next v4 3/3] net: qualcomm: rmnet: add ethtool support for configuring tx aggregation Daniele Palmas
@ 2023-01-12 22:59 ` Alexander H Duyck
  2023-01-13 15:43   ` Daniele Palmas
  2023-01-13 21:20 ` patchwork-bot+netdevbpf
  4 siblings, 1 reply; 12+ messages in thread
From: Alexander H Duyck @ 2023-01-12 22:59 UTC (permalink / raw)
  To: Daniele Palmas, David Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet, Subash Abhinov Kasiviswanathan, Sean Tranchetti,
	Jonathan Corbet, Alexander Lobakin, Gal Pressman, Dave Taht
  Cc: Bjørn Mork, Greg Kroah-Hartman, netdev

On Wed, 2023-01-11 at 14:05 +0100, Daniele Palmas wrote:
> Hello maintainers and all,
> 
> this patchset implements tx qmap packets aggregation in rmnet and generic
> ethtool support for that.
> 
> Some low-cat Thread-x based modems are not capable of properly reaching the maximum
> allowed throughput both in tx and rx during a bidirectional test if tx packets
> aggregation is not enabled.

One question I would have about this is if you are making use of Byte
Queue Limits and netdev_xmit_more at all? I know for high speed devices
most of us added support for xmit_more because PCIe bandwidth was
limiting when we were writing the Tx ring indexes/doorbells with every
packet. To overcome that we added netdev_xmit_more which told us when
the Qdisc had more packets to give us. This allowed us to better
utilize the PCIe bus by bursting packets through without adding
additional latency.

> I verified this problem with rmnet + qmi_wwan by using a MDM9207 Cat. 4 based modem
> (50Mbps/150Mbps max throughput). What is actually happening is pictured at
> https://drive.google.com/file/d/1gSbozrtd9h0X63i6vdkNpN68d-9sg8f9/view
> 
> Testing with iperf TCP, when rx and tx flows are tested singularly there's no issue
> in tx and minor issues in rx (not able to reach max throughput). When there are concurrent
> tx and rx flows, tx throughput has an huge drop. rx a minor one, but still present.
> 
> The same scenario with tx aggregation enabled is pictured at
> https://drive.google.com/file/d/1jcVIKNZD7K3lHtwKE5W02mpaloudYYih/view
> showing a regular graph.
> 
> This issue does not happen with high-cat modems (e.g. SDX20), or at least it
> does not happen at the throughputs I'm able to test currently: maybe the same
> could happen when moving close to the maximum rates supported by those modems.
> Anyway, having the tx aggregation enabled should not hurt.
> 
> The first attempt to solve this issue was in qmi_wwan qmap implementation,
> see the discussion at https://lore.kernel.org/netdev/20221019132503.6783-1-dnlplm@gmail.com/
> 
> However, it turned out that rmnet was a better candidate for the implementation.
> 
> Moreover, Greg and Jakub suggested also to use ethtool for the configuration:
> not sure if I got their advice right, but this patchset add also generic ethtool
> support for tx aggregation.

I have concerns about this essentially moving queueing disciplines down
into the device. The idea of doing Tx aggregation seems like something
that should be done with the qdisc rather than the device driver.
Otherwise we are looking at having multiple implementations of this
aggregation floating around. It seems like it would make more sense to
have this batching happen at the qdisc layer, and then the qdisc layer
would pass down a batch of frames w/ xmit_more set to indicate it is
flushing the batch.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet
  2023-01-12 22:59 ` [PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet Alexander H Duyck
@ 2023-01-13 15:43   ` Daniele Palmas
  2023-01-13 16:16     ` Alexander Duyck
  0 siblings, 1 reply; 12+ messages in thread
From: Daniele Palmas @ 2023-01-13 15:43 UTC (permalink / raw)
  To: Alexander H Duyck
  Cc: David Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Subash Abhinov Kasiviswanathan, Sean Tranchetti, Jonathan Corbet,
	Alexander Lobakin, Gal Pressman, Dave Taht, Bjørn Mork,
	Greg Kroah-Hartman, netdev

Hello Alexander,

Il giorno ven 13 gen 2023 alle ore 00:00 Alexander H Duyck
<alexander.duyck@gmail.com> ha scritto:
>
> On Wed, 2023-01-11 at 14:05 +0100, Daniele Palmas wrote:
> > Hello maintainers and all,
> >
> > this patchset implements tx qmap packets aggregation in rmnet and generic
> > ethtool support for that.
> >
> > Some low-cat Thread-x based modems are not capable of properly reaching the maximum
> > allowed throughput both in tx and rx during a bidirectional test if tx packets
> > aggregation is not enabled.
>
> One question I would have about this is if you are making use of Byte
> Queue Limits and netdev_xmit_more at all? I know for high speed devices
> most of us added support for xmit_more because PCIe bandwidth was
> limiting when we were writing the Tx ring indexes/doorbells with every
> packet. To overcome that we added netdev_xmit_more which told us when
> the Qdisc had more packets to give us. This allowed us to better
> utilize the PCIe bus by bursting packets through without adding
> additional latency.
>

no, I was not aware of BQL: this development has been basically
modelled on what other mobile broadband drivers do (e.g.
cdc_mbim/cdc_ncm, Qualcomm downstream rmnet implementation), that are
not using BQL.

If I understand properly documentation

rmnet0/queues/tx-0/byte_queue_limits/limit_max

would be the candidate for replacing ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES.

But I can't find anything for ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES
and ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS, something that should work
in combination with the bytes limit: at least the first one is
mandatory, since the modem can't receive more than a certain number
(this is a variable value depending on the modem model and is
collected through userspace tools).

ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES works also as a way to determine
that tx aggregation has been enabled by the userspace tool managing
the qmi requests, otherwise no aggregation should be performed.

> > I verified this problem with rmnet + qmi_wwan by using a MDM9207 Cat. 4 based modem
> > (50Mbps/150Mbps max throughput). What is actually happening is pictured at
> > https://drive.google.com/file/d/1gSbozrtd9h0X63i6vdkNpN68d-9sg8f9/view
> >
> > Testing with iperf TCP, when rx and tx flows are tested singularly there's no issue
> > in tx and minor issues in rx (not able to reach max throughput). When there are concurrent
> > tx and rx flows, tx throughput has an huge drop. rx a minor one, but still present.
> >
> > The same scenario with tx aggregation enabled is pictured at
> > https://drive.google.com/file/d/1jcVIKNZD7K3lHtwKE5W02mpaloudYYih/view
> > showing a regular graph.
> >
> > This issue does not happen with high-cat modems (e.g. SDX20), or at least it
> > does not happen at the throughputs I'm able to test currently: maybe the same
> > could happen when moving close to the maximum rates supported by those modems.
> > Anyway, having the tx aggregation enabled should not hurt.
> >
> > The first attempt to solve this issue was in qmi_wwan qmap implementation,
> > see the discussion at https://lore.kernel.org/netdev/20221019132503.6783-1-dnlplm@gmail.com/
> >
> > However, it turned out that rmnet was a better candidate for the implementation.
> >
> > Moreover, Greg and Jakub suggested also to use ethtool for the configuration:
> > not sure if I got their advice right, but this patchset add also generic ethtool
> > support for tx aggregation.
>
> I have concerns about this essentially moving queueing disciplines down
> into the device. The idea of doing Tx aggregation seems like something
> that should be done with the qdisc rather than the device driver.
> Otherwise we are looking at having multiple implementations of this
> aggregation floating around. It seems like it would make more sense to
> have this batching happen at the qdisc layer, and then the qdisc layer
> would pass down a batch of frames w/ xmit_more set to indicate it is
> flushing the batch.

Honestly, I'm not expert enough to give a reliable opinion about this.

I feel like these settings are more related to the hardware, requiring
also a configuration on the hardware itself done by the user, so
ethtool would seem to me a good choice, but I may be biased since I
did this development :-)

Thanks,
Daniele

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet
  2023-01-13 15:43   ` Daniele Palmas
@ 2023-01-13 16:16     ` Alexander Duyck
  2023-01-13 16:57       ` Dave Taht
  2023-01-13 19:48       ` Jakub Kicinski
  0 siblings, 2 replies; 12+ messages in thread
From: Alexander Duyck @ 2023-01-13 16:16 UTC (permalink / raw)
  To: Daniele Palmas
  Cc: David Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Subash Abhinov Kasiviswanathan, Sean Tranchetti, Jonathan Corbet,
	Alexander Lobakin, Gal Pressman, Dave Taht, Bjørn Mork,
	Greg Kroah-Hartman, netdev

On Fri, Jan 13, 2023 at 7:50 AM Daniele Palmas <dnlplm@gmail.com> wrote:
>
> Hello Alexander,
>
> Il giorno ven 13 gen 2023 alle ore 00:00 Alexander H Duyck
> <alexander.duyck@gmail.com> ha scritto:
> >
> > On Wed, 2023-01-11 at 14:05 +0100, Daniele Palmas wrote:
> > > Hello maintainers and all,
> > >
> > > this patchset implements tx qmap packets aggregation in rmnet and generic
> > > ethtool support for that.
> > >
> > > Some low-cat Thread-x based modems are not capable of properly reaching the maximum
> > > allowed throughput both in tx and rx during a bidirectional test if tx packets
> > > aggregation is not enabled.
> >
> > One question I would have about this is if you are making use of Byte
> > Queue Limits and netdev_xmit_more at all? I know for high speed devices
> > most of us added support for xmit_more because PCIe bandwidth was
> > limiting when we were writing the Tx ring indexes/doorbells with every
> > packet. To overcome that we added netdev_xmit_more which told us when
> > the Qdisc had more packets to give us. This allowed us to better
> > utilize the PCIe bus by bursting packets through without adding
> > additional latency.
> >
>
> no, I was not aware of BQL: this development has been basically
> modelled on what other mobile broadband drivers do (e.g.
> cdc_mbim/cdc_ncm, Qualcomm downstream rmnet implementation), that are
> not using BQL.
>
> If I understand properly documentation
>
> rmnet0/queues/tx-0/byte_queue_limits/limit_max
>
> would be the candidate for replacing ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES.

Yes the general idea is that you end up targeting the upper limit for
how many frames can be sent in a single burst.

> But I can't find anything for ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES
> and ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS, something that should work
> in combination with the bytes limit: at least the first one is
> mandatory, since the modem can't receive more than a certain number
> (this is a variable value depending on the modem model and is
> collected through userspace tools).

In terms of MAX_FRAMES there isn't necessarily anything like that, but
at the same time it isn't something that is already controlled by the
netdev itself by using the netif_stop_queue or netif_stop_subqueue
when there isn't space to store another frame. As such most devices
control this by just manipulating their descriptor ring size via
"ethtool -G <dev> tx <N>"

As far as the TIME_USECS that is something I tried to propose a decade
ago and was essentially given a hard "NAK" before xmit_more was
introduced. We shouldn't be adding latency when we don't need to.
Between GSO and xmit_more you should find that the network stack
itself will naturally want to give you larger bursts of frames with
xmit_more set. In addition, adding latency can mess with certain TCP
algorithms and cause problematic behaviors.

> ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES works also as a way to determine
> that tx aggregation has been enabled by the userspace tool managing
> the qmi requests, otherwise no aggregation should be performed.

Is there a specific reason why you wouldn't want to take advantage of
aggregation that is already provided by the stack in the form of
things such as GSO and the qdisc layer? I know most of the high speed
NICs are always making use of xmit_more since things like GSO can take
advantage of it to increase the throughput. Enabling BQL is a way of
taking that one step further and enabling the non-GSO cases.

> > > I verified this problem with rmnet + qmi_wwan by using a MDM9207 Cat. 4 based modem
> > > (50Mbps/150Mbps max throughput). What is actually happening is pictured at
> > > https://drive.google.com/file/d/1gSbozrtd9h0X63i6vdkNpN68d-9sg8f9/view
> > >
> > > Testing with iperf TCP, when rx and tx flows are tested singularly there's no issue
> > > in tx and minor issues in rx (not able to reach max throughput). When there are concurrent
> > > tx and rx flows, tx throughput has an huge drop. rx a minor one, but still present.
> > >
> > > The same scenario with tx aggregation enabled is pictured at
> > > https://drive.google.com/file/d/1jcVIKNZD7K3lHtwKE5W02mpaloudYYih/view
> > > showing a regular graph.
> > >
> > > This issue does not happen with high-cat modems (e.g. SDX20), or at least it
> > > does not happen at the throughputs I'm able to test currently: maybe the same
> > > could happen when moving close to the maximum rates supported by those modems.
> > > Anyway, having the tx aggregation enabled should not hurt.
> > >
> > > The first attempt to solve this issue was in qmi_wwan qmap implementation,
> > > see the discussion at https://lore.kernel.org/netdev/20221019132503.6783-1-dnlplm@gmail.com/
> > >
> > > However, it turned out that rmnet was a better candidate for the implementation.
> > >
> > > Moreover, Greg and Jakub suggested also to use ethtool for the configuration:
> > > not sure if I got their advice right, but this patchset add also generic ethtool
> > > support for tx aggregation.
> >
> > I have concerns about this essentially moving queueing disciplines down
> > into the device. The idea of doing Tx aggregation seems like something
> > that should be done with the qdisc rather than the device driver.
> > Otherwise we are looking at having multiple implementations of this
> > aggregation floating around. It seems like it would make more sense to
> > have this batching happen at the qdisc layer, and then the qdisc layer
> > would pass down a batch of frames w/ xmit_more set to indicate it is
> > flushing the batch.
>
> Honestly, I'm not expert enough to give a reliable opinion about this.
>
> I feel like these settings are more related to the hardware, requiring
> also a configuration on the hardware itself done by the user, so
> ethtool would seem to me a good choice, but I may be biased since I
> did this development :-)

Yeah, I get that. I went through something similar when I had
submitted a patch to defer Tx tail writes in order to try and improve
the PCIe throughput. I would be open to revisiting this if we gave
xmit_more and BQL a try and it doesn't take care of this, but from my
past experience odds are the combination will likely resolve most of
what you are seeing without adding additional latency. At a minimum
the xmit_more should show a significant improvement in Tx throughput
just from the benefits of GSO sending bursts of frames through that
can easily be batched.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet
  2023-01-13 16:16     ` Alexander Duyck
@ 2023-01-13 16:57       ` Dave Taht
  2023-01-13 19:48       ` Jakub Kicinski
  1 sibling, 0 replies; 12+ messages in thread
From: Dave Taht @ 2023-01-13 16:57 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Daniele Palmas, David Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet, Subash Abhinov Kasiviswanathan, Sean Tranchetti,
	Jonathan Corbet, Alexander Lobakin, Gal Pressman, Bjørn Mork,
	Greg Kroah-Hartman, netdev

Dear Alexander:

Thank you for taking a look at this thread. In earlier tests, the
aggregation was indeed an improvement over the default behavior,
but the amount of data that ended up living in the device was best
case, 150ms, and worst case, 25 seconds, in the flent tcp_nup case. A
better test would be the staggered start tcp_4up_squarewave test which
would show only one tcp out of the 4, able to start with the current
overall structure of this portion of the stack, once that buffer gets
filled.

The shiny new rust based Crusader test (
https://github.com/Zoxc/crusader ) has also now got a nice staggered
start test that shows this problem in more detail.

I was so peeved about 4g/5g behaviors like this in general that I was
going to dedicate a blog entry to it; this recent rant only touches
upon the real problem with engineering to the test, here:

https://blog.cerowrt.org/post/speedtests/

Unfortunately your post is more about leveraging xmit_more - which is
good, but the BQL structure for ethernet, itself leverages the concept
of a completion interrupt (when the packets actually hit the air or
wire), which I dearly wish was some previously unknown, single line
hack to the driver or message in the usb API for 4g/5g etc.

On Fri, Jan 13, 2023 at 8:17 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Fri, Jan 13, 2023 at 7:50 AM Daniele Palmas <dnlplm@gmail.com> wrote:
> >
> > Hello Alexander,
> >
> > Il giorno ven 13 gen 2023 alle ore 00:00 Alexander H Duyck
> > <alexander.duyck@gmail.com> ha scritto:
> > >
> > > On Wed, 2023-01-11 at 14:05 +0100, Daniele Palmas wrote:
> > > > Hello maintainers and all,
> > > >
> > > > this patchset implements tx qmap packets aggregation in rmnet and generic
> > > > ethtool support for that.
> > > >
> > > > Some low-cat Thread-x based modems are not capable of properly reaching the maximum
> > > > allowed throughput both in tx and rx during a bidirectional test if tx packets
> > > > aggregation is not enabled.
> > >
> > > One question I would have about this is if you are making use of Byte
> > > Queue Limits and netdev_xmit_more at all? I know for high speed devices
> > > most of us added support for xmit_more because PCIe bandwidth was
> > > limiting when we were writing the Tx ring indexes/doorbells with every
> > > packet. To overcome that we added netdev_xmit_more which told us when
> > > the Qdisc had more packets to give us. This allowed us to better
> > > utilize the PCIe bus by bursting packets through without adding
> > > additional latency.
> > >
> >
> > no, I was not aware of BQL: this development has been basically
> > modelled on what other mobile broadband drivers do (e.g.
> > cdc_mbim/cdc_ncm, Qualcomm downstream rmnet implementation), that are
> > not using BQL.
> >
> > If I understand properly documentation
> >
> > rmnet0/queues/tx-0/byte_queue_limits/limit_max
> >
> > would be the candidate for replacing ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES.
>
> Yes the general idea is that you end up targeting the upper limit for
> how many frames can be sent in a single burst.
>
> > But I can't find anything for ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES
> > and ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS, something that should work
> > in combination with the bytes limit: at least the first one is
> > mandatory, since the modem can't receive more than a certain number
> > (this is a variable value depending on the modem model and is
> > collected through userspace tools).
>
> In terms of MAX_FRAMES there isn't necessarily anything like that, but
> at the same time it isn't something that is already controlled by the
> netdev itself by using the netif_stop_queue or netif_stop_subqueue
> when there isn't space to store another frame. As such most devices
> control this by just manipulating their descriptor ring size via
> "ethtool -G <dev> tx <N>"
>
> As far as the TIME_USECS that is something I tried to propose a decade
> ago and was essentially given a hard "NAK" before xmit_more was
> introduced. We shouldn't be adding latency when we don't need to.
> Between GSO and xmit_more you should find that the network stack
> itself will naturally want to give you larger bursts of frames with
> xmit_more set. In addition, adding latency can mess with certain TCP
> algorithms and cause problematic behaviors.
>
> > ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES works also as a way to determine
> > that tx aggregation has been enabled by the userspace tool managing
> > the qmi requests, otherwise no aggregation should be performed.
>
> Is there a specific reason why you wouldn't want to take advantage of
> aggregation that is already provided by the stack in the form of
> things such as GSO and the qdisc layer? I know most of the high speed
> NICs are always making use of xmit_more since things like GSO can take
> advantage of it to increase the throughput. Enabling BQL is a way of
> taking that one step further and enabling the non-GSO cases.
>
> > > > I verified this problem with rmnet + qmi_wwan by using a MDM9207 Cat. 4 based modem
> > > > (50Mbps/150Mbps max throughput). What is actually happening is pictured at
> > > > https://drive.google.com/file/d/1gSbozrtd9h0X63i6vdkNpN68d-9sg8f9/view
> > > >
> > > > Testing with iperf TCP, when rx and tx flows are tested singularly there's no issue
> > > > in tx and minor issues in rx (not able to reach max throughput). When there are concurrent
> > > > tx and rx flows, tx throughput has an huge drop. rx a minor one, but still present.
> > > >
> > > > The same scenario with tx aggregation enabled is pictured at
> > > > https://drive.google.com/file/d/1jcVIKNZD7K3lHtwKE5W02mpaloudYYih/view
> > > > showing a regular graph.
> > > >
> > > > This issue does not happen with high-cat modems (e.g. SDX20), or at least it
> > > > does not happen at the throughputs I'm able to test currently: maybe the same
> > > > could happen when moving close to the maximum rates supported by those modems.
> > > > Anyway, having the tx aggregation enabled should not hurt.
> > > >
> > > > The first attempt to solve this issue was in qmi_wwan qmap implementation,
> > > > see the discussion at https://lore.kernel.org/netdev/20221019132503.6783-1-dnlplm@gmail.com/
> > > >
> > > > However, it turned out that rmnet was a better candidate for the implementation.
> > > >
> > > > Moreover, Greg and Jakub suggested also to use ethtool for the configuration:
> > > > not sure if I got their advice right, but this patchset add also generic ethtool
> > > > support for tx aggregation.
> > >
> > > I have concerns about this essentially moving queueing disciplines down
> > > into the device. The idea of doing Tx aggregation seems like something
> > > that should be done with the qdisc rather than the device driver.
> > > Otherwise we are looking at having multiple implementations of this
> > > aggregation floating around. It seems like it would make more sense to
> > > have this batching happen at the qdisc layer, and then the qdisc layer
> > > would pass down a batch of frames w/ xmit_more set to indicate it is
> > > flushing the batch.
> >
> > Honestly, I'm not expert enough to give a reliable opinion about this.
> >
> > I feel like these settings are more related to the hardware, requiring
> > also a configuration on the hardware itself done by the user, so
> > ethtool would seem to me a good choice, but I may be biased since I
> > did this development :-)
>
> Yeah, I get that. I went through something similar when I had
> submitted a patch to defer Tx tail writes in order to try and improve
> the PCIe throughput. I would be open to revisiting this if we gave
> xmit_more and BQL a try and it doesn't take care of this, but from my
> past experience odds are the combination will likely resolve most of
> what you are seeing without adding additional latency. At a minimum
> the xmit_more should show a significant improvement in Tx throughput
> just from the benefits of GSO sending bursts of frames through that
> can easily be batched.



-- 
This song goes out to all the folk that thought Stadia would work:
https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
Dave Täht CEO, TekLibre, LLC

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet
  2023-01-13 16:16     ` Alexander Duyck
  2023-01-13 16:57       ` Dave Taht
@ 2023-01-13 19:48       ` Jakub Kicinski
  2023-01-15 12:15         ` Daniele Palmas
  1 sibling, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2023-01-13 19:48 UTC (permalink / raw)
  To: Daniele Palmas
  Cc: Alexander Duyck, David Miller, Paolo Abeni, Eric Dumazet,
	Subash Abhinov Kasiviswanathan, Sean Tranchetti, Jonathan Corbet,
	Alexander Lobakin, Gal Pressman, Dave Taht, Bjørn Mork,
	Greg Kroah-Hartman, netdev

On Fri, 13 Jan 2023 08:16:48 -0800 Alexander Duyck wrote:
> > ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES works also as a way to determine
> > that tx aggregation has been enabled by the userspace tool managing
> > the qmi requests, otherwise no aggregation should be performed.  
> 
> Is there a specific reason why you wouldn't want to take advantage of
> aggregation that is already provided by the stack in the form of
> things such as GSO and the qdisc layer? I know most of the high speed
> NICs are always making use of xmit_more since things like GSO can take
> advantage of it to increase the throughput. Enabling BQL is a way of
> taking that one step further and enabling the non-GSO cases.

The patches had been applied last night by DaveM but FWIW I think
Alex's idea is quite interesting. Even without BQL I believe you'd 
get xmit_more set within a single GSO super-frame. TCP sends data
in chunks of 64kB, and you're only aggregating 32kB. If we could
get the same effect without any added latency and user configuration
that'd be great.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet
  2023-01-11 13:05 [PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet Daniele Palmas
                   ` (3 preceding siblings ...)
  2023-01-12 22:59 ` [PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet Alexander H Duyck
@ 2023-01-13 21:20 ` patchwork-bot+netdevbpf
  4 siblings, 0 replies; 12+ messages in thread
From: patchwork-bot+netdevbpf @ 2023-01-13 21:20 UTC (permalink / raw)
  To: Daniele Palmas
  Cc: davem, kuba, pabeni, edumazet, quic_subashab, quic_stranche,
	corbet, alexandr.lobakin, gal, dave.taht, bjorn, gregkh, netdev

Hello:

This series was applied to netdev/net-next.git (master)
by David S. Miller <davem@davemloft.net>:

On Wed, 11 Jan 2023 14:05:17 +0100 you wrote:
> Hello maintainers and all,
> 
> this patchset implements tx qmap packets aggregation in rmnet and generic
> ethtool support for that.
> 
> Some low-cat Thread-x based modems are not capable of properly reaching the maximum
> allowed throughput both in tx and rx during a bidirectional test if tx packets
> aggregation is not enabled.
> 
> [...]

Here is the summary with links:
  - [net-next,v4,1/3] ethtool: add tx aggregation parameters
    https://git.kernel.org/netdev/net-next/c/31de2842399a
  - [net-next,v4,2/3] net: qualcomm: rmnet: add tx packets aggregation
    https://git.kernel.org/netdev/net-next/c/64b5d1f8f2d1
  - [net-next,v4,3/3] net: qualcomm: rmnet: add ethtool support for configuring tx aggregation
    https://git.kernel.org/netdev/net-next/c/db8a563a9d90

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet
  2023-01-13 19:48       ` Jakub Kicinski
@ 2023-01-15 12:15         ` Daniele Palmas
  0 siblings, 0 replies; 12+ messages in thread
From: Daniele Palmas @ 2023-01-15 12:15 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexander Duyck, David Miller, Paolo Abeni, Eric Dumazet,
	Subash Abhinov Kasiviswanathan, Sean Tranchetti, Jonathan Corbet,
	Alexander Lobakin, Gal Pressman, Dave Taht, Bjørn Mork,
	Greg Kroah-Hartman, netdev

Hello Jakub and Alexander,

Il giorno ven 13 gen 2023 alle ore 20:48 Jakub Kicinski
<kuba@kernel.org> ha scritto:
>
> On Fri, 13 Jan 2023 08:16:48 -0800 Alexander Duyck wrote:
> > > ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES works also as a way to determine
> > > that tx aggregation has been enabled by the userspace tool managing
> > > the qmi requests, otherwise no aggregation should be performed.
> >
> > Is there a specific reason why you wouldn't want to take advantage of
> > aggregation that is already provided by the stack in the form of
> > things such as GSO and the qdisc layer? I know most of the high speed
> > NICs are always making use of xmit_more since things like GSO can take
> > advantage of it to increase the throughput. Enabling BQL is a way of
> > taking that one step further and enabling the non-GSO cases.
>
> The patches had been applied last night by DaveM but FWIW I think
> Alex's idea is quite interesting. Even without BQL I believe you'd
> get xmit_more set within a single GSO super-frame. TCP sends data
> in chunks of 64kB, and you're only aggregating 32kB. If we could
> get the same effect without any added latency and user configuration
> that'd be great.

Thanks for the hints, I'll explore xmit_more usage and try to gather
some numbers to compare the two solutions.

Regards,
Daniele

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-01-15 12:15 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-11 13:05 [PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet Daniele Palmas
2023-01-11 13:05 ` [PATCH net-next v4 1/3] ethtool: add tx aggregation parameters Daniele Palmas
2023-01-11 13:05 ` [PATCH net-next v4 2/3] net: qualcomm: rmnet: add tx packets aggregation Daniele Palmas
2023-01-12 19:41   ` Subash Abhinov Kasiviswanathan (KS)
2023-01-11 13:05 ` [PATCH net-next v4 3/3] net: qualcomm: rmnet: add ethtool support for configuring tx aggregation Daniele Palmas
2023-01-12 22:59 ` [PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet Alexander H Duyck
2023-01-13 15:43   ` Daniele Palmas
2023-01-13 16:16     ` Alexander Duyck
2023-01-13 16:57       ` Dave Taht
2023-01-13 19:48       ` Jakub Kicinski
2023-01-15 12:15         ` Daniele Palmas
2023-01-13 21:20 ` patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).