linux-wireless.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] wifi: rtw88: usb: Copy instead of cloning the RX skb
@ 2024-11-14 23:06 Bitterblue Smith
  2024-11-14 23:07 ` [PATCH 2/2] wifi: rtw88: usb: Preallocate and reuse the RX skbs Bitterblue Smith
  2024-11-18  9:16 ` [PATCH 1/2] wifi: rtw88: usb: Copy instead of cloning the RX skb Ping-Ke Shih
  0 siblings, 2 replies; 10+ messages in thread
From: Bitterblue Smith @ 2024-11-14 23:06 UTC (permalink / raw)
  To: linux-wireless@vger.kernel.org; +Cc: Ping-Ke Shih, Sascha Hauer

"iperf3 -c 192.168.0.1 -R --udp -b 0" shows about 40% of datagrams
are lost. Many torrents don't download faster than 3 MiB/s, probably
because the Bittorrent protocol uses UDP. This is somehow related to
the use of skb_clone() in the RX path.

Don't use skb_clone(). Instead allocate a new skb for each 802.11 frame
received and copy the data from the big (32768 byte) skb.

With this patch, "iperf3 -c 192.168.0.1 -R --udp -b 0" shows only 1-2%
of datagrams are lost, and torrents can reach download speeds of 36
MiB/s.

Tested with RTL8812AU and RTL8822CU.

Signed-off-by: Bitterblue Smith <rtl8821cerfe2@gmail.com>
---
 drivers/net/wireless/realtek/rtw88/usb.c | 52 ++++++++++++++----------
 1 file changed, 31 insertions(+), 21 deletions(-)

diff --git a/drivers/net/wireless/realtek/rtw88/usb.c b/drivers/net/wireless/realtek/rtw88/usb.c
index 93ac4837e1b5..727569d4ef4b 100644
--- a/drivers/net/wireless/realtek/rtw88/usb.c
+++ b/drivers/net/wireless/realtek/rtw88/usb.c
@@ -7,6 +7,7 @@
 #include <linux/mutex.h>
 #include "main.h"
 #include "debug.h"
+#include "mac.h"
 #include "reg.h"
 #include "tx.h"
 #include "rx.h"
@@ -546,49 +547,58 @@ static void rtw_usb_rx_handler(struct work_struct *work)
 {
 	struct rtw_usb *rtwusb = container_of(work, struct rtw_usb, rx_work);
 	struct rtw_dev *rtwdev = rtwusb->rtwdev;
-	const struct rtw_chip_info *chip = rtwdev->chip;
-	u32 pkt_desc_sz = chip->rx_pkt_desc_sz;
 	struct ieee80211_rx_status rx_status;
-	u32 pkt_offset, next_pkt, urb_len;
 	struct rtw_rx_pkt_stat pkt_stat;
-	struct sk_buff *next_skb;
+	struct sk_buff *rx_skb;
 	struct sk_buff *skb;
+	u32 pkt_desc_sz = rtwdev->chip->rx_pkt_desc_sz;
+	u32 max_skb_len = pkt_desc_sz + PHY_STATUS_SIZE * 8 +
+			  IEEE80211_MAX_MPDU_LEN_VHT_11454;
+	u32 pkt_offset, next_pkt, skb_len;
 	u8 *rx_desc;
 	int limit;
 
 	for (limit = 0; limit < 200; limit++) {
-		skb = skb_dequeue(&rtwusb->rx_queue);
-		if (!skb)
+		rx_skb = skb_dequeue(&rtwusb->rx_queue);
+		if (!rx_skb)
 			break;
 
 		if (skb_queue_len(&rtwusb->rx_queue) >= RTW_USB_MAX_RXQ_LEN) {
 			dev_dbg_ratelimited(rtwdev->dev, "failed to get rx_queue, overflow\n");
-			dev_kfree_skb_any(skb);
+			dev_kfree_skb_any(rx_skb);
 			continue;
 		}
 
-		urb_len = skb->len;
+		rx_desc = rx_skb->data;
 
 		do {
-			rx_desc = skb->data;
 			rtw_rx_query_rx_desc(rtwdev, rx_desc, &pkt_stat,
 					     &rx_status);
 			pkt_offset = pkt_desc_sz + pkt_stat.drv_info_sz +
 				     pkt_stat.shift;
 
-			next_pkt = round_up(pkt_stat.pkt_len + pkt_offset, 8);
+			skb_len = pkt_stat.pkt_len + pkt_offset;
+			if (skb_len > max_skb_len) {
+				rtw_dbg(rtwdev, RTW_DBG_USB,
+					"skipping too big packet: %u\n",
+					skb_len);
+				goto skip_packet;
+			}
 
-			if (urb_len >= next_pkt + pkt_desc_sz)
-				next_skb = skb_clone(skb, GFP_KERNEL);
-			else
-				next_skb = NULL;
+			skb = alloc_skb(skb_len, GFP_KERNEL);
+			if (!skb) {
+				rtw_dbg(rtwdev, RTW_DBG_USB,
+					"failed to allocate RX skb of size %u\n",
+					skb_len);
+				goto skip_packet;
+			}
+
+			skb_put_data(skb, rx_desc, skb_len);
 
 			if (pkt_stat.is_c2h) {
-				skb_trim(skb, pkt_stat.pkt_len + pkt_offset);
 				rtw_fw_c2h_cmd_rx_irqsafe(rtwdev, pkt_offset, skb);
 			} else {
 				skb_pull(skb, pkt_offset);
-				skb_trim(skb, pkt_stat.pkt_len);
 				rtw_update_rx_freq_for_invalid(rtwdev, skb,
 							       &rx_status,
 							       &pkt_stat);
@@ -597,12 +607,12 @@ static void rtw_usb_rx_handler(struct work_struct *work)
 				ieee80211_rx_irqsafe(rtwdev->hw, skb);
 			}
 
-			skb = next_skb;
-			if (skb)
-				skb_pull(skb, next_pkt);
+skip_packet:
+			next_pkt = round_up(skb_len, 8);
+			rx_desc += next_pkt;
+		} while (rx_desc + pkt_desc_sz < rx_skb->data + rx_skb->len);
 
-			urb_len -= next_pkt;
-		} while (skb);
+		dev_kfree_skb_any(rx_skb);
 	}
 }
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] wifi: rtw88: usb: Preallocate and reuse the RX skbs
  2024-11-14 23:06 [PATCH 1/2] wifi: rtw88: usb: Copy instead of cloning the RX skb Bitterblue Smith
@ 2024-11-14 23:07 ` Bitterblue Smith
  2024-11-19  0:50   ` Ping-Ke Shih
  2024-11-18  9:16 ` [PATCH 1/2] wifi: rtw88: usb: Copy instead of cloning the RX skb Ping-Ke Shih
  1 sibling, 1 reply; 10+ messages in thread
From: Bitterblue Smith @ 2024-11-14 23:07 UTC (permalink / raw)
  To: linux-wireless@vger.kernel.org; +Cc: Ping-Ke Shih, Sascha Hauer

The USB driver uses four USB Request Blocks for RX. Before submitting
one, it allocates a 32768 byte skb for the RX data. This allocation can
fail, maybe due to temporary memory fragmentation. When the allocation
fails, the corresponding URB is never submitted again. After four such
allocation failures, all RX stops because the driver is not requesting
data from the device anymore.

Don't allocate a 32768 byte skb when submitting a USB Request Block
(which happens very often). Instead preallocate 8 such skbs, and reuse
them over and over. If all 8 are busy, allocate a new one. This is
pretty rare. If the allocation fails, use a work to try again later.
When there are enough free skbs again, free the excess skbs.

Also, use WQ_BH for the RX workqueue. With a normal or high priority
workqueue the skbs are processed too slowly when the system is even a
little busy, like when opening a new page in a browser, and the driver
runs out of free skbs and allocates a lot of new ones.

Move C2H_ADAPTIVITY to the c2h workqueue instead of handling it directly
from rtw_fw_c2h_cmd_rx_irqsafe(), which runs in the RX workqueue. It
reads hardware registers, which is not "irqsafe" with USB.

This is more or less what the out-of-tree Realtek drivers do, except
they use a tasklet instead of a BH workqueue.

Tested with RTL8723DU, RTL8821AU, RTL8812AU, RTL8812BU, RTL8822CU,
RTL8811CU.

Closes: https://lore.kernel.org/linux-wireless/6e7ecb47-7ea0-433a-a19f-05f88a2edf6b@gmail.com/
Signed-off-by: Bitterblue Smith <rtl8821cerfe2@gmail.com>
---
 drivers/net/wireless/realtek/rtw88/fw.c  |  7 +--
 drivers/net/wireless/realtek/rtw88/usb.c | 73 ++++++++++++++++++++----
 drivers/net/wireless/realtek/rtw88/usb.h |  3 +
 3 files changed, 67 insertions(+), 16 deletions(-)

diff --git a/drivers/net/wireless/realtek/rtw88/fw.c b/drivers/net/wireless/realtek/rtw88/fw.c
index e6e9946fbf44..02389b7c6876 100644
--- a/drivers/net/wireless/realtek/rtw88/fw.c
+++ b/drivers/net/wireless/realtek/rtw88/fw.c
@@ -332,6 +332,9 @@ void rtw_fw_c2h_cmd_handle(struct rtw_dev *rtwdev, struct sk_buff *skb)
 	case C2H_RA_RPT:
 		rtw_fw_ra_report_handle(rtwdev, c2h->payload, len);
 		break;
+	case C2H_ADAPTIVITY:
+		rtw_fw_adaptivity_result(rtwdev, c2h->payload, len);
+		break;
 	default:
 		rtw_dbg(rtwdev, RTW_DBG_FW, "C2H 0x%x isn't handled\n", c2h->id);
 		break;
@@ -367,10 +370,6 @@ void rtw_fw_c2h_cmd_rx_irqsafe(struct rtw_dev *rtwdev, u32 pkt_offset,
 		rtw_fw_scan_result(rtwdev, c2h->payload, len);
 		dev_kfree_skb_any(skb);
 		break;
-	case C2H_ADAPTIVITY:
-		rtw_fw_adaptivity_result(rtwdev, c2h->payload, len);
-		dev_kfree_skb_any(skb);
-		break;
 	default:
 		/* pass offset for further operation */
 		*((u32 *)skb->cb) = pkt_offset;
diff --git a/drivers/net/wireless/realtek/rtw88/usb.c b/drivers/net/wireless/realtek/rtw88/usb.c
index 727569d4ef4b..5c2dfa2ced92 100644
--- a/drivers/net/wireless/realtek/rtw88/usb.c
+++ b/drivers/net/wireless/realtek/rtw88/usb.c
@@ -585,7 +585,7 @@ static void rtw_usb_rx_handler(struct work_struct *work)
 				goto skip_packet;
 			}
 
-			skb = alloc_skb(skb_len, GFP_KERNEL);
+			skb = alloc_skb(skb_len, GFP_ATOMIC);
 			if (!skb) {
 				rtw_dbg(rtwdev, RTW_DBG_USB,
 					"failed to allocate RX skb of size %u\n",
@@ -612,7 +612,11 @@ static void rtw_usb_rx_handler(struct work_struct *work)
 			rx_desc += next_pkt;
 		} while (rx_desc + pkt_desc_sz < rx_skb->data + rx_skb->len);
 
-		dev_kfree_skb_any(rx_skb);
+		if (skb_queue_len(&rtwusb->rx_free_queue) >=
+		    RTW_USB_RX_SKB_NUM - RTW_USB_RXCB_NUM)
+			dev_kfree_skb_any(rx_skb);
+		else
+			skb_queue_tail(&rtwusb->rx_free_queue, rx_skb);
 	}
 }
 
@@ -621,23 +625,57 @@ static void rtw_usb_read_port_complete(struct urb *urb);
 static void rtw_usb_rx_resubmit(struct rtw_usb *rtwusb, struct rx_usb_ctrl_block *rxcb)
 {
 	struct rtw_dev *rtwdev = rtwusb->rtwdev;
+	struct sk_buff *rx_skb;
+	gfp_t priority = in_interrupt() ? GFP_ATOMIC : GFP_KERNEL;
 	int error;
 
-	rxcb->rx_skb = alloc_skb(RTW_USB_MAX_RECVBUF_SZ, GFP_ATOMIC);
-	if (!rxcb->rx_skb)
-		return;
+	rx_skb = skb_dequeue(&rtwusb->rx_free_queue);
+	if (!rx_skb)
+		rx_skb = alloc_skb(RTW_USB_MAX_RECVBUF_SZ, priority);
+
+	if (!rx_skb)
+		goto try_later;
+
+	rxcb->rx_skb = rx_skb;
+
+	skb_reset_tail_pointer(rxcb->rx_skb);
+	rxcb->rx_skb->len = 0;
 
 	usb_fill_bulk_urb(rxcb->rx_urb, rtwusb->udev,
 			  usb_rcvbulkpipe(rtwusb->udev, rtwusb->pipe_in),
 			  rxcb->rx_skb->data, RTW_USB_MAX_RECVBUF_SZ,
 			  rtw_usb_read_port_complete, rxcb);
 
-	error = usb_submit_urb(rxcb->rx_urb, GFP_ATOMIC);
+	error = usb_submit_urb(rxcb->rx_urb, priority);
 	if (error) {
-		kfree_skb(rxcb->rx_skb);
+		skb_queue_tail(&rtwusb->rx_free_queue, rxcb->rx_skb);
+
 		if (error != -ENODEV)
 			rtw_err(rtwdev, "Err sending rx data urb %d\n",
 				error);
+
+		if (error == -ENOMEM)
+			goto try_later;
+	}
+
+	return;
+
+try_later:
+	rxcb->rx_skb = NULL;
+	queue_work(rtwusb->rxwq, &rtwusb->rx_urb_work);
+}
+
+static void rtw_usb_rx_resubmit_work(struct work_struct *work)
+{
+	struct rtw_usb *rtwusb = container_of(work, struct rtw_usb, rx_urb_work);
+	struct rx_usb_ctrl_block *rxcb;
+	int i;
+
+	for (i = 0; i < RTW_USB_RXCB_NUM; i++) {
+		rxcb = &rtwusb->rx_cb[i];
+
+		if (!rxcb->rx_skb)
+			rtw_usb_rx_resubmit(rtwusb, rxcb);
 	}
 }
 
@@ -653,8 +691,7 @@ static void rtw_usb_read_port_complete(struct urb *urb)
 		    urb->actual_length < 24) {
 			rtw_err(rtwdev, "failed to get urb length:%d\n",
 				urb->actual_length);
-			if (skb)
-				dev_kfree_skb_any(skb);
+			skb_queue_tail(&rtwusb->rx_free_queue, skb);
 		} else {
 			skb_put(skb, urb->actual_length);
 			skb_queue_tail(&rtwusb->rx_queue, skb);
@@ -662,6 +699,8 @@ static void rtw_usb_read_port_complete(struct urb *urb)
 		}
 		rtw_usb_rx_resubmit(rtwusb, rxcb);
 	} else {
+		skb_queue_tail(&rtwusb->rx_free_queue, skb);
+
 		switch (urb->status) {
 		case -EINVAL:
 		case -EPIPE:
@@ -679,8 +718,6 @@ static void rtw_usb_read_port_complete(struct urb *urb)
 			rtw_err(rtwdev, "status %d\n", urb->status);
 			break;
 		}
-		if (skb)
-			dev_kfree_skb_any(skb);
 	}
 }
 
@@ -868,16 +905,26 @@ static struct rtw_hci_ops rtw_usb_ops = {
 static int rtw_usb_init_rx(struct rtw_dev *rtwdev)
 {
 	struct rtw_usb *rtwusb = rtw_get_usb_priv(rtwdev);
+	struct sk_buff *rx_skb;
+	int i;
 
-	rtwusb->rxwq = create_singlethread_workqueue("rtw88_usb: rx wq");
+	rtwusb->rxwq = alloc_workqueue("rtw88_usb: rx wq", WQ_BH, 0);
 	if (!rtwusb->rxwq) {
 		rtw_err(rtwdev, "failed to create RX work queue\n");
 		return -ENOMEM;
 	}
 
 	skb_queue_head_init(&rtwusb->rx_queue);
+	skb_queue_head_init(&rtwusb->rx_free_queue);
 
 	INIT_WORK(&rtwusb->rx_work, rtw_usb_rx_handler);
+	INIT_WORK(&rtwusb->rx_urb_work, rtw_usb_rx_resubmit_work);
+
+	for (i = 0; i < RTW_USB_RX_SKB_NUM; i++) {
+		rx_skb = alloc_skb(RTW_USB_MAX_RECVBUF_SZ, GFP_KERNEL);
+		if (rx_skb)
+			skb_queue_tail(&rtwusb->rx_free_queue, rx_skb);
+	}
 
 	return 0;
 }
@@ -902,6 +949,8 @@ static void rtw_usb_deinit_rx(struct rtw_dev *rtwdev)
 
 	flush_workqueue(rtwusb->rxwq);
 	destroy_workqueue(rtwusb->rxwq);
+
+	skb_queue_purge(&rtwusb->rx_free_queue);
 }
 
 static int rtw_usb_init_tx(struct rtw_dev *rtwdev)
diff --git a/drivers/net/wireless/realtek/rtw88/usb.h b/drivers/net/wireless/realtek/rtw88/usb.h
index 86697a5c0103..9b695b688b24 100644
--- a/drivers/net/wireless/realtek/rtw88/usb.h
+++ b/drivers/net/wireless/realtek/rtw88/usb.h
@@ -38,6 +38,7 @@
 #define RTW_USB_RXAGG_TIMEOUT		10
 
 #define RTW_USB_RXCB_NUM		4
+#define RTW_USB_RX_SKB_NUM		8
 
 #define RTW_USB_EP_MAX			4
 
@@ -81,7 +82,9 @@ struct rtw_usb {
 
 	struct rx_usb_ctrl_block rx_cb[RTW_USB_RXCB_NUM];
 	struct sk_buff_head rx_queue;
+	struct sk_buff_head rx_free_queue;
 	struct work_struct rx_work;
+	struct work_struct rx_urb_work;
 };
 
 static inline struct rtw_usb_tx_data *rtw_usb_get_tx_data(struct sk_buff *skb)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* RE: [PATCH 1/2] wifi: rtw88: usb: Copy instead of cloning the RX skb
  2024-11-14 23:06 [PATCH 1/2] wifi: rtw88: usb: Copy instead of cloning the RX skb Bitterblue Smith
  2024-11-14 23:07 ` [PATCH 2/2] wifi: rtw88: usb: Preallocate and reuse the RX skbs Bitterblue Smith
@ 2024-11-18  9:16 ` Ping-Ke Shih
  2024-11-25 23:48   ` Bitterblue Smith
  1 sibling, 1 reply; 10+ messages in thread
From: Ping-Ke Shih @ 2024-11-18  9:16 UTC (permalink / raw)
  To: Bitterblue Smith, linux-wireless@vger.kernel.org; +Cc: Sascha Hauer

Bitterblue Smith <rtl8821cerfe2@gmail.com> wrote:
> "iperf3 -c 192.168.0.1 -R --udp -b 0" shows about 40% of datagrams
> are lost. Many torrents don't download faster than 3 MiB/s, probably
> because the Bittorrent protocol uses UDP. This is somehow related to
> the use of skb_clone() in the RX path.

Using skb_clone() can improve throughput is weird to me too. Do you check
top with 100% cpu usage?

> 
> Don't use skb_clone(). Instead allocate a new skb for each 802.11 frame
> received and copy the data from the big (32768 byte) skb.
> 
> With this patch, "iperf3 -c 192.168.0.1 -R --udp -b 0" shows only 1-2%
> of datagrams are lost, and torrents can reach download speeds of 36
> MiB/s.
> 
> Tested with RTL8812AU and RTL8822CU.
> 
> Signed-off-by: Bitterblue Smith <rtl8821cerfe2@gmail.com>
> ---
>  drivers/net/wireless/realtek/rtw88/usb.c | 52 ++++++++++++++----------
>  1 file changed, 31 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/net/wireless/realtek/rtw88/usb.c b/drivers/net/wireless/realtek/rtw88/usb.c
> index 93ac4837e1b5..727569d4ef4b 100644
> --- a/drivers/net/wireless/realtek/rtw88/usb.c
> +++ b/drivers/net/wireless/realtek/rtw88/usb.c
> @@ -7,6 +7,7 @@
>  #include <linux/mutex.h>
>  #include "main.h"
>  #include "debug.h"
> +#include "mac.h"
>  #include "reg.h"
>  #include "tx.h"
>  #include "rx.h"
> @@ -546,49 +547,58 @@ static void rtw_usb_rx_handler(struct work_struct *work)
>  {
>         struct rtw_usb *rtwusb = container_of(work, struct rtw_usb, rx_work);
>         struct rtw_dev *rtwdev = rtwusb->rtwdev;
> -       const struct rtw_chip_info *chip = rtwdev->chip;
> -       u32 pkt_desc_sz = chip->rx_pkt_desc_sz;
>         struct ieee80211_rx_status rx_status;
> -       u32 pkt_offset, next_pkt, urb_len;
>         struct rtw_rx_pkt_stat pkt_stat;
> -       struct sk_buff *next_skb;
> +       struct sk_buff *rx_skb;
>         struct sk_buff *skb;
> +       u32 pkt_desc_sz = rtwdev->chip->rx_pkt_desc_sz;
> +       u32 max_skb_len = pkt_desc_sz + PHY_STATUS_SIZE * 8 +
> +                         IEEE80211_MAX_MPDU_LEN_VHT_11454;
> +       u32 pkt_offset, next_pkt, skb_len;
>         u8 *rx_desc;
>         int limit;
> 
>         for (limit = 0; limit < 200; limit++) {
> -               skb = skb_dequeue(&rtwusb->rx_queue);
> -               if (!skb)
> +               rx_skb = skb_dequeue(&rtwusb->rx_queue);
> +               if (!rx_skb)
>                         break;
> 
>                 if (skb_queue_len(&rtwusb->rx_queue) >= RTW_USB_MAX_RXQ_LEN) {
>                         dev_dbg_ratelimited(rtwdev->dev, "failed to get rx_queue, overflow\n");
> -                       dev_kfree_skb_any(skb);
> +                       dev_kfree_skb_any(rx_skb);
>                         continue;
>                 }
> 
> -               urb_len = skb->len;
> +               rx_desc = rx_skb->data;
> 
>                 do {
> -                       rx_desc = skb->data;
>                         rtw_rx_query_rx_desc(rtwdev, rx_desc, &pkt_stat,
>                                              &rx_status);
>                         pkt_offset = pkt_desc_sz + pkt_stat.drv_info_sz +
>                                      pkt_stat.shift;
> 
> -                       next_pkt = round_up(pkt_stat.pkt_len + pkt_offset, 8);
> +                       skb_len = pkt_stat.pkt_len + pkt_offset;
> +                       if (skb_len > max_skb_len) {

This seems a new rule introduced by this patch. Do you really encounter this
case? Maybe this is the cause of slow download throughput?

> +                               rtw_dbg(rtwdev, RTW_DBG_USB,
> +                                       "skipping too big packet: %u\n",
> +                                       skb_len);
> +                               goto skip_packet;
> +                       }
> 
> -                       if (urb_len >= next_pkt + pkt_desc_sz)
> -                               next_skb = skb_clone(skb, GFP_KERNEL);
> -                       else
> -                               next_skb = NULL;
> +                       skb = alloc_skb(skb_len, GFP_KERNEL);
> +                       if (!skb) {
> +                               rtw_dbg(rtwdev, RTW_DBG_USB,
> +                                       "failed to allocate RX skb of size %u\n",
> +                                       skb_len);
> +                               goto skip_packet;
> +                       }
> +
> +                       skb_put_data(skb, rx_desc, skb_len);
> 
>                         if (pkt_stat.is_c2h) {
> -                               skb_trim(skb, pkt_stat.pkt_len + pkt_offset);
>                                 rtw_fw_c2h_cmd_rx_irqsafe(rtwdev, pkt_offset, skb);
>                         } else {
>                                 skb_pull(skb, pkt_offset);
> -                               skb_trim(skb, pkt_stat.pkt_len);
>                                 rtw_update_rx_freq_for_invalid(rtwdev, skb,
>                                                                &rx_status,
>                                                                &pkt_stat);
> @@ -597,12 +607,12 @@ static void rtw_usb_rx_handler(struct work_struct *work)
>                                 ieee80211_rx_irqsafe(rtwdev->hw, skb);
>                         }
> 
> -                       skb = next_skb;
> -                       if (skb)
> -                               skb_pull(skb, next_pkt);
> +skip_packet:
> +                       next_pkt = round_up(skb_len, 8);
> +                       rx_desc += next_pkt;
> +               } while (rx_desc + pkt_desc_sz < rx_skb->data + rx_skb->len);
> 
> -                       urb_len -= next_pkt;
> -               } while (skb);
> +               dev_kfree_skb_any(rx_skb);
>         }
>  }
> 
> --
> 2.47.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH 2/2] wifi: rtw88: usb: Preallocate and reuse the RX skbs
  2024-11-14 23:07 ` [PATCH 2/2] wifi: rtw88: usb: Preallocate and reuse the RX skbs Bitterblue Smith
@ 2024-11-19  0:50   ` Ping-Ke Shih
  2024-11-25 23:49     ` Bitterblue Smith
  0 siblings, 1 reply; 10+ messages in thread
From: Ping-Ke Shih @ 2024-11-19  0:50 UTC (permalink / raw)
  To: Bitterblue Smith, linux-wireless@vger.kernel.org; +Cc: Sascha Hauer

Bitterblue Smith <rtl8821cerfe2@gmail.com> wrote:
> The USB driver uses four USB Request Blocks for RX. Before submitting
> one, it allocates a 32768 byte skb for the RX data. This allocation can
> fail, maybe due to temporary memory fragmentation. When the allocation
> fails, the corresponding URB is never submitted again. After four such
> allocation failures, all RX stops because the driver is not requesting
> data from the device anymore.
> 
> Don't allocate a 32768 byte skb when submitting a USB Request Block
> (which happens very often). Instead preallocate 8 such skbs, and reuse
> them over and over. If all 8 are busy, allocate a new one. This is
> pretty rare. If the allocation fails, use a work to try again later.
> When there are enough free skbs again, free the excess skbs.
> 
> Also, use WQ_BH for the RX workqueue. With a normal or high priority
> workqueue the skbs are processed too slowly when the system is even a
> little busy, like when opening a new page in a browser, and the driver
> runs out of free skbs and allocates a lot of new ones.
> 
> Move C2H_ADAPTIVITY to the c2h workqueue instead of handling it directly
> from rtw_fw_c2h_cmd_rx_irqsafe(), which runs in the RX workqueue. It
> reads hardware registers, which is not "irqsafe" with USB.

This part should be in a separated patch. 

> 
> This is more or less what the out-of-tree Realtek drivers do, except
> they use a tasklet instead of a BH workqueue.
> 
> Tested with RTL8723DU, RTL8821AU, RTL8812AU, RTL8812BU, RTL8822CU,
> RTL8811CU.
> 
> Closes: https://lore.kernel.org/linux-wireless/6e7ecb47-7ea0-433a-a19f-05f88a2edf6b@gmail.com/
> Signed-off-by: Bitterblue Smith <rtl8821cerfe2@gmail.com>
> ---
>  drivers/net/wireless/realtek/rtw88/fw.c  |  7 +--
>  drivers/net/wireless/realtek/rtw88/usb.c | 73 ++++++++++++++++++++----
>  drivers/net/wireless/realtek/rtw88/usb.h |  3 +
>  3 files changed, 67 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/net/wireless/realtek/rtw88/fw.c b/drivers/net/wireless/realtek/rtw88/fw.c
> index e6e9946fbf44..02389b7c6876 100644
> --- a/drivers/net/wireless/realtek/rtw88/fw.c
> +++ b/drivers/net/wireless/realtek/rtw88/fw.c
> @@ -332,6 +332,9 @@ void rtw_fw_c2h_cmd_handle(struct rtw_dev *rtwdev, struct sk_buff *skb)
>         case C2H_RA_RPT:
>                 rtw_fw_ra_report_handle(rtwdev, c2h->payload, len);
>                 break;
> +       case C2H_ADAPTIVITY:
> +               rtw_fw_adaptivity_result(rtwdev, c2h->payload, len);
> +               break;
>         default:
>                 rtw_dbg(rtwdev, RTW_DBG_FW, "C2H 0x%x isn't handled\n", c2h->id);
>                 break;
> @@ -367,10 +370,6 @@ void rtw_fw_c2h_cmd_rx_irqsafe(struct rtw_dev *rtwdev, u32 pkt_offset,
>                 rtw_fw_scan_result(rtwdev, c2h->payload, len);
>                 dev_kfree_skb_any(skb);
>                 break;
> -       case C2H_ADAPTIVITY:
> -               rtw_fw_adaptivity_result(rtwdev, c2h->payload, len);
> -               dev_kfree_skb_any(skb);
> -               break;
>         default:
>                 /* pass offset for further operation */
>                 *((u32 *)skb->cb) = pkt_offset;
> diff --git a/drivers/net/wireless/realtek/rtw88/usb.c b/drivers/net/wireless/realtek/rtw88/usb.c
> index 727569d4ef4b..5c2dfa2ced92 100644
> --- a/drivers/net/wireless/realtek/rtw88/usb.c
> +++ b/drivers/net/wireless/realtek/rtw88/usb.c
> @@ -585,7 +585,7 @@ static void rtw_usb_rx_handler(struct work_struct *work)
>                                 goto skip_packet;
>                         }
> 
> -                       skb = alloc_skb(skb_len, GFP_KERNEL);
> +                       skb = alloc_skb(skb_len, GFP_ATOMIC);
>                         if (!skb) {
>                                 rtw_dbg(rtwdev, RTW_DBG_USB,
>                                         "failed to allocate RX skb of size %u\n",
> @@ -612,7 +612,11 @@ static void rtw_usb_rx_handler(struct work_struct *work)
>                         rx_desc += next_pkt;
>                 } while (rx_desc + pkt_desc_sz < rx_skb->data + rx_skb->len);
> 
> -               dev_kfree_skb_any(rx_skb);
> +               if (skb_queue_len(&rtwusb->rx_free_queue) >=
> +                   RTW_USB_RX_SKB_NUM - RTW_USB_RXCB_NUM)
> +                       dev_kfree_skb_any(rx_skb);
> +               else
> +                       skb_queue_tail(&rtwusb->rx_free_queue, rx_skb);

Why not just queue and reuse rx_skb for all? That would be simpler. 

>         }
>  }
> 
> @@ -621,23 +625,57 @@ static void rtw_usb_read_port_complete(struct urb *urb);
>  static void rtw_usb_rx_resubmit(struct rtw_usb *rtwusb, struct rx_usb_ctrl_block *rxcb)
>  {
>         struct rtw_dev *rtwdev = rtwusb->rtwdev;
> +       struct sk_buff *rx_skb;
> +       gfp_t priority = in_interrupt() ? GFP_ATOMIC : GFP_KERNEL;

I remember in_interrupt() is not recommended. Can't we pass necessary gfp_t
via function argument by callers?

>         int error;
> 
> -       rxcb->rx_skb = alloc_skb(RTW_USB_MAX_RECVBUF_SZ, GFP_ATOMIC);
> -       if (!rxcb->rx_skb)
> -               return;
> +       rx_skb = skb_dequeue(&rtwusb->rx_free_queue);
> +       if (!rx_skb)
> +               rx_skb = alloc_skb(RTW_USB_MAX_RECVBUF_SZ, priority);
> +
> +       if (!rx_skb)
> +               goto try_later;
> +
> +       rxcb->rx_skb = rx_skb;
> +
> +       skb_reset_tail_pointer(rxcb->rx_skb);
> +       rxcb->rx_skb->len = 0;

How about moving these initialization upward before ' rxcb->rx_skb = rx_skb;'?
Statements can be shorter, and it is reasonable to initialize data before
assignment. 

> 
>         usb_fill_bulk_urb(rxcb->rx_urb, rtwusb->udev,
>                           usb_rcvbulkpipe(rtwusb->udev, rtwusb->pipe_in),
>                           rxcb->rx_skb->data, RTW_USB_MAX_RECVBUF_SZ,
>                           rtw_usb_read_port_complete, rxcb);
> 
> -       error = usb_submit_urb(rxcb->rx_urb, GFP_ATOMIC);
> +       error = usb_submit_urb(rxcb->rx_urb, priority);

Not sure if 'priority' fits the meaning. Maybe many kernel code just
uses 'gfp'.

>         if (error) {
> -               kfree_skb(rxcb->rx_skb);
> +               skb_queue_tail(&rtwusb->rx_free_queue, rxcb->rx_skb);
> +
>                 if (error != -ENODEV)
>                         rtw_err(rtwdev, "Err sending rx data urb %d\n",
>                                 error);

Since here rxcb->rx_skb != NULL, will it be a problem? The rxcb will not be
submitted again? Should all error cases go to try_later label?


> +
> +               if (error == -ENOMEM)
> +                       goto try_later;
> +       }
> +
> +       return;
> +
> +try_later:
> +       rxcb->rx_skb = NULL;
> +       queue_work(rtwusb->rxwq, &rtwusb->rx_urb_work);
> +}
> +
> +static void rtw_usb_rx_resubmit_work(struct work_struct *work)
> +{
> +       struct rtw_usb *rtwusb = container_of(work, struct rtw_usb, rx_urb_work);
> +       struct rx_usb_ctrl_block *rxcb;
> +       int i;
> +
> +       for (i = 0; i < RTW_USB_RXCB_NUM; i++) {
> +               rxcb = &rtwusb->rx_cb[i];
> +
> +               if (!rxcb->rx_skb)
> +                       rtw_usb_rx_resubmit(rtwusb, rxcb);
>         }
>  }
> 
> @@ -653,8 +691,7 @@ static void rtw_usb_read_port_complete(struct urb *urb)
>                     urb->actual_length < 24) {
>                         rtw_err(rtwdev, "failed to get urb length:%d\n",
>                                 urb->actual_length);
> -                       if (skb)
> -                               dev_kfree_skb_any(skb);
> +                       skb_queue_tail(&rtwusb->rx_free_queue, skb);
>                 } else {
>                         skb_put(skb, urb->actual_length);
>                         skb_queue_tail(&rtwusb->rx_queue, skb);
> @@ -662,6 +699,8 @@ static void rtw_usb_read_port_complete(struct urb *urb)
>                 }
>                 rtw_usb_rx_resubmit(rtwusb, rxcb);
>         } else {
> +               skb_queue_tail(&rtwusb->rx_free_queue, skb);
> +
>                 switch (urb->status) {
>                 case -EINVAL:
>                 case -EPIPE:
> @@ -679,8 +718,6 @@ static void rtw_usb_read_port_complete(struct urb *urb)
>                         rtw_err(rtwdev, "status %d\n", urb->status);
>                         break;
>                 }
> -               if (skb)
> -                       dev_kfree_skb_any(skb);
>         }
>  }
> 
> @@ -868,16 +905,26 @@ static struct rtw_hci_ops rtw_usb_ops = {
>  static int rtw_usb_init_rx(struct rtw_dev *rtwdev)
>  {
>         struct rtw_usb *rtwusb = rtw_get_usb_priv(rtwdev);
> +       struct sk_buff *rx_skb;
> +       int i;
> 
> -       rtwusb->rxwq = create_singlethread_workqueue("rtw88_usb: rx wq");
> +       rtwusb->rxwq = alloc_workqueue("rtw88_usb: rx wq", WQ_BH, 0);
>         if (!rtwusb->rxwq) {
>                 rtw_err(rtwdev, "failed to create RX work queue\n");
>                 return -ENOMEM;
>         }
> 
>         skb_queue_head_init(&rtwusb->rx_queue);
> +       skb_queue_head_init(&rtwusb->rx_free_queue);
> 
>         INIT_WORK(&rtwusb->rx_work, rtw_usb_rx_handler);
> +       INIT_WORK(&rtwusb->rx_urb_work, rtw_usb_rx_resubmit_work);
> +
> +       for (i = 0; i < RTW_USB_RX_SKB_NUM; i++) {
> +               rx_skb = alloc_skb(RTW_USB_MAX_RECVBUF_SZ, GFP_KERNEL);
> +               if (rx_skb)
> +                       skb_queue_tail(&rtwusb->rx_free_queue, rx_skb);
> +       }
> 
>         return 0;
>  }
> @@ -902,6 +949,8 @@ static void rtw_usb_deinit_rx(struct rtw_dev *rtwdev)
> 
>         flush_workqueue(rtwusb->rxwq);
>         destroy_workqueue(rtwusb->rxwq);
> +
> +       skb_queue_purge(&rtwusb->rx_free_queue);
>  }
> 
>  static int rtw_usb_init_tx(struct rtw_dev *rtwdev)
> diff --git a/drivers/net/wireless/realtek/rtw88/usb.h b/drivers/net/wireless/realtek/rtw88/usb.h
> index 86697a5c0103..9b695b688b24 100644
> --- a/drivers/net/wireless/realtek/rtw88/usb.h
> +++ b/drivers/net/wireless/realtek/rtw88/usb.h
> @@ -38,6 +38,7 @@
>  #define RTW_USB_RXAGG_TIMEOUT          10
> 
>  #define RTW_USB_RXCB_NUM               4
> +#define RTW_USB_RX_SKB_NUM             8
> 
>  #define RTW_USB_EP_MAX                 4
> 
> @@ -81,7 +82,9 @@ struct rtw_usb {
> 
>         struct rx_usb_ctrl_block rx_cb[RTW_USB_RXCB_NUM];
>         struct sk_buff_head rx_queue;
> +       struct sk_buff_head rx_free_queue;
>         struct work_struct rx_work;
> +       struct work_struct rx_urb_work;
>  };
> 
>  static inline struct rtw_usb_tx_data *rtw_usb_get_tx_data(struct sk_buff *skb)
> --
> 2.47.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] wifi: rtw88: usb: Copy instead of cloning the RX skb
  2024-11-18  9:16 ` [PATCH 1/2] wifi: rtw88: usb: Copy instead of cloning the RX skb Ping-Ke Shih
@ 2024-11-25 23:48   ` Bitterblue Smith
  2024-11-26  1:17     ` Ping-Ke Shih
  0 siblings, 1 reply; 10+ messages in thread
From: Bitterblue Smith @ 2024-11-25 23:48 UTC (permalink / raw)
  To: Ping-Ke Shih, linux-wireless@vger.kernel.org; +Cc: Sascha Hauer

On 18/11/2024 11:16, Ping-Ke Shih wrote:
> Bitterblue Smith <rtl8821cerfe2@gmail.com> wrote:
>> "iperf3 -c 192.168.0.1 -R --udp -b 0" shows about 40% of datagrams
>> are lost. Many torrents don't download faster than 3 MiB/s, probably
>> because the Bittorrent protocol uses UDP. This is somehow related to
>> the use of skb_clone() in the RX path.
> 
> Using skb_clone() can improve throughput is weird to me too. Do you check
> top with 100% cpu usage?
> 

I checked the CPU usage now and the results are interesting. In short,
patch 1/2 slightly raises the CPU usage, and patch 2/2 lowers it a lot.

I used this command: "top -b -d 20 -n 2" while running iperf3 in
reverse mode. During every test the RX speed was approximately 490 Mbps.
I tested with RTL8812AU. I used the numbers from the second "frame"
printed by each top command. I added up all the processes that had CPU
usage of at least 0.1%. I removed processes that I think are irrelevant
(kwin_wayland, konsole, weechat, plasmashell).

Before patch 1/2:

      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  21192 asdf      20   0   17988   3468   2828 S  26.6   0.0   0:05.90 iperf3
     34 root      20   0       0      0      0 S   8.7   0.0   1:06.89 ksoftirqd/1
     17 root      20   0       0      0      0 S   4.7   0.0   0:44.58 ksoftirqd/0
     40 root      20   0       0      0      0 S   3.9   0.0   0:25.15 ksoftirqd/3
     28 root      20   0       0      0      0 R   3.7   0.0   0:31.46 ksoftirqd/2
  20964 root      20   0       0      0      0 R   3.5   0.0   0:01.21 kworker/u16:1-rtw88_usb: tx wq
  20716 root      20   0       0      0      0 I   3.4   0.0   0:02.74 kworker/u16:3-flush-259:0
  11253 root      20   0       0      0      0 I   2.9   0.0   0:43.42 kworker/u16:4-events_unbound
  20572 root       0 -20       0      0      0 I   2.7   0.0   0:03.05 kworker/u17:0-rtw_tx_wq
  21194 root       0 -20       0      0      0 I   2.7   0.0   0:00.59 kworker/u17:1-rtw_tx_wq
                                                  62.8

      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  21798 asdf      20   0   17988   3408   2768 S  33.7   0.0   0:08.41 iperf3
     34 root      20   0       0      0      0 S   7.5   0.0   1:21.17 ksoftirqd/1
     17 root      20   0       0      0      0 S   5.8   0.0   0:54.08 ksoftirqd/0
     28 root      20   0       0      0      0 S   5.7   0.0   0:38.42 ksoftirqd/2
     40 root      20   0       0      0      0 S   5.5   0.0   0:29.69 ksoftirqd/3
  21261 root      20   0       0      0      0 I   3.0   0.0   0:02.32 kworker/u16:0-rtw88_usb: tx wq
  21237 root       0 -20       0      0      0 I   2.6   0.0   0:00.71 kworker/u17:4-rtw_tx_wq
  20716 root      20   0       0      0      0 I   2.4   0.0   0:05.97 kworker/u16:3-rtw88_usb: rx wq
  20964 root      20   0       0      0      0 I   2.4   0.0   0:04.65 kworker/u16:1-rtw88_usb: rx wq
  21244 root       0 -20       0      0      0 I   0.7   0.0   0:00.24 kworker/u17:12-rtw_tx_wq
                                                  69.3

      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  21853 asdf      20   0   17988   3700   2932 S  34.3   0.0   0:08.38 iperf3
     17 root      20   0       0      0      0 S   7.0   0.0   0:55.93 ksoftirqd/0
     34 root      20   0       0      0      0 S   6.8   0.0   1:22.92 ksoftirqd/1
     40 root      20   0       0      0      0 S   5.4   0.0   0:31.05 ksoftirqd/3
     28 root      20   0       0      0      0 S   4.9   0.0   0:39.62 ksoftirqd/2
  21261 root      20   0       0      0      0 I   3.1   0.0   0:03.22 kworker/u16:0-rtw88_usb: rx wq
  21833 root       0 -20       0      0      0 I   2.4   0.0   0:00.67 kworker/u17:0-rtw_tx_wq
  20427 root      20   0       0      0      0 I   2.2   0.0   0:12.37 kworker/u16:2-events_unbound
  21834 root       0 -20       0      0      0 I   1.1   0.0   0:00.30 kworker/u17:2-rtw_tx_wq
                                                  67.2

With patch 1/2:

      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  22222 asdf      20   0   17988   3804   3036 S  29.4   0.0   0:08.38 iperf3
     34 root      20   0       0      0      0 R  10.3   0.0   1:26.13 ksoftirqd/1
     28 root      20   0       0      0      0 R   7.9   0.0   0:42.00 ksoftirqd/2
  20964 root      20   0       0      0      0 I   5.8   0.0   0:06.66 kworker/u16:1-rtw88_usb: rx wq
     17 root      20   0       0      0      0 S   5.0   0.0   0:57.49 ksoftirqd/0
  21261 root      20   0       0      0      0 I   4.2   0.0   0:04.55 kworker/u16:0-rtw88_usb: tx wq
     40 root      20   0       0      0      0 S   4.1   0.0   0:32.39 ksoftirqd/3
  20716 root      20   0       0      0      0 I   2.8   0.0   0:08.07 kworker/u16:3-rtw88_usb: tx wq
  21237 root       0 -20       0      0      0 I   2.5   0.0   0:01.32 kworker/u17:4-rtw_tx_wq
  21834 root       0 -20       0      0      0 I   1.0   0.0   0:00.61 kworker/u17:2-rtw_tx_wq
  21225 root      20   0       0      0      0 I   0.1   0.0   0:00.33 kworker/2:2-events
                                                  73.1

      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  22265 asdf      20   0   17988   3596   2828 S  30.5   0.0   0:08.17 iperf3
     34 root      20   0       0      0      0 S  10.5   0.0   1:28.96 ksoftirqd/1
     28 root      20   0       0      0      0 S   9.1   0.0   0:44.54 ksoftirqd/2
  20427 root      20   0       0      0      0 I   5.9   0.0   0:13.72 kworker/u16:2-rtw88_usb: rx wq
  20716 root      20   0       0      0      0 I   5.7   0.0   0:09.78 kworker/u16:3-rtw88_usb: tx wq
     17 root      20   0       0      0      0 S   3.9   0.0   0:58.72 ksoftirqd/0
     40 root      20   0       0      0      0 S   3.5   0.0   0:33.53 ksoftirqd/3
  21833 root       0 -20       0      0      0 I   3.2   0.0   0:01.76 kworker/u17:0-rtw_tx_wq
  20964 root      20   0       0      0      0 I   1.1   0.0   0:07.34 kworker/u16:1-flush-259:0
  21834 root       0 -20       0      0      0 I   0.4   0.0   0:00.92 kworker/u17:2-rtw_tx_wq
                                                  73.8

      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  22303 asdf      20   0   17988   3808   3040 S  31.3   0.0   0:08.58 iperf3
     28 root      20   0       0      0      0 R  10.0   0.0   0:47.50 ksoftirqd/2
     34 root      20   0       0      0      0 S   9.9   0.0   1:31.69 ksoftirqd/1
  20716 root      20   0       0      0      0 I   5.2   0.0   0:11.26 kworker/u16:3-rtw88_usb: rx wq
     17 root      20   0       0      0      0 S   4.0   0.0   0:59.98 ksoftirqd/0
  20427 root      20   0       0      0      0 I   4.0   0.0   0:15.06 kworker/u16:2-rtw88_usb: tx wq
  20964 root      20   0       0      0      0 I   3.8   0.0   0:08.73 kworker/u16:1-rtw88_usb: tx wq
     40 root      20   0       0      0      0 S   2.7   0.0   0:34.49 ksoftirqd/3
  21833 root       0 -20       0      0      0 I   1.9   0.0   0:02.47 kworker/u17:0-rtw_tx_wq
  21834 root       0 -20       0      0      0 I   1.8   0.0   0:01.49 kworker/u17:2-rtw_tx_wq
                                                  74.6


With patches 1/2 and 2/2:

      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  23151 asdf      20   0   17988   3656   2888 S  28.9   0.0   0:08.86 iperf3
  22400 root       0 -20       0      0      0 I   2.7   0.0   0:00.86 kworker/u17:6-rtw_tx_wq
  21261 root      20   0       0      0      0 I   2.3   0.0   0:06.47 kworker/u16:0-rtw88_usb: tx wq
  22401 root       0 -20       0      0      0 I   1.3   0.0   0:00.41 kworker/u17:7-rtw_tx_wq
  20427 root      20   0       0      0      0 I   1.2   0.0   0:16.02 kworker/u16:2-rtw88_usb: tx wq
     34 root      20   0       0      0      0 S   0.5   0.0   1:32.23 ksoftirqd/1
  20964 root      20   0       0      0      0 I   0.4   0.0   0:09.30 kworker/u16:1-rtw88_usb: tx wq
  22498 root      20   0       0      0      0 I   0.1   0.0   0:00.07 kworker/1:1-events
                                                  37.4

      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  23213 asdf      20   0   17988   3360   2720 S  30.7   0.0   0:08.69 iperf3
  23169 root       0 -20       0      0      0 I   3.3   0.0   0:00.93 kworker/u17:1-rtw_tx_wq
  21261 root      20   0       0      0      0 I   1.5   0.0   0:07.17 kworker/u16:0-rtw88_usb: tx wq
  23219 root      20   0       0      0      0 I   1.0   0.0   0:00.21 kworker/u16:3-rtw88_usb: tx wq
     34 root      20   0       0      0      0 S   0.6   0.0   1:32.59 ksoftirqd/1
  20964 root      20   0       0      0      0 I   0.6   0.0   0:09.46 kworker/u16:1-rtw88_usb: tx wq
  23189 root       0 -20       0      0      0 I   0.6   0.0   0:00.26 kworker/u17:9-rtw_tx_wq
  20427 root      20   0       0      0      0 I   0.5   0.0   0:16.48 kworker/u16:2-rtw88_usb: tx wq
  22283 root      20   0       0      0      0 I   0.1   0.0   0:00.03 kworker/1:0-events
                                                  38.9

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  23266 asdf      20   0   17988   3608   2968 S  27.5   0.0   0:07.55 iperf3
  21833 root       0 -20       0      0      0 R   2.7   0.0   0:03.09 kworker/u17:0+rtw_tx_wq
  20964 root      20   0       0      0      0 I   1.6   0.0   0:10.03 kworker/u16:1-events_unbound
  23272 root      20   0       0      0      0 I   1.5   0.0   0:00.31 kworker/u16:4-rtw88_usb: tx wq
  22401 root       0 -20       0      0      0 I   1.1   0.0   0:00.97 kworker/u17:7-rtw_tx_wq
     34 root      20   0       0      0      0 S   0.5   0.0   1:32.83 ksoftirqd/1
  20427 root      20   0       0      0      0 I   0.5   0.0   0:16.78 kworker/u16:2-rtw88_usb: tx wq
  23228 root      20   0       0      0      0 I   0.1   0.0   0:00.08 kworker/1:2-events
                                                  35.5


>>
>> Don't use skb_clone(). Instead allocate a new skb for each 802.11 frame
>> received and copy the data from the big (32768 byte) skb.
>>
>> With this patch, "iperf3 -c 192.168.0.1 -R --udp -b 0" shows only 1-2%
>> of datagrams are lost, and torrents can reach download speeds of 36
>> MiB/s.
>>
>> Tested with RTL8812AU and RTL8822CU.
>>
>> Signed-off-by: Bitterblue Smith <rtl8821cerfe2@gmail.com>
>> ---
>>  drivers/net/wireless/realtek/rtw88/usb.c | 52 ++++++++++++++----------
>>  1 file changed, 31 insertions(+), 21 deletions(-)
>>
>> diff --git a/drivers/net/wireless/realtek/rtw88/usb.c b/drivers/net/wireless/realtek/rtw88/usb.c
>> index 93ac4837e1b5..727569d4ef4b 100644
>> --- a/drivers/net/wireless/realtek/rtw88/usb.c
>> +++ b/drivers/net/wireless/realtek/rtw88/usb.c
>> @@ -7,6 +7,7 @@
>>  #include <linux/mutex.h>
>>  #include "main.h"
>>  #include "debug.h"
>> +#include "mac.h"
>>  #include "reg.h"
>>  #include "tx.h"
>>  #include "rx.h"
>> @@ -546,49 +547,58 @@ static void rtw_usb_rx_handler(struct work_struct *work)
>>  {
>>         struct rtw_usb *rtwusb = container_of(work, struct rtw_usb, rx_work);
>>         struct rtw_dev *rtwdev = rtwusb->rtwdev;
>> -       const struct rtw_chip_info *chip = rtwdev->chip;
>> -       u32 pkt_desc_sz = chip->rx_pkt_desc_sz;
>>         struct ieee80211_rx_status rx_status;
>> -       u32 pkt_offset, next_pkt, urb_len;
>>         struct rtw_rx_pkt_stat pkt_stat;
>> -       struct sk_buff *next_skb;
>> +       struct sk_buff *rx_skb;
>>         struct sk_buff *skb;
>> +       u32 pkt_desc_sz = rtwdev->chip->rx_pkt_desc_sz;
>> +       u32 max_skb_len = pkt_desc_sz + PHY_STATUS_SIZE * 8 +
>> +                         IEEE80211_MAX_MPDU_LEN_VHT_11454;
>> +       u32 pkt_offset, next_pkt, skb_len;
>>         u8 *rx_desc;
>>         int limit;
>>
>>         for (limit = 0; limit < 200; limit++) {
>> -               skb = skb_dequeue(&rtwusb->rx_queue);
>> -               if (!skb)
>> +               rx_skb = skb_dequeue(&rtwusb->rx_queue);
>> +               if (!rx_skb)
>>                         break;
>>
>>                 if (skb_queue_len(&rtwusb->rx_queue) >= RTW_USB_MAX_RXQ_LEN) {
>>                         dev_dbg_ratelimited(rtwdev->dev, "failed to get rx_queue, overflow\n");
>> -                       dev_kfree_skb_any(skb);
>> +                       dev_kfree_skb_any(rx_skb);
>>                         continue;
>>                 }
>>
>> -               urb_len = skb->len;
>> +               rx_desc = rx_skb->data;
>>
>>                 do {
>> -                       rx_desc = skb->data;
>>                         rtw_rx_query_rx_desc(rtwdev, rx_desc, &pkt_stat,
>>                                              &rx_status);
>>                         pkt_offset = pkt_desc_sz + pkt_stat.drv_info_sz +
>>                                      pkt_stat.shift;
>>
>> -                       next_pkt = round_up(pkt_stat.pkt_len + pkt_offset, 8);
>> +                       skb_len = pkt_stat.pkt_len + pkt_offset;
>> +                       if (skb_len > max_skb_len) {
> 
> This seems a new rule introduced by this patch. Do you really encounter this
> case? Maybe this is the cause of slow download throughput?
> 

No, I never saw this case. It just seemed like a good idea to check the
size passed to alloc_skb. Maybe it's not needed?

>> +                               rtw_dbg(rtwdev, RTW_DBG_USB,
>> +                                       "skipping too big packet: %u\n",
>> +                                       skb_len);
>> +                               goto skip_packet;
>> +                       }
>>
>> -                       if (urb_len >= next_pkt + pkt_desc_sz)
>> -                               next_skb = skb_clone(skb, GFP_KERNEL);
>> -                       else
>> -                               next_skb = NULL;
>> +                       skb = alloc_skb(skb_len, GFP_KERNEL);
>> +                       if (!skb) {
>> +                               rtw_dbg(rtwdev, RTW_DBG_USB,
>> +                                       "failed to allocate RX skb of size %u\n",
>> +                                       skb_len);
>> +                               goto skip_packet;
>> +                       }
>> +
>> +                       skb_put_data(skb, rx_desc, skb_len);
>>
>>                         if (pkt_stat.is_c2h) {
>> -                               skb_trim(skb, pkt_stat.pkt_len + pkt_offset);
>>                                 rtw_fw_c2h_cmd_rx_irqsafe(rtwdev, pkt_offset, skb);
>>                         } else {
>>                                 skb_pull(skb, pkt_offset);
>> -                               skb_trim(skb, pkt_stat.pkt_len);
>>                                 rtw_update_rx_freq_for_invalid(rtwdev, skb,
>>                                                                &rx_status,
>>                                                                &pkt_stat);
>> @@ -597,12 +607,12 @@ static void rtw_usb_rx_handler(struct work_struct *work)
>>                                 ieee80211_rx_irqsafe(rtwdev->hw, skb);
>>                         }
>>
>> -                       skb = next_skb;
>> -                       if (skb)
>> -                               skb_pull(skb, next_pkt);
>> +skip_packet:
>> +                       next_pkt = round_up(skb_len, 8);
>> +                       rx_desc += next_pkt;
>> +               } while (rx_desc + pkt_desc_sz < rx_skb->data + rx_skb->len);
>>
>> -                       urb_len -= next_pkt;
>> -               } while (skb);
>> +               dev_kfree_skb_any(rx_skb);
>>         }
>>  }
>>
>> --
>> 2.47.0
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] wifi: rtw88: usb: Preallocate and reuse the RX skbs
  2024-11-19  0:50   ` Ping-Ke Shih
@ 2024-11-25 23:49     ` Bitterblue Smith
  2024-11-26  1:26       ` Ping-Ke Shih
  0 siblings, 1 reply; 10+ messages in thread
From: Bitterblue Smith @ 2024-11-25 23:49 UTC (permalink / raw)
  To: Ping-Ke Shih, linux-wireless@vger.kernel.org; +Cc: Sascha Hauer

On 19/11/2024 02:50, Ping-Ke Shih wrote:
> Bitterblue Smith <rtl8821cerfe2@gmail.com> wrote:
>> The USB driver uses four USB Request Blocks for RX. Before submitting
>> one, it allocates a 32768 byte skb for the RX data. This allocation can
>> fail, maybe due to temporary memory fragmentation. When the allocation
>> fails, the corresponding URB is never submitted again. After four such
>> allocation failures, all RX stops because the driver is not requesting
>> data from the device anymore.
>>
>> Don't allocate a 32768 byte skb when submitting a USB Request Block
>> (which happens very often). Instead preallocate 8 such skbs, and reuse
>> them over and over. If all 8 are busy, allocate a new one. This is
>> pretty rare. If the allocation fails, use a work to try again later.
>> When there are enough free skbs again, free the excess skbs.
>>
>> Also, use WQ_BH for the RX workqueue. With a normal or high priority
>> workqueue the skbs are processed too slowly when the system is even a
>> little busy, like when opening a new page in a browser, and the driver
>> runs out of free skbs and allocates a lot of new ones.
>>
>> Move C2H_ADAPTIVITY to the c2h workqueue instead of handling it directly
>> from rtw_fw_c2h_cmd_rx_irqsafe(), which runs in the RX workqueue. It
>> reads hardware registers, which is not "irqsafe" with USB.
> 
> This part should be in a separated patch. 
> 

Do you mean just C2H_ADAPTIVITY or WQ_BH as well?

>>
>> This is more or less what the out-of-tree Realtek drivers do, except
>> they use a tasklet instead of a BH workqueue.
>>
>> Tested with RTL8723DU, RTL8821AU, RTL8812AU, RTL8812BU, RTL8822CU,
>> RTL8811CU.
>>
>> Closes: https://lore.kernel.org/linux-wireless/6e7ecb47-7ea0-433a-a19f-05f88a2edf6b@gmail.com/
>> Signed-off-by: Bitterblue Smith <rtl8821cerfe2@gmail.com>
>> ---
>>  drivers/net/wireless/realtek/rtw88/fw.c  |  7 +--
>>  drivers/net/wireless/realtek/rtw88/usb.c | 73 ++++++++++++++++++++----
>>  drivers/net/wireless/realtek/rtw88/usb.h |  3 +
>>  3 files changed, 67 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/net/wireless/realtek/rtw88/fw.c b/drivers/net/wireless/realtek/rtw88/fw.c
>> index e6e9946fbf44..02389b7c6876 100644
>> --- a/drivers/net/wireless/realtek/rtw88/fw.c
>> +++ b/drivers/net/wireless/realtek/rtw88/fw.c
>> @@ -332,6 +332,9 @@ void rtw_fw_c2h_cmd_handle(struct rtw_dev *rtwdev, struct sk_buff *skb)
>>         case C2H_RA_RPT:
>>                 rtw_fw_ra_report_handle(rtwdev, c2h->payload, len);
>>                 break;
>> +       case C2H_ADAPTIVITY:
>> +               rtw_fw_adaptivity_result(rtwdev, c2h->payload, len);
>> +               break;
>>         default:
>>                 rtw_dbg(rtwdev, RTW_DBG_FW, "C2H 0x%x isn't handled\n", c2h->id);
>>                 break;
>> @@ -367,10 +370,6 @@ void rtw_fw_c2h_cmd_rx_irqsafe(struct rtw_dev *rtwdev, u32 pkt_offset,
>>                 rtw_fw_scan_result(rtwdev, c2h->payload, len);
>>                 dev_kfree_skb_any(skb);
>>                 break;
>> -       case C2H_ADAPTIVITY:
>> -               rtw_fw_adaptivity_result(rtwdev, c2h->payload, len);
>> -               dev_kfree_skb_any(skb);
>> -               break;
>>         default:
>>                 /* pass offset for further operation */
>>                 *((u32 *)skb->cb) = pkt_offset;
>> diff --git a/drivers/net/wireless/realtek/rtw88/usb.c b/drivers/net/wireless/realtek/rtw88/usb.c
>> index 727569d4ef4b..5c2dfa2ced92 100644
>> --- a/drivers/net/wireless/realtek/rtw88/usb.c
>> +++ b/drivers/net/wireless/realtek/rtw88/usb.c
>> @@ -585,7 +585,7 @@ static void rtw_usb_rx_handler(struct work_struct *work)
>>                                 goto skip_packet;
>>                         }
>>
>> -                       skb = alloc_skb(skb_len, GFP_KERNEL);
>> +                       skb = alloc_skb(skb_len, GFP_ATOMIC);
>>                         if (!skb) {
>>                                 rtw_dbg(rtwdev, RTW_DBG_USB,
>>                                         "failed to allocate RX skb of size %u\n",
>> @@ -612,7 +612,11 @@ static void rtw_usb_rx_handler(struct work_struct *work)
>>                         rx_desc += next_pkt;
>>                 } while (rx_desc + pkt_desc_sz < rx_skb->data + rx_skb->len);
>>
>> -               dev_kfree_skb_any(rx_skb);
>> +               if (skb_queue_len(&rtwusb->rx_free_queue) >=
>> +                   RTW_USB_RX_SKB_NUM - RTW_USB_RXCB_NUM)
>> +                       dev_kfree_skb_any(rx_skb);
>> +               else
>> +                       skb_queue_tail(&rtwusb->rx_free_queue, rx_skb);
> 
> Why not just queue and reuse rx_skb for all? That would be simpler. 
> 

I didn't want to let it allocate too many skbs. I didn't find any kind
of limit in the official drivers, so maybe it would be fine.

>>         }
>>  }
>>
>> @@ -621,23 +625,57 @@ static void rtw_usb_read_port_complete(struct urb *urb);
>>  static void rtw_usb_rx_resubmit(struct rtw_usb *rtwusb, struct rx_usb_ctrl_block *rxcb)
>>  {
>>         struct rtw_dev *rtwdev = rtwusb->rtwdev;
>> +       struct sk_buff *rx_skb;
>> +       gfp_t priority = in_interrupt() ? GFP_ATOMIC : GFP_KERNEL;
> 
> I remember in_interrupt() is not recommended. Can't we pass necessary gfp_t
> via function argument by callers?
> 

Yes, I can do that.

>>         int error;
>>
>> -       rxcb->rx_skb = alloc_skb(RTW_USB_MAX_RECVBUF_SZ, GFP_ATOMIC);
>> -       if (!rxcb->rx_skb)
>> -               return;
>> +       rx_skb = skb_dequeue(&rtwusb->rx_free_queue);
>> +       if (!rx_skb)
>> +               rx_skb = alloc_skb(RTW_USB_MAX_RECVBUF_SZ, priority);
>> +
>> +       if (!rx_skb)
>> +               goto try_later;
>> +
>> +       rxcb->rx_skb = rx_skb;
>> +
>> +       skb_reset_tail_pointer(rxcb->rx_skb);
>> +       rxcb->rx_skb->len = 0;
> 
> How about moving these initialization upward before ' rxcb->rx_skb = rx_skb;'?
> Statements can be shorter, and it is reasonable to initialize data before
> assignment. 
> 
>>
>>         usb_fill_bulk_urb(rxcb->rx_urb, rtwusb->udev,
>>                           usb_rcvbulkpipe(rtwusb->udev, rtwusb->pipe_in),
>>                           rxcb->rx_skb->data, RTW_USB_MAX_RECVBUF_SZ,
>>                           rtw_usb_read_port_complete, rxcb);
>>
>> -       error = usb_submit_urb(rxcb->rx_urb, GFP_ATOMIC);
>> +       error = usb_submit_urb(rxcb->rx_urb, priority);
> 
> Not sure if 'priority' fits the meaning. Maybe many kernel code just
> uses 'gfp'.
> 
>>         if (error) {
>> -               kfree_skb(rxcb->rx_skb);
>> +               skb_queue_tail(&rtwusb->rx_free_queue, rxcb->rx_skb);
>> +
>>                 if (error != -ENODEV)
>>                         rtw_err(rtwdev, "Err sending rx data urb %d\n",
>>                                 error);
> 
> Since here rxcb->rx_skb != NULL, will it be a problem? The rxcb will not be
> submitted again? Should all error cases go to try_later label?
> 

Right, the rxcb will be submitted again only if the error was ENOMEM
I don't know what other errors can be considered temporary. I never
ran into the case where the error is not ENODEV.

> 
>> +
>> +               if (error == -ENOMEM)
>> +                       goto try_later;
>> +       }
>> +
>> +       return;
>> +
>> +try_later:
>> +       rxcb->rx_skb = NULL;
>> +       queue_work(rtwusb->rxwq, &rtwusb->rx_urb_work);
>> +}
>> +
>> +static void rtw_usb_rx_resubmit_work(struct work_struct *work)
>> +{
>> +       struct rtw_usb *rtwusb = container_of(work, struct rtw_usb, rx_urb_work);
>> +       struct rx_usb_ctrl_block *rxcb;
>> +       int i;
>> +
>> +       for (i = 0; i < RTW_USB_RXCB_NUM; i++) {
>> +               rxcb = &rtwusb->rx_cb[i];
>> +
>> +               if (!rxcb->rx_skb)
>> +                       rtw_usb_rx_resubmit(rtwusb, rxcb);
>>         }
>>  }
>>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH 1/2] wifi: rtw88: usb: Copy instead of cloning the RX skb
  2024-11-25 23:48   ` Bitterblue Smith
@ 2024-11-26  1:17     ` Ping-Ke Shih
  2024-11-26 20:40       ` Bitterblue Smith
  0 siblings, 1 reply; 10+ messages in thread
From: Ping-Ke Shih @ 2024-11-26  1:17 UTC (permalink / raw)
  To: Bitterblue Smith, linux-wireless@vger.kernel.org; +Cc: Sascha Hauer

Bitterblue Smith <rtl8821cerfe2@gmail.com> wrote:
> On 18/11/2024 11:16, Ping-Ke Shih wrote:
> > Bitterblue Smith <rtl8821cerfe2@gmail.com> wrote:
> >> "iperf3 -c 192.168.0.1 -R --udp -b 0" shows about 40% of datagrams
> >> are lost. Many torrents don't download faster than 3 MiB/s, probably
> >> because the Bittorrent protocol uses UDP. This is somehow related to
> >> the use of skb_clone() in the RX path.
> >
> > Using skb_clone() can improve throughput is weird to me too. Do you check
> > top with 100% cpu usage?
> >
> 
> I checked the CPU usage now and the results are interesting. In short,
> patch 1/2 slightly raises the CPU usage, and patch 2/2 lowers it a lot.

Originally I just wanted to know if throughput is a limit of CPU bound.
Anyway good to know this patchset can improve CPU usage. 

> >>
> >> Don't use skb_clone(). Instead allocate a new skb for each 802.11 frame
> >> received and copy the data from the big (32768 byte) skb.
> >>
> >> With this patch, "iperf3 -c 192.168.0.1 -R --udp -b 0" shows only 1-2%
> >> of datagrams are lost, and torrents can reach download speeds of 36
> >> MiB/s.
> >>
> >> Tested with RTL8812AU and RTL8822CU.
> >>
> >> Signed-off-by: Bitterblue Smith <rtl8821cerfe2@gmail.com>
> >> ---
> >>  drivers/net/wireless/realtek/rtw88/usb.c | 52 ++++++++++++++----------
> >>  1 file changed, 31 insertions(+), 21 deletions(-)
> >>
> >> diff --git a/drivers/net/wireless/realtek/rtw88/usb.c b/drivers/net/wireless/realtek/rtw88/usb.c
> >> index 93ac4837e1b5..727569d4ef4b 100644
> >> --- a/drivers/net/wireless/realtek/rtw88/usb.c
> >> +++ b/drivers/net/wireless/realtek/rtw88/usb.c
> >> @@ -7,6 +7,7 @@
> >>  #include <linux/mutex.h>
> >>  #include "main.h"
> >>  #include "debug.h"
> >> +#include "mac.h"
> >>  #include "reg.h"
> >>  #include "tx.h"
> >>  #include "rx.h"
> >> @@ -546,49 +547,58 @@ static void rtw_usb_rx_handler(struct work_struct *work)
> >>  {
> >>         struct rtw_usb *rtwusb = container_of(work, struct rtw_usb, rx_work);
> >>         struct rtw_dev *rtwdev = rtwusb->rtwdev;
> >> -       const struct rtw_chip_info *chip = rtwdev->chip;
> >> -       u32 pkt_desc_sz = chip->rx_pkt_desc_sz;
> >>         struct ieee80211_rx_status rx_status;
> >> -       u32 pkt_offset, next_pkt, urb_len;
> >>         struct rtw_rx_pkt_stat pkt_stat;
> >> -       struct sk_buff *next_skb;
> >> +       struct sk_buff *rx_skb;
> >>         struct sk_buff *skb;
> >> +       u32 pkt_desc_sz = rtwdev->chip->rx_pkt_desc_sz;
> >> +       u32 max_skb_len = pkt_desc_sz + PHY_STATUS_SIZE * 8 +
> >> +                         IEEE80211_MAX_MPDU_LEN_VHT_11454;
> >> +       u32 pkt_offset, next_pkt, skb_len;
> >>         u8 *rx_desc;
> >>         int limit;
> >>
> >>         for (limit = 0; limit < 200; limit++) {
> >> -               skb = skb_dequeue(&rtwusb->rx_queue);
> >> -               if (!skb)
> >> +               rx_skb = skb_dequeue(&rtwusb->rx_queue);
> >> +               if (!rx_skb)
> >>                         break;
> >>
> >>                 if (skb_queue_len(&rtwusb->rx_queue) >= RTW_USB_MAX_RXQ_LEN) {
> >>                         dev_dbg_ratelimited(rtwdev->dev, "failed to get rx_queue, overflow\n");
> >> -                       dev_kfree_skb_any(skb);
> >> +                       dev_kfree_skb_any(rx_skb);
> >>                         continue;
> >>                 }
> >>
> >> -               urb_len = skb->len;
> >> +               rx_desc = rx_skb->data;
> >>
> >>                 do {
> >> -                       rx_desc = skb->data;
> >>                         rtw_rx_query_rx_desc(rtwdev, rx_desc, &pkt_stat,
> >>                                              &rx_status);
> >>                         pkt_offset = pkt_desc_sz + pkt_stat.drv_info_sz +
> >>                                      pkt_stat.shift;
> >>
> >> -                       next_pkt = round_up(pkt_stat.pkt_len + pkt_offset, 8);
> >> +                       skb_len = pkt_stat.pkt_len + pkt_offset;
> >> +                       if (skb_len > max_skb_len) {
> >
> > This seems a new rule introduced by this patch. Do you really encounter this
> > case? Maybe this is the cause of slow download throughput?
> >
> 
> No, I never saw this case. It just seemed like a good idea to check the
> size passed to alloc_skb. Maybe it's not needed?

I think it is fine. 

Asking some questions before, I just tried to find a cause why 40% datagram get
lost as you mentioned in commit message, but I can't. 

> 
> >> +                               rtw_dbg(rtwdev, RTW_DBG_USB,
> >> +                                       "skipping too big packet: %u\n",
> >> +                                       skb_len);
> >> +                               goto skip_packet;
> >> +                       }
> >>
> >> -                       if (urb_len >= next_pkt + pkt_desc_sz)
> >> -                               next_skb = skb_clone(skb, GFP_KERNEL);
> >> -                       else
> >> -                               next_skb = NULL;
> >> +                       skb = alloc_skb(skb_len, GFP_KERNEL);
> >> +                       if (!skb) {
> >> +                               rtw_dbg(rtwdev, RTW_DBG_USB,
> >> +                                       "failed to allocate RX skb of size %u\n",
> >> +                                       skb_len);
> >> +                               goto skip_packet;
> >> +                       }
> >> +
> >> +                       skb_put_data(skb, rx_desc, skb_len);
> >>
> >>                         if (pkt_stat.is_c2h) {
> >> -                               skb_trim(skb, pkt_stat.pkt_len + pkt_offset);
> >>                                 rtw_fw_c2h_cmd_rx_irqsafe(rtwdev, pkt_offset, skb);
> >>                         } else {
> >>                                 skb_pull(skb, pkt_offset);
> >> -                               skb_trim(skb, pkt_stat.pkt_len);
> >>                                 rtw_update_rx_freq_for_invalid(rtwdev, skb,
> >>                                                                &rx_status,
> >>                                                                &pkt_stat);
> >> @@ -597,12 +607,12 @@ static void rtw_usb_rx_handler(struct work_struct *work)
> >>                                 ieee80211_rx_irqsafe(rtwdev->hw, skb);
> >>                         }
> >>
> >> -                       skb = next_skb;
> >> -                       if (skb)
> >> -                               skb_pull(skb, next_pkt);
> >> +skip_packet:
> >> +                       next_pkt = round_up(skb_len, 8);
> >> +                       rx_desc += next_pkt;
> >> +               } while (rx_desc + pkt_desc_sz < rx_skb->data + rx_skb->len);
> >>
> >> -                       urb_len -= next_pkt;
> >> -               } while (skb);
> >> +               dev_kfree_skb_any(rx_skb);
> >>         }
> >>  }
> >>
> >> --
> >> 2.47.0
> >


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH 2/2] wifi: rtw88: usb: Preallocate and reuse the RX skbs
  2024-11-25 23:49     ` Bitterblue Smith
@ 2024-11-26  1:26       ` Ping-Ke Shih
  2024-12-11 18:07         ` Bitterblue Smith
  0 siblings, 1 reply; 10+ messages in thread
From: Ping-Ke Shih @ 2024-11-26  1:26 UTC (permalink / raw)
  To: Bitterblue Smith, linux-wireless@vger.kernel.org; +Cc: Sascha Hauer

Bitterblue Smith <rtl8821cerfe2@gmail.com> wrote:
> On 19/11/2024 02:50, Ping-Ke Shih wrote:
> > Bitterblue Smith <rtl8821cerfe2@gmail.com> wrote:
> >> The USB driver uses four USB Request Blocks for RX. Before submitting
> >> one, it allocates a 32768 byte skb for the RX data. This allocation can
> >> fail, maybe due to temporary memory fragmentation. When the allocation
> >> fails, the corresponding URB is never submitted again. After four such
> >> allocation failures, all RX stops because the driver is not requesting
> >> data from the device anymore.
> >>
> >> Don't allocate a 32768 byte skb when submitting a USB Request Block
> >> (which happens very often). Instead preallocate 8 such skbs, and reuse
> >> them over and over. If all 8 are busy, allocate a new one. This is
> >> pretty rare. If the allocation fails, use a work to try again later.
> >> When there are enough free skbs again, free the excess skbs.
> >>
> >> Also, use WQ_BH for the RX workqueue. With a normal or high priority
> >> workqueue the skbs are processed too slowly when the system is even a
> >> little busy, like when opening a new page in a browser, and the driver
> >> runs out of free skbs and allocates a lot of new ones.
> >>
> >> Move C2H_ADAPTIVITY to the c2h workqueue instead of handling it directly
> >> from rtw_fw_c2h_cmd_rx_irqsafe(), which runs in the RX workqueue. It
> >> reads hardware registers, which is not "irqsafe" with USB.
> >
> > This part should be in a separated patch.
> >
> 
> Do you mean just C2H_ADAPTIVITY or WQ_BH as well?

Just C2H_ADAPTIVITY.

With patch subject, people can't understand this change. 

> 
> >>
> >> This is more or less what the out-of-tree Realtek drivers do, except
> >> they use a tasklet instead of a BH workqueue.
> >>
> >> Tested with RTL8723DU, RTL8821AU, RTL8812AU, RTL8812BU, RTL8822CU,
> >> RTL8811CU.
> >>
> >> Closes: https://lore.kernel.org/linux-wireless/6e7ecb47-7ea0-433a-a19f-05f88a2edf6b@gmail.com/
> >> Signed-off-by: Bitterblue Smith <rtl8821cerfe2@gmail.com>
> >> ---
> >>  drivers/net/wireless/realtek/rtw88/fw.c  |  7 +--
> >>  drivers/net/wireless/realtek/rtw88/usb.c | 73 ++++++++++++++++++++----
> >>  drivers/net/wireless/realtek/rtw88/usb.h |  3 +
> >>  3 files changed, 67 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/drivers/net/wireless/realtek/rtw88/fw.c b/drivers/net/wireless/realtek/rtw88/fw.c
> >> index e6e9946fbf44..02389b7c6876 100644
> >> --- a/drivers/net/wireless/realtek/rtw88/fw.c
> >> +++ b/drivers/net/wireless/realtek/rtw88/fw.c
> >> @@ -332,6 +332,9 @@ void rtw_fw_c2h_cmd_handle(struct rtw_dev *rtwdev, struct sk_buff *skb)
> >>         case C2H_RA_RPT:
> >>                 rtw_fw_ra_report_handle(rtwdev, c2h->payload, len);
> >>                 break;
> >> +       case C2H_ADAPTIVITY:
> >> +               rtw_fw_adaptivity_result(rtwdev, c2h->payload, len);
> >> +               break;
> >>         default:
> >>                 rtw_dbg(rtwdev, RTW_DBG_FW, "C2H 0x%x isn't handled\n", c2h->id);
> >>                 break;
> >> @@ -367,10 +370,6 @@ void rtw_fw_c2h_cmd_rx_irqsafe(struct rtw_dev *rtwdev, u32 pkt_offset,
> >>                 rtw_fw_scan_result(rtwdev, c2h->payload, len);
> >>                 dev_kfree_skb_any(skb);
> >>                 break;
> >> -       case C2H_ADAPTIVITY:
> >> -               rtw_fw_adaptivity_result(rtwdev, c2h->payload, len);
> >> -               dev_kfree_skb_any(skb);
> >> -               break;
> >>         default:
> >>                 /* pass offset for further operation */
> >>                 *((u32 *)skb->cb) = pkt_offset;
> >> diff --git a/drivers/net/wireless/realtek/rtw88/usb.c b/drivers/net/wireless/realtek/rtw88/usb.c
> >> index 727569d4ef4b..5c2dfa2ced92 100644
> >> --- a/drivers/net/wireless/realtek/rtw88/usb.c
> >> +++ b/drivers/net/wireless/realtek/rtw88/usb.c
> >> @@ -585,7 +585,7 @@ static void rtw_usb_rx_handler(struct work_struct *work)
> >>                                 goto skip_packet;
> >>                         }
> >>
> >> -                       skb = alloc_skb(skb_len, GFP_KERNEL);
> >> +                       skb = alloc_skb(skb_len, GFP_ATOMIC);
> >>                         if (!skb) {
> >>                                 rtw_dbg(rtwdev, RTW_DBG_USB,
> >>                                         "failed to allocate RX skb of size %u\n",
> >> @@ -612,7 +612,11 @@ static void rtw_usb_rx_handler(struct work_struct *work)
> >>                         rx_desc += next_pkt;
> >>                 } while (rx_desc + pkt_desc_sz < rx_skb->data + rx_skb->len);
> >>
> >> -               dev_kfree_skb_any(rx_skb);
> >> +               if (skb_queue_len(&rtwusb->rx_free_queue) >=
> >> +                   RTW_USB_RX_SKB_NUM - RTW_USB_RXCB_NUM)
> >> +                       dev_kfree_skb_any(rx_skb);
> >> +               else
> >> +                       skb_queue_tail(&rtwusb->rx_free_queue, rx_skb);
> >
> > Why not just queue and reuse rx_skb for all? That would be simpler.
> >
> 
> I didn't want to let it allocate too many skbs. I didn't find any kind
> of limit in the official drivers, so maybe it would be fine.

The case ' skb_queue_len(&rtwusb->rx_free_queue) >= 8 - 4' is rare? 
If so, I think this is fine. If not, to repeatedly allocate and free 
would cause memory fragment, and higher probability to failed to allocate
memory with GFP_ATOMIC. Also seemingly additional 4 persistent rx_skb
is not a big cost. 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] wifi: rtw88: usb: Copy instead of cloning the RX skb
  2024-11-26  1:17     ` Ping-Ke Shih
@ 2024-11-26 20:40       ` Bitterblue Smith
  0 siblings, 0 replies; 10+ messages in thread
From: Bitterblue Smith @ 2024-11-26 20:40 UTC (permalink / raw)
  To: Ping-Ke Shih, linux-wireless@vger.kernel.org; +Cc: Sascha Hauer

On 26/11/2024 03:17, Ping-Ke Shih wrote:
> Bitterblue Smith <rtl8821cerfe2@gmail.com> wrote:
>> On 18/11/2024 11:16, Ping-Ke Shih wrote:
>>> Bitterblue Smith <rtl8821cerfe2@gmail.com> wrote:
>>>> "iperf3 -c 192.168.0.1 -R --udp -b 0" shows about 40% of datagrams
>>>> are lost. Many torrents don't download faster than 3 MiB/s, probably
>>>> because the Bittorrent protocol uses UDP. This is somehow related to
>>>> the use of skb_clone() in the RX path.
>>>
>>> Using skb_clone() can improve throughput is weird to me too. Do you check
>>> top with 100% cpu usage?
>>>
>>
>> I checked the CPU usage now and the results are interesting. In short,
>> patch 1/2 slightly raises the CPU usage, and patch 2/2 lowers it a lot.
> 
> Originally I just wanted to know if throughput is a limit of CPU bound.
> Anyway good to know this patchset can improve CPU usage. 
> 
>>>>
>>>> Don't use skb_clone(). Instead allocate a new skb for each 802.11 frame
>>>> received and copy the data from the big (32768 byte) skb.
>>>>
>>>> With this patch, "iperf3 -c 192.168.0.1 -R --udp -b 0" shows only 1-2%
>>>> of datagrams are lost, and torrents can reach download speeds of 36
>>>> MiB/s.
>>>>
>>>> Tested with RTL8812AU and RTL8822CU.
>>>>
>>>> Signed-off-by: Bitterblue Smith <rtl8821cerfe2@gmail.com>
>>>> ---
>>>>  drivers/net/wireless/realtek/rtw88/usb.c | 52 ++++++++++++++----------
>>>>  1 file changed, 31 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/drivers/net/wireless/realtek/rtw88/usb.c b/drivers/net/wireless/realtek/rtw88/usb.c
>>>> index 93ac4837e1b5..727569d4ef4b 100644
>>>> --- a/drivers/net/wireless/realtek/rtw88/usb.c
>>>> +++ b/drivers/net/wireless/realtek/rtw88/usb.c
>>>> @@ -7,6 +7,7 @@
>>>>  #include <linux/mutex.h>
>>>>  #include "main.h"
>>>>  #include "debug.h"
>>>> +#include "mac.h"
>>>>  #include "reg.h"
>>>>  #include "tx.h"
>>>>  #include "rx.h"
>>>> @@ -546,49 +547,58 @@ static void rtw_usb_rx_handler(struct work_struct *work)
>>>>  {
>>>>         struct rtw_usb *rtwusb = container_of(work, struct rtw_usb, rx_work);
>>>>         struct rtw_dev *rtwdev = rtwusb->rtwdev;
>>>> -       const struct rtw_chip_info *chip = rtwdev->chip;
>>>> -       u32 pkt_desc_sz = chip->rx_pkt_desc_sz;
>>>>         struct ieee80211_rx_status rx_status;
>>>> -       u32 pkt_offset, next_pkt, urb_len;
>>>>         struct rtw_rx_pkt_stat pkt_stat;
>>>> -       struct sk_buff *next_skb;
>>>> +       struct sk_buff *rx_skb;
>>>>         struct sk_buff *skb;
>>>> +       u32 pkt_desc_sz = rtwdev->chip->rx_pkt_desc_sz;
>>>> +       u32 max_skb_len = pkt_desc_sz + PHY_STATUS_SIZE * 8 +
>>>> +                         IEEE80211_MAX_MPDU_LEN_VHT_11454;
>>>> +       u32 pkt_offset, next_pkt, skb_len;
>>>>         u8 *rx_desc;
>>>>         int limit;
>>>>
>>>>         for (limit = 0; limit < 200; limit++) {
>>>> -               skb = skb_dequeue(&rtwusb->rx_queue);
>>>> -               if (!skb)
>>>> +               rx_skb = skb_dequeue(&rtwusb->rx_queue);
>>>> +               if (!rx_skb)
>>>>                         break;
>>>>
>>>>                 if (skb_queue_len(&rtwusb->rx_queue) >= RTW_USB_MAX_RXQ_LEN) {
>>>>                         dev_dbg_ratelimited(rtwdev->dev, "failed to get rx_queue, overflow\n");
>>>> -                       dev_kfree_skb_any(skb);
>>>> +                       dev_kfree_skb_any(rx_skb);
>>>>                         continue;
>>>>                 }
>>>>
>>>> -               urb_len = skb->len;
>>>> +               rx_desc = rx_skb->data;
>>>>
>>>>                 do {
>>>> -                       rx_desc = skb->data;
>>>>                         rtw_rx_query_rx_desc(rtwdev, rx_desc, &pkt_stat,
>>>>                                              &rx_status);
>>>>                         pkt_offset = pkt_desc_sz + pkt_stat.drv_info_sz +
>>>>                                      pkt_stat.shift;
>>>>
>>>> -                       next_pkt = round_up(pkt_stat.pkt_len + pkt_offset, 8);
>>>> +                       skb_len = pkt_stat.pkt_len + pkt_offset;
>>>> +                       if (skb_len > max_skb_len) {
>>>
>>> This seems a new rule introduced by this patch. Do you really encounter this
>>> case? Maybe this is the cause of slow download throughput?
>>>
>>
>> No, I never saw this case. It just seemed like a good idea to check the
>> size passed to alloc_skb. Maybe it's not needed?
> 
> I think it is fine. 
> 
> Asking some questions before, I just tried to find a cause why 40% datagram get
> lost as you mentioned in commit message, but I can't. 
> 

By the way, I saw 30-40% datagrams lost with RTL8822CE as well. But the
PCI driver is not using skb_clone so I'm not sure what is going on there.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] wifi: rtw88: usb: Preallocate and reuse the RX skbs
  2024-11-26  1:26       ` Ping-Ke Shih
@ 2024-12-11 18:07         ` Bitterblue Smith
  0 siblings, 0 replies; 10+ messages in thread
From: Bitterblue Smith @ 2024-12-11 18:07 UTC (permalink / raw)
  To: Ping-Ke Shih, linux-wireless@vger.kernel.org; +Cc: Sascha Hauer

On 26/11/2024 03:26, Ping-Ke Shih wrote:
> Bitterblue Smith <rtl8821cerfe2@gmail.com> wrote:
>> On 19/11/2024 02:50, Ping-Ke Shih wrote:
>>> Bitterblue Smith <rtl8821cerfe2@gmail.com> wrote:
>>>> The USB driver uses four USB Request Blocks for RX. Before submitting
>>>> one, it allocates a 32768 byte skb for the RX data. This allocation can
>>>> fail, maybe due to temporary memory fragmentation. When the allocation
>>>> fails, the corresponding URB is never submitted again. After four such
>>>> allocation failures, all RX stops because the driver is not requesting
>>>> data from the device anymore.
>>>>
>>>> Don't allocate a 32768 byte skb when submitting a USB Request Block
>>>> (which happens very often). Instead preallocate 8 such skbs, and reuse
>>>> them over and over. If all 8 are busy, allocate a new one. This is
>>>> pretty rare. If the allocation fails, use a work to try again later.
>>>> When there are enough free skbs again, free the excess skbs.
>>>>
>>>> Also, use WQ_BH for the RX workqueue. With a normal or high priority
>>>> workqueue the skbs are processed too slowly when the system is even a
>>>> little busy, like when opening a new page in a browser, and the driver
>>>> runs out of free skbs and allocates a lot of new ones.
>>>>
>>>> Move C2H_ADAPTIVITY to the c2h workqueue instead of handling it directly
>>>> from rtw_fw_c2h_cmd_rx_irqsafe(), which runs in the RX workqueue. It
>>>> reads hardware registers, which is not "irqsafe" with USB.
>>>
>>> This part should be in a separated patch.
>>>
>>
>> Do you mean just C2H_ADAPTIVITY or WQ_BH as well?
> 
> Just C2H_ADAPTIVITY.
> 
> With patch subject, people can't understand this change. 
> 
>>
>>>>
>>>> This is more or less what the out-of-tree Realtek drivers do, except
>>>> they use a tasklet instead of a BH workqueue.
>>>>
>>>> Tested with RTL8723DU, RTL8821AU, RTL8812AU, RTL8812BU, RTL8822CU,
>>>> RTL8811CU.
>>>>
>>>> Closes: https://lore.kernel.org/linux-wireless/6e7ecb47-7ea0-433a-a19f-05f88a2edf6b@gmail.com/
>>>> Signed-off-by: Bitterblue Smith <rtl8821cerfe2@gmail.com>
>>>> ---
>>>>  drivers/net/wireless/realtek/rtw88/fw.c  |  7 +--
>>>>  drivers/net/wireless/realtek/rtw88/usb.c | 73 ++++++++++++++++++++----
>>>>  drivers/net/wireless/realtek/rtw88/usb.h |  3 +
>>>>  3 files changed, 67 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/drivers/net/wireless/realtek/rtw88/fw.c b/drivers/net/wireless/realtek/rtw88/fw.c
>>>> index e6e9946fbf44..02389b7c6876 100644
>>>> --- a/drivers/net/wireless/realtek/rtw88/fw.c
>>>> +++ b/drivers/net/wireless/realtek/rtw88/fw.c
>>>> @@ -332,6 +332,9 @@ void rtw_fw_c2h_cmd_handle(struct rtw_dev *rtwdev, struct sk_buff *skb)
>>>>         case C2H_RA_RPT:
>>>>                 rtw_fw_ra_report_handle(rtwdev, c2h->payload, len);
>>>>                 break;
>>>> +       case C2H_ADAPTIVITY:
>>>> +               rtw_fw_adaptivity_result(rtwdev, c2h->payload, len);
>>>> +               break;
>>>>         default:
>>>>                 rtw_dbg(rtwdev, RTW_DBG_FW, "C2H 0x%x isn't handled\n", c2h->id);
>>>>                 break;
>>>> @@ -367,10 +370,6 @@ void rtw_fw_c2h_cmd_rx_irqsafe(struct rtw_dev *rtwdev, u32 pkt_offset,
>>>>                 rtw_fw_scan_result(rtwdev, c2h->payload, len);
>>>>                 dev_kfree_skb_any(skb);
>>>>                 break;
>>>> -       case C2H_ADAPTIVITY:
>>>> -               rtw_fw_adaptivity_result(rtwdev, c2h->payload, len);
>>>> -               dev_kfree_skb_any(skb);
>>>> -               break;
>>>>         default:
>>>>                 /* pass offset for further operation */
>>>>                 *((u32 *)skb->cb) = pkt_offset;
>>>> diff --git a/drivers/net/wireless/realtek/rtw88/usb.c b/drivers/net/wireless/realtek/rtw88/usb.c
>>>> index 727569d4ef4b..5c2dfa2ced92 100644
>>>> --- a/drivers/net/wireless/realtek/rtw88/usb.c
>>>> +++ b/drivers/net/wireless/realtek/rtw88/usb.c
>>>> @@ -585,7 +585,7 @@ static void rtw_usb_rx_handler(struct work_struct *work)
>>>>                                 goto skip_packet;
>>>>                         }
>>>>
>>>> -                       skb = alloc_skb(skb_len, GFP_KERNEL);
>>>> +                       skb = alloc_skb(skb_len, GFP_ATOMIC);
>>>>                         if (!skb) {
>>>>                                 rtw_dbg(rtwdev, RTW_DBG_USB,
>>>>                                         "failed to allocate RX skb of size %u\n",
>>>> @@ -612,7 +612,11 @@ static void rtw_usb_rx_handler(struct work_struct *work)
>>>>                         rx_desc += next_pkt;
>>>>                 } while (rx_desc + pkt_desc_sz < rx_skb->data + rx_skb->len);
>>>>
>>>> -               dev_kfree_skb_any(rx_skb);
>>>> +               if (skb_queue_len(&rtwusb->rx_free_queue) >=
>>>> +                   RTW_USB_RX_SKB_NUM - RTW_USB_RXCB_NUM)
>>>> +                       dev_kfree_skb_any(rx_skb);
>>>> +               else
>>>> +                       skb_queue_tail(&rtwusb->rx_free_queue, rx_skb);
>>>
>>> Why not just queue and reuse rx_skb for all? That would be simpler.
>>>
>>
>> I didn't want to let it allocate too many skbs. I didn't find any kind
>> of limit in the official drivers, so maybe it would be fine.
> 
> The case ' skb_queue_len(&rtwusb->rx_free_queue) >= 8 - 4' is rare? 
> If so, I think this is fine. If not, to repeatedly allocate and free 
> would cause memory fragment, and higher probability to failed to allocate
> memory with GFP_ATOMIC. Also seemingly additional 4 persistent rx_skb
> is not a big cost. 
> 
> 

I feel like it's rare for me, but I never counted how often it happens.

Someone else reported that in a period of 14.5 hours additional skbs
were allocated and freed 295 times. This is their log:
https://github.com/user-attachments/files/17792914/dmesg_grep_rtw_8821au.txt
This computer has i7-6700 CPU with 32 GB of RAM.

Another user reported that additional skbs were allocated and freed
3824 times in a period of 1.25 hours. This is their log:
https://github.com/user-attachments/files/18008735/log.txt
This computer is a Raspberry Pi 3B with 1 GB RAM.

I agree that another 4 skbs would be fine. I don't know what the upper
limit should be. The user with the Raspberry Pi 3B also tested with no
limit and reported that the driver allocated 17 extra skbs. That seems
like too many to me.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-12-11 18:07 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-14 23:06 [PATCH 1/2] wifi: rtw88: usb: Copy instead of cloning the RX skb Bitterblue Smith
2024-11-14 23:07 ` [PATCH 2/2] wifi: rtw88: usb: Preallocate and reuse the RX skbs Bitterblue Smith
2024-11-19  0:50   ` Ping-Ke Shih
2024-11-25 23:49     ` Bitterblue Smith
2024-11-26  1:26       ` Ping-Ke Shih
2024-12-11 18:07         ` Bitterblue Smith
2024-11-18  9:16 ` [PATCH 1/2] wifi: rtw88: usb: Copy instead of cloning the RX skb Ping-Ke Shih
2024-11-25 23:48   ` Bitterblue Smith
2024-11-26  1:17     ` Ping-Ke Shih
2024-11-26 20:40       ` Bitterblue Smith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).