* [PATCH net-next 1/4] igb: prepare for AF_XDP zero-copy support
2024-08-08 18:35 [PATCH net-next 0/4][pull request] igb: Add support for AF_XDP zero-copy Tony Nguyen
@ 2024-08-08 18:35 ` Tony Nguyen
2024-08-08 20:38 ` Maciej Fijalkowski
2024-08-08 18:35 ` [PATCH net-next 2/4] igb: Introduce XSK data structures and helpers Tony Nguyen
` (2 subsequent siblings)
3 siblings, 1 reply; 20+ messages in thread
From: Tony Nguyen @ 2024-08-08 18:35 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, netdev
Cc: Sriram Yagnaraman, anthony.l.nguyen, maciej.fijalkowski,
magnus.karlsson, ast, daniel, hawk, john.fastabend, bpf, kurt,
sriram.yagnaraman, richardcochran, benjamin.steinke, bigeasy,
Chandan Kumar Rout
From: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Always call igb_xdp_ring_update_tail under __netif_tx_lock, add a
comment to indicate that. This is needed to share the same TX ring
between XDP, XSK and slow paths.
Remove static qualifiers on the following functions to be able to call
from XSK specific file that is added in the later patches
- igb_xdp_tx_queue_mapping
- igb_xdp_ring_update_tail
- igb_clean_tx_ring
- igb_clean_rx_ring
- igb_run_xdp
- igb_process_skb_fields
Introduce igb_xdp_is_enabled() to check if an XDP program is assigned to
the device.
Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
drivers/net/ethernet/intel/igb/igb.h | 15 ++++++++++++
drivers/net/ethernet/intel/igb/igb_main.c | 29 +++++++++++------------
2 files changed, 29 insertions(+), 15 deletions(-)
diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
index 3c2dc7bdebb5..0de71ec324ed 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -718,6 +718,8 @@ extern char igb_driver_name[];
int igb_xmit_xdp_ring(struct igb_adapter *adapter,
struct igb_ring *ring,
struct xdp_frame *xdpf);
+struct igb_ring *igb_xdp_tx_queue_mapping(struct igb_adapter *adapter);
+void igb_xdp_ring_update_tail(struct igb_ring *ring);
int igb_open(struct net_device *netdev);
int igb_close(struct net_device *netdev);
int igb_up(struct igb_adapter *);
@@ -731,12 +733,20 @@ int igb_setup_tx_resources(struct igb_ring *);
int igb_setup_rx_resources(struct igb_ring *);
void igb_free_tx_resources(struct igb_ring *);
void igb_free_rx_resources(struct igb_ring *);
+void igb_clean_tx_ring(struct igb_ring *tx_ring);
+void igb_clean_rx_ring(struct igb_ring *rx_ring);
void igb_configure_tx_ring(struct igb_adapter *, struct igb_ring *);
void igb_configure_rx_ring(struct igb_adapter *, struct igb_ring *);
void igb_setup_tctl(struct igb_adapter *);
void igb_setup_rctl(struct igb_adapter *);
void igb_setup_srrctl(struct igb_adapter *, struct igb_ring *);
netdev_tx_t igb_xmit_frame_ring(struct sk_buff *, struct igb_ring *);
+struct sk_buff *igb_run_xdp(struct igb_adapter *adapter,
+ struct igb_ring *rx_ring,
+ struct xdp_buff *xdp);
+void igb_process_skb_fields(struct igb_ring *rx_ring,
+ union e1000_adv_rx_desc *rx_desc,
+ struct sk_buff *skb);
void igb_alloc_rx_buffers(struct igb_ring *, u16);
void igb_update_stats(struct igb_adapter *);
bool igb_has_link(struct igb_adapter *adapter);
@@ -797,6 +807,11 @@ static inline struct netdev_queue *txring_txq(const struct igb_ring *tx_ring)
return netdev_get_tx_queue(tx_ring->netdev, tx_ring->queue_index);
}
+static inline bool igb_xdp_is_enabled(struct igb_adapter *adapter)
+{
+ return !!adapter->xdp_prog;
+}
+
int igb_add_filter(struct igb_adapter *adapter,
struct igb_nfc_filter *input);
int igb_erase_filter(struct igb_adapter *adapter,
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 11be39f435f3..bdb7637559b8 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -115,8 +115,6 @@ static void igb_configure_tx(struct igb_adapter *);
static void igb_configure_rx(struct igb_adapter *);
static void igb_clean_all_tx_rings(struct igb_adapter *);
static void igb_clean_all_rx_rings(struct igb_adapter *);
-static void igb_clean_tx_ring(struct igb_ring *);
-static void igb_clean_rx_ring(struct igb_ring *);
static void igb_set_rx_mode(struct net_device *);
static void igb_update_phy_info(struct timer_list *);
static void igb_watchdog(struct timer_list *);
@@ -2914,7 +2912,8 @@ static int igb_xdp(struct net_device *dev, struct netdev_bpf *xdp)
}
}
-static void igb_xdp_ring_update_tail(struct igb_ring *ring)
+/* This function assumes __netif_tx_lock is held by the caller. */
+void igb_xdp_ring_update_tail(struct igb_ring *ring)
{
/* Force memory writes to complete before letting h/w know there
* are new descriptors to fetch.
@@ -2923,7 +2922,7 @@ static void igb_xdp_ring_update_tail(struct igb_ring *ring)
writel(ring->next_to_use, ring->tail);
}
-static struct igb_ring *igb_xdp_tx_queue_mapping(struct igb_adapter *adapter)
+struct igb_ring *igb_xdp_tx_queue_mapping(struct igb_adapter *adapter)
{
unsigned int r_idx = smp_processor_id();
@@ -3000,11 +2999,11 @@ static int igb_xdp_xmit(struct net_device *dev, int n,
nxmit++;
}
- __netif_tx_unlock(nq);
-
if (unlikely(flags & XDP_XMIT_FLUSH))
igb_xdp_ring_update_tail(tx_ring);
+ __netif_tx_unlock(nq);
+
return nxmit;
}
@@ -4879,7 +4878,7 @@ static void igb_free_all_tx_resources(struct igb_adapter *adapter)
* igb_clean_tx_ring - Free Tx Buffers
* @tx_ring: ring to be cleaned
**/
-static void igb_clean_tx_ring(struct igb_ring *tx_ring)
+void igb_clean_tx_ring(struct igb_ring *tx_ring)
{
u16 i = tx_ring->next_to_clean;
struct igb_tx_buffer *tx_buffer = &tx_ring->tx_buffer_info[i];
@@ -4998,7 +4997,7 @@ static void igb_free_all_rx_resources(struct igb_adapter *adapter)
* igb_clean_rx_ring - Free Rx Buffers per Queue
* @rx_ring: ring to free buffers from
**/
-static void igb_clean_rx_ring(struct igb_ring *rx_ring)
+void igb_clean_rx_ring(struct igb_ring *rx_ring)
{
u16 i = rx_ring->next_to_clean;
@@ -6613,7 +6612,7 @@ static int igb_change_mtu(struct net_device *netdev, int new_mtu)
struct igb_adapter *adapter = netdev_priv(netdev);
int max_frame = new_mtu + IGB_ETH_PKT_HDR_PAD;
- if (adapter->xdp_prog) {
+ if (igb_xdp_is_enabled(adapter)) {
int i;
for (i = 0; i < adapter->num_rx_queues; i++) {
@@ -8569,9 +8568,9 @@ static struct sk_buff *igb_build_skb(struct igb_ring *rx_ring,
return skb;
}
-static struct sk_buff *igb_run_xdp(struct igb_adapter *adapter,
- struct igb_ring *rx_ring,
- struct xdp_buff *xdp)
+struct sk_buff *igb_run_xdp(struct igb_adapter *adapter,
+ struct igb_ring *rx_ring,
+ struct xdp_buff *xdp)
{
int err, result = IGB_XDP_PASS;
struct bpf_prog *xdp_prog;
@@ -8767,9 +8766,9 @@ static bool igb_cleanup_headers(struct igb_ring *rx_ring,
* order to populate the hash, checksum, VLAN, timestamp, protocol, and
* other fields within the skb.
**/
-static void igb_process_skb_fields(struct igb_ring *rx_ring,
- union e1000_adv_rx_desc *rx_desc,
- struct sk_buff *skb)
+void igb_process_skb_fields(struct igb_ring *rx_ring,
+ union e1000_adv_rx_desc *rx_desc,
+ struct sk_buff *skb)
{
struct net_device *dev = rx_ring->netdev;
--
2.42.0
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH net-next 1/4] igb: prepare for AF_XDP zero-copy support
2024-08-08 18:35 ` [PATCH net-next 1/4] igb: prepare for AF_XDP zero-copy support Tony Nguyen
@ 2024-08-08 20:38 ` Maciej Fijalkowski
2024-08-09 13:05 ` Kurt Kanzenbach
0 siblings, 1 reply; 20+ messages in thread
From: Maciej Fijalkowski @ 2024-08-08 20:38 UTC (permalink / raw)
To: Tony Nguyen
Cc: davem, kuba, pabeni, edumazet, netdev, Sriram Yagnaraman,
magnus.karlsson, ast, daniel, hawk, john.fastabend, bpf, kurt,
sriram.yagnaraman, richardcochran, benjamin.steinke, bigeasy,
Chandan Kumar Rout
On Thu, Aug 08, 2024 at 11:35:51AM -0700, Tony Nguyen wrote:
> From: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
>
> Always call igb_xdp_ring_update_tail under __netif_tx_lock, add a
> comment to indicate that. This is needed to share the same TX ring
> between XDP, XSK and slow paths.
standalone commit
>
> Remove static qualifiers on the following functions to be able to call
> from XSK specific file that is added in the later patches
> - igb_xdp_tx_queue_mapping
> - igb_xdp_ring_update_tail
> - igb_clean_tx_ring
> - igb_clean_rx_ring
> - igb_run_xdp
> - igb_process_skb_fields
ditto
>
> Introduce igb_xdp_is_enabled() to check if an XDP program is assigned to
> the device.
ditto
>
> Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
> Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
> Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
> ---
> drivers/net/ethernet/intel/igb/igb.h | 15 ++++++++++++
> drivers/net/ethernet/intel/igb/igb_main.c | 29 +++++++++++------------
> 2 files changed, 29 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
> index 3c2dc7bdebb5..0de71ec324ed 100644
> --- a/drivers/net/ethernet/intel/igb/igb.h
> +++ b/drivers/net/ethernet/intel/igb/igb.h
> @@ -718,6 +718,8 @@ extern char igb_driver_name[];
> int igb_xmit_xdp_ring(struct igb_adapter *adapter,
> struct igb_ring *ring,
> struct xdp_frame *xdpf);
> +struct igb_ring *igb_xdp_tx_queue_mapping(struct igb_adapter *adapter);
> +void igb_xdp_ring_update_tail(struct igb_ring *ring);
> int igb_open(struct net_device *netdev);
> int igb_close(struct net_device *netdev);
> int igb_up(struct igb_adapter *);
> @@ -731,12 +733,20 @@ int igb_setup_tx_resources(struct igb_ring *);
> int igb_setup_rx_resources(struct igb_ring *);
> void igb_free_tx_resources(struct igb_ring *);
> void igb_free_rx_resources(struct igb_ring *);
> +void igb_clean_tx_ring(struct igb_ring *tx_ring);
> +void igb_clean_rx_ring(struct igb_ring *rx_ring);
> void igb_configure_tx_ring(struct igb_adapter *, struct igb_ring *);
> void igb_configure_rx_ring(struct igb_adapter *, struct igb_ring *);
> void igb_setup_tctl(struct igb_adapter *);
> void igb_setup_rctl(struct igb_adapter *);
> void igb_setup_srrctl(struct igb_adapter *, struct igb_ring *);
> netdev_tx_t igb_xmit_frame_ring(struct sk_buff *, struct igb_ring *);
> +struct sk_buff *igb_run_xdp(struct igb_adapter *adapter,
> + struct igb_ring *rx_ring,
> + struct xdp_buff *xdp);
> +void igb_process_skb_fields(struct igb_ring *rx_ring,
> + union e1000_adv_rx_desc *rx_desc,
> + struct sk_buff *skb);
> void igb_alloc_rx_buffers(struct igb_ring *, u16);
> void igb_update_stats(struct igb_adapter *);
> bool igb_has_link(struct igb_adapter *adapter);
> @@ -797,6 +807,11 @@ static inline struct netdev_queue *txring_txq(const struct igb_ring *tx_ring)
> return netdev_get_tx_queue(tx_ring->netdev, tx_ring->queue_index);
> }
>
> +static inline bool igb_xdp_is_enabled(struct igb_adapter *adapter)
> +{
> + return !!adapter->xdp_prog;
READ_ONCE() plus use this everywhere else where prog is read.
> +}
> +
> int igb_add_filter(struct igb_adapter *adapter,
> struct igb_nfc_filter *input);
> int igb_erase_filter(struct igb_adapter *adapter,
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> index 11be39f435f3..bdb7637559b8 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -115,8 +115,6 @@ static void igb_configure_tx(struct igb_adapter *);
> static void igb_configure_rx(struct igb_adapter *);
> static void igb_clean_all_tx_rings(struct igb_adapter *);
> static void igb_clean_all_rx_rings(struct igb_adapter *);
> -static void igb_clean_tx_ring(struct igb_ring *);
> -static void igb_clean_rx_ring(struct igb_ring *);
> static void igb_set_rx_mode(struct net_device *);
> static void igb_update_phy_info(struct timer_list *);
> static void igb_watchdog(struct timer_list *);
> @@ -2914,7 +2912,8 @@ static int igb_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> }
> }
>
> -static void igb_xdp_ring_update_tail(struct igb_ring *ring)
> +/* This function assumes __netif_tx_lock is held by the caller. */
> +void igb_xdp_ring_update_tail(struct igb_ring *ring)
> {
> /* Force memory writes to complete before letting h/w know there
> * are new descriptors to fetch.
> @@ -2923,7 +2922,7 @@ static void igb_xdp_ring_update_tail(struct igb_ring *ring)
> writel(ring->next_to_use, ring->tail);
> }
>
> -static struct igb_ring *igb_xdp_tx_queue_mapping(struct igb_adapter *adapter)
> +struct igb_ring *igb_xdp_tx_queue_mapping(struct igb_adapter *adapter)
> {
> unsigned int r_idx = smp_processor_id();
>
> @@ -3000,11 +2999,11 @@ static int igb_xdp_xmit(struct net_device *dev, int n,
> nxmit++;
> }
>
> - __netif_tx_unlock(nq);
> -
> if (unlikely(flags & XDP_XMIT_FLUSH))
> igb_xdp_ring_update_tail(tx_ring);
>
> + __netif_tx_unlock(nq);
> +
> return nxmit;
> }
>
> @@ -4879,7 +4878,7 @@ static void igb_free_all_tx_resources(struct igb_adapter *adapter)
> * igb_clean_tx_ring - Free Tx Buffers
> * @tx_ring: ring to be cleaned
> **/
> -static void igb_clean_tx_ring(struct igb_ring *tx_ring)
> +void igb_clean_tx_ring(struct igb_ring *tx_ring)
> {
> u16 i = tx_ring->next_to_clean;
> struct igb_tx_buffer *tx_buffer = &tx_ring->tx_buffer_info[i];
> @@ -4998,7 +4997,7 @@ static void igb_free_all_rx_resources(struct igb_adapter *adapter)
> * igb_clean_rx_ring - Free Rx Buffers per Queue
> * @rx_ring: ring to free buffers from
> **/
> -static void igb_clean_rx_ring(struct igb_ring *rx_ring)
> +void igb_clean_rx_ring(struct igb_ring *rx_ring)
> {
> u16 i = rx_ring->next_to_clean;
>
> @@ -6613,7 +6612,7 @@ static int igb_change_mtu(struct net_device *netdev, int new_mtu)
> struct igb_adapter *adapter = netdev_priv(netdev);
> int max_frame = new_mtu + IGB_ETH_PKT_HDR_PAD;
>
> - if (adapter->xdp_prog) {
> + if (igb_xdp_is_enabled(adapter)) {
> int i;
>
> for (i = 0; i < adapter->num_rx_queues; i++) {
> @@ -8569,9 +8568,9 @@ static struct sk_buff *igb_build_skb(struct igb_ring *rx_ring,
> return skb;
> }
>
> -static struct sk_buff *igb_run_xdp(struct igb_adapter *adapter,
> - struct igb_ring *rx_ring,
> - struct xdp_buff *xdp)
> +struct sk_buff *igb_run_xdp(struct igb_adapter *adapter,
> + struct igb_ring *rx_ring,
> + struct xdp_buff *xdp)
> {
> int err, result = IGB_XDP_PASS;
> struct bpf_prog *xdp_prog;
> @@ -8767,9 +8766,9 @@ static bool igb_cleanup_headers(struct igb_ring *rx_ring,
> * order to populate the hash, checksum, VLAN, timestamp, protocol, and
> * other fields within the skb.
> **/
> -static void igb_process_skb_fields(struct igb_ring *rx_ring,
> - union e1000_adv_rx_desc *rx_desc,
> - struct sk_buff *skb)
> +void igb_process_skb_fields(struct igb_ring *rx_ring,
> + union e1000_adv_rx_desc *rx_desc,
> + struct sk_buff *skb)
> {
> struct net_device *dev = rx_ring->netdev;
>
> --
> 2.42.0
>
>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH net-next 1/4] igb: prepare for AF_XDP zero-copy support
2024-08-08 20:38 ` Maciej Fijalkowski
@ 2024-08-09 13:05 ` Kurt Kanzenbach
2024-08-09 13:14 ` Fijalkowski, Maciej
0 siblings, 1 reply; 20+ messages in thread
From: Kurt Kanzenbach @ 2024-08-09 13:05 UTC (permalink / raw)
To: Maciej Fijalkowski, Tony Nguyen
Cc: davem, kuba, pabeni, edumazet, netdev, Sriram Yagnaraman,
magnus.karlsson, ast, daniel, hawk, john.fastabend, bpf,
sriram.yagnaraman, richardcochran, benjamin.steinke, bigeasy,
Chandan Kumar Rout
[-- Attachment #1: Type: text/plain, Size: 617 bytes --]
On Thu Aug 08 2024, Maciej Fijalkowski wrote:
> On Thu, Aug 08, 2024 at 11:35:51AM -0700, Tony Nguyen wrote:
>> From: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
>>
>> Always call igb_xdp_ring_update_tail under __netif_tx_lock, add a
>> comment to indicate that. This is needed to share the same TX ring
>> between XDP, XSK and slow paths.
>
> standalone commit
Ok.
>> +static inline bool igb_xdp_is_enabled(struct igb_adapter *adapter)
>> +{
>> + return !!adapter->xdp_prog;
>
> READ_ONCE() plus use this everywhere else where prog is read.
Sure. I'll send v6 to iwl then.
Thanks,
Kurt
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]
^ permalink raw reply [flat|nested] 20+ messages in thread* RE: [PATCH net-next 1/4] igb: prepare for AF_XDP zero-copy support
2024-08-09 13:05 ` Kurt Kanzenbach
@ 2024-08-09 13:14 ` Fijalkowski, Maciej
2024-08-09 13:19 ` Kurt Kanzenbach
0 siblings, 1 reply; 20+ messages in thread
From: Fijalkowski, Maciej @ 2024-08-09 13:14 UTC (permalink / raw)
To: Kurt Kanzenbach, Nguyen, Anthony L
Cc: davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com,
edumazet@google.com, netdev@vger.kernel.org, Sriram Yagnaraman,
Karlsson, Magnus, ast@kernel.org, daniel@iogearbox.net,
hawk@kernel.org, john.fastabend@gmail.com, bpf@vger.kernel.org,
sriram.yagnaraman@ericsson.com, richardcochran@gmail.com,
benjamin.steinke@woks-audio.com, bigeasy@linutronix.de,
Rout, ChandanX
>
> On Thu Aug 08 2024, Maciej Fijalkowski wrote:
> > On Thu, Aug 08, 2024 at 11:35:51AM -0700, Tony Nguyen wrote:
> >> From: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
> >>
> >> Always call igb_xdp_ring_update_tail under __netif_tx_lock, add a
> >> comment to indicate that. This is needed to share the same TX ring
> >> between XDP, XSK and slow paths.
> >
> > standalone commit
>
> Ok.
>
> >> +static inline bool igb_xdp_is_enabled(struct igb_adapter *adapter)
> >> +{
> >> + return !!adapter->xdp_prog;
> >
> > READ_ONCE() plus use this everywhere else where prog is read.
>
> Sure. I'll send v6 to iwl then.
I'm in the middle of going through rest of the set, will finish today.
>
> Thanks,
> Kurt
^ permalink raw reply [flat|nested] 20+ messages in thread* RE: [PATCH net-next 1/4] igb: prepare for AF_XDP zero-copy support
2024-08-09 13:14 ` Fijalkowski, Maciej
@ 2024-08-09 13:19 ` Kurt Kanzenbach
0 siblings, 0 replies; 20+ messages in thread
From: Kurt Kanzenbach @ 2024-08-09 13:19 UTC (permalink / raw)
To: Fijalkowski, Maciej, Nguyen, Anthony L
Cc: davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com,
edumazet@google.com, netdev@vger.kernel.org, Sriram Yagnaraman,
Karlsson, Magnus, ast@kernel.org, daniel@iogearbox.net,
hawk@kernel.org, john.fastabend@gmail.com, bpf@vger.kernel.org,
sriram.yagnaraman@ericsson.com, richardcochran@gmail.com,
benjamin.steinke@woks-audio.com, bigeasy@linutronix.de,
Rout, ChandanX
[-- Attachment #1: Type: text/plain, Size: 382 bytes --]
On Fri Aug 09 2024, Fijalkowski, Maciej wrote:
>> >> +static inline bool igb_xdp_is_enabled(struct igb_adapter *adapter)
>> >> +{
>> >> + return !!adapter->xdp_prog;
>> >
>> > READ_ONCE() plus use this everywhere else where prog is read.
>>
>> Sure. I'll send v6 to iwl then.
>
> I'm in the middle of going through rest of the set, will finish today.
Perfect, thanks!
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH net-next 2/4] igb: Introduce XSK data structures and helpers
2024-08-08 18:35 [PATCH net-next 0/4][pull request] igb: Add support for AF_XDP zero-copy Tony Nguyen
2024-08-08 18:35 ` [PATCH net-next 1/4] igb: prepare for AF_XDP zero-copy support Tony Nguyen
@ 2024-08-08 18:35 ` Tony Nguyen
2024-08-10 13:35 ` Maciej Fijalkowski
2024-08-08 18:35 ` [PATCH net-next 3/4] igb: add AF_XDP zero-copy Rx support Tony Nguyen
2024-08-08 18:35 ` [PATCH net-next 4/4] igb: add AF_XDP zero-copy Tx support Tony Nguyen
3 siblings, 1 reply; 20+ messages in thread
From: Tony Nguyen @ 2024-08-08 18:35 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, netdev
Cc: Sriram Yagnaraman, anthony.l.nguyen, maciej.fijalkowski,
magnus.karlsson, ast, daniel, hawk, john.fastabend, bpf, kurt,
sriram.yagnaraman, richardcochran, benjamin.steinke, bigeasy,
Chandan Kumar Rout
From: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Add the following ring flags
- IGB_RING_FLAG_TX_DISABLED (when xsk pool is being setup)
- IGB_RING_FLAG_AF_XDP_ZC (xsk pool is active)
Add a xdp_buff array for use with XSK receive batch API, and a pointer
to xsk_pool in igb_adapter.
Add enable/disable functions for TX and RX rings
Add enable/disable functions for XSK pool
Add xsk wakeup function
None of the above functionality will be active until
NETDEV_XDP_ACT_XSK_ZEROCOPY is advertised in netdev->xdp_features.
Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
drivers/net/ethernet/intel/igb/Makefile | 2 +-
drivers/net/ethernet/intel/igb/igb.h | 14 +-
drivers/net/ethernet/intel/igb/igb_main.c | 9 +
drivers/net/ethernet/intel/igb/igb_xsk.c | 210 ++++++++++++++++++++++
4 files changed, 233 insertions(+), 2 deletions(-)
create mode 100644 drivers/net/ethernet/intel/igb/igb_xsk.c
diff --git a/drivers/net/ethernet/intel/igb/Makefile b/drivers/net/ethernet/intel/igb/Makefile
index 463c0d26b9d4..6c1b702fd992 100644
--- a/drivers/net/ethernet/intel/igb/Makefile
+++ b/drivers/net/ethernet/intel/igb/Makefile
@@ -8,4 +8,4 @@ obj-$(CONFIG_IGB) += igb.o
igb-y := igb_main.o igb_ethtool.o e1000_82575.o \
e1000_mac.o e1000_nvm.o e1000_phy.o e1000_mbx.o \
- e1000_i210.o igb_ptp.o igb_hwmon.o
+ e1000_i210.o igb_ptp.o igb_hwmon.o igb_xsk.o
diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
index 0de71ec324ed..053130c01480 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -20,6 +20,7 @@
#include <linux/mdio.h>
#include <net/xdp.h>
+#include <net/xdp_sock_drv.h>
struct igb_adapter;
@@ -320,6 +321,7 @@ struct igb_ring {
union { /* array of buffer info structs */
struct igb_tx_buffer *tx_buffer_info;
struct igb_rx_buffer *rx_buffer_info;
+ struct xdp_buff **rx_buffer_info_zc;
};
void *desc; /* descriptor ring memory */
unsigned long flags; /* ring specific flags */
@@ -357,6 +359,7 @@ struct igb_ring {
};
};
struct xdp_rxq_info xdp_rxq;
+ struct xsk_buff_pool *xsk_pool;
} ____cacheline_internodealigned_in_smp;
struct igb_q_vector {
@@ -384,7 +387,9 @@ enum e1000_ring_flags_t {
IGB_RING_FLAG_RX_SCTP_CSUM,
IGB_RING_FLAG_RX_LB_VLAN_BSWAP,
IGB_RING_FLAG_TX_CTX_IDX,
- IGB_RING_FLAG_TX_DETECT_HANG
+ IGB_RING_FLAG_TX_DETECT_HANG,
+ IGB_RING_FLAG_TX_DISABLED,
+ IGB_RING_FLAG_AF_XDP_ZC
};
#define ring_uses_large_buffer(ring) \
@@ -822,4 +827,11 @@ int igb_add_mac_steering_filter(struct igb_adapter *adapter,
int igb_del_mac_steering_filter(struct igb_adapter *adapter,
const u8 *addr, u8 queue, u8 flags);
+struct xsk_buff_pool *igb_xsk_pool(struct igb_adapter *adapter,
+ struct igb_ring *ring);
+int igb_xsk_pool_setup(struct igb_adapter *adapter,
+ struct xsk_buff_pool *pool,
+ u16 qid);
+int igb_xsk_wakeup(struct net_device *dev, u32 qid, u32 flags);
+
#endif /* _IGB_H_ */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index bdb7637559b8..b6f23bbeff71 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2904,9 +2904,14 @@ static int igb_xdp_setup(struct net_device *dev, struct netdev_bpf *bpf)
static int igb_xdp(struct net_device *dev, struct netdev_bpf *xdp)
{
+ struct igb_adapter *adapter = netdev_priv(dev);
+
switch (xdp->command) {
case XDP_SETUP_PROG:
return igb_xdp_setup(dev, xdp);
+ case XDP_SETUP_XSK_POOL:
+ return igb_xsk_pool_setup(adapter, xdp->xsk.pool,
+ xdp->xsk.queue_id);
default:
return -EINVAL;
}
@@ -3033,6 +3038,7 @@ static const struct net_device_ops igb_netdev_ops = {
.ndo_setup_tc = igb_setup_tc,
.ndo_bpf = igb_xdp,
.ndo_xdp_xmit = igb_xdp_xmit,
+ .ndo_xsk_wakeup = igb_xsk_wakeup,
};
/**
@@ -4355,6 +4361,8 @@ void igb_configure_tx_ring(struct igb_adapter *adapter,
u64 tdba = ring->dma;
int reg_idx = ring->reg_idx;
+ ring->xsk_pool = igb_xsk_pool(adapter, ring);
+
wr32(E1000_TDLEN(reg_idx),
ring->count * sizeof(union e1000_adv_tx_desc));
wr32(E1000_TDBAL(reg_idx),
@@ -4750,6 +4758,7 @@ void igb_configure_rx_ring(struct igb_adapter *adapter,
xdp_rxq_info_unreg_mem_model(&ring->xdp_rxq);
WARN_ON(xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
MEM_TYPE_PAGE_SHARED, NULL));
+ ring->xsk_pool = igb_xsk_pool(adapter, ring);
/* disable the queue */
wr32(E1000_RXDCTL(reg_idx), 0);
diff --git a/drivers/net/ethernet/intel/igb/igb_xsk.c b/drivers/net/ethernet/intel/igb/igb_xsk.c
new file mode 100644
index 000000000000..925bf97f7caa
--- /dev/null
+++ b/drivers/net/ethernet/intel/igb/igb_xsk.c
@@ -0,0 +1,210 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2018 Intel Corporation. */
+
+#include <linux/bpf_trace.h>
+#include <net/xdp_sock_drv.h>
+#include <net/xdp.h>
+
+#include "e1000_hw.h"
+#include "igb.h"
+
+static int igb_realloc_rx_buffer_info(struct igb_ring *ring, bool pool_present)
+{
+ int size = pool_present ?
+ sizeof(*ring->rx_buffer_info_zc) * ring->count :
+ sizeof(*ring->rx_buffer_info) * ring->count;
+ void *buff_info = vmalloc(size);
+
+ if (!buff_info)
+ return -ENOMEM;
+
+ if (pool_present) {
+ vfree(ring->rx_buffer_info);
+ ring->rx_buffer_info = NULL;
+ ring->rx_buffer_info_zc = buff_info;
+ } else {
+ vfree(ring->rx_buffer_info_zc);
+ ring->rx_buffer_info_zc = NULL;
+ ring->rx_buffer_info = buff_info;
+ }
+
+ return 0;
+}
+
+static void igb_txrx_ring_disable(struct igb_adapter *adapter, u16 qid)
+{
+ struct igb_ring *tx_ring = adapter->tx_ring[qid];
+ struct igb_ring *rx_ring = adapter->rx_ring[qid];
+ struct e1000_hw *hw = &adapter->hw;
+
+ set_bit(IGB_RING_FLAG_TX_DISABLED, &tx_ring->flags);
+
+ wr32(E1000_TXDCTL(tx_ring->reg_idx), 0);
+ wr32(E1000_RXDCTL(rx_ring->reg_idx), 0);
+
+ /* Rx/Tx share the same napi context. */
+ napi_disable(&rx_ring->q_vector->napi);
+
+ igb_clean_tx_ring(tx_ring);
+ igb_clean_rx_ring(rx_ring);
+
+ memset(&rx_ring->rx_stats, 0, sizeof(rx_ring->rx_stats));
+ memset(&tx_ring->tx_stats, 0, sizeof(tx_ring->tx_stats));
+}
+
+static void igb_txrx_ring_enable(struct igb_adapter *adapter, u16 qid)
+{
+ struct igb_ring *tx_ring = adapter->tx_ring[qid];
+ struct igb_ring *rx_ring = adapter->rx_ring[qid];
+
+ igb_configure_tx_ring(adapter, tx_ring);
+ igb_configure_rx_ring(adapter, rx_ring);
+
+ clear_bit(IGB_RING_FLAG_TX_DISABLED, &tx_ring->flags);
+
+ /* call igb_desc_unused which always leaves
+ * at least 1 descriptor unused to make sure
+ * next_to_use != next_to_clean
+ */
+ igb_alloc_rx_buffers(rx_ring, igb_desc_unused(rx_ring));
+
+ /* Rx/Tx share the same napi context. */
+ napi_enable(&rx_ring->q_vector->napi);
+}
+
+struct xsk_buff_pool *igb_xsk_pool(struct igb_adapter *adapter,
+ struct igb_ring *ring)
+{
+ int qid = ring->queue_index;
+
+ if (!igb_xdp_is_enabled(adapter) ||
+ !test_bit(IGB_RING_FLAG_AF_XDP_ZC, &ring->flags))
+ return NULL;
+
+ return xsk_get_pool_from_qid(adapter->netdev, qid);
+}
+
+static int igb_xsk_pool_enable(struct igb_adapter *adapter,
+ struct xsk_buff_pool *pool,
+ u16 qid)
+{
+ struct net_device *netdev = adapter->netdev;
+ struct igb_ring *tx_ring, *rx_ring;
+ bool if_running;
+ int err;
+
+ if (qid >= adapter->num_rx_queues)
+ return -EINVAL;
+
+ if (qid >= netdev->real_num_rx_queues ||
+ qid >= netdev->real_num_tx_queues)
+ return -EINVAL;
+
+ err = xsk_pool_dma_map(pool, &adapter->pdev->dev, IGB_RX_DMA_ATTR);
+ if (err)
+ return err;
+
+ tx_ring = adapter->tx_ring[qid];
+ rx_ring = adapter->rx_ring[qid];
+ if_running = netif_running(adapter->netdev) && igb_xdp_is_enabled(adapter);
+ if (if_running)
+ igb_txrx_ring_disable(adapter, qid);
+
+ set_bit(IGB_RING_FLAG_AF_XDP_ZC, &tx_ring->flags);
+ set_bit(IGB_RING_FLAG_AF_XDP_ZC, &rx_ring->flags);
+
+ if (if_running) {
+ err = igb_realloc_rx_buffer_info(rx_ring, true);
+ if (!err) {
+ igb_txrx_ring_enable(adapter, qid);
+ /* Kick start the NAPI context so that receiving will start */
+ err = igb_xsk_wakeup(adapter->netdev, qid, XDP_WAKEUP_RX);
+ }
+
+ if (err) {
+ clear_bit(IGB_RING_FLAG_AF_XDP_ZC, &tx_ring->flags);
+ clear_bit(IGB_RING_FLAG_AF_XDP_ZC, &rx_ring->flags);
+ xsk_pool_dma_unmap(pool, IGB_RX_DMA_ATTR);
+ return err;
+ }
+ }
+
+ return 0;
+}
+
+static int igb_xsk_pool_disable(struct igb_adapter *adapter, u16 qid)
+{
+ struct igb_ring *tx_ring, *rx_ring;
+ struct xsk_buff_pool *pool;
+ bool if_running;
+ int err;
+
+ pool = xsk_get_pool_from_qid(adapter->netdev, qid);
+ if (!pool)
+ return -EINVAL;
+
+ tx_ring = adapter->tx_ring[qid];
+ rx_ring = adapter->rx_ring[qid];
+ if_running = netif_running(adapter->netdev) && igb_xdp_is_enabled(adapter);
+ if (if_running)
+ igb_txrx_ring_disable(adapter, qid);
+
+ xsk_pool_dma_unmap(pool, IGB_RX_DMA_ATTR);
+ clear_bit(IGB_RING_FLAG_AF_XDP_ZC, &tx_ring->flags);
+ clear_bit(IGB_RING_FLAG_AF_XDP_ZC, &rx_ring->flags);
+
+ if (if_running) {
+ err = igb_realloc_rx_buffer_info(rx_ring, false);
+ if (err)
+ return err;
+
+ igb_txrx_ring_enable(adapter, qid);
+ }
+
+ return 0;
+}
+
+int igb_xsk_pool_setup(struct igb_adapter *adapter,
+ struct xsk_buff_pool *pool,
+ u16 qid)
+{
+ return pool ? igb_xsk_pool_enable(adapter, pool, qid) :
+ igb_xsk_pool_disable(adapter, qid);
+}
+
+int igb_xsk_wakeup(struct net_device *dev, u32 qid, u32 flags)
+{
+ struct igb_adapter *adapter = netdev_priv(dev);
+ struct e1000_hw *hw = &adapter->hw;
+ struct igb_ring *ring;
+ u32 eics = 0;
+
+ if (test_bit(__IGB_DOWN, &adapter->state))
+ return -ENETDOWN;
+
+ if (!igb_xdp_is_enabled(adapter))
+ return -EINVAL;
+
+ if (qid >= adapter->num_tx_queues)
+ return -EINVAL;
+
+ ring = adapter->tx_ring[qid];
+
+ if (test_bit(IGB_RING_FLAG_TX_DISABLED, &ring->flags))
+ return -ENETDOWN;
+
+ if (!ring->xsk_pool)
+ return -EINVAL;
+
+ if (!napi_if_scheduled_mark_missed(&ring->q_vector->napi)) {
+ /* Cause software interrupt to ensure Rx ring is cleaned */
+ if (adapter->flags & IGB_FLAG_HAS_MSIX) {
+ eics |= ring->q_vector->eims_value;
+ wr32(E1000_EICS, eics);
+ } else {
+ wr32(E1000_ICS, E1000_ICS_RXDMT0);
+ }
+ }
+
+ return 0;
+}
--
2.42.0
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH net-next 2/4] igb: Introduce XSK data structures and helpers
2024-08-08 18:35 ` [PATCH net-next 2/4] igb: Introduce XSK data structures and helpers Tony Nguyen
@ 2024-08-10 13:35 ` Maciej Fijalkowski
0 siblings, 0 replies; 20+ messages in thread
From: Maciej Fijalkowski @ 2024-08-10 13:35 UTC (permalink / raw)
To: Tony Nguyen
Cc: davem, kuba, pabeni, edumazet, netdev, Sriram Yagnaraman,
magnus.karlsson, ast, daniel, hawk, john.fastabend, bpf, kurt,
sriram.yagnaraman, richardcochran, benjamin.steinke, bigeasy,
Chandan Kumar Rout
On Thu, Aug 08, 2024 at 11:35:52AM -0700, Tony Nguyen wrote:
> From: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
>
> Add the following ring flags
> - IGB_RING_FLAG_TX_DISABLED (when xsk pool is being setup)
> - IGB_RING_FLAG_AF_XDP_ZC (xsk pool is active)
>
> Add a xdp_buff array for use with XSK receive batch API, and a pointer
> to xsk_pool in igb_adapter.
>
> Add enable/disable functions for TX and RX rings
> Add enable/disable functions for XSK pool
> Add xsk wakeup function
>
> None of the above functionality will be active until
> NETDEV_XDP_ACT_XSK_ZEROCOPY is advertised in netdev->xdp_features.
>
> Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
> Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
> Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
> ---
> drivers/net/ethernet/intel/igb/Makefile | 2 +-
> drivers/net/ethernet/intel/igb/igb.h | 14 +-
> drivers/net/ethernet/intel/igb/igb_main.c | 9 +
> drivers/net/ethernet/intel/igb/igb_xsk.c | 210 ++++++++++++++++++++++
> 4 files changed, 233 insertions(+), 2 deletions(-)
> create mode 100644 drivers/net/ethernet/intel/igb/igb_xsk.c
>
> diff --git a/drivers/net/ethernet/intel/igb/Makefile b/drivers/net/ethernet/intel/igb/Makefile
> index 463c0d26b9d4..6c1b702fd992 100644
> --- a/drivers/net/ethernet/intel/igb/Makefile
> +++ b/drivers/net/ethernet/intel/igb/Makefile
> @@ -8,4 +8,4 @@ obj-$(CONFIG_IGB) += igb.o
>
> igb-y := igb_main.o igb_ethtool.o e1000_82575.o \
> e1000_mac.o e1000_nvm.o e1000_phy.o e1000_mbx.o \
> - e1000_i210.o igb_ptp.o igb_hwmon.o
> + e1000_i210.o igb_ptp.o igb_hwmon.o igb_xsk.o
> diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
> index 0de71ec324ed..053130c01480 100644
> --- a/drivers/net/ethernet/intel/igb/igb.h
> +++ b/drivers/net/ethernet/intel/igb/igb.h
> @@ -20,6 +20,7 @@
> #include <linux/mdio.h>
>
> #include <net/xdp.h>
> +#include <net/xdp_sock_drv.h>
>
> struct igb_adapter;
>
> @@ -320,6 +321,7 @@ struct igb_ring {
> union { /* array of buffer info structs */
> struct igb_tx_buffer *tx_buffer_info;
> struct igb_rx_buffer *rx_buffer_info;
> + struct xdp_buff **rx_buffer_info_zc;
> };
> void *desc; /* descriptor ring memory */
> unsigned long flags; /* ring specific flags */
> @@ -357,6 +359,7 @@ struct igb_ring {
> };
> };
> struct xdp_rxq_info xdp_rxq;
> + struct xsk_buff_pool *xsk_pool;
> } ____cacheline_internodealigned_in_smp;
>
> struct igb_q_vector {
> @@ -384,7 +387,9 @@ enum e1000_ring_flags_t {
> IGB_RING_FLAG_RX_SCTP_CSUM,
> IGB_RING_FLAG_RX_LB_VLAN_BSWAP,
> IGB_RING_FLAG_TX_CTX_IDX,
> - IGB_RING_FLAG_TX_DETECT_HANG
> + IGB_RING_FLAG_TX_DETECT_HANG,
> + IGB_RING_FLAG_TX_DISABLED,
> + IGB_RING_FLAG_AF_XDP_ZC
> };
>
> #define ring_uses_large_buffer(ring) \
> @@ -822,4 +827,11 @@ int igb_add_mac_steering_filter(struct igb_adapter *adapter,
> int igb_del_mac_steering_filter(struct igb_adapter *adapter,
> const u8 *addr, u8 queue, u8 flags);
>
> +struct xsk_buff_pool *igb_xsk_pool(struct igb_adapter *adapter,
> + struct igb_ring *ring);
> +int igb_xsk_pool_setup(struct igb_adapter *adapter,
> + struct xsk_buff_pool *pool,
> + u16 qid);
> +int igb_xsk_wakeup(struct net_device *dev, u32 qid, u32 flags);
> +
> #endif /* _IGB_H_ */
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> index bdb7637559b8..b6f23bbeff71 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -2904,9 +2904,14 @@ static int igb_xdp_setup(struct net_device *dev, struct netdev_bpf *bpf)
>
> static int igb_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> {
> + struct igb_adapter *adapter = netdev_priv(dev);
> +
> switch (xdp->command) {
> case XDP_SETUP_PROG:
> return igb_xdp_setup(dev, xdp);
> + case XDP_SETUP_XSK_POOL:
> + return igb_xsk_pool_setup(adapter, xdp->xsk.pool,
> + xdp->xsk.queue_id);
> default:
> return -EINVAL;
> }
> @@ -3033,6 +3038,7 @@ static const struct net_device_ops igb_netdev_ops = {
> .ndo_setup_tc = igb_setup_tc,
> .ndo_bpf = igb_xdp,
> .ndo_xdp_xmit = igb_xdp_xmit,
> + .ndo_xsk_wakeup = igb_xsk_wakeup,
> };
>
> /**
> @@ -4355,6 +4361,8 @@ void igb_configure_tx_ring(struct igb_adapter *adapter,
> u64 tdba = ring->dma;
> int reg_idx = ring->reg_idx;
>
> + ring->xsk_pool = igb_xsk_pool(adapter, ring);
use WRITE_ONCE()
> +
> wr32(E1000_TDLEN(reg_idx),
> ring->count * sizeof(union e1000_adv_tx_desc));
> wr32(E1000_TDBAL(reg_idx),
> @@ -4750,6 +4758,7 @@ void igb_configure_rx_ring(struct igb_adapter *adapter,
> xdp_rxq_info_unreg_mem_model(&ring->xdp_rxq);
> WARN_ON(xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
> MEM_TYPE_PAGE_SHARED, NULL));
> + ring->xsk_pool = igb_xsk_pool(adapter, ring);
ditto
I was recently addressing issues around xsk in ice, see:
[0]: https://lore.kernel.org/netdev/172239123450.15322.12860347838208396251.git-patchwork-notify@kernel.org/
>
> /* disable the queue */
> wr32(E1000_RXDCTL(reg_idx), 0);
> diff --git a/drivers/net/ethernet/intel/igb/igb_xsk.c b/drivers/net/ethernet/intel/igb/igb_xsk.c
> new file mode 100644
> index 000000000000..925bf97f7caa
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/igb/igb_xsk.c
> @@ -0,0 +1,210 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2018 Intel Corporation. */
> +
> +#include <linux/bpf_trace.h>
> +#include <net/xdp_sock_drv.h>
> +#include <net/xdp.h>
> +
> +#include "e1000_hw.h"
> +#include "igb.h"
> +
> +static int igb_realloc_rx_buffer_info(struct igb_ring *ring, bool pool_present)
> +{
> + int size = pool_present ?
> + sizeof(*ring->rx_buffer_info_zc) * ring->count :
> + sizeof(*ring->rx_buffer_info) * ring->count;
> + void *buff_info = vmalloc(size);
You need to take into account the rx_buffer_info_zc in the memset in
igb_configure_rx_ring(). Also why vmalloc?
> +
> + if (!buff_info)
> + return -ENOMEM;
> +
> + if (pool_present) {
> + vfree(ring->rx_buffer_info);
> + ring->rx_buffer_info = NULL;
> + ring->rx_buffer_info_zc = buff_info;
> + } else {
> + vfree(ring->rx_buffer_info_zc);
> + ring->rx_buffer_info_zc = NULL;
> + ring->rx_buffer_info = buff_info;
> + }
> +
> + return 0;
> +}
> +
> +static void igb_txrx_ring_disable(struct igb_adapter *adapter, u16 qid)
> +{
> + struct igb_ring *tx_ring = adapter->tx_ring[qid];
> + struct igb_ring *rx_ring = adapter->rx_ring[qid];
> + struct e1000_hw *hw = &adapter->hw;
> +
> + set_bit(IGB_RING_FLAG_TX_DISABLED, &tx_ring->flags);
> +
> + wr32(E1000_TXDCTL(tx_ring->reg_idx), 0);
> + wr32(E1000_RXDCTL(rx_ring->reg_idx), 0);
> +
synchronize_net() to let the napi finish its current job?
> + /* Rx/Tx share the same napi context. */
> + napi_disable(&rx_ring->q_vector->napi);
> +
> + igb_clean_tx_ring(tx_ring);
> + igb_clean_rx_ring(rx_ring);
> +
> + memset(&rx_ring->rx_stats, 0, sizeof(rx_ring->rx_stats));
> + memset(&tx_ring->tx_stats, 0, sizeof(tx_ring->tx_stats));
> +}
> +
> +static void igb_txrx_ring_enable(struct igb_adapter *adapter, u16 qid)
> +{
> + struct igb_ring *tx_ring = adapter->tx_ring[qid];
> + struct igb_ring *rx_ring = adapter->rx_ring[qid];
> +
> + igb_configure_tx_ring(adapter, tx_ring);
> + igb_configure_rx_ring(adapter, rx_ring);
> +
synchronize_net() after updating xsk_pool ptrs
> + clear_bit(IGB_RING_FLAG_TX_DISABLED, &tx_ring->flags);
> +
> + /* call igb_desc_unused which always leaves
> + * at least 1 descriptor unused to make sure
> + * next_to_use != next_to_clean
> + */
> + igb_alloc_rx_buffers(rx_ring, igb_desc_unused(rx_ring));
> +
> + /* Rx/Tx share the same napi context. */
> + napi_enable(&rx_ring->q_vector->napi);
> +}
> +
> +struct xsk_buff_pool *igb_xsk_pool(struct igb_adapter *adapter,
> + struct igb_ring *ring)
> +{
> + int qid = ring->queue_index;
> +
> + if (!igb_xdp_is_enabled(adapter) ||
> + !test_bit(IGB_RING_FLAG_AF_XDP_ZC, &ring->flags))
See:
[1]: https://lore.kernel.org/netdev/20240603-net-2024-05-30-intel-net-fixes-v2-3-e3563aa89b0c@intel.com/
how to avoid the introduction of IGB_RING_FLAG_AF_XDP_ZC altogether.
> + return NULL;
> +
> + return xsk_get_pool_from_qid(adapter->netdev, qid);
> +}
> +
> +static int igb_xsk_pool_enable(struct igb_adapter *adapter,
> + struct xsk_buff_pool *pool,
> + u16 qid)
> +{
> + struct net_device *netdev = adapter->netdev;
> + struct igb_ring *tx_ring, *rx_ring;
> + bool if_running;
> + int err;
> +
> + if (qid >= adapter->num_rx_queues)
> + return -EINVAL;
> +
> + if (qid >= netdev->real_num_rx_queues ||
> + qid >= netdev->real_num_tx_queues)
> + return -EINVAL;
> +
> + err = xsk_pool_dma_map(pool, &adapter->pdev->dev, IGB_RX_DMA_ATTR);
> + if (err)
> + return err;
> +
> + tx_ring = adapter->tx_ring[qid];
> + rx_ring = adapter->rx_ring[qid];
> + if_running = netif_running(adapter->netdev) && igb_xdp_is_enabled(adapter);
> + if (if_running)
> + igb_txrx_ring_disable(adapter, qid);
> +
> + set_bit(IGB_RING_FLAG_AF_XDP_ZC, &tx_ring->flags);
> + set_bit(IGB_RING_FLAG_AF_XDP_ZC, &rx_ring->flags);
> +
> + if (if_running) {
> + err = igb_realloc_rx_buffer_info(rx_ring, true);
> + if (!err) {
> + igb_txrx_ring_enable(adapter, qid);
> + /* Kick start the NAPI context so that receiving will start */
> + err = igb_xsk_wakeup(adapter->netdev, qid, XDP_WAKEUP_RX);
> + }
> +
> + if (err) {
> + clear_bit(IGB_RING_FLAG_AF_XDP_ZC, &tx_ring->flags);
> + clear_bit(IGB_RING_FLAG_AF_XDP_ZC, &rx_ring->flags);
> + xsk_pool_dma_unmap(pool, IGB_RX_DMA_ATTR);
> + return err;
> + }
> + }
> +
> + return 0;
> +}
> +
> +static int igb_xsk_pool_disable(struct igb_adapter *adapter, u16 qid)
> +{
> + struct igb_ring *tx_ring, *rx_ring;
> + struct xsk_buff_pool *pool;
> + bool if_running;
> + int err;
> +
> + pool = xsk_get_pool_from_qid(adapter->netdev, qid);
> + if (!pool)
> + return -EINVAL;
> +
> + tx_ring = adapter->tx_ring[qid];
> + rx_ring = adapter->rx_ring[qid];
> + if_running = netif_running(adapter->netdev) && igb_xdp_is_enabled(adapter);
> + if (if_running)
> + igb_txrx_ring_disable(adapter, qid);
> +
> + xsk_pool_dma_unmap(pool, IGB_RX_DMA_ATTR);
> + clear_bit(IGB_RING_FLAG_AF_XDP_ZC, &tx_ring->flags);
> + clear_bit(IGB_RING_FLAG_AF_XDP_ZC, &rx_ring->flags);
> +
> + if (if_running) {
> + err = igb_realloc_rx_buffer_info(rx_ring, false);
> + if (err)
> + return err;
> +
> + igb_txrx_ring_enable(adapter, qid);
> + }
> +
> + return 0;
> +}
> +
> +int igb_xsk_pool_setup(struct igb_adapter *adapter,
> + struct xsk_buff_pool *pool,
> + u16 qid)
> +{
> + return pool ? igb_xsk_pool_enable(adapter, pool, qid) :
> + igb_xsk_pool_disable(adapter, qid);
> +}
> +
> +int igb_xsk_wakeup(struct net_device *dev, u32 qid, u32 flags)
> +{
> + struct igb_adapter *adapter = netdev_priv(dev);
> + struct e1000_hw *hw = &adapter->hw;
> + struct igb_ring *ring;
> + u32 eics = 0;
> +
> + if (test_bit(__IGB_DOWN, &adapter->state))
> + return -ENETDOWN;
> +
> + if (!igb_xdp_is_enabled(adapter))
> + return -EINVAL;
> +
> + if (qid >= adapter->num_tx_queues)
> + return -EINVAL;
> +
> + ring = adapter->tx_ring[qid];
> +
> + if (test_bit(IGB_RING_FLAG_TX_DISABLED, &ring->flags))
> + return -ENETDOWN;
> +
> + if (!ring->xsk_pool)
READ_ONCE()
Also, please test this patchset against a scenario where you do Tx ZC from
every queue available and toggle the interface down and up. We had a nasty
case that [0] fixed where we were producing Tx descriptors to wire when
interface was either already going down or not brought up yet.
> + return -EINVAL;
> +
> + if (!napi_if_scheduled_mark_missed(&ring->q_vector->napi)) {
> + /* Cause software interrupt to ensure Rx ring is cleaned */
> + if (adapter->flags & IGB_FLAG_HAS_MSIX) {
> + eics |= ring->q_vector->eims_value;
> + wr32(E1000_EICS, eics);
> + } else {
> + wr32(E1000_ICS, E1000_ICS_RXDMT0);
> + }
> + }
> +
> + return 0;
> +}
> --
> 2.42.0
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH net-next 3/4] igb: add AF_XDP zero-copy Rx support
2024-08-08 18:35 [PATCH net-next 0/4][pull request] igb: Add support for AF_XDP zero-copy Tony Nguyen
2024-08-08 18:35 ` [PATCH net-next 1/4] igb: prepare for AF_XDP zero-copy support Tony Nguyen
2024-08-08 18:35 ` [PATCH net-next 2/4] igb: Introduce XSK data structures and helpers Tony Nguyen
@ 2024-08-08 18:35 ` Tony Nguyen
2024-08-10 13:55 ` Maciej Fijalkowski
2024-08-08 18:35 ` [PATCH net-next 4/4] igb: add AF_XDP zero-copy Tx support Tony Nguyen
3 siblings, 1 reply; 20+ messages in thread
From: Tony Nguyen @ 2024-08-08 18:35 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, netdev
Cc: Sriram Yagnaraman, anthony.l.nguyen, maciej.fijalkowski,
magnus.karlsson, ast, daniel, hawk, john.fastabend, bpf, kurt,
sriram.yagnaraman, richardcochran, benjamin.steinke, bigeasy,
Chandan Kumar Rout
From: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Add support for AF_XDP zero-copy receive path.
When AF_XDP zero-copy is enabled, the rx buffers are allocated from the
xsk buff pool using igb_alloc_rx_buffers_zc.
Use xsk_pool_get_rx_frame_size to set SRRCTL rx buf size when zero-copy
is enabled.
Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
[Kurt: Port to v6.10 and provide napi_id for xdp_rxq_info_reg()]
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
drivers/net/ethernet/intel/igb/igb.h | 4 +
drivers/net/ethernet/intel/igb/igb_main.c | 95 ++++++--
drivers/net/ethernet/intel/igb/igb_xsk.c | 261 +++++++++++++++++++++-
3 files changed, 337 insertions(+), 23 deletions(-)
diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
index 053130c01480..4983a6ec718e 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -87,6 +87,7 @@ struct igb_adapter;
#define IGB_XDP_CONSUMED BIT(0)
#define IGB_XDP_TX BIT(1)
#define IGB_XDP_REDIR BIT(2)
+#define IGB_XDP_EXIT BIT(3)
struct vf_data_storage {
unsigned char vf_mac_addresses[ETH_ALEN];
@@ -832,6 +833,9 @@ struct xsk_buff_pool *igb_xsk_pool(struct igb_adapter *adapter,
int igb_xsk_pool_setup(struct igb_adapter *adapter,
struct xsk_buff_pool *pool,
u16 qid);
+bool igb_alloc_rx_buffers_zc(struct igb_ring *rx_ring, u16 count);
+void igb_clean_rx_ring_zc(struct igb_ring *rx_ring);
+int igb_clean_rx_irq_zc(struct igb_q_vector *q_vector, const int budget);
int igb_xsk_wakeup(struct net_device *dev, u32 qid, u32 flags);
#endif /* _IGB_H_ */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index b6f23bbeff71..0b779b2ca9ea 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -472,12 +472,17 @@ static void igb_dump(struct igb_adapter *adapter)
for (i = 0; i < rx_ring->count; i++) {
const char *next_desc;
- struct igb_rx_buffer *buffer_info;
- buffer_info = &rx_ring->rx_buffer_info[i];
+ dma_addr_t dma = (dma_addr_t)0;
+ struct igb_rx_buffer *buffer_info = NULL;
rx_desc = IGB_RX_DESC(rx_ring, i);
u0 = (struct my_u0 *)rx_desc;
staterr = le32_to_cpu(rx_desc->wb.upper.status_error);
+ if (!rx_ring->xsk_pool) {
+ buffer_info = &rx_ring->rx_buffer_info[i];
+ dma = buffer_info->dma;
+ }
+
if (i == rx_ring->next_to_use)
next_desc = " NTU";
else if (i == rx_ring->next_to_clean)
@@ -497,11 +502,11 @@ static void igb_dump(struct igb_adapter *adapter)
"R ", i,
le64_to_cpu(u0->a),
le64_to_cpu(u0->b),
- (u64)buffer_info->dma,
+ (u64)dma,
next_desc);
if (netif_msg_pktdata(adapter) &&
- buffer_info->dma && buffer_info->page) {
+ buffer_info && dma && buffer_info->page) {
print_hex_dump(KERN_INFO, "",
DUMP_PREFIX_ADDRESS,
16, 1,
@@ -1983,7 +1988,10 @@ static void igb_configure(struct igb_adapter *adapter)
*/
for (i = 0; i < adapter->num_rx_queues; i++) {
struct igb_ring *ring = adapter->rx_ring[i];
- igb_alloc_rx_buffers(ring, igb_desc_unused(ring));
+ if (ring->xsk_pool)
+ igb_alloc_rx_buffers_zc(ring, igb_desc_unused(ring));
+ else
+ igb_alloc_rx_buffers(ring, igb_desc_unused(ring));
}
}
@@ -3335,7 +3343,8 @@ static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
netdev->priv_flags |= IFF_SUPP_NOFCS;
netdev->priv_flags |= IFF_UNICAST_FLT;
- netdev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT;
+ netdev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT |
+ NETDEV_XDP_ACT_XSK_ZEROCOPY;
/* MTU range: 68 - 9216 */
netdev->min_mtu = ETH_MIN_MTU;
@@ -4423,7 +4432,8 @@ int igb_setup_rx_resources(struct igb_ring *rx_ring)
if (xdp_rxq_info_is_reg(&rx_ring->xdp_rxq))
xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
res = xdp_rxq_info_reg(&rx_ring->xdp_rxq, rx_ring->netdev,
- rx_ring->queue_index, 0);
+ rx_ring->queue_index,
+ rx_ring->q_vector->napi.napi_id);
if (res < 0) {
dev_err(dev, "Failed to register xdp_rxq index %u\n",
rx_ring->queue_index);
@@ -4719,12 +4729,17 @@ void igb_setup_srrctl(struct igb_adapter *adapter, struct igb_ring *ring)
struct e1000_hw *hw = &adapter->hw;
int reg_idx = ring->reg_idx;
u32 srrctl = 0;
+ u32 buf_size;
- srrctl = IGB_RX_HDR_LEN << E1000_SRRCTL_BSIZEHDRSIZE_SHIFT;
- if (ring_uses_large_buffer(ring))
- srrctl |= IGB_RXBUFFER_3072 >> E1000_SRRCTL_BSIZEPKT_SHIFT;
+ if (ring->xsk_pool)
+ buf_size = xsk_pool_get_rx_frame_size(ring->xsk_pool);
+ else if (ring_uses_large_buffer(ring))
+ buf_size = IGB_RXBUFFER_3072;
else
- srrctl |= IGB_RXBUFFER_2048 >> E1000_SRRCTL_BSIZEPKT_SHIFT;
+ buf_size = IGB_RXBUFFER_2048;
+
+ srrctl = IGB_RX_HDR_LEN << E1000_SRRCTL_BSIZEHDRSIZE_SHIFT;
+ srrctl |= buf_size >> E1000_SRRCTL_BSIZEPKT_SHIFT;
srrctl |= E1000_SRRCTL_DESCTYPE_ADV_ONEBUF;
if (hw->mac.type >= e1000_82580)
srrctl |= E1000_SRRCTL_TIMESTAMP;
@@ -4756,9 +4771,17 @@ void igb_configure_rx_ring(struct igb_adapter *adapter,
u32 rxdctl = 0;
xdp_rxq_info_unreg_mem_model(&ring->xdp_rxq);
- WARN_ON(xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
- MEM_TYPE_PAGE_SHARED, NULL));
ring->xsk_pool = igb_xsk_pool(adapter, ring);
+ if (ring->xsk_pool) {
+ WARN_ON(xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
+ MEM_TYPE_XSK_BUFF_POOL,
+ NULL));
+ xsk_pool_set_rxq_info(ring->xsk_pool, &ring->xdp_rxq);
+ } else {
+ WARN_ON(xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
+ MEM_TYPE_PAGE_SHARED,
+ NULL));
+ }
/* disable the queue */
wr32(E1000_RXDCTL(reg_idx), 0);
@@ -4785,9 +4808,12 @@ void igb_configure_rx_ring(struct igb_adapter *adapter,
rxdctl |= IGB_RX_HTHRESH << 8;
rxdctl |= IGB_RX_WTHRESH << 16;
- /* initialize rx_buffer_info */
- memset(ring->rx_buffer_info, 0,
- sizeof(struct igb_rx_buffer) * ring->count);
+ if (ring->xsk_pool)
+ memset(ring->rx_buffer_info_zc, 0,
+ sizeof(*ring->rx_buffer_info_zc) * ring->count);
+ else
+ memset(ring->rx_buffer_info, 0,
+ sizeof(*ring->rx_buffer_info) * ring->count);
/* initialize Rx descriptor 0 */
rx_desc = IGB_RX_DESC(ring, 0);
@@ -4974,8 +5000,13 @@ void igb_free_rx_resources(struct igb_ring *rx_ring)
rx_ring->xdp_prog = NULL;
xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
- vfree(rx_ring->rx_buffer_info);
- rx_ring->rx_buffer_info = NULL;
+ if (rx_ring->xsk_pool) {
+ vfree(rx_ring->rx_buffer_info_zc);
+ rx_ring->rx_buffer_info_zc = NULL;
+ } else {
+ vfree(rx_ring->rx_buffer_info);
+ rx_ring->rx_buffer_info = NULL;
+ }
/* if not set, then don't free */
if (!rx_ring->desc)
@@ -5013,6 +5044,11 @@ void igb_clean_rx_ring(struct igb_ring *rx_ring)
dev_kfree_skb(rx_ring->skb);
rx_ring->skb = NULL;
+ if (rx_ring->xsk_pool) {
+ igb_clean_rx_ring_zc(rx_ring);
+ goto skip_for_xsk;
+ }
+
/* Free all the Rx ring sk_buffs */
while (i != rx_ring->next_to_alloc) {
struct igb_rx_buffer *buffer_info = &rx_ring->rx_buffer_info[i];
@@ -5040,6 +5076,7 @@ void igb_clean_rx_ring(struct igb_ring *rx_ring)
i = 0;
}
+skip_for_xsk:
rx_ring->next_to_alloc = 0;
rx_ring->next_to_clean = 0;
rx_ring->next_to_use = 0;
@@ -8195,7 +8232,9 @@ static int igb_poll(struct napi_struct *napi, int budget)
clean_complete = igb_clean_tx_irq(q_vector, budget);
if (q_vector->rx.ring) {
- int cleaned = igb_clean_rx_irq(q_vector, budget);
+ int cleaned = q_vector->rx.ring->xsk_pool ?
+ igb_clean_rx_irq_zc(q_vector, budget) :
+ igb_clean_rx_irq(q_vector, budget);
work_done += cleaned;
if (cleaned >= budget)
@@ -8603,8 +8642,15 @@ struct sk_buff *igb_run_xdp(struct igb_adapter *adapter,
break;
case XDP_REDIRECT:
err = xdp_do_redirect(adapter->netdev, xdp, xdp_prog);
- if (err)
+ if (err) {
+ if (rx_ring->xsk_pool &&
+ xsk_uses_need_wakeup(rx_ring->xsk_pool) &&
+ err == -ENOBUFS)
+ result = IGB_XDP_EXIT;
+ else
+ result = IGB_XDP_CONSUMED;
goto out_failure;
+ }
result = IGB_XDP_REDIR;
break;
default:
@@ -8861,12 +8907,14 @@ static void igb_put_rx_buffer(struct igb_ring *rx_ring,
static int igb_clean_rx_irq(struct igb_q_vector *q_vector, const int budget)
{
+ unsigned int total_bytes = 0, total_packets = 0;
struct igb_adapter *adapter = q_vector->adapter;
struct igb_ring *rx_ring = q_vector->rx.ring;
- struct sk_buff *skb = rx_ring->skb;
- unsigned int total_bytes = 0, total_packets = 0;
u16 cleaned_count = igb_desc_unused(rx_ring);
+ struct sk_buff *skb = rx_ring->skb;
+ int cpu = smp_processor_id();
unsigned int xdp_xmit = 0;
+ struct netdev_queue *nq;
struct xdp_buff xdp;
u32 frame_sz = 0;
int rx_buf_pgcnt;
@@ -8994,7 +9042,10 @@ static int igb_clean_rx_irq(struct igb_q_vector *q_vector, const int budget)
if (xdp_xmit & IGB_XDP_TX) {
struct igb_ring *tx_ring = igb_xdp_tx_queue_mapping(adapter);
+ nq = txring_txq(tx_ring);
+ __netif_tx_lock(nq, cpu);
igb_xdp_ring_update_tail(tx_ring);
+ __netif_tx_unlock(nq);
}
u64_stats_update_begin(&rx_ring->rx_syncp);
diff --git a/drivers/net/ethernet/intel/igb/igb_xsk.c b/drivers/net/ethernet/intel/igb/igb_xsk.c
index 925bf97f7caa..66cdc30e9b6e 100644
--- a/drivers/net/ethernet/intel/igb/igb_xsk.c
+++ b/drivers/net/ethernet/intel/igb/igb_xsk.c
@@ -66,7 +66,10 @@ static void igb_txrx_ring_enable(struct igb_adapter *adapter, u16 qid)
* at least 1 descriptor unused to make sure
* next_to_use != next_to_clean
*/
- igb_alloc_rx_buffers(rx_ring, igb_desc_unused(rx_ring));
+ if (rx_ring->xsk_pool)
+ igb_alloc_rx_buffers_zc(rx_ring, igb_desc_unused(rx_ring));
+ else
+ igb_alloc_rx_buffers(rx_ring, igb_desc_unused(rx_ring));
/* Rx/Tx share the same napi context. */
napi_enable(&rx_ring->q_vector->napi);
@@ -172,6 +175,262 @@ int igb_xsk_pool_setup(struct igb_adapter *adapter,
igb_xsk_pool_disable(adapter, qid);
}
+static u16 igb_fill_rx_descs(struct xsk_buff_pool *pool, struct xdp_buff **xdp,
+ union e1000_adv_rx_desc *rx_desc, u16 count)
+{
+ dma_addr_t dma;
+ u16 buffs;
+ int i;
+
+ /* nothing to do */
+ if (!count)
+ return 0;
+
+ buffs = xsk_buff_alloc_batch(pool, xdp, count);
+ for (i = 0; i < buffs; i++) {
+ dma = xsk_buff_xdp_get_dma(*xdp);
+ rx_desc->read.pkt_addr = cpu_to_le64(dma);
+ rx_desc->wb.upper.length = 0;
+
+ rx_desc++;
+ xdp++;
+ }
+
+ return buffs;
+}
+
+bool igb_alloc_rx_buffers_zc(struct igb_ring *rx_ring, u16 count)
+{
+ u32 nb_buffs_extra = 0, nb_buffs = 0;
+ union e1000_adv_rx_desc *rx_desc;
+ u16 ntu = rx_ring->next_to_use;
+ u16 total_count = count;
+ struct xdp_buff **xdp;
+
+ rx_desc = IGB_RX_DESC(rx_ring, ntu);
+ xdp = &rx_ring->rx_buffer_info_zc[ntu];
+
+ if (ntu + count >= rx_ring->count) {
+ nb_buffs_extra = igb_fill_rx_descs(rx_ring->xsk_pool, xdp,
+ rx_desc,
+ rx_ring->count - ntu);
+ if (nb_buffs_extra != rx_ring->count - ntu) {
+ ntu += nb_buffs_extra;
+ goto exit;
+ }
+ rx_desc = IGB_RX_DESC(rx_ring, 0);
+ xdp = rx_ring->rx_buffer_info_zc;
+ ntu = 0;
+ count -= nb_buffs_extra;
+ }
+
+ nb_buffs = igb_fill_rx_descs(rx_ring->xsk_pool, xdp, rx_desc, count);
+ ntu += nb_buffs;
+ if (ntu == rx_ring->count)
+ ntu = 0;
+
+ /* clear the length for the next_to_use descriptor */
+ rx_desc = IGB_RX_DESC(rx_ring, ntu);
+ rx_desc->wb.upper.length = 0;
+
+exit:
+ if (rx_ring->next_to_use != ntu) {
+ rx_ring->next_to_use = ntu;
+
+ /* Force memory writes to complete before letting h/w
+ * know there are new descriptors to fetch. (Only
+ * applicable for weak-ordered memory model archs,
+ * such as IA-64).
+ */
+ wmb();
+ writel(ntu, rx_ring->tail);
+ }
+
+ return total_count == (nb_buffs + nb_buffs_extra);
+}
+
+void igb_clean_rx_ring_zc(struct igb_ring *rx_ring)
+{
+ u16 ntc = rx_ring->next_to_clean;
+ u16 ntu = rx_ring->next_to_use;
+
+ while (ntc != ntu) {
+ struct xdp_buff *xdp = rx_ring->rx_buffer_info_zc[ntc];
+
+ xsk_buff_free(xdp);
+ ntc++;
+ if (ntc >= rx_ring->count)
+ ntc = 0;
+ }
+}
+
+static struct sk_buff *igb_construct_skb_zc(struct igb_ring *rx_ring,
+ struct xdp_buff *xdp,
+ ktime_t timestamp)
+{
+ unsigned int totalsize = xdp->data_end - xdp->data_meta;
+ unsigned int metasize = xdp->data - xdp->data_meta;
+ struct sk_buff *skb;
+
+ net_prefetch(xdp->data_meta);
+
+ /* allocate a skb to store the frags */
+ skb = napi_alloc_skb(&rx_ring->q_vector->napi, totalsize);
+ if (unlikely(!skb))
+ return NULL;
+
+ if (timestamp)
+ skb_hwtstamps(skb)->hwtstamp = timestamp;
+
+ memcpy(__skb_put(skb, totalsize), xdp->data_meta,
+ ALIGN(totalsize, sizeof(long)));
+
+ if (metasize) {
+ skb_metadata_set(skb, metasize);
+ __skb_pull(skb, metasize);
+ }
+
+ return skb;
+}
+
+static void igb_update_ntc(struct igb_ring *rx_ring)
+{
+ u32 ntc = rx_ring->next_to_clean + 1;
+
+ /* fetch, update, and store next to clean */
+ ntc = (ntc < rx_ring->count) ? ntc : 0;
+ rx_ring->next_to_clean = ntc;
+
+ prefetch(IGB_RX_DESC(rx_ring, ntc));
+}
+
+int igb_clean_rx_irq_zc(struct igb_q_vector *q_vector, const int budget)
+{
+ struct igb_adapter *adapter = q_vector->adapter;
+ unsigned int total_bytes = 0, total_packets = 0;
+ struct igb_ring *rx_ring = q_vector->rx.ring;
+ int cpu = smp_processor_id();
+ unsigned int xdp_xmit = 0;
+ struct netdev_queue *nq;
+ bool failure = false;
+ u16 entries_to_alloc;
+ struct sk_buff *skb;
+
+ while (likely(total_packets < budget)) {
+ union e1000_adv_rx_desc *rx_desc;
+ struct xdp_buff *xdp;
+ ktime_t timestamp = 0;
+ unsigned int size;
+
+ rx_desc = IGB_RX_DESC(rx_ring, rx_ring->next_to_clean);
+ size = le16_to_cpu(rx_desc->wb.upper.length);
+ if (!size)
+ break;
+
+ /* This memory barrier is needed to keep us from reading
+ * any other fields out of the rx_desc until we know the
+ * descriptor has been written back
+ */
+ dma_rmb();
+
+ xdp = rx_ring->rx_buffer_info_zc[rx_ring->next_to_clean];
+ xsk_buff_set_size(xdp, size);
+ xsk_buff_dma_sync_for_cpu(xdp);
+
+ /* pull rx packet timestamp if available and valid */
+ if (igb_test_staterr(rx_desc, E1000_RXDADV_STAT_TSIP)) {
+ int ts_hdr_len;
+
+ ts_hdr_len = igb_ptp_rx_pktstamp(rx_ring->q_vector,
+ xdp->data,
+ ×tamp);
+
+ xdp->data += ts_hdr_len;
+ xdp->data_meta += ts_hdr_len;
+ size -= ts_hdr_len;
+ }
+
+ skb = igb_run_xdp(adapter, rx_ring, xdp);
+
+ if (IS_ERR(skb)) {
+ unsigned int xdp_res = -PTR_ERR(skb);
+
+ if (likely(xdp_res & (IGB_XDP_TX | IGB_XDP_REDIR))) {
+ xdp_xmit |= xdp_res;
+ } else if (xdp_res == IGB_XDP_EXIT) {
+ failure = true;
+ break;
+ } else if (xdp_res == IGB_XDP_CONSUMED) {
+ xsk_buff_free(xdp);
+ }
+
+ total_packets++;
+ total_bytes += size;
+
+ igb_update_ntc(rx_ring);
+ continue;
+ }
+
+ skb = igb_construct_skb_zc(rx_ring, xdp, timestamp);
+
+ /* exit if we failed to retrieve a buffer */
+ if (!skb) {
+ rx_ring->rx_stats.alloc_failed++;
+ break;
+ }
+
+ xsk_buff_free(xdp);
+ igb_update_ntc(rx_ring);
+
+ if (eth_skb_pad(skb))
+ continue;
+
+ /* probably a little skewed due to removing CRC */
+ total_bytes += skb->len;
+
+ /* populate checksum, timestamp, VLAN, and protocol */
+ igb_process_skb_fields(rx_ring, rx_desc, skb);
+
+ napi_gro_receive(&q_vector->napi, skb);
+
+ /* update budget accounting */
+ total_packets++;
+ }
+
+ if (xdp_xmit & IGB_XDP_REDIR)
+ xdp_do_flush();
+
+ if (xdp_xmit & IGB_XDP_TX) {
+ struct igb_ring *tx_ring = igb_xdp_tx_queue_mapping(adapter);
+
+ nq = txring_txq(tx_ring);
+ __netif_tx_lock(nq, cpu);
+ igb_xdp_ring_update_tail(tx_ring);
+ __netif_tx_unlock(nq);
+ }
+
+ u64_stats_update_begin(&rx_ring->rx_syncp);
+ rx_ring->rx_stats.packets += total_packets;
+ rx_ring->rx_stats.bytes += total_bytes;
+ u64_stats_update_end(&rx_ring->rx_syncp);
+ q_vector->rx.total_packets += total_packets;
+ q_vector->rx.total_bytes += total_bytes;
+
+ entries_to_alloc = igb_desc_unused(rx_ring);
+ if (entries_to_alloc >= IGB_RX_BUFFER_WRITE)
+ failure |= !igb_alloc_rx_buffers_zc(rx_ring, entries_to_alloc);
+
+ if (xsk_uses_need_wakeup(rx_ring->xsk_pool)) {
+ if (failure || rx_ring->next_to_clean == rx_ring->next_to_use)
+ xsk_set_rx_need_wakeup(rx_ring->xsk_pool);
+ else
+ xsk_clear_rx_need_wakeup(rx_ring->xsk_pool);
+
+ return (int)total_packets;
+ }
+ return failure ? budget : (int)total_packets;
+}
+
int igb_xsk_wakeup(struct net_device *dev, u32 qid, u32 flags)
{
struct igb_adapter *adapter = netdev_priv(dev);
--
2.42.0
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH net-next 3/4] igb: add AF_XDP zero-copy Rx support
2024-08-08 18:35 ` [PATCH net-next 3/4] igb: add AF_XDP zero-copy Rx support Tony Nguyen
@ 2024-08-10 13:55 ` Maciej Fijalkowski
2024-08-10 14:12 ` Maciej Fijalkowski
2024-08-14 8:29 ` Kurt Kanzenbach
0 siblings, 2 replies; 20+ messages in thread
From: Maciej Fijalkowski @ 2024-08-10 13:55 UTC (permalink / raw)
To: Tony Nguyen
Cc: davem, kuba, pabeni, edumazet, netdev, Sriram Yagnaraman,
magnus.karlsson, ast, daniel, hawk, john.fastabend, bpf, kurt,
sriram.yagnaraman, richardcochran, benjamin.steinke, bigeasy,
Chandan Kumar Rout
On Thu, Aug 08, 2024 at 11:35:53AM -0700, Tony Nguyen wrote:
> From: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
>
> Add support for AF_XDP zero-copy receive path.
>
> When AF_XDP zero-copy is enabled, the rx buffers are allocated from the
> xsk buff pool using igb_alloc_rx_buffers_zc.
>
> Use xsk_pool_get_rx_frame_size to set SRRCTL rx buf size when zero-copy
> is enabled.
>
> Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
> [Kurt: Port to v6.10 and provide napi_id for xdp_rxq_info_reg()]
> Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
> Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
> ---
> drivers/net/ethernet/intel/igb/igb.h | 4 +
> drivers/net/ethernet/intel/igb/igb_main.c | 95 ++++++--
> drivers/net/ethernet/intel/igb/igb_xsk.c | 261 +++++++++++++++++++++-
> 3 files changed, 337 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
> index 053130c01480..4983a6ec718e 100644
> --- a/drivers/net/ethernet/intel/igb/igb.h
> +++ b/drivers/net/ethernet/intel/igb/igb.h
> @@ -87,6 +87,7 @@ struct igb_adapter;
> #define IGB_XDP_CONSUMED BIT(0)
> #define IGB_XDP_TX BIT(1)
> #define IGB_XDP_REDIR BIT(2)
> +#define IGB_XDP_EXIT BIT(3)
>
> struct vf_data_storage {
> unsigned char vf_mac_addresses[ETH_ALEN];
> @@ -832,6 +833,9 @@ struct xsk_buff_pool *igb_xsk_pool(struct igb_adapter *adapter,
> int igb_xsk_pool_setup(struct igb_adapter *adapter,
> struct xsk_buff_pool *pool,
> u16 qid);
> +bool igb_alloc_rx_buffers_zc(struct igb_ring *rx_ring, u16 count);
> +void igb_clean_rx_ring_zc(struct igb_ring *rx_ring);
> +int igb_clean_rx_irq_zc(struct igb_q_vector *q_vector, const int budget);
> int igb_xsk_wakeup(struct net_device *dev, u32 qid, u32 flags);
>
> #endif /* _IGB_H_ */
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> index b6f23bbeff71..0b779b2ca9ea 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -472,12 +472,17 @@ static void igb_dump(struct igb_adapter *adapter)
>
> for (i = 0; i < rx_ring->count; i++) {
> const char *next_desc;
> - struct igb_rx_buffer *buffer_info;
> - buffer_info = &rx_ring->rx_buffer_info[i];
> + dma_addr_t dma = (dma_addr_t)0;
> + struct igb_rx_buffer *buffer_info = NULL;
> rx_desc = IGB_RX_DESC(rx_ring, i);
> u0 = (struct my_u0 *)rx_desc;
> staterr = le32_to_cpu(rx_desc->wb.upper.status_error);
>
> + if (!rx_ring->xsk_pool) {
> + buffer_info = &rx_ring->rx_buffer_info[i];
> + dma = buffer_info->dma;
> + }
> +
> if (i == rx_ring->next_to_use)
> next_desc = " NTU";
> else if (i == rx_ring->next_to_clean)
> @@ -497,11 +502,11 @@ static void igb_dump(struct igb_adapter *adapter)
> "R ", i,
> le64_to_cpu(u0->a),
> le64_to_cpu(u0->b),
> - (u64)buffer_info->dma,
> + (u64)dma,
> next_desc);
>
> if (netif_msg_pktdata(adapter) &&
> - buffer_info->dma && buffer_info->page) {
> + buffer_info && dma && buffer_info->page) {
> print_hex_dump(KERN_INFO, "",
> DUMP_PREFIX_ADDRESS,
> 16, 1,
> @@ -1983,7 +1988,10 @@ static void igb_configure(struct igb_adapter *adapter)
> */
> for (i = 0; i < adapter->num_rx_queues; i++) {
> struct igb_ring *ring = adapter->rx_ring[i];
> - igb_alloc_rx_buffers(ring, igb_desc_unused(ring));
> + if (ring->xsk_pool)
> + igb_alloc_rx_buffers_zc(ring, igb_desc_unused(ring));
> + else
> + igb_alloc_rx_buffers(ring, igb_desc_unused(ring));
> }
> }
>
> @@ -3335,7 +3343,8 @@ static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> netdev->priv_flags |= IFF_SUPP_NOFCS;
>
> netdev->priv_flags |= IFF_UNICAST_FLT;
> - netdev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT;
> + netdev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT |
> + NETDEV_XDP_ACT_XSK_ZEROCOPY;
>
> /* MTU range: 68 - 9216 */
> netdev->min_mtu = ETH_MIN_MTU;
> @@ -4423,7 +4432,8 @@ int igb_setup_rx_resources(struct igb_ring *rx_ring)
> if (xdp_rxq_info_is_reg(&rx_ring->xdp_rxq))
> xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
> res = xdp_rxq_info_reg(&rx_ring->xdp_rxq, rx_ring->netdev,
> - rx_ring->queue_index, 0);
> + rx_ring->queue_index,
> + rx_ring->q_vector->napi.napi_id);
> if (res < 0) {
> dev_err(dev, "Failed to register xdp_rxq index %u\n",
> rx_ring->queue_index);
> @@ -4719,12 +4729,17 @@ void igb_setup_srrctl(struct igb_adapter *adapter, struct igb_ring *ring)
> struct e1000_hw *hw = &adapter->hw;
> int reg_idx = ring->reg_idx;
> u32 srrctl = 0;
> + u32 buf_size;
>
> - srrctl = IGB_RX_HDR_LEN << E1000_SRRCTL_BSIZEHDRSIZE_SHIFT;
> - if (ring_uses_large_buffer(ring))
> - srrctl |= IGB_RXBUFFER_3072 >> E1000_SRRCTL_BSIZEPKT_SHIFT;
> + if (ring->xsk_pool)
> + buf_size = xsk_pool_get_rx_frame_size(ring->xsk_pool);
> + else if (ring_uses_large_buffer(ring))
> + buf_size = IGB_RXBUFFER_3072;
> else
> - srrctl |= IGB_RXBUFFER_2048 >> E1000_SRRCTL_BSIZEPKT_SHIFT;
> + buf_size = IGB_RXBUFFER_2048;
> +
> + srrctl = IGB_RX_HDR_LEN << E1000_SRRCTL_BSIZEHDRSIZE_SHIFT;
> + srrctl |= buf_size >> E1000_SRRCTL_BSIZEPKT_SHIFT;
> srrctl |= E1000_SRRCTL_DESCTYPE_ADV_ONEBUF;
> if (hw->mac.type >= e1000_82580)
> srrctl |= E1000_SRRCTL_TIMESTAMP;
> @@ -4756,9 +4771,17 @@ void igb_configure_rx_ring(struct igb_adapter *adapter,
> u32 rxdctl = 0;
>
> xdp_rxq_info_unreg_mem_model(&ring->xdp_rxq);
> - WARN_ON(xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
> - MEM_TYPE_PAGE_SHARED, NULL));
> ring->xsk_pool = igb_xsk_pool(adapter, ring);
> + if (ring->xsk_pool) {
> + WARN_ON(xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
> + MEM_TYPE_XSK_BUFF_POOL,
> + NULL));
> + xsk_pool_set_rxq_info(ring->xsk_pool, &ring->xdp_rxq);
> + } else {
> + WARN_ON(xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
> + MEM_TYPE_PAGE_SHARED,
> + NULL));
> + }
>
> /* disable the queue */
> wr32(E1000_RXDCTL(reg_idx), 0);
> @@ -4785,9 +4808,12 @@ void igb_configure_rx_ring(struct igb_adapter *adapter,
> rxdctl |= IGB_RX_HTHRESH << 8;
> rxdctl |= IGB_RX_WTHRESH << 16;
>
> - /* initialize rx_buffer_info */
> - memset(ring->rx_buffer_info, 0,
> - sizeof(struct igb_rx_buffer) * ring->count);
> + if (ring->xsk_pool)
> + memset(ring->rx_buffer_info_zc, 0,
> + sizeof(*ring->rx_buffer_info_zc) * ring->count);
> + else
> + memset(ring->rx_buffer_info, 0,
> + sizeof(*ring->rx_buffer_info) * ring->count);
ah okay. comment from previous patch is addressed here, thanks.
>
> /* initialize Rx descriptor 0 */
> rx_desc = IGB_RX_DESC(ring, 0);
> @@ -4974,8 +5000,13 @@ void igb_free_rx_resources(struct igb_ring *rx_ring)
>
> rx_ring->xdp_prog = NULL;
> xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
> - vfree(rx_ring->rx_buffer_info);
> - rx_ring->rx_buffer_info = NULL;
> + if (rx_ring->xsk_pool) {
> + vfree(rx_ring->rx_buffer_info_zc);
> + rx_ring->rx_buffer_info_zc = NULL;
> + } else {
> + vfree(rx_ring->rx_buffer_info);
> + rx_ring->rx_buffer_info = NULL;
> + }
>
> /* if not set, then don't free */
> if (!rx_ring->desc)
> @@ -5013,6 +5044,11 @@ void igb_clean_rx_ring(struct igb_ring *rx_ring)
> dev_kfree_skb(rx_ring->skb);
> rx_ring->skb = NULL;
>
> + if (rx_ring->xsk_pool) {
> + igb_clean_rx_ring_zc(rx_ring);
> + goto skip_for_xsk;
> + }
> +
> /* Free all the Rx ring sk_buffs */
> while (i != rx_ring->next_to_alloc) {
> struct igb_rx_buffer *buffer_info = &rx_ring->rx_buffer_info[i];
> @@ -5040,6 +5076,7 @@ void igb_clean_rx_ring(struct igb_ring *rx_ring)
> i = 0;
> }
>
> +skip_for_xsk:
> rx_ring->next_to_alloc = 0;
> rx_ring->next_to_clean = 0;
> rx_ring->next_to_use = 0;
> @@ -8195,7 +8232,9 @@ static int igb_poll(struct napi_struct *napi, int budget)
> clean_complete = igb_clean_tx_irq(q_vector, budget);
>
> if (q_vector->rx.ring) {
> - int cleaned = igb_clean_rx_irq(q_vector, budget);
> + int cleaned = q_vector->rx.ring->xsk_pool ?
best if you would READ_ONCE() here pool and use it for the rest of napi
lifetime...
> + igb_clean_rx_irq_zc(q_vector, budget) :
> + igb_clean_rx_irq(q_vector, budget);
>
> work_done += cleaned;
> if (cleaned >= budget)
> @@ -8603,8 +8642,15 @@ struct sk_buff *igb_run_xdp(struct igb_adapter *adapter,
> break;
> case XDP_REDIRECT:
> err = xdp_do_redirect(adapter->netdev, xdp, xdp_prog);
We were introducing a ZC variant of handling XDP verdict due to a fact
that mostly what happens is the redirect to user space. We observed a
reasonable perf improvement from likely()fying XDP_REDIRECT case.
> - if (err)
> + if (err) {
> + if (rx_ring->xsk_pool &&
> + xsk_uses_need_wakeup(rx_ring->xsk_pool) &&
> + err == -ENOBUFS)
> + result = IGB_XDP_EXIT;
> + else
> + result = IGB_XDP_CONSUMED;
> goto out_failure;
> + }
> result = IGB_XDP_REDIR;
> break;
> default:
> @@ -8861,12 +8907,14 @@ static void igb_put_rx_buffer(struct igb_ring *rx_ring,
>
> static int igb_clean_rx_irq(struct igb_q_vector *q_vector, const int budget)
> {
> + unsigned int total_bytes = 0, total_packets = 0;
> struct igb_adapter *adapter = q_vector->adapter;
> struct igb_ring *rx_ring = q_vector->rx.ring;
> - struct sk_buff *skb = rx_ring->skb;
> - unsigned int total_bytes = 0, total_packets = 0;
> u16 cleaned_count = igb_desc_unused(rx_ring);
> + struct sk_buff *skb = rx_ring->skb;
> + int cpu = smp_processor_id();
> unsigned int xdp_xmit = 0;
> + struct netdev_queue *nq;
> struct xdp_buff xdp;
> u32 frame_sz = 0;
> int rx_buf_pgcnt;
> @@ -8994,7 +9042,10 @@ static int igb_clean_rx_irq(struct igb_q_vector *q_vector, const int budget)
> if (xdp_xmit & IGB_XDP_TX) {
> struct igb_ring *tx_ring = igb_xdp_tx_queue_mapping(adapter);
>
> + nq = txring_txq(tx_ring);
> + __netif_tx_lock(nq, cpu);
That belongs to a patch where you do:
"Always call igb_xdp_ring_update_tail under __netif_tx_lock"
> igb_xdp_ring_update_tail(tx_ring);
> + __netif_tx_unlock(nq);
> }
>
> u64_stats_update_begin(&rx_ring->rx_syncp);
> diff --git a/drivers/net/ethernet/intel/igb/igb_xsk.c b/drivers/net/ethernet/intel/igb/igb_xsk.c
> index 925bf97f7caa..66cdc30e9b6e 100644
> --- a/drivers/net/ethernet/intel/igb/igb_xsk.c
> +++ b/drivers/net/ethernet/intel/igb/igb_xsk.c
> @@ -66,7 +66,10 @@ static void igb_txrx_ring_enable(struct igb_adapter *adapter, u16 qid)
> * at least 1 descriptor unused to make sure
> * next_to_use != next_to_clean
> */
> - igb_alloc_rx_buffers(rx_ring, igb_desc_unused(rx_ring));
> + if (rx_ring->xsk_pool)
> + igb_alloc_rx_buffers_zc(rx_ring, igb_desc_unused(rx_ring));
> + else
> + igb_alloc_rx_buffers(rx_ring, igb_desc_unused(rx_ring));
>
> /* Rx/Tx share the same napi context. */
> napi_enable(&rx_ring->q_vector->napi);
> @@ -172,6 +175,262 @@ int igb_xsk_pool_setup(struct igb_adapter *adapter,
> igb_xsk_pool_disable(adapter, qid);
> }
>
> +static u16 igb_fill_rx_descs(struct xsk_buff_pool *pool, struct xdp_buff **xdp,
> + union e1000_adv_rx_desc *rx_desc, u16 count)
> +{
> + dma_addr_t dma;
> + u16 buffs;
> + int i;
> +
> + /* nothing to do */
> + if (!count)
> + return 0;
> +
> + buffs = xsk_buff_alloc_batch(pool, xdp, count);
> + for (i = 0; i < buffs; i++) {
> + dma = xsk_buff_xdp_get_dma(*xdp);
> + rx_desc->read.pkt_addr = cpu_to_le64(dma);
> + rx_desc->wb.upper.length = 0;
> +
> + rx_desc++;
> + xdp++;
> + }
> +
> + return buffs;
> +}
> +
> +bool igb_alloc_rx_buffers_zc(struct igb_ring *rx_ring, u16 count)
> +{
> + u32 nb_buffs_extra = 0, nb_buffs = 0;
> + union e1000_adv_rx_desc *rx_desc;
> + u16 ntu = rx_ring->next_to_use;
> + u16 total_count = count;
> + struct xdp_buff **xdp;
> +
> + rx_desc = IGB_RX_DESC(rx_ring, ntu);
> + xdp = &rx_ring->rx_buffer_info_zc[ntu];
> +
> + if (ntu + count >= rx_ring->count) {
> + nb_buffs_extra = igb_fill_rx_descs(rx_ring->xsk_pool, xdp,
> + rx_desc,
> + rx_ring->count - ntu);
> + if (nb_buffs_extra != rx_ring->count - ntu) {
> + ntu += nb_buffs_extra;
> + goto exit;
> + }
> + rx_desc = IGB_RX_DESC(rx_ring, 0);
> + xdp = rx_ring->rx_buffer_info_zc;
> + ntu = 0;
> + count -= nb_buffs_extra;
> + }
> +
> + nb_buffs = igb_fill_rx_descs(rx_ring->xsk_pool, xdp, rx_desc, count);
> + ntu += nb_buffs;
> + if (ntu == rx_ring->count)
> + ntu = 0;
> +
> + /* clear the length for the next_to_use descriptor */
> + rx_desc = IGB_RX_DESC(rx_ring, ntu);
> + rx_desc->wb.upper.length = 0;
> +
> +exit:
> + if (rx_ring->next_to_use != ntu) {
> + rx_ring->next_to_use = ntu;
> +
> + /* Force memory writes to complete before letting h/w
> + * know there are new descriptors to fetch. (Only
> + * applicable for weak-ordered memory model archs,
> + * such as IA-64).
> + */
> + wmb();
> + writel(ntu, rx_ring->tail);
> + }
> +
> + return total_count == (nb_buffs + nb_buffs_extra);
> +}
> +
> +void igb_clean_rx_ring_zc(struct igb_ring *rx_ring)
> +{
> + u16 ntc = rx_ring->next_to_clean;
> + u16 ntu = rx_ring->next_to_use;
> +
> + while (ntc != ntu) {
> + struct xdp_buff *xdp = rx_ring->rx_buffer_info_zc[ntc];
> +
> + xsk_buff_free(xdp);
> + ntc++;
> + if (ntc >= rx_ring->count)
> + ntc = 0;
> + }
> +}
> +
> +static struct sk_buff *igb_construct_skb_zc(struct igb_ring *rx_ring,
> + struct xdp_buff *xdp,
> + ktime_t timestamp)
> +{
> + unsigned int totalsize = xdp->data_end - xdp->data_meta;
> + unsigned int metasize = xdp->data - xdp->data_meta;
> + struct sk_buff *skb;
> +
> + net_prefetch(xdp->data_meta);
> +
> + /* allocate a skb to store the frags */
> + skb = napi_alloc_skb(&rx_ring->q_vector->napi, totalsize);
> + if (unlikely(!skb))
> + return NULL;
> +
> + if (timestamp)
> + skb_hwtstamps(skb)->hwtstamp = timestamp;
> +
> + memcpy(__skb_put(skb, totalsize), xdp->data_meta,
> + ALIGN(totalsize, sizeof(long)));
> +
> + if (metasize) {
> + skb_metadata_set(skb, metasize);
> + __skb_pull(skb, metasize);
> + }
> +
> + return skb;
> +}
> +
> +static void igb_update_ntc(struct igb_ring *rx_ring)
> +{
> + u32 ntc = rx_ring->next_to_clean + 1;
> +
> + /* fetch, update, and store next to clean */
> + ntc = (ntc < rx_ring->count) ? ntc : 0;
> + rx_ring->next_to_clean = ntc;
That is suboptimal. You don't have update the ring's member per each
procesed descriptor. What you can do is to work on a local ntc and update
it to ring at the end of napi's Rx side.
Even if you would insist on doing the current way then I would expect that
you would use this helper in igb_is_non_eop() as this is duplicated code.
> +
> + prefetch(IGB_RX_DESC(rx_ring, ntc));
> +}
> +
> +int igb_clean_rx_irq_zc(struct igb_q_vector *q_vector, const int budget)
> +{
> + struct igb_adapter *adapter = q_vector->adapter;
> + unsigned int total_bytes = 0, total_packets = 0;
> + struct igb_ring *rx_ring = q_vector->rx.ring;
> + int cpu = smp_processor_id();
> + unsigned int xdp_xmit = 0;
> + struct netdev_queue *nq;
> + bool failure = false;
> + u16 entries_to_alloc;
> + struct sk_buff *skb;
> +
> + while (likely(total_packets < budget)) {
> + union e1000_adv_rx_desc *rx_desc;
> + struct xdp_buff *xdp;
> + ktime_t timestamp = 0;
minor: RCT
> + unsigned int size;
> +
> + rx_desc = IGB_RX_DESC(rx_ring, rx_ring->next_to_clean);
> + size = le16_to_cpu(rx_desc->wb.upper.length);
> + if (!size)
> + break;
> +
> + /* This memory barrier is needed to keep us from reading
> + * any other fields out of the rx_desc until we know the
> + * descriptor has been written back
> + */
> + dma_rmb();
> +
> + xdp = rx_ring->rx_buffer_info_zc[rx_ring->next_to_clean];
> + xsk_buff_set_size(xdp, size);
> + xsk_buff_dma_sync_for_cpu(xdp);
> +
> + /* pull rx packet timestamp if available and valid */
> + if (igb_test_staterr(rx_desc, E1000_RXDADV_STAT_TSIP)) {
> + int ts_hdr_len;
> +
> + ts_hdr_len = igb_ptp_rx_pktstamp(rx_ring->q_vector,
> + xdp->data,
> + ×tamp);
> +
> + xdp->data += ts_hdr_len;
> + xdp->data_meta += ts_hdr_len;
> + size -= ts_hdr_len;
> + }
> +
> + skb = igb_run_xdp(adapter, rx_ring, xdp);
> +
> + if (IS_ERR(skb)) {
> + unsigned int xdp_res = -PTR_ERR(skb);
> +
> + if (likely(xdp_res & (IGB_XDP_TX | IGB_XDP_REDIR))) {
> + xdp_xmit |= xdp_res;
> + } else if (xdp_res == IGB_XDP_EXIT) {
> + failure = true;
> + break;
> + } else if (xdp_res == IGB_XDP_CONSUMED) {
> + xsk_buff_free(xdp);
> + }
> +
> + total_packets++;
> + total_bytes += size;
> +
> + igb_update_ntc(rx_ring);
> + continue;
> + }
> +
> + skb = igb_construct_skb_zc(rx_ring, xdp, timestamp);
> +
> + /* exit if we failed to retrieve a buffer */
> + if (!skb) {
> + rx_ring->rx_stats.alloc_failed++;
> + break;
> + }
> +
> + xsk_buff_free(xdp);
> + igb_update_ntc(rx_ring);
> +
> + if (eth_skb_pad(skb))
> + continue;
> +
> + /* probably a little skewed due to removing CRC */
> + total_bytes += skb->len;
> +
> + /* populate checksum, timestamp, VLAN, and protocol */
> + igb_process_skb_fields(rx_ring, rx_desc, skb);
> +
> + napi_gro_receive(&q_vector->napi, skb);
> +
> + /* update budget accounting */
> + total_packets++;
> + }
> +
> + if (xdp_xmit & IGB_XDP_REDIR)
> + xdp_do_flush();
> +
> + if (xdp_xmit & IGB_XDP_TX) {
> + struct igb_ring *tx_ring = igb_xdp_tx_queue_mapping(adapter);
pull this out to common function and use this in igb_clean_rx_irq() ?
> +
> + nq = txring_txq(tx_ring);
> + __netif_tx_lock(nq, cpu);
> + igb_xdp_ring_update_tail(tx_ring);
> + __netif_tx_unlock(nq);
> + }
> +
> + u64_stats_update_begin(&rx_ring->rx_syncp);
> + rx_ring->rx_stats.packets += total_packets;
> + rx_ring->rx_stats.bytes += total_bytes;
> + u64_stats_update_end(&rx_ring->rx_syncp);
> + q_vector->rx.total_packets += total_packets;
> + q_vector->rx.total_bytes += total_bytes;
... same here :)
> +
> + entries_to_alloc = igb_desc_unused(rx_ring);
> + if (entries_to_alloc >= IGB_RX_BUFFER_WRITE)
> + failure |= !igb_alloc_rx_buffers_zc(rx_ring, entries_to_alloc);
> +
> + if (xsk_uses_need_wakeup(rx_ring->xsk_pool)) {
> + if (failure || rx_ring->next_to_clean == rx_ring->next_to_use)
> + xsk_set_rx_need_wakeup(rx_ring->xsk_pool);
> + else
> + xsk_clear_rx_need_wakeup(rx_ring->xsk_pool);
> +
> + return (int)total_packets;
> + }
> + return failure ? budget : (int)total_packets;
> +}
> +
> int igb_xsk_wakeup(struct net_device *dev, u32 qid, u32 flags)
> {
> struct igb_adapter *adapter = netdev_priv(dev);
> --
> 2.42.0
>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH net-next 3/4] igb: add AF_XDP zero-copy Rx support
2024-08-10 13:55 ` Maciej Fijalkowski
@ 2024-08-10 14:12 ` Maciej Fijalkowski
2024-08-14 8:29 ` Kurt Kanzenbach
1 sibling, 0 replies; 20+ messages in thread
From: Maciej Fijalkowski @ 2024-08-10 14:12 UTC (permalink / raw)
To: Tony Nguyen
Cc: davem, kuba, pabeni, edumazet, netdev, Sriram Yagnaraman,
magnus.karlsson, ast, daniel, hawk, john.fastabend, bpf, kurt,
sriram.yagnaraman, richardcochran, benjamin.steinke, bigeasy,
Chandan Kumar Rout
On Sat, Aug 10, 2024 at 03:55:10PM +0200, Maciej Fijalkowski wrote:
> On Thu, Aug 08, 2024 at 11:35:53AM -0700, Tony Nguyen wrote:
> > From: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
> >
> > Add support for AF_XDP zero-copy receive path.
> >
> > When AF_XDP zero-copy is enabled, the rx buffers are allocated from the
> > xsk buff pool using igb_alloc_rx_buffers_zc.
> >
> > Use xsk_pool_get_rx_frame_size to set SRRCTL rx buf size when zero-copy
> > is enabled.
> >
> > Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
> > [Kurt: Port to v6.10 and provide napi_id for xdp_rxq_info_reg()]
> > Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
> > Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
> > Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
> > ---
> > netdev->priv_flags |= IFF_UNICAST_FLT;
> > - netdev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT;
> > + netdev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT |
> > + NETDEV_XDP_ACT_XSK_ZEROCOPY;
> >
Also please set this on last patch as this opens us to a state where
AF_XDP ZC is enabled but there is no Tx ZC yet.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next 3/4] igb: add AF_XDP zero-copy Rx support
2024-08-10 13:55 ` Maciej Fijalkowski
2024-08-10 14:12 ` Maciej Fijalkowski
@ 2024-08-14 8:29 ` Kurt Kanzenbach
2024-08-14 8:55 ` Maciej Fijalkowski
1 sibling, 1 reply; 20+ messages in thread
From: Kurt Kanzenbach @ 2024-08-14 8:29 UTC (permalink / raw)
To: Maciej Fijalkowski, Tony Nguyen
Cc: davem, kuba, pabeni, edumazet, netdev, Sriram Yagnaraman,
magnus.karlsson, ast, daniel, hawk, john.fastabend, bpf,
sriram.yagnaraman, richardcochran, benjamin.steinke, bigeasy,
Chandan Kumar Rout
[-- Attachment #1: Type: text/plain, Size: 512 bytes --]
>> case XDP_REDIRECT:
>> err = xdp_do_redirect(adapter->netdev, xdp, xdp_prog);
>
> We were introducing a ZC variant of handling XDP verdict due to a fact
> that mostly what happens is the redirect to user space. We observed a
> reasonable perf improvement from likely()fying XDP_REDIRECT case.
Indeed, I can observe an improvement too.
I've introduced igb_run_xdp_zc() which takes care of that
optimization. Also it takes a pointer to the (read once) xsk_pool to
address the other comment.
Thanks,
Kurt
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next 3/4] igb: add AF_XDP zero-copy Rx support
2024-08-14 8:29 ` Kurt Kanzenbach
@ 2024-08-14 8:55 ` Maciej Fijalkowski
0 siblings, 0 replies; 20+ messages in thread
From: Maciej Fijalkowski @ 2024-08-14 8:55 UTC (permalink / raw)
To: Kurt Kanzenbach
Cc: Tony Nguyen, davem, kuba, pabeni, edumazet, netdev,
Sriram Yagnaraman, magnus.karlsson, ast, daniel, hawk,
john.fastabend, bpf, sriram.yagnaraman, richardcochran,
benjamin.steinke, bigeasy, Chandan Kumar Rout
On Wed, Aug 14, 2024 at 10:29:29AM +0200, Kurt Kanzenbach wrote:
> >> case XDP_REDIRECT:
> >> err = xdp_do_redirect(adapter->netdev, xdp, xdp_prog);
> >
> > We were introducing a ZC variant of handling XDP verdict due to a fact
> > that mostly what happens is the redirect to user space. We observed a
> > reasonable perf improvement from likely()fying XDP_REDIRECT case.
>
> Indeed, I can observe an improvement too.
>
> I've introduced igb_run_xdp_zc() which takes care of that
> optimization. Also it takes a pointer to the (read once) xsk_pool to
> address the other comment.
Nice, thanks!
>
> Thanks,
> Kurt
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH net-next 4/4] igb: add AF_XDP zero-copy Tx support
2024-08-08 18:35 [PATCH net-next 0/4][pull request] igb: Add support for AF_XDP zero-copy Tony Nguyen
` (2 preceding siblings ...)
2024-08-08 18:35 ` [PATCH net-next 3/4] igb: add AF_XDP zero-copy Rx support Tony Nguyen
@ 2024-08-08 18:35 ` Tony Nguyen
2024-08-10 14:10 ` Maciej Fijalkowski
3 siblings, 1 reply; 20+ messages in thread
From: Tony Nguyen @ 2024-08-08 18:35 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, netdev
Cc: Sriram Yagnaraman, anthony.l.nguyen, maciej.fijalkowski,
magnus.karlsson, ast, daniel, hawk, john.fastabend, bpf, kurt,
sriram.yagnaraman, richardcochran, benjamin.steinke, bigeasy,
Chandan Kumar Rout
From: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Add support for AF_XDP zero-copy transmit path.
A new TX buffer type IGB_TYPE_XSK is introduced to indicate that the Tx
frame was allocated from the xsk buff pool, so igb_clean_tx_ring and
igb_clean_tx_irq can clean the buffers correctly based on type.
igb_xmit_zc performs the actual packet transmit when AF_XDP zero-copy is
enabled. We share the TX ring between slow path, XDP and AF_XDP
zero-copy, so we use the netdev queue lock to ensure mutual exclusion.
Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
[Kurt: Set olinfo_status in igb_xmit_zc() so that frames are transmitted]
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
drivers/net/ethernet/intel/igb/igb.h | 2 +
drivers/net/ethernet/intel/igb/igb_main.c | 56 +++++++++++++++++++----
drivers/net/ethernet/intel/igb/igb_xsk.c | 53 +++++++++++++++++++++
3 files changed, 102 insertions(+), 9 deletions(-)
diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
index 4983a6ec718e..9ee18ac1ba47 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -257,6 +257,7 @@ enum igb_tx_flags {
enum igb_tx_buf_type {
IGB_TYPE_SKB = 0,
IGB_TYPE_XDP,
+ IGB_TYPE_XSK
};
/* wrapper around a pointer to a socket buffer,
@@ -836,6 +837,7 @@ int igb_xsk_pool_setup(struct igb_adapter *adapter,
bool igb_alloc_rx_buffers_zc(struct igb_ring *rx_ring, u16 count);
void igb_clean_rx_ring_zc(struct igb_ring *rx_ring);
int igb_clean_rx_irq_zc(struct igb_q_vector *q_vector, const int budget);
+bool igb_xmit_zc(struct igb_ring *tx_ring);
int igb_xsk_wakeup(struct net_device *dev, u32 qid, u32 flags);
#endif /* _IGB_H_ */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 0b779b2ca9ea..1ebd67981978 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2996,6 +2996,9 @@ static int igb_xdp_xmit(struct net_device *dev, int n,
if (unlikely(!tx_ring))
return -ENXIO;
+ if (unlikely(test_bit(IGB_RING_FLAG_TX_DISABLED, &tx_ring->flags)))
+ return -ENXIO;
+
nq = txring_txq(tx_ring);
__netif_tx_lock(nq, cpu);
@@ -4917,15 +4920,20 @@ void igb_clean_tx_ring(struct igb_ring *tx_ring)
{
u16 i = tx_ring->next_to_clean;
struct igb_tx_buffer *tx_buffer = &tx_ring->tx_buffer_info[i];
+ u32 xsk_frames = 0;
while (i != tx_ring->next_to_use) {
union e1000_adv_tx_desc *eop_desc, *tx_desc;
/* Free all the Tx ring sk_buffs or xdp frames */
- if (tx_buffer->type == IGB_TYPE_SKB)
+ if (tx_buffer->type == IGB_TYPE_SKB) {
dev_kfree_skb_any(tx_buffer->skb);
- else
+ } else if (tx_buffer->type == IGB_TYPE_XDP) {
xdp_return_frame(tx_buffer->xdpf);
+ } else if (tx_buffer->type == IGB_TYPE_XSK) {
+ xsk_frames++;
+ goto skip_for_xsk;
+ }
/* unmap skb header data */
dma_unmap_single(tx_ring->dev,
@@ -4956,6 +4964,7 @@ void igb_clean_tx_ring(struct igb_ring *tx_ring)
DMA_TO_DEVICE);
}
+skip_for_xsk:
tx_buffer->next_to_watch = NULL;
/* move us one more past the eop_desc for start of next pkt */
@@ -4970,6 +4979,9 @@ void igb_clean_tx_ring(struct igb_ring *tx_ring)
/* reset BQL for queue */
netdev_tx_reset_queue(txring_txq(tx_ring));
+ if (tx_ring->xsk_pool && xsk_frames)
+ xsk_tx_completed(tx_ring->xsk_pool, xsk_frames);
+
/* reset next_to_use and next_to_clean */
tx_ring->next_to_use = 0;
tx_ring->next_to_clean = 0;
@@ -6503,6 +6515,9 @@ netdev_tx_t igb_xmit_frame_ring(struct sk_buff *skb,
return NETDEV_TX_BUSY;
}
+ if (unlikely(test_bit(IGB_RING_FLAG_TX_DISABLED, &tx_ring->flags)))
+ return NETDEV_TX_BUSY;
+
/* record the location of the first descriptor for this packet */
first = &tx_ring->tx_buffer_info[tx_ring->next_to_use];
first->type = IGB_TYPE_SKB;
@@ -8263,13 +8278,17 @@ static int igb_poll(struct napi_struct *napi, int budget)
**/
static bool igb_clean_tx_irq(struct igb_q_vector *q_vector, int napi_budget)
{
- struct igb_adapter *adapter = q_vector->adapter;
- struct igb_ring *tx_ring = q_vector->tx.ring;
- struct igb_tx_buffer *tx_buffer;
- union e1000_adv_tx_desc *tx_desc;
unsigned int total_bytes = 0, total_packets = 0;
+ struct igb_adapter *adapter = q_vector->adapter;
unsigned int budget = q_vector->tx.work_limit;
+ struct igb_ring *tx_ring = q_vector->tx.ring;
unsigned int i = tx_ring->next_to_clean;
+ union e1000_adv_tx_desc *tx_desc;
+ struct igb_tx_buffer *tx_buffer;
+ int cpu = smp_processor_id();
+ bool xsk_xmit_done = true;
+ struct netdev_queue *nq;
+ u32 xsk_frames = 0;
if (test_bit(__IGB_DOWN, &adapter->state))
return true;
@@ -8300,10 +8319,14 @@ static bool igb_clean_tx_irq(struct igb_q_vector *q_vector, int napi_budget)
total_packets += tx_buffer->gso_segs;
/* free the skb */
- if (tx_buffer->type == IGB_TYPE_SKB)
+ if (tx_buffer->type == IGB_TYPE_SKB) {
napi_consume_skb(tx_buffer->skb, napi_budget);
- else
+ } else if (tx_buffer->type == IGB_TYPE_XDP) {
xdp_return_frame(tx_buffer->xdpf);
+ } else if (tx_buffer->type == IGB_TYPE_XSK) {
+ xsk_frames++;
+ goto skip_for_xsk;
+ }
/* unmap skb header data */
dma_unmap_single(tx_ring->dev,
@@ -8335,6 +8358,7 @@ static bool igb_clean_tx_irq(struct igb_q_vector *q_vector, int napi_budget)
}
}
+skip_for_xsk:
/* move us one more past the eop_desc for start of next pkt */
tx_buffer++;
tx_desc++;
@@ -8363,6 +8387,20 @@ static bool igb_clean_tx_irq(struct igb_q_vector *q_vector, int napi_budget)
q_vector->tx.total_bytes += total_bytes;
q_vector->tx.total_packets += total_packets;
+ if (tx_ring->xsk_pool) {
+ if (xsk_frames)
+ xsk_tx_completed(tx_ring->xsk_pool, xsk_frames);
+ if (xsk_uses_need_wakeup(tx_ring->xsk_pool))
+ xsk_set_tx_need_wakeup(tx_ring->xsk_pool);
+
+ nq = txring_txq(tx_ring);
+ __netif_tx_lock(nq, cpu);
+ /* Avoid transmit queue timeout since we share it with the slow path */
+ txq_trans_cond_update(nq);
+ xsk_xmit_done = igb_xmit_zc(tx_ring);
+ __netif_tx_unlock(nq);
+ }
+
if (test_bit(IGB_RING_FLAG_TX_DETECT_HANG, &tx_ring->flags)) {
struct e1000_hw *hw = &adapter->hw;
@@ -8425,7 +8463,7 @@ static bool igb_clean_tx_irq(struct igb_q_vector *q_vector, int napi_budget)
}
}
- return !!budget;
+ return !!budget && xsk_xmit_done;
}
/**
diff --git a/drivers/net/ethernet/intel/igb/igb_xsk.c b/drivers/net/ethernet/intel/igb/igb_xsk.c
index 66cdc30e9b6e..4e530e1eb3c0 100644
--- a/drivers/net/ethernet/intel/igb/igb_xsk.c
+++ b/drivers/net/ethernet/intel/igb/igb_xsk.c
@@ -431,6 +431,59 @@ int igb_clean_rx_irq_zc(struct igb_q_vector *q_vector, const int budget)
return failure ? budget : (int)total_packets;
}
+bool igb_xmit_zc(struct igb_ring *tx_ring)
+{
+ unsigned int budget = igb_desc_unused(tx_ring);
+ struct xsk_buff_pool *pool = tx_ring->xsk_pool;
+ u32 cmd_type, olinfo_status, nb_pkts, i = 0;
+ struct xdp_desc *descs = pool->tx_descs;
+ union e1000_adv_tx_desc *tx_desc = NULL;
+ struct igb_tx_buffer *tx_buffer_info;
+ unsigned int total_bytes = 0;
+ dma_addr_t dma;
+
+ nb_pkts = xsk_tx_peek_release_desc_batch(pool, budget);
+ if (!nb_pkts)
+ return true;
+
+ while (nb_pkts-- > 0) {
+ dma = xsk_buff_raw_get_dma(pool, descs[i].addr);
+ xsk_buff_raw_dma_sync_for_device(pool, dma, descs[i].len);
+
+ tx_buffer_info = &tx_ring->tx_buffer_info[tx_ring->next_to_use];
+ tx_buffer_info->bytecount = descs[i].len;
+ tx_buffer_info->type = IGB_TYPE_XSK;
+ tx_buffer_info->xdpf = NULL;
+ tx_buffer_info->gso_segs = 1;
+ tx_buffer_info->time_stamp = jiffies;
+
+ tx_desc = IGB_TX_DESC(tx_ring, tx_ring->next_to_use);
+ tx_desc->read.buffer_addr = cpu_to_le64(dma);
+
+ /* put descriptor type bits */
+ cmd_type = E1000_ADVTXD_DTYP_DATA | E1000_ADVTXD_DCMD_DEXT |
+ E1000_ADVTXD_DCMD_IFCS;
+ olinfo_status = descs[i].len << E1000_ADVTXD_PAYLEN_SHIFT;
+
+ cmd_type |= descs[i].len | IGB_TXD_DCMD;
+ tx_desc->read.cmd_type_len = cpu_to_le32(cmd_type);
+ tx_desc->read.olinfo_status = cpu_to_le32(olinfo_status);
+
+ total_bytes += descs[i].len;
+
+ i++;
+ tx_ring->next_to_use++;
+ tx_buffer_info->next_to_watch = tx_desc;
+ if (tx_ring->next_to_use == tx_ring->count)
+ tx_ring->next_to_use = 0;
+ }
+
+ netdev_tx_sent_queue(txring_txq(tx_ring), total_bytes);
+ igb_xdp_ring_update_tail(tx_ring);
+
+ return nb_pkts < budget;
+}
+
int igb_xsk_wakeup(struct net_device *dev, u32 qid, u32 flags)
{
struct igb_adapter *adapter = netdev_priv(dev);
--
2.42.0
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH net-next 4/4] igb: add AF_XDP zero-copy Tx support
2024-08-08 18:35 ` [PATCH net-next 4/4] igb: add AF_XDP zero-copy Tx support Tony Nguyen
@ 2024-08-10 14:10 ` Maciej Fijalkowski
2024-08-14 8:36 ` Kurt Kanzenbach
0 siblings, 1 reply; 20+ messages in thread
From: Maciej Fijalkowski @ 2024-08-10 14:10 UTC (permalink / raw)
To: Tony Nguyen
Cc: davem, kuba, pabeni, edumazet, netdev, Sriram Yagnaraman,
magnus.karlsson, ast, daniel, hawk, john.fastabend, bpf, kurt,
sriram.yagnaraman, richardcochran, benjamin.steinke, bigeasy,
Chandan Kumar Rout
On Thu, Aug 08, 2024 at 11:35:54AM -0700, Tony Nguyen wrote:
> From: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
>
> Add support for AF_XDP zero-copy transmit path.
>
> A new TX buffer type IGB_TYPE_XSK is introduced to indicate that the Tx
> frame was allocated from the xsk buff pool, so igb_clean_tx_ring and
> igb_clean_tx_irq can clean the buffers correctly based on type.
>
> igb_xmit_zc performs the actual packet transmit when AF_XDP zero-copy is
> enabled. We share the TX ring between slow path, XDP and AF_XDP
> zero-copy, so we use the netdev queue lock to ensure mutual exclusion.
>
> Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
> [Kurt: Set olinfo_status in igb_xmit_zc() so that frames are transmitted]
> Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
> Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
> ---
> drivers/net/ethernet/intel/igb/igb.h | 2 +
> drivers/net/ethernet/intel/igb/igb_main.c | 56 +++++++++++++++++++----
> drivers/net/ethernet/intel/igb/igb_xsk.c | 53 +++++++++++++++++++++
> 3 files changed, 102 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
> index 4983a6ec718e..9ee18ac1ba47 100644
> --- a/drivers/net/ethernet/intel/igb/igb.h
> +++ b/drivers/net/ethernet/intel/igb/igb.h
> @@ -257,6 +257,7 @@ enum igb_tx_flags {
> enum igb_tx_buf_type {
> IGB_TYPE_SKB = 0,
> IGB_TYPE_XDP,
> + IGB_TYPE_XSK
> };
>
> /* wrapper around a pointer to a socket buffer,
> @@ -836,6 +837,7 @@ int igb_xsk_pool_setup(struct igb_adapter *adapter,
> bool igb_alloc_rx_buffers_zc(struct igb_ring *rx_ring, u16 count);
> void igb_clean_rx_ring_zc(struct igb_ring *rx_ring);
> int igb_clean_rx_irq_zc(struct igb_q_vector *q_vector, const int budget);
> +bool igb_xmit_zc(struct igb_ring *tx_ring);
> int igb_xsk_wakeup(struct net_device *dev, u32 qid, u32 flags);
>
> #endif /* _IGB_H_ */
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> index 0b779b2ca9ea..1ebd67981978 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -2996,6 +2996,9 @@ static int igb_xdp_xmit(struct net_device *dev, int n,
> if (unlikely(!tx_ring))
> return -ENXIO;
>
> + if (unlikely(test_bit(IGB_RING_FLAG_TX_DISABLED, &tx_ring->flags)))
> + return -ENXIO;
> +
> nq = txring_txq(tx_ring);
> __netif_tx_lock(nq, cpu);
>
> @@ -4917,15 +4920,20 @@ void igb_clean_tx_ring(struct igb_ring *tx_ring)
> {
> u16 i = tx_ring->next_to_clean;
> struct igb_tx_buffer *tx_buffer = &tx_ring->tx_buffer_info[i];
> + u32 xsk_frames = 0;
>
> while (i != tx_ring->next_to_use) {
> union e1000_adv_tx_desc *eop_desc, *tx_desc;
>
> /* Free all the Tx ring sk_buffs or xdp frames */
> - if (tx_buffer->type == IGB_TYPE_SKB)
> + if (tx_buffer->type == IGB_TYPE_SKB) {
> dev_kfree_skb_any(tx_buffer->skb);
> - else
> + } else if (tx_buffer->type == IGB_TYPE_XDP) {
> xdp_return_frame(tx_buffer->xdpf);
> + } else if (tx_buffer->type == IGB_TYPE_XSK) {
> + xsk_frames++;
> + goto skip_for_xsk;
> + }
>
> /* unmap skb header data */
> dma_unmap_single(tx_ring->dev,
> @@ -4956,6 +4964,7 @@ void igb_clean_tx_ring(struct igb_ring *tx_ring)
> DMA_TO_DEVICE);
> }
>
> +skip_for_xsk:
> tx_buffer->next_to_watch = NULL;
>
> /* move us one more past the eop_desc for start of next pkt */
> @@ -4970,6 +4979,9 @@ void igb_clean_tx_ring(struct igb_ring *tx_ring)
> /* reset BQL for queue */
> netdev_tx_reset_queue(txring_txq(tx_ring));
>
> + if (tx_ring->xsk_pool && xsk_frames)
> + xsk_tx_completed(tx_ring->xsk_pool, xsk_frames);
> +
> /* reset next_to_use and next_to_clean */
> tx_ring->next_to_use = 0;
> tx_ring->next_to_clean = 0;
> @@ -6503,6 +6515,9 @@ netdev_tx_t igb_xmit_frame_ring(struct sk_buff *skb,
> return NETDEV_TX_BUSY;
> }
>
> + if (unlikely(test_bit(IGB_RING_FLAG_TX_DISABLED, &tx_ring->flags)))
> + return NETDEV_TX_BUSY;
> +
> /* record the location of the first descriptor for this packet */
> first = &tx_ring->tx_buffer_info[tx_ring->next_to_use];
> first->type = IGB_TYPE_SKB;
> @@ -8263,13 +8278,17 @@ static int igb_poll(struct napi_struct *napi, int budget)
> **/
> static bool igb_clean_tx_irq(struct igb_q_vector *q_vector, int napi_budget)
> {
> - struct igb_adapter *adapter = q_vector->adapter;
> - struct igb_ring *tx_ring = q_vector->tx.ring;
> - struct igb_tx_buffer *tx_buffer;
> - union e1000_adv_tx_desc *tx_desc;
> unsigned int total_bytes = 0, total_packets = 0;
> + struct igb_adapter *adapter = q_vector->adapter;
> unsigned int budget = q_vector->tx.work_limit;
> + struct igb_ring *tx_ring = q_vector->tx.ring;
> unsigned int i = tx_ring->next_to_clean;
> + union e1000_adv_tx_desc *tx_desc;
> + struct igb_tx_buffer *tx_buffer;
> + int cpu = smp_processor_id();
> + bool xsk_xmit_done = true;
> + struct netdev_queue *nq;
> + u32 xsk_frames = 0;
>
> if (test_bit(__IGB_DOWN, &adapter->state))
> return true;
> @@ -8300,10 +8319,14 @@ static bool igb_clean_tx_irq(struct igb_q_vector *q_vector, int napi_budget)
> total_packets += tx_buffer->gso_segs;
>
> /* free the skb */
> - if (tx_buffer->type == IGB_TYPE_SKB)
> + if (tx_buffer->type == IGB_TYPE_SKB) {
> napi_consume_skb(tx_buffer->skb, napi_budget);
> - else
> + } else if (tx_buffer->type == IGB_TYPE_XDP) {
> xdp_return_frame(tx_buffer->xdpf);
> + } else if (tx_buffer->type == IGB_TYPE_XSK) {
> + xsk_frames++;
> + goto skip_for_xsk;
> + }
>
> /* unmap skb header data */
> dma_unmap_single(tx_ring->dev,
> @@ -8335,6 +8358,7 @@ static bool igb_clean_tx_irq(struct igb_q_vector *q_vector, int napi_budget)
> }
> }
>
> +skip_for_xsk:
> /* move us one more past the eop_desc for start of next pkt */
> tx_buffer++;
> tx_desc++;
> @@ -8363,6 +8387,20 @@ static bool igb_clean_tx_irq(struct igb_q_vector *q_vector, int napi_budget)
> q_vector->tx.total_bytes += total_bytes;
> q_vector->tx.total_packets += total_packets;
>
> + if (tx_ring->xsk_pool) {
READ_ONCE()
> + if (xsk_frames)
> + xsk_tx_completed(tx_ring->xsk_pool, xsk_frames);
> + if (xsk_uses_need_wakeup(tx_ring->xsk_pool))
> + xsk_set_tx_need_wakeup(tx_ring->xsk_pool);
> +
> + nq = txring_txq(tx_ring);
> + __netif_tx_lock(nq, cpu);
> + /* Avoid transmit queue timeout since we share it with the slow path */
> + txq_trans_cond_update(nq);
> + xsk_xmit_done = igb_xmit_zc(tx_ring);
> + __netif_tx_unlock(nq);
> + }
> +
> if (test_bit(IGB_RING_FLAG_TX_DETECT_HANG, &tx_ring->flags)) {
> struct e1000_hw *hw = &adapter->hw;
>
> @@ -8425,7 +8463,7 @@ static bool igb_clean_tx_irq(struct igb_q_vector *q_vector, int napi_budget)
> }
> }
>
> - return !!budget;
> + return !!budget && xsk_xmit_done;
> }
>
> /**
> diff --git a/drivers/net/ethernet/intel/igb/igb_xsk.c b/drivers/net/ethernet/intel/igb/igb_xsk.c
> index 66cdc30e9b6e..4e530e1eb3c0 100644
> --- a/drivers/net/ethernet/intel/igb/igb_xsk.c
> +++ b/drivers/net/ethernet/intel/igb/igb_xsk.c
> @@ -431,6 +431,59 @@ int igb_clean_rx_irq_zc(struct igb_q_vector *q_vector, const int budget)
> return failure ? budget : (int)total_packets;
> }
>
> +bool igb_xmit_zc(struct igb_ring *tx_ring)
> +{
> + unsigned int budget = igb_desc_unused(tx_ring);
> + struct xsk_buff_pool *pool = tx_ring->xsk_pool;
> + u32 cmd_type, olinfo_status, nb_pkts, i = 0;
> + struct xdp_desc *descs = pool->tx_descs;
> + union e1000_adv_tx_desc *tx_desc = NULL;
> + struct igb_tx_buffer *tx_buffer_info;
> + unsigned int total_bytes = 0;
> + dma_addr_t dma;
check IGB_RING_FLAG_TX_DISABLED?
> +
> + nb_pkts = xsk_tx_peek_release_desc_batch(pool, budget);
> + if (!nb_pkts)
> + return true;
> +
> + while (nb_pkts-- > 0) {
> + dma = xsk_buff_raw_get_dma(pool, descs[i].addr);
> + xsk_buff_raw_dma_sync_for_device(pool, dma, descs[i].len);
> +
> + tx_buffer_info = &tx_ring->tx_buffer_info[tx_ring->next_to_use];
> + tx_buffer_info->bytecount = descs[i].len;
> + tx_buffer_info->type = IGB_TYPE_XSK;
> + tx_buffer_info->xdpf = NULL;
> + tx_buffer_info->gso_segs = 1;
> + tx_buffer_info->time_stamp = jiffies;
> +
> + tx_desc = IGB_TX_DESC(tx_ring, tx_ring->next_to_use);
> + tx_desc->read.buffer_addr = cpu_to_le64(dma);
> +
> + /* put descriptor type bits */
> + cmd_type = E1000_ADVTXD_DTYP_DATA | E1000_ADVTXD_DCMD_DEXT |
> + E1000_ADVTXD_DCMD_IFCS;
> + olinfo_status = descs[i].len << E1000_ADVTXD_PAYLEN_SHIFT;
> +
> + cmd_type |= descs[i].len | IGB_TXD_DCMD;
This is also sub-optimal as you are setting RS bit on each Tx descriptor,
which will in turn raise a lot of irqs. See how ice sets RS bit only on
last desc from a batch and then, on cleaning side, how it finds a
descriptor that is supposed to have DD bit written by HW.
> + tx_desc->read.cmd_type_len = cpu_to_le32(cmd_type);
> + tx_desc->read.olinfo_status = cpu_to_le32(olinfo_status);
> +
> + total_bytes += descs[i].len;
> +
> + i++;
> + tx_ring->next_to_use++;
> + tx_buffer_info->next_to_watch = tx_desc;
> + if (tx_ring->next_to_use == tx_ring->count)
> + tx_ring->next_to_use = 0;
> + }
> +
> + netdev_tx_sent_queue(txring_txq(tx_ring), total_bytes);
> + igb_xdp_ring_update_tail(tx_ring);
> +
> + return nb_pkts < budget;
> +}
> +
> int igb_xsk_wakeup(struct net_device *dev, u32 qid, u32 flags)
> {
> struct igb_adapter *adapter = netdev_priv(dev);
> --
> 2.42.0
>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH net-next 4/4] igb: add AF_XDP zero-copy Tx support
2024-08-10 14:10 ` Maciej Fijalkowski
@ 2024-08-14 8:36 ` Kurt Kanzenbach
2024-08-14 8:55 ` Maciej Fijalkowski
0 siblings, 1 reply; 20+ messages in thread
From: Kurt Kanzenbach @ 2024-08-14 8:36 UTC (permalink / raw)
To: Maciej Fijalkowski, Tony Nguyen
Cc: davem, kuba, pabeni, edumazet, netdev, Sriram Yagnaraman,
magnus.karlsson, ast, daniel, hawk, john.fastabend, bpf,
sriram.yagnaraman, richardcochran, benjamin.steinke, bigeasy,
Chandan Kumar Rout
[-- Attachment #1: Type: text/plain, Size: 1540 bytes --]
On Sat Aug 10 2024, Maciej Fijalkowski wrote:
>> + nb_pkts = xsk_tx_peek_release_desc_batch(pool, budget);
>> + if (!nb_pkts)
>> + return true;
>> +
>> + while (nb_pkts-- > 0) {
>> + dma = xsk_buff_raw_get_dma(pool, descs[i].addr);
>> + xsk_buff_raw_dma_sync_for_device(pool, dma, descs[i].len);
>> +
>> + tx_buffer_info = &tx_ring->tx_buffer_info[tx_ring->next_to_use];
>> + tx_buffer_info->bytecount = descs[i].len;
>> + tx_buffer_info->type = IGB_TYPE_XSK;
>> + tx_buffer_info->xdpf = NULL;
>> + tx_buffer_info->gso_segs = 1;
>> + tx_buffer_info->time_stamp = jiffies;
>> +
>> + tx_desc = IGB_TX_DESC(tx_ring, tx_ring->next_to_use);
>> + tx_desc->read.buffer_addr = cpu_to_le64(dma);
>> +
>> + /* put descriptor type bits */
>> + cmd_type = E1000_ADVTXD_DTYP_DATA | E1000_ADVTXD_DCMD_DEXT |
>> + E1000_ADVTXD_DCMD_IFCS;
>> + olinfo_status = descs[i].len << E1000_ADVTXD_PAYLEN_SHIFT;
>> +
>> + cmd_type |= descs[i].len | IGB_TXD_DCMD;
>
> This is also sub-optimal as you are setting RS bit on each Tx descriptor,
> which will in turn raise a lot of irqs. See how ice sets RS bit only on
> last desc from a batch and then, on cleaning side, how it finds a
> descriptor that is supposed to have DD bit written by HW.
I see your point. That requires changes to the cleaning side. However,
igb_clean_tx_irq() is shared between normal and zero-copy path.
The amount of irqs can be also controlled by irq coalescing or even
using busy polling. So I'd rather keep this implementation as simple as
it is now.
Thanks,
Kurt
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH net-next 4/4] igb: add AF_XDP zero-copy Tx support
2024-08-14 8:36 ` Kurt Kanzenbach
@ 2024-08-14 8:55 ` Maciej Fijalkowski
2024-08-14 9:12 ` Kurt Kanzenbach
0 siblings, 1 reply; 20+ messages in thread
From: Maciej Fijalkowski @ 2024-08-14 8:55 UTC (permalink / raw)
To: Kurt Kanzenbach
Cc: Tony Nguyen, davem, kuba, pabeni, edumazet, netdev,
Sriram Yagnaraman, magnus.karlsson, ast, daniel, hawk,
john.fastabend, bpf, sriram.yagnaraman, richardcochran,
benjamin.steinke, bigeasy, Chandan Kumar Rout
On Wed, Aug 14, 2024 at 10:36:32AM +0200, Kurt Kanzenbach wrote:
> On Sat Aug 10 2024, Maciej Fijalkowski wrote:
> >> + nb_pkts = xsk_tx_peek_release_desc_batch(pool, budget);
> >> + if (!nb_pkts)
> >> + return true;
> >> +
> >> + while (nb_pkts-- > 0) {
> >> + dma = xsk_buff_raw_get_dma(pool, descs[i].addr);
> >> + xsk_buff_raw_dma_sync_for_device(pool, dma, descs[i].len);
> >> +
> >> + tx_buffer_info = &tx_ring->tx_buffer_info[tx_ring->next_to_use];
> >> + tx_buffer_info->bytecount = descs[i].len;
> >> + tx_buffer_info->type = IGB_TYPE_XSK;
> >> + tx_buffer_info->xdpf = NULL;
> >> + tx_buffer_info->gso_segs = 1;
> >> + tx_buffer_info->time_stamp = jiffies;
> >> +
> >> + tx_desc = IGB_TX_DESC(tx_ring, tx_ring->next_to_use);
> >> + tx_desc->read.buffer_addr = cpu_to_le64(dma);
> >> +
> >> + /* put descriptor type bits */
> >> + cmd_type = E1000_ADVTXD_DTYP_DATA | E1000_ADVTXD_DCMD_DEXT |
> >> + E1000_ADVTXD_DCMD_IFCS;
> >> + olinfo_status = descs[i].len << E1000_ADVTXD_PAYLEN_SHIFT;
> >> +
> >> + cmd_type |= descs[i].len | IGB_TXD_DCMD;
> >
> > This is also sub-optimal as you are setting RS bit on each Tx descriptor,
> > which will in turn raise a lot of irqs. See how ice sets RS bit only on
> > last desc from a batch and then, on cleaning side, how it finds a
> > descriptor that is supposed to have DD bit written by HW.
>
> I see your point. That requires changes to the cleaning side. However,
> igb_clean_tx_irq() is shared between normal and zero-copy path.
Ok if that's too much of a hassle then let's leave it as-is. I can address
that in some nearby future.
>
> The amount of irqs can be also controlled by irq coalescing or even
> using busy polling. So I'd rather keep this implementation as simple as
> it is now.
That has nothing to do with what I was describing.
>
> Thanks,
> Kurt
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH net-next 4/4] igb: add AF_XDP zero-copy Tx support
2024-08-14 8:55 ` Maciej Fijalkowski
@ 2024-08-14 9:12 ` Kurt Kanzenbach
2024-08-14 10:26 ` Maciej Fijalkowski
0 siblings, 1 reply; 20+ messages in thread
From: Kurt Kanzenbach @ 2024-08-14 9:12 UTC (permalink / raw)
To: Maciej Fijalkowski
Cc: Tony Nguyen, davem, kuba, pabeni, edumazet, netdev,
Sriram Yagnaraman, magnus.karlsson, ast, daniel, hawk,
john.fastabend, bpf, sriram.yagnaraman, richardcochran,
benjamin.steinke, bigeasy, Chandan Kumar Rout
[-- Attachment #1: Type: text/plain, Size: 2296 bytes --]
On Wed Aug 14 2024, Maciej Fijalkowski wrote:
> On Wed, Aug 14, 2024 at 10:36:32AM +0200, Kurt Kanzenbach wrote:
>> On Sat Aug 10 2024, Maciej Fijalkowski wrote:
>> >> + nb_pkts = xsk_tx_peek_release_desc_batch(pool, budget);
>> >> + if (!nb_pkts)
>> >> + return true;
>> >> +
>> >> + while (nb_pkts-- > 0) {
>> >> + dma = xsk_buff_raw_get_dma(pool, descs[i].addr);
>> >> + xsk_buff_raw_dma_sync_for_device(pool, dma, descs[i].len);
>> >> +
>> >> + tx_buffer_info = &tx_ring->tx_buffer_info[tx_ring->next_to_use];
>> >> + tx_buffer_info->bytecount = descs[i].len;
>> >> + tx_buffer_info->type = IGB_TYPE_XSK;
>> >> + tx_buffer_info->xdpf = NULL;
>> >> + tx_buffer_info->gso_segs = 1;
>> >> + tx_buffer_info->time_stamp = jiffies;
>> >> +
>> >> + tx_desc = IGB_TX_DESC(tx_ring, tx_ring->next_to_use);
>> >> + tx_desc->read.buffer_addr = cpu_to_le64(dma);
>> >> +
>> >> + /* put descriptor type bits */
>> >> + cmd_type = E1000_ADVTXD_DTYP_DATA | E1000_ADVTXD_DCMD_DEXT |
>> >> + E1000_ADVTXD_DCMD_IFCS;
>> >> + olinfo_status = descs[i].len << E1000_ADVTXD_PAYLEN_SHIFT;
>> >> +
>> >> + cmd_type |= descs[i].len | IGB_TXD_DCMD;
>> >
>> > This is also sub-optimal as you are setting RS bit on each Tx descriptor,
>> > which will in turn raise a lot of irqs. See how ice sets RS bit only on
>> > last desc from a batch and then, on cleaning side, how it finds a
>> > descriptor that is supposed to have DD bit written by HW.
>>
>> I see your point. That requires changes to the cleaning side. However,
>> igb_clean_tx_irq() is shared between normal and zero-copy path.
>
> Ok if that's too much of a hassle then let's leave it as-is. I can address
> that in some nearby future.
How would you do that, by adding a dedicated igb_clean_tx_irq_zc()
function? Or is there a more simple way?
BTW: This needs to be addressed in igc too.
>
>>
>> The amount of irqs can be also controlled by irq coalescing or even
>> using busy polling. So I'd rather keep this implementation as simple as
>> it is now.
>
> That has nothing to do with what I was describing.
Ok, maybe I misunderstood your suggestion. It seemed to me that adding
the RS bit to the last frame of the burst will reduce the amount of
raised irqs.
Thanks,
Kurt
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH net-next 4/4] igb: add AF_XDP zero-copy Tx support
2024-08-14 9:12 ` Kurt Kanzenbach
@ 2024-08-14 10:26 ` Maciej Fijalkowski
2024-08-14 12:51 ` Kurt Kanzenbach
0 siblings, 1 reply; 20+ messages in thread
From: Maciej Fijalkowski @ 2024-08-14 10:26 UTC (permalink / raw)
To: Kurt Kanzenbach
Cc: Tony Nguyen, davem, kuba, pabeni, edumazet, netdev,
Sriram Yagnaraman, magnus.karlsson, ast, daniel, hawk,
john.fastabend, bpf, sriram.yagnaraman, richardcochran,
benjamin.steinke, bigeasy, Chandan Kumar Rout
On Wed, Aug 14, 2024 at 11:12:30AM +0200, Kurt Kanzenbach wrote:
> On Wed Aug 14 2024, Maciej Fijalkowski wrote:
> > On Wed, Aug 14, 2024 at 10:36:32AM +0200, Kurt Kanzenbach wrote:
> >> On Sat Aug 10 2024, Maciej Fijalkowski wrote:
> >> >> + nb_pkts = xsk_tx_peek_release_desc_batch(pool, budget);
> >> >> + if (!nb_pkts)
> >> >> + return true;
> >> >> +
> >> >> + while (nb_pkts-- > 0) {
> >> >> + dma = xsk_buff_raw_get_dma(pool, descs[i].addr);
> >> >> + xsk_buff_raw_dma_sync_for_device(pool, dma, descs[i].len);
> >> >> +
> >> >> + tx_buffer_info = &tx_ring->tx_buffer_info[tx_ring->next_to_use];
> >> >> + tx_buffer_info->bytecount = descs[i].len;
> >> >> + tx_buffer_info->type = IGB_TYPE_XSK;
> >> >> + tx_buffer_info->xdpf = NULL;
> >> >> + tx_buffer_info->gso_segs = 1;
> >> >> + tx_buffer_info->time_stamp = jiffies;
> >> >> +
> >> >> + tx_desc = IGB_TX_DESC(tx_ring, tx_ring->next_to_use);
> >> >> + tx_desc->read.buffer_addr = cpu_to_le64(dma);
> >> >> +
> >> >> + /* put descriptor type bits */
> >> >> + cmd_type = E1000_ADVTXD_DTYP_DATA | E1000_ADVTXD_DCMD_DEXT |
> >> >> + E1000_ADVTXD_DCMD_IFCS;
> >> >> + olinfo_status = descs[i].len << E1000_ADVTXD_PAYLEN_SHIFT;
> >> >> +
> >> >> + cmd_type |= descs[i].len | IGB_TXD_DCMD;
> >> >
> >> > This is also sub-optimal as you are setting RS bit on each Tx descriptor,
> >> > which will in turn raise a lot of irqs. See how ice sets RS bit only on
> >> > last desc from a batch and then, on cleaning side, how it finds a
> >> > descriptor that is supposed to have DD bit written by HW.
> >>
> >> I see your point. That requires changes to the cleaning side. However,
> >> igb_clean_tx_irq() is shared between normal and zero-copy path.
> >
> > Ok if that's too much of a hassle then let's leave it as-is. I can address
> > that in some nearby future.
>
> How would you do that, by adding a dedicated igb_clean_tx_irq_zc()
> function? Or is there a more simple way?
Yes that would be my first approach.
>
> BTW: This needs to be addressed in igc too.
Argh!
>
> >
> >>
> >> The amount of irqs can be also controlled by irq coalescing or even
> >> using busy polling. So I'd rather keep this implementation as simple as
> >> it is now.
> >
> > That has nothing to do with what I was describing.
>
> Ok, maybe I misunderstood your suggestion. It seemed to me that adding
> the RS bit to the last frame of the burst will reduce the amount of
> raised irqs.
You got it right, but I don't think it's related to any outer settings.
The main case here is that by doing what I proposed you get much less PCIe
traffic which in turn yields better performance.
>
> Thanks,
> Kurt
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH net-next 4/4] igb: add AF_XDP zero-copy Tx support
2024-08-14 10:26 ` Maciej Fijalkowski
@ 2024-08-14 12:51 ` Kurt Kanzenbach
0 siblings, 0 replies; 20+ messages in thread
From: Kurt Kanzenbach @ 2024-08-14 12:51 UTC (permalink / raw)
To: Maciej Fijalkowski
Cc: Tony Nguyen, davem, kuba, pabeni, edumazet, netdev,
Sriram Yagnaraman, magnus.karlsson, ast, daniel, hawk,
john.fastabend, bpf, sriram.yagnaraman, richardcochran,
benjamin.steinke, bigeasy, Chandan Kumar Rout
[-- Attachment #1: Type: text/plain, Size: 759 bytes --]
On Wed Aug 14 2024, Maciej Fijalkowski wrote:
>> >> The amount of irqs can be also controlled by irq coalescing or even
>> >> using busy polling. So I'd rather keep this implementation as simple as
>> >> it is now.
>> >
>> > That has nothing to do with what I was describing.
>>
>> Ok, maybe I misunderstood your suggestion. It seemed to me that adding
>> the RS bit to the last frame of the burst will reduce the amount of
>> raised irqs.
>
> You got it right, but I don't think it's related to any outer settings.
> The main case here is that by doing what I proposed you get much less PCIe
> traffic which in turn yields better performance.
I see, makes sense. Then, let's address this in another patchset also
for igc.
Thanks,
Kurt
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]
^ permalink raw reply [flat|nested] 20+ messages in thread