All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Kubiak <michal.kubiak@intel.com>
To: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Jacob Keller <jacob.e.keller@intel.com>,
	Anthony Nguyen <anthony.l.nguyen@intel.com>,
	Intel Wired LAN <intel-wired-lan@lists.osuosl.org>,
	<netdev@vger.kernel.org>,
	"Christoph Petrausch" <christoph.petrausch@deepl.com>,
	Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>,
	kernel-team <kernel-team@cloudflare.com>
Subject: Re: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
Date: Tue, 26 Aug 2025 12:24:36 +0200	[thread overview]
Message-ID: <aK2LZCedKkXuG1I_@localhost.localdomain> (raw)
In-Reply-To: <85c2fea0-686f-435a-a539-81491a316e46@kernel.org>

On Tue, Aug 26, 2025 at 10:35:30AM +0200, Jesper Dangaard Brouer wrote:
> 
> 
> On 26/08/2025 01.00, Jacob Keller wrote:
> > XDP_DROP performance has been tested for this version, thanks to work from
> > Michal Kubiak. The results are quite promising, with 3 versions being
> > compared:
> > 
> > * baseline net-next tree
> > * v1 applied
> > * v2 applied
> > 
> > Michal said:
> > 
> >    I run the XDP_DROP performance comparison tests on my setup in the way I
> >    usually do. I didn't have the pktgen configured on my link partner, but I
> >    used 6 instances of the xdpsock running in Tx-only mode to generate
> >    high-bandwith traffic. Also, I tried to replicate the conditions according
> >    to Jesper's description, making sure that all the traffic is directed to a
> >    single Rx queue and one CPU is 100% loaded.
> > 
> 
> Thank you for replicating the test setup.  Using xdpsock as a traffic
> generator is fine, as long as we make sure that the generator TX speeds
> exceeds the Device Under Test RX XDP_DROP speed.  It is also important
> for the test that packets hits a single RX queue and we verify one CPU is
> 100% load, as you describe.
> 
> As a reminder the pktgen kernel module comes with ready-to-use sample
> shell-scripts[1].
> 
>  [1] https://elixir.bootlin.com/linux/v6.16.3/source/samples/pktgen
> 

Thank you! I am aware of that and also use those scripts.
The xdpsock solution was just the quickest option for that specific
moment, so I decided not to change my link partner setup, (since I
successfully reproduced the performance drop in v1).

> > The performance hit from v1 is replicated, and shown to be gone in v2, with
> > our results showing even an increase compared to baseline instead of a
> > drop. I've included the relative packet per second deltas compared against
> > a baseline test with neither v1 or v2.
> > 
> 
> Thanks for also replicating the performance hit from v1 as I did in [2].
> 
> To Michal: What CPU did you use?
>  - I used CPU: AMD EPYC 9684X (with SRSO=IBPB)

In my test I used: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz

> 
> One of the reasons that I saw a larger percentage drop is that this CPU
> doesn't have DDIO/DCA, which deliver the packet to L3 cache (and a L2
> cache-miss will obviously take less time than a full main memory cache-
> miss). (Details: Newer AMD CPUs will get something called PCIe TLP
> Processing Hints (TPH), which resembles DDIO).
> 
> Point is that I see some opportunities in driver to move some of the
> prefetches earlier. But we want to make sure it benefits both CPU types,
> and I can test on the AMD platform. (This CPU is a large part of our
> fleet so it makes sense for us to optimize this).
> 
> > baseline to v1, no-touch:
> >    -8,387,677 packets per second (17%) decrease.
> > 
> > baseline to v2, no-touch:
> >    +4,057,000 packets per second (8%) increase!
> > 
> > baseline to v1, read data:
> >    -411,709 packets per second (1%) decrease.
> > 
> > baseline to v2, read data:
> >    +4,331,857 packets per second (11%) increase!
> 
> Thanks for providing these numbers.
> I would also like to know the throughput PPS packet numbers before and
> after, as this allows me to calculate the nanosec difference. Using
> percentages are usually useful, but it can be misleading when dealing
> with XDP_DROP speeds, because a small nanosec change will get
> "magnified" too much.
> 

I was usually told to share the percentage data, because absolute numbers may
depend on various circumstances.
However, I understand your point regarding XDP_DROP. In such case it may
be justified. Please see my raw results (from xdp-bench summary) below:


net-next (main) (drop, no touch)
  Duration            : 105.7s
  Packets received    : 4,960,778,583
  Average packets/s   : 46,951,873
  Rx dropped          : 4,960,778,583


net-next (main) (drop, read data)
  Duration            : 94.5s
  Packets received    : 3,524,346,352
  Average packets/s   : 37,295,056
  Rx dropped          : 3,524,346,352


net-next (main+v1) (drop, no touch)
  Duration            : 122.5s
  Packets received    : 4,722,510,839
  Average packets/s   : 38,564,196
  Rx dropped          : 4,722,510,839


net-next (main+v1) (drop, read data)
  Duration            : 115.7s
  Packets received    : 4,265,991,147
  Average packets/s   : 36,883,347
  Rx dropped          : 4,265,991,147


net-next (main+v2) (drop, no touch)
  Duration            : 130.6s
  Packets received    : 6,664,104,907
  Average packets/s   : 51,008,873
  Rx dropped          : 6,664,104,907


net-next (main+v2) (drop, read data)
  Duration            : 143.6s
  Packets received    : 5,975,991,044
  Average packets/s   : 41,626,913
  Rx dropped          : 5,975,991,044


Thanks,
Michal

> > ---
> > Changes in v2:
> > - Only access shared info for fragmented frames
> > - Link to v1: https://lore.kernel.org/netdev/20250815204205.1407768-4-anthony.l.nguyen@intel.com/
> 
> [2] https://lore.kernel.org/netdev/6e2cbea1-8c70-4bfa-9ce4-1d07b545a705@kernel.org/
> 
> > ---
> >   drivers/net/ethernet/intel/ice/ice_txrx.h |  1 -
> >   drivers/net/ethernet/intel/ice/ice_txrx.c | 80 +++++++++++++------------------
> >   2 files changed, 34 insertions(+), 47 deletions(-)
> 
> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>

WARNING: multiple messages have this Message-ID (diff)
From: Michal Kubiak <michal.kubiak@intel.com>
To: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Jacob Keller <jacob.e.keller@intel.com>,
	Anthony Nguyen <anthony.l.nguyen@intel.com>,
	Intel Wired LAN <intel-wired-lan@lists.osuosl.org>,
	<netdev@vger.kernel.org>,
	"Christoph Petrausch" <christoph.petrausch@deepl.com>,
	Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>,
	kernel-team <kernel-team@cloudflare.com>
Subject: Re: [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
Date: Tue, 26 Aug 2025 12:24:36 +0200	[thread overview]
Message-ID: <aK2LZCedKkXuG1I_@localhost.localdomain> (raw)
In-Reply-To: <85c2fea0-686f-435a-a539-81491a316e46@kernel.org>

On Tue, Aug 26, 2025 at 10:35:30AM +0200, Jesper Dangaard Brouer wrote:
> 
> 
> On 26/08/2025 01.00, Jacob Keller wrote:
> > XDP_DROP performance has been tested for this version, thanks to work from
> > Michal Kubiak. The results are quite promising, with 3 versions being
> > compared:
> > 
> > * baseline net-next tree
> > * v1 applied
> > * v2 applied
> > 
> > Michal said:
> > 
> >    I run the XDP_DROP performance comparison tests on my setup in the way I
> >    usually do. I didn't have the pktgen configured on my link partner, but I
> >    used 6 instances of the xdpsock running in Tx-only mode to generate
> >    high-bandwith traffic. Also, I tried to replicate the conditions according
> >    to Jesper's description, making sure that all the traffic is directed to a
> >    single Rx queue and one CPU is 100% loaded.
> > 
> 
> Thank you for replicating the test setup.  Using xdpsock as a traffic
> generator is fine, as long as we make sure that the generator TX speeds
> exceeds the Device Under Test RX XDP_DROP speed.  It is also important
> for the test that packets hits a single RX queue and we verify one CPU is
> 100% load, as you describe.
> 
> As a reminder the pktgen kernel module comes with ready-to-use sample
> shell-scripts[1].
> 
>  [1] https://elixir.bootlin.com/linux/v6.16.3/source/samples/pktgen
> 

Thank you! I am aware of that and also use those scripts.
The xdpsock solution was just the quickest option for that specific
moment, so I decided not to change my link partner setup, (since I
successfully reproduced the performance drop in v1).

> > The performance hit from v1 is replicated, and shown to be gone in v2, with
> > our results showing even an increase compared to baseline instead of a
> > drop. I've included the relative packet per second deltas compared against
> > a baseline test with neither v1 or v2.
> > 
> 
> Thanks for also replicating the performance hit from v1 as I did in [2].
> 
> To Michal: What CPU did you use?
>  - I used CPU: AMD EPYC 9684X (with SRSO=IBPB)

In my test I used: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz

> 
> One of the reasons that I saw a larger percentage drop is that this CPU
> doesn't have DDIO/DCA, which deliver the packet to L3 cache (and a L2
> cache-miss will obviously take less time than a full main memory cache-
> miss). (Details: Newer AMD CPUs will get something called PCIe TLP
> Processing Hints (TPH), which resembles DDIO).
> 
> Point is that I see some opportunities in driver to move some of the
> prefetches earlier. But we want to make sure it benefits both CPU types,
> and I can test on the AMD platform. (This CPU is a large part of our
> fleet so it makes sense for us to optimize this).
> 
> > baseline to v1, no-touch:
> >    -8,387,677 packets per second (17%) decrease.
> > 
> > baseline to v2, no-touch:
> >    +4,057,000 packets per second (8%) increase!
> > 
> > baseline to v1, read data:
> >    -411,709 packets per second (1%) decrease.
> > 
> > baseline to v2, read data:
> >    +4,331,857 packets per second (11%) increase!
> 
> Thanks for providing these numbers.
> I would also like to know the throughput PPS packet numbers before and
> after, as this allows me to calculate the nanosec difference. Using
> percentages are usually useful, but it can be misleading when dealing
> with XDP_DROP speeds, because a small nanosec change will get
> "magnified" too much.
> 

I was usually told to share the percentage data, because absolute numbers may
depend on various circumstances.
However, I understand your point regarding XDP_DROP. In such case it may
be justified. Please see my raw results (from xdp-bench summary) below:


net-next (main) (drop, no touch)
  Duration            : 105.7s
  Packets received    : 4,960,778,583
  Average packets/s   : 46,951,873
  Rx dropped          : 4,960,778,583


net-next (main) (drop, read data)
  Duration            : 94.5s
  Packets received    : 3,524,346,352
  Average packets/s   : 37,295,056
  Rx dropped          : 3,524,346,352


net-next (main+v1) (drop, no touch)
  Duration            : 122.5s
  Packets received    : 4,722,510,839
  Average packets/s   : 38,564,196
  Rx dropped          : 4,722,510,839


net-next (main+v1) (drop, read data)
  Duration            : 115.7s
  Packets received    : 4,265,991,147
  Average packets/s   : 36,883,347
  Rx dropped          : 4,265,991,147


net-next (main+v2) (drop, no touch)
  Duration            : 130.6s
  Packets received    : 6,664,104,907
  Average packets/s   : 51,008,873
  Rx dropped          : 6,664,104,907


net-next (main+v2) (drop, read data)
  Duration            : 143.6s
  Packets received    : 5,975,991,044
  Average packets/s   : 41,626,913
  Rx dropped          : 5,975,991,044


Thanks,
Michal

> > ---
> > Changes in v2:
> > - Only access shared info for fragmented frames
> > - Link to v1: https://lore.kernel.org/netdev/20250815204205.1407768-4-anthony.l.nguyen@intel.com/
> 
> [2] https://lore.kernel.org/netdev/6e2cbea1-8c70-4bfa-9ce4-1d07b545a705@kernel.org/
> 
> > ---
> >   drivers/net/ethernet/intel/ice/ice_txrx.h |  1 -
> >   drivers/net/ethernet/intel/ice/ice_txrx.c | 80 +++++++++++++------------------
> >   2 files changed, 34 insertions(+), 47 deletions(-)
> 
> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>

  reply	other threads:[~2025-08-26 10:24 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-25 23:00 [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames Jacob Keller
2025-08-25 23:00 ` Jacob Keller
2025-08-26  8:29 ` [Intel-wired-lan] " Paul Menzel
2025-08-26  8:35 ` Jesper Dangaard Brouer
2025-08-26  8:35   ` Jesper Dangaard Brouer
2025-08-26 10:24   ` Michal Kubiak [this message]
2025-08-26 10:24     ` Michal Kubiak
     [not found] ` <PH0PR11MB501305E4CF6B784B41DDD50D960FA@PH0PR11MB5013.namprd11.prod.outlook.com>
2025-09-09  6:19   ` [Intel-wired-lan] " Singh, PriyaX
2025-09-09  6:19     ` Singh, PriyaX
2025-09-16  4:19 ` Rinitha, SX
2025-09-16  4:19   ` Rinitha, SX
  -- strict thread matches above, loose matches on Subject: below --
2025-07-12  0:23 Jacob Keller
2025-07-30 12:11 ` Rinitha, SX
2025-07-30 12:11   ` Rinitha, SX
2025-07-30 13:11 ` Maciej Fijalkowski
     [not found] ` <PH0PR11MB5013AD3FEC58E277297A7D4F962AA@PH0PR11MB5013.namprd11.prod.outlook.com>
2025-08-13 13:39   ` Singh, PriyaX
2025-08-13 13:39     ` Singh, PriyaX
2025-08-13 13:43     ` Singh, PriyaX
2025-08-13 13:43       ` Singh, PriyaX

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aK2LZCedKkXuG1I_@localhost.localdomain \
    --to=michal.kubiak@intel.com \
    --cc=anthony.l.nguyen@intel.com \
    --cc=christoph.petrausch@deepl.com \
    --cc=hawk@kernel.org \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=jacob.e.keller@intel.com \
    --cc=jaroslav.pulchart@gooddata.com \
    --cc=kernel-team@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.