Question about xdp: how to figure out the throughput is limited by pcie

All of lore.kernel.org
 help / color / mirror / Atom feed

* Question about xdp: how to figure out the throughput is limited by pcie
@ 2023-04-07  1:46 Qiongwen Xu
  2023-04-09 15:44 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 7+ messages in thread
From: Qiongwen Xu @ 2023-04-07  1:46 UTC (permalink / raw)
  To: xdp-newbies@vger.kernel.org; +Cc: Srinivas Narayana Ganapathy

Dear XDP experts,

Hope this email finds you well!

I am a PhD student at Rutgers. Recently, I have been reading the XDP paper "The eXpress Data Path: Fast Programmable Packet Processing in the Operating System Kernel". In section 4.1 and 4.3, you mention the throughputs of xdp programs (packet drop and packet forwarding) are limited by the PCIe (e.g., "Both scale their performance linearly until they approach the global performance limit of the PCI bus"). I am curious about how you figured out it was the PCIe limitation. Is there any tool or method to check this?

Looking forward to your reply!

Thanks,
Qiongwen Xu

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Question about xdp: how to figure out the throughput is limited by pcie
  2023-04-07  1:46 Question about xdp: how to figure out the throughput is limited by pcie Qiongwen Xu
@ 2023-04-09 15:44 ` Jesper Dangaard Brouer
       [not found]   ` <CH2PR14MB3657EF09F9A2BE7C08E4C9DBE3989@CH2PR14MB3657.namprd14.prod.outlook.com>
  0 siblings, 1 reply; 7+ messages in thread
From: Jesper Dangaard Brouer @ 2023-04-09 15:44 UTC (permalink / raw)
  To: Qiongwen Xu, xdp-newbies@vger.kernel.org
  Cc: brouer, Srinivas Narayana Ganapathy, Tariq Toukan

(answered inline below)

On 07/04/2023 03.46, Qiongwen Xu wrote:
> Dear XDP experts,
> 
> I am a PhD student at Rutgers. Recently, I have been reading the XDP
> paper "The eXpress Data Path: Fast Programmable Packet Processing
> in the Operating System Kernel". In section 4.1 and 4.3, you mention 
> the throughputs of xdp programs (packet drop and packet forwarding) 
> are limited by the PCIe (e.g., "Both scale their performance linearly
> until they approach the global performance limit of the PCI bus").

Most of the article[1][2] authors are likely this mailing list,
including me. (Sad to see we called it "PCI *bus*" and not just PCIe).

> I am curious about how you figured out it was the PCIe limitation. 

It is worth noting that the PCIe limitation shown in article is related
to number of PCIe transactions with small packets (Ethernet minimum
frame size 64 Bytes). (Thus meaning NOT bandwidth related).

The observations that lead to the PCIe limitation conclusion:
A single CPU doing XDP_DROP (25Mpps) was using 100% CPU time (runtime
attributed to ksoftirqd).  When we scaled up XDP_DROP to run on more
CPUs we saw something strange[3].  It scaled linear to 3 CPUs, and at 4
CPUs each CPU started to process less packets per sec (pps) and total
(86Mpps) stayed the same.  Even more strange the CPUs wasn't using 100%
CPU any-longer, CPUs had "time" to idle.  Looking at ethtool stats, we
noticed the counter "rx_discards_phy", which (we were told) happens when
PCIe causes backpressure.

What confirmed the PCIe (transactions) bottleneck was[4] when we
discovered enabling the mlx5 priv-flags rx_cqe_compress=on (and
rx_striding_rq=off) changed the total limit (86Mpps to 108Mpps),
as rx_cqe_compress reduce the transactions on PCIe by compressing the RX
descriptors.  Thus, confirming this was related to PCIe.

 > Is there any tool or method to check this?

I *highly* recommend that you read this article [pci1][pci2]:
  - Title: "Understanding PCIe performance for end host networking"

I wish we had read and referenced this article in ours (but both
happened in 2018).  They give a theoretical model for PCIe, both
bandwidth and latency.  That could be used to explain our PCIe
observations. They also released their [pcie-bench] tool.

I wish more (kernel) performance people understood, that PCIe is a
protocol (3-layers: physical, data link layer (DLL) and Transaction
Layer Packets (TLP)), that is used between the device and host
OS-driver.  In networking usually ignores this PCIe protocol step, with
associated protocol overheads, which actually causes a network packet to
be split into smaller PCIe TLP "packets" with their own PCIe level
headers. Besides the packet data itself, the PCIe protocol is used for
reading TX desc (seen from device) and writing RX desc (seen from
device), and read/update queue pointers.

It might surprise people that article [pci1] shows, that PCIe (128B
payload) introduces a latency around 600ns (nanosec), which is
significantly larger than the inter-packet gap needed for wirespeed
networking.  Thus, latency hiding happens "behind our back", via the
device and DMA engine have to keep many transactions in-flight to
utilize the NIC (yet another hidden queue in the system).

--Jesper

Links:

  [1] https://dl.acm.org/doi/10.1145/3281411.3281443
  [2] https://github.com/xdp-project/xdp-paper
  [3] 
https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench02_xdp_drop.org#possible-pcie-limit
  [4] 
https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench01_baseline.org#initial-data-from-jespers-runs

Read this article:
  [pci0] https://dl.acm.org/doi/10.1145/3230543.3230560
  [pci1] 
https://www.cl.cam.ac.uk/research/srg/netos/projects/pcie-bench/neugebauer2018understanding.pdf
  [pci2] https://www.cl.cam.ac.uk/research/srg/netos/projects/pcie-bench/
  [pcie-bench] https://github.com/pcie-bench/pcie-model

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Question about xdp: how to figure out the throughput is limited by pcie
       [not found]   ` <CH2PR14MB3657EF09F9A2BE7C08E4C9DBE3989@CH2PR14MB3657.namprd14.prod.outlook.com>
@ 2023-04-13  2:54     ` Qiongwen Xu
  2023-04-13 11:16       ` Toke Høiland-Jørgensen
  2023-04-13 11:59       ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 7+ messages in thread
From: Qiongwen Xu @ 2023-04-13  2:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: brouer@redhat.com, Srinivas Narayana Ganapathy, Tariq Toukan,
	xdp-newbies@vger.kernel.org

Hi Jesper,

Thanks for the detailed reply and sharing these helpful materials/papers with us!

After enabling rx_cqe_compress, the throughput in our experiment increases from
70+Mpps to 85 Mpps. We also tried to use the counter "rx_discards_phy". The counter
increases in both cpu-limited and pcie-limited experiments, i.e., in the experiment
which is only cpu-limited can also increase the counter. We are looking for any
counter that can separate cpu- and pcie-limited cases. Regarding the [pcie-bench] tool,
unfortunately, we are not able to use it, as it requires fpga hardware.

Thanks,
Qiongwen

From: Jesper Dangaard Brouer <jbrouer@redhat.com>
Date: Sunday, April 9, 2023 at 11:46 AM
To: Qiongwen Xu <qx51@cs.rutgers.edu>, xdp-newbies@vger.kernel.org <xdp-newbies@vger.kernel.org>
Cc: brouer@redhat.com <brouer@redhat.com>, Srinivas Narayana Ganapathy <sn624@cs.rutgers.edu>, Tariq Toukan <tariqt@nvidia.com>
Subject: Re: Question about xdp: how to figure out the throughput is limited by pcie
(answered inline below)

On 07/04/2023 03.46, Qiongwen Xu wrote:
> Dear XDP experts,
>
> I am a PhD student at Rutgers. Recently, I have been reading the XDP
> paper "The eXpress Data Path: Fast Programmable Packet Processing
> in the Operating System Kernel". In section 4.1 and 4.3, you mention
> the throughputs of xdp programs (packet drop and packet forwarding)
> are limited by the PCIe (e.g., "Both scale their performance linearly
> until they approach the global performance limit of the PCI bus").

Most of the article[1][2] authors are likely this mailing list,
including me. (Sad to see we called it "PCI *bus*" and not just PCIe).

> I am curious about how you figured out it was the PCIe limitation.

It is worth noting that the PCIe limitation shown in article is related
to number of PCIe transactions with small packets (Ethernet minimum
frame size 64 Bytes). (Thus meaning NOT bandwidth related).

The observations that lead to the PCIe limitation conclusion:
A single CPU doing XDP_DROP (25Mpps) was using 100% CPU time (runtime
attributed to ksoftirqd).  When we scaled up XDP_DROP to run on more
CPUs we saw something strange[3].  It scaled linear to 3 CPUs, and at 4
CPUs each CPU started to process less packets per sec (pps) and total
(86Mpps) stayed the same.  Even more strange the CPUs wasn't using 100%
CPU any-longer, CPUs had "time" to idle.  Looking at ethtool stats, we
noticed the counter "rx_discards_phy", which (we were told) happens when
PCIe causes backpressure.

What confirmed the PCIe (transactions) bottleneck was[4] when we
discovered enabling the mlx5 priv-flags rx_cqe_compress=on (and
rx_striding_rq=off) changed the total limit (86Mpps to 108Mpps),
as rx_cqe_compress reduce the transactions on PCIe by compressing the RX
descriptors.  Thus, confirming this was related to PCIe.

 > Is there any tool or method to check this?

I *highly* recommend that you read this article [pci1][pci2]:
  - Title: "Understanding PCIe performance for end host networking"

I wish we had read and referenced this article in ours (but both
happened in 2018).  They give a theoretical model for PCIe, both
bandwidth and latency.  That could be used to explain our PCIe
observations. They also released their [pcie-bench] tool.

I wish more (kernel) performance people understood, that PCIe is a
protocol (3-layers: physical, data link layer (DLL) and Transaction
Layer Packets (TLP)), that is used between the device and host
OS-driver.  In networking usually ignores this PCIe protocol step, with
associated protocol overheads, which actually causes a network packet to
be split into smaller PCIe TLP "packets" with their own PCIe level
headers. Besides the packet data itself, the PCIe protocol is used for
reading TX desc (seen from device) and writing RX desc (seen from
device), and read/update queue pointers.

It might surprise people that article [pci1] shows, that PCIe (128B
payload) introduces a latency around 600ns (nanosec), which is
significantly larger than the inter-packet gap needed for wirespeed
networking.  Thus, latency hiding happens "behind our back", via the
device and DMA engine have to keep many transactions in-flight to
utilize the NIC (yet another hidden queue in the system).

--Jesper

Links:

  [1] https://dl.acm.org/doi/10.1145/3281411.3281443
  [2] https://github.com/xdp-project/xdp-paper
  [3]
https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench02_xdp_drop.org
  [4]
https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench01_baseline.org

Read this article:
  [pci0] https://dl.acm.org/doi/10.1145/3230543.3230560
  [pci1]
https://www.cl.cam.ac.uk/research/srg/netos/projects/pcie-bench/neugebauer2018understanding.pdf
  [pci2] https://www.cl.cam.ac.uk/research/srg/netos/projects/pcie-bench/
  [pcie-bench] https://github.com/pcie-bench/pcie-model

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Question about xdp: how to figure out the throughput is limited by pcie
  2023-04-13  2:54     ` Qiongwen Xu
@ 2023-04-13 11:16       ` Toke Høiland-Jørgensen
  2023-04-13 11:30         ` Jesper Dangaard Brouer
  2023-04-13 11:59       ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 7+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-04-13 11:16 UTC (permalink / raw)
  To: Qiongwen Xu, Jesper Dangaard Brouer
  Cc: brouer@redhat.com, Srinivas Narayana Ganapathy, Tariq Toukan,
	xdp-newbies@vger.kernel.org

Qiongwen Xu <qx51@cs.rutgers.edu> writes:

> Hi Jesper,
>
> Thanks for the detailed reply and sharing these helpful
> materials/papers with us!

(Please don't top post on the mailing list).

> After enabling rx_cqe_compress, the throughput in our experiment increases from
> 70+Mpps to 85 Mpps. We also tried to use the counter "rx_discards_phy". The counter
> increases in both cpu-limited and pcie-limited experiments, i.e., in the experiment
> which is only cpu-limited can also increase the counter. We are looking for any
> counter that can separate cpu- and pcie-limited cases. Regarding the [pcie-bench] tool,
> unfortunately, we are not able to use it, as it requires fpga hardware.

Well, are your CPUs being maxed out? IIRC it was pretty obvious that
they weren't when we were running those tests, so just looking at
something like 'mpstat' should give you a hint. For more detailed
analysis you can use 'perf' to see exactly where the CPU is spending its
time.

-Toke


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Question about xdp: how to figure out the throughput is limited by pcie
  2023-04-13 11:16       ` Toke Høiland-Jørgensen
@ 2023-04-13 11:30         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 7+ messages in thread
From: Jesper Dangaard Brouer @ 2023-04-13 11:30 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Qiongwen Xu,
	Jesper Dangaard Brouer
  Cc: brouer, Srinivas Narayana Ganapathy, Tariq Toukan,
	xdp-newbies@vger.kernel.org


On 13/04/2023 13.16, Toke Høiland-Jørgensen wrote:
> Qiongwen Xu <qx51@cs.rutgers.edu> writes:
> 
>> Hi Jesper,
>>
>> Thanks for the detailed reply and sharing these helpful
>> materials/papers with us!
> 
> (Please don't top post on the mailing list).

+1

>> After enabling rx_cqe_compress, the throughput in our experiment increases from
>> 70+Mpps to 85 Mpps. We also tried to use the counter "rx_discards_phy". The counter
>> increases in both cpu-limited and pcie-limited experiments, i.e., in the experiment
>> which is only cpu-limited can also increase the counter. We are looking for any
>> counter that can separate cpu- and pcie-limited cases. Regarding the [pcie-bench] tool,
>> unfortunately, we are not able to use it, as it requires fpga hardware.
> 
> Well, are your CPUs being maxed out? IIRC it was pretty obvious that
> they weren't when we were running those tests, so just looking at
> something like 'mpstat' should give you a hint. 

As you can see in[1] I find this mpstat command very useful:

  $ mpstat -P ALL -u -I SCPU -I SUM 2

The tool turbostat will also tell you how busy individial CPUs are.


> For more detailed analysis you can use 'perf' to see exactly where
> the CPU is spending its time.

Again a practical hint.
Perf record with cmdline:

  # perf record -g -a -- sleep 10

Look at results with cmdline that also expose the 'cpu' info:

  # perf report --sort cpu,dso,symbol --no-children

Look at a specific CPU e.g. core 3 (counting from 0) with cmdline:

  # perf report --sort cpu,dso,symbol --no-children -C3

--Jesper

Links:
  [1] 
https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench02_xdp_drop.org#test-100g-bandwidth


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Question about xdp: how to figure out the throughput is limited by pcie
  2023-04-13  2:54     ` Qiongwen Xu
  2023-04-13 11:16       ` Toke Høiland-Jørgensen
@ 2023-04-13 11:59       ` Jesper Dangaard Brouer
  2023-04-13 20:11         ` Andi Kleen
  1 sibling, 1 reply; 7+ messages in thread
From: Jesper Dangaard Brouer @ 2023-04-13 11:59 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Andi Kleen
  Cc: brouer, Srinivas Narayana Ganapathy, Tariq Toukan,
	xdp-newbies@vger.kernel.org, Qiongwen Xu, Jiri Olsa

Hi Andi and Acme,

Regarding below discussion and subj (top-posting as you don't need to
read discussion to answer my perf questions).

Can we somehow use perf to profile things happening in PCIe ?
E.g. Are there any PMU counters "uncore" events for PCIe ?

   Hint, we can list more PMU counter via Andi's ocperf tool[42].
   # sudo ./ocperf list

Could we use the TopDown [toplev] model, to indicate/detect that the
PCIe device (or PCIe root complex) is the bottleneck?

   Hint, try out the [toplev] tool looking at specific core under-load
   # sudo ./toplev.py -I 3000 -l3 -a --show-sample --core C2

--Jesper

  [toplev] https://github.com/andikleen/pmu-tools/wiki/toplev-manual
  [42] https://github.com/andikleen/pmu-tools

On 13/04/2023 04.54, Qiongwen Xu wrote:
> Hi Jesper,
> 
> Thanks for the detailed reply and sharing these helpful materials/papers with us!
> 
> After enabling rx_cqe_compress, the throughput in our experiment increases from
> 70+Mpps to 85 Mpps. We also tried to use the counter "rx_discards_phy". The counter
> increases in both cpu-limited and pcie-limited experiments, i.e., in the experiment
> which is only cpu-limited can also increase the counter. We are looking for any
> counter that can separate cpu- and pcie-limited cases. Regarding the [pcie-bench] tool,
> unfortunately, we are not able to use it, as it requires fpga hardware.
> 
> Thanks,
> Qiongwen
> 
> From: Jesper Dangaard Brouer <jbrouer@redhat.com>
> Date: Sunday, April 9, 2023 at 11:46 AM
> Subject: Re: Question about xdp: how to figure out the throughput is limited by pcie
> (answered inline below)
> 
> On 07/04/2023 03.46, Qiongwen Xu wrote:
>> Dear XDP experts,
>>
>> I am a PhD student at Rutgers. Recently, I have been reading the XDP
>> paper "The eXpress Data Path: Fast Programmable Packet Processing
>> in the Operating System Kernel". In section 4.1 and 4.3, you mention
>> the throughputs of xdp programs (packet drop and packet forwarding)
>> are limited by the PCIe (e.g., "Both scale their performance linearly
>> until they approach the global performance limit of the PCI bus").
> 
> Most of the article[1][2] authors are likely this mailing list,
> including me. (Sad to see we called it "PCI *bus*" and not just PCIe).
> 
>> I am curious about how you figured out it was the PCIe limitation.
> 
> It is worth noting that the PCIe limitation shown in article is related
> to number of PCIe transactions with small packets (Ethernet minimum
> frame size 64 Bytes). (Thus meaning NOT bandwidth related).
> 
> The observations that lead to the PCIe limitation conclusion:
> A single CPU doing XDP_DROP (25Mpps) was using 100% CPU time (runtime
> attributed to ksoftirqd).  When we scaled up XDP_DROP to run on more
> CPUs we saw something strange[3].  It scaled linear to 3 CPUs, and at 4
> CPUs each CPU started to process less packets per sec (pps) and total
> (86Mpps) stayed the same.  Even more strange the CPUs wasn't using 100%
> CPU any-longer, CPUs had "time" to idle.  Looking at ethtool stats, we
> noticed the counter "rx_discards_phy", which (we were told) happens when
> PCIe causes backpressure.
> 
> What confirmed the PCIe (transactions) bottleneck was[4] when we
> discovered enabling the mlx5 priv-flags rx_cqe_compress=on (and
> rx_striding_rq=off) changed the total limit (86Mpps to 108Mpps),
> as rx_cqe_compress reduce the transactions on PCIe by compressing the RX
> descriptors.  Thus, confirming this was related to PCIe.
> 
> 
>   > Is there any tool or method to check this?
> 
> I *highly* recommend that you read this article [pci1][pci2]:
>    - Title: "Understanding PCIe performance for end host networking"
> 
> I wish we had read and referenced this article in ours (but both
> happened in 2018).  They give a theoretical model for PCIe, both
> bandwidth and latency.  That could be used to explain our PCIe
> observations. They also released their [pcie-bench] tool.
> 
> I wish more (kernel) performance people understood, that PCIe is a
> protocol (3-layers: physical, data link layer (DLL) and Transaction
> Layer Packets (TLP)), that is used between the device and host
> OS-driver.  In networking usually ignores this PCIe protocol step, with
> associated protocol overheads, which actually causes a network packet to
> be split into smaller PCIe TLP "packets" with their own PCIe level
> headers. Besides the packet data itself, the PCIe protocol is used for
> reading TX desc (seen from device) and writing RX desc (seen from
> device), and read/update queue pointers.
> 
> It might surprise people that article [pci1] shows, that PCIe (128B
> payload) introduces a latency around 600ns (nanosec), which is
> significantly larger than the inter-packet gap needed for wirespeed
> networking.  Thus, latency hiding happens "behind our back", via the
> device and DMA engine have to keep many transactions in-flight to
> utilize the NIC (yet another hidden queue in the system).
> 
> --Jesper
> 
> Links:
> 
>    [1] https://dl.acm.org/doi/10.1145/3281411.3281443
>    [2] https://github.com/xdp-project/xdp-paper
>    [3]
> https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench02_xdp_drop.org
>    [4]
> https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench01_baseline.org
> 
> Read this article:
>    [pci0] https://dl.acm.org/doi/10.1145/3230543.3230560
>    [pci1]
> https://www.cl.cam.ac.uk/research/srg/netos/projects/pcie-bench/neugebauer2018understanding.pdf
>    [pci2] https://www.cl.cam.ac.uk/research/srg/netos/projects/pcie-bench/
>    [pcie-bench] https://github.com/pcie-bench/pcie-model
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Question about xdp: how to figure out the throughput is limited by pcie
  2023-04-13 11:59       ` Jesper Dangaard Brouer
@ 2023-04-13 20:11         ` Andi Kleen
  0 siblings, 0 replies; 7+ messages in thread
From: Andi Kleen @ 2023-04-13 20:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Arnaldo Carvalho de Melo, brouer, Srinivas Narayana Ganapathy,
	Tariq Toukan, xdp-newbies@vger.kernel.org, Qiongwen Xu, Jiri Olsa

On Thu, Apr 13, 2023 at 01:59:44PM +0200, Jesper Dangaard Brouer wrote:
> Hi Andi and Acme,
> 
> Regarding below discussion and subj (top-posting as you don't need to
> read discussion to answer my perf questions).
> 
> Can we somehow use perf to profile things happening in PCIe ?
> E.g. Are there any PMU counters "uncore" events for PCIe ?
> 
>   Hint, we can list more PMU counter via Andi's ocperf tool[42].
>   # sudo ./ocperf list
> 
> Could we use the TopDown [toplev] model, to indicate/detect that the
> PCIe device (or PCIe root complex) is the bottleneck?

perf list uncore_iio_free_running has the bandwidth counters 
on Intel servers.

It can be tricky to identify the device for that, there was a patchkit
from Alexander Andronov to make it easier, but I'm not sure if it
made it in.

-Andi

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-04-13 20:17 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-07  1:46 Question about xdp: how to figure out the throughput is limited by pcie Qiongwen Xu
2023-04-09 15:44 ` Jesper Dangaard Brouer
     [not found]   ` <CH2PR14MB3657EF09F9A2BE7C08E4C9DBE3989@CH2PR14MB3657.namprd14.prod.outlook.com>
2023-04-13  2:54     ` Qiongwen Xu
2023-04-13 11:16       ` Toke Høiland-Jørgensen
2023-04-13 11:30         ` Jesper Dangaard Brouer
2023-04-13 11:59       ` Jesper Dangaard Brouer
2023-04-13 20:11         ` Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.