Re: Memory providers multiplexing (Was: [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE_FRAG flag)

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <jbrouer@redhat.com>
To: Mina Almasry <almasrymina@google.com>, Jason Gunthorpe <jgg@ziepe.ca>
Cc: brouer@redhat.com, "Christian König" <christian.koenig@amd.com>,
	"Hari Ramakrishnan" <rharix@google.com>,
	"David Ahern" <dsahern@kernel.org>,
	"Samiullah Khawaja" <skhawaja@google.com>,
	"Willem de Bruijn" <willemb@google.com>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"Christoph Hellwig" <hch@lst.de>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Jesper Dangaard Brouer" <jbrouer@redhat.com>,
	"Alexander Duyck" <alexander.duyck@gmail.com>,
	"Yunsheng Lin" <linyunsheng@huawei.com>,
	davem@davemloft.net, pabeni@redhat.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	"Lorenzo Bianconi" <lorenzo@kernel.org>,
	"Yisen Zhuang" <yisen.zhuang@huawei.com>,
	"Salil Mehta" <salil.mehta@huawei.com>,
	"Eric Dumazet" <edumazet@google.com>,
	"Sunil Goutham" <sgoutham@marvell.com>,
	"Geetha sowjanya" <gakula@marvell.com>,
	"Subbaraya Sundeep" <sbhatta@marvell.com>,
	hariprasad <hkelam@marvell.com>,
	"Saeed Mahameed" <saeedm@nvidia.com>,
	"Leon Romanovsky" <leon@kernel.org>,
	"Felix Fietkau" <nbd@nbd.name>,
	"Ryder Lee" <ryder.lee@mediatek.com>,
	"Shayne Chen" <shayne.chen@mediatek.com>,
	"Sean Wang" <sean.wang@mediatek.com>,
	"Kalle Valo" <kvalo@kernel.org>,
	"Matthias Brugger" <matthias.bgg@gmail.com>,
	"AngeloGioacchino Del Regno"
	<angelogioacchino.delregno@collabora.com>,
	"Jesper Dangaard Brouer" <hawk@kernel.org>,
	"Ilias Apalodimas" <ilias.apalodimas@linaro.org>,
	linux-rdma@vger.kernel.org, linux-wireless@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-mediatek@lists.infradead.org,
	"Jonathan Lemon" <jonathan.lemon@gmail.com>,
	logang@deltatee.com, "Bjorn Helgaas" <bhelgaas@google.com>
Subject: Re: Memory providers multiplexing (Was: [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE_FRAG flag)
Date: Mon, 24 Jul 2023 16:56:27 +0200	[thread overview]
Message-ID: <a2569132-393e-0149-f76c-f6de282e1c96@redhat.com> (raw)
In-Reply-To: <CAHS8izNMB-H3w0CE9kj6hT5q_F6_XJy_X_HtZwmisOEDhp31yg@mail.gmail.com>



On 17/07/2023 03.53, Mina Almasry wrote:
> On Fri, Jul 14, 2023 at 8:55 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>
>> On Fri, Jul 14, 2023 at 07:55:15AM -0700, Mina Almasry wrote:
>>
>>> Once the skb frags with struct new_abstraction are in the TCP stack,
>>> they will need some special handling in code accessing the frags. But
>>> my RFC already addressed that somewhat because the frags were
>>> inaccessible in that case. In this case the frags will be both
>>> inaccessible and will not be struct pages at all (things like
>>> get_page() will not work), so more special handling will be required,
>>> maybe.
>>
>> It seems sort of reasonable, though there will be interesting concerns
>> about coherence and synchronization with generial purpose DMABUFs that
>> will need tackling.
>>
>> Still it is such a lot of churn and weridness in the netdev side, I
>> think you'd do well to present an actual full application as
>> justification.
>>
>> Yes, you showed you can stick unordered TCP data frags into GPU memory
>> sort of quickly, but have you gone further with this to actually show
>> it is useful for a real world GPU centric application?
>>
>> BTW your cover letter said 96% utilization, the usual server
>> configuation is one NIC per GPU, so you were able to hit 1500Gb/sec of
>> TCP BW with this?
>>
> 
> I do notice that the number of NICs is missing from our public
> documentation so far, so I will refrain from specifying how many NICs
> are on those A3 VMs until the information is public. But I think I can
> confirm that your general thinking is correct, the perf that we're
> getting is 96.6% line rate of each GPU/NIC pair, 

What do you mean by 96.6% "line rate".
Is is the Ethernet line-rate?

Is the measured throughput the measured TCP data "goodput"?
Assuming
  - MTU 1500 bytes (1514 on wire).
  - Ethernet header 14 bytes
  - IP header 20 bytes
  - TCP header 20 bytes

Due to header overhead the goodput will be approx 96.4%.
  - (1514-(14+20+20))/1514 = 0.9643
  - (Not taking Ethernet interframe gap into account).

Thus, maybe you have hit Ethernet wire line-rate already?

> and scales linearly
> for each NIC/GPU pair we've tested with so far. Line rate of each
> NIC/GPU pair is 200 Gb/sec.
> 
> So if we have 8 NIC/GPU pairs we'd be hitting 96.6% * 200 * 8 = 1545 GB/sec.

Lets keep our units straight.
Here you mean 1545 Gbit/sec, which is 193 GBytes/s

> If we have, say, 2 NIC/GPU pairs, we'd be hitting 96.6% * 200 * 2 = 384 GB/sec

Here you mean 384 Gbit/sec, which is 48 GBytes/sec.

> ...
> etc.
> 

These massive throughput numbers are important, because they *exceed*
the physical host RAM/DIMM memory speeds.

This is the *real argument* why software cannot afford to do a single
copy of the data from host-RAM into GPU-memory, because the CPU memory
throughput to DRAM/DIMM are insufficient.

My testlab CPU E5-1650 have 4 DIMM slots DDR4
  - Data Width: 64 bits (= 8 bytes)
  - Configured Memory Speed: 2400 MT/s
  - Theoretical maximum memory bandwidth: 76.8 GBytes/s (2400*8*4)

Even the theoretical max 76.8 GBytes/s (614 Gbit/s) is not enough for
the 193 GBytes/s or 1545 Gbit/s (8 NIC/GPU pairs).

When testing this with lmbench tool bw_mem, the results (below
signature) are in the area 14.8 GBytes/sec (118 Gbit/s), as soon as
exceeding L3 cache size.  In practice it looks like main memory is
limited to reading 118 Gbit/s *once*. (Mina's NICs run at 200 Gbit/s)

Given DDIO can deliver network packets into L3, I also tried to figure
out what the L3 read bandwidth, which I measured to be 42.4 GBits/sec
(339 Gbit/s), in hopes that it would be enough, but it was not.


--Jesper
(data below signature)

CPU under test:

  $ cat /proc/cpuinfo | egrep -e 'model name|cache size' | head -2
  model name	: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
  cache size	: 15360 KB


Providing some cmdline outputs from lmbench "bw_mem" tool.
(Output format is "%0.2f %.2f\n", megabytes, megabytes_per_second)

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rd
256.00 14924.50

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M wr
256.00 9895.25

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rdwr
256.00 9737.54

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bcopy
256.00 12462.88

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bzero
256.00 14869.89


Next output shows reducing size below L3 cache size, which shows an
increase in speed, likely the L3 bandwidth.

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 64M rd
64.00 14840.58

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 32M rd
32.00 14823.97

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 16M rd
16.00 24743.86

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 8M rd
8.00 40852.26

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 4M rd
4.00 42545.65

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 2M rd
2.00 42447.82

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 1M rd
1.00 42447.82

next prev parent reply	other threads:[~2023-07-24 15:02 UTC|newest]

Thread overview: 92+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-12 13:02 [PATCH net-next v4 0/5] introduce page_pool_alloc() API Yunsheng Lin
2023-06-12 13:02 ` [PATCH net-next v4 1/5] page_pool: frag API support for 32-bit arch with 64-bit DMA Yunsheng Lin
2023-06-13 13:30   ` Alexander Lobakin
2023-06-14  3:36     ` Yunsheng Lin
2023-06-14  3:55       ` Jakub Kicinski
2023-06-14 10:52       ` Alexander Lobakin
2023-06-14 12:15         ` Yunsheng Lin
2023-06-14 12:42           ` Alexander Lobakin
2023-06-14  4:09   ` Jakub Kicinski
2023-06-14 11:42     ` Yunsheng Lin
2023-06-14 17:07       ` Jakub Kicinski
2023-06-15  7:29         ` Yunsheng Lin
2023-06-12 13:02 ` [PATCH net-next v4 2/5] page_pool: unify frag_count handling in page_pool_is_last_frag() Yunsheng Lin
2023-06-14  4:33   ` Jakub Kicinski
2023-06-14 11:55     ` Yunsheng Lin
2023-06-12 13:02 ` [PATCH net-next v4 3/5] page_pool: introduce page_pool_alloc() API Yunsheng Lin
2023-06-13 13:08   ` Alexander Lobakin
2023-06-13 13:11     ` Alexander Lobakin
2023-06-14  3:17       ` Yunsheng Lin
2023-06-12 13:02 ` [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE_FRAG flag Yunsheng Lin
2023-06-14 17:19   ` Jakub Kicinski
2023-06-15  7:17     ` Yunsheng Lin
2023-06-15 16:51       ` Jakub Kicinski
2023-06-15 18:26         ` Alexander Duyck
2023-06-16 12:20           ` Yunsheng Lin
2023-06-16 15:01             ` Alexander Duyck
2023-06-16 18:59               ` Jesper Dangaard Brouer
2023-06-16 19:21                 ` Jakub Kicinski
2023-06-16 20:42                   ` Memory providers multiplexing (Was: [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE_FRAG flag) Jesper Dangaard Brouer
2023-06-19 18:07                     ` Jakub Kicinski
2023-06-20 15:12                       ` Jesper Dangaard Brouer
2023-06-20 15:39                         ` Jakub Kicinski
2023-06-30  2:27                       ` Mina Almasry
2023-07-03  4:20                         ` David Ahern
2023-07-03  6:22                           ` Mina Almasry
2023-07-03 14:45                             ` David Ahern
2023-07-03 17:13                               ` Eric Dumazet
2023-07-03 17:23                                 ` David Ahern
2023-07-06  1:19                                   ` Mina Almasry
2023-07-03 17:15                               ` Eric Dumazet
2023-07-03 17:25                                 ` David Ahern
2023-07-03 21:43                             ` Jason Gunthorpe
2023-07-06  1:17                               ` Mina Almasry
2023-07-10 17:44                                 ` Jason Gunthorpe
2023-07-10 23:02                                   ` Mina Almasry
2023-07-10 23:49                                     ` Jason Gunthorpe
2023-07-11  0:45                                       ` Mina Almasry
2023-07-11 13:11                                         ` Jason Gunthorpe
2023-07-11 17:24                                           ` Mina Almasry
2023-07-11  4:27                                       ` Christoph Hellwig
2023-07-11  4:59                                         ` Jakub Kicinski
2023-07-11  5:04                                           ` Christoph Hellwig
2023-07-11 12:05                                             ` Jason Gunthorpe
2023-07-11 16:00                                               ` Jakub Kicinski
2023-07-11 16:20                                                 ` David Ahern
2023-07-11 16:32                                                   ` Jakub Kicinski
2023-07-11 17:06                                                     ` Mina Almasry
2023-07-11 20:39                                                       ` Jakub Kicinski
2023-07-11 21:39                                                         ` David Ahern
2023-07-12  3:42                                                           ` Mina Almasry
2023-07-12  7:55                                                             ` Christian König
2023-07-12 13:03                                                               ` Jason Gunthorpe
2023-07-12 13:35                                                                 ` Christian König
2023-07-12 22:41                                                                   ` Mina Almasry
2023-07-12 13:01                                                             ` Jason Gunthorpe
2023-07-12 20:16                                                               ` Mina Almasry
2023-07-12 23:57                                                                 ` Jason Gunthorpe
2023-07-13  7:56                                                                 ` Christian König
2023-07-14 14:55                                                                   ` Mina Almasry
2023-07-14 15:18                                                                     ` David Ahern
2023-07-17  2:05                                                                       ` Mina Almasry
2023-07-17  3:08                                                                         ` David Ahern
2023-07-14 15:55                                                                     ` Jason Gunthorpe
2023-07-17  1:53                                                                       ` Mina Almasry
2023-07-24 14:56                                                                         ` Jesper Dangaard Brouer [this message]
2023-07-24 16:28                                                                           ` Jason Gunthorpe
2023-07-25  4:04                                                                           ` Mina Almasry
2023-07-26 17:36                                                                             ` Jesper Dangaard Brouer
2023-07-11 16:42                                                 ` Jason Gunthorpe
2023-07-11 17:06                                                   ` Jakub Kicinski
2023-07-11 18:52                                                     ` Jason Gunthorpe
2023-07-11 20:34                                                       ` Jakub Kicinski
2023-07-11 23:56                                                         ` Jason Gunthorpe
2023-07-11  6:52                                       ` Dan Williams
2023-07-06 16:50                         ` Jakub Kicinski
2023-06-17 12:19               ` [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE_FRAG flag Yunsheng Lin
2023-06-15 13:59     ` Alexander Lobakin
2023-06-12 13:02 ` [PATCH net-next v4 5/5] page_pool: update document about frag API Yunsheng Lin
2023-06-14  4:40   ` Jakub Kicinski
2023-06-14 12:04     ` Yunsheng Lin
2023-06-14 16:56       ` Jakub Kicinski
2023-06-15  6:49         ` Yunsheng Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a2569132-393e-0149-f76c-f6de282e1c96@redhat.com \
    --to=jbrouer@redhat.com \
    --cc=alexander.duyck@gmail.com \
    --cc=almasrymina@google.com \
    --cc=angelogioacchino.delregno@collabora.com \
    --cc=bhelgaas@google.com \
    --cc=brouer@redhat.com \
    --cc=christian.koenig@amd.com \
    --cc=dan.j.williams@intel.com \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=gakula@marvell.com \
    --cc=hawk@kernel.org \
    --cc=hch@lst.de \
    --cc=hkelam@marvell.com \
    --cc=ilias.apalodimas@linaro.org \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=jonathan.lemon@gmail.com \
    --cc=kuba@kernel.org \
    --cc=kvalo@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mediatek@lists.infradead.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=linux-wireless@vger.kernel.org \
    --cc=linyunsheng@huawei.com \
    --cc=logang@deltatee.com \
    --cc=lorenzo@kernel.org \
    --cc=matthias.bgg@gmail.com \
    --cc=nbd@nbd.name \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=rharix@google.com \
    --cc=ryder.lee@mediatek.com \
    --cc=saeedm@nvidia.com \
    --cc=salil.mehta@huawei.com \
    --cc=sbhatta@marvell.com \
    --cc=sean.wang@mediatek.com \
    --cc=sgoutham@marvell.com \
    --cc=shayne.chen@mediatek.com \
    --cc=skhawaja@google.com \
    --cc=willemb@google.com \
    --cc=yisen.zhuang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).