From: Jesper Dangaard Brouer <jbrouer@redhat.com>
To: Mina Almasry <almasrymina@google.com>, Jason Gunthorpe <jgg@ziepe.ca>
Cc: brouer@redhat.com, "Christian König" <christian.koenig@amd.com>,
"Hari Ramakrishnan" <rharix@google.com>,
"David Ahern" <dsahern@kernel.org>,
"Samiullah Khawaja" <skhawaja@google.com>,
"Willem de Bruijn" <willemb@google.com>,
"Jakub Kicinski" <kuba@kernel.org>,
"Christoph Hellwig" <hch@lst.de>,
"John Hubbard" <jhubbard@nvidia.com>,
"Dan Williams" <dan.j.williams@intel.com>,
"Jesper Dangaard Brouer" <jbrouer@redhat.com>,
"Alexander Duyck" <alexander.duyck@gmail.com>,
"Yunsheng Lin" <linyunsheng@huawei.com>,
davem@davemloft.net, pabeni@redhat.com, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org,
"Lorenzo Bianconi" <lorenzo@kernel.org>,
"Yisen Zhuang" <yisen.zhuang@huawei.com>,
"Salil Mehta" <salil.mehta@huawei.com>,
"Eric Dumazet" <edumazet@google.com>,
"Sunil Goutham" <sgoutham@marvell.com>,
"Geetha sowjanya" <gakula@marvell.com>,
"Subbaraya Sundeep" <sbhatta@marvell.com>,
hariprasad <hkelam@marvell.com>,
"Saeed Mahameed" <saeedm@nvidia.com>,
"Leon Romanovsky" <leon@kernel.org>,
"Felix Fietkau" <nbd@nbd.name>,
"Ryder Lee" <ryder.lee@mediatek.com>,
"Shayne Chen" <shayne.chen@mediatek.com>,
"Sean Wang" <sean.wang@mediatek.com>,
"Kalle Valo" <kvalo@kernel.org>,
"Matthias Brugger" <matthias.bgg@gmail.com>,
"AngeloGioacchino Del Regno"
<angelogioacchino.delregno@collabora.com>,
"Jesper Dangaard Brouer" <hawk@kernel.org>,
"Ilias Apalodimas" <ilias.apalodimas@linaro.org>,
linux-rdma@vger.kernel.org, linux-wireless@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-mediatek@lists.infradead.org,
"Jonathan Lemon" <jonathan.lemon@gmail.com>,
logang@deltatee.com, "Bjorn Helgaas" <bhelgaas@google.com>
Subject: Re: Memory providers multiplexing (Was: [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE_FRAG flag)
Date: Mon, 24 Jul 2023 16:56:27 +0200 [thread overview]
Message-ID: <a2569132-393e-0149-f76c-f6de282e1c96@redhat.com> (raw)
In-Reply-To: <CAHS8izNMB-H3w0CE9kj6hT5q_F6_XJy_X_HtZwmisOEDhp31yg@mail.gmail.com>
On 17/07/2023 03.53, Mina Almasry wrote:
> On Fri, Jul 14, 2023 at 8:55 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>
>> On Fri, Jul 14, 2023 at 07:55:15AM -0700, Mina Almasry wrote:
>>
>>> Once the skb frags with struct new_abstraction are in the TCP stack,
>>> they will need some special handling in code accessing the frags. But
>>> my RFC already addressed that somewhat because the frags were
>>> inaccessible in that case. In this case the frags will be both
>>> inaccessible and will not be struct pages at all (things like
>>> get_page() will not work), so more special handling will be required,
>>> maybe.
>>
>> It seems sort of reasonable, though there will be interesting concerns
>> about coherence and synchronization with generial purpose DMABUFs that
>> will need tackling.
>>
>> Still it is such a lot of churn and weridness in the netdev side, I
>> think you'd do well to present an actual full application as
>> justification.
>>
>> Yes, you showed you can stick unordered TCP data frags into GPU memory
>> sort of quickly, but have you gone further with this to actually show
>> it is useful for a real world GPU centric application?
>>
>> BTW your cover letter said 96% utilization, the usual server
>> configuation is one NIC per GPU, so you were able to hit 1500Gb/sec of
>> TCP BW with this?
>>
>
> I do notice that the number of NICs is missing from our public
> documentation so far, so I will refrain from specifying how many NICs
> are on those A3 VMs until the information is public. But I think I can
> confirm that your general thinking is correct, the perf that we're
> getting is 96.6% line rate of each GPU/NIC pair,
What do you mean by 96.6% "line rate".
Is is the Ethernet line-rate?
Is the measured throughput the measured TCP data "goodput"?
Assuming
- MTU 1500 bytes (1514 on wire).
- Ethernet header 14 bytes
- IP header 20 bytes
- TCP header 20 bytes
Due to header overhead the goodput will be approx 96.4%.
- (1514-(14+20+20))/1514 = 0.9643
- (Not taking Ethernet interframe gap into account).
Thus, maybe you have hit Ethernet wire line-rate already?
> and scales linearly
> for each NIC/GPU pair we've tested with so far. Line rate of each
> NIC/GPU pair is 200 Gb/sec.
>
> So if we have 8 NIC/GPU pairs we'd be hitting 96.6% * 200 * 8 = 1545 GB/sec.
Lets keep our units straight.
Here you mean 1545 Gbit/sec, which is 193 GBytes/s
> If we have, say, 2 NIC/GPU pairs, we'd be hitting 96.6% * 200 * 2 = 384 GB/sec
Here you mean 384 Gbit/sec, which is 48 GBytes/sec.
> ...
> etc.
>
These massive throughput numbers are important, because they *exceed*
the physical host RAM/DIMM memory speeds.
This is the *real argument* why software cannot afford to do a single
copy of the data from host-RAM into GPU-memory, because the CPU memory
throughput to DRAM/DIMM are insufficient.
My testlab CPU E5-1650 have 4 DIMM slots DDR4
- Data Width: 64 bits (= 8 bytes)
- Configured Memory Speed: 2400 MT/s
- Theoretical maximum memory bandwidth: 76.8 GBytes/s (2400*8*4)
Even the theoretical max 76.8 GBytes/s (614 Gbit/s) is not enough for
the 193 GBytes/s or 1545 Gbit/s (8 NIC/GPU pairs).
When testing this with lmbench tool bw_mem, the results (below
signature) are in the area 14.8 GBytes/sec (118 Gbit/s), as soon as
exceeding L3 cache size. In practice it looks like main memory is
limited to reading 118 Gbit/s *once*. (Mina's NICs run at 200 Gbit/s)
Given DDIO can deliver network packets into L3, I also tried to figure
out what the L3 read bandwidth, which I measured to be 42.4 GBits/sec
(339 Gbit/s), in hopes that it would be enough, but it was not.
--Jesper
(data below signature)
CPU under test:
$ cat /proc/cpuinfo | egrep -e 'model name|cache size' | head -2
model name : Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
cache size : 15360 KB
Providing some cmdline outputs from lmbench "bw_mem" tool.
(Output format is "%0.2f %.2f\n", megabytes, megabytes_per_second)
$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rd
256.00 14924.50
$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M wr
256.00 9895.25
$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rdwr
256.00 9737.54
$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bcopy
256.00 12462.88
$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bzero
256.00 14869.89
Next output shows reducing size below L3 cache size, which shows an
increase in speed, likely the L3 bandwidth.
$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 64M rd
64.00 14840.58
$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 32M rd
32.00 14823.97
$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 16M rd
16.00 24743.86
$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 8M rd
8.00 40852.26
$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 4M rd
4.00 42545.65
$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 2M rd
2.00 42447.82
$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 1M rd
1.00 42447.82
next prev parent reply other threads:[~2023-07-24 15:02 UTC|newest]
Thread overview: 92+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-12 13:02 [PATCH net-next v4 0/5] introduce page_pool_alloc() API Yunsheng Lin
2023-06-12 13:02 ` [PATCH net-next v4 1/5] page_pool: frag API support for 32-bit arch with 64-bit DMA Yunsheng Lin
2023-06-13 13:30 ` Alexander Lobakin
2023-06-14 3:36 ` Yunsheng Lin
2023-06-14 3:55 ` Jakub Kicinski
2023-06-14 10:52 ` Alexander Lobakin
2023-06-14 12:15 ` Yunsheng Lin
2023-06-14 12:42 ` Alexander Lobakin
2023-06-14 4:09 ` Jakub Kicinski
2023-06-14 11:42 ` Yunsheng Lin
2023-06-14 17:07 ` Jakub Kicinski
2023-06-15 7:29 ` Yunsheng Lin
2023-06-12 13:02 ` [PATCH net-next v4 2/5] page_pool: unify frag_count handling in page_pool_is_last_frag() Yunsheng Lin
2023-06-14 4:33 ` Jakub Kicinski
2023-06-14 11:55 ` Yunsheng Lin
2023-06-12 13:02 ` [PATCH net-next v4 3/5] page_pool: introduce page_pool_alloc() API Yunsheng Lin
2023-06-13 13:08 ` Alexander Lobakin
2023-06-13 13:11 ` Alexander Lobakin
2023-06-14 3:17 ` Yunsheng Lin
2023-06-12 13:02 ` [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE_FRAG flag Yunsheng Lin
2023-06-14 17:19 ` Jakub Kicinski
2023-06-15 7:17 ` Yunsheng Lin
2023-06-15 16:51 ` Jakub Kicinski
2023-06-15 18:26 ` Alexander Duyck
2023-06-16 12:20 ` Yunsheng Lin
2023-06-16 15:01 ` Alexander Duyck
2023-06-16 18:59 ` Jesper Dangaard Brouer
2023-06-16 19:21 ` Jakub Kicinski
2023-06-16 20:42 ` Memory providers multiplexing (Was: [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE_FRAG flag) Jesper Dangaard Brouer
2023-06-19 18:07 ` Jakub Kicinski
2023-06-20 15:12 ` Jesper Dangaard Brouer
2023-06-20 15:39 ` Jakub Kicinski
2023-06-30 2:27 ` Mina Almasry
2023-07-03 4:20 ` David Ahern
2023-07-03 6:22 ` Mina Almasry
2023-07-03 14:45 ` David Ahern
2023-07-03 17:13 ` Eric Dumazet
2023-07-03 17:23 ` David Ahern
2023-07-06 1:19 ` Mina Almasry
2023-07-03 17:15 ` Eric Dumazet
2023-07-03 17:25 ` David Ahern
2023-07-03 21:43 ` Jason Gunthorpe
2023-07-06 1:17 ` Mina Almasry
2023-07-10 17:44 ` Jason Gunthorpe
2023-07-10 23:02 ` Mina Almasry
2023-07-10 23:49 ` Jason Gunthorpe
2023-07-11 0:45 ` Mina Almasry
2023-07-11 13:11 ` Jason Gunthorpe
2023-07-11 17:24 ` Mina Almasry
2023-07-11 4:27 ` Christoph Hellwig
2023-07-11 4:59 ` Jakub Kicinski
2023-07-11 5:04 ` Christoph Hellwig
2023-07-11 12:05 ` Jason Gunthorpe
2023-07-11 16:00 ` Jakub Kicinski
2023-07-11 16:20 ` David Ahern
2023-07-11 16:32 ` Jakub Kicinski
2023-07-11 17:06 ` Mina Almasry
2023-07-11 20:39 ` Jakub Kicinski
2023-07-11 21:39 ` David Ahern
2023-07-12 3:42 ` Mina Almasry
2023-07-12 7:55 ` Christian König
2023-07-12 13:03 ` Jason Gunthorpe
2023-07-12 13:35 ` Christian König
2023-07-12 22:41 ` Mina Almasry
2023-07-12 13:01 ` Jason Gunthorpe
2023-07-12 20:16 ` Mina Almasry
2023-07-12 23:57 ` Jason Gunthorpe
2023-07-13 7:56 ` Christian König
2023-07-14 14:55 ` Mina Almasry
2023-07-14 15:18 ` David Ahern
2023-07-17 2:05 ` Mina Almasry
2023-07-17 3:08 ` David Ahern
2023-07-14 15:55 ` Jason Gunthorpe
2023-07-17 1:53 ` Mina Almasry
2023-07-24 14:56 ` Jesper Dangaard Brouer [this message]
2023-07-24 16:28 ` Jason Gunthorpe
2023-07-25 4:04 ` Mina Almasry
2023-07-26 17:36 ` Jesper Dangaard Brouer
2023-07-11 16:42 ` Jason Gunthorpe
2023-07-11 17:06 ` Jakub Kicinski
2023-07-11 18:52 ` Jason Gunthorpe
2023-07-11 20:34 ` Jakub Kicinski
2023-07-11 23:56 ` Jason Gunthorpe
2023-07-11 6:52 ` Dan Williams
2023-07-06 16:50 ` Jakub Kicinski
2023-06-17 12:19 ` [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE_FRAG flag Yunsheng Lin
2023-06-15 13:59 ` Alexander Lobakin
2023-06-12 13:02 ` [PATCH net-next v4 5/5] page_pool: update document about frag API Yunsheng Lin
2023-06-14 4:40 ` Jakub Kicinski
2023-06-14 12:04 ` Yunsheng Lin
2023-06-14 16:56 ` Jakub Kicinski
2023-06-15 6:49 ` Yunsheng Lin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a2569132-393e-0149-f76c-f6de282e1c96@redhat.com \
--to=jbrouer@redhat.com \
--cc=alexander.duyck@gmail.com \
--cc=almasrymina@google.com \
--cc=angelogioacchino.delregno@collabora.com \
--cc=bhelgaas@google.com \
--cc=brouer@redhat.com \
--cc=christian.koenig@amd.com \
--cc=dan.j.williams@intel.com \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=edumazet@google.com \
--cc=gakula@marvell.com \
--cc=hawk@kernel.org \
--cc=hch@lst.de \
--cc=hkelam@marvell.com \
--cc=ilias.apalodimas@linaro.org \
--cc=jgg@ziepe.ca \
--cc=jhubbard@nvidia.com \
--cc=jonathan.lemon@gmail.com \
--cc=kuba@kernel.org \
--cc=kvalo@kernel.org \
--cc=leon@kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mediatek@lists.infradead.org \
--cc=linux-rdma@vger.kernel.org \
--cc=linux-wireless@vger.kernel.org \
--cc=linyunsheng@huawei.com \
--cc=logang@deltatee.com \
--cc=lorenzo@kernel.org \
--cc=matthias.bgg@gmail.com \
--cc=nbd@nbd.name \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=rharix@google.com \
--cc=ryder.lee@mediatek.com \
--cc=saeedm@nvidia.com \
--cc=salil.mehta@huawei.com \
--cc=sbhatta@marvell.com \
--cc=sean.wang@mediatek.com \
--cc=sgoutham@marvell.com \
--cc=shayne.chen@mediatek.com \
--cc=skhawaja@google.com \
--cc=willemb@google.com \
--cc=yisen.zhuang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).