netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dust Li <dust.li@linux.alibaba.com>
To: Niklas Schnelle <schnelle@linux.ibm.com>,
	Wen Gu <guwen@linux.alibaba.com>,
	kgraul@linux.ibm.com, wenjia@linux.ibm.com, jaka@linux.ibm.com,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com
Cc: linux-s390@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH net-next v2 0/5] net/smc:Introduce SMC-D based loopback acceleration
Date: Mon, 26 Dec 2022 18:46:08 +0800	[thread overview]
Message-ID: <20221226104608.GD40720@linux.alibaba.com> (raw)
In-Reply-To: <42f2972f1dfe45a2741482f36fbbda5b5a56d8f1.camel@linux.ibm.com>

On Tue, Dec 20, 2022 at 03:02:45PM +0100, Niklas Schnelle wrote:
>On Tue, 2022-12-20 at 11:21 +0800, Wen Gu wrote:
>> Hi, all
>> 
>> # Background
>> 
>> As previously mentioned in [1], we (Alibaba Cloud) are trying to use SMC
>> to accelerate TCP applications in cloud environment, improving inter-host
>> or inter-VM communication.
>> 
>> In addition of these, we also found the value of SMC-D in scenario of local
>> inter-process communication, such as accelerate communication between containers
>> within the same host. So this RFC tries to provide a SMC-D loopback solution
>> in such scenario, to bring a significant improvement in latency and throughput
>> compared to TCP loopback.
>> 
>> # Design
>> 
>> This patch set provides a kind of SMC-D loopback solution.
>> 
>> Patch #1/5 and #2/5 provide an SMC-D based dummy device, preparing for the
>> inter-process communication acceleration. Except for loopback acceleration,
>> the dummy device can also meet the requirements mentioned in [2], which is
>> providing a way to test SMC-D logic for broad community without ISM device.
>> 
>>  +------------------------------------------+
>>  |  +-----------+           +-----------+   |
>>  |  | process A |           | process B |   |
>>  |  +-----------+           +-----------+   |
>>  |       ^                        ^         |
>>  |       |    +---------------+   |         |
>>  |       |    |   SMC stack   |   |         |
>>  |       +--->| +-----------+ |<--|         |
>>  |            | |   dummy   | |             |
>>  |            | |   device  | |             |
>>  |            +-+-----------+-+             |
>>  |                   VM                     |
>>  +------------------------------------------+
>> 
>> Patch #3/5, #4/5, #5/5 provides a way to avoid data copy from sndbuf to RMB
>> and improve SMC-D loopback performance. Through extending smcd_ops with two
>> new semantic: attach_dmb and detach_dmb, sender's sndbuf shares the same
>> physical memory region with receiver's RMB. The data copied from userspace
>> to sender's sndbuf directly reaches the receiver's RMB without unnecessary
>> memory copy in the same kernel.
>> 
>>  +----------+                     +----------+
>>  | socket A |                     | socket B |
>>  +----------+                     +----------+
>>        |                               ^
>>        |         +---------+           |
>>   regard as      |         | ----------|
>>   local sndbuf   |  B's    |     regard as
>>        |         |  RMB    |     local RMB
>>        |-------> |         |
>>                  +---------+
>
>Hi Wen Gu,
>
>I maintain the s390 specific PCI support in Linux and would like to
>provide a bit of background on this. You're surely wondering why we
>even have a copy in there for our ISM virtual PCI device. To understand
>why this copy operation exists and why we need to keep it working, one
>needs a bit of s390 aka mainframe background.
>
>On s390 all (currently supported) native machines have a mandatory
>machine level hypervisor. All OSs whether z/OS or Linux run either on
>this machine level hypervisor as so called Logical Partitions (LPARs)
>or as second/third/… level guests on e.g. a KVM or z/VM hypervisor that
>in turn runs in an LPAR. Now, in terms of memory this machine level
>hypervisor sometimes called PR/SM unlike KVM, z/VM, or VMWare is a
>partitioning hypervisor without paging. This is one of the main reasons
>for the very-near-native performance of the machine hypervisor as the
>memory of its guests acts just like native RAM on other systems. It is
>never paged out and always accessible to IOMMU translated DMA from
>devices without the need for pinning pages and besides a trivial
>offset/limit adjustment an LPAR's MMU does the same amount of work as
>an MMU on a bare metal x86_64/ARM64 box.
>
>It also means however that when SMC-D is used to communicate between
>LPARs via an ISM device there is  no way of mapping the DMBs to the
>same physical memory as there exists no MMU-like layer spanning
>partitions that could do such a mapping. Meanwhile for machine level
>firmware including the ISM virtual PCI device it is still possible to
>_copy_ memory between different memory partitions. So yeah while I do
>see the appeal of skipping the memcpy() for loopback or even between
>guests of a paging hypervisor such as KVM, which can map the DMBs on
>the same physical memory, we must keep in mind this original use case
>requiring a copy operation.
>
>Thanks,
>Niklas
>
>> 
>> # Benchmark Test
>> 
>>  * Test environments:
>>       - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>>       - SMC sndbuf/RMB size 1MB.
>> 
>>  * Test object:
>>       - TCP: run on TCP loopback.
>>       - domain: run on UNIX domain.
>>       - SMC lo: run on SMC loopback device with patch #1/5 ~ #2/5.
>>       - SMC lo-nocpy: run on SMC loopback device with patch #1/5 ~ #5/5.
>> 
>> 1. ipc-benchmark (see [3])
>> 
>>  - ./<foo> -c 1000000 -s 100
>> 
>>                        TCP              domain              SMC-lo             SMC-lo-nocpy
>> Message
>> rate (msg/s)         75140      129548(+72.41)    152266(+102.64%)         151914(+102.17%)
>
>Interesting that it does beat UNIX domain sockets. Also, see my below
>comment for nginx/wrk as this seems very similar.
>
>> 
>> 2. sockperf
>> 
>>  - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>>  - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>> 
>>                        TCP                  SMC-lo             SMC-lo-nocpy
>> Bandwidth(MBps)   4943.359        4936.096(-0.15%)        8239.624(+66.68%)
>> Latency(us)          6.372          3.359(-47.28%)            3.25(-49.00%)
>> 
>> 3. iperf3
>> 
>>  - serv: <smc_run> taskset -c <cpu> iperf3 -s
>>  - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15
>> 
>>                        TCP                  SMC-lo             SMC-lo-nocpy
>> Bitrate(Gb/s)         40.5            41.4(+2.22%)            76.4(+88.64%)
>> 
>> 4. nginx/wrk
>> 
>>  - serv: <smc_run> nginx
>>  - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80
>> 
>>                        TCP                  SMC-lo             SMC-lo-nocpy
>> Requests/s       154643.22      220894.03(+42.84%)        226754.3(+46.63%)
>
>
>This result is very interesting indeed. So with the much more realistic
>nginx/wrk workload it seems to copy hurts much less than the
>iperf3/sockperf would suggest while SMC-D itself seems to help more.
>I'd hope that this translates to actual applications as well. Maybe
>this makes SMC-D based loopback interesting even while keeping the
>copy, at least until we can come up with a sane way to work a no-copy
>variant into SMC-D?

Yes, SMC-D based loopback shows great advantages over TCP loopback, with
or without copy.

The advantage of zero-copy should be observed when we need to transfer
a large mount of data. But here in this wrk/nginx case, the test file
transferred from server to client is a small file. So we didn't see much gain.
If we use a large file(e.g >=1MB file), I think we should observe a much
different result.

Thinks!



  parent reply	other threads:[~2022-12-26 10:46 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-20  3:21 [RFC PATCH net-next v2 0/5] net/smc:Introduce SMC-D based loopback acceleration Wen Gu
2022-12-20  3:21 ` [RFC PATCH net-next v2 1/5] net/smc: introduce SMC-D loopback device Wen Gu
2023-01-19 16:25   ` Alexandra Winter
2023-01-30 16:30     ` Wen Gu
2022-12-20  3:21 ` [RFC PATCH net-next v2 2/5] net/smc: choose loopback device in SMC-D communication Wen Gu
2022-12-20  3:21 ` [RFC PATCH net-next v2 3/5] net/smc: add dmb attach and detach interface Wen Gu
2022-12-20  3:21 ` [RFC PATCH net-next v2 4/5] net/smc: avoid data copy from sndbuf to peer RMB in SMC-D loopback Wen Gu
2022-12-20  3:21 ` [RFC PATCH net-next v2 5/5] net/smc: logic of cursors update in SMC-D loopback connections Wen Gu
2022-12-20 14:02 ` [RFC PATCH net-next v2 0/5] net/smc:Introduce SMC-D based loopback acceleration Niklas Schnelle
2022-12-21 13:14   ` Wen Gu
2023-01-04 16:09     ` Alexandra Winter
2023-01-12 12:12       ` Wen Gu
2023-01-16 11:01         ` Wenjia Zhang
2023-01-18 12:15           ` Wen Gu
2023-01-19 12:30             ` Alexandra Winter
2023-01-30 16:27               ` Wen Gu
2022-12-26 10:46   ` Dust Li [this message]
2022-12-28 10:26 ` Wen Gu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221226104608.GD40720@linux.alibaba.com \
    --to=dust.li@linux.alibaba.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=guwen@linux.alibaba.com \
    --cc=jaka@linux.ibm.com \
    --cc=kgraul@linux.ibm.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=schnelle@linux.ibm.com \
    --cc=wenjia@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).