All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dust Li <dust.li@linux.alibaba.com>
To: Niklas Schnelle <schnelle@linux.ibm.com>,
	Wen Gu <guwen@linux.alibaba.com>,
	kgraul@linux.ibm.com, wenjia@linux.ibm.com, jaka@linux.ibm.com,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com
Cc: linux-s390@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH net-next v2 0/5] net/smc:Introduce SMC-D based loopback acceleration
Date: Mon, 26 Dec 2022 18:46:08 +0800	[thread overview]
Message-ID: <20221226104608.GD40720@linux.alibaba.com> (raw)
In-Reply-To: <42f2972f1dfe45a2741482f36fbbda5b5a56d8f1.camel@linux.ibm.com>

On Tue, Dec 20, 2022 at 03:02:45PM +0100, Niklas Schnelle wrote:
>On Tue, 2022-12-20 at 11:21 +0800, Wen Gu wrote:
>> Hi, all
>> 
>> # Background
>> 
>> As previously mentioned in [1], we (Alibaba Cloud) are trying to use SMC
>> to accelerate TCP applications in cloud environment, improving inter-host
>> or inter-VM communication.
>> 
>> In addition of these, we also found the value of SMC-D in scenario of local
>> inter-process communication, such as accelerate communication between containers
>> within the same host. So this RFC tries to provide a SMC-D loopback solution
>> in such scenario, to bring a significant improvement in latency and throughput
>> compared to TCP loopback.
>> 
>> # Design
>> 
>> This patch set provides a kind of SMC-D loopback solution.
>> 
>> Patch #1/5 and #2/5 provide an SMC-D based dummy device, preparing for the
>> inter-process communication acceleration. Except for loopback acceleration,
>> the dummy device can also meet the requirements mentioned in [2], which is
>> providing a way to test SMC-D logic for broad community without ISM device.
>> 
>>  +------------------------------------------+
>>  |  +-----------+           +-----------+   |
>>  |  | process A |           | process B |   |
>>  |  +-----------+           +-----------+   |
>>  |       ^                        ^         |
>>  |       |    +---------------+   |         |
>>  |       |    |   SMC stack   |   |         |
>>  |       +--->| +-----------+ |<--|         |
>>  |            | |   dummy   | |             |
>>  |            | |   device  | |             |
>>  |            +-+-----------+-+             |
>>  |                   VM                     |
>>  +------------------------------------------+
>> 
>> Patch #3/5, #4/5, #5/5 provides a way to avoid data copy from sndbuf to RMB
>> and improve SMC-D loopback performance. Through extending smcd_ops with two
>> new semantic: attach_dmb and detach_dmb, sender's sndbuf shares the same
>> physical memory region with receiver's RMB. The data copied from userspace
>> to sender's sndbuf directly reaches the receiver's RMB without unnecessary
>> memory copy in the same kernel.
>> 
>>  +----------+                     +----------+
>>  | socket A |                     | socket B |
>>  +----------+                     +----------+
>>        |                               ^
>>        |         +---------+           |
>>   regard as      |         | ----------|
>>   local sndbuf   |  B's    |     regard as
>>        |         |  RMB    |     local RMB
>>        |-------> |         |
>>                  +---------+
>
>Hi Wen Gu,
>
>I maintain the s390 specific PCI support in Linux and would like to
>provide a bit of background on this. You're surely wondering why we
>even have a copy in there for our ISM virtual PCI device. To understand
>why this copy operation exists and why we need to keep it working, one
>needs a bit of s390 aka mainframe background.
>
>On s390 all (currently supported) native machines have a mandatory
>machine level hypervisor. All OSs whether z/OS or Linux run either on
>this machine level hypervisor as so called Logical Partitions (LPARs)
>or as second/third/… level guests on e.g. a KVM or z/VM hypervisor that
>in turn runs in an LPAR. Now, in terms of memory this machine level
>hypervisor sometimes called PR/SM unlike KVM, z/VM, or VMWare is a
>partitioning hypervisor without paging. This is one of the main reasons
>for the very-near-native performance of the machine hypervisor as the
>memory of its guests acts just like native RAM on other systems. It is
>never paged out and always accessible to IOMMU translated DMA from
>devices without the need for pinning pages and besides a trivial
>offset/limit adjustment an LPAR's MMU does the same amount of work as
>an MMU on a bare metal x86_64/ARM64 box.
>
>It also means however that when SMC-D is used to communicate between
>LPARs via an ISM device there is  no way of mapping the DMBs to the
>same physical memory as there exists no MMU-like layer spanning
>partitions that could do such a mapping. Meanwhile for machine level
>firmware including the ISM virtual PCI device it is still possible to
>_copy_ memory between different memory partitions. So yeah while I do
>see the appeal of skipping the memcpy() for loopback or even between
>guests of a paging hypervisor such as KVM, which can map the DMBs on
>the same physical memory, we must keep in mind this original use case
>requiring a copy operation.
>
>Thanks,
>Niklas
>
>> 
>> # Benchmark Test
>> 
>>  * Test environments:
>>       - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>>       - SMC sndbuf/RMB size 1MB.
>> 
>>  * Test object:
>>       - TCP: run on TCP loopback.
>>       - domain: run on UNIX domain.
>>       - SMC lo: run on SMC loopback device with patch #1/5 ~ #2/5.
>>       - SMC lo-nocpy: run on SMC loopback device with patch #1/5 ~ #5/5.
>> 
>> 1. ipc-benchmark (see [3])
>> 
>>  - ./<foo> -c 1000000 -s 100
>> 
>>                        TCP              domain              SMC-lo             SMC-lo-nocpy
>> Message
>> rate (msg/s)         75140      129548(+72.41)    152266(+102.64%)         151914(+102.17%)
>
>Interesting that it does beat UNIX domain sockets. Also, see my below
>comment for nginx/wrk as this seems very similar.
>
>> 
>> 2. sockperf
>> 
>>  - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>>  - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>> 
>>                        TCP                  SMC-lo             SMC-lo-nocpy
>> Bandwidth(MBps)   4943.359        4936.096(-0.15%)        8239.624(+66.68%)
>> Latency(us)          6.372          3.359(-47.28%)            3.25(-49.00%)
>> 
>> 3. iperf3
>> 
>>  - serv: <smc_run> taskset -c <cpu> iperf3 -s
>>  - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15
>> 
>>                        TCP                  SMC-lo             SMC-lo-nocpy
>> Bitrate(Gb/s)         40.5            41.4(+2.22%)            76.4(+88.64%)
>> 
>> 4. nginx/wrk
>> 
>>  - serv: <smc_run> nginx
>>  - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80
>> 
>>                        TCP                  SMC-lo             SMC-lo-nocpy
>> Requests/s       154643.22      220894.03(+42.84%)        226754.3(+46.63%)
>
>
>This result is very interesting indeed. So with the much more realistic
>nginx/wrk workload it seems to copy hurts much less than the
>iperf3/sockperf would suggest while SMC-D itself seems to help more.
>I'd hope that this translates to actual applications as well. Maybe
>this makes SMC-D based loopback interesting even while keeping the
>copy, at least until we can come up with a sane way to work a no-copy
>variant into SMC-D?

Yes, SMC-D based loopback shows great advantages over TCP loopback, with
or without copy.

The advantage of zero-copy should be observed when we need to transfer
a large mount of data. But here in this wrk/nginx case, the test file
transferred from server to client is a small file. So we didn't see much gain.
If we use a large file(e.g >=1MB file), I think we should observe a much
different result.

Thinks!



  parent reply	other threads:[~2022-12-26 10:46 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-20  3:21 [RFC PATCH net-next v2 0/5] net/smc:Introduce SMC-D based loopback acceleration Wen Gu
2022-12-20  3:21 ` [RFC PATCH net-next v2 1/5] net/smc: introduce SMC-D loopback device Wen Gu
2023-01-19 16:25   ` Alexandra Winter
2023-01-30 16:30     ` Wen Gu
2022-12-20  3:21 ` [RFC PATCH net-next v2 2/5] net/smc: choose loopback device in SMC-D communication Wen Gu
2022-12-20  3:21 ` [RFC PATCH net-next v2 3/5] net/smc: add dmb attach and detach interface Wen Gu
2022-12-20  3:21 ` [RFC PATCH net-next v2 4/5] net/smc: avoid data copy from sndbuf to peer RMB in SMC-D loopback Wen Gu
2022-12-20  3:21 ` [RFC PATCH net-next v2 5/5] net/smc: logic of cursors update in SMC-D loopback connections Wen Gu
2022-12-20 14:02 ` [RFC PATCH net-next v2 0/5] net/smc:Introduce SMC-D based loopback acceleration Niklas Schnelle
2022-12-21 13:14   ` Wen Gu
2023-01-04 16:09     ` Alexandra Winter
2023-01-12 12:12       ` Wen Gu
2023-01-16 11:01         ` Wenjia Zhang
2023-01-18 12:15           ` Wen Gu
2023-01-19 12:30             ` Alexandra Winter
2023-01-30 16:27               ` Wen Gu
2022-12-26 10:46   ` Dust Li [this message]
2022-12-28 10:26 ` Wen Gu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221226104608.GD40720@linux.alibaba.com \
    --to=dust.li@linux.alibaba.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=guwen@linux.alibaba.com \
    --cc=jaka@linux.ibm.com \
    --cc=kgraul@linux.ibm.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=schnelle@linux.ibm.com \
    --cc=wenjia@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.