From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 23B5AC46467 for ; Mon, 26 Dec 2022 10:46:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231866AbiLZKqT (ORCPT ); Mon, 26 Dec 2022 05:46:19 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49276 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231789AbiLZKqP (ORCPT ); Mon, 26 Dec 2022 05:46:15 -0500 Received: from out30-7.freemail.mail.aliyun.com (out30-7.freemail.mail.aliyun.com [115.124.30.7]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E3CBCD53; Mon, 26 Dec 2022 02:46:12 -0800 (PST) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R151e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046056;MF=dust.li@linux.alibaba.com;NM=1;PH=DS;RN=12;SR=0;TI=SMTPD_---0VY8Jq1L_1672051568; Received: from localhost(mailfrom:dust.li@linux.alibaba.com fp:SMTPD_---0VY8Jq1L_1672051568) by smtp.aliyun-inc.com; Mon, 26 Dec 2022 18:46:09 +0800 Date: Mon, 26 Dec 2022 18:46:08 +0800 From: Dust Li To: Niklas Schnelle , Wen Gu , kgraul@linux.ibm.com, wenjia@linux.ibm.com, jaka@linux.ibm.com, davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com Cc: linux-s390@vger.kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH net-next v2 0/5] net/smc:Introduce SMC-D based loopback acceleration Message-ID: <20221226104608.GD40720@linux.alibaba.com> Reply-To: dust.li@linux.alibaba.com References: <1671506505-104676-1-git-send-email-guwen@linux.alibaba.com> <42f2972f1dfe45a2741482f36fbbda5b5a56d8f1.camel@linux.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <42f2972f1dfe45a2741482f36fbbda5b5a56d8f1.camel@linux.ibm.com> Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Tue, Dec 20, 2022 at 03:02:45PM +0100, Niklas Schnelle wrote: >On Tue, 2022-12-20 at 11:21 +0800, Wen Gu wrote: >> Hi, all >> >> # Background >> >> As previously mentioned in [1], we (Alibaba Cloud) are trying to use SMC >> to accelerate TCP applications in cloud environment, improving inter-host >> or inter-VM communication. >> >> In addition of these, we also found the value of SMC-D in scenario of local >> inter-process communication, such as accelerate communication between containers >> within the same host. So this RFC tries to provide a SMC-D loopback solution >> in such scenario, to bring a significant improvement in latency and throughput >> compared to TCP loopback. >> >> # Design >> >> This patch set provides a kind of SMC-D loopback solution. >> >> Patch #1/5 and #2/5 provide an SMC-D based dummy device, preparing for the >> inter-process communication acceleration. Except for loopback acceleration, >> the dummy device can also meet the requirements mentioned in [2], which is >> providing a way to test SMC-D logic for broad community without ISM device. >> >> +------------------------------------------+ >> | +-----------+ +-----------+ | >> | | process A | | process B | | >> | +-----------+ +-----------+ | >> | ^ ^ | >> | | +---------------+ | | >> | | | SMC stack | | | >> | +--->| +-----------+ |<--| | >> | | | dummy | | | >> | | | device | | | >> | +-+-----------+-+ | >> | VM | >> +------------------------------------------+ >> >> Patch #3/5, #4/5, #5/5 provides a way to avoid data copy from sndbuf to RMB >> and improve SMC-D loopback performance. Through extending smcd_ops with two >> new semantic: attach_dmb and detach_dmb, sender's sndbuf shares the same >> physical memory region with receiver's RMB. The data copied from userspace >> to sender's sndbuf directly reaches the receiver's RMB without unnecessary >> memory copy in the same kernel. >> >> +----------+ +----------+ >> | socket A | | socket B | >> +----------+ +----------+ >> | ^ >> | +---------+ | >> regard as | | ----------| >> local sndbuf | B's | regard as >> | | RMB | local RMB >> |-------> | | >> +---------+ > >Hi Wen Gu, > >I maintain the s390 specific PCI support in Linux and would like to >provide a bit of background on this. You're surely wondering why we >even have a copy in there for our ISM virtual PCI device. To understand >why this copy operation exists and why we need to keep it working, one >needs a bit of s390 aka mainframe background. > >On s390 all (currently supported) native machines have a mandatory >machine level hypervisor. All OSs whether z/OS or Linux run either on >this machine level hypervisor as so called Logical Partitions (LPARs) >or as second/third/… level guests on e.g. a KVM or z/VM hypervisor that >in turn runs in an LPAR. Now, in terms of memory this machine level >hypervisor sometimes called PR/SM unlike KVM, z/VM, or VMWare is a >partitioning hypervisor without paging. This is one of the main reasons >for the very-near-native performance of the machine hypervisor as the >memory of its guests acts just like native RAM on other systems. It is >never paged out and always accessible to IOMMU translated DMA from >devices without the need for pinning pages and besides a trivial >offset/limit adjustment an LPAR's MMU does the same amount of work as >an MMU on a bare metal x86_64/ARM64 box. > >It also means however that when SMC-D is used to communicate between >LPARs via an ISM device there is no way of mapping the DMBs to the >same physical memory as there exists no MMU-like layer spanning >partitions that could do such a mapping. Meanwhile for machine level >firmware including the ISM virtual PCI device it is still possible to >_copy_ memory between different memory partitions. So yeah while I do >see the appeal of skipping the memcpy() for loopback or even between >guests of a paging hypervisor such as KVM, which can map the DMBs on >the same physical memory, we must keep in mind this original use case >requiring a copy operation. > >Thanks, >Niklas > >> >> # Benchmark Test >> >> * Test environments: >> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem. >> - SMC sndbuf/RMB size 1MB. >> >> * Test object: >> - TCP: run on TCP loopback. >> - domain: run on UNIX domain. >> - SMC lo: run on SMC loopback device with patch #1/5 ~ #2/5. >> - SMC lo-nocpy: run on SMC loopback device with patch #1/5 ~ #5/5. >> >> 1. ipc-benchmark (see [3]) >> >> - ./ -c 1000000 -s 100 >> >> TCP domain SMC-lo SMC-lo-nocpy >> Message >> rate (msg/s) 75140 129548(+72.41) 152266(+102.64%) 151914(+102.17%) > >Interesting that it does beat UNIX domain sockets. Also, see my below >comment for nginx/wrk as this seems very similar. > >> >> 2. sockperf >> >> - serv: taskset -c sockperf sr --tcp >> - clnt: taskset -c sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30 >> >> TCP SMC-lo SMC-lo-nocpy >> Bandwidth(MBps) 4943.359 4936.096(-0.15%) 8239.624(+66.68%) >> Latency(us) 6.372 3.359(-47.28%) 3.25(-49.00%) >> >> 3. iperf3 >> >> - serv: taskset -c iperf3 -s >> - clnt: taskset -c iperf3 -c 127.0.0.1 -t 15 >> >> TCP SMC-lo SMC-lo-nocpy >> Bitrate(Gb/s) 40.5 41.4(+2.22%) 76.4(+88.64%) >> >> 4. nginx/wrk >> >> - serv: nginx >> - clnt: wrk -t 8 -c 500 -d 30 http://127.0.0.1:80 >> >> TCP SMC-lo SMC-lo-nocpy >> Requests/s 154643.22 220894.03(+42.84%) 226754.3(+46.63%) > > >This result is very interesting indeed. So with the much more realistic >nginx/wrk workload it seems to copy hurts much less than the >iperf3/sockperf would suggest while SMC-D itself seems to help more. >I'd hope that this translates to actual applications as well. Maybe >this makes SMC-D based loopback interesting even while keeping the >copy, at least until we can come up with a sane way to work a no-copy >variant into SMC-D? Yes, SMC-D based loopback shows great advantages over TCP loopback, with or without copy. The advantage of zero-copy should be observed when we need to transfer a large mount of data. But here in this wrk/nginx case, the test file transferred from server to client is a small file. So we didn't see much gain. If we use a large file(e.g >=1MB file), I think we should observe a much different result. Thinks!