From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5627CC19F2D for ; Wed, 3 Aug 2022 23:41:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237146AbiHCXlu (ORCPT ); Wed, 3 Aug 2022 19:41:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38152 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238544AbiHCXlr (ORCPT ); Wed, 3 Aug 2022 19:41:47 -0400 Received: from mail-pl1-x62e.google.com (mail-pl1-x62e.google.com [IPv6:2607:f8b0:4864:20::62e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 74EDA52452 for ; Wed, 3 Aug 2022 16:41:22 -0700 (PDT) Received: by mail-pl1-x62e.google.com with SMTP id p18so501306plr.8 for ; Wed, 03 Aug 2022 16:41:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=networkplumber-org.20210112.gappssmtp.com; s=20210112; h=date:from:to:cc:subject:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=1sPbeiZpRruAa6REs//xsGmRNVOyOPNtxbd2Sfog704=; b=1US45KLl5AbHH3duMeY7x64PyCB5fJKkwYEm656wVG0uqOOcr6O36D0Gk3uvV8nVWo EtfcVWEN4FRRGsKq6s8rragY645M6ubika72Z3SC7WfEmoMARcPZyrWD40ed435YqfMh rbGwfkgouP0sYM+q0pguzTvcwyr9h8w4UWzjnPnvGDbayxeHte7djgxWVxEC3N9tS3bQ u5RMpNL/v7BOrubRlxhat/W1mkp/O7h0mRWt3EXepjIvKYHzG6N07PfnmcpIFxozM5Gw eLVykql9/o+qVwxyC/HNKSLMYCuRZ/UFQQ6amCTW2gxHOauCTxcSD/XRZ/RiRJ2gne5Q jYvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=1sPbeiZpRruAa6REs//xsGmRNVOyOPNtxbd2Sfog704=; b=qTj2cmC2bUlHN72M2t0QEWHzBpjv3h9wvKHS3j+AG6Pv72j5OyFgOK+S1b4SApF8N3 R8gfHUWW9wtnOs3DANoxFcTSQtrxUAFAtfNAn54gXceFYkC8YahxZR1k1dJoo2D/UTgA zH3MPhk0mC7zKuY5vFWo+Y5s8UwouVSyUo4dvn28mbplq1sHBafOtWoGgBRTeCAXt0tw 8w8iAxC3Bk4oFGZTX5HUT0wUOtFffkPApZKcNlYpcyuTwvUfAq4ZnEa5BuFOGpjxaaV2 REeL4mftfRff12tMUgFV28mkK4EoVLgrid0rsozMOwvwhamHU9cyuMgFOBJZlU0ibhd2 1sLg== X-Gm-Message-State: ACgBeo3UIQh0Ecc321DLw75oQlfno3niCEXfZ4J7JMU0FWfeEf29ER0R 5XUrz7301K7VpKxC15ORMaWkNA== X-Google-Smtp-Source: AA6agR5c4SHTahuBWgqei8kKggTl9cLa8c8jckoIYfd4i6X7Bb0kGgOdlVW6WpQxcoSbiuLenak0TQ== X-Received: by 2002:a17:90a:1a5c:b0:1f5:15a6:aae9 with SMTP id 28-20020a17090a1a5c00b001f515a6aae9mr7458087pjl.117.1659570081771; Wed, 03 Aug 2022 16:41:21 -0700 (PDT) Received: from hermes.local (204-195-120-218.wavecable.com. [204.195.120.218]) by smtp.gmail.com with ESMTPSA id z16-20020aa79590000000b0052db82ad8b2sm5961555pfj.123.2022.08.03.16.41.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 03 Aug 2022 16:41:21 -0700 (PDT) Date: Wed, 3 Aug 2022 16:41:19 -0700 From: Stephen Hemminger To: Matthew Rosato Cc: Tony Lu , kgraul@linux.ibm.com, wenjia@linux.ibm.com, davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, netdev@vger.kernel.org, linux-s390@vger.kernel.org, zmlcc@linux.alibaba.com, hans@linux.alibaba.com, zhiyuan2048@linux.alibaba.com, herongguang@linux.alibaba.com Subject: Re: [RFC net-next 1/1] net/smc: SMC for inter-VM communication Message-ID: <20220803164119.5955b442@hermes.local> In-Reply-To: <0ccf9cc6-4916-7815-9ce2-990dc7884849@linux.ibm.com> References: <20220720170048.20806-1-tonylu@linux.alibaba.com> <0ccf9cc6-4916-7815-9ce2-990dc7884849@linux.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-s390@vger.kernel.org On Wed, 3 Aug 2022 16:27:54 -0400 Matthew Rosato wrote: > On 7/20/22 1:00 PM, Tony Lu wrote: > > Hi all, > >=20 > > # Background > >=20 > > We (Alibaba Cloud) have already used SMC in cloud environment to > > transparently accelerate TCP applications with ERDMA [1]. Nowadays, > > there is a common scenario that deploy containers (which runtime is > > based on lightweight virtual machine) on ECS (Elastic Compute Service), > > and the containers may want to be scheduled on the same host in order to > > get higher performance of network, such as AI, big data or other > > scenarios that are sensitive with bandwidth and latency. Currently, the > > performance of inter-VM is poor and CPU resource is wasted (see > > #Benchmark virtio). This scenario has been discussed many times, but a > > solution for a common scenario for applications is missing [2] [3] [4]. > >=20 > > # Design > >=20 > > In inter-VM scenario, we use ivshmem (Inter-VM shared memory device) > > which is modeled by QEMU [5]. With it, multiple VMs can access one > > shared memory. This shared memory device is statically created by host > > and shared to desired guests. The device exposes as a PCI BAR, and can > > interrupt its peers (ivshmem-doorbell). > >=20 > > In order to use ivshmem in SMC, we write a draft device driver as a > > bridge between SMC and ivshmem PCI device. To make it easier, this > > driver acts like a SMC-D device in order to fit in SMC without modifying > > the code, which is named ivpci (see patch #1). > >=20 > > =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =90 > > =E2=94=82 =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=90 =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=90 =E2=94=82 > > =E2=94=82 =E2=94=82 VM1 =E2=94=82 =E2=94=82 VM2 = =E2=94=82 =E2=94=82 > > =E2=94=82 =E2=94=82=E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=90=E2=94=82 =E2=94=82=E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=90=E2=94=82 =E2=94=82 > > =E2=94=82 =E2=94=82=E2=94=82 Application =E2=94=82=E2=94=82 =E2=94= =82=E2=94=82 Application =E2=94=82=E2=94=82 =E2=94=82 > > =E2=94=82 =E2=94=82=E2=94=9C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=A4=E2=94=82 =E2=94=82=E2=94=9C=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=A4=E2=94=82 =E2=94=82 > > =E2=94=82 =E2=94=82=E2=94=82 SMC =E2=94=82=E2=94=82 =E2=94= =82=E2=94=82 SMC =E2=94=82=E2=94=82 =E2=94=82 > > =E2=94=82 =E2=94=82=E2=94=9C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=A4=E2=94=82 =E2=94=82=E2=94=9C=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=A4=E2=94=82 =E2=94=82 > > =E2=94=82 =E2=94=82=E2=94=82 ivpci =E2=94=82=E2=94=82 =E2=94= =82=E2=94=82 ivpci =E2=94=82=E2=94=82 =E2=94=82 > > =E2=94=82 =E2=94=94=E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=98=E2=94=98 =E2=94=94=E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=98=E2=94=98 =E2=94=82 > > =E2=94=82 x * x * =E2=94=82 > > =E2=94=82 x ****************x* * =E2=94=82 > > =E2=94=82 x xxxxxxxxxxxxxxxxx* * =E2=94=82 > > =E2=94=82 x x * * =E2=94=82 > > =E2=94=82 =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=90 =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=90 =E2=94=82 > > =E2=94=82 =E2=94=82shared memories=E2=94=82 =E2=94=82ivshmem-server= =E2=94=82 =E2=94=82 > > =E2=94=82 =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=98 =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=98 =E2=94=82 > > =E2=94=82 HOST A =E2=94=82 > > =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =98 > > *********** Control flow (interrupt) > > xxxxxxxxxxx Data flow (memory access) > >=20 > > Inside ivpci driver, it implements almost all the operations of SMC-D > > device. It can be divided into two parts: > >=20 > > - control flow, most of it is same with SMC-D, use ivshmem trigger > > interruptions in ivpci and process CDC flow. > >=20 > > - data flow, the shared memory of each connection is one large region > > and divided into two part for local and remote RMB. Every writer > > syscall copies data to sndbuf and calls ISM's move_data() to move da= ta > > to remote RMB in ivshmem and interrupt remote. And reader then > > receives interruption and check CDC message, consume data if cursor = is > > updated. > >=20 > > # Benchmark > >=20 > > Current POC of ivpci is unstable and only works for single SMC > > connection. Here is the brief data: > >=20 > > Items Latency (pingpong) Throughput (64KB) > > TCP (virtio) 19.3 us 3794.185 MBps > > TCP (SR-IOV) 13.2 us 3948.792 MBps > > SMC (ivshmem) 6.3 us 11900.269 MBps > >=20 > > Test environments: > >=20 > > - CPU Intel Xeon Platinum 8 core, mem 32 GiB > > - NIC Mellanox CX4 with 2 VFs in two different guests > > - using virsh to setup virtio-net + vhost > > - using sockperf and single connection > > - SMC + ivshmem throughput uses one-copy (userspace -> kernel copy) > > with intrusive modification of SMC (see patch #1), latency (pingpong) > > use two-copy (user -> kernel and move_data() copy, patch version). > >=20 > > With the comparison, SMC with ivshmem gets 3-4x bandwidth and a half > > latency. > >=20 > > TCP + virtio is the most usage solution for guest, it gains lower > > performance. Moreover, it consumes extra thread with full CPU core > > occupied in host to transfer data, wastes more CPU resource. If the host > > is very busy, the performance will be worse. > > =20 >=20 > Hi Tony, >=20 > Quite interesting! FWIW for s390x we are also looking at passthrough of= =20 > host ISM devices to enable SMC-D in QEMU guests: > https://lore.kernel.org/kvm/20220606203325.110625-1-mjrosato@linux.ibm.co= m/ > https://lore.kernel.org/kvm/20220606203614.110928-1-mjrosato@linux.ibm.co= m/ >=20 > But seems to me an 'emulated ISM' of sorts could still be interesting=20 > even on s390x e.g. for scenarios where host device passthrough is not=20 > possible/desired. >=20 > Out of curiosity I tried this ivpci module on s390x but the device won't= =20 > probe -- This is possibly an issue with the s390x PCI emulation layer in= =20 > QEMU, I'll have to look into that. >=20 > > # Discussion > >=20 > > This RFC and solution is still in early stage, so we want to come it up > > as soon as possible and fully discuss with IBM and community. We have > > some topics putting on the table: > >=20 > > 1. SMC officially supports this scenario. > >=20 > > SMC + ivshmem shows huge improvement when communicating inter VMs. SMC-D > > and mocking ISM device might not be the official solution, maybe another > > extension for SMC besides SMC-R and SMC-D. So we are wondering if SMC > > would accept this idea to fix this scenario? Are there any other > > possibilities? =20 >=20 > I am curious about ivshmem and its current state though -- e.g. looking=20 > around I see mention of v2 which you also referenced but don't see any=20 > activity on it for a few years? And as far as v1 ivshmem -- server "not= =20 > for production use", etc. >=20 > Thanks, > Matt >=20 > >=20 > > 2. Implementation of SMC for inter-VM. > >=20 > > SMC is used in container and cloud environment, maybe we can propose a > > new device and new protocol if possible in these new scenarios to solve > > this problem. > >=20 > > 3. Standardize this new protocol and device. > >=20 > > SMC-R has an open RFC 7609, so can this new device or protocol like > > SMC-D can be standardized. There is a possible option that proposing a > > new device model in QEMU + virtio ecosystem and SMC supports this > > standard virtio device, like [6]. > >=20 > > If there are any problems, please point them out. > >=20 > > Hope to hear from you, thank you. > >=20 > > [1] https://lwn.net/Articles/879373/ > > [2] https://projectacrn.github.io/latest/tutorials/enable_ivshmem.html > > [3] https://dl.acm.org/doi/10.1145/2847562 > > [4] https://hal.archives-ouvertes.fr/hal-00368622/document > > [5] https://github.com/qemu/qemu/blob/master/docs/specs/ivshmem-spec.txt > > [6] https://github.com/siemens/jailhouse/blob/master/Documentation/ivsh= mem-v2-specification.md > >=20 > > Signed-off-by: Tony Lu =20 Also looks a lot like existing VSOCK which has transports for Virtio, Hyper= V and VMWare already.