From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id B7C4BEB64DC for ; Mon, 3 Jul 2023 09:05:01 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qGFTN-0007eQ-Ja; Mon, 03 Jul 2023 05:03:53 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qGFTL-0007eB-UT for qemu-devel@nongnu.org; Mon, 03 Jul 2023 05:03:52 -0400 Received: from mail-oo1-xc32.google.com ([2607:f8b0:4864:20::c32]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1qGFTK-0006Cw-19 for qemu-devel@nongnu.org; Mon, 03 Jul 2023 05:03:51 -0400 Received: by mail-oo1-xc32.google.com with SMTP id 006d021491bc7-565db4666d7so1859905eaf.0 for ; Mon, 03 Jul 2023 02:03:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1688375027; x=1690967027; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=yJKu3TI7NEuDtjDEdXHnRJNzjZ8vVeFGshp9lQeJJH4=; b=BTOQ2wDReU49bQbiMfBrWLmYV8oBdzD8GDs90P7FoLh/AhrEXd7aIG7ykuwVBz4aPd I9PW9w7tHDPOOkbqsJc88iJ33stzFHpgQgIrJbmH+mSdokxLNIW/+DC+/oqs3OXqffZm W54DW4TrgTM0uwNvk5yj6xcSFJuDO9yBb3adMOMa24kimsPxDpXieYLbWwiKDlIFjN8M jPJLFCOuBVr3jp3ps3GyGun9uZvLVqZU7pHCwN/Fy1SFCEk1hP3Qv3HPbx1Sk0o9HZ+4 Jii6Yt4x+GysdQNsXZAotv8WgGIecn3PoqsVNDocOpqHXBJ9spr5YhOaiqdCFWZG1Nl+ BeyQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688375027; x=1690967027; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=yJKu3TI7NEuDtjDEdXHnRJNzjZ8vVeFGshp9lQeJJH4=; b=L7ikuAjbRdagBCdmkVlcx31dj7aZL5wQey8vSDQDXYG8Ob+Tm6cwtW3LP0UUQ9XLHC ueRK4FlwaAtb2TKJegDHyfTP1TpA8W0tPYwvdN/j+tTgdjTegdaQ/+f5o8HoM5Bj1LtD Vo/YJjRse3s5EtvusEIyfS89SxtoaGf7bByq1M07DZS1dypzk9cb4EXVLwwBh7ScUcNE X+8/rU0HGoZP7LG+NcX44Yks6R4b/XpzbqZXC22BgZbCKuOI95Ncx3LJcy23unGHT0pt 0xputFMdIORp6lTAZ5WZRzEyz8x/jaI0+3bH8raEPZ2yXVWAMpfi62f69HXbPrZBQP0b U4fg== X-Gm-Message-State: AC+VfDxfpi6h33jOn5zkpEp/rgyDergAPo8/OHLu/ZjwYukBB7N0HcAU XaF1m6wJKShvMFuSt5pI8R+W4LqOY2tYX+XgE/Q= X-Google-Smtp-Source: ACHHUZ7NK8gcd9X1nVp6Tqw1XCwuv2Ay/R/cWNOIWlZWOit+9aBH7/UXiCy6K1U86eav9NBv87t+J4oIsjFRrTJCMwQ= X-Received: by 2002:a4a:524f:0:b0:563:5666:5c03 with SMTP id d76-20020a4a524f000000b0056356665c03mr6373167oob.5.1688375027342; Mon, 03 Jul 2023 02:03:47 -0700 (PDT) MIME-Version: 1.0 References: <20230622215824.2173343-1-i.maximets@ovn.org> <93a056c4-b6d3-5491-0c1f-7f58e9f9d1ad@ovn.org> <26c03cd4-5582-489c-9f4c-aeaf8e157b42@ovn.org> In-Reply-To: From: Stefan Hajnoczi Date: Mon, 3 Jul 2023 11:03:34 +0200 Message-ID: Subject: Re: [PATCH] net: add initial support for AF_XDP network backend To: Jason Wang Cc: Ilya Maximets , Paolo Bonzini , Eric Blake , Stefan Hajnoczi , qemu-devel@nongnu.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=2607:f8b0:4864:20::c32; envelope-from=stefanha@gmail.com; helo=mail-oo1-xc32.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org On Fri, 30 Jun 2023 at 09:41, Jason Wang wrote: > > On Thu, Jun 29, 2023 at 8:36=E2=80=AFPM Stefan Hajnoczi wrote: > > > > On Thu, 29 Jun 2023 at 07:26, Jason Wang wrote: > > > > > > On Wed, Jun 28, 2023 at 4:25=E2=80=AFPM Stefan Hajnoczi wrote: > > > > > > > > On Wed, 28 Jun 2023 at 10:19, Jason Wang wrot= e: > > > > > > > > > > On Wed, Jun 28, 2023 at 4:15=E2=80=AFPM Stefan Hajnoczi wrote: > > > > > > > > > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang = wrote: > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 3:46=E2=80=AFPM Stefan Hajnoczi wrote: > > > > > > > > > > > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang wrote: > > > > > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 6:45=E2=80=AFAM Ilya Maximets wrote: > > > > > > > > > > > > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote: > > > > > > > > > > > On Mon, Jun 26, 2023 at 9:17=E2=80=AFPM Ilya Maximets= wrote: > > > > > > > > > > >> > > > > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote: > > > > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06=E2=80=AFPM Jason Wang = wrote: > > > > > > > > > > >>>> > > > > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58=E2=80=AFAM Ilya Maxim= ets wrote: > > > > > > > > > > >> It is noticeably more performant than a tap with vho= st=3Don in terms of PPS. > > > > > > > > > > >> So, that might be one case. Taking into account tha= t just rcu lock and > > > > > > > > > > >> unlock in virtio-net code takes more time than a pac= ket copy, some batching > > > > > > > > > > >> on QEMU side should improve performance significantl= y. And it shouldn't be > > > > > > > > > > >> too hard to implement. > > > > > > > > > > >> > > > > > > > > > > >> Performance over virtual interfaces may potentially = be improved by creating > > > > > > > > > > >> a kernel thread for async Tx. Similarly to what io_= uring allows. Currently > > > > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and t= hat doesn't allow to > > > > > > > > > > >> scale well. > > > > > > > > > > > > > > > > > > > > > > Interestingly, actually, there are a lot of "duplicat= ion" between > > > > > > > > > > > io_uring and AF_XDP: > > > > > > > > > > > > > > > > > > > > > > 1) both have similar memory model (user register) > > > > > > > > > > > 2) both use ring for communication > > > > > > > > > > > > > > > > > > > > > > I wonder if we can let io_uring talks directly to AF_= XDP. > > > > > > > > > > > > > > > > > > > > Well, if we submit poll() in QEMU main loop via io_urin= g, then we can > > > > > > > > > > avoid cost of the synchronous Tx for non-zero-copy mode= s, i.e. for > > > > > > > > > > virtual interfaces. io_uring thread in the kernel will= be able to > > > > > > > > > > perform transmission for us. > > > > > > > > > > > > > > > > > > It would be nice if we can use iothread/vhost other than = the main loop > > > > > > > > > even if io_uring can use kthreads. We can avoid the memor= y translation > > > > > > > > > cost. > > > > > > > > > > > > > > > > The QEMU event loop (AioContext) has io_uring code > > > > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I= 'm working > > > > > > > > on patches to re-enable it and will probably send them in J= uly. The > > > > > > > > patches also add an API to submit arbitrary io_uring operat= ions so > > > > > > > > that you can do stuff besides file descriptor monitoring. B= oth the > > > > > > > > main loop and IOThreads will be able to use io_uring on Lin= ux hosts. > > > > > > > > > > > > > > Just to make sure I understand. If we still need a copy from = guest to > > > > > > > io_uring buffer, we still need to go via memory API for GPA w= hich > > > > > > > seems expensive. > > > > > > > > > > > > > > Vhost seems to be a shortcut for this. > > > > > > > > > > > > I'm not sure how exactly you're thinking of using io_uring. > > > > > > > > > > > > Simply using io_uring for the event loop (file descriptor monit= oring) > > > > > > doesn't involve an extra buffer, but the packet payload still n= eeds to > > > > > > reside in AF_XDP umem, so there is a copy between guest memory = and > > > > > > umem. > > > > > > > > > > So there would be a translation from GPA to HVA (unless io_uring > > > > > support 2 stages) which needs to go via qemu memory core. And thi= s > > > > > part seems to be very expensive according to my test in the past. > > > > > > > > Yes, but in the current approach where AF_XDP is implemented as a Q= EMU > > > > netdev, there is already QEMU device emulation (e.g. virtio-net) > > > > happening. So the GPA to HVA translation will happen anyway in devi= ce > > > > emulation. > > > > > > Just to make sure we're on the same page. > > > > > > I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the > > > QEMU netdev, it would be very hard to achieve that if we stick to > > > using the Qemu memory core translations which need to take care about > > > too much extra stuff. That's why I suggest using vhost in io threads > > > which only cares about ram so the translation could be very fast. > > > > What does using "vhost in io threads" mean? > > It means a vhost userspace dataplane that is implemented via io threads. AFAIK this does not exist today. QEMU's built-in devices that use IOThreads don't use vhost code. QEMU vhost code is for vhost kernel, vhost-user, or vDPA but not built-in devices that use IOThreads. The built-in devices implement VirtioDeviceClass callbacks directly and use AioContext APIs to run in IOThreads. Do you have an idea for using vhost code for built-in devices? Maybe it's fastest if you explain your idea and its advantages instead of me guessing. > > > > Regarding pinning - I wonder if that's something that can be refine= d > > > > in the kernel by adding an AF_XDP flag that enables on-demand pinni= ng > > > > of umem. That way only rx and tx buffers that are currently in use > > > > will be pinned. The disadvantage is the runtime overhead to pin/unp= in > > > > pages. I'm not sure whether it's possible to implement this, I have= n't > > > > checked the kernel code. > > > > > > It requires the device to do page faults which is not commonly > > > supported nowadays. > > > > I don't understand this comment. AF_XDP processes each rx/tx > > descriptor. At that point it can getuserpages() or similar in order to > > pin the page. When the memory is no longer needed, it can put those > > pages. No fault mechanism is needed. What am I missing? > > Ok, I think I kind of get you, you mean doing pinning while processing > rx/tx buffers? It's not easy since GUP itself is not very fast, it may > hit PPS for sure. Yes. It's not as fast as permanently pinning rx/tx buffers, but it supports unpinned guest RAM. There are variations on this approach, like keeping a certain amount of pages pinned after they have been used so the cost of pinning/unpinning can be avoided when the same pages are reused in the future, but I don't know how effective that is in practice. Is there a more efficient approach without relying on hardware page fault support? My understanding is that hardware page fault support is not yet deployed. We'd be left with pinning guest RAM permanently or using a runtime pinning/unpinning approach like I've described. Stefan