From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 084A4BE68 for ; Wed, 3 May 2023 21:18:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1683148701; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=t/bNA0LUE9HHsRFFEALcAbIeRaPUWYSG9GaTR6+obtI=; b=BlafyetnQ1pbTW6h7aNNRDBYt2M2OluTZ5k1B1cpuqiDCokwsxBsT1EDahhPD3UMKT1B55 opgAxOKL0YEOcXanruykpaulGYedxX4cYbRNd58ZLxJ8JmU4LfWOwG2gLNFN70EyfpdP+3 4tZyXMSoXZSwE+1YsXPfkSR7fkjdXe0= Received: from mail-qt1-f199.google.com (mail-qt1-f199.google.com [209.85.160.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-224-8uqUF2ZrP9WNf4Za5Nh9aA-1; Wed, 03 May 2023 17:18:18 -0400 X-MC-Unique: 8uqUF2ZrP9WNf4Za5Nh9aA-1 Received: by mail-qt1-f199.google.com with SMTP id d75a77b69052e-3ef4f29c9d1so9613051cf.0 for ; Wed, 03 May 2023 14:18:18 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683148698; x=1685740698; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=2ZyvFarmbHofrNnRuQP6HdbNymlx/RJo0/9ImoEvaVk=; b=iQkSX7E0hSNxfzOwA+wKuQYTU6Q5Cg85XzsCki9EsxsHgVANkFHvyGMuj32+CA22op E+HgYznEcxBioJFFGUymf2C3trzagK3I5Qh/P5VwNhbHDDUf7nzhBS8s01u5aoSAVgZx Et1nBC5L/caRy/Wu+VuvrShiPAvsP81YT/IIF/dXif+MVIwx2wj+Dr7MqpP1ALRPG+Nz yanAhPhK6zp8UYaqeVCSVbNtWrVWpiLW8GM+/9hZg7fRHEWF4kguBNCvNBjiVJw7kMMn iy5p5Csx3OvNTgesu+BJGn1qVH07n7BJD31MSaTTcSpLp1ZgmgDwlDrkAD6un4M56u62 7Ptg== X-Gm-Message-State: AC+VfDyv/BqpydhxCAGxu1juLUvdaoyqtYnc9gRQZQU5czDGxkqoKTLi 4GRh8LeqFKDeyu8gLfaH5HMSLK9/PcubCnSElZFrVGLEwnkX6uOkzz1jvswHgOkLX4WjMi/RFeT cDU0xJFoL6qF+TD8D X-Received: by 2002:a05:622a:1824:b0:3eb:143a:746a with SMTP id t36-20020a05622a182400b003eb143a746amr12936247qtc.4.1683148698201; Wed, 03 May 2023 14:18:18 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5sPpvNmZ/espB1SNspvPfcTp0UcrnUNxDv2ANJOo51gLlboXnP1v3atONpN+J5AgQy2mAsCg== X-Received: by 2002:a05:622a:1824:b0:3eb:143a:746a with SMTP id t36-20020a05622a182400b003eb143a746amr12936196qtc.4.1683148697540; Wed, 03 May 2023 14:18:17 -0700 (PDT) Received: from x1n (bras-base-aurron9127w-grc-40-70-52-229-124.dsl.bell.ca. [70.52.229.124]) by smtp.gmail.com with ESMTPSA id z26-20020ac87cba000000b003eb136bec50sm11737859qtv.66.2023.05.03.14.18.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 03 May 2023 14:18:16 -0700 (PDT) Date: Wed, 3 May 2023 17:18:13 -0400 From: Peter Xu To: Anish Moorthy Cc: Nadav Amit , Axel Rasmussen , Paolo Bonzini , maz@kernel.org, oliver.upton@linux.dev, Sean Christopherson , James Houghton , bgardon@google.com, dmatlack@google.com, ricarkol@google.com, kvm , kvmarm@lists.linux.dev Subject: Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults. Message-ID: References: <46DD705B-3A3F-438E-A5B1-929C1E43D11F@gmail.com> <84DD9212-31FB-4AF6-80DD-9BA5AEA0EC1A@gmail.com> Precedence: bulk X-Mailing-List: kvmarm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: multipart/mixed; boundary="Mh1nfu5EYGrCe8FI" Content-Disposition: inline --Mh1nfu5EYGrCe8FI Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit On Wed, May 03, 2023 at 12:45:07PM -0700, Anish Moorthy wrote: > On Thu, Apr 27, 2023 at 1:26 PM Peter Xu wrote: > > > > Thanks (for doing this test, and also to Nadav for all his inputs), and > > sorry for a late response. > > No need to apologize: anyways, I've got you comfortably beat on being > late at this point :) > > > These numbers caught my eye, and I'm very curious why even 2 vcpus can > > scale that bad. > > > > I gave it a shot on a test machine and I got something slightly different: > > > > Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (20 cores, 40 threads) > > $ ./demand_paging_test -b 512M -u MINOR -s shmem -v N > > |-------+----------+--------| > > | n_thr | per-vcpu | total | > > |-------+----------+--------| > > | 1 | 39.5K | 39.5K | > > | 2 | 33.8K | 67.6K | > > | 4 | 31.8K | 127.2K | > > | 8 | 30.8K | 246.1K | > > | 16 | 21.9K | 351.0K | > > |-------+----------+--------| > > > > I used larger ram due to less cores. I didn't try 32+ vcpus to make sure I > > don't have two threads content on a core/thread already since I only got 40 > > hardware threads there, but still we can compare with your lower half. > > > > When I was testing I noticed bad numbers and another bug on not using > > NSEC_PER_SEC properly, so I did this before the test: > > > > https://lore.kernel.org/all/20230427201112.2164776-1-peterx@redhat.com/ > > > > I think it means it still doesn't scale that good, however not so bad > > either - no obvious 1/2 drop on using 2vcpus. There're still a bunch of > > paths triggered in the test so I also don't expect it to fully scale > > linearly. From my numbers I just didn't see as drastic as yours. I'm not > > sure whether it's simply broken test number, parameter differences > > (e.g. you used 64M only per-vcpu), or hardware differences. > > Hmm, I suspect we're dealing with hardware differences here. I > rebased my changes onto those two patches you sent up, taking care not > to clobber them, but even with the repro command you provided my > results look very different than yours (at least on 1-4 vcpus) on the > machine I've been testing on (4x AMD EPYC 7B13 64-Core, 2.2GHz). > > (n=20) > n_thr per_vcpu total > 1 154K 154K > 2 92k 184K > 4 71K 285K > 8 36K 291K > 16 19K 310K > > Out of interested I tested on another machine (Intel(R) Xeon(R) > Platinum 8273CL CPU @ 2.20GHz) as well, and results are a bit > different again > > (n=20) > n_thr per_vcpu total > 1 115K 115K > 2 103k 206K > 4 65K 262K > 8 39K 319K > 16 19K 398K Interesting. > > It is interesting how all three sets of numbers start off different > but seem to converge around 16 vCPUs. I did check to make sure the > memory fault exits sped things up in all cases, and that at least > stays true. > > By the way, I've got a little helper script that I've been using to > run/average the selftest results (which can vary quite a bit). I've > attached it below- hopefully it doesn't bounce from the mailing list. > Just for reference, the invocation to test the command you provided is > > > python dp_runner.py --num_runs 20 --max_cores 16 --percpu_mem 512M I found that indeed I shouldn't have stopped at 16 vcpus since that's exactly where it starts to bottleneck. :) So out of my curiosity I tried to profile 32 vcpus case on my system with this test case, meanwhile I tried it both with: - 1 uffd + 8 readers - 32 uffds (so 32 readers) I've got the flamegraphs attached for both. It seems that when using >1 uffds the bottleneck is not the spinlock anymore but something else. >From what I got there, vmx_vcpu_load() gets more highlights than the spinlocks. I think that's the tlb flush broadcast. While OTOH indeed when using 1 uffd we can see obviously the overhead of spinlock contention on either the fault() path or read()/poll() as you and James rightfully pointed out. I'm not sure whether my number is caused by special setup, though. After all I only had 40 threads and I started 32 vcpus + 8 readers and there'll be contention already between the workloads. IMHO this means that there's still chance to provide a more generic userfaultfd scaling solution as long as we can remove the single spinlock contention on the fault/fault_pending queues. I'll see whether I can still explore a bit on the possibility of this and keep you guys updated. The general idea here to me is still to make multi-queue out of 1 uffd. I _think_ this might also be a positive result to your work, because if the bottleneck is not userfaultfd (as we scale it with creating multiple; ignoring the split vma effect), then it cannot be resolved by scaling userfaultfd alone anyway, anymore. So a general solution, even if existed, may not work here for kvm, because we'll get stuck somewhere else already. -- Peter Xu --Mh1nfu5EYGrCe8FI Content-Type: image/svg+xml Content-Disposition: attachment; filename="uffd-1-reader-8.svg" Content-Transfer-Encoding: quoted-printable =0A=0A=0A=0A=0A=0A=09=0A=09=09=0A=09=09=0A=09=0A=0A= =0A