From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C9F9EC07E99 for ; Fri, 9 Jul 2021 14:57:07 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 5860F613BE for ; Fri, 9 Jul 2021 14:57:07 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5860F613BE Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:59278 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1m1rwA-0006sb-IH for qemu-devel@archiver.kernel.org; Fri, 09 Jul 2021 10:57:06 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:50984) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1m1rnZ-0005TL-D3 for qemu-devel@nongnu.org; Fri, 09 Jul 2021 10:48:13 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:35397) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1m1rnV-0004bz-VL for qemu-devel@nongnu.org; Fri, 09 Jul 2021 10:48:12 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1625842084; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=eWQykk2LrJttz0fC/lj0mdyEofZ3cLngz92vMlETJlw=; b=f6HkvcrzPfuVPXkfcGQvedNi6T81nHx+vqQB/Bfs8GlqXlKzjH+MX2hK3OV1CaP1vLdXje +oy1WEcFITWtJBTon+iSg7dLFXNlkZ0iT4AgAB23DjQg0UXmHKG8ejuGCvbsrX15HM5jWA LK/tPA9N2xOmG1k3HZQepzCXRQYD6mk= Received: from mail-qt1-f199.google.com (mail-qt1-f199.google.com [209.85.160.199]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-324-zlLGpC5PP8uFX8fjAE9Fqw-1; Fri, 09 Jul 2021 10:48:03 -0400 X-MC-Unique: zlLGpC5PP8uFX8fjAE9Fqw-1 Received: by mail-qt1-f199.google.com with SMTP id c18-20020a05622a0592b0290251fb198592so5922379qtb.1 for ; Fri, 09 Jul 2021 07:48:03 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=eWQykk2LrJttz0fC/lj0mdyEofZ3cLngz92vMlETJlw=; b=uZf0eR3ugk6CPnLKzcUeifmu1PsSKHuNISh6jlloficq0bmZdt6A9wfwzBdPbGerke DDetRaZibMcRls3nTGia3tuDI1sMKTUWIIwi1huutm2sWLxyAw3r+IpZJ+tYF6Fruu1S e83QRUGlr5yqoJBNyg5yKdNDL1p6jqNIugNVMrYxJngSNPzUZZfuS9oKoo8a5sOYrFUQ 9u/nIHoGKqbxgfVkSg36oVTJesVfwKxwypKohr/YCMJa8uxykuiY4bPZ3UOAceHp8ylU CvXO1Okl0I2r5dCLKfMkzF7eBf2Utp6uU4Nl9KYVmeJ4ZMjXxGYeRTDN9HeslCeCjxbI OfeA== X-Gm-Message-State: AOAM531SpMZc1Z9Mg38QwBaQdLnFsiMoID8s9MACmkyT/QF9QvQHL2tW HuHIgL5HXierqCyBJF7Fyoi7bJ4zkUzQ7vCw/vQsCShvFfk9BvvMyO0b53WEU+Q16tUKhbJqi0v FU+1zhyvrRjY+n3Y= X-Received: by 2002:a05:620a:70e:: with SMTP id 14mr38669886qkc.85.1625842082714; Fri, 09 Jul 2021 07:48:02 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwxoH57POsUdMBfpCsUviIYjGAunrRsaW5KzKwoaBp/YqBeX0vFqAULtTH8Z63E9QDSNJqapA== X-Received: by 2002:a05:620a:70e:: with SMTP id 14mr38669864qkc.85.1625842082440; Fri, 09 Jul 2021 07:48:02 -0700 (PDT) Received: from t490s (bras-base-toroon474qw-grc-65-184-144-111-238.dsl.bell.ca. [184.144.111.238]) by smtp.gmail.com with ESMTPSA id g8sm1510675qtb.38.2021.07.09.07.48.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 09 Jul 2021 07:48:01 -0700 (PDT) Date: Fri, 9 Jul 2021 10:48:00 -0400 From: Peter Xu To: "Wang, Wei W" Subject: Re: [PATCH] migration: Move bitmap_mutex out of migration_bitmap_clear_dirty() Message-ID: References: <20210630200805.280905-1-peterx@redhat.com> <33f137dae5c346078a3a7a658bb5f1ab@intel.com> <27cb8a0141fa493a8d4bb6bb918e8a82@intel.com> <2eb1ccd7105d4823898ea68a614d16f9@intel.com> MIME-Version: 1.0 In-Reply-To: Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=peterx@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=170.10.133.124; envelope-from=peterx@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -31 X-Spam_score: -3.2 X-Spam_bar: --- X-Spam_report: (-3.2 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.45, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Hailiang Zhang , David Hildenbrand , Juan Quintela , "qemu-devel@nongnu.org" , "Dr . David Alan Gilbert" , Leonardo Bras Soares Passos Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" On Fri, Jul 09, 2021 at 08:58:08AM +0000, Wang, Wei W wrote: > On Friday, July 9, 2021 2:31 AM, Peter Xu wrote: > > > > Yes I think this is the place I didn't make myself clear. It's not > > > > about sleeping, it's about the cmpxchg being expensive already when the vm > > is huge. > > > > > > OK. > > > How did you root cause that it's caused by cmpxchg, instead of lock > > > contention (i.e. syscall and sleep) or some other code inside > > pthread_mutex_lock(). Do you have cycles about cmpxchg v.s. cycles of > > pthread_mutex_lock()? > > > > We've got "perf top -g" showing a huge amount of stacks lying in > > pthread_mutex_lock(). > > This only explains pthread_mutex_lock is the cause, not root caused to cmpxchg. I think that's enough already to prove we can move the lock elsewhere. It's not really a heavy race between threads; it's the pure overhead we called it too many times. So it's not really a problem yet about "what type of lock we should use" (mutex or spin lock) or "how this lock is implemented" (say, whether using cmpxchg only or optimize using test + test-and-set, as that sounds like a good optimization of pure test-and-set spinlocks) because the lock is not busy at all. If we have two busy threads contentioning on the lock for real, that'll be another problem, imho. It's not yet the case. > > > > > > > > > I check the implementation of pthread_mutex_lock(). The code path for lock > > acquire is long. QemuSpin looks more efficient. > > > (probably we also don’t want migration thread to sleep in any case) > > > > I didn't check it, but I really hoped it should be very close to a spinlock version > > when it's not contended. We should be careful on using spin locks in userspace, > > e.g., with that moving clear log into critical section will be too much and actuall > > close to "wrong", imho, because the kvm memslot lock inside is sleepable. > > OK, that might be a good reason to keep clear_map out of the lock. > We also don’t want the lock holder to have more chances to go to sleep though it is mutex. Yes that looks fine to me, but that's really another topic. It's not a requirement either before spinlock is justified to be better than mutex here. > > > > > I think it's very fine to have migration thread sleep. IMHO we need explicit > > justification for each mutex to be converted to a spinlock in userspace. So far I > > don't see it yet for this bitmap lock. > > What if the guest gets stopped and then the migration thread goes to sleep? Isn't the balloon code run in a standalone iothread? Why guest stopped can stop migration thread? > > > > > Frankly I also don't know how spinlock would work reliably without being able to > > disable scheduling, then could the thread got scheduled out duing taking the > > spinlock? Would another thread trying to take this lock spinning on this sleeping > > task? > > Yes, and it needs good justification: > If it's per-page spinlock, the granularity is very small, so it has very little chance that a lock holder gets scheduled out as time slice uses up. Even that happens once, it seems no big issues, just the waiter wastes some CPU cycles, this is better than having the migration thread scheduled out by mutex, I think. Yes I believe there's good reason when Emilio introduced the lock, I didn't measure it myself so I can't tell. It's just that we need more justification that this mutex is needed to be replaced with a spinlock. > > > > > > > > > I think it's also better to see the comparison of migration throughput data (i.e. > > pages per second) in the following cases, before we make a decision: > > > - per-page mutex > > > - per-page spinlock > > > - 50-ms mutex > > > > I don't have the machines, so I can't do this. I also see this as separate issues to > > solve to use spinlock, as I said before I would prefer some justification first to use > > it rather than blindly running tests and pick a patch with higher number. > > > > I also hope we can reach a consensus that we fix one issue at a time. This patch > > already proves itself with some real workloads, I hope we can merge it first, > > either with 50ms period or less. > > > > Per-page locking is already against at least my instinction of how this should be > > done; I just regreted I didn't see that an issue when reviewing the balloon patch > > as I offered the r-b myself. However I want to make it work as before then we > > figure out a better approach on how the lock is taken, and which lock we should > > use. I don't see it a must to do all things in one patch. > > Sorry, it wasn't my intention to be a barrier to merging your patch. No problem, and I appreciate for all the discussions. It's just that as you see I still prefer to keep it like this first. free-page-hint is a great feature, though if we grep it in libvirt we found that it's not yet enabled anywhere. It means it's not used as a majority, yet. Meanwhile the code path this patch changes happen for each migration that qemu started. What I'm afraid is we worry too much on what most people are not using, during which we didn't really make the majority works better. >From what I learned in the past few years, funnily "speed of migration" is normally not what people care the most. Issues are mostly with convergence and being transparent to users using the VMs so they aren't even aware. I admit this patch may not be the perfect one, but that's the reason I wanted to do it in two steps as spinlock definitely needs more justification here, which I can't provide myself, while this patch is verified to help. > Just try to come up with better solutions if possible. > - Option1 QemuSpin lock would reduce the lock acquiring overhead, but need to test if that migration could converge; > - Option2 conditional lock. We haven't see the test results if turning on free page optimization with that per-page lock could still make your 2TB guest migration converge. Yes after I checked libvirt I am more confident it's not with free-page-hint, and as I said I wonder why you think free-page-hint could help a heavy memory usage guest at all, as I mentioned most memories are used. If we want to further optimize it, I don't think it'll make sense to chooise either option 1 or 2 but only to take them all, but it'll be great when it comes who proposed that new change also compare that with current patch 50ms even if sleepable, so I'm wondering whether that'll really make a measurable difference, especially if we can also shrink that 50ms period too, which is what I asked initially. > > Seems we lack resources for those tests right now. If you are urgent for a decision to have it work first, I'm also OK with you to merge it. No I can't merge it myself as I'm not the maintainer. :) I haven't received any ack yet, so at least I'll need to see how Dave and Juan think. It's just that I don't think qemuspin could help much in this case, and I don't want to mess up the issue too. Thanks for being considerate. -- Peter Xu