From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8A1C116D32A for ; Wed, 3 Apr 2024 17:20:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.74 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712164806; cv=none; b=nRHHlyV8d0xtctYX+0pJXfsIbq9MSTBAT014JDmIPkJgcRLm3D7OdN6t7giXBcmpCs1jWbA4xvzecnfrCyqGVwsdkb1phyN7msOY7wdIdtGgZHUhWghEUoWywS0BCXJGTUeJtObh93nuqkVxrID4Mo3WdkXa1+npleogGiBmY9M= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712164806; c=relaxed/simple; bh=8IlYAc/1g7pMae9Bxm4yrJUSCcJqAJcPwhB0IO4bkik=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=Ba/d6JBXXVdWfaSN5PMIsjqlZtGdrVjzby7rS9YNtxLD5GFhubOwvuW/l+heSHDb+ZFrFsB3GGjwbFZW8XTJVMEef2+PGzRcQgNJMQJaCVOY0M+5J6nUxIbECA0sj52EkO1THn7inK50LJWU3fK+GUjJxJRXaf+V3cZqqqhtyxg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=uqm9QCWv; arc=none smtp.client-ip=209.85.216.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="uqm9QCWv" Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-2a26a065631so14727a91.3 for ; Wed, 03 Apr 2024 10:20:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1712164804; x=1712769604; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=3rSsI2dMglxlpboEehF0mS0Pvczh5Ixi6iikM5re2Go=; b=uqm9QCWv5Jk6NOjEf1LH0ZW6fJ3j+nxZoMPCyk/f32oAGBCmNLuPU//bMIKVL2fT6d iGBgp+4XgGfm5EC9k+PhLta60wHjUrdJqUhx29Rz1wGoWqL37h4oXAQqe53FULM7ZuDf cevkpEOowbR3B8ewpaOwcWgaCG5wnh/GCCz+l62wtOX6kAt3/bUKLKbm6SyMsxAVEVDi 1eE8VuiqiIC1bVa86nJ8u4zRY48jeUkxUF5iOgsZg/JZTK7aft7WKR5YoUS+7ZhoNSsY vfTO4IY7jKO3FbDtio2mJE9Szo8X8WDYchJy9kkWwY5bRgzq2cHYkA8arbSqgIMTZS5p DZxA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712164804; x=1712769604; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=3rSsI2dMglxlpboEehF0mS0Pvczh5Ixi6iikM5re2Go=; b=L4IXosQM0c1IQGvFwxvuxd2FDXHvwindgYGgPXuJO45dwgBlyg9EU8xB2B3KOjwXV4 zYfuKZ2WK59N/vafq99sOpxgU/vyglq0cwb6Kv+GuRW8O6KgVObpkP+9noBI1bu9YAnA 0YHpUicrB2bTKHtT4as8OqXBqDO58GACKKgd6OwCRbJSg1cIGvf2WrRNDGs/Bl4lfnAa y9GJjis9WjFQqf5CNSzWxuxHPzkZCLNe+Ux9cVRkaYepQyKyVwWk/kACGphBRJBlzCI6 VaRAksqcR05wUF5fe9kc+HXGRmSLisnQ6HM5NoZ8Gb+DTK5+DQDrMWQ+e9LZ3CZJ9jDB 5Itg== X-Forwarded-Encrypted: i=1; AJvYcCXmzzQchHdK8y4RzLwbHoFQnZBbdTdWOvfQV8HcNFnhtfngHUqnGQVz+oExJBjYwuy4H7UUyeztwRpVQO/u0eAMOmfe X-Gm-Message-State: AOJu0Yxqfvc4vCj4MJj0h/UdpyYC1D81p1DowIQSUME74qnDsAV9UxpV MAe7DsxSrELh+UOrv9fnXrhHfuw7y0iIT23iDZNnxZHFc3EEKQrA5J7fiIUBy8QgY5m2tLduk7q TMg== X-Google-Smtp-Source: AGHT+IHIZspzDL2Pcm1/sprQ9sc+OwskYOcgAh4KZmqfJ7vm5Cs8EmLrKsRKFY2OsZwcrwc08E8/JFabG4E= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:902:dad2:b0:1e0:afa0:cca2 with SMTP id q18-20020a170902dad200b001e0afa0cca2mr632062plx.10.1712164803703; Wed, 03 Apr 2024 10:20:03 -0700 (PDT) Date: Wed, 3 Apr 2024 10:20:02 -0700 In-Reply-To: Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240402213656.3068504-1-dmatlack@google.com> Message-ID: Subject: Re: [PATCH v2] KVM: Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG From: Sean Christopherson To: maobibo Cc: David Matlack , Paolo Bonzini , kvm@vger.kernel.org, Marc Zyngier , Oliver Upton , Tianrui Zhao , Huacai Chen , Michael Ellerman , Anup Patel , Christian Borntraeger , Janosch Frank , Claudio Imbrenda Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On Wed, Apr 03, 2024, maobibo wrote: >=20 > On 2024/4/3 =E4=B8=8A=E5=8D=885:36, David Matlack wrote: > > Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG to avoi= d > > blocking other threads (e.g. vCPUs taking page faults) for too long. > >=20 > > Specifically, change kvm_clear_dirty_log_protect() to acquire/release > > mmu_lock only when calling kvm_arch_mmu_enable_log_dirty_pt_masked(), > > rather than around the entire for loop. This ensures that KVM will only > > hold mmu_lock for the time it takes the architecture-specific code to > > process up to 64 pages, rather than holding mmu_lock for log->num_pages= , > > which is controllable by userspace. This also avoids holding mmu_lock > > when processing parts of the dirty_bitmap that are zero (i.e. when ther= e > > is nothing to clear). > >=20 > > Moving the acquire/release points for mmu_lock should be safe since > > dirty_bitmap_buffer is already protected by slots_lock, and dirty_bitma= p > > is already accessed with atomic_long_fetch_andnot(). And at least on x8= 6 > > holding mmu_lock doesn't even serialize access to the memslot dirty > > bitmap, as vCPUs can call mark_page_dirty_in_slot() without holding > > mmu_lock. > >=20 > > This change eliminates dips in guest performance during live migration > > in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with > > 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, whic= h > Frequently drop/reacquire mmu_lock will cause userspace migration process > issuing CLEAR ioctls to contend with 160 vCPU, migration speed maybe beco= me > slower. Only if vCPUs actually acquire mmu_lock. E.g. on x86, KVM fixes/handles fa= ults due to write-protection for dirty logging without acquiring mmu_lock. So f= or x86, taking faults that need to acquire mmu_lock while dirty logging should be f= airly uncommon (and if vCPUs are taking lots of faults, performance is likely goi= ng to be bad no matter what). > In theory priority of userspace migration thread should be higher than vc= pu > thread. That's very debatable. And it's not an apples-to-apples comparison, becaus= e CLEAR_DIRTY_LOG can hold mmu_lock for a very long time, probably orders of magnitude longer than a vCPU will hold mmu_lock when handling a page fault. And on x86 and ARM, page faults can be resolved while hold mmu_lock for rea= d. As a result, holding mmu_lock in CLEAR_DIRTY_LOG (for write) is effectively mo= re costly than holding it in vCPUs. > Drop and reacquire mmu_lock with 64-pages may be a little too smaller, > in generic it is one huge page size. However it should be decided by > framework maintainer:) We could tweak the batching, but my very strong preference would be to do t= hat only as a last resort, i.e. if and only if some magic batch number provides= waaay better performance in all scenarios. Maintaining code with arbitrary magic numbers is a pain, e.g. KVM x86's MMU= has arbitrary batching in kvm_zap_obsolete_pages(), and the justification for t= he number is extremely handwavy (paraphasing the changelog): Zap at least 10 shadow pages before releasing mmu_lock to reduce the overhead associated with re-acquiring the lock. =20 Note: "10" is an arbitrary number, speculated to be high enough so that a vCPU isn't stuck zapping obsolete pages for an extended period= , but small enough so that other vCPUs aren't starved waiting for mmu_lock. I.e. we're stuck with someone's best guess from years ago without any real = idea if 10 is a good number, let alone optimal. Obviously that doesn't mean 64 pages is optimal, but it's at least not arbi= trary, it's just an artifact of how KVM processes the bitmap. To be clear, I'm not opposed to per-arch behavior, nor do I think x86 shoul= d dictate how all other architectures should behave. But I would strongly pr= efer to avoid per-arch behavior unless it's actually necessary (doubly so for ba= tching). In other words, if we do need per-arch behavior, e.g. because aggressivly d= ropping mmu_lock causes performance issues on other architectures that need to take= mmu_lock for write to handle faults, I would prefer to have the arch knob control wh= ether the lock+unlock is outside versus inside the loop, not control an arbitrary= batch size.