From: Gavin Shan <gshan@redhat.com>
To: Marc Zyngier <maz@kernel.org>, Peter Xu <peterx@redhat.com>
Cc: kvmarm@lists.cs.columbia.edu,
linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kselftest@vger.kernel.org, pbonzini@redhat.com,
corbet@lwn.net, james.morse@arm.com, alexandru.elisei@arm.com,
suzuki.poulose@arm.com, oliver.upton@linux.dev,
catalin.marinas@arm.com, will@kernel.org, shuah@kernel.org,
seanjc@google.com, dmatlack@google.com, bgardon@google.com,
ricarkol@google.com, zhenyzha@redhat.com, shan.gavin@gmail.com
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking
Date: Fri, 26 Aug 2022 16:05:28 +1000 [thread overview]
Message-ID: <78c613e8-b600-119e-0d33-b049dd7c35ce@redhat.com> (raw)
In-Reply-To: <877d2xweae.wl-maz@kernel.org>
Hi Marc,
On 8/25/22 6:57 AM, Marc Zyngier wrote:
> On Wed, 24 Aug 2022 17:21:50 +0100,
> Peter Xu <peterx@redhat.com> wrote:
>>
>> On Wed, Aug 24, 2022 at 03:45:11PM +0100, Marc Zyngier wrote:
>>> On Wed, 24 Aug 2022 00:19:04 +0100,
>>> Peter Xu <peterx@redhat.com> wrote:
>>>>
>>>> On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
>>>>> Atomicity doesn't guarantee ordering, unfortunately.
>>>>
>>>> Right, sorry to be misleading. The "atomicity" part I was trying to say
>>>> the kernel will always see consistent update on the fields.
>>>>
>>>> The ordering should also be guaranteed, because things must happen with
>>>> below sequence:
>>>>
>>>> (1) kernel publish dirty GFN data (slot, offset)
>>>> (2) kernel publish dirty GFN flag (set to DIRTY)
>>>> (3) user sees DIRTY, collects (slots, offset)
>>>> (4) user sets it to RESET
>>>> (5) kernel reads RESET
>>>
>>> Maybe. Maybe not. The reset could well be sitting in the CPU write
>>> buffer for as long as it wants and not be seen by the kernel if the
>>> read occurs on another CPU. And that's the crucial bit: single-CPU is
>>> fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU
>>> on collection, and global on reset (this seems like a bad decision,
>>> but it is too late to fix this).
>>
>> Regarding the last statement, that's something I had question too and
>> discussed with Paolo, even though at that time it's not a outcome of
>> discussing memory ordering issues.
>>
>> IIUC the initial design was trying to avoid tlb flush flood when vcpu
>> number is large (each RESET per ring even for one page will need all vcpus
>> to flush, so O(N^2) flushing needed). With global RESET it's O(N). So
>> it's kind of a trade-off, and indeed until now I'm not sure which one is
>> better. E.g., with per-ring reset, we can have locality too in userspace,
>> e.g. the vcpu thread might be able to recycle without holding global locks.
>
> I don't get that. On x86, each CPU must perform the TLB invalidation
> (there is an IPI for that). So whether you do a per-CPU scan of the
> ring or a global scan is irrelevant: each entry you find in any of the
> rings must result in a global invalidation, since you've updated the
> PTE to make the page RO.
>
> The same is true on ARM, except that the broadcast is done in HW
> instead of being tracked in SW.
>
> Buy anyway, this is all moot. The API is what it is, and it isn't
> going to change any time soon. All we can do is add some
> clarifications to the API for the more relaxed architectures, and make
> sure the kernel behaves accordingly.
>
> [...]
>
>>> It may be safe, but it isn't what the userspace API promises.
>>
>> The document says:
>>
>> After processing one or more entries in the ring buffer, userspace calls
>> the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about it, so that
>> the kernel will reprotect those collected GFNs. Therefore, the ioctl
>> must be called *before* reading the content of the dirty pages.
>>
>> I'd say it's not an explicit promise, but I think I agree with you that at
>> least it's unclear on the behavior.
>
> This is the least problematic part of the documentation. The bit I
> literally choke on is this:
>
> <quote>
> It's not necessary for userspace to harvest the all dirty GFNs at once.
> However it must collect the dirty GFNs in sequence, i.e., the userspace
> program cannot skip one dirty GFN to collect the one next to it.
> </quote>
>
> This is the core of the issue. Without ordering rules, the consumer on
> the other side cannot observe the updates correctly, even if userspace
> follows the above to the letter. Of course, the kernel itself must do
> the right thing (I guess it amounts to the kernel doing a
> load-acquire, and userspace doing a store-release -- effectively
> emulating x86...).
>
>> Since we have a global recycle mechanism, most likely the app (e.g. current
>> qemu impl) will use the same thread to collect/reset dirty GFNs, and
>> trigger the RESET ioctl(). In that case it's safe, IIUC, because no
>> cross-core ops.
>>
>> QEMU even guarantees this by checking it (kvm_dirty_ring_reap_locked):
>>
>> if (total) {
>> ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS);
>> assert(ret == total);
>> }
>>
>> I think the assert() should never trigger as mentioned above. But ideally
>> maybe it should just be a loop until cleared gfns match total.
>
> Right. If userspace calls the ioctl on every vcpu, then things should
> work correctly. It is only that the overhead is higher than what it
> should be if multiple vcpus perform a reset at the same time.
>
>>
>>> In other words, without further straightening of the API, this doesn't
>>> work as expected on relaxed memory architectures. So before this gets
>>> enabled on arm64, this whole ordering issue must be addressed.
>>
>> How about adding some more documentation for KVM_RESET_DIRTY_RINGS on the
>> possibility of recycling partial of the pages, especially when collection
>> and the ioctl() aren't done from the same thread?
>
> I'd rather tell people about the ordering rules. That will come at
> zero cost on x86.
>
>> Any suggestions will be greatly welcomed.
>
> I'll write a couple of patch when I get the time, most likely next
> week. Gavin will hopefully be able to take them as part of his series.
>
Thanks, Marc. Please let me know where I can check out the patches
when they're ready. I can include the patches into this series in
next revision :)
Thanks,
Gavin
next prev parent reply other threads:[~2022-08-26 6:05 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-19 0:55 [PATCH v1 0/5] KVM: arm64: Enable ring-based dirty memory tracking Gavin Shan
2022-08-19 0:55 ` [PATCH v1 1/5] " Gavin Shan
2022-08-19 8:00 ` Marc Zyngier
2022-08-22 1:58 ` Gavin Shan
2022-08-22 18:55 ` Peter Xu
2022-08-23 3:19 ` Gavin Shan
2022-08-22 21:42 ` Marc Zyngier
2022-08-23 5:22 ` Gavin Shan
2022-08-23 13:58 ` Peter Xu
2022-08-23 19:17 ` Marc Zyngier
2022-08-23 21:20 ` Peter Xu
2022-08-23 22:47 ` Marc Zyngier
2022-08-23 23:19 ` Peter Xu
2022-08-24 14:45 ` Marc Zyngier
2022-08-24 16:21 ` Peter Xu
2022-08-24 20:57 ` Marc Zyngier
2022-08-26 6:05 ` Gavin Shan [this message]
2022-08-26 10:50 ` Paolo Bonzini
2022-08-26 15:49 ` Marc Zyngier
2022-08-27 8:27 ` Paolo Bonzini
2022-08-23 14:44 ` Oliver Upton
2022-08-23 20:35 ` Marc Zyngier
2022-08-26 10:58 ` Paolo Bonzini
2022-08-26 15:28 ` Marc Zyngier
2022-08-30 14:42 ` Peter Xu
2022-09-02 0:19 ` Paolo Bonzini
2022-08-19 0:55 ` [PATCH v1 2/5] KVM: selftests: Use host page size to map ring buffer in dirty_log_test Gavin Shan
2022-08-19 0:55 ` [PATCH v1 3/5] KVM: selftests: Dirty host pages " Gavin Shan
2022-08-19 5:28 ` Andrew Jones
2022-08-22 6:29 ` Gavin Shan
2022-08-23 3:09 ` Gavin Shan
2022-08-19 0:56 ` [PATCH v1 4/5] KVM: selftests: Clear dirty ring states between two modes " Gavin Shan
2022-08-19 0:56 ` [PATCH v1 5/5] KVM: selftests: Automate choosing dirty ring size " Gavin Shan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=78c613e8-b600-119e-0d33-b049dd7c35ce@redhat.com \
--to=gshan@redhat.com \
--cc=alexandru.elisei@arm.com \
--cc=bgardon@google.com \
--cc=catalin.marinas@arm.com \
--cc=corbet@lwn.net \
--cc=dmatlack@google.com \
--cc=james.morse@arm.com \
--cc=kvm@vger.kernel.org \
--cc=kvmarm@lists.cs.columbia.edu \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=maz@kernel.org \
--cc=oliver.upton@linux.dev \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=ricarkol@google.com \
--cc=seanjc@google.com \
--cc=shan.gavin@gmail.com \
--cc=shuah@kernel.org \
--cc=suzuki.poulose@arm.com \
--cc=will@kernel.org \
--cc=zhenyzha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox