Linux Kernel Selftest development
 help / color / mirror / Atom feed
From: Gavin Shan <gshan@redhat.com>
To: Marc Zyngier <maz@kernel.org>, Peter Xu <peterx@redhat.com>
Cc: kvmarm@lists.cs.columbia.edu,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kselftest@vger.kernel.org, pbonzini@redhat.com,
	corbet@lwn.net, james.morse@arm.com, alexandru.elisei@arm.com,
	suzuki.poulose@arm.com, oliver.upton@linux.dev,
	catalin.marinas@arm.com, will@kernel.org, shuah@kernel.org,
	seanjc@google.com, dmatlack@google.com, bgardon@google.com,
	ricarkol@google.com, zhenyzha@redhat.com, shan.gavin@gmail.com
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking
Date: Fri, 26 Aug 2022 16:05:28 +1000	[thread overview]
Message-ID: <78c613e8-b600-119e-0d33-b049dd7c35ce@redhat.com> (raw)
In-Reply-To: <877d2xweae.wl-maz@kernel.org>

Hi Marc,

On 8/25/22 6:57 AM, Marc Zyngier wrote:
> On Wed, 24 Aug 2022 17:21:50 +0100,
> Peter Xu <peterx@redhat.com> wrote:
>>
>> On Wed, Aug 24, 2022 at 03:45:11PM +0100, Marc Zyngier wrote:
>>> On Wed, 24 Aug 2022 00:19:04 +0100,
>>> Peter Xu <peterx@redhat.com> wrote:
>>>>
>>>> On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
>>>>> Atomicity doesn't guarantee ordering, unfortunately.
>>>>
>>>> Right, sorry to be misleading.  The "atomicity" part I was trying to say
>>>> the kernel will always see consistent update on the fields.
>>>>
>>>> The ordering should also be guaranteed, because things must happen with
>>>> below sequence:
>>>>
>>>>    (1) kernel publish dirty GFN data (slot, offset)
>>>>    (2) kernel publish dirty GFN flag (set to DIRTY)
>>>>    (3) user sees DIRTY, collects (slots, offset)
>>>>    (4) user sets it to RESET
>>>>    (5) kernel reads RESET
>>>
>>> Maybe. Maybe not. The reset could well be sitting in the CPU write
>>> buffer for as long as it wants and not be seen by the kernel if the
>>> read occurs on another CPU. And that's the crucial bit: single-CPU is
>>> fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU
>>> on collection, and global on reset (this seems like a bad decision,
>>> but it is too late to fix this).
>>
>> Regarding the last statement, that's something I had question too and
>> discussed with Paolo, even though at that time it's not a outcome of
>> discussing memory ordering issues.
>>
>> IIUC the initial design was trying to avoid tlb flush flood when vcpu
>> number is large (each RESET per ring even for one page will need all vcpus
>> to flush, so O(N^2) flushing needed). With global RESET it's O(N).  So
>> it's kind of a trade-off, and indeed until now I'm not sure which one is
>> better.  E.g., with per-ring reset, we can have locality too in userspace,
>> e.g. the vcpu thread might be able to recycle without holding global locks.
> 
> I don't get that. On x86, each CPU must perform the TLB invalidation
> (there is an IPI for that). So whether you do a per-CPU scan of the
> ring or a global scan is irrelevant: each entry you find in any of the
> rings must result in a global invalidation, since you've updated the
> PTE to make the page RO.
> 
> The same is true on ARM, except that the broadcast is done in HW
> instead of being tracked in SW.
> 
> Buy anyway, this is all moot. The API is what it is, and it isn't
> going to change any time soon. All we can do is add some
> clarifications to the API for the more relaxed architectures, and make
> sure the kernel behaves accordingly.
> 
> [...]
> 
>>> It may be safe, but it isn't what the userspace API promises.
>>
>> The document says:
>>
>>    After processing one or more entries in the ring buffer, userspace calls
>>    the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about it, so that
>>    the kernel will reprotect those collected GFNs.  Therefore, the ioctl
>>    must be called *before* reading the content of the dirty pages.
>>
>> I'd say it's not an explicit promise, but I think I agree with you that at
>> least it's unclear on the behavior.
> 
> This is the least problematic part of the documentation. The bit I
> literally choke on is this:
> 
> <quote>
> It's not necessary for userspace to harvest the all dirty GFNs at once.
> However it must collect the dirty GFNs in sequence, i.e., the userspace
> program cannot skip one dirty GFN to collect the one next to it.
> </quote>
> 
> This is the core of the issue. Without ordering rules, the consumer on
> the other side cannot observe the updates correctly, even if userspace
> follows the above to the letter. Of course, the kernel itself must do
> the right thing (I guess it amounts to the kernel doing a
> load-acquire, and userspace doing a store-release -- effectively
> emulating x86...).
> 
>> Since we have a global recycle mechanism, most likely the app (e.g. current
>> qemu impl) will use the same thread to collect/reset dirty GFNs, and
>> trigger the RESET ioctl().  In that case it's safe, IIUC, because no
>> cross-core ops.
>>
>> QEMU even guarantees this by checking it (kvm_dirty_ring_reap_locked):
>>
>>      if (total) {
>>          ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS);
>>          assert(ret == total);
>>      }
>>
>> I think the assert() should never trigger as mentioned above.  But ideally
>> maybe it should just be a loop until cleared gfns match total.
> 
> Right. If userspace calls the ioctl on every vcpu, then things should
> work correctly. It is only that the overhead is higher than what it
> should be if multiple vcpus perform a reset at the same time.
> 
>>
>>> In other words, without further straightening of the API, this doesn't
>>> work as expected on relaxed memory architectures. So before this gets
>>> enabled on arm64, this whole ordering issue must be addressed.
>>
>> How about adding some more documentation for KVM_RESET_DIRTY_RINGS on the
>> possibility of recycling partial of the pages, especially when collection
>> and the ioctl() aren't done from the same thread?
> 
> I'd rather tell people about the ordering rules. That will come at
> zero cost on x86.
> 
>> Any suggestions will be greatly welcomed.
> 
> I'll write a couple of patch when I get the time, most likely next
> week. Gavin will hopefully be able to take them as part of his series.
> 

Thanks, Marc. Please let me know where I can check out the patches
when they're ready. I can include the patches into this series in
next revision :)

Thanks,
Gavin



  reply	other threads:[~2022-08-26  6:05 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-19  0:55 [PATCH v1 0/5] KVM: arm64: Enable ring-based dirty memory tracking Gavin Shan
2022-08-19  0:55 ` [PATCH v1 1/5] " Gavin Shan
2022-08-19  8:00   ` Marc Zyngier
2022-08-22  1:58     ` Gavin Shan
2022-08-22 18:55       ` Peter Xu
2022-08-23  3:19         ` Gavin Shan
2022-08-22 21:42       ` Marc Zyngier
2022-08-23  5:22         ` Gavin Shan
2022-08-23 13:58           ` Peter Xu
2022-08-23 19:17             ` Marc Zyngier
2022-08-23 21:20               ` Peter Xu
2022-08-23 22:47                 ` Marc Zyngier
2022-08-23 23:19                   ` Peter Xu
2022-08-24 14:45                     ` Marc Zyngier
2022-08-24 16:21                       ` Peter Xu
2022-08-24 20:57                         ` Marc Zyngier
2022-08-26  6:05                           ` Gavin Shan [this message]
2022-08-26 10:50                   ` Paolo Bonzini
2022-08-26 15:49                     ` Marc Zyngier
2022-08-27  8:27                       ` Paolo Bonzini
2022-08-23 14:44         ` Oliver Upton
2022-08-23 20:35           ` Marc Zyngier
2022-08-26 10:58             ` Paolo Bonzini
2022-08-26 15:28               ` Marc Zyngier
2022-08-30 14:42                 ` Peter Xu
2022-09-02  0:19                   ` Paolo Bonzini
2022-08-19  0:55 ` [PATCH v1 2/5] KVM: selftests: Use host page size to map ring buffer in dirty_log_test Gavin Shan
2022-08-19  0:55 ` [PATCH v1 3/5] KVM: selftests: Dirty host pages " Gavin Shan
2022-08-19  5:28   ` Andrew Jones
2022-08-22  6:29     ` Gavin Shan
2022-08-23  3:09       ` Gavin Shan
2022-08-19  0:56 ` [PATCH v1 4/5] KVM: selftests: Clear dirty ring states between two modes " Gavin Shan
2022-08-19  0:56 ` [PATCH v1 5/5] KVM: selftests: Automate choosing dirty ring size " Gavin Shan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=78c613e8-b600-119e-0d33-b049dd7c35ce@redhat.com \
    --to=gshan@redhat.com \
    --cc=alexandru.elisei@arm.com \
    --cc=bgardon@google.com \
    --cc=catalin.marinas@arm.com \
    --cc=corbet@lwn.net \
    --cc=dmatlack@google.com \
    --cc=james.morse@arm.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=oliver.upton@linux.dev \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=ricarkol@google.com \
    --cc=seanjc@google.com \
    --cc=shan.gavin@gmail.com \
    --cc=shuah@kernel.org \
    --cc=suzuki.poulose@arm.com \
    --cc=will@kernel.org \
    --cc=zhenyzha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox