From: Christoffer Dall <cdall@linaro.org>
To: James Morse <james.morse@arm.com>
Cc: Tyler Baicar <tbaicar@codeaurora.org>,
Marc Zyngier <marc.zyngier@arm.com>,
Punit Agrawal <punit.agrawal@arm.com>,
linux-arm-kernel@lists.infradead.org,
kvmarm@lists.cs.columbia.edu, gengdongjiu <gengdj.1984@gmail.com>
Subject: Re: [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
Date: Mon, 27 Mar 2017 14:44:59 +0200 [thread overview]
Message-ID: <20170327124459.GA21871@cbox> (raw)
In-Reply-To: <58D8FEF8.9000205@arm.com>
On Mon, Mar 27, 2017 at 01:00:56PM +0100, James Morse wrote:
> Hi guys,
>
> On 27/03/17 12:20, Punit Agrawal wrote:
> > Christoffer Dall <cdall@linaro.org> writes:
> >> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
> >>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
> >>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
> >>> SIGBUS to any user space process using the page, and notify all the
> >>> in-kernel users.
> >>>
> >>> If the page corresponded with guest memory, KVM will unmap this page
> >>> from its stage2 page tables. The user space process that allocated
> >>> this memory may have never touched this page in which case it may not
> >>> be mapped meaning SIGBUS won't be delivered.
> >>>
> >>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
> >>> comes to process the stage2 fault.
> >>>
> >>> Do as x86 does, and deliver the SIGBUS when we discover
> >>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
> >>> as this matches the user space mapping size.
>
> >>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> >>> index 962616fd4ddd..9d1aa294e88f 100644
> >>> --- a/arch/arm/kvm/mmu.c
> >>> +++ b/arch/arm/kvm/mmu.c
> >>> @@ -20,8 +20,10 @@
> >>> #include <linux/kvm_host.h>
> >>> #include <linux/io.h>
> >>> #include <linux/hugetlb.h>
> >>> +#include <linux/sched/signal.h>
> >>> #include <trace/events/kvm.h>
> >>> #include <asm/pgalloc.h>
> >>> +#include <asm/siginfo.h>
> >>> #include <asm/cacheflush.h>
> >>> #include <asm/kvm_arm.h>
> >>> #include <asm/kvm_mmu.h>
> >>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
> >>> __coherent_cache_guest_page(vcpu, pfn, size);
> >>> }
> >>>
> >>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
> >>> +{
> >>> + siginfo_t info;
> >>> +
> >>> + info.si_signo = SIGBUS;
> >>> + info.si_errno = 0;
> >>> + info.si_code = BUS_MCEERR_AR;
> >>> + info.si_addr = (void __user *)address;
> >>> +
> >>> + if (hugetlb)
> >>> + info.si_addr_lsb = PMD_SHIFT;
> >>> + else
> >>> + info.si_addr_lsb = PAGE_SHIFT;
> >>> +
> >>> + send_sig_info(SIGBUS, &info, current);
> >>> +}
> >>> +
> >>> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>> struct kvm_memory_slot *memslot, unsigned long hva,
> >>> unsigned long fault_status)
> >>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>> smp_rmb();
> >>>
> >>> pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> >>> + if (pfn == KVM_PFN_ERR_HWPOISON) {
> >>> + kvm_send_hwpoison_signal(hva, hugetlb);
> >>
> >> The way this is called means that we'll only notify userspace of a huge
> >> mapping if userspace is mapping hugetlbfs, and not because the stage2
> >> mapping may or may not have used transparent huge pages when the error
> >> was discovered. Is this the desired semantics?
>
> No,
>
>
> > I think so.
> >
> > AFAIUI, transparent hugepages are split before being poisoned while all
> > the underlying pages of a hugepage are poisoned together, i.e., no
> > splitting.
>
> In which case I need to look into this some more!
>
> My thinking was we should report the size that was knocked out of the stage2 to
> avoid the guest repeatedly faulting until it has touched every guest-page-size
> in the stage2 hole.
By signaling something at the fault path, I think it's going to be very
hard to backtrack how the stage 2 page tables looked like when faults
started happening, because I think these are completely decoupled events
(the mmu notifier and the later fault).
>
> Reading the code in that kvm/mmu.c it looked like the mapping sizes would always
> be the same as those used by userspace.
I think the mapping sizes should be the same between userspace and KVM,
but the mapping size of a particular page (and associated pages) may
vary over time.
>
> If the page was split before KVM could have taken this fault I assumed it would
> fault on the page-size mapping and hugetlb would be false.
I think you could have a huge page, which gets unmapped as a result on
it getting split (perhaps because there was a failure on one page) and
later as you fault, you can discover a range which can be a hugetlbfs or
transparent huge pages.
The question that I don't know is how Linux behaves if a page is marked
with hwpoison, in that case, if Linux never supports THP and always
marks an entire huge page in a hugetlbfs with the poison, then I think
we're mostly good here. If not, we should make sure we align with
whatever the rest of the kernel does.
> (which is already
> wrong for another reason, looks like I grabbed the variable before
> transparent_hugepage_adjust() has had a go a it.).
>
yes, which is why I asked if you only care about hugetlbfs.
>
> >> Also notice that the hva is not necessarily aligned to the beginning of
> >> the huge page, so can we be giving userspace wrong information by
> >> pointing in the middle of a huge page and telling it there was an
> >> address error in the size of the PMD ?
> >>
> >
> > I could be reading it wrong but I think we are fine here - the address
> > (hva) is the location that faulted. And the lsb indicates the least
> > significant bit of the faulting address (See man sigaction(2)). The
> > receiver of the signal is expected to use the address and lsb to workout
> > the extent of corruption.
>
> kill_proc() in mm/memory-failure.c does this too, but the address is set by
> page_address_in_vma() in add_to_kill() of the same file. (I'll chat with Punit
> off list.)
>
>
> > Though I missed a subtlety while reviewing the patch before. The
> > reported lsb should be for the userspace hugepage mapping (i.e., hva)
> > and not for the stage 2.
>
> I thought these were always supposed to be the same, and using hugetlb was a bug
> because I didn't look closely enough at what is_vm_hugetlb_page() does.
>
>
> > In light of this, I'd like to retract my Reviewed-by tag for this
> > version of the patch as I believe we'll need to change the lsb
> > reporting.
>
> Sure, lets work out what this should be doing. I'm beginning to suspect x86's
> 'always page size' was correct to begin with!
>
I had a sense of that too, but it would be good to understand how you
mark and individual page within a hugetlbfs huge page with hwpoison...
Thanks,
-Christoffer
WARNING: multiple messages have this Message-ID (diff)
From: cdall@linaro.org (Christoffer Dall)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
Date: Mon, 27 Mar 2017 14:44:59 +0200 [thread overview]
Message-ID: <20170327124459.GA21871@cbox> (raw)
In-Reply-To: <58D8FEF8.9000205@arm.com>
On Mon, Mar 27, 2017 at 01:00:56PM +0100, James Morse wrote:
> Hi guys,
>
> On 27/03/17 12:20, Punit Agrawal wrote:
> > Christoffer Dall <cdall@linaro.org> writes:
> >> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
> >>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
> >>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
> >>> SIGBUS to any user space process using the page, and notify all the
> >>> in-kernel users.
> >>>
> >>> If the page corresponded with guest memory, KVM will unmap this page
> >>> from its stage2 page tables. The user space process that allocated
> >>> this memory may have never touched this page in which case it may not
> >>> be mapped meaning SIGBUS won't be delivered.
> >>>
> >>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
> >>> comes to process the stage2 fault.
> >>>
> >>> Do as x86 does, and deliver the SIGBUS when we discover
> >>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
> >>> as this matches the user space mapping size.
>
> >>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> >>> index 962616fd4ddd..9d1aa294e88f 100644
> >>> --- a/arch/arm/kvm/mmu.c
> >>> +++ b/arch/arm/kvm/mmu.c
> >>> @@ -20,8 +20,10 @@
> >>> #include <linux/kvm_host.h>
> >>> #include <linux/io.h>
> >>> #include <linux/hugetlb.h>
> >>> +#include <linux/sched/signal.h>
> >>> #include <trace/events/kvm.h>
> >>> #include <asm/pgalloc.h>
> >>> +#include <asm/siginfo.h>
> >>> #include <asm/cacheflush.h>
> >>> #include <asm/kvm_arm.h>
> >>> #include <asm/kvm_mmu.h>
> >>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
> >>> __coherent_cache_guest_page(vcpu, pfn, size);
> >>> }
> >>>
> >>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
> >>> +{
> >>> + siginfo_t info;
> >>> +
> >>> + info.si_signo = SIGBUS;
> >>> + info.si_errno = 0;
> >>> + info.si_code = BUS_MCEERR_AR;
> >>> + info.si_addr = (void __user *)address;
> >>> +
> >>> + if (hugetlb)
> >>> + info.si_addr_lsb = PMD_SHIFT;
> >>> + else
> >>> + info.si_addr_lsb = PAGE_SHIFT;
> >>> +
> >>> + send_sig_info(SIGBUS, &info, current);
> >>> +}
> >>> +
> >>> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>> struct kvm_memory_slot *memslot, unsigned long hva,
> >>> unsigned long fault_status)
> >>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>> smp_rmb();
> >>>
> >>> pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> >>> + if (pfn == KVM_PFN_ERR_HWPOISON) {
> >>> + kvm_send_hwpoison_signal(hva, hugetlb);
> >>
> >> The way this is called means that we'll only notify userspace of a huge
> >> mapping if userspace is mapping hugetlbfs, and not because the stage2
> >> mapping may or may not have used transparent huge pages when the error
> >> was discovered. Is this the desired semantics?
>
> No,
>
>
> > I think so.
> >
> > AFAIUI, transparent hugepages are split before being poisoned while all
> > the underlying pages of a hugepage are poisoned together, i.e., no
> > splitting.
>
> In which case I need to look into this some more!
>
> My thinking was we should report the size that was knocked out of the stage2 to
> avoid the guest repeatedly faulting until it has touched every guest-page-size
> in the stage2 hole.
By signaling something at the fault path, I think it's going to be very
hard to backtrack how the stage 2 page tables looked like when faults
started happening, because I think these are completely decoupled events
(the mmu notifier and the later fault).
>
> Reading the code in that kvm/mmu.c it looked like the mapping sizes would always
> be the same as those used by userspace.
I think the mapping sizes should be the same between userspace and KVM,
but the mapping size of a particular page (and associated pages) may
vary over time.
>
> If the page was split before KVM could have taken this fault I assumed it would
> fault on the page-size mapping and hugetlb would be false.
I think you could have a huge page, which gets unmapped as a result on
it getting split (perhaps because there was a failure on one page) and
later as you fault, you can discover a range which can be a hugetlbfs or
transparent huge pages.
The question that I don't know is how Linux behaves if a page is marked
with hwpoison, in that case, if Linux never supports THP and always
marks an entire huge page in a hugetlbfs with the poison, then I think
we're mostly good here. If not, we should make sure we align with
whatever the rest of the kernel does.
> (which is already
> wrong for another reason, looks like I grabbed the variable before
> transparent_hugepage_adjust() has had a go a it.).
>
yes, which is why I asked if you only care about hugetlbfs.
>
> >> Also notice that the hva is not necessarily aligned to the beginning of
> >> the huge page, so can we be giving userspace wrong information by
> >> pointing in the middle of a huge page and telling it there was an
> >> address error in the size of the PMD ?
> >>
> >
> > I could be reading it wrong but I think we are fine here - the address
> > (hva) is the location that faulted. And the lsb indicates the least
> > significant bit of the faulting address (See man sigaction(2)). The
> > receiver of the signal is expected to use the address and lsb to workout
> > the extent of corruption.
>
> kill_proc() in mm/memory-failure.c does this too, but the address is set by
> page_address_in_vma() in add_to_kill() of the same file. (I'll chat with Punit
> off list.)
>
>
> > Though I missed a subtlety while reviewing the patch before. The
> > reported lsb should be for the userspace hugepage mapping (i.e., hva)
> > and not for the stage 2.
>
> I thought these were always supposed to be the same, and using hugetlb was a bug
> because I didn't look closely enough at what is_vm_hugetlb_page() does.
>
>
> > In light of this, I'd like to retract my Reviewed-by tag for this
> > version of the patch as I believe we'll need to change the lsb
> > reporting.
>
> Sure, lets work out what this should be doing. I'm beginning to suspect x86's
> 'always page size' was correct to begin with!
>
I had a sense of that too, but it would be good to understand how you
mark and individual page within a hugetlbfs huge page with hwpoison...
Thanks,
-Christoffer
next prev parent reply other threads:[~2017-03-27 12:43 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-03-15 16:07 [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory James Morse
2017-03-15 16:07 ` James Morse
2017-03-17 15:06 ` Punit Agrawal
2017-03-17 15:06 ` Punit Agrawal
2017-03-17 15:48 ` James Morse
2017-03-17 15:48 ` James Morse
2017-03-24 18:30 ` Christoffer Dall
2017-03-24 18:30 ` Christoffer Dall
2017-03-27 11:20 ` Punit Agrawal
2017-03-27 11:20 ` Punit Agrawal
2017-03-27 12:00 ` James Morse
2017-03-27 12:00 ` James Morse
2017-03-27 12:44 ` Christoffer Dall [this message]
2017-03-27 12:44 ` Christoffer Dall
2017-03-27 13:31 ` Punit Agrawal
2017-03-27 13:31 ` Punit Agrawal
2017-03-27 13:38 ` Marc Zyngier
2017-03-27 13:38 ` Marc Zyngier
2017-03-27 14:04 ` Punit Agrawal
2017-03-27 14:04 ` Punit Agrawal
2017-03-27 14:47 ` Christoffer Dall
2017-03-27 14:47 ` Christoffer Dall
2017-03-28 14:50 ` Punit Agrawal
2017-03-28 14:50 ` Punit Agrawal
2017-03-28 15:12 ` Christoffer Dall
2017-03-28 15:12 ` Christoffer Dall
2017-04-04 23:05 ` gengdongjiu
2017-04-04 23:05 ` gengdongjiu
2017-04-06 9:25 ` James Morse
2017-04-06 9:25 ` James Morse
2017-04-06 15:06 ` gengdongjiu
2017-04-06 15:06 ` gengdongjiu
2017-04-07 16:12 ` James Morse
2017-04-07 16:12 ` James Morse
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170327124459.GA21871@cbox \
--to=cdall@linaro.org \
--cc=gengdj.1984@gmail.com \
--cc=james.morse@arm.com \
--cc=kvmarm@lists.cs.columbia.edu \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=marc.zyngier@arm.com \
--cc=punit.agrawal@arm.com \
--cc=tbaicar@codeaurora.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.