* [PATCH] powerpc/mm/radix: Make Radix require HUGETLB_PAGE
From: Michael Ellerman @ 2019-04-17 7:41 UTC (permalink / raw)
To: linuxppc-dev; +Cc: stewart, aneesh.kumar, joel
Joel reported weird crashes using skiroot_defconfig, in his case we
jumped into an NX page:
kernel tried to execute exec-protected page (c000000002bff4f0) - exploit attempt? (uid: 0)
BUG: Unable to handle kernel instruction fetch
Faulting instruction address: 0xc000000002bff4f0
Looking at the disassembly, we had simply branched to that address:
c000000000c001bc 49fff335 bl c000000002bff4f0
But that didn't match the original kernel image:
c000000000c001bc 4bfff335 bl c000000000bff4f0 <kobject_get+0x8>
When STRICT_KERNEL_RWX is enabled, and we're using the radix MMU, we
call radix__change_memory_range() late in boot to change page
protections. We do that both to mark rodata read only and also to mark
init text no-execute. That involves walking the kernel page tables,
and clearing _PAGE_WRITE or _PAGE_EXEC respectively.
With radix we may use hugepages for the linear mapping, so the code in
radix__change_memory_range() uses eg. pmd_huge() to test if it has
found a huge mapping, and if so it stops the page table walk and
changes the PMD permissions.
However if the kernel is built without HUGETLBFS support, pmd_huge()
is just a #define that always returns 0. That causes the code in
radix__change_memory_range() to incorrectly interpret the PMD value as
a pointer to a PTE page rather than as a PTE at the PMD level.
Unfortunately the combination of _PAGE_PTE and _PAGE_PRESENT in the
high bits of the PMD entry give us 0xc in the top nibble which means
the PMD entry happens to look like a valid pointer into the linear
mapping.
We can see this using `dv` in xmon:
0:mon> dv c000000000000000
pgd @ 0xc000000001740000
pgdp @ 0xc000000001740000 = 0x80000000ffffb009
pudp @ 0xc0000000ffffb000 = 0x80000000ffffa009
pmdp @ 0xc0000000ffffa000 = 0xc00000000000018f <- not a pointer
ptep @ 0xc000000000000100 = 0xa64bb17da64ab07d <- kernel text
The end result is we treat the value at 0xc000000000000100 as a PTE
and clear _PAGE_WRITE or _PAGE_EXEC, potentially corrupting the code
at that address.
In Joel's specific case we cleared the sign bit in the offset of the
branch, causing a backward branch to turn into a forward branch which
caused us to branch into a non-executable page. However the exact
nature of the crash depends on kernel version, compiler version, and
other factors.
We need to fix radix__change_memory_range() to not use accessors that
depend on HUGETLBFS, but we also have radix memory hotplug code that
uses pmd_huge() etc that will also need fixing. So for now just
disallow the broken combination of Radix with HUGETLBFS disabled.
The only defconfig we have that is affected is skiroot_defconfig, so
turn on HUGETLBFS there so that it still gets Radix.
Fixes: 566ca99af026 ("powerpc/mm/radix: Add dummy radix_enabled()")
Cc: stable@vger.kernel.org # v4.7+
Reported-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
arch/powerpc/configs/skiroot_defconfig | 1 +
arch/powerpc/platforms/Kconfig.cputype | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/configs/skiroot_defconfig b/arch/powerpc/configs/skiroot_defconfig
index 5ba131c30f6b..1bcd468ab422 100644
--- a/arch/powerpc/configs/skiroot_defconfig
+++ b/arch/powerpc/configs/skiroot_defconfig
@@ -266,6 +266,7 @@ CONFIG_UDF_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_PROC_KCORE=y
+CONFIG_HUGETLBFS=y
# CONFIG_MISC_FILESYSTEMS is not set
# CONFIG_NETWORK_FILESYSTEMS is not set
CONFIG_NLS=y
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index 842b2c7e156a..50cd09b4e05d 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -324,7 +324,7 @@ config ARCH_ENABLE_SPLIT_PMD_PTLOCK
config PPC_RADIX_MMU
bool "Radix MMU Support"
- depends on PPC_BOOK3S_64
+ depends on PPC_BOOK3S_64 && HUGETLB_PAGE
select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA
default y
help
--
2.20.1
^ permalink raw reply related
* [PATCH] drivers: cpuidle: This patch fix the following checkpatch warning.
From: Mohan Kumar @ 2019-04-17 7:06 UTC (permalink / raw)
To: rjw; +Cc: linux-pm, daniel.lezcano, linux-kernel, linuxppc-dev
Use pr_debug instead of printk
WARNING: Prefer [subsystem eg: netdev]_dbg([subsystem]dev, ...
then dev_dbg(dev, ... then pr_debug(... to printk(KERN_DEBUG ...
Signed-off-by: Mohan Kumar <mohankumar718@gmail.com>
---
drivers/cpuidle/cpuidle-pseries.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c
index 74c2479..a9c1db8 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -282,7 +282,7 @@ static int __init pseries_processor_idle_init(void)
pseries_cpuidle_driver_init();
retval = cpuidle_register(&pseries_idle_driver, NULL);
if (retval) {
- printk(KERN_DEBUG "Registration of pseries driver failed.\n");
+ pr_debug("Registration of pseries driver failed.\n");
return retval;
}
@@ -294,7 +294,7 @@ static int __init pseries_processor_idle_init(void)
"cpuidle/pseries:DEAD", NULL,
pseries_cpuidle_cpu_dead);
WARN_ON(retval < 0);
- printk(KERN_DEBUG "pseries_idle_driver registered\n");
+ pr_debug("pseries_idle_driver registered\n");
return 0;
}
--
2.7.4
^ permalink raw reply related
* Re: [PATCH v3 00/26] compat_ioctl: cleanups
From: Douglas Gilbert @ 2019-04-16 22:33 UTC (permalink / raw)
To: Arnd Bergmann, Alexander Viro
Cc: linux-fbdev, linux-iio, linux-remoteproc, alsa-devel, dri-devel,
platform-driver-x86, linux-ide, linux-mtd, sparclinux,
linux1394-devel, devel, linux-s390, linux-scsi, linux-bluetooth,
y2038, qat-linux, amd-gfx, linux-input, Marcel Holtmann,
linux-media, linux-rtc, James E.J. Bottomley, linux-nvme,
ceph-devel, linux-arm-kernel, Karsten Keil, Martin K. Petersen,
Greg Kroah-Hartman, linux-usb, linux-wireless, linux-kernel,
linux-rdma, linux-crypto, netdev, linux-fsdevel, linux-integrity,
linuxppc-dev, David S. Miller, linux-btrfs, linux-ppp
In-Reply-To: <20190416202013.4034148-1-arnd@arndb.de>
On 2019-04-16 4:19 p.m., Arnd Bergmann wrote:
> Hi Al,
>
> It took me way longer than I had hoped to revisit this series, see
> https://lore.kernel.org/lkml/20180912150142.157913-1-arnd@arndb.de/
> for the previously posted version.
>
> I've come to the point where all conversion handlers and most
> COMPATIBLE_IOCTL() entries are gone from this file, but for
> now, this series only has the parts that have either been reviewed
> previously, or that are simple enough to include.
>
> The main missing piece is the SG_IO/SG_GET_REQUEST_TABLE conversion.
> I'll post the patches I made for that later, as they need more
> testing and review from the scsi maintainers.
Perhaps you could look at the document in this url:
http://sg.danny.cz/sg/sg_v40.html
It is work-in-progress to modernize the SCSI generic driver. It
extends ioctl(sg_fd, SG_IO, &pt_obj) to additionally accept the sg v4
interface as defined in include/uapi/linux/bsg.h . Currently only the
bsg driver uses the sg v4 interface. Since struct sg_io_v4 is all
explicitly sized integers, I'm guessing it is immune "compat" problems.
[I can see no reference to bsg nor struct sg_io_v4 in the current
fs/compat_ioctl.c file.]
Other additions described in the that document are these new ioctls:
- SG_IOSUBMIT ultimately to replace write(sg_fd, ...)
- SG_IORECEIVE to replace read(sg_fd, ...)
- SG_IOABORT abort SCSI cmd in progress; new functionality
- SG_SET_GET_EXTENDED has associated struct sg_extended_info
The first three take a pointer to a struct sg_io_hdr (v3 interface) or
a struct sg_io_v4 object. Both objects start with a 32 bit integer:
'S' identifies the v3 interface while 'Q' identifies the v4 interface.
The SG_SET_GET_EXTENDED ioctl takes a pointer to a struct
sg_extended_info object which contains explicitly sized integers so it
may also be immune from "compat" problems. The ioctls section (13) of
that document referenced above has a table showing how many "sets and
gets" are hiding in the SG_SET_GET_EXTENDED ioctl.
BTW No change is proposed for this case:
ioctl(normal_block_device, SG_IO, &sg_v3_obj)
which is handled by block/scsi_ioctl.c
This would be a good time for me to address any "compat" concerns in the
proposed sg driver update.
Doug Gilbert
> I hope you can still take these for the coming merge window, unless
> new problems come up.
>
> Arnd
>
> Arnd Bergmann (26):
> compat_ioctl: pppoe: fix PPPOEIOCSFWD handling
> compat_ioctl: move simple ppp command handling into driver
> compat_ioctl: avoid unused function warning for do_ioctl
> compat_ioctl: move PPPIOCSCOMPRESS32 to ppp-generic.c
> compat_ioctl: move PPPIOCSPASS32/PPPIOCSACTIVE32 to ppp_generic.c
> compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
> compat_ioctl: move rtc handling into rtc-dev.c
> compat_ioctl: add compat_ptr_ioctl()
> compat_ioctl: move drivers to compat_ptr_ioctl
> compat_ioctl: use correct compat_ptr() translation in drivers
> ceph: fix compat_ioctl for ceph_dir_operations
> compat_ioctl: move more drivers to compat_ptr_ioctl
> compat_ioctl: move tape handling into drivers
> compat_ioctl: move ATYFB_CLK handling to atyfb driver
> compat_ioctl: move isdn/capi ioctl translation into driver
> compat_ioctl: move rfcomm handlers into driver
> compat_ioctl: move hci_sock handlers into driver
> compat_ioctl: remove HCIUART handling
> compat_ioctl: remove HIDIO translation
> compat_ioctl: remove translation for sound ioctls
> compat_ioctl: remove IGNORE_IOCTL()
> compat_ioctl: remove /dev/random commands
> compat_ioctl: remove joystick ioctl translation
> compat_ioctl: remove PCI ioctl translation
> compat_ioctl: remove /dev/raw ioctl translation
> compat_ioctl: remove last RAID handling code
>
> Documentation/networking/ppp_generic.txt | 2 +
> arch/um/drivers/hostaudio_kern.c | 1 +
> drivers/android/binder.c | 2 +-
> drivers/char/ppdev.c | 12 +-
> drivers/char/random.c | 1 +
> drivers/char/tpm/tpm_vtpm_proxy.c | 12 +-
> drivers/crypto/qat/qat_common/adf_ctl_drv.c | 2 +-
> drivers/dma-buf/dma-buf.c | 4 +-
> drivers/dma-buf/sw_sync.c | 2 +-
> drivers/dma-buf/sync_file.c | 2 +-
> drivers/firewire/core-cdev.c | 12 +-
> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 2 +-
> drivers/hid/hidraw.c | 4 +-
> drivers/hid/usbhid/hiddev.c | 11 +-
> drivers/hwtracing/stm/core.c | 12 +-
> drivers/ide/ide-tape.c | 31 +-
> drivers/iio/industrialio-core.c | 2 +-
> drivers/infiniband/core/uverbs_main.c | 4 +-
> drivers/isdn/capi/capi.c | 31 +
> drivers/isdn/i4l/isdn_ppp.c | 14 +-
> drivers/media/rc/lirc_dev.c | 4 +-
> drivers/mfd/cros_ec_dev.c | 4 +-
> drivers/misc/cxl/flash.c | 8 +-
> drivers/misc/genwqe/card_dev.c | 23 +-
> drivers/misc/mei/main.c | 22 +-
> drivers/misc/vmw_vmci/vmci_host.c | 2 +-
> drivers/mtd/ubi/cdev.c | 36 +-
> drivers/net/ppp/ppp_generic.c | 99 +++-
> drivers/net/ppp/pppoe.c | 7 +
> drivers/net/ppp/pptp.c | 3 +
> drivers/net/tap.c | 12 +-
> drivers/nvdimm/bus.c | 4 +-
> drivers/nvme/host/core.c | 2 +-
> drivers/pci/switch/switchtec.c | 2 +-
> drivers/platform/x86/wmi.c | 2 +-
> drivers/rpmsg/rpmsg_char.c | 4 +-
> drivers/rtc/dev.c | 13 +-
> drivers/rtc/rtc-vr41xx.c | 10 +
> drivers/s390/char/tape_char.c | 41 +-
> drivers/sbus/char/display7seg.c | 2 +-
> drivers/sbus/char/envctrl.c | 4 +-
> drivers/scsi/3w-xxxx.c | 4 +-
> drivers/scsi/cxlflash/main.c | 2 +-
> drivers/scsi/esas2r/esas2r_main.c | 2 +-
> drivers/scsi/megaraid/megaraid_mm.c | 28 +-
> drivers/scsi/osst.c | 34 +-
> drivers/scsi/pmcraid.c | 4 +-
> drivers/scsi/st.c | 35 +-
> drivers/staging/android/ion/ion.c | 4 +-
> drivers/staging/pi433/pi433_if.c | 12 +-
> drivers/staging/vme/devices/vme_user.c | 2 +-
> drivers/tee/tee_core.c | 2 +-
> drivers/usb/class/cdc-wdm.c | 2 +-
> drivers/usb/class/usbtmc.c | 4 +-
> drivers/usb/core/devio.c | 16 +-
> drivers/usb/gadget/function/f_fs.c | 12 +-
> drivers/vfio/vfio.c | 39 +-
> drivers/vhost/net.c | 12 +-
> drivers/vhost/scsi.c | 12 +-
> drivers/vhost/test.c | 12 +-
> drivers/vhost/vsock.c | 12 +-
> drivers/video/fbdev/aty/atyfb_base.c | 12 +-
> drivers/virt/fsl_hypervisor.c | 2 +-
> fs/btrfs/super.c | 2 +-
> fs/ceph/dir.c | 1 +
> fs/ceph/file.c | 2 +-
> fs/compat_ioctl.c | 602 +-------------------
> fs/fat/file.c | 13 +-
> fs/fuse/dev.c | 2 +-
> fs/notify/fanotify/fanotify_user.c | 2 +-
> fs/userfaultfd.c | 2 +-
> include/linux/fs.h | 7 +
> include/linux/if_pppox.h | 2 +
> include/linux/mtio.h | 58 ++
> include/uapi/linux/ppp-ioctl.h | 2 +
> include/uapi/linux/ppp_defs.h | 14 +
> net/bluetooth/hci_sock.c | 21 +-
> net/bluetooth/rfcomm/sock.c | 14 +-
> net/l2tp/l2tp_ppp.c | 3 +
> net/rfkill/core.c | 2 +-
> sound/core/oss/pcm_oss.c | 4 +
> sound/oss/dmasound/dmasound_core.c | 2 +
> 82 files changed, 452 insertions(+), 1034 deletions(-)
> create mode 100644 include/linux/mtio.h
>
^ permalink raw reply
* Re: Linux 5.1-rc5
From: Linus Torvalds @ 2019-04-17 4:13 UTC (permalink / raw)
To: Michael Ellerman
Cc: linux-s390, Aneesh Kumar K.V, Linux List Kernel Mailing,
Nicholas Piggin, Christoph Hellwig, Martin Schwidefsky,
linuxppc-dev
In-Reply-To: <87sguhti6e.fsf@concordia.ellerman.id.au>
On Tue, Apr 16, 2019 at 8:38 PM Michael Ellerman <mpe@ellerman.id.au> wrote:
>
> > That said, powerpc and s390 should at least look at maybe adding a
> > check for the page ref in their gup paths too. Powerpc has the special
> > gup_hugepte() case
>
> Which uses page_cache_add_speculative(), which handles the case of the
> refcount being zero but not overflow. So that looks like it needs
> fixing.
Note that unlike the zero check, the "too many refs" check does _not_
need to be atomic.
Because it's not a correctness issue right at some magical exact
point, it's a much more ambiguous a "the refcount is now so large that
I'm not going to do GUP on this page any more". Being off by a number
of pages in case there's a race is just fine.
So you could do something like this (TOTALLY UNTESTED, and
whitespace-damaged on purpose - I don't want you to apply it blindly)
appended patch.
> And we have a few uses of bare get_page() in KVM code which might be
> subject to the same attack.
Note that you really have to have not just a get_page(), but some way
of lining up *billions* of them. Which really tends to be pretty hard.
Linus
----
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 9e732bb2c84a..52db7ff7c756 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -523,7 +523,8 @@ struct page *follow_huge_pd(struct vm_area_struct *vma,
page = pte_page(*ptep);
page += ((address & mask) >> PAGE_SHIFT);
if (flags & FOLL_GET)
- get_page(page);
+ if (!try_get_page(page))
+ page = NULL;
} else {
if (is_hugetlb_entry_migration(*ptep)) {
spin_unlock(ptl);
@@ -883,6 +884,8 @@ int gup_hugepte(pte_t *ptep, unsigned long sz,
unsigned long addr,
refs = 0;
head = pte_page(pte);
+ if (page_ref_count(head) < 0)
+ return 0;
page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
do {
^ permalink raw reply related
* Re: [PATCH v2 00/21] Convert hwmon documentation to ReST
From: Guenter Roeck @ 2019-04-17 3:49 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: linux-hwmon, Jean Delvare, linux-aspeed, Linux Doc Mailing List,
Andrew Jeffery, Jonathan Corbet, Liviu Dudau, linux-kernel,
Mauro Carvalho Chehab, Lorenzo Pieralisi, Paul Mackerras,
Joel Stanley, linuxppc-dev, Sudeep Holla, linux-arm-kernel
In-Reply-To: <20190416225836.1d5953bc@coco.lan>
On 4/16/19 6:58 PM, Mauro Carvalho Chehab wrote:
> Em Tue, 16 Apr 2019 13:31:14 -0700
> Guenter Roeck <linux@roeck-us.net> escreveu:
>
>> On Tue, Apr 16, 2019 at 02:19:49PM -0600, Jonathan Corbet wrote:
>>> On Fri, 12 Apr 2019 20:09:16 -0700
>>> Guenter Roeck <linux@roeck-us.net> wrote:
>>>
>>>> The big real-world question is: Is the series good enough for you to accept,
>>>> or do you expect some level of user/kernel separation ?
>>>
>>> I guess it can go in; it's forward progress, even if it doesn't make the
>>> improvements I would like to see.
>>>
>>> The real question, I guess, is who should take it. I've been seeing a
>>> fair amount of activity on hwmon, so I suspect that the potential for
>>> conflicts is real. Perhaps things would go smoother if it went through
>>> your tree?
>>>
>> We'll see a number of conflicts, yes. In terms of timing, this is probably
>> the worst release in the last few years to make such a change. I currently
>> have 9 patches queued in hwmon-next which touch Documentation/hwmon.
>> Of course the changes made in those are all not ReST compatible, and I have
>> no idea what to look out for to make it compatible. So this is going to be
>> fun (in a negative sense) either way.
>>
>> I don't really have a recommendation at this point; I think the best I could
>> do to take the patches which don't generate conflicts and leave the rest
>> alone. But that would also be bad, since the new index file would not match
>> reality. No idea, really, what the best or even a useful approach would be.
>>
>> Maybe automated changes like this (assuming they are indeed automated)
>> can be generated and pushed right after a commit window closes. Would
>> that by any chance be possible ?
>
> No, those patches are hand-maid, but I can surely rebase it on the top of
> your tree. Is your tree already merged at linux-next, or should I use some
> other branch/tree for rebase?
>
linux-next merges hwmon-next. next-20190416 is missing one patch which touches
Documentation/hwmon, but that should be easy to deal with.
Thanks,
Guenter
^ permalink raw reply
* Re: Linux 5.1-rc5
From: Michael Ellerman @ 2019-04-17 3:38 UTC (permalink / raw)
To: Linus Torvalds, Christoph Hellwig
Cc: linux-s390, Aneesh Kumar K.V, Linux List Kernel Mailing,
Nicholas Piggin, Martin Schwidefsky, linuxppc-dev
In-Reply-To: <CAHk-=wj7jgMOVFW0tiU-X+zhg6+Rn7mEBTej+f26rV3zXezOSA@mail.gmail.com>
[ Cc += Nick & Aneesh & Paul ]
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Sun, Apr 14, 2019 at 10:19 PM Christoph Hellwig <hch@infradead.org> wrote:
>>
>> Can we please have the page refcount overflow fixes out on the list
>> for review, even if it is after the fact?
>
> They were actually on a list for review long before the fact, but it
> was the security mailing list. The issue actually got discussed back
> in January along with early versions of the patches, but then we
> dropped the ball because it just wasn't on anybody's radar and it got
> resurrected late March. Willy wrote a rather bigger patch-series, and
> review of that is what then resulted in those commits. So they may
> look recent, but that's just because the original patches got
> seriously edited down and rewritten.
>
> That said, powerpc and s390 should at least look at maybe adding a
> check for the page ref in their gup paths too. Powerpc has the special
> gup_hugepte() case
Which uses page_cache_add_speculative(), which handles the case of the
refcount being zero but not overflow. So that looks like it needs
fixing.
We also have follow_huge_pd() that should use try_get_page().
And we have a few uses of bare get_page() in KVM code which might be
subject to the same attack.
cheers
^ permalink raw reply
* Re: [PATCH v3 7/8] powerpc/mm: Consolidate radix and hash address map details
From: Aneesh Kumar K.V @ 2019-04-17 3:00 UTC (permalink / raw)
To: Nicholas Piggin, mpe, paulus; +Cc: linuxppc-dev
In-Reply-To: <1555422365.eio3zgx55b.astroid@bobo.none>
On 4/16/19 7:33 PM, Nicholas Piggin wrote:
> Aneesh Kumar K.V's on April 16, 2019 8:07 pm:
>> We now have
>>
>> 4K page size config
>>
>> kernel_region_map_size = 16TB
>> kernel vmalloc start = 0xc000100000000000
>> kernel IO start = 0xc000200000000000
>> kernel vmemmap start = 0xc000300000000000
>>
>> with 64K page size config:
>>
>> kernel_region_map_size = 512TB
>> kernel vmalloc start = 0xc008000000000000
>> kernel IO start = 0xc00a000000000000
>> kernel vmemmap start = 0xc00c000000000000
>
> Hey Aneesh,
>
> I like the series, I like consolidating the address spaces into 0xc,
> and making the layouts match or similar isn't a bad thing. I don't
> see any real reason to force limitations on one layout or another --
> you could make the argument that 4k radix should match 64k radix
> as much as matching 4k hash IMO.
>
> I wouldn't like to tie them too strongly to the same base defines
> that force them to stay in sync.
>
> Can we drop this patch? Or at least keep the users of the H_ and R_
> defines and set them to the same thing in map.h?
>
>
I did that based on the suggestion from Michael Ellerman. I guess he
wanted the VMALLOC_START to match. I am not sure whether we should match
the kernel_region_map_size too. I did mention that in the cover letter.
I agree with your suggestion above. I still can keep the VMALLOC_START
at 16TB and keep the region_map_size as 512TB for radix 4k. I am not
sure we want to do that.
I will wait for feedback from Michael to make the suggested changes.
-aneesh
^ permalink raw reply
* Re: [PATCH v5 1/6] iommu: add generic boot option iommu.dma_mode
From: Leizhen (ThunderTown) @ 2019-04-17 2:36 UTC (permalink / raw)
To: Will Deacon, Robin Murphy
Cc: linux-ia64, Sebastian Ott, linux-doc, Hanjun Guo, Heiko Carstens,
Paul Mackerras, H . Peter Anvin, linux-s390, Jonathan Corbet,
Jean-Philippe Brucker, Joerg Roedel, x86, Ingo Molnar, Fenghua Yu,
John Garry, Borislav Petkov, Thomas Gleixner, Gerald Schaefer,
Tony Luck, linuxppc-dev, linux-kernel, iommu, Martin Schwidefsky,
David Woodhouse
In-Reply-To: <20190416152100.GB4187@fuggles.cambridge.arm.com>
On 2019/4/16 23:21, Will Deacon wrote:
> On Fri, Apr 12, 2019 at 02:11:31PM +0100, Robin Murphy wrote:
>> On 12/04/2019 11:26, John Garry wrote:
>>> On 09/04/2019 13:53, Zhen Lei wrote:
>>>> +static int __init iommu_dma_mode_setup(char *str)
>>>> +{
>>>> + if (!str)
>>>> + goto fail;
>>>> +
>>>> + if (!strncmp(str, "passthrough", 11))
>>>> + iommu_default_dma_mode = IOMMU_DMA_MODE_PASSTHROUGH;
>>>> + else if (!strncmp(str, "lazy", 4))
>>>> + iommu_default_dma_mode = IOMMU_DMA_MODE_LAZY;
>>>> + else if (!strncmp(str, "strict", 6))
>>>> + iommu_default_dma_mode = IOMMU_DMA_MODE_STRICT;
>>>> + else
>>>> + goto fail;
>>>> +
>>>> + pr_info("Force dma mode to be %d\n", iommu_default_dma_mode);
>>>
>>> What happens if the cmdline option iommu.dma_mode is passed multiple
>>> times? We get mutliple - possibily conflicting - prints, right?
>>
>> Indeed; we ended up removing such prints for the existing options here,
>> specifically because multiple messages seemed more likely to be confusing
>> than useful.
I originally intended to be compatible with X86 printing.
} else if (!strncmp(str, "strict", 6)) {
pr_info("Disable batched IOTLB flush\n");
intel_iommu_strict = 1;
}
>>
>>> And do we need to have backwards compatibility, such that the setting
>>> for iommu.strict or iommu.passthrough trumps iommu.dma_mode, regardless
>>> of order?
>>
>> As above I think it would be preferable to just keep using the existing
>> options anyway. The current behaviour works out as:
>>
>> iommu.passthrough | Y | N
>> iommu.strict | x | Y N
>> ------------------|-------------|---------|--------
>> MODE | PASSTHROUGH | STRICT | LAZY
>>
>> which seems intuitive enough that a specific dma_mode option doesn't add
>> much value, and would more likely just overcomplicate things for users as
>> well as our implementation.
>
> Agreed. We can't remove the existing options, and they do the job perfectly
> well so I don't see the need to add more options on top.
OK, I will remove the iommu.dma_mode option in the next version. Thanks for you three.
I didn't want to add it at first, but later found that the boot options on
each ARCH are different, then want to normalize it.
In addition, do we need to compatible the build option name IOMMU_DEFAULT_PASSTHROUGH? or
change it to IOMMU_DEFAULT_DMA_MODE_PASSTHROUGH or IOMMU_DEFAULT_MODE_PASSTHROUGH?
>
> Will
>
> .
>
--
Thanks!
BestRegards
^ permalink raw reply
* Re: [PATCH v2 00/21] Convert hwmon documentation to ReST
From: Mauro Carvalho Chehab @ 2019-04-17 1:58 UTC (permalink / raw)
To: Guenter Roeck
Cc: linux-hwmon, Jean Delvare, linux-aspeed, Linux Doc Mailing List,
Andrew Jeffery, Jonathan Corbet, Liviu Dudau, linux-kernel,
Mauro Carvalho Chehab, Lorenzo Pieralisi, Paul Mackerras,
Joel Stanley, linuxppc-dev, Sudeep Holla, linux-arm-kernel
In-Reply-To: <20190416203114.GB25517@roeck-us.net>
Em Tue, 16 Apr 2019 13:31:14 -0700
Guenter Roeck <linux@roeck-us.net> escreveu:
> On Tue, Apr 16, 2019 at 02:19:49PM -0600, Jonathan Corbet wrote:
> > On Fri, 12 Apr 2019 20:09:16 -0700
> > Guenter Roeck <linux@roeck-us.net> wrote:
> >
> > > The big real-world question is: Is the series good enough for you to accept,
> > > or do you expect some level of user/kernel separation ?
> >
> > I guess it can go in; it's forward progress, even if it doesn't make the
> > improvements I would like to see.
> >
> > The real question, I guess, is who should take it. I've been seeing a
> > fair amount of activity on hwmon, so I suspect that the potential for
> > conflicts is real. Perhaps things would go smoother if it went through
> > your tree?
> >
> We'll see a number of conflicts, yes. In terms of timing, this is probably
> the worst release in the last few years to make such a change. I currently
> have 9 patches queued in hwmon-next which touch Documentation/hwmon.
> Of course the changes made in those are all not ReST compatible, and I have
> no idea what to look out for to make it compatible. So this is going to be
> fun (in a negative sense) either way.
>
> I don't really have a recommendation at this point; I think the best I could
> do to take the patches which don't generate conflicts and leave the rest
> alone. But that would also be bad, since the new index file would not match
> reality. No idea, really, what the best or even a useful approach would be.
>
> Maybe automated changes like this (assuming they are indeed automated)
> can be generated and pushed right after a commit window closes. Would
> that by any chance be possible ?
No, those patches are hand-maid, but I can surely rebase it on the top of
your tree. Is your tree already merged at linux-next, or should I use some
other branch/tree for rebase?
Thanks,
Mauro
^ permalink raw reply
* Re: [PATCH v3 3/5] powerpc: Use the correct style for SPDX License Identifier
From: Andrew Donnellan @ 2019-04-16 23:47 UTC (permalink / raw)
To: Nishad Kamdar, Frederic Barrat
Cc: Greg Kroah-Hartman, linux-kernel, Paul Mackerras,
Uwe Kleine-König, Joe Perches, linuxppc-dev
In-Reply-To: <5c04f72569b508cd5477fa1bf15f0166d376cd3a.1555427420.git.nishadkamdar@gmail.com>
On 17/4/19 1:28 am, Nishad Kamdar wrote:
> This patch corrects the SPDX License Identifier style
> in the powerpc Hardware Architecture related files.
>
> Suggested-by: Joe Perches <joe@perches.com>
> Signed-off-by: Nishad Kamdar <nishadkamdar@gmail.com>
> ---
TIL there's a different style for source vs headers... sigh. :( Thanks
for fixing.
Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
> arch/powerpc/include/asm/pnv-ocxl.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/include/asm/pnv-ocxl.h b/arch/powerpc/include/asm/pnv-ocxl.h
> index 208b5503f4ed..7de82647e761 100644
> --- a/arch/powerpc/include/asm/pnv-ocxl.h
> +++ b/arch/powerpc/include/asm/pnv-ocxl.h
> @@ -1,4 +1,4 @@
> -// SPDX-License-Identifier: GPL-2.0+
> +/* SPDX-License-Identifier: GPL-2.0+ */
> // Copyright 2017 IBM Corp.
> #ifndef _ASM_PNV_OCXL_H
> #define _ASM_PNV_OCXL_H
>
--
Andrew Donnellan OzLabs, ADL Canberra
andrew.donnellan@au1.ibm.com IBM Australia Limited
^ permalink raw reply
* Re: [PATCH 1/6] mm: change locked_vm's type from unsigned long to atomic64_t
From: Andrew Morton @ 2019-04-16 23:33 UTC (permalink / raw)
To: Daniel Jordan
Cc: Mark Rutland, Davidlohr Bueso, kvm, Alan Tull,
Alexey Kardashevskiy, linux-fpga, linux-kernel, kvm-ppc, linux-mm,
Alex Williamson, Moritz Fischer, Christoph Lameter, linuxppc-dev,
Wu Hao
In-Reply-To: <20190411202807.q2fge33uoduhtehq@ca-dmjordan1.us.oracle.com>
On Thu, 11 Apr 2019 16:28:07 -0400 Daniel Jordan <daniel.m.jordan@oracle.com> wrote:
> On Thu, Apr 11, 2019 at 10:55:43AM +0100, Mark Rutland wrote:
> > On Thu, Apr 11, 2019 at 02:22:23PM +1000, Alexey Kardashevskiy wrote:
> > > On 03/04/2019 07:41, Daniel Jordan wrote:
> >
> > > > - dev_dbg(dev, "[%d] RLIMIT_MEMLOCK %c%ld %ld/%ld%s\n", current->pid,
> > > > + dev_dbg(dev, "[%d] RLIMIT_MEMLOCK %c%ld %lld/%lu%s\n", current->pid,
> > > > incr ? '+' : '-', npages << PAGE_SHIFT,
> > > > - current->mm->locked_vm << PAGE_SHIFT, rlimit(RLIMIT_MEMLOCK),
> > > > - ret ? "- exceeded" : "");
> > > > + (s64)atomic64_read(¤t->mm->locked_vm) << PAGE_SHIFT,
> > > > + rlimit(RLIMIT_MEMLOCK), ret ? "- exceeded" : "");
> > >
> > >
> > >
> > > atomic64_read() returns "long" which matches "%ld", why this change (and
> > > similar below)? You did not do this in the two pr_debug()s above anyway.
> >
> > Unfortunately, architectures return inconsistent types for atomic64 ops.
> >
> > Some return long (e..g. powerpc), some return long long (e.g. arc), and
> > some return s64 (e.g. x86).
>
> Yes, Mark said it all, I'm just chiming in to confirm that's why I added the
> cast.
>
> Btw, thanks for doing this, Mark.
What's the status of this patchset, btw?
I have a note here that
powerpc-mmu-drop-mmap_sem-now-that-locked_vm-is-atomic.patch is to be
updated.
^ permalink raw reply
* Re: [PATCH v2 5/5] arm64/speculation: Support 'mitigations=' cmdline option
From: Will Deacon @ 2019-04-16 21:39 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Peter Zijlstra, Heiko Carstens, Paul Mackerras, H . Peter Anvin,
Ingo Molnar, Andrea Arcangeli, linux-s390, x86, Steven Price,
Linus Torvalds, Catalin Marinas, Waiman Long, linux-arch,
Jon Masters, Jiri Kosina, Borislav Petkov, Andy Lutomirski,
Josh Poimboeuf, linux-arm-kernel, Phil Auld, Greg Kroah-Hartman,
Randy Dunlap, linux-kernel, Tyler Hicks, Martin Schwidefsky,
linuxppc-dev
In-Reply-To: <alpine.DEB.2.21.1904162124020.1780@nanos.tec.linutronix.de>
On Tue, Apr 16, 2019 at 09:26:13PM +0200, Thomas Gleixner wrote:
> On Fri, 12 Apr 2019, Josh Poimboeuf wrote:
>
> > Configure arm64 runtime CPU speculation bug mitigations in accordance
> > with the 'mitigations=' cmdline option. This affects Meltdown, Spectre
> > v2, and Speculative Store Bypass.
> >
> > The default behavior is unchanged.
> >
> > Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> > ---
> > NOTE: This is based on top of Jeremy Linton's patches:
> > https://lkml.kernel.org/r/20190410231237.52506-1-jeremy.linton@arm.com
>
> So I keep that out and we have to revisit that once the ARM64 stuff hits a
> tree, right? I can have a branch with just the 4 first patches applied
> which ARM64 folks can pull in when they apply Jeremy's patches before te
> merge window.
Yes, that would work for us, cheers. I should get to Jeremy's latest version
next week and I'm certainly planning to get them queued up for 5.2.
Will
^ permalink raw reply
* [PATCH v3 3/5] powerpc: Use the correct style for SPDX License Identifier
From: Nishad Kamdar @ 2019-04-16 15:28 UTC (permalink / raw)
To: Frederic Barrat, Andrew Donnellan
Cc: Greg Kroah-Hartman, linux-kernel, Paul Mackerras,
Uwe Kleine-König, Joe Perches, linuxppc-dev
In-Reply-To: <cover.1555427418.git.nishadkamdar@gmail.com>
This patch corrects the SPDX License Identifier style
in the powerpc Hardware Architecture related files.
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Nishad Kamdar <nishadkamdar@gmail.com>
---
arch/powerpc/include/asm/pnv-ocxl.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/include/asm/pnv-ocxl.h b/arch/powerpc/include/asm/pnv-ocxl.h
index 208b5503f4ed..7de82647e761 100644
--- a/arch/powerpc/include/asm/pnv-ocxl.h
+++ b/arch/powerpc/include/asm/pnv-ocxl.h
@@ -1,4 +1,4 @@
-// SPDX-License-Identifier: GPL-2.0+
+/* SPDX-License-Identifier: GPL-2.0+ */
// Copyright 2017 IBM Corp.
#ifndef _ASM_PNV_OCXL_H
#define _ASM_PNV_OCXL_H
--
2.17.1
^ permalink raw reply related
* [PATCH v12 08/31] mm: introduce INIT_VMA()
From: Laurent Dufour @ 2019-04-16 13:44 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
Some VMA struct fields need to be initialized once the VMA structure is
allocated.
Currently this only concerns anon_vma_chain field but some other will be
added to support the speculative page fault.
Instead of spreading the initialization calls all over the code, let's
introduce a dedicated inline function.
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
fs/exec.c | 1 +
include/linux/mm.h | 5 +++++
kernel/fork.c | 2 +-
mm/mmap.c | 3 +++
mm/nommu.c | 1 +
5 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/fs/exec.c b/fs/exec.c
index 2e0033348d8e..9762e060295c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -266,6 +266,7 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
vma->vm_start = vma->vm_end - PAGE_SIZE;
vma->vm_flags = VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP;
vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
+ INIT_VMA(vma);
err = insert_vm_struct(mm, vma);
if (err)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4ba2f53f9d60..2ceb1d2869a6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1407,6 +1407,11 @@ struct zap_details {
pgoff_t last_index; /* Highest page->index to unmap */
};
+static inline void INIT_VMA(struct vm_area_struct *vma)
+{
+ INIT_LIST_HEAD(&vma->anon_vma_chain);
+}
+
struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte, bool with_public_device);
#define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
diff --git a/kernel/fork.c b/kernel/fork.c
index 915be4918a2b..f8dae021c2e5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -341,7 +341,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
if (new) {
*new = *orig;
- INIT_LIST_HEAD(&new->anon_vma_chain);
+ INIT_VMA(new);
}
return new;
}
diff --git a/mm/mmap.c b/mm/mmap.c
index bd7b9f293b39..5ad3a3228d76 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1765,6 +1765,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vma->vm_flags = vm_flags;
vma->vm_page_prot = vm_get_page_prot(vm_flags);
vma->vm_pgoff = pgoff;
+ INIT_VMA(vma);
if (file) {
if (vm_flags & VM_DENYWRITE) {
@@ -3037,6 +3038,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
}
vma_set_anonymous(vma);
+ INIT_VMA(vma);
vma->vm_start = addr;
vma->vm_end = addr + len;
vma->vm_pgoff = pgoff;
@@ -3395,6 +3397,7 @@ static struct vm_area_struct *__install_special_mapping(
if (unlikely(vma == NULL))
return ERR_PTR(-ENOMEM);
+ INIT_VMA(vma);
vma->vm_start = addr;
vma->vm_end = addr + len;
diff --git a/mm/nommu.c b/mm/nommu.c
index 749276beb109..acf7ca72ca90 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1210,6 +1210,7 @@ unsigned long do_mmap(struct file *file,
region->vm_flags = vm_flags;
region->vm_pgoff = pgoff;
+ INIT_VMA(vma);
vma->vm_flags = vm_flags;
vma->vm_pgoff = pgoff;
--
2.21.0
^ permalink raw reply related
* [PATCH v12 00/31] Speculative page faults
From: Laurent Dufour @ 2019-04-16 13:44 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
This is a port on kernel 5.1 of the work done by Peter Zijlstra to handle
page fault without holding the mm semaphore [1].
The idea is to try to handle user space page faults without holding the
mmap_sem. This should allow better concurrency for massively threaded
process since the page fault handler will not wait for other threads memory
layout change to be done, assuming that this change is done in another part
of the process's memory space. This type of page fault is named speculative
page fault. If the speculative page fault fails because a concurrency has
been detected or because underlying PMD or PTE tables are not yet
allocating, it is failing its processing and a regular page fault is then
tried.
The speculative page fault (SPF) has to look for the VMA matching the fault
address without holding the mmap_sem, this is done by protecting the MM RB
tree with RCU and by using a reference counter on each VMA. When fetching a
VMA under the RCU protection, the VMA's reference counter is incremented to
ensure that the VMA will not freed in our back during the SPF
processing. Once that processing is done the VMA's reference counter is
decremented. To ensure that a VMA is still present when walking the RB tree
locklessly, the VMA's reference counter is incremented when that VMA is
linked in the RB tree. When the VMA is unlinked from the RB tree, its
reference counter will be decremented at the end of the RCU grace period,
ensuring it will be available during this time. This means that the VMA
freeing could be delayed and could delay the file closing for file
mapping. Since the SPF handler is not able to manage file mapping, file is
closed synchronously and not during the RCU cleaning. This is safe since
the page fault handler is aborting if a file pointer is associated to the
VMA.
Using RCU fixes the overhead seen by Haiyan Song using the will-it-scale
benchmark [2].
The VMA's attributes checked during the speculative page fault processing
have to be protected against parallel changes. This is done by using a per
VMA sequence lock. This sequence lock allows the speculative page fault
handler to fast check for parallel changes in progress and to abort the
speculative page fault in that case.
Once the VMA has been found, the speculative page fault handler would check
for the VMA's attributes to verify that the page fault has to be handled
correctly or not. Thus, the VMA is protected through a sequence lock which
allows fast detection of concurrent VMA changes. If such a change is
detected, the speculative page fault is aborted and a *classic* page fault
is tried. VMA sequence lockings are added when VMA attributes which are
checked during the page fault are modified.
When the PTE is fetched, the VMA is checked to see if it has been changed,
so once the page table is locked, the VMA is valid, so any other changes
leading to touching this PTE will need to lock the page table, so no
parallel change is possible at this time.
The locking of the PTE is done with interrupts disabled, this allows
checking for the PMD to ensure that there is not an ongoing collapsing
operation. Since khugepaged is firstly set the PMD to pmd_none and then is
waiting for the other CPU to have caught the IPI interrupt, if the pmd is
valid at the time the PTE is locked, we have the guarantee that the
collapsing operation will have to wait on the PTE lock to move
forward. This allows the SPF handler to map the PTE safely. If the PMD
value is different from the one recorded at the beginning of the SPF
operation, the classic page fault handler will be called to handle the
operation while holding the mmap_sem. As the PTE lock is done with the
interrupts disabled, the lock is done using spin_trylock() to avoid dead
lock when handling a page fault while a TLB invalidate is requested by
another CPU holding the PTE.
In pseudo code, this could be seen as:
speculative_page_fault()
{
vma = find_vma_rcu()
check vma sequence count
check vma's support
disable interrupt
check pgd,p4d,...,pte
save pmd and pte in vmf
save vma sequence counter in vmf
enable interrupt
check vma sequence count
handle_pte_fault(vma)
..
page = alloc_page()
pte_map_lock()
disable interrupt
abort if sequence counter has changed
abort if pmd or pte has changed
pte map and lock
enable interrupt
if abort
free page
abort
...
put_vma(vma)
}
arch_fault_handler()
{
if (speculative_page_fault(&vma))
goto done
again:
lock(mmap_sem)
vma = find_vma();
handle_pte_fault(vma);
if retry
unlock(mmap_sem)
goto again;
done:
handle fault error
}
Support for THP is not done because when checking for the PMD, we can be
confused by an in progress collapsing operation done by khugepaged. The
issue is that pmd_none() could be true either if the PMD is not already
populated or if the underlying PTE are in the way to be collapsed. So we
cannot safely allocate a PMD if pmd_none() is true.
This series add a new software performance event named 'speculative-faults'
or 'spf'. It counts the number of successful page fault event handled
speculatively. When recording 'faults,spf' events, the faults one is
counting the total number of page fault events while 'spf' is only counting
the part of the faults processed speculatively.
There are some trace events introduced by this series. They allow
identifying why the page faults were not processed speculatively. This
doesn't take in account the faults generated by a monothreaded process
which directly processed while holding the mmap_sem. This trace events are
grouped in a system named 'pagefault', they are:
- pagefault:spf_vma_changed : if the VMA has been changed in our back
- pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set.
- pagefault:spf_vma_notsup : the VMA's type is not supported
- pagefault:spf_vma_access : the VMA's access right are not respected
- pagefault:spf_pmd_changed : the upper PMD pointer has changed in our
back.
To record all the related events, the easier is to run perf with the
following arguments :
$ perf stat -e 'faults,spf,pagefault:*' <command>
There is also a dedicated vmstat counter showing the number of successful
page fault handled speculatively. I can be seen this way:
$ grep speculative_pgfault /proc/vmstat
It is possible to deactivate the speculative page fault handler by echoing
0 in /proc/sys/vm/speculative_page_fault.
This series builds on top of v5.1-rc4-mmotm-2019-04-09-17-51 and is
functional on x86, PowerPC. I cross built it on arm64 but I was not able to
test it.
This series is also available on github [4].
---------------------
Real Workload results
Test using a "popular in memory multithreaded database product" on 128cores
SMT8 Power system are in progress and I will come back with performance
mesurement as soon as possible. With the previous series we seen up to 30%
improvements in the number of transaction processed per second, and we hope
this will be the case with this series too.
------------------
Benchmarks results
Base kernel is v5.1-rc4-mmotm-2019-04-09-17-51
SPF is BASE + this series
Kernbench:
----------
Here are the results on a 48 CPUs X86 system using kernbench on a 5.0
kernel (kernel is build 5 times):
Average Half load -j 24
Run (std deviation)
BASE SPF
Elapsed Time 56.52 (1.39185) 56.256 (1.15106) 0.47%
User Time 980.018 (2.94734) 984.958 (1.98518) -0.50%
System Time 130.744 (1.19148) 133.616 (0.873573) -2.20%
Percent CPU 1965.6 (49.682) 1988.4 (40.035) -1.16%
Context Switches 29926.6 (272.789) 30472.4 (109.569) -1.82%
Sleeps 124793 (415.87) 125003 (591.008) -0.17%
Average Optimal load -j 48
Run (std deviation)
BASE SPF
Elapsed Time 46.354 (0.917949) 45.968 (1.42786) 0.83%
User Time 1193.42 (224.96) 1196.78 (223.28) -0.28%
System Time 143.306 (13.2726) 146.177 (13.2659) -2.00%
Percent CPU 2668.6 (743.157) 2699.9 (753.767) -1.17%
Context Switches 62268.3 (34097.1) 62721.7 (33999.1) -0.73%
Sleeps 132556 (8222.99) 132607 (8077.6) -0.04%
During a run on the SPF, perf events were captured:
Performance counter stats for '../kernbench -M':
525,873,132 faults
242 spf
0 pagefault:spf_vma_changed
0 pagefault:spf_vma_noanon
441 pagefault:spf_vma_notsup
0 pagefault:spf_vma_access
0 pagefault:spf_pmd_changed
Very few speculative page faults were recorded as most of the processes
involved are monothreaded (sounds that on this architecture some threads
were created during the kernel build processing).
Here are the kerbench results on a 1024 CPUs Power8 VM:
5.1.0-rc4-mm1+ 5.1.0-rc4-mm1-spf-rcu+
Average Half load -j 512 Run (std deviation):
Elapsed Time 52.52 (0.906697) 52.778 (0.510069) -0.49%
User Time 3855.43 (76.378) 3890.44 (73.0466) -0.91%
System Time 1977.24 (182.316) 1974.56 (166.097) 0.14%
Percent CPU 11111.6 (540.461) 11115.2 (458.907) -0.03%
Context Switches 83245.6 (3061.44) 83651.8 (1202.31) -0.49%
Sleeps 613459 (23091.8) 628378 (27485.2) -2.43%
Average Optimal load -j 1024 Run (std deviation):
Elapsed Time 52.964 (0.572346) 53.132 (0.825694) -0.32%
User Time 4058.22 (222.034) 4070.2 (201.646) -0.30%
System Time 2672.81 (759.207) 2712.13 (797.292) -1.47%
Percent CPU 12756.7 (1786.35) 12806.5 (1858.89) -0.39%
Context Switches 88818.5 (6772) 87890.6 (5567.72) 1.04%
Sleeps 618658 (20842.2) 636297 (25044) -2.85%
During a run on the SPF, perf events were captured:
Performance counter stats for '../kernbench -M':
149 375 832 faults
1 spf
0 pagefault:spf_vma_changed
0 pagefault:spf_vma_noanon
561 pagefault:spf_vma_notsup
0 pagefault:spf_vma_access
0 pagefault:spf_pmd_changed
Most of the processes involved are monothreaded so SPF is not activated but
there is no impact on the performance.
Ebizzy:
-------
The test is counting the number of records per second it can manage, the
higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get
consistent result I repeated the test 100 times and measure the average
result. The number is the record processes per second, the higher is the best.
BASE SPF delta
24 CPUs x86 5492.69 9383.07 70.83%
1024 CPUS P8 VM 8476.74 17144.38 102%
Here are the performance counter read during a run on a 48 CPUs x86 node:
Performance counter stats for './ebizzy -mTt 48':
11,846,569 faults
10,886,706 spf
957,702 pagefault:spf_vma_changed
0 pagefault:spf_vma_noanon
815 pagefault:spf_vma_notsup
0 pagefault:spf_vma_access
0 pagefault:spf_pmd_changed
And the ones captured during a run on a 1024 CPUs Power VM:
Performance counter stats for './ebizzy -mTt 1024':
1 359 789 faults
1 284 910 spf
72 085 pagefault:spf_vma_changed
0 pagefault:spf_vma_noanon
2 669 pagefault:spf_vma_notsup
0 pagefault:spf_vma_access
0 pagefault:spf_pmd_changed
In ebizzy's case most of the page fault were handled in a speculative way,
leading the ebizzy performance boost.
------------------
Changes since v11 [3]
- Check vm_ops.fault instead of vm_ops since now all the VMA as a vm_ops.
- Abort speculative page fault when doing swap readhead because VMA's
boundaries are not protected at this time. Doing this the first swap in
is doing a readhead, the next fault should be handled in a speculative
way as the page is present in the swap read page.
- Handle a race between copy_pte_range() and the wp_page_copy called by
the speculative page fault handler.
- Ported to Kernel v5.0
- Moved VM_FAULT_PTNOTSAME define in mm_types.h
- Use RCU to protect the MM RB tree instead of a rwlock.
- Add a toggle interface: /proc/sys/vm/speculative_page_fault
[1] https://lore.kernel.org/linux-mm/20141020215633.717315139@infradead.org/
[2] https://lore.kernel.org/linux-mm/9FE19350E8A7EE45B64D8D63D368C8966B847F54@SHSMSX101.ccr.corp.intel.com/
[3] https://lore.kernel.org/linux-mm/1526555193-7242-1-git-send-email-ldufour@linux.vnet.ibm.com/
[4] https://github.com/ldu4/linux/tree/spf-v12
Laurent Dufour (25):
mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT
x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
mm: make pte_unmap_same compatible with SPF
mm: introduce INIT_VMA()
mm: protect VMA modifications using VMA sequence count
mm: protect mremap() against SPF hanlder
mm: protect SPF handler against anon_vma changes
mm: cache some VMA fields in the vm_fault structure
mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()
mm: introduce __lru_cache_add_active_or_unevictable
mm: introduce __vm_normal_page()
mm: introduce __page_add_new_anon_rmap()
mm: protect against PTE changes done by dup_mmap()
mm: protect the RB tree with a sequence lock
mm: introduce vma reference counter
mm: Introduce find_vma_rcu()
mm: don't do swap readahead during speculative page fault
mm: adding speculative page fault failure trace events
perf: add a speculative page fault sw event
perf tools: add support for the SPF perf event
mm: add speculative page fault vmstats
powerpc/mm: add speculative page fault
mm: Add a speculative page fault switch in sysctl
Mahendran Ganesh (2):
arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
arm64/mm: add speculative page fault
Peter Zijlstra (4):
mm: prepare for FAULT_FLAG_SPECULATIVE
mm: VMA sequence count
mm: provide speculative fault infrastructure
x86/mm: add speculative pagefault handling
arch/arm64/Kconfig | 1 +
arch/arm64/mm/fault.c | 12 +
arch/powerpc/Kconfig | 1 +
arch/powerpc/mm/fault.c | 16 +
arch/x86/Kconfig | 1 +
arch/x86/mm/fault.c | 14 +
fs/exec.c | 1 +
fs/proc/task_mmu.c | 5 +-
fs/userfaultfd.c | 17 +-
include/linux/hugetlb_inline.h | 2 +-
include/linux/migrate.h | 4 +-
include/linux/mm.h | 138 +++++-
include/linux/mm_types.h | 16 +-
include/linux/pagemap.h | 4 +-
include/linux/rmap.h | 12 +-
include/linux/swap.h | 10 +-
include/linux/vm_event_item.h | 3 +
include/trace/events/pagefault.h | 80 ++++
include/uapi/linux/perf_event.h | 1 +
kernel/fork.c | 35 +-
kernel/sysctl.c | 9 +
mm/Kconfig | 22 +
mm/huge_memory.c | 6 +-
mm/hugetlb.c | 2 +
mm/init-mm.c | 3 +
mm/internal.h | 45 ++
mm/khugepaged.c | 5 +
mm/madvise.c | 6 +-
mm/memory.c | 631 ++++++++++++++++++++++----
mm/mempolicy.c | 51 ++-
mm/migrate.c | 6 +-
mm/mlock.c | 13 +-
mm/mmap.c | 249 ++++++++--
mm/mprotect.c | 4 +-
mm/mremap.c | 13 +
mm/nommu.c | 1 +
mm/rmap.c | 5 +-
mm/swap.c | 6 +-
mm/swap_state.c | 10 +-
mm/vmstat.c | 5 +-
tools/include/uapi/linux/perf_event.h | 1 +
tools/perf/util/evsel.c | 1 +
tools/perf/util/parse-events.c | 4 +
tools/perf/util/parse-events.l | 1 +
tools/perf/util/python.c | 1 +
45 files changed, 1277 insertions(+), 196 deletions(-)
create mode 100644 include/trace/events/pagefault.h
--
2.21.0
^ permalink raw reply
* [PATCH v12 24/31] mm: adding speculative page fault failure trace events
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
This patch a set of new trace events to collect the speculative page fault
event failures.
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
include/trace/events/pagefault.h | 80 ++++++++++++++++++++++++++++++++
mm/memory.c | 57 ++++++++++++++++++-----
2 files changed, 125 insertions(+), 12 deletions(-)
create mode 100644 include/trace/events/pagefault.h
diff --git a/include/trace/events/pagefault.h b/include/trace/events/pagefault.h
new file mode 100644
index 000000000000..d9438f3e6bad
--- /dev/null
+++ b/include/trace/events/pagefault.h
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM pagefault
+
+#if !defined(_TRACE_PAGEFAULT_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PAGEFAULT_H
+
+#include <linux/tracepoint.h>
+#include <linux/mm.h>
+
+DECLARE_EVENT_CLASS(spf,
+
+ TP_PROTO(unsigned long caller,
+ struct vm_area_struct *vma, unsigned long address),
+
+ TP_ARGS(caller, vma, address),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, caller)
+ __field(unsigned long, vm_start)
+ __field(unsigned long, vm_end)
+ __field(unsigned long, address)
+ ),
+
+ TP_fast_assign(
+ __entry->caller = caller;
+ __entry->vm_start = vma->vm_start;
+ __entry->vm_end = vma->vm_end;
+ __entry->address = address;
+ ),
+
+ TP_printk("ip:%lx vma:%lx-%lx address:%lx",
+ __entry->caller, __entry->vm_start, __entry->vm_end,
+ __entry->address)
+);
+
+DEFINE_EVENT(spf, spf_vma_changed,
+
+ TP_PROTO(unsigned long caller,
+ struct vm_area_struct *vma, unsigned long address),
+
+ TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_noanon,
+
+ TP_PROTO(unsigned long caller,
+ struct vm_area_struct *vma, unsigned long address),
+
+ TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_notsup,
+
+ TP_PROTO(unsigned long caller,
+ struct vm_area_struct *vma, unsigned long address),
+
+ TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_access,
+
+ TP_PROTO(unsigned long caller,
+ struct vm_area_struct *vma, unsigned long address),
+
+ TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_pmd_changed,
+
+ TP_PROTO(unsigned long caller,
+ struct vm_area_struct *vma, unsigned long address),
+
+ TP_ARGS(caller, vma, address)
+);
+
+#endif /* _TRACE_PAGEFAULT_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/memory.c b/mm/memory.c
index 1991da97e2db..509851ad7c95 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -81,6 +81,9 @@
#include "internal.h"
+#define CREATE_TRACE_POINTS
+#include <trace/events/pagefault.h>
+
#if defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) && !defined(CONFIG_COMPILE_TEST)
#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid.
#endif
@@ -2100,8 +2103,10 @@ static bool pte_spinlock(struct vm_fault *vmf)
again:
local_irq_disable();
- if (vma_has_changed(vmf))
+ if (vma_has_changed(vmf)) {
+ trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+ }
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/*
@@ -2109,8 +2114,10 @@ static bool pte_spinlock(struct vm_fault *vmf)
* is not a huge collapse operation in progress in our back.
*/
pmdval = READ_ONCE(*vmf->pmd);
- if (!pmd_same(pmdval, vmf->orig_pmd))
+ if (!pmd_same(pmdval, vmf->orig_pmd)) {
+ trace_spf_pmd_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+ }
#endif
vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
@@ -2121,6 +2128,7 @@ static bool pte_spinlock(struct vm_fault *vmf)
if (vma_has_changed(vmf)) {
spin_unlock(vmf->ptl);
+ trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
}
@@ -2154,8 +2162,10 @@ static bool pte_map_lock(struct vm_fault *vmf)
*/
again:
local_irq_disable();
- if (vma_has_changed(vmf))
+ if (vma_has_changed(vmf)) {
+ trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+ }
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/*
@@ -2163,8 +2173,10 @@ static bool pte_map_lock(struct vm_fault *vmf)
* is not a huge collapse operation in progress in our back.
*/
pmdval = READ_ONCE(*vmf->pmd);
- if (!pmd_same(pmdval, vmf->orig_pmd))
+ if (!pmd_same(pmdval, vmf->orig_pmd)) {
+ trace_spf_pmd_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+ }
#endif
/*
@@ -2184,6 +2196,7 @@ static bool pte_map_lock(struct vm_fault *vmf)
if (vma_has_changed(vmf)) {
pte_unmap_unlock(pte, ptl);
+ trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
}
@@ -4187,47 +4200,60 @@ vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
/* rmb <-> seqlock,vma_rb_erase() */
seq = raw_read_seqcount(&vma->vm_sequence);
- if (seq & 1)
+ if (seq & 1) {
+ trace_spf_vma_changed(_RET_IP_, vma, address);
goto out_put;
+ }
/*
* Can't call vm_ops service has we don't know what they would do
* with the VMA.
* This include huge page from hugetlbfs.
*/
- if (vma->vm_ops && vma->vm_ops->fault)
+ if (vma->vm_ops && vma->vm_ops->fault) {
+ trace_spf_vma_notsup(_RET_IP_, vma, address);
goto out_put;
+ }
/*
* __anon_vma_prepare() requires the mmap_sem to be held
* because vm_next and vm_prev must be safe. This can't be guaranteed
* in the speculative path.
*/
- if (unlikely(!vma->anon_vma))
+ if (unlikely(!vma->anon_vma)) {
+ trace_spf_vma_notsup(_RET_IP_, vma, address);
goto out_put;
+ }
vmf.vma_flags = READ_ONCE(vma->vm_flags);
vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot);
/* Can't call userland page fault handler in the speculative path */
- if (unlikely(vmf.vma_flags & VM_UFFD_MISSING))
+ if (unlikely(vmf.vma_flags & VM_UFFD_MISSING)) {
+ trace_spf_vma_notsup(_RET_IP_, vma, address);
goto out_put;
+ }
- if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP)
+ if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP) {
/*
* This could be detected by the check address against VMA's
* boundaries but we want to trace it as not supported instead
* of changed.
*/
+ trace_spf_vma_notsup(_RET_IP_, vma, address);
goto out_put;
+ }
if (address < READ_ONCE(vma->vm_start)
- || READ_ONCE(vma->vm_end) <= address)
+ || READ_ONCE(vma->vm_end) <= address) {
+ trace_spf_vma_changed(_RET_IP_, vma, address);
goto out_put;
+ }
if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
flags & FAULT_FLAG_INSTRUCTION,
flags & FAULT_FLAG_REMOTE)) {
+ trace_spf_vma_access(_RET_IP_, vma, address);
ret = VM_FAULT_SIGSEGV;
goto out_put;
}
@@ -4235,10 +4261,12 @@ vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
/* This is one is required to check that the VMA has write access set */
if (flags & FAULT_FLAG_WRITE) {
if (unlikely(!(vmf.vma_flags & VM_WRITE))) {
+ trace_spf_vma_access(_RET_IP_, vma, address);
ret = VM_FAULT_SIGSEGV;
goto out_put;
}
} else if (unlikely(!(vmf.vma_flags & (VM_READ|VM_EXEC|VM_WRITE)))) {
+ trace_spf_vma_access(_RET_IP_, vma, address);
ret = VM_FAULT_SIGSEGV;
goto out_put;
}
@@ -4252,8 +4280,10 @@ vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
pol = __get_vma_policy(vma, address);
if (!pol)
pol = get_task_policy(current);
- if (pol && pol->mode == MPOL_INTERLEAVE)
+ if (pol && pol->mode == MPOL_INTERLEAVE) {
+ trace_spf_vma_notsup(_RET_IP_, vma, address);
goto out_put;
+ }
#endif
/*
@@ -4326,8 +4356,10 @@ vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
* We need to re-validate the VMA after checking the bounds, otherwise
* we might have a false positive on the bounds.
*/
- if (read_seqcount_retry(&vma->vm_sequence, seq))
+ if (read_seqcount_retry(&vma->vm_sequence, seq)) {
+ trace_spf_vma_changed(_RET_IP_, vma, address);
goto out_put;
+ }
mem_cgroup_enter_user_fault();
ret = handle_pte_fault(&vmf);
@@ -4346,6 +4378,7 @@ vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
return ret;
out_walk:
+ trace_spf_vma_notsup(_RET_IP_, vma, address);
local_irq_enable();
out_put:
put_vma(vma);
--
2.21.0
^ permalink raw reply related
* [PATCH v12 17/31] mm: introduce __page_add_new_anon_rmap()
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
When dealing with speculative page fault handler, we may race with VMA
being split or merged. In this case the vma->vm_start and vm->vm_end
fields may not match the address the page fault is occurring.
This can only happens when the VMA is split but in that case, the
anon_vma pointer of the new VMA will be the same as the original one,
because in __split_vma the new->anon_vma is set to src->anon_vma when
*new = *vma.
So even if the VMA boundaries are not correct, the anon_vma pointer is
still valid.
If the VMA has been merged, then the VMA in which it has been merged
must have the same anon_vma pointer otherwise the merge can't be done.
So in all the case we know that the anon_vma is valid, since we have
checked before starting the speculative page fault that the anon_vma
pointer is valid for this VMA and since there is an anon_vma this
means that at one time a page has been backed and that before the VMA
is cleaned, the page table lock would have to be grab to clean the
PTE, and the anon_vma field is checked once the PTE is locked.
This patch introduce a new __page_add_new_anon_rmap() service which
doesn't check for the VMA boundaries, and create a new inline one
which do the check.
When called from a page fault handler, if this is not a speculative one,
there is a guarantee that vm_start and vm_end match the faulting address,
so this check is useless. In the context of the speculative page fault
handler, this check may be wrong but anon_vma is still valid as explained
above.
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
include/linux/rmap.h | 12 ++++++++++--
mm/memory.c | 8 ++++----
mm/rmap.c | 5 ++---
3 files changed, 16 insertions(+), 9 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 988d176472df..a5d282573093 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -174,8 +174,16 @@ void page_add_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long, bool);
void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long, int);
-void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
- unsigned long, bool);
+void __page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
+ unsigned long, bool);
+static inline void page_add_new_anon_rmap(struct page *page,
+ struct vm_area_struct *vma,
+ unsigned long address, bool compound)
+{
+ VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+ __page_add_new_anon_rmap(page, vma, address, compound);
+}
+
void page_add_file_rmap(struct page *, bool);
void page_remove_rmap(struct page *, bool);
diff --git a/mm/memory.c b/mm/memory.c
index be93f2c8ebe0..46f877b6abea 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2347,7 +2347,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
* thread doing COW.
*/
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
- page_add_new_anon_rmap(new_page, vma, vmf->address, false);
+ __page_add_new_anon_rmap(new_page, vma, vmf->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
__lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
/*
@@ -2897,7 +2897,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
/* ksm created a completely new copy */
if (unlikely(page != swapcache && swapcache)) {
- page_add_new_anon_rmap(page, vma, vmf->address, false);
+ __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
@@ -3049,7 +3049,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
}
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, vmf->address, false);
+ __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
setpte:
@@ -3328,7 +3328,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
/* copy-on-write page */
if (write && !(vmf->vma_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, vmf->address, false);
+ __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
diff --git a/mm/rmap.c b/mm/rmap.c
index e5dfe2ae6b0d..2148e8ce6e34 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1140,7 +1140,7 @@ void do_page_add_anon_rmap(struct page *page,
}
/**
- * page_add_new_anon_rmap - add pte mapping to a new anonymous page
+ * __page_add_new_anon_rmap - add pte mapping to a new anonymous page
* @page: the page to add the mapping to
* @vma: the vm area in which the mapping is added
* @address: the user virtual address mapped
@@ -1150,12 +1150,11 @@ void do_page_add_anon_rmap(struct page *page,
* This means the inc-and-test can be bypassed.
* Page does not have to be locked.
*/
-void page_add_new_anon_rmap(struct page *page,
+void __page_add_new_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address, bool compound)
{
int nr = compound ? hpage_nr_pages(page) : 1;
- VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
__SetPageSwapBacked(page);
if (compound) {
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
--
2.21.0
^ permalink raw reply related
* Re: [PATCH v12 04/31] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
From: Laurent Dufour @ 2019-04-16 14:31 UTC (permalink / raw)
To: Mark Rutland
Cc: jack, sergey.senozhatsky.work, peterz, Will Deacon, mhocko,
linux-mm, paulus, Punit Agrawal, hpa, Michel Lespinasse,
Alexei Starovoitov, Andrea Arcangeli, ak, Minchan Kim,
aneesh.kumar, x86, Matthew Wilcox, Daniel Jordan, Ingo Molnar,
David Rientjes, paulmck, Haiyan Song, npiggin, sj38.park,
Jerome Glisse, dave, kemi.wang, kirill, Thomas Gleixner,
zhong jiang, Ganesh Mahendran, Yang Shi, Mike Rapoport,
linuxppc-dev, linux-kernel, Sergey Senozhatsky, vinayak menon,
akpm, Tim Chen, haren
In-Reply-To: <20190416142710.GA54515@lakrids.cambridge.arm.com>
Le 16/04/2019 à 16:27, Mark Rutland a écrit :
> On Tue, Apr 16, 2019 at 03:44:55PM +0200, Laurent Dufour wrote:
>> From: Mahendran Ganesh <opensource.ganesh@gmail.com>
>>
>> Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for arm64. This
>> enables Speculative Page Fault handler.
>>
>> Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
>
> This is missing your S-o-B.
You're right, I missed that...
>
> The first patch noted that the ARCH_SUPPORTS_* option was there because
> the arch code had to make an explicit call to try to handle the fault
> speculatively, but that isn't addeed until patch 30.
>
> Why is this separate from that code?
Andrew was recommended this a long time ago for bisection purpose. This
allows to build the code with CONFIG_SPECULATIVE_PAGE_FAULT before the
code that trigger the spf handler is added to the per architecture's code.
Thanks,
Laurent.
> Thanks,
> Mark.
>
>> ---
>> arch/arm64/Kconfig | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index 870ef86a64ed..8e86934d598b 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -174,6 +174,7 @@ config ARM64
>> select SWIOTLB
>> select SYSCTL_EXCEPTION_TRACE
>> select THREAD_INFO_IN_TASK
>> + select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>> help
>> ARM 64-bit (AArch64) Linux support.
>>
>> --
>> 2.21.0
>>
>
^ permalink raw reply
* [PATCH v12 22/31] mm: provide speculative fault infrastructure
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
From: Peter Zijlstra <peterz@infradead.org>
Provide infrastructure to do a speculative fault (not holding
mmap_sem).
The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.
Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Manage the newly introduced pte_spinlock() for speculative page
fault to fail if the VMA is touched in our back]
[Rename vma_is_dead() to vma_has_changed() and declare it here]
[Fetch p4d and pud]
[Set vmd.sequence in __handle_mm_fault()]
[Abort speculative path when handle_userfault() has to be called]
[Add additional VMA's flags checks in handle_speculative_fault()]
[Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
[Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
[Remove warning comment about waiting for !seq&1 since we don't want
to wait]
[Remove warning about no huge page support, mention it explictly]
[Don't call do_fault() in the speculative path as __do_fault() calls
vma->vm_ops->fault() which may want to release mmap_sem]
[Only vm_fault pointer argument for vma_has_changed()]
[Fix check against huge page, calling pmd_trans_huge()]
[Use READ_ONCE() when reading VMA's fields in the speculative path]
[Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
processing done in vm_normal_page()]
[Check that vma->anon_vma is already set when starting the speculative
path]
[Check for memory policy as we can't support MPOL_INTERLEAVE case due to
the processing done in mpol_misplaced()]
[Don't support VMA growing up or down]
[Move check on vm_sequence just before calling handle_pte_fault()]
[Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
[Add mem cgroup oom check]
[Use READ_ONCE to access p*d entries]
[Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
[Don't fetch pte again in handle_pte_fault() when running the speculative
path]
[Check PMD against concurrent collapsing operation]
[Try spin lock the pte during the speculative path to avoid deadlock with
other CPU's invalidating the TLB and requiring this CPU to catch the
inter processor's interrupt]
[Move define of FAULT_FLAG_SPECULATIVE here]
[Introduce __handle_speculative_fault() and add a check against
mm->mm_users in handle_speculative_fault() defined in mm.h]
[Abort if vm_ops->fault is set instead of checking only vm_ops]
[Use find_vma_rcu() and call put_vma() when we are done with the VMA]
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
include/linux/hugetlb_inline.h | 2 +-
include/linux/mm.h | 30 +++
include/linux/pagemap.h | 4 +-
mm/internal.h | 15 ++
mm/memory.c | 344 ++++++++++++++++++++++++++++++++-
5 files changed, 389 insertions(+), 6 deletions(-)
diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index 0660a03d37d9..9e25283d6fc9 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -8,7 +8,7 @@
static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
{
- return !!(vma->vm_flags & VM_HUGETLB);
+ return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
}
#else
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f761a9c65c74..ec609cbad25a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -381,6 +381,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */
#define FAULT_FLAG_REMOTE 0x80 /* faulting for non current tsk/mm */
#define FAULT_FLAG_INSTRUCTION 0x100 /* The fault was during an instruction fetch */
+#define FAULT_FLAG_SPECULATIVE 0x200 /* Speculative fault, not holding mmap_sem */
#define FAULT_FLAG_TRACE \
{ FAULT_FLAG_WRITE, "WRITE" }, \
@@ -409,6 +410,10 @@ struct vm_fault {
gfp_t gfp_mask; /* gfp mask to be used for allocations */
pgoff_t pgoff; /* Logical page offset based on vma */
unsigned long address; /* Faulting virtual address */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ unsigned int sequence;
+ pmd_t orig_pmd; /* value of PMD at the time of fault */
+#endif
pmd_t *pmd; /* Pointer to pmd entry matching
* the 'address' */
pud_t *pud; /* Pointer to pud entry matching
@@ -1524,6 +1529,31 @@ int invalidate_inode_page(struct page *page);
#ifdef CONFIG_MMU
extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
unsigned long address, unsigned int flags);
+
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+extern vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
+ unsigned long address,
+ unsigned int flags);
+static inline vm_fault_t handle_speculative_fault(struct mm_struct *mm,
+ unsigned long address,
+ unsigned int flags)
+{
+ /*
+ * Try speculative page fault for multithreaded user space task only.
+ */
+ if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
+ return VM_FAULT_RETRY;
+ return __handle_speculative_fault(mm, address, flags);
+}
+#else
+static inline vm_fault_t handle_speculative_fault(struct mm_struct *mm,
+ unsigned long address,
+ unsigned int flags)
+{
+ return VM_FAULT_RETRY;
+}
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2e8438a1216a..2fcfaa910007 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -457,8 +457,8 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
pgoff_t pgoff;
if (unlikely(is_vm_hugetlb_page(vma)))
return linear_hugepage_index(vma, address);
- pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
- pgoff += vma->vm_pgoff;
+ pgoff = (address - READ_ONCE(vma->vm_start)) >> PAGE_SHIFT;
+ pgoff += READ_ONCE(vma->vm_pgoff);
return pgoff;
}
diff --git a/mm/internal.h b/mm/internal.h
index 1e368e4afe3c..ed91b199cb8c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -58,6 +58,21 @@ static inline void put_vma(struct vm_area_struct *vma)
extern struct vm_area_struct *find_vma_rcu(struct mm_struct *mm,
unsigned long addr);
+
+static inline bool vma_has_changed(struct vm_fault *vmf)
+{
+ int ret = RB_EMPTY_NODE(&vmf->vma->vm_rb);
+ unsigned int seq = READ_ONCE(vmf->vma->vm_sequence.sequence);
+
+ /*
+ * Matches both the wmb in write_seqlock_{begin,end}() and
+ * the wmb in vma_rb_erase().
+ */
+ smp_rmb();
+
+ return ret || seq != vmf->sequence;
+}
+
#else /* CONFIG_SPECULATIVE_PAGE_FAULT */
static inline void get_vma(struct vm_area_struct *vma)
diff --git a/mm/memory.c b/mm/memory.c
index 46f877b6abea..6e6bf61c0e5c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -522,7 +522,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
if (page)
dump_page(page, "bad pte");
pr_alert("addr:%p vm_flags:%08lx anon_vma:%p mapping:%p index:%lx\n",
- (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
+ (void *)addr, READ_ONCE(vma->vm_flags), vma->anon_vma,
+ mapping, index);
pr_alert("file:%pD fault:%pf mmap:%pf readpage:%pf\n",
vma->vm_file,
vma->vm_ops ? vma->vm_ops->fault : NULL,
@@ -2082,6 +2083,118 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
}
EXPORT_SYMBOL_GPL(apply_to_page_range);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+static bool pte_spinlock(struct vm_fault *vmf)
+{
+ bool ret = false;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ pmd_t pmdval;
+#endif
+
+ /* Check if vma is still valid */
+ if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+ vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+ spin_lock(vmf->ptl);
+ return true;
+ }
+
+again:
+ local_irq_disable();
+ if (vma_has_changed(vmf))
+ goto out;
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ /*
+ * We check if the pmd value is still the same to ensure that there
+ * is not a huge collapse operation in progress in our back.
+ */
+ pmdval = READ_ONCE(*vmf->pmd);
+ if (!pmd_same(pmdval, vmf->orig_pmd))
+ goto out;
+#endif
+
+ vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+ if (unlikely(!spin_trylock(vmf->ptl))) {
+ local_irq_enable();
+ goto again;
+ }
+
+ if (vma_has_changed(vmf)) {
+ spin_unlock(vmf->ptl);
+ goto out;
+ }
+
+ ret = true;
+out:
+ local_irq_enable();
+ return ret;
+}
+
+static bool pte_map_lock(struct vm_fault *vmf)
+{
+ bool ret = false;
+ pte_t *pte;
+ spinlock_t *ptl;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ pmd_t pmdval;
+#endif
+
+ if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+ vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ return true;
+ }
+
+ /*
+ * The first vma_has_changed() guarantees the page-tables are still
+ * valid, having IRQs disabled ensures they stay around, hence the
+ * second vma_has_changed() to make sure they are still valid once
+ * we've got the lock. After that a concurrent zap_pte_range() will
+ * block on the PTL and thus we're safe.
+ */
+again:
+ local_irq_disable();
+ if (vma_has_changed(vmf))
+ goto out;
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ /*
+ * We check if the pmd value is still the same to ensure that there
+ * is not a huge collapse operation in progress in our back.
+ */
+ pmdval = READ_ONCE(*vmf->pmd);
+ if (!pmd_same(pmdval, vmf->orig_pmd))
+ goto out;
+#endif
+
+ /*
+ * Same as pte_offset_map_lock() except that we call
+ * spin_trylock() in place of spin_lock() to avoid race with
+ * unmap path which may have the lock and wait for this CPU
+ * to invalidate TLB but this CPU has irq disabled.
+ * Since we are in a speculative patch, accept it could fail
+ */
+ ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+ pte = pte_offset_map(vmf->pmd, vmf->address);
+ if (unlikely(!spin_trylock(ptl))) {
+ pte_unmap(pte);
+ local_irq_enable();
+ goto again;
+ }
+
+ if (vma_has_changed(vmf)) {
+ pte_unmap_unlock(pte, ptl);
+ goto out;
+ }
+
+ vmf->pte = pte;
+ vmf->ptl = ptl;
+ ret = true;
+out:
+ local_irq_enable();
+ return ret;
+}
+#else
static inline bool pte_spinlock(struct vm_fault *vmf)
{
vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
@@ -2095,6 +2208,7 @@ static inline bool pte_map_lock(struct vm_fault *vmf)
vmf->address, &vmf->ptl);
return true;
}
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
/*
* handle_pte_fault chooses page fault handler according to an entry which was
@@ -2999,6 +3113,14 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
ret = check_stable_address_space(vma->vm_mm);
if (ret)
goto unlock;
+ /*
+ * Don't call the userfaultfd during the speculative path.
+ * We already checked for the VMA to not be managed through
+ * userfaultfd, but it may be set in our back once we have lock
+ * the pte. In such a case we can ignore it this time.
+ */
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ goto setpte;
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -3041,7 +3163,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
goto unlock_and_release;
/* Deliver the page fault to userland, check inside PT lock */
- if (userfaultfd_missing(vma)) {
+ if (!(vmf->flags & FAULT_FLAG_SPECULATIVE) &&
+ userfaultfd_missing(vma)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
mem_cgroup_cancel_charge(page, memcg, false);
put_page(page);
@@ -3836,6 +3959,15 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
pte_t entry;
if (unlikely(pmd_none(*vmf->pmd))) {
+ /*
+ * In the case of the speculative page fault handler we abort
+ * the speculative path immediately as the pmd is probably
+ * in the way to be converted in a huge one. We will try
+ * again holding the mmap_sem (which implies that the collapse
+ * operation is done).
+ */
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ return VM_FAULT_RETRY;
/*
* Leave __pte_alloc() until later: because vm_ops->fault may
* want to allocate huge page, and if we expose page table
@@ -3843,7 +3975,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
* concurrent faults and from rmap lookups.
*/
vmf->pte = NULL;
- } else {
+ } else if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
/* See comment in pte_alloc_one_map() */
if (pmd_devmap_trans_unstable(vmf->pmd))
return 0;
@@ -3852,6 +3984,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
* pmd from under us anymore at this point because we hold the
* mmap_sem read mode and khugepaged takes it in write mode.
* So now it's safe to run pte_offset_map().
+ * This is not applicable to the speculative page fault handler
+ * but in that case, the pte is fetched earlier in
+ * handle_speculative_fault().
*/
vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
vmf->orig_pte = *vmf->pte;
@@ -3874,6 +4009,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
if (!vmf->pte) {
if (vma_is_anonymous(vmf->vma))
return do_anonymous_page(vmf);
+ else if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ return VM_FAULT_RETRY;
else
return do_fault(vmf);
}
@@ -3971,6 +4108,9 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
vmf.pmd = pmd_alloc(mm, vmf.pud, address);
if (!vmf.pmd)
return VM_FAULT_OOM;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ vmf.sequence = raw_read_seqcount(&vma->vm_sequence);
+#endif
if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) {
ret = create_huge_pmd(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
@@ -4004,6 +4144,204 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
return handle_pte_fault(&vmf);
}
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+/*
+ * Tries to handle the page fault in a speculative way, without grabbing the
+ * mmap_sem.
+ */
+vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
+ unsigned long address,
+ unsigned int flags)
+{
+ struct vm_fault vmf = {
+ .address = address,
+ };
+ pgd_t *pgd, pgdval;
+ p4d_t *p4d, p4dval;
+ pud_t pudval;
+ int seq;
+ vm_fault_t ret = VM_FAULT_RETRY;
+ struct vm_area_struct *vma;
+#ifdef CONFIG_NUMA
+ struct mempolicy *pol;
+#endif
+
+ /* Clear flags that may lead to release the mmap_sem to retry */
+ flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
+ flags |= FAULT_FLAG_SPECULATIVE;
+
+ vma = find_vma_rcu(mm, address);
+ if (!vma)
+ return ret;
+
+ /* rmb <-> seqlock,vma_rb_erase() */
+ seq = raw_read_seqcount(&vma->vm_sequence);
+ if (seq & 1)
+ goto out_put;
+
+ /*
+ * Can't call vm_ops service has we don't know what they would do
+ * with the VMA.
+ * This include huge page from hugetlbfs.
+ */
+ if (vma->vm_ops && vma->vm_ops->fault)
+ goto out_put;
+
+ /*
+ * __anon_vma_prepare() requires the mmap_sem to be held
+ * because vm_next and vm_prev must be safe. This can't be guaranteed
+ * in the speculative path.
+ */
+ if (unlikely(!vma->anon_vma))
+ goto out_put;
+
+ vmf.vma_flags = READ_ONCE(vma->vm_flags);
+ vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot);
+
+ /* Can't call userland page fault handler in the speculative path */
+ if (unlikely(vmf.vma_flags & VM_UFFD_MISSING))
+ goto out_put;
+
+ if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP)
+ /*
+ * This could be detected by the check address against VMA's
+ * boundaries but we want to trace it as not supported instead
+ * of changed.
+ */
+ goto out_put;
+
+ if (address < READ_ONCE(vma->vm_start)
+ || READ_ONCE(vma->vm_end) <= address)
+ goto out_put;
+
+ if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+ flags & FAULT_FLAG_INSTRUCTION,
+ flags & FAULT_FLAG_REMOTE)) {
+ ret = VM_FAULT_SIGSEGV;
+ goto out_put;
+ }
+
+ /* This is one is required to check that the VMA has write access set */
+ if (flags & FAULT_FLAG_WRITE) {
+ if (unlikely(!(vmf.vma_flags & VM_WRITE))) {
+ ret = VM_FAULT_SIGSEGV;
+ goto out_put;
+ }
+ } else if (unlikely(!(vmf.vma_flags & (VM_READ|VM_EXEC|VM_WRITE)))) {
+ ret = VM_FAULT_SIGSEGV;
+ goto out_put;
+ }
+
+#ifdef CONFIG_NUMA
+ /*
+ * MPOL_INTERLEAVE implies additional checks in
+ * mpol_misplaced() which are not compatible with the
+ *speculative page fault processing.
+ */
+ pol = __get_vma_policy(vma, address);
+ if (!pol)
+ pol = get_task_policy(current);
+ if (pol && pol->mode == MPOL_INTERLEAVE)
+ goto out_put;
+#endif
+
+ /*
+ * Do a speculative lookup of the PTE entry.
+ */
+ local_irq_disable();
+ pgd = pgd_offset(mm, address);
+ pgdval = READ_ONCE(*pgd);
+ if (pgd_none(pgdval) || unlikely(pgd_bad(pgdval)))
+ goto out_walk;
+
+ p4d = p4d_offset(pgd, address);
+ p4dval = READ_ONCE(*p4d);
+ if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval)))
+ goto out_walk;
+
+ vmf.pud = pud_offset(p4d, address);
+ pudval = READ_ONCE(*vmf.pud);
+ if (pud_none(pudval) || unlikely(pud_bad(pudval)))
+ goto out_walk;
+
+ /* Huge pages at PUD level are not supported. */
+ if (unlikely(pud_trans_huge(pudval)))
+ goto out_walk;
+
+ vmf.pmd = pmd_offset(vmf.pud, address);
+ vmf.orig_pmd = READ_ONCE(*vmf.pmd);
+ /*
+ * pmd_none could mean that a hugepage collapse is in progress
+ * in our back as collapse_huge_page() mark it before
+ * invalidating the pte (which is done once the IPI is catched
+ * by all CPU and we have interrupt disabled).
+ * For this reason we cannot handle THP in a speculative way since we
+ * can't safely identify an in progress collapse operation done in our
+ * back on that PMD.
+ * Regarding the order of the following checks, see comment in
+ * pmd_devmap_trans_unstable()
+ */
+ if (unlikely(pmd_devmap(vmf.orig_pmd) ||
+ pmd_none(vmf.orig_pmd) || pmd_trans_huge(vmf.orig_pmd) ||
+ is_swap_pmd(vmf.orig_pmd)))
+ goto out_walk;
+
+ /*
+ * The above does not allocate/instantiate page-tables because doing so
+ * would lead to the possibility of instantiating page-tables after
+ * free_pgtables() -- and consequently leaking them.
+ *
+ * The result is that we take at least one !speculative fault per PMD
+ * in order to instantiate it.
+ */
+
+ vmf.pte = pte_offset_map(vmf.pmd, address);
+ vmf.orig_pte = READ_ONCE(*vmf.pte);
+ barrier(); /* See comment in handle_pte_fault() */
+ if (pte_none(vmf.orig_pte)) {
+ pte_unmap(vmf.pte);
+ vmf.pte = NULL;
+ }
+
+ vmf.vma = vma;
+ vmf.pgoff = linear_page_index(vma, address);
+ vmf.gfp_mask = __get_fault_gfp_mask(vma);
+ vmf.sequence = seq;
+ vmf.flags = flags;
+
+ local_irq_enable();
+
+ /*
+ * We need to re-validate the VMA after checking the bounds, otherwise
+ * we might have a false positive on the bounds.
+ */
+ if (read_seqcount_retry(&vma->vm_sequence, seq))
+ goto out_put;
+
+ mem_cgroup_enter_user_fault();
+ ret = handle_pte_fault(&vmf);
+ mem_cgroup_exit_user_fault();
+
+ put_vma(vma);
+
+ /*
+ * The task may have entered a memcg OOM situation but
+ * if the allocation error was handled gracefully (no
+ * VM_FAULT_OOM), there is no need to kill anything.
+ * Just clean up the OOM state peacefully.
+ */
+ if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
+ mem_cgroup_oom_synchronize(false);
+ return ret;
+
+out_walk:
+ local_irq_enable();
+out_put:
+ put_vma(vma);
+ return ret;
+}
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
/*
* By the time we get here, we already hold the mm semaphore
*
--
2.21.0
^ permalink raw reply related
* [PATCH v12 18/31] mm: protect against PTE changes done by dup_mmap()
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, Vinayak Menon,
paulmck, Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
Vinayak Menon and Ganesh Mahendran reported that the following scenario may
lead to thread being blocked due to data corruption:
CPU 1 CPU 2 CPU 3
Process 1, Process 1, Process 1,
Thread A Thread B Thread C
while (1) { while (1) { while(1) {
pthread_mutex_lock(l) pthread_mutex_lock(l) fork
pthread_mutex_unlock(l) pthread_mutex_unlock(l) }
} }
In the details this happens because :
CPU 1 CPU 2 CPU 3
fork()
copy_pte_range()
set PTE rdonly
got to next VMA...
. PTE is seen rdonly PTE still writable
. thread is writing to page
. -> page fault
. copy the page Thread writes to page
. . -> no page fault
. update the PTE
. flush TLB for that PTE
flush TLB PTE are now rdonly
So the write done by the CPU 3 is interfering with the page copy operation
done by CPU 2, leading to the data corruption.
To avoid this we mark all the VMA involved in the COW mechanism as changing
by calling vm_write_begin(). This ensures that the speculative page fault
handler will not try to handle a fault on these pages.
The marker is set until the TLB is flushed, ensuring that all the CPUs will
now see the PTE as not writable.
Once the TLB is flush, the marker is removed by calling vm_write_end().
The variable last is used to keep tracked of the latest VMA marked to
handle the error path where part of the VMA may have been marked.
Since multiple VMA from the same mm may have the sequence count increased
during this process, the use of the vm_raw_write_begin/end() is required to
avoid lockdep false warning messages.
Reported-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Reported-by: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
kernel/fork.c | 30 ++++++++++++++++++++++++++++--
1 file changed, 28 insertions(+), 2 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index f8dae021c2e5..2992d2c95256 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -462,7 +462,7 @@ EXPORT_SYMBOL(free_task);
static __latent_entropy int dup_mmap(struct mm_struct *mm,
struct mm_struct *oldmm)
{
- struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
+ struct vm_area_struct *mpnt, *tmp, *prev, **pprev, *last = NULL;
struct rb_node **rb_link, *rb_parent;
int retval;
unsigned long charge;
@@ -581,8 +581,18 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
rb_parent = &tmp->vm_rb;
mm->map_count++;
- if (!(tmp->vm_flags & VM_WIPEONFORK))
+ if (!(tmp->vm_flags & VM_WIPEONFORK)) {
+ if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
+ /*
+ * Mark this VMA as changing to prevent the
+ * speculative page fault hanlder to process
+ * it until the TLB are flushed below.
+ */
+ last = mpnt;
+ vm_raw_write_begin(mpnt);
+ }
retval = copy_page_range(mm, oldmm, mpnt);
+ }
if (tmp->vm_ops && tmp->vm_ops->open)
tmp->vm_ops->open(tmp);
@@ -595,6 +605,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
out:
up_write(&mm->mmap_sem);
flush_tlb_mm(oldmm);
+
+ if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
+ /*
+ * Since the TLB has been flush, we can safely unmark the
+ * copied VMAs and allows the speculative page fault handler to
+ * process them again.
+ * Walk back the VMA list from the last marked VMA.
+ */
+ for (; last; last = last->vm_prev) {
+ if (last->vm_flags & VM_DONTCOPY)
+ continue;
+ if (!(last->vm_flags & VM_WIPEONFORK))
+ vm_raw_write_end(last);
+ }
+ }
+
up_write(&oldmm->mmap_sem);
dup_userfaultfd_complete(&uf);
fail_uprobe_end:
--
2.21.0
^ permalink raw reply related
* [PATCH v12 23/31] mm: don't do swap readahead during speculative page fault
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, Vinayak Menon,
paulmck, Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
Vinayak Menon faced a panic because one thread was page faulting a page in
swap, while another one was mprotecting a part of the VMA leading to a VMA
split.
This raise a panic in swap_vma_readahead() because the VMA's boundaries
were not more matching the faulting address.
To avoid this, if the page is not found in the swap, the speculative page
fault is aborted to retry a regular page fault.
Reported-by: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
mm/memory.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index 6e6bf61c0e5c..1991da97e2db 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2900,6 +2900,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
lru_cache_add_anon(page);
swap_readpage(page, true);
}
+ } else if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ /*
+ * Don't try readahead during a speculative page fault
+ * as the VMA's boundaries may change in our back.
+ * If the page is not in the swap cache and synchronous
+ * read is disabled, fall back to the regular page
+ * fault mechanism.
+ */
+ delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+ ret = VM_FAULT_RETRY;
+ goto out;
} else {
page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
vmf);
--
2.21.0
^ permalink raw reply related
* [PATCH v12 25/31] perf: add a speculative page fault sw event
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
Add a new software event to count succeeded speculative page faults.
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
include/uapi/linux/perf_event.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 7198ddd0c6b1..3b4356c55caa 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -112,6 +112,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_EMULATION_FAULTS = 8,
PERF_COUNT_SW_DUMMY = 9,
PERF_COUNT_SW_BPF_OUTPUT = 10,
+ PERF_COUNT_SW_SPF = 11,
PERF_COUNT_SW_MAX, /* non-ABI */
};
--
2.21.0
^ permalink raw reply related
* [PATCH v12 20/31] mm: introduce vma reference counter
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
The final goal is to be able to use a VMA structure without holding the
mmap_sem and to be sure that the structure will not be freed in our back.
The lockless use of the VMA will be done through RCU protection and thus a
dedicated freeing service is required to manage it asynchronously.
As reported in a 2010's thread [1], this may impact file handling when a
file is still referenced while the mapping is no more there. As the final
goal is to handle anonymous VMA in a speculative way and not file backed
mapping, we could close and free the file pointer in a synchronous way, as
soon as we are guaranteed to not use it without holding the mmap_sem. For
sanity reason, in a minimal effort, the vm_file file pointer is unset once
the file pointer is put.
[1] https://lore.kernel.org/linux-mm/20100104182429.833180340@chello.nl/
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
include/linux/mm.h | 4 ++++
include/linux/mm_types.h | 3 +++
mm/internal.h | 27 +++++++++++++++++++++++++++
mm/mmap.c | 13 +++++++++----
4 files changed, 43 insertions(+), 4 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f14b2c9ddfd4..f761a9c65c74 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -529,6 +529,9 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
vma->vm_mm = mm;
vma->vm_ops = &dummy_vm_ops;
INIT_LIST_HEAD(&vma->anon_vma_chain);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ atomic_set(&vma->vm_ref_count, 1);
+#endif
}
static inline void vma_set_anonymous(struct vm_area_struct *vma)
@@ -1418,6 +1421,7 @@ static inline void INIT_VMA(struct vm_area_struct *vma)
INIT_LIST_HEAD(&vma->anon_vma_chain);
#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
seqcount_init(&vma->vm_sequence);
+ atomic_set(&vma->vm_ref_count, 1);
#endif
}
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 24b3f8ce9e42..6a6159e11a3f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -285,6 +285,9 @@ struct vm_area_struct {
/* linked list of VM areas per task, sorted by address */
struct vm_area_struct *vm_next, *vm_prev;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ atomic_t vm_ref_count;
+#endif
struct rb_node vm_rb;
/*
diff --git a/mm/internal.h b/mm/internal.h
index 9eeaf2b95166..302382bed406 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -40,6 +40,33 @@ void page_writeback_init(void);
vm_fault_t do_swap_page(struct vm_fault *vmf);
+
+extern void __free_vma(struct vm_area_struct *vma);
+
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+static inline void get_vma(struct vm_area_struct *vma)
+{
+ atomic_inc(&vma->vm_ref_count);
+}
+
+static inline void put_vma(struct vm_area_struct *vma)
+{
+ if (atomic_dec_and_test(&vma->vm_ref_count))
+ __free_vma(vma);
+}
+
+#else
+
+static inline void get_vma(struct vm_area_struct *vma)
+{
+}
+
+static inline void put_vma(struct vm_area_struct *vma)
+{
+ __free_vma(vma);
+}
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
diff --git a/mm/mmap.c b/mm/mmap.c
index f7f6027a7dff..c106440dcae7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -188,6 +188,12 @@ static inline void mm_write_sequnlock(struct mm_struct *mm)
}
#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+void __free_vma(struct vm_area_struct *vma)
+{
+ mpol_put(vma_policy(vma));
+ vm_area_free(vma);
+}
+
/*
* Close a vm structure and free it, returning the next.
*/
@@ -200,8 +206,8 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
vma->vm_ops->close(vma);
if (vma->vm_file)
fput(vma->vm_file);
- mpol_put(vma_policy(vma));
- vm_area_free(vma);
+ vma->vm_file = NULL;
+ put_vma(vma);
return next;
}
@@ -990,8 +996,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
if (next->anon_vma)
anon_vma_merge(vma, next);
mm->map_count--;
- mpol_put(vma_policy(next));
- vm_area_free(next);
+ put_vma(next);
/*
* In mprotect's case 6 (see comments on vma_merge),
* we must remove another next too. It would clutter
--
2.21.0
^ permalink raw reply related
* [PATCH v12 10/31] mm: protect VMA modifications using VMA sequence count
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
The VMA sequence count has been introduced to allow fast detection of
VMA modification when running a page fault handler without holding
the mmap_sem.
This patch provides protection against the VMA modification done in :
- madvise()
- mpol_rebind_policy()
- vma_replace_policy()
- change_prot_numa()
- mlock(), munlock()
- mprotect()
- mmap_region()
- collapse_huge_page()
- userfaultd registering services
In addition, VMA fields which will be read during the speculative fault
path needs to be written using WRITE_ONCE to prevent write to be split
and intermediate values to be pushed to other CPUs.
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
fs/proc/task_mmu.c | 5 ++++-
fs/userfaultfd.c | 17 ++++++++++++----
mm/khugepaged.c | 3 +++
mm/madvise.c | 6 +++++-
mm/mempolicy.c | 51 ++++++++++++++++++++++++++++++----------------
mm/mlock.c | 13 +++++++-----
mm/mmap.c | 28 ++++++++++++++++---------
mm/mprotect.c | 4 +++-
mm/swap_state.c | 10 ++++++---
9 files changed, 95 insertions(+), 42 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 01d4eb0e6bd1..0864c050b2de 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1162,8 +1162,11 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
goto out_mm;
}
for (vma = mm->mmap; vma; vma = vma->vm_next) {
- vma->vm_flags &= ~VM_SOFTDIRTY;
+ vm_write_begin(vma);
+ WRITE_ONCE(vma->vm_flags,
+ vma->vm_flags & ~VM_SOFTDIRTY);
vma_set_page_prot(vma);
+ vm_write_end(vma);
}
downgrade_write(&mm->mmap_sem);
break;
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 3b30301c90ec..2e0f98cadd81 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -667,8 +667,11 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
octx = vma->vm_userfaultfd_ctx.ctx;
if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
+ vm_write_begin(vma);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
- vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+ WRITE_ONCE(vma->vm_flags,
+ vma->vm_flags & ~(VM_UFFD_WP | VM_UFFD_MISSING));
+ vm_write_end(vma);
return 0;
}
@@ -908,8 +911,10 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
vma = prev;
else
prev = vma;
- vma->vm_flags = new_flags;
+ vm_write_begin(vma);
+ WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+ vm_write_end(vma);
}
skip_mm:
up_write(&mm->mmap_sem);
@@ -1474,8 +1479,10 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
* the next vma was merged into the current one and
* the current one has not been updated yet.
*/
- vma->vm_flags = new_flags;
+ vm_write_begin(vma);
+ WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx.ctx = ctx;
+ vm_write_end(vma);
skip:
prev = vma;
@@ -1636,8 +1643,10 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
* the next vma was merged into the current one and
* the current one has not been updated yet.
*/
- vma->vm_flags = new_flags;
+ vm_write_begin(vma);
+ WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+ vm_write_end(vma);
skip:
prev = vma;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a335f7c1fac4..6a0cbca3885e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1011,6 +1011,7 @@ static void collapse_huge_page(struct mm_struct *mm,
if (mm_find_pmd(mm, address) != pmd)
goto out;
+ vm_write_begin(vma);
anon_vma_lock_write(vma->anon_vma);
pte = pte_offset_map(pmd, address);
@@ -1046,6 +1047,7 @@ static void collapse_huge_page(struct mm_struct *mm,
pmd_populate(mm, pmd, pmd_pgtable(_pmd));
spin_unlock(pmd_ptl);
anon_vma_unlock_write(vma->anon_vma);
+ vm_write_end(vma);
result = SCAN_FAIL;
goto out;
}
@@ -1081,6 +1083,7 @@ static void collapse_huge_page(struct mm_struct *mm,
set_pmd_at(mm, address, pmd, _pmd);
update_mmu_cache_pmd(vma, address, pmd);
spin_unlock(pmd_ptl);
+ vm_write_end(vma);
*hpage = NULL;
diff --git a/mm/madvise.c b/mm/madvise.c
index a692d2a893b5..6cf07dc546fc 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -184,7 +184,9 @@ static long madvise_behavior(struct vm_area_struct *vma,
/*
* vm_flags is protected by the mmap_sem held in write mode.
*/
- vma->vm_flags = new_flags;
+ vm_write_begin(vma);
+ WRITE_ONCE(vma->vm_flags, new_flags);
+ vm_write_end(vma);
out:
return error;
}
@@ -450,9 +452,11 @@ static void madvise_free_page_range(struct mmu_gather *tlb,
.private = tlb,
};
+ vm_write_begin(vma);
tlb_start_vma(tlb, vma);
walk_page_range(addr, end, &free_walk);
tlb_end_vma(tlb, vma);
+ vm_write_end(vma);
}
static int madvise_free_single_vma(struct vm_area_struct *vma,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2219e747df49..94c103c5034a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -380,8 +380,11 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
struct vm_area_struct *vma;
down_write(&mm->mmap_sem);
- for (vma = mm->mmap; vma; vma = vma->vm_next)
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ vm_write_begin(vma);
mpol_rebind_policy(vma->vm_policy, new);
+ vm_write_end(vma);
+ }
up_write(&mm->mmap_sem);
}
@@ -575,9 +578,11 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
{
int nr_updated;
+ vm_write_begin(vma);
nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1);
if (nr_updated)
count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
+ vm_write_end(vma);
return nr_updated;
}
@@ -683,6 +688,7 @@ static int vma_replace_policy(struct vm_area_struct *vma,
if (IS_ERR(new))
return PTR_ERR(new);
+ vm_write_begin(vma);
if (vma->vm_ops && vma->vm_ops->set_policy) {
err = vma->vm_ops->set_policy(vma, new);
if (err)
@@ -690,11 +696,17 @@ static int vma_replace_policy(struct vm_area_struct *vma,
}
old = vma->vm_policy;
- vma->vm_policy = new; /* protected by mmap_sem */
+ /*
+ * The speculative page fault handler accesses this field without
+ * hodling the mmap_sem.
+ */
+ WRITE_ONCE(vma->vm_policy, new);
+ vm_write_end(vma);
mpol_put(old);
return 0;
err_out:
+ vm_write_end(vma);
mpol_put(new);
return err;
}
@@ -1654,23 +1666,28 @@ COMPAT_SYSCALL_DEFINE4(migrate_pages, compat_pid_t, pid,
struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
unsigned long addr)
{
- struct mempolicy *pol = NULL;
+ struct mempolicy *pol;
- if (vma) {
- if (vma->vm_ops && vma->vm_ops->get_policy) {
- pol = vma->vm_ops->get_policy(vma, addr);
- } else if (vma->vm_policy) {
- pol = vma->vm_policy;
+ if (!vma)
+ return NULL;
- /*
- * shmem_alloc_page() passes MPOL_F_SHARED policy with
- * a pseudo vma whose vma->vm_ops=NULL. Take a reference
- * count on these policies which will be dropped by
- * mpol_cond_put() later
- */
- if (mpol_needs_cond_ref(pol))
- mpol_get(pol);
- }
+ if (vma->vm_ops && vma->vm_ops->get_policy)
+ return vma->vm_ops->get_policy(vma, addr);
+
+ /*
+ * This could be called without holding the mmap_sem in the
+ * speculative page fault handler's path.
+ */
+ pol = READ_ONCE(vma->vm_policy);
+ if (pol) {
+ /*
+ * shmem_alloc_page() passes MPOL_F_SHARED policy with
+ * a pseudo vma whose vma->vm_ops=NULL. Take a reference
+ * count on these policies which will be dropped by
+ * mpol_cond_put() later
+ */
+ if (mpol_needs_cond_ref(pol))
+ mpol_get(pol);
}
return pol;
diff --git a/mm/mlock.c b/mm/mlock.c
index 080f3b36415b..f390903d9bbb 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -445,7 +445,9 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
void munlock_vma_pages_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
- vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+ vm_write_begin(vma);
+ WRITE_ONCE(vma->vm_flags, vma->vm_flags & VM_LOCKED_CLEAR_MASK);
+ vm_write_end(vma);
while (start < end) {
struct page *page;
@@ -569,10 +571,11 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
* It's okay if try_to_unmap_one unmaps a page just after we
* set VM_LOCKED, populate_vma_page_range will bring it back.
*/
-
- if (lock)
- vma->vm_flags = newflags;
- else
+ if (lock) {
+ vm_write_begin(vma);
+ WRITE_ONCE(vma->vm_flags, newflags);
+ vm_write_end(vma);
+ } else
munlock_vma_pages_range(vma, start, end);
out:
diff --git a/mm/mmap.c b/mm/mmap.c
index a4e4d52a5148..b77ec0149249 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -877,17 +877,18 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
}
if (start != vma->vm_start) {
- vma->vm_start = start;
+ WRITE_ONCE(vma->vm_start, start);
start_changed = true;
}
if (end != vma->vm_end) {
- vma->vm_end = end;
+ WRITE_ONCE(vma->vm_end, end);
end_changed = true;
}
- vma->vm_pgoff = pgoff;
+ WRITE_ONCE(vma->vm_pgoff, pgoff);
if (adjust_next) {
- next->vm_start += adjust_next << PAGE_SHIFT;
- next->vm_pgoff += adjust_next;
+ WRITE_ONCE(next->vm_start,
+ next->vm_start + (adjust_next << PAGE_SHIFT));
+ WRITE_ONCE(next->vm_pgoff, next->vm_pgoff + adjust_next);
}
if (root) {
@@ -1850,12 +1851,14 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
out:
perf_event_mmap(vma);
+ vm_write_begin(vma);
vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
if (vm_flags & VM_LOCKED) {
if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
is_vm_hugetlb_page(vma) ||
vma == get_gate_vma(current->mm))
- vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+ WRITE_ONCE(vma->vm_flags,
+ vma->vm_flags &= VM_LOCKED_CLEAR_MASK);
else
mm->locked_vm += (len >> PAGE_SHIFT);
}
@@ -1870,9 +1873,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
* then new mapped in-place (which must be aimed as
* a completely new data area).
*/
- vma->vm_flags |= VM_SOFTDIRTY;
+ WRITE_ONCE(vma->vm_flags, vma->vm_flags | VM_SOFTDIRTY);
vma_set_page_prot(vma);
+ vm_write_end(vma);
return addr;
@@ -2430,7 +2434,9 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
mm->locked_vm += grow;
vm_stat_account(mm, vma->vm_flags, grow);
anon_vma_interval_tree_pre_update_vma(vma);
- vma->vm_end = address;
+ vm_write_begin(vma);
+ WRITE_ONCE(vma->vm_end, address);
+ vm_write_end(vma);
anon_vma_interval_tree_post_update_vma(vma);
if (vma->vm_next)
vma_gap_update(vma->vm_next);
@@ -2510,8 +2516,10 @@ int expand_downwards(struct vm_area_struct *vma,
mm->locked_vm += grow;
vm_stat_account(mm, vma->vm_flags, grow);
anon_vma_interval_tree_pre_update_vma(vma);
- vma->vm_start = address;
- vma->vm_pgoff -= grow;
+ vm_write_begin(vma);
+ WRITE_ONCE(vma->vm_start, address);
+ WRITE_ONCE(vma->vm_pgoff, vma->vm_pgoff - grow);
+ vm_write_end(vma);
anon_vma_interval_tree_post_update_vma(vma);
vma_gap_update(vma);
spin_unlock(&mm->page_table_lock);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 65242f1e4457..78fce873ca3a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -427,12 +427,14 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
* vm_flags and vm_page_prot are protected by the mmap_sem
* held in write mode.
*/
- vma->vm_flags = newflags;
+ vm_write_begin(vma);
+ WRITE_ONCE(vma->vm_flags, newflags);
dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
vma_set_page_prot(vma);
change_protection(vma, start, end, vma->vm_page_prot,
dirty_accountable, 0);
+ vm_write_end(vma);
/*
* Private VM_LOCKED VMA becoming writable: trigger COW to avoid major
diff --git a/mm/swap_state.c b/mm/swap_state.c
index eb714165afd2..c45f9122b457 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -523,7 +523,11 @@ static unsigned long swapin_nr_pages(unsigned long offset)
* This has been extended to use the NUMA policies from the mm triggering
* the readahead.
*
- * Caller must hold read mmap_sem if vmf->vma is not NULL.
+ * Caller must hold down_read on the vma->vm_mm if vmf->vma is not NULL.
+ * This is needed to ensure the VMA will not be freed in our back. In the case
+ * of the speculative page fault handler, this cannot happen, even if we don't
+ * hold the mmap_sem. Callees are assumed to take care of reading VMA's fields
+ * using READ_ONCE() to read consistent values.
*/
struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
struct vm_fault *vmf)
@@ -624,9 +628,9 @@ static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma,
unsigned long *start,
unsigned long *end)
{
- *start = max3(lpfn, PFN_DOWN(vma->vm_start),
+ *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)),
PFN_DOWN(faddr & PMD_MASK));
- *end = min3(rpfn, PFN_DOWN(vma->vm_end),
+ *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)),
PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE));
}
--
2.21.0
^ permalink raw reply related
* [PATCH v12 04/31] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
From: Laurent Dufour @ 2019-04-16 13:44 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
From: Mahendran Ganesh <opensource.ganesh@gmail.com>
Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for arm64. This
enables Speculative Page Fault handler.
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
---
arch/arm64/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 870ef86a64ed..8e86934d598b 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -174,6 +174,7 @@ config ARM64
select SWIOTLB
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
+ select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
help
ARM 64-bit (AArch64) Linux support.
--
2.21.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox