From: Sasha Levin <sashal@kernel.org>
To: stable@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: Andrea Arcangeli <aarcange@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Linus Torvalds <torvalds@linux-foundation.org>,
Sasha Levin <sashal@kernel.org>
Subject: [PATCH AUTOSEL 4.19 06/57] userfaultfd: allow get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) to trigger userfaults
Date: Sun, 4 Nov 2018 08:50:53 -0500 [thread overview]
Message-ID: <20181104135144.88324-6-sashal@kernel.org> (raw)
In-Reply-To: <20181104135144.88324-1-sashal@kernel.org>
From: Andrea Arcangeli <aarcange@redhat.com>
[ Upstream commit 3b9aadf7278d16d7bed4d5d808501065f70898d8 ]
get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) called a get_user_pages that would
not be waiting for userfaults before failing and it would hit on a SIGBUS
instead. Using get_user_pages_locked/unlocked instead will allow
get_mempolicy to allow userfaults to resolve the fault and fill the hole,
before grabbing the node id of the page.
If the user calls get_mempolicy() with MPOL_F_ADDR | MPOL_F_NODE for an
address inside an area managed by uffd and there is no page at that
address, the page allocation from within get_mempolicy() will fail
because get_user_pages() does not allow for page fault retry required
for uffd; the user will get SIGBUS.
With this patch, the page fault will be resolved by the uffd and the
get_mempolicy() will continue normally.
Background:
Via code review, previously the syscall would have returned -EFAULT
(vm_fault_to_errno), now it will block and wait for an userfault (if
it's waken before the fault is resolved it'll still -EFAULT).
This way get_mempolicy will give a chance to an "unaware" app to be
compliant with userfaults.
The reason this visible change is that becoming "userfault compliant"
cannot regress anything: all other syscalls including read(2)/write(2)
had to become "userfault compliant" long time ago (that's one of the
things userfaultfd can do that PROT_NONE and trapping segfaults can't).
So this is just one more syscall that become "userfault compliant" like
all other major ones already were.
This has been happening on virtio-bridge dpdk process which just called
get_mempolicy on the guest space post live migration, but before the
memory had a chance to be migrated to destination.
I didn't run an strace to be able to show the -EFAULT going away, but
I've the confirmation of the below debug aid information (only visible
with CONFIG_DEBUG_VM=y) going away with the patch:
[20116.371461] FAULT_FLAG_ALLOW_RETRY missing 0
[20116.371464] CPU: 1 PID: 13381 Comm: vhost-events Not tainted 4.17.12-200.fc28.x86_64 #1
[20116.371465] Hardware name: LENOVO 20FAS2BN0A/20FAS2BN0A, BIOS N1CET54W (1.22 ) 02/10/2017
[20116.371466] Call Trace:
[20116.371473] dump_stack+0x5c/0x80
[20116.371476] handle_userfault.cold.37+0x1b/0x22
[20116.371479] ? remove_wait_queue+0x20/0x60
[20116.371481] ? poll_freewait+0x45/0xa0
[20116.371483] ? do_sys_poll+0x31c/0x520
[20116.371485] ? radix_tree_lookup_slot+0x1e/0x50
[20116.371488] shmem_getpage_gfp+0xce7/0xe50
[20116.371491] ? page_add_file_rmap+0x1a/0x2c0
[20116.371493] shmem_fault+0x78/0x1e0
[20116.371495] ? filemap_map_pages+0x3a1/0x450
[20116.371498] __do_fault+0x1f/0xc0
[20116.371500] __handle_mm_fault+0xe2e/0x12f0
[20116.371502] handle_mm_fault+0xda/0x200
[20116.371504] __get_user_pages+0x238/0x790
[20116.371506] get_user_pages+0x3e/0x50
[20116.371510] kernel_get_mempolicy+0x40b/0x700
[20116.371512] ? vfs_write+0x170/0x1a0
[20116.371515] __x64_sys_get_mempolicy+0x21/0x30
[20116.371517] do_syscall_64+0x5b/0x160
[20116.371520] entry_SYSCALL_64_after_hwframe+0x44/0xa9
The above harmless debug message (not a kernel crash, just a
dump_stack()) is shown with CONFIG_DEBUG_VM=y to more quickly identify
and improve kernel spots that may have to become "userfaultfd
compliant" like this one (without having to run an strace and search
for syscall misbehavior). Spots like the above are more closer to a
kernel bug for the non-cooperative usages that Mike focuses on, than
for for dpdk qemu-cooperative usages that reproduced it, but it's still
nicer to get this fixed for dpdk too.
The part of the patch that caused me to think is only the
implementation issue of mpol_get, but it looks like it should work safe
no matter the kind of mempolicy structure that is (the default static
policy also starts at 1 so it'll go to 2 and back to 1 without crashing
everything at 0).
[rppt@linux.vnet.ibm.com: changelog addition]
http://lkml.kernel.org/r/20180904073718.GA26916@rapoport-lnx
Link: http://lkml.kernel.org/r/20180831214848.23676-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Tested-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
mm/mempolicy.c | 24 +++++++++++++++++++-----
1 file changed, 19 insertions(+), 5 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index da858f794eb6..2e76a8f65e94 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -797,16 +797,19 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
}
}
-static int lookup_node(unsigned long addr)
+static int lookup_node(struct mm_struct *mm, unsigned long addr)
{
struct page *p;
int err;
- err = get_user_pages(addr & PAGE_MASK, 1, 0, &p, NULL);
+ int locked = 1;
+ err = get_user_pages_locked(addr & PAGE_MASK, 1, 0, &p, &locked);
if (err >= 0) {
err = page_to_nid(p);
put_page(p);
}
+ if (locked)
+ up_read(&mm->mmap_sem);
return err;
}
@@ -817,7 +820,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
int err;
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma = NULL;
- struct mempolicy *pol = current->mempolicy;
+ struct mempolicy *pol = current->mempolicy, *pol_refcount = NULL;
if (flags &
~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED))
@@ -857,7 +860,16 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
if (flags & MPOL_F_NODE) {
if (flags & MPOL_F_ADDR) {
- err = lookup_node(addr);
+ /*
+ * Take a refcount on the mpol, lookup_node()
+ * wil drop the mmap_sem, so after calling
+ * lookup_node() only "pol" remains valid, "vma"
+ * is stale.
+ */
+ pol_refcount = pol;
+ vma = NULL;
+ mpol_get(pol);
+ err = lookup_node(mm, addr);
if (err < 0)
goto out;
*policy = err;
@@ -892,7 +904,9 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
out:
mpol_cond_put(pol);
if (vma)
- up_read(¤t->mm->mmap_sem);
+ up_read(&mm->mmap_sem);
+ if (pol_refcount)
+ mpol_put(pol_refcount);
return err;
}
--
2.17.1
next prev parent reply other threads:[~2018-11-04 13:50 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-11-04 13:50 [PATCH AUTOSEL 4.19 01/57] mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition Sasha Levin
2018-11-04 13:50 ` [PATCH AUTOSEL 4.19 02/57] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() Sasha Levin
2018-11-04 13:50 ` [PATCH AUTOSEL 4.19 03/57] mm: calculate deferred pages after skipping mirrored memory Sasha Levin
2018-11-04 13:50 ` [PATCH AUTOSEL 4.19 04/57] mm: don't raise MEMCG_OOM event due to failed high-order allocation Sasha Levin
2018-11-04 13:50 ` [PATCH AUTOSEL 4.19 05/57] mm/vmstat.c: assert that vmstat_text is in sync with stat_items_size Sasha Levin
2018-11-04 13:50 ` Sasha Levin [this message]
2018-11-04 13:50 ` [PATCH AUTOSEL 4.19 07/57] mm: don't miss the last page because of round-off error Sasha Levin
2018-11-04 13:50 ` [PATCH AUTOSEL 4.19 08/57] mm: don't warn about large allocations for slab Sasha Levin
2018-11-04 13:50 ` [PATCH AUTOSEL 4.19 09/57] r8169: fix broken Wake-on-LAN from S5 (poweroff) Sasha Levin
2018-11-04 13:50 ` [PATCH AUTOSEL 4.19 10/57] powerpc/traps: restore recoverability of machine_check interrupts Sasha Levin
2018-11-04 13:50 ` [PATCH AUTOSEL 4.19 11/57] powerpc/64/module: REL32 relocation range check Sasha Levin
2018-11-04 13:50 ` [PATCH AUTOSEL 4.19 12/57] powerpc/mm: Fix page table dump to work on Radix Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 13/57] powerpc/mm: fix always true/false warning in slice.c Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 14/57] drm/amd/display: fix bug of accessing invalid memory Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 15/57] Input: wm97xx-ts - fix exit path Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 16/57] powerpc/Makefile: Fix PPC_BOOK3S_64 ASFLAGS Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 17/57] powerpc/eeh: Fix possible null deref in eeh_dump_dev_log() Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 18/57] tty: check name length in tty_find_polling_driver() Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 19/57] tracing/kprobes: Check the probe on unloaded module correctly Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 20/57] drm/nouveau/secboot/acr: fix memory leak Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 21/57] drm/amdgpu/powerplay: fix missing break in switch statements Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 22/57] ARM: imx_v6_v7_defconfig: Select CONFIG_TMPFS_POSIX_ACL Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 23/57] powerpc/nohash: fix undefined behaviour when testing page size support Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 24/57] drm/msm/gpu: fix parameters in function msm_gpu_crashstate_capture Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 25/57] drm/msm/disp/dpu: Use proper define for drm_encoder_init() 'encoder_type' Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 26/57] drm/msm: dpu: Allow planes to extend past active display Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 27/57] powerpc/mm: Don't report hugepage tables as memory leaks when using kmemleak Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 28/57] watchdog: lantiq: update register names to better match spec Sasha Levin
2018-11-05 22:26 ` Hauke Mehrtens
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 29/57] drm/omap: fix memory barrier bug in DMM driver Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 30/57] iio: adc: at91: fix wrong channel number in triggered buffer mode Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 31/57] iio: adc: at91: fix acking DRDY irq on simple conversions Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 32/57] drm/amd/display: Raise dispclk value for dce120 by 15% Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 33/57] drm/amd/display: fix gamma not being applied Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 34/57] drm/hisilicon: hibmc: Do not carry error code in HiBMC framebuffer pointer Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 35/57] media: pci: cx23885: handle adding to list failure Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 36/57] media: coda: don't overwrite h.264 profile_idc on decoder instance Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 37/57] iio: adc: imx25-gcq: Fix leak of device_node in mx25_gcq_setup_cfgs() Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 38/57] MIPS: kexec: Mark CPU offline before disabling local IRQ Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 39/57] powerpc/boot: Ensure _zimage_start is a weak symbol Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 40/57] powerpc/memtrace: Remove memory in chunks Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 41/57] MIPS/PCI: Call pcie_bus_configure_settings() to set MPS/MRRS Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 42/57] staging: erofs: fix a missing endian conversion Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 43/57] serial: 8250_of: Fix for lack of interrupt support Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 44/57] sc16is7xx: Fix for multi-channel stall Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 45/57] media: tvp5150: fix width alignment during set_selection() Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 46/57] powerpc/selftests: Wait all threads to join Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 47/57] staging:iio:ad7606: fix voltage scales Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 48/57] drm: rcar-du: Update Gen3 output limitations Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 49/57] drm/amdgpu: Fix SDMA TO after GPU reset v3 Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 50/57] staging: most: video: fix registration of an empty comp core_component Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 51/57] 9p locks: fix glock.client_id leak in do_lock Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 52/57] udf: Prevent write-unsupported filesystem to be remounted read-write Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 53/57] ARM: dts: imx6ull: keep IMX6UL_ prefix for signals on both i.MX6UL and i.MX6ULL Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 54/57] media: ov5640: fix mode change regression Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 55/57] 9p: clear dangling pointers in p9stat_free Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 56/57] drm/amdgpu: fix integer overflow test in amdgpu_bo_list_create() Sasha Levin
2018-11-04 13:51 ` [PATCH AUTOSEL 4.19 57/57] media: ov5640: fix restore of last mode set Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181104135144.88324-6-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=stable@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).