public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	"Slusarz, Marcin" <marcin.slusarz@intel.com>,
	Jan Kara <jack@suse.cz>, Alexander Viro <viro@zeniv.linux.org.uk>,
	Christoph Hellwig <hch@lst.de>,
	Dan Williams <dan.j.williams@intel.com>,
	Dave Chinner <david@fromorbit.com>,
	Matthew Wilcox <mawilcox@microsoft.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 4.12 47/99] dax: fix deadlock due to misaligned PMD faults
Date: Mon, 28 Aug 2017 10:04:45 +0200	[thread overview]
Message-ID: <20170828080457.850515473@linuxfoundation.org> (raw)
In-Reply-To: <20170828080455.968552605@linuxfoundation.org>

4.12-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Ross Zwisler <ross.zwisler@linux.intel.com>

commit fffa281b48a91ad6dac1a18c5907ece58fa3879b upstream.

In DAX there are two separate places where the 2MiB range of a PMD is
defined.

The first is in the page tables, where a PMD mapping inserted for a
given address spans from (vmf->address & PMD_MASK) to ((vmf->address &
PMD_MASK) + PMD_SIZE - 1).  That is, from the 2MiB boundary below the
address to the 2MiB boundary above the address.

So, for example, a fault at address 3MiB (0x30 0000) falls within the
PMD that ranges from 2MiB (0x20 0000) to 4MiB (0x40 0000).

The second PMD range is in the mapping->page_tree, where a given file
offset is covered by a radix tree entry that spans from one 2MiB aligned
file offset to another 2MiB aligned file offset.

So, for example, the file offset for 3MiB (pgoff 768) falls within the
PMD range for the order 9 radix tree entry that ranges from 2MiB (pgoff
512) to 4MiB (pgoff 1024).

This system works so long as the addresses and file offsets for a given
mapping both have the same offsets relative to the start of each PMD.

Consider the case where the starting address for a given file isn't 2MiB
aligned - say our faulting address is 3 MiB (0x30 0000), but that
corresponds to the beginning of our file (pgoff 0).  Now all the PMDs in
the mapping are misaligned so that the 2MiB range defined in the page
tables never matches up with the 2MiB range defined in the radix tree.

The current code notices this case for DAX faults to storage with the
following test in dax_pmd_insert_mapping():

	if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)
		goto unlock_fallback;

This test makes sure that the pfn we get from the driver is 2MiB
aligned, and relies on the assumption that the 2MiB alignment of the pfn
we get back from the driver matches the 2MiB alignment of the faulting
address.

However, faults to holes were not checked and we could hit the problem
described above.

This was reported in response to the NVML nvml/src/test/pmempool_sync
TEST5:

	$ cd nvml/src/test/pmempool_sync
	$ make TEST5

You can grab NVML here:

	https://github.com/pmem/nvml/

The dmesg warning you see when you hit this error is:

  WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 dax_insert_mapping_entry+0x2df/0x310

Where we notice in dax_insert_mapping_entry() that the radix tree entry
we are about to replace doesn't match the locked entry that we had
previously inserted into the tree.  This happens because the initial
insertion was done in grab_mapping_entry() using a pgoff calculated from
the faulting address (vmf->address), and the replacement in
dax_pmd_load_hole() => dax_insert_mapping_entry() is done using
vmf->pgoff.

In our failure case those two page offsets (one calculated from
vmf->address, one using vmf->pgoff) point to different order 9 radix
tree entries.

This failure case can result in a deadlock because the radix tree unlock
also happens on the pgoff calculated from vmf->address.  This means that
the locked radix tree entry that we swapped in to the tree in
dax_insert_mapping_entry() using vmf->pgoff is never unlocked, so all
future faults to that 2MiB range will block forever.

Fix this by validating that the faulting address's PMD offset matches
the PMD offset from the start of the file.  This check is done at the
very beginning of the fault and covers faults that would have mapped to
storage as well as faults to holes.  I left the COLOUR check in
dax_pmd_insert_mapping() in place in case we ever hit the insanity
condition where the alignment of the pfn we get from the driver doesn't
match the alignment of the userspace address.

Link: http://lkml.kernel.org/r/20170822222436.18926-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: "Slusarz, Marcin" <marcin.slusarz@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 fs/dax.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1380,6 +1380,16 @@ static int dax_iomap_pmd_fault(struct vm
 
 	trace_dax_pmd_fault(inode, vmf, max_pgoff, 0);
 
+	/*
+	 * Make sure that the faulting address's PMD offset (color) matches
+	 * the PMD offset from the start of the file.  This is necessary so
+	 * that a PMD range in the page table overlaps exactly with a PMD
+	 * range in the radix tree.
+	 */
+	if ((vmf->pgoff & PG_PMD_COLOUR) !=
+	    ((vmf->address >> PAGE_SHIFT) & PG_PMD_COLOUR))
+		goto fallback;
+
 	/* Fall back to PTEs if we're going to COW */
 	if (write && !(vma->vm_flags & VM_SHARED))
 		goto fallback;

  parent reply	other threads:[~2017-08-28  8:07 UTC|newest]

Thread overview: 102+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-28  8:03 [PATCH 4.12 00/99] 4.12.10-stable review Greg Kroah-Hartman
2017-08-28  8:03 ` [PATCH 4.12 01/99] sparc64: remove unnecessary log message Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 02/99] bonding: require speed/duplex only for 802.3ad, alb and tlb Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 03/99] bonding: ratelimit failed speed/duplex update warning Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 04/99] af_key: do not use GFP_KERNEL in atomic contexts Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 05/99] dccp: purge write queue in dccp_destroy_sock() Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 06/99] dccp: defer ccid_hc_tx_delete() at dismantle time Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 07/99] ipv4: fix NULL dereference in free_fib_info_rcu() Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 08/99] net_sched/sfq: update hierarchical backlog when drop packet Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 09/99] net_sched: remove warning from qdisc_hash_add Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 10/99] bpf: fix bpf_trace_printk on 32 bit archs Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 11/99] net: igmp: Use ingress interface rather than vrf device Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 12/99] openvswitch: fix skb_panic due to the incorrect actions attrlen Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 13/99] ptr_ring: use kmalloc_array() Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 14/99] ipv4: better IP_MAX_MTU enforcement Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 15/99] nfp: fix infinite loop on umapping cleanup Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 16/99] tun: handle register_netdevice() failures properly Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 17/99] sctp: fully initialize the IPv6 address in sctp_v6_to_addr() Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 18/99] tipc: fix use-after-free Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 19/99] ipv6: reset fn->rr_ptr when replacing route Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 20/99] ipv6: repair fib6 tree in failure case Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 21/99] tcp: when rearming RTO, if RTO time is in past then fire RTO ASAP Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 22/99] net/mlx4_core: Enable 4K UAR if SRIOV module parameter is not enabled Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 23/99] irda: do not leak initialized list.dev to userspace Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 24/99] net: sched: fix NULL pointer dereference when action calls some targets Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 25/99] net_sched: fix order of queue length updates in qdisc_replace() Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 26/99] bpf, verifier: add additional patterns to evaluate_reg_imm_alu Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 27/99] bpf: fix mixed signed/unsigned derived min/max value bounds Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 28/99] bpf/verifier: fix min/max handling in BPF_SUB Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 29/99] Input: trackpoint - add new trackpoint firmware ID Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 30/99] Input: elan_i2c - add ELAN0602 ACPI ID to support Lenovo Yoga310 Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 31/99] Input: ALPS - fix two-finger scroll breakage in right side on ALPS touchpad Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 32/99] KVM: s390: sthyi: fix sthyi inline assembly Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 33/99] KVM: s390: sthyi: fix specification exception detection Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 34/99] KVM: x86: simplify handling of PKRU Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 35/99] KVM, pkeys: do not use PKRU value in vcpu->arch.guest_fpu.state Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 36/99] KVM: x86: block guest protection keys unless the host has them enabled Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 37/99] ALSA: usb-audio: Add delay quirk for H650e/Jabra 550a USB headsets Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 38/99] ALSA: core: Fix unexpected error at replacing user TLV Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 39/99] ALSA: hda - Add stereo mic quirk for Lenovo G50-70 (17aa:3978) Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 40/99] ALSA: firewire: fix NULL pointer dereference when releasing uninitialized data of iso-resource Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 41/99] ALSA: firewire-motu: destroy stream data surely at failure of card initialization Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 42/99] ARCv2: SLC: Make sure busy bit is set properly for region ops Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 43/99] ARCv2: PAE40: Explicitly set MSB counterpart of SLC region ops addresses Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 44/99] ARCv2: PAE40: set MSB even if !CONFIG_ARC_HAS_PAE40 but PAE exists in SoC Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 45/99] PM/hibernate: touch NMI watchdog when creating snapshot Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 46/99] mm, shmem: fix handling /sys/kernel/mm/transparent_hugepage/shmem_enabled Greg Kroah-Hartman
2017-08-28  8:04 ` Greg Kroah-Hartman [this message]
2017-08-28  8:04 ` [PATCH 4.12 48/99] i2c: designware: Fix system suspend Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 49/99] mm/madvise.c: fix freeing of locked page with MADV_FREE Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 50/99] fork: fix incorrect fput of ->exe_file causing use-after-free Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 51/99] mm/memblock.c: reversed logic in memblock_discard() Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 52/99] arm64: fpsimd: Prevent registers leaking across exec Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 53/99] drm: Fix framebuffer leak Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 55/99] drm/sun4i: Implement drm_driver lastclose to restore fbdev console Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 56/99] drm/atomic: Handle -EDEADLK with out-fences correctly Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 57/99] drm/atomic: If the atomic check fails, return its value first Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 59/99] drm/i915/gvt: Fix the kernel null pointer error Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 60/99] Revert "drm/amdgpu: fix vblank_time when displays are off" Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.12 61/99] ACPI: device property: Fix node lookup in acpi_graph_get_child_prop_value() Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 62/99] tracing: Call clear_boot_tracer() at lateinit_sync Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 63/99] tracing: Missing error code in tracer_alloc_buffers() Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 64/99] tracing: Fix kmemleak in tracing_map_array_free() Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 65/99] tracing: Fix freeing of filter in create_filter() when set_str is false Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 66/99] RDMA/uverbs: Initialize cq_context appropriately Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 67/99] kbuild: linker script do not match C names unless LD_DEAD_CODE_DATA_ELIMINATION is configured Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 68/99] cifs: Fix df output for users with quota limits Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 69/99] cifs: return ENAMETOOLONG for overlong names in cifs_open()/cifs_lookup() Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 70/99] nfsd: Limit end of page list when decoding NFSv4 WRITE Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 71/99] ring-buffer: Have ring_buffer_alloc_read_page() return error on offline CPU Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 72/99] virtio_pci: fix cpu affinity support Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 73/99] ftrace: Check for null ret_stack on profile function graph entry function Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 74/99] perf/core: Fix group {cpu,task} validation Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 75/99] timers: Fix excessive granularity of new timers after a nohz idle Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 76/99] x86/mm: Fix use-after-free of ldt_struct Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 77/99] net: sunrpc: svcsock: fix NULL-pointer exception Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 78/99] netfilter: expect: fix crash when putting uninited expectation Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 79/99] netfilter: nat: fix src map lookup Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 80/99] netfilter: nfnetlink: Improve input length sanitization in nfnetlink_rcv Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 81/99] Bluetooth: hidp: fix possible might sleep error in hidp_session_thread Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 82/99] Bluetooth: cmtp: fix possible might sleep error in cmtp_session Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 83/99] Bluetooth: bnep: fix possible might sleep error in bnep_session Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 84/99] Revert "android: binder: Sanity check at binder ioctl" Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 85/99] binder: use group leader instead of open thread Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 86/99] binder: Use wake up hint for synchronous transactions Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 87/99] ANDROID: binder: fix proc->tsk check Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 88/99] iio: imu: adis16480: Fix acceleration scale factor for adis16480 Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 89/99] iio: hid-sensor-trigger: Fix the race with user space powering up sensors Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 90/99] iio: magnetometer: st_magn: fix status register address for LSM303AGR Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 91/99] iio: magnetometer: st_magn: remove ihl property " Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 92/99] staging: rtl8188eu: add RNX-N150NUB support Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 93/99] iommu: Fix wrong freeing of iommu_device->dev Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 94/99] Clarify (and fix) MAX_LFS_FILESIZE macros Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 95/99] ntb: ntb_test: ensure the link is up before trying to configure the mws Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 96/99] ntb: transport shouldnt disable link due to bogus values in SPADs Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 97/99] ACPI: APD: Fix HID for Hisilicon Hip07/08 Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 98/99] ACPI: EC: Fix regression related to wrong ECDT initialization order Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.12 99/99] powerpc/mm: Ensure cpumask update is ordered Greg Kroah-Hartman
2017-08-28 19:40 ` [PATCH 4.12 00/99] 4.12.10-stable review Shuah Khan
2017-08-29  4:56   ` Greg Kroah-Hartman
2017-08-29  0:11 ` Guenter Roeck
2017-08-29  4:56   ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170828080457.850515473@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=linux-kernel@vger.kernel.org \
    --cc=marcin.slusarz@intel.com \
    --cc=mawilcox@microsoft.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox