stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Nadav Amit <nadav.amit@gmail.com>,
	Mel Gorman <mgorman@suse.de>, Andy Lutomirski <luto@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 4.4 18/58] mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries
Date: Wed,  9 Aug 2017 12:41:30 -0700	[thread overview]
Message-ID: <20170809194147.234463750@linuxfoundation.org> (raw)
In-Reply-To: <20170809194146.501519882@linuxfoundation.org>

4.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Mel Gorman <mgorman@suse.de>

commit 3ea277194daaeaa84ce75180ec7c7a2075027a68 upstream.

Stable note for 4.4: The upstream patch patches madvise(MADV_FREE) but 4.4
	does not have support for that feature. The changelog is left
	as-is but the hunk related to madvise is omitted from the backport.

Nadav Amit identified a theoritical race between page reclaim and
mprotect due to TLB flushes being batched outside of the PTL being held.

He described the race as follows:

        CPU0                            CPU1
        ----                            ----
                                        user accesses memory using RW PTE
                                        [PTE now cached in TLB]
        try_to_unmap_one()
        ==> ptep_get_and_clear()
        ==> set_tlb_ubc_flush_pending()
                                        mprotect(addr, PROT_READ)
                                        ==> change_pte_range()
                                        ==> [ PTE non-present - no flush ]

                                        user writes using cached RW PTE
        ...

        try_to_unmap_flush()

The same type of race exists for reads when protecting for PROT_NONE and
also exists for operations that can leave an old TLB entry behind such
as munmap, mremap and madvise.

For some operations like mprotect, it's not necessarily a data integrity
issue but it is a correctness issue as there is a window where an
mprotect that limits access still allows access.  For munmap, it's
potentially a data integrity issue although the race is massive as an
munmap, mmap and return to userspace must all complete between the
window when reclaim drops the PTL and flushes the TLB.  However, it's
theoritically possible so handle this issue by flushing the mm if
reclaim is potentially currently batching TLB flushes.

Other instances where a flush is required for a present pte should be ok
as either the page lock is held preventing parallel reclaim or a page
reference count is elevated preventing a parallel free leading to
corruption.  In the case of page_mkclean there isn't an obvious path
that userspace could take advantage of without using the operations that
are guarded by this patch.  Other users such as gup as a race with
reclaim looks just at PTEs.  huge page variants should be ok as they
don't race with reclaim.  mincore only looks at PTEs.  userfault also
should be ok as if a parallel reclaim takes place, it will either fault
the page back in or read some of the data before the flush occurs
triggering a fault.

Note that a variant of this patch was acked by Andy Lutomirski but this
was for the x86 parts on top of his PCID work which didn't make the 4.13
merge window as expected.  His ack is dropped from this version and
there will be a follow-on patch on top of PCID that will include his
ack.

[akpm@linux-foundation.org: tweak comments]
[akpm@linux-foundation.org: fix spello]
Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 include/linux/mm_types.h |    4 ++++
 mm/internal.h            |    5 ++++-
 mm/memory.c              |    1 +
 mm/mprotect.c            |    1 +
 mm/mremap.c              |    1 +
 mm/rmap.c                |   36 ++++++++++++++++++++++++++++++++++++
 6 files changed, 47 insertions(+), 1 deletion(-)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -504,6 +504,10 @@ struct mm_struct {
 	 */
 	bool tlb_flush_pending;
 #endif
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	/* See flush_tlb_batched_pending() */
+	bool tlb_flush_batched;
+#endif
 	struct uprobes_state uprobes_state;
 #ifdef CONFIG_X86_INTEL_MPX
 	/* address of the bounds directory */
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -453,6 +453,7 @@ struct tlbflush_unmap_batch;
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 void try_to_unmap_flush(void);
 void try_to_unmap_flush_dirty(void);
+void flush_tlb_batched_pending(struct mm_struct *mm);
 #else
 static inline void try_to_unmap_flush(void)
 {
@@ -460,6 +461,8 @@ static inline void try_to_unmap_flush(vo
 static inline void try_to_unmap_flush_dirty(void)
 {
 }
-
+static inline void flush_tlb_batched_pending(struct mm_struct *mm)
+{
+}
 #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 #endif	/* __MM_INTERNAL_H */
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1127,6 +1127,7 @@ again:
 	init_rss_vec(rss);
 	start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	pte = start_pte;
+	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
 	do {
 		pte_t ptent = *pte;
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -72,6 +72,7 @@ static unsigned long change_pte_range(st
 	if (!pte)
 		return 0;
 
+	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 	do {
 		oldpte = *pte;
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -135,6 +135,7 @@ static void move_ptes(struct vm_area_str
 	new_ptl = pte_lockptr(mm, new_pmd);
 	if (new_ptl != old_ptl)
 		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 
 	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -649,6 +649,13 @@ static void set_tlb_ubc_flush_pending(st
 	tlb_ubc->flush_required = true;
 
 	/*
+	 * Ensure compiler does not re-order the setting of tlb_flush_batched
+	 * before the PTE is cleared.
+	 */
+	barrier();
+	mm->tlb_flush_batched = true;
+
+	/*
 	 * If the PTE was dirty then it's best to assume it's writable. The
 	 * caller must use try_to_unmap_flush_dirty() or try_to_unmap_flush()
 	 * before the page is queued for IO.
@@ -675,6 +682,35 @@ static bool should_defer_flush(struct mm
 
 	return should_defer;
 }
+
+/*
+ * Reclaim unmaps pages under the PTL but do not flush the TLB prior to
+ * releasing the PTL if TLB flushes are batched. It's possible for a parallel
+ * operation such as mprotect or munmap to race between reclaim unmapping
+ * the page and flushing the page. If this race occurs, it potentially allows
+ * access to data via a stale TLB entry. Tracking all mm's that have TLB
+ * batching in flight would be expensive during reclaim so instead track
+ * whether TLB batching occurred in the past and if so then do a flush here
+ * if required. This will cost one additional flush per reclaim cycle paid
+ * by the first operation at risk such as mprotect and mumap.
+ *
+ * This must be called under the PTL so that an access to tlb_flush_batched
+ * that is potentially a "reclaim vs mprotect/munmap/etc" race will synchronise
+ * via the PTL.
+ */
+void flush_tlb_batched_pending(struct mm_struct *mm)
+{
+	if (mm->tlb_flush_batched) {
+		flush_tlb_mm(mm);
+
+		/*
+		 * Do not allow the compiler to re-order the clearing of
+		 * tlb_flush_batched before the tlb is flushed.
+		 */
+		barrier();
+		mm->tlb_flush_batched = false;
+	}
+}
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
 		struct page *page, bool writable)

  parent reply	other threads:[~2017-08-09 19:41 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-09 19:41 [PATCH 4.4 00/58] 4.4.81-stable review Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 01/58] parisc: Increase thread and stack size to 32kb Greg Kroah-Hartman
2017-08-11  1:33   ` Ben Hutchings
2017-08-11  7:21     ` Helge Deller
2017-08-11 15:33       ` Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 02/58] libata: array underflow in ata_find_dev() Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 03/58] workqueue: restore WQ_UNBOUND/max_active==1 to be ordered Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 04/58] ALSA: hda - Fix speaker output from VAIO VPCL14M1R Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 05/58] ASoC: do not close shared backend dailink Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 07/58] mm/page_alloc: Remove kernel address exposure in free_reserved_area() Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 08/58] ext4: fix SEEK_HOLE/SEEK_DATA for blocksize < pagesize Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 09/58] ext4: fix overflow caused by missing cast in ext4_resize_fs() Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 10/58] ARM: dts: armada-38x: Fix irq type for pca955 Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 11/58] media: platform: davinci: return -EINVAL for VPFE_CMD_S_CCDC_RAW_PARAMS ioctl Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 12/58] target: Avoid mappedlun symlink creation during lun shutdown Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 13/58] iscsi-target: Always wait for kthread_should_stop() before kthread exit Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 14/58] iscsi-target: Fix early sk_data_ready LOGIN_FLAGS_READY race Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 15/58] iscsi-target: Fix initial login PDU asynchronous socket close OOPs Greg Kroah-Hartman
2017-08-11 16:12   ` Ben Hutchings
2017-08-09 19:41 ` [PATCH 4.4 16/58] iscsi-target: Fix delayed logout processing greater than SECONDS_FOR_LOGOUT_COMP Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 17/58] iser-target: Avoid isert_conn->cm_id dereference in isert_login_recv_done Greg Kroah-Hartman
2017-08-09 19:41 ` Greg Kroah-Hartman [this message]
2017-08-11 17:45   ` [PATCH 4.4 18/58] mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries Ben Hutchings
2017-08-13  6:27     ` Nadav Amit
2017-08-15 13:36       ` Ben Hutchings
2017-08-15 16:39         ` Nadav Amit
2017-08-14  8:00     ` Mel Gorman
2017-08-09 19:41 ` [PATCH 4.4 19/58] media: lirc: LIRC_GET_REC_RESOLUTION should return microseconds Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 20/58] f2fs: sanity check checkpoint segno and blkoff Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 21/58] drm: rcar-du: fix backport bug Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 22/58] [media] saa7164: fix double fetch PCIe access condition Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 23/58] ipv4: ipv6: initialize treq->txhash in cookie_v[46]_check() Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 24/58] net: Zero terminate ifr_name in dev_ifname() Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 25/58] ipv6: avoid overflow of offset in ip6_find_1stfragopt Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 26/58] ipv4: initialize fib_trie prior to register_netdev_notifier call Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 27/58] rtnetlink: allocate more memory for dev_set_mac_address() Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 28/58] mcs7780: Fix initialization when CONFIG_VMAP_STACK is enabled Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 29/58] openvswitch: fix potential out of bound access in parse_ct Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 30/58] packet: fix use-after-free in prb_retire_rx_blk_timer_expired() Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 31/58] ipv6: Dont increase IPSTATS_MIB_FRAGFAILS twice in ip6_fragment() Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 32/58] net: ethernet: nb8800: Handle all 4 RGMII modes identically Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 33/58] dccp: fix a memleak that dccp_ipv6 doesnt put reqsk properly Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 34/58] dccp: fix a memleak that dccp_ipv4 " Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 35/58] dccp: fix a memleak for dccp_feat_init err process Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 36/58] sctp: dont dereference ptr before leaving _sctp_walk_{params, errors}() Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 37/58] sctp: fix the check for _sctp_walk_params and _sctp_walk_errors Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 38/58] net/mlx5: Fix command bad flow on command entry allocation failure Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 39/58] net: phy: Correctly process PHY_HALTED in phy_stop_machine() Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 40/58] xen-netback: correctly schedule rate-limited queues Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 41/58] sparc64: Measure receiver forward progress to avoid send mondo timeout Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 42/58] sparc64: Prevent perf from running during super critical sections Greg Kroah-Hartman
2017-08-10 16:20   ` Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 43/58] wext: handle NULL extra data in iwe_stream_add_point better Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 44/58] sh_eth: R8A7740 supports packet shecksumming Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 45/58] net: phy: dp83867: fix irq generation Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 46/58] tg3: Fix race condition in tg3_get_stats64() Greg Kroah-Hartman
2017-08-09 19:41 ` [PATCH 4.4 47/58] x86/boot: Add missing declaration of string functions Greg Kroah-Hartman
2017-08-09 19:42 ` [PATCH 4.4 48/58] phy state machine: failsafe leave invalid RUNNING state Greg Kroah-Hartman
2017-08-09 19:42 ` [PATCH 4.4 49/58] scsi: qla2xxx: Get mutex lock before checking optrom_state Greg Kroah-Hartman
2017-08-09 19:42 ` [PATCH 4.4 50/58] drm/virtio: fix framebuffer sparse warning Greg Kroah-Hartman
2017-08-09 19:42 ` [PATCH 4.4 51/58] virtio_blk: fix panic in initialization error path Greg Kroah-Hartman
2017-08-09 19:42 ` [PATCH 4.4 52/58] ARM: 8632/1: ftrace: fix syscall name matching Greg Kroah-Hartman
2017-08-09 19:42 ` [PATCH 4.4 53/58] mm, slab: make sure that KMALLOC_MAX_SIZE will fit into MAX_ORDER Greg Kroah-Hartman
2017-08-09 19:42 ` [PATCH 4.4 54/58] lib/Kconfig.debug: fix frv build failure Greg Kroah-Hartman
2017-08-09 19:42 ` [PATCH 4.4 55/58] signal: protect SIGNAL_UNKILLABLE from unintentional clearing Greg Kroah-Hartman
2017-08-09 19:42 ` [PATCH 4.4 56/58] mm: dont dereference struct page fields of invalid pages Greg Kroah-Hartman
2017-08-09 19:42 ` [PATCH 4.4 57/58] ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output Greg Kroah-Hartman
2017-08-10  0:02 ` [PATCH 4.4 00/58] 4.4.81-stable review Shuah Khan
2017-08-10  0:37 ` Guenter Roeck
2017-08-10 16:17   ` Greg Kroah-Hartman
2017-08-10 17:34     ` Guenter Roeck
2017-08-10 16:21   ` Greg Kroah-Hartman
2017-08-10  0:58 ` Guenter Roeck
2017-08-10 16:18   ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170809194147.234463750@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    --cc=nadav.amit@gmail.com \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).