From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
patches@lists.linux.dev, Jann Horn <jannh@google.com>,
Yang Shi <shy828301@gmail.com>,
David Hildenbrand <david@redhat.com>,
John Hubbard <jhubbard@nvidia.com>, Peter Xu <peterx@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Sasha Levin <sashal@kernel.org>
Subject: [PATCH 5.4 17/67] mm/khugepaged: take the right locks for page table retraction
Date: Mon, 12 Dec 2022 14:16:52 +0100 [thread overview]
Message-ID: <20221212130918.447051134@linuxfoundation.org> (raw)
In-Reply-To: <20221212130917.599345531@linuxfoundation.org>
From: Jann Horn <jannh@google.com>
commit 8d3c106e19e8d251da31ff4cc7462e4565d65084 upstream.
pagetable walks on address ranges mapped by VMAs can be done under the
mmap lock, the lock of an anon_vma attached to the VMA, or the lock of the
VMA's address_space. Only one of these needs to be held, and it does not
need to be held in exclusive mode.
Under those circumstances, the rules for concurrent access to page table
entries are:
- Terminal page table entries (entries that don't point to another page
table) can be arbitrarily changed under the page table lock, with the
exception that they always need to be consistent for
hardware page table walks and lockless_pages_from_mm().
This includes that they can be changed into non-terminal entries.
- Non-terminal page table entries (which point to another page table)
can not be modified; readers are allowed to READ_ONCE() an entry, verify
that it is non-terminal, and then assume that its value will stay as-is.
Retracting a page table involves modifying a non-terminal entry, so
page-table-level locks are insufficient to protect against concurrent page
table traversal; it requires taking all the higher-level locks under which
it is possible to start a page walk in the relevant range in exclusive
mode.
The collapse_huge_page() path for anonymous THP already follows this rule,
but the shmem/file THP path was getting it wrong, making it possible for
concurrent rmap-based operations to cause corruption.
Link: https://lkml.kernel.org/r/20221129154730.2274278-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221128180252.1684965-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221125213714.4115729-1-jannh@google.com
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[manual backport: this code was refactored from two copies into a common
helper between 5.15 and 6.0]
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
mm/khugepaged.c | 31 ++++++++++++++++++++++++++-----
1 file changed, 26 insertions(+), 5 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3c2326568193..55631cd73939 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1326,6 +1326,14 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
return;
+ /*
+ * Symmetry with retract_page_tables(): Exclude MAP_PRIVATE mappings
+ * that got written to. Without this, we'd have to also lock the
+ * anon_vma if one exists.
+ */
+ if (vma->anon_vma)
+ return;
+
hpage = find_lock_page(vma->vm_file->f_mapping,
linear_page_index(vma, haddr));
if (!hpage)
@@ -1338,6 +1346,19 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
if (!pmd)
goto drop_hpage;
+ /*
+ * We need to lock the mapping so that from here on, only GUP-fast and
+ * hardware page walks can access the parts of the page tables that
+ * we're operating on.
+ */
+ i_mmap_lock_write(vma->vm_file->f_mapping);
+
+ /*
+ * This spinlock should be unnecessary: Nobody else should be accessing
+ * the page tables under spinlock protection here, only
+ * lockless_pages_from_mm() and the hardware page walker can access page
+ * tables while all the high-level locks are held in write mode.
+ */
start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
/* step 1: check all mapped PTEs are to the right huge page */
@@ -1384,12 +1405,12 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
}
/* step 4: collapse pmd */
- ptl = pmd_lock(vma->vm_mm, pmd);
_pmd = pmdp_collapse_flush(vma, haddr, pmd);
- spin_unlock(ptl);
mm_dec_nr_ptes(mm);
pte_free(mm, pmd_pgtable(_pmd));
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
+
drop_hpage:
unlock_page(hpage);
put_page(hpage);
@@ -1397,6 +1418,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
abort:
pte_unmap_unlock(start_pte, ptl);
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
goto drop_hpage;
}
@@ -1446,7 +1468,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
* An alternative would be drop the check, but check that page
* table is clear before calling pmdp_collapse_flush() under
* ptl. It has higher chance to recover THP for the VMA, but
- * has higher cost too.
+ * has higher cost too. It would also probably require locking
+ * the anon_vma.
*/
if (vma->anon_vma)
continue;
@@ -1468,10 +1491,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
*/
if (down_write_trylock(&mm->mmap_sem)) {
if (!khugepaged_test_exit(mm)) {
- spinlock_t *ptl = pmd_lock(mm, pmd);
/* assume page table is clear */
_pmd = pmdp_collapse_flush(vma, addr, pmd);
- spin_unlock(ptl);
mm_dec_nr_ptes(mm);
pte_free(mm, pmd_pgtable(_pmd));
}
--
2.35.1
next prev parent reply other threads:[~2022-12-12 13:21 UTC|newest]
Thread overview: 86+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-12-12 13:16 [PATCH 5.4 00/67] 5.4.227-rc1 review Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 01/67] arm64: dts: rockchip: keep I2S1 disabled for GPIO function on ROCK Pi 4 series Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 02/67] arm: dts: rockchip: fix node name for hym8563 rtc Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 03/67] ARM: dts: rockchip: fix ir-receiver node names Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 04/67] ARM: dts: rockchip: rk3188: fix lcdc1-rgb24 node name Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 05/67] ARM: 9251/1: perf: Fix stacktraces for tracepoint events in THUMB2 kernels Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 06/67] ARM: 9266/1: mm: fix no-MMU ZERO_PAGE() implementation Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 07/67] ARM: dts: rockchip: disable arm_global_timer on rk3066 and rk3188 Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 08/67] 9p/fd: Use P9_HDRSZ for header size Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 09/67] regulator: slg51000: Wait after asserting CS pin Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 10/67] ALSA: seq: Fix function prototype mismatch in snd_seq_expand_var_event Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 11/67] btrfs: send: avoid unaligned encoded writes when attempting to clone range Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 12/67] ASoC: soc-pcm: Add NULL check in BE reparenting Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 13/67] regulator: twl6030: fix get status of twl6032 regulators Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 14/67] fbcon: Use kzalloc() in fbcon_prepare_logo() Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 15/67] 9p/xen: check logical size for buffer size Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 16/67] net: usb: qmi_wwan: add u-blox 0x1342 composition Greg Kroah-Hartman
2022-12-12 13:16 ` Greg Kroah-Hartman [this message]
2022-12-12 13:16 ` [PATCH 5.4 18/67] mm/khugepaged: fix GUP-fast interaction by sending IPI Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 19/67] mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 20/67] xen/netback: Ensure protocol headers dont fall in the non-linear area Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 21/67] xen/netback: do some code cleanup Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 22/67] xen/netback: dont call kfree_skb() with interrupts disabled Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 23/67] Revert "net: dsa: b53: Fix valid setting for MDB entries" Greg Kroah-Hartman
2022-12-12 13:16 ` [PATCH 5.4 24/67] media: v4l2-dv-timings.c: fix too strict blanking sanity checks Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 25/67] memcg: fix possible use-after-free in memcg_write_event_control() Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 26/67] mm/gup: fix gup_pud_range() for dax Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 27/67] KVM: s390: vsie: Fix the initialization of the epoch extension (epdx) field Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 28/67] drm/shmem-helper: Remove errant put in error path Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 29/67] HID: usbhid: Add ALWAYS_POLL quirk for some mice Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 30/67] HID: hid-lg4ff: Add check for empty lbuf Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 31/67] HID: core: fix shift-out-of-bounds in hid_report_raw_event Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 32/67] can: af_can: fix NULL pointer dereference in can_rcv_filter Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 33/67] mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb page Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 34/67] ieee802154: cc2520: Fix error return code in cc2520_hw_init() Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 35/67] ca8210: Fix crash by zero initializing data Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 36/67] drm/bridge: ti-sn65dsi86: Fix output polarity setting bug Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 37/67] gpio: amd8111: Fix PCI device reference count leak Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 38/67] e1000e: Fix TX dispatch condition Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 39/67] igb: Allocate MSI-X vector when testing Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 40/67] af_unix: Get user_ns from in_skb in unix_diag_get_exact() Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 41/67] Bluetooth: 6LoWPAN: add missing hci_dev_put() in get_l2cap_conn() Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 42/67] Bluetooth: Fix not cleanup led when bt_init fails Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 43/67] net: dsa: ksz: Check return value Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 44/67] selftests: rtnetlink: correct xfrm policy rule in kci_test_ipsec_offload Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 45/67] mac802154: fix missing INIT_LIST_HEAD in ieee802154_if_add() Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 46/67] net: encx24j600: Add parentheses to fix precedence Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 47/67] net: encx24j600: Fix invalid logic in reading of MISTAT register Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 48/67] xen-netfront: Fix NULL sring after live migration Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 49/67] net: mvneta: Prevent out of bounds read in mvneta_config_rss() Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 50/67] i40e: Fix not setting default xps_cpus after reset Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 51/67] i40e: Fix for VF MAC address 0 Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 52/67] i40e: Disallow ip4 and ip6 l4_4_bytes Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 53/67] NFC: nci: Bounds check struct nfc_target arrays Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 54/67] nvme initialize core quirks before calling nvme_init_subsystem Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 55/67] net: stmmac: fix "snps,axi-config" node property parsing Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 56/67] net: thunderx: Fix missing destroy_workqueue of nicvf_rx_mode_wq Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 57/67] net: hisilicon: Fix potential use-after-free in hisi_femac_rx() Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 58/67] net: hisilicon: Fix potential use-after-free in hix5hd2_rx() Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 59/67] tipc: Fix potential OOB in tipc_link_proto_rcv() Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 60/67] ipv4: Fix incorrect route flushing when source address is deleted Greg Kroah-Hartman
2023-02-03 18:47 ` Shaoying Xu
2023-02-04 8:06 ` Greg KH
2023-02-05 5:30 ` Shaoying Xu
2023-02-05 5:30 ` [PATCH stable 5.4 1/2] Revert "ipv4: Fix incorrect route flushing when source address is deleted" Shaoying Xu
2023-02-07 9:40 ` Greg KH
2023-02-05 5:31 ` [PATCH stable 5.4 2/2] ipv4: Fix incorrect route flushing when source address is deleted Shaoying Xu
2023-02-07 9:40 ` [PATCH 5.4 60/67] " Greg KH
2023-02-07 18:28 ` [PATCH 5.4 v2 1/2] Revert "ipv4: Fix incorrect route flushing when source address is deleted" Shaoying Xu
2023-02-07 18:28 ` [PATCH 5.4 v2 2/2] ipv4: Fix incorrect route flushing when source address is deleted Shaoying Xu
2023-02-08 2:22 ` David Ahern
2023-02-08 2:23 ` David Ahern
2023-02-08 2:40 ` Shaoying Xu
2022-12-12 13:17 ` [PATCH 5.4 61/67] ipv4: Fix incorrect route flushing when table ID 0 is used Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 62/67] ethernet: aeroflex: fix potential skb leak in greth_init_rings() Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 63/67] xen/netback: fix build warning Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 64/67] net: plip: dont call kfree_skb/dev_kfree_skb() under spin_lock_irq() Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 65/67] ipv6: avoid use-after-free in ip6_fragment() Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 66/67] net: mvneta: Fix an out of bounds check Greg Kroah-Hartman
2022-12-12 13:17 ` [PATCH 5.4 67/67] can: esd_usb: Allow REC and TEC to return to zero Greg Kroah-Hartman
2022-12-12 22:30 ` [PATCH 5.4 00/67] 5.4.227-rc1 review Florian Fainelli
2022-12-13 0:02 ` Shuah Khan
2022-12-13 0:24 ` Guenter Roeck
2022-12-13 9:20 ` Naresh Kamboju
2022-12-13 15:07 ` Greg Kroah-Hartman
2022-12-13 12:02 ` Sudip Mukherjee (Codethink)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20221212130918.447051134@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=jannh@google.com \
--cc=jhubbard@nvidia.com \
--cc=patches@lists.linux.dev \
--cc=peterx@redhat.com \
--cc=sashal@kernel.org \
--cc=shy828301@gmail.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).