From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
patches@lists.linux.dev, Hugh Dickins <hughd@google.com>,
David Hildenbrand <david@redhat.com>,
Yang Shi <shy828301@gmail.com>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
Barry Song <baohua@kernel.org>, Chris Li <chrisl@kernel.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Kefeng Wang <wangkefeng.wang@huawei.com>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
"Matthew Wilcox (Oracle)" <willy@infradead.org>,
Nhat Pham <nphamcs@gmail.com>,
Ryan Roberts <ryan.roberts@arm.com>,
Shakeel Butt <shakeel.butt@linux.dev>,
Usama Arif <usamaarif642@gmail.com>,
Wei Yang <richard.weiyang@gmail.com>, Zi Yan <ziy@nvidia.com>,
Andrew Morton <akpm@linux-foundation.org>
Subject: [PATCH 6.6 47/48] mm/thp: fix deferred split unqueue naming and locking
Date: Fri, 15 Nov 2024 07:38:36 +0100 [thread overview]
Message-ID: <20241115063724.659792191@linuxfoundation.org> (raw)
In-Reply-To: <20241115063722.962047137@linuxfoundation.org>
6.6-stable review patch. If anyone has any objections, please let me know.
------------------
From: Hugh Dickins <hughd@google.com>
commit f8f931bba0f92052cf842b7e30917b1afcc77d5a upstream.
Recent changes are putting more pressure on THP deferred split queues:
under load revealing long-standing races, causing list_del corruptions,
"Bad page state"s and worse (I keep BUGs in both of those, so usually
don't get to see how badly they end up without). The relevant recent
changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
improved swap allocation, and underused THP splitting.
Before fixing locking: rename misleading folio_undo_large_rmappable(),
which does not undo large_rmappable, to folio_unqueue_deferred_split(),
which is what it does. But that and its out-of-line __callee are mm
internals of very limited usability: add comment and WARN_ON_ONCEs to
check usage; and return a bool to say if a deferred split was unqueued,
which can then be used in WARN_ON_ONCEs around safety checks (sparing
callers the arcane conditionals in __folio_unqueue_deferred_split()).
Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all
of whose callers now call it beforehand (and if any forget then bad_page()
will tell) - except for its caller put_pages_list(), which itself no
longer has any callers (and will be deleted separately).
Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
without checking and unqueueing a THP folio from deferred split list;
which is unfortunate, since the split_queue_lock depends on the memcg
(when memcg is enabled); so swapout has been unqueueing such THPs later,
when freeing the folio, using the pgdat's lock instead: potentially
corrupting the memcg's list. __remove_mapping() has frozen refcount to 0
here, so no problem with calling folio_unqueue_deferred_split() before
resetting memcg_data.
That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
shrinker memcg aware"): which included a check on swapcache before adding
to deferred queue, but no check on deferred queue before adding THP to
swapcache. That worked fine with the usual sequence of events in reclaim
(though there were a couple of rare ways in which a THP on deferred queue
could have been swapped out), but 6.12 commit dafff3f4c850 ("mm: split
underused THPs") avoids splitting underused THPs in reclaim, which makes
swapcache THPs on deferred queue commonplace.
Keep the check on swapcache before adding to deferred queue? Yes: it is
no longer essential, but preserves the existing behaviour, and is likely
to be a worthwhile optimization (vmstat showed much more traffic on the
queue under swapping load if the check was removed); update its comment.
Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
folio->memcg_data without checking and unqueueing a THP folio from the
deferred list, sometimes corrupting "from" memcg's list, like swapout.
Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
of course unqueues), or skip it if that fails. Not ideal, but moving
charge has been requested, and khugepaged should repair the THP later:
nobody wants new custom unqueueing code just for this deprecated case.
The 87eaceb3faa5 commit did have the code to move from one deferred list
to another (but was not conscious of its unsafety while refcount non-0);
but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need care
deferred split queue in memcg charge move path"), which argued that the
existence of a PMD mapping guarantees that the THP cannot be on a deferred
list. As above, false in rare cases, and now commonly false.
Backport to 6.11 should be straightforward. Earlier backports must take
care that other _deferred_list fixes and dependencies are included. There
is not a strong case for backports, but they can fix cornercases.
Link: https://lkml.kernel.org/r/8dc111ae-f6db-2da7-b25c-7a20b1effe3b@google.com
Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
Fixes: dafff3f4c850 ("mm: split underused THPs")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Upstream commit itself does not apply cleanly, because there
are fewer calls to folio_undo_large_rmappable() in this tree
(in particular, folio migration does not migrate memcg charge),
and mm/memcontrol-v1.c has not been split out of mm/memcontrol.c. ]
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/huge_memory.c | 35 ++++++++++++++++++++++++++---------
mm/internal.h | 10 +++++-----
mm/memcontrol.c | 32 +++++++++++++++++++++++++++++---
mm/page_alloc.c | 2 +-
4 files changed, 61 insertions(+), 18 deletions(-)
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2767,18 +2767,38 @@ out:
return ret;
}
-void __folio_undo_large_rmappable(struct folio *folio)
+/*
+ * __folio_unqueue_deferred_split() is not to be called directly:
+ * the folio_unqueue_deferred_split() inline wrapper in mm/internal.h
+ * limits its calls to those folios which may have a _deferred_list for
+ * queueing THP splits, and that list is (racily observed to be) non-empty.
+ *
+ * It is unsafe to call folio_unqueue_deferred_split() until folio refcount is
+ * zero: because even when split_queue_lock is held, a non-empty _deferred_list
+ * might be in use on deferred_split_scan()'s unlocked on-stack list.
+ *
+ * If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is
+ * therefore important to unqueue deferred split before changing folio memcg.
+ */
+bool __folio_unqueue_deferred_split(struct folio *folio)
{
struct deferred_split *ds_queue;
unsigned long flags;
+ bool unqueued = false;
+
+ WARN_ON_ONCE(folio_ref_count(folio));
+ WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
ds_queue = get_deferred_split_queue(folio);
spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
if (!list_empty(&folio->_deferred_list)) {
ds_queue->split_queue_len--;
list_del_init(&folio->_deferred_list);
+ unqueued = true;
}
spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+
+ return unqueued; /* useful for debug warnings */
}
void deferred_split_folio(struct folio *folio)
@@ -2797,14 +2817,11 @@ void deferred_split_folio(struct folio *
return;
/*
- * The try_to_unmap() in page reclaim path might reach here too,
- * this may cause a race condition to corrupt deferred split queue.
- * And, if page reclaim is already handling the same folio, it is
- * unnecessary to handle it again in shrinker.
- *
- * Check the swapcache flag to determine if the folio is being
- * handled by page reclaim since THP swap would add the folio into
- * swap cache before calling try_to_unmap().
+ * Exclude swapcache: originally to avoid a corrupt deferred split
+ * queue. Nowadays that is fully prevented by mem_cgroup_swapout();
+ * but if page reclaim is already handling the same folio, it is
+ * unnecessary to handle it again in the shrinker, so excluding
+ * swapcache here may still be a useful optimization.
*/
if (folio_test_swapcache(folio))
return;
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -413,11 +413,11 @@ static inline void folio_set_order(struc
#endif
}
-void __folio_undo_large_rmappable(struct folio *folio);
-static inline void folio_undo_large_rmappable(struct folio *folio)
+bool __folio_unqueue_deferred_split(struct folio *folio);
+static inline bool folio_unqueue_deferred_split(struct folio *folio)
{
if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
- return;
+ return false;
/*
* At this point, there is no one trying to add the folio to
@@ -425,9 +425,9 @@ static inline void folio_undo_large_rmap
* to check without acquiring the split_queue_lock.
*/
if (data_race(list_empty(&folio->_deferred_list)))
- return;
+ return false;
- __folio_undo_large_rmappable(folio);
+ return __folio_unqueue_deferred_split(folio);
}
static inline struct folio *page_rmappable_folio(struct page *page)
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5873,6 +5873,8 @@ static int mem_cgroup_move_account(struc
css_get(&to->css);
css_put(&from->css);
+ /* Warning should never happen, so don't worry about refcount non-0 */
+ WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
folio->memcg_data = (unsigned long)to;
__folio_memcg_unlock(from);
@@ -6237,7 +6239,10 @@ static int mem_cgroup_move_charge_pte_ra
enum mc_target_type target_type;
union mc_target target;
struct page *page;
+ struct folio *folio;
+ bool tried_split_before = false;
+retry_pmd:
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
if (mc.precharge < HPAGE_PMD_NR) {
@@ -6247,6 +6252,28 @@ static int mem_cgroup_move_charge_pte_ra
target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
if (target_type == MC_TARGET_PAGE) {
page = target.page;
+ folio = page_folio(page);
+ /*
+ * Deferred split queue locking depends on memcg,
+ * and unqueue is unsafe unless folio refcount is 0:
+ * split or skip if on the queue? first try to split.
+ */
+ if (!list_empty(&folio->_deferred_list)) {
+ spin_unlock(ptl);
+ if (!tried_split_before)
+ split_folio(folio);
+ folio_unlock(folio);
+ folio_put(folio);
+ if (tried_split_before)
+ return 0;
+ tried_split_before = true;
+ goto retry_pmd;
+ }
+ /*
+ * So long as that pmd lock is held, the folio cannot
+ * be racily added to the _deferred_list, because
+ * page_remove_rmap() will find it still pmdmapped.
+ */
if (isolate_lru_page(page)) {
if (!mem_cgroup_move_account(page, true,
mc.from, mc.to)) {
@@ -7153,9 +7180,6 @@ static void uncharge_folio(struct folio
struct obj_cgroup *objcg;
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
- VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
- !folio_test_hugetlb(folio) &&
- !list_empty(&folio->_deferred_list), folio);
/*
* Nobody should be changing or seriously looking at
@@ -7202,6 +7226,7 @@ static void uncharge_folio(struct folio
ug->nr_memory += nr_pages;
ug->pgpgout++;
+ WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
folio->memcg_data = 0;
}
@@ -7495,6 +7520,7 @@ void mem_cgroup_swapout(struct folio *fo
VM_BUG_ON_FOLIO(oldid, folio);
mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
+ folio_unqueue_deferred_split(folio);
folio->memcg_data = 0;
if (!mem_cgroup_is_root(memcg))
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -600,7 +600,7 @@ void destroy_large_folio(struct folio *f
return;
}
- folio_undo_large_rmappable(folio);
+ folio_unqueue_deferred_split(folio);
mem_cgroup_uncharge(folio);
free_the_page(&folio->page, folio_order(folio));
}
next prev parent reply other threads:[~2024-11-15 6:52 UTC|newest]
Thread overview: 60+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-15 6:37 [PATCH 6.6 00/48] 6.6.62-rc1 review Greg Kroah-Hartman
2024-11-15 6:37 ` [PATCH 6.6 01/48] 9p: v9fs_fid_find: also lookup by inode if not found dentry Greg Kroah-Hartman
2024-11-15 6:37 ` [PATCH 6.6 02/48] 9p: Avoid creating multiple slab caches with the same name Greg Kroah-Hartman
2024-11-15 6:37 ` [PATCH 6.6 03/48] selftests/bpf: Verify that sync_linked_regs preserves subreg_def Greg Kroah-Hartman
2024-11-15 6:37 ` [PATCH 6.6 04/48] irqchip/ocelot: Fix trigger register address Greg Kroah-Hartman
2024-11-15 6:37 ` [PATCH 6.6 05/48] nvme: tcp: avoid race between queue_lock lock and destroy Greg Kroah-Hartman
2024-11-15 6:37 ` [PATCH 6.6 06/48] block: Fix elevator_get_default() checking for NULL q->tag_set Greg Kroah-Hartman
2024-11-15 6:37 ` [PATCH 6.6 07/48] HID: multitouch: Add support for B2402FVA track point Greg Kroah-Hartman
2024-11-15 6:37 ` [PATCH 6.6 08/48] HID: multitouch: Add quirk for HONOR MagicBook Art 14 touchpad Greg Kroah-Hartman
2024-11-15 6:37 ` [PATCH 6.6 09/48] iommu/arm-smmu: Clarify MMU-500 CPRE workaround Greg Kroah-Hartman
2024-11-15 6:37 ` [PATCH 6.6 10/48] nvme: disable CC.CRIME (NVME_CC_CRIME) Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 11/48] bpf: use kvzmalloc to allocate BPF verifier environment Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 12/48] crypto: api - Fix liveliness check in crypto_alg_tested Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 13/48] crypto: marvell/cesa - Disable hash algorithms Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 14/48] sound: Make CONFIG_SND depend on INDIRECT_IOMEM instead of UML Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 15/48] drm/vmwgfx: Limit display layout ioctl array size to VMWGFX_NUM_DISPLAY_UNITS Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 16/48] RDMA/siw: Add sendpage_ok() check to disable MSG_SPLICE_PAGES Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 17/48] kasan: Disable Software Tag-Based KASAN with GCC Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 18/48] nvme-multipath: defer partition scanning Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 19/48] drm/amdkfd: Accounting pdd vram_usage for svm Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 20/48] powerpc/powernv: Free name on error in opal_event_init() Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 21/48] net: phy: mdio-bcm-unimac: Add BCM6846 support Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 22/48] nvme-loop: flush off pending I/O while shutting down loop controller Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 23/48] nvme: make keep-alive synchronous operation Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 24/48] smb: client: Fix use-after-free of network namespace Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 25/48] nvme/host: Fix RCU list traversal to use SRCU primitive Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 26/48] vDPA/ifcvf: Fix pci_read_config_byte() return code handling Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 27/48] bpf: Add sk_is_inet and IS_ICSK check in tls_sw_has_ctx_tx/rx Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 28/48] bpf: Fix mismatched RCU unlock flavour in bpf_out_neigh_v6 Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 29/48] ASoC: amd: yc: Add quirk for ASUS Vivobook S15 M3502RA Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 30/48] ASoC: amd: yc: Fix non-functional mic on ASUS E1404FA Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 31/48] fs: Fix uninitialized value issue in from_kuid and from_kgid Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 32/48] HID: multitouch: Add quirk for Logitech Bolt receiver w/ Casa touchpad Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 33/48] HID: lenovo: Add support for Thinkpad X1 Tablet Gen 3 keyboard Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 34/48] RISCV: KVM: use raw_spinlock for critical section in imsic Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 35/48] ASoC: rt722-sdca: increase clk_stop_timeout to fix clock stop issue Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 36/48] LoongArch: Use "Exception return address" to comment ERA Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 37/48] ASoC: fsl_micfil: Add sample rate constraint Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 38/48] net: usb: qmi_wwan: add Fibocom FG132 0x0112 composition Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 39/48] bpf: Check validity of link->type in bpf_link_show_fdinfo() Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 40/48] io_uring: fix possible deadlock in io_register_iowq_max_workers() Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 41/48] mm: krealloc: Fix MTE false alarm in __do_krealloc Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 42/48] mm: add page_rmappable_folio() wrapper Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 43/48] mm/readahead: do not allow order-1 folio Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 44/48] mm: support order-1 folios in the page cache Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 45/48] mm: always initialise folio->_deferred_list Greg Kroah-Hartman
2024-11-15 6:38 ` [PATCH 6.6 46/48] mm: refactor folio_undo_large_rmappable() Greg Kroah-Hartman
2024-11-15 6:38 ` Greg Kroah-Hartman [this message]
2024-11-15 6:38 ` [PATCH 6.6 48/48] 9p: fix slab cache name creation for real Greg Kroah-Hartman
2024-11-15 9:07 ` [PATCH 6.6 00/48] 6.6.62-rc1 review Takeshi Ogasawara
2024-11-15 13:36 ` Peter Schneider
2024-11-15 15:59 ` Harshit Mogalapalli
2024-11-15 18:11 ` Jon Hunter
2024-11-15 18:26 ` SeongJae Park
2024-11-15 19:27 ` Florian Fainelli
2024-11-15 21:20 ` Mark Brown
2024-11-15 23:57 ` Ron Economos
2024-11-16 8:23 ` Naresh Kamboju
2024-11-16 17:15 ` [PATCH 6.6] " Hardik Garg
2024-11-16 21:06 ` [PATCH 6.6 00/48] " Shuah Khan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241115063724.659792191@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=chrisl@kernel.org \
--cc=david@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=nphamcs@gmail.com \
--cc=patches@lists.linux.dev \
--cc=richard.weiyang@gmail.com \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=shy828301@gmail.com \
--cc=stable@vger.kernel.org \
--cc=usamaarif642@gmail.com \
--cc=wangkefeng.wang@huawei.com \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox