From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
patches@lists.linux.dev, Oleg Nesterov <oleg@redhat.com>,
Hugh Dickins <hughd@google.com>, Michal Hocko <mhocko@suse.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Sasha Levin <sashal@kernel.org>
Subject: [PATCH 5.4 02/60] mm: rewrite wait_on_page_bit_common() logic
Date: Mon, 26 Jun 2023 20:11:41 +0200 [thread overview]
Message-ID: <20230626180739.640524541@linuxfoundation.org> (raw)
In-Reply-To: <20230626180739.558575012@linuxfoundation.org>
From: Linus Torvalds <torvalds@linux-foundation.org>
[ Upstream commit 2a9127fcf2296674d58024f83981f40b128fffea ]
It turns out that wait_on_page_bit_common() had several problems,
ranging from just unfair behavioe due to re-queueing at the end of the
wait queue when re-trying, and an outright bug that could result in
missed wakeups (but probably never happened in practice).
This rewrites the whole logic to avoid both issues, by simply moving the
logic to check (and possibly take) the bit lock into the wakeup path
instead.
That makes everything much more straightforward, and means that we never
need to re-queue the wait entry: if we get woken up, we'll be notified
through WQ_FLAG_WOKEN, and the wait queue entry will have been removed,
and everything will have been done for us.
Link: https://lore.kernel.org/lkml/CAHk-=wjJA2Z3kUFb-5s=6+n0qbTs8ELqKFt9B3pH85a8fGD73w@mail.gmail.com/
Link: https://lore.kernel.org/lkml/alpine.LSU.2.11.2007221359450.1017@eggly.anvils/
Reported-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stable-dep-of: 2192bba03d80 ("epoll: ep_autoremove_wake_function should use list_del_init_careful")
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
mm/filemap.c | 132 +++++++++++++++++++++++++++++++++------------------
1 file changed, 85 insertions(+), 47 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index c094103191a6e..83b324420046b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1046,6 +1046,7 @@ struct wait_page_queue {
static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *arg)
{
+ int ret;
struct wait_page_key *key = arg;
struct wait_page_queue *wait_page
= container_of(wait, struct wait_page_queue, wait);
@@ -1058,17 +1059,40 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
return 0;
/*
- * Stop walking if it's locked.
- * Is this safe if put_and_wait_on_page_locked() is in use?
- * Yes: the waker must hold a reference to this page, and if PG_locked
- * has now already been set by another task, that task must also hold
- * a reference to the *same usage* of this page; so there is no need
- * to walk on to wake even the put_and_wait_on_page_locked() callers.
+ * If it's an exclusive wait, we get the bit for it, and
+ * stop walking if we can't.
+ *
+ * If it's a non-exclusive wait, then the fact that this
+ * wake function was called means that the bit already
+ * was cleared, and we don't care if somebody then
+ * re-took it.
*/
- if (test_bit(key->bit_nr, &key->page->flags))
- return -1;
+ ret = 0;
+ if (wait->flags & WQ_FLAG_EXCLUSIVE) {
+ if (test_and_set_bit(key->bit_nr, &key->page->flags))
+ return -1;
+ ret = 1;
+ }
+ wait->flags |= WQ_FLAG_WOKEN;
- return autoremove_wake_function(wait, mode, sync, key);
+ wake_up_state(wait->private, mode);
+
+ /*
+ * Ok, we have successfully done what we're waiting for,
+ * and we can unconditionally remove the wait entry.
+ *
+ * Note that this has to be the absolute last thing we do,
+ * since after list_del_init(&wait->entry) the wait entry
+ * might be de-allocated and the process might even have
+ * exited.
+ *
+ * We _really_ should have a "list_del_init_careful()" to
+ * properly pair with the unlocked "list_empty_careful()"
+ * in finish_wait().
+ */
+ smp_mb();
+ list_del_init(&wait->entry);
+ return ret;
}
static void wake_up_page_bit(struct page *page, int bit_nr)
@@ -1147,16 +1171,31 @@ enum behavior {
*/
};
+/*
+ * Attempt to check (or get) the page bit, and mark the
+ * waiter woken if successful.
+ */
+static inline bool trylock_page_bit_common(struct page *page, int bit_nr,
+ struct wait_queue_entry *wait)
+{
+ if (wait->flags & WQ_FLAG_EXCLUSIVE) {
+ if (test_and_set_bit(bit_nr, &page->flags))
+ return false;
+ } else if (test_bit(bit_nr, &page->flags))
+ return false;
+
+ wait->flags |= WQ_FLAG_WOKEN;
+ return true;
+}
+
static inline int wait_on_page_bit_common(wait_queue_head_t *q,
struct page *page, int bit_nr, int state, enum behavior behavior)
{
struct wait_page_queue wait_page;
wait_queue_entry_t *wait = &wait_page.wait;
- bool bit_is_set;
bool thrashing = false;
bool delayacct = false;
unsigned long pflags;
- int ret = 0;
if (bit_nr == PG_locked &&
!PageUptodate(page) && PageWorkingset(page)) {
@@ -1174,48 +1213,47 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
wait_page.page = page;
wait_page.bit_nr = bit_nr;
- for (;;) {
- spin_lock_irq(&q->lock);
+ /*
+ * Do one last check whether we can get the
+ * page bit synchronously.
+ *
+ * Do the SetPageWaiters() marking before that
+ * to let any waker we _just_ missed know they
+ * need to wake us up (otherwise they'll never
+ * even go to the slow case that looks at the
+ * page queue), and add ourselves to the wait
+ * queue if we need to sleep.
+ *
+ * This part needs to be done under the queue
+ * lock to avoid races.
+ */
+ spin_lock_irq(&q->lock);
+ SetPageWaiters(page);
+ if (!trylock_page_bit_common(page, bit_nr, wait))
+ __add_wait_queue_entry_tail(q, wait);
+ spin_unlock_irq(&q->lock);
- if (likely(list_empty(&wait->entry))) {
- __add_wait_queue_entry_tail(q, wait);
- SetPageWaiters(page);
- }
+ /*
+ * From now on, all the logic will be based on
+ * the WQ_FLAG_WOKEN flag, and the and the page
+ * bit testing (and setting) will be - or has
+ * already been - done by the wake function.
+ *
+ * We can drop our reference to the page.
+ */
+ if (behavior == DROP)
+ put_page(page);
+ for (;;) {
set_current_state(state);
- spin_unlock_irq(&q->lock);
-
- bit_is_set = test_bit(bit_nr, &page->flags);
- if (behavior == DROP)
- put_page(page);
-
- if (likely(bit_is_set))
- io_schedule();
-
- if (behavior == EXCLUSIVE) {
- if (!test_and_set_bit_lock(bit_nr, &page->flags))
- break;
- } else if (behavior == SHARED) {
- if (!test_bit(bit_nr, &page->flags))
- break;
- }
-
- if (signal_pending_state(state, current)) {
- ret = -EINTR;
+ if (signal_pending_state(state, current))
break;
- }
- if (behavior == DROP) {
- /*
- * We can no longer safely access page->flags:
- * even if CONFIG_MEMORY_HOTREMOVE is not enabled,
- * there is a risk of waiting forever on a page reused
- * for something that keeps it locked indefinitely.
- * But best check for -EINTR above before breaking.
- */
+ if (wait->flags & WQ_FLAG_WOKEN)
break;
- }
+
+ io_schedule();
}
finish_wait(q, wait);
@@ -1234,7 +1272,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
* bother with signals either.
*/
- return ret;
+ return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR;
}
void wait_on_page_bit(struct page *page, int bit_nr)
--
2.39.2
next prev parent reply other threads:[~2023-06-26 18:35 UTC|newest]
Thread overview: 66+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-26 18:11 [PATCH 5.4 00/60] 5.4.249-rc1 review Greg Kroah-Hartman
2023-06-26 18:11 ` [PATCH 5.4 01/60] nilfs2: reject devices with insufficient block count Greg Kroah-Hartman
2023-06-26 18:11 ` Greg Kroah-Hartman [this message]
2023-06-26 18:11 ` [PATCH 5.4 03/60] list: add "list_del_init_careful()" to go with "list_empty_careful()" Greg Kroah-Hartman
2023-06-26 18:11 ` [PATCH 5.4 04/60] epoll: ep_autoremove_wake_function should use list_del_init_careful Greg Kroah-Hartman
2023-06-26 18:11 ` [PATCH 5.4 05/60] tracing: Add tracing_reset_all_online_cpus_unlocked() function Greg Kroah-Hartman
2023-06-26 18:11 ` [PATCH 5.4 06/60] x86/purgatory: remove PGO flags Greg Kroah-Hartman
2023-06-26 18:11 ` [PATCH 5.4 07/60] tick/common: Align tick period during sched_timer setup Greg Kroah-Hartman
2023-06-26 18:11 ` [PATCH 5.4 08/60] media: dvbdev: Fix memleak in dvb_register_device Greg Kroah-Hartman
2023-06-26 18:11 ` [PATCH 5.4 09/60] media: dvbdev: fix error logic at dvb_register_device() Greg Kroah-Hartman
2023-06-26 18:11 ` [PATCH 5.4 10/60] media: dvb-core: Fix use-after-free due to race " Greg Kroah-Hartman
2023-06-26 18:11 ` [PATCH 5.4 11/60] nilfs2: fix buffer corruption due to concurrent device reads Greg Kroah-Hartman
2023-06-26 18:11 ` [PATCH 5.4 12/60] Drivers: hv: vmbus: Fix vmbus_wait_for_unload() to scan present CPUs Greg Kroah-Hartman
2023-06-26 18:11 ` [PATCH 5.4 13/60] PCI: hv: Fix a race condition bug in hv_pci_query_relations() Greg Kroah-Hartman
2023-06-26 18:11 ` [PATCH 5.4 14/60] cgroup: Do not corrupt task iteration when rebinding subsystem Greg Kroah-Hartman
2023-06-26 18:11 ` [PATCH 5.4 15/60] mmc: meson-gx: remove redundant mmc_request_done() call from irq context Greg Kroah-Hartman
2023-06-26 18:11 ` [PATCH 5.4 16/60] ip_tunnels: allow VXLAN/GENEVE to inherit TOS/TTL from VLAN Greg Kroah-Hartman
2023-06-26 18:11 ` [PATCH 5.4 17/60] writeback: fix dereferencing NULL mapping->host on writeback_page_template Greg Kroah-Hartman
2023-06-26 18:11 ` [PATCH 5.4 18/60] nilfs2: prevent general protection fault in nilfs_clear_dirty_page() Greg Kroah-Hartman
2023-06-26 18:11 ` [PATCH 5.4 19/60] cifs: Clean up DFS referral cache Greg Kroah-Hartman
2023-06-26 18:11 ` [PATCH 5.4 20/60] cifs: Get rid of kstrdup_const()d paths Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 21/60] cifs: Introduce helpers for finding TCP connection Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 22/60] cifs: Merge is_path_valid() into get_normalized_path() Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 23/60] cifs: Fix potential deadlock when updating vol in cifs_reconnect() Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 24/60] x86/mm: Avoid using set_pgd() outside of real PGD pages Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 25/60] rcu: Upgrade rcu_swap_protected() to rcu_replace_pointer() Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 26/60] ieee802154: hwsim: Fix possible memory leaks Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 27/60] xfrm: Linearize the skb after offloading if needed Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 28/60] net: qca_spi: Avoid high load if QCA7000 is not available Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 29/60] mmc: mtk-sd: fix deferred probing Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 30/60] mmc: mvsdio: convert to devm_platform_ioremap_resource Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 31/60] mmc: mvsdio: fix deferred probing Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 32/60] mmc: omap: " Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 33/60] mmc: omap_hsmmc: " Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 34/60] mmc: sdhci-acpi: " Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 35/60] mmc: sh_mmcif: " Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 36/60] mmc: usdhi60rol0: " Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 37/60] ipvs: align inner_mac_header for encapsulation Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 38/60] net: dsa: mt7530: fix trapping frames on non-MT7621 SoC MT7530 switch Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 39/60] be2net: Extend xmit workaround to BE3 chip Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 40/60] netfilter: nf_tables: disallow element updates of bound anonymous sets Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 41/60] netfilter: nfnetlink_osf: fix module autoload Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 42/60] Revert "net: phy: dp83867: perform soft reset and retain established link" Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 43/60] sch_netem: acquire qdisc lock in netem_change() Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 44/60] scsi: target: iscsi: Prevent login threads from racing between each other Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 45/60] HID: wacom: Add error check to wacom_parse_and_register() Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 46/60] arm64: Add missing Set/Way CMO encodings Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 47/60] media: cec: core: dont set last_initiator if tx in progress Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 48/60] nfcsim.c: Fix error checking for debugfs_create_dir Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 49/60] usb: gadget: udc: fix NULL dereference in remove() Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 50/60] s390/cio: unregister device when the only path is gone Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 51/60] ASoC: nau8824: Add quirk to active-high jack-detect Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 52/60] ARM: dts: Fix erroneous ADS touchscreen polarities Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 53/60] drm/exynos: vidi: fix a wrong error return Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 54/60] drm/exynos: fix race condition UAF in exynos_g2d_exec_ioctl Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 55/60] drm/radeon: fix race condition UAF in radeon_gem_set_domain_ioctl Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 56/60] x86/apic: Fix kernel panic when booting with intremap=off and x2apic_phys Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 57/60] i2c: imx-lpi2c: fix type char overflow issue when calculating the clock cycle Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 58/60] mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback) Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 59/60] mm: make wait_on_page_writeback() wait for multiple pending writebacks Greg Kroah-Hartman
2023-06-26 18:12 ` [PATCH 5.4 60/60] xfs: verify buffer contents when we skip log replay Greg Kroah-Hartman
2023-06-27 9:04 ` [PATCH 5.4 00/60] 5.4.249-rc1 review Jon Hunter
2023-06-27 14:15 ` Harshit Mogalapalli
2023-06-27 20:10 ` Chris Paterson
2023-06-27 21:35 ` Guenter Roeck
2023-06-28 7:03 ` Naresh Kamboju
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230626180739.640524541@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=hughd@google.com \
--cc=mhocko@suse.com \
--cc=oleg@redhat.com \
--cc=patches@lists.linux.dev \
--cc=sashal@kernel.org \
--cc=stable@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).