From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Jan Kara <jack@suse.cz>, Matthew Wilcox <willy@infradead.org>,
Guo Xuenan <guoxuenan@huawei.com>,
Andrew Morton <akpm@linux-foundation.org>,
Sasha Levin <sashal@kernel.org>
Subject: [PATCH 6.6 53/60] readahead: avoid multiple marked readahead pages
Date: Wed, 13 Mar 2024 12:37:00 -0400 [thread overview]
Message-ID: <20240313163707.615000-54-sashal@kernel.org> (raw)
In-Reply-To: <20240313163707.615000-1-sashal@kernel.org>
From: Jan Kara <jack@suse.cz>
[ Upstream commit ab4443fe3ca6298663a55c4a70efc6c3ce913ca6 ]
ra_alloc_folio() marks a page that should trigger next round of async
readahead. However it rounds up computed index to the order of page being
allocated. This can however lead to multiple consecutive pages being
marked with readahead flag. Consider situation with index == 1, mark ==
1, order == 0. We insert order 0 page at index 1 and mark it. Then we
bump order to 1, index to 2, mark (still == 1) is rounded up to 2 so page
at index 2 is marked as well. Then we bump order to 2, index is
incremented to 4, mark gets rounded to 4 so page at index 4 is marked as
well. The fact that multiple pages get marked within a single readahead
window confuses the readahead logic and results in readahead window being
trimmed back to 1. This situation is triggered in particular when maximum
readahead window size is not a power of two (in the observed case it was
768 KB) and as a result sequential read throughput suffers.
Fix the problem by rounding 'mark' down instead of up. Because the index
is naturally aligned to 'order', we are guaranteed 'rounded mark' == index
iff 'mark' is within the page we are allocating at 'index' and thus
exactly one page is marked with readahead flag as required by the
readahead code and sequential read performance is restored.
This effectively reverts part of commit b9ff43dd2743 ("mm/readahead: Fix
readahead with large folios"). The commit changed the rounding with the
rationale:
"... we were setting the readahead flag on the folio which contains the
last byte read from the block. This is wrong because we will trigger
readahead at the end of the read without waiting to see if a subsequent
read is going to use the pages we just read."
Although this is true, the fact is this was always the case with read
sizes not aligned to folio boundaries and large folios in the page cache
just make the situation more obvious (and frequent). Also for sequential
read workloads it is better to trigger the readahead earlier rather than
later. It is true that the difference in the rounding and thus earlier
triggering of the readahead can result in reading more for semi-random
workloads. However workloads really suffering from this seem to be rare.
In particular I have verified that the workload described in commit
b9ff43dd2743 ("mm/readahead: Fix readahead with large folios") of reading
random 100k blocks from a file like:
[reader]
bs=100k
rw=randread
numjobs=1
size=64g
runtime=60s
is not impacted by the rounding change and achieves ~70MB/s in both cases.
[jack@suse.cz: fix one more place where mark rounding was done as well]
Link: https://lkml.kernel.org/r/20240123153254.5206-1-jack@suse.cz
Link: https://lkml.kernel.org/r/20240104085839.21029-1-jack@suse.cz
Fixes: b9ff43dd2743 ("mm/readahead: Fix readahead with large folios")
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Guo Xuenan <guoxuenan@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
mm/readahead.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c
index 6925e6959fd3f..1d1a84deb5bc5 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -469,7 +469,7 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
if (!folio)
return -ENOMEM;
- mark = round_up(mark, 1UL << order);
+ mark = round_down(mark, 1UL << order);
if (index == mark)
folio_set_readahead(folio);
err = filemap_add_folio(ractl->mapping, folio, index, gfp);
@@ -577,7 +577,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
* It's the expected callback index, assume sequential access.
* Ramp up sizes, and push forward the readahead window.
*/
- expected = round_up(ra->start + ra->size - ra->async_size,
+ expected = round_down(ra->start + ra->size - ra->async_size,
1UL << order);
if (index == expected || index == (ra->start + ra->size)) {
ra->start += ra->size;
--
2.43.0
next prev parent reply other threads:[~2024-03-13 16:38 UTC|newest]
Thread overview: 72+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-13 16:36 [PATCH 6.6 00/60] 6.6.22-rc1 review Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 01/60] dt-bindings: dma: fsl-edma: Add fsl-edma.h to prevent hardcoding in dts Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 02/60] dmaengine: fsl-edma: utilize common dt-binding header file Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 03/60] dmaengine: fsl-edma: correct max_segment_size setting Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 04/60] ceph: switch to corrected encoding of max_xattr_size in mdsmap Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 05/60] mm: migrate: remove PageTransHuge check in numamigrate_isolate_page() Sasha Levin
2024-03-13 17:29 ` Hugh Dickins
2024-03-13 16:36 ` [PATCH 6.6 06/60] mm: migrate: remove THP mapcount " Sasha Levin
2024-03-13 17:31 ` Hugh Dickins
2024-03-13 16:36 ` [PATCH 6.6 07/60] mm: migrate: convert numamigrate_isolate_page() to numamigrate_isolate_folio() Sasha Levin
2024-03-13 17:32 ` Hugh Dickins
2024-03-13 18:32 ` Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 08/60] mm/vmscan: fix a bug calling wakeup_kswapd() with a wrong zone index Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 09/60] xfrm: Pass UDP encapsulation in TX packet offload Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 10/60] net: lan78xx: fix runtime PM count underflow on link stop Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 11/60] ixgbe: {dis, en}able irqs in ixgbe_txrx_ring_{dis, en}able Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 12/60] i40e: disable NAPI right after disabling irqs when handling xsk_pool Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 13/60] ice: reorder disabling IRQ and NAPI in ice_qp_dis Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 14/60] Revert "net/mlx5: Block entering switchdev mode with ns inconsistency" Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 15/60] Revert "net/mlx5e: Check the number of elements before walk TC rhashtable" Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 16/60] net/mlx5: E-switch, Change flow rule destination checking Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 17/60] net/mlx5: Check capability for fw_reset Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 18/60] net/mlx5e: Change the warning when ignore_flow_level is not supported Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 19/60] net/mlx5e: Fix MACsec state loss upon state update in offload path Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 20/60] net/mlx5e: Use a memory barrier to enforce PTP WQ xmit submission tracking occurs after populating the metadata_map Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 21/60] net/mlx5e: Switch to using _bh variant of of spinlock API in port timestamping NAPI poll context Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 22/60] tracing/net_sched: Fix tracepoints that save qdisc_dev() as a string Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 23/60] geneve: make sure to pull inner header in geneve_rx() Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 24/60] net: sparx5: Fix use after free inside sparx5_del_mact_entry Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 25/60] ice: virtchnl: stop pretending to support RSS over AQ or registers Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 26/60] net: ice: Fix potential NULL pointer dereference in ice_bridge_setlink() Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 27/60] igc: avoid returning frame twice in XDP_REDIRECT Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 28/60] net/ipv6: avoid possible UAF in ip6_route_mpath_notify() Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 29/60] bpf: check bpf_func_state->callback_depth when pruning states Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 30/60] xdp, bonding: Fix feature flags when there are no slave devs anymore Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 31/60] selftests/bpf: Fix up xdp bonding test wrt feature flags Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 32/60] cpumap: Zero-initialise xdp_rxq_info struct before running XDP program Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 33/60] net: dsa: microchip: fix register write order in ksz8_ind_write8() Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 34/60] net/rds: fix WARNING in rds_conn_connect_if_down Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 35/60] netfilter: nft_ct: fix l3num expectations with inet pseudo family Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 36/60] netfilter: nf_conntrack_h323: Add protection for bmp length out of range Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 37/60] erofs: apply proper VMA alignment for memory mapped files on THP Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 38/60] netrom: Fix a data-race around sysctl_netrom_default_path_quality Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 39/60] netrom: Fix a data-race around sysctl_netrom_obsolescence_count_initialiser Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 40/60] netrom: Fix data-races around sysctl_netrom_network_ttl_initialiser Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 41/60] netrom: Fix a data-race around sysctl_netrom_transport_timeout Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 42/60] netrom: Fix a data-race around sysctl_netrom_transport_maximum_tries Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 43/60] netrom: Fix a data-race around sysctl_netrom_transport_acknowledge_delay Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 44/60] netrom: Fix a data-race around sysctl_netrom_transport_busy_delay Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 45/60] netrom: Fix a data-race around sysctl_netrom_transport_requested_window_size Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 46/60] netrom: Fix a data-race around sysctl_netrom_transport_no_activity_timeout Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 47/60] netrom: Fix a data-race around sysctl_netrom_routing_control Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 48/60] netrom: Fix a data-race around sysctl_netrom_link_fails_count Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 49/60] netrom: Fix data-races around sysctl_net_busy_read Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 50/60] net: pds_core: Fix possible double free in error handling path Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 51/60] KVM: s390: add stat counter for shadow gmap events Sasha Levin
2024-03-13 16:36 ` [PATCH 6.6 52/60] KVM: s390: vsie: fix race during shadow creation Sasha Levin
2024-03-13 16:37 ` Sasha Levin [this message]
2024-03-13 16:37 ` [PATCH 6.6 54/60] selftests: mptcp: decrease BW in simult flows Sasha Levin
2024-03-13 16:37 ` [PATCH 6.6 55/60] exit: wait_task_zombie: kill the no longer necessary spin_lock_irq(siglock) Sasha Levin
2024-03-13 16:37 ` [PATCH 6.6 56/60] x86/mmio: Disable KVM mitigation when X86_FEATURE_CLEAR_CPU_BUF is set Sasha Levin
2024-03-13 16:37 ` [PATCH 6.6 57/60] Documentation/hw-vuln: Add documentation for RFDS Sasha Levin
2024-03-13 16:37 ` [PATCH 6.6 58/60] x86/rfds: Mitigate Register File Data Sampling (RFDS) Sasha Levin
2024-03-13 16:37 ` [PATCH 6.6 59/60] KVM/x86: Export RFDS_NO and RFDS_CLEAR to guests Sasha Levin
2024-03-13 16:37 ` [PATCH 6.6 60/60] Linux 6.6.22-rc1 Sasha Levin
2024-03-14 8:02 ` [PATCH 6.6 00/60] 6.6.22-rc1 review Bagas Sanjaya
2024-03-14 10:08 ` Naresh Kamboju
2024-03-14 11:56 ` Takeshi Ogasawara
2024-03-14 20:55 ` Florian Fainelli
2024-03-15 15:44 ` Mark Brown
2024-03-15 16:01 ` Ron Economos
2024-03-15 17:36 ` Harshit Mogalapalli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240313163707.615000-54-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=guoxuenan@huawei.com \
--cc=jack@suse.cz \
--cc=linux-kernel@vger.kernel.org \
--cc=stable@vger.kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox