* [REGRESSION][BISECTED] Unexpected OOM instead of reclaiming inactive file pages @ 2025-08-11 15:48 Oleksandr Natalenko 2025-08-11 16:06 ` David Rientjes 0 siblings, 1 reply; 5+ messages in thread From: Oleksandr Natalenko @ 2025-08-11 15:48 UTC (permalink / raw) To: linux-kernel Cc: linux-block, Jens Axboe, Damien Le Moal, John Garry, Christoph Hellwig, Martin K. Petersen, linux-mm, Lorenzo Stoakes, Shakeel Butt, Qi Zheng, Michal Hocko, David Hildenbrand, Johannes Weiner, Andrew Morton [-- Attachment #1: Type: text/plain, Size: 969 bytes --] Hello Damien. I'm fairly confident that the following commit 459779d04ae8d block: Improve read ahead size for rotational devices caused a regression in my test bench. I'm running v6.17-rc1 in a small QEMU VM with virtio-scsi disk. It has got 1 GiB of RAM, so I can saturate it easily causing reclaiming mechanism to kick in. If MGLRU is enabled: $ echo 1000 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms then, once page cache builds up, an OOM happens without reclaiming inactive file pages: [1]. Note that inactive_file:506952kB, I'd expect these to be reclaimed instead, like how it happens with v6.16. If MGLRU is disabled: $ echo 0 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms then OOM doesn't occur, and things seem to work as usual. If MGLRU is enabled, and 459779d04ae8d is reverted on top of v6.17-rc1, the OOM doesn't happen either. Could you please check this? Thank you. -- Oleksandr Natalenko, MSE [1]: https://paste.voidband.net/TG5OiZ29.log [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [REGRESSION][BISECTED] Unexpected OOM instead of reclaiming inactive file pages 2025-08-11 15:48 [REGRESSION][BISECTED] Unexpected OOM instead of reclaiming inactive file pages Oleksandr Natalenko @ 2025-08-11 16:06 ` David Rientjes 2025-08-11 20:42 ` Oleksandr Natalenko 0 siblings, 1 reply; 5+ messages in thread From: David Rientjes @ 2025-08-11 16:06 UTC (permalink / raw) To: Oleksandr Natalenko Cc: linux-kernel, linux-block, Jens Axboe, Damien Le Moal, John Garry, Christoph Hellwig, Martin K. Petersen, linux-mm, Lorenzo Stoakes, Shakeel Butt, Qi Zheng, Michal Hocko, David Hildenbrand, Johannes Weiner, Andrew Morton On Mon, 11 Aug 2025, Oleksandr Natalenko wrote: > Hello Damien. > > I'm fairly confident that the following commit > > 459779d04ae8d block: Improve read ahead size for rotational devices > > caused a regression in my test bench. > > I'm running v6.17-rc1 in a small QEMU VM with virtio-scsi disk. It has got 1 GiB of RAM, so I can saturate it easily causing reclaiming mechanism to kick in. > > If MGLRU is enabled: > > $ echo 1000 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms > > then, once page cache builds up, an OOM happens without reclaiming inactive file pages: [1]. Note that inactive_file:506952kB, I'd expect these to be reclaimed instead, like how it happens with v6.16. > > If MGLRU is disabled: > > $ echo 0 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms > > then OOM doesn't occur, and things seem to work as usual. > > If MGLRU is enabled, and 459779d04ae8d is reverted on top of v6.17-rc1, the OOM doesn't happen either. > > Could you please check this? > This looks to be an MGLRU policy decision rather than a readahead regression, correct? Mem-Info: active_anon:388 inactive_anon:5382 isolated_anon:0 active_file:9638 inactive_file:126738 isolated_file:0 Setting min_ttl_ms to 1000 is preserving the working set and triggering the oom kill is the only alternative to free memory in that configuration. The oom kill is being triggered by kswapd for this purpose. So additional readahead would certainly increase that working set. This looks working as intended. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [REGRESSION][BISECTED] Unexpected OOM instead of reclaiming inactive file pages 2025-08-11 16:06 ` David Rientjes @ 2025-08-11 20:42 ` Oleksandr Natalenko 2025-08-12 0:45 ` Damien Le Moal 0 siblings, 1 reply; 5+ messages in thread From: Oleksandr Natalenko @ 2025-08-11 20:42 UTC (permalink / raw) To: David Rientjes Cc: linux-kernel, linux-block, Jens Axboe, Damien Le Moal, John Garry, Christoph Hellwig, Martin K. Petersen, linux-mm, Lorenzo Stoakes, Shakeel Butt, Qi Zheng, Michal Hocko, David Hildenbrand, Johannes Weiner, Andrew Morton [-- Attachment #1: Type: text/plain, Size: 3175 bytes --] Hello. On pondělí 11. srpna 2025 18:06:16, středoevropský letní čas David Rientjes wrote: > On Mon, 11 Aug 2025, Oleksandr Natalenko wrote: > > > Hello Damien. > > > > I'm fairly confident that the following commit > > > > 459779d04ae8d block: Improve read ahead size for rotational devices > > > > caused a regression in my test bench. > > > > I'm running v6.17-rc1 in a small QEMU VM with virtio-scsi disk. It has got 1 GiB of RAM, so I can saturate it easily causing reclaiming mechanism to kick in. > > > > If MGLRU is enabled: > > > > $ echo 1000 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms > > > > then, once page cache builds up, an OOM happens without reclaiming inactive file pages: [1]. Note that inactive_file:506952kB, I'd expect these to be reclaimed instead, like how it happens with v6.16. > > > > If MGLRU is disabled: > > > > $ echo 0 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms > > > > then OOM doesn't occur, and things seem to work as usual. > > > > If MGLRU is enabled, and 459779d04ae8d is reverted on top of v6.17-rc1, the OOM doesn't happen either. > > > > Could you please check this? > > > > This looks to be an MGLRU policy decision rather than a readahead > regression, correct? > > Mem-Info: > active_anon:388 inactive_anon:5382 isolated_anon:0 > active_file:9638 inactive_file:126738 isolated_file:0 > > Setting min_ttl_ms to 1000 is preserving the working set and triggering > the oom kill is the only alternative to free memory in that configuration. > The oom kill is being triggered by kswapd for this purpose. > > So additional readahead would certainly increase that working set. This > looks working as intended. OK, this makes sense indeed, thanks for the explanation. But is inactive_file explosion expected and justified? Without revert: $ echo 3 | sudo tee /proc/sys/vm/drop_caches; free -m; sudo journalctl -kb >/dev/null; free -m 3 total used free shared buff/cache available Mem: 690 179 536 3 57 510 Swap: 1379 12 1367 /* OOM happens here */ total used free shared buff/cache available Mem: 690 177 52 3 561 513 Swap: 1379 17 1362 With revert: $ echo 3 | sudo tee /proc/sys/vm/drop_caches; free -m; sudo journalctl -kb >/dev/null; free -m 3 total used free shared buff/cache available Mem: 690 214 498 4 64 476 Swap: 1379 0 1379 /* no OOM */ total used free shared buff/cache available Mem: 690 209 462 4 119 481 Swap: 1379 0 1379 The journal folder size is: $ sudo du -hs /var/log/journal 575M /var/log/journal It looks like this readahead change causes far more data to be read than actually needed? -- Oleksandr Natalenko, MSE [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [REGRESSION][BISECTED] Unexpected OOM instead of reclaiming inactive file pages 2025-08-11 20:42 ` Oleksandr Natalenko @ 2025-08-12 0:45 ` Damien Le Moal 2025-08-12 7:37 ` Oleksandr Natalenko 0 siblings, 1 reply; 5+ messages in thread From: Damien Le Moal @ 2025-08-12 0:45 UTC (permalink / raw) To: Oleksandr Natalenko, David Rientjes Cc: linux-kernel, linux-block, Jens Axboe, John Garry, Christoph Hellwig, Martin K. Petersen, linux-mm, Lorenzo Stoakes, Shakeel Butt, Qi Zheng, Michal Hocko, David Hildenbrand, Johannes Weiner, Andrew Morton On 8/12/25 5:42 AM, Oleksandr Natalenko wrote: > Hello. > > On pondělí 11. srpna 2025 18:06:16, středoevropský letní čas David Rientjes wrote: >> On Mon, 11 Aug 2025, Oleksandr Natalenko wrote: >> >>> Hello Damien. >>> >>> I'm fairly confident that the following commit >>> >>> 459779d04ae8d block: Improve read ahead size for rotational devices >>> >>> caused a regression in my test bench. >>> >>> I'm running v6.17-rc1 in a small QEMU VM with virtio-scsi disk. It has got 1 GiB of RAM, so I can saturate it easily causing reclaiming mechanism to kick in. >>> >>> If MGLRU is enabled: >>> >>> $ echo 1000 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms >>> >>> then, once page cache builds up, an OOM happens without reclaiming inactive file pages: [1]. Note that inactive_file:506952kB, I'd expect these to be reclaimed instead, like how it happens with v6.16. >>> >>> If MGLRU is disabled: >>> >>> $ echo 0 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms >>> >>> then OOM doesn't occur, and things seem to work as usual. >>> >>> If MGLRU is enabled, and 459779d04ae8d is reverted on top of v6.17-rc1, the OOM doesn't happen either. >>> >>> Could you please check this? >>> >> >> This looks to be an MGLRU policy decision rather than a readahead >> regression, correct? >> >> Mem-Info: >> active_anon:388 inactive_anon:5382 isolated_anon:0 >> active_file:9638 inactive_file:126738 isolated_file:0 >> >> Setting min_ttl_ms to 1000 is preserving the working set and triggering >> the oom kill is the only alternative to free memory in that configuration. >> The oom kill is being triggered by kswapd for this purpose. >> >> So additional readahead would certainly increase that working set. This >> looks working as intended. > > OK, this makes sense indeed, thanks for the explanation. But is inactive_file explosion expected and justified? > > Without revert: > > $ echo 3 | sudo tee /proc/sys/vm/drop_caches; free -m; sudo journalctl -kb >/dev/null; free -m > 3 > total used free shared buff/cache available > Mem: 690 179 536 3 57 510 > Swap: 1379 12 1367 > /* OOM happens here */ > total used free shared buff/cache available > Mem: 690 177 52 3 561 513 > Swap: 1379 17 1362 > > With revert: > > $ echo 3 | sudo tee /proc/sys/vm/drop_caches; free -m; sudo journalctl -kb >/dev/null; free -m > 3 > total used free shared buff/cache available > Mem: 690 214 498 4 64 476 > Swap: 1379 0 1379 > /* no OOM */ > total used free shared buff/cache available > Mem: 690 209 462 4 119 481 > Swap: 1379 0 1379 > > The journal folder size is: > > $ sudo du -hs /var/log/journal > 575M /var/log/journal > > It looks like this readahead change causes far more data to be read than actually needed? For your drive as seen by the VM, what is the value of /sys/block/sdX/queue/optimal_io_size ? I guess it is "0", as I see on my VM. So before 459779d04ae8d, the block device read_ahead_kb was 128KB only, and 459779d04ae8d switched it to be 2 times the max_sectors_kb, so 8MB. This change significantly improves file buffered read performance on HDDs, and HDDs only. This means that your VM device is probably being reported as a rotational one (/sys/block/sdX/queue/rotational is 1), which is normal if you attached an actual HDD. If you are using a qcow2 image for that disk, then having rotational==1 is questionable... The other issue is the device driver for the device reporting 0 for the optimal IO size, which normally happens only for SATA drives. I see the same with virtio-scsi, which is also questionable given that the maximum IO size with it is fairly limited. So virtio-scsi may need some tweaking. The other thing to question, I think, is setting read_ahead_kb using the optimal_io_size limit (io_opt), which can be *very large*. For most SCSI devices, it is 16MB, so you will see a read_ahead_kb of 32 MB. But for SCSI devices, optimal_io_size indicates a *maximum* IO size beyond which performance may degrade. So using any value lower than this, but still reasonably large, would be better in general I think. Note that lim->io_opt for RAID arrays actually indicates the stripe size, so generally a lot smaller than the component drives io_opt. And this use changes the meaning of that queue limit, which makes things even more confusing and finding an adequate default harder. -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [REGRESSION][BISECTED] Unexpected OOM instead of reclaiming inactive file pages 2025-08-12 0:45 ` Damien Le Moal @ 2025-08-12 7:37 ` Oleksandr Natalenko 0 siblings, 0 replies; 5+ messages in thread From: Oleksandr Natalenko @ 2025-08-12 7:37 UTC (permalink / raw) To: David Rientjes, Damien Le Moal Cc: linux-kernel, linux-block, Jens Axboe, John Garry, Christoph Hellwig, Martin K. Petersen, linux-mm, Lorenzo Stoakes, Shakeel Butt, Qi Zheng, Michal Hocko, David Hildenbrand, Johannes Weiner, Andrew Morton [-- Attachment #1: Type: text/plain, Size: 5467 bytes --] Hello. On úterý 12. srpna 2025 2:45:02, středoevropský letní čas Damien Le Moal wrote: > On 8/12/25 5:42 AM, Oleksandr Natalenko wrote: > > On pondělí 11. srpna 2025 18:06:16, středoevropský letní čas David Rientjes wrote: > >> On Mon, 11 Aug 2025, Oleksandr Natalenko wrote: > >>> I'm fairly confident that the following commit > >>> > >>> 459779d04ae8d block: Improve read ahead size for rotational devices > >>> > >>> caused a regression in my test bench. > >>> > >>> I'm running v6.17-rc1 in a small QEMU VM with virtio-scsi disk. It has got 1 GiB of RAM, so I can saturate it easily causing reclaiming mechanism to kick in. > >>> > >>> If MGLRU is enabled: > >>> > >>> $ echo 1000 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms > >>> > >>> then, once page cache builds up, an OOM happens without reclaiming inactive file pages: [1]. Note that inactive_file:506952kB, I'd expect these to be reclaimed instead, like how it happens with v6.16. > >>> > >>> If MGLRU is disabled: > >>> > >>> $ echo 0 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms > >>> > >>> then OOM doesn't occur, and things seem to work as usual. > >>> > >>> If MGLRU is enabled, and 459779d04ae8d is reverted on top of v6.17-rc1, the OOM doesn't happen either. > >>> > >>> Could you please check this? > >>> > >> > >> This looks to be an MGLRU policy decision rather than a readahead > >> regression, correct? > >> > >> Mem-Info: > >> active_anon:388 inactive_anon:5382 isolated_anon:0 > >> active_file:9638 inactive_file:126738 isolated_file:0 > >> > >> Setting min_ttl_ms to 1000 is preserving the working set and triggering > >> the oom kill is the only alternative to free memory in that configuration. > >> The oom kill is being triggered by kswapd for this purpose. > >> > >> So additional readahead would certainly increase that working set. This > >> looks working as intended. > > > > OK, this makes sense indeed, thanks for the explanation. But is inactive_file explosion expected and justified? > > > > Without revert: > > > > $ echo 3 | sudo tee /proc/sys/vm/drop_caches; free -m; sudo journalctl -kb >/dev/null; free -m > > 3 > > total used free shared buff/cache available > > Mem: 690 179 536 3 57 510 > > Swap: 1379 12 1367 > > /* OOM happens here */ > > total used free shared buff/cache available > > Mem: 690 177 52 3 561 513 > > Swap: 1379 17 1362 > > > > With revert: > > > > $ echo 3 | sudo tee /proc/sys/vm/drop_caches; free -m; sudo journalctl -kb >/dev/null; free -m > > 3 > > total used free shared buff/cache available > > Mem: 690 214 498 4 64 476 > > Swap: 1379 0 1379 > > /* no OOM */ > > total used free shared buff/cache available > > Mem: 690 209 462 4 119 481 > > Swap: 1379 0 1379 > > > > The journal folder size is: > > > > $ sudo du -hs /var/log/journal > > 575M /var/log/journal > > > > It looks like this readahead change causes far more data to be read than actually needed? > > For your drive as seen by the VM, what is the value of > /sys/block/sdX/queue/optimal_io_size ? > > I guess it is "0", as I see on my VM. Yes, it's 0. > So before 459779d04ae8d, the block device read_ahead_kb was 128KB only, and > 459779d04ae8d switched it to be 2 times the max_sectors_kb, so 8MB. This change > significantly improves file buffered read performance on HDDs, and HDDs only. Right, max_sectors_kb is 4096. > This means that your VM device is probably being reported as a rotational one > (/sys/block/sdX/queue/rotational is 1), which is normal if you attached an > actual HDD. If you are using a qcow2 image for that disk, then having > rotational==1 is questionable... Yes, it's reported as rotational by default. I've just set -device scsi-hd,drive=hd1,rotation_rate=1 so that guest will see the drive as non-rotational from now on, which brings old behaviour back. > The other issue is the device driver for the device reporting 0 for the optimal > IO size, which normally happens only for SATA drives. I see the same with > virtio-scsi, which is also questionable given that the maximum IO size with it > is fairly limited. So virtio-scsi may need some tweaking. > > The other thing to question, I think, is setting read_ahead_kb using the > optimal_io_size limit (io_opt), which can be *very large*. For most SCSI > devices, it is 16MB, so you will see a read_ahead_kb of 32 MB. But for SCSI > devices, optimal_io_size indicates a *maximum* IO size beyond which performance > may degrade. So using any value lower than this, but still reasonably large, > would be better in general I think. Note that lim->io_opt for RAID arrays > actually indicates the stripe size, so generally a lot smaller than the > component drives io_opt. And this use changes the meaning of that queue limit, > which makes things even more confusing and finding an adequate default harder. Thank you for the explanation. -- Oleksandr Natalenko, MSE [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-08-12 7:37 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-11 15:48 [REGRESSION][BISECTED] Unexpected OOM instead of reclaiming inactive file pages Oleksandr Natalenko 2025-08-11 16:06 ` David Rientjes 2025-08-11 20:42 ` Oleksandr Natalenko 2025-08-12 0:45 ` Damien Le Moal 2025-08-12 7:37 ` Oleksandr Natalenko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).