[REGRESSION][BISECTED] Unexpected OOM instead of reclaiming inactive file pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [REGRESSION][BISECTED] Unexpected OOM instead of reclaiming inactive file pages
@ 2025-08-11 15:48 Oleksandr Natalenko
  2025-08-11 16:06 ` David Rientjes
  0 siblings, 1 reply; 5+ messages in thread
From: Oleksandr Natalenko @ 2025-08-11 15:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-block, Jens Axboe, Damien Le Moal, John Garry,
	Christoph Hellwig, Martin K. Petersen, linux-mm, Lorenzo Stoakes,
	Shakeel Butt, Qi Zheng, Michal Hocko, David Hildenbrand,
	Johannes Weiner, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 969 bytes --]

Hello Damien.

I'm fairly confident that the following commit

459779d04ae8d block: Improve read ahead size for rotational devices

caused a regression in my test bench.

I'm running v6.17-rc1 in a small QEMU VM with virtio-scsi disk. It has got 1 GiB of RAM, so I can saturate it easily causing reclaiming mechanism to kick in.

If MGLRU is enabled:

$ echo 1000 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms

then, once page cache builds up, an OOM happens without reclaiming inactive file pages: [1]. Note that inactive_file:506952kB, I'd expect these to be reclaimed instead, like how it happens with v6.16.

If MGLRU is disabled:

$ echo 0 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms

then OOM doesn't occur, and things seem to work as usual.

If MGLRU is enabled, and 459779d04ae8d is reverted on top of v6.17-rc1, the OOM doesn't happen either.

Could you please check this?

Thank you.

-- 
Oleksandr Natalenko, MSE

[1]: https://paste.voidband.net/TG5OiZ29.log

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [REGRESSION][BISECTED] Unexpected OOM instead of reclaiming inactive file pages
  2025-08-11 15:48 [REGRESSION][BISECTED] Unexpected OOM instead of reclaiming inactive file pages Oleksandr Natalenko
@ 2025-08-11 16:06 ` David Rientjes
  2025-08-11 20:42   ` Oleksandr Natalenko
  0 siblings, 1 reply; 5+ messages in thread
From: David Rientjes @ 2025-08-11 16:06 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: linux-kernel, linux-block, Jens Axboe, Damien Le Moal, John Garry,
	Christoph Hellwig, Martin K. Petersen, linux-mm, Lorenzo Stoakes,
	Shakeel Butt, Qi Zheng, Michal Hocko, David Hildenbrand,
	Johannes Weiner, Andrew Morton

On Mon, 11 Aug 2025, Oleksandr Natalenko wrote:

> Hello Damien.
> 
> I'm fairly confident that the following commit
> 
> 459779d04ae8d block: Improve read ahead size for rotational devices
> 
> caused a regression in my test bench.
> 
> I'm running v6.17-rc1 in a small QEMU VM with virtio-scsi disk. It has got 1 GiB of RAM, so I can saturate it easily causing reclaiming mechanism to kick in.
> 
> If MGLRU is enabled:
> 
> $ echo 1000 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms
> 
> then, once page cache builds up, an OOM happens without reclaiming inactive file pages: [1]. Note that inactive_file:506952kB, I'd expect these to be reclaimed instead, like how it happens with v6.16.
> 
> If MGLRU is disabled:
> 
> $ echo 0 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms
> 
> then OOM doesn't occur, and things seem to work as usual.
> 
> If MGLRU is enabled, and 459779d04ae8d is reverted on top of v6.17-rc1, the OOM doesn't happen either.
> 
> Could you please check this?
> 

This looks to be an MGLRU policy decision rather than a readahead 
regression, correct?

Mem-Info:
active_anon:388 inactive_anon:5382 isolated_anon:0
 active_file:9638 inactive_file:126738 isolated_file:0

Setting min_ttl_ms to 1000 is preserving the working set and triggering 
the oom kill is the only alternative to free memory in that configuration.  
The oom kill is being triggered by kswapd for this purpose.

So additional readahead would certainly increase that working set.  This 
looks working as intended.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [REGRESSION][BISECTED] Unexpected OOM instead of reclaiming inactive file pages
  2025-08-11 16:06 ` David Rientjes
@ 2025-08-11 20:42   ` Oleksandr Natalenko
  2025-08-12  0:45     ` Damien Le Moal
  0 siblings, 1 reply; 5+ messages in thread
From: Oleksandr Natalenko @ 2025-08-11 20:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-kernel, linux-block, Jens Axboe, Damien Le Moal, John Garry,
	Christoph Hellwig, Martin K. Petersen, linux-mm, Lorenzo Stoakes,
	Shakeel Butt, Qi Zheng, Michal Hocko, David Hildenbrand,
	Johannes Weiner, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 3175 bytes --]

Hello.

On pondělí 11. srpna 2025 18:06:16, středoevropský letní čas David Rientjes wrote:
> On Mon, 11 Aug 2025, Oleksandr Natalenko wrote:
> 
> > Hello Damien.
> > 
> > I'm fairly confident that the following commit
> > 
> > 459779d04ae8d block: Improve read ahead size for rotational devices
> > 
> > caused a regression in my test bench.
> > 
> > I'm running v6.17-rc1 in a small QEMU VM with virtio-scsi disk. It has got 1 GiB of RAM, so I can saturate it easily causing reclaiming mechanism to kick in.
> > 
> > If MGLRU is enabled:
> > 
> > $ echo 1000 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms
> > 
> > then, once page cache builds up, an OOM happens without reclaiming inactive file pages: [1]. Note that inactive_file:506952kB, I'd expect these to be reclaimed instead, like how it happens with v6.16.
> > 
> > If MGLRU is disabled:
> > 
> > $ echo 0 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms
> > 
> > then OOM doesn't occur, and things seem to work as usual.
> > 
> > If MGLRU is enabled, and 459779d04ae8d is reverted on top of v6.17-rc1, the OOM doesn't happen either.
> > 
> > Could you please check this?
> > 
> 
> This looks to be an MGLRU policy decision rather than a readahead 
> regression, correct?
> 
> Mem-Info:
> active_anon:388 inactive_anon:5382 isolated_anon:0
>  active_file:9638 inactive_file:126738 isolated_file:0
> 
> Setting min_ttl_ms to 1000 is preserving the working set and triggering 
> the oom kill is the only alternative to free memory in that configuration.  
> The oom kill is being triggered by kswapd for this purpose.
> 
> So additional readahead would certainly increase that working set.  This 
> looks working as intended.

OK, this makes sense indeed, thanks for the explanation. But is inactive_file explosion expected and justified?

Without revert:

$ echo 3 | sudo tee /proc/sys/vm/drop_caches; free -m; sudo journalctl -kb >/dev/null; free -m
3
               total        used        free      shared  buff/cache   available
Mem:             690         179         536           3          57         510
Swap:           1379          12        1367
/* OOM happens here */
               total        used        free      shared  buff/cache   available
Mem:             690         177          52           3         561         513
Swap:           1379          17        1362 

With revert:

$ echo 3 | sudo tee /proc/sys/vm/drop_caches; free -m; sudo journalctl -kb >/dev/null; free -m
3
               total        used        free      shared  buff/cache   available
Mem:             690         214         498           4          64         476
Swap:           1379           0        1379
/* no OOM */
               total        used        free      shared  buff/cache   available
Mem:             690         209         462           4         119         481
Swap:           1379           0        1379

The journal folder size is:

$ sudo du -hs /var/log/journal
575M    /var/log/journal

It looks like this readahead change causes far more data to be read than actually needed?

-- 
Oleksandr Natalenko, MSE

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [REGRESSION][BISECTED] Unexpected OOM instead of reclaiming inactive file pages
  2025-08-11 20:42   ` Oleksandr Natalenko
@ 2025-08-12  0:45     ` Damien Le Moal
  2025-08-12  7:37       ` Oleksandr Natalenko
  0 siblings, 1 reply; 5+ messages in thread
From: Damien Le Moal @ 2025-08-12  0:45 UTC (permalink / raw)
  To: Oleksandr Natalenko, David Rientjes
  Cc: linux-kernel, linux-block, Jens Axboe, John Garry,
	Christoph Hellwig, Martin K. Petersen, linux-mm, Lorenzo Stoakes,
	Shakeel Butt, Qi Zheng, Michal Hocko, David Hildenbrand,
	Johannes Weiner, Andrew Morton

On 8/12/25 5:42 AM, Oleksandr Natalenko wrote:
> Hello.
> 
> On pondělí 11. srpna 2025 18:06:16, středoevropský letní čas David Rientjes wrote:
>> On Mon, 11 Aug 2025, Oleksandr Natalenko wrote:
>>
>>> Hello Damien.
>>>
>>> I'm fairly confident that the following commit
>>>
>>> 459779d04ae8d block: Improve read ahead size for rotational devices
>>>
>>> caused a regression in my test bench.
>>>
>>> I'm running v6.17-rc1 in a small QEMU VM with virtio-scsi disk. It has got 1 GiB of RAM, so I can saturate it easily causing reclaiming mechanism to kick in.
>>>
>>> If MGLRU is enabled:
>>>
>>> $ echo 1000 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms
>>>
>>> then, once page cache builds up, an OOM happens without reclaiming inactive file pages: [1]. Note that inactive_file:506952kB, I'd expect these to be reclaimed instead, like how it happens with v6.16.
>>>
>>> If MGLRU is disabled:
>>>
>>> $ echo 0 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms
>>>
>>> then OOM doesn't occur, and things seem to work as usual.
>>>
>>> If MGLRU is enabled, and 459779d04ae8d is reverted on top of v6.17-rc1, the OOM doesn't happen either.
>>>
>>> Could you please check this?
>>>
>>
>> This looks to be an MGLRU policy decision rather than a readahead 
>> regression, correct?
>>
>> Mem-Info:
>> active_anon:388 inactive_anon:5382 isolated_anon:0
>>  active_file:9638 inactive_file:126738 isolated_file:0
>>
>> Setting min_ttl_ms to 1000 is preserving the working set and triggering 
>> the oom kill is the only alternative to free memory in that configuration.  
>> The oom kill is being triggered by kswapd for this purpose.
>>
>> So additional readahead would certainly increase that working set.  This 
>> looks working as intended.
> 
> OK, this makes sense indeed, thanks for the explanation. But is inactive_file explosion expected and justified?
> 
> Without revert:
> 
> $ echo 3 | sudo tee /proc/sys/vm/drop_caches; free -m; sudo journalctl -kb >/dev/null; free -m
> 3
>                total        used        free      shared  buff/cache   available
> Mem:             690         179         536           3          57         510
> Swap:           1379          12        1367
> /* OOM happens here */
>                total        used        free      shared  buff/cache   available
> Mem:             690         177          52           3         561         513
> Swap:           1379          17        1362 
> 
> With revert:
> 
> $ echo 3 | sudo tee /proc/sys/vm/drop_caches; free -m; sudo journalctl -kb >/dev/null; free -m
> 3
>                total        used        free      shared  buff/cache   available
> Mem:             690         214         498           4          64         476
> Swap:           1379           0        1379
> /* no OOM */
>                total        used        free      shared  buff/cache   available
> Mem:             690         209         462           4         119         481
> Swap:           1379           0        1379
> 
> The journal folder size is:
> 
> $ sudo du -hs /var/log/journal
> 575M    /var/log/journal
> 
> It looks like this readahead change causes far more data to be read than actually needed?

For your drive as seen by the VM, what is the value of
/sys/block/sdX/queue/optimal_io_size ?

I guess it is "0", as I see on my VM.
So before 459779d04ae8d, the block device read_ahead_kb was 128KB only, and
459779d04ae8d switched it to be 2 times the max_sectors_kb, so 8MB. This change
significantly improves file buffered read performance on HDDs, and HDDs only.

This means that your VM device is probably being reported as a rotational one
(/sys/block/sdX/queue/rotational is 1), which is normal if you attached an
actual HDD. If you are using a qcow2 image for that disk, then having
rotational==1 is questionable...

The other issue is the device driver for the device reporting 0 for the optimal
IO size, which normally happens only for SATA drives. I see the same with
virtio-scsi, which is also questionable given that the maximum IO size with it
is fairly limited. So virtio-scsi may need some tweaking.

The other thing to question, I think, is setting read_ahead_kb using the
optimal_io_size limit (io_opt), which can be *very large*. For most SCSI
devices, it is 16MB, so you will see a read_ahead_kb of 32 MB. But for SCSI
devices, optimal_io_size indicates a *maximum* IO size beyond which performance
may degrade. So using any value lower than this, but still reasonably large,
would be better in general I think. Note that lim->io_opt for RAID arrays
actually indicates the stripe size, so generally a lot smaller than the
component drives io_opt. And this use changes the meaning of that queue limit,
which makes things even more confusing and finding an adequate default harder.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [REGRESSION][BISECTED] Unexpected OOM instead of reclaiming inactive file pages
  2025-08-12  0:45     ` Damien Le Moal
@ 2025-08-12  7:37       ` Oleksandr Natalenko
  0 siblings, 0 replies; 5+ messages in thread
From: Oleksandr Natalenko @ 2025-08-12  7:37 UTC (permalink / raw)
  To: David Rientjes, Damien Le Moal
  Cc: linux-kernel, linux-block, Jens Axboe, John Garry,
	Christoph Hellwig, Martin K. Petersen, linux-mm, Lorenzo Stoakes,
	Shakeel Butt, Qi Zheng, Michal Hocko, David Hildenbrand,
	Johannes Weiner, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 5467 bytes --]

Hello.

On úterý 12. srpna 2025 2:45:02, středoevropský letní čas Damien Le Moal wrote:
> On 8/12/25 5:42 AM, Oleksandr Natalenko wrote:
> > On pondělí 11. srpna 2025 18:06:16, středoevropský letní čas David Rientjes wrote:
> >> On Mon, 11 Aug 2025, Oleksandr Natalenko wrote:
> >>> I'm fairly confident that the following commit
> >>>
> >>> 459779d04ae8d block: Improve read ahead size for rotational devices
> >>>
> >>> caused a regression in my test bench.
> >>>
> >>> I'm running v6.17-rc1 in a small QEMU VM with virtio-scsi disk. It has got 1 GiB of RAM, so I can saturate it easily causing reclaiming mechanism to kick in.
> >>>
> >>> If MGLRU is enabled:
> >>>
> >>> $ echo 1000 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms
> >>>
> >>> then, once page cache builds up, an OOM happens without reclaiming inactive file pages: [1]. Note that inactive_file:506952kB, I'd expect these to be reclaimed instead, like how it happens with v6.16.
> >>>
> >>> If MGLRU is disabled:
> >>>
> >>> $ echo 0 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms
> >>>
> >>> then OOM doesn't occur, and things seem to work as usual.
> >>>
> >>> If MGLRU is enabled, and 459779d04ae8d is reverted on top of v6.17-rc1, the OOM doesn't happen either.
> >>>
> >>> Could you please check this?
> >>>
> >>
> >> This looks to be an MGLRU policy decision rather than a readahead 
> >> regression, correct?
> >>
> >> Mem-Info:
> >> active_anon:388 inactive_anon:5382 isolated_anon:0
> >>  active_file:9638 inactive_file:126738 isolated_file:0
> >>
> >> Setting min_ttl_ms to 1000 is preserving the working set and triggering 
> >> the oom kill is the only alternative to free memory in that configuration.  
> >> The oom kill is being triggered by kswapd for this purpose.
> >>
> >> So additional readahead would certainly increase that working set.  This 
> >> looks working as intended.
> > 
> > OK, this makes sense indeed, thanks for the explanation. But is inactive_file explosion expected and justified?
> > 
> > Without revert:
> > 
> > $ echo 3 | sudo tee /proc/sys/vm/drop_caches; free -m; sudo journalctl -kb >/dev/null; free -m
> > 3
> >                total        used        free      shared  buff/cache   available
> > Mem:             690         179         536           3          57         510
> > Swap:           1379          12        1367
> > /* OOM happens here */
> >                total        used        free      shared  buff/cache   available
> > Mem:             690         177          52           3         561         513
> > Swap:           1379          17        1362 
> > 
> > With revert:
> > 
> > $ echo 3 | sudo tee /proc/sys/vm/drop_caches; free -m; sudo journalctl -kb >/dev/null; free -m
> > 3
> >                total        used        free      shared  buff/cache   available
> > Mem:             690         214         498           4          64         476
> > Swap:           1379           0        1379
> > /* no OOM */
> >                total        used        free      shared  buff/cache   available
> > Mem:             690         209         462           4         119         481
> > Swap:           1379           0        1379
> > 
> > The journal folder size is:
> > 
> > $ sudo du -hs /var/log/journal
> > 575M    /var/log/journal
> > 
> > It looks like this readahead change causes far more data to be read than actually needed?
> 
> For your drive as seen by the VM, what is the value of
> /sys/block/sdX/queue/optimal_io_size ?
> 
> I guess it is "0", as I see on my VM.

Yes, it's 0.

> So before 459779d04ae8d, the block device read_ahead_kb was 128KB only, and
> 459779d04ae8d switched it to be 2 times the max_sectors_kb, so 8MB. This change
> significantly improves file buffered read performance on HDDs, and HDDs only.

Right, max_sectors_kb is 4096.

> This means that your VM device is probably being reported as a rotational one
> (/sys/block/sdX/queue/rotational is 1), which is normal if you attached an
> actual HDD. If you are using a qcow2 image for that disk, then having
> rotational==1 is questionable...

Yes, it's reported as rotational by default.

I've just set -device scsi-hd,drive=hd1,rotation_rate=1 so that guest will see the drive as non-rotational from now on, which brings old behaviour back.

> The other issue is the device driver for the device reporting 0 for the optimal
> IO size, which normally happens only for SATA drives. I see the same with
> virtio-scsi, which is also questionable given that the maximum IO size with it
> is fairly limited. So virtio-scsi may need some tweaking.
> 
> The other thing to question, I think, is setting read_ahead_kb using the
> optimal_io_size limit (io_opt), which can be *very large*. For most SCSI
> devices, it is 16MB, so you will see a read_ahead_kb of 32 MB. But for SCSI
> devices, optimal_io_size indicates a *maximum* IO size beyond which performance
> may degrade. So using any value lower than this, but still reasonably large,
> would be better in general I think. Note that lim->io_opt for RAID arrays
> actually indicates the stripe size, so generally a lot smaller than the
> component drives io_opt. And this use changes the meaning of that queue limit,
> which makes things even more confusing and finding an adequate default harder.

Thank you for the explanation.

-- 
Oleksandr Natalenko, MSE

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-08-12  7:37 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-11 15:48 [REGRESSION][BISECTED] Unexpected OOM instead of reclaiming inactive file pages Oleksandr Natalenko
2025-08-11 16:06 ` David Rientjes
2025-08-11 20:42   ` Oleksandr Natalenko
2025-08-12  0:45     ` Damien Le Moal
2025-08-12  7:37       ` Oleksandr Natalenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).