linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* OOMs on PS3 since kernel 6.9-rc4
       [not found] <7CE7C8BC-D728-4A10-BD8F-15293D7CF312.ref@yahoo.com>
@ 2024-09-24 20:52 ` Damian Dudycz
  2024-09-25 17:20   ` Johannes Weiner
  0 siblings, 1 reply; 5+ messages in thread
From: Damian Dudycz @ 2024-09-24 20:52 UTC (permalink / raw)
  To: linux-mm; +Cc: sam, holger, kernel, hannes

[-- Attachment #1: Type: text/plain, Size: 2480 bytes --]

I'm running Gentoo on the PlayStation 3 console (PPC64BE CPU), using custom
firmware (OtherOS++) feature.

Upgrading from 6.6 to 6.10, I have noticed that OOM kills started during long
and intense processes, like compiling code or extracting a large archive.

The OOM usually occurs after about 10-20 minutes of for example
compiling the gentoo-kernel package.

This system has limited amount of RAM (256MB) and there's possibility to use
another 256MB of VRam as a fast swap device. Besides that, there's also
standard swap partition of 4GB enabled. I bisected with vanilla upstream sources
with the exception of some irrelevant patches mentioned at the end.

After bisecting, I have found that the issue first started to appear in commit
c0cd6f557b9090525d288806cccbc73440ac235a (build 6.9.0-rc4-test)
(titled: "page_alloc: fix freelist movement during block conversion").
https://github.com/torvalds/linux/commit/c0cd6f557b9090525d288806cccbc73440ac235a

Unfortunately, it doesn't revert cleanly on 6.11 so I couldn't test that.

# Files and directories:
- patches: contains patches applied to kernel when preparing a test build. These are working with version 6.9
- config: kernel config used
- bisect.txt: log from bisecting process
- dmesg.txt: log from dmesg after issue occurred
- c0cd6f557b9090525d288806cccbc73440ac235a.patch: diff from commit that introduced the issue
- proc - Collection of files from /proc, before, after and during test. („During" was always taken 5 minutes after test was started).
	- 6.9.0-rc3-test-dirty - working version, issue didn't happened.
	- 6.9.0-rc4-test-00116-gc0cd6f557b90-dirty - commit that introduced the issue, OOM occurred.
	- 6.11.0-test-dirty - newer version of kernel, OOM still occurred.

# Patches:
In order for kernel to work on the PS3 using OtherOS++, some patches are
required. I reduced the number of patches during testing, only to the ones
that are essential to boot correctly. The patches I have used are in "patches"
directory.

These are used mainly to enable linux to use disk regions that are
used for linux and I doubt they have any impact on the issue, but I'm adding them
in case this needs verification.

There are also 2 disabled patches related to page allocation, that I have left there,
but these were not used in tests, as they don't affect the result in this situation,
I'm leaving them just in case.

Mentioned logs and files are in attached tarball.


[-- Attachment #2: files-linux-6.9.0-rc4-test.tar.xz --]
[-- Type: application/x-xz, Size: 308736 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: OOMs on PS3 since kernel 6.9-rc4
  2024-09-24 20:52 ` OOMs on PS3 since kernel 6.9-rc4 Damian Dudycz
@ 2024-09-25 17:20   ` Johannes Weiner
  2024-09-25 17:43     ` Damian Dudycz
  0 siblings, 1 reply; 5+ messages in thread
From: Johannes Weiner @ 2024-09-25 17:20 UTC (permalink / raw)
  To: Damian Dudycz; +Cc: linux-mm, sam, holger, kernel, Michal Hocko

Hi Damian,

On Tue, Sep 24, 2024 at 10:52:28PM +0200, Damian Dudycz wrote:
> I'm running Gentoo on the PlayStation 3 console (PPC64BE CPU), using custom
> firmware (OtherOS++) feature.
> 
> Upgrading from 6.6 to 6.10, I have noticed that OOM kills started during long
> and intense processes, like compiling code or extracting a large archive.
> 
> The OOM usually occurs after about 10-20 minutes of for example
> compiling the gentoo-kernel package.

Thanks for your excellent and detailed report, and sorry about the
breakage.

While going through the dmesg, I'm noticing the following:

[  719.989545] configure invoked oom-killer: gfp_mask=0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=2, oom_score_adj=0
[  719.989607] COMPACTION is disabled!!!
[  719.989633] CPU: 1 PID: 4701 Comm: configure Not tainted 6.9.0-rc4-test-00116-gc0cd6f557b90-dirty #1
[  719.989665] Hardware name: SonyPS3 Cell Broadband Engine 0x702100 PS3
[  719.989688] Call Trace:
[  719.989708] [c00000000a5834a0] [c000000000662e9c] .dump_stack_lvl+0xb0/0x100 (unreliable)
[  719.989777] [c00000000a583530] [c00000000013e43c] .dump_header+0x5c/0x414
[  719.989835] [c00000000a583600] [c00000000013ec38] .oom_kill_process+0xcc/0x598
[  719.989888] [c00000000a5836f0] [c00000000013f6f0] .out_of_memory+0x3d0/0x3f0
[  719.989939] [c00000000a5837a0] [c00000000018f28c] .__alloc_pages_slowpath.constprop.0+0x540/0x6b0
[  719.989987] [c00000000a5838f0] [c00000000018f4f4] .__alloc_pages_noprof+0xf8/0x1c0
[  719.990031] [c00000000a5839c0] [c0000000000505d0] .copy_process+0x1d4/0x1bf0
[  719.990085] [c00000000a583b40] [c000000000052144] .kernel_clone+0xcc/0x3f0
[  719.990136] [c00000000a583c50] [c0000000000524d4] .__do_sys_clone+0x6c/0x90
[  719.990188] [c00000000a583d80] [c00000000001f600] .system_call_exception+0x1f4/0x260
[  719.990246] [c00000000a583e10] [c00000000000b2d4] system_call_common+0xf4/0x258

This is clone() trying to allocate a thread stack, which is a request
for 4 physically contiguous pages (order=2 -> 2^2 pages).

The second line warns that you don't have CONFIG_COMPACTION enabled,
which is the kernel's facility to assemble such contiguous page
blocks. (God bless you, Michal Hocko, for adding this warning.)

This is not a common configuration anymore, as we have since removed
various other mechanisms from the MM code to support higher order
allocations. So I think you may have gotten lucky in the past.

Can you please try with CONFIG_COMPACTION=y?

[ I think what likely happened is that, before my patch, an unmovable
  request falling back to a movable block would have stolen the rest
  of its free pages even if it hadn't claimed the block as unmovable.
  Now it doesn't anymore, and the block, already dominated by cache
  and anon, will continue to fill up with cache and anon. Not an issue
  with compaction - and better for long-term defragmentation
  prospects; but without compaction, you just get a bit less lucky
  specifically with those higher-order kernel requests. ]

Thanks


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: OOMs on PS3 since kernel 6.9-rc4
  2024-09-25 17:20   ` Johannes Weiner
@ 2024-09-25 17:43     ` Damian Dudycz
  2024-09-26  7:00       ` Damian Dudycz
  0 siblings, 1 reply; 5+ messages in thread
From: Damian Dudycz @ 2024-09-25 17:43 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, sam, holger, kernel, Michal Hocko

Thank you for the response Johannes.

I’ll test this and get back with the results.
Also, I should mention - Holger have suggested to enable LRU and this also seems to be helping with this issue, but
still I thought I should report it when it’s not enabled.
I’ll see if compaction helps and let you know if it helped.

Regards, 
Damian.

> Wiadomość napisana przez Johannes Weiner <hannes@cmpxchg.org> w dniu 25 wrz 2024, o godz. 19:20:
> 
> Hi Damian,
> 
> On Tue, Sep 24, 2024 at 10:52:28PM +0200, Damian Dudycz wrote:
>> I'm running Gentoo on the PlayStation 3 console (PPC64BE CPU), using custom
>> firmware (OtherOS++) feature.
>> 
>> Upgrading from 6.6 to 6.10, I have noticed that OOM kills started during long
>> and intense processes, like compiling code or extracting a large archive.
>> 
>> The OOM usually occurs after about 10-20 minutes of for example
>> compiling the gentoo-kernel package.
> 
> Thanks for your excellent and detailed report, and sorry about the
> breakage.
> 
> While going through the dmesg, I'm noticing the following:
> 
> [  719.989545] configure invoked oom-killer: gfp_mask=0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=2, oom_score_adj=0
> [  719.989607] COMPACTION is disabled!!!
> [  719.989633] CPU: 1 PID: 4701 Comm: configure Not tainted 6.9.0-rc4-test-00116-gc0cd6f557b90-dirty #1
> [  719.989665] Hardware name: SonyPS3 Cell Broadband Engine 0x702100 PS3
> [  719.989688] Call Trace:
> [  719.989708] [c00000000a5834a0] [c000000000662e9c] .dump_stack_lvl+0xb0/0x100 (unreliable)
> [  719.989777] [c00000000a583530] [c00000000013e43c] .dump_header+0x5c/0x414
> [  719.989835] [c00000000a583600] [c00000000013ec38] .oom_kill_process+0xcc/0x598
> [  719.989888] [c00000000a5836f0] [c00000000013f6f0] .out_of_memory+0x3d0/0x3f0
> [  719.989939] [c00000000a5837a0] [c00000000018f28c] .__alloc_pages_slowpath.constprop.0+0x540/0x6b0
> [  719.989987] [c00000000a5838f0] [c00000000018f4f4] .__alloc_pages_noprof+0xf8/0x1c0
> [  719.990031] [c00000000a5839c0] [c0000000000505d0] .copy_process+0x1d4/0x1bf0
> [  719.990085] [c00000000a583b40] [c000000000052144] .kernel_clone+0xcc/0x3f0
> [  719.990136] [c00000000a583c50] [c0000000000524d4] .__do_sys_clone+0x6c/0x90
> [  719.990188] [c00000000a583d80] [c00000000001f600] .system_call_exception+0x1f4/0x260
> [  719.990246] [c00000000a583e10] [c00000000000b2d4] system_call_common+0xf4/0x258
> 
> This is clone() trying to allocate a thread stack, which is a request
> for 4 physically contiguous pages (order=2 -> 2^2 pages).
> 
> The second line warns that you don't have CONFIG_COMPACTION enabled,
> which is the kernel's facility to assemble such contiguous page
> blocks. (God bless you, Michal Hocko, for adding this warning.)
> 
> This is not a common configuration anymore, as we have since removed
> various other mechanisms from the MM code to support higher order
> allocations. So I think you may have gotten lucky in the past.
> 
> Can you please try with CONFIG_COMPACTION=y?
> 
> [ I think what likely happened is that, before my patch, an unmovable
>  request falling back to a movable block would have stolen the rest
>  of its free pages even if it hadn't claimed the block as unmovable.
>  Now it doesn't anymore, and the block, already dominated by cache
>  and anon, will continue to fill up with cache and anon. Not an issue
>  with compaction - and better for long-term defragmentation
>  prospects; but without compaction, you just get a bit less lucky
>  specifically with those higher-order kernel requests. ]
> 
> Thanks



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: OOMs on PS3 since kernel 6.9-rc4
  2024-09-25 17:43     ` Damian Dudycz
@ 2024-09-26  7:00       ` Damian Dudycz
  2024-09-26 10:34         ` Johannes Weiner
  0 siblings, 1 reply; 5+ messages in thread
From: Damian Dudycz @ 2024-09-26  7:00 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, sam, holger, kernel, Michal Hocko

Johannes,

I have tested this with compaction enabled and it seems to be working fine now.
I think, in that case, this should be enabled in ps3_defconfig by default.

As for not having compaction in previous versions - I have been using this for pretty long
time and Im pretty sure it used to work fine without it. Still I understand, that it should have been
used, just mentioning that it really did work without this before that version.

I’ll let ps3_defconfig maintainer know about compaction missing in ps3_defconfig
or send patch for that config myself.

Thank you all for your help with this.

Regards,
Damian.

> Wiadomość napisana przez Damian Dudycz <damiandudycz@yahoo.com> w dniu 25 wrz 2024, o godz. 19:43:
> 
> Thank you for the response Johannes.
> 
> I’ll test this and get back with the results.
> Also, I should mention - Holger have suggested to enable LRU and this also seems to be helping with this issue, but
> still I thought I should report it when it’s not enabled.
> I’ll see if compaction helps and let you know if it helped.
> 
> Regards, 
> Damian.
> 
>> Wiadomość napisana przez Johannes Weiner <hannes@cmpxchg.org> w dniu 25 wrz 2024, o godz. 19:20:
>> 
>> Hi Damian,
>> 
>> On Tue, Sep 24, 2024 at 10:52:28PM +0200, Damian Dudycz wrote:
>>> I'm running Gentoo on the PlayStation 3 console (PPC64BE CPU), using custom
>>> firmware (OtherOS++) feature.
>>> 
>>> Upgrading from 6.6 to 6.10, I have noticed that OOM kills started during long
>>> and intense processes, like compiling code or extracting a large archive.
>>> 
>>> The OOM usually occurs after about 10-20 minutes of for example
>>> compiling the gentoo-kernel package.
>> 
>> Thanks for your excellent and detailed report, and sorry about the
>> breakage.
>> 
>> While going through the dmesg, I'm noticing the following:
>> 
>> [  719.989545] configure invoked oom-killer: gfp_mask=0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=2, oom_score_adj=0
>> [  719.989607] COMPACTION is disabled!!!
>> [  719.989633] CPU: 1 PID: 4701 Comm: configure Not tainted 6.9.0-rc4-test-00116-gc0cd6f557b90-dirty #1
>> [  719.989665] Hardware name: SonyPS3 Cell Broadband Engine 0x702100 PS3
>> [  719.989688] Call Trace:
>> [  719.989708] [c00000000a5834a0] [c000000000662e9c] .dump_stack_lvl+0xb0/0x100 (unreliable)
>> [  719.989777] [c00000000a583530] [c00000000013e43c] .dump_header+0x5c/0x414
>> [  719.989835] [c00000000a583600] [c00000000013ec38] .oom_kill_process+0xcc/0x598
>> [  719.989888] [c00000000a5836f0] [c00000000013f6f0] .out_of_memory+0x3d0/0x3f0
>> [  719.989939] [c00000000a5837a0] [c00000000018f28c] .__alloc_pages_slowpath.constprop.0+0x540/0x6b0
>> [  719.989987] [c00000000a5838f0] [c00000000018f4f4] .__alloc_pages_noprof+0xf8/0x1c0
>> [  719.990031] [c00000000a5839c0] [c0000000000505d0] .copy_process+0x1d4/0x1bf0
>> [  719.990085] [c00000000a583b40] [c000000000052144] .kernel_clone+0xcc/0x3f0
>> [  719.990136] [c00000000a583c50] [c0000000000524d4] .__do_sys_clone+0x6c/0x90
>> [  719.990188] [c00000000a583d80] [c00000000001f600] .system_call_exception+0x1f4/0x260
>> [  719.990246] [c00000000a583e10] [c00000000000b2d4] system_call_common+0xf4/0x258
>> 
>> This is clone() trying to allocate a thread stack, which is a request
>> for 4 physically contiguous pages (order=2 -> 2^2 pages).
>> 
>> The second line warns that you don't have CONFIG_COMPACTION enabled,
>> which is the kernel's facility to assemble such contiguous page
>> blocks. (God bless you, Michal Hocko, for adding this warning.)
>> 
>> This is not a common configuration anymore, as we have since removed
>> various other mechanisms from the MM code to support higher order
>> allocations. So I think you may have gotten lucky in the past.
>> 
>> Can you please try with CONFIG_COMPACTION=y?
>> 
>> [ I think what likely happened is that, before my patch, an unmovable
>> request falling back to a movable block would have stolen the rest
>> of its free pages even if it hadn't claimed the block as unmovable.
>> Now it doesn't anymore, and the block, already dominated by cache
>> and anon, will continue to fill up with cache and anon. Not an issue
>> with compaction - and better for long-term defragmentation
>> prospects; but without compaction, you just get a bit less lucky
>> specifically with those higher-order kernel requests. ]
>> 
>> Thanks
> 



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: OOMs on PS3 since kernel 6.9-rc4
  2024-09-26  7:00       ` Damian Dudycz
@ 2024-09-26 10:34         ` Johannes Weiner
  0 siblings, 0 replies; 5+ messages in thread
From: Johannes Weiner @ 2024-09-26 10:34 UTC (permalink / raw)
  To: Damian Dudycz; +Cc: linux-mm, sam, holger, kernel, Michal Hocko

Hello Damian,

On Thu, Sep 26, 2024 at 09:00:25AM +0200, Damian Dudycz wrote:
> Johannes,
> 
> I have tested this with compaction enabled and it seems to be working fine now.
> I think, in that case, this should be enabled in ps3_defconfig by default.

I'm glad to hear it's working again!

> As for not having compaction in previous versions - I have been using this for pretty long
> time and Im pretty sure it used to work fine without it. Still I understand, that it should have been
> used, just mentioning that it really did work without this before that version.

Yes, it's a real regression and I believe you that it has worked until
now. My comment about luck was more in reference to the level of
support, testing and attention this configuration is getting:

config COMPACTION
        bool "Allow for memory compaction"
        default y
        select MIGRATION
        depends on MMU
        help
          Compaction is the only memory management component to form
          high order (larger physically contiguous) memory blocks
          reliably. The page allocator relies on compaction heavily and
          the lack of the feature can lead to unexpected OOM killer
          invocations for high order memory requests. You shouldn't
          disable this option unless there really is a strong reason for
          it and then we would be really interested to hear about that at
          linux-mm@kvack.org.

So I definitely agree that the ps3_defconfig should be fixed.

> I’ll let ps3_defconfig maintainer know about compaction missing in ps3_defconfig
> or send patch for that config myself.

Thanks, yes this makes sense. This should be a good list of pointers:

hannes@column ~/src/linux/linux $ ./scripts/get_maintainer.pl -f arch/powerpc/configs/ps3_defconfig 
Michael Ellerman <mpe@ellerman.id.au> (supporter:LINUX FOR POWERPC (32-BIT AND 64-BIT),commit_signer:2/2=100%)
Nicholas Piggin <npiggin@gmail.com> (reviewer:LINUX FOR POWERPC (32-BIT AND 64-BIT))
Christophe Leroy <christophe.leroy@csgroup.eu> (reviewer:LINUX FOR POWERPC (32-BIT AND 64-BIT))
Naveen N Rao <naveen@kernel.org> (reviewer:LINUX FOR POWERPC (32-BIT AND 64-BIT))
Geoff Levand <geoff@infradead.org> (commit_signer:2/2=100%,authored:2/2=100%,added_lines:1/1=100%,removed_lines:1/1=100%)
linuxppc-dev@lists.ozlabs.org (open list:LINUX FOR POWERPC (32-BIT AND 64-BIT))
linux-kernel@vger.kernel.org (open list)

Johannes


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-09-26 10:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <7CE7C8BC-D728-4A10-BD8F-15293D7CF312.ref@yahoo.com>
2024-09-24 20:52 ` OOMs on PS3 since kernel 6.9-rc4 Damian Dudycz
2024-09-25 17:20   ` Johannes Weiner
2024-09-25 17:43     ` Damian Dudycz
2024-09-26  7:00       ` Damian Dudycz
2024-09-26 10:34         ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).