public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
From: "Colin King (gmail)" <colin.i.king@gmail.com>
To: Hui Wang <hui.wang@canonical.com>,
	Gao Xiang <hsiangkao@linux.alibaba.com>,
	Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org, surenb@google.com,
	shy828301@gmail.com, hannes@cmpxchg.org, vbabka@suse.cz,
	hch@infradead.org, mgorman@suse.de,
	Phillip Lougher <phillip@squashfs.org.uk>
Subject: Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS
Date: Thu, 27 Apr 2023 08:03:08 +0100	[thread overview]
Message-ID: <8bac892e-15e2-e95b-5b2b-0981f1279e4c@gmail.com> (raw)
In-Reply-To: <4aa48b6a-362d-de1b-f0ff-9bb8dafbdcc7@canonical.com>

On 27/04/2023 04:47, Hui Wang wrote:
> 
> On 4/27/23 09:18, Gao Xiang wrote:
>>
>>
>> On 2023/4/26 19:07, Hui Wang wrote:
>>>
>>> On 4/26/23 16:33, Michal Hocko wrote:
>>>> [CC squashfs maintainer]
>>>>
>>>> On Wed 26-04-23 13:10:30, Hui Wang wrote:
>>>>> If we run the stress-ng in the filesystem of squashfs, the system
>>>>> will be in a state something like hang, the stress-ng couldn't
>>>>> finish running and the console couldn't react to users' input.
>>>>>
>>>>> This issue happens on all arm/arm64 platforms we are working on,
>>>>> through debugging, we found this issue is introduced by oom handling
>>>>> in the kernel.
>>>>>
>>>>> The fs->readahead() is called between memalloc_nofs_save() and
>>>>> memalloc_nofs_restore(), and the squashfs_readahead() calls
>>>>> alloc_page(), in this case, if there is no memory left, the
>>>>> out_of_memory() will be called without __GFP_FS, then the oom killer
>>>>> will not be triggered and this process will loop endlessly and wait
>>>>> for others to trigger oom killer to release some memory. But for a
>>>>> system with the whole root filesystem constructed by squashfs,
>>>>> nearly all userspace processes will call out_of_memory() without
>>>>> __GFP_FS, so we will see that the system enters a state something like
>>>>> hang when running stress-ng.
>>>>>
>>>>> To fix it, we could trigger a kthread to call page_alloc() with
>>>>> __GFP_FS before returning from out_of_memory() due to without
>>>>> __GFP_FS.
>>>> I do not think this is an appropriate way to deal with this issue.
>>>> Does it even make sense to trigger OOM killer for something like
>>>> readahead? Would it be more mindful to fail the allocation instead?
>>>> That being said should allocations from squashfs_readahead use
>>>> __GFP_RETRY_MAYFAIL instead?
>>>
>>> Thanks for your comment, and this issue could hardly be reproduced on 
>>> ext4 filesystem, that is because the ext4->readahead() doesn't call 
>>> alloc_page(). If changing the ext4->readahead() as below, it will be 
>>> easy to reproduce this issue with the ext4 filesystem (repeatedly 
>>> run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0 
>>> --timeout 30s --skip-silent --verbose)
>>>
>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>> index ffbbd9626bd8..8b9db0b9d0b8 100644
>>> --- a/fs/ext4/inode.c
>>> +++ b/fs/ext4/inode.c
>>> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file *file, 
>>> struct folio *folio)
>>>   static void ext4_readahead(struct readahead_control *rac)
>>>   {
>>>          struct inode *inode = rac->mapping->host;
>>> +       struct page *tmp_page;
>>>
>>>          /* If the file has inline data, no need to do readahead. */
>>>          if (ext4_has_inline_data(inode))
>>>                  return;
>>>
>>> +       tmp_page = alloc_page(GFP_KERNEL);
>>> +
>>>          ext4_mpage_readpages(inode, rac, NULL);
>>> +
>>> +       if (tmp_page)
>>> +               __free_page(tmp_page);
>>>   }
>>>
>>
> Hi Xiang and Michal,
>> Is it tested with a pure ext4 without any other fs background?
>>
> Basically yes. Maybe there is a squashfs mounted for python3 in my test 
> environment. But stress-ng and its needed sharing libs are in the ext4.

One could build a static version of stress-ng to remove the need for 
shared library loading at run time:

git clone https://github.com/ColinIanKing/stress-ng
cd stress-ng
make clean
STATIC=1 make -j 8


>> I don't think it's true that "ext4->readahead() doesn't call
>> alloc_page()" since I think even ext2/ext4 uses buffer head
>> interfaces to read metadata (extents or old block mapping)
>> from its bd_inode for readahead, which indirectly allocates
>> some extra pages to page cache as well.
> 
> Calling alloc_page() or allocating memory in the readahead() is not a 
> problem, suppose we have 4 processes (A, B, C and D). Process A, B and C 
> are entering out_of_memory() because of allocating memory in the 
> readahead(), they are looping and waiting for some memory be released. 
> And process D could enter out_of_memory() with __GFP_FS, then it could 
> trigger oom killer, so A, B and C could get the memory and return to the 
> readahead(), there is no system hang issue.
> 
> But if all 4 processes enter out_of_memory() from readahead(), they will 
> loop and wait endlessly, there is no process to trigger oom killer,  so 
> the users will think the system is getting hang.
> 
> I applied my change for ext4->readahead to linux-next, and tested it on 
> my ubuntu classic server for arm64, I could reproduce the hang issue 
> within 1 minutes with 100% rate. I guess it is easy to reproduce the 
> issue because it is an embedded environment, the total number of 
> processes in the system is very limited, nearly all userspace processes 
> will finally reach out_of_memory() from ext4_readahead(), and nearly all 
> kthreads will not reach out_of_memory() for long time, that makes the 
> system in a state like hang (not real hang).
> 
> And this is why I wrote a patch to let a specific kthread trigger oom 
> killer forcibly (my initial patch).
> 
> 
>>
>> The difference only here is the total number of pages to be
>> allocated here, but many extra compressed data takeing extra
>> allocation causes worse.  So I think it much depends on how
>> stressful does your stress workload work like, and I'm even
>> not sure it's a real issue since if you stop the stress
>> workload, it will immediately recover (only it may not oom
>> directly).
>>
> Yes, it is not a real hang. All userspace processes are looping and 
> waiting for other processes to release or reclaim memory. And in this 
> case, we can't stop the stress workload since users can't control the 
> system through console.
> 
> So Michal,
> 
> Don't know if you read the "[PATCH 0/1] mm/oom_kill: system enters a 
> state something like hang when running stress-ng", do you know why 
> out_of_memory() will return immediately if there is no __GFP_FS, could 
> we drop these lines directly:
> 
>      /*
>       * The OOM killer does not compensate for IO-less reclaim.
>       * pagefault_out_of_memory lost its gfp context so we have to
>       * make sure exclude 0 mask - all other users should have at least
>       * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
>       * invoke the OOM killer even if it is a GFP_NOFS allocation.
>       */
>      if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
>          return true;
> 
> 
> Thanks,
> 
> Hui.
> 
>> Thanks,
>> Gao Xiang



  parent reply	other threads:[~2023-04-27  7:03 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-26  5:10 [PATCH 0/1] mm/oom_kill: system enters a state something like hang when running stress-ng Hui Wang
2023-04-26  5:10 ` [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS Hui Wang
2023-04-26  8:33   ` Michal Hocko
2023-04-26 11:07     ` Hui Wang
2023-04-26 16:44       ` Phillip Lougher
2023-04-26 17:38         ` Phillip Lougher
2023-04-26 18:26           ` Yang Shi
2023-04-26 19:06             ` Phillip Lougher
2023-04-26 19:34               ` Phillip Lougher
2023-04-27  0:42                 ` Hui Wang
2023-04-27  1:37                   ` Phillip Lougher
2023-04-27  5:22                     ` Hui Wang
2023-04-27  1:18       ` Gao Xiang
2023-04-27  3:47         ` Hui Wang
2023-04-27  4:17           ` Gao Xiang
2023-04-27  7:03           ` Colin King (gmail) [this message]
2023-04-27  7:49             ` Hui Wang
2023-04-28 19:53           ` Michal Hocko
2023-05-03 11:49             ` Hui Wang
2023-05-03 12:20               ` Michal Hocko
2023-05-03 18:41                 ` Phillip Lougher
2023-05-03 19:10               ` Phillip Lougher
2023-05-03 19:38                 ` Hui Wang
2023-05-07 21:07                 ` Phillip Lougher
2023-05-08 10:05                   ` Hui Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8bac892e-15e2-e95b-5b2b-0981f1279e4c@gmail.com \
    --to=colin.i.king@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=hch@infradead.org \
    --cc=hsiangkao@linux.alibaba.com \
    --cc=hui.wang@canonical.com \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=phillip@squashfs.org.uk \
    --cc=shy828301@gmail.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox