Re: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: jane.chu@oracle.com
To: Jiaqi Yan <jiaqiyan@google.com>
Cc: nao.horiguchi@gmail.com, linmiaohe@huawei.com,
	tony.luck@intel.com, wangkefeng.wang@huawei.com,
	akpm@linux-foundation.org, osalvador@suse.de,
	rientjes@google.com, duenwen@google.com, jthoughton@google.com,
	jgg@nvidia.com, ankita@nvidia.com, peterx@redhat.com,
	linux-mm@kvack.org
Subject: Re: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy
Date: Fri, 11 Oct 2024 11:28:04 -0700	[thread overview]
Message-ID: <aa42865e-faab-4199-b80b-8fd15aae3ed7@oracle.com> (raw)
In-Reply-To: <CACw3F53CVOVH1NeaAuXeacvgpxVyZ=dfOeacSTX-HLWhPdaHPw@mail.gmail.com>

On 10/10/2024 4:21 PM, Jiaqi Yan wrote:

> On Mon, Oct 7, 2024 at 10:24 AM <jane.chu@oracle.com> wrote:
>> On 10/3/2024 4:51 PM, Jiaqi Yan wrote:
>>> soned page (sub- or huge-) will eventually be isolated, because,
>>> The code here is "global policy". The "per-VMA policy", proposed in
>>> 0/2 but code not sent, should be able to support isolation + offline
>>> at some point (all VMAs are gone and page becomes free).
>> "per-VMA policy" sounds interesting.
>>>> Another thing I'm curious at is whether you have tested with real
>>>> hardware UE - the one that triggers MCE.  When a real UE is consumed by
>>> Yes, with our workload. Can you share more about what is the "training
>>> process"? Is it something to train memory or screen memory errors?
>> The cover letter mentioned "Machine Learning (ML) workloads", so I used
>> it as an example.
> Got you. In that case, if the ML workload (running in a VM) wants to
> do what you described, wouldn't losing 1G hugetlb page due to kernel
> offline make the VM/workload even harder to execute recover logic?

Indeed.

As the user application got more sophisticated on recovering from 
poison, what about making the kernel to do the heavy lifting?

Something like by way of userfaultfd,  kernel provides a new/clean 
hugetlb page, copied over good data from the clean subpages and then 
present the clean hugetlb page to user process with indication that 
subpage x is a substitute of the poisoned old subpage x, hence its data 
might need a refill?  I am not sure how exactly to pull this through as 
the even is not a page-fault, but just wondering whether something like 
this is possible.

thanks,

-jane

>
>> -jane
>>

next prev parent reply	other threads:[~2024-10-11 18:28 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-24  4:39 [RFC PATCH v1 0/2] Userspace Can Control Memory Failure Recovery Jiaqi Yan
2024-09-24  4:39 ` [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy Jiaqi Yan
2024-10-02 23:50   ` jane.chu
2024-10-03 23:51     ` Jiaqi Yan
2024-10-07 17:24       ` jane.chu
2024-10-10 23:21         ` Jiaqi Yan
2024-10-11 18:28           ` jane.chu [this message]
2024-10-11 19:44             ` Luck, Tony
2024-10-11 20:15               ` jane.chu
2024-10-15 23:45             ` Jiaqi Yan
2024-10-15 23:56               ` Luck, Tony
2024-10-16  0:19                 ` jane.chu
2024-10-11  7:04       ` Miaohe Lin
2024-10-15 23:58         ` Jiaqi Yan
2024-09-24  4:39 ` [RFC PATCH v1 2/2] docs: mm: add enable_hard_offline sysctl Jiaqi Yan
2024-10-02 15:02 ` [RFC PATCH v1 0/2] Userspace Can Control Memory Failure Recovery Jason Gunthorpe
2024-10-03 22:45   ` Jiaqi Yan
2024-10-03 22:58     ` Luck, Tony
2024-10-03 23:19       ` Jiaqi Yan
2024-10-03 23:19     ` Jason Gunthorpe
2024-10-04 18:32       ` Jiaqi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aa42865e-faab-4199-b80b-8fd15aae3ed7@oracle.com \
    --to=jane.chu@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=ankita@nvidia.com \
    --cc=duenwen@google.com \
    --cc=jgg@nvidia.com \
    --cc=jiaqiyan@google.com \
    --cc=jthoughton@google.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-mm@kvack.org \
    --cc=nao.horiguchi@gmail.com \
    --cc=osalvador@suse.de \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=tony.luck@intel.com \
    --cc=wangkefeng.wang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).