From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wm1-f44.google.com (mail-wm1-f44.google.com [209.85.128.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 825D429B20D for ; Tue, 28 Oct 2025 04:17:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.44 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761625066; cv=none; b=bzySvz4dQKvq0dvM98CfwUu//hi8k8xcd2orNUPmU4vVfSD2KMNJrZ6UdHuh6QhBnRksiAkBljYXt9JVGwyIxi3q6Qlexq89jDbZFlns3bBMW0f+s8in5LWWrshnmIGSy/yeTqvvS6PGZQbpmx5aws8vYVaQJ7xK6FNatNJbZNc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761625066; c=relaxed/simple; bh=5Vq7u3nGd6Do3JsO9eJRTz/nrsOJguWIs6I8jqHn41o=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=MNWUc5jKnHd/T+B8CI9gW4pQEh29lSWkKZDg0tL+Bv/zlz2LN/WK4cM6Ia2lLInigAM7wdugEWi1Kxxzd0ueLncZDqslv32y42xIjd/t3e7pWMoIwAQEv9lppwxmFIPc+141K3pIANVbt+P3dSD4cp13XuVXfB3AkDUHgmJpHr4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=CMCKcqFZ; arc=none smtp.client-ip=209.85.128.44 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="CMCKcqFZ" Received: by mail-wm1-f44.google.com with SMTP id 5b1f17b1804b1-47105bbb8d9so27085e9.1 for ; Mon, 27 Oct 2025 21:17:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1761625063; x=1762229863; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=EafH425e5ZdVyi0FV+yegWkfn9dVJGmMSMrvuISBCM4=; b=CMCKcqFZhIhN9GrHzMt22pfSBpt3n53umu4a2LYUg/tvVWruM7FfnDckvCePB79NcV alCzGSqG+eZwdByWYPBw3sKUB6WN+zryHc4v7XTsogFm2MY3u5qmIBMknkMnEFwdZ8GR anDfpq3mRoktXpOBn8DJJNBmYHDqkzbruxcXVLyoTK3EvOQQ+6fY0S0wXALQ82uRGTsX Wvwu2btXwMi10P4vBUjoS1Ie8eS3XO2qc5IyFl8O+ikmEJ0Focl+6ARzgO5JaNTS6sOj x854GtrsNfY2euuqxztCC989l1ozRusb0k4mM4jN6r3TiVuZp52uyvOfAHZt0AlcL94N Yx0g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761625063; x=1762229863; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=EafH425e5ZdVyi0FV+yegWkfn9dVJGmMSMrvuISBCM4=; b=nK8KQY5QYYXLaiR0xFdk5aFjDmLld0rnIdkeu4mBMBw5iPoYY57eDXwvLS6Gacz8MA ciPlPIrGDSiz1q49ELvXLBS2oqPIL7ovyDbeyhO9nbC2Eld1e21U0QykDLZlN/t6g83D p0jmlhl2JEkZx5osmMDg6FEtBchGAc/wqsm85eN3P6Ko2MSw7AShXf/mlMf7kz8CdK4k HLVseeBQSio2K82aLrOUpK0orRpWtm2yw6tZluuvd/rcwgA6eZ66ftfDLu8qXdOCJZCF fZ6mbs2PR4JppiOoPISJJwUpVv+ix7D7ThcMAJKMcrERFeDnHcidGJISmlSJEh66aaEB jkwQ== X-Forwarded-Encrypted: i=1; AJvYcCXKbp0qLPz44E/YEkO2i9Agf7Iz1AUgIT9qqqIzwNa7MYwE3+ahsd2PiG9iE8PkTYW6RrXTt2+gmqCVvDHT@vger.kernel.org X-Gm-Message-State: AOJu0YzzUt7Lbd4vCo1qcXdVfR8mzsjqcB6fhX5ECQNTzxK8y8y9GUd7 X9vpJhsUE1LSJAr1+K9vrTrYyUNNs+M8Ed3xxT/tucRS7xe1y1JQdGkz3okz73Z7YYmzz3uTjCt w2TCtds9RsDoKKcfj8iSKkOTXGEJ5BYnXn7Db7iCM X-Gm-Gg: ASbGnctDo9v6LH0OLD3X/1ZxczLVMu+TwY+Ozyff3f7ShTiFV8xU9ykeW57TM1PgxyI Kmc4bOnTit2pIBxeiX7Y7i0Siy0eQQ7eQ8yuokMiCizqi8ZHgkDpN0OUBGNW/Kyaw/TcmbooGl1 24Ww7ZKj/vi7UjXwtAwPp9CM0dbeiIy4xk1xd58aATUsYbwjAtIXZFH+HnbWVX8hayJKpe7Q+qq TPdpIz5BoFahRu4HmX6bh/7CjXzdx1ssVoWIMCJbPd7PdQlWDiCtARbZAaK+hX83lPpEvTlWNfB vq18UUJ/brMZZiA8tQ== X-Google-Smtp-Source: AGHT+IEhtd6FMoBD9oW9VTHPjTNuAKD4LI9Bm+L8vr/ZIvRemEu0gME+w1twwea2P3HAp+5Fi55N6fK3//NhVxDsmnM= X-Received: by 2002:a05:600c:1791:b0:45b:74f7:9d30 with SMTP id 5b1f17b1804b1-477191e0eabmr589665e9.1.1761625062640; Mon, 27 Oct 2025 21:17:42 -0700 (PDT) Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20250118231549.1652825-1-jiaqiyan@google.com> <20250919155832.1084091-1-william.roche@oracle.com> In-Reply-To: From: Jiaqi Yan Date: Mon, 27 Oct 2025 21:17:31 -0700 X-Gm-Features: AWmQ_bm9R7TnoGST5GcqG6bwJmlJM5TYefCTLaciqwqQYwwTV1Gje_wcNqRiI_k Message-ID: Subject: Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd To: Harry Yoo , =?UTF-8?Q?=E2=80=9CWilliam_Roche?= Cc: Ackerley Tng , jgg@nvidia.com, akpm@linux-foundation.org, ankita@nvidia.com, dave.hansen@linux.intel.com, david@redhat.com, duenwen@google.com, jane.chu@oracle.com, jthoughton@google.com, linmiaohe@huawei.com, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, muchun.song@linux.dev, nao.horiguchi@gmail.com, osalvador@suse.de, peterx@redhat.com, rientjes@google.com, sidhartha.kumar@oracle.com, tony.luck@intel.com, wangkefeng.wang@huawei.com, willy@infradead.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, Oct 22, 2025 at 6:09=E2=80=AFAM Harry Yoo wr= ote: > > On Mon, Oct 13, 2025 at 03:14:32PM -0700, Jiaqi Yan wrote: > > On Fri, Sep 19, 2025 at 8:58=E2=80=AFAM =E2=80=9CWilliam Roche wrote: > > > > > > From: William Roche > > > > > > Hello, > > > > > > The possibility to keep a VM using large hugetlbfs pages running afte= r a memory > > > error is very important, and the possibility described here could be = a good > > > candidate to address this issue. > > > > Thanks for expressing interest, William, and sorry for getting back to > > you so late. > > > > > > > > So I would like to provide my feedback after testing this code with t= he > > > introduction of persistent errors in the address space: My tests used= a VM > > > running a kernel able to provide MFD_MF_KEEP_UE_MAPPED memfd segments= to the > > > test program provided with this project. But instead of injecting the= errors > > > with madvise calls from this program, I get the guest physical addres= s of a > > > location and inject the error from the hypervisor into the VM, so tha= t any > > > subsequent access to the location is prevented directly from the hype= rvisor > > > level. > > > > This is exactly what VMM should do: when it owns or manages the VM > > memory with MFD_MF_KEEP_UE_MAPPED, it is then VMM's responsibility to > > isolate guest/VCPUs from poisoned memory pages, e.g. by intercepting > > such memory accesses. > > > > > > > > Using this framework, I realized that the code provided here has a pr= oblem: > > > When the error impacts a large folio, the release of this folio doesn= 't isolate > > > the sub-page(s) actually impacted by the poison. __rmqueue_pcplist() = can return > > > a known poisoned page to get_page_from_freelist(). > > > > Just curious, how exactly you can repro this leaking of a known poison > > page? It may help me debug my patch. > > > > > > > > This revealed some mm limitations, as I would have expected that the > > > check_new_pages() mechanism used by the __rmqueue functions would fil= ter these > > > pages out, but I noticed that this has been disabled by default in 20= 23 with: > > > [PATCH] mm, page_alloc: reduce page alloc/free sanity checks > > > https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz > > > > Thanks for the reference. I did turned on CONFIG_DEBUG_VM=3Dy during de= v > > and testing but didn't notice any WARNING on "bad page"; It is very > > likely I was just lucky. > > > > > > > > > > > This problem seems to be avoided if we call take_page_off_buddy(page)= in the > > > filemap_offline_hwpoison_folio_hugetlb() function without testing if > > > PageBuddy(page) is true first. > > > > Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb > > shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or > > not. take_page_off_buddy will check PageBuddy or not, on the page_head > > of different page orders. So maybe somehow a known poisoned page is > > not taken off from buddy allocator due to this? > > Maybe it's the case where the poisoned page is merged to a larger page, > and the PGTY_buddy flag is set on its buddy of the poisoned page, so > PageBuddy() returns false?: > > [ free page A ][ free page B (poisoned) ] > > When these two are merged, then we set PGTY_buddy on page A but not on B. Thanks Harry! It is indeed this case. I validate by adding some debug prints in take_page_off_buddy: [ 193.029423] Memory failure: 0x2800200: [yjq] PageBuddy=3D0 after drain_al= l_pages [ 193.029426] 0x2800200: [yjq] order=3D0, page_order=3D0, PageBuddy(page_he= ad)=3D0 [ 193.029428] 0x2800200: [yjq] order=3D1, page_order=3D0, PageBuddy(page_he= ad)=3D0 [ 193.029429] 0x2800200: [yjq] order=3D2, page_order=3D0, PageBuddy(page_he= ad)=3D0 [ 193.029430] 0x2800200: [yjq] order=3D3, page_order=3D0, PageBuddy(page_he= ad)=3D0 [ 193.029431] 0x2800200: [yjq] order=3D4, page_order=3D0, PageBuddy(page_he= ad)=3D0 [ 193.029432] 0x2800200: [yjq] order=3D5, page_order=3D0, PageBuddy(page_he= ad)=3D0 [ 193.029434] 0x2800200: [yjq] order=3D6, page_order=3D0, PageBuddy(page_he= ad)=3D0 [ 193.029435] 0x2800200: [yjq] order=3D7, page_order=3D0, PageBuddy(page_he= ad)=3D0 [ 193.029436] 0x2800200: [yjq] order=3D8, page_order=3D0, PageBuddy(page_he= ad)=3D0 [ 193.029437] 0x2800200: [yjq] order=3D9, page_order=3D0, PageBuddy(page_he= ad)=3D0 [ 193.029438] 0x2800200: [yjq] order=3D10, page_order=3D10, PageBuddy(page_= head)=3D1 In this case, page for 0x2800200 is hwpoisoned, and its buddy page is 0x2800000 with order 10. > > But even after fixing that we need to fix the race condition. What exactly is the race condition you are referring to? > > > Let me try to fix it in v2, by the end of the week. If you could test > > with your way of repro as well, that will be very helpful! > > > > > But according to me it leaves a (small) race condition where a new pa= ge > > > allocation could get a poisoned sub-page between the dissolve phase a= nd the > > > attempt to remove it from the buddy allocator. > > > > > > I do have the impression that a correct behavior (isolating an impact= ed > > > sub-page and remapping the valid memory content) using large pages is > > > currently only achieved with Transparent Huge Pages. > > > If performance requires using Hugetlb pages, than maybe we could acce= pt to > > > loose a huge page after a memory impacted MFD_MF_KEEP_UE_MAPPED memfd= segment > > > is released ? If it can easily avoid some other corruption. > > > > > > I'm very interested in finding an appropriate way to deal with memory= errors on > > > hugetlbfs pages, and willing to help to build a valid solution. This = project > > > showed a real possibility to do so, even in cases where pinned memory= is used - > > > with VFIO for example. > > > > > > I would really be interested in knowing your feedback about this proj= ect, and > > > if another solution is considered more adapted to deal with errors on= hugetlbfs > > > pages, please let us know. > > > > There is also another possible path if VMM can change to back VM > > memory with *1G guest_memfd*, which wraps 1G hugetlbfs. In Ackerley's > > work [1], guest_memfd can split the 1G page for conversions. If we > > re-use the splitting for memory failure recovery, we can probably > > achieve something generally similar to THP's memory failure recovery: > > split 1G to 2M and 4k chunks, then unmap only 4k of poisoned page. We > > still lose the 1G TLB size so VM may be subject to some performance > > sacrifice. > > [1] https://lore.kernel.org/linux-mm/2ae41e0d80339da2b57011622ac2288fed= 65cd01.1747264138.git.ackerleytng@google.com > > I want to take a closer look at the actual patches but either way sounds > good to me. > > By the way, please Cc me in future revisions :) For sure! > > Thanks! > > -- > Cheers, > Harry / Hyeonggon