From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2EF6C31B82B; Mon, 17 Nov 2025 13:43:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763387004; cv=none; b=iDuyK3seuJcn3ZxxWgPV0tEpKr7TRVG2+EjJj9fqW7fmQ8dtLTvrwN5NK7oQ24ummno5p2mclIa6uf5gpmFw4mFGoWALktfkY/gbiqUdO7scg1MmiSkwjr7wixFu2/I/j1y7VI4Zo2VGTx02C2mngn7ZyYIWb4gueDx9p/l5cdM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763387004; c=relaxed/simple; bh=KnwUkpe7bz24RPLJCJ5lZs9XdB2oXkXKJIgzjZFUM1U=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=PRssK0vQ81ohUTdIDqnG/UqBIVqaNy2JXOXUmMi98vl8qaABjlu6jX13sgnFl8mHBWt9ZDVPib+xJUfWUQLeL7eCldHlZd4oWGnI5ahn7c6hzpz7A7U3lo5NHC/BhonnsM5h+KJ7buB2wyDBMJZehxOYh8dNCEIWqo8WvHlumaA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=Jjh93+km; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="Jjh93+km" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=9YVuu7zAULl0ixowM1tYF4wrONsB1jmbz+W8pAwYhcU=; b=Jjh93+kmr9SpMk/YF5lvhgC2u1 ZlLsFwt+ATReBnLApmCMvDdJSY2oNVNipqTUPFWCQry6k8bU6FXiczgKhMtGrjOEm0kvGhrY+PTH5 7aynII8OLBrKH8zYUJV1vdToERpFDa8P6bWYOHLmbH7JpYg29hoM3vHl0jll7oo+MHeVcxV21b3XJ S+kqdLRnmwBA75pDGLJDgsSWUY+NB3F2rS+32Suu8VkMT1mloxKDd1vZ8xc1rMCTp+DIYxKLls+fD oW0POF7znBkybXltQwPWEILV7jBbFvCKGZm952EKG3CmTi+AyRzaylIyPLpvyIbLFobnUvKx10mR9 +7DS/e5w==; Received: from willy by casper.infradead.org with local (Exim 4.98.2 #2 (Red Hat Linux)) id 1vKzVY-0000000DtlN-38y7; Mon, 17 Nov 2025 13:43:04 +0000 Date: Mon, 17 Nov 2025 13:43:04 +0000 From: Matthew Wilcox To: Harry Yoo Cc: Jiaqi Yan , nao.horiguchi@gmail.com, linmiaohe@huawei.com, ziy@nvidia.com, david@redhat.com, lorenzo.stoakes@oracle.com, william.roche@oracle.com, tony.luck@intel.com, wangkefeng.wang@huawei.com, jane.chu@oracle.com, akpm@linux-foundation.org, osalvador@suse.de, muchun.song@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Vlastimil Babka , Suren Baghdasaryan , Michal Hocko , Brendan Jackman , Johannes Weiner Subject: Re: [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order Message-ID: References: <20251116014721.1561456-1-jiaqiyan@google.com> <20251116014721.1561456-2-jiaqiyan@google.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Mon, Nov 17, 2025 at 12:15:23PM +0900, Harry Yoo wrote: > On Sun, Nov 16, 2025 at 11:51:14AM +0000, Matthew Wilcox wrote: > > But since we're only doing this on free, we won't need to do folio > > allocations at all; we'll just be able to release the good pages to the > > page allocator and sequester the hwpoison pages. > > [+Cc PAGE ALLOCATOR folks] > > So we need an interface to free only healthy portion of a hwpoison folio. > > I think a proper approach to this should be to "free a hwpoison folio > just like freeing a normal folio via folio_put() or free_frozen_pages(), > then the page allocator will add only healthy pages to the freelist and > isolate the hwpoison pages". Oherwise we'll end up open coding a lot, > which is too fragile. Yes, I think it should be handled by the page allocator. There may be some complexity to this that I've missed, eg if hugetlb wants to retain the good 2MB chunks of a 1GB allocation. I'm not sure that's a useful thing to do or not. > In fact, that can be done by teaching free_pages_prepare() how to handle > the case where one or more subpages of a folio are hwpoison pages. > > How this should be implemented in the page allocator in memdescs world? > Hmm, we'll want to do some kind of non-uniform split, without actually > splitting the folio but allocating struct buddy? Let me sketch that out, realising that it's subject to change. A page in buddy state can't need a memdesc allocated. Otherwise we're allocating memory to free memory, and that way lies madness. We can't do the hack of "embed struct buddy in the page that we're freeing" because HIGHMEM. So we'll never shrink struct page smaller than struct buddy (which is fine because I've laid out how to get to a 64 bit struct buddy, and we're probably two years from getting there anyway). My design for handling hwpoison is that we do allocate a struct hwpoison for a page. It looks like this (for now, in my head): struct hwpoison { memdesc_t original; ... other things ... }; So we can replace the memdesc in a page with a hwpoison memdesc when we encounter the error. We still need a folio flag to indicate that "this folio contains a page with hwpoison". I haven't put much thought yet into interaction with HUGETLB_PAGE_OPTIMIZE_VMEMMAP; maybe "other things" includes an index of where the actually poisoned page is in the folio, so it doesn't matter if the pages alias with each other as we can recover the information when it becomes useful to do so. > But... for now I think hiding this complexity inside the page allocator > is good enough. For now this would just mean splitting a frozen page > inside the page allocator (probably non-uniform?). We can later re-implement > this to provide better support for memdescs. Yes, I like this approach. But then I'm not the page allocator maintainer ;-)