From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from out-177.mta0.migadu.com (out-177.mta0.migadu.com [91.218.175.177])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id D619C1C8FBA
	for <linux-kernel@vger.kernel.org>; Fri,  9 Jan 2026 06:05:44 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.177
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1767938747; cv=none; b=sPAWEhe8IU89P03TLty8vkGktaucOFB3AauFhv/2cHSxTeRSWEmBQU8vlWFMBpwGNnMPNuyMekezGcCKeCYX+Hxbiw3bihS3sibF204LiJcbtoZOOhs6bLUinn7tm4Nzh+yzHVRZnGqsuLjQVxBJ1/a8RTTbbiIGRPNZggm4hz8=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1767938747; c=relaxed/simple;
	bh=x4QSRuYxzkR8nm2XQf2g1n8gTS4cneNeBll4ZlBu0eM=;
	h=Content-Type:Mime-Version:Subject:From:In-Reply-To:Date:Cc:
	 Message-Id:References:To; b=SckgeimHqzXOPN0mL3xxwiAAvvFxk18gRSNiZVOP/c/2g9cQdJ01eRypnZRsb0pmfplfiafqe7A96MORLrNynXdupJsUUHZWrjaI50IJPNsPnpO9oulXUbAbEeHvRa+OPkp8HjLEQqZuwHCzs08ecOjJ8mSfvgXATcmEci05kkA=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=abILgAFT; arc=none smtp.client-ip=91.218.175.177
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="abILgAFT"
Content-Type: text/plain;
	charset=utf-8
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1767938742;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=JMAB7hPYrAln1BrSRyZimijuqiIVG7Q06uo7KZiinUo=;
	b=abILgAFTq4PjJUDsGYYnBs0PNuqofwrvp0kHav05bioXupTMP96W4BsJ27VGvi6v+UcWYe
	YgATqUZf90x4OfN/cIylsIn7H47QCMB/Mhb3cPDHjBLAiDlVNH12wfNx02tCxP3i7v6SqU
	69C9DzxaFuqDKoc94HuUM/OebAfSkwM=
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3864.300.41.1.7\))
Subject: Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Muchun Song <muchun.song@linux.dev>
In-Reply-To: <20260107113130.37231-1-lizhe.67@bytedance.com>
Date: Fri, 9 Jan 2026 14:05:01 +0800
Cc: osalvador@suse.de,
 david@kernel.org,
 akpm@linux-foundation.org,
 fvdl@google.com,
 linux-mm@kvack.org,
 linux-kernel@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <1981A332-0585-49AB-9ADE-99FA2FB32DD4@linux.dev>
References: <20260107113130.37231-1-lizhe.67@bytedance.com>
To: Li Zhe <lizhe.67@bytedance.com>
X-Migadu-Flow: FLOW_OUT


> On Jan 7, 2026, at 19:31, Li Zhe <lizhe.67@bytedance.com> wrote:
>=20
> This patchset is based on this commit[1]("mm/hugetlb: optionally
> pre-zero hugetlb pages").

I=E2=80=99d like you to add a brief summary here that roughly explains
what concerns the previous attempts raised and whether the
current proposal has already addressed those concerns, so more
people can quickly grasp the context.

>=20
> Fresh hugetlb pages are zeroed out when they are faulted in,
> just like with all other page types. This can take up a good
> amount of time for larger page sizes (e.g. around 250
> milliseconds for a 1G page on a Skylake machine).
>=20
> This normally isn't a problem, since hugetlb pages are typically
> mapped by the application for a long time, and the initial
> delay when touching them isn't much of an issue.
>=20
> However, there are some use cases where a large number of hugetlb
> pages are touched when an application starts (such as a VM backed
> by these pages), rendering the launch noticeably slow.
>=20
> On an Skylake platform running v6.19-rc2, faulting in 64 =C3=97 1 GB =
huge
> pages takes about 16 seconds, roughly 250 ms per page. Even with
> Ankur=E2=80=99s optimizations[2], the time drops only to ~13 seconds,
> ~200 ms per page, still a noticeable delay.

I did see some comments in [1] about QEMU supporting user-mode
parallel zero-page operations; I=E2=80=99m just not sure what the =
current
state of that support looks like, or what the corresponding benchmark
numbers are.

>=20
> To accelerate the above scenario, this patchset exports a per-node,
> read-write "zeroable_hugepages" sysfs interface for every hugepage =
size.
> This interface reports how many hugepages on that node can currently
> be pre-zeroed and allows user space to request that any integer number
> in the range [0, max] be zeroed in a single operation.
>=20
> This mechanism offers the following advantages:
>=20
> (1) User space gains full control over when zeroing is triggered,
> enabling it to minimize the impact on both CPU and cache utilization.
>=20
> (2) Applications can spawn as many zeroing processes as they need,
> enabling concurrent background zeroing.
>=20
> (3) By binding the process to specific CPUs, users can confine zeroing
> threads to cores that do not run latency-critical tasks, eliminating
> interference.
>=20
> (4) A zeroing process can be interrupted at any time through standard
> signal mechanisms, allowing immediate cancellation.
>=20
> (5) The CPU consumption incurred by zeroing can be throttled and =
contained
> with cgroups, ensuring that the cost is not borne system-wide.
>=20
> Tested on the same Skylake platform as above, when the 64 GiB of =
memory
> was pre-zeroed in advance by the pre-zeroing mechanism, the faulting
> latency test completed in negligible time.
>=20
> In user space, we can use system calls such as epoll and write to zero
> huge folios as they become available, and sleep when none are ready. =
The
> following pseudocode illustrates this approach. The pseudocode spawns
> eight threads (each running thread_fun()) that wait for huge pages on
> node 0 to become eligible for zeroing; whenever such pages are =
available,
> the threads clear them in parallel.
>=20
>  static void thread_fun(void)
>  {
>   	epoll_create();
>   	epoll_ctl();
>   	while (1) {
>   		val =3D =
read("/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroabl=
e_hugepages");
>   		if (val > 0)
>   			system("echo max > =
/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_huge=
pages");
>   		epoll_wait();
>   	}
>  }
>=20
>  static void start_pre_zero_thread(int thread_num)
>  {
>   	create_pre_zero_threads(thread_num, thread_fun)
>  }
>=20
>  int main(void)
>  {
>   	start_pre_zero_thread(8);
>  }
>=20
> [1]: =
https://lore.kernel.org/linux-mm/202412030519.W14yll4e-lkp@intel.com/T/#t
> [2]: =
https://lore.kernel.org/all/20251215204922.475324-1-ankur.a.arora@oracle.c=
om/T/#u
>=20
> Li Zhe (8):
>  mm/hugetlb: add pre-zeroed framework
>  mm/hugetlb: convert to prep_account_new_hugetlb_folio()
>  mm/hugetlb: move the huge folio to the end of the list during enqueue
>  mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"
>  mm/hugetlb: simplify function hugetlb_sysfs_add_hstate()
>  mm/hugetlb: relocate the per-hstate struct kobject pointer
>  mm/hugetlb: add epoll support for interface "zeroable_hugepages"
>  mm/hugetlb: limit event generation frequency of function
>    do_zero_free_notify()
>=20
> fs/hugetlbfs/inode.c    |   3 +-
> include/linux/hugetlb.h |  26 +++++
> mm/hugetlb.c            | 131 ++++++++++++++++++++++---
> mm/hugetlb_internal.h   |   6 ++
> mm/hugetlb_sysfs.c      | 206 ++++++++++++++++++++++++++++++++++++----
> 5 files changed, 337 insertions(+), 35 deletions(-)
>=20
> ---
> Changelogs:
>=20
> v1->v2 :
> - Use guard() to simplify function hpage_wait_zeroing(). (pointed by
>  Raghu)
> - Simplify the logic of zero_free_hugepages_nid() by removing
>  redundant checks and exiting the loop upon encountering a
>  pre-zeroed folio. (pointed by Frank)
> - Include in the cover letter a performance comparison with Ankur's
>  optimization patch[2]. (pointed by Andrew)
>=20
> v1: =
https://lore.kernel.org/all/20251225082059.1632-1-lizhe.67@bytedance.com/
>=20
> --=20
> 2.20.1