From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 504DCC7EE30
	for <linux-mm@archiver.kernel.org>; Tue,  1 Jul 2025 13:59:07 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E6BAE6B00A1; Tue,  1 Jul 2025 09:59:06 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E41866B00AA; Tue,  1 Jul 2025 09:59:06 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D7E396B00AD; Tue,  1 Jul 2025 09:59:06 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id C6D106B00A1
	for <linux-mm@kvack.org>; Tue,  1 Jul 2025 09:59:06 -0400 (EDT)
Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 671C6B9883
	for <linux-mm@kvack.org>; Tue,  1 Jul 2025 13:59:06 +0000 (UTC)
X-FDA: 83615852292.25.03C6E80
Received: from tor.source.kernel.org (tor.source.kernel.org [172.105.4.254])
	by imf11.hostedemail.com (Postfix) with ESMTP id C602F40010
	for <linux-mm@kvack.org>; Tue,  1 Jul 2025 13:59:04 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=none;
	dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=arm.com (policy=none);
	spf=pass (imf11.hostedemail.com: domain of cmarinas@kernel.org designates 172.105.4.254 as permitted sender) smtp.mailfrom=cmarinas@kernel.org
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751378344; a=rsa-sha256;
	cv=none;
	b=784hjTk5BbkFAYGH7j4ZKl8SiEYaei0rZRyET6Iit34RU6lh2B0w+veDhByeXb2EWq0LLM
	znxredyLQhTnQCHFCAi3T1XFgKOMxqEusky28Z2/aROKnZOKgNsXbBziYkeFaUf9o1jUhX
	jft1h4LocHhV0xmr3h96Tz2K7BOVwMk=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=none;
	dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=arm.com (policy=none);
	spf=pass (imf11.hostedemail.com: domain of cmarinas@kernel.org designates 172.105.4.254 as permitted sender) smtp.mailfrom=cmarinas@kernel.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1751378344;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=w8hcaTAx86juA/niq7QPs9mBEe9/c+bteJ+HTKMUGZY=;
	b=ZgBdcP43DPcQNTLjUA1dvZWmLxTIv4/wDJUf/ROp1Qysg6YUhKVs4+bJLNt97wL6LtHm2g
	3knqIaPx9gmJWK/cY4zkqsRBUSf22QHKjjZ1/SZGfg93kPNP6hPU5RZ/INZvPOav2K1dLb
	nlaDt+eM6khc8EnA2PR5A7RefzGMHn4=
Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58])
	by tor.source.kernel.org (Postfix) with ESMTP id 25DBE61441;
	Tue,  1 Jul 2025 13:59:04 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 191FAC4CEEB;
	Tue,  1 Jul 2025 13:59:00 +0000 (UTC)
Date: Tue, 1 Jul 2025 14:58:58 +0100
From: Catalin Marinas <catalin.marinas@arm.com>
To: Xavier Xia <xavier.qyxia@gmail.com>
Cc: ryan.roberts@arm.com, will@kernel.org, 21cnbao@gmail.com,
	ioworker0@gmail.com, dev.jain@arm.com, akpm@linux-foundation.org,
	david@redhat.com, gshan@redhat.com,
	linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, willy@infradead.org, xavier_qy@163.com,
	ziy@nvidia.com, Barry Song <baohua@kernel.org>
Subject: Re: [PATCH v7] arm64/mm: Optimize loop to reduce redundant
 operations of  contpte_ptep_get
Message-ID: <aGPpohrc8APQad-v@arm.com>
References: <20250624152549.2647828-1-xavier.qyxia@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20250624152549.2647828-1-xavier.qyxia@gmail.com>
X-Rspamd-Server: rspam09
X-Rspamd-Queue-Id: C602F40010
X-Stat-Signature: 1eydndoc6u1bo6zifonsbtk17p66xtks
X-Rspam-User: 
X-HE-Tag: 1751378344-403691
X-HE-Meta: U2FsdGVkX19BaqvRd6QOM96LyzzEd7ICMmob5w1Qr3T+ijv2rhFlhQUqHV6NuNbi2kjeH3/w/LVAvNdGKGIt88lxZhE0WN+gHDdn5RLEvpwWSsBhiGBlysKqg8SIR67fQpTFCbBU11aReGn6XlZXEkoNxl1wK+LwZsJAGSrT5/SozMrA1SF5DFcZjRfZQGZDBUN2Qch7FAr1AwN0AZSQljLIaLEbn/CMZg0SQ7l4fefiA73r5Zbx3J2EoHxf+3mZvLfYxhPqsV1DxeL00wYRksbGpEtdnBDeJTHZ4eM/9HVxicUQr9gkQ8TYx6pFbYVHjqnWZ5QVtn32wUa7ZkMR7Zf1mu458MyxR2p6qz2Oi2m099gd9LSdxsKRppU5TGTZv5sT7RqlFkLkgBbs3uD7di8HixIsqPTW4VL5nHjF4oWY5W9LudhV8zg01Oov/UUpfU/XL3ml8qCPi3GDqJ1CXafaMgEMYUD4NsYwgHD4S5KScluB6GNC487o4qV5nPGZOj1iEskYuzr5wnUOvJLV+kYIxLpPbDo+BgLUMXG9HcJ7vn4yRpflk/nOAIxu6+p9FobAtwVh2ZybIq0hmNHs2dcw45rWQ5GxXx4c5Bm1G9BppsCavqozIuJduml/TBSda03zduaoqdVAU3ddmxuvMFKNhdGdnTjr69ALjqmiK1emOmkCaPLlVglC/7325ZlSpRte20YNP3ObGhngrybjdpU5DhQmaUS93C9vyAV++q6Z9W3tZ1msTzUylCfSVl/Sb2HYUJ9Xh0NtcV4FQUL7j+rpF+DzFC/gtD/NMj6bAbSJ0f1g76ZHIcvfJ9KFPBCnkFTSlrS4rNGiTz/d5OtQbjhURDX6i7oJWyYSDhwR90ZmaBf2QZjWi0eaHffB9Hb/dneSZr5aiFjS5QkRKMgz2Oq0YHCfk3F3Qn1Ffswr6vMTBbnCcEvK1CcxnEMDgWsRLjh7KNjuGW2/QDOeNPV
 Zg9x66IB
 rKSeamNZaimSsf2HI6Qqhw5mpixQnLkeI5s/mcBOR+V87rZBTorL6Vf6hjHpeRUluyLI/DZBYK39RE9/hRTifKqTCio7Dac00JpraJmya4sjCwgIPCHiL36Myro3wPTPB1OZkW8bGXMSJYb0ohZZBkz1Xy8y3leXAbcfnFNKKFd2z8uFP7yfDXbV1C3P5QUINq4BEOdVSvTWoGgg5usEkqDH6hyTyqAq7HzxwMwXUh0dqit0WayYZdVoueX6u/9pCX7Eg1TvvlaBlOBpKP4Y3AtkWCuhR5EHAnpjCd7mU80hfnZqzlbgrjvmO3ltyVzmJE/oNRgaP8ZGsVNivXzNZzUDnyNZ8J4079DWhgVT3eYqgPimyO4di6kGq52TtOop/8ErL6vmwEof6cFp+ly4UHI/YcwvK5YoLXgr3uv7kwYWZsy1VXLdzjHZBTn6pSy7qBzHYVsqzMPSB3AlAnfF89xCEdSecCrlFp5xMn7YoxbMaAtb6+/dytgnsxvP0L3vtfbKf
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Jun 24, 2025 at 11:25:49PM +0800, Xavier Xia wrote:
> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
> function by adding early termination logic. It checks if the dirty and
> young bits of orig_pte are already set and skips redundant bit-setting
> operations during the loop. This reduces unnecessary iterations and
> improves performance.
> 
> In order to verify the optimization performance, a test function has been
> designed. The function's execution time and instruction statistics have
> been traced using perf, and the following are the operation results on a
> certain Qualcomm mobile phone chip:
> 
> Test Code:
> 	#include <stdlib.h>
> 	#include <sys/mman.h>
> 	#include <stdio.h>
> 
> 	#define PAGE_SIZE 4096
> 	#define CONT_PTES 16
> 	#define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
> 	#define YOUNG_BIT 8
> 	void rwdata(char *buf)
> 	{
> 		for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
> 			buf[i] = 'a';
> 			volatile char c = buf[i];
> 		}
> 	}
> 	void clear_young_dirty(char *buf)
> 	{
> 		if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
> 			perror("madvise free failed");
> 			free(buf);
> 			exit(EXIT_FAILURE);
> 		}
> 		if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
> 			perror("madvise free failed");
> 			free(buf);
> 			exit(EXIT_FAILURE);
> 		}
> 	}
> 	void set_one_young(char *buf)
> 	{
> 		for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
> 			volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
> 		}
> 	}
> 
> 	void test_contpte_perf() {
> 		char *buf;
> 		int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE,
> 				TEST_SIZE);
> 		if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
> 			perror("posix_memalign failed");
> 			exit(EXIT_FAILURE);
> 		}
> 
> 		rwdata(buf);
> 	#if TEST_CASE2 || TEST_CASE3
> 		clear_young_dirty(buf);
> 	#endif
> 	#if TEST_CASE2
> 		set_one_young(buf);
> 	#endif
> 
> 		for (int j = 0; j < 500; j++) {
> 			mlock(buf, TEST_SIZE);
> 
> 			munlock(buf, TEST_SIZE);
> 		}
> 		free(buf);
> 	}
> 
> 	int main(void) 
> 	{
> 		test_contpte_perf();
> 		return 0;
> 	}
> 
> 	Descriptions of three test scenarios
> 
> Scenario 1
> 	The data of all 16 PTEs are both dirty and young.
> 	#define TEST_CASE2 0
> 	#define TEST_CASE3 0
> 
> Scenario 2
> 	Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
> 	#define TEST_CASE2 1
> 	#define TEST_CASE3 0
> 
> Scenario 3
> 	Among the 16 PTEs, there are neither young nor dirty ones.
> 	#define TEST_CASE2 0
> 	#define TEST_CASE3 1
> 
> Test results
> 
> |Scenario 1         |       Original|       Optimized|
> |-------------------|---------------|----------------|
> |instructions       |    37912436160|     18731580031|
> |test time          |         4.2797|          2.2949|
> |overhead of        |               |                |
> |contpte_ptep_get() |         21.31%|           4.80%|
> 
> |Scenario 2         |       Original|       Optimized|
> |-------------------|---------------|----------------|
> |instructions       |    36701270862|     36115790086|
> |test time          |         3.2335|          3.0874|
> |Overhead of        |               |                |
> |contpte_ptep_get() |         32.26%|          33.57%|
> 
> |Scenario 3         |       Original|       Optimized|
> |-------------------|---------------|----------------|
> |instructions       |    36706279735|     36750881878|
> |test time          |         3.2008|          3.1249|
> |Overhead of        |               |                |
> |contpte_ptep_get() |         31.94%|          34.59%|
> 
> For Scenario 1, optimized code can achieve an instruction benefit of 50.59%
> and a time benefit of 46.38%.
> For Scenario 2, optimized code can achieve an instruction count benefit of
> 1.6% and a time benefit of 4.5%.
> For Scenario 3, since all the PTEs have neither the young nor the dirty
> flag, the branches taken by optimized code should be the same as those of
> the original code. In fact, the test results of optimized code seem to be
> closer to those of the original code.
> 
> Ryan re-ran these tests on Apple M2 with 4K base pages + 64K mTHP.
> 
> Scenario 1: reduced to 56% of baseline execution time
> Scenario 2: reduced to 89% of baseline execution time
> Scenario 3: reduced to 91% of baseline execution time

Still not keen on microbenchmarks to justify such change but at least
the code is more readable than the macro approach in some earlier
version.

Do you have any numbers to see how it compares with your v1:

https://lore.kernel.org/all/20250407092243.2207837-1-xavier_qy@163.com/

That patch was a lot simpler.

Thanks.

-- 
Catalin