From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6DBCE369203 for ; Thu, 30 Apr 2026 07:07:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.156.1 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777532838; cv=none; b=UF5ambVwVSMIJcWNoP5ECMc4rp7zNS1NprDnSEJljTsvKyysr2frOkkeEFPfL3B1Jt1AqfM6E/D8AvDwZ8UOlxyvRO8dWC/vIhnaw+h+xdTpx/+ywUrknJ/tg72SN9a+0T/Jfaq29xsYxD6RxuJ/u0NsZZBhTTW876p+Nxsv4d4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777532838; c=relaxed/simple; bh=TUHqu7tLTh2uCbYlOfbF37H5rnryktOa4WJXvXAkIrU=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=jLI0psuJDH4em4SDiVy0wgXjPIPhs+ebjzmwYUgzD/aWRig1mLD8vf22+Mx+oLt5fn+ydflRCFSRlzXJRM26aaok5I0D+WWRC8NJPwnibmfgqyczre8G6heufx74He8zxoxsudiLM94gGaWPG7aqDFqttkoCT/yxthPm443OeGw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=ZOR5Dha9; arc=none smtp.client-ip=148.163.156.1 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="ZOR5Dha9" Received: from pps.filterd (m0360083.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 63U0vCEP3558379; Thu, 30 Apr 2026 07:06:27 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=pp1; bh=KVe1rk QoId2M0q+G7vAJ7I0Stq/EZ6R6Ou9i2ZV3qVc=; b=ZOR5Dha9BGwlY7odxoaWVs kAKJGE84Tju3JmGXNugos9uVE/8hxwdKhvJbSBxaVREUWVvVaUWto3KfYCeagv0L fPwMWEAyIN5QsXbjLyP1bTyMfrT3DYImdJqoaLjebee/S/pwsB58gJAQ06ND96nv IOYl/X4tVrsSvlLLwJWsY6zYPjMCwVT7HckpUnEgdw/uno6TA7RirOtmAw8rkIeB OJfiQmZzFqydld+Dad78rgsrRk55jdYQeAMZJjDQ9jmxDPSHMnqQbhFtPvUK/94U wzhjmou0dcJc3za2knN1gzhJpb8s2Ub8lHwJZpP0FUU5u20iOYWxdpoad3ziOIOA == Received: from ppma21.wdc07v.mail.ibm.com (5b.69.3da9.ip4.static.sl-reverse.com [169.61.105.91]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4drn44xget-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Apr 2026 07:06:26 +0000 (GMT) Received: from pps.filterd (ppma21.wdc07v.mail.ibm.com [127.0.0.1]) by ppma21.wdc07v.mail.ibm.com (8.18.1.7/8.18.1.7) with ESMTP id 63U6rn7f031286; Thu, 30 Apr 2026 07:06:25 GMT Received: from smtprelay03.dal12v.mail.ibm.com ([172.16.1.5]) by ppma21.wdc07v.mail.ibm.com (PPS) with ESMTPS id 4ds8xk9u67-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Apr 2026 07:06:25 +0000 (GMT) Received: from smtpav05.wdc07v.mail.ibm.com (smtpav05.wdc07v.mail.ibm.com [10.39.53.232]) by smtprelay03.dal12v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 63U76OhR26018546 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 30 Apr 2026 07:06:24 GMT Received: from smtpav05.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0AD605805F; Thu, 30 Apr 2026 07:06:24 +0000 (GMT) Received: from smtpav05.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 12E2458053; Thu, 30 Apr 2026 07:06:09 +0000 (GMT) Received: from [9.124.222.143] (unknown [9.124.222.143]) by smtpav05.wdc07v.mail.ibm.com (Postfix) with ESMTP; Thu, 30 Apr 2026 07:06:08 +0000 (GMT) Message-ID: <32d03cb3-2199-444b-94de-ff34cf2d5315@linux.ibm.com> Date: Thu, 30 Apr 2026 12:36:06 +0530 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v6 3/5] mm: Hot page tracking and promotion - pghot To: Bharata B Rao , linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Jonathan.Cameron@huawei.com, dave.hansen@intel.com, gourry@gourry.net, mgorman@techsingularity.net, mingo@redhat.com, peterz@infradead.org, raghavendra.kt@amd.com, riel@surriel.com, rientjes@google.com, sj@kernel.org, weixugc@google.com, willy@infradead.org, ying.huang@linux.alibaba.com, ziy@nvidia.com, dave@stgolabs.net, nifan.cxl@gmail.com, xuezhengchu@huawei.com, yiannis@zptcorp.com, akpm@linux-foundation.org, david@redhat.com, byungchul@sk.com, kinseyho@google.com, joshua.hahnjy@gmail.com, yuanchu@google.com, balbirs@nvidia.com, alok.rathore@samsung.com, shivankg@amd.com References: <20260323095104.238982-1-bharata@amd.com> <20260323095104.238982-4-bharata@amd.com> <250e68f3-3664-4148-bfbf-52fd4230a3b9@linux.ibm.com> Content-Language: en-US From: Donet Tom In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-Reinject: loops=2 maxloops=12 X-Proofpoint-ORIG-GUID: 91BIh40wuEno1H4PmYjBoi4c67XxTQ0g X-Authority-Analysis: v=2.4 cv=Ft81OWrq c=1 sm=1 tr=0 ts=69f2ff72 cx=c_pps a=GFwsV6G8L6GxiO2Y/PsHdQ==:117 a=GFwsV6G8L6GxiO2Y/PsHdQ==:17 a=IkcTkHD0fZMA:10 a=A5OVakUREuEA:10 a=VkNPw1HP01LnGYTKEx00:22 a=RnoormkPH1_aCDwRdu11:22 a=iQ6ETzBq9ecOQQE5vZCe:22 a=6e-r1iUsFtuhjnILqo8A:9 a=3ZKOabzyN94A:10 a=QEXdDO2ut3YA:10 X-Proofpoint-GUID: dyPF0tdCOfGRIY6TS_xnIayZTHdQJ6sc X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNDMwMDA2NSBTYWx0ZWRfXyLBLQs79Ttgd j8v0qdv4k+xEcJ06YVy+7P674P2JlZBNM7LTpyXeGE9HQ9Hj4/yTNjAZMNzCdwzIedREWKdmpbd +fhCCpOA6O3pH4X/TJGzK/Zd5wnFhXnP0O2ZfpaaeXTr5/WYKs1qLCWqkXWA4m4tCtaEzCvGYlV kuLG2VvOjZ1zT7xaJeUFg2ZYx3alx5DIgWO/uDCcIAupsSEBLJ2zikMPKHw/0u7x9nwMBXQWJ5L IrAH3YZwKIt0O/9EXOW0ODt7BptNIvAWoz/vataLrye9h8eYMTWfKOJHOc+QmG/mafUvmAm+pKL DkBotB0NmqFxCX5L4cf9boBnb6fHh+VQMfsTIANEKIza8akmM3hfXqCYYRvGLNoYa2UEQzfcR4B lgQZ02mlXyX2g7CvvcZ9aVTovX3jVy1WbQGXY1fWbZs0+dhveJqSiICeONQimcsaD2jzrzSZ+5r vuQNNzGbeushqt90ZFw== X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.51,FMLib:17.12.100.49 definitions=2026-04-30_02,2026-04-28_01,2025-10-01_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 lowpriorityscore=0 bulkscore=0 spamscore=0 impostorscore=0 clxscore=1015 malwarescore=0 phishscore=0 suspectscore=0 adultscore=0 classifier=typeunknown authscore=0 authtc= authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.22.0-2604200000 definitions=main-2604300065 Hi Bharata On 4/27/26 10:54 AM, Bharata B Rao wrote: > On 24-Apr-26 6:27 PM, Donet Tom wrote: >>> +int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now) >>> +{ >>> +    struct mem_section *ms; >>> +    struct folio *folio; >>> +    phi_t *phi, *hot_map; >>> +    struct page *page; >>> + >>> +    if (!kmigrated_started) >>> +        return 0; >>> + >>> +    if (!pghot_nid_valid(nid)) >>> +        return -EINVAL; >>> + >>> +    switch (src) { >>> +    case PGHOT_HINTFAULTS: >>> +        if (!static_branch_unlikely(&pghot_src_hintfaults)) >>> +            return 0; >>> +        count_vm_event(PGHOT_RECORDED_HINTFAULTS); >>> +        break; >>> +    case PGHOT_HWHINTS: >>> +        if (!static_branch_unlikely(&pghot_src_hwhints)) >>> +            return 0; >>> +        count_vm_event(PGHOT_RECORDED_HWHINTS); >>> +        break; >>> +    default: >>> +        return -EINVAL; >>> +    } >>> + >>> +    /* >>> +     * Record only accesses from lower tiers. >>> +     */ >>> +    if (node_is_toptier(pfn_to_nid(pfn))) >>> +        return 0; >> >> Just a thought—could we check this at the beginning of the function, before the >> switch case? > I am accumulating two stats here: How many hot page intimations pghot obtained > that are attributable to different sources > > and > > out of them, how many turned out to be actionable > Understood. Thanks for the clarification. > >> >>> + >>> +    /* >>> +     * Reject the non-migratable pages right away. >>> +     */ >>> +    page = pfn_to_online_page(pfn); >>> +    if (!page || is_zone_device_page(page)) >>> +        return 0; >>> + >>> +    folio = page_folio(page); >>> +    if (!folio_try_get(folio)) >>> +        return 0; >>> + >>> +    if (unlikely(page_folio(page) != folio)) >>> +        goto out; >>> + >>> +    if (!folio_test_lru(folio)) >>> +        goto out; >>> + >>> +    /* Get the hotness slot corresponding to the 1st PFN of the folio */ >>> +    pfn = folio_pfn(folio); >>> +    ms = __pfn_to_section(pfn); >>> +    if (!ms || !ms->hot_map) >>> +        goto out; >>> + >>> +    hot_map = (phi_t *)(((unsigned long)(ms->hot_map)) & >>> ~PGHOT_SECTION_HOT_MASK); >>> +    phi = &hot_map[pfn % PAGES_PER_SECTION]; >>> + >>> +    count_vm_event(PGHOT_RECORDED_ACCESSES); > which is this ^ > >>> +static void kmigrated_do_work(pg_data_t *pgdat) >>> +{ >>> +    unsigned long section_nr, s_begin, start_pfn; >>> +    struct mem_section *ms; >>> +    int nid; >>> + >>> +    clear_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags); >>> +    s_begin = next_present_section_nr(-1); >>> +    for_each_present_section_nr(s_begin, section_nr) { >>> +        start_pfn = section_nr_to_pfn(section_nr); >> >> I may be missing something, but in pghot_setup_hot_map() and kmigrated_do_work() >> we seem to iterate over all memory sections. On large memory systems, could this >> become a bottleneck right? >> >> Since hot_map is allocated only for lower-tier memory and the hotness >> information is primarily used there, would it make sense to skip scanning >> higher-tier sections? >> >> for_each_online_node(nid) { >>         if (node_is_toptier(nid)) >>             continue; >> >>         start_pfn = node_start_pfn(nid); >>         end_pfn = node_end_pfn(nid); >> >>         s_begin = pfn_to_section_nr(start_pfn); >>         for_each_present_section_nr(s_begin, section_nr) { >>     } >> } >> >> Would this approach be reasonable, or am I overlooking something? > I didn't just yet optimize the walk. Since there is one kmigrated thread per > lower tier, this routine already is aware of which node to scan. We can limit > the section walk to that node instead. Something like this: > > static void kmigrated_do_work(pg_data_t *pgdat) > { > unsigned long section_nr, s_begin, start_pfn, end_pfn; > struct mem_section *ms; > int nid = pgdat->node_id; > > start_pfn = SECTION_ALIGN_DOWN(node_start_pfn(nid)); > end_pfn = SECTION_ALIGN_UP(start_pfn + node_end_pfn(nid)); > > for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) { > section_nr = pfn_to_section_nr(pfn); > > if (!present_section_nr(section_nr)) > continue; > > ms = __nr_to_section(section_nr); > > ... > > kmigrated_walk_zone(pfn, pfn + PAGES_PER_SECTION, nid); > } > } Thanks. This looks good to me. > >>> +static int pghot_online_sec_hotmap(unsigned long start_pfn, >>> +                   unsigned long nr_pages) >>> +{ >>> +    int nid = pfn_to_nid(start_pfn); >>> +    unsigned long start, end, pfn; >>> +    struct mem_section *ms; >>> +    int fail = 0; >>> + >>> +    start = SECTION_ALIGN_DOWN(start_pfn); >>> +    end = SECTION_ALIGN_UP(start_pfn + nr_pages); >>> + >>> +    for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) { >>> +        ms = __pfn_to_section(pfn); >>> +        if (!ms || ms->hot_map) >>> +            continue; >>> + >>> +        fail = pghot_alloc_hot_map(ms, nid); >> I may be missing something, but after pghot_alloc_hot_map fails, we continue the >> loop. Would it make sense to break and go to the cleanup logic instead? > There is a !fail check in the for-loop due to which we break when alloc fails. My bad, I missed it. -Donet > Regards, > Bharata. >