From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 01B1BEB64DA for ; Fri, 14 Jul 2023 03:58:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2CE4D900019; Thu, 13 Jul 2023 23:58:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 27D94900002; Thu, 13 Jul 2023 23:58:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 16C6F900019; Thu, 13 Jul 2023 23:58:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 072BD900002 for ; Thu, 13 Jul 2023 23:58:12 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id BED141204A9 for ; Fri, 14 Jul 2023 03:58:11 +0000 (UTC) X-FDA: 81008859582.28.3DDD1FC Received: from mail-ed1-f42.google.com (mail-ed1-f42.google.com [209.85.208.42]) by imf18.hostedemail.com (Postfix) with ESMTP id E0E781C000F for ; Fri, 14 Jul 2023 03:58:09 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=fKweP6Jl; spf=pass (imf18.hostedemail.com: domain of yuzhao@google.com designates 209.85.208.42 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1689307090; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=df1bb1jAvs2jRSd0UV5vKM8Tkxi1X2alJm5noRhbC9U=; b=iY0ZT43kdHFnU9p26chE3FTo0IDTtCqkZlb1V1knkOEyVFNNRm+q+kkKCyu8SCRePS2OHw DLXSwRVPCwmFVHk6z7JvLEp4Me5893VvZbJPBem81PQmr39FrphHk01chLQFsV8aFtEaGV GieNr1GH8vNv7+XnWyu084X6Ln9wQSI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1689307090; a=rsa-sha256; cv=none; b=ZPU+rB3CJLsMQI0VAXaGs/yIjLTyzqVyvqcQkOimCtwJXBhvQYGvmyY/Z7of7MgCjns1un g9KJmBk4U9aKGgyEd7soQkQ3YASUw0QaZzGE1FZNN6W5+55gjmWhyk6/JX1QWu+ACmeyn0 VoorwPpkqmJm0pC3hwiURF/2pGt8cdY= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=fKweP6Jl; spf=pass (imf18.hostedemail.com: domain of yuzhao@google.com designates 209.85.208.42 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ed1-f42.google.com with SMTP id 4fb4d7f45d1cf-51ddbf83ff9so7378a12.0 for ; Thu, 13 Jul 2023 20:58:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689307088; x=1691899088; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=df1bb1jAvs2jRSd0UV5vKM8Tkxi1X2alJm5noRhbC9U=; b=fKweP6JlCmBRRQlBVKEkzMhU187HIhLaIWvEPumx6/X/MkiwIpNifZpuIqq0CNmVyf cvorRJzo0OEqgF0ya+Lq4smI04/pV43nP5JSzfBcA0YeFu/AeP7F5N3Xh95QX4On81oD e1+4prrRYiCmTQRc29M7v7aLUYukZ/fmVOCaIHap9Mc5cIr6jMsWeaTH/p+UnRq/3D1I curzSZBZWkTuRxCO6gX8c3fpWLFby5gy/vF/CjkIeTOVEakW/Ij0RfPMU+oWlAWBHkx4 /lvBTprMY2ygmsu3yPpSlIBU5Jasujfa47E6o353Q+q4L8mK4vqC/ropZ4x3IQxr0sRh MLoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689307088; x=1691899088; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=df1bb1jAvs2jRSd0UV5vKM8Tkxi1X2alJm5noRhbC9U=; b=AyLFSeC2E7WyV3LUrAdcMztcXqAd86v4ps9fsgkBDJMeUEOeH0Ci35cLbNDSL2p23x uuhGZUmCLmp6kOqU10q7Hj1Lr2uXkvNI28N7bpwhVwHiU+x+C7acEAOFmPyE7KgITag3 HFNVpjrT4IrUO8Ws9tkxhxtJivbuh5sUF3fUCC/OgEovtbse+US9KYDSTpaWxhixF4Oc sBzHodnKDQx8dPvQNkF8qgmG8m+RLfesy9Tl3f4SZ69iYyKt7OnZOkFnhPYZQH6V1efj g4LucVoUJUqSZTVEUg6LQNsqcjbV8DGnqeWVGIkwJKODsGKhWnMV6uOz1I3zt5DHaAgU O7pQ== X-Gm-Message-State: ABy/qLaboHDTJAMh1mmsQbIvDJMhp8eCgRiKbHBA86GaPph+QqBehau8 BxmE84PeKwg/+0K/CNeYsWjmG6JAJLDfaorxXDFRsCNJJ2ggBG0cgC/iYA== X-Google-Smtp-Source: APBJJlG9165PCDdodTrBV39p3BgvpNMQjUYPYsowWCQAF1lsWkwBo1vNccxUchLd6a8G6ff2AmUCzGPmpz/Py83+w6Q= X-Received: by 2002:a50:935d:0:b0:519:7d2:e256 with SMTP id n29-20020a50935d000000b0051907d2e256mr348835eda.0.1689307088289; Thu, 13 Jul 2023 20:58:08 -0700 (PDT) MIME-Version: 1.0 References: <20230713150558.200545-1-fengwei.yin@intel.com> In-Reply-To: <20230713150558.200545-1-fengwei.yin@intel.com> From: Yu Zhao Date: Thu, 13 Jul 2023 21:57:29 -0600 Message-ID: Subject: Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range() support large folio To: Yin Fengwei Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, willy@infradead.org, david@redhat.com, ryan.roberts@arm.com, shy828301@gmail.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: E0E781C000F X-Rspam-User: X-Stat-Signature: x6y7mofkq69b6dosk6mhstm1eauiki4x X-Rspamd-Server: rspam03 X-HE-Tag: 1689307089-955386 X-HE-Meta: U2FsdGVkX18vlckEpyIvX0oqXqlqUZMgpbNgoTwitbiWRvOlXG6MEC8Jw+8Qjh30n7V68IIOOpFhercgSY752MXfgX5ZSO84BrYOKavTueCY6Musq1HAN3fThRsaSCBshg/fZaUaI2t9vdKCYXCSuC2PLGvkVFco1qDCS5zH8CcEPOHPmj4B51pGmVKSJThAIsrHMm7ynqkA7e484JAewmmRS5HJtfyHm47Jf3FFtCNLT67vVnLxXzlNWwAADX6YdpQ28Qugp0DBiIq8i/hVfzdchEDOJAKNxKoTNoMM1k5xN/ZpO+sKrGeJunWa2Bvf3Tm3Oh5lxArGFM9SlnTUKOIy/cAfQMto/A4KNwJH9vumlKPp48xwjqtV1AutrDZFIJ4jPpIZZCfc9iRY5ecnlYqXSuwpXjQlCx1Svap3hi/3LZwCBmm8ws77XQrNF2J8BCYNaddTWZjMAHSdw4Ud1SD623XPJzqdO3kN1pyGXiBRxhWg2FY58ry/7yvVskgU8YMasQbUmtDQJonH+7ASFHUrk1Im2dPU9e+O4pJTnH1BZXn794pf67UgvQ2Gwpa5qdlE98+2nIalhx7JT+fJ86/hI7Q8n64KJqH/8EVZa6ZzTJD41dmrKToGue2h5WLKEbdmVHxI6iXnLRmIyvwtXtdaB2x2k8C3uqWS/eMzg6wmeOEnsOAHHqKBBcG8n2q6ymmms5m8WOD/fslYAgd44PH2dbThxnlanjU8tgTQ/xCKs+2UN+5K6YMVyXTI5oKTl398YZAf6bX3B5/r1pRNDs3HOn4NzCXLuiI2zF0hdZRlOY5qAcHkw+bdpfJAZ2/hDT0g885/Wx4Dp+j6z4d9brTTb4sviDgG38ukPKVwMqNmm5QPrw+sBoQf6jnb/bamJ/wQ2724rSFy/2KblDxROCfVxUCEFEPCVb9GmDwesXw2wer8N4MCuHLqiOMFRJlQypAGI9rQRY9AO6rcqn3 jfUUFgIK GC6wH8Ofx4U2bN5llbFBn5Hy6uhcHWBsq6sb17tNb239HjGI1Rmecsgpwqdv7iSo/vEhV4VkAcLPfoCahFt5MbBgk6bAZm4XNu7rrj5u1wLwFDCm35pyF6pAwrVeCN0lBCsamp6jBeYW9R++GhyoJxy9bgT/OhSTy4a7+WYVipHVjlXIJREtZPHbNEuzo8ucP8cA+EPisjQkW2IfY377Xhqw4bQ8xGOb8fDGbtM/RZJ12U+0FKxRme8dT+m5Kenh+Omt8zb+gzCq0Bk6W3Tg+4Kwg+hSK51Q9ryrOygZzb/oUZ4w= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Jul 13, 2023 at 9:06=E2=80=AFAM Yin Fengwei = wrote: > > Current madvise_cold_or_pageout_pte_range() has two problems for > large folio support: > - Using folio_mapcount() with large folio prevent large folio from > picking up. > - If large folio is in the range requested, shouldn't split it > in madvise_cold_or_pageout_pte_range(). > > Fix them by: > - Use folio_estimated_sharers() with large folio > - If large folio is in the range requested, don't split it. Leave > to page reclaim phase. > > For large folio cross boundaries of requested range, skip it if it's > page cache. Try to split it if it's anonymous folio. If splitting > fails, skip it. > > The main reason to call folio_referenced() is to clear the yong of > conresponding PTEs. So in page reclaim phase, there is good chance > the folio can be reclaimed. > > Signed-off-by: Yin Fengwei > --- > This patch is based on mlock large folio support rfc2 as it depends > on the folio_in_range() added by that patchset > > Also folio_op_size() can be unitfied with get_folio_mlock_step(). > > Testing done: > - kselftest: No new regression introduced. > > mm/madvise.c | 133 ++++++++++++++++++++++++++++++++------------------- > 1 file changed, 84 insertions(+), 49 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index 38382a5d1e393..5748cf098235d 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -31,6 +31,7 @@ > #include > #include > #include > +#include > > #include > > @@ -339,6 +340,35 @@ static inline bool can_do_file_pageout(struct vm_are= a_struct *vma) > file_permission(vma->vm_file, MAY_WRITE) =3D=3D 0; > } > > +static inline bool skip_current_entry(struct folio *folio, bool pageout_= anon) > +{ > + if (!folio) > + return true; > + > + if (folio_is_zone_device(folio)) > + return true; > + > + if (!folio_test_lru(folio)) > + return true; > + > + if (pageout_anon && !folio_test_anon(folio)) > + return true; > + > + if (folio_test_unevictable(folio)) > + return true; > + > + return false; > +} > + > +static inline unsigned int folio_op_size(struct folio *folio, pte_t pte, > + unsigned long addr, unsigned long end) > +{ > + unsigned int nr; > + > + nr =3D folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte); > + return min_t(unsigned int, nr, (end - addr) >> PAGE_SHIFT); > +} > + > static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > unsigned long addr, unsigned long end, > struct mm_walk *walk) > @@ -353,6 +383,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *p= md, > struct folio *folio =3D NULL; > LIST_HEAD(folio_list); > bool pageout_anon_only_filter; > + unsigned long start =3D addr; > > if (fatal_signal_pending(current)) > return -EINTR; > @@ -383,7 +414,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *p= md, > folio =3D pfn_folio(pmd_pfn(orig_pmd)); > > /* Do not interfere with other mappings of this folio */ > - if (folio_mapcount(folio) !=3D 1) > + if (folio_estimated_sharers(folio) !=3D 1) > goto huge_unlock; > > if (pageout_anon_only_filter && !folio_test_anon(folio)) > @@ -442,78 +473,60 @@ static int madvise_cold_or_pageout_pte_range(pmd_t = *pmd, > for (; addr < end; pte++, addr +=3D PAGE_SIZE) { > ptent =3D ptep_get(pte); > > - if (pte_none(ptent)) > - continue; > - > - if (!pte_present(ptent)) > + if (pte_none(ptent) || !pte_present(ptent)) > continue; > > folio =3D vm_normal_folio(vma, addr, ptent); > - if (!folio || folio_is_zone_device(folio)) > + if (skip_current_entry(folio, pageout_anon_only_filter)) > continue; > > /* > - * Creating a THP page is expensive so split it only if w= e > - * are sure it's worth. Split it if we are only owner. > + * Split large folio if it's anonymous and cross the > + * boundaries of request range. > */ > if (folio_test_large(folio)) { > - int err; > + int err, step; > + > + if (folio_estimated_sharers(folio) !=3D 1) > + continue; > + > + if (folio_in_range(folio, vma, start, end)) > + goto pageout_cold_folio; > > - if (folio_mapcount(folio) !=3D 1) > - break; > - if (pageout_anon_only_filter && !folio_test_anon(= folio)) > - break; > - if (!folio_trylock(folio)) > - break; > folio_get(folio); > + step =3D folio_op_size(folio, ptent, addr, end); > + if (!folio_test_anon(folio) || !folio_trylock(fol= io)) { > + folio_put(folio); > + goto next_folio; > + } > + > arch_leave_lazy_mmu_mode(); > pte_unmap_unlock(start_pte, ptl); > start_pte =3D NULL; > err =3D split_folio(folio); > folio_unlock(folio); > folio_put(folio); > - if (err) > - break; > + > start_pte =3D pte =3D > pte_offset_map_lock(mm, pmd, addr, &ptl); > if (!start_pte) > break; > arch_enter_lazy_mmu_mode(); > - pte--; > - addr -=3D PAGE_SIZE; > - continue; > - } > > - /* > - * Do not interfere with other mappings of this folio and > - * non-LRU folio. > - */ > - if (!folio_test_lru(folio) || folio_mapcount(folio) !=3D = 1) > + /* Skip the folio if split fails */ > + if (!err) > + step =3D 0; > +next_folio: > + pte +=3D step - 1; > + addr +=3D (step - 1) << PAGE_SHIFT; > continue; > + } > > - if (pageout_anon_only_filter && !folio_test_anon(folio)) > + /* Do not interfere with other mappings of this folio */ > + if (folio_mapcount(folio) !=3D 1) > continue; > > - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > - > - if (pte_young(ptent)) { > - ptent =3D ptep_get_and_clear_full(mm, addr, pte, > - tlb->fullmm); > - ptent =3D pte_mkold(ptent); > - set_pte_at(mm, addr, pte, ptent); > - tlb_remove_tlb_entry(tlb, pte, addr); > - } > - > - /* > - * We are deactivating a folio for accelerating reclaimin= g. > - * VM couldn't reclaim the folio unless we clear PG_young= . > - * As a side effect, it makes confuse idle-page tracking > - * because they will miss recent referenced history. > - */ > - folio_clear_referenced(folio); > - folio_test_clear_young(folio); > - if (folio_test_active(folio)) > - folio_set_workingset(folio); > +pageout_cold_folio: > if (pageout) { > if (folio_isolate_lru(folio)) { > if (folio_test_unevictable(folio)) > @@ -529,8 +542,30 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *= pmd, > arch_leave_lazy_mmu_mode(); > pte_unmap_unlock(start_pte, ptl); > } > - if (pageout) > - reclaim_pages(&folio_list); > + > + if (pageout) { > + LIST_HEAD(reclaim_list); > + > + while (!list_empty(&folio_list)) { > + int refs; > + unsigned long flags; > + struct mem_cgroup *memcg =3D folio_memcg(folio); > + > + folio =3D lru_to_folio(&folio_list); > + list_del(&folio->lru); > + > + refs =3D folio_referenced(folio, 0, memcg, &flags= ); > + > + if ((flags & VM_LOCKED) || (refs =3D=3D -1)) { > + folio_putback_lru(folio); > + continue; > + } > + > + folio_test_clear_referenced(folio); > + list_add(&folio->lru, &reclaim_list); > + } > + reclaim_pages(&reclaim_list); > + } i overlooked the chunk above -- it's unnecessary: after we split the large folio (and splice the base folios onto the same LRU list), we continue at the position of the first base folio because of: pte--; addr -=3D PAGE_SIZE; continue; And then we do pte_mkold(), which takes care of the A-bit.