From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D2D9EC433EF for ; Tue, 21 Dec 2021 13:24:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 36DF36B0071; Tue, 21 Dec 2021 08:24:47 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 31DE76B0073; Tue, 21 Dec 2021 08:24:47 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1BE926B0074; Tue, 21 Dec 2021 08:24:47 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 0A2EE6B0071 for ; Tue, 21 Dec 2021 08:24:47 -0500 (EST) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id BA3B5181AC9C6 for ; Tue, 21 Dec 2021 13:24:46 +0000 (UTC) X-FDA: 78941871372.08.D5333CD Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by imf02.hostedemail.com (Postfix) with ESMTP id AAF6F80023 for ; Tue, 21 Dec 2021 13:24:42 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 55D70B80DA6; Tue, 21 Dec 2021 13:24:44 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3DB65C36AE8; Tue, 21 Dec 2021 13:24:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1640093083; bh=0p0GBbh9OIzW55YKUeVeBjY9r31CqJNcU7kpK7AiRx8=; h=From:To:Cc:Subject:Date:In-Reply-To:From; b=SXBmsM56fGHT+Bth4Z5L6YojgaXCY2VJUkNenVrfpcUPAb/cqGfTs+N+kIl3t728z HkDgzNUlyvYggG2mSdMJiLB+JBnSgAjxTlcyMK+QVDReyFw6N23eO7jx+2w+fmMTkk A4nHbkf6oHxiUxsdpDU+6ZSfYlN/AGW8d2quHSQc2mD0wofHtc4IEXvsEjBsm+Nw9v EdZLZ6OsPTsTRd6CjYDw0XdqgdpFL0m3Tm1sWZPsRwZQ7WMLyLirRILuufCp/2/G2J aHeKf+kSrrRdGD1GYXBGH/a3Acel89QXFGqbuNnxk/YJC1TzifWVcg+p1ozS3NBDCt ZDk2wytD4riVw== From: SeongJae Park To: Baolin Wang Cc: sj@kernel.org, akpm@linux-foundation.org, ying.huang@intel.com, dave.hansen@linux.intel.com, ziy@nvidia.com, shy828301@gmail.com, zhongjiang-ali@linux.alibaba.com, xlpang@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/2] mm/damon: Add a new scheme to support demotion on tiered memory system Date: Tue, 21 Dec 2021 13:24:39 +0000 Message-Id: <20211221132439.14997-1-sj@kernel.org> X-Mailer: git-send-email 2.17.1 In-Reply-To: <1c014ce5c6f6c62a30d07096c5e28aa1310c1bbd.1640077468.git.baolin.wang@linux.alibaba.com> X-Rspamd-Queue-Id: AAF6F80023 X-Stat-Signature: 5n77kexese36ta9iht4wmfnqnermq3e5 Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=SXBmsM56; spf=pass (imf02.hostedemail.com: domain of sj@kernel.org designates 145.40.68.75 as permitted sender) smtp.mailfrom=sj@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-Rspamd-Server: rspam10 X-HE-Tag: 1640093082-330820 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Baolin, Thank you for this great patch! I left some comments below. On Tue, 21 Dec 2021 17:18:04 +0800 Baolin Wang wrote: > On tiered memory system, the reclaim path in shrink_page_list() already > support demoting pages to slow memory node instead of discarding the > pages. However, at that time the fast memory node memory wartermark is > already tense, which will increase the memory allocation latency during > page demotion. > > We can rely on the DAMON in user space to help to monitor the cold > memory on fast memory node, and demote the cold pages to slow memory > node proactively to keep the fast memory node in a healthy state. > Thus this patch introduces a new scheme named DAMOS_DEMOTE to support > this feature. > > Signed-off-by: Baolin Wang > --- > include/linux/damon.h | 3 + > mm/damon/dbgfs.c | 1 + > mm/damon/vaddr.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 160 insertions(+) > > diff --git a/include/linux/damon.h b/include/linux/damon.h > index af64838..da9957c 100644 > --- a/include/linux/damon.h > +++ b/include/linux/damon.h > @@ -87,6 +87,8 @@ struct damon_target { > * @DAMOS_PAGEOUT: Call ``madvise()`` for the region with MADV_PAGEOUT. > * @DAMOS_HUGEPAGE: Call ``madvise()`` for the region with MADV_HUGEPAGE. > * @DAMOS_NOHUGEPAGE: Call ``madvise()`` for the region with MADV_NOHUGEPAGE. > + * @DAMOS_DEMOTE: Migate cold pages from fast memory type (DRAM) to slow > + * memory type (persistent memory). s/Migate/Migrate/ Also, could you make the second line of the description aligned with the first line? e.g., + * @DAMOS_DEMOTE: Migate cold pages from fast memory type (DRAM) to slow + * memory type (persistent memory). > * @DAMOS_STAT: Do nothing but count the stat. > */ > enum damos_action { > @@ -95,6 +97,7 @@ enum damos_action { > DAMOS_PAGEOUT, > DAMOS_HUGEPAGE, > DAMOS_NOHUGEPAGE, > + DAMOS_DEMOTE, > DAMOS_STAT, /* Do nothing but only record the stat */ This would make user space people who were using 'DAMOS_STAT' be confused. Could you put 'DAMOS_DEMOTE' after 'DAMOS_STAT'? > }; > > diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c > index 58dbb96..43355ab 100644 > --- a/mm/damon/dbgfs.c > +++ b/mm/damon/dbgfs.c > @@ -168,6 +168,7 @@ static bool damos_action_valid(int action) > case DAMOS_PAGEOUT: > case DAMOS_HUGEPAGE: > case DAMOS_NOHUGEPAGE: > + case DAMOS_DEMOTE: > case DAMOS_STAT: > return true; > default: > diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c > index 9e213a1..b354d3e 100644 > --- a/mm/damon/vaddr.c > +++ b/mm/damon/vaddr.c > @@ -14,6 +14,10 @@ > #include > #include > #include > +#include > +#include > +#include > +#include "../internal.h" > > #include "prmtv-common.h" > > @@ -693,6 +697,156 @@ static unsigned long damos_madvise(struct damon_target *target, > } > #endif /* CONFIG_ADVISE_SYSCALLS */ > > +static bool damos_isolate_page(struct page *page, struct list_head *demote_list) > +{ > + struct page *head = compound_head(page); > + > + /* Do not interfere with other mappings of this page */ > + if (page_mapcount(head) != 1) > + return false; > + > + /* No need migration if the target demotion node is empty. */ > + if (next_demotion_node(page_to_nid(head)) == NUMA_NO_NODE) > + return false; > + > + if (isolate_lru_page(head)) > + return false; > + > + list_add_tail(&head->lru, demote_list); > + mod_node_page_state(page_pgdat(head), > + NR_ISOLATED_ANON + page_is_file_lru(head), > + thp_nr_pages(head)); > + return true; The return value is not used by callers. If not needed, let's remove it. > +} > + > +static int damos_migrate_pmd_entry(pmd_t *pmd, unsigned long addr, > + unsigned long end, struct mm_walk *walk) > +{ > + struct vm_area_struct *vma = walk->vma; > + struct list_head *demote_list = walk->private; > + spinlock_t *ptl; > + struct page *page; > + pte_t *pte, *mapped_pte; > + > + if (!vma_migratable(vma)) > + return -EFAULT; > + > + ptl = pmd_trans_huge_lock(pmd, vma); > + if (ptl) { > + /* Bail out if THP migration is not supported. */ > + if (!thp_migration_supported()) > + goto thp_out; > + > + /* If the THP pte is under migration, do not bother it. */ > + if (unlikely(is_pmd_migration_entry(*pmd))) > + goto thp_out; > + > + page = damon_get_page(pmd_pfn(*pmd)); > + if (!page) > + goto thp_out; > + > + damos_isolate_page(page, demote_list); > + > + put_page(page); > +thp_out: > + spin_unlock(ptl); > + return 0; > + } Could we wrap above THP handling code with '#ifdef CONFIG_TRANSPARENT_HUGEPAGE'? > + > + /* regular page handling */ > + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) > + return -EINVAL; > + > + mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); > + for (; addr != end; pte++, addr += PAGE_SIZE) { > + if (pte_none(*pte) || !pte_present(*pte)) > + continue; > + > + page = damon_get_page(pte_pfn(*pte)); > + if (!page) > + continue; > + > + damos_isolate_page(page, demote_list); > + put_page(page); > + } > + pte_unmap_unlock(mapped_pte, ptl); > + cond_resched(); > + > + return 0; > +} > + > +static const struct mm_walk_ops damos_migrate_pages_walk_ops = { > + .pmd_entry = damos_migrate_pmd_entry, The names are little bit confusing to me. How about 's/migrate/isolate/'? > +}; > + > +/* > + * damos_demote() - demote cold pages from fast memory to slow memory > + * @target: the given target > + * @r: region of the target > + * > + * On tiered memory system, if DAMON monitored cold pages on fast memory > + * node (DRAM), we can demote them to slow memory node proactively in case > + * accumulating much more cold memory on fast memory node (DRAM) to reclaim. > + * > + * Return: > + * = 0 means no pages were demoted. > + * > 0 means how many cold pages were demoted successfully. > + * < 0 means errors happened. damon_va_apply_scheme(), which returns what this function returns when DAMOS_DEMOTE action is given, should return bytes of the region that the action is successfully applied. > + */ > +static int damos_demote(struct damon_target *target, struct damon_region *r) > +{ > + struct mm_struct *mm; > + LIST_HEAD(demote_pages); > + LIST_HEAD(pagelist); > + int target_nid, nr_pages, ret = 0; > + unsigned int nr_succeeded, demoted_pages = 0; > + struct page *page, *page2; > + > + /* Validate if allowing to do page demotion */ > + if (!numa_demotion_enabled) > + return -EINVAL; > + > + mm = damon_get_mm(target); > + if (!mm) > + return -ENOMEM; > + > + mmap_read_lock(mm); > + walk_page_range(mm, PAGE_ALIGN(r->ar.start), PAGE_ALIGN(r->ar.end), I guess PAGE_ALIGN()s are not necessary here? > + &damos_migrate_pages_walk_ops, &demote_pages); > + mmap_read_unlock(mm); > + > + mmput(mm); > + if (list_empty(&demote_pages)) > + return 0; > + > + list_for_each_entry_safe(page, page2, &demote_pages, lru) { I'd prefer 'next' or 'next_page' instead of 'page2'. > + list_add(&page->lru, &pagelist); > + target_nid = next_demotion_node(page_to_nid(page)); > + nr_pages = thp_nr_pages(page); > + > + ret = migrate_pages(&pagelist, alloc_demote_page, NULL, > + target_nid, MIGRATE_SYNC, MR_DEMOTION, > + &nr_succeeded); > + if (ret) { > + if (!list_empty(&pagelist)) { > + list_del(&page->lru); > + mod_node_page_state(page_pgdat(page), > + NR_ISOLATED_ANON + page_is_file_lru(page), > + -nr_pages); > + putback_lru_page(page); > + } > + } else { > + __count_vm_events(PGDEMOTE_DIRECT, nr_succeeded); > + demoted_pages += nr_succeeded; > + } Why should we do above work for each page on our own instead of exporting and calling 'demote_page_list()'? Thanks, SJ > + > + INIT_LIST_HEAD(&pagelist); > + cond_resched(); > + } > + > + return demoted_pages; > +} > + > static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx, > struct damon_target *t, struct damon_region *r, > struct damos *scheme) > @@ -715,6 +869,8 @@ static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx, > case DAMOS_NOHUGEPAGE: > madv_action = MADV_NOHUGEPAGE; > break; > + case DAMOS_DEMOTE: > + return damos_demote(t, r); > case DAMOS_STAT: > return 0; > default: > -- > 1.8.3.1