From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4BAC1C52D7D for ; Thu, 15 Aug 2024 23:06:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9F5AC8D0010; Thu, 15 Aug 2024 19:06:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9A60E8D0002; Thu, 15 Aug 2024 19:06:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8460D8D0010; Thu, 15 Aug 2024 19:06:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 671D78D0002 for ; Thu, 15 Aug 2024 19:06:36 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 202C61C4A0F for ; Thu, 15 Aug 2024 23:06:36 +0000 (UTC) X-FDA: 82456015992.02.A1CB9B4 Received: from mail-pl1-f175.google.com (mail-pl1-f175.google.com [209.85.214.175]) by imf23.hostedemail.com (Postfix) with ESMTP id 2BD15140016 for ; Thu, 15 Aug 2024 23:06:33 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=jzbYoZrw; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf23.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723763140; a=rsa-sha256; cv=none; b=phBqqMQCTD44EFYg+LhSmiKAGda3f98UH706LQN/Sbk2NXcSGKmoQFbVizqMFXJetq6taz UEjxl6eHIv09wfnM+AvWJVTXaYjtIX5RfBNiG6cM4yVadzvmtFXB5Sbcwwe9snLapl6E+h RW7PPLP4rdyzmE9am5zcFDum8Sc+g9c= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=jzbYoZrw; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf23.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723763140; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=x/zpViDnXt7FpAGHxaqgS4YyUNzii93kYurBFVfMwqE=; b=TcrYnggdw1c9W5s6Y37+w5+g9zL2tuiiDtiodB02gVkoyK+KQWWonD3sD5Ee/+Hw3Plpv0 +ComnNoCwAXPDvTg0XWKZsHLzyDJw1olYqD42/0sbYiKz62rVHdnWCp17+RMP2Aenzj/C5 9RuTipMYpnEc9jmGZh7KCoQn8yaLHy8= Received: by mail-pl1-f175.google.com with SMTP id d9443c01a7336-1fd9e70b592so14976715ad.3 for ; Thu, 15 Aug 2024 16:06:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1723763193; x=1724367993; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=x/zpViDnXt7FpAGHxaqgS4YyUNzii93kYurBFVfMwqE=; b=jzbYoZrwngQeH9ozYQaSbTFbpD8RqZqMR0sM8lu+ccbK0uuY21H1WEKlLJWWi/ToL5 eFZ4o7GkPYIgAUst93+J+dyg8pyJFBG44bRAVjelsmjMBNyEMYYuycR6G4bCAkyIbLHI EPdwjNJJJUsf6sECaqnEF2g5ys7Zo0zHp7+ky2F0rKXpfHSTlw4+Ezo58FWMOOKT1siV o6ifYTsr/Vi9PeriEJA9Q0IDp2DCHLPoxMAVmkk6m74wLpWZ/pB/SGy/iwu9OdeIAnrm KWhisWG6a+XsqQz5poJibhNIM0Mwa4ipoxjY10fYJjxjLq0e96Lumi5kypyWusy9oflK V63A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723763193; x=1724367993; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=x/zpViDnXt7FpAGHxaqgS4YyUNzii93kYurBFVfMwqE=; b=Zop2D0aD14BIJnePY50F8/rcMdRgPrX0/3lcOS/K4rFkhmMeRbWKb6GjE2JYxIO1g8 6HFcNhyc67p0N2dufVGNMF7qtaLAiQdCG/S9tdeqIINEr743u3NqaEX8zV3y2trEAOnS A7qScuqrApsU8x6Nr2brHprYrO4bxQ6sBmQh8+yQEoQAiWvt6Hh2AY1kqYL+2ek3atAq fPFoC7ZgFiMrIuekbeqonFpUfmQIedcQdTxMidW4n+ayYflrw++izFUFHA6ft84z2irv OJkbWpraPtaEZSSdfzMolhcW4b2bXbFd9WKGfp0NMiGIXwnDpEE3zFgIfBeBEiSuvT1G cZ/Q== X-Forwarded-Encrypted: i=1; AJvYcCXAsvh1iyZwJpoeh2H4iRfI3HRRWm7uWrv0tfS1hHpYSFPxB8rcYSdLa4EG3LiqMgImwmjP/0D8Pw==@kvack.org X-Gm-Message-State: AOJu0YwceaVa7VUvWpQ0t7hXmt5DKDvjnxFcpjkukUvs9JwdARxSGRrm 7U6W/Z7cd971A6FQQVFnCFX9QaPNM4tbBiDf+KwYszCj+KqNFlTu X-Google-Smtp-Source: AGHT+IGlJ4CKM5TdeWOEh9vmgZNrpyxETSOztRbXUr0JVMqCNlRfxztjXl2nRsrCNZD/PIMUTKpFjw== X-Received: by 2002:a17:902:ecc3:b0:1fb:54d9:ebbb with SMTP id d9443c01a7336-20203ea1da0mr15808655ad.22.1723763192564; Thu, 15 Aug 2024 16:06:32 -0700 (PDT) Received: from localhost.localdomain ([2407:7000:8942:5500:aaa1:59ff:fe57:eb97]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-2d3ac7ca35fsm4202664a91.9.2024.08.15.16.06.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 15 Aug 2024 16:06:32 -0700 (PDT) From: Barry Song <21cnbao@gmail.com> To: wangkefeng.wang@huawei.com Cc: akpm@linux-foundation.org, baolin.wang@linux.alibaba.com, chrisl@kernel.org, david@redhat.com, hanchuanhua@oppo.com, hannes@cmpxchg.org, hch@infradead.org, hughd@google.com, kaleshsingh@google.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@suse.com, minchan@kernel.org, nphamcs@gmail.com, ryan.roberts@arm.com, ryncsn@gmail.com, senozhatsky@chromium.org, shakeel.butt@linux.dev, shy828301@gmail.com, surenb@google.com, v-songbaohua@oppo.com, willy@infradead.org, xiang@kernel.org, ying.huang@intel.com, yosryahmed@google.com Subject: Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices Date: Fri, 16 Aug 2024 11:06:12 +1200 Message-Id: <20240815230612.77266-1-21cnbao@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20ed69ad-5dad-446b-9f01-86ad8b1c67fa@huawei.com> References: <20ed69ad-5dad-446b-9f01-86ad8b1c67fa@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 2BD15140016 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: d3j5bmcjokzyaixj6mtm1b8tty9wipt9 X-HE-Tag: 1723763193-369477 X-HE-Meta: U2FsdGVkX1/jd1GSWTfYpD/sX61l/9KF1hBs36i5RfXLihKhyVUlSW2oUM72LIWoy+ASsswM8GYDAwMWO8zAviI+utwUUOvbFhUve02MUVjEk/1kX2ZYkFFQG9UK/HBwmyDSV2QAwpLWAnT+ilF9g06HPzBEbxM7EkYEui+/V7AlpVeo8cVMdyzGX8I1z3mRrtr9VD6N+jBufzXgPLqnOSvvUUzafOT1mqNrQaX/n9fuBM0bQMODw1JnFuawzPzxZg63/53g6lZ8PSuhmg3/aHC7U7rY/bMpDsGqygsCKJg/HmPGxgHpgDh10hRyI8i8SHNlQk6b/wXLgmCBwzbq4D5eo5VanEjtGyj/9LCnLsAFjSxes/2Crjz4xchy8H2nse2JwHDLRVv5ruBKYQrbCX+h6Gnr7H4txehR+1xFVSgz3XcqElYjLTFCMsv91UXLw4iiDEAdfOEzLHhbpnj9Zy6O3aLfkucs6YML9MKmCEzk5EdGIwJh56EKlEJ0waYbe8DSChKeF+o/XUb15rB/Umv8TaYaBA0UhAt5SruYHVv8crGI5NsNQtH7HQS+8lYr49TNs9qSSuVv9uFq4JtWbgnZnslXbaNKVRytrAHYsys8I9OCA1FGp/Nh6TsIRaht9v+ggW/Dz0iOwxbR5nyz2Sj2CchxBFGL1DUEe8JYN43GHWW/NV5YfCI1e9aQ9A9laiY40oCDD5/Copcd/oP9O8/AXBriGQ8RqUCgonELvRXWhujA9RJTKUehJ0Oham95DhM2g+ZsIGKWq0sREpofA8r6jBqUlDKD4NieChocbRh2RxJvWfGJc4ljq/Vzd5kSTWz4FZnvMdxVq3HFaBWPK9EuOlzpLWHNLoTtH2B9GoOH1WMisCwzbnXDHXpDGG9vRWVIXq1wqs6GskEeHAV+frALx6aTa3YmFh/FReIddJjCJTSNnGcw8eCApzzipNr9qSk6kU9Wa4g+1ps3rHX QRvvylRT HtRhUL2WVbg8b3OEKTZlPyC7AHH02eJe12tLNFdnp2he0TncFSDaKb2AtPni+BJa184+rhkjQyDDzjfyZR5KOrbCmTVv0d0fcny1IOS9AkYejOVyp2sHcSRM3emqD59XUxwvxIYuYPgceaovJalfL2eCHRefLQ+nD1FRHtnWqrIlPR5pNVcJzKOROvQhOFrrhlT0tsNYqlT9mrvEgTyYi6ItN6Bg1ZJppIcKcpsyNx0xxLnO2egQNi2gIoXVjq1VnxfnGLXEYFf0tiRkvYEDDHrKIIcyPwVQsnnN4AOj23r4x1+b9yP1rOQRsld3Lbllr5iQ5KEfp7vx12UPw7FJU7+2CKJ4RxoCxjKqfbOiKLUYaNKz6VYuAmBCSg0f7wCzxH0EAomUluunIH1ayxvDJbEvxhDvyf/ZOMJGSQZKQNVijC7e6T6v/TTllbSGmzmAS2Id+fujpHsKc4WXUWzvGbbFC6qx780l8RJEx X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Aug 16, 2024 at 1:27 AM Kefeng Wang wrote: > > > > On 2024/8/15 17:47, Kairui Song wrote: > > On Fri, Aug 2, 2024 at 8:21 PM Barry Song <21cnbao@gmail.com> wrote: > >> > >> From: Chuanhua Han > > > > Hi Chuanhua, > > > >> > ... > > >> + > >> +static struct folio *alloc_swap_folio(struct vm_fault *vmf) > >> +{ > >> +       struct vm_area_struct *vma = vmf->vma; > >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > >> +       unsigned long orders; > >> +       struct folio *folio; > >> +       unsigned long addr; > >> +       swp_entry_t entry; > >> +       spinlock_t *ptl; > >> +       pte_t *pte; > >> +       gfp_t gfp; > >> +       int order; > >> + > >> +       /* > >> +        * If uffd is active for the vma we need per-page fault fidelity to > >> +        * maintain the uffd semantics. > >> +        */ > >> +       if (unlikely(userfaultfd_armed(vma))) > >> +               goto fallback; > >> + > >> +       /* > >> +        * A large swapped out folio could be partially or fully in zswap. We > >> +        * lack handling for such cases, so fallback to swapping in order-0 > >> +        * folio. > >> +        */ > >> +       if (!zswap_never_enabled()) > >> +               goto fallback; > >> + > >> +       entry = pte_to_swp_entry(vmf->orig_pte); > >> +       /* > >> +        * Get a list of all the (large) orders below PMD_ORDER that are enabled > >> +        * and suitable for swapping THP. > >> +        */ > >> +       orders = thp_vma_allowable_orders(vma, vma->vm_flags, > >> +                       TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); > >> +       orders = thp_vma_suitable_orders(vma, vmf->address, orders); > >> +       orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders); > >> + > >> +       if (!orders) > >> +               goto fallback; > >> + > >> +       pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl); > >> +       if (unlikely(!pte)) > >> +               goto fallback; > >> + > >> +       /* > >> +        * For do_swap_page, find the highest order where the aligned range is > >> +        * completely swap entries with contiguous swap offsets. > >> +        */ > >> +       order = highest_order(orders); > >> +       while (orders) { > >> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > >> +               if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order)) > >> +                       break; > >> +               order = next_order(&orders, order); > >> +       } > >> + > >> +       pte_unmap_unlock(pte, ptl); > >> + > >> +       /* Try allocating the highest of the remaining orders. */ > >> +       gfp = vma_thp_gfp_mask(vma); > >> +       while (orders) { > >> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > >> +               folio = vma_alloc_folio(gfp, order, vma, addr, true); > >> +               if (folio) > >> +                       return folio; > >> +               order = next_order(&orders, order); > >> +       } > >> + > >> +fallback: > >> +#endif > >> +       return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false); > >> +} > >> + > >> + > >>   /* > >>    * We enter with non-exclusive mmap_lock (to exclude vma changes, > >>    * but allow concurrent faults), and pte mapped but not yet locked. > >> @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>          if (!folio) { > >>                  if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > >>                      __swap_count(entry) == 1) { > >> -                       /* > >> -                        * Prevent parallel swapin from proceeding with > >> -                        * the cache flag. Otherwise, another thread may > >> -                        * finish swapin first, free the entry, and swapout > >> -                        * reusing the same entry. It's undetectable as > >> -                        * pte_same() returns true due to entry reuse. > >> -                        */ > >> -                       if (swapcache_prepare(entry, 1)) { > >> -                               /* Relax a bit to prevent rapid repeated page faults */ > >> -                               schedule_timeout_uninterruptible(1); > >> -                               goto out; > >> -                       } > >> -                       need_clear_cache = true; > >> - > >>                          /* skip swapcache */ > >> -                       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > >> -                                               vma, vmf->address, false); > >> +                       folio = alloc_swap_folio(vmf); > >>                          page = &folio->page; > >>                          if (folio) { > >>                                  __folio_set_locked(folio); > >>                                  __folio_set_swapbacked(folio); > >> > >> +                               nr_pages = folio_nr_pages(folio); > >> +                               if (folio_test_large(folio)) > >> +                                       entry.val = ALIGN_DOWN(entry.val, nr_pages); > >> +                               /* > >> +                                * Prevent parallel swapin from proceeding with > >> +                                * the cache flag. Otherwise, another thread may > >> +                                * finish swapin first, free the entry, and swapout > >> +                                * reusing the same entry. It's undetectable as > >> +                                * pte_same() returns true due to entry reuse. > >> +                                */ > >> +                               if (swapcache_prepare(entry, nr_pages)) { > >> +                                       /* Relax a bit to prevent rapid repeated page faults */ > >> +                                       schedule_timeout_uninterruptible(1); > >> +                                       goto out_page; > >> +                               } > >> +                               need_clear_cache = true; > >> + > >>                                  if (mem_cgroup_swapin_charge_folio(folio, > >>                                                          vma->vm_mm, GFP_KERNEL, > >>                                                          entry)) { > >>                                          ret = VM_FAULT_OOM; > >>                                          goto out_page; > >>                                  } > > > > After your patch, with build kernel test, I'm seeing kernel log > > spamming like this: > > [  101.048594] pagefault_out_of_memory: 95 callbacks suppressed > > [  101.048599] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > [  101.059416] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > [  101.118575] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > [  101.125585] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > [  101.182501] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > [  101.215351] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > [  101.272822] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > [  101.403195] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > ............ > > > > And heavy performance loss with workloads limited by memcg, mTHP enabled. > > > > After some debugging, the problematic part is the > > mem_cgroup_swapin_charge_folio call above. > > When under pressure, cgroup charge fails easily for mTHP. One 64k > > swapin will require a much more aggressive reclaim to success. > > > > If I change MAX_RECLAIM_RETRIES from 16 to 512, the spamming log is > > gone and mTHP swapin should have a much higher swapin success rate. > > But this might not be the right way. > > > > For this particular issue, maybe you can change the charge order, try > > charging first, if successful, use mTHP. if failed, fallback to 4k? > > This is what we did in alloc_anon_folio(), see 085ff35e7636 > ("mm: memory: move mem_cgroup_charge() into alloc_anon_folio()"), > 1) fallback earlier > 2) using same GFP flags for allocation and charge > > but it seems that there is a little complicated for swapin charge Kefeng, thanks! I guess we can continue using the same approach and it's not too complicated.  Kairui, sorry for the trouble and thanks for the report! could you check if the solution below resolves the issue? On phones, we don't encounter the scenarios you’re facing. >From 2daaf91077705a8fa26a3a428117f158f05375b0 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Fri, 16 Aug 2024 10:51:48 +1200 Subject: [PATCH] mm: fallback to next_order if charing mTHP fails When memcg approaches its limit, charging mTHP becomes difficult. At this point, when the charge fails, we fallback to the next order to avoid repeatedly retrying larger orders. Reported-by: Kairui Song Signed-off-by: Barry Song --- mm/memory.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 0ed3603aaf31..6cba28ef91e7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4121,8 +4121,12 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) while (orders) { addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); folio = vma_alloc_folio(gfp, order, vma, addr, true); - if (folio) - return folio; + if (folio) { + if (!mem_cgroup_swapin_charge_folio(folio, + vma->vm_mm, gfp, entry)) + return folio; + folio_put(folio); + } order = next_order(&orders, order); } @@ -4244,7 +4248,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) } need_clear_cache = true; - if (mem_cgroup_swapin_charge_folio(folio, + if (nr_pages == 1 && mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, GFP_KERNEL, entry)) { ret = VM_FAULT_OOM; -- 2.34.1 Thanks Barry