From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1F001C71159 for ; Fri, 18 Aug 2023 16:31:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 790A3940069; Fri, 18 Aug 2023 12:31:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 749E9940012; Fri, 18 Aug 2023 12:31:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5E003940069; Fri, 18 Aug 2023 12:31:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 4C7AC940012 for ; Fri, 18 Aug 2023 12:31:35 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 12354160A62 for ; Fri, 18 Aug 2023 16:31:35 +0000 (UTC) X-FDA: 81137766150.21.3041309 Received: from mail-lf1-f47.google.com (mail-lf1-f47.google.com [209.85.167.47]) by imf05.hostedemail.com (Postfix) with ESMTP id C7F14100012 for ; Fri, 18 Aug 2023 16:31:32 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=w4vShAk0; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of yosryahmed@google.com designates 209.85.167.47 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692376293; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=omwCkxQq24y4kCjZrsXGSpWYzcZ37a/RmYIdVkufMPM=; b=peDyXAa7haFWJYNbP1jPEjJnFtw9QLIlHZ/Z2pMFpcJk4yc/cr3mEN0AMJTaPqCaUiCuD1 +peMzDCygGHQRxuFHqqeulRNqItxPi+dpoGb6HDtniElkowbatRo9+CkcwW+Xcz+2wIBsg XI5RZhiaWeUjYUFRKhFHlmFFx6e9sLQ= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=w4vShAk0; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of yosryahmed@google.com designates 209.85.167.47 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692376293; a=rsa-sha256; cv=none; b=2NeGIStB/f9IQknW3f4dURxF6fdOT4pXkj/RMEnHV/Bp+F/JSQ+HJsrDwsYJD1jMtuXcJl 9kL3m/0CeX2lNBVEDWZwA3Whjjkwja8CwmkVmS7XuKCxXIyx1ayegCmzTmNPuNksS4T+Kr hO+WXhazfXPgrU1xcrg4/wEqVpF3ifQ= Received: by mail-lf1-f47.google.com with SMTP id 2adb3069b0e04-4ffa248263cso1630707e87.2 for ; Fri, 18 Aug 2023 09:31:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1692376291; x=1692981091; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=omwCkxQq24y4kCjZrsXGSpWYzcZ37a/RmYIdVkufMPM=; b=w4vShAk0VUI1abFeD1fX1VPiSh4XzYi+WdCDr5hs8oMBy7+MeUNO78yhT+yoJyrZL1 8JmKaO+2+1w0rmlRWBZByONrC7BytZ0dbTr5ZwBYDiZG7nWnOC9n0ZkWGTtkmjGM0y20 LVo/J5UUG13Ifw1IbzdJsvtJzhyG15nI2Ujf+6eP+n6iLBZ3KZFBudSRJJrZ2JYAU83j 3QM+HnpkSdzKpXyCz8EwvyGgaYcg3BATFdIUUjG8Ylc8+IXBeFLERvMGXZml5YO7UMLz C4lfiCiEy6YaSdavKm6PiVKqy+Njuoq4TRqRDszocxE+K5gEXb6VNcTMKQUA/o3Ynojq T+tQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692376291; x=1692981091; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=omwCkxQq24y4kCjZrsXGSpWYzcZ37a/RmYIdVkufMPM=; b=e4sVsNf20d2jegMNmP58BobI10fnFNYpYy8c00mHEAx+MAA5nWudCINgash34+20Ju 8rTjtYoPKP3qcMvluHcEVmqrHAp9T4fnMAY0y3vHC8Ac7JMiO6mbXNqGIGN6DY5guVyn pgxD1q6Z5sURoqmzDQnHi8/OEaBDXx94Wmq8bujTviDElrzV11SKDm6jav7e4apyZz3z +2bOLqJSMYNzAZO7MnjQPxTotwcEjPUh1+Z+toYy0ql1s1jj+umm3Zs0YnEHCacwbzwb nfkGHgTZfGIGr1DryDl/smD4Y6NSDrXUni6rqfJ3U8owwcf+bvKVJlZ052c+Fi0SfBaT DPVA== X-Gm-Message-State: AOJu0Yy7Is8pVj8JRMaNUF+uM+9BQstXdZVlbBwdadN7mj5PKNuDS46l FychkhM0DSkZsRsmeHV4awRbHmDFZ12uxuzJ4i3Xeg== X-Google-Smtp-Source: AGHT+IEEzKyQ4HUkS/Hfg8swi6QPlQwfuANsLtgK5T56zFkDQvvL04j0oiOs5Q2Z0Sbb24oShFiVkTVcYgT55DL6K/4= X-Received: by 2002:a05:6512:3da9:b0:4fe:ca9:d9bd with SMTP id k41-20020a0565123da900b004fe0ca9d9bdmr2647218lfv.56.1692376290593; Fri, 18 Aug 2023 09:31:30 -0700 (PDT) MIME-Version: 1.0 References: <20230817164733.2475092-1-nphamcs@gmail.com> <20230817190126.3155299-1-nphamcs@gmail.com> <20230818134906.GA138967@cmpxchg.org> In-Reply-To: From: Yosry Ahmed Date: Fri, 18 Aug 2023 09:30:54 -0700 Message-ID: Subject: Re: [PATCH v2] workingset: ensure memcg is valid for recency check To: Nhat Pham Cc: Johannes Weiner , Yu Zhao , akpm@linux-foundation.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, stable@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: C7F14100012 X-Stat-Signature: ktubiyyct4qg9c38rm5mtk33mo5p6jk8 X-Rspam-User: X-HE-Tag: 1692376292-483387 X-HE-Meta: U2FsdGVkX1+MURL8GO8Y+VGuAhzWOFFlFLV2YVCmKhVgKxF8itf2bCg+pipT7zsRLKl/G7m8J6xr8zTYFOpRmPHgjx8AUFrCYi3Xfp2XIRagrSRqyTU1MBepSDxhK3ZV3PzxivZyKZvO74TrEl/iU1S68j9at6Y1dkE2JKMW+YDIRTO/x4VoPgTpKig7CeY2pG0KGQ/7H0rx0TbpbIh/FuLaS71wBZmXuD/XLvi2EdZ7etaJh4HxM97aYXYTLFqdojuwif/o/srO13p3i00cEk8hZXzzH2c8gAsOGwzsPKMC1MfGrzHhHprrAt+r5F7AJVFzoKGEbFyke43NFQ4Jk0ebxy+FqeJhswkwhxm4bVtb1P+pgc2oat2ZFs2fAwB07cBtFkICKH0qjMs4nLujuu22rb1FrXaCd3U7PO6TwLp9+NtD1ERuKUWQO/wHa9q0uK+c6c5uPv167iOaRf67dSSqzBvKtPpfax54GyP28gv1HIg03bO+iaTdlryZCjQzupGAqkssQI9nmaKNCxuXg9i9H7g57nMUM/1n3bkG2X9fdtpfBYfxaVWwanzS9D8C3OjbQSoVG1CcHkA2f26XSv6pamHfLUzdd+b/O2jPnMeSHF7OuKTbn//YeIJoAyDM8TE371fwrzdUMUnta9KUQ5S5sdVjU9IoMouqXWToVYHbhkozyanejU1T9qgUzau77HGdFMCd8XPwzMC8/egOCwdu2zu8SwKWuZW0jVhzw+xwr1koXfKHDubNpQumwNvSsPy10S2PhsUC/TtaoGhvCqoGKpjKs2UBx8B0SwYSUXNMXT6YGoOu0ZBMbZ+CiypnSA19OjcvPLdrWvMwoEJc+O+fJotxHb50hVSf/STaDsnN7J9RXD/1tEk+JrHD0lYgf9jQNmT/KLs2DZpfBJh5wdNxA4x8+l+6uXzITlPWf6shJlzbc3UT+SsZPjTBB5dqwge5RQYdwAIZ4HDWbWj 9S6yY5b/ aozkyx1t3omzjaL6Xlhzie8fT8/PR3CNVmfhthKtpkaLFwnbF5fZjK8wQcHgHzhdNDO+7shPA1FfMS2HZTHRsCoQQxLMlVtkD/uK4cc1zClclfgDCRXOdOQiGryxrlhxbMbrM+oa6edtatk9Lm9iQI6qV1IM57fUpwx1wAnf0M6BGLGqm3TH0/4njw3Y5g7v133Gtx8Zll3os8U8X01hT/ou5cURQPLsTkPOXPs/cQ/mxG2RcmyqozJmg2JEHLlVFx/gTeaKkn+L7Rd1N2Ig/Y8gXuAUntlA3fgIThN28YFKAR6koOky3+DdAN2MUJ/+BCk8n3dWz2ZIt7+HHW8Fab561mdFpr46qr9tbrO1r8JEGt/r/TmYaSLRWc5LbZs7dDzwlnIqo646kigw= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Aug 18, 2023 at 9:24=E2=80=AFAM Nhat Pham wrote= : > > On Fri, Aug 18, 2023 at 7:57 AM Yosry Ahmed wrote= : > > > > On Fri, Aug 18, 2023 at 6:49=E2=80=AFAM Johannes Weiner wrote: > > > > > > On Thu, Aug 17, 2023 at 05:12:17PM -0600, Yu Zhao wrote: > > > > On Thu, Aug 17, 2023 at 4:50=E2=80=AFPM Yosry Ahmed wrote: > > > > > > > > > > On Thu, Aug 17, 2023 at 3:43=E2=80=AFPM Nhat Pham wrote: > > > > > > > > > > > > On Thu, Aug 17, 2023 at 1:50 PM Yosry Ahmed wrote: > > > > > > > > > > > > > > On Thu, Aug 17, 2023 at 12:01=E2=80=AFPM Nhat Pham wrote: > > > > > > > > > > > > > > > > In eviction recency check, we are currently not holding a l= ocal > > > > > > > > reference to the memcg that the refaulted folio belonged to= when it was > > > > > > > > evicted. This could cause serious memcg lifetime issues, fo= r e.g in the > > > > > > > > memcg hierarchy traversal done in mem_cgroup_get_nr_swap_pa= ges(). This > > > > > > > > has occurred in production: > > > > > > > > > > > > > > > > [ 155757.793456] BUG: kernel NULL pointer dereference, addr= ess: 00000000000000c0 > > > > > > > > [ 155757.807568] #PF: supervisor read access in kernel mode > > > > > > > > [ 155757.818024] #PF: error_code(0x0000) - not-present page > > > > > > > > [ 155757.828482] PGD 401f77067 P4D 401f77067 PUD 401f76067 = PMD 0 > > > > > > > > [ 155757.839985] Oops: 0000 [#1] SMP > > > > > > > > [ 155757.846444] CPU: 7 PID: 1380944 Comm: ThriftSrv-pri3- = Kdump: loaded Tainted: G S 6.4.3-0_fbk1_rc0_594_g8d0cbcaa67= ba #1 > > > > > > > > [ 155757.870808] Hardware name: Wiwynn Twin Lakes MP/Twin L= akes Passive MP, BIOS YMM16 05/24/2021 > > > > > > > > [ 155757.887870] RIP: 0010:mem_cgroup_get_nr_swap_pages+0x3= d/0xb0 > > > > > > > > [ 155757.899377] Code: 29 19 4a 02 48 39 f9 74 63 48 8b 97 = c0 00 00 00 48 8b b7 58 02 00 00 48 2b b7 c0 01 00 00 48 39 f0 48 0f 4d c6 = 48 39 d1 74 42 <48> 8b b2 c0 00 00 00 48 8b ba 58 02 00 00 48 2b ba c0 01 0= 0 00 48 > > > > > > > > [ 155757.937125] RSP: 0018:ffffc9002ecdfbc8 EFLAGS: 0001028= 6 > > > > > > > > [ 155757.947755] RAX: 00000000003a3b1c RBX: 000007fffffffff= f RCX: ffff888280183000 > > > > > > > > [ 155757.962202] RDX: 0000000000000000 RSI: 0007fffffffffff= f RDI: ffff888bbc2d1000 > > > > > > > > [ 155757.976648] RBP: 0000000000000001 R08: 000000000000000= b R09: ffff888ad9cedba0 > > > > > > > > [ 155757.991094] R10: ffffea0039c07900 R11: 000000000000001= 0 R12: ffff888b23a7b000 > > > > > > > > [ 155758.005540] R13: 0000000000000000 R14: ffff888bbc2d100= 0 R15: 000007ffffc71354 > > > > > > > > [ 155758.019991] FS: 00007f6234c68640(0000) GS:ffff88903f9= c0000(0000) knlGS:0000000000000000 > > > > > > > > [ 155758.036356] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008= 0050033 > > > > > > > > [ 155758.048023] CR2: 00000000000000c0 CR3: 0000000a83eb800= 4 CR4: 00000000007706e0 > > > > > > > > [ 155758.062473] DR0: 0000000000000000 DR1: 000000000000000= 0 DR2: 0000000000000000 > > > > > > > > [ 155758.076924] DR3: 0000000000000000 DR6: 00000000fffe0ff= 0 DR7: 0000000000000400 > > > > > > > > [ 155758.091376] PKRU: 55555554 > > > > > > > > [ 155758.096957] Call Trace: > > > > > > > > [ 155758.102016] > > > > > > > > [ 155758.106502] ? __die+0x78/0xc0 > > > > > > > > [ 155758.112793] ? page_fault_oops+0x286/0x380 > > > > > > > > [ 155758.121175] ? exc_page_fault+0x5d/0x110 > > > > > > > > [ 155758.129209] ? asm_exc_page_fault+0x22/0x30 > > > > > > > > [ 155758.137763] ? mem_cgroup_get_nr_swap_pages+0x3d/0xb0 > > > > > > > > [ 155758.148060] workingset_test_recent+0xda/0x1b0 > > > > > > > > [ 155758.157133] workingset_refault+0xca/0x1e0 > > > > > > > > [ 155758.165508] filemap_add_folio+0x4d/0x70 > > > > > > > > [ 155758.173538] page_cache_ra_unbounded+0xed/0x190 > > > > > > > > [ 155758.182919] page_cache_sync_ra+0xd6/0x1e0 > > > > > > > > [ 155758.191738] filemap_read+0x68d/0xdf0 > > > > > > > > [ 155758.199495] ? mlx5e_napi_poll+0x123/0x940 > > > > > > > > [ 155758.207981] ? __napi_schedule+0x55/0x90 > > > > > > > > [ 155758.216095] __x64_sys_pread64+0x1d6/0x2c0 > > > > > > > > [ 155758.224601] do_syscall_64+0x3d/0x80 > > > > > > > > [ 155758.232058] entry_SYSCALL_64_after_hwframe+0x46/0xb0 > > > > > > > > [ 155758.242473] RIP: 0033:0x7f62c29153b5 > > > > > > > > [ 155758.249938] Code: e8 48 89 75 f0 89 7d f8 48 89 4d e0 = e8 b4 e6 f7 ff 41 89 c0 4c 8b 55 e0 48 8b 55 e8 48 8b 75 f0 8b 7d f8 b8 11 = 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 45 f8 e8 e7 e6 f7 f= f 48 8b > > > > > > > > [ 155758.288005] RSP: 002b:00007f6234c5ffd0 EFLAGS: 0000029= 3 ORIG_RAX: 0000000000000011 > > > > > > > > [ 155758.303474] RAX: ffffffffffffffda RBX: 00007f628c4e70c= 0 RCX: 00007f62c29153b5 > > > > > > > > [ 155758.318075] RDX: 000000000003c041 RSI: 00007f61d298600= 0 RDI: 0000000000000076 > > > > > > > > [ 155758.332678] RBP: 00007f6234c5fff0 R08: 000000000000000= 0 R09: 0000000064d5230c > > > > > > > > [ 155758.347452] R10: 000000000027d450 R11: 000000000000029= 3 R12: 000000000003c041 > > > > > > > > [ 155758.362044] R13: 00007f61d2986000 R14: 00007f629e11b06= 0 R15: 000000000027d450 > > > > > > > > [ 155758.376661] > > > > > > > > > > > > > > > > This patch fixes the issue by getting a local reference ins= ide > > > > > > > > unpack_shadow(). > > > > > > > > > > > > > > > > Fixes: f78dfc7b77d5 ("workingset: fix confusion around evic= tion vs refault container") > > > > > > > > > > > > > > Beyond mem_cgroup_get_nr_swap_pages(), we still use the evict= ion_memcg > > > > > > > without grabbing a ref to it first in workingset_test_recent(= ) (and in > > > > > > > workingset_refault() before that) as well as lru_gen_test_rec= ent(). > > > > > > > > > > > > > > Wouldn't the fix go back even further? or am I misinterpretin= g the problem? > > > > > > Hmm I don't see eviction_memcg being used outside of *_test_rec= ent > > > > > > (the rest just uses memcg =3D folio_memcg(folio), which if I'm = not mistaken is > > > > > > the memcg that is refaulting the folio into memory). > > > > > > > > > > > > Inside workingset_test_recent(), the only other place where evi= ction_memcg > > > > > > is used is for mem_cgroup_lruvec. This function call won't cras= h whether > > > > > > eviction_memcg is valid or not. > > > > > > > > > > If eviction_memcg is invalid because the memory was already freed= , we > > > > > are basically dereferencing garbage in mem_cgroup_lruvec() aren't= we? > > > > > > > > > > > The crash only happens during > > > > > > mem_cgroup_get_nr_swap_pages, which has an upward traversal fro= m > > > > > > eviction_memcg to root. > > > > > > > > > > > > Let me know if this does not make sense and/or is insufficient = to ensure > > > > > > safe upward traversal from eviction_memcg to root! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Nhat Pham > > > > > > > > Cc: stable@vger.kernel.org > > > > > > > > --- > > > > > > > > mm/workingset.c | 65 ++++++++++++++++++++++++++++++++-----= ------------ > > > > > > > > 1 file changed, 43 insertions(+), 22 deletions(-) > > > > > > > > > > > > > > > > diff --git a/mm/workingset.c b/mm/workingset.c > > > > > > > > index da58a26d0d4d..03cadad4e484 100644 > > > > > > > > --- a/mm/workingset.c > > > > > > > > +++ b/mm/workingset.c > > > > > > > > @@ -206,10 +206,19 @@ static void *pack_shadow(int memcgid,= pg_data_t *pgdat, unsigned long eviction, > > > > > > > > return xa_mk_value(eviction); > > > > > > > > } > > > > > > > > > > > > > > > > -static void unpack_shadow(void *shadow, int *memcgidp, pg_= data_t **pgdat, > > > > > > > > - unsigned long *evictionp, bool *w= orkingsetp) > > > > > > > > +/* > > > > > > > > + * Unpacks the stored fields of a shadow entry into the gi= ven pointers. > > > > > > > > + * > > > > > > > > + * The memcg pointer is only populated if the memcg record= ed in the shadow > > > > > > > > + * entry is valid. In this case, a reference to the memcg = will be acquired, > > > > > > > > + * and a corresponding mem_cgroup_put() will be needed whe= n we no longer > > > > > > > > + * need the memcg. > > > > > > > > + */ > > > > > > > > +static void unpack_shadow(void *shadow, struct mem_cgroup = **memcgp, > > > > > > > > + pg_data_t **pgdat, unsigned long *e= victionp, bool *workingsetp) > > > > > > > > { > > > > > > > > unsigned long entry =3D xa_to_value(shadow); > > > > > > > > + struct mem_cgroup *memcg; > > > > > > > > int memcgid, nid; > > > > > > > > bool workingset; > > > > > > > > > > > > > > > > @@ -220,7 +229,24 @@ static void unpack_shadow(void *shadow= , int *memcgidp, pg_data_t **pgdat, > > > > > > > > memcgid =3D entry & ((1UL << MEM_CGROUP_ID_SHIFT) -= 1); > > > > > > > > entry >>=3D MEM_CGROUP_ID_SHIFT; > > > > > > > > > > > > > > > > - *memcgidp =3D memcgid; > > > > > > > > + /* > > > > > > > > + * Look up the memcg associated with the stored ID.= It might > > > > > > > > + * have been deleted since the folio's eviction. > > > > > > > > + * > > > > > > > > + * Note that in rare events the ID could have been = recycled > > > > > > > > + * for a new cgroup that refaults a shared folio. T= his is > > > > > > > > + * impossible to tell from the available data. Howe= ver, this > > > > > > > > + * should be a rare and limited disturbance, and ac= tivations > > > > > > > > + * are always speculative anyway. Ultimately, it's = the aging > > > > > > > > + * algorithm's job to shake out the minimum access = frequency > > > > > > > > + * for the active cache. > > > > > > > > + */ > > > > > > > > + memcg =3D mem_cgroup_from_id(memcgid); > > > > > > > > + if (memcg && css_tryget(&memcg->css)) > > > > > > > > + *memcgp =3D memcg; > > > > > > > > + else > > > > > > > > + *memcgp =3D NULL; > > > > > > > > + > > > > > > > > *pgdat =3D NODE_DATA(nid); > > > > > > > > *evictionp =3D entry; > > > > > > > > *workingsetp =3D workingset; > > > > > > > > @@ -262,15 +288,16 @@ static void *lru_gen_eviction(struct = folio *folio) > > > > > > > > static bool lru_gen_test_recent(void *shadow, bool file, s= truct lruvec **lruvec, > > > > > > > > unsigned long *token, bool = *workingset) > > > > > > > > { > > > > > > > > - int memcg_id; > > > > > > > > unsigned long min_seq; > > > > > > > > struct mem_cgroup *memcg; > > > > > > > > struct pglist_data *pgdat; > > > > > > > > > > > > > > > > - unpack_shadow(shadow, &memcg_id, &pgdat, token, wor= kingset); > > > > > > > > + unpack_shadow(shadow, &memcg, &pgdat, token, workin= gset); > > > > > > > > + if (!mem_cgroup_disabled() && !memcg) > > > > > > > > + return false; > > > > > > > > > > > > > > +Yu Zhao > > > > > > > > > > > > > > There is a change of behavior here, right? > > > > > > > > > > > > > > The existing code will continue if !mem_cgroup_disabled() && = !memcg is > > > > > > > true, and mem_cgroup_lruvec() will return the lruvec of the r= oot > > > > > > > memcg. Now we are just returning false. > > > > > > > > > > > > > > Is this intentional? > > > > > > Oh right, there is. Should have cc-ed Yu Zhao as well, my bad. > > > > > > get_maintainers.pl isn't always sufficient I guess :) > > > > > > > > > > > > But yeah, this behavioral change is intentional. > > > > > > > > > > > > Correct me if I'm wrong of course, but it seems like MGLRU shou= ld > > > > > > follow the same pattern here. That is, once we return from unpa= ck_shadow, > > > > > > the possible scenarios are the same as prescribed in workingset= _test_recent: > > > > > > > > > > > > 1. If mem_cgroup is disabled, we can ignore this check. > > > > > > 2. If mem_cgroup is enabled, then the only reason why we get NU= LL > > > > > > memcg from unpack_shadow is if the eviction_memcg is no longer > > > > > > valid. We should not try to get its lruvec, or substitute it w= ith the > > > > > > root memcg, but return false right away (i.e not recent). > > > > > > > > > > > > > > > > > I will leave this for Yu :) > > > > > > > > Thanks, Yosry. > > > > > > > > Hi Nhat, it seems unnecessary to me to introduce a get/put into > > > > lru_gen_test_recent() because it doesn't suffer from the bug this > > > > patch tries to fix. In theory, the extra get/put can impact > > > > performance, though admittedly the impact is unlikely to be > > > > measurable. Regardless, the general practice is to fix the bug > > > > locally, i.e., when the mem_cgroup_get_nr_swap_pages() path is take= n, > > > > rather than change the unrelated path. Thank you. > > > > > > Hey guys, > > > > > > I had suggested to have it in unpack_shadow() to keep things simple, > > > and not further complicate the lifetime rules in this code. The > > > tryget() is against a per-cpu counter, so it's not expensive. > > > > > > The NULL deref is evidence that while *some* cgroup members are still > > > accessible once it's dead, not all of it is. There is no explicit > > > guarantee from the cgroup code that anything BUT the tryget() is stil= l > > > valid against group that is under rcu freeing. > > > > > > Since it isn't expensive, let's keep it simple and robust, and preven= t > > > future bugs of the same class, by always ensuring the cgroup is alive > > > before accessing random members. Especially in non-cgroup code. > > > > I looked at this again today with fresh eyes, and I want to go back to > > what I initially said. Isn't RCU protection in this case enough to > > keep the memcg "valid" (i.e accessible, not garbage)? The tryget is > > not a lot of complexity or performance tax, but I want to really > > understand what's happening here. > > > > Looking at the code again, this seems to be the sequence of events on > > the cgroup side: > > - css_put() puts the last reference invoking a call to css_release() > > - css_release() queues css_release_work_fn() > > - css_release() does some bookkeeping, makes some callbacks, and > > queues css_free_rwork_fn() to run *after* an RCU grace period. > > - css_free_rwork_fn() makes callbacks to free the memory, ultimately > > freeing the memcg. > > > > On the memcg idr side, the removal sequence of events seem to be: > > - mem_cgroup_id_put() will decrement the id ref and check if falls to 0 > > - If the id ref falls to 0, we call mem_cgroup_id_remove() *then* css_p= ut() > > > > On the workingset_refault() side, the sequence of events seems to be: > > - rcu_read_lock() > > - memcg =3D mem_cgroup_from_id() > > - ... // use memcg > > - rcu_read_unlock() > > > > So technically, after holding the rcu read lock, if we find the memcg > > in the idr, it must be valid, and it must not be freed until after the > > rcu read section is completed. It's not just the cgroup internal > > implementation, it's the contract between cgroup core and controllers > > such as memcg. > > > > The memory controller expects a sequence of callbacks during freeing: > > css_offline() -> css_released() -> css_free(). So memcg code is within > > its right to access any fields of struct mem_cgroup that are not freed > > by the css_offline() or css_released() until css_free() is called, > > right? > > > > > > > > Here is a guess / question, because I am not really familiar with > > memory barriers and such, but is it at all possible that the actual > > problem is reordering of instructions in mem_cgroup_id_put_many(), > > such that we actually execute css_put() *before* > > mem_cgroup_id_remove()? > > > > If this happens it seems possible for this to happen: > > > > cpu #1 cpu#2 > > css_put() > > /* css_free_rwork_fn is qu= eued */ > > rcu_read_lock() > > mem_cgroup_from_id() > > mem_cgroup_id_remove() > > /* access memcg */ > > > > If I understand correctly, if css_free_rwork_fn() is queued before the > > rcu_read_lock in workingset_refault() begins, then it can be executed > > during the rcu read section, and the memcg can be freed at any point > > from under us. Perhaps what we need is memory barriers to ensure > > correct ordering in mem_cgroup_id_put_many()? I am not sure if > > rcu_read_lock() implies a barrier on the other side. > > > > Sorry if this is all off, I am just trying to understand what's going o= n. > Ah that is wild. That does sound plausible. > In this case, maybe something like this? > > mem_cgroup_id_remove(memcg); > /* > * Preventing css_put from happening before id removal due to > * instruction reordering. > * This is redefining what smp_mb() is, probably unnecessary. > * This guarantees that if a non-null memcg is acquired from ID within > * an RCU read section, its css won't be freed for the > * duration of this section. > */ > smp_mb(); > /* Memcg ID pins CSS */ > css_put(&memcg->css); I am not the best person to answer this question, ideally someone with more understanding of memory barriers should chime in here to: - Confirm my theory is correct. - Confirm smp_mb() is the correct primitive to use. I am guessing smp_wmb() is enough here. - Confirm that we don't need an additional read barrier on the read side, ideally rcu_read_lock() is enough, but I am not sure.