From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 28D75C5320E for ; Sun, 25 Aug 2024 18:21:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 912D58D0027; Sun, 25 Aug 2024 14:21:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 89BAB8D0022; Sun, 25 Aug 2024 14:21:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 73D2B8D0027; Sun, 25 Aug 2024 14:21:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 4D2608D0022 for ; Sun, 25 Aug 2024 14:21:50 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id EA9A0A0867 for ; Sun, 25 Aug 2024 18:21:49 +0000 (UTC) X-FDA: 82491586338.18.FC517EC Received: from nyc.source.kernel.org (nyc.source.kernel.org [147.75.193.91]) by imf03.hostedemail.com (Postfix) with ESMTP id 0736A20015 for ; Sun, 25 Aug 2024 18:21:47 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=ZssenzNG; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf03.hostedemail.com: domain of chrisl@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724610013; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hn1/O7g3GRzX0alW50sJVfzveakhoGxqlhFyffI2+04=; b=6TiWOfPLmd6JDlLjYfUfFK2kuokvQCzLo/Lcl2j8sEO7UPllVSfTuxOgGCm1iI0b9cSgjU M8JXIGoCNs04u1yEo4wHB8WqASdkBbp1pFqmhD3M7WDPkN/odM3HYMtqpfiU+ow7Q45Jon S/IDpttgDHGL2gG6WbfjSmu2IawhcDk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724610013; a=rsa-sha256; cv=none; b=eCF4GnT6p+iTFPxlg7va5q6z4CPcHeTO/zeBZ9oG18YEneZF8KmJDzpKSw3g060uBXFM9A C8nIDRuhnqXDd3IhrL4IAf3BL7vDMBgXLTLItRGfiqSTNJ+aoqS4GXuJ6Zto8x0gim8+pX 9kwxCCctGBbNfvBTAdiS37yP7LW0yg8= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=ZssenzNG; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf03.hostedemail.com: domain of chrisl@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=chrisl@kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by nyc.source.kernel.org (Postfix) with ESMTP id 633EFA41C8C for ; Thu, 22 Aug 2024 20:16:17 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id A1601C4AF0C for ; Thu, 22 Aug 2024 20:16:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1724357783; bh=tk0K02bLzaw+nmzpDxhNt+aoHh6I1fzKq3BgCJhAkEg=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=ZssenzNGF0T4JgWJwbP1ykPMVXxQpHs1yrn4UF2fzOY0GZRmdHND7tma0wyV/bzrD E1pcZKHUNeh5GGM+eVXmgrJ/+9DLzDa5yfIR6EeukTp+I5O2kR6pj0SE744bUbpQP+ feb0E3xVXga2r5yr1va2M5OqpCPBLjhFZXBcxVNszZ2jjVWfgEzc3gczFMW85QgZPV 7lmmfu0yw5qX1ciaTUZc83r0V8MhmQz8pTEv3LLDxSg0VDLmQwmrqiwsImVUvoccdX EGlAP/H0RIDeQJrcT+sCVbdZimqtDGjCGxIu3ueNN5mq1YGltiBpxMGwYy8uY8/zYX hPlKUMNDEn/sQ== Received: by mail-yb1-f182.google.com with SMTP id 3f1490d57ef6-dff1ccdc17bso1224800276.0 for ; Thu, 22 Aug 2024 13:16:23 -0700 (PDT) X-Forwarded-Encrypted: i=1; AJvYcCXmmmi1tf2WftOAUEJf9fmTPyXNXLvl7yVKXRakkOveXDzFBsE3f1Wg4h2J7P/YJ/qMMf2wnCoC1A==@kvack.org X-Gm-Message-State: AOJu0YzvMDMohPrfp24c2rHkBzcJTnMFiDfDSkPNDHOjmIwGgloSDA0W czngi9kTVVCnaFiAmRSevapSc/hPtYIuOs1NVJf0d6zLvCgyqun9cM131ZW2EJLibEhhuWpP2nF TnkZhOlIdaxtB9mafdvXNQs1dsZOLdpbYUiKGDw== X-Google-Smtp-Source: AGHT+IHYk0T8u8pG+UNDogRAP2V5686B5zcR4sagkcdeFglT3slBjUxvhKDW2NhUVFZ86ml0TFBwmdvKfolemOXv/tU= X-Received: by 2002:a05:690c:3506:b0:650:672e:9c64 with SMTP id 00721157ae682-6c09fa8adb6mr91669517b3.33.1724357782829; Thu, 22 Aug 2024 13:16:22 -0700 (PDT) MIME-Version: 1.0 References: <20240821054921.43468-1-21cnbao@gmail.com> In-Reply-To: From: Chris Li Date: Thu, 22 Aug 2024 13:16:11 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [syzbot] [mm?] WARNING in zswap_swapoff To: Yosry Ahmed Cc: Kairui Song , Barry Song <21cnbao@gmail.com>, akpm@linux-foundation.org, chengming.zhou@linux.dev, hannes@cmpxchg.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, nphamcs@gmail.com, ryan.roberts@arm.com, syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com, syzkaller-bugs@googlegroups.com, ying.huang@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 0736A20015 X-Stat-Signature: 1es466j34uhy3duu44prme8ynn7d3pmw X-Rspam-User: X-HE-Tag: 1724610107-930906 X-HE-Meta: U2FsdGVkX18pTzJJPudiNKhHo3kxMLXBDBw9MYp99hYgtqv1yTqPoDiVyHETc8e4DmYpb9JdAfoyhoerGUE04UH2WB3bJx/fKpj0F28zzqxv8mkmRQSyyO/Vo6Yj6mouJ3ZdYl4AElXWUpQHkn4FlVS7lGxm0Eitfu8CiXNIkMzbwFPelbDs0Vp3mpzek1upU07NDkQqk5eTcpmsq2vyHNVTpDFZAKgDK+UNgjL/gRXuYClw0ns+QELe0xhXwLE9eJ6TkzPF82WRbNbOjKTj3tCUgAn/SSG14NgMtYHBHn6wvOkAZ2zsGDHTw3wdE00L0edRV13rq2oyYymfe5Lg/8LjaNsOTpJN9gZXSFiFmWD9sxr2IwpVNsGmhlhex5LDZq2fNBkuV2LzLzmnQPYcINVwwRL8lZ6KjKyjegT0ls3RnfpXG3YGECFTJqn1oyTxDzfBdApVyLLAGf6uGZywvhY1GbTs/aVNEAUzvF5qfHwplbjgCPa21aiBpXopgD8sNVWId0JVBvUsc59ayuV/xAW5LwB1yRGAO5UsgcdA4y1m12XMm/yZY5Nr3xEwhlshJWINPKatqE6r0+fZcnps9Jz7LC+5zow1WWO3JKGsOpzyhiWgqYtfaHXbUAgoDxuKEcMHFwmw0VJX+tLSPKAioxxy32DrehRJc4nqUmIGnErhPScGo8rk512V7vxQ2KCVUgH7u2NXur1OJDEct/FAAN+GCLIzA6LLf5tX7FPALiGkMDzMlqnSdQFC204GKS/76A/qUgYU+8zp8/ikhttJtDV57vWHHDeEvF7zCo0RB6NBQtv0XJJ3biXsizsHYjg7KF1pDS8pra3kRQQyT9Eb12qFtt4ty/Tn9d2kcepPwwnT8xgwSQ0ij+ZQ1piOcwkn37mQpoMxRf+iENEWiCwbSd+lFZJx15q2ADDkAZM/106GCT6mvhwLdug+fIy0KxyuH606A+0h/wts73jMRmD zexH1gD2 P0HELgO+RX9HAuLdgtb0F6y6vkQ7rkOeRdhfWbiIXmF7bij0SaaShBp0hbqgDCsmfUlCiZO1CWOG4sDAL3wrUEiESbER+Q9mKovOslJ7QiUuDzHo2HFUkdPaf5EpcOMQtTTSxLlbhdtAQwT6upvYZo6Qv6MMVdZBhsoDhOuhfVfeg8LiO+bbPE6cr1OvSWRzhazhBvcyC+JeXyj4ZiKlcG735wd+p8lzz1Q8l8uDtpTkKacBl2UfmXCS8xjNcC4FZJAF2LgUA+VMui+FLb3w/+C172keYSXyrS5K1ugEyKqqhA7Rg78hM9kL9Y34oJIYZs0cqoLjrwPi3yvceIEVlnhQj3rQhxoDTfbAujcvJKcIq9suRoNTULPSBnLQPPT5BueD/+d33ZycT7YQPhjE7MVvlZIr1F37xuNvOtQGj1crlr2MTZ33cNihGxlDkGF3TKtIhk4iWdtSD4dnRVurzMobqDvM9QyDV8MOKyMN+QKOSdF03qEaF8eUGtjtzNbhVesnoi8YHZhpF3Von6tDeO+HZvWF76TcS2aVgNQiX5JC7qmp7oqV13dxJYuRA0jn9NznivAN0EO+SFNvi3WEqawaJRZYbtLhm1LKtr2ktbS4vys5avYTAz5fx3YhKFwkyYh32jpEzBsbHQ2jXLUo6TEJrpfBSgMyRkR3V1YV0KUq9MpkwrkSCxPEHSAuLzn9Eo22GJYvDKIeT/N+p2zlBdbjUJiJUQPWra0fhhQcgKK3Jf2CHS0ZyXYxD2K8vcWz5o2GxSH/JNdNdUq7KbjnowI3Sia0SrzUiLxd8uHBhDXGNNqe1CWxcFau2jQz2jGBA5mQylBVmfreOh/l3HFVawohDER+AiOkUlgkLyE9BJIQnZl/JcX/bg1R+JYP65PC0tWX4IZookgnglt1zS30e4KNKLK02xKVb0G8j9Q2Z8bOX56/ajvJqgUsQMOsZG/LdiCzR36nvIrLMkOli2lCDWJ516sui j6Urvzgu hWTm1WiluIzVlmDAkjkcd+LZtxcJxfh6AABlAJhGWHw= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Aug 22, 2024 at 11:13=E2=80=AFAM Yosry Ahmed wrote: > > On Tue, Aug 20, 2024 at 11:42=E2=80=AFPM Kairui Song w= rote: > > > > On Wed, Aug 21, 2024 at 1:49=E2=80=AFPM Barry Song <21cnbao@gmail.com> = wrote: > > > > > > On Tue, Aug 20, 2024 at 9:02=E2=80=AFPM Kairui Song wrote: > > > > > > > > On Tue, Aug 20, 2024 at 4:47=E2=80=AFPM Kairui Song wrote: > > > > > > > > > > On Tue, Aug 20, 2024 at 4:13=E2=80=AFAM Yosry Ahmed wrote: > > > > > > On Fri, Aug 16, 2024 at 12:52=E2=80=AFPM syzbot > > > > > > wrote: > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > syzbot found the following issue on: > > > > > > > > > > > > > > HEAD commit: 367b5c3d53e5 Add linux-next specific files fo= r 20240816 > > > > > > > > > > I can't find this commit, seems this commit is not in linux-next = any more? > > > > > > > > > > > > git tree: linux-next > > > > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=3D1= 2489105980000 > > > > > > > kernel config: https://syzkaller.appspot.com/x/.config?x=3D6= 1ba6f3b22ee5467 > > > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=3Dce6= 029250d7fd4d0476d > > > > > > > compiler: Debian clang version 15.0.6, GNU ld (GNU Binu= tils for Debian) 2.40 > > > > > > > > > > > > > > Unfortunately, I don't have any reproducer for this issue yet= . > > > > > > > > > > > > > > Downloadable assets: > > > > > > > disk image: https://storage.googleapis.com/syzbot-assets/0b1b= 4e3cad3c/disk-367b5c3d.raw.xz > > > > > > > vmlinux: https://storage.googleapis.com/syzbot-assets/5bb090f= 7813c/vmlinux-367b5c3d.xz > > > > > > > kernel image: https://storage.googleapis.com/syzbot-assets/66= 74cb0709b1/bzImage-367b5c3d.xz > > > > > > > > > > > > > > IMPORTANT: if you fix the issue, please add the following tag= to the commit: > > > > > > > Reported-by: syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmai= l.com > > > > > > > > > > > > > > ------------[ cut here ]------------ > > > > > > > WARNING: CPU: 0 PID: 11298 at mm/zswap.c:1700 zswap_swapoff+0= x11b/0x2b0 mm/zswap.c:1700 > > > > > > > Modules linked in: > > > > > > > CPU: 0 UID: 0 PID: 11298 Comm: swapoff Not tainted 6.11.0-rc3= -next-20240816-syzkaller #0 > > > > > > > Hardware name: Google Google Compute Engine/Google Compute En= gine, BIOS Google 06/27/2024 > > > > > > > RIP: 0010:zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700 > > > > > > > Code: 74 05 e8 78 73 07 00 4b 83 7c 35 00 00 75 15 e8 1b bd 9= e ff 48 ff c5 49 83 c6 50 83 7c 24 0c 17 76 9b eb 24 e8 06 bd 9e ff 90 <0f>= 0b 90 eb e5 48 8b 0c 24 80 e1 07 80 c1 03 38 c1 7c 90 48 8b 3c > > > > > > > RSP: 0018:ffffc9000302fa38 EFLAGS: 00010293 > > > > > > > RAX: ffffffff81f4d66a RBX: dffffc0000000000 RCX: ffff88802c19= bc00 > > > > > > > RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff88801598= 6248 > > > > > > > RBP: 0000000000000000 R08: ffffffff81f4d620 R09: 1ffffffff1d4= 76ac > > > > > > > R10: dffffc0000000000 R11: fffffbfff1d476ad R12: dffffc000000= 0000 > > > > > > > R13: ffff888015986200 R14: 0000000000000048 R15: 000000000000= 0002 > > > > > > > FS: 00007f9e628a5380(0000) GS:ffff8880b9000000(0000) knlGS:0= 000000000000000 > > > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > > > > CR2: 0000001b30f15ff8 CR3: 000000006c5f0000 CR4: 000000000035= 06f0 > > > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 000000000000= 0000 > > > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 000000000000= 0400 > > > > > > > Call Trace: > > > > > > > > > > > > > > __do_sys_swapoff mm/swapfile.c:2837 [inline] > > > > > > > __se_sys_swapoff+0x4653/0x4cf0 mm/swapfile.c:2706 > > > > > > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > > > > > > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 > > > > > > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > > > > > > RIP: 0033:0x7f9e629feb37 > > > > > > > Code: 73 01 c3 48 8b 0d f1 52 0d 00 f7 d8 64 89 01 48 83 c8 f= f c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48>= 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 52 0d 00 f7 d8 64 89 01 48 > > > > > > > RSP: 002b:00007fff17734f68 EFLAGS: 00000246 ORIG_RAX: 0000000= 0000000a8 > > > > > > > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9e629f= eb37 > > > > > > > RDX: 00007f9e62a9e7e8 RSI: 00007f9e62b9beed RDI: 000056309094= 2a20 > > > > > > > RBP: 0000563090942a20 R08: 0000000000000000 R09: 77872e07ed16= 4f94 > > > > > > > R10: 000000000000001f R11: 0000000000000246 R12: 00007fff1773= 5188 > > > > > > > R13: 00005630909422a0 R14: 0000563073724169 R15: 00007f9e62bd= da80 > > > > > > > > > > > > > > > > > > > I am hoping syzbot would find a reproducer and bisect this for = us. > > > > > > Meanwhile, from a high-level it looks to me like we are missing= a > > > > > > zswap_invalidate() call in some paths. > > > > > > > > > > > > If I have to guess, I would say it's related to the latest mTHP= swap > > > > > > changes, but I am not following closely. Perhaps one of the fol= lowing > > > > > > things happened: > > > > > > > > > > > > (1) We are not calling zswap_invalidate() in some invalidation = paths. > > > > > > It used to not be called for the cluster freeing path, so maybe= we end > > > > > > up with some order-0 swap entries in a cluster? or maybe there = is an > > > > > > entirely new invalidation path that does not go through > > > > > > free_swap_slot() for order-0 entries? > > > > > > > > > > > > (2) Some higher order swap entries (i.e. a cluster) end up in z= swap > > > > > > somehow. zswap_store() has a warning to cover that though. Mayb= e > > > > > > somehow some swap entries are allocated as a cluster, but then = pages > > > > > > are swapped out one-by-one as order-0 (which can go to zswap), = but > > > > > > then we still free the swap entries as a cluster? > > > > > > > > > > Hi Yosry, thanks for the report. > > > > > > > > > > There are many mTHP related optimizations recently, for this prob= lem I > > > > > can reproduce this locally. Can confirm the problem is gone for m= e > > > > > after reverting: > > > > > > > > > > "mm: attempt to batch free swap entries for zap_pte_range()" > > > > > > > > > > Hi Barry, > > > > > > > > > > If a set of continuous slots are having the same value, they are > > > > > considered a mTHP and freed, bypassing the slot cache, and causin= g > > > > > zswap leak. > > > > > This didn't happen in put_swap_folio because that function is > > > > > expecting an actual mTHP folio behind the slots but > > > > > free_swap_and_cache_nr is simply walking the slots. > > > > > > > > > > For the testing, I actually have to disable mTHP, because linux-n= ext > > > > > will panic with mTHP due to lack of following fixes: > > > > > https://lore.kernel.org/linux-mm/a4b1b34f-0d8c-490d-ab00-eaedbf3f= e780@gmail.com/ > > > > > https://lore.kernel.org/linux-mm/403b7f3c-6e5b-4030-ab1c-3198f36e= 3f73@gmail.com/ > > > > > > > > > > > > > > > > > I am not closely following the latest changes so I am not sure.= CCing > > > > > > folks who have done work in that area recently. > > > > > > > > > > > > I am starting to think maybe it would be more reliable to just = call > > > > > > zswap_invalidate() for all freed swap entries anyway. Would tha= t be > > > > > > too expensive? We used to do that before the zswap_invalidate()= call > > > > > > was moved by commit 0827a1fb143f ("mm/zswap: invalidate zswap e= ntry > > > > > > when swap entry free"), and that was before we started using th= e > > > > > > xarray (so it was arguably worse than it would be now). > > > > > > > > > > > > > > > > That might be a good idea, I suggest moving zswap_invalidate to > > > > > swap_range_free and call it for every freed slot. > > > > > > > > > > Below patch can be squash into or put before "mm: attempt to batc= h > > > > > free swap entries for zap_pte_range()". > > > > > > > > Hmm, on second thought, the commit message in the attachment commit > > > > might be not suitable, current zswap_invalidate is also designed to > > > > only work for order 0 ZSWAP, so things are not clean even after thi= s. > > > > > > Kairui, what about the below? we don't touch the path of __try_to_rec= laim_swap() where > > > you have one folio backed? > > > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > > index c1638a009113..8ff58be40544 100644 > > > --- a/mm/swapfile.c > > > +++ b/mm/swapfile.c > > > @@ -1514,6 +1514,8 @@ static bool __swap_entries_free(struct swap_inf= o_struct *si, > > > unlock_cluster_or_swap_info(si, ci); > > > > > > if (!has_cache) { > > > + for (i =3D 0; i < nr; i++) > > > + zswap_invalidate(swp_entry(si->type, offset += i)); > > > spin_lock(&si->lock); > > > swap_entry_range_free(si, entry, nr); > > > spin_unlock(&si->lock); > > > > > > > Hi Barry, > > > > Thanks for updating this thread, I'm thinking maybe something will > > better be done at the zswap side? > > > > The concern of using zswap_invalidate is that it calls xa_erase which > > requires the xa spin lock. But if we are calling zswap_invalidate in > > swap_entry_range_free, and ensure the slot is HAS_CACHE pinned, doing > > a lockless read first with xa_load should be OK for checking if the > > slot needs a ZSWAP invalidation. The performance cost will be minimal > > and we only need to call zswap_invalidate in one place, something like > > this (haven't tested, comments are welcome). Also ZSWAP mthp will > > still store entried in order 0 so this should be OK for future. > > > While I do agree with this change on a high level, it's essentially > reverting commit 0827a1fb143f ("mm/zswap: invalidate zswap entry when > swap entry free") which fixed a small problem with zswap writeback. > I'd prefer that we don't if possible. > > One thing that I always wanted to do is to pull some of the work done > in swap_entry_range_free() and swap_range_free() before the slots > caching layer. The memcg uncharging, clearing shadow entries from the > swap cache, arch invalidation, zswap invalidation, etc. If we can have > a hook for these pre-free callbacks we can call it for single entries > before we add them to the slots cache, and call them for the clusters > as we do today. This should also reduce the amount of work done under > the lock, and move more work to where the freeing is actually > happening vs. the cache draining. > > I remember discussing this briefly with Ying before. Anyone have any thou= ghts? Hi Yosry, If I understand correctly, the lock you are talking about is the si->lock, right? Kairui has some WIP patches removing the swap slot cache in the swap entry freeing path. Basically the si->lock is only used to protect the cluster list. Most of the time freeing swap entry will only take the ci->lock. No need to take the si->lock to change the cluster lists. Only when the cluster moves to another list will it require the si->lock e.g. the cluster moves to the free list when all 512 entries are freed. Because each cluster has 512 entries. The need to take si->lock is dramatically reduced. That patch is based on the new cluster swap allocator series. Kairui can share more details. I don't think ci->lock has heavy contentions. Chris > > > > > diff --git a/mm/swap_slots.c b/mm/swap_slots.c > > index 13ab3b771409..d7bb3caa9d4e 100644 > > --- a/mm/swap_slots.c > > +++ b/mm/swap_slots.c > > @@ -273,9 +273,6 @@ void free_swap_slot(swp_entry_t entry) > > { > > struct swap_slots_cache *cache; > > > > - /* Large folio swap slot is not covered. */ > > - zswap_invalidate(entry); > > - > > cache =3D raw_cpu_ptr(&swp_slots); > > if (likely(use_swap_slot_cache && cache->slots_ret)) { > > spin_lock_irq(&cache->free_lock); > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > index f947f4dd31a9..fbc25d38a27e 100644 > > --- a/mm/swapfile.c > > +++ b/mm/swapfile.c > > @@ -242,9 +242,6 @@ static int __try_to_reclaim_swap(struct > > swap_info_struct *si, > > folio_set_dirty(folio); > > > > spin_lock(&si->lock); > > - /* Only sinple page folio can be backed by zswap */ > > - if (nr_pages =3D=3D 1) > > - zswap_invalidate(entry); > > swap_entry_range_free(si, entry, nr_pages); > > spin_unlock(&si->lock); > > ret =3D nr_pages; > > @@ -1545,6 +1542,10 @@ static void swap_entry_range_free(struct > > swap_info_struct *si, swp_entry_t entry > > unsigned char *map_end =3D map + nr_pages; > > struct swap_cluster_info *ci; > > > > + /* Slots are pinned with SWAP_HAS_CACHE, safe to invalidate */ > > + for (int i =3D 0; i < nr_pages; ++i) > > + zswap_invalidate(swp_entry(si->type, offset + i)); > > + > > ci =3D lock_cluster(si, offset); > > do { > > VM_BUG_ON(*map !=3D SWAP_HAS_CACHE); > > diff --git a/mm/zswap.c b/mm/zswap.c > > index df66ab102d27..100ad04397fe 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -1656,15 +1656,18 @@ bool zswap_load(struct folio *folio) > > return true; > > } > > > > +/* Caller need to pin the slot to prevent parallel store */ > > void zswap_invalidate(swp_entry_t swp) > > { > > pgoff_t offset =3D swp_offset(swp); > > struct xarray *tree =3D swap_zswap_tree(swp); > > struct zswap_entry *entry; > > > > - entry =3D xa_erase(tree, offset); > > - if (entry) > > - zswap_entry_free(entry); > > + if (xa_load(tree, offset)) { > > + entry =3D xa_erase(tree, offset); > > + if (entry) > > + zswap_entry_free(entry); > > + } > > } > > > > int zswap_swapon(int type, unsigned long nr_pages) > > -- > > 2.45.2