From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 09DA3C47DB3 for ; Tue, 30 Jan 2024 00:22:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 79A696B0071; Mon, 29 Jan 2024 19:22:19 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 74ACF6B0080; Mon, 29 Jan 2024 19:22:19 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 611D66B0082; Mon, 29 Jan 2024 19:22:19 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 5281C6B0071 for ; Mon, 29 Jan 2024 19:22:19 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 148F6806C5 for ; Tue, 30 Jan 2024 00:22:19 +0000 (UTC) X-FDA: 81734075598.25.82B3B1F Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) by imf21.hostedemail.com (Postfix) with ESMTP id 568EA1C000D for ; Tue, 30 Jan 2024 00:22:17 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=eRUOaCDo; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf21.hostedemail.com: domain of 3OEG4ZQoKCM8J9DCJv27zy19916z.x97638FI-775Gvx5.9C1@flex--yosryahmed.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3OEG4ZQoKCM8J9DCJv27zy19916z.x97638FI-775Gvx5.9C1@flex--yosryahmed.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706574137; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=F78ItJHSnk6v7TJwnpc75w6j6ObmvfAFLcbwc8cbrb4=; b=GugLIMI9hDALq+HY2i90rCMSXuqW8SXTBjqpniloXuK38+RzUs5iNhsfByiR5KThExgIzB m06U8fWzVRA6DJkco0s6KwYrMbSFKK6hDNwqaS2iMqraFtUajZ1buSN3XtFSH5hAeVpY9Z kyqMJg7cLTtRZopb4HeN3HQX8rxp3zI= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=eRUOaCDo; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf21.hostedemail.com: domain of 3OEG4ZQoKCM8J9DCJv27zy19916z.x97638FI-775Gvx5.9C1@flex--yosryahmed.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3OEG4ZQoKCM8J9DCJv27zy19916z.x97638FI-775Gvx5.9C1@flex--yosryahmed.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706574137; a=rsa-sha256; cv=none; b=WyqTivVraztm+lrfrWXQsahQLS+0Og7nlKihPWS1NIFtHCuVAyZUee//+j36a+JeGjanVW Li2dxuXTClxSZtl/YzsHOQpOCtCDWrBE5nm3opezPH0QOTHXpeyJYlNvsskrpFOaF/hTjj Rs4/KvFTsAnOcjrVWD3pLAOmBgCGdV0= Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-dc24eb17be6so6964402276.1 for ; Mon, 29 Jan 2024 16:22:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1706574136; x=1707178936; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=F78ItJHSnk6v7TJwnpc75w6j6ObmvfAFLcbwc8cbrb4=; b=eRUOaCDo3cH1S+tFho/5+w15hOctVRhdziL14En/lbUcPF2QCE+S2aGH+LrCUIHMHm O35SflBbYSQIqB1pIuc1n6bz4Mx4oI4998BxQ/YYxjCb1vEz8hPRHN0KEzn9F1b4Xau4 7RSfHq58FjdkYFgy9p3P/RnEIMc7fft/sKzp+sPVz0usPK1SBxQOojHdalPhQQ3JesoZ CI0ftLB2ja4bJ/RUnWpfvzzfL0475XhFoBx+Jd7FzpRz7En7X9phWcSQKgHXIRKFIeYk prcXrZWOvWC+z8BTpumuLajgkhwgi66bnvj92quAigd4Fho+/qGu17gpN6twbyfw8BwO qzDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706574136; x=1707178936; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=F78ItJHSnk6v7TJwnpc75w6j6ObmvfAFLcbwc8cbrb4=; b=P8sxZHplDTFy+KfkNG4bLAkYjfPa0HR7Io7OahG7aiFB90VPBgDUxT4n03KjiE6yhT YfueV7wJlsIWLD1FQJprR1IioPGGqrZndwjUyQZNat5sdwNVsgY+UTKx78t4fFcn2cFz UChVdwVdJZTdFrIAQGDvqAJQ8oNs9tcMSuKf6w1JvNAxZDA497FYYFjvxOV3iAPvoAbi wYwmurLMi/JanUtvpZSJFnChDtwguhhTl97Z8NJ6FCwqNJHtVohWJIQPDGaGHrXn8YoB 7FkaLBz4poqYIQu8UsdKw4RbuIGHIDV+NM5BJGOn1cRp72ah1aW3erR8Qx6zyO3WDZdV GXvA== X-Gm-Message-State: AOJu0YwyqZjRsB16KNHAMhJlp3N6MBRqD8K2AhLEnLanTiB+Yjk0Pu9u cYWpRX1NRACZwTVInpRix5IM7PVdD18JrZedB7jNpzW3FZEwslSY1+ILy99aLQ1taOwFCTey/FT 7PIlXvkrKCa779VX8xg== X-Google-Smtp-Source: AGHT+IEUs9nJVgft0nruXb0OzaYuuf2VjEd1+twkYVsClIVgQe5soSOs21qguuxf8WouQ1emBpkUcVpbLRWvVfNh X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:29b4]) (user=yosryahmed job=sendgmr) by 2002:a05:6902:2205:b0:dc2:23f5:1791 with SMTP id dm5-20020a056902220500b00dc223f51791mr2687960ybb.6.1706574136468; Mon, 29 Jan 2024 16:22:16 -0800 (PST) Date: Tue, 30 Jan 2024 00:22:14 +0000 In-Reply-To: <20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com> Mime-Version: 1.0 References: <20240126-zswap-writeback-race-v2-0-b10479847099@bytedance.com> <20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com> Message-ID: Subject: Re: [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff From: Yosry Ahmed To: Chengming Zhou Cc: Johannes Weiner , Nhat Pham , Andrew Morton , Chris Li , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="us-ascii" X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 568EA1C000D X-Stat-Signature: 8guhmskrgz67mpdjo4513ksiqy3h3rwp X-HE-Tag: 1706574137-325511 X-HE-Meta: U2FsdGVkX1+Os4b3Ohb30SSKiHPZFv0uPuWRQH065pLXaG2CaTAKT0DAWg0eJqrbcddd5J7W57dhxbf4BjQqJa9MW6wn8rYKm3At6HwVFlH8ukItzT4Kdb6JuV4RqP2vwWPo4nzZCZSStCzA1v/LzaBZ67JqF585yFgoYPx9oDUnBiGiJWoPtoYQ2qQ6VYqkPFfMwBKE/yzniwxMGnYF0z1NDJ/y9I8ad916q2lYGnKwiFBmvJKWoxjTGE3xIyA8wqq9CfNzdZwvBMNkjjFNzKTlPkhbYqTOe36aAAga0OUkCiZ/YPoP7IBX7wB+CfPVvUJG5nTD7dbtYaftSBa/1iyON4htfG52c32GNFv+kHMuwN88U0qai+4jbWvKMBAOZKp0HkoNb0ROpT/iOIdmHIHojj73v8Rqfo4WMP6otN8Fofn8wSiP7jSs9RHiLmKB+fDQEmPC7UXp30iRvvmN3yo81adlhA3ycBrqY9krV39hvljZVJxRTKkf7cygs4flJYSLIfL561irtNmapWvZZsNWepsGHyw8MgUyLwaXL0LLvODsY33q++QOaTgErDu2isCrBpKwMzAUCefw77aExII1Ff17/lHAYT8j4qjusJMEJCPYsS5K4SaT82nKVrJoNp3hLTL73B1RmYyCBSQgHuPi6Iw5LyFbN5zctnkDEAa2T7RPpPb316/InMJH1HFoPCmHvRKrZAIteKKTBRHdTmWpNatWfm/iSLK20ZS1SBcZ0eOnYCPKZ/JT8uPVo2DXbD9XoFpTMDRIgt75Elj6sDIBaEhBPtwA/k3SV/1Ycioj8bc+H3bfA2Q76tdB1cLPjxtjc0uHhJqhOXSg3BoDzvRZ7qwGgq1Hqf+YTnrQZn4AyPUoTTjxx72GAtXmVSVqtD0qsBohGKB/QmWO95k15n/zwtsK1Abte5HYwqMGkf5mHK2/GakistxQXghhjjJ1DcxWr2/jv8dz5aAivx4 jPwU4Gzo PGagPqxiWUH/gZWYFBP4C39+Tqts7n3+urPq5Lx9/UcaAzNsu3TCgsE5iUD3NQWeX6LKeuI0G+r1hE4LCYpRC3PVpmqn8Jg84PoK6SD+6XKNixS8oAV0ioFqhCri6GOIyROA0Uq6z6TvKvQpMnOHUnGiy55+Wa2PMxBjBuy/yAtICqIGJWpLaLW6xTKKQvEDEgDPhtAtAb0Tx03AWIpCxWM72vM9aRW5PW6sRna2QGmi8MCaby0/9LcODgmPV8OtSQuGxvbC6FNqfXH8xHwGtq6j88QXb5A8L4SSKssszrzPObxz/NMymvE+4BlNQ8I+71NBWZ8IsVr9KOKCO2XXXNio6wgW0dk6RMKagEToftcZB+LpgIJFj5C/s9qj/G2dflqtFniOMGO6jhC47rcxpb7A637lnxhXafVwQgrCkl4Y3oMgK5OafAOs2yHAsN4j4JD6410VPEXH7IsumjA+5ZVgAAiqAxd8yX+br6BkDWbAtbtrwoAqt+44qFpwO541kq3oz5AQjqaKAhKkNkE9bfqhlPKIEhhAJEsFQZNY53faPbK2qpeGSMSEcEqdcvCSWpeazvB+ClkGqp7w8YEfLgc+mCE4yVINZpHGeBG97StiBKOCeumLKXNtmFhVXszg7wdJSNT17FP+EcrOnZ2AIPeIfUjAArNWjV7ZR+UrrYYTL3Ia3Co0XhwMv9pJNlbQR8kIhVu4OczCVV8PuexNRHy4HV/MjV8fp57EuIZYtK2hj8xMjasQC1Rfd049XfJH8/X7zYKiRsS4D/BwBSQ4rzjJHcyj/+1Rm38voXdseG+0XX+IICL/t0qAk29mXJ7VXoN3H X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote: > LRU writeback has race problem with swapoff, as spotted by Yosry [1]: > > CPU1 CPU2 > shrink_memcg_cb swap_off > list_lru_isolate zswap_invalidate > zswap_swapoff > kfree(tree) > // UAF > spin_lock(&tree->lock) > > The problem is that the entry in lru list can't protect the tree from > being swapoff and freed, and the entry also can be invalidated and freed > concurrently after we unlock the lru lock. > > We can fix it by moving the swap cache allocation ahead before > referencing the tree, then check invalidate race with tree lock, > only after that we can safely deref the entry. Note we couldn't > deref entry or tree anymore after we unlock the folio, since we > depend on this to hold on swapoff. > > So this patch moves all tree and entry usage to zswap_writeback_entry(), > we only use the copied swpentry on the stack to allocate swap cache > and if returned with folio locked we can reference the tree safely. > Then we can check invalidate race with tree lock, the following things > is much the same like zswap_load(). > > Since we can't deref the entry after zswap_writeback_entry(), we > can't use zswap_lru_putback() anymore, instead we rotate the entry > in the beginning. And it will be unlinked and freed when invalidated > if writeback success. You are also removing one redundant lookup from the zswap writeback path to check for the invalidation race, and simplifying the code. Give yourself full credit :) > > Another change is we don't update the memcg nr_zswap_protected in > the -ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced > with swapin or concurrent shrinker action, since swapin already > have memcg nr_zswap_protected updated, don't need double counts here. > For concurrent shrinker, the folio will be writeback and freed anyway. > -ENOMEM case is extremely rare and doesn't happen spuriously either, > so don't bother distinguishing this case. > > [1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/ > > Acked-by: Johannes Weiner > Acked-by: Nhat Pham > Signed-off-by: Chengming Zhou > --- > mm/zswap.c | 114 ++++++++++++++++++++++++++----------------------------------- > 1 file changed, 49 insertions(+), 65 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 81cb3790e0dd..f5cb5a46e4d7 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -277,7 +277,7 @@ static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp) > zpool_get_type((p)->zpools[0])) > > static int zswap_writeback_entry(struct zswap_entry *entry, > - struct zswap_tree *tree); > + swp_entry_t swpentry); > static int zswap_pool_get(struct zswap_pool *pool); > static void zswap_pool_put(struct zswap_pool *pool); > > @@ -445,27 +445,6 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry) > rcu_read_unlock(); > } > > -static void zswap_lru_putback(struct list_lru *list_lru, > - struct zswap_entry *entry) > -{ > - int nid = entry_to_nid(entry); > - spinlock_t *lock = &list_lru->node[nid].lock; > - struct mem_cgroup *memcg; > - struct lruvec *lruvec; > - > - rcu_read_lock(); > - memcg = mem_cgroup_from_entry(entry); > - spin_lock(lock); > - /* we cannot use list_lru_add here, because it increments node's lru count */ > - list_lru_putback(list_lru, &entry->lru, nid, memcg); > - spin_unlock(lock); > - > - lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry))); > - /* increment the protection area to account for the LRU rotation. */ > - atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected); > - rcu_read_unlock(); > -} > - > /********************************* > * rbtree functions > **********************************/ > @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o > { > struct zswap_entry *entry = container_of(item, struct zswap_entry, lru); > bool *encountered_page_in_swapcache = (bool *)arg; > - struct zswap_tree *tree; > - pgoff_t swpoffset; > + swp_entry_t swpentry; > enum lru_status ret = LRU_REMOVED_RETRY; > int writeback_result; > > + /* > + * Rotate the entry to the tail before unlocking the LRU, > + * so that in case of an invalidation race concurrent > + * reclaimers don't waste their time on it. > + * > + * If writeback succeeds, or failure is due to the entry > + * being invalidated by the swap subsystem, the invalidation > + * will unlink and free it. > + * > + * Temporary failures, where the same entry should be tried > + * again immediately, almost never happen for this shrinker. > + * We don't do any trylocking; -ENOMEM comes closest, > + * but that's extremely rare and doesn't happen spuriously > + * either. Don't bother distinguishing this case. > + * > + * But since they do exist in theory, the entry cannot just > + * be unlinked, or we could leak it. Hence, rotate. The entry cannot be unlinked because we cannot get a ref on it without holding the tree lock, and we cannot deref the tree before we acquire a swap cache ref in zswap_writeback_entry() -- or if zswap_writeback_entry() fails. This should be called out explicitly somewhere. Perhaps we can describe this whole deref dance with added docs to shrink_memcg_cb(). We could also use a comment in zswap_writeback_entry() (or above it) to state that we only deref the tree *after* we get the swapcache ref.