From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 32CF3C47DDF for ; Fri, 26 Jan 2024 15:31:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C03876B007D; Fri, 26 Jan 2024 10:31:36 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BB3466B0081; Fri, 26 Jan 2024 10:31:36 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A53926B0099; Fri, 26 Jan 2024 10:31:36 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 8FFBB6B007D for ; Fri, 26 Jan 2024 10:31:36 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 5A6AFC0476 for ; Fri, 26 Jan 2024 15:31:36 +0000 (UTC) X-FDA: 81721851792.09.A2EDCA4 Received: from mail-qt1-f173.google.com (mail-qt1-f173.google.com [209.85.160.173]) by imf20.hostedemail.com (Postfix) with ESMTP id 3EC571C0029 for ; Fri, 26 Jan 2024 15:31:34 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=SS24LFTW; spf=pass (imf20.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.160.173 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706283094; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=i8xFaWQCHWurka6SrH6X4+S0f4rWeYvomsMFMS1hldQ=; b=L0dmP1IEntF7O0zl8G5sbn7ZpvrbkjlbeVyZ0z+vxV6AFKXjdOidLg/mBKAud7FO9xAdQq n6dV7ILJLC5dhRlpDnLhZKh7UJvVb9h27C7TG6SX6r7CVKLh5fTZlFSU0ckx6KAVtqe7rU 2SCfQXOlzG7/+LQjIkMv00+h38ydi0Y= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=SS24LFTW; spf=pass (imf20.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.160.173 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706283094; a=rsa-sha256; cv=none; b=4Usto6Yts1TPx9z1rJeYXhLtqhEiPcUBD5b7pdwZr/b4jT69IqdsBXgoO/LVyCRR2qEyl/ 4SSE+c1g+d1pnVPmAWDVVq24Mv46UnyyEGgvYTg/p1JQntMoV6n7iuqVfdUkcVwqP5/LGX TW5UGA9AnV/qRrivTMclp1Htvh9IL5s= Received: by mail-qt1-f173.google.com with SMTP id d75a77b69052e-42a49d42ec6so3409301cf.1 for ; Fri, 26 Jan 2024 07:31:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1706283093; x=1706887893; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=i8xFaWQCHWurka6SrH6X4+S0f4rWeYvomsMFMS1hldQ=; b=SS24LFTW5Scd/SJVll9EhtXXueEnTFAV+3VvR6Q5L7cUdhZ9PrpqglNG0oDQ3TB9x9 ltKyUZK9e/NigyWvgcujNzaLwkhqIZ905FN/b8Fo4PLAE0e6/QcY20poPwyhKjs9e04W +yWEWlL2ZAMGbrQh+vjSY/qGn5+IQfertkO22MOO3uFduP4jS+BZe39/moEjejUEpkqW 6aBnkOiL4RjrWNJASJY+T9kJQDyJ4gqH8OaBCxT9eMG2yDzBwvXjtmsjfwjQR1k/Nnnf hCF2JV9vPR+RLrX+o/rjUYPq3QyaJ2eZQMF5dCSI7D0eNx8rvcWyBmHhwIOsh/nHAED5 sN/g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706283093; x=1706887893; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=i8xFaWQCHWurka6SrH6X4+S0f4rWeYvomsMFMS1hldQ=; b=I9oIgNAfPVQHj4rb/bLi61P973xXkiPipxBF5X/inGf59nbU36esIwCwiqILHnZ8hQ P/UxNPjwV/5Z9MKI0CnDnqR8T9kM0SEMyYo05vK6NiX1E/1GR7AKI8Rl6mxUBYTvTkPH MAjskC2vBNqiTUjiXGDCF2eo5xhHWgpsC6LJBx6YvtyagibnydlX5QaN8nPFX93zTm/a cXfDF+vsIZypmemC0xlmkLWBC22UxbmY8dCx6/xls6vZIrhIWg5Tpbc2no3rnD2t/iUF tLVOi/CLDGe7YQ2Y1njBssqlKubTledE9/wUN8YOJ1AxhJ8AifLUQ64HFCmnc16OYoSi XSXA== X-Gm-Message-State: AOJu0Ywp7SKpKDF1V6vk79ho+9BtMdQf0sSLz954DvFQIaRerm0Crx7J WJZCVB1uQr6omGH0c4c6QZJLo25e/IbGVjCk9it6V8LXCRw0TogRgKYRrDjePjQ= X-Google-Smtp-Source: AGHT+IE5Fh/DjIILtcBC05Tkvn3HXStuS1QvOx0c0621W5YJW7xHdA8k9CRqvRWkxySjah4k3K84mA== X-Received: by 2002:a05:622a:1a20:b0:42a:53f4:3198 with SMTP id f32-20020a05622a1a2000b0042a53f43198mr2756qtb.99.1706283093090; Fri, 26 Jan 2024 07:31:33 -0800 (PST) Received: from localhost ([2620:10d:c091:400::5:271e]) by smtp.gmail.com with ESMTPSA id bq11-20020a05622a1c0b00b0042a4402bd75sm612257qtb.27.2024.01.26.07.31.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 26 Jan 2024 07:31:32 -0800 (PST) Date: Fri, 26 Jan 2024 10:31:26 -0500 From: Johannes Weiner To: chengming.zhou@linux.dev Cc: yosryahmed@google.com, nphamcs@gmail.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Chengming Zhou Subject: Re: [PATCH 2/2] mm/zswap: fix race between lru writeback and swapoff Message-ID: <20240126153126.GG1567330@cmpxchg.org> References: <20240126083015.3557006-1-chengming.zhou@linux.dev> <20240126083015.3557006-2-chengming.zhou@linux.dev> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240126083015.3557006-2-chengming.zhou@linux.dev> X-Rspamd-Queue-Id: 3EC571C0029 X-Rspam-User: X-Stat-Signature: dhp3k7hwtzcqjz7w7krxnccirwcetoyq X-Rspamd-Server: rspam01 X-HE-Tag: 1706283094-606631 X-HE-Meta: U2FsdGVkX195aRV2fK0403VBnDr89YhD9BIxXaWW1GswPpzMGyA+RR83H1DpvmcX0t0HrG+nCpTa4dUJyTTzLSrbkMwHy4EA5066HFnwAgcNQu7/EREGRNtHus4GASEv3ZLESE+9+H/k6ms2FhqopeLDgb+6ZruUHgO73r4R+kmepWSTLSQxg1xK2KCjyFdYvBNZsrQgAkzQF3M3xCSYBA0QkYyzD+2RNZSWNGrB/IOOLsX7OJoZSALo7r0yOYkpPyom2Y9IdskTj14epOToWhs90wD4YtxmussAVQTPiaetwc0q1e0oZtKn+I6J58WRRcFdFkBv38irVoFEvS05di+JiPqDU0eO20USAybDA7+lP01zE4sKc7Ykwllrnwpk4C0Y+NSTQ06Bv76c3jB+DiOBRAxB6/rDAaFXLMTN8if/TF6mVclhzSAfBhCYyZ1ChH6nmrli45+UrDv3PsQOMTpNADbajGW/MtQa8DVnBbgIjlBUQqtTmkuh6ZZEDtRNO0ws9uxOrTtx1fLysSg8CkRdeFlYya/dbrXJCfluWg8ywkFWyFOaNJGDQ5r4zU0NPijmmvkWyQl4Y+TuC5FZF1bvTHzOZ8DiFlhRxbxPftN5Iv3sHP4/44fkL1DXm/Wnso6hhUyEepFGbqMQkkR1/LP6uAq3lg7PPQMfOPnYmwUF+H7X+RrLpCsmHn/wg96Hwxrpjm8p2ku9JPgFcXoId+SYjaycCOirt8te8MixU2x7EXpIQpPCA8yqJSHqt49rGA9xLb0HBzLZthdXx59mcqlSFiYqV7iamUK9aPjdaOx2JP2vtDTV9RMjYEWUs18/vYPVSqLWtPhfRzaOx57F0Bq+gdLC+7vPqDO1gC4yrwQJj8CLowxJNVUxO9o4QxEThFQyYxX9+b1G3tsS7MBs8AxV/73dqhJJI55ixFCpit/6HmUbJRD9nJFB27N3jAqzUWVShgpPYx+7Rw3EZtQ f/TnlRMx cJSy22Ioldfi67SUSlcLWSWN3HjRzKJaxl+ilIU7uwTezI+8ZpVXeFQg0TSipHCEDF8XWPkJxiroVjSotH6R+xeKLYV4/LkPgpY+LAL4riIRiZRwOONuzyucfdav5oBT+Q1lZhW920ZgOvQX4G0OzHYf78+GWLt1mKXkvjAU7JQybyGnGubS3sKWCEcEu8lVEoQkFAcPy670+1pgHPnJyzFoaVxch6GPauxIlprwimzzHM71wMaBAxdXHYn38eo+n1v1eBly7nCN+bFtwJHdWB+4fi/t0Ym17THegN5Yu6AMeSQY+NU4ZuNHh3lZxE9NW+g91QEat7UBpw5O0isZewcnZKGOENWjN3uIfi4A2dR3J+lXXI203VrQAL26Bn/OjFjbdbaQAnk7DQCbVvW5YVNaCY1piQLCjOFGpm6G1N4yd5yXvsbBmmZZK482B6+/g6CzYxXkV9cP08xbsssxgnt9heE5hg3PRXjDoY1hqOBk3uThNxkvLOhP6Jr5N85ezpJ/QVi7xG4vTPJKZks3iQGx5B7un1fAWD47XECvZlLeavC8VHNp7J8rJkbFwCcBlwH4g X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jan 26, 2024 at 08:30:15AM +0000, chengming.zhou@linux.dev wrote: > From: Chengming Zhou > > LRU writeback has race problem with swapoff, as spotted by Yosry[1]: > > CPU1 CPU2 > shrink_memcg_cb swap_off > list_lru_isolate zswap_invalidate > zswap_swapoff > kfree(tree) > // UAF > spin_lock(&tree->lock) > > The problem is that the entry in lru list can't protect the tree from > being swapoff and freed, and the entry also can be invalidated and freed > concurrently after we unlock the lru lock. > > We can fix it by moving the swap cache allocation ahead before > referencing the tree, then check invalidate race with tree lock, > only after that we can safely deref the entry. Note we couldn't > deref entry or tree anymore after we unlock the folio, since we > depend on this to hold on swapoff. This is a great simplification on top of being a bug fix. > So this patch moves all tree and entry usage to zswap_writeback_entry(), > we only use the copied swpentry on the stack to allocate swap cache > and return with folio locked, after which we can reference the tree. > Then check invalidate race with tree lock, the following things is > much the same like zswap_load(). > > Since we can't deref the entry after zswap_writeback_entry(), we > can't use zswap_lru_putback() anymore, instead we rotate the entry > in the LRU list so concurrent reclaimers have little chance to see it. > Or it will be deleted from LRU list if writeback success. > > Another confusing part to me is the update of memcg nr_zswap_protected > in zswap_lru_putback(). I'm not sure why it's needed here since > if we raced with swapin, memcg nr_zswap_protected has already been > updated in zswap_folio_swapin(). So not include this part for now. Good observation. Technically, it could also fail on -ENOMEM, but in practice these size allocations don't fail, especially since the shrinker runs in PF_MEMALLOC context. The shrink_worker might be affected, since it doesn't But the common case is -EEXIST, which indeed double counts. To make it "correct", you'd have to grab an objcg reference with the LRU lock, and also re-order the objcg put on entry freeing after the LRU del. This is probably not worth doing. But it could use a comment. I was going to ask if you could reorder objcg uncharging after LRU deletion to make it more robust for future changes in that direction. However, staring at this, I notice this is a second UAF bug: if (entry->objcg) { obj_cgroup_uncharge_zswap(entry->objcg, entry->length); obj_cgroup_put(entry->objcg); } if (!entry->length) atomic_dec(&zswap_same_filled_pages); else { zswap_lru_del(&entry->pool->list_lru, entry); zswap_lru_del() uses entry->objcg to determine the list_lru memcg, but the put may have killed it. I'll send a separate patch on top. > @@ -860,40 +839,34 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o > { > struct zswap_entry *entry = container_of(item, struct zswap_entry, lru); > bool *encountered_page_in_swapcache = (bool *)arg; > - struct zswap_tree *tree; > - pgoff_t swpoffset; > + swp_entry_t swpentry; > enum lru_status ret = LRU_REMOVED_RETRY; > int writeback_result; > > + /* > + * First rotate to the tail of lru list before unlocking lru lock, > + * so the concurrent reclaimers have little chance to see it. > + * It will be deleted from the lru list if writeback success. > + */ > + list_move_tail(item, &l->list); We don't hold a reference to the object, so there could also be an invalidation waiting on the LRU lock, which will free the entry even when writeback fails. It would also be good to expand on the motivation, because it's not clear WHY you'd want to hide it from other reclaimers. Lastly, maybe mention the story around temporary failures? Most shrinkers have a lock inversion pattern (object lock -> LRU lock for linking versus LRU lock -> object trylock during reclaim) that can fail and require the same object be tried again before advancing. How about this? /* * Rotate the entry to the tail before unlocking the LRU, * so that in case of an invalidation race concurrent * reclaimers don't waste their time on it. * * If writeback succeeds, or failure is due to the entry * being invalidated by the swap subsystem, the invalidation * will unlink and free it. * * Temporary failures, where the same entry should be tried * again immediately, almost never happen for this shrinker. * We don't do any trylocking; -ENOMEM comes closest, * but that's extremely rare and doesn't happen spuriously * either. Don't bother distinguishing this case. * * But since they do exist in theory, the entry cannot just * be unlinked, or we could leak it. Hence, rotate. */ Otherwise, looks great to me. Acked-by: Johannes Weiner