From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 20BBEC46CD2 for ; Sat, 27 Jan 2024 14:53:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7BEDE6B0081; Sat, 27 Jan 2024 09:53:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7482C6B0082; Sat, 27 Jan 2024 09:53:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5E88C6B0083; Sat, 27 Jan 2024 09:53:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 480406B0081 for ; Sat, 27 Jan 2024 09:53:45 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 19A6716097A for ; Sat, 27 Jan 2024 14:53:45 +0000 (UTC) X-FDA: 81725385210.03.16CB8B1 Received: from out-171.mta1.migadu.com (out-171.mta1.migadu.com [95.215.58.171]) by imf11.hostedemail.com (Postfix) with ESMTP id 0A35240011 for ; Sat, 27 Jan 2024 14:53:42 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b="p/0ZOj6c"; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf11.hostedemail.com: domain of chengming.zhou@linux.dev designates 95.215.58.171 as permitted sender) smtp.mailfrom=chengming.zhou@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706367223; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6yUsk0RLN7Y1NXhRh/1sdnkOngC6GXf4J/3HF7OaPD0=; b=kBa4KlBRbCCvELpJRfIv68jYMQb27uNCmDXaURtBiI2jZvA79cgShcfoErQxxoObJCZwuA xmYXgh9QwKnzg3R5G1YePVNQcyUlOkdeDglqDImUZgUNovMUdxqdtNzS+ysYudSstoS/jV xEOnf7E5Jy94oSAtAzNMGMxtjRhDbc0= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b="p/0ZOj6c"; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf11.hostedemail.com: domain of chengming.zhou@linux.dev designates 95.215.58.171 as permitted sender) smtp.mailfrom=chengming.zhou@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706367223; a=rsa-sha256; cv=none; b=7h4S82adHcXm2pAFDz10bHxPJuQnU0WDU/Kbojv5Lm3Ma2LNDA7F5Mx+4fu9yoMF6UAoyB UGaZN5wNMwRLUZNvNVSZ9jqxyTeoJdJ8x9Bs08jKfxcRC9/8NYHyI8qtrGFKbzTHoXFcvS pwFzayKYj8lSQ4+q6yeLMt3sbuHAB1c= Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1706367220; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=6yUsk0RLN7Y1NXhRh/1sdnkOngC6GXf4J/3HF7OaPD0=; b=p/0ZOj6c/5m8euun3FNWgCzdyFGII6Mmrx2INrB5KlU78Yw0ue91Rgef7N6ZOwCEFJzCih gmZqGPFQiQwGuZiBLseX32GCW/JcTVx3OBRshK/1PoWfhCwwDrE2v3lll1QbzzCMfaDcRb fsqneWYXBx0kuJ/Ow9lJtZKQKgl9dEw= Date: Sat, 27 Jan 2024 22:53:09 +0800 MIME-Version: 1.0 Subject: Re: [PATCH 2/2] mm/zswap: fix race between lru writeback and swapoff Content-Language: en-US To: Johannes Weiner Cc: yosryahmed@google.com, nphamcs@gmail.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Chengming Zhou References: <20240126083015.3557006-1-chengming.zhou@linux.dev> <20240126083015.3557006-2-chengming.zhou@linux.dev> <20240126153126.GG1567330@cmpxchg.org> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Chengming Zhou In-Reply-To: <20240126153126.GG1567330@cmpxchg.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Stat-Signature: ixee9yi8k6fym8qufbousnhe56mdzapq X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 0A35240011 X-HE-Tag: 1706367222-184451 X-HE-Meta: U2FsdGVkX18o9qHWQ3SuNy/u+FnzvSo7ywGYffaK4at976kKUL6s8J1p0j57vYjjMQ91gEfC8vTo8LfbT7RdcfDPjBHCr8OBkQ6qOgIwtohflHXkRMPn4rO8wuI7vGjc6KcUWmx2iLVykgVO301+qUrc4gHYsqjJ4Rs8KOQEtLHY+JWflS8M5VuA7rCkaer4c7IeF8efLAE2fQ3KFsGbjhnYZRXbZqFPLrDbq5TlCJy36vrKmskQSW0ltgKts7dCg4j5yvxwZevp3BAd3/PuBJHgma7L4wS6WFgTmxsAqns7etAMpx6rv8XPYn1gVxtQos09Xo6NhEALDHwTKsSGpyH3HH0ea7mtgIXptE4n9wb3b+HKBvuvs95MU3k7X19ftaT+jBi57yhbFvS+RVa4PQwx/nJuylr3oglxoUCNQoCvtfPQ/3fX3ZFVEdf2WiEbPteR7SNLuIsOys7HK68XFn9ynfNieFfsHzvhFhyIjOEg/cLkmdoDYcLc2PWY3yT7BlyWXrq76ayJtVj1sqbzWbAOGuJ2IUUZBkoqT8JC6uvXcyhTKi3mbKjISuDcZhA7IcYQlBfOye270F6Nf9X5JzI48lhfPblhiLAs1IMR7Kj2Q0a7MJBsVf72c2mbN2rGxBjCld8cF9VhvFsp7vouBEkoZ2FYcPs9+2Y1lAVZ28rt5mjFoqr94/taFQOvu5KF9WgoUmraHqdAmjFXxie4/LKL2siJclO33PpKt42cpkQ/Ytq6kR6UeVvyS6XOLimn+0pqqrw///XhJgiUElPWZfmLEUlWgIHym5+RoiSUP3winH5zkjm2ptZ1zkVcsbzZPmVIE6S2Z0KIQuAFwaY5LsNQmSIKxTLMNkYjYXDQW4gdBaONEnQIwhf8dCFqEJiP4iwcrxBwJYL4+eLm8IDug79NkAJZzXO9fMh5zFK4Gf9M2DwNkyuQDpNohuIxJhIHYYoQE1Sxe8s3UPn8Dn/ NIATAg9y rqXZqtxMfKakodsYNzdSlTFmjTu2ThsOES8dW7P+zY8p0euQ8bd9UIKuGzjUBqtVZli/CPMgilpm4yGUF6fA/esaXwZPaUgncnnnsMVGw28s7Op60ZHEyNIsZeUNZlDkiaFDc56NSdDsNlfuUCqGN26rocZ6il7eMMZZcrcDhG9IU1g3ouaqkg5wN7ubbxfkDwRL9iMmjxl9LB7o= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/1/26 23:31, Johannes Weiner wrote: > On Fri, Jan 26, 2024 at 08:30:15AM +0000, chengming.zhou@linux.dev wrote: >> From: Chengming Zhou >> >> LRU writeback has race problem with swapoff, as spotted by Yosry[1]: >> >> CPU1 CPU2 >> shrink_memcg_cb swap_off >> list_lru_isolate zswap_invalidate >> zswap_swapoff >> kfree(tree) >> // UAF >> spin_lock(&tree->lock) >> >> The problem is that the entry in lru list can't protect the tree from >> being swapoff and freed, and the entry also can be invalidated and freed >> concurrently after we unlock the lru lock. >> >> We can fix it by moving the swap cache allocation ahead before >> referencing the tree, then check invalidate race with tree lock, >> only after that we can safely deref the entry. Note we couldn't >> deref entry or tree anymore after we unlock the folio, since we >> depend on this to hold on swapoff. > > This is a great simplification on top of being a bug fix. > >> So this patch moves all tree and entry usage to zswap_writeback_entry(), >> we only use the copied swpentry on the stack to allocate swap cache >> and return with folio locked, after which we can reference the tree. >> Then check invalidate race with tree lock, the following things is >> much the same like zswap_load(). >> >> Since we can't deref the entry after zswap_writeback_entry(), we >> can't use zswap_lru_putback() anymore, instead we rotate the entry >> in the LRU list so concurrent reclaimers have little chance to see it. >> Or it will be deleted from LRU list if writeback success. >> >> Another confusing part to me is the update of memcg nr_zswap_protected >> in zswap_lru_putback(). I'm not sure why it's needed here since >> if we raced with swapin, memcg nr_zswap_protected has already been >> updated in zswap_folio_swapin(). So not include this part for now. > > Good observation. > > Technically, it could also fail on -ENOMEM, but in practice these size > allocations don't fail, especially since the shrinker runs in > PF_MEMALLOC context. The shrink_worker might be affected, since it > doesn't But the common case is -EEXIST, which indeed double counts. Ah right, the rotation of the more unlikely case that allocation fail indeed need to update the memcg nr_zswap_protected, only the case of -EEXIST has double counts problem. > > To make it "correct", you'd have to grab an objcg reference with the > LRU lock, and also re-order the objcg put on entry freeing after the > LRU del. This is probably not worth doing. But it could use a comment. Agree, will add your comments below. > > I was going to ask if you could reorder objcg uncharging after LRU > deletion to make it more robust for future changes in that direction. > However, staring at this, I notice this is a second UAF bug: > > if (entry->objcg) { > obj_cgroup_uncharge_zswap(entry->objcg, entry->length); > obj_cgroup_put(entry->objcg); > } > if (!entry->length) > atomic_dec(&zswap_same_filled_pages); > else { > zswap_lru_del(&entry->pool->list_lru, entry); > > zswap_lru_del() uses entry->objcg to determine the list_lru memcg, but > the put may have killed it. I'll send a separate patch on top. Good observation. > >> @@ -860,40 +839,34 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o >> { >> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru); >> bool *encountered_page_in_swapcache = (bool *)arg; >> - struct zswap_tree *tree; >> - pgoff_t swpoffset; >> + swp_entry_t swpentry; >> enum lru_status ret = LRU_REMOVED_RETRY; >> int writeback_result; >> >> + /* >> + * First rotate to the tail of lru list before unlocking lru lock, >> + * so the concurrent reclaimers have little chance to see it. >> + * It will be deleted from the lru list if writeback success. >> + */ >> + list_move_tail(item, &l->list); > > We don't hold a reference to the object, so there could also be an > invalidation waiting on the LRU lock, which will free the entry even > when writeback fails. > > It would also be good to expand on the motivation, because it's not > clear WHY you'd want to hide it from other reclaimers. Right, my comments are not clear enough. > > Lastly, maybe mention the story around temporary failures? Most > shrinkers have a lock inversion pattern (object lock -> LRU lock for > linking versus LRU lock -> object trylock during reclaim) that can > fail and require the same object be tried again before advancing. Your comments are great, will add in the next version. Thanks. > > How about this? > > /* > * Rotate the entry to the tail before unlocking the LRU, > * so that in case of an invalidation race concurrent > * reclaimers don't waste their time on it. > * > * If writeback succeeds, or failure is due to the entry > * being invalidated by the swap subsystem, the invalidation > * will unlink and free it. > * > * Temporary failures, where the same entry should be tried > * again immediately, almost never happen for this shrinker. > * We don't do any trylocking; -ENOMEM comes closest, > * but that's extremely rare and doesn't happen spuriously > * either. Don't bother distinguishing this case. > * > * But since they do exist in theory, the entry cannot just > * be unlinked, or we could leak it. Hence, rotate. > */ > > Otherwise, looks great to me. > > Acked-by: Johannes Weiner