From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8D038C47DA9 for ; Tue, 30 Jan 2024 03:31:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0D9366B0071; Mon, 29 Jan 2024 22:31:14 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0894E6B00CD; Mon, 29 Jan 2024 22:31:14 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E44D46B00CE; Mon, 29 Jan 2024 22:31:13 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id D11766B0071 for ; Mon, 29 Jan 2024 22:31:13 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id A48A4A07BD for ; Tue, 30 Jan 2024 03:31:13 +0000 (UTC) X-FDA: 81734551626.01.79BF1F8 Received: from mail-pg1-f170.google.com (mail-pg1-f170.google.com [209.85.215.170]) by imf12.hostedemail.com (Postfix) with ESMTP id B79A640010 for ; Tue, 30 Jan 2024 03:31:11 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=kulOAUzv; spf=pass (imf12.hostedemail.com: domain of zhouchengming@bytedance.com designates 209.85.215.170 as permitted sender) smtp.mailfrom=zhouchengming@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706585471; a=rsa-sha256; cv=none; b=zbVUg0Ri/NEaql7FKaFYX3Q510sV6m8s8XuADvbCc4s/8xIouwMZrtku+jTEDJA6yhrigP M192vMjV1EXZSWVaV7EWZ/ZaT8FZc05/ZyS7SW42QH8PyeofeuEwdRkctA0+02UckdlVYp g9/TanG98iF53lGXuJfm9o4GqVHnbVA= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=kulOAUzv; spf=pass (imf12.hostedemail.com: domain of zhouchengming@bytedance.com designates 209.85.215.170 as permitted sender) smtp.mailfrom=zhouchengming@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706585471; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=RD6hTXlMSyJ3iWEtJweqxifSu9jRBJgphTtX1Fvzr68=; b=NMmDsdYcnVBkRJrnXSoGu6zRpoUHV0/Xkcdec/tO4l2Nt47ZKhtHF4RC3sJ+fBAeEP4Trv tsymyxDMECltDR7sDs1Ty2Uu3WZHNg8pcHfrzpDqJdfrDwcqy9r2vNmz43xcP4oi4IaUdI FL6aqLViJLnYqPKV4PMML1ExntFa2EA= Received: by mail-pg1-f170.google.com with SMTP id 41be03b00d2f7-5ca1b4809b5so1465674a12.3 for ; Mon, 29 Jan 2024 19:31:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1706585470; x=1707190270; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=RD6hTXlMSyJ3iWEtJweqxifSu9jRBJgphTtX1Fvzr68=; b=kulOAUzvcTGB5pT/W3IqrqeMcpE+7+KqFe8YhwkuPA2aHxI1ovfCyJFSvfNsQ7sLQE 9S/LyxTCBCFtyBkuqByvGkIJuSRGmIx1MLZqpgRI8u0Dc9HeriX+t/jRqsUHFB1bLhZi MlUNAq3G8Hh1sC0+kHX9VbL42VDvlpd+ANPrqvTNu6YTrDifWxlyDCl9Y7TxFSoqfCy9 iKrWESJCuJ9ClVtC4u4f6hISEcCoSu3+tewP8a6TDxkDc1TCoZnalreUzw5zaIy8TX61 Zw8FTvonN7Jnp/Dwf0UFnXp7w0rXsmzUOlEnor0nMxU55ngyLIDUD6+AfeV5R0vxtb6h aluA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706585470; x=1707190270; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=RD6hTXlMSyJ3iWEtJweqxifSu9jRBJgphTtX1Fvzr68=; b=N2ziT/78qnLBtQUYFs2L74+S+A5YPPN6JfSv7iVTYjUb+u5vZz/Qr9u+IN/anFkGro ujSsinArg0MMQ+3kpwyQC8LMQ4xICoe+t+UNjiIxO808s2HZ+/hAJF25G38D1rtt2Fpb U0JkBUh0b44LH1g5unuldbADnv8M7ICXqcVutwfIt1CzyGdBOlmtp1EIL3ZjLgcALjDN xLy1vzxgdRu+lykX7aPAJ69eY7Oa0hx3+2pIM23RHhhAU6prWQBwaaxRhX47jbZ7t/di x/pY82sVPT3wqpdwTs5ww89Wu9Z122UylB6DKW6YtQBxx4YeEcfE1EOps+Z436xl82dc 3WFw== X-Gm-Message-State: AOJu0Yx+DH9nt7hSgGZNoYscvKFXaCjjsnLvuEz5sDJMKEY4D+Lv8R3q AR5g6siyAXo9Ul5ZAafYCauIQ8+++SeHkml0vXSMzDOl5pR4AFKTfw4davxFhCk= X-Google-Smtp-Source: AGHT+IHJmPuERhTK5o2Nq2jM7SYhLb3zkXP+HLMEJlqpfVkhvjIwq1UPqfASkuCuKKfgnYlGPV/2sA== X-Received: by 2002:a05:6a20:3ba6:b0:19c:61d5:beab with SMTP id b38-20020a056a203ba600b0019c61d5beabmr3654918pzh.39.1706585470314; Mon, 29 Jan 2024 19:31:10 -0800 (PST) Received: from [10.4.207.234] ([139.177.225.234]) by smtp.gmail.com with ESMTPSA id iz19-20020a170902ef9300b001d8cf16d890sm3334197plb.16.2024.01.29.19.31.07 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 29 Jan 2024 19:31:10 -0800 (PST) Message-ID: <30e949b5-9455-4053-93ef-1ca0ceb10c3f@bytedance.com> Date: Tue, 30 Jan 2024 11:31:04 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff Content-Language: en-US To: Johannes Weiner Cc: Yosry Ahmed , Nhat Pham , Andrew Morton , Chris Li , linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20240126-zswap-writeback-race-v2-0-b10479847099@bytedance.com> <20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com> <527bd543-97a5-4262-be73-6a5d21c2f896@bytedance.com> <20240130031727.GA780575@cmpxchg.org> From: Chengming Zhou In-Reply-To: <20240130031727.GA780575@cmpxchg.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: B79A640010 X-Stat-Signature: apcsc5o4ba7xrawidw7wboqjd9pk8f44 X-Rspam-User: X-HE-Tag: 1706585471-577752 X-HE-Meta: U2FsdGVkX1/wGUTG8jsqK+Bi8MnrE+mv8L/njSg0S8HKvA4F3j4CFQMFRGxwm/8uUPg6TCWZzR/MHsdQakgz6FAKddPKJXgpPIifgUyHvT9USxgZNiKeab+6t+cn/HbcCviyySo3TVFx5PKwYkX7xoLMqbnQb3wrgqHG7s7xQtgUYb9++2gN/vNn3d1CgJAoj+/BpzBlVt7/jNCDcrQOhmFpj9hs8EV1e2KuGKvU06w5i2RuccKJgDwmF7fHfGRjXdz7u4b/efuUcViwGqp3pDJM7QLER8ETcNO8GLH8jCDuaoVwZAJgPwUcdWqqEiohdxitkZ22oNw72ijn5mUUVVvhdKjDkczqyTNM8ABA2Kyrtowid6flp/MYWUCXLZhGjpwzHIuLjJX10JmFNkxUojKZWNLA4llnvV0tylA/P612Yi3grRTJwdwVBrQZY4VDRQXtc4kwWf8Iu51LAEwosk18ZnpNl+st400Bed+ceqnYk1TSBQwTBzg7hAo/vDtdOvOXsdBAVAK+eBx7ITGPnJZSG0LrcfebW70p9eP9Fyx5sIs8c98khQPwTkO4bv+pZAqmoczfAmO09UbuU5tEyLhrHNgGallmxxQ8P3TitM30CLUmWipnQLrxBtme0ivkjzrtf3vLM5F88JNk0VBxCgc1XNVxBxdgdXYic6HbCpFidTPIKOGyWHiByFg7WlZu1jaDs3AsA5uoaktuLtT+Ob+sjCSYAqzsN5n5a5wv73lOEZ7+kdVNaom6NiQedJpBsn2ZL4ZA4NK/p5bmrgcenMfr68Fg5Oqr2vn6vV7mPY5Y0D9oV4M2MGfBbR7M4kIQoaeKtQwS1l6No3h6n7g9fQZgrDbhT+JimFItAgX2Z3z0Uw/B9BVw6rMbMjmXnF31PhsDDSwlwABn9v/uvkmxYOS4S4MFHct2FNIIiGkvpHy426Il48goQDY+12m333F1W/dou72tTe4zrIlF/Ex t4BhgQPn u27U2XYGjJnTuI+nCzjLw+uxyfpEebBXqXDk3KX5h1iHWSucO0/pAYSuYz+wLCfEKl3Xp+L/HN0eRTly3BM41PJApSxc8bGpGxXHf+svqJPXvLJsA5zDdSzeLGOEx0OUhNlB+XxdwQOalAEbKFrD6ScYfqSFtAFig1mCluBwvAodkiwZuUau2s6JO5UUIG4RzcyERwfrKXLrsTrPBL+U9GWz6zzshSwPer2ejZLNzMbO8L+eiVmAZNTUHvX55ipMMCjba7XiZoE7egFnflDPPIuIz4DJ6fvCOxEHi2EUJKmGNxkd39QNWSkt7f4TrTEYT1SrPZ0iN5+/CUF6abtaG6jxka86hf8Bl3MRttQhaljEzwVml2NWrYGnOxjldipmUz7WFr5+0CTDCyIeYQxHZWnu35A== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/1/30 11:17, Johannes Weiner wrote: > On Tue, Jan 30, 2024 at 10:30:20AM +0800, Chengming Zhou wrote: >> On 2024/1/30 08:22, Yosry Ahmed wrote: >>> On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote: >>>> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o >>>> { >>>> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru); >>>> bool *encountered_page_in_swapcache = (bool *)arg; >>>> - struct zswap_tree *tree; >>>> - pgoff_t swpoffset; >>>> + swp_entry_t swpentry; >>>> enum lru_status ret = LRU_REMOVED_RETRY; >>>> int writeback_result; >>>> >>>> + /* >>>> + * Rotate the entry to the tail before unlocking the LRU, >>>> + * so that in case of an invalidation race concurrent >>>> + * reclaimers don't waste their time on it. >>>> + * >>>> + * If writeback succeeds, or failure is due to the entry >>>> + * being invalidated by the swap subsystem, the invalidation >>>> + * will unlink and free it. >>>> + * >>>> + * Temporary failures, where the same entry should be tried >>>> + * again immediately, almost never happen for this shrinker. >>>> + * We don't do any trylocking; -ENOMEM comes closest, >>>> + * but that's extremely rare and doesn't happen spuriously >>>> + * either. Don't bother distinguishing this case. >>>> + * >>>> + * But since they do exist in theory, the entry cannot just >>>> + * be unlinked, or we could leak it. Hence, rotate. >>> >>> The entry cannot be unlinked because we cannot get a ref on it without >>> holding the tree lock, and we cannot deref the tree before we acquire a >>> swap cache ref in zswap_writeback_entry() -- or if >>> zswap_writeback_entry() fails. This should be called out explicitly >>> somewhere. Perhaps we can describe this whole deref dance with added >>> docs to shrink_memcg_cb(). >> >> Maybe we should add some comments before or after zswap_writeback_entry()? >> Or do you have some suggestions? I'm not good at this. :) > > I agree with the suggestion of a central point to document this. > > How about something like this: > > /* > * As soon as we drop the LRU lock, the entry can be freed by > * a concurrent invalidation. This means the following: > * > * 1. We extract the swp_entry_t to the stack, allowing > * zswap_writeback_entry() to pin the swap entry and > * then validate the zwap entry against that swap entry's > * tree using pointer value comparison. Only when that > * is successful can the entry be dereferenced. > * > * 2. Usually, objects are taken off the LRU for reclaim. In > * this case this isn't possible, because if reclaim fails > * for whatever reason, we have no means of knowing if the > * entry is alive to put it back on the LRU. > * > * So rotate it before dropping the lock. If the entry is > * written back or invalidated, the free path will unlink > * it. For failures, rotation is the right thing as well. > * > * Temporary failures, where the same entry should be tried > * again immediately, almost never happen for this shrinker. > * We don't do any trylocking; -ENOMEM comes closest, > * but that's extremely rare and doesn't happen spuriously > * either. Don't bother distinguishing this case. > */ > Thanks! I think this document is great enough. >>> We could also use a comment in zswap_writeback_entry() (or above it) to >>> state that we only deref the tree *after* we get the swapcache ref. >> >> I just notice there are some comments in zswap_writeback_entry(), should >> we add more comments here? >> >> /* >> * folio is locked, and the swapcache is now secured against >> * concurrent swapping to and from the slot. Verify that the >> * swap entry hasn't been invalidated and recycled behind our >> * backs (our zswap_entry reference doesn't prevent that), to >> * avoid overwriting a new swap folio with old compressed data. >> */ > > The bit in () is now stale, since we're not even holding a ref ;) Right. > > Otherwise, a brief note that entry can not be dereferenced until > validation would be helpful in zswap_writeback_entry(). The core of Ok. > the scheme I'd probably describe in shrink_memcg_cb(), though. > > Can I ask a favor, though? > > For non-critical updates to this patch, can you please make them > follow-up changes? I just sent out 20 cleanup patches on top of this > patch which would be super painful and error prone to rebase. I'd like > to avoid that if at all possible. Ok, so these comments changes should be changed on top of your cleanup series and sent as a follow-up patch. Thanks.