From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6F85CC4345F for ; Fri, 26 Apr 2024 12:57:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E51B56B0083; Fri, 26 Apr 2024 08:57:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E01C36B0085; Fri, 26 Apr 2024 08:57:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CF0E26B0087; Fri, 26 Apr 2024 08:57:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id ACE556B0083 for ; Fri, 26 Apr 2024 08:57:45 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 390FD1A08EC for ; Fri, 26 Apr 2024 12:57:45 +0000 (UTC) X-FDA: 82051684890.11.A7EC1C3 Received: from mail-pg1-f179.google.com (mail-pg1-f179.google.com [209.85.215.179]) by imf17.hostedemail.com (Postfix) with ESMTP id A9D2C4000A for ; Fri, 26 Apr 2024 12:57:42 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=XIfmCfoH; spf=pass (imf17.hostedemail.com: domain of ritesh.list@gmail.com designates 209.85.215.179 as permitted sender) smtp.mailfrom=ritesh.list@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1714136262; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:content-type: content-transfer-encoding:in-reply-to:in-reply-to:references:dkim-signature; bh=endWZFs/huUwpiPXnvyhJNr3iwSnVQvC+T17t46UJ4o=; b=aBy6N75k9XeY5swn6dqZBrhuYXogfBBtliZGpPv4tLqUg6VDXLmV67ChvhSAW8oqpGq0hA dMq3R5J9wEPwmpjS2NHWn6N24m2NU51fCjdXDNoJcYHlVC/79Ss08elQcLGFBuId3xYwqP eXNtjbF9ZvENmGOSHNqmGdV7PHRoPEw= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=XIfmCfoH; spf=pass (imf17.hostedemail.com: domain of ritesh.list@gmail.com designates 209.85.215.179 as permitted sender) smtp.mailfrom=ritesh.list@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1714136262; a=rsa-sha256; cv=none; b=3l3LClKfWvxC/rMxi1wPIrd5v8M9ZBapWPeeTid0pzzCYwdHtmVnnmw0Yh2HYg4EfafdNP XIKOjYrw2fekzoCfu88ggw+FufWvZqe72eldWlK98jqGna3XjhwFo6C/NVmD+cMvOWbfgz y1vW0/0b2fJ2eCDfWSvTtMXiSvk9hJ8= Received: by mail-pg1-f179.google.com with SMTP id 41be03b00d2f7-5e8470c1cb7so1449381a12.2 for ; Fri, 26 Apr 2024 05:57:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1714136261; x=1714741061; darn=kvack.org; h=in-reply-to:subject:cc:to:from:message-id:date:from:to:cc:subject :date:message-id:reply-to; bh=endWZFs/huUwpiPXnvyhJNr3iwSnVQvC+T17t46UJ4o=; b=XIfmCfoHDW3/87DH2lOhGabu2nkCE94WrXB4HPJq/5WlXM0sBurt15w0epuXYWxRcb 5V3kX1ijz9dTE2sSm33QlNmzKpKVftbEMMFVRJLOgsBS8gTR4QjHuga1dJGm4eURY4qh YmgDqyRO3A1DAclktDEy856tePAVB1DKP+qSNPl49iNdfJL/BAI0DArdG/5fIFyXwevi P8H78D3HGpdBspPh/e1Op6gsvRq6pofCyStHhsRfoaAwo1UaRDaSIPcQ9yYgVu9rOVyK PWvm3RA0KCACPd3u1Pptm0Yh1X3Ko0VYZZ3aDtosFp8br1BF8dgHndrkEoKSbhYlIoCl nZXA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1714136261; x=1714741061; h=in-reply-to:subject:cc:to:from:message-id:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=endWZFs/huUwpiPXnvyhJNr3iwSnVQvC+T17t46UJ4o=; b=e1t6+E26mFdCpzQnP9+mX6yYEyv6EjmPMQZ3fLW+XAlR9c76fCHH6ERfD0f5DtKFbQ eC/ktIVgDOOtgkY+v5Nj9PTWbcNtlQZ0rIDUAimdtTeq3XYTPbyBIg+HKYCSahS6R4vP DecLDM+plJPF0cmO7knv5fngb+7b+yfUSpJ6AKnJFV1C80HJFvaQ71vArVoxoEDbvGRz LNNI4eJdXSpUdCijKUxfGlyn2ZoVS1z2jg85iQYVGB+90FIJcihnj+KTjncM9DGTX01o 0X4Fg3PqSCYxCGRgdDKZTtOEjdHiTe0AQhq0bwtuisYlmoTAKhOhVkq72xJm2CxRFsPV KPgg== X-Forwarded-Encrypted: i=1; AJvYcCXpx4Mi4zUMbqF0nVBGkiqo/ECAj0kc3osNPDUYfbEGV/CnQz2K1GVwXhRapcYQ2P5ZKepvRu6ZUfC3wm193N1VC88= X-Gm-Message-State: AOJu0YyWkopaT4Fgplih92Mr0j591HZGyh9eTO1+1Slf8B3zVOuLvnbE xRVbKP5zymQfcgRsGY8OCd9uCZeYISxqq6/+mRkN68tz4QvDUXOl X-Google-Smtp-Source: AGHT+IEQAPUV/KA5Lz34E4irpjiyAsv33koqvqCYKYdiaGY5Yw7DPqFcAt70S4/nUSAiR+Zq1QnoMg== X-Received: by 2002:a17:90a:67c6:b0:2ae:6e36:af21 with SMTP id g6-20020a17090a67c600b002ae6e36af21mr2412507pjm.25.1714136261420; Fri, 26 Apr 2024 05:57:41 -0700 (PDT) Received: from dw-tp ([171.76.87.172]) by smtp.gmail.com with ESMTPSA id eu6-20020a17090af94600b002aeb05d6e97sm6334259pjb.21.2024.04.26.05.57.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 26 Apr 2024 05:57:40 -0700 (PDT) Date: Fri, 26 Apr 2024 18:27:30 +0530 Message-Id: <87frv8z3gl.fsf@gmail.com> From: Ritesh Harjani (IBM) To: Zhang Yi , linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.cz, hch@infradead.org, djwong@kernel.org, david@fromorbit.com, willy@infradead.org, zokeefe@google.com, yi.zhang@huawei.com, yi.zhang@huaweicloud.com, chengzhihao1@huawei.com, yukuai3@huawei.com, wangkefeng.wang@huawei.com Subject: Re: [PATCH v4 02/34] ext4: check the extent status again before inserting delalloc block In-Reply-To: <87frv8nw4a.fsf@gmail.com> X-Stat-Signature: bsdcf9h6ehondwc8y4esucw8jdfk8skk X-Rspamd-Queue-Id: A9D2C4000A X-Rspamd-Server: rspam10 X-Rspam-User: X-HE-Tag: 1714136262-244499 X-HE-Meta: U2FsdGVkX1+a6UHXHDRZoH968zko2M+K6tnHlRMZ0bgGUGgSn65/te3+AFdBs/kipclkaVn7K2uXolEapb6xasQPvyMUJy0f+IPALX1Pkt3o9p97ij9PT1xt8Hra1PYF58iDPZ72oXzEfUlah5h/89QvmpWXtDoNYcm8C/hhWYP9yVqPMYnDoYHGOtc2fjNYG25Vqxwgh4vGbugpfA+ItFc0esyVhVkatTY975i2c9aU8+9rp4gHMAd5Yt7bYYkUnj8k+3vPKtpmaj6iGOe54p8pDyCZtCjXrKSUhP0MNwt03oPZ7Hikw8LO+AswC/s4bHy43ZhZd+MIdTZT69HlDp0VpDs1Q6p7E058khaUpYYr6DnK5h5OLiye4tpcWsImgqE2FTjUJZeQhtMfMlIIoGBVmKRrvczckoGJmF+t7XOqhdiDjaSjanpaz77O60nng3zW2KT3VDUtLBkl9+0s/ZxStuU5Z0D+fRRE/qX+BFjeyz/Lqf+jZHgpztTUocjq97SPISHhnJSbnYGdqQwVjlLaQKbI3C6Z+5I8tvgZy8jN4TdbTdJ3Jurofvvmmc9JEfxSF3t3/iWjUvUsDHrU2V0pDVr2gGhPv7w7nf+XHYyjVW+5pRQCx6SYjSoE5TflhZz69oQ//GjmhC4CieIgk4Be82PvATG3RHCQXgmcFSzCp34Zx0j9PdfTeybptIWSP9enYzXMvar021n2dhbn44s4XFNd0a4JAGU7p3//j2h4WkwtvjKUA4Vd9VfdPWRMn7bx3zi5F5cvAltl3GwGoil5sY5Yn+/4iyeFSvykOkUpWmm9pohEBzHs/JIpCxuuhVnyiIW9YkAOPnRHSLQloqAFJLqIMABSO83fFdVwo3WKNYPNXEi8YSPNxDzM/43sEQH4bjLmPd011+S1TEzFERtlCsQITE0L6r/BKSawm4aRQ2zY3uKs9AV4xtZcx226qn+czw561e2Jxq/mQ0O BzZt+7/S Z6F5Abt6ZNyzDMooXeOEs2mgNN9/GZbryhKj9nqnXMmjCKO4sJx+ha6KGd5PRq6ZMiLGrQpIrOHjAffOUzDIC0IQ33EHMiElIkC/kI9y2YCnkVmA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Ritesh Harjani (IBM) writes: > Zhang Yi writes: > >> From: Zhang Yi >> >> Now we lookup extent status entry without holding the i_data_sem before >> inserting delalloc block, it works fine in buffered write path and >> because it holds i_rwsem and folio lock, and the mmap path holds folio >> lock, so the found extent locklessly couldn't be modified concurrently. >> But it could be raced by fallocate since it allocate block whitout >> holding i_rwsem and folio lock. >> >> ext4_page_mkwrite() ext4_fallocate() >> block_page_mkwrite() >> ext4_da_map_blocks() >> //find hole in extent status tree >> ext4_alloc_file_blocks() >> ext4_map_blocks() >> //allocate block and unwritten extent >> ext4_insert_delayed_block() >> ext4_da_reserve_space() >> //reserve one more block >> ext4_es_insert_delayed_block() >> //drop unwritten extent and add delayed extent by mistake >> >> Then, the delalloc extent is wrong until writeback, the one more >> reserved block can't be release any more and trigger below warning: >> >> EXT4-fs (pmem2): Inode 13 (00000000bbbd4d23): i_reserved_data_blocks(1) not cleared! >> >> Hold i_data_sem in write mode directly can fix the problem, but it's >> expansive, we should keep the lockless check and check the extent again >> once we need to add an new delalloc block. > > Hi Zhang, > > It's a nice finding. I was wondering if this was caught in any of the > xfstests? > > I have reworded some of the commit message, feel free to use it if you > think this version is better. The use of which path uses which locks was > a bit confusing in the original commit message. > > > > ext4_da_map_blocks(), first looks up the extent status tree for any > extent entry with i_data_sem held in read mode. It then unlocks > i_data_sem, if it can't find an entry and take this lock in write > mode for inserting a new da entry. Sorry about this above paragraph. I messed this paragraph. Here is the correct version of this. ext4_da_map_blocks looks up for any extent entry in the extent status tree (w/o i_data_sem) and then the looks up for any ondisk extent mapping (with i_data_sem in read mode). If it finds a hole in the extent status tree or if it couldn't find any entry at all, it then takes the i_data_sem in write mode to add a da entry into the extent status tree. This can actually race with page mkwrite & fallocate path. Note that this is ok between... > > This is ok between - > 1. ext4 buffered-write path v/s ext4_page_mkwrite(), because of the > folio lock > 2. ext4 buffered write path v/s ext4 fallocate because of the inode > lock. > > But this can race between ext4_page_mkwrite() & ext4 fallocate path - > > ext4_page_mkwrite() ext4_fallocate() > block_page_mkwrite() > ext4_da_map_blocks() > //find hole in extent status tree > ext4_alloc_file_blocks() > ext4_map_blocks() > //allocate block and unwritten extent > ext4_insert_delayed_block() > ext4_da_reserve_space() > //reserve one more block > ext4_es_insert_delayed_block() > //drop unwritten extent and add delayed extent by mistake > > Then, the delalloc extent is wrong until writeback and the extra > reserved block can't be released any more and it triggers below warning: > > EXT4-fs (pmem2): Inode 13 (00000000bbbd4d23): i_reserved_data_blocks(1) not cleared! > > This patch fixes the problem by looking up extent status tree again > while the i_data_sem is held in write mode. If it still can't find > any entry, then we insert a new da entry into the extent status tree. > >> >> Cc: stable@vger.kernel.org >> Signed-off-by: Zhang Yi >> --- >> fs/ext4/inode.c | 19 +++++++++++++++++++ >> 1 file changed, 19 insertions(+) >> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >> index 6a41172c06e1..118b0497a954 100644 >> --- a/fs/ext4/inode.c >> +++ b/fs/ext4/inode.c >> @@ -1737,6 +1737,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock, >> if (ext4_es_is_hole(&es)) >> goto add_delayed; >> >> +found: >> /* >> * Delayed extent could be allocated by fallocate. >> * So we need to check it. >> @@ -1781,6 +1782,24 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock, >> >> add_delayed: >> down_write(&EXT4_I(inode)->i_data_sem); >> + /* >> + * Lookup extents tree again under i_data_sem, make sure this >> + * inserting delalloc range haven't been delayed or allocated >> + * whitout holding i_rwsem and folio lock. >> + */ > > page fault path (ext4_page_mkwrite does not take i_rwsem) and fallocate > path (no folio lock) can race. Make sure we lookup the extent status > tree here again while i_data_sem is held in write mode, before inserting > a new da entry in the extent status tree. > > -ritesh