From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 65B91C4345F for ; Fri, 26 Apr 2024 13:20:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B27EA6B0089; Fri, 26 Apr 2024 09:19:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AD7866B008A; Fri, 26 Apr 2024 09:19:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 99F956B008C; Fri, 26 Apr 2024 09:19:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 7AD456B0089 for ; Fri, 26 Apr 2024 09:19:59 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 18F69C0840 for ; Fri, 26 Apr 2024 13:19:59 +0000 (UTC) X-FDA: 82051740918.06.9D4237E Received: from dggsgout11.his.huawei.com (unknown [45.249.212.51]) by imf06.hostedemail.com (Postfix) with ESMTP id 13C92180006 for ; Fri, 26 Apr 2024 13:19:54 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf06.hostedemail.com: domain of yi.zhang@huaweicloud.com designates 45.249.212.51 as permitted sender) smtp.mailfrom=yi.zhang@huaweicloud.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1714137597; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=87sLCNL/TN6XG8APGuceKOxpk2h/1LvaEaS6+4WNJTI=; b=lEvW4typTRTYTrXUzgcYHLr6I2kQsA11H133SYHEkIttZldbFUiRB1oeMAyhL8rT5cEUTQ Pj3amMVqTNWq9PKKJxVeUXyt/ZgzRrLtfXk5/i8AIiqaTVdaj7d5wX/YaFKDtfVVUPljEQ lGp7J7H6ARm2Hum09+ZcF/mOL35XYl4= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf06.hostedemail.com: domain of yi.zhang@huaweicloud.com designates 45.249.212.51 as permitted sender) smtp.mailfrom=yi.zhang@huaweicloud.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1714137597; a=rsa-sha256; cv=none; b=wvDX27r2jOIKR4sPABofxO1qA17O6dhvUNiQ0Mt6q2pwNzhyrLSjPLK+uP/tO3xCIRTpJl cVc+1kxvrGYjbbQ5paoIdb9yl9F2FrMesaaLAIcDx8qT+RWovBjKUn+Qj3wPgbuhH0t4k2 C6xRJMGkIFOz56qBTDQ23qRT6KKYk60= Received: from mail.maildlp.com (unknown [172.19.163.235]) by dggsgout11.his.huawei.com (SkyGuard) with ESMTP id 4VQtcD2tK9z4f3lgG for ; Fri, 26 Apr 2024 21:19:40 +0800 (CST) Received: from mail02.huawei.com (unknown [10.116.40.112]) by mail.maildlp.com (Postfix) with ESMTP id AE9581A0568 for ; Fri, 26 Apr 2024 21:19:49 +0800 (CST) Received: from [10.174.176.34] (unknown [10.174.176.34]) by APP1 (Coremail) with SMTP id cCh0CgDHlxDzqStmn1MRLA--.12428S3; Fri, 26 Apr 2024 21:19:49 +0800 (CST) Subject: Re: [PATCH v4 02/34] ext4: check the extent status again before inserting delalloc block To: "Ritesh Harjani (IBM)" , linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.cz, hch@infradead.org, djwong@kernel.org, david@fromorbit.com, willy@infradead.org, zokeefe@google.com, yi.zhang@huawei.com, chengzhihao1@huawei.com, yukuai3@huawei.com, wangkefeng.wang@huawei.com References: <87frv8z3gl.fsf@gmail.com> From: Zhang Yi Message-ID: <185a0d75-558e-a1ae-9415-c3eed4def60f@huaweicloud.com> Date: Fri, 26 Apr 2024 21:19:47 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.12.0 MIME-Version: 1.0 In-Reply-To: <87frv8z3gl.fsf@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-CM-TRANSID:cCh0CgDHlxDzqStmn1MRLA--.12428S3 X-Coremail-Antispam: 1UD129KBjvJXoW3Jw47Ar4UKF4rXw4Dur43GFg_yoW7urykpa 9IkF45Grs5Ww1kCanagF1UXr10gw18XrW2gr9xKr1jvFZ0kFyfWF12qFyY9FySkrs7G3W0 vF4jqa4xu3WjyaDanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUvIb4IE77IF4wAFF20E14v26ryj6rWUM7CY07I20VC2zVCF04k2 6cxKx2IYs7xG6rWj6s0DM7CIcVAFz4kK6r1j6r18M28lY4IEw2IIxxk0rwA2F7IY1VAKz4 vEj48ve4kI8wA2z4x0Y4vE2Ix0cI8IcVAFwI0_Ar0_tr1l84ACjcxK6xIIjxv20xvEc7Cj xVAFwI0_Gr1j6F4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwVC2z280aVCY1x 0267AKxVW0oVCq3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG 6I80ewAv7VC0I7IYx2IY67AKxVWUJVWUGwAv7VC2z280aVAFwI0_Jr0_Gr1lOx8S6xCaFV Cjc4AY6r1j6r4UM4x0Y48IcVAKI48JM4IIrI8v6xkF7I0E8cxan2IY04v7Mxk0xIA0c2IE e2xFo4CEbIxvr21l42xK82IYc2Ij64vIr41l4I8I3I0E4IkC6x0Yz7v_Jr0_Gr1lx2IqxV Aqx4xG67AKxVWUJVWUGwC20s026x8GjcxK67AKxVWUGVWUWwC2zVAF1VAY17CE14v26r4a 6rW5MIIYrxkI7VAKI48JMIIF0xvE2Ix0cI8IcVAFwI0_Jr0_JF4lIxAIcVC0I7IYx2IY6x kF7I0E14v26r4j6F4UMIIF0xvE42xK8VAvwI8IcIk0rVWrZr1j6s0DMIIF0xvEx4A2jsIE 14v26r1j6r4UMIIF0xvEx4A2jsIEc7CjxVAFwI0_Gr0_Gr1UYxBIdaVFxhVjvjDU0xZFpf 9x07UZ18PUUUUU= X-CM-SenderInfo: d1lo6xhdqjqx5xdzvxpfor3voofrz/ X-Rspam-User: X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 13C92180006 X-Stat-Signature: fnzpsdg9kqup68acqqgjgsx43x5kyh4x X-HE-Tag: 1714137594-737524 X-HE-Meta: U2FsdGVkX1+PsaEe87leY95SCWWcrOH7aoovVTWVlZ7Soutw2RkiTei9Vn8t02a4odNY24Nl1FByL3fx5aXiB4Nt2u8VYtRJKHk4i9xD1dYZt593ckcq1M1u9j55/FaiMBMkCvQ9rvLw5fp/17vI88apT2XwCuPWISdOMYQQjB+/dt7fBB+MBxJEbGazjLdwBcwb7eUGkQO7/lvJVZH8U+G/SIo8Weeve4wRLRRytuJhj9GPWSS3LBzr7h5iybH/53RR3MzQHmkzAIglXLjbmOwsaUE64zvViX3/u2Z4mfgpM+sc1DJ96G7lkDKqRn2mL8Mp5myE09e+y2NhLIlRoyC/tBolUp+LaagQtYaICTQOFMD/aseGWK006TDoQFoUM9ZLnHOll0lP5xa7AV/aXo/x6ZS3TgOYGYVcH7jHbnyc27RYycVp/YE2gypNgoKu9BMV2dHLUh3SZC4mf9gAH22/RKI3dwTmgichQDOOlnM2kjpPAtX32BvZU2KDguVjwkhT//Sd9eCotE1XGCE/fOxItGUhaJa8K489oFFGn8j7t2DV4QlM6d+NjuUq0IuYAKXvbUZ8Uj7y6n+/bpVGHwf7QL/viyEDFY0vXbP1mCbwaKutSjvhgkO+/ztk+xcCTfpAjtWRxf00SxP7XwjVWjF7NZDHrdfpb4FgwQaw2+WX6hiNYnbI4lWFxbwwf0SEHt1MI1EkT4kdqqJ8yPAvCeJTO63tuDoYsT82hSIFItC7U/Q13xMSSxMUxRxy/maQjb7Dd0zQlL62ahvgfPV0zZjizijoIy1VhOCBrxaq4a7cYfcym+kTAHHH9XFyEd+yA1dVtokpDdORs5bPWszvTKhhGO9l6bXHe0+u2e6NHLBeaK9osxw2AnI4NAK0MFT/AEVutFwzZzVjLOSNJ3N0pQcUgzNJFum+He7aysZGvqa5Nq/H+PhGFOgbBcGrqA5jYiGr65F0AI8q1yOOfI0 5Z5DVrby CD5vXfjVwCH5N96dOmApGAGWnAhrho3cGFAZOBAWtRDQ+DSFfwOHV0AN0kEVhdYA4gPF1eTovprXU5ByleiQcoSTBtGqvY7fa1iAsy27Axu3BO9OKTC1pxhVoUBbCc7r+80/g6t/L3nzokP/Pne77i26efV2Z9BPeEcoB0EKUrtQJXTU4iyS+COhrjNEm1T+sKIz93XBlgrFHaQdlEXuNhQxbelWo7+uu2gFU5FSJk1AwmqdMMR27sTJaNreh0gk+4o7BW92YuyAOI+vCF4wV2KIxfaaMn1EASUh4xfyvPDS46T05S8blggEcjUusXz5ufptoca4wyh7ZU4Uhjg2xgmE/FhuzVpHfvg7vNAFgtjGiedg9kclEWQVskWO5g20tR9oSepmGTCTEUgpOdYOUywhuTu894Cuxf/stHPntvZuE/TMFIcLh1JDrxz8lhJWoGDtjTjU0un8Utoo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/4/26 20:57, Ritesh Harjani (IBM) wrote: > Ritesh Harjani (IBM) writes: > >> Zhang Yi writes: >> >>> From: Zhang Yi >>> >>> Now we lookup extent status entry without holding the i_data_sem before >>> inserting delalloc block, it works fine in buffered write path and >>> because it holds i_rwsem and folio lock, and the mmap path holds folio >>> lock, so the found extent locklessly couldn't be modified concurrently. >>> But it could be raced by fallocate since it allocate block whitout >>> holding i_rwsem and folio lock. >>> >>> ext4_page_mkwrite() ext4_fallocate() >>> block_page_mkwrite() >>> ext4_da_map_blocks() >>> //find hole in extent status tree >>> ext4_alloc_file_blocks() >>> ext4_map_blocks() >>> //allocate block and unwritten extent >>> ext4_insert_delayed_block() >>> ext4_da_reserve_space() >>> //reserve one more block >>> ext4_es_insert_delayed_block() >>> //drop unwritten extent and add delayed extent by mistake >>> >>> Then, the delalloc extent is wrong until writeback, the one more >>> reserved block can't be release any more and trigger below warning: >>> >>> EXT4-fs (pmem2): Inode 13 (00000000bbbd4d23): i_reserved_data_blocks(1) not cleared! >>> >>> Hold i_data_sem in write mode directly can fix the problem, but it's >>> expansive, we should keep the lockless check and check the extent again >>> once we need to add an new delalloc block. >> >> Hi Zhang, >> >> It's a nice finding. I was wondering if this was caught in any of the >> xfstests? >> Hi, Ritesh I caught this issue when I tested my iomap series in generic/344 and generic/346. It's easy to reproduce because the iomap's buffered write path doesn't hold folio lock while inserting delalloc blocks, so it could be raced by the mmap page fault path. But the buffer_head's buffered write path can't trigger this problem, the race between buffered write path and fallocate path was discovered while I was analyzing the code, so I'm not sure if it could be caught by xfstests now, at least I haven't noticed this problem so far. >> I have reworded some of the commit message, feel free to use it if you >> think this version is better. The use of which path uses which locks was >> a bit confusing in the original commit message. >> Thanks for the message improvement, it looks more clear then mine, I will use it. Thanks, Yi. >> >> >> ext4_da_map_blocks(), first looks up the extent status tree for any >> extent entry with i_data_sem held in read mode. It then unlocks >> i_data_sem, if it can't find an entry and take this lock in write >> mode for inserting a new da entry. > > Sorry about this above paragraph. I messed this paragraph. > Here is the correct version of this. > > ext4_da_map_blocks looks up for any extent entry in the extent status > tree (w/o i_data_sem) and then the looks up for any ondisk extent > mapping (with i_data_sem in read mode). > > If it finds a hole in the extent status tree or if it couldn't find any > entry at all, it then takes the i_data_sem in write mode to add a da entry > into the extent status tree. This can actually race with page mkwrite > & fallocate path. > > Note that this is ok between... > >> >> This is ok between - >> 1. ext4 buffered-write path v/s ext4_page_mkwrite(), because of the >> folio lock >> 2. ext4 buffered write path v/s ext4 fallocate because of the inode >> lock. >> > > >> But this can race between ext4_page_mkwrite() & ext4 fallocate path - >> >> ext4_page_mkwrite() ext4_fallocate() >> block_page_mkwrite() >> ext4_da_map_blocks() >> //find hole in extent status tree >> ext4_alloc_file_blocks() >> ext4_map_blocks() >> //allocate block and unwritten extent >> ext4_insert_delayed_block() >> ext4_da_reserve_space() >> //reserve one more block >> ext4_es_insert_delayed_block() >> //drop unwritten extent and add delayed extent by mistake >> >> Then, the delalloc extent is wrong until writeback and the extra >> reserved block can't be released any more and it triggers below warning: >> >> EXT4-fs (pmem2): Inode 13 (00000000bbbd4d23): i_reserved_data_blocks(1) not cleared! >> >> This patch fixes the problem by looking up extent status tree again >> while the i_data_sem is held in write mode. If it still can't find >> any entry, then we insert a new da entry into the extent status tree. >> >>> >>> Cc: stable@vger.kernel.org >>> Signed-off-by: Zhang Yi >>> --- >>> fs/ext4/inode.c | 19 +++++++++++++++++++ >>> 1 file changed, 19 insertions(+) >>> >>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >>> index 6a41172c06e1..118b0497a954 100644 >>> --- a/fs/ext4/inode.c >>> +++ b/fs/ext4/inode.c >>> @@ -1737,6 +1737,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock, >>> if (ext4_es_is_hole(&es)) >>> goto add_delayed; >>> >>> +found: >>> /* >>> * Delayed extent could be allocated by fallocate. >>> * So we need to check it. >>> @@ -1781,6 +1782,24 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock, >>> >>> add_delayed: >>> down_write(&EXT4_I(inode)->i_data_sem); >>> + /* >>> + * Lookup extents tree again under i_data_sem, make sure this >>> + * inserting delalloc range haven't been delayed or allocated >>> + * whitout holding i_rwsem and folio lock. >>> + */ >> >> page fault path (ext4_page_mkwrite does not take i_rwsem) and fallocate >> path (no folio lock) can race. Make sure we lookup the extent status >> tree here again while i_data_sem is held in write mode, before inserting >> a new da entry in the extent status tree. >> >> > > > -ritesh >