From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 24BB3C32793 for ; Wed, 18 Jan 2023 21:43:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231321AbjARVnl (ORCPT ); Wed, 18 Jan 2023 16:43:41 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39688 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231231AbjARVnY (ORCPT ); Wed, 18 Jan 2023 16:43:24 -0500 Received: from mail-pf1-x434.google.com (mail-pf1-x434.google.com [IPv6:2607:f8b0:4864:20::434]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BA4DD66016 for ; Wed, 18 Jan 2023 13:42:27 -0800 (PST) Received: by mail-pf1-x434.google.com with SMTP id 20so21902175pfu.13 for ; Wed, 18 Jan 2023 13:42:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20210112.gappssmtp.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=25/ogZ1IY1yCzD+p9FJLZ5EqJq4IvrsYWI+uktJqoc8=; b=b6n2HzUq915R4Q52gekT31VCYQfhro59qvgUT/934MH4Ph5mOu/m74OMQPxPnVS/Jj udr4O/ozJG9Wr9xUuH2pWLtCd0DTUDFDxmD74lfUUGrbGuRaGcSnGoh0rNX1AC2NFXV6 qMCOhTTi27jvwOqSTZhFFRHiEyeZfv1Z7SJm/zUnPYhEZFr3pUaYsE1UE9b+AyxmBdYe 7NiRz5Hqjq2balCLmD9DyIqwhWgbKeB2es8kJ68QDAwvLHxSwwYUN+FimGMTNMO7OYv4 LSx+3Sfiw6/J+ifh5//FHTHi/49BsvmL+hcFXy0sAcr90oEDNxS5oG5e41N6r7Nqtgwz BdIg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=25/ogZ1IY1yCzD+p9FJLZ5EqJq4IvrsYWI+uktJqoc8=; b=u9iF9bK8QUfKim4yIBPD0k0gbn97LJtQjng/Ku1Khta3HsY7tKUTQ/gqp39Oiw47+x TTz4xpJWhFlaW1cmShoavaEGB5Hw5tuiCp2e4Rv4b5lZxsTZ9ERKPKJq2cciiKG6Q5TU MaJOoaNVZmWhMRlNUwgpWyQbobIFnu87QkLmzjhikr9htmuxMSmLLFXcX4/U+CIJcu+x xkj7CV9/MFINiOny1+H9asaXECdG31IK81OTWmn2NkzpPfmdngo49fPMXJ1+/ips5tzX 6udgqIxgzjBHxEs/iICSZXRdmkI1Q0SqeNpt+gpj1vtRm7bxqkuz0t8++6GHAkpQmvut 9zAA== X-Gm-Message-State: AFqh2krP87pmRBKwwGC3eN70XHJHAxgP9HuF4KrcBXy6im6moAFJ2xkC gDX/DA4wPvfh7AaD9UWeqHOVSA== X-Google-Smtp-Source: AMrXdXuUhT365uogRqJeciSgZ6jusXovRP6UIOCn0m3CnHYkFzQba4SbhSlaI19LX086ihdE2UUqBA== X-Received: by 2002:a05:6a00:4088:b0:58d:aadf:5e62 with SMTP id bw8-20020a056a00408800b0058daadf5e62mr9302248pfb.18.1674078147224; Wed, 18 Jan 2023 13:42:27 -0800 (PST) Received: from dread.disaster.area (pa49-186-146-207.pa.vic.optusnet.com.au. [49.186.146.207]) by smtp.gmail.com with ESMTPSA id m2-20020a62a202000000b005869a33dd3bsm20734472pff.164.2023.01.18.13.42.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Jan 2023 13:42:26 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1pIGCN-004hTW-Hg; Thu, 19 Jan 2023 08:42:23 +1100 Date: Thu, 19 Jan 2023 08:42:23 +1100 From: Dave Chinner To: "Darrick J. Wong" Cc: Andreas =?iso-8859-1?Q?Gr=FCnbacher?= , Andreas Gruenbacher , Christoph Hellwig , Alexander Viro , Matthew Wilcox , linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, cluster-devel@redhat.com Subject: Re: [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler Message-ID: <20230118214223.GH360264@dread.disaster.area> References: <20230108194034.1444764-1-agruenba@redhat.com> <20230108194034.1444764-9-agruenba@redhat.com> <20230108215911.GP1971568@dread.disaster.area> <20230109225453.GQ1971568@dread.disaster.area> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote: > 2. Do we need to revalidate mappings for directio writes? I think the > answer is no (for xfs) because the ->iomap_begin call will allocate > whatever blocks are needed and truncate/punch/reflink block on the > iolock while the directio writes are pending, so you'll never end up > with a stale mapping. But I don't know if that statement applies > generally... The issue is not truncate/punch/reflink for either DIO or buffered IO - the issue that leads to stale iomaps is async extent state. i.e. IO completion doing unwritten extent conversion. For DIO, AIO doesn't hold the IOLOCK at all when completion is run (like buffered writeback), but non-AIO DIO writes hold the IOLOCK shared while waiting for completion. This means that we can have DIO submission and completion still running concurrently, and so stale iomaps are a definite possibility. >From my notes when I looked at this: 1. the race condition for a DIO write mapping go stale is an overlapping DIO completion and converting the block from unwritten to written, and then the dio write incorrectly issuing sub-block zeroing because the mapping is now stale. 2. DIO read into a hole or unwritten extent zeroes the entire range in the user buffer in one operation. If this is a large range, this could race with small DIO writes within that range that have completed 3. There is a window between dio write completion doing unwritten extent conversion (by ->end_io) and the page cache being invalidated, providing a window where buffered read maps can be stale and incorrect read behaviour exposed to userpace before the page cache is invalidated. These all stem from IO having overlapping ranges, which is largely unsupported but can't be entirely prevented (e.g. backup applications running in the background). Largely the problems are confined to sub-block IOs. i.e. when sub-block DIO writes to the same block are being performed, we have the possiblity that one write completes whilst the other is deciding what to zero, unaware that the range is now MAPPED rather than UNWRITTEN. We currently avoid issues with sub-block dio writes by using IOMAP_DIO_OVERWRITE_ONLY with shared locking. This ensures that the unaligned IO fits entirely within a MAPPED extent so no sub-block zeroing is required. If allocation or sub-block zeroing is required, then we force the filesystem to fall back to exclusive IO locking and wait for all concurrent DIO in flight to complete so that it can't race with any other DIO write that might cause the map to become stale while we are doing the zeroing. This does not avoid potential issues with DIO write vs buffered read, nor DIO write vs mmap IO. It's not totally clear to me whether we need ->iomap_valid checks in the buffered read paths to avoid the completion races with DIO writes, but there are windows there where cached iomaps could be considered stale.... Cheers, Dave. -- Dave Chinner david@fromorbit.com