From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-ot1-f46.google.com (mail-ot1-f46.google.com [209.85.210.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9C4D113D8B0 for ; Tue, 23 Apr 2024 17:26:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.46 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1713893195; cv=none; b=fALNbzOr73/4YjzIVRIAxZopYfm/7TnnniPzvNmbVJhLAKmMvsa5WdJTWl2KjAbOZYM4jCVoSn6pbGaNkmy+Flx2tLP+WtEQh4xwU4oDsZp2LyzHV6g42kgiH4kHtOvjFxoDbr5jrOK6ryiIqA/miPLCPF7JE9Vfwu/ndDUGHuw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1713893195; c=relaxed/simple; bh=aw5QyXTFrt0I6wkXNS1QSaH9h20uEcN6f5SlYI5wjiI=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=ARG56R7XclhJhpycI+YvseH0WEynzMGyJZtfmHP8q5DFQXDoDxVlTHtFO2Pik7Y93DsPO9Mg4jfuoXg3/JKOwuxZjqYm0Bw6n/2nFJc76ZS8tEQGmCLd/G2vF51/I/moE7ONWgfKxOgkUceu6EIwmJmu0+6fQnKst4DQKJLIkCg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org; spf=pass smtp.mailfrom=redhat.com; arc=none smtp.client-ip=209.85.210.46 Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Received: by mail-ot1-f46.google.com with SMTP id 46e09a7af769-6eb848b5a2eso3277000a34.3 for ; Tue, 23 Apr 2024 10:26:33 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1713893192; x=1714497992; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ay/Z5MQ+WP1fncKf7FVy06TMnDYaQnNxLByiP4iYl+w=; b=bR/ziJwWjjSI80yRVPUD3X/7nDXOE6Z95U4dRFDHJ7OlRh8WIkwGrcjEyZNSJCWPUh 1qoJblQRNywO0KH2uWyA8+559SQWSblY0HfJP1g0TpGd4nMAjTn0zwmsCYmtf4aT8Mol /qUlrcQNZPqdgotQUr0u4kWJy8dmsSsren41QMv+v/l8HdGGg20VvdxtaVZYLFhBgRce PJNzj9nIWaSE1tjDustV7+9r4yid1Wey3ZLHxIbZWiwlZRLrE9wjWumTZDAcWBiNOdMt M3UuX7eQ4FIWuCWOorng8jQJB/A8kCAxRuA0fdnw3nLjWpXvTm5MkAcT/JlJv/EZDWI4 L62A== X-Forwarded-Encrypted: i=1; AJvYcCXJWHxy6aCl9Wd7rkwaf93WDRA6YpDOKC87B/Obx1P3DdGoPWuyr48ISu0fOFhukLJZS6zpCmjKhgDpTweVgBIiDPPSykhPhwc= X-Gm-Message-State: AOJu0YyE3XyXoOgxMSK56DjtEblV5JFwuFWDAD2IfhC+39SXiG248B+1 zvm/BbpHWkGgg+Yz8PXvAsEYDH7SiCjh9wcw/8+FhV+RR7UFeMptdEBzJ/Tuag== X-Google-Smtp-Source: AGHT+IFEGiPnPGvD+GhFoRFgm+os6owhJlcDyAfL5BQMgOPIDxTZsXR1XaWF0Tm2XxdfMwCWsBa8rw== X-Received: by 2002:a05:6830:2047:b0:6eb:d850:ae05 with SMTP id f7-20020a056830204700b006ebd850ae05mr241306otp.20.1713893192679; Tue, 23 Apr 2024 10:26:32 -0700 (PDT) Received: from localhost (pool-68-160-141-91.bstnma.fios.verizon.net. [68.160.141.91]) by smtp.gmail.com with ESMTPSA id u5-20020a05620a022500b0078f13d368f3sm4967735qkm.95.2024.04.23.10.26.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 23 Apr 2024 10:26:32 -0700 (PDT) Date: Tue, 23 Apr 2024 13:26:31 -0400 From: Mike Snitzer To: Dave Chinner Cc: Joe Thornber , Eric Sandeen , dm-devel@lists.linux.dev, Mikulas Patocka , Jonathan Brassow Subject: Re: RFE: dm-thin: fine-grained handling of REQ_FUA is needed Message-ID: References: <20210114081525.GB1923@rh> <7bb4980cec013e7e3accdb156f1c17f3c305e0b9.camel@redhat.com> <20210121064512.GD2431@rh> <5c47483805514c9dd7313e2e21f5d946953b508c.camel@redhat.com> <20210122023346.GF2431@rh> <20210122042536.GB6814@redhat.com> <20210125013435.GG2431@rh> Precedence: bulk X-Mailing-List: dm-devel@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: [one more time with feeling, after fixing dm-devel address typo] On Tue, Apr 23, 2024 at 01:19:39PM -0400, Mike Snitzer wrote: > [adding snitzer@kernel.org and dm-devel to the Cc so I don't lose this important thread again] > [this was old Red Hat internal discussion that I tripped over like a dead cat, but there isn't anything sensitive here: dm-thinp needs to improve] > > On Sun, Jan 24 2021 at 8:34P -0500, > Dave Chinner wrote: > > > On Thu, Jan 21, 2021 at 11:25:37PM -0500, Mike Snitzer wrote: > > > On Thu, Jan 21 2021 at 9:33pm -0500, > > > Dave Chinner wrote: > > > > > > > On Thu, Jan 21, 2021 at 04:23:20PM +0000, Joe Thornber wrote: > > > > > On Thu, 2021-01-21 at 17:45 +1100, Dave Chinner wrote: > > > > > > On Wed, Jan 20, 2021 at 10:34:01AM +0000, Joe Thornber wrote: > > > > > > > > > > > > > - Currently the tools unshare _metadata_ when restoring/repairing > > > > > > > pools > > > > > > > with snapshots.  Combined with the hard 16gb limit on metadata size > > > > > > > this means some systems with many snapshots can't be repaired if > > > > > > > they > > > > > > > run into trouble.  Next release of the tools will fix, but as said > > > > > > > before, we have to assess the software as is. > > > > > > > > > > > > Can you explain more about this limit and what it means? The test > > > > > > I'm currently running (1000 iterations of "write 10,000 4kB IOs, > > > > > > snapshot" > > > > > > > > > > Right let's talk about your scenario. > > > > > > > > > > Firstly metadata. Our transactional model is based on persistent data > > > > > structures in the functional programming sense. To update metadata in > > > > > a new transaction we COW key metadata pages to build a new data > > > > > structure that shares data with that of the previous transaction. We > > > > > amortise the costs associated with this by holding the transactions > > > > > open as long as possible and allowing real mutation *within* a > > > > > transaction. > > > > > > > > > > There are advantages and disadvantages with this scheme. > > > > > > > > > > Advantages: It's simple. It allows the 'take snapshot' operation to be > > > > > very quick (just writing a handful of pages). It also allows us to > > > > > take a 'metadata snapshot' of all the metadata which userland can > > > > > access safely. So we can move the implementation of certain > > > > > query/health functions entirely to userland. (see thin_check, thin_ls, > > > > > thin_dump etc). > > > > > > > > > > Disadvantages: The biggest down side is everything is done in one > > > > > global transaction; if a commit is required by one thin device (eg, due > > > > > to a REQ_FUA), then other active thin devices will be temporarily > > > > > slowed as the transaction commits. > > > > > > > > OUCH! > > > > > > > > That means every journal IO from the filesystem on the thin vol will > > > > trigger a globally serialising metadata commit. It also means every > > > > user O_DSYNC/RWF_DSYNC direct IO (which we optimise to FUA writes > > > > to avoid journal flushes) will trigger a thinp commit. > > > > > > So XFS hijacks REQ_FUA to bypass its journaling _but_ expects that > > > REQ_FUA to have no impact on the underlying storage? Why not clear > > > REQ_FUA after XFS decides to bypass its journal? Why use REQ_FUA for > > > this? What are your assumptions about how the underlying block device > > > will respond? > > > > "Hijack" is a bit ... strong. We replaced a full device cache flush > > with an FUA write at the request of a partner that demonstrated that > > using FUA writes in their applications on other operating systems on > > the same hardware configurations ran them 25-50% faster than on > > RHEL/XFS. > > > > I reproduced their results locally, and the difference in > > performance was the device level overhead between write+full device > > caceh flush (the traditional linux DIO O_DSYNC implementation) and > > using FUA writes to avoid the full device cache flush when only the > > data in the IO itself needed to be written to stable storage. > > > > It's not at all controversial - REQ_FUA has been used by filesystems > > for years for their journal writes to optimise away the post-write > > cache flush on hardware that supports FUA. And ext4, gfs2, zonefs > > and btrfs all use this same code direct IO code path now, so it's > > not just XFS doing this... > > > > FWIW, I mused on the weekend that we don't even need to check that > > the block device supports FUA to make this optimisation, because the > > block layer will convert REQ_FUA to a normal write + post write > > cache flush if the underlying device doesn't support FUA. So we are > > likely to make all pure O_DSYNC data overwrites use REQ_FUA in the > > future, not just on devices that support FUA... > > > > > > This is what enterprise applications do for IO performance - small, > > > > random, highly concurrent O_DSYNC AIO+DIO overwrites - so it seems > > > > like this is going to cause IO serialisation problems even if there is > > > > only one thin-vol in a thin-pool... > > > > > > Yes, but I currently cannot say this is thinp's fault. REQ_FUA must be > > > honored. If dm-thinp is being asked to flush caches and get a > > > consistent point-in-time it cannot be ignored. > > > > Yes, that's the point I was trying to make. FUA is not a cache flush > > request - it's much more fine grained than that, and so can be used > > to elide full device cache flushes. Hence requiring global commits > > to honor FUA data writes defeats the higher level optimisations that > > try to minimise the device cache flush overhead... > > > > > So what additonal data structures and approach should we be taking to > > > mask the impact of needing to keep data and metadata consistent? > > > > I'm not a dm-thin expert, but I can describe the constraints for > > using REQ_FUA because I think they probably apply to dm-thin, too > > > > For filesystems, there are per-inode metadata trees, and that's all > > we need to be stable w.r.t. O_DSYNC data writes. If the per-inode > > metadata is up-to-date on disk (i.e. we are doing a pure overwrite > > and no metadata needs to change to perform the data IO) then we can > > issue a REQ_FUA write. If there is any metadata that needs to be > > modified to perform the write (e.g. we need an allocation, convert > > an unwritten extent, etc) or something else dirtied the inode > > metadata prior to the write, then we do a normal data write and a > > post-write journal flush (which is REQ_PREFLUSH|REQ_FUA write) to > > ensure both the data and the metadata that references it is on > > stable storage. > > > > See this commit: > > > > commit 3460cac1ca76215a60acb086ebe97b3e50731628 > > Author: Dave Chinner > > Date: Wed May 2 12:54:53 2018 -0700 > > > > iomap: Use FUA for pure data O_DSYNC DIO writes > > .... > > > > Also, check the current upstream tree for IOMAP_F_DIRTY so the fs > > can indicate that inode has O_DSYNC dirty metadata (allocate, > > unwritten) or IOMAP_F_SHARED (requires COW), and the iomap > > implementation uses the IOMAP_DIO_WRITE_FUA and IOMAP_DIO_NEED_SYNC > > flags to determine what to do at IO completion. > > > > I suspect that dm-thin would need to do something similar for > > REQ_FUA writes - if the thin-dev metadata (rather than the global > > pool metadata) is clean and there's an already allocated block being > > overwritten into, then you can just pass the FUA down and not have > > to run a global transaction commit. If there's anything dirty, or > > the IO requires COW or allocation, then it needs to run through the > > existing slow path... > > > > > > > This limits the number of active > > > > > thin devices. Similarly if different thin devices are performing > > > > > operations that allocate new data space (provisioning, COW), then there > > > > > can be contention in the allocators. > > > > > > > > Or we are doing concurrent IO to the same thin device... > > > > > > Concurrent IO that is causing a denial-of-service due to floods of > > > REQ_FUA IO? Yeap.. that sounds like it'd hurt ;) > > > > This is what enterprise applications like SAP, MS-SQL, Seastar, etc > > all do to extract maximum performance out of the underlying storage. > > They can have hundreds, even thousands of O_DSYNC AIO+DIO in flight at > > once, generally only limited only by software and hardware queue depths. > > > > So, yeah, if "general use" is the goal, this better not be a DOS > > vector. ;) > > Just bumping this thread because it has been over 3 years now and > dm-thinp hasn't seen any changes to address its heavy handling of > REQ_FUA. > > Mike