linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Christian Brauner <brauner@kernel.org>
To: Christoph Hellwig <hch@lst.de>
Cc: John Garry <john.g.garry@oracle.com>,
	 Christoph Hellwig <hch@infradead.org>,
	"Darrick J. Wong" <djwong@kernel.org>,
	 linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-block@vger.kernel.org,  linux-nvme@lists.infradead.org
Subject: Re: Do we need an opt-in for file systems use of hw atomic writes?
Date: Tue, 15 Jul 2025 14:20:36 +0200	[thread overview]
Message-ID: <20250715-gekapert-einsam-4645671c7555@brauner> (raw)
In-Reply-To: <20250715112952.GA23935@lst.de>

On Tue, Jul 15, 2025 at 01:29:52PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 15, 2025 at 12:02:06PM +0200, Christian Brauner wrote:
> > > I'm not sure a XFLAG is all that useful.  It's not really a per-file
> > > persistent thing.  It's more of a mount option, or better persistent
> > > mount-option attr like we did for autofsck.
> > 
> > If we were to make this a mount option it would be really really ugly.
> > Either it is a filesystem specific mount option and then we have the
> > problem that we're ending up with different mount option names
> > per-filesystem.
> 
> Not that I'm arguing for a mount option (this should be sticky), but
> we've had plenty of fs parsed mount options with common semantics.
> 
> > It feels like this is something that needs to be done on the block
> > layer. IOW, maybe add generic block layer ioctls or a per-device sysfs
> > entry that allows to turn atomic writes on or off. That information
> > would then also potentially available to the filesystem to e.g.,
> > generate an info message during mount that hardware atomics are used or
> > aren't used. Because ultimately the block layer is where the decision
> > needs to be made.
> 
> The block layer just passes things through.

We already have bdev_can_atomic_write() which checks whether the
underlying device is capable of hardware assisted atomic writes. If
that's the case the filesystem currently just uses them, fine.

So it is possible to implement an ioctl() that allows an administrator
to mark a device as untrusted for hardware assisted atomic writes.

This is also nice is because this can be integrated with udev easily. If
a device is know to have broken hardware assisted atomic writes then add
the device into systemd-udev's hardware database (hwdb).

When systemd-udev sees that device show up during boot it will
automatically mark that device as having broken atomic write support and
any mount of that device will have the filesystem immediately see the
broken hardware assisted atomic write support in bdev_can_atomic_write()
and not use it.

Fwiw, this pattern is already used for other stuff. For example for the
iocost stuff that udev will auto-apply if known. The broken atomic write
stuff would fit very well in there. Either it's an allowlist or a
denylist.

commit 6b8e90545e918a4653281b3672a873e948f12b65
Author:     Gustavo Noronha Silva <gustavo.noronha@collabora.com>
AuthorDate: Mon May 2 14:02:23 2022 -0300
Commit:     Lennart Poettering <lennart@poettering.net>
CommitDate: Thu Apr 20 16:45:57 2023 +0200

    Apply known iocost solutions to block devices

    Meta's resource control demo project[0] includes a benchmark tool that can
    be used to calculate the best iocost solutions for a given SSD.

      [0]: https://github.com/facebookexperimental/resctl-demo

    A project[1] has now been started to create a publicly available database
    of results that can be used to apply them automatically.

      [1]: https://github.com/iocost-benchmark/iocost-benchmarks

    This change adds a new tool that gets triggered by a udev rule for any
    block device and queries the hwdb for known solutions. The format for
    the hwdb file that is currently generated by the github action looks like
    this:

      # This file was auto-generated on Tue, 23 Aug 2022 13:03:57 +0000.
      # From the following commit:
      # https://github.com/iocost-benchmark/iocost-benchmarks/commit/ca82acfe93c40f21d3b513c055779f43f1126f88
      #
      # Match key format:
      # block:<devpath>:name:<model name>:

      # 12 points, MOF=[1.346,1.346], aMOF=[1.249,1.249]
      block:*:name:HFS256GD9TNG-62A0A:fwver:*:
        IOCOST_SOLUTIONS=isolation isolated-bandwidth bandwidth naive
        IOCOST_MODEL_ISOLATION=rbps=1091439492 rseqiops=52286 rrandiops=63784 wbps=192329466 wseqiops=12309 wrandiops=16119
        IOCOST_QOS_ISOLATION=rpct=0.00 rlat=8807 wpct=0.00 wlat=59023 min=100.00 max=100.00
        IOCOST_MODEL_ISOLATED_BANDWIDTH=rbps=1091439492 rseqiops=52286 rrandiops=63784 wbps=192329466 wseqiops=12309 wrandiops=16119
        IOCOST_QOS_ISOLATED_BANDWIDTH=rpct=0.00 rlat=8807 wpct=0.00 wlat=59023 min=100.00 max=100.00
        IOCOST_MODEL_BANDWIDTH=rbps=1091439492 rseqiops=52286 rrandiops=63784 wbps=192329466 wseqiops=12309 wrandiops=16119
        IOCOST_QOS_BANDWIDTH=rpct=0.00 rlat=8807 wpct=0.00 wlat=59023 min=100.00 max=100.00
        IOCOST_MODEL_NAIVE=rbps=1091439492 rseqiops=52286 rrandiops=63784 wbps=192329466 wseqiops=12309 wrandiops=16119
        IOCOST_QOS_NAIVE=rpct=99.00 rlat=8807 wpct=99.00 wlat=59023 min=75.00 max=100.00

    The IOCOST_SOLUTIONS key lists the solutions available for that device
    in the preferred order for higher isolation, which is a reasonable
    default for most client systems. This can be overriden to choose better
    defaults for custom use cases, like the various data center workloads.

    The tool can also be used to query the known solutions for a specific
    device or to apply a non-default solution (say, isolation or bandwidth).

    Co-authored-by: Santosh Mahto <santosh.mahto@collabora.com>

  reply	other threads:[~2025-07-15 12:20 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-14 13:17 Do we need an opt-in for file systems use of hw atomic writes? Christoph Hellwig
2025-07-14 13:24 ` Theodore Ts'o
2025-07-14 13:30   ` Christoph Hellwig
2025-07-14 16:04     ` Darrick J. Wong
2025-07-15  6:00       ` Christoph Hellwig
2025-07-15  3:22     ` Martin K. Petersen
2025-07-15  6:00       ` Christoph Hellwig
2025-07-15 12:45         ` Martin K. Petersen
2025-07-14 13:39 ` John Garry
2025-07-14 13:50   ` Christoph Hellwig
2025-07-14 15:53     ` John Garry
2025-07-15  6:02       ` Christoph Hellwig
2025-07-15  8:42         ` John Garry
2025-07-15  9:03           ` Christoph Hellwig
2025-08-19 11:42             ` John Garry
2025-08-19 13:39               ` Christoph Hellwig
2025-08-19 14:36                 ` John Garry
2025-08-19 14:43                   ` Darrick J. Wong
2025-08-19 14:45                     ` Christoph Hellwig
2025-08-21 14:01               ` Keith Busch
2025-07-15 10:02         ` Christian Brauner
2025-07-15 11:29           ` Christoph Hellwig
2025-07-15 12:20             ` Christian Brauner [this message]
2025-07-15 11:58           ` Theodore Ts'o
2025-07-14 20:53 ` Dave Chinner
2025-07-15  6:05   ` Christoph Hellwig
2025-07-15 20:56 ` Keith Busch
2025-07-16  5:50   ` Nilay Shroff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250715-gekapert-einsam-4645671c7555@brauner \
    --to=brauner@kernel.org \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=hch@lst.de \
    --cc=john.g.garry@oracle.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).