public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed
From: "Martin K. Petersen" <martin.petersen@oracle.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: John Garry <john.g.garry@oracle.com>,
	Dave Chinner <david@fromorbit.com>,
	axboe@kernel.dk, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me,
	martin.petersen@oracle.com, viro@zeniv.linux.org.uk,
	brauner@kernel.org, dchinner@redhat.com, jejb@linux.ibm.com,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-scsi@vger.kernel.org,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-security-module@vger.kernel.org, paul@paul-moore.com,
	jmorris@namei.org, serge@hallyn.com,
	Himanshu Madhani <himanshu.madhani@oracle.com>
Subject: Re: [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits
Date: Sat, 06 May 2023 21:59:30 -0400	[thread overview]
Message-ID: <yq1y1m10w0d.fsf@ca-mkp.ca.oracle.com> (raw)
In-Reply-To: <20230505220056.GJ15394@frogsfrogsfrogs> (Darrick J. Wong's message of "Fri, 5 May 2023 15:00:56 -0700")


Darrick,

> Could a SCSI device could advertise 512b LBAs, 4096b physical blocks,
> a 64k atomic_write_unit_max, and a 1MB maximum transfer length
> (atomic_write_max_bytes)?

Yes.

> And does that mean that application software can send one 64k-aligned
> write and expect it either to be persisted completely or not at all?

Yes.

> And, does that mean that the application can send up to 16 of these
> 64k-aligned blocks as a single 1MB IO and expect that each of those 16
> blocks will either be persisted entirely or not at all?

Yes.

> There doesn't seem to be any means for the device to report /which/ of
> the 16 were persisted, which is disappointing. But maybe the
> application encodes LSNs and can tell after the fact that something
> went wrong, and recover?

Correct. Although we traditionally haven't had too much fun with partial
completion for sequential I/O either.

> If the same device reports a 2048b atomic_write_unit_min, does that mean
> that I can send between 2 and 64k of data as a single atomic write and
> that's ok?  I assume that this weird situation (512b LBA, 4k physical,
> 2k atomic unit min) requires some fancy RMW but that the device is
> prepared to cr^Wpersist that correctly?

Yes.

It would not make much sense for a device to report a minimum atomic
granularity smaller than the reported physical block size. But in theory
it could.

> What if the device also advertises a 128k atomic_write_boundary?
> That means that a 2k atomic block write will fail if it starts at 127k,
> but if it starts at 126k then thats ok.  Right?

Correct.

> As for avoiding splits in the block layer, I guess that also means that
> someone needs to reduce atomic_write_unit_max and atomic_write_boundary
> if (say) some sysadmin decides to create a raid0 of these devices with a
> 32k stripe size?

Correct. Atomic limits will need to be stacked for MD and DM like we do
with the remaining queue limits.

> It sounds like NVME is simpler in that it would report 64k for both the
> max unit and the max transfer length?  And for the 1M write I mentioned
> above, the application must send 16 individual writes?

Correct.

> With my app developer hat on, the simplest mental model of this is that
> if I want to persist a blob of data that is larger than one device LBA,
> then atomic_write_unit_min <= blob size <= atomic_write_unit_max must be
> true, and the LBA range for the write cannot cross a atomic_write_boundary.
>
> Does that sound right?

Yep.

> Going back to my sample device above, the XFS buffer cache could write
> individual 4k filesystem metadata blocks using REQ_ATOMIC because 4k is
> between the atomic write unit min/max, 4k metadata blocks will never
> cross a 128k boundary, and we'd never have to worry about torn writes
> in metadata ever again?

Correct.

> Furthermore, if I want to persist a bunch of blobs in a contiguous LBA
> range and atomic_write_max_bytes > atomic_write_unit_max, then I can do
> that with a single direct write?

Yes.

> I'm assuming that the blobs in the middle of the range must all be
> exactly atomic_write_unit_max bytes in size?

If you care about each blob being written atomically, yes.

> And I had better be prepared to (I guess) re-read the entire range
> after the system goes down to find out if any of them did or did not
> persist?

If you crash or get an I/O error, then yes. There is no way to inquire
which blobs were written. Just like we don't know which LBAs were
written if the OS crashes in the middle of a regular write operation.

-- 
Martin K. Petersen	Oracle Linux Engineering


  reply	other threads:[~2023-05-07  3:01 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-03 18:38 [PATCH RFC 00/16] block atomic writes John Garry
2023-05-03 18:38 ` [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits John Garry
2023-05-03 21:39   ` Dave Chinner
2023-05-04 18:14     ` John Garry
2023-05-04 22:26       ` Dave Chinner
2023-05-05  7:54         ` John Garry
2023-05-05 22:00           ` Darrick J. Wong
2023-05-07  1:59             ` Martin K. Petersen [this message]
2023-05-05 23:18           ` Dave Chinner
2023-05-06  9:38             ` John Garry
2023-05-07  2:35             ` Martin K. Petersen
2023-05-05 22:47         ` Eric Biggers
2023-05-05 23:31           ` Dave Chinner
2023-05-06  0:08             ` Eric Biggers
2023-05-09  0:19   ` Mike Snitzer
2023-05-17 17:02     ` John Garry
2023-05-03 18:38 ` [PATCH RFC 02/16] fs/bdev: Add atomic write support info to statx John Garry
2023-05-03 21:58   ` Dave Chinner
2023-05-04  8:45     ` John Garry
2023-05-04 22:40       ` Dave Chinner
2023-05-05  8:01         ` John Garry
2023-05-05 22:04           ` Darrick J. Wong
2023-05-03 18:38 ` [PATCH RFC 03/16] xfs: Support atomic write for statx John Garry
2023-05-03 22:17   ` Dave Chinner
2023-05-05 22:10     ` Darrick J. Wong
2023-05-03 18:38 ` [PATCH RFC 04/16] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support John Garry
2023-05-03 18:38 ` [PATCH RFC 05/16] block: Add REQ_ATOMIC flag John Garry
2023-05-03 18:38 ` [PATCH RFC 06/16] block: Limit atomic writes according to bio and queue limits John Garry
2023-05-03 18:53   ` Keith Busch
2023-05-04  8:24     ` John Garry
2023-05-03 18:38 ` [PATCH RFC 07/16] block: Add bdev_find_max_atomic_write_alignment() John Garry
2023-05-03 18:38 ` [PATCH RFC 08/16] block: Add support for atomic_write_unit John Garry
2023-05-03 18:38 ` [PATCH RFC 09/16] block: Add blk_validate_atomic_write_op() John Garry
2023-05-03 18:38 ` [PATCH RFC 10/16] block: Add fops atomic write support John Garry
2023-05-03 18:38 ` [PATCH RFC 11/16] fs: iomap: Atomic " John Garry
2023-05-04  5:00   ` Dave Chinner
2023-05-05 21:19     ` Darrick J. Wong
2023-05-05 23:56       ` Dave Chinner
2023-05-03 18:38 ` [PATCH RFC 12/16] xfs: Add support for fallocate2 John Garry
2023-05-03 23:26   ` Dave Chinner
2023-05-05 22:23     ` Darrick J. Wong
2023-05-05 23:42       ` Dave Chinner
2023-05-03 18:38 ` [PATCH RFC 13/16] scsi: sd: Support reading atomic properties from block limits VPD John Garry
2023-05-03 18:38 ` [PATCH RFC 14/16] scsi: sd: Add WRITE_ATOMIC_16 support John Garry
2023-05-03 18:48   ` Bart Van Assche
2023-05-04  8:17     ` John Garry
2023-05-03 18:38 ` [PATCH RFC 15/16] scsi: scsi_debug: Atomic write support John Garry
2023-05-03 18:38 ` [PATCH RFC 16/16] nvme: Support atomic writes John Garry
2023-05-03 18:49   ` Bart Van Assche
2023-05-04  8:19     ` John Garry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=yq1y1m10w0d.fsf@ca-mkp.ca.oracle.com \
    --to=martin.petersen@oracle.com \
    --cc=axboe@kernel.dk \
    --cc=brauner@kernel.org \
    --cc=david@fromorbit.com \
    --cc=dchinner@redhat.com \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=himanshu.madhani@oracle.com \
    --cc=jejb@linux.ibm.com \
    --cc=jmorris@namei.org \
    --cc=john.g.garry@oracle.com \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=linux-security-module@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=paul@paul-moore.com \
    --cc=sagi@grimberg.me \
    --cc=serge@hallyn.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox