public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed
* direct I/O: ext4 seems to not honor RWF_DSYNC when journal is disabled
       [not found] <87ttscddv4.fsf@x1.mail-host-address-is-not-set>
@ 2023-09-06 20:15 ` Free Ekanayaka
  2024-01-08 21:31   ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Free Ekanayaka @ 2023-09-06 20:15 UTC (permalink / raw)
  To: linux-ext4

(reposting after having subscribed, and from a different address since
the first mail did not show up in the archives and didn't bounce either)

Hello,

I'm using Linux 6.4.0 from Debian/testing (but tried this with 6.5.1
too).

I've created an ext4 filesystem with journalling disabled on an NVMe
drive:

mkfs.ext4 -O ^has_journal -F /dev/nvme0n1p6

I have a program that creates and open a new file with O_DIRECT, and
sets its size to 8M with posix_fallocate(), something like:

fd = open("/dir/file", O_CREAT | O_WRONLY | O_DIRECT);
posix_fallocate(fd, 0, 8 * 1024 * 1024);
fsync(fd);
dirfd = open("/dir", O_RDONLY | O_DIRECTORY);
fsync(dirfd);

and then it uses io_uring to perform a single write of 4096 bytes at the
beginning of the file, passing RWF_DSYNC to the submitted
IORING_OP_WRITE_FIXED entry,

I would expect the kernel to tell the NVMe device to actually flush the
write, not only buffer it. However I measured the end-to-end latency of
the io_uring operation and it was very low, as if the write was only
buffered by the NVMe device, but not flushed.

This suspicion seems to be confirmed by tracing the write nvme command
sent to the device:

raft-benchmark-37801   [003] .....  9904.831153: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=25169, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498455480, len=7, ctrl=0x0, dsmgmt=0, reftag=0)

notice the "ctrl=0x0" in there. If I run the same program, but I write
directly to the /dev/nvme0n1p6 device instead of using a file, I get:

raft-benchmark-37936   [003] .....  9942.784628: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=29265, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498192384, len=7, ctrl=0x4000, dsmgmt=0, reftag=0)

and here notice the "ctrl=0x4000", which I would expect is some flag to
tell to the NVMe device to also flush the write.

If I enable "write through" on the NVMe device:

echo "write through" | sudo tee /sys/block/nvme0n1/queue/write_cache

then also the direct write to the /dev/nvme0n1p6 device avoids flushing
the command, as expected:

raft-benchmark-38106   [003] .....  9971.741308: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=33361, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498192384, len=7, ctrl=0x0, dsmgmt=0, reftag=0)

(notice the "ctrl=0x0" there).

My question is: is my interpretation correct? Is it expected that ext4
does not tell the block layer/device to flush this write when
journalling is disabled?

Below you find the complete event trace of the io_uring call, both in
the ext4 write case and in the raw /dev/nvme0n1p6 write case.

I can provide a C reproducer as well if needed, but wanted to first
check if this rings any bell.

Thanks,

Free

== ext4 ==

  raft-benchmark-37801   [003] .....  9904.830974: io_uring_submit_req: ring 0000000011cab2e4, req 00000000c7a7d835, user_data 0x0, opcode WRITE_FIXED, flags 0x1, sq_thread 0
  raft-benchmark-37801   [003] .....  9904.830982: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
  raft-benchmark-37801   [003] .....  9904.830983: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 U
  raft-benchmark-37801   [003] .....  9904.830985: ext4_journal_start_inode: dev 259,5 blocks 2, rsv_blocks 0, revoke_creds 8, type 1, ino 12, caller ext4_dirty_inode+0x38/0x80 [ext4]
  raft-benchmark-37801   [003] .....  9904.830987: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_dirty_inode+0x5b/0x80 [ext4]
  raft-benchmark-37801   [003] .....  9904.830989: block_touch_buffer: 259,5 sector=135 size=4096
  raft-benchmark-37801   [003] .....  9904.830993: block_dirty_buffer: 259,5 sector=135 size=4096
  raft-benchmark-37801   [003] .....  9904.831121: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
  raft-benchmark-37801   [003] .....  9904.831122: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 UR
  raft-benchmark-37801   [003] .....  9904.831123: ext4_journal_start_inode: dev 259,5 blocks 8, rsv_blocks 0, revoke_creds 8, type 3, ino 12, caller ext4_iomap_begin+0x1c2/0x2f0 [ext4]
  raft-benchmark-37801   [003] .....  9904.831124: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
  raft-benchmark-37801   [003] .....  9904.831124: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 UR
  raft-benchmark-37801   [003] .....  9904.831125: ext4_ext_map_blocks_enter: dev 259,5 ino 12 lblk 0 len 1 flags CREATE|UNWRIT|PRE_IO
  raft-benchmark-37801   [003] .....  9904.831126: ext4_es_cache_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
  raft-benchmark-37801   [003] .....  9904.831127: ext4_ext_show_extent: dev 259,5 ino 12 lblk 0 pblk 32887 len 1
  raft-benchmark-37801   [003] .....  9904.831128: ext4_ext_handle_unwritten_extents: dev 259,5 ino 12 m_lblk 0 m_pblk 32887 m_len 1 flags CREATE|UNWRIT|PRE_IO|METADATA_NOFAIL allocated 1 newblock 32887
  raft-benchmark-37801   [003] .....  9904.831129: ext4_es_cache_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
  raft-benchmark-37801   [003] .....  9904.831130: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_split_extent+0xcd/0x190 [ext4]
  raft-benchmark-37801   [003] .....  9904.831131: block_touch_buffer: 259,5 sector=135 size=4096
  raft-benchmark-37801   [003] .....  9904.831133: block_dirty_buffer: 259,5 sector=135 size=4096
  raft-benchmark-37801   [003] .....  9904.831134: ext4_ext_map_blocks_exit: dev 259,5 ino 12 flags CREATE|UNWRIT|PRE_IO lblk 0 pblk 32887 len 1 mflags NMU ret 1
  raft-benchmark-37801   [003] .....  9904.831135: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
  raft-benchmark-37801   [003] .....  9904.831135: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 UR
  raft-benchmark-37801   [003] .....  9904.831136: ext4_es_insert_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
  raft-benchmark-37801   [003] .....  9904.831143: block_bio_remap: 259,0 WS 498455480 + 8 <- (259,5) 263096
  raft-benchmark-37801   [003] .....  9904.831144: block_bio_queue: 259,0 WS 498455480 + 8 [raft-benchmark]
  raft-benchmark-37801   [003] .....  9904.831149: block_getrq: 259,0 WS 498455480 + 8 [raft-benchmark]
  raft-benchmark-37801   [003] .....  9904.831149: block_plug: [raft-benchmark]
  raft-benchmark-37801   [003] .....  9904.831153: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=25169, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498455480, len=7, ctrl=0x0, dsmgmt=0, reftag=0)
  raft-benchmark-37801   [003] .....  9904.831159: block_rq_issue: 259,0 WS 4096 () 498455480 + 8 [raft-benchmark]
  raft-benchmark-37801   [003] d.h..  9904.831173: nvme_sq: nvme0: disk=nvme0n1, qid=4, head=783, tail=783
  raft-benchmark-37801   [003] d.h..  9904.831177: nvme_complete_rq: nvme0: disk=nvme0n1, qid=4, cmdid=25169, res=0x0, retries=0, flags=0x0, status=0x0
  raft-benchmark-37801   [003] d.h..  9904.831178: block_rq_complete: 259,0 WS () 498455480 + 8 [0]
     kworker/3:1-30279   [003] .....  9904.831193: ext4_journal_start_inode: dev 259,5 blocks 8, rsv_blocks 0, revoke_creds 8, type 3, ino 12, caller ext4_convert_unwritten_extents+0xb4/0x260 [ext4]
     kworker/3:1-30279   [003] .....  9904.831193: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
     kworker/3:1-30279   [003] .....  9904.831194: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 U
     kworker/3:1-30279   [003] .....  9904.831194: ext4_ext_map_blocks_enter: dev 259,5 ino 12 lblk 0 len 1 flags CREATE|UNWRIT|CONVERT
     kworker/3:1-30279   [003] .....  9904.831195: ext4_es_cache_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
     kworker/3:1-30279   [003] .....  9904.831195: ext4_ext_show_extent: dev 259,5 ino 12 lblk 0 pblk 32887 len 1
     kworker/3:1-30279   [003] .....  9904.831196: ext4_ext_handle_unwritten_extents: dev 259,5 ino 12 m_lblk 0 m_pblk 32887 m_len 1 flags CREATE|UNWRIT|CONVERT|METADATA_NOFAIL allocated 1 newblock 32887
     kworker/3:1-30279   [003] .....  9904.831196: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_ext_map_blocks+0xeee/0x1980 [ext4]
     kworker/3:1-30279   [003] .....  9904.831197: block_touch_buffer: 259,5 sector=135 size=4096
     kworker/3:1-30279   [003] .....  9904.831198: block_dirty_buffer: 259,5 sector=135 size=4096
     kworker/3:1-30279   [003] .....  9904.831199: ext4_ext_map_blocks_exit: dev 259,5 ino 12 flags CREATE|UNWRIT|CONVERT lblk 0 pblk 32887 len 1 mflags M ret 1
     kworker/3:1-30279   [003] .....  9904.831199: ext4_es_insert_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status W
     kworker/3:1-30279   [003] .....  9904.831200: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_convert_unwritten_extents+0x1e2/0x260 [ext4]
     kworker/3:1-30279   [003] .....  9904.831200: block_touch_buffer: 259,5 sector=135 size=4096
     kworker/3:1-30279   [003] .....  9904.831201: block_dirty_buffer: 259,5 sector=135 size=4096
     kworker/3:1-30279   [003] .....  9904.831202: ext4_sync_file_enter: dev 259,5 ino 12 parent 2 datasync 1 
     kworker/3:1-30279   [003] .....  9904.831203: ext4_sync_file_exit: dev 259,5 ino 12 ret 0
  raft-benchmark-37801   [003] ...1.  9904.831212: io_uring_complete: ring 0000000011cab2e4, req 00000000c7a7d835, user_data 0x0, result 4096, cflags 0x0 extra1 0 extra2 0 
  raft-benchmark-37801   [003] .....  9904.831213: io_uring_task_work_run: tctx 000000005c8048e9, count 1, loops 1

== /dev/nvme0n1p6 ==

  raft-benchmark-37936   [003] .....  9942.784613: io_uring_submit_req: ring 00000000c4f84b11, req 00000000058196bf, user_data 0x0, opcode WRITE_FIXED, flags 0x1, sq_thread 0
  raft-benchmark-37936   [003] .....  9942.784622: block_bio_remap: 259,0 WFS 498192384 + 8 <- (259,5) 0
  raft-benchmark-37936   [003] .....  9942.784622: block_bio_queue: 259,0 WFS 498192384 + 8 [raft-benchmark]
  raft-benchmark-37936   [003] .....  9942.784626: block_getrq: 259,0 WFS 498192384 + 8 [raft-benchmark]
  raft-benchmark-37936   [003] .....  9942.784628: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=29265, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498192384, len=7, ctrl=0x4000, dsmgmt=0, reftag=0)
  raft-benchmark-37936   [003] .....  9942.784634: block_rq_issue: 259,0 WFS 4096 () 498192384 + 8 [raft-benchmark]
  raft-benchmark-37936   [003] d.h..  9942.794331: nvme_sq: nvme0: disk=nvme0n1, qid=4, head=784, tail=784
  raft-benchmark-37936   [003] d.h..  9942.794334: nvme_complete_rq: nvme0: disk=nvme0n1, qid=4, cmdid=29265, res=0x0, retries=0, flags=0x0, status=0x0
  raft-benchmark-37936   [003] d.h..  9942.794334: block_rq_complete: 259,0 WFS () 498192384 + 8 [0]
  raft-benchmark-37936   [003] ...1.  9942.794338: io_uring_complete: ring 00000000c4f84b11, req 00000000058196bf, user_data 0x0, result 4096, cflags 0x0 extra1 0 extra2 0 
  raft-benchmark-37936   [003] .....  9942.794338: io_uring_task_work_run: tctx 000000005fccbc0a, count 1, loops 1
-------------------- End of forwarded message --------------------

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: direct I/O: ext4 seems to not honor RWF_DSYNC when journal is disabled
  2023-09-06 20:15 ` direct I/O: ext4 seems to not honor RWF_DSYNC when journal is disabled Free Ekanayaka
@ 2024-01-08 21:31   ` Jan Kara
  2024-01-09  6:05     ` Dave Chinner
  2024-01-09 15:46     ` Free Ekanayaka
  0 siblings, 2 replies; 9+ messages in thread
From: Jan Kara @ 2024-01-08 21:31 UTC (permalink / raw)
  To: Free Ekanayaka; +Cc: linux-ext4

Hello,

I've found this when going through some old email. Were you able to debug
this?

On Wed 06-09-23 21:15:01, Free Ekanayaka wrote:
> I'm using Linux 6.4.0 from Debian/testing (but tried this with 6.5.1
> too).
> 
> I've created an ext4 filesystem with journalling disabled on an NVMe
> drive:
> 
> mkfs.ext4 -O ^has_journal -F /dev/nvme0n1p6
> 
> I have a program that creates and open a new file with O_DIRECT, and
> sets its size to 8M with posix_fallocate(), something like:
> 
> fd = open("/dir/file", O_CREAT | O_WRONLY | O_DIRECT);
> posix_fallocate(fd, 0, 8 * 1024 * 1024);
> fsync(fd);
> dirfd = open("/dir", O_RDONLY | O_DIRECTORY);
> fsync(dirfd);
> 
> and then it uses io_uring to perform a single write of 4096 bytes at the
> beginning of the file, passing RWF_DSYNC to the submitted
> IORING_OP_WRITE_FIXED entry,
> 
> I would expect the kernel to tell the NVMe device to actually flush the
> write, not only buffer it. However I measured the end-to-end latency of
> the io_uring operation and it was very low, as if the write was only
> buffered by the NVMe device, but not flushed.

Yes, the kernel should issue device cache flush or mark the IO as FUA.

> This suspicion seems to be confirmed by tracing the write nvme command
> sent to the device:
> 
> raft-benchmark-37801   [003] .....  9904.831153: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=25169, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498455480, len=7, ctrl=0x0, dsmgmt=0, reftag=0)
> 
> notice the "ctrl=0x0" in there.

Not quite sure about the NVME here but ...

> == ext4 ==
> 
>   raft-benchmark-37801   [003] .....  9904.830974: io_uring_submit_req: ring 0000000011cab2e4, req 00000000c7a7d835, user_data 0x0, opcode WRITE_FIXED, flags 0x1, sq_thread 0
>   raft-benchmark-37801   [003] .....  9904.830982: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
>   raft-benchmark-37801   [003] .....  9904.830983: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 U
>   raft-benchmark-37801   [003] .....  9904.830985: ext4_journal_start_inode: dev 259,5 blocks 2, rsv_blocks 0, revoke_creds 8, type 1, ino 12, caller ext4_dirty_inode+0x38/0x80 [ext4]
>   raft-benchmark-37801   [003] .....  9904.830987: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_dirty_inode+0x5b/0x80 [ext4]
>   raft-benchmark-37801   [003] .....  9904.830989: block_touch_buffer: 259,5 sector=135 size=4096
>   raft-benchmark-37801   [003] .....  9904.830993: block_dirty_buffer: 259,5 sector=135 size=4096
>   raft-benchmark-37801   [003] .....  9904.831121: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
>   raft-benchmark-37801   [003] .....  9904.831122: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 UR
>   raft-benchmark-37801   [003] .....  9904.831123: ext4_journal_start_inode: dev 259,5 blocks 8, rsv_blocks 0, revoke_creds 8, type 3, ino 12, caller ext4_iomap_begin+0x1c2/0x2f0 [ext4]
>   raft-benchmark-37801   [003] .....  9904.831124: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
>   raft-benchmark-37801   [003] .....  9904.831124: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 UR
>   raft-benchmark-37801   [003] .....  9904.831125: ext4_ext_map_blocks_enter: dev 259,5 ino 12 lblk 0 len 1 flags CREATE|UNWRIT|PRE_IO
>   raft-benchmark-37801   [003] .....  9904.831126: ext4_es_cache_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
>   raft-benchmark-37801   [003] .....  9904.831127: ext4_ext_show_extent: dev 259,5 ino 12 lblk 0 pblk 32887 len 1
>   raft-benchmark-37801   [003] .....  9904.831128: ext4_ext_handle_unwritten_extents: dev 259,5 ino 12 m_lblk 0 m_pblk 32887 m_len 1 flags CREATE|UNWRIT|PRE_IO|METADATA_NOFAIL allocated 1 newblock 32887
>   raft-benchmark-37801   [003] .....  9904.831129: ext4_es_cache_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
>   raft-benchmark-37801   [003] .....  9904.831130: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_split_extent+0xcd/0x190 [ext4]
>   raft-benchmark-37801   [003] .....  9904.831131: block_touch_buffer: 259,5 sector=135 size=4096
>   raft-benchmark-37801   [003] .....  9904.831133: block_dirty_buffer: 259,5 sector=135 size=4096
>   raft-benchmark-37801   [003] .....  9904.831134: ext4_ext_map_blocks_exit: dev 259,5 ino 12 flags CREATE|UNWRIT|PRE_IO lblk 0 pblk 32887 len 1 mflags NMU ret 1
>   raft-benchmark-37801   [003] .....  9904.831135: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
>   raft-benchmark-37801   [003] .....  9904.831135: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 UR
>   raft-benchmark-37801   [003] .....  9904.831136: ext4_es_insert_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
>   raft-benchmark-37801   [003] .....  9904.831143: block_bio_remap: 259,0 WS 498455480 + 8 <- (259,5) 263096
>   raft-benchmark-37801   [003] .....  9904.831144: block_bio_queue: 259,0 WS 498455480 + 8 [raft-benchmark]
>   raft-benchmark-37801   [003] .....  9904.831149: block_getrq: 259,0 WS 498455480 + 8 [raft-benchmark]

Here we can see the indeed the write was submitted without the cache flush.
However we can also see that the write was going into unwritten extent
so...

>   raft-benchmark-37801   [003] .....  9904.831149: block_plug: [raft-benchmark]
>   raft-benchmark-37801   [003] .....  9904.831153: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=25169, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498455480, len=7, ctrl=0x0, dsmgmt=0, reftag=0)
>   raft-benchmark-37801   [003] .....  9904.831159: block_rq_issue: 259,0 WS 4096 () 498455480 + 8 [raft-benchmark]
>   raft-benchmark-37801   [003] d.h..  9904.831173: nvme_sq: nvme0: disk=nvme0n1, qid=4, head=783, tail=783
>   raft-benchmark-37801   [003] d.h..  9904.831177: nvme_complete_rq: nvme0: disk=nvme0n1, qid=4, cmdid=25169, res=0x0, retries=0, flags=0x0, status=0x0
>   raft-benchmark-37801   [003] d.h..  9904.831178: block_rq_complete: 259,0 WS () 498455480 + 8 [0]
>      kworker/3:1-30279   [003] .....  9904.831193: ext4_journal_start_inode: dev 259,5 blocks 8, rsv_blocks 0, revoke_creds 8, type 3, ino 12, caller ext4_convert_unwritten_extents+0xb4/0x260 [ext4]

... after io completed here, we need to convert unwritten extent into a
written one.

>      kworker/3:1-30279   [003] .....  9904.831193: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
>      kworker/3:1-30279   [003] .....  9904.831194: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 U
>      kworker/3:1-30279   [003] .....  9904.831194: ext4_ext_map_blocks_enter: dev 259,5 ino 12 lblk 0 len 1 flags CREATE|UNWRIT|CONVERT
>      kworker/3:1-30279   [003] .....  9904.831195: ext4_es_cache_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
>      kworker/3:1-30279   [003] .....  9904.831195: ext4_ext_show_extent: dev 259,5 ino 12 lblk 0 pblk 32887 len 1
>      kworker/3:1-30279   [003] .....  9904.831196: ext4_ext_handle_unwritten_extents: dev 259,5 ino 12 m_lblk 0 m_pblk 32887 m_len 1 flags CREATE|UNWRIT|CONVERT|METADATA_NOFAIL allocated 1 newblock 32887
>      kworker/3:1-30279   [003] .....  9904.831196: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_ext_map_blocks+0xeee/0x1980 [ext4]
>      kworker/3:1-30279   [003] .....  9904.831197: block_touch_buffer: 259,5 sector=135 size=4096
>      kworker/3:1-30279   [003] .....  9904.831198: block_dirty_buffer: 259,5 sector=135 size=4096
>      kworker/3:1-30279   [003] .....  9904.831199: ext4_ext_map_blocks_exit: dev 259,5 ino 12 flags CREATE|UNWRIT|CONVERT lblk 0 pblk 32887 len 1 mflags M ret 1
>      kworker/3:1-30279   [003] .....  9904.831199: ext4_es_insert_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status W
>      kworker/3:1-30279   [003] .....  9904.831200: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_convert_unwritten_extents+0x1e2/0x260 [ext4]
>      kworker/3:1-30279   [003] .....  9904.831200: block_touch_buffer: 259,5 sector=135 size=4096
>      kworker/3:1-30279   [003] .....  9904.831201: block_dirty_buffer: 259,5 sector=135 size=4096

The conversion to written extents happened here.

>      kworker/3:1-30279   [003] .....  9904.831202: ext4_sync_file_enter: dev 259,5 ino 12 parent 2 datasync 1 
>      kworker/3:1-30279   [003] .....  9904.831203: ext4_sync_file_exit: dev 259,5 ino 12 ret 0

And here we've called fdatasync() for the inode. Now this should have
submitted a cache flush through blkdev_issue_flush() but that doesn't seem
to happen. Indeed checking the code in 6.4 the problem is that inode is
dirtied only through ext4_mark_inode_dirty() which does not alter
inode->i_state and thus neither the inode buffer is properly flushed to the
disk in this case as it should (as it contains the converted extent) nor do
we issue the cache flush. As part of commit 5b5b4ff8f92da ("ext4: Use
generic_buffers_fsync_noflush() implementation") in 6.5 we accidentally
fixed the second problem AFAICT but the first problem with not flushing inode
buffer properly is still there... I'll have to think how to fix that
properly.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: direct I/O: ext4 seems to not honor RWF_DSYNC when journal is disabled
  2024-01-08 21:31   ` Jan Kara
@ 2024-01-09  6:05     ` Dave Chinner
  2024-01-09 13:59       ` Jan Kara
  2024-01-09 15:46     ` Free Ekanayaka
  1 sibling, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2024-01-09  6:05 UTC (permalink / raw)
  To: Jan Kara; +Cc: Free Ekanayaka, linux-ext4

On Mon, Jan 08, 2024 at 10:31:12PM +0100, Jan Kara wrote:
> Hello,
> 
> I've found this when going through some old email. Were you able to debug
> this?
> 
> On Wed 06-09-23 21:15:01, Free Ekanayaka wrote:
> > I'm using Linux 6.4.0 from Debian/testing (but tried this with 6.5.1
> > too).
> > 
> > I've created an ext4 filesystem with journalling disabled on an NVMe
> > drive:
> > 
> > mkfs.ext4 -O ^has_journal -F /dev/nvme0n1p6
> > 
> > I have a program that creates and open a new file with O_DIRECT, and
> > sets its size to 8M with posix_fallocate(), something like:
> > 
> > fd = open("/dir/file", O_CREAT | O_WRONLY | O_DIRECT);
> > posix_fallocate(fd, 0, 8 * 1024 * 1024);
> > fsync(fd);
> > dirfd = open("/dir", O_RDONLY | O_DIRECTORY);
> > fsync(dirfd);
> > 
> > and then it uses io_uring to perform a single write of 4096 bytes at the
> > beginning of the file, passing RWF_DSYNC to the submitted
> > IORING_OP_WRITE_FIXED entry,
> > 
> > I would expect the kernel to tell the NVMe device to actually flush the
> > write, not only buffer it. However I measured the end-to-end latency of
> > the io_uring operation and it was very low, as if the write was only
> > buffered by the NVMe device, but not flushed.
> 
> Yes, the kernel should issue device cache flush or mark the IO as FUA.
> 
> > This suspicion seems to be confirmed by tracing the write nvme command
> > sent to the device:
> > 
> > raft-benchmark-37801   [003] .....  9904.831153: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=25169, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498455480, len=7, ctrl=0x0, dsmgmt=0, reftag=0)
> > 
> > notice the "ctrl=0x0" in there.
> 
> Not quite sure about the NVME here but ...
> 
> > == ext4 ==
> > 
> >   raft-benchmark-37801   [003] .....  9904.830974: io_uring_submit_req: ring 0000000011cab2e4, req 00000000c7a7d835, user_data 0x0, opcode WRITE_FIXED, flags 0x1, sq_thread 0
> >   raft-benchmark-37801   [003] .....  9904.830982: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
> >   raft-benchmark-37801   [003] .....  9904.830983: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 U
> >   raft-benchmark-37801   [003] .....  9904.830985: ext4_journal_start_inode: dev 259,5 blocks 2, rsv_blocks 0, revoke_creds 8, type 1, ino 12, caller ext4_dirty_inode+0x38/0x80 [ext4]
> >   raft-benchmark-37801   [003] .....  9904.830987: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_dirty_inode+0x5b/0x80 [ext4]
> >   raft-benchmark-37801   [003] .....  9904.830989: block_touch_buffer: 259,5 sector=135 size=4096
> >   raft-benchmark-37801   [003] .....  9904.830993: block_dirty_buffer: 259,5 sector=135 size=4096
> >   raft-benchmark-37801   [003] .....  9904.831121: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
> >   raft-benchmark-37801   [003] .....  9904.831122: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 UR
> >   raft-benchmark-37801   [003] .....  9904.831123: ext4_journal_start_inode: dev 259,5 blocks 8, rsv_blocks 0, revoke_creds 8, type 3, ino 12, caller ext4_iomap_begin+0x1c2/0x2f0 [ext4]
> >   raft-benchmark-37801   [003] .....  9904.831124: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
> >   raft-benchmark-37801   [003] .....  9904.831124: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 UR
> >   raft-benchmark-37801   [003] .....  9904.831125: ext4_ext_map_blocks_enter: dev 259,5 ino 12 lblk 0 len 1 flags CREATE|UNWRIT|PRE_IO
> >   raft-benchmark-37801   [003] .....  9904.831126: ext4_es_cache_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
> >   raft-benchmark-37801   [003] .....  9904.831127: ext4_ext_show_extent: dev 259,5 ino 12 lblk 0 pblk 32887 len 1
> >   raft-benchmark-37801   [003] .....  9904.831128: ext4_ext_handle_unwritten_extents: dev 259,5 ino 12 m_lblk 0 m_pblk 32887 m_len 1 flags CREATE|UNWRIT|PRE_IO|METADATA_NOFAIL allocated 1 newblock 32887
> >   raft-benchmark-37801   [003] .....  9904.831129: ext4_es_cache_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
> >   raft-benchmark-37801   [003] .....  9904.831130: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_split_extent+0xcd/0x190 [ext4]
> >   raft-benchmark-37801   [003] .....  9904.831131: block_touch_buffer: 259,5 sector=135 size=4096
> >   raft-benchmark-37801   [003] .....  9904.831133: block_dirty_buffer: 259,5 sector=135 size=4096
> >   raft-benchmark-37801   [003] .....  9904.831134: ext4_ext_map_blocks_exit: dev 259,5 ino 12 flags CREATE|UNWRIT|PRE_IO lblk 0 pblk 32887 len 1 mflags NMU ret 1
> >   raft-benchmark-37801   [003] .....  9904.831135: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
> >   raft-benchmark-37801   [003] .....  9904.831135: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 UR
> >   raft-benchmark-37801   [003] .....  9904.831136: ext4_es_insert_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
> >   raft-benchmark-37801   [003] .....  9904.831143: block_bio_remap: 259,0 WS 498455480 + 8 <- (259,5) 263096
> >   raft-benchmark-37801   [003] .....  9904.831144: block_bio_queue: 259,0 WS 498455480 + 8 [raft-benchmark]
> >   raft-benchmark-37801   [003] .....  9904.831149: block_getrq: 259,0 WS 498455480 + 8 [raft-benchmark]
> 
> Here we can see the indeed the write was submitted without the cache flush.
> However we can also see that the write was going into unwritten extent
> so...
> 
> >   raft-benchmark-37801   [003] .....  9904.831149: block_plug: [raft-benchmark]
> >   raft-benchmark-37801   [003] .....  9904.831153: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=25169, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498455480, len=7, ctrl=0x0, dsmgmt=0, reftag=0)
> >   raft-benchmark-37801   [003] .....  9904.831159: block_rq_issue: 259,0 WS 4096 () 498455480 + 8 [raft-benchmark]
> >   raft-benchmark-37801   [003] d.h..  9904.831173: nvme_sq: nvme0: disk=nvme0n1, qid=4, head=783, tail=783
> >   raft-benchmark-37801   [003] d.h..  9904.831177: nvme_complete_rq: nvme0: disk=nvme0n1, qid=4, cmdid=25169, res=0x0, retries=0, flags=0x0, status=0x0
> >   raft-benchmark-37801   [003] d.h..  9904.831178: block_rq_complete: 259,0 WS () 498455480 + 8 [0]
> >      kworker/3:1-30279   [003] .....  9904.831193: ext4_journal_start_inode: dev 259,5 blocks 8, rsv_blocks 0, revoke_creds 8, type 3, ino 12, caller ext4_convert_unwritten_extents+0xb4/0x260 [ext4]
> 
> ... after io completed here, we need to convert unwritten extent into a
> written one.
> 
> >      kworker/3:1-30279   [003] .....  9904.831193: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
> >      kworker/3:1-30279   [003] .....  9904.831194: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 U
> >      kworker/3:1-30279   [003] .....  9904.831194: ext4_ext_map_blocks_enter: dev 259,5 ino 12 lblk 0 len 1 flags CREATE|UNWRIT|CONVERT
> >      kworker/3:1-30279   [003] .....  9904.831195: ext4_es_cache_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
> >      kworker/3:1-30279   [003] .....  9904.831195: ext4_ext_show_extent: dev 259,5 ino 12 lblk 0 pblk 32887 len 1
> >      kworker/3:1-30279   [003] .....  9904.831196: ext4_ext_handle_unwritten_extents: dev 259,5 ino 12 m_lblk 0 m_pblk 32887 m_len 1 flags CREATE|UNWRIT|CONVERT|METADATA_NOFAIL allocated 1 newblock 32887
> >      kworker/3:1-30279   [003] .....  9904.831196: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_ext_map_blocks+0xeee/0x1980 [ext4]
> >      kworker/3:1-30279   [003] .....  9904.831197: block_touch_buffer: 259,5 sector=135 size=4096
> >      kworker/3:1-30279   [003] .....  9904.831198: block_dirty_buffer: 259,5 sector=135 size=4096
> >      kworker/3:1-30279   [003] .....  9904.831199: ext4_ext_map_blocks_exit: dev 259,5 ino 12 flags CREATE|UNWRIT|CONVERT lblk 0 pblk 32887 len 1 mflags M ret 1
> >      kworker/3:1-30279   [003] .....  9904.831199: ext4_es_insert_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status W
> >      kworker/3:1-30279   [003] .....  9904.831200: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_convert_unwritten_extents+0x1e2/0x260 [ext4]
> >      kworker/3:1-30279   [003] .....  9904.831200: block_touch_buffer: 259,5 sector=135 size=4096
> >      kworker/3:1-30279   [003] .....  9904.831201: block_dirty_buffer: 259,5 sector=135 size=4096
> 
> The conversion to written extents happened here.

Which dirties metadata buffers....

> >      kworker/3:1-30279   [003] .....  9904.831202: ext4_sync_file_enter: dev 259,5 ino 12 parent 2 datasync 1 
> >      kworker/3:1-30279   [003] .....  9904.831203: ext4_sync_file_exit: dev 259,5 ino 12 ret 0
> 
> And here we've called fdatasync() for the inode. Now this should have
> submitted a cache flush through blkdev_issue_flush() but that doesn't seem
> to happen.

I suspect correct crash recovery behaviour here requires
multiple cache flushes to ensure the correct ordering or data vs
metadata updates. i.e:

	....
	data write completes
	fdatasync()
	  cache flush to ensure data is on disk
	  if (dirty metadata) {
		issue metadata write(s) for extent records and inode
		....
		metadata write(s) complete
		cache flush to ensure metadata is on disk
	  }

If we don't flush the cache between the data write and the metadata
write(s) that marks the extent as written, we could have a state
after a power fail where the metadata writes hit the disk
before the data write and after the system comes back up that file
now it exposes stale data to the user.

This is a problem we have to deal with in XFS because we have
external logs and multiple data devices - we can't rely on the cache
flush issued before a log write to flush the data write to stable
storage if the data is on a different device to the metadata. In
that situation, we have to explicitly flush the data device cache
before we flush the log and persist the metadata changes.

Indeed, I suspect that ext4 needs the same device cache flushing
logic in the fsync path as XFS because it, like XFS, can have an
external log device and so multiple device cache flushes are needed
to ensure correct persistent ordering of data vs metadata updates...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: direct I/O: ext4 seems to not honor RWF_DSYNC when journal is disabled
  2024-01-09  6:05     ` Dave Chinner
@ 2024-01-09 13:59       ` Jan Kara
  2024-01-09 15:57         ` Free Ekanayaka
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2024-01-09 13:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, Free Ekanayaka, linux-ext4

On Tue 09-01-24 17:05:00, Dave Chinner wrote:
> On Mon, Jan 08, 2024 at 10:31:12PM +0100, Jan Kara wrote:
> > Hello,
> > 
> > I've found this when going through some old email. Were you able to debug
> > this?
> > 
> > On Wed 06-09-23 21:15:01, Free Ekanayaka wrote:
> > > I'm using Linux 6.4.0 from Debian/testing (but tried this with 6.5.1
> > > too).
> > > 
> > > I've created an ext4 filesystem with journalling disabled on an NVMe
> > > drive:
> > > 
> > > mkfs.ext4 -O ^has_journal -F /dev/nvme0n1p6
> > > 
> > > I have a program that creates and open a new file with O_DIRECT, and
> > > sets its size to 8M with posix_fallocate(), something like:
> > > 
> > > fd = open("/dir/file", O_CREAT | O_WRONLY | O_DIRECT);
> > > posix_fallocate(fd, 0, 8 * 1024 * 1024);
> > > fsync(fd);
> > > dirfd = open("/dir", O_RDONLY | O_DIRECTORY);
> > > fsync(dirfd);
> > > 
> > > and then it uses io_uring to perform a single write of 4096 bytes at the
> > > beginning of the file, passing RWF_DSYNC to the submitted
> > > IORING_OP_WRITE_FIXED entry,
> > > 
> > > I would expect the kernel to tell the NVMe device to actually flush the
> > > write, not only buffer it. However I measured the end-to-end latency of
> > > the io_uring operation and it was very low, as if the write was only
> > > buffered by the NVMe device, but not flushed.
> > 
> > Yes, the kernel should issue device cache flush or mark the IO as FUA.

<snip>

> > >   raft-benchmark-37801   [003] .....  9904.831134: ext4_ext_map_blocks_exit: dev 259,5 ino 12 flags CREATE|UNWRIT|PRE_IO lblk 0 pblk 32887 len 1 mflags NMU ret 1
> > >   raft-benchmark-37801   [003] .....  9904.831135: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
> > >   raft-benchmark-37801   [003] .....  9904.831135: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 UR
> > >   raft-benchmark-37801   [003] .....  9904.831136: ext4_es_insert_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
> > >   raft-benchmark-37801   [003] .....  9904.831143: block_bio_remap: 259,0 WS 498455480 + 8 <- (259,5) 263096
> > >   raft-benchmark-37801   [003] .....  9904.831144: block_bio_queue: 259,0 WS 498455480 + 8 [raft-benchmark]
> > >   raft-benchmark-37801   [003] .....  9904.831149: block_getrq: 259,0 WS 498455480 + 8 [raft-benchmark]
> > 
> > Here we can see the indeed the write was submitted without the cache flush.
> > However we can also see that the write was going into unwritten extent
> > so...
> > 
> > >   raft-benchmark-37801   [003] .....  9904.831149: block_plug: [raft-benchmark]
> > >   raft-benchmark-37801   [003] .....  9904.831153: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=25169, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498455480, len=7, ctrl=0x0, dsmgmt=0, reftag=0)
> > >   raft-benchmark-37801   [003] .....  9904.831159: block_rq_issue: 259,0 WS 4096 () 498455480 + 8 [raft-benchmark]
> > >   raft-benchmark-37801   [003] d.h..  9904.831173: nvme_sq: nvme0: disk=nvme0n1, qid=4, head=783, tail=783
> > >   raft-benchmark-37801   [003] d.h..  9904.831177: nvme_complete_rq: nvme0: disk=nvme0n1, qid=4, cmdid=25169, res=0x0, retries=0, flags=0x0, status=0x0
> > >   raft-benchmark-37801   [003] d.h..  9904.831178: block_rq_complete: 259,0 WS () 498455480 + 8 [0]
> > >      kworker/3:1-30279   [003] .....  9904.831193: ext4_journal_start_inode: dev 259,5 blocks 8, rsv_blocks 0, revoke_creds 8, type 3, ino 12, caller ext4_convert_unwritten_extents+0xb4/0x260 [ext4]
> > 
> > ... after io completed here, we need to convert unwritten extent into a
> > written one.
> > 
> > >      kworker/3:1-30279   [003] .....  9904.831193: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
> > >      kworker/3:1-30279   [003] .....  9904.831194: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 U
> > >      kworker/3:1-30279   [003] .....  9904.831194: ext4_ext_map_blocks_enter: dev 259,5 ino 12 lblk 0 len 1 flags CREATE|UNWRIT|CONVERT
> > >      kworker/3:1-30279   [003] .....  9904.831195: ext4_es_cache_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
> > >      kworker/3:1-30279   [003] .....  9904.831195: ext4_ext_show_extent: dev 259,5 ino 12 lblk 0 pblk 32887 len 1
> > >      kworker/3:1-30279   [003] .....  9904.831196: ext4_ext_handle_unwritten_extents: dev 259,5 ino 12 m_lblk 0 m_pblk 32887 m_len 1 flags CREATE|UNWRIT|CONVERT|METADATA_NOFAIL allocated 1 newblock 32887
> > >      kworker/3:1-30279   [003] .....  9904.831196: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_ext_map_blocks+0xeee/0x1980 [ext4]
> > >      kworker/3:1-30279   [003] .....  9904.831197: block_touch_buffer: 259,5 sector=135 size=4096
> > >      kworker/3:1-30279   [003] .....  9904.831198: block_dirty_buffer: 259,5 sector=135 size=4096
> > >      kworker/3:1-30279   [003] .....  9904.831199: ext4_ext_map_blocks_exit: dev 259,5 ino 12 flags CREATE|UNWRIT|CONVERT lblk 0 pblk 32887 len 1 mflags M ret 1
> > >      kworker/3:1-30279   [003] .....  9904.831199: ext4_es_insert_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status W
> > >      kworker/3:1-30279   [003] .....  9904.831200: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_convert_unwritten_extents+0x1e2/0x260 [ext4]
> > >      kworker/3:1-30279   [003] .....  9904.831200: block_touch_buffer: 259,5 sector=135 size=4096
> > >      kworker/3:1-30279   [003] .....  9904.831201: block_dirty_buffer: 259,5 sector=135 size=4096
> > 
> > The conversion to written extents happened here.
> 
> Which dirties metadata buffers....

Yes.

> > >      kworker/3:1-30279   [003] .....  9904.831202: ext4_sync_file_enter: dev 259,5 ino 12 parent 2 datasync 1 
> > >      kworker/3:1-30279   [003] .....  9904.831203: ext4_sync_file_exit: dev 259,5 ino 12 ret 0
> > 
> > And here we've called fdatasync() for the inode. Now this should have
> > submitted a cache flush through blkdev_issue_flush() but that doesn't seem
> > to happen.
> 
> I suspect correct crash recovery behaviour here requires
> multiple cache flushes to ensure the correct ordering or data vs
> metadata updates. i.e:
> 
> 	....
> 	data write completes
> 	fdatasync()
> 	  cache flush to ensure data is on disk
> 	  if (dirty metadata) {
> 		issue metadata write(s) for extent records and inode
> 		....
> 		metadata write(s) complete
> 		cache flush to ensure metadata is on disk
> 	  }
> 
> If we don't flush the cache between the data write and the metadata
> write(s) that marks the extent as written, we could have a state
> after a power fail where the metadata writes hit the disk
> before the data write and after the system comes back up that file
> now it exposes stale data to the user.

So when we are journalling, we end up doing this (we flush data disk before
writing and flushing the transaction commit block in jbd2). When we are not
doing journalling (which is the case here), our crash consistency
guarantees are pretty weak. We want to guarantee that if fsync(2)
successfully completed on the file before the crash, user should see the
data there. But not much more - i.e., stale data exposure in case of crash
is fully within what sysadmin should expect from a filesystem without a
journal. After all even if we improved fsync(2) as you suggest, we'd still
have normal page writeback where we'd have to separate data & metadata
writes with cache flushes and I don't think the performace overhead is
something people would be willing to pay.

So yes, nojournal mode is unsafe in case of crash. It is there for people
not caring about the filesystem after the crash, single user filesystems
doing data verification in userspace and similar special usecases. Still, I
think we want at least minimum fsync(2) guarantees if nothing else for
backwards compatibility with ext2.

Thanks for the comments anyway :)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: direct I/O: ext4 seems to not honor RWF_DSYNC when journal is disabled
  2024-01-08 21:31   ` Jan Kara
  2024-01-09  6:05     ` Dave Chinner
@ 2024-01-09 15:46     ` Free Ekanayaka
  2024-01-10 10:01       ` Jan Kara
  1 sibling, 1 reply; 9+ messages in thread
From: Free Ekanayaka @ 2024-01-09 15:46 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

Hello Jan,

thanks for getting back to me on this one.

Jan Kara <jack@suse.cz> writes:

> Hello,
>
> I've found this when going through some old email. Were you able to debug
> this?

If you mean looking at the kernel code to understand what was going on,
no I didn't. Wanted to ask you guys first.

>
> On Wed 06-09-23 21:15:01, Free Ekanayaka wrote:
>> I'm using Linux 6.4.0 from Debian/testing (but tried this with 6.5.1
>> too).
>> 
>> I've created an ext4 filesystem with journalling disabled on an NVMe
>> drive:
>> 
>> mkfs.ext4 -O ^has_journal -F /dev/nvme0n1p6
>> 
>> I have a program that creates and open a new file with O_DIRECT, and
>> sets its size to 8M with posix_fallocate(), something like:
>> 
>> fd = open("/dir/file", O_CREAT | O_WRONLY | O_DIRECT);
>> posix_fallocate(fd, 0, 8 * 1024 * 1024);
>> fsync(fd);
>> dirfd = open("/dir", O_RDONLY | O_DIRECTORY);
>> fsync(dirfd);
>> 
>> and then it uses io_uring to perform a single write of 4096 bytes at the
>> beginning of the file, passing RWF_DSYNC to the submitted
>> IORING_OP_WRITE_FIXED entry,
>> 
>> I would expect the kernel to tell the NVMe device to actually flush the
>> write, not only buffer it. However I measured the end-to-end latency of
>> the io_uring operation and it was very low, as if the write was only
>> buffered by the NVMe device, but not flushed.
>
> Yes, the kernel should issue device cache flush or mark the IO as FUA.

Good, thanks for confirming.

>
>> This suspicion seems to be confirmed by tracing the write nvme command
>> sent to the device:
>> 
>> raft-benchmark-37801   [003] .....  9904.831153: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=25169, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498455480, len=7, ctrl=0x0, dsmgmt=0, reftag=0)
>> 
>> notice the "ctrl=0x0" in there.
>
> Not quite sure about the NVME here but ...
>
>> == ext4 ==
>> 
>>   raft-benchmark-37801   [003] .....  9904.830974: io_uring_submit_req: ring 0000000011cab2e4, req 00000000c7a7d835, user_data 0x0, opcode WRITE_FIXED, flags 0x1, sq_thread 0
>>   raft-benchmark-37801   [003] .....  9904.830982: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
>>   raft-benchmark-37801   [003] .....  9904.830983: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 U
>>   raft-benchmark-37801   [003] .....  9904.830985: ext4_journal_start_inode: dev 259,5 blocks 2, rsv_blocks 0, revoke_creds 8, type 1, ino 12, caller ext4_dirty_inode+0x38/0x80 [ext4]
>>   raft-benchmark-37801   [003] .....  9904.830987: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_dirty_inode+0x5b/0x80 [ext4]
>>   raft-benchmark-37801   [003] .....  9904.830989: block_touch_buffer: 259,5 sector=135 size=4096
>>   raft-benchmark-37801   [003] .....  9904.830993: block_dirty_buffer: 259,5 sector=135 size=4096
>>   raft-benchmark-37801   [003] .....  9904.831121: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
>>   raft-benchmark-37801   [003] .....  9904.831122: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 UR
>>   raft-benchmark-37801   [003] .....  9904.831123: ext4_journal_start_inode: dev 259,5 blocks 8, rsv_blocks 0, revoke_creds 8, type 3, ino 12, caller ext4_iomap_begin+0x1c2/0x2f0 [ext4]
>>   raft-benchmark-37801   [003] .....  9904.831124: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
>>   raft-benchmark-37801   [003] .....  9904.831124: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 UR
>>   raft-benchmark-37801   [003] .....  9904.831125: ext4_ext_map_blocks_enter: dev 259,5 ino 12 lblk 0 len 1 flags CREATE|UNWRIT|PRE_IO
>>   raft-benchmark-37801   [003] .....  9904.831126: ext4_es_cache_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
>>   raft-benchmark-37801   [003] .....  9904.831127: ext4_ext_show_extent: dev 259,5 ino 12 lblk 0 pblk 32887 len 1
>>   raft-benchmark-37801   [003] .....  9904.831128: ext4_ext_handle_unwritten_extents: dev 259,5 ino 12 m_lblk 0 m_pblk 32887 m_len 1 flags CREATE|UNWRIT|PRE_IO|METADATA_NOFAIL allocated 1 newblock 32887
>>   raft-benchmark-37801   [003] .....  9904.831129: ext4_es_cache_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
>>   raft-benchmark-37801   [003] .....  9904.831130: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_split_extent+0xcd/0x190 [ext4]
>>   raft-benchmark-37801   [003] .....  9904.831131: block_touch_buffer: 259,5 sector=135 size=4096
>>   raft-benchmark-37801   [003] .....  9904.831133: block_dirty_buffer: 259,5 sector=135 size=4096
>>   raft-benchmark-37801   [003] .....  9904.831134: ext4_ext_map_blocks_exit: dev 259,5 ino 12 flags CREATE|UNWRIT|PRE_IO lblk 0 pblk 32887 len 1 mflags NMU ret 1
>>   raft-benchmark-37801   [003] .....  9904.831135: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
>>   raft-benchmark-37801   [003] .....  9904.831135: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 UR
>>   raft-benchmark-37801   [003] .....  9904.831136: ext4_es_insert_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
>>   raft-benchmark-37801   [003] .....  9904.831143: block_bio_remap: 259,0 WS 498455480 + 8 <- (259,5) 263096
>>   raft-benchmark-37801   [003] .....  9904.831144: block_bio_queue: 259,0 WS 498455480 + 8 [raft-benchmark]
>>   raft-benchmark-37801   [003] .....  9904.831149: block_getrq: 259,0 WS 498455480 + 8 [raft-benchmark]
>
> Here we can see the indeed the write was submitted without the cache flush.
> However we can also see that the write was going into unwritten extent
> so...
>
>>   raft-benchmark-37801   [003] .....  9904.831149: block_plug: [raft-benchmark]
>>   raft-benchmark-37801   [003] .....  9904.831153: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=25169, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498455480, len=7, ctrl=0x0, dsmgmt=0, reftag=0)
>>   raft-benchmark-37801   [003] .....  9904.831159: block_rq_issue: 259,0 WS 4096 () 498455480 + 8 [raft-benchmark]
>>   raft-benchmark-37801   [003] d.h..  9904.831173: nvme_sq: nvme0: disk=nvme0n1, qid=4, head=783, tail=783
>>   raft-benchmark-37801   [003] d.h..  9904.831177: nvme_complete_rq: nvme0: disk=nvme0n1, qid=4, cmdid=25169, res=0x0, retries=0, flags=0x0, status=0x0
>>   raft-benchmark-37801   [003] d.h..  9904.831178: block_rq_complete: 259,0 WS () 498455480 + 8 [0]
>>      kworker/3:1-30279   [003] .....  9904.831193: ext4_journal_start_inode: dev 259,5 blocks 8, rsv_blocks 0, revoke_creds 8, type 3, ino 12, caller ext4_convert_unwritten_extents+0xb4/0x260 [ext4]
>
> ... after io completed here, we need to convert unwritten extent into a
> written one.
>
>>      kworker/3:1-30279   [003] .....  9904.831193: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
>>      kworker/3:1-30279   [003] .....  9904.831194: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 U
>>      kworker/3:1-30279   [003] .....  9904.831194: ext4_ext_map_blocks_enter: dev 259,5 ino 12 lblk 0 len 1 flags CREATE|UNWRIT|CONVERT
>>      kworker/3:1-30279   [003] .....  9904.831195: ext4_es_cache_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
>>      kworker/3:1-30279   [003] .....  9904.831195: ext4_ext_show_extent: dev 259,5 ino 12 lblk 0 pblk 32887 len 1
>>      kworker/3:1-30279   [003] .....  9904.831196: ext4_ext_handle_unwritten_extents: dev 259,5 ino 12 m_lblk 0 m_pblk 32887 m_len 1 flags CREATE|UNWRIT|CONVERT|METADATA_NOFAIL allocated 1 newblock 32887
>>      kworker/3:1-30279   [003] .....  9904.831196: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_ext_map_blocks+0xeee/0x1980 [ext4]
>>      kworker/3:1-30279   [003] .....  9904.831197: block_touch_buffer: 259,5 sector=135 size=4096
>>      kworker/3:1-30279   [003] .....  9904.831198: block_dirty_buffer: 259,5 sector=135 size=4096
>>      kworker/3:1-30279   [003] .....  9904.831199: ext4_ext_map_blocks_exit: dev 259,5 ino 12 flags CREATE|UNWRIT|CONVERT lblk 0 pblk 32887 len 1 mflags M ret 1
>>      kworker/3:1-30279   [003] .....  9904.831199: ext4_es_insert_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status W
>>      kworker/3:1-30279   [003] .....  9904.831200: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_convert_unwritten_extents+0x1e2/0x260 [ext4]
>>      kworker/3:1-30279   [003] .....  9904.831200: block_touch_buffer: 259,5 sector=135 size=4096
>>      kworker/3:1-30279   [003] .....  9904.831201: block_dirty_buffer: 259,5 sector=135 size=4096
>
> The conversion to written extents happened here.
>
>>      kworker/3:1-30279   [003] .....  9904.831202: ext4_sync_file_enter: dev 259,5 ino 12 parent 2 datasync 1 
>>      kworker/3:1-30279   [003] .....  9904.831203: ext4_sync_file_exit: dev 259,5 ino 12 ret 0
>
> And here we've called fdatasync() for the inode. Now this should have
> submitted a cache flush through blkdev_issue_flush() but that doesn't seem
> to happen.

Right.

> Indeed checking the code in 6.4 the problem is that inode is
> dirtied only through ext4_mark_inode_dirty() which does not alter
> inode->i_state and thus neither the inode buffer is properly flushed to the
> disk in this case as it should (as it contains the converted extent) nor do
> we issue the cache flush. As part of commit 5b5b4ff8f92da ("ext4: Use
> generic_buffers_fsync_noflush() implementation") in 6.5 we accidentally
> fixed the second problem AFAICT but the first problem with not flushing inode
> buffer properly is still there... I'll have to think how to fix that
> properly.

I've re-ran the same code with kernel 6.5.0 and indeed the behavior has
changed and an actual NVMe flush command seems to be issued (the flags
passed to nvme_setup_cmd match the ones that I see in the case I write
to the raw block device). So that part seems fixed. I thought I had
tried with 6.5.1 when I had posted this issue, and that it was present
there too, but maybe I was mistaken.

I'm not entirely sure what you mean about flushing the inode buffer
properly. As far as I see, the current behavior I see matches what I'd
expect.

For reference I'm attaching below the trace of the same user code, this
time run on kernel 6.5.0, which is the one currently shipping with
Debian/testing. Note that there are quite a bit less trace lines emitted
by the ext4 sub-system, not sure if it's related/relevant.

  raft-benchmark-35708   [000] .....  9203.271114: io_uring_submit_req: ring 000000007ee609d1, req 00000000221c7d2e, user_data 0x0, opcode WRITE_FIXED, flags 0x1, sq_thread 0
  raft-benchmark-35708   [000] .....  9203.271117: ext4_es_lookup_extent_enter: dev 259,5 ino 16 lblk 0
  raft-benchmark-35708   [000] .....  9203.271117: ext4_es_lookup_extent_exit: dev 259,5 ino 16 found 1 [0/16) 32896 WR
  raft-benchmark-35708   [000] .....  9203.271118: ext4_journal_start_inode: dev 259,5 blocks 2, rsv_blocks 0, revoke_creds 8, type 1, ino 16, caller ext4_dirty_inode+0x38/0x80 [ext4]
  raft-benchmark-35708   [000] .....  9203.271119: ext4_mark_inode_dirty: dev 259,5 ino 16 caller ext4_dirty_inode+0x5b/0x80 [ext4]
  raft-benchmark-35708   [000] .....  9203.271119: block_touch_buffer: 259,5 sector=135 size=4096
  raft-benchmark-35708   [000] .....  9203.271120: block_dirty_buffer: 259,5 sector=135 size=4096
  raft-benchmark-35708   [000] .....  9203.271120: ext4_es_lookup_extent_enter: dev 259,5 ino 16 lblk 0
  raft-benchmark-35708   [000] .....  9203.271121: ext4_es_lookup_extent_exit: dev 259,5 ino 16 found 1 [0/16) 32896 WR
  raft-benchmark-35708   [000] .....  9203.271122: block_bio_remap: 259,0 WFS 498455552 + 8 <- (259,5) 263168
  raft-benchmark-35708   [000] .....  9203.271122: block_bio_queue: 259,0 WFS 498455552 + 8 [raft-benchmark]
  raft-benchmark-35708   [000] .....  9203.271123: block_getrq: 259,0 WFS 498455552 + 8 [raft-benchmark]
  raft-benchmark-35708   [000] .....  9203.271123: block_io_start: 259,0 WFS 4096 () 498455552 + 8 [raft-benchmark]
  raft-benchmark-35708   [000] .....  9203.271124: block_plug: [raft-benchmark]
  raft-benchmark-35708   [000] .....  9203.271124: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=1, cmdid=53265, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498455552, len=7, ctrl=0x4000, dsmgmt=0, reftag=0)
  raft-benchmark-35708   [000] .....  9203.271126: block_rq_issue: 259,0 WFS 4096 () 498455552 + 8 [raft-benchmark]
  raft-benchmark-35708   [000] d.h..  9203.271382: nvme_sq: nvme0: disk=nvme0n1, qid=1, head=79, tail=79
  raft-benchmark-35708   [000] d.h..  9203.271384: nvme_complete_rq: nvme0: disk=nvme0n1, qid=1, cmdid=53265, res=0x0, retries=0, flags=0x0, status=0x0
  raft-benchmark-35708   [000] d.h..  9203.271384: block_rq_complete: 259,0 WFS () 498455552 + 8 [0]
  raft-benchmark-35708   [000] dNh..  9203.271386: block_io_done: 259,0 WFS 0 () 498455552 + 0 [raft-benchmark]
  raft-benchmark-35708   [000] ...1.  9203.271391: io_uring_complete: ring 000000007ee609d1, req 00000000221c7d2e, user_data 0x0, result 4096, cflags 0x0 extra1 0 extra2 0 
  raft-benchmark-35708   [000] .....  9203.271391: io_uring_task_work_run: tctx 00000000f15587dc, count 1, loops 1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: direct I/O: ext4 seems to not honor RWF_DSYNC when journal is disabled
  2024-01-09 13:59       ` Jan Kara
@ 2024-01-09 15:57         ` Free Ekanayaka
  0 siblings, 0 replies; 9+ messages in thread
From: Free Ekanayaka @ 2024-01-09 15:57 UTC (permalink / raw)
  To: Jan Kara, Dave Chinner; +Cc: linux-ext4

Jan Kara <jack@suse.cz> writes:

[...]
>> I suspect correct crash recovery behaviour here requires
>> multiple cache flushes to ensure the correct ordering or data vs
>> metadata updates. i.e:
>> 
>> 	....
>> 	data write completes
>> 	fdatasync()
>> 	  cache flush to ensure data is on disk
>> 	  if (dirty metadata) {
>> 		issue metadata write(s) for extent records and inode
>> 		....
>> 		metadata write(s) complete
>> 		cache flush to ensure metadata is on disk
>> 	  }
>> 
>> If we don't flush the cache between the data write and the metadata
>> write(s) that marks the extent as written, we could have a state
>> after a power fail where the metadata writes hit the disk
>> before the data write and after the system comes back up that file
>> now it exposes stale data to the user.
>
> So when we are journalling, we end up doing this (we flush data disk before
> writing and flushing the transaction commit block in jbd2). When we are not
> doing journalling (which is the case here), our crash consistency
> guarantees are pretty weak. We want to guarantee that if fsync(2)
> successfully completed on the file before the crash, user should see the
> data there. But not much more - i.e., stale data exposure in case of crash
> is fully within what sysadmin should expect from a filesystem without a
> journal.

Right, which is exectly the tradeoff I need. Weaker guarantees for lower
latency.

All I need is that RWF_DSYNC holds up the promise that once I see a
successful io_uring completion entry, than I'm sure that the data has
made it to disk and it would survive a power loss.

> After all even if we improved fsync(2) as you suggest, we'd still
> have normal page writeback where we'd have to separate data & metadata
> writes with cache flushes and I don't think the performace overhead is
> something people would be willing to pay.
>
> So yes, nojournal mode is unsafe in case of crash. It is there for people
> not caring about the filesystem after the crash, single user filesystems
> doing data verification in userspace and similar special usecases. Still, I
> think we want at least minimum fsync(2) guarantees if nothing else for
> backwards compatibility with ext2.

I'm doing data verification in user space indeed. As sad, the file has
been pre-allocated posix_fallocate() and fsync'ed (along with its
dir), so no metadata changes will occur, just the bare write.

FWIW the use case is writing the log for an implementation of the Raft
consensus algorithm. So basically a series of sequential writes.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: direct I/O: ext4 seems to not honor RWF_DSYNC when journal is disabled
  2024-01-09 15:46     ` Free Ekanayaka
@ 2024-01-10 10:01       ` Jan Kara
  2024-01-10 11:19         ` Free Ekanayaka
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2024-01-10 10:01 UTC (permalink / raw)
  To: Free Ekanayaka; +Cc: Jan Kara, linux-ext4

On Tue 09-01-24 15:46:39, Free Ekanayaka wrote:
> > On Wed 06-09-23 21:15:01, Free Ekanayaka wrote:
> >> I'm using Linux 6.4.0 from Debian/testing (but tried this with 6.5.1
> >> too).
> >> 
> >> I've created an ext4 filesystem with journalling disabled on an NVMe
> >> drive:
> >> 
> >> mkfs.ext4 -O ^has_journal -F /dev/nvme0n1p6
> >> 
> >> I have a program that creates and open a new file with O_DIRECT, and
> >> sets its size to 8M with posix_fallocate(), something like:
> >> 
> >> fd = open("/dir/file", O_CREAT | O_WRONLY | O_DIRECT);
> >> posix_fallocate(fd, 0, 8 * 1024 * 1024);
> >> fsync(fd);
> >> dirfd = open("/dir", O_RDONLY | O_DIRECTORY);
> >> fsync(dirfd);
> >> 
> >> and then it uses io_uring to perform a single write of 4096 bytes at the
> >> beginning of the file, passing RWF_DSYNC to the submitted
> >> IORING_OP_WRITE_FIXED entry,

<snip>

> >> == ext4 ==
> >> 
> >>   raft-benchmark-37801   [003] .....  9904.830974: io_uring_submit_req: ring 0000000011cab2e4, req 00000000c7a7d835, user_data 0x0, opcode WRITE_FIXED, flags 0x1, sq_thread 0
> >>   raft-benchmark-37801   [003] .....  9904.830982: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
> >>   raft-benchmark-37801   [003] .....  9904.830983: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 U
> >>   raft-benchmark-37801   [003] .....  9904.830985: ext4_journal_start_inode: dev 259,5 blocks 2, rsv_blocks 0, revoke_creds 8, type 1, ino 12, caller ext4_dirty_inode+0x38/0x80 [ext4]
> >>   raft-benchmark-37801   [003] .....  9904.830987: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_dirty_inode+0x5b/0x80 [ext4]
> >>   raft-benchmark-37801   [003] .....  9904.830989: block_touch_buffer: 259,5 sector=135 size=4096
> >>   raft-benchmark-37801   [003] .....  9904.830993: block_dirty_buffer: 259,5 sector=135 size=4096
> >>   raft-benchmark-37801   [003] .....  9904.831121: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
> >>   raft-benchmark-37801   [003] .....  9904.831122: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 UR
> >>   raft-benchmark-37801   [003] .....  9904.831123: ext4_journal_start_inode: dev 259,5 blocks 8, rsv_blocks 0, revoke_creds 8, type 3, ino 12, caller ext4_iomap_begin+0x1c2/0x2f0 [ext4]
> >>   raft-benchmark-37801   [003] .....  9904.831124: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
> >>   raft-benchmark-37801   [003] .....  9904.831124: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 UR
> >>   raft-benchmark-37801   [003] .....  9904.831125: ext4_ext_map_blocks_enter: dev 259,5 ino 12 lblk 0 len 1 flags CREATE|UNWRIT|PRE_IO
> >>   raft-benchmark-37801   [003] .....  9904.831126: ext4_es_cache_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
> >>   raft-benchmark-37801   [003] .....  9904.831127: ext4_ext_show_extent: dev 259,5 ino 12 lblk 0 pblk 32887 len 1
> >>   raft-benchmark-37801   [003] .....  9904.831128: ext4_ext_handle_unwritten_extents: dev 259,5 ino 12 m_lblk 0 m_pblk 32887 m_len 1 flags CREATE|UNWRIT|PRE_IO|METADATA_NOFAIL allocated 1 newblock 32887
> >>   raft-benchmark-37801   [003] .....  9904.831129: ext4_es_cache_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
> >>   raft-benchmark-37801   [003] .....  9904.831130: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_split_extent+0xcd/0x190 [ext4]
> >>   raft-benchmark-37801   [003] .....  9904.831131: block_touch_buffer: 259,5 sector=135 size=4096
> >>   raft-benchmark-37801   [003] .....  9904.831133: block_dirty_buffer: 259,5 sector=135 size=4096
> >>   raft-benchmark-37801   [003] .....  9904.831134: ext4_ext_map_blocks_exit: dev 259,5 ino 12 flags CREATE|UNWRIT|PRE_IO lblk 0 pblk 32887 len 1 mflags NMU ret 1
> >>   raft-benchmark-37801   [003] .....  9904.831135: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
> >>   raft-benchmark-37801   [003] .....  9904.831135: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 UR
> >>   raft-benchmark-37801   [003] .....  9904.831136: ext4_es_insert_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
> >>   raft-benchmark-37801   [003] .....  9904.831143: block_bio_remap: 259,0 WS 498455480 + 8 <- (259,5) 263096
> >>   raft-benchmark-37801   [003] .....  9904.831144: block_bio_queue: 259,0 WS 498455480 + 8 [raft-benchmark]
> >>   raft-benchmark-37801   [003] .....  9904.831149: block_getrq: 259,0 WS 498455480 + 8 [raft-benchmark]
> >
> > Here we can see the indeed the write was submitted without the cache flush.
> > However we can also see that the write was going into unwritten extent
> > so...
> >
> >>   raft-benchmark-37801   [003] .....  9904.831149: block_plug: [raft-benchmark]
> >>   raft-benchmark-37801   [003] .....  9904.831153: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=25169, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498455480, len=7, ctrl=0x0, dsmgmt=0, reftag=0)
> >>   raft-benchmark-37801   [003] .....  9904.831159: block_rq_issue: 259,0 WS 4096 () 498455480 + 8 [raft-benchmark]
> >>   raft-benchmark-37801   [003] d.h..  9904.831173: nvme_sq: nvme0: disk=nvme0n1, qid=4, head=783, tail=783
> >>   raft-benchmark-37801   [003] d.h..  9904.831177: nvme_complete_rq: nvme0: disk=nvme0n1, qid=4, cmdid=25169, res=0x0, retries=0, flags=0x0, status=0x0
> >>   raft-benchmark-37801   [003] d.h..  9904.831178: block_rq_complete: 259,0 WS () 498455480 + 8 [0]
> >>      kworker/3:1-30279   [003] .....  9904.831193: ext4_journal_start_inode: dev 259,5 blocks 8, rsv_blocks 0, revoke_creds 8, type 3, ino 12, caller ext4_convert_unwritten_extents+0xb4/0x260 [ext4]
> >
> > ... after io completed here, we need to convert unwritten extent into a
> > written one.
> >
> >>      kworker/3:1-30279   [003] .....  9904.831193: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
> >>      kworker/3:1-30279   [003] .....  9904.831194: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 U
> >>      kworker/3:1-30279   [003] .....  9904.831194: ext4_ext_map_blocks_enter: dev 259,5 ino 12 lblk 0 len 1 flags CREATE|UNWRIT|CONVERT
> >>      kworker/3:1-30279   [003] .....  9904.831195: ext4_es_cache_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
> >>      kworker/3:1-30279   [003] .....  9904.831195: ext4_ext_show_extent: dev 259,5 ino 12 lblk 0 pblk 32887 len 1
> >>      kworker/3:1-30279   [003] .....  9904.831196: ext4_ext_handle_unwritten_extents: dev 259,5 ino 12 m_lblk 0 m_pblk 32887 m_len 1 flags CREATE|UNWRIT|CONVERT|METADATA_NOFAIL allocated 1 newblock 32887
> >>      kworker/3:1-30279   [003] .....  9904.831196: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_ext_map_blocks+0xeee/0x1980 [ext4]
> >>      kworker/3:1-30279   [003] .....  9904.831197: block_touch_buffer: 259,5 sector=135 size=4096
> >>      kworker/3:1-30279   [003] .....  9904.831198: block_dirty_buffer: 259,5 sector=135 size=4096
> >>      kworker/3:1-30279   [003] .....  9904.831199: ext4_ext_map_blocks_exit: dev 259,5 ino 12 flags CREATE|UNWRIT|CONVERT lblk 0 pblk 32887 len 1 mflags M ret 1
> >>      kworker/3:1-30279   [003] .....  9904.831199: ext4_es_insert_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status W
> >>      kworker/3:1-30279   [003] .....  9904.831200: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_convert_unwritten_extents+0x1e2/0x260 [ext4]
> >>      kworker/3:1-30279   [003] .....  9904.831200: block_touch_buffer: 259,5 sector=135 size=4096
> >>      kworker/3:1-30279   [003] .....  9904.831201: block_dirty_buffer: 259,5 sector=135 size=4096
> >
> > The conversion to written extents happened here.
> >
> >>      kworker/3:1-30279   [003] .....  9904.831202: ext4_sync_file_enter: dev 259,5 ino 12 parent 2 datasync 1 
> >>      kworker/3:1-30279   [003] .....  9904.831203: ext4_sync_file_exit: dev 259,5 ino 12 ret 0
> >
> > And here we've called fdatasync() for the inode. Now this should have
> > submitted a cache flush through blkdev_issue_flush() but that doesn't seem
> > to happen.
> 
> Right.
> 
> > Indeed checking the code in 6.4 the problem is that inode is
> > dirtied only through ext4_mark_inode_dirty() which does not alter
> > inode->i_state and thus neither the inode buffer is properly flushed to the
> > disk in this case as it should (as it contains the converted extent) nor do
> > we issue the cache flush. As part of commit 5b5b4ff8f92da ("ext4: Use
> > generic_buffers_fsync_noflush() implementation") in 6.5 we accidentally
> > fixed the second problem AFAICT but the first problem with not flushing inode
> > buffer properly is still there... I'll have to think how to fix that
> > properly.
> 
> I've re-ran the same code with kernel 6.5.0 and indeed the behavior has
> changed and an actual NVMe flush command seems to be issued (the flags
> passed to nvme_setup_cmd match the ones that I see in the case I write
> to the raw block device). So that part seems fixed. I thought I had
> tried with 6.5.1 when I had posted this issue, and that it was present
> there too, but maybe I was mistaken.
> 
> I'm not entirely sure what you mean about flushing the inode buffer
> properly. As far as I see, the current behavior I see matches what I'd
> expect.

The thing is: fallocate(2) does preallocate blocks to the file so
block allocation is not needed when writing to those blocks. However that
does not mean metadata changes are not needed when writing to those blocks.
In case of ext4 (and other filesystems such as xfs or btrfs as well) blocks
allocated using fallocate(2) are tracked as unwritten in inode's metadata (so
reads from them return 0 instead of random garbage). After block contents
is written with iouring write, we need to convert the state of blocks from
unwritten to written and that needs metadata modifications. So the first
write to the file after fallocate(2) call still needs to do metadata
modifications and these need to be persisted as part of fdatasync(2). And
by mistake this did not happen in nojournal mode.

> For reference I'm attaching below the trace of the same user code, this
> time run on kernel 6.5.0, which is the one currently shipping with
> Debian/testing. Note that there are quite a bit less trace lines emitted
> by the ext4 sub-system, not sure if it's related/relevant.
> 
>   raft-benchmark-35708   [000] .....  9203.271114: io_uring_submit_req: ring 000000007ee609d1, req 00000000221c7d2e, user_data 0x0, opcode WRITE_FIXED, flags 0x1, sq_thread 0
>   raft-benchmark-35708   [000] .....  9203.271117: ext4_es_lookup_extent_enter: dev 259,5 ino 16 lblk 0
>   raft-benchmark-35708   [000] .....  9203.271117: ext4_es_lookup_extent_exit: dev 259,5 ino 16 found 1 [0/16) 32896 WR
>   raft-benchmark-35708   [000] .....  9203.271118: ext4_journal_start_inode: dev 259,5 blocks 2, rsv_blocks 0, revoke_creds 8, type 1, ino 16, caller ext4_dirty_inode+0x38/0x80 [ext4]
>   raft-benchmark-35708   [000] .....  9203.271119: ext4_mark_inode_dirty: dev 259,5 ino 16 caller ext4_dirty_inode+0x5b/0x80 [ext4]
>   raft-benchmark-35708   [000] .....  9203.271119: block_touch_buffer: 259,5 sector=135 size=4096
>   raft-benchmark-35708   [000] .....  9203.271120: block_dirty_buffer: 259,5 sector=135 size=4096
>   raft-benchmark-35708   [000] .....  9203.271120: ext4_es_lookup_extent_enter: dev 259,5 ino 16 lblk 0
>   raft-benchmark-35708   [000] .....  9203.271121: ext4_es_lookup_extent_exit: dev 259,5 ino 16 found 1 [0/16) 32896 WR
>   raft-benchmark-35708   [000] .....  9203.271122: block_bio_remap: 259,0 WFS 498455552 + 8 <- (259,5) 263168
>   raft-benchmark-35708   [000] .....  9203.271122: block_bio_queue: 259,0 WFS 498455552 + 8 [raft-benchmark]
>   raft-benchmark-35708   [000] .....  9203.271123: block_getrq: 259,0 WFS 498455552 + 8 [raft-benchmark]
>   raft-benchmark-35708   [000] .....  9203.271123: block_io_start: 259,0 WFS 4096 () 498455552 + 8 [raft-benchmark]
>   raft-benchmark-35708   [000] .....  9203.271124: block_plug: [raft-benchmark]
>   raft-benchmark-35708   [000] .....  9203.271124: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=1, cmdid=53265, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498455552, len=7, ctrl=0x4000, dsmgmt=0, reftag=0)
>   raft-benchmark-35708   [000] .....  9203.271126: block_rq_issue: 259,0 WFS 4096 () 498455552 + 8 [raft-benchmark]
>   raft-benchmark-35708   [000] d.h..  9203.271382: nvme_sq: nvme0: disk=nvme0n1, qid=1, head=79, tail=79
>   raft-benchmark-35708   [000] d.h..  9203.271384: nvme_complete_rq: nvme0: disk=nvme0n1, qid=1, cmdid=53265, res=0x0, retries=0, flags=0x0, status=0x0
>   raft-benchmark-35708   [000] d.h..  9203.271384: block_rq_complete: 259,0 WFS () 498455552 + 8 [0]
>   raft-benchmark-35708   [000] dNh..  9203.271386: block_io_done: 259,0 WFS 0 () 498455552 + 0 [raft-benchmark]
>   raft-benchmark-35708   [000] ...1.  9203.271391: io_uring_complete: ring 000000007ee609d1, req 00000000221c7d2e, user_data 0x0, result 4096, cflags 0x0 extra1 0 extra2 0 
>   raft-benchmark-35708   [000] .....  9203.271391: io_uring_task_work_run: tctx 00000000f15587dc, count 1, loops 1

So in this case the file blocks seem to have been already written. In this
case we don't need to do any block conversions (thus no metadata changes)
and we also submit the write including the flush in the single block
request. In the trace:

raft-benchmark-35708   [000] .....  9203.271126: block_rq_issue: 259,0 WFS 4096 () 498455552 + 8 [raft-benchmark]

this line shows the request submission. The type of request is 'WFS' which
means 'write' with 'flush' (this makes sure write does not just go to
devices' cache) and 'sync' (which just says somebody is waiting for
completion of the IO so it should be treated with priority by IO scheduling
algorithms).

								Honza



-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: direct I/O: ext4 seems to not honor RWF_DSYNC when journal is disabled
  2024-01-10 10:01       ` Jan Kara
@ 2024-01-10 11:19         ` Free Ekanayaka
  2024-01-10 11:57           ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Free Ekanayaka @ 2024-01-10 11:19 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

Jan Kara <jack@suse.cz> writes:

[...]

>> I've re-ran the same code with kernel 6.5.0 and indeed the behavior has
>> changed and an actual NVMe flush command seems to be issued (the flags
>> passed to nvme_setup_cmd match the ones that I see in the case I write
>> to the raw block device). So that part seems fixed. I thought I had
>> tried with 6.5.1 when I had posted this issue, and that it was present
>> there too, but maybe I was mistaken.
>> 
>> I'm not entirely sure what you mean about flushing the inode buffer
>> properly. As far as I see, the current behavior I see matches what I'd
>> expect.
>
> The thing is: fallocate(2) does preallocate blocks to the file so
> block allocation is not needed when writing to those blocks. However that
> does not mean metadata changes are not needed when writing to those blocks.
> In case of ext4 (and other filesystems such as xfs or btrfs as well) blocks
> allocated using fallocate(2) are tracked as unwritten in inode's metadata (so
> reads from them return 0 instead of random garbage). After block contents
> is written with iouring write, we need to convert the state of blocks from
> unwritten to written and that needs metadata modifications. So the first
> write to the file after fallocate(2) call still needs to do metadata
> modifications and these need to be persisted as part of fdatasync(2). And
> by mistake this did not happen in nojournal mode.

Thanks for the explanation.

Since I want to achieve the lowest possible latency for these writes, is
there a way to not onlly preallocate blocks with fallocate() but also to
avoid the extra write for metadata modifications that you mention?

I mean, I could of course take the brute force approach of performing
dump "pre" writes of those blocks, but I'm wondering if there's a better
way that doesn't require writing those blocks twice (which might end up
defeating the purpose of lowering overall latency).

>
>> For reference I'm attaching below the trace of the same user code, this
>> time run on kernel 6.5.0, which is the one currently shipping with
>> Debian/testing. Note that there are quite a bit less trace lines emitted
>> by the ext4 sub-system, not sure if it's related/relevant.
>> 
>>   raft-benchmark-35708   [000] .....  9203.271114: io_uring_submit_req: ring 000000007ee609d1, req 00000000221c7d2e, user_data 0x0, opcode WRITE_FIXED, flags 0x1, sq_thread 0
>>   raft-benchmark-35708   [000] .....  9203.271117: ext4_es_lookup_extent_enter: dev 259,5 ino 16 lblk 0
>>   raft-benchmark-35708   [000] .....  9203.271117: ext4_es_lookup_extent_exit: dev 259,5 ino 16 found 1 [0/16) 32896 WR
>>   raft-benchmark-35708   [000] .....  9203.271118: ext4_journal_start_inode: dev 259,5 blocks 2, rsv_blocks 0, revoke_creds 8, type 1, ino 16, caller ext4_dirty_inode+0x38/0x80 [ext4]
>>   raft-benchmark-35708   [000] .....  9203.271119: ext4_mark_inode_dirty: dev 259,5 ino 16 caller ext4_dirty_inode+0x5b/0x80 [ext4]
>>   raft-benchmark-35708   [000] .....  9203.271119: block_touch_buffer: 259,5 sector=135 size=4096
>>   raft-benchmark-35708   [000] .....  9203.271120: block_dirty_buffer: 259,5 sector=135 size=4096
>>   raft-benchmark-35708   [000] .....  9203.271120: ext4_es_lookup_extent_enter: dev 259,5 ino 16 lblk 0
>>   raft-benchmark-35708   [000] .....  9203.271121: ext4_es_lookup_extent_exit: dev 259,5 ino 16 found 1 [0/16) 32896 WR
>>   raft-benchmark-35708   [000] .....  9203.271122: block_bio_remap: 259,0 WFS 498455552 + 8 <- (259,5) 263168
>>   raft-benchmark-35708   [000] .....  9203.271122: block_bio_queue: 259,0 WFS 498455552 + 8 [raft-benchmark]
>>   raft-benchmark-35708   [000] .....  9203.271123: block_getrq: 259,0 WFS 498455552 + 8 [raft-benchmark]
>>   raft-benchmark-35708   [000] .....  9203.271123: block_io_start: 259,0 WFS 4096 () 498455552 + 8 [raft-benchmark]
>>   raft-benchmark-35708   [000] .....  9203.271124: block_plug: [raft-benchmark]
>>   raft-benchmark-35708   [000] .....  9203.271124: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=1, cmdid=53265, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498455552, len=7, ctrl=0x4000, dsmgmt=0, reftag=0)
>>   raft-benchmark-35708   [000] .....  9203.271126: block_rq_issue: 259,0 WFS 4096 () 498455552 + 8 [raft-benchmark]
>>   raft-benchmark-35708   [000] d.h..  9203.271382: nvme_sq: nvme0: disk=nvme0n1, qid=1, head=79, tail=79
>>   raft-benchmark-35708   [000] d.h..  9203.271384: nvme_complete_rq: nvme0: disk=nvme0n1, qid=1, cmdid=53265, res=0x0, retries=0, flags=0x0, status=0x0
>>   raft-benchmark-35708   [000] d.h..  9203.271384: block_rq_complete: 259,0 WFS () 498455552 + 8 [0]
>>   raft-benchmark-35708   [000] dNh..  9203.271386: block_io_done: 259,0 WFS 0 () 498455552 + 0 [raft-benchmark]
>>   raft-benchmark-35708   [000] ...1.  9203.271391: io_uring_complete: ring 000000007ee609d1, req 00000000221c7d2e, user_data 0x0, result 4096, cflags 0x0 extra1 0 extra2 0 
>>   raft-benchmark-35708   [000] .....  9203.271391: io_uring_task_work_run: tctx 00000000f15587dc, count 1, loops 1
>
> So in this case the file blocks seem to have been already written.

Do you mean that they have been already written at some point in the
past after the file system was created?

I guess it doesn't matter if they were written as part of the new file
that is being written (and that was created with open()/fallocate()), or
if they were written before as part of some other file that was deleted
since then.

> In this
> case we don't need to do any block conversions (thus no metadata changes)
> and we also submit the write including the flush in the single block
> request. In the trace:
>
> raft-benchmark-35708   [000] .....  9203.271126: block_rq_issue: 259,0 WFS 4096 () 498455552 + 8 [raft-benchmark]
>
> this line shows the request submission. The type of request is 'WFS' which
> means 'write' with 'flush' (this makes sure write does not just go to
> devices' cache) and 'sync' (which just says somebody is waiting for
> completion of the IO so it should be treated with priority by IO scheduling
> algorithms).

Thanks, that's helpful.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: direct I/O: ext4 seems to not honor RWF_DSYNC when journal is disabled
  2024-01-10 11:19         ` Free Ekanayaka
@ 2024-01-10 11:57           ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2024-01-10 11:57 UTC (permalink / raw)
  To: Free Ekanayaka; +Cc: Jan Kara, linux-ext4

On Wed 10-01-24 11:19:42, Free Ekanayaka wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> [...]
> 
> >> I've re-ran the same code with kernel 6.5.0 and indeed the behavior has
> >> changed and an actual NVMe flush command seems to be issued (the flags
> >> passed to nvme_setup_cmd match the ones that I see in the case I write
> >> to the raw block device). So that part seems fixed. I thought I had
> >> tried with 6.5.1 when I had posted this issue, and that it was present
> >> there too, but maybe I was mistaken.
> >> 
> >> I'm not entirely sure what you mean about flushing the inode buffer
> >> properly. As far as I see, the current behavior I see matches what I'd
> >> expect.
> >
> > The thing is: fallocate(2) does preallocate blocks to the file so
> > block allocation is not needed when writing to those blocks. However that
> > does not mean metadata changes are not needed when writing to those blocks.
> > In case of ext4 (and other filesystems such as xfs or btrfs as well) blocks
> > allocated using fallocate(2) are tracked as unwritten in inode's metadata (so
> > reads from them return 0 instead of random garbage). After block contents
> > is written with iouring write, we need to convert the state of blocks from
> > unwritten to written and that needs metadata modifications. So the first
> > write to the file after fallocate(2) call still needs to do metadata
> > modifications and these need to be persisted as part of fdatasync(2). And
> > by mistake this did not happen in nojournal mode.
> 
> Thanks for the explanation.
> 
> Since I want to achieve the lowest possible latency for these writes, is
> there a way to not onlly preallocate blocks with fallocate() but also to
> avoid the extra write for metadata modifications that you mention?
> 
> I mean, I could of course take the brute force approach of performing
> dump "pre" writes of those blocks, but I'm wondering if there's a better
> way that doesn't require writing those blocks twice (which might end up
> defeating the purpose of lowering overall latency).

No, if you really want to make sure there will be no metadata modifications
while writing to a file, writing 0's (or whatever you wish) to it is the
only way.

> >> For reference I'm attaching below the trace of the same user code, this
> >> time run on kernel 6.5.0, which is the one currently shipping with
> >> Debian/testing. Note that there are quite a bit less trace lines emitted
> >> by the ext4 sub-system, not sure if it's related/relevant.
> >> 
> >>   raft-benchmark-35708   [000] .....  9203.271114: io_uring_submit_req: ring 000000007ee609d1, req 00000000221c7d2e, user_data 0x0, opcode WRITE_FIXED, flags 0x1, sq_thread 0
> >>   raft-benchmark-35708   [000] .....  9203.271117: ext4_es_lookup_extent_enter: dev 259,5 ino 16 lblk 0
> >>   raft-benchmark-35708   [000] .....  9203.271117: ext4_es_lookup_extent_exit: dev 259,5 ino 16 found 1 [0/16) 32896 WR
> >>   raft-benchmark-35708   [000] .....  9203.271118: ext4_journal_start_inode: dev 259,5 blocks 2, rsv_blocks 0, revoke_creds 8, type 1, ino 16, caller ext4_dirty_inode+0x38/0x80 [ext4]
> >>   raft-benchmark-35708   [000] .....  9203.271119: ext4_mark_inode_dirty: dev 259,5 ino 16 caller ext4_dirty_inode+0x5b/0x80 [ext4]
> >>   raft-benchmark-35708   [000] .....  9203.271119: block_touch_buffer: 259,5 sector=135 size=4096
> >>   raft-benchmark-35708   [000] .....  9203.271120: block_dirty_buffer: 259,5 sector=135 size=4096
> >>   raft-benchmark-35708   [000] .....  9203.271120: ext4_es_lookup_extent_enter: dev 259,5 ino 16 lblk 0
> >>   raft-benchmark-35708   [000] .....  9203.271121: ext4_es_lookup_extent_exit: dev 259,5 ino 16 found 1 [0/16) 32896 WR
> >>   raft-benchmark-35708   [000] .....  9203.271122: block_bio_remap: 259,0 WFS 498455552 + 8 <- (259,5) 263168
> >>   raft-benchmark-35708   [000] .....  9203.271122: block_bio_queue: 259,0 WFS 498455552 + 8 [raft-benchmark]
> >>   raft-benchmark-35708   [000] .....  9203.271123: block_getrq: 259,0 WFS 498455552 + 8 [raft-benchmark]
> >>   raft-benchmark-35708   [000] .....  9203.271123: block_io_start: 259,0 WFS 4096 () 498455552 + 8 [raft-benchmark]
> >>   raft-benchmark-35708   [000] .....  9203.271124: block_plug: [raft-benchmark]
> >>   raft-benchmark-35708   [000] .....  9203.271124: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=1, cmdid=53265, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498455552, len=7, ctrl=0x4000, dsmgmt=0, reftag=0)
> >>   raft-benchmark-35708   [000] .....  9203.271126: block_rq_issue: 259,0 WFS 4096 () 498455552 + 8 [raft-benchmark]
> >>   raft-benchmark-35708   [000] d.h..  9203.271382: nvme_sq: nvme0: disk=nvme0n1, qid=1, head=79, tail=79
> >>   raft-benchmark-35708   [000] d.h..  9203.271384: nvme_complete_rq: nvme0: disk=nvme0n1, qid=1, cmdid=53265, res=0x0, retries=0, flags=0x0, status=0x0
> >>   raft-benchmark-35708   [000] d.h..  9203.271384: block_rq_complete: 259,0 WFS () 498455552 + 8 [0]
> >>   raft-benchmark-35708   [000] dNh..  9203.271386: block_io_done: 259,0 WFS 0 () 498455552 + 0 [raft-benchmark]
> >>   raft-benchmark-35708   [000] ...1.  9203.271391: io_uring_complete: ring 000000007ee609d1, req 00000000221c7d2e, user_data 0x0, result 4096, cflags 0x0 extra1 0 extra2 0 
> >>   raft-benchmark-35708   [000] .....  9203.271391: io_uring_task_work_run: tctx 00000000f15587dc, count 1, loops 1
> >
> > So in this case the file blocks seem to have been already written.
> 
> Do you mean that they have been already written at some point in the
> past after the file system was created?

No, I mean this particular file offset has already been written to since
the file was created. The line:

raft-benchmark-35708   [000] .....  9203.271121: ext4_es_lookup_extent_exit: dev 259,5 ino 16 found 1 [0/16) 32896 WR

at the end has 'WR' which is the status of the found extent (contiguous
sequence of blocks) and 'WR' means 'written' and 'referenced'.

> I guess it doesn't matter if they were written as part of the new file
> that is being written (and that was created with open()/fallocate()), or
> if they were written before as part of some other file that was deleted
> since then.

No, the block must have been written when it was already part of this file.
The block being written is tracked on logical-file-offset basis and this
tracking is there exactly so that you cannot read some stale data
(potentially sensitive data of another user) after fallocating some blocks
to the file.

I understand this protection need not be that interesting for your usecase
and in nojournal mode you still may potentially see such stale data but
there's a big difference between possibly seeing such data only after a
crash and giving user a way to read such data at will.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-01-10 11:57 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <87ttscddv4.fsf@x1.mail-host-address-is-not-set>
2023-09-06 20:15 ` direct I/O: ext4 seems to not honor RWF_DSYNC when journal is disabled Free Ekanayaka
2024-01-08 21:31   ` Jan Kara
2024-01-09  6:05     ` Dave Chinner
2024-01-09 13:59       ` Jan Kara
2024-01-09 15:57         ` Free Ekanayaka
2024-01-09 15:46     ` Free Ekanayaka
2024-01-10 10:01       ` Jan Kara
2024-01-10 11:19         ` Free Ekanayaka
2024-01-10 11:57           ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox