Block size and read-modify-write

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Block size and read-modify-write
@ 2017-12-28 23:14 Gionatan Danti
  2018-01-02 10:25 ` Carlos Maiolino
  0 siblings, 1 reply; 9+ messages in thread
From: Gionatan Danti @ 2017-12-28 23:14 UTC (permalink / raw)
  To: linux-xfs; +Cc: g.danti

Hi list,
I would ask a question: how XFS block size affect read-modify-write in 
case of very small writes?

For example, suppose an XFS filesystem with the default 4K data block 
size. Am I correct saying that:
- a 512B normal, cached writes will cause a read-modify-write of the 
entire 4K sector?
- a 512B O_DIRECT write will *not* cause a read-modify-write of the 4K 
sector, rather it will be flushed to disk as-is (512 bytes in length)?

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Block size and read-modify-write
  2017-12-28 23:14 Block size and read-modify-write Gionatan Danti
@ 2018-01-02 10:25 ` Carlos Maiolino
  2018-01-03  1:19   ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Carlos Maiolino @ 2018-01-02 10:25 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: linux-xfs

On Fri, Dec 29, 2017 at 12:14:14AM +0100, Gionatan Danti wrote:
> Hi list,
> I would ask a question: how XFS block size affect read-modify-write in case
> of very small writes?
> 

Hi,

> For example, suppose an XFS filesystem with the default 4K data block size.
> Am I correct saying that:
> - a 512B normal, cached writes will cause a read-modify-write of the entire
> 4K sector?
> - a 512B O_DIRECT write will *not* cause a read-modify-write of the 4K
> sector, rather it will be flushed to disk as-is (512 bytes in length)?
> 

IIRC, although the DIO requirement is to have writes aligned to the logical sector
size, issuing such IOs not properly aligned with the filesystem block size, have
a few consequences.

- It will require exclusive inode io locks, so serializing IOs to the inode
- And yes, it will require a RMW to the block in question, all IO are always
  made in filesystem block size units.

I'm probably missing something else here though, but these two are the things I
had in my mind.

Cheers

> Thanks.
> 
> -- 
> Danti Gionatan
> Supporto Tecnico
> Assyoma S.r.l. - www.assyoma.it
> email: g.danti@assyoma.it - info@assyoma.it
> GPG public key ID: FF5F32A8
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Carlos

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Block size and read-modify-write
  2018-01-02 10:25 ` Carlos Maiolino
@ 2018-01-03  1:19   ` Dave Chinner
  2018-01-03  8:19     ` Carlos Maiolino
  2018-01-03 14:54     ` Gionatan Danti
  0 siblings, 2 replies; 9+ messages in thread
From: Dave Chinner @ 2018-01-03  1:19 UTC (permalink / raw)
  To: Gionatan Danti, linux-xfs

On Tue, Jan 02, 2018 at 11:25:39AM +0100, Carlos Maiolino wrote:
> On Fri, Dec 29, 2017 at 12:14:14AM +0100, Gionatan Danti wrote:
> > Hi list,
> > I would ask a question: how XFS block size affect read-modify-write in case
> > of very small writes?
> > 
> 
> Hi,
> 
> > For example, suppose an XFS filesystem with the default 4K data block size.
> > Am I correct saying that:
> > - a 512B normal, cached writes will cause a read-modify-write of the entire
> > 4K sector?

Cached writes smaller than a *page* will cause RMW cycles in the
page cache, regardless of the block size of the filesystem.

> > - a 512B O_DIRECT write will *not* cause a read-modify-write of the 4K
> > sector, rather it will be flushed to disk as-is (512 bytes in length)?

Ok, there is a difference between *sector size* and *filesystem
block size*. You seem to be using them interchangably in your
question, and that's not correct.

> IIRC, although the DIO requirement is to have writes aligned to the logical sector
> size, issuing such IOs not properly aligned with the filesystem block size, have
> a few consequences.
> 
> - It will require exclusive inode io locks, so serializing IOs to the inode

That is correct, but....

> - And yes, it will require a RMW to the block in question, all IO are always
>   made in filesystem block size units.

.... this is not correct for direct IO. The direct IO path does not
do RMW cycles at all.

Put simply: a 512B DIO write on a (real or emulated) 512B sector
device with a 4k FSB will be serialised by the filesystem and do a
single 512B sector write to the device.  However, if the device
reports as a 4k sector device then a 512B DIO write will be rejected
by the filesystem because sub-sector IO is not possible.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Block size and read-modify-write
  2018-01-03  1:19   ` Dave Chinner
@ 2018-01-03  8:19     ` Carlos Maiolino
  2018-01-03 14:54     ` Gionatan Danti
  1 sibling, 0 replies; 9+ messages in thread
From: Carlos Maiolino @ 2018-01-03  8:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Gionatan Danti, linux-xfs

On Wed, Jan 03, 2018 at 12:19:26PM +1100, Dave Chinner wrote:
> On Tue, Jan 02, 2018 at 11:25:39AM +0100, Carlos Maiolino wrote:
> > On Fri, Dec 29, 2017 at 12:14:14AM +0100, Gionatan Danti wrote:
> > > Hi list,
> > > I would ask a question: how XFS block size affect read-modify-write in case
> > > of very small writes?
> > > 
> > 
> > Hi,
> > 
> > > For example, suppose an XFS filesystem with the default 4K data block size.
> > > Am I correct saying that:
> > > - a 512B normal, cached writes will cause a read-modify-write of the entire
> > > 4K sector?
> 
> Cached writes smaller than a *page* will cause RMW cycles in the
> page cache, regardless of the block size of the filesystem.
> 
> > > - a 512B O_DIRECT write will *not* cause a read-modify-write of the 4K
> > > sector, rather it will be flushed to disk as-is (512 bytes in length)?
> 
> Ok, there is a difference between *sector size* and *filesystem
> block size*. You seem to be using them interchangably in your
> question, and that's not correct.
> 
> > IIRC, although the DIO requirement is to have writes aligned to the logical sector
> > size, issuing such IOs not properly aligned with the filesystem block size, have
> > a few consequences.
> > 
> > - It will require exclusive inode io locks, so serializing IOs to the inode
> 
> That is correct, but....
> 
> > - And yes, it will require a RMW to the block in question, all IO are always
> >   made in filesystem block size units.
> 
> .... this is not correct for direct IO. The direct IO path does not
> do RMW cycles at all.
> 
> Put simply: a 512B DIO write on a (real or emulated) 512B sector
> device with a 4k FSB will be serialised by the filesystem and do a
> single 512B sector write to the device.  However, if the device
> reports as a 4k sector device then a 512B DIO write will be rejected
> by the filesystem because sub-sector IO is not possible.

Oh, thanks for the correction :)

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Carlos

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Block size and read-modify-write
  2018-01-03  1:19   ` Dave Chinner
  2018-01-03  8:19     ` Carlos Maiolino
@ 2018-01-03 14:54     ` Gionatan Danti
  2018-01-03 21:47       ` Dave Chinner
  1 sibling, 1 reply; 9+ messages in thread
From: Gionatan Danti @ 2018-01-03 14:54 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs; +Cc: Gionatan Danti



On 03/01/2018 02:19, Dave Chinner wrote:
> Cached writes smaller than a *page* will cause RMW cycles in the
> page cache, regardless of the block size of the filesystem.

Sure, in this case a page-sized r/m/w cycle happen in the pagecache. 
However it seems to me that, when flushed to disk, writes happens at the 
block level granularity, as you can see from tests[1,2] below. Am I 
wrong? I am missing something?

> Ok, there is a difference between *sector size* and *filesystem
> block size*. You seem to be using them interchangably in your
> question, and that's not correct.

True, maybe I have issues grasping the concept of sector size from XFS 
point of view. I understand sector size as an hardware property of the 
underlying block device, but how does it relate to the filesystem?

I naively supposed that an XFS filesystem created with 4k *sector* size 
(ie: mkfs.xfs -s size=4096) would prevent 512 bytes O_DIRECT writes, but 
my test[3] shows that even of such a filesystem a 512B direct write is 
possible, indeed.

Is sector size information only used by XFS own metadata and journaling 
in order to avoid costly device-level r/m/w cycles on 512e devices? I 
understand that on 4Kn device you *have* to avoid sub-sector writes, or 
the transfer will fail.

> 
> .... this is not correct for direct IO. The direct IO path does not
> do RMW cycles at all.
> 
> Put simply: a 512B DIO write on a (real or emulated) 512B sector
> device with a 4k FSB will be serialised by the filesystem and do a
> single 512B sector write to the device.  However, if the device
> reports as a 4k sector device then a 512B DIO write will be rejected
> by the filesystem because sub-sector IO is not possible.

Ok, this was as expected.

I want to put some context on the original question, and why I am so 
interested on r/m/w cycles. SSD's flash-page size has, in recent years 
(2014+), ballooned to 8/16/32K. I wonder if a matching blocksize and/or 
sector size are needed to avoid (some of) device-level r/m/w cycles, 
which can dramatically increase flash write amplification (with reduced 
endurance).

Thanks.


------ test output below ------

# Block device properties
[root@blackhole queue]# blockdev --getss --getpbsz --getiomin --getbsz 
/dev/sda3
512
512
512
4096

[1] # XFS with blocksize=4K and sectorsize=512B (default)
[root@blackhole queue]# mkfs.xfs /dev/sda3
meta-data=/dev/sda3              isize=512    agcount=4, agsize=65536 blks
          =                       sectsz=512   attr=2, projid32bit=1
          =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=262144, imaxpct=25
          =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
          =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
[root@blackhole queue]# mount /dev/sda3 /mnt/test/
# Preallocate file to minimize metadata traffic
[root@blackhole test]# fallocate /mnt/test/test.img -l 256M
# Write 512B via pagecache
[root@blackhole test]# while true; do echo 3 > /proc/sys/vm/drop_caches; 
dd if=/dev/urandom of=/mnt/test/test.img bs=512 count=1 oflag=dsync 
conv=nocreat,notrunc; sleep 1; done
# Dstat results: 4K reads/writes (read-modify-write)
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
  read  writ
4096B 4096B
4096B 4096B
4096B 4096B
# Write 512B via O_DIRECT
[root@blackhole test]# while true; do echo 3 > /proc/sys/vm/drop_caches; 
dd if=/dev/urandom of=/mnt/test/test.img bs=512 count=1 oflag=dsync 
conv=nocreat,notrunc oflag=direct; sleep 1; done
# Dstat results: 512B writes
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
  read  writ
    0   512B
    0   512B
    0   512B

[2] # XFS with blocksize=1K and sectorsize=512B
[root@blackhole mnt]# umount /mnt/test/
[root@blackhole mnt]# mkfs.xfs /dev/sda3 -f -b size=1024
meta-data=/dev/sda3              isize=512    agcount=4, agsize=262144 blks
          =                       sectsz=512   attr=2, projid32bit=1
          =                       crc=1        finobt=0, sparse=0
data     =                       bsize=1024   blocks=1048576, imaxpct=25
          =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=1024   blocks=10240, version=2
          =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
[root@blackhole mnt]# mount /dev/sda3 /mnt/test/
# Preallocate file to minimize metadata traffic
[root@blackhole mnt]# fallocate /mnt/test/test.img -l 256M
# Write 512B via pagecache
[root@blackhole mnt]# while true; do echo 3 > /proc/sys/vm/drop_caches; 
dd if=/dev/urandom of=/mnt/test/test.img bs=512 count=1 oflag=dsync 
conv=nocreat,notrunc; sleep 1; done
# Dstat results: 1K reads/writes (read-modify-write)
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
  read  writ
1024B 1024B
1024B 1024B
1024B 1024B
# Write 512B via O_DIRECT
while true; do echo 3 > /proc/sys/vm/drop_caches; dd if=/dev/urandom 
of=/mnt/test/test.img bs=512 count=1 oflag=dsync conv=nocreat,notrunc 
oflag=direct; sleep 1; done
# Dstat results: 512B writes
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
  read  writ
    0   512B
    0   512B
    0   512B

[3] # XFS with blocksize=4K and sectorsize=4K
[root@blackhole mnt]# umount /mnt/test/
[root@blackhole mnt]# mkfs.xfs /dev/sda3 -f -s size=4096
meta-data=/dev/sda3              isize=512    agcount=4, agsize=65536 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=262144, imaxpct=25
          =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
[root@blackhole mnt]# mount /dev/sda3 /mnt/test/
# Preallocate file to minimize metadata traffic
[root@blackhole mnt]# fallocate /mnt/test/test.img -l 256M
# Write 512B via pagecache
[root@blackhole mnt]# while true; do echo 3 > /proc/sys/vm/drop_caches; 
dd if=/dev/urandom of=/mnt/test/test.img bs=512 count=1 oflag=dsync 
conv=nocreat,notrunc; sleep 1; done
# Dstat results: 4K reads/writes (read-modify-write)
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
  read  writ
4096B 4096B
4096B 4096B
4096B 4096B
# Write 512B via O_DIRECT
[root@blackhole mnt]# while true; do echo 3 > /proc/sys/vm/drop_caches; 
dd if=/dev/urandom of=/mnt/test/test.img bs=512 count=1 oflag=dsync 
conv=nocreat,notrunc oflag=direct; sleep 1; done
# Dstat results: 512B writes
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
  read  writ
    0   512B
    0   512B
    0   512B

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Block size and read-modify-write
  2018-01-03 14:54     ` Gionatan Danti
@ 2018-01-03 21:47       ` Dave Chinner
  2018-01-03 22:09         ` Gionatan Danti
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2018-01-03 21:47 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: linux-xfs

On Wed, Jan 03, 2018 at 03:54:42PM +0100, Gionatan Danti wrote:
> 
> 
> On 03/01/2018 02:19, Dave Chinner wrote:
> >Cached writes smaller than a *page* will cause RMW cycles in the
> >page cache, regardless of the block size of the filesystem.
> 
> Sure, in this case a page-sized r/m/w cycle happen in the pagecache.
> However it seems to me that, when flushed to disk, writes happens at
> the block level granularity, as you can see from tests[1,2] below.
> Am I wrong? I am missing something?

You're writing into unwritten extents. That's not a data overwrite,
so behaviour can be very different. And when you have sub-page block
sizes, the filesystem and/or page cache may decide not to read the
whole page if it doesn't need to immmediately. e.g. you'll see
different behaviour between a 512 byte write() and a 512 byte write
via mmap()...

IOWs, there are so many different combinations of behaviour and
variables that we don't try to explain every single nuance. If you
do sub-page and/or sub-block size IO, then expect to page sized RMW
to occur. It might be smaller depending on the fs config, the file
layout, the underlying extent type, the type of ioperation the 
write must perform (e.g. plain overwrite vs copy-on-write), the
offset into the page/block, etc. The simple message is this: avoid
sub-block/page size IO if you can possibly avoid it.

> >Ok, there is a difference between *sector size* and *filesystem
> >block size*. You seem to be using them interchangably in your
> >question, and that's not correct.
> 
> True, maybe I have issues grasping the concept of sector size from
> XFS point of view. I understand sector size as an hardware property
> of the underlying block device, but how does it relate to the
> filesystem?
> 
> I naively supposed that an XFS filesystem created with 4k *sector*
> size (ie: mkfs.xfs -s size=4096) would prevent 512 bytes O_DIRECT
> writes, but my test[3] shows that even of such a filesystem a 512B
> direct write is possible, indeed.
> 
> Is sector size information only used by XFS own metadata and
> journaling in order to avoid costly device-level r/m/w cycles on
> 512e devices? I understand that on 4Kn device you *have* to avoid
> sub-sector writes, or the transfer will fail.

We don't care if the device does internal RMW cycles (RAID does
that all the time). The sector size we care about is the size of an
atomic write IO - the IO size that the device guarantees will either
succeed completely or fail without modification. This is needed for
journal recovery sanity.

For data, the kernel checks the logical device sector size and
limits direct IO to those sizes, not the filesystem sector size.
i.e.  filesystem sector size if there for sizing journal operations
and metadata, not limiting data access alignment.

> I want to put some context on the original question, and why I am so
> interested on r/m/w cycles. SSD's flash-page size has, in recent
> years (2014+), ballooned to 8/16/32K. I wonder if a matching
> blocksize and/or sector size are needed to avoid (some of)
> device-level r/m/w cycles, which can dramatically increase flash
> write amplification (with reduced endurance).

We've been over this many times in the past few years. user data
alignment is controlled by stripe unit/width specification,
not sector/block sizes.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Block size and read-modify-write
  2018-01-03 21:47       ` Dave Chinner
@ 2018-01-03 22:09         ` Gionatan Danti
  2018-01-03 22:59           ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Gionatan Danti @ 2018-01-03 22:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, g.danti

Il 03-01-2018 22:47 Dave Chinner ha scritto:
> On Wed, Jan 03, 2018 at 03:54:42PM +0100, Gionatan Danti wrote:
>> 
>> 
>> On 03/01/2018 02:19, Dave Chinner wrote:
>> >Cached writes smaller than a *page* will cause RMW cycles in the
>> >page cache, regardless of the block size of the filesystem.
>> 
>> Sure, in this case a page-sized r/m/w cycle happen in the pagecache.
>> However it seems to me that, when flushed to disk, writes happens at
>> the block level granularity, as you can see from tests[1,2] below.
>> Am I wrong? I am missing something?
> 
> You're writing into unwritten extents. That's not a data overwrite,
> so behaviour can be very different. And when you have sub-page block
> sizes, the filesystem and/or page cache may decide not to read the
> whole page if it doesn't need to immmediately. e.g. you'll see
> different behaviour between a 512 byte write() and a 512 byte write
> via mmap()...

The first "dd" execution surely writes into unwritten extents. However, 
on the following writes real data are overwritten, right?

> IOWs, there are so many different combinations of behaviour and
> variables that we don't try to explain every single nuance. If you
> do sub-page and/or sub-block size IO, then expect to page sized RMW
> to occur. It might be smaller depending on the fs config, the file
> layout, the underlying extent type, the type of ioperation the
> write must perform (e.g. plain overwrite vs copy-on-write), the
> offset into the page/block, etc. The simple message is this: avoid
> sub-block/page size IO if you can possibly avoid it.
> 
>> >Ok, there is a difference between *sector size* and *filesystem
>> >block size*. You seem to be using them interchangably in your
>> >question, and that's not correct.
>> 
>> True, maybe I have issues grasping the concept of sector size from
>> XFS point of view. I understand sector size as an hardware property
>> of the underlying block device, but how does it relate to the
>> filesystem?
>> 
>> I naively supposed that an XFS filesystem created with 4k *sector*
>> size (ie: mkfs.xfs -s size=4096) would prevent 512 bytes O_DIRECT
>> writes, but my test[3] shows that even of such a filesystem a 512B
>> direct write is possible, indeed.
>> 
>> Is sector size information only used by XFS own metadata and
>> journaling in order to avoid costly device-level r/m/w cycles on
>> 512e devices? I understand that on 4Kn device you *have* to avoid
>> sub-sector writes, or the transfer will fail.
> 
> We don't care if the device does internal RMW cycles (RAID does
> that all the time). The sector size we care about is the size of an
> atomic write IO - the IO size that the device guarantees will either
> succeed completely or fail without modification. This is needed for
> journal recovery sanity.
> 
> For data, the kernel checks the logical device sector size and
> limits direct IO to those sizes, not the filesystem sector size.
> i.e.  filesystem sector size if there for sizing journal operations
> and metadata, not limiting data access alignment.

This is an outstanding explanation.
Thank you very much.

>> I want to put some context on the original question, and why I am so
>> interested on r/m/w cycles. SSD's flash-page size has, in recent
>> years (2014+), ballooned to 8/16/32K. I wonder if a matching
>> blocksize and/or sector size are needed to avoid (some of)
>> device-level r/m/w cycles, which can dramatically increase flash
>> write amplification (with reduced endurance).
> 
> We've been over this many times in the past few years. user data
> alignment is controlled by stripe unit/width specification,
> not sector/block sizes.

Sure, but to avoid/mitigate device-level r/m/w, a proper alignement is 
not sufficient by itself. You should also avoid partial page writes. 
Anyway, I got the message: this is not business XFS directly cares 
about.

Thanks again.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Block size and read-modify-write
  2018-01-03 22:09         ` Gionatan Danti
@ 2018-01-03 22:59           ` Dave Chinner
  2018-01-04  1:38             ` Gionatan Danti
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2018-01-03 22:59 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: linux-xfs

On Wed, Jan 03, 2018 at 11:09:30PM +0100, Gionatan Danti wrote:
> Il 03-01-2018 22:47 Dave Chinner ha scritto:
> >On Wed, Jan 03, 2018 at 03:54:42PM +0100, Gionatan Danti wrote:
> >>
> >>
> >>On 03/01/2018 02:19, Dave Chinner wrote:
> >>>Cached writes smaller than a *page* will cause RMW cycles in the
> >>>page cache, regardless of the block size of the filesystem.
> >>
> >>Sure, in this case a page-sized r/m/w cycle happen in the pagecache.
> >>However it seems to me that, when flushed to disk, writes happens at
> >>the block level granularity, as you can see from tests[1,2] below.
> >>Am I wrong? I am missing something?
> >
> >You're writing into unwritten extents. That's not a data overwrite,
> >so behaviour can be very different. And when you have sub-page block
> >sizes, the filesystem and/or page cache may decide not to read the
> >whole page if it doesn't need to immmediately. e.g. you'll see
> >different behaviour between a 512 byte write() and a 512 byte write
> >via mmap()...
> 
> The first "dd" execution surely writes into unwritten extents.
> However, on the following writes real data are overwritten, right?

Yes. But I'm talking about the initial page cache writes in your
tests, and they were all into unwritten extents. These are the
writes that had different behaviour in exach test case.

The second write in each test case was the direct IO write. That's
what went over existing data, written through the page cache by the
first write. They all had the same behaviour - a single 512 byte
write - as they were all being written into allocated blocks that
contained existing data on a device with a logical sector size of
512 bytes.

> >We've been over this many times in the past few years. user data
> >alignment is controlled by stripe unit/width specification,
> >not sector/block sizes.
> 
> Sure, but to avoid/mitigate device-level r/m/w, a proper alignement
> is not sufficient by itself. You should also avoid partial page
> writes.

That's an application problem, not a filesystem problem. All the
filesystem can do is align/size the data extents to match what is
optimal for the underlying storage (as we do for RAID) and hope
the application is smart enough to do large, well formed IOs to
the filesystem.

> Anyway, I got the message: this is not business XFS directly
> cares about.

I think you've jumped to entirely the wrong conclusion. We do care
about it because if you can't convey/control data alignment at the
filesystem level, then you can't fully optimise IO at the
application level.

The reality is that we've been doing these sorts of data alignment
optimisations for the last 20 years with XFS and applications using
direct IO. We care an awful lot about alignment of the filesystem
structure to the underlying device characteristics because if we
don't then IO performance is extremely difficult to maximise and/or
make deterministic.

However, this is such a complex domain that very, very few people
have the knowledge and expertise to understand how to take advantage
of it fully. It's hard even to convey just how complex it is to
people without a solid knowledge base of filesysystem and storage
knowledge, as this conversion shows...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Block size and read-modify-write
  2018-01-03 22:59           ` Dave Chinner
@ 2018-01-04  1:38             ` Gionatan Danti
  0 siblings, 0 replies; 9+ messages in thread
From: Gionatan Danti @ 2018-01-04  1:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, g.danti

Il 03-01-2018 23:59 Dave Chinner ha scritto:
> Yes. But I'm talking about the initial page cache writes in your
> tests, and they were all into unwritten extents. These are the
> writes that had different behaviour in exach test case.

I have some difficulties grasping that. Please note that each test is a 
*while loop* of the "dd" command. So, on the first test (dd with cached 
writes), only the first iteration on the loop should write to an 
unwritten extent; the second, third an so on should overwrite real data 
(as "dd" was issued with "oflag=dsync", which immediately flushes data).

So, why you noted that "they were all into unwritten extents"? Again, I 
am missing something?

> That's an application problem, not a filesystem problem. All the
> filesystem can do is align/size the data extents to match what is
> optimal for the underlying storage (as we do for RAID) and hope
> the application is smart enough to do large, well formed IOs to
> the filesystem.

You are right, I am surely approaching the issue from the wrong end...

> I think you've jumped to entirely the wrong conclusion. We do care
> about it because if you can't convey/control data alignment at the
> filesystem level, then you can't fully optimise IO at the
> application level.

Uhm no, it is my (bad) english which failed...
I fully understand XFS does a wonderful job with regard to data 
alignment.
What I inteded to say is that I understand it is not an XFS problem if 
an application does very small writes rather than a large one.

> The reality is that we've been doing these sorts of data alignment
> optimisations for the last 20 years with XFS and applications using
> direct IO. We care an awful lot about alignment of the filesystem
> structure to the underlying device characteristics because if we
> don't then IO performance is extremely difficult to maximise and/or
> make deterministic.
> 
> However, this is such a complex domain that very, very few people
> have the knowledge and expertise to understand how to take advantage
> of it fully. It's hard even to convey just how complex it is to
> people without a solid knowledge base of filesysystem and storage
> knowledge, as this conversion shows...

True. I really thank you for the time spent on explaining the issue.
Apart that studing the source code, there are any resources I can read 
about these advanced topic?

Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-01-04  1:38 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-12-28 23:14 Block size and read-modify-write Gionatan Danti
2018-01-02 10:25 ` Carlos Maiolino
2018-01-03  1:19   ` Dave Chinner
2018-01-03  8:19     ` Carlos Maiolino
2018-01-03 14:54     ` Gionatan Danti
2018-01-03 21:47       ` Dave Chinner
2018-01-03 22:09         ` Gionatan Danti
2018-01-03 22:59           ` Dave Chinner
2018-01-04  1:38             ` Gionatan Danti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).