linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Gionatan Danti <g.danti@assyoma.it>
To: Dave Chinner <david@fromorbit.com>, linux-xfs@vger.kernel.org
Cc: Gionatan Danti <g.danti@assyoma.it>
Subject: Re: Block size and read-modify-write
Date: Wed, 3 Jan 2018 15:54:42 +0100	[thread overview]
Message-ID: <b58a3a90-0e7b-abca-91ce-8b2d8819a75b@assyoma.it> (raw)
In-Reply-To: <20180103011926.GJ5858@dastard>



On 03/01/2018 02:19, Dave Chinner wrote:
> Cached writes smaller than a *page* will cause RMW cycles in the
> page cache, regardless of the block size of the filesystem.

Sure, in this case a page-sized r/m/w cycle happen in the pagecache. 
However it seems to me that, when flushed to disk, writes happens at the 
block level granularity, as you can see from tests[1,2] below. Am I 
wrong? I am missing something?

> Ok, there is a difference between *sector size* and *filesystem
> block size*. You seem to be using them interchangably in your
> question, and that's not correct.

True, maybe I have issues grasping the concept of sector size from XFS 
point of view. I understand sector size as an hardware property of the 
underlying block device, but how does it relate to the filesystem?

I naively supposed that an XFS filesystem created with 4k *sector* size 
(ie: mkfs.xfs -s size=4096) would prevent 512 bytes O_DIRECT writes, but 
my test[3] shows that even of such a filesystem a 512B direct write is 
possible, indeed.

Is sector size information only used by XFS own metadata and journaling 
in order to avoid costly device-level r/m/w cycles on 512e devices? I 
understand that on 4Kn device you *have* to avoid sub-sector writes, or 
the transfer will fail.

> 
> .... this is not correct for direct IO. The direct IO path does not
> do RMW cycles at all.
> 
> Put simply: a 512B DIO write on a (real or emulated) 512B sector
> device with a 4k FSB will be serialised by the filesystem and do a
> single 512B sector write to the device.  However, if the device
> reports as a 4k sector device then a 512B DIO write will be rejected
> by the filesystem because sub-sector IO is not possible.

Ok, this was as expected.

I want to put some context on the original question, and why I am so 
interested on r/m/w cycles. SSD's flash-page size has, in recent years 
(2014+), ballooned to 8/16/32K. I wonder if a matching blocksize and/or 
sector size are needed to avoid (some of) device-level r/m/w cycles, 
which can dramatically increase flash write amplification (with reduced 
endurance).

Thanks.


------ test output below ------

# Block device properties
[root@blackhole queue]# blockdev --getss --getpbsz --getiomin --getbsz 
/dev/sda3
512
512
512
4096

[1] # XFS with blocksize=4K and sectorsize=512B (default)
[root@blackhole queue]# mkfs.xfs /dev/sda3
meta-data=/dev/sda3              isize=512    agcount=4, agsize=65536 blks
          =                       sectsz=512   attr=2, projid32bit=1
          =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=262144, imaxpct=25
          =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
          =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
[root@blackhole queue]# mount /dev/sda3 /mnt/test/
# Preallocate file to minimize metadata traffic
[root@blackhole test]# fallocate /mnt/test/test.img -l 256M
# Write 512B via pagecache
[root@blackhole test]# while true; do echo 3 > /proc/sys/vm/drop_caches; 
dd if=/dev/urandom of=/mnt/test/test.img bs=512 count=1 oflag=dsync 
conv=nocreat,notrunc; sleep 1; done
# Dstat results: 4K reads/writes (read-modify-write)
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
  read  writ
4096B 4096B
4096B 4096B
4096B 4096B
# Write 512B via O_DIRECT
[root@blackhole test]# while true; do echo 3 > /proc/sys/vm/drop_caches; 
dd if=/dev/urandom of=/mnt/test/test.img bs=512 count=1 oflag=dsync 
conv=nocreat,notrunc oflag=direct; sleep 1; done
# Dstat results: 512B writes
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
  read  writ
    0   512B
    0   512B
    0   512B

[2] # XFS with blocksize=1K and sectorsize=512B
[root@blackhole mnt]# umount /mnt/test/
[root@blackhole mnt]# mkfs.xfs /dev/sda3 -f -b size=1024
meta-data=/dev/sda3              isize=512    agcount=4, agsize=262144 blks
          =                       sectsz=512   attr=2, projid32bit=1
          =                       crc=1        finobt=0, sparse=0
data     =                       bsize=1024   blocks=1048576, imaxpct=25
          =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=1024   blocks=10240, version=2
          =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
[root@blackhole mnt]# mount /dev/sda3 /mnt/test/
# Preallocate file to minimize metadata traffic
[root@blackhole mnt]# fallocate /mnt/test/test.img -l 256M
# Write 512B via pagecache
[root@blackhole mnt]# while true; do echo 3 > /proc/sys/vm/drop_caches; 
dd if=/dev/urandom of=/mnt/test/test.img bs=512 count=1 oflag=dsync 
conv=nocreat,notrunc; sleep 1; done
# Dstat results: 1K reads/writes (read-modify-write)
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
  read  writ
1024B 1024B
1024B 1024B
1024B 1024B
# Write 512B via O_DIRECT
while true; do echo 3 > /proc/sys/vm/drop_caches; dd if=/dev/urandom 
of=/mnt/test/test.img bs=512 count=1 oflag=dsync conv=nocreat,notrunc 
oflag=direct; sleep 1; done
# Dstat results: 512B writes
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
  read  writ
    0   512B
    0   512B
    0   512B

[3] # XFS with blocksize=4K and sectorsize=4K
[root@blackhole mnt]# umount /mnt/test/
[root@blackhole mnt]# mkfs.xfs /dev/sda3 -f -s size=4096
meta-data=/dev/sda3              isize=512    agcount=4, agsize=65536 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=262144, imaxpct=25
          =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
[root@blackhole mnt]# mount /dev/sda3 /mnt/test/
# Preallocate file to minimize metadata traffic
[root@blackhole mnt]# fallocate /mnt/test/test.img -l 256M
# Write 512B via pagecache
[root@blackhole mnt]# while true; do echo 3 > /proc/sys/vm/drop_caches; 
dd if=/dev/urandom of=/mnt/test/test.img bs=512 count=1 oflag=dsync 
conv=nocreat,notrunc; sleep 1; done
# Dstat results: 4K reads/writes (read-modify-write)
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
  read  writ
4096B 4096B
4096B 4096B
4096B 4096B
# Write 512B via O_DIRECT
[root@blackhole mnt]# while true; do echo 3 > /proc/sys/vm/drop_caches; 
dd if=/dev/urandom of=/mnt/test/test.img bs=512 count=1 oflag=dsync 
conv=nocreat,notrunc oflag=direct; sleep 1; done
# Dstat results: 512B writes
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
  read  writ
    0   512B
    0   512B
    0   512B

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

  parent reply	other threads:[~2018-01-03 14:54 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-28 23:14 Block size and read-modify-write Gionatan Danti
2018-01-02 10:25 ` Carlos Maiolino
2018-01-03  1:19   ` Dave Chinner
2018-01-03  8:19     ` Carlos Maiolino
2018-01-03 14:54     ` Gionatan Danti [this message]
2018-01-03 21:47       ` Dave Chinner
2018-01-03 22:09         ` Gionatan Danti
2018-01-03 22:59           ` Dave Chinner
2018-01-04  1:38             ` Gionatan Danti

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b58a3a90-0e7b-abca-91ce-8b2d8819a75b@assyoma.it \
    --to=g.danti@assyoma.it \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).