Re: dm-writecache issue

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: dm-writecache issue
       [not found] <20180911221147.GA23308@redhat.com>
@ 2018-09-18 11:46 ` Mikulas Patocka
  2018-09-18 12:32   ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Mikulas Patocka @ 2018-09-18 11:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, David Teigland

I would ask the XFS developers about this - why does mkfs.xfs select 
sector size 512 by default?

If a filesystem created with the default 512-byte sector size is activated 
on a device with 4k sectors, it results in mount failure.

Mikulas



On Tue, 11 Sep 2018, David Teigland wrote:

> Hi Mikulas,
> 
> Am I doing something wrong below or is there a bug somewhere?  (I could be
> doing something wrong in the lvm activation code, also.)
> Thanks
> 
> 
> [root@null-05 ~]# lvs foo
>   LV   VG  Attr       LSize   
>   fast foo -wi-------  32.00m
>   main foo -wi------- 200.00m
> 
> [root@null-05 ~]# lvchange -ay foo/main
> 
> [root@null-05 ~]# mkfs.xfs /dev/foo/main
> meta-data=/dev/foo/main          isize=512    agcount=4, agsize=12800 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=0, sparse=0
> data     =                       bsize=4096   blocks=51200, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal log           bsize=4096   blocks=855, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> [root@null-05 ~]# mount /dev/foo/main /mnt
> [root@null-05 ~]# cp /root/pattern* /mnt/
> [root@null-05 ~]# umount /mnt
> [root@null-05 ~]# lvchange -an foo/main
> 
> [root@null-05 ~]# lvconvert --type writecache --cachepool fast foo/main
>   Logical volume foo/main now has write cache.
> 
> [root@null-05 ~]# lvs -a foo -o+devices
>   LV            VG  Attr       LSize   Origin        Devices       
>   [fast]        foo -wi-------  32.00m               /dev/pmem0(0) 
>   main          foo Cwi------- 200.00m [main_wcorig] main_wcorig(0)
>   [main_wcorig] foo -wi------- 200.00m               /dev/loop0(0) 
> 
> [root@null-05 ~]# lvchange -ay foo/main
> 
> [root@null-05 ~]# dmsetup ls
> foo-main_wcorig (253:4)
> rhel_null--05-home      (253:2)
> foo-main        (253:5)
> foo-fast        (253:3)
> rhel_null--05-swap      (253:1)
> rhel_null--05-root      (253:0)
> 
> [root@null-05 ~]# dmsetup table
> foo-main_wcorig: 0 409600 linear 7:0 2048
> rhel_null--05-home: 0 853286912 linear 8:2 16517120
> foo-main: 0 409600 writecache p 253:4 253:3 4096 0
> foo-fast: 0 65536 linear 259:0 2048
> rhel_null--05-swap: 0 16515072 linear 8:2 2048
> rhel_null--05-root: 0 104857600 linear 8:2 869804032
> 
> [root@null-05 ~]# mount /dev/foo/main /mnt
> mount: mount /dev/mapper/foo-main on /mnt failed: Function not implemented
> 
> [root@null-05 ~]# dmesg | tail -1
> [802734.176118] XFS (dm-5): device supports 4096 byte sectors (not 512)
> 
> [root@null-05 ~]# lvchange -an foo/main
> 
> [root@null-05 ~]# lvconvert --splitcache foo/main
>   Logical volume foo/main write cache has been detached.
> 
> [root@null-05 ~]# lvchange -ay foo/main
> 
> [root@null-05 ~]# mount /dev/foo/main /mnt
> 
> [root@null-05 ~]# diff /mnt/pattern1 /root/pattern1
> 
> [root@null-05 ~]# lvs foo
>   LV   VG  Attr       LSize   
>   fast foo -wi-------  32.00m
>   main foo -wi-ao---- 200.00m
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: dm-writecache issue
  2018-09-18 11:46 ` dm-writecache issue Mikulas Patocka
@ 2018-09-18 12:32   ` Dave Chinner
  2018-09-18 12:48     ` Eric Sandeen
  2018-09-18 14:22     ` Mikulas Patocka
  0 siblings, 2 replies; 19+ messages in thread
From: Dave Chinner @ 2018-09-18 12:32 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Darrick J. Wong, linux-xfs, David Teigland

On Tue, Sep 18, 2018 at 07:46:47AM -0400, Mikulas Patocka wrote:
> I would ask the XFS developers about this - why does mkfs.xfs select 
> sector size 512 by default?

Because the underlying device told it that it supported a
sector size of 512 bytes?

> If a filesystem created with the default 512-byte sector size is activated 
> on a device with 4k sectors, it results in mount failure.

Yes, it does, but mkfs should also fail when it tries to write 512
byte sectors to a 4k device, too.

> On Tue, 11 Sep 2018, David Teigland wrote:
> 
> > Hi Mikulas,
> > 
> > Am I doing something wrong below or is there a bug somewhere?  (I could be
> > doing something wrong in the lvm activation code, also.)
> > Thanks
> > 
> > 
> > [root@null-05 ~]# lvs foo
> >   LV   VG  Attr       LSize   
> >   fast foo -wi-------  32.00m
> >   main foo -wi------- 200.00m
> > 
> > [root@null-05 ~]# lvchange -ay foo/main
> > 
> > [root@null-05 ~]# mkfs.xfs /dev/foo/main
> > meta-data=/dev/foo/main          isize=512    agcount=4, agsize=12800 blks
> >          =                       sectsz=512   attr=2, projid32bit=1
> >          =                       crc=1        finobt=0, sparse=0
> > data     =                       bsize=4096   blocks=51200, imaxpct=25
> >          =                       sunit=0      swidth=0 blks
> > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > log      =internal log           bsize=4096   blocks=855, version=2
> >          =                       sectsz=512   sunit=0 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > 
> > [root@null-05 ~]# mount /dev/foo/main /mnt
> > [root@null-05 ~]# cp /root/pattern* /mnt/
> > [root@null-05 ~]# umount /mnt
> > [root@null-05 ~]# lvchange -an foo/main
> > 
> > [root@null-05 ~]# lvconvert --type writecache --cachepool fast foo/main
> >   Logical volume foo/main now has write cache.
> > 
> > [root@null-05 ~]# lvs -a foo -o+devices
> >   LV            VG  Attr       LSize   Origin        Devices       
> >   [fast]        foo -wi-------  32.00m               /dev/pmem0(0) 
> >   main          foo Cwi------- 200.00m [main_wcorig] main_wcorig(0)
> >   [main_wcorig] foo -wi------- 200.00m               /dev/loop0(0) 

Yeehaw!

I'm betting that the underlying device advertised a logical/physical
sector size of 512 bytes to mkfs, and then adding pmem as the cache
device changed the logical volume from a 512 byte sector device to a
hard 4k sector device.

If so, this is a dm-cache bug. Filesystems don't support changing
the logical/physical sector sizes of the block device dynamically.
Filesystems lay out the filesystem structure at mkfs time based on
the assumption that the sector size of the block device is fixed and
will never change for the life of that filesystem.

Indeed, if the sector size of the block device is not fixed and can
change dynamically, then the block device also violates the
assumptions that the filesystem journalling algorithms make about
the atomic write size of the underlying device....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: dm-writecache issue
  2018-09-18 12:32   ` Dave Chinner
@ 2018-09-18 12:48     ` Eric Sandeen
  2018-09-18 14:09       ` Mikulas Patocka
  2018-09-18 14:20       ` David Teigland
  2018-09-18 14:22     ` Mikulas Patocka
  1 sibling, 2 replies; 19+ messages in thread
From: Eric Sandeen @ 2018-09-18 12:48 UTC (permalink / raw)
  To: Dave Chinner, Mikulas Patocka; +Cc: Darrick J. Wong, linux-xfs, David Teigland

On 9/18/18 7:32 AM, Dave Chinner wrote:
> On Tue, Sep 18, 2018 at 07:46:47AM -0400, Mikulas Patocka wrote:
>> I would ask the XFS developers about this - why does mkfs.xfs select 
>> sector size 512 by default?
> 
> Because the underlying device told it that it supported a
> sector size of 512 bytes?

Not only that, but it must have told us that it had a /physical/ 512 sector.
If it had even said physical/logical 4096/512, we would have chosen 4096.

What does please check blockdev --getpbsz --getss /dev/$FOO say at mkfs time?

>> If a filesystem created with the default 512-byte sector size is activated 
>> on a device with 4k sectors, it results in mount failure.
> 
> Yes, it does, but mkfs should also fail when it tries to write 512
> byte sectors to a 4k device, too.

...

> I'm betting that the underlying device advertised a logical/physical
> sector size of 512 bytes to mkfs, and then adding pmem as the cache
> device changed the logical volume from a 512 byte sector device to a
> hard 4k sector device.

Yup, please check blockdev --getpbsz --getss /dev/$FOO for the
devices in question during the course of this testcase.  As Dave says,
we can't mkfs a device that claims to support 512 bytes IOs and then
have that granularity increase.  (And going the other way, i.e. making
512 byte IOs non-atomic may break guarantees that journaling relies on).

-Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: dm-writecache issue
  2018-09-18 12:48     ` Eric Sandeen
@ 2018-09-18 14:09       ` Mikulas Patocka
  2018-09-18 14:16         ` Eric Sandeen
  2018-09-18 14:20       ` David Teigland
  1 sibling, 1 reply; 19+ messages in thread
From: Mikulas Patocka @ 2018-09-18 14:09 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Dave Chinner, Darrick J. Wong, linux-xfs, David Teigland

On Tue, 18 Sep 2018, Eric Sandeen wrote:

> On 9/18/18 7:32 AM, Dave Chinner wrote:
> > On Tue, Sep 18, 2018 at 07:46:47AM -0400, Mikulas Patocka wrote:
> >> I would ask the XFS developers about this - why does mkfs.xfs select 
> >> sector size 512 by default?
> > 
> > Because the underlying device told it that it supported a
> > sector size of 512 bytes?
> 
> Not only that, but it must have told us that it had a /physical/ 512 sector.
> If it had even said physical/logical 4096/512, we would have chosen 4096.
> 
> What does please check blockdev --getpbsz --getss /dev/$FOO say at mkfs time?

On SSDs, physical sector size is not detectable - the ATA and NVME 
standards allows reporting physical sector size, but some SSD vendors 
report this as 512-bytes despite the fact that the SSD has 4k sectors 
internally.

I tested 5 SSDs (Samsung SSD 960 EVO NVME, KINGSTON SKC1000240G NVME, 
Samsung SSD 850 EVO SATA, Crucial MX100 SATA, Intel 520 SATA) - all of 
them have 4k sectors internally (i.e. the SSDs have higher IOPS for 4k 
writes than for 2k writes), but only the Crucial SSD reports 4096 in 
/sys/block/*/queue/physical_block_size. Intel and Samsung report 512.

The SSDs use 4k sectors to reduce the size of the mapping table (hardly 
any SSD vendor would want to use real 512-byte sectors and increase the 
size of the table 8 times) and they do read-modify-write for sub-4k 
writes. So, why do you want to do sub-4k writes in XFS? - they are slower.

For example, the Kingston NVME SSD has 5-times lower IOPS for 2k writes 
than for 4k writes. And if I use mkfs.xfs directly on it, it selects 
sectsz=512 for both metadata and log.

Mikulas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: dm-writecache issue
  2018-09-18 14:09       ` Mikulas Patocka
@ 2018-09-18 14:16         ` Eric Sandeen
  2018-09-18 14:19           ` Eric Sandeen
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Sandeen @ 2018-09-18 14:16 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Dave Chinner, Darrick J. Wong, linux-xfs, David Teigland

On 9/18/18 9:09 AM, Mikulas Patocka wrote:
> 
> 
> On Tue, 18 Sep 2018, Eric Sandeen wrote:
> 
>> On 9/18/18 7:32 AM, Dave Chinner wrote:
>>> On Tue, Sep 18, 2018 at 07:46:47AM -0400, Mikulas Patocka wrote:
>>>> I would ask the XFS developers about this - why does mkfs.xfs select 
>>>> sector size 512 by default?
>>>
>>> Because the underlying device told it that it supported a
>>> sector size of 512 bytes?
>>
>> Not only that, but it must have told us that it had a /physical/ 512 sector.
>> If it had even said physical/logical 4096/512, we would have chosen 4096.
>>
>> What does please check blockdev --getpbsz --getss /dev/$FOO say at mkfs time?
> 
> On SSDs, physical sector size is not detectable - the ATA and NVME 
> standards allows reporting physical sector size, but some SSD vendors 
> report this as 512-bytes despite the fact that the SSD has 4k sectors 
> internally.

There's a difference between "detecting" and "observing what the
device reports."

All we have to go on is the geometry reported by the device.

# cat /sys/block/sdc/device/model 
Samsung SSD 850 
# blockdev --getpbsz --getss /dev/sdc
512
512

If the device lies to us, there's nothing to be done about it.

> I tested 5 SSDs (Samsung SSD 960 EVO NVME, KINGSTON SKC1000240G NVME, 
> Samsung SSD 850 EVO SATA, Crucial MX100 SATA, Intel 520 SATA) - all of 
> them have 4k sectors internally (i.e. the SSDs have higher IOPS for 4k 
> writes than for 2k writes), but only the Crucial SSD reports 4096 in 
> /sys/block/*/queue/physical_block_size. Intel and Samsung report 512.

Then that's what they'll get from us.

> The SSDs use 4k sectors to reduce the size of the mapping table (hardly 
> any SSD vendor would want to use real 512-byte sectors and increase the 
> size of the table 8 times) and they do read-modify-write for sub-4k 
> writes. So, why do you want to do sub-4k writes in XFS? - they are slower.
> 
> For example, the Kingston NVME SSD has 5-times lower IOPS for 2k writes 
> than for 4k writes. And if I use mkfs.xfs directly on it, it selects 
> sectsz=512 for both metadata and log.

If a device tells us that it has 512 sector granularity, that's all we have
to go on.  We can't second guess what the device reports to us.  How could
we do so safely?

-Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: dm-writecache issue
  2018-09-18 14:16         ` Eric Sandeen
@ 2018-09-18 14:19           ` Eric Sandeen
  2018-09-18 14:29             ` Mikulas Patocka
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Sandeen @ 2018-09-18 14:19 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Dave Chinner, Darrick J. Wong, linux-xfs, David Teigland



On 9/18/18 9:16 AM, Eric Sandeen wrote:
> On 9/18/18 9:09 AM, Mikulas Patocka wrote:
>>
>>
>> On Tue, 18 Sep 2018, Eric Sandeen wrote:
>>
>>> On 9/18/18 7:32 AM, Dave Chinner wrote:
>>>> On Tue, Sep 18, 2018 at 07:46:47AM -0400, Mikulas Patocka wrote:
>>>>> I would ask the XFS developers about this - why does mkfs.xfs select 
>>>>> sector size 512 by default?
>>>>
>>>> Because the underlying device told it that it supported a
>>>> sector size of 512 bytes?
>>>
>>> Not only that, but it must have told us that it had a /physical/ 512 sector.
>>> If it had even said physical/logical 4096/512, we would have chosen 4096.
>>>
>>> What does please check blockdev --getpbsz --getss /dev/$FOO say at mkfs time?
>>
>> On SSDs, physical sector size is not detectable - the ATA and NVME 
>> standards allows reporting physical sector size, but some SSD vendors 
>> report this as 512-bytes despite the fact that the SSD has 4k sectors 
>> internally.
> 
> There's a difference between "detecting" and "observing what the
> device reports."
> 
> All we have to go on is the geometry reported by the device.
> 
> # cat /sys/block/sdc/device/model 
> Samsung SSD 850 
> # blockdev --getpbsz --getss /dev/sdc
> 512
> 512
> 
> If the device lies to us, there's nothing to be done about it.
> 
>> I tested 5 SSDs (Samsung SSD 960 EVO NVME, KINGSTON SKC1000240G NVME, 
>> Samsung SSD 850 EVO SATA, Crucial MX100 SATA, Intel 520 SATA) - all of 
>> them have 4k sectors internally (i.e. the SSDs have higher IOPS for 4k 
>> writes than for 2k writes), but only the Crucial SSD reports 4096 in 
>> /sys/block/*/queue/physical_block_size. Intel and Samsung report 512.

See also 
https://www.intel.com/content/www/us/en/support/articles/000006392/memory-and-storage.html

-Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: dm-writecache issue
  2018-09-18 12:48     ` Eric Sandeen
  2018-09-18 14:09       ` Mikulas Patocka
@ 2018-09-18 14:20       ` David Teigland
  2018-09-18 14:23         ` Eric Sandeen
  1 sibling, 1 reply; 19+ messages in thread
From: David Teigland @ 2018-09-18 14:20 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Dave Chinner, Mikulas Patocka, Darrick J. Wong, linux-xfs

On Tue, Sep 18, 2018 at 07:48:04AM -0500, Eric Sandeen wrote:
> Not only that, but it must have told us that it had a /physical/ 512 sector.
> If it had even said physical/logical 4096/512, we would have chosen 4096.

Right, just after asking I discovered that the advertised
physical_block_size changes after attaching dm-writecache:

Before:

# dmsetup ls
foo-main        (253:3)
foo-fast        (253:4)

# cat /sys/block/dm-3/queue/physical_block_size 
512
# cat /sys/block/dm-4/queue/physical_block_size 
4096

After:

# dmsetup ls
foo-main_wcorig (253:4)
foo-main        (253:5)
foo-fast        (253:3)

# cat /sys/block/dm-3/queue/physical_block_size 
4096
# cat /sys/block/dm-5/queue/physical_block_size 
4096


> What does please check blockdev --getpbsz --getss /dev/$FOO say at mkfs time?

Before:

# blockdev --getpbsz --getss /dev/foo/main
512
512

After:

# blockdev --getpbsz --getss /dev/foo/main
4096
4096


> > Yes, it does, but mkfs should also fail when it tries to write 512
> > byte sectors to a 4k device, too.

By default it uses the 4k sectors in that case:

# mkfs.xfs -f /dev/foo/main
meta-data=/dev/foo/main          isize=512    agcount=4, agsize=13056 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=52224, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=1605, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


But you are correct if I request 512 on a 4k device:

# mkfs.xfs -f -s size=512 /dev/foo/main
illegal sector size 512; hw sector is 4096

> > I'm betting that the underlying device advertised a logical/physical
> > sector size of 512 bytes to mkfs, and then adding pmem as the cache
> > device changed the logical volume from a 512 byte sector device to a
> > hard 4k sector device.

Yes

I've been dealing with this by just using -s size=4096.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: dm-writecache issue
  2018-09-18 12:32   ` Dave Chinner
  2018-09-18 12:48     ` Eric Sandeen
@ 2018-09-18 14:22     ` Mikulas Patocka
  2018-09-18 15:33       ` Christoph Hellwig
  1 sibling, 1 reply; 19+ messages in thread
From: Mikulas Patocka @ 2018-09-18 14:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs, David Teigland



On Tue, 18 Sep 2018, Dave Chinner wrote:

> On Tue, Sep 18, 2018 at 07:46:47AM -0400, Mikulas Patocka wrote:
> > I would ask the XFS developers about this - why does mkfs.xfs select 
> > sector size 512 by default?
> 
> Because the underlying device told it that it supported a
> sector size of 512 bytes?

SSDs lie about this. They have 4k sectors internally, but report 512.

> > If a filesystem created with the default 512-byte sector size is activated 
> > on a device with 4k sectors, it results in mount failure.
> 
> Yes, it does, but mkfs should also fail when it tries to write 512
> byte sectors to a 4k device, too.
> 
> > On Tue, 11 Sep 2018, David Teigland wrote:
> > 
> > > Hi Mikulas,
> > > 
> > > Am I doing something wrong below or is there a bug somewhere?  (I could be
> > > doing something wrong in the lvm activation code, also.)
> > > Thanks
> > > 
> > > 
> > > [root@null-05 ~]# lvs foo
> > >   LV   VG  Attr       LSize   
> > >   fast foo -wi-------  32.00m
> > >   main foo -wi------- 200.00m
> > > 
> > > [root@null-05 ~]# lvchange -ay foo/main
> > > 
> > > [root@null-05 ~]# mkfs.xfs /dev/foo/main
> > > meta-data=/dev/foo/main          isize=512    agcount=4, agsize=12800 blks
> > >          =                       sectsz=512   attr=2, projid32bit=1
> > >          =                       crc=1        finobt=0, sparse=0
> > > data     =                       bsize=4096   blocks=51200, imaxpct=25
> > >          =                       sunit=0      swidth=0 blks
> > > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > > log      =internal log           bsize=4096   blocks=855, version=2
> > >          =                       sectsz=512   sunit=0 blks, lazy-count=1
> > > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > > 
> > > [root@null-05 ~]# mount /dev/foo/main /mnt
> > > [root@null-05 ~]# cp /root/pattern* /mnt/
> > > [root@null-05 ~]# umount /mnt
> > > [root@null-05 ~]# lvchange -an foo/main
> > > 
> > > [root@null-05 ~]# lvconvert --type writecache --cachepool fast foo/main
> > >   Logical volume foo/main now has write cache.
> > > 
> > > [root@null-05 ~]# lvs -a foo -o+devices
> > >   LV            VG  Attr       LSize   Origin        Devices       
> > >   [fast]        foo -wi-------  32.00m               /dev/pmem0(0) 
> > >   main          foo Cwi------- 200.00m [main_wcorig] main_wcorig(0)
> > >   [main_wcorig] foo -wi------- 200.00m               /dev/loop0(0) 
> 
> Yeehaw!
> 
> I'm betting that the underlying device advertised a logical/physical
> sector size of 512 bytes to mkfs, and then adding pmem as the cache
> device changed the logical volume from a 512 byte sector device to a
> hard 4k sector device.
> 
> If so, this is a dm-cache bug.

dm-writecache can run with 512-byte sectors, but it increases metadata 
overhead 8 times and degrades performance.

My question is - what's the purpose of using a filesystem with 512-byte 
sector size? Does it really improve performance?

> Filesystems don't support changing the logical/physical sector sizes of 

ext4 uses 4096 byte sectors by default (except for devices smaller than 
512MiB).

> the block device dynamically. Filesystems lay out the filesystem 
> structure at mkfs time based on the assumption that the sector size of 
> the block device is fixed and will never change for the life of that 
> filesystem.
> 
> Indeed, if the sector size of the block device is not fixed and can
> change dynamically, then the block device also violates the
> assumptions that the filesystem journalling algorithms make about
> the atomic write size of the underlying device....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Mikulas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: dm-writecache issue
  2018-09-18 14:20       ` David Teigland
@ 2018-09-18 14:23         ` Eric Sandeen
  0 siblings, 0 replies; 19+ messages in thread
From: Eric Sandeen @ 2018-09-18 14:23 UTC (permalink / raw)
  To: David Teigland; +Cc: Dave Chinner, Mikulas Patocka, Darrick J. Wong, linux-xfs

On 9/18/18 9:20 AM, David Teigland wrote:
>>> I'm betting that the underlying device advertised a logical/physical
>>> sector size of 512 bytes to mkfs, and then adding pmem as the cache
>>> device changed the logical volume from a 512 byte sector device to a
>>> hard 4k sector device.
> Yes
> 
> I've been dealing with this by just using -s size=4096.
> 

So the risk here - in general - is that using a sector size larger
than the advertised physical sector size may - in general - lead to
metadata IOs that are not atomic, when they are required to be so for
consistency.

-Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: dm-writecache issue
  2018-09-18 14:19           ` Eric Sandeen
@ 2018-09-18 14:29             ` Mikulas Patocka
  2018-09-18 14:36               ` Eric Sandeen
  0 siblings, 1 reply; 19+ messages in thread
From: Mikulas Patocka @ 2018-09-18 14:29 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Dave Chinner, Darrick J. Wong, linux-xfs, David Teigland



On Tue, 18 Sep 2018, Eric Sandeen wrote:

> 
> 
> On 9/18/18 9:16 AM, Eric Sandeen wrote:
> > On 9/18/18 9:09 AM, Mikulas Patocka wrote:
> >>
> >>
> >> On Tue, 18 Sep 2018, Eric Sandeen wrote:
> >>
> >>> On 9/18/18 7:32 AM, Dave Chinner wrote:
> >>>> On Tue, Sep 18, 2018 at 07:46:47AM -0400, Mikulas Patocka wrote:
> >>>>> I would ask the XFS developers about this - why does mkfs.xfs select 
> >>>>> sector size 512 by default?
> >>>>
> >>>> Because the underlying device told it that it supported a
> >>>> sector size of 512 bytes?
> >>>
> >>> Not only that, but it must have told us that it had a /physical/ 512 sector.
> >>> If it had even said physical/logical 4096/512, we would have chosen 4096.
> >>>
> >>> What does please check blockdev --getpbsz --getss /dev/$FOO say at mkfs time?
> >>
> >> On SSDs, physical sector size is not detectable - the ATA and NVME 
> >> standards allows reporting physical sector size, but some SSD vendors 
> >> report this as 512-bytes despite the fact that the SSD has 4k sectors 
> >> internally.
> > 
> > There's a difference between "detecting" and "observing what the
> > device reports."
> > 
> > All we have to go on is the geometry reported by the device.
> > 
> > # cat /sys/block/sdc/device/model 
> > Samsung SSD 850 
> > # blockdev --getpbsz --getss /dev/sdc
> > 512
> > 512
> > 
> > If the device lies to us, there's nothing to be done about it.
> > 
> >> I tested 5 SSDs (Samsung SSD 960 EVO NVME, KINGSTON SKC1000240G NVME, 
> >> Samsung SSD 850 EVO SATA, Crucial MX100 SATA, Intel 520 SATA) - all of 
> >> them have 4k sectors internally (i.e. the SSDs have higher IOPS for 4k 
> >> writes than for 2k writes), but only the Crucial SSD reports 4096 in 
> >> /sys/block/*/queue/physical_block_size. Intel and Samsung report 512.
> 
> See also 
> https://www.intel.com/content/www/us/en/support/articles/000006392/memory-and-storage.html
> 
> -Eric

And does it really support native 512-byte writes? Or does it emulate 
512-byte writes by doing read-modify-write? That needs to be benchmarked, 
the paper doesn't say that.

Memory is expensive and reducing SSD sector size increases memory 
requirement on the SSD. I doubt that any SSD vendor would want to use 
8-times more memory just to support 512-byte sectors natively.

Mikulas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: dm-writecache issue
  2018-09-18 14:29             ` Mikulas Patocka
@ 2018-09-18 14:36               ` Eric Sandeen
  2018-09-18 14:42                 ` Mikulas Patocka
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Sandeen @ 2018-09-18 14:36 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Dave Chinner, Darrick J. Wong, linux-xfs, David Teigland

On 9/18/18 9:29 AM, Mikulas Patocka wrote:

> On Tue, 18 Sep 2018, Eric Sandeen wrote:

...

>> See also 
>> https://www.intel.com/content/www/us/en/support/articles/000006392/memory-and-storage.html
>>
>> -Eric
> 
> And does it really support native 512-byte writes? Or does it emulate 
> 512-byte writes by doing read-modify-write? That needs to be benchmarked, 
> the paper doesn't say that.

Interesting from a manual tuning perspective, but not from a default
behavior perspective.

I'm just pointing out that Intel does seem to give the user a choice about
the /advertised/ geometry for some of their SSDs.

> Memory is expensive and reducing SSD sector size increases memory 
> requirement on the SSD. I doubt that any SSD vendor would want to use 
> 8-times more memory just to support 512-byte sectors natively.

Marketing decisions aside, we just can't safely ignore what the device
tells us about these IO sizes.

We have similar issues with raid devices which report nonsensical
optimal and mininum IO sizes.  If it's reporting bad info, the user
can override it, but they have to be sure to get it right.  For default
behavior, mkfs.xfs has no choice but to use what the device tells it
to use.

-Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: dm-writecache issue
  2018-09-18 14:36               ` Eric Sandeen
@ 2018-09-18 14:42                 ` Mikulas Patocka
  2018-09-18 15:04                   ` Eric Sandeen
  0 siblings, 1 reply; 19+ messages in thread
From: Mikulas Patocka @ 2018-09-18 14:42 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Dave Chinner, Darrick J. Wong, linux-xfs, David Teigland



On Tue, 18 Sep 2018, Eric Sandeen wrote:

> On 9/18/18 9:29 AM, Mikulas Patocka wrote:
> 
> > On Tue, 18 Sep 2018, Eric Sandeen wrote:
> 
> ...
> 
> >> See also 
> >> https://www.intel.com/content/www/us/en/support/articles/000006392/memory-and-storage.html
> >>
> >> -Eric
> > 
> > And does it really support native 512-byte writes? Or does it emulate 
> > 512-byte writes by doing read-modify-write? That needs to be benchmarked, 
> > the paper doesn't say that.
> 
> Interesting from a manual tuning perspective, but not from a default
> behavior perspective.
> 
> I'm just pointing out that Intel does seem to give the user a choice about
> the /advertised/ geometry for some of their SSDs.
> 
> > Memory is expensive and reducing SSD sector size increases memory 
> > requirement on the SSD. I doubt that any SSD vendor would want to use 
> > 8-times more memory just to support 512-byte sectors natively.
> 
> Marketing decisions aside, we just can't safely ignore what the device
> tells us about these IO sizes.

No one is forcing you to use 512-byte writes. You can use 4k writes on a 
device that advertises 512-byte sectors.

ext4 uses 4k block size by default (and lets the user lower it if the user 
is tight on disk space and doesn't care about performance).

Mikulas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: dm-writecache issue
  2018-09-18 14:42                 ` Mikulas Patocka
@ 2018-09-18 15:04                   ` Eric Sandeen
  2018-09-18 15:27                     ` Eric Sandeen
  2018-09-18 17:15                     ` Mikulas Patocka
  0 siblings, 2 replies; 19+ messages in thread
From: Eric Sandeen @ 2018-09-18 15:04 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Dave Chinner, Darrick J. Wong, linux-xfs, David Teigland

On 9/18/18 9:42 AM, Mikulas Patocka wrote:
> 
> 
> On Tue, 18 Sep 2018, Eric Sandeen wrote:
> 
>> On 9/18/18 9:29 AM, Mikulas Patocka wrote:
>>
>>> On Tue, 18 Sep 2018, Eric Sandeen wrote:
>>
>> ...
>>
>>>> See also 
>>>> https://www.intel.com/content/www/us/en/support/articles/000006392/memory-and-storage.html
>>>>
>>>> -Eric
>>>
>>> And does it really support native 512-byte writes? Or does it emulate 
>>> 512-byte writes by doing read-modify-write? That needs to be benchmarked, 
>>> the paper doesn't say that.
>>
>> Interesting from a manual tuning perspective, but not from a default
>> behavior perspective.
>>
>> I'm just pointing out that Intel does seem to give the user a choice about
>> the /advertised/ geometry for some of their SSDs.
>>
>>> Memory is expensive and reducing SSD sector size increases memory 
>>> requirement on the SSD. I doubt that any SSD vendor would want to use 
>>> 8-times more memory just to support 512-byte sectors natively.
>>
>> Marketing decisions aside, we just can't safely ignore what the device
>> tells us about these IO sizes.
> 
> No one is forcing you to use 512-byte writes. You can use 4k writes on a 
> device that advertises 512-byte sectors.

Of course.  But not if you require those 4k writes to be /atomic/.

> ext4 uses 4k block size by default (and lets the user lower it if the user 
> is tight on disk space and doesn't care about performance).

I think you may be conflating sector size with filesystem block size.

ext4 makes no distinction between the two.

XFS has both sector size (metadata atomic IO unit) and filesystem block size
(file data allocation unit) as configurable mkfs-time options. The sector size
can be smaller than, and up to, the filesystem block size.

mkfs.xfs defaults to 4k filesystem blocks and device-physical-sector-sized
sectors, i.e. the largest atomic IO the device advertises, because XFS
metadata journaling relies on this IO atomicity.  We allocate file data in
4k chunks, and do atomic metadata IO in device-sector-sized chunks.

ext4 doesn't - it's true - but I cannot help but believe that ext4 occasionally
gets harmed by this choice, because it's absolutely possible that a 4k
metadata write gets only partly-persisted if power fails on a 512/512 disk,
for example.  In practice it seems to generally work out ok, but it is going
beyond what the device says it can guarantee.

-Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: dm-writecache issue
  2018-09-18 15:04                   ` Eric Sandeen
@ 2018-09-18 15:27                     ` Eric Sandeen
  2018-09-18 15:29                       ` Christoph Hellwig
  2018-09-18 17:15                     ` Mikulas Patocka
  1 sibling, 1 reply; 19+ messages in thread
From: Eric Sandeen @ 2018-09-18 15:27 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Dave Chinner, Darrick J. Wong, linux-xfs, David Teigland

On 9/18/18 10:04 AM, Eric Sandeen wrote:
> I cannot help but believe that ext4 occasionally
> gets harmed by this choice, because it's absolutely possible that a 4k
> metadata write gets only partly-persisted if power fails on a 512/512 disk,
> for example.  In practice it seems to generally work out ok, but it is going
> beyond what the device says it can guarantee.

(this may be a bit uninformed on my part, Darrick reminds me that jbd2's
careful use of cache flushing & FUA probably means that it won't get burned
by a partial 4k metadata update if the power fails.)

I'll stop for now and see if Dave wants to chime in on xfs's reliance on
the actual atomic IO size for metadata IO.  ;)

Thanks,
-Eric 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: dm-writecache issue
  2018-09-18 15:27                     ` Eric Sandeen
@ 2018-09-18 15:29                       ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2018-09-18 15:29 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Mikulas Patocka, Dave Chinner, Darrick J. Wong, linux-xfs,
	David Teigland

On Tue, Sep 18, 2018 at 10:27:20AM -0500, Eric Sandeen wrote:
> (this may be a bit uninformed on my part, Darrick reminds me that jbd2's
> careful use of cache flushing & FUA probably means that it won't get burned
> by a partial 4k metadata update if the power fails.)

cache flushing and FUA is never going to help you with torn writes.

> 
> I'll stop for now and see if Dave wants to chime in on xfs's reliance on
> the actual atomic IO size for metadata IO.  ;)

What helps todays XFS (and probably ext4) is checksums on all metadata
blocks, as that way we can check if we wrote the full "block".  This
doesn't help applications that rely on the sector size unless they have
similar protections of their own.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: dm-writecache issue
  2018-09-18 14:22     ` Mikulas Patocka
@ 2018-09-18 15:33       ` Christoph Hellwig
  2018-09-18 17:39         ` Mikulas Patocka
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2018-09-18 15:33 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Dave Chinner, Darrick J. Wong, linux-xfs, David Teigland

On Tue, Sep 18, 2018 at 10:22:15AM -0400, Mikulas Patocka wrote:
> > On Tue, Sep 18, 2018 at 07:46:47AM -0400, Mikulas Patocka wrote:
> > > I would ask the XFS developers about this - why does mkfs.xfs select 
> > > sector size 512 by default?
> > 
> > Because the underlying device told it that it supported a
> > sector size of 512 bytes?
> 
> SSDs lie about this. They have 4k sectors internally, but report 512.

SSDs can't lie about the sector size because they don't even have
sectors in the disk sense, they have program and erase block size,
and some kind of FTL granularity (think of it like a file system block
size - even a 4k block size file can do smaller writes with
read-modify-write cycles, so can SSDs).

SSDs can just properly implement the guarantees they inherited from
disk by other means.  So if an SSD claims it supports 512 byte blocks
it better can deal with them atomically.  If they have issues in that
area (like Intel did recently where they corrupted data left right
and center if you actually did 512byte writes) they are simply buggy.

SATA and SAS SSDs can always use the same trick as modern disks to
support 512 byte access where really needed (e.g. BIOS and legacy
OSes) but give a strong hint to modern OSes that they don't want that
to be actually used with the physical block exponent.  NVMe doesn't
have anything like that yet, but we are working on something like
that in the NVMe TWG.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: dm-writecache issue
  2018-09-18 15:04                   ` Eric Sandeen
  2018-09-18 15:27                     ` Eric Sandeen
@ 2018-09-18 17:15                     ` Mikulas Patocka
  1 sibling, 0 replies; 19+ messages in thread
From: Mikulas Patocka @ 2018-09-18 17:15 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Dave Chinner, Darrick J. Wong, linux-xfs, David Teigland

On Tue, 18 Sep 2018, Eric Sandeen wrote:

> > is tight on disk space and doesn't care about performance).
> 
> I think you may be conflating sector size with filesystem block size.
> 
> ext4 makes no distinction between the two.
> 
> XFS has both sector size (metadata atomic IO unit) and filesystem block size
> (file data allocation unit) as configurable mkfs-time options. The sector size
> can be smaller than, and up to, the filesystem block size.
> 
> mkfs.xfs defaults to 4k filesystem blocks and device-physical-sector-sized
> sectors, i.e. the largest atomic IO the device advertises, because XFS
> metadata journaling relies on this IO atomicity.  We allocate file data in
> 4k chunks, and do atomic metadata IO in device-sector-sized chunks.

You can have 512-byte metadata sectors and you can read and write them in 
4k chunks (so that you avoid the read-modify-write logic in the SSDs). If 
data blocks are allocated on 4k boundary, there's no risk of 
metadata-vs-data buffer races.

> ext4 doesn't - it's true - but I cannot help but believe that ext4 occasionally
> gets harmed by this choice, because it's absolutely possible that a 4k
> metadata write gets only partly-persisted if power fails on a 512/512 disk,
> for example.  In practice it seems to generally work out ok, but it is going
> beyond what the device says it can guarantee.
> 
> -Eric

I implemented journal in the dm-integrity driver and I solved this problem
with partial writes by tagging every 512-byte journal sector with an
8-byte tag. If the tags don't match, there was power failure during write
and the partially written journal section will not be replayed. The
journal is written using 4k-aligned writes because they perform better.

ext4 solves this problem by using checksums.

Mikulas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: dm-writecache issue
  2018-09-18 15:33       ` Christoph Hellwig
@ 2018-09-18 17:39         ` Mikulas Patocka
  2018-09-18 22:52           ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Mikulas Patocka @ 2018-09-18 17:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Darrick J. Wong, linux-xfs, David Teigland

On Tue, 18 Sep 2018, Christoph Hellwig wrote:

> On Tue, Sep 18, 2018 at 10:22:15AM -0400, Mikulas Patocka wrote:
> > > On Tue, Sep 18, 2018 at 07:46:47AM -0400, Mikulas Patocka wrote:
> > > > I would ask the XFS developers about this - why does mkfs.xfs select 
> > > > sector size 512 by default?
> > > 
> > > Because the underlying device told it that it supported a
> > > sector size of 512 bytes?
> > 
> > SSDs lie about this. They have 4k sectors internally, but report 512.
> 
> SSDs can't lie about the sector size because they don't even have
> sectors in the disk sense, they have program and erase block size,

They have remapping table that maps each 4k block to a location on the 
NAND flash.

> and some kind of FTL granularity (think of it like a file system block
> size - even a 4k block size file can do smaller writes with
> read-modify-write cycles, so can SSDs).
> 
> SSDs can just properly implement the guarantees they inherited from
> disk by other means.  So if an SSD claims it supports 512 byte blocks
> it better can deal with them atomically.  If they have issues in that
> area (like Intel did recently where they corrupted data left right
> and center if you actually did 512byte writes) they are simply buggy.
> 
> SATA and SAS SSDs can always use the same trick as modern disks to
> support 512 byte access where really needed (e.g. BIOS and legacy
> OSes) but give a strong hint to modern OSes that they don't want that
> to be actually used with the physical block exponent.  NVMe doesn't
> have anything like that yet, but we are working on something like
> that in the NVMe TWG.

The question is - why do you want to use 512-byte writes if they perform 
badly? For example, the Kingston NVME SSD has 242k IOPS for 4k writes and 
45k IOPS for 2k writes.

The same problem is with dm-writecache - it can run with 512-byte sectors, 
but there's unnecessary overhead. It would be much better if XFS did 4k 
aligned writes.

You can do 4k writes and assume that only 512-byte units are written 
atomically - that would be safe for old 512-byte sector disk and it 
wouldn't degrade performance on SSDs.

If the XFS data blocks are aligned on 4k boundary, you can do 4k-aligned 
I/Os on the metadata as well. You could allocate metadata in 512-byte 
quantities and you could do 4k reads and writes on them. You would 
over-read and over-write a bit, but it will perform better due to avoiding 
the read-modify-write logic in the SSD.

I'm not an expert in the XFS journal - but could the journal writes be 
just padded to 4k boundary (even if the journal space is allocated in 
512-byte quantities)?

Mikulas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: dm-writecache issue
  2018-09-18 17:39         ` Mikulas Patocka
@ 2018-09-18 22:52           ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2018-09-18 22:52 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, David Teigland

On Tue, Sep 18, 2018 at 01:39:53PM -0400, Mikulas Patocka wrote:
> On Tue, 18 Sep 2018, Christoph Hellwig wrote:
> > On Tue, Sep 18, 2018 at 10:22:15AM -0400, Mikulas Patocka wrote:
> > > > On Tue, Sep 18, 2018 at 07:46:47AM -0400, Mikulas Patocka wrote:
> > SATA and SAS SSDs can always use the same trick as modern disks to
> > support 512 byte access where really needed (e.g. BIOS and legacy
> > OSes) but give a strong hint to modern OSes that they don't want that
> > to be actually used with the physical block exponent.  NVMe doesn't
> > have anything like that yet, but we are working on something like
> > that in the NVMe TWG.
> 
> The question is - why do you want to use 512-byte writes if they perform 
> badly? For example, the Kingston NVME SSD has 242k IOPS for 4k writes and 
> 45k IOPS for 2k writes.

We know all about this sort of thing for a long, long while (e.g.
the very first fusion IO hardware behaved like this).  We also know
that various intel SSDs have the same problems(*), not to mention
XFS being blamed for the single sector write data corruption
problems they had (which Christoph has already mentioned).

That said, we /rarely/ do 512 byte metadata IOs in active XFS
filesystems.  Single sector metadata are used only for static
metadata - the allocation group headers. They were designed as
single sector objects 25 years ago so they could be written
atomically and could be relied on to be intact after power failure
events.

In reality, we read these objects once into memory when they are
first accessed, and we only write them when the filesystem needs
journal space or goes idle, both of which may be "never" because
active metadata is just continually relogged and never rewritten in
place until the workload idles.

IOWs arguments about 512 byte sector metadata writes causing
performance issues are irrelevant to the discussion here because
they just don't.

> The same problem is with dm-writecache - it can run with 512-byte sectors, 
> but there's unnecessary overhead. It would be much better if XFS did 4k 
> aligned writes.
> 
> 
> You can do 4k writes and assume that only 512-byte units are written 
> atomically - that would be safe for old 512-byte sector disk and it 
> wouldn't degrade performance on SSDs.
> 
> If the XFS data blocks are aligned on 4k boundary, you can do 4k-aligned 
> I/Os on the metadata as well.

Yup, that's how XFS was designed way back when.  All the dynamically
allocated metadata in XFS (i.e. everything but the AG header blocks)
is filesystem block sized (or larger) and at least filesystem block
aligned. So if you use the default 4k filesystem block size, all the
dynamic metadata IO (dirs, btrees, inodes, etc) are at least 4k
aligned and sized.

This is all really basic filesystem design stuff - structures
critical to recovery and repair need to be robust against torn
writes (i.e. single sectors), while dynamically allocated run-time
structures (data or metadata) are all allocated in multiples of the
fundamental unit of space acounting (i.e. filesystem blocks).

> You could allocate metadata in 512-byte 
> quantities and you could do 4k reads and writes on them. You would 
> over-read and over-write a bit, but it will perform better due to avoiding 
> the read-modify-write logic in the SSD.

No, we can't do that. The transaction models requires independent IO
and locking for each of those AG headers. Trying to stuff them all
into one IO and/or buffer will cause deadlocks and lock contention
problems.

> I'm not an expert in the XFS journal - but could the journal writes be 
> just padded to 4k boundary (even if the journal space is allocated in 
> 512-byte quantities)?

<sigh>

What do you think the log stripe unit configuration option does?
Yup, it pads the log writes to whatever block size is specified.
We've had that functionality since around 2005. The issue here is
that the device told us "512 byte sectors" and no minimum/optimal IO
sizes, so mkfs doesn't add any padding by default.

Regardless, a log write is rarely a single sector - a log write
header is a complete sector, maybe 2 depending on the iclogbuf size,
which is then followed by the metadata that needs checkpointing.
The typical log IO is an entire iclogbuf, which defaults to 32k and
can be configured up to 256k. These writes have lsunit padding and
alignment (which may be single sector), but these large writes
didn't have any protection against tearing. Hence we still have to
ensure individual sectors in each write are intact during recovery.

Yes, we now have CRCs for entire log writes these large write tears
and other IO errors, but that doesn't change the underlying on-disk
format or algorithms that are used.

And that's the issue: the XFS journal is a sector based device, not
a filesystem block based device. It's on-disk format was designed
around the fact that torn writes do not occur inside a single
sector. Hence the log scan and search algorithms assume that it can
read and write individual sectors in the device, and that if the
per-sector log header has the correct sequence number stamped in it
then the entire sector has valid contents.

The long and short of all this is that block devices cannot
dynamically change their advertised sector size. Filesystems have
been designed for 40+ years around the assumption that the physical
sector size of the underlying storage device is fixed for the life
of the device. DM is just going to have to reject attempts to build
dynamically layered devices with incompatible sector size
requirements.

Cheers,

Dave.

(*) Some recent high end intel enterprise SSDs had bath-tub curve
performance problems with IO that isn't 128k aligned(**). I'm
guessing that the internal page (flash erase block) size was 128k,
and it didn't handle sub page or cross-page IO at all well. I'm not
sure if there were ever firmware updates to fix that.

(**) yes, filesystem developers see all sorts of whacky performance
problems with storage. That's because everyone blames the messenger
for storage problems (i.e. the filesystem), so we're the lucky
people get to triage them and determine where the problem really
lies.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-09-19  4:27 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20180911221147.GA23308@redhat.com>
2018-09-18 11:46 ` dm-writecache issue Mikulas Patocka
2018-09-18 12:32   ` Dave Chinner
2018-09-18 12:48     ` Eric Sandeen
2018-09-18 14:09       ` Mikulas Patocka
2018-09-18 14:16         ` Eric Sandeen
2018-09-18 14:19           ` Eric Sandeen
2018-09-18 14:29             ` Mikulas Patocka
2018-09-18 14:36               ` Eric Sandeen
2018-09-18 14:42                 ` Mikulas Patocka
2018-09-18 15:04                   ` Eric Sandeen
2018-09-18 15:27                     ` Eric Sandeen
2018-09-18 15:29                       ` Christoph Hellwig
2018-09-18 17:15                     ` Mikulas Patocka
2018-09-18 14:20       ` David Teigland
2018-09-18 14:23         ` Eric Sandeen
2018-09-18 14:22     ` Mikulas Patocka
2018-09-18 15:33       ` Christoph Hellwig
2018-09-18 17:39         ` Mikulas Patocka
2018-09-18 22:52           ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).