Using fio for testing for SMR

All of lore.kernel.org
 help / color / mirror / Atom feed

* Using fio for testing for SMR
@ 2020-09-05 13:30 Ian S. Worthington
  2020-09-06 14:13 ` Sitsofe Wheeler
  2020-09-07  1:38 ` Damien Le Moal
  0 siblings, 2 replies; 9+ messages in thread
From: Ian S. Worthington @ 2020-09-05 13:30 UTC (permalink / raw)
  To: fio

I'm trying to establish if a new disk is SMR or not, or has any other
characteristics that would make it unsuitable for use in a zfs array.

CrystalDiskMark suggests it has a speed of 6~8 MB/s in its RND4K testing.

iiuc SMR disks contain a CMR area, possibly of variable size, which is used as
a cache, so to test a drive I need to ensure I fill this cache to the drive is
forced to start shingling. 

As the disk is 14TB, my first test used:

sudo fio --name TEST --eta-newline=5s --filename=/dev/sda --rw=randwrite
--size=100t --io_size=14t  --ioengine=libaio --iodepth=1 --direct=1
--numjobs=1 --runtime=10h --group_reporting

which reported:

TEST: (groupid=0, jobs=1): err= 0: pid=4685: Sat Sep  5 07:42:02 2020
  write: IOPS=490, BW=1962KiB/s (2009kB/s)(67.4GiB/36000002msec); 0 zone
resets
    slat (usec): min=16, max=10242, avg=41.02, stdev=11.10
    clat (usec): min=17, max=371540, avg=1980.75, stdev=1016.94
     lat (usec): min=283, max=371587, avg=2024.00, stdev=1016.92
    clat percentiles (usec):
     |  1.00th=[  486],  5.00th=[  594], 10.00th=[ 1074], 20.00th=[ 1418],
     | 30.00th=[ 1565], 40.00th=[ 1713], 50.00th=[ 1876], 60.00th=[ 2040],
     | 70.00th=[ 2245], 80.00th=[ 2474], 90.00th=[ 2933], 95.00th=[ 3589],
     | 99.00th=[ 4686], 99.50th=[ 5211], 99.90th=[ 8356], 99.95th=[11863],
     | 99.99th=[21627]
   bw (  KiB/s): min=  832, max= 7208, per=100.00%, avg=1961.66, stdev=105.29,
samples=72000
   iops        : min=  208, max= 1802, avg=490.40, stdev=26.31, samples=72000

I have a number of concerns about this test:

1. Why is the average speed, 2MB/s, so much lower than that reported by
CrystalDiskMark?

2. After running for 10 hours, only 67 GiB were written.  This could easily
not yet have filled any CMR cache on a SMR disk, rendering the test
worthless.

I then ran some 5m tests, using different blocksizes in the command

sudo fio --name TEST --eta-newline=5s --filename=/dev/sda --rw=randwrite
--size=100t --io_size=14t  --ioengine=libaio --iodepth=1 --direct=1
--numjobs=1 --runtime=5m --group_reporting --blocksize=xxx

with the result:

blksize speed(MB/s) IOPS
  4k        2        490
  1M      100         97
 10M      130         12
100M      160        1~2
  1G      160          -

3. I'm considering running a dual test, where I first write, say 10TB data
with a blocksize of 1M (28 hours), followed by 10 hours of 4k writes again. 
Although the 1M block contents will be sequential data, can I assume that
enough of them will do via any CMR cache in order to fill it up and reveal any
slow down?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Using fio for testing for SMR
  2020-09-05 13:30 Using fio for testing for SMR Ian S. Worthington
@ 2020-09-06 14:13 ` Sitsofe Wheeler
  2020-09-07  1:38 ` Damien Le Moal
  1 sibling, 0 replies; 9+ messages in thread
From: Sitsofe Wheeler @ 2020-09-06 14:13 UTC (permalink / raw)
  To: Ian S. Worthington; +Cc: fio

On Sat, 5 Sep 2020 at 14:40, Ian S. Worthington <ianworthington@usa.net> wrote:
>
> I'm trying to establish if a new disk is SMR or not, or has any other
> characteristics that would make it unsuitable for use in a zfs array.
>
> CrystalDiskMark suggests it has a speed of 6~8 MB/s in its RND4K testing.
>
> iiuc SMR disks contain a CMR area, possibly of variable size, which is used as
> a cache, so to test a drive I need to ensure I fill this cache to the drive is
> forced to start shingling.
>
> As the disk is 14TB, my first test used:
>
> sudo fio --name TEST --eta-newline=5s --filename=/dev/sda --rw=randwrite
> --size=100t --io_size=14t  --ioengine=libaio --iodepth=1 --direct=1
> --numjobs=1 --runtime=10h --group_reporting
>
> which reported:
>
> TEST: (groupid=0, jobs=1): err= 0: pid=4685: Sat Sep  5 07:42:02 2020
>   write: IOPS=490, BW=1962KiB/s (2009kB/s)(67.4GiB/36000002msec); 0 zone
> resets
>     slat (usec): min=16, max=10242, avg=41.02, stdev=11.10
>     clat (usec): min=17, max=371540, avg=1980.75, stdev=1016.94
>      lat (usec): min=283, max=371587, avg=2024.00, stdev=1016.92
>     clat percentiles (usec):
>      |  1.00th=[  486],  5.00th=[  594], 10.00th=[ 1074], 20.00th=[ 1418],
>      | 30.00th=[ 1565], 40.00th=[ 1713], 50.00th=[ 1876], 60.00th=[ 2040],
>      | 70.00th=[ 2245], 80.00th=[ 2474], 90.00th=[ 2933], 95.00th=[ 3589],
>      | 99.00th=[ 4686], 99.50th=[ 5211], 99.90th=[ 8356], 99.95th=[11863],
>      | 99.99th=[21627]
>    bw (  KiB/s): min=  832, max= 7208, per=100.00%, avg=1961.66, stdev=105.29,
> samples=72000
>    iops        : min=  208, max= 1802, avg=490.40, stdev=26.31, samples=72000
>
> I have a number of concerns about this test:
>
> 1. Why is the average speed, 2MB/s, so much lower than that reported by
> CrystalDiskMark?

Hard to say without seeing the exact crystal disk mark job and knowing
how the I/O ends up being seen by the disk. I heard below the hood it
uses diskspd so it would be good to know what parameters it was
sending to that along and/or information about what the *disk* was
actually seeing (e.g. average block size and depth)... Bear in mind
that CDM is usually a filesystem test rather than a block device/raw
disk test so there's some indirection compared to the fio job above
(assuming /dev/sda is a SATA block device).

> 2. After running for 10 hours, only 67 GiB were written.  This could easily
> not yet have filled any CMR cache on a SMR disk, rendering the test
> worthless.
>
> I then ran some 5m tests, using different blocksizes in the command
>
> sudo fio --name TEST --eta-newline=5s --filename=/dev/sda --rw=randwrite
> --size=100t --io_size=14t  --ioengine=libaio --iodepth=1 --direct=1
> --numjobs=1 --runtime=5m --group_reporting --blocksize=xxx
>
> with the result:
>
> blksize speed(MB/s) IOPS
>   4k        2        490
>   1M      100         97
>  10M      130         12
> 100M      160        1~2
>   1G      160          -

I'm not sure I saw the question in this one... Note: when the block
size gets big enough (probably somewhere between 512K but less than 2M
from reading https://stackoverflow.com/a/59403297 and
https://kernel.dk/when-2mb-turns-into-512k.pdf ) the kernel block
layer will split the bigger block into smaller pieces (which it might
then choose to send down to the disk in parallel).

> 3. I'm considering running a dual test, where I first write, say 10TB data
> with a blocksize of 1M (28 hours), followed by 10 hours of 4k writes again.
> Although the 1M block contents will be sequential data, can I assume that
> enough of them will do via any CMR cache in order to fill it up and reveal any
> slow down?

I think that would depend on the size of the cache, the speed at which
it was filled and the speed at which said cache could be destaged. If
those 1MByte blocks are sent "slowly" then the destaging may be able
to keep up...

-- 
Sitsofe | http://sucs.org/~sits/


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Using fio for testing for SMR
  2020-09-05 13:30 Using fio for testing for SMR Ian S. Worthington
  2020-09-06 14:13 ` Sitsofe Wheeler
@ 2020-09-07  1:38 ` Damien Le Moal
  2020-09-08 14:02   ` Ian S. Worthington
  1 sibling, 1 reply; 9+ messages in thread
From: Damien Le Moal @ 2020-09-07  1:38 UTC (permalink / raw)
  To: Ian S. Worthington, fio@vger.kernel.org

On 2020/09/05 22:38, Ian S. Worthington wrote:
> I'm trying to establish if a new disk is SMR or not, or has any other
> characteristics that would make it unsuitable for use in a zfs array.
> 
> CrystalDiskMark suggests it has a speed of 6~8 MB/s in its RND4K testing.
> 
> iiuc SMR disks contain a CMR area, possibly of variable size, which is used as
> a cache, so to test a drive I need to ensure I fill this cache to the drive is
> forced to start shingling. 

That is not necessarily true. One can handle the SMR sequential write constraint
using a log structured approach that does not require any CMR caching. It really
depends on how the disk FW is implemented, but generally, that is not public
information unfortunately.

> As the disk is 14TB, my first test used:
> 
> sudo fio --name TEST --eta-newline=5s --filename=/dev/sda --rw=randwrite
> --size=100t --io_size=14t  --ioengine=libaio --iodepth=1 --direct=1
> --numjobs=1 --runtime=10h --group_reporting
> 
> which reported:
> 
> TEST: (groupid=0, jobs=1): err= 0: pid=4685: Sat Sep  5 07:42:02 2020
>   write: IOPS=490, BW=1962KiB/s (2009kB/s)(67.4GiB/36000002msec); 0 zone
> resets
>     slat (usec): min=16, max=10242, avg=41.02, stdev=11.10
>     clat (usec): min=17, max=371540, avg=1980.75, stdev=1016.94
>      lat (usec): min=283, max=371587, avg=2024.00, stdev=1016.92
>     clat percentiles (usec):
>      |  1.00th=[  486],  5.00th=[  594], 10.00th=[ 1074], 20.00th=[ 1418],
>      | 30.00th=[ 1565], 40.00th=[ 1713], 50.00th=[ 1876], 60.00th=[ 2040],
>      | 70.00th=[ 2245], 80.00th=[ 2474], 90.00th=[ 2933], 95.00th=[ 3589],
>      | 99.00th=[ 4686], 99.50th=[ 5211], 99.90th=[ 8356], 99.95th=[11863],
>      | 99.99th=[21627]
>    bw (  KiB/s): min=  832, max= 7208, per=100.00%, avg=1961.66, stdev=105.29,
> samples=72000
>    iops        : min=  208, max= 1802, avg=490.40, stdev=26.31, samples=72000
> 
> I have a number of concerns about this test:
> 
> 1. Why is the average speed, 2MB/s, so much lower than that reported by
> CrystalDiskMark?

Likely because CrystalDiskMark is very short and does not trigger internal
sector management (GC) by the disk. Your 10h run most likely did.

> 2. After running for 10 hours, only 67 GiB were written.  This could easily
> not yet have filled any CMR cache on a SMR disk, rendering the test
> worthless.

Likely no. Whatever CMR space the disk has (if any at all) was likely filled.
The internal disk sector movements to handle SMR sequential write constraint is
causing enormous overhead and leading to 67GB written only. Your 2M random write
test is the worst possible for a drive managed SMR disk. You simply are seeing
what the drive performance is given the horrible conditions it is subjected to.

> 
> I then ran some 5m tests, using different blocksizes in the command
> 
> sudo fio --name TEST --eta-newline=5s --filename=/dev/sda --rw=randwrite
> --size=100t --io_size=14t  --ioengine=libaio --iodepth=1 --direct=1
> --numjobs=1 --runtime=5m --group_reporting --blocksize=xxx
> 
> with the result:
> 
> blksize speed(MB/s) IOPS
>   4k        2        490
>   1M      100         97
>  10M      130         12
> 100M      160        1~2
>   1G      160          -
> 
> 3. I'm considering running a dual test, where I first write, say 10TB data
> with a blocksize of 1M (28 hours), followed by 10 hours of 4k writes again. 
> Although the 1M block contents will be sequential data, can I assume that
> enough of them will do via any CMR cache in order to fill it up and reveal any
> slow down?

On Linux, one easy thing to check is to look at:

cat /sys/block/<disk name>/device/scsi_disk/X:Y:Z:N/zoned_cap

A drive managed SMR disk that is no hiding its true nature will say
"drive-managed". You will need kernel 5.8 to have this attribute files.
Otherwise, you can use SG to inspect the VPD page 0xB1 (block device
characteristics). Look for the value of bits 4-5 of byte 8 (ZONED field). If the
value is 2 (10b), then your disk is a drive managed SMR disk.


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Using fio for testing for SMR
  2020-09-07  1:38 ` Damien Le Moal
@ 2020-09-08 14:02   ` Ian S. Worthington
  2020-09-08 23:45     ` Damien Le Moal
  0 siblings, 1 reply; 9+ messages in thread
From: Ian S. Worthington @ 2020-09-08 14:02 UTC (permalink / raw)
  To: Damien Le Moal, fio@vger.kernel.org

Hello Damien --

Many thanks indeed for this most comprehensive answer.

> On Linux, one easy thing to check is to look at:
> 
> cat /sys/block/<disk name>/device/scsi_disk/X:Y:Z:N/zoned_cap
> 
> A drive managed SMR disk that is no hiding its true nature will say
> "drive-managed". You will need kernel 5.8 to have this attribute files.
> Otherwise, you can use SG to inspect the VPD page 0xB1 (block device
> characteristics). Look for the value of bits 4-5 of byte 8 (ZONED field). If
the
> value is 2 (10b), then your disk is a drive managed SMR disk.

I'm not on 5.8, so I guess that's why I don't have a zoned_cap. But:

sudo sg_vpd --page=bdc /dev/sda
Block device characteristics VPD page (SBC):
  Nominal rotation rate: 5400 rpm
  Product type: Not specified
  WABEREQ=0
  WACEREQ=0
  Nominal form factor not reported
  ZONED=0
  RBWZ=0
  BOCS=0
  FUAB=0
  VBULS=0
  DEPOPULATION_TIME=0 (seconds)

sudo sg_vpd --page=bdc -H /dev/sda
Block device characteristics VPD page (SBC):
 00     00 b1 00 3c 15 18 00 00  00 00 00 00 00 00 00 00    ...<............
 10     00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00    ................
 20     00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00    ................
 30     00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00    ................

This seems to suggest that this is NOT a "drive managed SMR disk".  Are there
other types of SMR disks that could have zoned=0?

> > 1. Why is the average speed, 2MB/s, so much lower than that reported by
> > CrystalDiskMark?
> 
> Likely because CrystalDiskMark is very short and does not trigger internal
> sector management (GC) by the disk. Your 10h run most likely did.

Unfortunately, it was showing that speed pretty much from the start.  I ran it
again in three runs, both of 4k randwrite, with sizes of 256MB and 1GB (the
same as I used in my CDM test), and 10GB, viz:

sudo fio --name SPINUP      --eta-newline=5s --eta-interval=5s
-filename=/dev/sda --rw=randwrite --size=100t --io_size=14t --ioengine=libaio
--iodepth=4 --direct=1 --numjobs=1 --runtime=1m --group_reporting
--blocksize=4k
sudo fio --name 4K256m   --eta-newline=5s --eta-interval=5s -filename=/dev/sda
--rw=randwrite --size=256m --io_size=256m --ioengine=libaio --iodepth=1
--direct=1 --numjobs=1 --group_reporting --blocksize=4k
sudo fio --name 4K1g     --eta-newline=5s --eta-interval=5s -filename=/dev/sda
--rw=randwrite --size=1g   --io_size=1g   --ioengine=libaio --iodepth=1
--direct=1 --numjobs=1 --group_reporting --blocksize=4k
sudo fio --name 4K10g    --eta-newline=5s --eta-interval=5s -filename=/dev/sda
--rw=randwrite --size=10g   --io_size=10g   --ioengine=libaio --iodepth=1
--direct=1 --numjobs=1 --group_reporting --blocksize=4k --runtime=5m

size   KiB/s bw-min  max  avg (KiB/s)
256MB  3216    1600  6896 3216
  1G   3265    1712 12008 3263
 10G   2886    1264  6976 2885

I've noticed that always after finishing running these tests there minutes of
head seeking noise from the drive.  Is this the GC to which you refer?  I'm
curious as to what might it actually doing during this time, if we assume that
SG_VPD is correctly reporting that this is NOT an SMR drive?  Is there other
internal sector management that it might be doing?

If I ran a test where I filled the drive to capacity using sequential writes
so the drive recorded all sectors as being in use, then wrote 10TB randwrite
using a 1MB blocksize to fill as much of any CMR cache as possible, then
finally redid the 10 hour test with 4k randwrite, could I then compare the
results of that final test to the short tests to definitively show if there
were any slowdowns that might be caused by reshingling in that final test?

Best wishes,

Ian
 


------ Original Message ------
Received: 02:39 AM BST, 09/07/2020
From: Damien Le Moal <Damien.LeMoal@wdc.com>
To: "Ian S. Worthington" <ianworthington@usa.net>,       
"fio@vger.kernel.org" <fio@vger.kernel.org>
Subject: Re: Using fio for testing for SMR

> On 2020/09/05 22:38, Ian S. Worthington wrote:
> > I'm trying to establish if a new disk is SMR or not, or has any other
> > characteristics that would make it unsuitable for use in a zfs array.
> > 
> > CrystalDiskMark suggests it has a speed of 6~8 MB/s in its RND4K testing.
> > 
> > iiuc SMR disks contain a CMR area, possibly of variable size, which is
used as
> > a cache, so to test a drive I need to ensure I fill this cache to the
drive is
> > forced to start shingling. 
> 
> That is not necessarily true. One can handle the SMR sequential write
constraint
> using a log structured approach that does not require any CMR caching. It
really
> depends on how the disk FW is implemented, but generally, that is not
public
> information unfortunately.
> 
> > As the disk is 14TB, my first test used:
> > 
> > sudo fio --name TEST --eta-newline=5s --filename=/dev/sda --rw=randwrite
> > --size=100t --io_size=14t  --ioengine=libaio --iodepth=1 --direct=1
> > --numjobs=1 --runtime=10h --group_reporting
> > 
> > which reported:
> > 
> > TEST: (groupid=0, jobs=1): err= 0: pid=4685: Sat Sep  5 07:42:02 2020
> >   write: IOPS=490, BW=1962KiB/s (2009kB/s)(67.4GiB/36000002msec); 0 zone
> > resets
> >     slat (usec): min=16, max=10242, avg=41.02, stdev=11.10
> >     clat (usec): min=17, max=371540, avg=1980.75, stdev=1016.94
> >      lat (usec): min=283, max=371587, avg=2024.00, stdev=1016.92
> >     clat percentiles (usec):
> >      |  1.00th=[  486],  5.00th=[  594], 10.00th=[ 1074], 20.00th=[
1418],
> >      | 30.00th=[ 1565], 40.00th=[ 1713], 50.00th=[ 1876], 60.00th=[
2040],
> >      | 70.00th=[ 2245], 80.00th=[ 2474], 90.00th=[ 2933], 95.00th=[
3589],
> >      | 99.00th=[ 4686], 99.50th=[ 5211], 99.90th=[ 8356],
99.95th=[11863],
> >      | 99.99th=[21627]
> >    bw (  KiB/s): min=  832, max= 7208, per=100.00%, avg=1961.66,
stdev=105.29,
> > samples=72000
> >    iops        : min=  208, max= 1802, avg=490.40, stdev=26.31,
samples=72000
> > 
> > I have a number of concerns about this test:
> > 
> > 1. Why is the average speed, 2MB/s, so much lower than that reported by
> > CrystalDiskMark?
> 
> Likely because CrystalDiskMark is very short and does not trigger internal
> sector management (GC) by the disk. Your 10h run most likely did.
> 
> > 2. After running for 10 hours, only 67 GiB were written.  This could
easily
> > not yet have filled any CMR cache on a SMR disk, rendering the test
> > worthless.
> 
> Likely no. Whatever CMR space the disk has (if any at all) was likely
filled.
> The internal disk sector movements to handle SMR sequential write constraint
is
> causing enormous overhead and leading to 67GB written only. Your 2M random
write
> test is the worst possible for a drive managed SMR disk. You simply are
seeing
> what the drive performance is given the horrible conditions it is subjected
to.
> 
> > 
> > I then ran some 5m tests, using different blocksizes in the command
> > 
> > sudo fio --name TEST --eta-newline=5s --filename=/dev/sda --rw=randwrite
> > --size=100t --io_size=14t  --ioengine=libaio --iodepth=1 --direct=1
> > --numjobs=1 --runtime=5m --group_reporting --blocksize=xxx
> > 
> > with the result:
> > 
> > blksize speed(MB/s) IOPS
> >   4k        2        490
> >   1M      100         97
> >  10M      130         12
> > 100M      160        1~2
> >   1G      160          -
> > 
> > 3. I'm considering running a dual test, where I first write, say 10TB
data
> > with a blocksize of 1M (28 hours), followed by 10 hours of 4k writes
again. 
> > Although the 1M block contents will be sequential data, can I assume that
> > enough of them will do via any CMR cache in order to fill it up and reveal
any
> > slow down?
> 
> On Linux, one easy thing to check is to look at:
> 
> cat /sys/block/<disk name>/device/scsi_disk/X:Y:Z:N/zoned_cap
> 
> A drive managed SMR disk that is no hiding its true nature will say
> "drive-managed". You will need kernel 5.8 to have this attribute files.
> Otherwise, you can use SG to inspect the VPD page 0xB1 (block device
> characteristics). Look for the value of bits 4-5 of byte 8 (ZONED field). If
the
> value is 2 (10b), then your disk is a drive managed SMR disk.
> 
> 
> -- 
> Damien Le Moal
> Western Digital Research




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Using fio for testing for SMR
  2020-09-08 14:02   ` Ian S. Worthington
@ 2020-09-08 23:45     ` Damien Le Moal
  2020-09-12 18:20       ` Ian S. Worthington
  0 siblings, 1 reply; 9+ messages in thread
From: Damien Le Moal @ 2020-09-08 23:45 UTC (permalink / raw)
  To: Ian S. Worthington, fio@vger.kernel.org

On 2020/09/08 23:02, Ian S. Worthington wrote:
> Hello Damien --
> 
> Many thanks indeed for this most comprehensive answer.
> 
>> On Linux, one easy thing to check is to look at:
>>
>> cat /sys/block/<disk name>/device/scsi_disk/X:Y:Z:N/zoned_cap
>>
>> A drive managed SMR disk that is no hiding its true nature will say
>> "drive-managed". You will need kernel 5.8 to have this attribute files.
>> Otherwise, you can use SG to inspect the VPD page 0xB1 (block device
>> characteristics). Look for the value of bits 4-5 of byte 8 (ZONED field). If
> the
>> value is 2 (10b), then your disk is a drive managed SMR disk.
> 
> I'm not on 5.8, so I guess that's why I don't have a zoned_cap. But:
> 
> sudo sg_vpd --page=bdc /dev/sda
> Block device characteristics VPD page (SBC):
>   Nominal rotation rate: 5400 rpm
>   Product type: Not specified
>   WABEREQ=0
>   WACEREQ=0
>   Nominal form factor not reported
>   ZONED=0
>   RBWZ=0
>   BOCS=0
>   FUAB=0
>   VBULS=0
>   DEPOPULATION_TIME=0 (seconds)
> 
> sudo sg_vpd --page=bdc -H /dev/sda
> Block device characteristics VPD page (SBC):
>  00     00 b1 00 3c 15 18 00 00  00 00 00 00 00 00 00 00    ...<............
>  10     00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00    ................
>  20     00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00    ................
>  30     00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00    ................
> 
> This seems to suggest that this is NOT a "drive managed SMR disk".  Are there
> other types of SMR disks that could have zoned=0?

zoned=0 means "not reported". So either your disk is a regular CMR one, or it is
a drive-managed SMR one but it is not confirming it :) At this point, you may
want to contact your drive vendor to ask for information. Some vendors (e.g. WD)
already have clarified which of their drive models are drive managed SMR. Google
it and you may be able to find more information online.

> 
>>> 1. Why is the average speed, 2MB/s, so much lower than that reported by
>>> CrystalDiskMark?
>>
>> Likely because CrystalDiskMark is very short and does not trigger internal
>> sector management (GC) by the disk. Your 10h run most likely did.
> 
> Unfortunately, it was showing that speed pretty much from the start.  I ran it
> again in three runs, both of 4k randwrite, with sizes of 256MB and 1GB (the
> same as I used in my CDM test), and 10GB, viz:
> 
> sudo fio --name SPINUP      --eta-newline=5s --eta-interval=5s
> -filename=/dev/sda --rw=randwrite --size=100t --io_size=14t --ioengine=libaio
> --iodepth=4 --direct=1 --numjobs=1 --runtime=1m --group_reporting
> --blocksize=4k
> sudo fio --name 4K256m   --eta-newline=5s --eta-interval=5s -filename=/dev/sda
> --rw=randwrite --size=256m --io_size=256m --ioengine=libaio --iodepth=1
> --direct=1 --numjobs=1 --group_reporting --blocksize=4k
> sudo fio --name 4K1g     --eta-newline=5s --eta-interval=5s -filename=/dev/sda
> --rw=randwrite --size=1g   --io_size=1g   --ioengine=libaio --iodepth=1
> --direct=1 --numjobs=1 --group_reporting --blocksize=4k
> sudo fio --name 4K10g    --eta-newline=5s --eta-interval=5s -filename=/dev/sda
> --rw=randwrite --size=10g   --io_size=10g   --ioengine=libaio --iodepth=1
> --direct=1 --numjobs=1 --group_reporting --blocksize=4k --runtime=5m
> 
> size   KiB/s bw-min  max  avg (KiB/s)
> 256MB  3216    1600  6896 3216
>   1G   3265    1712 12008 3263
>  10G   2886    1264  6976 2885

4K@QD=4 random write at 3MB/s is actually pretty normal. That is about 800 IOPS.
I am getting similar results with CMR disks too.

> I've noticed that always after finishing running these tests there minutes of
> head seeking noise from the drive.  Is this the GC to which you refer?  I'm
> curious as to what might it actually doing during this time, if we assume that
> SG_VPD is correctly reporting that this is NOT an SMR drive?  Is there other
> internal sector management that it might be doing?

Well, yes. There are a lot of things that drives can do to try to improve small
random IO performance, especially writes. E.g. use opportunistic media caching
(not in place writes) is special reserved areas of the disk, transforming the
random pattern into a more sequential one. And such method of course results in
these media cache areas being cleaned up later to free space (sectors are moved
in place). And if this is really a drive managed SMR disk, then GC is a
possibility too. But this is all speculation. Again, without detailed knowledge
of the disk FW implementation (and if it is SMR or not), this is all guess work.

> If I ran a test where I filled the drive to capacity using sequential writes
> so the drive recorded all sectors as being in use, then wrote 10TB randwrite
> using a 1MB blocksize to fill as much of any CMR cache as possible, then
> finally redid the 10 hour test with 4k randwrite, could I then compare the
> results of that final test to the short tests to definitively show if there
> were any slowdowns that might be caused by reshingling in that final test?

No idea. I do not think this can give a definitive/reliable answer. Getting
clear information from the drive vendors seems to me like a much easier solution :)


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Using fio for testing for SMR
  2020-09-08 23:45     ` Damien Le Moal
@ 2020-09-12 18:20       ` Ian S. Worthington
  2020-09-15  2:28         ` Damien Le Moal
  0 siblings, 1 reply; 9+ messages in thread
From: Ian S. Worthington @ 2020-09-12 18:20 UTC (permalink / raw)
  To: Damien Le Moal, fio@vger.kernel.org

Many thanks again Damien for that great information.

I've now run the following tests:

sudo fio --name START   --eta-newline=5s --eta-interval=5s -filename=/dev/sda
--rw=randwrite --size=100t --io_size=14t  --ioengine=libaio --iodepth=4
--direct=1 --numjobs=1 --group_reporting --blocksize=4k --runtime=1m
sudo fio --name FILLSEQ --eta-newline=5s --eta-interval=5s -filename=/dev/sda
--rw=write     --size=14t  --io_size=14t  --ioengine=libaio --iodepth=4
--direct=1 --numjobs=1 --group_reporting --blocksize=512m
sudo fio --name 10TB1MB --eta-newline=5s --eta-interval=5s -filename=/dev/sda
--rw=randwrite --size=14t  --io_size=14t  --ioengine=libaio --iodepth=4
--direct=1 --numjobs=1 --group_reporting --blocksize=1m
sudo fio --name 4K10HR  --eta-newline=5s --eta-interval=5s -filename=/dev/sda
--rw=randwrite --size=14t  --io_size=14t  --ioengine=libaio --iodepth=1
--direct=1 --numjobs=1 --group_reporting --blocksize=4k --runtime=10h 

Results:

FILLSEQ:
Purpose: to fill disk as quickly as possible.
Seems FIO IOS counts blocks in some places and maybe split blocks in other
places?
blocksize=512MiB
avg bw=157MiB/s ios=58720040/93692287 msec(26.03 hours),  io=14.0TiB
(15.4TB),
14tib = 14680064 miB
~/26.03/3600 = 156.7 MiB/s
58720040 ios/93692287 msecs = 0.6267 ios/ms = 626.73 iops
157 MiB/s / 626 iops = 0.25 MiB/io = 256KiB/io => 2048 ios/512MiB 
626.73/2048=0.31 blks/sec

From https://kernel.dk/when-2mb-turns-into-512k.pdf:

pi@raspberrypi:/ $ cat /sys/block/sda/queue/max_hw_sectors_kb
256
pi@raspberrypi:/ $ cat /sys/block/sda/queue/max_sectors_kb
256
pi@raspberrypi:/ $ cat /sys/block/sda/queue/max_segments
2048
pi@raspberrypi:/ $ cat /sys/block/sda/queue/max_segment_size
65536

This tells us that the maximum size the device can support is 256KB
(max_hw_sectors_kb)
and the maximum size that the kernel allows is 256KB (max_sectors_kb). 
Additionally, the DMA engine is limited to 2048 segments of IO, 
each with a max size of 64KB.


10TB1MB:
Blksize 1MB randwrite 14TiB
Purpose: Attempt to ensure any CMA cache is full
Avg BW: 104 MiB/s; IOPS: 104
Samples: BW min: 32 MiB/s; max: 190 MiB/s; avg=104 MiB/s
		 IOPS min: 32; max: 190; avg: 104
lat (msec)   : 4=0.01%, 10=0.01%, 20=0.86%, 50=98.44%, 100=0.69%
lat (msec)   : 250=0.01%, 500=0.01%


4K10HR:
10 hours of randwrite blksize=4KiB
Purpose: See if we can see any delays caused by forced destaging from CMA
cache to SMA
BW: 1960 KiB/s; IOPS: 489; 67.3 GiB written
Sampling:
BW min: 640 KiB/s; max: 8264 KiB/s
IOPS min: 160; max: 2066
  lat (usec)   : 20=0.01%, 50=0.01%, 250=0.01%, 500=1.12%, 750=5.86%
  lat (usec)   : 1000=1.80%
  lat (msec)   : 2=48.59%, 4=39.72%, 10=2.81%, 20=0.07%, 50=0.02%
  lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%
No obvious slow downs in fio log.

Conclusions:

So my guess, from the absence of any drastic slow downs during the 4K test is
that either this disk is not an SMR disk, or if it is, I'm not showing it in a
test that writes 4k random for only 10 hours.

> Getting
> clear information from the drive vendors seems to me like a much easier
solution :)

Totally agree.  I've declined to look until now in case it biased my
analysis.

The model here is a WD140EMFZ-11A0WA0 which WD don't seem to publish any
information on. HDDScan claim in their blog that its a CMR (PMR, He) drive
manufactured by HGST (https://hddscan.com/blog/2020/hdd-wd-smr.html), but I
have no idea if this information is reliable or not.

Ian
...




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Using fio for testing for SMR
  2020-09-12 18:20       ` Ian S. Worthington
@ 2020-09-15  2:28         ` Damien Le Moal
  2020-09-19 11:47           ` Ian S. Worthington
  0 siblings, 1 reply; 9+ messages in thread
From: Damien Le Moal @ 2020-09-15  2:28 UTC (permalink / raw)
  To: Ian S. Worthington, fio@vger.kernel.org

On 2020/09/13 3:20, Ian S. Worthington wrote:
> Many thanks again Damien for that great information.
> 
> I've now run the following tests:
> 
> sudo fio --name START   --eta-newline=5s --eta-interval=5s -filename=/dev/sda
> --rw=randwrite --size=100t --io_size=14t  --ioengine=libaio --iodepth=4
> --direct=1 --numjobs=1 --group_reporting --blocksize=4k --runtime=1m
> sudo fio --name FILLSEQ --eta-newline=5s --eta-interval=5s -filename=/dev/sda
> --rw=write     --size=14t  --io_size=14t  --ioengine=libaio --iodepth=4
> --direct=1 --numjobs=1 --group_reporting --blocksize=512m
> sudo fio --name 10TB1MB --eta-newline=5s --eta-interval=5s -filename=/dev/sda
> --rw=randwrite --size=14t  --io_size=14t  --ioengine=libaio --iodepth=4
> --direct=1 --numjobs=1 --group_reporting --blocksize=1m
> sudo fio --name 4K10HR  --eta-newline=5s --eta-interval=5s -filename=/dev/sda
> --rw=randwrite --size=14t  --io_size=14t  --ioengine=libaio --iodepth=1
> --direct=1 --numjobs=1 --group_reporting --blocksize=4k --runtime=10h 
> 
> Results:
> 
> FILLSEQ:
> Purpose: to fill disk as quickly as possible.
> Seems FIO IOS counts blocks in some places and maybe split blocks in other
> places?
> blocksize=512MiB
> avg bw=157MiB/s ios=58720040/93692287 msec(26.03 hours),  io=14.0TiB
> (15.4TB),
> 14tib = 14680064 miB
> ~/26.03/3600 = 156.7 MiB/s
> 58720040 ios/93692287 msecs = 0.6267 ios/ms = 626.73 iops
> 157 MiB/s / 626 iops = 0.25 MiB/io = 256KiB/io => 2048 ios/512MiB 
> 626.73/2048=0.31 blks/sec
> 
> From https://kernel.dk/when-2mb-turns-into-512k.pdf:
> 
> pi@raspberrypi:/ $ cat /sys/block/sda/queue/max_hw_sectors_kb
> 256
> pi@raspberrypi:/ $ cat /sys/block/sda/queue/max_sectors_kb
> 256
> pi@raspberrypi:/ $ cat /sys/block/sda/queue/max_segments
> 2048
> pi@raspberrypi:/ $ cat /sys/block/sda/queue/max_segment_size
> 65536
> 
> This tells us that the maximum size the device can support is 256KB
> (max_hw_sectors_kb)
> and the maximum size that the kernel allows is 256KB (max_sectors_kb). 

Which highly depends on the HBA since the hard-disk interface will accept much
larger commands. This 256KB on your system seems to be very low. What HBA are
you using ? These days, most SAS HBAs have at least 128 or 256 segments, which
allow up to 512KB/1MB IOs. SATA/AHCI is limited at 169 segments for up to 32MB
max_hw_sectors_kb (and 1280 KB max_sectors_kb).

> Additionally, the DMA engine is limited to 2048 segments of IO, 
> each with a max size of 64KB.

That's a lot. I wonder why max_hw_sectors_kb end up so small.

> 10TB1MB:
> Blksize 1MB randwrite 14TiB
> Purpose: Attempt to ensure any CMA cache is full
> Avg BW: 104 MiB/s; IOPS: 104
> Samples: BW min: 32 MiB/s; max: 190 MiB/s; avg=104 MiB/s
> 		 IOPS min: 32; max: 190; avg: 104
> lat (msec)   : 4=0.01%, 10=0.01%, 20=0.86%, 50=98.44%, 100=0.69%
> lat (msec)   : 250=0.01%, 500=0.01%
> 
> 
> 4K10HR:
> 10 hours of randwrite blksize=4KiB
> Purpose: See if we can see any delays caused by forced destaging from CMA
> cache to SMA
> BW: 1960 KiB/s; IOPS: 489; 67.3 GiB written
> Sampling:
> BW min: 640 KiB/s; max: 8264 KiB/s
> IOPS min: 160; max: 2066
>   lat (usec)   : 20=0.01%, 50=0.01%, 250=0.01%, 500=1.12%, 750=5.86%
>   lat (usec)   : 1000=1.80%
>   lat (msec)   : 2=48.59%, 4=39.72%, 10=2.81%, 20=0.07%, 50=0.02%
>   lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%
> No obvious slow downs in fio log.
> 
> Conclusions:
> 
> So my guess, from the absence of any drastic slow downs during the 4K test is
> that either this disk is not an SMR disk, or if it is, I'm not showing it in a
> test that writes 4k random for only 10 hours.
> 
>> Getting
>> clear information from the drive vendors seems to me like a much easier
> solution :)
> 
> Totally agree.  I've declined to look until now in case it biased my
> analysis.
> 
> The model here is a WD140EMFZ-11A0WA0 which WD don't seem to publish any
> information on. HDDScan claim in their blog that its a CMR (PMR, He) drive
> manufactured by HGST (https://hddscan.com/blog/2020/hdd-wd-smr.html), but I
> have no idea if this information is reliable or not.


The list of WD drives using SMR as drive-managed is published. See:

https://blog.westerndigital.com/wd-red-nas-drives/

and

https://blog.westerndigital.com/wp-content/uploads/2020/07/WD_SMR_SKUs_vDS.pdf

Your drive is not on the list :)


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Using fio for testing for SMR
  2020-09-15  2:28         ` Damien Le Moal
@ 2020-09-19 11:47           ` Ian S. Worthington
  2020-09-23  1:00             ` Damien Le Moal
  0 siblings, 1 reply; 9+ messages in thread
From: Ian S. Worthington @ 2020-09-19 11:47 UTC (permalink / raw)
  To: Damien Le Moal, fio@vger.kernel.org

Hi Damien --


> > This tells us that the maximum size the device can support is 256KB
> > (max_hw_sectors_kb)
> > and the maximum size that the kernel allows is 256KB (max_sectors_kb). 
> 
> Which highly depends on the HBA since the hard-disk interface will accept
much
> larger commands. This 256KB on your system seems to be very low. What HBA
are
> you using ? These days, most SAS HBAs have at least 128 or 256 segments,
which
> allow up to 512KB/1MB IOs. SATA/AHCI is limited at 169 segments for up to
32MB
> max_hw_sectors_kb (and 1280 KB max_sectors_kb).
> 
> > Additionally, the DMA engine is limited to 2048 segments of IO, 
> > each with a max size of 64KB.
> 
> That's a lot. I wonder why max_hw_sectors_kb end up so small.

I'm trying a Raspberry Pi for this testing so only have a USB3 interface
available.  Seems to be fast enough but maybe that's why it's giving numbers
that seem odd to you?


> The list of WD drives using SMR as drive-managed is published. See:
> 
> https://blog.westerndigital.com/wd-red-nas-drives/
> 
> and
> 
>
https://blog.westerndigital.com/wp-content/uploads/2020/07/WD_SMR_SKUs_vDS.pdf
> 
> Your drive is not on the list :)


I wasn't sure that the list was reliable when it came to white label drives,
but it's good to have it confirmed.

Many thanks!

Ian
...




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Using fio for testing for SMR
  2020-09-19 11:47           ` Ian S. Worthington
@ 2020-09-23  1:00             ` Damien Le Moal
  0 siblings, 0 replies; 9+ messages in thread
From: Damien Le Moal @ 2020-09-23  1:00 UTC (permalink / raw)
  To: Ian S. Worthington, fio@vger.kernel.org

On 2020/09/19 20:47, Ian S. Worthington wrote:
> Hi Damien --
> 
> 
>>> This tells us that the maximum size the device can support is 256KB
>>> (max_hw_sectors_kb)
>>> and the maximum size that the kernel allows is 256KB (max_sectors_kb). 
>>
>> Which highly depends on the HBA since the hard-disk interface will accept
> much
>> larger commands. This 256KB on your system seems to be very low. What HBA
> are
>> you using ? These days, most SAS HBAs have at least 128 or 256 segments,
> which
>> allow up to 512KB/1MB IOs. SATA/AHCI is limited at 169 segments for up to
> 32MB
>> max_hw_sectors_kb (and 1280 KB max_sectors_kb).
>>
>>> Additionally, the DMA engine is limited to 2048 segments of IO, 
>>> each with a max size of 64KB.
>>
>> That's a lot. I wonder why max_hw_sectors_kb end up so small.
> 
> I'm trying a Raspberry Pi for this testing so only have a USB3 interface
> available.  Seems to be fast enough but maybe that's why it's giving numbers
> that seem odd to you?

Ah. OK. That explains it. USB is introducing the limitation on the number of
segments here. That will result in a lot of command splits for large reads and
writes, but I do not think it matters much for the workloads you ran.

Cheers.


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-09-23  1:00 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-09-05 13:30 Using fio for testing for SMR Ian S. Worthington
2020-09-06 14:13 ` Sitsofe Wheeler
2020-09-07  1:38 ` Damien Le Moal
2020-09-08 14:02   ` Ian S. Worthington
2020-09-08 23:45     ` Damien Le Moal
2020-09-12 18:20       ` Ian S. Worthington
2020-09-15  2:28         ` Damien Le Moal
2020-09-19 11:47           ` Ian S. Worthington
2020-09-23  1:00             ` Damien Le Moal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.