How can btrfs take 23sec to stat 23K files from an SSD?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* How can btrfs take 23sec to stat 23K files from an SSD?
  2012-07-22 22:41         ` brtfs on top of dmcrypt with SSD -> file access 5x slower than spinning disk Marc MERLIN
@ 2012-07-23  6:42           ` Marc MERLIN
  2012-07-24  7:56             ` Martin Steigerwald
  2012-07-27 11:08             ` Chris Mason
  0 siblings, 2 replies; 22+ messages in thread
From: Marc MERLIN @ 2012-07-23  6:42 UTC (permalink / raw)
  To: linux-btrfs

I just realized that the older thread got a bit confusing, so I'll keep
problems separate and make things simpler :)

On an _unencrypted_ partition on the SSD, running du -sh on a directory
with 15K files, takes 23 seconds on unencrypted SSD and 4 secs on
encrypted spinning drive, both with a similar btrfs filesystem, and 
the same kernel (3.4.4).

Unencrypted btrfs on SSD:
gandalfthegreat:~# mount -o compress=lzo,discard,nossd,space_cache,noatime /dev/sda2 /mnt/mnt2
gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du -sh src
514M	src
real	0m22.667s

Encrypted btrfs on spinning drive of the same src directory:
gandalfthegreat:/var/local# echo 3 > /proc/sys/vm/drop_caches; time du -sh src
514M	src
real	0m3.881s

I've run this many times and get the same numbers.
I've tried deadline and noop on /dev/sda (the SSD) and du is just as slow.  

I also tried with:
- space_cache and nospace_cache
- ssd and nossd
- noatime didn't seem to help even though I was hopeful on this one.

In all cases, I get:
gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du -sh src
514M	src
real	0m22.537s


I'm having the same slow speed on 2 btrfs filesystems on the same SSD.
One is encrypted, the other one isnt:
Label: 'btrfs_pool1'  uuid: d570c40a-4a0b-4d03-b1c9-cff319fc224d
	Total devices 1 FS bytes used 144.74GB
	devid    1 size 441.70GB used 195.04GB path /dev/dm-0

Label: 'boot'  uuid: 84199644-3542-430a-8f18-a5aa58959662
	Total devices 1 FS bytes used 2.33GB
	devid    1 size 25.00GB used 5.04GB path /dev/sda2

If instead of stating a bunch of files, I try reading a big file, I do get speeds
that are quite fast (253MB/s and 423MB/s).

22 seconds for 15K files on an SSD is super slow and being 5 times
slower than a spinning disk with the same data.
What's going on?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
@ 2012-07-24  6:18 Marc MERLIN
  0 siblings, 0 replies; 22+ messages in thread
From: Marc MERLIN @ 2012-07-24  6:18 UTC (permalink / raw)
  To: linux-btrfs

On Sun, Jul 22, 2012 at 11:42:03PM -0700, Marc MERLIN wrote:
> I just realized that the older thread got a bit confusing, so I'll keep
> problems separate and make things simpler :)
 
Since yesterday, I tried other kernels, including noprempt, volprempt and
preempt for 3.4.4.
I also tried a default 3.2.0 kernel from debian (all amd64), but that did
not help. I'm still seeing close to 25 seconds to scan 15K files.

How can it possibly be so slow?
More importantly how I can provide useful debug information.

- I don't think it's a problem with the kernel since I tried 4 kernels,
  including a default debian one.

- Alignement seem ok, I made sure cylinders was divisible by 512:
/dev/sda2          502272    52930559    26214144   83  Linux

- I tried another brand new btrfs, and thing are even slower now.
gandalfthegreat:/mnt/mnt2# mount -o ssd,discard,noatime /dev/sda2 /mnt/mnt2
gandalfthegreat:/mnt/mnt2# reset_cache 
gandalfthegreat:/mnt/mnt2# time du -sh src/
514M	src/
real	0m29.584s
gandalfthegreat:/mnt/mnt2# find src/| wc -l
15261

This is bad enough that there ought to be a way to debug this, right?

Can you suggest something?

Thanks,
Marc

> On an _unencrypted_ partition on the SSD, running du -sh on a directory
> with 15K files, takes 23 seconds on unencrypted SSD and 4 secs on
> encrypted spinning drive, both with a similar btrfs filesystem, and 
> the same kernel (3.4.4).
> 
> Unencrypted btrfs on SSD:
> gandalfthegreat:~# mount -o compress=lzo,discard,nossd,space_cache,noatime /dev/sda2 /mnt/mnt2
> gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du -sh src
> 514M	src
> real	0m22.667s
> 
> Encrypted btrfs on spinning drive of the same src directory:
> gandalfthegreat:/var/local# echo 3 > /proc/sys/vm/drop_caches; time du -sh src
> 514M	src
> real	0m3.881s
> 
> I've run this many times and get the same numbers.
> I've tried deadline and noop on /dev/sda (the SSD) and du is just as slow.  
> 
> I also tried with:
> - space_cache and nospace_cache
> - ssd and nossd
> - noatime didn't seem to help even though I was hopeful on this one.
> 
> In all cases, I get:
> gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du -sh src
> 514M	src
> real	0m22.537s
> 
> 
> I'm having the same slow speed on 2 btrfs filesystems on the same SSD.
> One is encrypted, the other one isnt:
> Label: 'btrfs_pool1'  uuid: d570c40a-4a0b-4d03-b1c9-cff319fc224d
> 	Total devices 1 FS bytes used 144.74GB
> 	devid    1 size 441.70GB used 195.04GB path /dev/dm-0
> 
> Label: 'boot'  uuid: 84199644-3542-430a-8f18-a5aa58959662
> 	Total devices 1 FS bytes used 2.33GB
> 	devid    1 size 25.00GB used 5.04GB path /dev/sda2
> 
> If instead of stating a bunch of files, I try reading a big file, I do get speeds
> that are quite fast (253MB/s and 423MB/s).
> 
> 22 seconds for 15K files on an SSD is super slow and being 5 times
> slower than a spinning disk with the same data.
> What's going on?
> 
> Thanks,
> Marc

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-07-23  6:42           ` How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN
@ 2012-07-24  7:56             ` Martin Steigerwald
  2012-07-27  4:40               ` Marc MERLIN
  2012-07-27 11:08             ` Chris Mason
  1 sibling, 1 reply; 22+ messages in thread
From: Martin Steigerwald @ 2012-07-24  7:56 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Marc MERLIN

Am Montag, 23. Juli 2012 schrieb Marc MERLIN:
> I just realized that the older thread got a bit confusing, so I'll keep
> problems separate and make things simpler :)
> 
> On an _unencrypted_ partition on the SSD, running du -sh on a directory
> with 15K files, takes 23 seconds on unencrypted SSD and 4 secs on
> encrypted spinning drive, both with a similar btrfs filesystem, and
> the same kernel (3.4.4).
> 
> Unencrypted btrfs on SSD:
> gandalfthegreat:~# mount -o
> compress=lzo,discard,nossd,space_cache,noatime /dev/sda2 /mnt/mnt2
> gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du
> -sh src 514M	src
> real	0m22.667s
> 
> Encrypted btrfs on spinning drive of the same src directory:
> gandalfthegreat:/var/local# echo 3 > /proc/sys/vm/drop_caches; time du
> -sh src 514M	src
> real	0m3.881s

find is fast, du is much slower:

merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( find /usr | wc -l )
404166
( find /usr | wc -l; )  0,03s user 0,07s system 1% cpu 9,212 total
merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( du -sh /usr )      
11G     /usr
( du -sh /usr; )  1,00s user 19,07s system 41% cpu 48,886 total


Now I try to find something with less files.

merkaba:~> find /usr/share/doc | wc -l                                   
50715
merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( find /usr/share/doc 
| wc -l )
50715
( find /usr/share/doc | wc -l; )  0,00s user 0,02s system 1% cpu 1,398 
total
merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( du -sh 
/usr/share/doc )      
606M    /usr/share/doc
( du -sh /usr/share/doc; )  0,20s user 3,63s system 35% cpu 10,691 total

merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time du -sh /usr/share/doc   
606M    /usr/share/doc
du -sh /usr/share/doc  0,19s user 3,54s system 35% cpu 10,386 total


Anyway thats still much faster than your measurements.



merkaba:~> df -hT /usr                
Dateisystem    Typ   Größe Benutzt Verf. Verw% Eingehängt auf
/dev/dm-0      btrfs   19G     11G  5,6G   67% /
merkaba:~> btrfs fi sh                                                           
failed to read /dev/sr0
Label: 'debian'  uuid: […]
        Total devices 1 FS bytes used 10.25GB
        devid    1 size 18.62GB used 18.62GB path /dev/dm-0

Btrfs Btrfs v0.19
merkaba:~> btrfs fi df /   
Data: total=15.10GB, used=9.59GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=1.75GB, used=670.43MB
Metadata: total=8.00MB, used=0.00


merkaba:~> grep btrfs /proc/mounts
/dev/dm-0 / btrfs rw,noatime,compress=lzo,ssd,space_cache,inode_cache 0 0


Somewhat aged BTRFS filesystem on ThinkPad T520, Intel SSD 320, kernel 
3.5.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-07-24  7:56             ` Martin Steigerwald
@ 2012-07-27  4:40               ` Marc MERLIN
  0 siblings, 0 replies; 22+ messages in thread
From: Marc MERLIN @ 2012-07-27  4:40 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs

On Tue, Jul 24, 2012 at 09:56:26AM +0200, Martin Steigerwald wrote:
> find is fast, du is much slower:
> 
> merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( find /usr | wc -l )
> 404166
> ( find /usr | wc -l; )  0,03s user 0,07s system 1% cpu 9,212 total
> merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( du -sh /usr )      
> 11G     /usr
> ( du -sh /usr; )  1,00s user 19,07s system 41% cpu 48,886 total
 
You're right.

gandalfthegreat [mc]# time du -sh src
514M	src
real	0m25.159s
gandalfthegreat [mc]# reset_cache 
gandalfthegreat [mc]# time bash -c "find src | wc -l"
15261
real	0m7.614s

But find is still slower the du on my spinning disk for the same tree.
 
> Anyway thats still much faster than your measurements.
 
Right. Your numbers look reasonable for an SSD.

Since then, I did some more tests and I'm also getting slower than normal
speeds with ext4, an indications that it's a problem with the block layer.
I'm working with some folks try to pin down the core problem, but it looks
like it's not an easy one.

Thanks for your data.
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-07-23  6:42           ` How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN
  2012-07-24  7:56             ` Martin Steigerwald
@ 2012-07-27 11:08             ` Chris Mason
  2012-07-27 18:42               ` Marc MERLIN
  1 sibling, 1 reply; 22+ messages in thread
From: Chris Mason @ 2012-07-27 11:08 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs@vger.kernel.org

On Mon, Jul 23, 2012 at 12:42:03AM -0600, Marc MERLIN wrote:
> 
> 22 seconds for 15K files on an SSD is super slow and being 5 times
> slower than a spinning disk with the same data.
> What's going on?

Hi Marc,

The easiest way to figure out is with latencytop.  I'd either run the
latencytop gui or use the latencytop -c patch which sends a text dump to
the console.

This is assuming that you're not pegged at 100% CPU...

https://oss.oracle.com/~mason/latencytop.patch

-chris


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-07-27 11:08             ` Chris Mason
@ 2012-07-27 18:42               ` Marc MERLIN
       [not found]                 ` <20120801053042.GG12695@merlins.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Marc MERLIN @ 2012-07-27 18:42 UTC (permalink / raw)
  To: Chris Mason, linux-btrfs@vger.kernel.org

On Fri, Jul 27, 2012 at 07:08:35AM -0400, Chris Mason wrote:
> On Mon, Jul 23, 2012 at 12:42:03AM -0600, Marc MERLIN wrote:
> > 
> > 22 seconds for 15K files on an SSD is super slow and being 5 times
> > slower than a spinning disk with the same data.
> > What's going on?
> 
> Hi Marc,
> 
> The easiest way to figure out is with latencytop.  I'd either run the
> latencytop gui or use the latencytop -c patch which sends a text dump to
> the console.
> 
> This is assuming that you're not pegged at 100% CPU...
> 
> https://oss.oracle.com/~mason/latencytop.patch

Thanks for the patch, and yes I can confirm I'm definitely not pegged on CPU 
(not even close and I get the same problem with unencrypted filesystem, actually
du -sh is exactly the same speed on encrypted and unecrypted).

Here's the result I think you were looking for. I'm not good at reading this,
but hopefully it tells you something useful :)

The full run is here if that helps:
http://marc.merlins.org/tmp/latencytop.txt

Process du (6748) Total: 4280.5 msec
	Reading directory content	 15.2 msec         11.0 %
		sleep_on_page wait_on_page_bit read_extent_buffer_pages 
		btree_read_extent_buffer_pages.constprop.110 read_tree_block 
		read_block_for_search.isra.32 btrfs_next_leaf 
		btrfs_real_readdir vfs_readdir sys_getdents system_call_fastpath 
	[sleep_on_page]	 13.5 msec         88.2 %
		sleep_on_page wait_on_page_bit read_extent_buffer_pages 
		btree_read_extent_buffer_pages.constprop.110 read_tree_block 
		read_block_for_search.isra.32 btrfs_search_slot 
		btrfs_lookup_csum __btrfs_lookup_bio_sums 
		btrfs_lookup_bio_sums btrfs_submit_compressed_read btrfs_submit_bio_hook 
	Page fault	 12.9 msec          0.6 %
		sleep_on_page_killable wait_on_page_bit_killable 
		__lock_page_or_retry filemap_fault __do_fault handle_pte_fault 
		handle_mm_fault do_page_fault page_fault 
	Executing a program	  7.1 msec          0.2 %
		sleep_on_page_killable __lock_page_killable 
		generic_file_aio_read do_sync_read vfs_read kernel_read 
		prepare_binprm do_execve_common.isra.27 do_execve sys_execve 
		stub_execve 

Process du (6748) Total: 9517.4 msec
	[sleep_on_page]	 23.0 msec         82.8 %
		sleep_on_page wait_on_page_bit read_extent_buffer_pages 
		btree_read_extent_buffer_pages.constprop.110 read_tree_block 
		read_block_for_search.isra.32 btrfs_search_slot 
		btrfs_lookup_inode btrfs_iget btrfs_lookup_dentry btrfs_lookup 
		__lookup_hash 
	Reading directory content	 13.2 msec         17.2 %
		sleep_on_page wait_on_page_bit read_extent_buffer_pages 
		btree_read_extent_buffer_pages.constprop.110 read_tree_block 
		read_block_for_search.isra.32 btrfs_search_slot 
		btrfs_real_readdir vfs_readdir sys_getdents system_call_fastpath 

Process du (6748) Total: 9524.0 msec
	[sleep_on_page]	 17.1 msec         88.5 %
		sleep_on_page wait_on_page_bit read_extent_buffer_pages 
		btree_read_extent_buffer_pages.constprop.110 read_tree_block 
		read_block_for_search.isra.32 btrfs_search_slot 
		btrfs_lookup_inode btrfs_iget btrfs_lookup_dentry btrfs_lookup 
		__lookup_hash 
	Reading directory content	 16.0 msec         11.5 %
		sleep_on_page wait_on_page_bit read_extent_buffer_pages 
		btree_read_extent_buffer_pages.constprop.110 read_tree_block 
		read_block_for_search.isra.32 btrfs_search_slot 
		btrfs_real_readdir vfs_readdir sys_getdents system_call_fastpath 

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
       [not found]                 ` <20120801053042.GG12695@merlins.org>
@ 2012-08-01  6:01                   ` Marc MERLIN
  2012-08-01  6:08                     ` Fajar A. Nugraha
  2012-08-01  6:36                     ` Chris Samuel
  0 siblings, 2 replies; 22+ messages in thread
From: Marc MERLIN @ 2012-08-01  6:01 UTC (permalink / raw)
  To: Chris Mason, linux-btrfs@vger.kernel.org, Ted Ts'o
  Cc: Lukáš Czerner, linux-ext4, axboe, Milan Broz

On Fri, Jul 27, 2012 at 11:42:39AM -0700, Marc MERLIN wrote:
> > https://oss.oracle.com/~mason/latencytop.patch
> 
> Thanks for the patch, and yes I can confirm I'm definitely not pegged on CPU 
> (not even close and I get the same problem with unencrypted filesystem, actually
> du -sh is exactly the same speed on encrypted and unecrypted).
> 
> Here's the result I think you were looking for. I'm not good at reading this,
> but hopefully it tells you something useful :)
> 
> The full run is here if that helps:
> http://marc.merlins.org/tmp/latencytop.txt

I did some other tests since last week since my laptop is hard to use
considering how slow the SSD is.

(TL;DR: ntfs on linux via fuse is 33% faster than ext4, which is 2x faster
than btrfs, but 3x slower than the same filesystem on spinning disk :( )

Ok, just to help with debuggging this,
1)  I put my samsung 830 SSD into another thinkpad and it wasn't faster or
slower.

2) Then I put a crucial 256 C300 SSD (the replacement for the one I had that
just died and killed all my data), and du took 0.3 seconds on both my old
and new thinkpads.
The old thinkpad is running ubuntu 32bit the new one debian testing 64bit
both with kernel 3.4.4.

So, clearly, there is something wrong with the samsung 830 SSD with linux
but I have no clue what :(
In raw speed (dd) the samsung is faster than the crucial (350MB/s vs
500MB/s).
It it were a random crappy SSD from a random vendor, I'd blame the SSD, but
I have a hard time believing that samsung is selling SSDs that are slower
than hard drives at random IO and 'seeks'.

3) I just got a 2nd ssd from samsung (same kind), just to make sure the one
I had wasn't bad. It's brand new, and I formatted it carefully on 512
boundaries:
/dev/sda1            2048      502271      250112   83  Linux
/dev/sda2          502272    52930559    26214144    7  HPFS/NTFS/exFAT
/dev/sda3        52930560    73902079    10485760   82  Linux swap / Solaris
/dev/sda4        73902080  1000215215   463156568   83  Linux

I also upgraded to 3.5.0 in the meantime but unfortunately the results are
similar.

First: btrfs is the slowest:
gandalfthegreat:/mnt/ssd/var/local# time du -sh src/
514M	src/
real	0m25.741s
gandalfthegreat:/mnt/ssd/var/local# grep /mnt/ssd/var /proc/mounts 
/dev/mapper/ssd /mnt/ssd/var btrfs rw,noatime,compress=lzo,ssd,discard,space_cache 0 0

Second: ext4 is 2x faster than btrfs with mkfs.ext4 -O extent -b 4096 /dev/sda3
gandalfthegreat:/mnt/mnt3# reset_cache
gandalfthegreat:/mnt/mnt3# time du -sh src/
519M	src/
real	0m12.459s
gandalfthegreat:~# grep mnt3 /proc/mounts
/dev/sda3 /mnt/mnt3 ext4 rw,noatime,discard,data=ordered 0 0

Third, A freshly made ntfs filesystem through fuse is actually FASTER!
gandalfthegreat:/mnt/mnt2# reset_cache 
gandalfthegreat:/mnt/mnt2# time du -sh src/
506M	src/
real	0m8.928s
gandalfthegreat:/mnt/mnt2# grep mnt2 /proc/mounts
/dev/sda2 /mnt/mnt2 fuseblk rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other,blksize=4096 0 0

How can ntfs via fuse be the fastest and btrfs so slow?
Of course, all 3 are slower than the same filesystem on spinning too, but
I'm wondering if there is a scheduling issue that is somehow causing the
extreme slowness I'm seeing.

Did the latencytop trace I got help in any way?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-01  6:01                   ` Marc MERLIN
@ 2012-08-01  6:08                     ` Fajar A. Nugraha
  2012-08-01  6:21                       ` Marc MERLIN
  2012-08-01  6:36                     ` Chris Samuel
  1 sibling, 1 reply; 22+ messages in thread
From: Fajar A. Nugraha @ 2012-08-01  6:08 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs@vger.kernel.org

On Wed, Aug 1, 2012 at 1:01 PM, Marc MERLIN <marc@merlins.org> wrote:

> So, clearly, there is something wrong with the samsung 830 SSD with linux


> It it were a random crappy SSD from a random vendor, I'd blame the SSD, but
> I have a hard time believing that samsung is selling SSDs that are slower
> than hard drives at random IO and 'seeks'.

You'd be surprised on how badly some vendors can screw up :)


> First: btrfs is the slowest:

> gandalfthegreat:/mnt/ssd/var/local# grep /mnt/ssd/var /proc/mounts
> /dev/mapper/ssd /mnt/ssd/var btrfs rw,noatime,compress=lzo,ssd,discard,space_cache 0 0

Just checking, did you explicitly activate "discard"? Cause on my
setup (with corsair SSD) it made things MUCH slower. Also, try adding
"noatime" (just in case the slow down was because "du" cause many
access time updates)

-- 
Fajar

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-01  6:08                     ` Fajar A. Nugraha
@ 2012-08-01  6:21                       ` Marc MERLIN
  2012-08-01 21:57                         ` Martin Steigerwald
  0 siblings, 1 reply; 22+ messages in thread
From: Marc MERLIN @ 2012-08-01  6:21 UTC (permalink / raw)
  To: Fajar A. Nugraha; +Cc: linux-btrfs@vger.kernel.org

On Wed, Aug 01, 2012 at 01:08:46PM +0700, Fajar A. Nugraha wrote:
> > It it were a random crappy SSD from a random vendor, I'd blame the SSD, but
> > I have a hard time believing that samsung is selling SSDs that are slower
> > than hard drives at random IO and 'seeks'.
> 
> You'd be surprised on how badly some vendors can screw up :)
 
At some point, it may come down to that indeed :-/
I'm still hopefully that Samsung didn't, but we'll see.
 
> > First: btrfs is the slowest:
> 
> > gandalfthegreat:/mnt/ssd/var/local# grep /mnt/ssd/var /proc/mounts
> > /dev/mapper/ssd /mnt/ssd/var btrfs rw,noatime,compress=lzo,ssd,discard,space_cache 0 0
> 
> Just checking, did you explicitly activate "discard"? Cause on my

Yes. Note that it should a noop when all you're doing is stating inodes and
not writing (I'm using noatime).

> setup (with corsair SSD) it made things MUCH slower. Also, try adding
> "noatime" (just in case the slow down was because "du" cause many
> access time updates)

I have noatime in there already :)

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-01  6:01                   ` Marc MERLIN
  2012-08-01  6:08                     ` Fajar A. Nugraha
@ 2012-08-01  6:36                     ` Chris Samuel
  2012-08-01  6:40                       ` Marc MERLIN
  1 sibling, 1 reply; 22+ messages in thread
From: Chris Samuel @ 2012-08-01  6:36 UTC (permalink / raw)
  To: Marc MERLIN
  Cc: Chris Mason, linux-btrfs@vger.kernel.org, Ted Ts'o,
	Lukáš Czerner, linux-ext4, axboe, Milan Broz

On 01/08/12 16:01, Marc MERLIN wrote:

> Third, A freshly made ntfs filesystem through fuse is actually FASTER!

Could it be that Samsungs FTL has optimisations in it for NTFS ?

A horrible thought, but not impossible..

-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-01  6:36                     ` Chris Samuel
@ 2012-08-01  6:40                       ` Marc MERLIN
  0 siblings, 0 replies; 22+ messages in thread
From: Marc MERLIN @ 2012-08-01  6:40 UTC (permalink / raw)
  To: Chris Samuel
  Cc: Chris Mason, linux-btrfs@vger.kernel.org, Ted Ts'o,
	Lukáš Czerner, linux-ext4, axboe, Milan Broz

On Wed, Aug 01, 2012 at 04:36:22PM +1000, Chris Samuel wrote:
> On 01/08/12 16:01, Marc MERLIN wrote:
> 
> > Third, A freshly made ntfs filesystem through fuse is actually FASTER!
> 
> Could it be that Samsungs FTL has optimisations in it for NTFS ?
> 
> A horrible thought, but not impossible..

Not impossible, but it's can be the main reason since it clocks still 2x
slower with ntfs than a spinning disk with encrypted btrfs.
Since SSDs should "seek" 10-100x faster than spinning disks, that can't be
the only reason.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-01  6:21                       ` Marc MERLIN
@ 2012-08-01 21:57                         ` Martin Steigerwald
  2012-08-02  5:07                           ` Marc MERLIN
  0 siblings, 1 reply; 22+ messages in thread
From: Martin Steigerwald @ 2012-08-01 21:57 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Marc MERLIN, Fajar A. Nugraha

Hi Marc,

Am Mittwoch, 1. August 2012 schrieb Marc MERLIN:
> On Wed, Aug 01, 2012 at 01:08:46PM +0700, Fajar A. Nugraha wrote:
> > > It it were a random crappy SSD from a random vendor, I'd blame the
> > > SSD, but I have a hard time believing that samsung is selling SSDs
> > > that are slower than hard drives at random IO and 'seeks'.
> > 
> > You'd be surprised on how badly some vendors can screw up :)
> 
> At some point, it may come down to that indeed :-/
> I'm still hopefully that Samsung didn't, but we'll see.

Its getting quite strange.

I lost track of whether you did that already or not, but if you didn´t
please post some

vmstat 1

iostat -xd 1

on the device while it is being slow.

I am interested in wait I/O and latencies and disk utilization.


Comparison data of Intel SSD 320 in ThinkPad T520 during

merkaba:~> echo 3 > /proc/sys/drop_caches ; du -sch /usr

on BTRFS with Kernel 3.5:


martin@merkaba:~> vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  1  21556 4442668   2056 502352    0    0   194    85  247  120 11  2 87  0
 1  2  21556 4408888   2448 514884    0    0 11684   328 4975 24585  5 16 65 14
 1  0  21556 4389880   2448 528060    0    0 13400     0 4574 23452  2 16 68 14
 3  1  21556 4370068   2448 545052    0    0 18132     0 5499 27220  1 18 64 16
 1  0  21556 4350228   2448 555580    0    0 10856     0 4122 25339  3 16 67 14
 1  1  21556 4315604   2448 569756    0    0 12648     0 4647 31153  5 14 66 15
 0  1  21556 4295652   2456 581480    0    0 11548    56 4093 24618  2 13 69 16
 0  1  21556 4286720   2456 591580    0    0 10824     0 3750 21445  1 12 71 16
 0  1  21556 4266308   2456 603620    0    0 12932     0 4841 26447  4 12 68 17
 1  0  21556 4248228   2456 613808    0    0 10264     4 3703 22108  1 13 71 15
 5  1  21556 4231976   2456 624356    0    0 10540     0 3581 20436  1 10 72 17
 0  1  21556 4197168   2456 639108    0    0 12952     0 4738 28223  4 15 66 15
 4  1  21556 4178456   2456 650552    0    0 11656     0 4234 23480  2 14 68 16
 0  1  21556 4163616   2456 662992    0    0 13652     0 4619 26580  1 16 70 13
 4  1  21556 4138288   2456 675696    0    0 13352     0 4422 22254  1 16 70 13
 1  0  21556 4113204   2456 689060    0    0 13232     0 4312 21936  1 15 70 14
 0  1  21556 4085532   2456 704160    0    0 14972     0 4820 24238  1 16 69 14
 2  0  21556 4055740   2456 719644    0    0 15736     0 5099 25513  3 17 66 14
 0  1  21556 4028612   2456 734380    0    0 14504     0 4795 25052  3 15 68 14
 2  0  21556 3999108   2456 749040    0    0 14656    16 4672 21878  1 17 69 13
 1  1  21556 3972732   2456 762108    0    0 12972     0 4717 22411  1 17 70 13
 5  0  21556 3949684   2584 773484    0    0 11528    52 4837 24107  3 15 67 15
 1  0  21556 3912504   2584 787420    0    0 12156     0 4883 25201  4 15 67 14


martin@merkaba:~> iostat -xd 1 /dev/sda
Linux 3.5.0-tp520 (merkaba)     01.08.2012      _x86_64_        (4 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               1,29     1,44   11,58   12,78   684,74   299,75    80,81     0,24    9,86    0,95   17,93   0,29   0,71

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 2808,00    0,00 11232,00     0,00     8,00     0,57    0,21    0,21    0,00   0,19  54,50

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 2967,00    0,00 11868,00     0,00     8,00     0,63    0,21    0,21    0,00   0,21  60,90

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00    11,00 2992,00    4,00 11968,00    56,00     8,03     0,64    0,22    0,22    0,25   0,21  62,00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 2680,00    0,00 10720,00     0,00     8,00     0,70    0,26    0,26    0,00   0,25  66,70

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3153,00    0,00 12612,00     0,00     8,00     0,72    0,23    0,23    0,00   0,22  69,30

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 2769,00    0,00 11076,00     0,00     8,00     0,63    0,23    0,23    0,00   0,21  58,00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 2523,00    1,00 10092,00     4,00     8,00     0,74    0,29    0,29    0,00   0,28  71,30

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3026,00    0,00 12104,00     0,00     8,00     0,73    0,24    0,24    0,00   0,21  64,00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3069,00    0,00 12276,00     0,00     8,00     0,67    0,22    0,22    0,00   0,20  62,00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3346,00    0,00 13384,00     0,00     8,00     0,64    0,19    0,19    0,00   0,18  59,90

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3188,00    0,00 12752,00     0,00     8,00     0,80    0,25    0,25    0,00   0,17  54,00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3433,00    0,00 13732,00     0,00     8,00     1,03    0,30    0,30    0,00   0,17  57,00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3565,00    0,00 14260,00     0,00     8,00     0,92    0,26    0,26    0,00   0,16  57,30

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3972,00    0,00 15888,00     0,00     8,00     1,13    0,29    0,29    0,00   0,16  62,90

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3743,00    0,00 14972,00     0,00     8,00     1,03    0,28    0,28    0,00   0,16  59,40

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3408,00    0,00 13632,00     0,00     8,00     1,08    0,32    0,32    0,00   0,17  56,70

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3730,00    3,00 14920,00    16,00     8,00     1,14    0,31    0,31    0,00   0,15  56,30




I also suggest to use fio with with the ssd-test example on the SSD. I
have some comparison data available for my setup. Heck it should be
publicly available in my ADMIN magazine article about fio. I used a bit
different fio jobs with block sizes of 2k to 16k, but its similar enough
and I might even have some 4k examples at hand or can easily create one.
I also raised size and duration a bit.

An example based on whats in my article:

[global]
ioengine=libaio
direct=1
iodepth=64
runtime=60

filename=testfile
size=2G
bsrange=2k-16k

refill_buffers=1

[randomwrite]
stonewall
rw=randwrite

[sequentialwrite]
stonewall
rw=write

[randomread]
stonewall
rw=randread

[sequentialread]
stonewall
rw=read

Remove bsrange if you want 4k blocks only. I put the writes above the
reads cause the writes with refill_buffers=1 initialize the testfile
with random data. It would contain only zeros which will be compressible
nicely by modern SandForce controllers otherwise. Above job is untested,
but it should do.

Please remove the test file and fstrim your disk unless you have discard
mount option on after having run any write tests. (Not necessarily after
each one ;).


Also I am interested in

merkaba:~> hdparm -I /dev/sda | grep -i queue
        Queue depth: 32
           *    Native Command Queueing (NCQ)

output for your SSD.

Try to load the ssd with different iodepths up to twice the amount
displayed by hdparm.

But note: a du -sch will not reach that high iodepth. I expect it to
get as low as an iodepth of one in that case. You see in my comparison
output that a single du -sch is not able to load the Intel SSD 320 fully,
but load is just about 50-70%. And iowait is just about 10-20%.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-01 21:57                         ` Martin Steigerwald
@ 2012-08-02  5:07                           ` Marc MERLIN
  2012-08-02 11:18                             ` Martin Steigerwald
  2012-08-02 11:25                             ` Martin Steigerwald
  0 siblings, 2 replies; 22+ messages in thread
From: Marc MERLIN @ 2012-08-02  5:07 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs, Fajar A. Nugraha

On Wed, Aug 01, 2012 at 11:57:39PM +0200, Martin Steigerwald wrote:
> Its getting quite strange.
 
I would agree :)

Before I paste a bunch of thing, I wanted to thank you for not giving up on me 
and offering your time to help me figure this out :)

> I lost track of whether you did that already or not, but if you didn´t
> please post some
> 
> vmstat 1
> iostat -xd 1
> on the device while it is being slow.
 
Sure thing, here's the 24 second du -s run:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  1      0 2747264     44 348388    0    0    28    50  242  184 19  6 74  1
 1  0      0 2744128     44 351700    0    0   144     0 2758 32115 30  5 61  4
 2  1      0 2743100     44 351992    0    0   792     0 2616 30613 28  4 50 18
 1  1      0 2741592     44 352668    0    0   776     0 2574 31551 29  4 45 21
 1  1      0 2740720     44 353432    0    0   692     0 2734 32891 30  4 45 22
 1  1      0 2740104     44 354284    0    0   460     0 2639 31585 30  4 45 21
 3  1      0 2738520     44 354692    0    0   544   264 2834 30302 32  5 42 21
 1  1      0 2736936     44 355476    0    0  1064  2012 2867 31172 28  4 45 23
 3  1      0 2734892     44 356840    0    0  1236     0 2865 31100 30  3 46 21
 3  1      0 2733840     44 357432    0    0   636    12 2930 31362 30  4 45 21
 3  1      0 2735152     44 357892    0    0   612     0 2546 31641 28  4 46 22
 3  1      0 2733608     44 358680    0    0   668     0 2846 29747 30  3 46 22
 2  1      0 2731480     44 359556    0    0   712   248 2744 27939 31  3 44 22
 2  1      0 2732688     44 360736    0    0  1204     0 2774 30979 30  3 46 21
 4  1      0 2730276     44 361872    0    0   872   256 3174 31657 30  5 45 20
 1  1      0 2728372     44 363008    0    0   912   132 2752 31811 30  3 46 22
 1  1      0 2727060     44 363240    0    0   476     0 2595 30313 33  4 43 21
 1  1      0 2725856     44 364000    0    0   580     0 2547 30803 29  4 46 21
 1  1      0 2723628     44 364724    0    0   756   272 2766 30652 31  4 44 22
 2  1      0 2722128     44 365616    0    0   744     8 2771 31760 30  4 45 21
 2  1      0 2721024     44 366284    0    0   788   196 2695 31819 31  4 44 22
 3  1      0 2719312     44 367084    0    0   732     0 2737 32038 30  5 45 20
 1  1      0 2717436     44 367880    0    0   728     0 2570 30810 29  3 46 22
 3  1      0 2716820     44 368500    0    0   608     0 2614 30949 29  4 45 22
 2  1      0 2715576     44 369240    0    0   628   264 2715 31180 34  4 40 22
 1  1      0 2714308     44 369952    0    0   624     0 2533 31155 28  4 46 22

Linux 3.5.0-amd64-preempt-noide-20120410 (gandalfthegreat) 	08/01/2012 	_x86_64_	(4 CPU)
rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
  2.18     0.68    6.45   17.77    78.12   153.39    19.12     0.18    7.51    8.52    7.15   4.46  10.81
  0.00     0.00  118.00    0.00   540.00     0.00     9.15     1.18    9.93    9.93    0.00   4.98  58.80
  0.00     0.00  217.00    0.00   868.00     0.00     8.00     1.90    8.77    8.77    0.00   4.44  96.40
  0.00     0.00  192.00    0.00   768.00     0.00     8.00     1.63    8.44    8.44    0.00   5.10  98.00
  0.00     0.00  119.00    0.00   476.00     0.00     8.00     1.06    9.01    9.01    0.00   8.20  97.60
  0.00     0.00  125.00    0.00   500.00     0.00     8.00     1.08    8.67    8.67    0.00   7.55  94.40
  0.00     0.00  165.00    0.00   660.00     0.00     8.00     1.50    9.12    9.12    0.00   5.87  96.80
  0.00     0.00  195.00   13.00   780.00   272.00    10.12     1.68    8.10    7.94   10.46   4.65  96.80
  0.00     0.00  173.00    0.00   692.00     0.00     8.00     1.72    9.87    9.87    0.00   5.71  98.80
  0.00     0.00  171.00    0.00   684.00     0.00     8.00     1.62    9.33    9.33    0.00   5.75  98.40
  0.00     0.00  161.00    0.00   644.00     0.00     8.00     1.52    9.57    9.57    0.00   6.14  98.80
  0.00     0.00  136.00    0.00   544.00     0.00     8.00     1.26    9.29    9.29    0.00   7.24  98.40
  0.00     0.00  199.00    0.00   796.00     0.00     8.00     1.94    9.73    9.73    0.00   4.94  98.40
  0.00     0.00  201.00    0.00   804.00     0.00     8.00     1.70    8.54    8.54    0.00   4.80  96.40
  0.00     0.00  272.00   15.00  1088.00   272.00     9.48     2.35    8.21    8.46    3.73   3.39  97.20
  0.00     0.00  208.00    1.00   832.00     4.00     8.00     1.90    9.01    9.04    4.00   4.67  97.60
  0.00     0.00  228.00    0.00   912.00     0.00     8.00     1.94    8.56    8.56    0.00   4.23  96.40
  0.00     0.00  107.00    0.00   428.00     0.00     8.00     0.98    9.05    9.05    0.00   9.12  97.60
  0.00     0.00  163.00    0.00   652.00     0.00     8.00     1.60    9.89    9.89    0.00   6.01  98.00
  0.00     0.00  162.00    0.00   648.00     0.00     8.00     1.83   10.20   10.20    0.00   6.07  98.40
  1.00    10.00  224.00   71.00   900.00  2157.50    20.73     2.82   10.14    9.50   12.17   3.35  98.80
 41.00     8.00  166.00  176.00   832.00   333.50     6.82     2.48    7.24    8.48    6.07   2.92 100.00
 18.00     0.00   78.00   72.00   388.00    36.00     5.65     2.30   15.28   16.56   13.89   6.67 100.00
 13.00     0.00  190.00   25.00   824.00    12.50     7.78     1.95    9.17    9.43    7.20   4.58  98.40
  0.00     0.00  156.00    0.00   624.00     0.00     8.00     1.36    8.72    8.72    0.00   6.26  97.60
  0.00     0.00  158.00   19.00   632.00    64.00     7.86     1.50    8.45    8.61    7.16   5.60  99.20
  0.00     0.00  176.00    0.00   704.00     0.00     8.00     1.65    9.39    9.39    0.00   5.61  98.80
  0.00     0.00   44.00    0.00   176.00     0.00     8.00     0.38    8.55    8.55    0.00   8.27  36.40

> I am interested in wait I/O and latencies and disk utilization.
 
Cool tool, I didn't know about iostat.
My r_await numbers don't look good obviously and yet %util is pretty much
100% the entire time.

Does that show that it's indeed the device that is unable to deliver the requests any quicker, despite
being an ssd, or are you reading this differently?

> Also I am interested in
> merkaba:~> hdparm -I /dev/sda | grep -i queue
>         Queue depth: 32
>            *    Native Command Queueing (NCQ)
> output for your SSD.

gandalfthegreat:/var/local# hdparm -I /dev/sda | grep -i queue   
	Queue depth: 32
	   *	Native Command Queueing (NCQ)
gandalfthegreat:/var/local# 

I've the the fio tests in:
/dev/mapper/cryptroot /var btrfs rw,noatime,compress=lzo,nossd,discard,space_cache 0 0

(discard is there, so fstrim shouldn't be needed)
 
> I also suggest to use fio with with the ssd-test example on the SSD. I
> have some comparison data available for my setup. Heck it should be
> publicly available in my ADMIN magazine article about fio. I used a bit
> different fio jobs with block sizes of 2k to 16k, but its similar enough
> and I might even have some 4k examples at hand or can easily create one.
> I also raised size and duration a bit.
> 
> An example based on whats in my article:

Thanks, here's the output.
I see bw jumping from bw=1700.8KB/s all the way to bw=474684KB/s depending
on the io size.  That's "intereesting" to say the least.

So, doctor, is it bad? :)

randomwrite: (g=0): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
sequentialwrite: (g=1): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
randomread: (g=2): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
sequentialread: (g=3): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
2.0.8
Starting 4 processes
randomwrite: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 1 (f=1): [___R] [100.0% done] [558.8M/0K /s] [63.8K/0  iops] [eta 00m:00s]  
randomwrite: (groupid=0, jobs=1): err= 0: pid=7193
  write: io=102048KB, bw=1700.8KB/s, iops=189 , runt= 60003msec
    slat (usec): min=21 , max=219834 , avg=5250.91, stdev=5936.55
    clat (usec): min=25 , max=738932 , avg=329339.45, stdev=106004.63
     lat (msec): min=4 , max=751 , avg=334.59, stdev=107.57
    clat percentiles (msec):
     |  1.00th=[  225],  5.00th=[  241], 10.00th=[  247], 20.00th=[  260],
     | 30.00th=[  269], 40.00th=[  281], 50.00th=[  293], 60.00th=[  306],
     | 70.00th=[  322], 80.00th=[  351], 90.00th=[  545], 95.00th=[  570],
     | 99.00th=[  627], 99.50th=[  644], 99.90th=[  709], 99.95th=[  725],
     | 99.99th=[  742]
    bw (KB/s)  : min=   11, max= 2591, per=99.83%, avg=1697.13, stdev=491.48
    lat (usec) : 50=0.01%
    lat (msec) : 10=0.02%, 20=0.02%, 50=0.05%, 100=0.14%, 250=12.89%
    lat (msec) : 500=72.44%, 750=14.43%
  cpu          : usr=0.30%, sys=2.65%, ctx=9699, majf=0, minf=19
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.4%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=11396/d=0, short=r=0/w=0/d=0
sequentialwrite: (groupid=1, jobs=1): err= 0: pid=7218
  write: io=80994KB, bw=1349.7KB/s, iops=149 , runt= 60011msec
    slat (usec): min=15 , max=296798 , avg=6658.06, stdev=7414.10
    clat (usec): min=15 , max=962319 , avg=418984.76, stdev=128626.12
     lat (msec): min=12 , max=967 , avg=425.64, stdev=130.36
    clat percentiles (msec):
     |  1.00th=[  269],  5.00th=[  306], 10.00th=[  322], 20.00th=[  338],
     | 30.00th=[  355], 40.00th=[  371], 50.00th=[  388], 60.00th=[  404],
     | 70.00th=[  420], 80.00th=[  445], 90.00th=[  603], 95.00th=[  766],
     | 99.00th=[  881], 99.50th=[  906], 99.90th=[  938], 99.95th=[  947],
     | 99.99th=[  963]
    bw (KB/s)  : min=  418, max= 1952, per=99.31%, avg=1339.72, stdev=354.39
    lat (usec) : 20=0.01%
    lat (msec) : 20=0.01%, 50=0.11%, 100=0.27%, 250=0.29%, 500=86.93%
    lat (msec) : 750=6.87%, 1000=5.51%
  cpu          : usr=0.25%, sys=2.32%, ctx=13426, majf=0, minf=21
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.3%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=8991/d=0, short=r=0/w=0/d=0
randomread: (groupid=2, jobs=1): err= 0: pid=7234
  read : io=473982KB, bw=7899.5KB/s, iops=934 , runt= 60005msec
    slat (usec): min=2 , max=31048 , avg=1065.57, stdev=3519.63
    clat (usec): min=21 , max=246981 , avg=67367.84, stdev=33459.21
     lat (msec): min=1 , max=263 , avg=68.43, stdev=33.81
    clat percentiles (msec):
     |  1.00th=[   15],  5.00th=[   25], 10.00th=[   33], 20.00th=[   39],
     | 30.00th=[   50], 40.00th=[   57], 50.00th=[   61], 60.00th=[   70],
     | 70.00th=[   76], 80.00th=[   89], 90.00th=[  111], 95.00th=[  139],
     | 99.00th=[  176], 99.50th=[  190], 99.90th=[  217], 99.95th=[  229],
     | 99.99th=[  247]
    bw (KB/s)  : min= 2912, max=11909, per=100.00%, avg=7900.46, stdev=2255.68
    lat (usec) : 50=0.01%
    lat (msec) : 2=0.17%, 4=0.16%, 10=0.35%, 20=2.49%, 50=27.44%
    lat (msec) : 100=55.05%, 250=14.34%
  cpu          : usr=0.38%, sys=2.89%, ctx=8425, majf=0, minf=276
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=56073/w=0/d=0, short=r=0/w=0/d=0
sequentialread: (groupid=3, jobs=1): err= 0: pid=7249
  read : io=2048.0MB, bw=474684KB/s, iops=52754 , runt=  4418msec
    slat (usec): min=1 , max=44869 , avg=15.35, stdev=320.96
    clat (usec): min=0 , max=67787 , avg=1145.60, stdev=3487.25
     lat (usec): min=2 , max=67801 , avg=1161.14, stdev=3508.06
    clat percentiles (usec):
     |  1.00th=[   10],  5.00th=[  187], 10.00th=[  217], 20.00th=[  249],
     | 30.00th=[  278], 40.00th=[  302], 50.00th=[  330], 60.00th=[  362],
     | 70.00th=[  398], 80.00th=[  450], 90.00th=[  716], 95.00th=[ 6624],
     | 99.00th=[16064], 99.50th=[20096], 99.90th=[44800], 99.95th=[61696],
     | 99.99th=[67072]
    bw (KB/s)  : min=50063, max=635019, per=97.34%, avg=462072.50, stdev=202894.18
    lat (usec) : 2=0.65%, 4=0.02%, 10=0.30%, 20=0.24%, 50=0.36%
    lat (usec) : 100=0.52%, 250=18.08%, 500=65.03%, 750=4.88%, 1000=0.52%
    lat (msec) : 2=0.81%, 4=0.95%, 10=4.72%, 20=2.43%, 50=0.45%
    lat (msec) : 100=0.05%
  cpu          : usr=7.52%, sys=67.56%, ctx=1494, majf=0, minf=277
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=233071/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=102048KB, aggrb=1700KB/s, minb=1700KB/s, maxb=1700KB/s, mint=60003msec, maxt=60003msec

Run status group 1 (all jobs):
  WRITE: io=80994KB, aggrb=1349KB/s, minb=1349KB/s, maxb=1349KB/s, mint=60011msec, maxt=60011msec

Run status group 2 (all jobs):
   READ: io=473982KB, aggrb=7899KB/s, minb=7899KB/s, maxb=7899KB/s, mint=60005msec, maxt=60005msec

Run status group 3 (all jobs):
   READ: io=2048.0MB, aggrb=474683KB/s, minb=474683KB/s, maxb=474683KB/s, mint=4418msec, maxt=4418msec

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-02  5:07                           ` Marc MERLIN
@ 2012-08-02 11:18                             ` Martin Steigerwald
  2012-08-02 17:39                               ` Marc MERLIN
  2012-08-02 11:25                             ` Martin Steigerwald
  1 sibling, 1 reply; 22+ messages in thread
From: Martin Steigerwald @ 2012-08-02 11:18 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs, Fajar A. Nugraha

Am Donnerstag, 2. August 2012 schrieb Marc MERLIN:
> On Wed, Aug 01, 2012 at 11:57:39PM +0200, Martin Steigerwald wrote:
> > Its getting quite strange.
>  
> I would agree :)
> 
> Before I paste a bunch of thing, I wanted to thank you for not giving up on me 
> and offering your time to help me figure this out :)

You are welcome.

Well I am holding Linux performance analysis & tuning trainings and I am
really interested into issues like this ;)

I will take care of myself and I take my time to respond or even do not
respond at all anymore if I run out of ideas ;).

> > I lost track of whether you did that already or not, but if you didn´t
> > please post some
> > 
> > vmstat 1
> > iostat -xd 1
> > on the device while it is being slow.
>  
> Sure thing, here's the 24 second du -s run:
> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>  2  1      0 2747264     44 348388    0    0    28    50  242  184 19  6 74  1
>  1  0      0 2744128     44 351700    0    0   144     0 2758 32115 30  5 61  4
>  2  1      0 2743100     44 351992    0    0   792     0 2616 30613 28  4 50 18
>  1  1      0 2741592     44 352668    0    0   776     0 2574 31551 29  4 45 21
>  1  1      0 2740720     44 353432    0    0   692     0 2734 32891 30  4 45 22
>  1  1      0 2740104     44 354284    0    0   460     0 2639 31585 30  4 45 21
>  3  1      0 2738520     44 354692    0    0   544   264 2834 30302 32  5 42 21
>  1  1      0 2736936     44 355476    0    0  1064  2012 2867 31172 28  4 45 23

A bit more wait I/O with not even 10% of the throughput as compared to
the Intel SSD 320 figures. Seems that Intel SSD is running circles around
your Samsung SSD while not – as expected for that use case – being fully
utilized.

> Linux 3.5.0-amd64-preempt-noide-20120410 (gandalfthegreat) 	08/01/2012 	_x86_64_	(4 CPU)
> rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>   2.18     0.68    6.45   17.77    78.12   153.39    19.12     0.18    7.51    8.52    7.15   4.46  10.81
>   0.00     0.00  118.00    0.00   540.00     0.00     9.15     1.18    9.93    9.93    0.00   4.98  58.80
>   0.00     0.00  217.00    0.00   868.00     0.00     8.00     1.90    8.77    8.77    0.00   4.44  96.40
>   0.00     0.00  192.00    0.00   768.00     0.00     8.00     1.63    8.44    8.44    0.00   5.10  98.00
>   0.00     0.00  119.00    0.00   476.00     0.00     8.00     1.06    9.01    9.01    0.00   8.20  97.60
>   0.00     0.00  125.00    0.00   500.00     0.00     8.00     1.08    8.67    8.67    0.00   7.55  94.40
>   0.00     0.00  165.00    0.00   660.00     0.00     8.00     1.50    9.12    9.12    0.00   5.87  96.80
>   0.00     0.00  195.00   13.00   780.00   272.00    10.12     1.68    8.10    7.94   10.46   4.65  96.80
>   0.00     0.00  173.00    0.00   692.00     0.00     8.00     1.72    9.87    9.87    0.00   5.71  98.80
>   0.00     0.00  171.00    0.00   684.00     0.00     8.00     1.62    9.33    9.33    0.00   5.75  98.40
>   0.00     0.00  161.00    0.00   644.00     0.00     8.00     1.52    9.57    9.57    0.00   6.14  98.80
>   0.00     0.00  136.00    0.00   544.00     0.00     8.00     1.26    9.29    9.29    0.00   7.24  98.40
>   0.00     0.00  199.00    0.00   796.00     0.00     8.00     1.94    9.73    9.73    0.00   4.94  98.40
>   0.00     0.00  201.00    0.00   804.00     0.00     8.00     1.70    8.54    8.54    0.00   4.80  96.40
>   0.00     0.00  272.00   15.00  1088.00   272.00     9.48     2.35    8.21    8.46    3.73   3.39  97.20
[…]
> > I am interested in wait I/O and latencies and disk utilization.
>  
> Cool tool, I didn't know about iostat.
> My r_await numbers don't look good obviously and yet %util is pretty much
> 100% the entire time.
> 
> Does that show that it's indeed the device that is unable to deliver the requests any quicker, despite
> being an ssd, or are you reading this differently?

That, or…

> > Also I am interested in
> > merkaba:~> hdparm -I /dev/sda | grep -i queue
> >         Queue depth: 32
> >            *    Native Command Queueing (NCQ)
> > output for your SSD.
> 
> gandalfthegreat:/var/local# hdparm -I /dev/sda | grep -i queue   
> 	Queue depth: 32
> 	   *	Native Command Queueing (NCQ)
> gandalfthegreat:/var/local# 
> 
> I've the the fio tests in:
> /dev/mapper/cryptroot /var btrfs rw,noatime,compress=lzo,nossd,discard,space_cache 0 0

… you are still using dm_crypt?

Please test without dm_crypt. My figures are from within LVM, but no
dm_crypt. Its good to have a comparable base for the measurements.

> (discard is there, so fstrim shouldn't be needed)

I can´t imagine why it should matter, but maybe its worth having some
tests without „discard“. 

> > I also suggest to use fio with with the ssd-test example on the SSD. I
> > have some comparison data available for my setup. Heck it should be
> > publicly available in my ADMIN magazine article about fio. I used a bit
> > different fio jobs with block sizes of 2k to 16k, but its similar enough
> > and I might even have some 4k examples at hand or can easily create one.
> > I also raised size and duration a bit.
> > 
> > An example based on whats in my article:
> 
> Thanks, here's the output.
> I see bw jumping from bw=1700.8KB/s all the way to bw=474684KB/s depending
> on the io size.  That's "intereesting" to say the least.
> 
> So, doctor, is it bad? :)

Well I do not engineer those SSDs, but to me it looks like either dm_crypt
is wildly hogging performance or the SSD is way slower than what I would
expect. But then you did some tests without dm_crypt AFAIR, but then it
would be interesting to repeat these tests without dm_crypt as well.

> randomwrite: (g=0): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> sequentialwrite: (g=1): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> randomread: (g=2): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> sequentialread: (g=3): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> 2.0.8
> Starting 4 processes
> randomwrite: Laying out IO file(s) (1 file(s) / 2048MB)
> Jobs: 1 (f=1): [___R] [100.0% done] [558.8M/0K /s] [63.8K/0  iops] [eta 00m:00s]  
> randomwrite: (groupid=0, jobs=1): err= 0: pid=7193
>   write: io=102048KB, bw=1700.8KB/s, iops=189 , runt= 60003msec
>     slat (usec): min=21 , max=219834 , avg=5250.91, stdev=5936.55
>     clat (usec): min=25 , max=738932 , avg=329339.45, stdev=106004.63
>      lat (msec): min=4 , max=751 , avg=334.59, stdev=107.57
>     clat percentiles (msec):
>      |  1.00th=[  225],  5.00th=[  241], 10.00th=[  247], 20.00th=[  260],
>      | 30.00th=[  269], 40.00th=[  281], 50.00th=[  293], 60.00th=[  306],
>      | 70.00th=[  322], 80.00th=[  351], 90.00th=[  545], 95.00th=[  570],
>      | 99.00th=[  627], 99.50th=[  644], 99.90th=[  709], 99.95th=[  725],
>      | 99.99th=[  742]
>     bw (KB/s)  : min=   11, max= 2591, per=99.83%, avg=1697.13, stdev=491.48
>     lat (usec) : 50=0.01%
>     lat (msec) : 10=0.02%, 20=0.02%, 50=0.05%, 100=0.14%, 250=12.89%
>     lat (msec) : 500=72.44%, 750=14.43%

Gosh, look at these latencies!

72,44% of all requests above 500 (in words: five hundred) milliseconds!
And 14,43% above 750 msecs. The percentage of requests served at 100 msecs
or less is was below one percent! Hey, is this an SSD or what?

Please really test without dm_crypt so that we can see whether it
introduces any kind of latency and if so in what amount.

But then let me compare. You are using iodepth=64 which fills the
device with request at maximum speed. So this will likely increase
latencies.

martin@merkaba:~[…]> fio iops-iodepth64.job 
zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
fio 1.57
Starting 4 processes
[…]
zufälligschreiben: (groupid=2, jobs=1): err= 0: pid=13103
  write: io=2048.0MB, bw=76940KB/s, iops=14032 , runt= 27257msec
    slat (usec): min=3 , max=6426 , avg=14.35, stdev=22.33
    clat (usec): min=82 , max=414190 , avg=4540.81, stdev=4996.47
     lat (usec): min=97 , max=414205 , avg=4555.53, stdev=4996.42
    bw (KB/s) : min=12890, max=112728, per=100.39%, avg=77240.48, stdev=20310.79
  cpu          : usr=9.67%, sys=29.47%, ctx=72053, majf=0, minf=532
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued r/w/d: total=0/382480/0, short=0/0/0
     lat (usec): 100=0.01%, 250=0.14%, 500=0.66%, 750=0.82%, 1000=1.52%
     lat (msec): 2=10.86%, 4=27.88%, 10=57.18%, 20=0.79%, 50=0.03%
     lat (msec): 100=0.07%, 250=0.04%, 500=0.01%

Still even with iodepth 64 totally different picture. And look at the IOPS
and throughput.

For reference, this refers to

[global]
ioengine=libaio
direct=1
iodepth=64
# Für zufällige Daten über die komplette Länge
# der Datei vorher Job sequentiell laufen lassen
filename=testdatei
size=2G
bsrange=2k-16k

refill_buffers=1

[zufälliglesen]
rw=randread
runtime=60

[sequentielllesen]
stonewall
rw=read
runtime=60

[zufälligschreiben]
stonewall
rw=randwrite
runtime=60

[sequentiellschreiben]
stonewall
rw=write
runtime=60

but on Ext4 instead of BTRFS.

This could be another good test. Text with Ext4 on plain logical volume
without dm_crypt.

>   cpu          : usr=0.30%, sys=2.65%, ctx=9699, majf=0, minf=19
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.4%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
>      issued    : total=r=0/w=11396/d=0, short=r=0/w=0/d=0
> sequentialwrite: (groupid=1, jobs=1): err= 0: pid=7218
>   write: io=80994KB, bw=1349.7KB/s, iops=149 , runt= 60011msec
>     slat (usec): min=15 , max=296798 , avg=6658.06, stdev=7414.10
>     clat (usec): min=15 , max=962319 , avg=418984.76, stdev=128626.12
>      lat (msec): min=12 , max=967 , avg=425.64, stdev=130.36
>     clat percentiles (msec):
>      |  1.00th=[  269],  5.00th=[  306], 10.00th=[  322], 20.00th=[  338],
>      | 30.00th=[  355], 40.00th=[  371], 50.00th=[  388], 60.00th=[  404],
>      | 70.00th=[  420], 80.00th=[  445], 90.00th=[  603], 95.00th=[  766],
>      | 99.00th=[  881], 99.50th=[  906], 99.90th=[  938], 99.95th=[  947],
>      | 99.99th=[  963]
>     bw (KB/s)  : min=  418, max= 1952, per=99.31%, avg=1339.72, stdev=354.39
>     lat (usec) : 20=0.01%
>     lat (msec) : 20=0.01%, 50=0.11%, 100=0.27%, 250=0.29%, 500=86.93%
>     lat (msec) : 750=6.87%, 1000=5.51%

Sequential write latencies are abysmal as well.

sequentiellschreiben: (groupid=3, jobs=1): err= 0: pid=13105
  write: io=2048.0MB, bw=155333KB/s, iops=17290 , runt= 13501msec
    slat (usec): min=2 , max=4299 , avg=13.51, stdev=18.91
    clat (usec): min=334 , max=201706 , avg=3680.94, stdev=2088.53
     lat (usec): min=340 , max=201718 , avg=3694.83, stdev=2088.28
    bw (KB/s) : min=144952, max=162856, per=100.04%, avg=155394.92, stdev=5720.43
  cpu          : usr=16.77%, sys=31.17%, ctx=47790, majf=0, minf=535
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued r/w/d: total=0/233444/0, short=0/0/0
     lat (usec): 500=0.01%, 750=0.02%, 1000=0.03%
     lat (msec): 2=0.55%, 4=77.52%, 10=21.86%, 20=0.01%, 100=0.01%
     lat (msec): 250=0.01%

>   cpu          : usr=0.25%, sys=2.32%, ctx=13426, majf=0, minf=21
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.3%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
>      issued    : total=r=0/w=8991/d=0, short=r=0/w=0/d=0
> randomread: (groupid=2, jobs=1): err= 0: pid=7234
>   read : io=473982KB, bw=7899.5KB/s, iops=934 , runt= 60005msec
>     slat (usec): min=2 , max=31048 , avg=1065.57, stdev=3519.63
>     clat (usec): min=21 , max=246981 , avg=67367.84, stdev=33459.21
>      lat (msec): min=1 , max=263 , avg=68.43, stdev=33.81
>     clat percentiles (msec):
>      |  1.00th=[   15],  5.00th=[   25], 10.00th=[   33], 20.00th=[   39],
>      | 30.00th=[   50], 40.00th=[   57], 50.00th=[   61], 60.00th=[   70],
>      | 70.00th=[   76], 80.00th=[   89], 90.00th=[  111], 95.00th=[  139],
>      | 99.00th=[  176], 99.50th=[  190], 99.90th=[  217], 99.95th=[  229],
>      | 99.99th=[  247]
>     bw (KB/s)  : min= 2912, max=11909, per=100.00%, avg=7900.46, stdev=2255.68
>     lat (usec) : 50=0.01%
>     lat (msec) : 2=0.17%, 4=0.16%, 10=0.35%, 20=2.49%, 50=27.44%
>     lat (msec) : 100=55.05%, 250=14.34%

Even read latencies are really high.

zufälliglesen: (groupid=0, jobs=1): err= 0: pid=13101
  read : io=2048.0MB, bw=162545KB/s, iops=29719 , runt= 12902msec
    slat (usec): min=2 , max=1485 , avg=10.58, stdev=10.81
    clat (usec): min=246 , max=18706 , avg=2140.05, stdev=1073.28
     lat (usec): min=267 , max=18714 , avg=2150.97, stdev=1074.42
    bw (KB/s) : min=108000, max=205060, per=101.08%, avg=164305.12, stdev=32494.26
  cpu          : usr=12.74%, sys=49.83%, ctx=82018, majf=0, minf=276
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued r/w/d: total=383439/0/0, short=0/0/0
     lat (usec): 250=0.01%, 500=0.03%, 750=0.30%, 1000=3.21%
     lat (msec): 2=48.26%, 4=44.68%, 10=3.35%, 20=0.18%


Here for comparison that 500 GB 2.5 inch eSATA Hitachi disk (5400rpm):

merkaba:[…]> fio --readonly iops-gerät-lesend-iodepth64.job
zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
fio 1.57
Starting 2 processes
Jobs: 1 (f=1): [_R] [66.9% done] [79411K/0K /s] [8617 /0  iops] [eta 01m:00s]
zufälliglesen: (groupid=0, jobs=1): err= 0: pid=32578
  read : io=58290KB, bw=984592 B/s, iops=106 , runt= 60623msec
    slat (usec): min=3 , max=82 , avg=24.63, stdev= 6.64
    clat (msec): min=40 , max=2825 , avg=602.78, stdev=374.10
     lat (msec): min=40 , max=2825 , avg=602.80, stdev=374.10
    bw (KB/s) : min=    0, max= 1172, per=65.52%, avg=629.66, stdev=466.50
  cpu          : usr=0.24%, sys=0.64%, ctx=6443, majf=0, minf=275
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.5%, >=64=99.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued r/w/d: total=6435/0/0, short=0/0/0
     lat (msec): 50=0.20%, 100=2.61%, 250=15.37%, 500=26.64%, 750=25.07%
     lat (msec): 1000=16.13%, 2000=13.44%, >=2000=0.54%

Okay, so at least your SSD has shorter random read latencies than this
slow harddisk. ;)

Has a queue depth of 32 as well.

>   cpu          : usr=0.38%, sys=2.89%, ctx=8425, majf=0, minf=276
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
>      issued    : total=r=56073/w=0/d=0, short=r=0/w=0/d=0
> sequentialread: (groupid=3, jobs=1): err= 0: pid=7249
>   read : io=2048.0MB, bw=474684KB/s, iops=52754 , runt=  4418msec
>     slat (usec): min=1 , max=44869 , avg=15.35, stdev=320.96
>     clat (usec): min=0 , max=67787 , avg=1145.60, stdev=3487.25
>      lat (usec): min=2 , max=67801 , avg=1161.14, stdev=3508.06
>     clat percentiles (usec):
>      |  1.00th=[   10],  5.00th=[  187], 10.00th=[  217], 20.00th=[  249],
>      | 30.00th=[  278], 40.00th=[  302], 50.00th=[  330], 60.00th=[  362],
>      | 70.00th=[  398], 80.00th=[  450], 90.00th=[  716], 95.00th=[ 6624],
>      | 99.00th=[16064], 99.50th=[20096], 99.90th=[44800], 99.95th=[61696],
>      | 99.99th=[67072]
>     bw (KB/s)  : min=50063, max=635019, per=97.34%, avg=462072.50, stdev=202894.18
>     lat (usec) : 2=0.65%, 4=0.02%, 10=0.30%, 20=0.24%, 50=0.36%
>     lat (usec) : 100=0.52%, 250=18.08%, 500=65.03%, 750=4.88%, 1000=0.52%
>     lat (msec) : 2=0.81%, 4=0.95%, 10=4.72%, 20=2.43%, 50=0.45%
>     lat (msec) : 100=0.05%
>   cpu          : usr=7.52%, sys=67.56%, ctx=1494, majf=0, minf=277
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
>      issued    : total=r=233071/w=0/d=0, short=r=0/w=0/d=0

Intel SSD 320:

sequentielllesen: (groupid=1, jobs=1): err= 0: pid=13102
  read : io=2048.0MB, bw=252699KB/s, iops=28043 , runt=  8299msec
    slat (usec): min=2 , max=1416 , avg=10.75, stdev=10.95
    clat (usec): min=305 , max=201105 , avg=2268.82, stdev=2066.69
     lat (usec): min=319 , max=201114 , avg=2279.94, stdev=2066.61
    bw (KB/s) : min=249424, max=254472, per=100.03%, avg=252776.50, stdev=1416.63
  cpu          : usr=12.29%, sys=43.62%, ctx=43913, majf=0, minf=278
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued r/w/d: total=232729/0/0, short=0/0/0
     lat (usec): 500=0.01%, 750=0.06%, 1000=0.10%
     lat (msec): 2=11.00%, 4=88.32%, 10=0.50%, 20=0.01%, 50=0.01%
     lat (msec): 100=0.01%, 250=0.01%


Harddisk again:

sequentielllesen: (groupid=1, jobs=1): err= 0: pid=32580
  read : io=4510.3MB, bw=76964KB/s, iops=8551 , runt= 60008msec
    slat (usec): min=1 , max=3373 , avg=10.03, stdev= 8.96
    clat (msec): min=1 , max=49 , avg= 7.47, stdev= 1.14
     lat (msec): min=1 , max=49 , avg= 7.48, stdev= 1.14
    bw (KB/s) : min=72608, max=79588, per=100.05%, avg=77003.96, stdev=1626.66
  cpu          : usr=6.89%, sys=16.96%, ctx=229486, majf=0, minf=277
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued r/w/d: total=513184/0/0, short=0/0/0
     lat (msec): 2=0.01%, 4=0.02%, 10=99.67%, 20=0.29%, 50=0.01%

Maximum latencies on our SSD are longer. Minimum latencies shorter.

Well sequential stuff might not be located sequential on the SSD, the SSD
controller might put the stuff from a big file into totally different places.

> Run status group 0 (all jobs):
>   WRITE: io=102048KB, aggrb=1700KB/s, minb=1700KB/s, maxb=1700KB/s, mint=60003msec, maxt=60003msec
> 
> Run status group 1 (all jobs):
>   WRITE: io=80994KB, aggrb=1349KB/s, minb=1349KB/s, maxb=1349KB/s, mint=60011msec, maxt=60011msec
> 
> Run status group 2 (all jobs):
>    READ: io=473982KB, aggrb=7899KB/s, minb=7899KB/s, maxb=7899KB/s, mint=60005msec, maxt=60005msec
> 
> Run status group 3 (all jobs):
>    READ: io=2048.0MB, aggrb=474683KB/s, minb=474683KB/s, maxb=474683KB/s, mint=4418msec, maxt=4418msec

Thats pathetic except for the last job group (each group has one job here
cause of the stonewall commands) of reading sequentially. Strangely
sequential write is also abysmally slow.

This is how I expect this to look like with iodepth 64:

Run status group 0 (all jobs):
   READ: io=2048.0MB, aggrb=162544KB/s, minb=166445KB/s, maxb=166445KB/s, mint=12902msec, maxt=12902msec

Run status group 1 (all jobs):
   READ: io=2048.0MB, aggrb=252699KB/s, minb=258764KB/s, maxb=258764KB/s, mint=8299msec, maxt=8299msec

Run status group 2 (all jobs):
  WRITE: io=2048.0MB, aggrb=76939KB/s, minb=78786KB/s, maxb=78786KB/s, mint=27257msec, maxt=27257msec

Run status group 3 (all jobs):
  WRITE: io=2048.0MB, aggrb=155333KB/s, minb=159061KB/s, maxb=159061KB/s, mint=13501msec, maxt=13501msec


Can you also post the last lines:

Disk stats (read/write):
  dm-2: ios=616191/613142, merge=0/0, ticks=1300820/2565384, in_queue=3867448, util=98.81%, aggrios=504829/504643, aggrmerge=111362/111451, aggrticks=1058320/2164664, aggrin_queue=3223048, aggrutil=98.78%
    sda: ios=504829/504643, merge=111362/111451, ticks=1058320/2164664, in_queue=3223048, util=98.78%
martin@merkaba:~/Artikel/LinuxNewMedia/fio/Recherche/Messungen/merkaba>


Its gives information on who good the I/O scheduler was able to merge requests.

I didn´t see much of a difference between CFQ and noop, so it may not
matter much, but since it gives also a number on total disk utilization
its still quite nice to have.


So my recommendation of now:

Remove as much factors as possible and in order to compare results with
what I posted try with plain logical volume with Ext4.

If the values are still quite slow, I think its good to ask Linux
block layer experts – for example by posting on fio mailing list where
there are people subscribed that may be able to provide other test
results – and SSD experts. There might be a Linux block layer mailing
list or use libata or fsdevel, I don´t know.

If your Samsung SSD turns out to be this slow or almost this slow on
such a basic level, I am out of ideas.

Besides except:

Have the IOPS run on the device it self. That will remove any filesystem
layer. But only the read only tests, to make sure I suggest to use fio
with the --readonly option as safety guard. Unless you have a spare SSD
that you can afford to use for write testing which will likely destroy
every filesystem on it. Or let it run on just one logical volume.

If its then slow, then I´d tend to use a different SSD and test from
there. If its then faster, I´d tend to believe that its either a hardware
or probably more likely a compatibilty issue.

What does your SSD report discard alignment – well but even that should
not matter for reads:

merkaba:/sys/block/sda> cat discard_alignment 
0

Seems Intel tells us to not care at all. Or the value for some reason
cannot be read.

merkaba:/sys/block/sda> cd queue
merkaba:/sys/block/sda/queue> grep . *
add_random:1
discard_granularity:512
discard_max_bytes:2147450880
discard_zeroes_data:1
hw_sector_size:512

I would be interested whether these above differ.

grep: iosched: Ist ein Verzeichnis
iostats:1
logical_block_size:512
max_hw_sectors_kb:32767
max_integrity_segments:0
max_sectors_kb:512
max_segments:168
max_segment_size:65536
minimum_io_size:512
nomerges:0
nr_requests:128
optimal_io_size:0
physical_block_size:512
read_ahead_kb:128
rotational:0
rq_affinity:1
scheduler:noop deadline [cfq]

And whether there is any optimal io size.

One note regarding this: Back then I made the Ext4 I carried above tests
in the last year to be aligned to what I think could be good for the SSD.
I do not know whether this matters much.

merkaba:~> tune2fs -l /dev/merkaba/home 
RAID stripe width:        128

Which should be in blocks and alignto 512 KiB. Hmmm, I am a bit puzzled
about this value at the moment. Either 1 MiB or 128 KiB would make sense
to me. But then what I do not about the erase block size of that SSD.


Then add BTRFS and then dm_crypt and look at how the numbers change.

I think this is a plan to find out whether its either really the hardware
or some wierd happening in the low level parts of Linux, e.g. the block
layer or dm_crypt or filing system.

Reduce it to the most basic level and then work from there.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-02  5:07                           ` Marc MERLIN
  2012-08-02 11:18                             ` Martin Steigerwald
@ 2012-08-02 11:25                             ` Martin Steigerwald
  1 sibling, 0 replies; 22+ messages in thread
From: Martin Steigerwald @ 2012-08-02 11:25 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Marc MERLIN, Fajar A. Nugraha

Am Donnerstag, 2. August 2012 schrieb Marc MERLIN:
> So, doctor, is it bad? :)
> 
> randomwrite: (g=0): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> sequentialwrite: (g=1): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> randomread: (g=2): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> sequentialread: (g=3): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> 2.0.8
> Starting 4 processes
> randomwrite: Laying out IO file(s) (1 file(s) / 2048MB)
> Jobs: 1 (f=1): [___R] [100.0% done] [558.8M/0K /s] [63.8K/0  iops] [eta 00m:00s]  
> randomwrite: (groupid=0, jobs=1): err= 0: pid=7193
>   write: io=102048KB, bw=1700.8KB/s, iops=189 , runt= 60003msec
>     slat (usec): min=21 , max=219834 , avg=5250.91, stdev=5936.55
>     clat (usec): min=25 , max=738932 , avg=329339.45, stdev=106004.63
>      lat (msec): min=4 , max=751 , avg=334.59, stdev=107.57
>     clat percentiles (msec):

Heck, I didn´t look at the IOPS figure!

189 IOPS for a SATA-600 SSD. Thats pathetic.

So again, please test this without dm_crypt. I can´t believe that this
is the maximum the hardware is able to achieve.

A really fast 15000 rpm SAS harddisk might top that.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-02 11:18                             ` Martin Steigerwald
@ 2012-08-02 17:39                               ` Marc MERLIN
  2012-08-02 20:20                                 ` Martin Steigerwald
  0 siblings, 1 reply; 22+ messages in thread
From: Marc MERLIN @ 2012-08-02 17:39 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs, Fajar A. Nugraha

On Thu, Aug 02, 2012 at 01:18:07PM +0200, Martin Steigerwald wrote:
> > I've the the fio tests in:
> > /dev/mapper/cryptroot /var btrfs rw,noatime,compress=lzo,nossd,discard,space_cache 0 0
> 
> … you are still using dm_crypt?
 
That was my biggest partition and so far I've found no performance impact
on file access between unencrypted and dm_crypt.
I just took out my swap partition and made a smaller btrfs there:
/dev/sda3 /mnt/mnt3 btrfs rw,noatime,ssd,space_cache 0 0

I mounted without discard.

> >     lat (usec) : 50=0.01%
> >     lat (msec) : 10=0.02%, 20=0.02%, 50=0.05%, 100=0.14%, 250=12.89%
> >     lat (msec) : 500=72.44%, 750=14.43%
> 
> Gosh, look at these latencies!
> 
> 72,44% of all requests above 500 (in words: five hundred) milliseconds!
> And 14,43% above 750 msecs. The percentage of requests served at 100 msecs
> or less is was below one percent! Hey, is this an SSD or what?
 
Yeah, that's kind of what I've been complaining about since the beginning :)
Once I'm reading sequentially, it goes fast, but random access/latency is
indeed abysmal.

> Still even with iodepth 64 totally different picture. And look at the IOPS
> and throughput.
 
Yep. I know mine are bad :(

> For reference, this refers to
> 
> [global]
> ioengine=libaio
> direct=1
> iodepth=64

Since it's slightly different than the first job file you gave me, I re-ran
with this one this time.

gandalfthegreat:~# /sbin/mkfs.btrfs -L test /dev/sda2
gandalfthegreat:~# mount -o noatime /dev/sda2 /mnt/mnt2
gandalfthegreat:~# grep sda2 /proc/mounts
/dev/sda2 /mnt/mnt2 btrfs rw,noatime,ssd,space_cache 0 0

here's the btrfs test (ext4 is lower down):
gandalfthegreat:/mnt/mnt2# fio ~/fio.job2
zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
2.0.8
Starting 4 processes
zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB)
sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB)
zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 1 (f=1): [___W] [59.5% done] [0K/1800K /s] [0 /193  iops] [eta 02m:10s]     
zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30318
  read : io=73682KB, bw=1227.1KB/s, iops=137 , runt= 60004msec
    slat (usec): min=3 , max=37432 , avg=7252.52, stdev=5717.70
    clat (usec): min=13 , max=981927 , avg=454046.13, stdev=110527.92
     lat (msec): min=5 , max=999 , avg=461.30, stdev=112.00
    clat percentiles (msec):
     |  1.00th=[  145],  5.00th=[  269], 10.00th=[  371], 20.00th=[  408],
     | 30.00th=[  424], 40.00th=[  437], 50.00th=[  449], 60.00th=[  457],
     | 70.00th=[  474], 80.00th=[  490], 90.00th=[  570], 95.00th=[  644],
     | 99.00th=[  865], 99.50th=[  922], 99.90th=[  963], 99.95th=[  979],
     | 99.99th=[  979]
    bw (KB/s)  : min=    8, max= 2807, per=100.00%, avg=1227.75, stdev=317.57
    lat (usec) : 20=0.01%
    lat (msec) : 10=0.01%, 20=0.02%, 50=0.04%, 100=0.46%, 250=3.82%
    lat (msec) : 500=79.48%, 750=13.57%, 1000=2.58%
  cpu          : usr=0.12%, sys=1.13%, ctx=12186, majf=0, minf=276
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.2%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=8262/w=0/d=0, short=r=0/w=0/d=0
sequentielllesen: (groupid=1, jobs=1): err= 0: pid=30340
  read : io=2048.0MB, bw=211257KB/s, iops=23473 , runt=  9927msec
    slat (usec): min=1 , max=56321 , avg=20.51, stdev=424.44
    clat (usec): min=0 , max=57987 , avg=2695.98, stdev=6624.00
     lat (usec): min=1 , max=58015 , avg=2716.75, stdev=6642.09
    clat percentiles (usec):
     |  1.00th=[    1],  5.00th=[   10], 10.00th=[   30], 20.00th=[  100],
     | 30.00th=[  217], 40.00th=[  362], 50.00th=[  494], 60.00th=[  636],
     | 70.00th=[  892], 80.00th=[ 1656], 90.00th=[ 7392], 95.00th=[21632],
     | 99.00th=[29056], 99.50th=[29568], 99.90th=[43776], 99.95th=[46848],
     | 99.99th=[57600]
    bw (KB/s)  : min=166675, max=260984, per=99.83%, avg=210892.26, stdev=22433.65
    lat (usec) : 2=2.16%, 4=0.43%, 10=2.33%, 20=2.80%, 50=5.72%
    lat (usec) : 100=6.47%, 250=12.35%, 500=18.29%, 750=15.52%, 1000=5.44%
    lat (msec) : 2=13.04%, 4=3.59%, 10=3.83%, 20=1.88%, 50=6.11%
    lat (msec) : 100=0.04%
  cpu          : usr=4.51%, sys=35.70%, ctx=11480, majf=0, minf=278
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=233025/w=0/d=0, short=r=0/w=0/d=0
zufälligschreiben: (groupid=2, jobs=1): err= 0: pid=30348
  write: io=110768KB, bw=1845.1KB/s, iops=208 , runt= 60007msec
    slat (usec): min=26 , max=50160 , avg=4789.12, stdev=4968.75
    clat (usec): min=29 , max=752494 , avg=301858.86, stdev=80422.24
     lat (msec): min=12 , max=757 , avg=306.65, stdev=81.52
    clat percentiles (msec):
     |  1.00th=[  208],  5.00th=[  241], 10.00th=[  245], 20.00th=[  255],
     | 30.00th=[  262], 40.00th=[  269], 50.00th=[  277], 60.00th=[  289],
     | 70.00th=[  306], 80.00th=[  326], 90.00th=[  363], 95.00th=[  519],
     | 99.00th=[  627], 99.50th=[  652], 99.90th=[  734], 99.95th=[  742],
     | 99.99th=[  750]
    bw (KB/s)  : min=  616, max= 2688, per=99.72%, avg=1839.80, stdev=399.60
    lat (usec) : 50=0.01%
    lat (msec) : 20=0.02%, 50=0.04%, 100=0.07%, 250=14.31%, 500=79.62%
    lat (msec) : 750=5.91%, 1000=0.01%
  cpu          : usr=0.28%, sys=2.60%, ctx=10167, majf=0, minf=20
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.5%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=12495/d=0, short=r=0/w=0/d=0
sequentiellschreiben: (groupid=3, jobs=1): err= 0: pid=30364
  write: io=84902KB, bw=1414.1KB/s, iops=156 , runt= 60005msec
    slat (usec): min=18 , max=40825 , avg=6389.99, stdev=6072.56
    clat (usec): min=22 , max=887097 , avg=401897.77, stdev=108015.23
     lat (msec): min=11 , max=899 , avg=408.29, stdev=109.47
    clat percentiles (msec):
     |  1.00th=[  262],  5.00th=[  302], 10.00th=[  318], 20.00th=[  338],
     | 30.00th=[  351], 40.00th=[  363], 50.00th=[  375], 60.00th=[  388],
     | 70.00th=[  404], 80.00th=[  433], 90.00th=[  502], 95.00th=[  693],
     | 99.00th=[  783], 99.50th=[  824], 99.90th=[  873], 99.95th=[  881],
     | 99.99th=[  889]
    bw (KB/s)  : min=  346, max= 2115, per=99.44%, avg=1406.11, stdev=333.58
    lat (usec) : 50=0.01%
    lat (msec) : 20=0.02%, 50=0.03%, 100=0.07%, 250=0.59%, 500=89.12%
    lat (msec) : 750=7.91%, 1000=2.24%
  cpu          : usr=0.28%, sys=2.21%, ctx=13224, majf=0, minf=21
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.3%, >=64=99.3%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=9369/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=73682KB, aggrb=1227KB/s, minb=1227KB/s, maxb=1227KB/s, mint=60004msec, maxt=60004msec

Run status group 1 (all jobs):
   READ: io=2048.0MB, aggrb=211257KB/s, minb=211257KB/s, maxb=211257KB/s, mint=9927msec, maxt=9927msec

Run status group 2 (all jobs):
  WRITE: io=110768KB, aggrb=1845KB/s, minb=1845KB/s, maxb=1845KB/s, mint=60007msec, maxt=60007msec

Run status group 3 (all jobs):
  WRITE: io=84902KB, aggrb=1414KB/s, minb=1414KB/s, maxb=1414KB/s, mint=60005msec, maxt=60005msec
gandalfthegreat:/mnt/mnt2#
(no lines beyond this, fio 2.0.8)

> This could be another good test. Text with Ext4 on plain logical volume
> without dm_crypt.
> Can you also post the last lines:
> 
> Disk stats (read/write):
>   dm-2: ios=616191/613142, merge=0/0, ticks=1300820/2565384, in_queue=3867448, util=98.81%, aggrios=504829/504643, aggrmerge=111362/111451, aggrticks=1058320/2164664, aggrin_queue=3223048, aggrutil=98.78%
>     sda: ios=504829/504643, merge=111362/111451, ticks=1058320/2164664, in_queue=3223048, util=98.78%
> martin@merkaba:~/Artikel/LinuxNewMedia/fio/Recherche/Messungen/merkaba>
 
I didn't get these lines.
 
> Its gives information on who good the I/O scheduler was able to merge requests.
> 
> I didn´t see much of a difference between CFQ and noop, so it may not
> matter much, but since it gives also a number on total disk utilization
> its still quite nice to have.

I tried deadline and noop, and indeed I'm not see much of a difference for my basic tests.
Fro now I have deadline.
 
> So my recommendation of now:
> 
> Remove as much factors as possible and in order to compare results with
> what I posted try with plain logical volume with Ext4.

gandalfthegreat:~# mkfs.ext4 -O extent -b 4096 -E stride=128,stripe-width=128 /dev/sda2
/dev/sda2 /mnt/mnt2 ext4 rw,noatime,stripe=128,data=ordered 0 0

gandalfthegreat:/mnt/mnt2# fio ~/fio.job2
zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
2.0.8
Starting 4 processes
zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB)
sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB)
zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 1 (f=1): [___W] [63.8% done] [0K/2526K /s] [0 /280  iops] [eta 01m:21s]  
zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30077
  read : io=2048.0MB, bw=276232KB/s, iops=50472 , runt=  7592msec
    slat (usec): min=2 , max=2276 , avg= 6.87, stdev=12.01
    clat (usec): min=249 , max=52128 , avg=1258.87, stdev=1714.63
     lat (usec): min=260 , max=52134 , avg=1266.00, stdev=1715.36
    clat percentiles (usec):
     |  1.00th=[  450],  5.00th=[  548], 10.00th=[  620], 20.00th=[  724],
     | 30.00th=[  820], 40.00th=[  908], 50.00th=[ 1004], 60.00th=[ 1096],
     | 70.00th=[ 1208], 80.00th=[ 1368], 90.00th=[ 1640], 95.00th=[ 2040],
     | 99.00th=[ 8256], 99.50th=[14912], 99.90th=[21120], 99.95th=[23168],
     | 99.99th=[33024]
    bw (KB/s)  : min=76463, max=385328, per=100.00%, avg=277313.20, stdev=94661.29
    lat (usec) : 250=0.01%, 500=2.46%, 750=19.82%, 1000=27.70%
    lat (msec) : 2=44.79%, 4=3.55%, 10=0.72%, 20=0.78%, 50=0.17%
    lat (msec) : 100=0.01%
  cpu          : usr=11.91%, sys=51.64%, ctx=91337, majf=0, minf=277
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=383190/w=0/d=0, short=r=0/w=0/d=0
sequentielllesen: (groupid=1, jobs=1): err= 0: pid=30081
  read : io=2048.0MB, bw=150043KB/s, iops=16641 , runt= 13977msec
    slat (usec): min=1 , max=2134 , avg= 7.09, stdev=10.83
    clat (usec): min=298 , max=16751 , avg=3836.41, stdev=755.26
     lat (usec): min=304 , max=16771 , avg=3843.77, stdev=754.77
    clat percentiles (usec):
     |  1.00th=[ 2608],  5.00th=[ 2960], 10.00th=[ 3152], 20.00th=[ 3376],
     | 30.00th=[ 3536], 40.00th=[ 3664], 50.00th=[ 3792], 60.00th=[ 3920],
     | 70.00th=[ 4080], 80.00th=[ 4256], 90.00th=[ 4448], 95.00th=[ 4704],
     | 99.00th=[ 5216], 99.50th=[ 5984], 99.90th=[13888], 99.95th=[15296],
     | 99.99th=[16512]
    bw (KB/s)  : min=134280, max=169692, per=100.00%, avg=150111.30, stdev=7227.35
    lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.07%, 4=64.79%, 10=34.87%, 20=0.25%
  cpu          : usr=6.81%, sys=20.01%, ctx=77116, majf=0, minf=279
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=232601/w=0/d=0, short=r=0/w=0/d=0
zufälligschreiben: (groupid=2, jobs=1): err= 0: pid=30088
  write: io=111828KB, bw=1863.7KB/s, iops=209 , runt= 60006msec
    slat (usec): min=7 , max=51044 , avg=4757.60, stdev=4816.78
    clat (usec): min=958 , max=582205 , avg=299854.73, stdev=56663.66
     lat (msec): min=12 , max=588 , avg=304.61, stdev=57.22
    clat percentiles (msec):
     |  1.00th=[   73],  5.00th=[  225], 10.00th=[  251], 20.00th=[  273],
     | 30.00th=[  285], 40.00th=[  297], 50.00th=[  306], 60.00th=[  314],
     | 70.00th=[  318], 80.00th=[  330], 90.00th=[  343], 95.00th=[  359],
     | 99.00th=[  482], 99.50th=[  519], 99.90th=[  553], 99.95th=[  570],
     | 99.99th=[  570]
    bw (KB/s)  : min= 1033, max= 3430, per=99.67%, avg=1856.90, stdev=307.24
    lat (usec) : 1000=0.01%
    lat (msec) : 20=0.06%, 50=0.72%, 100=0.44%, 250=8.50%, 500=89.52%
    lat (msec) : 750=0.75%
  cpu          : usr=0.27%, sys=1.27%, ctx=9505, majf=0, minf=21
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.5%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=12580/d=0, short=r=0/w=0/d=0
sequentiellschreiben: (groupid=3, jobs=1): err= 0: pid=30102
  write: io=137316KB, bw=2288.2KB/s, iops=254 , runt= 60013msec
    slat (usec): min=5 , max=40817 , avg=3911.57, stdev=4785.23
    clat (msec): min=4 , max=682 , avg=246.61, stdev=77.35
     lat (msec): min=5 , max=688 , avg=250.53, stdev=78.39
    clat percentiles (msec):
     |  1.00th=[   73],  5.00th=[  143], 10.00th=[  186], 20.00th=[  206],
     | 30.00th=[  221], 40.00th=[  229], 50.00th=[  239], 60.00th=[  245],
     | 70.00th=[  258], 80.00th=[  273], 90.00th=[  318], 95.00th=[  416],
     | 99.00th=[  529], 99.50th=[  562], 99.90th=[  660], 99.95th=[  668],
     | 99.99th=[  676]
    bw (KB/s)  : min=  811, max= 4243, per=99.65%, avg=2279.92, stdev=610.32
    lat (msec) : 10=0.26%, 20=0.03%, 50=0.22%, 100=1.97%, 250=62.01%
    lat (msec) : 500=33.43%, 750=2.07%
  cpu          : usr=0.35%, sys=1.46%, ctx=10993, majf=0, minf=21
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=99.6%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=15293/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=2048.0MB, aggrb=276231KB/s, minb=276231KB/s, maxb=276231KB/s, mint=7592msec, maxt=7592msec

Run status group 1 (all jobs):
   READ: io=2048.0MB, aggrb=150043KB/s, minb=150043KB/s, maxb=150043KB/s, mint=13977msec, maxt=13977msec

Run status group 2 (all jobs):
  WRITE: io=111828KB, aggrb=1863KB/s, minb=1863KB/s, maxb=1863KB/s, mint=60006msec, maxt=60006msec

Run status group 3 (all jobs):
  WRITE: io=137316KB, aggrb=2288KB/s, minb=2288KB/s, maxb=2288KB/s, mint=60013msec, maxt=60013msec

Disk stats (read/write):
  sda: ios=505649/30833, merge=110818/7835, ticks=917980/187892, in_queue=1106088, util=98.50%
gandalfthegreat:/mnt/mnt2# 
 
> If the values are still quite slow, I think its good to ask Linux
> block layer experts – for example by posting on fio mailing list where
> there are people subscribed that may be able to provide other test
> results – and SSD experts. There might be a Linux block layer mailing
> list or use libata or fsdevel, I don´t know.
 
I'll try there next. I think after this mail, and your help which is very
much appreciated, it's fair to take things out of this list to stop spamming
the wrong people :)

> Have the IOPS run on the device it self. That will remove any filesystem
> layer. But only the read only tests, to make sure I suggest to use fio
> with the --readonly option as safety guard. Unless you have a spare SSD
> that you can afford to use for write testing which will likely destroy
> every filesystem on it. Or let it run on just one logical volume.
 
Can you send me a recommended job config you'd like me to run if the runs
above haven't already answered your questions?

> What does your SSD report discard alignment – well but even that should
> not matter for reads:
> 
> merkaba:/sys/block/sda> cat discard_alignment 
> 0

gandalfthegreat:/sys/block/sda# cat discard_alignment 
0
 
> Seems Intel tells us to not care at all. Or the value for some reason
> cannot be read.
> 
> merkaba:/sys/block/sda> cd queue
> merkaba:/sys/block/sda/queue> grep . *
> add_random:1
> discard_granularity:512
> discard_max_bytes:2147450880
> discard_zeroes_data:1
> hw_sector_size:512
> 
> I would be interested whether these above differ.
 
I have the exact same.

> And whether there is any optimal io size.

I also have
optimal_io_size:0

> One note regarding this: Back then I made the Ext4 I carried above tests
> in the last year to be aligned to what I think could be good for the SSD.
> I do not know whether this matters much.
> 
> merkaba:~> tune2fs -l /dev/merkaba/home 
> RAID stripe width:        128

> I think this is a plan to find out whether its either really the hardware
> or some wierd happening in the low level parts of Linux, e.g. the block
> layer or dm_crypt or filing system.
> 
> Reduce it to the most basic level and then work from there.

So, I went back to basics.

On Thu, Aug 02, 2012 at 01:25:17PM +0200, Martin Steigerwald wrote:
> Heck, I didn´t look at the IOPS figure!
> 
> 189 IOPS for a SATA-600 SSD. Thats pathetic.
 
Yeah, and the new tests above without dmcrypt are no better.

> A really fast 15000 rpm SAS harddisk might top that.

My slow 1TB 5400rpm laptop hard drive runs du -s 4x faster than 
the SSD right now :)

I read your comments about the multiple things you had in mind. Given the
new data, should I just go to an io list now, or even save myself some time
and just return the damned SSDs and buy from another vendor? :)

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-02 17:39                               ` Marc MERLIN
@ 2012-08-02 20:20                                 ` Martin Steigerwald
  2012-08-02 20:44                                   ` Marc MERLIN
  0 siblings, 1 reply; 22+ messages in thread
From: Martin Steigerwald @ 2012-08-02 20:20 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs, Fajar A. Nugraha

Am Donnerstag, 2. August 2012 schrieb Marc MERLIN:
> On Thu, Aug 02, 2012 at 01:18:07PM +0200, Martin Steigerwald wrote:
> > > I've the the fio tests in:
> > > /dev/mapper/cryptroot /var btrfs rw,noatime,compress=lzo,nossd,discard,space_cache 0 0
> > 
> > … you are still using dm_crypt?
[…]
> I just took out my swap partition and made a smaller btrfs there:
> /dev/sda3 /mnt/mnt3 btrfs rw,noatime,ssd,space_cache 0 0
> 
> I mounted without discard.
[…]
> > For reference, this refers to
> > 
> > [global]
> > ioengine=libaio
> > direct=1
> > iodepth=64
> 
> Since it's slightly different than the first job file you gave me, I re-ran
> with this one this time.
> 
> gandalfthegreat:~# /sbin/mkfs.btrfs -L test /dev/sda2
> gandalfthegreat:~# mount -o noatime /dev/sda2 /mnt/mnt2
> gandalfthegreat:~# grep sda2 /proc/mounts
> /dev/sda2 /mnt/mnt2 btrfs rw,noatime,ssd,space_cache 0 0
> 
> here's the btrfs test (ext4 is lower down):
> gandalfthegreat:/mnt/mnt2# fio ~/fio.job2
> zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> 2.0.8

Still abysmal except for sequential reads.

> Starting 4 processes
> zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB)
> sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB)
> zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
> sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
> Jobs: 1 (f=1): [___W] [59.5% done] [0K/1800K /s] [0 /193  iops] [eta 02m:10s]     
> zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30318
>   read : io=73682KB, bw=1227.1KB/s, iops=137 , runt= 60004msec
[…]
>     lat (usec) : 20=0.01%
>     lat (msec) : 10=0.01%, 20=0.02%, 50=0.04%, 100=0.46%, 250=3.82%
>     lat (msec) : 500=79.48%, 750=13.57%, 1000=2.58%

1 second latency? Oh well…


> Run status group 3 (all jobs):
>   WRITE: io=84902KB, aggrb=1414KB/s, minb=1414KB/s, maxb=1414KB/s, mint=60005msec, maxt=60005msec
> gandalfthegreat:/mnt/mnt2#
> (no lines beyond this, fio 2.0.8)
[…]
> > Can you also post the last lines:
> > 
> > Disk stats (read/write):
> >   dm-2: ios=616191/613142, merge=0/0, ticks=1300820/2565384, in_queue=3867448, util=98.81%, aggrios=504829/504643, aggrmerge=111362/111451, aggrticks=1058320/2164664, aggrin_queue=3223048, aggrutil=98.78%
> >     sda: ios=504829/504643, merge=111362/111451, ticks=1058320/2164664, in_queue=3223048, util=98.78%
> > martin@merkaba:~/Artikel/LinuxNewMedia/fio/Recherche/Messungen/merkaba>
>  
> I didn't get these lines.

Hmmm, back then I guess this was fio 1.5.9 or so.

> I tried deadline and noop, and indeed I'm not see much of a difference for my basic tests.
> Fro now I have deadline.
>  
> > So my recommendation of now:
> > 
> > Remove as much factors as possible and in order to compare results with
> > what I posted try with plain logical volume with Ext4.
> 
> gandalfthegreat:~# mkfs.ext4 -O extent -b 4096 -E stride=128,stripe-width=128 /dev/sda2
> /dev/sda2 /mnt/mnt2 ext4 rw,noatime,stripe=128,data=ordered 0 0
> 
> gandalfthegreat:/mnt/mnt2# fio ~/fio.job2
> zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> 2.0.8
> Starting 4 processes
> zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB)
> sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB)
> zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
> sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
> Jobs: 1 (f=1): [___W] [63.8% done] [0K/2526K /s] [0 /280  iops] [eta 01m:21s]  
> zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30077
>   read : io=2048.0MB, bw=276232KB/s, iops=50472 , runt=  7592msec
>     slat (usec): min=2 , max=2276 , avg= 6.87, stdev=12.01
>     clat (usec): min=249 , max=52128 , avg=1258.87, stdev=1714.63
>      lat (usec): min=260 , max=52134 , avg=1266.00, stdev=1715.36
>     clat percentiles (usec):
>      |  1.00th=[  450],  5.00th=[  548], 10.00th=[  620], 20.00th=[  724],
>      | 30.00th=[  820], 40.00th=[  908], 50.00th=[ 1004], 60.00th=[ 1096],
>      | 70.00th=[ 1208], 80.00th=[ 1368], 90.00th=[ 1640], 95.00th=[ 2040],
>      | 99.00th=[ 8256], 99.50th=[14912], 99.90th=[21120], 99.95th=[23168],
>      | 99.99th=[33024]
>     bw (KB/s)  : min=76463, max=385328, per=100.00%, avg=277313.20, stdev=94661.29
>     lat (usec) : 250=0.01%, 500=2.46%, 750=19.82%, 1000=27.70%
>     lat (msec) : 2=44.79%, 4=3.55%, 10=0.72%, 20=0.78%, 50=0.17%
>     lat (msec) : 100=0.01%
>   cpu          : usr=11.91%, sys=51.64%, ctx=91337, majf=0, minf=277
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
>      issued    : total=r=383190/w=0/d=0, short=r=0/w=0/d=0

Hey, whats this? With Ext4 you have really good random read performance
now! Way better than the Intel SSD 320 and…

> sequentielllesen: (groupid=1, jobs=1): err= 0: pid=30081
>   read : io=2048.0MB, bw=150043KB/s, iops=16641 , runt= 13977msec
>     slat (usec): min=1 , max=2134 , avg= 7.09, stdev=10.83
>     clat (usec): min=298 , max=16751 , avg=3836.41, stdev=755.26
>      lat (usec): min=304 , max=16771 , avg=3843.77, stdev=754.77
>     clat percentiles (usec):
>      |  1.00th=[ 2608],  5.00th=[ 2960], 10.00th=[ 3152], 20.00th=[ 3376],
>      | 30.00th=[ 3536], 40.00th=[ 3664], 50.00th=[ 3792], 60.00th=[ 3920],
>      | 70.00th=[ 4080], 80.00th=[ 4256], 90.00th=[ 4448], 95.00th=[ 4704],
>      | 99.00th=[ 5216], 99.50th=[ 5984], 99.90th=[13888], 99.95th=[15296],
>      | 99.99th=[16512]
>     bw (KB/s)  : min=134280, max=169692, per=100.00%, avg=150111.30, stdev=7227.35
>     lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
>     lat (msec) : 2=0.07%, 4=64.79%, 10=34.87%, 20=0.25%
>   cpu          : usr=6.81%, sys=20.01%, ctx=77116, majf=0, minf=279
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
>      issued    : total=r=232601/w=0/d=0, short=r=0/w=0/d=0

… even faster than sequential reads. That just doesn´t make any sense
to me.

The performance of your Samsung SSD seems to be quite erratic. It seems
that the device is capable of being fast, but only sometimes shows this
capability.

> zufälligschreiben: (groupid=2, jobs=1): err= 0: pid=30088
>   write: io=111828KB, bw=1863.7KB/s, iops=209 , runt= 60006msec
[…]
> sequentiellschreiben: (groupid=3, jobs=1): err= 0: pid=30102
>   write: io=137316KB, bw=2288.2KB/s, iops=254 , runt= 60013msec
[…]
> Run status group 0 (all jobs):
>    READ: io=2048.0MB, aggrb=276231KB/s, minb=276231KB/s, maxb=276231KB/s, mint=7592msec, maxt=7592msec
> 
> Run status group 1 (all jobs):
>    READ: io=2048.0MB, aggrb=150043KB/s, minb=150043KB/s, maxb=150043KB/s, mint=13977msec, maxt=13977msec
> 
> Run status group 2 (all jobs):
>   WRITE: io=111828KB, aggrb=1863KB/s, minb=1863KB/s, maxb=1863KB/s, mint=60006msec, maxt=60006msec
> 
> Run status group 3 (all jobs):
>   WRITE: io=137316KB, aggrb=2288KB/s, minb=2288KB/s, maxb=2288KB/s, mint=60013msec, maxt=60013msec

Writes are slow again.

> Disk stats (read/write):
>   sda: ios=505649/30833, merge=110818/7835, ticks=917980/187892, in_queue=1106088, util=98.50%

Well there we have that line. You don´t seem to have the other line
possibly due to not using LVM.

> gandalfthegreat:/mnt/mnt2# 
>  
> > If the values are still quite slow, I think its good to ask Linux
> > block layer experts – for example by posting on fio mailing list where
> > there are people subscribed that may be able to provide other test
> > results – and SSD experts. There might be a Linux block layer mailing
> > list or use libata or fsdevel, I don´t know.
>  
> I'll try there next. I think after this mail, and your help which is very
> much appreciated, it's fair to take things out of this list to stop spamming
> the wrong people :)

Well the only thing interesting regarding BTRFS is the whopping
difference in random read performance between BTRFS and Ext4 on this
SSD.

> > Have the IOPS run on the device it self. That will remove any filesystem
> > layer. But only the read only tests, to make sure I suggest to use fio
> > with the --readonly option as safety guard. Unless you have a spare SSD
> > that you can afford to use for write testing which will likely destroy
> > every filesystem on it. Or let it run on just one logical volume.
>  
> Can you send me a recommended job config you'd like me to run if the runs
> above haven't already answered your questions?

[global]
ioengine=libaio
direct=1
filename=/dev/sdb
size=476940m
bsrange=2k-16k

[zufälliglesen]
rw=randread
runtime=60

[sequentielllesen]
stonewall
rw=read
runtime=60

I won´t expect much of a difference, but then the random read performance
is quite different between Ext4 and BTRFS on this disk. That would make
it interesting to test without any filesystem in between and over the
whole device.

It might make sense to test around with different alignment, I think the
option is called bsalign or so…

> On Thu, Aug 02, 2012 at 01:25:17PM +0200, Martin Steigerwald wrote:
> > Heck, I didn´t look at the IOPS figure!
> > 
> > 189 IOPS for a SATA-600 SSD. Thats pathetic.
>  
> Yeah, and the new tests above without dmcrypt are no better.
> 
> > A really fast 15000 rpm SAS harddisk might top that.
> 
> My slow 1TB 5400rpm laptop hard drive runs du -s 4x faster than 
> the SSD right now :)
> 
> I read your comments about the multiple things you had in mind. Given the
> new data, should I just go to an io list now, or even save myself some time
> and just return the damned SSDs and buy from another vendor? :)

… or get yourself another SSD. Its your decision.

I admire your endurance. ;)

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-02 20:20                                 ` Martin Steigerwald
@ 2012-08-02 20:44                                   ` Marc MERLIN
  2012-08-02 21:21                                     ` Martin Steigerwald
  0 siblings, 1 reply; 22+ messages in thread
From: Marc MERLIN @ 2012-08-02 20:44 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs, Fajar A. Nugraha

On Thu, Aug 02, 2012 at 10:20:07PM +0200, Martin Steigerwald wrote:
> Hey, whats this? With Ext4 you have really good random read performance
> now! Way better than the Intel SSD 320 and…

Yep, my du -sh tests do show that ext4 is 2x faster than btrfs.
Obviously it's sending IO in a way that either the IO subsystem, linux
driver, or drive prefer.

> The performance of your Samsung SSD seems to be quite erratic. It seems
> that the device is capable of being fast, but only sometimes shows this
> capability.
 
That's indeed exactly what I'm seeing in real life :)

> > > Have the IOPS run on the device it self. That will remove any filesystem
> > > layer. But only the read only tests, to make sure I suggest to use fio
> > > with the --readonly option as safety guard. Unless you have a spare SSD
> > > that you can afford to use for write testing which will likely destroy
> > > every filesystem on it. Or let it run on just one logical volume.
> >  
> > Can you send me a recommended job config you'd like me to run if the runs
> > above haven't already answered your questions?
 
> [global]
(...)

I used this and just changed filename to /dev/sda. Since I'm reading
from the beginning of the drive, reads have to be aligned.

> I won´t expect much of a difference, but then the random read performance
> is quite different between Ext4 and BTRFS on this disk. That would make
> it interesting to test without any filesystem in between and over the
> whole device.

Here is the output:
gandalfthegreat:~# fio --readonly ./fio.job3
zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1
sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1
2.0.8
Starting 2 processes
Jobs: 1 (f=1): [_R] [66.9% done] [966K/0K /s] [108 /0  iops] [eta 01m:00s] 
zufälliglesen: (groupid=0, jobs=1): err= 0: pid=2172
  read : io=59036KB, bw=983.93KB/s, iops=108 , runt= 60002msec
    slat (usec): min=5 , max=158 , avg=27.62, stdev=10.64
    clat (usec): min=45 , max=27348 , avg=9150.78, stdev=4452.66
     lat (usec): min=53 , max=27370 , avg=9179.05, stdev=4454.88
    clat percentiles (usec):
     |  1.00th=[  126],  5.00th=[  235], 10.00th=[ 5216], 20.00th=[ 5920],
     | 30.00th=[ 5920], 40.00th=[ 5984], 50.00th=[ 7712], 60.00th=[12480],
     | 70.00th=[12608], 80.00th=[12736], 90.00th=[12864], 95.00th=[16768],
     | 99.00th=[18560], 99.50th=[18816], 99.90th=[20352], 99.95th=[22656],
     | 99.99th=[27264]
    bw (KB/s)  : min=  423, max= 5776, per=100.00%, avg=986.48, stdev=480.68
    lat (usec) : 50=0.11%, 100=0.64%, 250=4.47%, 500=1.65%, 750=0.02%
    lat (usec) : 1000=0.02%
    lat (msec) : 2=0.06%, 4=0.03%, 10=43.31%, 20=49.51%, 50=0.18%
  cpu          : usr=0.17%, sys=0.45%, ctx=6534, majf=0, minf=26
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=6532/w=0/d=0, short=r=0/w=0/d=0
sequentielllesen: (groupid=1, jobs=1): err= 0: pid=2199
  read : io=54658KB, bw=932798 B/s, iops=101 , runt= 60002msec
    slat (usec): min=5 , max=140 , avg=28.63, stdev= 9.91
    clat (usec): min=39 , max=34210 , avg=9799.18, stdev=4471.32
     lat (usec): min=45 , max=34228 , avg=9828.50, stdev=4472.06
    clat percentiles (usec):
     |  1.00th=[   61],  5.00th=[ 5088], 10.00th=[ 5856], 20.00th=[ 5920],
     | 30.00th=[ 5984], 40.00th=[ 6048], 50.00th=[11840], 60.00th=[12608],
     | 70.00th=[12608], 80.00th=[12736], 90.00th=[16512], 95.00th=[17536],
     | 99.00th=[18816], 99.50th=[19584], 99.90th=[24960], 99.95th=[29568],
     | 99.99th=[34048]
    bw (KB/s)  : min=  405, max= 2680, per=100.00%, avg=912.92, stdev=261.62
    lat (usec) : 50=0.41%, 100=1.77%, 250=1.20%, 500=0.23%, 750=0.02%
    lat (usec) : 1000=0.03%
    lat (msec) : 2=0.02%, 10=43.06%, 20=52.91%, 50=0.36%
  cpu          : usr=0.15%, sys=0.45%, ctx=6103, majf=0, minf=28
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=6101/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=59036KB, aggrb=983KB/s, minb=983KB/s, maxb=983KB/s, mint=60002msec, maxt=60002msec

Run status group 1 (all jobs):
   READ: io=54658KB, aggrb=910KB/s, minb=910KB/s, maxb=910KB/s, mint=60002msec, maxt=60002msec

Disk stats (read/write):
  sda: ios=12660/2072, merge=5/34, ticks=119452/22496, in_queue=141936, util=99.30%

> … or get yourself another SSD. Its your decision.
> 
> I admire your endurance. ;)

Since I've gotten 2 SSDs to make sure I didn't get one bad one, and that the
company is greating great reviews for them, I'm now pretty sure that
it's either a problem with a linux driver, which is interesting for us
all to debug :)
If I go buy another brand, the next guy will have the same problems than
me.

But yes, if we figure this out, Samsung owes you and me some money :)

I'll try plugging this SSD in a totally different PC and see what happens.
This may say if it's an AHCI/intel sata driver problem.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-02 20:44                                   ` Marc MERLIN
@ 2012-08-02 21:21                                     ` Martin Steigerwald
  2012-08-02 21:49                                       ` Marc MERLIN
  0 siblings, 1 reply; 22+ messages in thread
From: Martin Steigerwald @ 2012-08-02 21:21 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs, Fajar A. Nugraha

Am Donnerstag, 2. August 2012 schrieb Marc MERLIN:
> On Thu, Aug 02, 2012 at 10:20:07PM +0200, Martin Steigerwald wrote:
> > Hey, whats this? With Ext4 you have really good random read performance
> > now! Way better than the Intel SSD 320 and…
> 
> Yep, my du -sh tests do show that ext4 is 2x faster than btrfs.
> Obviously it's sending IO in a way that either the IO subsystem, linux
> driver, or drive prefer.

But only on reads.

> > > > Have the IOPS run on the device it self. That will remove any filesystem
> > > > layer. But only the read only tests, to make sure I suggest to use fio
> > > > with the --readonly option as safety guard. Unless you have a spare SSD
> > > > that you can afford to use for write testing which will likely destroy
> > > > every filesystem on it. Or let it run on just one logical volume.
> > >  
> > > Can you send me a recommended job config you'd like me to run if the runs
> > > above haven't already answered your questions?
>  
> > [global]
> (...)
> 
> I used this and just changed filename to /dev/sda. Since I'm reading
> from the beginning of the drive, reads have to be aligned.
> 
> > I won´t expect much of a difference, but then the random read performance
> > is quite different between Ext4 and BTRFS on this disk. That would make
> > it interesting to test without any filesystem in between and over the
> > whole device.
> 
> Here is the output:
> gandalfthegreat:~# fio --readonly ./fio.job3
> zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1
> sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1
> 2.0.8
> Starting 2 processes
> Jobs: 1 (f=1): [_R] [66.9% done] [966K/0K /s] [108 /0  iops] [eta 01m:00s] 
> zufälliglesen: (groupid=0, jobs=1): err= 0: pid=2172
>   read : io=59036KB, bw=983.93KB/s, iops=108 , runt= 60002msec

WTF?

Hey, did you adapt the size= keyword? It seems fio 2.0.8 can do completely
without it.

Also I noticed that I had iodepth 1 in there to circumvent any in drive
cache / optimization.

>     slat (usec): min=5 , max=158 , avg=27.62, stdev=10.64
>     clat (usec): min=45 , max=27348 , avg=9150.78, stdev=4452.66
>      lat (usec): min=53 , max=27370 , avg=9179.05, stdev=4454.88
>     clat percentiles (usec):
>      |  1.00th=[  126],  5.00th=[  235], 10.00th=[ 5216], 20.00th=[ 5920],
>      | 30.00th=[ 5920], 40.00th=[ 5984], 50.00th=[ 7712], 60.00th=[12480],
>      | 70.00th=[12608], 80.00th=[12736], 90.00th=[12864], 95.00th=[16768],
>      | 99.00th=[18560], 99.50th=[18816], 99.90th=[20352], 99.95th=[22656],
>      | 99.99th=[27264]
>     bw (KB/s)  : min=  423, max= 5776, per=100.00%, avg=986.48, stdev=480.68
>     lat (usec) : 50=0.11%, 100=0.64%, 250=4.47%, 500=1.65%, 750=0.02%
>     lat (usec) : 1000=0.02%
>     lat (msec) : 2=0.06%, 4=0.03%, 10=43.31%, 20=49.51%, 50=0.18%
>   cpu          : usr=0.17%, sys=0.45%, ctx=6534, majf=0, minf=26
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued    : total=r=6532/w=0/d=0, short=r=0/w=0/d=0

Latency is still way to high even with iodepth 1, 10 milliseconds for
43% of requests. And the throughput and IOPS is still abysmal even
for iodepth 1 (see below for Intel SSD 320 values).

Okay, one further idea: Remove the bsrange to just test with 4k blocks.

Additionally test these are aligned with

       blockalign=int[,int], ba=int[,int]
              At what boundary to align random IO offsets. Defaults to
              the  same  as  'blocksize'  the minimum blocksize given.
              Minimum alignment is typically 512b for using direct IO,
              though  it  usually  depends on the hardware block size.
              This option is mutually exclusive with  using  a  random
              map for files, so it will turn off that option.

I would first test with 4k blocks as is. And then do something like:

blocksize=4k
blockalign=4k

And then raise

blockalign to some values that may matter like 8k, 128k, 512k, 1m or so.

But thats just guess work. I do not even exactly now if it works this
way in fio.

There is something pretty wierd going on. But I am not sure what it is.
Maybe an alignment issue, since Ext4 with stripe alignment was able to
do so much faster on reads.

> sequentielllesen: (groupid=1, jobs=1): err= 0: pid=2199
>   read : io=54658KB, bw=932798 B/s, iops=101 , runt= 60002msec

Ey, whats this?

>     slat (usec): min=5 , max=140 , avg=28.63, stdev= 9.91
>     clat (usec): min=39 , max=34210 , avg=9799.18, stdev=4471.32
>      lat (usec): min=45 , max=34228 , avg=9828.50, stdev=4472.06
>     clat percentiles (usec):
>      |  1.00th=[   61],  5.00th=[ 5088], 10.00th=[ 5856], 20.00th=[ 5920],
>      | 30.00th=[ 5984], 40.00th=[ 6048], 50.00th=[11840], 60.00th=[12608],
>      | 70.00th=[12608], 80.00th=[12736], 90.00th=[16512], 95.00th=[17536],
>      | 99.00th=[18816], 99.50th=[19584], 99.90th=[24960], 99.95th=[29568],
>      | 99.99th=[34048]
>     bw (KB/s)  : min=  405, max= 2680, per=100.00%, avg=912.92, stdev=261.62
>     lat (usec) : 50=0.41%, 100=1.77%, 250=1.20%, 500=0.23%, 750=0.02%
>     lat (usec) : 1000=0.03%
>     lat (msec) : 2=0.02%, 10=43.06%, 20=52.91%, 50=0.36%
>   cpu          : usr=0.15%, sys=0.45%, ctx=6103, majf=0, minf=28
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued    : total=r=6101/w=0/d=0, short=r=0/w=0/d=0
> 
> Run status group 0 (all jobs):
>    READ: io=59036KB, aggrb=983KB/s, minb=983KB/s, maxb=983KB/s, mint=60002msec, maxt=60002msec
> 
> Run status group 1 (all jobs):
>    READ: io=54658KB, aggrb=910KB/s, minb=910KB/s, maxb=910KB/s, mint=60002msec, maxt=60002msec
> 
> Disk stats (read/write):
>   sda: ios=12660/2072, merge=5/34, ticks=119452/22496, in_queue=141936, util=99.30%

What on earth is this?

You are testing the raw device and are getting these fun values out of it?
Let me repeat that here:

merkaba:/tmp> cat iops-read.job 
[global]
ioengine=libaio
direct=1
filename=/dev/sda
bsrange=2k-16k

[zufälliglesen]
rw=randread
runtime=60

[sequentielllesen]
stonewall
rw=read
runtime=60

merkaba:/tmp> fio iops-read.job
zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1
sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1
2.0.8
Starting 2 processes
Jobs: 1 (f=1): [_R] [66.9% done] [57784K/0K /s] [6471 /0  iops] [eta 01m:00s]
zufälliglesen: (groupid=0, jobs=1): err= 0: pid=31915
  read : io=1681.2MB, bw=28692KB/s, iops=3193 , runt= 60001msec
    slat (usec): min=5 , max=991 , avg=31.18, stdev=10.07
    clat (usec): min=2 , max=3312 , avg=276.67, stdev=124.41
     lat (usec): min=48 , max=3337 , avg=308.54, stdev=125.51
    clat percentiles (usec):
     |  1.00th=[   61],  5.00th=[   82], 10.00th=[  103], 20.00th=[  159],
     | 30.00th=[  201], 40.00th=[  237], 50.00th=[  270], 60.00th=[  306],
     | 70.00th=[  354], 80.00th=[  402], 90.00th=[  446], 95.00th=[  478],
     | 99.00th=[  524], 99.50th=[  548], 99.90th=[  644], 99.95th=[  716],
     | 99.99th=[ 1020]
    bw (KB/s)  : min=27760, max=31784, per=100.00%, avg=28694.49, stdev=625.77
    lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.15%, 100=8.76%
    lat (usec) : 250=34.69%, 500=54.13%, 750=2.23%, 1000=0.03%
    lat (msec) : 2=0.01%, 4=0.01%
  cpu          : usr=2.91%, sys=12.49%, ctx=199678, majf=0, minf=26
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=191587/w=0/d=0, short=r=0/w=0/d=0
sequentielllesen: (groupid=1, jobs=1): err= 0: pid=31921
  read : io=3895.9MB, bw=66487KB/s, iops=7384 , runt= 60001msec
    slat (usec): min=5 , max=518 , avg=30.18, stdev= 8.16
    clat (usec): min=1 , max=3394 , avg=100.21, stdev=83.51
     lat (usec): min=44 , max=3429 , avg=131.69, stdev=84.44
    clat percentiles (usec):
     |  1.00th=[   54],  5.00th=[   60], 10.00th=[   61], 20.00th=[   68],
     | 30.00th=[   72], 40.00th=[   79], 50.00th=[   86], 60.00th=[   92],
     | 70.00th=[   99], 80.00th=[  105], 90.00th=[  112], 95.00th=[  163],
     | 99.00th=[  556], 99.50th=[  716], 99.90th=[  900], 99.95th=[  940],
     | 99.99th=[ 1080]
    bw (KB/s)  : min=30276, max=81176, per=99.95%, avg=66453.19, stdev=10893.90
    lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.57%
    lat (usec) : 100=69.52%, 250=26.59%, 500=2.01%, 750=0.89%, 1000=0.39%
    lat (msec) : 2=0.02%, 4=0.01%
  cpu          : usr=7.12%, sys=27.34%, ctx=461537, majf=0, minf=27
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=443106/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=1681.2MB, aggrb=28691KB/s, minb=28691KB/s, maxb=28691KB/s, mint=60001msec, maxt=60001msec

Run status group 1 (all jobs):
   READ: io=3895.9MB, aggrb=66487KB/s, minb=66487KB/s, maxb=66487KB/s, mint=60001msec, maxt=60001msec

Disk stats (read/write):
  sda: ios=632851/228, merge=0/140, ticks=106392/351, in_queue=105682, util=87.78%


I have seen about 4000 IOPS with pure 4k blocks from that drive with 4k
blocks, so that seems to be fine. And look at latencies, barely some
requests in low millisecond range.

Now just for the sake of it with iodepth 64:

merkaba:/tmp> fio iops-read.job
zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
2.0.8
Starting 2 processes
Jobs: 1 (f=1): [_R] [66.9% done] [217.8M/0K /s] [24.1K/0  iops] [eta 01m:00s]
zufälliglesen: (groupid=0, jobs=1): err= 0: pid=31945
  read : io=12412MB, bw=211819KB/s, iops=23795 , runt= 60003msec
    slat (usec): min=2 , max=2158 , avg=15.69, stdev= 8.96
    clat (usec): min=138 , max=19349 , avg=2670.01, stdev=861.85
     lat (usec): min=189 , max=19366 , avg=2686.29, stdev=861.67
    clat percentiles (usec):
     |  1.00th=[ 1256],  5.00th=[ 1448], 10.00th=[ 1592], 20.00th=[ 1864],
     | 30.00th=[ 2128], 40.00th=[ 2384], 50.00th=[ 2640], 60.00th=[ 2896],
     | 70.00th=[ 3152], 80.00th=[ 3408], 90.00th=[ 3728], 95.00th=[ 3952],
     | 99.00th=[ 4448], 99.50th=[ 4896], 99.90th=[ 8768], 99.95th=[10048],
     | 99.99th=[12096]
    bw (KB/s)  : min=116464, max=216400, per=100.00%, avg=211823.16, stdev=9787.89
    lat (usec) : 250=0.01%, 500=0.01%, 750=0.02%, 1000=0.05%
    lat (msec) : 2=25.18%, 4=70.40%, 10=4.28%, 20=0.05%
  cpu          : usr=14.39%, sys=44.38%, ctx=624900, majf=0, minf=278
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=1427811/w=0/d=0, short=r=0/w=0/d=0
sequentielllesen: (groupid=1, jobs=1): err= 0: pid=32301
  read : io=12499MB, bw=213307KB/s, iops=23691 , runt= 60002msec
    slat (usec): min=1 , max=1509 , avg=12.42, stdev= 8.12
    clat (usec): min=306 , max=201523 , avg=2685.75, stdev=1220.14
     lat (usec): min=316 , max=201536 , avg=2698.70, stdev=1220.06
    clat percentiles (usec):
     |  1.00th=[ 1816],  5.00th=[ 2024], 10.00th=[ 2096], 20.00th=[ 2192],
     | 30.00th=[ 2256], 40.00th=[ 2352], 50.00th=[ 2416], 60.00th=[ 2544],
     | 70.00th=[ 2736], 80.00th=[ 2992], 90.00th=[ 3632], 95.00th=[ 4320],
     | 99.00th=[ 5664], 99.50th=[ 6112], 99.90th=[ 7136], 99.95th=[ 7712],
     | 99.99th=[26240]
    bw (KB/s)  : min=144720, max=256568, per=99.96%, avg=213210.08, stdev=23175.58
    lat (usec) : 500=0.01%, 750=0.02%, 1000=0.06%
    lat (msec) : 2=4.10%, 4=88.72%, 10=7.09%, 20=0.01%, 50=0.01%
    lat (msec) : 100=0.01%, 250=0.01%
  cpu          : usr=11.77%, sys=34.79%, ctx=440558, majf=0, minf=280
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=1421534/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=12412MB, aggrb=211818KB/s, minb=211818KB/s, maxb=211818KB/s, mint=60003msec, maxt=60003msec

Run status group 1 (all jobs):
   READ: io=12499MB, aggrb=213306KB/s, minb=213306KB/s, maxb=213306KB/s, mint=60002msec, maxt=60002msec

Disk stats (read/write):
  sda: ios=2151624/243, merge=692022/115, ticks=5719203/651132, in_queue=6372962, util=99.87%

(Seems that extra line is gone from fio 2, I darkly remember that Jens
Axboe removed some output he found to be superfluuos, but the disk stats
are still there.)

In that case random versus sequential do not even seem to make much of
a difference.

> > … or get yourself another SSD. Its your decision.
> > 
> > I admire your endurance. ;)
> 
> Since I've gotten 2 SSDs to make sure I didn't get one bad one, and that the
> company is greating great reviews for them, I'm now pretty sure that
> it's either a problem with a linux driver, which is interesting for us
> all to debug :)

Either that or there possible is some firmware update? But then those
reviews would mention it I bet.

I´d still look whether there is some firmware update available labeled:

"This will raise performance on lots of Linux based workloads by factors
of 10x to 250x" ;)

Anyway, I think now we are pretty near to the hardware.

So issue has to be somewhere in the block layer, the controller,
the SSD firmware or the SSD itself IMHO.

> If I go buy another brand, the next guy will have the same problems than
> me.
> 
> But yes, if we figure this out, Samsung owes you and me some money :)

;)
 
> I'll try plugging this SSD in a totally different PC and see what happens.
> This may say if it's an AHCI/intel sata driver problem.

Seems we will continue until someone starts to complain here. Maybe
another list will be more approbiate? But then this thread has it all
in one ;). Adding a CC with some introductory note might be approbiate.
Its your problem, so you decide ;). I´d suggest the fio mailing list,
there are other performance people how may want to chime in.

Another idea: Is there anything funny in SMART values (smartctl -a
and possibly even -x)? Well I gather these SSDs are new, but still.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-02 21:21                                     ` Martin Steigerwald
@ 2012-08-02 21:49                                       ` Marc MERLIN
  2012-08-03 18:45                                         ` Martin Steigerwald
  0 siblings, 1 reply; 22+ messages in thread
From: Marc MERLIN @ 2012-08-02 21:49 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs, Fajar A. Nugraha

On Thu, Aug 02, 2012 at 11:21:51PM +0200, Martin Steigerwald wrote:
> > Yep, my du -sh tests do show that ext4 is 2x faster than btrfs.
> > Obviously it's sending IO in a way that either the IO subsystem, linux
> > driver, or drive prefer.
> 
> But only on reads.

Yes. I was focussing on the simple case. SSDs can be slower than hard
drives on writes, and now you're getting into deep firmware magic for
corner cases, so I didn't want to touch that here :)
 
> WTF?
> 
> Hey, did you adapt the size= keyword? It seems fio 2.0.8 can do completely
> without it.

I left it as is:
[global]
ioengine=libaio
direct=1
filename=/dev/sda
size=476940m
bsrange=2k-16k

[zufälliglesen]
rw=randread
runtime=60

[sequentielllesen]
stonewall
rw=read
runtime=60

I just removed it now.

> Okay, one further idea: Remove the bsrange to just test with 4k blocks.

Ok, there you go:
[global]
ioengine=libaio
direct=1
filename=/dev/sda
blocksize=4k
blockalign=4k

[zufälliglesen]
rw=randread
runtime=60

[sequentielllesen]
stonewall
rw=read
runtime=60

gandalfthegreat:~# fio --readonly ./fio.job4
zufälliglesen: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
sequentielllesen: (g=1): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
2.0.8
Starting 2 processes
Jobs: 1 (f=1): [_R] [66.9% done] [1476K/0K /s] [369 /0  iops] [eta 01m:00s]
zufälliglesen: (groupid=0, jobs=1): err= 0: pid=3614
  read : io=60832KB, bw=1013.7KB/s, iops=253 , runt= 60012msec
    slat (usec): min=4 , max=892 , avg=24.36, stdev=16.05
    clat (usec): min=2 , max=27575 , avg=3915.39, stdev=4990.69
     lat (usec): min=44 , max=27591 , avg=3940.35, stdev=4991.99
    clat percentiles (usec):
     |  1.00th=[   41],  5.00th=[   46], 10.00th=[   51], 20.00th=[   66],
     | 30.00th=[  113], 40.00th=[  163], 50.00th=[  205], 60.00th=[ 4960],
     | 70.00th=[ 5856], 80.00th=[11584], 90.00th=[12480], 95.00th=[12608],
     | 99.00th=[14272], 99.50th=[17280], 99.90th=[18816], 99.95th=[21120],
     | 99.99th=[22912]
    bw (KB/s)  : min=  253, max= 2520, per=100.00%, avg=1014.86, stdev=364.74
    lat (usec) : 4=0.05%, 10=0.01%, 20=0.01%, 50=8.07%, 100=1iiiiiiii9.94%
    lat (usec) : 250=26.26%, 500=2.78%, 750=0.30%, 1000=0.09%
    lat (msec) : 2=0.12%, 4=0.07%, 10=20.75%, 20=21.51%, 50=0.05%
  cpu          : usr=0.31%, sys=0.88%, ctx=15250, majf=0, minf=23
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=15208/w=0/d=0, short=r=0/w=0/d=0
sequentielllesen: (groupid=1, jobs=1): err= 0: pid=3636
  read : io=88468KB, bw=1474.3KB/s, iops=368 , runt= 60009msec
    slat (usec): min=3 , max=340 , avg=16.46, stdev=10.23
    clat (usec): min=1 , max=27356 , avg=2692.99, stdev=4564.76
     lat (usec): min=30 , max=27370 , avg=2709.87, stdev=4566.31
    clat percentiles (usec):
     |  1.00th=[   27],  5.00th=[   29], 10.00th=[   31], 20.00th=[   35],
     | 30.00th=[   47], 40.00th=[   61], 50.00th=[   78], 60.00th=[  107],
     | 70.00th=[  270], 80.00th=[ 5856], 90.00th=[11712], 95.00th=[12608],
     | 99.00th=[16512], 99.50th=[17536], 99.90th=[18816], 99.95th=[19584],
     | 99.99th=[21120]
    bw (KB/s)  : min=  282, max= 5096, per=100.00%, avg=1474.40, stdev=899.36
    lat (usec) : 2=0.05%, 4=0.01%, 20=0.02%, 50=32.45%, 100=24.61%
    lat (usec) : 250=12.52%, 500=1.40%, 750=0.03%, 1000=0.03%
    lat (msec) : 2=0.02%, 4=0.03%, 10=13.95%, 20=14.85%, 50=0.03%
  cpu          : usr=0.33%, sys=0.90%, ctx=22176, majf=0, minf=25
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=22117/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=60832KB, aggrb=1013KB/s, minb=1013KB/s, maxb=1013KB/s, mint=60012msec, maxt=60012msec

Run status group 1 (all jobs):
   READ: io=88468KB, aggrb=1474KB/s, minb=1474KB/s, maxb=1474KB/s, mint=60009msec, maxt=60009msec

Disk stats (read/write):
  sda: ios=54013/1836, merge=291/17, ticks=251660/11052, in_queue=262668, util=99.51%


and the same test with blockalign=512k:
gandalfthegreat:~# fio --readonly ./fio.job4 # blockalign=512k
fio: Any use of blockalign= turns off randommap
zufälliglesen: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
sequentielllesen: (g=1): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
2.0.8
Starting 2 processes
Jobs: 1 (f=1): [_R] [66.9% done] [4180K/0K /s] [1045 /0  iops] [eta 01m:00s]
zufälliglesen: (groupid=0, jobs=1): err= 0: pid=3679
  read : io=56096KB, bw=957355 B/s, iops=233 , runt= 60001msec
    slat (usec): min=4 , max=344 , avg=27.73, stdev=15.51
    clat (usec): min=2 , max=38954 , avg=4244.27, stdev=5153.32
     lat (usec): min=44 , max=38980 , avg=4272.70, stdev=5154.68
    clat percentiles (usec):
     |  1.00th=[   42],  5.00th=[   46], 10.00th=[   53], 20.00th=[   69],
     | 30.00th=[  120], 40.00th=[  169], 50.00th=[  211], 60.00th=[ 5024],
     | 70.00th=[ 5920], 80.00th=[11584], 90.00th=[12480], 95.00th=[12608],
     | 99.00th=[16768], 99.50th=[17536], 99.90th=[18560], 99.95th=[20096],
     | 99.99th=[37632]
    bw (KB/s)  : min=  263, max= 3133, per=100.00%, avg=935.09, stdev=392.66
    lat (usec) : 4=0.05%, 10=0.03%, 20=0.01%, 50=7.84%, 100=19.12%
    lat (usec) : 250=25.84%, 500=1.61%, 750=0.08%, 1000=0.05%
    lat (msec) : 2=0.06%, 4=0.08%, 10=21.99%, 20=23.17%, 50=0.06%
  cpu          : usr=0.29%, sys=0.94%, ctx=14069, majf=0, minf=25
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=14024/w=0/d=0, short=r=0/w=0/d=0
sequentielllesen: (groupid=1, jobs=1): err= 0: pid=3703
  read : io=99592KB, bw=1659.7KB/s, iops=414 , runt= 60008msec
    slat (usec): min=4 , max=186 , avg=15.83, stdev= 9.95
    clat (usec): min=1 , max=25530 , avg=2390.79, stdev=4340.79
     lat (usec): min=28 , max=25539 , avg=2407.01, stdev=4342.90
    clat percentiles (usec):
     |  1.00th=[   26],  5.00th=[   28], 10.00th=[   30], 20.00th=[   34],
     | 30.00th=[   43], 40.00th=[   57], 50.00th=[   72], 60.00th=[   91],
     | 70.00th=[  155], 80.00th=[ 5152], 90.00th=[11712], 95.00th=[12480],
     | 99.00th=[12992], 99.50th=[16768], 99.90th=[18560], 99.95th=[18816],
     | 99.99th=[21632]
    bw (KB/s)  : min=  269, max= 8465, per=100.00%, avg=1663.84, stdev=1367.82
    lat (usec) : 2=0.01%, 4=0.02%, 20=0.01%, 50=35.42%, 100=26.00%
    lat (usec) : 250=11.29%, 500=1.31%, 750=0.05%, 1000=0.01%
    lat (msec) : 2=0.04%, 4=0.02%, 10=12.55%, 20=13.25%, 50=0.01%
  cpu          : usr=0.31%, sys=0.99%, ctx=24963, majf=0, minf=25
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=24898/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=56096KB, aggrb=934KB/s, minb=934KB/s, maxb=934KB/s, mint=60001msec, maxt=60001msec

Run status group 1 (all jobs):
   READ: io=99592KB, aggrb=1659KB/s, minb=1659KB/s, maxb=1659KB/s, mint=60008msec, maxt=60008msec

Disk stats (read/write):
  sda: ios=55021/1637, merge=72/39, ticks=240896/12080, in_queue=252912, util=99.59%


> > Since I've gotten 2 SSDs to make sure I didn't get one bad one, and that the
> > company is greating great reviews for them, I'm now pretty sure that
> > it's either a problem with a linux driver, which is interesting for us
> > all to debug :)
> 
> Either that or there possible is some firmware update? But then those
> reviews would mention it I bet.
 
So far, there are no firmware updates from mine, I checked right after
unboxing it :)

> > I'll try plugging this SSD in a totally different PC and see what happens.
> > This may say if it's an AHCI/intel sata driver problem.
> 
> Seems we will continue until someone starts to complain here. Maybe
> another list will be more approbiate? But then this thread has it all
> in one ;). Adding a CC with some introductory note might be approbiate.
> Its your problem, so you decide ;). I´d suggest the fio mailing list,
> there are other performance people how may want to chime in.
 
Actually you know the lists and people involved more than me. I'd be
happy if you added a Cc to another list you think is best, and we can
move there.

> Another idea: Is there anything funny in SMART values (smartctl -a
> and possibly even -x)? Well I gather these SSDs are new, but still.


smartctl -a /dev/sda
(snipped)
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       36
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       7
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       9
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   058   040   000    Old_age   Always       -       42
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   253   253   000    Old_age   Always       -       0
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       4
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       433339112


smartctl -x /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.5.0-amd64-preempt-noide-20120410] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG SSD 830 Series
Serial Number:    S0WGNEAC400484
LU WWN Device Id: 5 002538 043584d30
Firmware Version: CXM03B1Q
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Size:      512 bytes logical/physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ACS-2 revision 2
Local Time is:    Thu Aug  2 14:42:08 2012 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x83)	Offline data collection activity
					is in a Reserved state.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		( 1980) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  33) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  9 Power_On_Hours          -O--CK   099   099   000    -    36
 12 Power_Cycle_Count       -O--CK   099   099   000    -    7
177 Wear_Leveling_Count     PO--C-   099   099   000    -    9
179 Used_Rsvd_Blk_Cnt_Tot   PO--C-   100   100   010    -    0
181 Program_Fail_Cnt_Total  -O--CK   100   100   010    -    0
182 Erase_Fail_Count_Total  -O--CK   100   100   010    -    0
183 Runtime_Bad_Block       PO--C-   100   100   010    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
190 Airflow_Temperature_Cel -O--CK   058   040   000    -    42
195 Hardware_ECC_Recovered  -O-RC-   200   200   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   253   253   000    -    0
235 Unknown_Attribute       -O--C-   099   099   000    -    4
241 Total_LBAs_Written      -O--CK   099   099   000    -    433352008
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
GP/S  Log at address 0x00 has    1 sectors [Log Directory]
GP/S  Log at address 0x01 has    1 sectors [Summary SMART error log]
GP/S  Log at address 0x02 has    2 sectors [Comprehensive SMART error log]
GP/S  Log at address 0x03 has    2 sectors [Ext. Comprehensive SMART error log]
GP/S  Log at address 0x06 has    1 sectors [SMART self-test log]
GP/S  Log at address 0x07 has    2 sectors [Extended self-test log]
GP/S  Log at address 0x09 has    1 sectors [Selective self-test log]
GP/S  Log at address 0x10 has    1 sectors [NCQ Command Error]
GP/S  Log at address 0x11 has    1 sectors [SATA Phy Event Counters]
GP/S  Log at address 0x80 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x81 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x82 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x83 has   16 sectors [Host vendor specific log]
(snip)

SMART Extended Comprehensive Error Log Version: 1 (2 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (2 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%        27         -
# 2  Short offline       Completed without error       00%         8         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Warning: device does not support SCT Commands
SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2        49152  R_ERR response for data FIS
0x0003  2        53248  R_ERR response for device-to-host data FIS
0x0004  2        57344  R_ERR response for host-to-device data FIS
0x0005  2         7952  R_ERR response for non-data FIS
0x0006  2         2562  R_ERR response for device-to-host non-data FIS
0x0007  2         4096  R_ERR response for host-to-device non-data FIS
0x0008  2         7952  Device-to-host non-data FIS retries
0x0009  2         7953  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2           25  Device-to-host register FISes sent due to a COMRESET
0x000b  2         6145  CRC errors within host-to-device FIS
0x000d  2         7953  Non-CRC errors within host-to-device FIS
0x000f  2         8176  R_ERR response for host-to-device data FIS, CRC
0x0010  2           15  R_ERR response for host-to-device data FIS, non-CRC
0x0012  2        49153  R_ERR response for host-to-device non-data FIS, CRC
0x0013  2        12112  R_ERR response for host-to-device non-data FIS, non-CRC

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-02 21:49                                       ` Marc MERLIN
@ 2012-08-03 18:45                                         ` Martin Steigerwald
  2012-08-16  7:45                                           ` Marc MERLIN
  0 siblings, 1 reply; 22+ messages in thread
From: Martin Steigerwald @ 2012-08-03 18:45 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs, Fajar A. Nugraha

Am Donnerstag, 2. August 2012 schrieb Marc MERLIN:
> > > I'll try plugging this SSD in a totally different PC and see what
> > > happens. This may say if it's an AHCI/intel sata driver problem.
> >
> > 
> >
> > Seems we will continue until someone starts to complain here. Maybe
> > another list will be more approbiate? But then this thread has it all
> > in one ;). Adding a CC with some introductory note might be
> > approbiate. Its your problem, so you decide ;). I´d suggest the fio
> > mailing list, there are other performance people how may want to
> > chime in.
> 
>  
> Actually you know the lists and people involved more than me. I'd be
> happy if you added a Cc to another list you think is best, and we can
> move there.

I have no more ideas for the moment.

I suggest you to try the fio mailing list at fio lalala vger.kernel.org.

I currently have no ideas for other lists. I bet there is some block layer 
/ libata / scsi list that might match. Just browse the lists from 
kernel.org and pick the one that sounds most suitable for me¹. Probably 
fsdevel but since the thing does not seem to be filesystem related beside 
some alignment effect (stripe in Ext4). Hmmm, probably linux-scsi, but 
look there first, whether block layer and libata related things are 
discussed there.

I would start a new thread. Tell the problem, tell in summary what you 
have tried and the outcome of that was and ask for advice. Add a link to 
the original thread here. (It might be a good idea to have one more post 
here with a link to the new thread so that other curious people can follow 
there – in case there are any. Otherwise I would end it on this thread. 
The thread view already looks ridiculous in KMail here ;-)

And put me on Cc ;). I´d really like to know whats going on here. As 
written I have no more ideas right now. It seems to be getting to low 
level for me.

[1] http://vger.kernel.org/vger-lists.html

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-03 18:45                                         ` Martin Steigerwald
@ 2012-08-16  7:45                                           ` Marc MERLIN
  0 siblings, 0 replies; 22+ messages in thread
From: Marc MERLIN @ 2012-08-16  7:45 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs, Fajar A. Nugraha

To close this thread.

I gave up on the drive after it showed abysmal benchmarking results in
windows too.

I owed everyone an update, which I just finished typing:
http://marc.merlins.org/perso/linux/post_2012-08-15_The-tale-of-SSDs_-Crucial-C300-early-Death_-Samsung-830-extreme-random-IO-slowness_-and-settling-with-OCZ-Vertex-4.html

I'm not sure how I could have gotten 2 bad drives from Samsung in 2
different shipments, so I'm afraid the entire line may be bad. At least, it
was for me after extensive benchmarking, and even using their own windows
benchmarking tool.

In the end, I got a OCZ Vertex 4 and it's superfast as per the benchmarks I
posted in the link above.

What is interesting is that ext4 is somewhat faster than btrfs in the tests
I posted above, but not in ways that should be truly worrisome.
I'm still happy to have COW and snapshots, even if that costs me
performance.

Sorry for spamming the list with what apparently just ended up a crappy
drive (well, 2 since I got two just to make sure) from a vendor who
shouldn't have been selling crappy SSDs.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2012-08-16  7:45 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-07-24  6:18 How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN
  -- strict thread matches above, loose matches on Subject: below --
2012-02-02 12:42 brtfs on top of dmcrypt with SSD. No corruption iff write cache off? Chris Mason
2012-07-18 22:04 ` brtfs on top of dmcrypt with SSD -> Trim or no Trim Marc MERLIN
2012-07-22 18:58   ` brtfs on top of dmcrypt with SSD -> ssd or nossd + crypt performance? Marc MERLIN
2012-07-22 19:35     ` Martin Steigerwald
2012-07-22 20:44       ` Marc MERLIN
2012-07-22 22:41         ` brtfs on top of dmcrypt with SSD -> file access 5x slower than spinning disk Marc MERLIN
2012-07-23  6:42           ` How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN
2012-07-24  7:56             ` Martin Steigerwald
2012-07-27  4:40               ` Marc MERLIN
2012-07-27 11:08             ` Chris Mason
2012-07-27 18:42               ` Marc MERLIN
     [not found]                 ` <20120801053042.GG12695@merlins.org>
2012-08-01  6:01                   ` Marc MERLIN
2012-08-01  6:08                     ` Fajar A. Nugraha
2012-08-01  6:21                       ` Marc MERLIN
2012-08-01 21:57                         ` Martin Steigerwald
2012-08-02  5:07                           ` Marc MERLIN
2012-08-02 11:18                             ` Martin Steigerwald
2012-08-02 17:39                               ` Marc MERLIN
2012-08-02 20:20                                 ` Martin Steigerwald
2012-08-02 20:44                                   ` Marc MERLIN
2012-08-02 21:21                                     ` Martin Steigerwald
2012-08-02 21:49                                       ` Marc MERLIN
2012-08-03 18:45                                         ` Martin Steigerwald
2012-08-16  7:45                                           ` Marc MERLIN
2012-08-02 11:25                             ` Martin Steigerwald
2012-08-01  6:36                     ` Chris Samuel
2012-08-01  6:40                       ` Marc MERLIN

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).