linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Ext4 slow on links
@ 2012-06-20  0:20 Norbert Preining
  2012-06-20  2:19 ` Ted Ts'o
  2012-06-20  3:15 ` Eric Sandeen
  0 siblings, 2 replies; 16+ messages in thread
From: Norbert Preining @ 2012-06-20  0:20 UTC (permalink / raw)
  To: linux-ext4

Dear all

(please Cc)

I recently had to track down a big delay in one of my Debian packages,
and it turned out that it seems to be due to ext4 being *horribly*
slow on dealing with symlinks.

On my system, if I create a directory with 8000 symlinks (that is
a real case of a font package shipping special encoded files) and
the symlink targets are "far away" (long names), then, after 
a reboot a simply
	ls -l
in this directory took 1m20sec. While on second run it is down to 2secs
(nice caching).

I read in the ext4 design document that if the symlink target is
less then 66 (?) chars long, then it is saved right in the inode,
otherwise some other action has to be taken.

Now my questions are:
- is this to be expected and not to be avoided?
- do you have a way around it?
- do other file systems, esp ext2/ext3 behave differently in this respect?

Finally the specs: kernel 3.5.0-rc3 (but was the same with 3.4.0 and
before), mount options rw,noatime,errors=remount-ro,user_xattr

tune2fs -l output:
tune2fs 1.42.4 (12-Jun-2012)
Filesystem volume name:   <none>
Last mounted on:          /
Filesystem UUID:          961635f4-762d-4136-a3d5-35fca8e4f3d8
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent sparse_super large_file uninit_bg
Filesystem flags:         signed_directory_hash 
Default mount options:    journal_data_writeback
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              46333952
Block count:              185335808
Reserved block count:     9266789
Free blocks:              104044481
Free inodes:              41749891
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      979
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Filesystem created:       Sun Nov 15 15:09:13 2009
Last mount time:          Tue Jun 19 15:15:48 2012
Last write time:          Tue May 29 07:17:52 2012
Mount count:              34
Maximum mount count:      50
Last checked:             Tue May 29 07:17:52 2012
Check interval:           15552000 (6 months)
Next check after:         Sun Nov 25 07:17:52 2012
Lifetime writes:          2151 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
First orphan inode:       13246498
Default directory hash:   half_md4
Directory Hash Seed:      87ea85d5-2287-4211-a920-f793468c22c1
Journal backup:           inode blocks


Anything else I can provide?

Best wishes

Norbert
------------------------------------------------------------------------
Norbert Preining            preining@{jaist.ac.jp, logic.at, debian.org}
JAIST, Japan                                 TeX Live & Debian Developer
DSA: 0x09C5B094   fp: 14DF 2E6C 0307 BE6D AD76  A9C0 D2BF 4AA3 09C5 B094
------------------------------------------------------------------------
BALDOCK
The sharp prong on the top of a tree stump where the tree has snapped
off before being completely sawn through.
			--- Douglas Adams, The Meaning of Liff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4 slow on links
  2012-06-20  0:20 Ext4 slow on links Norbert Preining
@ 2012-06-20  2:19 ` Ted Ts'o
  2012-06-20  3:38   ` Norbert Preining
  2012-06-20  3:15 ` Eric Sandeen
  1 sibling, 1 reply; 16+ messages in thread
From: Ted Ts'o @ 2012-06-20  2:19 UTC (permalink / raw)
  To: Norbert Preining; +Cc: linux-ext4

On Wed, Jun 20, 2012 at 09:20:14AM +0900, Norbert Preining wrote:
> Dear all
> 
> (please Cc)
> 
> I recently had to track down a big delay in one of my Debian packages,
> and it turned out that it seems to be due to ext4 being *horribly*
> slow on dealing with symlinks.
> 
> On my system, if I create a directory with 8000 symlinks (that is
> a real case of a font package shipping special encoded files) and
> the symlink targets are "far away" (long names), then, after 
> a reboot a simply
> 	ls -l
> in this directory took 1m20sec. While on second run it is down to 2secs
> (nice caching).
> 
> I read in the ext4 design document that if the symlink target is
> less then 66 (?) chars long, then it is saved right in the inode,
> otherwise some other action has to be taken.

The inode has room for 60 characters; after that, the symlink target
gets stored in an external block.  The seek to read in the symlink
target could be one of the causes of the delay.  The other is
potentially reading in the inode which is the target of the symlink
target.  Both of these will take disk time in a cold cache situation.

> Now my questions are:
> - is this to be expected and not to be avoided?
> - do you have a way around it?
> - do other file systems, esp ext2/ext3 behave differently in this respect?

Nothing has changed here between ext2/ext3 and ext4 here, so ext2/ext3
will behave exactly the same.  There are changes in the block and
inode allocation algorithms which might make a minor difference, but
the same is potentially true of a very fragmented file system.

There is a relatively new feature, which is not yet merged into ext4
mainline, called the inline data patch set, which could potentially
allow you to store more than 60 characters in a symlink in large
inodes.  This could potentially help, but as a feature it will be a
while before it's ready (it definitely won't make the upcoming Debian
stable freeze) --- and so most of your Debian users won't be able to
take advantage of it for quite a while.

Otherwise, there's not much we can do about this, unfortunately.  The
cold cache case is always a hard one, and the simplest ways of
optimizing it would involve changing how the application is storing
its files.  In general, trying to use a file system as a poor man's
database is a bad idea, and will only end in tears, and it sounds like
this is what you're running into in terms of very long file names to
symlinks in a font directory.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4 slow on links
  2012-06-20  0:20 Ext4 slow on links Norbert Preining
  2012-06-20  2:19 ` Ted Ts'o
@ 2012-06-20  3:15 ` Eric Sandeen
  1 sibling, 0 replies; 16+ messages in thread
From: Eric Sandeen @ 2012-06-20  3:15 UTC (permalink / raw)
  To: Norbert Preining; +Cc: linux-ext4

On 6/19/12 7:20 PM, Norbert Preining wrote:
> Dear all
> 
> (please Cc)
> 
> I recently had to track down a big delay in one of my Debian packages,
> and it turned out that it seems to be due to ext4 being *horribly*
> slow on dealing with symlinks.
> 
> On my system, if I create a directory with 8000 symlinks (that is
> a real case of a font package shipping special encoded files) and
> the symlink targets are "far away" (long names), then, after 
> a reboot a simply
> 	ls -l
> in this directory took 1m20sec. While on second run it is down to 2secs
> (nice caching).

As Ted said, the targets might be far-flung.  If you do /bin/ls -l instead
of maybe an aliased ls which stats everything to make pretty colors,
is that faster?

-Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4 slow on links
  2012-06-20  2:19 ` Ted Ts'o
@ 2012-06-20  3:38   ` Norbert Preining
  2012-06-20  3:57     ` Eric Sandeen
  0 siblings, 1 reply; 16+ messages in thread
From: Norbert Preining @ 2012-06-20  3:38 UTC (permalink / raw)
  To: Ted Ts'o, Eric Sandeen; +Cc: linux-ext4

Hi Ted, hi Eric,

thanks for the answers, here some remarks.

On Di, 19 Jun 2012, Ted Ts'o wrote:
> The inode has room for 60 characters; after that, the symlink target
> gets stored in an external block.  The seek to read in the symlink
> target could be one of the causes of the delay.  The other is

Ok.

> Nothing has changed here between ext2/ext3 and ext4 here, so ext2/ext3
> will behave exactly the same.  There are changes in the block and
> inode allocation algorithms which might make a minor difference, but
> the same is potentially true of a very fragmented file system.

Ok.

Thinking about that, even if I dereference the files, I still am a 
bit surprised. For each file we have the following times:
1- read the inode and determine if it is a link
2- check if link target fits in the the 60chars
3- read the additional block for long link target
4- read the target inode

I assume that the items 1,3, and 4 are the time consuming ones and
about the same time.

Now what I don't understand, why doing a 
	time ls -l >/dev/null
on the directory with the original files takes 1.2s,
but reading the links with ls -l >/dev/null takes 1m13s, both
after reboot on cold cache.

I assume that some data is hashed in the directory inode, so doing
ls -l on the real files only reads the directory inode and not 
each file invividually, while reading all the links read all the
individual files.

Is this the explanation? If not, I cannot imagine any way that reading
a list of links and dereferencing them plus reading the ttargets 
takes 60times as long.

On Di, 19 Jun 2012, Eric Sandeen wrote:
> As Ted said, the targets might be far-flung.  If you do /bin/ls -l instead
> of maybe an aliased ls which stats everything to make pretty colors,
> is that faster?

Might be the problem, but I saw the same with a program doing
opendir readdir etc, so no allias or external program involved.

Best wishes

Norbert
------------------------------------------------------------------------
Norbert Preining            preining@{jaist.ac.jp, logic.at, debian.org}
JAIST, Japan                                 TeX Live & Debian Developer
DSA: 0x09C5B094   fp: 14DF 2E6C 0307 BE6D AD76  A9C0 D2BF 4AA3 09C5 B094
------------------------------------------------------------------------
NACTION (n.)
The 'n' with which cheap advertising copywriters replace the word
'and' (as in 'fish 'n' chips', 'mix 'n' match', 'assault 'n'
battery'), in the mistaken belief that this is in some way chummy or
endearing.
			--- Douglas Adams, The Meaning of Liff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4 slow on links
  2012-06-20  3:38   ` Norbert Preining
@ 2012-06-20  3:57     ` Eric Sandeen
  2012-06-20  4:01       ` Norbert Preining
                         ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Eric Sandeen @ 2012-06-20  3:57 UTC (permalink / raw)
  To: Norbert Preining; +Cc: Ted Ts'o, Eric Sandeen, linux-ext4@vger.kernel.org

On Jun 19, 2012, at 10:38 PM, Norbert Preining <preining@logic.at> wrote:

> Hi Ted, hi Eric,
> 
> thanks for the answers, here some remarks.
> 
...

> On Di, 19 Jun 2012, Eric Sandeen wrote:
>> As Ted said, the targets might be far-flung.  If you do /bin/ls -l instead
>> of maybe an aliased ls which stats everything to make pretty colors,
>> is that faster?
> 
> Might be the problem, but I saw the same with a program doing
> opendir readdir etc, so no allias or external program involved.
> 
Of course ls -l must stat anyway.  I shouldn't compose emails so late.  :). 

You might see if the dir itself is badly fragmented (if not filefrag, stat in debugfs would show you block mapping) and maybe a blktrace of the actions would show you something interesting as well.

Eric

> Best wishes
> 
> Norbert
> ------------------------------------------------------------------------
> Norbert Preining            preining@{jaist.ac.jp, logic.at, debian.org}
> JAIST, Japan                                 TeX Live & Debian Developer
> DSA: 0x09C5B094   fp: 14DF 2E6C 0307 BE6D AD76  A9C0 D2BF 4AA3 09C5 B094
> ------------------------------------------------------------------------
> NACTION (n.)
> The 'n' with which cheap advertising copywriters replace the word
> 'and' (as in 'fish 'n' chips', 'mix 'n' match', 'assault 'n'
> battery'), in the mistaken belief that this is in some way chummy or
> endearing.
>            --- Douglas Adams, The Meaning of Liff
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4 slow on links
  2012-06-20  3:57     ` Eric Sandeen
@ 2012-06-20  4:01       ` Norbert Preining
  2012-06-20  5:18       ` Norbert Preining
  2012-06-20 19:35       ` Eric Sandeen
  2 siblings, 0 replies; 16+ messages in thread
From: Norbert Preining @ 2012-06-20  4:01 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Ted Ts'o, Eric Sandeen, linux-ext4@vger.kernel.org

Hi Eric,

On Di, 19 Jun 2012, Eric Sandeen wrote:
> You might see if the dir itself is badly fragmented

I don't think so, since I did:
	for i in /usr/local/font-collection/......./* ; do ln -s $i . ; done

	reboot

	time ls -l >/dev/null

newly created entries shouldn't b fragmented, I guess.

> (if not filefrag, stat in debugfs would show you block mapping) and 
> maybe a blktrace of the actions would show you something interesting as well.

I will investigate, but it is the first time I look into that so
if you have a link to quick howto, great, otherwise I read through
man pages ;-)

Best wishes

Norbert
------------------------------------------------------------------------
Norbert Preining            preining@{jaist.ac.jp, logic.at, debian.org}
JAIST, Japan                                 TeX Live & Debian Developer
DSA: 0x09C5B094   fp: 14DF 2E6C 0307 BE6D AD76  A9C0 D2BF 4AA3 09C5 B094
------------------------------------------------------------------------
`You'd better be prepared for the jump into hyperspace.
It's unpleasently like being drunk.'
`What's so unpleasent about being drunk?'
`You ask a glass of water.'
                 --- Arthur getting ready for his first jump into hyperspace.
                 --- Douglas Adams, The Hitchhikers Guide to the Galaxy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4 slow on links
  2012-06-20  3:57     ` Eric Sandeen
  2012-06-20  4:01       ` Norbert Preining
@ 2012-06-20  5:18       ` Norbert Preining
  2012-06-20 14:07         ` Eric Sandeen
  2012-06-20 19:35       ` Eric Sandeen
  2 siblings, 1 reply; 16+ messages in thread
From: Norbert Preining @ 2012-06-20  5:18 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Ted Ts'o, Eric Sandeen, linux-ext4@vger.kernel.org

Hi Eric,

On Di, 19 Jun 2012, Eric Sandeen wrote:
> blktrace of the actions would show you something interesting as well.

I tried to understand the output, but didn't get any information
that tells me something.

I rebooted into single user mode, started blktrace on sda, then run
time ls -l /..../dir/with/links/ >/dev/null, stopped the blktrace.

Then I run blkparse and btt etc to generate a variety of data.
Here are some output of the btt run:
==================== All Devices ====================

            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

Q2Q               0.000002654   0.009716548   6.594440648        8953
Q2G               0.000001047   0.000001825   0.000011594        8913
G2I               0.000000908   0.000001561   0.000234317        8913
Q2M               0.000000908   0.000001046   0.000001536          41
I2D               0.000005378   0.001040528   1.572900300        8913
M2D               0.000014527   0.038626776   1.572908333          41
D2C               0.000098337   0.009050273   0.053790446        8954
Q2C               0.000108394   0.010266282   1.577191424        8954

==================== Device Overhead ====================

       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------
 (  8,  0) |   0.0177%   0.0151%   0.0000%  10.0890%  88.1553%
---------- | --------- --------- --------- --------- ---------
   Overall |   0.0177%   0.0151%   0.0000%  10.0890%  88.1553%

==================== Device Merge Information ====================

       DEV |       #Q       #D   Ratio |   BLKmin   BLKavg   BLKmax    Total
---------- | -------- -------- ------- | -------- -------- -------- --------
 (  8,  0) |     8954     8913     1.0 |        8        9     1024    86872

==================== Device Q2Q Seek Information ====================

       DEV |          NSEEKS            MEAN          MEDIAN | MODE           
---------- | --------------- --------------- --------------- | ---------------
 (  8,  0) |            8954     193362127.3               0 | 0(538)
---------- | --------------- --------------- --------------- | ---------------
   Overall |          NSEEKS            MEAN          MEDIAN | MODE           
   Average |            8954     193362127.3               0 | 0(538)

==================== Device D2D Seek Information ====================

       DEV |          NSEEKS            MEAN          MEDIAN | MODE           
---------- | --------------- --------------- --------------- | ---------------
 (  8,  0) |            8913     194044831.4               0 | 0(497)
---------- | --------------- --------------- --------------- | ---------------
   Overall |          NSEEKS            MEAN          MEDIAN | MODE           
   Average |            8913     194044831.4               0 | 0(497)

==================== Plug Information ====================

       DEV |    # Plugs # Timer Us  | % Time Q Plugged
---------- | ---------- ----------  | ----------------
 (  8,  0) |         81(         0) |   0.002663579%

       DEV |    IOs/Unp   IOs/Unp(to)
---------- | ----------   ----------
 (  8,  0) |        1.2          0.0
---------- | ----------   ----------
   Overall |    IOs/Unp   IOs/Unp(to)
   Average |        1.2          0.0

==================== Active Requests At Q Information ====================

       DEV |  Avg Reqs @ Q
---------- | -------------
 (  8,  0) |           0.1

.....


I don't know if that shows anything of interest, but if you need more,
and want to waste a bit of time looking at the data, I have uploaded
everything created into
	http://www.logic.at/people/preining/BlkParse.tar.gz
size 5090853
md5sum 46db8455a04dcc4a602e34d21eecc6bd

In any case, thanks for your patience and support

Norbert

------------------------------------------------------------------------
Norbert Preining            preining@{jaist.ac.jp, logic.at, debian.org}
JAIST, Japan                                 TeX Live & Debian Developer
DSA: 0x09C5B094   fp: 14DF 2E6C 0307 BE6D AD76  A9C0 D2BF 4AA3 09C5 B094
------------------------------------------------------------------------
KIRBY (n.)
Small but repulsive piece of food prominently attached to a person's
face or clothing. See also CHIPPING ONGAR.
			--- Douglas Adams, The Meaning of Liff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4 slow on links
  2012-06-20  5:18       ` Norbert Preining
@ 2012-06-20 14:07         ` Eric Sandeen
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Sandeen @ 2012-06-20 14:07 UTC (permalink / raw)
  To: Norbert Preining; +Cc: Ted Ts'o, linux-ext4@vger.kernel.org

On 6/20/12 12:18 AM, Norbert Preining wrote:
> Hi Eric,
> 
> On Di, 19 Jun 2012, Eric Sandeen wrote:
>> blktrace of the actions would show you something interesting as well.
> 
> I tried to understand the output, but didn't get any information
> that tells me something.
> 
> I rebooted into single user mode, started blktrace on sda, then run
> time ls -l /..../dir/with/links/ >/dev/null, stopped the blktrace.
> 
> Then I run blkparse and btt etc to generate a variety of data.

...

> 
> I don't know if that shows anything of interest, but if you need more,
> and want to waste a bit of time looking at the data, I have uploaded
> everything created into
> 	http://www.logic.at/people/preining/BlkParse.tar.gz

Here are the overall stats:

Total (sda):
 Reads Queued:       8,864,   35,456KiB	 Writes Queued:          90,    7,980KiB
 Read Dispatches:    8,864,   35,456KiB	 Write Dispatches:       49,    7,980KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:    8,864,   35,456KiB	 Writes Completed:       59,    7,980KiB
 Read Merges:            0,        0KiB	 Write Merges:           41,      164KiB
 IO unplugs:            81        	 Timer unplugs:           0

so almost all reads, and no read merges; almost 35 megabytes read and every
one was a small 4k IO.

It's doing about 120 seeks/second.  I'm a little surprised that there was no read
merging...

Let me think about this. :)

-Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4 slow on links
  2012-06-20  3:57     ` Eric Sandeen
  2012-06-20  4:01       ` Norbert Preining
  2012-06-20  5:18       ` Norbert Preining
@ 2012-06-20 19:35       ` Eric Sandeen
  2012-06-21  2:28         ` Norbert Preining
  2 siblings, 1 reply; 16+ messages in thread
From: Eric Sandeen @ 2012-06-20 19:35 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Norbert Preining, Ted Ts'o, linux-ext4@vger.kernel.org

On 6/19/12 10:57 PM, Eric Sandeen wrote:
> On Jun 19, 2012, at 10:38 PM, Norbert Preining <preining@logic.at> wrote:
> 
>> Hi Ted, hi Eric,
>>
>> thanks for the answers, here some remarks.
>>
> ...
> 
>> On Di, 19 Jun 2012, Eric Sandeen wrote:
>>> As Ted said, the targets might be far-flung.  If you do /bin/ls -l instead
>>> of maybe an aliased ls which stats everything to make pretty colors,
>>> is that faster?
>>
>> Might be the problem, but I saw the same with a program doing
>> opendir readdir etc, so no allias or external program involved.
>>
> Of course ls -l must stat anyway.  I shouldn't compose emails so late.  :). 

Oh, but Zach Brown reminds me that if we stat the entries in getdents/hash
order, it's roughly random w.r.t. disk location.  Newer utils will sort into
inode order, I think(?)  Might be interesting to strace the ls -l and see
if it's doing it in inode order, or not.

-Eric


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4 slow on links
  2012-06-20 19:35       ` Eric Sandeen
@ 2012-06-21  2:28         ` Norbert Preining
  2012-06-21  4:05           ` Eric Sandeen
  0 siblings, 1 reply; 16+ messages in thread
From: Norbert Preining @ 2012-06-21  2:28 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Ted Ts'o, linux-ext4@vger.kernel.org, Eric Sandeen

Hi Eric,

thanks a lot for looking into that.

On Mi, 20 Jun 2012, Eric Sandeen wrote:
> so almost all reads, and no read merges; almost 35 megabytes read and every
> one was a small 4k IO.

Ouch, that hurts.

On Mi, 20 Jun 2012, Eric Sandeen wrote:
> Would you be willing to provide an "e2image -r" image of the filesystem?

Ok, it is running now since a few hours and I am far from finished
I guess, since there are 350+G on the fs, and the compressed image
is by now 200M.

Is it fine to do it on a running system, or do I have to boot
from USB or so?

If it is not toooo big I will tr to upload it to some place were
you can get access to.

On Mi, 20 Jun 2012, Eric Sandeen wrote:
> Oh, but Zach Brown reminds me that if we stat the entries in getdents/hash
> order, it's roughly random w.r.t. disk location.  Newer utils will sort into
> inode order, I think(?)  Might be interesting to strace the ls -l and see
> if it's doing it in inode order, or not.

Ok, is there a special option to strace, or -trace=all?

Best wishes

Norbert
------------------------------------------------------------------------
Norbert Preining            preining@{jaist.ac.jp, logic.at, debian.org}
JAIST, Japan                                 TeX Live & Debian Developer
DSA: 0x09C5B094   fp: 14DF 2E6C 0307 BE6D AD76  A9C0 D2BF 4AA3 09C5 B094
------------------------------------------------------------------------
CANNOCK CHASE (n.)
In any box of After Eight Mints, there is always a large number of
empty envelopes and no more that four or five actual mints. The
cannock chase is the process by which, no matter which part of the box
often, you will always extract most of the empty sachets before
pinning down an actual mint, or 'cannock'. The cannock chase also
occurs with people who put their dead matches back in the matchbox,
and then embarrass themselves at parties trying to light cigarettes
with tree quarters of an inch of charcoal. The term is also used to
describe futile attempts to pursue unscrupulous advertising agencies
who nick your ideas to sell chocolates with.
			--- Douglas Adams, The Meaning of Liff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4 slow on links
  2012-06-21  2:28         ` Norbert Preining
@ 2012-06-21  4:05           ` Eric Sandeen
  2012-06-21  4:50             ` Norbert Preining
  2012-06-22  9:53             ` Bernd Schubert
  0 siblings, 2 replies; 16+ messages in thread
From: Eric Sandeen @ 2012-06-21  4:05 UTC (permalink / raw)
  To: Norbert Preining; +Cc: Ted Ts'o, linux-ext4@vger.kernel.org

On 6/20/12 9:28 PM, Norbert Preining wrote:
> Hi Eric,
> 
> thanks a lot for looking into that.
> 
> On Mi, 20 Jun 2012, Eric Sandeen wrote:
>> so almost all reads, and no read merges; almost 35 megabytes read and every
>> one was a small 4k IO.
> 
> Ouch, that hurts.
> 
> On Mi, 20 Jun 2012, Eric Sandeen wrote:
>> Would you be willing to provide an "e2image -r" image of the filesystem?
> 
> Ok, it is running now since a few hours and I am far from finished
> I guess, since there are 350+G on the fs, and the compressed image
> is by now 200M.
> 
> Is it fine to do it on a running system, or do I have to boot
> from USB or so?

Well, don't bother, sorry.  See below.  Zach had it right.

> If it is not toooo big I will tr to upload it to some place were
> you can get access to.
> 
> On Mi, 20 Jun 2012, Eric Sandeen wrote:
>> Oh, but Zach Brown reminds me that if we stat the entries in getdents/hash
>> order, it's roughly random w.r.t. disk location.  Newer utils will sort into
>> inode order, I think(?)  Might be interesting to strace the ls -l and see
>> if it's doing it in inode order, or not.
> 
> Ok, is there a special option to strace, or -trace=all?

if you do 

# strace -v -o outfile ls -l 

you'll see things like:

getdents(3, {{d_ino=249052, d_off=186216735, d_reclen=32, d_name="file3"} {d_ino=245882, d_off=473549160, d_reclen=24, d_name="."} {d_ino=249051, d_off=516459536, d_reclen=32, d_name="file2"} {d_ino=249055, d_off=545762253, d_reclen=32, d_name="file6"} {d_ino=249049, d_off=550416647, d_reclen=32, d_name="file1"} ...

and from there see that the entries returned  are not in inode order (and therefore not in disk order).

and lstats after that, also out of order:

# grep lstat outfile
lstat("file3", {st_dev=makedev(8, 8), st_ino=249052, st_mode=S_IFLNK|0777, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=13, st_atime=2012/06/20-22:13:08, st_mtime=2012/06/20-22:13:07, st_ctime=2012/06/20-22:13:07}) = 0
lstat("file2", {st_dev=makedev(8, 8), st_ino=249051, st_mode=S_IFLNK|0777, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=13, st_atime=2012/06/20-22:13:08, st_mtime=2012/06/20-22:13:07, st_ctime=2012/06/20-22:13:07}) = 0
lstat("file6", {st_dev=makedev(8, 8), st_ino=249055, st_mode=S_IFLNK|0777, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=13, st_atime=2012/06/20-22:13:08, st_mtime=2012/06/20-22:13:07, st_ctime=2012/06/20-22:13:07}) = 0
lstat("file1", {st_dev=makedev(8, 8), st_ino=249049, st_mode=S_IFLNK|0777, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=13, st_atime=2012/06/20-22:13:08, st_mtime=2012/06/20-22:13:07, st_ctime=2012/06/20-22:13:07}) = 0
...

later on you'll see readlinks:

# grep readlink outfile
readlink("file3", "../dir2/file3", 14)  = 13
readlink("file2", "../dir2/file2", 14)  = 13
readlink("file6", "../dir2/file6", 14)  = 13
readlink("file1", "../dir2/file1", 14)  = 13
...

etc.

Hm.  Upstream coreutils fixed this for rm and some other ops:

http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=24412edeaf556a

# grep unlink /tmp/rm-strace 
unlink("file1")                         = 0
unlink("file10")                        = 0
unlink("file2")                         = 0
unlink("file3")                         = 0
unlink("file4")                         = 0
unlink("file5")                         = 0
unlink("file6")                         = 0
unlink("file7")                         = 0
unlink("file8")                         = 0
unlink("file9")                         = 0

but maybe not for ls -l

You could see if you could get this LD_PRELOAD working:

http://git.kernel.org/?p=fs/ext2/e2fsprogs.git;a=blob_plain;f=contrib/spd_readdir.c

build & enable with:

gcc -o spd_readdir.so -fPIC -shared spd_readdir.c -ldl
export LD_PRELOAD=`pwd`/spd_readdir.so

and see if that addresses the problem; 

here, it does for me:

# grep readlink outfile2 
readlink("file1", "../dir2/file1"..., 14) = 13
readlink("file10", "../dir2/file10"..., 15) = 14
readlink("file2", "../dir2/file2"..., 14) = 13
readlink("file3", "../dir2/file3"..., 14) = 13
readlink("file4", "../dir2/file4"..., 14) = 13
readlink("file5", "../dir2/file5"..., 14) = 13

I'm guessing that operating in inode order should help
you a bit, at least.  I tested on a dir w/ 10,000 long symlinks
with and without the sorting, and you can see the difference pretty
clearly.

sorted took 2.6s, unsorted took 52s.

And you can see why:

http://people.redhat.com/esandeen/sorted_unsorted.png

meanwhile I can ask Jim about coreutils & ls -l.

-Eric

> Best wishes
> 
> Norbert

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4 slow on links
  2012-06-21  4:05           ` Eric Sandeen
@ 2012-06-21  4:50             ` Norbert Preining
  2012-06-21  5:18               ` Andreas Dilger
  2012-06-22  9:53             ` Bernd Schubert
  1 sibling, 1 reply; 16+ messages in thread
From: Norbert Preining @ 2012-06-21  4:50 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Ted Ts'o, linux-ext4@vger.kernel.org

Hi Eric,

wow, thanks again.

On Mi, 20 Jun 2012, Eric Sandeen wrote:
> Hm.  Upstream coreutils fixed this for rm and some other ops:

Ok, I see.

> sorted took 2.6s, unsorted took 52s.

Got the idea, and tried it now myself not with ls etc, but
with the program that generates the caos, and yes, stracing it
gives the same result, getdents and the followed stats are all
*not* in inode order.

So that means, it should be fixed in glibc? Right? Ouuchhh...

That means that this behaviour is for *each* program using getdent
etc ...

Do you have any suggestions? Is there a way to force readdir (I guess
most people use readdir instead of getdents directly) to iterate
in inode order?



Best wishes

Norbert
------------------------------------------------------------------------
Norbert Preining            preining@{jaist.ac.jp, logic.at, debian.org}
JAIST, Japan                                 TeX Live & Debian Developer
DSA: 0x09C5B094   fp: 14DF 2E6C 0307 BE6D AD76  A9C0 D2BF 4AA3 09C5 B094
------------------------------------------------------------------------
PABBY (n.,vb.)
(Fencing term.) The play, or manoeuvre, where one swordsman leaps on
to the table and pulls the battleaxe off the wall.
			--- Douglas Adams, The Meaning of Liff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4 slow on links
  2012-06-21  4:50             ` Norbert Preining
@ 2012-06-21  5:18               ` Andreas Dilger
  2012-06-21  6:55                 ` Norbert Preining
  0 siblings, 1 reply; 16+ messages in thread
From: Andreas Dilger @ 2012-06-21  5:18 UTC (permalink / raw)
  To: Norbert Preining; +Cc: Eric Sandeen, Ted Ts'o, linux-ext4@vger.kernel.org

On 2012-06-20, at 10:50 PM, Norbert Preining wrote:
> On Mi, 20 Jun 2012, Eric Sandeen wrote:
>> Hm.  Upstream coreutils fixed this for rm and some other ops:
> 
> Ok, I see.
> 
>> sorted took 2.6s, unsorted took 52s.
> 
> Got the idea, and tried it now myself not with ls etc, but
> with the program that generates the caos, and yes, stracing it
> gives the same result, getdents and the followed stats are all
> *not* in inode order.
> 
> So that means, it should be fixed in glibc? Right? Ouuchhh...
> 
> That means that this behaviour is for *each* program using getdent
> etc ...
> 
> Do you have any suggestions? Is there a way to force readdir (I guess
> most people use readdir instead of getdents directly) to iterate
> in inode order?

That's what the LD_PRELOAD library that Eric referenced does - you can
load it for any application, and it will sort the dirents in inode order.

It would definitely be better to do this in glibc, though we've also
been discussing on occasion doing this inside ext4 for small directories.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4 slow on links
  2012-06-21  5:18               ` Andreas Dilger
@ 2012-06-21  6:55                 ` Norbert Preining
  0 siblings, 0 replies; 16+ messages in thread
From: Norbert Preining @ 2012-06-21  6:55 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Eric Sandeen, Ted Ts'o, linux-ext4@vger.kernel.org

On Mi, 20 Jun 2012, Andreas Dilger wrote:
> That's what the LD_PRELOAD library that Eric referenced does - you can
> load it for any application, and it will sort the dirents in inode order.

Yes, hmm, I tried it without success. 
I did:
	export LD_PRELOAD=/path/to/spd_readdir.so
	strace -o ... /usr/bin/texlua /usr/bin/mtxrun --generate
(the bad command) and I still see stats and getdents out of
inode order.

> It would definitely be better to do this in glibc, though we've also
> been discussing on occasion doing this inside ext4 for small directories.

I have now found the thread 
	 Large directories and poor order correlation  
from March 2011 on ext4-devel, interesting read. 

Anyway, as far as I can see I cannot do much but
	fsck -D
the filesystem and see if it gets better, right?

Best wishes

Norbert
------------------------------------------------------------------------
Norbert Preining            preining@{jaist.ac.jp, logic.at, debian.org}
JAIST, Japan                                 TeX Live & Debian Developer
DSA: 0x09C5B094   fp: 14DF 2E6C 0307 BE6D AD76  A9C0 D2BF 4AA3 09C5 B094
------------------------------------------------------------------------
MARGATE (n.)
A margate is a particular kind of commissionaire who sees you every
day and is on cheerful Christian-name terms with you, then one day
refuses to let you in because you've forgotten your identify card.
			--- Douglas Adams, The Meaning of Liff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4 slow on links
  2012-06-21  4:05           ` Eric Sandeen
  2012-06-21  4:50             ` Norbert Preining
@ 2012-06-22  9:53             ` Bernd Schubert
  2012-06-22 14:08               ` Ted Ts'o
  1 sibling, 1 reply; 16+ messages in thread
From: Bernd Schubert @ 2012-06-22  9:53 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Norbert Preining, Ted Ts'o, linux-ext4@vger.kernel.org

On 06/21/2012 06:05 AM, Eric Sandeen wrote:
> On 6/20/12 9:28 PM, Norbert Preining wrote:
>> Hi Eric,
>>
>
> You could see if you could get this LD_PRELOAD working:
>
> http://git.kernel.org/?p=fs/ext2/e2fsprogs.git;a=blob_plain;f=contrib/spd_readdir.c
>

Hrmm, I need to look through that commit again, but on a first glance I 
cannot see code doing the sorting for ext3/ext4 only (e.g. by checking 
the fsid). So while I like the general approach, it will have the 
opposite effect for some file systems. I will report that back on the 
coreutils list.

Thanks a lot for the pointer to the commit!


Cheers,
Bernd


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4 slow on links
  2012-06-22  9:53             ` Bernd Schubert
@ 2012-06-22 14:08               ` Ted Ts'o
  0 siblings, 0 replies; 16+ messages in thread
From: Ted Ts'o @ 2012-06-22 14:08 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Eric Sandeen, Norbert Preining, linux-ext4@vger.kernel.org

On Fri, Jun 22, 2012 at 11:53:13AM +0200, Bernd Schubert wrote:
> On 06/21/2012 06:05 AM, Eric Sandeen wrote:
> >On 6/20/12 9:28 PM, Norbert Preining wrote:
> >>Hi Eric,
> >>
> >
> >You could see if you could get this LD_PRELOAD working:
> >
> >http://git.kernel.org/?p=fs/ext2/e2fsprogs.git;a=blob_plain;f=contrib/spd_readdir.c
> >
> 
> Hrmm, I need to look through that commit again, but on a first
> glance I cannot see code doing the sorting for ext3/ext4 only (e.g.
> by checking the fsid). So while I like the general approach, it will
> have the opposite effect for some file systems. I will report that
> back on the coreutils list.
> 
> Thanks a lot for the pointer to the commit!

One warning about spd_readdir.  It's not thread-safe, and I've noted
that some programs crash when they try using spd_readdir.so as a
pre-load.  I've tried to fix some of the causes, and I think
thread-safety is the primary fix which is missing, but it's possible
that program which really care about telldir()/seekdir() behaviour as
it relates to readdir() and when files are added to a directory may
also end up getting surprised.

I wrote it primarily as a demonstration of how sorting by inode number
is a big win.  It is *not* suitable for use in /etc/ld.so.preload!

If people want to try to make it safer, patches are accepted, but
ultimately it's better to fix this in the application; that way you
will get your performance gains no matter what OS you happen to be
running on, whether it's Linux, Solaris, AIX, OS X, etc.

     	       	     	   	    	- Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-06-22 14:08 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-20  0:20 Ext4 slow on links Norbert Preining
2012-06-20  2:19 ` Ted Ts'o
2012-06-20  3:38   ` Norbert Preining
2012-06-20  3:57     ` Eric Sandeen
2012-06-20  4:01       ` Norbert Preining
2012-06-20  5:18       ` Norbert Preining
2012-06-20 14:07         ` Eric Sandeen
2012-06-20 19:35       ` Eric Sandeen
2012-06-21  2:28         ` Norbert Preining
2012-06-21  4:05           ` Eric Sandeen
2012-06-21  4:50             ` Norbert Preining
2012-06-21  5:18               ` Andreas Dilger
2012-06-21  6:55                 ` Norbert Preining
2012-06-22  9:53             ` Bernd Schubert
2012-06-22 14:08               ` Ted Ts'o
2012-06-20  3:15 ` Eric Sandeen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).