linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: suspiciously good fsck times?
       [not found]       ` <20080710172117.GE10402@mit.edu>
@ 2008-07-10 17:27         ` Ric Wheeler
  0 siblings, 0 replies; 9+ messages in thread
From: Ric Wheeler @ 2008-07-10 17:27 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Eric Sandeen, linux-ext4

Theodore Tso wrote:
> On Thu, Jul 10, 2008 at 11:14:28AM -0500, Eric Sandeen wrote:
>   
>> Val & I talked about this a little, and came to the conclusion that
>> directory fragmentation might be a pretty big part of it.
>>     
>
> Hmm, could be.  Let's see.  Ric said 46.5 million files, I don't know
> how big the filenames were, but let's assume a directory entry size of
> 32, so that means if we assume perfect packing, 128 directory entries
> per 4k block.  Let's use 100 directory entries/blok just to make the
> math easyer, so that's 465,000 blocks.  If we assume a 10ms seek time,
> and that the blocks are totally scattered, that's 4650 seconds, or
> 1.29 hours. So that's roughly within the ballpark that Ric measured.
>
>      	       	      	      	     	 	  - Ted
>   

(changing cc to the real list instead of ext4 owner - sorry!)

The file names are 40 bytes long, (6 initial bytes of time stamp with 24 
random bytes at end of name. For example:

451aeb61ead89~~~DYASX8LYL4NAUWK3WI187VRP

The 4 threads chose the target subdirectory based on the time stamp, 
rotating into a new subdirectory every 3 minutes or so.

ric




^ permalink raw reply	[flat|nested] 9+ messages in thread

* [ricwheeler@gmail.com: suspiciously good fsck times?]
@ 2008-07-10 17:28 Theodore Tso
  2008-07-10 17:53 ` suspiciously good fsck times? Theodore Tso
  0 siblings, 1 reply; 9+ messages in thread
From: Theodore Tso @ 2008-07-10 17:28 UTC (permalink / raw)
  To: linux-ext4

[-- Attachment #1: Type: text/plain, Size: 95 bytes --]

Transferring this thread to the linux-ext4 list instead of
linux-ext4-owner.  :-)

						- Ted

[-- Attachment #2: Type: message/rfc822, Size: 4488 bytes --]

From: Ric Wheeler <ricwheeler@gmail.com>
To: linux-ext4-owner@vger.kernel.org, Theodore Tso <tytso@mit.edu>
Subject: suspiciously good fsck times?
Date: Thu, 10 Jul 2008 08:36:42 -0400
Message-ID: <4876025A.80909@gmail.com>


Just to be mean, I have been trying to test the fsck speed of ext4 with 
lots of small files.  The test I ran uses fs_mark to fill a 1TB Seagate 
drive with 45.6 million 20k files (distributed between 256 subdirectories).

Running on ext3, "fsck -f" takes about one hour.

Running on ext4, with uninit_bg, the same fsck is finished in a bit over 
5 minutes - more than 10x faster.  (Without uninit_bg, the fsck takes 
about 10 minutes).

Is this too good to be true? Below is the fsck run itself, the tree is 
Ted's latest git tree and his 1.41 WIP tools,

ric


[root@localhost Perf]# time /sbin/fsck.ext4 -t -t -f /dev/sdb1
e4fsck 1.41-WIP (07-Jul-2008)
Pass 1: Checking inodes, blocks, and sizes
Pass 1: Memory used: 40632k/69424k (36424k/4209k), time: 204.95/78.22/25.58
Pass 1: I/O read: 11140MB, write: 0MB, rate: 54.35MB/s
Pass 2: Checking directory structure
Pass 2: Memory used: 70184k/61968k (51803k/18382k), time: 76.47/50.27/ 8.77
Pass 2: I/O read: 3023MB, write: 0MB, rate: 39.53MB/s
Pass 3: Checking directory connectivity
Peak memory: Memory used: 70184k/61968k (59256k/10929k), time: 
281.72/128.59/34.35
Pass 3A: Memory used: 70184k/61968k (59256k/10929k), time:  0.00/ 0.00/ 
0.00
Pass 3A: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
Pass 3: Memory used: 70184k/61968k (51803k/18382k), time:  0.03/ 0.00/ 0.00
Pass 3: I/O read: 1MB, write: 0MB, rate: 37.86MB/s
Pass 4: Checking reference counts
Pass 4: Memory used: 70184k/44968k (27354k/42831k), time:  2.37/ 2.36/ 0.00
Pass 4: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
Pass 5: Checking group summary information
Pass 5: Memory used: 70184k/240k (64619k/5566k), time: 19.40/ 5.52/ 0.29
Pass 5: I/O read: 34MB, write: 0MB, rate: 1.75MB/s
/dev/sdb1: 45600268/61054976 files (0.0% non-contiguous), 
232657574/244190000 blocks
Memory used: 70184k/240k (64889k/5296k), time: 303.54/136.48/34.65
I/O read: 14198MB, write: 1MB, rate: 46.77MB/s

real    5m3.993s
user    2m16.477s
sys     0m35.041s

[-- Attachment #3: Type: message/rfc822, Size: 3307 bytes --]

From: Theodore Tso <tytso@MIT.EDU>
To: Ric Wheeler <ricwheeler@gmail.com>
Cc: linux-ext4-owner@vger.kernel.org
Subject: Re: suspiciously good fsck times?
Date: Thu, 10 Jul 2008 11:18:22 -0400
Message-ID: <20080710151822.GA25939@mit.edu>

On Thu, Jul 10, 2008 at 08:36:42AM -0400, Ric Wheeler wrote:
>
> Just to be mean, I have been trying to test the fsck speed of ext4 with  
> lots of small files.  The test I ran uses fs_mark to fill a 1TB Seagate  
> drive with 45.6 million 20k files (distributed between 256 
> subdirectories).
>
> Running on ext3, "fsck -f" takes about one hour.
>
> Running on ext4, with uninit_bg, the same fsck is finished in a bit over  
> 5 minutes - more than 10x faster.  (Without uninit_bg, the fsck takes  
> about 10 minutes).
>
> Is this too good to be true? Below is the fsck run itself, the tree is  
> Ted's latest git tree and his 1.41 WIP tools,

Wow.  My guess is that flex_bg is making the difference.  What we
would want to compare is the I/O read statistics line:

> I/O read: 14198MB, write: 1MB, rate: 46.77MB/s

That's pretty good, and indicates we've avoided a *lot* of seeking.
The e2fsck -t -t output for ext3 should show roughly the same mount of
I/O read (with 20k files, there would be no advantage towards using
extents), but the I/O rate is probably *much* lower, indicating a lot
more seeking is going on.

Can you send the full e2fsck -t -t output of the ext3 run?  And what
is the hdparm -t -t results of the disk?

If I'm right, if you create the filesystem with mke2fs -t ext4dev -O
^flex_bg,^uninit_bg, you should see performance back to the old ext3
levels.

							- Ted

P.S.  We probably do want to examine the block allocation layout with
flex_bg to make sure that the filesystem ages well in the long term.

[-- Attachment #4: Type: message/rfc822, Size: 4425 bytes --]

From: Ric Wheeler <rwheeler@redhat.com>
To: Theodore Tso <tytso@mit.edu>
Cc: linux-ext4-owner@vger.kernel.org, Eric Sandeen <sandeen@redhat.com>
Subject: Re: suspiciously good fsck times?
Date: Thu, 10 Jul 2008 11:49:51 -0400
Message-ID: <48762F9F.5070308@redhat.com>

Theodore Tso wrote:
> On Thu, Jul 10, 2008 at 08:36:42AM -0400, Ric Wheeler wrote:
>   
>> Just to be mean, I have been trying to test the fsck speed of ext4 with  
>> lots of small files.  The test I ran uses fs_mark to fill a 1TB Seagate  
>> drive with 45.6 million 20k files (distributed between 256 
>> subdirectories).
>>
>> Running on ext3, "fsck -f" takes about one hour.
>>
>> Running on ext4, with uninit_bg, the same fsck is finished in a bit over  
>> 5 minutes - more than 10x faster.  (Without uninit_bg, the fsck takes  
>> about 10 minutes).
>>
>> Is this too good to be true? Below is the fsck run itself, the tree is  
>> Ted's latest git tree and his 1.41 WIP tools,
>>     
>
> Wow.  My guess is that flex_bg is making the difference.  What we
> would want to compare is the I/O read statistics line:
>
>   
>> I/O read: 14198MB, write: 1MB, rate: 46.77MB/s
>>     
>
> That's pretty good, and indicates we've avoided a *lot* of seeking.
> The e2fsck -t -t output for ext3 should show roughly the same mount of
> I/O read (with 20k files, there would be no advantage towards using
> extents), but the I/O rate is probably *much* lower, indicating a lot
> more seeking is going on.
>   
We did run fsck through seekwatcher & saw a significant reduction in 
seeks/sec for ext4. Eric has the pretty pictures that he can share.

> Can you send the full e2fsck -t -t output of the ext3 run?  And what
> is the hdparm -t -t results of the disk?
>   

I didn't run the ext3 test with -t -t (but can refill and rerun, takes 
about 12 hours).

This disk is a relatively new Seagate 1TB drive, specs at:

http://www.seagate.com/ww/v/index.jsp?vgnextoid=0732f141e7f43110VgnVCM100000f5ee0a0aRCRD

hdparm test:

[root@localhost rwheeler]# /sbin/hdparm -t -t /dev/sdb

/dev/sdb:
 Timing buffered disk reads:  186 MB in  3.03 seconds =  61.33 MB/sec



> If I'm right, if you create the filesystem with mke2fs -t ext4dev -O
> ^flex_bg,^uninit_bg, you should see performance back to the old ext3
> levels.
>   

With uninit_bg off, it ran about 10 minutes, but it would be interesting 
to run without either.
> 							- Ted
>
> P.S.  We probably do want to examine the block allocation layout with
> flex_bg to make sure that the filesystem ages well in the long term.
>   
Testing aged file systems is always the holy grail - this workload is a 
fairly artificial one and was laid down with 4 threads currently writing 
to a shared subdirectory.

ric


[-- Attachment #5: Type: message/rfc822, Size: 2877 bytes --]

From: Theodore Tso <tytso@MIT.EDU>
To: Ric Wheeler <rwheeler@redhat.com>
Cc: linux-ext4-owner@vger.kernel.org, Eric Sandeen <sandeen@redhat.com>
Subject: Re: suspiciously good fsck times?
Date: Thu, 10 Jul 2008 12:13:54 -0400
Message-ID: <20080710161354.GA10402@mit.edu>

On Thu, Jul 10, 2008 at 11:49:51AM -0400, Ric Wheeler wrote:
> We did run fsck through seekwatcher & saw a significant reduction in  
> seeks/sec for ext4. Eric has the pretty pictures that he can share.

Pictures are always fun!  It would be great to see the comparison
between ext3 and ext4 for fsck in this case.

> [root@localhost rwheeler]# /sbin/hdparm -t -t /dev/sdb
>
> /dev/sdb:
> Timing buffered disk reads:  186 MB in  3.03 seconds =  61.33 MB/sec
>

I meant hdparm -t -T, but that's ok, the 61.33 MB/sec is what I was
curious about.  So for this very artificial benchmark, fsck was using
2/3rd of the disk's full benchmark.  Not bad.  :-)

> Testing aged file systems is always the holy grail - this workload is a  
> fairly artificial one and was laid down with 4 threads currently writing  
> to a shared subdirectory.

If you haven't nuked the ext4 filesystem yet, can you grab a dumpe2fs
of it first, so we can compare it to the inode allocation patterns
under ext3.  Thanks!!

						- Ted

[-- Attachment #6: Type: message/rfc822, Size: 5176 bytes --]

From: Eric Sandeen <sandeen@redhat.com>
To: rwheeler@redhat.com
Cc: Theodore Tso <tytso@mit.edu>, linux-ext4-owner@vger.kernel.org
Subject: Re: suspiciously good fsck times?
Date: Thu, 10 Jul 2008 11:14:28 -0500
Message-ID: <48763564.2090505@redhat.com>

Ric Wheeler wrote:
> Theodore Tso wrote:
>> On Thu, Jul 10, 2008 at 08:36:42AM -0400, Ric Wheeler wrote:
>>   
>>> Just to be mean, I have been trying to test the fsck speed of ext4 with  
>>> lots of small files.  The test I ran uses fs_mark to fill a 1TB Seagate  
>>> drive with 45.6 million 20k files (distributed between 256 
>>> subdirectories).
>>>
>>> Running on ext3, "fsck -f" takes about one hour.
>>>
>>> Running on ext4, with uninit_bg, the same fsck is finished in a bit over  
>>> 5 minutes - more than 10x faster.  (Without uninit_bg, the fsck takes  
>>> about 10 minutes).
>>>
>>> Is this too good to be true? Below is the fsck run itself, the tree is  
>>> Ted's latest git tree and his 1.41 WIP tools,
>>>     
>> Wow.  My guess is that flex_bg is making the difference.  What we
>> would want to compare is the I/O read statistics line:

I thought we actually had flex_bg off at least on the first run and it
still looked good.  (Ric just made the fs with mkfs.ext3 -j -I 256 -E
test_fs initially I think)

Val & I talked about this a little, and came to the conclusion that
directory fragmentation might be a pretty big part of it.

I did a similar workload on a much smaller fs, and the largest dir
(~11MB) looked like this on ext3:

BLOCKS:
(0-4):3950592-3950596, (5):3950604, (6-7):3950606-3950607, (8):3950630,
(9):3950871, (10-11):3950875-3950876, (IND):3950899, (12):3950900,
(13):3950934, (14):3950937, (15-16):3950943-3950944, (17):3951390,
(18):3951396, (19):3951402, (20):3951406, (21):3951408, (22):3951410,
(23):3951581, (24):3951684, (25):3951985, (26):3952031, (27):3952156,
(28):3952322, (29):3952418, (30):3952599, (31):3952626, (32):3954038,
(33):3954693, (34):3954698, (35):3954874, (36):3955108, (37):3955708,
(38):3955711, (39):3956034, (40):3956598, (41):3957173, (42):3957179,
(43):3957622, (44):3957763, (45):3957824, (46):3957910, (47):3958190,
(48):3958302, (49):3958488, (50):3958834, (51):3959173, (52):3959468,
(53):3959842, (54):3959903, (55):3960029, (56):3960245, (57):3960446
..... ad naseum ...
(4032):4893557, (4033):4894194, (4034):4894719, (4035):4937580,
(4036):4937887, (4037):4939087, (4038):4939233, (4039):4939502,
(4040):4939508, (4041):4940473, (4042-4043):4940939-4940940,
(4044):4941191, (4045):4941402, (4046-4048):4941409-4941411,
(4049):4943061, (4050):4943307, (4051-4052):4943314-4943315
TOTAL: 4058

compared to ext4:

BLOCKS:
(0):1900544, (1-5070):1900546-1905615
TOTAL: 5071


> We did run fsck through seekwatcher & saw a significant reduction in 
> seeks/sec for ext4. Eric has the pretty pictures that he can share.

sure do (AFAIK these were with neither flex_bg nor uninit_bg):

http://people.redhat.com/esandeen/ext4/e4fsck-1T.png
http://people.redhat.com/esandeen/ext4/e3fsck-1T.png
http://people.redhat.com/esandeen/ext4/ext3-ext4-fsck-1T.png

I'm still working out what's what.  But that hockey-stick-shaped red
line for ext4 is intriguing, I think it's very densely packed $SOMETHING
that ext3 had to seek all over for, guessing it's the directories.
Although that strikes me as an odd place for the root-level directories
to land.

I need to check, does ext3 use reservation windows for directories?
Looks like maybe it should... :)

-Eric



[-- Attachment #7: Type: message/rfc822, Size: 2612 bytes --]

From: Theodore Tso <tytso@MIT.EDU>
To: Eric Sandeen <sandeen@redhat.com>
Cc: rwheeler@redhat.com, linux-ext4-owner@vger.kernel.org
Subject: Re: suspiciously good fsck times?
Date: Thu, 10 Jul 2008 13:21:17 -0400
Message-ID: <20080710172117.GE10402@mit.edu>

On Thu, Jul 10, 2008 at 11:14:28AM -0500, Eric Sandeen wrote:
> Val & I talked about this a little, and came to the conclusion that
> directory fragmentation might be a pretty big part of it.

Hmm, could be.  Let's see.  Ric said 46.5 million files, I don't know
how big the filenames were, but let's assume a directory entry size of
32, so that means if we assume perfect packing, 128 directory entries
per 4k block.  Let's use 100 directory entries/blok just to make the
math easyer, so that's 465,000 blocks.  If we assume a 10ms seek time,
and that the blocks are totally scattered, that's 4650 seconds, or
1.29 hours. So that's roughly within the ballpark that Ric measured.

     	       	      	      	     	 	  - Ted

[-- Attachment #8: Type: message/rfc822, Size: 2925 bytes --]

From: Eric Sandeen <sandeen@redhat.com>
To: Theodore Tso <tytso@mit.edu>
Cc: rwheeler@redhat.com, linux-ext4-owner@vger.kernel.org
Subject: Re: suspiciously good fsck times?
Date: Thu, 10 Jul 2008 12:23:08 -0500
Message-ID: <4876457C.3040709@redhat.com>

Theodore Tso wrote:
> On Thu, Jul 10, 2008 at 11:14:28AM -0500, Eric Sandeen wrote:
>> Val & I talked about this a little, and came to the conclusion that
>> directory fragmentation might be a pretty big part of it.
> 
> Hmm, could be.  Let's see.  Ric said 46.5 million files, I don't know
> how big the filenames were, but let's assume a directory entry size of
> 32, so that means if we assume perfect packing, 128 directory entries
> per 4k block.  Let's use 100 directory entries/blok just to make the
> math easyer, so that's 465,000 blocks.  If we assume a 10ms seek time,
> and that the blocks are totally scattered, that's 4650 seconds, or
> 1.29 hours. So that's roughly within the ballpark that Ric measured.
> 
>      	       	      	      	     	 	  - Ted


btw guys  this thread is not on linux-ext4, it's going to linux-ext4-owner

maybe someone who has them all can bounce to the list ;)

-Eric

^ permalink raw reply	[flat|nested] 9+ messages in thread

* suspiciously good fsck times?
@ 2008-07-10 17:29 Ric Wheeler
  0 siblings, 0 replies; 9+ messages in thread
From: Ric Wheeler @ 2008-07-10 17:29 UTC (permalink / raw)
  To: linux-ext4; +Cc: Theodore Tso, Eric Sandeen

(Repost to the list - this was mistakingly sent to linux-ext4-owner)

Just to be mean, I have been trying to test the fsck speed of ext4 with
lots of small files.  The test I ran uses fs_mark to fill a 1TB Seagate
drive with 45.6 million 20k files (distributed between 256 subdirectories).

Running on ext3, "fsck -f" takes about one hour.

Running on ext4, with uninit_bg, the same fsck is finished in a bit over
5 minutes - more than 10x faster.  (Without uninit_bg, the fsck takes
about 10 minutes).

Is this too good to be true? Below is the fsck run itself, the tree is
Ted's latest git tree and his 1.41 WIP tools,

ric


[root@localhost Perf]# time /sbin/fsck.ext4 -t -t -f /dev/sdb1
e4fsck 1.41-WIP (07-Jul-2008)
Pass 1: Checking inodes, blocks, and sizes
Pass 1: Memory used: 40632k/69424k (36424k/4209k), time: 204.95/78.22/25.58
Pass 1: I/O read: 11140MB, write: 0MB, rate: 54.35MB/s
Pass 2: Checking directory structure
Pass 2: Memory used: 70184k/61968k (51803k/18382k), time: 76.47/50.27/ 8.77
Pass 2: I/O read: 3023MB, write: 0MB, rate: 39.53MB/s
Pass 3: Checking directory connectivity
Peak memory: Memory used: 70184k/61968k (59256k/10929k), time:
281.72/128.59/34.35
Pass 3A: Memory used: 70184k/61968k (59256k/10929k), time:  0.00/ 0.00/
0.00
Pass 3A: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
Pass 3: Memory used: 70184k/61968k (51803k/18382k), time:  0.03/ 0.00/ 0.00
Pass 3: I/O read: 1MB, write: 0MB, rate: 37.86MB/s
Pass 4: Checking reference counts
Pass 4: Memory used: 70184k/44968k (27354k/42831k), time:  2.37/ 2.36/ 0.00
Pass 4: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
Pass 5: Checking group summary information
Pass 5: Memory used: 70184k/240k (64619k/5566k), time: 19.40/ 5.52/ 0.29
Pass 5: I/O read: 34MB, write: 0MB, rate: 1.75MB/s
/dev/sdb1: 45600268/61054976 files (0.0% non-contiguous),
232657574/244190000 blocks
Memory used: 70184k/240k (64889k/5296k), time: 303.54/136.48/34.65
I/O read: 14198MB, write: 1MB, rate: 46.77MB/s

real    5m3.993s
user    2m16.477s
sys     0m35.041s

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: suspiciously good fsck times?
       [not found] ` <20080710151822.GA25939@mit.edu>
       [not found]   ` <48762F9F.5070308@redhat.com>
@ 2008-07-10 17:30   ` Ric Wheeler
  1 sibling, 0 replies; 9+ messages in thread
From: Ric Wheeler @ 2008-07-10 17:30 UTC (permalink / raw)
  To: Theodore Tso; +Cc: linux-ext4, Eric Sandeen

Theodore Tso wrote:
> On Thu, Jul 10, 2008 at 08:36:42AM -0400, Ric Wheeler wrote:
>   
>> Just to be mean, I have been trying to test the fsck speed of ext4 with  
>> lots of small files.  The test I ran uses fs_mark to fill a 1TB Seagate  
>> drive with 45.6 million 20k files (distributed between 256 
>> subdirectories).
>>
>> Running on ext3, "fsck -f" takes about one hour.
>>
>> Running on ext4, with uninit_bg, the same fsck is finished in a bit over  
>> 5 minutes - more than 10x faster.  (Without uninit_bg, the fsck takes  
>> about 10 minutes).
>>
>> Is this too good to be true? Below is the fsck run itself, the tree is  
>> Ted's latest git tree and his 1.41 WIP tools,
>>     
>
> Wow.  My guess is that flex_bg is making the difference.  What we
> would want to compare is the I/O read statistics line:
>
>   
>> I/O read: 14198MB, write: 1MB, rate: 46.77MB/s
>>     
>
> That's pretty good, and indicates we've avoided a *lot* of seeking.
> The e2fsck -t -t output for ext3 should show roughly the same mount of
> I/O read (with 20k files, there would be no advantage towards using
> extents), but the I/O rate is probably *much* lower, indicating a lot
> more seeking is going on.
>   
We did run fsck through seekwatcher & saw a significant reduction in
seeks/sec for ext4. Eric has the pretty pictures that he can share.

> Can you send the full e2fsck -t -t output of the ext3 run?  And what
> is the hdparm -t -t results of the disk?
>   

I didn't run the ext3 test with -t -t (but can refill and rerun, takes
about 12 hours).

This disk is a relatively new Seagate 1TB drive, specs at:

http://www.seagate.com/ww/v/index.jsp?vgnextoid=0732f141e7f43110VgnVCM100000f5ee0a0aRCRD

hdparm test:

[root@localhost rwheeler]# /sbin/hdparm -t -t /dev/sdb

/dev/sdb:
Timing buffered disk reads:  186 MB in  3.03 seconds =  61.33 MB/sec



> If I'm right, if you create the filesystem with mke2fs -t ext4dev -O
> ^flex_bg,^uninit_bg, you should see performance back to the old ext3
> levels.
>   

With uninit_bg off, it ran about 10 minutes, but it would be interesting
to run without either.
> 							- Ted
>
> P.S.  We probably do want to examine the block allocation layout with
> flex_bg to make sure that the filesystem ages well in the long term.
>   
Testing aged file systems is always the holy grail - this workload is a
fairly artificial one and was laid down with 4 threads currently writing
to a shared subdirectory.

ric



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: suspiciously good fsck times?
  2008-07-10 17:28 [ricwheeler@gmail.com: suspiciously good fsck times?] Theodore Tso
@ 2008-07-10 17:53 ` Theodore Tso
  2008-07-10 20:13   ` Ric Wheeler
  2008-07-11 15:39   ` Ric Wheeler
  0 siblings, 2 replies; 9+ messages in thread
From: Theodore Tso @ 2008-07-10 17:53 UTC (permalink / raw)
  To: linux-ext4

Based on the graphs which Eric posted, One interesting thing I think
you'll find if you repeat the ext3 experiment with e2fsck -t -t is
that pass2 will be about seven times longer than pass1.  (Which is
backwards from most e2fsck runs, where pass2 is about half pass 1's
run time --- although obviously that depends on how many directory
blocks you have.)

Yes, some kind of reservation windows would help on ext3 --- but the
question is whether such a change would be too-specific for this
benchmark or not.  Most of the time directories don't grow to such a
huge size.  So if you use a smallish (around 8 blocks, say) for many
directories this might lead to more filesystem fragmentation that in
the long run would cause the filesystem not to age well; it also
wouldn't help much when you have over 11 million files in the
directory, and a directory with over 100,000 blocks.

I don't think delayed allocation is what's helping here either,
because the journal will force the directory blocks to be placed as
soon as we commit a transaction.  I think what's saving us here is
that flex_bg and mballoc is separating the directory blocks from the
data blocks, allowng the directory blocks to be closely packed
together.

					- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: suspiciously good fsck times?
  2008-07-10 17:53 ` suspiciously good fsck times? Theodore Tso
@ 2008-07-10 20:13   ` Ric Wheeler
  2008-07-14 21:19     ` Andreas Dilger
  2008-07-11 15:39   ` Ric Wheeler
  1 sibling, 1 reply; 9+ messages in thread
From: Ric Wheeler @ 2008-07-10 20:13 UTC (permalink / raw)
  To: Theodore Tso; +Cc: linux-ext4

Theodore Tso wrote:
> Based on the graphs which Eric posted, One interesting thing I think
> you'll find if you repeat the ext3 experiment with e2fsck -t -t is
> that pass2 will be about seven times longer than pass1.  (Which is
> backwards from most e2fsck runs, where pass2 is about half pass 1's
> run time --- although obviously that depends on how many directory
> blocks you have.)
>
>   
Pass2 was where both spent most of their time, but I can rerun later to 
validate that.

> Yes, some kind of reservation windows would help on ext3 --- but the
> question is whether such a change would be too-specific for this
> benchmark or not.  Most of the time directories don't grow to such a
> huge size.  So if you use a smallish (around 8 blocks, say) for many
> directories this might lead to more filesystem fragmentation that in
> the long run would cause the filesystem not to age well; it also
> wouldn't help much when you have over 11 million files in the
> directory, and a directory with over 100,000 blocks.
>   
I think that the key is to lay out the directories (or files for that 
matter) in reasonably contiguous chunks.  If we could always bump up the 
allocation by enough to capture a full disk track (128k? 512k?) you 
would probably be near optimal, but any significant portion of a track 
would also help.

It would be interesting to rerun with the 46 million files in one 
directory as well (basically, for working sets that have no natural 
mapping into directories like some object based workloads).

> I don't think delayed allocation is what's helping here either,
> because the journal will force the directory blocks to be placed as
> soon as we commit a transaction.  I think what's saving us here is
> that flex_bg and mballoc is separating the directory blocks from the
> data blocks, allowng the directory blocks to be closely packed
> together.
>
> 					- Ted
>   
I can try to validate that, thanks!

ric



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: suspiciously good fsck times?
  2008-07-10 17:53 ` suspiciously good fsck times? Theodore Tso
  2008-07-10 20:13   ` Ric Wheeler
@ 2008-07-11 15:39   ` Ric Wheeler
  1 sibling, 0 replies; 9+ messages in thread
From: Ric Wheeler @ 2008-07-11 15:39 UTC (permalink / raw)
  To: Theodore Tso; +Cc: linux-ext4

Theodore Tso wrote:
> Based on the graphs which Eric posted, One interesting thing I think
> you'll find if you repeat the ext3 experiment with e2fsck -t -t is
> that pass2 will be about seven times longer than pass1.  (Which is
> backwards from most e2fsck runs, where pass2 is about half pass 1's
> run time --- although obviously that depends on how many directory
> blocks you have.)
>
> Yes, some kind of reservation windows would help on ext3 --- but the
> question is whether such a change would be too-specific for this
> benchmark or not.  Most of the time directories don't grow to such a
> huge size.  So if you use a smallish (around 8 blocks, say) for many
> directories this might lead to more filesystem fragmentation that in
> the long run would cause the filesystem not to age well; it also
> wouldn't help much when you have over 11 million files in the
> directory, and a directory with over 100,000 blocks.
>
> I don't think delayed allocation is what's helping here either,
> because the journal will force the directory blocks to be placed as
> soon as we commit a transaction.  I think what's saving us here is
> that flex_bg and mballoc is separating the directory blocks from the
> data blocks, allowng the directory blocks to be closely packed
> together.
>
> 					- Ted
>   

I made a new ext4 file system without flex_bg or uninit:[

root@localhost Perf]# /sbin/debuge4fs /dev/sdb1
debuge4fs 1.41-WIP (07-Jul-2008)
debuge4fs:  feature
Filesystem features: has_journal ext_attr resize_inode dir_index 
filetype extent sparse_super large_file


The fsck time was a bit slower, but still looks like 8 minutes on ext4 
vs 1 hour on ext3:

[root@localhost Perf]# umount /mnt
[root@localhost Perf]# time /sbin/fsck.ext4 -t -t -f /dev/sdb1
e4fsck 1.41-WIP (07-Jul-2008)
Pass 1: Checking inodes, blocks, and sizes
Pass 1: Memory used: 43944k/69424k (36476k/7469k), time: 352.48/93.27/29.45
Pass 1: I/O read: 14914MB, write: 0MB, rate: 42.31MB/s
Pass 2: Checking directory structure
Pass 2: Memory used: 71396k/61968k (51854k/19543k), time: 73.00/50.46/ 7.65
Pass 2: I/O read: 3023MB, write: 0MB, rate: 41.41MB/s
Pass 3: Checking directory connectivity
Peak memory: Memory used: 71396k/61968k (59307k/12090k), time: 
425.82/143.83/37.10
Pass 3A: Memory used: 71396k/61968k (59307k/12090k), time:  0.00/ 0.00/ 0.00
Pass 3A: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
Pass 3: Memory used: 71396k/61968k (51854k/19543k), time:  0.01/ 0.00/ 0.00
Pass 3: I/O read: 1MB, write: 0MB, rate: 76.91MB/s
Pass 4: Checking reference counts
Pass 4: Memory used: 71396k/44968k (27406k/43991k), time:  2.37/ 2.36/ 0.00
Pass 4: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
Pass 5: Checking group summary information
Pass 5: Memory used: 71396k/240k (64671k/6726k), time: 63.60/ 4.98/ 0.33
Pass 5: I/O read: 37MB, write: 0MB, rate: 0.58MB/s
/dev/sdb1: 45600268/61054976 files (0.0% non-contiguous), 
232657587/244190000 blocks
Memory used: 71396k/240k (64671k/6726k), time: 491.82/151.17/37.43
I/O read: 17974MB, write: 1MB, rate: 36.55MB/s

real    8m12.260s
user    2m31.167s
sys     0m37.766s


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: suspiciously good fsck times?
  2008-07-10 20:13   ` Ric Wheeler
@ 2008-07-14 21:19     ` Andreas Dilger
  2008-07-15  0:47       ` Ric Wheeler
  0 siblings, 1 reply; 9+ messages in thread
From: Andreas Dilger @ 2008-07-14 21:19 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Theodore Tso, linux-ext4

On Jul 10, 2008  16:13 -0400, Ric Wheeler wrote:
> It would be interesting to rerun with the 46 million files in one  
> directory as well (basically, for working sets that have no natural  
> mapping into directories like some object based workloads).

I think you'll hit a limit around 15M files in a single directory.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: suspiciously good fsck times?
  2008-07-14 21:19     ` Andreas Dilger
@ 2008-07-15  0:47       ` Ric Wheeler
  0 siblings, 0 replies; 9+ messages in thread
From: Ric Wheeler @ 2008-07-15  0:47 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Theodore Tso, linux-ext4

Andreas Dilger wrote:
> On Jul 10, 2008  16:13 -0400, Ric Wheeler wrote:
>   
>> It would be interesting to rerun with the 46 million files in one  
>> directory as well (basically, for working sets that have no natural  
>> mapping into directories like some object based workloads).
>>     
>
> I think you'll hit a limit around 15M files in a single directory.
>
> Cheers, Andreas
> --
>   
Probably still worth a quick test, just to see how well it holds up at 
the edge, thanks!

ric


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-07-15  0:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-10 17:28 [ricwheeler@gmail.com: suspiciously good fsck times?] Theodore Tso
2008-07-10 17:53 ` suspiciously good fsck times? Theodore Tso
2008-07-10 20:13   ` Ric Wheeler
2008-07-14 21:19     ` Andreas Dilger
2008-07-15  0:47       ` Ric Wheeler
2008-07-11 15:39   ` Ric Wheeler
  -- strict thread matches above, loose matches on Subject: below --
2008-07-10 17:29 Ric Wheeler
     [not found] <4876025A.80909@gmail.com>
     [not found] ` <20080710151822.GA25939@mit.edu>
     [not found]   ` <48762F9F.5070308@redhat.com>
     [not found]     ` <48763564.2090505@redhat.com>
     [not found]       ` <20080710172117.GE10402@mit.edu>
2008-07-10 17:27         ` Ric Wheeler
2008-07-10 17:30   ` Ric Wheeler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).