[RFC] Tree fragmentation and prefetching

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Tree fragmentation and prefetching
@ 2011-03-23 13:06 Arne Jansen
  2011-03-23 15:01 ` Arne Jansen
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Arne Jansen @ 2011-03-23 13:06 UTC (permalink / raw)
  To: Linux Btrfs; +Cc: Chris Mason

While looking into the performance of scrub I noticed that a significant
amount of time is being used for loading the extent tree and the csum
tree. While this is no surprise I did some prototyping on how to improve
on it.
The main idea is to load the tree (or parts of it) top-down, order the
needed blocks and distribute it over all disks.
To keep you interested, some results first.

a) by tree enumeration with reada=2
   reading extent tree: 242s
   reading csum tree: 140s
   reading both trees: 324s

b) prefetch prototype
   reading extent tree: 23.5s
   reading csum tree: 20.4s
   reading both trees: 25.7s

The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled
28%. It is created with the current git tree + the round robin patch and
filled with

fs_mark -D 512 -t 16 -n 4096 -F -S0

The 'normal' read is done by enumerating the leaves by btrfs_next_leaf()
with path->reada=2. Both trees are being enumerated one after the other.
The prototype currently just uses raw bios, does not make use of the
page cache and does not enter the read pages into the cache. This will
probably add some overhead. It also does not check the crcs.

While it is very promising to implement it for scrub, I think a more
general interface which can be used for every enumeration would be
beneficial. Use cases that come to mind are rebalance, reflink, deletion
of large files, listing of large directories etc..

I'd imagine an interface along the lines of

int btrfs_readahead_init(struct btrfs_reada_ctx *reada);
int btrfs_readahead_add(struct btrfs_root *root,
                        struct btrfs_key *start,
                        struct btrfs_key *end,
                        struct btrfs_reada_ctx *reada);
void btrfs_readahead_wait(struct btrfs_reada_ctx *reada);

to trigger the readahead of parts of a tree. Multiple readahead
requests can be given before waiting. This would enable the very
beneficial folding seen above for 'reading both trees'.

Also it would be possible to add a cascading readahead, where the
content of leaves would trigger readaheads in other trees, maybe by
giving a callback for the decisions what to read instead of the fixed
start/end range.

For the implementation I'd need an interface which I haven't been able
to find yet. Currently I can trigger the read of several pages / tree
blocks and wait for the completion of each of them. What I'd need would
be an interface that gives me a callback on each completion or a waiting
function that wakes up on each completion with the information which
pages just completed.
One way to achieve this would be to add a hook, but I gladly take any
implementation hints.

--
Arne

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] Tree fragmentation and prefetching
  2011-03-23 13:06 [RFC] Tree fragmentation and prefetching Arne Jansen
@ 2011-03-23 15:01 ` Arne Jansen
  2011-03-23 19:26 ` Andrey Kuzmin
  2011-03-23 19:32 ` Chris Mason
  2 siblings, 0 replies; 13+ messages in thread
From: Arne Jansen @ 2011-03-23 15:01 UTC (permalink / raw)
  To: Linux Btrfs; +Cc: Chris Mason

On 23.03.2011 14:06, Arne Jansen wrote:
> While looking into the performance of scrub I noticed that a significant
> amount of time is being used for loading the extent tree and the csum
> tree. While this is no surprise I did some prototyping on how to improve
> on it.
> The main idea is to load the tree (or parts of it) top-down, order the
> needed blocks and distribute it over all disks.
> To keep you interested, some results first.
> 
> a) by tree enumeration with reada=2
>    reading extent tree: 242s
>    reading csum tree: 140s


>    reading both trees: 324s
sorry, the number is wrong, it should be 384s (just the sum of both
./. rounding errors).


> 
> b) prefetch prototype
>    reading extent tree: 23.5s
>    reading csum tree: 20.4s
>    reading both trees: 25.7s
> 
> The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled
> 28%. It is created with the current git tree + the round robin patch and
> filled with
> 
> fs_mark -D 512 -t 16 -n 4096 -F -S0
> 
> The 'normal' read is done by enumerating the leaves by btrfs_next_leaf()
> with path->reada=2. Both trees are being enumerated one after the other.
> The prototype currently just uses raw bios, does not make use of the
> page cache and does not enter the read pages into the cache. This will
> probably add some overhead. It also does not check the crcs.
> 
> While it is very promising to implement it for scrub, I think a more
> general interface which can be used for every enumeration would be
> beneficial. Use cases that come to mind are rebalance, reflink, deletion
> of large files, listing of large directories etc..
> 
> I'd imagine an interface along the lines of
> 
> int btrfs_readahead_init(struct btrfs_reada_ctx *reada);
> int btrfs_readahead_add(struct btrfs_root *root,
>                         struct btrfs_key *start,
>                         struct btrfs_key *end,
>                         struct btrfs_reada_ctx *reada);
> void btrfs_readahead_wait(struct btrfs_reada_ctx *reada);
> 
> to trigger the readahead of parts of a tree. Multiple readahead
> requests can be given before waiting. This would enable the very
> beneficial folding seen above for 'reading both trees'.
> 
> Also it would be possible to add a cascading readahead, where the
> content of leaves would trigger readaheads in other trees, maybe by
> giving a callback for the decisions what to read instead of the fixed
> start/end range.
> 
> For the implementation I'd need an interface which I haven't been able
> to find yet. Currently I can trigger the read of several pages / tree
> blocks and wait for the completion of each of them. What I'd need would
> be an interface that gives me a callback on each completion or a waiting
> function that wakes up on each completion with the information which
> pages just completed.
> One way to achieve this would be to add a hook, but I gladly take any
> implementation hints.
> 
> --
> Arne
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] Tree fragmentation and prefetching
  2011-03-23 13:06 [RFC] Tree fragmentation and prefetching Arne Jansen
  2011-03-23 15:01 ` Arne Jansen
@ 2011-03-23 19:26 ` Andrey Kuzmin
  2011-03-23 20:28   ` Arne Jansen
  2011-03-23 19:32 ` Chris Mason
  2 siblings, 1 reply; 13+ messages in thread
From: Andrey Kuzmin @ 2011-03-23 19:26 UTC (permalink / raw)
  To: Arne Jansen; +Cc: Linux Btrfs, Chris Mason

On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansen <sensille@gmx.net> wrote:
> While looking into the performance of scrub I noticed that a signific=
ant
> amount of time is being used for loading the extent tree and the csum
> tree. While this is no surprise I did some prototyping on how to impr=
ove
> on it.
> The main idea is to load the tree (or parts of it) top-down, order th=
e
> needed blocks and distribute it over all disks.
> To keep you interested, some results first.
>
> a) by tree enumeration with reada=3D2
> =C2=A0 reading extent tree: 242s
> =C2=A0 reading csum tree: 140s
> =C2=A0 reading both trees: 324s
>
> b) prefetch prototype
> =C2=A0 reading extent tree: 23.5s
> =C2=A0 reading csum tree: 20.4s
> =C2=A0 reading both trees: 25.7s

10x speed-up looks indeed impressive. Just for me to be sure, did I
get you right in that you attribute this effect specifically to
enumerating tree leaves in key address vs. disk addresses when these
two are not aligned?

Regards,
Andrey

>
> The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, fil=
led
> 28%. It is created with the current git tree + the round robin patch =
and
> filled with
>
> fs_mark -D 512 -t 16 -n 4096 -F -S0
>
> The 'normal' read is done by enumerating the leaves by btrfs_next_lea=
f()
> with path->reada=3D2. Both trees are being enumerated one after the o=
ther.
> The prototype currently just uses raw bios, does not make use of the
> page cache and does not enter the read pages into the cache. This wil=
l
> probably add some overhead. It also does not check the crcs.
>
> While it is very promising to implement it for scrub, I think a more
> general interface which can be used for every enumeration would be
> beneficial. Use cases that come to mind are rebalance, reflink, delet=
ion
> of large files, listing of large directories etc..
>
> I'd imagine an interface along the lines of
>
> int btrfs_readahead_init(struct btrfs_reada_ctx *reada);
> int btrfs_readahead_add(struct btrfs_root *root,
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0struct btrfs_key *start,
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0struct btrfs_key *end,
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0struct btrfs_reada_ctx *reada);
> void btrfs_readahead_wait(struct btrfs_reada_ctx *reada);
>
> to trigger the readahead of parts of a tree. Multiple readahead
> requests can be given before waiting. This would enable the very
> beneficial folding seen above for 'reading both trees'.
>
> Also it would be possible to add a cascading readahead, where the
> content of leaves would trigger readaheads in other trees, maybe by
> giving a callback for the decisions what to read instead of the fixed
> start/end range.
>
> For the implementation I'd need an interface which I haven't been abl=
e
> to find yet. Currently I can trigger the read of several pages / tree
> blocks and wait for the completion of each of them. What I'd need wou=
ld
> be an interface that gives me a callback on each completion or a wait=
ing
> function that wakes up on each completion with the information which
> pages just completed.
> One way to achieve this would be to add a hook, but I gladly take any
> implementation hints.
>
> --
> Arne
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs=
" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =C2=A0http://vger.kernel.org/majordomo-info.ht=
ml
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] Tree fragmentation and prefetching
  2011-03-23 13:06 [RFC] Tree fragmentation and prefetching Arne Jansen
  2011-03-23 15:01 ` Arne Jansen
  2011-03-23 19:26 ` Andrey Kuzmin
@ 2011-03-23 19:32 ` Chris Mason
  2011-03-25 20:14   ` Arne Jansen
  2 siblings, 1 reply; 13+ messages in thread
From: Chris Mason @ 2011-03-23 19:32 UTC (permalink / raw)
  To: Arne Jansen; +Cc: Linux Btrfs

Excerpts from Arne Jansen's message of 2011-03-23 09:06:02 -0400:
> While looking into the performance of scrub I noticed that a significant
> amount of time is being used for loading the extent tree and the csum
> tree. While this is no surprise I did some prototyping on how to improve
> on it.
> The main idea is to load the tree (or parts of it) top-down, order the
> needed blocks and distribute it over all disks.
> To keep you interested, some results first.
> 
> a) by tree enumeration with reada=2
>    reading extent tree: 242s
>    reading csum tree: 140s
>    reading both trees: 324s
> 
> b) prefetch prototype
>    reading extent tree: 23.5s
>    reading csum tree: 20.4s
>    reading both trees: 25.7s

Very nice, btrfsck does something similar.

> 
> The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled
> 28%. It is created with the current git tree + the round robin patch and
> filled with
> 
> fs_mark -D 512 -t 16 -n 4096 -F -S0
> 
> The 'normal' read is done by enumerating the leaves by btrfs_next_leaf()
> with path->reada=2. Both trees are being enumerated one after the other.
> The prototype currently just uses raw bios, does not make use of the
> page cache and does not enter the read pages into the cache. This will
> probably add some overhead. It also does not check the crcs.
> 
> While it is very promising to implement it for scrub, I think a more
> general interface which can be used for every enumeration would be
> beneficial. Use cases that come to mind are rebalance, reflink, deletion
> of large files, listing of large directories etc..
> 
> I'd imagine an interface along the lines of
> 
> int btrfs_readahead_init(struct btrfs_reada_ctx *reada);
> int btrfs_readahead_add(struct btrfs_root *root,
>                         struct btrfs_key *start,
>                         struct btrfs_key *end,
>                         struct btrfs_reada_ctx *reada);
> void btrfs_readahead_wait(struct btrfs_reada_ctx *reada);
> 
> to trigger the readahead of parts of a tree. Multiple readahead
> requests can be given before waiting. This would enable the very
> beneficial folding seen above for 'reading both trees'.
> 
> Also it would be possible to add a cascading readahead, where the
> content of leaves would trigger readaheads in other trees, maybe by
> giving a callback for the decisions what to read instead of the fixed
> start/end range.
> 
> For the implementation I'd need an interface which I haven't been able
> to find yet. Currently I can trigger the read of several pages / tree
> blocks and wait for the completion of each of them. What I'd need would
> be an interface that gives me a callback on each completion or a waiting
> function that wakes up on each completion with the information which
> pages just completed.
> One way to achieve this would be to add a hook, but I gladly take any
> implementation hints.

We have the bio endio call backs for this, I think that's the only thing
you can use.

-chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] Tree fragmentation and prefetching
  2011-03-23 19:26 ` Andrey Kuzmin
@ 2011-03-23 20:28   ` Arne Jansen
  2011-03-23 21:47     ` Andrey Kuzmin
  2011-03-24  1:38     ` Miao Xie
  0 siblings, 2 replies; 13+ messages in thread
From: Arne Jansen @ 2011-03-23 20:28 UTC (permalink / raw)
  To: Andrey Kuzmin; +Cc: Linux Btrfs, Chris Mason

On 23.03.2011 20:26, Andrey Kuzmin wrote:
> On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansen<sensille@gmx.net>  wrote:
>> While looking into the performance of scrub I noticed that a significant
>> amount of time is being used for loading the extent tree and the csum
>> tree. While this is no surprise I did some prototyping on how to improve
>> on it.
>> The main idea is to load the tree (or parts of it) top-down, order the
>> needed blocks and distribute it over all disks.
>> To keep you interested, some results first.
>>
>> a) by tree enumeration with reada=2
>>    reading extent tree: 242s
>>    reading csum tree: 140s
>>    reading both trees: 324s
>>
>> b) prefetch prototype
>>    reading extent tree: 23.5s
>>    reading csum tree: 20.4s
>>    reading both trees: 25.7s
>
> 10x speed-up looks indeed impressive. Just for me to be sure, did I
> get you right in that you attribute this effect specifically to
> enumerating tree leaves in key address vs. disk addresses when these
> two are not aligned?

Yes. Leaves and the intermediate nodes tend to be quite scattered
around the disk with respect to their logical order.
Reading them in logical (ascending/descending) order require lots
of seeks.

>
> Regards,
> Andrey
>
>>
>> The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled
>> 28%. It is created with the current git tree + the round robin patch and
>> filled with
>>
>> fs_mark -D 512 -t 16 -n 4096 -F -S0
>>
>> The 'normal' read is done by enumerating the leaves by btrfs_next_leaf()
>> with path->reada=2. Both trees are being enumerated one after the other.
>> The prototype currently just uses raw bios, does not make use of the
>> page cache and does not enter the read pages into the cache. This will
>> probably add some overhead. It also does not check the crcs.
>>
>> While it is very promising to implement it for scrub, I think a more
>> general interface which can be used for every enumeration would be
>> beneficial. Use cases that come to mind are rebalance, reflink, deletion
>> of large files, listing of large directories etc..
>>
>> I'd imagine an interface along the lines of
>>
>> int btrfs_readahead_init(struct btrfs_reada_ctx *reada);
>> int btrfs_readahead_add(struct btrfs_root *root,
>>                         struct btrfs_key *start,
>>                         struct btrfs_key *end,
>>                         struct btrfs_reada_ctx *reada);
>> void btrfs_readahead_wait(struct btrfs_reada_ctx *reada);
>>
>> to trigger the readahead of parts of a tree. Multiple readahead
>> requests can be given before waiting. This would enable the very
>> beneficial folding seen above for 'reading both trees'.
>>
>> Also it would be possible to add a cascading readahead, where the
>> content of leaves would trigger readaheads in other trees, maybe by
>> giving a callback for the decisions what to read instead of the fixed
>> start/end range.
>>
>> For the implementation I'd need an interface which I haven't been able
>> to find yet. Currently I can trigger the read of several pages / tree
>> blocks and wait for the completion of each of them. What I'd need would
>> be an interface that gives me a callback on each completion or a waiting
>> function that wakes up on each completion with the information which
>> pages just completed.
>> One way to achieve this would be to add a hook, but I gladly take any
>> implementation hints.
>>
>> --
>> Arne
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] Tree fragmentation and prefetching
  2011-03-23 20:28   ` Arne Jansen
@ 2011-03-23 21:47     ` Andrey Kuzmin
  2011-03-24  1:38     ` Miao Xie
  1 sibling, 0 replies; 13+ messages in thread
From: Andrey Kuzmin @ 2011-03-23 21:47 UTC (permalink / raw)
  To: Arne Jansen; +Cc: Linux Btrfs, Chris Mason

On Wed, Mar 23, 2011 at 11:28 PM, Arne Jansen <sensille@gmx.net> wrote:
> On 23.03.2011 20:26, Andrey Kuzmin wrote:
>>
>> On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansen<sensille@gmx.net> =C2=A0=
wrote:
>>>
>>> While looking into the performance of scrub I noticed that a signif=
icant
>>> amount of time is being used for loading the extent tree and the cs=
um
>>> tree. While this is no surprise I did some prototyping on how to im=
prove
>>> on it.
>>> The main idea is to load the tree (or parts of it) top-down, order =
the
>>> needed blocks and distribute it over all disks.
>>> To keep you interested, some results first.
>>>
>>> a) by tree enumeration with reada=3D2
>>> =C2=A0 reading extent tree: 242s
>>> =C2=A0 reading csum tree: 140s
>>> =C2=A0 reading both trees: 324s
>>>
>>> b) prefetch prototype
>>> =C2=A0 reading extent tree: 23.5s
>>> =C2=A0 reading csum tree: 20.4s
>>> =C2=A0 reading both trees: 25.7s
>>
>> 10x speed-up looks indeed impressive. Just for me to be sure, did I
>> get you right in that you attribute this effect specifically to
>> enumerating tree leaves in key address vs. disk addresses when these
>> two are not aligned?
>
> Yes. Leaves and the intermediate nodes tend to be quite scattered
> around the disk with respect to their logical order.
> Reading them in logical (ascending/descending) order require lots
> of seeks.

And the patch actually does on-the-fly defragmentation, right? Why
loose it then :)?

Regards,
Andrey

>
>>
>> Regards,
>> Andrey
>>
>>>
>>> The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, f=
illed
>>> 28%. It is created with the current git tree + the round robin patc=
h and
>>> filled with
>>>
>>> fs_mark -D 512 -t 16 -n 4096 -F -S0
>>>
>>> The 'normal' read is done by enumerating the leaves by btrfs_next_l=
eaf()
>>> with path->reada=3D2. Both trees are being enumerated one after the=
 other.
>>> The prototype currently just uses raw bios, does not make use of th=
e
>>> page cache and does not enter the read pages into the cache. This w=
ill
>>> probably add some overhead. It also does not check the crcs.
>>>
>>> While it is very promising to implement it for scrub, I think a mor=
e
>>> general interface which can be used for every enumeration would be
>>> beneficial. Use cases that come to mind are rebalance, reflink, del=
etion
>>> of large files, listing of large directories etc..
>>>
>>> I'd imagine an interface along the lines of
>>>
>>> int btrfs_readahead_init(struct btrfs_reada_ctx *reada);
>>> int btrfs_readahead_add(struct btrfs_root *root,
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0struct btrfs_key *start,
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0struct btrfs_key *end,
>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0struct btrfs_reada_ctx *reada);
>>> void btrfs_readahead_wait(struct btrfs_reada_ctx *reada);
>>>
>>> to trigger the readahead of parts of a tree. Multiple readahead
>>> requests can be given before waiting. This would enable the very
>>> beneficial folding seen above for 'reading both trees'.
>>>
>>> Also it would be possible to add a cascading readahead, where the
>>> content of leaves would trigger readaheads in other trees, maybe by
>>> giving a callback for the decisions what to read instead of the fix=
ed
>>> start/end range.
>>>
>>> For the implementation I'd need an interface which I haven't been a=
ble
>>> to find yet. Currently I can trigger the read of several pages / tr=
ee
>>> blocks and wait for the completion of each of them. What I'd need w=
ould
>>> be an interface that gives me a callback on each completion or a wa=
iting
>>> function that wakes up on each completion with the information whic=
h
>>> pages just completed.
>>> One way to achieve this would be to add a hook, but I gladly take a=
ny
>>> implementation hints.
>>>
>>> --
>>> Arne
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btr=
fs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at =C2=A0http://vger.kernel.org/majordomo-info.=
html
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrf=
s" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at =C2=A0http://vger.kernel.org/majordomo-info.h=
tml
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] Tree fragmentation and prefetching
  2011-03-23 20:28   ` Arne Jansen
  2011-03-23 21:47     ` Andrey Kuzmin
@ 2011-03-24  1:38     ` Miao Xie
  2011-03-24  7:29       ` Arne Jansen
  1 sibling, 1 reply; 13+ messages in thread
From: Miao Xie @ 2011-03-24  1:38 UTC (permalink / raw)
  To: Arne Jansen; +Cc: Andrey Kuzmin, Linux Btrfs, Chris Mason

On wed, 23 Mar 2011 21:28:25 +0100, Arne Jansen wrote:
> On 23.03.2011 20:26, Andrey Kuzmin wrote:
>> On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansen<sensille@gmx.net>  wrote:
>>> While looking into the performance of scrub I noticed that a significant
>>> amount of time is being used for loading the extent tree and the csum
>>> tree. While this is no surprise I did some prototyping on how to improve
>>> on it.
>>> The main idea is to load the tree (or parts of it) top-down, order the
>>> needed blocks and distribute it over all disks.
>>> To keep you interested, some results first.
>>>
>>> a) by tree enumeration with reada=2
>>>    reading extent tree: 242s
>>>    reading csum tree: 140s
>>>    reading both trees: 324s
>>>
>>> b) prefetch prototype
>>>    reading extent tree: 23.5s
>>>    reading csum tree: 20.4s
>>>    reading both trees: 25.7s
>>
>> 10x speed-up looks indeed impressive. Just for me to be sure, did I
>> get you right in that you attribute this effect specifically to
>> enumerating tree leaves in key address vs. disk addresses when these
>> two are not aligned?
> 
> Yes. Leaves and the intermediate nodes tend to be quite scattered
> around the disk with respect to their logical order.
> Reading them in logical (ascending/descending) order require lots
> of seeks.

I'm also dealing with tree fragmentation problem, I try to store the leaves
which have the same parent closely.

Regards
Miao

> 
>>
>> Regards,
>> Andrey
>>
>>>
>>> The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled
>>> 28%. It is created with the current git tree + the round robin patch and
>>> filled with
>>>
>>> fs_mark -D 512 -t 16 -n 4096 -F -S0
>>>
>>> The 'normal' read is done by enumerating the leaves by btrfs_next_leaf()
>>> with path->reada=2. Both trees are being enumerated one after the other.
>>> The prototype currently just uses raw bios, does not make use of the
>>> page cache and does not enter the read pages into the cache. This will
>>> probably add some overhead. It also does not check the crcs.
>>>
>>> While it is very promising to implement it for scrub, I think a more
>>> general interface which can be used for every enumeration would be
>>> beneficial. Use cases that come to mind are rebalance, reflink, deletion
>>> of large files, listing of large directories etc..
>>>
>>> I'd imagine an interface along the lines of
>>>
>>> int btrfs_readahead_init(struct btrfs_reada_ctx *reada);
>>> int btrfs_readahead_add(struct btrfs_root *root,
>>>                         struct btrfs_key *start,
>>>                         struct btrfs_key *end,
>>>                         struct btrfs_reada_ctx *reada);
>>> void btrfs_readahead_wait(struct btrfs_reada_ctx *reada);
>>>
>>> to trigger the readahead of parts of a tree. Multiple readahead
>>> requests can be given before waiting. This would enable the very
>>> beneficial folding seen above for 'reading both trees'.
>>>
>>> Also it would be possible to add a cascading readahead, where the
>>> content of leaves would trigger readaheads in other trees, maybe by
>>> giving a callback for the decisions what to read instead of the fixed
>>> start/end range.
>>>
>>> For the implementation I'd need an interface which I haven't been able
>>> to find yet. Currently I can trigger the read of several pages / tree
>>> blocks and wait for the completion of each of them. What I'd need would
>>> be an interface that gives me a callback on each completion or a waiting
>>> function that wakes up on each completion with the information which
>>> pages just completed.
>>> One way to achieve this would be to add a hook, but I gladly take any
>>> implementation hints.
>>>
>>> -- 
>>> Arne
>>>
>>>
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] Tree fragmentation and prefetching
  2011-03-24  1:38     ` Miao Xie
@ 2011-03-24  7:29       ` Arne Jansen
  2011-03-24 12:45         ` Miao Xie
  0 siblings, 1 reply; 13+ messages in thread
From: Arne Jansen @ 2011-03-24  7:29 UTC (permalink / raw)
  To: miaox; +Cc: Andrey Kuzmin, Linux Btrfs, Chris Mason

On 24.03.2011 02:38, Miao Xie wrote:
> On wed, 23 Mar 2011 21:28:25 +0100, Arne Jansen wrote:
>> On 23.03.2011 20:26, Andrey Kuzmin wrote:
>>> On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansen<sensille@gmx.net>  wrote:
>>>> The main idea is to load the tree (or parts of it) top-down, order the
>>>> needed blocks and distribute it over all disks.
>>>> To keep you interested, some results first.
>>>>
>>>> a) by tree enumeration with reada=2
>>>>    reading extent tree: 242s
>>>>    reading csum tree: 140s
>>>>    reading both trees: 324s
>>>>
>>>> b) prefetch prototype
>>>>    reading extent tree: 23.5s
>>>>    reading csum tree: 20.4s
>>>>    reading both trees: 25.7s
>>>
>>> 10x speed-up looks indeed impressive. Just for me to be sure, did I
>>> get you right in that you attribute this effect specifically to
>>> enumerating tree leaves in key address vs. disk addresses when these
>>> two are not aligned?
>>
>> Yes. Leaves and the intermediate nodes tend to be quite scattered
>> around the disk with respect to their logical order.
>> Reading them in logical (ascending/descending) order require lots
>> of seeks.
> 
> I'm also dealing with tree fragmentation problem, I try to store the leaves
> which have the same parent closely.

That's good to hear. Do you have already anything I can repeat the test
with?

-Arne

> 
> Regards
> Miao
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] Tree fragmentation and prefetching
  2011-03-24  7:29       ` Arne Jansen
@ 2011-03-24 12:45         ` Miao Xie
  0 siblings, 0 replies; 13+ messages in thread
From: Miao Xie @ 2011-03-24 12:45 UTC (permalink / raw)
  To: Arne Jansen; +Cc: Andrey Kuzmin, Linux Btrfs, Chris Mason

On 	thu, 24 Mar 2011 08:29:57 +0100, Arne Jansen wrote:
> On 24.03.2011 02:38, Miao Xie wrote:
>> On wed, 23 Mar 2011 21:28:25 +0100, Arne Jansen wrote:
>>> On 23.03.2011 20:26, Andrey Kuzmin wrote:
>>>> On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansen<sensille@gmx.net>  wrote:
>>>>> The main idea is to load the tree (or parts of it) top-down, order the
>>>>> needed blocks and distribute it over all disks.
>>>>> To keep you interested, some results first.
>>>>>
>>>>> a) by tree enumeration with reada=2
>>>>>    reading extent tree: 242s
>>>>>    reading csum tree: 140s
>>>>>    reading both trees: 324s
>>>>>
>>>>> b) prefetch prototype
>>>>>    reading extent tree: 23.5s
>>>>>    reading csum tree: 20.4s
>>>>>    reading both trees: 25.7s
>>>>
>>>> 10x speed-up looks indeed impressive. Just for me to be sure, did I
>>>> get you right in that you attribute this effect specifically to
>>>> enumerating tree leaves in key address vs. disk addresses when these
>>>> two are not aligned?
>>>
>>> Yes. Leaves and the intermediate nodes tend to be quite scattered
>>> around the disk with respect to their logical order.
>>> Reading them in logical (ascending/descending) order require lots
>>> of seeks.
>>
>> I'm also dealing with tree fragmentation problem, I try to store the leaves
>> which have the same parent closely.
> 
> That's good to hear. Do you have already anything I can repeat the test
> with?

It is still under developing.;)

Thanks
Miao

> -Arne
> 
>>
>> Regards
>> Miao
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] Tree fragmentation and prefetching
  2011-03-23 19:32 ` Chris Mason
@ 2011-03-25 20:14   ` Arne Jansen
  2011-03-25 20:15     ` Chris Mason
  0 siblings, 1 reply; 13+ messages in thread
From: Arne Jansen @ 2011-03-25 20:14 UTC (permalink / raw)
  To: Chris Mason; +Cc: Linux Btrfs

On 23.03.2011 20:32, Chris Mason wrote:
> Excerpts from Arne Jansen's message of 2011-03-23 09:06:02 -0400:
>>
>> For the implementation I'd need an interface which I haven't been able
>> to find yet. Currently I can trigger the read of several pages / tree
>> blocks and wait for the completion of each of them. What I'd need would
>> be an interface that gives me a callback on each completion or a waiting
>> function that wakes up on each completion with the information which
>> pages just completed.
>> One way to achieve this would be to add a hook, but I gladly take any
>> implementation hints.
>
> We have the bio endio call backs for this, I think that's the only thing
> you can use.
>

ok, so I'll add a new extent state bit EXTENT_READAHEAD and test for it
in btree_readpage_end_io_hook.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] Tree fragmentation and prefetching
  2011-03-25 20:14   ` Arne Jansen
@ 2011-03-25 20:15     ` Chris Mason
  2011-03-25 20:53       ` Arne Jansen
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Mason @ 2011-03-25 20:15 UTC (permalink / raw)
  To: Arne Jansen; +Cc: Linux Btrfs

Excerpts from Arne Jansen's message of 2011-03-25 16:14:35 -0400:
> On 23.03.2011 20:32, Chris Mason wrote:
> > Excerpts from Arne Jansen's message of 2011-03-23 09:06:02 -0400:
> >>
> >> For the implementation I'd need an interface which I haven't been able
> >> to find yet. Currently I can trigger the read of several pages / tree
> >> blocks and wait for the completion of each of them. What I'd need would
> >> be an interface that gives me a callback on each completion or a waiting
> >> function that wakes up on each completion with the information which
> >> pages just completed.
> >> One way to achieve this would be to add a hook, but I gladly take any
> >> implementation hints.
> >
> > We have the bio endio call backs for this, I think that's the only thing
> > you can use.
> >
> 
> ok, so I'll add a new extent state bit EXTENT_READAHEAD and test for it
> in btree_readpage_end_io_hook.

It's also common to use a chain of endio handlers.  If you're allocating
any state for the RA, you just save the original endio handler in your
new struct, and then use your own endio handler that does the readahead
smarts.

-chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] Tree fragmentation and prefetching
  2011-03-25 20:15     ` Chris Mason
@ 2011-03-25 20:53       ` Arne Jansen
  2011-03-25 21:01         ` Chris Mason
  0 siblings, 1 reply; 13+ messages in thread
From: Arne Jansen @ 2011-03-25 20:53 UTC (permalink / raw)
  To: Chris Mason; +Cc: Linux Btrfs

On 25.03.2011 21:15, Chris Mason wrote:
> Excerpts from Arne Jansen's message of 2011-03-25 16:14:35 -0400:
>> On 23.03.2011 20:32, Chris Mason wrote:
>>> Excerpts from Arne Jansen's message of 2011-03-23 09:06:02 -0400:
>>>>
>>>> For the implementation I'd need an interface which I haven't been able
>>>> to find yet. Currently I can trigger the read of several pages / tree
>>>> blocks and wait for the completion of each of them. What I'd need would
>>>> be an interface that gives me a callback on each completion or a waiting
>>>> function that wakes up on each completion with the information which
>>>> pages just completed.
>>>> One way to achieve this would be to add a hook, but I gladly take any
>>>> implementation hints.
>>>
>>> We have the bio endio call backs for this, I think that's the only thing
>>> you can use.
>>>
>>
>> ok, so I'll add a new extent state bit EXTENT_READAHEAD and test for it
>> in btree_readpage_end_io_hook.
>
> It's also common to use a chain of endio handlers.  If you're allocating
> any state for the RA, you just save the original endio handler in your
> new struct, and then use your own endio handler that does the readahead
> smarts.
>

Do you mean replacing the bio end_io handler or the
readpage_end_io_hook? As I want the pages to end up in the page cache,
I'd like to use as much of the existing infrastructure as possible.
To intercept the bio deep down in the chain would mean to duplicate
some code on the way down and on the way up again.

-Arne

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] Tree fragmentation and prefetching
  2011-03-25 20:53       ` Arne Jansen
@ 2011-03-25 21:01         ` Chris Mason
  0 siblings, 0 replies; 13+ messages in thread
From: Chris Mason @ 2011-03-25 21:01 UTC (permalink / raw)
  To: Arne Jansen; +Cc: Linux Btrfs

Excerpts from Arne Jansen's message of 2011-03-25 16:53:24 -0400:
> On 25.03.2011 21:15, Chris Mason wrote:
> > Excerpts from Arne Jansen's message of 2011-03-25 16:14:35 -0400:
> >> On 23.03.2011 20:32, Chris Mason wrote:
> >>> Excerpts from Arne Jansen's message of 2011-03-23 09:06:02 -0400:
> >>>>
> >>>> For the implementation I'd need an interface which I haven't been able
> >>>> to find yet. Currently I can trigger the read of several pages / tree
> >>>> blocks and wait for the completion of each of them. What I'd need would
> >>>> be an interface that gives me a callback on each completion or a waiting
> >>>> function that wakes up on each completion with the information which
> >>>> pages just completed.
> >>>> One way to achieve this would be to add a hook, but I gladly take any
> >>>> implementation hints.
> >>>
> >>> We have the bio endio call backs for this, I think that's the only thing
> >>> you can use.
> >>>
> >>
> >> ok, so I'll add a new extent state bit EXTENT_READAHEAD and test for it
> >> in btree_readpage_end_io_hook.
> >
> > It's also common to use a chain of endio handlers.  If you're allocating
> > any state for the RA, you just save the original endio handler in your
> > new struct, and then use your own endio handler that does the readahead
> > smarts.
> >
> 
> Do you mean replacing the bio end_io handler or the
> readpage_end_io_hook? As I want the pages to end up in the page cache,
> I'd like to use as much of the existing infrastructure as possible.
> To intercept the bio deep down in the chain would mean to duplicate
> some code on the way down and on the way up again.

The end_io handler, see how we chain them around for the crc helper
threads.

-chris


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-03-25 21:01 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-23 13:06 [RFC] Tree fragmentation and prefetching Arne Jansen
2011-03-23 15:01 ` Arne Jansen
2011-03-23 19:26 ` Andrey Kuzmin
2011-03-23 20:28   ` Arne Jansen
2011-03-23 21:47     ` Andrey Kuzmin
2011-03-24  1:38     ` Miao Xie
2011-03-24  7:29       ` Arne Jansen
2011-03-24 12:45         ` Miao Xie
2011-03-23 19:32 ` Chris Mason
2011-03-25 20:14   ` Arne Jansen
2011-03-25 20:15     ` Chris Mason
2011-03-25 20:53       ` Arne Jansen
2011-03-25 21:01         ` Chris Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).