* reiser4: discard support
@ 2014-04-30 7:43 Ivan Shapovalov
2014-04-30 14:11 ` Edward Shishkin
[not found] ` <1B07908E-B768-407D-ADFA-B3C539FABB49@gmail.com>
0 siblings, 2 replies; 19+ messages in thread
From: Ivan Shapovalov @ 2014-04-30 7:43 UTC (permalink / raw)
To: Edward Shishkin; +Cc: reiserfs-devel
[-- Attachment #1: Type: text/plain, Size: 1925 bytes --]
Hi Edward,
So I bit the bullet, ditched all tasks for the upcoming holidays and gave
another try to the discard support in reiser4. Since my last status update,
I've lost the local kernel tree due to bad scripting (decided to TRIM free
space between partitions... and TRIM'ed everything that is NOT free space), so
I had to start from scratch.
I decided to go with following algorithm:
During the transaction deallocated ranges are logged as-is to a list or array
(let's call it "the delete log"). On atom commit we will generate a minimal
superset of the delete log, comprised only of whole erase units ("the discard
set").
So, at commit time the following actions take place:
- elements of the delete log are sorted;
- the delete log is iterated, merging any adjacent ranges;
- for each resulting range in the delete log its start is rounded down to the
closest erase unit boundary, and, starting from this block, extents of erase
unit length are created until whole range is covered;
- for each such new extent it is checked whether does it contain allocated
blocks. If no (i. e. the range is completely deallocated), the range is
added to the discard set.
Complexity of logging is thus O(n), added complexity of atom joining is O(1)
and added complexity of atom commit is O(n*log(n) + n).
The first question is how to store the delete log.
- blocknr_set isn't ordered per se, so adjacent entries can't be easily merged
- simple linked list (extent per node) seems like too big overhead
- kmalloc()'ed arrays can't be joined in O(1) and they are static (imposes a
hard limit on log size)
- kmem_cache... feels like using a sledge-hammer.
And oh, where in commit_current_atom() should these things be done? I guess
right after the reiser4_write_logs(), under the tmgr mutex, but I can't really
explain why I think so.
Thanks,
--
Ivan Shapovalov / intelfx /
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 230 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: reiser4: discard support
2014-04-30 7:43 Ivan Shapovalov
@ 2014-04-30 14:11 ` Edward Shishkin
[not found] ` <1B07908E-B768-407D-ADFA-B3C539FABB49@gmail.com>
1 sibling, 0 replies; 19+ messages in thread
From: Edward Shishkin @ 2014-04-30 14:11 UTC (permalink / raw)
To: Ivan Shapovalov; +Cc: reiserfs-devel
On 04/30/2014 09:43 AM, Ivan Shapovalov wrote:
> Hi Edward,
Hello Ivan.
>
> So I bit the bullet, ditched all tasks for the upcoming holidays and gave
> another try to the discard support in reiser4. Since my last status update,
> I've lost the local kernel tree due to bad scripting (decided to TRIM free
> space between partitions... and TRIM'ed everything that is NOT free space), so
> I had to start from scratch.
Once in a while I send key patches to my gmail mailbox configured with
IMAP..
>
> I decided to go with following algorithm:
>
> During the transaction deallocated ranges are logged as-is to a list or array
> (let's call it "the delete log").
"Log" is a busy term. It is already used for journal blocks.
Let's use "discard set" instead?
> On atom commit we will generate a minimal
> superset of the delete log, comprised only of whole erase units ("the discard
> set").
> So, at commit time the following actions take place:
> - elements of the delete log are sorted;
> - the delete log is iterated, merging any adjacent ranges;
> - for each resulting range in the delete log its start is rounded down to the
> closest erase unit boundary, and, starting from this block, extents of erase
> unit length are created until whole range is covered;
> - for each such new extent it is checked whether does it contain allocated
> blocks. If no (i. e. the range is completely deallocated), the range is
> added to the discard set.
>
> Complexity of logging is thus O(n), added complexity of atom joining is O(1)
> and added complexity of atom commit is O(n*log(n) + n).
>
> The first question is how to store the delete log.
> - blocknr_set isn't ordered per se, so adjacent entries can't be easily merged
> - simple linked list (extent per node) seems like too big overhead
> - kmalloc()'ed arrays can't be joined in O(1) and they are static (imposes a
> hard limit on log size)
> - kmem_cache... feels like using a sledge-hammer.
Why not to maintain discard set as extents in a special per-atom tree?
When adding an extent, try to merge it with the neighbors (if any), so
that everything is prepared for discard at any moment. Just walk through
the tree from left to right and discard all suitable extents?
There is a ready implementation of rb-trees in the kernel (see
./lib/rbtree.c,
./lib/rbtree_test.c). Could you please take a look at this?
>
> And oh, where in commit_current_atom() should these things be done? I guess
> right after the reiser4_write_logs(), under the tmgr mutex,
You are right.
Insert the function issue_discard_requests() right after the
reiser4_invalidate_list(ATOM_OVRWR_LIST(*atom)); and before
mutex_unlock(&sbinfo->tmgr.commit_mutex);
> but I can't really
> explain why I think so.
reiser4_write_logs() commits the atom's overwrite set, which includes
all system blocks, including bitmap blocks.
If you issue discard requests before reiser4_write_logs(), then it
can happen that the system crashes after issue_discard_requests(),
but before reiser4_write_logs(). So, we'll have that the blocks have
been discarded, while the file system still refers to them (as bitmap
eventually hasn't been updated). Welcome to data corruption..
Thanks,
Edward.
P.S.: You'll need to set up a debugging environment. Take a look at this:
http://bipinkunal.blogspot.cz/2012/05/kgdb-tutorial.html
Do you have a second machine with a serial port?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: reiser4: discard support
@ 2014-04-30 17:55 Edward Shishkin
2014-04-30 18:53 ` Edward Shishkin
0 siblings, 1 reply; 19+ messages in thread
From: Edward Shishkin @ 2014-04-30 17:55 UTC (permalink / raw)
To: ReiserFS Development mailing list
On 04/30/2014 05:04 PM, Ivan Shapovalov wrote:
[...]
>> There is a ready implementation of rb-trees in the kernel (see
./lib/rbtree.c,
>> ./lib/rbtree_test.c). Could you please take a look at this?
>
> I've checked it already. Well, in case of trees:
> - combined extent insertion+adjacent extent merge complexity is
O(n*log(n))
Actually, this is:
1) Add n extents one-by-one to empty tree:
log(1) + log(2) + ... + log(n-1)
2) Merge 2 trees of n extents:
log(n) + log(n+1) + ... + log(2n-1)
Total: log(1*2*...*2n-1) = log((2n-1)!)
in accordance with Stirling's formula:
log((2n)!) = log(((2n)^(1/2))*(2n/e)^n) =
= (1/2)*log(2n) + n*log(2n/e) < n + n*log(n)
Since atoms can be merged in parallel, we have that
trees work better than lists. Also tree of extents has
better memory consumption because of merging extents.
> against O(n+n*log(n)) in lists;
> - joining two trees is not less than O(log(n)) against O(1) in lists.
>
> So I can't see any benefit in using trees, and join performance is a
downside of these. Am I missing something?
>
>>
>>>
[...]
> Why after reiser4_invalidate_list()? I thought that it should be
called between reiser4_write_logs(), but before reiser4_invalidate_list().
OK, insert before. I think, it doesn't matter, since we don't look
at this list when issuing discard requests, no? :)
>
> And why not after all reiser4_invalidate_list() calls?
>
>>
>> Thanks,
>> Edward.
>> P.S.: You'll need to set up a debugging environment. Take a look at
this:
>> http://bipinkunal.blogspot.cz/2012/05/kgdb-tutorial.html
>> Do you have a second machine with a serial port?
>
> No, I don't. Will a VM suffice?
Yes, if it works for you..
Thanks,
Edward.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: reiser4: discard support
2014-04-30 17:55 reiser4: discard support Edward Shishkin
@ 2014-04-30 18:53 ` Edward Shishkin
0 siblings, 0 replies; 19+ messages in thread
From: Edward Shishkin @ 2014-04-30 18:53 UTC (permalink / raw)
To: ReiserFS Development mailing list, Ivan Shapovalov
On 04/30/2014 07:55 PM, Edward Shishkin wrote:
> On 04/30/2014 05:04 PM, Ivan Shapovalov wrote:
> [...]
> >> There is a ready implementation of rb-trees in the kernel (see
> ./lib/rbtree.c,
> >> ./lib/rbtree_test.c). Could you please take a look at this?
> >
> > I've checked it already. Well, in case of trees:
> > - combined extent insertion+adjacent extent merge complexity is
> O(n*log(n))
>
> Actually, this is:
>
> 1) Add n extents one-by-one to empty tree:
> log(1) + log(2) + ... + log(n-1)
>
> 2) Merge 2 trees of n extents:
BTW, (2) is an excess component, since merging of 2 trees
looks like adding one-by-one extents to a tree, and the final
number of extents to discard is n.
So, when using trees, complexity is (1/2)log(n)+ nlog(n/e).
> log(n) + log(n+1) + ... + log(2n-1)
>
> Total: log(1*2*...*2n-1) = log((2n-1)!)
>
> in accordance with Stirling's formula:
>
> log((2n)!) = log(((2n)^(1/2))*(2n/e)^n) =
> = (1/2)*log(2n) + n*log(2n/e) < n + n*log(n)
>
> Since atoms can be merged in parallel, we have that
> trees work better than lists. Also tree of extents has
> better memory consumption because of merging extents.
>
>
> > against O(n+n*log(n)) in lists;
> > - joining two trees is not less than O(log(n)) against O(1) in lists.
> >
> > So I can't see any benefit in using trees, and join performance is a
> downside of these. Am I missing something?
> >
> >>
> >>>
> [...]
> > Why after reiser4_invalidate_list()? I thought that it should be
> called between reiser4_write_logs(), but before
> reiser4_invalidate_list().
>
> OK, insert before. I think, it doesn't matter, since we don't look
> at this list when issuing discard requests, no? :)
>
> >
> > And why not after all reiser4_invalidate_list() calls?
> >
> >>
> >> Thanks,
> >> Edward.
> >> P.S.: You'll need to set up a debugging environment. Take a look at
> this:
> >> http://bipinkunal.blogspot.cz/2012/05/kgdb-tutorial.html
> >> Do you have a second machine with a serial port?
> >
> > No, I don't. Will a VM suffice?
>
> Yes, if it works for you..
>
> Thanks,
> Edward.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: reiser4: discard support
[not found] ` <53613807.7090605@gmail.com>
@ 2014-05-01 23:51 ` Ivan Shapovalov
[not found] ` <CAJZSrNK4=o9vocDfSM=4W5ZgtqZ6RVpmU66sGCu6HFsdN47OHw@mail.gmail.com>
2014-05-02 11:48 ` Edward Shishkin
0 siblings, 2 replies; 19+ messages in thread
From: Ivan Shapovalov @ 2014-05-01 23:51 UTC (permalink / raw)
To: Edward Shishkin; +Cc: reiserfs-devel
[-- Attachment #1: Type: text/plain, Size: 2924 bytes --]
On Wednesday 30 April 2014 at 19:51:03, Edward Shishkin wrote:
> On 04/30/2014 05:04 PM, Ivan Shapovalov wrote:
> [...]
>
> >> There is a ready implementation of rb-trees in the kernel (see
> >> ./lib/rbtree.c,
> >> ./lib/rbtree_test.c). Could you please take a look at this?
> >
> > I've checked it already. Well, in case of trees:
> > - combined extent insertion+adjacent extent merge complexity is
> > O(n*log(n))
>
> Actually, this is:
>
> 1) Add n extents one-by-one to empty tree:
> log(1) + log(2) + ... + log(n-1)
>
> 2) Merge 2 trees of n extents:
> log(n) + log(n+1) + ... + log(2n-1)
>
> Total: log(1*2*...*2n-1) = log((2n-1)!)
>
> in accordance with Stirling's formula:
>
> log((2n)!) = log(((2n)^(1/2))*(2n/e)^n) =
> = (1/2)*log(2n) + n*log(2n/e) < n + n*log(n)
Hm.
Consider we fill two data structures with N extents each, then merge these two
data structures and prepare the resulting one for traversal (for lists it will
be sorting, for trees that's no-op).
For trees:
1) Adding
2*log((n-1)!)
2) Merging
log((2n-1)!/(n-1)!)
3) Sorting
none
Total:
log((2n-1)!) + log((n-1)!)
For lists:
1) Adding
2N
2) Merging
constant
3) Sorting
2N*log(2N)
Total:
2N+2N*log(2N)
Quick testing in WolframAlpha shows that, if these estimations are correct,
lists tend to have lesser complexity starting with approx. N=160.
>
> Since atoms can be merged in parallel, we have that
> trees work better than lists. Also tree of extents has
> better memory consumption because of merging extents.
>
> > against O(n+n*log(n)) in lists;
> >
> > - joining two trees is not less than O(log(n)) against O(1) in lists.
> >
> > So I can't see any benefit in using trees, and join performance is a
> > downside of these. Am I missing something?
>
> [...]
>
> > Why after reiser4_invalidate_list()? I thought that it should be
> > called between reiser4_write_logs(), but before reiser4_invalidate_list().
>
> OK, insert before. I think, it doesn't matter, since we don't look
> at this list when issuing discard requests, no? :)
I just don't understand what implications does that function have. It seems to
mess with jnodes, and I can imagine that this leads to races with somebody
modifying the bitmaps while issue_discard_requests() tries to read them to
check discard extents... Again, for me, internals of reiser4 still mostly look
like heavy wizardry. :)
Thanks for guidance,
--
Ivan Shapovalov / intelfx /
>
> > And why not after all reiser4_invalidate_list() calls?
> >
> >> Thanks,
> >> Edward.
> >> P.S.: You'll need to set up a debugging environment. Take a look at this:
> >> http://bipinkunal.blogspot.cz/2012/05/kgdb-tutorial.html
> >> Do you have a second machine with a serial port?
> >
> > No, I don't. Will a VM suffice?
>
> Yes, if it works for you..
>
> Thanks,
> Edward.
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 230 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: reiser4: discard support
[not found] ` <CAJZSrNK4=o9vocDfSM=4W5ZgtqZ6RVpmU66sGCu6HFsdN47OHw@mail.gmail.com>
@ 2014-05-02 11:06 ` Ivan Shapovalov
0 siblings, 0 replies; 19+ messages in thread
From: Ivan Shapovalov @ 2014-05-02 11:06 UTC (permalink / raw)
To: Daniel Horne; +Cc: Edward Shishkin, Reiserfs mailing list
[-- Attachment #1: Type: text/plain, Size: 841 bytes --]
On Friday 02 May 2014 at 02:28:24, Daniel Horne wrote:
> When you say
>
> > For lists:
> > 1) Adding
> > 2N
> >
> > 2) Merging
> > constant
> >
> > 3) Sorting
> > 2N*log(2N)
> >
> > Total:
> > 2N+2N*log(2N)
>
> For the sorting pass, are you assuming use of quicksort? That would require
> constant-time random access characteristics to achieve that efficiency (in
> other words, transfer it to an array prior to sorting). Otherwise, you'd
> have to add in a multiple of N for a linear scan on each item, making it
> O(2N^2) by my reckoning.
>
>
> (ps, apologies for possible duplicate post - hit reply to Ivan instead of
> the list )
>
> --
> DH
I'm assuming usage of the in-kernel list_sort(), which uses merge sort and
claims to have a complexity of O(n*log(n)).
--
Ivan Shapovalov / intelfx /
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 230 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: reiser4: discard support
2014-05-01 23:51 ` Ivan Shapovalov
[not found] ` <CAJZSrNK4=o9vocDfSM=4W5ZgtqZ6RVpmU66sGCu6HFsdN47OHw@mail.gmail.com>
@ 2014-05-02 11:48 ` Edward Shishkin
2014-05-02 12:44 ` Edward Shishkin
2014-05-02 13:36 ` Ivan Shapovalov
1 sibling, 2 replies; 19+ messages in thread
From: Edward Shishkin @ 2014-05-02 11:48 UTC (permalink / raw)
To: Ivan Shapovalov; +Cc: reiserfs-devel
On 05/02/2014 01:51 AM, Ivan Shapovalov wrote:
> On Wednesday 30 April 2014 at 19:51:03, Edward Shishkin wrote:
>> On 04/30/2014 05:04 PM, Ivan Shapovalov wrote:
>> [...]
>>
>>>> There is a ready implementation of rb-trees in the kernel (see
>>>> ./lib/rbtree.c,
>>>> ./lib/rbtree_test.c). Could you please take a look at this?
>>> I've checked it already. Well, in case of trees:
>>> - combined extent insertion+adjacent extent merge complexity is
>>> O(n*log(n))
>> Actually, this is:
>>
>> 1) Add n extents one-by-one to empty tree:
>> log(1) + log(2) + ... + log(n-1)
>>
>> 2) Merge 2 trees of n extents:
>> log(n) + log(n+1) + ... + log(2n-1)
>>
>> Total: log(1*2*...*2n-1) = log((2n-1)!)
>>
>> in accordance with Stirling's formula:
>>
>> log((2n)!) = log(((2n)^(1/2))*(2n/e)^n) =
>> = (1/2)*log(2n) + n*log(2n/e) < n + n*log(n)
> Hm.
>
> Consider we fill two data structures with N extents each, then merge these two
> data structures and prepare the resulting one for traversal (for lists it will
> be sorting, for trees that's no-op).
>
> For trees:
> 1) Adding
> 2*log((n-1)!)
Why "2*" ?
I suppose they are filled in parallel...
Total complexity (when everything is serialized) is not interesting ;)
>
> 2) Merging
> log((2n-1)!/(n-1)!)
>
> 3) Sorting
> none
>
> Total:
> log((2n-1)!) + log((n-1)!)
>
> For lists:
> 1) Adding
> 2N
>
> 2) Merging
> constant
>
> 3) Sorting
> 2N*log(2N)
>
> Total:
> 2N+2N*log(2N)
>
> Quick testing in WolframAlpha shows that, if these estimations are correct,
> lists tend to have lesser complexity starting with approx. N=160.
>
>> Since atoms can be merged in parallel, we have that
>> trees work better than lists. Also tree of extents has
>> better memory consumption because of merging extents.
>>
>>> against O(n+n*log(n)) in lists;
>>>
>>> - joining two trees is not less than O(log(n)) against O(1) in lists.
>>>
>>> So I can't see any benefit in using trees, and join performance is a
>>> downside of these. Am I missing something?
>> [...]
>>
>>> Why after reiser4_invalidate_list()? I thought that it should be
>>> called between reiser4_write_logs(), but before reiser4_invalidate_list().
>> OK, insert before. I think, it doesn't matter, since we don't look
>> at this list when issuing discard requests, no? :)
> I just don't understand what implications does that function have. It seems to
> mess with jnodes, and I can imagine that this leads to races with somebody
> modifying the bitmaps while issue_discard_requests() tries to read them to
> check discard extents...
"somebody modifying the bitmaps" will wait for commit_mutex.
Have you found where the bitmaps are modified? This is
pre_commit_hook_bitmap().
> Again, for me, internals of reiser4 still mostly look
> like heavy wizardry. :)
This is because the technical level of reiser4 is very high.
Many very strong scientists contributed to reiser4.
I also don't understand in details how some reiser4 subsystems work.
Once I need details, I open the respective code and read it with the
comments.
The most important thing is general understanding of how it works.
Nobody is able to grok this only by reading reiser4 code before
sleeping. You need to implement at least one feature by your hands.
Yes, it is not easy, and not for everyone. Like every high-level research
in the science...
Edward.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: reiser4: discard support
2014-05-02 11:48 ` Edward Shishkin
@ 2014-05-02 12:44 ` Edward Shishkin
2014-05-02 13:47 ` Ivan Shapovalov
2014-05-02 13:36 ` Ivan Shapovalov
1 sibling, 1 reply; 19+ messages in thread
From: Edward Shishkin @ 2014-05-02 12:44 UTC (permalink / raw)
To: Ivan Shapovalov; +Cc: reiserfs-devel
On 05/02/2014 01:48 PM, Edward Shishkin wrote:
> On 05/02/2014 01:51 AM, Ivan Shapovalov wrote:
>> On Wednesday 30 April 2014 at 19:51:03, Edward Shishkin wrote:
>>> On 04/30/2014 05:04 PM, Ivan Shapovalov wrote:
>>> [...]
>>>
>>>>> There is a ready implementation of rb-trees in the kernel (see
>>>>> ./lib/rbtree.c,
>>>>> ./lib/rbtree_test.c). Could you please take a look at this?
>>>> I've checked it already. Well, in case of trees:
>>>> - combined extent insertion+adjacent extent merge complexity is
>>>> O(n*log(n))
>>> Actually, this is:
>>>
>>> 1) Add n extents one-by-one to empty tree:
>>> log(1) + log(2) + ... + log(n-1)
>>>
>>> 2) Merge 2 trees of n extents:
>>> log(n) + log(n+1) + ... + log(2n-1)
>>>
>>> Total: log(1*2*...*2n-1) = log((2n-1)!)
>>>
>>> in accordance with Stirling's formula:
>>>
>>> log((2n)!) = log(((2n)^(1/2))*(2n/e)^n) =
>>> = (1/2)*log(2n) + n*log(2n/e) < n + n*log(n)
>> Hm.
>>
>> Consider we fill two data structures with N extents each, then merge
>> these two
>> data structures and prepare the resulting one for traversal (for
>> lists it will
>> be sorting, for trees that's no-op).
>>
>> For trees:
>> 1) Adding
>> 2*log((n-1)!)
>
> Why "2*" ?
> I suppose they are filled in parallel...
> Total complexity (when everything is serialized) is not interesting ;)
If lists look more simple for you, then do it with lists.
Eventually, we'll replace everything with fast-merge trees ;)
>
>>
>> 2) Merging
>> log((2n-1)!/(n-1)!)
>>
>> 3) Sorting
>> none
>>
>> Total:
>> log((2n-1)!) + log((n-1)!)
>>
>> For lists:
>> 1) Adding
>> 2N
>>
>> 2) Merging
>> constant
>>
>> 3) Sorting
>> 2N*log(2N)
>>
>> Total:
>> 2N+2N*log(2N)
>>
>> Quick testing in WolframAlpha shows that, if these estimations are
>> correct,
>> lists tend to have lesser complexity starting with approx. N=160.
>>
>>> Since atoms can be merged in parallel, we have that
>>> trees work better than lists. Also tree of extents has
>>> better memory consumption because of merging extents.
>>>
>>>> against O(n+n*log(n)) in lists;
>>>>
>>>> - joining two trees is not less than O(log(n)) against O(1) in lists.
>>>>
>>>> So I can't see any benefit in using trees, and join performance is a
>>>> downside of these. Am I missing something?
>>> [...]
>>>
>>>> Why after reiser4_invalidate_list()? I thought that it should be
>>>> called between reiser4_write_logs(), but before
>>>> reiser4_invalidate_list().
>>> OK, insert before. I think, it doesn't matter, since we don't look
>>> at this list when issuing discard requests, no? :)
>> I just don't understand what implications does that function have. It
>> seems to
>> mess with jnodes, and I can imagine that this leads to races with
>> somebody
>> modifying the bitmaps while issue_discard_requests() tries to read
>> them to
>> check discard extents...
>
> "somebody modifying the bitmaps" will wait for commit_mutex.
> Have you found where the bitmaps are modified? This is
> pre_commit_hook_bitmap().
>
>> Again, for me, internals of reiser4 still mostly look
>> like heavy wizardry. :)
>
> This is because the technical level of reiser4 is very high.
> Many very strong scientists contributed to reiser4.
>
> I also don't understand in details how some reiser4 subsystems work.
> Once I need details, I open the respective code and read it with the
> comments.
> The most important thing is general understanding of how it works.
> Nobody is able to grok this only by reading reiser4 code before
> sleeping. You need to implement at least one feature by your hands.
> Yes, it is not easy, and not for everyone. Like every high-level research
> in the science...
>
> Edward.
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: reiser4: discard support
2014-05-02 11:48 ` Edward Shishkin
2014-05-02 12:44 ` Edward Shishkin
@ 2014-05-02 13:36 ` Ivan Shapovalov
2014-05-02 14:07 ` Edward Shishkin
1 sibling, 1 reply; 19+ messages in thread
From: Ivan Shapovalov @ 2014-05-02 13:36 UTC (permalink / raw)
To: Edward Shishkin; +Cc: reiserfs-devel
[-- Attachment #1: Type: text/plain, Size: 4149 bytes --]
On Friday 02 May 2014 at 13:48:28, Edward Shishkin wrote:
> On 05/02/2014 01:51 AM, Ivan Shapovalov wrote:
> > On Wednesday 30 April 2014 at 19:51:03, Edward Shishkin wrote:
> >> On 04/30/2014 05:04 PM, Ivan Shapovalov wrote:
> >> [...]
> >>
> >>>> There is a ready implementation of rb-trees in the kernel (see
> >>>> ./lib/rbtree.c,
> >>>> ./lib/rbtree_test.c). Could you please take a look at this?
> >>>
> >>> I've checked it already. Well, in case of trees:
> >>> - combined extent insertion+adjacent extent merge complexity is
> >>> O(n*log(n))
> >>
> >> Actually, this is:
> >>
> >> 1) Add n extents one-by-one to empty tree:
> >> log(1) + log(2) + ... + log(n-1)
> >>
> >> 2) Merge 2 trees of n extents:
> >> log(n) + log(n+1) + ... + log(2n-1)
> >>
> >> Total: log(1*2*...*2n-1) = log((2n-1)!)
> >>
> >> in accordance with Stirling's formula:
> >>
> >> log((2n)!) = log(((2n)^(1/2))*(2n/e)^n) =
> >> = (1/2)*log(2n) + n*log(2n/e) < n + n*log(n)
> >
> > Hm.
> >
> > Consider we fill two data structures with N extents each, then merge these
> > two data structures and prepare the resulting one for traversal (for
> > lists it will be sorting, for trees that's no-op).
> >
> > For trees:
> > 1) Adding
> > 2*log((n-1)!)
>
> Why "2*" ?
> I suppose they are filled in parallel...
> Total complexity (when everything is serialized) is not interesting ;)
Oh why? Everything is still serialized by the physical CPU, for that matter.
So less complexity -> less CPU time...
>
> > 2) Merging
> > log((2n-1)!/(n-1)!)
> >
> > 3) Sorting
> > none
> >
> > Total:
> > log((2n-1)!) + log((n-1)!)
> >
> > For lists:
> > 1) Adding
> > 2N
> >
> > 2) Merging
> > constant
> >
> > 3) Sorting
> > 2N*log(2N)
> >
> > Total:
> > 2N+2N*log(2N)
> >
> > Quick testing in WolframAlpha shows that, if these estimations are
> > correct,
> > lists tend to have lesser complexity starting with approx. N=160.
> >
> >> Since atoms can be merged in parallel, we have that
> >> trees work better than lists. Also tree of extents has
> >> better memory consumption because of merging extents.
> >>
> >>> against O(n+n*log(n)) in lists;
> >>>
> >>> - joining two trees is not less than O(log(n)) against O(1) in lists.
> >>>
> >>> So I can't see any benefit in using trees, and join performance is a
> >>> downside of these. Am I missing something?
> >>
> >> [...]
> >>
> >>> Why after reiser4_invalidate_list()? I thought that it should be
> >>> called between reiser4_write_logs(), but before
> >>> reiser4_invalidate_list().
> >>
> >> OK, insert before. I think, it doesn't matter, since we don't look
> >> at this list when issuing discard requests, no? :)
> >
> > I just don't understand what implications does that function have. It
> > seems to mess with jnodes, and I can imagine that this leads to races
> > with somebody modifying the bitmaps while issue_discard_requests() tries
> > to read them to check discard extents...
>
> "somebody modifying the bitmaps" will wait for commit_mutex.
> Have you found where the bitmaps are modified? This is
> pre_commit_hook_bitmap().
Yes, but I was not sure this is the only place where the bitmaps are
touched...
>
> > Again, for me, internals of reiser4 still mostly look
> >
> > like heavy wizardry. :)
>
> This is because the technical level of reiser4 is very high.
> Many very strong scientists contributed to reiser4.
>
> I also don't understand in details how some reiser4 subsystems work.
> Once I need details, I open the respective code and read it with the
> comments.
> The most important thing is general understanding of how it works.
> Nobody is able to grok this only by reading reiser4 code before
> sleeping. You need to implement at least one feature by your hands.
> Yes, it is not easy, and not for everyone. Like every high-level research
> in the science...
>
> Edward.
...and nobody to this point had written an explanation of that, or some piece
of documentation except the design document...
--
Ivan Shapovalov / intelfx /
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 230 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: reiser4: discard support
2014-05-02 12:44 ` Edward Shishkin
@ 2014-05-02 13:47 ` Ivan Shapovalov
0 siblings, 0 replies; 19+ messages in thread
From: Ivan Shapovalov @ 2014-05-02 13:47 UTC (permalink / raw)
To: Edward Shishkin; +Cc: reiserfs-devel
[-- Attachment #1: Type: text/plain, Size: 4328 bytes --]
On Friday 02 May 2014 at 14:44:52, Edward Shishkin wrote:
> On 05/02/2014 01:48 PM, Edward Shishkin wrote:
> > On 05/02/2014 01:51 AM, Ivan Shapovalov wrote:
> >> On Wednesday 30 April 2014 at 19:51:03, Edward Shishkin wrote:
> >>> On 04/30/2014 05:04 PM, Ivan Shapovalov wrote:
> >>> [...]
> >>>
> >>>>> There is a ready implementation of rb-trees in the kernel (see
> >>>>> ./lib/rbtree.c,
> >>>>> ./lib/rbtree_test.c). Could you please take a look at this?
> >>>>
> >>>> I've checked it already. Well, in case of trees:
> >>>> - combined extent insertion+adjacent extent merge complexity is
> >>>> O(n*log(n))
> >>>
> >>> Actually, this is:
> >>>
> >>> 1) Add n extents one-by-one to empty tree:
> >>> log(1) + log(2) + ... + log(n-1)
> >>>
> >>> 2) Merge 2 trees of n extents:
> >>> log(n) + log(n+1) + ... + log(2n-1)
> >>>
> >>> Total: log(1*2*...*2n-1) = log((2n-1)!)
> >>>
> >>> in accordance with Stirling's formula:
> >>>
> >>> log((2n)!) = log(((2n)^(1/2))*(2n/e)^n) =
> >>> = (1/2)*log(2n) + n*log(2n/e) < n + n*log(n)
> >>
> >> Hm.
> >>
> >> Consider we fill two data structures with N extents each, then merge
> >> these two
> >> data structures and prepare the resulting one for traversal (for
> >> lists it will
> >> be sorting, for trees that's no-op).
> >>
> >> For trees:
> >> 1) Adding
> >> 2*log((n-1)!)
> >
> > Why "2*" ?
> > I suppose they are filled in parallel...
> > Total complexity (when everything is serialized) is not interesting ;)
>
> If lists look more simple for you, then do it with lists.
> Eventually, we'll replace everything with fast-merge trees ;)
The trees with fingers are most probably the way to go, but I would like to
defer implementing of yet another in-kernel data structure at least for some
time. :)
--
Ivan Shapovalov / intelfx /
>
> >> 2) Merging
> >> log((2n-1)!/(n-1)!)
> >>
> >> 3) Sorting
> >> none
> >>
> >> Total:
> >> log((2n-1)!) + log((n-1)!)
> >>
> >> For lists:
> >> 1) Adding
> >> 2N
> >>
> >> 2) Merging
> >> constant
> >>
> >> 3) Sorting
> >> 2N*log(2N)
> >>
> >> Total:
> >> 2N+2N*log(2N)
> >>
> >> Quick testing in WolframAlpha shows that, if these estimations are
> >> correct,
> >> lists tend to have lesser complexity starting with approx. N=160.
> >>
> >>> Since atoms can be merged in parallel, we have that
> >>> trees work better than lists. Also tree of extents has
> >>> better memory consumption because of merging extents.
> >>>
> >>>> against O(n+n*log(n)) in lists;
> >>>>
> >>>> - joining two trees is not less than O(log(n)) against O(1) in lists.
> >>>>
> >>>> So I can't see any benefit in using trees, and join performance is a
> >>>> downside of these. Am I missing something?
> >>>
> >>> [...]
> >>>
> >>>> Why after reiser4_invalidate_list()? I thought that it should be
> >>>> called between reiser4_write_logs(), but before
> >>>> reiser4_invalidate_list().
> >>>
> >>> OK, insert before. I think, it doesn't matter, since we don't look
> >>> at this list when issuing discard requests, no? :)
> >>
> >> I just don't understand what implications does that function have. It
> >> seems to
> >> mess with jnodes, and I can imagine that this leads to races with
> >> somebody
> >> modifying the bitmaps while issue_discard_requests() tries to read
> >> them to
> >> check discard extents...
> >
> > "somebody modifying the bitmaps" will wait for commit_mutex.
> > Have you found where the bitmaps are modified? This is
> > pre_commit_hook_bitmap().
> >
> >> Again, for me, internals of reiser4 still mostly look
> >>
> >> like heavy wizardry. :)
> >
> > This is because the technical level of reiser4 is very high.
> > Many very strong scientists contributed to reiser4.
> >
> > I also don't understand in details how some reiser4 subsystems work.
> > Once I need details, I open the respective code and read it with the
> > comments.
> > The most important thing is general understanding of how it works.
> > Nobody is able to grok this only by reading reiser4 code before
> > sleeping. You need to implement at least one feature by your hands.
> > Yes, it is not easy, and not for everyone. Like every high-level research
> > in the science...
> >
> > Edward.
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 230 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: reiser4: discard support
2014-05-02 13:36 ` Ivan Shapovalov
@ 2014-05-02 14:07 ` Edward Shishkin
2014-05-02 14:32 ` Ivan Shapovalov
0 siblings, 1 reply; 19+ messages in thread
From: Edward Shishkin @ 2014-05-02 14:07 UTC (permalink / raw)
To: Ivan Shapovalov; +Cc: reiserfs-devel
On 05/02/2014 03:36 PM, Ivan Shapovalov wrote:
> On Friday 02 May 2014 at 13:48:28, Edward Shishkin wrote:
>> On 05/02/2014 01:51 AM, Ivan Shapovalov wrote:
>>> On Wednesday 30 April 2014 at 19:51:03, Edward Shishkin wrote:
>>>> On 04/30/2014 05:04 PM, Ivan Shapovalov wrote:
>>>> [...]
>>>>
>>>>>> There is a ready implementation of rb-trees in the kernel (see
>>>>>> ./lib/rbtree.c,
>>>>>> ./lib/rbtree_test.c). Could you please take a look at this?
>>>>> I've checked it already. Well, in case of trees:
>>>>> - combined extent insertion+adjacent extent merge complexity is
>>>>> O(n*log(n))
>>>> Actually, this is:
>>>>
>>>> 1) Add n extents one-by-one to empty tree:
>>>> log(1) + log(2) + ... + log(n-1)
>>>>
>>>> 2) Merge 2 trees of n extents:
>>>> log(n) + log(n+1) + ... + log(2n-1)
>>>>
>>>> Total: log(1*2*...*2n-1) = log((2n-1)!)
>>>>
>>>> in accordance with Stirling's formula:
>>>>
>>>> log((2n)!) = log(((2n)^(1/2))*(2n/e)^n) =
>>>> = (1/2)*log(2n) + n*log(2n/e) < n + n*log(n)
>>> Hm.
>>>
>>> Consider we fill two data structures with N extents each, then merge these
>>> two data structures and prepare the resulting one for traversal (for
>>> lists it will be sorting, for trees that's no-op).
>>>
>>> For trees:
>>> 1) Adding
>>> 2*log((n-1)!)
>> Why "2*" ?
>> I suppose they are filled in parallel...
>> Total complexity (when everything is serialized) is not interesting ;)
> Oh why? Everything is still serialized by the physical CPU, for that matter.
> So less complexity -> less CPU time...
We can perfectly populate different "discard trees" in parallel on
different CPUs.
As to sorting the list: I don't know how to perform it in parallel :)
Default assumptions, that everything is serialized, usually lead to various
bad-scalable solutions...
>
>>> 2) Merging
>>> log((2n-1)!/(n-1)!)
>>>
>>> 3) Sorting
>>> none
>>>
>>> Total:
>>> log((2n-1)!) + log((n-1)!)
>>>
>>> For lists:
>>> 1) Adding
>>> 2N
>>>
>>> 2) Merging
>>> constant
>>>
>>> 3) Sorting
>>> 2N*log(2N)
>>>
>>> Total:
>>> 2N+2N*log(2N)
>>>
>>> Quick testing in WolframAlpha shows that, if these estimations are
>>> correct,
>>> lists tend to have lesser complexity starting with approx. N=160.
>>>
>>>> Since atoms can be merged in parallel, we have that
>>>> trees work better than lists. Also tree of extents has
>>>> better memory consumption because of merging extents.
>>>>
>>>>> against O(n+n*log(n)) in lists;
>>>>>
>>>>> - joining two trees is not less than O(log(n)) against O(1) in lists.
>>>>>
>>>>> So I can't see any benefit in using trees, and join performance is a
>>>>> downside of these. Am I missing something?
>>>> [...]
>>>>
>>>>> Why after reiser4_invalidate_list()? I thought that it should be
>>>>> called between reiser4_write_logs(), but before
>>>>> reiser4_invalidate_list().
>>>> OK, insert before. I think, it doesn't matter, since we don't look
>>>> at this list when issuing discard requests, no? :)
>>> I just don't understand what implications does that function have. It
>>> seems to mess with jnodes, and I can imagine that this leads to races
>>> with somebody modifying the bitmaps while issue_discard_requests() tries
>>> to read them to check discard extents...
>> "somebody modifying the bitmaps" will wait for commit_mutex.
>> Have you found where the bitmaps are modified? This is
>> pre_commit_hook_bitmap().
> Yes, but I was not sure this is the only place where the bitmaps are
> touched...
>
>>> Again, for me, internals of reiser4 still mostly look
>>>
>>> like heavy wizardry. :)
>> This is because the technical level of reiser4 is very high.
>> Many very strong scientists contributed to reiser4.
>>
>> I also don't understand in details how some reiser4 subsystems work.
>> Once I need details, I open the respective code and read it with the
>> comments.
>> The most important thing is general understanding of how it works.
>> Nobody is able to grok this only by reading reiser4 code before
>> sleeping. You need to implement at least one feature by your hands.
>> Yes, it is not easy, and not for everyone. Like every high-level research
>> in the science...
>>
>> Edward.
> ...and nobody to this point had written an explanation of that, or some piece
> of documentation except the design document...
what explanation do you mean?
Edward.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: reiser4: discard support
2014-05-02 14:07 ` Edward Shishkin
@ 2014-05-02 14:32 ` Ivan Shapovalov
2014-05-02 18:10 ` Edward Shishkin
0 siblings, 1 reply; 19+ messages in thread
From: Ivan Shapovalov @ 2014-05-02 14:32 UTC (permalink / raw)
To: reiserfs-devel; +Cc: Edward Shishkin
[-- Attachment #1: Type: text/plain, Size: 5347 bytes --]
On Friday 02 May 2014 at 16:07:21, Edward Shishkin wrote:
> On 05/02/2014 03:36 PM, Ivan Shapovalov wrote:
> > On Friday 02 May 2014 at 13:48:28, Edward Shishkin wrote:
> >> On 05/02/2014 01:51 AM, Ivan Shapovalov wrote:
> >>> On Wednesday 30 April 2014 at 19:51:03, Edward Shishkin wrote:
> >>>> On 04/30/2014 05:04 PM, Ivan Shapovalov wrote:
> >>>> [...]
> >>>>
> >>>>>> There is a ready implementation of rb-trees in the kernel (see
> >>>>>> ./lib/rbtree.c,
> >>>>>> ./lib/rbtree_test.c). Could you please take a look at this?
> >>>>>
> >>>>> I've checked it already. Well, in case of trees:
> >>>>> - combined extent insertion+adjacent extent merge complexity is
> >>>>> O(n*log(n))
> >>>>
> >>>> Actually, this is:
> >>>>
> >>>> 1) Add n extents one-by-one to empty tree:
> >>>> log(1) + log(2) + ... + log(n-1)
> >>>>
> >>>> 2) Merge 2 trees of n extents:
> >>>> log(n) + log(n+1) + ... + log(2n-1)
> >>>>
> >>>> Total: log(1*2*...*2n-1) = log((2n-1)!)
> >>>>
> >>>> in accordance with Stirling's formula:
> >>>>
> >>>> log((2n)!) = log(((2n)^(1/2))*(2n/e)^n) =
> >>>> = (1/2)*log(2n) + n*log(2n/e) < n + n*log(n)
> >>>
> >>> Hm.
> >>>
> >>> Consider we fill two data structures with N extents each, then merge
> >>> these
> >>> two data structures and prepare the resulting one for traversal (for
> >>> lists it will be sorting, for trees that's no-op).
> >>>
> >>> For trees:
> >>> 1) Adding
> >>> 2*log((n-1)!)
> >>
> >> Why "2*" ?
> >> I suppose they are filled in parallel...
> >> Total complexity (when everything is serialized) is not interesting ;)
> >
> > Oh why? Everything is still serialized by the physical CPU, for that
> > matter. So less complexity -> less CPU time...
>
> We can perfectly populate different "discard trees" in parallel on
> different CPUs.
> As to sorting the list: I don't know how to perform it in parallel :)
> Default assumptions, that everything is serialized, usually lead to various
> bad-scalable solutions...
Ah, now I understand what do you mean. If that's about doing less work under
the tmgr mutex, then yes, trees are better than lists.
BTW, I've seen that reiser4 releases atom lock before allocating another node
for a blocknr_set, and that leads to a "do { } while (ret == -E_REPEAT)" loop
around the blocknr_set_add_extent() calls. Is this the preferred way of
attaching a dynamic data structure to an atom?
>
> >>> 2) Merging
> >>> log((2n-1)!/(n-1)!)
> >>>
> >>> 3) Sorting
> >>> none
> >>>
> >>> Total:
> >>> log((2n-1)!) + log((n-1)!)
> >>>
> >>> For lists:
> >>> 1) Adding
> >>> 2N
> >>>
> >>> 2) Merging
> >>> constant
> >>>
> >>> 3) Sorting
> >>> 2N*log(2N)
> >>>
> >>> Total:
> >>> 2N+2N*log(2N)
> >>>
> >>> Quick testing in WolframAlpha shows that, if these estimations are
> >>> correct,
> >>> lists tend to have lesser complexity starting with approx. N=160.
> >>>
> >>>> Since atoms can be merged in parallel, we have that
> >>>> trees work better than lists. Also tree of extents has
> >>>> better memory consumption because of merging extents.
> >>>>
> >>>>> against O(n+n*log(n)) in lists;
> >>>>>
> >>>>> - joining two trees is not less than O(log(n)) against O(1) in lists.
> >>>>>
> >>>>> So I can't see any benefit in using trees, and join performance is a
> >>>>> downside of these. Am I missing something?
> >>>>
> >>>> [...]
> >>>>
> >>>>> Why after reiser4_invalidate_list()? I thought that it should be
> >>>>> called between reiser4_write_logs(), but before
> >>>>> reiser4_invalidate_list().
> >>>>
> >>>> OK, insert before. I think, it doesn't matter, since we don't look
> >>>> at this list when issuing discard requests, no? :)
> >>>
> >>> I just don't understand what implications does that function have. It
> >>> seems to mess with jnodes, and I can imagine that this leads to races
> >>> with somebody modifying the bitmaps while issue_discard_requests() tries
> >>> to read them to check discard extents...
> >>
> >> "somebody modifying the bitmaps" will wait for commit_mutex.
> >> Have you found where the bitmaps are modified? This is
> >> pre_commit_hook_bitmap().
> >
> > Yes, but I was not sure this is the only place where the bitmaps are
> > touched...
> >
> >>> Again, for me, internals of reiser4 still mostly look
> >>>
> >>> like heavy wizardry. :)
> >>
> >> This is because the technical level of reiser4 is very high.
> >> Many very strong scientists contributed to reiser4.
> >>
> >> I also don't understand in details how some reiser4 subsystems work.
> >> Once I need details, I open the respective code and read it with the
> >> comments.
> >> The most important thing is general understanding of how it works.
> >> Nobody is able to grok this only by reading reiser4 code before
> >> sleeping. You need to implement at least one feature by your hands.
> >> Yes, it is not easy, and not for everyone. Like every high-level research
> >> in the science...
> >>
> >> Edward.
> >
> > ...and nobody to this point had written an explanation of that, or some
> > piece of documentation except the design document...
>
> what explanation do you mean?
>
> Edward.
A general explanation of how does it work :)
--
Ivan Shapovalov / intelfx /
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 230 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: reiser4: discard support
2014-05-02 14:32 ` Ivan Shapovalov
@ 2014-05-02 18:10 ` Edward Shishkin
2014-05-03 18:48 ` Ivan Shapovalov
0 siblings, 1 reply; 19+ messages in thread
From: Edward Shishkin @ 2014-05-02 18:10 UTC (permalink / raw)
To: Ivan Shapovalov; +Cc: reiserfs-devel
On 05/02/2014 04:32 PM, Ivan Shapovalov wrote:
> On Friday 02 May 2014 at 16:07:21, Edward Shishkin wrote:
>> On 05/02/2014 03:36 PM, Ivan Shapovalov wrote:
>>> On Friday 02 May 2014 at 13:48:28, Edward Shishkin wrote:
>>>> On 05/02/2014 01:51 AM, Ivan Shapovalov wrote:
>>>>> On Wednesday 30 April 2014 at 19:51:03, Edward Shishkin wrote:
>>>>>> On 04/30/2014 05:04 PM, Ivan Shapovalov wrote:
>>>>>> [...]
>>>>>>
>>>>>>>> There is a ready implementation of rb-trees in the kernel (see
>>>>>>>> ./lib/rbtree.c,
>>>>>>>> ./lib/rbtree_test.c). Could you please take a look at this?
>>>>>>> I've checked it already. Well, in case of trees:
>>>>>>> - combined extent insertion+adjacent extent merge complexity is
>>>>>>> O(n*log(n))
>>>>>> Actually, this is:
>>>>>>
>>>>>> 1) Add n extents one-by-one to empty tree:
>>>>>> log(1) + log(2) + ... + log(n-1)
>>>>>>
>>>>>> 2) Merge 2 trees of n extents:
>>>>>> log(n) + log(n+1) + ... + log(2n-1)
>>>>>>
>>>>>> Total: log(1*2*...*2n-1) = log((2n-1)!)
>>>>>>
>>>>>> in accordance with Stirling's formula:
>>>>>>
>>>>>> log((2n)!) = log(((2n)^(1/2))*(2n/e)^n) =
>>>>>> = (1/2)*log(2n) + n*log(2n/e) < n + n*log(n)
>>>>> Hm.
>>>>>
>>>>> Consider we fill two data structures with N extents each, then merge
>>>>> these
>>>>> two data structures and prepare the resulting one for traversal (for
>>>>> lists it will be sorting, for trees that's no-op).
>>>>>
>>>>> For trees:
>>>>> 1) Adding
>>>>> 2*log((n-1)!)
>>>> Why "2*" ?
>>>> I suppose they are filled in parallel...
>>>> Total complexity (when everything is serialized) is not interesting ;)
>>> Oh why? Everything is still serialized by the physical CPU, for that
>>> matter. So less complexity -> less CPU time...
>> We can perfectly populate different "discard trees" in parallel on
>> different CPUs.
>> As to sorting the list: I don't know how to perform it in parallel :)
>> Default assumptions, that everything is serialized, usually lead to various
>> bad-scalable solutions...
> Ah, now I understand what do you mean. If that's about doing less work under
> the tmgr mutex,
No. This is about minimizing real time.
We don't know about mutexes. We want to decide, what is preferable:
populating trees, or sorting the list. There is a chance that the first will
be faster in systems with many CPUs, so I suggest to use trees.
That's all!
> then yes, trees are better than lists.
>
> BTW, I've seen that reiser4 releases atom lock before allocating another node
> for a blocknr_set, and that leads to a "do { } while (ret == -E_REPEAT)" loop
> around the blocknr_set_add_extent() calls. Is this the preferred way of
> attaching a dynamic data structure to an atom?
TBH, I have never looked at the deallocation paths in reiser4: everything
worked fine there.. BTW, why not to use atom's delete_set to discard things?
Could you please take a look?
>>>>> 2) Merging
>>>>> log((2n-1)!/(n-1)!)
>>>>>
>>>>> 3) Sorting
>>>>> none
>>>>>
>>>>> Total:
>>>>> log((2n-1)!) + log((n-1)!)
>>>>>
>>>>> For lists:
>>>>> 1) Adding
>>>>> 2N
>>>>>
>>>>> 2) Merging
>>>>> constant
>>>>>
>>>>> 3) Sorting
>>>>> 2N*log(2N)
>>>>>
>>>>> Total:
>>>>> 2N+2N*log(2N)
>>>>>
>>>>> Quick testing in WolframAlpha shows that, if these estimations are
>>>>> correct,
>>>>> lists tend to have lesser complexity starting with approx. N=160.
>>>>>
>>>>>> Since atoms can be merged in parallel, we have that
>>>>>> trees work better than lists. Also tree of extents has
>>>>>> better memory consumption because of merging extents.
>>>>>>
>>>>>>> against O(n+n*log(n)) in lists;
>>>>>>>
>>>>>>> - joining two trees is not less than O(log(n)) against O(1) in lists.
>>>>>>>
>>>>>>> So I can't see any benefit in using trees, and join performance is a
>>>>>>> downside of these. Am I missing something?
>>>>>> [...]
>>>>>>
>>>>>>> Why after reiser4_invalidate_list()? I thought that it should be
>>>>>>> called between reiser4_write_logs(), but before
>>>>>>> reiser4_invalidate_list().
>>>>>> OK, insert before. I think, it doesn't matter, since we don't look
>>>>>> at this list when issuing discard requests, no? :)
>>>>> I just don't understand what implications does that function have. It
>>>>> seems to mess with jnodes, and I can imagine that this leads to races
>>>>> with somebody modifying the bitmaps while issue_discard_requests() tries
>>>>> to read them to check discard extents...
>>>> "somebody modifying the bitmaps" will wait for commit_mutex.
>>>> Have you found where the bitmaps are modified? This is
>>>> pre_commit_hook_bitmap().
>>> Yes, but I was not sure this is the only place where the bitmaps are
>>> touched...
>>>
>>>>> Again, for me, internals of reiser4 still mostly look
>>>>>
>>>>> like heavy wizardry. :)
>>>> This is because the technical level of reiser4 is very high.
>>>> Many very strong scientists contributed to reiser4.
>>>>
>>>> I also don't understand in details how some reiser4 subsystems work.
>>>> Once I need details, I open the respective code and read it with the
>>>> comments.
>>>> The most important thing is general understanding of how it works.
>>>> Nobody is able to grok this only by reading reiser4 code before
>>>> sleeping. You need to implement at least one feature by your hands.
>>>> Yes, it is not easy, and not for everyone. Like every high-level research
>>>> in the science...
>>>>
>>>> Edward.
>>> ...and nobody to this point had written an explanation of that, or some
>>> piece of documentation except the design document...
>> what explanation do you mean?
>>
>> Edward.
> A general explanation of how does it work :)
I think this is because of historical reasons.
Such good explanation costs a lot of time and efforts.
Namesys was a small company, we couldn't afford to have a technical writer.
Hans insisted on good comments in the source code..
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: reiser4: discard support
2014-05-02 18:10 ` Edward Shishkin
@ 2014-05-03 18:48 ` Ivan Shapovalov
2014-05-03 20:21 ` Edward Shishkin
0 siblings, 1 reply; 19+ messages in thread
From: Ivan Shapovalov @ 2014-05-03 18:48 UTC (permalink / raw)
To: Edward Shishkin; +Cc: reiserfs-devel
[-- Attachment #1: Type: text/plain, Size: 2760 bytes --]
On Friday 02 May 2014 at 20:10:16, Edward Shishkin wrote:
> On 05/02/2014 04:32 PM, Ivan Shapovalov wrote:
> > On Friday 02 May 2014 at 16:07:21, Edward Shishkin wrote:
> >> On 05/02/2014 03:36 PM, Ivan Shapovalov wrote:
> >>> On Friday 02 May 2014 at 13:48:28, Edward Shishkin wrote:
> >>> [...]
> >>
> >> We can perfectly populate different "discard trees" in parallel on
> >> different CPUs.
> >> As to sorting the list: I don't know how to perform it in parallel :)
> >> Default assumptions, that everything is serialized, usually lead to
> >> various
> >> bad-scalable solutions...
> >
> > Ah, now I understand what do you mean. If that's about doing less work
> > under the tmgr mutex,
>
> No. This is about minimizing real time.
> We don't know about mutexes. We want to decide, what is preferable:
> populating trees, or sorting the list. There is a chance that the first will
> be faster in systems with many CPUs, so I suggest to use trees.
> That's all!
OK.. well, I've started with lists anyway. The data structure is (of course)
under an abstraction layer, so we can change it.
>
> > then yes, trees are better than lists.
> >
> > BTW, I've seen that reiser4 releases atom lock before allocating another
> > node for a blocknr_set, and that leads to a "do { } while (ret ==
> > -E_REPEAT)" loop around the blocknr_set_add_extent() calls. Is this the
> > preferred way of attaching a dynamic data structure to an atom?
>
> TBH, I have never looked at the deallocation paths in reiser4: everything
> worked fine there.. BTW, why not to use atom's delete_set to discard things?
> Could you please take a look?
Blocks used for the journal (wander.c) are deallocated without BA_DEFER set
and thus they never hit delete_set. However, we want to discard these.
>
> >>> [...]
> >>
> >> what explanation do you mean?
> >>
> >> Edward.
> >
> > A general explanation of how does it work :)
>
> I think this is because of historical reasons.
> Such good explanation costs a lot of time and efforts.
> Namesys was a small company, we couldn't afford to have a technical writer.
> Hans insisted on good comments in the source code..
But there aren't much comments there. ;)
I'm now coding the last part - bitmap querying - and I've got a question: how
do we pass reiser4_block_nr parameters?
Many places use pointers, but I do not really understand why (to optimize for
stack frame size on 32-bit archs?) and there is a place
(reiser4_dealloc_blocks_bitmap()) where this convention is apparently broken.
And this is tedious, because public interfaces seem to pass start+len, but in
some places it's more convenient to use start+end...
--
Ivan Shapovalov / intelfx /
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 230 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: reiser4: discard support
2014-05-03 18:48 ` Ivan Shapovalov
@ 2014-05-03 20:21 ` Edward Shishkin
2014-05-03 20:32 ` Ivan Shapovalov
0 siblings, 1 reply; 19+ messages in thread
From: Edward Shishkin @ 2014-05-03 20:21 UTC (permalink / raw)
To: Ivan Shapovalov; +Cc: reiserfs-devel
On 05/03/2014 08:48 PM, Ivan Shapovalov wrote:
> On Friday 02 May 2014 at 20:10:16, Edward Shishkin wrote:
>> On 05/02/2014 04:32 PM, Ivan Shapovalov wrote:
>>> On Friday 02 May 2014 at 16:07:21, Edward Shishkin wrote:
>>>> On 05/02/2014 03:36 PM, Ivan Shapovalov wrote:
>>>>> On Friday 02 May 2014 at 13:48:28, Edward Shishkin wrote:
>>>>> [...]
>>>> We can perfectly populate different "discard trees" in parallel on
>>>> different CPUs.
>>>> As to sorting the list: I don't know how to perform it in parallel :)
>>>> Default assumptions, that everything is serialized, usually lead to
>>>> various
>>>> bad-scalable solutions...
>>> Ah, now I understand what do you mean. If that's about doing less work
>>> under the tmgr mutex,
>> No. This is about minimizing real time.
>> We don't know about mutexes. We want to decide, what is preferable:
>> populating trees, or sorting the list. There is a chance that the first will
>> be faster in systems with many CPUs, so I suggest to use trees.
>> That's all!
> OK.. well, I've started with lists anyway. The data structure is (of course)
> under an abstraction layer, so we can change it.
>
>>> then yes, trees are better than lists.
>>>
>>> BTW, I've seen that reiser4 releases atom lock before allocating another
>>> node for a blocknr_set, and that leads to a "do { } while (ret ==
>>> -E_REPEAT)" loop around the blocknr_set_add_extent() calls. Is this the
>>> preferred way of attaching a dynamic data structure to an atom?
>> TBH, I have never looked at the deallocation paths in reiser4: everything
>> worked fine there.. BTW, why not to use atom's delete_set to discard things?
>> Could you please take a look?
> Blocks used for the journal (wander.c) are deallocated without BA_DEFER set
> and thus they never hit delete_set. However, we want to discard these.
This happens in error paths. Don't be so scrupulous ;)
Also users will use copy-on-write transaction model for their SSDs(*),
and in this mode journals are tiny: they contain only system blocks.
In short, there is nothing to discard..
(*) http://marc.info/?l=reiserfs-devel&m=139449965000686&w=2
>
>>>>> [...]
>>>> what explanation do you mean?
>>>>
>>>> Edward.
>>> A general explanation of how does it work :)
>> I think this is because of historical reasons.
>> Such good explanation costs a lot of time and efforts.
>> Namesys was a small company, we couldn't afford to have a technical writer.
>> Hans insisted on good comments in the source code..
> But there aren't much comments there. ;)
>
> I'm now coding the last part - bitmap querying - and I've got a question: how
> do we pass reiser4_block_nr parameters?
> Many places use pointers, but I do not really understand why (to optimize for
> stack frame size on 32-bit archs?)
I think, yes: 12 years ago 64-bit archs looked like exotic..
> and there is a place
> (reiser4_dealloc_blocks_bitmap()) where this convention is apparently broken.
I suspect that even the author of the bitmap code doesn't know, why it
is broken ;)
Edward.
>
> And this is tedious, because public interfaces seem to pass start+len, but in
> some places it's more convenient to use start+end...
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: reiser4: discard support
2014-05-03 20:21 ` Edward Shishkin
@ 2014-05-03 20:32 ` Ivan Shapovalov
2014-05-06 8:58 ` Edward Shishkin
0 siblings, 1 reply; 19+ messages in thread
From: Ivan Shapovalov @ 2014-05-03 20:32 UTC (permalink / raw)
To: Edward Shishkin; +Cc: reiserfs-devel
[-- Attachment #1: Type: text/plain, Size: 3886 bytes --]
On Saturday 03 May 2014 at 22:21:58, Edward Shishkin wrote:
> On 05/03/2014 08:48 PM, Ivan Shapovalov wrote:
> > On Friday 02 May 2014 at 20:10:16, Edward Shishkin wrote:
> >> On 05/02/2014 04:32 PM, Ivan Shapovalov wrote:
> >>> On Friday 02 May 2014 at 16:07:21, Edward Shishkin wrote:
> >>>> On 05/02/2014 03:36 PM, Ivan Shapovalov wrote:
> >>>>> On Friday 02 May 2014 at 13:48:28, Edward Shishkin wrote:
> >>>>> [...]
> >>>>
> >>>> We can perfectly populate different "discard trees" in parallel on
> >>>> different CPUs.
> >>>> As to sorting the list: I don't know how to perform it in parallel :)
> >>>> Default assumptions, that everything is serialized, usually lead to
> >>>> various
> >>>> bad-scalable solutions...
> >>>
> >>> Ah, now I understand what do you mean. If that's about doing less work
> >>> under the tmgr mutex,
> >>
> >> No. This is about minimizing real time.
> >> We don't know about mutexes. We want to decide, what is preferable:
> >> populating trees, or sorting the list. There is a chance that the first
> >> will be faster in systems with many CPUs, so I suggest to use trees.
> >> That's all!
> >
> > OK.. well, I've started with lists anyway. The data structure is (of
> > course) under an abstraction layer, so we can change it.
> >
> >>> then yes, trees are better than lists.
> >>>
> >>> BTW, I've seen that reiser4 releases atom lock before allocating another
> >>> node for a blocknr_set, and that leads to a "do { } while (ret ==
> >>> -E_REPEAT)" loop around the blocknr_set_add_extent() calls. Is this the
> >>> preferred way of attaching a dynamic data structure to an atom?
> >>
> >> TBH, I have never looked at the deallocation paths in reiser4: everything
> >> worked fine there.. BTW, why not to use atom's delete_set to discard
> >> things? Could you please take a look?
> >
> > Blocks used for the journal (wander.c) are deallocated without BA_DEFER
> > set
> > and thus they never hit delete_set. However, we want to discard these.
>
> This happens in error paths. Don't be so scrupulous ;)
I don't think so:
- wander.c:485
dealloc_tx_list() <- reiser4_write_logs()
- wander.c:505
dealloc_wmap_actor() <- dealloc_wmap() <- reiser4_write_logs()
> Also users will use copy-on-write transaction model for their SSDs(*),
> and in this mode journals are tiny: they contain only system blocks.
> In short, there is nothing to discard..
>
> (*) http://marc.info/?l=reiserfs-devel&m=139449965000686&w=2
>
> >>>>> [...]
> >>>>
> >>>> what explanation do you mean?
> >>>>
> >>>> Edward.
> >>>
> >>> A general explanation of how does it work :)
> >>
> >> I think this is because of historical reasons.
> >> Such good explanation costs a lot of time and efforts.
> >> Namesys was a small company, we couldn't afford to have a technical
> >> writer.
> >> Hans insisted on good comments in the source code..
> >
> > But there aren't much comments there. ;)
> >
> > I'm now coding the last part - bitmap querying - and I've got a question:
> > how do we pass reiser4_block_nr parameters?
> > Many places use pointers, but I do not really understand why (to optimize
> > for stack frame size on 32-bit archs?)
>
> I think, yes: 12 years ago 64-bit archs looked like exotic..
>
> > and there is a place
> >
> > (reiser4_dealloc_blocks_bitmap()) where this convention is apparently
> > broken.
> I suspect that even the author of the bitmap code doesn't know, why it
> is broken ;)
>
> Edward.
>
> > And this is tedious, because public interfaces seem to pass start+len, but
> > in some places it's more convenient to use start+end...
Ok, so I'll follow the convention.
And oh, blocks != sectors... It looks like I really need to make a backup of:
- the last good reiser4.ko
- my /home
--
Ivan Shapovalov / intelfx /
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 230 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: reiser4: discard support
2014-05-03 20:32 ` Ivan Shapovalov
@ 2014-05-06 8:58 ` Edward Shishkin
2014-05-07 7:35 ` Ivan Shapovalov
0 siblings, 1 reply; 19+ messages in thread
From: Edward Shishkin @ 2014-05-06 8:58 UTC (permalink / raw)
To: Ivan Shapovalov; +Cc: reiserfs-devel
On 05/03/2014 10:32 PM, Ivan Shapovalov wrote:
> On Saturday 03 May 2014 at 22:21:58, Edward Shishkin wrote:
>
> TBH, I have never looked at the deallocation paths in reiser4: everything
> worked fine there.. BTW, why not to use atom's delete_set to discard
> things? Could you please take a look?
[...]
>>> Blocks used for the journal (wander.c) are deallocated without BA_DEFER
>>> set
>>> and thus they never hit delete_set. However, we want to discard these.
>> This happens in error paths. Don't be so scrupulous ;)
> I don't think so:
> - wander.c:485
> dealloc_tx_list() <- reiser4_write_logs()
> - wander.c:505
> dealloc_wmap_actor() <- dealloc_wmap() <- reiser4_write_logs()
Yep, indeed.
I don't completely understand this special case of wandered blocks.
Also I don't see where wandered blocks are deallocated after journal
replay (as journal is not needed any more): it can be a possible leak
of disk space after system crashes..
Once we understand it, I think we'll able to re-use the delete_set for
discard needs.
Edward.
>
>> Also users will use copy-on-write transaction model for their SSDs(*),
>> and in this mode journals are tiny: they contain only system blocks.
>> In short, there is nothing to discard..
>>
>> (*) http://marc.info/?l=reiserfs-devel&m=139449965000686&w=2
>>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: reiser4: discard support
2014-05-06 8:58 ` Edward Shishkin
@ 2014-05-07 7:35 ` Ivan Shapovalov
2014-05-07 21:04 ` Edward Shishkin
0 siblings, 1 reply; 19+ messages in thread
From: Ivan Shapovalov @ 2014-05-07 7:35 UTC (permalink / raw)
To: Edward Shishkin; +Cc: reiserfs-devel
[-- Attachment #1: Type: text/plain, Size: 2306 bytes --]
On Tuesday 06 May 2014 at 10:58:53, Edward Shishkin wrote:
> On 05/03/2014 10:32 PM, Ivan Shapovalov wrote:
> > On Saturday 03 May 2014 at 22:21:58, Edward Shishkin wrote:
> >
> > TBH, I have never looked at the deallocation paths in reiser4: everything
> > worked fine there.. BTW, why not to use atom's delete_set to discard
> > things? Could you please take a look?
>
> [...]
>
> >>> Blocks used for the journal (wander.c) are deallocated without BA_DEFER
> >>> set
> >>> and thus they never hit delete_set. However, we want to discard these.
> >>
> >> This happens in error paths. Don't be so scrupulous ;)
> >
> > I don't think so:
> > - wander.c:485
> >
> > dealloc_tx_list() <- reiser4_write_logs()
> >
> > - wander.c:505
> >
> > dealloc_wmap_actor() <- dealloc_wmap() <- reiser4_write_logs()
>
> Yep, indeed.
> I don't completely understand this special case of wandered blocks.
From what I've been able to understand, delete_set is not used there just
because the deallocations are done directly to the bitmaps: we already hold
the tmgr mutex at that time.
> Also I don't see where wandered blocks are deallocated after journal
> replay (as journal is not needed any more): it can be a possible leak
> of disk space after system crashes..
Maybe the bitmap version submitted to disk does not have these blocks
allocated? I don't really understand what's going on wrt. disk write-out, and
I don't even understand the difference between COMMIT BITMAP and WORKING
BITMAP...
> Once we understand it, I think we'll able to re-use the delete_set for
> discard needs.
>
> Edward.
We can get any benefits from re-using delete_set only if we replace it with a
list/rbtree. Current implementation (blocknr_set) seems to be inherently
unordered and can't be trivially modified (extent merge etc), so if we want to
use delete_set as-is, we still need to copy the extents into a better data
structure at commit time.
--
Ivan Shapovalov / intelfx /
>
> >> Also users will use copy-on-write transaction model for their SSDs(*),
> >> and in this mode journals are tiny: they contain only system blocks.
> >> In short, there is nothing to discard..
> >>
> >> (*) http://marc.info/?l=reiserfs-devel&m=139449965000686&w=2
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 230 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: reiser4: discard support
2014-05-07 7:35 ` Ivan Shapovalov
@ 2014-05-07 21:04 ` Edward Shishkin
0 siblings, 0 replies; 19+ messages in thread
From: Edward Shishkin @ 2014-05-07 21:04 UTC (permalink / raw)
To: Ivan Shapovalov; +Cc: reiserfs-devel
On 05/07/2014 09:35 AM, Ivan Shapovalov wrote:
> On Tuesday 06 May 2014 at 10:58:53, Edward Shishkin wrote:
>> On 05/03/2014 10:32 PM, Ivan Shapovalov wrote:
>>> On Saturday 03 May 2014 at 22:21:58, Edward Shishkin wrote:
>>>
>>> TBH, I have never looked at the deallocation paths in reiser4: everything
>>> worked fine there.. BTW, why not to use atom's delete_set to discard
>>> things? Could you please take a look?
>> [...]
>>
>>>>> Blocks used for the journal (wander.c) are deallocated without BA_DEFER
>>>>> set
>>>>> and thus they never hit delete_set. However, we want to discard these.
>>>> This happens in error paths. Don't be so scrupulous ;)
>>> I don't think so:
>>> - wander.c:485
>>>
>>> dealloc_tx_list() <- reiser4_write_logs()
>>>
>>> - wander.c:505
>>>
>>> dealloc_wmap_actor() <- dealloc_wmap() <- reiser4_write_logs()
>> Yep, indeed.
>> I don't completely understand this special case of wandered blocks.
> From what I've been able to understand, delete_set is not used there just
> because the deallocations are done directly to the bitmaps: we already hold
> the tmgr mutex at that time.
>
>> Also I don't see where wandered blocks are deallocated after journal
>> replay (as journal is not needed any more): it can be a possible leak
>> of disk space after system crashes..
> Maybe the bitmap version submitted to disk does not have these blocks
> allocated? I don't really understand what's going on wrt. disk write-out, and
> I don't even understand the difference between COMMIT BITMAP and WORKING
> BITMAP...
I'll try to find the original zam's design document.
It is horribly written and obsolete, but it contains some hints about
who is who..
>
>> Once we understand it, I think we'll able to re-use the delete_set for
>> discard needs.
>>
>> Edward.
> We can get any benefits from re-using delete_set only if we replace it with a
> list/rbtree. Current implementation (blocknr_set) seems to be inherently
> unordered and can't be trivially modified (extent merge etc), so if we want to
> use delete_set as-is, we still need to copy the extents into a better data
> structure at commit time.
>
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2014-05-07 21:04 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-30 17:55 reiser4: discard support Edward Shishkin
2014-04-30 18:53 ` Edward Shishkin
-- strict thread matches above, loose matches on Subject: below --
2014-04-30 7:43 Ivan Shapovalov
2014-04-30 14:11 ` Edward Shishkin
[not found] ` <1B07908E-B768-407D-ADFA-B3C539FABB49@gmail.com>
[not found] ` <53613807.7090605@gmail.com>
2014-05-01 23:51 ` Ivan Shapovalov
[not found] ` <CAJZSrNK4=o9vocDfSM=4W5ZgtqZ6RVpmU66sGCu6HFsdN47OHw@mail.gmail.com>
2014-05-02 11:06 ` Ivan Shapovalov
2014-05-02 11:48 ` Edward Shishkin
2014-05-02 12:44 ` Edward Shishkin
2014-05-02 13:47 ` Ivan Shapovalov
2014-05-02 13:36 ` Ivan Shapovalov
2014-05-02 14:07 ` Edward Shishkin
2014-05-02 14:32 ` Ivan Shapovalov
2014-05-02 18:10 ` Edward Shishkin
2014-05-03 18:48 ` Ivan Shapovalov
2014-05-03 20:21 ` Edward Shishkin
2014-05-03 20:32 ` Ivan Shapovalov
2014-05-06 8:58 ` Edward Shishkin
2014-05-07 7:35 ` Ivan Shapovalov
2014-05-07 21:04 ` Edward Shishkin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).