* reiser4: discard implementation, pass 2: allocation issues
@ 2014-06-13 20:28 Ivan Shapovalov
2014-06-15 17:36 ` Edward Shishkin
0 siblings, 1 reply; 22+ messages in thread
From: Ivan Shapovalov @ 2014-06-13 20:28 UTC (permalink / raw)
To: reiserfs-devel; +Cc: Edward Shishkin
[-- Attachment #1: Type: text/plain, Size: 3163 bytes --]
Here is my "analysis" of what happens in reiser4 during a transaction's
lifetime wrt. block allocation and deallocation.
THE EFFECTS (SEMANTICS) OF RELATED FUNCTIONS
reiser4_alloc_blocks_bitmap(): allocates in WORKING BITMAP
reiser4_dealloc_blocks_bitmap(!BA_DEFER): deallocates from WORKING BITMAP
reiser4_dealloc_blocks_bitmap(BA_DEFER): stores to ->delete_set
reiser4_pre_commit_hook_bitmap(): allocates all relocated nodes in COMMIT BITMAP
deallocates ->delete_set from COMMIT BITMAP
reiser4_post_commit_hook(): deallocates ->delete_set using !BA_DEFER
(i. e. from WORKING BITMAP)
TIMELINE OF ALLOCATIONS FOR "USUAL" NODES, AND TIMELINE OF TRANSACTION COMMIT
- nodes are allocated using reiser4_alloc_blocks() and setting JNODE_RELOC,
so WORKING BITMAP ensures that two nodes cannot get the same block;
- nodes are deallocated using reiser4_dealloc_blocks(BA_DEFER),
so their deallocation is not immediately reflected in WORKING BITMAP;
(the relocate set is written here)
- reiser4_pre_commit_hook_bitmap() uses 1) JNODE_RELOC flag and 2) ->delete_set
to convey effective bitmap changes into COMMIT BITMAP;
(the journal and overwrite set are written here)
- reiser4_post_commit_hook() uses ->delete_set to convey deallocations
from step 2 to WORKING BITMAP.
(the discard happens here)
TIMELINE OF ALLOCATIONS FOR WANDERED JOURNAL BLOCKS
- at commit time, blocks are allocated using reiser4_alloc_blocks(), so they
are allocated in WORKING BITMAP and do not interfere with any "usual" blocks;
- after writing wandered blocks, they are deallocated using
reiser4_dealloc_blocks(!BA_DEFER), i. e. from the WORKING BITMAP.
CONCLUSION
At possible transaction replay time, journal blocks are not allocated in any
of the bitmaps. However, because the journal is read and replayed before a
transaction has a chance to commit, this fact does not matter.
What matters is that wandered journal blocks never hit COMMIT BITMAP.
So, if I've got all this correct (which I highly doubt), the disk space leak
(as you pointed it out) does not exist.
What exists is a rather different problem with my idea of "log every
deallocated block". Current implementation logs every block regardless of
BA_DEFER flag presence or absence, so non-wandered blocks are logged twice.
We could just use ->delete_set, but we would lose wandered blocks then.
Or we could only log !BA_DEFER requests, which would do the right thing
(wandered blocks + deallocations from reiser4_post_commit_hook()), but
the reasoning behind this decision would not be obvious for a casual
code reader.
Or we could log only wandered blocks (in addition to ->delete_set)
at discard time, but this is messy and requires us to merge the discard log
with ->delete_set at discard time.
Or we could log wandered blocks straight into ->delete_set and do something in
reiser4_post_commit_hook() to separate these entries, but this is super messy.
I'm preferring the second way... Edward, please proof-read all this.
--
Ivan Shapovalov / intelfx /
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 213 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: reiser4: discard implementation, pass 2: allocation issues
2014-06-13 20:28 reiser4: discard implementation, pass 2: allocation issues Ivan Shapovalov
@ 2014-06-15 17:36 ` Edward Shishkin
2014-06-15 18:07 ` Ivan Shapovalov
0 siblings, 1 reply; 22+ messages in thread
From: Edward Shishkin @ 2014-06-15 17:36 UTC (permalink / raw)
To: Ivan Shapovalov, reiserfs-devel
On 06/13/2014 10:28 PM, Ivan Shapovalov wrote:
> Here is my "analysis" of what happens in reiser4 during a transaction's
> lifetime wrt. block allocation and deallocation.
>
>
> THE EFFECTS (SEMANTICS) OF RELATED FUNCTIONS
> reiser4_alloc_blocks_bitmap(): allocates in WORKING BITMAP
Yes.
> reiser4_dealloc_blocks_bitmap(!BA_DEFER): deallocates from WORKING BITMAP
> reiser4_dealloc_blocks_bitmap(BA_DEFER): stores to ->delete_set
This is correct for the middle-level allocator (without suffix "bitmap").
The low-level one frees blocks only in WORKING BITMAP.
>
> reiser4_pre_commit_hook_bitmap(): allocates all relocated nodes in COMMIT BITMAP
"Relocated" is bad term here: nodes with new data also get the flag
JNODE_RELOC. So, I would rather say, that it applies freshly allocated
nodes of the atom to COMMIT BITMAP.
> deallocates ->delete_set from COMMIT BITMAP
applies the atom's delete_set to COMMIT BITMAP
>
> reiser4_post_commit_hook(): deallocates ->delete_set using !BA_DEFER
> (i. e. from WORKING BITMAP)
applies the atom's delete_set to WORKING BITMAP.
I would also mention a function of the middle-level block allocator
reiser4_alloc_blocks(): allocates blocks in WORKING BITMAP.
Note that the middle-level block allocator (block_alloc.c) actually
manipulates
with abstract space maps. Currently in reiser4 they are represented only by
bitmaps (plugin/space/bitmap.c). We can also implement another
representation -
extent tree (like in XFS). I don't see any needs for now, though.
>
>
> TIMELINE OF ALLOCATIONS FOR "USUAL" NODES, AND TIMELINE OF TRANSACTION COMMIT
> - nodes are allocated using reiser4_alloc_blocks() and setting JNODE_RELOC,
> so WORKING BITMAP ensures that two nodes cannot get the same block;
> - nodes are deallocated using reiser4_dealloc_blocks(BA_DEFER),
> so their deallocation is not immediately reflected in WORKING BITMAP;
> (the relocate set is written here)
> - reiser4_pre_commit_hook_bitmap() uses 1) JNODE_RELOC flag and 2) ->delete_set
> to convey effective bitmap changes into COMMIT BITMAP;
> (the journal and overwrite set are written here)
> - reiser4_post_commit_hook() uses ->delete_set to convey deallocations
> from step 2 to WORKING BITMAP.
> (the discard happens here)
>
>
> TIMELINE OF ALLOCATIONS FOR WANDERED JOURNAL BLOCKS
> - at commit time, blocks are allocated using reiser4_alloc_blocks(), so they
> are allocated in WORKING BITMAP and do not interfere with any "usual" blocks;
> - after writing wandered blocks, they are deallocated using
> reiser4_dealloc_blocks(!BA_DEFER), i. e. from the WORKING BITMAP.
So, the system of working and commit bitmaps plus the delete set seems
to be redundant? I think this is because of performance reasons: block
allocation is critical thing...
>
>
> CONCLUSION
> At possible transaction replay time, journal blocks are not allocated in any
> of the bitmaps. However, because the journal is read and replayed before a
> transaction has a chance to commit, this fact does not matter.
> What matters is that wandered journal blocks never hit COMMIT BITMAP.
>
> So, if I've got all this correct (which I highly doubt), the disk space leak
> (as you pointed it out) does not exist.
It seems, you are right..
>
> What exists is a rather different problem with my idea of "log every
> deallocated block". Current implementation logs every block regardless of
> BA_DEFER flag presence or absence, so non-wandered blocks are logged twice.
>
> We could just use ->delete_set, but we would lose wandered blocks then.
> Or we could only log !BA_DEFER requests, which would do the right thing
> (wandered blocks + deallocations from reiser4_post_commit_hook()), but
> the reasoning behind this decision would not be obvious for a casual
> code reader.
I think that a good comment will save the situation..
> Or we could log only wandered blocks (in addition to ->delete_set)
> at discard time, but this is messy and requires us to merge the discard log
> with ->delete_set at discard time.
what is the difference with the previous "we could.."?
> Or we could log wandered blocks straight into ->delete_set and do something in
> reiser4_post_commit_hook() to separate these entries, but this is super messy.
>
> I'm preferring the second way... Edward, please proof-read all this.
BTW, what about your current implementation? Does it work for you?
Thanks,
Edward.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: reiser4: discard implementation, pass 2: allocation issues
2014-06-15 17:36 ` Edward Shishkin
@ 2014-06-15 18:07 ` Ivan Shapovalov
2014-06-15 21:49 ` Edward Shishkin
0 siblings, 1 reply; 22+ messages in thread
From: Ivan Shapovalov @ 2014-06-15 18:07 UTC (permalink / raw)
To: Edward Shishkin; +Cc: reiserfs-devel
[-- Attachment #1: Type: text/plain, Size: 5971 bytes --]
On Sunday 15 June 2014 at 19:36:05, Edward Shishkin wrote:
> On 06/13/2014 10:28 PM, Ivan Shapovalov wrote:
> > Here is my "analysis" of what happens in reiser4 during a transaction's
> > lifetime wrt. block allocation and deallocation.
> >
> >
> > THE EFFECTS (SEMANTICS) OF RELATED FUNCTIONS
> > reiser4_alloc_blocks_bitmap(): allocates in WORKING BITMAP
>
> Yes.
>
> > reiser4_dealloc_blocks_bitmap(!BA_DEFER): deallocates from WORKING BITMAP
> > reiser4_dealloc_blocks_bitmap(BA_DEFER): stores to ->delete_set
>
> This is correct for the middle-level allocator (without suffix "bitmap").
> The low-level one frees blocks only in WORKING BITMAP.
Yes, this was a wording mistake. I've noticed it shortly after sending...
> > reiser4_pre_commit_hook_bitmap(): allocates all relocated nodes in COMMIT BITMAP
>
> "Relocated" is bad term here: nodes with new data also get the flag
> JNODE_RELOC. So, I would rather say, that it applies freshly allocated
> nodes of the atom to COMMIT BITMAP.
>
>
> > deallocates ->delete_set from COMMIT BITMAP
>
> applies the atom's delete_set to COMMIT BITMAP
I've meant exactly this.
> > reiser4_post_commit_hook(): deallocates ->delete_set using !BA_DEFER
> > (i. e. from WORKING BITMAP)
>
> applies the atom's delete_set to WORKING BITMAP.
ditto
> I would also mention a function of the middle-level block allocator
> reiser4_alloc_blocks(): allocates blocks in WORKING BITMAP.
Yes, sure.
> Note that the middle-level block allocator (block_alloc.c) actually
> manipulates
> with abstract space maps. Currently in reiser4 they are represented only by
> bitmaps (plugin/space/bitmap.c). We can also implement another
> representation -
> extent tree (like in XFS). I don't see any needs for now, though.
Extent trees seem to be interesting.. They claim that they are more efficient -
is the difference that huge?
> > TIMELINE OF ALLOCATIONS FOR "USUAL" NODES, AND TIMELINE OF TRANSACTION COMMIT
> > - nodes are allocated using reiser4_alloc_blocks() and setting JNODE_RELOC,
> > so WORKING BITMAP ensures that two nodes cannot get the same block;
> > - nodes are deallocated using reiser4_dealloc_blocks(BA_DEFER),
> > so their deallocation is not immediately reflected in WORKING BITMAP;
> > (the relocate set is written here)
> > - reiser4_pre_commit_hook_bitmap() uses 1) JNODE_RELOC flag and 2) ->delete_set
> > to convey effective bitmap changes into COMMIT BITMAP;
> > (the journal and overwrite set are written here)
> > - reiser4_post_commit_hook() uses ->delete_set to convey deallocations
> > from step 2 to WORKING BITMAP.
> > (the discard happens here)
> >
> >
> > TIMELINE OF ALLOCATIONS FOR WANDERED JOURNAL BLOCKS
> > - at commit time, blocks are allocated using reiser4_alloc_blocks(), so they
> > are allocated in WORKING BITMAP and do not interfere with any "usual" blocks;
> > - after writing wandered blocks, they are deallocated using
> > reiser4_dealloc_blocks(!BA_DEFER), i. e. from the WORKING BITMAP.
>
>
> So, the system of working and commit bitmaps plus the delete set seems
> to be redundant? I think this is because of performance reasons: block
> allocation is critical thing...
Seems like it is -- for performance and simplicity (== robustness).
> > CONCLUSION
> > At possible transaction replay time, journal blocks are not allocated in any
> > of the bitmaps. However, because the journal is read and replayed before a
> > transaction has a chance to commit, this fact does not matter.
> > What matters is that wandered journal blocks never hit COMMIT BITMAP.
> >
> > So, if I've got all this correct (which I highly doubt), the disk space leak
> > (as you pointed it out) does not exist.
>
> It seems, you are right..
>
> >
> > What exists is a rather different problem with my idea of "log every
> > deallocated block". Current implementation logs every block regardless of
> > BA_DEFER flag presence or absence, so non-wandered blocks are logged twice.
> >
> > We could just use ->delete_set, but we would lose wandered blocks then.
> > Or we could only log !BA_DEFER requests, which would do the right thing
> > (wandered blocks + deallocations from reiser4_post_commit_hook()), but
> > the reasoning behind this decision would not be obvious for a casual
> > code reader.
>
> I think that a good comment will save the situation..
>
> > Or we could log only wandered blocks (in addition to ->delete_set)
> > at discard time, but this is messy and requires us to merge the discard log
> > with ->delete_set at discard time.
>
> what is the difference with the previous "we could.."?
The previous option is to log all !BA_DEFER requests in addition to
->delete_set, so those two block sets would be partially overlapping.
And this option is to log only those !BA_DEFER requests which are not in
->delete_set, so no block will be stored twice. Thus we will have to consider
both ->delete_set and this hypothetical ->discard_set at discard time.
> > Or we could log wandered blocks straight into ->delete_set and do something in
> > reiser4_post_commit_hook() to separate these entries, but this is super messy.
> >
> > I'm preferring the second way... Edward, please proof-read all this.
>
> BTW, what about your current implementation? Does it work for you?
Yes, it apparently does (no massive corruptions).
I get batches of "orphan cluster" errors from time to time (from offset 0 till
the supposed end of file), but 1) I do not see which files get corrupted,
2) this could be due to txmod=wa, 3) my machine is pretty crashy now due to
some driver issues, so I don't know what exactly causes this.
Thanks for proof-reading. What do you think about all this?
--
Ivan Shapovalov / intelfx /
>
> Thanks,
> Edward.
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 213 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: reiser4: discard implementation, pass 2: allocation issues
2014-06-15 18:07 ` Ivan Shapovalov
@ 2014-06-15 21:49 ` Edward Shishkin
2014-06-15 21:58 ` Ivan Shapovalov
0 siblings, 1 reply; 22+ messages in thread
From: Edward Shishkin @ 2014-06-15 21:49 UTC (permalink / raw)
To: Ivan Shapovalov; +Cc: reiserfs-devel
On 06/15/2014 08:07 PM, Ivan Shapovalov wrote:
> On Sunday 15 June 2014 at 19:36:05, Edward Shishkin wrote:
>> On 06/13/2014 10:28 PM, Ivan Shapovalov wrote:
>>> Here is my "analysis" of what happens in reiser4 during a transaction's
>>> lifetime wrt. block allocation and deallocation.
>>>
>>>
>>> THE EFFECTS (SEMANTICS) OF RELATED FUNCTIONS
>>> reiser4_alloc_blocks_bitmap(): allocates in WORKING BITMAP
>> Yes.
>>
>>> reiser4_dealloc_blocks_bitmap(!BA_DEFER): deallocates from WORKING BITMAP
>>> reiser4_dealloc_blocks_bitmap(BA_DEFER): stores to ->delete_set
>> This is correct for the middle-level allocator (without suffix "bitmap").
>> The low-level one frees blocks only in WORKING BITMAP.
> Yes, this was a wording mistake. I've noticed it shortly after sending...
>
>>> reiser4_pre_commit_hook_bitmap(): allocates all relocated nodes in COMMIT BITMAP
>> "Relocated" is bad term here: nodes with new data also get the flag
>> JNODE_RELOC. So, I would rather say, that it applies freshly allocated
>> nodes of the atom to COMMIT BITMAP.
>>
>>
>>> deallocates ->delete_set from COMMIT BITMAP
>> applies the atom's delete_set to COMMIT BITMAP
> I've meant exactly this.
>
>>> reiser4_post_commit_hook(): deallocates ->delete_set using !BA_DEFER
>>> (i. e. from WORKING BITMAP)
>> applies the atom's delete_set to WORKING BITMAP.
> ditto
>
>> I would also mention a function of the middle-level block allocator
>> reiser4_alloc_blocks(): allocates blocks in WORKING BITMAP.
> Yes, sure.
>
>
>> Note that the middle-level block allocator (block_alloc.c) actually
>> manipulates
>> with abstract space maps. Currently in reiser4 they are represented only by
>> bitmaps (plugin/space/bitmap.c). We can also implement another
>> representation -
>> extent tree (like in XFS). I don't see any needs for now, though.
> Extent trees seem to be interesting.. They claim that they are more efficient -
> is the difference that huge?
>
>>> TIMELINE OF ALLOCATIONS FOR "USUAL" NODES, AND TIMELINE OF TRANSACTION COMMIT
>>> - nodes are allocated using reiser4_alloc_blocks() and setting JNODE_RELOC,
>>> so WORKING BITMAP ensures that two nodes cannot get the same block;
>>> - nodes are deallocated using reiser4_dealloc_blocks(BA_DEFER),
>>> so their deallocation is not immediately reflected in WORKING BITMAP;
>>> (the relocate set is written here)
>>> - reiser4_pre_commit_hook_bitmap() uses 1) JNODE_RELOC flag and 2) ->delete_set
>>> to convey effective bitmap changes into COMMIT BITMAP;
>>> (the journal and overwrite set are written here)
>>> - reiser4_post_commit_hook() uses ->delete_set to convey deallocations
>>> from step 2 to WORKING BITMAP.
>>> (the discard happens here)
>>>
>>>
>>> TIMELINE OF ALLOCATIONS FOR WANDERED JOURNAL BLOCKS
>>> - at commit time, blocks are allocated using reiser4_alloc_blocks(), so they
>>> are allocated in WORKING BITMAP and do not interfere with any "usual" blocks;
>>> - after writing wandered blocks, they are deallocated using
>>> reiser4_dealloc_blocks(!BA_DEFER), i. e. from the WORKING BITMAP.
>>
>> So, the system of working and commit bitmaps plus the delete set seems
>> to be redundant? I think this is because of performance reasons: block
>> allocation is critical thing...
> Seems like it is -- for performance and simplicity (== robustness).
>
>>> CONCLUSION
>>> At possible transaction replay time, journal blocks are not allocated in any
>>> of the bitmaps. However, because the journal is read and replayed before a
>>> transaction has a chance to commit, this fact does not matter.
>>> What matters is that wandered journal blocks never hit COMMIT BITMAP.
>>>
>>> So, if I've got all this correct (which I highly doubt), the disk space leak
>>> (as you pointed it out) does not exist.
>> It seems, you are right..
>>
>>> What exists is a rather different problem with my idea of "log every
>>> deallocated block". Current implementation logs every block regardless of
>>> BA_DEFER flag presence or absence, so non-wandered blocks are logged twice.
>>>
>>> We could just use ->delete_set, but we would lose wandered blocks then.
>>> Or we could only log !BA_DEFER requests, which would do the right thing
>>> (wandered blocks + deallocations from reiser4_post_commit_hook()), but
>>> the reasoning behind this decision would not be obvious for a casual
>>> code reader.
>> I think that a good comment will save the situation..
>>
>>> Or we could log only wandered blocks (in addition to ->delete_set)
>>> at discard time, but this is messy and requires us to merge the discard log
>>> with ->delete_set at discard time.
>> what is the difference with the previous "we could.."?
> The previous option is to log all !BA_DEFER requests in addition to
> ->delete_set, so those two block sets would be partially overlapping.
Hmm.. Are they really overlapped?
reiser4_dealloc_blocks() looks like the following:
{
if (flags & BA_DEFER)
log in the delete_set;
else
deallocate in working bitmap
...
}
so (!BA_DEFER) requests and the delete_set look disjoint,
or, I missed something?
> And this option is to log only those !BA_DEFER requests which are not in
> ->delete_set, so no block will be stored twice. Thus we will have to consider
> both ->delete_set and this hypothetical ->discard_set at discard time.
>
>>> Or we could log wandered blocks straight into ->delete_set and do something in
>>> reiser4_post_commit_hook() to separate these entries, but this is super messy.
>>>
>>> I'm preferring the second way... Edward, please proof-read all this.
>> BTW, what about your current implementation? Does it work for you?
> Yes, it apparently does (no massive corruptions).
>
> I get batches of "orphan cluster" errors from time to time (from offset 0 till
> the supposed end of file),
this is fsck complaints?
> but 1) I do not see which files get corrupted,
> 2) this could be due to txmod=wa, 3) my machine is pretty crashy now due to
> some driver issues, so I don't know what exactly causes this.
>
> Thanks for proof-reading. What do you think about all this?
I think that not bad for the beginning! I have purchased an SSD card,
hope to test the discard stuff in the near future.
Thanks,
Edward.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: reiser4: discard implementation, pass 2: allocation issues
2014-06-15 21:49 ` Edward Shishkin
@ 2014-06-15 21:58 ` Ivan Shapovalov
2014-06-16 0:14 ` Edward Shishkin
0 siblings, 1 reply; 22+ messages in thread
From: Ivan Shapovalov @ 2014-06-15 21:58 UTC (permalink / raw)
To: Edward Shishkin; +Cc: reiserfs-devel
[-- Attachment #1: Type: text/plain, Size: 7265 bytes --]
On Sunday 15 June 2014 at 23:49:59, Edward Shishkin wrote:
>
> On 06/15/2014 08:07 PM, Ivan Shapovalov wrote:
> > On Sunday 15 June 2014 at 19:36:05, Edward Shishkin wrote:
> >> On 06/13/2014 10:28 PM, Ivan Shapovalov wrote:
> >>> Here is my "analysis" of what happens in reiser4 during a transaction's
> >>> lifetime wrt. block allocation and deallocation.
> >>>
> >>>
> >>> THE EFFECTS (SEMANTICS) OF RELATED FUNCTIONS
> >>> reiser4_alloc_blocks_bitmap(): allocates in WORKING BITMAP
> >> Yes.
> >>
> >>> reiser4_dealloc_blocks_bitmap(!BA_DEFER): deallocates from WORKING BITMAP
> >>> reiser4_dealloc_blocks_bitmap(BA_DEFER): stores to ->delete_set
> >> This is correct for the middle-level allocator (without suffix "bitmap").
> >> The low-level one frees blocks only in WORKING BITMAP.
> > Yes, this was a wording mistake. I've noticed it shortly after sending...
> >
> >>> reiser4_pre_commit_hook_bitmap(): allocates all relocated nodes in COMMIT BITMAP
> >> "Relocated" is bad term here: nodes with new data also get the flag
> >> JNODE_RELOC. So, I would rather say, that it applies freshly allocated
> >> nodes of the atom to COMMIT BITMAP.
> >>
> >>
> >>> deallocates ->delete_set from COMMIT BITMAP
> >> applies the atom's delete_set to COMMIT BITMAP
> > I've meant exactly this.
> >
> >>> reiser4_post_commit_hook(): deallocates ->delete_set using !BA_DEFER
> >>> (i. e. from WORKING BITMAP)
> >> applies the atom's delete_set to WORKING BITMAP.
> > ditto
> >
> >> I would also mention a function of the middle-level block allocator
> >> reiser4_alloc_blocks(): allocates blocks in WORKING BITMAP.
> > Yes, sure.
> >
> >
> >> Note that the middle-level block allocator (block_alloc.c) actually
> >> manipulates
> >> with abstract space maps. Currently in reiser4 they are represented only by
> >> bitmaps (plugin/space/bitmap.c). We can also implement another
> >> representation -
> >> extent tree (like in XFS). I don't see any needs for now, though.
> > Extent trees seem to be interesting.. They claim that they are more efficient -
> > is the difference that huge?
> >
> >>> TIMELINE OF ALLOCATIONS FOR "USUAL" NODES, AND TIMELINE OF TRANSACTION COMMIT
> >>> - nodes are allocated using reiser4_alloc_blocks() and setting JNODE_RELOC,
> >>> so WORKING BITMAP ensures that two nodes cannot get the same block;
> >>> - nodes are deallocated using reiser4_dealloc_blocks(BA_DEFER),
> >>> so their deallocation is not immediately reflected in WORKING BITMAP;
> >>> (the relocate set is written here)
> >>> - reiser4_pre_commit_hook_bitmap() uses 1) JNODE_RELOC flag and 2) ->delete_set
> >>> to convey effective bitmap changes into COMMIT BITMAP;
> >>> (the journal and overwrite set are written here)
> >>> - reiser4_post_commit_hook() uses ->delete_set to convey deallocations
> >>> from step 2 to WORKING BITMAP.
> >>> (the discard happens here)
> >>>
> >>>
> >>> TIMELINE OF ALLOCATIONS FOR WANDERED JOURNAL BLOCKS
> >>> - at commit time, blocks are allocated using reiser4_alloc_blocks(), so they
> >>> are allocated in WORKING BITMAP and do not interfere with any "usual" blocks;
> >>> - after writing wandered blocks, they are deallocated using
> >>> reiser4_dealloc_blocks(!BA_DEFER), i. e. from the WORKING BITMAP.
> >>
> >> So, the system of working and commit bitmaps plus the delete set seems
> >> to be redundant? I think this is because of performance reasons: block
> >> allocation is critical thing...
> > Seems like it is -- for performance and simplicity (== robustness).
> >
> >>> CONCLUSION
> >>> At possible transaction replay time, journal blocks are not allocated in any
> >>> of the bitmaps. However, because the journal is read and replayed before a
> >>> transaction has a chance to commit, this fact does not matter.
> >>> What matters is that wandered journal blocks never hit COMMIT BITMAP.
> >>>
> >>> So, if I've got all this correct (which I highly doubt), the disk space leak
> >>> (as you pointed it out) does not exist.
> >> It seems, you are right..
> >>
> >>> What exists is a rather different problem with my idea of "log every
> >>> deallocated block". Current implementation logs every block regardless of
> >>> BA_DEFER flag presence or absence, so non-wandered blocks are logged twice.
> >>>
> >>> We could just use ->delete_set, but we would lose wandered blocks then.
> >>> Or we could only log !BA_DEFER requests, which would do the right thing
> >>> (wandered blocks + deallocations from reiser4_post_commit_hook()), but
> >>> the reasoning behind this decision would not be obvious for a casual
> >>> code reader.
> >> I think that a good comment will save the situation..
> >>
> >>> Or we could log only wandered blocks (in addition to ->delete_set)
> >>> at discard time, but this is messy and requires us to merge the discard log
> >>> with ->delete_set at discard time.
> >> what is the difference with the previous "we could.."?
> > The previous option is to log all !BA_DEFER requests in addition to
> > ->delete_set, so those two block sets would be partially overlapping.
>
>
> Hmm.. Are they really overlapped?
>
> reiser4_dealloc_blocks() looks like the following:
> {
> if (flags & BA_DEFER)
> log in the delete_set;
> else
> deallocate in working bitmap
> ...
> }
>
> so (!BA_DEFER) requests and the delete_set look disjoint,
> or, I missed something?
You're right, I've misread the code. I was under impression that
reiser4_post_commit_hook() internally calls reiser4_dealloc_blocks(!BA_DEFER).
So the second option (which I prefer) is to log all sa_dealloc_blocks() calls.
That is, effectively, ->delete_set plus wandered journal block deallocations.
> > And this option is to log only those !BA_DEFER requests which are not in
> > ->delete_set, so no block will be stored twice. Thus we will have to consider
> > both ->delete_set and this hypothetical ->discard_set at discard time.
> >
> >>> Or we could log wandered blocks straight into ->delete_set and do something in
> >>> reiser4_post_commit_hook() to separate these entries, but this is super messy.
> >>>
> >>> I'm preferring the second way... Edward, please proof-read all this.
> >> BTW, what about your current implementation? Does it work for you?
> > Yes, it apparently does (no massive corruptions).
> >
> > I get batches of "orphan cluster" errors from time to time (from offset 0 till
> > the supposed end of file),
>
>
> this is fsck complaints?
Yes. (I do `fsck.reiser4 --build-fs` pretty routinely...)
> > but 1) I do not see which files get corrupted,
> > 2) this could be due to txmod=wa, 3) my machine is pretty crashy now due to
> > some driver issues, so I don't know what exactly causes this.
> >
> > Thanks for proof-reading. What do you think about all this?
>
>
> I think that not bad for the beginning! I have purchased an SSD card,
> hope to test the discard stuff in the near future.
And I hope to finish this feature somewhen... :)
Thanks,
--
Ivan Shapovalov / intelfx /
>
> Thanks,
> Edward.
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 213 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: reiser4: discard implementation, pass 2: allocation issues
2014-06-15 21:58 ` Ivan Shapovalov
@ 2014-06-16 0:14 ` Edward Shishkin
2014-06-16 5:03 ` Ivan Shapovalov
0 siblings, 1 reply; 22+ messages in thread
From: Edward Shishkin @ 2014-06-16 0:14 UTC (permalink / raw)
To: Ivan Shapovalov; +Cc: reiserfs-devel
On 06/15/2014 11:58 PM, Ivan Shapovalov wrote:
> On Sunday 15 June 2014 at 23:49:59, Edward Shishkin wrote:
>> On 06/15/2014 08:07 PM, Ivan Shapovalov wrote:
>>> On Sunday 15 June 2014 at 19:36:05, Edward Shishkin wrote:
>>>> On 06/13/2014 10:28 PM, Ivan Shapovalov wrote:
>>>>> Here is my "analysis" of what happens in reiser4 during a transaction's
>>>>> lifetime wrt. block allocation and deallocation.
>>>>>
>>>>>
>>>>> THE EFFECTS (SEMANTICS) OF RELATED FUNCTIONS
>>>>> reiser4_alloc_blocks_bitmap(): allocates in WORKING BITMAP
>>>> Yes.
>>>>
>>>>> reiser4_dealloc_blocks_bitmap(!BA_DEFER): deallocates from WORKING BITMAP
>>>>> reiser4_dealloc_blocks_bitmap(BA_DEFER): stores to ->delete_set
>>>> This is correct for the middle-level allocator (without suffix "bitmap").
>>>> The low-level one frees blocks only in WORKING BITMAP.
>>> Yes, this was a wording mistake. I've noticed it shortly after sending...
>>>
>>>>> reiser4_pre_commit_hook_bitmap(): allocates all relocated nodes in COMMIT BITMAP
>>>> "Relocated" is bad term here: nodes with new data also get the flag
>>>> JNODE_RELOC. So, I would rather say, that it applies freshly allocated
>>>> nodes of the atom to COMMIT BITMAP.
>>>>
>>>>
>>>>> deallocates ->delete_set from COMMIT BITMAP
>>>> applies the atom's delete_set to COMMIT BITMAP
>>> I've meant exactly this.
>>>
>>>>> reiser4_post_commit_hook(): deallocates ->delete_set using !BA_DEFER
>>>>> (i. e. from WORKING BITMAP)
>>>> applies the atom's delete_set to WORKING BITMAP.
>>> ditto
>>>
>>>> I would also mention a function of the middle-level block allocator
>>>> reiser4_alloc_blocks(): allocates blocks in WORKING BITMAP.
>>> Yes, sure.
>>>
>>>
>>>> Note that the middle-level block allocator (block_alloc.c) actually
>>>> manipulates
>>>> with abstract space maps. Currently in reiser4 they are represented only by
>>>> bitmaps (plugin/space/bitmap.c). We can also implement another
>>>> representation -
>>>> extent tree (like in XFS). I don't see any needs for now, though.
>>> Extent trees seem to be interesting.. They claim that they are more efficient -
>>> is the difference that huge?
>>>
>>>>> TIMELINE OF ALLOCATIONS FOR "USUAL" NODES, AND TIMELINE OF TRANSACTION COMMIT
>>>>> - nodes are allocated using reiser4_alloc_blocks() and setting JNODE_RELOC,
>>>>> so WORKING BITMAP ensures that two nodes cannot get the same block;
>>>>> - nodes are deallocated using reiser4_dealloc_blocks(BA_DEFER),
>>>>> so their deallocation is not immediately reflected in WORKING BITMAP;
>>>>> (the relocate set is written here)
>>>>> - reiser4_pre_commit_hook_bitmap() uses 1) JNODE_RELOC flag and 2) ->delete_set
>>>>> to convey effective bitmap changes into COMMIT BITMAP;
>>>>> (the journal and overwrite set are written here)
>>>>> - reiser4_post_commit_hook() uses ->delete_set to convey deallocations
>>>>> from step 2 to WORKING BITMAP.
>>>>> (the discard happens here)
>>>>>
>>>>>
>>>>> TIMELINE OF ALLOCATIONS FOR WANDERED JOURNAL BLOCKS
>>>>> - at commit time, blocks are allocated using reiser4_alloc_blocks(), so they
>>>>> are allocated in WORKING BITMAP and do not interfere with any "usual" blocks;
>>>>> - after writing wandered blocks, they are deallocated using
>>>>> reiser4_dealloc_blocks(!BA_DEFER), i. e. from the WORKING BITMAP.
>>>> So, the system of working and commit bitmaps plus the delete set seems
>>>> to be redundant? I think this is because of performance reasons: block
>>>> allocation is critical thing...
>>> Seems like it is -- for performance and simplicity (== robustness).
>>>
>>>>> CONCLUSION
>>>>> At possible transaction replay time, journal blocks are not allocated in any
>>>>> of the bitmaps. However, because the journal is read and replayed before a
>>>>> transaction has a chance to commit, this fact does not matter.
>>>>> What matters is that wandered journal blocks never hit COMMIT BITMAP.
>>>>>
>>>>> So, if I've got all this correct (which I highly doubt), the disk space leak
>>>>> (as you pointed it out) does not exist.
>>>> It seems, you are right..
>>>>
>>>>> What exists is a rather different problem with my idea of "log every
>>>>> deallocated block". Current implementation logs every block regardless of
>>>>> BA_DEFER flag presence or absence, so non-wandered blocks are logged twice.
>>>>>
>>>>> We could just use ->delete_set, but we would lose wandered blocks then.
>>>>> Or we could only log !BA_DEFER requests, which would do the right thing
>>>>> (wandered blocks + deallocations from reiser4_post_commit_hook()), but
>>>>> the reasoning behind this decision would not be obvious for a casual
>>>>> code reader.
>>>> I think that a good comment will save the situation..
>>>>
>>>>> Or we could log only wandered blocks (in addition to ->delete_set)
>>>>> at discard time, but this is messy and requires us to merge the discard log
>>>>> with ->delete_set at discard time.
>>>> what is the difference with the previous "we could.."?
>>> The previous option is to log all !BA_DEFER requests in addition to
>>> ->delete_set, so those two block sets would be partially overlapping.
>>
>> Hmm.. Are they really overlapped?
>>
>> reiser4_dealloc_blocks() looks like the following:
>> {
>> if (flags & BA_DEFER)
>> log in the delete_set;
>> else
>> deallocate in working bitmap
>> ...
>> }
>>
>> so (!BA_DEFER) requests and the delete_set look disjoint,
>> or, I missed something?
> You're right, I've misread the code. I was under impression that
> reiser4_post_commit_hook() internally calls reiser4_dealloc_blocks(!BA_DEFER).
>
> So the second option (which I prefer) is to log all sa_dealloc_blocks() calls.
> That is, effectively, ->delete_set plus wandered journal block deallocations.
Yes, I think to do something like this:
reiser4_dealloc_blocks()
{
if (flags & BA_DEFER) {
...
} else {
...
if (discard is turned on)
/* we'll want to discard these blocks, which are not to
be represented
in the COMMIT SPACE MAP, so store them in a
separate list */
do {
blocknr_set_add_extent(atom,
&atom->delete_set_for_wander, &bsep, start, len);
...
} while (ret == -E_REPEAT);
sa_dealloc_blocks(reiser4_get_space_allocator(ctx->super), *start, *len);
...
}
Then join the sets (->delete_set, and ->delete_set_for_wander) at
discard time
(should be pretty cheap operation). insert a check at every atoms
fusion, that
->delete_set_for_wander of both atoms are empty).
>
>>> And this option is to log only those !BA_DEFER requests which are not in
>>> ->delete_set, so no block will be stored twice. Thus we will have to consider
>>> both ->delete_set and this hypothetical ->discard_set at discard time.
>>>
>>>>> Or we could log wandered blocks straight into ->delete_set and do something in
>>>>> reiser4_post_commit_hook() to separate these entries, but this is super messy.
>>>>>
>>>>> I'm preferring the second way... Edward, please proof-read all this.
>>>> BTW, what about your current implementation? Does it work for you?
>>> Yes, it apparently does (no massive corruptions).
>>>
>>> I get batches of "orphan cluster" errors from time to time (from offset 0 till
>>> the supposed end of file),
>>
>> this is fsck complaints?
> Yes. (I do `fsck.reiser4 --build-fs` pretty routinely...)
>
>>> but 1) I do not see which files get corrupted,
>>> 2) this could be due to txmod=wa, 3) my machine is pretty crashy now due to
>>> some driver issues, so I don't know what exactly causes this.
>>>
>>> Thanks for proof-reading. What do you think about all this?
>>
>> I think that not bad for the beginning! I have purchased an SSD card,
>> hope to test the discard stuff in the near future.
> And I hope to finish this feature somewhen... :)
>
> Thanks,
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: reiser4: discard implementation, pass 2: allocation issues
2014-06-16 0:14 ` Edward Shishkin
@ 2014-06-16 5:03 ` Ivan Shapovalov
2014-06-16 9:24 ` Edward Shishkin
0 siblings, 1 reply; 22+ messages in thread
From: Ivan Shapovalov @ 2014-06-16 5:03 UTC (permalink / raw)
To: Edward Shishkin; +Cc: reiserfs-devel
[-- Attachment #1: Type: text/plain, Size: 8910 bytes --]
On Monday 16 June 2014 at 02:14:49, Edward Shishkin wrote:
>
> On 06/15/2014 11:58 PM, Ivan Shapovalov wrote:
> > On Sunday 15 June 2014 at 23:49:59, Edward Shishkin wrote:
> >> On 06/15/2014 08:07 PM, Ivan Shapovalov wrote:
> >>> On Sunday 15 June 2014 at 19:36:05, Edward Shishkin wrote:
> >>>> On 06/13/2014 10:28 PM, Ivan Shapovalov wrote:
> >>>>> Here is my "analysis" of what happens in reiser4 during a transaction's
> >>>>> lifetime wrt. block allocation and deallocation.
> >>>>>
> >>>>>
> >>>>> THE EFFECTS (SEMANTICS) OF RELATED FUNCTIONS
> >>>>> reiser4_alloc_blocks_bitmap(): allocates in WORKING BITMAP
> >>>> Yes.
> >>>>
> >>>>> reiser4_dealloc_blocks_bitmap(!BA_DEFER): deallocates from WORKING BITMAP
> >>>>> reiser4_dealloc_blocks_bitmap(BA_DEFER): stores to ->delete_set
> >>>> This is correct for the middle-level allocator (without suffix "bitmap").
> >>>> The low-level one frees blocks only in WORKING BITMAP.
> >>> Yes, this was a wording mistake. I've noticed it shortly after sending...
> >>>
> >>>>> reiser4_pre_commit_hook_bitmap(): allocates all relocated nodes in COMMIT BITMAP
> >>>> "Relocated" is bad term here: nodes with new data also get the flag
> >>>> JNODE_RELOC. So, I would rather say, that it applies freshly allocated
> >>>> nodes of the atom to COMMIT BITMAP.
> >>>>
> >>>>
> >>>>> deallocates ->delete_set from COMMIT BITMAP
> >>>> applies the atom's delete_set to COMMIT BITMAP
> >>> I've meant exactly this.
> >>>
> >>>>> reiser4_post_commit_hook(): deallocates ->delete_set using !BA_DEFER
> >>>>> (i. e. from WORKING BITMAP)
> >>>> applies the atom's delete_set to WORKING BITMAP.
> >>> ditto
> >>>
> >>>> I would also mention a function of the middle-level block allocator
> >>>> reiser4_alloc_blocks(): allocates blocks in WORKING BITMAP.
> >>> Yes, sure.
> >>>
> >>>
> >>>> Note that the middle-level block allocator (block_alloc.c) actually
> >>>> manipulates
> >>>> with abstract space maps. Currently in reiser4 they are represented only by
> >>>> bitmaps (plugin/space/bitmap.c). We can also implement another
> >>>> representation -
> >>>> extent tree (like in XFS). I don't see any needs for now, though.
> >>> Extent trees seem to be interesting.. They claim that they are more efficient -
> >>> is the difference that huge?
> >>>
> >>>>> TIMELINE OF ALLOCATIONS FOR "USUAL" NODES, AND TIMELINE OF TRANSACTION COMMIT
> >>>>> - nodes are allocated using reiser4_alloc_blocks() and setting JNODE_RELOC,
> >>>>> so WORKING BITMAP ensures that two nodes cannot get the same block;
> >>>>> - nodes are deallocated using reiser4_dealloc_blocks(BA_DEFER),
> >>>>> so their deallocation is not immediately reflected in WORKING BITMAP;
> >>>>> (the relocate set is written here)
> >>>>> - reiser4_pre_commit_hook_bitmap() uses 1) JNODE_RELOC flag and 2) ->delete_set
> >>>>> to convey effective bitmap changes into COMMIT BITMAP;
> >>>>> (the journal and overwrite set are written here)
> >>>>> - reiser4_post_commit_hook() uses ->delete_set to convey deallocations
> >>>>> from step 2 to WORKING BITMAP.
> >>>>> (the discard happens here)
> >>>>>
> >>>>>
> >>>>> TIMELINE OF ALLOCATIONS FOR WANDERED JOURNAL BLOCKS
> >>>>> - at commit time, blocks are allocated using reiser4_alloc_blocks(), so they
> >>>>> are allocated in WORKING BITMAP and do not interfere with any "usual" blocks;
> >>>>> - after writing wandered blocks, they are deallocated using
> >>>>> reiser4_dealloc_blocks(!BA_DEFER), i. e. from the WORKING BITMAP.
> >>>> So, the system of working and commit bitmaps plus the delete set seems
> >>>> to be redundant? I think this is because of performance reasons: block
> >>>> allocation is critical thing...
> >>> Seems like it is -- for performance and simplicity (== robustness).
> >>>
> >>>>> CONCLUSION
> >>>>> At possible transaction replay time, journal blocks are not allocated in any
> >>>>> of the bitmaps. However, because the journal is read and replayed before a
> >>>>> transaction has a chance to commit, this fact does not matter.
> >>>>> What matters is that wandered journal blocks never hit COMMIT BITMAP.
> >>>>>
> >>>>> So, if I've got all this correct (which I highly doubt), the disk space leak
> >>>>> (as you pointed it out) does not exist.
> >>>> It seems, you are right..
> >>>>
> >>>>> What exists is a rather different problem with my idea of "log every
> >>>>> deallocated block". Current implementation logs every block regardless of
> >>>>> BA_DEFER flag presence or absence, so non-wandered blocks are logged twice.
> >>>>>
> >>>>> We could just use ->delete_set, but we would lose wandered blocks then.
> >>>>> Or we could only log !BA_DEFER requests, which would do the right thing
> >>>>> (wandered blocks + deallocations from reiser4_post_commit_hook()), but
> >>>>> the reasoning behind this decision would not be obvious for a casual
> >>>>> code reader.
> >>>> I think that a good comment will save the situation..
> >>>>
> >>>>> Or we could log only wandered blocks (in addition to ->delete_set)
> >>>>> at discard time, but this is messy and requires us to merge the discard log
> >>>>> with ->delete_set at discard time.
> >>>> what is the difference with the previous "we could.."?
> >>> The previous option is to log all !BA_DEFER requests in addition to
> >>> ->delete_set, so those two block sets would be partially overlapping.
> >>
> >> Hmm.. Are they really overlapped?
> >>
> >> reiser4_dealloc_blocks() looks like the following:
> >> {
> >> if (flags & BA_DEFER)
> >> log in the delete_set;
> >> else
> >> deallocate in working bitmap
> >> ...
> >> }
> >>
> >> so (!BA_DEFER) requests and the delete_set look disjoint,
> >> or, I missed something?
> > You're right, I've misread the code. I was under impression that
> > reiser4_post_commit_hook() internally calls reiser4_dealloc_blocks(!BA_DEFER).
> >
> > So the second option (which I prefer) is to log all sa_dealloc_blocks() calls.
> > That is, effectively, ->delete_set plus wandered journal block deallocations.
>
>
> Yes, I think to do something like this:
>
> reiser4_dealloc_blocks()
> {
> if (flags & BA_DEFER) {
> ...
> } else {
> ...
> if (discard is turned on)
> /* we'll want to discard these blocks, which are not to
> be represented
> in the COMMIT SPACE MAP, so store them in a
> separate list */
> do {
> blocknr_set_add_extent(atom,
> &atom->delete_set_for_wander, &bsep, start, len);
> ...
> } while (ret == -E_REPEAT);
> sa_dealloc_blocks(reiser4_get_space_allocator(ctx->super), *start, *len);
> ...
> }
>
> Then join the sets (->delete_set, and ->delete_set_for_wander) at
> discard time
> (should be pretty cheap operation). insert a check at every atoms
> fusion, that
> ->delete_set_for_wander of both atoms are empty).
Hm. IIRC, ultimately we wanted to use rb-trees for storing discard block sets,
but this means that we'll need to convert two blocknr_sets into an rb-tree
at discard time, which pretty defeats their purpose in our context... Or do you
say that blocknr_set itself should be made rb-tree?
Thanks,
--
Ivan Shapovalov / intelfx /
>
>
> >
> >>> And this option is to log only those !BA_DEFER requests which are not in
> >>> ->delete_set, so no block will be stored twice. Thus we will have to consider
> >>> both ->delete_set and this hypothetical ->discard_set at discard time.
> >>>
> >>>>> Or we could log wandered blocks straight into ->delete_set and do something in
> >>>>> reiser4_post_commit_hook() to separate these entries, but this is super messy.
> >>>>>
> >>>>> I'm preferring the second way... Edward, please proof-read all this.
> >>>> BTW, what about your current implementation? Does it work for you?
> >>> Yes, it apparently does (no massive corruptions).
> >>>
> >>> I get batches of "orphan cluster" errors from time to time (from offset 0 till
> >>> the supposed end of file),
> >>
> >> this is fsck complaints?
> > Yes. (I do `fsck.reiser4 --build-fs` pretty routinely...)
> >
> >>> but 1) I do not see which files get corrupted,
> >>> 2) this could be due to txmod=wa, 3) my machine is pretty crashy now due to
> >>> some driver issues, so I don't know what exactly causes this.
> >>>
> >>> Thanks for proof-reading. What do you think about all this?
> >>
> >> I think that not bad for the beginning! I have purchased an SSD card,
> >> hope to test the discard stuff in the near future.
> > And I hope to finish this feature somewhen... :)
> >
> > Thanks,
>
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 213 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: reiser4: discard implementation, pass 2: allocation issues
2014-06-16 5:03 ` Ivan Shapovalov
@ 2014-06-16 9:24 ` Edward Shishkin
2014-06-16 11:00 ` Ivan Shapovalov
0 siblings, 1 reply; 22+ messages in thread
From: Edward Shishkin @ 2014-06-16 9:24 UTC (permalink / raw)
To: Ivan Shapovalov; +Cc: reiserfs-devel
On 06/16/2014 07:03 AM, Ivan Shapovalov wrote:
> On Monday 16 June 2014 at 02:14:49, Edward Shishkin wrote:
>> On 06/15/2014 11:58 PM, Ivan Shapovalov wrote:
>>> On Sunday 15 June 2014 at 23:49:59, Edward Shishkin wrote:
>>>> On 06/15/2014 08:07 PM, Ivan Shapovalov wrote:
>>>>> On Sunday 15 June 2014 at 19:36:05, Edward Shishkin wrote:
>>>>>> On 06/13/2014 10:28 PM, Ivan Shapovalov wrote:
>>>>>>> Here is my "analysis" of what happens in reiser4 during a transaction's
>>>>>>> lifetime wrt. block allocation and deallocation.
>>>>>>>
>>>>>>>
>>>>>>> THE EFFECTS (SEMANTICS) OF RELATED FUNCTIONS
>>>>>>> reiser4_alloc_blocks_bitmap(): allocates in WORKING BITMAP
>>>>>> Yes.
>>>>>>
>>>>>>> reiser4_dealloc_blocks_bitmap(!BA_DEFER): deallocates from WORKING BITMAP
>>>>>>> reiser4_dealloc_blocks_bitmap(BA_DEFER): stores to ->delete_set
>>>>>> This is correct for the middle-level allocator (without suffix "bitmap").
>>>>>> The low-level one frees blocks only in WORKING BITMAP.
>>>>> Yes, this was a wording mistake. I've noticed it shortly after sending...
>>>>>
>>>>>>> reiser4_pre_commit_hook_bitmap(): allocates all relocated nodes in COMMIT BITMAP
>>>>>> "Relocated" is bad term here: nodes with new data also get the flag
>>>>>> JNODE_RELOC. So, I would rather say, that it applies freshly allocated
>>>>>> nodes of the atom to COMMIT BITMAP.
>>>>>>
>>>>>>
>>>>>>> deallocates ->delete_set from COMMIT BITMAP
>>>>>> applies the atom's delete_set to COMMIT BITMAP
>>>>> I've meant exactly this.
>>>>>
>>>>>>> reiser4_post_commit_hook(): deallocates ->delete_set using !BA_DEFER
>>>>>>> (i. e. from WORKING BITMAP)
>>>>>> applies the atom's delete_set to WORKING BITMAP.
>>>>> ditto
>>>>>
>>>>>> I would also mention a function of the middle-level block allocator
>>>>>> reiser4_alloc_blocks(): allocates blocks in WORKING BITMAP.
>>>>> Yes, sure.
>>>>>
>>>>>
>>>>>> Note that the middle-level block allocator (block_alloc.c) actually
>>>>>> manipulates
>>>>>> with abstract space maps. Currently in reiser4 they are represented only by
>>>>>> bitmaps (plugin/space/bitmap.c). We can also implement another
>>>>>> representation -
>>>>>> extent tree (like in XFS). I don't see any needs for now, though.
>>>>> Extent trees seem to be interesting.. They claim that they are more efficient -
>>>>> is the difference that huge?
>>>>>
>>>>>>> TIMELINE OF ALLOCATIONS FOR "USUAL" NODES, AND TIMELINE OF TRANSACTION COMMIT
>>>>>>> - nodes are allocated using reiser4_alloc_blocks() and setting JNODE_RELOC,
>>>>>>> so WORKING BITMAP ensures that two nodes cannot get the same block;
>>>>>>> - nodes are deallocated using reiser4_dealloc_blocks(BA_DEFER),
>>>>>>> so their deallocation is not immediately reflected in WORKING BITMAP;
>>>>>>> (the relocate set is written here)
>>>>>>> - reiser4_pre_commit_hook_bitmap() uses 1) JNODE_RELOC flag and 2) ->delete_set
>>>>>>> to convey effective bitmap changes into COMMIT BITMAP;
>>>>>>> (the journal and overwrite set are written here)
>>>>>>> - reiser4_post_commit_hook() uses ->delete_set to convey deallocations
>>>>>>> from step 2 to WORKING BITMAP.
>>>>>>> (the discard happens here)
>>>>>>>
>>>>>>>
>>>>>>> TIMELINE OF ALLOCATIONS FOR WANDERED JOURNAL BLOCKS
>>>>>>> - at commit time, blocks are allocated using reiser4_alloc_blocks(), so they
>>>>>>> are allocated in WORKING BITMAP and do not interfere with any "usual" blocks;
>>>>>>> - after writing wandered blocks, they are deallocated using
>>>>>>> reiser4_dealloc_blocks(!BA_DEFER), i. e. from the WORKING BITMAP.
>>>>>> So, the system of working and commit bitmaps plus the delete set seems
>>>>>> to be redundant? I think this is because of performance reasons: block
>>>>>> allocation is critical thing...
>>>>> Seems like it is -- for performance and simplicity (== robustness).
>>>>>
>>>>>>> CONCLUSION
>>>>>>> At possible transaction replay time, journal blocks are not allocated in any
>>>>>>> of the bitmaps. However, because the journal is read and replayed before a
>>>>>>> transaction has a chance to commit, this fact does not matter.
>>>>>>> What matters is that wandered journal blocks never hit COMMIT BITMAP.
>>>>>>>
>>>>>>> So, if I've got all this correct (which I highly doubt), the disk space leak
>>>>>>> (as you pointed it out) does not exist.
>>>>>> It seems, you are right..
>>>>>>
>>>>>>> What exists is a rather different problem with my idea of "log every
>>>>>>> deallocated block". Current implementation logs every block regardless of
>>>>>>> BA_DEFER flag presence or absence, so non-wandered blocks are logged twice.
>>>>>>>
>>>>>>> We could just use ->delete_set, but we would lose wandered blocks then.
>>>>>>> Or we could only log !BA_DEFER requests, which would do the right thing
>>>>>>> (wandered blocks + deallocations from reiser4_post_commit_hook()), but
>>>>>>> the reasoning behind this decision would not be obvious for a casual
>>>>>>> code reader.
>>>>>> I think that a good comment will save the situation..
>>>>>>
>>>>>>> Or we could log only wandered blocks (in addition to ->delete_set)
>>>>>>> at discard time, but this is messy and requires us to merge the discard log
>>>>>>> with ->delete_set at discard time.
>>>>>> what is the difference with the previous "we could.."?
>>>>> The previous option is to log all !BA_DEFER requests in addition to
>>>>> ->delete_set, so those two block sets would be partially overlapping.
>>>> Hmm.. Are they really overlapped?
>>>>
>>>> reiser4_dealloc_blocks() looks like the following:
>>>> {
>>>> if (flags & BA_DEFER)
>>>> log in the delete_set;
>>>> else
>>>> deallocate in working bitmap
>>>> ...
>>>> }
>>>>
>>>> so (!BA_DEFER) requests and the delete_set look disjoint,
>>>> or, I missed something?
>>> You're right, I've misread the code. I was under impression that
>>> reiser4_post_commit_hook() internally calls reiser4_dealloc_blocks(!BA_DEFER).
>>>
>>> So the second option (which I prefer) is to log all sa_dealloc_blocks() calls.
>>> That is, effectively, ->delete_set plus wandered journal block deallocations.
>>
>> Yes, I think to do something like this:
>>
>> reiser4_dealloc_blocks()
>> {
>> if (flags & BA_DEFER) {
>> ...
>> } else {
>> ...
>> if (discard is turned on)
>> /* we'll want to discard these blocks, which are not to
>> be represented
>> in the COMMIT SPACE MAP, so store them in a
>> separate list */
>> do {
>> blocknr_set_add_extent(atom,
>> &atom->delete_set_for_wander, &bsep, start, len);
>> ...
>> } while (ret == -E_REPEAT);
>> sa_dealloc_blocks(reiser4_get_space_allocator(ctx->super), *start, *len);
>> ...
>> }
>>
>> Then join the sets (->delete_set, and ->delete_set_for_wander) at
>> discard time
>> (should be pretty cheap operation). insert a check at every atoms
>> fusion, that
>> ->delete_set_for_wander of both atoms are empty).
> Hm. IIRC, ultimately we wanted to use rb-trees for storing discard block sets,
> but this means that we'll need to convert two blocknr_sets into an rb-tree
> at discard time, which pretty defeats their purpose in our context... Or do you
> say that blocknr_set itself should be made rb-tree?
I didn't know about the delete sets. They are not optional and lists are
better for them (as they don't require sorting). So I suggest to confine
with lists for now, and may be to try the fast-merge trees in the future
(if there is a great desire).
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: reiser4: discard implementation, pass 2: allocation issues
2014-06-16 9:24 ` Edward Shishkin
@ 2014-06-16 11:00 ` Ivan Shapovalov
2014-06-16 11:32 ` Edward Shishkin
0 siblings, 1 reply; 22+ messages in thread
From: Ivan Shapovalov @ 2014-06-16 11:00 UTC (permalink / raw)
To: reiserfs-devel; +Cc: Edward Shishkin
[-- Attachment #1: Type: text/plain, Size: 8580 bytes --]
On Monday 16 June 2014 at 11:24:39, Edward Shishkin wrote:
>
> On 06/16/2014 07:03 AM, Ivan Shapovalov wrote:
> > On Monday 16 June 2014 at 02:14:49, Edward Shishkin wrote:
> >> On 06/15/2014 11:58 PM, Ivan Shapovalov wrote:
> >>> On Sunday 15 June 2014 at 23:49:59, Edward Shishkin wrote:
> >>>> On 06/15/2014 08:07 PM, Ivan Shapovalov wrote:
> >>>>> On Sunday 15 June 2014 at 19:36:05, Edward Shishkin wrote:
> >>>>>> On 06/13/2014 10:28 PM, Ivan Shapovalov wrote:
> >>>>>>> Here is my "analysis" of what happens in reiser4 during a transaction's
> >>>>>>> lifetime wrt. block allocation and deallocation.
> >>>>>>>
> >>>>>>>
> >>>>>>> THE EFFECTS (SEMANTICS) OF RELATED FUNCTIONS
> >>>>>>> reiser4_alloc_blocks_bitmap(): allocates in WORKING BITMAP
> >>>>>> Yes.
> >>>>>>
> >>>>>>> reiser4_dealloc_blocks_bitmap(!BA_DEFER): deallocates from WORKING BITMAP
> >>>>>>> reiser4_dealloc_blocks_bitmap(BA_DEFER): stores to ->delete_set
> >>>>>> This is correct for the middle-level allocator (without suffix "bitmap").
> >>>>>> The low-level one frees blocks only in WORKING BITMAP.
> >>>>> Yes, this was a wording mistake. I've noticed it shortly after sending...
> >>>>>
> >>>>>>> reiser4_pre_commit_hook_bitmap(): allocates all relocated nodes in COMMIT BITMAP
> >>>>>> "Relocated" is bad term here: nodes with new data also get the flag
> >>>>>> JNODE_RELOC. So, I would rather say, that it applies freshly allocated
> >>>>>> nodes of the atom to COMMIT BITMAP.
> >>>>>>
> >>>>>>
> >>>>>>> deallocates ->delete_set from COMMIT BITMAP
> >>>>>> applies the atom's delete_set to COMMIT BITMAP
> >>>>> I've meant exactly this.
> >>>>>
> >>>>>>> reiser4_post_commit_hook(): deallocates ->delete_set using !BA_DEFER
> >>>>>>> (i. e. from WORKING BITMAP)
> >>>>>> applies the atom's delete_set to WORKING BITMAP.
> >>>>> ditto
> >>>>>
> >>>>>> I would also mention a function of the middle-level block allocator
> >>>>>> reiser4_alloc_blocks(): allocates blocks in WORKING BITMAP.
> >>>>> Yes, sure.
> >>>>>
> >>>>>
> >>>>>> Note that the middle-level block allocator (block_alloc.c) actually
> >>>>>> manipulates
> >>>>>> with abstract space maps. Currently in reiser4 they are represented only by
> >>>>>> bitmaps (plugin/space/bitmap.c). We can also implement another
> >>>>>> representation -
> >>>>>> extent tree (like in XFS). I don't see any needs for now, though.
> >>>>> Extent trees seem to be interesting.. They claim that they are more efficient -
> >>>>> is the difference that huge?
> >>>>>
> >>>>>>> TIMELINE OF ALLOCATIONS FOR "USUAL" NODES, AND TIMELINE OF TRANSACTION COMMIT
> >>>>>>> - nodes are allocated using reiser4_alloc_blocks() and setting JNODE_RELOC,
> >>>>>>> so WORKING BITMAP ensures that two nodes cannot get the same block;
> >>>>>>> - nodes are deallocated using reiser4_dealloc_blocks(BA_DEFER),
> >>>>>>> so their deallocation is not immediately reflected in WORKING BITMAP;
> >>>>>>> (the relocate set is written here)
> >>>>>>> - reiser4_pre_commit_hook_bitmap() uses 1) JNODE_RELOC flag and 2) ->delete_set
> >>>>>>> to convey effective bitmap changes into COMMIT BITMAP;
> >>>>>>> (the journal and overwrite set are written here)
> >>>>>>> - reiser4_post_commit_hook() uses ->delete_set to convey deallocations
> >>>>>>> from step 2 to WORKING BITMAP.
> >>>>>>> (the discard happens here)
> >>>>>>>
> >>>>>>>
> >>>>>>> TIMELINE OF ALLOCATIONS FOR WANDERED JOURNAL BLOCKS
> >>>>>>> - at commit time, blocks are allocated using reiser4_alloc_blocks(), so they
> >>>>>>> are allocated in WORKING BITMAP and do not interfere with any "usual" blocks;
> >>>>>>> - after writing wandered blocks, they are deallocated using
> >>>>>>> reiser4_dealloc_blocks(!BA_DEFER), i. e. from the WORKING BITMAP.
> >>>>>> So, the system of working and commit bitmaps plus the delete set seems
> >>>>>> to be redundant? I think this is because of performance reasons: block
> >>>>>> allocation is critical thing...
> >>>>> Seems like it is -- for performance and simplicity (== robustness).
> >>>>>
> >>>>>>> CONCLUSION
> >>>>>>> At possible transaction replay time, journal blocks are not allocated in any
> >>>>>>> of the bitmaps. However, because the journal is read and replayed before a
> >>>>>>> transaction has a chance to commit, this fact does not matter.
> >>>>>>> What matters is that wandered journal blocks never hit COMMIT BITMAP.
> >>>>>>>
> >>>>>>> So, if I've got all this correct (which I highly doubt), the disk space leak
> >>>>>>> (as you pointed it out) does not exist.
> >>>>>> It seems, you are right..
> >>>>>>
> >>>>>>> What exists is a rather different problem with my idea of "log every
> >>>>>>> deallocated block". Current implementation logs every block regardless of
> >>>>>>> BA_DEFER flag presence or absence, so non-wandered blocks are logged twice.
> >>>>>>>
> >>>>>>> We could just use ->delete_set, but we would lose wandered blocks then.
> >>>>>>> Or we could only log !BA_DEFER requests, which would do the right thing
> >>>>>>> (wandered blocks + deallocations from reiser4_post_commit_hook()), but
> >>>>>>> the reasoning behind this decision would not be obvious for a casual
> >>>>>>> code reader.
> >>>>>> I think that a good comment will save the situation..
> >>>>>>
> >>>>>>> Or we could log only wandered blocks (in addition to ->delete_set)
> >>>>>>> at discard time, but this is messy and requires us to merge the discard log
> >>>>>>> with ->delete_set at discard time.
> >>>>>> what is the difference with the previous "we could.."?
> >>>>> The previous option is to log all !BA_DEFER requests in addition to
> >>>>> ->delete_set, so those two block sets would be partially overlapping.
> >>>> Hmm.. Are they really overlapped?
> >>>>
> >>>> reiser4_dealloc_blocks() looks like the following:
> >>>> {
> >>>> if (flags & BA_DEFER)
> >>>> log in the delete_set;
> >>>> else
> >>>> deallocate in working bitmap
> >>>> ...
> >>>> }
> >>>>
> >>>> so (!BA_DEFER) requests and the delete_set look disjoint,
> >>>> or, I missed something?
> >>> You're right, I've misread the code. I was under impression that
> >>> reiser4_post_commit_hook() internally calls reiser4_dealloc_blocks(!BA_DEFER).
> >>>
> >>> So the second option (which I prefer) is to log all sa_dealloc_blocks() calls.
> >>> That is, effectively, ->delete_set plus wandered journal block deallocations.
> >>
> >> Yes, I think to do something like this:
> >>
> >> reiser4_dealloc_blocks()
> >> {
> >> if (flags & BA_DEFER) {
> >> ...
> >> } else {
> >> ...
> >> if (discard is turned on)
> >> /* we'll want to discard these blocks, which are not to
> >> be represented
> >> in the COMMIT SPACE MAP, so store them in a
> >> separate list */
> >> do {
> >> blocknr_set_add_extent(atom,
> >> &atom->delete_set_for_wander, &bsep, start, len);
> >> ...
> >> } while (ret == -E_REPEAT);
> >> sa_dealloc_blocks(reiser4_get_space_allocator(ctx->super), *start, *len);
> >> ...
> >> }
> >>
> >> Then join the sets (->delete_set, and ->delete_set_for_wander) at
> >> discard time
> >> (should be pretty cheap operation). insert a check at every atoms
> >> fusion, that
> >> ->delete_set_for_wander of both atoms are empty).
> > Hm. IIRC, ultimately we wanted to use rb-trees for storing discard block sets,
> > but this means that we'll need to convert two blocknr_sets into an rb-tree
> > at discard time, which pretty defeats their purpose in our context... Or do you
> > say that blocknr_set itself should be made rb-tree?
>
> I didn't know about the delete sets. They are not optional and lists are
> better for them (as they don't require sorting). So I suggest to confine
> with lists for now, and may be to try the fast-merge trees in the future
> (if there is a great desire).
The discard algorithm still needs its input to be sorted, and blocknr_set is
inherently unordered.
So there are two options:
- log into ->delete_set and ->delete_set_for_wander (as you've proposed),
then merge them into a sorted list at discard time;
- log into ->discard_set (just like it is now), just fix what to log.
What do you think?
--
Ivan Shapovalov / intelfx /
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 213 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: reiser4: discard implementation, pass 2: allocation issues
2014-06-16 11:00 ` Ivan Shapovalov
@ 2014-06-16 11:32 ` Edward Shishkin
2014-06-16 11:47 ` Ivan Shapovalov
0 siblings, 1 reply; 22+ messages in thread
From: Edward Shishkin @ 2014-06-16 11:32 UTC (permalink / raw)
To: Ivan Shapovalov; +Cc: reiserfs-devel
On 06/16/2014 01:00 PM, Ivan Shapovalov wrote:
> On Monday 16 June 2014 at 11:24:39, Edward Shishkin wrote:
>> On 06/16/2014 07:03 AM, Ivan Shapovalov wrote:
>>> On Monday 16 June 2014 at 02:14:49, Edward Shishkin wrote:
>>>> On 06/15/2014 11:58 PM, Ivan Shapovalov wrote:
>>>>> On Sunday 15 June 2014 at 23:49:59, Edward Shishkin wrote:
>>>>>> On 06/15/2014 08:07 PM, Ivan Shapovalov wrote:
>>>>>>> On Sunday 15 June 2014 at 19:36:05, Edward Shishkin wrote:
>>>>>>>> On 06/13/2014 10:28 PM, Ivan Shapovalov wrote:
>>>>>>>>> Here is my "analysis" of what happens in reiser4 during a transaction's
>>>>>>>>> lifetime wrt. block allocation and deallocation.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> THE EFFECTS (SEMANTICS) OF RELATED FUNCTIONS
>>>>>>>>> reiser4_alloc_blocks_bitmap(): allocates in WORKING BITMAP
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>> reiser4_dealloc_blocks_bitmap(!BA_DEFER): deallocates from WORKING BITMAP
>>>>>>>>> reiser4_dealloc_blocks_bitmap(BA_DEFER): stores to ->delete_set
>>>>>>>> This is correct for the middle-level allocator (without suffix "bitmap").
>>>>>>>> The low-level one frees blocks only in WORKING BITMAP.
>>>>>>> Yes, this was a wording mistake. I've noticed it shortly after sending...
>>>>>>>
>>>>>>>>> reiser4_pre_commit_hook_bitmap(): allocates all relocated nodes in COMMIT BITMAP
>>>>>>>> "Relocated" is bad term here: nodes with new data also get the flag
>>>>>>>> JNODE_RELOC. So, I would rather say, that it applies freshly allocated
>>>>>>>> nodes of the atom to COMMIT BITMAP.
>>>>>>>>
>>>>>>>>
>>>>>>>>> deallocates ->delete_set from COMMIT BITMAP
>>>>>>>> applies the atom's delete_set to COMMIT BITMAP
>>>>>>> I've meant exactly this.
>>>>>>>
>>>>>>>>> reiser4_post_commit_hook(): deallocates ->delete_set using !BA_DEFER
>>>>>>>>> (i. e. from WORKING BITMAP)
>>>>>>>> applies the atom's delete_set to WORKING BITMAP.
>>>>>>> ditto
>>>>>>>
>>>>>>>> I would also mention a function of the middle-level block allocator
>>>>>>>> reiser4_alloc_blocks(): allocates blocks in WORKING BITMAP.
>>>>>>> Yes, sure.
>>>>>>>
>>>>>>>
>>>>>>>> Note that the middle-level block allocator (block_alloc.c) actually
>>>>>>>> manipulates
>>>>>>>> with abstract space maps. Currently in reiser4 they are represented only by
>>>>>>>> bitmaps (plugin/space/bitmap.c). We can also implement another
>>>>>>>> representation -
>>>>>>>> extent tree (like in XFS). I don't see any needs for now, though.
>>>>>>> Extent trees seem to be interesting.. They claim that they are more efficient -
>>>>>>> is the difference that huge?
>>>>>>>
>>>>>>>>> TIMELINE OF ALLOCATIONS FOR "USUAL" NODES, AND TIMELINE OF TRANSACTION COMMIT
>>>>>>>>> - nodes are allocated using reiser4_alloc_blocks() and setting JNODE_RELOC,
>>>>>>>>> so WORKING BITMAP ensures that two nodes cannot get the same block;
>>>>>>>>> - nodes are deallocated using reiser4_dealloc_blocks(BA_DEFER),
>>>>>>>>> so their deallocation is not immediately reflected in WORKING BITMAP;
>>>>>>>>> (the relocate set is written here)
>>>>>>>>> - reiser4_pre_commit_hook_bitmap() uses 1) JNODE_RELOC flag and 2) ->delete_set
>>>>>>>>> to convey effective bitmap changes into COMMIT BITMAP;
>>>>>>>>> (the journal and overwrite set are written here)
>>>>>>>>> - reiser4_post_commit_hook() uses ->delete_set to convey deallocations
>>>>>>>>> from step 2 to WORKING BITMAP.
>>>>>>>>> (the discard happens here)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> TIMELINE OF ALLOCATIONS FOR WANDERED JOURNAL BLOCKS
>>>>>>>>> - at commit time, blocks are allocated using reiser4_alloc_blocks(), so they
>>>>>>>>> are allocated in WORKING BITMAP and do not interfere with any "usual" blocks;
>>>>>>>>> - after writing wandered blocks, they are deallocated using
>>>>>>>>> reiser4_dealloc_blocks(!BA_DEFER), i. e. from the WORKING BITMAP.
>>>>>>>> So, the system of working and commit bitmaps plus the delete set seems
>>>>>>>> to be redundant? I think this is because of performance reasons: block
>>>>>>>> allocation is critical thing...
>>>>>>> Seems like it is -- for performance and simplicity (== robustness).
>>>>>>>
>>>>>>>>> CONCLUSION
>>>>>>>>> At possible transaction replay time, journal blocks are not allocated in any
>>>>>>>>> of the bitmaps. However, because the journal is read and replayed before a
>>>>>>>>> transaction has a chance to commit, this fact does not matter.
>>>>>>>>> What matters is that wandered journal blocks never hit COMMIT BITMAP.
>>>>>>>>>
>>>>>>>>> So, if I've got all this correct (which I highly doubt), the disk space leak
>>>>>>>>> (as you pointed it out) does not exist.
>>>>>>>> It seems, you are right..
>>>>>>>>
>>>>>>>>> What exists is a rather different problem with my idea of "log every
>>>>>>>>> deallocated block". Current implementation logs every block regardless of
>>>>>>>>> BA_DEFER flag presence or absence, so non-wandered blocks are logged twice.
>>>>>>>>>
>>>>>>>>> We could just use ->delete_set, but we would lose wandered blocks then.
>>>>>>>>> Or we could only log !BA_DEFER requests, which would do the right thing
>>>>>>>>> (wandered blocks + deallocations from reiser4_post_commit_hook()), but
>>>>>>>>> the reasoning behind this decision would not be obvious for a casual
>>>>>>>>> code reader.
>>>>>>>> I think that a good comment will save the situation..
>>>>>>>>
>>>>>>>>> Or we could log only wandered blocks (in addition to ->delete_set)
>>>>>>>>> at discard time, but this is messy and requires us to merge the discard log
>>>>>>>>> with ->delete_set at discard time.
>>>>>>>> what is the difference with the previous "we could.."?
>>>>>>> The previous option is to log all !BA_DEFER requests in addition to
>>>>>>> ->delete_set, so those two block sets would be partially overlapping.
>>>>>> Hmm.. Are they really overlapped?
>>>>>>
>>>>>> reiser4_dealloc_blocks() looks like the following:
>>>>>> {
>>>>>> if (flags & BA_DEFER)
>>>>>> log in the delete_set;
>>>>>> else
>>>>>> deallocate in working bitmap
>>>>>> ...
>>>>>> }
>>>>>>
>>>>>> so (!BA_DEFER) requests and the delete_set look disjoint,
>>>>>> or, I missed something?
>>>>> You're right, I've misread the code. I was under impression that
>>>>> reiser4_post_commit_hook() internally calls reiser4_dealloc_blocks(!BA_DEFER).
>>>>>
>>>>> So the second option (which I prefer) is to log all sa_dealloc_blocks() calls.
>>>>> That is, effectively, ->delete_set plus wandered journal block deallocations.
>>>> Yes, I think to do something like this:
>>>>
>>>> reiser4_dealloc_blocks()
>>>> {
>>>> if (flags & BA_DEFER) {
>>>> ...
>>>> } else {
>>>> ...
>>>> if (discard is turned on)
>>>> /* we'll want to discard these blocks, which are not to
>>>> be represented
>>>> in the COMMIT SPACE MAP, so store them in a
>>>> separate list */
>>>> do {
>>>> blocknr_set_add_extent(atom,
>>>> &atom->delete_set_for_wander, &bsep, start, len);
>>>> ...
>>>> } while (ret == -E_REPEAT);
>>>> sa_dealloc_blocks(reiser4_get_space_allocator(ctx->super), *start, *len);
>>>> ...
>>>> }
>>>>
>>>> Then join the sets (->delete_set, and ->delete_set_for_wander) at
>>>> discard time
>>>> (should be pretty cheap operation). insert a check at every atoms
>>>> fusion, that
>>>> ->delete_set_for_wander of both atoms are empty).
>>> Hm. IIRC, ultimately we wanted to use rb-trees for storing discard block sets,
>>> but this means that we'll need to convert two blocknr_sets into an rb-tree
>>> at discard time, which pretty defeats their purpose in our context... Or do you
>>> say that blocknr_set itself should be made rb-tree?
>> I didn't know about the delete sets. They are not optional and lists are
>> better for them (as they don't require sorting). So I suggest to confine
>> with lists for now, and may be to try the fast-merge trees in the future
>> (if there is a great desire).
> The discard algorithm still needs its input to be sorted, and blocknr_set is
> inherently unordered.
>
> So there are two options:
> - log into ->delete_set and ->delete_set_for_wander (as you've proposed),
> then merge them into a sorted list at discard time;
> - log into ->discard_set (just like it is now), just fix what to log.
>
> What do you think?
Are the blocknr sets comfortable?
Say, can we organize the discard process via the blocknr_set_iterator()?
If yes, then let's do everything via the blocknr sets (i.e. let's
implement the
first option). Add needed operations to blocknrset.c, the discard iteration
actor to discard.c, etc.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: reiser4: discard implementation, pass 2: allocation issues
2014-06-16 11:32 ` Edward Shishkin
@ 2014-06-16 11:47 ` Ivan Shapovalov
2014-06-17 0:37 ` Edward Shishkin
0 siblings, 1 reply; 22+ messages in thread
From: Ivan Shapovalov @ 2014-06-16 11:47 UTC (permalink / raw)
To: Edward Shishkin; +Cc: reiserfs-devel
[-- Attachment #1: Type: text/plain, Size: 9487 bytes --]
On Monday 16 June 2014 at 13:32:02, Edward Shishkin wrote:
> On 06/16/2014 01:00 PM, Ivan Shapovalov wrote:
> > On Monday 16 June 2014 at 11:24:39, Edward Shishkin wrote:
> >> On 06/16/2014 07:03 AM, Ivan Shapovalov wrote:
> >>> On Monday 16 June 2014 at 02:14:49, Edward Shishkin wrote:
> >>>> On 06/15/2014 11:58 PM, Ivan Shapovalov wrote:
> >>>>> On Sunday 15 June 2014 at 23:49:59, Edward Shishkin wrote:
> >>>>>> On 06/15/2014 08:07 PM, Ivan Shapovalov wrote:
> >>>>>>> On Sunday 15 June 2014 at 19:36:05, Edward Shishkin wrote:
> >>>>>>>> On 06/13/2014 10:28 PM, Ivan Shapovalov wrote:
> >>>>>>>>> Here is my "analysis" of what happens in reiser4 during a transaction's
> >>>>>>>>> lifetime wrt. block allocation and deallocation.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> THE EFFECTS (SEMANTICS) OF RELATED FUNCTIONS
> >>>>>>>>> reiser4_alloc_blocks_bitmap(): allocates in WORKING BITMAP
> >>>>>>>> Yes.
> >>>>>>>>
> >>>>>>>>> reiser4_dealloc_blocks_bitmap(!BA_DEFER): deallocates from WORKING BITMAP
> >>>>>>>>> reiser4_dealloc_blocks_bitmap(BA_DEFER): stores to ->delete_set
> >>>>>>>> This is correct for the middle-level allocator (without suffix "bitmap").
> >>>>>>>> The low-level one frees blocks only in WORKING BITMAP.
> >>>>>>> Yes, this was a wording mistake. I've noticed it shortly after sending...
> >>>>>>>
> >>>>>>>>> reiser4_pre_commit_hook_bitmap(): allocates all relocated nodes in COMMIT BITMAP
> >>>>>>>> "Relocated" is bad term here: nodes with new data also get the flag
> >>>>>>>> JNODE_RELOC. So, I would rather say, that it applies freshly allocated
> >>>>>>>> nodes of the atom to COMMIT BITMAP.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> deallocates ->delete_set from COMMIT BITMAP
> >>>>>>>> applies the atom's delete_set to COMMIT BITMAP
> >>>>>>> I've meant exactly this.
> >>>>>>>
> >>>>>>>>> reiser4_post_commit_hook(): deallocates ->delete_set using !BA_DEFER
> >>>>>>>>> (i. e. from WORKING BITMAP)
> >>>>>>>> applies the atom's delete_set to WORKING BITMAP.
> >>>>>>> ditto
> >>>>>>>
> >>>>>>>> I would also mention a function of the middle-level block allocator
> >>>>>>>> reiser4_alloc_blocks(): allocates blocks in WORKING BITMAP.
> >>>>>>> Yes, sure.
> >>>>>>>
> >>>>>>>
> >>>>>>>> Note that the middle-level block allocator (block_alloc.c) actually
> >>>>>>>> manipulates
> >>>>>>>> with abstract space maps. Currently in reiser4 they are represented only by
> >>>>>>>> bitmaps (plugin/space/bitmap.c). We can also implement another
> >>>>>>>> representation -
> >>>>>>>> extent tree (like in XFS). I don't see any needs for now, though.
> >>>>>>> Extent trees seem to be interesting.. They claim that they are more efficient -
> >>>>>>> is the difference that huge?
> >>>>>>>
> >>>>>>>>> TIMELINE OF ALLOCATIONS FOR "USUAL" NODES, AND TIMELINE OF TRANSACTION COMMIT
> >>>>>>>>> - nodes are allocated using reiser4_alloc_blocks() and setting JNODE_RELOC,
> >>>>>>>>> so WORKING BITMAP ensures that two nodes cannot get the same block;
> >>>>>>>>> - nodes are deallocated using reiser4_dealloc_blocks(BA_DEFER),
> >>>>>>>>> so their deallocation is not immediately reflected in WORKING BITMAP;
> >>>>>>>>> (the relocate set is written here)
> >>>>>>>>> - reiser4_pre_commit_hook_bitmap() uses 1) JNODE_RELOC flag and 2) ->delete_set
> >>>>>>>>> to convey effective bitmap changes into COMMIT BITMAP;
> >>>>>>>>> (the journal and overwrite set are written here)
> >>>>>>>>> - reiser4_post_commit_hook() uses ->delete_set to convey deallocations
> >>>>>>>>> from step 2 to WORKING BITMAP.
> >>>>>>>>> (the discard happens here)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> TIMELINE OF ALLOCATIONS FOR WANDERED JOURNAL BLOCKS
> >>>>>>>>> - at commit time, blocks are allocated using reiser4_alloc_blocks(), so they
> >>>>>>>>> are allocated in WORKING BITMAP and do not interfere with any "usual" blocks;
> >>>>>>>>> - after writing wandered blocks, they are deallocated using
> >>>>>>>>> reiser4_dealloc_blocks(!BA_DEFER), i. e. from the WORKING BITMAP.
> >>>>>>>> So, the system of working and commit bitmaps plus the delete set seems
> >>>>>>>> to be redundant? I think this is because of performance reasons: block
> >>>>>>>> allocation is critical thing...
> >>>>>>> Seems like it is -- for performance and simplicity (== robustness).
> >>>>>>>
> >>>>>>>>> CONCLUSION
> >>>>>>>>> At possible transaction replay time, journal blocks are not allocated in any
> >>>>>>>>> of the bitmaps. However, because the journal is read and replayed before a
> >>>>>>>>> transaction has a chance to commit, this fact does not matter.
> >>>>>>>>> What matters is that wandered journal blocks never hit COMMIT BITMAP.
> >>>>>>>>>
> >>>>>>>>> So, if I've got all this correct (which I highly doubt), the disk space leak
> >>>>>>>>> (as you pointed it out) does not exist.
> >>>>>>>> It seems, you are right..
> >>>>>>>>
> >>>>>>>>> What exists is a rather different problem with my idea of "log every
> >>>>>>>>> deallocated block". Current implementation logs every block regardless of
> >>>>>>>>> BA_DEFER flag presence or absence, so non-wandered blocks are logged twice.
> >>>>>>>>>
> >>>>>>>>> We could just use ->delete_set, but we would lose wandered blocks then.
> >>>>>>>>> Or we could only log !BA_DEFER requests, which would do the right thing
> >>>>>>>>> (wandered blocks + deallocations from reiser4_post_commit_hook()), but
> >>>>>>>>> the reasoning behind this decision would not be obvious for a casual
> >>>>>>>>> code reader.
> >>>>>>>> I think that a good comment will save the situation..
> >>>>>>>>
> >>>>>>>>> Or we could log only wandered blocks (in addition to ->delete_set)
> >>>>>>>>> at discard time, but this is messy and requires us to merge the discard log
> >>>>>>>>> with ->delete_set at discard time.
> >>>>>>>> what is the difference with the previous "we could.."?
> >>>>>>> The previous option is to log all !BA_DEFER requests in addition to
> >>>>>>> ->delete_set, so those two block sets would be partially overlapping.
> >>>>>> Hmm.. Are they really overlapped?
> >>>>>>
> >>>>>> reiser4_dealloc_blocks() looks like the following:
> >>>>>> {
> >>>>>> if (flags & BA_DEFER)
> >>>>>> log in the delete_set;
> >>>>>> else
> >>>>>> deallocate in working bitmap
> >>>>>> ...
> >>>>>> }
> >>>>>>
> >>>>>> so (!BA_DEFER) requests and the delete_set look disjoint,
> >>>>>> or, I missed something?
> >>>>> You're right, I've misread the code. I was under impression that
> >>>>> reiser4_post_commit_hook() internally calls reiser4_dealloc_blocks(!BA_DEFER).
> >>>>>
> >>>>> So the second option (which I prefer) is to log all sa_dealloc_blocks() calls.
> >>>>> That is, effectively, ->delete_set plus wandered journal block deallocations.
> >>>> Yes, I think to do something like this:
> >>>>
> >>>> reiser4_dealloc_blocks()
> >>>> {
> >>>> if (flags & BA_DEFER) {
> >>>> ...
> >>>> } else {
> >>>> ...
> >>>> if (discard is turned on)
> >>>> /* we'll want to discard these blocks, which are not to
> >>>> be represented
> >>>> in the COMMIT SPACE MAP, so store them in a
> >>>> separate list */
> >>>> do {
> >>>> blocknr_set_add_extent(atom,
> >>>> &atom->delete_set_for_wander, &bsep, start, len);
> >>>> ...
> >>>> } while (ret == -E_REPEAT);
> >>>> sa_dealloc_blocks(reiser4_get_space_allocator(ctx->super), *start, *len);
> >>>> ...
> >>>> }
> >>>>
> >>>> Then join the sets (->delete_set, and ->delete_set_for_wander) at
> >>>> discard time
> >>>> (should be pretty cheap operation). insert a check at every atoms
> >>>> fusion, that
> >>>> ->delete_set_for_wander of both atoms are empty).
> >>> Hm. IIRC, ultimately we wanted to use rb-trees for storing discard block sets,
> >>> but this means that we'll need to convert two blocknr_sets into an rb-tree
> >>> at discard time, which pretty defeats their purpose in our context... Or do you
> >>> say that blocknr_set itself should be made rb-tree?
> >> I didn't know about the delete sets. They are not optional and lists are
> >> better for them (as they don't require sorting). So I suggest to confine
> >> with lists for now, and may be to try the fast-merge trees in the future
> >> (if there is a great desire).
> > The discard algorithm still needs its input to be sorted, and blocknr_set is
> > inherently unordered.
> >
> > So there are two options:
> > - log into ->delete_set and ->delete_set_for_wander (as you've proposed),
> > then merge them into a sorted list at discard time;
> > - log into ->discard_set (just like it is now), just fix what to log.
> >
> > What do you think?
>
>
> Are the blocknr sets comfortable?
> Say, can we organize the discard process via the blocknr_set_iterator()?
> If yes, then let's do everything via the blocknr sets (i.e. let's
> implement the
> first option). Add needed operations to blocknrset.c, the discard iteration
> actor to discard.c, etc.
No, they aren't, and this is a problem. As I said, they are not just unsorted,
they are unsortable as I understand them.
--
Ivan Shapovalov / intelfx /
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 213 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: reiser4: discard implementation, pass 2: allocation issues
2014-06-16 11:47 ` Ivan Shapovalov
@ 2014-06-17 0:37 ` Edward Shishkin
2014-06-17 10:14 ` Ivan Shapovalov
0 siblings, 1 reply; 22+ messages in thread
From: Edward Shishkin @ 2014-06-17 0:37 UTC (permalink / raw)
To: Ivan Shapovalov; +Cc: reiserfs-devel
On 06/16/2014 01:47 PM, Ivan Shapovalov wrote:
[...]
>
> Are the blocknr sets comfortable?
> Say, can we organize the discard process via the blocknr_set_iterator()?
> If yes, then let's do everything via the blocknr sets (i.e. let's
> implement the
> first option). Add needed operations to blocknrset.c, the discard iteration
> actor to discard.c, etc.
> No, they aren't, and this is a problem. As I said, they are not just unsorted,
> they are unsortable as I understand them.
>
Yup, blocknr sets minimize memory consumption and are unsortable...
I think that the cleanest option will be using lists (instead of blocknr
sets) for
the delete sets, if the discard is turned on. It will reduce memory
consumption
by 20%. Indeed, every entry in a blocknr_set occupies ~8 bytes (assuming
that everything is pretty fragmented because of txmod=wa), whereas a list
entry occupies 32 bytes (start, length, plus 2 pointers for the link).
In this option we'll need to join lists (instead of merging blocknr
sets) during
atoms fusion and apply the list (instead of blocknr set) to the COMMIT
BITMAP
at pre_commit_hook(). I think it won't be a problem, since the lists are
simpler
than blocknr sets.
The next option is to leave everything as is.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: reiser4: discard implementation, pass 2: allocation issues
2014-06-17 0:37 ` Edward Shishkin
@ 2014-06-17 10:14 ` Ivan Shapovalov
2014-06-17 10:29 ` Edward Shishkin
0 siblings, 1 reply; 22+ messages in thread
From: Ivan Shapovalov @ 2014-06-17 10:14 UTC (permalink / raw)
To: Edward Shishkin; +Cc: reiserfs-devel
[-- Attachment #1: Type: text/plain, Size: 1017 bytes --]
On Tuesday 17 June 2014 at 02:37:16, Edward Shishkin wrote:
>
> [...]
>
> Yup, blocknr sets minimize memory consumption and are unsortable...
>
> I think that the cleanest option will be using lists (instead of blocknr
> sets) for
> the delete sets, if the discard is turned on. It will reduce memory
> consumption
> by 20%. Indeed, every entry in a blocknr_set occupies ~8 bytes (assuming
> that everything is pretty fragmented because of txmod=wa), whereas a list
> entry occupies 32 bytes (start, length, plus 2 pointers for the link).
>
> In this option we'll need to join lists (instead of merging blocknr
> sets) during
> atoms fusion and apply the list (instead of blocknr set) to the COMMIT
> BITMAP
> at pre_commit_hook(). I think it won't be a problem, since the lists are
> simpler
> than blocknr sets.
That's a neat approach. I think I'll use unions and do the decision at runtime.
--
Ivan Shapovalov / intelfx /
>
> The next option is to leave everything as is.
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 213 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: reiser4: discard implementation, pass 2: allocation issues
2014-06-17 10:14 ` Ivan Shapovalov
@ 2014-06-17 10:29 ` Edward Shishkin
2014-06-17 18:31 ` Ivan Shapovalov
0 siblings, 1 reply; 22+ messages in thread
From: Edward Shishkin @ 2014-06-17 10:29 UTC (permalink / raw)
To: Ivan Shapovalov; +Cc: reiserfs-devel
On 06/17/2014 12:14 PM, Ivan Shapovalov wrote:
> On Tuesday 17 June 2014 at 02:37:16, Edward Shishkin wrote:
>> [...]
>>
>> Yup, blocknr sets minimize memory consumption and are unsortable...
>>
>> I think that the cleanest option will be using lists (instead of blocknr
>> sets) for
>> the delete sets, if the discard is turned on. It will reduce memory
>> consumption
>> by 20%. Indeed, every entry in a blocknr_set occupies ~8 bytes (assuming
>> that everything is pretty fragmented because of txmod=wa), whereas a list
>> entry occupies 32 bytes (start, length, plus 2 pointers for the link).
>>
>> In this option we'll need to join lists (instead of merging blocknr
>> sets) during
>> atoms fusion and apply the list (instead of blocknr set) to the COMMIT
>> BITMAP
>> at pre_commit_hook(). I think it won't be a problem, since the lists are
>> simpler
>> than blocknr sets.
> That's a neat approach. I think I'll use unions and do the decision at runtime.
Yup.
So, if discard is on, we work with 2 lists (delete_set,
delete_set_for_wander).
If discard is off, we work with one blocknr set..
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: reiser4: discard implementation, pass 2: allocation issues
2014-06-17 10:29 ` Edward Shishkin
@ 2014-06-17 18:31 ` Ivan Shapovalov
2014-06-17 20:47 ` Ivan Shapovalov
2014-06-18 0:30 ` Edward Shishkin
0 siblings, 2 replies; 22+ messages in thread
From: Ivan Shapovalov @ 2014-06-17 18:31 UTC (permalink / raw)
To: Edward Shishkin; +Cc: reiserfs-devel
[-- Attachment #1: Type: text/plain, Size: 1980 bytes --]
On Tuesday 17 June 2014 at 12:29:53, Edward Shishkin wrote:
>
> On 06/17/2014 12:14 PM, Ivan Shapovalov wrote:
> > On Tuesday 17 June 2014 at 02:37:16, Edward Shishkin wrote:
> >> [...]
> >>
> >> Yup, blocknr sets minimize memory consumption and are unsortable...
> >>
> >> I think that the cleanest option will be using lists (instead of blocknr
> >> sets) for
> >> the delete sets, if the discard is turned on. It will reduce memory
> >> consumption
> >> by 20%. Indeed, every entry in a blocknr_set occupies ~8 bytes (assuming
> >> that everything is pretty fragmented because of txmod=wa), whereas a list
> >> entry occupies 32 bytes (start, length, plus 2 pointers for the link).
> >>
> >> In this option we'll need to join lists (instead of merging blocknr
> >> sets) during
> >> atoms fusion and apply the list (instead of blocknr set) to the COMMIT
> >> BITMAP
> >> at pre_commit_hook(). I think it won't be a problem, since the lists are
> >> simpler
> >> than blocknr sets.
> > That's a neat approach. I think I'll use unions and do the decision at runtime.
>
>
> Yup.
> So, if discard is on, we work with 2 lists (delete_set,
> delete_set_for_wander).
> If discard is off, we work with one blocknr set..
Good. So I'll do roughly following for v5:
- rename discard_set_* to block_list_* and split off these definitions
- write a family of reiser4_atom_dset_*() (log_deferred, log_immediate,
apply_deferred, merge, init, destroy) which will encapsulate discard/nodiscard
check and operate on correct lists (blocknr_set vs block_list)
- call reiser4_atom_dset_{init,destroy,merge}() from respective functions
- call reiser4_atom_dset_log_{deferred,immediate}() from reiser4_dealloc_blocks()
- call reiser4_atom_dset_apply_deferred() from reiser4_post_commit_hook()
- directly manipulate the block lists from discard_atom(), checking that we
indeed have discard enabled
Is this OK?
--
Ivan Shapovalov / intelfx /
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 213 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: reiser4: discard implementation, pass 2: allocation issues
2014-06-17 18:31 ` Ivan Shapovalov
@ 2014-06-17 20:47 ` Ivan Shapovalov
2014-06-18 1:41 ` Edward Shishkin
2014-06-18 0:30 ` Edward Shishkin
1 sibling, 1 reply; 22+ messages in thread
From: Ivan Shapovalov @ 2014-06-17 20:47 UTC (permalink / raw)
To: reiserfs-devel; +Cc: Edward Shishkin
[-- Attachment #1: Type: text/plain, Size: 1335 bytes --]
On Tuesday 17 June 2014 at 22:31:36, Ivan Shapovalov wrote:
> On Tuesday 17 June 2014 at 12:29:53, Edward Shishkin wrote:
> >
> > [...]
> >
> > Yup.
> > So, if discard is on, we work with 2 lists (delete_set,
> > delete_set_for_wander).
> > If discard is off, we work with one blocknr set..
>
> Good. So I'll do roughly following for v5:
> - rename discard_set_* to block_list_* and split off these definitions
> - write a family of reiser4_atom_dset_*() (log_deferred, log_immediate,
> apply_deferred, merge, init, destroy) which will encapsulate discard/nodiscard
> check and operate on correct lists (blocknr_set vs block_list)
> - call reiser4_atom_dset_{init,destroy,merge}() from respective functions
> - call reiser4_atom_dset_log_{deferred,immediate}() from reiser4_dealloc_blocks()
> - call reiser4_atom_dset_apply_deferred() from reiser4_post_commit_hook()
> - directly manipulate the block lists from discard_atom(), checking that we
> indeed have discard enabled
>
> Is this OK?
BTW, with txn_atoms there is a locking idiom involving E_REPEAT loops.
Is it fine to implement a
current_atom_dset_log_...(...) // E_REPEAT loop inside
instead of
atom_dset_log_...(txn_atom* atom, ...) // may return E_REPEAT
?
I mean, per coding style.
Thanks,
--
Ivan Shapovalov / intelfx /
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 213 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: reiser4: discard implementation, pass 2: allocation issues
2014-06-17 18:31 ` Ivan Shapovalov
2014-06-17 20:47 ` Ivan Shapovalov
@ 2014-06-18 0:30 ` Edward Shishkin
1 sibling, 0 replies; 22+ messages in thread
From: Edward Shishkin @ 2014-06-18 0:30 UTC (permalink / raw)
To: Ivan Shapovalov; +Cc: reiserfs-devel
On 06/17/2014 08:31 PM, Ivan Shapovalov wrote:
> On Tuesday 17 June 2014 at 12:29:53, Edward Shishkin wrote:
>> On 06/17/2014 12:14 PM, Ivan Shapovalov wrote:
>>> On Tuesday 17 June 2014 at 02:37:16, Edward Shishkin wrote:
>>>> [...]
>>>>
>>>> Yup, blocknr sets minimize memory consumption and are unsortable...
>>>>
>>>> I think that the cleanest option will be using lists (instead of blocknr
>>>> sets) for
>>>> the delete sets, if the discard is turned on. It will reduce memory
>>>> consumption
>>>> by 20%. Indeed, every entry in a blocknr_set occupies ~8 bytes (assuming
>>>> that everything is pretty fragmented because of txmod=wa), whereas a list
>>>> entry occupies 32 bytes (start, length, plus 2 pointers for the link).
>>>>
>>>> In this option we'll need to join lists (instead of merging blocknr
>>>> sets) during
>>>> atoms fusion and apply the list (instead of blocknr set) to the COMMIT
>>>> BITMAP
>>>> at pre_commit_hook(). I think it won't be a problem, since the lists are
>>>> simpler
>>>> than blocknr sets.
>>> That's a neat approach. I think I'll use unions and do the decision at runtime.
>>
>> Yup.
>> So, if discard is on, we work with 2 lists (delete_set,
>> delete_set_for_wander).
>> If discard is off, we work with one blocknr set..
> Good. So I'll do roughly following for v5:
> - rename discard_set_* to block_list_* and split off these definitions
How about blocknr_lst_*?
> - write a family of reiser4_atom_dset_*() (log_deferred, log_immediate,
Can we avoid "log" if possible? Logs live in the journals...
Let's use "add_extent" instead of "log".
> apply_deferred, merge, init, destroy) which will encapsulate discard/nodiscard
> check and operate on correct lists (blocknr_set vs block_list)
> - call reiser4_atom_dset_{init,destroy,merge}() from respective functions
> - call reiser4_atom_dset_log_{deferred,immediate}() from reiser4_dealloc_blocks()
Let's do something like this:
reiser4_dealloc_blocks
{
...
if (flags & BA_DEFER)
atom_dset_defer_add_extent(atom, &bsep, start, len);
else
....
sa_dealloc_blocks();
atom_dset_immed_add_extent(...);
...
}
where
atom_dset_defer_add_extent() encapsulates
1) the loop
do {
blocknr_set_add_extent (atom, &atom->delete_set_defer, &bsep,
start, len);
...
} while (ret == -E_REPEAT);
2) blocknr_lst_add_extent (&atom->delete_set_defer, start, len);
atom_dset_immed_add_extent() encapsulates
blocknr_lst_add_extent (&atom->delete_set_immed, start, len);
etc.
> - call reiser4_atom_dset_apply_deferred() from reiser4_post_commit_hook()
> - directly manipulate the block lists from discard_atom(), checking that we
> indeed have discard enabled
OK.
Thanks,
Edward.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: reiser4: discard implementation, pass 2: allocation issues
2014-06-17 20:47 ` Ivan Shapovalov
@ 2014-06-18 1:41 ` Edward Shishkin
2014-06-18 9:55 ` Ivan Shapovalov
0 siblings, 1 reply; 22+ messages in thread
From: Edward Shishkin @ 2014-06-18 1:41 UTC (permalink / raw)
To: Ivan Shapovalov, reiserfs-devel
On 06/17/2014 10:47 PM, Ivan Shapovalov wrote:
> On Tuesday 17 June 2014 at 22:31:36, Ivan Shapovalov wrote:
>> On Tuesday 17 June 2014 at 12:29:53, Edward Shishkin wrote:
>>> [...]
>>>
>>> Yup.
>>> So, if discard is on, we work with 2 lists (delete_set,
>>> delete_set_for_wander).
>>> If discard is off, we work with one blocknr set..
>> Good. So I'll do roughly following for v5:
>> - rename discard_set_* to block_list_* and split off these definitions
>> - write a family of reiser4_atom_dset_*() (log_deferred, log_immediate,
>> apply_deferred, merge, init, destroy) which will encapsulate discard/nodiscard
>> check and operate on correct lists (blocknr_set vs block_list)
>> - call reiser4_atom_dset_{init,destroy,merge}() from respective functions
>> - call reiser4_atom_dset_log_{deferred,immediate}() from reiser4_dealloc_blocks()
>> - call reiser4_atom_dset_apply_deferred() from reiser4_post_commit_hook()
>> - directly manipulate the block lists from discard_atom(), checking that we
>> indeed have discard enabled
>>
>> Is this OK?
> BTW, with txn_atoms there is a locking idiom involving E_REPEAT loops.
>
> Is it fine to implement a
> current_atom_dset_log_...(...) // E_REPEAT loop inside
> instead of
> atom_dset_log_...(txn_atom* atom, ...)
IMHO it is not needed.
atom = get_current_atom_locked();
atom_dset_{defer, immed}_add_extent(atom, ...); // <-- the loop is here
spin_unlock_atom(atom);
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: reiser4: discard implementation, pass 2: allocation issues
2014-06-18 1:41 ` Edward Shishkin
@ 2014-06-18 9:55 ` Ivan Shapovalov
2014-06-18 11:49 ` Edward Shishkin
0 siblings, 1 reply; 22+ messages in thread
From: Ivan Shapovalov @ 2014-06-18 9:55 UTC (permalink / raw)
To: reiserfs-devel; +Cc: Edward Shishkin
[-- Attachment #1: Type: text/plain, Size: 2378 bytes --]
On Wednesday 18 June 2014 at 03:41:04, Edward Shishkin wrote:
>
> On 06/17/2014 10:47 PM, Ivan Shapovalov wrote:
> > On Tuesday 17 June 2014 at 22:31:36, Ivan Shapovalov wrote:
> >> On Tuesday 17 June 2014 at 12:29:53, Edward Shishkin wrote:
> >>> [...]
> >>>
> >>> Yup.
> >>> So, if discard is on, we work with 2 lists (delete_set,
> >>> delete_set_for_wander).
> >>> If discard is off, we work with one blocknr set..
> >> Good. So I'll do roughly following for v5:
> >> - rename discard_set_* to block_list_* and split off these definitions
> >> - write a family of reiser4_atom_dset_*() (log_deferred, log_immediate,
> >> apply_deferred, merge, init, destroy) which will encapsulate discard/nodiscard
> >> check and operate on correct lists (blocknr_set vs block_list)
> >> - call reiser4_atom_dset_{init,destroy,merge}() from respective functions
> >> - call reiser4_atom_dset_log_{deferred,immediate}() from reiser4_dealloc_blocks()
> >> - call reiser4_atom_dset_apply_deferred() from reiser4_post_commit_hook()
> >> - directly manipulate the block lists from discard_atom(), checking that we
> >> indeed have discard enabled
> >>
> >> Is this OK?
> > BTW, with txn_atoms there is a locking idiom involving E_REPEAT loops.
> >
> > Is it fine to implement a
> > current_atom_dset_log_...(...) // E_REPEAT loop inside
> > instead of
> > atom_dset_log_...(txn_atom* atom, ...)
>
> IMHO it is not needed.
>
> atom = get_current_atom_locked();
> atom_dset_{defer, immed}_add_extent(atom, ...); // <-- the loop is here
> spin_unlock_atom(atom);
IIUC, the point of the E_REPEAT loop is that memory allocations must be done
with atom spinlock released, and once released, the atom may "disappear", so
it is unsafe to use the same pointer again. Hence there is a loop of
void *data = NULL;
do {
atom = get_current_atom_locked();
ret = blocknr_set_add_extent(atom, &atom->delete_set, &data, ...);
} while (ret == -E_REPEAT);
where `data` is a piece of "state" saved across the two blocknr_set_add_extent()
calls. On first iteration of the loop, atom lock is released, memory is
allocated and pointer is stored in `data`, and -E_REPEAT is returned.
On second iteration of the loop, the previously allocated memory is accessed
and the atom is modified under its spinlock.
--
Ivan Shapovalov / intelfx /
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 213 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: reiser4: discard implementation, pass 2: allocation issues
2014-06-18 9:55 ` Ivan Shapovalov
@ 2014-06-18 11:49 ` Edward Shishkin
2014-06-18 12:26 ` Ivan Shapovalov
0 siblings, 1 reply; 22+ messages in thread
From: Edward Shishkin @ 2014-06-18 11:49 UTC (permalink / raw)
To: Ivan Shapovalov; +Cc: reiserfs-devel
On 06/18/2014 11:55 AM, Ivan Shapovalov wrote:
> On Wednesday 18 June 2014 at 03:41:04, Edward Shishkin wrote:
>> On 06/17/2014 10:47 PM, Ivan Shapovalov wrote:
>>> On Tuesday 17 June 2014 at 22:31:36, Ivan Shapovalov wrote:
>>>> On Tuesday 17 June 2014 at 12:29:53, Edward Shishkin wrote:
>>>>> [...]
>>>>>
>>>>> Yup.
>>>>> So, if discard is on, we work with 2 lists (delete_set,
>>>>> delete_set_for_wander).
>>>>> If discard is off, we work with one blocknr set..
>>>> Good. So I'll do roughly following for v5:
>>>> - rename discard_set_* to block_list_* and split off these definitions
>>>> - write a family of reiser4_atom_dset_*() (log_deferred, log_immediate,
>>>> apply_deferred, merge, init, destroy) which will encapsulate discard/nodiscard
>>>> check and operate on correct lists (blocknr_set vs block_list)
>>>> - call reiser4_atom_dset_{init,destroy,merge}() from respective functions
>>>> - call reiser4_atom_dset_log_{deferred,immediate}() from reiser4_dealloc_blocks()
>>>> - call reiser4_atom_dset_apply_deferred() from reiser4_post_commit_hook()
>>>> - directly manipulate the block lists from discard_atom(), checking that we
>>>> indeed have discard enabled
>>>>
>>>> Is this OK?
>>> BTW, with txn_atoms there is a locking idiom involving E_REPEAT loops.
>>>
>>> Is it fine to implement a
>>> current_atom_dset_log_...(...) // E_REPEAT loop inside
>>> instead of
>>> atom_dset_log_...(txn_atom* atom, ...)
>> IMHO it is not needed.
>>
>> atom = get_current_atom_locked();
>> atom_dset_{defer, immed}_add_extent(atom, ...); // <-- the loop is here
>> spin_unlock_atom(atom);
> IIUC, the point of the E_REPEAT loop is that memory allocations must be done
> with atom spinlock released, and once released,
You are right, we need something similar for the lists. Otherwise,
deadlocks are possible (they will appear on high-load systems).
Not necessarily in the first edition: first make sure that the base
stuff works.
> the atom may "disappear", so
> it is unsafe to use the same pointer again. Hence there is a loop of
>
> void *data = NULL;
>
> do {
> atom = get_current_atom_locked();
> ret = blocknr_set_add_extent(atom, &atom->delete_set, &data, ...);
> } while (ret == -E_REPEAT);
>
> where `data` is a piece of "state" saved across the two blocknr_set_add_extent()
> calls. On first iteration of the loop, atom lock is released, memory is
> allocated and pointer is stored in `data`, and -E_REPEAT is returned.
> On second iteration of the loop, the previously allocated memory is accessed
> and the atom is modified under its spinlock.
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: reiser4: discard implementation, pass 2: allocation issues
2014-06-18 11:49 ` Edward Shishkin
@ 2014-06-18 12:26 ` Ivan Shapovalov
2014-06-18 22:46 ` Edward Shishkin
0 siblings, 1 reply; 22+ messages in thread
From: Ivan Shapovalov @ 2014-06-18 12:26 UTC (permalink / raw)
To: Edward Shishkin; +Cc: reiserfs-devel
[-- Attachment #1: Type: text/plain, Size: 3217 bytes --]
On Wednesday 18 June 2014 at 13:49:30, Edward Shishkin wrote:
> On 06/18/2014 11:55 AM, Ivan Shapovalov wrote:
> > On Wednesday 18 June 2014 at 03:41:04, Edward Shishkin wrote:
> >> On 06/17/2014 10:47 PM, Ivan Shapovalov wrote:
> >>> On Tuesday 17 June 2014 at 22:31:36, Ivan Shapovalov wrote:
> >>>> On Tuesday 17 June 2014 at 12:29:53, Edward Shishkin wrote:
> >>>>> [...]
> >>>>>
> >>>>> Yup.
> >>>>> So, if discard is on, we work with 2 lists (delete_set,
> >>>>> delete_set_for_wander).
> >>>>> If discard is off, we work with one blocknr set..
> >>>> Good. So I'll do roughly following for v5:
> >>>> - rename discard_set_* to block_list_* and split off these definitions
> >>>> - write a family of reiser4_atom_dset_*() (log_deferred, log_immediate,
> >>>> apply_deferred, merge, init, destroy) which will encapsulate discard/nodiscard
> >>>> check and operate on correct lists (blocknr_set vs block_list)
> >>>> - call reiser4_atom_dset_{init,destroy,merge}() from respective functions
> >>>> - call reiser4_atom_dset_log_{deferred,immediate}() from reiser4_dealloc_blocks()
> >>>> - call reiser4_atom_dset_apply_deferred() from reiser4_post_commit_hook()
> >>>> - directly manipulate the block lists from discard_atom(), checking that we
> >>>> indeed have discard enabled
> >>>>
> >>>> Is this OK?
> >>> BTW, with txn_atoms there is a locking idiom involving E_REPEAT loops.
> >>>
> >>> Is it fine to implement a
> >>> current_atom_dset_log_...(...) // E_REPEAT loop inside
> >>> instead of
> >>> atom_dset_log_...(txn_atom* atom, ...)
> >> IMHO it is not needed.
> >>
> >> atom = get_current_atom_locked();
> >> atom_dset_{defer, immed}_add_extent(atom, ...); // <-- the loop is here
> >> spin_unlock_atom(atom);
> > IIUC, the point of the E_REPEAT loop is that memory allocations must be done
> > with atom spinlock released, and once released,
>
>
> You are right, we need something similar for the lists. Otherwise,
> deadlocks are possible (they will appear on high-load systems).
> Not necessarily in the first edition: first make sure that the base
> stuff works.
It's cheap, there is no need to delay this until some ephemeral "final
imlementation". I just wanted to make sure it's OK per coding style to
implement functions like "current_atom_XXX()" which internally do
get_current_atom_locked() and all that stuff. (Or not OK, in which case
I'll push this loop out of the function.)
Thanks,
--
Ivan Shapovalov / intelfx /
>
>
> > the atom may "disappear", so
> > it is unsafe to use the same pointer again. Hence there is a loop of
> >
> > void *data = NULL;
> >
> > do {
> > atom = get_current_atom_locked();
> > ret = blocknr_set_add_extent(atom, &atom->delete_set, &data, ...);
> > } while (ret == -E_REPEAT);
> >
> > where `data` is a piece of "state" saved across the two blocknr_set_add_extent()
> > calls. On first iteration of the loop, atom lock is released, memory is
> > allocated and pointer is stored in `data`, and -E_REPEAT is returned.
> > On second iteration of the loop, the previously allocated memory is accessed
> > and the atom is modified under its spinlock.
> >
>
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 213 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: reiser4: discard implementation, pass 2: allocation issues
2014-06-18 12:26 ` Ivan Shapovalov
@ 2014-06-18 22:46 ` Edward Shishkin
0 siblings, 0 replies; 22+ messages in thread
From: Edward Shishkin @ 2014-06-18 22:46 UTC (permalink / raw)
To: Ivan Shapovalov; +Cc: reiserfs-devel
On 06/18/2014 02:26 PM, Ivan Shapovalov wrote:
> On Wednesday 18 June 2014 at 13:49:30, Edward Shishkin wrote:
>> On 06/18/2014 11:55 AM, Ivan Shapovalov wrote:
>>> On Wednesday 18 June 2014 at 03:41:04, Edward Shishkin wrote:
>>>> On 06/17/2014 10:47 PM, Ivan Shapovalov wrote:
>>>>> On Tuesday 17 June 2014 at 22:31:36, Ivan Shapovalov wrote:
>>>>>> On Tuesday 17 June 2014 at 12:29:53, Edward Shishkin wrote:
>>>>>>> [...]
>>>>>>>
>>>>>>> Yup.
>>>>>>> So, if discard is on, we work with 2 lists (delete_set,
>>>>>>> delete_set_for_wander).
>>>>>>> If discard is off, we work with one blocknr set..
>>>>>> Good. So I'll do roughly following for v5:
>>>>>> - rename discard_set_* to block_list_* and split off these definitions
>>>>>> - write a family of reiser4_atom_dset_*() (log_deferred, log_immediate,
>>>>>> apply_deferred, merge, init, destroy) which will encapsulate discard/nodiscard
>>>>>> check and operate on correct lists (blocknr_set vs block_list)
>>>>>> - call reiser4_atom_dset_{init,destroy,merge}() from respective functions
>>>>>> - call reiser4_atom_dset_log_{deferred,immediate}() from reiser4_dealloc_blocks()
>>>>>> - call reiser4_atom_dset_apply_deferred() from reiser4_post_commit_hook()
>>>>>> - directly manipulate the block lists from discard_atom(), checking that we
>>>>>> indeed have discard enabled
>>>>>>
>>>>>> Is this OK?
>>>>> BTW, with txn_atoms there is a locking idiom involving E_REPEAT loops.
>>>>>
>>>>> Is it fine to implement a
>>>>> current_atom_dset_log_...(...) // E_REPEAT loop inside
>>>>> instead of
>>>>> atom_dset_log_...(txn_atom* atom, ...)
>>>> IMHO it is not needed.
>>>>
>>>> atom = get_current_atom_locked();
>>>> atom_dset_{defer, immed}_add_extent(atom, ...); // <-- the loop is here
>>>> spin_unlock_atom(atom);
>>> IIUC, the point of the E_REPEAT loop is that memory allocations must be done
>>> with atom spinlock released, and once released,
>>
>> You are right, we need something similar for the lists. Otherwise,
>> deadlocks are possible (they will appear on high-load systems).
>> Not necessarily in the first edition: first make sure that the base
>> stuff works.
> It's cheap, there is no need to delay this until some ephemeral "final
> imlementation". I just wanted to make sure it's OK per coding style to
> implement functions like "current_atom_XXX()" which internally do
> get_current_atom_locked() and all that stuff. (Or not OK, in which case
> I'll push this loop out of the function.)
Ah, I see.. Not good. Let's push the loop out of the encapsulation then.
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2014-06-18 22:46 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-13 20:28 reiser4: discard implementation, pass 2: allocation issues Ivan Shapovalov
2014-06-15 17:36 ` Edward Shishkin
2014-06-15 18:07 ` Ivan Shapovalov
2014-06-15 21:49 ` Edward Shishkin
2014-06-15 21:58 ` Ivan Shapovalov
2014-06-16 0:14 ` Edward Shishkin
2014-06-16 5:03 ` Ivan Shapovalov
2014-06-16 9:24 ` Edward Shishkin
2014-06-16 11:00 ` Ivan Shapovalov
2014-06-16 11:32 ` Edward Shishkin
2014-06-16 11:47 ` Ivan Shapovalov
2014-06-17 0:37 ` Edward Shishkin
2014-06-17 10:14 ` Ivan Shapovalov
2014-06-17 10:29 ` Edward Shishkin
2014-06-17 18:31 ` Ivan Shapovalov
2014-06-17 20:47 ` Ivan Shapovalov
2014-06-18 1:41 ` Edward Shishkin
2014-06-18 9:55 ` Ivan Shapovalov
2014-06-18 11:49 ` Edward Shishkin
2014-06-18 12:26 ` Ivan Shapovalov
2014-06-18 22:46 ` Edward Shishkin
2014-06-18 0:30 ` Edward Shishkin
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.