Re: reiser4: discard implementation, pass 2: allocation issues

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Edward Shishkin <edward.shishkin@gmail.com>
To: Ivan Shapovalov <intelfx100@gmail.com>
Cc: reiserfs-devel@vger.kernel.org
Subject: Re: reiser4: discard implementation, pass 2: allocation issues
Date: Sun, 15 Jun 2014 23:49:59 +0200	[thread overview]
Message-ID: <539E1507.4070503@gmail.com> (raw)
In-Reply-To: <1427799.NLW5IsuhQY@intelfx-laptop>


On 06/15/2014 08:07 PM, Ivan Shapovalov wrote:
> On Sunday 15 June 2014 at 19:36:05, Edward Shishkin wrote:	
>> On 06/13/2014 10:28 PM, Ivan Shapovalov wrote:
>>> Here is my "analysis" of what happens in reiser4 during a transaction's
>>> lifetime wrt. block allocation and deallocation.
>>>
>>>
>>> THE EFFECTS (SEMANTICS) OF RELATED FUNCTIONS
>>> reiser4_alloc_blocks_bitmap(): allocates in WORKING BITMAP
>> Yes.
>>
>>> reiser4_dealloc_blocks_bitmap(!BA_DEFER): deallocates from WORKING BITMAP
>>> reiser4_dealloc_blocks_bitmap(BA_DEFER): stores to ->delete_set
>> This is correct for the middle-level allocator (without suffix "bitmap").
>> The low-level one frees blocks only in WORKING BITMAP.
> Yes, this was a wording mistake. I've noticed it shortly after sending...
>
>>> reiser4_pre_commit_hook_bitmap(): allocates all relocated nodes in COMMIT BITMAP
>> "Relocated" is bad term here: nodes with new data also get the flag
>> JNODE_RELOC. So, I would rather say, that it applies freshly allocated
>> nodes of the atom to COMMIT BITMAP.
>>
>>
>>>                                     deallocates ->delete_set from COMMIT BITMAP
>> applies the atom's delete_set to COMMIT BITMAP
> I've meant exactly this.
>
>>> reiser4_post_commit_hook(): deallocates ->delete_set using !BA_DEFER
>>>                                                      (i. e. from WORKING BITMAP)
>> applies the atom's delete_set to WORKING BITMAP.
> ditto
>
>> I would also mention a function of the middle-level block allocator
>> reiser4_alloc_blocks(): allocates blocks in WORKING  BITMAP.
> Yes, sure.
>
>
>> Note that the middle-level block allocator (block_alloc.c) actually
>> manipulates
>> with abstract space maps. Currently in reiser4 they are represented only by
>> bitmaps (plugin/space/bitmap.c). We can also implement another
>> representation -
>> extent tree (like in XFS). I don't see any needs for now, though.
> Extent trees seem to be interesting.. They claim that they are more efficient -
> is the difference that huge?
>
>>> TIMELINE OF ALLOCATIONS FOR "USUAL" NODES, AND TIMELINE OF TRANSACTION COMMIT
>>> - nodes are allocated using reiser4_alloc_blocks() and setting JNODE_RELOC,
>>>     so WORKING BITMAP ensures that two nodes cannot get the same block;
>>> - nodes are deallocated using reiser4_dealloc_blocks(BA_DEFER),
>>>     so their deallocation is not immediately reflected in WORKING BITMAP;
>>> (the relocate set is written here)
>>> - reiser4_pre_commit_hook_bitmap() uses 1) JNODE_RELOC flag and 2) ->delete_set
>>>     to convey effective bitmap changes into COMMIT BITMAP;
>>> (the journal and overwrite set are written here)
>>> - reiser4_post_commit_hook() uses ->delete_set to convey deallocations
>>>     from step 2 to WORKING BITMAP.
>>> (the discard happens here)
>>>
>>>
>>> TIMELINE OF ALLOCATIONS FOR WANDERED JOURNAL BLOCKS
>>> - at commit time, blocks are allocated using reiser4_alloc_blocks(), so they
>>>     are allocated in WORKING BITMAP and do not interfere with any "usual" blocks;
>>> - after writing wandered blocks, they are deallocated using
>>>     reiser4_dealloc_blocks(!BA_DEFER), i. e. from the WORKING BITMAP.
>>
>> So, the system of working and commit bitmaps plus the delete set seems
>> to be redundant? I think this is because of performance reasons: block
>> allocation is critical thing...
> Seems like it is -- for performance and simplicity (== robustness).
>
>>> CONCLUSION
>>> At possible transaction replay time, journal blocks are not allocated in any
>>> of the bitmaps. However, because the journal is read and replayed before a
>>> transaction has a chance to commit, this fact does not matter.
>>> What matters is that wandered journal blocks never hit COMMIT BITMAP.
>>>
>>> So, if I've got all this correct (which I highly doubt), the disk space leak
>>> (as you pointed it out) does not exist.
>> It seems, you are right..
>>
>>> What exists is a rather different problem with my idea of "log every
>>> deallocated block". Current implementation logs every block regardless of
>>> BA_DEFER flag presence or absence, so non-wandered blocks are logged twice.
>>>
>>> We could just use ->delete_set, but we would lose wandered blocks then.
>>> Or we could only log !BA_DEFER requests, which would do the right thing
>>> (wandered blocks + deallocations from reiser4_post_commit_hook()), but
>>> the reasoning behind this decision would not be obvious for a casual
>>> code reader.
>> I think that a good comment will save the situation..
>>
>>> Or we could log only wandered blocks (in addition to ->delete_set)
>>> at discard time, but this is messy and requires us to merge the discard log
>>> with ->delete_set at discard time.
>> what is the difference with the previous "we could.."?
> The previous option is to log all !BA_DEFER requests in addition to
> ->delete_set, so those two block sets would be partially overlapping.


Hmm.. Are they really overlapped?

reiser4_dealloc_blocks() looks like the following:
{
if (flags & BA_DEFER)
         log in the delete_set;
else
         deallocate in working bitmap
...
}

so (!BA_DEFER) requests and the delete_set look disjoint,
or, I missed something?

> And this option is to log only those !BA_DEFER requests which are not in
> ->delete_set, so no block will be stored twice. Thus we will have to consider
> both ->delete_set and this hypothetical ->discard_set at discard time.
>
>>> Or we could log wandered blocks straight into ->delete_set and do something in
>>> reiser4_post_commit_hook() to separate these entries, but this is super messy.
>>>
>>> I'm preferring the second way... Edward, please proof-read all this.
>> BTW, what about your current implementation? Does it work for you?
> Yes, it apparently does (no massive corruptions).
>
> I get batches of "orphan cluster" errors from time to time (from offset 0 till
> the supposed end of file),


this is fsck complaints?


>   but 1) I do not see which files get corrupted,
> 2) this could be due to txmod=wa, 3) my machine is pretty crashy now due to
> some driver issues, so I don't know what exactly causes this.
>
> Thanks for proof-reading. What do you think about all this?


I think that not bad for the beginning! I have purchased an SSD card,
hope to test the discard stuff in the near future.

Thanks,
Edward.

next prev parent reply	other threads:[~2014-06-15 21:49 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-06-13 20:28 reiser4: discard implementation, pass 2: allocation issues Ivan Shapovalov
2014-06-15 17:36 ` Edward Shishkin
2014-06-15 18:07   ` Ivan Shapovalov
2014-06-15 21:49     ` Edward Shishkin [this message]
2014-06-15 21:58       ` Ivan Shapovalov
2014-06-16  0:14         ` Edward Shishkin
2014-06-16  5:03           ` Ivan Shapovalov
2014-06-16  9:24             ` Edward Shishkin
2014-06-16 11:00               ` Ivan Shapovalov
2014-06-16 11:32                 ` Edward Shishkin
2014-06-16 11:47                   ` Ivan Shapovalov
2014-06-17  0:37                     ` Edward Shishkin
2014-06-17 10:14                       ` Ivan Shapovalov
2014-06-17 10:29                         ` Edward Shishkin
2014-06-17 18:31                           ` Ivan Shapovalov
2014-06-17 20:47                             ` Ivan Shapovalov
2014-06-18  1:41                               ` Edward Shishkin
2014-06-18  9:55                                 ` Ivan Shapovalov
2014-06-18 11:49                                   ` Edward Shishkin
2014-06-18 12:26                                     ` Ivan Shapovalov
2014-06-18 22:46                                       ` Edward Shishkin
2014-06-18  0:30                             ` Edward Shishkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=539E1507.4070503@gmail.com \
    --to=edward.shishkin@gmail.com \
    --cc=intelfx100@gmail.com \
    --cc=reiserfs-devel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.