From mboxrd@z Thu Jan  1 00:00:00 1970
From: Edward Shishkin <edward.shishkin@gmail.com>
Subject: Re: reiser4: discard implementation, pass 2: allocation issues
Date: Mon, 16 Jun 2014 02:14:49 +0200
Message-ID: <539E36F9.6070706@gmail.com>
References: <3401431.jj87z7i0xD@intelfx-laptop> <1427799.NLW5IsuhQY@intelfx-laptop> <539E1507.4070503@gmail.com> <1848882.0DacExnpE3@intelfx-laptop>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Return-path: <reiserfs-devel-owner@vger.kernel.org>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20120113;
        h=message-id:date:from:user-agent:mime-version:to:cc:subject
         :references:in-reply-to:content-type:content-transfer-encoding;
        bh=POiacOwW7mKx5Ya3xAUK8/pTQACupHeWtYvpmWpJuH8=;
        b=fD568tvOnE6NUPcVlPsgBLeKyVq2D/PQ/uHVMr5km3VFQHyG1sH1V4RWOCeNKCtahj
         WvUGMrgprYh3XyTIGlm2qc7Xdre59ph5Ej/qVvPsoqJby5eHgqJ/IxP4aaXq9PbYvLee
         jdWzu9RtmKNUawo/iQaHtuwPxxnt3m4xZxcvu3F3VRn3TDItiTe/niaf5J2PyiSYNLn/
         ScoaVN4bYt0Z3OhqiuOQ8o0ONes//lDXiJbK3SJNObQ8sU5pHIty0NxYaNVDwivf5SRk
         /+u+p9alA6rX/ckfnRob6BHCL+ChQ8J5zijxeokWENsG21p+ImDoDtKXXNzm3Sk6gKTA
         TluA==
In-Reply-To: <1848882.0DacExnpE3@intelfx-laptop>
Sender: reiserfs-devel-owner@vger.kernel.org
List-ID: <reiserfs-devel.vger.kernel.org>
Content-Type: text/plain; charset="us-ascii"; format="flowed"
To: Ivan Shapovalov <intelfx100@gmail.com>
Cc: reiserfs-devel@vger.kernel.org


On 06/15/2014 11:58 PM, Ivan Shapovalov wrote:
> On Sunday 15 June 2014 at 23:49:59, Edward Shishkin wrote:	
>> On 06/15/2014 08:07 PM, Ivan Shapovalov wrote:
>>> On Sunday 15 June 2014 at 19:36:05, Edward Shishkin wrote:	
>>>> On 06/13/2014 10:28 PM, Ivan Shapovalov wrote:
>>>>> Here is my "analysis" of what happens in reiser4 during a transaction's
>>>>> lifetime wrt. block allocation and deallocation.
>>>>>
>>>>>
>>>>> THE EFFECTS (SEMANTICS) OF RELATED FUNCTIONS
>>>>> reiser4_alloc_blocks_bitmap(): allocates in WORKING BITMAP
>>>> Yes.
>>>>
>>>>> reiser4_dealloc_blocks_bitmap(!BA_DEFER): deallocates from WORKING BITMAP
>>>>> reiser4_dealloc_blocks_bitmap(BA_DEFER): stores to ->delete_set
>>>> This is correct for the middle-level allocator (without suffix "bitmap").
>>>> The low-level one frees blocks only in WORKING BITMAP.
>>> Yes, this was a wording mistake. I've noticed it shortly after sending...
>>>
>>>>> reiser4_pre_commit_hook_bitmap(): allocates all relocated nodes in COMMIT BITMAP
>>>> "Relocated" is bad term here: nodes with new data also get the flag
>>>> JNODE_RELOC. So, I would rather say, that it applies freshly allocated
>>>> nodes of the atom to COMMIT BITMAP.
>>>>
>>>>
>>>>>                                      deallocates ->delete_set from COMMIT BITMAP
>>>> applies the atom's delete_set to COMMIT BITMAP
>>> I've meant exactly this.
>>>
>>>>> reiser4_post_commit_hook(): deallocates ->delete_set using !BA_DEFER
>>>>>                                                       (i. e. from WORKING BITMAP)
>>>> applies the atom's delete_set to WORKING BITMAP.
>>> ditto
>>>
>>>> I would also mention a function of the middle-level block allocator
>>>> reiser4_alloc_blocks(): allocates blocks in WORKING  BITMAP.
>>> Yes, sure.
>>>
>>>
>>>> Note that the middle-level block allocator (block_alloc.c) actually
>>>> manipulates
>>>> with abstract space maps. Currently in reiser4 they are represented only by
>>>> bitmaps (plugin/space/bitmap.c). We can also implement another
>>>> representation -
>>>> extent tree (like in XFS). I don't see any needs for now, though.
>>> Extent trees seem to be interesting.. They claim that they are more efficient -
>>> is the difference that huge?
>>>
>>>>> TIMELINE OF ALLOCATIONS FOR "USUAL" NODES, AND TIMELINE OF TRANSACTION COMMIT
>>>>> - nodes are allocated using reiser4_alloc_blocks() and setting JNODE_RELOC,
>>>>>      so WORKING BITMAP ensures that two nodes cannot get the same block;
>>>>> - nodes are deallocated using reiser4_dealloc_blocks(BA_DEFER),
>>>>>      so their deallocation is not immediately reflected in WORKING BITMAP;
>>>>> (the relocate set is written here)
>>>>> - reiser4_pre_commit_hook_bitmap() uses 1) JNODE_RELOC flag and 2) ->delete_set
>>>>>      to convey effective bitmap changes into COMMIT BITMAP;
>>>>> (the journal and overwrite set are written here)
>>>>> - reiser4_post_commit_hook() uses ->delete_set to convey deallocations
>>>>>      from step 2 to WORKING BITMAP.
>>>>> (the discard happens here)
>>>>>
>>>>>
>>>>> TIMELINE OF ALLOCATIONS FOR WANDERED JOURNAL BLOCKS
>>>>> - at commit time, blocks are allocated using reiser4_alloc_blocks(), so they
>>>>>      are allocated in WORKING BITMAP and do not interfere with any "usual" blocks;
>>>>> - after writing wandered blocks, they are deallocated using
>>>>>      reiser4_dealloc_blocks(!BA_DEFER), i. e. from the WORKING BITMAP.
>>>> So, the system of working and commit bitmaps plus the delete set seems
>>>> to be redundant? I think this is because of performance reasons: block
>>>> allocation is critical thing...
>>> Seems like it is -- for performance and simplicity (== robustness).
>>>
>>>>> CONCLUSION
>>>>> At possible transaction replay time, journal blocks are not allocated in any
>>>>> of the bitmaps. However, because the journal is read and replayed before a
>>>>> transaction has a chance to commit, this fact does not matter.
>>>>> What matters is that wandered journal blocks never hit COMMIT BITMAP.
>>>>>
>>>>> So, if I've got all this correct (which I highly doubt), the disk space leak
>>>>> (as you pointed it out) does not exist.
>>>> It seems, you are right..
>>>>
>>>>> What exists is a rather different problem with my idea of "log every
>>>>> deallocated block". Current implementation logs every block regardless of
>>>>> BA_DEFER flag presence or absence, so non-wandered blocks are logged twice.
>>>>>
>>>>> We could just use ->delete_set, but we would lose wandered blocks then.
>>>>> Or we could only log !BA_DEFER requests, which would do the right thing
>>>>> (wandered blocks + deallocations from reiser4_post_commit_hook()), but
>>>>> the reasoning behind this decision would not be obvious for a casual
>>>>> code reader.
>>>> I think that a good comment will save the situation..
>>>>
>>>>> Or we could log only wandered blocks (in addition to ->delete_set)
>>>>> at discard time, but this is messy and requires us to merge the discard log
>>>>> with ->delete_set at discard time.
>>>> what is the difference with the previous "we could.."?
>>> The previous option is to log all !BA_DEFER requests in addition to
>>> ->delete_set, so those two block sets would be partially overlapping.
>>
>> Hmm.. Are they really overlapped?
>>
>> reiser4_dealloc_blocks() looks like the following:
>> {
>> if (flags & BA_DEFER)
>>           log in the delete_set;
>> else
>>           deallocate in working bitmap
>> ...
>> }
>>
>> so (!BA_DEFER) requests and the delete_set look disjoint,
>> or, I missed something?
> You're right, I've misread the code. I was under impression that
> reiser4_post_commit_hook() internally calls reiser4_dealloc_blocks(!BA_DEFER).
>
> So the second option (which I prefer) is to log all sa_dealloc_blocks() calls.
> That is, effectively, ->delete_set plus wandered journal block deallocations.


Yes, I think to do something like this:

reiser4_dealloc_blocks()
{
if (flags & BA_DEFER) {
         ...
         } else {
         ...
         if (discard is turned on)
                 /* we'll want to discard these blocks, which are not to 
be represented
                     in the COMMIT SPACE MAP, so store them in a 
separate list */
                 do {
                         blocknr_set_add_extent(atom, 
&atom->delete_set_for_wander, &bsep, start, len);
                         ...
                 } while (ret == -E_REPEAT);
sa_dealloc_blocks(reiser4_get_space_allocator(ctx->super), *start, *len);
         ...
}

Then join the sets (->delete_set, and ->delete_set_for_wander) at 
discard time
(should be pretty cheap operation). insert a check at every atoms 
fusion, that
->delete_set_for_wander of both atoms are empty).


>
>>> And this option is to log only those !BA_DEFER requests which are not in
>>> ->delete_set, so no block will be stored twice. Thus we will have to consider
>>> both ->delete_set and this hypothetical ->discard_set at discard time.
>>>
>>>>> Or we could log wandered blocks straight into ->delete_set and do something in
>>>>> reiser4_post_commit_hook() to separate these entries, but this is super messy.
>>>>>
>>>>> I'm preferring the second way... Edward, please proof-read all this.
>>>> BTW, what about your current implementation? Does it work for you?
>>> Yes, it apparently does (no massive corruptions).
>>>
>>> I get batches of "orphan cluster" errors from time to time (from offset 0 till
>>> the supposed end of file),
>>
>> this is fsck complaints?
> Yes. (I do `fsck.reiser4 --build-fs` pretty routinely...)
>
>>>    but 1) I do not see which files get corrupted,
>>> 2) this could be due to txmod=wa, 3) my machine is pretty crashy now due to
>>> some driver issues, so I don't know what exactly causes this.
>>>
>>> Thanks for proof-reading. What do you think about all this?
>>
>> I think that not bad for the beginning! I have purchased an SSD card,
>> hope to test the discard stuff in the near future.
> And I hope to finish this feature somewhen... :)
>
> Thanks,