From mboxrd@z Thu Jan 1 00:00:00 1970 From: Edward Shishkin Subject: Re: reiser4: discard implementation, pass 2: allocation issues Date: Mon, 16 Jun 2014 11:24:39 +0200 Message-ID: <539EB7D7.8020407@gmail.com> References: <3401431.jj87z7i0xD@intelfx-laptop> <1848882.0DacExnpE3@intelfx-laptop> <539E36F9.6070706@gmail.com> <1867805.hmMaC7j7al@intelfx-laptop> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=Ev6Rb3FlxzKsB1cx5cZiLLLj55yqy6iF4CFQJwi5t7c=; b=wdyEFBJ1F/CzpVE/MUOT+v8tZdGy5dJd3AMJ/RGZg59v+q5POzvaTh8pd8+WeNvszx fPRGmq9nCZ1y3mzrCr8lPmsFbgdm3N/fK4rA5DRAPHqAF/Jza4qwnQ+yPPUFS5ckI7ON m1IG4zsqDQE8aauoG2itpxWmsn6XpphVdfo/MiYxQQeLxlJurZlQWW5qhcVn6rINV0SE Vmc1iifOxbxEyRhSWcn+Be5pfr9SIhXql5hyZMwoo5SPsGsIic78W/DjQcBHmKVUBCpO 8wfxZgiIR9RhW+TRFhOCEm12l/tvl2IBdUsTmd2FucoR244aZli7GQz3eJpQQZf9R3NG 8V0g== In-Reply-To: <1867805.hmMaC7j7al@intelfx-laptop> Sender: reiserfs-devel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: Ivan Shapovalov Cc: reiserfs-devel@vger.kernel.org On 06/16/2014 07:03 AM, Ivan Shapovalov wrote: > On Monday 16 June 2014 at 02:14:49, Edward Shishkin wrote: >> On 06/15/2014 11:58 PM, Ivan Shapovalov wrote: >>> On Sunday 15 June 2014 at 23:49:59, Edward Shishkin wrote: >>>> On 06/15/2014 08:07 PM, Ivan Shapovalov wrote: >>>>> On Sunday 15 June 2014 at 19:36:05, Edward Shishkin wrote: >>>>>> On 06/13/2014 10:28 PM, Ivan Shapovalov wrote: >>>>>>> Here is my "analysis" of what happens in reiser4 during a transaction's >>>>>>> lifetime wrt. block allocation and deallocation. >>>>>>> >>>>>>> >>>>>>> THE EFFECTS (SEMANTICS) OF RELATED FUNCTIONS >>>>>>> reiser4_alloc_blocks_bitmap(): allocates in WORKING BITMAP >>>>>> Yes. >>>>>> >>>>>>> reiser4_dealloc_blocks_bitmap(!BA_DEFER): deallocates from WORKING BITMAP >>>>>>> reiser4_dealloc_blocks_bitmap(BA_DEFER): stores to ->delete_set >>>>>> This is correct for the middle-level allocator (without suffix "bitmap"). >>>>>> The low-level one frees blocks only in WORKING BITMAP. >>>>> Yes, this was a wording mistake. I've noticed it shortly after sending... >>>>> >>>>>>> reiser4_pre_commit_hook_bitmap(): allocates all relocated nodes in COMMIT BITMAP >>>>>> "Relocated" is bad term here: nodes with new data also get the flag >>>>>> JNODE_RELOC. So, I would rather say, that it applies freshly allocated >>>>>> nodes of the atom to COMMIT BITMAP. >>>>>> >>>>>> >>>>>>> deallocates ->delete_set from COMMIT BITMAP >>>>>> applies the atom's delete_set to COMMIT BITMAP >>>>> I've meant exactly this. >>>>> >>>>>>> reiser4_post_commit_hook(): deallocates ->delete_set using !BA_DEFER >>>>>>> (i. e. from WORKING BITMAP) >>>>>> applies the atom's delete_set to WORKING BITMAP. >>>>> ditto >>>>> >>>>>> I would also mention a function of the middle-level block allocator >>>>>> reiser4_alloc_blocks(): allocates blocks in WORKING BITMAP. >>>>> Yes, sure. >>>>> >>>>> >>>>>> Note that the middle-level block allocator (block_alloc.c) actually >>>>>> manipulates >>>>>> with abstract space maps. Currently in reiser4 they are represented only by >>>>>> bitmaps (plugin/space/bitmap.c). We can also implement another >>>>>> representation - >>>>>> extent tree (like in XFS). I don't see any needs for now, though. >>>>> Extent trees seem to be interesting.. They claim that they are more efficient - >>>>> is the difference that huge? >>>>> >>>>>>> TIMELINE OF ALLOCATIONS FOR "USUAL" NODES, AND TIMELINE OF TRANSACTION COMMIT >>>>>>> - nodes are allocated using reiser4_alloc_blocks() and setting JNODE_RELOC, >>>>>>> so WORKING BITMAP ensures that two nodes cannot get the same block; >>>>>>> - nodes are deallocated using reiser4_dealloc_blocks(BA_DEFER), >>>>>>> so their deallocation is not immediately reflected in WORKING BITMAP; >>>>>>> (the relocate set is written here) >>>>>>> - reiser4_pre_commit_hook_bitmap() uses 1) JNODE_RELOC flag and 2) ->delete_set >>>>>>> to convey effective bitmap changes into COMMIT BITMAP; >>>>>>> (the journal and overwrite set are written here) >>>>>>> - reiser4_post_commit_hook() uses ->delete_set to convey deallocations >>>>>>> from step 2 to WORKING BITMAP. >>>>>>> (the discard happens here) >>>>>>> >>>>>>> >>>>>>> TIMELINE OF ALLOCATIONS FOR WANDERED JOURNAL BLOCKS >>>>>>> - at commit time, blocks are allocated using reiser4_alloc_blocks(), so they >>>>>>> are allocated in WORKING BITMAP and do not interfere with any "usual" blocks; >>>>>>> - after writing wandered blocks, they are deallocated using >>>>>>> reiser4_dealloc_blocks(!BA_DEFER), i. e. from the WORKING BITMAP. >>>>>> So, the system of working and commit bitmaps plus the delete set seems >>>>>> to be redundant? I think this is because of performance reasons: block >>>>>> allocation is critical thing... >>>>> Seems like it is -- for performance and simplicity (== robustness). >>>>> >>>>>>> CONCLUSION >>>>>>> At possible transaction replay time, journal blocks are not allocated in any >>>>>>> of the bitmaps. However, because the journal is read and replayed before a >>>>>>> transaction has a chance to commit, this fact does not matter. >>>>>>> What matters is that wandered journal blocks never hit COMMIT BITMAP. >>>>>>> >>>>>>> So, if I've got all this correct (which I highly doubt), the disk space leak >>>>>>> (as you pointed it out) does not exist. >>>>>> It seems, you are right.. >>>>>> >>>>>>> What exists is a rather different problem with my idea of "log every >>>>>>> deallocated block". Current implementation logs every block regardless of >>>>>>> BA_DEFER flag presence or absence, so non-wandered blocks are logged twice. >>>>>>> >>>>>>> We could just use ->delete_set, but we would lose wandered blocks then. >>>>>>> Or we could only log !BA_DEFER requests, which would do the right thing >>>>>>> (wandered blocks + deallocations from reiser4_post_commit_hook()), but >>>>>>> the reasoning behind this decision would not be obvious for a casual >>>>>>> code reader. >>>>>> I think that a good comment will save the situation.. >>>>>> >>>>>>> Or we could log only wandered blocks (in addition to ->delete_set) >>>>>>> at discard time, but this is messy and requires us to merge the discard log >>>>>>> with ->delete_set at discard time. >>>>>> what is the difference with the previous "we could.."? >>>>> The previous option is to log all !BA_DEFER requests in addition to >>>>> ->delete_set, so those two block sets would be partially overlapping. >>>> Hmm.. Are they really overlapped? >>>> >>>> reiser4_dealloc_blocks() looks like the following: >>>> { >>>> if (flags & BA_DEFER) >>>> log in the delete_set; >>>> else >>>> deallocate in working bitmap >>>> ... >>>> } >>>> >>>> so (!BA_DEFER) requests and the delete_set look disjoint, >>>> or, I missed something? >>> You're right, I've misread the code. I was under impression that >>> reiser4_post_commit_hook() internally calls reiser4_dealloc_blocks(!BA_DEFER). >>> >>> So the second option (which I prefer) is to log all sa_dealloc_blocks() calls. >>> That is, effectively, ->delete_set plus wandered journal block deallocations. >> >> Yes, I think to do something like this: >> >> reiser4_dealloc_blocks() >> { >> if (flags & BA_DEFER) { >> ... >> } else { >> ... >> if (discard is turned on) >> /* we'll want to discard these blocks, which are not to >> be represented >> in the COMMIT SPACE MAP, so store them in a >> separate list */ >> do { >> blocknr_set_add_extent(atom, >> &atom->delete_set_for_wander, &bsep, start, len); >> ... >> } while (ret == -E_REPEAT); >> sa_dealloc_blocks(reiser4_get_space_allocator(ctx->super), *start, *len); >> ... >> } >> >> Then join the sets (->delete_set, and ->delete_set_for_wander) at >> discard time >> (should be pretty cheap operation). insert a check at every atoms >> fusion, that >> ->delete_set_for_wander of both atoms are empty). > Hm. IIRC, ultimately we wanted to use rb-trees for storing discard block sets, > but this means that we'll need to convert two blocknr_sets into an rb-tree > at discard time, which pretty defeats their purpose in our context... Or do you > say that blocknr_set itself should be made rb-tree? I didn't know about the delete sets. They are not optional and lists are better for them (as they don't require sorting). So I suggest to confine with lists for now, and may be to try the fast-merge trees in the future (if there is a great desire).