From: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
To: Ryusuke Konishi
<konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy
Date: Tue, 10 Mar 2015 21:37:50 +0100 [thread overview]
Message-ID: <54FF561E.7030409@gmx.net> (raw)
In-Reply-To: <20150310.142119.813265940569588216.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
Hi Ryusuke,
Thanks for your thorough review.
On 2015-03-10 06:21, Ryusuke Konishi wrote:
> Hi Andreas,
>
> I looked through whole kernel patches and a part of util patches.
> Overall comments are as follows:
>
> [Algorithm]
> As for algorithm, it looks about OK except for the starvation
> countermeasure. The stavation countermeasure looks adhoc/hacky, but
> it's good that it doesn't change kernel/userland interface; we may be
> able to replace it with better ways in a future or in a revised
> version of this patchset.
>
> (1) Drawback of the starvation countermeasure
> The patch 9/9 looks to make the execution time of chcp operation
> worse since it will scan through sufile to modify live block
> counters. How much does it prolong the execution time ?
I'll do some tests, but I haven't noticed any significant performance
drop. The GC basically does the same thing, every time it selects
segments to reclaim.
> In a use case of nilfs, many snapshots are created and they are
> automatically changed back to plain checkpoints because old
> snapshots are thinned out over time. The patch 9/9 may impact on
> such usage.
>
> (2) Compatibility
> What will happen in the following case:
> 1. Create a file system, use it with the new module, and
> create snapshots.
> 2. Mount it with an old module, and release snapshot with "chcp cp"
> 3. Mount it with the new module, and cleanerd runs gc with
> cost benefit or greedy policy.
Some segments could be subject to starvation. But it would probably only
affect a small number of segments and it could be fixed by "chcp ss
<CP>; chcp cp <CP>".
> (3) Durability against unexpected power failures (just a note)
> The current patchset looks not to cause starvation issue even when
> unexpected power failure occurs during or after executing "chcp
> cp" because nilfs_ioctl_change_cpmode() do changes in a
> transactional way with nilfs_transaction_begin/commit.
> We should always think this kind of situtation to keep consistency.
>
> [Coding Style]
> (4) This patchset has several coding style issues. Please fix them and
> re-check with the latest checkpatch script (script/checkpatch.pl).
I'll fix that. Sorry.
> patch 2:
> WARNING: Prefer kmalloc_array over kmalloc with multiply
> #85: FILE: fs/nilfs2/sufile.c:1192:
> + mc->mc_mods = kmalloc(capacity * sizeof(struct nilfs_sufile_mod),
>
> patch 5,6:
> WARNING: 'aquired' may be misspelled - perhaps 'acquired'?
> #60:
> the same semaphore has to be aquired. So if the DAT-Entry belongs to
>
> WARNING: 'aquired' may be misspelled - perhaps 'acquired'?
> #46:
> be aquired, which blocks the entire SUFILE and effectively turns
>
> WARNING: 'aquired' may be misspelled - perhaps 'acquired'?
> #53:
> afore mentioned lock only needs to be aquired, if the cache is full
>
> (5) sub_sizeof macro:
> The same definition exists as offsetofend() in vfio.h,
> and a patch to move it to stddef.h is now proposed.
>
> Please use the same name, and redefine it only if it's not
> defined:
>
> #ifndef offsetofend
> #define offsetofend(TYPE, MEMBER) \
> (offsetof(TYPE, MEMBER) + sizeof(((TYPE *)0)->MEMBER))
> #endif
Ok I'll change that.
> [Implementation]
> (6) b_blocknr
> Please do not use bh->b_blocknr to store disk block number. This
> field is used to keep virtual block number except for DAT files.
> It is only replaced to an actual block number during calling
> submit_bh(). Keep this policy.
As far as I can tell, this is only true for blocks of GC inodes and node
blocks. All other buffer_heads are always mapped to on disk blocks by
nilfs_get_block(). I only added the mapping in nilfs_segbuf_submit_bh()
to correctly set the value in b_blocknr to the new location.
> In segment constructor context, you can calculate the disk block
> number from the start disk address of the segment and the block
> index (offset) in the segment.
If I understand you correctly, this approach would give me the on disk
location inside of the segment that is currently constructed. But I need
to know the previous on disk location of the buffer_head. I have to
decrement the counter for the previous segment.
> (7) sufile mod cache
> Consider gathering the cache into nilfs_sufile_info struct and
> stopping to pass it via argument of bmap/sufile/dat interface
> functions. It's hacky, and decreases readability of programs, and
> is bloating changes of this patchset over multiple function
> blocks.
If I use a global structure, I have to protect it with a lock. Since
almost any operation has to modify the counters in the SUFILE, this
would serialize the whole file system.
> The cache should be well designed. It's important to balance the
> performance and locality/transparency of the feature. For
> instance, it can be implemented with radix-tree of objects in
> which each object has a vector of 2^k cache entries.
I'll look into that.
> I think the cache should be written back to the sufile buffers
> only within segment construction context. At least, it should be
> written back in the context in which a transaction lock is held.
>
> In addition, introducing a new bmap lock dependency,
> nilfs_sufile_lock_key, is undesireble. You should avoid it
> by delaying the writeback of cache entries to sufile.
The cache could end up using a lot of memory. In the worst case one
entry per block.
> (8) Changes to the sufile must be finished before dirty buffer
> collection of sufile.
> All mark_buffer_dirty() calls to sufile must be finished
> before or in NILFS_ST_SUFILE stage of nilfs_segctor_collect_blocks().
>
> (You can write fixed figures to sufile after the collection phase
> of sufile by preparatory marking buffer dirty before the
> colection phase.)
>
> In the current patchset, sufile mod cache can be flushed in
> nilfs_segctor_update_palyload_blocknr(), which comes after the
> dirty buffer collection phase.
This is a hard problem. I have to count the blocks added in the
NILFS_ST_DAT stage. I don't know, which SUFILE blocks I have to mark in
advance. I'll have to think about this.
> (9) cpfile is also excluded in the dead block counting like sufile
> cpfile is always changed and written back along with sufile and dat.
> So, cpfile must be excluded from the dead block counting.
> Otherwise, sufile change can trigger cpfile changes, and it in turn
> triggers sufile.
I don't quite understand your example. How exactly can a sufile change
trigger a cpfile change and how can this turn into an infinite loop?
Thanks,
Andreas Rohner
> This also helps to simplify nilfs_dat_commit_end() that the patchset
> added two arguments for the dead block counting in the patchset.
> I mean, "dead" argument and "count_blocks" argument can be unified by
> changing meaning of the "dead" argument.
>
>
> I will add detail comments for patches tonight or another day.
>
> Regards,
> Ryusuke Konishi
>
> On Wed, 25 Feb 2015 09:18:04 +0900 (JST), Ryusuke Konishi wrote:
>> Hi Andreas,
>>
>> Thank you for posting this proposal!
>>
>> I would like to have time to review this series through, but please
>> wait for several days. (This week I'm quite busy until weekend)
>>
>> Thanks,
>> Ryusuke Konishi
>>
>> On Tue, 24 Feb 2015 20:01:35 +0100, Andreas Rohner wrote:
>>> Hi everyone!
>>>
>>> One of the biggest performance problems of NILFS is its
>>> inefficient Timestamp GC policy. This patch set introduces two new GC
>>> policies, namely Cost-Benefit and Greedy.
>>>
>>> The Cost-Benefit policy is nothing new. It has been around for a long
>>> time with log-structured file systems [1]. But it relies on accurate
>>> information, about the number of live blocks in a segment. NILFS
>>> currently does not provide the necessary information. So this patch set
>>> extends the entries in the SUFILE to include a counter for the number of
>>> live blocks. This counter is decremented whenever a file is deleted or
>>> overwritten.
>>>
>>> Except for some tricky parts, the counting of live blocks is quite
>>> trivial. The problem is snapshots. At any time, a checkpoint can be
>>> turned into a snapshot or vice versa. So blocks that are reclaimable at
>>> one point in time, are protected by a snapshot a moment later.
>>>
>>> This patch set does not try to track snapshots at all. Instead it uses a
>>> heuristic approach to prevent the worst case scenario. The performance
>>> is still significantly better than timestamp for my benchmarks.
>>>
>>> The worst case scenario is, the following:
>>>
>>> 1. Segment 1 is written
>>> 2. Snapshot is created
>>> 3. GC tries to reclaim Segment 1, but all blocks are protected
>>> by the Snapshot. The GC has to set the number of live blocks
>>> to maximum to avoid reclaiming this Segment again in the near future.
>>> 4. Snapshot is deleted
>>> 5. Segment 1 is reclaimable, but its counter is so high, that the GC
>>> will never try to reclaim it again.
>>>
>>> To prevent this kind of starvation I use another field in the SUFILE
>>> entry, to store the number of blocks that are protected by a snapshot.
>>> This value is just a heuristic and it is usually set to 0. Only if the
>>> GC reclaims a segment, it is written to the SUFILE entry. The GC has to
>>> check for snapshots anyway, so we get this information for free. By
>>> storing this information in the SUFILE we can avoid starvation in the
>>> following way:
>>>
>>> 1. Segment 1 is written
>>> 2. Snapshot is created
>>> 3. GC tries to reclaim Segment 1, but all blocks are protected
>>> by the Snapshot. The GC has to set the number of live blocks
>>> to maximum to avoid reclaiming this Segment again in the near future.
>>> 4. GC sets the number of snapshot blocks in Segment 1 in the SUFILE
>>> entry
>>> 5. Snapshot is deleted
>>> 6. On Snapshot deletion we walk through every entry in the SUFILE and
>>> reduce the number of live blocks to half, if the number of snapshot
>>> blocks is bigger than half of the maximum.
>>> 7. Segment 1 is reclaimable and the number of live blocks entry is at
>>> half the maximum. The GC will try to reclaim this segment as soon as
>>> there are no other better choices.
>>>
>>> BENCHMARKS:
>>> -----------
>>>
>>> My benchmark is quite simple. It consists of a process, that replays
>>> real NFS traces at a faster speed. It thereby creates relatively
>>> realistic patterns of file creation and deletions. At the same time
>>> multiple snapshots are created and deleted in parallel. I use a 100GB
>>> partition of a Samsung SSD:
>>>
>>> WITH SNAPSHOTS EVERY 5 MINUTES:
>>> --------------------------------------------------------------------
>>> Execution time Wear (Data written to disk)
>>> Timestamp: 100% 100%
>>> Cost-Benefit: 80% 43%
>>>
>>> NO SNAPSHOTS:
>>> ---------------------------------------------------------------------
>>> Execution time Wear (Data written to disk)
>>> Timestamp: 100% 100%
>>> Cost-Benefit: 70% 45%
>>>
>>> I plan on adding more benchmark results soon.
>>>
>>> Best regards,
>>> Andreas Rohner
>>>
>>> [1] Mendel Rosenblum and John K. Ousterhout. The design and implementa-
>>> tion of a log-structured file system. ACM Trans. Comput. Syst.,
>>> 10(1):26–52, February 1992.
>>>
>>> Andreas Rohner (9):
>>> nilfs2: refactor nilfs_sufile_updatev()
>>> nilfs2: add simple cache for modifications to SUFILE
>>> nilfs2: extend SUFILE on-disk format to enable counting of live blocks
>>> nilfs2: add function to modify su_nlive_blks
>>> nilfs2: add simple tracking of block deletions and updates
>>> nilfs2: use modification cache to improve performance
>>> nilfs2: add additional flags for nilfs_vdesc
>>> nilfs2: improve accuracy and correct for invalid GC values
>>> nilfs2: prevent starvation of segments protected by snapshots
>>>
>>> fs/nilfs2/bmap.c | 84 +++++++-
>>> fs/nilfs2/bmap.h | 14 +-
>>> fs/nilfs2/btree.c | 4 +-
>>> fs/nilfs2/cpfile.c | 5 +
>>> fs/nilfs2/dat.c | 95 ++++++++-
>>> fs/nilfs2/dat.h | 8 +-
>>> fs/nilfs2/direct.c | 4 +-
>>> fs/nilfs2/inode.c | 24 ++-
>>> fs/nilfs2/ioctl.c | 27 ++-
>>> fs/nilfs2/mdt.c | 5 +-
>>> fs/nilfs2/page.h | 6 +-
>>> fs/nilfs2/segbuf.c | 6 +
>>> fs/nilfs2/segbuf.h | 3 +
>>> fs/nilfs2/segment.c | 155 +++++++++++++-
>>> fs/nilfs2/segment.h | 3 +
>>> fs/nilfs2/sufile.c | 533 +++++++++++++++++++++++++++++++++++++++++++---
>>> fs/nilfs2/sufile.h | 97 +++++++--
>>> fs/nilfs2/the_nilfs.c | 4 +
>>> fs/nilfs2/the_nilfs.h | 23 ++
>>> include/linux/nilfs2_fs.h | 122 ++++++++++-
>>> 20 files changed, 1126 insertions(+), 96 deletions(-)
>>>
>>> --
>>> 2.3.0
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2015-03-10 20:37 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-02-24 19:01 [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy Andreas Rohner
[not found] ` <1424804504-10914-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-02-24 19:01 ` [PATCH 1/9] nilfs2: refactor nilfs_sufile_updatev() Andreas Rohner
[not found] ` <1424804504-10914-2-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-10 15:52 ` Ryusuke Konishi
[not found] ` <20150311.005220.1374468405510151934.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-03-10 20:40 ` Andreas Rohner
2015-02-24 19:01 ` [PATCH 2/9] nilfs2: add simple cache for modifications to SUFILE Andreas Rohner
[not found] ` <1424804504-10914-3-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14 0:45 ` Ryusuke Konishi
2015-02-24 19:01 ` [PATCH 3/9] nilfs2: extend SUFILE on-disk format to enable counting of live blocks Andreas Rohner
[not found] ` <1424804504-10914-4-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14 4:05 ` Ryusuke Konishi
2015-02-24 19:01 ` [PATCH 4/9] nilfs2: add function to modify su_nlive_blks Andreas Rohner
[not found] ` <1424804504-10914-5-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14 4:57 ` Ryusuke Konishi
2015-02-24 19:01 ` [PATCH 5/9] nilfs2: add simple tracking of block deletions and updates Andreas Rohner
[not found] ` <1424804504-10914-6-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14 3:46 ` Ryusuke Konishi
2015-02-24 19:01 ` [PATCH 6/9] nilfs2: use modification cache to improve performance Andreas Rohner
[not found] ` <1424804504-10914-7-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14 1:04 ` Ryusuke Konishi
2015-02-24 19:01 ` [PATCH 7/9] nilfs2: add additional flags for nilfs_vdesc Andreas Rohner
[not found] ` <1424804504-10914-8-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14 3:21 ` Ryusuke Konishi
2015-02-24 19:01 ` [PATCH 8/9] nilfs2: improve accuracy and correct for invalid GC values Andreas Rohner
[not found] ` <1424804504-10914-9-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14 2:50 ` Ryusuke Konishi
2015-02-24 19:01 ` [PATCH 9/9] nilfs2: prevent starvation of segments protected by snapshots Andreas Rohner
[not found] ` <1424804504-10914-10-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14 3:51 ` Ryusuke Konishi
[not found] ` <20150314.125109.1017248837083480553.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-03-14 12:36 ` Andreas Rohner
[not found] ` <55042B53.5000101-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14 12:49 ` Ryusuke Konishi
2015-03-14 14:32 ` Ryusuke Konishi
2015-02-24 19:04 ` [PATCH 1/6] nilfs-utils: extend SUFILE on-disk format to enable track live blocks Andreas Rohner
[not found] ` <1424804659-10986-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-02-24 19:04 ` [PATCH 2/6] nilfs-utils: add additional flags for nilfs_vdesc Andreas Rohner
2015-02-24 19:04 ` [PATCH 3/6] nilfs-utils: add support for tracking live blocks Andreas Rohner
[not found] ` <1424804659-10986-3-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14 5:52 ` Ryusuke Konishi
2015-02-24 19:04 ` [PATCH 4/6] nilfs-utils: implement the tracking of live blocks for set_suinfo Andreas Rohner
2015-02-24 19:04 ` [PATCH 5/6] nilfs-utils: add support for greedy/cost-benefit policies Andreas Rohner
2015-02-24 19:04 ` [PATCH 6/6] nilfs-utils: add su_nsnapshot_blks field to indicate starvation Andreas Rohner
2015-02-25 0:18 ` [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy Ryusuke Konishi
[not found] ` <20150225.091804.1850885506186316087.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-03-10 5:21 ` Ryusuke Konishi
[not found] ` <20150310.142119.813265940569588216.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-03-10 20:37 ` Andreas Rohner [this message]
[not found] ` <54FF561E.7030409-hi6Y0CQ0nG0@public.gmane.org>
2015-03-12 12:54 ` Ryusuke Konishi
[not found] ` <20150312.215431.324210374799651841.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-03-14 12:24 ` Andreas Rohner
[not found] ` <55042879.90701-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14 15:40 ` Ryusuke Konishi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=54FF561E.7030409@gmx.net \
--to=andreas.rohner-hi6y0cq0ng0@public.gmane.org \
--cc=konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org \
--cc=linux-nilfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).