linux-nilfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
To: Ryusuke Konishi
	<konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy
Date: Tue, 10 Mar 2015 21:37:50 +0100	[thread overview]
Message-ID: <54FF561E.7030409@gmx.net> (raw)
In-Reply-To: <20150310.142119.813265940569588216.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>

Hi Ryusuke,

Thanks for your thorough review.

On 2015-03-10 06:21, Ryusuke Konishi wrote:
> Hi Andreas,
> 
> I looked through whole kernel patches and a part of util patches.
> Overall comments are as follows:
> 
> [Algorithm]
> As for algorithm, it looks about OK except for the starvation
> countermeasure.  The stavation countermeasure looks adhoc/hacky, but
> it's good that it doesn't change kernel/userland interface; we may be
> able to replace it with better ways in a future or in a revised
> version of this patchset.
> 
> (1) Drawback of the starvation countermeasure
>     The patch 9/9 looks to make the execution time of chcp operation
>     worse since it will scan through sufile to modify live block
>     counters.  How much does it prolong the execution time ?

I'll do some tests, but I haven't noticed any significant performance
drop. The GC basically does the same thing, every time it selects
segments to reclaim.

>     In a use case of nilfs, many snapshots are created and they are
>     automatically changed back to plain checkpoints because old
>     snapshots are thinned out over time.  The patch 9/9 may impact on
>     such usage.
>
> (2) Compatibility
>     What will happen in the following case:
>     1. Create a file system, use it with the new module, and
>        create snapshots.
>     2. Mount it with an old module, and release snapshot with "chcp cp"
>     3. Mount it with the new module, and cleanerd runs gc with
>        cost benefit or greedy policy.

Some segments could be subject to starvation. But it would probably only
affect a small number of segments and it could be fixed by "chcp ss
<CP>; chcp cp <CP>".

> (3) Durability against unexpected power failures (just a note)
>     The current patchset looks not to cause starvation issue even when
>     unexpected power failure occurs during or after executing "chcp
>     cp" because nilfs_ioctl_change_cpmode() do changes in a
>     transactional way with nilfs_transaction_begin/commit.
>     We should always think this kind of situtation to keep consistency.
> 
> [Coding Style]
> (4) This patchset has several coding style issues. Please fix them and
>     re-check with the latest checkpatch script (script/checkpatch.pl).

I'll fix that. Sorry.

> patch 2:
> WARNING: Prefer kmalloc_array over kmalloc with multiply
> #85: FILE: fs/nilfs2/sufile.c:1192:
> +    mc->mc_mods = kmalloc(capacity * sizeof(struct nilfs_sufile_mod),
> 
> patch 5,6:
> WARNING: 'aquired' may be misspelled - perhaps 'acquired'?
> #60: 
> the same semaphore has to be aquired. So if the DAT-Entry belongs to
> 
> WARNING: 'aquired' may be misspelled - perhaps 'acquired'?
> #46: 
> be aquired, which blocks the entire SUFILE and effectively turns
> 
> WARNING: 'aquired' may be misspelled - perhaps 'acquired'?
> #53: 
> afore mentioned lock only needs to be aquired, if the cache is full
> 
> (5) sub_sizeof macro:
>     The same definition exists as offsetofend() in vfio.h,
>     and a patch to move it to stddef.h is now proposed.
> 
>     Please use the same name, and redefine it only if it's not
>     defined:
> 
> #ifndef offsetofend
> #define offsetofend(TYPE, MEMBER) \
>         (offsetof(TYPE, MEMBER) + sizeof(((TYPE *)0)->MEMBER))
> #endif

Ok I'll change that.

> [Implementation]
> (6) b_blocknr
>     Please do not use bh->b_blocknr to store disk block number.  This
>     field is used to keep virtual block number except for DAT files.
>     It is only replaced to an actual block number during calling
>     submit_bh().  Keep this policy.

As far as I can tell, this is only true for blocks of GC inodes and node
blocks. All other buffer_heads are always mapped to on disk blocks by
nilfs_get_block(). I only added the mapping in nilfs_segbuf_submit_bh()
to correctly set the value in b_blocknr to the new location.

>     In segment constructor context, you can calculate the disk block
>     number from the start disk address of the segment and the block
>     index (offset) in the segment.

If I understand you correctly, this approach would give me the on disk
location inside of the segment that is currently constructed. But I need
to know the previous on disk location of the buffer_head. I have to
decrement the counter for the previous segment.

> (7) sufile mod cache
>     Consider gathering the cache into nilfs_sufile_info struct and
>     stopping to pass it via argument of bmap/sufile/dat interface
>     functions.  It's hacky, and decreases readability of programs, and
>     is bloating changes of this patchset over multiple function
>     blocks.

If I use a global structure, I have to protect it with a lock. Since
almost any operation has to modify the counters in the SUFILE, this
would serialize the whole file system.

>     The cache should be well designed. It's important to balance the
>     performance and locality/transparency of the feature.  For
>     instance, it can be implemented with radix-tree of objects in
>     which each object has a vector of 2^k cache entries.

I'll look into that.

>     I think the cache should be written back to the sufile buffers
>     only within segment construction context. At least, it should be
>     written back in the context in which a transaction lock is held.
> 
>     In addition, introducing a new bmap lock dependency,
>     nilfs_sufile_lock_key, is undesireble. You should avoid it
>     by delaying the writeback of cache entries to sufile.

The cache could end up using a lot of memory. In the worst case one
entry per block.

> (8) Changes to the sufile must be finished before dirty buffer
>     collection of sufile.
>     All mark_buffer_dirty() calls to sufile must be finished
>     before or in NILFS_ST_SUFILE stage of nilfs_segctor_collect_blocks().
> 
>     (You can write fixed figures to sufile after the collection phase
>      of sufile by preparatory marking buffer dirty before the
>      colection phase.)
>
>     In the current patchset, sufile mod cache can be flushed in
>     nilfs_segctor_update_palyload_blocknr(), which comes after the
>     dirty buffer collection phase.

This is a hard problem. I have to count the blocks added in the
NILFS_ST_DAT stage. I don't know, which SUFILE blocks I have to mark in
advance. I'll have to think about this.

> (9) cpfile is also excluded in the dead block counting like sufile
>     cpfile is always changed and written back along with sufile and dat.
>     So, cpfile must be excluded from the dead block counting.
>     Otherwise, sufile change can trigger cpfile changes, and it in turn
>     triggers sufile.

I don't quite understand your example. How exactly can a sufile change
trigger a cpfile change and how can this turn into an infinite loop?

Thanks,
Andreas Rohner

>     This also helps to simplify nilfs_dat_commit_end() that the patchset
>     added two arguments for the dead block counting in the patchset.
>     I mean, "dead" argument and "count_blocks" argument can be unified by
>     changing meaning of the "dead" argument.
> 
> 
> I will add detail comments for patches tonight or another day.
> 
> Regards,
> Ryusuke Konishi
> 
> On Wed, 25 Feb 2015 09:18:04 +0900 (JST), Ryusuke Konishi wrote:
>> Hi Andreas,
>>
>> Thank you for posting this proposal!
>>
>> I would like to have time to review this series through, but please
>> wait for several days. (This week I'm quite busy until weekend)
>>
>> Thanks,
>> Ryusuke Konishi
>>
>> On Tue, 24 Feb 2015 20:01:35 +0100, Andreas Rohner wrote:
>>> Hi everyone!
>>>
>>> One of the biggest performance problems of NILFS is its
>>> inefficient Timestamp GC policy. This patch set introduces two new GC
>>> policies, namely Cost-Benefit and Greedy.
>>>
>>> The Cost-Benefit policy is nothing new. It has been around for a long
>>> time with log-structured file systems [1]. But it relies on accurate
>>> information, about the number of live blocks in a segment. NILFS
>>> currently does not provide the necessary information. So this patch set
>>> extends the entries in the SUFILE to include a counter for the number of
>>> live blocks. This counter is decremented whenever a file is deleted or
>>> overwritten.
>>>
>>> Except for some tricky parts, the counting of live blocks is quite
>>> trivial. The problem is snapshots. At any time, a checkpoint can be
>>> turned into a snapshot or vice versa. So blocks that are reclaimable at
>>> one point in time, are protected by a snapshot a moment later.
>>>
>>> This patch set does not try to track snapshots at all. Instead it uses a
>>> heuristic approach to prevent the worst case scenario. The performance
>>> is still significantly better than timestamp for my benchmarks.
>>>
>>> The worst case scenario is, the following:
>>>
>>> 1. Segment 1 is written
>>> 2. Snapshot is created
>>> 3. GC tries to reclaim Segment 1, but all blocks are protected
>>>    by the Snapshot. The GC has to set the number of live blocks
>>>    to maximum to avoid reclaiming this Segment again in the near future.
>>> 4. Snapshot is deleted
>>> 5. Segment 1 is reclaimable, but its counter is so high, that the GC
>>>    will never try to reclaim it again.
>>>
>>> To prevent this kind of starvation I use another field in the SUFILE
>>> entry, to store the number of blocks that are protected by a snapshot.
>>> This value is just a heuristic and it is usually set to 0. Only if the
>>> GC reclaims a segment, it is written to the SUFILE entry. The GC has to
>>> check for snapshots anyway, so we get this information for free. By
>>> storing this information in the SUFILE we can avoid starvation in the
>>> following way:
>>>
>>> 1. Segment 1 is written
>>> 2. Snapshot is created
>>> 3. GC tries to reclaim Segment 1, but all blocks are protected
>>>    by the Snapshot. The GC has to set the number of live blocks
>>>    to maximum to avoid reclaiming this Segment again in the near future.
>>> 4. GC sets the number of snapshot blocks in Segment 1 in the SUFILE
>>>    entry
>>> 5. Snapshot is deleted
>>> 6. On Snapshot deletion we walk through every entry in the SUFILE and
>>>    reduce the number of live blocks to half, if the number of snapshot
>>>    blocks is bigger than half of the maximum.
>>> 7. Segment 1 is reclaimable and the number of live blocks entry is at
>>>    half the maximum. The GC will try to reclaim this segment as soon as
>>>    there are no other better choices.
>>>
>>> BENCHMARKS:
>>> -----------
>>>
>>> My benchmark is quite simple. It consists of a process, that replays
>>> real NFS traces at a faster speed. It thereby creates relatively
>>> realistic patterns of file creation and deletions. At the same time
>>> multiple snapshots are created and deleted in parallel. I use a 100GB
>>> partition of a Samsung SSD:
>>>
>>> WITH SNAPSHOTS EVERY 5 MINUTES:
>>> --------------------------------------------------------------------
>>>                 Execution time       Wear (Data written to disk)
>>> Timestamp:      100%                 100%
>>> Cost-Benefit:   80%                  43%
>>>
>>> NO SNAPSHOTS:
>>> ---------------------------------------------------------------------
>>>                 Execution time       Wear (Data written to disk)
>>> Timestamp:      100%                 100%
>>> Cost-Benefit:   70%                  45%
>>>
>>> I plan on adding more benchmark results soon.
>>>
>>> Best regards,
>>> Andreas Rohner
>>>
>>> [1] Mendel Rosenblum and John K. Ousterhout. The design and implementa-
>>>     tion of a log-structured file system. ACM Trans. Comput. Syst.,
>>>     10(1):26–52, February 1992.
>>>
>>> Andreas Rohner (9):
>>>   nilfs2: refactor nilfs_sufile_updatev()
>>>   nilfs2: add simple cache for modifications to SUFILE
>>>   nilfs2: extend SUFILE on-disk format to enable counting of live blocks
>>>   nilfs2: add function to modify su_nlive_blks
>>>   nilfs2: add simple tracking of block deletions and updates
>>>   nilfs2: use modification cache to improve performance
>>>   nilfs2: add additional flags for nilfs_vdesc
>>>   nilfs2: improve accuracy and correct for invalid GC values
>>>   nilfs2: prevent starvation of segments protected by snapshots
>>>
>>>  fs/nilfs2/bmap.c          |  84 +++++++-
>>>  fs/nilfs2/bmap.h          |  14 +-
>>>  fs/nilfs2/btree.c         |   4 +-
>>>  fs/nilfs2/cpfile.c        |   5 +
>>>  fs/nilfs2/dat.c           |  95 ++++++++-
>>>  fs/nilfs2/dat.h           |   8 +-
>>>  fs/nilfs2/direct.c        |   4 +-
>>>  fs/nilfs2/inode.c         |  24 ++-
>>>  fs/nilfs2/ioctl.c         |  27 ++-
>>>  fs/nilfs2/mdt.c           |   5 +-
>>>  fs/nilfs2/page.h          |   6 +-
>>>  fs/nilfs2/segbuf.c        |   6 +
>>>  fs/nilfs2/segbuf.h        |   3 +
>>>  fs/nilfs2/segment.c       | 155 +++++++++++++-
>>>  fs/nilfs2/segment.h       |   3 +
>>>  fs/nilfs2/sufile.c        | 533 +++++++++++++++++++++++++++++++++++++++++++---
>>>  fs/nilfs2/sufile.h        |  97 +++++++--
>>>  fs/nilfs2/the_nilfs.c     |   4 +
>>>  fs/nilfs2/the_nilfs.h     |  23 ++
>>>  include/linux/nilfs2_fs.h | 122 ++++++++++-
>>>  20 files changed, 1126 insertions(+), 96 deletions(-)
>>>
>>> -- 
>>> 2.3.0
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2015-03-10 20:37 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-24 19:01 [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy Andreas Rohner
     [not found] ` <1424804504-10914-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-02-24 19:01   ` [PATCH 1/9] nilfs2: refactor nilfs_sufile_updatev() Andreas Rohner
     [not found]     ` <1424804504-10914-2-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-10 15:52       ` Ryusuke Konishi
     [not found]         ` <20150311.005220.1374468405510151934.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-03-10 20:40           ` Andreas Rohner
2015-02-24 19:01   ` [PATCH 2/9] nilfs2: add simple cache for modifications to SUFILE Andreas Rohner
     [not found]     ` <1424804504-10914-3-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14  0:45       ` Ryusuke Konishi
2015-02-24 19:01   ` [PATCH 3/9] nilfs2: extend SUFILE on-disk format to enable counting of live blocks Andreas Rohner
     [not found]     ` <1424804504-10914-4-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14  4:05       ` Ryusuke Konishi
2015-02-24 19:01   ` [PATCH 4/9] nilfs2: add function to modify su_nlive_blks Andreas Rohner
     [not found]     ` <1424804504-10914-5-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14  4:57       ` Ryusuke Konishi
2015-02-24 19:01   ` [PATCH 5/9] nilfs2: add simple tracking of block deletions and updates Andreas Rohner
     [not found]     ` <1424804504-10914-6-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14  3:46       ` Ryusuke Konishi
2015-02-24 19:01   ` [PATCH 6/9] nilfs2: use modification cache to improve performance Andreas Rohner
     [not found]     ` <1424804504-10914-7-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14  1:04       ` Ryusuke Konishi
2015-02-24 19:01   ` [PATCH 7/9] nilfs2: add additional flags for nilfs_vdesc Andreas Rohner
     [not found]     ` <1424804504-10914-8-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14  3:21       ` Ryusuke Konishi
2015-02-24 19:01   ` [PATCH 8/9] nilfs2: improve accuracy and correct for invalid GC values Andreas Rohner
     [not found]     ` <1424804504-10914-9-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14  2:50       ` Ryusuke Konishi
2015-02-24 19:01   ` [PATCH 9/9] nilfs2: prevent starvation of segments protected by snapshots Andreas Rohner
     [not found]     ` <1424804504-10914-10-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14  3:51       ` Ryusuke Konishi
     [not found]         ` <20150314.125109.1017248837083480553.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-03-14 12:36           ` Andreas Rohner
     [not found]             ` <55042B53.5000101-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14 12:49               ` Ryusuke Konishi
2015-03-14 14:32           ` Ryusuke Konishi
2015-02-24 19:04   ` [PATCH 1/6] nilfs-utils: extend SUFILE on-disk format to enable track live blocks Andreas Rohner
     [not found]     ` <1424804659-10986-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-02-24 19:04       ` [PATCH 2/6] nilfs-utils: add additional flags for nilfs_vdesc Andreas Rohner
2015-02-24 19:04       ` [PATCH 3/6] nilfs-utils: add support for tracking live blocks Andreas Rohner
     [not found]         ` <1424804659-10986-3-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14  5:52           ` Ryusuke Konishi
2015-02-24 19:04       ` [PATCH 4/6] nilfs-utils: implement the tracking of live blocks for set_suinfo Andreas Rohner
2015-02-24 19:04       ` [PATCH 5/6] nilfs-utils: add support for greedy/cost-benefit policies Andreas Rohner
2015-02-24 19:04       ` [PATCH 6/6] nilfs-utils: add su_nsnapshot_blks field to indicate starvation Andreas Rohner
2015-02-25  0:18   ` [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy Ryusuke Konishi
     [not found]     ` <20150225.091804.1850885506186316087.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-03-10  5:21       ` Ryusuke Konishi
     [not found]         ` <20150310.142119.813265940569588216.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-03-10 20:37           ` Andreas Rohner [this message]
     [not found]             ` <54FF561E.7030409-hi6Y0CQ0nG0@public.gmane.org>
2015-03-12 12:54               ` Ryusuke Konishi
     [not found]                 ` <20150312.215431.324210374799651841.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-03-14 12:24                   ` Andreas Rohner
     [not found]                     ` <55042879.90701-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14 15:40                       ` Ryusuke Konishi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54FF561E.7030409@gmx.net \
    --to=andreas.rohner-hi6y0cq0ng0@public.gmane.org \
    --cc=konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org \
    --cc=linux-nilfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).