Re: [PATCH v5 5/8] md/r5cache: reclaim support

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: NeilBrown <neilb@suse.com>
To: linux-raid@vger.kernel.org
Cc: shli@fb.com, kernel-team@fb.com, dan.j.williams@intel.com,
	hch@infradead.org, liuzhengyuang521@gmail.com,
	liuzhengyuan@kylinos.cn, Song Liu <songliubraving@fb.com>
Subject: Re: [PATCH v5 5/8] md/r5cache: reclaim support
Date: Wed, 19 Oct 2016 13:03:53 +1100	[thread overview]
Message-ID: <87oa2hxfg6.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <20161013054944.1038806-6-songliubraving@fb.com>

[-- Attachment #1: Type: text/plain, Size: 7199 bytes --]

On Thu, Oct 13 2016, Song Liu wrote:

> There are two limited resources, stripe cache and journal disk space.
> For better performance, we priotize reclaim of full stripe writes.
> To free up more journal space, we free earliest data on the journal.
>
> In current implementation, reclaim happens when:
> 1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 5 seconds) reclaim
>    if there is no reclaim in the past 5 seconds.

5 seconds is an arbitrary number.  I don't think it needs to be made
configurable, but it might help to justify it.
I would probably make it closer to 30 seconds.  There really isn't any
rush.  If there is much load, it will never wait this long.  If there is
not much load ... then it probably doesn't matter.


> 2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (32) cached full stripes
>    (r5c_check_cached_full_stripe)

This is another arbitrary number, but I think it needs more
justification.  Why 32?  That's 128K per device, which isn't very much.
It might make sense to set the batch size based on the stripe size
(e.g. at least on stripe?) or on the cache size.

> 3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
> 4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
>
> r5c_do_reclaim() contains new logic of reclaim.
>
> For stripe cache:
>
> When stripe cache pressure is high (more than 3/4 stripes are cached,

You say "3/4" here bit I think the code says "1/2"
> +	if (total_cached > conf->min_nr_stripes * 1 / 2 ||
But there is also
> +	if (total_cached > conf->min_nr_stripes * 3 / 4 ||
> +	    atomic_read(&conf->empty_inactive_list_nr) > 0)

so maybe you do mean 3/4..  Maybe some comments closer to the different
fractions would help.

> @@ -141,6 +150,12 @@ struct r5l_log {
>  
>  	/* for r5c_cache */
>  	enum r5c_state r5c_state;
> +
> +	/* all stripes in r5cache, in the order of seq at sh->log_start */
> +	struct list_head stripe_in_cache_list;
> +
> +	spinlock_t stripe_in_cache_lock;

You use the _irqsave / _irqrestore versions of spinlock for this lock,
but I cannot see where it is takes from interrupt-context.
What did I miss?

> +/* evaluate log space usage and update R5C_LOG_TIGHT and R5C_LOG_CRITICAL */
> +static inline void r5c_update_log_state(struct r5l_log *log)
> +{
> +	struct r5conf *conf = log->rdev->mddev->private;
> +	sector_t free_space = r5l_ring_distance(log, log->log_start,
> +						log->last_checkpoint);
> +	sector_t reclaim_space = r5c_log_required_to_flush_cache(conf);
> +
> +	if (!r5c_is_writeback(log))
> +		return;
> +	else if (free_space < 2 * reclaim_space) {
> +		set_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +		set_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +	} else if (free_space < 3 * reclaim_space) {
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +		set_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +	} else if (free_space > 4 * reclaim_space) {
> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +	} else
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);

Hmmm...
We set LOG_CRITICAL if free_space < 2*reclaim_space, else we clear it.
We set LOG_TIGHT if free_space < 3*reclaim_space and clear it if > 4*reclaim_space

Would the code be clearer if you just made that explicit.

At the very least, change
> +	} else if (free_space > 4 * reclaim_space) {
> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +	} else
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);

to

> +	} else if (free_space < 4 * reclaim_space) {
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +	} else {
> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +     }

so the order of the branches is consistent and all the tests a '<'.

>  
> +/*
> + * do not release a stripe to cached lists in quiesce
> + */

You tell me what this function doesn't do, but not what it does.

This function is only called once, and I think that could would be
clearer if this was just included in-line in that one place.

> +void r5c_prepare_stripe_for_release_in_quiesce(struct stripe_head *sh)
> +{
> +	if (!test_bit(STRIPE_HANDLE, &sh->state) &&
> +	    atomic_read(&sh->dev_in_cache) != 0) {
> +		if (!test_bit(STRIPE_R5C_FROZEN, &sh->state))
> +			r5c_freeze_stripe_for_reclaim(sh);
> +		set_bit(STRIPE_HANDLE, &sh->state);
> +	}
> +}
> +
>  
> @@ -709,6 +867,7 @@ static void r5l_run_no_space_stripes(struct r5l_log *log)
>  	while (!list_empty(&log->no_space_stripes)) {
>  		sh = list_first_entry(&log->no_space_stripes,
>  				      struct stripe_head, log_list);
> +		set_bit(STRIPE_PREREAD_ACTIVE, &sh->state);
>  		list_del_init(&sh->log_list);
>  		set_bit(STRIPE_HANDLE, &sh->state);
>  		raid5_release_stripe(sh);

This chunk doesn't seem to belong here.  It looks like it might be a
bugfix that is quite independent of the rest of the code.  Is it?
If is, it probably belongs in a separate patch.


>  static void r5l_do_reclaim(struct r5l_log *log)
>  {
> +	struct r5conf *conf = log->rdev->mddev->private;
>  	sector_t reclaim_target = xchg(&log->reclaim_target, 0);
>  	sector_t reclaimable;
>  	sector_t next_checkpoint;
> -	u64 next_cp_seq;
>  
>  	spin_lock_irq(&log->io_list_lock);
>  	/*
> @@ -924,8 +1115,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
>  				    log->io_list_lock);
>  	}
>  
> -	next_checkpoint = log->next_checkpoint;
> -	next_cp_seq = log->next_cp_seq;
> +	next_checkpoint = r5c_calculate_new_cp(conf);
>  	spin_unlock_irq(&log->io_list_lock);
>  
>  	BUG_ON(reclaimable < 0);
> @@ -941,7 +1131,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
>  
>  	mutex_lock(&log->io_mutex);
>  	log->last_checkpoint = next_checkpoint;
> -	log->last_cp_seq = next_cp_seq;

Why don't we update last_cp_seq here any more?


> +
> +/*
> + * if num < 0, flush all stripes
> + * if num == 0, flush all full stripes
> + * if num > 0, flush all full stripes. If less than num full stripes are
> + *             flushed, flush some partial stripes until totally num stripes are
> + *             flushed or there is no more cached stripes.
> + */

I'm not very keen on the idea of having "num < 0".

I'd rather the meaning was:
  flush all full stripes, and a total of at least num stripes.

To flush all stripes, pass MAX_INT for num (or similar).

> +void r5c_flush_cache(struct r5conf *conf, int num)
> +{
> +	int count;
> +	struct stripe_head *sh, *next;
> +
> +	assert_spin_locked(&conf->device_lock);
> +	if (!conf->log)
> +		return;
> +
> +	count = 0;
> +	list_for_each_entry_safe(sh, next, &conf->r5c_full_stripe_list, lru) {
> +		r5c_flush_stripe(conf, sh);
> +		count++;
> +	}
> +
> +	if (num == 0 || (num > 0 && count >= num))
> +		return;
> +	list_for_each_entry_safe(sh, next,
> +				 &conf->r5c_partial_stripe_list, lru) {
> +		r5c_flush_stripe(conf, sh);
> +		if (num > 0 && count++ >= num)
> +			break;
> +	}
> +}
> +


Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

next prev parent reply	other threads:[~2016-10-19  2:03 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-13  5:49 [PATCH v5 0/8] raid5-cache: enabling cache features Song Liu
2016-10-13  5:49 ` [PATCH v5 1/8] md/r5cache: Check array size in r5l_init_log Song Liu
2016-10-13  5:49 ` [PATCH v5 2/8] md/r5cache: move some code to raid5.h Song Liu
2016-10-13  5:49 ` [PATCH v5 3/8] md/r5cache: State machine for raid5-cache write back mode Song Liu
2016-10-14  6:13   ` NeilBrown
2016-10-14  7:03     ` Song Liu
2016-10-13  5:49 ` [PATCH v5 4/8] md/r5cache: write part of r5cache Song Liu
2016-10-14  6:53   ` NeilBrown
2016-10-14  7:33     ` Song Liu
2016-10-19  0:53       ` NeilBrown
2016-10-21 20:42         ` Song Liu
2016-10-13  5:49 ` [PATCH v5 5/8] md/r5cache: reclaim support Song Liu
2016-10-19  2:03   ` NeilBrown [this message]
2016-10-21 21:04     ` Song Liu
2016-10-13  5:49 ` [PATCH v5 6/8] md/r5cache: sysfs entry r5c_state Song Liu
2016-10-19  2:19   ` NeilBrown
2016-10-21 21:20     ` Song Liu
2016-10-13  5:49 ` [PATCH v5 7/8] md/r5cache: r5c recovery Song Liu
2016-10-19  2:32   ` NeilBrown
2016-10-20  3:03   ` NeilBrown
2016-10-21 21:12     ` Song Liu
2016-10-26  1:18       ` NeilBrown
2016-10-13  5:49 ` [PATCH v5 8/8] md/r5cache: handle SYNC and FUA Song Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87oa2hxfg6.fsf@notabene.neil.brown.name \
    --to=neilb@suse.com \
    --cc=dan.j.williams@intel.com \
    --cc=hch@infradead.org \
    --cc=kernel-team@fb.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=liuzhengyuan@kylinos.cn \
    --cc=liuzhengyuang521@gmail.com \
    --cc=shli@fb.com \
    --cc=songliubraving@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).