linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: NeilBrown <neilb@suse.com>
To: Shaohua Li <shli@kernel.org>
Cc: linux-raid@vger.kernel.org, hch@lst.de
Subject: Re: [md PATCH 14/14] MD: use per-cpu counter for writes_pending
Date: Fri, 17 Feb 2017 13:34:18 +1100	[thread overview]
Message-ID: <87a89l4kb9.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <20170216201211.3q7rvvp2p3xxz6ew@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 4034 bytes --]

On Thu, Feb 16 2017, Shaohua Li wrote:

> On Thu, Feb 16, 2017 at 03:39:03PM +1100, Neil Brown wrote:
>> The 'writes_pending' counter is used to determine when the
>> array is stable so that it can be marked in the superblock
>> as "Clean".  Consequently it needs to be updated frequently
>> but only checked for zero occasionally.  Recent changes to
>> raid5 cause the count to be updated even more often - once
>> per 4K rather than once per bio.  This provided
>> justification for making the updates more efficient.
>> 
>> So we replace the atomic counter with a per-cpu array of
>> 'long' counters. Incrementing and decrementing is normally
>> much cheaper, testing for zero is more expensive.
>> 
>> To meaningfully be able to test for zero we need to be able
>> to block further updates.  This is done by forcing the
>> "increment" step to take a spinlock in the rare case that
>> another thread is checking if the count is zero.  This is
>> done using a new field: "checkers".  "checkers" is the
>> number of threads that are currently checking whether the
>> count is zero.  It is usually 0, occasionally 1, and it is
>> not impossible that it could be higher, though this would be
>> rare.
>> 
>> If, within an rcu_read_locked section, checkers is seen to
>> be zero, then the local-cpu counter can be incremented
>> freely.  If checkers is not zero, mddev->lock must be taken
>> before the increment is allowed.  A decrement is always
>> allowed.
>> 
>> To test for zero, a thread must increment "checkers", call
>> synchronize_rcu(), then take mddev->lock.  Once this is done
>> no new increments can happen.  A thread may choose to
>> perform a quick test-for-zero by summing all the counters
>> without holding a lock.  If this is non-zero, the the total
>> count is non-zero, or was non-zero very recently, so it is
>> safe to assume that it isn't zero.  If the quick check does
>> report a zero sum, then it is worth performing the locking
>> protocol.
>> 
>> When the counter is decremented, it is no longer possible to
>> immediately test if the result is zero
>> (atmic_dec_and_test()).  We don't even really want to
>> perform the "quick" tests as that sums over all cpus and is
>> work that will most often bring no benefit.
>> 
>> In the "safemode==2" case, when we want to mark the array as
>> "clean" immediately when there are no writes, we perform the
>> quick test anyway, and possibly wake the md thread to do the
>> full test.  "safemode==2" is only used during shutdown so
>> the cost is not problematic.
>> 
>> When safemode!=2 we always set the timer, rather than only
>> when the counter reaches zero.
>> 
>> If mod_timer() is called to set the timeout to the value it
>> already has, mod_timer() has low overhead with no atomic
>> operations.  So at worst it will have a noticeable cost once
>> per jiffie.  To further reduce the otherhead, we round the
>> requests delay to a multiple of ->safemode_delay.  This
>> might increase the delay until the timer fires a little, but
>> will reduce the overhead of calling mod_timer()
>> significantly.  If lots of requests are completing, the
>> timer will be updated every 200 milliseconds (by default)
>> and never fire.  When it does eventually fire, it will
>> schedule the md thread to perform the full test for
>> writes_pending==0, and this is quite likely to find '0'.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>
> This sounds like a good place to use percpu-refcount. In set_in_sync, we switch
> it to atomic, read it, then switch it back to percpu.

I did have a quick look at percpu-refcount and it didn't seem suitable.
There is no interface to set the counter to atomic and wait for that to
complete.
There is also no interface to perform a lockless sum - only expected to
report '0' if the count has been zero for a little while.

I might be able to make do without those, or possibly enhance
percpu-refcount.

I'll have a look - thanks.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

  reply	other threads:[~2017-02-17  2:34 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-16  4:39 [md PATCH 00/14] remove all abuse of bi_phys_segments NeilBrown
2017-02-16  4:39 ` [md PATCH 01/14] md/raid5: use md_write_start to count stripes, not bios NeilBrown
2017-02-16 17:29   ` Shaohua Li
2017-02-17  2:04     ` NeilBrown
2017-02-16  4:39 ` [md PATCH 03/14] md/raid5: call bio_endio() directly rather than queueing for later NeilBrown
2017-02-16  4:39 ` [md PATCH 04/14] block: trace completion of all bios NeilBrown
2017-02-16  4:39 ` [md PATCH 02/14] md/raid5: simplfy delaying of writes while metadata is updated NeilBrown
2017-02-16 17:37   ` Shaohua Li
2017-02-17  2:10     ` NeilBrown
2017-02-16  4:39 ` [md PATCH 10/14] md/raid1: stop using bi_phys_segment NeilBrown
2017-02-20 10:57   ` Ming Lei
2017-02-21  0:05     ` NeilBrown
2017-02-21  7:41       ` Ming Lei
2017-03-03  0:34         ` NeilBrown
2017-02-16  4:39 ` [md PATCH 06/14] md/raid5: remove over-loading of ->bi_phys_segments NeilBrown
2017-02-16  4:39 ` [md PATCH 11/14] md/raid5: don't test ->writes_pending in raid5_remove_disk NeilBrown
2017-02-16  4:39 ` [md PATCH 05/14] md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter NeilBrown
2017-02-16  4:39 ` [md PATCH 07/14] Revert "md/raid5: limit request size according to implementation limits" NeilBrown
2017-02-16  4:39 ` [md PATCH 08/14] md/raid1, raid10: move rXbio accounting closer to allocation NeilBrown
2017-02-16  4:39 ` [md PATCH 09/14] md/raid10: stop using bi_phys_segments NeilBrown
2017-02-16 14:26   ` Jack Wang
2017-02-17  2:15     ` NeilBrown
2017-02-16  4:39 ` [md PATCH 13/14] md: close a race with setting mddev->in_sync NeilBrown
2017-02-16  4:39 ` [md PATCH 14/14] MD: use per-cpu counter for writes_pending NeilBrown
2017-02-16 20:12   ` Shaohua Li
2017-02-17  2:34     ` NeilBrown [this message]
2017-02-16  4:39 ` [md PATCH 12/14] md: factor out set_in_sync() NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a89l4kb9.fsf@notabene.neil.brown.name \
    --to=neilb@suse.com \
    --cc=hch@lst.de \
    --cc=linux-raid@vger.kernel.org \
    --cc=shli@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).