[md PATCH 14/14] MD: use per-cpu counter for writes_pending

Linux RAID subsystem development
 help / color / mirror / Atom feed

From: NeilBrown <neilb@suse.com>
To: Shaohua Li <shli@kernel.org>
Cc: linux-raid@vger.kernel.org, hch@lst.de
Subject: [md PATCH 14/14] MD: use per-cpu counter for writes_pending
Date: Thu, 16 Feb 2017 15:39:03 +1100	[thread overview]
Message-ID: <148721994322.7521.3431105903303667812.stgit@noble> (raw)
In-Reply-To: <148721992248.7521.17160361058957519076.stgit@noble>

The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean".  Consequently it needs to be updated frequently
but only checked for zero occasionally.  Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio.  This provided
justification for making the updates more efficient.

So we replace the atomic counter with a per-cpu array of
'long' counters. Incrementing and decrementing is normally
much cheaper, testing for zero is more expensive.

To meaningfully be able to test for zero we need to be able
to block further updates.  This is done by forcing the
"increment" step to take a spinlock in the rare case that
another thread is checking if the count is zero.  This is
done using a new field: "checkers".  "checkers" is the
number of threads that are currently checking whether the
count is zero.  It is usually 0, occasionally 1, and it is
not impossible that it could be higher, though this would be
rare.

If, within an rcu_read_locked section, checkers is seen to
be zero, then the local-cpu counter can be incremented
freely.  If checkers is not zero, mddev->lock must be taken
before the increment is allowed.  A decrement is always
allowed.

To test for zero, a thread must increment "checkers", call
synchronize_rcu(), then take mddev->lock.  Once this is done
no new increments can happen.  A thread may choose to
perform a quick test-for-zero by summing all the counters
without holding a lock.  If this is non-zero, the the total
count is non-zero, or was non-zero very recently, so it is
safe to assume that it isn't zero.  If the quick check does
report a zero sum, then it is worth performing the locking
protocol.

When the counter is decremented, it is no longer possible to
immediately test if the result is zero
(atmic_dec_and_test()).  We don't even really want to
perform the "quick" tests as that sums over all cpus and is
work that will most often bring no benefit.

In the "safemode==2" case, when we want to mark the array as
"clean" immediately when there are no writes, we perform the
quick test anyway, and possibly wake the md thread to do the
full test.  "safemode==2" is only used during shutdown so
the cost is not problematic.

When safemode!=2 we always set the timer, rather than only
when the counter reaches zero.

If mod_timer() is called to set the timeout to the value it
already has, mod_timer() has low overhead with no atomic
operations.  So at worst it will have a noticeable cost once
per jiffie.  To further reduce the otherhead, we round the
requests delay to a multiple of ->safemode_delay.  This
might increase the delay until the timer fires a little, but
will reduce the overhead of calling mod_timer()
significantly.  If lots of requests are completing, the
timer will be updated every 200 milliseconds (by default)
and never fire.  When it does eventually fire, it will
schedule the md thread to perform the full test for
writes_pending==0, and this is quite likely to find '0'.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/md.c |  107 +++++++++++++++++++++++++++++++++++++++++++++++--------
 drivers/md/md.h |    3 +-
 2 files changed, 93 insertions(+), 17 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index f856c95ee7d5..47bbd22e2865 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -64,6 +64,8 @@
 #include <linux/raid/md_p.h>
 #include <linux/raid/md_u.h>
 #include <linux/slab.h>
+#include <linux/percpu.h>
+
 #include <trace/events/block.h>
 #include "md.h"
 #include "bitmap.h"
@@ -2250,21 +2252,52 @@ static void export_array(struct mddev *mddev)
 	mddev->major_version = 0;
 }
 
+/*
+ * The percpu writes_pending counters are linked with ->checkers and
+ * ->lock.  If ->writes_pending can always be decremented without a
+ * lock.  It can only be incremented without a lock if ->checkers is 0
+ * and the test+incr happen in a rcu_readlocked region.
+ * ->checkers can only be changed under ->lock protection.
+ * To determine if ->writes_pending is totally zero, a quick sum without
+ * locks can be performed. If this is non-zero, then the result is final.
+ * Otherwise ->checkers must be incremented and synchronize_rcu() called.
+ * Then a sum calculated under ->lock, and the result is final until the
+ * ->checkers is decremented, or the lock is dropped.
+ *
+ */
+
+static bool __writes_pending(struct mddev *mddev)
+{
+	long cnt = 0;
+	int i;
+
+	for_each_possible_cpu(i)
+		cnt += *per_cpu_ptr(mddev->writes_pending_percpu, i);
+	return cnt != 0;
+}
+
 static bool set_in_sync(struct mddev *mddev)
 {
 	bool ret = false;
 
 	WARN_ON_ONCE(!spin_is_locked(&mddev->lock));
-	if (atomic_read(&mddev->writes_pending) == 0) {
-		if (mddev->in_sync == 0) {
+	if (!__writes_pending(mddev) && !mddev->in_sync) {
+		mddev->checkers++;
+		spin_unlock(&mddev->lock);
+		synchronize_rcu();
+		spin_lock(&mddev->lock);
+		if (!mddev->in_sync &&
+		    !__writes_pending(mddev)) {
 			mddev->in_sync = 1;
+			/*
+			 * Ensure in_sync is visible before ->checkers
+			 * is decremented
+			 */
 			smp_mb();
-			if (atomic_read(&mddev->writes_pending))
-				/* lost a race with md_write_start() */
-				mddev->in_sync = 0;
 			set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
 			sysfs_notify_dirent_safe(mddev->sysfs_state);
 		}
+		mddev->checkers--;
 		ret = true;
 	}
 	if (mddev->safemode == 1)
@@ -2272,6 +2305,29 @@ static bool set_in_sync(struct mddev *mddev)
 	return ret;
 }
 
+static void inc_writes_pending(struct mddev *mddev)
+{
+	rcu_read_lock();
+	if (mddev->checkers == 0) {
+		__this_cpu_inc(*mddev->writes_pending_percpu);
+		rcu_read_unlock();
+		return;
+	}
+	rcu_read_unlock();
+	/* Need that spinlock */
+	spin_lock(&mddev->lock);
+	this_cpu_inc(*mddev->writes_pending_percpu);
+	spin_unlock(&mddev->lock);
+}
+
+static void zero_writes_pending(struct mddev *mddev)
+{
+	int i;
+
+	for_each_possible_cpu(i)
+		*per_cpu_ptr(mddev->writes_pending_percpu, i) = 0;
+}
+
 static void sync_sbs(struct mddev *mddev, int nospares)
 {
 	/* Update each superblock (in-memory image), but
@@ -5000,6 +5056,8 @@ static void md_free(struct kobject *ko)
 		del_gendisk(mddev->gendisk);
 		put_disk(mddev->gendisk);
 	}
+	if (mddev->writes_pending_percpu)
+		free_percpu(mddev->writes_pending_percpu);
 
 	kfree(mddev);
 }
@@ -5076,6 +5134,13 @@ static int md_alloc(dev_t dev, char *name)
 	blk_queue_make_request(mddev->queue, md_make_request);
 	blk_set_stacking_limits(&mddev->queue->limits);
 
+	mddev->writes_pending_percpu = alloc_percpu(long);
+	if (!mddev->writes_pending_percpu) {
+		blk_cleanup_queue(mddev->queue);
+		mddev->queue = NULL;
+		goto abort;
+	}
+
 	disk = alloc_disk(1 << shift);
 	if (!disk) {
 		blk_cleanup_queue(mddev->queue);
@@ -5159,7 +5224,7 @@ static void md_safemode_timeout(unsigned long data)
 {
 	struct mddev *mddev = (struct mddev *) data;
 
-	if (!atomic_read(&mddev->writes_pending)) {
+	if (!__writes_pending(mddev)) {
 		mddev->safemode = 1;
 		if (mddev->external)
 			sysfs_notify_dirent_safe(mddev->sysfs_state);
@@ -5365,7 +5430,7 @@ int md_run(struct mddev *mddev)
 	} else if (mddev->ro == 2) /* auto-readonly not meaningful */
 		mddev->ro = 0;
 
-	atomic_set(&mddev->writes_pending,0);
+	zero_writes_pending(mddev);
 	atomic_set(&mddev->max_corr_read_errors,
 		   MD_DEFAULT_MAX_CORRECTED_READ_ERRORS);
 	mddev->safemode = 0;
@@ -7774,11 +7839,13 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
 		md_wakeup_thread(mddev->sync_thread);
 		did_change = 1;
 	}
-	atomic_inc(&mddev->writes_pending);
+	rcu_read_lock();
+	inc_writes_pending(mddev);
 	smp_mb(); /* Match smp_mb in set_in_sync() */
 	if (mddev->safemode == 1)
 		mddev->safemode = 0;
-	if (mddev->in_sync) {
+	if (mddev->in_sync || mddev->checkers) {
+		rcu_read_unlock();
 		spin_lock(&mddev->lock);
 		if (mddev->in_sync) {
 			mddev->in_sync = 0;
@@ -7788,7 +7855,9 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
 			did_change = 1;
 		}
 		spin_unlock(&mddev->lock);
-	}
+	} else
+		rcu_read_unlock();
+
 	if (did_change)
 		sysfs_notify_dirent_safe(mddev->sysfs_state);
 	wait_event(mddev->sb_wait,
@@ -7798,12 +7867,18 @@ EXPORT_SYMBOL(md_write_start);
 
 void md_write_end(struct mddev *mddev)
 {
-	if (atomic_dec_and_test(&mddev->writes_pending)) {
-		if (mddev->safemode == 2)
+	this_cpu_dec(*mddev->writes_pending_percpu);
+	if (mddev->safemode == 2) {
+		if (!__writes_pending(mddev))
 			md_wakeup_thread(mddev->thread);
-		else if (mddev->safemode_delay)
-			mod_timer(&mddev->safemode_timer, jiffies + mddev->safemode_delay);
-	}
+	} else if (mddev->safemode_delay)
+		/* The roundup() ensure this only performs locking once
+		 * every ->safemode_delay jiffies
+		 */
+		mod_timer(&mddev->safemode_timer,
+			  roundup(jiffies, mddev->safemode_delay) +
+			  mddev->safemode_delay);
+
 }
 EXPORT_SYMBOL(md_write_end);
 
@@ -8406,7 +8481,7 @@ void md_check_recovery(struct mddev *mddev)
 		test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
 		test_bit(MD_RELOAD_SB, &mddev->flags) ||
 		(mddev->external == 0 && mddev->safemode == 1) ||
-		(mddev->safemode == 2 && ! atomic_read(&mddev->writes_pending)
+		(mddev->safemode == 2 && !__writes_pending(mddev)
 		 && !mddev->in_sync && mddev->recovery_cp == MaxSector)
 		))
 		return;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 2a514036a83d..7e41f882d33d 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -404,7 +404,8 @@ struct mddev {
 							 */
 	unsigned int			safemode_delay;
 	struct timer_list		safemode_timer;
-	atomic_t			writes_pending;
+	long				*writes_pending_percpu;
+	int				checkers;	/* # of threads checking writes_pending */
 	struct request_queue		*queue;	/* for plugging ... */
 
 	struct bitmap			*bitmap; /* the bitmap for the device */

next prev parent reply	other threads:[~2017-02-16  4:39 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-16  4:39 [md PATCH 00/14] remove all abuse of bi_phys_segments NeilBrown
2017-02-16  4:39 ` [md PATCH 02/14] md/raid5: simplfy delaying of writes while metadata is updated NeilBrown
2017-02-16 17:37   ` Shaohua Li
2017-02-17  2:10     ` NeilBrown
2017-02-16  4:39 ` [md PATCH 03/14] md/raid5: call bio_endio() directly rather than queueing for later NeilBrown
2017-02-16  4:39 ` [md PATCH 04/14] block: trace completion of all bios NeilBrown
2017-02-16  4:39 ` [md PATCH 01/14] md/raid5: use md_write_start to count stripes, not bios NeilBrown
2017-02-16 17:29   ` Shaohua Li
2017-02-17  2:04     ` NeilBrown
2017-02-16  4:39 ` [md PATCH 08/14] md/raid1, raid10: move rXbio accounting closer to allocation NeilBrown
2017-02-16  4:39 ` [md PATCH 11/14] md/raid5: don't test ->writes_pending in raid5_remove_disk NeilBrown
2017-02-16  4:39 ` [md PATCH 06/14] md/raid5: remove over-loading of ->bi_phys_segments NeilBrown
2017-02-16  4:39 ` [md PATCH 05/14] md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter NeilBrown
2017-02-16  4:39 ` [md PATCH 10/14] md/raid1: stop using bi_phys_segment NeilBrown
2017-02-20 10:57   ` Ming Lei
2017-02-21  0:05     ` NeilBrown
2017-02-21  7:41       ` Ming Lei
2017-03-03  0:34         ` NeilBrown
2017-02-16  4:39 ` [md PATCH 07/14] Revert "md/raid5: limit request size according to implementation limits" NeilBrown
2017-02-16  4:39 ` [md PATCH 09/14] md/raid10: stop using bi_phys_segments NeilBrown
2017-02-16 14:26   ` Jack Wang
2017-02-17  2:15     ` NeilBrown
2017-02-16  4:39 ` NeilBrown [this message]
2017-02-16 20:12   ` [md PATCH 14/14] MD: use per-cpu counter for writes_pending Shaohua Li
2017-02-17  2:34     ` NeilBrown
2017-02-16  4:39 ` [md PATCH 12/14] md: factor out set_in_sync() NeilBrown
2017-02-16  4:39 ` [md PATCH 13/14] md: close a race with setting mddev->in_sync NeilBrown

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:f856c95ee7d dfblob:47bbd22e286 dfblob:2a514036a83
dfblob:7e41f882d33 )
 OR (
bs:"[md PATCH 14/14] MD: use per-cpu counter for writes_pending" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=148721994322.7521.3431105903303667812.stgit@noble \
    --to=neilb@suse.com \
    --cc=hch@lst.de \
    --cc=linux-raid@vger.kernel.org \
    --cc=shli@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox