Linux RAID subsystem development
 help / color / mirror / Atom feed
* [PATCH 1/2] raid5-cache: suspend reclaim thread instead of shutdown
From: Shaohua Li @ 2016-11-21 18:29 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, neilb

There is mechanism to suspend a kernel thread. Use it instead of playing
create/destroy game.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c          |  4 +++-
 drivers/md/raid5-cache.c | 18 +++++-------------
 2 files changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index d3cef77..f548469 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7136,10 +7136,12 @@ static int md_thread(void *arg)
 		wait_event_interruptible_timeout
 			(thread->wqueue,
 			 test_bit(THREAD_WAKEUP, &thread->flags)
-			 || kthread_should_stop(),
+			 || kthread_should_stop() || kthread_should_park(),
 			 thread->timeout);
 
 		clear_bit(THREAD_WAKEUP, &thread->flags);
+		if (kthread_should_park())
+			kthread_parkme();
 		if (!kthread_should_stop())
 			thread->run(thread);
 	}
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 8cb79fc..5f817bd 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -19,6 +19,7 @@
 #include <linux/raid/md_p.h>
 #include <linux/crc32c.h>
 #include <linux/random.h>
+#include <linux/kthread.h>
 #include "md.h"
 #include "raid5.h"
 #include "bitmap.h"
@@ -1437,23 +1438,14 @@ void r5l_quiesce(struct r5l_log *log, int state)
 	struct mddev *mddev;
 	if (!log || state == 2)
 		return;
-	if (state == 0) {
-		/*
-		 * This is a special case for hotadd. In suspend, the array has
-		 * no journal. In resume, journal is initialized as well as the
-		 * reclaim thread.
-		 */
-		if (log->reclaim_thread)
-			return;
-		log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
-					log->rdev->mddev, "reclaim");
-		log->reclaim_thread->timeout = R5C_RECLAIM_WAKEUP_INTERVAL;
-	} else if (state == 1) {
+	if (state == 0)
+		kthread_unpark(log->reclaim_thread->tsk);
+	else if (state == 1) {
 		/* make sure r5l_write_super_and_discard_space exits */
 		mddev = log->rdev->mddev;
 		wake_up(&mddev->sb_wait);
+		kthread_park(log->reclaim_thread->tsk);
 		r5l_wake_reclaim(log, MaxSector);
-		md_unregister_thread(&log->reclaim_thread);
 		r5l_do_reclaim(log);
 	}
 }
-- 
2.9.3


^ permalink raw reply related

* [PATCH 2/2] md: stop write should stop journal reclaim
From: Shaohua Li @ 2016-11-21 18:29 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, neilb
In-Reply-To: <598ff2610261ffa745b682c311d5300fefeb2fcc.1479751751.git.shli@fb.com>

__md_stop_writes currently doesn't stop raid5-cache reclaim thread. It's
possible the reclaim thread is still running and doing write, which
doesn't match what __md_stop_writes should do. The extra ->quiesce()
call should not harm any raid types. For raid5-cache, this will
guarantee we reclaim all caches before we update superblock.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index f548469..560150c 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5474,6 +5474,10 @@ static void __md_stop_writes(struct mddev *mddev)
 
 	del_timer_sync(&mddev->safemode_timer);
 
+	if (mddev->pers && mddev->pers->quiesce) {
+		mddev->pers->quiesce(mddev, 1);
+		mddev->pers->quiesce(mddev, 0);
+	}
 	bitmap_flush(mddev);
 
 	if (mddev->ro == 0 &&
-- 
2.9.3


^ permalink raw reply related

* Re: linux raid wiki - backup files
From: Anthony Youngman @ 2016-11-21 20:22 UTC (permalink / raw)
  To: Phil Turmel, linux-raid, neilb
In-Reply-To: <4febebb6-0442-cc5e-45d5-e041e21e2d95@turmel.org>



On 21/11/16 14:07, Phil Turmel wrote:
> On 11/20/2016 09:48 AM, Wols Lists wrote:
>> On 20/11/16 00:27, Phil Turmel wrote:
>>> Yes.  But the new stripes lay on top of the old stripes, unless you move
>>> the data offset.  Which is why a backup file holds the old stripe just
>>> in case.  If you can move the offset, you use the lower offset for the
>>> lower addresses in the array, and the higher offset for the higher
>>> addresses, on either side of the reshape position.
>>
>> Okay, understood. So v0.9 and v1.0 always need a backup for a reshape.

Having looked at the man page, this now seems obvious - the superblock 
is at the end, so the data offset is 0. But for a 1.0 array, could we 
create a data offset?

(So, if we created a data offset, we could then move the superblock and 
convert a 1.0 to 1.1 or 1.2? Okay, it can't do it now, but it looks to 
me like it shouldn't be that hard ... ?)
>>
>> But if we have a data offset with v1.2, a reshape will use that space if
>> it can rather than needing a backup file?
>
I'm guessing that 1.0 and 1.1 defaulted to no data offset to speak of? 
And if we (can) create a decent data offset, we can then use that in 
exactly the same way as with v1.2?

Cheers,
Wol

^ permalink raw reply

* Re: linux raid wiki - backup files
From: Phil Turmel @ 2016-11-21 21:02 UTC (permalink / raw)
  To: Anthony Youngman, linux-raid, neilb
In-Reply-To: <c3d16a30-b36e-3d26-f30a-441b236f7107@youngman.org.uk>

On 11/21/2016 03:22 PM, Anthony Youngman wrote:
> Having looked at the man page, this now seems obvious - the superblock
> is at the end, so the data offset is 0. But for a 1.0 array, could we
> create a data offset?
> 
> (So, if we created a data offset, we could then move the superblock and
> convert a 1.0 to 1.1 or 1.2? Okay, it can't do it now, but it looks to
> me like it shouldn't be that hard ... ?)

I suppose it would be possible, but defeats the purpose of having data
offset == 0: making non-parity array contents directly readable outside
the array.  Commonly used to raid boot partitions but still have grub
able to read them directly.  Less of an issue today with grub2, I
suppose (but I don't use bootloaders anymore, so I'm not a good resource
for that).

>>> But if we have a data offset with v1.2, a reshape will use that space if
>>> it can rather than needing a backup file?
>>
> I'm guessing that 1.0 and 1.1 defaulted to no data offset to speak of?
> And if we (can) create a decent data offset, we can then use that in
> exactly the same way as with v1.2?

V1.1 and v1.2 are identical except for the superblock offset (one 4k
block difference).  v1.1 reshapes just like v1.2.

Phil

^ permalink raw reply

* Re: linux raid wiki - backup files
From: Anthony Youngman @ 2016-11-21 21:30 UTC (permalink / raw)
  To: Phil Turmel, linux-raid, neilb
In-Reply-To: <177d8cfb-f7ec-b0e4-040f-3cfb55f033cc@turmel.org>



On 21/11/16 21:02, Phil Turmel wrote:
> On 11/21/2016 03:22 PM, Anthony Youngman wrote:
>> Having looked at the man page, this now seems obvious - the superblock
>> is at the end, so the data offset is 0. But for a 1.0 array, could we
>> create a data offset?
>>
>> (So, if we created a data offset, we could then move the superblock and
>> convert a 1.0 to 1.1 or 1.2? Okay, it can't do it now, but it looks to
>> me like it shouldn't be that hard ... ?)
>
> I suppose it would be possible, but defeats the purpose of having data
> offset == 0: making non-parity array contents directly readable outside
> the array.  Commonly used to raid boot partitions but still have grub
> able to read them directly.  Less of an issue today with grub2, I
> suppose (but I don't use bootloaders anymore, so I'm not a good resource
> for that).

Only works for mirrors ...

So basically, there's no point using 1.0 for any form of parity raid. In 
which case, if you convert a mirror to parity, it would be nice to be 
able to create said offset and move the superblock.
>
>>>> But if we have a data offset with v1.2, a reshape will use that space if
>>>> it can rather than needing a backup file?
>>>
>> I'm guessing that 1.0 and 1.1 defaulted to no data offset to speak of?
>> And if we (can) create a decent data offset, we can then use that in
>> exactly the same way as with v1.2?
>
> V1.1 and v1.2 are identical except for the superblock offset (one 4k
> block difference).  v1.1 reshapes just like v1.2.
>
Any reason for that difference? How big is the superblock? Did v1.1 fill 
most of the first 4K and just leave not much room with a default 4K data 
offset initially?

> Phil
>
Cheers,
Wol

^ permalink raw reply

* Re: linux raid wiki - backup files
From: Phil Turmel @ 2016-11-21 21:40 UTC (permalink / raw)
  To: Anthony Youngman, linux-raid, neilb
In-Reply-To: <980f8ae9-124e-e10c-afc2-97eef9e05b94@youngman.org.uk>

On 11/21/2016 04:30 PM, Anthony Youngman wrote:
> So basically, there's no point using 1.0 for any form of parity raid. In
> which case, if you convert a mirror to parity, it would be nice to be
> able to create said offset and move the superblock.

Not for new parity arrays, no.  v1.0 is what you get if you convert an
old v0.90 array to v1.x using --assemble --update=metadata.  I simply
don't know if a v1.0 superblock will run with a non-zero data offset.
That would have to be an intermediate step before the superblock and the
rest of the metadata could be relocated.

>> V1.1 and v1.2 are identical except for the superblock offset (one 4k
>> block difference).  v1.1 reshapes just like v1.2.
>>
> Any reason for that difference? How big is the superblock? Did v1.1 fill
> most of the first 4K and just leave not much room with a default 4K data
> offset initially?

v1.1 places the superblock in the first sector of the device.  v1.2
skips the first 4k, making such arrays resistant to destruction by
accidental creation of MBR partition tables or similar crises.

Phil

^ permalink raw reply

* [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Coly Li @ 2016-11-21 21:54 UTC (permalink / raw)
  To: linux-raid
  Cc: Coly Li, Shaohua Li, Neil Brown, Johannes Thumshirn,
	Guoqing Jiang

'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
introduces a sliding resync window for raid1 I/O barrier, this idea limits
I/O barriers to happen only inside a slidingresync window, for regular
I/Os out of this resync window they don't need to wait for barrier any
more. On large raid1 device, it helps a lot to improve parallel writing
I/O throughput when there are background resync I/Os performing at
same time. 

The idea of sliding resync widow is awesome, but there are several
challenges are very difficult to solve,
 - code complexity
   Sliding resync window requires several veriables to work collectively,
   this is complexed and very hard to make it work correctly. Just grep
   "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix
   the original resync window patch. This is not the end, any further
   related modification may easily introduce more regreassion.
 - multiple sliding resync windows
   Currently raid1 code only has a single sliding resync window, we cannot
   do parallel resync with current I/O barrier implementation.
   Implementing multiple resync windows are much more complexed, and very
   hard to make it correctly.

Therefore I decide to implement a much simpler raid1 I/O barrier, by
removing resync window code, I believe life will be much easier.

The brief idea of the simpler barrier is,
 - Do not maintain a logbal unique resync window
 - Use multiple hash buckets to reduce I/O barrier conflictions, regular
   I/O only has to wait for a resync I/O when both them have same barrier
   bucket index, vice versa.
 - I/O barrier can be recuded to an acceptable number if there are enought
   barrier buckets

Here I explain how the barrier buckets are designed,
 - BARRIER_UNIT_SECTOR_SIZE
   The whole LBA address space of a raid1 device is divided into multiple
   barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
   Bio request won't go across border of barrier unit size, that means
   maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
 - BARRIER_BUCKETS_NR
   There are BARRIER_BUCKETS_NR buckets in total, if multiple I/O requests
   hit different barrier units, they only need to compete I/O barrier with
   other I/Os which hit the same barrier bucket index with each other. The
   index of a barrier bucket which a bio should look for is calculated by
   get_barrier_bucket_idx(),
	(sector >> BARRIER_UNIT_SECTOR_BITS) % BARRIER_BUCKETS_NR
   sector is the start sector number of a bio. align_to_barrier_unit_end()
   will make sure the finall bio sent into generic_make_request() won't
   exceed border of the barrier unit size.
 - RRIER_BUCKETS_NR
   Number of barrier buckets is defined by,
	#define BARRIER_BUCKETS_NR	(PAGE_SIZE/sizeof(long))
   For 4KB page size, there are 512 buckets for each raid1 device. That
   means the propobility of full random I/O barrier confliction may be
   reduced down to 1/512.

Comparing to single sliding resync window,
 - Currently resync I/O grows linearly, therefore regular and resync I/O
   will have confliction within a single barrier units. So it is similar to
   single sliding resync window.
 - But a barrier unit bucket is shared by all barrier units with identical
   barrier uinit index, the probability of confliction might be higher
   than single sliding resync window, in condition that writing I/Os
   always hit barrier units which have identical barrier bucket index with
   the resync I/Os. This is a very rare condition in real I/O work loads,
   I cannot imagine how it could happen in practice.
 - Therefore we can achieve a good enough low confliction rate with much
   simpler barrier algorithm and implementation.

If user has a (realy) large raid1 device, for example 10PB size, we may
just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro,
it is possible to be a raid1-created-time-defined variable in future.

There are two changes should be noticed,
 - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
   single loop, it looks like this,
	spin_lock_irqsave(&conf->device_lock, flags);
	conf->nr_queued[idx]--;
	spin_unlock_irqrestore(&conf->device_lock, flags);
   This change generates more spin lock operations, but in next patch of
   this patch set, it will be replaced by a single line code,
	atomic_dec(conf->nr_queueud[idx]);
   So we don't need to worry about spin lock cost here.
 - In raid1_make_request(), wait_barrier() is replaced by,
   a) wait_read_barrier(): wait barrier in regular read I/O code path
   b) wait_barrier(): wait barrier in regular write I/O code path
   The differnece is wait_read_barrier() only waits if array is frozen, I
   am not able to combile them into one function, because they must receive
   differnet data types in their arguments list.
 - align_to_barrier_unit_end() is called to make sure both regular and
   resync I/O won't go across the border of a barrier unit size.
 
Open question:
 - Need review from md clustring developer, I don't touch related code now.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/raid1.c | 389 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------
 drivers/md/raid1.h |  42 +++++-------
 2 files changed, 242 insertions(+), 189 deletions(-)

Index: linux-raid1/drivers/md/raid1.c
===================================================================
--- linux-raid1.orig/drivers/md/raid1.c
+++ linux-raid1/drivers/md/raid1.c
@@ -66,9 +66,8 @@
  */
 static int max_queued_requests = 1024;
 
-static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
-			  sector_t bi_sector);
-static void lower_barrier(struct r1conf *conf);
+static void allow_barrier(struct r1conf *conf, sector_t sector_nr);
+static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
 
 static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data)
 {
@@ -92,7 +91,6 @@ static void r1bio_pool_free(void *r1_bio
 #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9)
 #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW)
 #define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
-#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
 
 static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
 {
@@ -198,6 +196,7 @@ static void put_buf(struct r1bio *r1_bio
 {
 	struct r1conf *conf = r1_bio->mddev->private;
 	int i;
+	sector_t sector_nr = r1_bio->sector;
 
 	for (i = 0; i < conf->raid_disks * 2; i++) {
 		struct bio *bio = r1_bio->bios[i];
@@ -207,7 +206,7 @@ static void put_buf(struct r1bio *r1_bio
 
 	mempool_free(r1_bio, conf->r1buf_pool);
 
-	lower_barrier(conf);
+	lower_barrier(conf, sector_nr);
 }
 
 static void reschedule_retry(struct r1bio *r1_bio)
@@ -215,10 +214,15 @@ static void reschedule_retry(struct r1bi
 	unsigned long flags;
 	struct mddev *mddev = r1_bio->mddev;
 	struct r1conf *conf = mddev->private;
+	sector_t sector_nr;
+	long idx;
+
+	sector_nr = r1_bio->sector;
+	idx = get_barrier_bucket_idx(sector_nr);
 
 	spin_lock_irqsave(&conf->device_lock, flags);
 	list_add(&r1_bio->retry_list, &conf->retry_list);
-	conf->nr_queued ++;
+	conf->nr_queued[idx]++;
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 
 	wake_up(&conf->wait_barrier);
@@ -235,8 +239,6 @@ static void call_bio_endio(struct r1bio
 	struct bio *bio = r1_bio->master_bio;
 	int done;
 	struct r1conf *conf = r1_bio->mddev->private;
-	sector_t start_next_window = r1_bio->start_next_window;
-	sector_t bi_sector = bio->bi_iter.bi_sector;
 
 	if (bio->bi_phys_segments) {
 		unsigned long flags;
@@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio
 	if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
 		bio->bi_error = -EIO;
 
-	if (done) {
+	if (done)
 		bio_endio(bio);
-		/*
-		 * Wake up any possible resync thread that waits for the device
-		 * to go idle.
-		 */
-		allow_barrier(conf, start_next_window, bi_sector);
-	}
 }
 
 static void raid_end_bio_io(struct r1bio *r1_bio)
 {
 	struct bio *bio = r1_bio->master_bio;
+	struct r1conf *conf = r1_bio->mddev->private;
 
 	/* if nobody has done the final endio yet, do it now */
 	if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
@@ -278,6 +275,12 @@ static void raid_end_bio_io(struct r1bio
 
 		call_bio_endio(r1_bio);
 	}
+
+	/*
+	 * Wake up any possible resync thread that waits for the device
+	 * to go idle.
+	 */
+	allow_barrier(conf, r1_bio->sector);
 	free_r1bio(r1_bio);
 }
 
@@ -311,6 +314,7 @@ static int find_bio_disk(struct r1bio *r
 	return mirror;
 }
 
+/* bi_end_io callback for regular READ bio */
 static void raid1_end_read_request(struct bio *bio)
 {
 	int uptodate = !bio->bi_error;
@@ -490,6 +494,25 @@ static void raid1_end_write_request(stru
 		bio_put(to_put);
 }
 
+static sector_t align_to_barrier_unit_end(sector_t start_sector,
+					  sector_t sectors)
+{
+	sector_t len;
+
+	WARN_ON(sectors == 0);
+	/* len is the number of sectors from start_sector to end of the
+	 * barrier unit which start_sector belongs to.
+	 */
+	len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
+	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
+	      start_sector;
+
+	if (len > sectors)
+		len = sectors;
+
+	return len;
+}
+
 /*
  * This routine returns the disk from which the requested read should
  * be done. There is a per-array 'next expected sequential IO' sector
@@ -691,6 +714,7 @@ static int read_balance(struct r1conf *c
 		conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
 	}
 	rcu_read_unlock();
+	sectors = align_to_barrier_unit_end(this_sector, sectors);
 	*max_sectors = sectors;
 
 	return best_disk;
@@ -779,168 +803,174 @@ static void flush_pending_writes(struct
  *    there is no normal IO happeing.  It must arrange to call
  *    lower_barrier when the particular background IO completes.
  */
+
 static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
 {
+	long idx = get_barrier_bucket_idx(sector_nr);
+
 	spin_lock_irq(&conf->resync_lock);
 
 	/* Wait until no block IO is waiting */
-	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
+	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
 			    conf->resync_lock);
 
 	/* block any new IO from starting */
-	conf->barrier++;
-	conf->next_resync = sector_nr;
+	conf->barrier[idx]++;
 
 	/* For these conditions we must wait:
 	 * A: while the array is in frozen state
-	 * B: while barrier >= RESYNC_DEPTH, meaning resync reach
-	 *    the max count which allowed.
-	 * C: next_resync + RESYNC_SECTORS > start_next_window, meaning
-	 *    next resync will reach to the window which normal bios are
-	 *    handling.
-	 * D: while there are any active requests in the current window.
+	 * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
+	 *    existing in sector number ranges corresponding to idx.
+	 * C: while conf->barrier[idx] >= RESYNC_DEPTH, meaning resync reach
+	 *    the max count which allowed in sector number ranges
+	 *    conrresponding to idx.
 	 */
 	wait_event_lock_irq(conf->wait_barrier,
-			    !conf->array_frozen &&
-			    conf->barrier < RESYNC_DEPTH &&
-			    conf->current_window_requests == 0 &&
-			    (conf->start_next_window >=
-			     conf->next_resync + RESYNC_SECTORS),
+			    !conf->array_frozen && !conf->nr_pending[idx] &&
+			    conf->barrier[idx] < RESYNC_DEPTH,
 			    conf->resync_lock);
-
-	conf->nr_pending++;
+	conf->nr_pending[idx]++;
 	spin_unlock_irq(&conf->resync_lock);
 }
 
-static void lower_barrier(struct r1conf *conf)
+static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
 {
 	unsigned long flags;
-	BUG_ON(conf->barrier <= 0);
+	long idx = get_barrier_bucket_idx(sector_nr);
+
+	BUG_ON(conf->barrier[idx] <= 0);
 	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->barrier--;
-	conf->nr_pending--;
+	conf->barrier[idx]--;
+	conf->nr_pending[idx]--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
 }
 
-static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
+/* A regular I/O should wait when,
+ * - The whole array is frozen (both READ and WRITE)
+ * - bio is WRITE and in same barrier bucket conf->barrier[idx] raised
+ */
+static void _wait_barrier(struct r1conf *conf, long idx)
 {
-	bool wait = false;
-
-	if (conf->array_frozen || !bio)
-		wait = true;
-	else if (conf->barrier && bio_data_dir(bio) == WRITE) {
-		if ((conf->mddev->curr_resync_completed
-		     >= bio_end_sector(bio)) ||
-		    (conf->next_resync + NEXT_NORMALIO_DISTANCE
-		     <= bio->bi_iter.bi_sector))
-			wait = false;
-		else
-			wait = true;
+	spin_lock_irq(&conf->resync_lock);
+	if (conf->array_frozen || conf->barrier[idx]) {
+		conf->nr_waiting[idx]++;
+		/* Wait for the barrier to drop. */
+		wait_event_lock_irq(
+			conf->wait_barrier,
+			!conf->array_frozen && !conf->barrier[idx],
+			conf->resync_lock);
+		conf->nr_waiting[idx]--;
 	}
 
-	return wait;
+	conf->nr_pending[idx]++;
+	spin_unlock_irq(&conf->resync_lock);
 }
 
-static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
+static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
 {
-	sector_t sector = 0;
+	long idx = get_barrier_bucket_idx(sector_nr);
 
 	spin_lock_irq(&conf->resync_lock);
-	if (need_to_wait_for_sync(conf, bio)) {
-		conf->nr_waiting++;
-		/* Wait for the barrier to drop.
-		 * However if there are already pending
-		 * requests (preventing the barrier from
-		 * rising completely), and the
-		 * per-process bio queue isn't empty,
-		 * then don't wait, as we need to empty
-		 * that queue to allow conf->start_next_window
-		 * to increase.
-		 */
-		wait_event_lock_irq(conf->wait_barrier,
-				    !conf->array_frozen &&
-				    (!conf->barrier ||
-				     ((conf->start_next_window <
-				       conf->next_resync + RESYNC_SECTORS) &&
-				      current->bio_list &&
-				      !bio_list_empty(current->bio_list))),
-				    conf->resync_lock);
-		conf->nr_waiting--;
-	}
-
-	if (bio && bio_data_dir(bio) == WRITE) {
-		if (bio->bi_iter.bi_sector >= conf->next_resync) {
-			if (conf->start_next_window == MaxSector)
-				conf->start_next_window =
-					conf->next_resync +
-					NEXT_NORMALIO_DISTANCE;
-
-			if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE)
-			    <= bio->bi_iter.bi_sector)
-				conf->next_window_requests++;
-			else
-				conf->current_window_requests++;
-			sector = conf->start_next_window;
-		}
+	if (conf->array_frozen) {
+		conf->nr_waiting[idx]++;
+		/* Wait for array to unfreeze */
+		wait_event_lock_irq(
+			conf->wait_barrier,
+			!conf->array_frozen,
+			conf->resync_lock);
+		conf->nr_waiting[idx]--;
 	}
-
-	conf->nr_pending++;
+	conf->nr_pending[idx]++;
 	spin_unlock_irq(&conf->resync_lock);
-	return sector;
 }
 
-static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
-			  sector_t bi_sector)
+static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
+{
+	long idx = get_barrier_bucket_idx(sector_nr);
+
+	_wait_barrier(conf, idx);
+}
+
+static void wait_all_barriers(struct r1conf *conf)
+{
+	long idx;
+
+	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		_wait_barrier(conf, idx);
+}
+
+static void _allow_barrier(struct r1conf *conf, long idx)
 {
 	unsigned long flags;
 
 	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->nr_pending--;
-	if (start_next_window) {
-		if (start_next_window == conf->start_next_window) {
-			if (conf->start_next_window + NEXT_NORMALIO_DISTANCE
-			    <= bi_sector)
-				conf->next_window_requests--;
-			else
-				conf->current_window_requests--;
-		} else
-			conf->current_window_requests--;
-
-		if (!conf->current_window_requests) {
-			if (conf->next_window_requests) {
-				conf->current_window_requests =
-					conf->next_window_requests;
-				conf->next_window_requests = 0;
-				conf->start_next_window +=
-					NEXT_NORMALIO_DISTANCE;
-			} else
-				conf->start_next_window = MaxSector;
-		}
-	}
+	conf->nr_pending[idx]--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
 }
 
+static void allow_barrier(struct r1conf *conf, sector_t sector_nr)
+{
+	long idx = get_barrier_bucket_idx(sector_nr);
+
+	_allow_barrier(conf, idx);
+}
+
+static void allow_all_barriers(struct r1conf *conf)
+{
+	long idx;
+
+	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		_allow_barrier(conf, idx);
+}
+
+
+/* conf->resync_lock should be held */
+static int get_all_pendings(struct r1conf *conf)
+{
+	long idx;
+	int ret;
+
+	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		ret += conf->nr_pending[idx];
+	return ret;
+}
+
+/* conf->resync_lock should be held */
+static int get_all_queued(struct r1conf *conf)
+{
+	long idx;
+	int  ret;
+
+	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		ret += conf->nr_queued[idx];
+	return ret;
+}
+
 static void freeze_array(struct r1conf *conf, int extra)
 {
 	/* stop syncio and normal IO and wait for everything to
 	 * go quite.
-	 * We wait until nr_pending match nr_queued+extra
+	 * We wait until get_all_pending() matches get_all_queued()+extra,
+	 * which means sum of conf->nr_pending[] matches sum of
+	 * conf->nr_queued[] plus extra (which might be 0 or 1).
 	 * This is called in the context of one normal IO request
 	 * that has failed. Thus any sync request that might be pending
-	 * will be blocked by nr_pending, and we need to wait for
+	 * will be blocked by a conf->nr_pending[idx] which the idx depends
+	 * on the request's sector number, and we need to wait for
 	 * pending IO requests to complete or be queued for re-try.
-	 * Thus the number queued (nr_queued) plus this request (extra)
-	 * must match the number of pending IOs (nr_pending) before
-	 * we continue.
+	 * Thus the number queued (sum of conf->nr_queued[]) plus this
+	 * request (extra) must match the number of pending IOs (sum
+	 * of conf->nr_pending[]) before we continue.
 	 */
 	spin_lock_irq(&conf->resync_lock);
 	conf->array_frozen = 1;
-	wait_event_lock_irq_cmd(conf->wait_barrier,
-				conf->nr_pending == conf->nr_queued+extra,
-				conf->resync_lock,
-				flush_pending_writes(conf));
+	wait_event_lock_irq_cmd(
+		conf->wait_barrier,
+		get_all_pendings(conf) == get_all_queued(conf)+extra,
+		conf->resync_lock,
+		flush_pending_writes(conf));
 	spin_unlock_irq(&conf->resync_lock);
 }
 static void unfreeze_array(struct r1conf *conf)
@@ -1031,6 +1061,7 @@ static void raid1_unplug(struct blk_plug
 	kfree(plug);
 }
 
+
 static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 {
 	struct r1conf *conf = mddev->private;
@@ -1051,7 +1082,6 @@ static void raid1_make_request(struct md
 	int first_clone;
 	int sectors_handled;
 	int max_sectors;
-	sector_t start_next_window;
 
 	/*
 	 * Register the new request and wait if the reconstruction
@@ -1087,8 +1117,6 @@ static void raid1_make_request(struct md
 		finish_wait(&conf->wait_barrier, &w);
 	}
 
-	start_next_window = wait_barrier(conf, bio);
-
 	bitmap = mddev->bitmap;
 
 	/*
@@ -1121,6 +1149,14 @@ static void raid1_make_request(struct md
 		int rdisk;
 
 read_again:
+		/* Still need barrier for READ in case that whole
+		 * array is frozen.
+		 */
+		wait_read_barrier(conf, r1_bio->sector);
+		/* max_sectors from read_balance is  modified to no
+		 * go across border of the barrier unit which
+		 * r1_bio->sector is in.
+		 */
 		rdisk = read_balance(conf, r1_bio, &max_sectors);
 
 		if (rdisk < 0) {
@@ -1140,7 +1176,6 @@ read_again:
 				   atomic_read(&bitmap->behind_writes) == 0);
 		}
 		r1_bio->read_disk = rdisk;
-		r1_bio->start_next_window = 0;
 
 		read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
 		bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector,
@@ -1211,7 +1246,7 @@ read_again:
 
 	disks = conf->raid_disks * 2;
  retry_write:
-	r1_bio->start_next_window = start_next_window;
+	wait_barrier(conf, r1_bio->sector);
 	blocked_rdev = NULL;
 	rcu_read_lock();
 	max_sectors = r1_bio->sectors;
@@ -1279,27 +1314,17 @@ read_again:
 	if (unlikely(blocked_rdev)) {
 		/* Wait for this device to become unblocked */
 		int j;
-		sector_t old = start_next_window;
 
 		for (j = 0; j < i; j++)
 			if (r1_bio->bios[j])
 				rdev_dec_pending(conf->mirrors[j].rdev, mddev);
 		r1_bio->state = 0;
-		allow_barrier(conf, start_next_window, bio->bi_iter.bi_sector);
+		allow_barrier(conf, r1_bio->sector);
 		md_wait_for_blocked_rdev(blocked_rdev, mddev);
-		start_next_window = wait_barrier(conf, bio);
-		/*
-		 * We must make sure the multi r1bios of bio have
-		 * the same value of bi_phys_segments
-		 */
-		if (bio->bi_phys_segments && old &&
-		    old != start_next_window)
-			/* Wait for the former r1bio(s) to complete */
-			wait_event(conf->wait_barrier,
-				   bio->bi_phys_segments == 1);
 		goto retry_write;
 	}
 
+	max_sectors = align_to_barrier_unit_end(r1_bio->sector, max_sectors);
 	if (max_sectors < r1_bio->sectors) {
 		/* We are splitting this write into multiple parts, so
 		 * we need to prepare for allocating another r1_bio.
@@ -1495,19 +1520,11 @@ static void print_conf(struct r1conf *co
 
 static void close_sync(struct r1conf *conf)
 {
-	wait_barrier(conf, NULL);
-	allow_barrier(conf, 0, 0);
+	wait_all_barriers(conf);
+	allow_all_barriers(conf);
 
 	mempool_destroy(conf->r1buf_pool);
 	conf->r1buf_pool = NULL;
-
-	spin_lock_irq(&conf->resync_lock);
-	conf->next_resync = MaxSector - 2 * NEXT_NORMALIO_DISTANCE;
-	conf->start_next_window = MaxSector;
-	conf->current_window_requests +=
-		conf->next_window_requests;
-	conf->next_window_requests = 0;
-	spin_unlock_irq(&conf->resync_lock);
 }
 
 static int raid1_spare_active(struct mddev *mddev)
@@ -1787,7 +1804,7 @@ static int fix_sync_read_error(struct r1
 	struct bio *bio = r1_bio->bios[r1_bio->read_disk];
 	sector_t sect = r1_bio->sector;
 	int sectors = r1_bio->sectors;
-	int idx = 0;
+	long idx = 0;
 
 	while(sectors) {
 		int s = sectors;
@@ -1983,6 +2000,14 @@ static void process_checks(struct r1bio
 	}
 }
 
+/* If there is no error encountered during sync writing out, there are two
+ * places to destroy r1_bio:
+ *  - sync_request_write(): If all wbio completed even before returning
+ *    back to its caller.
+ *  - end_sync_write(): when all remaining sync writes are done
+ * When there are error encountered from the above functions, r1_bio will
+ * be handled to handle_sync_write_finish() by reschedule_retry().
+ */
 static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
 {
 	struct r1conf *conf = mddev->private;
@@ -2244,6 +2269,9 @@ static void handle_write_finished(struct
 {
 	int m;
 	bool fail = false;
+	sector_t sector_nr;
+	long idx;
+
 	for (m = 0; m < conf->raid_disks * 2 ; m++)
 		if (r1_bio->bios[m] == IO_MADE_GOOD) {
 			struct md_rdev *rdev = conf->mirrors[m].rdev;
@@ -2269,7 +2297,9 @@ static void handle_write_finished(struct
 	if (fail) {
 		spin_lock_irq(&conf->device_lock);
 		list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
-		conf->nr_queued++;
+		sector_nr = r1_bio->sector;
+		idx = get_barrier_bucket_idx(sector_nr);
+		conf->nr_queued[idx]++;
 		spin_unlock_irq(&conf->device_lock);
 		md_wakeup_thread(conf->mddev->thread);
 	} else {
@@ -2380,6 +2410,8 @@ static void raid1d(struct md_thread *thr
 	struct r1conf *conf = mddev->private;
 	struct list_head *head = &conf->retry_list;
 	struct blk_plug plug;
+	sector_t sector_nr;
+	long idx;
 
 	md_check_recovery(mddev);
 
@@ -2387,17 +2419,19 @@ static void raid1d(struct md_thread *thr
 	    !test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
 		LIST_HEAD(tmp);
 		spin_lock_irqsave(&conf->device_lock, flags);
-		if (!test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
-			while (!list_empty(&conf->bio_end_io_list)) {
-				list_move(conf->bio_end_io_list.prev, &tmp);
-				conf->nr_queued--;
-			}
-		}
+		if (!test_bit(MD_CHANGE_PENDING, &mddev->flags))
+			list_splice_init(&conf->bio_end_io_list, &tmp);
 		spin_unlock_irqrestore(&conf->device_lock, flags);
+
 		while (!list_empty(&tmp)) {
 			r1_bio = list_first_entry(&tmp, struct r1bio,
 						  retry_list);
 			list_del(&r1_bio->retry_list);
+			sector_nr = r1_bio->sector;
+			idx = get_barrier_bucket_idx(sector_nr);
+			spin_lock_irqsave(&conf->device_lock, flags);
+			conf->nr_queued[idx]--;
+			spin_unlock_irqrestore(&conf->device_lock, flags);
 			if (mddev->degraded)
 				set_bit(R1BIO_Degraded, &r1_bio->state);
 			if (test_bit(R1BIO_WriteError, &r1_bio->state))
@@ -2418,7 +2452,9 @@ static void raid1d(struct md_thread *thr
 		}
 		r1_bio = list_entry(head->prev, struct r1bio, retry_list);
 		list_del(head->prev);
-		conf->nr_queued--;
+		sector_nr = r1_bio->sector;
+		idx = get_barrier_bucket_idx(sector_nr);
+		conf->nr_queued[idx]--;
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 
 		mddev = r1_bio->mddev;
@@ -2457,7 +2493,6 @@ static int init_resync(struct r1conf *co
 					  conf->poolinfo);
 	if (!conf->r1buf_pool)
 		return -ENOMEM;
-	conf->next_resync = 0;
 	return 0;
 }
 
@@ -2486,6 +2521,7 @@ static sector_t raid1_sync_request(struc
 	int still_degraded = 0;
 	int good_sectors = RESYNC_SECTORS;
 	int min_bad = 0; /* number of sectors that are bad in all devices */
+	long idx = get_barrier_bucket_idx(sector_nr);
 
 	if (!conf->r1buf_pool)
 		if (init_resync(conf))
@@ -2535,7 +2571,7 @@ static sector_t raid1_sync_request(struc
 	 * If there is non-resync activity waiting for a turn, then let it
 	 * though before starting on this new sync request.
 	 */
-	if (conf->nr_waiting)
+	if (conf->nr_waiting[idx])
 		schedule_timeout_uninterruptible(1);
 
 	/* we are incrementing sector_nr below. To be safe, we check against
@@ -2562,6 +2598,8 @@ static sector_t raid1_sync_request(struc
 	r1_bio->sector = sector_nr;
 	r1_bio->state = 0;
 	set_bit(R1BIO_IsSync, &r1_bio->state);
+	/* make sure good_sectors won't go across barrier unit border */
+	good_sectors = align_to_barrier_unit_end(sector_nr, good_sectors);
 
 	for (i = 0; i < conf->raid_disks * 2; i++) {
 		struct md_rdev *rdev;
@@ -2786,6 +2824,22 @@ static struct r1conf *setup_conf(struct
 	if (!conf)
 		goto abort;
 
+	conf->nr_pending = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!conf->nr_pending)
+		goto abort;
+
+	conf->nr_waiting = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!conf->nr_waiting)
+		goto abort;
+
+	conf->nr_queued = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!conf->nr_queued)
+		goto abort;
+
+	conf->barrier = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!conf->barrier)
+		goto abort;
+
 	conf->mirrors = kzalloc(sizeof(struct raid1_info)
 				* mddev->raid_disks * 2,
 				 GFP_KERNEL);
@@ -2841,9 +2895,6 @@ static struct r1conf *setup_conf(struct
 	conf->pending_count = 0;
 	conf->recovery_disabled = mddev->recovery_disabled - 1;
 
-	conf->start_next_window = MaxSector;
-	conf->current_window_requests = conf->next_window_requests = 0;
-
 	err = -EIO;
 	for (i = 0; i < conf->raid_disks * 2; i++) {
 
@@ -2890,6 +2941,10 @@ static struct r1conf *setup_conf(struct
 		kfree(conf->mirrors);
 		safe_put_page(conf->tmppage);
 		kfree(conf->poolinfo);
+		kfree(conf->nr_pending);
+		kfree(conf->nr_waiting);
+		kfree(conf->nr_queued);
+		kfree(conf->barrier);
 		kfree(conf);
 	}
 	return ERR_PTR(err);
@@ -2992,6 +3047,10 @@ static void raid1_free(struct mddev *mdd
 	kfree(conf->mirrors);
 	safe_put_page(conf->tmppage);
 	kfree(conf->poolinfo);
+	kfree(conf->nr_pending);
+	kfree(conf->nr_waiting);
+	kfree(conf->nr_queued);
+	kfree(conf->barrier);
 	kfree(conf);
 }
 
Index: linux-raid1/drivers/md/raid1.h
===================================================================
--- linux-raid1.orig/drivers/md/raid1.h
+++ linux-raid1/drivers/md/raid1.h
@@ -1,6 +1,20 @@
 #ifndef _RAID1_H
 #define _RAID1_H
 
+/* each barrier unit size is 64MB fow now
+ * note: it must be larger than RESYNC_DEPTH
+ */
+#define BARRIER_UNIT_SECTOR_BITS	17
+#define BARRIER_UNIT_SECTOR_SIZE	(1<<17)
+#define BARRIER_BUCKETS_NR		(PAGE_SIZE/sizeof(long))
+
+/* will use bit shift later */
+static inline long get_barrier_bucket_idx(sector_t sector)
+{
+	return (long)(sector >> BARRIER_UNIT_SECTOR_BITS) % BARRIER_BUCKETS_NR;
+
+}
+
 struct raid1_info {
 	struct md_rdev	*rdev;
 	sector_t	head_position;
@@ -35,25 +49,6 @@ struct r1conf {
 						 */
 	int			raid_disks;
 
-	/* During resync, read_balancing is only allowed on the part
-	 * of the array that has been resynced.  'next_resync' tells us
-	 * where that is.
-	 */
-	sector_t		next_resync;
-
-	/* When raid1 starts resync, we divide array into four partitions
-	 * |---------|--------------|---------------------|-------------|
-	 *        next_resync   start_next_window       end_window
-	 * start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
-	 * end_window = start_next_window + NEXT_NORMALIO_DISTANCE
-	 * current_window_requests means the count of normalIO between
-	 *   start_next_window and end_window.
-	 * next_window_requests means the count of normalIO after end_window.
-	 * */
-	sector_t		start_next_window;
-	int			current_window_requests;
-	int			next_window_requests;
-
 	spinlock_t		device_lock;
 
 	/* list of 'struct r1bio' that need to be processed by raid1d,
@@ -79,10 +74,10 @@ struct r1conf {
 	 */
 	wait_queue_head_t	wait_barrier;
 	spinlock_t		resync_lock;
-	int			nr_pending;
-	int			nr_waiting;
-	int			nr_queued;
-	int			barrier;
+	int			*nr_pending;
+	int			*nr_waiting;
+	int			*nr_queued;
+	int			*barrier;
 	int			array_frozen;
 
 	/* Set to 1 if a full sync is needed, (fresh device added).
@@ -135,7 +130,6 @@ struct r1bio {
 						 * in this BehindIO request
 						 */
 	sector_t		sector;
-	sector_t		start_next_window;
 	int			sectors;
 	unsigned long		state;
 	struct mddev		*mddev;

^ permalink raw reply

* [RFC PATCH 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
From: Coly Li @ 2016-11-21 21:54 UTC (permalink / raw)
  To: linux-raid
  Cc: Coly Li, Shaohua Li, Hannes Reinecke, Neil Brown,
	Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <1479765241-15528-1-git-send-email-colyli@suse.de>

When I run a parallel reading performan testing on a md raid1 device with
two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
only 2.7GB/s, this is around 50% of the idea performance number.

The perf reports locking contention happens at allow_barrier() and
wait_barrier() code,
 - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
   - _raw_spin_lock_irqsave
	 + 89.92% allow_barrier
	 + 9.34% __wake_up
 - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
   - _raw_spin_lock_irq
	 - 100.00% wait_barrier 

The reason is, in these I/O barrier related functions,
 - raise_barrier()
 - lower_barrier()
 - wait_barrier()
 - allow_barrier()
They always hold conf->resync_lock firstly, even there are only regular
reading I/Os and no resync I/O at all. This is a huge performance penalty.

The solution is a lockless-like algorithm in I/O barrier code, and only
holding conf->resync_lock when it is really necessary.

The original idea is from Hannes Reinecke, and Neil Brown provides
comments to improve it. Now I write the patch based on new simpler raid1
I/O barrier code.

In the new simpler raid1 I/O barrier implementation, there are two
wait barrier functions,
 - wait_barrier()
   Which in turns calls _wait_barrier(), is used for regular write I/O.
   If there is resync I/O happening on the same barrier bucket index, or
   the whole array is frozen, task will wait untill no barrier on same
   bucket index, or the whold array is unfreezed.
 - wait_read_barrier()
   Since regular read I/O won't interfere with resync I/O (read_balance()
   will make sure only uptodate data will be read out), so it is
   unnecessary to wait for barrier in regular read I/Os, they only have to
   wait only when the whole array is frozen.
The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
barrier[idx] are very carefully designed in raise_barrier(),
lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
avoid unnecessary spin locks in these functions. Once conf->
nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
has to wait in raise_barrier(). Then in _wait_barrier() or
wait_read_barrier() if no barrier raised in same barrier bucket index or
array is not frozen, the regular I/O doesn't need to hold conf->
resync_lock, it can just increase conf->nr_pending[idx], and return to its
caller. For heavy parallel reading I/Os, the lockless I/O barrier code
almostly gets rid of all spin lock cost.

This patch significantly improves raid1 reading peroformance. From my
testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
increases from 2.7GB/s to 4.6GB/s (+70%).

Open question:
 - I am not comfortable with freeze_array() and unfreeze_array(), for
   writing I/Os if devices failed, wait_barrier() may have race with
   freeze_array(), I am still looking for a solution now.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/raid1.c | 125 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------------------------------
 drivers/md/raid1.h |  12 +++++------
 2 files changed, 75 insertions(+), 62 deletions(-)

Index: linux-raid1/drivers/md/raid1.c
===================================================================
--- linux-raid1.orig/drivers/md/raid1.c
+++ linux-raid1/drivers/md/raid1.c
@@ -222,8 +222,8 @@ static void reschedule_retry(struct r1bi
 
 	spin_lock_irqsave(&conf->device_lock, flags);
 	list_add(&r1_bio->retry_list, &conf->retry_list);
-	conf->nr_queued[idx]++;
 	spin_unlock_irqrestore(&conf->device_lock, flags);
+	atomic_inc(&conf->nr_queued[idx]);
 
 	wake_up(&conf->wait_barrier);
 	md_wakeup_thread(mddev->thread);
@@ -811,11 +811,12 @@ static void raise_barrier(struct r1conf
 	spin_lock_irq(&conf->resync_lock);
 
 	/* Wait until no block IO is waiting */
-	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
+	wait_event_lock_irq(conf->wait_barrier,
+			    !atomic_read(&conf->nr_waiting[idx]),
 			    conf->resync_lock);
 
 	/* block any new IO from starting */
-	conf->barrier[idx]++;
+	atomic_inc(&conf->barrier[idx]);
 
 	/* For these conditions we must wait:
 	 * A: while the array is in frozen state
@@ -826,23 +827,21 @@ static void raise_barrier(struct r1conf
 	 *    conrresponding to idx.
 	 */
 	wait_event_lock_irq(conf->wait_barrier,
-			    !conf->array_frozen && !conf->nr_pending[idx] &&
-			    conf->barrier[idx] < RESYNC_DEPTH,
+			    !atomic_read(&conf->array_frozen) &&
+			    !atomic_read(&conf->nr_pending[idx]) &&
+			    atomic_read(&conf->barrier[idx]) < RESYNC_DEPTH,
 			    conf->resync_lock);
-	conf->nr_pending[idx]++;
+	atomic_inc(&conf->nr_pending[idx]);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
 static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
 {
-	unsigned long flags;
 	long idx = get_barrier_bucket_idx(sector_nr);
 
-	BUG_ON(conf->barrier[idx] <= 0);
-	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->barrier[idx]--;
-	conf->nr_pending[idx]--;
-	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	BUG_ON(atomic_read(&conf->barrier[idx]) <= 0);
+	atomic_dec(&conf->barrier[idx]);
+	atomic_dec(&conf->nr_pending[idx]);
 	wake_up(&conf->wait_barrier);
 }
 
@@ -852,36 +851,55 @@ static void lower_barrier(struct r1conf
  */
 static void _wait_barrier(struct r1conf *conf, long idx)
 {
-	spin_lock_irq(&conf->resync_lock);
-	if (conf->array_frozen || conf->barrier[idx]) {
-		conf->nr_waiting[idx]++;
-		/* Wait for the barrier to drop. */
-		wait_event_lock_irq(
-			conf->wait_barrier,
-			!conf->array_frozen && !conf->barrier[idx],
-			conf->resync_lock);
-		conf->nr_waiting[idx]--;
-	}
+	/* We need to increase conf->nr_pending[idx] very early here,
+	 * then raise_barrier() can be blocked when it waits for
+	 * conf->nr_pending[idx] to be 0. Then we can avoid holding
+	 * conf->resync_lock when there is no barrier raised in same
+	 * barrier unit bucket.
+	 */
+	atomic_inc(&conf->nr_pending[idx]);
+	if (!atomic_read(&conf->barrier[idx]) &&
+	    !atomic_read(&conf->array_frozen))
+		return;
 
-	conf->nr_pending[idx]++;
+	/* After holding conf->resync_lock, conf->nr_pending[idx]
+	 * should be decreased before waiting for barrier to drop.
+	 * Otherwise, we may encounter a race condition because
+	 * raise_barrer() might be waiting for conf->nr_pending[idx]
+	 * to be 0 at same time.
+	 */
+	spin_lock_irq(&conf->resync_lock);
+	atomic_inc(&conf->nr_waiting[idx]);
+	atomic_dec(&conf->nr_pending[idx]);
+	/* Wait for the barrier in same barrier unit bucket to drop. */
+	wait_event_lock_irq(conf->wait_barrier,
+			    !atomic_read(&conf->array_frozen) &&
+			    !atomic_read(&conf->barrier[idx]),
+			    conf->resync_lock);
+	atomic_inc(&conf->nr_pending[idx]);
+	atomic_dec(&conf->nr_waiting[idx]);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
+/* Very similar to _wait_barrier(), but only wait if array is frozen.
+ */
 static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
 {
 	long idx = get_barrier_bucket_idx(sector_nr);
 
+	atomic_inc(&conf->nr_pending[idx]);
+	if (!atomic_read(&conf->array_frozen))
+		return;
+
 	spin_lock_irq(&conf->resync_lock);
-	if (conf->array_frozen) {
-		conf->nr_waiting[idx]++;
-		/* Wait for array to unfreeze */
-		wait_event_lock_irq(
-			conf->wait_barrier,
-			!conf->array_frozen,
-			conf->resync_lock);
-		conf->nr_waiting[idx]--;
-	}
-	conf->nr_pending[idx]++;
+	/* Wait for array to unfreeze */
+	atomic_inc(&conf->nr_waiting[idx]);
+	atomic_dec(&conf->nr_pending[idx]);
+	wait_event_lock_irq(conf->wait_barrier,
+			    !atomic_read(&conf->array_frozen),
+			    conf->resync_lock);
+	atomic_dec(&conf->nr_waiting[idx]);
+	atomic_inc(&conf->nr_pending[idx]);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
@@ -902,11 +920,7 @@ static void wait_all_barriers(struct r1c
 
 static void _allow_barrier(struct r1conf *conf, long idx)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->nr_pending[idx]--;
-	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	atomic_dec(&conf->nr_pending[idx]);
 	wake_up(&conf->wait_barrier);
 }
 
@@ -933,7 +947,7 @@ static int get_all_pendings(struct r1con
 	int ret;
 
 	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
-		ret += conf->nr_pending[idx];
+		ret += atomic_read(&conf->nr_pending[idx]);
 	return ret;
 }
 
@@ -944,7 +958,7 @@ static int get_all_queued(struct r1conf
 	int  ret;
 
 	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
-		ret += conf->nr_queued[idx];
+		ret += atomic_read(&conf->nr_queued[idx]);
 	return ret;
 }
 
@@ -965,7 +979,7 @@ static void freeze_array(struct r1conf *
 	 * of conf->nr_pending[]) before we continue.
 	 */
 	spin_lock_irq(&conf->resync_lock);
-	conf->array_frozen = 1;
+	atomic_set(&conf->array_frozen, 1);
 	wait_event_lock_irq_cmd(
 		conf->wait_barrier,
 		get_all_pendings(conf) == get_all_queued(conf)+extra,
@@ -977,7 +991,7 @@ static void unfreeze_array(struct r1conf
 {
 	/* reverse the effect of the freeze */
 	spin_lock_irq(&conf->resync_lock);
-	conf->array_frozen = 0;
+	atomic_set(&conf->array_frozen, 0);
 	wake_up(&conf->wait_barrier);
 	spin_unlock_irq(&conf->resync_lock);
 }
@@ -2295,12 +2309,16 @@ static void handle_write_finished(struct
 					 conf->mddev);
 		}
 	if (fail) {
+		/* set sector_nr before r1_bio add into conf->bio_end_io_list,
+		 * we can't touch r1_bio once it is in this list, because
+		 * it might be freed by raid_end_bio_io() in raid1d()
+		 */
+		sector_nr = r1_bio->sector;
 		spin_lock_irq(&conf->device_lock);
 		list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
-		sector_nr = r1_bio->sector;
-		idx = get_barrier_bucket_idx(sector_nr);
-		conf->nr_queued[idx]++;
 		spin_unlock_irq(&conf->device_lock);
+		idx = get_barrier_bucket_idx(sector_nr);
+		atomic_inc(&conf->nr_queued[idx]);
 		md_wakeup_thread(conf->mddev->thread);
 	} else {
 		if (test_bit(R1BIO_WriteError, &r1_bio->state))
@@ -2410,7 +2428,6 @@ static void raid1d(struct md_thread *thr
 	struct r1conf *conf = mddev->private;
 	struct list_head *head = &conf->retry_list;
 	struct blk_plug plug;
-	sector_t sector_nr;
 	long idx;
 
 	md_check_recovery(mddev);
@@ -2427,11 +2444,8 @@ static void raid1d(struct md_thread *thr
 			r1_bio = list_first_entry(&tmp, struct r1bio,
 						  retry_list);
 			list_del(&r1_bio->retry_list);
-			sector_nr = r1_bio->sector;
-			idx = get_barrier_bucket_idx(sector_nr);
-			spin_lock_irqsave(&conf->device_lock, flags);
-			conf->nr_queued[idx]--;
-			spin_unlock_irqrestore(&conf->device_lock, flags);
+			idx = get_barrier_bucket_idx(r1_bio->sector);
+			atomic_dec(&conf->nr_queued[idx]);
 			if (mddev->degraded)
 				set_bit(R1BIO_Degraded, &r1_bio->state);
 			if (test_bit(R1BIO_WriteError, &r1_bio->state))
@@ -2452,10 +2466,9 @@ static void raid1d(struct md_thread *thr
 		}
 		r1_bio = list_entry(head->prev, struct r1bio, retry_list);
 		list_del(head->prev);
-		sector_nr = r1_bio->sector;
-		idx = get_barrier_bucket_idx(sector_nr);
-		conf->nr_queued[idx]--;
 		spin_unlock_irqrestore(&conf->device_lock, flags);
+		idx = get_barrier_bucket_idx(r1_bio->sector);
+		atomic_dec(&conf->nr_queued[idx]);
 
 		mddev = r1_bio->mddev;
 		conf = mddev->private;
@@ -2571,7 +2584,7 @@ static sector_t raid1_sync_request(struc
 	 * If there is non-resync activity waiting for a turn, then let it
 	 * though before starting on this new sync request.
 	 */
-	if (conf->nr_waiting[idx])
+	if (atomic_read(&conf->nr_waiting[idx]))
 		schedule_timeout_uninterruptible(1);
 
 	/* we are incrementing sector_nr below. To be safe, we check against
@@ -3224,7 +3237,7 @@ static void *raid1_takeover(struct mddev
 		conf = setup_conf(mddev);
 		if (!IS_ERR(conf))
 			/* Array must appear to be quiesced */
-			conf->array_frozen = 1;
+			atomic_set(&conf->array_frozen, 1);
 		return conf;
 	}
 	return ERR_PTR(-EINVAL);
Index: linux-raid1/drivers/md/raid1.h
===================================================================
--- linux-raid1.orig/drivers/md/raid1.h
+++ linux-raid1/drivers/md/raid1.h
@@ -6,7 +6,7 @@
  */
 #define BARRIER_UNIT_SECTOR_BITS	17
 #define BARRIER_UNIT_SECTOR_SIZE	(1<<17)
-#define BARRIER_BUCKETS_NR		(PAGE_SIZE/sizeof(long))
+#define BARRIER_BUCKETS_NR		(PAGE_SIZE/sizeof(atomic_t))
 
 /* will use bit shift later */
 static inline long get_barrier_bucket_idx(sector_t sector)
@@ -74,11 +74,11 @@ struct r1conf {
 	 */
 	wait_queue_head_t	wait_barrier;
 	spinlock_t		resync_lock;
-	int			*nr_pending;
-	int			*nr_waiting;
-	int			*nr_queued;
-	int			*barrier;
-	int			array_frozen;
+	atomic_t		*nr_pending;
+	atomic_t		*nr_waiting;
+	atomic_t		*nr_queued;
+	atomic_t		*barrier;
+	atomic_t		array_frozen;
 
 	/* Set to 1 if a full sync is needed, (fresh device added).
 	 * Cleared when a sync completes.

^ permalink raw reply

* Re: linux raid wiki - backup files
From: NeilBrown @ 2016-11-21 22:45 UTC (permalink / raw)
  To: Anthony Youngman, Phil Turmel, linux-raid
In-Reply-To: <c3d16a30-b36e-3d26-f30a-441b236f7107@youngman.org.uk>

[-- Attachment #1: Type: text/plain, Size: 1633 bytes --]

On Tue, Nov 22 2016, Anthony Youngman wrote:

> On 21/11/16 14:07, Phil Turmel wrote:
>> On 11/20/2016 09:48 AM, Wols Lists wrote:
>>> On 20/11/16 00:27, Phil Turmel wrote:
>>>> Yes.  But the new stripes lay on top of the old stripes, unless you move
>>>> the data offset.  Which is why a backup file holds the old stripe just
>>>> in case.  If you can move the offset, you use the lower offset for the
>>>> lower addresses in the array, and the higher offset for the higher
>>>> addresses, on either side of the reshape position.
>>>
>>> Okay, understood. So v0.9 and v1.0 always need a backup for a reshape.
>
> Having looked at the man page, this now seems obvious - the superblock 
> is at the end, so the data offset is 0. But for a 1.0 array, could we 
> create a data offset?

Yes.  But usually the purpose for using 1.0 is to have data_offset == 0,
so you might not want to.

>
> (So, if we created a data offset, we could then move the superblock and 
> convert a 1.0 to 1.1 or 1.2? Okay, it can't do it now, but it looks to 
> me like it shouldn't be that hard ... ?)

It would be quite easy to extend "--update=metadata" to change the
version once the data_offset had been changed.

>>>
>>> But if we have a data offset with v1.2, a reshape will use that space if
>>> it can rather than needing a backup file?
>>
> I'm guessing that 1.0 and 1.1 defaulted to no data offset to speak of? 
> And if we (can) create a decent data offset, we can then use that in 
> exactly the same way as with v1.2?

1.0 defaults to no data_offset.
1.1 uses the safe choice function as 1.2.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* raid10.c read_balance question
From: Robert LeBlanc @ 2016-11-21 22:45 UTC (permalink / raw)
  To: linux-raid

I'm looking to improve 'near' layout performance in raid10 and looking
at the code for read_balance, the comments mention that there is a
'next expected sequential IO' sector number stored for the RAID array,
but I don't see such a variable in the struct, nor do I see in
read_balance where that variable is being updated. With such a value,
I could use it (with another variable) to guess if the current
requests are random or sequential enough to take the latency hit to
move drive heads to help parallel the read. This would be a much
simpler route than what I was thinking about previously.

Am I just missing this variable? I would expect it to be in the
r10conf struct, would that be the right place to implement it if
doesn't exist?

Thanks,
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

^ permalink raw reply

* Re: Newly-created arrays don't auto-assemble - related to hostname change?
From: NeilBrown @ 2016-11-21 22:56 UTC (permalink / raw)
  To: Andy Smith; +Cc: linux-raid
In-Reply-To: <20161121060256.GL1804@bitfolk.com>

[-- Attachment #1: Type: text/plain, Size: 1587 bytes --]

On Mon, Nov 21 2016, Andy Smith wrote:

> Hi Neil,
>
> On Mon, Nov 21, 2016 at 03:32:42PM +1100, NeilBrown wrote:
>> If you still want to get to the bottom of this, you might need to revert
>> your work-around, the try the "udevadm monitor" and "udevadm info" and "udevadm
>> trigger" while the array is not assembled.
>
> I have reverted my addition of "mpt3sas" from
> /etc/initramfs-tools/modules and rebooted, so that md5 is again not
> assembled.

Thanks.  Sorry this is taking a lot of back-and-forth...
Still getting
> E: ID_FS_TYPE=linux_raid_member
which is good.  Not getting and MD_*, which is bad.

I would:
 - check that md5 definitely isn't running (mdadm -S /dev/md5)
 - run mdadm -I just like udev does.

   /sbin/mdadm --incremental --export /dev/sdc --offroot /dev/disk/by-id/ata-SAMSUNG_MZ7KM1T9HAJM-00005_S2HNNAAH200633 /dev/disk/by-id/wwn-0x5002538c0007e7a8 /dev/disk/by-path/pci-0000:01:00.0-sas-0x4433221100000000-lun-0

(the string of paths is from the "DEVLINKS" field).

That *should* produce several lines like "MD_NAME=tbd:5" etc.
My guess is that it is producing an error.  Knowing that error message
would help.

If it doesn't produce an error, but does produce some MD_* lines, then
the problem must be that udev isn't doing quite the same thing.
So stop md5 again (mdadm -S /dev/md5), enable udev debugging
  udevadm control -l debug

and re-issue the 'change'
     echo change > /sys/block/sdc/uevent

That should puts lots of stuff in the journal.  If you could extract
that and post it I might be able to find something of interest.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: [md PATCH 0/5] Stop using bi_phys_segments as a counter
From: Shaohua Li @ 2016-11-21 23:43 UTC (permalink / raw)
  To: NeilBrown; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid
In-Reply-To: <147969099621.5434.12384452255155063186.stgit@noble>

On Mon, Nov 21, 2016 at 12:19:43PM +1100, Neil Brown wrote:
> There are 2 problems with using bi_phys_segments as a counter
> 1/ we only use 16bits, which limits bios to 256M
> 2/ it is poor form to reuse a field like this.  It interferes
>    with other changes to bios.
> 
> We need to clean up a few things before we can change the use the
> counter which is now available inside a bio.
> 
> I have only tested this lightly.  More review and testing would be
> appreciated.

So without the accounting, we:
- don't do bio completion trace
- call md_write_start/md_write_end excessively, which involves atomic operation.

Not big problems. But we are actually reusing __bi_remaining, I'm wondering why
we not explicitly reuse it. Eg, adds bio_dec_remaining_return() and uses it
like raid5_dec_bi_active_stripes.

Thanks,
Shaohua

^ permalink raw reply

* Re: Fwd: Re: mdadm I/O error with Ddf RAID
From: NeilBrown @ 2016-11-21 23:54 UTC (permalink / raw)
  To: Arka Sharma, linux-raid
In-Reply-To: <CAPO=kN3pSizni=e3N3zxktSjQWsRL7T_GwZJdUUyKzCjM-0MWw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2059 bytes --]

On Tue, Nov 22 2016, Arka Sharma wrote:

> ---------- Forwarded message ----------
> From: "Arka Sharma" <arka.sw1988@gmail.com>
> Date: 21 Nov 2016 12:57 p.m.
> Subject: Re: mdadm I/O error with Ddf RAID
> To: "NeilBrown" <neilb@suse.com>
> Cc: <linux-raid@vger.kernel.org>
>
> I have run mdadm --examine on both the component devices as well as on
> the container. This shows that one of the component disk is marked as
> offline and status is failed. When I run mdadm --detail on the RAID
> device it shows the component disk 0 state as removed. Since I am very
> much new to md and linux in general I am not able to fully root cause
> this issue. I have made couple of observation though, that before the
> invalid sector 18446744073709551615 is sent, the sector 1000182866 is
> accessed after which mdraid reports as not clean starts background
> reconstruction. I read the LBA 1000182866 and this block contains FF.
> So is md expecting something in the metadata we are not populating ?
> Please find the attached md127.txt which is the output of the mdadm
> --examine <container>, blk-core_diff.txt which contains the printk's
> and dmesg.txt, also DDF_Header0.txt and DDF_Header1.txt are the dump
> of ddf headers for both the disks.

Thanks for providing more details.

Sector 1000182866 is 256 sectors into the config section.
It starts reading the config section at 1000182610 and gets 256 sectors,
so it reads the rest from 1000182866 and then starts the array.

My guess is that md is getting confused about resync and recovery.
It tries a resync, but as the array appears degraded, this code:
		if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
			j = mddev->resync_min;
		else if (!mddev->bitmap)
			j = mddev->recovery_cp;

in md_do_sync() sets 'j' to MaxSector, which is effectively "-1".  It
then starts resync from there and goes crazy.  You could put a printk in
there to confirm.

I don't know why.  Something about the config makes mdadm think the
array is degraded.  I might try to find time to dig into it again later.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: [md PATCH 0/5] Stop using bi_phys_segments as a counter
From: NeilBrown @ 2016-11-22  0:25 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid
In-Reply-To: <20161121234311.6qhwa2g3oa4uhcbi@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 1385 bytes --]

On Tue, Nov 22 2016, Shaohua Li wrote:

> On Mon, Nov 21, 2016 at 12:19:43PM +1100, Neil Brown wrote:
>> There are 2 problems with using bi_phys_segments as a counter
>> 1/ we only use 16bits, which limits bios to 256M
>> 2/ it is poor form to reuse a field like this.  It interferes
>>    with other changes to bios.
>> 
>> We need to clean up a few things before we can change the use the
>> counter which is now available inside a bio.
>> 
>> I have only tested this lightly.  More review and testing would be
>> appreciated.
>
> So without the accounting, we:
> - don't do bio completion trace

Yes, but hopefully that will be added back to bio_endio() soon.

> - call md_write_start/md_write_end excessively, which involves atomic operation.

raid5_inc_bio_active_stripes() did an atomic operation.  I don't think
there is a net increase in the number of atomic operations.

>
> Not big problems. But we are actually reusing __bi_remaining, I'm wondering why
> we not explicitly reuse it. Eg, adds bio_dec_remaining_return() and uses it
> like raid5_dec_bi_active_stripes.

Because using it exactly the same way that other places use it leads to
fewer surprises, now or later.
And I think that the effort to rearrange the code so that we could just
call bio_endio() brought real improvements in code clarity and
simplicity.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: [PATCH 07/12] dm: use bvec iterator helpers to implement .get_page and .next_page
From: Ming Lei @ 2016-11-22  0:26 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-block,
	Linux FS Devel, Christoph Hellwig, Alasdair Kergon,
	maintainer:DEVICE-MAPPER (LVM), Shaohua Li,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT
In-Reply-To: <20161121144956.GB23051@redhat.com>

On Mon, Nov 21, 2016 at 10:49 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Fri, Nov 11 2016 at  7:05am -0500,
> Ming Lei <tom.leiming@gmail.com> wrote:
>
>> Firstly we have mature bvec/bio iterator helper for iterate each
>> page in one bio, not necessary to reinvent a wheel to do that.
>>
>> Secondly the coming multipage bvecs requires this patch.
>>
>> Also add comments about the direct access to bvec table.
>>
>> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
>
> I've staged this for 4.10

Thanks Mike!

^ permalink raw reply

* Re: [md PATCH 0/5] Stop using bi_phys_segments as a counter
From: Shaohua Li @ 2016-11-22  1:02 UTC (permalink / raw)
  To: NeilBrown; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid
In-Reply-To: <87mvgscqe7.fsf@notabene.neil.brown.name>

On Tue, Nov 22, 2016 at 11:25:04AM +1100, Neil Brown wrote:
> On Tue, Nov 22 2016, Shaohua Li wrote:
> 
> > On Mon, Nov 21, 2016 at 12:19:43PM +1100, Neil Brown wrote:
> >> There are 2 problems with using bi_phys_segments as a counter
> >> 1/ we only use 16bits, which limits bios to 256M
> >> 2/ it is poor form to reuse a field like this.  It interferes
> >>    with other changes to bios.
> >> 
> >> We need to clean up a few things before we can change the use the
> >> counter which is now available inside a bio.
> >> 
> >> I have only tested this lightly.  More review and testing would be
> >> appreciated.
> >
> > So without the accounting, we:
> > - don't do bio completion trace
> 
> Yes, but hopefully that will be added back to bio_endio() soon.
> 
> > - call md_write_start/md_write_end excessively, which involves atomic operation.
> 
> raid5_inc_bio_active_stripes() did an atomic operation.  I don't think
> there is a net increase in the number of atomic operations.

That's different. md_write_start/end uses a global atomic.
raid5_inc_bio_active_stripes uses a bio atomic. So we have more cache bouncing now.
 
> >
> > Not big problems. But we are actually reusing __bi_remaining, I'm wondering why
> > we not explicitly reuse it. Eg, adds bio_dec_remaining_return() and uses it
> > like raid5_dec_bi_active_stripes.
> 
> Because using it exactly the same way that other places use it leads to
> fewer surprises, now or later.
> And I think that the effort to rearrange the code so that we could just
> call bio_endio() brought real improvements in code clarity and
> simplicity.

Not the same way. The return_bi list and retry list fix are still good. We can
replace the bio_endio in your patch with something like:
if (bio_dec_remaining_return() == 0) {
	trace_block_bio_complete()
	md_write_end()
	bio_endio();
}
This will give us better control when to end io.

Thanks,
Shaohua

^ permalink raw reply

* [PATCH v2] md/r5cache: handle alloc_page failure
From: Song Liu @ 2016-11-22  1:31 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu

RMW of r5c write back cache uses an extra page to store old data for
prexor. handle_stripe_dirtying() allocates this page by calling
alloc_page(). However, alloc_page() may fail.

To handle alloc_page() failures, this patch adds a small mempool
in r5l_log. When alloc_page fails, handle_stripe() trys to use pages
from the mempool. When the mempool is in use, the stripe is added
to delayed_list.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c       | 47 ++++++++++++++++++++++++++++++++++++++---------
 drivers/md/raid5.h       |  8 ++++++++
 3 files changed, 89 insertions(+), 9 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 8cb79fc..9406c39 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -162,6 +162,9 @@ struct r5l_log {
 
 	/* to submit async io_units, to fulfill ordering of flush */
 	struct work_struct deferred_io_work;
+
+	/* to handle alloc_page() failures in handle_stripe_dirtying() */
+	mempool_t *extra_page_pool;
 };
 
 /*
@@ -2334,15 +2337,48 @@ int r5c_try_caching_write(struct r5conf *conf,
  */
 void r5c_release_extra_page(struct stripe_head *sh)
 {
+	struct r5conf *conf = sh->raid_conf;
+	struct r5l_log *log = conf->log;
 	int i;
+	bool using_extra_page_pool;
+
+	using_extra_page_pool = test_and_clear_bit(
+		STRIPE_R5C_EXTRA_PAGE, &sh->state);
 
 	for (i = sh->disks; i--; )
 		if (sh->dev[i].page != sh->dev[i].orig_page) {
 			struct page *p = sh->dev[i].orig_page;
 
 			sh->dev[i].orig_page = sh->dev[i].page;
+			if (using_extra_page_pool)
+				mempool_free(p, log->extra_page_pool);
+			else
+				put_page(p);
+		}
+
+	if (using_extra_page_pool)
+		clear_bit(R5C_EXTRA_PAGE_IN_USE, &conf->cache_state);
+}
+
+void r5c_use_extra_page_pool(struct stripe_head *sh)
+{
+	struct r5l_log *log = sh->raid_conf->log;
+	int i;
+	struct r5dev *dev;
+	struct page *p;
+
+	for (i = sh->disks; i--; ) {
+		dev = &sh->dev[i];
+		if (dev->orig_page != dev->page) {
+			p = dev->orig_page;
+			dev->orig_page = dev->page;
 			put_page(p);
 		}
+		dev->orig_page = mempool_alloc(log->extra_page_pool,
+					       GFP_ATOMIC);
+		BUG_ON(!dev->orig_page);
+	}
+	set_bit(STRIPE_R5C_EXTRA_PAGE, &sh->state);
 }
 
 /*
@@ -2581,6 +2617,10 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	if (!log->meta_pool)
 		goto out_mempool;
 
+	log->extra_page_pool = mempool_create_page_pool(conf->raid_disks, 0);
+	if (!log->extra_page_pool)
+		goto extra_page_pool;
+
 	log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
 						 log->rdev->mddev, "reclaim");
 	if (!log->reclaim_thread)
@@ -2611,6 +2651,8 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 error:
 	md_unregister_thread(&log->reclaim_thread);
 reclaim_thread:
+	mempool_destroy(log->extra_page_pool);
+extra_page_pool:
 	mempool_destroy(log->meta_pool);
 out_mempool:
 	bioset_free(log->bs);
@@ -2626,6 +2668,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 void r5l_exit_log(struct r5l_log *log)
 {
 	md_unregister_thread(&log->reclaim_thread);
+	mempool_destroy(log->extra_page_pool);
 	mempool_destroy(log->meta_pool);
 	bioset_free(log->bs);
 	mempool_destroy(log->io_pool);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index dbab8c7..9d20849 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -876,6 +876,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 
 	if (!test_bit(STRIPE_R5C_CACHING, &sh->state)) {
 		/* writing out phase */
+		if (s->waiting_extra_page)
+			return;
 		if (r5l_write_stripe(conf->log, sh) == 0)
 			return;
 	} else {  /* caching phase */
@@ -2007,6 +2009,7 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
 		INIT_LIST_HEAD(&sh->batch_list);
 		INIT_LIST_HEAD(&sh->lru);
 		INIT_LIST_HEAD(&sh->r5c);
+		INIT_LIST_HEAD(&sh->log_list);
 		atomic_set(&sh->count, 1);
 		sh->log_start = MaxSector;
 		for (i = 0; i < disks; i++) {
@@ -3580,10 +3583,10 @@ static void handle_stripe_clean_event(struct r5conf *conf,
 		break_stripe_batch_list(head_sh, STRIPE_EXPAND_SYNC_FLAGS);
 }
 
-static void handle_stripe_dirtying(struct r5conf *conf,
-				   struct stripe_head *sh,
-				   struct stripe_head_state *s,
-				   int disks)
+static int handle_stripe_dirtying(struct r5conf *conf,
+				  struct stripe_head *sh,
+				  struct stripe_head_state *s,
+				  int disks)
 {
 	int rmw = 0, rcw = 0, i;
 	sector_t recovery_cp = conf->mddev->recovery_cp;
@@ -3649,12 +3652,32 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 			    dev->page == dev->orig_page &&
 			    !test_bit(R5_LOCKED, &sh->dev[sh->pd_idx].flags)) {
 				/* alloc page for prexor */
-				dev->orig_page = alloc_page(GFP_NOIO);
+				struct page *p = alloc_page(GFP_NOIO);
+
+				if (p) {
+					dev->orig_page = p;
+					continue;
+				}
 
-				/* will handle failure in a later patch*/
-				BUG_ON(!dev->orig_page);
+				/*
+				 * alloc_page() failed, try use
+				 * log->extra_page_pool
+				 */
+				if (!test_and_set_bit(R5C_EXTRA_PAGE_IN_USE,
+						      &conf->cache_state)) {
+					r5c_use_extra_page_pool(sh);
+					break;
+				}
+
+				/* extra_page_pool busy, add to delayed_list */
+				set_bit(STRIPE_DELAYED, &sh->state);
+				s->waiting_extra_page = 1;
+				return -EAGAIN;
 			}
+		}
 
+		for (i = disks; i--; ) {
+			struct r5dev *dev = &sh->dev[i];
 			if ((dev->towrite ||
 			     i == sh->pd_idx || i == sh->qd_idx ||
 			     test_bit(R5_InJournal, &dev->flags)) &&
@@ -3730,6 +3753,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 	    (s->locked == 0 && (rcw == 0 || rmw == 0) &&
 	     !test_bit(STRIPE_BIT_DELAY, &sh->state)))
 		schedule_reconstruction(sh, s, rcw == 0, 0);
+	return 0;
 }
 
 static void handle_parity_checks5(struct r5conf *conf, struct stripe_head *sh,
@@ -4545,8 +4569,12 @@ static void handle_stripe(struct stripe_head *sh)
 			if (ret == -EAGAIN ||
 			    /* stripe under reclaim: !caching && injournal */
 			    (!test_bit(STRIPE_R5C_CACHING, &sh->state) &&
-			     s.injournal > 0))
-				handle_stripe_dirtying(conf, sh, &s, disks);
+			     s.injournal > 0)) {
+				ret = handle_stripe_dirtying(conf, sh, &s,
+							     disks);
+				if (ret == -EAGAIN)
+					goto finish;
+			}
 		}
 	}
 
@@ -6612,6 +6640,7 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 
 	conf->disks = kzalloc(max_disks * sizeof(struct disk_info),
 			      GFP_KERNEL);
+
 	if (!conf->disks)
 		goto abort;
 
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index d13fe45..fad19eb 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -276,6 +276,7 @@ struct stripe_head_state {
 	struct md_rdev *blocked_rdev;
 	int handle_bad_blocks;
 	int log_failed;
+	int waiting_extra_page;
 };
 
 /* Flags for struct r5dev.flags */
@@ -377,6 +378,9 @@ enum {
 				 * in conf->r5c_full_stripe_list)
 				 */
 	STRIPE_R5C_PREFLUSH,	/* need to flush journal device */
+	STRIPE_R5C_EXTRA_PAGE,	/* extra orig_page of this stripe is allocated
+				 * from extra_page_pool
+				 */
 };
 
 #define STRIPE_EXPAND_SYNC_FLAGS \
@@ -559,6 +563,9 @@ enum r5_cache_state {
 				 * only process stripes that are already
 				 * occupying the log
 				 */
+	R5C_EXTRA_PAGE_IN_USE,	/* a stripe is using the pages in
+				 * r5l_log.extra_page_pool for prexor
+				 */
 };
 
 struct r5conf {
@@ -765,6 +772,7 @@ extern void
 r5c_finish_stripe_write_out(struct r5conf *conf, struct stripe_head *sh,
 			    struct stripe_head_state *s);
 extern void r5c_release_extra_page(struct stripe_head *sh);
+extern void r5c_use_extra_page_pool(struct stripe_head *sh);
 extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
 extern void r5c_handle_cached_data_endio(struct r5conf *conf,
 	struct stripe_head *sh, int disks, struct bio_list *return_bi);
-- 
2.9.3


^ permalink raw reply related

* Re: [PATCH/RFC] add "failfast" support for raid1/raid10.
From: Shaohua Li @ 2016-11-22  2:02 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, linux-block, Christoph Hellwig, linux-kernel, hare
In-Reply-To: <147944614789.3302.1959091446949640579.stgit@noble>

On Fri, Nov 18, 2016 at 04:16:11PM +1100, Neil Brown wrote:
> Hi,
> 
>  I've been sitting on these patches for a while because although they
>  solve a real problem, it is a fairly limited use-case, and I don't
>  really like some of the details.
> 
>  So I'm posting them as RFC in the hope that a different perspective
>  might help me like them better, or find a better approach.
> 
>  The core idea is that when you have multiple copies of data
>  (i.e. mirrored drives) it doesn't make sense to wait for a read from
>  a drive that seems to be having problems.  It will probably be faster
>  to just cancel that read, and read from the other device.
>  Similarly, in some circumstances, it might be better to fail a drive
>  that is being slow to respond to writes, rather than cause all writes
>  to be very slow.
> 
>  The particular context where this comes up is when mirroring across
>  storage arrays, where the storage arrays can temporarily take an
>  unusually long time to respond to requests (firmware updates have
>  been mentioned).  As the array will have redundancy internally, there
>  is little risk to the data.  The mirrored pair is really only for
>  disaster recovery, and it is deemed better to lose the last few
>  minutes of updates in the case of a serious disaster, rather than
>  occasionally having latency issues because one array needs to do some
>  maintenance for a few minutes.  The particular storage arrays in
>  question are DASD devices which are part of the s390 ecosystem.
> 
>  Linux block layer has "failfast" flags to direct drivers to fail more
>  quickly.  These patches allow devices in an md array to be given a
>  "failfast" flag, which will cause IO requests to be marked as
>  "failfast" providing there is another device available.  Once the
>  array becomes degraded, we stop using failfast, as that could result
>  in data loss.
> 
>  I don't like the whole "failfast" concept because it is not at all
>  clear how fast "fast" is.  In fact, these block-layer flags are
>  really a misnomer.  They should be "noretry" flags.
>  REQ_FAILFAST_DEV means "don't retry requests which reported an error
>  which seems to come from the device.
>  REQ_FAILFAST_TRANSPORT means "don't retry requests which seem to
>  indicate a problem with the transport, rather than the device"
>  REQ_FAILFAST_DRIVER means  .... I'm not exactly sure.  I think it
>  means whatever a particular driver wants it to mean, basically "I
>  cannot seem to handle this right now, just resend and I'll probably
>  be more in control next time".  It seems to be for internal-use only.
> 
>  Multipath code uses REQ_FAILFAST_TRANSPORT only, which makes sense.
>  btrfs uses REQ_FAILFAST_DEV only (for read-ahead) which doesn't seem
>  to make sense.... why would you ever use _DEV without _TRANSPORT?
> 
>  None of these actually change the timeouts in the driver or in the
>  device, which is what I would expect for "failfast", so to get real
>  "fast failure" you need to enable failfast, and adjust the timeouts.
>  That is what we do for our customers with DASD.
> 
>  Anyway, it seems to make sense to use _TRANSPORT and _DEV for
>  requests from md where there is somewhere to fall-back on.
>  If we get an error from a "failfast" request, and the array is still
>  non-degraded, we just fail the device.  We don't try to repair read
>  errors (which is pointless on storage arrays).
> 
>  It is assumed that some user-space code will notice the failure,
>  monitor the device to see when it becomes available again, and then
>  --re-add it.  Assuming the array has a bitmap, the --re-add should be
>  fast and the array will become optimal again without experiencing
>  excessive latencies.
> 
>  My two main concerns are:
>   - does this functionality have any use-case outside of mirrored
>     storage arrays, and are there other storage arrays which
>     occasionally inserted excessive latency (seems like a serious
>     misfeature to me, but I know few of the details)?
>   - would it be at all possible to have "real" failfast functionality
>     in the block layer?  I.e. something that is based on time rather
>     than retry count.  Maybe in some cases a retry would be
>     appropriate if the first failure was very fast.
>     I.e. it would reduce timeouts and decide on retries based on
>     elapsed time rather than number of attempts.
>     With this would come the question of "how fast is fast" and I
>     don't have a really good answer.  Maybe md would need to set a
>     timeout, which it would double whenever it got failures on all
>     drives.  Otherwise the timeout would drift towards (say) 10 times
>     the typical response time.
> 
>  So: comments most welcome.  As I say, this does address a genuine
>  need.  Just find it hard to like it :-(

Patches looks good. As long as these don't break normal raid array (while they
don't if the superblock bit isn't set), I have no objection to apply the
patches even they are for special usage. I'll add the series to the next tree.

Just curious: will the FAILFAST increase the chance IO failure?

Thanks,
Shaohua

^ permalink raw reply

* Re: [md PATCH 0/5] Stop using bi_phys_segments as a counter
From: NeilBrown @ 2016-11-22  2:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid
In-Reply-To: <20161122010220.dcq6brjhsliw4io6@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 3534 bytes --]

On Tue, Nov 22 2016, Shaohua Li wrote:

> On Tue, Nov 22, 2016 at 11:25:04AM +1100, Neil Brown wrote:
>> On Tue, Nov 22 2016, Shaohua Li wrote:
>> 
>> > On Mon, Nov 21, 2016 at 12:19:43PM +1100, Neil Brown wrote:
>> >> There are 2 problems with using bi_phys_segments as a counter
>> >> 1/ we only use 16bits, which limits bios to 256M
>> >> 2/ it is poor form to reuse a field like this.  It interferes
>> >>    with other changes to bios.
>> >> 
>> >> We need to clean up a few things before we can change the use the
>> >> counter which is now available inside a bio.
>> >> 
>> >> I have only tested this lightly.  More review and testing would be
>> >> appreciated.
>> >
>> > So without the accounting, we:
>> > - don't do bio completion trace
>> 
>> Yes, but hopefully that will be added back to bio_endio() soon.
>> 
>> > - call md_write_start/md_write_end excessively, which involves atomic operation.
>> 
>> raid5_inc_bio_active_stripes() did an atomic operation.  I don't think
>> there is a net increase in the number of atomic operations.
>
> That's different. md_write_start/end uses a global atomic.
> raid5_inc_bio_active_stripes uses a bio atomic. So we have more cache bouncing now.

Maybe.
Most md_write_start() calls are made in the context of
raid5_make_request().
We could
 - call md_write_start() once at the start
 - count how many times we want to call it in a variable local to
   raid5_make_request()
 - atomically add that to the counter at the end.

Similarly mode md_write_end() requests are in the context of raid5d.  It
could maintain local counter and apply them all in a single update
before it sleeps.

It would be a little messy, but not too horrible I think.

>  
>> >
>> > Not big problems. But we are actually reusing __bi_remaining, I'm wondering why
>> > we not explicitly reuse it. Eg, adds bio_dec_remaining_return() and uses it
>> > like raid5_dec_bi_active_stripes.
>> 
>> Because using it exactly the same way that other places use it leads to
>> fewer surprises, now or later.
>> And I think that the effort to rearrange the code so that we could just
>> call bio_endio() brought real improvements in code clarity and
>> simplicity.
>
> Not the same way. The return_bi list and retry list fix are still good. We can
> replace the bio_endio in your patch with something like:
> if (bio_dec_remaining_return() == 0) {
> 	trace_block_bio_complete()
> 	md_write_end()
> 	bio_endio();
> }
> This will give us better control when to end io.

This isn't safe.  The bio arriving at raid5_make_request() might already
have been split and could be chained.  Then raid5 might never see
bio_dec_remaining_return() return zero.

For example, suppose there is a RAID0 make of some other device, and
this RAID5.
A write request arrives which crosses a chunk boundary.
raid0.c calls bio_split to split off a new bio that will fit in the other
device, leaving the original bio with a larger bi_sector which will get
mapped only into the raid5.
The split bio is chained into the original bio, elevating its
__bi_remaining count.
If the other device is particularly slow, or the RAID5 is particularly
fast, the RAID5 IO might complete before the split bio completes, so
raid5 will only see __bi_remaining go down to one, not zero.
When the split bio finally completes, it's bi_endio is
bio_chain_endio(), and that will call the final bio_endio() on the
original bio.  md_write_end() would then never be called.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: raid10.c read_balance question
From: NeilBrown @ 2016-11-22  4:49 UTC (permalink / raw)
  To: Robert LeBlanc, linux-raid
In-Reply-To: <CAANLjFodaUUZrROncC+uQjWX5pXVxgJ0HWkNunwugQQrp4Kojw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1172 bytes --]

On Tue, Nov 22 2016, Robert LeBlanc wrote:

> I'm looking to improve 'near' layout performance in raid10 and looking
> at the code for read_balance, the comments mention that there is a
> 'next expected sequential IO' sector number stored for the RAID array,
> but I don't see such a variable in the struct, nor do I see in
> read_balance where that variable is being updated. With such a value,
> I could use it (with another variable) to guess if the current
> requests are random or sequential enough to take the latency hit to
> move drive heads to help parallel the read. This would be a much
> simpler route than what I was thinking about previously.
>
> Am I just missing this variable? I would expect it to be in the
> r10conf struct, would that be the right place to implement it if
> doesn't exist?

I think it is "head_position".

NeilBrown

>
> Thanks,
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: Newly-created arrays don't auto-assemble - related to hostname change?
From: Andy Smith @ 2016-11-22  6:01 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <87shqkcuhv.fsf@notabene.neil.brown.name>

Hi Neil,

On Tue, Nov 22, 2016 at 09:56:28AM +1100, NeilBrown wrote:
> Thanks.  Sorry this is taking a lot of back-and-forth...

No worries. This is very interesting to me and I'd also like to know
what is going wrong even if I have a work-around.

> I would:
>  - check that md5 definitely isn't running (mdadm -S /dev/md5)
>  - run mdadm -I just like udev does.
> 
>    /sbin/mdadm --incremental --export /dev/sdc --offroot /dev/disk/by-id/ata-SAMSUNG_MZ7KM1T9HAJM-00005_S2HNNAAH200633 /dev/disk/by-id/wwn-0x5002538c0007e7a8 /dev/disk/by-path/pci-0000:01:00.0-sas-0x4433221100000000-lun-0
> 
> (the string of paths is from the "DEVLINKS" field).
> 
> That *should* produce several lines like "MD_NAME=tbd:5" etc.

I got:

$ sudo /sbin/mdadm --incremental --export /dev/sdc --offroot /dev/disk/by-id/ata-SAMSUNG_MZ7KM1T9HAJM-00005_S2HNNAAH200633 /dev/disk/by-id/wwn-0x5002538c0007e7a8 /dev/disk/by-path/pci-0000:01:00.0-sas-0x4433221100000000-lun-0
MD_DEVICE=md5
MD_DEVNAME=5
MD_FOREIGN=no
MD_STARTED=unsafe

Over in the "udevadm monitor" window I see:

KERNEL[86682.101420] add      /devices/virtual/bdi/9:5 (bdi)
KERNEL[86682.101866] add      /devices/virtual/block/md5 (block)
UDEV  [86682.102211] add      /devices/virtual/bdi/9:5 (bdi)
UDEV  [86682.103376] add      /devices/virtual/block/md5 (block)

and cat /proc/mdstat:

Personalities : [raid1] [raid10]
md5 : inactive sdc[1](S)
      1875243352 blocks super 1.2
…etc…

> If it doesn't produce an error, but does produce some MD_* lines, then
> the problem must be that udev isn't doing quite the same thing.
> So stop md5 again (mdadm -S /dev/md5), enable udev debugging
>   udevadm control -l debug
> 
> and re-issue the 'change'
>      echo change > /sys/block/sdc/uevent
> 
> That should puts lots of stuff in the journal.  If you could extract
> that and post it I might be able to find something of interest.

Nov 22 05:50:15 jfd systemd-udevd[218]: validate module index
Nov 22 05:50:15 jfd systemd-udevd[218]: Check if link configuration needs reloading.
Nov 22 05:50:15 jfd systemd-udevd[218]: seq 5382 queued, 'change' 'block'
Nov 22 05:50:15 jfd systemd-udevd[218]: seq 5382 forked new worker [29004]
Nov 22 05:50:15 jfd systemd-udevd[29004]: seq 5382 running
Nov 22 05:50:15 jfd systemd-udevd[29004]: device 0x7f26dc1cf950 filled with db file data
Nov 22 05:50:15 jfd systemd-udevd[29004]: removing watch on '/dev/sdc'
Nov 22 05:50:15 jfd systemd-udevd[29004]: device 0x7f26dc1d1a50 has devpath '/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host10/port-10:0/end_device-10:0/target10:0:0/10:0:0:0'
Nov 22 05:50:15 jfd systemd-udevd[29004]: IMPORT 'ata_id --export /dev/sdc' /lib/udev/rules.d/60-persistent-storage.rules:31
Nov 22 05:50:15 jfd systemd-udevd[29005]: starting 'ata_id --export /dev/sdc'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_ATA=1'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_TYPE=disk'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_BUS=ata'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_MODEL=SAMSUNG_MZ7KM1T9HAJM-00005'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_MODEL_ENC=SAMSUNG\x20MZ7KM1T9HAJM-00005\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_REVISION=GXM1003Q'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_SERIAL=SAMSUNG_MZ7KM1T9HAJM-00005_S2HNNAAH200633'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_SERIAL_SHORT=S2HNNAAH200633'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_ATA_WRITE_CACHE=1'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_ATA_WRITE_CACHE_ENABLED=1'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_ATA_FEATURE_SET_HPA=1'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_ATA_FEATURE_SET_HPA_ENABLED=1'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_ATA_FEATURE_SET_PM=1'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_ATA_FEATURE_SET_PM_ENABLED=1'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_ATA_FEATURE_SET_SECURITY=1'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_ATA_FEATURE_SET_SECURITY_ENABLED=0'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_ATA_FEATURE_SET_SECURITY_ERASE_UNIT_MIN=32'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_ATA_FEATURE_SET_SECURITY_ENHANCED_ERASE_UNIT_MIN=32'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_ATA_FEATURE_SET_SMART=1'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_ATA_FEATURE_SET_SMART_ENABLED=1'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_ATA_DOWNLOAD_MICROCODE=1'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_ATA_SATA=1'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_ATA_SATA_SIGNAL_RATE_GEN2=1'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_ATA_SATA_SIGNAL_RATE_GEN1=1'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_ATA_ROTATION_RATE_RPM=0'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_WWN=0x5002538c0007e7a8'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc'(out) 'ID_WWN_WITH_EXTENSION=0x5002538c0007e7a8'
Nov 22 05:50:15 jfd systemd-udevd[29004]: 'ata_id --export /dev/sdc' [29005] exit with return code 0
Nov 22 05:50:15 jfd systemd-udevd[29004]: LINK 'disk/by-id/ata-SAMSUNG_MZ7KM1T9HAJM-00005_S2HNNAAH200633' /lib/udev/rules.d/60-persistent-storage.rules:43
Nov 22 05:50:15 jfd systemd-udevd[29004]: device 0x7f26dc1dde20 has devpath '/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host10/port-10:0/end_device-10:0/target10:0:0'
Nov 22 05:50:15 jfd systemd-udevd[29004]: device 0x7f26dc1de650 has devpath '/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host10/port-10:0/end_device-10:0'
Nov 22 05:50:15 jfd systemd-udevd[29004]: device 0x7f26dc1dee60 has devpath '/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host10/port-10:0'
Nov 22 05:50:15 jfd systemd-udevd[29004]: device 0x7f26dc1df650 has devpath '/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host10'
Nov 22 05:50:15 jfd systemd-udevd[29004]: device 0x7f26dc1dfe30 has devpath '/devices/pci0000:00/0000:00:01.0/0000:01:00.0'
Nov 22 05:50:15 jfd systemd-udevd[29004]: device 0x7f26dc1e0600 has devpath '/devices/pci0000:00/0000:00:01.0'
Nov 22 05:50:15 jfd systemd-udevd[29004]: device 0x7f26dc1e0db0 has devpath '/devices/pci0000:00'
Nov 22 05:50:15 jfd systemd-udevd[29004]: IMPORT builtin 'path_id' /lib/udev/rules.d/60-persistent-storage.rules:56
Nov 22 05:50:15 jfd systemd-udevd[29004]: device 0x7f26dc1e17c0 has devpath '/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host10/port-10:0/end_device-10:0/sas_device/end_device-10:0'
Nov 22 05:50:15 jfd systemd-udevd[29004]: LINK 'disk/by-path/pci-0000:01:00.0-sas-0x4433221100000000-lun-0' /lib/udev/rules.d/60-persistent-storage.rules:57
Nov 22 05:50:15 jfd systemd-udevd[29004]: IMPORT builtin 'blkid' /lib/udev/rules.d/60-persistent-storage.rules:71
Nov 22 05:50:15 jfd systemd-udevd[29004]: probe /dev/sdc raid offset=0
Nov 22 05:50:15 jfd systemd-udevd[29004]: LINK 'disk/by-id/wwn-0x5002538c0007e7a8' /lib/udev/rules.d/60-persistent-storage.rules:81
Nov 22 05:50:15 jfd systemd-udevd[29004]: no db file to read /run/udev/data/+scsi:10:0:0:0: No such file or directory
Nov 22 05:50:15 jfd systemd-udevd[29004]: handling device node '/dev/sdc', devnum=b8:32, mode=0600, uid=0, gid=0
Nov 22 05:50:15 jfd systemd-udevd[29004]: preserve already existing symlink '/dev/block/8:32' to '../sdc'
Nov 22 05:50:15 jfd systemd-udevd[29004]: found 'b8:32' claiming '/run/udev/links/\x2fdisk\x2fby-id\x2fata-SAMSUNG_MZ7KM1T9HAJM-00005_S2HNNAAH200633'
Nov 22 05:50:15 jfd systemd-udevd[29004]: creating link '/dev/disk/by-id/ata-SAMSUNG_MZ7KM1T9HAJM-00005_S2HNNAAH200633' to '/dev/sdc'
Nov 22 05:50:15 jfd systemd-udevd[29004]: preserve already existing symlink '/dev/disk/by-id/ata-SAMSUNG_MZ7KM1T9HAJM-00005_S2HNNAAH200633' to '../../sdc'
Nov 22 05:50:15 jfd systemd-udevd[29004]: found 'b8:32' claiming '/run/udev/links/\x2fdisk\x2fby-id\x2fwwn-0x5002538c0007e7a8'
Nov 22 05:50:15 jfd systemd-udevd[29004]: creating link '/dev/disk/by-id/wwn-0x5002538c0007e7a8' to '/dev/sdc'
Nov 22 05:50:15 jfd systemd-udevd[29004]: preserve already existing symlink '/dev/disk/by-id/wwn-0x5002538c0007e7a8' to '../../sdc'
Nov 22 05:50:15 jfd systemd-udevd[29004]: found 'b8:32' claiming '/run/udev/links/\x2fdisk\x2fby-path\x2fpci-0000:01:00.0-sas-0x4433221100000000-lun-0'
Nov 22 05:50:15 jfd systemd-udevd[29004]: creating link '/dev/disk/by-path/pci-0000:01:00.0-sas-0x4433221100000000-lun-0' to '/dev/sdc'
Nov 22 05:50:15 jfd systemd-udevd[29004]: preserve already existing symlink '/dev/disk/by-path/pci-0000:01:00.0-sas-0x4433221100000000-lun-0' to '../../sdc'
Nov 22 05:50:15 jfd systemd-udevd[29004]: created db file '/run/udev/data/b8:32' for '/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host10/port-10:0/end_device-10:0/target10:0:0/10:0:0:0/block/sdc'
Nov 22 05:50:15 jfd systemd-udevd[29004]: adding watch on '/dev/sdc'
Nov 22 05:50:15 jfd systemd-udevd[29004]: created db file '/run/udev/data/b8:32' for '/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host10/port-10:0/end_device-10:0/target10:0:0/10:0:0:0/block/sdc'
Nov 22 05:50:15 jfd systemd-udevd[29004]: passed -1 bytes to netlink monitor 0x7f26dc1d2470
Nov 22 05:50:15 jfd systemd-udevd[29004]: seq 5382 processed with 0
Nov 22 05:50:15 jfd ata_id[29005]: custom logging function 0x7fc30e86abd0 registered
Nov 22 05:50:18 jfd systemd-udevd[218]: cleanup idle workers
Nov 22 05:50:18 jfd systemd-udevd[218]: validate module index
Nov 22 05:50:18 jfd systemd-udevd[218]: Check if link configuration needs reloading.
Nov 22 05:50:18 jfd systemd-udevd[29004]: unload module index
Nov 22 05:50:18 jfd systemd-udevd[29004]: Unloaded link configuration context.
Nov 22 05:50:18 jfd systemd-udevd[218]: worker [29004] exit
Nov 22 05:50:18 jfd systemd-udevd[218]: worker [29004] cleaned up

Cheers,
Andy

^ permalink raw reply

* Re: [PATCH 07/12] dm: use bvec iterator helpers to implement .get_page and .next_page
From: Christoph Hellwig @ 2016-11-22  7:28 UTC (permalink / raw)
  To: Ming Lei
  Cc: Mike Snitzer, Jens Axboe, Linux Kernel Mailing List, linux-block,
	Linux FS Devel, Christoph Hellwig, Alasdair Kergon,
	maintainer:DEVICE-MAPPER (LVM), Shaohua Li,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT
In-Reply-To: <CACVXFVOHLMC+qHZco7S675PnbbNQyDH+qobcjdtJEyTE41gEmw@mail.gmail.com>

Jens, can you apply the non-dm patches?

^ permalink raw reply

* Re: [PATCH 05/12] bcache: debug: avoid to access .bi_io_vec directly
From: Christoph Hellwig @ 2016-11-22  7:30 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-kernel, linux-block, linux-fsdevel,
	Christoph Hellwig, Kent Overstreet, Shaohua Li, Mike Christie,
	Guoqing Jiang, open list:BCACHE (BLOCK LAYER CACHE),
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT
In-Reply-To: <1478865957-25252-6-git-send-email-tom.leiming@gmail.com>

On Fri, Nov 11, 2016 at 08:05:33PM +0800, Ming Lei wrote:
> Instead we use standard iterator way to do that.
> 
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [md PATCH 0/5] Stop using bi_phys_segments as a counter
From: Shaohua Li @ 2016-11-22  8:01 UTC (permalink / raw)
  To: NeilBrown; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid
In-Reply-To: <87k2bwcl44.fsf@notabene.neil.brown.name>

On Tue, Nov 22, 2016 at 01:19:07PM +1100, Neil Brown wrote:
> On Tue, Nov 22 2016, Shaohua Li wrote:
> 
> > On Tue, Nov 22, 2016 at 11:25:04AM +1100, Neil Brown wrote:
> >> On Tue, Nov 22 2016, Shaohua Li wrote:
> >> 
> >> > On Mon, Nov 21, 2016 at 12:19:43PM +1100, Neil Brown wrote:
> >> >> There are 2 problems with using bi_phys_segments as a counter
> >> >> 1/ we only use 16bits, which limits bios to 256M
> >> >> 2/ it is poor form to reuse a field like this.  It interferes
> >> >>    with other changes to bios.
> >> >> 
> >> >> We need to clean up a few things before we can change the use the
> >> >> counter which is now available inside a bio.
> >> >> 
> >> >> I have only tested this lightly.  More review and testing would be
> >> >> appreciated.
> >> >
> >> > So without the accounting, we:
> >> > - don't do bio completion trace
> >> 
> >> Yes, but hopefully that will be added back to bio_endio() soon.
> >> 
> >> > - call md_write_start/md_write_end excessively, which involves atomic operation.
> >> 
> >> raid5_inc_bio_active_stripes() did an atomic operation.  I don't think
> >> there is a net increase in the number of atomic operations.
> >
> > That's different. md_write_start/end uses a global atomic.
> > raid5_inc_bio_active_stripes uses a bio atomic. So we have more cache bouncing now.
> 
> Maybe.
> Most md_write_start() calls are made in the context of
> raid5_make_request().
> We could
>  - call md_write_start() once at the start
>  - count how many times we want to call it in a variable local to
>    raid5_make_request()
>  - atomically add that to the counter at the end.
> 
> Similarly mode md_write_end() requests are in the context of raid5d.  It
> could maintain local counter and apply them all in a single update
> before it sleeps.
> 
> It would be a little messy, but not too horrible I think.

this is messy actually, don't think it's justified.

> >> >
> >> > Not big problems. But we are actually reusing __bi_remaining, I'm wondering why
> >> > we not explicitly reuse it. Eg, adds bio_dec_remaining_return() and uses it
> >> > like raid5_dec_bi_active_stripes.
> >> 
> >> Because using it exactly the same way that other places use it leads to
> >> fewer surprises, now or later.
> >> And I think that the effort to rearrange the code so that we could just
> >> call bio_endio() brought real improvements in code clarity and
> >> simplicity.
> >
> > Not the same way. The return_bi list and retry list fix are still good. We can
> > replace the bio_endio in your patch with something like:
> > if (bio_dec_remaining_return() == 0) {
> > 	trace_block_bio_complete()
> > 	md_write_end()
> > 	bio_endio();
> > }
> > This will give us better control when to end io.
> 
> This isn't safe.  The bio arriving at raid5_make_request() might already
> have been split and could be chained.  Then raid5 might never see
> bio_dec_remaining_return() return zero.
> 
> For example, suppose there is a RAID0 make of some other device, and
> this RAID5.
> A write request arrives which crosses a chunk boundary.
> raid0.c calls bio_split to split off a new bio that will fit in the other
> device, leaving the original bio with a larger bi_sector which will get
> mapped only into the raid5.
> The split bio is chained into the original bio, elevating its
> __bi_remaining count.
> If the other device is particularly slow, or the RAID5 is particularly
> fast, the RAID5 IO might complete before the split bio completes, so
> raid5 will only see __bi_remaining go down to one, not zero.
> When the split bio finally completes, it's bi_endio is
> bio_chain_endio(), and that will call the final bio_endio() on the
> original bio.  md_write_end() would then never be called.
ok, this makes sense.

I still don't like we have no control when to do endio, but don't have better
idea either. These patches work for raid5, but we need to find similar tricky
way to workaround raid1/raid10, which reuse bi_phys_segments too. Probably it's
time MD allocates additional data and attaches to bio. Adding a counter will
solve the issue in a consistency/clean way for raid1/10/5.

Thanks,
Shaohua

^ permalink raw reply

* Re: Fwd: Re: mdadm I/O error with Ddf RAID
From: Arka Sharma @ 2016-11-22  9:30 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <87polocrsv.fsf@notabene.neil.brown.name>

I have observed that following block
else if (!mddev->bitmap)
                        j = mddev->recovery_cp;
is getting executed in md_do_sync. I performed to test. In case 1 I
filled the entire 32 mb of physical disks with FF and then wrote the
metadata. And in the following case we filled the 32 mb with zeros and
then wrote the metadata. In both the cases we receive md/raid1:md126:
not clean -- starting background reconstruction message from md when
there is access to LBA 1000182866. However when I create raid 1 using
mdadm and reboot the system there is no access to  LBA 1000182866.
Also when I read that sector after creating raid 1 with mdadm we see
this block contains FF. As we have confirmed that mdadm also writing
the config data at 1000182610. Only in case of raid created through
our application results access at that offset.

Regards,
Arka

On Tue, Nov 22, 2016 at 5:24 AM, NeilBrown <neilb@suse.com> wrote:
> On Tue, Nov 22 2016, Arka Sharma wrote:
>
>> ---------- Forwarded message ----------
>> From: "Arka Sharma" <arka.sw1988@gmail.com>
>> Date: 21 Nov 2016 12:57 p.m.
>> Subject: Re: mdadm I/O error with Ddf RAID
>> To: "NeilBrown" <neilb@suse.com>
>> Cc: <linux-raid@vger.kernel.org>
>>
>> I have run mdadm --examine on both the component devices as well as on
>> the container. This shows that one of the component disk is marked as
>> offline and status is failed. When I run mdadm --detail on the RAID
>> device it shows the component disk 0 state as removed. Since I am very
>> much new to md and linux in general I am not able to fully root cause
>> this issue. I have made couple of observation though, that before the
>> invalid sector 18446744073709551615 is sent, the sector 1000182866 is
>> accessed after which mdraid reports as not clean starts background
>> reconstruction. I read the LBA 1000182866 and this block contains FF.
>> So is md expecting something in the metadata we are not populating ?
>> Please find the attached md127.txt which is the output of the mdadm
>> --examine <container>, blk-core_diff.txt which contains the printk's
>> and dmesg.txt, also DDF_Header0.txt and DDF_Header1.txt are the dump
>> of ddf headers for both the disks.
>
> Thanks for providing more details.
>
> Sector 1000182866 is 256 sectors into the config section.
> It starts reading the config section at 1000182610 and gets 256 sectors,
> so it reads the rest from 1000182866 and then starts the array.
>
> My guess is that md is getting confused about resync and recovery.
> It tries a resync, but as the array appears degraded, this code:
>                 if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
>                         j = mddev->resync_min;
>                 else if (!mddev->bitmap)
>                         j = mddev->recovery_cp;
>
> in md_do_sync() sets 'j' to MaxSector, which is effectively "-1".  It
> then starts resync from there and goes crazy.  You could put a printk in
> there to confirm.
>
> I don't know why.  Something about the config makes mdadm think the
> array is degraded.  I might try to find time to dig into it again later.
>
> NeilBrown

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox