[RFC 0/2] raid5 trim support

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/2] raid5 trim support
@ 2012-04-17  8:35 Shaohua Li
  2012-04-17  8:35 ` [RFC 1/2] MD: " Shaohua Li
  2012-04-17  8:35 ` [RFC 2/2] MD: raid5 avoid unnecessary zero page for trim Shaohua Li
  0 siblings, 2 replies; 17+ messages in thread
From: Shaohua Li @ 2012-04-17  8:35 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, dan.j.williams, axboe

These are the raid5 trim support patches I'm working on. There are some
limitations to support raid5 trim. I explained the details in the first patch.
The raid5 state machine is quite complicated and I'm sure I missed something,
so it's a RFC so far. And the raid5 trim suffers from the same discard request
merge issue as raid0/10. If you try these in a SSD, please ignore the warning
from block layer. But anyway, my raid4/5/6 tests work well with a fusionio
card.  Comments and suggestions are welcome!

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC 1/2] MD: raid5 trim support
  2012-04-17  8:35 [RFC 0/2] raid5 trim support Shaohua Li
@ 2012-04-17  8:35 ` Shaohua Li
  2012-04-17 14:46   ` Dan Williams
  2012-04-17  8:35 ` [RFC 2/2] MD: raid5 avoid unnecessary zero page for trim Shaohua Li
  1 sibling, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2012-04-17  8:35 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, dan.j.williams, axboe, Shaohua Li

[-- Attachment #1: md-raid5-discard-support.patch --]
[-- Type: text/plain, Size: 10150 bytes --]

Discard for raid4/5/6 has limitation. If discard request size is small, we do
discard for one disk, but we need calculate parity and write parity disk.  To
correctly calculate parity, zero_after_discard must be guaranteed. Even it's
true, we need do discard for one disk but write another disks, which makes the
parity disks wear out fast. This doesn't make sense. So an efficient discard
for raid4/5/6 should discard all data disks and parity disks, which requires
the write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size is
smaller than chunk_size, such pattern is almost impossible in practice. So in
this patch, I only handle the case that A's size equals to chunk_size. That is
discard request should be aligned to stripe size and its size is multiple of
stripe size.

Since we can only handle request with specific alignment and size (or part of
the request fitting stripes), we can't guarantee zero_after_discard even
zero_after_discard is true in low level drives.

The block layer doesn't send down correctly aligned requests even correct
discard alignment is set, so I must filter out.

For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after discard.
Otherwise, data might be lost. Let's consider a scenario: discard a stripe,
write data to one disk and write parity disk. The stripe could be still
inconsistent till then depending on using data from other data disks or parity
disks to calculate new parity. If the disk is broken, we can't restore it. So
in this patch, we only enable discard support if all disks have
zero_after_discard.

If discard fails in one disk, we face the similar inconsistent issue above. The
patch will make discard follow the same path as normal write request. If
discard fails, a resync will be scheduled to make the data consistent. This
isn't good to have extra writes, but data consistency is important.

If a subsequent read/write request hits raid5 cache of a discarded stripe, the
discarded dev page should have zero filled, so the data is consistent. This
patch will always zero dev page for discarded request stripe. This isn't
optimal because discard request doesn't need such payload. Next patch will
avoid it.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |  133 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 drivers/md/raid5.h |    1 
 2 files changed, 128 insertions(+), 6 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-04-17 14:15:38.546074804 +0800
+++ linux/drivers/md/raid5.c	2012-04-17 16:02:49.776046739 +0800
@@ -510,6 +510,8 @@ static void ops_run_io(struct stripe_hea
 				rw = WRITE_FUA;
 			else
 				rw = WRITE;
+			if (test_and_clear_bit(R5_Discard, &sh->dev[i].flags))
+				rw |= REQ_DISCARD;
 		} else if (test_and_clear_bit(R5_Wantread, &sh->dev[i].flags))
 			rw = READ;
 		else if (test_and_clear_bit(R5_WantReplace,
@@ -1114,8 +1116,13 @@ ops_run_biodrain(struct stripe_head *sh,
 				dev->sector + STRIPE_SECTORS) {
 				if (wbi->bi_rw & REQ_FUA)
 					set_bit(R5_WantFUA, &dev->flags);
-				tx = async_copy_data(1, wbi, dev->page,
-					dev->sector, tx);
+				if (wbi->bi_rw & REQ_DISCARD) {
+					memset(page_address(dev->page), 0,
+						STRIPE_SECTORS << 9);
+					set_bit(R5_Discard, &dev->flags);
+				} else
+					tx = async_copy_data(1, wbi, dev->page,
+						dev->sector, tx);
 				wbi = r5_next_bio(wbi, dev->sector);
 			}
 		}
@@ -1177,6 +1184,20 @@ ops_run_reconstruct5(struct stripe_head
 	pr_debug("%s: stripe %llu\n", __func__,
 		(unsigned long long)sh->sector);
 
+	for (i = 0; i < sh->disks; i++) {
+		if (pd_idx == i)
+			continue;
+		if (!test_bit(R5_Discard, &sh->dev[i].flags))
+			break;
+	}
+	if (i >= sh->disks) {
+		atomic_inc(&sh->count);
+		memset(page_address(sh->dev[pd_idx].page), 0,
+			STRIPE_SECTORS << 9);
+		set_bit(R5_Discard, &sh->dev[pd_idx].flags);
+		ops_complete_reconstruct(sh);
+		return;
+	}
 	/* check if prexor is active which means only process blocks
 	 * that are part of a read-modify-write (written)
 	 */
@@ -1221,10 +1242,28 @@ ops_run_reconstruct6(struct stripe_head
 {
 	struct async_submit_ctl submit;
 	struct page **blocks = percpu->scribble;
-	int count;
+	int count, i;
 
 	pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector);
 
+	for (i = 0; i < sh->disks; i++) {
+		if (sh->pd_idx == i || sh->qd_idx == i)
+			continue;
+		if (!test_bit(R5_Discard, &sh->dev[i].flags))
+			break;
+	}
+	if (i >= sh->disks) {
+		atomic_inc(&sh->count);
+		memset(page_address(sh->dev[sh->pd_idx].page), 0,
+			STRIPE_SECTORS << 9);
+		memset(page_address(sh->dev[sh->qd_idx].page), 0,
+			STRIPE_SECTORS << 9);
+		set_bit(R5_Discard, &sh->dev[sh->pd_idx].flags);
+		set_bit(R5_Discard, &sh->dev[sh->qd_idx].flags);
+		ops_complete_reconstruct(sh);
+		return;
+	}
+
 	count = set_syndrome_sources(blocks, sh);
 
 	atomic_inc(&sh->count);
@@ -2278,7 +2317,7 @@ schedule_reconstruction(struct stripe_he
  */
 static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, int forwrite)
 {
-	struct bio **bip;
+	struct bio **bip, *orig_bi = bi;
 	struct r5conf *conf = sh->raid_conf;
 	int firstwrite=0;
 
@@ -2288,6 +2327,23 @@ static int add_stripe_bio(struct stripe_
 
 
 	spin_lock_irq(&conf->device_lock);
+
+	if (bi->bi_rw & REQ_DISCARD) {
+		int i;
+		dd_idx = -1;
+		for (i = 0; i < sh->disks; i++) {
+			if (i == sh->pd_idx || i == sh->qd_idx)
+				continue;
+			if (dd_idx == -1)
+				dd_idx = i;
+			if (sh->dev[i].towrite) {
+				dd_idx = i;
+				goto overlap;
+			}
+		}
+	}
+
+again:
 	if (forwrite) {
 		bip = &sh->dev[dd_idx].towrite;
 		if (*bip == NULL && sh->dev[dd_idx].written == NULL)
@@ -2321,6 +2377,15 @@ static int add_stripe_bio(struct stripe_
 		if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS)
 			set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
 	}
+
+	bi = orig_bi;
+	if (bi->bi_rw & REQ_DISCARD) {
+		dd_idx++;
+		while (dd_idx == sh->pd_idx || dd_idx == sh->qd_idx)
+			dd_idx++;
+		if (dd_idx < sh->disks)
+			goto again;
+	}
 	spin_unlock_irq(&conf->device_lock);
 
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
@@ -3950,6 +4015,19 @@ static void make_request(struct mddev *m
 	bi->bi_next = NULL;
 	bi->bi_phys_segments = 1;	/* over-loaded to count active stripes */
 
+	/* block layer doesn't correctly do alignment even we set correct alignment */
+	if (bi->bi_rw & REQ_DISCARD) {
+		int stripe_sectors = conf->chunk_sectors *
+			(conf->raid_disks - conf->max_degraded);
+
+		logical_sector = (logical_sector + stripe_sectors - 1);
+		sector_div(logical_sector, stripe_sectors);
+		sector_div(last_sector, stripe_sectors);
+
+		logical_sector *= stripe_sectors;
+		last_sector *= stripe_sectors;
+	}
+
 	plugged = mddev_check_plugged(mddev);
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
 		DEFINE_WAIT(w);
@@ -3973,6 +4051,11 @@ static void make_request(struct mddev *m
 			if (mddev->delta_disks < 0
 			    ? logical_sector < conf->reshape_progress
 			    : logical_sector >= conf->reshape_progress) {
+				/* The stripe will be reshaped soon, ignore it */
+				if (bi->bi_rw & REQ_DISCARD) {
+					spin_unlock_irq(&conf->device_lock);
+					continue;
+				}
 				disks = conf->previous_raid_disks;
 				previous = 1;
 			} else {
@@ -4063,7 +4146,11 @@ static void make_request(struct mddev *m
 			finish_wait(&conf->wait_for_overlap, &w);
 			break;
 		}
-			
+		/* For discard, we always discard one stripe */
+		if ((bi->bi_rw & REQ_DISCARD) &&
+		   !((logical_sector + STRIPE_SECTORS) % conf->chunk_sectors))
+			logical_sector += conf->chunk_sectors *
+				(conf->raid_disks - conf->max_degraded - 1);
 	}
 	if (!plugged)
 		md_wakeup_thread(mddev->thread);
@@ -5162,6 +5249,7 @@ static int run(struct mddev *mddev)
 
 	if (mddev->queue) {
 		int chunk_size;
+		bool discard_supported = true;
 		/* read-ahead size must cover two whole stripes, which
 		 * is 2 * (datadisks) * chunksize where 'n' is the
 		 * number of raid devices
@@ -5182,9 +5270,42 @@ static int run(struct mddev *mddev)
 		blk_queue_io_opt(mddev->queue, chunk_size *
 				 (conf->raid_disks - conf->max_degraded));
 
-		rdev_for_each(rdev, mddev)
+		/*
+		 * We can only discard a whole stripe. It doesn't make sense to
+		 * discard data disk but write parity disk
+		 */
+		stripe = stripe * PAGE_SIZE;
+		mddev->queue->limits.discard_alignment = stripe;
+		mddev->queue->limits.discard_granularity = stripe;
+		/*
+		 * unaligned part of discard request will be ignored, so can't
+		 * guarantee discard_zerors_data
+		 */
+		mddev->queue->limits.discard_zeroes_data = 0;
+
+		rdev_for_each(rdev, mddev) {
 			disk_stack_limits(mddev->gendisk, rdev->bdev,
 					  rdev->data_offset << 9);
+			/*
+			 * discard_zeroes_data is required, otherwise data
+			 * could be lost. Consider a scenario: discard a stripe
+			 * (the stripe could be inconsistent); write one disk
+			 * of the stripe (the stripe could be inconsistent
+			 * again depending on which disks are used to calculate
+			 * parity); the disk is broken; This stripe data of
+			 * the disk is lost.
+			 */
+			if (!blk_queue_discard(bdev_get_queue(rdev->bdev)) ||
+			   !bdev_get_queue(rdev->bdev)->limits.discard_zeroes_data)
+				discard_supported = false;
+		}
+
+		if (discard_supported &&
+		   mddev->queue->limits.max_discard_sectors >= stripe &&
+		   mddev->queue->limits.discard_granularity >= stripe)
+			queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
+		else
+			queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
 	}
 
 	return 0;
Index: linux/drivers/md/raid5.h
===================================================================
--- linux.orig/drivers/md/raid5.h	2012-04-17 14:15:38.506074977 +0800
+++ linux/drivers/md/raid5.h	2012-04-17 14:16:50.206075090 +0800
@@ -295,6 +295,7 @@ enum r5dev_flags {
 	R5_WantReplace, /* We need to update the replacement, we have read
 			 * data in, and now is a good time to write it out.
 			 */
+	R5_Discard,	/* Discard the stripe */
 };
 
 /*


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] MD: raid5 trim support
  2012-04-17  8:35 ` [RFC 1/2] MD: " Shaohua Li
@ 2012-04-17 14:46   ` Dan Williams
  2012-04-17 15:07     ` Shaohua Li
  2012-04-17 20:26     ` NeilBrown
  0 siblings, 2 replies; 17+ messages in thread
From: Dan Williams @ 2012-04-17 14:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, neilb, axboe, Shaohua Li

On Tue, Apr 17, 2012 at 1:35 AM, Shaohua Li <shli@kernel.org> wrote:
> Discard for raid4/5/6 has limitation. If discard request size is small, we do
> discard for one disk, but we need calculate parity and write parity disk.  To
> correctly calculate parity, zero_after_discard must be guaranteed.

I'm wondering if we could use the new bad blocks facility to mark
discarded ranges so we don't necessarily need determinate data after
discard.

...but I have not looked into it beyond that.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] MD: raid5 trim support
  2012-04-17 14:46   ` Dan Williams
@ 2012-04-17 15:07     ` Shaohua Li
  2012-04-17 18:16       ` Dan Williams
  2012-04-17 20:26     ` NeilBrown
  1 sibling, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2012-04-17 15:07 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid, neilb, axboe, Shaohua Li

Hi Dan,
On 4/17/12 10:46 PM, Dan Williams wrote:
> On Tue, Apr 17, 2012 at 1:35 AM, Shaohua Li<shli@kernel.org>  wrote:
>> Discard for raid4/5/6 has limitation. If discard request size is small, we do
>> discard for one disk, but we need calculate parity and write parity disk.  To
>> correctly calculate parity, zero_after_discard must be guaranteed.
>
> I'm wondering if we could use the new bad blocks facility to mark
> discarded ranges so we don't necessarily need determinate data after
> discard.

It would be great the limitation can be avoided and the code can be
simplified. I didn't follow linux-raid maillist, can you point me the url
of the new bad blocks facility please?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] MD: raid5 trim support
  2012-04-17 15:07     ` Shaohua Li
@ 2012-04-17 18:16       ` Dan Williams
  0 siblings, 0 replies; 17+ messages in thread
From: Dan Williams @ 2012-04-17 18:16 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, neilb, axboe, Shaohua Li

On Tue, Apr 17, 2012 at 8:07 AM, Shaohua Li <shli@kernel.org> wrote:
> Hi Dan,
>
> On 4/17/12 10:46 PM, Dan Williams wrote:
>>
>> On Tue, Apr 17, 2012 at 1:35 AM, Shaohua Li<shli@kernel.org>  wrote:
>>>
>>> Discard for raid4/5/6 has limitation. If discard request size is small,
>>> we do
>>> discard for one disk, but we need calculate parity and write parity disk.
>>>  To
>>> correctly calculate parity, zero_after_discard must be guaranteed.
>>
>>
>> I'm wondering if we could use the new bad blocks facility to mark
>> discarded ranges so we don't necessarily need determinate data after
>> discard.
>
>
> It would be great the limitation can be avoided and the code can be
> simplified. I didn't follow linux-raid maillist, can you point me the url
> of the new bad blocks facility please?

It came in at 3.1 starting with:

commit 2230dfe4ccc3add340dc6d437965b2de1d269fde
Author: NeilBrown <neilb@suse.de>
Date:   Thu Jul 28 11:31:46 2011 +1000

    md: beginnings of bad block management.

    This the first step in allowing md to track bad-blocks per-device so
    that we can fail individual blocks rather than the whole device.

    This patch just adds a data structure for recording bad blocks, with
    routines to add, remove, search the list.

    Signed-off-by: NeilBrown <neilb@suse.de>
    Reviewed-by: Namhyung Kim <namhyung@gmail.com>

Here is a link to the patch set:
http://marc.info/?l=linux-raid&m=131121721902900&w=2

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] MD: raid5 trim support
  2012-04-17 14:46   ` Dan Williams
  2012-04-17 15:07     ` Shaohua Li
@ 2012-04-17 20:26     ` NeilBrown
  2012-04-18  0:58       ` Shaohua Li
  1 sibling, 1 reply; 17+ messages in thread
From: NeilBrown @ 2012-04-17 20:26 UTC (permalink / raw)
  To: Dan Williams; +Cc: Shaohua Li, linux-raid, axboe, Shaohua Li

[-- Attachment #1: Type: text/plain, Size: 954 bytes --]

On Tue, 17 Apr 2012 07:46:03 -0700 Dan Williams <dan.j.williams@intel.com>
wrote:

> On Tue, Apr 17, 2012 at 1:35 AM, Shaohua Li <shli@kernel.org> wrote:
> > Discard for raid4/5/6 has limitation. If discard request size is small, we do
> > discard for one disk, but we need calculate parity and write parity disk.  To
> > correctly calculate parity, zero_after_discard must be guaranteed.
> 
> I'm wondering if we could use the new bad blocks facility to mark
> discarded ranges so we don't necessarily need determinate data after
> discard.
> 
> ...but I have not looked into it beyond that.
> 
> --
> Dan

No.

The bad blocks framework can only store a limited number of bad ranges - 512
in the current implementation.
That would not be an acceptable restriction for discarded ranges.

You would need a bitmap of some sort if you wanted to record discarded
regions.

http://neil.brown.name/blog/20110216044002#5

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] MD: raid5 trim support
  2012-04-17 20:26     ` NeilBrown
@ 2012-04-18  0:58       ` Shaohua Li
  2012-04-18  4:48         ` NeilBrown
  0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2012-04-18  0:58 UTC (permalink / raw)
  To: NeilBrown; +Cc: Dan Williams, linux-raid, axboe, Shaohua Li

On 4/18/12 4:26 AM, NeilBrown wrote:
> On Tue, 17 Apr 2012 07:46:03 -0700 Dan Williams<dan.j.williams@intel.com>
> wrote:
>
>> On Tue, Apr 17, 2012 at 1:35 AM, Shaohua Li<shli@kernel.org>  wrote:
>>> Discard for raid4/5/6 has limitation. If discard request size is small, we do
>>> discard for one disk, but we need calculate parity and write parity disk.  To
>>> correctly calculate parity, zero_after_discard must be guaranteed.
>>
>> I'm wondering if we could use the new bad blocks facility to mark
>> discarded ranges so we don't necessarily need determinate data after
>> discard.
>>
>> ...but I have not looked into it beyond that.
>>
>> --
>> Dan
>
> No.
>
> The bad blocks framework can only store a limited number of bad ranges - 512
> in the current implementation.
> That would not be an acceptable restriction for discarded ranges.
>
> You would need a bitmap of some sort if you wanted to record discarded
> regions.
>
> http://neil.brown.name/blog/20110216044002#5

This appears to remove the unnecessary resync for discarded range after 
a crash
or discard error, eg an enhancement. From my understanding, it can't 
remove the
limitation I mentioned in the patch. For raid5, we still need discard a 
whole
stripe (discarding one disk but writing parity disk isn't good).

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] MD: raid5 trim support
  2012-04-18  0:58       ` Shaohua Li
@ 2012-04-18  4:48         ` NeilBrown
  2012-04-18  5:30           ` Shaohua Li
  0 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2012-04-18  4:48 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Dan Williams, linux-raid, axboe, Shaohua Li

[-- Attachment #1: Type: text/plain, Size: 2076 bytes --]

On Wed, 18 Apr 2012 08:58:14 +0800 Shaohua Li <shli@kernel.org> wrote:

> On 4/18/12 4:26 AM, NeilBrown wrote:
> > On Tue, 17 Apr 2012 07:46:03 -0700 Dan Williams<dan.j.williams@intel.com>
> > wrote:
> >
> >> On Tue, Apr 17, 2012 at 1:35 AM, Shaohua Li<shli@kernel.org>  wrote:
> >>> Discard for raid4/5/6 has limitation. If discard request size is small, we do
> >>> discard for one disk, but we need calculate parity and write parity disk.  To
> >>> correctly calculate parity, zero_after_discard must be guaranteed.
> >>
> >> I'm wondering if we could use the new bad blocks facility to mark
> >> discarded ranges so we don't necessarily need determinate data after
> >> discard.
> >>
> >> ...but I have not looked into it beyond that.
> >>
> >> --
> >> Dan
> >
> > No.
> >
> > The bad blocks framework can only store a limited number of bad ranges - 512
> > in the current implementation.
> > That would not be an acceptable restriction for discarded ranges.
> >
> > You would need a bitmap of some sort if you wanted to record discarded
> > regions.
> >
> > http://neil.brown.name/blog/20110216044002#5
> 
> This appears to remove the unnecessary resync for discarded range after 
> a crash
> or discard error, eg an enhancement. From my understanding, it can't 
> remove the
> limitation I mentioned in the patch. For raid5, we still need discard a 
> whole
> stripe (discarding one disk but writing parity disk isn't good).

It is certainly not ideal, but it is worse than not discarding at all?
And would updating some sort of bitmap be just as bad as updating the parity
block?

How about treating a DISCARD request as a request to write a block full of
zeros, then at the lower level treat any request to write a block full of
zeros as a DISCARD request.  So when the parity becomes zero, it gets
discarded.

Certainly it is best if the filesystem would discard whole stripes at a time,
and we should be sure to optimise that.  But maybe there is still room to do
something useful with small discards?

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] MD: raid5 trim support
  2012-04-18  4:48         ` NeilBrown
@ 2012-04-18  5:30           ` Shaohua Li
  2012-04-18  5:57             ` NeilBrown
  0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2012-04-18  5:30 UTC (permalink / raw)
  To: NeilBrown; +Cc: Dan Williams, linux-raid, axboe, Shaohua Li

On 4/18/12 12:48 PM, NeilBrown wrote:
> On Wed, 18 Apr 2012 08:58:14 +0800 Shaohua Li<shli@kernel.org>  wrote:
>
>> On 4/18/12 4:26 AM, NeilBrown wrote:
>>> On Tue, 17 Apr 2012 07:46:03 -0700 Dan Williams<dan.j.williams@intel.com>
>>> wrote:
>>>
>>>> On Tue, Apr 17, 2012 at 1:35 AM, Shaohua Li<shli@kernel.org>   wrote:
>>>>> Discard for raid4/5/6 has limitation. If discard request size is small, we do
>>>>> discard for one disk, but we need calculate parity and write parity disk.  To
>>>>> correctly calculate parity, zero_after_discard must be guaranteed.
>>>>
>>>> I'm wondering if we could use the new bad blocks facility to mark
>>>> discarded ranges so we don't necessarily need determinate data after
>>>> discard.
>>>>
>>>> ...but I have not looked into it beyond that.
>>>>
>>>> --
>>>> Dan
>>>
>>> No.
>>>
>>> The bad blocks framework can only store a limited number of bad ranges - 512
>>> in the current implementation.
>>> That would not be an acceptable restriction for discarded ranges.
>>>
>>> You would need a bitmap of some sort if you wanted to record discarded
>>> regions.
>>>
>>> http://neil.brown.name/blog/20110216044002#5
>>
>> This appears to remove the unnecessary resync for discarded range after
>> a crash
>> or discard error, eg an enhancement. From my understanding, it can't
>> remove the
>> limitation I mentioned in the patch. For raid5, we still need discard a
>> whole
>> stripe (discarding one disk but writing parity disk isn't good).
>
> It is certainly not ideal, but it is worse than not discarding at all?
> And would updating some sort of bitmap be just as bad as updating the parity
> block?
>
> How about treating a DISCARD request as a request to write a block full of
> zeros, then at the lower level treat any request to write a block full of
> zeros as a DISCARD request.  So when the parity becomes zero, it gets
> discarded.
>
> Certainly it is best if the filesystem would discard whole stripes at a time,
> and we should be sure to optimise that.  But maybe there is still room to do
> something useful with small discards?

Sure, it would be great we can do small discards. But I didn't get how to do
it with the bitmap approach. Let's give an example, data disk1, data disk2,
parity disk3. Say discard some sectors of disk1. The suggested approach is
to mark the range bad. Then how to deal with parity disk3? As I said, 
writing
parity disk3 isn't good. So mark the corresponding range of parity disk3
bad too? If we did this, if disk2 is broken, how can we restore it?

Am I missed something or are you talking about different issues?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] MD: raid5 trim support
  2012-04-18  5:30           ` Shaohua Li
@ 2012-04-18  5:57             ` NeilBrown
  2012-04-18  6:34               ` Shaohua Li
  0 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2012-04-18  5:57 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Dan Williams, linux-raid, axboe, Shaohua Li

[-- Attachment #1: Type: text/plain, Size: 3080 bytes --]

On Wed, 18 Apr 2012 13:30:45 +0800 Shaohua Li <shli@kernel.org> wrote:

> On 4/18/12 12:48 PM, NeilBrown wrote:
> > On Wed, 18 Apr 2012 08:58:14 +0800 Shaohua Li<shli@kernel.org>  wrote:
> >
> >> On 4/18/12 4:26 AM, NeilBrown wrote:
> >>> On Tue, 17 Apr 2012 07:46:03 -0700 Dan Williams<dan.j.williams@intel.com>
> >>> wrote:
> >>>
> >>>> On Tue, Apr 17, 2012 at 1:35 AM, Shaohua Li<shli@kernel.org>   wrote:
> >>>>> Discard for raid4/5/6 has limitation. If discard request size is small, we do
> >>>>> discard for one disk, but we need calculate parity and write parity disk.  To
> >>>>> correctly calculate parity, zero_after_discard must be guaranteed.
> >>>>
> >>>> I'm wondering if we could use the new bad blocks facility to mark
> >>>> discarded ranges so we don't necessarily need determinate data after
> >>>> discard.
> >>>>
> >>>> ...but I have not looked into it beyond that.
> >>>>
> >>>> --
> >>>> Dan
> >>>
> >>> No.
> >>>
> >>> The bad blocks framework can only store a limited number of bad ranges - 512
> >>> in the current implementation.
> >>> That would not be an acceptable restriction for discarded ranges.
> >>>
> >>> You would need a bitmap of some sort if you wanted to record discarded
> >>> regions.
> >>>
> >>> http://neil.brown.name/blog/20110216044002#5
> >>
> >> This appears to remove the unnecessary resync for discarded range after
> >> a crash
> >> or discard error, eg an enhancement. From my understanding, it can't
> >> remove the
> >> limitation I mentioned in the patch. For raid5, we still need discard a
> >> whole
> >> stripe (discarding one disk but writing parity disk isn't good).
> >
> > It is certainly not ideal, but it is worse than not discarding at all?
> > And would updating some sort of bitmap be just as bad as updating the parity
> > block?
> >
> > How about treating a DISCARD request as a request to write a block full of
> > zeros, then at the lower level treat any request to write a block full of
> > zeros as a DISCARD request.  So when the parity becomes zero, it gets
> > discarded.
> >
> > Certainly it is best if the filesystem would discard whole stripes at a time,
> > and we should be sure to optimise that.  But maybe there is still room to do
> > something useful with small discards?
> 
> Sure, it would be great we can do small discards. But I didn't get how to do
> it with the bitmap approach. Let's give an example, data disk1, data disk2,
> parity disk3. Say discard some sectors of disk1. The suggested approach is
> to mark the range bad. Then how to deal with parity disk3? As I said, 
> writing
> parity disk3 isn't good. So mark the corresponding range of parity disk3
> bad too? If we did this, if disk2 is broken, how can we restore it?

Why, exactly, is writing the parity disk not good?
Not discarding blocks that we possibly could discard is also not good.
Which is worst?


> 
> Am I missed something or are you talking about different issues?

We are probably talking about slightly different issues..

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] MD: raid5 trim support
  2012-04-18  5:57             ` NeilBrown
@ 2012-04-18  6:34               ` Shaohua Li
  2012-04-25  3:43                 ` Shaohua Li
  0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2012-04-18  6:34 UTC (permalink / raw)
  To: NeilBrown; +Cc: Dan Williams, linux-raid, axboe, Shaohua Li

On 4/18/12 1:57 PM, NeilBrown wrote:
> On Wed, 18 Apr 2012 13:30:45 +0800 Shaohua Li<shli@kernel.org>  wrote:
>
>> On 4/18/12 12:48 PM, NeilBrown wrote:
>>> On Wed, 18 Apr 2012 08:58:14 +0800 Shaohua Li<shli@kernel.org>   wrote:
>>>
>>>> On 4/18/12 4:26 AM, NeilBrown wrote:
>>>>> On Tue, 17 Apr 2012 07:46:03 -0700 Dan Williams<dan.j.williams@intel.com>
>>>>> wrote:
>>>>>
>>>>>> On Tue, Apr 17, 2012 at 1:35 AM, Shaohua Li<shli@kernel.org>    wrote:
>>>>>>> Discard for raid4/5/6 has limitation. If discard request size is small, we do
>>>>>>> discard for one disk, but we need calculate parity and write parity disk.  To
>>>>>>> correctly calculate parity, zero_after_discard must be guaranteed.
>>>>>>
>>>>>> I'm wondering if we could use the new bad blocks facility to mark
>>>>>> discarded ranges so we don't necessarily need determinate data after
>>>>>> discard.
>>>>>>
>>>>>> ...but I have not looked into it beyond that.
>>>>>>
>>>>>> --
>>>>>> Dan
>>>>>
>>>>> No.
>>>>>
>>>>> The bad blocks framework can only store a limited number of bad ranges - 512
>>>>> in the current implementation.
>>>>> That would not be an acceptable restriction for discarded ranges.
>>>>>
>>>>> You would need a bitmap of some sort if you wanted to record discarded
>>>>> regions.
>>>>>
>>>>> http://neil.brown.name/blog/20110216044002#5
>>>>
>>>> This appears to remove the unnecessary resync for discarded range after
>>>> a crash
>>>> or discard error, eg an enhancement. From my understanding, it can't
>>>> remove the
>>>> limitation I mentioned in the patch. For raid5, we still need discard a
>>>> whole
>>>> stripe (discarding one disk but writing parity disk isn't good).
>>>
>>> It is certainly not ideal, but it is worse than not discarding at all?
>>> And would updating some sort of bitmap be just as bad as updating the parity
>>> block?
>>>
>>> How about treating a DISCARD request as a request to write a block full of
>>> zeros, then at the lower level treat any request to write a block full of
>>> zeros as a DISCARD request.  So when the parity becomes zero, it gets
>>> discarded.
>>>
>>> Certainly it is best if the filesystem would discard whole stripes at a time,
>>> and we should be sure to optimise that.  But maybe there is still room to do
>>> something useful with small discards?
>>
>> Sure, it would be great we can do small discards. But I didn't get how to do
>> it with the bitmap approach. Let's give an example, data disk1, data disk2,
>> parity disk3. Say discard some sectors of disk1. The suggested approach is
>> to mark the range bad. Then how to deal with parity disk3? As I said,
>> writing
>> parity disk3 isn't good. So mark the corresponding range of parity disk3
>> bad too? If we did this, if disk2 is broken, how can we restore it?
>
> Why, exactly, is writing the parity disk not good?
> Not discarding blocks that we possibly could discard is also not good.
> Which is worst?

Writing the parity disk is worse. Discard is to improve the garbage 
collection
of SSD firmware, so improve later write performance. While write is bad for
SSD, because SSD can be wear leveling out with extra write and also write
increases garbage collection overhead. So the result of small discard is 
data
disk garbage collection is improved but parity disk gets worse and 
parity disk
gets fast to end of its life, which doesn't make sense. This is even 
worse when
the parity is distributed.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] MD: raid5 trim support
  2012-04-18  6:34               ` Shaohua Li
@ 2012-04-25  3:43                 ` Shaohua Li
  2012-05-08 10:16                   ` Shaohua Li
  0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2012-04-25  3:43 UTC (permalink / raw)
  To: NeilBrown; +Cc: Dan Williams, linux-raid, axboe, Shaohua Li

On Wed, Apr 18, 2012 at 02:34:04PM +0800, Shaohua Li wrote:
> On 4/18/12 1:57 PM, NeilBrown wrote:
> >On Wed, 18 Apr 2012 13:30:45 +0800 Shaohua Li<shli@kernel.org>  wrote:
> >
> >>On 4/18/12 12:48 PM, NeilBrown wrote:
> >>>On Wed, 18 Apr 2012 08:58:14 +0800 Shaohua Li<shli@kernel.org>   wrote:
> >>>
> >>>>On 4/18/12 4:26 AM, NeilBrown wrote:
> >>>>>On Tue, 17 Apr 2012 07:46:03 -0700 Dan Williams<dan.j.williams@intel.com>
> >>>>>wrote:
> >>>>>
> >>>>>>On Tue, Apr 17, 2012 at 1:35 AM, Shaohua Li<shli@kernel.org>    wrote:
> >>>>>>>Discard for raid4/5/6 has limitation. If discard request size is small, we do
> >>>>>>>discard for one disk, but we need calculate parity and write parity disk.  To
> >>>>>>>correctly calculate parity, zero_after_discard must be guaranteed.
> >>>>>>
> >>>>>>I'm wondering if we could use the new bad blocks facility to mark
> >>>>>>discarded ranges so we don't necessarily need determinate data after
> >>>>>>discard.
> >>>>>>
> >>>>>>...but I have not looked into it beyond that.
> >>>>>>
> >>>>>>--
> >>>>>>Dan
> >>>>>
> >>>>>No.
> >>>>>
> >>>>>The bad blocks framework can only store a limited number of bad ranges - 512
> >>>>>in the current implementation.
> >>>>>That would not be an acceptable restriction for discarded ranges.
> >>>>>
> >>>>>You would need a bitmap of some sort if you wanted to record discarded
> >>>>>regions.
> >>>>>
> >>>>>http://neil.brown.name/blog/20110216044002#5
> >>>>
> >>>>This appears to remove the unnecessary resync for discarded range after
> >>>>a crash
> >>>>or discard error, eg an enhancement. From my understanding, it can't
> >>>>remove the
> >>>>limitation I mentioned in the patch. For raid5, we still need discard a
> >>>>whole
> >>>>stripe (discarding one disk but writing parity disk isn't good).
> >>>
> >>>It is certainly not ideal, but it is worse than not discarding at all?
> >>>And would updating some sort of bitmap be just as bad as updating the parity
> >>>block?
> >>>
> >>>How about treating a DISCARD request as a request to write a block full of
> >>>zeros, then at the lower level treat any request to write a block full of
> >>>zeros as a DISCARD request.  So when the parity becomes zero, it gets
> >>>discarded.
> >>>
> >>>Certainly it is best if the filesystem would discard whole stripes at a time,
> >>>and we should be sure to optimise that.  But maybe there is still room to do
> >>>something useful with small discards?
> >>
> >>Sure, it would be great we can do small discards. But I didn't get how to do
> >>it with the bitmap approach. Let's give an example, data disk1, data disk2,
> >>parity disk3. Say discard some sectors of disk1. The suggested approach is
> >>to mark the range bad. Then how to deal with parity disk3? As I said,
> >>writing
> >>parity disk3 isn't good. So mark the corresponding range of parity disk3
> >>bad too? If we did this, if disk2 is broken, how can we restore it?
> >
> >Why, exactly, is writing the parity disk not good?
> >Not discarding blocks that we possibly could discard is also not good.
> >Which is worst?
> 
> Writing the parity disk is worse. Discard is to improve the garbage
> collection
> of SSD firmware, so improve later write performance. While write is bad for
> SSD, because SSD can be wear leveling out with extra write and also write
> increases garbage collection overhead. So the result of small
> discard is data
> disk garbage collection is improved but parity disk gets worse and
> parity disk
> gets fast to end of its life, which doesn't make sense. This is even
> worse when
> the parity is distributed.
Neil,
Any comments about the patches?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] MD: raid5 trim support
  2012-04-25  3:43                 ` Shaohua Li
@ 2012-05-08 10:16                   ` Shaohua Li
  2012-05-08 15:52                     ` Dan Williams
  2012-05-08 20:17                     ` NeilBrown
  0 siblings, 2 replies; 17+ messages in thread
From: Shaohua Li @ 2012-05-08 10:16 UTC (permalink / raw)
  To: NeilBrown; +Cc: Dan Williams, linux-raid, axboe, Shaohua Li

On Wed, Apr 25, 2012 at 11:43:07AM +0800, Shaohua Li wrote:
> On Wed, Apr 18, 2012 at 02:34:04PM +0800, Shaohua Li wrote:
> > On 4/18/12 1:57 PM, NeilBrown wrote:
> > >On Wed, 18 Apr 2012 13:30:45 +0800 Shaohua Li<shli@kernel.org>  wrote:
> > >
> > >>On 4/18/12 12:48 PM, NeilBrown wrote:
> > >>>On Wed, 18 Apr 2012 08:58:14 +0800 Shaohua Li<shli@kernel.org>   wrote:
> > >>>
> > >>>>On 4/18/12 4:26 AM, NeilBrown wrote:
> > >>>>>On Tue, 17 Apr 2012 07:46:03 -0700 Dan Williams<dan.j.williams@intel.com>
> > >>>>>wrote:
> > >>>>>
> > >>>>>>On Tue, Apr 17, 2012 at 1:35 AM, Shaohua Li<shli@kernel.org>    wrote:
> > >>>>>>>Discard for raid4/5/6 has limitation. If discard request size is small, we do
> > >>>>>>>discard for one disk, but we need calculate parity and write parity disk.  To
> > >>>>>>>correctly calculate parity, zero_after_discard must be guaranteed.
> > >>>>>>
> > >>>>>>I'm wondering if we could use the new bad blocks facility to mark
> > >>>>>>discarded ranges so we don't necessarily need determinate data after
> > >>>>>>discard.
> > >>>>>>
> > >>>>>>...but I have not looked into it beyond that.
> > >>>>>>
> > >>>>>>--
> > >>>>>>Dan
> > >>>>>
> > >>>>>No.
> > >>>>>
> > >>>>>The bad blocks framework can only store a limited number of bad ranges - 512
> > >>>>>in the current implementation.
> > >>>>>That would not be an acceptable restriction for discarded ranges.
> > >>>>>
> > >>>>>You would need a bitmap of some sort if you wanted to record discarded
> > >>>>>regions.
> > >>>>>
> > >>>>>http://neil.brown.name/blog/20110216044002#5
> > >>>>
> > >>>>This appears to remove the unnecessary resync for discarded range after
> > >>>>a crash
> > >>>>or discard error, eg an enhancement. From my understanding, it can't
> > >>>>remove the
> > >>>>limitation I mentioned in the patch. For raid5, we still need discard a
> > >>>>whole
> > >>>>stripe (discarding one disk but writing parity disk isn't good).
> > >>>
> > >>>It is certainly not ideal, but it is worse than not discarding at all?
> > >>>And would updating some sort of bitmap be just as bad as updating the parity
> > >>>block?
> > >>>
> > >>>How about treating a DISCARD request as a request to write a block full of
> > >>>zeros, then at the lower level treat any request to write a block full of
> > >>>zeros as a DISCARD request.  So when the parity becomes zero, it gets
> > >>>discarded.
> > >>>
> > >>>Certainly it is best if the filesystem would discard whole stripes at a time,
> > >>>and we should be sure to optimise that.  But maybe there is still room to do
> > >>>something useful with small discards?
> > >>
> > >>Sure, it would be great we can do small discards. But I didn't get how to do
> > >>it with the bitmap approach. Let's give an example, data disk1, data disk2,
> > >>parity disk3. Say discard some sectors of disk1. The suggested approach is
> > >>to mark the range bad. Then how to deal with parity disk3? As I said,
> > >>writing
> > >>parity disk3 isn't good. So mark the corresponding range of parity disk3
> > >>bad too? If we did this, if disk2 is broken, how can we restore it?
> > >
> > >Why, exactly, is writing the parity disk not good?
> > >Not discarding blocks that we possibly could discard is also not good.
> > >Which is worst?
> > 
> > Writing the parity disk is worse. Discard is to improve the garbage
> > collection
> > of SSD firmware, so improve later write performance. While write is bad for
> > SSD, because SSD can be wear leveling out with extra write and also write
> > increases garbage collection overhead. So the result of small
> > discard is data
> > disk garbage collection is improved but parity disk gets worse and
> > parity disk
> > gets fast to end of its life, which doesn't make sense. This is even
> > worse when
> > the parity is distributed.
> Neil,
> Any comments about the patches?
ping!

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] MD: raid5 trim support
  2012-05-08 10:16                   ` Shaohua Li
@ 2012-05-08 15:52                     ` Dan Williams
  2012-05-09  3:12                       ` Shaohua Li
  2012-05-08 20:17                     ` NeilBrown
  1 sibling, 1 reply; 17+ messages in thread
From: Dan Williams @ 2012-05-08 15:52 UTC (permalink / raw)
  To: Shaohua Li; +Cc: NeilBrown, linux-raid, axboe, Shaohua Li

On Tue, May 8, 2012 at 3:16 AM, Shaohua Li <shli@kernel.org> wrote:
>> > Writing the parity disk is worse. Discard is to improve the garbage
>> > collection
>> > of SSD firmware, so improve later write performance. While write is bad for
>> > SSD, because SSD can be wear leveling out with extra write and also write
>> > increases garbage collection overhead. So the result of small
>> > discard is data
>> > disk garbage collection is improved but parity disk gets worse and
>> > parity disk
>> > gets fast to end of its life, which doesn't make sense. This is even
>> > worse when
>> > the parity is distributed.
>> Neil,
>> Any comments about the patches?
> ping!
>

So, are we still in the position of needing to degrade individual
stripes to support trim?  Is there any benefit to making this a
temporary condition?  I.e. trim large ranges, including the parity
disk, and then once all the trim sequences have coalesced resync the
stripes that remain only partially trimmed?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] MD: raid5 trim support
  2012-05-08 15:52                     ` Dan Williams
@ 2012-05-09  3:12                       ` Shaohua Li
  0 siblings, 0 replies; 17+ messages in thread
From: Shaohua Li @ 2012-05-09  3:12 UTC (permalink / raw)
  To: Dan Williams; +Cc: NeilBrown, linux-raid, axboe, Shaohua Li

On Tue, May 08, 2012 at 08:52:07AM -0700, Dan Williams wrote:
> On Tue, May 8, 2012 at 3:16 AM, Shaohua Li <shli@kernel.org> wrote:
> >> > Writing the parity disk is worse. Discard is to improve the garbage
> >> > collection
> >> > of SSD firmware, so improve later write performance. While write is bad for
> >> > SSD, because SSD can be wear leveling out with extra write and also write
> >> > increases garbage collection overhead. So the result of small
> >> > discard is data
> >> > disk garbage collection is improved but parity disk gets worse and
> >> > parity disk
> >> > gets fast to end of its life, which doesn't make sense. This is even
> >> > worse when
> >> > the parity is distributed.
> >> Neil,
> >> Any comments about the patches?
> > ping!
> >
> 
> So, are we still in the position of needing to degrade individual
> stripes to support trim?  Is there any benefit to making this a
> temporary condition?  I.e. trim large ranges, including the parity
> disk, and then once all the trim sequences have coalesced resync the
> stripes that remain only partially trimmed?

Yes, we need trim a whole stripe one time. This is the best I can do now.
resync involves write and I don't want to do any unnecessary write, because
trim is to improve ssd firmware garbage collection while write goes to the
oppositive and makes flash wear out faster.

Or your suggestion is if a trim is small (for example, only covers one disk),
we don't do trim immediately but just record it in bitmap. Later if the small
trim can be coalesced to one strip, trim the strip. This is optimal but
involves disk format change. And to make it efficient, we need track 4kB range.
From Neil's blog, there isn't such space left.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] MD: raid5 trim support
  2012-05-08 10:16                   ` Shaohua Li
  2012-05-08 15:52                     ` Dan Williams
@ 2012-05-08 20:17                     ` NeilBrown
  1 sibling, 0 replies; 17+ messages in thread
From: NeilBrown @ 2012-05-08 20:17 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Dan Williams, linux-raid, axboe, Shaohua Li

[-- Attachment #1: Type: text/plain, Size: 266 bytes --]

On Tue, 8 May 2012 18:16:53 +0800 Shaohua Li <shli@kernel.org> wrote:
> > Neil,
> > Any comments about the patches?
> ping!

Sorry for not responding earlier.
I'll try to make time next week to look at these and the RAID1 patches you
just sent.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC 2/2] MD: raid5 avoid unnecessary zero page for trim
  2012-04-17  8:35 [RFC 0/2] raid5 trim support Shaohua Li
  2012-04-17  8:35 ` [RFC 1/2] MD: " Shaohua Li
@ 2012-04-17  8:35 ` Shaohua Li
  1 sibling, 0 replies; 17+ messages in thread
From: Shaohua Li @ 2012-04-17  8:35 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, dan.j.williams, axboe, Shaohua Li

[-- Attachment #1: md-raid5-remove-unnecessary-memset.patch --]
[-- Type: text/plain, Size: 7593 bytes --]

We want to avoid zero discarded dev page, because it's useless for discard.
But if we don't zero it, another read/write hit such page in the cache and
will get inconsistent data. To avoid zero the page, we set R5_WantZeroFill
for discarded dev page. Every time before the page is accessed and the
flag is set, we zero the page and clear the flag. If the page will be
drained or computed, we just clear the flag for it. In this way, the dev
page data is alway consistent. And since the chance discarded data is
accessed soon is low, zero discard dev page is largely avoided.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   83 +++++++++++++++++++++++++++++++++++++++++++----------
 drivers/md/raid5.h |    1 
 2 files changed, 69 insertions(+), 15 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-04-17 16:02:49.776046739 +0800
+++ linux/drivers/md/raid5.c	2012-04-17 16:07:52.426045417 +0800
@@ -770,6 +770,10 @@ static void ops_run_biofill(struct strip
 			dev->read = rbi = dev->toread;
 			dev->toread = NULL;
 			spin_unlock_irq(&conf->device_lock);
+
+			if (test_and_clear_bit(R5_WantZeroFill, &dev->flags))
+				memset(page_address(dev->page), 0, STRIPE_SIZE);
+
 			while (rbi && rbi->bi_sector <
 				dev->sector + STRIPE_SECTORS) {
 				tx = async_copy_data(0, rbi, dev->page,
@@ -839,9 +843,16 @@ ops_run_compute5(struct stripe_head *sh,
 		__func__, (unsigned long long)sh->sector, target);
 	BUG_ON(!test_bit(R5_Wantcompute, &tgt->flags));
 
-	for (i = disks; i--; )
-		if (i != target)
+	for (i = disks; i--; ) {
+		if (i != target) {
 			xor_srcs[count++] = sh->dev[i].page;
+			if (test_and_clear_bit(R5_WantZeroFill,
+			   &sh->dev[i].flags))
+				memset(page_address(sh->dev[i].page), 0,
+					STRIPE_SIZE);
+		}
+		clear_bit(R5_WantZeroFill, &sh->dev[i].flags);
+	}
 
 	atomic_inc(&sh->count);
 
@@ -918,6 +929,10 @@ ops_run_compute6_1(struct stripe_head *s
 
 	atomic_inc(&sh->count);
 
+	for (i = 0; i < sh->disks; i++)
+		if (test_and_clear_bit(R5_WantZeroFill, &sh->dev[i].flags))
+			memset(page_address(sh->dev[i].page), 0, STRIPE_SIZE);
+
 	if (target == qd_idx) {
 		count = set_syndrome_sources(blocks, sh);
 		blocks[count] = NULL; /* regenerating p is not necessary */
@@ -968,8 +983,11 @@ ops_run_compute6_2(struct stripe_head *s
 	/* we need to open-code set_syndrome_sources to handle the
 	 * slot number conversion for 'faila' and 'failb'
 	 */
-	for (i = 0; i < disks ; i++)
+	for (i = 0; i < disks ; i++) {
 		blocks[i] = NULL;
+		if (test_and_clear_bit(R5_WantZeroFill, &sh->dev[i].flags))
+			memset(page_address(sh->dev[i].page), 0, STRIPE_SIZE);
+	}
 	count = 0;
 	i = d0_idx;
 	do {
@@ -1080,6 +1098,9 @@ ops_run_prexor(struct stripe_head *sh, s
 		/* Only process blocks that are known to be uptodate */
 		if (test_bit(R5_Wantdrain, &dev->flags))
 			xor_srcs[count++] = dev->page;
+		if ((i == pd_idx || test_bit(R5_Wantdrain, &dev->flags)) &&
+		   test_and_clear_bit(R5_WantZeroFill, &dev->flags))
+			memset(page_address(dev->page), 0, STRIPE_SIZE);
 	}
 
 	init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_DROP_DST, tx,
@@ -1117,12 +1138,13 @@ ops_run_biodrain(struct stripe_head *sh,
 				if (wbi->bi_rw & REQ_FUA)
 					set_bit(R5_WantFUA, &dev->flags);
 				if (wbi->bi_rw & REQ_DISCARD) {
-					memset(page_address(dev->page), 0,
-						STRIPE_SECTORS << 9);
+					set_bit(R5_WantZeroFill, &dev->flags);
 					set_bit(R5_Discard, &dev->flags);
-				} else
+				} else {
+					clear_bit(R5_WantZeroFill, &dev->flags);
 					tx = async_copy_data(1, wbi, dev->page,
 						dev->sector, tx);
+				}
 				wbi = r5_next_bio(wbi, dev->sector);
 			}
 		}
@@ -1192,8 +1214,7 @@ ops_run_reconstruct5(struct stripe_head
 	}
 	if (i >= sh->disks) {
 		atomic_inc(&sh->count);
-		memset(page_address(sh->dev[pd_idx].page), 0,
-			STRIPE_SECTORS << 9);
+		set_bit(R5_WantZeroFill, &sh->dev[pd_idx].flags);
 		set_bit(R5_Discard, &sh->dev[pd_idx].flags);
 		ops_complete_reconstruct(sh);
 		return;
@@ -1208,13 +1229,21 @@ ops_run_reconstruct5(struct stripe_head
 			struct r5dev *dev = &sh->dev[i];
 			if (dev->written)
 				xor_srcs[count++] = dev->page;
+			if ((i == pd_idx || dev->written) &&
+			   test_and_clear_bit(R5_WantZeroFill, &dev->flags))
+				memset(page_address(dev->page), 0, STRIPE_SIZE);
 		}
 	} else {
 		xor_dest = sh->dev[pd_idx].page;
+		clear_bit(R5_WantZeroFill, &sh->dev[pd_idx].flags);
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
-			if (i != pd_idx)
+			if (i != pd_idx) {
 				xor_srcs[count++] = dev->page;
+			   	if (test_and_clear_bit(R5_WantZeroFill, &dev->flags))
+					memset(page_address(dev->page), 0,
+					      STRIPE_SIZE);
+			}
 		}
 	}
 
@@ -1254,16 +1283,23 @@ ops_run_reconstruct6(struct stripe_head
 	}
 	if (i >= sh->disks) {
 		atomic_inc(&sh->count);
-		memset(page_address(sh->dev[sh->pd_idx].page), 0,
-			STRIPE_SECTORS << 9);
-		memset(page_address(sh->dev[sh->qd_idx].page), 0,
-			STRIPE_SECTORS << 9);
+		set_bit(R5_WantZeroFill, &sh->dev[sh->pd_idx].flags);
 		set_bit(R5_Discard, &sh->dev[sh->pd_idx].flags);
+		set_bit(R5_WantZeroFill, &sh->dev[sh->qd_idx].flags);
 		set_bit(R5_Discard, &sh->dev[sh->qd_idx].flags);
 		ops_complete_reconstruct(sh);
 		return;
 	}
 
+	for (i = 0; i < sh->disks; i++) {
+		if (sh->pd_idx == i || sh->qd_idx == i) {
+			clear_bit(R5_WantZeroFill, &sh->dev[i].flags);
+			continue;
+		}
+		if (test_and_clear_bit(R5_WantZeroFill, &sh->dev[i].flags))
+			memset(page_address(sh->dev[i].page), 0, STRIPE_SIZE);
+	}
+
 	count = set_syndrome_sources(blocks, sh);
 
 	atomic_inc(&sh->count);
@@ -1304,8 +1340,13 @@ static void ops_run_check_p(struct strip
 	xor_dest = sh->dev[pd_idx].page;
 	xor_srcs[count++] = xor_dest;
 	for (i = disks; i--; ) {
-		if (i == pd_idx || i == qd_idx)
+		if (i != qd_idx &&
+		   test_and_clear_bit(R5_WantZeroFill, &sh->dev[i].flags))
+			memset(page_address(sh->dev[i].page), 0, STRIPE_SIZE);
+		if (i == pd_idx || i == qd_idx) {
+			clear_bit(R5_WantZeroFill, &sh->dev[i].flags);
 			continue;
+		}
 		xor_srcs[count++] = sh->dev[i].page;
 	}
 
@@ -1323,11 +1364,20 @@ static void ops_run_check_pq(struct stri
 {
 	struct page **srcs = percpu->scribble;
 	struct async_submit_ctl submit;
-	int count;
+	int count, i;
 
 	pr_debug("%s: stripe %llu checkp: %d\n", __func__,
 		(unsigned long long)sh->sector, checkp);
 
+	for (i = 0; i < sh->disks; i++) {
+		if (sh->pd_idx == i || sh->qd_idx == i) {
+			clear_bit(R5_WantZeroFill, &sh->dev[i].flags);
+			continue;
+		}
+		if (test_and_clear_bit(R5_WantZeroFill, &sh->dev[i].flags))
+			memset(page_address(sh->dev[i].page), 0, STRIPE_SIZE);
+	}
+
 	count = set_syndrome_sources(srcs, sh);
 	if (!checkp)
 		srcs[count] = NULL;
@@ -3134,6 +3184,9 @@ static void handle_stripe_expansion(stru
 				release_stripe(sh2);
 				continue;
 			}
+			if (test_and_clear_bit(R5_WantZeroFill, &sh->dev[i].flags))
+				memset(page_address(sh->dev[i].page),
+					0, STRIPE_SIZE);
 
 			/* place all the copies on one channel */
 			init_async_submit(&submit, 0, tx, NULL, NULL, NULL);
Index: linux/drivers/md/raid5.h
===================================================================
--- linux.orig/drivers/md/raid5.h	2012-04-17 14:16:50.206075090 +0800
+++ linux/drivers/md/raid5.h	2012-04-17 16:07:52.426045417 +0800
@@ -296,6 +296,7 @@ enum r5dev_flags {
 			 * data in, and now is a good time to write it out.
 			 */
 	R5_Discard,	/* Discard the stripe */
+	R5_WantZeroFill, /* should be zero filled before read */
 };
 
 /*


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2012-05-09  3:12 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-04-17  8:35 [RFC 0/2] raid5 trim support Shaohua Li
2012-04-17  8:35 ` [RFC 1/2] MD: " Shaohua Li
2012-04-17 14:46   ` Dan Williams
2012-04-17 15:07     ` Shaohua Li
2012-04-17 18:16       ` Dan Williams
2012-04-17 20:26     ` NeilBrown
2012-04-18  0:58       ` Shaohua Li
2012-04-18  4:48         ` NeilBrown
2012-04-18  5:30           ` Shaohua Li
2012-04-18  5:57             ` NeilBrown
2012-04-18  6:34               ` Shaohua Li
2012-04-25  3:43                 ` Shaohua Li
2012-05-08 10:16                   ` Shaohua Li
2012-05-08 15:52                     ` Dan Williams
2012-05-09  3:12                       ` Shaohua Li
2012-05-08 20:17                     ` NeilBrown
2012-04-17  8:35 ` [RFC 2/2] MD: raid5 avoid unnecessary zero page for trim Shaohua Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).