[PATCH 1/2] allow md_flush_request to take NULL bio

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] allow md_flush_request to take NULL bio
@ 2016-02-11  0:53 Song Liu
  2016-02-11  0:53 ` [PATCH 2/2] enable bypass raid5 journal for full stripe writes Song Liu
  2016-02-11  3:46 ` [PATCH 1/2] allow md_flush_request to take NULL bio NeilBrown
  0 siblings, 2 replies; 5+ messages in thread
From: Song Liu @ 2016-02-11  0:53 UTC (permalink / raw)
  To: linux-raid; +Cc: dan.j.williams, shli, hch, kernel-team, Song Liu

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index e55e6cf..2997104 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -406,6 +406,8 @@ static void md_submit_flush_data(struct work_struct *ws)
 	struct mddev *mddev = container_of(ws, struct mddev, flush_work);
 	struct bio *bio = mddev->flush_bio;
 
+	if (!bio)
+		goto out;
 	if (bio->bi_iter.bi_size == 0)
 		/* an empty barrier - all done */
 		bio_endio(bio);
@@ -415,6 +417,7 @@ static void md_submit_flush_data(struct work_struct *ws)
 	}
 
 	mddev->flush_bio = NULL;
+out:
 	wake_up(&mddev->sb_wait);
 }
 
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] enable bypass raid5 journal for full stripe writes
  2016-02-11  0:53 [PATCH 1/2] allow md_flush_request to take NULL bio Song Liu
@ 2016-02-11  0:53 ` Song Liu
  2016-02-11  4:07   ` NeilBrown
  2016-02-23 16:34   ` Alireza Haghdoost
  2016-02-11  3:46 ` [PATCH 1/2] allow md_flush_request to take NULL bio NeilBrown
  1 sibling, 2 replies; 5+ messages in thread
From: Song Liu @ 2016-02-11  0:53 UTC (permalink / raw)
  To: linux-raid; +Cc: dan.j.williams, shli, hch, kernel-team, Song Liu

Summary:

Resending the patch to see whether we can get another chance...

When testing current SATA SSDs as the journal device, we have
seen 2 challenges: throughput of long sequential writes, and
SSD life time.

To ease the burn on the SSD, we tested bypassing journal for
full stripe writes. We understand that bypassing journal will
re-introduce write hole to the md layer. However, with
well-designed application and file system, such write holes
should not result in any data loss.

Our test systems have 2 RAID-6 array per server and 15 HDDs
per array. These 2 arrays shared 1 SSD as the journal (2
partitions). Btrfs is created on both array.

For squential write benchmarks, we observe significant
performance gain (250MB/s per volume vs. 150M/s) from
bypassing journal for full stripes.

We all performed power cycle tests on these systems while
running a write workload. For more than 50 power cycles,
we have seen zero data loss.

To configure the bypass feature:

echo 1 > /sys/block/mdX/md/r5l_bypass_full_stripe

and

echo 0 > /sys/block/mdX/md/r5l_bypass_full_stripe

For file system integrity, the code does not bypass any write
with REQ_FUA.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5-cache.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/md/raid5.c       |  1 +
 drivers/md/raid5.h       |  2 ++
 3 files changed, 77 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 9531f5f..9ec0878 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -97,6 +97,8 @@ struct r5l_log {
 
 	bool need_cache_flush;
 	bool in_teardown;
+
+	int bypass_full_stripe;
 };
 
 /*
@@ -446,6 +448,7 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
 	int reserve;
 	int i;
 	int ret = 0;
+	int fua = 0;
 
 	if (!log)
 		return -EAGAIN;
@@ -462,6 +465,8 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
 
 		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags))
 			continue;
+		if (test_bit(R5_WantFUA, &sh->dev[i].flags))
+			fua = 1;
 		write_disks++;
 		/* checksum is already calculated in last run */
 		if (test_bit(STRIPE_LOG_TRAPPED, &sh->state))
@@ -471,6 +476,10 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
 						    addr, PAGE_SIZE);
 		kunmap_atomic(addr);
 	}
+
+	if (log->bypass_full_stripe && (write_disks == sh->disks) && (!fua))
+		return -EAGAIN;   /* bypass journal device */
+
 	parity_pages = 1 + !!(sh->qd_idx >= 0);
 	data_pages = write_disks - parity_pages;
 
@@ -524,7 +533,7 @@ void r5l_write_stripe_run(struct r5l_log *log)
 
 int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio)
 {
-	if (!log)
+	if (!log || log->bypass_full_stripe)
 		return -ENODEV;
 	/*
 	 * we flush log disk cache first, then write stripe data to raid disks.
@@ -1186,6 +1195,69 @@ ioerr:
 	return ret;
 }
 
+static ssize_t
+r5l_show_bypass_full_stripe(struct mddev *mddev, char *page)
+{
+	struct r5conf *conf;
+	int ret = 0;
+
+	spin_lock(&mddev->lock);
+	conf = mddev->private;
+	if (conf) {
+		if (conf->log)
+			ret = sprintf(page, "%d\n",
+				      conf->log->bypass_full_stripe);
+		else
+			ret = sprintf(page, "n/a\n");
+	}
+	spin_unlock(&mddev->lock);
+	return ret;
+}
+
+static ssize_t
+r5l_store_bypass_full_stripe(struct mddev *mddev, const char *page, size_t len)
+{
+	struct r5conf *conf;
+	int err = 0;
+	unsigned int val;
+
+	if (len >= PAGE_SIZE)
+		return -EINVAL;
+	if (kstrtouint(page, 10, &val))
+		return -EINVAL;
+
+	if (val > 1)
+		val = 1;
+
+	err = mddev_lock(mddev);
+	if (err)
+		return err;
+	mddev_suspend(mddev);
+
+	/*
+	 * We do not flush when journal is on, add extra flush for previous writes
+	 */
+	if (val == 0)
+		md_flush_request(mddev, NULL);
+
+	conf = mddev->private;
+	if (conf) {
+		if (conf->log) {
+			conf->log->bypass_full_stripe = val;
+		} else
+			err = -EINVAL;
+	} else
+		err = -ENODEV;
+	mddev_resume(mddev);
+	mddev_unlock(mddev);
+	return err ?: len;
+}
+
+struct md_sysfs_entry
+r5l_bypass_full_stripe = __ATTR(r5l_bypass_full_stripe, S_IRUGO | S_IWUSR,
+				r5l_show_bypass_full_stripe,
+				r5l_store_bypass_full_stripe);
+
 int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 {
 	struct r5l_log *log;
@@ -1227,6 +1299,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	if (!log->meta_pool)
 		goto out_mempool;
 
+	log->bypass_full_stripe = 0;
 	log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
 						 log->rdev->mddev, "reclaim");
 	if (!log->reclaim_thread)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b4f02c9..bdf30b1 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6216,6 +6216,7 @@ static struct attribute *raid5_attrs[] =  {
 	&raid5_group_thread_cnt.attr,
 	&raid5_skip_copy.attr,
 	&raid5_rmw_level.attr,
+	&r5l_bypass_full_stripe.attr,
 	NULL,
 };
 static struct attribute_group raid5_attrs_group = {
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index a415e1c..6b39a07 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -633,4 +633,6 @@ extern void r5l_stripe_write_finished(struct stripe_head *sh);
 extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
 extern void r5l_quiesce(struct r5l_log *log, int state);
 extern bool r5l_log_disk_error(struct r5conf *conf);
+
+extern struct md_sysfs_entry r5l_bypass_full_stripe;
 #endif
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] allow md_flush_request to take NULL bio
  2016-02-11  0:53 [PATCH 1/2] allow md_flush_request to take NULL bio Song Liu
  2016-02-11  0:53 ` [PATCH 2/2] enable bypass raid5 journal for full stripe writes Song Liu
@ 2016-02-11  3:46 ` NeilBrown
  1 sibling, 0 replies; 5+ messages in thread
From: NeilBrown @ 2016-02-11  3:46 UTC (permalink / raw)
  To: linux-raid; +Cc: dan.j.williams, shli, hch, kernel-team, Song Liu

[-- Attachment #1: Type: text/plain, Size: 1333 bytes --]

On Thu, Feb 11 2016, Song Liu wrote:

> Signed-off-by: Song Liu <songliubraving@fb.com>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/md.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index e55e6cf..2997104 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -406,6 +406,8 @@ static void md_submit_flush_data(struct work_struct *ws)
>  	struct mddev *mddev = container_of(ws, struct mddev, flush_work);
>  	struct bio *bio = mddev->flush_bio;
>  
> +	if (!bio)
> +		goto out;
>  	if (bio->bi_iter.bi_size == 0)
>  		/* an empty barrier - all done */
>  		bio_endio(bio);
> @@ -415,6 +417,7 @@ static void md_submit_flush_data(struct work_struct *ws)
>  	}
>  
>  	mddev->flush_bio = NULL;
> +out:
>  	wake_up(&mddev->sb_wait);
>  }
>  

I don't think this is safe.
->flush_bio is used for two different purposes.
One is to carry the bio so it can be passed to ->make_request after the
flush.
The other is to ensure exclusive access to mddev->flush_work.
The first thing md_flush_request() does is wait for ->flush_bio to be
NULL.  Then it knows that ->flush_work is no longer in use.
With your change you could get md_flush_request trying to INIT_WORK
->flush_work while it is already queued.

Not good.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] enable bypass raid5 journal for full stripe writes
  2016-02-11  0:53 ` [PATCH 2/2] enable bypass raid5 journal for full stripe writes Song Liu
@ 2016-02-11  4:07   ` NeilBrown
  2016-02-23 16:34   ` Alireza Haghdoost
  1 sibling, 0 replies; 5+ messages in thread
From: NeilBrown @ 2016-02-11  4:07 UTC (permalink / raw)
  To: linux-raid; +Cc: dan.j.williams, shli, hch, kernel-team, Song Liu

[-- Attachment #1: Type: text/plain, Size: 3416 bytes --]

On Thu, Feb 11 2016, Song Liu wrote:

> Summary:
>
> Resending the patch to see whether we can get another chance...
>
> When testing current SATA SSDs as the journal device, we have
> seen 2 challenges: throughput of long sequential writes, and
> SSD life time.
>
> To ease the burn on the SSD, we tested bypassing journal for
> full stripe writes. We understand that bypassing journal will
> re-introduce write hole to the md layer. However, with
> well-designed application and file system, such write holes
> should not result in any data loss.

"should not" isn't very encouraging.
I've always thought that the "write hole" was more of a theoretical than
a practical concern.  But if we go to the trouble of preventing them, I
really think we need "will not".

I don't suppose we could ask the filesystem to tag certain requests with
a flag which say "If the system crashes before this write completes, I
promise to re-write the entire region (possibly with different data)
before trusting any data I read from here".
That is what you really need.
But it wouldn't help for in-place overwrites.

>
> Our test systems have 2 RAID-6 array per server and 15 HDDs
> per array. These 2 arrays shared 1 SSD as the journal (2
> partitions). Btrfs is created on both array.
>
> For squential write benchmarks, we observe significant
> performance gain (250MB/s per volume vs. 150M/s) from
> bypassing journal for full stripes.

You need a faster SSD :-)

>
> We all performed power cycle tests on these systems while
> running a write workload. For more than 50 power cycles,
> we have seen zero data loss.

That doesn't fill me with any confidence.
To get even close to a credible test, you would need to instrument the
kernel so that at some (random) point it would never send another IO
request until a reboot, and so that this random point was always
somewhere in the middle of submitting a set of writes to a stripe.

Instrument a kernel to do this, demonstrate that the write-hole causes
problems, then demonstrate that it doesn't with the various patches
applied.  Then do this across a range of tests (write-in-place, append,
etc) on a range of filesystems.
Testing for data integrity is *hard*.

And as you say, it may be up to the application to ensure it doesn't use
corrupt data (like a partial write).  Testing that is not really
possible.
If you were to make it possible to do this, you would need to be very
explicit about what sort of write-hole corruption could still occur, so
that users could make an informed decision.

(I assume your test had a degraded array... you didn't explicitly say
so, but the test would be meaningless on a non-degraded array so you
probably did).

>  
> +static ssize_t
> +r5l_show_bypass_full_stripe(struct mddev *mddev, char *page)
> +{
> +	struct r5conf *conf;
> +	int ret = 0;
> +
> +	spin_lock(&mddev->lock);
> +	conf = mddev->private;
> +	if (conf) {
> +		if (conf->log)
> +			ret = sprintf(page, "%d\n",
> +				      conf->log->bypass_full_stripe);
> +		else
> +			ret = sprintf(page, "n/a\n");
> +	}

Given that this is a boolean, I would much prefer it was consistently a
boolean value, never "n/a".
Not sure if 0/1 Y/N T/F is best, but one of those.

Otherwise the code looks sensible (assuming the idea is sensible, which
I'm not 100% convinced of).

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] enable bypass raid5 journal for full stripe writes
  2016-02-11  0:53 ` [PATCH 2/2] enable bypass raid5 journal for full stripe writes Song Liu
  2016-02-11  4:07   ` NeilBrown
@ 2016-02-23 16:34   ` Alireza Haghdoost
  1 sibling, 0 replies; 5+ messages in thread
From: Alireza Haghdoost @ 2016-02-23 16:34 UTC (permalink / raw)
  To: Song Liu
  Cc: Linux RAID, Dan Williams, Shaohua Li, hch, kernel-team,
	Tony Floeder

On Wed, Feb 10, 2016 at 6:53 PM, Song Liu <songliubraving@fb.com> wrote:
> Summary:
>
> Resending the patch to see whether we can get another chance...
>
> When testing current SATA SSDs as the journal device, we have
> seen 2 challenges: throughput of long sequential writes, and
> SSD life time.
>
> To ease the burn on the SSD, we tested bypassing journal for
> full stripe writes. We understand that bypassing journal will
> re-introduce write hole to the md layer. However, with
> well-designed application and file system, such write holes
> should not result in any data loss.

To me the probability of data-lost during a full-stripe write is more
than partial-stripe write. I understand your motivation of doing this
however as Neil mentioned, this trade-off and your assumption about
the "well-designed application and file system" put a question mark on
the general usage of MD journal.

> Our test systems have 2 RAID-6 array per server and 15 HDDs
> per array. These 2 arrays shared 1 SSD as the journal (2
> partitions). Btrfs is created on both array.
>
> For squential write benchmarks, we observe significant
> performance gain (250MB/s per volume vs. 150M/s) from
> bypassing journal for full stripes.
>
> We all performed power cycle tests on these systems while
> running a write workload. For more than 50 power cycles,
> we have seen zero data loss.
>

Is it possible to share more details about your power cycle test
procedure and data loss detection method?

> To configure the bypass feature:
>
> echo 1 > /sys/block/mdX/md/r5l_bypass_full_stripe
>
> and
>
> echo 0 > /sys/block/mdX/md/r5l_bypass_full_stripe
>
> For file system integrity, the code does not bypass any write
> with REQ_FUA.
>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> Signed-off-by: Shaohua Li <shli@fb.com>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-02-23 16:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-11  0:53 [PATCH 1/2] allow md_flush_request to take NULL bio Song Liu
2016-02-11  0:53 ` [PATCH 2/2] enable bypass raid5 journal for full stripe writes Song Liu
2016-02-11  4:07   ` NeilBrown
2016-02-23 16:34   ` Alireza Haghdoost
2016-02-11  3:46 ` [PATCH 1/2] allow md_flush_request to take NULL bio NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).