raid5-cache I/O path improvements

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* raid5-cache I/O path improvements
@ 2015-09-07  5:20 Christoph Hellwig
  2015-09-07  5:20 ` [PATCH 01/10] raid5-cache: port to 4.3-rc Christoph Hellwig
                   ` (10 more replies)
  0 siblings, 11 replies; 21+ messages in thread
From: Christoph Hellwig @ 2015-09-07  5:20 UTC (permalink / raw)
  To: Shaohua Li, neilb; +Cc: linux-raid, Kernel-team, dan.j.williams

Hi Shaohua, hi Neil,

this series contains a few updates to the raid5-cache feature.

The first patch just ports it to the post-4.2 block layer.  As part of that
I noticed that it currently doesn't handle I/O errors - fixes for that will
follow.

The second and third patch simplify the I/O unit state machine and reduce
latency and memory usage for the I/O units.  The remainder are just a couple
of cleanups in this area that I stumbled upon.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 01/10] raid5-cache: port to 4.3-rc
  2015-09-07  5:20 raid5-cache I/O path improvements Christoph Hellwig
@ 2015-09-07  5:20 ` Christoph Hellwig
  2015-09-07  5:20 ` [PATCH 02/10] raid5-cache: free I/O units earlier Christoph Hellwig
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2015-09-07  5:20 UTC (permalink / raw)
  To: Shaohua Li, neilb; +Cc: linux-raid, Kernel-team, dan.j.williams

Port for changes in the block layer:  bio endio callers don't get passed
a separate error, and bio_get_nr_vecs is gone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/raid5-cache.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 785749b1..c345479 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -222,7 +222,8 @@ static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
 	io->state = state;
 }
 
-static inline void r5l_log_endio(struct bio *bio, int error)
+/* XXX: totally ignores I/O errors */
+static void r5l_log_endio(struct bio *bio)
 {
 	struct r5l_io_unit *io = bio->bi_private;
 	struct r5l_log *log = io->log;
@@ -288,8 +289,7 @@ static struct r5l_io_unit *r5l_new_meta(struct r5l_log *log)
 	io->meta_offset = sizeof(struct r5l_meta_block);
 	io->seq = log->seq;
 
-	bio = bio_kmalloc(GFP_NOIO | __GFP_NOFAIL,
-		bio_get_nr_vecs(log->rdev->bdev));
+	bio = bio_kmalloc(GFP_NOIO | __GFP_NOFAIL, BIO_MAX_PAGES);
 	io->current_bio = bio;
 	bio->bi_rw = WRITE;
 	bio->bi_bdev = log->rdev->bdev;
@@ -358,8 +358,7 @@ static void r5l_append_payload_page(struct r5l_log *log, struct page *page)
 alloc_bio:
 	if (!io->current_bio) {
 		struct bio *bio;
-		bio = bio_kmalloc(GFP_NOIO | __GFP_NOFAIL,
-			bio_get_nr_vecs(log->rdev->bdev));
+		bio = bio_kmalloc(GFP_NOIO | __GFP_NOFAIL, BIO_MAX_PAGES);
 		bio->bi_rw = WRITE;
 		bio->bi_bdev = log->rdev->bdev;
 		bio->bi_iter.bi_sector = log->log_start;
@@ -518,7 +517,7 @@ int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio)
 	 * don't need to flush again
 	 * */
 	if (bio->bi_iter.bi_size == 0) {
-		bio_endio(bio, 0);
+		bio_endio(bio);
 		return 0;
 	}
 	bio->bi_rw &= ~REQ_FLUSH;
@@ -581,7 +580,7 @@ void r5l_stripe_write_finished(struct stripe_head *sh)
 		__r5l_stripe_write_finished(io);
 }
 
-static void r5l_log_flush_endio(struct bio *bio, int error)
+static void r5l_log_flush_endio(struct bio *bio)
 {
 	struct r5l_log *log = container_of(bio, struct r5l_log,
 		flush_bio);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 02/10] raid5-cache: free I/O units earlier
  2015-09-07  5:20 raid5-cache I/O path improvements Christoph Hellwig
  2015-09-07  5:20 ` [PATCH 01/10] raid5-cache: port to 4.3-rc Christoph Hellwig
@ 2015-09-07  5:20 ` Christoph Hellwig
  2015-09-07  5:20 ` [PATCH 03/10] raid5-cache: use FUA writes for the log Christoph Hellwig
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2015-09-07  5:20 UTC (permalink / raw)
  To: Shaohua Li, neilb; +Cc: linux-raid, Kernel-team, dan.j.williams

There is no good reason to keep the I/O unit structures around after the
stripe has been written back to the RAID array.  The only information
we need is the log sequence number, and the checkpoint offset of the
highest successfull writeback.  Store those in the log structure, and
free the IO units from __r5l_stripe_write_finished.

Besides simplifying the code this also avoid having to keep the allocation
for the I/O unit around for a potentially long time as superblock updates
that checkpoint the log do not happen very often.

This also fixes the previously incorrect calculation of 'free' in
r5l_do_reclaim as a side effect: previous if took the last unit which
isn't checkpointed into account.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/raid5-cache.c | 145 ++++++++++++++++++-----------------------------
 1 file changed, 55 insertions(+), 90 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index c345479..803bcc6 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -51,6 +51,9 @@ struct r5l_log {
 	sector_t log_start; /* log head. where new data appends */
 	u64 seq; /* log head sequence */
 
+	sector_t next_checkpoint;
+	u64 next_cp_seq;
+
 	struct mutex io_mutex;
 	struct r5l_io_unit *current_io; /* current io_unit accepting new data */
 
@@ -65,10 +68,6 @@ struct r5l_log {
 					* cache flush */
 	struct list_head flushed_ios; /* io_units which settle down in log disk */
 	struct bio flush_bio;
-	struct list_head stripe_end_ios; /* io_units which have been
-					  * completely written to the RAID *
-					  * but have not yet been considered *
-					  * for updating super */
 
 	struct kmem_cache *io_kc;
 
@@ -185,35 +184,6 @@ static void r5l_move_io_unit_list(struct list_head *from, struct list_head *to,
        }
 }
 
-/*
- * We don't want too many io_units reside in stripe_end_ios list, which will
- * waste a lot of memory. So we try to remove some. But we must keep at least 2
- * io_units. The superblock must point to a valid meta, if it's the last meta,
- * recovery can scan less
- * */
-static void r5l_compress_stripe_end_list(struct r5l_log *log)
-{
-	struct r5l_io_unit *first, *last, *io;
-
-	first = list_first_entry(&log->stripe_end_ios,
-		struct r5l_io_unit, log_sibling);
-	last = list_last_entry(&log->stripe_end_ios,
-		struct r5l_io_unit, log_sibling);
-	if (first == last)
-		return;
-	list_del(&first->log_sibling);
-	list_del(&last->log_sibling);
-	while (!list_empty(&log->stripe_end_ios)) {
-		io = list_first_entry(&log->stripe_end_ios,
-			struct r5l_io_unit, log_sibling);
-		list_del(&io->log_sibling);
-		first->log_end = io->log_end;
-		r5l_free_io_unit(log, io);
-	}
-	list_add_tail(&first->log_sibling, &log->stripe_end_ios);
-	list_add_tail(&last->log_sibling, &log->stripe_end_ios);
-}
-
 static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
 	enum r5l_io_unit_state state)
 {
@@ -540,31 +510,53 @@ static void r5l_run_no_space_stripes(struct r5l_log *log)
 	spin_unlock(&log->no_space_stripes_lock);
 }
 
+static sector_t r5l_reclaimable_space(struct r5l_log *log)
+{
+	return r5l_ring_distance(log, log->last_checkpoint,
+				 log->next_checkpoint);
+}
+
+static bool r5l_complete_flushed_ios(struct r5l_log *log)
+{
+	struct r5l_io_unit *io, *next;
+	bool found = false;
+
+	assert_spin_locked(&log->io_list_lock);
+
+	list_for_each_entry_safe(io, next, &log->flushed_ios, log_sibling) {
+		/* don't change list order */
+		if (io->state < IO_UNIT_STRIPE_END)
+			break;
+
+		log->next_checkpoint = io->log_start;
+		log->next_cp_seq = io->seq;
+
+		list_del(&io->log_sibling);
+		r5l_free_io_unit(log, io);
+
+		found = true;
+	}
+	
+
+	return found;
+}
+
 static void __r5l_stripe_write_finished(struct r5l_io_unit *io)
 {
 	struct r5l_log *log = io->log;
-	struct r5l_io_unit *last;
-	sector_t reclaimable_space;
 	unsigned long flags;
 
 	spin_lock_irqsave(&log->io_list_lock, flags);
 	__r5l_set_io_unit_state(io, IO_UNIT_STRIPE_END);
-	/* might move 0 entry */
-	r5l_move_io_unit_list(&log->flushed_ios, &log->stripe_end_ios,
-		IO_UNIT_STRIPE_END);
-	if (list_empty(&log->stripe_end_ios)) {
+
+	if (!r5l_complete_flushed_ios(log)) {
 		spin_unlock_irqrestore(&log->io_list_lock, flags);
 		return;
 	}
 
-	last = list_last_entry(&log->stripe_end_ios,
-			struct r5l_io_unit, log_sibling);
-	reclaimable_space = r5l_ring_distance(log, log->last_checkpoint,
-				last->log_end);
-	if (reclaimable_space >= log->max_free_space)
+	if (r5l_reclaimable_space(log) > log->max_free_space)
 		r5l_wake_reclaim(log, 0);
 
-	r5l_compress_stripe_end_list(log);
 	spin_unlock_irqrestore(&log->io_list_lock, flags);
 	wake_up(&log->iounit_wait);
 }
@@ -640,13 +632,6 @@ void r5l_flush_stripe_to_raid(struct r5l_log *log)
 	submit_bio(WRITE_FLUSH, &log->flush_bio);
 }
 
-static void r5l_kick_io_unit(struct r5l_log *log)
-{
-	md_wakeup_thread(log->rdev->mddev->thread);
-	wait_event_lock_irq(log->iounit_wait, !list_empty(&log->stripe_end_ios),
-		log->io_list_lock);
-}
-
 static void r5l_write_super(struct r5l_log *log, sector_t cp);
 static void r5l_write_super_and_discard_space(struct r5l_log *log,
 	sector_t end)
@@ -678,10 +663,10 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
 
 static void r5l_do_reclaim(struct r5l_log *log)
 {
-	struct r5l_io_unit *io, *last;
-	LIST_HEAD(list);
-	sector_t free = 0;
 	sector_t reclaim_target = xchg(&log->reclaim_target, 0);
+	sector_t reclaimable;
+	sector_t next_checkpoint;
+	u64 next_cp_seq;
 
 	spin_lock_irq(&log->io_list_lock);
 	/*
@@ -690,60 +675,41 @@ static void r5l_do_reclaim(struct r5l_log *log)
 	 * shouldn't reuse space of an unreclaimable io_unit
 	 * */
 	while (1) {
-		struct list_head *target_list = NULL;
-
-		while (!list_empty(&log->stripe_end_ios)) {
-			io = list_first_entry(&log->stripe_end_ios,
-				struct r5l_io_unit, log_sibling);
-			list_move_tail(&io->log_sibling, &list);
-			free += r5l_ring_distance(log, io->log_start,
-						io->log_end);
-		}
-
-		if (free >= reclaim_target ||
+		reclaimable = r5l_reclaimable_space(log);
+		if (reclaimable >= reclaim_target ||
 		    (list_empty(&log->running_ios) &&
 		     list_empty(&log->io_end_ios) &&
 		     list_empty(&log->flushing_ios) &&
 		     list_empty(&log->flushed_ios)))
 			break;
 
-		/* Below waiting mostly happens when we shutdown the raid */
-		if (!list_empty(&log->flushed_ios))
-			target_list = &log->flushed_ios;
-		else if (!list_empty(&log->flushing_ios))
-			target_list = &log->flushing_ios;
-		else if (!list_empty(&log->io_end_ios))
-			target_list = &log->io_end_ios;
-		else if (!list_empty(&log->running_ios))
-			target_list = &log->running_ios;
-
-		r5l_kick_io_unit(log);
+		md_wakeup_thread(log->rdev->mddev->thread);
+		wait_event_lock_irq(log->iounit_wait,
+				    r5l_reclaimable_space(log) > reclaimable,
+				    log->io_list_lock);
 	}
+
+	next_checkpoint = log->next_checkpoint;
+	next_cp_seq = log->next_cp_seq;
 	spin_unlock_irq(&log->io_list_lock);
 
-	if (list_empty(&list))
+	BUG_ON(reclaimable < 0);
+	if (reclaimable == 0)
 		return;
 
-	/* super always point to last valid meta */
-	last = list_last_entry(&list, struct r5l_io_unit, log_sibling);
 	/*
 	 * write_super will flush cache of each raid disk. We must write super
 	 * here, because the log area might be reused soon and we don't want to
 	 * confuse recovery
 	 * */
-	r5l_write_super_and_discard_space(log, last->log_start);
+	r5l_write_super_and_discard_space(log, next_checkpoint);
 
 	mutex_lock(&log->io_mutex);
-	log->last_checkpoint = last->log_start;
-	log->last_cp_seq = last->seq;
+	log->last_checkpoint = next_checkpoint;
+	log->last_cp_seq = next_cp_seq;
 	mutex_unlock(&log->io_mutex);
-	r5l_run_no_space_stripes(log);
 
-	while (!list_empty(&list)) {
-		io = list_first_entry(&list, struct r5l_io_unit, log_sibling);
-		list_del(&io->log_sibling);
-		r5l_free_io_unit(log, io);
-	}
+	r5l_run_no_space_stripes(log);
 }
 
 static void r5l_reclaim_thread(struct md_thread *thread)
@@ -1105,7 +1071,6 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	spin_lock_init(&log->io_list_lock);
 	INIT_LIST_HEAD(&log->running_ios);
 	INIT_LIST_HEAD(&log->io_end_ios);
-	INIT_LIST_HEAD(&log->stripe_end_ios);
 	INIT_LIST_HEAD(&log->flushing_ios);
 	INIT_LIST_HEAD(&log->flushed_ios);
 	bio_init(&log->flush_bio);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 03/10] raid5-cache: use FUA writes for the log
  2015-09-07  5:20 raid5-cache I/O path improvements Christoph Hellwig
  2015-09-07  5:20 ` [PATCH 01/10] raid5-cache: port to 4.3-rc Christoph Hellwig
  2015-09-07  5:20 ` [PATCH 02/10] raid5-cache: free I/O units earlier Christoph Hellwig
@ 2015-09-07  5:20 ` Christoph Hellwig
  2015-09-07  5:20 ` [PATCH 04/10] raid5-cache: clean up r5l_get_meta Christoph Hellwig
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2015-09-07  5:20 UTC (permalink / raw)
  To: Shaohua Li, neilb; +Cc: linux-raid, Kernel-team, dan.j.williams

If we submit writes with the FUA bit for the log they are guaranteed to
be on stable storage once the endio callback is called.  This allows
to simplify the IO unit state machine, and decrease latencies a lot
when the device supports FUA.  If the device doesnt' support FUA the
block layer has an efficient state machine to emulate it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/raid5-cache.c | 133 +++++++++++++----------------------------------
 drivers/md/raid5.c       |   9 +---
 drivers/md/raid5.h       |   1 -
 3 files changed, 37 insertions(+), 106 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 803bcc6..1e54249 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -61,13 +61,8 @@ struct r5l_log {
 	struct list_head running_ios; /* io_units which are still running,
 				       * and have not yet been completely
 				       * written to the log */
-	struct list_head io_end_ios; /* io_units which have been completely
-				      * written to the log but not yet written
-				      * to the RAID */
-	struct list_head flushing_ios; /* io_units which are waiting for log
-					* cache flush */
-	struct list_head flushed_ios; /* io_units which settle down in log disk */
-	struct bio flush_bio;
+	struct list_head finished_ios; /* io_units already written to the
+					* log disk */
 
 	struct kmem_cache *io_kc;
 
@@ -169,21 +164,6 @@ static void r5l_free_io_unit(struct r5l_log *log, struct r5l_io_unit *io)
 	kmem_cache_free(log->io_kc, io);
 }
 
-static void r5l_move_io_unit_list(struct list_head *from, struct list_head *to,
-       enum r5l_io_unit_state state)
-{
-       struct r5l_io_unit *io;
-
-       while (!list_empty(from)) {
-               io = list_first_entry(from, struct r5l_io_unit, log_sibling);
-               /* don't change list order */
-               if (io->state >= state)
-                       list_move_tail(&io->log_sibling, to);
-               else
-                       break;
-       }
-}
-
 static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
 	enum r5l_io_unit_state state)
 {
@@ -192,6 +172,33 @@ static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
 	io->state = state;
 }
 
+static void r5l_io_run_stripes(struct r5l_io_unit *io)
+{
+	struct stripe_head *sh, *next;
+
+	list_for_each_entry_safe(sh, next, &io->stripe_list, log_list) {
+		list_del_init(&sh->log_list);
+		set_bit(STRIPE_HANDLE, &sh->state);
+		raid5_release_stripe(sh);
+	}
+}
+
+static void r5l_log_run_stripes(struct r5l_log *log)
+{
+	struct r5l_io_unit *io, *next;
+
+	assert_spin_locked(&log->io_list_lock);
+
+	list_for_each_entry_safe(io, next, &log->running_ios, log_sibling) {
+		/* don't change list order */
+		if (io->state < IO_UNIT_IO_END)
+			break;
+
+		list_move_tail(&io->log_sibling, &log->finished_ios);
+		r5l_io_run_stripes(io);
+	}
+}
+
 /* XXX: totally ignores I/O errors */
 static void r5l_log_endio(struct bio *bio)
 {
@@ -206,11 +213,8 @@ static void r5l_log_endio(struct bio *bio)
 
 	spin_lock_irqsave(&log->io_list_lock, flags);
 	__r5l_set_io_unit_state(io, IO_UNIT_IO_END);
-	r5l_move_io_unit_list(&log->running_ios, &log->io_end_ios,
-			IO_UNIT_IO_END);
+	r5l_log_run_stripes(log);
 	spin_unlock_irqrestore(&log->io_list_lock, flags);
-
-	md_wakeup_thread(log->rdev->mddev->thread);
 }
 
 static void r5l_submit_current_io(struct r5l_log *log)
@@ -237,7 +241,7 @@ static void r5l_submit_current_io(struct r5l_log *log)
 	while ((bio = bio_list_pop(&io->bios))) {
 		/* all IO must start from rdev->data_offset */
 		bio->bi_iter.bi_sector += log->rdev->data_offset;
-		submit_bio(WRITE, bio);
+		submit_bio(WRITE | REQ_FUA, bio);
 	}
 }
 
@@ -516,14 +520,14 @@ static sector_t r5l_reclaimable_space(struct r5l_log *log)
 				 log->next_checkpoint);
 }
 
-static bool r5l_complete_flushed_ios(struct r5l_log *log)
+static bool r5l_complete_finished_ios(struct r5l_log *log)
 {
 	struct r5l_io_unit *io, *next;
 	bool found = false;
 
 	assert_spin_locked(&log->io_list_lock);
 
-	list_for_each_entry_safe(io, next, &log->flushed_ios, log_sibling) {
+	list_for_each_entry_safe(io, next, &log->finished_ios, log_sibling) {
 		/* don't change list order */
 		if (io->state < IO_UNIT_STRIPE_END)
 			break;
@@ -549,7 +553,7 @@ static void __r5l_stripe_write_finished(struct r5l_io_unit *io)
 	spin_lock_irqsave(&log->io_list_lock, flags);
 	__r5l_set_io_unit_state(io, IO_UNIT_STRIPE_END);
 
-	if (!r5l_complete_flushed_ios(log)) {
+	if (!r5l_complete_finished_ios(log)) {
 		spin_unlock_irqrestore(&log->io_list_lock, flags);
 		return;
 	}
@@ -572,66 +576,6 @@ void r5l_stripe_write_finished(struct stripe_head *sh)
 		__r5l_stripe_write_finished(io);
 }
 
-static void r5l_log_flush_endio(struct bio *bio)
-{
-	struct r5l_log *log = container_of(bio, struct r5l_log,
-		flush_bio);
-	unsigned long flags;
-	struct r5l_io_unit *io;
-	struct stripe_head *sh;
-
-	spin_lock_irqsave(&log->io_list_lock, flags);
-	list_for_each_entry(io, &log->flushing_ios, log_sibling) {
-		while (!list_empty(&io->stripe_list)) {
-			sh = list_first_entry(&io->stripe_list,
-				struct stripe_head, log_list);
-			list_del_init(&sh->log_list);
-			set_bit(STRIPE_HANDLE, &sh->state);
-			raid5_release_stripe(sh);
-		}
-	}
-	list_splice_tail_init(&log->flushing_ios, &log->flushed_ios);
-	spin_unlock_irqrestore(&log->io_list_lock, flags);
-}
-
-/*
- * Starting dispatch IO to raid.
- * io_unit(meta) consists of a log. There is one situation we want to avoid. A
- * broken meta in the middle of a log causes recovery can't find meta at the
- * head of log. If operations require meta at the head persistent in log, we
- * must make sure meta before it persistent in log too. A case is:
- *
- * stripe data/parity is in log, we start write stripe to raid disks. stripe
- * data/parity must be persistent in log before we do the write to raid disks.
- *
- * The solution is we restrictly maintain io_unit list order. In this case, we
- * only write stripes of an io_unit to raid disks till the io_unit is the first
- * one whose data/parity is in log.
- * */
-void r5l_flush_stripe_to_raid(struct r5l_log *log)
-{
-	bool do_flush;
-	if (!log)
-		return;
-
-	spin_lock_irq(&log->io_list_lock);
-	/* flush bio is running */
-	if (!list_empty(&log->flushing_ios)) {
-		spin_unlock_irq(&log->io_list_lock);
-		return;
-	}
-	list_splice_tail_init(&log->io_end_ios, &log->flushing_ios);
-	do_flush = !list_empty(&log->flushing_ios);
-	spin_unlock_irq(&log->io_list_lock);
-
-	if (!do_flush)
-		return;
-	bio_reset(&log->flush_bio);
-	log->flush_bio.bi_bdev = log->rdev->bdev;
-	log->flush_bio.bi_end_io = r5l_log_flush_endio;
-	submit_bio(WRITE_FLUSH, &log->flush_bio);
-}
-
 static void r5l_write_super(struct r5l_log *log, sector_t cp);
 static void r5l_write_super_and_discard_space(struct r5l_log *log,
 	sector_t end)
@@ -678,9 +622,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
 		reclaimable = r5l_reclaimable_space(log);
 		if (reclaimable >= reclaim_target ||
 		    (list_empty(&log->running_ios) &&
-		     list_empty(&log->io_end_ios) &&
-		     list_empty(&log->flushing_ios) &&
-		     list_empty(&log->flushed_ios)))
+		     list_empty(&log->finished_ios)))
 			break;
 
 		md_wakeup_thread(log->rdev->mddev->thread);
@@ -1070,10 +1012,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 
 	spin_lock_init(&log->io_list_lock);
 	INIT_LIST_HEAD(&log->running_ios);
-	INIT_LIST_HEAD(&log->io_end_ios);
-	INIT_LIST_HEAD(&log->flushing_ios);
-	INIT_LIST_HEAD(&log->flushed_ios);
-	bio_init(&log->flush_bio);
+	INIT_LIST_HEAD(&log->finished_ios);
 
 	log->io_kc = KMEM_CACHE(r5l_io_unit, 0);
 	if (!log->io_kc)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index d86a39e..99e2d13 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5732,12 +5732,8 @@ static int handle_active_stripes(struct r5conf *conf, int group,
 		for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++)
 			if (!list_empty(temp_inactive_list + i))
 				break;
-		if (i == NR_STRIPE_HASH_LOCKS) {
-			spin_unlock_irq(&conf->device_lock);
-			r5l_flush_stripe_to_raid(conf->log);
-			spin_lock_irq(&conf->device_lock);
+		if (i == NR_STRIPE_HASH_LOCKS)
 			return batch_size;
-		}
 		release_inactive = true;
 	}
 	spin_unlock_irq(&conf->device_lock);
@@ -5745,7 +5741,6 @@ static int handle_active_stripes(struct r5conf *conf, int group,
 	release_inactive_stripe_list(conf, temp_inactive_list,
 				     NR_STRIPE_HASH_LOCKS);
 
-	r5l_flush_stripe_to_raid(conf->log);
 	if (release_inactive) {
 		spin_lock_irq(&conf->device_lock);
 		return 0;
@@ -5875,8 +5870,6 @@ static void raid5d(struct md_thread *thread)
 		mutex_unlock(&conf->cache_size_mutex);
 	}
 
-	r5l_flush_stripe_to_raid(conf->log);
-
 	async_tx_issue_pending_all();
 	blk_finish_plug(&plug);
 
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index b85ee02..720f0b3 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -624,7 +624,6 @@ extern int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev);
 extern void r5l_exit_log(struct r5l_log *log);
 extern int r5l_write_stripe(struct r5l_log *log, struct stripe_head *head_sh);
 extern void r5l_write_stripe_run(struct r5l_log *log);
-extern void r5l_flush_stripe_to_raid(struct r5l_log *log);
 extern void r5l_stripe_write_finished(struct stripe_head *sh);
 extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
 #endif
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 04/10] raid5-cache: clean up r5l_get_meta
  2015-09-07  5:20 raid5-cache I/O path improvements Christoph Hellwig
                   ` (2 preceding siblings ...)
  2015-09-07  5:20 ` [PATCH 03/10] raid5-cache: use FUA writes for the log Christoph Hellwig
@ 2015-09-07  5:20 ` Christoph Hellwig
  2015-09-07  5:20 ` [PATCH 05/10] raid5-cache: refactor bio allocation Christoph Hellwig
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2015-09-07  5:20 UTC (permalink / raw)
  To: Shaohua Li, neilb; +Cc: linux-raid, Kernel-team, dan.j.williams

Remove the only partially used local 'io' variable to simplify the code
flow.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/raid5-cache.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 1e54249..10ccda3 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -291,16 +291,12 @@ static struct r5l_io_unit *r5l_new_meta(struct r5l_log *log)
 
 static int r5l_get_meta(struct r5l_log *log, unsigned int payload_size)
 {
-	struct r5l_io_unit *io;
-
-	io = log->current_io;
-	if (io && io->meta_offset + payload_size > PAGE_SIZE)
+	if (log->current_io &&
+	    log->current_io->meta_offset + payload_size > PAGE_SIZE)
 		r5l_submit_current_io(log);
-	io = log->current_io;
-	if (io)
-		return 0;
 
-	log->current_io = r5l_new_meta(log);
+	if (!log->current_io)
+		log->current_io = r5l_new_meta(log);
 	return 0;
 }
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 05/10] raid5-cache: refactor bio allocation
  2015-09-07  5:20 raid5-cache I/O path improvements Christoph Hellwig
                   ` (3 preceding siblings ...)
  2015-09-07  5:20 ` [PATCH 04/10] raid5-cache: clean up r5l_get_meta Christoph Hellwig
@ 2015-09-07  5:20 ` Christoph Hellwig
  2015-09-07  5:20 ` [PATCH 06/10] raid5-cache: take rdev->data_offset into account early on Christoph Hellwig
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2015-09-07  5:20 UTC (permalink / raw)
  To: Shaohua Li, neilb; +Cc: linux-raid, Kernel-team, dan.j.williams

Split out a helper to allocate a bio for log writes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/raid5-cache.c | 44 ++++++++++++++++++++------------------------
 1 file changed, 20 insertions(+), 24 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 10ccda3..d0ad798 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -245,11 +245,25 @@ static void r5l_submit_current_io(struct r5l_log *log)
 	}
 }
 
+static struct bio *r5l_bio_alloc(struct r5l_log *log, struct r5l_io_unit *io)
+{
+	struct bio *bio = bio_kmalloc(GFP_NOIO | __GFP_NOFAIL, BIO_MAX_PAGES);
+
+	bio->bi_rw = WRITE;
+	bio->bi_bdev = log->rdev->bdev;
+	bio->bi_iter.bi_sector = log->log_start;
+	bio->bi_end_io = r5l_log_endio;
+	bio->bi_private = io;
+
+	bio_list_add(&io->bios, bio);
+	atomic_inc(&io->pending_io);
+	return bio;
+}
+
 static struct r5l_io_unit *r5l_new_meta(struct r5l_log *log)
 {
 	struct r5l_io_unit *io;
 	struct r5l_meta_block *block;
-	struct bio *bio;
 
 	io = r5l_alloc_io_unit(log);
 
@@ -263,17 +277,8 @@ static struct r5l_io_unit *r5l_new_meta(struct r5l_log *log)
 	io->meta_offset = sizeof(struct r5l_meta_block);
 	io->seq = log->seq;
 
-	bio = bio_kmalloc(GFP_NOIO | __GFP_NOFAIL, BIO_MAX_PAGES);
-	io->current_bio = bio;
-	bio->bi_rw = WRITE;
-	bio->bi_bdev = log->rdev->bdev;
-	bio->bi_iter.bi_sector = log->log_start;
-	bio_add_page(bio, io->meta_page, PAGE_SIZE, 0);
-	bio->bi_end_io = r5l_log_endio;
-	bio->bi_private = io;
-
-	bio_list_add(&io->bios, bio);
-	atomic_inc(&io->pending_io);
+	io->current_bio = r5l_bio_alloc(log, io);
+	bio_add_page(io->current_bio, io->meta_page, PAGE_SIZE, 0);
 
 	log->seq++;
 	log->log_start = r5l_ring_add(log, log->log_start, BLOCK_SECTORS);
@@ -326,18 +331,9 @@ static void r5l_append_payload_page(struct r5l_log *log, struct page *page)
 	struct r5l_io_unit *io = log->current_io;
 
 alloc_bio:
-	if (!io->current_bio) {
-		struct bio *bio;
-		bio = bio_kmalloc(GFP_NOIO | __GFP_NOFAIL, BIO_MAX_PAGES);
-		bio->bi_rw = WRITE;
-		bio->bi_bdev = log->rdev->bdev;
-		bio->bi_iter.bi_sector = log->log_start;
-		bio->bi_end_io = r5l_log_endio;
-		bio->bi_private = io;
-		bio_list_add(&io->bios, bio);
-		atomic_inc(&io->pending_io);
-		io->current_bio = bio;
-	}
+	if (!io->current_bio) 
+		io->current_bio = r5l_bio_alloc(log, io);
+
 	if (!bio_add_page(io->current_bio, page, PAGE_SIZE, 0)) {
 		io->current_bio = NULL;
 		goto alloc_bio;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 06/10] raid5-cache: take rdev->data_offset into account early on
  2015-09-07  5:20 raid5-cache I/O path improvements Christoph Hellwig
                   ` (4 preceding siblings ...)
  2015-09-07  5:20 ` [PATCH 05/10] raid5-cache: refactor bio allocation Christoph Hellwig
@ 2015-09-07  5:20 ` Christoph Hellwig
  2015-09-07  5:20 ` [PATCH 07/10] raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta Christoph Hellwig
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2015-09-07  5:20 UTC (permalink / raw)
  To: Shaohua Li, neilb; +Cc: linux-raid, Kernel-team, dan.j.williams

Set up bi_sector properly when we allocate an bio instead of updating it
at submission time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/raid5-cache.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index d0ad798..df79d49 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -238,11 +238,8 @@ static void r5l_submit_current_io(struct r5l_log *log)
 	__r5l_set_io_unit_state(io, IO_UNIT_IO_START);
 	spin_unlock_irqrestore(&log->io_list_lock, flags);
 
-	while ((bio = bio_list_pop(&io->bios))) {
-		/* all IO must start from rdev->data_offset */
-		bio->bi_iter.bi_sector += log->rdev->data_offset;
+	while ((bio = bio_list_pop(&io->bios)))
 		submit_bio(WRITE | REQ_FUA, bio);
-	}
 }
 
 static struct bio *r5l_bio_alloc(struct r5l_log *log, struct r5l_io_unit *io)
@@ -251,7 +248,7 @@ static struct bio *r5l_bio_alloc(struct r5l_log *log, struct r5l_io_unit *io)
 
 	bio->bi_rw = WRITE;
 	bio->bi_bdev = log->rdev->bdev;
-	bio->bi_iter.bi_sector = log->log_start;
+	bio->bi_iter.bi_sector = log->rdev->data_offset + log->log_start;
 	bio->bi_end_io = r5l_log_endio;
 	bio->bi_private = io;
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 07/10] raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta
  2015-09-07  5:20 raid5-cache I/O path improvements Christoph Hellwig
                   ` (5 preceding siblings ...)
  2015-09-07  5:20 ` [PATCH 06/10] raid5-cache: take rdev->data_offset into account early on Christoph Hellwig
@ 2015-09-07  5:20 ` Christoph Hellwig
  2015-09-07  5:20 ` [PATCH 08/10] raid5-cache: new helper: r5_reserve_log_entry Christoph Hellwig
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2015-09-07  5:20 UTC (permalink / raw)
  To: Shaohua Li, neilb; +Cc: linux-raid, Kernel-team, dan.j.williams

This is the only user, and keeping all code initializing the io_unit
structure together improves readbility.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/raid5-cache.c | 26 ++++++++------------------
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index df79d49..2ccc1b0 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -141,23 +141,6 @@ static bool r5l_has_free_space(struct r5l_log *log, sector_t size)
 	return log->device_size > used_size + size;
 }
 
-static struct r5l_io_unit *r5l_alloc_io_unit(struct r5l_log *log)
-{
-	struct r5l_io_unit *io;
-	/* We can't handle memory allocate failure so far */
-	gfp_t gfp = GFP_NOIO | __GFP_NOFAIL;
-
-	io = kmem_cache_zalloc(log->io_kc, gfp);
-	io->log = log;
-	io->meta_page = alloc_page(gfp | __GFP_ZERO);
-
-	bio_list_init(&io->bios);
-	INIT_LIST_HEAD(&io->log_sibling);
-	INIT_LIST_HEAD(&io->stripe_list);
-	io->state = IO_UNIT_RUNNING;
-	return io;
-}
-
 static void r5l_free_io_unit(struct r5l_log *log, struct r5l_io_unit *io)
 {
 	__free_page(io->meta_page);
@@ -262,8 +245,15 @@ static struct r5l_io_unit *r5l_new_meta(struct r5l_log *log)
 	struct r5l_io_unit *io;
 	struct r5l_meta_block *block;
 
-	io = r5l_alloc_io_unit(log);
+	/* We can't handle memory allocate failure so far */
+	io = kmem_cache_zalloc(log->io_kc, GFP_NOIO | __GFP_NOFAIL);
+	io->log = log;
+	bio_list_init(&io->bios);
+	INIT_LIST_HEAD(&io->log_sibling);
+	INIT_LIST_HEAD(&io->stripe_list);
+	io->state = IO_UNIT_RUNNING;
 
+	io->meta_page = alloc_page(GFP_NOIO | __GFP_NOFAIL | __GFP_ZERO);
 	block = page_address(io->meta_page);
 	block->magic = cpu_to_le32(R5LOG_MAGIC);
 	block->version = R5LOG_VERSION;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 08/10] raid5-cache: new helper: r5_reserve_log_entry
  2015-09-07  5:20 raid5-cache I/O path improvements Christoph Hellwig
                   ` (6 preceding siblings ...)
  2015-09-07  5:20 ` [PATCH 07/10] raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta Christoph Hellwig
@ 2015-09-07  5:20 ` Christoph Hellwig
  2015-09-07  5:20 ` [PATCH 09/10] raid5-cache: small log->seq cleanup Christoph Hellwig
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2015-09-07  5:20 UTC (permalink / raw)
  To: Shaohua Li, neilb; +Cc: linux-raid, Kernel-team, dan.j.williams

Factor out code to reserve log space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/raid5-cache.c | 30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 2ccc1b0..ae8caa8 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -240,6 +240,23 @@ static struct bio *r5l_bio_alloc(struct r5l_log *log, struct r5l_io_unit *io)
 	return bio;
 }
 
+static void r5_reserve_log_entry(struct r5l_log *log, struct r5l_io_unit *io)
+{
+	log->log_start = r5l_ring_add(log, log->log_start, BLOCK_SECTORS);
+
+	/*
+	 * If we filled up the log device start from the beginning again,
+	 * which will require a new bio.
+	 *
+	 * Note: for this to work properly the log size needs to me a multiple
+	 * of BLOCK_SECTORS.
+	 */
+	if (log->log_start == 0)
+		io->current_bio = NULL;
+
+	io->log_end = log->log_start;
+}
+
 static struct r5l_io_unit *r5l_new_meta(struct r5l_log *log)
 {
 	struct r5l_io_unit *io;
@@ -268,11 +285,7 @@ static struct r5l_io_unit *r5l_new_meta(struct r5l_log *log)
 	bio_add_page(io->current_bio, io->meta_page, PAGE_SIZE, 0);
 
 	log->seq++;
-	log->log_start = r5l_ring_add(log, log->log_start, BLOCK_SECTORS);
-	io->log_end = log->log_start;
-	/* current bio hit disk end */
-	if (log->log_start == 0)
-		io->current_bio = NULL;
+	r5_reserve_log_entry(log, io);
 
 	spin_lock_irq(&log->io_list_lock);
 	list_add_tail(&io->log_sibling, &log->running_ios);
@@ -325,13 +338,8 @@ alloc_bio:
 		io->current_bio = NULL;
 		goto alloc_bio;
 	}
-	log->log_start = r5l_ring_add(log, log->log_start,
-			BLOCK_SECTORS);
-	/* current bio hit disk end */
-	if (log->log_start == 0)
-		io->current_bio = NULL;
 
-	io->log_end = log->log_start;
+	r5_reserve_log_entry(log, io);
 }
 
 static void r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 09/10] raid5-cache: small log->seq cleanup
  2015-09-07  5:20 raid5-cache I/O path improvements Christoph Hellwig
                   ` (7 preceding siblings ...)
  2015-09-07  5:20 ` [PATCH 08/10] raid5-cache: new helper: r5_reserve_log_entry Christoph Hellwig
@ 2015-09-07  5:20 ` Christoph Hellwig
  2015-09-07  5:20 ` [PATCH 10/10] raid5-cache: use bio chaining Christoph Hellwig
  2015-09-08  0:28 ` raid5-cache I/O path improvements Shaohua Li
  10 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2015-09-07  5:20 UTC (permalink / raw)
  To: Shaohua Li, neilb; +Cc: linux-raid, Kernel-team, dan.j.williams

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/raid5-cache.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index ae8caa8..af274e9 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -279,12 +279,11 @@ static struct r5l_io_unit *r5l_new_meta(struct r5l_log *log)
 
 	io->log_start = log->log_start;
 	io->meta_offset = sizeof(struct r5l_meta_block);
-	io->seq = log->seq;
+	io->seq = log->seq++;
 
 	io->current_bio = r5l_bio_alloc(log, io);
 	bio_add_page(io->current_bio, io->meta_page, PAGE_SIZE, 0);
 
-	log->seq++;
 	r5_reserve_log_entry(log, io);
 
 	spin_lock_irq(&log->io_list_lock);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 10/10] raid5-cache: use bio chaining
  2015-09-07  5:20 raid5-cache I/O path improvements Christoph Hellwig
                   ` (8 preceding siblings ...)
  2015-09-07  5:20 ` [PATCH 09/10] raid5-cache: small log->seq cleanup Christoph Hellwig
@ 2015-09-07  5:20 ` Christoph Hellwig
  2015-09-08  0:28 ` raid5-cache I/O path improvements Shaohua Li
  10 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2015-09-07  5:20 UTC (permalink / raw)
  To: Shaohua Li, neilb; +Cc: linux-raid, Kernel-team, dan.j.williams

Simplify the bio completion handler by using bio chaining and submitting
bios as soon as they are full.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/raid5-cache.c | 38 ++++++++++++++++----------------------
 1 file changed, 16 insertions(+), 22 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index af274e9..50ec1aa 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -91,8 +91,6 @@ struct r5l_io_unit {
 	struct page *meta_page; /* store meta block */
 	int meta_offset; /* current offset in meta_page */
 
-	struct bio_list bios;
-	atomic_t pending_io; /* pending bios not written to log yet */
 	struct bio *current_bio; /* current_bio accepting new data */
 
 	atomic_t pending_stripe; /* how many stripes not flushed to raid */
@@ -103,6 +101,7 @@ struct r5l_io_unit {
 	struct list_head stripe_list; /* stripes added to the io_unit */
 
 	int state;
+	bool need_split_bio;
 };
 
 /* r5l_io_unit state */
@@ -191,9 +190,6 @@ static void r5l_log_endio(struct bio *bio)
 
 	bio_put(bio);
 
-	if (!atomic_dec_and_test(&io->pending_io))
-		return;
-
 	spin_lock_irqsave(&log->io_list_lock, flags);
 	__r5l_set_io_unit_state(io, IO_UNIT_IO_END);
 	r5l_log_run_stripes(log);
@@ -204,7 +200,6 @@ static void r5l_submit_current_io(struct r5l_log *log)
 {
 	struct r5l_io_unit *io = log->current_io;
 	struct r5l_meta_block *block;
-	struct bio *bio;
 	unsigned long flags;
 	u32 crc;
 
@@ -221,22 +216,17 @@ static void r5l_submit_current_io(struct r5l_log *log)
 	__r5l_set_io_unit_state(io, IO_UNIT_IO_START);
 	spin_unlock_irqrestore(&log->io_list_lock, flags);
 
-	while ((bio = bio_list_pop(&io->bios)))
-		submit_bio(WRITE | REQ_FUA, bio);
+	submit_bio(WRITE | REQ_FUA, io->current_bio);
 }
 
-static struct bio *r5l_bio_alloc(struct r5l_log *log, struct r5l_io_unit *io)
+static struct bio *r5l_bio_alloc(struct r5l_log *log)
 {
 	struct bio *bio = bio_kmalloc(GFP_NOIO | __GFP_NOFAIL, BIO_MAX_PAGES);
 
 	bio->bi_rw = WRITE;
 	bio->bi_bdev = log->rdev->bdev;
 	bio->bi_iter.bi_sector = log->rdev->data_offset + log->log_start;
-	bio->bi_end_io = r5l_log_endio;
-	bio->bi_private = io;
 
-	bio_list_add(&io->bios, bio);
-	atomic_inc(&io->pending_io);
 	return bio;
 }
 
@@ -252,7 +242,7 @@ static void r5_reserve_log_entry(struct r5l_log *log, struct r5l_io_unit *io)
 	 * of BLOCK_SECTORS.
 	 */
 	if (log->log_start == 0)
-		io->current_bio = NULL;
+		io->need_split_bio = true;
 
 	io->log_end = log->log_start;
 }
@@ -265,7 +255,6 @@ static struct r5l_io_unit *r5l_new_meta(struct r5l_log *log)
 	/* We can't handle memory allocate failure so far */
 	io = kmem_cache_zalloc(log->io_kc, GFP_NOIO | __GFP_NOFAIL);
 	io->log = log;
-	bio_list_init(&io->bios);
 	INIT_LIST_HEAD(&io->log_sibling);
 	INIT_LIST_HEAD(&io->stripe_list);
 	io->state = IO_UNIT_RUNNING;
@@ -281,7 +270,9 @@ static struct r5l_io_unit *r5l_new_meta(struct r5l_log *log)
 	io->meta_offset = sizeof(struct r5l_meta_block);
 	io->seq = log->seq++;
 
-	io->current_bio = r5l_bio_alloc(log, io);
+	io->current_bio = r5l_bio_alloc(log);
+	io->current_bio->bi_end_io = r5l_log_endio;
+	io->current_bio->bi_private = io;
 	bio_add_page(io->current_bio, io->meta_page, PAGE_SIZE, 0);
 
 	r5_reserve_log_entry(log, io);
@@ -329,15 +320,18 @@ static void r5l_append_payload_page(struct r5l_log *log, struct page *page)
 {
 	struct r5l_io_unit *io = log->current_io;
 
-alloc_bio:
-	if (!io->current_bio) 
-		io->current_bio = r5l_bio_alloc(log, io);
+	if (io->need_split_bio) {
+		struct bio *prev = io->current_bio;
 
-	if (!bio_add_page(io->current_bio, page, PAGE_SIZE, 0)) {
-		io->current_bio = NULL;
-		goto alloc_bio;
+		io->current_bio = r5l_bio_alloc(log);
+		bio_chain(io->current_bio, prev);
+
+		submit_bio(WRITE, prev);
 	}
 
+	if (!bio_add_page(io->current_bio, page, PAGE_SIZE, 0))
+		BUG();
+
 	r5_reserve_log_entry(log, io);
 }
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: raid5-cache I/O path improvements
  2015-09-07  5:20 raid5-cache I/O path improvements Christoph Hellwig
                   ` (9 preceding siblings ...)
  2015-09-07  5:20 ` [PATCH 10/10] raid5-cache: use bio chaining Christoph Hellwig
@ 2015-09-08  0:28 ` Shaohua Li
  2015-09-08  6:12   ` Christoph Hellwig
  10 siblings, 1 reply; 21+ messages in thread
From: Shaohua Li @ 2015-09-08  0:28 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: neilb, linux-raid, Kernel-team, dan.j.williams

On Mon, Sep 07, 2015 at 07:20:40AM +0200, Christoph Hellwig wrote:
> Hi Shaohua, hi Neil,
> 
> this series contains a few updates to the raid5-cache feature.
> 
> The first patch just ports it to the post-4.2 block layer.  As part of that
> I noticed that it currently doesn't handle I/O errors - fixes for that will
> follow.
> 
> The second and third patch simplify the I/O unit state machine and reduce
> latency and memory usage for the I/O units.  The remainder are just a couple
> of cleanups in this area that I stumbled upon.

Hi Christoph,
Thanks for these work. Yes, I/O error handling is in the plan. We could
simplify panic (people here like this option) or report error and bypass
log. Any way an option is good.

For the patches, FUA write does simplify things a lot. However, I tried
it before, the performance is quite bad in SSD. FUA is off in SATA by
default, the emulation is farily slow because FLUSH request isn't NCQ
command. I tried to enable FUA in SATA too, FUA write is still slow in
the SSD I tested. Other than this one, other patches look good:

Reviewed-by: Shaohua Li <shli@fb.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: raid5-cache I/O path improvements
  2015-09-08  0:28 ` raid5-cache I/O path improvements Shaohua Li
@ 2015-09-08  6:12   ` Christoph Hellwig
  2015-09-08 15:25     ` Tejun Heo
  2015-09-08 16:56     ` Shaohua Li
  0 siblings, 2 replies; 21+ messages in thread
From: Christoph Hellwig @ 2015-09-08  6:12 UTC (permalink / raw)
  To: Shaohua Li
  Cc: neilb, linux-raid, Kernel-team, dan.j.williams, Tejun Heo,
	Martin K. Petersen, linux-ide

On Mon, Sep 07, 2015 at 05:28:55PM -0700, Shaohua Li wrote:
> Hi Christoph,
> Thanks for these work. Yes, I/O error handling is in the plan. We could
> simplify panic (people here like this option) or report error and bypass
> log. Any way an option is good.

I think the sensible thing in general is to fail the I/O.  Once we have
a cache devie the assumption is that a) write holes are properly handled,
and we b) do all kinds of optimizations based on the presensce of the
log device like not passing through flush requests or skippign resync.

Having the cache device suddenly disappear will alwasy break a) and
require a lot of hairy code only used in failure cases to undo the
rest.

> For the patches, FUA write does simplify things a lot. However, I tried
> it before, the performance is quite bad in SSD. FUA is off in SATA by
> default, the emulation is farily slow because FLUSH request isn't NCQ
> command. I tried to enable FUA in SATA too, FUA write is still slow in
> the SSD I tested. Other than this one, other patches look good:

Pretty much every SSD (and modern disk driver) supports FUA.  Please
benchmark with libata.fua=Y, as I think the simplifcation is absolutely
worth it.  On my SSDs using it gives far lower latency for writes,
nevermind nvmdimm where it's also essential as the flush statemchine
increases the write latency by an order of magnitude.

Tejun, do you have any updates on libata vs FUA?  We onable it
by default for a while in 2012, but then Jeff reverted it with a rather
non-descriptive commit message.

Also NVMe or SAS SSDs will benefit heavily from the FUA bit.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: raid5-cache I/O path improvements
  2015-09-08  6:12   ` Christoph Hellwig
@ 2015-09-08 15:25     ` Tejun Heo
  2015-09-08 15:26       ` Tejun Heo
  2015-09-08 15:40       ` Christoph Hellwig
  2015-09-08 16:56     ` Shaohua Li
  1 sibling, 2 replies; 21+ messages in thread
From: Tejun Heo @ 2015-09-08 15:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Shaohua Li, neilb, linux-raid, Kernel-team, dan.j.williams,
	Martin K. Petersen, linux-ide

Hello, Christoph.

On Tue, Sep 08, 2015 at 08:12:15AM +0200, Christoph Hellwig wrote:
> Tejun, do you have any updates on libata vs FUA?  We onable it
> by default for a while in 2012, but then Jeff reverted it with a rather
> non-descriptive commit message.

IIRC, some controllers and/or controllers were choking on it and it
didn't make any noticeable difference on rotating disks.  Maybe we can
try again with controller white list and enabling by default on SSds.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: raid5-cache I/O path improvements
  2015-09-08 15:25     ` Tejun Heo
@ 2015-09-08 15:26       ` Tejun Heo
  2015-09-08 15:40       ` Christoph Hellwig
  1 sibling, 0 replies; 21+ messages in thread
From: Tejun Heo @ 2015-09-08 15:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Shaohua Li, neilb, linux-raid, Kernel-team, dan.j.williams,
	Martin K. Petersen, linux-ide

On Tue, Sep 08, 2015 at 11:25:46AM -0400, Tejun Heo wrote:
...
> IIRC, some controllers and/or controllers were choking on it and it
                                ^
				drives

> didn't make any noticeable difference on rotating disks.  Maybe we can
> try again with controller white list and enabling by default on SSds.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: raid5-cache I/O path improvements
  2015-09-08 15:25     ` Tejun Heo
  2015-09-08 15:26       ` Tejun Heo
@ 2015-09-08 15:40       ` Christoph Hellwig
  1 sibling, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2015-09-08 15:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Shaohua Li, neilb, linux-raid, Kernel-team, dan.j.williams,
	Martin K. Petersen, linux-ide

On Tue, Sep 08, 2015 at 11:25:46AM -0400, Tejun Heo wrote:
> IIRC, some controllers and/or controllers were choking on it and it
> didn't make any noticeable difference on rotating disks.  Maybe we can
> try again with controller white list and enabling by default on SSds.

I guess we could start with AHCI only as a good approximation for a not
too crappy controller and driver.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: raid5-cache I/O path improvements
  2015-09-08  6:12   ` Christoph Hellwig
  2015-09-08 15:25     ` Tejun Heo
@ 2015-09-08 16:56     ` Shaohua Li
  2015-09-08 17:02       ` Tejun Heo
  1 sibling, 1 reply; 21+ messages in thread
From: Shaohua Li @ 2015-09-08 16:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: neilb, linux-raid, Kernel-team, dan.j.williams, Tejun Heo,
	Martin K. Petersen, linux-ide

On Tue, Sep 08, 2015 at 08:12:15AM +0200, Christoph Hellwig wrote:
> On Mon, Sep 07, 2015 at 05:28:55PM -0700, Shaohua Li wrote:
> > Hi Christoph,
> > Thanks for these work. Yes, I/O error handling is in the plan. We could
> > simplify panic (people here like this option) or report error and bypass
> > log. Any way an option is good.
> 
> I think the sensible thing in general is to fail the I/O.  Once we have
> a cache devie the assumption is that a) write holes are properly handled,
> and we b) do all kinds of optimizations based on the presensce of the
> log device like not passing through flush requests or skippign resync.
> 
> Having the cache device suddenly disappear will alwasy break a) and
> require a lot of hairy code only used in failure cases to undo the
> rest.

Failing the I/O is ok too.
 
> > For the patches, FUA write does simplify things a lot. However, I tried
> > it before, the performance is quite bad in SSD. FUA is off in SATA by
> > default, the emulation is farily slow because FLUSH request isn't NCQ
> > command. I tried to enable FUA in SATA too, FUA write is still slow in
> > the SSD I tested. Other than this one, other patches look good:
> 
> Pretty much every SSD (and modern disk driver) supports FUA.  Please
> benchmark with libata.fua=Y, as I think the simplifcation is absolutely
> worth it.  On my SSDs using it gives far lower latency for writes,
> nevermind nvmdimm where it's also essential as the flush statemchine
> increases the write latency by an order of magnitude.
> 
> Tejun, do you have any updates on libata vs FUA?  We onable it
> by default for a while in 2012, but then Jeff reverted it with a rather
> non-descriptive commit message.
> 
> Also NVMe or SAS SSDs will benefit heavily from the FUA bit.

I agree the benefit of FUA. In the system I'm testing, an Intel ssd
supports FUA, a sandisk SSD doesn't support FUA (this is the SSD we will
deploy for the log). This is AHCI with libata.fua=1. FUA isn't supported
by every SSD. If the log uses FUA by default, we will have a lot of FUA
write and performance is impacted.

I'll benchmark on a SSD from another vendor, which supports FUA, but FUA
write has poor performance in my last test.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: raid5-cache I/O path improvements
  2015-09-08 16:56     ` Shaohua Li
@ 2015-09-08 17:02       ` Tejun Heo
  2015-09-08 17:07         ` Shaohua Li
  0 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2015-09-08 17:02 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Christoph Hellwig, neilb, linux-raid, Kernel-team, dan.j.williams,
	Martin K. Petersen, linux-ide

Hello,

On Tue, Sep 08, 2015 at 09:56:22AM -0700, Shaohua Li wrote:
> I'll benchmark on a SSD from another vendor, which supports FUA, but FUA
> write has poor performance in my last test.

lolwut?  Was it slower than write + flush?

thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: raid5-cache I/O path improvements
  2015-09-08 17:02       ` Tejun Heo
@ 2015-09-08 17:07         ` Shaohua Li
  2015-09-08 17:34           ` Tejun Heo
  0 siblings, 1 reply; 21+ messages in thread
From: Shaohua Li @ 2015-09-08 17:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, neilb, linux-raid, Kernel-team, dan.j.williams,
	Martin K. Petersen, linux-ide

On Tue, Sep 08, 2015 at 01:02:26PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Sep 08, 2015 at 09:56:22AM -0700, Shaohua Li wrote:
> > I'll benchmark on a SSD from another vendor, which supports FUA, but FUA
> > write has poor performance in my last test.
> 
> lolwut?  Was it slower than write + flush?

I need double confirm. But for write + flush, we aggregrate several
writes and do a flush; for FUA, we do every meta write with FUA. So this
is not apple to apple comparison.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: raid5-cache I/O path improvements
  2015-09-08 17:07         ` Shaohua Li
@ 2015-09-08 17:34           ` Tejun Heo
  2015-09-09 15:59             ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2015-09-08 17:34 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Christoph Hellwig, neilb, linux-raid, Kernel-team, dan.j.williams,
	Martin K. Petersen, linux-ide

Hello,

On Tue, Sep 08, 2015 at 10:07:36AM -0700, Shaohua Li wrote:
> I need double confirm. But for write + flush, we aggregrate several
> writes and do a flush; for FUA, we do every meta write with FUA. So this
> is not apple to apple comparison.

Does that mean that upper layers are taking different actions
depending on whether the underlying device supports FUA?  That at
least wasn't the original model that I had on mind when implementing
the current incarnation of REQ_FLUSH and FUA.  The only difference FUA
was expected to make was optimizing out the flush after REQ_FUA and
the upper layers were expected to issue REQ_FLUSH/FUA the same whether
the device supports FUA or not.

Hmmm... grep tells me that dm and md actually are branching on whether
the underlying device supports FUA.  This is tricky.  I didn't even
mean flush_flags to be used directly by upper layers.  For rotational
devices, doing multiple FUAs compared multiple writes followed by
REQ_FLUSH is probably a lot worse - the head gets moved multiple times
likely skipping over data which can be written out while traversing
and it's not like stalling write pipeline and draining write queue has
much impact on hard drives.

Maybe it's different on SSDs.  I'm not sure about how expensive flush
itself would be given that a lot of write cost is paid asynchronously
anyway during gc but flush stalls IO pipeline and that could be very
noticeable on high iops devices.  Also, unless the implementation is
braindead FUA IOs are unlikely to be expensive on SSDs, so maybe what
we should do is making block layer hint upper layers regarding what's
likely to perform better.

But, ultimately, I don't think it'd be too difficult for high-depth
SSD devices to report write-through w/o losing any performance one way
or the other and it'd be great if we eventually can get there.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: raid5-cache I/O path improvements
  2015-09-08 17:34           ` Tejun Heo
@ 2015-09-09 15:59             ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2015-09-09 15:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Shaohua Li, Christoph Hellwig, neilb, linux-raid, Kernel-team,
	dan.j.williams, Martin K. Petersen, linux-ide

On Tue, Sep 08, 2015 at 01:34:20PM -0400, Tejun Heo wrote:
> Hmmm... grep tells me that dm and md actually are branching on whether
> the underlying device supports FUA.  This is tricky.  I didn't even
> mean flush_flags to be used directly by upper layers.  For rotational
> devices, doing multiple FUAs compared multiple writes followed by
> REQ_FLUSH is probably a lot worse - the head gets moved multiple times
> likely skipping over data which can be written out while traversing
> and it's not like stalling write pipeline and draining write queue has
> much impact on hard drives.

Well, that's what we'd need to do for the raid cache as well, given
the resulst that Shaohua sees.  Unless you have a good idea for another
way to handle the issue we'll need to support both behaviors there.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2015-09-09 15:59 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-07  5:20 raid5-cache I/O path improvements Christoph Hellwig
2015-09-07  5:20 ` [PATCH 01/10] raid5-cache: port to 4.3-rc Christoph Hellwig
2015-09-07  5:20 ` [PATCH 02/10] raid5-cache: free I/O units earlier Christoph Hellwig
2015-09-07  5:20 ` [PATCH 03/10] raid5-cache: use FUA writes for the log Christoph Hellwig
2015-09-07  5:20 ` [PATCH 04/10] raid5-cache: clean up r5l_get_meta Christoph Hellwig
2015-09-07  5:20 ` [PATCH 05/10] raid5-cache: refactor bio allocation Christoph Hellwig
2015-09-07  5:20 ` [PATCH 06/10] raid5-cache: take rdev->data_offset into account early on Christoph Hellwig
2015-09-07  5:20 ` [PATCH 07/10] raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta Christoph Hellwig
2015-09-07  5:20 ` [PATCH 08/10] raid5-cache: new helper: r5_reserve_log_entry Christoph Hellwig
2015-09-07  5:20 ` [PATCH 09/10] raid5-cache: small log->seq cleanup Christoph Hellwig
2015-09-07  5:20 ` [PATCH 10/10] raid5-cache: use bio chaining Christoph Hellwig
2015-09-08  0:28 ` raid5-cache I/O path improvements Shaohua Li
2015-09-08  6:12   ` Christoph Hellwig
2015-09-08 15:25     ` Tejun Heo
2015-09-08 15:26       ` Tejun Heo
2015-09-08 15:40       ` Christoph Hellwig
2015-09-08 16:56     ` Shaohua Li
2015-09-08 17:02       ` Tejun Heo
2015-09-08 17:07         ` Shaohua Li
2015-09-08 17:34           ` Tejun Heo
2015-09-09 15:59             ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).