[PATCH 0/8] raid5-cache fixes

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8] raid5-cache fixes
@ 2015-09-02 20:49 Shaohua Li
  2015-09-02 20:49 ` [PATCH 1/8] md: fix feature map check Shaohua Li
                   ` (8 more replies)
  0 siblings, 9 replies; 14+ messages in thread
From: Shaohua Li @ 2015-09-02 20:49 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

Fix some bugs, improve performance and add trim/resync support.

Thanks,
Shaohua

Christoph Hellwig (1):
  raid5-cache: move functionality out of __r5l_set_io_unit_state

Shaohua Li (7):
  md: fix feature map check
  raid5: fix build error
  raid5-cache: switching to state machine for log disk cache flush
  raid5-cache: fix a user-after-free bug
  raid5-cache: optimize FLUSH IO with log enabled
  md: skip resync for raid array with journal
  raid5-cache: add trim support for log

 drivers/md/md.c          |   6 +-
 drivers/md/md.h          |   1 +
 drivers/md/raid5-cache.c | 254 ++++++++++++++++++++++++++++-------------------
 drivers/md/raid5.c       |  17 +++-
 drivers/md/raid5.h       |   1 +
 5 files changed, 175 insertions(+), 104 deletions(-)

-- 
1.8.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/8] md: fix feature map check
  2015-09-02 20:49 [PATCH 0/8] raid5-cache fixes Shaohua Li
@ 2015-09-02 20:49 ` Shaohua Li
  2015-09-02 20:49 ` [PATCH 2/8] raid5: fix build error Shaohua Li
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Shaohua Li @ 2015-09-02 20:49 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

the feature map is a __le32, should convert it to cpu endian.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8e49ea3..b3f9eed 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1660,7 +1660,7 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
 			set_bit(Faulty, &rdev->flags);
 			break;
 		case MD_DISK_ROLE_JOURNAL: /* journal device */
-			if (!(sb->feature_map & MD_FEATURE_JOURNAL)) {
+			if (!(le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL)) {
 				/* journal device without journal feature */
 				printk(KERN_WARNING
 				  "md: journal device provided without "
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/8] raid5: fix build error
  2015-09-02 20:49 [PATCH 0/8] raid5-cache fixes Shaohua Li
  2015-09-02 20:49 ` [PATCH 1/8] md: fix feature map check Shaohua Li
@ 2015-09-02 20:49 ` Shaohua Li
  2015-09-02 20:49 ` [PATCH 3/8] raid5-cache: switching to state machine for log disk cache flush Shaohua Li
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Shaohua Li @ 2015-09-02 20:49 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

xchg doesn't work with u64 in 32-bit, so change the data type to
'unsigned long'. In theory we could overflow here, but it's not worthy
worrying about.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5-cache.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 52feb90..c85d72a 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -69,7 +69,7 @@ struct r5l_log {
 	struct kmem_cache *io_kc;
 
 	struct md_thread *reclaim_thread;
-	sector_t reclaim_target; /* number of space that need to be reclaimed.
+	unsigned long reclaim_target; /* number of space that need to be reclaimed.
 				  * if it's 0, reclaim spaces used by io_units
 				  * which are in IO_UNIT_STRIPE_END state (eg,
 				  * reclaim dones't wait for specific io_unit
@@ -701,13 +701,14 @@ static void r5l_reclaim_thread(struct md_thread *thread)
 
 static void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
 {
-	sector_t target;
+	unsigned long target;
+	unsigned long new = (unsigned long)space; /* overflow in theory */
 
 	do {
 		target = log->reclaim_target;
-		if (space < target)
+		if (new < target)
 			return;
-	} while (cmpxchg(&log->reclaim_target, target, space) != target);
+	} while (cmpxchg(&log->reclaim_target, target, new) != target);
 	md_wakeup_thread(log->reclaim_thread);
 }
 
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/8] raid5-cache: switching to state machine for log disk cache flush
  2015-09-02 20:49 [PATCH 0/8] raid5-cache fixes Shaohua Li
  2015-09-02 20:49 ` [PATCH 1/8] md: fix feature map check Shaohua Li
  2015-09-02 20:49 ` [PATCH 2/8] raid5: fix build error Shaohua Li
@ 2015-09-02 20:49 ` Shaohua Li
  2015-09-02 20:49 ` [PATCH 4/8] raid5-cache: fix a user-after-free bug Shaohua Li
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Shaohua Li @ 2015-09-02 20:49 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

Before we write stripe data to raid disks, we must guarantee stripe data
is settled down in log disk. To do this, we flush log disk cache and
wait the flush finish. That wait introduces sleep time in raid5d thread
and impact performance. This patch moves the log disk cache flush
process to the stripe handling state machine, which can remove the wait
in raid5d.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5-cache.c | 116 ++++++++++++++++++++++++++---------------------
 drivers/md/raid5.c       |   7 ++-
 2 files changed, 71 insertions(+), 52 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index c85d72a..86b836c 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -61,6 +61,10 @@ struct r5l_log {
 	struct list_head io_end_ios; /* io_units which have been completely
 				      * written to the log but not yet written
 				      * to the RAID */
+	struct list_head flushing_ios; /* io_units which are waiting for log
+					* cache flush */
+	struct list_head flushed_ios; /* io_units which settle down in log disk */
+	struct bio flush_bio;
 	struct list_head stripe_end_ios; /* io_units which have been
 					  * completely written to the RAID *
 					  * but have not yet been considered *
@@ -113,8 +117,7 @@ enum r5l_io_unit_state {
 	IO_UNIT_IO_START = 1, /* io_unit bio start writting to log,
 			       * don't accepting new bio */
 	IO_UNIT_IO_END = 2, /* io_unit bio finish writting to log */
-	IO_UNIT_STRIPE_START = 3, /* stripes of io_unit are flushing to raid */
-	IO_UNIT_STRIPE_END = 4, /* stripes data finished writting to raid */
+	IO_UNIT_STRIPE_END = 3, /* stripes data finished writting to raid */
 };
 
 static sector_t r5l_ring_add(struct r5l_log *log, sector_t start, sector_t inc)
@@ -228,7 +231,7 @@ static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
 		struct r5l_io_unit *last;
 		sector_t reclaimable_space;
 
-		r5l_move_io_unit_list(&log->io_end_ios, &log->stripe_end_ios,
+		r5l_move_io_unit_list(&log->flushed_ios, &log->stripe_end_ios,
 			IO_UNIT_STRIPE_END);
 
 		last = list_last_entry(&log->stripe_end_ios,
@@ -554,6 +557,28 @@ void r5l_stripe_write_finished(struct stripe_head *sh)
 		r5l_set_io_unit_state(io, IO_UNIT_STRIPE_END);
 }
 
+static void r5l_log_flush_endio(struct bio *bio, int error)
+{
+	struct r5l_log *log = container_of(bio, struct r5l_log,
+		flush_bio);
+	unsigned long flags;
+	struct r5l_io_unit *io;
+	struct stripe_head *sh;
+
+	spin_lock_irqsave(&log->io_list_lock, flags);
+	list_for_each_entry(io, &log->flushing_ios, log_sibling) {
+		while (!list_empty(&io->stripe_list)) {
+			sh = list_first_entry(&io->stripe_list,
+				struct stripe_head, log_list);
+			list_del_init(&sh->log_list);
+			set_bit(STRIPE_HANDLE, &sh->state);
+			raid5_release_stripe(sh);
+		}
+	}
+	list_splice_tail_init(&log->flushing_ios, &log->flushed_ios);
+	spin_unlock_irqrestore(&log->io_list_lock, flags);
+}
+
 /*
  * Starting dispatch IO to raid.
  * io_unit(meta) consists of a log. There is one situation we want to avoid. A
@@ -570,44 +595,31 @@ void r5l_stripe_write_finished(struct stripe_head *sh)
  * */
 void r5l_flush_stripe_to_raid(struct r5l_log *log)
 {
-	struct r5l_io_unit *io;
-	struct stripe_head *sh;
-	bool run_stripe;
-
+	bool do_flush;
 	if (!log)
 		return;
-	spin_lock_irq(&log->io_list_lock);
-	run_stripe = !list_empty(&log->io_end_ios);
-	spin_unlock_irq(&log->io_list_lock);
-
-	if (!run_stripe)
-		return;
-
-	blkdev_issue_flush(log->rdev->bdev, GFP_NOIO, NULL);
 
 	spin_lock_irq(&log->io_list_lock);
-	list_for_each_entry(io, &log->io_end_ios, log_sibling) {
-		if (io->state >= IO_UNIT_STRIPE_START)
-			continue;
-		__r5l_set_io_unit_state(io, IO_UNIT_STRIPE_START);
-
-		while (!list_empty(&io->stripe_list)) {
-			sh = list_first_entry(&io->stripe_list,
-				struct stripe_head, log_list);
-			list_del_init(&sh->log_list);
-			set_bit(STRIPE_HANDLE, &sh->state);
-			raid5_release_stripe(sh);
-		}
+	/* flush bio is running */
+	if (!list_empty(&log->flushing_ios)) {
+		spin_unlock_irq(&log->io_list_lock);
+		return;
 	}
+	list_splice_tail_init(&log->io_end_ios, &log->flushing_ios);
+	do_flush = !list_empty(&log->flushing_ios);
 	spin_unlock_irq(&log->io_list_lock);
+
+	if (!do_flush)
+		return;
+	bio_reset(&log->flush_bio);
+	log->flush_bio.bi_bdev = log->rdev->bdev;
+	log->flush_bio.bi_end_io = r5l_log_flush_endio;
+	submit_bio(WRITE_FLUSH, &log->flush_bio);
 }
 
 static void r5l_kick_io_unit(struct r5l_log *log, struct r5l_io_unit *io)
 {
-	/* the log thread will write the io unit */
-	wait_event(io->wait_state, io->state >= IO_UNIT_IO_END);
-	if (io->state < IO_UNIT_STRIPE_START)
-		r5l_flush_stripe_to_raid(log);
+	md_wakeup_thread(log->rdev->mddev->thread);
 	wait_event(io->wait_state, io->state >= IO_UNIT_STRIPE_END);
 }
 
@@ -626,6 +638,8 @@ static void r5l_do_reclaim(struct r5l_log *log)
 	 * shouldn't reuse space of an unreclaimable io_unit
 	 * */
 	while (1) {
+		struct list_head *target_list = NULL;
+
 		while (!list_empty(&log->stripe_end_ios)) {
 			io = list_first_entry(&log->stripe_end_ios,
 				struct r5l_io_unit, log_sibling);
@@ -637,29 +651,26 @@ static void r5l_do_reclaim(struct r5l_log *log)
 		if (free >= reclaim_target ||
 		    (list_empty(&log->running_ios) &&
 		     list_empty(&log->io_end_ios) &&
-		     list_empty(&log->stripe_end_ios)))
+		     list_empty(&log->flushing_ios) &&
+		     list_empty(&log->flushed_ios)))
 			break;
 
 		/* Below waiting mostly happens when we shutdown the raid */
-		if (!list_empty(&log->io_end_ios)) {
-			io = list_first_entry(&log->io_end_ios,
-				struct r5l_io_unit, log_sibling);
-			spin_unlock_irq(&log->io_list_lock);
-			/* nobody else can delete the io, we are safe */
-			r5l_kick_io_unit(log, io);
-			spin_lock_irq(&log->io_list_lock);
-			continue;
-		}
-
-		if (!list_empty(&log->running_ios)) {
-			io = list_first_entry(&log->running_ios,
-				struct r5l_io_unit, log_sibling);
-			spin_unlock_irq(&log->io_list_lock);
-			/* nobody else can delete the io, we are safe */
-			r5l_kick_io_unit(log, io);
-			spin_lock_irq(&log->io_list_lock);
-			continue;
-		}
+		if (!list_empty(&log->flushed_ios))
+			target_list = &log->flushed_ios;
+		else if (!list_empty(&log->flushing_ios))
+			target_list = &log->flushing_ios;
+		else if (!list_empty(&log->io_end_ios))
+			target_list = &log->io_end_ios;
+		else if (!list_empty(&log->running_ios))
+			target_list = &log->running_ios;
+
+		io = list_first_entry(target_list,
+			struct r5l_io_unit, log_sibling);
+		spin_unlock_irq(&log->io_list_lock);
+		/* nobody else can delete the io, we are safe */
+		r5l_kick_io_unit(log, io);
+		spin_lock_irq(&log->io_list_lock);
 	}
 	spin_unlock_irq(&log->io_list_lock);
 
@@ -1048,6 +1059,9 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	INIT_LIST_HEAD(&log->running_ios);
 	INIT_LIST_HEAD(&log->io_end_ios);
 	INIT_LIST_HEAD(&log->stripe_end_ios);
+	INIT_LIST_HEAD(&log->flushing_ios);
+	INIT_LIST_HEAD(&log->flushed_ios);
+	bio_init(&log->flush_bio);
 
 	log->io_kc = KMEM_CACHE(r5l_io_unit, 0);
 	if (!log->io_kc)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index cd24d58..b247997 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5758,8 +5758,12 @@ static int handle_active_stripes(struct r5conf *conf, int group,
 		for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++)
 			if (!list_empty(temp_inactive_list + i))
 				break;
-		if (i == NR_STRIPE_HASH_LOCKS)
+		if (i == NR_STRIPE_HASH_LOCKS) {
+			spin_unlock_irq(&conf->device_lock);
+			r5l_flush_stripe_to_raid(conf->log);
+			spin_lock_irq(&conf->device_lock);
 			return batch_size;
+		}
 		release_inactive = true;
 	}
 	spin_unlock_irq(&conf->device_lock);
@@ -5767,6 +5771,7 @@ static int handle_active_stripes(struct r5conf *conf, int group,
 	release_inactive_stripe_list(conf, temp_inactive_list,
 				     NR_STRIPE_HASH_LOCKS);
 
+	r5l_flush_stripe_to_raid(conf->log);
 	if (release_inactive) {
 		spin_lock_irq(&conf->device_lock);
 		return 0;
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/8] raid5-cache: fix a user-after-free bug
  2015-09-02 20:49 [PATCH 0/8] raid5-cache fixes Shaohua Li
                   ` (2 preceding siblings ...)
  2015-09-02 20:49 ` [PATCH 3/8] raid5-cache: switching to state machine for log disk cache flush Shaohua Li
@ 2015-09-02 20:49 ` Shaohua Li
  2015-09-02 20:49 ` [PATCH 5/8] raid5-cache: move functionality out of __r5l_set_io_unit_state Shaohua Li
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Shaohua Li @ 2015-09-02 20:49 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

r5l_compress_stripe_end_list() can free an io_unit. This breaks the
assumption only reclaimer can free io_unit. We can add a reference count
based io_unit free, but since only reclaim can wait io_unit becoming to
STRIPE_END state, we use a simple global wait queue here.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5-cache.c | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 86b836c..2f5e2b8 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -78,6 +78,7 @@ struct r5l_log {
 				  * which are in IO_UNIT_STRIPE_END state (eg,
 				  * reclaim dones't wait for specific io_unit
 				  * switching to IO_UNIT_STRIPE_END state) */
+	wait_queue_head_t iounit_wait;
 
 	struct list_head no_space_stripes; /* pending stripes, log has no space */
 	spinlock_t no_space_stripes_lock;
@@ -108,7 +109,6 @@ struct r5l_io_unit {
 	struct list_head stripe_list; /* stripes added to the io_unit */
 
 	int state;
-	wait_queue_head_t wait_state;
 };
 
 /* r5l_io_unit state */
@@ -161,7 +161,6 @@ static struct r5l_io_unit *r5l_alloc_io_unit(struct r5l_log *log)
 	INIT_LIST_HEAD(&io->log_sibling);
 	INIT_LIST_HEAD(&io->stripe_list);
 	io->state = IO_UNIT_RUNNING;
-	init_waitqueue_head(&io->wait_state);
 	return io;
 }
 
@@ -242,8 +241,8 @@ static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
 			r5l_wake_reclaim(log, 0);
 
 		r5l_compress_stripe_end_list(log);
+		wake_up(&log->iounit_wait);
 	}
-	wake_up(&io->wait_state);
 }
 
 static void r5l_set_io_unit_state(struct r5l_io_unit *io,
@@ -617,10 +616,11 @@ void r5l_flush_stripe_to_raid(struct r5l_log *log)
 	submit_bio(WRITE_FLUSH, &log->flush_bio);
 }
 
-static void r5l_kick_io_unit(struct r5l_log *log, struct r5l_io_unit *io)
+static void r5l_kick_io_unit(struct r5l_log *log)
 {
 	md_wakeup_thread(log->rdev->mddev->thread);
-	wait_event(io->wait_state, io->state >= IO_UNIT_STRIPE_END);
+	wait_event_lock_irq(log->iounit_wait, !list_empty(&log->stripe_end_ios),
+		log->io_list_lock);
 }
 
 static void r5l_write_super(struct r5l_log *log, sector_t cp);
@@ -665,12 +665,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
 		else if (!list_empty(&log->running_ios))
 			target_list = &log->running_ios;
 
-		io = list_first_entry(target_list,
-			struct r5l_io_unit, log_sibling);
-		spin_unlock_irq(&log->io_list_lock);
-		/* nobody else can delete the io, we are safe */
-		r5l_kick_io_unit(log, io);
-		spin_lock_irq(&log->io_list_lock);
+		r5l_kick_io_unit(log);
 	}
 	spin_unlock_irq(&log->io_list_lock);
 
@@ -1071,6 +1066,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 		log->rdev->mddev, "reclaim");
 	if (!log->reclaim_thread)
 		goto reclaim_thread;
+	init_waitqueue_head(&log->iounit_wait);
 
 	INIT_LIST_HEAD(&log->no_space_stripes);
 	spin_lock_init(&log->no_space_stripes_lock);
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 5/8] raid5-cache: move functionality out of __r5l_set_io_unit_state
  2015-09-02 20:49 [PATCH 0/8] raid5-cache fixes Shaohua Li
                   ` (3 preceding siblings ...)
  2015-09-02 20:49 ` [PATCH 4/8] raid5-cache: fix a user-after-free bug Shaohua Li
@ 2015-09-02 20:49 ` Shaohua Li
  2015-09-02 20:49 ` [PATCH 6/8] raid5-cache: optimize FLUSH IO with log enabled Shaohua Li
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Shaohua Li @ 2015-09-02 20:49 UTC (permalink / raw)
  To: linux-raid
  Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb,
	Christoph Hellwig

From: Christoph Hellwig <hch@lst.de>

Just keep __r5l_set_io_unit_state as a small set the state wrapper, and
remove r5l_set_io_unit_state entirely after moving the real
functionality to the two callers that need it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5-cache.c | 81 +++++++++++++++++++++++-------------------------
 1 file changed, 39 insertions(+), 42 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 2f5e2b8..5a216ab 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -214,59 +214,31 @@ static void r5l_compress_stripe_end_list(struct r5l_log *log)
 	list_add_tail(&last->log_sibling, &log->stripe_end_ios);
 }
 
-static void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
 static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
 	enum r5l_io_unit_state state)
 {
-	struct r5l_log *log = io->log;
-
 	if (WARN_ON(io->state >= state))
 		return;
 	io->state = state;
-	if (state == IO_UNIT_IO_END)
-		r5l_move_io_unit_list(&log->running_ios, &log->io_end_ios,
-			IO_UNIT_IO_END);
-	if (state == IO_UNIT_STRIPE_END) {
-		struct r5l_io_unit *last;
-		sector_t reclaimable_space;
-
-		r5l_move_io_unit_list(&log->flushed_ios, &log->stripe_end_ios,
-			IO_UNIT_STRIPE_END);
-
-		last = list_last_entry(&log->stripe_end_ios,
-				struct r5l_io_unit, log_sibling);
-		reclaimable_space = r5l_ring_distance(log, log->last_checkpoint,
-					last->log_end);
-		if (reclaimable_space >= log->max_free_space)
-			r5l_wake_reclaim(log, 0);
-
-		r5l_compress_stripe_end_list(log);
-		wake_up(&log->iounit_wait);
-	}
 }
 
-static void r5l_set_io_unit_state(struct r5l_io_unit *io,
-	enum r5l_io_unit_state state)
-{
-	struct r5l_log *log = io->log;
-	unsigned long flags;
-
-	spin_lock_irqsave(&log->io_list_lock, flags);
-	__r5l_set_io_unit_state(io, state);
-	spin_unlock_irqrestore(&log->io_list_lock, flags);
-}
-
-static void r5l_log_endio(struct bio *bio, int error)
+static inline void r5l_log_endio(struct bio *bio, int error)
 {
 	struct r5l_io_unit *io = bio->bi_private;
 	struct r5l_log *log = io->log;
+	unsigned long flags;
 
 	bio_put(bio);
 
 	if (!atomic_dec_and_test(&io->pending_io))
 		return;
 
-	r5l_set_io_unit_state(io, IO_UNIT_IO_END);
+	spin_lock_irqsave(&log->io_list_lock, flags);
+	__r5l_set_io_unit_state(io, IO_UNIT_IO_END);
+	r5l_move_io_unit_list(&log->running_ios, &log->io_end_ios,
+			IO_UNIT_IO_END);
+	spin_unlock_irqrestore(&log->io_list_lock, flags);
+
 	md_wakeup_thread(log->rdev->mddev->thread);
 }
 
@@ -275,6 +247,7 @@ static void r5l_submit_current_io(struct r5l_log *log)
 	struct r5l_io_unit *io = log->current_io;
 	struct r5l_meta_block *block;
 	struct bio *bio;
+	unsigned long flags;
 	u32 crc;
 
 	if (!io)
@@ -286,7 +259,9 @@ static void r5l_submit_current_io(struct r5l_log *log)
 	block->checksum = cpu_to_le32(crc);
 
 	log->current_io = NULL;
-	r5l_set_io_unit_state(io, IO_UNIT_IO_START);
+	spin_lock_irqsave(&log->io_list_lock, flags);
+	__r5l_set_io_unit_state(io, IO_UNIT_IO_START);
+	spin_unlock_irqrestore(&log->io_list_lock, flags);
 
 	while ((bio = bio_list_pop(&io->bios))) {
 		/* all IO must start from rdev->data_offset */
@@ -451,6 +426,7 @@ static void r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
 	sh->log_io = io;
 }
 
+static void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
 /*
  * running in raid5d, where reclaim could wait for raid5d too (when it flushes
  * data from log to raid disks), so we shouldn't wait for reclaim here
@@ -542,18 +518,39 @@ static void r5l_run_no_space_stripes(struct r5l_log *log)
 	spin_unlock(&log->no_space_stripes_lock);
 }
 
+static void __r5l_stripe_write_finished(struct r5l_io_unit *io)
+{
+	struct r5l_log *log = io->log;
+	struct r5l_io_unit *last;
+	sector_t reclaimable_space;
+	unsigned long flags;
+
+	spin_lock_irqsave(&log->io_list_lock, flags);
+	__r5l_set_io_unit_state(io, IO_UNIT_STRIPE_END);
+	r5l_move_io_unit_list(&log->flushed_ios, &log->stripe_end_ios,
+		IO_UNIT_STRIPE_END);
+
+	last = list_last_entry(&log->stripe_end_ios,
+			struct r5l_io_unit, log_sibling);
+	reclaimable_space = r5l_ring_distance(log, log->last_checkpoint,
+				last->log_end);
+	if (reclaimable_space >= log->max_free_space)
+		r5l_wake_reclaim(log, 0);
+
+	r5l_compress_stripe_end_list(log);
+	spin_unlock_irqrestore(&log->io_list_lock, flags);
+	wake_up(&log->iounit_wait);
+}
+
 void r5l_stripe_write_finished(struct stripe_head *sh)
 {
 	struct r5l_io_unit *io;
 
-	/* Don't support stripe batch */
 	io = sh->log_io;
-	if (!io)
-		return;
 	sh->log_io = NULL;
 
-	if (atomic_dec_and_test(&io->pending_stripe))
-		r5l_set_io_unit_state(io, IO_UNIT_STRIPE_END);
+	if (io && atomic_dec_and_test(&io->pending_stripe))
+		__r5l_stripe_write_finished(io);
 }
 
 static void r5l_log_flush_endio(struct bio *bio, int error)
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 6/8] raid5-cache: optimize FLUSH IO with log enabled
  2015-09-02 20:49 [PATCH 0/8] raid5-cache fixes Shaohua Li
                   ` (4 preceding siblings ...)
  2015-09-02 20:49 ` [PATCH 5/8] raid5-cache: move functionality out of __r5l_set_io_unit_state Shaohua Li
@ 2015-09-02 20:49 ` Shaohua Li
  2015-09-02 20:49 ` [PATCH 7/8] md: skip resync for raid array with journal Shaohua Li
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Shaohua Li @ 2015-09-02 20:49 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

With log enabled, bio is written to raid disks after the bio is settled
down in log disk. The recovery guarantees we can recovery the bio data
from log disk, so we we skip FLUSH IO.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5-cache.c | 18 ++++++++++++++++++
 drivers/md/raid5.c       | 10 ++++++++--
 drivers/md/raid5.h       |  1 +
 3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 5a216ab..27fb513 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -502,6 +502,24 @@ void r5l_write_stripe_run(struct r5l_log *log)
 	mutex_unlock(&log->io_mutex);
 }
 
+int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio)
+{
+	if (!log)
+		return -ENODEV;
+	/*
+	 * we flush log disk cache first, then write stripe data to raid disks.
+	 * So if bio is finished, the log disk cache is flushed already. The
+	 * recovery guarantees we can recovery the bio from log disk, so we
+	 * don't need to flush again
+	 * */
+	if (bio->bi_iter.bi_size == 0) {
+		bio_endio(bio, 0);
+		return 0;
+	}
+	bio->bi_rw &= ~REQ_FLUSH;
+	return -EAGAIN;
+}
+
 /* This will run after log space is reclaimed */
 static void r5l_run_no_space_stripes(struct r5l_log *log)
 {
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b247997..394cdf8 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5166,8 +5166,14 @@ static void make_request(struct mddev *mddev, struct bio * bi)
 	bool do_prepare;
 
 	if (unlikely(bi->bi_rw & REQ_FLUSH)) {
-		md_flush_request(mddev, bi);
-		return;
+		int ret = r5l_handle_flush_request(conf->log, bi);
+		if (ret == 0)
+			return;
+		if (ret == -ENODEV) {
+			md_flush_request(mddev, bi);
+			return;
+		}
+		/* ret == -EAGAIN, fallback */
 	}
 
 	md_write_start(mddev, bi);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 7ecd7d4..e6b9a40 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -629,4 +629,5 @@ extern int r5l_write_stripe(struct r5l_log *log, struct stripe_head *head_sh);
 extern void r5l_write_stripe_run(struct r5l_log *log);
 extern void r5l_flush_stripe_to_raid(struct r5l_log *log);
 extern void r5l_stripe_write_finished(struct stripe_head *sh);
+extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
 #endif
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 7/8] md: skip resync for raid array with journal
  2015-09-02 20:49 [PATCH 0/8] raid5-cache fixes Shaohua Li
                   ` (5 preceding siblings ...)
  2015-09-02 20:49 ` [PATCH 6/8] raid5-cache: optimize FLUSH IO with log enabled Shaohua Li
@ 2015-09-02 20:49 ` Shaohua Li
  2015-09-30  4:24   ` Neil Brown
  2015-09-02 20:49 ` [PATCH 8/8] raid5-cache: add trim support for log Shaohua Li
  2015-09-30  5:36 ` [PATCH 0/8] raid5-cache fixes Neil Brown
  8 siblings, 1 reply; 14+ messages in thread
From: Shaohua Li @ 2015-09-02 20:49 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

If a raid array has journal, the journal can guarantee the consistency,
we can skip resync after a unclean shutdown. The exception is raid
creation or user initiated resync, which we still do a raid resync.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c | 4 ++++
 drivers/md/md.h | 1 +
 2 files changed, 5 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index b3f9eed..95824fb 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1669,6 +1669,8 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
 			}
 			set_bit(Journal, &rdev->flags);
 			rdev->journal_tail = le64_to_cpu(sb->journal_tail);
+			if (mddev->recovery_cp == MaxSector)
+				set_bit(MD_JOURNAL_CLEAN, &mddev->flags);
 			break;
 		default:
 			rdev->saved_raid_disk = role;
@@ -1711,6 +1713,8 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
 	sb->events = cpu_to_le64(mddev->events);
 	if (mddev->in_sync)
 		sb->resync_offset = cpu_to_le64(mddev->recovery_cp);
+	else if (test_bit(MD_JOURNAL_CLEAN, &mddev->flags))
+		sb->resync_offset = cpu_to_le64(MaxSector);
 	else
 		sb->resync_offset = cpu_to_le64(0);
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 226f4ba..0288a0b 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -236,6 +236,7 @@ struct mddev {
 #define MD_STILL_CLOSED	4	/* If set, then array has not been opened since
 				 * md_ioctl checked on it.
 				 */
+#define MD_JOURNAL_CLEAN 5	/* A raid with journal is already clean */
 
 	int				suspended;
 	atomic_t			active_io;
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 7/8] md: skip resync for raid array with journal
  2015-09-02 20:49 ` [PATCH 7/8] md: skip resync for raid array with journal Shaohua Li
@ 2015-09-30  4:24   ` Neil Brown
  2015-10-01 17:47     ` Song Liu
  0 siblings, 1 reply; 14+ messages in thread
From: Neil Brown @ 2015-09-30  4:24 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 2608 bytes --]

Shaohua Li <shli@fb.com> writes:

> If a raid array has journal, the journal can guarantee the consistency,
> we can skip resync after a unclean shutdown. The exception is raid
> creation or user initiated resync, which we still do a raid resync.
>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/md.c | 4 ++++
>  drivers/md/md.h | 1 +
>  2 files changed, 5 insertions(+)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index b3f9eed..95824fb 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1669,6 +1669,8 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
>  			}
>  			set_bit(Journal, &rdev->flags);
>  			rdev->journal_tail = le64_to_cpu(sb->journal_tail);
> +			if (mddev->recovery_cp == MaxSector)
> +				set_bit(MD_JOURNAL_CLEAN, &mddev->flags);
>  			break;
>  		default:
>  			rdev->saved_raid_disk = role;
> @@ -1711,6 +1713,8 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
>  	sb->events = cpu_to_le64(mddev->events);
>  	if (mddev->in_sync)
>  		sb->resync_offset = cpu_to_le64(mddev->recovery_cp);
> +	else if (test_bit(MD_JOURNAL_CLEAN, &mddev->flags))
> +		sb->resync_offset = cpu_to_le64(MaxSector);
>  	else
>  		sb->resync_offset = cpu_to_le64(0);
>  
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 226f4ba..0288a0b 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -236,6 +236,7 @@ struct mddev {
>  #define MD_STILL_CLOSED	4	/* If set, then array has not been opened since
>  				 * md_ioctl checked on it.
>  				 */
> +#define MD_JOURNAL_CLEAN 5	/* A raid with journal is already clean */
>  
>  	int				suspended;
>  	atomic_t			active_io;
> -- 
> 1.8.1

This looks right as far as it goes, but I don't think it goes far
enough.
The particular scenario that bothers me is if the array is started
without the journal being present.
I cannot see anything to prevent that - is there?

In that case we need to assume that the array is not in-sync,
and we need to clear MD_FEATURE_JOURNAL so if it gets stopped and then
assembled again with the stale journal doesn't get used.

One unfortunate side effect of that is that you couldn't stop the array
cleanly (leaving the journal effectively empty) and then restart with no
journal and no resync.  Is that a problem I wonder?

I'm not sure what the best solution is here, but we need a clear
understanding of what happens if you try to assemble an array without
the journal where previously it had one, and I don't think the current
code gets it right.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH 7/8] md: skip resync for raid array with journal
  2015-09-30  4:24   ` Neil Brown
@ 2015-10-01 17:47     ` Song Liu
  2015-10-01 23:50       ` Neil Brown
  0 siblings, 1 reply; 14+ messages in thread
From: Song Liu @ 2015-10-01 17:47 UTC (permalink / raw)
  To: Neil Brown, Shaohua Li, linux-raid@vger.kernel.org
  Cc: Kernel Team, hch@infradead.org, dan.j.williams@intel.com

> -----Original Message-----
> From: Neil Brown [mailto:neilb@suse.de]
> Sent: Tuesday, September 29, 2015 9:25 PM
> To: Shaohua Li; linux-raid@vger.kernel.org
> Cc: Kernel Team; Song Liu; hch@infradead.org; dan.j.williams@intel.com
> Subject: Re: [PATCH 7/8] md: skip resync for raid array with journal
> 
> Shaohua Li <shli@fb.com> writes:
> 
> > If a raid array has journal, the journal can guarantee the
> > consistency, we can skip resync after a unclean shutdown. The
> > exception is raid creation or user initiated resync, which we still do a raid
> resync.
> >
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  drivers/md/md.c | 4 ++++
> >  drivers/md/md.h | 1 +
> >  2 files changed, 5 insertions(+)
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c index b3f9eed..95824fb
> > 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -1669,6 +1669,8 @@ static int super_1_validate(struct mddev *mddev,
> struct md_rdev *rdev)
> >  			}
> >  			set_bit(Journal, &rdev->flags);
> >  			rdev->journal_tail = le64_to_cpu(sb->journal_tail);
> > +			if (mddev->recovery_cp == MaxSector)
> > +				set_bit(MD_JOURNAL_CLEAN, &mddev-
> >flags);
> >  			break;
> >  		default:
> >  			rdev->saved_raid_disk = role;
> > @@ -1711,6 +1713,8 @@ static void super_1_sync(struct mddev *mddev,
> struct md_rdev *rdev)
> >  	sb->events = cpu_to_le64(mddev->events);
> >  	if (mddev->in_sync)
> >  		sb->resync_offset = cpu_to_le64(mddev->recovery_cp);
> > +	else if (test_bit(MD_JOURNAL_CLEAN, &mddev->flags))
> > +		sb->resync_offset = cpu_to_le64(MaxSector);
> >  	else
> >  		sb->resync_offset = cpu_to_le64(0);
> >
> > diff --git a/drivers/md/md.h b/drivers/md/md.h index 226f4ba..0288a0b
> > 100644
> > --- a/drivers/md/md.h
> > +++ b/drivers/md/md.h
> > @@ -236,6 +236,7 @@ struct mddev {
> >  #define MD_STILL_CLOSED	4	/* If set, then array has not been
> opened since
> >  				 * md_ioctl checked on it.
> >  				 */
> > +#define MD_JOURNAL_CLEAN 5	/* A raid with journal is already clean
> */
> >
> >  	int				suspended;
> >  	atomic_t			active_io;
> > --
> > 1.8.1
> 
> This looks right as far as it goes, but I don't think it goes far enough.
> The particular scenario that bothers me is if the array is started without the
> journal being present.
> I cannot see anything to prevent that - is there?

I have sent mdadm patches that prevent the array to assemble with missing 
journal, for both --assemble and --incremental:

http://marc.info/?l=linux-raid&m=144080445720123
http://marc.info/?l=linux-raid&m=144080446120127 

We can force start in both cases with --force or --run. 

Currently, mdadm doesn't clear MD_FEATURE_JOURNAL bit, so we get warning 
every time. 

I did miss one step: to mark the missing journal as fault. 

> 
> In that case we need to assume that the array is not in-sync, and we need to
> clear MD_FEATURE_JOURNAL so if it gets stopped and then assembled again
> with the stale journal doesn't get used.
> 
> One unfortunate side effect of that is that you couldn't stop the array cleanly
> (leaving the journal effectively empty) and then restart with no journal and
> no resync.  Is that a problem I wonder?
> 
> I'm not sure what the best solution is here, but we need a clear
> understanding of what happens if you try to assemble an array without the
> journal where previously it had one, and I don't think the current code gets it
> right.

I just talked with Shaohua. We think we can probably do the following:

1. When assemble with cache device missing, show the warning (with --force 
option). Run resync if previous shutdown is not clean (this is the default
behavior. 

2. When we get I/O error on the journal device at run time, make the whole
array as read only. 

3. Add new operation that adds new journal device to an array with missing 
journal. We can limit this option to inactive array only. 

Would this follow cover all the cases?

Thanks,
Song


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH 7/8] md: skip resync for raid array with journal
  2015-10-01 17:47     ` Song Liu
@ 2015-10-01 23:50       ` Neil Brown
  0 siblings, 0 replies; 14+ messages in thread
From: Neil Brown @ 2015-10-01 23:50 UTC (permalink / raw)
  To: Song Liu, Shaohua Li, linux-raid@vger.kernel.org
  Cc: Kernel Team, hch@infradead.org, dan.j.williams@intel.com

[-- Attachment #1: Type: text/plain, Size: 5334 bytes --]

Song Liu <songliubraving@fb.com> writes:

>> -----Original Message-----
>> From: Neil Brown [mailto:neilb@suse.de]
>> Sent: Tuesday, September 29, 2015 9:25 PM
>> To: Shaohua Li; linux-raid@vger.kernel.org
>> Cc: Kernel Team; Song Liu; hch@infradead.org; dan.j.williams@intel.com
>> Subject: Re: [PATCH 7/8] md: skip resync for raid array with journal
>> 
>> Shaohua Li <shli@fb.com> writes:
>> 
>> > If a raid array has journal, the journal can guarantee the
>> > consistency, we can skip resync after a unclean shutdown. The
>> > exception is raid creation or user initiated resync, which we still do a raid
>> resync.
>> >
>> > Signed-off-by: Shaohua Li <shli@fb.com>
>> > ---
>> >  drivers/md/md.c | 4 ++++
>> >  drivers/md/md.h | 1 +
>> >  2 files changed, 5 insertions(+)
>> >
>> > diff --git a/drivers/md/md.c b/drivers/md/md.c index b3f9eed..95824fb
>> > 100644
>> > --- a/drivers/md/md.c
>> > +++ b/drivers/md/md.c
>> > @@ -1669,6 +1669,8 @@ static int super_1_validate(struct mddev *mddev,
>> struct md_rdev *rdev)
>> >  			}
>> >  			set_bit(Journal, &rdev->flags);
>> >  			rdev->journal_tail = le64_to_cpu(sb->journal_tail);
>> > +			if (mddev->recovery_cp == MaxSector)
>> > +				set_bit(MD_JOURNAL_CLEAN, &mddev-
>> >flags);
>> >  			break;
>> >  		default:
>> >  			rdev->saved_raid_disk = role;
>> > @@ -1711,6 +1713,8 @@ static void super_1_sync(struct mddev *mddev,
>> struct md_rdev *rdev)
>> >  	sb->events = cpu_to_le64(mddev->events);
>> >  	if (mddev->in_sync)
>> >  		sb->resync_offset = cpu_to_le64(mddev->recovery_cp);
>> > +	else if (test_bit(MD_JOURNAL_CLEAN, &mddev->flags))
>> > +		sb->resync_offset = cpu_to_le64(MaxSector);
>> >  	else
>> >  		sb->resync_offset = cpu_to_le64(0);
>> >
>> > diff --git a/drivers/md/md.h b/drivers/md/md.h index 226f4ba..0288a0b
>> > 100644
>> > --- a/drivers/md/md.h
>> > +++ b/drivers/md/md.h
>> > @@ -236,6 +236,7 @@ struct mddev {
>> >  #define MD_STILL_CLOSED	4	/* If set, then array has not been
>> opened since
>> >  				 * md_ioctl checked on it.
>> >  				 */
>> > +#define MD_JOURNAL_CLEAN 5	/* A raid with journal is already clean
>> */
>> >
>> >  	int				suspended;
>> >  	atomic_t			active_io;
>> > --
>> > 1.8.1
>> 
>> This looks right as far as it goes, but I don't think it goes far enough.
>> The particular scenario that bothers me is if the array is started without the
>> journal being present.
>> I cannot see anything to prevent that - is there?
>
> I have sent mdadm patches that prevent the array to assemble with missing 
> journal, for both --assemble and --incremental:
>
> http://marc.info/?l=linux-raid&m=144080445720123
> http://marc.info/?l=linux-raid&m=144080446120127 
>

That is probably good and useful, but the kernel cannot rely on mdadm to
protect it.  If the kernel is asked to start a journalled array when no
journal is present, it must do something that is safe.
That might be to refuse, it might be to force a resync, it might be to
somehow "know" if the journal is empty and only force a resync in that
case.

This should be treated much like a "dirty/degraded" start.  Currently
the kernel refuses to do that unless a module parameter is set.  When
mdadm is asked to --force start a dirty/degraded array, it modifies the
metadata so that it doesn't appear to be degraded.  Something similar
might be best for the missing-journal case.

I *think* the current kernel code will allow the array to be started
without a resync even if the journal was not empty, and that is not
safe.

> We can force start in both cases with --force or --run. 
>
> Currently, mdadm doesn't clear MD_FEATURE_JOURNAL bit, so we get warning 
> every time. 
>
> I did miss one step: to mark the missing journal as fault. 
>
>> 
>> In that case we need to assume that the array is not in-sync, and we need to
>> clear MD_FEATURE_JOURNAL so if it gets stopped and then assembled again
>> with the stale journal doesn't get used.
>> 
>> One unfortunate side effect of that is that you couldn't stop the array cleanly
>> (leaving the journal effectively empty) and then restart with no journal and
>> no resync.  Is that a problem I wonder?
>> 
>> I'm not sure what the best solution is here, but we need a clear
>> understanding of what happens if you try to assemble an array without the
>> journal where previously it had one, and I don't think the current code gets it
>> right.
>
> I just talked with Shaohua. We think we can probably do the following:
>
> 1. When assemble with cache device missing, show the warning (with --force 
> option). Run resync if previous shutdown is not clean (this is the default
> behavior.

Probably sensible.  But I think the current code never marks the array as
"not clean" when there is a journal.

>
> 2. When we get I/O error on the journal device at run time, make the whole
> array as read only.

Yes, I think that is best.

>
> 3. Add new operation that adds new journal device to an array with missing 
> journal. We can limit this option to inactive array only.

It would be great if we could add a journal to an active array...

>
> Would this follow cover all the cases?

Quite possibly.  I don't want to commit myself until I see the code :-)

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 8/8] raid5-cache: add trim support for log
  2015-09-02 20:49 [PATCH 0/8] raid5-cache fixes Shaohua Li
                   ` (6 preceding siblings ...)
  2015-09-02 20:49 ` [PATCH 7/8] md: skip resync for raid array with journal Shaohua Li
@ 2015-09-02 20:49 ` Shaohua Li
  2015-09-30  4:14   ` Neil Brown
  2015-09-30  5:36 ` [PATCH 0/8] raid5-cache fixes Neil Brown
  8 siblings, 1 reply; 14+ messages in thread
From: Shaohua Li @ 2015-09-02 20:49 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

Since superblock is updated infrequently, we do a simple trim of log
disk (a synchronous trim)

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5-cache.c | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 27fb513..410b85b 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -639,6 +639,34 @@ static void r5l_kick_io_unit(struct r5l_log *log)
 }
 
 static void r5l_write_super(struct r5l_log *log, sector_t cp);
+static void r5l_write_super_and_discard_space(struct r5l_log *log,
+	sector_t end)
+{
+	struct block_device *bdev = log->rdev->bdev;
+
+	r5l_write_super(log, end);
+
+	if (!blk_queue_discard(bdev_get_queue(bdev)))
+		return;
+
+	/* discard destroy old data in log, so force a super update */
+	md_update_sb(log->rdev->mddev, 1);
+
+	if (log->last_checkpoint < end) {
+		blkdev_issue_discard(bdev,
+				log->last_checkpoint + log->rdev->data_offset,
+				end - log->last_checkpoint, GFP_NOIO, 0);
+	} else {
+		blkdev_issue_discard(bdev,
+				log->last_checkpoint + log->rdev->data_offset,
+				log->device_size - log->last_checkpoint,
+				GFP_NOIO, 0);
+		blkdev_issue_discard(bdev, log->rdev->data_offset, end,
+				GFP_NOIO, 0);
+	}
+}
+
+
 static void r5l_do_reclaim(struct r5l_log *log)
 {
 	struct r5l_io_unit *io, *last;
@@ -694,7 +722,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
 	 * here, because the log area might be reused soon and we don't want to
 	 * confuse recovery
 	 * */
-	r5l_write_super(log, last->log_start);
+	r5l_write_super_and_discard_space(log, last->log_start);
 
 	mutex_lock(&log->io_mutex);
 	log->last_checkpoint = last->log_start;
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 8/8] raid5-cache: add trim support for log
  2015-09-02 20:49 ` [PATCH 8/8] raid5-cache: add trim support for log Shaohua Li
@ 2015-09-30  4:14   ` Neil Brown
  0 siblings, 0 replies; 14+ messages in thread
From: Neil Brown @ 2015-09-30  4:14 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 2320 bytes --]

Shaohua Li <shli@fb.com> writes:

> Since superblock is updated infrequently, we do a simple trim of log
> disk (a synchronous trim)
>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/raid5-cache.c | 30 +++++++++++++++++++++++++++++-
>  1 file changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 27fb513..410b85b 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -639,6 +639,34 @@ static void r5l_kick_io_unit(struct r5l_log *log)
>  }
>  
>  static void r5l_write_super(struct r5l_log *log, sector_t cp);
> +static void r5l_write_super_and_discard_space(struct r5l_log *log,
> +	sector_t end)
> +{
> +	struct block_device *bdev = log->rdev->bdev;
> +
> +	r5l_write_super(log, end);
> +
> +	if (!blk_queue_discard(bdev_get_queue(bdev)))
> +		return;
> +
> +	/* discard destroy old data in log, so force a super update */
> +	md_update_sb(log->rdev->mddev, 1);

I don't think this can go here.

md_update_sb() is currently always called with ->reconfig_mutex held,
and I think that should stay.

Instead you could:
	set_bit(MD_CHANGE_PENDING, &mddev->flags);
	md_wakeup_thread(mddev->thread);
	wait_event(mddev->sb_wait,
		   !test_bit(MD_CHANGE_PENDING, &mddev->flags));

a bit like md_write_start().

Thanks,
NeilBrown

> +
> +	if (log->last_checkpoint < end) {
> +		blkdev_issue_discard(bdev,
> +				log->last_checkpoint + log->rdev->data_offset,
> +				end - log->last_checkpoint, GFP_NOIO, 0);
> +	} else {
> +		blkdev_issue_discard(bdev,
> +				log->last_checkpoint + log->rdev->data_offset,
> +				log->device_size - log->last_checkpoint,
> +				GFP_NOIO, 0);
> +		blkdev_issue_discard(bdev, log->rdev->data_offset, end,
> +				GFP_NOIO, 0);
> +	}
> +}
> +
> +
>  static void r5l_do_reclaim(struct r5l_log *log)
>  {
>  	struct r5l_io_unit *io, *last;
> @@ -694,7 +722,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
>  	 * here, because the log area might be reused soon and we don't want to
>  	 * confuse recovery
>  	 * */
> -	r5l_write_super(log, last->log_start);
> +	r5l_write_super_and_discard_space(log, last->log_start);
>  
>  	mutex_lock(&log->io_mutex);
>  	log->last_checkpoint = last->log_start;
> -- 
> 1.8.1

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/8] raid5-cache fixes
  2015-09-02 20:49 [PATCH 0/8] raid5-cache fixes Shaohua Li
                   ` (7 preceding siblings ...)
  2015-09-02 20:49 ` [PATCH 8/8] raid5-cache: add trim support for log Shaohua Li
@ 2015-09-30  5:36 ` Neil Brown
  8 siblings, 0 replies; 14+ messages in thread
From: Neil Brown @ 2015-09-30  5:36 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 1303 bytes --]

Shaohua Li <shli@fb.com> writes:

> Fix some bugs, improve performance and add trim/resync support.
>
> Thanks,
> Shaohua
>
> Christoph Hellwig (1):
>   raid5-cache: move functionality out of __r5l_set_io_unit_state
>
> Shaohua Li (7):
>   md: fix feature map check
>   raid5: fix build error
>   raid5-cache: switching to state machine for log disk cache flush
>   raid5-cache: fix a user-after-free bug
>   raid5-cache: optimize FLUSH IO with log enabled
>   md: skip resync for raid array with journal
>   raid5-cache: add trim support for log
>
>  drivers/md/md.c          |   6 +-
>  drivers/md/md.h          |   1 +
>  drivers/md/raid5-cache.c | 254 ++++++++++++++++++++++++++++-------------------
>  drivers/md/raid5.c       |  17 +++-
>  drivers/md/raid5.h       |   1 +
>  5 files changed, 175 insertions(+), 104 deletions(-)
>
> -- 
> 1.8.1

Thanks.
I have applied all except
>   raid5-cache: add trim support for log
which I have replied to separately.
I have also applied the other two patches you sent separately.

I've made a number of white-space fixes throughout the series,
and I've merged Christoph's "port to 4.3-rc" patch into various places
in the series.

This is all in the 'devel' branch of git://neil.brown.name/md/

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-10-01 23:50 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-02 20:49 [PATCH 0/8] raid5-cache fixes Shaohua Li
2015-09-02 20:49 ` [PATCH 1/8] md: fix feature map check Shaohua Li
2015-09-02 20:49 ` [PATCH 2/8] raid5: fix build error Shaohua Li
2015-09-02 20:49 ` [PATCH 3/8] raid5-cache: switching to state machine for log disk cache flush Shaohua Li
2015-09-02 20:49 ` [PATCH 4/8] raid5-cache: fix a user-after-free bug Shaohua Li
2015-09-02 20:49 ` [PATCH 5/8] raid5-cache: move functionality out of __r5l_set_io_unit_state Shaohua Li
2015-09-02 20:49 ` [PATCH 6/8] raid5-cache: optimize FLUSH IO with log enabled Shaohua Li
2015-09-02 20:49 ` [PATCH 7/8] md: skip resync for raid array with journal Shaohua Li
2015-09-30  4:24   ` Neil Brown
2015-10-01 17:47     ` Song Liu
2015-10-01 23:50       ` Neil Brown
2015-09-02 20:49 ` [PATCH 8/8] raid5-cache: add trim support for log Shaohua Li
2015-09-30  4:14   ` Neil Brown
2015-09-30  5:36 ` [PATCH 0/8] raid5-cache fixes Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).