[PATCH v3 0/8] MD: a caching layer for raid5/6

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/8] MD: a caching layer for raid5/6
@ 2015-06-03 22:48 Shaohua Li
  2015-06-03 22:48 ` [PATCH v3 1/8] MD: add a new disk role to present cache device Shaohua Li
                   ` (8 more replies)
  0 siblings, 9 replies; 13+ messages in thread
From: Shaohua Li @ 2015-06-03 22:48 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

Hi,

This is the third version of the raid5/6 caching layer patches. The patches add
a caching layer for raid5/6. The caching layer uses a SSD as a cache for a raid
5/6. It works like the similar way of a hardware raid controller. The purpose
is to improve raid performance (reduce read-modify-write) and fix write hole
issue. The main patch is patch 3 and the description has all details about the
implementation. Please review!

Thanks,
Shaohua

V3:
-make reclaim multi-thread
-add statistics in sysfs
-bug fixes

V2:
-metadata write doesn't use FUA
-discard request is only issued when necessary
-bug fixes and cleanup

Shaohua Li (7):
  raid5: directly use mddev->queue
  raid5: A caching layer for RAID5/6
  raid5: add some sysfs entries
  md: don't allow resize/reshape with cache support
  raid5: skip resync if caching is enabled
  raid5: guarantee cache release stripes in correct way
  raid5: multi-thread support for raid5 caching reclaim

Song Liu (1):
  MD: add a new disk role to present cache device

 drivers/md/Makefile            |    2 +-
 drivers/md/md.c                |   14 +-
 drivers/md/md.h                |    4 +
 drivers/md/raid5-cache.c       | 3775 ++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c             |  177 +-
 drivers/md/raid5.h             |   25 +-
 include/uapi/linux/raid/md_p.h |   73 +
 7 files changed, 4022 insertions(+), 48 deletions(-)
 create mode 100644 drivers/md/raid5-cache.c

-- 
1.8.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 1/8] MD: add a new disk role to present cache device
  2015-06-03 22:48 [PATCH v3 0/8] MD: a caching layer for raid5/6 Shaohua Li
@ 2015-06-03 22:48 ` Shaohua Li
  2015-06-17 23:32   ` Neil Brown
  2015-06-03 22:48 ` [PATCH v3 2/8] raid5: directly use mddev->queue Shaohua Li
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 13+ messages in thread
From: Shaohua Li @ 2015-06-03 22:48 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

From: Song Liu <songliubraving@fb.com>

Next patches will use a disk as raid5/6 caching. We need a new disk role
to present the cache device

Not sure if we should bump up the MD superblock version for the disk
role.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c                | 14 +++++++++++++-
 drivers/md/md.h                |  4 ++++
 include/uapi/linux/raid/md_p.h |  1 +
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2750630..6297087 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1656,6 +1656,9 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
 		case 0xfffe: /* faulty */
 			set_bit(Faulty, &rdev->flags);
 			break;
+		case 0xfffd: /* cache device */
+			set_bit(WriteCache, &rdev->flags);
+			break;
 		default:
 			rdev->saved_raid_disk = role;
 			if ((le32_to_cpu(sb->feature_map) &
@@ -1811,6 +1814,8 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
 			sb->dev_roles[i] = cpu_to_le16(0xfffe);
 		else if (test_bit(In_sync, &rdev2->flags))
 			sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
+		else if (test_bit(WriteCache, &rdev2->flags))
+			sb->dev_roles[i] = cpu_to_le16(0xfffd);
 		else if (rdev2->raid_disk >= 0)
 			sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
 		else
@@ -5780,7 +5785,8 @@ static int get_disk_info(struct mddev *mddev, void __user * arg)
 		else if (test_bit(In_sync, &rdev->flags)) {
 			info.state |= (1<<MD_DISK_ACTIVE);
 			info.state |= (1<<MD_DISK_SYNC);
-		}
+		} else if (test_bit(WriteCache, &rdev->flags))
+			info.state |= (1<<MD_DISK_WRITECACHE);
 		if (test_bit(WriteMostly, &rdev->flags))
 			info.state |= (1<<MD_DISK_WRITEMOSTLY);
 	} else {
@@ -5895,6 +5901,8 @@ static int add_new_disk(struct mddev *mddev, mdu_disk_info_t *info)
 		else
 			clear_bit(WriteMostly, &rdev->flags);
 
+		if (info->state & (1<<MD_DISK_WRITECACHE))
+			set_bit(WriteCache, &rdev->flags);
 		/*
 		 * check whether the device shows up in other nodes
 		 */
@@ -7263,6 +7271,10 @@ static int md_seq_show(struct seq_file *seq, void *v)
 				seq_printf(seq, "(F)");
 				continue;
 			}
+			if (test_bit(WriteCache, &rdev->flags)) {
+				seq_printf(seq, "(C)");
+				continue;
+			}
 			if (rdev->raid_disk < 0)
 				seq_printf(seq, "(S)"); /* spare */
 			if (test_bit(Replacement, &rdev->flags))
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 4046a6c..6857592 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -175,6 +175,10 @@ enum flag_bits {
 				 * This device is seen locally but not
 				 * by the whole cluster
 				 */
+	WriteCache,		/* This device is used as write cache.
+				 * Usually, this device should be faster
+				 * than other devices in the array
+				 */
 };
 
 #define BB_LEN_MASK	(0x00000000000001FFULL)
diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
index 2ae6131..9d36b91 100644
--- a/include/uapi/linux/raid/md_p.h
+++ b/include/uapi/linux/raid/md_p.h
@@ -89,6 +89,7 @@
 				   * read requests will only be sent here in
 				   * dire need
 				   */
+#define MD_DISK_WRITECACHE      18 /* disk is used as the write cache in RAID-5/6 */
 
 typedef struct mdp_device_descriptor_s {
 	__u32 number;		/* 0 Device number in the entire set	      */
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 2/8] raid5: directly use mddev->queue
  2015-06-03 22:48 [PATCH v3 0/8] MD: a caching layer for raid5/6 Shaohua Li
  2015-06-03 22:48 ` [PATCH v3 1/8] MD: add a new disk role to present cache device Shaohua Li
@ 2015-06-03 22:48 ` Shaohua Li
  2015-06-03 22:48 ` [PATCH v3 4/8] raid5: add some sysfs entries Shaohua Li
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2015-06-03 22:48 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

When the cache layer flushes data from cache disk to raid disks, it will
dipsatch IO to raid disks. At that time, we don't have a block device
attached to the bio, so directly use mddev->queue. That should not
impact IO dispatched to rdev, which has rdev block device attached.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 553d54b..4c122ad 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -223,7 +223,7 @@ static int raid6_idx_to_slot(int idx, struct stripe_head *sh,
 	return slot;
 }
 
-static void return_io(struct bio *return_bi)
+static void return_io(struct r5conf *conf, struct bio *return_bi)
 {
 	struct bio *bi = return_bi;
 	while (bi) {
@@ -231,8 +231,7 @@ static void return_io(struct bio *return_bi)
 		return_bi = bi->bi_next;
 		bi->bi_next = NULL;
 		bi->bi_iter.bi_size = 0;
-		trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
-					 bi, 0);
+		trace_block_bio_complete(conf->mddev->queue, bi, 0);
 		bio_endio(bi, 0);
 		bi = return_bi;
 	}
@@ -1200,7 +1199,7 @@ static void ops_complete_biofill(void *stripe_head_ref)
 	}
 	clear_bit(STRIPE_BIOFILL_RUN, &sh->state);
 
-	return_io(return_bi);
+	return_io(sh->raid_conf, return_bi);
 
 	set_bit(STRIPE_HANDLE, &sh->state);
 	release_stripe(sh);
@@ -4594,7 +4593,7 @@ static void handle_stripe(struct stripe_head *sh)
 			md_wakeup_thread(conf->mddev->thread);
 	}
 
-	return_io(s.return_bi);
+	return_io(conf, s.return_bi);
 
 	clear_bit_unlock(STRIPE_ACTIVE, &sh->state);
 }
@@ -5298,8 +5297,7 @@ static void make_request(struct mddev *mddev, struct bio * bi)
 		if ( rw == WRITE )
 			md_write_end(mddev);
 
-		trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
-					 bi, 0);
+		trace_block_bio_complete(mddev->queue, bi, 0);
 		bio_endio(bi, 0);
 	}
 }
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 4/8] raid5: add some sysfs entries
  2015-06-03 22:48 [PATCH v3 0/8] MD: a caching layer for raid5/6 Shaohua Li
  2015-06-03 22:48 ` [PATCH v3 1/8] MD: add a new disk role to present cache device Shaohua Li
  2015-06-03 22:48 ` [PATCH v3 2/8] raid5: directly use mddev->queue Shaohua Li
@ 2015-06-03 22:48 ` Shaohua Li
  2015-06-03 22:48 ` [PATCH v3 5/8] md: don't allow resize/reshape with cache support Shaohua Li
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2015-06-03 22:48 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

Add some sysfs entries.
-cache_memory. Control the cache memory size.
-cache_reclaim_batch. Control how many stripes reclaim should run in one
time.
-cache_memory_watermark. The background reclaim runs if cache memory
hits the watermark and stops after hit 1.5x of the watermark.
-cache_disk_watermark. The background reclaim runs if cache disk space
hits the watermark and stops after hit 1.5x of the watermark.
-cache_stat. statistics about cache.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5-cache.c | 294 ++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/md/raid5.c       |   3 +
 drivers/md/raid5.h       |   1 +
 3 files changed, 297 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index c21d2f2..04b1684 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -278,6 +278,12 @@ struct r5c_cache {
 	struct kmem_cache *io_range_kc;
 	struct kmem_cache *stripe_kc;
 	struct bio_set *bio_set;
+
+	atomic64_t in_cache_rq;
+	atomic64_t out_cache_rq;
+	atomic64_t in_cache_sectors;
+	atomic64_t out_cache_sectors;
+	atomic64_t read_cache_sectors;
 };
 
 enum {
@@ -314,6 +320,12 @@ static inline int r5l_page_blocks(struct r5l_log *log, int pages)
 	return pages << log->page_block_shift;
 }
 
+static inline int r5l_max_flush_stripes(struct r5l_log *log)
+{
+	return (log->block_size - sizeof(struct r5l_flush_block)) /
+		sizeof(__le64);
+}
+
 static u32 r5l_calculate_checksum(struct r5l_log *log, u32 crc,
 	void *buf, size_t size, bool data)
 {
@@ -1644,6 +1656,9 @@ static void r5c_write_bio(struct r5c_cache *cache, struct bio *bio)
 	stripe->existing_pages += new_pages;
 	r5c_unlock_stripe(cache, stripe, &flags);
 
+	atomic64_inc(&cache->in_cache_rq);
+	atomic64_add(bio_sectors(bio), &cache->in_cache_sectors);
+
 	if (r5l_queue_bio(&cache->log, bio, r5c_bio_task_end, io_range,
 	    reserved_blocks))
 		goto put_error;
@@ -1692,6 +1707,8 @@ static void r5c_read_bio(struct r5c_cache *cache, struct bio *bio)
 				split = bio;
 
 			r5c_copy_bio(split, &stripe->data_pages[start], true);
+			atomic64_add(bio_sectors(split),
+						&cache->read_cache_sectors);
 
 			bio_endio(split, 0);
 
@@ -1837,6 +1854,10 @@ static void r5c_flush_one(struct r5c_cache *cache, struct r5c_stripe *stripe,
 		bio->bi_end_io = r5c_flush_endio;
 		bio->bi_rw = WRITE;
 		atomic_inc(&stripe->pending_bios);
+
+		atomic64_inc(&cache->out_cache_rq);
+		atomic64_add(bio_sectors(bio), &cache->out_cache_sectors);
+
 		raid5_make_request(cache->mddev, bio);
 	}
 }
@@ -3102,6 +3123,273 @@ static int r5c_shrink_cache_memory(struct r5c_cache *cache, unsigned long size)
 	return 0;
 }
 
+static ssize_t r5c_show_cache_memory(struct mddev *mddev, char *page)
+{
+	struct r5conf *conf = mddev->private;
+	struct r5c_cache *cache = conf->cache;
+
+	return sprintf(page, "%lld\n", cache->max_pages << PAGE_SHIFT);
+}
+
+static ssize_t r5c_store_cache_memory(struct mddev *mddev, const char *page,
+	size_t len)
+{
+	struct r5conf *conf = mddev->private;
+	struct r5c_cache *cache = conf->cache;
+	unsigned long new;
+	LIST_HEAD(page_list);
+	u64 i;
+
+	if (len >= PAGE_SIZE)
+		return -EINVAL;
+	if (kstrtoul(page, 0, &new))
+		return -EINVAL;
+	new >>= PAGE_SHIFT;
+
+	if (new > cache->max_pages) {
+		i = cache->max_pages;
+		while (i < new) {
+			struct page *page = alloc_page(GFP_KERNEL);
+
+			if (!page)
+				break;
+			list_add(&page->lru, &page_list);
+			i++;
+		}
+
+		spin_lock_irq(&cache->pool_lock);
+		list_splice(&page_list, &cache->page_pool);
+		cache->free_pages += i - cache->max_pages;
+		cache->max_pages = i;
+		cache->total_pages = i;
+		r5c_calculate_watermark(cache);
+		spin_unlock_irq(&cache->pool_lock);
+		return len;
+	}
+	r5c_shrink_cache_memory(cache, new);
+	return len;
+}
+
+static struct md_sysfs_entry r5c_cache_memory = __ATTR(cache_memory,
+	S_IRUGO | S_IWUSR, r5c_show_cache_memory, r5c_store_cache_memory);
+
+int r5c_min_stripe_cache_size(struct r5c_cache *cache)
+{
+	struct r5conf *conf = cache->mddev->private;
+	return (conf->chunk_sectors >> PAGE_SECTOR_SHIFT) *
+		cache->reclaim_batch;
+}
+
+static void r5c_set_reclaim_batch(struct r5c_cache *cache, int batch)
+{
+	struct mddev *mddev = cache->mddev;
+	struct r5conf *conf = mddev->private;
+	int size;
+
+	size = (cache->stripe_parity_pages << PAGE_SECTOR_SHIFT) * batch;
+	if (size > cache->reserved_space) {
+		cache->reserved_space = size;
+		mutex_lock(&cache->log.io_mutex);
+		cache->log.reserved_blocks = r5l_sector_to_block(&cache->log,
+			cache->reserved_space) + 1;
+		mutex_unlock(&cache->log.io_mutex);
+		r5c_wake_wait_reclaimer(cache,
+				RECLAIM_DISK_BACKGROUND);
+	} else {
+		mutex_lock(&cache->log.io_mutex);
+		cache->log.reserved_blocks -= r5l_sector_to_block(&cache->log,
+			cache->reserved_space - size);
+		mutex_unlock(&cache->log.io_mutex);
+		cache->reserved_space = size;
+	}
+
+	size = (conf->chunk_sectors >> PAGE_SECTOR_SHIFT) * batch;
+
+	mddev_lock(mddev);
+	if (size > conf->max_nr_stripes)
+		raid5_set_cache_size(mddev, size);
+	mddev_unlock(mddev);
+
+	cache->reclaim_batch = batch;
+}
+
+static ssize_t r5c_show_cache_reclaim_batch(struct mddev *mddev, char *page)
+{
+	struct r5conf *conf = mddev->private;
+	struct r5c_cache *cache = conf->cache;
+
+	return sprintf(page, "%d\n", cache->reclaim_batch);
+}
+
+static ssize_t r5c_store_cache_reclaim_batch(struct mddev *mddev,
+	const char *page, size_t len)
+{
+	struct r5conf *conf = mddev->private;
+	struct r5c_cache *cache = conf->cache;
+	unsigned long new;
+
+	if (len >= PAGE_SIZE)
+		return -EINVAL;
+	if (kstrtoul(page, 0, &new))
+		return -EINVAL;
+
+	if (new > r5l_max_flush_stripes(&cache->log))
+		new = r5l_max_flush_stripes(&cache->log);
+
+	if (new != cache->reclaim_batch)
+		r5c_set_reclaim_batch(cache, new);
+	return len;
+}
+
+static struct md_sysfs_entry r5c_cache_reclaim_batch =
+	__ATTR(cache_reclaim_batch, S_IRUGO | S_IWUSR,
+	r5c_show_cache_reclaim_batch, r5c_store_cache_reclaim_batch);
+
+static ssize_t r5c_show_cache_stat(struct mddev *mddev, char *page)
+{
+	struct r5conf *conf = mddev->private;
+	struct r5c_cache *cache = conf->cache;
+
+	return sprintf(page, "%lld %lld %lld %lld %lld\n",
+		(u64)atomic64_read(&cache->in_cache_rq),
+		(u64)atomic64_read(&cache->in_cache_sectors),
+		(u64)atomic64_read(&cache->out_cache_rq),
+		(u64)atomic64_read(&cache->out_cache_sectors),
+		(u64)atomic64_read(&cache->read_cache_sectors));
+}
+
+static struct md_sysfs_entry r5c_cache_stat =
+	__ATTR(cache_stat, S_IRUGO, r5c_show_cache_stat, NULL);
+
+static ssize_t r5c_show_cache_disk_watermark(struct mddev *mddev, char *page)
+{
+	struct r5conf *conf = mddev->private;
+	struct r5c_cache *cache = conf->cache;
+
+	return sprintf(page, "%lld\n", cache->log.low_watermark *
+		cache->log.block_size);
+}
+
+static ssize_t r5c_store_cache_disk_watermark(struct mddev *mddev,
+	const char *page, size_t len)
+{
+	struct r5conf *conf = mddev->private;
+	struct r5c_cache *cache = conf->cache;
+	struct r5l_log *log = &cache->log;
+	unsigned long new;
+
+	if (len >= PAGE_SIZE)
+		return -EINVAL;
+	if (kstrtoul(page, 0, &new))
+		return -EINVAL;
+	new /= log->block_size;
+
+	if (new * 3 / 2 >= log->total_blocks)
+		return -EINVAL;
+
+	mutex_lock(&log->io_mutex);
+	log->low_watermark = new;
+	log->high_watermark = new * 3 / 2;
+	mutex_unlock(&log->io_mutex);
+	return len;
+}
+
+static struct md_sysfs_entry r5c_cache_disk_watermark =
+	__ATTR(cache_disk_watermark, S_IRUGO | S_IWUSR,
+	r5c_show_cache_disk_watermark, r5c_store_cache_disk_watermark);
+
+static ssize_t r5c_show_cache_memory_watermark(struct mddev *mddev, char *page)
+{
+	struct r5conf *conf = mddev->private;
+	struct r5c_cache *cache = conf->cache;
+
+	return sprintf(page, "%lld\n", cache->low_watermark << PAGE_SHIFT);
+}
+
+static ssize_t r5c_store_cache_memory_watermark(struct mddev *mddev,
+	const char *page, size_t len)
+{
+	struct r5conf *conf = mddev->private;
+	struct r5c_cache *cache = conf->cache;
+	unsigned long new;
+
+	if (len >= PAGE_SIZE)
+		return -EINVAL;
+	if (kstrtoul(page, 0, &new))
+		return -EINVAL;
+	new >>= PAGE_SHIFT;
+
+	if (new * 2 >= cache->max_pages)
+		return -EINVAL;
+
+	spin_lock_irq(&cache->pool_lock);
+	cache->low_watermark = new;
+	cache->high_watermark = new << 1;
+	spin_unlock_irq(&cache->pool_lock);
+	return len;
+}
+
+static struct md_sysfs_entry r5c_cache_memory_watermark =
+	__ATTR(cache_memory_watermark, S_IRUGO | S_IWUSR,
+	r5c_show_cache_memory_watermark, r5c_store_cache_memory_watermark);
+
+static int r5c_init_sysfs(struct r5c_cache *cache)
+{
+	struct mddev *mddev = cache->mddev;
+	int ret;
+
+	ret = sysfs_add_file_to_group(&mddev->kobj, &r5c_cache_memory.attr,
+				      NULL);
+	if (ret)
+		return ret;
+	ret = sysfs_add_file_to_group(&mddev->kobj,
+				      &r5c_cache_reclaim_batch.attr, NULL);
+	if (ret)
+		goto err_reclaim;
+	ret = sysfs_add_file_to_group(&mddev->kobj,
+				      &r5c_cache_disk_watermark.attr, NULL);
+	if (ret)
+		goto disk_watermark;
+	ret = sysfs_add_file_to_group(&mddev->kobj,
+				      &r5c_cache_stat.attr, NULL);
+	if (ret)
+		goto stat;
+
+	ret = sysfs_add_file_to_group(&mddev->kobj,
+				      &r5c_cache_memory_watermark.attr, NULL);
+	if (ret)
+		goto memory_watermark;
+	return 0;
+memory_watermark:
+	sysfs_remove_file_from_group(&mddev->kobj,
+		&r5c_cache_stat.attr, NULL);
+stat:
+	sysfs_remove_file_from_group(&mddev->kobj,
+		&r5c_cache_disk_watermark.attr, NULL);
+disk_watermark:
+	sysfs_remove_file_from_group(&mddev->kobj,
+		&r5c_cache_reclaim_batch.attr, NULL);
+err_reclaim:
+	sysfs_remove_file_from_group(&mddev->kobj,
+		&r5c_cache_memory.attr, NULL);
+	return ret;
+}
+
+static void r5c_exit_sysfs(struct r5c_cache *cache)
+{
+	struct mddev *mddev = cache->mddev;
+	sysfs_remove_file_from_group(&mddev->kobj,
+		&r5c_cache_reclaim_batch.attr, NULL);
+	sysfs_remove_file_from_group(&mddev->kobj,
+		&r5c_cache_memory.attr, NULL);
+	sysfs_remove_file_from_group(&mddev->kobj,
+		&r5c_cache_disk_watermark.attr, NULL);
+	sysfs_remove_file_from_group(&mddev->kobj,
+		&r5c_cache_stat.attr, NULL);
+	sysfs_remove_file_from_group(&mddev->kobj,
+		&r5c_cache_memory_watermark.attr, NULL);
+}
+
 static void r5c_free_cache_data(struct r5c_cache *cache)
 {
 	struct r5c_stripe *stripe;
@@ -3212,8 +3500,11 @@ struct r5c_cache *r5c_init_cache(struct r5conf *conf, struct md_rdev *rdev)
 	cache->reclaim_thread->timeout = CHECKPOINT_TIMEOUT;
 
 	r5c_shrink_cache_memory(cache, cache->max_pages);
-
+	if (r5c_init_sysfs(cache))
+		goto err_sysfs;
 	return cache;
+err_sysfs:
+	md_unregister_thread(&cache->reclaim_thread);
 err_page:
 	r5c_free_cache_data(cache);
 
@@ -3232,6 +3523,7 @@ struct r5c_cache *r5c_init_cache(struct r5conf *conf, struct md_rdev *rdev)
 
 void r5c_exit_cache(struct r5c_cache *cache)
 {
+	r5c_exit_sysfs(cache);
 	md_unregister_thread(&cache->reclaim_thread);
 	r5l_exit_log(&cache->log);
 
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6cfba8f..26561d8 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5907,6 +5907,9 @@ raid5_set_cache_size(struct mddev *mddev, int size)
 	if (size <= 16 || size > 32768)
 		return -EINVAL;
 
+	if (conf->cache && size < r5c_min_stripe_cache_size(conf->cache))
+		size = r5c_min_stripe_cache_size(conf->cache);
+
 	conf->min_nr_stripes = size;
 	while (size < conf->max_nr_stripes &&
 	       drop_one_stripe(conf))
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 40307ca..f117b68 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -617,4 +617,5 @@ void r5c_exit_cache(struct r5c_cache *cache);
 void r5c_write_start(struct mddev *mddev, struct bio *bi);
 void r5c_write_end(struct mddev *mddev, struct bio *bi);
 void r5c_quiesce(struct r5conf *conf, int state);
+int r5c_min_stripe_cache_size(struct r5c_cache *cache);
 #endif
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 5/8] md: don't allow resize/reshape with cache support
  2015-06-03 22:48 [PATCH v3 0/8] MD: a caching layer for raid5/6 Shaohua Li
                   ` (2 preceding siblings ...)
  2015-06-03 22:48 ` [PATCH v3 4/8] raid5: add some sysfs entries Shaohua Li
@ 2015-06-03 22:48 ` Shaohua Li
  2015-06-18  1:16   ` Neil Brown
  2015-06-03 22:48 ` [PATCH v3 6/8] raid5: skip resync if caching is enabled Shaohua Li
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 13+ messages in thread
From: Shaohua Li @ 2015-06-03 22:48 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

If cache support is enabled, don't allow resize/reshape in current
stage. In the future, we can flush all data from cache to raid before
resize/reshape and then allow resize/reshape.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 26561d8..29f49c7 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7207,6 +7207,10 @@ static int raid5_resize(struct mddev *mddev, sector_t sectors)
 	 * worth it.
 	 */
 	sector_t newsize;
+	struct r5conf *conf = mddev->private;
+
+	if (conf->cache)
+		return -EINVAL;
 	sectors &= ~((sector_t)mddev->chunk_sectors - 1);
 	newsize = raid5_size(mddev, sectors, mddev->raid_disks);
 	if (mddev->external_size &&
@@ -7258,6 +7262,8 @@ static int check_reshape(struct mddev *mddev)
 {
 	struct r5conf *conf = mddev->private;
 
+	if (conf->cache)
+		return -EINVAL;
 	if (mddev->delta_disks == 0 &&
 	    mddev->new_layout == mddev->layout &&
 	    mddev->new_chunk_sectors == mddev->chunk_sectors)
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 6/8] raid5: skip resync if caching is enabled
  2015-06-03 22:48 [PATCH v3 0/8] MD: a caching layer for raid5/6 Shaohua Li
                   ` (3 preceding siblings ...)
  2015-06-03 22:48 ` [PATCH v3 5/8] md: don't allow resize/reshape with cache support Shaohua Li
@ 2015-06-03 22:48 ` Shaohua Li
  2015-06-03 22:48 ` [PATCH v3 7/8] raid5: guarantee cache release stripes in correct way Shaohua Li
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2015-06-03 22:48 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

If caching is enabled, the caching layer will guarantee data
consistency, so skip resync for unclean shutdown

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 29f49c7..b1c942c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6975,6 +6975,13 @@ static int run(struct mddev *mddev)
 		if (mddev->queue)
 			blk_queue_logical_block_size(mddev->queue, STRIPE_SIZE);
 		conf->skip_copy = 1;
+
+		if (mddev->recovery_cp == 0) {
+			printk(KERN_NOTICE
+				"md/raid:%s: skip resync with caching enabled\n",
+				mdname(mddev));
+			mddev->recovery_cp = MaxSector;
+		}
 	}
 
 	return 0;
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 7/8] raid5: guarantee cache release stripes in correct way
  2015-06-03 22:48 [PATCH v3 0/8] MD: a caching layer for raid5/6 Shaohua Li
                   ` (4 preceding siblings ...)
  2015-06-03 22:48 ` [PATCH v3 6/8] raid5: skip resync if caching is enabled Shaohua Li
@ 2015-06-03 22:48 ` Shaohua Li
  2015-06-03 22:48 ` [PATCH v3 8/8] raid5: multi-thread support for raid5 caching reclaim Shaohua Li
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2015-06-03 22:48 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

When cache trys to flush data to raid, we must be very careful about
stripe release. The issue is if we release a stripe and the stripe is
handling by raid5d and we add a new bio into the stripe later, the
stripe will enter the raid5 state machine several times, which
r5cache_write_parity can't handle well. This could happen if
get_active_stripe() sleeps in adding bio to several stripes. To solve
this issue, we guarantee stripe release after all bio are added to
corresponding stripes. This is a performance win too if the bug happens.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5-cache.c | 28 +++++++++-------
 drivers/md/raid5.c       | 84 +++++++++++++++++++++++++++++++-----------------
 drivers/md/raid5.h       | 11 ++++++-
 3 files changed, 80 insertions(+), 43 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 04b1684..86e7b94 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1670,7 +1670,7 @@ static void r5c_write_bio(struct r5c_cache *cache, struct bio *bio)
 	r5c_enter_error_mode(cache, -ENOMEM);
 	r5c_check_wait_error_mode(cache);
 error_mode:
-	raid5_make_request(cache->mddev, bio);
+	raid5_make_request(cache->mddev, bio, NULL);
 }
 
 static void r5c_read_bio(struct r5c_cache *cache, struct bio *bio)
@@ -1686,7 +1686,7 @@ static void r5c_read_bio(struct r5c_cache *cache, struct bio *bio)
 
 	stripe = r5c_get_stripe(cache, stripe_index);
 	if (!stripe) {
-		raid5_make_request(cache->mddev, bio);
+		raid5_make_request(cache->mddev, bio, NULL);
 		return;
 	}
 
@@ -1725,7 +1725,7 @@ static void r5c_read_bio(struct r5c_cache *cache, struct bio *bio)
 			} else
 				split = bio;
 
-			raid5_make_request(cache->mddev, split);
+			raid5_make_request(cache->mddev, split, NULL);
 
 			start = tmp;
 		}
@@ -1823,7 +1823,7 @@ void r5c_write_end(struct mddev *mddev, struct bio *bi)
 }
 
 static void r5c_flush_one(struct r5c_cache *cache, struct r5c_stripe *stripe,
-	int start, int pages)
+	int start, int pages, struct raid5_plug_context *pc)
 {
 	sector_t base;
 	struct bio *bio;
@@ -1858,7 +1858,7 @@ static void r5c_flush_one(struct r5c_cache *cache, struct r5c_stripe *stripe,
 		atomic64_inc(&cache->out_cache_rq);
 		atomic64_add(bio_sectors(bio), &cache->out_cache_sectors);
 
-		raid5_make_request(cache->mddev, bio);
+		raid5_make_request(cache->mddev, bio, pc);
 	}
 }
 
@@ -1871,7 +1871,8 @@ static void r5c_put_stripe_dirty(struct r5c_cache *cache,
 	}
 }
 
-static void r5c_flush_stripe(struct r5c_cache *cache, struct r5c_stripe *stripe)
+static void r5c_flush_stripe(struct r5c_cache *cache, struct r5c_stripe *stripe,
+	struct raid5_plug_context *pc)
 {
 	unsigned long *stripe_bits;
 	int chunk_stripes;
@@ -1900,7 +1901,7 @@ static void r5c_flush_stripe(struct r5c_cache *cache, struct r5c_stripe *stripe)
 		while (end < cache->stripe_data_pages &&
 		       stripe->data_pages[end])
 			end++;
-		r5c_flush_one(cache, stripe, start, end - start);
+		r5c_flush_one(cache, stripe, start, end - start, pc);
 	}
 	r5c_put_stripe_dirty(cache, stripe);
 }
@@ -2034,11 +2035,14 @@ static void r5c_reclaim_stripe_list(struct r5c_cache *cache,
 	struct r5c_io_range *range;
 	u64 seq;
 	sector_t meta;
-	struct blk_plug plug;
 	size_t size = 0;
+	struct raid5_plug_context pc;
 
 	if (list_empty(stripe_list))
 		return;
+
+	raid5_context_init(&pc);
+
 	list_sort(NULL, stripe_list, r5c_stripe_list_cmp);
 	list_for_each_entry(stripe, stripe_list, lru) {
 		cache->stripe_flush_data[size] =
@@ -2047,11 +2051,11 @@ static void r5c_reclaim_stripe_list(struct r5c_cache *cache,
 	}
 	size *= sizeof(__le64);
 
-	blk_start_plug(&plug);
 	/* step 1: start write to raid */
 	list_for_each_entry(stripe, stripe_list, lru)
-		r5c_flush_stripe(cache, stripe);
-	blk_finish_plug(&plug);
+		r5c_flush_stripe(cache, stripe, &pc);
+
+	raid5_context_unplug(&pc, cache->mddev, false);
 
 	/* step 2: wait parity write to cache */
 	list_for_each_entry_reverse(stripe, stripe_list, lru)
@@ -2158,7 +2162,7 @@ static void r5c_reclaim_thread(struct md_thread *thread)
 		wake_up(&cache->error_wait);
 
 		while ((bio = bio_list_pop(&cache->retry_bio_list)) != NULL)
-			raid5_make_request(cache->mddev, bio);
+			raid5_make_request(cache->mddev, bio, NULL);
 
 		if (++cache->retry_cnt < MAX_RETRY) {
 			cache->next_retry_time = jiffies +
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b1c942c..6cb10af 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4973,26 +4973,26 @@ static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
 	return sh;
 }
 
-struct raid5_plug_cb {
-	struct blk_plug_cb	cb;
-	struct list_head	list;
-	struct list_head	temp_inactive_list[NR_STRIPE_HASH_LOCKS];
-};
+void raid5_context_init(struct raid5_plug_context *context)
+{
+	int i;
+	INIT_LIST_HEAD(&context->list);
+	for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++)
+		INIT_LIST_HEAD(context->temp_inactive_list + i);
+}
 
-static void raid5_unplug(struct blk_plug_cb *blk_cb, bool from_schedule)
+void raid5_context_unplug(struct raid5_plug_context *context,
+	struct mddev *mddev, bool from_schedule)
 {
-	struct raid5_plug_cb *cb = container_of(
-		blk_cb, struct raid5_plug_cb, cb);
 	struct stripe_head *sh;
-	struct mddev *mddev = cb->cb.data;
 	struct r5conf *conf = mddev->private;
 	int cnt = 0;
 	int hash;
 
-	if (cb->list.next && !list_empty(&cb->list)) {
+	if (context->list.next && !list_empty(&context->list)) {
 		spin_lock_irq(&conf->device_lock);
-		while (!list_empty(&cb->list)) {
-			sh = list_first_entry(&cb->list, struct stripe_head, lru);
+		while (!list_empty(&context->list)) {
+			sh = list_first_entry(&context->list, struct stripe_head, lru);
 			list_del_init(&sh->lru);
 			/*
 			 * avoid race release_stripe_plug() sees
@@ -5006,15 +5006,38 @@ static void raid5_unplug(struct blk_plug_cb *blk_cb, bool from_schedule)
 			 * case, the count is always > 1 here
 			 */
 			hash = sh->hash_lock_index;
-			__release_stripe(conf, sh, &cb->temp_inactive_list[hash]);
+			__release_stripe(conf, sh, &context->temp_inactive_list[hash]);
 			cnt++;
 		}
 		spin_unlock_irq(&conf->device_lock);
 	}
-	release_inactive_stripe_list(conf, cb->temp_inactive_list,
+	release_inactive_stripe_list(conf, context->temp_inactive_list,
 				     NR_STRIPE_HASH_LOCKS);
 	if (mddev->queue)
 		trace_block_unplug(mddev->queue, cnt, !from_schedule);
+}
+
+static void raid5_context_plug(struct raid5_plug_context *context,
+			       struct stripe_head *sh)
+{
+	if (!test_and_set_bit(STRIPE_ON_UNPLUG_LIST, &sh->state))
+		list_add_tail(&sh->lru, &context->list);
+	else
+		release_stripe(sh);
+}
+
+struct raid5_plug_cb {
+	struct blk_plug_cb	cb;
+	struct raid5_plug_context context;
+};
+
+static void raid5_unplug(struct blk_plug_cb *blk_cb, bool from_schedule)
+{
+	struct raid5_plug_cb *cb = container_of(
+		blk_cb, struct raid5_plug_cb, cb);
+	struct mddev *mddev = cb->cb.data;
+
+	raid5_context_unplug(&cb->context, mddev, from_schedule);
 	kfree(cb);
 }
 
@@ -5033,20 +5056,14 @@ static void release_stripe_plug(struct mddev *mddev,
 
 	cb = container_of(blk_cb, struct raid5_plug_cb, cb);
 
-	if (cb->list.next == NULL) {
-		int i;
-		INIT_LIST_HEAD(&cb->list);
-		for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++)
-			INIT_LIST_HEAD(cb->temp_inactive_list + i);
-	}
+	if (cb->context.list.next == NULL)
+		raid5_context_init(&cb->context);
 
-	if (!test_and_set_bit(STRIPE_ON_UNPLUG_LIST, &sh->state))
-		list_add_tail(&sh->lru, &cb->list);
-	else
-		release_stripe(sh);
+	raid5_context_plug(&cb->context, sh);
 }
 
-static void make_discard_request(struct mddev *mddev, struct bio *bi)
+static void make_discard_request(struct mddev *mddev, struct bio *bi,
+	struct raid5_plug_context *context)
 {
 	struct r5conf *conf = mddev->private;
 	sector_t logical_sector, last_sector;
@@ -5128,7 +5145,10 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 		clear_bit(STRIPE_DELAYED, &sh->state);
 		if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
 			atomic_inc(&conf->preread_active_stripes);
-		release_stripe_plug(mddev, sh);
+		if (context)
+			raid5_context_plug(context, sh);
+		else
+			release_stripe_plug(mddev, sh);
 	}
 
 	remaining = raid5_dec_bi_active_stripes(bi);
@@ -5138,7 +5158,8 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 	}
 }
 
-void raid5_make_request(struct mddev *mddev, struct bio * bi)
+void raid5_make_request(struct mddev *mddev, struct bio * bi,
+	struct raid5_plug_context *context)
 {
 	struct r5conf *conf = mddev->private;
 	int dd_idx;
@@ -5168,7 +5189,7 @@ void raid5_make_request(struct mddev *mddev, struct bio * bi)
 		return;
 
 	if (unlikely(bi->bi_rw & REQ_DISCARD)) {
-		make_discard_request(mddev, bi);
+		make_discard_request(mddev, bi, context);
 		return;
 	}
 
@@ -5296,7 +5317,10 @@ void raid5_make_request(struct mddev *mddev, struct bio * bi)
 			    ((bi->bi_rw & REQ_SYNC) || conf->cache) &&
 			    !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
 				atomic_inc(&conf->preread_active_stripes);
-			release_stripe_plug(mddev, sh);
+			if (context)
+				raid5_context_plug(context, sh);
+			else
+				release_stripe_plug(mddev, sh);
 		} else {
 			/* cannot get stripe for read-ahead, just give-up */
 			clear_bit(BIO_UPTODATE, &bi->bi_flags);
@@ -5323,7 +5347,7 @@ static void make_request(struct mddev *mddev, struct bio *bi)
 	if (conf->cache)
 		r5c_handle_bio(conf->cache, bi);
 	else
-		raid5_make_request(mddev, bi);
+		raid5_make_request(mddev, bi, NULL);
 }
 
 static sector_t raid5_size(struct mddev *mddev, sector_t sectors, int raid_disks);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index f117b68..1adb2b8 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -607,8 +607,17 @@ static inline int algorithm_is_DDF(int layout)
 extern void md_raid5_kick_device(struct r5conf *conf);
 extern int raid5_set_cache_size(struct mddev *mddev, int size);
 
+struct raid5_plug_context {
+	struct list_head	list;
+	struct list_head	temp_inactive_list[NR_STRIPE_HASH_LOCKS];
+};
+void raid5_context_init(struct raid5_plug_context *context);
+void raid5_context_unplug(struct raid5_plug_context *context,
+	struct mddev *mddev, bool from_schedule);
+
 void release_stripe(struct stripe_head *sh);
-void raid5_make_request(struct mddev *mddev, struct bio *bi);
+void raid5_make_request(struct mddev *mddev, struct bio *bi,
+	struct raid5_plug_context *context);
 void r5c_handle_bio(struct r5c_cache *cache, struct bio *bi);
 int r5c_write_parity(struct r5c_cache *cache, struct stripe_head *sh);
 void r5c_flush_pending_parity(struct r5c_cache *cache);
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 8/8] raid5: multi-thread support for raid5 caching reclaim
  2015-06-03 22:48 [PATCH v3 0/8] MD: a caching layer for raid5/6 Shaohua Li
                   ` (5 preceding siblings ...)
  2015-06-03 22:48 ` [PATCH v3 7/8] raid5: guarantee cache release stripes in correct way Shaohua Li
@ 2015-06-03 22:48 ` Shaohua Li
  2015-06-04 19:29 ` Fwd: [PATCH v3 0/8] MD: a caching layer for raid5/6 Davor Vusir
       [not found] ` <c6df8779f11a4dc3362a04e7cee0be2aec213ebe.1433356864.git.shli@fb.com>
  8 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2015-06-03 22:48 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

There are several stalled stages in raid5 caching reclaim, which can
significantly harm reclaim performance. To mitigate the performance
issue, we introduce multi-thread support for reclaim. Since each thread
records reclaimed stripes in flush start/end block, it's safe for
multi-thread.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5-cache.c | 343 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 288 insertions(+), 55 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 86e7b94..329aa38 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -224,6 +224,8 @@ enum {
 };
 
 #define STRIPE_LOCK_BITS 8
+
+#define MAX_RECLAIM_WORKERS 16
 struct r5c_cache {
 	struct mddev *mddev;
 	struct md_rdev *rdev;
@@ -258,9 +260,11 @@ struct r5c_cache {
 	unsigned long reclaim_reason;
 	wait_queue_head_t reclaim_wait;
 	struct md_thread *reclaim_thread;
-	__le64 *stripe_flush_data;
 	int quiesce_state;
 
+	struct md_thread *reclaim_workers[MAX_RECLAIM_WORKERS];
+	int reclaim_worker_cnt;
+
 	int in_recovery;
 
 	struct work_struct pending_io_work;
@@ -294,9 +298,8 @@ enum {
 	RECLAIM_DISK_BACKGROUND = 9, /* try to reclaim disk */
 	RECLAIM_FLUSH_ALL = 16, /* flush all data to raid */
 
-	QUIESCE_NONE = 0,
+	QUIESCE_END = 0,
 	QUIESCE_START = 1,
-	QUIESCE_END = 2,
 
 	ERROR_NOERROR = 0,
 	ERROR_PREPARE = 1, /* Had an error, flushing cache to raid */
@@ -1958,7 +1961,9 @@ static void r5c_select_stripes(struct r5c_cache *cache, struct list_head *list)
 {
 	int stripes;
 	bool blocking;
+	static DEFINE_MUTEX(lock);
 
+	mutex_lock(&lock);
 	/*
 	 * generally select full stripe, if no disk space, select first stripe
 	 */
@@ -1991,6 +1996,7 @@ static void r5c_select_stripes(struct r5c_cache *cache, struct list_head *list)
 	}
 
 	spin_unlock_irq(&cache->tree_lock);
+	mutex_unlock(&lock);
 }
 
 static void r5c_disks_flush_end(struct bio *bio, int err)
@@ -2028,11 +2034,44 @@ static int r5c_stripe_list_cmp(void *priv, struct list_head *a,
 	return !(stripe_a->raid_index < stripe_b->raid_index);
 }
 
-static void r5c_reclaim_stripe_list(struct r5c_cache *cache,
-	struct list_head *stripe_list)
+struct r5c_reclaim_context {
+	struct list_head stripe_list;
+	__le64 *stripe_flush_data;
+	u64 seq;
+	sector_t meta;
+	struct completion comp;
+};
+
+static struct r5c_reclaim_context *
+r5c_alloc_reclaim_context(struct r5c_cache *cache)
+{
+	struct r5c_reclaim_context *context;
+
+	context = kzalloc(sizeof(*context), GFP_KERNEL);
+	if (!context)
+		return NULL;
+	INIT_LIST_HEAD(&context->stripe_list);
+	context->stripe_flush_data = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!context->stripe_flush_data) {
+		kfree(context);
+		return NULL;
+	}
+	init_completion(&context->comp);
+	return context;
+}
+
+static void r5c_free_reclaim_context(struct r5c_reclaim_context *context)
+{
+	kfree(context->stripe_flush_data);
+	kfree(context);
+}
+
+static void r5c_do_reclaim(struct r5c_cache *cache,
+	struct r5c_reclaim_context *context)
 {
+	struct list_head *stripe_list = &context->stripe_list;
+	__le64 *stripe_flush_data = context->stripe_flush_data;
 	struct r5c_stripe *stripe;
-	struct r5c_io_range *range;
 	u64 seq;
 	sector_t meta;
 	size_t size = 0;
@@ -2045,7 +2084,7 @@ static void r5c_reclaim_stripe_list(struct r5c_cache *cache,
 
 	list_sort(NULL, stripe_list, r5c_stripe_list_cmp);
 	list_for_each_entry(stripe, stripe_list, lru) {
-		cache->stripe_flush_data[size] =
+		stripe_flush_data[size] =
 				cpu_to_le64(stripe->raid_index);
 		size++;
 	}
@@ -2063,7 +2102,7 @@ static void r5c_reclaim_stripe_list(struct r5c_cache *cache,
 
 	/* step 3: make sure data and parity settle down */
 	r5l_flush_block(&cache->log, R5LOG_TYPE_FLUSH_START,
-		cache->stripe_flush_data, size, &seq, &meta);
+		stripe_flush_data, size, &seq, &meta);
 
 	/* step 4: continue write to raid */
 	list_for_each_entry(stripe, stripe_list, lru) {
@@ -2087,7 +2126,7 @@ static void r5c_reclaim_stripe_list(struct r5c_cache *cache,
 
 	/* step 7: mark data is flushed to raid */
 	r5l_flush_block(&cache->log, R5LOG_TYPE_FLUSH_END,
-		cache->stripe_flush_data, size, &seq, &meta);
+		stripe_flush_data, size, &seq, &meta);
 
 	/* step 8: mark stripe as dead */
 	while (!list_empty(stripe_list)) {
@@ -2100,6 +2139,23 @@ static void r5c_reclaim_stripe_list(struct r5c_cache *cache,
 
 		r5c_put_stripe(stripe);
 	}
+	context->seq = seq;
+	context->meta = meta;
+}
+
+static void r5c_do_reclaim_and_write_super(struct r5c_cache *cache,
+	struct r5c_reclaim_context *context)
+{
+	struct r5c_io_range *range;
+	u64 seq;
+	sector_t meta;
+
+	if (list_empty(&context->stripe_list))
+		return;
+
+	r5c_do_reclaim(cache, context);
+	seq = context->seq;
+	meta = context->meta;
 
 	/* step 9: advance superblock checkpoint */
 	spin_lock_irq(&cache->tree_lock);
@@ -2118,14 +2174,67 @@ static void r5c_reclaim_stripe_list(struct r5c_cache *cache,
 	r5l_write_super(&cache->log, seq, meta);
 }
 
+static void r5c_reclaim_work(struct r5c_cache *cache,
+	struct r5c_reclaim_context *context)
+{
+	/*
+	 * select stripe will freeze stripe, which will guarantee no
+	 * new task pending in error mode
+	 * */
+	r5c_select_stripes(cache, &context->stripe_list);
+
+	r5c_do_reclaim(cache, context);
+}
+
+static void r5c_reclaim_stripe_list(struct r5c_cache *cache,
+	struct list_head *stripe_list)
+{
+	int size;
+	struct r5c_reclaim_context *context;
+
+	context = r5c_alloc_reclaim_context(cache);
+
+	while (!list_empty(stripe_list)) {
+		size = 0;
+		while (!list_empty(stripe_list) && size < RECLAIM_BATCH) {
+			list_move_tail(stripe_list->next,
+				&context->stripe_list);
+			size++;
+		}
+		if (!list_empty(stripe_list))
+			r5c_do_reclaim(cache, context);
+		else
+			r5c_do_reclaim_and_write_super(cache, context);
+	}
+
+	r5c_free_reclaim_context(context);
+}
+
+static void r5c_reclaim_thread_check_quiesce(struct r5c_cache *cache,
+	struct r5c_reclaim_context *context)
+{
+	if (cache->quiesce_state == QUIESCE_START) {
+		complete(&context->comp);
+		wait_event(cache->reclaim_wait, cache->quiesce_state ==
+			QUIESCE_END);
+		reinit_completion(&context->comp);
+	}
+}
+
 static void r5c_reclaim_thread(struct md_thread *thread)
 {
 	struct mddev *mddev = thread->mddev;
 	struct r5conf *conf = mddev->private;
 	struct r5c_cache *cache = conf->cache;
-	LIST_HEAD(stripe_list);
+	struct r5c_reclaim_context *context = thread->private;
 	bool retry;
 
+	r5c_reclaim_thread_check_quiesce(cache, context);
+
+	if (thread != cache->reclaim_thread) {
+		r5c_reclaim_work(cache, context);
+		return;
+	}
 do_retry:
 	retry = false;
 
@@ -2134,9 +2243,9 @@ static void r5c_reclaim_thread(struct md_thread *thread)
 		 * select stripe will freeze stripe, which will guarantee no
 		 * new task pending in error mode
 		 * */
-		r5c_select_stripes(cache, &stripe_list);
+		r5c_select_stripes(cache, &context->stripe_list);
 
-		if (list_empty(&stripe_list)) {
+		if (list_empty(&context->stripe_list)) {
 			clear_bit(RECLAIM_MEM_FULL, &cache->reclaim_reason);
 			clear_bit(RECLAIM_MEM_BACKGROUND,
 					&cache->reclaim_reason);
@@ -2144,7 +2253,7 @@ static void r5c_reclaim_thread(struct md_thread *thread)
 					&cache->reclaim_reason);
 			clear_bit(RECLAIM_FLUSH_ALL, &cache->reclaim_reason);
 		} else
-			r5c_reclaim_stripe_list(cache, &stripe_list);
+			r5c_do_reclaim_and_write_super(cache, context);
 
 		wake_up(&cache->reclaim_wait);
 	}
@@ -2179,20 +2288,80 @@ static void r5c_reclaim_thread(struct md_thread *thread)
 		cache->error_state = 0;
 		mddev_resume(cache->mddev);
 	}
-	if (cache->quiesce_state == QUIESCE_START) {
-		/* user IO already finished, we just stop reclaim */
-		cache->reclaim_reason = 0;
-		cache->quiesce_state = QUIESCE_END;
-		wake_up(&cache->reclaim_wait);
-		wait_event(cache->reclaim_wait, cache->quiesce_state ==
-			QUIESCE_NONE);
+}
+
+static struct md_thread *r5c_init_reclaim_thread(struct r5c_cache *cache)
+{
+	struct r5c_reclaim_context *context;
+	struct md_thread *thread;
+
+	context = r5c_alloc_reclaim_context(cache);
+	if (!context)
+		return NULL;
+
+	thread = md_register_thread(r5c_reclaim_thread,
+			cache->mddev, "reclaim");
+	if (!thread) {
+		r5c_free_reclaim_context(context);
+		return NULL;
+	}
+	thread->private = context;
+
+	return thread;
+}
+
+static void r5c_exit_reclaim_thread(struct r5c_cache *cache,
+	struct md_thread **thread)
+{
+	struct r5c_reclaim_context *context;
+
+	context = (*thread)->private;
+	r5c_free_reclaim_context(context);
+
+	md_unregister_thread(thread);
+}
+
+static int r5c_init_reclaimers(struct r5c_cache *cache)
+{
+	struct md_thread *thread;
+
+	thread = r5c_init_reclaim_thread(cache);
+	if (!thread)
+		return -ENOMEM;
+	cache->reclaim_thread = thread;
+	cache->reclaim_thread->timeout = CHECKPOINT_TIMEOUT;
+	return 0;
+}
+
+static void r5c_exit_reclaimers(struct r5c_cache *cache)
+{
+	int i = cache->reclaim_worker_cnt;
+
+	while (i > 0) {
+		r5c_exit_reclaim_thread(cache, &cache->reclaim_workers[i - 1]);
+		i--;
 	}
+
+	r5c_exit_reclaim_thread(cache, &cache->reclaim_thread);
+}
+
+static void r5c_wakeup_reclaimer_threads(struct r5c_cache *cache)
+{
+	int i;
+
+	md_wakeup_thread(cache->reclaim_thread);
+
+	preempt_disable();
+	for (i = 0; i < cache->reclaim_worker_cnt; i++)
+		if (cache->reclaim_workers[i])
+			md_wakeup_thread(cache->reclaim_workers[i]);
+	preempt_enable();
 }
 
 static void r5c_wake_reclaimer(struct r5c_cache *cache, int reason)
 {
 	set_bit(reason, &cache->reclaim_reason);
-	md_wakeup_thread(cache->reclaim_thread);
+	r5c_wakeup_reclaimer_threads(cache);
 }
 
 static void r5c_wake_wait_reclaimer(struct r5c_cache *cache, int reason)
@@ -2205,17 +2374,25 @@ static void r5c_wake_wait_reclaimer(struct r5c_cache *cache, int reason)
 void r5c_quiesce(struct r5conf *conf, int state)
 {
 	struct r5c_cache *cache = conf->cache;
+	struct r5c_reclaim_context *context;
+	int i;
 
 	if (!cache || cache->error_state)
 		return;
 	if (state == 1) {
 		r5c_wake_wait_reclaimer(cache, RECLAIM_FLUSH_ALL);
+
 		cache->quiesce_state = QUIESCE_START;
-		md_wakeup_thread(cache->reclaim_thread);
-		wait_event(cache->reclaim_wait, cache->quiesce_state ==
-			QUIESCE_END);
+		r5c_wakeup_reclaimer_threads(cache);
+
+		for (i = 0; i < cache->reclaim_worker_cnt; i++) {
+			context = cache->reclaim_workers[i]->private;
+			wait_for_completion(&context->comp);
+		}
+		context = cache->reclaim_thread->private;
+		wait_for_completion(&context->comp);
 	} else if (state == 0) {
-		cache->quiesce_state = QUIESCE_NONE;
+		cache->quiesce_state = QUIESCE_END;
 		wake_up(&cache->reclaim_wait);
 	}
 }
@@ -2751,7 +2928,6 @@ static int r5c_recover_stripes(struct r5c_load_ctx *ctx)
 {
 	struct r5c_cache *cache = ctx->cache;
 	LIST_HEAD(list);
-	int i;
 
 	r5l_check_stripes_checksum(ctx);
 
@@ -2762,18 +2938,7 @@ static int r5c_recover_stripes(struct r5c_load_ctx *ctx)
 
 	cache->in_recovery = 1;
 
-	while (!list_empty(&ctx->stripes_with_parity)) {
-		i = 0;
-		/* Can't handle large stripe list */
-		while (i < RECLAIM_BATCH &&
-		       !list_empty(&ctx->stripes_with_parity)) {
-			list_move_tail(ctx->stripes_with_parity.next,
-				&list);
-			i++;
-		}
-		r5c_reclaim_stripe_list(cache, &list);
-		BUG_ON(!list_empty(&list));
-	}
+	r5c_reclaim_stripe_list(cache, &ctx->stripes_with_parity);
 
 	cache->in_recovery = 0;
 	return 0;
@@ -3181,16 +3346,20 @@ int r5c_min_stripe_cache_size(struct r5c_cache *cache)
 {
 	struct r5conf *conf = cache->mddev->private;
 	return (conf->chunk_sectors >> PAGE_SECTOR_SHIFT) *
-		cache->reclaim_batch;
+		cache->reclaim_batch * (1 + cache->reclaim_worker_cnt);
 }
 
-static void r5c_set_reclaim_batch(struct r5c_cache *cache, int batch)
+static void r5c_set_reclaim_batch(struct r5c_cache *cache, int batch,
+	int threads)
 {
 	struct mddev *mddev = cache->mddev;
 	struct r5conf *conf = mddev->private;
 	int size;
 
-	size = (cache->stripe_parity_pages << PAGE_SECTOR_SHIFT) * batch;
+	threads++;
+
+	size = (cache->stripe_parity_pages << PAGE_SECTOR_SHIFT) * batch *
+		threads;
 	if (size > cache->reserved_space) {
 		cache->reserved_space = size;
 		mutex_lock(&cache->log.io_mutex);
@@ -3207,7 +3376,7 @@ static void r5c_set_reclaim_batch(struct r5c_cache *cache, int batch)
 		cache->reserved_space = size;
 	}
 
-	size = (conf->chunk_sectors >> PAGE_SECTOR_SHIFT) * batch;
+	size = (conf->chunk_sectors >> PAGE_SECTOR_SHIFT) * batch * threads;
 
 	mddev_lock(mddev);
 	if (size > conf->max_nr_stripes)
@@ -3241,7 +3410,7 @@ static ssize_t r5c_store_cache_reclaim_batch(struct mddev *mddev,
 		new = r5l_max_flush_stripes(&cache->log);
 
 	if (new != cache->reclaim_batch)
-		r5c_set_reclaim_batch(cache, new);
+		r5c_set_reclaim_batch(cache, new, cache->reclaim_worker_cnt);
 	return len;
 }
 
@@ -3337,6 +3506,68 @@ static struct md_sysfs_entry r5c_cache_memory_watermark =
 	__ATTR(cache_memory_watermark, S_IRUGO | S_IWUSR,
 	r5c_show_cache_memory_watermark, r5c_store_cache_memory_watermark);
 
+static ssize_t r5c_show_reclaim_threads(struct mddev *mddev, char *page)
+{
+	struct r5conf *conf = mddev->private;
+	struct r5c_cache *cache = conf->cache;
+
+	return sprintf(page, "%d\n", cache->reclaim_worker_cnt);
+}
+
+static void r5c_set_reclaim_thread_count(struct r5c_cache *cache, int cnt)
+{
+	struct md_thread *thread;
+	int old_cnt;
+
+	if (cache->reclaim_worker_cnt == cnt)
+		return;
+	if (cnt > MAX_RECLAIM_WORKERS)
+		cnt = MAX_RECLAIM_WORKERS;
+
+	old_cnt = cache->reclaim_worker_cnt;
+	if (old_cnt > cnt) {
+		cache->reclaim_worker_cnt = cnt;
+		/* make sure r5c_wake_reclaimer() isn't using thread */
+		synchronize_sched();
+	}
+
+	while (old_cnt > cnt) {
+		r5c_exit_reclaim_thread(cache,
+			&cache->reclaim_workers[old_cnt - 1]);
+		old_cnt--;
+	}
+	while (old_cnt < cnt) {
+		thread = r5c_init_reclaim_thread(cache);
+		if (!thread)
+			break;
+		cache->reclaim_workers[old_cnt++] = thread;
+	}
+
+	r5c_set_reclaim_batch(cache, cache->reclaim_batch, old_cnt);
+
+	cache->reclaim_worker_cnt = old_cnt;
+}
+
+static ssize_t r5c_store_reclaim_threads(struct mddev *mddev,
+	const char *page, size_t len)
+{
+	struct r5conf *conf = mddev->private;
+	struct r5c_cache *cache = conf->cache;
+	unsigned int new;
+
+	if (len >= PAGE_SIZE)
+		return -EINVAL;
+	if (kstrtouint(page, 0, &new))
+		return -EINVAL;
+
+	r5c_set_reclaim_thread_count(cache, new);
+	return len;
+}
+
+static struct md_sysfs_entry r5c_cache_reclaim_threads =
+	__ATTR(cache_reclaim_threads, S_IRUGO | S_IWUSR,
+	r5c_show_reclaim_threads, r5c_store_reclaim_threads);
+
 static int r5c_init_sysfs(struct r5c_cache *cache)
 {
 	struct mddev *mddev = cache->mddev;
@@ -3363,7 +3594,16 @@ static int r5c_init_sysfs(struct r5c_cache *cache)
 				      &r5c_cache_memory_watermark.attr, NULL);
 	if (ret)
 		goto memory_watermark;
+
+	ret = sysfs_add_file_to_group(&mddev->kobj,
+				      &r5c_cache_reclaim_threads.attr, NULL);
+	if (ret)
+		goto reclaim_threads;
+
 	return 0;
+reclaim_threads:
+	sysfs_remove_file_from_group(&mddev->kobj,
+		&r5c_cache_memory_watermark.attr, NULL);
 memory_watermark:
 	sysfs_remove_file_from_group(&mddev->kobj,
 		&r5c_cache_stat.attr, NULL);
@@ -3392,6 +3632,8 @@ static void r5c_exit_sysfs(struct r5c_cache *cache)
 		&r5c_cache_stat.attr, NULL);
 	sysfs_remove_file_from_group(&mddev->kobj,
 		&r5c_cache_memory_watermark.attr, NULL);
+	sysfs_remove_file_from_group(&mddev->kobj,
+		&r5c_cache_reclaim_threads.attr, NULL);
 }
 
 static void r5c_free_cache_data(struct r5c_cache *cache)
@@ -3446,10 +3688,6 @@ struct r5c_cache *r5c_init_cache(struct r5conf *conf, struct md_rdev *rdev)
 	cache->stripe_parity_pages = (cache->stripe_size -
 		cache->stripe_data_size) >> PAGE_SECTOR_SHIFT;
 
-	cache->stripe_flush_data = kmalloc(PAGE_SIZE, GFP_KERNEL);
-	if (!cache->stripe_flush_data)
-		goto io_range_kc;
-
 	cache->io_range_kc = KMEM_CACHE(r5c_io_range, 0);
 	if (!cache->io_range_kc)
 		goto io_range_kc;
@@ -3497,18 +3735,15 @@ struct r5c_cache *r5c_init_cache(struct r5conf *conf, struct md_rdev *rdev)
 
 	r5c_calculate_watermark(cache);
 
-	cache->reclaim_thread = md_register_thread(r5c_reclaim_thread,
-		mddev, "reclaim");
-	if (!cache->reclaim_thread)
+	if (r5c_init_reclaimers(cache))
 		goto err_page;
-	cache->reclaim_thread->timeout = CHECKPOINT_TIMEOUT;
 
 	r5c_shrink_cache_memory(cache, cache->max_pages);
 	if (r5c_init_sysfs(cache))
 		goto err_sysfs;
 	return cache;
 err_sysfs:
-	md_unregister_thread(&cache->reclaim_thread);
+	r5c_exit_reclaimers(cache);
 err_page:
 	r5c_free_cache_data(cache);
 
@@ -3520,7 +3755,6 @@ struct r5c_cache *r5c_init_cache(struct r5conf *conf, struct md_rdev *rdev)
 stripe_kc:
 	kmem_cache_destroy(cache->io_range_kc);
 io_range_kc:
-	kfree(cache->stripe_flush_data);
 	kfree(cache);
 	return NULL;
 }
@@ -3528,7 +3762,7 @@ struct r5c_cache *r5c_init_cache(struct r5conf *conf, struct md_rdev *rdev)
 void r5c_exit_cache(struct r5c_cache *cache)
 {
 	r5c_exit_sysfs(cache);
-	md_unregister_thread(&cache->reclaim_thread);
+	r5c_exit_reclaimers(cache);
 	r5l_exit_log(&cache->log);
 
 	r5c_free_cache_data(cache);
@@ -3537,6 +3771,5 @@ void r5c_exit_cache(struct r5c_cache *cache)
 	kmem_cache_destroy(cache->stripe_kc);
 	kmem_cache_destroy(cache->io_range_kc);
 
-	kfree(cache->stripe_flush_data);
 	kfree(cache);
 }
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Fwd: [PATCH v3 0/8] MD: a caching layer for raid5/6
  2015-06-03 22:48 [PATCH v3 0/8] MD: a caching layer for raid5/6 Shaohua Li
                   ` (6 preceding siblings ...)
  2015-06-03 22:48 ` [PATCH v3 8/8] raid5: multi-thread support for raid5 caching reclaim Shaohua Li
@ 2015-06-04 19:29 ` Davor Vusir
       [not found] ` <c6df8779f11a4dc3362a04e7cee0be2aec213ebe.1433356864.git.shli@fb.com>
  8 siblings, 0 replies; 13+ messages in thread
From: Davor Vusir @ 2015-06-04 19:29 UTC (permalink / raw)
  To: linux-raid

Sorry for intruding on the patch reviewing process and for the novice 
question.

Is this caching layer also suitable for a smaller system with only one 
raid5 where /boot and / resides? Or is it meant for dedicated data 
storage(s) only?

Regards
Davor Vusir



-------- Vidarebefordrat meddelande --------
Ämne: [PATCH v3 0/8] MD: a caching layer for raid5/6
Datum: Wed, 3 Jun 2015 15:48:35 -0700
Från: Shaohua Li <shli@fb.com>
Till: linux-raid@vger.kernel.org
Kopia: Kernel-team@fb.com, songliubraving@fb.com, hch@infradead.org, 
dan.j.williams@intel.com, neilb@suse.de

Hi,

This is the third version of the raid5/6 caching layer patches. The 
patches add
a caching layer for raid5/6. The caching layer uses a SSD as a cache for 
a raid
5/6. It works like the similar way of a hardware raid controller. The 
purpose
is to improve raid performance (reduce read-modify-write) and fix write hole
issue. The main patch is patch 3 and the description has all details 
about the
implementation. Please review!

Thanks,
Shaohua

V3:
-make reclaim multi-thread
-add statistics in sysfs
-bug fixes

V2:
-metadata write doesn't use FUA
-discard request is only issued when necessary
-bug fixes and cleanup

Shaohua Li (7):
   raid5: directly use mddev->queue
   raid5: A caching layer for RAID5/6
   raid5: add some sysfs entries
   md: don't allow resize/reshape with cache support
   raid5: skip resync if caching is enabled
   raid5: guarantee cache release stripes in correct way
   raid5: multi-thread support for raid5 caching reclaim

Song Liu (1):
   MD: add a new disk role to present cache device

  drivers/md/Makefile            |    2 +-
  drivers/md/md.c                |   14 +-
  drivers/md/md.h                |    4 +
  drivers/md/raid5-cache.c       | 3775 
++++++++++++++++++++++++++++++++++++++++
  drivers/md/raid5.c             |  177 +-
  drivers/md/raid5.h             |   25 +-
  include/uapi/linux/raid/md_p.h |   73 +
  7 files changed, 4022 insertions(+), 48 deletions(-)
  create mode 100644 drivers/md/raid5-cache.c

-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/8] MD: add a new disk role to present cache device
  2015-06-03 22:48 ` [PATCH v3 1/8] MD: add a new disk role to present cache device Shaohua Li
@ 2015-06-17 23:32   ` Neil Brown
  0 siblings, 0 replies; 13+ messages in thread
From: Neil Brown @ 2015-06-17 23:32 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

On Wed, 3 Jun 2015 15:48:36 -0700
Shaohua Li <shli@fb.com> wrote:

> From: Song Liu <songliubraving@fb.com>
> 
> Next patches will use a disk as raid5/6 caching. We need a new disk role
> to present the cache device
> 
> Not sure if we should bump up the MD superblock version for the disk
> role.

No need to increase the superblock version, but you would need to add a
feature flag (for feature_map) which was set whenever the array had a
caching device.

NeilBrown

> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/md.c                | 14 +++++++++++++-
>  drivers/md/md.h                |  4 ++++
>  include/uapi/linux/raid/md_p.h |  1 +
>  3 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 2750630..6297087 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1656,6 +1656,9 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
>  		case 0xfffe: /* faulty */
>  			set_bit(Faulty, &rdev->flags);
>  			break;
> +		case 0xfffd: /* cache device */
> +			set_bit(WriteCache, &rdev->flags);
> +			break;
>  		default:
>  			rdev->saved_raid_disk = role;
>  			if ((le32_to_cpu(sb->feature_map) &
> @@ -1811,6 +1814,8 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
>  			sb->dev_roles[i] = cpu_to_le16(0xfffe);
>  		else if (test_bit(In_sync, &rdev2->flags))
>  			sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
> +		else if (test_bit(WriteCache, &rdev2->flags))
> +			sb->dev_roles[i] = cpu_to_le16(0xfffd);
>  		else if (rdev2->raid_disk >= 0)
>  			sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
>  		else
> @@ -5780,7 +5785,8 @@ static int get_disk_info(struct mddev *mddev, void __user * arg)
>  		else if (test_bit(In_sync, &rdev->flags)) {
>  			info.state |= (1<<MD_DISK_ACTIVE);
>  			info.state |= (1<<MD_DISK_SYNC);
> -		}
> +		} else if (test_bit(WriteCache, &rdev->flags))
> +			info.state |= (1<<MD_DISK_WRITECACHE);
>  		if (test_bit(WriteMostly, &rdev->flags))
>  			info.state |= (1<<MD_DISK_WRITEMOSTLY);
>  	} else {
> @@ -5895,6 +5901,8 @@ static int add_new_disk(struct mddev *mddev, mdu_disk_info_t *info)
>  		else
>  			clear_bit(WriteMostly, &rdev->flags);
>  
> +		if (info->state & (1<<MD_DISK_WRITECACHE))
> +			set_bit(WriteCache, &rdev->flags);
>  		/*
>  		 * check whether the device shows up in other nodes
>  		 */
> @@ -7263,6 +7271,10 @@ static int md_seq_show(struct seq_file *seq, void *v)
>  				seq_printf(seq, "(F)");
>  				continue;
>  			}
> +			if (test_bit(WriteCache, &rdev->flags)) {
> +				seq_printf(seq, "(C)");
> +				continue;
> +			}
>  			if (rdev->raid_disk < 0)
>  				seq_printf(seq, "(S)"); /* spare */
>  			if (test_bit(Replacement, &rdev->flags))
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 4046a6c..6857592 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -175,6 +175,10 @@ enum flag_bits {
>  				 * This device is seen locally but not
>  				 * by the whole cluster
>  				 */
> +	WriteCache,		/* This device is used as write cache.
> +				 * Usually, this device should be faster
> +				 * than other devices in the array
> +				 */
>  };
>  
>  #define BB_LEN_MASK	(0x00000000000001FFULL)
> diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
> index 2ae6131..9d36b91 100644
> --- a/include/uapi/linux/raid/md_p.h
> +++ b/include/uapi/linux/raid/md_p.h
> @@ -89,6 +89,7 @@
>  				   * read requests will only be sent here in
>  				   * dire need
>  				   */
> +#define MD_DISK_WRITECACHE      18 /* disk is used as the write cache in RAID-5/6 */
>  
>  typedef struct mdp_device_descriptor_s {
>  	__u32 number;		/* 0 Device number in the entire set	      */


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 3/8] raid5: A caching layer for RAID5/6
       [not found] ` <c6df8779f11a4dc3362a04e7cee0be2aec213ebe.1433356864.git.shli@fb.com>
@ 2015-06-18  1:00   ` Neil Brown
  2015-06-18  5:24     ` Shaohua Li
  0 siblings, 1 reply; 13+ messages in thread
From: Neil Brown @ 2015-06-18  1:00 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

On Wed, 3 Jun 2015 15:48:38 -0700 Shaohua Li <shli@fb.com> wrote:

Hi,
 sorry for the delay in getting to this.

3000 lines of code is rather hard to review - especially with so few
comments :-)
There seem to be a number of fairly well defined modules in the code,
such as managing the various data structures.  Maybe if these arrived
one per patch it would be easier  to review them individually.


> Main goal of the caching layer is to aggregate IO write to hopefully
> make full stripe IO write and fix write hole issue. This might speed up
> read too, but it's not optimized for read, eg, we don't proactively
> cache data for read. The aggregation makes a lot of sense for workloads
> which sequentially write to several files with/without fsync. Such
> workloads are popular in today's datacenter.
> 
> Write IO data will write to a cache disk (SSD) first, then later the
> data will be flushed to raid disks.
> 
> The cache disk will be organized as a simple ring buffer log. For IO
> data, a tuple (raid_sector, io length, checksum, data) will be appended
> to the log; for raid 5/6 parity, a tuple (stripe_sector, parity length,
> checksum, parity data) will be appended to the log. We don't have
> on-disk index for the data appended to the log. So either we can rebuild
> an in-memory index at startup with scanning the whole log, or we can
> flush all data from cache disk to raid disks at shutdown so cache disk
> has no valid data. Current code chooses the second option, but this can
> be easily changed.
> 
> We have a simple meta data for above tuples. It's essentially a
> tuple (sequence, metadata length).

How are these tuples store in the log?  One tuples per block?  A block
with lots of tuples followed by all the described data?  Or does the
block of tuples follow the data?


>                                    Crash recovery or startup will scan
> the log to read the metadata and rebuild in-memory index. If metadata is
> broken at the head of the log, even metadata afterward is ok, the
> scanning will not work well. So we take some steps to mitigate the
> issue:
> -A flush request is sent to cache disk every second to relieve the issue

I don't feel at all comfortable about the every-second flush.
I can certainly understand a flush after 50% of the log space is
written, or even 25%.  but time-based flushes should be coming from the
filesystem.



> -FLUSH/FUA request is carefully handled. FLUSH/FUA will only be
> dispatched after all previous requests finish.

This certainly makes sense.

> 
> The in-memory index is a simple list of io_range (sequence, metadata
> sector, data sector, length). The list is orded by sequence. The first
> io_range entry's metadata sector is the tail of the log. There is also a
> struct to track io ranges within a stripe. All stripes will be organized
> as a radix tree.

It would be useful here to briefly say how these data structures are
used.
I assumed the order-by-sequence list is to flush data to RAID when the
log is getting full?

The radix-tree tracks where in the log each stripe is - is that correct?
So it is easy to find the stripe at the tail of the list to flush it
out.

> 
> All IO data will be copied to a memory pool for caching too until the
> data is flushed to raid disks. This is just to simplify the
> implementation, it's not mandated. In the future flush can do a data
> copy from cache disk to raid disks, so the memory pool can be discarded.
> If a stripe is flushed to raid disks, memory of the stripe can be
> reused.

Again, why a separate memory pool rather than just leaving the data in
the stripe cache until it is safe in the RAID?


> 
> We have two limited resources here, memory pool and cache disk space. If
> resource is tight, we will do reclaim. In either case, we will flush
> some data from cache disk to raid disks. However, we implement different
> strategies. For memory reclaim, we prefer reclaiming full stripe. For
> cache disk space reclaim, we prefer reclaiming io_range entry at the
> head of index list.

Wouldn't full stripes be scheduled to the RAID immediately they become
full (or immediately after the parity is safe in the log)?
So for memory reclaim it would make sense to prefer stripes that have
been idle for a long time - so an LRU list.

Certainly when the log approaches full you need to start flushing
things at the start of the log.

> 
> If cache disk has IO error, since all data are in memory pool, we will
> flush all data to raid disks and fallback to no-cache-disk mode. IO
> error of cache disk doesn't corrupt any data. After some time, we will
> try to use cache disk again if the disk is ok. The retry will stop after
> several rounds of failure.

The normal data flow involves lots of writing to the cache device and
very little if any reading.  So we will probably need some sort of
"scrub" process to read and verify the log occasionally, just to be
sure that reads still work.  That could be largely done in user-space.

> 
> We always do reclaim in stripe unit. Reclaim could create holes in the
> log, eg, some io_range in the middle is reclaimed, but io_range at the
> head remains. So the index list entries don't always have continuous
> sequence. But this doesn't matter, the first io_range is always the log
> tail. Superblock has a field pointing to the position of log tail. The
> hole can waste a lot of disk space though. In the future, we can
> implement a garbage collection to mitigate the issue, eg, copy data
> from the index tail to head.

I doubt there would be value in garbage collection.  Just flush the old
stripes to the RAID.  Maybe I'm wrong, but I certainly agree that it
isn't a priority.

> 
> In the process reclaim flush data to raid disk, stripe parity will be
> append to cache log. Parity is always appended after its corresponding
> data. After parity is in cache disk, a flush_start block is appended to
> log, which indicates stripe is flushing to raid disks. Data writing to
> raid disks only happens after all data and parity are already in cache
> disk. This will fix the write whole issue. After a stripe is flushed to
> raid disks, a flush_end block is appended to log, which indicates a
> stripe is settled down in raid disks.

I'm not sure that a separate "flush start" block is needed. Once all
the parity blocks have been written we can assume that the flush has
started.
A "flush-end" block does make sense, though I would think of it as an
'invalidate' block in that it invalidates specific previous data blocks.
Same concept though.


> 
> Recovery relies on the flush_start and flush_end block. If recovery
> finds data and parity of a stripe, the flush start/end block will be
> used to determine which stage the stripe is in flush. If the stripe is
> listed in flush end block, the stripe is in raid disks, all data and
> parity of the stripe can be discarded. If the stripe isn't listed in
> flush start block, the stripe hasn't started flush to raid yet, its
> parity can be ignored. Otherwise, the stripe is in the middle of
> flushing to raid disks.  Since we have both data and parity, the
> recovery just rewrite them to raid disks.
> 
> IO write code path:
> 1. copy bio data to stripe memory pages
> 2. append metadata and data to cache log
> 3. IO write endio
> 
> reclaim code path:
> 1. select stripe to reclaim
> 2. write all stripe data to raid disk
> 3. in raid5 ops_run_io, append metadata and parity data to cache log.
>     ops_run_io doesn't write data/parity to raid disks at this time
> 4. flush cache disk and write flush_start block
> 5. ops_run_io continues. data/parity will be written to raid disks
> 6. flush all raid disks cache
> 7. write flush_end block
> 8. delete in-memory index of the stripe, and advance superblock log checkpoint

I find this a bit confusing.  Step 2 writes to the raid disk, but step
3 says it doesn't write to the raid disk yet.  It also doesn't tell me
when parity is calculated.

I imagine:

1. select stripe to reclaim
2. read missing data or parity block so new parity can be calculated.
3. calculate parity and write to log - this records that the flush has
   started.
4. flush cache device - probably multiple stripes will get up to the
   step, and then a single flush will be performed for all of them.
5. schedule writes to the RAID disks, both data an parity.
6. flush RAID disks - again, wait for lots of stripes to reach this
   point, then perform a single flush
7. write flush_end / invalidate block for all of the flushed stripes
8. delete in-memory index and advance superblock log checkpoint.
   flush the invalidate blocks before writing the superblock.

If you get '4' for one set of stripes to line up with '8' for the
previous set of stripes, then you can combine the two 'flush'
operations to the cache devices.
So the cache device see:
  FLUSH superblock update, new parity writes, flush_end of older writes FLUSH
which data writes sprinkled through wherever needed.

> 
> Recovery:
> Crash in IO write code path doesn't need recovery. If data and checksum
> don't match, the data will be ignored so read will return old data. In
> reclaim code path, crash before step 4 doesn't need recovery as
> data/parity don't touch raid disk yet. Parity can be ignored too. crash
> after 7 doesn't need recovery too, as the stripe is fully flushed to
> raid disks. Crash between 4 and 7 need recovery. Data and parity in the
> log will be written to raid disks.

So recovery involves reading everything in the log, adding data and
parity to the stripe cache as it is found, invalidating any entries
when an 'invalidate' block is found.
Then any stripe that has all required parity is written to the RAID,
and any other stripe is discarded.
Then the log is emptied and we start from scratch.

> 
> The performance of the raid will largely be determined by reclaim speed
> at run time. Several stages of the reclaim process involves IO wait or
> disk cache flush, which significantly impact the raid disk utilization
> and performance. The flush_start and flush_end block make running
> multiple reclaim possible. Each reclaim only records stripes in the
> flush start/end block which it is reclaiming.  Recovery can use the
> information to correctly determine stripe's flush stage. An example of 2
> reclaimer:
> 
> xxx1 belongs to stripe1 and the same for stripe2
> 
> reclaim 1:  |data1|...|parity1|...|flush_start1|...|flush_end1|
> reclaim 2:      |data2|..|parity2|...............|flush_start2|...|flush_end2|
> 
> the reclaims only record its own stripes in flush block. If, for
> exmaple, recovery finds flush_start1, it knows stripe1 is flushing to
> raid disk. Recovery will ignore stripe2, since stripe2 isn't in
> flush_start1.
> 
> Multiple reclaim will efficiently solve the performance issue. Current
> code hasn't add multiple reclaim yet, but certainly will be added soon.

Maybe this is what you have in mind, but I would use a state-machine
approach for the reclaim
i.e. schedule a collection of stripes for parity calculation.
    as they complete, schedule the writes to cache
    When there are no pending writes to cache, schedule a flush
    etc.


> 
> V3:
> -fix a use after free bug. fix bio to cache disk crosses disk boundary
> -adjust memory watermark
> V2:
> -fix bugs and code clean up
> -log meta data write isn't FUA any more
> -discard only runs when the discard area is big enough
> 
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/Makefile            |    2 +-
>  drivers/md/raid5-cache.c       | 3246 ++++++++++++++++++++++++++++++++++++++++
>  drivers/md/raid5.c             |   69 +-
>  drivers/md/raid5.h             |   15 +-
>  include/uapi/linux/raid/md_p.h |   72 +
>  5 files changed, 3392 insertions(+), 12 deletions(-)
>  create mode 100644 drivers/md/raid5-cache.c
> 
> diff --git a/drivers/md/Makefile b/drivers/md/Makefile
> index dba4db5..aeb4330 100644
> --- a/drivers/md/Makefile
> +++ b/drivers/md/Makefile
> @@ -16,7 +16,7 @@ dm-cache-mq-y   += dm-cache-policy-mq.o
>  dm-cache-cleaner-y += dm-cache-policy-cleaner.o
>  dm-era-y	+= dm-era-target.o
>  md-mod-y	+= md.o bitmap.o
> -raid456-y	+= raid5.o
> +raid456-y	+= raid5.o raid5-cache.o
>  
>  # Note: link order is important.  All raid personalities
>  # and must come before md.o, as they each initialise 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> new file mode 100644
> index 0000000..c21d2f2
> --- /dev/null
> +++ b/drivers/md/raid5-cache.c
> @@ -0,0 +1,3246 @@
> +/*
> + * A caching layer for raid 4/5/6
> + *
> + * Copyright (C) 2015 Shaohua Li <shli@fb.com>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along with
> + * this program; if not, write to the Free Software Foundation, Inc.,
> + * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
> + */
> +#include <linux/kernel.h>
> +#include <linux/wait.h>
> +#include <linux/blkdev.h>
> +#include <linux/slab.h>
> +#include <linux/raid/md_p.h>
> +#include <linux/crc32.h>
> +#include <linux/list_sort.h>
> +#include "md.h"
> +#include "raid5.h"
> +
> +/* the log disk cache will be flushed forcely every second */
> +#define LOG_FLUSH_TIME HZ
> +#define LOG_SUPER_WRITE_TIME (5 * HZ)
> +
> +#define MAX_MEM (256 * 1024 * 1024)
> +#define RECLAIM_BATCH 16
> +#define CHECKPOINT_TIMEOUT (5 * HZ)
> +
> +#define MAX_RETRY 10
> +#define IOERR_RETRY_TIME (5 * 60 * HZ)
> +#define MEMERR_RETRY_TIME (60 * HZ)
> +
> +typedef u64 r5blk_t; /* log blocks, could be 512B - 4k */

Why isn't this just "sector_t" ??


> +
> +struct r5l_log {
> +	struct r5c_cache *cache;

It isn't immediately clear what the "log" is separate from the "cache".
If there is a good reason, then a comment with that reason here would
be very helpful.

> +	struct block_device *bdev;
> +	struct md_rdev *rdev;

Is there a reason for storing both of these instead of just 'rdev' and
using 'rdev->bdev' when that is needed?


> +
> +	struct page *super_page;
> +	u8 uuid[16];
> +	u32 uuid_checksum_data;
> +	u32 uuid_checksum_meta;
> +
> +	unsigned int block_size; /* bytes */
> +	unsigned int block_sector_shift;
> +	unsigned int page_block_shift;
> +	unsigned int stripe_data_size; /* sector */
> +	unsigned int chunk_size; /* sector */
> +	unsigned int stripe_size; /* sector */
> +	unsigned int parity_disks;

Some of these are obvious, some are not.
Some more specific comments would help.

> +
> +	r5blk_t total_blocks;
> +	r5blk_t first_block;
> +	r5blk_t last_block;
> +
> +	r5blk_t low_watermark; /* For disk space */
> +	r5blk_t high_watermark;
> +
> +	r5blk_t last_checkpoint;
> +	r5blk_t last_freepoint;
> +	r5blk_t super_point;
> +	u64 last_cp_seq;
> +	u64 last_freeseq;
> +	unsigned long last_cp_time;
> +	struct work_struct discard_work;
> +
> +	int do_discard;
> +
> +	u64 seq; /* get after read log */
> +	r5blk_t log_start; /* get after read log */
> +
> +	u8 data_checksum_type;
> +	u8 meta_checksum_type;
> +
> +	unsigned int reserved_blocks;
> +	wait_queue_head_t space_waitq;
> +
> +	struct mutex io_mutex;
> +	struct r5l_io_unit *current_io;
> +
> +	spinlock_t io_list_lock;
> +	struct list_head running_ios; /* order is important */
> +
> +	struct list_head task_list;
> +	struct list_head parity_task_list;
> +	spinlock_t task_lock;
> +
> +	struct kmem_cache *io_kc;
> +	mempool_t *io_pool;
> +	struct bio_set *bio_set;
> +
> +	unsigned long next_flush_time;
> +	struct work_struct flush_work;
> +};
> +
> +struct r5l_task;
> +/* end function must free task */
> +typedef void (r5l_task_end_fn)(struct r5l_task *task, int error);
> +struct r5l_task {

A comment here telling me what these tasks do would help me understand
the data structure.


> +	struct list_head list;
> +	int type;
> +	struct bio *bio;
> +	union {
> +		struct bio *orig_bio;
> +		struct {
> +			sector_t stripe_sector;
> +			struct page *page_p;
> +			struct page *page_q;
> +		};
> +	};
> +	/* tasks in a single r5l_io_unit will have the same seq and meta_start */
> +	sector_t meta_start;
> +	sector_t data_start;
> +	u64 seq;
> +	r5l_task_end_fn *fn;
> +	void *private;
> +	unsigned int reserved_blocks;
> +	u32 checksum[];
> +};
> +
> +/*
> + * Data and metadata in log is written with normal IO write. A power failure
> + * can cause data loss. To relieve the issue, log disk cache will be flushed
> + * forcely every second.
> + *
> + * Note IO is finished out of order. If metadata corrupts in the middle,
> + * recovery can't work well even metadata/data at the tail is good. IO in log
> + * tail could finish earlier than IO ahead, so we must be very careful to
> + * handle FLUSH/FUA bio.
> + *
> + * FLUSH bio: the bio will be dispatched after all previous IO finish. FLUSH
> + * syntax doesn't require pending IO finish, but pending IO might be a metadata
> + * in the middle of log, we must force this order.
> + *
> + * FUA bio: same like FLUSH bio, previous meta IO must be finished before
> + * dispatching this bio. To simplify implementation, we wait all previous IO.
> + * And we must add a FLUSH to make sure previous IO hit disk media. metadata
> + * and the bio itself must be written with FUA.
> + * */
> +struct r5l_io_unit {
> +	struct r5l_log *log;
> +	struct list_head log_sibling;
> +
> +	struct page *meta_page;
> +	sector_t meta_sector;
> +	int meta_offset;
> +	u64 seq;
> +	struct bio *meta_bio;
> +
> +	struct list_head tasks;
> +	struct bio *current_bio;
> +	atomic_t refcnt;
> +
> +	unsigned int has_flush:1; /* include flush request */
> +	unsigned int has_fua:1; /* include fua request */
> +	unsigned int has_null_flush:1; /* include empty flush request */
> +	/*
> +	 * io isn't sent yet, flush/fua request can only be submitted till it's
> +	 * the first IO in running_ios list
> +	 * */
> +	unsigned int io_deferred:1;
> +	int error;
> +};
> +
> +struct r5c_io_range {
> +	struct list_head log_sibling;
> +	struct list_head stripe_sibling;
> +
> +	u64 seq;
> +
> +	sector_t meta_start; /* cache position */
> +	sector_t data_start; /* cache position */
> +	sector_t raid_start;
> +	unsigned int data_sectors;
> +
> +	struct r5c_stripe *stripe;
> +	union {
> +		struct bio *bio;
> +		u32 *checksum; /* only for recovery */
> +	};
> +};
> +
> +struct r5c_stripe {
> +	u64 raid_index;
> +	struct r5c_cache *cache;
> +	atomic_t ref;
> +	int state;
> +	int recovery_state; /* just for recovery */
> +
> +	struct list_head io_ranges; /* order list */
> +	union {
> +		struct list_head stripes;
> +		struct list_head parity_list; /* just for recovery */
> +	};
> +
> +	struct list_head lru;
> +
> +	int existing_pages;
> +	atomic_t dirty_stripes;
> +	atomic_t pending_bios;
> +	struct page **parity_pages; /* just for recovery */
> +	struct page *data_pages[];
> +};
> +
> +enum {
> +	STRIPE_RUNNING = 0,
> +	STRIPE_FROZEN = 1, /* Doesn't accept new IO */
> +	STRIPE_PARITY_DONE = 2,
> +	STRIPE_INRAID = 3,
> +	STRIPE_DEAD = 4,
> +
> +	RECOVERY_NO_FLUSH = 0,
> +	RECOVERY_FLUSH_START = 1, /* stripe in a start flush block */
> +	RECOVERY_FLUSH_END = 2,
> +};

I wouldn't be surprised if some compilers rejected that.  Two separate
enums would be better.


> +
> +#define STRIPE_LOCK_BITS 8
> +struct r5c_cache {
> +	struct mddev *mddev;
> +	struct md_rdev *rdev;
> +
> +	struct r5l_log log;
> +

Ahh - the log is embedded in the cache.
That probably makes sense.  But you don't need a pointer from the log
to the cache, just use container_of().



> +	spinlock_t tree_lock; /* protect stripe_tree, log_list, full_stripes */
> +	struct radix_tree_root stripe_tree;
> +	struct list_head log_list; /* sorted list of io_range */
> +	struct list_head full_stripes;
> +	int full_stripe_cnt;
> +
> +	struct list_head page_pool;
> +	spinlock_t pool_lock;
> +	u64 free_pages;
> +	u64 total_pages;
> +	u64 max_pages;
> +	u64 low_watermark; /* for memory, pages */
> +	u64 high_watermark;
> +
> +	unsigned int stripe_data_size; /* stripe size excluding parity, sector */
> +	unsigned int chunk_size; /* one disk chunk size including parity, sector */
> +	unsigned int stripe_size; /* stripe size including parity, sector */
> +	unsigned int parity_disks;
> +	unsigned int stripe_data_pages;
> +	unsigned int stripe_parity_pages;
> +
> +	unsigned int reclaim_batch;
> +
> +	unsigned int reserved_space; /* log reserved size, sector */
> +
> +	unsigned long reclaim_reason;
> +	wait_queue_head_t reclaim_wait;
> +	struct md_thread *reclaim_thread;
> +	__le64 *stripe_flush_data;
> +	int quiesce_state;
> +
> +	int in_recovery;
> +
> +	struct work_struct pending_io_work;
> +
> +	spinlock_t stripe_locks[1 << STRIPE_LOCK_BITS];
> +	wait_queue_head_t stripe_waitq[1 << STRIPE_LOCK_BITS];
> +
> +	int error_state;
> +	int error_type;
> +	int retry_cnt;
> +	unsigned long next_retry_time;
> +	struct bio_list retry_bio_list;
> +	wait_queue_head_t error_wait;
> +
> +	struct kmem_cache *io_range_kc;
> +	struct kmem_cache *stripe_kc;
> +	struct bio_set *bio_set;
> +};
> +
> +enum {
> +	RECLAIM_MEM = 0, /* work hard to reclaim memory */
> +	RECLAIM_MEM_BACKGROUND = 1, /* try to reclaim memory */
> +	RECLAIM_MEM_FULL = 2, /* only reclaim full stripe */
> +	RECLAIM_DISK = 8, /* work hard to reclaim disk */
> +	RECLAIM_DISK_BACKGROUND = 9, /* try to reclaim disk */
> +	RECLAIM_FLUSH_ALL = 16, /* flush all data to raid */
> +
> +	QUIESCE_NONE = 0,
> +	QUIESCE_START = 1,
> +	QUIESCE_END = 2,
> +
> +	ERROR_NOERROR = 0,
> +	ERROR_PREPARE = 1, /* Had an error, flushing cache to raid */
> +	ERROR_FINISH = 2, /* Had an error, cache has no data */
> +};

Three enums here :-)


> +
> +#define PAGE_SECTOR_SHIFT (PAGE_SHIFT - 9)

Maybe this should go in md.h - raid1.c raid10.c and raid5.c all use
something like it.

... and that's as far as I got.
I really need this a bit more like md-cluster.c: add modules one at a
time which I can understand and review and units.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 5/8] md: don't allow resize/reshape with cache support
  2015-06-03 22:48 ` [PATCH v3 5/8] md: don't allow resize/reshape with cache support Shaohua Li
@ 2015-06-18  1:16   ` Neil Brown
  0 siblings, 0 replies; 13+ messages in thread
From: Neil Brown @ 2015-06-18  1:16 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

On Wed, 3 Jun 2015 15:48:40 -0700
Shaohua Li <shli@fb.com> wrote:

> If cache support is enabled, don't allow resize/reshape in current
> stage. In the future, we can flush all data from cache to raid before
> resize/reshape and then allow resize/reshape.
> 
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/raid5.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 26561d8..29f49c7 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -7207,6 +7207,10 @@ static int raid5_resize(struct mddev *mddev, sector_t sectors)
>  	 * worth it.
>  	 */
>  	sector_t newsize;
> +	struct r5conf *conf = mddev->private;
> +
> +	if (conf->cache)
> +		return -EINVAL;
>  	sectors &= ~((sector_t)mddev->chunk_sectors - 1);
>  	newsize = raid5_size(mddev, sectors, mddev->raid_disks);
>  	if (mddev->external_size &&
> @@ -7258,6 +7262,8 @@ static int check_reshape(struct mddev *mddev)
>  {
>  	struct r5conf *conf = mddev->private;
>  
> +	if (conf->cache)
> +		return -EINVAL;
>  	if (mddev->delta_disks == 0 &&
>  	    mddev->new_layout == mddev->layout &&
>  	    mddev->new_chunk_sectors == mddev->chunk_sectors)

This patch should come before that patch which enables caches - the
could should be correct at each point in the series.

NeilBrown

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 3/8] raid5: A caching layer for RAID5/6
  2015-06-18  1:00   ` [PATCH v3 3/8] raid5: A caching layer for RAID5/6 Neil Brown
@ 2015-06-18  5:24     ` Shaohua Li
  0 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2015-06-18  5:24 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

On Thu, Jun 18, 2015 at 11:00:36AM +1000, Neil Brown wrote:
> On Wed, 3 Jun 2015 15:48:38 -0700 Shaohua Li <shli@fb.com> wrote:
> 
> Hi,
>  sorry for the delay in getting to this.
> 
> 3000 lines of code is rather hard to review - especially with so few
> comments :-)
> There seem to be a number of fairly well defined modules in the code,
> such as managing the various data structures.  Maybe if these arrived
> one per patch it would be easier  to review them individually.

ok, I'll try to split this one.
 
> > Main goal of the caching layer is to aggregate IO write to hopefully
> > make full stripe IO write and fix write hole issue. This might speed up
> > read too, but it's not optimized for read, eg, we don't proactively
> > cache data for read. The aggregation makes a lot of sense for workloads
> > which sequentially write to several files with/without fsync. Such
> > workloads are popular in today's datacenter.
> > 
> > Write IO data will write to a cache disk (SSD) first, then later the
> > data will be flushed to raid disks.
> > 
> > The cache disk will be organized as a simple ring buffer log. For IO
> > data, a tuple (raid_sector, io length, checksum, data) will be appended
> > to the log; for raid 5/6 parity, a tuple (stripe_sector, parity length,
> > checksum, parity data) will be appended to the log. We don't have
> > on-disk index for the data appended to the log. So either we can rebuild
> > an in-memory index at startup with scanning the whole log, or we can
> > flush all data from cache disk to raid disks at shutdown so cache disk
> > has no valid data. Current code chooses the second option, but this can
> > be easily changed.
> > 
> > We have a simple meta data for above tuples. It's essentially a
> > tuple (sequence, metadata length).
> 
> How are these tuples store in the log?  One tuples per block?  A block
> with lots of tuples followed by all the described data?  Or does the
> block of tuples follow the data?

one or several tuples per block. I'll give more details in later post. 
> >                                    Crash recovery or startup will scan
> > the log to read the metadata and rebuild in-memory index. If metadata is
> > broken at the head of the log, even metadata afterward is ok, the
> > scanning will not work well. So we take some steps to mitigate the
> > issue:
> > -A flush request is sent to cache disk every second to relieve the issue
> 
> I don't feel at all comfortable about the every-second flush.
> I can certainly understand a flush after 50% of the log space is
> written, or even 25%.  but time-based flushes should be coming from the
> filesystem.

Do you like to delete the every-second flush logic? The reclaim will do
flush too, so this is barely running, just for the worst case.

> > -FLUSH/FUA request is carefully handled. FLUSH/FUA will only be
> > dispatched after all previous requests finish.
> 
> This certainly makes sense.
> 
> > 
> > The in-memory index is a simple list of io_range (sequence, metadata
> > sector, data sector, length). The list is orded by sequence. The first
> > io_range entry's metadata sector is the tail of the log. There is also a
> > struct to track io ranges within a stripe. All stripes will be organized
> > as a radix tree.
> 
> It would be useful here to briefly say how these data structures are
> used.
> I assumed the order-by-sequence list is to flush data to RAID when the
> log is getting full?
> The radix-tree tracks where in the log each stripe is - is that correct?
> So it is easy to find the stripe at the tail of the list to flush it
> out.

The order-by-sequence list maintains the valid data in log which hasn't
been flushed to raid yet. The radix-tree maps stripe sector to stripe.

> > All IO data will be copied to a memory pool for caching too until the
> > data is flushed to raid disks. This is just to simplify the
> > implementation, it's not mandated. In the future flush can do a data
> > copy from cache disk to raid disks, so the memory pool can be discarded.
> > If a stripe is flushed to raid disks, memory of the stripe can be
> > reused.
> 
> Again, why a separate memory pool rather than just leaving the data in
> the stripe cache until it is safe in the RAID?

I'd say a separate pool is much more convenient than using stripe cache.
The stripe cache handling state machine is complex enough, which I don't
like to make it even more. Plus, the pool isn't a must-have. A small
pool will cause frequent stripe flush to raid disks, which impacts both
performance and latency. I'm planing to remove it in the future, though
if that happens error handling will be hard.

> > We have two limited resources here, memory pool and cache disk space. If
> > resource is tight, we will do reclaim. In either case, we will flush
> > some data from cache disk to raid disks. However, we implement different
> > strategies. For memory reclaim, we prefer reclaiming full stripe. For
> > cache disk space reclaim, we prefer reclaiming io_range entry at the
> > head of index list.
> 
> Wouldn't full stripes be scheduled to the RAID immediately they become
> full (or immediately after the parity is safe in the log)?
> So for memory reclaim it would make sense to prefer stripes that have
> been idle for a long time - so an LRU list.
> 
> Certainly when the log approaches full you need to start flushing
> things at the start of the log.

Yep, a LRU list does make sense, though we don't have it yet. The full
stripes are flushed out till we have several such stripes. It's a
performance consideration, as flush must flush disk cache, which I don't
want to run very frequently.
> 
> > 
> > If cache disk has IO error, since all data are in memory pool, we will
> > flush all data to raid disks and fallback to no-cache-disk mode. IO
> > error of cache disk doesn't corrupt any data. After some time, we will
> > try to use cache disk again if the disk is ok. The retry will stop after
> > several rounds of failure.
> 
> The normal data flow involves lots of writing to the cache device and
> very little if any reading.  So we will probably need some sort of
> "scrub" process to read and verify the log occasionally, just to be
> sure that reads still work.  That could be largely done in user-space.

makes sense.
> > 
> > We always do reclaim in stripe unit. Reclaim could create holes in the
> > log, eg, some io_range in the middle is reclaimed, but io_range at the
> > head remains. So the index list entries don't always have continuous
> > sequence. But this doesn't matter, the first io_range is always the log
> > tail. Superblock has a field pointing to the position of log tail. The
> > hole can waste a lot of disk space though. In the future, we can
> > implement a garbage collection to mitigate the issue, eg, copy data
> > from the index tail to head.
> 
> I doubt there would be value in garbage collection.  Just flush the old
> stripes to the RAID.  Maybe I'm wrong, but I certainly agree that it
> isn't a priority.
> 
> > 
> > In the process reclaim flush data to raid disk, stripe parity will be
> > append to cache log. Parity is always appended after its corresponding
> > data. After parity is in cache disk, a flush_start block is appended to
> > log, which indicates stripe is flushing to raid disks. Data writing to
> > raid disks only happens after all data and parity are already in cache
> > disk. This will fix the write whole issue. After a stripe is flushed to
> > raid disks, a flush_end block is appended to log, which indicates a
> > stripe is settled down in raid disks.
> 
> I'm not sure that a separate "flush start" block is needed. Once all
> the parity blocks have been written we can assume that the flush has
> started.
> A "flush-end" block does make sense, though I would think of it as an
> 'invalidate' block in that it invalidates specific previous data blocks.
> Same concept though.

My original implementation doesn't have the 'flush start' block, but
when we have multiple reclaim threads (or a state machine you mentioned
handling several sets of stripe flushing), the 'flush start' block let
recovery code correctly destinguish stripe flush states and finish
recovery. The reason is we have multiple set of stripes and crash,
recovery can't destinguish a set. For example, a set have 8 stripes
writting parity to cache, but recovery only see 7 have parity and it
doesn't know there should be 8 stripes. In this case, the parity of 7
stripes should be discarded because disk cache isn't flushed yet (we
flush cache disk cache after the 8 stripes have parity in disk), we
shouldn't trust the parity data. Maybe the log checksum can help here, I
must double check.

> > Recovery relies on the flush_start and flush_end block. If recovery
> > finds data and parity of a stripe, the flush start/end block will be
> > used to determine which stage the stripe is in flush. If the stripe is
> > listed in flush end block, the stripe is in raid disks, all data and
> > parity of the stripe can be discarded. If the stripe isn't listed in
> > flush start block, the stripe hasn't started flush to raid yet, its
> > parity can be ignored. Otherwise, the stripe is in the middle of
> > flushing to raid disks.  Since we have both data and parity, the
> > recovery just rewrite them to raid disks.
> > 
> > IO write code path:
> > 1. copy bio data to stripe memory pages
> > 2. append metadata and data to cache log
> > 3. IO write endio
> > 
> > reclaim code path:
> > 1. select stripe to reclaim
> > 2. write all stripe data to raid disk
> > 3. in raid5 ops_run_io, append metadata and parity data to cache log.
> >     ops_run_io doesn't write data/parity to raid disks at this time
> > 4. flush cache disk and write flush_start block
> > 5. ops_run_io continues. data/parity will be written to raid disks
> > 6. flush all raid disks cache
> > 7. write flush_end block
> > 8. delete in-memory index of the stripe, and advance superblock log checkpoint
> 
> I find this a bit confusing.  Step 2 writes to the raid disk, but step
> 3 says it doesn't write to the raid disk yet.  It also doesn't tell me
> when parity is calculated.
It's in step 2, just like step 2 and part of 3 in what you write below

> I imagine:
> 
> 1. select stripe to reclaim
> 2. read missing data or parity block so new parity can be calculated.
> 3. calculate parity and write to log - this records that the flush has
>    started.
> 4. flush cache device - probably multiple stripes will get up to the
>    step, and then a single flush will be performed for all of them.
> 5. schedule writes to the RAID disks, both data an parity.
> 6. flush RAID disks - again, wait for lots of stripes to reach this
>    point, then perform a single flush
> 7. write flush_end / invalidate block for all of the flushed stripes
> 8. delete in-memory index and advance superblock log checkpoint.
>    flush the invalidate blocks before writing the superblock.
> 
> If you get '4' for one set of stripes to line up with '8' for the
> previous set of stripes, then you can combine the two 'flush'
> operations to the cache devices.
> So the cache device see:
>   FLUSH superblock update, new parity writes, flush_end of older writes FLUSH
> which data writes sprinkled through wherever needed.

Agree.

> > Recovery:
> > Crash in IO write code path doesn't need recovery. If data and checksum
> > don't match, the data will be ignored so read will return old data. In
> > reclaim code path, crash before step 4 doesn't need recovery as
> > data/parity don't touch raid disk yet. Parity can be ignored too. crash
> > after 7 doesn't need recovery too, as the stripe is fully flushed to
> > raid disks. Crash between 4 and 7 need recovery. Data and parity in the
> > log will be written to raid disks.
> 
> So recovery involves reading everything in the log, adding data and
> parity to the stripe cache as it is found, invalidating any entries
> when an 'invalidate' block is found.
> Then any stripe that has all required parity is written to the RAID,
> and any other stripe is discarded.
> Then the log is emptied and we start from scratch.

right
> > 
> > The performance of the raid will largely be determined by reclaim speed
> > at run time. Several stages of the reclaim process involves IO wait or
> > disk cache flush, which significantly impact the raid disk utilization
> > and performance. The flush_start and flush_end block make running
> > multiple reclaim possible. Each reclaim only records stripes in the
> > flush start/end block which it is reclaiming.  Recovery can use the
> > information to correctly determine stripe's flush stage. An example of 2
> > reclaimer:
> > 
> > xxx1 belongs to stripe1 and the same for stripe2
> > 
> > reclaim 1:  |data1|...|parity1|...|flush_start1|...|flush_end1|
> > reclaim 2:      |data2|..|parity2|...............|flush_start2|...|flush_end2|
> > 
> > the reclaims only record its own stripes in flush block. If, for
> > exmaple, recovery finds flush_start1, it knows stripe1 is flushing to
> > raid disk. Recovery will ignore stripe2, since stripe2 isn't in
> > flush_start1.
> > 
> > Multiple reclaim will efficiently solve the performance issue. Current
> > code hasn't add multiple reclaim yet, but certainly will be added soon.
> 
> Maybe this is what you have in mind, but I would use a state-machine
> approach for the reclaim
> i.e. schedule a collection of stripes for parity calculation.
>     as they complete, schedule the writes to cache
>     When there are no pending writes to cache, schedule a flush
>     etc.

the state machine approach is a good choice too. The problem here is
reclaim thread sometimes writes cache disk and sometimes write raid
disks, the cache disk and raid disks haven't enough IO in the meaning
time. With multiple reclaim threads, the threads (hopefully) are in
different stage and dispatch enough IO to all disks. a state machien
approach can do the similar thing too if it's done correctly. I think we
must make sure recover works with multiple stripe flushing sets
currently, we can try different reclaim approaches later.


> > 
> > V3:
> > -fix a use after free bug. fix bio to cache disk crosses disk boundary
> > -adjust memory watermark
> > V2:
> > -fix bugs and code clean up
> > -log meta data write isn't FUA any more
> > -discard only runs when the discard area is big enough
> > 
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  drivers/md/Makefile            |    2 +-
> >  drivers/md/raid5-cache.c       | 3246 ++++++++++++++++++++++++++++++++++++++++
> >  drivers/md/raid5.c             |   69 +-
> >  drivers/md/raid5.h             |   15 +-
> >  include/uapi/linux/raid/md_p.h |   72 +
> >  5 files changed, 3392 insertions(+), 12 deletions(-)
> >  create mode 100644 drivers/md/raid5-cache.c
> > 
> > diff --git a/drivers/md/Makefile b/drivers/md/Makefile
> > index dba4db5..aeb4330 100644
> > --- a/drivers/md/Makefile
> > +++ b/drivers/md/Makefile
> > @@ -16,7 +16,7 @@ dm-cache-mq-y   += dm-cache-policy-mq.o
> >  dm-cache-cleaner-y += dm-cache-policy-cleaner.o
> >  dm-era-y	+= dm-era-target.o
> >  md-mod-y	+= md.o bitmap.o
> > -raid456-y	+= raid5.o
> > +raid456-y	+= raid5.o raid5-cache.o
> >  
> >  # Note: link order is important.  All raid personalities
> >  # and must come before md.o, as they each initialise 
> > diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> > new file mode 100644
> > index 0000000..c21d2f2
> > --- /dev/null
> > +++ b/drivers/md/raid5-cache.c
> > @@ -0,0 +1,3246 @@
> > +/*
> > + * A caching layer for raid 4/5/6
> > + *
> > + * Copyright (C) 2015 Shaohua Li <shli@fb.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms and conditions of the GNU General Public License,
> > + * version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope it will be useful, but WITHOUT
> > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> > + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> > + * more details.
> > + *
> > + * You should have received a copy of the GNU General Public License along with
> > + * this program; if not, write to the Free Software Foundation, Inc.,
> > + * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
> > + */
> > +#include <linux/kernel.h>
> > +#include <linux/wait.h>
> > +#include <linux/blkdev.h>
> > +#include <linux/slab.h>
> > +#include <linux/raid/md_p.h>
> > +#include <linux/crc32.h>
> > +#include <linux/list_sort.h>
> > +#include "md.h"
> > +#include "raid5.h"
> > +
> > +/* the log disk cache will be flushed forcely every second */
> > +#define LOG_FLUSH_TIME HZ
> > +#define LOG_SUPER_WRITE_TIME (5 * HZ)
> > +
> > +#define MAX_MEM (256 * 1024 * 1024)
> > +#define RECLAIM_BATCH 16
> > +#define CHECKPOINT_TIMEOUT (5 * HZ)
> > +
> > +#define MAX_RETRY 10
> > +#define IOERR_RETRY_TIME (5 * 60 * HZ)
> > +#define MEMERR_RETRY_TIME (60 * HZ)
> > +
> > +typedef u64 r5blk_t; /* log blocks, could be 512B - 4k */
> 
> Why isn't this just "sector_t" ??

It can be sector_t. But with the new type, it's clear to know code is
working on a sector or a block. It helps avoid bug.

> > +
> > +struct r5l_log {
> > +	struct r5c_cache *cache;
> 
> It isn't immediately clear what the "log" is separate from the "cache".
> If there is a good reason, then a comment with that reason here would
> be very helpful.

ok, will do

> > +	struct block_device *bdev;
> > +	struct md_rdev *rdev;
> 
> Is there a reason for storing both of these instead of just 'rdev' and
> using 'rdev->bdev' when that is needed?

It's just for convenience 
> 
> > +
> > +	struct page *super_page;
> > +	u8 uuid[16];
> > +	u32 uuid_checksum_data;
> > +	u32 uuid_checksum_meta;
> > +
> > +	unsigned int block_size; /* bytes */
> > +	unsigned int block_sector_shift;
> > +	unsigned int page_block_shift;
> > +	unsigned int stripe_data_size; /* sector */
> > +	unsigned int chunk_size; /* sector */
> > +	unsigned int stripe_size; /* sector */
> > +	unsigned int parity_disks;
> 
> Some of these are obvious, some are not.
> Some more specific comments would help.

Ok, I'll split the patch and add more comments.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-06-18  5:24 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-03 22:48 [PATCH v3 0/8] MD: a caching layer for raid5/6 Shaohua Li
2015-06-03 22:48 ` [PATCH v3 1/8] MD: add a new disk role to present cache device Shaohua Li
2015-06-17 23:32   ` Neil Brown
2015-06-03 22:48 ` [PATCH v3 2/8] raid5: directly use mddev->queue Shaohua Li
2015-06-03 22:48 ` [PATCH v3 4/8] raid5: add some sysfs entries Shaohua Li
2015-06-03 22:48 ` [PATCH v3 5/8] md: don't allow resize/reshape with cache support Shaohua Li
2015-06-18  1:16   ` Neil Brown
2015-06-03 22:48 ` [PATCH v3 6/8] raid5: skip resync if caching is enabled Shaohua Li
2015-06-03 22:48 ` [PATCH v3 7/8] raid5: guarantee cache release stripes in correct way Shaohua Li
2015-06-03 22:48 ` [PATCH v3 8/8] raid5: multi-thread support for raid5 caching reclaim Shaohua Li
2015-06-04 19:29 ` Fwd: [PATCH v3 0/8] MD: a caching layer for raid5/6 Davor Vusir
     [not found] ` <c6df8779f11a4dc3362a04e7cee0be2aec213ebe.1433356864.git.shli@fb.com>
2015-06-18  1:00   ` [PATCH v3 3/8] raid5: A caching layer for RAID5/6 Neil Brown
2015-06-18  5:24     ` Shaohua Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).