Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Ddf based RAID management software
From: Arka Sharma @ 2016-11-12 16:33 UTC (permalink / raw)
  To: linux-raid

Hello All,

Is there any tool apart from mdadm available which can create software
RAID based on Ddf metadata. We want to dump the metadata content and
tally with metadata written by mdadm and our application.

Regards,
Arka

^ permalink raw reply

* Re: "creative" bio usage in the RAID code
From: Shaohua Li @ 2016-11-11 19:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-raid, linux-block, neilb
In-Reply-To: <20161110194636.GA32241@infradead.org>

On Thu, Nov 10, 2016 at 11:46:36AM -0800, Christoph Hellwig wrote:
> Hi Shaohua,
> 
> one of the major issues with Ming Lei's multipage biovec works
> is that we can't easily enabled the MD RAID code for it.  I had
> a quick chat on that with Chris and Jens and they suggested talking
> to you about it.
> 
> It's mostly about the RAID1 and RAID10 code which does a lot of funny
> things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that
> drivers don't touch.  One example is the r1buf_pool_alloc code,
> which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED
> case, which would also take care of r1buf_pool_free.  I'm not sure
> about all the others cases, as some bits don't fully make sense to me,

The problem is we use the iov_vec to track the pages allocated. We will read
data to the pages and write out later for resync. If we add new fields to track
the pages in r1bio, we could use standard API bio_kmalloc/bio_add_page and
avoid the tricky parts. This should work for both the resync and writebehind
cases.

> e.g. why we're trying to do single page I/O out of a bigger bio.

what's this one?

> Maybe you have some better ideas what's going on there?
> 
> Another not quite as urgent issue is how the RAID5 code abuses
> ->bi_phys_segments as and outstanding I/O counter, and I have no
> really good answer to that either.

I don't have good idea for this one either if we don't want to allocate extra
memory. The good side is we never dispatch the original bio to under layer
disks.

Thanks,
Shaohua

^ permalink raw reply

* Re: Question on blocks periodic writes
From: Theophanis Kontogiannis @ 2016-11-11 12:13 UTC (permalink / raw)
  To: NeilBrown; +Cc: Wols Lists, Linux RAID
In-Reply-To: <87a8d6lr4t.fsf@notabene.neil.brown.name>

Have also run the following for about 6 hours:


-bash-4.2# cat iotop-results
Total DISK READ :    0.00 B/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:    0.00 B/s | Actual DISK WRITE:       0.00 B/s
  PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
  696 be/3 root          0.00 B    956.00 K  0.00 %  0.31 % [jbd2/md0-8]
  750 be/4 root          0.00 B      0.00 B  0.00 %  0.05 % smartd -n -q never
 2134 be/4 mysql       624.00 K    344.85 M  0.00 %  0.01 % mysqld
--basedir=/usr --datadir=/mnt/md/md0/DBs/mys~riadb/mariadb.pid
--socket=/var/lib/mysql/mysql.sock
14110 be/4 root          0.00 B      0.00 B  0.00 %  0.01 % udisksd --no-debug
 5217 be/4 root          0.00 B      0.00 B  0.00 %  0.29 % [kworker/1:0]
 2454 be/4 root          0.00 B      0.00 B  0.00 %  0.04 % [kworker/0:2]
31764 be/4 root          4.00 K      0.00 B  0.00 %  0.00 % -bash
 2368 be/4 root          0.00 B      0.00 B  0.00 %  0.01 % [kworker/1:3]
 4584 be/4 root          0.00 B      0.00 B  0.00 %  0.00 % [kworker/1:1]
 5265 be/4 root          0.00 B      0.00 B  0.00 %  0.00 % [kworker/0:0]
 4595 be/4 root          0.00 B      0.00 B  0.00 %  0.00 % [kworker/0:1]
31747 be/4 root          0.00 B      8.00 K  0.00 %  0.00 % sshd: root@pts/3
  761 be/4 root          0.00 B     10.20 M  0.00 %  0.00 % rsyslogd -n
 7437 be/4 apache        0.00 B     24.00 K  0.00 %  0.00 % httpd -DFOREGROUND
14625 be/4 apache        0.00 B     20.00 K  0.00 %  0.00 % httpd -DFOREGROUND
10059 be/4 apache        0.00 B      8.00 K  0.00 %  0.00 % httpd -DFOREGROUND
10062 be/4 apache        0.00 B     24.00 K  0.00 %  0.00 % httpd -DFOREGROUND
10094 be/4 apache        0.00 B     16.00 K  0.00 %  0.00 % httpd -DFOREGROUND
10095 be/4 apache        0.00 B     20.00 K  0.00 %  0.00 % httpd -DFOREGROUND
10141 be/4 apache        0.00 B     24.00 K  0.00 %  0.00 % httpd -DFOREGROUND
12220 be/4 root          0.00 B      9.99 M  0.00 %  0.00 % python
/usr/bin/weewxd --daemon --pidfile=/var/run/weewx.pid
/etc/weewx/weewx.conf

---
Best regards,
ΜΦΧ,

Theophanis Kontogiannis



On Fri, Nov 11, 2016 at 3:52 AM, NeilBrown <neilb@suse.com> wrote:
> On Fri, Nov 11 2016, Wols Lists wrote:
>
>> On 10/11/16 02:00, NeilBrown wrote:
>>>> [ 8664.858104] xfsaild/md1(658): WRITE block 0 on md1 (8 sectors)
>>> This is XFS doing something.  md cannot possibly stop all IO while the
>>> filesystem performs occasional IO.  If these continue, you need to
>>> discuss with xfs developers how to stop it.  If the writes to individual
>>> drives continue after there are no writes to 'md1', then it is worth
>>> coming back here to ask.
>>>
>>>
>> Would the new journal feature be any help?
>
> Probably not, though until we know what is causing the writes, it is
> hard to say.
>
>>
>> I haven't dug in enough to understand it properly, and it would increase
>> the vulnerability of the system to a journal failure, but the feature
>> itself seems almost perfect for batching writes and enabling the disks
>> to spin down for extended periods.
>
> You might be able to build functionality onto the journal which allows
> the drives in the main array to stay idle for longer, but it doesn't try
> to do that at present.
>
> NeilBrown

^ permalink raw reply

* Re: Question on blocks periodic writes
From: Theophanis Kontogiannis @ 2016-11-11 12:11 UTC (permalink / raw)
  To: NeilBrown; +Cc: Wols Lists, Linux RAID
In-Reply-To: <87a8d6lr4t.fsf@notabene.neil.brown.name>

Hi Neil,

Hi Community,


Regarding XFS we can ignore it. That file system will be moved to an
ext2 file system on a CF.

So we are left with the rest:


-bash-4.2# mdadm --detail /dev/md1
/dev/md1:
        Version : 1.2
  Creation Time : Thu Jun 16 18:02:57 2016
     Raid Level : raid6
     Array Size : 9397248 (8.96 GiB 9.62 GB)
  Used Dev Size : 3132416 (2.99 GiB 3.21 GB)
   Raid Devices : 5
  Total Devices : 5
    Persistence : Superblock is persistent

    Update Time : Fri Nov 11 14:05:33 2016
          State : clean
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : tweety.example.com:1  (local to host tweety.example.com)
           UUID : 98e2af83:dc074310:d1639adb:3f19f0d3
         Events : 127

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       49        1      active sync   /dev/sdd1
       2       8       65        2      active sync   /dev/sde1
       3       8       81        3      active sync   /dev/sdf1
       4       8       97        4      active sync   /dev/sdg1



-bash-4.2# mdadm --examine /dev/sda1
/dev/sda1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 98e2af83:dc074310:d1639adb:3f19f0d3
           Name : tweety.example.com:1  (local to host tweety.example.com)
  Creation Time : Thu Jun 16 18:02:57 2016
     Raid Level : raid6
   Raid Devices : 5

 Avail Dev Size : 6264832 (2.99 GiB 3.21 GB)
     Array Size : 9397248 (8.96 GiB 9.62 GB)
    Data Offset : 4096 sectors
   Super Offset : 8 sectors
   Unused Space : before=4008 sectors, after=0 sectors
          State : clean
    Device UUID : 5ff290a3:68faf9d0:22edd403:abbaf970

    Update Time : Fri Nov 11 14:05:59 2016
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : 55812945 - correct
         Events : 127

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AAAAA ('A' == active, '.' == missing, 'R' == replacing)


Thank you ALL


---
Best regards,
ΜΦΧ,

Theophanis Kontogiannis



On Fri, Nov 11, 2016 at 3:52 AM, NeilBrown <neilb@suse.com> wrote:
> On Fri, Nov 11 2016, Wols Lists wrote:
>
>> On 10/11/16 02:00, NeilBrown wrote:
>>>> [ 8664.858104] xfsaild/md1(658): WRITE block 0 on md1 (8 sectors)
>>> This is XFS doing something.  md cannot possibly stop all IO while the
>>> filesystem performs occasional IO.  If these continue, you need to
>>> discuss with xfs developers how to stop it.  If the writes to individual
>>> drives continue after there are no writes to 'md1', then it is worth
>>> coming back here to ask.
>>>
>>>
>> Would the new journal feature be any help?
>
> Probably not, though until we know what is causing the writes, it is
> hard to say.
>
>>
>> I haven't dug in enough to understand it properly, and it would increase
>> the vulnerability of the system to a journal failure, but the feature
>> itself seems almost perfect for batching writes and enabling the disks
>> to spin down for extended periods.
>
> You might be able to build functionality onto the journal which allows
> the drives in the main array to stay idle for longer, but it doesn't try
> to do that at present.
>
> NeilBrown

^ permalink raw reply

* [PATCH 08/12] dm: dm.c: replace 'bio->bi_vcnt == 1' with !bio_multiple_segments
From: Ming Lei @ 2016-11-11 12:05 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, linux-fsdevel, Christoph Hellwig, Ming Lei,
	Alasdair Kergon, Mike Snitzer, maintainer:DEVICE-MAPPER LVM,
	Shaohua Li, open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1478865957-25252-1-git-send-email-tom.leiming@gmail.com>

Avoid to access .bi_vcnt directly, because the bio can be
splitted from block layer, and .bi_vcnt should never have
been used here.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/dm-rq.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index b2a9e2d161e4..d9cc8324e597 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -797,8 +797,13 @@ static void dm_old_request_fn(struct request_queue *q)
 		if (req_op(rq) != REQ_OP_FLUSH)
 			pos = blk_rq_pos(rq);
 
+		/*
+		 * bio can be splitted from block layer, so use
+		 * !bio_multiple_segments instead of 'bio->bi_vcnt == 1'
+		 */
 		if ((dm_old_request_peeked_before_merge_deadline(md) &&
-		     md_in_flight(md) && rq->bio && rq->bio->bi_vcnt == 1 &&
+		     md_in_flight(md) && rq->bio &&
+		     !bio_multiple_segments(rq->bio) &&
 		     md->last_rq_pos == pos && md->last_rq_rw == rq_data_dir(rq)) ||
 		    (ti->type->busy && ti->type->busy(ti))) {
 			blk_delay_queue(q, 10);
-- 
2.7.4

^ permalink raw reply related

* [PATCH 07/12] dm: use bvec iterator helpers to implement .get_page and .next_page
From: Ming Lei @ 2016-11-11 12:05 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, linux-fsdevel, Christoph Hellwig, Ming Lei,
	Alasdair Kergon, Mike Snitzer, maintainer:DEVICE-MAPPER LVM,
	Shaohua Li, open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1478865957-25252-1-git-send-email-tom.leiming@gmail.com>

Firstly we have mature bvec/bio iterator helper for iterate each
page in one bio, not necessary to reinvent a wheel to do that.

Secondly the coming multipage bvecs requires this patch.

Also add comments about the direct access to bvec table.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/dm-io.c | 34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 0bf1a12e35fe..2ef573c220fc 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -162,7 +162,10 @@ struct dpages {
 			 struct page **p, unsigned long *len, unsigned *offset);
 	void (*next_page)(struct dpages *dp);
 
-	unsigned context_u;
+	union {
+		unsigned context_u;
+		struct bvec_iter context_bi;
+	};
 	void *context_ptr;
 
 	void *vma_invalidate_address;
@@ -204,25 +207,36 @@ static void list_dp_init(struct dpages *dp, struct page_list *pl, unsigned offse
 static void bio_get_page(struct dpages *dp, struct page **p,
 			 unsigned long *len, unsigned *offset)
 {
-	struct bio_vec *bvec = dp->context_ptr;
-	*p = bvec->bv_page;
-	*len = bvec->bv_len - dp->context_u;
-	*offset = bvec->bv_offset + dp->context_u;
+	struct bio_vec bv = bvec_iter_bvec((struct bio_vec *)dp->context_ptr,
+			dp->context_bi);
+
+	*p = bv.bv_page;
+	*len = bv.bv_len;
+	*offset = bv.bv_offset;
+
+	/* avoid to figure out it in bio_next_page() again */
+	dp->context_bi.bi_sector = (sector_t)bv.bv_len;
 }
 
 static void bio_next_page(struct dpages *dp)
 {
-	struct bio_vec *bvec = dp->context_ptr;
-	dp->context_ptr = bvec + 1;
-	dp->context_u = 0;
+	unsigned int len = (unsigned int)dp->context_bi.bi_sector;
+
+	bvec_iter_advance((struct bio_vec *)dp->context_ptr,
+			&dp->context_bi, len);
 }
 
 static void bio_dp_init(struct dpages *dp, struct bio *bio)
 {
 	dp->get_page = bio_get_page;
 	dp->next_page = bio_next_page;
-	dp->context_ptr = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
-	dp->context_u = bio->bi_iter.bi_bvec_done;
+
+	/*
+	 * We just use bvec iterator to retrieve pages, so it is ok to
+	 * access the bvec table directly here
+	 */
+	dp->context_ptr = bio->bi_io_vec;
+	dp->context_bi = bio->bi_iter;
 }
 
 /*
-- 
2.7.4


^ permalink raw reply related

* [PATCH 06/12] dm: crypt: use bio_add_page()
From: Ming Lei @ 2016-11-11 12:05 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, linux-fsdevel, Christoph Hellwig, Ming Lei,
	Alasdair Kergon, Mike Snitzer, maintainer:DEVICE-MAPPER LVM,
	Shaohua Li, open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1478865957-25252-1-git-send-email-tom.leiming@gmail.com>

We have the standard interface to add page to bio, so don't
do that in hacking way.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/dm-crypt.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 68a9eb4f3f36..e4f86bead9d3 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -994,7 +994,6 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
 	gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM;
 	unsigned i, len, remaining_size;
 	struct page *page;
-	struct bio_vec *bvec;
 
 retry:
 	if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
@@ -1019,12 +1018,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
 
 		len = (remaining_size > PAGE_SIZE) ? PAGE_SIZE : remaining_size;
 
-		bvec = &clone->bi_io_vec[clone->bi_vcnt++];
-		bvec->bv_page = page;
-		bvec->bv_len = len;
-		bvec->bv_offset = 0;
-
-		clone->bi_iter.bi_size += len;
+		bio_add_page(clone, page, len, 0);
 
 		remaining_size -= len;
 	}
-- 
2.7.4

^ permalink raw reply related

* [PATCH 05/12] bcache: debug: avoid to access .bi_io_vec directly
From: Ming Lei @ 2016-11-11 12:05 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, linux-fsdevel, Christoph Hellwig, Ming Lei,
	Kent Overstreet, Shaohua Li, Mike Christie, Guoqing Jiang,
	open list:BCACHE BLOCK LAYER CACHE,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1478865957-25252-1-git-send-email-tom.leiming@gmail.com>

Instead we use standard iterator way to do that.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/bcache/debug.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 1c9130ae0073..06f55056aaae 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -107,8 +107,8 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
 {
 	char name[BDEVNAME_SIZE];
 	struct bio *check;
-	struct bio_vec bv;
-	struct bvec_iter iter;
+	struct bio_vec bv, cbv;
+	struct bvec_iter iter, citer = { 0 };
 
 	check = bio_clone(bio, GFP_NOIO);
 	if (!check)
@@ -120,9 +120,13 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
 
 	submit_bio_wait(check);
 
+	citer.bi_size = UINT_MAX;
 	bio_for_each_segment(bv, bio, iter) {
 		void *p1 = kmap_atomic(bv.bv_page);
-		void *p2 = page_address(check->bi_io_vec[iter.bi_idx].bv_page);
+		void *p2;
+
+		cbv = bio_iter_iovec(check, citer);
+		p2 = page_address(cbv.bv_page);
 
 		cache_set_err_on(memcmp(p1 + bv.bv_offset,
 					p2 + bv.bv_offset,
@@ -133,6 +137,7 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
 				 (uint64_t) bio->bi_iter.bi_sector);
 
 		kunmap_atomic(p1);
+		bio_advance_iter(check, &citer, bv.bv_len);
 	}
 
 	bio_free_pages(check);
-- 
2.7.4

^ permalink raw reply related

* [PATCH 01/12] block: bio: pass bvec table to bio_init()
From: Ming Lei @ 2016-11-11 12:05 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, linux-fsdevel, Christoph Hellwig, Ming Lei,
	Jens Axboe, Jiri Kosina, Kent Overstreet, Shaohua Li,
	Alasdair Kergon, Mike Snitzer, maintainer:DEVICE-MAPPER LVM,
	Christoph Hellwig, Sagi Grimberg, Joern Engel, Prasad Joshi,
	Mike Christie, Hannes Reinecke, Rasmus Villemoes,
	Johannes Thumshirn, Guoqing Jiang, Eric Wheeler
In-Reply-To: <1478865957-25252-1-git-send-email-tom.leiming@gmail.com>

Some drivers often use external bvec table, so introduce
this helper for this case. It is always safe to access the
bio->bi_io_vec in this way for this case.

After converting to this usage, it will becomes a bit easier
to evaluate the remaining direct access to bio->bi_io_vec,
so it can help to prepare for the following multipage bvec
support.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 block/bio.c                   |  8 ++++++--
 drivers/block/floppy.c        |  3 +--
 drivers/md/bcache/io.c        |  4 +---
 drivers/md/bcache/journal.c   |  4 +---
 drivers/md/bcache/movinggc.c  |  6 ++----
 drivers/md/bcache/request.c   |  2 +-
 drivers/md/bcache/super.c     | 12 +++---------
 drivers/md/bcache/writeback.c |  5 ++---
 drivers/md/dm-bufio.c         |  4 +---
 drivers/md/dm.c               |  2 +-
 drivers/md/multipath.c        |  2 +-
 drivers/md/raid5-cache.c      |  2 +-
 drivers/md/raid5.c            |  9 ++-------
 drivers/nvme/target/io-cmd.c  |  4 +---
 fs/logfs/dev_bdev.c           |  4 +---
 include/linux/bio.h           |  3 ++-
 16 files changed, 27 insertions(+), 47 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 2cf6ebabc68c..de257ced69b1 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -270,11 +270,15 @@ static void bio_free(struct bio *bio)
 	}
 }
 
-void bio_init(struct bio *bio)
+void bio_init(struct bio *bio, struct bio_vec *table,
+	      unsigned short max_vecs)
 {
 	memset(bio, 0, sizeof(*bio));
 	atomic_set(&bio->__bi_remaining, 1);
 	atomic_set(&bio->__bi_cnt, 1);
+
+	bio->bi_io_vec = table;
+	bio->bi_max_vecs = max_vecs;
 }
 EXPORT_SYMBOL(bio_init);
 
@@ -480,7 +484,7 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		return NULL;
 
 	bio = p + front_pad;
-	bio_init(bio);
+	bio_init(bio, NULL, 0);
 
 	if (nr_iovecs > inline_vecs) {
 		unsigned long idx = 0;
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index e3d8e4ced4a2..6a3ff2b2e3ae 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -3806,8 +3806,7 @@ static int __floppy_read_block_0(struct block_device *bdev, int drive)
 
 	cbdata.drive = drive;
 
-	bio_init(&bio);
-	bio.bi_io_vec = &bio_vec;
+	bio_init(&bio, &bio_vec, 1);
 	bio_vec.bv_page = page;
 	bio_vec.bv_len = size;
 	bio_vec.bv_offset = 0;
diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index e97b0acf7b8d..db45a88c0ce9 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -24,9 +24,7 @@ struct bio *bch_bbio_alloc(struct cache_set *c)
 	struct bbio *b = mempool_alloc(c->bio_meta, GFP_NOIO);
 	struct bio *bio = &b->bio;
 
-	bio_init(bio);
-	bio->bi_max_vecs	 = bucket_pages(c);
-	bio->bi_io_vec		 = bio->bi_inline_vecs;
+	bio_init(bio, bio->bi_inline_vecs, bucket_pages(c));
 
 	return bio;
 }
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 6925023e12d4..1198e53d5670 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -448,13 +448,11 @@ static void do_journal_discard(struct cache *ca)
 
 		atomic_set(&ja->discard_in_flight, DISCARD_IN_FLIGHT);
 
-		bio_init(bio);
+		bio_init(bio, bio->bi_inline_vecs, 1);
 		bio_set_op_attrs(bio, REQ_OP_DISCARD, 0);
 		bio->bi_iter.bi_sector	= bucket_to_sector(ca->set,
 						ca->sb.d[ja->discard_idx]);
 		bio->bi_bdev		= ca->bdev;
-		bio->bi_max_vecs	= 1;
-		bio->bi_io_vec		= bio->bi_inline_vecs;
 		bio->bi_iter.bi_size	= bucket_bytes(ca);
 		bio->bi_end_io		= journal_discard_endio;
 
diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c
index 5c4bddecfaf0..13b8a907006d 100644
--- a/drivers/md/bcache/movinggc.c
+++ b/drivers/md/bcache/movinggc.c
@@ -77,15 +77,13 @@ static void moving_init(struct moving_io *io)
 {
 	struct bio *bio = &io->bio.bio;
 
-	bio_init(bio);
+	bio_init(bio, bio->bi_inline_vecs,
+		 DIV_ROUND_UP(KEY_SIZE(&io->w->key), PAGE_SECTORS));
 	bio_get(bio);
 	bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
 
 	bio->bi_iter.bi_size	= KEY_SIZE(&io->w->key) << 9;
-	bio->bi_max_vecs	= DIV_ROUND_UP(KEY_SIZE(&io->w->key),
-					       PAGE_SECTORS);
 	bio->bi_private		= &io->cl;
-	bio->bi_io_vec		= bio->bi_inline_vecs;
 	bch_bio_map(bio, NULL);
 }
 
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 0d99b5f4b3e6..f49c5417527d 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -623,7 +623,7 @@ static void do_bio_hook(struct search *s, struct bio *orig_bio)
 {
 	struct bio *bio = &s->bio.bio;
 
-	bio_init(bio);
+	bio_init(bio, NULL, 0);
 	__bio_clone_fast(bio, orig_bio);
 	bio->bi_end_io		= request_endio;
 	bio->bi_private		= &s->cl;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 988edf928466..2fb5bfeb43e2 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1152,9 +1152,7 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
 	dc->bdev = bdev;
 	dc->bdev->bd_holder = dc;
 
-	bio_init(&dc->sb_bio);
-	dc->sb_bio.bi_max_vecs	= 1;
-	dc->sb_bio.bi_io_vec	= dc->sb_bio.bi_inline_vecs;
+	bio_init(&dc->sb_bio, dc->sb_bio.bi_inline_vecs, 1);
 	dc->sb_bio.bi_io_vec[0].bv_page = sb_page;
 	get_page(sb_page);
 
@@ -1814,9 +1812,7 @@ static int cache_alloc(struct cache *ca)
 	__module_get(THIS_MODULE);
 	kobject_init(&ca->kobj, &bch_cache_ktype);
 
-	bio_init(&ca->journal.bio);
-	ca->journal.bio.bi_max_vecs = 8;
-	ca->journal.bio.bi_io_vec = ca->journal.bio.bi_inline_vecs;
+	bio_init(&ca->journal.bio, ca->journal.bio.bi_inline_vecs, 8);
 
 	free = roundup_pow_of_two(ca->sb.nbuckets) >> 10;
 
@@ -1852,9 +1848,7 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page,
 	ca->bdev = bdev;
 	ca->bdev->bd_holder = ca;
 
-	bio_init(&ca->sb_bio);
-	ca->sb_bio.bi_max_vecs	= 1;
-	ca->sb_bio.bi_io_vec	= ca->sb_bio.bi_inline_vecs;
+	bio_init(&ca->sb_bio, ca->sb_bio.bi_inline_vecs, 1);
 	ca->sb_bio.bi_io_vec[0].bv_page = sb_page;
 	get_page(sb_page);
 
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index e51644e503a5..69e1ae59cab8 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -106,14 +106,13 @@ static void dirty_init(struct keybuf_key *w)
 	struct dirty_io *io = w->private;
 	struct bio *bio = &io->bio;
 
-	bio_init(bio);
+	bio_init(bio, bio->bi_inline_vecs,
+		 DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS));
 	if (!io->dc->writeback_percent)
 		bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
 
 	bio->bi_iter.bi_size	= KEY_SIZE(&w->key) << 9;
-	bio->bi_max_vecs	= DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS);
 	bio->bi_private		= w;
-	bio->bi_io_vec		= bio->bi_inline_vecs;
 	bch_bio_map(bio, NULL);
 }
 
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index b3ba142e59a4..262e75365cc0 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -611,9 +611,7 @@ static void use_inline_bio(struct dm_buffer *b, int rw, sector_t block,
 	char *ptr;
 	int len;
 
-	bio_init(&b->bio);
-	b->bio.bi_io_vec = b->bio_vec;
-	b->bio.bi_max_vecs = DM_BUFIO_INLINE_VECS;
+	bio_init(&b->bio, b->bio_vec, DM_BUFIO_INLINE_VECS);
 	b->bio.bi_iter.bi_sector = block << b->c->sectors_per_block_bits;
 	b->bio.bi_bdev = b->c->bdev;
 	b->bio.bi_end_io = inline_endio;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 1b8b21aba34b..ffa97b742a68 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1525,7 +1525,7 @@ static struct mapped_device *alloc_dev(int minor)
 	if (!md->bdev)
 		goto bad;
 
-	bio_init(&md->flush_bio);
+	bio_init(&md->flush_bio, NULL, 0);
 	md->flush_bio.bi_bdev = md->bdev;
 	md->flush_bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
 
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 673efbd6fc47..4da06d813b8f 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -130,7 +130,7 @@ static void multipath_make_request(struct mddev *mddev, struct bio * bio)
 	}
 	multipath = conf->multipaths + mp_bh->path;
 
-	bio_init(&mp_bh->bio);
+	bio_init(&mp_bh->bio, NULL, 0);
 	__bio_clone_fast(&mp_bh->bio, bio);
 
 	mp_bh->bio.bi_iter.bi_sector += multipath->rdev->data_offset;
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 2bca090cd64e..8491edcfb5a6 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1205,7 +1205,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	INIT_LIST_HEAD(&log->io_end_ios);
 	INIT_LIST_HEAD(&log->flushing_ios);
 	INIT_LIST_HEAD(&log->finished_ios);
-	bio_init(&log->flush_bio);
+	bio_init(&log->flush_bio, NULL, 0);
 
 	log->io_kc = KMEM_CACHE(r5l_io_unit, 0);
 	if (!log->io_kc)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 70acdd379e44..5f9e28443c8a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2004,13 +2004,8 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
 		for (i = 0; i < disks; i++) {
 			struct r5dev *dev = &sh->dev[i];
 
-			bio_init(&dev->req);
-			dev->req.bi_io_vec = &dev->vec;
-			dev->req.bi_max_vecs = 1;
-
-			bio_init(&dev->rreq);
-			dev->rreq.bi_io_vec = &dev->rvec;
-			dev->rreq.bi_max_vecs = 1;
+			bio_init(&dev->req, &dev->vec, 1);
+			bio_init(&dev->rreq, &dev->rvec, 1);
 		}
 	}
 	return sh;
diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
index c2784cfc5e29..c58b67e61092 100644
--- a/drivers/nvme/target/io-cmd.c
+++ b/drivers/nvme/target/io-cmd.c
@@ -37,9 +37,7 @@ static void nvmet_inline_bio_init(struct nvmet_req *req)
 {
 	struct bio *bio = &req->inline_bio;
 
-	bio_init(bio);
-	bio->bi_max_vecs = NVMET_MAX_INLINE_BIOVEC;
-	bio->bi_io_vec = req->inline_bvec;
+	bio_init(bio, req->inline_bvec, NVMET_MAX_INLINE_BIOVEC);
 }
 
 static void nvmet_execute_rw(struct nvmet_req *req)
diff --git a/fs/logfs/dev_bdev.c b/fs/logfs/dev_bdev.c
index a8329cc47dec..dc8cafeee038 100644
--- a/fs/logfs/dev_bdev.c
+++ b/fs/logfs/dev_bdev.c
@@ -19,9 +19,7 @@ static int sync_request(struct page *page, struct block_device *bdev, int op)
 	struct bio bio;
 	struct bio_vec bio_vec;
 
-	bio_init(&bio);
-	bio.bi_max_vecs = 1;
-	bio.bi_io_vec = &bio_vec;
+	bio_init(&bio, &bio_vec, 1);
 	bio_vec.bv_page = page;
 	bio_vec.bv_len = PAGE_SIZE;
 	bio_vec.bv_offset = 0;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index d367cd37a7f7..70a7244f08a7 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -420,7 +420,8 @@ extern int bio_phys_segments(struct request_queue *, struct bio *);
 extern int submit_bio_wait(struct bio *bio);
 extern void bio_advance(struct bio *, unsigned);
 
-extern void bio_init(struct bio *);
+extern void bio_init(struct bio *bio, struct bio_vec *table,
+		     unsigned short max_vecs);
 extern void bio_reset(struct bio *);
 void bio_chain(struct bio *, struct bio *);
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH 00/12] block: cleanup direct access to bvec table
From: Ming Lei @ 2016-11-11 12:05 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Mike Snitzer, Ming Lei, Rasmus Villemoes,
	open list:NVM EXPRESS TARGET DRIVER, Keith Busch,
	open list:DEVICE-MAPPER  LVM, open list:TARGET SUBSYSTEM,
	Yijing Wang, open list:DRBD DRIVER, Hannes Reinecke,
	open list:TARGET SUBSYSTEM, Christoph Hellwig, Mike Christie,
	Guoqing Jiang, Johannes Thumshirn, Kent Overstreet, Coly Li,
	open list:SOFTWARE RAID Multiple Disks SUPPORT,
	open list:BCACHE BLOCK LAYER CACHE

Hi,

This patchset cleans up direct access to bvec table.

The 1st patch passes bvec table to bio_init(), so that
direct access to bvec table in bio initialization is avoided.

For other patches, most of them uses bio_add_page()
to replace hardcode style of adding page to bvec,
and others avoids to access bio->bi_vcnt.

The most special one is to use bvec iterator helpers
to implement .get_page/.next_page for dm-io.c

One big motivation is to prepare for supporting multipage
bvec, but this patchset is one good cleanup too even not
for that purpose.

Thanks,
Ming

Ming Lei (12):
  block: bio: pass bvec table to bio_init()
  block: drbd: remove impossible failure handling
  block: floppy: use bio_add_page()
  target: avoid to access .bi_vcnt directly
  bcache: debug: avoid to access .bi_io_vec directly
  dm: crypt: use bio_add_page()
  dm: use bvec iterator helpers to implement .get_page and .next_page
  dm: dm.c: replace 'bio->bi_vcnt == 1' with !bio_multiple_segments
  fs: logfs: convert to bio_add_page() in sync_request()
  fs: logfs: use bio_add_page() in __bdev_writeseg()
  fs: logfs: use bio_add_page() in do_erase()
  fs: logfs: remove unnecesary check

 block/bio.c                        |   8 ++-
 drivers/block/drbd/drbd_receiver.c |  14 +----
 drivers/block/floppy.c             |  10 ++--
 drivers/md/bcache/debug.c          |  11 ++--
 drivers/md/bcache/io.c             |   4 +-
 drivers/md/bcache/journal.c        |   4 +-
 drivers/md/bcache/movinggc.c       |   6 +--
 drivers/md/bcache/request.c        |   2 +-
 drivers/md/bcache/super.c          |  12 ++---
 drivers/md/bcache/writeback.c      |   5 +-
 drivers/md/dm-bufio.c              |   4 +-
 drivers/md/dm-crypt.c              |   8 +--
 drivers/md/dm-io.c                 |  34 ++++++++----
 drivers/md/dm-rq.c                 |   7 ++-
 drivers/md/dm.c                    |   2 +-
 drivers/md/multipath.c             |   2 +-
 drivers/md/raid5-cache.c           |   2 +-
 drivers/md/raid5.c                 |   9 +---
 drivers/nvme/target/io-cmd.c       |   4 +-
 drivers/target/target_core_pscsi.c |   8 +--
 fs/logfs/dev_bdev.c                | 106 +++++++++++++------------------------
 include/linux/bio.h                |   3 +-
 22 files changed, 107 insertions(+), 158 deletions(-)

-- 
2.7.4

^ permalink raw reply

* mdadm I/O error with Ddf RAID
From: Arka Sharma @ 2016-11-11  3:45 UTC (permalink / raw)
  To: linux-raid

Hi All,

We have developed a RAID creation application which create RAID with
Ddf RAID metadata. We are using PCIe ssd as physical disks. We are
writing the anchor, primary, secondary headers, virtual and physical
records, configuration record and physical disk data. The offsets of
the headers are updated in the primary, secondary and anchor headers
correctly. The problem is when we try to boot to Ubuntu server and we
observe that mdadm is throwing a disk failure error message and from
block layer we are getting rw=0, want=7, limit=1000215216. We also
confirmed using there is no I/O error is coming from the PCIe ssd,
using a logic analyzer. Also the limit value 1000215216 is the
capacity of the ssd in 512 byte blocks. Any insight will be highly
appreciated.

Regards,
Arka

^ permalink raw reply

* Re: Question on blocks periodic writes
From: NeilBrown @ 2016-11-11  1:52 UTC (permalink / raw)
  To: Wols Lists, Theophanis Kontogiannis, Linux RAID
In-Reply-To: <58249BD9.8000303@youngman.org.uk>

[-- Attachment #1: Type: text/plain, Size: 1070 bytes --]

On Fri, Nov 11 2016, Wols Lists wrote:

> On 10/11/16 02:00, NeilBrown wrote:
>>> [ 8664.858104] xfsaild/md1(658): WRITE block 0 on md1 (8 sectors)
>> This is XFS doing something.  md cannot possibly stop all IO while the
>> filesystem performs occasional IO.  If these continue, you need to
>> discuss with xfs developers how to stop it.  If the writes to individual
>> drives continue after there are no writes to 'md1', then it is worth
>> coming back here to ask.
>> 
>> 
> Would the new journal feature be any help?

Probably not, though until we know what is causing the writes, it is
hard to say.

>
> I haven't dug in enough to understand it properly, and it would increase
> the vulnerability of the system to a journal failure, but the feature
> itself seems almost perfect for batching writes and enabling the disks
> to spin down for extended periods.

You might be able to build functionality onto the journal which allows
the drives in the main array to stay idle for longer, but it doesn't try
to do that at present.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* [PATCH v6 11/11] md/r5cache: handle alloc_page failure
From: Song Liu @ 2016-11-10 20:46 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161110204623.3484694-1-songliubraving@fb.com>

RMW of r5c write back cache uses an extra page to store old data for
prexor. handle_stripe_dirtying() allocates this page by calling
alloc_page(). However, alloc_page() may fail.

To handle alloc_page() failures, this patch adds a small mempool
in r5l_log. When alloc_page fails, the code tries to allocate a
page from the mempool.

The mempool is small, so it maintains a waiting list so that only
R5C_EXTRA_PAGE_POOL_SIZE (2) stripes are using the mempool at a time.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 124 +++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c       |  62 ++++++++++++++++++------
 drivers/md/raid5.h       |   7 +++
 3 files changed, 177 insertions(+), 16 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index b83fd94..93c3caa 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -52,6 +52,8 @@
  */
 #define R5L_POOL_SIZE	4
 
+#define R5C_EXTRA_PAGE_POOL_SIZE	2
+
 /*
  * r5c journal modes of the array: write-back or write-through.
  * write-through mode has identical behavior as existing log only
@@ -163,6 +165,17 @@ struct r5l_log {
 
 	/* to submit async io_units, to fulfill ordering of flush */
 	struct work_struct deferred_io_work;
+
+	/* to handle alloc_page failures in handle_stripe_dirtying */
+	/* mempool for up to R5C_EXTRA_PAGE_POOL_SIZE stripes */
+	mempool_t *extra_page_pool;
+	/* stripes using the page pool */
+	struct stripe_head *extra_page_pool_stripes[R5C_EXTRA_PAGE_POOL_SIZE];
+	/* stripes waiting to use the page pool */
+	struct list_head extra_page_pool_waiting_list;
+	/* lock for extra_page_pool_stripes and extra_page_pool_waiting_list */
+	spinlock_t extra_page_pool_lock;
+
 };
 
 /*
@@ -253,6 +266,103 @@ static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
 	io->state = state;
 }
 
+/*
+ * Try allocate pages from extra_page_pool
+ * The mempool only supports R5C_EXTRA_PAGE_POOL_SIZE stripes. More stripes
+ * are added to extra_page_pool_waiting_list.
+ *
+ * If allocation succeeded, return pointer to the page;
+ * Otherwise, return NULL
+ */
+struct page *r5c_alloc_page_from_extra_page_pool(struct stripe_head *sh,
+						 struct stripe_head_state *s,
+						 int dd_idx)
+{
+	struct r5conf *conf = sh->raid_conf;
+	struct r5l_log *log = conf->log;
+	int i;
+	bool using_the_pool = false;
+	struct page *p;
+
+	BUG_ON(!r5c_is_writeback(log));
+
+	spin_lock(&log->extra_page_pool_lock);
+	/* check whether sh is already using the mempool */
+	for (i = 0; i < R5C_EXTRA_PAGE_POOL_SIZE; i++)
+		if (log->extra_page_pool_stripes[i] == sh) {
+			using_the_pool = true;
+			break;
+		}
+
+	if (!using_the_pool)
+		/* try add sh to extra_page_pool_stripes */
+		for (i = 0; i < R5C_EXTRA_PAGE_POOL_SIZE; i++)
+			if (log->extra_page_pool_stripes[i] == NULL) {
+				using_the_pool = true;
+				log->extra_page_pool_stripes[i] = sh;
+				break;
+			}
+	spin_unlock(&log->extra_page_pool_lock);
+	if (using_the_pool) {
+		/* this mempool alloc should never fail */
+		p = mempool_alloc(log->extra_page_pool, GFP_ATOMIC);
+		BUG_ON(!p);
+		set_bit(R5_R5CMemPool, &sh->dev[dd_idx].flags);
+		return p;
+	}
+
+	/* add sh to waiting list */
+	atomic_inc(&sh->count);
+	WARN_ON(!list_empty(&sh->log_list));
+	spin_lock(&log->extra_page_pool_lock);
+	list_add_tail(&sh->log_list, &log->extra_page_pool_waiting_list);
+	spin_unlock(&log->extra_page_pool_lock);
+	s->waiting_extra_page = 1;
+	return NULL;
+}
+
+static void r5c_run_extra_page_pool_waiting_list(struct r5l_log *log)
+{
+	struct stripe_head *sh;
+
+	assert_spin_locked(&log->extra_page_pool_lock);
+	while (!list_empty(&log->extra_page_pool_waiting_list)) {
+		sh = list_first_entry(&log->extra_page_pool_waiting_list,
+				      struct stripe_head, log_list);
+		list_del_init(&sh->log_list);
+		set_bit(STRIPE_HANDLE, &sh->state);
+		raid5_release_stripe(sh);
+	}
+}
+
+void r5c_stripe_finish_using_extra_page_pool(struct stripe_head *sh)
+{
+	struct r5conf *conf = sh->raid_conf;
+	struct r5l_log *log = conf->log;
+	int i;
+	struct page *p;
+
+	/* return pages to extra_page_pool */
+	for (i = sh->disks; i--; )
+		if (test_and_clear_bit(R5_R5CMemPool, &sh->dev[i].flags)) {
+			p = sh->dev[i].page;
+			sh->dev[i].page = sh->dev[i].orig_page;
+			mempool_free(p, log->extra_page_pool);
+		}
+
+	/* remove sh from extra_page_pool_stripes */
+	spin_lock(&log->extra_page_pool_lock);
+	for (i = 0; i < R5C_EXTRA_PAGE_POOL_SIZE; i++)
+		if (log->extra_page_pool_stripes[i] == sh) {
+			log->extra_page_pool_stripes[i] = NULL;
+			r5c_run_extra_page_pool_waiting_list(log);
+			spin_unlock(&log->extra_page_pool_lock);
+			return;
+		}
+	/* didn't find sh in extra_page_pool_stripes? it must be a bug */
+	BUG();
+}
+
 static void
 r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev,
 			      struct bio_list *return_bi)
@@ -2457,6 +2567,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 {
 	struct request_queue *q = bdev_get_queue(rdev->bdev);
 	struct r5l_log *log;
+	int i;
 
 	if (PAGE_SIZE != 4096)
 		return -EINVAL;
@@ -2511,6 +2622,16 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	if (!log->meta_pool)
 		goto out_mempool;
 
+	log->extra_page_pool = mempool_create_page_pool(
+		R5C_EXTRA_PAGE_POOL_SIZE *
+		(conf->raid_disks - conf->max_degraded), 0);
+	if (!log->extra_page_pool)
+		goto extra_page_pool;
+	for (i = 0; i < R5C_EXTRA_PAGE_POOL_SIZE; i++)
+		log->extra_page_pool_stripes[i] = NULL;
+	INIT_LIST_HEAD(&log->extra_page_pool_waiting_list);
+	spin_lock_init(&log->extra_page_pool_lock);
+
 	log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
 						 log->rdev->mddev, "reclaim");
 	if (!log->reclaim_thread)
@@ -2541,6 +2662,8 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 error:
 	md_unregister_thread(&log->reclaim_thread);
 reclaim_thread:
+	mempool_destroy(log->extra_page_pool);
+extra_page_pool:
 	mempool_destroy(log->meta_pool);
 out_mempool:
 	bioset_free(log->bs);
@@ -2556,6 +2679,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 void r5l_exit_log(struct r5l_log *log)
 {
 	md_unregister_thread(&log->reclaim_thread);
+	mempool_destroy(log->extra_page_pool);
 	mempool_destroy(log->meta_pool);
 	bioset_free(log->bs);
 	mempool_destroy(log->io_pool);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 1ed9632..4ff2c96 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -880,6 +880,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 
 	if (test_bit(STRIPE_R5C_WRITE_OUT, &sh->state)) {
 		/* writing out mode */
+		if (s->waiting_extra_page)
+			return;
 		if (r5l_write_stripe(conf->log, sh) == 0)
 			return;
 	} else {
@@ -1533,6 +1535,7 @@ static void ops_complete_prexor(void *stripe_head_ref)
 {
 	struct stripe_head *sh = stripe_head_ref;
 	int i;
+	bool using_extra_page_pool = false;
 
 	pr_debug("%s: stripe %llu\n", __func__,
 		(unsigned long long)sh->sector);
@@ -1546,11 +1549,18 @@ static void ops_complete_prexor(void *stripe_head_ref)
 	 */
 	for (i = sh->disks; i--; )
 		if (sh->dev[i].page != sh->dev[i].orig_page) {
-			struct page *p = sh->dev[i].page;
+			struct page *p;
 
-			sh->dev[i].page = sh->dev[i].orig_page;
-			put_page(p);
+			if (test_bit(R5_R5CMemPool, &sh->dev[i].flags))
+				using_extra_page_pool = true;
+			else {
+				p = sh->dev[i].page;
+				sh->dev[i].page = sh->dev[i].orig_page;
+				put_page(p);
+			}
 		}
+	if (using_extra_page_pool)
+		r5c_stripe_finish_using_extra_page_pool(sh);
 }
 
 static struct dma_async_tx_descriptor *
@@ -2012,6 +2022,7 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
 		INIT_LIST_HEAD(&sh->batch_list);
 		INIT_LIST_HEAD(&sh->lru);
 		INIT_LIST_HEAD(&sh->r5c);
+		INIT_LIST_HEAD(&sh->log_list);
 		atomic_set(&sh->count, 1);
 		sh->log_start = MaxSector;
 		for (i = 0; i < disks; i++) {
@@ -2878,6 +2889,7 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 	int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx, disks = sh->disks;
 	struct r5conf *conf = sh->raid_conf;
 	int level = conf->level;
+	bool using_extra_page_pool = false;
 
 	if (rcw) {
 
@@ -2892,10 +2904,15 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 			 * ops_complete_prexor().
 			 */
 			if (sh->dev[i].page != sh->dev[i].orig_page) {
-				struct page *p = sh->dev[i].page;
+				struct page *p;
 
-				sh->dev[i].page = sh->dev[i].orig_page;
-				put_page(p);
+				if (test_bit(R5_R5CMemPool, &sh->dev[i].flags))
+					using_extra_page_pool = true;
+				else {
+					p = sh->dev[i].page;
+					sh->dev[i].page = sh->dev[i].orig_page;
+					put_page(p);
+				}
 			}
 
 			if (dev->towrite) {
@@ -2909,6 +2926,9 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 				s->locked++;
 			}
 		}
+		if (using_extra_page_pool)
+			r5c_stripe_finish_using_extra_page_pool(sh);
+
 		/* if we are not expanding this is a proper write request, and
 		 * there will be bios with new data to be drained into the
 		 * stripe cache
@@ -3592,16 +3612,16 @@ static void handle_stripe_clean_event(struct r5conf *conf,
 		break_stripe_batch_list(head_sh, STRIPE_EXPAND_SYNC_FLAGS);
 }
 
-static void handle_stripe_dirtying(struct r5conf *conf,
-				   struct stripe_head *sh,
-				   struct stripe_head_state *s,
-				   int disks)
+static int handle_stripe_dirtying(struct r5conf *conf,
+				  struct stripe_head *sh,
+				  struct stripe_head_state *s,
+				  int disks)
 {
 	int rmw = 0, rcw = 0, i;
 	sector_t recovery_cp = conf->mddev->recovery_cp;
 
 	if (r5c_handle_stripe_dirtying(conf, sh, s, disks) == 0)
-		return;
+		return 0;
 
 	/* Check whether resync is now happening or should start.
 	 * If yes, then the array is dirty (after unclean shutdown or
@@ -3662,13 +3682,21 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 			struct r5dev *dev = &sh->dev[i];
 			if (test_bit(R5_InJournal, &dev->flags) &&
 			    dev->page == dev->orig_page) {
-				/* alloc page for prexor */
-				dev->page = alloc_page(GFP_NOIO);
+				struct page *p;
 
-				/* will handle failure in a later patch*/
-				BUG_ON(!dev->page);
+				/* alloc page for prexor */
+				p = alloc_page(GFP_NOIO);
+				if (!p) {
+					p = r5c_alloc_page_from_extra_page_pool(sh, s, i);
+					if (!p)  /* added to waiting list, try again later */
+						return -EAGAIN;
+				}
+				dev->page = p;
 			}
+		}
 
+		for (i = disks; i--; ) {
+			struct r5dev *dev = &sh->dev[i];
 			if ((dev->towrite ||
 			     i == sh->pd_idx || i == sh->qd_idx ||
 			     test_bit(R5_InJournal, &dev->flags)) &&
@@ -3744,6 +3772,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 	    (s->locked == 0 && (rcw == 0 || rmw == 0) &&
 	     !test_bit(STRIPE_BIT_DELAY, &sh->state)))
 		schedule_reconstruction(sh, s, rcw == 0, 0);
+	return 0;
 }
 
 static void handle_parity_checks5(struct r5conf *conf, struct stripe_head *sh,
@@ -4535,7 +4564,8 @@ static void handle_stripe(struct stripe_head *sh)
 	 */
 	if ((s.to_write || test_bit(STRIPE_R5C_WRITE_OUT, &sh->state)) &&
 	    !sh->reconstruct_state && !sh->check_state && !sh->log_io)
-		handle_stripe_dirtying(conf, sh, &s, disks);
+		if (handle_stripe_dirtying(conf, sh, &s, disks) == -EAGAIN)
+			goto finish;
 
 	/* maybe we need to check and possibly fix the parity for this stripe
 	 * Any reads will already have been scheduled, so we just see if enough
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 4261c3d..2ff62c1 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -276,6 +276,7 @@ struct stripe_head_state {
 	struct md_rdev *blocked_rdev;
 	int handle_bad_blocks;
 	int log_failed;
+	int waiting_extra_page;
 };
 
 /* Flags for struct r5dev.flags */
@@ -317,6 +318,9 @@ enum r5dev_flags {
 	R5_Discard,	/* Discard the stripe */
 	R5_SkipCopy,	/* Don't copy data from bio to stripe cache */
 	R5_InJournal,	/* data being written is in the journal device */
+	R5_R5CMemPool,	/* because of alloc_page failure, data page of this
+			 * dev is allocated from r5l_log.extra_page_pool
+			 */
 };
 
 /*
@@ -755,5 +759,8 @@ extern void r5c_make_stripe_write_out(struct stripe_head *sh);
 extern void r5c_flush_cache(struct r5conf *conf, int num);
 extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
 extern void r5c_check_cached_full_stripe(struct r5conf *conf);
+extern struct page *r5c_alloc_page_from_extra_page_pool(
+	struct stripe_head *sh, struct stripe_head_state *s, int dd_idx);
+extern void r5c_stripe_finish_using_extra_page_pool(struct stripe_head *sh);
 extern struct md_sysfs_entry r5c_journal_mode;
 #endif
-- 
2.9.3


^ permalink raw reply related

* [PATCH v6 10/11] md/r5cache: handle SYNC and FUA
From: Song Liu @ 2016-11-10 20:46 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161110204623.3484694-1-songliubraving@fb.com>

With raid5 cache, we committing data from journal device. When
there is flush request, we need to flush journal device's cache.
This was not needed in raid5 journal, because we will flush the
journal before committing data to raid disks.

This is similar to FUA, except that we also need flush journal for
FUA. Otherwise, corruptions in earlier meta data will stop recovery
from reaching FUA data.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 162 +++++++++++++++++++++++++++++++++++++++++------
 drivers/md/raid5.c       |   8 +++
 drivers/md/raid5.h       |   1 +
 3 files changed, 153 insertions(+), 18 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index c6b6840..b83fd94 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -19,6 +19,7 @@
 #include <linux/raid/md_p.h>
 #include <linux/crc32c.h>
 #include <linux/random.h>
+#include <trace/events/block.h>
 #include "md.h"
 #include "raid5.h"
 #include "bitmap.h"
@@ -159,6 +160,9 @@ struct r5l_log {
 
 	spinlock_t stripe_in_cache_lock;
 	atomic_t stripe_in_cache_count;
+
+	/* to submit async io_units, to fulfill ordering of flush */
+	struct work_struct deferred_io_work;
 };
 
 /*
@@ -185,6 +189,18 @@ struct r5l_io_unit {
 
 	int state;
 	bool need_split_bio;
+	struct bio *split_bio;
+
+	unsigned int has_flush:1;      /* include flush request */
+	unsigned int has_fua:1;        /* include fua request */
+	unsigned int has_null_flush:1; /* include empty flush request */
+	/*
+	 * io isn't sent yet, flush/fua request can only be submitted till it's
+	 * the first IO in running_ios list
+	 */
+	unsigned int io_deferred:1;
+
+	struct bio_list flush_barriers;   /* size == 0 flush bios */
 };
 
 /* r5l_io_unit state */
@@ -498,9 +514,11 @@ static void r5l_move_to_end_ios(struct r5l_log *log)
 	}
 }
 
+static void __r5l_stripe_write_finished(struct r5l_io_unit *io);
 static void r5l_log_endio(struct bio *bio)
 {
 	struct r5l_io_unit *io = bio->bi_private;
+	struct r5l_io_unit *io_deferred;
 	struct r5l_log *log = io->log;
 	unsigned long flags;
 
@@ -516,18 +534,89 @@ static void r5l_log_endio(struct bio *bio)
 		r5l_move_to_end_ios(log);
 	else
 		r5l_log_run_stripes(log);
+	if (!list_empty(&log->running_ios)) {
+		/*
+		 * FLUSH/FUA io_unit is deferred because of ordering, now we
+		 * can dispatch it
+		 */
+		io_deferred = list_first_entry(&log->running_ios,
+					       struct r5l_io_unit, log_sibling);
+		if (io_deferred->io_deferred)
+			schedule_work(&log->deferred_io_work);
+	}
+
 	spin_unlock_irqrestore(&log->io_list_lock, flags);
 
 	if (log->need_cache_flush)
 		md_wakeup_thread(log->rdev->mddev->thread);
+
+	if (io->has_null_flush) {
+		struct bio *bi;
+
+		WARN_ON(bio_list_empty(&io->flush_barriers));
+		while ((bi = bio_list_pop(&io->flush_barriers)) != NULL) {
+			bio_endio(bi);
+			atomic_dec(&io->pending_stripe);
+		}
+		if (atomic_read(&io->pending_stripe) == 0)
+			__r5l_stripe_write_finished(io);
+	}
+}
+
+static void r5l_do_submit_io(struct r5l_log *log, struct r5l_io_unit *io)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&log->io_list_lock, flags);
+	__r5l_set_io_unit_state(io, IO_UNIT_IO_START);
+	spin_unlock_irqrestore(&log->io_list_lock, flags);
+
+	if (io->has_flush)
+		bio_set_op_attrs(io->current_bio, REQ_OP_WRITE, WRITE_FLUSH);
+	if (io->has_fua)
+		bio_set_op_attrs(io->current_bio, REQ_OP_WRITE, WRITE_FUA);
+	submit_bio(io->current_bio);
+
+	if (!io->split_bio)
+		return;
+
+	if (io->has_flush)
+		bio_set_op_attrs(io->split_bio, REQ_OP_WRITE, WRITE_FLUSH);
+	if (io->has_fua)
+		bio_set_op_attrs(io->split_bio, REQ_OP_WRITE, WRITE_FUA);
+	submit_bio(io->split_bio);
+}
+
+/* deferred io_unit will be dispatched here */
+static void r5l_submit_io_async(struct work_struct *work)
+{
+	struct r5l_log *log = container_of(work, struct r5l_log,
+					   deferred_io_work);
+	struct r5l_io_unit *io = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&log->io_list_lock, flags);
+	if (!list_empty(&log->running_ios)) {
+		io = list_first_entry(&log->running_ios, struct r5l_io_unit,
+				      log_sibling);
+		if (!io->io_deferred)
+			io = NULL;
+		else
+			io->io_deferred = 0;
+	}
+	spin_unlock_irqrestore(&log->io_list_lock, flags);
+	if (io)
+		r5l_do_submit_io(log, io);
 }
 
 static void r5l_submit_current_io(struct r5l_log *log)
 {
 	struct r5l_io_unit *io = log->current_io;
+	struct bio *bio;
 	struct r5l_meta_block *block;
 	unsigned long flags;
 	u32 crc;
+	bool do_submit = true;
 
 	if (!io)
 		return;
@@ -536,13 +625,20 @@ static void r5l_submit_current_io(struct r5l_log *log)
 	block->meta_size = cpu_to_le32(io->meta_offset);
 	crc = crc32c_le(log->uuid_checksum, block, PAGE_SIZE);
 	block->checksum = cpu_to_le32(crc);
+	bio = io->current_bio;
 
 	log->current_io = NULL;
 	spin_lock_irqsave(&log->io_list_lock, flags);
-	__r5l_set_io_unit_state(io, IO_UNIT_IO_START);
+	if (io->has_flush || io->has_fua) {
+		if (io != list_first_entry(&log->running_ios,
+					   struct r5l_io_unit, log_sibling)) {
+			io->io_deferred = 1;
+			do_submit = false;
+		}
+	}
 	spin_unlock_irqrestore(&log->io_list_lock, flags);
-
-	submit_bio(io->current_bio);
+	if (do_submit)
+		r5l_do_submit_io(log, io);
 }
 
 static struct bio *r5l_bio_alloc(struct r5l_log *log)
@@ -587,6 +683,7 @@ static struct r5l_io_unit *r5l_new_meta(struct r5l_log *log)
 	io->log = log;
 	INIT_LIST_HEAD(&io->log_sibling);
 	INIT_LIST_HEAD(&io->stripe_list);
+	bio_list_init(&io->flush_barriers);
 	io->state = IO_UNIT_RUNNING;
 
 	io->meta_page = mempool_alloc(log->meta_pool, GFP_NOIO);
@@ -657,12 +754,11 @@ static void r5l_append_payload_page(struct r5l_log *log, struct page *page)
 	struct r5l_io_unit *io = log->current_io;
 
 	if (io->need_split_bio) {
-		struct bio *prev = io->current_bio;
-
+		BUG_ON(io->split_bio);
+		io->split_bio = io->current_bio;
 		io->current_bio = r5l_bio_alloc(log);
-		bio_chain(io->current_bio, prev);
-
-		submit_bio(prev);
+		bio_chain(io->current_bio, io->split_bio);
+		io->need_split_bio = false;
 	}
 
 	if (!bio_add_page(io->current_bio, page, PAGE_SIZE, 0))
@@ -691,12 +787,23 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
 
 	io = log->current_io;
 
+	if (test_and_clear_bit(STRIPE_R5C_PREFLUSH, &sh->state))
+		io->has_flush = 1;
+
 	for (i = 0; i < sh->disks; i++) {
 		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags) ||
 		    test_bit(R5_InJournal, &sh->dev[i].flags))
 			continue;
 		if (i == sh->pd_idx || i == sh->qd_idx)
 			continue;
+		if (test_bit(R5_WantFUA, &sh->dev[i].flags)) {
+			io->has_fua = 1;
+			/*
+			 * we need to flush journal to make sure recovery can
+			 * reach the data with fua flag
+			 */
+			io->has_flush = 1;
+		}
 		r5l_append_payload_meta(log, R5LOG_PAYLOAD_DATA,
 					raid5_compute_blocknr(sh, i, 0),
 					sh->dev[i].log_checksum, 0, false);
@@ -861,17 +968,34 @@ int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio)
 {
 	if (!log)
 		return -ENODEV;
-	/*
-	 * we flush log disk cache first, then write stripe data to raid disks.
-	 * So if bio is finished, the log disk cache is flushed already. The
-	 * recovery guarantees we can recovery the bio from log disk, so we
-	 * don't need to flush again
-	 */
-	if (bio->bi_iter.bi_size == 0) {
-		bio_endio(bio);
-		return 0;
+
+	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH) {
+		/*
+		 * in write through (journal only)
+		 * we flush log disk cache first, then write stripe data to
+		 * raid disks. So if bio is finished, the log disk cache is
+		 * flushed already. The recovery guarantees we can recovery
+		 * the bio from log disk, so we don't need to flush again
+		 */
+		if (bio->bi_iter.bi_size == 0) {
+			bio_endio(bio);
+			return 0;
+		}
+		bio->bi_opf &= ~REQ_PREFLUSH;
+	} else {
+		/* write back (with cache) */
+		if (bio->bi_iter.bi_size == 0) {
+			mutex_lock(&log->io_mutex);
+			r5l_get_meta(log, 0);
+			bio_list_add(&log->current_io->flush_barriers, bio);
+			log->current_io->has_flush = 1;
+			log->current_io->has_null_flush = 1;
+			atomic_inc(&log->current_io->pending_stripe);
+			r5l_submit_current_io(log);
+			mutex_unlock(&log->io_mutex);
+			return 0;
+		}
 	}
-	bio->bi_opf &= ~REQ_PREFLUSH;
 	return -EAGAIN;
 }
 
@@ -2400,7 +2524,9 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	INIT_LIST_HEAD(&log->no_space_stripes);
 	spin_lock_init(&log->no_space_stripes_lock);
 
+	INIT_WORK(&log->deferred_io_work, r5l_submit_io_async);
 	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
+
 	INIT_LIST_HEAD(&log->stripe_in_cache_list);
 	spin_lock_init(&log->stripe_in_cache_lock);
 	atomic_set(&log->stripe_in_cache_count, 0);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3ac3172..1ed9632 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5235,6 +5235,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	int remaining;
 	DEFINE_WAIT(w);
 	bool do_prepare;
+	bool do_flush = false;
 
 	if (unlikely(bi->bi_opf & REQ_PREFLUSH)) {
 		int ret = r5l_handle_flush_request(conf->log, bi);
@@ -5246,6 +5247,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 			return;
 		}
 		/* ret == -EAGAIN, fallback */
+		do_flush = true;
 	}
 
 	md_write_start(mddev, bi);
@@ -5385,6 +5387,12 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 				do_prepare = true;
 				goto retry;
 			}
+			if (do_flush) {
+				set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
+				/* we only need flush for one stripe */
+				do_flush = false;
+			}
+
 			set_bit(STRIPE_HANDLE, &sh->state);
 			clear_bit(STRIPE_DELAYED, &sh->state);
 			if ((!sh->batch_head || sh == sh->batch_head) &&
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 801722a..4261c3d 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -359,6 +359,7 @@ enum {
 	STRIPE_R5C_FULL_STRIPE,	/* in r5c cache (to-be/being handled or
 				 * in conf->r5c_full_stripe_list)
 				 */
+	STRIPE_R5C_PREFLUSH,	/* need to flush journal device */
 };
 
 #define STRIPE_EXPAND_SYNC_FLAGS \
-- 
2.9.3


^ permalink raw reply related

* [PATCH v6 09/11] md/r5cache: r5cache recovery: part 2
From: Song Liu @ 2016-11-10 20:46 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161110204623.3484694-1-songliubraving@fb.com>

1. In previous patch, we:
      - add new data to r5l_recovery_ctx
      - add new functions to recovery write-back cache
   The new functions are not used in this patch, so this patch does not
   change the behavior of recovery.

2. In this patchpatch, we:
      - modify main recovery procedure r5l_recovery_log() to call new
        functions
      - remove old functions

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 198 ++++++-----------------------------------------
 drivers/md/raid5.c       |   3 +-
 2 files changed, 26 insertions(+), 175 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 44af32d..c6b6840 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1390,144 +1390,6 @@ static int r5l_recovery_read_meta_block(struct r5l_log *log,
 	return 0;
 }
 
-static int r5l_recovery_flush_one_stripe(struct r5l_log *log,
-					 struct r5l_recovery_ctx *ctx,
-					 sector_t stripe_sect,
-					 int *offset)
-{
-	struct r5conf *conf = log->rdev->mddev->private;
-	struct stripe_head *sh;
-	struct r5l_payload_data_parity *payload;
-	int disk_index;
-
-	sh = raid5_get_active_stripe(conf, stripe_sect, 0, 0, 0);
-	while (1) {
-		sector_t log_offset = r5l_ring_add(log, ctx->pos,
-				ctx->meta_total_blocks);
-		payload = page_address(ctx->meta_page) + *offset;
-
-		if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_DATA) {
-			raid5_compute_sector(conf,
-					     le64_to_cpu(payload->location), 0,
-					     &disk_index, sh);
-
-			sync_page_io(log->rdev, log_offset, PAGE_SIZE,
-				     sh->dev[disk_index].page, REQ_OP_READ, 0,
-				     false);
-			sh->dev[disk_index].log_checksum =
-				le32_to_cpu(payload->checksum[0]);
-			set_bit(R5_Wantwrite, &sh->dev[disk_index].flags);
-		} else {
-			disk_index = sh->pd_idx;
-			sync_page_io(log->rdev, log_offset, PAGE_SIZE,
-				     sh->dev[disk_index].page, REQ_OP_READ, 0,
-				     false);
-			sh->dev[disk_index].log_checksum =
-				le32_to_cpu(payload->checksum[0]);
-			set_bit(R5_Wantwrite, &sh->dev[disk_index].flags);
-
-			if (sh->qd_idx >= 0) {
-				disk_index = sh->qd_idx;
-				sync_page_io(log->rdev,
-					     r5l_ring_add(log, log_offset, BLOCK_SECTORS),
-					     PAGE_SIZE, sh->dev[disk_index].page,
-					     REQ_OP_READ, 0, false);
-				sh->dev[disk_index].log_checksum =
-					le32_to_cpu(payload->checksum[1]);
-				set_bit(R5_Wantwrite,
-					&sh->dev[disk_index].flags);
-			}
-		}
-
-		ctx->meta_total_blocks += le32_to_cpu(payload->size);
-		*offset += sizeof(struct r5l_payload_data_parity) +
-			sizeof(__le32) *
-			(le32_to_cpu(payload->size) >> (PAGE_SHIFT - 9));
-		if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_PARITY)
-			break;
-	}
-
-	for (disk_index = 0; disk_index < sh->disks; disk_index++) {
-		void *addr;
-		u32 checksum;
-
-		if (!test_bit(R5_Wantwrite, &sh->dev[disk_index].flags))
-			continue;
-		addr = kmap_atomic(sh->dev[disk_index].page);
-		checksum = crc32c_le(log->uuid_checksum, addr, PAGE_SIZE);
-		kunmap_atomic(addr);
-		if (checksum != sh->dev[disk_index].log_checksum)
-			goto error;
-	}
-
-	for (disk_index = 0; disk_index < sh->disks; disk_index++) {
-		struct md_rdev *rdev, *rrdev;
-
-		if (!test_and_clear_bit(R5_Wantwrite,
-					&sh->dev[disk_index].flags))
-			continue;
-
-		/* in case device is broken */
-		rdev = rcu_dereference(conf->disks[disk_index].rdev);
-		if (rdev)
-			sync_page_io(rdev, stripe_sect, PAGE_SIZE,
-				     sh->dev[disk_index].page, REQ_OP_WRITE, 0,
-				     false);
-		rrdev = rcu_dereference(conf->disks[disk_index].replacement);
-		if (rrdev)
-			sync_page_io(rrdev, stripe_sect, PAGE_SIZE,
-				     sh->dev[disk_index].page, REQ_OP_WRITE, 0,
-				     false);
-	}
-	raid5_release_stripe(sh);
-	return 0;
-
-error:
-	for (disk_index = 0; disk_index < sh->disks; disk_index++)
-		sh->dev[disk_index].flags = 0;
-	raid5_release_stripe(sh);
-	return -EINVAL;
-}
-
-static int r5l_recovery_flush_one_meta(struct r5l_log *log,
-				       struct r5l_recovery_ctx *ctx)
-{
-	struct r5conf *conf = log->rdev->mddev->private;
-	struct r5l_payload_data_parity *payload;
-	struct r5l_meta_block *mb;
-	int offset;
-	sector_t stripe_sector;
-
-	mb = page_address(ctx->meta_page);
-	offset = sizeof(struct r5l_meta_block);
-
-	while (offset < le32_to_cpu(mb->meta_size)) {
-		int dd;
-
-		payload = (void *)mb + offset;
-		stripe_sector = raid5_compute_sector(conf,
-						     le64_to_cpu(payload->location), 0, &dd, NULL);
-		if (r5l_recovery_flush_one_stripe(log, ctx, stripe_sector,
-						  &offset))
-			return -EINVAL;
-	}
-	return 0;
-}
-
-/* copy data/parity from log to raid disks */
-static void r5l_recovery_flush_log(struct r5l_log *log,
-				   struct r5l_recovery_ctx *ctx)
-{
-	while (1) {
-		if (r5l_recovery_read_meta_block(log, ctx))
-			return;
-		if (r5l_recovery_flush_one_meta(log, ctx))
-			return;
-		ctx->seq++;
-		ctx->pos = r5l_ring_add(log, ctx->pos, ctx->meta_total_blocks);
-	}
-}
-
 static void
 r5l_recovery_create_empty_meta_block(struct r5l_log *log,
 				     struct page *page,
@@ -2139,7 +2001,9 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 
 static int r5l_recovery_log(struct r5l_log *log)
 {
+	struct mddev *mddev = log->rdev->mddev;
 	struct r5l_recovery_ctx ctx;
+	int ret;
 
 	ctx.pos = log->last_checkpoint;
 	ctx.seq = log->last_cp_seq;
@@ -2151,47 +2015,33 @@ static int r5l_recovery_log(struct r5l_log *log)
 	if (!ctx.meta_page)
 		return -ENOMEM;
 
-	r5l_recovery_flush_log(log, &ctx);
+	ret = r5c_recovery_flush_log(log, &ctx);
 	__free_page(ctx.meta_page);
 
-	/*
-	 * we did a recovery. Now ctx.pos points to an invalid meta block. New
-	 * log will start here. but we can't let superblock point to last valid
-	 * meta block. The log might looks like:
-	 * | meta 1| meta 2| meta 3|
-	 * meta 1 is valid, meta 2 is invalid. meta 3 could be valid. If
-	 * superblock points to meta 1, we write a new valid meta 2n.  if crash
-	 * happens again, new recovery will start from meta 1. Since meta 2n is
-	 * valid now, recovery will think meta 3 is valid, which is wrong.
-	 * The solution is we create a new meta in meta2 with its seq == meta
-	 * 1's seq + 10 and let superblock points to meta2. The same recovery will
-	 * not think meta 3 is a valid meta, because its seq doesn't match
-	 */
-	if (ctx.seq > log->last_cp_seq) {
-		int ret;
-
-		ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10);
-		if (ret)
-			return ret;
-		log->seq = ctx.seq + 11;
-		log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
-		r5l_write_super(log, ctx.pos);
-		log->last_checkpoint = ctx.pos;
-		log->next_checkpoint = ctx.pos;
-	} else {
-		log->log_start = ctx.pos;
-		log->seq = ctx.seq;
-	}
+	if (ret)
+		return ret;
 
-	/*
-	 * This is to suppress "function defined but not used" warning.
-	 * It will be removed when the two functions are used (next patch).
-	 */
-	if (!log) {
-		r5c_recovery_flush_log(log, &ctx);
-		r5c_recovery_rewrite_data_only_stripes(log, &ctx);
+	if ((ctx.data_only_stripes == 0) && (ctx.data_parity_stripes == 0))
+		pr_debug("md/raid:%s: starting from clean shutdown\n",
+			 mdname(mddev));
+	else {
+		pr_debug("md/raid:%s: recoverying %d data-only stripes and %d data-parity stripes\n",
+			 mdname(mddev), ctx.data_only_stripes,
+			 ctx.data_parity_stripes);
+
+		if (ctx.data_only_stripes > 0)
+			if (r5c_recovery_rewrite_data_only_stripes(log, &ctx)) {
+				pr_err("md/raid:%s: failed to rewrite stripes to journal\n",
+				       mdname(mddev));
+				return -EIO;
+			}
 	}
 
+	log->log_start = ctx.pos;
+	log->next_checkpoint = ctx.pos;
+	log->seq = ctx.seq;
+	r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq);
+	r5l_write_super(log, ctx.pos);
 	return 0;
 }
 
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e8dace5..3ac3172 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7087,7 +7087,8 @@ static int raid5_run(struct mddev *mddev)
 
 		pr_debug("md/raid:%s: using device %s as journal\n",
 			 mdname(mddev), bdevname(journal_dev->bdev, b));
-		r5l_init_log(conf, journal_dev);
+		if (r5l_init_log(conf, journal_dev))
+			goto abort;
 	}
 
 	return 0;
-- 
2.9.3


^ permalink raw reply related

* [PATCH v6 08/11] md/r5cache: r5cache recovery: part 1
From: Song Liu @ 2016-11-10 20:46 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161110204623.3484694-1-songliubraving@fb.com>

Recovery of write-back cache has different logic to write-through only
cache. Specifically, for write-back cache, the recovery need to scan
through all active journal entries before flushing data out. Therefore,
large portion of the recovery logic is rewritten here.

To make the diffs cleaner, we split the rewrite as follows:

1. In this patch, we:
      - add new data to r5l_recovery_ctx
      - add new functions to recovery write-back cache
   The new functions are not used in this patch, so this patch does not
   change the behavior of recovery.

2. In next patch, we:
      - modify main recovery procedure r5l_recovery_log() to call new
        functions
      - remove old functions

With cache feature, there are 2 different scenarios of recovery:
1. Data-Parity stripe: a stripe with complete parity in journal.
2. Data-Only stripe: a stripe with only data in journal (or partial
   parity).

The code differentiate Data-Parity stripe from Data-Only stripe with
flag STRIPE_R5C_WRITE_OUT.

For Data-Parity stripes, we use the same procedure as raid5 journal,
where all the data and parity are replayed to the RAID devices.

For Data-Only strips, we need to finish complete calculate parity and
finish the full reconstruct write or RMW write. For simplicity, in
the recovery, we load the stripe to stripe cache. Once the array is
started, the stripe cache state machine will handle these stripes
through normal write path.

r5c_recovery_flush_log contains the main procedure of recovery. The
recovery code first scans through the journal and loads data to
stripe cache. The code keeps tracks of all these stripes in a list
(use sh->lru and ctx->cached_list), stripes in the list are
organized in the order of its first appearance on the journal.
During the scan, the recovery code assesses each stripe as
Data-Parity or Data-Only.

During scan, the array may run out of stripe cache. In these cases,
the recovery code will also call raid5_set_cache_size to increase
stripe cache size. If the array still runs out of stripe cache
because there isn't enough memory, the array will not assemble.

At the end of scan, the recovery code replays all Data-Parity
stripes, and sets proper states for Data-Only stripes. The recovery
code also increases seq number by 10 and rewrites all Data-Only
stripes to journal. This is to avoid confusion after repeated
crashes. More details is explained in raid5-cache.c before
r5c_recovery_rewrite_data_only_stripes().

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 590 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 590 insertions(+)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 1058e81..44af32d 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1,5 +1,6 @@
 /*
  * Copyright (C) 2015 Shaohua Li <shli@fb.com>
+ * Copyright (C) 2016 Song Liu <songliubraving@fb.com>
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms and conditions of the GNU General Public License,
@@ -1351,6 +1352,9 @@ struct r5l_recovery_ctx {
 	sector_t meta_total_blocks;	/* total size of current meta and data */
 	sector_t pos;			/* recovery position */
 	u64 seq;			/* recovery position seq */
+	int data_parity_stripes;	/* number of data_parity stripes */
+	int data_only_stripes;		/* number of data_only stripes */
+	struct list_head cached_list;
 };
 
 static int r5l_recovery_read_meta_block(struct r5l_log *log,
@@ -1561,6 +1565,578 @@ static int r5l_log_write_empty_meta_block(struct r5l_log *log, sector_t pos,
 	return 0;
 }
 
+/*
+ * r5l_recovery_load_data and r5l_recovery_load_parity uses flag R5_Wantwrite
+ * to mark valid (potentially not flushed) data in the journal.
+ *
+ * We already verified checksum in r5l_recovery_verify_data_checksum_for_mb,
+ * so there should not be any mismatch here.
+ */
+static void r5l_recovery_load_data(struct r5l_log *log,
+				   struct stripe_head *sh,
+				   struct r5l_recovery_ctx *ctx,
+				   struct r5l_payload_data_parity *payload,
+				   sector_t log_offset)
+{
+	struct mddev *mddev = log->rdev->mddev;
+	struct r5conf *conf = mddev->private;
+	int dd_idx;
+
+	raid5_compute_sector(conf,
+			     le64_to_cpu(payload->location), 0,
+			     &dd_idx, sh);
+	sync_page_io(log->rdev, log_offset, PAGE_SIZE,
+		     sh->dev[dd_idx].page, REQ_OP_READ, 0, false);
+	sh->dev[dd_idx].log_checksum =
+		le32_to_cpu(payload->checksum[0]);
+	ctx->meta_total_blocks += BLOCK_SECTORS;
+
+	set_bit(R5_Wantwrite, &sh->dev[dd_idx].flags);
+}
+
+static void r5l_recovery_load_parity(struct r5l_log *log,
+				     struct stripe_head *sh,
+				     struct r5l_recovery_ctx *ctx,
+				     struct r5l_payload_data_parity *payload,
+				     sector_t log_offset)
+{
+	struct mddev *mddev = log->rdev->mddev;
+	struct r5conf *conf = mddev->private;
+
+	ctx->meta_total_blocks += BLOCK_SECTORS * conf->max_degraded;
+	sync_page_io(log->rdev, log_offset, PAGE_SIZE,
+		     sh->dev[sh->pd_idx].page, REQ_OP_READ, 0, false);
+	sh->dev[sh->pd_idx].log_checksum =
+		le32_to_cpu(payload->checksum[0]);
+	set_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags);
+
+	if (sh->qd_idx >= 0) {
+		sync_page_io(log->rdev,
+			     r5l_ring_add(log, log_offset, BLOCK_SECTORS),
+			     PAGE_SIZE, sh->dev[sh->qd_idx].page,
+			     REQ_OP_READ, 0, false);
+		sh->dev[sh->qd_idx].log_checksum =
+			le32_to_cpu(payload->checksum[1]);
+		set_bit(R5_Wantwrite, &sh->dev[sh->qd_idx].flags);
+	}
+	set_bit(STRIPE_R5C_WRITE_OUT, &sh->state);
+}
+
+static void r5l_recovery_reset_stripe(struct stripe_head *sh)
+{
+	int i;
+
+	sh->state = 0;
+	sh->log_start = MaxSector;
+	for (i = sh->disks; i--; )
+		sh->dev[i].flags = 0;
+}
+
+static void
+r5l_recovery_replay_one_stripe(struct r5conf *conf,
+			       struct stripe_head *sh,
+			       struct r5l_recovery_ctx *ctx)
+{
+	struct md_rdev *rdev, *rrdev;
+	int disk_index;
+	int data_count = 0;
+
+	for (disk_index = 0; disk_index < sh->disks; disk_index++) {
+		if (!test_bit(R5_Wantwrite, &sh->dev[disk_index].flags))
+			continue;
+		if (disk_index == sh->qd_idx || disk_index == sh->pd_idx)
+			continue;
+		data_count++;
+	}
+
+	/*
+	 * stripes that only have parity must have been flushed
+	 * before the crash that we are now recovering from, so
+	 * there is nothing more to recovery.
+	 */
+	if (data_count == 0)
+		goto out;
+
+	for (disk_index = 0; disk_index < sh->disks; disk_index++) {
+		if (!test_bit(R5_Wantwrite, &sh->dev[disk_index].flags))
+			continue;
+
+		/* in case device is broken */
+		rdev = rcu_dereference(conf->disks[disk_index].rdev);
+		if (rdev)
+			sync_page_io(rdev, sh->sector, PAGE_SIZE,
+				     sh->dev[disk_index].page, REQ_OP_WRITE, 0,
+				     false);
+		rrdev = rcu_dereference(conf->disks[disk_index].replacement);
+		if (rrdev)
+			sync_page_io(rrdev, sh->sector, PAGE_SIZE,
+				     sh->dev[disk_index].page, REQ_OP_WRITE, 0,
+				     false);
+	}
+	ctx->data_parity_stripes++;
+out:
+	r5l_recovery_reset_stripe(sh);
+}
+
+static struct stripe_head *
+r5c_recovery_alloc_stripe(struct r5conf *conf,
+			  struct list_head *recovery_list,
+			  sector_t stripe_sect,
+			  sector_t log_start)
+{
+	struct stripe_head *sh;
+
+	sh = raid5_get_active_stripe(conf, stripe_sect, 0, 1, 0);
+	if (!sh)
+		return NULL;  /* no more stripe available */
+
+	r5l_recovery_reset_stripe(sh);
+	sh->log_start = log_start;
+
+	return sh;
+}
+
+static struct stripe_head *
+r5c_recovery_lookup_stripe(struct list_head *list, sector_t sect)
+{
+	struct stripe_head *sh;
+
+	list_for_each_entry(sh, list, lru)
+		if (sh->sector == sect)
+			return sh;
+	return NULL;
+}
+
+static void
+r5c_recovery_drop_stripes(struct list_head *cached_stripe_list,
+			  struct r5l_recovery_ctx *ctx)
+{
+	struct stripe_head *sh, *next;
+
+	list_for_each_entry_safe(sh, next, cached_stripe_list, lru)
+		if (test_and_clear_bit(STRIPE_R5C_WRITE_OUT, &sh->state)) {
+			r5l_recovery_reset_stripe(sh);
+			list_del_init(&sh->lru);
+			raid5_release_stripe(sh);
+		}
+}
+
+static void
+r5c_recovery_replay_stripes(struct list_head *cached_stripe_list,
+			    struct r5l_recovery_ctx *ctx)
+{
+	struct stripe_head *sh, *next;
+
+	list_for_each_entry_safe(sh, next, cached_stripe_list, lru)
+		if (test_and_clear_bit(STRIPE_R5C_WRITE_OUT, &sh->state)) {
+			r5l_recovery_replay_one_stripe(sh->raid_conf, sh, ctx);
+			list_del_init(&sh->lru);
+			raid5_release_stripe(sh);
+		}
+}
+
+/* if matches return 0; otherwise return -EINVAL */
+static int
+r5l_recovery_verify_data_checksum(struct r5l_log *log, struct page *page,
+				  sector_t log_offset, __le32 log_checksum)
+{
+	void *addr;
+	u32 checksum;
+
+	sync_page_io(log->rdev, log_offset, PAGE_SIZE,
+		     page, REQ_OP_READ, 0, false);
+	addr = kmap_atomic(page);
+	checksum = crc32c_le(log->uuid_checksum, addr, PAGE_SIZE);
+	kunmap_atomic(addr);
+	return (le32_to_cpu(log_checksum) == checksum) ? 0 : -EINVAL;
+}
+
+/*
+ * before loading data to stripe cache, we need verify checksum for all data,
+ * if there is mismatch for any data page, we drop all data in the mata block
+ */
+static int
+r5l_recovery_verify_data_checksum_for_mb(struct r5l_log *log,
+					 struct r5l_recovery_ctx *ctx)
+{
+	struct mddev *mddev = log->rdev->mddev;
+	struct r5conf *conf = mddev->private;
+	struct r5l_meta_block *mb = page_address(ctx->meta_page);
+	sector_t mb_offset = sizeof(struct r5l_meta_block);
+	sector_t log_offset = r5l_ring_add(log, ctx->pos, BLOCK_SECTORS);
+	struct page *page;
+	struct r5l_payload_data_parity *payload;
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		return -ENOMEM;
+
+	while (mb_offset < le32_to_cpu(mb->meta_size)) {
+		payload = (void *)mb + mb_offset;
+
+		if (payload->header.type == R5LOG_PAYLOAD_DATA) {
+			if (r5l_recovery_verify_data_checksum(
+				    log, page, log_offset,
+				    payload->checksum[0]) < 0)
+				goto mismatch;
+		} else if (payload->header.type == R5LOG_PAYLOAD_PARITY) {
+			if (r5l_recovery_verify_data_checksum(
+				    log, page, log_offset,
+				    payload->checksum[0]) < 0)
+				goto mismatch;
+			if (conf->max_degraded == 2 && /* q for RAID 6 */
+			    r5l_recovery_verify_data_checksum(
+				    log, page,
+				    r5l_ring_add(log, log_offset,
+						 BLOCK_SECTORS),
+				    payload->checksum[1]) < 0)
+				goto mismatch;
+		} else /* not R5LOG_PAYLOAD_DATA or R5LOG_PAYLOAD_PARITY */
+			goto mismatch;
+
+		log_offset = r5l_ring_add(log, log_offset,
+					  le32_to_cpu(payload->size));
+
+		mb_offset += sizeof(struct r5l_payload_data_parity) +
+			sizeof(__le32) *
+			(le32_to_cpu(payload->size) >> (PAGE_SHIFT - 9));
+	}
+
+	put_page(page);
+	return 0;
+
+mismatch:
+	put_page(page);
+	return -EINVAL;
+}
+
+/*
+ * Analyze all data/parity pages in one meta block
+ * Returns:
+ * 0 for success
+ * -EINVAL for unknown playload type
+ * -EAGAIN for checksum mismatch of data page
+ * -ENOMEM for run out of memory (alloc_page failed or run out of stripes)
+ */
+static int
+r5c_recovery_analyze_meta_block(struct r5l_log *log,
+				struct r5l_recovery_ctx *ctx,
+				struct list_head *cached_stripe_list)
+{
+	struct mddev *mddev = log->rdev->mddev;
+	struct r5conf *conf = mddev->private;
+	struct r5l_meta_block *mb;
+	struct r5l_payload_data_parity *payload;
+	int mb_offset;
+	sector_t log_offset;
+	sector_t stripe_sect;
+	struct stripe_head *sh;
+	int ret;
+
+	/*
+	 * for mismatch in data blocks, we will drop all data in this mb, but
+	 * we will still read next mb for other data with FLUSH flag, as
+	 * io_unit could finish out of order.
+	 */
+	ret = r5l_recovery_verify_data_checksum_for_mb(log, ctx);
+	if (ret == -EINVAL)
+		return -EAGAIN;
+	else if (ret)
+		return ret;   /* -ENOMEM duo to alloc_page() failed */
+
+	mb = page_address(ctx->meta_page);
+	mb_offset = sizeof(struct r5l_meta_block);
+	log_offset = r5l_ring_add(log, ctx->pos, BLOCK_SECTORS);
+
+	while (mb_offset < le32_to_cpu(mb->meta_size)) {
+		int dd;
+
+		payload = (void *)mb + mb_offset;
+		stripe_sect = (payload->header.type == R5LOG_PAYLOAD_DATA) ?
+			raid5_compute_sector(
+				conf, le64_to_cpu(payload->location), 0, &dd,
+				NULL)
+			: le64_to_cpu(payload->location);
+
+		sh = r5c_recovery_lookup_stripe(cached_stripe_list,
+						stripe_sect);
+
+		if (!sh) {
+			sh = r5c_recovery_alloc_stripe(conf, cached_stripe_list,
+						       stripe_sect, ctx->pos);
+			/*
+			 * cannot get stripe from raid5_get_active_stripe
+			 * try replay some stripes
+			 */
+			if (!sh) {
+				r5c_recovery_replay_stripes(
+					cached_stripe_list, ctx);
+				sh = r5c_recovery_alloc_stripe(
+					conf, cached_stripe_list,
+					stripe_sect, ctx->pos);
+			}
+			if (!sh) {
+				pr_debug("md/raid:%s: Increasing stripe cache size to %d to recovery data on journal.\n",
+					mdname(mddev),
+					conf->min_nr_stripes * 2);
+				raid5_set_cache_size(mddev,
+						     conf->min_nr_stripes * 2);
+				sh = r5c_recovery_alloc_stripe(
+					conf, cached_stripe_list, stripe_sect,
+					ctx->pos);
+			}
+			if (!sh) {
+				pr_err("md/raid:%s: Cannot get enough stripes due to memory pressure. Recovery failed.\n",
+				       mdname(mddev));
+				return -ENOMEM;
+			}
+			list_add_tail(&sh->lru, cached_stripe_list);
+		}
+
+		if (payload->header.type == R5LOG_PAYLOAD_DATA) {
+			if (test_bit(STRIPE_R5C_WRITE_OUT, &sh->state)) {
+				r5l_recovery_replay_one_stripe(conf, sh, ctx);
+				r5l_recovery_reset_stripe(sh);
+				sh->log_start = ctx->pos;
+				list_move_tail(&sh->lru, cached_stripe_list);
+			}
+			r5l_recovery_load_data(log, sh, ctx, payload,
+					       log_offset);
+		} else if (payload->header.type == R5LOG_PAYLOAD_PARITY)
+			r5l_recovery_load_parity(log, sh, ctx, payload,
+						 log_offset);
+		else
+			return -EINVAL;
+
+		log_offset = r5l_ring_add(log, log_offset,
+					  le32_to_cpu(payload->size));
+
+		mb_offset += sizeof(struct r5l_payload_data_parity) +
+			sizeof(__le32) *
+			(le32_to_cpu(payload->size) >> (PAGE_SHIFT - 9));
+	}
+
+	return 0;
+}
+
+/*
+ * Load the stripe into cache. The stripe will be written out later by
+ * the stripe cache state machine.
+ */
+static void r5c_recovery_load_one_stripe(struct r5l_log *log,
+					 struct stripe_head *sh)
+{
+	struct r5conf *conf = sh->raid_conf;
+	struct r5dev *dev;
+	int i;
+
+	for (i = sh->disks; i--; ) {
+		dev = sh->dev + i;
+		if (test_and_clear_bit(R5_Wantwrite, &dev->flags)) {
+			set_bit(R5_InJournal, &dev->flags);
+			set_bit(R5_UPTODATE, &dev->flags);
+		}
+	}
+	set_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state);
+	atomic_inc(&conf->r5c_cached_partial_stripes);
+	list_add_tail(&sh->r5c, &log->stripe_in_cache_list);
+}
+
+/*
+ * Scan through the log for all to-be-flushed data
+ *
+ * For stripes with data and parity, namely Data-Parity stripe
+ * (STRIPE_R5C_WRITE_OUT == 0), we simply replay all the writes.
+ *
+ * For stripes with only data, namely Data-Only stripe
+ * (STRIPE_R5C_WRITE_OUT == 1), we load them to stripe cache state machine.
+ *
+ * For a stripe, if we see data after parity, we should discard all previous
+ * data and parity for this stripe, as these data are already flushed to
+ * the array.
+ *
+ * At the end of the scan, we return the new journal_tail, which points to
+ * first data-only stripe on the journal device, or next invalid meta block.
+ */
+static int r5c_recovery_flush_log(struct r5l_log *log,
+				  struct r5l_recovery_ctx *ctx)
+{
+	struct stripe_head *sh, *next;
+	int ret = 0;
+
+	/* scan through the log */
+	while (1) {
+		if (r5l_recovery_read_meta_block(log, ctx))
+			break;
+
+		ret = r5c_recovery_analyze_meta_block(log, ctx,
+						      &ctx->cached_list);
+		/*
+		 * -EAGAIN means mismatch in data block, in this case, we still
+		 * try scan the next metablock
+		 */
+		if (ret && ret != -EAGAIN)
+			break;   /* ret == -EINVAL or -ENOMEM */
+		ctx->seq++;
+		ctx->pos = r5l_ring_add(log, ctx->pos, ctx->meta_total_blocks);
+	}
+
+	if (ret == -ENOMEM) {
+		r5c_recovery_drop_stripes(&ctx->cached_list, ctx);
+		return ret;
+	}
+
+	/* replay data-parity stripes */
+	r5c_recovery_replay_stripes(&ctx->cached_list, ctx);
+
+	/* load data-only stripes to stripe cache */
+	list_for_each_entry_safe(sh, next, &ctx->cached_list, lru) {
+		WARN_ON(test_bit(STRIPE_R5C_WRITE_OUT, &sh->state));
+		r5c_recovery_load_one_stripe(log, sh);
+		list_del_init(&sh->lru);
+		raid5_release_stripe(sh);
+		ctx->data_only_stripes++;
+	}
+
+	return 0;
+}
+
+/*
+ * we did a recovery. Now ctx.pos points to an invalid meta block. New
+ * log will start here. but we can't let superblock point to last valid
+ * meta block. The log might looks like:
+ * | meta 1| meta 2| meta 3|
+ * meta 1 is valid, meta 2 is invalid. meta 3 could be valid. If
+ * superblock points to meta 1, we write a new valid meta 2n.  if crash
+ * happens again, new recovery will start from meta 1. Since meta 2n is
+ * valid now, recovery will think meta 3 is valid, which is wrong.
+ * The solution is we create a new meta in meta2 with its seq == meta
+ * 1's seq + 10 and let superblock points to meta2. The same recovery will
+ * not think meta 3 is a valid meta, because its seq doesn't match
+ */
+
+/*
+ * Before recovery, the log looks like the following
+ *
+ *   ---------------------------------------------
+ *   |           valid log        | invalid log  |
+ *   ---------------------------------------------
+ *   ^
+ *   |- log->last_checkpoint
+ *   |- log->last_cp_seq
+ *
+ * Now we scan through the log until we see invalid entry
+ *
+ *   ---------------------------------------------
+ *   |           valid log        | invalid log  |
+ *   ---------------------------------------------
+ *   ^                            ^
+ *   |- log->last_checkpoint      |- ctx->pos
+ *   |- log->last_cp_seq          |- ctx->seq
+ *
+ * From this point, we need to increase seq number by 10 to avoid
+ * confusing next recovery.
+ *
+ *   ---------------------------------------------
+ *   |           valid log        | invalid log  |
+ *   ---------------------------------------------
+ *   ^                              ^
+ *   |- log->last_checkpoint        |- ctx->pos+1
+ *   |- log->last_cp_seq            |- ctx->seq+11
+ *
+ * However, it is not safe to start the state machine yet, because data only
+ * parities are not yet secured in RAID. To save these data only parities, we
+ * rewrite them from seq+11.
+ *
+ *   -----------------------------------------------------------------
+ *   |           valid log        | data only stripes | invalid log  |
+ *   -----------------------------------------------------------------
+ *   ^                                                ^
+ *   |- log->last_checkpoint                          |- ctx->pos+n
+ *   |- log->last_cp_seq                              |- ctx->seq+10+n
+ *
+ * If failure happens again during this process, the recovery can safe start
+ * again from log->last_checkpoint.
+ *
+ * Once data only stripes are rewritten to journal, we move log_tail
+ *
+ *   -----------------------------------------------------------------
+ *   |     old log        |    data only stripes    | invalid log  |
+ *   -----------------------------------------------------------------
+ *                        ^                         ^
+ *                        |- log->last_checkpoint   |- ctx->pos+n
+ *                        |- log->last_cp_seq       |- ctx->seq+10+n
+ *
+ * Then we can safely start the state machine. If failure happens from this
+ * point on, the recovery will start from new log->last_checkpoint.
+ */
+static int
+r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
+				       struct r5l_recovery_ctx *ctx)
+{
+	struct stripe_head *sh;
+	struct mddev *mddev = log->rdev->mddev;
+	struct page *page;
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page) {
+		pr_err("md/raid:%s: cannot allocate memory to rewrite data only stripes\n",
+		       mdname(mddev));
+		return -ENOMEM;
+	}
+
+	ctx->seq += 10;
+	list_for_each_entry(sh, &ctx->cached_list, lru) {
+		struct r5l_meta_block *mb;
+		int i;
+		int offset;
+		sector_t write_pos;
+
+		WARN_ON(test_bit(STRIPE_R5C_WRITE_OUT, &sh->state));
+		r5l_recovery_create_empty_meta_block(log, page,
+						     ctx->pos, ctx->seq);
+		mb = page_address(page);
+		offset = le32_to_cpu(mb->meta_size);
+		write_pos = ctx->pos + BLOCK_SECTORS;
+
+		for (i = sh->disks; i--; ) {
+			struct r5dev *dev = &sh->dev[i];
+			struct r5l_payload_data_parity *payload;
+			void *addr;
+
+			if (test_bit(R5_InJournal, &dev->flags)) {
+				payload = (void *)mb + offset;
+				payload->header.type = cpu_to_le16(
+					R5LOG_PAYLOAD_DATA);
+				payload->size = BLOCK_SECTORS;
+				payload->location = cpu_to_le64(
+					raid5_compute_blocknr(sh, i, 0));
+				addr = kmap_atomic(dev->page);
+				payload->checksum[0] = cpu_to_le32(
+					crc32c_le(log->uuid_checksum, addr,
+						  PAGE_SIZE));
+				kunmap_atomic(addr);
+				sync_page_io(log->rdev, write_pos, PAGE_SIZE,
+					     dev->page, REQ_OP_WRITE, 0, false);
+				write_pos = r5l_ring_add(log, write_pos,
+							 BLOCK_SECTORS);
+				offset += sizeof(__le32) +
+					sizeof(struct r5l_payload_data_parity);
+
+			}
+		}
+		mb->meta_size = cpu_to_le32(offset);
+		mb->checksum = crc32c_le(log->uuid_checksum, mb, PAGE_SIZE);
+		sync_page_io(log->rdev, ctx->pos, PAGE_SIZE, page,
+			     REQ_OP_WRITE, WRITE_FUA, false);
+		sh->log_start = ctx->pos;
+		ctx->pos = write_pos;
+		ctx->seq += 1;
+	}
+	__free_page(page);
+	return 0;
+}
+
 static int r5l_recovery_log(struct r5l_log *log)
 {
 	struct r5l_recovery_ctx ctx;
@@ -1568,6 +2144,10 @@ static int r5l_recovery_log(struct r5l_log *log)
 	ctx.pos = log->last_checkpoint;
 	ctx.seq = log->last_cp_seq;
 	ctx.meta_page = alloc_page(GFP_KERNEL);
+	ctx.data_only_stripes = 0;
+	ctx.data_parity_stripes = 0;
+	INIT_LIST_HEAD(&ctx.cached_list);
+
 	if (!ctx.meta_page)
 		return -ENOMEM;
 
@@ -1602,6 +2182,16 @@ static int r5l_recovery_log(struct r5l_log *log)
 		log->log_start = ctx.pos;
 		log->seq = ctx.seq;
 	}
+
+	/*
+	 * This is to suppress "function defined but not used" warning.
+	 * It will be removed when the two functions are used (next patch).
+	 */
+	if (!log) {
+		r5c_recovery_flush_log(log, &ctx);
+		r5c_recovery_rewrite_data_only_stripes(log, &ctx);
+	}
+
 	return 0;
 }
 
-- 
2.9.3


^ permalink raw reply related

* [PATCH v6 07/11] md/r5cache: refactoring journal recovery code
From: Song Liu @ 2016-11-10 20:46 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161110204623.3484694-1-songliubraving@fb.com>

1. rename r5l_read_meta_block() as r5l_recovery_read_meta_block();
2. pull the code that initialize r5l_meta_block from
   r5l_log_write_empty_meta_block() to a separate function
   r5l_recovery_create_empty_meta_block(), so that we can reuse this
   piece of code.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index d2acb69..1058e81 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1353,8 +1353,8 @@ struct r5l_recovery_ctx {
 	u64 seq;			/* recovery position seq */
 };
 
-static int r5l_read_meta_block(struct r5l_log *log,
-			       struct r5l_recovery_ctx *ctx)
+static int r5l_recovery_read_meta_block(struct r5l_log *log,
+					struct r5l_recovery_ctx *ctx)
 {
 	struct page *page = ctx->meta_page;
 	struct r5l_meta_block *mb;
@@ -1515,7 +1515,7 @@ static void r5l_recovery_flush_log(struct r5l_log *log,
 				   struct r5l_recovery_ctx *ctx)
 {
 	while (1) {
-		if (r5l_read_meta_block(log, ctx))
+		if (r5l_recovery_read_meta_block(log, ctx))
 			return;
 		if (r5l_recovery_flush_one_meta(log, ctx))
 			return;
@@ -1524,17 +1524,16 @@ static void r5l_recovery_flush_log(struct r5l_log *log,
 	}
 }
 
-static int r5l_log_write_empty_meta_block(struct r5l_log *log, sector_t pos,
-					  u64 seq)
+static void
+r5l_recovery_create_empty_meta_block(struct r5l_log *log,
+				     struct page *page,
+				     sector_t pos, u64 seq)
 {
-	struct page *page;
 	struct r5l_meta_block *mb;
 	u32 crc;
 
-	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
-	if (!page)
-		return -ENOMEM;
 	mb = page_address(page);
+	clear_page(mb);
 	mb->magic = cpu_to_le32(R5LOG_MAGIC);
 	mb->version = R5LOG_VERSION;
 	mb->meta_size = cpu_to_le32(sizeof(struct r5l_meta_block));
@@ -1542,7 +1541,17 @@ static int r5l_log_write_empty_meta_block(struct r5l_log *log, sector_t pos,
 	mb->position = cpu_to_le64(pos);
 	crc = crc32c_le(log->uuid_checksum, mb, PAGE_SIZE);
 	mb->checksum = cpu_to_le32(crc);
+}
 
+static int r5l_log_write_empty_meta_block(struct r5l_log *log, sector_t pos,
+					  u64 seq)
+{
+	struct page *page;
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		return -ENOMEM;
+	r5l_recovery_create_empty_meta_block(log, page, pos, seq);
 	if (!sync_page_io(log->rdev, pos, PAGE_SIZE, page, REQ_OP_WRITE,
 			  WRITE_FUA, false)) {
 		__free_page(page);
-- 
2.9.3


^ permalink raw reply related

* [PATCH v6 06/11] md/r5cache: sysfs entry r5c_journal_mode
From: Song Liu @ 2016-11-10 20:46 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161110204623.3484694-1-songliubraving@fb.com>

With write cache, r5c_journal_mode is the knob to switch between
write-back and write-through.

Below is an example:

root@virt-test:~/# cat /sys/block/md0/md/r5c_state
[write-through] write-back
root@virt-test:~/# echo write-back > /sys/block/md0/md/r5c_state
root@virt-test:~/# cat /sys/block/md0/md/r5c_state
write-through [write-back]

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c       |  1 +
 drivers/md/raid5.h       |  1 +
 3 files changed, 62 insertions(+)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 8330053..d2acb69 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -60,6 +60,8 @@ enum r5c_journal_mode {
 	R5C_JOURNAL_MODE_WRITE_BACK = 1,
 };
 
+static char *r5c_journal_mode_str[] = {"write-through",
+				       "write-back"};
 /*
  * raid5 cache state machine
  *
@@ -1602,6 +1604,64 @@ static void r5l_write_super(struct r5l_log *log, sector_t cp)
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
 }
 
+static ssize_t r5c_journal_mode_show(struct mddev *mddev, char *page)
+{
+	struct r5conf *conf = mddev->private;
+	int ret;
+
+	if (!conf->log)
+		return 0;
+
+	switch (conf->log->r5c_journal_mode) {
+	case R5C_JOURNAL_MODE_WRITE_THROUGH:
+		ret = snprintf(page, PAGE_SIZE, "[%s] %s\n",
+			       r5c_journal_mode_str[R5C_JOURNAL_MODE_WRITE_THROUGH],
+			       r5c_journal_mode_str[R5C_JOURNAL_MODE_WRITE_BACK]);
+		break;
+	case R5C_JOURNAL_MODE_WRITE_BACK:
+		ret = snprintf(page, PAGE_SIZE, "%s [%s]\n",
+			       r5c_journal_mode_str[R5C_JOURNAL_MODE_WRITE_THROUGH],
+			       r5c_journal_mode_str[R5C_JOURNAL_MODE_WRITE_BACK]);
+		break;
+	default:
+		ret = 0;
+	}
+	return ret;
+}
+
+static ssize_t r5c_journal_mode_store(struct mddev *mddev,
+				      const char *page, size_t len)
+{
+	struct r5conf *conf = mddev->private;
+	struct r5l_log *log = conf->log;
+	int val = -1, i;
+
+	if (!log)
+		return -ENODEV;
+
+	for (i = 0; i < sizeof(r5c_journal_mode_str) / sizeof(r5c_journal_mode_str[0]); i++)
+		if (strlen(r5c_journal_mode_str[i]) == len - 1 &&
+		    strncmp(page, r5c_journal_mode_str[i], len - 1) == 0) {
+			val = i;
+			break;
+		}
+	if (val < R5C_JOURNAL_MODE_WRITE_THROUGH ||
+	    val > R5C_JOURNAL_MODE_WRITE_BACK)
+		return -EINVAL;
+
+	mddev_suspend(mddev);
+	conf->log->r5c_journal_mode = val;
+	mddev_resume(mddev);
+
+	pr_debug("md/raid:%s: setting r5c cache mode to %d: %s\n",
+		 mdname(mddev), val, r5c_journal_mode_str[val]);
+	return len;
+}
+
+struct md_sysfs_entry
+r5c_journal_mode = __ATTR(r5c_journal_mode, S_IRUGO | S_IWUSR,
+			  r5c_journal_mode_show, r5c_journal_mode_store);
+
 int r5c_handle_stripe_dirtying(struct r5conf *conf,
 			       struct stripe_head *sh,
 			       struct stripe_head_state *s,
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 49414f9..e8dace5 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6306,6 +6306,7 @@ static struct attribute *raid5_attrs[] =  {
 	&raid5_group_thread_cnt.attr,
 	&raid5_skip_copy.attr,
 	&raid5_rmw_level.attr,
+	&r5c_journal_mode.attr,
 	NULL,
 };
 static struct attribute_group raid5_attrs_group = {
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 8a913f9..801722a 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -754,4 +754,5 @@ extern void r5c_make_stripe_write_out(struct stripe_head *sh);
 extern void r5c_flush_cache(struct r5conf *conf, int num);
 extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
 extern void r5c_check_cached_full_stripe(struct r5conf *conf);
+extern struct md_sysfs_entry r5c_journal_mode;
 #endif
-- 
2.9.3


^ permalink raw reply related

* [PATCH v6 05/11] md/r5cache: write-out mode and reclaim support
From: Song Liu @ 2016-11-10 20:46 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161110204623.3484694-1-songliubraving@fb.com>

There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.

In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
   if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
   or cached stripes is enough for a full stripe (chunk size / 4k)
   (r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)

r5c_do_reclaim() contains new logic of reclaim.

For stripe cache:

When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.

For log space:

To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_cache_count). In current
implementation, the writing-out mode automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.

Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.

r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_cache_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.

When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_cache. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_cache_list).

This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 401 +++++++++++++++++++++++++++++++++++++++++++----
 drivers/md/raid5.c       |  21 +++
 drivers/md/raid5.h       |  39 +++--
 3 files changed, 420 insertions(+), 41 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 99c64d8..8330053 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -29,12 +29,21 @@
 #define BLOCK_SECTORS (8)
 
 /*
- * reclaim runs every 1/4 disk size or 10G reclaimable space. This can prevent
- * recovery scans a very long log
+ * log->max_free_space is min(1/4 disk size, 10G reclaimable space).
+ *
+ * In write through mode, the reclaim runs every log->max_free_space.
+ * This can prevent the recovery scans for too long
  */
 #define RECLAIM_MAX_FREE_SPACE (10 * 1024 * 1024 * 2) /* sector */
 #define RECLAIM_MAX_FREE_SPACE_SHIFT (2)
 
+/* wake up reclaim thread periodically */
+#define R5C_RECLAIM_WAKEUP_INTERVAL (30 * HZ)
+/* start flush with these full stripes */
+#define R5C_FULL_STRIPE_FLUSH_BATCH 256
+/* reclaim stripes in groups */
+#define R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2)
+
 /*
  * We only need 2 bios per I/O unit to make progress, but ensure we
  * have a few more available to not get too tight.
@@ -141,6 +150,12 @@ struct r5l_log {
 
 	/* for r5c_cache */
 	enum r5c_journal_mode r5c_journal_mode;
+
+	/* all stripes in r5cache, in the order of seq at sh->log_start */
+	struct list_head stripe_in_cache_list;
+
+	spinlock_t stripe_in_cache_lock;
+	atomic_t stripe_in_cache_count;
 };
 
 /*
@@ -256,6 +271,103 @@ void r5c_handle_cached_data_endio(struct r5conf *conf,
 	}
 }
 
+/* Check whether we should flush some stripes to free up stripe cache */
+void r5c_check_stripe_cache_usage(struct r5conf *conf)
+{
+	int total_cached;
+
+	if (!r5c_is_writeback(conf->log))
+		return;
+
+	total_cached = atomic_read(&conf->r5c_cached_partial_stripes) +
+		atomic_read(&conf->r5c_cached_full_stripes);
+
+	/*
+	 * The following condition is true for either of the following:
+	 *   - stripe cache pressure high:
+	 *          total_cached > 3/4 min_nr_stripes ||
+	 *          empty_inactive_list_nr > 0
+	 *   - stripe cache pressure moderate:
+	 *          total_cached > 1/2 min_nr_stripes
+	 */
+	if (total_cached > conf->min_nr_stripes * 1 / 2 ||
+	    atomic_read(&conf->empty_inactive_list_nr) > 0)
+		r5l_wake_reclaim(conf->log, 0);
+}
+
+/*
+ * flush cache when there are R5C_FULL_STRIPE_FLUSH_BATCH or more full
+ * stripes in the cache
+ */
+void r5c_check_cached_full_stripe(struct r5conf *conf)
+{
+	if (!r5c_is_writeback(conf->log))
+		return;
+
+	/*
+	 * wake up reclaim for R5C_FULL_STRIPE_FLUSH_BATCH cached stripes or
+	 * a full stripe (chunk size / 4k stripes).
+	 */
+	if (atomic_read(&conf->r5c_cached_full_stripes) >=
+	    min(R5C_FULL_STRIPE_FLUSH_BATCH, conf->chunk_sectors >> STRIPE_SHIFT))
+		r5l_wake_reclaim(conf->log, 0);
+}
+
+/*
+ * Total log space (in sectors) needed to flush all data in cache
+ *
+ * Currently, writing-out mode automatically includes all pending writes
+ * to the same sector. So the reclaim of each stripe takes up to
+ * (conf->raid_disks + 1) pages of log space.
+ *
+ * To totally avoid deadlock due to log space, the code reserves
+ * (conf->raid_disks + 1) pages for each stripe in cache, which is not
+ * necessary in most cases.
+ *
+ * To improve this, we will need writing-out mode to be able to NOT include
+ * pending writes, which will reduce the requirement to
+ * (conf->max_degraded + 1) pages per stripe in cache.
+ */
+static sector_t r5c_log_required_to_flush_cache(struct r5conf *conf)
+{
+	struct r5l_log *log = conf->log;
+
+	if (!r5c_is_writeback(log))
+		return 0;
+
+	return BLOCK_SECTORS * (conf->raid_disks + 1) *
+		atomic_read(&log->stripe_in_cache_count);
+}
+
+/*
+ * evaluate log space usage and update R5C_LOG_TIGHT and R5C_LOG_CRITICAL
+ *
+ * R5C_LOG_TIGHT is set when free space on the log device is less than 3x of
+ * reclaim_required_space. R5C_LOG_CRITICAL is set when free space on the log
+ * device is less than 2x of reclaim_required_space.
+ */
+static inline void r5c_update_log_state(struct r5l_log *log)
+{
+	struct r5conf *conf = log->rdev->mddev->private;
+	sector_t free_space;
+	sector_t reclaim_space;
+
+	if (!r5c_is_writeback(log))
+		return;
+
+	free_space = r5l_ring_distance(log, log->log_start,
+				       log->last_checkpoint);
+	reclaim_space = r5c_log_required_to_flush_cache(conf);
+	if (free_space < 2 * reclaim_space)
+		set_bit(R5C_LOG_CRITICAL, &conf->cache_state);
+	else
+		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
+	if (free_space < 3 * reclaim_space)
+		set_bit(R5C_LOG_TIGHT, &conf->cache_state);
+	else
+		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
+}
+
 /*
  * Put the stripe into writing-out mode by setting STRIPE_R5C_WRITE_OUT.
  *
@@ -263,7 +375,7 @@ void r5c_handle_cached_data_endio(struct r5conf *conf,
  * caching mode and writing-out mode. In such cases, this function is called
  * in r5c_handle_stripe_dirtying().
  */
-static void r5c_make_stripe_write_out(struct stripe_head *sh)
+void r5c_make_stripe_write_out(struct stripe_head *sh)
 {
 	struct r5conf *conf = sh->raid_conf;
 	struct r5l_log *log = conf->log;
@@ -445,6 +557,7 @@ static void r5_reserve_log_entry(struct r5l_log *log, struct r5l_io_unit *io)
 {
 	log->log_start = r5l_ring_add(log, log->log_start, BLOCK_SECTORS);
 
+	r5c_update_log_state(log);
 	/*
 	 * If we filled up the log device start from the beginning again,
 	 * which will require a new bio.
@@ -605,21 +718,42 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
 	atomic_inc(&io->pending_stripe);
 	sh->log_io = io;
 
+	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
+		return 0;
+
+	if (sh->log_start == MaxSector) {
+		BUG_ON(!list_empty(&sh->r5c));
+		sh->log_start = io->log_start;
+		spin_lock(&log->stripe_in_cache_lock);
+		list_add_tail(&sh->r5c,
+			      &log->stripe_in_cache_list);
+		spin_unlock(&log->stripe_in_cache_lock);
+		atomic_inc(&log->stripe_in_cache_count);
+	}
 	return 0;
 }
 
-static void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
+/* add stripe to no_space_stripes, and then wake up reclaim */
+static inline void r5l_add_no_space_stripe(struct r5l_log *log, struct stripe_head *sh)
+{
+	spin_lock(&log->no_space_stripes_lock);
+	list_add_tail(&sh->log_list, &log->no_space_stripes);
+	spin_unlock(&log->no_space_stripes_lock);
+}
+
 /*
  * running in raid5d, where reclaim could wait for raid5d too (when it flushes
  * data from log to raid disks), so we shouldn't wait for reclaim here
  */
 int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
 {
+	struct r5conf *conf = sh->raid_conf;
 	int write_disks = 0;
 	int data_pages, parity_pages;
 	int reserve;
 	int i;
 	int ret = 0;
+	bool wake_reclaim = false;
 
 	if (!log)
 		return -EAGAIN;
@@ -665,22 +799,49 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
 	mutex_lock(&log->io_mutex);
 	/* meta + data */
 	reserve = (1 + write_disks) << (PAGE_SHIFT - 9);
-	if (!r5l_has_free_space(log, reserve)) {
-		spin_lock(&log->no_space_stripes_lock);
-		list_add_tail(&sh->log_list, &log->no_space_stripes);
-		spin_unlock(&log->no_space_stripes_lock);
 
-		r5l_wake_reclaim(log, reserve);
-	} else {
-		ret = r5l_log_stripe(log, sh, data_pages, parity_pages);
-		if (ret) {
-			spin_lock_irq(&log->io_list_lock);
-			list_add_tail(&sh->log_list, &log->no_mem_stripes);
-			spin_unlock_irq(&log->io_list_lock);
+	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH) {
+		if (!r5l_has_free_space(log, reserve)) {
+			r5l_add_no_space_stripe(log, sh);
+			wake_reclaim = true;
+		} else {
+			ret = r5l_log_stripe(log, sh, data_pages, parity_pages);
+			if (ret) {
+				spin_lock_irq(&log->io_list_lock);
+				list_add_tail(&sh->log_list,
+					      &log->no_mem_stripes);
+				spin_unlock_irq(&log->io_list_lock);
+			}
+		}
+	} else {  /* R5C_JOURNAL_MODE_WRITE_BACK */
+		/*
+		 * log space critical, do not process stripes that are
+		 * not in cache yet (sh->log_start == MaxSector).
+		 */
+		if (test_bit(R5C_LOG_CRITICAL, &conf->cache_state) &&
+		    sh->log_start == MaxSector) {
+			r5l_add_no_space_stripe(log, sh);
+			wake_reclaim = true;
+			reserve = 0;
+		} else if (!r5l_has_free_space(log, reserve)) {
+			if (sh->log_start == log->last_checkpoint)
+				BUG();
+			else {
+				r5l_add_no_space_stripe(log, sh);
+			}
+		} else {
+			ret = r5l_log_stripe(log, sh, data_pages, parity_pages);
+			if (ret) {
+				spin_lock_irq(&log->io_list_lock);
+				list_add_tail(&sh->log_list, &log->no_mem_stripes);
+				spin_unlock_irq(&log->io_list_lock);
+			}
 		}
 	}
 
 	mutex_unlock(&log->io_mutex);
+	if (wake_reclaim)
+		r5l_wake_reclaim(log, reserve);
 	return 0;
 }
 
@@ -727,10 +888,39 @@ static void r5l_run_no_space_stripes(struct r5l_log *log)
 	spin_unlock(&log->no_space_stripes_lock);
 }
 
+/*
+ * calculate new last_checkpoint
+ * for write through mode, returns log->next_checkpoint
+ * for write back, returns log_start of first sh in stripe_in_cache_list
+ */
+static sector_t r5c_calculate_new_cp(struct r5conf *conf)
+{
+	struct stripe_head *sh;
+	struct r5l_log *log = conf->log;
+	sector_t end = MaxSector;
+
+	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
+		return log->next_checkpoint;
+
+	spin_lock(&log->stripe_in_cache_lock);
+	if (list_empty(&conf->log->stripe_in_cache_list)) {
+		/* all stripes flushed */
+		spin_unlock(&log->stripe_in_cache_lock);
+		return log->next_checkpoint;
+	}
+	sh = list_first_entry(&conf->log->stripe_in_cache_list,
+			      struct stripe_head, r5c);
+	end = sh->log_start;
+	spin_unlock(&log->stripe_in_cache_lock);
+	return end;
+}
+
 static sector_t r5l_reclaimable_space(struct r5l_log *log)
 {
+	struct r5conf *conf = log->rdev->mddev->private;
+
 	return r5l_ring_distance(log, log->last_checkpoint,
-				 log->next_checkpoint);
+				 r5c_calculate_new_cp(conf));
 }
 
 static void r5l_run_no_mem_stripe(struct r5l_log *log)
@@ -776,6 +966,7 @@ static bool r5l_complete_finished_ios(struct r5l_log *log)
 static void __r5l_stripe_write_finished(struct r5l_io_unit *io)
 {
 	struct r5l_log *log = io->log;
+	struct r5conf *conf = log->rdev->mddev->private;
 	unsigned long flags;
 
 	spin_lock_irqsave(&log->io_list_lock, flags);
@@ -786,7 +977,8 @@ static void __r5l_stripe_write_finished(struct r5l_io_unit *io)
 		return;
 	}
 
-	if (r5l_reclaimable_space(log) > log->max_free_space)
+	if (r5l_reclaimable_space(log) > log->max_free_space ||
+	    test_bit(R5C_LOG_TIGHT, &conf->cache_state))
 		r5l_wake_reclaim(log, 0);
 
 	spin_unlock_irqrestore(&log->io_list_lock, flags);
@@ -907,14 +1099,140 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
 	}
 }
 
+/*
+ * r5c_flush_stripe moves stripe from cached list to handle_list. When called,
+ * the stripe must be on r5c_cached_full_stripes or r5c_cached_partial_stripes.
+ *
+ * must hold conf->device_lock
+ */
+static void r5c_flush_stripe(struct r5conf *conf, struct stripe_head *sh)
+{
+	BUG_ON(list_empty(&sh->lru));
+	BUG_ON(test_bit(STRIPE_R5C_WRITE_OUT, &sh->state));
+	BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
+	assert_spin_locked(&conf->device_lock);
+
+	list_del_init(&sh->lru);
+	atomic_inc(&sh->count);
+
+	set_bit(STRIPE_HANDLE, &sh->state);
+	atomic_inc(&conf->active_stripes);
+	r5c_make_stripe_write_out(sh);
+
+	if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+		atomic_inc(&conf->preread_active_stripes);
+	raid5_release_stripe(sh);
+}
+
+/*
+ * if num == 0, flush all full stripes
+ * if num > 0, flush all full stripes. If less than num full stripes are
+ *             flushed, flush some partial stripes until totally num stripes are
+ *             flushed or there is no more cached stripes.
+ */
+void r5c_flush_cache(struct r5conf *conf, int num)
+{
+	int count;
+	struct stripe_head *sh, *next;
+
+	assert_spin_locked(&conf->device_lock);
+	if (!conf->log)
+		return;
+
+	count = 0;
+	list_for_each_entry_safe(sh, next, &conf->r5c_full_stripe_list, lru) {
+		r5c_flush_stripe(conf, sh);
+		count++;
+	}
+
+	if (count >= num)
+		return;
+	list_for_each_entry_safe(sh, next,
+				 &conf->r5c_partial_stripe_list, lru) {
+		r5c_flush_stripe(conf, sh);
+		if (++count >= num)
+			break;
+	}
+}
+
+static void r5c_do_reclaim(struct r5conf *conf)
+{
+	struct r5l_log *log = conf->log;
+	struct stripe_head *sh;
+	int count = 0;
+	unsigned long flags;
+	int total_cached;
+	int stripes_to_flush;
+
+	if (!r5c_is_writeback(log))
+		return;
+
+	total_cached = atomic_read(&conf->r5c_cached_partial_stripes) +
+		atomic_read(&conf->r5c_cached_full_stripes);
+
+	if (total_cached > conf->min_nr_stripes * 3 / 4 ||
+	    atomic_read(&conf->empty_inactive_list_nr) > 0)
+		/*
+		 * if stripe cache pressure high, flush all full stripes and
+		 * some partial stripes
+		 */
+		stripes_to_flush = R5C_RECLAIM_STRIPE_GROUP;
+	else if (total_cached > conf->min_nr_stripes * 1 / 2 ||
+		 atomic_read(&conf->r5c_cached_full_stripes) >
+		 R5C_FULL_STRIPE_FLUSH_BATCH)
+		/*
+		 * if stripe cache pressure moderate, or if there is many full
+		 * stripes,flush all full stripes
+		 */
+		stripes_to_flush = 0;
+	else
+		/* no need to flush */
+		stripes_to_flush = -1;
+
+	if (stripes_to_flush >= 0) {
+		spin_lock_irqsave(&conf->device_lock, flags);
+		r5c_flush_cache(conf, stripes_to_flush);
+		spin_unlock_irqrestore(&conf->device_lock, flags);
+	}
+
+	/* if log space is tight, flush stripes on stripe_in_cache_list */
+	if (test_bit(R5C_LOG_TIGHT, &conf->cache_state)) {
+		spin_lock(&log->stripe_in_cache_lock);
+		spin_lock(&conf->device_lock);
+		list_for_each_entry(sh, &log->stripe_in_cache_list, r5c) {
+			/*
+			 * stripes on stripe_in_cache_list could be in any
+			 * state of the stripe_cache state machine. In this
+			 * case, we only want to flush stripe on
+			 * r5c_cached_full/partial_stripes. The following
+			 * condition makes sure the stripe is on one of the
+			 * two lists.
+			 */
+			if (!list_empty(&sh->lru) &&
+			    !test_bit(STRIPE_HANDLE, &sh->state) &&
+			    atomic_read(&sh->count) == 0) {
+				r5c_flush_stripe(conf, sh);
+			}
+			if (count++ >= R5C_RECLAIM_STRIPE_GROUP)
+				break;
+		}
+		spin_unlock(&conf->device_lock);
+		spin_unlock(&log->stripe_in_cache_lock);
+	}
+	md_wakeup_thread(conf->mddev->thread);
+}
+
 static void r5l_do_reclaim(struct r5l_log *log)
 {
+	struct r5conf *conf = log->rdev->mddev->private;
 	sector_t reclaim_target = xchg(&log->reclaim_target, 0);
 	sector_t reclaimable;
 	sector_t next_checkpoint;
-	u64 next_cp_seq;
+	bool write_super;
 
 	spin_lock_irq(&log->io_list_lock);
+	write_super = r5l_reclaimable_space(log) > log->max_free_space ||
+		reclaim_target != 0;
 	/*
 	 * move proper io_unit to reclaim list. We should not change the order.
 	 * reclaimable/unreclaimable io_unit can be mixed in the list, we
@@ -935,12 +1253,11 @@ static void r5l_do_reclaim(struct r5l_log *log)
 				    log->io_list_lock);
 	}
 
-	next_checkpoint = log->next_checkpoint;
-	next_cp_seq = log->next_cp_seq;
+	next_checkpoint = r5c_calculate_new_cp(conf);
 	spin_unlock_irq(&log->io_list_lock);
 
 	BUG_ON(reclaimable < 0);
-	if (reclaimable == 0)
+	if (reclaimable == 0 || !write_super)
 		return;
 
 	/*
@@ -952,7 +1269,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
 
 	mutex_lock(&log->io_mutex);
 	log->last_checkpoint = next_checkpoint;
-	log->last_cp_seq = next_cp_seq;
+	r5c_update_log_state(log);
 	mutex_unlock(&log->io_mutex);
 
 	r5l_run_no_space_stripes(log);
@@ -966,14 +1283,17 @@ static void r5l_reclaim_thread(struct md_thread *thread)
 
 	if (!log)
 		return;
+	r5c_do_reclaim(conf);
 	r5l_do_reclaim(log);
 }
 
-static void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
+void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
 {
 	unsigned long target;
 	unsigned long new = (unsigned long)space; /* overflow in theory */
 
+	if (!log)
+		return;
 	do {
 		target = log->reclaim_target;
 		if (new < target)
@@ -997,11 +1317,12 @@ void r5l_quiesce(struct r5l_log *log, int state)
 			return;
 		log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
 					log->rdev->mddev, "reclaim");
+		log->reclaim_thread->timeout = R5C_RECLAIM_WAKEUP_INTERVAL;
 	} else if (state == 1) {
 		/* make sure r5l_write_super_and_discard_space exits */
 		mddev = log->rdev->mddev;
 		wake_up(&mddev->sb_wait);
-		r5l_wake_reclaim(log, -1L);
+		r5l_wake_reclaim(log, MaxSector);
 		md_unregister_thread(&log->reclaim_thread);
 		r5l_do_reclaim(log);
 	}
@@ -1360,12 +1681,22 @@ void r5c_finish_stripe_write_out(struct r5conf *conf,
 
 	if (do_wakeup)
 		wake_up(&conf->wait_for_overlap);
+
+	if (conf->log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
+		return;
+
+	spin_lock(&conf->log->stripe_in_cache_lock);
+	list_del_init(&sh->r5c);
+	spin_unlock(&conf->log->stripe_in_cache_lock);
+	sh->log_start = MaxSector;
+	atomic_dec(&conf->log->stripe_in_cache_count);
 }
 
 int
 r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
 	       struct stripe_head_state *s)
 {
+	struct r5conf *conf = sh->raid_conf;
 	int pages = 0;
 	int reserve;
 	int i;
@@ -1396,12 +1727,15 @@ r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
 	mutex_lock(&log->io_mutex);
 	/* meta + data */
 	reserve = (1 + pages) << (PAGE_SHIFT - 9);
-	if (!r5l_has_free_space(log, reserve)) {
-		spin_lock(&log->no_space_stripes_lock);
-		list_add_tail(&sh->log_list, &log->no_space_stripes);
-		spin_unlock(&log->no_space_stripes_lock);
 
-		r5l_wake_reclaim(log, reserve);
+	if (test_bit(R5C_LOG_CRITICAL, &conf->cache_state) &&
+	    sh->log_start == MaxSector)
+		r5l_add_no_space_stripe(log, sh);
+	else if (!r5l_has_free_space(log, reserve)) {
+		if (sh->log_start == log->last_checkpoint)
+			BUG();
+		else
+			r5l_add_no_space_stripe(log, sh);
 	} else {
 		ret = r5l_log_stripe(log, sh, pages, 0);
 		if (ret) {
@@ -1415,7 +1749,6 @@ r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
 	return 0;
 }
 
-
 static int r5l_load_log(struct r5l_log *log)
 {
 	struct md_rdev *rdev = log->rdev;
@@ -1475,6 +1808,9 @@ static int r5l_load_log(struct r5l_log *log)
 		log->max_free_space = RECLAIM_MAX_FREE_SPACE;
 	log->last_checkpoint = cp;
 	log->next_checkpoint = cp;
+	mutex_lock(&log->io_mutex);
+	r5c_update_log_state(log);
+	mutex_unlock(&log->io_mutex);
 
 	__free_page(page);
 
@@ -1546,6 +1882,8 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 						 log->rdev->mddev, "reclaim");
 	if (!log->reclaim_thread)
 		goto reclaim_thread;
+	log->reclaim_thread->timeout = R5C_RECLAIM_WAKEUP_INTERVAL;
+
 	init_waitqueue_head(&log->iounit_wait);
 
 	INIT_LIST_HEAD(&log->no_mem_stripes);
@@ -1554,6 +1892,9 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	spin_lock_init(&log->no_space_stripes_lock);
 
 	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
+	INIT_LIST_HEAD(&log->stripe_in_cache_list);
+	spin_lock_init(&log->stripe_in_cache_lock);
+	atomic_set(&log->stripe_in_cache_count, 0);
 
 	if (r5l_load_log(log))
 		goto error;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ad97103..49414f9 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -232,6 +232,16 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
 			if (test_bit(R5_InJournal, &sh->dev[i].flags))
 				injournal++;
 		}
+	/*
+	 * When quiesce in r5c write back, set STRIPE_HANDLE for stripes with
+	 * data in journal, so they are not released to cached lists
+	 */
+	if (conf->quiesce && r5c_is_writeback(conf->log) &&
+	    !test_bit(STRIPE_HANDLE, &sh->state) && injournal != 0) {
+		if (!test_bit(STRIPE_R5C_WRITE_OUT, &sh->state))
+			r5c_make_stripe_write_out(sh);
+		set_bit(STRIPE_HANDLE, &sh->state);
+	}
 
 	if (test_bit(STRIPE_HANDLE, &sh->state)) {
 		if (test_bit(STRIPE_DELAYED, &sh->state) &&
@@ -272,6 +282,7 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
 					if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state))
 						atomic_dec(&conf->r5c_cached_partial_stripes);
 					list_add_tail(&sh->lru, &conf->r5c_full_stripe_list);
+					r5c_check_cached_full_stripe(conf);
 				} else {
 					/* partial stripe */
 					if (!test_and_set_bit(STRIPE_R5C_PARTIAL_STRIPE,
@@ -643,9 +654,12 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
 			}
 			if (noblock && sh == NULL)
 				break;
+
+			r5c_check_stripe_cache_usage(conf);
 			if (!sh) {
 				set_bit(R5_INACTIVE_BLOCKED,
 					&conf->cache_state);
+				r5l_wake_reclaim(conf->log, 0);
 				wait_event_lock_irq(
 					conf->wait_for_stripe,
 					!list_empty(conf->inactive_list + hash) &&
@@ -1997,7 +2011,9 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
 		spin_lock_init(&sh->batch_lock);
 		INIT_LIST_HEAD(&sh->batch_list);
 		INIT_LIST_HEAD(&sh->lru);
+		INIT_LIST_HEAD(&sh->r5c);
 		atomic_set(&sh->count, 1);
+		sh->log_start = MaxSector;
 		for (i = 0; i < disks; i++) {
 			struct r5dev *dev = &sh->dev[i];
 
@@ -4746,6 +4762,10 @@ static int raid5_congested(struct mddev *mddev, int bits)
 
 	if (test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state))
 		return 1;
+
+	/* Also checks whether there is pressure on r5cache log space */
+	if (test_bit(R5C_LOG_TIGHT, &conf->cache_state))
+		return 1;
 	if (conf->quiesce)
 		return 1;
 	if (atomic_read(&conf->empty_inactive_list_nr))
@@ -7648,6 +7668,7 @@ static void raid5_quiesce(struct mddev *mddev, int state)
 		/* '2' tells resync/reshape to pause so that all
 		 * active stripes can drain
 		 */
+		r5c_flush_cache(conf, INT_MAX);
 		conf->quiesce = 2;
 		wait_event_cmd(conf->wait_for_quiescent,
 				    atomic_read(&conf->active_stripes) == 0 &&
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index a5d907a..8a913f9 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -226,6 +226,8 @@ struct stripe_head {
 
 	struct r5l_io_unit	*log_io;
 	struct list_head	log_list;
+	sector_t		log_start; /* first meta block on the journal */
+	struct list_head	r5c; /* for r5c_cache->stripe_in_cache */
 	/**
 	 * struct stripe_operations
 	 * @target - STRIPE_OP_COMPUTE_BLK target
@@ -520,6 +522,27 @@ struct r5worker_group {
 	int stripes_cnt;
 };
 
+enum r5_cache_state {
+	R5_INACTIVE_BLOCKED,	/* release of inactive stripes blocked,
+				 * waiting for 25% to be free
+				 */
+	R5_ALLOC_MORE,		/* It might help to allocate another
+				 * stripe.
+				 */
+	R5_DID_ALLOC,		/* A stripe was allocated, don't allocate
+				 * more until at least one has been
+				 * released.  This avoids flooding
+				 * the cache.
+				 */
+	R5C_LOG_TIGHT,		/* log device space tight, need to
+				 * prioritize stripes at last_checkpoint
+				 */
+	R5C_LOG_CRITICAL,	/* log device is running out of space,
+				 * only process stripes that are already
+				 * occupying the log
+				 */
+};
+
 struct r5conf {
 	struct hlist_head	*stripe_hashtbl;
 	/* only protect corresponding hash list and inactive_list */
@@ -619,17 +642,6 @@ struct r5conf {
 	wait_queue_head_t	wait_for_stripe;
 	wait_queue_head_t	wait_for_overlap;
 	unsigned long		cache_state;
-#define R5_INACTIVE_BLOCKED	1	/* release of inactive stripes blocked,
-					 * waiting for 25% to be free
-					 */
-#define R5_ALLOC_MORE		2	/* It might help to allocate another
-					 * stripe.
-					 */
-#define R5_DID_ALLOC		4	/* A stripe was allocated, don't allocate
-					 * more until at least one has been
-					 * released.  This avoids flooding
-					 * the cache.
-					 */
 	struct shrinker		shrinker;
 	int			pool_size; /* number of disks in stripeheads in pool */
 	spinlock_t		device_lock;
@@ -733,8 +745,13 @@ r5c_handle_stripe_dirtying(struct r5conf *conf, struct stripe_head *sh,
 			   struct stripe_head_state *s, int disks);
 extern void
 r5c_finish_stripe_write_out(struct r5conf *conf, struct stripe_head *sh);
+extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
 extern void r5c_handle_cached_data_endio(struct r5conf *conf,
 	struct stripe_head *sh, int disks, struct bio_list *return_bi);
 extern int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
 			  struct stripe_head_state *s);
+extern void r5c_make_stripe_write_out(struct stripe_head *sh);
+extern void r5c_flush_cache(struct r5conf *conf, int num);
+extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
+extern void r5c_check_cached_full_stripe(struct r5conf *conf);
 #endif
-- 
2.9.3


^ permalink raw reply related

* [PATCH v6 04/11] md/r5cache: caching mode of r5cache
From: Song Liu @ 2016-11-10 20:46 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161110204623.3484694-1-songliubraving@fb.com>

As described in previous patch, write back cache operates in two
modes: caching and writing-out. The caching mode works as:
1. write data to journal
   (r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
   (r5c_handle_data_cached, r5c_return_dev_pending_writes).

Then the writing-out path is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks

This patch implements caching mode. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.

Writing-out mode of the cache is implemented in the next patch.

With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.

This patch adds 2 flags to stripe_head.state:
 - STRIPE_R5C_PARTIAL_STRIPE,
 - STRIPE_R5C_FULL_STRIPE,

Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".

For RMW, the code allocates an extra page for each data block
being updated.  This is stored in r5dev->page and the old data
is read into it.  Then the prexor calculation subtracts ->page
from the parity block, and the reconstruct calculation adds the
->orig_page data back into the parity block.

r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.

There are some known limitations of the cache implementation:

1. Write cache only covers full page writes (R5_OVERWRITE). Writes
   of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
   writes for the same stripe have to wait. This can be improved by
   moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
   is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
   the log, so recovery code has to replay more than necessary data
   (sometimes all the log from last_checkpoint). This reduces
   availability of the array.

This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 192 +++++++++++++++++++++++++++++++++++++++++++++--
 drivers/md/raid5.c       | 158 +++++++++++++++++++++++++++++++++-----
 drivers/md/raid5.h       |  17 +++++
 3 files changed, 342 insertions(+), 25 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 5876727..99c64d8 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -20,6 +20,7 @@
 #include <linux/random.h>
 #include "md.h"
 #include "raid5.h"
+#include "bitmap.h"
 
 /*
  * metadata/data stored in disk with 4k size unit (a block) regardless
@@ -218,6 +219,43 @@ static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
 	io->state = state;
 }
 
+static void
+r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev,
+			      struct bio_list *return_bi)
+{
+	struct bio *wbi, *wbi2;
+
+	wbi = dev->written;
+	dev->written = NULL;
+	while (wbi && wbi->bi_iter.bi_sector <
+	       dev->sector + STRIPE_SECTORS) {
+		wbi2 = r5_next_bio(wbi, dev->sector);
+		if (!raid5_dec_bi_active_stripes(wbi)) {
+			md_write_end(conf->mddev);
+			bio_list_add(return_bi, wbi);
+		}
+		wbi = wbi2;
+	}
+}
+
+void r5c_handle_cached_data_endio(struct r5conf *conf,
+	  struct stripe_head *sh, int disks, struct bio_list *return_bi)
+{
+	int i;
+
+	for (i = sh->disks; i--; ) {
+		if (sh->dev[i].written) {
+			set_bit(R5_UPTODATE, &sh->dev[i].flags);
+			r5c_return_dev_pending_writes(conf, &sh->dev[i],
+						      return_bi);
+			bitmap_endwrite(conf->mddev->bitmap, sh->sector,
+					STRIPE_SECTORS,
+					!test_bit(STRIPE_DEGRADED, &sh->state),
+					0);
+		}
+	}
+}
+
 /*
  * Put the stripe into writing-out mode by setting STRIPE_R5C_WRITE_OUT.
  *
@@ -234,6 +272,47 @@ static void r5c_make_stripe_write_out(struct stripe_head *sh)
 		return;
 	WARN_ON(test_bit(STRIPE_R5C_WRITE_OUT, &sh->state));
 	set_bit(STRIPE_R5C_WRITE_OUT, &sh->state);
+
+	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
+		return;
+
+	if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+		atomic_inc(&conf->preread_active_stripes);
+
+	if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state)) {
+		BUG_ON(atomic_read(&conf->r5c_cached_partial_stripes) == 0);
+		atomic_dec(&conf->r5c_cached_partial_stripes);
+	}
+
+	if (test_and_clear_bit(STRIPE_R5C_FULL_STRIPE, &sh->state)) {
+		BUG_ON(atomic_read(&conf->r5c_cached_full_stripes) == 0);
+		atomic_dec(&conf->r5c_cached_full_stripes);
+	}
+}
+
+static void r5c_handle_data_cached(struct stripe_head *sh)
+{
+	int i;
+
+	for (i = sh->disks; i--; )
+		if (test_and_clear_bit(R5_Wantwrite, &sh->dev[i].flags)) {
+			set_bit(R5_InJournal, &sh->dev[i].flags);
+			clear_bit(R5_LOCKED, &sh->dev[i].flags);
+		}
+	clear_bit(STRIPE_LOG_TRAPPED, &sh->state);
+}
+
+/*
+ * this journal write must contain full parity,
+ * it may also contain some data pages
+ */
+static void r5c_handle_parity_cached(struct stripe_head *sh)
+{
+	int i;
+
+	for (i = sh->disks; i--; )
+		if (test_bit(R5_InJournal, &sh->dev[i].flags))
+			set_bit(R5_Wantwrite, &sh->dev[i].flags);
 }
 
 /*
@@ -253,8 +332,11 @@ static void r5c_finish_cache_stripe(struct stripe_head *sh)
 		 * out together.
 		 */
 		set_bit(R5_InJournal, &sh->dev[sh->pd_idx].flags);
+	} else if (test_bit(STRIPE_R5C_WRITE_OUT, &sh->state)) {
+		r5c_handle_parity_cached(sh);
+		set_bit(R5_InJournal, &sh->dev[sh->pd_idx].flags);
 	} else
-		BUG(); /* write back logic in next patch */
+		r5c_handle_data_cached(sh);
 }
 
 static void r5l_io_run_stripes(struct r5l_io_unit *io)
@@ -494,7 +576,8 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
 	io = log->current_io;
 
 	for (i = 0; i < sh->disks; i++) {
-		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags))
+		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags) ||
+		    test_bit(R5_InJournal, &sh->dev[i].flags))
 			continue;
 		if (i == sh->pd_idx || i == sh->qd_idx)
 			continue;
@@ -555,6 +638,10 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
 
 		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags))
 			continue;
+
+		if (test_bit(R5_InJournal, &sh->dev[i].flags))
+			continue;
+
 		write_disks++;
 		/* checksum is already calculated in last run */
 		if (test_bit(STRIPE_LOG_TRAPPED, &sh->state))
@@ -820,7 +907,6 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
 	}
 }
 
-
 static void r5l_do_reclaim(struct r5l_log *log)
 {
 	sector_t reclaim_target = xchg(&log->reclaim_target, 0);
@@ -1201,6 +1287,9 @@ int r5c_handle_stripe_dirtying(struct r5conf *conf,
 			       int disks)
 {
 	struct r5l_log *log = conf->log;
+	int i;
+	struct r5dev *dev;
+	int to_cache = 0;
 
 	if (!log || test_bit(STRIPE_R5C_WRITE_OUT, &sh->state))
 		return -EAGAIN;
@@ -1210,7 +1299,32 @@ int r5c_handle_stripe_dirtying(struct r5conf *conf,
 		r5c_make_stripe_write_out(sh);
 		return -EAGAIN;
 	}
-	BUG();  /* write back logic in next commit */
+
+	for (i = disks; i--; ) {
+		dev = &sh->dev[i];
+		/* if non-overwrite, use the writing-out path (write through) */
+		if (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags) &&
+		    !test_bit(R5_InJournal, &dev->flags)) {
+			r5c_make_stripe_write_out(sh);
+			return -EAGAIN;
+		}
+	}
+
+	for (i = disks; i--; ) {
+		dev = &sh->dev[i];
+		if (dev->towrite) {
+			set_bit(R5_Wantwrite, &dev->flags);
+			set_bit(R5_Wantdrain, &dev->flags);
+			set_bit(R5_LOCKED, &dev->flags);
+			to_cache++;
+		}
+	}
+
+	if (to_cache) {
+		set_bit(STRIPE_OP_BIODRAIN, &s->ops_request);
+		set_bit(STRIPE_LOG_TRAPPED, &sh->state);
+	}
+
 	return 0;
 }
 
@@ -1221,6 +1335,9 @@ int r5c_handle_stripe_dirtying(struct r5conf *conf,
 void r5c_finish_stripe_write_out(struct r5conf *conf,
 				 struct stripe_head *sh)
 {
+	int i;
+	int do_wakeup = 0;
+
 	if (!test_bit(R5_InJournal, &sh->dev[sh->pd_idx].flags))
 		return;
 
@@ -1230,7 +1347,72 @@ void r5c_finish_stripe_write_out(struct r5conf *conf,
 
 	if (conf->log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
 		return;
-	BUG();  /* write-back logic in coming patches */
+
+	for (i = sh->disks; i--; ) {
+		clear_bit(R5_InJournal, &sh->dev[i].flags);
+		if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
+			do_wakeup = 1;
+	}
+
+	if (test_and_clear_bit(STRIPE_FULL_WRITE, &sh->state))
+		if (atomic_dec_and_test(&conf->pending_full_writes))
+			md_wakeup_thread(conf->mddev->thread);
+
+	if (do_wakeup)
+		wake_up(&conf->wait_for_overlap);
+}
+
+int
+r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
+	       struct stripe_head_state *s)
+{
+	int pages = 0;
+	int reserve;
+	int i;
+	int ret = 0;
+
+	BUG_ON(!log);
+
+	for (i = 0; i < sh->disks; i++) {
+		void *addr;
+
+		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags))
+			continue;
+		addr = kmap_atomic(sh->dev[i].page);
+		sh->dev[i].log_checksum = crc32c_le(log->uuid_checksum,
+						    addr, PAGE_SIZE);
+		kunmap_atomic(addr);
+		pages++;
+	}
+	WARN_ON(pages == 0);
+
+	/*
+	 * The stripe must enter state machine again to call endio, so
+	 * don't delay.
+	 */
+	clear_bit(STRIPE_DELAYED, &sh->state);
+	atomic_inc(&sh->count);
+
+	mutex_lock(&log->io_mutex);
+	/* meta + data */
+	reserve = (1 + pages) << (PAGE_SHIFT - 9);
+	if (!r5l_has_free_space(log, reserve)) {
+		spin_lock(&log->no_space_stripes_lock);
+		list_add_tail(&sh->log_list, &log->no_space_stripes);
+		spin_unlock(&log->no_space_stripes_lock);
+
+		r5l_wake_reclaim(log, reserve);
+	} else {
+		ret = r5l_log_stripe(log, sh, pages, 0);
+		if (ret) {
+			spin_lock_irq(&log->io_list_lock);
+			list_add_tail(&sh->log_list, &log->no_mem_stripes);
+			spin_unlock_irq(&log->io_list_lock);
+		}
+	}
+
+	mutex_unlock(&log->io_mutex);
+	return 0;
 }
 
 
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index abb2c58..ad97103 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -218,8 +218,21 @@ static void raid5_wakeup_stripe_thread(struct stripe_head *sh)
 static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
 			      struct list_head *temp_inactive_list)
 {
+	int i;
+	int uptodate = 0;	/* number of data pages with R5_UPTODATE */
+	int injournal = 0;	/* number of date pages with R5_InJournal */
+
 	BUG_ON(!list_empty(&sh->lru));
 	BUG_ON(atomic_read(&conf->active_stripes)==0);
+
+	if (r5c_is_writeback(conf->log))
+		for (i = sh->disks; i--; ) {
+			if (test_bit(R5_UPTODATE, &sh->dev[i].flags))
+				uptodate++;
+			if (test_bit(R5_InJournal, &sh->dev[i].flags))
+				injournal++;
+		}
+
 	if (test_bit(STRIPE_HANDLE, &sh->state)) {
 		if (test_bit(STRIPE_DELAYED, &sh->state) &&
 		    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
@@ -245,8 +258,29 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
 			    < IO_THRESHOLD)
 				md_wakeup_thread(conf->mddev->thread);
 		atomic_dec(&conf->active_stripes);
-		if (!test_bit(STRIPE_EXPANDING, &sh->state))
-			list_add_tail(&sh->lru, temp_inactive_list);
+		if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
+			if (!r5c_is_writeback(conf->log))
+				list_add_tail(&sh->lru, temp_inactive_list);
+			else {
+				WARN_ON(test_bit(R5_InJournal, &sh->dev[sh->pd_idx].flags));
+				if (injournal == 0)
+					list_add_tail(&sh->lru, temp_inactive_list);
+				else if (uptodate == conf->raid_disks - conf->max_degraded) {
+					/* full stripe */
+					if (!test_and_set_bit(STRIPE_R5C_FULL_STRIPE, &sh->state))
+						atomic_inc(&conf->r5c_cached_full_stripes);
+					if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state))
+						atomic_dec(&conf->r5c_cached_partial_stripes);
+					list_add_tail(&sh->lru, &conf->r5c_full_stripe_list);
+				} else {
+					/* partial stripe */
+					if (!test_and_set_bit(STRIPE_R5C_PARTIAL_STRIPE,
+							      &sh->state))
+						atomic_inc(&conf->r5c_cached_partial_stripes);
+					list_add_tail(&sh->lru, &conf->r5c_partial_stripe_list);
+				}
+			}
+		}
 	}
 }
 
@@ -830,8 +864,18 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 
 	might_sleep();
 
-	if (r5l_write_stripe(conf->log, sh) == 0)
-		return;
+	if (test_bit(STRIPE_R5C_WRITE_OUT, &sh->state)) {
+		/* writing out mode */
+		if (r5l_write_stripe(conf->log, sh) == 0)
+			return;
+	} else {
+		/* caching mode */
+		if (test_bit(STRIPE_LOG_TRAPPED, &sh->state)) {
+			r5c_cache_data(conf->log, sh, s);
+			return;
+		}
+	}
+
 	for (i = disks; i--; ) {
 		int op, op_flags = 0;
 		int replace_only = 0;
@@ -1044,7 +1088,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 static struct dma_async_tx_descriptor *
 async_copy_data(int frombio, struct bio *bio, struct page **page,
 	sector_t sector, struct dma_async_tx_descriptor *tx,
-	struct stripe_head *sh)
+	struct stripe_head *sh, int no_skipcopy)
 {
 	struct bio_vec bvl;
 	struct bvec_iter iter;
@@ -1084,7 +1128,8 @@ async_copy_data(int frombio, struct bio *bio, struct page **page,
 			if (frombio) {
 				if (sh->raid_conf->skip_copy &&
 				    b_offset == 0 && page_offset == 0 &&
-				    clen == STRIPE_SIZE)
+				    clen == STRIPE_SIZE &&
+				    !no_skipcopy)
 					*page = bio_page;
 				else
 					tx = async_memcpy(*page, bio_page, page_offset,
@@ -1166,7 +1211,7 @@ static void ops_run_biofill(struct stripe_head *sh)
 			while (rbi && rbi->bi_iter.bi_sector <
 				dev->sector + STRIPE_SECTORS) {
 				tx = async_copy_data(0, rbi, &dev->page,
-					dev->sector, tx, sh);
+						     dev->sector, tx, sh, 0);
 				rbi = r5_next_bio(rbi, dev->sector);
 			}
 		}
@@ -1293,7 +1338,8 @@ static int set_syndrome_sources(struct page **srcs,
 		if (i == sh->qd_idx || i == sh->pd_idx ||
 		    (srctype == SYNDROME_SRC_ALL) ||
 		    (srctype == SYNDROME_SRC_WANT_DRAIN &&
-		     test_bit(R5_Wantdrain, &dev->flags)) ||
+		     (test_bit(R5_Wantdrain, &dev->flags) ||
+		      test_bit(R5_InJournal, &dev->flags))) ||
 		    (srctype == SYNDROME_SRC_WRITTEN &&
 		     dev->written))
 			srcs[slot] = sh->dev[i].page;
@@ -1472,9 +1518,25 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu)
 static void ops_complete_prexor(void *stripe_head_ref)
 {
 	struct stripe_head *sh = stripe_head_ref;
+	int i;
 
 	pr_debug("%s: stripe %llu\n", __func__,
 		(unsigned long long)sh->sector);
+
+	if (!r5c_is_writeback(sh->raid_conf->log))
+		return;
+
+	/*
+	 * raid5-cache write back uses orig_page during prexor. after prexor,
+	 * it is time to free orig_page
+	 */
+	for (i = sh->disks; i--; )
+		if (sh->dev[i].page != sh->dev[i].orig_page) {
+			struct page *p = sh->dev[i].page;
+
+			sh->dev[i].page = sh->dev[i].orig_page;
+			put_page(p);
+		}
 }
 
 static struct dma_async_tx_descriptor *
@@ -1496,7 +1558,8 @@ ops_run_prexor5(struct stripe_head *sh, struct raid5_percpu *percpu,
 	for (i = disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
 		/* Only process blocks that are known to be uptodate */
-		if (test_bit(R5_Wantdrain, &dev->flags))
+		if (test_bit(R5_Wantdrain, &dev->flags) ||
+		    test_bit(R5_InJournal, &dev->flags))
 			xor_srcs[count++] = dev->page;
 	}
 
@@ -1530,6 +1593,7 @@ ops_run_prexor6(struct stripe_head *sh, struct raid5_percpu *percpu,
 static struct dma_async_tx_descriptor *
 ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 {
+	struct r5conf *conf = sh->raid_conf;
 	int disks = sh->disks;
 	int i;
 	struct stripe_head *head_sh = sh;
@@ -1547,6 +1611,7 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 
 again:
 			dev = &sh->dev[i];
+			clear_bit(R5_InJournal, &dev->flags);
 			spin_lock_irq(&sh->stripe_lock);
 			chosen = dev->towrite;
 			dev->towrite = NULL;
@@ -1566,8 +1631,10 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 					set_bit(R5_Discard, &dev->flags);
 				else {
 					tx = async_copy_data(1, wbi, &dev->page,
-						dev->sector, tx, sh);
-					if (dev->page != dev->orig_page) {
+							     dev->sector, tx, sh,
+							     r5c_is_writeback(conf->log));
+					if (dev->page != dev->orig_page &&
+					    !r5c_is_writeback(conf->log)) {
 						set_bit(R5_SkipCopy, &dev->flags);
 						clear_bit(R5_UPTODATE, &dev->flags);
 						clear_bit(R5_OVERWRITE, &dev->flags);
@@ -1675,7 +1742,8 @@ ops_run_reconstruct5(struct stripe_head *sh, struct raid5_percpu *percpu,
 		xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page;
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
-			if (head_sh->dev[i].written)
+			if (head_sh->dev[i].written ||
+			    test_bit(R5_InJournal, &head_sh->dev[i].flags))
 				xor_srcs[count++] = dev->page;
 		}
 	} else {
@@ -2800,12 +2868,29 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
 
+			/*
+			 * Initially, handle_stripe_dirtying decided to run rmw
+			 * and allocates extra page for prexor. However, rcw is
+			 * cheaper later on. We need to free the extra page
+			 * now, because we won't be able to do that in
+			 * ops_complete_prexor().
+			 */
+			if (sh->dev[i].page != sh->dev[i].orig_page) {
+				struct page *p = sh->dev[i].page;
+
+				sh->dev[i].page = sh->dev[i].orig_page;
+				put_page(p);
+			}
+
 			if (dev->towrite) {
 				set_bit(R5_LOCKED, &dev->flags);
 				set_bit(R5_Wantdrain, &dev->flags);
 				if (!expand)
 					clear_bit(R5_UPTODATE, &dev->flags);
 				s->locked++;
+			} else if (test_bit(R5_InJournal, &dev->flags)) {
+				set_bit(R5_LOCKED, &dev->flags);
+				s->locked++;
 			}
 		}
 		/* if we are not expanding this is a proper write request, and
@@ -2845,6 +2930,9 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 				set_bit(R5_LOCKED, &dev->flags);
 				clear_bit(R5_UPTODATE, &dev->flags);
 				s->locked++;
+			} else if (test_bit(R5_InJournal, &dev->flags)) {
+				set_bit(R5_LOCKED, &dev->flags);
+				s->locked++;
 			}
 		}
 		if (!s->locked)
@@ -3519,9 +3607,12 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 	} else for (i = disks; i--; ) {
 		/* would I have to read this buffer for read_modify_write */
 		struct r5dev *dev = &sh->dev[i];
-		if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx) &&
+		if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx ||
+		     test_bit(R5_InJournal, &dev->flags)) &&
 		    !test_bit(R5_LOCKED, &dev->flags) &&
-		    !(test_bit(R5_UPTODATE, &dev->flags) ||
+		    !((test_bit(R5_UPTODATE, &dev->flags) &&
+		       (!test_bit(R5_InJournal, &dev->flags) ||
+			dev->page != dev->orig_page)) ||
 		      test_bit(R5_Wantcompute, &dev->flags))) {
 			if (test_bit(R5_Insync, &dev->flags))
 				rmw++;
@@ -3533,13 +3624,15 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 		    i != sh->pd_idx && i != sh->qd_idx &&
 		    !test_bit(R5_LOCKED, &dev->flags) &&
 		    !(test_bit(R5_UPTODATE, &dev->flags) ||
-		    test_bit(R5_Wantcompute, &dev->flags))) {
+		      test_bit(R5_InJournal, &dev->flags) ||
+		      test_bit(R5_Wantcompute, &dev->flags))) {
 			if (test_bit(R5_Insync, &dev->flags))
 				rcw++;
 			else
 				rcw += 2*disks;
 		}
 	}
+
 	pr_debug("for sector %llu, rmw=%d rcw=%d\n",
 		(unsigned long long)sh->sector, rmw, rcw);
 	set_bit(STRIPE_HANDLE, &sh->state);
@@ -3551,10 +3644,23 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 					  (unsigned long long)sh->sector, rmw);
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
-			if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx) &&
+			if (test_bit(R5_InJournal, &dev->flags) &&
+			    dev->page == dev->orig_page) {
+				/* alloc page for prexor */
+				dev->page = alloc_page(GFP_NOIO);
+
+				/* will handle failure in a later patch*/
+				BUG_ON(!dev->page);
+			}
+
+			if ((dev->towrite ||
+			     i == sh->pd_idx || i == sh->qd_idx ||
+			     test_bit(R5_InJournal, &dev->flags)) &&
 			    !test_bit(R5_LOCKED, &dev->flags) &&
-			    !(test_bit(R5_UPTODATE, &dev->flags) ||
-			    test_bit(R5_Wantcompute, &dev->flags)) &&
+			    !((test_bit(R5_UPTODATE, &dev->flags) &&
+			       (!test_bit(R5_InJournal, &dev->flags) ||
+				dev->page != dev->orig_page)) ||
+			      test_bit(R5_Wantcompute, &dev->flags)) &&
 			    test_bit(R5_Insync, &dev->flags)) {
 				if (test_bit(STRIPE_PREREAD_ACTIVE,
 					     &sh->state)) {
@@ -3580,6 +3686,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 			    i != sh->pd_idx && i != sh->qd_idx &&
 			    !test_bit(R5_LOCKED, &dev->flags) &&
 			    !(test_bit(R5_UPTODATE, &dev->flags) ||
+			      test_bit(R5_InJournal, &dev->flags) ||
 			      test_bit(R5_Wantcompute, &dev->flags))) {
 				rcw++;
 				if (test_bit(R5_Insync, &dev->flags) &&
@@ -3619,7 +3726,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 	 */
 	if ((s->req_compute || !test_bit(STRIPE_COMPUTE_RUN, &sh->state)) &&
 	    (s->locked == 0 && (rcw == 0 || rmw == 0) &&
-	    !test_bit(STRIPE_BIT_DELAY, &sh->state)))
+	     !test_bit(STRIPE_BIT_DELAY, &sh->state)))
 		schedule_reconstruction(sh, s, rcw == 0, 0);
 }
 
@@ -4110,6 +4217,8 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
 			if (rdev && !test_bit(Faulty, &rdev->flags))
 				do_recovery = 1;
 		}
+		if (test_bit(R5_InJournal, &dev->flags) && dev->written)
+			s->just_cached++;
 	}
 	if (test_bit(STRIPE_SYNCING, &sh->state)) {
 		/* If there is a failed device being replaced,
@@ -4338,7 +4447,7 @@ static void handle_stripe(struct stripe_head *sh)
 			struct r5dev *dev = &sh->dev[i];
 			if (test_bit(R5_LOCKED, &dev->flags) &&
 				(i == sh->pd_idx || i == sh->qd_idx ||
-				 dev->written)) {
+				 dev->written || test_bit(R5_InJournal, &dev->flags))) {
 				pr_debug("Writing block %d\n", i);
 				set_bit(R5_Wantwrite, &dev->flags);
 				if (prexor)
@@ -4378,6 +4487,10 @@ static void handle_stripe(struct stripe_head *sh)
 				 test_bit(R5_Discard, &qdev->flags))))))
 		handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
 
+	if (s.just_cached)
+		r5c_handle_cached_data_endio(conf, sh, disks, &s.return_bi);
+	r5l_stripe_write_finished(sh);
+
 	/* Now we might consider reading some blocks, either to check/generate
 	 * parity, or to satisfy requests
 	 * or to load a block that is being partially written.
@@ -6476,6 +6589,11 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 	for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++)
 		INIT_LIST_HEAD(conf->temp_inactive_list + i);
 
+	atomic_set(&conf->r5c_cached_full_stripes, 0);
+	INIT_LIST_HEAD(&conf->r5c_full_stripe_list);
+	atomic_set(&conf->r5c_cached_partial_stripes, 0);
+	INIT_LIST_HEAD(&conf->r5c_partial_stripe_list);
+
 	conf->level = mddev->new_level;
 	conf->chunk_sectors = mddev->new_chunk_sectors;
 	if (raid5_alloc_percpu(conf) != 0)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index b379496..a5d907a 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -263,6 +263,7 @@ struct stripe_head_state {
 	 */
 	int syncing, expanding, expanded, replacing;
 	int locked, uptodate, to_read, to_write, failed, written;
+	int just_cached;
 	int to_fill, compute, req_compute, non_overwrite;
 	int failed_num[2];
 	int p_failed, q_failed;
@@ -350,6 +351,12 @@ enum {
 	STRIPE_R5C_WRITE_OUT,	/* the stripe is in writing-out mode
 				 * see more detail in the raid5-cache.c
 				 */
+	STRIPE_R5C_PARTIAL_STRIPE,	/* in r5c cache (to-be/being handled or
+					 * in conf->r5c_partial_stripe_list)
+					 */
+	STRIPE_R5C_FULL_STRIPE,	/* in r5c cache (to-be/being handled or
+				 * in conf->r5c_full_stripe_list)
+				 */
 };
 
 #define STRIPE_EXPAND_SYNC_FLAGS \
@@ -600,6 +607,12 @@ struct r5conf {
 	 */
 	atomic_t		active_stripes;
 	struct list_head	inactive_list[NR_STRIPE_HASH_LOCKS];
+
+	atomic_t		r5c_cached_full_stripes;
+	struct list_head	r5c_full_stripe_list;
+	atomic_t		r5c_cached_partial_stripes;
+	struct list_head	r5c_partial_stripe_list;
+
 	atomic_t		empty_inactive_list_nr;
 	struct llist_head	released_stripes;
 	wait_queue_head_t	wait_for_quiescent;
@@ -720,4 +733,8 @@ r5c_handle_stripe_dirtying(struct r5conf *conf, struct stripe_head *sh,
 			   struct stripe_head_state *s, int disks);
 extern void
 r5c_finish_stripe_write_out(struct r5conf *conf, struct stripe_head *sh);
+extern void r5c_handle_cached_data_endio(struct r5conf *conf,
+	struct stripe_head *sh, int disks, struct bio_list *return_bi);
+extern int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
+			  struct stripe_head_state *s);
 #endif
-- 
2.9.3


^ permalink raw reply related

* [PATCH v6 03/11] md/r5cache: State machine for raid5-cache write back mode
From: Song Liu @ 2016-11-10 20:46 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161110204623.3484694-1-songliubraving@fb.com>

This patch adds state machine for raid5-cache. With log device, the
raid456 array could operate in two different modes (r5c_journal_mode):
  - write-back (R5C_MODE_WRITE_BACK)
  - write-through (R5C_MODE_WRITE_THROUGH)

Existing code of raid5-cache only has write-through mode. For write-back
cache, it is necessary to extend the state machine.

With write-back cache, every stripe could operate in two different
modes:
  - caching
  - writing-out

In caching mode, the stripe handles writes as:
  - write to journal
  - return IO

In writing-out mode, the stripe behaviors as a stripe in write through
mode R5C_MODE_WRITE_THROUGH.

STRIPE_R5C_WRITE_OUT is added to sh->state to differentiate caching and
writing-out mode.

When the array is write-through, stripes also go between caching mode
and writing-out mode.

Please note: this is a "no-op" patch for raid5-cache write-through
mode.

The following detailed explanation is copied from the raid5-cache.c:

/*
 * raid5 cache state machine
 *
 * With rhe RAID cache, each stripe works in two modes:
 *      - caching mode
 *      - writing-out mode
 *
 * These two modes are controlled by bit STRIPE_R5C_WRITE_OUT:
 *   if STRIPE_R5C_WRITE_OUT == 0, the stripe is in caching mode
 *   if STRIPE_R5C_WRITE_OUT == 1, the stripe is in writing-out mode

 * r5c_make_stripe_write_out() and r5c_finish_stripe_write_out() handles the
 * transition between caching and writing-out mode.
 *
 * Stripes in caching mode do not write the raid disks. Instead, all writes
 * are committed from the log device. Therefore, a stripe in caching mode
 * handles writes as:
 *      - write to log device
 *      - return IO
 *
 * Stripes in writing-out mode handle writes as:
 *      - calculate parity
 *      - write pending data and parity to journal
 *      - write data and parity to raid disks
 *      - return IO for pending writes
 *
 * All stripes starts with caching mode. If the array is write-through
 * (R5C_JOURNAL_MODE_WRITE_THROUGH), all stripes enter writing-out mode for
 * every write in r5c_handle_stripe_dirtying().
 */

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 142 ++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/md/raid5.c       |  20 ++++++-
 drivers/md/raid5.h       |  12 +++-
 3 files changed, 167 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 7ebf665..5876727 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -40,6 +40,47 @@
  */
 #define R5L_POOL_SIZE	4
 
+/*
+ * r5c journal modes of the array: write-back or write-through.
+ * write-through mode has identical behavior as existing log only
+ * implementation.
+ */
+enum r5c_journal_mode {
+	R5C_JOURNAL_MODE_WRITE_THROUGH = 0,
+	R5C_JOURNAL_MODE_WRITE_BACK = 1,
+};
+
+/*
+ * raid5 cache state machine
+ *
+ * With rhe RAID cache, each stripe works in two modes:
+ *	- caching mode
+ *	- writing-out mode
+ *
+ * These two modes are controlled by bit STRIPE_R5C_WRITE_OUT:
+ *   if STRIPE_R5C_WRITE_OUT == 0, the stripe is in caching mode
+ *   if STRIPE_R5C_WRITE_OUT == 1, the stripe is in writing-out mode
+
+ * r5c_make_stripe_write_out() and r5c_finish_stripe_write_out() handles the
+ * transition between caching and writing-out mode.
+ *
+ * Stripes in caching mode do not write the raid disks. Instead, all writes
+ * are committed from the log device. Therefore, a stripe in caching mode
+ * handles writes as:
+ *	- write to log device
+ *	- return IO
+ *
+ * Stripes in writing-out mode handle writes as:
+ *	- calculate parity
+ *	- write pending data and parity to journal
+ *	- write data and parity to raid disks
+ *	- return IO for pending writes
+ *
+ * All stripes starts with caching mode. If the array is write-through
+ * (R5C_JOURNAL_MODE_WRITE_THROUGH), all stripes enter writing-out mode for
+ * every write in r5c_handle_stripe_dirtying().
+ */
+
 struct r5l_log {
 	struct md_rdev *rdev;
 
@@ -96,6 +137,9 @@ struct r5l_log {
 	spinlock_t no_space_stripes_lock;
 
 	bool need_cache_flush;
+
+	/* for r5c_cache */
+	enum r5c_journal_mode r5c_journal_mode;
 };
 
 /*
@@ -133,6 +177,12 @@ enum r5l_io_unit_state {
 	IO_UNIT_STRIPE_END = 3,	/* stripes data finished writing to raid */
 };
 
+bool r5c_is_writeback(struct r5l_log *log)
+{
+	return (log != NULL &&
+		log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_BACK);
+}
+
 static sector_t r5l_ring_add(struct r5l_log *log, sector_t start, sector_t inc)
 {
 	start += inc;
@@ -168,12 +218,54 @@ static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
 	io->state = state;
 }
 
+/*
+ * Put the stripe into writing-out mode by setting STRIPE_R5C_WRITE_OUT.
+ *
+ * Note: when the array is in write-through, each stripe still goes through
+ * caching mode and writing-out mode. In such cases, this function is called
+ * in r5c_handle_stripe_dirtying().
+ */
+static void r5c_make_stripe_write_out(struct stripe_head *sh)
+{
+	struct r5conf *conf = sh->raid_conf;
+	struct r5l_log *log = conf->log;
+
+	if (!log)
+		return;
+	WARN_ON(test_bit(STRIPE_R5C_WRITE_OUT, &sh->state));
+	set_bit(STRIPE_R5C_WRITE_OUT, &sh->state);
+}
+
+/*
+ * Setting proper flags after writing (or flushing) data and/or parity to the
+ * log device. This is called from r5l_log_endio() or r5l_log_flush_endio().
+ */
+static void r5c_finish_cache_stripe(struct stripe_head *sh)
+{
+	struct r5l_log *log = sh->raid_conf->log;
+
+	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH) {
+		BUG_ON(!test_bit(STRIPE_R5C_WRITE_OUT, &sh->state));
+		/*
+		 * Set R5_InJournal for parity dev[pd_idx]. This means parity
+		 * is in the journal. For RAID 6, it is NOT necessary to set
+		 * the flag for dev[qd_idx], as the two parities are written
+		 * out together.
+		 */
+		set_bit(R5_InJournal, &sh->dev[sh->pd_idx].flags);
+	} else
+		BUG(); /* write back logic in next patch */
+}
+
 static void r5l_io_run_stripes(struct r5l_io_unit *io)
 {
 	struct stripe_head *sh, *next;
 
 	list_for_each_entry_safe(sh, next, &io->stripe_list, log_list) {
 		list_del_init(&sh->log_list);
+
+		r5c_finish_cache_stripe(sh);
+
 		set_bit(STRIPE_HANDLE, &sh->state);
 		raid5_release_stripe(sh);
 	}
@@ -412,18 +504,19 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
 		r5l_append_payload_page(log, sh->dev[i].page);
 	}
 
-	if (sh->qd_idx >= 0) {
+	if (parity_pages == 2) {
 		r5l_append_payload_meta(log, R5LOG_PAYLOAD_PARITY,
 					sh->sector, sh->dev[sh->pd_idx].log_checksum,
 					sh->dev[sh->qd_idx].log_checksum, true);
 		r5l_append_payload_page(log, sh->dev[sh->pd_idx].page);
 		r5l_append_payload_page(log, sh->dev[sh->qd_idx].page);
-	} else {
+	} else if (parity_pages == 1) {
 		r5l_append_payload_meta(log, R5LOG_PAYLOAD_PARITY,
 					sh->sector, sh->dev[sh->pd_idx].log_checksum,
 					0, false);
 		r5l_append_payload_page(log, sh->dev[sh->pd_idx].page);
-	}
+	} else  /* Just writing data, not parity, in caching mode */
+		BUG_ON(parity_pages != 0);
 
 	list_add_tail(&sh->log_list, &io->stripe_list);
 	atomic_inc(&io->pending_stripe);
@@ -455,6 +548,8 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
 		return -EAGAIN;
 	}
 
+	WARN_ON(!test_bit(STRIPE_R5C_WRITE_OUT, &sh->state));
+
 	for (i = 0; i < sh->disks; i++) {
 		void *addr;
 
@@ -1100,6 +1195,45 @@ static void r5l_write_super(struct r5l_log *log, sector_t cp)
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
 }
 
+int r5c_handle_stripe_dirtying(struct r5conf *conf,
+			       struct stripe_head *sh,
+			       struct stripe_head_state *s,
+			       int disks)
+{
+	struct r5l_log *log = conf->log;
+
+	if (!log || test_bit(STRIPE_R5C_WRITE_OUT, &sh->state))
+		return -EAGAIN;
+
+	if (conf->log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH) {
+		/* write-through mode */
+		r5c_make_stripe_write_out(sh);
+		return -EAGAIN;
+	}
+	BUG();  /* write back logic in next commit */
+	return 0;
+}
+
+/*
+ * clean up the stripe (clear STRIPE_R5C_WRITE_OUT etc.) after the stripe is
+ * committed to RAID disks.
+ */
+void r5c_finish_stripe_write_out(struct r5conf *conf,
+				 struct stripe_head *sh)
+{
+	if (!test_bit(R5_InJournal, &sh->dev[sh->pd_idx].flags))
+		return;
+
+	WARN_ON(!test_bit(STRIPE_R5C_WRITE_OUT, &sh->state));
+	clear_bit(STRIPE_R5C_WRITE_OUT, &sh->state);
+	clear_bit(R5_InJournal, &sh->dev[sh->pd_idx].flags);
+
+	if (conf->log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
+		return;
+	BUG();  /* write-back logic in coming patches */
+}
+
+
 static int r5l_load_log(struct r5l_log *log)
 {
 	struct md_rdev *rdev = log->rdev;
@@ -1237,6 +1371,8 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	INIT_LIST_HEAD(&log->no_space_stripes);
 	spin_lock_init(&log->no_space_stripes_lock);
 
+	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
+
 	if (r5l_load_log(log))
 		goto error;
 
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 34895f3..abb2c58 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3496,6 +3496,9 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 	int rmw = 0, rcw = 0, i;
 	sector_t recovery_cp = conf->mddev->recovery_cp;
 
+	if (r5c_handle_stripe_dirtying(conf, sh, s, disks) == 0)
+		return;
+
 	/* Check whether resync is now happening or should start.
 	 * If yes, then the array is dirty (after unclean shutdown or
 	 * initial creation), so parity in some stripes might be inconsistent.
@@ -4386,13 +4389,23 @@ static void handle_stripe(struct stripe_head *sh)
 	    || s.expanding)
 		handle_stripe_fill(sh, &s, disks);
 
-	/* Now to consider new write requests and what else, if anything
-	 * should be read.  We do not handle new writes when:
+	/*
+	 * When the stripe finishes full journal write cycle (write to journal
+	 * and raid disk), this is the clean up procedure so it is ready for
+	 * next operation.
+	 */
+	r5c_finish_stripe_write_out(conf, sh);
+
+	/*
+	 * Now to consider new write requests, cache write back and what else,
+	 * if anything should be read.  We do not handle new writes when:
 	 * 1/ A 'write' operation (copy+xor) is already in flight.
 	 * 2/ A 'check' operation is in flight, as it may clobber the parity
 	 *    block.
+	 * 3/ A r5c cache log write is in flight.
 	 */
-	if (s.to_write && !sh->reconstruct_state && !sh->check_state)
+	if ((s.to_write || test_bit(STRIPE_R5C_WRITE_OUT, &sh->state)) &&
+	    !sh->reconstruct_state && !sh->check_state && !sh->log_io)
 		handle_stripe_dirtying(conf, sh, &s, disks);
 
 	/* maybe we need to check and possibly fix the parity for this stripe
@@ -5110,6 +5123,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	 * data on failed drives.
 	 */
 	if (rw == READ && mddev->degraded == 0 &&
+	    !r5c_is_writeback(conf->log) &&
 	    mddev->reshape_position == MaxSector) {
 		bi = chunk_aligned_read(mddev, bi);
 		if (!bi)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index ffc13c4..b379496 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -313,6 +313,7 @@ enum r5dev_flags {
 			 */
 	R5_Discard,	/* Discard the stripe */
 	R5_SkipCopy,	/* Don't copy data from bio to stripe cache */
+	R5_InJournal,	/* data being written is in the journal device */
 };
 
 /*
@@ -345,7 +346,10 @@ enum {
 	STRIPE_BITMAP_PENDING,	/* Being added to bitmap, don't add
 				 * to batch yet.
 				 */
-	STRIPE_LOG_TRAPPED, /* trapped into log */
+	STRIPE_LOG_TRAPPED,	/* trapped into log */
+	STRIPE_R5C_WRITE_OUT,	/* the stripe is in writing-out mode
+				 * see more detail in the raid5-cache.c
+				 */
 };
 
 #define STRIPE_EXPAND_SYNC_FLAGS \
@@ -710,4 +714,10 @@ extern void r5l_stripe_write_finished(struct stripe_head *sh);
 extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
 extern void r5l_quiesce(struct r5l_log *log, int state);
 extern bool r5l_log_disk_error(struct r5conf *conf);
+extern bool r5c_is_writeback(struct r5l_log *log);
+extern int
+r5c_handle_stripe_dirtying(struct r5conf *conf, struct stripe_head *sh,
+			   struct stripe_head_state *s, int disks);
+extern void
+r5c_finish_stripe_write_out(struct r5conf *conf, struct stripe_head *sh);
 #endif
-- 
2.9.3


^ permalink raw reply related

* [PATCH v6 02/11] md/r5cache: move some code to raid5.h
From: Song Liu @ 2016-11-10 20:46 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161110204623.3484694-1-songliubraving@fb.com>

Move some define and inline functions to raid5.h, so they can be
used in raid5-cache.c

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5.c | 71 -------------------------------------------------
 drivers/md/raid5.h | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 77 insertions(+), 71 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index df88656..34895f3 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -70,19 +70,6 @@ module_param(devices_handle_discard_safely, bool, 0644);
 MODULE_PARM_DESC(devices_handle_discard_safely,
 		 "Set to Y if all devices in each array reliably return zeroes on reads from discarded regions");
 static struct workqueue_struct *raid5_wq;
-/*
- * Stripe cache
- */
-
-#define NR_STRIPES		256
-#define STRIPE_SIZE		PAGE_SIZE
-#define STRIPE_SHIFT		(PAGE_SHIFT - 9)
-#define STRIPE_SECTORS		(STRIPE_SIZE>>9)
-#define	IO_THRESHOLD		1
-#define BYPASS_THRESHOLD	1
-#define NR_HASH			(PAGE_SIZE / sizeof(struct hlist_head))
-#define HASH_MASK		(NR_HASH - 1)
-#define MAX_STRIPE_BATCH	8
 
 static inline struct hlist_head *stripe_hash(struct r5conf *conf, sector_t sect)
 {
@@ -126,64 +113,6 @@ static inline void unlock_all_device_hash_locks_irq(struct r5conf *conf)
 	local_irq_enable();
 }
 
-/* bio's attached to a stripe+device for I/O are linked together in bi_sector
- * order without overlap.  There may be several bio's per stripe+device, and
- * a bio could span several devices.
- * When walking this list for a particular stripe+device, we must never proceed
- * beyond a bio that extends past this device, as the next bio might no longer
- * be valid.
- * This function is used to determine the 'next' bio in the list, given the sector
- * of the current stripe+device
- */
-static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector)
-{
-	int sectors = bio_sectors(bio);
-	if (bio->bi_iter.bi_sector + sectors < sector + STRIPE_SECTORS)
-		return bio->bi_next;
-	else
-		return NULL;
-}
-
-/*
- * We maintain a biased count of active stripes in the bottom 16 bits of
- * bi_phys_segments, and a count of processed stripes in the upper 16 bits
- */
-static inline int raid5_bi_processed_stripes(struct bio *bio)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	return (atomic_read(segments) >> 16) & 0xffff;
-}
-
-static inline int raid5_dec_bi_active_stripes(struct bio *bio)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	return atomic_sub_return(1, segments) & 0xffff;
-}
-
-static inline void raid5_inc_bi_active_stripes(struct bio *bio)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	atomic_inc(segments);
-}
-
-static inline void raid5_set_bi_processed_stripes(struct bio *bio,
-	unsigned int cnt)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	int old, new;
-
-	do {
-		old = atomic_read(segments);
-		new = (old & 0xffff) | (cnt << 16);
-	} while (atomic_cmpxchg(segments, old, new) != old);
-}
-
-static inline void raid5_set_bi_stripes(struct bio *bio, unsigned int cnt)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	atomic_set(segments, cnt);
-}
-
 /* Find first data disk in a raid6 stripe */
 static inline int raid6_d0(struct stripe_head *sh)
 {
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 57ec49f..ffc13c4 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -410,6 +410,83 @@ struct disk_info {
 	struct md_rdev	*rdev, *replacement;
 };
 
+/*
+ * Stripe cache
+ */
+
+#define NR_STRIPES		256
+#define STRIPE_SIZE		PAGE_SIZE
+#define STRIPE_SHIFT		(PAGE_SHIFT - 9)
+#define STRIPE_SECTORS		(STRIPE_SIZE>>9)
+#define	IO_THRESHOLD		1
+#define BYPASS_THRESHOLD	1
+#define NR_HASH			(PAGE_SIZE / sizeof(struct hlist_head))
+#define HASH_MASK		(NR_HASH - 1)
+#define MAX_STRIPE_BATCH	8
+
+/* bio's attached to a stripe+device for I/O are linked together in bi_sector
+ * order without overlap.  There may be several bio's per stripe+device, and
+ * a bio could span several devices.
+ * When walking this list for a particular stripe+device, we must never proceed
+ * beyond a bio that extends past this device, as the next bio might no longer
+ * be valid.
+ * This function is used to determine the 'next' bio in the list, given the
+ * sector of the current stripe+device
+ */
+static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector)
+{
+	int sectors = bio_sectors(bio);
+
+	if (bio->bi_iter.bi_sector + sectors < sector + STRIPE_SECTORS)
+		return bio->bi_next;
+	else
+		return NULL;
+}
+
+/*
+ * We maintain a biased count of active stripes in the bottom 16 bits of
+ * bi_phys_segments, and a count of processed stripes in the upper 16 bits
+ */
+static inline int raid5_bi_processed_stripes(struct bio *bio)
+{
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+
+	return (atomic_read(segments) >> 16) & 0xffff;
+}
+
+static inline int raid5_dec_bi_active_stripes(struct bio *bio)
+{
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+
+	return atomic_sub_return(1, segments) & 0xffff;
+}
+
+static inline void raid5_inc_bi_active_stripes(struct bio *bio)
+{
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+
+	atomic_inc(segments);
+}
+
+static inline void raid5_set_bi_processed_stripes(struct bio *bio,
+	unsigned int cnt)
+{
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	int old, new;
+
+	do {
+		old = atomic_read(segments);
+		new = (old & 0xffff) | (cnt << 16);
+	} while (atomic_cmpxchg(segments, old, new) != old);
+}
+
+static inline void raid5_set_bi_stripes(struct bio *bio, unsigned int cnt)
+{
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+
+	atomic_set(segments, cnt);
+}
+
 /* NOTE NR_STRIPE_HASH_LOCKS must remain below 64.
  * This is because we sometimes take all the spinlocks
  * and creating that much locking depth can cause
-- 
2.9.3


^ permalink raw reply related

* [PATCH v6 01/11] md/r5cache: Check array size in r5l_init_log
From: Song Liu @ 2016-11-10 20:46 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161110204623.3484694-1-songliubraving@fb.com>

Currently, r5l_write_stripe checks meta size for each stripe write,
which is not necessary.

With this patch, r5l_init_log checks maximal meta size of the array,
which is (r5l_meta_block + raid_disks x r5l_payload_data_parity).
If this is too big to fit in one page, r5l_init_log aborts.

With current meta data, r5l_log support raid_disks up to 203.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 2e270e6..7ebf665 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -441,7 +441,6 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
 {
 	int write_disks = 0;
 	int data_pages, parity_pages;
-	int meta_size;
 	int reserve;
 	int i;
 	int ret = 0;
@@ -473,15 +472,6 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
 	parity_pages = 1 + !!(sh->qd_idx >= 0);
 	data_pages = write_disks - parity_pages;
 
-	meta_size =
-		((sizeof(struct r5l_payload_data_parity) + sizeof(__le32))
-		 * data_pages) +
-		sizeof(struct r5l_payload_data_parity) +
-		sizeof(__le32) * parity_pages;
-	/* Doesn't work with very big raid array */
-	if (meta_size + sizeof(struct r5l_meta_block) > PAGE_SIZE)
-		return -EINVAL;
-
 	set_bit(STRIPE_LOG_TRAPPED, &sh->state);
 	/*
 	 * The stripe must enter state machine again to finish the write, so
@@ -1185,6 +1175,22 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 
 	if (PAGE_SIZE != 4096)
 		return -EINVAL;
+
+	/*
+	 * The PAGE_SIZE must be big enough to hold 1 r5l_meta_block and
+	 * raid_disks r5l_payload_data_parity.
+	 *
+	 * Write journal and cache does not work for very big array
+	 * (raid_disks > 203)
+	 */
+	if (sizeof(struct r5l_meta_block) +
+	    ((sizeof(struct r5l_payload_data_parity) + sizeof(__le32)) *
+	     conf->raid_disks) > PAGE_SIZE) {
+		pr_err("md/raid:%s: write journal/cache doesn't work for array with %d disks\n",
+		       mdname(conf->mddev), conf->raid_disks);
+		return -EINVAL;
+	}
+
 	log = kzalloc(sizeof(*log), GFP_KERNEL);
 	if (!log)
 		return -ENOMEM;
-- 
2.9.3


^ permalink raw reply related

* [PATCH v6 00/11] raid5-cache: enabling cache features
From: Song Liu @ 2016-11-10 20:46 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu

These are the 5th version of patches to enable write cache part of
raid5-cache. The journal part was released with kernel 4.4.

The caching part uses same disk format of raid456 journal, and provides
acceleration to writes. Write operations are committed (bio_endio) once
the data is secured in journal. Reconstruct and RMW are postponed to
writing-out phase, which is usually not on the critical path.

The changes are organized in 11 patches (details below).

Patch for chunk_aligned_read in earlier RFC is not included yet
(http://marc.info/?l=linux-raid&m=146432700719277). But we may still need
some optimizations later, especially for SSD raid devices.

Changes between v5 and v4 (http://marc.info/?l=linux-raid&m=147629531615172):
 1. Incorporate feedbacks;
 2. Split reclaim patch into 3 smaller patches for (hopefully) easier review;
 3. Rename different modes of a stripe to "caching" and "writing-out";
 4. Remove flag R5_WantCache;
 5. Rename flag R5_InCache to R5_InJournal;
 6. Remove flag STRIPE_R5C_WRITTEN and use R5_InJournal bit of dev[pd_idx].flags instead;
 7. Remove dev_in_cache from stripe_head;
 8. Add logic to handle alloc_page() failure in handle_stripe_dirtying();

Song Liu (11):
  md/r5cache: Check array size in r5l_init_log
  md/r5cache: move some code to raid5.h
  md/r5cache: State machine for raid5-cache write back mode
  md/r5cache: caching mode of r5cache
  md/r5cache: write-out mode and reclaim support
  md/r5cache: sysfs entry r5c_journal_mode
  md/r5cache: refactoring journal recovery code
  md/r5cache: r5cache recovery: part 1
  md/r5cache: r5cache recovery: part 2
  md/r5cache: handle SYNC and FUA
  md/r5cache: handle alloc_page failure

 drivers/md/raid5-cache.c | 1822 +++++++++++++++++++++++++++++++++++++++++-----
 drivers/md/raid5.c       |  322 +++++---
 drivers/md/raid5.h       |  154 +++-
 3 files changed, 1987 insertions(+), 311 deletions(-)

--
2.9.3

^ permalink raw reply

* "creative" bio usage in the RAID code
From: Christoph Hellwig @ 2016-11-10 19:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, linux-block

Hi Shaohua,

one of the major issues with Ming Lei's multipage biovec works
is that we can't easily enabled the MD RAID code for it.  I had
a quick chat on that with Chris and Jens and they suggested talking
to you about it.

It's mostly about the RAID1 and RAID10 code which does a lot of funny
things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that
drivers don't touch.  One example is the r1buf_pool_alloc code,
which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED
case, which would also take care of r1buf_pool_free.  I'm not sure
about all the others cases, as some bits don't fully make sense to me,
e.g. why we're trying to do single page I/O out of a bigger bio.

Maybe you have some better ideas what's going on there?

Another not quite as urgent issue is how the RAID5 code abuses
->bi_phys_segments as and outstanding I/O counter, and I have no
really good answer to that either.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox