Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: RAID Recovery
From: Phil Turmel @ 2017-03-07 15:00 UTC (permalink / raw)
  To: Adam Goryachev, linux-raid
In-Reply-To: <94b4591a-bda1-529b-3af1-c451b657082f@websitemanagers.com.au>

Hi Adam,

{Please remember to trim repetitive stuff, and interleave.}

On 03/07/2017 09:06 AM, Adam Goryachev wrote:
> BTW, just some more info I've found... either almost the entire
> drives are RAID1 mirrors, or all 4 are RAID1 mirrors:

> Other option, they have been re-initialised/zero'd or similar, and
> thats why all the data is identical (useless). I was hoping to get a
> starting point for where the partition boundaries might have been
> ....

Search the devices for ext2/3/4 superblocks, like so:

dd if=/dev/sdX bs=1M 2>/dev/null |hexdump -C |grep '30  .\+  53 ef 0'

This will take a very long time, and will generate false positives.
You probably would want to use screen or tmux to run these in
parallel in separate processes.

But superblock locations will give you hints as to the rest of data,
and make it possible to create partitions that will let you copy
stuff off into a new array.

Phil

^ permalink raw reply

* Re: When will Linux support M2 on RAID ?
From: Christoph Hellwig @ 2017-03-07 15:15 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Christoph Hellwig, David F., linux-kernel,
	linux-raid@vger.kernel.org
In-Reply-To: <a92c6f16-af2d-9d79-fb09-8490aac050db@gmail.com>

On Tue, Mar 07, 2017 at 09:50:22AM -0500, Austin S. Hemmelgarn wrote:
> He's referring to the RAID mode most modern Intel chipsets have, which (last
> I checked) Linux does not support completely and many OEM's are setting by
> default on new systems because it apparently provides better performance
> than AHCI even for a single device.

It actually provides worse performance.  What it does it that it shoves
up to three nvme device bars into the bar of an AHCI device, and
requires the OS to handle them all using a single driver.  The Money's
on crack at Intel decided to do that to provide their "valueable" RSTe
IP (which is a windows ATA + RAID driver in a blob, which now has also
grown a NVMe driver).  The only remotely sane thing is to disable it
in the bios, and burn all people involved with it.  The next best thing
is to provide a fake PCIe root port driver untangling this before it
hits the driver, but unfortunately Intel is unwilling to either do this
on their own or at least provide enough documentation for others to do
it.

^ permalink raw reply

* Re: When will Linux support M2 on RAID ?
From: Dave Jiang @ 2017-03-07 15:46 UTC (permalink / raw)
  To: Christoph Hellwig, David F.
  Cc: linux-kernel, linux-raid@vger.kernel.org, Williams, Dan J
In-Reply-To: <20170307045200.GA1708@infradead.org>

On 03/06/2017 09:52 PM, Christoph Hellwig wrote:
> On Sun, Mar 05, 2017 at 06:09:42PM -0800, David F. wrote:
>> More and more systems are coming with M2 on RAID and Linux doesn't
>> work unless you change the system out of RAID mode.  This is becoming
>> more and more of a problem.   What is the status of Linux support for
>> the new systems?
> 
> Your message doesn't make sense at all.  MD works on absolutely any
> Linux block device, and I've used it on plenty M.2 form factor devices -
> not that the form factor has anything to do with how Linux would treat
> a device.

I have a feeling he's talking about this [1] issue.

[1]:
http://www.pcworld.com/article/3123075/linux/linux-wont-install-on-your-laptop-blame-intel-not-microsoft.html

^ permalink raw reply

* Re: When will Linux support M2 on RAID ?
From: Austin S. Hemmelgarn @ 2017-03-07 15:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: David F., linux-kernel, linux-raid@vger.kernel.org
In-Reply-To: <20170307151528.GA16216@infradead.org>

On 2017-03-07 10:15, Christoph Hellwig wrote:
> On Tue, Mar 07, 2017 at 09:50:22AM -0500, Austin S. Hemmelgarn wrote:
>> He's referring to the RAID mode most modern Intel chipsets have, which (last
>> I checked) Linux does not support completely and many OEM's are setting by
>> default on new systems because it apparently provides better performance
>> than AHCI even for a single device.
>
> It actually provides worse performance.  What it does it that it shoves
> up to three nvme device bars into the bar of an AHCI device, and
> requires the OS to handle them all using a single driver.  The Money's
> on crack at Intel decided to do that to provide their "valueable" RSTe
> IP (which is a windows ATA + RAID driver in a blob, which now has also
> grown a NVMe driver).  The only remotely sane thing is to disable it
> in the bios, and burn all people involved with it.  The next best thing
> is to provide a fake PCIe root port driver untangling this before it
> hits the driver, but unfortunately Intel is unwilling to either do this
> on their own or at least provide enough documentation for others to do
> it.
>
For NVMe, yeah, it hurts performance horribly.  For SATA devices though, 
it's hit or miss, some setups perform better, some perform worse.

It does have one advantage though, it lets you put the C drive for a 
Windows install on a soft-RAID array insanely easily compared to trying 
to do so through Windows itself (although still significantly less 
easily that doing the equivalent on Linux...).

The cynic in me is tempted to believe that the OEM's who are turning it 
on by default are trying to either:
1. Make their low-end systems look even crappier in terms of performance 
while adding to their marketing checklist (Of the systems I've seen that 
have this on by default, most were cheap ones with really low specs).
2. Actively make it harder to run anything but Windows on their hardware.

^ permalink raw reply

* Re: [PATCH v2] mdadm:add checking clustered bitmap in assemble mode
From: jes.sorensen @ 2017-03-07 16:57 UTC (permalink / raw)
  To: Zhilong Liu; +Cc: linux-raid
In-Reply-To: <20170307031303.3009-1-zlliu@suse.com>

Zhilong Liu <zlliu@suse.com> writes:
> mdadm:Both clustered and internal array don't need
> to specify --bitmap when assembling array.
>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
> Acked-by: Coly Li <colyli@suse.de>

This one looks good!

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH 08/29] drivers, md: convert mddev.active from atomic_t to refcount_t
From: Shaohua Li @ 2017-03-07 19:04 UTC (permalink / raw)
  To: Elena Reshetova
  Cc: peterz, linux-pci, target-devel, linux1394-devel, devel,
	linux-s390, linux-scsi, linux-serial, fcoe-devel, xen-devel,
	open-iscsi, linux-media, Kees Cook, linux-raid, linux-bcache,
	Hans Liljestrand, David Windsor, gregkh, linux-usb, linux-kernel,
	netdev, devel
In-Reply-To: <1488810076-3754-9-git-send-email-elena.reshetova@intel.com>

On Mon, Mar 06, 2017 at 04:20:55PM +0200, Elena Reshetova wrote:
> refcount_t type and corresponding API should be
> used instead of atomic_t when the variable is used as
> a reference counter. This allows to avoid accidental
> refcounter overflows that might lead to use-after-free
> situations.

Looks good. Let me know how do you want to route the patch to upstream.
 
> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
> Signed-off-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: David Windsor <dwindsor@gmail.com>
> ---
>  drivers/md/md.c | 6 +++---
>  drivers/md/md.h | 3 ++-
>  2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 985374f..94c8ebf 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -449,7 +449,7 @@ EXPORT_SYMBOL(md_unplug);
>  
>  static inline struct mddev *mddev_get(struct mddev *mddev)
>  {
> -	atomic_inc(&mddev->active);
> +	refcount_inc(&mddev->active);
>  	return mddev;
>  }
>  
> @@ -459,7 +459,7 @@ static void mddev_put(struct mddev *mddev)
>  {
>  	struct bio_set *bs = NULL;
>  
> -	if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock))
> +	if (!refcount_dec_and_lock(&mddev->active, &all_mddevs_lock))
>  		return;
>  	if (!mddev->raid_disks && list_empty(&mddev->disks) &&
>  	    mddev->ctime == 0 && !mddev->hold_active) {
> @@ -495,7 +495,7 @@ void mddev_init(struct mddev *mddev)
>  	INIT_LIST_HEAD(&mddev->all_mddevs);
>  	setup_timer(&mddev->safemode_timer, md_safemode_timeout,
>  		    (unsigned long) mddev);
> -	atomic_set(&mddev->active, 1);
> +	refcount_set(&mddev->active, 1);
>  	atomic_set(&mddev->openers, 0);
>  	atomic_set(&mddev->active_io, 0);
>  	spin_lock_init(&mddev->lock);
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index b8859cb..4811663 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -22,6 +22,7 @@
>  #include <linux/list.h>
>  #include <linux/mm.h>
>  #include <linux/mutex.h>
> +#include <linux/refcount.h>
>  #include <linux/timer.h>
>  #include <linux/wait.h>
>  #include <linux/workqueue.h>
> @@ -360,7 +361,7 @@ struct mddev {
>  	 */
>  	struct mutex			open_mutex;
>  	struct mutex			reconfig_mutex;
> -	atomic_t			active;		/* general refcount */
> +	refcount_t			active;		/* general refcount */
>  	atomic_t			openers;	/* number of active opens */
>  
>  	int				changed;	/* True if we might need to
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 10/29] drivers, md: convert stripe_head.count from atomic_t to refcount_t
From: Shaohua Li @ 2017-03-07 19:07 UTC (permalink / raw)
  To: Elena Reshetova
  Cc: peterz, linux-pci, target-devel, linux1394-devel, devel,
	linux-s390, linux-scsi, linux-serial, fcoe-devel, xen-devel,
	open-iscsi, linux-media, Kees Cook, linux-raid, linux-bcache,
	Hans Liljestrand, David Windsor, gregkh, linux-usb, linux-kernel,
	netdev, devel
In-Reply-To: <1488810076-3754-11-git-send-email-elena.reshetova@intel.com>

On Mon, Mar 06, 2017 at 04:20:57PM +0200, Elena Reshetova wrote:
> refcount_t type and corresponding API should be
> used instead of atomic_t when the variable is used as
> a reference counter. This allows to avoid accidental
> refcounter overflows that might lead to use-after-free
> situations.
> 
> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
> Signed-off-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: David Windsor <dwindsor@gmail.com>
> ---
>  drivers/md/raid5-cache.c |  8 +++---
>  drivers/md/raid5.c       | 66 ++++++++++++++++++++++++------------------------
>  drivers/md/raid5.h       |  3 ++-
>  3 files changed, 39 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 3f307be..6c05e12 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c

snip
>  	       sh->check_state, sh->reconstruct_state);
>  
>  	analyse_stripe(sh, &s);
> @@ -4924,7 +4924,7 @@ static void activate_bit_delay(struct r5conf *conf,
>  		struct stripe_head *sh = list_entry(head.next, struct stripe_head, lru);
>  		int hash;
>  		list_del_init(&sh->lru);
> -		atomic_inc(&sh->count);
> +		refcount_inc(&sh->count);
>  		hash = sh->hash_lock_index;
>  		__release_stripe(conf, sh, &temp_inactive_list[hash]);
>  	}
> @@ -5240,7 +5240,7 @@ static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
>  		sh->group = NULL;
>  	}
>  	list_del_init(&sh->lru);
> -	BUG_ON(atomic_inc_return(&sh->count) != 1);
> +	BUG_ON(refcount_inc_not_zero(&sh->count));

This changes the behavior. refcount_inc_not_zero doesn't inc if original value is 0

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH] md/r5cache: improve recovery with read ahead page pool
From: Shaohua Li @ 2017-03-07 20:13 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid, shli, neilb, kernel-team, dan.j.williams, hch
In-Reply-To: <20170303210616.56044-1-songliubraving@fb.com>

On Fri, Mar 03, 2017 at 01:06:15PM -0800, Song Liu wrote:
> In r5cache recovery, the journal device is scanned page by page.
> Currently, we use sync_page_io() to read journal device. This is
> not efficient when we have to recovery many stripes from the journal.
> 
> To improve the speed of recovery, this patch introduces a read ahead
> page pool (ra_pool) to recovery_ctx. With ra_pool, multiple consecutive
> pages are read in one IO. Then the recovery code read the journal from
> ra_pool.
> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  drivers/md/raid5-cache.c | 151 +++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 134 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 3f307be..46afea8 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -1552,6 +1552,8 @@ bool r5l_log_disk_error(struct r5conf *conf)
>  	return ret;
>  }
>  
> +#define R5L_RECOVERY_PAGE_POOL_SIZE 64

I'd use a larger pool, for example, 1M memory to create an optimal IO. 1M
should be good for SSD.

>  struct r5l_recovery_ctx {
>  	struct page *meta_page;		/* current meta */
>  	sector_t meta_total_blocks;	/* total size of current meta and data */
> @@ -1560,18 +1562,130 @@ struct r5l_recovery_ctx {
>  	int data_parity_stripes;	/* number of data_parity stripes */
>  	int data_only_stripes;		/* number of data_only stripes */
>  	struct list_head cached_list;
> +
> +	/*
> +	 * read ahead page pool (ra_pool)
> +	 * in recovery, log is read sequentially. It is not efficient to
> +	 * read every page with sync_page_io(). The read ahead page pool
> +	 * reads multiple pages with one IO, so further log read can
> +	 * just copy data from the pool.
> +	 */
> +	struct page *ra_pool[R5L_RECOVERY_PAGE_POOL_SIZE];
> +	sector_t pool_offset;	/* offset of first page in the pool */
> +	int total_pages;	/* total allocated pages */
> +	int valid_pages;	/* pages with valid data */
> +	struct bio *ra_bio;	/* bio to do the read ahead*/
>  };

snip

> +				  struct r5l_recovery_ctx *ctx,
> +				  struct page *page,
> +				  sector_t offset)
> +{
> +	int ret;
> +
> +	if (offset < ctx->pool_offset ||
> +	    offset >= ctx->pool_offset + ctx->valid_pages * BLOCK_SECTORS) {
> +		ret = r5l_recovery_fetch_ra_pool(log, ctx, offset);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	BUG_ON(offset < ctx->pool_offset ||
> +	       offset >= ctx->pool_offset + ctx->valid_pages * BLOCK_SECTORS);
> +
> +	memcpy(page_address(page),
> +	       page_address(ctx->ra_pool[(offset - ctx->pool_offset) / BLOCK_SECTORS]),

sector_t is u64. Divide isn't allowed in 32-bit system. The compiler probably
optmized this to '>> 9', but I'd suggest explictly doing it.

> +	       PAGE_SIZE);
> +	return 0;
> +}
> +
>  static int r5l_recovery_read_meta_block(struct r5l_log *log,
>  					struct r5l_recovery_ctx *ctx)
>  {
>  	struct page *page = ctx->meta_page;
>  	struct r5l_meta_block *mb;
>  	u32 crc, stored_crc;
> +	int ret;
>  
> -	if (!sync_page_io(log->rdev, ctx->pos, PAGE_SIZE, page, REQ_OP_READ, 0,
> -			  false))
> -		return -EIO;
> +	ret = r5l_recovery_read_page(log, ctx, page, ctx->pos);
> +	if (ret != 0)
> +		return ret;
>  
>  	mb = page_address(page);
>  	stored_crc = le32_to_cpu(mb->checksum);
> @@ -1653,8 +1767,7 @@ static void r5l_recovery_load_data(struct r5l_log *log,
>  	raid5_compute_sector(conf,
>  			     le64_to_cpu(payload->location), 0,
>  			     &dd_idx, sh);
> -	sync_page_io(log->rdev, log_offset, PAGE_SIZE,
> -		     sh->dev[dd_idx].page, REQ_OP_READ, 0, false);
> +	r5l_recovery_read_page(log, ctx, sh->dev[dd_idx].page, log_offset);
>  	sh->dev[dd_idx].log_checksum =
>  		le32_to_cpu(payload->checksum[0]);
>  	ctx->meta_total_blocks += BLOCK_SECTORS;
> @@ -1673,17 +1786,13 @@ static void r5l_recovery_load_parity(struct r5l_log *log,
>  	struct r5conf *conf = mddev->private;
>  
>  	ctx->meta_total_blocks += BLOCK_SECTORS * conf->max_degraded;
> -	sync_page_io(log->rdev, log_offset, PAGE_SIZE,
> -		     sh->dev[sh->pd_idx].page, REQ_OP_READ, 0, false);
> +	r5l_recovery_read_page(log, ctx, sh->dev[sh->pd_idx].page, log_offset);
>  	sh->dev[sh->pd_idx].log_checksum =
>  		le32_to_cpu(payload->checksum[0]);
>  	set_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags);
>  
>  	if (sh->qd_idx >= 0) {
> -		sync_page_io(log->rdev,
> -			     r5l_ring_add(log, log_offset, BLOCK_SECTORS),
> -			     PAGE_SIZE, sh->dev[sh->qd_idx].page,
> -			     REQ_OP_READ, 0, false);
> +		r5l_recovery_read_page(log, ctx, sh->dev[sh->qd_idx].page, log_offset);

The original code reads from 'r5l_ring_add(log, log_offset, BLOCK_SECTORS)',
now the code reads from log_offset, is this intended?

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH 3/3] md/raid5: sort bios
From: Shaohua Li @ 2017-03-07 20:23 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid, songliubraving, kernel-team
In-Reply-To: <87bmtekiwf.fsf@notabene.neil.brown.name>

On Mon, Mar 06, 2017 at 05:40:16PM +1100, Neil Brown wrote:
> On Fri, Mar 03 2017, Shaohua Li wrote:
> 
> > On Fri, Mar 03, 2017 at 02:43:49PM +1100, Neil Brown wrote:
> >> On Fri, Feb 17 2017, Shaohua Li wrote:
> >> 
> >> > Previous patch (raid5: only dispatch IO from raid5d for harddisk raid)
> >> > defers IO dispatching. The goal is to create better IO pattern. At that
> >> > time, we don't sort the deffered IO and hope the block layer can do IO
> >> > merge and sort. Now the raid5-cache writeback could create large amount
> >> > of bios. And if we enable muti-thread for stripe handling, we can't
> >> > control when to dispatch IO to raid disks. In a lot of time, we are
> >> > dispatching IO which block layer can't do merge effectively.
> >> >
> >> > This patch moves further for the IO dispatching defer. We accumulate
> >> > bios, but we don't dispatch all the bios after a threshold is met. This
> >> > 'dispatch partial portion of bios' stragety allows bios coming in a
> >> > large time window are sent to disks together. At the dispatching time,
> >> > there is large chance the block layer can merge the bios. To make this
> >> > more effective, we dispatch IO in ascending order. This increases
> >> > request merge chance and reduces disk seek.
> >> 
> >> I can see the benefit of batching and sorting requests.
> >> 
> >> I wonder if the extra complexity of grouping together 512 requests, then
> >> submitting the "first" 128 is really worth it.  Have you measured the
> >> value of that?
> >
> > I'm pretty sure I tried. The whole point of dispatching the first 128 is we
> > don't have a better pipeline. Grouping 512 and then dispatching them together
> > definitely improve the IO patter, but the request accumulation takes time, we
> > will have no IO running in the window.
> 
> But we don't wait for the first batch before we start collecting the
> next batch - do we?  Why would there be a window with no IO running?

We don't. Dispatching the 128 instead of 512 is to avoid the case the first
batch is finished but the second batch hasn't accumulate enough stripes yet. In
that case, we will have no IO running, which is the window I mentioned.

> >
> >> If you just submitted every time you got 512 requests, you could use
> >> list_sort() on the bio list and wouldn't need an array.
> >> 
> >> If an array really is best, it would be really nice if "sort" could pass
> >> a 'void*' down to the cmp function,
> >> and it could sort all bios that are
> >> *after* last_bio_pos first, and then the others.  That would make the
> >> code much simpler.  I guess sort() could be changed (list_sort() already
> >> has a 'priv' argument like this).
> >
> > Ok, I'll change this to a list. And add extra pointer to record the last sorted
> > entry. I didn't see the sort uses much time in my profile, but the merge sort
> > looks better. Will do the change.
> 
> I think both sorts are O(log(N)).
> I had thought that list_sort() would work on a bio_list, but it requires
> a list_head (even though it doesn't use the prev pointer).
> If it worked on a bio_list and if you could just submit the whole batch,
> then using list_sort would have meant that you don't need to allocate a
> table of r5pending_data.
> Now with the struct list_head in there, the data is twice the size.
> 
> I guess that doesn't matter too much.
> 
> It just feels like there should be a cleaner solution, but I cannot find
> it without writing a new sort function (not that it would be so hard do
> to that).

How about the v2 I sent? Using list can avoid the memmove

Thanks,
Shaohua

^ permalink raw reply

* [PATCH v2] blk: improve order of bio handling in generic_make_request()
From: NeilBrown @ 2017-03-07 20:38 UTC (permalink / raw)
  To: Jens Axboe, Jack Wang
  Cc: linux-block, Mike Snitzer, LKML, linux-raid,
	device-mapper development, Mikulas Patocka, Pavel Machek,
	Lars Ellenberg, Kent Overstreet
In-Reply-To: <a674456d-fb93-437e-c50e-195e7a035ba4@kernel.dk>

[-- Attachment #1.1: Type: text/plain, Size: 5090 bytes --]

To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling.  They will be handled when the
make_request_fn for the current bio completes.

If any bios are submitted by a make_request_fn, these will ultimately
be handled seqeuntially.  If the handling of one of those generates
further requests, they will be added to the end of the queue.

This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete.  This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device.  Both md and dm have examples where this happens.

These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order.  That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent.  That way, when generic_make_request() calls make_request_fn
for some particular device, we can be certain that all previously
submited requests for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().

An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue.  However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn.  After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.

This, by itself, it not enough to remove all deadlocks.  It just makes
it possible for drivers to take the extra step required themselves.

To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request.  This includes never
allocing from a mempool twice in the one call to a make_request_fn.

A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return.  The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off.  If it splits again, the same process happens.  In
each case one bio will be completely handled before the next one is attempted.

With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.

Ref: http://www.spinics.net/lists/raid/msg54680.html
Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com>
Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/blk-core.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

Changes since v1:
 - merge code improvements from Jack Wang
 - more edits to changelog comment
 - add Ref: link.
 - Add some lists to Cc, that should have been there the first time.

diff --git a/block/blk-core.c b/block/blk-core.c
index b9e857f4afe8..9520e82aa78c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2018,17 +2018,34 @@ blk_qc_t generic_make_request(struct bio *bio)
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);

 		if (likely(blk_queue_enter(q, false) == 0)) {
+			struct bio_list hold;
+			struct bio_list lower, same;
+
+			/* Create a fresh bio_list for all subordinate requests */
+			hold = bio_list_on_stack;
+			bio_list_init(&bio_list_on_stack);
 			ret = q->make_request_fn(q, bio);

 			blk_queue_exit(q);

-			bio = bio_list_pop(current->bio_list);
+			/* sort new bios into those for a lower level
+			 * and those for the same level
+			 */
+			bio_list_init(&lower);
+			bio_list_init(&same);
+			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+				if (q == bdev_get_queue(bio->bi_bdev))
+					bio_list_add(&same, bio);
+				else
+					bio_list_add(&lower, bio);
+			/* now assemble so we handle the lowest level first */
+			bio_list_merge(&bio_list_on_stack, &lower);
+			bio_list_merge(&bio_list_on_stack, &same);
+			bio_list_merge(&bio_list_on_stack, &hold);
 		} else {
-			struct bio *bio_next = bio_list_pop(current->bio_list);
-
 			bio_io_error(bio);
-			bio = bio_next;
 		}
+		bio = bio_list_pop(current->bio_list);
 	} while (bio);
 	current->bio_list = NULL; /* deactivate */

-- 
2.12.0

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]

^ permalink raw reply related

* Re: [PATCH 2/3] md/raid5-cache: bump flush stripe batch size
From: Shaohua Li @ 2017-03-07 20:50 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, songliubraving
In-Reply-To: <87efyakjp7.fsf@notabene.neil.brown.name>

On Mon, Mar 06, 2017 at 05:23:00PM +1100, Neil Brown wrote:
> On Fri, Mar 03 2017, Shaohua Li wrote:
> 
> > On Fri, Mar 03, 2017 at 02:03:31PM +1100, Neil Brown wrote:
> >> On Fri, Feb 17 2017, Shaohua Li wrote:
> >> 
> >> > Bump the flush stripe batch size to 2048. For my 12 disks raid
> >> > array, the stripes takes:
> >> > 12 * 4k * 2048 = 96MB
> >> >
> >> > This is still quite small. A hardware raid card generally has 1GB size,
> >> > which we suggest the raid5-cache has similar cache size.
> >> >
> >> > The advantage of a big batch size is we can dispatch a lot of IO in the
> >> > same time, then we can do some scheduling to make better IO pattern.
> >> >
> >> > Last patch prioritizes stripes, so we don't worry about a big flush
> >> > stripe batch will starve normal stripes.
> >> >
> >> > Signed-off-by: Shaohua Li <shli@fb.com>
> >> > ---
> >> >  drivers/md/raid5-cache.c | 2 +-
> >> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >> >
> >> > diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> >> > index 3f307be..b25512c 100644
> >> > --- a/drivers/md/raid5-cache.c
> >> > +++ b/drivers/md/raid5-cache.c
> >> > @@ -43,7 +43,7 @@
> >> >  /* wake up reclaim thread periodically */
> >> >  #define R5C_RECLAIM_WAKEUP_INTERVAL (30 * HZ)
> >> >  /* start flush with these full stripes */
> >> > -#define R5C_FULL_STRIPE_FLUSH_BATCH 256
> >> > +#define R5C_FULL_STRIPE_FLUSH_BATCH 2048
> >> 
> >> Fixed numbers are warning signs... I wonder if there is something better
> >> we could do?   "conf->max_nr_stripes / 4" maybe?  We use that sort of
> >> number elsewhere.
> >> Would that make sense?
> >
> > The code where we check the batch size (in r5c_do_reclaim) already a check:
> > total_cached > conf->min_nr_stripes * 1 / 2
> > so I think that's ok, no?
> 
> I'm not sure what you are saying.
> 
> I'm suggesting that we get rid of R5C_FULL_STRIPE_FLUSH_BATCH and use a
> number like "conf->max_nr_stripes / 4"
> Are you agreeing, or are you saying that you don't think we need to get
> rid of R5C_FULL_STRIPE_FLUSH_BATCH??

What I mean is we already check the min_nr_stripes which is related to
max_nr_stripes, so we don't need check max_nr_stripes again. Thinking this
more, max_nr_stripes / 4 does make more sense if the cache is very big. I'll
change R5C_FULL_STRIPE_FLUSH_BATCH to 'conf->max_nr_stripes / 4'.

Thanks,
Shaohua

^ permalink raw reply

* [PATCH v2] md/r5cache: improve recovery with read ahead page pool
From: Song Liu @ 2017-03-07 21:47 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, kernel-team, dan.j.williams, hch, Song Liu

In r5cache recovery, the journal device is scanned page by page.
Currently, we use sync_page_io() to read journal device. This is
not efficient when we have to recovery many stripes from the journal.

To improve the speed of recovery, this patch introduces a read ahead
page pool (ra_pool) to recovery_ctx. With ra_pool, multiple consecutive
pages are read in one IO. Then the recovery code read the journal from
ra_pool.

With ra_pool, r5l_recovery_ctx has become much bigger. Therefore,
r5l_recovery_log() is refactored so r5l_recovery_ctx is not using
stack space.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 225 +++++++++++++++++++++++++++++++++++++----------
 1 file changed, 178 insertions(+), 47 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 3f307be..73a24ab 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -30,6 +30,7 @@
  * underneath hardware sector size. only works with PAGE_SIZE == 4096
  */
 #define BLOCK_SECTORS (8)
+#define BLOCK_SECTOR_SHIFT (3)
 
 /*
  * log->max_free_space is min(1/4 disk size, 10G reclaimable space).
@@ -1552,6 +1553,8 @@ bool r5l_log_disk_error(struct r5conf *conf)
 	return ret;
 }
 
+#define R5L_RECOVERY_PAGE_POOL_SIZE 256
+
 struct r5l_recovery_ctx {
 	struct page *meta_page;		/* current meta */
 	sector_t meta_total_blocks;	/* total size of current meta and data */
@@ -1560,18 +1563,131 @@ struct r5l_recovery_ctx {
 	int data_parity_stripes;	/* number of data_parity stripes */
 	int data_only_stripes;		/* number of data_only stripes */
 	struct list_head cached_list;
+
+	/*
+	 * read ahead page pool (ra_pool)
+	 * in recovery, log is read sequentially. It is not efficient to
+	 * read every page with sync_page_io(). The read ahead page pool
+	 * reads multiple pages with one IO, so further log read can
+	 * just copy data from the pool.
+	 */
+	struct page *ra_pool[R5L_RECOVERY_PAGE_POOL_SIZE];
+	sector_t pool_offset;	/* offset of first page in the pool */
+	int total_pages;	/* total allocated pages */
+	int valid_pages;	/* pages with valid data */
+	struct bio *ra_bio;	/* bio to do the read ahead */
 };
 
+static int r5l_recovery_allocate_ra_pool(struct r5l_log *log,
+					    struct r5l_recovery_ctx *ctx)
+{
+	struct page *page;
+
+	ctx->ra_bio = bio_alloc_bioset(GFP_KERNEL, BIO_MAX_PAGES, log->bs);
+	if (!ctx->ra_bio)
+		return -ENOMEM;
+
+	ctx->valid_pages = 0;
+	ctx->total_pages = 0;
+	while (ctx->total_pages < R5L_RECOVERY_PAGE_POOL_SIZE) {
+		page = alloc_page(GFP_KERNEL);
+
+		if (!page)
+			break;
+		ctx->ra_pool[ctx->total_pages] = page;
+		ctx->total_pages += 1;
+	}
+
+	if (ctx->total_pages == 0) {
+		bio_put(ctx->ra_bio);
+		return -ENOMEM;
+	}
+
+	ctx->pool_offset = 0;
+	return 0;
+}
+
+static void r5l_recovery_free_ra_pool(struct r5l_log *log,
+					struct r5l_recovery_ctx *ctx)
+{
+	int i;
+
+	for (i = 0; i < ctx->total_pages; ++i)
+		put_page(ctx->ra_pool[i]);
+	bio_put(ctx->ra_bio);
+}
+
+/*
+ * fetch ctx->valid_pages pages from offset
+ * In normal cases, ctx->valid_pages == ctx->total_pages after the call.
+ * However, if the offset is close to the end of the journal device,
+ * ctx->valid_pages could be smaller than ctx->total_pages
+ */
+static int r5l_recovery_fetch_ra_pool(struct r5l_log *log,
+				      struct r5l_recovery_ctx *ctx,
+				      sector_t offset)
+{
+	bio_reset(ctx->ra_bio);
+	ctx->ra_bio->bi_bdev = log->rdev->bdev;
+	bio_set_op_attrs(ctx->ra_bio, REQ_OP_READ, 0);
+	ctx->ra_bio->bi_iter.bi_sector = log->rdev->data_offset + offset;
+
+	ctx->valid_pages = 0;
+	ctx->pool_offset = offset;
+
+	while (ctx->valid_pages < ctx->total_pages) {
+		bio_add_page(ctx->ra_bio,
+			     ctx->ra_pool[ctx->valid_pages], PAGE_SIZE, 0);
+		ctx->valid_pages += 1;
+
+		offset = r5l_ring_add(log, offset, BLOCK_SECTORS);
+
+		if (offset == 0)  /* reached end of the device */
+			break;
+	}
+
+	return submit_bio_wait(ctx->ra_bio);
+}
+
+/*
+ * try read a page from the read ahead page pool, if the page is not in the
+ * pool, call r5l_recovery_fetch_ra_pool
+ */
+static int r5l_recovery_read_page(struct r5l_log *log,
+				  struct r5l_recovery_ctx *ctx,
+				  struct page *page,
+				  sector_t offset)
+{
+	int ret;
+
+	if (offset < ctx->pool_offset ||
+	    offset >= ctx->pool_offset + ctx->valid_pages * BLOCK_SECTORS) {
+		ret = r5l_recovery_fetch_ra_pool(log, ctx, offset);
+		if (ret)
+			return ret;
+	}
+
+	BUG_ON(offset < ctx->pool_offset ||
+	       offset >= ctx->pool_offset + ctx->valid_pages * BLOCK_SECTORS);
+
+	memcpy(page_address(page),
+	       page_address(ctx->ra_pool[(offset - ctx->pool_offset) >>
+					 BLOCK_SECTOR_SHIFT]),
+	       PAGE_SIZE);
+	return 0;
+}
+
 static int r5l_recovery_read_meta_block(struct r5l_log *log,
 					struct r5l_recovery_ctx *ctx)
 {
 	struct page *page = ctx->meta_page;
 	struct r5l_meta_block *mb;
 	u32 crc, stored_crc;
+	int ret;
 
-	if (!sync_page_io(log->rdev, ctx->pos, PAGE_SIZE, page, REQ_OP_READ, 0,
-			  false))
-		return -EIO;
+	ret = r5l_recovery_read_page(log, ctx, page, ctx->pos);
+	if (ret != 0)
+		return ret;
 
 	mb = page_address(page);
 	stored_crc = le32_to_cpu(mb->checksum);
@@ -1653,8 +1769,7 @@ static void r5l_recovery_load_data(struct r5l_log *log,
 	raid5_compute_sector(conf,
 			     le64_to_cpu(payload->location), 0,
 			     &dd_idx, sh);
-	sync_page_io(log->rdev, log_offset, PAGE_SIZE,
-		     sh->dev[dd_idx].page, REQ_OP_READ, 0, false);
+	r5l_recovery_read_page(log, ctx, sh->dev[dd_idx].page, log_offset);
 	sh->dev[dd_idx].log_checksum =
 		le32_to_cpu(payload->checksum[0]);
 	ctx->meta_total_blocks += BLOCK_SECTORS;
@@ -1673,17 +1788,15 @@ static void r5l_recovery_load_parity(struct r5l_log *log,
 	struct r5conf *conf = mddev->private;
 
 	ctx->meta_total_blocks += BLOCK_SECTORS * conf->max_degraded;
-	sync_page_io(log->rdev, log_offset, PAGE_SIZE,
-		     sh->dev[sh->pd_idx].page, REQ_OP_READ, 0, false);
+	r5l_recovery_read_page(log, ctx, sh->dev[sh->pd_idx].page, log_offset);
 	sh->dev[sh->pd_idx].log_checksum =
 		le32_to_cpu(payload->checksum[0]);
 	set_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags);
 
 	if (sh->qd_idx >= 0) {
-		sync_page_io(log->rdev,
-			     r5l_ring_add(log, log_offset, BLOCK_SECTORS),
-			     PAGE_SIZE, sh->dev[sh->qd_idx].page,
-			     REQ_OP_READ, 0, false);
+		r5l_recovery_read_page(
+			log, ctx, sh->dev[sh->qd_idx].page,
+			r5l_ring_add(log, log_offset, BLOCK_SECTORS));
 		sh->dev[sh->qd_idx].log_checksum =
 			le32_to_cpu(payload->checksum[1]);
 		set_bit(R5_Wantwrite, &sh->dev[sh->qd_idx].flags);
@@ -1814,14 +1927,15 @@ r5c_recovery_replay_stripes(struct list_head *cached_stripe_list,
 
 /* if matches return 0; otherwise return -EINVAL */
 static int
-r5l_recovery_verify_data_checksum(struct r5l_log *log, struct page *page,
+r5l_recovery_verify_data_checksum(struct r5l_log *log,
+				  struct r5l_recovery_ctx *ctx,
+				  struct page *page,
 				  sector_t log_offset, __le32 log_checksum)
 {
 	void *addr;
 	u32 checksum;
 
-	sync_page_io(log->rdev, log_offset, PAGE_SIZE,
-		     page, REQ_OP_READ, 0, false);
+	r5l_recovery_read_page(log, ctx, page, log_offset);
 	addr = kmap_atomic(page);
 	checksum = crc32c_le(log->uuid_checksum, addr, PAGE_SIZE);
 	kunmap_atomic(addr);
@@ -1853,17 +1967,17 @@ r5l_recovery_verify_data_checksum_for_mb(struct r5l_log *log,
 
 		if (payload->header.type == R5LOG_PAYLOAD_DATA) {
 			if (r5l_recovery_verify_data_checksum(
-				    log, page, log_offset,
+				    log, ctx, page, log_offset,
 				    payload->checksum[0]) < 0)
 				goto mismatch;
 		} else if (payload->header.type == R5LOG_PAYLOAD_PARITY) {
 			if (r5l_recovery_verify_data_checksum(
-				    log, page, log_offset,
+				    log, ctx, page, log_offset,
 				    payload->checksum[0]) < 0)
 				goto mismatch;
 			if (conf->max_degraded == 2 && /* q for RAID 6 */
 			    r5l_recovery_verify_data_checksum(
-				    log, page,
+				    log, ctx, page,
 				    r5l_ring_add(log, log_offset,
 						 BLOCK_SECTORS),
 				    payload->checksum[1]) < 0)
@@ -2241,55 +2355,72 @@ static void r5c_recovery_flush_data_only_stripes(struct r5l_log *log,
 static int r5l_recovery_log(struct r5l_log *log)
 {
 	struct mddev *mddev = log->rdev->mddev;
-	struct r5l_recovery_ctx ctx;
+	struct r5l_recovery_ctx *ctx;
 	int ret;
 	sector_t pos;
 
-	ctx.pos = log->last_checkpoint;
-	ctx.seq = log->last_cp_seq;
-	ctx.meta_page = alloc_page(GFP_KERNEL);
-	ctx.data_only_stripes = 0;
-	ctx.data_parity_stripes = 0;
-	INIT_LIST_HEAD(&ctx.cached_list);
-
-	if (!ctx.meta_page)
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
 		return -ENOMEM;
 
-	ret = r5c_recovery_flush_log(log, &ctx);
-	__free_page(ctx.meta_page);
+	ctx->pos = log->last_checkpoint;
+	ctx->seq = log->last_cp_seq;
+	ctx->data_only_stripes = 0;
+	ctx->data_parity_stripes = 0;
+	INIT_LIST_HEAD(&ctx->cached_list);
+	ctx->meta_page = alloc_page(GFP_KERNEL);
 
-	if (ret)
-		return ret;
+	if (!ctx->meta_page) {
+		ret =  -ENOMEM;
+		goto meta_page;
+	}
+
+	if (r5l_recovery_allocate_ra_pool(log, ctx) != 0) {
+		ret = -ENOMEM;
+		goto ra_pool;
+	}
 
-	pos = ctx.pos;
-	ctx.seq += 10000;
+	ret = r5c_recovery_flush_log(log, ctx);
 
+	if (ret)
+		goto error;
 
-	if ((ctx.data_only_stripes == 0) && (ctx.data_parity_stripes == 0))
+	pos = ctx->pos;
+	ctx->seq += 10000;
+
+	if ((ctx->data_only_stripes == 0) && (ctx->data_parity_stripes == 0))
 		pr_debug("md/raid:%s: starting from clean shutdown\n",
 			 mdname(mddev));
 	else
-		pr_debug("md/raid:%s: recovering %d data-only stripes and %d data-parity stripes\n",
-			 mdname(mddev), ctx.data_only_stripes,
-			 ctx.data_parity_stripes);
-
-	if (ctx.data_only_stripes == 0) {
-		log->next_checkpoint = ctx.pos;
-		r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq++);
-		ctx.pos = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
-	} else if (r5c_recovery_rewrite_data_only_stripes(log, &ctx)) {
+		pr_info("md/raid:%s: recovering %d data-only stripes and %d data-parity stripes\n",
+			 mdname(mddev), ctx->data_only_stripes,
+			 ctx->data_parity_stripes);
+
+	if (ctx->data_only_stripes == 0) {
+		log->next_checkpoint = ctx->pos;
+		r5l_log_write_empty_meta_block(log, ctx->pos, ctx->seq++);
+		ctx->pos = r5l_ring_add(log, ctx->pos, BLOCK_SECTORS);
+	} else if (r5c_recovery_rewrite_data_only_stripes(log, ctx)) {
 		pr_err("md/raid:%s: failed to rewrite stripes to journal\n",
 		       mdname(mddev));
-		return -EIO;
+		ret =  -EIO;
+		goto error;
 	}
 
-	log->log_start = ctx.pos;
-	log->seq = ctx.seq;
+	log->log_start = ctx->pos;
+	log->seq = ctx->seq;
 	log->last_checkpoint = pos;
 	r5l_write_super(log, pos);
 
-	r5c_recovery_flush_data_only_stripes(log, &ctx);
-	return 0;
+	r5c_recovery_flush_data_only_stripes(log, ctx);
+	ret = 0;
+error:
+	r5l_recovery_free_ra_pool(log, ctx);
+ra_pool:
+	__free_page(ctx->meta_page);
+meta_page:
+	kfree(ctx);
+	return ret;
 }
 
 static void r5l_write_super(struct r5l_log *log, sector_t cp)
-- 
2.9.3


^ permalink raw reply related

* [PATCH v3] md/r5cache: improve recovery with read ahead page pool
From: Song Liu @ 2017-03-08  0:49 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, kernel-team, dan.j.williams, hch, Song Liu

In r5cache recovery, the journal device is scanned page by page.
Currently, we use sync_page_io() to read journal device. This is
not efficient when we have to recovery many stripes from the journal.

To improve the speed of recovery, this patch introduces a read ahead
page pool (ra_pool) to recovery_ctx. With ra_pool, multiple consecutive
pages are read in one IO. Then the recovery code read the journal from
ra_pool.

With ra_pool, r5l_recovery_ctx has become much bigger. Therefore,
r5l_recovery_log() is refactored so r5l_recovery_ctx is not using
stack space.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 223 +++++++++++++++++++++++++++++++++++++----------
 1 file changed, 177 insertions(+), 46 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 3f307be..0d744d5 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -30,6 +30,7 @@
  * underneath hardware sector size. only works with PAGE_SIZE == 4096
  */
 #define BLOCK_SECTORS (8)
+#define BLOCK_SECTOR_SHIFT (3)
 
 /*
  * log->max_free_space is min(1/4 disk size, 10G reclaimable space).
@@ -1552,6 +1553,8 @@ bool r5l_log_disk_error(struct r5conf *conf)
 	return ret;
 }
 
+#define R5L_RECOVERY_PAGE_POOL_SIZE 256
+
 struct r5l_recovery_ctx {
 	struct page *meta_page;		/* current meta */
 	sector_t meta_total_blocks;	/* total size of current meta and data */
@@ -1560,18 +1563,131 @@ struct r5l_recovery_ctx {
 	int data_parity_stripes;	/* number of data_parity stripes */
 	int data_only_stripes;		/* number of data_only stripes */
 	struct list_head cached_list;
+
+	/*
+	 * read ahead page pool (ra_pool)
+	 * in recovery, log is read sequentially. It is not efficient to
+	 * read every page with sync_page_io(). The read ahead page pool
+	 * reads multiple pages with one IO, so further log read can
+	 * just copy data from the pool.
+	 */
+	struct page *ra_pool[R5L_RECOVERY_PAGE_POOL_SIZE];
+	sector_t pool_offset;	/* offset of first page in the pool */
+	int total_pages;	/* total allocated pages */
+	int valid_pages;	/* pages with valid data */
+	struct bio *ra_bio;	/* bio to do the read ahead */
 };
 
+static int r5l_recovery_allocate_ra_pool(struct r5l_log *log,
+					    struct r5l_recovery_ctx *ctx)
+{
+	struct page *page;
+
+	ctx->ra_bio = bio_alloc_bioset(GFP_KERNEL, BIO_MAX_PAGES, log->bs);
+	if (!ctx->ra_bio)
+		return -ENOMEM;
+
+	ctx->valid_pages = 0;
+	ctx->total_pages = 0;
+	while (ctx->total_pages < R5L_RECOVERY_PAGE_POOL_SIZE) {
+		page = alloc_page(GFP_KERNEL);
+
+		if (!page)
+			break;
+		ctx->ra_pool[ctx->total_pages] = page;
+		ctx->total_pages += 1;
+	}
+
+	if (ctx->total_pages == 0) {
+		bio_put(ctx->ra_bio);
+		return -ENOMEM;
+	}
+
+	ctx->pool_offset = 0;
+	return 0;
+}
+
+static void r5l_recovery_free_ra_pool(struct r5l_log *log,
+					struct r5l_recovery_ctx *ctx)
+{
+	int i;
+
+	for (i = 0; i < ctx->total_pages; ++i)
+		put_page(ctx->ra_pool[i]);
+	bio_put(ctx->ra_bio);
+}
+
+/*
+ * fetch ctx->valid_pages pages from offset
+ * In normal cases, ctx->valid_pages == ctx->total_pages after the call.
+ * However, if the offset is close to the end of the journal device,
+ * ctx->valid_pages could be smaller than ctx->total_pages
+ */
+static int r5l_recovery_fetch_ra_pool(struct r5l_log *log,
+				      struct r5l_recovery_ctx *ctx,
+				      sector_t offset)
+{
+	bio_reset(ctx->ra_bio);
+	ctx->ra_bio->bi_bdev = log->rdev->bdev;
+	bio_set_op_attrs(ctx->ra_bio, REQ_OP_READ, 0);
+	ctx->ra_bio->bi_iter.bi_sector = log->rdev->data_offset + offset;
+
+	ctx->valid_pages = 0;
+	ctx->pool_offset = offset;
+
+	while (ctx->valid_pages < ctx->total_pages) {
+		bio_add_page(ctx->ra_bio,
+			     ctx->ra_pool[ctx->valid_pages], PAGE_SIZE, 0);
+		ctx->valid_pages += 1;
+
+		offset = r5l_ring_add(log, offset, BLOCK_SECTORS);
+
+		if (offset == 0)  /* reached end of the device */
+			break;
+	}
+
+	return submit_bio_wait(ctx->ra_bio);
+}
+
+/*
+ * try read a page from the read ahead page pool, if the page is not in the
+ * pool, call r5l_recovery_fetch_ra_pool
+ */
+static int r5l_recovery_read_page(struct r5l_log *log,
+				  struct r5l_recovery_ctx *ctx,
+				  struct page *page,
+				  sector_t offset)
+{
+	int ret;
+
+	if (offset < ctx->pool_offset ||
+	    offset >= ctx->pool_offset + ctx->valid_pages * BLOCK_SECTORS) {
+		ret = r5l_recovery_fetch_ra_pool(log, ctx, offset);
+		if (ret)
+			return ret;
+	}
+
+	BUG_ON(offset < ctx->pool_offset ||
+	       offset >= ctx->pool_offset + ctx->valid_pages * BLOCK_SECTORS);
+
+	memcpy(page_address(page),
+	       page_address(ctx->ra_pool[(offset - ctx->pool_offset) >>
+					 BLOCK_SECTOR_SHIFT]),
+	       PAGE_SIZE);
+	return 0;
+}
+
 static int r5l_recovery_read_meta_block(struct r5l_log *log,
 					struct r5l_recovery_ctx *ctx)
 {
 	struct page *page = ctx->meta_page;
 	struct r5l_meta_block *mb;
 	u32 crc, stored_crc;
+	int ret;
 
-	if (!sync_page_io(log->rdev, ctx->pos, PAGE_SIZE, page, REQ_OP_READ, 0,
-			  false))
-		return -EIO;
+	ret = r5l_recovery_read_page(log, ctx, page, ctx->pos);
+	if (ret != 0)
+		return ret;
 
 	mb = page_address(page);
 	stored_crc = le32_to_cpu(mb->checksum);
@@ -1653,8 +1769,7 @@ static void r5l_recovery_load_data(struct r5l_log *log,
 	raid5_compute_sector(conf,
 			     le64_to_cpu(payload->location), 0,
 			     &dd_idx, sh);
-	sync_page_io(log->rdev, log_offset, PAGE_SIZE,
-		     sh->dev[dd_idx].page, REQ_OP_READ, 0, false);
+	r5l_recovery_read_page(log, ctx, sh->dev[dd_idx].page, log_offset);
 	sh->dev[dd_idx].log_checksum =
 		le32_to_cpu(payload->checksum[0]);
 	ctx->meta_total_blocks += BLOCK_SECTORS;
@@ -1673,17 +1788,15 @@ static void r5l_recovery_load_parity(struct r5l_log *log,
 	struct r5conf *conf = mddev->private;
 
 	ctx->meta_total_blocks += BLOCK_SECTORS * conf->max_degraded;
-	sync_page_io(log->rdev, log_offset, PAGE_SIZE,
-		     sh->dev[sh->pd_idx].page, REQ_OP_READ, 0, false);
+	r5l_recovery_read_page(log, ctx, sh->dev[sh->pd_idx].page, log_offset);
 	sh->dev[sh->pd_idx].log_checksum =
 		le32_to_cpu(payload->checksum[0]);
 	set_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags);
 
 	if (sh->qd_idx >= 0) {
-		sync_page_io(log->rdev,
-			     r5l_ring_add(log, log_offset, BLOCK_SECTORS),
-			     PAGE_SIZE, sh->dev[sh->qd_idx].page,
-			     REQ_OP_READ, 0, false);
+		r5l_recovery_read_page(
+			log, ctx, sh->dev[sh->qd_idx].page,
+			r5l_ring_add(log, log_offset, BLOCK_SECTORS));
 		sh->dev[sh->qd_idx].log_checksum =
 			le32_to_cpu(payload->checksum[1]);
 		set_bit(R5_Wantwrite, &sh->dev[sh->qd_idx].flags);
@@ -1814,14 +1927,15 @@ r5c_recovery_replay_stripes(struct list_head *cached_stripe_list,
 
 /* if matches return 0; otherwise return -EINVAL */
 static int
-r5l_recovery_verify_data_checksum(struct r5l_log *log, struct page *page,
+r5l_recovery_verify_data_checksum(struct r5l_log *log,
+				  struct r5l_recovery_ctx *ctx,
+				  struct page *page,
 				  sector_t log_offset, __le32 log_checksum)
 {
 	void *addr;
 	u32 checksum;
 
-	sync_page_io(log->rdev, log_offset, PAGE_SIZE,
-		     page, REQ_OP_READ, 0, false);
+	r5l_recovery_read_page(log, ctx, page, log_offset);
 	addr = kmap_atomic(page);
 	checksum = crc32c_le(log->uuid_checksum, addr, PAGE_SIZE);
 	kunmap_atomic(addr);
@@ -1853,17 +1967,17 @@ r5l_recovery_verify_data_checksum_for_mb(struct r5l_log *log,
 
 		if (payload->header.type == R5LOG_PAYLOAD_DATA) {
 			if (r5l_recovery_verify_data_checksum(
-				    log, page, log_offset,
+				    log, ctx, page, log_offset,
 				    payload->checksum[0]) < 0)
 				goto mismatch;
 		} else if (payload->header.type == R5LOG_PAYLOAD_PARITY) {
 			if (r5l_recovery_verify_data_checksum(
-				    log, page, log_offset,
+				    log, ctx, page, log_offset,
 				    payload->checksum[0]) < 0)
 				goto mismatch;
 			if (conf->max_degraded == 2 && /* q for RAID 6 */
 			    r5l_recovery_verify_data_checksum(
-				    log, page,
+				    log, ctx, page,
 				    r5l_ring_add(log, log_offset,
 						 BLOCK_SECTORS),
 				    payload->checksum[1]) < 0)
@@ -2241,55 +2355,72 @@ static void r5c_recovery_flush_data_only_stripes(struct r5l_log *log,
 static int r5l_recovery_log(struct r5l_log *log)
 {
 	struct mddev *mddev = log->rdev->mddev;
-	struct r5l_recovery_ctx ctx;
+	struct r5l_recovery_ctx *ctx;
 	int ret;
 	sector_t pos;
 
-	ctx.pos = log->last_checkpoint;
-	ctx.seq = log->last_cp_seq;
-	ctx.meta_page = alloc_page(GFP_KERNEL);
-	ctx.data_only_stripes = 0;
-	ctx.data_parity_stripes = 0;
-	INIT_LIST_HEAD(&ctx.cached_list);
-
-	if (!ctx.meta_page)
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
 		return -ENOMEM;
 
-	ret = r5c_recovery_flush_log(log, &ctx);
-	__free_page(ctx.meta_page);
+	ctx->pos = log->last_checkpoint;
+	ctx->seq = log->last_cp_seq;
+	ctx->data_only_stripes = 0;
+	ctx->data_parity_stripes = 0;
+	INIT_LIST_HEAD(&ctx->cached_list);
+	ctx->meta_page = alloc_page(GFP_KERNEL);
 
-	if (ret)
-		return ret;
+	if (!ctx->meta_page) {
+		ret =  -ENOMEM;
+		goto meta_page;
+	}
+
+	if (r5l_recovery_allocate_ra_pool(log, ctx) != 0) {
+		ret = -ENOMEM;
+		goto ra_pool;
+	}
 
-	pos = ctx.pos;
-	ctx.seq += 10000;
+	ret = r5c_recovery_flush_log(log, ctx);
 
+	if (ret)
+		goto error;
 
-	if ((ctx.data_only_stripes == 0) && (ctx.data_parity_stripes == 0))
+	pos = ctx->pos;
+	ctx->seq += 10000;
+
+	if ((ctx->data_only_stripes == 0) && (ctx->data_parity_stripes == 0))
 		pr_debug("md/raid:%s: starting from clean shutdown\n",
 			 mdname(mddev));
 	else
 		pr_debug("md/raid:%s: recovering %d data-only stripes and %d data-parity stripes\n",
-			 mdname(mddev), ctx.data_only_stripes,
-			 ctx.data_parity_stripes);
-
-	if (ctx.data_only_stripes == 0) {
-		log->next_checkpoint = ctx.pos;
-		r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq++);
-		ctx.pos = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
-	} else if (r5c_recovery_rewrite_data_only_stripes(log, &ctx)) {
+			 mdname(mddev), ctx->data_only_stripes,
+			 ctx->data_parity_stripes);
+
+	if (ctx->data_only_stripes == 0) {
+		log->next_checkpoint = ctx->pos;
+		r5l_log_write_empty_meta_block(log, ctx->pos, ctx->seq++);
+		ctx->pos = r5l_ring_add(log, ctx->pos, BLOCK_SECTORS);
+	} else if (r5c_recovery_rewrite_data_only_stripes(log, ctx)) {
 		pr_err("md/raid:%s: failed to rewrite stripes to journal\n",
 		       mdname(mddev));
-		return -EIO;
+		ret =  -EIO;
+		goto error;
 	}
 
-	log->log_start = ctx.pos;
-	log->seq = ctx.seq;
+	log->log_start = ctx->pos;
+	log->seq = ctx->seq;
 	log->last_checkpoint = pos;
 	r5l_write_super(log, pos);
 
-	r5c_recovery_flush_data_only_stripes(log, &ctx);
-	return 0;
+	r5c_recovery_flush_data_only_stripes(log, ctx);
+	ret = 0;
+error:
+	r5l_recovery_free_ra_pool(log, ctx);
+ra_pool:
+	__free_page(ctx->meta_page);
+meta_page:
+	kfree(ctx);
+	return ret;
 }
 
 static void r5l_write_super(struct r5l_log *log, sector_t cp)
-- 
2.9.3


^ permalink raw reply related

* [PATCH 1/2] md/r5cache: handle R5LOG_PAYLOAD_FLUSH in recovery
From: Song Liu @ 2017-03-08  1:44 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, kernel-team, dan.j.williams, hch, Song Liu

This patch adds handling of R5LOG_PAYLOAD_FLUSH in journal recovery.
Next patch will add logic that generate R5LOG_PAYLOAD_FLUSH on flush
finish.

When R5LOG_PAYLOAD_FLUSH is seen in recovery, pending data and parity
will be dropped from recovery. This will reduce the number of stripes
to replay, and thus accelerate the recovery process.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 47 +++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 41 insertions(+), 6 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 0d744d5..e69f922 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1957,6 +1957,7 @@ r5l_recovery_verify_data_checksum_for_mb(struct r5l_log *log,
 	sector_t log_offset = r5l_ring_add(log, ctx->pos, BLOCK_SECTORS);
 	struct page *page;
 	struct r5l_payload_data_parity *payload;
+	struct r5l_payload_flush *payload_flush;
 
 	page = alloc_page(GFP_KERNEL);
 	if (!page)
@@ -1964,6 +1965,7 @@ r5l_recovery_verify_data_checksum_for_mb(struct r5l_log *log,
 
 	while (mb_offset < le32_to_cpu(mb->meta_size)) {
 		payload = (void *)mb + mb_offset;
+		payload_flush = (void *)mb + mb_offset;
 
 		if (payload->header.type == R5LOG_PAYLOAD_DATA) {
 			if (r5l_recovery_verify_data_checksum(
@@ -1982,15 +1984,23 @@ r5l_recovery_verify_data_checksum_for_mb(struct r5l_log *log,
 						 BLOCK_SECTORS),
 				    payload->checksum[1]) < 0)
 				goto mismatch;
-		} else /* not R5LOG_PAYLOAD_DATA or R5LOG_PAYLOAD_PARITY */
+		} else if (payload->header.type == R5LOG_PAYLOAD_FLUSH) {
+			/* nothing to do for R5LOG_PAYLOAD_FLUSH here */
+		} else /* not R5LOG_PAYLOAD_DATA/PARITY/FLUSH */
 			goto mismatch;
 
-		log_offset = r5l_ring_add(log, log_offset,
-					  le32_to_cpu(payload->size));
+		if (payload->header.type == R5LOG_PAYLOAD_FLUSH) {
+			mb_offset += sizeof(struct r5l_payload_flush) +
+				le32_to_cpu(payload_flush->size);
+		} else {
+			/* DATA or PARITY payload */
+			log_offset = r5l_ring_add(log, log_offset,
+						  le32_to_cpu(payload->size));
+			mb_offset += sizeof(struct r5l_payload_data_parity) +
+				sizeof(__le32) *
+				(le32_to_cpu(payload->size) >> (PAGE_SHIFT - 9));
+		}
 
-		mb_offset += sizeof(struct r5l_payload_data_parity) +
-			sizeof(__le32) *
-			(le32_to_cpu(payload->size) >> (PAGE_SHIFT - 9));
 	}
 
 	put_page(page);
@@ -2018,6 +2028,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
 	struct r5conf *conf = mddev->private;
 	struct r5l_meta_block *mb;
 	struct r5l_payload_data_parity *payload;
+	struct r5l_payload_flush *payload_flush;
 	int mb_offset;
 	sector_t log_offset;
 	sector_t stripe_sect;
@@ -2043,6 +2054,30 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
 		int dd;
 
 		payload = (void *)mb + mb_offset;
+		payload_flush = (void *)mb + mb_offset;
+
+		if (payload->header.type == R5LOG_PAYLOAD_FLUSH) {
+			int i, count;
+
+			count = le32_to_cpu(payload_flush->size) / sizeof(__le64);
+			for (i = 0; i < count; ++i) {
+				stripe_sect = le64_to_cpu(payload_flush->flush_stripes[i]);
+				sh = r5c_recovery_lookup_stripe(cached_stripe_list,
+								stripe_sect);
+				if (sh) {
+					WARN_ON(test_bit(STRIPE_R5C_CACHING, &sh->state));
+					r5l_recovery_reset_stripe(sh);
+					list_del_init(&sh->lru);
+					raid5_release_stripe(sh);
+				}
+			}
+
+			mb_offset += sizeof(struct r5l_payload_flush) +
+				le32_to_cpu(payload_flush->size);
+			continue;
+		}
+
+		/* DATA or PARITY payload */
 		stripe_sect = (payload->header.type == R5LOG_PAYLOAD_DATA) ?
 			raid5_compute_sector(
 				conf, le64_to_cpu(payload->location), 0, &dd,
-- 
2.9.3


^ permalink raw reply related

* [PATCH 2/2] md/r5cache: generate R5LOG_PAYLOAD_FLUSH
From: Song Liu @ 2017-03-08  1:44 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, kernel-team, dan.j.williams, hch, Song Liu
In-Reply-To: <20170308014422.4019149-1-songliubraving@fb.com>

In r5c_finish_stripe_write_out(), R5LOG_PAYLOAD_FLUSH is append to
log->current_io.

Appending R5LOG_PAYLOAD_FLUSH in quiesce needs extra writes to
journal. To simplify the logic, we just skip R5LOG_PAYLOAD_FLUSH in
quiesce.

Even R5LOG_PAYLOAD_FLUSH supports multiple stripes per payload.
However, current implementation is one stripe per R5LOG_PAYLOAD_FLUSH,
which is simpler.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 59 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index e69f922..fd0bfea 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -590,7 +590,21 @@ static void r5l_log_endio(struct bio *bio)
 	mempool_free(io->meta_page, log->meta_pool);
 
 	spin_lock_irqsave(&log->io_list_lock, flags);
-	__r5l_set_io_unit_state(io, IO_UNIT_IO_END);
+
+	if (list_empty(&io->stripe_list))
+		/*
+		 * this io_unit only has R5LOG_PAYLOAD_FLUSH, set
+		 * to IO_UNIT_STRIPE_END
+		 */
+		__r5l_set_io_unit_state(io, IO_UNIT_STRIPE_END);
+	else
+		/*
+		 * io_unit with R5LOG_PAYLOAD_FLUSH and also DATA/PARITY
+		 * set to IO_UNIT_IO_END and wait for all stripes get
+		 * handled.
+		 */
+		__r5l_set_io_unit_state(io, IO_UNIT_IO_END);
+
 	if (log->need_cache_flush)
 		r5l_move_to_end_ios(log);
 	else
@@ -843,6 +857,41 @@ static void r5l_append_payload_page(struct r5l_log *log, struct page *page)
 	r5_reserve_log_entry(log, io);
 }
 
+static void r5l_append_flush_payload(struct r5l_log *log, sector_t sect)
+{
+	struct mddev *mddev = log->rdev->mddev;
+	struct r5conf *conf = mddev->private;
+	struct r5l_io_unit *io;
+	struct r5l_payload_flush *payload;
+	int meta_size;
+
+	/*
+	 * payload_flush requires extra writes to the journal.
+	 * To avoid handling the extra IO in quiesce, just skip
+	 * flush_payload
+	 */
+	if (conf->quiesce)
+		return;
+
+	mutex_lock(&log->io_mutex);
+	meta_size = sizeof(struct r5l_payload_flush) + sizeof(__le64);
+
+	if (r5l_get_meta(log, meta_size)) {
+		mutex_unlock(&log->io_mutex);
+		return;
+	}
+
+	/* current implementation is one stripe per flush payload */
+	io = log->current_io;
+	payload = page_address(io->meta_page) + io->meta_offset;
+	payload->header.type = cpu_to_le16(R5LOG_PAYLOAD_FLUSH);
+	payload->header.flags = cpu_to_le16(0);
+	payload->size = cpu_to_le32(sizeof(__le64));
+	payload->flush_stripes[0] = cpu_to_le64(sect);
+	io->meta_offset += meta_size;
+	mutex_unlock(&log->io_mutex);
+}
+
 static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
 			   int data_pages, int parity_pages)
 {
@@ -1466,6 +1515,13 @@ static void r5l_do_reclaim(struct r5l_log *log)
 		     list_empty(&log->finished_ios)))
 			break;
 
+		/*
+		 * In some cases, io_unit with only R5LOG_PAYLOAD_FLUSH
+		 * will stay in finished_ios list. It is necessary to
+		 * complete them before quiesce.
+		 */
+		r5l_complete_finished_ios(log);
+
 		md_wakeup_thread(log->rdev->mddev->thread);
 		wait_event_lock_irq(log->iounit_wait,
 				    r5l_reclaimable_space(log) > reclaimable,
@@ -2784,6 +2840,8 @@ void r5c_finish_stripe_write_out(struct r5conf *conf,
 		atomic_dec(&conf->r5c_flushing_full_stripes);
 		atomic_dec(&conf->r5c_cached_full_stripes);
 	}
+
+	r5l_append_flush_payload(log, sh->sector);
 }
 
 int
-- 
2.9.3


^ permalink raw reply related

* [PATCH V3] md: move bitmap_destroy before __md_stop
From: Guoqing Jiang @ 2017-03-08  2:31 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, shli, Guoqing Jiang

Since we have switched to sync way to handle METADATA_UPDATED
msg for md-cluster, then process_metadata_update is depended
on mddev->thread->wqueue.

With the new change, clustered raid could possible hang if
array received a METADATA_UPDATED msg after array unregistered
mddev->thread, so we need to stop clustered raid (bitmap_destroy
	 -> bitmap_free -> md_cluster_stop) earlier than unregister
thread (mddev_detach -> md_unregister_thread).

And this change should be safe for non-clustered raid since
all writes are stopped before the destroy. Also in md_run,
we activate the personality (pers->run()) before activating
the bitmap (bitmap_create()). So it is pleasingly symmetric
to stop the bitmap (bitmap_destroy()) before stopping the
personality (__md_stop() calls pers->free()).

But we don't want to break the codes for waiting behind IO as
Shaohua mentioned, so move those codes from mddev_detach to
bitmap_destroy. Since we already check bitmap at the beginning
of bitmap_destroy, just wait for behind_writes to be zero if
it existed.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
This version move waiting behind IO codes into bitmap_destroy
so we can safely call bitmap_destroy before __md_stop now.

 drivers/md/bitmap.c |  9 +++++++++
 drivers/md/md.c     | 13 ++-----------
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index b6fa55a3cff8..89a35bc092dd 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1771,6 +1771,15 @@ void bitmap_destroy(struct mddev *mddev)
 	if (!bitmap) /* there was no bitmap */
 		return;
 
+	/* wait for behind writes to complete */
+	if (atomic_read(&bitmap->behind_writes) > 0) {
+		printk(KERN_INFO "md:%s: behind writes in progress - waiting to stop.\n",
+		       mdname(mddev));
+		/* need to kick something here to make sure I/O goes? */
+		wait_event(bitmap->behind_wait,
+			   atomic_read(&bitmap->behind_writes) == 0);
+	}
+
 	mutex_lock(&mddev->bitmap_info.mutex);
 	spin_lock(&mddev->lock);
 	mddev->bitmap = NULL; /* disconnect from the md device */
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 79a99a1c9ce7..b63ab4f33892 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5534,15 +5534,6 @@ EXPORT_SYMBOL_GPL(md_stop_writes);
 
 static void mddev_detach(struct mddev *mddev)
 {
-	struct bitmap *bitmap = mddev->bitmap;
-	/* wait for behind writes to complete */
-	if (bitmap && atomic_read(&bitmap->behind_writes) > 0) {
-		pr_debug("md:%s: behind writes in progress - waiting to stop.\n",
-			 mdname(mddev));
-		/* need to kick something here to make sure I/O goes? */
-		wait_event(bitmap->behind_wait,
-			   atomic_read(&bitmap->behind_writes) == 0);
-	}
 	if (mddev->pers && mddev->pers->quiesce) {
 		mddev->pers->quiesce(mddev, 1);
 		mddev->pers->quiesce(mddev, 0);
@@ -5574,8 +5565,8 @@ void md_stop(struct mddev *mddev)
 	/* stop the array and free an attached data structures.
 	 * This is called from dm-raid
 	 */
-	__md_stop(mddev);
 	bitmap_destroy(mddev);
+	__md_stop(mddev);
 	if (mddev->bio_set)
 		bioset_free(mddev->bio_set);
 }
@@ -5688,6 +5679,7 @@ static int do_md_stop(struct mddev *mddev, int mode,
 			set_disk_ro(disk, 0);
 
 		__md_stop_writes(mddev);
+		bitmap_destroy(mddev);
 		__md_stop(mddev);
 		mddev->queue->backing_dev_info->congested_fn = NULL;
 
@@ -5713,7 +5705,6 @@ static int do_md_stop(struct mddev *mddev, int mode,
 	if (mode == 0) {
 		pr_info("md: %s stopped.\n", mdname(mddev));
 
-		bitmap_destroy(mddev);
 		if (mddev->bitmap_info.file) {
 			struct file *f = mddev->bitmap_info.file;
 			spin_lock(&mddev->lock);
-- 
2.6.2


^ permalink raw reply related

* Re: [PATCH 24/29] drivers: convert iblock_req.pending from atomic_t to refcount_t
From: Nicholas A. Bellinger @ 2017-03-08  7:37 UTC (permalink / raw)
  To: Elena Reshetova
  Cc: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux1394-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-raid-u79uwXL29TY76Z2rM5mHXA,
	linux-media-u79uwXL29TY76Z2rM5mHXA,
	devel-tBiZLqfeLfOHmIFyCCdPziST3g8Odh+X,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA,
	fcoe-devel-s9riP+hp16TNLxjTenLetw,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	open-iscsi-/JYPxA39Uh5TLH3MbocFFw,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b,
	target-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-serial-u79uwXL29TY76Z2rM5mHXA,
	linux-usb-u79uwXL29TY76Z2rM5mHXA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Hans Liljestrand, Kees Cook, David Windsor
In-Reply-To: <1488810076-3754-25-git-send-email-elena.reshetova-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Hi Elena,

On Mon, 2017-03-06 at 16:21 +0200, Elena Reshetova wrote:
> refcount_t type and corresponding API should be
> used instead of atomic_t when the variable is used as
> a reference counter. This allows to avoid accidental
> refcounter overflows that might lead to use-after-free
> situations.
> 
> Signed-off-by: Elena Reshetova <elena.reshetova-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Hans Liljestrand <ishkamiel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
> Signed-off-by: David Windsor <dwindsor-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> ---
>  drivers/target/target_core_iblock.c | 12 ++++++------
>  drivers/target/target_core_iblock.h |  3 ++-
>  2 files changed, 8 insertions(+), 7 deletions(-)

For the target_core_iblock part:

Acked-by: Nicholas Bellinger <nab-IzHhD5pYlfBP7FQvKIMDCQ@public.gmane.org>

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at https://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* [PATCH 0/4] mdadm:checking level once mode has been set
From: Zhilong Liu @ 2017-03-08  7:48 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu

mdadm: it would be better to check --level ealier,
because it would fall to different prompt if user
forgets to specify the --level. such as:
./mdadm -CR /dev/md0 -b internal -n2 -x1 /dev/loop[0-2]

Signed-off-by: Zhilong Liu <zlliu@suse.com>

diff --git a/Create.c b/Create.c
index 9a951b0..beec29f 100644
--- a/Create.c
+++ b/Create.c
@@ -125,10 +125,6 @@ int Create(struct supertype *st, char *mddev,
 	memset(&info, 0, sizeof(info));
 	if (s->level == UnSet && st && st->ss->default_geometry)
 		st->ss->default_geometry(st, &s->level, NULL, NULL);
-	if (s->level == UnSet) {
-		pr_err("a RAID level is needed to create an array.\n");
-		return 1;
-	}
 	if (s->raiddisks < 4 && s->level == 6) {
 		pr_err("at least 4 raid-devices needed for level 6\n");
 		return 1;
diff --git a/mdadm.c b/mdadm.c
index 19a06db..ad24bdf 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -350,6 +350,12 @@ int main(int argc, char *argv[])
 				pr_err("Must give -a/--add for devices to add: %s\n", optarg);
 				exit(2);
 			}
+			if (devs_found > 0 && s.level == UnSet && !devmode) {
+				if (mode == CREATE || mode == BUILD) {
+					pr_err("a RAID level is needed to create or build an array.\n");
+					exit(2);
+				}
+			}
 			dv = xmalloc(sizeof(*dv));
 			dv->devname = optarg;
 			dv->disposition = devmode;
-- 
2.10.2


^ permalink raw reply related

* [PATCH 1/4] mdadm:bitmap cannot be set twice
From: Zhilong Liu @ 2017-03-08  7:50 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu
In-Reply-To: <20170308074831.24683-1-zlliu@suse.com>

mdadm:it doesn't make sense to set bitmap twice.

Signed-off-by: Zhilong Liu <zlliu@suse.com>

diff --git a/mdadm.c b/mdadm.c
index b5ac061..eb9a4e9 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -1137,6 +1137,10 @@ int main(int argc, char *argv[])
 		case O(CREATE,Bitmap): /* here we create the bitmap */
 		case O(GROW,'b'):
 		case O(GROW,Bitmap):
+			if (s.bitmap_file) {
+				pr_err("bitmap cannot be set twice. Second value: %s.\n", optarg);
+				exit(2);
+			}
 			if (strcmp(optarg, "internal") == 0 ||
 			    strcmp(optarg, "none") == 0 ||
 			    strchr(optarg, '/') != NULL) {
-- 
2.10.2


^ permalink raw reply related

* [PATCH 2/4] mdadm:external bitmap only supports ext filesystem
From: Zhilong Liu @ 2017-03-08  7:51 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu
In-Reply-To: <20170308074831.24683-1-zlliu@suse.com>

mdadm: ensure that the external bitmap_file is
stored by ext[2-4] file system, because bmap()
of linux/driver/md/bitmap.c exits directly when
the bitmap_file isn't suitable. mdadm should make
users aware of this scenario and give a prompt.

Signed-off-by: Zhilong Liu <zlliu@suse.com>

diff --git a/Create.c b/Create.c
index 2721884..9a951b0 100644
--- a/Create.c
+++ b/Create.c
@@ -831,11 +831,6 @@ int Create(struct supertype *st, char *mddev,
 			goto abort_locked;
 		}
 		bitmap_fd = open(s->bitmap_file, O_RDWR);
-		if (bitmap_fd < 0) {
-			pr_err("weird: %s cannot be openned\n",
-				s->bitmap_file);
-			goto abort_locked;
-		}
 		if (ioctl(mdfd, SET_BITMAP_FILE, bitmap_fd) < 0) {
 			pr_err("Cannot set bitmap file for %s: %s\n",
 				mddev, strerror(errno));
diff --git a/mdadm.c b/mdadm.c
index d6ad8dc..19a06db 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -28,6 +28,7 @@
 #include "mdadm.h"
 #include "md_p.h"
 #include <ctype.h>
+#include <sys/vfs.h>
 
 static int scan_assemble(struct supertype *ss,
 			 struct context *c,
@@ -1143,6 +1144,21 @@ int main(int argc, char *argv[])
 			    strcmp(optarg, "none") == 0 ||
 			    strchr(optarg, '/') != NULL) {
 				s.bitmap_file = optarg;
+				if (strchr(s.bitmap_file, '/') != NULL) {
+					bitmap_fd = open(s.bitmap_file, O_RDWR);
+					if (bitmap_fd < 0) {
+						pr_err("weird: %s cannot be openned\n", s.bitmap_file);
+						exit(2);
+					}
+					close(bitmap_fd);
+					struct statfs ext_bitmap;
+					statfs(s.bitmap_file, &ext_bitmap);
+					if (ext_bitmap.f_type != 0xEF53){
+						pr_err("external bitmap only supports ext[2-4] filesystem, %s.\n",
+							s.bitmap_file);
+						exit(2);
+					}
+				}
 				continue;
 			}
 			if (strcmp(optarg, "clustered") == 0) {
-- 
2.10.2


^ permalink raw reply related

* [PATCH 3/4] mdadm:triggers core dump when stat2devnm return NULL
From: Zhilong Liu @ 2017-03-08  7:52 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu
In-Reply-To: <20170308074831.24683-1-zlliu@suse.com>

ensure that the device should be a block device when uses
--wait parameter, such as the 'f' and 'd' type file would
be triggered core dumped.
./mdadm --wait /dev/md/, happened core dump.

Signed-off-by: Zhilong Liu <zlliu@suse.com>

diff --git a/Monitor.c b/Monitor.c
index 802a9d9..1900db3 100644
--- a/Monitor.c
+++ b/Monitor.c
@@ -1002,6 +1002,10 @@ int Wait(char *dev)
 			strerror(errno));
 		return 2;
 	}
+	if ((S_IFMT & stb.st_mode) != S_IFBLK) {
+		pr_err("%s is not a block device.\n", dev);
+		return 2;
+	}
 	strcpy(devnm, stat2devnm(&stb));
 
 	while(1) {
diff --git a/lib.c b/lib.c
index b640634..7116298 100644
--- a/lib.c
+++ b/lib.c
@@ -89,9 +89,6 @@ char *devid2kname(int devid)
 
 char *stat2kname(struct stat *st)
 {
-	if ((S_IFMT & st->st_mode) != S_IFBLK)
-		return NULL;
-
 	return devid2kname(st->st_rdev);
 }
 
-- 
2.10.2


^ permalink raw reply related

* [PATCH 5/5] mdadm:checking level once mode has been set
From: Zhilong Liu @ 2017-03-08  8:07 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu
In-Reply-To: <20170308075059.24789-1-zlliu@suse.com>

mdadm: it would be better to check --level ealier,
because it would fall to different prompt if user
forgets to specify the --level parameter. such as:
./mdadm -CR /dev/md0 -b internal -n2 -x1 /dev/loop[0-2]

Signed-off-by: Zhilong Liu <zlliu@suse.com>

diff --git a/Create.c b/Create.c
index 2721884..50ec85e 100644
--- a/Create.c
+++ b/Create.c
@@ -125,10 +125,6 @@ int Create(struct supertype *st, char *mddev,
 	memset(&info, 0, sizeof(info));
 	if (s->level == UnSet && st && st->ss->default_geometry)
 		st->ss->default_geometry(st, &s->level, NULL, NULL);
-	if (s->level == UnSet) {
-		pr_err("a RAID level is needed to create an array.\n");
-		return 1;
-	}
 	if (s->raiddisks < 4 && s->level == 6) {
 		pr_err("at least 4 raid-devices needed for level 6\n");
 		return 1;
diff --git a/mdadm.c b/mdadm.c
index d6ad8dc..fcb33d1 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -349,6 +349,12 @@ int main(int argc, char *argv[])
 				pr_err("Must give -a/--add for devices to add: %s\n", optarg);
 				exit(2);
 			}
+			if (devs_found > 0 && s.level == UnSet && !devmode) {
+				if (mode == CREATE || mode == BUILD) {
+					pr_err("a RAID level is needed to create or build an array.\n");
+					exit(2);
+				}
+			}
 			dv = xmalloc(sizeof(*dv));
 			dv->devname = optarg;
 			dv->disposition = devmode;
-- 
2.10.2


^ permalink raw reply related

* Re: RAID Recovery
From: Adam Goryachev @ 2017-03-08  9:08 UTC (permalink / raw)
  To: Phil Turmel, linux-raid
In-Reply-To: <1066d485-ebea-d0d2-a198-33b560de99fc@turmel.org>



On 8/3/17 02:00, Phil Turmel wrote:
> Hi Adam,
>
> {Please remember to trim repetitive stuff, and interleave.}
>
> On 03/07/2017 09:06 AM, Adam Goryachev wrote:
>> BTW, just some more info I've found... either almost the entire
>> drives are RAID1 mirrors, or all 4 are RAID1 mirrors:

OK, so I now have the following:
root@ubuntu:~# cmp /dev/sda /dev/sdb
/dev/sda /dev/sdb differ: byte 1000162959365, line 243319233
root@ubuntu:~# cmp /dev/sdc /dev/sdd
/dev/sdc /dev/sdd differ: byte 1000162971653, line 243236929
root@ubuntu:~# cmp /dev/sdb /dev/sdc
/dev/sdb /dev/sdc differ: byte 499637813249, line 54927275

So drives sda/sdb are almost complete mirror, and drives sdc/sdd are 
almost complete mirror.
On top of that, the first half of all four drives are a complete mirror 
(which seems oversized considering a "small" root RAID1 drive....)
Why there is a difference half way, and whether they are more difference 
after that point I haven't yet checked, but that looks like something to 
come back to...
>> Other option, they have been re-initialised/zero'd or similar, and
>> thats why all the data is identical (useless). I was hoping to get a
>> starting point for where the partition boundaries might have been
>> ....
> Search the devices for ext2/3/4 superblocks, like so:
>
> dd if=/dev/sdX bs=1M 2>/dev/null |hexdump -C |grep '30  .\+  53 ef 0'
What is the chance it is a ext2/3/4 based FS? I suppose most NAS would 
use these filesystems... I guess I'll find out soon enough.

> This will take a very long time, and will generate false positives.
Can you advise what to do to "verify" these and work out which ones are 
false positives?
> You probably would want to use screen or tmux to run these in
> parallel in separate processes.
I'm not sure there is much of a point, since they are mostly duplicates 
of each other. I'm running it on sdd now, and copying sda and sdc to a 
spare drive. I may re-run the command on sdc (and skip the first 
490GB...) if nothing useful is found on sdd.
> But superblock locations will give you hints as to the rest of data,
> and make it possible to create partitions that will let you copy
> stuff off into a new array.
>
Sounds good, will see how it goes. Thanks for the advice!

Regards,
Adam

^ permalink raw reply

* RE: [PATCH 10/29] drivers, md: convert stripe_head.count from atomic_t to refcount_t
From: Reshetova, Elena @ 2017-03-08  9:39 UTC (permalink / raw)
  To: Shaohua Li
  Cc: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux1394-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-raid-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	devel-tBiZLqfeLfOHmIFyCCdPziST3g8Odh+X@public.gmane.org,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-s390-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	fcoe-devel-s9riP+hp16TNLxjTenLetw@public.gmane.org,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	open-iscsi-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org,
	devel-gWbeCf7V1WCQmaza687I9uG/Ez6ZCGd0
In-Reply-To: <20170307190759.jnrq66kfpkr4m7zl-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

> On Mon, Mar 06, 2017 at 04:20:57PM +0200, Elena Reshetova wrote:
> > refcount_t type and corresponding API should be
> > used instead of atomic_t when the variable is used as
> > a reference counter. This allows to avoid accidental
> > refcounter overflows that might lead to use-after-free
> > situations.
> >
> > Signed-off-by: Elena Reshetova <elena.reshetova-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> > Signed-off-by: Hans Liljestrand <ishkamiel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > Signed-off-by: Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
> > Signed-off-by: David Windsor <dwindsor-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > ---
> >  drivers/md/raid5-cache.c |  8 +++---
> >  drivers/md/raid5.c       | 66 ++++++++++++++++++++++++------------------------
> >  drivers/md/raid5.h       |  3 ++-
> >  3 files changed, 39 insertions(+), 38 deletions(-)
> >
> > diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> > index 3f307be..6c05e12 100644
> > --- a/drivers/md/raid5-cache.c
> > +++ b/drivers/md/raid5-cache.c
> 
> snip
> >  	       sh->check_state, sh->reconstruct_state);
> >
> >  	analyse_stripe(sh, &s);
> > @@ -4924,7 +4924,7 @@ static void activate_bit_delay(struct r5conf *conf,
> >  		struct stripe_head *sh = list_entry(head.next, struct
> stripe_head, lru);
> >  		int hash;
> >  		list_del_init(&sh->lru);
> > -		atomic_inc(&sh->count);
> > +		refcount_inc(&sh->count);
> >  		hash = sh->hash_lock_index;
> >  		__release_stripe(conf, sh,
> &temp_inactive_list[hash]);
> >  	}
> > @@ -5240,7 +5240,7 @@ static struct stripe_head *__get_priority_stripe(struct
> r5conf *conf, int group)
> >  		sh->group = NULL;
> >  	}
> >  	list_del_init(&sh->lru);
> > -	BUG_ON(atomic_inc_return(&sh->count) != 1);
> > +	BUG_ON(refcount_inc_not_zero(&sh->count));
> 
> This changes the behavior. refcount_inc_not_zero doesn't inc if original value is 0

Hm.. So, you want to inc here in any case and BUG if the end result differs from 1. 
So essentially you want to only increment here from zero to one under normal conditions... This is a challenge for refcount_t and against the design.
Is it ok just to maybe do this here:

-	BUG_ON(atomic_inc_return(&sh->count) != 1);
+	BUG_ON(refcount_read(&sh->count) != 0);
+	refcount_set((&sh->count, 1);

Do we have an issue with locking in this case? Or maybe it is then better to leave this one to be atomic_t without protection since it isn't a real refcounter as it turns out. 

Best Regards,
Elena. 

> 
> Thanks,
> Shaohua

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at https://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* RE: [PATCH 08/29] drivers, md: convert mddev.active from atomic_t to refcount_t
From: Reshetova, Elena @ 2017-03-08  9:42 UTC (permalink / raw)
  To: Shaohua Li,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux1394-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-raid-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	devel-tBiZLqfeLfOHmIFyCCdPziST3g8Odh+X@public.gmane.org,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-s390-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	fcoe-devel-s9riP+hp16TNLxjTenLetw@public.gmane.org,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	open-iscsi-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b@public.gmane.org,
	target-devel-u79uwXL29TZNg+MwTxZMZA
In-Reply-To: <20170307190449.baceyzzngsz776x7-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

> On Mon, Mar 06, 2017 at 04:20:55PM +0200, Elena Reshetova wrote:
> > refcount_t type and corresponding API should be
> > used instead of atomic_t when the variable is used as
> > a reference counter. This allows to avoid accidental
> > refcounter overflows that might lead to use-after-free
> > situations.
> 
> Looks good. Let me know how do you want to route the patch to upstream.

Greg, you previously mentioned that driver's conversions can go via your tree. Does this still apply?
Or should I be asking maintainers to merge these patches via their trees? 
I am not sure about the correct (and easier for everyone) way, please suggest.  

Best Regards,
Elena.
> 
> > Signed-off-by: Elena Reshetova <elena.reshetova-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> > Signed-off-by: Hans Liljestrand <ishkamiel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > Signed-off-by: Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
> > Signed-off-by: David Windsor <dwindsor-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > ---
> >  drivers/md/md.c | 6 +++---
> >  drivers/md/md.h | 3 ++-
> >  2 files changed, 5 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 985374f..94c8ebf 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -449,7 +449,7 @@ EXPORT_SYMBOL(md_unplug);
> >
> >  static inline struct mddev *mddev_get(struct mddev *mddev)
> >  {
> > -	atomic_inc(&mddev->active);
> > +	refcount_inc(&mddev->active);
> >  	return mddev;
> >  }
> >
> > @@ -459,7 +459,7 @@ static void mddev_put(struct mddev *mddev)
> >  {
> >  	struct bio_set *bs = NULL;
> >
> > -	if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock))
> > +	if (!refcount_dec_and_lock(&mddev->active, &all_mddevs_lock))
> >  		return;
> >  	if (!mddev->raid_disks && list_empty(&mddev->disks) &&
> >  	    mddev->ctime == 0 && !mddev->hold_active) {
> > @@ -495,7 +495,7 @@ void mddev_init(struct mddev *mddev)
> >  	INIT_LIST_HEAD(&mddev->all_mddevs);
> >  	setup_timer(&mddev->safemode_timer, md_safemode_timeout,
> >  		    (unsigned long) mddev);
> > -	atomic_set(&mddev->active, 1);
> > +	refcount_set(&mddev->active, 1);
> >  	atomic_set(&mddev->openers, 0);
> >  	atomic_set(&mddev->active_io, 0);
> >  	spin_lock_init(&mddev->lock);
> > diff --git a/drivers/md/md.h b/drivers/md/md.h
> > index b8859cb..4811663 100644
> > --- a/drivers/md/md.h
> > +++ b/drivers/md/md.h
> > @@ -22,6 +22,7 @@
> >  #include <linux/list.h>
> >  #include <linux/mm.h>
> >  #include <linux/mutex.h>
> > +#include <linux/refcount.h>
> >  #include <linux/timer.h>
> >  #include <linux/wait.h>
> >  #include <linux/workqueue.h>
> > @@ -360,7 +361,7 @@ struct mddev {
> >  	 */
> >  	struct mutex			open_mutex;
> >  	struct mutex			reconfig_mutex;
> > -	atomic_t			active;
> 	/* general refcount */
> > +	refcount_t			active;
> 	/* general refcount */
> >  	atomic_t			openers;	/*
> number of active opens */
> >
> >  	int
> 	changed;	/* True if we might need to
> > --
> > 2.7.4
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at https://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox