Linux RAID subsystem development
 help / color / mirror / Atom feed
* Re: [PATCH v3] mdadm/r5cache: allow adding journal to array without journal
From: jes.sorensen @ 2017-03-28 17:52 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid, shli, neilb, kernel-team, dan.j.williams, hch
In-Reply-To: <20170317233111.2452831-2-songliubraving@fb.com>

Song Liu <songliubraving@fb.com> writes:
> Currently, --add-journal can be only used to recreate broken journal
> for arrays with journal since  creation. As the kernel code getting
> more mature, this constraint is no longer necessary.
>
> This patch allows --add-journal to add journal to array without
> journal.
>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  Manage.c   | 9 ---------
>  mdadm.8.in | 5 ++---
>  2 files changed, 2 insertions(+), 12 deletions(-)
>
> diff --git a/Manage.c b/Manage.c
> index 5c3d2b9..d038b68 100644
> --- a/Manage.c
> +++ b/Manage.c
> @@ -946,7 +946,6 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
>  
>  	/* only add journal to array that supports journaling */
>  	if (dv->disposition == 'j') {
> -		struct mdinfo mdi;
>  		struct mdinfo *mdp;
>  
>  		mdp = sysfs_read(fd, NULL, GET_ARRAY_STATE);
> @@ -960,14 +959,6 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
>  			pr_err("%s is not readonly, cannot add journal.\n", devname);
>  			return -1;
>  		}
> -
> -		sysfs_free(mdp);
> -

Song,

Sorry for nagging you again on this one, however removing
sysfs_free(mdp) means you are leaking mbp here.

Cheers,
Jes

^ permalink raw reply

* Re: [PATCH 1/2] Replace snprintf with strncpy at some places to avoid truncation
From: jes.sorensen @ 2017-03-28 17:55 UTC (permalink / raw)
  To: Xiao Ni; +Cc: ncroxon, artur.paszkiewicz, linux-raid
In-Reply-To: <1489804425-4949-1-git-send-email-xni@redhat.com>

Xiao Ni <xni@redhat.com> writes:
> In gcc7 there are some building errors like:
> directive output may be truncated writing up to 31 bytes into a region of size 24
> snprintf(str, MPB_SIG_LEN, %s, mpb->sig);
>
> It just need to copy one string to target. So use strncpy to replace it.
>
> For this line code: snprintf(str, MPB_SIG_LEN, %s, mpb->sig);
> Because mpb->sig has the content of version after magic, so
> it's better to use strncpy to replace snprintf too.
>
> Signed-off-by: Xiao Ni <xni@redhat.com>
> ---
>  super-intel.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH 2/2] mdadm: Forced type conversion to avoid truncation
From: jes.sorensen @ 2017-03-28 17:57 UTC (permalink / raw)
  To: Xiao Ni; +Cc: ncroxon, artur.paszkiewicz, linux-raid
In-Reply-To: <1489804425-4949-2-git-send-email-xni@redhat.com>

Xiao Ni <xni@redhat.com> writes:
> Gcc reports it needs 19 bytes to right to disk->serial. Because the type of argument i
> is int. But the meaning of i is failed disk number. So it doesn't need to use 19 bytes.
> Just add a type conversion to avoid this building error
>
> Signed-off-by: Xiao Ni <xni@redhat.com>
> ---
>  super-intel.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Applied!

I reformatted the commit message to keep it within 80 characters/line,
but no change of content.

Cheers,
Jes

^ permalink raw reply

* Re: [PATCHv2 1/2] super1: ignore failfast flag for setting device role
From: jes.sorensen @ 2017-03-28 18:01 UTC (permalink / raw)
  To: Gioh Kim; +Cc: neilb, linux-raid, linux-kernel, Jack Wang
In-Reply-To: <1490003517-4216-2-git-send-email-gi-oh.kim@profitbricks.com>

Gioh Kim <gi-oh.kim@profitbricks.com> writes:
> There is corner case for setting device role,
> if new device has failfast flag.
> The failfast flag should be ignored.
>
> Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
> Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
> ---
>  super1.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)

Applied!

Ideally I would like to get rid of the hardcoded values and use the bit
defines instead, but the original code did the same, so I am going to
take it.

I am leaving out the second patch for now, per Neil's comments. If you
feel strongly about it, please conveince Neil first :)

Thanks,
Jes

^ permalink raw reply

* [PATCH v4] mdadm/r5cache: allow adding journal to array without journal
From: Song Liu @ 2017-03-28 18:04 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, neilb, kernel-team, dan.j.williams, hch, jes.sorensen,
	Song Liu

Currently, --add-journal can be only used to recreate broken journal
for arrays with journal since  creation. As the kernel code getting
more mature, this constraint is no longer necessary.

This patch allows --add-journal to add journal to array without
journal.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 Manage.c   | 6 ------
 mdadm.8.in | 5 ++---
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/Manage.c b/Manage.c
index 5c3d2b9..5ff5de3 100644
--- a/Manage.c
+++ b/Manage.c
@@ -946,7 +946,6 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
 
 	/* only add journal to array that supports journaling */
 	if (dv->disposition == 'j') {
-		struct mdinfo mdi;
 		struct mdinfo *mdp;
 
 		mdp = sysfs_read(fd, NULL, GET_ARRAY_STATE);
@@ -963,11 +962,6 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
 
 		sysfs_free(mdp);
 
-		tst->ss->getinfo_super(tst, &mdi, NULL);
-		if (mdi.journal_device_required == 0) {
-			pr_err("%s does not support journal device.\n", devname);
-			return -1;
-		}
 		disc.raid_disk = 0;
 	}
 
diff --git a/mdadm.8.in b/mdadm.8.in
index df1d460..9e5e896 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -1474,9 +1474,8 @@ the device is found or <slot>:missing in case the device is not found.
 
 .TP
 .BR \-\-add-journal
-Recreate journal for RAID-4/5/6 array that lost a journal device. In the
-current implementation, this command cannot add a journal to an array
-that had a failed journal. To avoid interrupting on-going write opertions,
+Add journal to an existing array, or recreate journal for RAID-4/5/6 array
+that lost a journal device. To avoid interrupting on-going write opertions,
 .B \-\-add-journal
 only works for array in Read-Only state.
 
-- 
2.9.3


^ permalink raw reply related

* Re: [PATCH v3] mdadm/r5cache: allow adding journal to array without journal
From: Song Liu @ 2017-03-28 18:05 UTC (permalink / raw)
  To: jes.sorensen@gmail.com
  Cc: linux-raid, Shaohua Li, NeilBrown, Kernel Team, Dan Williams,
	hch@infradead.org
In-Reply-To: <wrfjpoh1gufa.fsf@gmail.com>

On Mar 28, 2017, at 10:52 AM, jes.sorensen@gmail.com wrote:
> 
> Song Liu <songliubraving@fb.com> writes:
>> Currently, --add-journal can be only used to recreate broken journal
>> for arrays with journal since  creation. As the kernel code getting
>> more mature, this constraint is no longer necessary.
>> 
>> This patch allows --add-journal to add journal to array without
>> journal.
>> 
>> Signed-off-by: Song Liu <songliubraving@fb.com>
>> ---
>> Manage.c   | 9 ---------
>> mdadm.8.in | 5 ++---
>> 2 files changed, 2 insertions(+), 12 deletions(-)
>> 
>> diff --git a/Manage.c b/Manage.c
>> index 5c3d2b9..d038b68 100644
>> --- a/Manage.c
>> +++ b/Manage.c
>> @@ -946,7 +946,6 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
>> 
>> 	/* only add journal to array that supports journaling */
>> 	if (dv->disposition == 'j') {
>> -		struct mdinfo mdi;
>> 		struct mdinfo *mdp;
>> 
>> 		mdp = sysfs_read(fd, NULL, GET_ARRAY_STATE);
>> @@ -960,14 +959,6 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
>> 			pr_err("%s is not readonly, cannot add journal.\n", devname);
>> 			return -1;
>> 		}
>> -
>> -		sysfs_free(mdp);
>> -
> 
> Song,
> 
> Sorry for nagging you again on this one, however removing
> sysfs_free(mdp) means you are leaking mbp here.
> 
> Cheers,
> Jes

Oops... new patch coming...

Thanks,
Song

^ permalink raw reply

* Re: [PATCH] mdadm/bitmap:fixed typos in comments of bitmap.h
From: jes.sorensen @ 2017-03-28 18:08 UTC (permalink / raw)
  To: Zhilong Liu; +Cc: linux-raid
In-Reply-To: <1490006799-25459-1-git-send-email-zlliu@suse.com>

Zhilong Liu <zlliu@suse.com> writes:
> bitmap.h: fixed trivial typos in comments
>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
> ---
>  bitmap.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH v4] mdadm/r5cache: allow adding journal to array without journal
From: jes.sorensen @ 2017-03-28 18:15 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid, shli, neilb, kernel-team, dan.j.williams, hch
In-Reply-To: <20170328180444.3385563-1-songliubraving@fb.com>

Song Liu <songliubraving@fb.com> writes:
> Currently, --add-journal can be only used to recreate broken journal
> for arrays with journal since  creation. As the kernel code getting
> more mature, this constraint is no longer necessary.
>
> This patch allows --add-journal to add journal to array without
> journal.
>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  Manage.c   | 6 ------
>  mdadm.8.in | 5 ++---
>  2 files changed, 2 insertions(+), 9 deletions(-)

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH v3 3/6] imsm: PPL support
From: jes.sorensen @ 2017-03-28 18:21 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid
In-Reply-To: <b63748c5-0cdc-d332-e4d1-601f7c86ff80@intel.com>

Artur Paszkiewicz <artur.paszkiewicz@intel.com> writes:
> On 03/17/2017 09:11 PM, jes.sorensen@gmail.com wrote:
>> Hi Artur,
>> 
>> This looks odd - sure you don't mean crc32c.c ?
>
> Of course this is a mistake, it should be crc32c.c. Sorry for that. But
> surprisingly it builds correctly.
>
> Thanks,
> Artur

Hi Artur,

I can no longer apply this set. Would you mind sending me a fresh update
so I don't end up botching it?

Thanks,
Jes

^ permalink raw reply

* Re: [mdadm PATCH] udev-md-raid-assembly.rules: Skip non-ready devices
From: jes.sorensen @ 2017-03-28 18:22 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux-RAID, Hannes Reinecke
In-Reply-To: <87fuhz8thr.fsf@notabene.neil.brown.name>

NeilBrown <neilb@suse.com> writes:
> From: Hannes Reinecke <hare@suse.de>
>
> If a device isn't fully initialized (e.g if it should be
> handled by multipathing) it should not be considered for
> md/RAID auto-assembly.  Doing so can cause incorrect results
> such as causing multipath to fail during startup.
>
> There is a convention that the udev environment variable
> SYSTEMD_READY be set to zero for such devices.  So change
> the mdadm rules to ignore devices with SYSTEMD_READY==0.
>
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  udev-md-raid-assembly.rules | 3 +++
>  1 file changed, 3 insertions(+)

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: [mdadm PATCH] Retry HOT_REMOVE_DISK a few times.
From: jes.sorensen @ 2017-03-28 18:26 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux-RAID
In-Reply-To: <87d1d38p47.fsf@notabene.neil.brown.name>

NeilBrown <neilb@suse.com> writes:
> HOT_REMOVE_DISK can fail with EBUSY if there are outstanding
> IO request that have not completed yet.  It can sometimes
> be helpful to wait a little while for these to complete.
>
> We already do this in impose_level() when reshaping a device,
> but not in Manage.c in response to an explicit --remove request.
>
> So create hot_remove_disk() to central this code, and call it
> where-ever it makes sense to wait for a HOT_REMOVE_DISK to succeed.
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  Grow.c   |  9 +--------
>  Manage.c |  4 ++--
>  mdadm.h  |  1 +
>  util.c   | 18 ++++++++++++++++++
>  4 files changed, 22 insertions(+), 10 deletions(-)

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH v1] mdadm/Build:check the level parameter when build new array
From: jes.sorensen @ 2017-03-28 18:27 UTC (permalink / raw)
  To: Zhilong Liu; +Cc: linux-raid
In-Reply-To: <20170328135227.6081-1-zlliu@suse.com>

Zhilong Liu <zlliu@suse.com> writes:
> check if user forgets to specify the --level
> when build a new array. such as:
> ./mdadm -B /dev/md0 -n2 /dev/loop[0-1]
>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
> ---
>  Build.c | 4 ++++
>  1 file changed, 4 insertions(+)

Applied!

Thanks,
Jes

^ permalink raw reply

* [PATCH] dm: move dm_table_destroy() to same header as dm_table_create()
From: Brian Norris @ 2017-03-28 18:31 UTC (permalink / raw)
  To: Alasdair Kergon, Mike Snitzer, dm-devel
  Cc: linux-kernel, linux-raid, Brian Norris

If anyone is going to use dm_table_create(), they probably should be
able to use dm_table_destroy() too. Move the dm_table_destroy()
definition outside the private header, near dm_table_create()

Signed-off-by: Brian Norris <briannorris@chromium.org>
---
 drivers/md/dm.h               | 1 -
 include/linux/device-mapper.h | 5 +++++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index f298b01f7ab3..c54d2f1934ce 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -48,7 +48,6 @@ struct dm_md_mempools;
 /*-----------------------------------------------------------------
  * Internal table functions.
  *---------------------------------------------------------------*/
-void dm_table_destroy(struct dm_table *t);
 void dm_table_event_callback(struct dm_table *t,
 			     void (*fn)(void *), void *context);
 struct dm_target *dm_table_get_target(struct dm_table *t, unsigned int index);
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index a7e6903866fd..70cb6af56f67 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -472,6 +472,11 @@ void dm_table_set_type(struct dm_table *t, unsigned type);
 int dm_table_complete(struct dm_table *t);
 
 /*
+ * Destroy the table when finished.
+ */
+void dm_table_destroy(struct dm_table *t);
+
+/*
  * Target may require that it is never sent I/O larger than len.
  */
 int __must_check dm_set_target_max_io_len(struct dm_target *ti, sector_t len);
-- 
2.12.2.564.g063fe858b8-goog


^ permalink raw reply related

* Re: [mdadm PATCH 3/5] Introduce sys_hot_remove_disk()
From: jes.sorensen @ 2017-03-28 18:33 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux-RAID
In-Reply-To: <149058581585.15679.12535145868752526948.stgit@noble>

NeilBrown <neilb@suse.com> writes:
> The new hot_remove_disk() will retry HOT_REMOVE_DISK
> several times in the face of EBUSY.
> However we sometimes remove a device by writing "remove" to the
> "state" attributed.  This should be retried as well.
> So introduce sys_hot_remove_disk() to repeat this action a few times.
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  Manage.c |    6 +-----
>  mdadm.h  |    1 +
>  util.c   |   12 ++++++++++++
>  3 files changed, 14 insertions(+), 5 deletions(-)

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: [mdadm PATCH 4/5] Add 'force' flag to *hot_remove_disk().
From: jes.sorensen @ 2017-03-28 18:33 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux-RAID
In-Reply-To: <149058581604.15679.9993360149817038150.stgit@noble>

NeilBrown <neilb@suse.com> writes:
> In rare circumstances, the short period that *hot_remove_disk()
> waits isn't long enough to IO to complete.  This particularly happens
> when a device is failing and many retries are still happening.
>
> We don't want to increase the normal wait time for "mdadm --remove"
> as that might be use just to test if a device is active or not, and a
> delay would be problematic.
> So allow "--force" to mean that mdadm should try extra hard for a
> --remove to complete, waiting up to 5 seconds.
>
> Note that this patch fixes a comment which claim the previous
> wait time was half a second, where it was really 50msec.
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  Grow.c   |    2 +-
>  Manage.c |   10 +++++-----
>  mdadm.h  |    4 ++--
>  util.c   |   10 +++++-----
>  4 files changed, 13 insertions(+), 13 deletions(-)

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: [mdadm PATCH 5/5] Detail: handle non-existent arrays better.
From: jes.sorensen @ 2017-03-28 18:34 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux-RAID
In-Reply-To: <149058581623.15679.14330532500949622569.stgit@noble>

NeilBrown <neilb@suse.com> writes:
> If you call "mdadm --detail" with a device file for an array which
> doesn't exist, such as by
>   mknod /dev/md57 b 9 57
>   mdadm --detail /dev/md57
>
> you get an unhelpful message about and inactive RAID0, and return
> status is '0'.  This is confusing.
>
> So catch this possibility and print a more useful message, and
> return a non-zero status.
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  Detail.c |    8 ++++++++
>  1 file changed, 8 insertions(+)

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH 03/23] sd: implement REQ_OP_WRITE_ZEROES
From: Bart Van Assche @ 2017-03-28 18:50 UTC (permalink / raw)
  To: agk@redhat.com, lars.ellenberg@linbit.com, snitzer@redhat.com,
	hch@lst.de, martin.petersen@oracle.com,
	philipp.reisner@linbit.com, axboe@kernel.dk, shli@kernel.org
  Cc: linux-scsi@vger.kernel.org, dm-devel@redhat.com,
	drbd-dev@lists.linbit.com, linux-block@vger.kernel.org,
	linux-raid@vger.kernel.org
In-Reply-To: <20170323143341.31549-4-hch@lst.de>

On Thu, 2017-03-23 at 10:33 -0400, Christoph Hellwig wrote:
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index af632e350ab4..b6f70a09a301 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -748,7 +748,7 @@ static int sd_setup_unmap_cmnd(struct scsi_cmnd *cmd)
>  	return scsi_init_io(cmd);
>  }
>  
> -static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd)
> +static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd, bool unmap)
>  {
>  	struct scsi_device *sdp = cmd->device;
>  	struct request *rq = cmd->request;
> @@ -765,13 +765,14 @@ static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd)
>  
>  	cmd->cmd_len = 16;
>  	cmd->cmnd[0] = WRITE_SAME_16;
> -	cmd->cmnd[1] = 0x8; /* UNMAP */
> +	if (unmap)
> +		cmd->cmnd[1] = 0x8; /* UNMAP */
>  	put_unaligned_be64(sector, &cmd->cmnd[2]);
>  	put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);

Hello Christoph,

A quote from SBC: "An OPTIMAL UNMAP GRANULARITY field set to a non-zero value
indicates the optimal granularity in logical blocks for unmap requests (e.g.,
an UNMAP command or a WRITE SAME (16) command with the UNMAP bit set to one).
An unmap request with a number of logical blocks that is not a multiple of
this value may result in unmap operations on fewer LBAs than requested."

This means that just like the start and end of a discard must be aligned on a
discard_granularity boundary, WRITE SAME commands with the UNMAP bit set must
also respect that granularity. I think this means that either
__blkdev_issue_zeroout() has to be modified such that it rejects unaligned
REQ_OP_WRITE_ZEROES operations or that blk_bio_write_same_split() has to be
modified such that it generates REQ_OP_WRITEs for the unaligned start and tail.

Bart.

^ permalink raw reply

* Re: [PATCH 03/23] sd: implement REQ_OP_WRITE_ZEROES
From: Mike Snitzer @ 2017-03-28 19:33 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org,
	linux-raid-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	martin.petersen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA@public.gmane.org,
	linux-block-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	lars.ellenberg-63ez5xqkn6DQT0dZR+AlfA@public.gmane.org,
	shli-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	hch-jcswGhMUV9g@public.gmane.org,
	agk-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	drbd-dev-cunTk1MwBs8qoQakbn7OcQ@public.gmane.org
In-Reply-To: <1490726988.2573.16.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>

On Tue, Mar 28 2017 at  2:50pm -0400,
Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:

> On Thu, 2017-03-23 at 10:33 -0400, Christoph Hellwig wrote:
> > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> > index af632e350ab4..b6f70a09a301 100644
> > --- a/drivers/scsi/sd.c
> > +++ b/drivers/scsi/sd.c
> > @@ -748,7 +748,7 @@ static int sd_setup_unmap_cmnd(struct scsi_cmnd *cmd)
> >  	return scsi_init_io(cmd);
> >  }
> >  
> > -static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd)
> > +static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd, bool unmap)
> >  {
> >  	struct scsi_device *sdp = cmd->device;
> >  	struct request *rq = cmd->request;
> > @@ -765,13 +765,14 @@ static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd)
> >  
> >  	cmd->cmd_len = 16;
> >  	cmd->cmnd[0] = WRITE_SAME_16;
> > -	cmd->cmnd[1] = 0x8; /* UNMAP */
> > +	if (unmap)
> > +		cmd->cmnd[1] = 0x8; /* UNMAP */
> >  	put_unaligned_be64(sector, &cmd->cmnd[2]);
> >  	put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);
> 
> Hello Christoph,
> 
> A quote from SBC: "An OPTIMAL UNMAP GRANULARITY field set to a non-zero value
> indicates the optimal granularity in logical blocks for unmap requests (e.g.,
> an UNMAP command or a WRITE SAME (16) command with the UNMAP bit set to one).
> An unmap request with a number of logical blocks that is not a multiple of
> this value may result in unmap operations on fewer LBAs than requested."
> 
> This means that just like the start and end of a discard must be aligned on a
> discard_granularity boundary, WRITE SAME commands with the UNMAP bit set must
> also respect that granularity. I think this means that either
> __blkdev_issue_zeroout() has to be modified such that it rejects unaligned
> REQ_OP_WRITE_ZEROES operations or that blk_bio_write_same_split() has to be
> modified such that it generates REQ_OP_WRITEs for the unaligned start and tail.

That'd get DM thinp off the hook from having to zero the unaligned start
and tail...

^ permalink raw reply

* Re: [PATCH] md:array cannot be opened again after 'md_set_readonly'
From: NeilBrown @ 2017-03-28 21:13 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Zhilong Liu, Guoqing Jiang
In-Reply-To: <1490601145-5865-1-git-send-email-zlliu@suse.com>

[-- Attachment #1: Type: text/plain, Size: 1205 bytes --]

On Mon, Mar 27 2017, Zhilong Liu wrote:

> This is a bug about array cannot be opened again after 'md_set_readonly',
> because the MD_CLOSING bit is still waiting for clear.
> MD_CLOSING should only be set for a short period or time to avoid certain
> races. After the operation that set it completes, it should be cleared.
>
> Reviewed-by: NeilBrown <neilb@suse.com>

No I didn't.  I never reviewed this patch.  I don't agree with this
patch. This patch is wrong.
The flag is set in md_ioctl(), and it should be cleared in md_ioctl().

NeilBrown

> Cc: Guoqing Jiang <gqjiang@suse.com>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
> ---
>  drivers/md/md.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index f6ae1d6..7f2db7c 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5588,6 +5588,7 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
>  	int err = 0;
>  	int did_freeze = 0;
>  
> +	test_and_clear_bit(MD_CLOSING, &mddev->flags);
>  	if (!test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) {
>  		did_freeze = 1;
>  		set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> -- 
> 2.6.6

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH] md:array cannot be opened again after 'md_set_readonly'
From: NeilBrown @ 2017-03-28 21:18 UTC (permalink / raw)
  To: Zhilong Liu, Shaohua Li; +Cc: shli, linux-raid, Guoqing Jiang
In-Reply-To: <0c5289f3-0160-8d2d-a5d8-da126b2fb468@suse.com>

[-- Attachment #1: Type: text/plain, Size: 1930 bytes --]

On Tue, Mar 28 2017, Zhilong Liu wrote:

> On 03/28/2017 02:22 AM, Shaohua Li wrote:
>> On Mon, Mar 27, 2017 at 03:52:25PM +0800, Zhilong Liu wrote:
>>> This is a bug about array cannot be opened again after 'md_set_readonly',
>>> because the MD_CLOSING bit is still waiting for clear.
>>> MD_CLOSING should only be set for a short period or time to avoid certain
>>> races. After the operation that set it completes, it should be cleared.
>> where is the bit set? Why don't clear it after the operation but clear it in
>> set_readonly?
>>   
>
> it goes two paths after set MD_CLOSING bit, do_md_stop and md_set_readonly,
> the do_md_stop has done the md_clean to clear the mddev->flags, thus I 
> put it in
> md_set_readonly.
> Thanks for your suggestion, is the following changing good for you? here 
> it has
> finished what it needs to do, so clear the MD_CLOSING bit in time.
> if looks good, I would send this in new revision patch.
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index f6ae1d6..e8c1130 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -6868,6 +6868,7 @@ static int md_ioctl(struct block_device *bdev, 
> fmode_t mode,
>                  set_bit(MD_CLOSING, &mddev->flags);
>                  mutex_unlock(&mddev->open_mutex);
>                  sync_blockdev(bdev);
> +               clear_bit(MD_CLOSING, &mddev->flags);

No.

MD_CLOSING is used to prevent a race with md_open().
md_open() blocks on open_mutex(), but we cannot hold that here for
complicated reasons.
So we set MD_CLOSING and need to keep it set at least until
md_set_readonly() or do_md_stop() take that lock again.

Clearing it here just opens up the race again.

It needs to be cleared in md_ioctl, after any call to md_set_readonly()
or do_md_stop().
I've already posted a sample patch which has been tested.  Please don't
do something completely different.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: RAID6 rebuild oddity
From: NeilBrown @ 2017-03-29  4:08 UTC (permalink / raw)
  To: Brad Campbell, Linux-RAID
In-Reply-To: <9eb215dd-437e-ac62-c4df-f3307d8fc4b4@fnarfbargle.com>

[-- Attachment #1: Type: text/plain, Size: 6248 bytes --]

On Fri, Mar 24 2017, Brad Campbell wrote:

> I'm in the process of setting up a new little array. 8 x 6TB drives in a 
> RAID6. While I have the luxury of a long burn in period I've been 
> beating it up and have seen some odd performance anomalies.
>
> I have one in front of me now, so I thought I'd lay out the data and see 
> if anyone has any ideas as to what might be going on.
>
> Here's the current state. I did this by removing and adding /dev/sdb 
> without a write intent bitmap to deliberately cause a rebuild.
>
> /dev/md0:
>          Version : 1.2
>    Creation Time : Wed Mar 22 14:01:41 2017
>       Raid Level : raid6
>       Array Size : 35162348160 (33533.43 GiB 36006.24 GB)
>    Used Dev Size : 5860391360 (5588.90 GiB 6001.04 GB)
>     Raid Devices : 8
>    Total Devices : 8
>      Persistence : Superblock is persistent
>
>      Update Time : Fri Mar 24 15:34:28 2017
>            State : clean, degraded, recovering
>   Active Devices : 7
> Working Devices : 8
>   Failed Devices : 0
>    Spare Devices : 1
>
>           Layout : left-symmetric
>       Chunk Size : 64K
>
>   Rebuild Status : 0% complete
>
>             Name : test:0  (local to host test)
>             UUID : 93a09ba7:f159e9f5:7c478f16:6ca8858e
>           Events : 394
>
>      Number   Major   Minor   RaidDevice State
>         8       8       16        0      spare rebuilding   /dev/sdb
>         1       8       32        1      active sync   /dev/sdc
>         2       8       48        2      active sync   /dev/sdd
>         3       8       64        3      active sync   /dev/sde
>         4       8       80        4      active sync   /dev/sdf
>         5       8       96        5      active sync   /dev/sdg
>         6       8      128        6      active sync   /dev/sdi
>         7       8      144        7      active sync   /dev/sdj
>
>
> Here's the iostat output (hope it doesn't wrap).
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.05    0.00   10.42    7.85    0.00   81.68
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     1.60    0.00    1.40     0.00     8.80 
> 12.57     0.02   12.86    0.00   12.86  12.86   1.80
> sdb               0.00 18835.60    0.00  657.80     0.00 90082.40 
> 273.89     3.72    4.71    0.00    4.71   0.85  55.80
> sdc           20685.80     0.00  244.20    0.00 87659.20     0.00 
> 717.93     8.65   34.10   34.10    0.00   2.15  52.40
> sdd           20664.60     0.00  244.60    0.00 87652.00     0.00 
> 716.70     8.72   34.28   34.28    0.00   2.19  53.60
> sde           20647.80     0.00  240.40    0.00 87556.80     0.00 
> 728.43     9.13   36.54   36.54    0.00   2.30  55.40
> sdf           20622.40     0.00  242.40    0.00 87556.80     0.00 
> 722.42     8.73   34.60   34.60    0.00   2.20  53.40
> sdg           20596.00     0.00  239.20    0.00 87556.80     0.00 
> 732.08     9.32   37.54   37.54    0.00   2.37  56.60
> sdh               0.00     1.60    0.00    1.40     0.00     8.80 
> 12.57     0.01    7.14    0.00    7.14   7.14   1.00
> sdi           20575.80     0.00  238.20    0.00 86999.20     0.00 
> 730.47     8.53   34.06   34.06    0.00   2.20  52.40
> sdj           22860.80     0.00  475.80    0.00 101773.60     0.00 
> 427.80   245.09  513.25  513.25    0.00   2.10 100.00
> md1               0.00     0.00    0.00    0.00     0.00     0.00 
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> md2               0.00     0.00    0.00    2.00     0.00     8.00 
> 8.00     0.00    0.00    0.00    0.00   0.00   0.00
> md0               0.00     0.00    0.00    0.00     0.00     0.00 
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>
> The long and short is /dev/sdj is the last drive in the array, and is 
> getting hit with a completely different read pattern to the other 
> drives, causing it to bottleneck the rebuild process.

sdj is getting twice the utilization of the others but only about 10%
more rKB/sec.  That suggests lots of seeking.

Does "fuser /dev/sdj" report anything funny?

Is there filesystem IO happening? If so, what filesystem?
Have you told the filesystem about the RAID layout?
Maybe the filesystem keeps reading some index blocks that are always on
the same drive.

>
> I *thought* the rebuild process was "read one stripe, calculate the 
> missing bit and write it out to the drive being rebuilt".

Yep, that is what it does.


>
> I've seen this behaviour now a number of times, but this is the first 
> time I've been able to reliably reproduce it. Of course it takes about 
> 20 hours to complete the rebuild, so it's a slow diagnostic process.
>
> I've set the stripe cache size to 8192. Didn't make a dent.
>
> The bottleneck drive seems to change depending on the load. I've seen it 
> happen simply dd'ing the array to /dev/null where the transfer rate 
> slows to < 150MB/s. Stop and restart the transfer and it's back up to 
> 500MB/s.

So the problem moves from drive to drive?  Strongly suggests filesystem
metadata access to me.

>
> I've reproduced this on kernel 4.6.4 & 4.10.5, so I'm not sure what is 
> going on at the moment. There is obviously a sub-optimal read pattern 
> getting fed to sdj. I had a look at it with blocktrace, but went cross 
> eyed trying to figure out what was going on.

If you can capture several seconds of trace on all drives plus the
array, compress it and host it somewhere, I can pick it up and have
look.

NeilBrown

>
> The drives are all on individual lanes on a SAS controller, are set with 
> the deadline scheduler and I can get about 160MB/s sustained from all 
> drives simultaneously using dd.
>
> It's not important, but I thought since I was seeing it and I have a 
> month or so of extra time with this array before it needs to do useful 
> work, I'd ask.
>
> Regards,
> Brad
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH v6 0/4] Broadcom SBA RAID support
From: Anup Patel @ 2017-03-29  6:05 UTC (permalink / raw)
  To: Vinod Koul
  Cc: Jassi Brar, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Dan Williams, Ray Jui, Scott Branden, Jon Mason,
	Rob Rice, BCM Kernel Feedback, dmaengine, Device Tree,
	Linux ARM Kernel, Linux Kernel, linux-crypto, linux-raid
In-Reply-To: <20170321091829.GZ2843@localhost>

On Tue, Mar 21, 2017 at 2:48 PM, Vinod Koul <vinod.koul@intel.com> wrote:
> On Tue, Mar 21, 2017 at 02:17:21PM +0530, Anup Patel wrote:
>> On Tue, Mar 21, 2017 at 2:00 PM, Vinod Koul <vinod.koul@intel.com> wrote:
>> > On Mon, Mar 06, 2017 at 03:13:24PM +0530, Anup Patel wrote:
>> >> The Broadcom SBA RAID is a stream-based device which provides
>> >> RAID5/6 offload.
>> >>
>> >> It requires a SoC specific ring manager (such as Broadcom FlexRM
>> >> ring manager) to provide ring-based programming interface. Due to
>> >> this, the Broadcom SBA RAID driver (mailbox client) implements
>> >> DMA device having one DMA channel using a set of mailbox channels
>> >> provided by Broadcom SoC specific ring manager driver (mailbox
>> >> controller).
>> >>
>> >> The Broadcom SBA RAID hardware requires PQ disk position instead
>> >> of PQ disk coefficient. To address this, we have added raid_gflog
>> >> table which will help driver to convert PQ disk coefficient to PQ
>> >> disk position.
>> >>
>> >> This patchset is based on Linux-4.11-rc1 and depends on patchset
>> >> "[PATCH v5 0/2] Broadcom FlexRM ring manager support"
>> >
>> > Okay I applied and was about to push when I noticed this :(
>> >
>> > So what is the status of this..?
>>
>> PATCH2 is Acked but PATCH1 is under-review. Currently, its
>> v6 of that patchset.
>>
>> The only dependency on that patchset is the changes in
>> brcm-message.h which are required by this BCM-SBA-RAID
>> driver.
>>
>> @Jassi,
>> Can you please have a look at PATCH v6?
>
> And I would need an immutable branch/tag once merged. I am going to keep
> this series pending till then.

The Broadcom FlexRM patchset is pickedup by Jassi and
can be found in mailbox-for-next branch of
git://git.linaro.org/landing-teams/working/fujitsu/integration

Both patchset (Broadcom FlexRM patchset and this one) are
also available in sba-raid-v7 branch of
https://github.com/Broadcom/arm64-linux.git

Regards,
Anup

^ permalink raw reply

* [PATCH] md: update slab_cache before releasing new stripes when stripes resizing
From: Dennis Yang @ 2017-03-29  7:46 UTC (permalink / raw)
  To: linux-raid; +Cc: Dennis Yang

When growing raid5 device on machine with small memory, there is chance that
mdadm will be killed and the following bug report can be observed. The same
bug could also be reproduced in linux-4.10.6.

[57600.075774] BUG: unable to handle kernel NULL pointer dereference at           (null)
[57600.083796] IP: [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
[57600.110378] PGD 421cf067 PUD 4442d067 PMD 0
[57600.114678] Oops: 0002 [#1] SMP
[57600.180799] CPU: 1 PID: 25990 Comm: mdadm Tainted: P           O    4.2.8 #1
[57600.187849] Hardware name: To be filled by O.E.M. To be filled by O.E.M./MAHOBAY, BIOS QV05AR66 03/06/2013
[57600.197490] task: ffff880044e47240 ti: ffff880043070000 task.ti: ffff880043070000
[57600.204963] RIP: 0010:[<ffffffff81a6aa87>]  [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
[57600.213057] RSP: 0018:ffff880043073810  EFLAGS: 00010046
[57600.218359] RAX: 0000000000000000 RBX: 000000000000000c RCX: ffff88011e296dd0
[57600.225486] RDX: 0000000000000001 RSI: ffffe8ffffcb46c0 RDI: 0000000000000000
[57600.232613] RBP: ffff880043073878 R08: ffff88011e5f8170 R09: 0000000000000282
[57600.239739] R10: 0000000000000005 R11: 28f5c28f5c28f5c3 R12: ffff880043073838
[57600.246872] R13: ffffe8ffffcb46c0 R14: 0000000000000000 R15: ffff8800b9706a00
[57600.253999] FS:  00007f576106c700(0000) GS:ffff88011e280000(0000) knlGS:0000000000000000
[57600.262078] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[57600.267817] CR2: 0000000000000000 CR3: 00000000428fe000 CR4: 00000000001406e0
[57600.274942] Stack:
[57600.276949]  ffffffff8114ee35 ffff880043073868 0000000000000282 000000000000eb3f
[57600.284383]  ffffffff81119043 ffff880043073838 ffff880043073838 ffff88003e197b98
[57600.291820]  ffffe8ffffcb46c0 ffff88003e197360 0000000000000286 ffff880043073968
[57600.299254] Call Trace:
[57600.301698]  [<ffffffff8114ee35>] ? cache_flusharray+0x35/0xe0
[57600.307523]  [<ffffffff81119043>] ? __page_cache_release+0x23/0x110
[57600.313779]  [<ffffffff8114eb53>] kmem_cache_free+0x63/0xc0
[57600.319344]  [<ffffffff81579942>] drop_one_stripe+0x62/0x90
[57600.324915]  [<ffffffff81579b5b>] raid5_cache_scan+0x8b/0xb0
[57600.330563]  [<ffffffff8111b98a>] shrink_slab.part.36+0x19a/0x250
[57600.336650]  [<ffffffff8111e38c>] shrink_zone+0x23c/0x250
[57600.342039]  [<ffffffff8111e4f3>] do_try_to_free_pages+0x153/0x420
[57600.348210]  [<ffffffff8111e851>] try_to_free_pages+0x91/0xa0
[57600.353959]  [<ffffffff811145b1>] __alloc_pages_nodemask+0x4d1/0x8b0
[57600.360303]  [<ffffffff8157a30b>] check_reshape+0x62b/0x770
[57600.365866]  [<ffffffff8157a4a5>] raid5_check_reshape+0x55/0xa0
[57600.371778]  [<ffffffff81583df7>] update_raid_disks+0xc7/0x110
[57600.377604]  [<ffffffff81592b73>] md_ioctl+0xd83/0x1b10
[57600.382827]  [<ffffffff81385380>] blkdev_ioctl+0x170/0x690
[57600.388307]  [<ffffffff81195238>] block_ioctl+0x38/0x40
[57600.393525]  [<ffffffff811731c5>] do_vfs_ioctl+0x2b5/0x480
[57600.399010]  [<ffffffff8115e07b>] ? vfs_write+0x14b/0x1f0
[57600.404400]  [<ffffffff811733cc>] SyS_ioctl+0x3c/0x70
[57600.409447]  [<ffffffff81a6ad97>] entry_SYSCALL_64_fastpath+0x12/0x6a
[57600.415875] Code: 00 00 00 00 55 48 89 e5 8b 07 85 c0 74 04 31 c0 5d c3 ba 01 00 00 00 f0 0f b1 17 85 c0 75 ef b0 01 5d c3 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 85 d1 63 ff 5d
[57600.435460] RIP  [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
[57600.441208]  RSP <ffff880043073810>
[57600.444690] CR2: 0000000000000000
[57600.448000] ---[ end trace cbc6b5cc4bf9831d ]---

The problem is that resize_stripes() releases new stripe_heads before assigning new
slab cache to conf->slab_cache. If the shrinker function raid5_cache_scan() gets called
after resize_stripes() starting releasing new stripes but right before new slab cache
being assigned, it is possible that these new stripe_heads will be freed with the old
slab_cache which was already been destoryed and that triggers this bug.

Signed-off-by: Dennis Yang <dennisyang@qnap.com>
---
 drivers/md/raid5.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6661db2c..172edc1 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2286,6 +2286,10 @@ static int resize_stripes(struct r5conf *conf, int newsize)
 		err = -ENOMEM;
 
 	mutex_unlock(&conf->cache_size_mutex);
+	
+	conf->slab_cache = sc;
+	conf->active_name = 1-conf->active_name;
+
 	/* Step 4, return new stripes to service */
 	while(!list_empty(&newstripes)) {
 		nsh = list_entry(newstripes.next, struct stripe_head, lru);
@@ -2303,8 +2307,6 @@ static int resize_stripes(struct r5conf *conf, int newsize)
 	}
 	/* critical section pass, GFP_NOIO no longer needed */
 
-	conf->slab_cache = sc;
-	conf->active_name = 1-conf->active_name;
 	if (!err)
 		conf->pool_size = newsize;
 	return err;
-- 
1.9.1


^ permalink raw reply related

* [PATCH v2] md/r5cache: gracefully handle journal device errors for writeback mode
From: Song Liu @ 2017-03-29  8:00 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, neilb, kernel-team, dan.j.williams, hch, jes.sorensen,
	Song Liu

For the raid456 with writeback cache, when journal device failed during
normal operation, it is still possible to persist all data, as all
pending data is still in stripe cache. However, it is necessary to handle
journal failure gracefully.

During journal failures, this patch makes the follow changes to land data
in cache to raid disks gracefully:

1. In raid5_remove_disk(), flush all cached stripes;
2. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
   to make progress;
3. In delay_towrite(), only process data in the cache (skip dev->towrite);
4. In __get_priority_stripe(), set try_loprio to true, so no stripe stuck
   in loprio_list
5. In r5l_do_submit_io(), submit io->split_bio first (see inline comments
   for details).

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 27 ++++++++++++++++++---------
 drivers/md/raid5.c       | 28 ++++++++++++++++++++++++----
 2 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index b6194e0..0838617 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -632,20 +632,29 @@ static void r5l_do_submit_io(struct r5l_log *log, struct r5l_io_unit *io)
 	__r5l_set_io_unit_state(io, IO_UNIT_IO_START);
 	spin_unlock_irqrestore(&log->io_list_lock, flags);
 
+	/*
+	 * In case of journal device failures, submit_bio will get error
+	 * and calls endio, then active stripes will continue write
+	 * process. Therefore, it is not necessary to check Faulty bit
+	 * of journal device here.
+	 *
+	 * However, calling r5l_log_endio(current_bio) may change
+	 * split_bio. Therefore, it is necessary to check split_bio before
+	 * submit current_bio.
+	 */
+	if (io->split_bio) {
+		if (io->has_flush)
+			io->split_bio->bi_opf |= REQ_PREFLUSH;
+		if (io->has_fua)
+			io->split_bio->bi_opf |= REQ_FUA;
+		submit_bio(io->split_bio);
+	}
+
 	if (io->has_flush)
 		io->current_bio->bi_opf |= REQ_PREFLUSH;
 	if (io->has_fua)
 		io->current_bio->bi_opf |= REQ_FUA;
 	submit_bio(io->current_bio);
-
-	if (!io->split_bio)
-		return;
-
-	if (io->has_flush)
-		io->split_bio->bi_opf |= REQ_PREFLUSH;
-	if (io->has_fua)
-		io->split_bio->bi_opf |= REQ_FUA;
-	submit_bio(io->split_bio);
 }
 
 /* deferred io_unit will be dispatched here */
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6036d5e..4d3d1ab 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3054,6 +3054,11 @@ sector_t raid5_compute_blocknr(struct stripe_head *sh, int i, int previous)
  *      When LOG_CRITICAL, stripes with injournal == 0 will be sent to
  *      no_space_stripes list.
  *
+ *   3. during journal failure
+ *      In journal failure, we try to flush all cached data to raid disks
+ *      based on data in stripe cache. The array is read-only to upper
+ *      layers, so we would skip all pending writes.
+ *
  */
 static inline bool delay_towrite(struct r5conf *conf,
 				 struct r5dev *dev,
@@ -3067,6 +3072,9 @@ static inline bool delay_towrite(struct r5conf *conf,
 	if (test_bit(R5C_LOG_CRITICAL, &conf->cache_state) &&
 	    s->injournal > 0)
 		return true;
+	/* case 3 above */
+	if (s->log_failed && s->injournal)
+		return true;
 	return false;
 }
 
@@ -4689,10 +4697,15 @@ static void handle_stripe(struct stripe_head *sh)
 	       " to_write=%d failed=%d failed_num=%d,%d\n",
 	       s.locked, s.uptodate, s.to_read, s.to_write, s.failed,
 	       s.failed_num[0], s.failed_num[1]);
-	/* check if the array has lost more than max_degraded devices and,
+	/*
+	 * check if the array has lost more than max_degraded devices and,
 	 * if so, some requests might need to be failed.
+	 *
+	 * When journal device failed (log_failed), we will only process
+	 * the stripe if there is data need write to raid disks
 	 */
-	if (s.failed > conf->max_degraded || s.log_failed) {
+	if (s.failed > conf->max_degraded ||
+	    (s.log_failed && s.injournal == 0)) {
 		sh->check_state = 0;
 		sh->reconstruct_state = 0;
 		break_stripe_batch_list(sh, 0);
@@ -5272,7 +5285,8 @@ static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
 	struct list_head *handle_list = NULL;
 	struct r5worker_group *wg;
 	bool second_try = !r5c_is_writeback(conf->log);
-	bool try_loprio = test_bit(R5C_LOG_TIGHT, &conf->cache_state);
+	bool try_loprio = test_bit(R5C_LOG_TIGHT, &conf->cache_state) ||
+		r5l_log_disk_error(conf);
 
 again:
 	wg = NULL;
@@ -7526,6 +7540,7 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
 	int number = rdev->raid_disk;
 	struct md_rdev **rdevp;
 	struct disk_info *p = conf->disks + number;
+	unsigned long flags;
 
 	print_raid5_conf(conf);
 	if (test_bit(Journal, &rdev->flags) && conf->log) {
@@ -7535,7 +7550,12 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
 		 * neilb: there is no locking about new writes here,
 		 * so this cannot be safe.
 		 */
-		if (atomic_read(&conf->active_stripes)) {
+		if (atomic_read(&conf->active_stripes) ||
+		    atomic_read(&conf->r5c_cached_full_stripes) ||
+		    atomic_read(&conf->r5c_cached_partial_stripes)) {
+			spin_lock_irqsave(&conf->device_lock, flags);
+			r5c_flush_cache(conf, INT_MAX);
+			spin_unlock_irqrestore(&conf->device_lock, flags);
 			return -EBUSY;
 		}
 		log_exit(conf);
-- 
2.9.3


^ permalink raw reply related

* Re: RAID6 rebuild oddity
From: Brad Campbell @ 2017-03-29  8:12 UTC (permalink / raw)
  To: NeilBrown, Linux-RAID
In-Reply-To: <87zig467z3.fsf@notabene.neil.brown.name>

On 29/03/17 12:08, NeilBrown wrote:

>
> sdj is getting twice the utilization of the others but only about 10%
> more rKB/sec.  That suggests lots of seeking.

Yes, something is not entirely sequential.

> Does "fuser /dev/sdj" report anything funny?

No. No output. As far as I can tell nothing should be touching the disks 
other than the md kernel thread.

> Is there filesystem IO happening? If so, what filesystem?
> Have you told the filesystem about the RAID layout?
> Maybe the filesystem keeps reading some index blocks that are always on
> the same drive.

No. I probably wasn't as clear as I should have been in the initial 
post. There was nothing mounted at the time.

Right now the array contains one large LUKS container (dm-crypt). This 
was mounted and a continuous dd done to the dm device to zero it out :

4111195+1 records in
4111195+1 records out
34487205507072 bytes (34 TB) copied, 57781.1 s, 597 MB/s

So there is no filesystem on the drive.

I failed and removed sdi :

root@test:~# mdadm --fail /dev/md0 /dev/sdi
mdadm: set /dev/sdi faulty in /dev/md0
root@test:~# mdadm --remove /dev/md0 /dev/sdi
mdadm: hot removed /dev/sdi from /dev/md0
root@test:~# mdadm --zero-superblock /dev/sdi
root@test:~# mdadm --add /dev/md0 /dev/sdi
mdadm: added /dev/sdi

Rebooted the machine to remove all tweaks of things like stripe cache 
size, readahead, NCQ and anything else.

I opened the LUKS container, dd'd a meg to the start to write to the 
array and kick off the resync, then closed the LUKS container. At this 
point dm should no longer be touching the drive and I've verified the 
device has gone.

I then ran sync a couple of times and waited a couple of minutes until I 
was positive _nothing_ was touching md0, then ran :

blktrace -w 5 /dev/sd[bcdefgij] /dev/md0

> So the problem moves from drive to drive?  Strongly suggests filesystem
> metadata access to me.

Again, sorry for me not being clear. The situation changes on a resync 
specific basis. For example the reproduction I've done now I popped out 
sdi rather than sdb, and now the bottleneck is sdg. It is the same if 
the exact circumstances remain the same.

>
> If you can capture several seconds of trace on all drives plus the
> array, compress it and host it somewhere, I can pick it up and have
> look.

I've captured 5 seconds. I was overly optimistic initially and tried 20 
seconds, but that resulted in 100M bzipped.

This is 19M. The kernel is a clean 4.10.6 compiled up half an hour ago 
as I forgot to include block tracing in there.

http://www.fnarfbargle.com/private/170329-Resync/resync-blktrace.tar.bz2

root@test:~/bench# mdadm --detail /dev/md0
/dev/md0:
         Version : 1.2
   Creation Time : Wed Mar 22 14:01:41 2017
      Raid Level : raid6
      Array Size : 35162348160 (33533.43 GiB 36006.24 GB)
   Used Dev Size : 5860391360 (5588.90 GiB 6001.04 GB)
    Raid Devices : 8
   Total Devices : 8
     Persistence : Superblock is persistent

     Update Time : Wed Mar 29 16:08:20 2017
           State : clean, degraded, recovering
  Active Devices : 7
Working Devices : 8
  Failed Devices : 0
   Spare Devices : 1

          Layout : left-symmetric
      Chunk Size : 64K

  Rebuild Status : 2% complete

            Name : test:0  (local to host test)
            UUID : 93a09ba7:f159e9f5:7c478f16:6ca8858e
          Events : 715

     Number   Major   Minor   RaidDevice State
        8       8       16        0      active sync   /dev/sdb
        1       8       32        1      active sync   /dev/sdc
        2       8       48        2      active sync   /dev/sdd
        3       8       64        3      active sync   /dev/sde
        4       8       80        4      active sync   /dev/sdf
        5       8       96        5      active sync   /dev/sdg
        9       8      128        6      spare rebuilding   /dev/sdi
        7       8      144        7      active sync   /dev/sdj


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.08    0.00    4.77    4.75    0.00   90.41

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     1.40    0.00    3.00     0.00    10.60 
7.07     0.01    4.67    0.00    4.67   4.67   1.40
sdb           14432.20     0.00  120.20    0.00 57295.20     0.00 
953.33     2.04   16.71   16.71    0.00   3.23  38.80
sdc           14432.20     0.00  120.20    0.00 57295.20     0.00 
953.33     1.99   16.26   16.26    0.00   3.21  38.60
sdd           14432.20     0.00  120.80    0.00 57602.40     0.00 
953.68     2.14   17.55   17.55    0.00   3.18  38.40
sde           14432.20     0.00  120.80    0.00 57602.40     0.00 
953.68     2.02   16.57   16.57    0.00   3.20  38.60
sdf           14432.20     0.00  120.80    0.00 57602.40     0.00 
953.68     2.08   17.04   17.04    0.00   3.25  39.20
sdg           16135.40     0.00  224.60    0.00 65811.20     0.00 
586.03   126.46  575.26  575.26    0.00   4.45 100.00
sdh               0.00     1.40    0.00    3.00     0.00    10.60 
7.07     0.02    6.67    0.00    6.67   6.67   2.00
sdi               0.00 14982.60    0.00  123.20     0.00 59801.60 
970.81     4.52   35.57    0.00   35.57   3.25  40.00
sdj           14432.40     0.00  120.80    0.00 57603.20     0.00 
953.70     1.84   15.07   15.07    0.00   3.06  37.00
md1               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    2.20     0.00     8.80 
8.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00    0.00    0.00   0.00   0.00


Thanks for having a look. It seems odd to me, but I can't figure it out.

Regards,
Brad

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox