Linux RAID subsystem development
 help / color / mirror / Atom feed
* [PATCH v3 2/4] async_tx: Fix DMA_PREP_FENCE usage in do_async_gen_syndrome()
From: Anup Patel @ 2017-02-10  9:07 UTC (permalink / raw)
  To: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar
  Cc: Dan Williams, Ray Jui, Scott Branden, Jon Mason, Rob Rice,
	bcm-kernel-feedback-list, dmaengine, devicetree, linux-arm-kernel,
	linux-kernel, linux-crypto, linux-raid, Anup Patel
In-Reply-To: <1486717628-17580-1-git-send-email-anup.patel@broadcom.com>

The DMA_PREP_FENCE is to be used when preparing Tx descriptor if output
of Tx descriptor is to be used by next/dependent Tx descriptor.

The DMA_PREP_FENSE will not be set correctly in do_async_gen_syndrome()
when calling dma->device_prep_dma_pq() under following conditions:
1. ASYNC_TX_FENCE not set in submit->flags
2. DMA_PREP_FENCE not set in dma_flags
3. src_cnt (= (disks - 2)) is greater than dma_maxpq(dma, dma_flags)

This patch fixes DMA_PREP_FENCE usage in do_async_gen_syndrome() taking
inspiration from do_async_xor() implementation.

Signed-off-by: Anup Patel <anup.patel@broadcom.com>
Reviewed-by: Ray Jui <ray.jui@broadcom.com>
Reviewed-by: Scott Branden <scott.branden@broadcom.com>
---
 crypto/async_tx/async_pq.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/crypto/async_tx/async_pq.c b/crypto/async_tx/async_pq.c
index f83de99..56bd612 100644
--- a/crypto/async_tx/async_pq.c
+++ b/crypto/async_tx/async_pq.c
@@ -62,9 +62,6 @@ do_async_gen_syndrome(struct dma_chan *chan,
 	dma_addr_t dma_dest[2];
 	int src_off = 0;
 
-	if (submit->flags & ASYNC_TX_FENCE)
-		dma_flags |= DMA_PREP_FENCE;
-
 	while (src_cnt > 0) {
 		submit->flags = flags_orig;
 		pq_src_cnt = min(src_cnt, dma_maxpq(dma, dma_flags));
@@ -83,6 +80,8 @@ do_async_gen_syndrome(struct dma_chan *chan,
 			if (cb_fn_orig)
 				dma_flags |= DMA_PREP_INTERRUPT;
 		}
+		if (submit->flags & ASYNC_TX_FENCE)
+			dma_flags |= DMA_PREP_FENCE;
 
 		/* Drivers force forward progress in case they can not provide
 		 * a descriptor
-- 
2.7.4

^ permalink raw reply related

* [PATCH v3 1/4] lib/raid6: Add log-of-2 table for RAID6 HW requiring disk position
From: Anup Patel @ 2017-02-10  9:07 UTC (permalink / raw)
  To: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar
  Cc: devicetree, Anup Patel, Scott Branden, Jon Mason, Ray Jui,
	linux-kernel, linux-raid, bcm-kernel-feedback-list, linux-crypto,
	Rob Rice, dmaengine, Dan Williams, linux-arm-kernel
In-Reply-To: <1486717628-17580-1-git-send-email-anup.patel@broadcom.com>

The raid6_gfexp table represents {2}^n values for 0 <= n < 256. The
Linux async_tx framework pass values from raid6_gfexp as coefficients
for each source to prep_dma_pq() callback of DMA channel with PQ
capability. This creates problem for RAID6 offload engines (such as
Broadcom SBA) which take disk position (i.e. log of {2}) instead of
multiplicative cofficients from raid6_gfexp table.

This patch adds raid6_gflog table having log-of-2 value for any given
x such that 0 <= x < 256. For any given disk coefficient x, the
corresponding disk position is given by raid6_gflog[x]. The RAID6
offload engine driver can use this newly added raid6_gflog table to
get disk position from multiplicative coefficient.

Signed-off-by: Anup Patel <anup.patel@broadcom.com>
Reviewed-by: Scott Branden <scott.branden@broadcom.com>
Reviewed-by: Ray Jui <ray.jui@broadcom.com>
---
 include/linux/raid/pq.h |  1 +
 lib/raid6/mktables.c    | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
index 4d57bba..30f9453 100644
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -142,6 +142,7 @@ int raid6_select_algo(void);
 extern const u8 raid6_gfmul[256][256] __attribute__((aligned(256)));
 extern const u8 raid6_vgfmul[256][32] __attribute__((aligned(256)));
 extern const u8 raid6_gfexp[256]      __attribute__((aligned(256)));
+extern const u8 raid6_gflog[256]      __attribute__((aligned(256)));
 extern const u8 raid6_gfinv[256]      __attribute__((aligned(256)));
 extern const u8 raid6_gfexi[256]      __attribute__((aligned(256)));
 
diff --git a/lib/raid6/mktables.c b/lib/raid6/mktables.c
index 39787db..e824d08 100644
--- a/lib/raid6/mktables.c
+++ b/lib/raid6/mktables.c
@@ -125,6 +125,26 @@ int main(int argc, char *argv[])
 	printf("EXPORT_SYMBOL(raid6_gfexp);\n");
 	printf("#endif\n");
 
+	/* Compute log-of-2 table */
+	printf("\nconst u8 __attribute__((aligned(256)))\n"
+	       "raid6_gflog[256] =\n" "{\n");
+	for (i = 0; i < 256; i += 8) {
+		printf("\t");
+		for (j = 0; j < 8; j++) {
+			v = 255;
+			for (k = 0; k < 256; k++)
+				if (exptbl[k] == (i + j)) {
+					v = k;
+					break;
+				}
+			printf("0x%02x,%c", v, (j == 7) ? '\n' : ' ');
+		}
+	}
+	printf("};\n");
+	printf("#ifdef __KERNEL__\n");
+	printf("EXPORT_SYMBOL(raid6_gflog);\n");
+	printf("#endif\n");
+
 	/* Compute inverse table x^-1 == x^254 */
 	printf("\nconst u8 __attribute__((aligned(256)))\n"
 	       "raid6_gfinv[256] =\n" "{\n");
-- 
2.7.4

^ permalink raw reply related

* [PATCH v3 0/4] Broadcom SBA RAID support
From: Anup Patel @ 2017-02-10  9:07 UTC (permalink / raw)
  To: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar
  Cc: devicetree, Anup Patel, Scott Branden, Jon Mason, Ray Jui,
	linux-kernel, linux-raid, bcm-kernel-feedback-list, linux-crypto,
	Rob Rice, dmaengine, Dan Williams, linux-arm-kernel

The Broadcom SBA RAID is a stream-based device which provides
RAID5/6 offload.

It requires a SoC specific ring manager (such as Broadcom FlexRM
ring manager) to provide ring-based programming interface. Due to
this, the Broadcom SBA RAID driver (mailbox client) implements
DMA device having one DMA channel using a set of mailbox channels
provided by Broadcom SoC specific ring manager driver (mailbox
controller).

The Broadcom SBA RAID hardware requires PQ disk position instead
of PQ disk coefficient. To address this, we have added raid_gflog
table which will help driver to convert PQ disk coefficient to PQ
disk position.

This patchset is based on Linux-4.10-rc2 and depends on patchset
"[PATCH v4 0/2] Broadcom FlexRM ring manager support"

It is also available at sba-raid-v3 branch of
https://github.com/Broadcom/arm64-linux.git

Changes since v2:
 - Droped patch to handle DMA devices having support for fewer
   PQ coefficients in Linux Async Tx
 - Added work-around in bcm-sba-raid driver to handle unsupported
   PQ coefficients using multiple SBA requests

Changes since v1:
 - Droped patch to add mbox_channel_device() API
 - Used GENMASK and BIT macros wherever possible in bcm-sba-raid driver
 - Replaced C_MDATA macros with static inline functions in
   bcm-sba-raid driver
 - Removed sba_alloc_chan_resources() callback in bcm-sba-raid driver
 - Used dev_err() instead of dev_info() wherever applicable
 - Removed call to sba_issue_pending() from sba_tx_submit() in
   bcm-sba-raid driver
 - Implemented SBA request chaning for handling (len > sba->req_size)
   in bcm-sba-raid driver
 - Implemented device_terminate_all() callback in bcm-sba-raid driver

Anup Patel (4):
  lib/raid6: Add log-of-2 table for RAID6 HW requiring disk position
  async_tx: Fix DMA_PREP_FENCE usage in do_async_gen_syndrome()
  dmaengine: Add Broadcom SBA RAID driver
  dt-bindings: Add DT bindings document for Broadcom SBA RAID driver

 .../devicetree/bindings/dma/brcm,iproc-sba.txt     |   29 +
 crypto/async_tx/async_pq.c                         |    5 +-
 drivers/dma/Kconfig                                |   13 +
 drivers/dma/Makefile                               |    1 +
 drivers/dma/bcm-sba-raid.c                         | 1711 ++++++++++++++++++++
 include/linux/raid/pq.h                            |    1 +
 lib/raid6/mktables.c                               |   20 +
 7 files changed, 1777 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/dma/brcm,iproc-sba.txt
 create mode 100644 drivers/dma/bcm-sba-raid.c

-- 
2.7.4

^ permalink raw reply

* Re: [PATCH 1/5] MD: attach data to each bio
From: Shaohua Li @ 2017-02-10  6:47 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid, khlebnikov, hch
In-Reply-To: <87r336tw5l.fsf@notabene.neil.brown.name>

On Fri, Feb 10, 2017 at 05:08:54PM +1100, Neil Brown wrote:
> On Tue, Feb 07 2017, Shaohua Li wrote:
> 
> > Currently MD is rebusing some bio fields. To remove the hack, we attach
> > extra data to each bio. Each personablity can attach extra data to the
> > bios, so we don't need to rebuse bio fields.
> 
> I must say that I don't really like this approach.
> Temporarily modifying ->bi_private and ->bi_end_io seems
> .... intrusive.   I suspect it works, but I wonder if it is really
> robust in the long term.
> 
> How about a different approach..  Your main concern with my first patch
> was that it called md_write_start() and md_write_end() much more often,
> and these performed atomic ops on "global" variables, particular
> writes_pending.
> 
> We could change writes_pending to a per-cpu array which we only count
> occasionally when needed.  As writes_pending is updated often and
> checked rarely, a per-cpu array which is summed on demand seems
> appropriate.
> 
> The following patch is an early draft - it doesn't obviously fail and
> isn't obviously wrong to me.  There is certainly room for improvement
> and may be bugs.
> Next week I'll work on collection the re-factoring into separate
> patches, which are possible good-to-have anyway.

For your first patch, I don't have much concern. It's ok to me. What I don't
like is the bi_phys_segments handling part. The patches add a lot of logic to
handle the reference count. They should work, but I'd say it's not easy to
understand and could be error prone. What we really need is a reference count
for the bio, so let's just add a reference count. That's my logic and it's
simple.

For the modifying bi_private and bi_end_io part, I saw some filesystems are
using this way, at least btrfs. If this is really intrusive, is cloning a bio
better?

Thanks,
Shaohua

> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 8926fb781cdc..883c409902b0 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -64,6 +64,8 @@
>  #include <linux/raid/md_p.h>
>  #include <linux/raid/md_u.h>
>  #include <linux/slab.h>
> +#include <linux/percpu.h>
> +
>  #include <trace/events/block.h>
>  #include "md.h"
>  #include "bitmap.h"
> @@ -2250,6 +2252,81 @@ static void export_array(struct mddev *mddev)
>  	mddev->major_version = 0;
>  }
>  
> +/*
> + * The percpu writes_pending counters are linked with ->checkers and ->lock.
> + * If ->writes_pending can always be decremented without a lock.
> + * It can only be incremented without a lock if ->checkers is 0 and the test+incr
> + * happen in a rcu_readlocked region.
> + * ->checkers can only be changed under ->lock protection.
> + * To determine if ->writes_pending is totally zero, a quick sum without
> + * locks can be performed. If this is non-zero, then the result is final.
> + * Otherwise ->checkers must be incremented and synchronize_rcu() called.
> + * Then a sum calculated under ->lock, and the result is final until the
> + * ->checkers is decremented, or the lock is dropped.
> + *
> + */
> +
> +static bool __writes_pending(struct mddev *mddev)
> +{
> +	long cnt = 0;
> +	int i;
> +	for_each_possible_cpu(i)
> +		cnt += *per_cpu_ptr(mddev->writes_pending_percpu, i);
> +	return cnt != 0;
> +}
> +
> +static bool set_in_sync(struct mddev *mddev)
> +{
> +
> +	WARN_ON_ONCE(!spin_is_locked(&mddev->lock));
> +	if (__writes_pending(mddev)) {
> +		if (mddev->safemode == 1)
> +			mddev->safemode = 0;
> +		return false;
> +	}
> +	if (mddev->in_sync)
> +		return false;
> +
> +	mddev->checkers ++;
> +	spin_unlock(&mddev->lock);
> +	synchronize_rcu();
> +	spin_lock(&mddev->lock);
> +	if (mddev->in_sync) {
> +		/* someone else set it */
> +		mddev->checkers --;
> +		return false;
> +	}
> +
> +	if (! __writes_pending(mddev))
> +		mddev->in_sync = 1;
> +	if (mddev->safemode == 1)
> +		mddev->safemode = 0;
> +	mddev->checkers --;
> +	return mddev->in_sync;
> +}
> +
> +static void inc_writes_pending(struct mddev *mddev)
> +{
> +	rcu_read_lock();
> +	if (mddev->checkers == 0) {
> +		__this_cpu_inc(*mddev->writes_pending_percpu);
> +		rcu_read_unlock();
> +		return;
> +	}
> +	rcu_read_unlock();
> +	/* Need that spinlock */
> +	spin_lock(&mddev->lock);
> +	this_cpu_inc(*mddev->writes_pending_percpu);
> +	spin_unlock(&mddev->lock);
> +}
> +
> +static void zero_writes_pending(struct mddev *mddev)
> +{
> +	int i;
> +	for_each_possible_cpu(i)
> +		*per_cpu_ptr(mddev->writes_pending_percpu, i) = 0;
> +}
> +
>  static void sync_sbs(struct mddev *mddev, int nospares)
>  {
>  	/* Update each superblock (in-memory image), but
> @@ -3583,7 +3660,6 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
>  	mddev->delta_disks = 0;
>  	mddev->reshape_backwards = 0;
>  	mddev->degraded = 0;
> -	spin_unlock(&mddev->lock);
>  
>  	if (oldpers->sync_request == NULL &&
>  	    mddev->external) {
> @@ -3596,8 +3672,8 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
>  		 */
>  		mddev->in_sync = 0;
>  		mddev->safemode_delay = 0;
> -		mddev->safemode = 0;
>  	}
> +	spin_unlock(&mddev->lock);
>  
>  	oldpers->free(mddev, oldpriv);
>  
> @@ -3963,16 +4039,11 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
>  			wake_up(&mddev->sb_wait);
>  			err = 0;
>  		} else /* st == clean */ {
> +			err = 0;
>  			restart_array(mddev);
> -			if (atomic_read(&mddev->writes_pending) == 0) {
> -				if (mddev->in_sync == 0) {
> -					mddev->in_sync = 1;
> -					if (mddev->safemode == 1)
> -						mddev->safemode = 0;
> -					set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
> -				}
> -				err = 0;
> -			} else
> +			if (set_in_sync(mddev))
> +				set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
> +			else if (!mddev->in_sync)
>  				err = -EBUSY;
>  		}
>  		if (!err)
> @@ -4030,15 +4101,9 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
>  			if (err)
>  				break;
>  			spin_lock(&mddev->lock);
> -			if (atomic_read(&mddev->writes_pending) == 0) {
> -				if (mddev->in_sync == 0) {
> -					mddev->in_sync = 1;
> -					if (mddev->safemode == 1)
> -						mddev->safemode = 0;
> -					set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
> -				}
> -				err = 0;
> -			} else
> +			if (set_in_sync(mddev))
> +				set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
> +			else if (!mddev->in_sync)
>  				err = -EBUSY;
>  			spin_unlock(&mddev->lock);
>  		} else
> @@ -4993,6 +5058,9 @@ static void md_free(struct kobject *ko)
>  		del_gendisk(mddev->gendisk);
>  		put_disk(mddev->gendisk);
>  	}
> +	if (mddev->writes_pending_percpu) {
> +		free_percpu(mddev->writes_pending_percpu);
> +	}
>  
>  	kfree(mddev);
>  }
> @@ -5069,6 +5137,13 @@ static int md_alloc(dev_t dev, char *name)
>  	blk_queue_make_request(mddev->queue, md_make_request);
>  	blk_set_stacking_limits(&mddev->queue->limits);
>  
> +	mddev->writes_pending_percpu = alloc_percpu(long);
> +	if (!mddev->writes_pending_percpu) {
> +		blk_cleanup_queue(mddev->queue);
> +		mddev->queue = NULL;
> +		goto abort;
> +	}
> +
>  	disk = alloc_disk(1 << shift);
>  	if (!disk) {
>  		blk_cleanup_queue(mddev->queue);
> @@ -5152,7 +5227,7 @@ static void md_safemode_timeout(unsigned long data)
>  {
>  	struct mddev *mddev = (struct mddev *) data;
>  
> -	if (!atomic_read(&mddev->writes_pending)) {
> +	if (!__writes_pending(mddev)) {
>  		mddev->safemode = 1;
>  		if (mddev->external)
>  			sysfs_notify_dirent_safe(mddev->sysfs_state);
> @@ -5358,7 +5433,7 @@ int md_run(struct mddev *mddev)
>  	} else if (mddev->ro == 2) /* auto-readonly not meaningful */
>  		mddev->ro = 0;
>  
> -	atomic_set(&mddev->writes_pending,0);
> +	zero_writes_pending(mddev);
>  	atomic_set(&mddev->max_corr_read_errors,
>  		   MD_DEFAULT_MAX_CORRECTED_READ_ERRORS);
>  	mddev->safemode = 0;
> @@ -5451,7 +5526,6 @@ static int restart_array(struct mddev *mddev)
>  			return -EINVAL;
>  	}
>  
> -	mddev->safemode = 0;
>  	mddev->ro = 0;
>  	set_disk_ro(disk, 0);
>  	pr_debug("md: %s switched to read-write mode.\n", mdname(mddev));
> @@ -7767,9 +7841,7 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
>  		md_wakeup_thread(mddev->sync_thread);
>  		did_change = 1;
>  	}
> -	atomic_inc(&mddev->writes_pending);
> -	if (mddev->safemode == 1)
> -		mddev->safemode = 0;
> +	inc_writes_pending(mddev);
>  	if (mddev->in_sync) {
>  		spin_lock(&mddev->lock);
>  		if (mddev->in_sync) {
> @@ -7790,12 +7862,17 @@ EXPORT_SYMBOL(md_write_start);
>  
>  void md_write_end(struct mddev *mddev)
>  {
> -	if (atomic_dec_and_test(&mddev->writes_pending)) {
> -		if (mddev->safemode == 2)
> +	this_cpu_dec(*mddev->writes_pending_percpu);
> +	if (mddev->safemode == 2) {
> +		if (!__writes_pending(mddev))
>  			md_wakeup_thread(mddev->thread);
> -		else if (mddev->safemode_delay)
> -			mod_timer(&mddev->safemode_timer, jiffies + mddev->safemode_delay);
> -	}
> +	} else if (mddev->safemode_delay)
> +		/* The roundup() ensure this only performs locking once
> +		 * every ->safemode_delay jiffies
> +		 */
> +		mod_timer(&mddev->safemode_timer,
> +			  roundup(jiffies, mddev->safemode_delay) + mddev->safemode_delay);
> +
>  }
>  EXPORT_SYMBOL(md_write_end);
>  
> @@ -8398,7 +8475,7 @@ void md_check_recovery(struct mddev *mddev)
>  		test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
>  		test_bit(MD_RELOAD_SB, &mddev->flags) ||
>  		(mddev->external == 0 && mddev->safemode == 1) ||
> -		(mddev->safemode == 2 && ! atomic_read(&mddev->writes_pending)
> +		(mddev->safemode == 2 && ! __writes_pending(mddev)
>  		 && !mddev->in_sync && mddev->recovery_cp == MaxSector)
>  		))
>  		return;
> @@ -8450,19 +8527,14 @@ void md_check_recovery(struct mddev *mddev)
>  				md_reload_sb(mddev, mddev->good_device_nr);
>  		}
>  
> -		if (!mddev->external) {
> +		if (!mddev->external && mddev->safemode) {
>  			int did_change = 0;
>  			spin_lock(&mddev->lock);
> -			if (mddev->safemode &&
> -			    !atomic_read(&mddev->writes_pending) &&
> -			    !mddev->in_sync &&
> -			    mddev->recovery_cp == MaxSector) {
> -				mddev->in_sync = 1;
> +			if (mddev->recovery_cp == MaxSector &&
> +			    set_in_sync(mddev)) {
>  				did_change = 1;
>  				set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
>  			}
> -			if (mddev->safemode == 1)
> -				mddev->safemode = 0;
>  			spin_unlock(&mddev->lock);
>  			if (did_change)
>  				sysfs_notify_dirent_safe(mddev->sysfs_state);
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 2a514036a83d..7e41f882d33d 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -404,7 +404,8 @@ struct mddev {
>  							 */
>  	unsigned int			safemode_delay;
>  	struct timer_list		safemode_timer;
> -	atomic_t			writes_pending;
> +	long				*writes_pending_percpu;
> +	int				checkers;	/* # of threads checking writes_pending */
>  	struct request_queue		*queue;	/* for plugging ... */
>  
>  	struct bitmap			*bitmap; /* the bitmap for the device */
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 3c7e106c12a2..45d260326efd 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -7360,7 +7360,7 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
>  		 * we can't wait pending write here, as this is called in
>  		 * raid5d, wait will deadlock.
>  		 */
> -		if (atomic_read(&mddev->writes_pending))
> +		if (!mddev->in_sync)
>  			return -EBUSY;
>  		log = conf->log;
>  		conf->log = NULL;



^ permalink raw reply

* Re: [PATCH 1/5] MD: attach data to each bio
From: NeilBrown @ 2017-02-10  6:08 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: khlebnikov, hch
In-Reply-To: <79515b1372fa1a1813c00ef0d7e0613a4512183d.1486485935.git.shli@fb.com>

[-- Attachment #1: Type: text/plain, Size: 10392 bytes --]

On Tue, Feb 07 2017, Shaohua Li wrote:

> Currently MD is rebusing some bio fields. To remove the hack, we attach
> extra data to each bio. Each personablity can attach extra data to the
> bios, so we don't need to rebuse bio fields.

I must say that I don't really like this approach.
Temporarily modifying ->bi_private and ->bi_end_io seems
.... intrusive.   I suspect it works, but I wonder if it is really
robust in the long term.

How about a different approach..  Your main concern with my first patch
was that it called md_write_start() and md_write_end() much more often,
and these performed atomic ops on "global" variables, particular
writes_pending.

We could change writes_pending to a per-cpu array which we only count
occasionally when needed.  As writes_pending is updated often and
checked rarely, a per-cpu array which is summed on demand seems
appropriate.

The following patch is an early draft - it doesn't obviously fail and
isn't obviously wrong to me.  There is certainly room for improvement
and may be bugs.
Next week I'll work on collection the re-factoring into separate
patches, which are possible good-to-have anyway.

Thoughts?

Thanks,
NeilBrown


diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8926fb781cdc..883c409902b0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -64,6 +64,8 @@
 #include <linux/raid/md_p.h>
 #include <linux/raid/md_u.h>
 #include <linux/slab.h>
+#include <linux/percpu.h>
+
 #include <trace/events/block.h>
 #include "md.h"
 #include "bitmap.h"
@@ -2250,6 +2252,81 @@ static void export_array(struct mddev *mddev)
 	mddev->major_version = 0;
 }
 
+/*
+ * The percpu writes_pending counters are linked with ->checkers and ->lock.
+ * If ->writes_pending can always be decremented without a lock.
+ * It can only be incremented without a lock if ->checkers is 0 and the test+incr
+ * happen in a rcu_readlocked region.
+ * ->checkers can only be changed under ->lock protection.
+ * To determine if ->writes_pending is totally zero, a quick sum without
+ * locks can be performed. If this is non-zero, then the result is final.
+ * Otherwise ->checkers must be incremented and synchronize_rcu() called.
+ * Then a sum calculated under ->lock, and the result is final until the
+ * ->checkers is decremented, or the lock is dropped.
+ *
+ */
+
+static bool __writes_pending(struct mddev *mddev)
+{
+	long cnt = 0;
+	int i;
+	for_each_possible_cpu(i)
+		cnt += *per_cpu_ptr(mddev->writes_pending_percpu, i);
+	return cnt != 0;
+}
+
+static bool set_in_sync(struct mddev *mddev)
+{
+
+	WARN_ON_ONCE(!spin_is_locked(&mddev->lock));
+	if (__writes_pending(mddev)) {
+		if (mddev->safemode == 1)
+			mddev->safemode = 0;
+		return false;
+	}
+	if (mddev->in_sync)
+		return false;
+
+	mddev->checkers ++;
+	spin_unlock(&mddev->lock);
+	synchronize_rcu();
+	spin_lock(&mddev->lock);
+	if (mddev->in_sync) {
+		/* someone else set it */
+		mddev->checkers --;
+		return false;
+	}
+
+	if (! __writes_pending(mddev))
+		mddev->in_sync = 1;
+	if (mddev->safemode == 1)
+		mddev->safemode = 0;
+	mddev->checkers --;
+	return mddev->in_sync;
+}
+
+static void inc_writes_pending(struct mddev *mddev)
+{
+	rcu_read_lock();
+	if (mddev->checkers == 0) {
+		__this_cpu_inc(*mddev->writes_pending_percpu);
+		rcu_read_unlock();
+		return;
+	}
+	rcu_read_unlock();
+	/* Need that spinlock */
+	spin_lock(&mddev->lock);
+	this_cpu_inc(*mddev->writes_pending_percpu);
+	spin_unlock(&mddev->lock);
+}
+
+static void zero_writes_pending(struct mddev *mddev)
+{
+	int i;
+	for_each_possible_cpu(i)
+		*per_cpu_ptr(mddev->writes_pending_percpu, i) = 0;
+}
+
 static void sync_sbs(struct mddev *mddev, int nospares)
 {
 	/* Update each superblock (in-memory image), but
@@ -3583,7 +3660,6 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 	mddev->delta_disks = 0;
 	mddev->reshape_backwards = 0;
 	mddev->degraded = 0;
-	spin_unlock(&mddev->lock);
 
 	if (oldpers->sync_request == NULL &&
 	    mddev->external) {
@@ -3596,8 +3672,8 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 		 */
 		mddev->in_sync = 0;
 		mddev->safemode_delay = 0;
-		mddev->safemode = 0;
 	}
+	spin_unlock(&mddev->lock);
 
 	oldpers->free(mddev, oldpriv);
 
@@ -3963,16 +4039,11 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
 			wake_up(&mddev->sb_wait);
 			err = 0;
 		} else /* st == clean */ {
+			err = 0;
 			restart_array(mddev);
-			if (atomic_read(&mddev->writes_pending) == 0) {
-				if (mddev->in_sync == 0) {
-					mddev->in_sync = 1;
-					if (mddev->safemode == 1)
-						mddev->safemode = 0;
-					set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
-				}
-				err = 0;
-			} else
+			if (set_in_sync(mddev))
+				set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
+			else if (!mddev->in_sync)
 				err = -EBUSY;
 		}
 		if (!err)
@@ -4030,15 +4101,9 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
 			if (err)
 				break;
 			spin_lock(&mddev->lock);
-			if (atomic_read(&mddev->writes_pending) == 0) {
-				if (mddev->in_sync == 0) {
-					mddev->in_sync = 1;
-					if (mddev->safemode == 1)
-						mddev->safemode = 0;
-					set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
-				}
-				err = 0;
-			} else
+			if (set_in_sync(mddev))
+				set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
+			else if (!mddev->in_sync)
 				err = -EBUSY;
 			spin_unlock(&mddev->lock);
 		} else
@@ -4993,6 +5058,9 @@ static void md_free(struct kobject *ko)
 		del_gendisk(mddev->gendisk);
 		put_disk(mddev->gendisk);
 	}
+	if (mddev->writes_pending_percpu) {
+		free_percpu(mddev->writes_pending_percpu);
+	}
 
 	kfree(mddev);
 }
@@ -5069,6 +5137,13 @@ static int md_alloc(dev_t dev, char *name)
 	blk_queue_make_request(mddev->queue, md_make_request);
 	blk_set_stacking_limits(&mddev->queue->limits);
 
+	mddev->writes_pending_percpu = alloc_percpu(long);
+	if (!mddev->writes_pending_percpu) {
+		blk_cleanup_queue(mddev->queue);
+		mddev->queue = NULL;
+		goto abort;
+	}
+
 	disk = alloc_disk(1 << shift);
 	if (!disk) {
 		blk_cleanup_queue(mddev->queue);
@@ -5152,7 +5227,7 @@ static void md_safemode_timeout(unsigned long data)
 {
 	struct mddev *mddev = (struct mddev *) data;
 
-	if (!atomic_read(&mddev->writes_pending)) {
+	if (!__writes_pending(mddev)) {
 		mddev->safemode = 1;
 		if (mddev->external)
 			sysfs_notify_dirent_safe(mddev->sysfs_state);
@@ -5358,7 +5433,7 @@ int md_run(struct mddev *mddev)
 	} else if (mddev->ro == 2) /* auto-readonly not meaningful */
 		mddev->ro = 0;
 
-	atomic_set(&mddev->writes_pending,0);
+	zero_writes_pending(mddev);
 	atomic_set(&mddev->max_corr_read_errors,
 		   MD_DEFAULT_MAX_CORRECTED_READ_ERRORS);
 	mddev->safemode = 0;
@@ -5451,7 +5526,6 @@ static int restart_array(struct mddev *mddev)
 			return -EINVAL;
 	}
 
-	mddev->safemode = 0;
 	mddev->ro = 0;
 	set_disk_ro(disk, 0);
 	pr_debug("md: %s switched to read-write mode.\n", mdname(mddev));
@@ -7767,9 +7841,7 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
 		md_wakeup_thread(mddev->sync_thread);
 		did_change = 1;
 	}
-	atomic_inc(&mddev->writes_pending);
-	if (mddev->safemode == 1)
-		mddev->safemode = 0;
+	inc_writes_pending(mddev);
 	if (mddev->in_sync) {
 		spin_lock(&mddev->lock);
 		if (mddev->in_sync) {
@@ -7790,12 +7862,17 @@ EXPORT_SYMBOL(md_write_start);
 
 void md_write_end(struct mddev *mddev)
 {
-	if (atomic_dec_and_test(&mddev->writes_pending)) {
-		if (mddev->safemode == 2)
+	this_cpu_dec(*mddev->writes_pending_percpu);
+	if (mddev->safemode == 2) {
+		if (!__writes_pending(mddev))
 			md_wakeup_thread(mddev->thread);
-		else if (mddev->safemode_delay)
-			mod_timer(&mddev->safemode_timer, jiffies + mddev->safemode_delay);
-	}
+	} else if (mddev->safemode_delay)
+		/* The roundup() ensure this only performs locking once
+		 * every ->safemode_delay jiffies
+		 */
+		mod_timer(&mddev->safemode_timer,
+			  roundup(jiffies, mddev->safemode_delay) + mddev->safemode_delay);
+
 }
 EXPORT_SYMBOL(md_write_end);
 
@@ -8398,7 +8475,7 @@ void md_check_recovery(struct mddev *mddev)
 		test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
 		test_bit(MD_RELOAD_SB, &mddev->flags) ||
 		(mddev->external == 0 && mddev->safemode == 1) ||
-		(mddev->safemode == 2 && ! atomic_read(&mddev->writes_pending)
+		(mddev->safemode == 2 && ! __writes_pending(mddev)
 		 && !mddev->in_sync && mddev->recovery_cp == MaxSector)
 		))
 		return;
@@ -8450,19 +8527,14 @@ void md_check_recovery(struct mddev *mddev)
 				md_reload_sb(mddev, mddev->good_device_nr);
 		}
 
-		if (!mddev->external) {
+		if (!mddev->external && mddev->safemode) {
 			int did_change = 0;
 			spin_lock(&mddev->lock);
-			if (mddev->safemode &&
-			    !atomic_read(&mddev->writes_pending) &&
-			    !mddev->in_sync &&
-			    mddev->recovery_cp == MaxSector) {
-				mddev->in_sync = 1;
+			if (mddev->recovery_cp == MaxSector &&
+			    set_in_sync(mddev)) {
 				did_change = 1;
 				set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
 			}
-			if (mddev->safemode == 1)
-				mddev->safemode = 0;
 			spin_unlock(&mddev->lock);
 			if (did_change)
 				sysfs_notify_dirent_safe(mddev->sysfs_state);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 2a514036a83d..7e41f882d33d 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -404,7 +404,8 @@ struct mddev {
 							 */
 	unsigned int			safemode_delay;
 	struct timer_list		safemode_timer;
-	atomic_t			writes_pending;
+	long				*writes_pending_percpu;
+	int				checkers;	/* # of threads checking writes_pending */
 	struct request_queue		*queue;	/* for plugging ... */
 
 	struct bitmap			*bitmap; /* the bitmap for the device */
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3c7e106c12a2..45d260326efd 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7360,7 +7360,7 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
 		 * we can't wait pending write here, as this is called in
 		 * raid5d, wait will deadlock.
 		 */
-		if (atomic_read(&mddev->writes_pending))
+		if (!mddev->in_sync)
 			return -EBUSY;
 		log = conf->log;
 		conf->log = NULL;

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Re: [PATCH v2 2/5] async_tx: Handle DMA devices having support for fewer PQ coefficients
From: Anup Patel @ 2017-02-10  3:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Mark Rutland, Device Tree, Herbert Xu, Scott Branden, Vinod Koul,
	Ray Jui, Jassi Brar, linux-kernel@vger.kernel.org, linux-raid,
	Jon Mason, Rob Herring, BCM Kernel Feedback, linux-crypto,
	Rob Rice, dmaengine@vger.kernel.org, David S . Miller,
	linux-arm-kernel@lists.infradead.org
In-Reply-To: <CAPcyv4gDq+our1+sAPShBru-qvebo+Zvk0crgSUNXucHGQwZ1Q@mail.gmail.com>

On Thu, Feb 9, 2017 at 10:14 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Thu, Feb 9, 2017 at 1:29 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>> On Wed, Feb 8, 2017 at 9:54 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>> On Wed, Feb 8, 2017 at 12:57 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>>>> On Tue, Feb 7, 2017 at 11:46 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>>>> On Tue, Feb 7, 2017 at 1:02 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>>>>>> On Tue, Feb 7, 2017 at 1:57 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>>>>>> On Tue, Feb 7, 2017 at 12:16 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>>>>>>>> The DMAENGINE framework assumes that if PQ offload is supported by a
>>>>>>>> DMA device then all 256 PQ coefficients are supported. This assumption
>>>>>>>> does not hold anymore because we now have BCM-SBA-RAID offload engine
>>>>>>>> which supports PQ offload with limited number of PQ coefficients.
>>>>>>>>
>>>>>>>> This patch extends async_tx APIs to handle DMA devices with support
>>>>>>>> for fewer PQ coefficients.
>>>>>>>>
>>>>>>>> Signed-off-by: Anup Patel <anup.patel@broadcom.com>
>>>>>>>> Reviewed-by: Scott Branden <scott.branden@broadcom.com>
>>>>>>>
>>>>>>> I don't like this approach. Define an interface for md to query the
>>>>>>> offload engine once at the beginning of time. We should not be adding
>>>>>>> any new extensions to async_tx.
>>>>>>
>>>>>> Even if we do capability checks in Linux MD, we still need a way
>>>>>> for DMAENGINE drivers to advertise number of PQ coefficients
>>>>>> handled by the HW.
>>>>>>
>>>>>> I agree capability checks should be done once in Linux MD but I don't
>>>>>> see why this has to be part of BCM-SBA-RAID driver patches. We need
>>>>>> separate patchsets to address limitations of async_tx framework.
>>>>>
>>>>> Right, separate enabling before we pile on new hardware support to a
>>>>> known broken framework.
>>>>
>>>> Linux Async Tx not broken framework. The issue is:
>>>> 1. Its not complete enough
>>>> 2. Its not optimized for very high through-put offload engines
>>>
>>> I'm not understanding your point. I'm nak'ing this change to add yet
>>> more per-transaction capability checking to async_tx. I don't like the
>>> DMA_HAS_FEWER_PQ_COEF flag, especially since it is equal to
>>> DMA_HAS_PQ_CONTINUE. I'm not asking for all of async_tx's problems to
>>> be fixed before this new hardware support, I'm simply saying we should
>>> start the process of moving offload-engine capability checking to the
>>> raid code.
>>
>> The DMA_HAS_FEWER_PQ_COEF is not equal to
>> DMA_HAS_PQ_CONTINUE.
>
> #define DMA_HAS_PQ_CONTINUE (1 << 15
> #define DMA_HAS_FEWER_PQ_COEF (1 << 15)

You are only looking at the values of these flags.

The semantics of both these flags are different and both
flags are set in different members of "struct dma_device"
The DMA_HAS_PQ_CONTINUE is set in "max_pq" whereas
DMA_HAS_FEWER_PQ_COEF is set in "max_pqcoef".

When DMA_HAS_PQ_CONTINUE is set in "max_pq", it
means that PQ HW is capable of taking P & Q computed
previous txn as input. If DMA_HAS_PQ_CONTINUE is
not supported the async_pq() will pass P & Q computed
by previous txn as sources with coef as g^0.

When DMA_HAS_FEWER_PQ_COEF is set in "max_pqcoef",
it means the PQ HW is not capable of handling all 256 coefs.

>
>> I will try to drop this patch and take care of unsupported PQ
>> coefficients in BCM-SBA-RAID driver itself even if this means
>> doing some computations in BCM-SBA-RAID driver itself.
>
> That should be nak'd as well, please do capability detection in a
> routine that is common to all raid engines.

Thanks for NAKing this patch.

This motivated me to find clean work-around for handling
unsupported PQ coefs in BCM-SBA-RAID driver.

Let's assume max number of PQ coefs supported by PQ HW
is m coefs. Now for any coef n > m, we can use RAID6 math
to get g^n = (g^m)*(g^m)*....*(g^k) where k <= m.

Using the above fact, we can create chained txn for each
source of PQ request in BCM-SBA-RAID driver where each
txn will only compute PQ from one source using above
described RAID6 math. Also, each txn in chained txn will
depend on output of previous txn because we are computing
PQ from one source at a time.

I will drop this patch and send updated BCM-SBA-RAID
driver with above described work-around for handling
unsupported PQ coefs.

Regards,
Anup

^ permalink raw reply

* Re: [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation
From: Shaohua Li @ 2017-02-09 17:13 UTC (permalink / raw)
  To: Wols Lists; +Cc: Artur Paszkiewicz, linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <589C9368.1020602@youngman.org.uk>

On Thu, Feb 09, 2017 at 04:06:00PM +0000, Wols Lists wrote:
> On 08/02/17 05:34, Shaohua Li wrote:
> > so it's the implementation which doesn't flush disk cache before writting new
> > data to disks, right? That makes sense then. I think we should fix this soon.
> > Don't think people will (or remember to) disable write-back cache before using
> > ppl.
> 
> Dunno whether I'm making a fool of myself :-) but might a quick-n-dirty
> workaround be to make enabling PPL also disable write-back cache? With
> maybe a warning not to switch it back on?

We don't have API for this in kernel side. This doesn't mean we can't add it
but since the limitation is in the implementation, we'd better fix the
implementation.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation
From: Shaohua Li @ 2017-02-09 17:09 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <232777f2-d8e9-3d86-cb84-7285c90e5a6f@intel.com>

On Thu, Feb 09, 2017 at 04:35:34PM +0100, Artur Paszkiewicz wrote:
> On 02/08/2017 06:34 AM, Shaohua Li wrote:
> > On Wed, Feb 08, 2017 at 12:58:42PM +0100, Artur Paszkiewicz wrote:
> >> On 02/07/2017 10:42 PM, Shaohua Li wrote:
> >>> On Mon, Jan 30, 2017 at 07:59:49PM +0100, Artur Paszkiewicz wrote:
> >>>> This implements the PPL write logging functionality, using the
> >>>> raid5-cache policy logic introduced in previous patches. The description
> >>>> of PPL is added to the documentation. More details can be found in the
> >>>> comments in raid5-ppl.c.
> >>>>
> >>>> Put the PPL metadata structures to md_p.h because userspace tools
> >>>> (mdadm) will also need to read/write PPL.
> >>>>
> >>>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> >>>> ---
> >>>>  Documentation/admin-guide/md.rst |  53 ++++
> >>>>  drivers/md/Makefile              |   2 +-
> >>>>  drivers/md/raid5-cache.c         |  13 +-
> >>>>  drivers/md/raid5-cache.h         |   8 +
> >>>>  drivers/md/raid5-ppl.c           | 551 +++++++++++++++++++++++++++++++++++++++
> >>>>  drivers/md/raid5.c               |  15 +-
> >>>>  include/uapi/linux/raid/md_p.h   |  26 ++
> >>>>  7 files changed, 661 insertions(+), 7 deletions(-)
> >>>>  create mode 100644 drivers/md/raid5-ppl.c
> >>>>
> >>>> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
> >>>> index e449fb5f277c..7104ef757e73 100644
> >>>> --- a/Documentation/admin-guide/md.rst
> >>>> +++ b/Documentation/admin-guide/md.rst
> >>>> @@ -86,6 +86,9 @@ superblock can be autodetected and run at boot time.
> >>>>  The kernel parameter ``raid=partitionable`` (or ``raid=part``) means
> >>>>  that all auto-detected arrays are assembled as partitionable.
> >>>>  
> >>>> +
> >>>> +.. _dirty_degraded_boot:
> >>>> +
> >>>>  Boot time assembly of degraded/dirty arrays
> >>>>  -------------------------------------------
> >>>>  
> >>>> @@ -176,6 +179,56 @@ and its role in the array.
> >>>>  Once started with RUN_ARRAY, uninitialized spares can be added with
> >>>>  HOT_ADD_DISK.
> >>>>  
> >>>> +.. _ppl:
> >>>> +
> >>>> +Partial Parity Log
> >>>> +------------------
> >>> Please put this part to a separate file under Documentation/md. The admin-guide
> >>> is supposed to serve as a doc for user. The Documentation/md files will server
> >>> as a doc for developer.
> >>
> >> OK, I will move it.
> >>
> >>>> +Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
> >>>> +addressed by PPL is that after a dirty shutdown, parity of a particular stripe
> >>>> +may become inconsistent with data on other member disks. If the array is also
> >>>> +in degraded state, there is no way to recalculate parity, because one of the
> >>>> +disks is missing. This can lead to silent data corruption when rebuilding the
> >>>> +array or using it is as degraded - data calculated from parity for array blocks
> >>>> +that have not been touched by a write request during the unclean shutdown can
> >>>> +be incorrect. Such condition is known as the ``RAID5 Write Hole``. Because of
> >>>> +this, md by default does not allow starting a dirty degraded array, see
> >>>> +:ref:`dirty_degraded_boot`.
> >>>> +
> >>>> +Partial parity for a write operation is the XOR of stripe data chunks not
> >>>> +modified by this write. It is just enough data needed for recovering from the
> >>>> +write hole. XORing partial parity with the modified chunks produces parity for
> >>>> +the stripe, consistent with its state before the write operation, regardless of
> >>>> +which chunk writes have completed. If one of the not modified data disks of
> >>>> +this stripe is missing, this updated parity can be used to recover its
> >>>> +contents. PPL recovery is also performed when starting an array after an
> >>>> +unclean shutdown and all disks are available, eliminating the need to resync
> >>>> +the array. Because of this, using write-intent bitmap and PPL together is not
> >>>> +supported.
> >>>> +
> >>>> +When handling a write request PPL writes partial parity before new data and
> >>>> +parity are dispatched to disks. PPL is a distributed log - it is stored on
> >>>> +array member drives in the metadata area, on the parity drive of a particular
> >>>> +stripe.  It does not require a dedicated journaling drive. Write performance is
> >>>> +reduced by up to 30%-40% but it scales with the number of drives in the array
> >>>> +and the journaling drive does not become a bottleneck or a single point of
> >>>> +failure.
> >>>> +
> >>>> +Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
> >>>> +not a true journal. It does not protect from losing in-flight data, only from
> >>>> +silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
> >>>> +performed for this stripe (parity is not updated). So it is possible to have
> >>>> +arbitrary data in the written part of a stripe if that disk is lost. In such
> >>>> +case the behavior is the same as in plain raid5.
> >>>> +
> >>>> +PPL is available for md version-1 metadata and external (specifically IMSM)
> >>>> +metadata arrays. It can be enabled using mdadm option
> >>>> +``--consistency-policy=ppl``.
> >>>> +
> >>>> +Currently, volatile write-back cache should be disabled on all member drives
> >>>> +when using PPL. Otherwise it cannot guarantee consistency in case of power
> >>>> +failure.
> >>>
> >>> The limitation is weird and dumb. Is this an implementation limitation which
> >>> can be fixed later or a design limitation?
> >>
> >> This can be fixed later and I'm planning to do this. The limitation
> >> results from the fact that partial parity is the xor of "old" data.
> >> During recovery we xor it with "new" data and that produces valid
> >> parity, assuming the "old" data has not changed after dirty shutdown.
> >> But it's not necessarily true, since the "old" data might have been only
> >> in cache and not on persistent storage. So fixing this will require
> >> making sure the old data is on persistent media before writing PPL.
> > 
> > so it's the implementation which doesn't flush disk cache before writting new
> > data to disks, right? That makes sense then. I think we should fix this soon.
> > Don't think people will (or remember to) disable write-back cache before using
> > ppl.
> > 
> >>>> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
> >>>> new file mode 100644
> >>>> index 000000000000..6bc246c80f6b
> >>>> --- /dev/null
> >>>> +++ b/drivers/md/raid5-ppl.c
> >>>> @@ -0,0 +1,551 @@
> >>>> +/*
> >>>> + * Partial Parity Log for closing the RAID5 write hole
> >>>> + * Copyright (c) 2017, Intel Corporation.
> >>>> + *
> >>>> + * This program is free software; you can redistribute it and/or modify it
> >>>> + * under the terms and conditions of the GNU General Public License,
> >>>> + * version 2, as published by the Free Software Foundation.
> >>>> + *
> >>>> + * This program is distributed in the hope it will be useful, but WITHOUT
> >>>> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> >>>> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> >>>> + * more details.
> >>>> + */
> >>>> +
> >>>> +#include <linux/kernel.h>
> >>>> +#include <linux/blkdev.h>
> >>>> +#include <linux/slab.h>
> >>>> +#include <linux/crc32c.h>
> >>>> +#include <linux/raid/md_p.h>
> >>>> +#include "md.h"
> >>>> +#include "raid5.h"
> >>>> +#include "raid5-cache.h"
> >>>> +
> >>>> +/*
> >>>> + * PPL consists of a 4KB header (struct ppl_header) and at least 128KB for
> >>>> + * partial parity data. The header contains an array of entries
> >>>> + * (struct ppl_header_entry) which describe the logged write requests.
> >>>> + * Partial parity for the entries comes after the header, written in the same
> >>>> + * sequence as the entries:
> >>>> + *
> >>>> + * Header
> >>>> + *   entry0
> >>>> + *   ...
> >>>> + *   entryN
> >>>> + * PP data
> >>>> + *   PP for entry0
> >>>> + *   ...
> >>>> + *   PP for entryN
> >>>> + *
> >>>> + * Every entry holds a checksum of its partial parity, the header also has a
> >>>> + * checksum of the header itself. Entries for full stripes writes contain no
> >>>> + * partial parity, they only mark the stripes for which parity should be
> >>>> + * recalculated after an unclean shutdown.
> >>>> + *
> >>>> + * A write request is always logged to the PPL instance stored on the parity
> >>>> + * disk of the corresponding stripe. For each member disk there is one r5l_log
> >>>> + * used to handle logging for this disk, independently from others. They are
> >>>> + * grouped in child_logs array in struct ppl_conf, which is assigned to a
> >>>> + * common parent r5l_log. This parent log serves as a proxy and is used in
> >>>> + * raid5 personality code - it is assigned as _the_ log in r5conf->log.
> >>>> + *
> >>>> + * r5l_io_unit represents a full PPL write, meta_page contains the ppl_header.
> >>>> + * PPL entries for logged stripes are added in ppl_log_stripe(). A stripe can
> >>>> + * be appended to the last entry if the chunks to write are the same, otherwise
> >>>> + * a new entry is added. Checksums of entries are calculated incrementally as
> >>>> + * stripes containing partial parity are being added to entries.
> >>>> + * ppl_submit_iounit() calculates the checksum of the header and submits a bio
> >>>> + * containing the meta_page (ppl_header) and partial parity pages (sh->ppl_page)
> >>>> + * for all stripes of the io_unit. When the PPL write completes, the stripes
> >>>> + * associated with the io_unit are released and raid5d starts writing their data
> >>>> + * and parity. When all stripes are written, the io_unit is freed and the next
> >>>> + * can be submitted.
> >>>> + *
> >>>> + * An io_unit is used to gather stripes until it is submitted or becomes full
> >>>> + * (if the maximum number of entries or size of PPL is reached). Another io_unit
> >>>> + * can't be submitted until the previous has completed (PPL and stripe
> >>>> + * data+parity is written). The log->running_ios list tracks all io_units of
> >>>> + * a log (for a single member disk). New io_units are added to the end of the
> >>>> + * list and the first io_unit is submitted, if it is not submitted already.
> >>>> + * The current io_unit accepting new stripes is always the last on the list.
> >>>> + */
> >>>
> >>> I don't like the way that io_unit and r5l_log are reused here. The code below
> >>> shares very little with raid5-cache. There are a lot of fields in the data
> >>> structures which are not used by ppl. It's not a good practice to share data
> >>> structure like this way, so please define ppl its own data structure. The
> >>> workflow between raid5 cache and ppl might be similar, but what we really need
> >>> to share is the hooks to raid5 core.
> >>
> >> Reusing the structures seemed like a good idea at first, but now I must
> >> admit you are right, especially since the structures were extended to
> >> support write-back caching. One instance of r5l_log will still be
> >> necessary though, to use for the hooks. Its private field will be used
> >> for holding the ppl structures. Is that ok?
> > 
> > I'd prefer adding a 'void *log_private' in struct r5conf. raid5-cache and ppl
> > can store whatever data in the log_private. And convert all callbacks to use a
> > 'struct r5conf' parameter. This way raid5 cache and ppl don't need to share the
> > data structure.
> 
> This way also 'struct r5l_policy *policy' will have to be moved from
> r5l_log to r5conf. Maybe it will be better if instead of adding
> log_private and moving this 'policy' field we add a dedicated field for
> ppl in r5conf, something like 'struct ppl_conf *ppl', and eliminate
> r5l_policy? From raid5 core we could just call corresponding functions
> depending on which pointer is set (log or ppl). This way raid5-cache.c
> won't have to be changed at all. It seems appropriate to separate
> raid5-cache and ppl if they won't be sharing any data structures.

It's ok to add ppl specific data in r5conf. Just make sure the raid5 core only
calls one set of hooks if possible and the hooks can call into ppl or
raid5-cache. It's quite unlikely we will add other policies, so it's fine to
eliminate r5l_policy.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v2 2/5] async_tx: Handle DMA devices having support for fewer PQ coefficients
From: Dan Williams @ 2017-02-09 16:44 UTC (permalink / raw)
  To: Anup Patel
  Cc: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar, Ray Jui, Scott Branden, Jon Mason,
	Rob Rice, BCM Kernel Feedback, dmaengine@vger.kernel.org,
	Device Tree, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-crypto, linux-raid
In-Reply-To: <CAALAos-ua6hVpUqjnJSQ=ysSOKrN67toiT3J808uHC4A_ZV3mg@mail.gmail.com>

On Thu, Feb 9, 2017 at 1:29 AM, Anup Patel <anup.patel@broadcom.com> wrote:
> On Wed, Feb 8, 2017 at 9:54 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Wed, Feb 8, 2017 at 12:57 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>>> On Tue, Feb 7, 2017 at 11:46 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>>> On Tue, Feb 7, 2017 at 1:02 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>>>>> On Tue, Feb 7, 2017 at 1:57 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>>>>> On Tue, Feb 7, 2017 at 12:16 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>>>>>>> The DMAENGINE framework assumes that if PQ offload is supported by a
>>>>>>> DMA device then all 256 PQ coefficients are supported. This assumption
>>>>>>> does not hold anymore because we now have BCM-SBA-RAID offload engine
>>>>>>> which supports PQ offload with limited number of PQ coefficients.
>>>>>>>
>>>>>>> This patch extends async_tx APIs to handle DMA devices with support
>>>>>>> for fewer PQ coefficients.
>>>>>>>
>>>>>>> Signed-off-by: Anup Patel <anup.patel@broadcom.com>
>>>>>>> Reviewed-by: Scott Branden <scott.branden@broadcom.com>
>>>>>>
>>>>>> I don't like this approach. Define an interface for md to query the
>>>>>> offload engine once at the beginning of time. We should not be adding
>>>>>> any new extensions to async_tx.
>>>>>
>>>>> Even if we do capability checks in Linux MD, we still need a way
>>>>> for DMAENGINE drivers to advertise number of PQ coefficients
>>>>> handled by the HW.
>>>>>
>>>>> I agree capability checks should be done once in Linux MD but I don't
>>>>> see why this has to be part of BCM-SBA-RAID driver patches. We need
>>>>> separate patchsets to address limitations of async_tx framework.
>>>>
>>>> Right, separate enabling before we pile on new hardware support to a
>>>> known broken framework.
>>>
>>> Linux Async Tx not broken framework. The issue is:
>>> 1. Its not complete enough
>>> 2. Its not optimized for very high through-put offload engines
>>
>> I'm not understanding your point. I'm nak'ing this change to add yet
>> more per-transaction capability checking to async_tx. I don't like the
>> DMA_HAS_FEWER_PQ_COEF flag, especially since it is equal to
>> DMA_HAS_PQ_CONTINUE. I'm not asking for all of async_tx's problems to
>> be fixed before this new hardware support, I'm simply saying we should
>> start the process of moving offload-engine capability checking to the
>> raid code.
>
> The DMA_HAS_FEWER_PQ_COEF is not equal to
> DMA_HAS_PQ_CONTINUE.

#define DMA_HAS_PQ_CONTINUE (1 << 15
#define DMA_HAS_FEWER_PQ_COEF (1 << 15)

> I will try to drop this patch and take care of unsupported PQ
> coefficients in BCM-SBA-RAID driver itself even if this means
> doing some computations in BCM-SBA-RAID driver itself.

That should be nak'd as well, please do capability detection in a
routine that is common to all raid engines.

^ permalink raw reply

* Re: [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation
From: Wols Lists @ 2017-02-09 16:06 UTC (permalink / raw)
  To: Shaohua Li, Artur Paszkiewicz; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <20170208053444.b4yfmahmlrjgvv7x@kernel.org>

On 08/02/17 05:34, Shaohua Li wrote:
> so it's the implementation which doesn't flush disk cache before writting new
> data to disks, right? That makes sense then. I think we should fix this soon.
> Don't think people will (or remember to) disable write-back cache before using
> ppl.

Dunno whether I'm making a fool of myself :-) but might a quick-n-dirty
workaround be to make enabling PPL also disable write-back cache? With
maybe a warning not to switch it back on?

Cheers,
Wol

^ permalink raw reply

* Re: [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation
From: Artur Paszkiewicz @ 2017-02-09 15:35 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <20170208053444.b4yfmahmlrjgvv7x@kernel.org>

On 02/08/2017 06:34 AM, Shaohua Li wrote:
> On Wed, Feb 08, 2017 at 12:58:42PM +0100, Artur Paszkiewicz wrote:
>> On 02/07/2017 10:42 PM, Shaohua Li wrote:
>>> On Mon, Jan 30, 2017 at 07:59:49PM +0100, Artur Paszkiewicz wrote:
>>>> This implements the PPL write logging functionality, using the
>>>> raid5-cache policy logic introduced in previous patches. The description
>>>> of PPL is added to the documentation. More details can be found in the
>>>> comments in raid5-ppl.c.
>>>>
>>>> Put the PPL metadata structures to md_p.h because userspace tools
>>>> (mdadm) will also need to read/write PPL.
>>>>
>>>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>>>> ---
>>>>  Documentation/admin-guide/md.rst |  53 ++++
>>>>  drivers/md/Makefile              |   2 +-
>>>>  drivers/md/raid5-cache.c         |  13 +-
>>>>  drivers/md/raid5-cache.h         |   8 +
>>>>  drivers/md/raid5-ppl.c           | 551 +++++++++++++++++++++++++++++++++++++++
>>>>  drivers/md/raid5.c               |  15 +-
>>>>  include/uapi/linux/raid/md_p.h   |  26 ++
>>>>  7 files changed, 661 insertions(+), 7 deletions(-)
>>>>  create mode 100644 drivers/md/raid5-ppl.c
>>>>
>>>> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
>>>> index e449fb5f277c..7104ef757e73 100644
>>>> --- a/Documentation/admin-guide/md.rst
>>>> +++ b/Documentation/admin-guide/md.rst
>>>> @@ -86,6 +86,9 @@ superblock can be autodetected and run at boot time.
>>>>  The kernel parameter ``raid=partitionable`` (or ``raid=part``) means
>>>>  that all auto-detected arrays are assembled as partitionable.
>>>>  
>>>> +
>>>> +.. _dirty_degraded_boot:
>>>> +
>>>>  Boot time assembly of degraded/dirty arrays
>>>>  -------------------------------------------
>>>>  
>>>> @@ -176,6 +179,56 @@ and its role in the array.
>>>>  Once started with RUN_ARRAY, uninitialized spares can be added with
>>>>  HOT_ADD_DISK.
>>>>  
>>>> +.. _ppl:
>>>> +
>>>> +Partial Parity Log
>>>> +------------------
>>> Please put this part to a separate file under Documentation/md. The admin-guide
>>> is supposed to serve as a doc for user. The Documentation/md files will server
>>> as a doc for developer.
>>
>> OK, I will move it.
>>
>>>> +Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
>>>> +addressed by PPL is that after a dirty shutdown, parity of a particular stripe
>>>> +may become inconsistent with data on other member disks. If the array is also
>>>> +in degraded state, there is no way to recalculate parity, because one of the
>>>> +disks is missing. This can lead to silent data corruption when rebuilding the
>>>> +array or using it is as degraded - data calculated from parity for array blocks
>>>> +that have not been touched by a write request during the unclean shutdown can
>>>> +be incorrect. Such condition is known as the ``RAID5 Write Hole``. Because of
>>>> +this, md by default does not allow starting a dirty degraded array, see
>>>> +:ref:`dirty_degraded_boot`.
>>>> +
>>>> +Partial parity for a write operation is the XOR of stripe data chunks not
>>>> +modified by this write. It is just enough data needed for recovering from the
>>>> +write hole. XORing partial parity with the modified chunks produces parity for
>>>> +the stripe, consistent with its state before the write operation, regardless of
>>>> +which chunk writes have completed. If one of the not modified data disks of
>>>> +this stripe is missing, this updated parity can be used to recover its
>>>> +contents. PPL recovery is also performed when starting an array after an
>>>> +unclean shutdown and all disks are available, eliminating the need to resync
>>>> +the array. Because of this, using write-intent bitmap and PPL together is not
>>>> +supported.
>>>> +
>>>> +When handling a write request PPL writes partial parity before new data and
>>>> +parity are dispatched to disks. PPL is a distributed log - it is stored on
>>>> +array member drives in the metadata area, on the parity drive of a particular
>>>> +stripe.  It does not require a dedicated journaling drive. Write performance is
>>>> +reduced by up to 30%-40% but it scales with the number of drives in the array
>>>> +and the journaling drive does not become a bottleneck or a single point of
>>>> +failure.
>>>> +
>>>> +Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
>>>> +not a true journal. It does not protect from losing in-flight data, only from
>>>> +silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
>>>> +performed for this stripe (parity is not updated). So it is possible to have
>>>> +arbitrary data in the written part of a stripe if that disk is lost. In such
>>>> +case the behavior is the same as in plain raid5.
>>>> +
>>>> +PPL is available for md version-1 metadata and external (specifically IMSM)
>>>> +metadata arrays. It can be enabled using mdadm option
>>>> +``--consistency-policy=ppl``.
>>>> +
>>>> +Currently, volatile write-back cache should be disabled on all member drives
>>>> +when using PPL. Otherwise it cannot guarantee consistency in case of power
>>>> +failure.
>>>
>>> The limitation is weird and dumb. Is this an implementation limitation which
>>> can be fixed later or a design limitation?
>>
>> This can be fixed later and I'm planning to do this. The limitation
>> results from the fact that partial parity is the xor of "old" data.
>> During recovery we xor it with "new" data and that produces valid
>> parity, assuming the "old" data has not changed after dirty shutdown.
>> But it's not necessarily true, since the "old" data might have been only
>> in cache and not on persistent storage. So fixing this will require
>> making sure the old data is on persistent media before writing PPL.
> 
> so it's the implementation which doesn't flush disk cache before writting new
> data to disks, right? That makes sense then. I think we should fix this soon.
> Don't think people will (or remember to) disable write-back cache before using
> ppl.
> 
>>>> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
>>>> new file mode 100644
>>>> index 000000000000..6bc246c80f6b
>>>> --- /dev/null
>>>> +++ b/drivers/md/raid5-ppl.c
>>>> @@ -0,0 +1,551 @@
>>>> +/*
>>>> + * Partial Parity Log for closing the RAID5 write hole
>>>> + * Copyright (c) 2017, Intel Corporation.
>>>> + *
>>>> + * This program is free software; you can redistribute it and/or modify it
>>>> + * under the terms and conditions of the GNU General Public License,
>>>> + * version 2, as published by the Free Software Foundation.
>>>> + *
>>>> + * This program is distributed in the hope it will be useful, but WITHOUT
>>>> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
>>>> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
>>>> + * more details.
>>>> + */
>>>> +
>>>> +#include <linux/kernel.h>
>>>> +#include <linux/blkdev.h>
>>>> +#include <linux/slab.h>
>>>> +#include <linux/crc32c.h>
>>>> +#include <linux/raid/md_p.h>
>>>> +#include "md.h"
>>>> +#include "raid5.h"
>>>> +#include "raid5-cache.h"
>>>> +
>>>> +/*
>>>> + * PPL consists of a 4KB header (struct ppl_header) and at least 128KB for
>>>> + * partial parity data. The header contains an array of entries
>>>> + * (struct ppl_header_entry) which describe the logged write requests.
>>>> + * Partial parity for the entries comes after the header, written in the same
>>>> + * sequence as the entries:
>>>> + *
>>>> + * Header
>>>> + *   entry0
>>>> + *   ...
>>>> + *   entryN
>>>> + * PP data
>>>> + *   PP for entry0
>>>> + *   ...
>>>> + *   PP for entryN
>>>> + *
>>>> + * Every entry holds a checksum of its partial parity, the header also has a
>>>> + * checksum of the header itself. Entries for full stripes writes contain no
>>>> + * partial parity, they only mark the stripes for which parity should be
>>>> + * recalculated after an unclean shutdown.
>>>> + *
>>>> + * A write request is always logged to the PPL instance stored on the parity
>>>> + * disk of the corresponding stripe. For each member disk there is one r5l_log
>>>> + * used to handle logging for this disk, independently from others. They are
>>>> + * grouped in child_logs array in struct ppl_conf, which is assigned to a
>>>> + * common parent r5l_log. This parent log serves as a proxy and is used in
>>>> + * raid5 personality code - it is assigned as _the_ log in r5conf->log.
>>>> + *
>>>> + * r5l_io_unit represents a full PPL write, meta_page contains the ppl_header.
>>>> + * PPL entries for logged stripes are added in ppl_log_stripe(). A stripe can
>>>> + * be appended to the last entry if the chunks to write are the same, otherwise
>>>> + * a new entry is added. Checksums of entries are calculated incrementally as
>>>> + * stripes containing partial parity are being added to entries.
>>>> + * ppl_submit_iounit() calculates the checksum of the header and submits a bio
>>>> + * containing the meta_page (ppl_header) and partial parity pages (sh->ppl_page)
>>>> + * for all stripes of the io_unit. When the PPL write completes, the stripes
>>>> + * associated with the io_unit are released and raid5d starts writing their data
>>>> + * and parity. When all stripes are written, the io_unit is freed and the next
>>>> + * can be submitted.
>>>> + *
>>>> + * An io_unit is used to gather stripes until it is submitted or becomes full
>>>> + * (if the maximum number of entries or size of PPL is reached). Another io_unit
>>>> + * can't be submitted until the previous has completed (PPL and stripe
>>>> + * data+parity is written). The log->running_ios list tracks all io_units of
>>>> + * a log (for a single member disk). New io_units are added to the end of the
>>>> + * list and the first io_unit is submitted, if it is not submitted already.
>>>> + * The current io_unit accepting new stripes is always the last on the list.
>>>> + */
>>>
>>> I don't like the way that io_unit and r5l_log are reused here. The code below
>>> shares very little with raid5-cache. There are a lot of fields in the data
>>> structures which are not used by ppl. It's not a good practice to share data
>>> structure like this way, so please define ppl its own data structure. The
>>> workflow between raid5 cache and ppl might be similar, but what we really need
>>> to share is the hooks to raid5 core.
>>
>> Reusing the structures seemed like a good idea at first, but now I must
>> admit you are right, especially since the structures were extended to
>> support write-back caching. One instance of r5l_log will still be
>> necessary though, to use for the hooks. Its private field will be used
>> for holding the ppl structures. Is that ok?
> 
> I'd prefer adding a 'void *log_private' in struct r5conf. raid5-cache and ppl
> can store whatever data in the log_private. And convert all callbacks to use a
> 'struct r5conf' parameter. This way raid5 cache and ppl don't need to share the
> data structure.

This way also 'struct r5l_policy *policy' will have to be moved from
r5l_log to r5conf. Maybe it will be better if instead of adding
log_private and moving this 'policy' field we add a dedicated field for
ppl in r5conf, something like 'struct ppl_conf *ppl', and eliminate
r5l_policy? From raid5 core we could just call corresponding functions
depending on which pointer is set (log or ppl). This way raid5-cache.c
won't have to be changed at all. It seems appropriate to separate
raid5-cache and ppl if they won't be sharing any data structures.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH v2 2/5] async_tx: Handle DMA devices having support for fewer PQ coefficients
From: Anup Patel @ 2017-02-09  9:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar, Ray Jui, Scott Branden, Jon Mason,
	Rob Rice, BCM Kernel Feedback,
	dmaengine-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Device Tree,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-crypto-u79uwXL29TY76Z2rM5mHXA, linux-raid
In-Reply-To: <CAPcyv4iFJXxvJFrUs2jtwP9GX5NcJ8LiEDHeZ5b1fwjCAToe5w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Wed, Feb 8, 2017 at 9:54 PM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> On Wed, Feb 8, 2017 at 12:57 AM, Anup Patel <anup.patel-dY08KVG/lbpWk0Htik3J/w@public.gmane.org> wrote:
>> On Tue, Feb 7, 2017 at 11:46 PM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>>> On Tue, Feb 7, 2017 at 1:02 AM, Anup Patel <anup.patel-dY08KVG/lbpWk0Htik3J/w@public.gmane.org> wrote:
>>>> On Tue, Feb 7, 2017 at 1:57 PM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>>>>> On Tue, Feb 7, 2017 at 12:16 AM, Anup Patel <anup.patel-dY08KVG/lbpWk0Htik3J/w@public.gmane.org> wrote:
>>>>>> The DMAENGINE framework assumes that if PQ offload is supported by a
>>>>>> DMA device then all 256 PQ coefficients are supported. This assumption
>>>>>> does not hold anymore because we now have BCM-SBA-RAID offload engine
>>>>>> which supports PQ offload with limited number of PQ coefficients.
>>>>>>
>>>>>> This patch extends async_tx APIs to handle DMA devices with support
>>>>>> for fewer PQ coefficients.
>>>>>>
>>>>>> Signed-off-by: Anup Patel <anup.patel-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
>>>>>> Reviewed-by: Scott Branden <scott.branden-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
>>>>>
>>>>> I don't like this approach. Define an interface for md to query the
>>>>> offload engine once at the beginning of time. We should not be adding
>>>>> any new extensions to async_tx.
>>>>
>>>> Even if we do capability checks in Linux MD, we still need a way
>>>> for DMAENGINE drivers to advertise number of PQ coefficients
>>>> handled by the HW.
>>>>
>>>> I agree capability checks should be done once in Linux MD but I don't
>>>> see why this has to be part of BCM-SBA-RAID driver patches. We need
>>>> separate patchsets to address limitations of async_tx framework.
>>>
>>> Right, separate enabling before we pile on new hardware support to a
>>> known broken framework.
>>
>> Linux Async Tx not broken framework. The issue is:
>> 1. Its not complete enough
>> 2. Its not optimized for very high through-put offload engines
>
> I'm not understanding your point. I'm nak'ing this change to add yet
> more per-transaction capability checking to async_tx. I don't like the
> DMA_HAS_FEWER_PQ_COEF flag, especially since it is equal to
> DMA_HAS_PQ_CONTINUE. I'm not asking for all of async_tx's problems to
> be fixed before this new hardware support, I'm simply saying we should
> start the process of moving offload-engine capability checking to the
> raid code.

The DMA_HAS_FEWER_PQ_COEF is not equal to
DMA_HAS_PQ_CONTINUE.

I will try to drop this patch and take care of unsupported PQ
coefficients in BCM-SBA-RAID driver itself even if this means
doing some computations in BCM-SBA-RAID driver itself.

Regards,
Anup
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
From: Binoy Jayan @ 2017-02-09  8:30 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Oded, Ofir, Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, Linux kernel mailing list, Alasdair Kergon,
	Mike Snitzer, dm-devel, Shaohua Li, linux-raid, Rajendra,
	Milan Broz
In-Reply-To: <CAOtvUMcN8s886978vE6T=AE79KkZsLBnXsmhewnb5PB0UyAG3A@mail.gmail.com>

On 8 February 2017 at 13:02, Gilad Ben-Yossef <gilad@benyossef.com> wrote:
>
> Ran Bonnie++ on it last night  (Luks mode, plain64, Qemu Virt platform
> Arm64) and it works just fine.
>
> Tested-by: Gilad Ben-Yossef <gilad@benyossef.com>
>

Hi Gilad,

Thank you for testing it. Do you have access to a device having crypto
hardware with IV generation capability and associated drivers for let
say, aes with cbc or any another mode? I was wondering if I can customize
it to work with dm-crypt by generating IVs automatically. Please let
me know your thoughts.

Thanks,
Binoy

^ permalink raw reply

* Re: [PATCH 1/5] MD: attach data to each bio
From: Wols Lists @ 2017-02-09  0:09 UTC (permalink / raw)
  To: Guoqing Jiang, Shaohua Li, linux-raid; +Cc: khlebnikov, hch, neilb
In-Reply-To: <589ADFEC.1070902@suse.com>

On 08/02/17 09:07, Guoqing Jiang wrote:
> 
> 
> On 02/08/2017 12:55 AM, Shaohua Li wrote:
>> Currently MD is rebusing some bio fields. To remove the hack, we attach
>> extra data to each bio. Each personablity can attach extra data to the
>> bios, so we don't need to rebuse bio fields.
> 
> rebuse? Maybe it should be reuse, or abuse.
> 
imho abuse

also
c/personablity/personality/

Cheers,
Wol


^ permalink raw reply

* Re: [PATCH v3 6/9] md: add sysfs entries for PPL
From: Shaohua Li @ 2017-02-08 18:14 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <d484e0e5-345e-f35b-e10c-6727e300bfa1@intel.com>

On Wed, Feb 08, 2017 at 12:58:51PM +0100, Artur Paszkiewicz wrote:
> On 02/07/2017 10:49 PM, Shaohua Li wrote:
> > On Mon, Jan 30, 2017 at 07:59:50PM +0100, Artur Paszkiewicz wrote:
> >> Add 'consistency_policy' attribute for array. It indicates how the array
> >> maintains consistency in case of unexpected shutdown.
> >>
> >> Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location
> >> and size of the PPL space on the device.
> >>
> >> These attributes are writable to allow enabling PPL for external
> >> metadata arrays and (later) to enable/disable PPL for a running array.
> >>
> >> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> >> ---
> >>  Documentation/admin-guide/md.rst |  32 ++++++++++-
> >>  drivers/md/md.c                  | 115 +++++++++++++++++++++++++++++++++++++++
> >>  2 files changed, 144 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
> >> index 7104ef757e73..2c153b3d798f 100644
> >> --- a/Documentation/admin-guide/md.rst
> >> +++ b/Documentation/admin-guide/md.rst
> >> @@ -329,14 +329,14 @@ All md devices contain:
> >>       array creation it will default to 0, though starting the array as
> >>       ``clean`` will set it much larger.
> >>  
> >> -   new_dev
> >> +  new_dev
> >>       This file can be written but not read.  The value written should
> >>       be a block device number as major:minor.  e.g. 8:0
> >>       This will cause that device to be attached to the array, if it is
> >>       available.  It will then appear at md/dev-XXX (depending on the
> >>       name of the device) and further configuration is then possible.
> >>  
> >> -   safe_mode_delay
> >> +  safe_mode_delay
> >>       When an md array has seen no write requests for a certain period
> >>       of time, it will be marked as ``clean``.  When another write
> >>       request arrives, the array is marked as ``dirty`` before the write
> >> @@ -345,7 +345,7 @@ All md devices contain:
> >>       period as a number of seconds.  The default is 200msec (0.200).
> >>       Writing a value of 0 disables safemode.
> >>  
> >> -   array_state
> >> +  array_state
> >>       This file contains a single word which describes the current
> >>       state of the array.  In many cases, the state can be set by
> >>       writing the word for the desired state, however some states
> >> @@ -454,7 +454,30 @@ All md devices contain:
> >>       once the array becomes non-degraded, and this fact has been
> >>       recorded in the metadata.
> >>  
> >> +  consistency_policy
> >> +     This indicates how the array maintains consistency in case of unexpected
> >> +     shutdown. It can be:
> >>  
> >> +     none
> >> +       Array has no redundancy information, e.g. raid0, linear.
> >> +
> >> +     resync
> >> +       Full resync is performed and all redundancy is regenerated when the
> >> +       array is started after unclean shutdown.
> >> +
> >> +     bitmap
> >> +       Resync assisted by a write-intent bitmap.
> >> +
> >> +     journal
> >> +       For raid4/5/6, journal device is used to log transactions and replay
> >> +       after unclean shutdown.
> >> +
> >> +     ppl
> >> +       For raid5 only, :ref:`ppl` is used to close the write hole and eliminate
> >> +       resync.
> >> +
> >> +     The accepted values when writing to this file are ``ppl`` and ``resync``,
> >> +     used to enable and disable PPL.
> >>  
> >>  
> >>  As component devices are added to an md array, they appear in the ``md``
> >> @@ -616,6 +639,9 @@ Each directory contains:
> >>  	adds bad blocks without acknowledging them. This is largely
> >>  	for testing.
> >>  
> >> +      ppl_sector, ppl_size
> >> +        Location and size (in sectors) of the space used for Partial Parity Log
> >> +        on this device.
> >>  
> >>  
> >>  An active md device will also contain an entry for each active device
> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> index e96f73572e23..47ab3afe87cb 100644
> >> --- a/drivers/md/md.c
> >> +++ b/drivers/md/md.c
> >> @@ -3205,6 +3205,78 @@ static ssize_t ubb_store(struct md_rdev *rdev, const char *page, size_t len)
> >>  static struct rdev_sysfs_entry rdev_unack_bad_blocks =
> >>  __ATTR(unacknowledged_bad_blocks, S_IRUGO|S_IWUSR, ubb_show, ubb_store);
> >>  
> >> +static ssize_t
> >> +ppl_sector_show(struct md_rdev *rdev, char *page)
> >> +{
> >> +	return sprintf(page, "%llu\n", (unsigned long long)rdev->ppl.sector);
> >> +}
> >> +
> >> +static ssize_t
> >> +ppl_sector_store(struct md_rdev *rdev, const char *buf, size_t len)
> >> +{
> >> +	unsigned long long sector;
> >> +
> >> +	if (kstrtoull(buf, 10, &sector) < 0)
> >> +		return -EINVAL;
> >> +	if (sector != (sector_t)sector)
> >> +	        return -EINVAL;
> >> +
> >> +	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
> >> +	    rdev->raid_disk >= 0)
> >> +		return -EBUSY;
> >> +
> >> +	if (rdev->mddev->persistent) {
> >> +		if (rdev->mddev->major_version == 0)
> >> +			return -EINVAL;
> >> +		if ((sector > rdev->sb_start &&
> >> +		     sector - rdev->sb_start > S16_MAX) ||
> >> +		    (sector < rdev->sb_start &&
> >> +		     rdev->sb_start - sector > -S16_MIN))
> >> +			return -EINVAL;
> >> +		rdev->ppl.offset = sector - rdev->sb_start;
> > 
> > Why do we allow non-external array changes the ppl log position and size?
> 
> To support enabling ppl for a running array. Mdadm should not change the
> metadata if the array is running, so we can't read the ppl position and
> size from the superblock. This is similar to adding an internal bitmap
> with mdadm --grow --bitmap=internal - mdadm writes the bitmap offset to
> "bitmap/location" in sysfs.

So this is to add ppl support in runtime, right? I think more sanity checks
should be done here to make sure the input is valid. For example the
position/size don't overlap with data, bitmap isn't enabled yet.

> >> +	} else if (!rdev->mddev->external) {
> >> +		return -EBUSY;
> >> +	}
> >> +	rdev->ppl.sector = sector;
> >> +	return len;
> >> +}
> >> +
> >> +static struct rdev_sysfs_entry rdev_ppl_sector =
> >> +__ATTR(ppl_sector, S_IRUGO|S_IWUSR, ppl_sector_show, ppl_sector_store);
> >> +
> >> +static ssize_t
> >> +ppl_size_show(struct md_rdev *rdev, char *page)
> >> +{
> >> +	return sprintf(page, "%u\n", rdev->ppl.size);
> >> +}
> >> +
> >> +static ssize_t
> >> +ppl_size_store(struct md_rdev *rdev, const char *buf, size_t len)
> >> +{
> >> +	unsigned int size;
> >> +
> >> +	if (kstrtouint(buf, 10, &size) < 0)
> >> +		return -EINVAL;
> >> +
> >> +	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
> >> +	    rdev->raid_disk >= 0)
> >> +		return -EBUSY;
> >> +
> >> +	if (rdev->mddev->persistent) {
> >> +		if (rdev->mddev->major_version == 0)
> >> +			return -EINVAL;
> >> +		if (size > U16_MAX)
> >> +			return -EINVAL;
> >> +	} else if (!rdev->mddev->external) {
> >> +		return -EBUSY;
> >> +	}
> >> +	rdev->ppl.size = size;
> >> +	return len;
> >> +}
> >> +
> >> +static struct rdev_sysfs_entry rdev_ppl_size =
> >> +__ATTR(ppl_size, S_IRUGO|S_IWUSR, ppl_size_show, ppl_size_store);
> >> +
> >>  static struct attribute *rdev_default_attrs[] = {
> >>  	&rdev_state.attr,
> >>  	&rdev_errors.attr,
> >> @@ -3215,6 +3287,8 @@ static struct attribute *rdev_default_attrs[] = {
> >>  	&rdev_recovery_start.attr,
> >>  	&rdev_bad_blocks.attr,
> >>  	&rdev_unack_bad_blocks.attr,
> >> +	&rdev_ppl_sector.attr,
> >> +	&rdev_ppl_size.attr,
> >>  	NULL,
> >>  };
> >>  static ssize_t
> >> @@ -4951,6 +5025,46 @@ static struct md_sysfs_entry md_array_size =
> >>  __ATTR(array_size, S_IRUGO|S_IWUSR, array_size_show,
> >>         array_size_store);
> >>  
> >> +static ssize_t
> >> +consistency_policy_show(struct mddev *mddev, char *page)
> >> +{
> >> +	int ret;
> >> +
> >> +	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
> >> +		ret = sprintf(page, "journal\n");
> >> +	} else if (test_bit(MD_HAS_PPL, &mddev->flags)) {
> >> +		ret = sprintf(page, "ppl\n");
> >> +	} else if (mddev->bitmap) {
> >> +		ret = sprintf(page, "bitmap\n");
> >> +	} else if (mddev->pers) {
> >> +		if (mddev->pers->sync_request)
> >> +			ret = sprintf(page, "resync\n");
> >> +		else
> >> +			ret = sprintf(page, "none\n");
> >> +	} else {
> >> +		ret = sprintf(page, "unknown\n");
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static ssize_t
> >> +consistency_policy_store(struct mddev *mddev, const char *buf, size_t len)
> >> +{
> >> +	if (mddev->pers) {
> >> +		return -EBUSY;
> >> +	} else if (mddev->external && strncmp(buf, "ppl", 3) == 0) {
> >> +		set_bit(MD_HAS_PPL, &mddev->flags);
> >> +		return len;
> > 
> > Here we should check if ppl position and size are valid before we set the flag.
> > 
> > And shouldn't we move this part to .change_consistency_policy added in patch 9?
> > This is a consistency policy change too.
> 
> Yes, but here the array is not started yet and we don't have the
> personality for which to call .change_consistency_policy. We are not
> changing the consistency policy here, but rather setting it initially
> for an external metadata array.

Ok, makes sense.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v2 2/5] async_tx: Handle DMA devices having support for fewer PQ coefficients
From: Dan Williams @ 2017-02-08 16:24 UTC (permalink / raw)
  To: Anup Patel
  Cc: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar, Ray Jui, Scott Branden, Jon Mason,
	Rob Rice, BCM Kernel Feedback, dmaengine@vger.kernel.org,
	Device Tree, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-crypto, linux-raid
In-Reply-To: <CAALAos9ovab4x=waRgM7G2adRC7syHMkY0Rhf-UPB_+h4w3yNQ@mail.gmail.com>

On Wed, Feb 8, 2017 at 12:57 AM, Anup Patel <anup.patel@broadcom.com> wrote:
> On Tue, Feb 7, 2017 at 11:46 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Tue, Feb 7, 2017 at 1:02 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>>> On Tue, Feb 7, 2017 at 1:57 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>>> On Tue, Feb 7, 2017 at 12:16 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>>>>> The DMAENGINE framework assumes that if PQ offload is supported by a
>>>>> DMA device then all 256 PQ coefficients are supported. This assumption
>>>>> does not hold anymore because we now have BCM-SBA-RAID offload engine
>>>>> which supports PQ offload with limited number of PQ coefficients.
>>>>>
>>>>> This patch extends async_tx APIs to handle DMA devices with support
>>>>> for fewer PQ coefficients.
>>>>>
>>>>> Signed-off-by: Anup Patel <anup.patel@broadcom.com>
>>>>> Reviewed-by: Scott Branden <scott.branden@broadcom.com>
>>>>
>>>> I don't like this approach. Define an interface for md to query the
>>>> offload engine once at the beginning of time. We should not be adding
>>>> any new extensions to async_tx.
>>>
>>> Even if we do capability checks in Linux MD, we still need a way
>>> for DMAENGINE drivers to advertise number of PQ coefficients
>>> handled by the HW.
>>>
>>> I agree capability checks should be done once in Linux MD but I don't
>>> see why this has to be part of BCM-SBA-RAID driver patches. We need
>>> separate patchsets to address limitations of async_tx framework.
>>
>> Right, separate enabling before we pile on new hardware support to a
>> known broken framework.
>
> Linux Async Tx not broken framework. The issue is:
> 1. Its not complete enough
> 2. Its not optimized for very high through-put offload engines

I'm not understanding your point. I'm nak'ing this change to add yet
more per-transaction capability checking to async_tx. I don't like the
DMA_HAS_FEWER_PQ_COEF flag, especially since it is equal to
DMA_HAS_PQ_CONTINUE. I'm not asking for all of async_tx's problems to
be fixed before this new hardware support, I'm simply saying we should
start the process of moving offload-engine capability checking to the
raid code.

^ permalink raw reply

* Re: [PATCH v3 6/9] md: add sysfs entries for PPL
From: Artur Paszkiewicz @ 2017-02-08 11:58 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <20170207214914.dfin7ma73lrqz2y2@kernel.org>

On 02/07/2017 10:49 PM, Shaohua Li wrote:
> On Mon, Jan 30, 2017 at 07:59:50PM +0100, Artur Paszkiewicz wrote:
>> Add 'consistency_policy' attribute for array. It indicates how the array
>> maintains consistency in case of unexpected shutdown.
>>
>> Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location
>> and size of the PPL space on the device.
>>
>> These attributes are writable to allow enabling PPL for external
>> metadata arrays and (later) to enable/disable PPL for a running array.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  Documentation/admin-guide/md.rst |  32 ++++++++++-
>>  drivers/md/md.c                  | 115 +++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 144 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
>> index 7104ef757e73..2c153b3d798f 100644
>> --- a/Documentation/admin-guide/md.rst
>> +++ b/Documentation/admin-guide/md.rst
>> @@ -329,14 +329,14 @@ All md devices contain:
>>       array creation it will default to 0, though starting the array as
>>       ``clean`` will set it much larger.
>>  
>> -   new_dev
>> +  new_dev
>>       This file can be written but not read.  The value written should
>>       be a block device number as major:minor.  e.g. 8:0
>>       This will cause that device to be attached to the array, if it is
>>       available.  It will then appear at md/dev-XXX (depending on the
>>       name of the device) and further configuration is then possible.
>>  
>> -   safe_mode_delay
>> +  safe_mode_delay
>>       When an md array has seen no write requests for a certain period
>>       of time, it will be marked as ``clean``.  When another write
>>       request arrives, the array is marked as ``dirty`` before the write
>> @@ -345,7 +345,7 @@ All md devices contain:
>>       period as a number of seconds.  The default is 200msec (0.200).
>>       Writing a value of 0 disables safemode.
>>  
>> -   array_state
>> +  array_state
>>       This file contains a single word which describes the current
>>       state of the array.  In many cases, the state can be set by
>>       writing the word for the desired state, however some states
>> @@ -454,7 +454,30 @@ All md devices contain:
>>       once the array becomes non-degraded, and this fact has been
>>       recorded in the metadata.
>>  
>> +  consistency_policy
>> +     This indicates how the array maintains consistency in case of unexpected
>> +     shutdown. It can be:
>>  
>> +     none
>> +       Array has no redundancy information, e.g. raid0, linear.
>> +
>> +     resync
>> +       Full resync is performed and all redundancy is regenerated when the
>> +       array is started after unclean shutdown.
>> +
>> +     bitmap
>> +       Resync assisted by a write-intent bitmap.
>> +
>> +     journal
>> +       For raid4/5/6, journal device is used to log transactions and replay
>> +       after unclean shutdown.
>> +
>> +     ppl
>> +       For raid5 only, :ref:`ppl` is used to close the write hole and eliminate
>> +       resync.
>> +
>> +     The accepted values when writing to this file are ``ppl`` and ``resync``,
>> +     used to enable and disable PPL.
>>  
>>  
>>  As component devices are added to an md array, they appear in the ``md``
>> @@ -616,6 +639,9 @@ Each directory contains:
>>  	adds bad blocks without acknowledging them. This is largely
>>  	for testing.
>>  
>> +      ppl_sector, ppl_size
>> +        Location and size (in sectors) of the space used for Partial Parity Log
>> +        on this device.
>>  
>>  
>>  An active md device will also contain an entry for each active device
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index e96f73572e23..47ab3afe87cb 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -3205,6 +3205,78 @@ static ssize_t ubb_store(struct md_rdev *rdev, const char *page, size_t len)
>>  static struct rdev_sysfs_entry rdev_unack_bad_blocks =
>>  __ATTR(unacknowledged_bad_blocks, S_IRUGO|S_IWUSR, ubb_show, ubb_store);
>>  
>> +static ssize_t
>> +ppl_sector_show(struct md_rdev *rdev, char *page)
>> +{
>> +	return sprintf(page, "%llu\n", (unsigned long long)rdev->ppl.sector);
>> +}
>> +
>> +static ssize_t
>> +ppl_sector_store(struct md_rdev *rdev, const char *buf, size_t len)
>> +{
>> +	unsigned long long sector;
>> +
>> +	if (kstrtoull(buf, 10, &sector) < 0)
>> +		return -EINVAL;
>> +	if (sector != (sector_t)sector)
>> +	        return -EINVAL;
>> +
>> +	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
>> +	    rdev->raid_disk >= 0)
>> +		return -EBUSY;
>> +
>> +	if (rdev->mddev->persistent) {
>> +		if (rdev->mddev->major_version == 0)
>> +			return -EINVAL;
>> +		if ((sector > rdev->sb_start &&
>> +		     sector - rdev->sb_start > S16_MAX) ||
>> +		    (sector < rdev->sb_start &&
>> +		     rdev->sb_start - sector > -S16_MIN))
>> +			return -EINVAL;
>> +		rdev->ppl.offset = sector - rdev->sb_start;
> 
> Why do we allow non-external array changes the ppl log position and size?

To support enabling ppl for a running array. Mdadm should not change the
metadata if the array is running, so we can't read the ppl position and
size from the superblock. This is similar to adding an internal bitmap
with mdadm --grow --bitmap=internal - mdadm writes the bitmap offset to
"bitmap/location" in sysfs.

>> +	} else if (!rdev->mddev->external) {
>> +		return -EBUSY;
>> +	}
>> +	rdev->ppl.sector = sector;
>> +	return len;
>> +}
>> +
>> +static struct rdev_sysfs_entry rdev_ppl_sector =
>> +__ATTR(ppl_sector, S_IRUGO|S_IWUSR, ppl_sector_show, ppl_sector_store);
>> +
>> +static ssize_t
>> +ppl_size_show(struct md_rdev *rdev, char *page)
>> +{
>> +	return sprintf(page, "%u\n", rdev->ppl.size);
>> +}
>> +
>> +static ssize_t
>> +ppl_size_store(struct md_rdev *rdev, const char *buf, size_t len)
>> +{
>> +	unsigned int size;
>> +
>> +	if (kstrtouint(buf, 10, &size) < 0)
>> +		return -EINVAL;
>> +
>> +	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
>> +	    rdev->raid_disk >= 0)
>> +		return -EBUSY;
>> +
>> +	if (rdev->mddev->persistent) {
>> +		if (rdev->mddev->major_version == 0)
>> +			return -EINVAL;
>> +		if (size > U16_MAX)
>> +			return -EINVAL;
>> +	} else if (!rdev->mddev->external) {
>> +		return -EBUSY;
>> +	}
>> +	rdev->ppl.size = size;
>> +	return len;
>> +}
>> +
>> +static struct rdev_sysfs_entry rdev_ppl_size =
>> +__ATTR(ppl_size, S_IRUGO|S_IWUSR, ppl_size_show, ppl_size_store);
>> +
>>  static struct attribute *rdev_default_attrs[] = {
>>  	&rdev_state.attr,
>>  	&rdev_errors.attr,
>> @@ -3215,6 +3287,8 @@ static struct attribute *rdev_default_attrs[] = {
>>  	&rdev_recovery_start.attr,
>>  	&rdev_bad_blocks.attr,
>>  	&rdev_unack_bad_blocks.attr,
>> +	&rdev_ppl_sector.attr,
>> +	&rdev_ppl_size.attr,
>>  	NULL,
>>  };
>>  static ssize_t
>> @@ -4951,6 +5025,46 @@ static struct md_sysfs_entry md_array_size =
>>  __ATTR(array_size, S_IRUGO|S_IWUSR, array_size_show,
>>         array_size_store);
>>  
>> +static ssize_t
>> +consistency_policy_show(struct mddev *mddev, char *page)
>> +{
>> +	int ret;
>> +
>> +	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
>> +		ret = sprintf(page, "journal\n");
>> +	} else if (test_bit(MD_HAS_PPL, &mddev->flags)) {
>> +		ret = sprintf(page, "ppl\n");
>> +	} else if (mddev->bitmap) {
>> +		ret = sprintf(page, "bitmap\n");
>> +	} else if (mddev->pers) {
>> +		if (mddev->pers->sync_request)
>> +			ret = sprintf(page, "resync\n");
>> +		else
>> +			ret = sprintf(page, "none\n");
>> +	} else {
>> +		ret = sprintf(page, "unknown\n");
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static ssize_t
>> +consistency_policy_store(struct mddev *mddev, const char *buf, size_t len)
>> +{
>> +	if (mddev->pers) {
>> +		return -EBUSY;
>> +	} else if (mddev->external && strncmp(buf, "ppl", 3) == 0) {
>> +		set_bit(MD_HAS_PPL, &mddev->flags);
>> +		return len;
> 
> Here we should check if ppl position and size are valid before we set the flag.
> 
> And shouldn't we move this part to .change_consistency_policy added in patch 9?
> This is a consistency policy change too.

Yes, but here the array is not started yet and we don't have the
personality for which to call .change_consistency_policy. We are not
changing the consistency policy here, but rather setting it initially
for an external metadata array.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation
From: Artur Paszkiewicz @ 2017-02-08 11:58 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <20170207214225.3jenvleiz3tldzlx@kernel.org>

On 02/07/2017 10:42 PM, Shaohua Li wrote:
> On Mon, Jan 30, 2017 at 07:59:49PM +0100, Artur Paszkiewicz wrote:
>> This implements the PPL write logging functionality, using the
>> raid5-cache policy logic introduced in previous patches. The description
>> of PPL is added to the documentation. More details can be found in the
>> comments in raid5-ppl.c.
>>
>> Put the PPL metadata structures to md_p.h because userspace tools
>> (mdadm) will also need to read/write PPL.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  Documentation/admin-guide/md.rst |  53 ++++
>>  drivers/md/Makefile              |   2 +-
>>  drivers/md/raid5-cache.c         |  13 +-
>>  drivers/md/raid5-cache.h         |   8 +
>>  drivers/md/raid5-ppl.c           | 551 +++++++++++++++++++++++++++++++++++++++
>>  drivers/md/raid5.c               |  15 +-
>>  include/uapi/linux/raid/md_p.h   |  26 ++
>>  7 files changed, 661 insertions(+), 7 deletions(-)
>>  create mode 100644 drivers/md/raid5-ppl.c
>>
>> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
>> index e449fb5f277c..7104ef757e73 100644
>> --- a/Documentation/admin-guide/md.rst
>> +++ b/Documentation/admin-guide/md.rst
>> @@ -86,6 +86,9 @@ superblock can be autodetected and run at boot time.
>>  The kernel parameter ``raid=partitionable`` (or ``raid=part``) means
>>  that all auto-detected arrays are assembled as partitionable.
>>  
>> +
>> +.. _dirty_degraded_boot:
>> +
>>  Boot time assembly of degraded/dirty arrays
>>  -------------------------------------------
>>  
>> @@ -176,6 +179,56 @@ and its role in the array.
>>  Once started with RUN_ARRAY, uninitialized spares can be added with
>>  HOT_ADD_DISK.
>>  
>> +.. _ppl:
>> +
>> +Partial Parity Log
>> +------------------
> Please put this part to a separate file under Documentation/md. The admin-guide
> is supposed to serve as a doc for user. The Documentation/md files will server
> as a doc for developer.

OK, I will move it.

>> +Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
>> +addressed by PPL is that after a dirty shutdown, parity of a particular stripe
>> +may become inconsistent with data on other member disks. If the array is also
>> +in degraded state, there is no way to recalculate parity, because one of the
>> +disks is missing. This can lead to silent data corruption when rebuilding the
>> +array or using it is as degraded - data calculated from parity for array blocks
>> +that have not been touched by a write request during the unclean shutdown can
>> +be incorrect. Such condition is known as the ``RAID5 Write Hole``. Because of
>> +this, md by default does not allow starting a dirty degraded array, see
>> +:ref:`dirty_degraded_boot`.
>> +
>> +Partial parity for a write operation is the XOR of stripe data chunks not
>> +modified by this write. It is just enough data needed for recovering from the
>> +write hole. XORing partial parity with the modified chunks produces parity for
>> +the stripe, consistent with its state before the write operation, regardless of
>> +which chunk writes have completed. If one of the not modified data disks of
>> +this stripe is missing, this updated parity can be used to recover its
>> +contents. PPL recovery is also performed when starting an array after an
>> +unclean shutdown and all disks are available, eliminating the need to resync
>> +the array. Because of this, using write-intent bitmap and PPL together is not
>> +supported.
>> +
>> +When handling a write request PPL writes partial parity before new data and
>> +parity are dispatched to disks. PPL is a distributed log - it is stored on
>> +array member drives in the metadata area, on the parity drive of a particular
>> +stripe.  It does not require a dedicated journaling drive. Write performance is
>> +reduced by up to 30%-40% but it scales with the number of drives in the array
>> +and the journaling drive does not become a bottleneck or a single point of
>> +failure.
>> +
>> +Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
>> +not a true journal. It does not protect from losing in-flight data, only from
>> +silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
>> +performed for this stripe (parity is not updated). So it is possible to have
>> +arbitrary data in the written part of a stripe if that disk is lost. In such
>> +case the behavior is the same as in plain raid5.
>> +
>> +PPL is available for md version-1 metadata and external (specifically IMSM)
>> +metadata arrays. It can be enabled using mdadm option
>> +``--consistency-policy=ppl``.
>> +
>> +Currently, volatile write-back cache should be disabled on all member drives
>> +when using PPL. Otherwise it cannot guarantee consistency in case of power
>> +failure.
> 
> The limitation is weird and dumb. Is this an implementation limitation which
> can be fixed later or a design limitation?

This can be fixed later and I'm planning to do this. The limitation
results from the fact that partial parity is the xor of "old" data.
During recovery we xor it with "new" data and that produces valid
parity, assuming the "old" data has not changed after dirty shutdown.
But it's not necessarily true, since the "old" data might have been only
in cache and not on persistent storage. So fixing this will require
making sure the old data is on persistent media before writing PPL.
 
>> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
>> new file mode 100644
>> index 000000000000..6bc246c80f6b
>> --- /dev/null
>> +++ b/drivers/md/raid5-ppl.c
>> @@ -0,0 +1,551 @@
>> +/*
>> + * Partial Parity Log for closing the RAID5 write hole
>> + * Copyright (c) 2017, Intel Corporation.
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms and conditions of the GNU General Public License,
>> + * version 2, as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope it will be useful, but WITHOUT
>> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
>> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
>> + * more details.
>> + */
>> +
>> +#include <linux/kernel.h>
>> +#include <linux/blkdev.h>
>> +#include <linux/slab.h>
>> +#include <linux/crc32c.h>
>> +#include <linux/raid/md_p.h>
>> +#include "md.h"
>> +#include "raid5.h"
>> +#include "raid5-cache.h"
>> +
>> +/*
>> + * PPL consists of a 4KB header (struct ppl_header) and at least 128KB for
>> + * partial parity data. The header contains an array of entries
>> + * (struct ppl_header_entry) which describe the logged write requests.
>> + * Partial parity for the entries comes after the header, written in the same
>> + * sequence as the entries:
>> + *
>> + * Header
>> + *   entry0
>> + *   ...
>> + *   entryN
>> + * PP data
>> + *   PP for entry0
>> + *   ...
>> + *   PP for entryN
>> + *
>> + * Every entry holds a checksum of its partial parity, the header also has a
>> + * checksum of the header itself. Entries for full stripes writes contain no
>> + * partial parity, they only mark the stripes for which parity should be
>> + * recalculated after an unclean shutdown.
>> + *
>> + * A write request is always logged to the PPL instance stored on the parity
>> + * disk of the corresponding stripe. For each member disk there is one r5l_log
>> + * used to handle logging for this disk, independently from others. They are
>> + * grouped in child_logs array in struct ppl_conf, which is assigned to a
>> + * common parent r5l_log. This parent log serves as a proxy and is used in
>> + * raid5 personality code - it is assigned as _the_ log in r5conf->log.
>> + *
>> + * r5l_io_unit represents a full PPL write, meta_page contains the ppl_header.
>> + * PPL entries for logged stripes are added in ppl_log_stripe(). A stripe can
>> + * be appended to the last entry if the chunks to write are the same, otherwise
>> + * a new entry is added. Checksums of entries are calculated incrementally as
>> + * stripes containing partial parity are being added to entries.
>> + * ppl_submit_iounit() calculates the checksum of the header and submits a bio
>> + * containing the meta_page (ppl_header) and partial parity pages (sh->ppl_page)
>> + * for all stripes of the io_unit. When the PPL write completes, the stripes
>> + * associated with the io_unit are released and raid5d starts writing their data
>> + * and parity. When all stripes are written, the io_unit is freed and the next
>> + * can be submitted.
>> + *
>> + * An io_unit is used to gather stripes until it is submitted or becomes full
>> + * (if the maximum number of entries or size of PPL is reached). Another io_unit
>> + * can't be submitted until the previous has completed (PPL and stripe
>> + * data+parity is written). The log->running_ios list tracks all io_units of
>> + * a log (for a single member disk). New io_units are added to the end of the
>> + * list and the first io_unit is submitted, if it is not submitted already.
>> + * The current io_unit accepting new stripes is always the last on the list.
>> + */
> 
> I don't like the way that io_unit and r5l_log are reused here. The code below
> shares very little with raid5-cache. There are a lot of fields in the data
> structures which are not used by ppl. It's not a good practice to share data
> structure like this way, so please define ppl its own data structure. The
> workflow between raid5 cache and ppl might be similar, but what we really need
> to share is the hooks to raid5 core.

Reusing the structures seemed like a good idea at first, but now I must
admit you are right, especially since the structures were extended to
support write-back caching. One instance of r5l_log will still be
necessary though, to use for the hooks. Its private field will be used
for holding the ppl structures. Is that ok? 

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH v3 4/9] raid5: calculate partial parity for a stripe
From: Artur Paszkiewicz @ 2017-02-08 11:58 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <20170207212528.m6u47y3f2jot2kng@kernel.org>

On 02/07/2017 10:25 PM, Shaohua Li wrote:
> On Mon, Jan 30, 2017 at 07:59:48PM +0100, Artur Paszkiewicz wrote:
>> Attach a page for holding the partial parity data to stripe_head.
>> Allocate it only if mddev has the MD_HAS_PPL flag set.
>>
>> Partial parity is the xor of not modified data chunks of a stripe and is
>> calculated as follows:
>>
>> - reconstruct-write case:
>>   xor data from all not updated disks in a stripe
>>
>> - read-modify-write case:
>>   xor old data and parity from all updated disks in a stripe
>>
>> Implement it using the async_tx API and integrate into raid_run_ops().
>> It must be called when we still have access to old data, so do it when
>> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
>> stored into sh->ppl_page.
>>
>> Partial parity is not meaningful for full stripe write and is not stored
>> in the log or used for recovery, so don't attempt to calculate it when
>> stripe has STRIPE_FULL_WRITE.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  drivers/md/raid5.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  drivers/md/raid5.h |  2 ++
>>  2 files changed, 100 insertions(+)
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index d1cba941951e..e1e238da32ba 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -466,6 +466,11 @@ static void shrink_buffers(struct stripe_head *sh)
>>  		sh->dev[i].page = NULL;
>>  		put_page(p);
>>  	}
>> +
>> +	if (sh->ppl_page) {
>> +		put_page(sh->ppl_page);
>> +		sh->ppl_page = NULL;
>> +	}
>>  }
>>  
>>  static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
>> @@ -482,6 +487,13 @@ static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
>>  		sh->dev[i].page = page;
>>  		sh->dev[i].orig_page = page;
>>  	}
>> +
>> +	if (test_bit(MD_HAS_PPL, &sh->raid_conf->mddev->flags)) {
> 
> I think the test should be something like:
> 	if (raid5_ppl_enabled())
> 
> Having the feature doesn't mean the feature is enabled.

This flag is set iff PPL is enabled, so this function would only check
the flag anyway. But I can add it to improve readability.

>>  	pr_debug("%s: stripe %llu locked: %d ops_request: %lx\n",
>>  		__func__, (unsigned long long)sh->sector,
>>  		s->locked, s->ops_request);
>> @@ -3105,6 +3175,34 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
>>  	if (*bip && (*bip)->bi_iter.bi_sector < bio_end_sector(bi))
>>  		goto overlap;
>>  
>> +	if (forwrite && test_bit(MD_HAS_PPL, &conf->mddev->flags)) {
>> +		/*
>> +		 * With PPL only writes to consecutive data chunks within a
>> +		 * stripe are allowed. Not really an overlap, but
>> +		 * wait_for_overlap can be used to handle this.
>> +		 */
> 
> Please describe why the data must be consecutive.

In PPL metadata we don't store information about which drives we write
to, only the modified data range. For a single stripe_head we can only
have one PPL entry at a time, which describes one data range. This can
be improved in the future, but will require extending the PPL metadata.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH v3 3/9] md: superblock changes for PPL
From: Artur Paszkiewicz @ 2017-02-08 11:58 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <20170207212011.5aulcj3znldl6dry@kernel.org>

On 02/07/2017 10:20 PM, Shaohua Li wrote:
> On Mon, Jan 30, 2017 at 07:59:47PM +0100, Artur Paszkiewicz wrote:
>> Include information about PPL location and size into mdp_superblock_1
>> and copy it to/from rdev. Because PPL is mutually exclusive with bitmap,
>> put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for
>> 'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL
>> to mddev->flags to indicate that PPL is enabled on an array.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  drivers/md/md.c                | 15 +++++++++++++++
>>  drivers/md/md.h                |  8 ++++++++
>>  drivers/md/raid0.c             |  3 ++-
>>  drivers/md/raid1.c             |  3 ++-
>>  include/uapi/linux/raid/md_p.h | 18 ++++++++++++++----
>>  5 files changed, 41 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 85ac98417a08..e96f73572e23 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -1566,6 +1566,12 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_
>>  	} else if (sb->bblog_offset != 0)
>>  		rdev->badblocks.shift = 0;
>>  
>> +	if (le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL) {
>> +		rdev->ppl.offset = (__s16)le16_to_cpu(sb->ppl.offset);
>> +		rdev->ppl.size = le16_to_cpu(sb->ppl.size);
>> +		rdev->ppl.sector = rdev->sb_start + rdev->ppl.offset;
>> +	}
>> +
>>  	if (!refdev) {
>>  		ret = 1;
>>  	} else {
>> @@ -1678,6 +1684,9 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
>>  
>>  		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL)
>>  			set_bit(MD_HAS_JOURNAL, &mddev->flags);
>> +
>> +		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL)
>> +			set_bit(MD_HAS_PPL, &mddev->flags);
> 
> I think we should check wrong configuration and bail out. For example, both
> MD_FEATURE_PPL and MD_FEATURE_BITMAP_OFFSET set, or both MD_FEATURE_PPL and
> MD_FEATURE_JOURNAL set.

Makes sense. I will add it.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH 1/5] MD: attach data to each bio
From: Guoqing Jiang @ 2017-02-08  9:07 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: khlebnikov, hch, neilb
In-Reply-To: <79515b1372fa1a1813c00ef0d7e0613a4512183d.1486485935.git.shli@fb.com>



On 02/08/2017 12:55 AM, Shaohua Li wrote:
> Currently MD is rebusing some bio fields. To remove the hack, we attach
> extra data to each bio. Each personablity can attach extra data to the
> bios, so we don't need to rebuse bio fields.

rebuse? Maybe it should be reuse, or abuse.

[snip]

> +static inline void *md_get_per_bio_data(struct bio *bio)
> +{
> +	return ((struct md_per_bio_data *)bio->bi_private) + 1;
> +}
> +
> +extern void md_bio_attach_data(struct mddev *mddev, struct bio *bio);

Why not make raid1_bio_ref_ptr and raid10_bio_ref_ptr as macro
like raid5_get_bio_data? Or add inline before the two funcs since
both of them are called frequently.

Thanks,
Guoqing

^ permalink raw reply

* Re: [PATCH v2 2/5] async_tx: Handle DMA devices having support for fewer PQ coefficients
From: Anup Patel @ 2017-02-08  8:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar, Ray Jui, Scott Branden, Jon Mason,
	Rob Rice, BCM Kernel Feedback, dmaengine@vger.kernel.org,
	Device Tree, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-crypto, linux-raid
In-Reply-To: <CAPcyv4hK_1f8bryq6smPQy7vTnP+QEzmT9wnAEE1xpxq7EAnjQ@mail.gmail.com>

On Tue, Feb 7, 2017 at 11:46 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Tue, Feb 7, 2017 at 1:02 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>> On Tue, Feb 7, 2017 at 1:57 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>> On Tue, Feb 7, 2017 at 12:16 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>>>> The DMAENGINE framework assumes that if PQ offload is supported by a
>>>> DMA device then all 256 PQ coefficients are supported. This assumption
>>>> does not hold anymore because we now have BCM-SBA-RAID offload engine
>>>> which supports PQ offload with limited number of PQ coefficients.
>>>>
>>>> This patch extends async_tx APIs to handle DMA devices with support
>>>> for fewer PQ coefficients.
>>>>
>>>> Signed-off-by: Anup Patel <anup.patel@broadcom.com>
>>>> Reviewed-by: Scott Branden <scott.branden@broadcom.com>
>>>
>>> I don't like this approach. Define an interface for md to query the
>>> offload engine once at the beginning of time. We should not be adding
>>> any new extensions to async_tx.
>>
>> Even if we do capability checks in Linux MD, we still need a way
>> for DMAENGINE drivers to advertise number of PQ coefficients
>> handled by the HW.
>>
>> I agree capability checks should be done once in Linux MD but I don't
>> see why this has to be part of BCM-SBA-RAID driver patches. We need
>> separate patchsets to address limitations of async_tx framework.
>
> Right, separate enabling before we pile on new hardware support to a
> known broken framework.

Linux Async Tx not broken framework. The issue is:
1. Its not complete enough
2. Its not optimized for very high through-put offload engines

Regards,
Anup

^ permalink raw reply

* Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
From: Gilad Ben-Yossef @ 2017-02-08  7:32 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: Oded, Ofir, Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, Linux kernel mailing list, Alasdair Kergon,
	Mike Snitzer, dm-devel, Shaohua Li, linux-raid, Rajendra,
	Milan Broz
In-Reply-To: <1486463731-6224-1-git-send-email-binoy.jayan@linaro.org>

On Tue, Feb 7, 2017 at 12:35 PM, Binoy Jayan <binoy.jayan@linaro.org> wrote:
> ===============================================================================
> dm-crypt optimization for larger block sizes
> ===============================================================================
>
> Currently, the iv generation algorithms are implemented in dm-crypt.c. The goal
> is to move these algorithms from the dm layer to the kernel crypto layer by
> implementing them as template ciphers so they can be used in relation with
> algorithms like aes, and with multiple modes like cbc, ecb etc. As part of this
> patchset, the iv-generation code is moved from the dm layer to the crypto layer
> and adapt the dm-layer to send a whole 'bio' (as defined in the block layer)
> at a time. Each bio contains the in memory representation of physically
> contiguous disk blocks. Since the bio itself may not be contiguous in main
> memory, the dm layer sets up a chained scatterlist of these blocks split into
> physically contiguous segments in memory so that DMA can be performed.
...
> Binoy Jayan (1):
>   crypto: Add IV generation algorithms
>
>  drivers/md/dm-crypt.c  | 1894 ++++++++++++++++++++++++++++++++++--------------
>  include/crypto/geniv.h |   47 ++
>  2 files changed, 1402 insertions(+), 539 deletions(-)
>  create mode 100644 include/crypto/geniv.h

Ran Bonnie++ on it last night  (Luks mode, plain64, Qemu Virt platform
Arm64) and it works just fine.

Tested-by: Gilad Ben-Yossef <gilad@benyossef.com>

Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply

* Re: [PATCH v2 2/5] async_tx: Handle DMA devices having support for fewer PQ coefficients
From: Anup Patel @ 2017-02-08  6:37 UTC (permalink / raw)
  To: Vinod Koul
  Cc: Dan Williams, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar, Ray Jui, Scott Branden, Jon Mason,
	Rob Rice, BCM Kernel Feedback, dmaengine@vger.kernel.org,
	Device Tree, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-crypto, linux-raid
In-Reply-To: <20170207164207.GI19244@localhost>

On Tue, Feb 7, 2017 at 10:12 PM, Vinod Koul <vinod.koul@intel.com> wrote:
> On Tue, Feb 07, 2017 at 02:32:15PM +0530, Anup Patel wrote:
>> On Tue, Feb 7, 2017 at 1:57 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> > On Tue, Feb 7, 2017 at 12:16 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>> >> The DMAENGINE framework assumes that if PQ offload is supported by a
>> >> DMA device then all 256 PQ coefficients are supported. This assumption
>> >> does not hold anymore because we now have BCM-SBA-RAID offload engine
>> >> which supports PQ offload with limited number of PQ coefficients.
>> >>
>> >> This patch extends async_tx APIs to handle DMA devices with support
>> >> for fewer PQ coefficients.
>> >>
>> >> Signed-off-by: Anup Patel <anup.patel@broadcom.com>
>> >> Reviewed-by: Scott Branden <scott.branden@broadcom.com>
>> >
>> > I don't like this approach. Define an interface for md to query the
>> > offload engine once at the beginning of time. We should not be adding
>> > any new extensions to async_tx.
>>
>> Even if we do capability checks in Linux MD, we still need a way
>> for DMAENGINE drivers to advertise number of PQ coefficients
>> handled by the HW.
>
> If the question is only for advertising caps, then why not do as done
> for dma_get_slave_caps(). you can add dma_get_pq_caps() so that clients (md)
> in this case would know the HW capability.

We have large number of possible capabilities for
DMA slave such as src_addr_widths, dst_addr_widths,
directions, max_burst, residue_granularity, and
descriptor_resue.

The possible capabilities of PQ offload are:
1. Number of PQ sources handled by PQ offload
(Represented by "max_pq" member of "struct dma_device")
2. Number of PQ coefficients handled by PQ offload

The above two PQ capabilities are good enough for
current PQ HW and future PQ HW so we just need a
way to specify number of PQ coefficients.

Till now all of the PQ HW always supported all 256
PQ coefficients so we never felt the need of capability
to specify PQ coefficients. The BCM-SBA-RAID is the
only HW (as far as I know) which does not support all
256 PQ coefficients.

Currently, DMAENGINE drivers use dma_set_maxpq() to
specify number of PQ sources handled by PQ HW and
Linux Async Tx uses dma_maxpq() to get number of
PQ sources.

On similar lines, we added dma_set_maxpqcoef() to
specify number of PQ coefficients and Linux Async Tx
uses dma_maxpqcoef() to get number of PQ coefficients.
If DMAENGINE driver does not specify PQ coefficients
then dma_maxpqcoef() will return 256 assuming all
PQ coefficients are supported. This approach is
backward compatible to existing DMAENGINE APIs
and will not break existing DMAENGINE drivers.

If we add dma_get_pq_caps() similar to the
dma_get_slave_caps() for PQ capabilities then we
will have to use this new method for both of the above
PQ capabilities and we have to change all DMAENGINE
drivers to use new method of specifying PQ capabilities.
I think this is too intrusive and bit overkill because its
very very unlikely to see anymore additions to
PQ capabilities.

Regards,
Anup

^ permalink raw reply

* Re: [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation
From: Shaohua Li @ 2017-02-08  5:34 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <43b52cb7-b9d6-f615-d7c0-d78fdbb9a0cd@intel.com>

On Wed, Feb 08, 2017 at 12:58:42PM +0100, Artur Paszkiewicz wrote:
> On 02/07/2017 10:42 PM, Shaohua Li wrote:
> > On Mon, Jan 30, 2017 at 07:59:49PM +0100, Artur Paszkiewicz wrote:
> >> This implements the PPL write logging functionality, using the
> >> raid5-cache policy logic introduced in previous patches. The description
> >> of PPL is added to the documentation. More details can be found in the
> >> comments in raid5-ppl.c.
> >>
> >> Put the PPL metadata structures to md_p.h because userspace tools
> >> (mdadm) will also need to read/write PPL.
> >>
> >> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> >> ---
> >>  Documentation/admin-guide/md.rst |  53 ++++
> >>  drivers/md/Makefile              |   2 +-
> >>  drivers/md/raid5-cache.c         |  13 +-
> >>  drivers/md/raid5-cache.h         |   8 +
> >>  drivers/md/raid5-ppl.c           | 551 +++++++++++++++++++++++++++++++++++++++
> >>  drivers/md/raid5.c               |  15 +-
> >>  include/uapi/linux/raid/md_p.h   |  26 ++
> >>  7 files changed, 661 insertions(+), 7 deletions(-)
> >>  create mode 100644 drivers/md/raid5-ppl.c
> >>
> >> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
> >> index e449fb5f277c..7104ef757e73 100644
> >> --- a/Documentation/admin-guide/md.rst
> >> +++ b/Documentation/admin-guide/md.rst
> >> @@ -86,6 +86,9 @@ superblock can be autodetected and run at boot time.
> >>  The kernel parameter ``raid=partitionable`` (or ``raid=part``) means
> >>  that all auto-detected arrays are assembled as partitionable.
> >>  
> >> +
> >> +.. _dirty_degraded_boot:
> >> +
> >>  Boot time assembly of degraded/dirty arrays
> >>  -------------------------------------------
> >>  
> >> @@ -176,6 +179,56 @@ and its role in the array.
> >>  Once started with RUN_ARRAY, uninitialized spares can be added with
> >>  HOT_ADD_DISK.
> >>  
> >> +.. _ppl:
> >> +
> >> +Partial Parity Log
> >> +------------------
> > Please put this part to a separate file under Documentation/md. The admin-guide
> > is supposed to serve as a doc for user. The Documentation/md files will server
> > as a doc for developer.
> 
> OK, I will move it.
> 
> >> +Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
> >> +addressed by PPL is that after a dirty shutdown, parity of a particular stripe
> >> +may become inconsistent with data on other member disks. If the array is also
> >> +in degraded state, there is no way to recalculate parity, because one of the
> >> +disks is missing. This can lead to silent data corruption when rebuilding the
> >> +array or using it is as degraded - data calculated from parity for array blocks
> >> +that have not been touched by a write request during the unclean shutdown can
> >> +be incorrect. Such condition is known as the ``RAID5 Write Hole``. Because of
> >> +this, md by default does not allow starting a dirty degraded array, see
> >> +:ref:`dirty_degraded_boot`.
> >> +
> >> +Partial parity for a write operation is the XOR of stripe data chunks not
> >> +modified by this write. It is just enough data needed for recovering from the
> >> +write hole. XORing partial parity with the modified chunks produces parity for
> >> +the stripe, consistent with its state before the write operation, regardless of
> >> +which chunk writes have completed. If one of the not modified data disks of
> >> +this stripe is missing, this updated parity can be used to recover its
> >> +contents. PPL recovery is also performed when starting an array after an
> >> +unclean shutdown and all disks are available, eliminating the need to resync
> >> +the array. Because of this, using write-intent bitmap and PPL together is not
> >> +supported.
> >> +
> >> +When handling a write request PPL writes partial parity before new data and
> >> +parity are dispatched to disks. PPL is a distributed log - it is stored on
> >> +array member drives in the metadata area, on the parity drive of a particular
> >> +stripe.  It does not require a dedicated journaling drive. Write performance is
> >> +reduced by up to 30%-40% but it scales with the number of drives in the array
> >> +and the journaling drive does not become a bottleneck or a single point of
> >> +failure.
> >> +
> >> +Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
> >> +not a true journal. It does not protect from losing in-flight data, only from
> >> +silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
> >> +performed for this stripe (parity is not updated). So it is possible to have
> >> +arbitrary data in the written part of a stripe if that disk is lost. In such
> >> +case the behavior is the same as in plain raid5.
> >> +
> >> +PPL is available for md version-1 metadata and external (specifically IMSM)
> >> +metadata arrays. It can be enabled using mdadm option
> >> +``--consistency-policy=ppl``.
> >> +
> >> +Currently, volatile write-back cache should be disabled on all member drives
> >> +when using PPL. Otherwise it cannot guarantee consistency in case of power
> >> +failure.
> > 
> > The limitation is weird and dumb. Is this an implementation limitation which
> > can be fixed later or a design limitation?
> 
> This can be fixed later and I'm planning to do this. The limitation
> results from the fact that partial parity is the xor of "old" data.
> During recovery we xor it with "new" data and that produces valid
> parity, assuming the "old" data has not changed after dirty shutdown.
> But it's not necessarily true, since the "old" data might have been only
> in cache and not on persistent storage. So fixing this will require
> making sure the old data is on persistent media before writing PPL.

so it's the implementation which doesn't flush disk cache before writting new
data to disks, right? That makes sense then. I think we should fix this soon.
Don't think people will (or remember to) disable write-back cache before using
ppl.

> >> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
> >> new file mode 100644
> >> index 000000000000..6bc246c80f6b
> >> --- /dev/null
> >> +++ b/drivers/md/raid5-ppl.c
> >> @@ -0,0 +1,551 @@
> >> +/*
> >> + * Partial Parity Log for closing the RAID5 write hole
> >> + * Copyright (c) 2017, Intel Corporation.
> >> + *
> >> + * This program is free software; you can redistribute it and/or modify it
> >> + * under the terms and conditions of the GNU General Public License,
> >> + * version 2, as published by the Free Software Foundation.
> >> + *
> >> + * This program is distributed in the hope it will be useful, but WITHOUT
> >> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> >> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> >> + * more details.
> >> + */
> >> +
> >> +#include <linux/kernel.h>
> >> +#include <linux/blkdev.h>
> >> +#include <linux/slab.h>
> >> +#include <linux/crc32c.h>
> >> +#include <linux/raid/md_p.h>
> >> +#include "md.h"
> >> +#include "raid5.h"
> >> +#include "raid5-cache.h"
> >> +
> >> +/*
> >> + * PPL consists of a 4KB header (struct ppl_header) and at least 128KB for
> >> + * partial parity data. The header contains an array of entries
> >> + * (struct ppl_header_entry) which describe the logged write requests.
> >> + * Partial parity for the entries comes after the header, written in the same
> >> + * sequence as the entries:
> >> + *
> >> + * Header
> >> + *   entry0
> >> + *   ...
> >> + *   entryN
> >> + * PP data
> >> + *   PP for entry0
> >> + *   ...
> >> + *   PP for entryN
> >> + *
> >> + * Every entry holds a checksum of its partial parity, the header also has a
> >> + * checksum of the header itself. Entries for full stripes writes contain no
> >> + * partial parity, they only mark the stripes for which parity should be
> >> + * recalculated after an unclean shutdown.
> >> + *
> >> + * A write request is always logged to the PPL instance stored on the parity
> >> + * disk of the corresponding stripe. For each member disk there is one r5l_log
> >> + * used to handle logging for this disk, independently from others. They are
> >> + * grouped in child_logs array in struct ppl_conf, which is assigned to a
> >> + * common parent r5l_log. This parent log serves as a proxy and is used in
> >> + * raid5 personality code - it is assigned as _the_ log in r5conf->log.
> >> + *
> >> + * r5l_io_unit represents a full PPL write, meta_page contains the ppl_header.
> >> + * PPL entries for logged stripes are added in ppl_log_stripe(). A stripe can
> >> + * be appended to the last entry if the chunks to write are the same, otherwise
> >> + * a new entry is added. Checksums of entries are calculated incrementally as
> >> + * stripes containing partial parity are being added to entries.
> >> + * ppl_submit_iounit() calculates the checksum of the header and submits a bio
> >> + * containing the meta_page (ppl_header) and partial parity pages (sh->ppl_page)
> >> + * for all stripes of the io_unit. When the PPL write completes, the stripes
> >> + * associated with the io_unit are released and raid5d starts writing their data
> >> + * and parity. When all stripes are written, the io_unit is freed and the next
> >> + * can be submitted.
> >> + *
> >> + * An io_unit is used to gather stripes until it is submitted or becomes full
> >> + * (if the maximum number of entries or size of PPL is reached). Another io_unit
> >> + * can't be submitted until the previous has completed (PPL and stripe
> >> + * data+parity is written). The log->running_ios list tracks all io_units of
> >> + * a log (for a single member disk). New io_units are added to the end of the
> >> + * list and the first io_unit is submitted, if it is not submitted already.
> >> + * The current io_unit accepting new stripes is always the last on the list.
> >> + */
> > 
> > I don't like the way that io_unit and r5l_log are reused here. The code below
> > shares very little with raid5-cache. There are a lot of fields in the data
> > structures which are not used by ppl. It's not a good practice to share data
> > structure like this way, so please define ppl its own data structure. The
> > workflow between raid5 cache and ppl might be similar, but what we really need
> > to share is the hooks to raid5 core.
> 
> Reusing the structures seemed like a good idea at first, but now I must
> admit you are right, especially since the structures were extended to
> support write-back caching. One instance of r5l_log will still be
> necessary though, to use for the hooks. Its private field will be used
> for holding the ppl structures. Is that ok?

I'd prefer adding a 'void *log_private' in struct r5conf. raid5-cache and ppl
can store whatever data in the log_private. And convert all callbacks to use a
'struct r5conf' parameter. This way raid5 cache and ppl don't need to share the
data structure.

Thanks,
Shaohua

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox