Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
From: Gilad Ben-Yossef @ 2017-02-08  7:32 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: Oded, Ofir, Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, Linux kernel mailing list, Alasdair Kergon,
	Mike Snitzer, dm-devel, Shaohua Li, linux-raid, Rajendra,
	Milan Broz
In-Reply-To: <1486463731-6224-1-git-send-email-binoy.jayan@linaro.org>

On Tue, Feb 7, 2017 at 12:35 PM, Binoy Jayan <binoy.jayan@linaro.org> wrote:
> ===============================================================================
> dm-crypt optimization for larger block sizes
> ===============================================================================
>
> Currently, the iv generation algorithms are implemented in dm-crypt.c. The goal
> is to move these algorithms from the dm layer to the kernel crypto layer by
> implementing them as template ciphers so they can be used in relation with
> algorithms like aes, and with multiple modes like cbc, ecb etc. As part of this
> patchset, the iv-generation code is moved from the dm layer to the crypto layer
> and adapt the dm-layer to send a whole 'bio' (as defined in the block layer)
> at a time. Each bio contains the in memory representation of physically
> contiguous disk blocks. Since the bio itself may not be contiguous in main
> memory, the dm layer sets up a chained scatterlist of these blocks split into
> physically contiguous segments in memory so that DMA can be performed.
...
> Binoy Jayan (1):
>   crypto: Add IV generation algorithms
>
>  drivers/md/dm-crypt.c  | 1894 ++++++++++++++++++++++++++++++++++--------------
>  include/crypto/geniv.h |   47 ++
>  2 files changed, 1402 insertions(+), 539 deletions(-)
>  create mode 100644 include/crypto/geniv.h

Ran Bonnie++ on it last night  (Luks mode, plain64, Qemu Virt platform
Arm64) and it works just fine.

Tested-by: Gilad Ben-Yossef <gilad@benyossef.com>

Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply

* Re: [PATCH v2 2/5] async_tx: Handle DMA devices having support for fewer PQ coefficients
From: Anup Patel @ 2017-02-08  8:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar, Ray Jui, Scott Branden, Jon Mason,
	Rob Rice, BCM Kernel Feedback, dmaengine@vger.kernel.org,
	Device Tree, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-crypto, linux-raid
In-Reply-To: <CAPcyv4hK_1f8bryq6smPQy7vTnP+QEzmT9wnAEE1xpxq7EAnjQ@mail.gmail.com>

On Tue, Feb 7, 2017 at 11:46 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Tue, Feb 7, 2017 at 1:02 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>> On Tue, Feb 7, 2017 at 1:57 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>> On Tue, Feb 7, 2017 at 12:16 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>>>> The DMAENGINE framework assumes that if PQ offload is supported by a
>>>> DMA device then all 256 PQ coefficients are supported. This assumption
>>>> does not hold anymore because we now have BCM-SBA-RAID offload engine
>>>> which supports PQ offload with limited number of PQ coefficients.
>>>>
>>>> This patch extends async_tx APIs to handle DMA devices with support
>>>> for fewer PQ coefficients.
>>>>
>>>> Signed-off-by: Anup Patel <anup.patel@broadcom.com>
>>>> Reviewed-by: Scott Branden <scott.branden@broadcom.com>
>>>
>>> I don't like this approach. Define an interface for md to query the
>>> offload engine once at the beginning of time. We should not be adding
>>> any new extensions to async_tx.
>>
>> Even if we do capability checks in Linux MD, we still need a way
>> for DMAENGINE drivers to advertise number of PQ coefficients
>> handled by the HW.
>>
>> I agree capability checks should be done once in Linux MD but I don't
>> see why this has to be part of BCM-SBA-RAID driver patches. We need
>> separate patchsets to address limitations of async_tx framework.
>
> Right, separate enabling before we pile on new hardware support to a
> known broken framework.

Linux Async Tx not broken framework. The issue is:
1. Its not complete enough
2. Its not optimized for very high through-put offload engines

Regards,
Anup

^ permalink raw reply

* Re: [PATCH 1/5] MD: attach data to each bio
From: Guoqing Jiang @ 2017-02-08  9:07 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: khlebnikov, hch, neilb
In-Reply-To: <79515b1372fa1a1813c00ef0d7e0613a4512183d.1486485935.git.shli@fb.com>



On 02/08/2017 12:55 AM, Shaohua Li wrote:
> Currently MD is rebusing some bio fields. To remove the hack, we attach
> extra data to each bio. Each personablity can attach extra data to the
> bios, so we don't need to rebuse bio fields.

rebuse? Maybe it should be reuse, or abuse.

[snip]

> +static inline void *md_get_per_bio_data(struct bio *bio)
> +{
> +	return ((struct md_per_bio_data *)bio->bi_private) + 1;
> +}
> +
> +extern void md_bio_attach_data(struct mddev *mddev, struct bio *bio);

Why not make raid1_bio_ref_ptr and raid10_bio_ref_ptr as macro
like raid5_get_bio_data? Or add inline before the two funcs since
both of them are called frequently.

Thanks,
Guoqing

^ permalink raw reply

* Re: [PATCH v3 3/9] md: superblock changes for PPL
From: Artur Paszkiewicz @ 2017-02-08 11:58 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <20170207212011.5aulcj3znldl6dry@kernel.org>

On 02/07/2017 10:20 PM, Shaohua Li wrote:
> On Mon, Jan 30, 2017 at 07:59:47PM +0100, Artur Paszkiewicz wrote:
>> Include information about PPL location and size into mdp_superblock_1
>> and copy it to/from rdev. Because PPL is mutually exclusive with bitmap,
>> put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for
>> 'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL
>> to mddev->flags to indicate that PPL is enabled on an array.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  drivers/md/md.c                | 15 +++++++++++++++
>>  drivers/md/md.h                |  8 ++++++++
>>  drivers/md/raid0.c             |  3 ++-
>>  drivers/md/raid1.c             |  3 ++-
>>  include/uapi/linux/raid/md_p.h | 18 ++++++++++++++----
>>  5 files changed, 41 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 85ac98417a08..e96f73572e23 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -1566,6 +1566,12 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_
>>  	} else if (sb->bblog_offset != 0)
>>  		rdev->badblocks.shift = 0;
>>  
>> +	if (le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL) {
>> +		rdev->ppl.offset = (__s16)le16_to_cpu(sb->ppl.offset);
>> +		rdev->ppl.size = le16_to_cpu(sb->ppl.size);
>> +		rdev->ppl.sector = rdev->sb_start + rdev->ppl.offset;
>> +	}
>> +
>>  	if (!refdev) {
>>  		ret = 1;
>>  	} else {
>> @@ -1678,6 +1684,9 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
>>  
>>  		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL)
>>  			set_bit(MD_HAS_JOURNAL, &mddev->flags);
>> +
>> +		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL)
>> +			set_bit(MD_HAS_PPL, &mddev->flags);
> 
> I think we should check wrong configuration and bail out. For example, both
> MD_FEATURE_PPL and MD_FEATURE_BITMAP_OFFSET set, or both MD_FEATURE_PPL and
> MD_FEATURE_JOURNAL set.

Makes sense. I will add it.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH v3 4/9] raid5: calculate partial parity for a stripe
From: Artur Paszkiewicz @ 2017-02-08 11:58 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <20170207212528.m6u47y3f2jot2kng@kernel.org>

On 02/07/2017 10:25 PM, Shaohua Li wrote:
> On Mon, Jan 30, 2017 at 07:59:48PM +0100, Artur Paszkiewicz wrote:
>> Attach a page for holding the partial parity data to stripe_head.
>> Allocate it only if mddev has the MD_HAS_PPL flag set.
>>
>> Partial parity is the xor of not modified data chunks of a stripe and is
>> calculated as follows:
>>
>> - reconstruct-write case:
>>   xor data from all not updated disks in a stripe
>>
>> - read-modify-write case:
>>   xor old data and parity from all updated disks in a stripe
>>
>> Implement it using the async_tx API and integrate into raid_run_ops().
>> It must be called when we still have access to old data, so do it when
>> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
>> stored into sh->ppl_page.
>>
>> Partial parity is not meaningful for full stripe write and is not stored
>> in the log or used for recovery, so don't attempt to calculate it when
>> stripe has STRIPE_FULL_WRITE.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  drivers/md/raid5.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  drivers/md/raid5.h |  2 ++
>>  2 files changed, 100 insertions(+)
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index d1cba941951e..e1e238da32ba 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -466,6 +466,11 @@ static void shrink_buffers(struct stripe_head *sh)
>>  		sh->dev[i].page = NULL;
>>  		put_page(p);
>>  	}
>> +
>> +	if (sh->ppl_page) {
>> +		put_page(sh->ppl_page);
>> +		sh->ppl_page = NULL;
>> +	}
>>  }
>>  
>>  static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
>> @@ -482,6 +487,13 @@ static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
>>  		sh->dev[i].page = page;
>>  		sh->dev[i].orig_page = page;
>>  	}
>> +
>> +	if (test_bit(MD_HAS_PPL, &sh->raid_conf->mddev->flags)) {
> 
> I think the test should be something like:
> 	if (raid5_ppl_enabled())
> 
> Having the feature doesn't mean the feature is enabled.

This flag is set iff PPL is enabled, so this function would only check
the flag anyway. But I can add it to improve readability.

>>  	pr_debug("%s: stripe %llu locked: %d ops_request: %lx\n",
>>  		__func__, (unsigned long long)sh->sector,
>>  		s->locked, s->ops_request);
>> @@ -3105,6 +3175,34 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
>>  	if (*bip && (*bip)->bi_iter.bi_sector < bio_end_sector(bi))
>>  		goto overlap;
>>  
>> +	if (forwrite && test_bit(MD_HAS_PPL, &conf->mddev->flags)) {
>> +		/*
>> +		 * With PPL only writes to consecutive data chunks within a
>> +		 * stripe are allowed. Not really an overlap, but
>> +		 * wait_for_overlap can be used to handle this.
>> +		 */
> 
> Please describe why the data must be consecutive.

In PPL metadata we don't store information about which drives we write
to, only the modified data range. For a single stripe_head we can only
have one PPL entry at a time, which describes one data range. This can
be improved in the future, but will require extending the PPL metadata.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation
From: Artur Paszkiewicz @ 2017-02-08 11:58 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <20170207214225.3jenvleiz3tldzlx@kernel.org>

On 02/07/2017 10:42 PM, Shaohua Li wrote:
> On Mon, Jan 30, 2017 at 07:59:49PM +0100, Artur Paszkiewicz wrote:
>> This implements the PPL write logging functionality, using the
>> raid5-cache policy logic introduced in previous patches. The description
>> of PPL is added to the documentation. More details can be found in the
>> comments in raid5-ppl.c.
>>
>> Put the PPL metadata structures to md_p.h because userspace tools
>> (mdadm) will also need to read/write PPL.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  Documentation/admin-guide/md.rst |  53 ++++
>>  drivers/md/Makefile              |   2 +-
>>  drivers/md/raid5-cache.c         |  13 +-
>>  drivers/md/raid5-cache.h         |   8 +
>>  drivers/md/raid5-ppl.c           | 551 +++++++++++++++++++++++++++++++++++++++
>>  drivers/md/raid5.c               |  15 +-
>>  include/uapi/linux/raid/md_p.h   |  26 ++
>>  7 files changed, 661 insertions(+), 7 deletions(-)
>>  create mode 100644 drivers/md/raid5-ppl.c
>>
>> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
>> index e449fb5f277c..7104ef757e73 100644
>> --- a/Documentation/admin-guide/md.rst
>> +++ b/Documentation/admin-guide/md.rst
>> @@ -86,6 +86,9 @@ superblock can be autodetected and run at boot time.
>>  The kernel parameter ``raid=partitionable`` (or ``raid=part``) means
>>  that all auto-detected arrays are assembled as partitionable.
>>  
>> +
>> +.. _dirty_degraded_boot:
>> +
>>  Boot time assembly of degraded/dirty arrays
>>  -------------------------------------------
>>  
>> @@ -176,6 +179,56 @@ and its role in the array.
>>  Once started with RUN_ARRAY, uninitialized spares can be added with
>>  HOT_ADD_DISK.
>>  
>> +.. _ppl:
>> +
>> +Partial Parity Log
>> +------------------
> Please put this part to a separate file under Documentation/md. The admin-guide
> is supposed to serve as a doc for user. The Documentation/md files will server
> as a doc for developer.

OK, I will move it.

>> +Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
>> +addressed by PPL is that after a dirty shutdown, parity of a particular stripe
>> +may become inconsistent with data on other member disks. If the array is also
>> +in degraded state, there is no way to recalculate parity, because one of the
>> +disks is missing. This can lead to silent data corruption when rebuilding the
>> +array or using it is as degraded - data calculated from parity for array blocks
>> +that have not been touched by a write request during the unclean shutdown can
>> +be incorrect. Such condition is known as the ``RAID5 Write Hole``. Because of
>> +this, md by default does not allow starting a dirty degraded array, see
>> +:ref:`dirty_degraded_boot`.
>> +
>> +Partial parity for a write operation is the XOR of stripe data chunks not
>> +modified by this write. It is just enough data needed for recovering from the
>> +write hole. XORing partial parity with the modified chunks produces parity for
>> +the stripe, consistent with its state before the write operation, regardless of
>> +which chunk writes have completed. If one of the not modified data disks of
>> +this stripe is missing, this updated parity can be used to recover its
>> +contents. PPL recovery is also performed when starting an array after an
>> +unclean shutdown and all disks are available, eliminating the need to resync
>> +the array. Because of this, using write-intent bitmap and PPL together is not
>> +supported.
>> +
>> +When handling a write request PPL writes partial parity before new data and
>> +parity are dispatched to disks. PPL is a distributed log - it is stored on
>> +array member drives in the metadata area, on the parity drive of a particular
>> +stripe.  It does not require a dedicated journaling drive. Write performance is
>> +reduced by up to 30%-40% but it scales with the number of drives in the array
>> +and the journaling drive does not become a bottleneck or a single point of
>> +failure.
>> +
>> +Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
>> +not a true journal. It does not protect from losing in-flight data, only from
>> +silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
>> +performed for this stripe (parity is not updated). So it is possible to have
>> +arbitrary data in the written part of a stripe if that disk is lost. In such
>> +case the behavior is the same as in plain raid5.
>> +
>> +PPL is available for md version-1 metadata and external (specifically IMSM)
>> +metadata arrays. It can be enabled using mdadm option
>> +``--consistency-policy=ppl``.
>> +
>> +Currently, volatile write-back cache should be disabled on all member drives
>> +when using PPL. Otherwise it cannot guarantee consistency in case of power
>> +failure.
> 
> The limitation is weird and dumb. Is this an implementation limitation which
> can be fixed later or a design limitation?

This can be fixed later and I'm planning to do this. The limitation
results from the fact that partial parity is the xor of "old" data.
During recovery we xor it with "new" data and that produces valid
parity, assuming the "old" data has not changed after dirty shutdown.
But it's not necessarily true, since the "old" data might have been only
in cache and not on persistent storage. So fixing this will require
making sure the old data is on persistent media before writing PPL.
 
>> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
>> new file mode 100644
>> index 000000000000..6bc246c80f6b
>> --- /dev/null
>> +++ b/drivers/md/raid5-ppl.c
>> @@ -0,0 +1,551 @@
>> +/*
>> + * Partial Parity Log for closing the RAID5 write hole
>> + * Copyright (c) 2017, Intel Corporation.
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms and conditions of the GNU General Public License,
>> + * version 2, as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope it will be useful, but WITHOUT
>> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
>> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
>> + * more details.
>> + */
>> +
>> +#include <linux/kernel.h>
>> +#include <linux/blkdev.h>
>> +#include <linux/slab.h>
>> +#include <linux/crc32c.h>
>> +#include <linux/raid/md_p.h>
>> +#include "md.h"
>> +#include "raid5.h"
>> +#include "raid5-cache.h"
>> +
>> +/*
>> + * PPL consists of a 4KB header (struct ppl_header) and at least 128KB for
>> + * partial parity data. The header contains an array of entries
>> + * (struct ppl_header_entry) which describe the logged write requests.
>> + * Partial parity for the entries comes after the header, written in the same
>> + * sequence as the entries:
>> + *
>> + * Header
>> + *   entry0
>> + *   ...
>> + *   entryN
>> + * PP data
>> + *   PP for entry0
>> + *   ...
>> + *   PP for entryN
>> + *
>> + * Every entry holds a checksum of its partial parity, the header also has a
>> + * checksum of the header itself. Entries for full stripes writes contain no
>> + * partial parity, they only mark the stripes for which parity should be
>> + * recalculated after an unclean shutdown.
>> + *
>> + * A write request is always logged to the PPL instance stored on the parity
>> + * disk of the corresponding stripe. For each member disk there is one r5l_log
>> + * used to handle logging for this disk, independently from others. They are
>> + * grouped in child_logs array in struct ppl_conf, which is assigned to a
>> + * common parent r5l_log. This parent log serves as a proxy and is used in
>> + * raid5 personality code - it is assigned as _the_ log in r5conf->log.
>> + *
>> + * r5l_io_unit represents a full PPL write, meta_page contains the ppl_header.
>> + * PPL entries for logged stripes are added in ppl_log_stripe(). A stripe can
>> + * be appended to the last entry if the chunks to write are the same, otherwise
>> + * a new entry is added. Checksums of entries are calculated incrementally as
>> + * stripes containing partial parity are being added to entries.
>> + * ppl_submit_iounit() calculates the checksum of the header and submits a bio
>> + * containing the meta_page (ppl_header) and partial parity pages (sh->ppl_page)
>> + * for all stripes of the io_unit. When the PPL write completes, the stripes
>> + * associated with the io_unit are released and raid5d starts writing their data
>> + * and parity. When all stripes are written, the io_unit is freed and the next
>> + * can be submitted.
>> + *
>> + * An io_unit is used to gather stripes until it is submitted or becomes full
>> + * (if the maximum number of entries or size of PPL is reached). Another io_unit
>> + * can't be submitted until the previous has completed (PPL and stripe
>> + * data+parity is written). The log->running_ios list tracks all io_units of
>> + * a log (for a single member disk). New io_units are added to the end of the
>> + * list and the first io_unit is submitted, if it is not submitted already.
>> + * The current io_unit accepting new stripes is always the last on the list.
>> + */
> 
> I don't like the way that io_unit and r5l_log are reused here. The code below
> shares very little with raid5-cache. There are a lot of fields in the data
> structures which are not used by ppl. It's not a good practice to share data
> structure like this way, so please define ppl its own data structure. The
> workflow between raid5 cache and ppl might be similar, but what we really need
> to share is the hooks to raid5 core.

Reusing the structures seemed like a good idea at first, but now I must
admit you are right, especially since the structures were extended to
support write-back caching. One instance of r5l_log will still be
necessary though, to use for the hooks. Its private field will be used
for holding the ppl structures. Is that ok? 

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH v3 6/9] md: add sysfs entries for PPL
From: Artur Paszkiewicz @ 2017-02-08 11:58 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <20170207214914.dfin7ma73lrqz2y2@kernel.org>

On 02/07/2017 10:49 PM, Shaohua Li wrote:
> On Mon, Jan 30, 2017 at 07:59:50PM +0100, Artur Paszkiewicz wrote:
>> Add 'consistency_policy' attribute for array. It indicates how the array
>> maintains consistency in case of unexpected shutdown.
>>
>> Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location
>> and size of the PPL space on the device.
>>
>> These attributes are writable to allow enabling PPL for external
>> metadata arrays and (later) to enable/disable PPL for a running array.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  Documentation/admin-guide/md.rst |  32 ++++++++++-
>>  drivers/md/md.c                  | 115 +++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 144 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
>> index 7104ef757e73..2c153b3d798f 100644
>> --- a/Documentation/admin-guide/md.rst
>> +++ b/Documentation/admin-guide/md.rst
>> @@ -329,14 +329,14 @@ All md devices contain:
>>       array creation it will default to 0, though starting the array as
>>       ``clean`` will set it much larger.
>>  
>> -   new_dev
>> +  new_dev
>>       This file can be written but not read.  The value written should
>>       be a block device number as major:minor.  e.g. 8:0
>>       This will cause that device to be attached to the array, if it is
>>       available.  It will then appear at md/dev-XXX (depending on the
>>       name of the device) and further configuration is then possible.
>>  
>> -   safe_mode_delay
>> +  safe_mode_delay
>>       When an md array has seen no write requests for a certain period
>>       of time, it will be marked as ``clean``.  When another write
>>       request arrives, the array is marked as ``dirty`` before the write
>> @@ -345,7 +345,7 @@ All md devices contain:
>>       period as a number of seconds.  The default is 200msec (0.200).
>>       Writing a value of 0 disables safemode.
>>  
>> -   array_state
>> +  array_state
>>       This file contains a single word which describes the current
>>       state of the array.  In many cases, the state can be set by
>>       writing the word for the desired state, however some states
>> @@ -454,7 +454,30 @@ All md devices contain:
>>       once the array becomes non-degraded, and this fact has been
>>       recorded in the metadata.
>>  
>> +  consistency_policy
>> +     This indicates how the array maintains consistency in case of unexpected
>> +     shutdown. It can be:
>>  
>> +     none
>> +       Array has no redundancy information, e.g. raid0, linear.
>> +
>> +     resync
>> +       Full resync is performed and all redundancy is regenerated when the
>> +       array is started after unclean shutdown.
>> +
>> +     bitmap
>> +       Resync assisted by a write-intent bitmap.
>> +
>> +     journal
>> +       For raid4/5/6, journal device is used to log transactions and replay
>> +       after unclean shutdown.
>> +
>> +     ppl
>> +       For raid5 only, :ref:`ppl` is used to close the write hole and eliminate
>> +       resync.
>> +
>> +     The accepted values when writing to this file are ``ppl`` and ``resync``,
>> +     used to enable and disable PPL.
>>  
>>  
>>  As component devices are added to an md array, they appear in the ``md``
>> @@ -616,6 +639,9 @@ Each directory contains:
>>  	adds bad blocks without acknowledging them. This is largely
>>  	for testing.
>>  
>> +      ppl_sector, ppl_size
>> +        Location and size (in sectors) of the space used for Partial Parity Log
>> +        on this device.
>>  
>>  
>>  An active md device will also contain an entry for each active device
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index e96f73572e23..47ab3afe87cb 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -3205,6 +3205,78 @@ static ssize_t ubb_store(struct md_rdev *rdev, const char *page, size_t len)
>>  static struct rdev_sysfs_entry rdev_unack_bad_blocks =
>>  __ATTR(unacknowledged_bad_blocks, S_IRUGO|S_IWUSR, ubb_show, ubb_store);
>>  
>> +static ssize_t
>> +ppl_sector_show(struct md_rdev *rdev, char *page)
>> +{
>> +	return sprintf(page, "%llu\n", (unsigned long long)rdev->ppl.sector);
>> +}
>> +
>> +static ssize_t
>> +ppl_sector_store(struct md_rdev *rdev, const char *buf, size_t len)
>> +{
>> +	unsigned long long sector;
>> +
>> +	if (kstrtoull(buf, 10, &sector) < 0)
>> +		return -EINVAL;
>> +	if (sector != (sector_t)sector)
>> +	        return -EINVAL;
>> +
>> +	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
>> +	    rdev->raid_disk >= 0)
>> +		return -EBUSY;
>> +
>> +	if (rdev->mddev->persistent) {
>> +		if (rdev->mddev->major_version == 0)
>> +			return -EINVAL;
>> +		if ((sector > rdev->sb_start &&
>> +		     sector - rdev->sb_start > S16_MAX) ||
>> +		    (sector < rdev->sb_start &&
>> +		     rdev->sb_start - sector > -S16_MIN))
>> +			return -EINVAL;
>> +		rdev->ppl.offset = sector - rdev->sb_start;
> 
> Why do we allow non-external array changes the ppl log position and size?

To support enabling ppl for a running array. Mdadm should not change the
metadata if the array is running, so we can't read the ppl position and
size from the superblock. This is similar to adding an internal bitmap
with mdadm --grow --bitmap=internal - mdadm writes the bitmap offset to
"bitmap/location" in sysfs.

>> +	} else if (!rdev->mddev->external) {
>> +		return -EBUSY;
>> +	}
>> +	rdev->ppl.sector = sector;
>> +	return len;
>> +}
>> +
>> +static struct rdev_sysfs_entry rdev_ppl_sector =
>> +__ATTR(ppl_sector, S_IRUGO|S_IWUSR, ppl_sector_show, ppl_sector_store);
>> +
>> +static ssize_t
>> +ppl_size_show(struct md_rdev *rdev, char *page)
>> +{
>> +	return sprintf(page, "%u\n", rdev->ppl.size);
>> +}
>> +
>> +static ssize_t
>> +ppl_size_store(struct md_rdev *rdev, const char *buf, size_t len)
>> +{
>> +	unsigned int size;
>> +
>> +	if (kstrtouint(buf, 10, &size) < 0)
>> +		return -EINVAL;
>> +
>> +	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
>> +	    rdev->raid_disk >= 0)
>> +		return -EBUSY;
>> +
>> +	if (rdev->mddev->persistent) {
>> +		if (rdev->mddev->major_version == 0)
>> +			return -EINVAL;
>> +		if (size > U16_MAX)
>> +			return -EINVAL;
>> +	} else if (!rdev->mddev->external) {
>> +		return -EBUSY;
>> +	}
>> +	rdev->ppl.size = size;
>> +	return len;
>> +}
>> +
>> +static struct rdev_sysfs_entry rdev_ppl_size =
>> +__ATTR(ppl_size, S_IRUGO|S_IWUSR, ppl_size_show, ppl_size_store);
>> +
>>  static struct attribute *rdev_default_attrs[] = {
>>  	&rdev_state.attr,
>>  	&rdev_errors.attr,
>> @@ -3215,6 +3287,8 @@ static struct attribute *rdev_default_attrs[] = {
>>  	&rdev_recovery_start.attr,
>>  	&rdev_bad_blocks.attr,
>>  	&rdev_unack_bad_blocks.attr,
>> +	&rdev_ppl_sector.attr,
>> +	&rdev_ppl_size.attr,
>>  	NULL,
>>  };
>>  static ssize_t
>> @@ -4951,6 +5025,46 @@ static struct md_sysfs_entry md_array_size =
>>  __ATTR(array_size, S_IRUGO|S_IWUSR, array_size_show,
>>         array_size_store);
>>  
>> +static ssize_t
>> +consistency_policy_show(struct mddev *mddev, char *page)
>> +{
>> +	int ret;
>> +
>> +	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
>> +		ret = sprintf(page, "journal\n");
>> +	} else if (test_bit(MD_HAS_PPL, &mddev->flags)) {
>> +		ret = sprintf(page, "ppl\n");
>> +	} else if (mddev->bitmap) {
>> +		ret = sprintf(page, "bitmap\n");
>> +	} else if (mddev->pers) {
>> +		if (mddev->pers->sync_request)
>> +			ret = sprintf(page, "resync\n");
>> +		else
>> +			ret = sprintf(page, "none\n");
>> +	} else {
>> +		ret = sprintf(page, "unknown\n");
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static ssize_t
>> +consistency_policy_store(struct mddev *mddev, const char *buf, size_t len)
>> +{
>> +	if (mddev->pers) {
>> +		return -EBUSY;
>> +	} else if (mddev->external && strncmp(buf, "ppl", 3) == 0) {
>> +		set_bit(MD_HAS_PPL, &mddev->flags);
>> +		return len;
> 
> Here we should check if ppl position and size are valid before we set the flag.
> 
> And shouldn't we move this part to .change_consistency_policy added in patch 9?
> This is a consistency policy change too.

Yes, but here the array is not started yet and we don't have the
personality for which to call .change_consistency_policy. We are not
changing the consistency policy here, but rather setting it initially
for an external metadata array.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH v2 2/5] async_tx: Handle DMA devices having support for fewer PQ coefficients
From: Dan Williams @ 2017-02-08 16:24 UTC (permalink / raw)
  To: Anup Patel
  Cc: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar, Ray Jui, Scott Branden, Jon Mason,
	Rob Rice, BCM Kernel Feedback, dmaengine@vger.kernel.org,
	Device Tree, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-crypto, linux-raid
In-Reply-To: <CAALAos9ovab4x=waRgM7G2adRC7syHMkY0Rhf-UPB_+h4w3yNQ@mail.gmail.com>

On Wed, Feb 8, 2017 at 12:57 AM, Anup Patel <anup.patel@broadcom.com> wrote:
> On Tue, Feb 7, 2017 at 11:46 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Tue, Feb 7, 2017 at 1:02 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>>> On Tue, Feb 7, 2017 at 1:57 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>>> On Tue, Feb 7, 2017 at 12:16 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>>>>> The DMAENGINE framework assumes that if PQ offload is supported by a
>>>>> DMA device then all 256 PQ coefficients are supported. This assumption
>>>>> does not hold anymore because we now have BCM-SBA-RAID offload engine
>>>>> which supports PQ offload with limited number of PQ coefficients.
>>>>>
>>>>> This patch extends async_tx APIs to handle DMA devices with support
>>>>> for fewer PQ coefficients.
>>>>>
>>>>> Signed-off-by: Anup Patel <anup.patel@broadcom.com>
>>>>> Reviewed-by: Scott Branden <scott.branden@broadcom.com>
>>>>
>>>> I don't like this approach. Define an interface for md to query the
>>>> offload engine once at the beginning of time. We should not be adding
>>>> any new extensions to async_tx.
>>>
>>> Even if we do capability checks in Linux MD, we still need a way
>>> for DMAENGINE drivers to advertise number of PQ coefficients
>>> handled by the HW.
>>>
>>> I agree capability checks should be done once in Linux MD but I don't
>>> see why this has to be part of BCM-SBA-RAID driver patches. We need
>>> separate patchsets to address limitations of async_tx framework.
>>
>> Right, separate enabling before we pile on new hardware support to a
>> known broken framework.
>
> Linux Async Tx not broken framework. The issue is:
> 1. Its not complete enough
> 2. Its not optimized for very high through-put offload engines

I'm not understanding your point. I'm nak'ing this change to add yet
more per-transaction capability checking to async_tx. I don't like the
DMA_HAS_FEWER_PQ_COEF flag, especially since it is equal to
DMA_HAS_PQ_CONTINUE. I'm not asking for all of async_tx's problems to
be fixed before this new hardware support, I'm simply saying we should
start the process of moving offload-engine capability checking to the
raid code.

^ permalink raw reply

* Re: [PATCH v3 6/9] md: add sysfs entries for PPL
From: Shaohua Li @ 2017-02-08 18:14 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <d484e0e5-345e-f35b-e10c-6727e300bfa1@intel.com>

On Wed, Feb 08, 2017 at 12:58:51PM +0100, Artur Paszkiewicz wrote:
> On 02/07/2017 10:49 PM, Shaohua Li wrote:
> > On Mon, Jan 30, 2017 at 07:59:50PM +0100, Artur Paszkiewicz wrote:
> >> Add 'consistency_policy' attribute for array. It indicates how the array
> >> maintains consistency in case of unexpected shutdown.
> >>
> >> Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location
> >> and size of the PPL space on the device.
> >>
> >> These attributes are writable to allow enabling PPL for external
> >> metadata arrays and (later) to enable/disable PPL for a running array.
> >>
> >> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> >> ---
> >>  Documentation/admin-guide/md.rst |  32 ++++++++++-
> >>  drivers/md/md.c                  | 115 +++++++++++++++++++++++++++++++++++++++
> >>  2 files changed, 144 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
> >> index 7104ef757e73..2c153b3d798f 100644
> >> --- a/Documentation/admin-guide/md.rst
> >> +++ b/Documentation/admin-guide/md.rst
> >> @@ -329,14 +329,14 @@ All md devices contain:
> >>       array creation it will default to 0, though starting the array as
> >>       ``clean`` will set it much larger.
> >>  
> >> -   new_dev
> >> +  new_dev
> >>       This file can be written but not read.  The value written should
> >>       be a block device number as major:minor.  e.g. 8:0
> >>       This will cause that device to be attached to the array, if it is
> >>       available.  It will then appear at md/dev-XXX (depending on the
> >>       name of the device) and further configuration is then possible.
> >>  
> >> -   safe_mode_delay
> >> +  safe_mode_delay
> >>       When an md array has seen no write requests for a certain period
> >>       of time, it will be marked as ``clean``.  When another write
> >>       request arrives, the array is marked as ``dirty`` before the write
> >> @@ -345,7 +345,7 @@ All md devices contain:
> >>       period as a number of seconds.  The default is 200msec (0.200).
> >>       Writing a value of 0 disables safemode.
> >>  
> >> -   array_state
> >> +  array_state
> >>       This file contains a single word which describes the current
> >>       state of the array.  In many cases, the state can be set by
> >>       writing the word for the desired state, however some states
> >> @@ -454,7 +454,30 @@ All md devices contain:
> >>       once the array becomes non-degraded, and this fact has been
> >>       recorded in the metadata.
> >>  
> >> +  consistency_policy
> >> +     This indicates how the array maintains consistency in case of unexpected
> >> +     shutdown. It can be:
> >>  
> >> +     none
> >> +       Array has no redundancy information, e.g. raid0, linear.
> >> +
> >> +     resync
> >> +       Full resync is performed and all redundancy is regenerated when the
> >> +       array is started after unclean shutdown.
> >> +
> >> +     bitmap
> >> +       Resync assisted by a write-intent bitmap.
> >> +
> >> +     journal
> >> +       For raid4/5/6, journal device is used to log transactions and replay
> >> +       after unclean shutdown.
> >> +
> >> +     ppl
> >> +       For raid5 only, :ref:`ppl` is used to close the write hole and eliminate
> >> +       resync.
> >> +
> >> +     The accepted values when writing to this file are ``ppl`` and ``resync``,
> >> +     used to enable and disable PPL.
> >>  
> >>  
> >>  As component devices are added to an md array, they appear in the ``md``
> >> @@ -616,6 +639,9 @@ Each directory contains:
> >>  	adds bad blocks without acknowledging them. This is largely
> >>  	for testing.
> >>  
> >> +      ppl_sector, ppl_size
> >> +        Location and size (in sectors) of the space used for Partial Parity Log
> >> +        on this device.
> >>  
> >>  
> >>  An active md device will also contain an entry for each active device
> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> index e96f73572e23..47ab3afe87cb 100644
> >> --- a/drivers/md/md.c
> >> +++ b/drivers/md/md.c
> >> @@ -3205,6 +3205,78 @@ static ssize_t ubb_store(struct md_rdev *rdev, const char *page, size_t len)
> >>  static struct rdev_sysfs_entry rdev_unack_bad_blocks =
> >>  __ATTR(unacknowledged_bad_blocks, S_IRUGO|S_IWUSR, ubb_show, ubb_store);
> >>  
> >> +static ssize_t
> >> +ppl_sector_show(struct md_rdev *rdev, char *page)
> >> +{
> >> +	return sprintf(page, "%llu\n", (unsigned long long)rdev->ppl.sector);
> >> +}
> >> +
> >> +static ssize_t
> >> +ppl_sector_store(struct md_rdev *rdev, const char *buf, size_t len)
> >> +{
> >> +	unsigned long long sector;
> >> +
> >> +	if (kstrtoull(buf, 10, &sector) < 0)
> >> +		return -EINVAL;
> >> +	if (sector != (sector_t)sector)
> >> +	        return -EINVAL;
> >> +
> >> +	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
> >> +	    rdev->raid_disk >= 0)
> >> +		return -EBUSY;
> >> +
> >> +	if (rdev->mddev->persistent) {
> >> +		if (rdev->mddev->major_version == 0)
> >> +			return -EINVAL;
> >> +		if ((sector > rdev->sb_start &&
> >> +		     sector - rdev->sb_start > S16_MAX) ||
> >> +		    (sector < rdev->sb_start &&
> >> +		     rdev->sb_start - sector > -S16_MIN))
> >> +			return -EINVAL;
> >> +		rdev->ppl.offset = sector - rdev->sb_start;
> > 
> > Why do we allow non-external array changes the ppl log position and size?
> 
> To support enabling ppl for a running array. Mdadm should not change the
> metadata if the array is running, so we can't read the ppl position and
> size from the superblock. This is similar to adding an internal bitmap
> with mdadm --grow --bitmap=internal - mdadm writes the bitmap offset to
> "bitmap/location" in sysfs.

So this is to add ppl support in runtime, right? I think more sanity checks
should be done here to make sure the input is valid. For example the
position/size don't overlap with data, bitmap isn't enabled yet.

> >> +	} else if (!rdev->mddev->external) {
> >> +		return -EBUSY;
> >> +	}
> >> +	rdev->ppl.sector = sector;
> >> +	return len;
> >> +}
> >> +
> >> +static struct rdev_sysfs_entry rdev_ppl_sector =
> >> +__ATTR(ppl_sector, S_IRUGO|S_IWUSR, ppl_sector_show, ppl_sector_store);
> >> +
> >> +static ssize_t
> >> +ppl_size_show(struct md_rdev *rdev, char *page)
> >> +{
> >> +	return sprintf(page, "%u\n", rdev->ppl.size);
> >> +}
> >> +
> >> +static ssize_t
> >> +ppl_size_store(struct md_rdev *rdev, const char *buf, size_t len)
> >> +{
> >> +	unsigned int size;
> >> +
> >> +	if (kstrtouint(buf, 10, &size) < 0)
> >> +		return -EINVAL;
> >> +
> >> +	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
> >> +	    rdev->raid_disk >= 0)
> >> +		return -EBUSY;
> >> +
> >> +	if (rdev->mddev->persistent) {
> >> +		if (rdev->mddev->major_version == 0)
> >> +			return -EINVAL;
> >> +		if (size > U16_MAX)
> >> +			return -EINVAL;
> >> +	} else if (!rdev->mddev->external) {
> >> +		return -EBUSY;
> >> +	}
> >> +	rdev->ppl.size = size;
> >> +	return len;
> >> +}
> >> +
> >> +static struct rdev_sysfs_entry rdev_ppl_size =
> >> +__ATTR(ppl_size, S_IRUGO|S_IWUSR, ppl_size_show, ppl_size_store);
> >> +
> >>  static struct attribute *rdev_default_attrs[] = {
> >>  	&rdev_state.attr,
> >>  	&rdev_errors.attr,
> >> @@ -3215,6 +3287,8 @@ static struct attribute *rdev_default_attrs[] = {
> >>  	&rdev_recovery_start.attr,
> >>  	&rdev_bad_blocks.attr,
> >>  	&rdev_unack_bad_blocks.attr,
> >> +	&rdev_ppl_sector.attr,
> >> +	&rdev_ppl_size.attr,
> >>  	NULL,
> >>  };
> >>  static ssize_t
> >> @@ -4951,6 +5025,46 @@ static struct md_sysfs_entry md_array_size =
> >>  __ATTR(array_size, S_IRUGO|S_IWUSR, array_size_show,
> >>         array_size_store);
> >>  
> >> +static ssize_t
> >> +consistency_policy_show(struct mddev *mddev, char *page)
> >> +{
> >> +	int ret;
> >> +
> >> +	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
> >> +		ret = sprintf(page, "journal\n");
> >> +	} else if (test_bit(MD_HAS_PPL, &mddev->flags)) {
> >> +		ret = sprintf(page, "ppl\n");
> >> +	} else if (mddev->bitmap) {
> >> +		ret = sprintf(page, "bitmap\n");
> >> +	} else if (mddev->pers) {
> >> +		if (mddev->pers->sync_request)
> >> +			ret = sprintf(page, "resync\n");
> >> +		else
> >> +			ret = sprintf(page, "none\n");
> >> +	} else {
> >> +		ret = sprintf(page, "unknown\n");
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static ssize_t
> >> +consistency_policy_store(struct mddev *mddev, const char *buf, size_t len)
> >> +{
> >> +	if (mddev->pers) {
> >> +		return -EBUSY;
> >> +	} else if (mddev->external && strncmp(buf, "ppl", 3) == 0) {
> >> +		set_bit(MD_HAS_PPL, &mddev->flags);
> >> +		return len;
> > 
> > Here we should check if ppl position and size are valid before we set the flag.
> > 
> > And shouldn't we move this part to .change_consistency_policy added in patch 9?
> > This is a consistency policy change too.
> 
> Yes, but here the array is not started yet and we don't have the
> personality for which to call .change_consistency_policy. We are not
> changing the consistency policy here, but rather setting it initially
> for an external metadata array.

Ok, makes sense.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH 1/5] MD: attach data to each bio
From: Wols Lists @ 2017-02-09  0:09 UTC (permalink / raw)
  To: Guoqing Jiang, Shaohua Li, linux-raid; +Cc: khlebnikov, hch, neilb
In-Reply-To: <589ADFEC.1070902@suse.com>

On 08/02/17 09:07, Guoqing Jiang wrote:
> 
> 
> On 02/08/2017 12:55 AM, Shaohua Li wrote:
>> Currently MD is rebusing some bio fields. To remove the hack, we attach
>> extra data to each bio. Each personablity can attach extra data to the
>> bios, so we don't need to rebuse bio fields.
> 
> rebuse? Maybe it should be reuse, or abuse.
> 
imho abuse

also
c/personablity/personality/

Cheers,
Wol


^ permalink raw reply

* Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
From: Binoy Jayan @ 2017-02-09  8:30 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Oded, Ofir, Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, Linux kernel mailing list, Alasdair Kergon,
	Mike Snitzer, dm-devel, Shaohua Li, linux-raid, Rajendra,
	Milan Broz
In-Reply-To: <CAOtvUMcN8s886978vE6T=AE79KkZsLBnXsmhewnb5PB0UyAG3A@mail.gmail.com>

On 8 February 2017 at 13:02, Gilad Ben-Yossef <gilad@benyossef.com> wrote:
>
> Ran Bonnie++ on it last night  (Luks mode, plain64, Qemu Virt platform
> Arm64) and it works just fine.
>
> Tested-by: Gilad Ben-Yossef <gilad@benyossef.com>
>

Hi Gilad,

Thank you for testing it. Do you have access to a device having crypto
hardware with IV generation capability and associated drivers for let
say, aes with cbc or any another mode? I was wondering if I can customize
it to work with dm-crypt by generating IVs automatically. Please let
me know your thoughts.

Thanks,
Binoy

^ permalink raw reply

* Re: [PATCH v2 2/5] async_tx: Handle DMA devices having support for fewer PQ coefficients
From: Anup Patel @ 2017-02-09  9:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar, Ray Jui, Scott Branden, Jon Mason,
	Rob Rice, BCM Kernel Feedback,
	dmaengine-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Device Tree,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-crypto-u79uwXL29TY76Z2rM5mHXA, linux-raid
In-Reply-To: <CAPcyv4iFJXxvJFrUs2jtwP9GX5NcJ8LiEDHeZ5b1fwjCAToe5w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Wed, Feb 8, 2017 at 9:54 PM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> On Wed, Feb 8, 2017 at 12:57 AM, Anup Patel <anup.patel-dY08KVG/lbpWk0Htik3J/w@public.gmane.org> wrote:
>> On Tue, Feb 7, 2017 at 11:46 PM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>>> On Tue, Feb 7, 2017 at 1:02 AM, Anup Patel <anup.patel-dY08KVG/lbpWk0Htik3J/w@public.gmane.org> wrote:
>>>> On Tue, Feb 7, 2017 at 1:57 PM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>>>>> On Tue, Feb 7, 2017 at 12:16 AM, Anup Patel <anup.patel-dY08KVG/lbpWk0Htik3J/w@public.gmane.org> wrote:
>>>>>> The DMAENGINE framework assumes that if PQ offload is supported by a
>>>>>> DMA device then all 256 PQ coefficients are supported. This assumption
>>>>>> does not hold anymore because we now have BCM-SBA-RAID offload engine
>>>>>> which supports PQ offload with limited number of PQ coefficients.
>>>>>>
>>>>>> This patch extends async_tx APIs to handle DMA devices with support
>>>>>> for fewer PQ coefficients.
>>>>>>
>>>>>> Signed-off-by: Anup Patel <anup.patel-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
>>>>>> Reviewed-by: Scott Branden <scott.branden-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
>>>>>
>>>>> I don't like this approach. Define an interface for md to query the
>>>>> offload engine once at the beginning of time. We should not be adding
>>>>> any new extensions to async_tx.
>>>>
>>>> Even if we do capability checks in Linux MD, we still need a way
>>>> for DMAENGINE drivers to advertise number of PQ coefficients
>>>> handled by the HW.
>>>>
>>>> I agree capability checks should be done once in Linux MD but I don't
>>>> see why this has to be part of BCM-SBA-RAID driver patches. We need
>>>> separate patchsets to address limitations of async_tx framework.
>>>
>>> Right, separate enabling before we pile on new hardware support to a
>>> known broken framework.
>>
>> Linux Async Tx not broken framework. The issue is:
>> 1. Its not complete enough
>> 2. Its not optimized for very high through-put offload engines
>
> I'm not understanding your point. I'm nak'ing this change to add yet
> more per-transaction capability checking to async_tx. I don't like the
> DMA_HAS_FEWER_PQ_COEF flag, especially since it is equal to
> DMA_HAS_PQ_CONTINUE. I'm not asking for all of async_tx's problems to
> be fixed before this new hardware support, I'm simply saying we should
> start the process of moving offload-engine capability checking to the
> raid code.

The DMA_HAS_FEWER_PQ_COEF is not equal to
DMA_HAS_PQ_CONTINUE.

I will try to drop this patch and take care of unsupported PQ
coefficients in BCM-SBA-RAID driver itself even if this means
doing some computations in BCM-SBA-RAID driver itself.

Regards,
Anup
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation
From: Artur Paszkiewicz @ 2017-02-09 15:35 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <20170208053444.b4yfmahmlrjgvv7x@kernel.org>

On 02/08/2017 06:34 AM, Shaohua Li wrote:
> On Wed, Feb 08, 2017 at 12:58:42PM +0100, Artur Paszkiewicz wrote:
>> On 02/07/2017 10:42 PM, Shaohua Li wrote:
>>> On Mon, Jan 30, 2017 at 07:59:49PM +0100, Artur Paszkiewicz wrote:
>>>> This implements the PPL write logging functionality, using the
>>>> raid5-cache policy logic introduced in previous patches. The description
>>>> of PPL is added to the documentation. More details can be found in the
>>>> comments in raid5-ppl.c.
>>>>
>>>> Put the PPL metadata structures to md_p.h because userspace tools
>>>> (mdadm) will also need to read/write PPL.
>>>>
>>>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>>>> ---
>>>>  Documentation/admin-guide/md.rst |  53 ++++
>>>>  drivers/md/Makefile              |   2 +-
>>>>  drivers/md/raid5-cache.c         |  13 +-
>>>>  drivers/md/raid5-cache.h         |   8 +
>>>>  drivers/md/raid5-ppl.c           | 551 +++++++++++++++++++++++++++++++++++++++
>>>>  drivers/md/raid5.c               |  15 +-
>>>>  include/uapi/linux/raid/md_p.h   |  26 ++
>>>>  7 files changed, 661 insertions(+), 7 deletions(-)
>>>>  create mode 100644 drivers/md/raid5-ppl.c
>>>>
>>>> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
>>>> index e449fb5f277c..7104ef757e73 100644
>>>> --- a/Documentation/admin-guide/md.rst
>>>> +++ b/Documentation/admin-guide/md.rst
>>>> @@ -86,6 +86,9 @@ superblock can be autodetected and run at boot time.
>>>>  The kernel parameter ``raid=partitionable`` (or ``raid=part``) means
>>>>  that all auto-detected arrays are assembled as partitionable.
>>>>  
>>>> +
>>>> +.. _dirty_degraded_boot:
>>>> +
>>>>  Boot time assembly of degraded/dirty arrays
>>>>  -------------------------------------------
>>>>  
>>>> @@ -176,6 +179,56 @@ and its role in the array.
>>>>  Once started with RUN_ARRAY, uninitialized spares can be added with
>>>>  HOT_ADD_DISK.
>>>>  
>>>> +.. _ppl:
>>>> +
>>>> +Partial Parity Log
>>>> +------------------
>>> Please put this part to a separate file under Documentation/md. The admin-guide
>>> is supposed to serve as a doc for user. The Documentation/md files will server
>>> as a doc for developer.
>>
>> OK, I will move it.
>>
>>>> +Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
>>>> +addressed by PPL is that after a dirty shutdown, parity of a particular stripe
>>>> +may become inconsistent with data on other member disks. If the array is also
>>>> +in degraded state, there is no way to recalculate parity, because one of the
>>>> +disks is missing. This can lead to silent data corruption when rebuilding the
>>>> +array or using it is as degraded - data calculated from parity for array blocks
>>>> +that have not been touched by a write request during the unclean shutdown can
>>>> +be incorrect. Such condition is known as the ``RAID5 Write Hole``. Because of
>>>> +this, md by default does not allow starting a dirty degraded array, see
>>>> +:ref:`dirty_degraded_boot`.
>>>> +
>>>> +Partial parity for a write operation is the XOR of stripe data chunks not
>>>> +modified by this write. It is just enough data needed for recovering from the
>>>> +write hole. XORing partial parity with the modified chunks produces parity for
>>>> +the stripe, consistent with its state before the write operation, regardless of
>>>> +which chunk writes have completed. If one of the not modified data disks of
>>>> +this stripe is missing, this updated parity can be used to recover its
>>>> +contents. PPL recovery is also performed when starting an array after an
>>>> +unclean shutdown and all disks are available, eliminating the need to resync
>>>> +the array. Because of this, using write-intent bitmap and PPL together is not
>>>> +supported.
>>>> +
>>>> +When handling a write request PPL writes partial parity before new data and
>>>> +parity are dispatched to disks. PPL is a distributed log - it is stored on
>>>> +array member drives in the metadata area, on the parity drive of a particular
>>>> +stripe.  It does not require a dedicated journaling drive. Write performance is
>>>> +reduced by up to 30%-40% but it scales with the number of drives in the array
>>>> +and the journaling drive does not become a bottleneck or a single point of
>>>> +failure.
>>>> +
>>>> +Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
>>>> +not a true journal. It does not protect from losing in-flight data, only from
>>>> +silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
>>>> +performed for this stripe (parity is not updated). So it is possible to have
>>>> +arbitrary data in the written part of a stripe if that disk is lost. In such
>>>> +case the behavior is the same as in plain raid5.
>>>> +
>>>> +PPL is available for md version-1 metadata and external (specifically IMSM)
>>>> +metadata arrays. It can be enabled using mdadm option
>>>> +``--consistency-policy=ppl``.
>>>> +
>>>> +Currently, volatile write-back cache should be disabled on all member drives
>>>> +when using PPL. Otherwise it cannot guarantee consistency in case of power
>>>> +failure.
>>>
>>> The limitation is weird and dumb. Is this an implementation limitation which
>>> can be fixed later or a design limitation?
>>
>> This can be fixed later and I'm planning to do this. The limitation
>> results from the fact that partial parity is the xor of "old" data.
>> During recovery we xor it with "new" data and that produces valid
>> parity, assuming the "old" data has not changed after dirty shutdown.
>> But it's not necessarily true, since the "old" data might have been only
>> in cache and not on persistent storage. So fixing this will require
>> making sure the old data is on persistent media before writing PPL.
> 
> so it's the implementation which doesn't flush disk cache before writting new
> data to disks, right? That makes sense then. I think we should fix this soon.
> Don't think people will (or remember to) disable write-back cache before using
> ppl.
> 
>>>> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
>>>> new file mode 100644
>>>> index 000000000000..6bc246c80f6b
>>>> --- /dev/null
>>>> +++ b/drivers/md/raid5-ppl.c
>>>> @@ -0,0 +1,551 @@
>>>> +/*
>>>> + * Partial Parity Log for closing the RAID5 write hole
>>>> + * Copyright (c) 2017, Intel Corporation.
>>>> + *
>>>> + * This program is free software; you can redistribute it and/or modify it
>>>> + * under the terms and conditions of the GNU General Public License,
>>>> + * version 2, as published by the Free Software Foundation.
>>>> + *
>>>> + * This program is distributed in the hope it will be useful, but WITHOUT
>>>> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
>>>> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
>>>> + * more details.
>>>> + */
>>>> +
>>>> +#include <linux/kernel.h>
>>>> +#include <linux/blkdev.h>
>>>> +#include <linux/slab.h>
>>>> +#include <linux/crc32c.h>
>>>> +#include <linux/raid/md_p.h>
>>>> +#include "md.h"
>>>> +#include "raid5.h"
>>>> +#include "raid5-cache.h"
>>>> +
>>>> +/*
>>>> + * PPL consists of a 4KB header (struct ppl_header) and at least 128KB for
>>>> + * partial parity data. The header contains an array of entries
>>>> + * (struct ppl_header_entry) which describe the logged write requests.
>>>> + * Partial parity for the entries comes after the header, written in the same
>>>> + * sequence as the entries:
>>>> + *
>>>> + * Header
>>>> + *   entry0
>>>> + *   ...
>>>> + *   entryN
>>>> + * PP data
>>>> + *   PP for entry0
>>>> + *   ...
>>>> + *   PP for entryN
>>>> + *
>>>> + * Every entry holds a checksum of its partial parity, the header also has a
>>>> + * checksum of the header itself. Entries for full stripes writes contain no
>>>> + * partial parity, they only mark the stripes for which parity should be
>>>> + * recalculated after an unclean shutdown.
>>>> + *
>>>> + * A write request is always logged to the PPL instance stored on the parity
>>>> + * disk of the corresponding stripe. For each member disk there is one r5l_log
>>>> + * used to handle logging for this disk, independently from others. They are
>>>> + * grouped in child_logs array in struct ppl_conf, which is assigned to a
>>>> + * common parent r5l_log. This parent log serves as a proxy and is used in
>>>> + * raid5 personality code - it is assigned as _the_ log in r5conf->log.
>>>> + *
>>>> + * r5l_io_unit represents a full PPL write, meta_page contains the ppl_header.
>>>> + * PPL entries for logged stripes are added in ppl_log_stripe(). A stripe can
>>>> + * be appended to the last entry if the chunks to write are the same, otherwise
>>>> + * a new entry is added. Checksums of entries are calculated incrementally as
>>>> + * stripes containing partial parity are being added to entries.
>>>> + * ppl_submit_iounit() calculates the checksum of the header and submits a bio
>>>> + * containing the meta_page (ppl_header) and partial parity pages (sh->ppl_page)
>>>> + * for all stripes of the io_unit. When the PPL write completes, the stripes
>>>> + * associated with the io_unit are released and raid5d starts writing their data
>>>> + * and parity. When all stripes are written, the io_unit is freed and the next
>>>> + * can be submitted.
>>>> + *
>>>> + * An io_unit is used to gather stripes until it is submitted or becomes full
>>>> + * (if the maximum number of entries or size of PPL is reached). Another io_unit
>>>> + * can't be submitted until the previous has completed (PPL and stripe
>>>> + * data+parity is written). The log->running_ios list tracks all io_units of
>>>> + * a log (for a single member disk). New io_units are added to the end of the
>>>> + * list and the first io_unit is submitted, if it is not submitted already.
>>>> + * The current io_unit accepting new stripes is always the last on the list.
>>>> + */
>>>
>>> I don't like the way that io_unit and r5l_log are reused here. The code below
>>> shares very little with raid5-cache. There are a lot of fields in the data
>>> structures which are not used by ppl. It's not a good practice to share data
>>> structure like this way, so please define ppl its own data structure. The
>>> workflow between raid5 cache and ppl might be similar, but what we really need
>>> to share is the hooks to raid5 core.
>>
>> Reusing the structures seemed like a good idea at first, but now I must
>> admit you are right, especially since the structures were extended to
>> support write-back caching. One instance of r5l_log will still be
>> necessary though, to use for the hooks. Its private field will be used
>> for holding the ppl structures. Is that ok?
> 
> I'd prefer adding a 'void *log_private' in struct r5conf. raid5-cache and ppl
> can store whatever data in the log_private. And convert all callbacks to use a
> 'struct r5conf' parameter. This way raid5 cache and ppl don't need to share the
> data structure.

This way also 'struct r5l_policy *policy' will have to be moved from
r5l_log to r5conf. Maybe it will be better if instead of adding
log_private and moving this 'policy' field we add a dedicated field for
ppl in r5conf, something like 'struct ppl_conf *ppl', and eliminate
r5l_policy? From raid5 core we could just call corresponding functions
depending on which pointer is set (log or ppl). This way raid5-cache.c
won't have to be changed at all. It seems appropriate to separate
raid5-cache and ppl if they won't be sharing any data structures.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation
From: Wols Lists @ 2017-02-09 16:06 UTC (permalink / raw)
  To: Shaohua Li, Artur Paszkiewicz; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <20170208053444.b4yfmahmlrjgvv7x@kernel.org>

On 08/02/17 05:34, Shaohua Li wrote:
> so it's the implementation which doesn't flush disk cache before writting new
> data to disks, right? That makes sense then. I think we should fix this soon.
> Don't think people will (or remember to) disable write-back cache before using
> ppl.

Dunno whether I'm making a fool of myself :-) but might a quick-n-dirty
workaround be to make enabling PPL also disable write-back cache? With
maybe a warning not to switch it back on?

Cheers,
Wol

^ permalink raw reply

* Re: [PATCH v2 2/5] async_tx: Handle DMA devices having support for fewer PQ coefficients
From: Dan Williams @ 2017-02-09 16:44 UTC (permalink / raw)
  To: Anup Patel
  Cc: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar, Ray Jui, Scott Branden, Jon Mason,
	Rob Rice, BCM Kernel Feedback, dmaengine@vger.kernel.org,
	Device Tree, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-crypto, linux-raid
In-Reply-To: <CAALAos-ua6hVpUqjnJSQ=ysSOKrN67toiT3J808uHC4A_ZV3mg@mail.gmail.com>

On Thu, Feb 9, 2017 at 1:29 AM, Anup Patel <anup.patel@broadcom.com> wrote:
> On Wed, Feb 8, 2017 at 9:54 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Wed, Feb 8, 2017 at 12:57 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>>> On Tue, Feb 7, 2017 at 11:46 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>>> On Tue, Feb 7, 2017 at 1:02 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>>>>> On Tue, Feb 7, 2017 at 1:57 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>>>>> On Tue, Feb 7, 2017 at 12:16 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>>>>>>> The DMAENGINE framework assumes that if PQ offload is supported by a
>>>>>>> DMA device then all 256 PQ coefficients are supported. This assumption
>>>>>>> does not hold anymore because we now have BCM-SBA-RAID offload engine
>>>>>>> which supports PQ offload with limited number of PQ coefficients.
>>>>>>>
>>>>>>> This patch extends async_tx APIs to handle DMA devices with support
>>>>>>> for fewer PQ coefficients.
>>>>>>>
>>>>>>> Signed-off-by: Anup Patel <anup.patel@broadcom.com>
>>>>>>> Reviewed-by: Scott Branden <scott.branden@broadcom.com>
>>>>>>
>>>>>> I don't like this approach. Define an interface for md to query the
>>>>>> offload engine once at the beginning of time. We should not be adding
>>>>>> any new extensions to async_tx.
>>>>>
>>>>> Even if we do capability checks in Linux MD, we still need a way
>>>>> for DMAENGINE drivers to advertise number of PQ coefficients
>>>>> handled by the HW.
>>>>>
>>>>> I agree capability checks should be done once in Linux MD but I don't
>>>>> see why this has to be part of BCM-SBA-RAID driver patches. We need
>>>>> separate patchsets to address limitations of async_tx framework.
>>>>
>>>> Right, separate enabling before we pile on new hardware support to a
>>>> known broken framework.
>>>
>>> Linux Async Tx not broken framework. The issue is:
>>> 1. Its not complete enough
>>> 2. Its not optimized for very high through-put offload engines
>>
>> I'm not understanding your point. I'm nak'ing this change to add yet
>> more per-transaction capability checking to async_tx. I don't like the
>> DMA_HAS_FEWER_PQ_COEF flag, especially since it is equal to
>> DMA_HAS_PQ_CONTINUE. I'm not asking for all of async_tx's problems to
>> be fixed before this new hardware support, I'm simply saying we should
>> start the process of moving offload-engine capability checking to the
>> raid code.
>
> The DMA_HAS_FEWER_PQ_COEF is not equal to
> DMA_HAS_PQ_CONTINUE.

#define DMA_HAS_PQ_CONTINUE (1 << 15
#define DMA_HAS_FEWER_PQ_COEF (1 << 15)

> I will try to drop this patch and take care of unsupported PQ
> coefficients in BCM-SBA-RAID driver itself even if this means
> doing some computations in BCM-SBA-RAID driver itself.

That should be nak'd as well, please do capability detection in a
routine that is common to all raid engines.

^ permalink raw reply

* Re: [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation
From: Shaohua Li @ 2017-02-09 17:09 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <232777f2-d8e9-3d86-cb84-7285c90e5a6f@intel.com>

On Thu, Feb 09, 2017 at 04:35:34PM +0100, Artur Paszkiewicz wrote:
> On 02/08/2017 06:34 AM, Shaohua Li wrote:
> > On Wed, Feb 08, 2017 at 12:58:42PM +0100, Artur Paszkiewicz wrote:
> >> On 02/07/2017 10:42 PM, Shaohua Li wrote:
> >>> On Mon, Jan 30, 2017 at 07:59:49PM +0100, Artur Paszkiewicz wrote:
> >>>> This implements the PPL write logging functionality, using the
> >>>> raid5-cache policy logic introduced in previous patches. The description
> >>>> of PPL is added to the documentation. More details can be found in the
> >>>> comments in raid5-ppl.c.
> >>>>
> >>>> Put the PPL metadata structures to md_p.h because userspace tools
> >>>> (mdadm) will also need to read/write PPL.
> >>>>
> >>>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> >>>> ---
> >>>>  Documentation/admin-guide/md.rst |  53 ++++
> >>>>  drivers/md/Makefile              |   2 +-
> >>>>  drivers/md/raid5-cache.c         |  13 +-
> >>>>  drivers/md/raid5-cache.h         |   8 +
> >>>>  drivers/md/raid5-ppl.c           | 551 +++++++++++++++++++++++++++++++++++++++
> >>>>  drivers/md/raid5.c               |  15 +-
> >>>>  include/uapi/linux/raid/md_p.h   |  26 ++
> >>>>  7 files changed, 661 insertions(+), 7 deletions(-)
> >>>>  create mode 100644 drivers/md/raid5-ppl.c
> >>>>
> >>>> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
> >>>> index e449fb5f277c..7104ef757e73 100644
> >>>> --- a/Documentation/admin-guide/md.rst
> >>>> +++ b/Documentation/admin-guide/md.rst
> >>>> @@ -86,6 +86,9 @@ superblock can be autodetected and run at boot time.
> >>>>  The kernel parameter ``raid=partitionable`` (or ``raid=part``) means
> >>>>  that all auto-detected arrays are assembled as partitionable.
> >>>>  
> >>>> +
> >>>> +.. _dirty_degraded_boot:
> >>>> +
> >>>>  Boot time assembly of degraded/dirty arrays
> >>>>  -------------------------------------------
> >>>>  
> >>>> @@ -176,6 +179,56 @@ and its role in the array.
> >>>>  Once started with RUN_ARRAY, uninitialized spares can be added with
> >>>>  HOT_ADD_DISK.
> >>>>  
> >>>> +.. _ppl:
> >>>> +
> >>>> +Partial Parity Log
> >>>> +------------------
> >>> Please put this part to a separate file under Documentation/md. The admin-guide
> >>> is supposed to serve as a doc for user. The Documentation/md files will server
> >>> as a doc for developer.
> >>
> >> OK, I will move it.
> >>
> >>>> +Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
> >>>> +addressed by PPL is that after a dirty shutdown, parity of a particular stripe
> >>>> +may become inconsistent with data on other member disks. If the array is also
> >>>> +in degraded state, there is no way to recalculate parity, because one of the
> >>>> +disks is missing. This can lead to silent data corruption when rebuilding the
> >>>> +array or using it is as degraded - data calculated from parity for array blocks
> >>>> +that have not been touched by a write request during the unclean shutdown can
> >>>> +be incorrect. Such condition is known as the ``RAID5 Write Hole``. Because of
> >>>> +this, md by default does not allow starting a dirty degraded array, see
> >>>> +:ref:`dirty_degraded_boot`.
> >>>> +
> >>>> +Partial parity for a write operation is the XOR of stripe data chunks not
> >>>> +modified by this write. It is just enough data needed for recovering from the
> >>>> +write hole. XORing partial parity with the modified chunks produces parity for
> >>>> +the stripe, consistent with its state before the write operation, regardless of
> >>>> +which chunk writes have completed. If one of the not modified data disks of
> >>>> +this stripe is missing, this updated parity can be used to recover its
> >>>> +contents. PPL recovery is also performed when starting an array after an
> >>>> +unclean shutdown and all disks are available, eliminating the need to resync
> >>>> +the array. Because of this, using write-intent bitmap and PPL together is not
> >>>> +supported.
> >>>> +
> >>>> +When handling a write request PPL writes partial parity before new data and
> >>>> +parity are dispatched to disks. PPL is a distributed log - it is stored on
> >>>> +array member drives in the metadata area, on the parity drive of a particular
> >>>> +stripe.  It does not require a dedicated journaling drive. Write performance is
> >>>> +reduced by up to 30%-40% but it scales with the number of drives in the array
> >>>> +and the journaling drive does not become a bottleneck or a single point of
> >>>> +failure.
> >>>> +
> >>>> +Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
> >>>> +not a true journal. It does not protect from losing in-flight data, only from
> >>>> +silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
> >>>> +performed for this stripe (parity is not updated). So it is possible to have
> >>>> +arbitrary data in the written part of a stripe if that disk is lost. In such
> >>>> +case the behavior is the same as in plain raid5.
> >>>> +
> >>>> +PPL is available for md version-1 metadata and external (specifically IMSM)
> >>>> +metadata arrays. It can be enabled using mdadm option
> >>>> +``--consistency-policy=ppl``.
> >>>> +
> >>>> +Currently, volatile write-back cache should be disabled on all member drives
> >>>> +when using PPL. Otherwise it cannot guarantee consistency in case of power
> >>>> +failure.
> >>>
> >>> The limitation is weird and dumb. Is this an implementation limitation which
> >>> can be fixed later or a design limitation?
> >>
> >> This can be fixed later and I'm planning to do this. The limitation
> >> results from the fact that partial parity is the xor of "old" data.
> >> During recovery we xor it with "new" data and that produces valid
> >> parity, assuming the "old" data has not changed after dirty shutdown.
> >> But it's not necessarily true, since the "old" data might have been only
> >> in cache and not on persistent storage. So fixing this will require
> >> making sure the old data is on persistent media before writing PPL.
> > 
> > so it's the implementation which doesn't flush disk cache before writting new
> > data to disks, right? That makes sense then. I think we should fix this soon.
> > Don't think people will (or remember to) disable write-back cache before using
> > ppl.
> > 
> >>>> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
> >>>> new file mode 100644
> >>>> index 000000000000..6bc246c80f6b
> >>>> --- /dev/null
> >>>> +++ b/drivers/md/raid5-ppl.c
> >>>> @@ -0,0 +1,551 @@
> >>>> +/*
> >>>> + * Partial Parity Log for closing the RAID5 write hole
> >>>> + * Copyright (c) 2017, Intel Corporation.
> >>>> + *
> >>>> + * This program is free software; you can redistribute it and/or modify it
> >>>> + * under the terms and conditions of the GNU General Public License,
> >>>> + * version 2, as published by the Free Software Foundation.
> >>>> + *
> >>>> + * This program is distributed in the hope it will be useful, but WITHOUT
> >>>> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> >>>> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> >>>> + * more details.
> >>>> + */
> >>>> +
> >>>> +#include <linux/kernel.h>
> >>>> +#include <linux/blkdev.h>
> >>>> +#include <linux/slab.h>
> >>>> +#include <linux/crc32c.h>
> >>>> +#include <linux/raid/md_p.h>
> >>>> +#include "md.h"
> >>>> +#include "raid5.h"
> >>>> +#include "raid5-cache.h"
> >>>> +
> >>>> +/*
> >>>> + * PPL consists of a 4KB header (struct ppl_header) and at least 128KB for
> >>>> + * partial parity data. The header contains an array of entries
> >>>> + * (struct ppl_header_entry) which describe the logged write requests.
> >>>> + * Partial parity for the entries comes after the header, written in the same
> >>>> + * sequence as the entries:
> >>>> + *
> >>>> + * Header
> >>>> + *   entry0
> >>>> + *   ...
> >>>> + *   entryN
> >>>> + * PP data
> >>>> + *   PP for entry0
> >>>> + *   ...
> >>>> + *   PP for entryN
> >>>> + *
> >>>> + * Every entry holds a checksum of its partial parity, the header also has a
> >>>> + * checksum of the header itself. Entries for full stripes writes contain no
> >>>> + * partial parity, they only mark the stripes for which parity should be
> >>>> + * recalculated after an unclean shutdown.
> >>>> + *
> >>>> + * A write request is always logged to the PPL instance stored on the parity
> >>>> + * disk of the corresponding stripe. For each member disk there is one r5l_log
> >>>> + * used to handle logging for this disk, independently from others. They are
> >>>> + * grouped in child_logs array in struct ppl_conf, which is assigned to a
> >>>> + * common parent r5l_log. This parent log serves as a proxy and is used in
> >>>> + * raid5 personality code - it is assigned as _the_ log in r5conf->log.
> >>>> + *
> >>>> + * r5l_io_unit represents a full PPL write, meta_page contains the ppl_header.
> >>>> + * PPL entries for logged stripes are added in ppl_log_stripe(). A stripe can
> >>>> + * be appended to the last entry if the chunks to write are the same, otherwise
> >>>> + * a new entry is added. Checksums of entries are calculated incrementally as
> >>>> + * stripes containing partial parity are being added to entries.
> >>>> + * ppl_submit_iounit() calculates the checksum of the header and submits a bio
> >>>> + * containing the meta_page (ppl_header) and partial parity pages (sh->ppl_page)
> >>>> + * for all stripes of the io_unit. When the PPL write completes, the stripes
> >>>> + * associated with the io_unit are released and raid5d starts writing their data
> >>>> + * and parity. When all stripes are written, the io_unit is freed and the next
> >>>> + * can be submitted.
> >>>> + *
> >>>> + * An io_unit is used to gather stripes until it is submitted or becomes full
> >>>> + * (if the maximum number of entries or size of PPL is reached). Another io_unit
> >>>> + * can't be submitted until the previous has completed (PPL and stripe
> >>>> + * data+parity is written). The log->running_ios list tracks all io_units of
> >>>> + * a log (for a single member disk). New io_units are added to the end of the
> >>>> + * list and the first io_unit is submitted, if it is not submitted already.
> >>>> + * The current io_unit accepting new stripes is always the last on the list.
> >>>> + */
> >>>
> >>> I don't like the way that io_unit and r5l_log are reused here. The code below
> >>> shares very little with raid5-cache. There are a lot of fields in the data
> >>> structures which are not used by ppl. It's not a good practice to share data
> >>> structure like this way, so please define ppl its own data structure. The
> >>> workflow between raid5 cache and ppl might be similar, but what we really need
> >>> to share is the hooks to raid5 core.
> >>
> >> Reusing the structures seemed like a good idea at first, but now I must
> >> admit you are right, especially since the structures were extended to
> >> support write-back caching. One instance of r5l_log will still be
> >> necessary though, to use for the hooks. Its private field will be used
> >> for holding the ppl structures. Is that ok?
> > 
> > I'd prefer adding a 'void *log_private' in struct r5conf. raid5-cache and ppl
> > can store whatever data in the log_private. And convert all callbacks to use a
> > 'struct r5conf' parameter. This way raid5 cache and ppl don't need to share the
> > data structure.
> 
> This way also 'struct r5l_policy *policy' will have to be moved from
> r5l_log to r5conf. Maybe it will be better if instead of adding
> log_private and moving this 'policy' field we add a dedicated field for
> ppl in r5conf, something like 'struct ppl_conf *ppl', and eliminate
> r5l_policy? From raid5 core we could just call corresponding functions
> depending on which pointer is set (log or ppl). This way raid5-cache.c
> won't have to be changed at all. It seems appropriate to separate
> raid5-cache and ppl if they won't be sharing any data structures.

It's ok to add ppl specific data in r5conf. Just make sure the raid5 core only
calls one set of hooks if possible and the hooks can call into ppl or
raid5-cache. It's quite unlikely we will add other policies, so it's fine to
eliminate r5l_policy.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation
From: Shaohua Li @ 2017-02-09 17:13 UTC (permalink / raw)
  To: Wols Lists; +Cc: Artur Paszkiewicz, linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <589C9368.1020602@youngman.org.uk>

On Thu, Feb 09, 2017 at 04:06:00PM +0000, Wols Lists wrote:
> On 08/02/17 05:34, Shaohua Li wrote:
> > so it's the implementation which doesn't flush disk cache before writting new
> > data to disks, right? That makes sense then. I think we should fix this soon.
> > Don't think people will (or remember to) disable write-back cache before using
> > ppl.
> 
> Dunno whether I'm making a fool of myself :-) but might a quick-n-dirty
> workaround be to make enabling PPL also disable write-back cache? With
> maybe a warning not to switch it back on?

We don't have API for this in kernel side. This doesn't mean we can't add it
but since the limitation is in the implementation, we'd better fix the
implementation.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v2 2/5] async_tx: Handle DMA devices having support for fewer PQ coefficients
From: Anup Patel @ 2017-02-10  3:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Mark Rutland, Device Tree, Herbert Xu, Scott Branden, Vinod Koul,
	Ray Jui, Jassi Brar, linux-kernel@vger.kernel.org, linux-raid,
	Jon Mason, Rob Herring, BCM Kernel Feedback, linux-crypto,
	Rob Rice, dmaengine@vger.kernel.org, David S . Miller,
	linux-arm-kernel@lists.infradead.org
In-Reply-To: <CAPcyv4gDq+our1+sAPShBru-qvebo+Zvk0crgSUNXucHGQwZ1Q@mail.gmail.com>

On Thu, Feb 9, 2017 at 10:14 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Thu, Feb 9, 2017 at 1:29 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>> On Wed, Feb 8, 2017 at 9:54 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>> On Wed, Feb 8, 2017 at 12:57 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>>>> On Tue, Feb 7, 2017 at 11:46 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>>>> On Tue, Feb 7, 2017 at 1:02 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>>>>>> On Tue, Feb 7, 2017 at 1:57 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>>>>>> On Tue, Feb 7, 2017 at 12:16 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>>>>>>>> The DMAENGINE framework assumes that if PQ offload is supported by a
>>>>>>>> DMA device then all 256 PQ coefficients are supported. This assumption
>>>>>>>> does not hold anymore because we now have BCM-SBA-RAID offload engine
>>>>>>>> which supports PQ offload with limited number of PQ coefficients.
>>>>>>>>
>>>>>>>> This patch extends async_tx APIs to handle DMA devices with support
>>>>>>>> for fewer PQ coefficients.
>>>>>>>>
>>>>>>>> Signed-off-by: Anup Patel <anup.patel@broadcom.com>
>>>>>>>> Reviewed-by: Scott Branden <scott.branden@broadcom.com>
>>>>>>>
>>>>>>> I don't like this approach. Define an interface for md to query the
>>>>>>> offload engine once at the beginning of time. We should not be adding
>>>>>>> any new extensions to async_tx.
>>>>>>
>>>>>> Even if we do capability checks in Linux MD, we still need a way
>>>>>> for DMAENGINE drivers to advertise number of PQ coefficients
>>>>>> handled by the HW.
>>>>>>
>>>>>> I agree capability checks should be done once in Linux MD but I don't
>>>>>> see why this has to be part of BCM-SBA-RAID driver patches. We need
>>>>>> separate patchsets to address limitations of async_tx framework.
>>>>>
>>>>> Right, separate enabling before we pile on new hardware support to a
>>>>> known broken framework.
>>>>
>>>> Linux Async Tx not broken framework. The issue is:
>>>> 1. Its not complete enough
>>>> 2. Its not optimized for very high through-put offload engines
>>>
>>> I'm not understanding your point. I'm nak'ing this change to add yet
>>> more per-transaction capability checking to async_tx. I don't like the
>>> DMA_HAS_FEWER_PQ_COEF flag, especially since it is equal to
>>> DMA_HAS_PQ_CONTINUE. I'm not asking for all of async_tx's problems to
>>> be fixed before this new hardware support, I'm simply saying we should
>>> start the process of moving offload-engine capability checking to the
>>> raid code.
>>
>> The DMA_HAS_FEWER_PQ_COEF is not equal to
>> DMA_HAS_PQ_CONTINUE.
>
> #define DMA_HAS_PQ_CONTINUE (1 << 15
> #define DMA_HAS_FEWER_PQ_COEF (1 << 15)

You are only looking at the values of these flags.

The semantics of both these flags are different and both
flags are set in different members of "struct dma_device"
The DMA_HAS_PQ_CONTINUE is set in "max_pq" whereas
DMA_HAS_FEWER_PQ_COEF is set in "max_pqcoef".

When DMA_HAS_PQ_CONTINUE is set in "max_pq", it
means that PQ HW is capable of taking P & Q computed
previous txn as input. If DMA_HAS_PQ_CONTINUE is
not supported the async_pq() will pass P & Q computed
by previous txn as sources with coef as g^0.

When DMA_HAS_FEWER_PQ_COEF is set in "max_pqcoef",
it means the PQ HW is not capable of handling all 256 coefs.

>
>> I will try to drop this patch and take care of unsupported PQ
>> coefficients in BCM-SBA-RAID driver itself even if this means
>> doing some computations in BCM-SBA-RAID driver itself.
>
> That should be nak'd as well, please do capability detection in a
> routine that is common to all raid engines.

Thanks for NAKing this patch.

This motivated me to find clean work-around for handling
unsupported PQ coefs in BCM-SBA-RAID driver.

Let's assume max number of PQ coefs supported by PQ HW
is m coefs. Now for any coef n > m, we can use RAID6 math
to get g^n = (g^m)*(g^m)*....*(g^k) where k <= m.

Using the above fact, we can create chained txn for each
source of PQ request in BCM-SBA-RAID driver where each
txn will only compute PQ from one source using above
described RAID6 math. Also, each txn in chained txn will
depend on output of previous txn because we are computing
PQ from one source at a time.

I will drop this patch and send updated BCM-SBA-RAID
driver with above described work-around for handling
unsupported PQ coefs.

Regards,
Anup

^ permalink raw reply

* Re: [PATCH 1/5] MD: attach data to each bio
From: NeilBrown @ 2017-02-10  6:08 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: khlebnikov, hch
In-Reply-To: <79515b1372fa1a1813c00ef0d7e0613a4512183d.1486485935.git.shli@fb.com>

[-- Attachment #1: Type: text/plain, Size: 10392 bytes --]

On Tue, Feb 07 2017, Shaohua Li wrote:

> Currently MD is rebusing some bio fields. To remove the hack, we attach
> extra data to each bio. Each personablity can attach extra data to the
> bios, so we don't need to rebuse bio fields.

I must say that I don't really like this approach.
Temporarily modifying ->bi_private and ->bi_end_io seems
.... intrusive.   I suspect it works, but I wonder if it is really
robust in the long term.

How about a different approach..  Your main concern with my first patch
was that it called md_write_start() and md_write_end() much more often,
and these performed atomic ops on "global" variables, particular
writes_pending.

We could change writes_pending to a per-cpu array which we only count
occasionally when needed.  As writes_pending is updated often and
checked rarely, a per-cpu array which is summed on demand seems
appropriate.

The following patch is an early draft - it doesn't obviously fail and
isn't obviously wrong to me.  There is certainly room for improvement
and may be bugs.
Next week I'll work on collection the re-factoring into separate
patches, which are possible good-to-have anyway.

Thoughts?

Thanks,
NeilBrown


diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8926fb781cdc..883c409902b0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -64,6 +64,8 @@
 #include <linux/raid/md_p.h>
 #include <linux/raid/md_u.h>
 #include <linux/slab.h>
+#include <linux/percpu.h>
+
 #include <trace/events/block.h>
 #include "md.h"
 #include "bitmap.h"
@@ -2250,6 +2252,81 @@ static void export_array(struct mddev *mddev)
 	mddev->major_version = 0;
 }
 
+/*
+ * The percpu writes_pending counters are linked with ->checkers and ->lock.
+ * If ->writes_pending can always be decremented without a lock.
+ * It can only be incremented without a lock if ->checkers is 0 and the test+incr
+ * happen in a rcu_readlocked region.
+ * ->checkers can only be changed under ->lock protection.
+ * To determine if ->writes_pending is totally zero, a quick sum without
+ * locks can be performed. If this is non-zero, then the result is final.
+ * Otherwise ->checkers must be incremented and synchronize_rcu() called.
+ * Then a sum calculated under ->lock, and the result is final until the
+ * ->checkers is decremented, or the lock is dropped.
+ *
+ */
+
+static bool __writes_pending(struct mddev *mddev)
+{
+	long cnt = 0;
+	int i;
+	for_each_possible_cpu(i)
+		cnt += *per_cpu_ptr(mddev->writes_pending_percpu, i);
+	return cnt != 0;
+}
+
+static bool set_in_sync(struct mddev *mddev)
+{
+
+	WARN_ON_ONCE(!spin_is_locked(&mddev->lock));
+	if (__writes_pending(mddev)) {
+		if (mddev->safemode == 1)
+			mddev->safemode = 0;
+		return false;
+	}
+	if (mddev->in_sync)
+		return false;
+
+	mddev->checkers ++;
+	spin_unlock(&mddev->lock);
+	synchronize_rcu();
+	spin_lock(&mddev->lock);
+	if (mddev->in_sync) {
+		/* someone else set it */
+		mddev->checkers --;
+		return false;
+	}
+
+	if (! __writes_pending(mddev))
+		mddev->in_sync = 1;
+	if (mddev->safemode == 1)
+		mddev->safemode = 0;
+	mddev->checkers --;
+	return mddev->in_sync;
+}
+
+static void inc_writes_pending(struct mddev *mddev)
+{
+	rcu_read_lock();
+	if (mddev->checkers == 0) {
+		__this_cpu_inc(*mddev->writes_pending_percpu);
+		rcu_read_unlock();
+		return;
+	}
+	rcu_read_unlock();
+	/* Need that spinlock */
+	spin_lock(&mddev->lock);
+	this_cpu_inc(*mddev->writes_pending_percpu);
+	spin_unlock(&mddev->lock);
+}
+
+static void zero_writes_pending(struct mddev *mddev)
+{
+	int i;
+	for_each_possible_cpu(i)
+		*per_cpu_ptr(mddev->writes_pending_percpu, i) = 0;
+}
+
 static void sync_sbs(struct mddev *mddev, int nospares)
 {
 	/* Update each superblock (in-memory image), but
@@ -3583,7 +3660,6 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 	mddev->delta_disks = 0;
 	mddev->reshape_backwards = 0;
 	mddev->degraded = 0;
-	spin_unlock(&mddev->lock);
 
 	if (oldpers->sync_request == NULL &&
 	    mddev->external) {
@@ -3596,8 +3672,8 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 		 */
 		mddev->in_sync = 0;
 		mddev->safemode_delay = 0;
-		mddev->safemode = 0;
 	}
+	spin_unlock(&mddev->lock);
 
 	oldpers->free(mddev, oldpriv);
 
@@ -3963,16 +4039,11 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
 			wake_up(&mddev->sb_wait);
 			err = 0;
 		} else /* st == clean */ {
+			err = 0;
 			restart_array(mddev);
-			if (atomic_read(&mddev->writes_pending) == 0) {
-				if (mddev->in_sync == 0) {
-					mddev->in_sync = 1;
-					if (mddev->safemode == 1)
-						mddev->safemode = 0;
-					set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
-				}
-				err = 0;
-			} else
+			if (set_in_sync(mddev))
+				set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
+			else if (!mddev->in_sync)
 				err = -EBUSY;
 		}
 		if (!err)
@@ -4030,15 +4101,9 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
 			if (err)
 				break;
 			spin_lock(&mddev->lock);
-			if (atomic_read(&mddev->writes_pending) == 0) {
-				if (mddev->in_sync == 0) {
-					mddev->in_sync = 1;
-					if (mddev->safemode == 1)
-						mddev->safemode = 0;
-					set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
-				}
-				err = 0;
-			} else
+			if (set_in_sync(mddev))
+				set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
+			else if (!mddev->in_sync)
 				err = -EBUSY;
 			spin_unlock(&mddev->lock);
 		} else
@@ -4993,6 +5058,9 @@ static void md_free(struct kobject *ko)
 		del_gendisk(mddev->gendisk);
 		put_disk(mddev->gendisk);
 	}
+	if (mddev->writes_pending_percpu) {
+		free_percpu(mddev->writes_pending_percpu);
+	}
 
 	kfree(mddev);
 }
@@ -5069,6 +5137,13 @@ static int md_alloc(dev_t dev, char *name)
 	blk_queue_make_request(mddev->queue, md_make_request);
 	blk_set_stacking_limits(&mddev->queue->limits);
 
+	mddev->writes_pending_percpu = alloc_percpu(long);
+	if (!mddev->writes_pending_percpu) {
+		blk_cleanup_queue(mddev->queue);
+		mddev->queue = NULL;
+		goto abort;
+	}
+
 	disk = alloc_disk(1 << shift);
 	if (!disk) {
 		blk_cleanup_queue(mddev->queue);
@@ -5152,7 +5227,7 @@ static void md_safemode_timeout(unsigned long data)
 {
 	struct mddev *mddev = (struct mddev *) data;
 
-	if (!atomic_read(&mddev->writes_pending)) {
+	if (!__writes_pending(mddev)) {
 		mddev->safemode = 1;
 		if (mddev->external)
 			sysfs_notify_dirent_safe(mddev->sysfs_state);
@@ -5358,7 +5433,7 @@ int md_run(struct mddev *mddev)
 	} else if (mddev->ro == 2) /* auto-readonly not meaningful */
 		mddev->ro = 0;
 
-	atomic_set(&mddev->writes_pending,0);
+	zero_writes_pending(mddev);
 	atomic_set(&mddev->max_corr_read_errors,
 		   MD_DEFAULT_MAX_CORRECTED_READ_ERRORS);
 	mddev->safemode = 0;
@@ -5451,7 +5526,6 @@ static int restart_array(struct mddev *mddev)
 			return -EINVAL;
 	}
 
-	mddev->safemode = 0;
 	mddev->ro = 0;
 	set_disk_ro(disk, 0);
 	pr_debug("md: %s switched to read-write mode.\n", mdname(mddev));
@@ -7767,9 +7841,7 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
 		md_wakeup_thread(mddev->sync_thread);
 		did_change = 1;
 	}
-	atomic_inc(&mddev->writes_pending);
-	if (mddev->safemode == 1)
-		mddev->safemode = 0;
+	inc_writes_pending(mddev);
 	if (mddev->in_sync) {
 		spin_lock(&mddev->lock);
 		if (mddev->in_sync) {
@@ -7790,12 +7862,17 @@ EXPORT_SYMBOL(md_write_start);
 
 void md_write_end(struct mddev *mddev)
 {
-	if (atomic_dec_and_test(&mddev->writes_pending)) {
-		if (mddev->safemode == 2)
+	this_cpu_dec(*mddev->writes_pending_percpu);
+	if (mddev->safemode == 2) {
+		if (!__writes_pending(mddev))
 			md_wakeup_thread(mddev->thread);
-		else if (mddev->safemode_delay)
-			mod_timer(&mddev->safemode_timer, jiffies + mddev->safemode_delay);
-	}
+	} else if (mddev->safemode_delay)
+		/* The roundup() ensure this only performs locking once
+		 * every ->safemode_delay jiffies
+		 */
+		mod_timer(&mddev->safemode_timer,
+			  roundup(jiffies, mddev->safemode_delay) + mddev->safemode_delay);
+
 }
 EXPORT_SYMBOL(md_write_end);
 
@@ -8398,7 +8475,7 @@ void md_check_recovery(struct mddev *mddev)
 		test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
 		test_bit(MD_RELOAD_SB, &mddev->flags) ||
 		(mddev->external == 0 && mddev->safemode == 1) ||
-		(mddev->safemode == 2 && ! atomic_read(&mddev->writes_pending)
+		(mddev->safemode == 2 && ! __writes_pending(mddev)
 		 && !mddev->in_sync && mddev->recovery_cp == MaxSector)
 		))
 		return;
@@ -8450,19 +8527,14 @@ void md_check_recovery(struct mddev *mddev)
 				md_reload_sb(mddev, mddev->good_device_nr);
 		}
 
-		if (!mddev->external) {
+		if (!mddev->external && mddev->safemode) {
 			int did_change = 0;
 			spin_lock(&mddev->lock);
-			if (mddev->safemode &&
-			    !atomic_read(&mddev->writes_pending) &&
-			    !mddev->in_sync &&
-			    mddev->recovery_cp == MaxSector) {
-				mddev->in_sync = 1;
+			if (mddev->recovery_cp == MaxSector &&
+			    set_in_sync(mddev)) {
 				did_change = 1;
 				set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
 			}
-			if (mddev->safemode == 1)
-				mddev->safemode = 0;
 			spin_unlock(&mddev->lock);
 			if (did_change)
 				sysfs_notify_dirent_safe(mddev->sysfs_state);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 2a514036a83d..7e41f882d33d 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -404,7 +404,8 @@ struct mddev {
 							 */
 	unsigned int			safemode_delay;
 	struct timer_list		safemode_timer;
-	atomic_t			writes_pending;
+	long				*writes_pending_percpu;
+	int				checkers;	/* # of threads checking writes_pending */
 	struct request_queue		*queue;	/* for plugging ... */
 
 	struct bitmap			*bitmap; /* the bitmap for the device */
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3c7e106c12a2..45d260326efd 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7360,7 +7360,7 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
 		 * we can't wait pending write here, as this is called in
 		 * raid5d, wait will deadlock.
 		 */
-		if (atomic_read(&mddev->writes_pending))
+		if (!mddev->in_sync)
 			return -EBUSY;
 		log = conf->log;
 		conf->log = NULL;

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Re: [PATCH 1/5] MD: attach data to each bio
From: Shaohua Li @ 2017-02-10  6:47 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid, khlebnikov, hch
In-Reply-To: <87r336tw5l.fsf@notabene.neil.brown.name>

On Fri, Feb 10, 2017 at 05:08:54PM +1100, Neil Brown wrote:
> On Tue, Feb 07 2017, Shaohua Li wrote:
> 
> > Currently MD is rebusing some bio fields. To remove the hack, we attach
> > extra data to each bio. Each personablity can attach extra data to the
> > bios, so we don't need to rebuse bio fields.
> 
> I must say that I don't really like this approach.
> Temporarily modifying ->bi_private and ->bi_end_io seems
> .... intrusive.   I suspect it works, but I wonder if it is really
> robust in the long term.
> 
> How about a different approach..  Your main concern with my first patch
> was that it called md_write_start() and md_write_end() much more often,
> and these performed atomic ops on "global" variables, particular
> writes_pending.
> 
> We could change writes_pending to a per-cpu array which we only count
> occasionally when needed.  As writes_pending is updated often and
> checked rarely, a per-cpu array which is summed on demand seems
> appropriate.
> 
> The following patch is an early draft - it doesn't obviously fail and
> isn't obviously wrong to me.  There is certainly room for improvement
> and may be bugs.
> Next week I'll work on collection the re-factoring into separate
> patches, which are possible good-to-have anyway.

For your first patch, I don't have much concern. It's ok to me. What I don't
like is the bi_phys_segments handling part. The patches add a lot of logic to
handle the reference count. They should work, but I'd say it's not easy to
understand and could be error prone. What we really need is a reference count
for the bio, so let's just add a reference count. That's my logic and it's
simple.

For the modifying bi_private and bi_end_io part, I saw some filesystems are
using this way, at least btrfs. If this is really intrusive, is cloning a bio
better?

Thanks,
Shaohua

> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 8926fb781cdc..883c409902b0 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -64,6 +64,8 @@
>  #include <linux/raid/md_p.h>
>  #include <linux/raid/md_u.h>
>  #include <linux/slab.h>
> +#include <linux/percpu.h>
> +
>  #include <trace/events/block.h>
>  #include "md.h"
>  #include "bitmap.h"
> @@ -2250,6 +2252,81 @@ static void export_array(struct mddev *mddev)
>  	mddev->major_version = 0;
>  }
>  
> +/*
> + * The percpu writes_pending counters are linked with ->checkers and ->lock.
> + * If ->writes_pending can always be decremented without a lock.
> + * It can only be incremented without a lock if ->checkers is 0 and the test+incr
> + * happen in a rcu_readlocked region.
> + * ->checkers can only be changed under ->lock protection.
> + * To determine if ->writes_pending is totally zero, a quick sum without
> + * locks can be performed. If this is non-zero, then the result is final.
> + * Otherwise ->checkers must be incremented and synchronize_rcu() called.
> + * Then a sum calculated under ->lock, and the result is final until the
> + * ->checkers is decremented, or the lock is dropped.
> + *
> + */
> +
> +static bool __writes_pending(struct mddev *mddev)
> +{
> +	long cnt = 0;
> +	int i;
> +	for_each_possible_cpu(i)
> +		cnt += *per_cpu_ptr(mddev->writes_pending_percpu, i);
> +	return cnt != 0;
> +}
> +
> +static bool set_in_sync(struct mddev *mddev)
> +{
> +
> +	WARN_ON_ONCE(!spin_is_locked(&mddev->lock));
> +	if (__writes_pending(mddev)) {
> +		if (mddev->safemode == 1)
> +			mddev->safemode = 0;
> +		return false;
> +	}
> +	if (mddev->in_sync)
> +		return false;
> +
> +	mddev->checkers ++;
> +	spin_unlock(&mddev->lock);
> +	synchronize_rcu();
> +	spin_lock(&mddev->lock);
> +	if (mddev->in_sync) {
> +		/* someone else set it */
> +		mddev->checkers --;
> +		return false;
> +	}
> +
> +	if (! __writes_pending(mddev))
> +		mddev->in_sync = 1;
> +	if (mddev->safemode == 1)
> +		mddev->safemode = 0;
> +	mddev->checkers --;
> +	return mddev->in_sync;
> +}
> +
> +static void inc_writes_pending(struct mddev *mddev)
> +{
> +	rcu_read_lock();
> +	if (mddev->checkers == 0) {
> +		__this_cpu_inc(*mddev->writes_pending_percpu);
> +		rcu_read_unlock();
> +		return;
> +	}
> +	rcu_read_unlock();
> +	/* Need that spinlock */
> +	spin_lock(&mddev->lock);
> +	this_cpu_inc(*mddev->writes_pending_percpu);
> +	spin_unlock(&mddev->lock);
> +}
> +
> +static void zero_writes_pending(struct mddev *mddev)
> +{
> +	int i;
> +	for_each_possible_cpu(i)
> +		*per_cpu_ptr(mddev->writes_pending_percpu, i) = 0;
> +}
> +
>  static void sync_sbs(struct mddev *mddev, int nospares)
>  {
>  	/* Update each superblock (in-memory image), but
> @@ -3583,7 +3660,6 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
>  	mddev->delta_disks = 0;
>  	mddev->reshape_backwards = 0;
>  	mddev->degraded = 0;
> -	spin_unlock(&mddev->lock);
>  
>  	if (oldpers->sync_request == NULL &&
>  	    mddev->external) {
> @@ -3596,8 +3672,8 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
>  		 */
>  		mddev->in_sync = 0;
>  		mddev->safemode_delay = 0;
> -		mddev->safemode = 0;
>  	}
> +	spin_unlock(&mddev->lock);
>  
>  	oldpers->free(mddev, oldpriv);
>  
> @@ -3963,16 +4039,11 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
>  			wake_up(&mddev->sb_wait);
>  			err = 0;
>  		} else /* st == clean */ {
> +			err = 0;
>  			restart_array(mddev);
> -			if (atomic_read(&mddev->writes_pending) == 0) {
> -				if (mddev->in_sync == 0) {
> -					mddev->in_sync = 1;
> -					if (mddev->safemode == 1)
> -						mddev->safemode = 0;
> -					set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
> -				}
> -				err = 0;
> -			} else
> +			if (set_in_sync(mddev))
> +				set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
> +			else if (!mddev->in_sync)
>  				err = -EBUSY;
>  		}
>  		if (!err)
> @@ -4030,15 +4101,9 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
>  			if (err)
>  				break;
>  			spin_lock(&mddev->lock);
> -			if (atomic_read(&mddev->writes_pending) == 0) {
> -				if (mddev->in_sync == 0) {
> -					mddev->in_sync = 1;
> -					if (mddev->safemode == 1)
> -						mddev->safemode = 0;
> -					set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
> -				}
> -				err = 0;
> -			} else
> +			if (set_in_sync(mddev))
> +				set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
> +			else if (!mddev->in_sync)
>  				err = -EBUSY;
>  			spin_unlock(&mddev->lock);
>  		} else
> @@ -4993,6 +5058,9 @@ static void md_free(struct kobject *ko)
>  		del_gendisk(mddev->gendisk);
>  		put_disk(mddev->gendisk);
>  	}
> +	if (mddev->writes_pending_percpu) {
> +		free_percpu(mddev->writes_pending_percpu);
> +	}
>  
>  	kfree(mddev);
>  }
> @@ -5069,6 +5137,13 @@ static int md_alloc(dev_t dev, char *name)
>  	blk_queue_make_request(mddev->queue, md_make_request);
>  	blk_set_stacking_limits(&mddev->queue->limits);
>  
> +	mddev->writes_pending_percpu = alloc_percpu(long);
> +	if (!mddev->writes_pending_percpu) {
> +		blk_cleanup_queue(mddev->queue);
> +		mddev->queue = NULL;
> +		goto abort;
> +	}
> +
>  	disk = alloc_disk(1 << shift);
>  	if (!disk) {
>  		blk_cleanup_queue(mddev->queue);
> @@ -5152,7 +5227,7 @@ static void md_safemode_timeout(unsigned long data)
>  {
>  	struct mddev *mddev = (struct mddev *) data;
>  
> -	if (!atomic_read(&mddev->writes_pending)) {
> +	if (!__writes_pending(mddev)) {
>  		mddev->safemode = 1;
>  		if (mddev->external)
>  			sysfs_notify_dirent_safe(mddev->sysfs_state);
> @@ -5358,7 +5433,7 @@ int md_run(struct mddev *mddev)
>  	} else if (mddev->ro == 2) /* auto-readonly not meaningful */
>  		mddev->ro = 0;
>  
> -	atomic_set(&mddev->writes_pending,0);
> +	zero_writes_pending(mddev);
>  	atomic_set(&mddev->max_corr_read_errors,
>  		   MD_DEFAULT_MAX_CORRECTED_READ_ERRORS);
>  	mddev->safemode = 0;
> @@ -5451,7 +5526,6 @@ static int restart_array(struct mddev *mddev)
>  			return -EINVAL;
>  	}
>  
> -	mddev->safemode = 0;
>  	mddev->ro = 0;
>  	set_disk_ro(disk, 0);
>  	pr_debug("md: %s switched to read-write mode.\n", mdname(mddev));
> @@ -7767,9 +7841,7 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
>  		md_wakeup_thread(mddev->sync_thread);
>  		did_change = 1;
>  	}
> -	atomic_inc(&mddev->writes_pending);
> -	if (mddev->safemode == 1)
> -		mddev->safemode = 0;
> +	inc_writes_pending(mddev);
>  	if (mddev->in_sync) {
>  		spin_lock(&mddev->lock);
>  		if (mddev->in_sync) {
> @@ -7790,12 +7862,17 @@ EXPORT_SYMBOL(md_write_start);
>  
>  void md_write_end(struct mddev *mddev)
>  {
> -	if (atomic_dec_and_test(&mddev->writes_pending)) {
> -		if (mddev->safemode == 2)
> +	this_cpu_dec(*mddev->writes_pending_percpu);
> +	if (mddev->safemode == 2) {
> +		if (!__writes_pending(mddev))
>  			md_wakeup_thread(mddev->thread);
> -		else if (mddev->safemode_delay)
> -			mod_timer(&mddev->safemode_timer, jiffies + mddev->safemode_delay);
> -	}
> +	} else if (mddev->safemode_delay)
> +		/* The roundup() ensure this only performs locking once
> +		 * every ->safemode_delay jiffies
> +		 */
> +		mod_timer(&mddev->safemode_timer,
> +			  roundup(jiffies, mddev->safemode_delay) + mddev->safemode_delay);
> +
>  }
>  EXPORT_SYMBOL(md_write_end);
>  
> @@ -8398,7 +8475,7 @@ void md_check_recovery(struct mddev *mddev)
>  		test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
>  		test_bit(MD_RELOAD_SB, &mddev->flags) ||
>  		(mddev->external == 0 && mddev->safemode == 1) ||
> -		(mddev->safemode == 2 && ! atomic_read(&mddev->writes_pending)
> +		(mddev->safemode == 2 && ! __writes_pending(mddev)
>  		 && !mddev->in_sync && mddev->recovery_cp == MaxSector)
>  		))
>  		return;
> @@ -8450,19 +8527,14 @@ void md_check_recovery(struct mddev *mddev)
>  				md_reload_sb(mddev, mddev->good_device_nr);
>  		}
>  
> -		if (!mddev->external) {
> +		if (!mddev->external && mddev->safemode) {
>  			int did_change = 0;
>  			spin_lock(&mddev->lock);
> -			if (mddev->safemode &&
> -			    !atomic_read(&mddev->writes_pending) &&
> -			    !mddev->in_sync &&
> -			    mddev->recovery_cp == MaxSector) {
> -				mddev->in_sync = 1;
> +			if (mddev->recovery_cp == MaxSector &&
> +			    set_in_sync(mddev)) {
>  				did_change = 1;
>  				set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
>  			}
> -			if (mddev->safemode == 1)
> -				mddev->safemode = 0;
>  			spin_unlock(&mddev->lock);
>  			if (did_change)
>  				sysfs_notify_dirent_safe(mddev->sysfs_state);
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 2a514036a83d..7e41f882d33d 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -404,7 +404,8 @@ struct mddev {
>  							 */
>  	unsigned int			safemode_delay;
>  	struct timer_list		safemode_timer;
> -	atomic_t			writes_pending;
> +	long				*writes_pending_percpu;
> +	int				checkers;	/* # of threads checking writes_pending */
>  	struct request_queue		*queue;	/* for plugging ... */
>  
>  	struct bitmap			*bitmap; /* the bitmap for the device */
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 3c7e106c12a2..45d260326efd 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -7360,7 +7360,7 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
>  		 * we can't wait pending write here, as this is called in
>  		 * raid5d, wait will deadlock.
>  		 */
> -		if (atomic_read(&mddev->writes_pending))
> +		if (!mddev->in_sync)
>  			return -EBUSY;
>  		log = conf->log;
>  		conf->log = NULL;



^ permalink raw reply

* [PATCH v3 0/4] Broadcom SBA RAID support
From: Anup Patel @ 2017-02-10  9:07 UTC (permalink / raw)
  To: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar
  Cc: devicetree, Anup Patel, Scott Branden, Jon Mason, Ray Jui,
	linux-kernel, linux-raid, bcm-kernel-feedback-list, linux-crypto,
	Rob Rice, dmaengine, Dan Williams, linux-arm-kernel

The Broadcom SBA RAID is a stream-based device which provides
RAID5/6 offload.

It requires a SoC specific ring manager (such as Broadcom FlexRM
ring manager) to provide ring-based programming interface. Due to
this, the Broadcom SBA RAID driver (mailbox client) implements
DMA device having one DMA channel using a set of mailbox channels
provided by Broadcom SoC specific ring manager driver (mailbox
controller).

The Broadcom SBA RAID hardware requires PQ disk position instead
of PQ disk coefficient. To address this, we have added raid_gflog
table which will help driver to convert PQ disk coefficient to PQ
disk position.

This patchset is based on Linux-4.10-rc2 and depends on patchset
"[PATCH v4 0/2] Broadcom FlexRM ring manager support"

It is also available at sba-raid-v3 branch of
https://github.com/Broadcom/arm64-linux.git

Changes since v2:
 - Droped patch to handle DMA devices having support for fewer
   PQ coefficients in Linux Async Tx
 - Added work-around in bcm-sba-raid driver to handle unsupported
   PQ coefficients using multiple SBA requests

Changes since v1:
 - Droped patch to add mbox_channel_device() API
 - Used GENMASK and BIT macros wherever possible in bcm-sba-raid driver
 - Replaced C_MDATA macros with static inline functions in
   bcm-sba-raid driver
 - Removed sba_alloc_chan_resources() callback in bcm-sba-raid driver
 - Used dev_err() instead of dev_info() wherever applicable
 - Removed call to sba_issue_pending() from sba_tx_submit() in
   bcm-sba-raid driver
 - Implemented SBA request chaning for handling (len > sba->req_size)
   in bcm-sba-raid driver
 - Implemented device_terminate_all() callback in bcm-sba-raid driver

Anup Patel (4):
  lib/raid6: Add log-of-2 table for RAID6 HW requiring disk position
  async_tx: Fix DMA_PREP_FENCE usage in do_async_gen_syndrome()
  dmaengine: Add Broadcom SBA RAID driver
  dt-bindings: Add DT bindings document for Broadcom SBA RAID driver

 .../devicetree/bindings/dma/brcm,iproc-sba.txt     |   29 +
 crypto/async_tx/async_pq.c                         |    5 +-
 drivers/dma/Kconfig                                |   13 +
 drivers/dma/Makefile                               |    1 +
 drivers/dma/bcm-sba-raid.c                         | 1711 ++++++++++++++++++++
 include/linux/raid/pq.h                            |    1 +
 lib/raid6/mktables.c                               |   20 +
 7 files changed, 1777 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/dma/brcm,iproc-sba.txt
 create mode 100644 drivers/dma/bcm-sba-raid.c

-- 
2.7.4

^ permalink raw reply

* [PATCH v3 1/4] lib/raid6: Add log-of-2 table for RAID6 HW requiring disk position
From: Anup Patel @ 2017-02-10  9:07 UTC (permalink / raw)
  To: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar
  Cc: devicetree, Anup Patel, Scott Branden, Jon Mason, Ray Jui,
	linux-kernel, linux-raid, bcm-kernel-feedback-list, linux-crypto,
	Rob Rice, dmaengine, Dan Williams, linux-arm-kernel
In-Reply-To: <1486717628-17580-1-git-send-email-anup.patel@broadcom.com>

The raid6_gfexp table represents {2}^n values for 0 <= n < 256. The
Linux async_tx framework pass values from raid6_gfexp as coefficients
for each source to prep_dma_pq() callback of DMA channel with PQ
capability. This creates problem for RAID6 offload engines (such as
Broadcom SBA) which take disk position (i.e. log of {2}) instead of
multiplicative cofficients from raid6_gfexp table.

This patch adds raid6_gflog table having log-of-2 value for any given
x such that 0 <= x < 256. For any given disk coefficient x, the
corresponding disk position is given by raid6_gflog[x]. The RAID6
offload engine driver can use this newly added raid6_gflog table to
get disk position from multiplicative coefficient.

Signed-off-by: Anup Patel <anup.patel@broadcom.com>
Reviewed-by: Scott Branden <scott.branden@broadcom.com>
Reviewed-by: Ray Jui <ray.jui@broadcom.com>
---
 include/linux/raid/pq.h |  1 +
 lib/raid6/mktables.c    | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
index 4d57bba..30f9453 100644
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -142,6 +142,7 @@ int raid6_select_algo(void);
 extern const u8 raid6_gfmul[256][256] __attribute__((aligned(256)));
 extern const u8 raid6_vgfmul[256][32] __attribute__((aligned(256)));
 extern const u8 raid6_gfexp[256]      __attribute__((aligned(256)));
+extern const u8 raid6_gflog[256]      __attribute__((aligned(256)));
 extern const u8 raid6_gfinv[256]      __attribute__((aligned(256)));
 extern const u8 raid6_gfexi[256]      __attribute__((aligned(256)));
 
diff --git a/lib/raid6/mktables.c b/lib/raid6/mktables.c
index 39787db..e824d08 100644
--- a/lib/raid6/mktables.c
+++ b/lib/raid6/mktables.c
@@ -125,6 +125,26 @@ int main(int argc, char *argv[])
 	printf("EXPORT_SYMBOL(raid6_gfexp);\n");
 	printf("#endif\n");
 
+	/* Compute log-of-2 table */
+	printf("\nconst u8 __attribute__((aligned(256)))\n"
+	       "raid6_gflog[256] =\n" "{\n");
+	for (i = 0; i < 256; i += 8) {
+		printf("\t");
+		for (j = 0; j < 8; j++) {
+			v = 255;
+			for (k = 0; k < 256; k++)
+				if (exptbl[k] == (i + j)) {
+					v = k;
+					break;
+				}
+			printf("0x%02x,%c", v, (j == 7) ? '\n' : ' ');
+		}
+	}
+	printf("};\n");
+	printf("#ifdef __KERNEL__\n");
+	printf("EXPORT_SYMBOL(raid6_gflog);\n");
+	printf("#endif\n");
+
 	/* Compute inverse table x^-1 == x^254 */
 	printf("\nconst u8 __attribute__((aligned(256)))\n"
 	       "raid6_gfinv[256] =\n" "{\n");
-- 
2.7.4

^ permalink raw reply related

* [PATCH v3 2/4] async_tx: Fix DMA_PREP_FENCE usage in do_async_gen_syndrome()
From: Anup Patel @ 2017-02-10  9:07 UTC (permalink / raw)
  To: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar
  Cc: Dan Williams, Ray Jui, Scott Branden, Jon Mason, Rob Rice,
	bcm-kernel-feedback-list, dmaengine, devicetree, linux-arm-kernel,
	linux-kernel, linux-crypto, linux-raid, Anup Patel
In-Reply-To: <1486717628-17580-1-git-send-email-anup.patel@broadcom.com>

The DMA_PREP_FENCE is to be used when preparing Tx descriptor if output
of Tx descriptor is to be used by next/dependent Tx descriptor.

The DMA_PREP_FENSE will not be set correctly in do_async_gen_syndrome()
when calling dma->device_prep_dma_pq() under following conditions:
1. ASYNC_TX_FENCE not set in submit->flags
2. DMA_PREP_FENCE not set in dma_flags
3. src_cnt (= (disks - 2)) is greater than dma_maxpq(dma, dma_flags)

This patch fixes DMA_PREP_FENCE usage in do_async_gen_syndrome() taking
inspiration from do_async_xor() implementation.

Signed-off-by: Anup Patel <anup.patel@broadcom.com>
Reviewed-by: Ray Jui <ray.jui@broadcom.com>
Reviewed-by: Scott Branden <scott.branden@broadcom.com>
---
 crypto/async_tx/async_pq.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/crypto/async_tx/async_pq.c b/crypto/async_tx/async_pq.c
index f83de99..56bd612 100644
--- a/crypto/async_tx/async_pq.c
+++ b/crypto/async_tx/async_pq.c
@@ -62,9 +62,6 @@ do_async_gen_syndrome(struct dma_chan *chan,
 	dma_addr_t dma_dest[2];
 	int src_off = 0;
 
-	if (submit->flags & ASYNC_TX_FENCE)
-		dma_flags |= DMA_PREP_FENCE;
-
 	while (src_cnt > 0) {
 		submit->flags = flags_orig;
 		pq_src_cnt = min(src_cnt, dma_maxpq(dma, dma_flags));
@@ -83,6 +80,8 @@ do_async_gen_syndrome(struct dma_chan *chan,
 			if (cb_fn_orig)
 				dma_flags |= DMA_PREP_INTERRUPT;
 		}
+		if (submit->flags & ASYNC_TX_FENCE)
+			dma_flags |= DMA_PREP_FENCE;
 
 		/* Drivers force forward progress in case they can not provide
 		 * a descriptor
-- 
2.7.4

^ permalink raw reply related

* [PATCH v3 3/4] dmaengine: Add Broadcom SBA RAID driver
From: Anup Patel @ 2017-02-10  9:07 UTC (permalink / raw)
  To: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar
  Cc: Dan Williams, Ray Jui, Scott Branden, Jon Mason, Rob Rice,
	bcm-kernel-feedback-list, dmaengine, devicetree, linux-arm-kernel,
	linux-kernel, linux-crypto, linux-raid, Anup Patel
In-Reply-To: <1486717628-17580-1-git-send-email-anup.patel@broadcom.com>

The Broadcom stream buffer accelerator (SBA) provides offloading
capabilities for RAID operations. This SBA offload engine is
accessible via Broadcom SoC specific ring manager.

This patch adds Broadcom SBA RAID driver which provides one
DMA device with RAID capabilities using one or more Broadcom
SoC specific ring manager channels. The SBA RAID driver in its
current shape implements memcpy, xor, and pq operations.

Signed-off-by: Anup Patel <anup.patel@broadcom.com>
Reviewed-by: Ray Jui <ray.jui@broadcom.com>
---
 drivers/dma/Kconfig        |   13 +
 drivers/dma/Makefile       |    1 +
 drivers/dma/bcm-sba-raid.c | 1711 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 1725 insertions(+)
 create mode 100644 drivers/dma/bcm-sba-raid.c

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 263495d..bf8fb84 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -99,6 +99,19 @@ config AXI_DMAC
 	  controller is often used in Analog Device's reference designs for FPGA
 	  platforms.
 
+config BCM_SBA_RAID
+	tristate "Broadcom SBA RAID engine support"
+	depends on (ARM64 && MAILBOX && RAID6_PQ) || COMPILE_TEST
+	select DMA_ENGINE
+	select DMA_ENGINE_RAID
+	select ASYNC_TX_ENABLE_CHANNEL_SWITCH
+	default ARCH_BCM_IPROC
+	help
+	  Enable support for Broadcom SBA RAID Engine. The SBA RAID
+	  engine is available on most of the Broadcom iProc SoCs. It
+	  has the capability to offload memcpy, xor and pq computation
+	  for raid5/6.
+
 config COH901318
 	bool "ST-Ericsson COH901318 DMA support"
 	select DMA_ENGINE
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
index a4fa336..ba96bdd 100644
--- a/drivers/dma/Makefile
+++ b/drivers/dma/Makefile
@@ -17,6 +17,7 @@ obj-$(CONFIG_AMCC_PPC440SPE_ADMA) += ppc4xx/
 obj-$(CONFIG_AT_HDMAC) += at_hdmac.o
 obj-$(CONFIG_AT_XDMAC) += at_xdmac.o
 obj-$(CONFIG_AXI_DMAC) += dma-axi-dmac.o
+obj-$(CONFIG_BCM_SBA_RAID) += bcm-sba-raid.o
 obj-$(CONFIG_COH901318) += coh901318.o coh901318_lli.o
 obj-$(CONFIG_DMA_BCM2835) += bcm2835-dma.o
 obj-$(CONFIG_DMA_JZ4740) += dma-jz4740.o
diff --git a/drivers/dma/bcm-sba-raid.c b/drivers/dma/bcm-sba-raid.c
new file mode 100644
index 0000000..bab9918
--- /dev/null
+++ b/drivers/dma/bcm-sba-raid.c
@@ -0,0 +1,1711 @@
+/*
+ * Copyright (C) 2017 Broadcom
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+/*
+ * Broadcom SBA RAID Driver
+ *
+ * The Broadcom stream buffer accelerator (SBA) provides offloading
+ * capabilities for RAID operations. The SBA offload engine is accessible
+ * via Broadcom SoC specific ring manager. Two or more offload engines
+ * can share same Broadcom SoC specific ring manager due to this Broadcom
+ * SoC specific ring manager driver is implemented as a mailbox controller
+ * driver and offload engine drivers are implemented as mallbox clients.
+ *
+ * Typically, Broadcom SoC specific ring manager will implement larger
+ * number of hardware rings over one or more SBA hardware devices. By
+ * design, the internal buffer size of SBA hardware device is limited
+ * but all offload operations supported by SBA can be broken down into
+ * multiple small size requests and executed parallely on multiple SBA
+ * hardware devices for achieving high through-put.
+ *
+ * The Broadcom SBA RAID driver does not require any register programming
+ * except submitting request to SBA hardware device via mailbox channels.
+ * This driver implements a DMA device with one DMA channel using a set
+ * of mailbox channels provided by Broadcom SoC specific ring manager
+ * driver. To exploit parallelism (as described above), all DMA request
+ * coming to SBA RAID DMA channel are broken down to smaller requests
+ * and submitted to multiple mailbox channels in round-robin fashion.
+ * For having more SBA DMA channels, we can create more SBA device nodes
+ * in Broadcom SoC specific DTS based on number of hardware rings supported
+ * by Broadcom SoC ring manager.
+ */
+
+#include <linux/bitops.h>
+#include <linux/dma-mapping.h>
+#include <linux/dmaengine.h>
+#include <linux/list.h>
+#include <linux/mailbox_client.h>
+#include <linux/mailbox/brcm-message.h>
+#include <linux/module.h>
+#include <linux/of_device.h>
+#include <linux/slab.h>
+#include <linux/raid/pq.h>
+
+#include "dmaengine.h"
+
+/* SBA command helper macros */
+#define SBA_DEC(_d, _s, _m)		(((_d) >> (_s)) & (_m))
+#define SBA_ENC(_d, _v, _s, _m)					\
+		do {						\
+			(_d) &= ~((u64)(_m) << (_s));		\
+			(_d) |= (((u64)(_v) & (_m)) << (_s));	\
+		} while (0)
+
+/* SBA command related defines */
+#define SBA_TYPE_SHIFT					48
+#define SBA_TYPE_MASK					GENMASK(1, 0)
+#define SBA_TYPE_A					0x0
+#define SBA_TYPE_B					0x2
+#define SBA_TYPE_C					0x3
+#define SBA_USER_DEF_SHIFT				32
+#define SBA_USER_DEF_MASK				GENMASK(15, 0)
+#define SBA_R_MDATA_SHIFT				24
+#define SBA_R_MDATA_MASK				GENMASK(7, 0)
+#define SBA_C_MDATA_MS_SHIFT				18
+#define SBA_C_MDATA_MS_MASK				GENMASK(1, 0)
+#define SBA_INT_SHIFT					17
+#define SBA_INT_MASK					BIT(0)
+#define SBA_RESP_SHIFT					16
+#define SBA_RESP_MASK					BIT(0)
+#define SBA_C_MDATA_SHIFT				8
+#define SBA_C_MDATA_MASK				GENMASK(7, 0)
+#define SBA_C_MDATA_BNUMx_SHIFT(__bnum)			(2 * (__bnum))
+#define SBA_C_MDATA_BNUMx_MASK				GENMASK(1, 0)
+#define SBA_C_MDATA_DNUM_SHIFT				5
+#define SBA_C_MDATA_DNUM_MASK				GENMASK(4, 0)
+#define SBA_C_MDATA_LS(__v)				((__v) & 0xff)
+#define SBA_C_MDATA_MS(__v)				(((__v) >> 8) & 0x3)
+#define SBA_CMD_SHIFT					0
+#define SBA_CMD_MASK					GENMASK(3, 0)
+#define SBA_CMD_ZERO_BUFFER				0x4
+#define SBA_CMD_ZERO_ALL_BUFFERS			0x8
+#define SBA_CMD_LOAD_BUFFER				0x9
+#define SBA_CMD_XOR					0xa
+#define SBA_CMD_GALOIS_XOR				0xb
+#define SBA_CMD_WRITE_BUFFER				0xc
+#define SBA_CMD_GALOIS					0xe
+
+/* Driver helper macros */
+#define to_sba_request(tx)		\
+	container_of(tx, struct sba_request, tx)
+#define to_sba_device(dchan)		\
+	container_of(dchan, struct sba_device, dma_chan)
+
+enum sba_request_state {
+	SBA_REQUEST_STATE_FREE = 1,
+	SBA_REQUEST_STATE_ALLOCED = 2,
+	SBA_REQUEST_STATE_PENDING = 3,
+	SBA_REQUEST_STATE_ACTIVE = 4,
+	SBA_REQUEST_STATE_RECEIVED = 5,
+	SBA_REQUEST_STATE_COMPLETED = 6,
+	SBA_REQUEST_STATE_ABORTED = 7,
+};
+
+struct sba_request {
+	/* Global state */
+	struct list_head node;
+	struct sba_device *sba;
+	enum sba_request_state state;
+	bool fence;
+	/* Chained requests management */
+	struct sba_request *first;
+	struct list_head next;
+	unsigned int next_count;
+	atomic_t next_pending_count;
+	/* BRCM message data */
+	void *resp;
+	dma_addr_t resp_dma;
+	struct brcm_sba_command *cmds;
+	struct brcm_message msg;
+	struct dma_async_tx_descriptor tx;
+};
+
+enum sba_version {
+	SBA_VER_1 = 0,
+	SBA_VER_2
+};
+
+struct sba_device {
+	/* Underlying device */
+	struct device *dev;
+	/* DT configuration parameters */
+	enum sba_version ver;
+	/* Derived configuration parameters */
+	u32 max_req;
+	u32 hw_buf_size;
+	u32 hw_resp_size;
+	u32 max_pq_coefs;
+	u32 max_pq_srcs;
+	u32 max_cmd_per_req;
+	u32 max_xor_srcs;
+	u32 max_resp_pool_size;
+	u32 max_cmds_pool_size;
+	/* Maibox client and Mailbox channels */
+	struct mbox_client client;
+	int mchans_count;
+	atomic_t mchans_current;
+	struct mbox_chan **mchans;
+	struct device *mbox_dev;
+	/* DMA device and DMA channel */
+	struct dma_device dma_dev;
+	struct dma_chan dma_chan;
+	/* DMA channel resources */
+	void *resp_base;
+	dma_addr_t resp_dma_base;
+	void *cmds_base;
+	dma_addr_t cmds_dma_base;
+	spinlock_t reqs_lock;
+	struct sba_request *reqs;
+	bool reqs_fence;
+	struct list_head reqs_alloc_list;
+	struct list_head reqs_pending_list;
+	struct list_head reqs_active_list;
+	struct list_head reqs_received_list;
+	struct list_head reqs_completed_list;
+	struct list_head reqs_aborted_list;
+	struct list_head reqs_free_list;
+	int reqs_free_count;
+};
+
+/* ====== C_MDATA helper routines ===== */
+
+static inline u32 sba_cmd_load_c_mdata(u32 b0)
+{
+	return b0 & SBA_C_MDATA_BNUMx_MASK;
+}
+
+static inline u32 sba_cmd_write_c_mdata(u32 b0)
+{
+	return b0 & SBA_C_MDATA_BNUMx_MASK;
+}
+
+static inline u32 sba_cmd_xor_c_mdata(u32 b1, u32 b0)
+{
+	return (b0 & SBA_C_MDATA_BNUMx_MASK) |
+	       ((b1 & SBA_C_MDATA_BNUMx_MASK) << SBA_C_MDATA_BNUMx_SHIFT(1));
+}
+
+static inline u32 sba_cmd_pq_c_mdata(u32 d, u32 b1, u32 b0)
+{
+	return (b0 & SBA_C_MDATA_BNUMx_MASK) |
+	       ((b1 & SBA_C_MDATA_BNUMx_MASK) << SBA_C_MDATA_BNUMx_SHIFT(1)) |
+	       ((d & SBA_C_MDATA_DNUM_MASK) << SBA_C_MDATA_DNUM_SHIFT);
+}
+
+/* ====== Channel resource management routines ===== */
+
+static struct sba_request *sba_alloc_request(struct sba_device *sba)
+{
+	unsigned long flags;
+	struct sba_request *req = NULL;
+
+	spin_lock_irqsave(&sba->reqs_lock, flags);
+
+	if (!list_empty(&sba->reqs_free_list)) {
+		req = list_first_entry(&sba->reqs_free_list,
+				       struct sba_request,
+				       node);
+
+		list_move_tail(&req->node, &sba->reqs_alloc_list);
+		req->state = SBA_REQUEST_STATE_ALLOCED;
+		req->fence = false;
+		req->first = req;
+		INIT_LIST_HEAD(&req->next);
+		req->next_count = 1;
+		atomic_set(&req->next_pending_count, 1);
+
+		sba->reqs_free_count--;
+
+		dma_async_tx_descriptor_init(&req->tx, &sba->dma_chan);
+	}
+
+	spin_unlock_irqrestore(&sba->reqs_lock, flags);
+
+	return req;
+}
+
+/* Note: Must be called with sba->reqs_lock held */
+static void _sba_pending_request(struct sba_device *sba,
+				 struct sba_request *req)
+{
+	req->state = SBA_REQUEST_STATE_PENDING;
+	list_move_tail(&req->node, &sba->reqs_pending_list);
+	if (list_empty(&sba->reqs_active_list))
+		sba->reqs_fence = false;
+}
+
+/* Note: Must be called with sba->reqs_lock held */
+static bool _sba_active_request(struct sba_device *sba,
+				struct sba_request *req)
+{
+	if (list_empty(&sba->reqs_active_list))
+		sba->reqs_fence = false;
+	if (sba->reqs_fence)
+		return false;
+	req->state = SBA_REQUEST_STATE_ACTIVE;
+	list_move_tail(&req->node, &sba->reqs_active_list);
+	if (req->fence)
+		sba->reqs_fence = true;
+	return true;
+}
+
+/* Note: Must be called with sba->reqs_lock held */
+static void _sba_abort_request(struct sba_device *sba,
+			       struct sba_request *req)
+{
+	req->state = SBA_REQUEST_STATE_ABORTED;
+	list_move_tail(&req->node, &sba->reqs_aborted_list);
+	if (list_empty(&sba->reqs_active_list))
+		sba->reqs_fence = false;
+}
+
+/* Note: Must be called with sba->reqs_lock held */
+static void _sba_free_request(struct sba_device *sba,
+			      struct sba_request *req)
+{
+	req->state = SBA_REQUEST_STATE_FREE;
+	list_move_tail(&req->node, &sba->reqs_free_list);
+	if (list_empty(&sba->reqs_active_list))
+		sba->reqs_fence = false;
+	sba->reqs_free_count++;
+}
+
+static void sba_received_request(struct sba_request *req)
+{
+	unsigned long flags;
+	struct sba_device *sba = req->sba;
+
+	spin_lock_irqsave(&sba->reqs_lock, flags);
+	req->state = SBA_REQUEST_STATE_RECEIVED;
+	list_move_tail(&req->node, &sba->reqs_received_list);
+	spin_unlock_irqrestore(&sba->reqs_lock, flags);
+}
+
+static void sba_complete_chained_requests(struct sba_request *req)
+{
+	unsigned long flags;
+	struct sba_request *nreq;
+	struct sba_device *sba = req->sba;
+
+	spin_lock_irqsave(&sba->reqs_lock, flags);
+
+	req->state = SBA_REQUEST_STATE_COMPLETED;
+	list_move_tail(&req->node, &sba->reqs_completed_list);
+	list_for_each_entry(nreq, &req->next, next) {
+		nreq->state = SBA_REQUEST_STATE_COMPLETED;
+		list_move_tail(&nreq->node, &sba->reqs_completed_list);
+	}
+	if (list_empty(&sba->reqs_active_list))
+		sba->reqs_fence = false;
+
+	spin_unlock_irqrestore(&sba->reqs_lock, flags);
+}
+
+static void sba_free_chained_requests(struct sba_request *req)
+{
+	unsigned long flags;
+	struct sba_request *nreq;
+	struct sba_device *sba = req->sba;
+
+	spin_lock_irqsave(&sba->reqs_lock, flags);
+
+	_sba_free_request(sba, req);
+	list_for_each_entry(nreq, &req->next, next) {
+		_sba_free_request(sba, nreq);
+	}
+
+	spin_unlock_irqrestore(&sba->reqs_lock, flags);
+}
+
+static void sba_chain_request(struct sba_request *first,
+			      struct sba_request *req)
+{
+	unsigned long flags;
+	struct sba_device *sba = req->sba;
+
+	spin_lock_irqsave(&sba->reqs_lock, flags);
+
+	list_add_tail(&req->next, &first->next);
+	req->first = first;
+	first->next_count++;
+	atomic_set(&first->next_pending_count, first->next_count);
+
+	spin_unlock_irqrestore(&sba->reqs_lock, flags);
+}
+
+static void sba_cleanup_nonpending_requests(struct sba_device *sba)
+{
+	unsigned long flags;
+	struct sba_request *req, *req1;
+
+	spin_lock_irqsave(&sba->reqs_lock, flags);
+
+	/* Freeup all alloced request */
+	list_for_each_entry_safe(req, req1, &sba->reqs_alloc_list, node) {
+		_sba_free_request(sba, req);
+	}
+
+	/* Freeup all received request */
+	list_for_each_entry_safe(req, req1, &sba->reqs_received_list, node) {
+		_sba_free_request(sba, req);
+	}
+
+	/* Freeup all completed request */
+	list_for_each_entry_safe(req, req1, &sba->reqs_completed_list, node) {
+		_sba_free_request(sba, req);
+	}
+
+	/* Set all active requests as aborted */
+	list_for_each_entry_safe(req, req1, &sba->reqs_active_list, node) {
+		_sba_abort_request(sba, req);
+	}
+
+	/*
+	 * Note: We expect that aborted request will be eventually
+	 * freed by sba_receive_message()
+	 */
+
+	spin_unlock_irqrestore(&sba->reqs_lock, flags);
+}
+
+static void sba_cleanup_pending_requests(struct sba_device *sba)
+{
+	unsigned long flags;
+	struct sba_request *req, *req1;
+
+	spin_lock_irqsave(&sba->reqs_lock, flags);
+
+	/* Freeup all pending request */
+	list_for_each_entry_safe(req, req1, &sba->reqs_pending_list, node) {
+		/* Freeup rest of the pending request */
+		_sba_free_request(sba, req);
+	}
+
+	spin_unlock_irqrestore(&sba->reqs_lock, flags);
+}
+
+/* ====== DMAENGINE callbacks ===== */
+
+static void sba_free_chan_resources(struct dma_chan *dchan)
+{
+	/*
+	 * Channel resources are pre-alloced so we just free-up
+	 * whatever we can so that we can re-use pre-alloced
+	 * channel resources next time.
+	 */
+	sba_cleanup_nonpending_requests(to_sba_device(dchan));
+}
+
+static int sba_device_terminate_all(struct dma_chan *dchan)
+{
+	/* Cleanup all pending requests */
+	sba_cleanup_pending_requests(to_sba_device(dchan));
+
+	return 0;
+}
+
+static int sba_send_mbox_request(struct sba_device *sba,
+				 struct sba_request *req)
+{
+	int mchans_idx, ret = 0;
+
+	/* Select mailbox channel in round-robin fashion */
+	mchans_idx = atomic_inc_return(&sba->mchans_current);
+	mchans_idx = mchans_idx % sba->mchans_count;
+
+	/* Send message for the request */
+	req->msg.error = 0;
+	ret = mbox_send_message(sba->mchans[mchans_idx], &req->msg);
+	if (ret < 0) {
+		dev_err(sba->dev, "send message failed with error %d", ret);
+		return ret;
+	}
+	ret = req->msg.error;
+	if (ret < 0) {
+		dev_err(sba->dev, "message error %d", ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void sba_issue_pending(struct dma_chan *dchan)
+{
+	int ret;
+	unsigned long flags;
+	struct sba_request *req, *req1;
+	struct sba_device *sba = to_sba_device(dchan);
+
+	spin_lock_irqsave(&sba->reqs_lock, flags);
+
+	/* Process all pending request */
+	list_for_each_entry_safe(req, req1, &sba->reqs_pending_list, node) {
+		/* Try to make request active */
+		if (!_sba_active_request(sba, req))
+			break;
+
+		/* Send request to mailbox channel */
+		spin_unlock_irqrestore(&sba->reqs_lock, flags);
+		ret = sba_send_mbox_request(sba, req);
+		spin_lock_irqsave(&sba->reqs_lock, flags);
+
+		/* If something went wrong then keep request pending */
+		if (ret < 0) {
+			_sba_pending_request(sba, req);
+			break;
+		}
+	}
+
+	spin_unlock_irqrestore(&sba->reqs_lock, flags);
+}
+
+static dma_cookie_t sba_tx_submit(struct dma_async_tx_descriptor *tx)
+{
+	unsigned long flags;
+	dma_cookie_t cookie;
+	struct sba_device *sba;
+	struct sba_request *req, *nreq;
+
+	if (unlikely(!tx))
+		return -EINVAL;
+
+	sba = to_sba_device(tx->chan);
+	req = to_sba_request(tx);
+
+	/* Assign cookie and mark all chained requests pending */
+	spin_lock_irqsave(&sba->reqs_lock, flags);
+	cookie = dma_cookie_assign(tx);
+	_sba_pending_request(sba, req);
+	list_for_each_entry(nreq, &req->next, next) {
+		_sba_pending_request(sba, nreq);
+	}
+	spin_unlock_irqrestore(&sba->reqs_lock, flags);
+
+	return cookie;
+}
+
+static enum dma_status sba_tx_status(struct dma_chan *dchan,
+				     dma_cookie_t cookie,
+				     struct dma_tx_state *txstate)
+{
+	int mchan_idx;
+	enum dma_status ret;
+	struct sba_device *sba = to_sba_device(dchan);
+
+	for (mchan_idx = 0; mchan_idx < sba->mchans_count; mchan_idx++)
+		mbox_client_peek_data(sba->mchans[mchan_idx]);
+
+	ret = dma_cookie_status(dchan, cookie, txstate);
+	if (ret == DMA_COMPLETE)
+		return ret;
+
+	return dma_cookie_status(dchan, cookie, txstate);
+}
+
+static void sba_fillup_memcpy_msg(struct sba_request *req,
+				  struct brcm_sba_command *cmds,
+				  struct brcm_message *msg,
+				  dma_addr_t msg_offset, size_t msg_len,
+				  dma_addr_t dst, dma_addr_t src)
+{
+	u64 cmd;
+	u32 c_mdata;
+	struct brcm_sba_command *cmdsp = cmds;
+
+	/* Type-B command to load data into buf0 */
+	cmd = 0;
+	SBA_ENC(cmd, SBA_TYPE_B, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+	SBA_ENC(cmd, msg_len,
+		SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+	c_mdata = sba_cmd_load_c_mdata(0);
+	SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+		SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+	SBA_ENC(cmd, SBA_CMD_LOAD_BUFFER,
+		SBA_CMD_SHIFT, SBA_CMD_MASK);
+	cmdsp->cmd = cmd;
+	*cmdsp->cmd_dma = cpu_to_le64(cmd);
+	cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+	cmdsp->data = src + msg_offset;
+	cmdsp->data_len = msg_len;
+	cmdsp++;
+
+	/* Type-A command to write buf0 */
+	cmd = 0;
+	SBA_ENC(cmd, SBA_TYPE_A, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+	SBA_ENC(cmd, msg_len,
+		SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+	SBA_ENC(cmd, 0x1, SBA_RESP_SHIFT, SBA_RESP_MASK);
+	c_mdata = sba_cmd_write_c_mdata(0);
+	SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+		SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+	SBA_ENC(cmd, SBA_CMD_WRITE_BUFFER,
+		SBA_CMD_SHIFT, SBA_CMD_MASK);
+	cmdsp->cmd = cmd;
+	*cmdsp->cmd_dma = cpu_to_le64(cmd);
+	cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+	if (req->sba->hw_resp_size) {
+		cmdsp->flags |= BRCM_SBA_CMD_HAS_RESP;
+		cmdsp->resp = req->resp_dma;
+		cmdsp->resp_len = req->sba->hw_resp_size;
+	}
+	cmdsp->flags |= BRCM_SBA_CMD_HAS_OUTPUT;
+	cmdsp->data = dst + msg_offset;
+	cmdsp->data_len = msg_len;
+	cmdsp++;
+
+	/* Fillup brcm_message */
+	msg->type = BRCM_MESSAGE_SBA;
+	msg->sba.cmds = cmds;
+	msg->sba.cmds_count = cmdsp - cmds;
+	msg->ctx = req;
+	msg->error = 0;
+}
+
+static struct sba_request *
+sba_prep_dma_memcpy_req(struct sba_device *sba,
+			dma_addr_t off, dma_addr_t dst, dma_addr_t src,
+			size_t len, unsigned long flags)
+{
+	struct sba_request *req = NULL;
+
+	/* Alloc new request */
+	req = sba_alloc_request(sba);
+	if (!req)
+		return NULL;
+	req->fence = (flags & DMA_PREP_FENCE) ? true : false;
+
+	/* Fillup request message */
+	sba_fillup_memcpy_msg(req, req->cmds, &req->msg,
+			      off, len, dst, src);
+
+	/* Init async_tx descriptor */
+	req->tx.flags = flags;
+	req->tx.cookie = -EBUSY;
+
+	return req;
+}
+
+static struct dma_async_tx_descriptor *
+sba_prep_dma_memcpy(struct dma_chan *dchan, dma_addr_t dst, dma_addr_t src,
+		    size_t len, unsigned long flags)
+{
+	size_t req_len;
+	dma_addr_t off = 0;
+	struct sba_device *sba = to_sba_device(dchan);
+	struct sba_request *first = NULL, *req;
+
+	/* Create chained requests where each request is upto hw_buf_size */
+	while (len) {
+		req_len = (len < sba->hw_buf_size) ? len : sba->hw_buf_size;
+
+		req = sba_prep_dma_memcpy_req(sba, off, dst, src,
+					      req_len, flags);
+		if (!req) {
+			if (first)
+				sba_free_chained_requests(first);
+			return NULL;
+		}
+
+		if (first)
+			sba_chain_request(first, req);
+		else
+			first = req;
+
+		off += req_len;
+		len -= req_len;
+	}
+
+	return (first) ? &first->tx : NULL;
+}
+
+static void sba_fillup_xor_msg(struct sba_request *req,
+				struct brcm_sba_command *cmds,
+				struct brcm_message *msg,
+				dma_addr_t msg_offset, size_t msg_len,
+				dma_addr_t dst, dma_addr_t *src, u32 src_cnt)
+{
+	u64 cmd;
+	u32 c_mdata;
+	unsigned int i;
+	struct brcm_sba_command *cmdsp = cmds;
+
+	/* Type-B command to load data into buf0 */
+	cmd = 0;
+	SBA_ENC(cmd, SBA_TYPE_B, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+	SBA_ENC(cmd, msg_len,
+		SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+	c_mdata = sba_cmd_load_c_mdata(0);
+	SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+		SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+	SBA_ENC(cmd, SBA_CMD_LOAD_BUFFER,
+		SBA_CMD_SHIFT, SBA_CMD_MASK);
+	cmdsp->cmd = cmd;
+	*cmdsp->cmd_dma = cpu_to_le64(cmd);
+	cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+	cmdsp->data = src[0] + msg_offset;
+	cmdsp->data_len = msg_len;
+	cmdsp++;
+
+	/* Type-B commands to xor data with buf0 and put it back in buf0 */
+	for (i = 1; i < src_cnt; i++) {
+		cmd = 0;
+		SBA_ENC(cmd, SBA_TYPE_B, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+		SBA_ENC(cmd, msg_len,
+			SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+		c_mdata = sba_cmd_xor_c_mdata(0, 0);
+		SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+			SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+		SBA_ENC(cmd, SBA_CMD_XOR, SBA_CMD_SHIFT, SBA_CMD_MASK);
+		cmdsp->cmd = cmd;
+		*cmdsp->cmd_dma = cpu_to_le64(cmd);
+		cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+		cmdsp->data = src[i] + msg_offset;
+		cmdsp->data_len = msg_len;
+		cmdsp++;
+	}
+
+	/* Type-A command to write buf0 */
+	cmd = 0;
+	SBA_ENC(cmd, SBA_TYPE_A, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+	SBA_ENC(cmd, msg_len,
+		SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+	SBA_ENC(cmd, 0x1, SBA_RESP_SHIFT, SBA_RESP_MASK);
+	c_mdata = sba_cmd_write_c_mdata(0);
+	SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+		SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+	SBA_ENC(cmd, SBA_CMD_WRITE_BUFFER,
+		SBA_CMD_SHIFT, SBA_CMD_MASK);
+	cmdsp->cmd = cmd;
+	*cmdsp->cmd_dma = cpu_to_le64(cmd);
+	cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+	if (req->sba->hw_resp_size) {
+		cmdsp->flags |= BRCM_SBA_CMD_HAS_RESP;
+		cmdsp->resp = req->resp_dma;
+		cmdsp->resp_len = req->sba->hw_resp_size;
+	}
+	cmdsp->flags |= BRCM_SBA_CMD_HAS_OUTPUT;
+	cmdsp->data = dst + msg_offset;
+	cmdsp->data_len = msg_len;
+	cmdsp++;
+
+	/* Fillup brcm_message */
+	msg->type = BRCM_MESSAGE_SBA;
+	msg->sba.cmds = cmds;
+	msg->sba.cmds_count = cmdsp - cmds;
+	msg->ctx = req;
+	msg->error = 0;
+}
+
+struct sba_request *
+sba_prep_dma_xor_req(struct sba_device *sba,
+		     dma_addr_t off, dma_addr_t dst, dma_addr_t *src,
+		     u32 src_cnt, size_t len, unsigned long flags)
+{
+	struct sba_request *req = NULL;
+
+	/* Alloc new request */
+	req = sba_alloc_request(sba);
+	if (!req)
+		return NULL;
+	req->fence = (flags & DMA_PREP_FENCE) ? true : false;
+
+	/* Fillup request message */
+	sba_fillup_xor_msg(req, req->cmds, &req->msg,
+			   off, len, dst, src, src_cnt);
+
+	/* Init async_tx descriptor */
+	req->tx.flags = flags;
+	req->tx.cookie = -EBUSY;
+
+	return req;
+}
+
+static struct dma_async_tx_descriptor *
+sba_prep_dma_xor(struct dma_chan *dchan, dma_addr_t dst, dma_addr_t *src,
+		 u32 src_cnt, size_t len, unsigned long flags)
+{
+	size_t req_len;
+	dma_addr_t off = 0;
+	struct sba_device *sba = to_sba_device(dchan);
+	struct sba_request *first = NULL, *req;
+
+	/* Sanity checks */
+	if (unlikely(src_cnt > sba->max_xor_srcs))
+		return NULL;
+
+	/* Create chained requests where each request is upto hw_buf_size */
+	while (len) {
+		req_len = (len < sba->hw_buf_size) ? len : sba->hw_buf_size;
+
+		req = sba_prep_dma_xor_req(sba, off, dst, src, src_cnt,
+					   req_len, flags);
+		if (!req) {
+			if (first)
+				sba_free_chained_requests(first);
+			return NULL;
+		}
+
+		if (first)
+			sba_chain_request(first, req);
+		else
+			first = req;
+
+		off += req_len;
+		len -= req_len;
+	}
+
+	return (first) ? &first->tx : NULL;
+}
+
+static void sba_fillup_pq_msg(struct sba_request *req,
+				bool pq_continue,
+				struct brcm_sba_command *cmds,
+				struct brcm_message *msg,
+				dma_addr_t msg_offset, size_t msg_len,
+				dma_addr_t *dst_p, dma_addr_t *dst_q,
+				const u8 *scf, dma_addr_t *src, u32 src_cnt)
+{
+	u64 cmd;
+	u32 c_mdata;
+	unsigned int i;
+	struct brcm_sba_command *cmdsp = cmds;
+
+	if (pq_continue) {
+		/* Type-B command to load old P into buf0 */
+		if (dst_p) {
+			cmd = 0;
+			SBA_ENC(cmd, SBA_TYPE_B,
+				SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+			SBA_ENC(cmd, msg_len,
+				SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+			c_mdata = sba_cmd_load_c_mdata(0);
+			SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+				SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+			SBA_ENC(cmd, SBA_CMD_LOAD_BUFFER,
+				SBA_CMD_SHIFT, SBA_CMD_MASK);
+			cmdsp->cmd = cmd;
+			*cmdsp->cmd_dma = cpu_to_le64(cmd);
+			cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+			cmdsp->data = *dst_p + msg_offset;
+			cmdsp->data_len = msg_len;
+			cmdsp++;
+		}
+
+		/* Type-B command to load old Q into buf1 */
+		if (dst_q) {
+			cmd = 0;
+			SBA_ENC(cmd, SBA_TYPE_B,
+				SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+			SBA_ENC(cmd, msg_len,
+				SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+			c_mdata = sba_cmd_load_c_mdata(1);
+			SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+				SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+			SBA_ENC(cmd, SBA_CMD_LOAD_BUFFER,
+				SBA_CMD_SHIFT, SBA_CMD_MASK);
+			cmdsp->cmd = cmd;
+			*cmdsp->cmd_dma = cpu_to_le64(cmd);
+			cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+			cmdsp->data = *dst_q + msg_offset;
+			cmdsp->data_len = msg_len;
+			cmdsp++;
+		}
+	} else {
+		/* Type-A command to zero all buffers */
+		cmd = 0;
+		SBA_ENC(cmd, SBA_TYPE_A, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+		SBA_ENC(cmd, msg_len,
+			SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+		SBA_ENC(cmd, SBA_CMD_ZERO_ALL_BUFFERS,
+			SBA_CMD_SHIFT, SBA_CMD_MASK);
+		cmdsp->cmd = cmd;
+		*cmdsp->cmd_dma = cpu_to_le64(cmd);
+		cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+		cmdsp++;
+	}
+
+	/* Type-B commands for generate P onto buf0 and Q onto buf1 */
+	for (i = 0; i < src_cnt; i++) {
+		cmd = 0;
+		SBA_ENC(cmd, SBA_TYPE_B, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+		SBA_ENC(cmd, msg_len,
+			SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+		c_mdata = sba_cmd_pq_c_mdata(raid6_gflog[scf[i]], 1, 0);
+		SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+			SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+		SBA_ENC(cmd, SBA_C_MDATA_MS(c_mdata),
+			SBA_C_MDATA_MS_SHIFT, SBA_C_MDATA_MS_MASK);
+		SBA_ENC(cmd, SBA_CMD_GALOIS_XOR,
+			SBA_CMD_SHIFT, SBA_CMD_MASK);
+		cmdsp->cmd = cmd;
+		*cmdsp->cmd_dma = cpu_to_le64(cmd);
+		cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+		cmdsp->data = src[i] + msg_offset;
+		cmdsp->data_len = msg_len;
+		cmdsp++;
+	}
+
+	/* Type-A command to write buf0 */
+	if (dst_p) {
+		cmd = 0;
+		SBA_ENC(cmd, SBA_TYPE_A, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+		SBA_ENC(cmd, msg_len,
+			SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+		SBA_ENC(cmd, 0x1, SBA_RESP_SHIFT, SBA_RESP_MASK);
+		c_mdata = sba_cmd_write_c_mdata(0);
+		SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+			SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+		SBA_ENC(cmd, SBA_CMD_WRITE_BUFFER,
+			SBA_CMD_SHIFT, SBA_CMD_MASK);
+		cmdsp->cmd = cmd;
+		*cmdsp->cmd_dma = cpu_to_le64(cmd);
+		cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+		if (req->sba->hw_resp_size) {
+			cmdsp->flags |= BRCM_SBA_CMD_HAS_RESP;
+			cmdsp->resp = req->resp_dma;
+			cmdsp->resp_len = req->sba->hw_resp_size;
+		}
+		cmdsp->flags |= BRCM_SBA_CMD_HAS_OUTPUT;
+		cmdsp->data = *dst_p + msg_offset;
+		cmdsp->data_len = msg_len;
+		cmdsp++;
+	}
+
+	/* Type-A command to write buf1 */
+	if (dst_q) {
+		cmd = 0;
+		SBA_ENC(cmd, SBA_TYPE_A, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+		SBA_ENC(cmd, msg_len,
+			SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+		SBA_ENC(cmd, 0x1, SBA_RESP_SHIFT, SBA_RESP_MASK);
+		c_mdata = sba_cmd_write_c_mdata(1);
+		SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+			SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+		SBA_ENC(cmd, SBA_CMD_WRITE_BUFFER,
+			SBA_CMD_SHIFT, SBA_CMD_MASK);
+		cmdsp->cmd = cmd;
+		*cmdsp->cmd_dma = cpu_to_le64(cmd);
+		cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+		if (req->sba->hw_resp_size) {
+			cmdsp->flags |= BRCM_SBA_CMD_HAS_RESP;
+			cmdsp->resp = req->resp_dma;
+			cmdsp->resp_len = req->sba->hw_resp_size;
+		}
+		cmdsp->flags |= BRCM_SBA_CMD_HAS_OUTPUT;
+		cmdsp->data = *dst_q + msg_offset;
+		cmdsp->data_len = msg_len;
+		cmdsp++;
+	}
+
+	/* Fillup brcm_message */
+	msg->type = BRCM_MESSAGE_SBA;
+	msg->sba.cmds = cmds;
+	msg->sba.cmds_count = cmdsp - cmds;
+	msg->ctx = req;
+	msg->error = 0;
+}
+
+struct sba_request *
+sba_prep_dma_pq_req(struct sba_device *sba, dma_addr_t off,
+		    dma_addr_t *dst_p, dma_addr_t *dst_q, dma_addr_t *src,
+		    u32 src_cnt, const u8 *scf, size_t len, unsigned long flags)
+{
+	struct sba_request *req = NULL;
+
+	/* Alloc new request */
+	req = sba_alloc_request(sba);
+	if (!req)
+		return NULL;
+	req->fence = (flags & DMA_PREP_FENCE) ? true : false;
+
+	/* Fillup request messages */
+	sba_fillup_pq_msg(req, dmaf_continue(flags),
+			  req->cmds, &req->msg,
+			  off, len, dst_p, dst_q, scf, src, src_cnt);
+
+	/* Init async_tx descriptor */
+	req->tx.flags = flags;
+	req->tx.cookie = -EBUSY;
+
+	return req;
+}
+
+static void sba_fillup_pq_single_msg(struct sba_request *req,
+				bool pq_continue,
+				struct brcm_sba_command *cmds,
+				struct brcm_message *msg,
+				dma_addr_t msg_offset, size_t msg_len,
+				dma_addr_t *dst_p, dma_addr_t *dst_q,
+				dma_addr_t src, u8 scf)
+{
+	u64 cmd;
+	u32 c_mdata;
+	u8 pos, dpos = raid6_gflog[scf];
+	struct brcm_sba_command *cmdsp = cmds;
+
+	if (!dst_p)
+		goto skip_p;
+
+	if (pq_continue) {
+		/* Type-B command to load old P into buf0 */
+		cmd = 0;
+		SBA_ENC(cmd, SBA_TYPE_B,
+			SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+		SBA_ENC(cmd, msg_len,
+			SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+		c_mdata = sba_cmd_load_c_mdata(0);
+		SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+			SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+		SBA_ENC(cmd, SBA_CMD_LOAD_BUFFER,
+			SBA_CMD_SHIFT, SBA_CMD_MASK);
+		cmdsp->cmd = cmd;
+		*cmdsp->cmd_dma = cpu_to_le64(cmd);
+		cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+		cmdsp->data = *dst_p + msg_offset;
+		cmdsp->data_len = msg_len;
+		cmdsp++;
+
+		/*
+		 * Type-B commands to xor data with buf0 and put it
+		 * back in buf0
+		 */
+		cmd = 0;
+		SBA_ENC(cmd, SBA_TYPE_B, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+		SBA_ENC(cmd, msg_len,
+			SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+		c_mdata = sba_cmd_xor_c_mdata(0, 0);
+		SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+			SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+		SBA_ENC(cmd, SBA_CMD_XOR, SBA_CMD_SHIFT, SBA_CMD_MASK);
+		cmdsp->cmd = cmd;
+		*cmdsp->cmd_dma = cpu_to_le64(cmd);
+		cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+		cmdsp->data = src + msg_offset;
+		cmdsp->data_len = msg_len;
+		cmdsp++;
+	} else {
+		/* Type-B command to load old P into buf0 */
+		cmd = 0;
+		SBA_ENC(cmd, SBA_TYPE_B,
+			SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+		SBA_ENC(cmd, msg_len,
+			SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+		c_mdata = sba_cmd_load_c_mdata(0);
+		SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+			SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+		SBA_ENC(cmd, SBA_CMD_LOAD_BUFFER,
+			SBA_CMD_SHIFT, SBA_CMD_MASK);
+		cmdsp->cmd = cmd;
+		*cmdsp->cmd_dma = cpu_to_le64(cmd);
+		cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+		cmdsp->data = src + msg_offset;
+		cmdsp->data_len = msg_len;
+		cmdsp++;
+	}
+
+	/* Type-A command to write buf0 */
+	cmd = 0;
+	SBA_ENC(cmd, SBA_TYPE_A, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+	SBA_ENC(cmd, msg_len,
+		SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+	SBA_ENC(cmd, 0x1, SBA_RESP_SHIFT, SBA_RESP_MASK);
+	c_mdata = sba_cmd_write_c_mdata(0);
+	SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+		SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+	SBA_ENC(cmd, SBA_CMD_WRITE_BUFFER,
+		SBA_CMD_SHIFT, SBA_CMD_MASK);
+	cmdsp->cmd = cmd;
+	*cmdsp->cmd_dma = cpu_to_le64(cmd);
+	cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+	if (req->sba->hw_resp_size) {
+		cmdsp->flags |= BRCM_SBA_CMD_HAS_RESP;
+		cmdsp->resp = req->resp_dma;
+		cmdsp->resp_len = req->sba->hw_resp_size;
+	}
+	cmdsp->flags |= BRCM_SBA_CMD_HAS_OUTPUT;
+	cmdsp->data = *dst_p + msg_offset;
+	cmdsp->data_len = msg_len;
+	cmdsp++;
+
+skip_p:
+	if (!dst_q)
+		goto skip_q;
+
+	/* Type-A command to zero all buffers */
+	cmd = 0;
+	SBA_ENC(cmd, SBA_TYPE_A, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+	SBA_ENC(cmd, msg_len,
+		SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+	SBA_ENC(cmd, SBA_CMD_ZERO_ALL_BUFFERS,
+		SBA_CMD_SHIFT, SBA_CMD_MASK);
+	cmdsp->cmd = cmd;
+	*cmdsp->cmd_dma = cpu_to_le64(cmd);
+	cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+	cmdsp++;
+
+	if (dpos == 255)
+		goto skip_q_computation;
+	pos = (dpos < req->sba->max_pq_coefs) ?
+		dpos : (req->sba->max_pq_coefs - 1);
+
+	/*
+	 * Type-B command to generate initial Q from data
+	 * and store output into buf0
+	 */
+	cmd = 0;
+	SBA_ENC(cmd, SBA_TYPE_B,
+		SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+	SBA_ENC(cmd, msg_len,
+		SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+	SBA_ENC(cmd, 0x0, SBA_RESP_SHIFT, SBA_RESP_MASK);
+	c_mdata = sba_cmd_pq_c_mdata(pos, 0, 0);
+	SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+		SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+	SBA_ENC(cmd, SBA_C_MDATA_MS(c_mdata),
+		SBA_C_MDATA_MS_SHIFT, SBA_C_MDATA_MS_MASK);
+	SBA_ENC(cmd, SBA_CMD_GALOIS, SBA_CMD_SHIFT, SBA_CMD_MASK);
+	cmdsp->cmd = cmd;
+	*cmdsp->cmd_dma = cpu_to_le64(cmd);
+	cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+	cmdsp->data = src + msg_offset;
+	cmdsp->data_len = msg_len;
+	cmdsp++;
+
+	dpos -= pos;
+
+	/* Multiple Type-A command to generate final Q */
+	while (dpos) {
+		pos = (dpos < req->sba->max_pq_coefs) ?
+			dpos : (req->sba->max_pq_coefs - 1);
+
+		/*
+		 * Type-A command to generate Q with buf0 and
+		 * buf1 store result in buf0
+		 */
+		cmd = 0;
+		SBA_ENC(cmd, SBA_TYPE_A,
+			SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+		SBA_ENC(cmd, msg_len,
+			SBA_USER_DEF_SHIFT,
+			SBA_USER_DEF_MASK);
+		SBA_ENC(cmd, 0x0,
+			SBA_RESP_SHIFT, SBA_RESP_MASK);
+		c_mdata = sba_cmd_pq_c_mdata(pos, 0, 1);
+		SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+			SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+		SBA_ENC(cmd, SBA_C_MDATA_MS(c_mdata),
+			SBA_C_MDATA_MS_SHIFT,
+			SBA_C_MDATA_MS_MASK);
+		SBA_ENC(cmd, SBA_CMD_GALOIS,
+			SBA_CMD_SHIFT, SBA_CMD_MASK);
+		cmdsp->cmd = cmd;
+		*cmdsp->cmd_dma = cpu_to_le64(cmd);
+		cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+		cmdsp++;
+
+		dpos -= pos;
+	};
+
+skip_q_computation:
+	if (pq_continue) {
+		/*
+		 * Type-B command to XOR previous output with
+		 * buf0 and write it into buf0
+		 */
+		cmd = 0;
+		SBA_ENC(cmd, SBA_TYPE_B,
+			SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+		SBA_ENC(cmd, msg_len,
+			SBA_USER_DEF_SHIFT,
+			SBA_USER_DEF_MASK);
+		SBA_ENC(cmd, 0x0,
+			SBA_RESP_SHIFT, SBA_RESP_MASK);
+		c_mdata = sba_cmd_xor_c_mdata(0, 0);
+		SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+			SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+		SBA_ENC(cmd, SBA_CMD_XOR,
+			SBA_CMD_SHIFT, SBA_CMD_MASK);
+		cmdsp->cmd = cmd;
+		*cmdsp->cmd_dma = cpu_to_le64(cmd);
+		cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+		cmdsp->data = *dst_q + msg_offset;
+		cmdsp->data_len = msg_len;
+		cmdsp++;
+	}
+
+	/* Type-A command to write buf0 */
+	cmd = 0;
+	SBA_ENC(cmd, SBA_TYPE_A,
+		SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+	SBA_ENC(cmd, msg_len,
+		SBA_USER_DEF_SHIFT,
+		SBA_USER_DEF_MASK);
+	SBA_ENC(cmd, 0x1,
+		SBA_RESP_SHIFT, SBA_RESP_MASK);
+	c_mdata = sba_cmd_write_c_mdata(0);
+	SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+		SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+	SBA_ENC(cmd, SBA_CMD_WRITE_BUFFER,
+		SBA_CMD_SHIFT, SBA_CMD_MASK);
+	cmdsp->cmd = cmd;
+	*cmdsp->cmd_dma = cpu_to_le64(cmd);
+	cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+	if (req->sba->hw_resp_size) {
+		cmdsp->flags |= BRCM_SBA_CMD_HAS_RESP;
+		cmdsp->resp = req->resp_dma;
+		cmdsp->resp_len = req->sba->hw_resp_size;
+	}
+	cmdsp->flags |= BRCM_SBA_CMD_HAS_OUTPUT;
+	cmdsp->data = *dst_q + msg_offset;
+	cmdsp->data_len = msg_len;
+	cmdsp++;
+
+skip_q:
+	/* Fillup brcm_message */
+	msg->type = BRCM_MESSAGE_SBA;
+	msg->sba.cmds = cmds;
+	msg->sba.cmds_count = cmdsp - cmds;
+	msg->ctx = req;
+	msg->error = 0;
+}
+
+struct sba_request *
+sba_prep_dma_pq_single_req(struct sba_device *sba, dma_addr_t off,
+			   dma_addr_t *dst_p, dma_addr_t *dst_q,
+			   dma_addr_t src, u8 scf, size_t len,
+			   unsigned long flags)
+{
+	struct sba_request *req = NULL;
+
+	/* Alloc new request */
+	req = sba_alloc_request(sba);
+	if (!req)
+		return NULL;
+	req->fence = (flags & DMA_PREP_FENCE) ? true : false;
+
+	/* Fillup request messages */
+	sba_fillup_pq_single_msg(req,  dmaf_continue(flags),
+				 req->cmds, &req->msg, off, len,
+				 dst_p, dst_q, src, scf);
+
+	/* Init async_tx descriptor */
+	req->tx.flags = flags;
+	req->tx.cookie = -EBUSY;
+
+	return req;
+}
+
+static struct dma_async_tx_descriptor *
+sba_prep_dma_pq(struct dma_chan *dchan, dma_addr_t *dst, dma_addr_t *src,
+		u32 src_cnt, const u8 *scf, size_t len, unsigned long flags)
+{
+	u32 i, dst_q_index;
+	size_t req_len;
+	bool slow = false;
+	dma_addr_t off = 0;
+	dma_addr_t *dst_p = NULL, *dst_q = NULL;
+	struct sba_device *sba = to_sba_device(dchan);
+	struct sba_request *first = NULL, *req;
+
+	/* Sanity checks */
+	if (unlikely(src_cnt > sba->max_pq_srcs))
+		return NULL;
+	for (i = 0; i < src_cnt; i++)
+		if (sba->max_pq_coefs <= raid6_gflog[scf[i]])
+			slow = true;
+
+	/* Figure-out P and Q destination addresses */
+	if (!(flags & DMA_PREP_PQ_DISABLE_P))
+		dst_p = &dst[0];
+	if (!(flags & DMA_PREP_PQ_DISABLE_Q))
+		dst_q = &dst[1];
+
+	/* Create chained requests where each request is upto hw_buf_size */
+	while (len) {
+		req_len = (len < sba->hw_buf_size) ? len : sba->hw_buf_size;
+
+		if (slow) {
+			dst_q_index = src_cnt;
+
+			if (dst_q) {
+				for (i = 0; i < src_cnt; i++) {
+					if (*dst_q == src[i]) {
+						dst_q_index = i;
+						break;
+					}
+				}
+			}
+
+			if (dst_q_index < src_cnt) {
+				i = dst_q_index;
+				req = sba_prep_dma_pq_single_req(sba,
+					off, dst_p, dst_q, src[i], scf[i],
+					req_len, flags | DMA_PREP_FENCE);
+				if (!req)
+					goto fail;
+
+				if (first)
+					sba_chain_request(first, req);
+				else
+					first = req;
+
+				flags |= DMA_PREP_CONTINUE;
+			}
+
+			for (i = 0; i < src_cnt; i++) {
+				if (dst_q_index == i)
+					continue;
+
+				req = sba_prep_dma_pq_single_req(sba,
+					off, dst_p, dst_q, src[i], scf[i],
+					req_len, flags | DMA_PREP_FENCE);
+				if (!req)
+					goto fail;
+
+				if (first)
+					sba_chain_request(first, req);
+				else
+					first = req;
+
+				flags |= DMA_PREP_CONTINUE;
+			}
+		} else {
+			req = sba_prep_dma_pq_req(sba, off,
+						  dst_p, dst_q, src, src_cnt,
+						  scf, req_len, flags);
+			if (!req)
+				goto fail;
+
+			if (first)
+				sba_chain_request(first, req);
+			else
+				first = req;
+		}
+
+		off += req_len;
+		len -= req_len;
+	}
+
+	return (first) ? &first->tx : NULL;
+
+fail:
+	if (first)
+		sba_free_chained_requests(first);
+	return NULL;
+}
+
+/* ====== Mailbox callbacks ===== */
+
+static void sba_dma_tx_actions(struct sba_request *req)
+{
+	struct dma_async_tx_descriptor *tx = &req->tx;
+
+	WARN_ON(tx->cookie < 0);
+
+	if (tx->cookie > 0) {
+		dma_cookie_complete(tx);
+
+		/*
+		 * Call the callback (must not sleep or submit new
+		 * operations to this channel)
+		 */
+		if (tx->callback)
+			tx->callback(tx->callback_param);
+
+		dma_descriptor_unmap(tx);
+	}
+
+	/* Run dependent operations */
+	dma_run_dependencies(tx);
+
+	/* If waiting for 'ack' then move to completed list */
+	if (!async_tx_test_ack(&req->tx))
+		sba_complete_chained_requests(req);
+	else
+		sba_free_chained_requests(req);
+}
+
+static void sba_receive_message(struct mbox_client *cl, void *msg)
+{
+	unsigned long flags;
+	struct brcm_message *m = msg;
+	struct sba_request *req = m->ctx, *req1;
+	struct sba_device *sba = req->sba;
+
+	/* Error count if message has error */
+	if (m->error < 0) {
+		dev_err(sba->dev, "%s got message with error %d",
+			dma_chan_name(&sba->dma_chan), m->error);
+	}
+
+	/* Mark request as received */
+	sba_received_request(req);
+
+	/* Wait for all chained requests to be completed */
+	if (atomic_dec_return(&req->first->next_pending_count))
+		goto done;
+
+	/* Point to first request */
+	req = req->first;
+
+	/* Update request */
+	if (req->state == SBA_REQUEST_STATE_RECEIVED)
+		sba_dma_tx_actions(req);
+	else
+		sba_free_chained_requests(req);
+
+	spin_lock_irqsave(&sba->reqs_lock, flags);
+
+	/* Re-check all completed request waiting for 'ack' */
+	list_for_each_entry_safe(req, req1, &sba->reqs_completed_list, node) {
+		spin_unlock_irqrestore(&sba->reqs_lock, flags);
+		sba_dma_tx_actions(req);
+		spin_lock_irqsave(&sba->reqs_lock, flags);
+	}
+
+	spin_unlock_irqrestore(&sba->reqs_lock, flags);
+
+done:
+	/* Try to submit pending request */
+	sba_issue_pending(&sba->dma_chan);
+}
+
+/* ====== Platform driver routines ===== */
+
+static int sba_prealloc_channel_resources(struct sba_device *sba)
+{
+	int i, j, p, ret = 0;
+	struct sba_request *req = NULL;
+
+	sba->resp_base = dma_alloc_coherent(sba->dma_dev.dev,
+					    sba->max_resp_pool_size,
+					    &sba->resp_dma_base, GFP_KERNEL);
+	if (!sba->resp_base)
+		return -ENOMEM;
+
+	sba->cmds_base = dma_alloc_coherent(sba->dma_dev.dev,
+					    sba->max_cmds_pool_size,
+					    &sba->cmds_dma_base, GFP_KERNEL);
+	if (!sba->cmds_base) {
+		ret = -ENOMEM;
+		goto fail_free_resp_pool;
+	}
+
+	spin_lock_init(&sba->reqs_lock);
+	sba->reqs_fence = false;
+	INIT_LIST_HEAD(&sba->reqs_alloc_list);
+	INIT_LIST_HEAD(&sba->reqs_pending_list);
+	INIT_LIST_HEAD(&sba->reqs_active_list);
+	INIT_LIST_HEAD(&sba->reqs_received_list);
+	INIT_LIST_HEAD(&sba->reqs_completed_list);
+	INIT_LIST_HEAD(&sba->reqs_aborted_list);
+	INIT_LIST_HEAD(&sba->reqs_free_list);
+
+	sba->reqs = devm_kcalloc(sba->dev, sba->max_req,
+				 sizeof(*req), GFP_KERNEL);
+	if (!sba->reqs) {
+		ret = -ENOMEM;
+		goto fail_free_cmds_pool;
+	}
+
+	for (i = 0, p = 0; i < sba->max_req; i++) {
+		req = &sba->reqs[i];
+		INIT_LIST_HEAD(&req->node);
+		req->sba = sba;
+		req->state = SBA_REQUEST_STATE_FREE;
+		INIT_LIST_HEAD(&req->next);
+		req->next_count = 1;
+		atomic_set(&req->next_pending_count, 0);
+		req->fence = false;
+		req->resp = sba->resp_base + p;
+		req->resp_dma = sba->resp_dma_base + p;
+		p += sba->hw_resp_size;
+		req->cmds = devm_kcalloc(sba->dev, sba->max_cmd_per_req,
+					 sizeof(*req->cmds), GFP_KERNEL);
+		if (!req->cmds) {
+			ret = -ENOMEM;
+			goto fail_free_cmds_pool;
+		}
+		for (j = 0; j < sba->max_cmd_per_req; j++) {
+			req->cmds[j].cmd = 0;
+			req->cmds[j].cmd_dma = sba->cmds_base +
+				(i * sba->max_cmd_per_req + j) * sizeof(u64);
+			req->cmds[j].cmd_dma_addr = sba->cmds_dma_base +
+				(i * sba->max_cmd_per_req + j) * sizeof(u64);
+			req->cmds[j].flags = 0;
+		}
+		memset(&req->msg, 0, sizeof(req->msg));
+		dma_async_tx_descriptor_init(&req->tx, &sba->dma_chan);
+		req->tx.tx_submit = sba_tx_submit;
+		req->tx.phys = req->resp_dma;
+		list_add_tail(&req->node, &sba->reqs_free_list);
+	}
+
+	sba->reqs_free_count = sba->max_req;
+
+	return 0;
+
+fail_free_cmds_pool:
+	dma_free_coherent(sba->dma_dev.dev,
+			  sba->max_cmds_pool_size,
+			  sba->cmds_base, sba->cmds_dma_base);
+fail_free_resp_pool:
+	dma_free_coherent(sba->dma_dev.dev,
+			  sba->max_resp_pool_size,
+			  sba->resp_base, sba->resp_dma_base);
+	return ret;
+}
+
+static void sba_freeup_channel_resources(struct sba_device *sba)
+{
+	dmaengine_terminate_all(&sba->dma_chan);
+	dma_free_coherent(sba->dma_dev.dev, sba->max_cmds_pool_size,
+			  sba->cmds_base, sba->cmds_dma_base);
+	dma_free_coherent(sba->dma_dev.dev, sba->max_resp_pool_size,
+			  sba->resp_base, sba->resp_dma_base);
+	sba->resp_base = NULL;
+	sba->resp_dma_base = 0;
+}
+
+static int sba_async_register(struct sba_device *sba)
+{
+	int ret;
+	struct dma_device *dma_dev = &sba->dma_dev;
+
+	/* Initialize DMA channel cookie */
+	sba->dma_chan.device = dma_dev;
+	dma_cookie_init(&sba->dma_chan);
+
+	/* Initialize DMA device capability mask */
+	dma_cap_zero(dma_dev->cap_mask);
+	dma_cap_set(DMA_MEMCPY, dma_dev->cap_mask);
+	dma_cap_set(DMA_XOR, dma_dev->cap_mask);
+	dma_cap_set(DMA_PQ, dma_dev->cap_mask);
+
+	/*
+	 * Set mailbox channel device as the base device of
+	 * our dma_device because the actual memory accesses
+	 * will be done by mailbox controller
+	 */
+	dma_dev->dev = sba->mbox_dev;
+
+	/* Set base prep routines */
+	dma_dev->device_free_chan_resources = sba_free_chan_resources;
+	dma_dev->device_terminate_all = sba_device_terminate_all;
+	dma_dev->device_issue_pending = sba_issue_pending;
+	dma_dev->device_tx_status = sba_tx_status;
+
+	/* Set memcpy routines and capability */
+	if (dma_has_cap(DMA_MEMCPY, dma_dev->cap_mask))
+		dma_dev->device_prep_dma_memcpy = sba_prep_dma_memcpy;
+
+	/* Set xor routines and capability */
+	if (dma_has_cap(DMA_XOR, dma_dev->cap_mask)) {
+		dma_dev->device_prep_dma_xor = sba_prep_dma_xor;
+		dma_dev->max_xor = sba->max_xor_srcs;
+	}
+
+	/* Set pq routines and capability */
+	if (dma_has_cap(DMA_PQ, dma_dev->cap_mask)) {
+		dma_dev->device_prep_dma_pq = sba_prep_dma_pq;
+		dma_set_maxpq(dma_dev, sba->max_pq_srcs, 0);
+	}
+
+	/* Initialize DMA device channel list */
+	INIT_LIST_HEAD(&dma_dev->channels);
+	list_add_tail(&sba->dma_chan.device_node, &dma_dev->channels);
+
+	/* Register with Linux async DMA framework*/
+	ret = dma_async_device_register(dma_dev);
+	if (ret) {
+		dev_err(sba->dev, "async device register error %d", ret);
+		return ret;
+	}
+
+	dev_info(sba->dev, "%s capabilities: %s%s%s\n",
+		 dma_chan_name(&sba->dma_chan),
+		 dma_has_cap(DMA_MEMCPY, dma_dev->cap_mask) ? "memcpy " : "",
+		 dma_has_cap(DMA_XOR, dma_dev->cap_mask) ? "xor " : "",
+		 dma_has_cap(DMA_PQ, dma_dev->cap_mask) ? "pq " : "");
+
+	return 0;
+}
+
+static int sba_probe(struct platform_device *pdev)
+{
+	int i, ret = 0, mchans_count;
+	struct sba_device *sba;
+	struct platform_device *mbox_pdev;
+	struct of_phandle_args args;
+
+	/* Allocate main SBA struct */
+	sba = devm_kzalloc(&pdev->dev, sizeof(*sba), GFP_KERNEL);
+	if (!sba)
+		return -ENOMEM;
+
+	sba->dev = &pdev->dev;
+	platform_set_drvdata(pdev, sba);
+
+	/* Determine SBA version from DT compatible string */
+	if (of_device_is_compatible(sba->dev->of_node, "brcm,iproc-sba"))
+		sba->ver = SBA_VER_1;
+	else if (of_device_is_compatible(sba->dev->of_node,
+					 "brcm,iproc-sba-v2"))
+		sba->ver = SBA_VER_2;
+	else
+		return -ENODEV;
+
+	/* Derived Configuration parameters */
+	switch (sba->ver) {
+	case SBA_VER_1:
+		sba->max_req = 1024;
+		sba->hw_buf_size = 4096;
+		sba->hw_resp_size = 8;
+		sba->max_pq_coefs = 6;
+		sba->max_pq_srcs = 6;
+		break;
+	case SBA_VER_2:
+		sba->max_req = 1024;
+		sba->hw_buf_size = 4096;
+		sba->hw_resp_size = 8;
+		sba->max_pq_coefs = 30;
+		/*
+		 * We can support max_pq_srcs == max_pq_coefs because
+		 * we are limited by number of SBA commands that we can
+		 * fit in one message for underlying ring manager HW.
+		 */
+		sba->max_pq_srcs = 12;
+		break;
+	default:
+		return -EINVAL;
+	}
+	sba->max_cmd_per_req = sba->max_pq_srcs + 3;
+	sba->max_xor_srcs = sba->max_cmd_per_req - 1;
+	sba->max_resp_pool_size = sba->max_req * sba->hw_resp_size;
+	sba->max_cmds_pool_size = sba->max_req *
+				  sba->max_cmd_per_req * sizeof(u64);
+
+	/* Setup mailbox client */
+	sba->client.dev			= &pdev->dev;
+	sba->client.rx_callback		= sba_receive_message;
+	sba->client.tx_block		= false;
+	sba->client.knows_txdone	= false;
+	sba->client.tx_tout		= 0;
+
+	/* Number of channels equals number of mailbox channels */
+	ret = of_count_phandle_with_args(pdev->dev.of_node,
+					 "mboxes", "#mbox-cells");
+	if (ret <= 0)
+		return -ENODEV;
+	mchans_count = ret;
+	sba->mchans_count = 0;
+	atomic_set(&sba->mchans_current, 0);
+
+	/* Allocate mailbox channel array */
+	sba->mchans = devm_kcalloc(&pdev->dev, sba->mchans_count,
+				   sizeof(*sba->mchans), GFP_KERNEL);
+	if (!sba->mchans)
+		return -ENOMEM;
+
+	/* Request mailbox channels */
+	for (i = 0; i < mchans_count; i++) {
+		sba->mchans[i] = mbox_request_channel(&sba->client, i);
+		if (IS_ERR(sba->mchans[i])) {
+			ret = PTR_ERR(sba->mchans[i]);
+			goto fail_free_mchans;
+		}
+		sba->mchans_count++;
+	}
+
+	/* Find-out underlying mailbox device */
+	ret = of_parse_phandle_with_args(pdev->dev.of_node,
+					 "mboxes", "#mbox-cells", 0, &args);
+	if (ret)
+		goto fail_free_mchans;
+	mbox_pdev = of_find_device_by_node(args.np);
+	of_node_put(args.np);
+	if (!mbox_pdev) {
+		ret = -ENODEV;
+		goto fail_free_mchans;
+	}
+	sba->mbox_dev = &mbox_pdev->dev;
+
+	/* All mailbox channels should be of same ring manager device */
+	for (i = 1; i < mchans_count; i++) {
+		ret = of_parse_phandle_with_args(pdev->dev.of_node,
+					 "mboxes", "#mbox-cells", i, &args);
+		if (ret)
+			goto fail_free_mchans;
+		mbox_pdev = of_find_device_by_node(args.np);
+		of_node_put(args.np);
+		if (sba->mbox_dev != &mbox_pdev->dev) {
+			ret = -EINVAL;
+			goto fail_free_mchans;
+		}
+	}
+
+	/* Register DMA device with linux async framework */
+	ret = sba_async_register(sba);
+	if (ret)
+		goto fail_free_mchans;
+
+	/* Prealloc channel resource */
+	ret = sba_prealloc_channel_resources(sba);
+	if (ret)
+		goto fail_async_dev_unreg;
+
+	/* Print device info */
+	dev_info(sba->dev, "%s using SBAv%d and %d mailbox channels",
+		 dma_chan_name(&sba->dma_chan), sba->ver+1,
+		 sba->mchans_count);
+
+	return 0;
+
+fail_async_dev_unreg:
+	dma_async_device_unregister(&sba->dma_dev);
+fail_free_mchans:
+	for (i = 0; i < sba->mchans_count; i++)
+		mbox_free_channel(sba->mchans[i]);
+	return ret;
+}
+
+static int sba_remove(struct platform_device *pdev)
+{
+	int i;
+	struct sba_device *sba = platform_get_drvdata(pdev);
+
+	sba_freeup_channel_resources(sba);
+
+	dma_async_device_unregister(&sba->dma_dev);
+
+	for (i = 0; i < sba->mchans_count; i++)
+		mbox_free_channel(sba->mchans[i]);
+
+	return 0;
+}
+
+static const struct of_device_id sba_of_match[] = {
+	{ .compatible = "brcm,iproc-sba", },
+	{ .compatible = "brcm,iproc-sba-v2", },
+	{},
+};
+MODULE_DEVICE_TABLE(of, sba_of_match);
+
+static struct platform_driver sba_driver = {
+	.probe = sba_probe,
+	.remove = sba_remove,
+	.driver = {
+		.name = "bcm-sba-raid",
+		.of_match_table = sba_of_match,
+	},
+};
+module_platform_driver(sba_driver);
+
+MODULE_DESCRIPTION("Broadcom SBA RAID driver");
+MODULE_AUTHOR("Anup Patel <anup.patel@broadcom.com>");
+MODULE_LICENSE("GPL v2");
-- 
2.7.4

^ permalink raw reply related

* [PATCH v3 4/4] dt-bindings: Add DT bindings document for Broadcom SBA RAID driver
From: Anup Patel @ 2017-02-10  9:07 UTC (permalink / raw)
  To: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar
  Cc: Dan Williams, Ray Jui, Scott Branden, Jon Mason, Rob Rice,
	bcm-kernel-feedback-list, dmaengine, devicetree, linux-arm-kernel,
	linux-kernel, linux-crypto, linux-raid, Anup Patel
In-Reply-To: <1486717628-17580-1-git-send-email-anup.patel@broadcom.com>

This patch adds the DT bindings document for newly added Broadcom
SBA RAID driver.

Signed-off-by: Anup Patel <anup.patel@broadcom.com>
Reviewed-by: Ray Jui <ray.jui@broadcom.com>
Reviewed-by: Scott Branden <scott.branden@broadcom.com>
---
 .../devicetree/bindings/dma/brcm,iproc-sba.txt     | 29 ++++++++++++++++++++++
 1 file changed, 29 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/dma/brcm,iproc-sba.txt

diff --git a/Documentation/devicetree/bindings/dma/brcm,iproc-sba.txt b/Documentation/devicetree/bindings/dma/brcm,iproc-sba.txt
new file mode 100644
index 0000000..092913a
--- /dev/null
+++ b/Documentation/devicetree/bindings/dma/brcm,iproc-sba.txt
@@ -0,0 +1,29 @@
+* Broadcom SBA RAID engine
+
+Required properties:
+- compatible: Should be one of the following
+	      "brcm,iproc-sba"
+	      "brcm,iproc-sba-v2"
+  The "brcm,iproc-sba" has support for only 6 PQ coefficients
+  The "brcm,iproc-sba-v2" has support for only 30 PQ coefficients
+- mboxes: List of phandle and mailbox channel specifiers
+
+Example:
+
+raid_mbox: mbox@67400000 {
+	...
+	#mbox-cells = <3>;
+	...
+};
+
+raid0 {
+	compatible = "brcm,iproc-sba-v2";
+	mboxes = <&raid_mbox 0 0x1 0xffff>,
+		 <&raid_mbox 1 0x1 0xffff>,
+		 <&raid_mbox 2 0x1 0xffff>,
+		 <&raid_mbox 3 0x1 0xffff>,
+		 <&raid_mbox 4 0x1 0xffff>,
+		 <&raid_mbox 5 0x1 0xffff>,
+		 <&raid_mbox 6 0x1 0xffff>,
+		 <&raid_mbox 7 0x1 0xffff>;
+};
-- 
2.7.4


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox