* [PATCH v3 2/4] async_tx: Fix DMA_PREP_FENCE usage in do_async_gen_syndrome()
From: Anup Patel @ 2017-02-10 9:07 UTC (permalink / raw)
To: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
David S . Miller, Jassi Brar
Cc: Dan Williams, Ray Jui, Scott Branden, Jon Mason, Rob Rice,
bcm-kernel-feedback-list, dmaengine, devicetree, linux-arm-kernel,
linux-kernel, linux-crypto, linux-raid, Anup Patel
In-Reply-To: <1486717628-17580-1-git-send-email-anup.patel@broadcom.com>
The DMA_PREP_FENCE is to be used when preparing Tx descriptor if output
of Tx descriptor is to be used by next/dependent Tx descriptor.
The DMA_PREP_FENSE will not be set correctly in do_async_gen_syndrome()
when calling dma->device_prep_dma_pq() under following conditions:
1. ASYNC_TX_FENCE not set in submit->flags
2. DMA_PREP_FENCE not set in dma_flags
3. src_cnt (= (disks - 2)) is greater than dma_maxpq(dma, dma_flags)
This patch fixes DMA_PREP_FENCE usage in do_async_gen_syndrome() taking
inspiration from do_async_xor() implementation.
Signed-off-by: Anup Patel <anup.patel@broadcom.com>
Reviewed-by: Ray Jui <ray.jui@broadcom.com>
Reviewed-by: Scott Branden <scott.branden@broadcom.com>
---
crypto/async_tx/async_pq.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/crypto/async_tx/async_pq.c b/crypto/async_tx/async_pq.c
index f83de99..56bd612 100644
--- a/crypto/async_tx/async_pq.c
+++ b/crypto/async_tx/async_pq.c
@@ -62,9 +62,6 @@ do_async_gen_syndrome(struct dma_chan *chan,
dma_addr_t dma_dest[2];
int src_off = 0;
- if (submit->flags & ASYNC_TX_FENCE)
- dma_flags |= DMA_PREP_FENCE;
-
while (src_cnt > 0) {
submit->flags = flags_orig;
pq_src_cnt = min(src_cnt, dma_maxpq(dma, dma_flags));
@@ -83,6 +80,8 @@ do_async_gen_syndrome(struct dma_chan *chan,
if (cb_fn_orig)
dma_flags |= DMA_PREP_INTERRUPT;
}
+ if (submit->flags & ASYNC_TX_FENCE)
+ dma_flags |= DMA_PREP_FENCE;
/* Drivers force forward progress in case they can not provide
* a descriptor
--
2.7.4
^ permalink raw reply related
* [PATCH v3 3/4] dmaengine: Add Broadcom SBA RAID driver
From: Anup Patel @ 2017-02-10 9:07 UTC (permalink / raw)
To: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
David S . Miller, Jassi Brar
Cc: Dan Williams, Ray Jui, Scott Branden, Jon Mason, Rob Rice,
bcm-kernel-feedback-list, dmaengine, devicetree, linux-arm-kernel,
linux-kernel, linux-crypto, linux-raid, Anup Patel
In-Reply-To: <1486717628-17580-1-git-send-email-anup.patel@broadcom.com>
The Broadcom stream buffer accelerator (SBA) provides offloading
capabilities for RAID operations. This SBA offload engine is
accessible via Broadcom SoC specific ring manager.
This patch adds Broadcom SBA RAID driver which provides one
DMA device with RAID capabilities using one or more Broadcom
SoC specific ring manager channels. The SBA RAID driver in its
current shape implements memcpy, xor, and pq operations.
Signed-off-by: Anup Patel <anup.patel@broadcom.com>
Reviewed-by: Ray Jui <ray.jui@broadcom.com>
---
drivers/dma/Kconfig | 13 +
drivers/dma/Makefile | 1 +
drivers/dma/bcm-sba-raid.c | 1711 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 1725 insertions(+)
create mode 100644 drivers/dma/bcm-sba-raid.c
diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 263495d..bf8fb84 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -99,6 +99,19 @@ config AXI_DMAC
controller is often used in Analog Device's reference designs for FPGA
platforms.
+config BCM_SBA_RAID
+ tristate "Broadcom SBA RAID engine support"
+ depends on (ARM64 && MAILBOX && RAID6_PQ) || COMPILE_TEST
+ select DMA_ENGINE
+ select DMA_ENGINE_RAID
+ select ASYNC_TX_ENABLE_CHANNEL_SWITCH
+ default ARCH_BCM_IPROC
+ help
+ Enable support for Broadcom SBA RAID Engine. The SBA RAID
+ engine is available on most of the Broadcom iProc SoCs. It
+ has the capability to offload memcpy, xor and pq computation
+ for raid5/6.
+
config COH901318
bool "ST-Ericsson COH901318 DMA support"
select DMA_ENGINE
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
index a4fa336..ba96bdd 100644
--- a/drivers/dma/Makefile
+++ b/drivers/dma/Makefile
@@ -17,6 +17,7 @@ obj-$(CONFIG_AMCC_PPC440SPE_ADMA) += ppc4xx/
obj-$(CONFIG_AT_HDMAC) += at_hdmac.o
obj-$(CONFIG_AT_XDMAC) += at_xdmac.o
obj-$(CONFIG_AXI_DMAC) += dma-axi-dmac.o
+obj-$(CONFIG_BCM_SBA_RAID) += bcm-sba-raid.o
obj-$(CONFIG_COH901318) += coh901318.o coh901318_lli.o
obj-$(CONFIG_DMA_BCM2835) += bcm2835-dma.o
obj-$(CONFIG_DMA_JZ4740) += dma-jz4740.o
diff --git a/drivers/dma/bcm-sba-raid.c b/drivers/dma/bcm-sba-raid.c
new file mode 100644
index 0000000..bab9918
--- /dev/null
+++ b/drivers/dma/bcm-sba-raid.c
@@ -0,0 +1,1711 @@
+/*
+ * Copyright (C) 2017 Broadcom
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+/*
+ * Broadcom SBA RAID Driver
+ *
+ * The Broadcom stream buffer accelerator (SBA) provides offloading
+ * capabilities for RAID operations. The SBA offload engine is accessible
+ * via Broadcom SoC specific ring manager. Two or more offload engines
+ * can share same Broadcom SoC specific ring manager due to this Broadcom
+ * SoC specific ring manager driver is implemented as a mailbox controller
+ * driver and offload engine drivers are implemented as mallbox clients.
+ *
+ * Typically, Broadcom SoC specific ring manager will implement larger
+ * number of hardware rings over one or more SBA hardware devices. By
+ * design, the internal buffer size of SBA hardware device is limited
+ * but all offload operations supported by SBA can be broken down into
+ * multiple small size requests and executed parallely on multiple SBA
+ * hardware devices for achieving high through-put.
+ *
+ * The Broadcom SBA RAID driver does not require any register programming
+ * except submitting request to SBA hardware device via mailbox channels.
+ * This driver implements a DMA device with one DMA channel using a set
+ * of mailbox channels provided by Broadcom SoC specific ring manager
+ * driver. To exploit parallelism (as described above), all DMA request
+ * coming to SBA RAID DMA channel are broken down to smaller requests
+ * and submitted to multiple mailbox channels in round-robin fashion.
+ * For having more SBA DMA channels, we can create more SBA device nodes
+ * in Broadcom SoC specific DTS based on number of hardware rings supported
+ * by Broadcom SoC ring manager.
+ */
+
+#include <linux/bitops.h>
+#include <linux/dma-mapping.h>
+#include <linux/dmaengine.h>
+#include <linux/list.h>
+#include <linux/mailbox_client.h>
+#include <linux/mailbox/brcm-message.h>
+#include <linux/module.h>
+#include <linux/of_device.h>
+#include <linux/slab.h>
+#include <linux/raid/pq.h>
+
+#include "dmaengine.h"
+
+/* SBA command helper macros */
+#define SBA_DEC(_d, _s, _m) (((_d) >> (_s)) & (_m))
+#define SBA_ENC(_d, _v, _s, _m) \
+ do { \
+ (_d) &= ~((u64)(_m) << (_s)); \
+ (_d) |= (((u64)(_v) & (_m)) << (_s)); \
+ } while (0)
+
+/* SBA command related defines */
+#define SBA_TYPE_SHIFT 48
+#define SBA_TYPE_MASK GENMASK(1, 0)
+#define SBA_TYPE_A 0x0
+#define SBA_TYPE_B 0x2
+#define SBA_TYPE_C 0x3
+#define SBA_USER_DEF_SHIFT 32
+#define SBA_USER_DEF_MASK GENMASK(15, 0)
+#define SBA_R_MDATA_SHIFT 24
+#define SBA_R_MDATA_MASK GENMASK(7, 0)
+#define SBA_C_MDATA_MS_SHIFT 18
+#define SBA_C_MDATA_MS_MASK GENMASK(1, 0)
+#define SBA_INT_SHIFT 17
+#define SBA_INT_MASK BIT(0)
+#define SBA_RESP_SHIFT 16
+#define SBA_RESP_MASK BIT(0)
+#define SBA_C_MDATA_SHIFT 8
+#define SBA_C_MDATA_MASK GENMASK(7, 0)
+#define SBA_C_MDATA_BNUMx_SHIFT(__bnum) (2 * (__bnum))
+#define SBA_C_MDATA_BNUMx_MASK GENMASK(1, 0)
+#define SBA_C_MDATA_DNUM_SHIFT 5
+#define SBA_C_MDATA_DNUM_MASK GENMASK(4, 0)
+#define SBA_C_MDATA_LS(__v) ((__v) & 0xff)
+#define SBA_C_MDATA_MS(__v) (((__v) >> 8) & 0x3)
+#define SBA_CMD_SHIFT 0
+#define SBA_CMD_MASK GENMASK(3, 0)
+#define SBA_CMD_ZERO_BUFFER 0x4
+#define SBA_CMD_ZERO_ALL_BUFFERS 0x8
+#define SBA_CMD_LOAD_BUFFER 0x9
+#define SBA_CMD_XOR 0xa
+#define SBA_CMD_GALOIS_XOR 0xb
+#define SBA_CMD_WRITE_BUFFER 0xc
+#define SBA_CMD_GALOIS 0xe
+
+/* Driver helper macros */
+#define to_sba_request(tx) \
+ container_of(tx, struct sba_request, tx)
+#define to_sba_device(dchan) \
+ container_of(dchan, struct sba_device, dma_chan)
+
+enum sba_request_state {
+ SBA_REQUEST_STATE_FREE = 1,
+ SBA_REQUEST_STATE_ALLOCED = 2,
+ SBA_REQUEST_STATE_PENDING = 3,
+ SBA_REQUEST_STATE_ACTIVE = 4,
+ SBA_REQUEST_STATE_RECEIVED = 5,
+ SBA_REQUEST_STATE_COMPLETED = 6,
+ SBA_REQUEST_STATE_ABORTED = 7,
+};
+
+struct sba_request {
+ /* Global state */
+ struct list_head node;
+ struct sba_device *sba;
+ enum sba_request_state state;
+ bool fence;
+ /* Chained requests management */
+ struct sba_request *first;
+ struct list_head next;
+ unsigned int next_count;
+ atomic_t next_pending_count;
+ /* BRCM message data */
+ void *resp;
+ dma_addr_t resp_dma;
+ struct brcm_sba_command *cmds;
+ struct brcm_message msg;
+ struct dma_async_tx_descriptor tx;
+};
+
+enum sba_version {
+ SBA_VER_1 = 0,
+ SBA_VER_2
+};
+
+struct sba_device {
+ /* Underlying device */
+ struct device *dev;
+ /* DT configuration parameters */
+ enum sba_version ver;
+ /* Derived configuration parameters */
+ u32 max_req;
+ u32 hw_buf_size;
+ u32 hw_resp_size;
+ u32 max_pq_coefs;
+ u32 max_pq_srcs;
+ u32 max_cmd_per_req;
+ u32 max_xor_srcs;
+ u32 max_resp_pool_size;
+ u32 max_cmds_pool_size;
+ /* Maibox client and Mailbox channels */
+ struct mbox_client client;
+ int mchans_count;
+ atomic_t mchans_current;
+ struct mbox_chan **mchans;
+ struct device *mbox_dev;
+ /* DMA device and DMA channel */
+ struct dma_device dma_dev;
+ struct dma_chan dma_chan;
+ /* DMA channel resources */
+ void *resp_base;
+ dma_addr_t resp_dma_base;
+ void *cmds_base;
+ dma_addr_t cmds_dma_base;
+ spinlock_t reqs_lock;
+ struct sba_request *reqs;
+ bool reqs_fence;
+ struct list_head reqs_alloc_list;
+ struct list_head reqs_pending_list;
+ struct list_head reqs_active_list;
+ struct list_head reqs_received_list;
+ struct list_head reqs_completed_list;
+ struct list_head reqs_aborted_list;
+ struct list_head reqs_free_list;
+ int reqs_free_count;
+};
+
+/* ====== C_MDATA helper routines ===== */
+
+static inline u32 sba_cmd_load_c_mdata(u32 b0)
+{
+ return b0 & SBA_C_MDATA_BNUMx_MASK;
+}
+
+static inline u32 sba_cmd_write_c_mdata(u32 b0)
+{
+ return b0 & SBA_C_MDATA_BNUMx_MASK;
+}
+
+static inline u32 sba_cmd_xor_c_mdata(u32 b1, u32 b0)
+{
+ return (b0 & SBA_C_MDATA_BNUMx_MASK) |
+ ((b1 & SBA_C_MDATA_BNUMx_MASK) << SBA_C_MDATA_BNUMx_SHIFT(1));
+}
+
+static inline u32 sba_cmd_pq_c_mdata(u32 d, u32 b1, u32 b0)
+{
+ return (b0 & SBA_C_MDATA_BNUMx_MASK) |
+ ((b1 & SBA_C_MDATA_BNUMx_MASK) << SBA_C_MDATA_BNUMx_SHIFT(1)) |
+ ((d & SBA_C_MDATA_DNUM_MASK) << SBA_C_MDATA_DNUM_SHIFT);
+}
+
+/* ====== Channel resource management routines ===== */
+
+static struct sba_request *sba_alloc_request(struct sba_device *sba)
+{
+ unsigned long flags;
+ struct sba_request *req = NULL;
+
+ spin_lock_irqsave(&sba->reqs_lock, flags);
+
+ if (!list_empty(&sba->reqs_free_list)) {
+ req = list_first_entry(&sba->reqs_free_list,
+ struct sba_request,
+ node);
+
+ list_move_tail(&req->node, &sba->reqs_alloc_list);
+ req->state = SBA_REQUEST_STATE_ALLOCED;
+ req->fence = false;
+ req->first = req;
+ INIT_LIST_HEAD(&req->next);
+ req->next_count = 1;
+ atomic_set(&req->next_pending_count, 1);
+
+ sba->reqs_free_count--;
+
+ dma_async_tx_descriptor_init(&req->tx, &sba->dma_chan);
+ }
+
+ spin_unlock_irqrestore(&sba->reqs_lock, flags);
+
+ return req;
+}
+
+/* Note: Must be called with sba->reqs_lock held */
+static void _sba_pending_request(struct sba_device *sba,
+ struct sba_request *req)
+{
+ req->state = SBA_REQUEST_STATE_PENDING;
+ list_move_tail(&req->node, &sba->reqs_pending_list);
+ if (list_empty(&sba->reqs_active_list))
+ sba->reqs_fence = false;
+}
+
+/* Note: Must be called with sba->reqs_lock held */
+static bool _sba_active_request(struct sba_device *sba,
+ struct sba_request *req)
+{
+ if (list_empty(&sba->reqs_active_list))
+ sba->reqs_fence = false;
+ if (sba->reqs_fence)
+ return false;
+ req->state = SBA_REQUEST_STATE_ACTIVE;
+ list_move_tail(&req->node, &sba->reqs_active_list);
+ if (req->fence)
+ sba->reqs_fence = true;
+ return true;
+}
+
+/* Note: Must be called with sba->reqs_lock held */
+static void _sba_abort_request(struct sba_device *sba,
+ struct sba_request *req)
+{
+ req->state = SBA_REQUEST_STATE_ABORTED;
+ list_move_tail(&req->node, &sba->reqs_aborted_list);
+ if (list_empty(&sba->reqs_active_list))
+ sba->reqs_fence = false;
+}
+
+/* Note: Must be called with sba->reqs_lock held */
+static void _sba_free_request(struct sba_device *sba,
+ struct sba_request *req)
+{
+ req->state = SBA_REQUEST_STATE_FREE;
+ list_move_tail(&req->node, &sba->reqs_free_list);
+ if (list_empty(&sba->reqs_active_list))
+ sba->reqs_fence = false;
+ sba->reqs_free_count++;
+}
+
+static void sba_received_request(struct sba_request *req)
+{
+ unsigned long flags;
+ struct sba_device *sba = req->sba;
+
+ spin_lock_irqsave(&sba->reqs_lock, flags);
+ req->state = SBA_REQUEST_STATE_RECEIVED;
+ list_move_tail(&req->node, &sba->reqs_received_list);
+ spin_unlock_irqrestore(&sba->reqs_lock, flags);
+}
+
+static void sba_complete_chained_requests(struct sba_request *req)
+{
+ unsigned long flags;
+ struct sba_request *nreq;
+ struct sba_device *sba = req->sba;
+
+ spin_lock_irqsave(&sba->reqs_lock, flags);
+
+ req->state = SBA_REQUEST_STATE_COMPLETED;
+ list_move_tail(&req->node, &sba->reqs_completed_list);
+ list_for_each_entry(nreq, &req->next, next) {
+ nreq->state = SBA_REQUEST_STATE_COMPLETED;
+ list_move_tail(&nreq->node, &sba->reqs_completed_list);
+ }
+ if (list_empty(&sba->reqs_active_list))
+ sba->reqs_fence = false;
+
+ spin_unlock_irqrestore(&sba->reqs_lock, flags);
+}
+
+static void sba_free_chained_requests(struct sba_request *req)
+{
+ unsigned long flags;
+ struct sba_request *nreq;
+ struct sba_device *sba = req->sba;
+
+ spin_lock_irqsave(&sba->reqs_lock, flags);
+
+ _sba_free_request(sba, req);
+ list_for_each_entry(nreq, &req->next, next) {
+ _sba_free_request(sba, nreq);
+ }
+
+ spin_unlock_irqrestore(&sba->reqs_lock, flags);
+}
+
+static void sba_chain_request(struct sba_request *first,
+ struct sba_request *req)
+{
+ unsigned long flags;
+ struct sba_device *sba = req->sba;
+
+ spin_lock_irqsave(&sba->reqs_lock, flags);
+
+ list_add_tail(&req->next, &first->next);
+ req->first = first;
+ first->next_count++;
+ atomic_set(&first->next_pending_count, first->next_count);
+
+ spin_unlock_irqrestore(&sba->reqs_lock, flags);
+}
+
+static void sba_cleanup_nonpending_requests(struct sba_device *sba)
+{
+ unsigned long flags;
+ struct sba_request *req, *req1;
+
+ spin_lock_irqsave(&sba->reqs_lock, flags);
+
+ /* Freeup all alloced request */
+ list_for_each_entry_safe(req, req1, &sba->reqs_alloc_list, node) {
+ _sba_free_request(sba, req);
+ }
+
+ /* Freeup all received request */
+ list_for_each_entry_safe(req, req1, &sba->reqs_received_list, node) {
+ _sba_free_request(sba, req);
+ }
+
+ /* Freeup all completed request */
+ list_for_each_entry_safe(req, req1, &sba->reqs_completed_list, node) {
+ _sba_free_request(sba, req);
+ }
+
+ /* Set all active requests as aborted */
+ list_for_each_entry_safe(req, req1, &sba->reqs_active_list, node) {
+ _sba_abort_request(sba, req);
+ }
+
+ /*
+ * Note: We expect that aborted request will be eventually
+ * freed by sba_receive_message()
+ */
+
+ spin_unlock_irqrestore(&sba->reqs_lock, flags);
+}
+
+static void sba_cleanup_pending_requests(struct sba_device *sba)
+{
+ unsigned long flags;
+ struct sba_request *req, *req1;
+
+ spin_lock_irqsave(&sba->reqs_lock, flags);
+
+ /* Freeup all pending request */
+ list_for_each_entry_safe(req, req1, &sba->reqs_pending_list, node) {
+ /* Freeup rest of the pending request */
+ _sba_free_request(sba, req);
+ }
+
+ spin_unlock_irqrestore(&sba->reqs_lock, flags);
+}
+
+/* ====== DMAENGINE callbacks ===== */
+
+static void sba_free_chan_resources(struct dma_chan *dchan)
+{
+ /*
+ * Channel resources are pre-alloced so we just free-up
+ * whatever we can so that we can re-use pre-alloced
+ * channel resources next time.
+ */
+ sba_cleanup_nonpending_requests(to_sba_device(dchan));
+}
+
+static int sba_device_terminate_all(struct dma_chan *dchan)
+{
+ /* Cleanup all pending requests */
+ sba_cleanup_pending_requests(to_sba_device(dchan));
+
+ return 0;
+}
+
+static int sba_send_mbox_request(struct sba_device *sba,
+ struct sba_request *req)
+{
+ int mchans_idx, ret = 0;
+
+ /* Select mailbox channel in round-robin fashion */
+ mchans_idx = atomic_inc_return(&sba->mchans_current);
+ mchans_idx = mchans_idx % sba->mchans_count;
+
+ /* Send message for the request */
+ req->msg.error = 0;
+ ret = mbox_send_message(sba->mchans[mchans_idx], &req->msg);
+ if (ret < 0) {
+ dev_err(sba->dev, "send message failed with error %d", ret);
+ return ret;
+ }
+ ret = req->msg.error;
+ if (ret < 0) {
+ dev_err(sba->dev, "message error %d", ret);
+ return ret;
+ }
+
+ return 0;
+}
+
+static void sba_issue_pending(struct dma_chan *dchan)
+{
+ int ret;
+ unsigned long flags;
+ struct sba_request *req, *req1;
+ struct sba_device *sba = to_sba_device(dchan);
+
+ spin_lock_irqsave(&sba->reqs_lock, flags);
+
+ /* Process all pending request */
+ list_for_each_entry_safe(req, req1, &sba->reqs_pending_list, node) {
+ /* Try to make request active */
+ if (!_sba_active_request(sba, req))
+ break;
+
+ /* Send request to mailbox channel */
+ spin_unlock_irqrestore(&sba->reqs_lock, flags);
+ ret = sba_send_mbox_request(sba, req);
+ spin_lock_irqsave(&sba->reqs_lock, flags);
+
+ /* If something went wrong then keep request pending */
+ if (ret < 0) {
+ _sba_pending_request(sba, req);
+ break;
+ }
+ }
+
+ spin_unlock_irqrestore(&sba->reqs_lock, flags);
+}
+
+static dma_cookie_t sba_tx_submit(struct dma_async_tx_descriptor *tx)
+{
+ unsigned long flags;
+ dma_cookie_t cookie;
+ struct sba_device *sba;
+ struct sba_request *req, *nreq;
+
+ if (unlikely(!tx))
+ return -EINVAL;
+
+ sba = to_sba_device(tx->chan);
+ req = to_sba_request(tx);
+
+ /* Assign cookie and mark all chained requests pending */
+ spin_lock_irqsave(&sba->reqs_lock, flags);
+ cookie = dma_cookie_assign(tx);
+ _sba_pending_request(sba, req);
+ list_for_each_entry(nreq, &req->next, next) {
+ _sba_pending_request(sba, nreq);
+ }
+ spin_unlock_irqrestore(&sba->reqs_lock, flags);
+
+ return cookie;
+}
+
+static enum dma_status sba_tx_status(struct dma_chan *dchan,
+ dma_cookie_t cookie,
+ struct dma_tx_state *txstate)
+{
+ int mchan_idx;
+ enum dma_status ret;
+ struct sba_device *sba = to_sba_device(dchan);
+
+ for (mchan_idx = 0; mchan_idx < sba->mchans_count; mchan_idx++)
+ mbox_client_peek_data(sba->mchans[mchan_idx]);
+
+ ret = dma_cookie_status(dchan, cookie, txstate);
+ if (ret == DMA_COMPLETE)
+ return ret;
+
+ return dma_cookie_status(dchan, cookie, txstate);
+}
+
+static void sba_fillup_memcpy_msg(struct sba_request *req,
+ struct brcm_sba_command *cmds,
+ struct brcm_message *msg,
+ dma_addr_t msg_offset, size_t msg_len,
+ dma_addr_t dst, dma_addr_t src)
+{
+ u64 cmd;
+ u32 c_mdata;
+ struct brcm_sba_command *cmdsp = cmds;
+
+ /* Type-B command to load data into buf0 */
+ cmd = 0;
+ SBA_ENC(cmd, SBA_TYPE_B, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+ SBA_ENC(cmd, msg_len,
+ SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+ c_mdata = sba_cmd_load_c_mdata(0);
+ SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+ SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+ SBA_ENC(cmd, SBA_CMD_LOAD_BUFFER,
+ SBA_CMD_SHIFT, SBA_CMD_MASK);
+ cmdsp->cmd = cmd;
+ *cmdsp->cmd_dma = cpu_to_le64(cmd);
+ cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+ cmdsp->data = src + msg_offset;
+ cmdsp->data_len = msg_len;
+ cmdsp++;
+
+ /* Type-A command to write buf0 */
+ cmd = 0;
+ SBA_ENC(cmd, SBA_TYPE_A, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+ SBA_ENC(cmd, msg_len,
+ SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+ SBA_ENC(cmd, 0x1, SBA_RESP_SHIFT, SBA_RESP_MASK);
+ c_mdata = sba_cmd_write_c_mdata(0);
+ SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+ SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+ SBA_ENC(cmd, SBA_CMD_WRITE_BUFFER,
+ SBA_CMD_SHIFT, SBA_CMD_MASK);
+ cmdsp->cmd = cmd;
+ *cmdsp->cmd_dma = cpu_to_le64(cmd);
+ cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+ if (req->sba->hw_resp_size) {
+ cmdsp->flags |= BRCM_SBA_CMD_HAS_RESP;
+ cmdsp->resp = req->resp_dma;
+ cmdsp->resp_len = req->sba->hw_resp_size;
+ }
+ cmdsp->flags |= BRCM_SBA_CMD_HAS_OUTPUT;
+ cmdsp->data = dst + msg_offset;
+ cmdsp->data_len = msg_len;
+ cmdsp++;
+
+ /* Fillup brcm_message */
+ msg->type = BRCM_MESSAGE_SBA;
+ msg->sba.cmds = cmds;
+ msg->sba.cmds_count = cmdsp - cmds;
+ msg->ctx = req;
+ msg->error = 0;
+}
+
+static struct sba_request *
+sba_prep_dma_memcpy_req(struct sba_device *sba,
+ dma_addr_t off, dma_addr_t dst, dma_addr_t src,
+ size_t len, unsigned long flags)
+{
+ struct sba_request *req = NULL;
+
+ /* Alloc new request */
+ req = sba_alloc_request(sba);
+ if (!req)
+ return NULL;
+ req->fence = (flags & DMA_PREP_FENCE) ? true : false;
+
+ /* Fillup request message */
+ sba_fillup_memcpy_msg(req, req->cmds, &req->msg,
+ off, len, dst, src);
+
+ /* Init async_tx descriptor */
+ req->tx.flags = flags;
+ req->tx.cookie = -EBUSY;
+
+ return req;
+}
+
+static struct dma_async_tx_descriptor *
+sba_prep_dma_memcpy(struct dma_chan *dchan, dma_addr_t dst, dma_addr_t src,
+ size_t len, unsigned long flags)
+{
+ size_t req_len;
+ dma_addr_t off = 0;
+ struct sba_device *sba = to_sba_device(dchan);
+ struct sba_request *first = NULL, *req;
+
+ /* Create chained requests where each request is upto hw_buf_size */
+ while (len) {
+ req_len = (len < sba->hw_buf_size) ? len : sba->hw_buf_size;
+
+ req = sba_prep_dma_memcpy_req(sba, off, dst, src,
+ req_len, flags);
+ if (!req) {
+ if (first)
+ sba_free_chained_requests(first);
+ return NULL;
+ }
+
+ if (first)
+ sba_chain_request(first, req);
+ else
+ first = req;
+
+ off += req_len;
+ len -= req_len;
+ }
+
+ return (first) ? &first->tx : NULL;
+}
+
+static void sba_fillup_xor_msg(struct sba_request *req,
+ struct brcm_sba_command *cmds,
+ struct brcm_message *msg,
+ dma_addr_t msg_offset, size_t msg_len,
+ dma_addr_t dst, dma_addr_t *src, u32 src_cnt)
+{
+ u64 cmd;
+ u32 c_mdata;
+ unsigned int i;
+ struct brcm_sba_command *cmdsp = cmds;
+
+ /* Type-B command to load data into buf0 */
+ cmd = 0;
+ SBA_ENC(cmd, SBA_TYPE_B, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+ SBA_ENC(cmd, msg_len,
+ SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+ c_mdata = sba_cmd_load_c_mdata(0);
+ SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+ SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+ SBA_ENC(cmd, SBA_CMD_LOAD_BUFFER,
+ SBA_CMD_SHIFT, SBA_CMD_MASK);
+ cmdsp->cmd = cmd;
+ *cmdsp->cmd_dma = cpu_to_le64(cmd);
+ cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+ cmdsp->data = src[0] + msg_offset;
+ cmdsp->data_len = msg_len;
+ cmdsp++;
+
+ /* Type-B commands to xor data with buf0 and put it back in buf0 */
+ for (i = 1; i < src_cnt; i++) {
+ cmd = 0;
+ SBA_ENC(cmd, SBA_TYPE_B, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+ SBA_ENC(cmd, msg_len,
+ SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+ c_mdata = sba_cmd_xor_c_mdata(0, 0);
+ SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+ SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+ SBA_ENC(cmd, SBA_CMD_XOR, SBA_CMD_SHIFT, SBA_CMD_MASK);
+ cmdsp->cmd = cmd;
+ *cmdsp->cmd_dma = cpu_to_le64(cmd);
+ cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+ cmdsp->data = src[i] + msg_offset;
+ cmdsp->data_len = msg_len;
+ cmdsp++;
+ }
+
+ /* Type-A command to write buf0 */
+ cmd = 0;
+ SBA_ENC(cmd, SBA_TYPE_A, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+ SBA_ENC(cmd, msg_len,
+ SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+ SBA_ENC(cmd, 0x1, SBA_RESP_SHIFT, SBA_RESP_MASK);
+ c_mdata = sba_cmd_write_c_mdata(0);
+ SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+ SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+ SBA_ENC(cmd, SBA_CMD_WRITE_BUFFER,
+ SBA_CMD_SHIFT, SBA_CMD_MASK);
+ cmdsp->cmd = cmd;
+ *cmdsp->cmd_dma = cpu_to_le64(cmd);
+ cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+ if (req->sba->hw_resp_size) {
+ cmdsp->flags |= BRCM_SBA_CMD_HAS_RESP;
+ cmdsp->resp = req->resp_dma;
+ cmdsp->resp_len = req->sba->hw_resp_size;
+ }
+ cmdsp->flags |= BRCM_SBA_CMD_HAS_OUTPUT;
+ cmdsp->data = dst + msg_offset;
+ cmdsp->data_len = msg_len;
+ cmdsp++;
+
+ /* Fillup brcm_message */
+ msg->type = BRCM_MESSAGE_SBA;
+ msg->sba.cmds = cmds;
+ msg->sba.cmds_count = cmdsp - cmds;
+ msg->ctx = req;
+ msg->error = 0;
+}
+
+struct sba_request *
+sba_prep_dma_xor_req(struct sba_device *sba,
+ dma_addr_t off, dma_addr_t dst, dma_addr_t *src,
+ u32 src_cnt, size_t len, unsigned long flags)
+{
+ struct sba_request *req = NULL;
+
+ /* Alloc new request */
+ req = sba_alloc_request(sba);
+ if (!req)
+ return NULL;
+ req->fence = (flags & DMA_PREP_FENCE) ? true : false;
+
+ /* Fillup request message */
+ sba_fillup_xor_msg(req, req->cmds, &req->msg,
+ off, len, dst, src, src_cnt);
+
+ /* Init async_tx descriptor */
+ req->tx.flags = flags;
+ req->tx.cookie = -EBUSY;
+
+ return req;
+}
+
+static struct dma_async_tx_descriptor *
+sba_prep_dma_xor(struct dma_chan *dchan, dma_addr_t dst, dma_addr_t *src,
+ u32 src_cnt, size_t len, unsigned long flags)
+{
+ size_t req_len;
+ dma_addr_t off = 0;
+ struct sba_device *sba = to_sba_device(dchan);
+ struct sba_request *first = NULL, *req;
+
+ /* Sanity checks */
+ if (unlikely(src_cnt > sba->max_xor_srcs))
+ return NULL;
+
+ /* Create chained requests where each request is upto hw_buf_size */
+ while (len) {
+ req_len = (len < sba->hw_buf_size) ? len : sba->hw_buf_size;
+
+ req = sba_prep_dma_xor_req(sba, off, dst, src, src_cnt,
+ req_len, flags);
+ if (!req) {
+ if (first)
+ sba_free_chained_requests(first);
+ return NULL;
+ }
+
+ if (first)
+ sba_chain_request(first, req);
+ else
+ first = req;
+
+ off += req_len;
+ len -= req_len;
+ }
+
+ return (first) ? &first->tx : NULL;
+}
+
+static void sba_fillup_pq_msg(struct sba_request *req,
+ bool pq_continue,
+ struct brcm_sba_command *cmds,
+ struct brcm_message *msg,
+ dma_addr_t msg_offset, size_t msg_len,
+ dma_addr_t *dst_p, dma_addr_t *dst_q,
+ const u8 *scf, dma_addr_t *src, u32 src_cnt)
+{
+ u64 cmd;
+ u32 c_mdata;
+ unsigned int i;
+ struct brcm_sba_command *cmdsp = cmds;
+
+ if (pq_continue) {
+ /* Type-B command to load old P into buf0 */
+ if (dst_p) {
+ cmd = 0;
+ SBA_ENC(cmd, SBA_TYPE_B,
+ SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+ SBA_ENC(cmd, msg_len,
+ SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+ c_mdata = sba_cmd_load_c_mdata(0);
+ SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+ SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+ SBA_ENC(cmd, SBA_CMD_LOAD_BUFFER,
+ SBA_CMD_SHIFT, SBA_CMD_MASK);
+ cmdsp->cmd = cmd;
+ *cmdsp->cmd_dma = cpu_to_le64(cmd);
+ cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+ cmdsp->data = *dst_p + msg_offset;
+ cmdsp->data_len = msg_len;
+ cmdsp++;
+ }
+
+ /* Type-B command to load old Q into buf1 */
+ if (dst_q) {
+ cmd = 0;
+ SBA_ENC(cmd, SBA_TYPE_B,
+ SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+ SBA_ENC(cmd, msg_len,
+ SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+ c_mdata = sba_cmd_load_c_mdata(1);
+ SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+ SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+ SBA_ENC(cmd, SBA_CMD_LOAD_BUFFER,
+ SBA_CMD_SHIFT, SBA_CMD_MASK);
+ cmdsp->cmd = cmd;
+ *cmdsp->cmd_dma = cpu_to_le64(cmd);
+ cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+ cmdsp->data = *dst_q + msg_offset;
+ cmdsp->data_len = msg_len;
+ cmdsp++;
+ }
+ } else {
+ /* Type-A command to zero all buffers */
+ cmd = 0;
+ SBA_ENC(cmd, SBA_TYPE_A, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+ SBA_ENC(cmd, msg_len,
+ SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+ SBA_ENC(cmd, SBA_CMD_ZERO_ALL_BUFFERS,
+ SBA_CMD_SHIFT, SBA_CMD_MASK);
+ cmdsp->cmd = cmd;
+ *cmdsp->cmd_dma = cpu_to_le64(cmd);
+ cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+ cmdsp++;
+ }
+
+ /* Type-B commands for generate P onto buf0 and Q onto buf1 */
+ for (i = 0; i < src_cnt; i++) {
+ cmd = 0;
+ SBA_ENC(cmd, SBA_TYPE_B, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+ SBA_ENC(cmd, msg_len,
+ SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+ c_mdata = sba_cmd_pq_c_mdata(raid6_gflog[scf[i]], 1, 0);
+ SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+ SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+ SBA_ENC(cmd, SBA_C_MDATA_MS(c_mdata),
+ SBA_C_MDATA_MS_SHIFT, SBA_C_MDATA_MS_MASK);
+ SBA_ENC(cmd, SBA_CMD_GALOIS_XOR,
+ SBA_CMD_SHIFT, SBA_CMD_MASK);
+ cmdsp->cmd = cmd;
+ *cmdsp->cmd_dma = cpu_to_le64(cmd);
+ cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+ cmdsp->data = src[i] + msg_offset;
+ cmdsp->data_len = msg_len;
+ cmdsp++;
+ }
+
+ /* Type-A command to write buf0 */
+ if (dst_p) {
+ cmd = 0;
+ SBA_ENC(cmd, SBA_TYPE_A, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+ SBA_ENC(cmd, msg_len,
+ SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+ SBA_ENC(cmd, 0x1, SBA_RESP_SHIFT, SBA_RESP_MASK);
+ c_mdata = sba_cmd_write_c_mdata(0);
+ SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+ SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+ SBA_ENC(cmd, SBA_CMD_WRITE_BUFFER,
+ SBA_CMD_SHIFT, SBA_CMD_MASK);
+ cmdsp->cmd = cmd;
+ *cmdsp->cmd_dma = cpu_to_le64(cmd);
+ cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+ if (req->sba->hw_resp_size) {
+ cmdsp->flags |= BRCM_SBA_CMD_HAS_RESP;
+ cmdsp->resp = req->resp_dma;
+ cmdsp->resp_len = req->sba->hw_resp_size;
+ }
+ cmdsp->flags |= BRCM_SBA_CMD_HAS_OUTPUT;
+ cmdsp->data = *dst_p + msg_offset;
+ cmdsp->data_len = msg_len;
+ cmdsp++;
+ }
+
+ /* Type-A command to write buf1 */
+ if (dst_q) {
+ cmd = 0;
+ SBA_ENC(cmd, SBA_TYPE_A, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+ SBA_ENC(cmd, msg_len,
+ SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+ SBA_ENC(cmd, 0x1, SBA_RESP_SHIFT, SBA_RESP_MASK);
+ c_mdata = sba_cmd_write_c_mdata(1);
+ SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+ SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+ SBA_ENC(cmd, SBA_CMD_WRITE_BUFFER,
+ SBA_CMD_SHIFT, SBA_CMD_MASK);
+ cmdsp->cmd = cmd;
+ *cmdsp->cmd_dma = cpu_to_le64(cmd);
+ cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+ if (req->sba->hw_resp_size) {
+ cmdsp->flags |= BRCM_SBA_CMD_HAS_RESP;
+ cmdsp->resp = req->resp_dma;
+ cmdsp->resp_len = req->sba->hw_resp_size;
+ }
+ cmdsp->flags |= BRCM_SBA_CMD_HAS_OUTPUT;
+ cmdsp->data = *dst_q + msg_offset;
+ cmdsp->data_len = msg_len;
+ cmdsp++;
+ }
+
+ /* Fillup brcm_message */
+ msg->type = BRCM_MESSAGE_SBA;
+ msg->sba.cmds = cmds;
+ msg->sba.cmds_count = cmdsp - cmds;
+ msg->ctx = req;
+ msg->error = 0;
+}
+
+struct sba_request *
+sba_prep_dma_pq_req(struct sba_device *sba, dma_addr_t off,
+ dma_addr_t *dst_p, dma_addr_t *dst_q, dma_addr_t *src,
+ u32 src_cnt, const u8 *scf, size_t len, unsigned long flags)
+{
+ struct sba_request *req = NULL;
+
+ /* Alloc new request */
+ req = sba_alloc_request(sba);
+ if (!req)
+ return NULL;
+ req->fence = (flags & DMA_PREP_FENCE) ? true : false;
+
+ /* Fillup request messages */
+ sba_fillup_pq_msg(req, dmaf_continue(flags),
+ req->cmds, &req->msg,
+ off, len, dst_p, dst_q, scf, src, src_cnt);
+
+ /* Init async_tx descriptor */
+ req->tx.flags = flags;
+ req->tx.cookie = -EBUSY;
+
+ return req;
+}
+
+static void sba_fillup_pq_single_msg(struct sba_request *req,
+ bool pq_continue,
+ struct brcm_sba_command *cmds,
+ struct brcm_message *msg,
+ dma_addr_t msg_offset, size_t msg_len,
+ dma_addr_t *dst_p, dma_addr_t *dst_q,
+ dma_addr_t src, u8 scf)
+{
+ u64 cmd;
+ u32 c_mdata;
+ u8 pos, dpos = raid6_gflog[scf];
+ struct brcm_sba_command *cmdsp = cmds;
+
+ if (!dst_p)
+ goto skip_p;
+
+ if (pq_continue) {
+ /* Type-B command to load old P into buf0 */
+ cmd = 0;
+ SBA_ENC(cmd, SBA_TYPE_B,
+ SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+ SBA_ENC(cmd, msg_len,
+ SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+ c_mdata = sba_cmd_load_c_mdata(0);
+ SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+ SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+ SBA_ENC(cmd, SBA_CMD_LOAD_BUFFER,
+ SBA_CMD_SHIFT, SBA_CMD_MASK);
+ cmdsp->cmd = cmd;
+ *cmdsp->cmd_dma = cpu_to_le64(cmd);
+ cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+ cmdsp->data = *dst_p + msg_offset;
+ cmdsp->data_len = msg_len;
+ cmdsp++;
+
+ /*
+ * Type-B commands to xor data with buf0 and put it
+ * back in buf0
+ */
+ cmd = 0;
+ SBA_ENC(cmd, SBA_TYPE_B, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+ SBA_ENC(cmd, msg_len,
+ SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+ c_mdata = sba_cmd_xor_c_mdata(0, 0);
+ SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+ SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+ SBA_ENC(cmd, SBA_CMD_XOR, SBA_CMD_SHIFT, SBA_CMD_MASK);
+ cmdsp->cmd = cmd;
+ *cmdsp->cmd_dma = cpu_to_le64(cmd);
+ cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+ cmdsp->data = src + msg_offset;
+ cmdsp->data_len = msg_len;
+ cmdsp++;
+ } else {
+ /* Type-B command to load old P into buf0 */
+ cmd = 0;
+ SBA_ENC(cmd, SBA_TYPE_B,
+ SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+ SBA_ENC(cmd, msg_len,
+ SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+ c_mdata = sba_cmd_load_c_mdata(0);
+ SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+ SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+ SBA_ENC(cmd, SBA_CMD_LOAD_BUFFER,
+ SBA_CMD_SHIFT, SBA_CMD_MASK);
+ cmdsp->cmd = cmd;
+ *cmdsp->cmd_dma = cpu_to_le64(cmd);
+ cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+ cmdsp->data = src + msg_offset;
+ cmdsp->data_len = msg_len;
+ cmdsp++;
+ }
+
+ /* Type-A command to write buf0 */
+ cmd = 0;
+ SBA_ENC(cmd, SBA_TYPE_A, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+ SBA_ENC(cmd, msg_len,
+ SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+ SBA_ENC(cmd, 0x1, SBA_RESP_SHIFT, SBA_RESP_MASK);
+ c_mdata = sba_cmd_write_c_mdata(0);
+ SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+ SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+ SBA_ENC(cmd, SBA_CMD_WRITE_BUFFER,
+ SBA_CMD_SHIFT, SBA_CMD_MASK);
+ cmdsp->cmd = cmd;
+ *cmdsp->cmd_dma = cpu_to_le64(cmd);
+ cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+ if (req->sba->hw_resp_size) {
+ cmdsp->flags |= BRCM_SBA_CMD_HAS_RESP;
+ cmdsp->resp = req->resp_dma;
+ cmdsp->resp_len = req->sba->hw_resp_size;
+ }
+ cmdsp->flags |= BRCM_SBA_CMD_HAS_OUTPUT;
+ cmdsp->data = *dst_p + msg_offset;
+ cmdsp->data_len = msg_len;
+ cmdsp++;
+
+skip_p:
+ if (!dst_q)
+ goto skip_q;
+
+ /* Type-A command to zero all buffers */
+ cmd = 0;
+ SBA_ENC(cmd, SBA_TYPE_A, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+ SBA_ENC(cmd, msg_len,
+ SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+ SBA_ENC(cmd, SBA_CMD_ZERO_ALL_BUFFERS,
+ SBA_CMD_SHIFT, SBA_CMD_MASK);
+ cmdsp->cmd = cmd;
+ *cmdsp->cmd_dma = cpu_to_le64(cmd);
+ cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+ cmdsp++;
+
+ if (dpos == 255)
+ goto skip_q_computation;
+ pos = (dpos < req->sba->max_pq_coefs) ?
+ dpos : (req->sba->max_pq_coefs - 1);
+
+ /*
+ * Type-B command to generate initial Q from data
+ * and store output into buf0
+ */
+ cmd = 0;
+ SBA_ENC(cmd, SBA_TYPE_B,
+ SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+ SBA_ENC(cmd, msg_len,
+ SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+ SBA_ENC(cmd, 0x0, SBA_RESP_SHIFT, SBA_RESP_MASK);
+ c_mdata = sba_cmd_pq_c_mdata(pos, 0, 0);
+ SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+ SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+ SBA_ENC(cmd, SBA_C_MDATA_MS(c_mdata),
+ SBA_C_MDATA_MS_SHIFT, SBA_C_MDATA_MS_MASK);
+ SBA_ENC(cmd, SBA_CMD_GALOIS, SBA_CMD_SHIFT, SBA_CMD_MASK);
+ cmdsp->cmd = cmd;
+ *cmdsp->cmd_dma = cpu_to_le64(cmd);
+ cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+ cmdsp->data = src + msg_offset;
+ cmdsp->data_len = msg_len;
+ cmdsp++;
+
+ dpos -= pos;
+
+ /* Multiple Type-A command to generate final Q */
+ while (dpos) {
+ pos = (dpos < req->sba->max_pq_coefs) ?
+ dpos : (req->sba->max_pq_coefs - 1);
+
+ /*
+ * Type-A command to generate Q with buf0 and
+ * buf1 store result in buf0
+ */
+ cmd = 0;
+ SBA_ENC(cmd, SBA_TYPE_A,
+ SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+ SBA_ENC(cmd, msg_len,
+ SBA_USER_DEF_SHIFT,
+ SBA_USER_DEF_MASK);
+ SBA_ENC(cmd, 0x0,
+ SBA_RESP_SHIFT, SBA_RESP_MASK);
+ c_mdata = sba_cmd_pq_c_mdata(pos, 0, 1);
+ SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+ SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+ SBA_ENC(cmd, SBA_C_MDATA_MS(c_mdata),
+ SBA_C_MDATA_MS_SHIFT,
+ SBA_C_MDATA_MS_MASK);
+ SBA_ENC(cmd, SBA_CMD_GALOIS,
+ SBA_CMD_SHIFT, SBA_CMD_MASK);
+ cmdsp->cmd = cmd;
+ *cmdsp->cmd_dma = cpu_to_le64(cmd);
+ cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+ cmdsp++;
+
+ dpos -= pos;
+ };
+
+skip_q_computation:
+ if (pq_continue) {
+ /*
+ * Type-B command to XOR previous output with
+ * buf0 and write it into buf0
+ */
+ cmd = 0;
+ SBA_ENC(cmd, SBA_TYPE_B,
+ SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+ SBA_ENC(cmd, msg_len,
+ SBA_USER_DEF_SHIFT,
+ SBA_USER_DEF_MASK);
+ SBA_ENC(cmd, 0x0,
+ SBA_RESP_SHIFT, SBA_RESP_MASK);
+ c_mdata = sba_cmd_xor_c_mdata(0, 0);
+ SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+ SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+ SBA_ENC(cmd, SBA_CMD_XOR,
+ SBA_CMD_SHIFT, SBA_CMD_MASK);
+ cmdsp->cmd = cmd;
+ *cmdsp->cmd_dma = cpu_to_le64(cmd);
+ cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+ cmdsp->data = *dst_q + msg_offset;
+ cmdsp->data_len = msg_len;
+ cmdsp++;
+ }
+
+ /* Type-A command to write buf0 */
+ cmd = 0;
+ SBA_ENC(cmd, SBA_TYPE_A,
+ SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+ SBA_ENC(cmd, msg_len,
+ SBA_USER_DEF_SHIFT,
+ SBA_USER_DEF_MASK);
+ SBA_ENC(cmd, 0x1,
+ SBA_RESP_SHIFT, SBA_RESP_MASK);
+ c_mdata = sba_cmd_write_c_mdata(0);
+ SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+ SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+ SBA_ENC(cmd, SBA_CMD_WRITE_BUFFER,
+ SBA_CMD_SHIFT, SBA_CMD_MASK);
+ cmdsp->cmd = cmd;
+ *cmdsp->cmd_dma = cpu_to_le64(cmd);
+ cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+ if (req->sba->hw_resp_size) {
+ cmdsp->flags |= BRCM_SBA_CMD_HAS_RESP;
+ cmdsp->resp = req->resp_dma;
+ cmdsp->resp_len = req->sba->hw_resp_size;
+ }
+ cmdsp->flags |= BRCM_SBA_CMD_HAS_OUTPUT;
+ cmdsp->data = *dst_q + msg_offset;
+ cmdsp->data_len = msg_len;
+ cmdsp++;
+
+skip_q:
+ /* Fillup brcm_message */
+ msg->type = BRCM_MESSAGE_SBA;
+ msg->sba.cmds = cmds;
+ msg->sba.cmds_count = cmdsp - cmds;
+ msg->ctx = req;
+ msg->error = 0;
+}
+
+struct sba_request *
+sba_prep_dma_pq_single_req(struct sba_device *sba, dma_addr_t off,
+ dma_addr_t *dst_p, dma_addr_t *dst_q,
+ dma_addr_t src, u8 scf, size_t len,
+ unsigned long flags)
+{
+ struct sba_request *req = NULL;
+
+ /* Alloc new request */
+ req = sba_alloc_request(sba);
+ if (!req)
+ return NULL;
+ req->fence = (flags & DMA_PREP_FENCE) ? true : false;
+
+ /* Fillup request messages */
+ sba_fillup_pq_single_msg(req, dmaf_continue(flags),
+ req->cmds, &req->msg, off, len,
+ dst_p, dst_q, src, scf);
+
+ /* Init async_tx descriptor */
+ req->tx.flags = flags;
+ req->tx.cookie = -EBUSY;
+
+ return req;
+}
+
+static struct dma_async_tx_descriptor *
+sba_prep_dma_pq(struct dma_chan *dchan, dma_addr_t *dst, dma_addr_t *src,
+ u32 src_cnt, const u8 *scf, size_t len, unsigned long flags)
+{
+ u32 i, dst_q_index;
+ size_t req_len;
+ bool slow = false;
+ dma_addr_t off = 0;
+ dma_addr_t *dst_p = NULL, *dst_q = NULL;
+ struct sba_device *sba = to_sba_device(dchan);
+ struct sba_request *first = NULL, *req;
+
+ /* Sanity checks */
+ if (unlikely(src_cnt > sba->max_pq_srcs))
+ return NULL;
+ for (i = 0; i < src_cnt; i++)
+ if (sba->max_pq_coefs <= raid6_gflog[scf[i]])
+ slow = true;
+
+ /* Figure-out P and Q destination addresses */
+ if (!(flags & DMA_PREP_PQ_DISABLE_P))
+ dst_p = &dst[0];
+ if (!(flags & DMA_PREP_PQ_DISABLE_Q))
+ dst_q = &dst[1];
+
+ /* Create chained requests where each request is upto hw_buf_size */
+ while (len) {
+ req_len = (len < sba->hw_buf_size) ? len : sba->hw_buf_size;
+
+ if (slow) {
+ dst_q_index = src_cnt;
+
+ if (dst_q) {
+ for (i = 0; i < src_cnt; i++) {
+ if (*dst_q == src[i]) {
+ dst_q_index = i;
+ break;
+ }
+ }
+ }
+
+ if (dst_q_index < src_cnt) {
+ i = dst_q_index;
+ req = sba_prep_dma_pq_single_req(sba,
+ off, dst_p, dst_q, src[i], scf[i],
+ req_len, flags | DMA_PREP_FENCE);
+ if (!req)
+ goto fail;
+
+ if (first)
+ sba_chain_request(first, req);
+ else
+ first = req;
+
+ flags |= DMA_PREP_CONTINUE;
+ }
+
+ for (i = 0; i < src_cnt; i++) {
+ if (dst_q_index == i)
+ continue;
+
+ req = sba_prep_dma_pq_single_req(sba,
+ off, dst_p, dst_q, src[i], scf[i],
+ req_len, flags | DMA_PREP_FENCE);
+ if (!req)
+ goto fail;
+
+ if (first)
+ sba_chain_request(first, req);
+ else
+ first = req;
+
+ flags |= DMA_PREP_CONTINUE;
+ }
+ } else {
+ req = sba_prep_dma_pq_req(sba, off,
+ dst_p, dst_q, src, src_cnt,
+ scf, req_len, flags);
+ if (!req)
+ goto fail;
+
+ if (first)
+ sba_chain_request(first, req);
+ else
+ first = req;
+ }
+
+ off += req_len;
+ len -= req_len;
+ }
+
+ return (first) ? &first->tx : NULL;
+
+fail:
+ if (first)
+ sba_free_chained_requests(first);
+ return NULL;
+}
+
+/* ====== Mailbox callbacks ===== */
+
+static void sba_dma_tx_actions(struct sba_request *req)
+{
+ struct dma_async_tx_descriptor *tx = &req->tx;
+
+ WARN_ON(tx->cookie < 0);
+
+ if (tx->cookie > 0) {
+ dma_cookie_complete(tx);
+
+ /*
+ * Call the callback (must not sleep or submit new
+ * operations to this channel)
+ */
+ if (tx->callback)
+ tx->callback(tx->callback_param);
+
+ dma_descriptor_unmap(tx);
+ }
+
+ /* Run dependent operations */
+ dma_run_dependencies(tx);
+
+ /* If waiting for 'ack' then move to completed list */
+ if (!async_tx_test_ack(&req->tx))
+ sba_complete_chained_requests(req);
+ else
+ sba_free_chained_requests(req);
+}
+
+static void sba_receive_message(struct mbox_client *cl, void *msg)
+{
+ unsigned long flags;
+ struct brcm_message *m = msg;
+ struct sba_request *req = m->ctx, *req1;
+ struct sba_device *sba = req->sba;
+
+ /* Error count if message has error */
+ if (m->error < 0) {
+ dev_err(sba->dev, "%s got message with error %d",
+ dma_chan_name(&sba->dma_chan), m->error);
+ }
+
+ /* Mark request as received */
+ sba_received_request(req);
+
+ /* Wait for all chained requests to be completed */
+ if (atomic_dec_return(&req->first->next_pending_count))
+ goto done;
+
+ /* Point to first request */
+ req = req->first;
+
+ /* Update request */
+ if (req->state == SBA_REQUEST_STATE_RECEIVED)
+ sba_dma_tx_actions(req);
+ else
+ sba_free_chained_requests(req);
+
+ spin_lock_irqsave(&sba->reqs_lock, flags);
+
+ /* Re-check all completed request waiting for 'ack' */
+ list_for_each_entry_safe(req, req1, &sba->reqs_completed_list, node) {
+ spin_unlock_irqrestore(&sba->reqs_lock, flags);
+ sba_dma_tx_actions(req);
+ spin_lock_irqsave(&sba->reqs_lock, flags);
+ }
+
+ spin_unlock_irqrestore(&sba->reqs_lock, flags);
+
+done:
+ /* Try to submit pending request */
+ sba_issue_pending(&sba->dma_chan);
+}
+
+/* ====== Platform driver routines ===== */
+
+static int sba_prealloc_channel_resources(struct sba_device *sba)
+{
+ int i, j, p, ret = 0;
+ struct sba_request *req = NULL;
+
+ sba->resp_base = dma_alloc_coherent(sba->dma_dev.dev,
+ sba->max_resp_pool_size,
+ &sba->resp_dma_base, GFP_KERNEL);
+ if (!sba->resp_base)
+ return -ENOMEM;
+
+ sba->cmds_base = dma_alloc_coherent(sba->dma_dev.dev,
+ sba->max_cmds_pool_size,
+ &sba->cmds_dma_base, GFP_KERNEL);
+ if (!sba->cmds_base) {
+ ret = -ENOMEM;
+ goto fail_free_resp_pool;
+ }
+
+ spin_lock_init(&sba->reqs_lock);
+ sba->reqs_fence = false;
+ INIT_LIST_HEAD(&sba->reqs_alloc_list);
+ INIT_LIST_HEAD(&sba->reqs_pending_list);
+ INIT_LIST_HEAD(&sba->reqs_active_list);
+ INIT_LIST_HEAD(&sba->reqs_received_list);
+ INIT_LIST_HEAD(&sba->reqs_completed_list);
+ INIT_LIST_HEAD(&sba->reqs_aborted_list);
+ INIT_LIST_HEAD(&sba->reqs_free_list);
+
+ sba->reqs = devm_kcalloc(sba->dev, sba->max_req,
+ sizeof(*req), GFP_KERNEL);
+ if (!sba->reqs) {
+ ret = -ENOMEM;
+ goto fail_free_cmds_pool;
+ }
+
+ for (i = 0, p = 0; i < sba->max_req; i++) {
+ req = &sba->reqs[i];
+ INIT_LIST_HEAD(&req->node);
+ req->sba = sba;
+ req->state = SBA_REQUEST_STATE_FREE;
+ INIT_LIST_HEAD(&req->next);
+ req->next_count = 1;
+ atomic_set(&req->next_pending_count, 0);
+ req->fence = false;
+ req->resp = sba->resp_base + p;
+ req->resp_dma = sba->resp_dma_base + p;
+ p += sba->hw_resp_size;
+ req->cmds = devm_kcalloc(sba->dev, sba->max_cmd_per_req,
+ sizeof(*req->cmds), GFP_KERNEL);
+ if (!req->cmds) {
+ ret = -ENOMEM;
+ goto fail_free_cmds_pool;
+ }
+ for (j = 0; j < sba->max_cmd_per_req; j++) {
+ req->cmds[j].cmd = 0;
+ req->cmds[j].cmd_dma = sba->cmds_base +
+ (i * sba->max_cmd_per_req + j) * sizeof(u64);
+ req->cmds[j].cmd_dma_addr = sba->cmds_dma_base +
+ (i * sba->max_cmd_per_req + j) * sizeof(u64);
+ req->cmds[j].flags = 0;
+ }
+ memset(&req->msg, 0, sizeof(req->msg));
+ dma_async_tx_descriptor_init(&req->tx, &sba->dma_chan);
+ req->tx.tx_submit = sba_tx_submit;
+ req->tx.phys = req->resp_dma;
+ list_add_tail(&req->node, &sba->reqs_free_list);
+ }
+
+ sba->reqs_free_count = sba->max_req;
+
+ return 0;
+
+fail_free_cmds_pool:
+ dma_free_coherent(sba->dma_dev.dev,
+ sba->max_cmds_pool_size,
+ sba->cmds_base, sba->cmds_dma_base);
+fail_free_resp_pool:
+ dma_free_coherent(sba->dma_dev.dev,
+ sba->max_resp_pool_size,
+ sba->resp_base, sba->resp_dma_base);
+ return ret;
+}
+
+static void sba_freeup_channel_resources(struct sba_device *sba)
+{
+ dmaengine_terminate_all(&sba->dma_chan);
+ dma_free_coherent(sba->dma_dev.dev, sba->max_cmds_pool_size,
+ sba->cmds_base, sba->cmds_dma_base);
+ dma_free_coherent(sba->dma_dev.dev, sba->max_resp_pool_size,
+ sba->resp_base, sba->resp_dma_base);
+ sba->resp_base = NULL;
+ sba->resp_dma_base = 0;
+}
+
+static int sba_async_register(struct sba_device *sba)
+{
+ int ret;
+ struct dma_device *dma_dev = &sba->dma_dev;
+
+ /* Initialize DMA channel cookie */
+ sba->dma_chan.device = dma_dev;
+ dma_cookie_init(&sba->dma_chan);
+
+ /* Initialize DMA device capability mask */
+ dma_cap_zero(dma_dev->cap_mask);
+ dma_cap_set(DMA_MEMCPY, dma_dev->cap_mask);
+ dma_cap_set(DMA_XOR, dma_dev->cap_mask);
+ dma_cap_set(DMA_PQ, dma_dev->cap_mask);
+
+ /*
+ * Set mailbox channel device as the base device of
+ * our dma_device because the actual memory accesses
+ * will be done by mailbox controller
+ */
+ dma_dev->dev = sba->mbox_dev;
+
+ /* Set base prep routines */
+ dma_dev->device_free_chan_resources = sba_free_chan_resources;
+ dma_dev->device_terminate_all = sba_device_terminate_all;
+ dma_dev->device_issue_pending = sba_issue_pending;
+ dma_dev->device_tx_status = sba_tx_status;
+
+ /* Set memcpy routines and capability */
+ if (dma_has_cap(DMA_MEMCPY, dma_dev->cap_mask))
+ dma_dev->device_prep_dma_memcpy = sba_prep_dma_memcpy;
+
+ /* Set xor routines and capability */
+ if (dma_has_cap(DMA_XOR, dma_dev->cap_mask)) {
+ dma_dev->device_prep_dma_xor = sba_prep_dma_xor;
+ dma_dev->max_xor = sba->max_xor_srcs;
+ }
+
+ /* Set pq routines and capability */
+ if (dma_has_cap(DMA_PQ, dma_dev->cap_mask)) {
+ dma_dev->device_prep_dma_pq = sba_prep_dma_pq;
+ dma_set_maxpq(dma_dev, sba->max_pq_srcs, 0);
+ }
+
+ /* Initialize DMA device channel list */
+ INIT_LIST_HEAD(&dma_dev->channels);
+ list_add_tail(&sba->dma_chan.device_node, &dma_dev->channels);
+
+ /* Register with Linux async DMA framework*/
+ ret = dma_async_device_register(dma_dev);
+ if (ret) {
+ dev_err(sba->dev, "async device register error %d", ret);
+ return ret;
+ }
+
+ dev_info(sba->dev, "%s capabilities: %s%s%s\n",
+ dma_chan_name(&sba->dma_chan),
+ dma_has_cap(DMA_MEMCPY, dma_dev->cap_mask) ? "memcpy " : "",
+ dma_has_cap(DMA_XOR, dma_dev->cap_mask) ? "xor " : "",
+ dma_has_cap(DMA_PQ, dma_dev->cap_mask) ? "pq " : "");
+
+ return 0;
+}
+
+static int sba_probe(struct platform_device *pdev)
+{
+ int i, ret = 0, mchans_count;
+ struct sba_device *sba;
+ struct platform_device *mbox_pdev;
+ struct of_phandle_args args;
+
+ /* Allocate main SBA struct */
+ sba = devm_kzalloc(&pdev->dev, sizeof(*sba), GFP_KERNEL);
+ if (!sba)
+ return -ENOMEM;
+
+ sba->dev = &pdev->dev;
+ platform_set_drvdata(pdev, sba);
+
+ /* Determine SBA version from DT compatible string */
+ if (of_device_is_compatible(sba->dev->of_node, "brcm,iproc-sba"))
+ sba->ver = SBA_VER_1;
+ else if (of_device_is_compatible(sba->dev->of_node,
+ "brcm,iproc-sba-v2"))
+ sba->ver = SBA_VER_2;
+ else
+ return -ENODEV;
+
+ /* Derived Configuration parameters */
+ switch (sba->ver) {
+ case SBA_VER_1:
+ sba->max_req = 1024;
+ sba->hw_buf_size = 4096;
+ sba->hw_resp_size = 8;
+ sba->max_pq_coefs = 6;
+ sba->max_pq_srcs = 6;
+ break;
+ case SBA_VER_2:
+ sba->max_req = 1024;
+ sba->hw_buf_size = 4096;
+ sba->hw_resp_size = 8;
+ sba->max_pq_coefs = 30;
+ /*
+ * We can support max_pq_srcs == max_pq_coefs because
+ * we are limited by number of SBA commands that we can
+ * fit in one message for underlying ring manager HW.
+ */
+ sba->max_pq_srcs = 12;
+ break;
+ default:
+ return -EINVAL;
+ }
+ sba->max_cmd_per_req = sba->max_pq_srcs + 3;
+ sba->max_xor_srcs = sba->max_cmd_per_req - 1;
+ sba->max_resp_pool_size = sba->max_req * sba->hw_resp_size;
+ sba->max_cmds_pool_size = sba->max_req *
+ sba->max_cmd_per_req * sizeof(u64);
+
+ /* Setup mailbox client */
+ sba->client.dev = &pdev->dev;
+ sba->client.rx_callback = sba_receive_message;
+ sba->client.tx_block = false;
+ sba->client.knows_txdone = false;
+ sba->client.tx_tout = 0;
+
+ /* Number of channels equals number of mailbox channels */
+ ret = of_count_phandle_with_args(pdev->dev.of_node,
+ "mboxes", "#mbox-cells");
+ if (ret <= 0)
+ return -ENODEV;
+ mchans_count = ret;
+ sba->mchans_count = 0;
+ atomic_set(&sba->mchans_current, 0);
+
+ /* Allocate mailbox channel array */
+ sba->mchans = devm_kcalloc(&pdev->dev, sba->mchans_count,
+ sizeof(*sba->mchans), GFP_KERNEL);
+ if (!sba->mchans)
+ return -ENOMEM;
+
+ /* Request mailbox channels */
+ for (i = 0; i < mchans_count; i++) {
+ sba->mchans[i] = mbox_request_channel(&sba->client, i);
+ if (IS_ERR(sba->mchans[i])) {
+ ret = PTR_ERR(sba->mchans[i]);
+ goto fail_free_mchans;
+ }
+ sba->mchans_count++;
+ }
+
+ /* Find-out underlying mailbox device */
+ ret = of_parse_phandle_with_args(pdev->dev.of_node,
+ "mboxes", "#mbox-cells", 0, &args);
+ if (ret)
+ goto fail_free_mchans;
+ mbox_pdev = of_find_device_by_node(args.np);
+ of_node_put(args.np);
+ if (!mbox_pdev) {
+ ret = -ENODEV;
+ goto fail_free_mchans;
+ }
+ sba->mbox_dev = &mbox_pdev->dev;
+
+ /* All mailbox channels should be of same ring manager device */
+ for (i = 1; i < mchans_count; i++) {
+ ret = of_parse_phandle_with_args(pdev->dev.of_node,
+ "mboxes", "#mbox-cells", i, &args);
+ if (ret)
+ goto fail_free_mchans;
+ mbox_pdev = of_find_device_by_node(args.np);
+ of_node_put(args.np);
+ if (sba->mbox_dev != &mbox_pdev->dev) {
+ ret = -EINVAL;
+ goto fail_free_mchans;
+ }
+ }
+
+ /* Register DMA device with linux async framework */
+ ret = sba_async_register(sba);
+ if (ret)
+ goto fail_free_mchans;
+
+ /* Prealloc channel resource */
+ ret = sba_prealloc_channel_resources(sba);
+ if (ret)
+ goto fail_async_dev_unreg;
+
+ /* Print device info */
+ dev_info(sba->dev, "%s using SBAv%d and %d mailbox channels",
+ dma_chan_name(&sba->dma_chan), sba->ver+1,
+ sba->mchans_count);
+
+ return 0;
+
+fail_async_dev_unreg:
+ dma_async_device_unregister(&sba->dma_dev);
+fail_free_mchans:
+ for (i = 0; i < sba->mchans_count; i++)
+ mbox_free_channel(sba->mchans[i]);
+ return ret;
+}
+
+static int sba_remove(struct platform_device *pdev)
+{
+ int i;
+ struct sba_device *sba = platform_get_drvdata(pdev);
+
+ sba_freeup_channel_resources(sba);
+
+ dma_async_device_unregister(&sba->dma_dev);
+
+ for (i = 0; i < sba->mchans_count; i++)
+ mbox_free_channel(sba->mchans[i]);
+
+ return 0;
+}
+
+static const struct of_device_id sba_of_match[] = {
+ { .compatible = "brcm,iproc-sba", },
+ { .compatible = "brcm,iproc-sba-v2", },
+ {},
+};
+MODULE_DEVICE_TABLE(of, sba_of_match);
+
+static struct platform_driver sba_driver = {
+ .probe = sba_probe,
+ .remove = sba_remove,
+ .driver = {
+ .name = "bcm-sba-raid",
+ .of_match_table = sba_of_match,
+ },
+};
+module_platform_driver(sba_driver);
+
+MODULE_DESCRIPTION("Broadcom SBA RAID driver");
+MODULE_AUTHOR("Anup Patel <anup.patel@broadcom.com>");
+MODULE_LICENSE("GPL v2");
--
2.7.4
^ permalink raw reply related
* [PATCH v3 4/4] dt-bindings: Add DT bindings document for Broadcom SBA RAID driver
From: Anup Patel @ 2017-02-10 9:07 UTC (permalink / raw)
To: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
David S . Miller, Jassi Brar
Cc: Dan Williams, Ray Jui, Scott Branden, Jon Mason, Rob Rice,
bcm-kernel-feedback-list, dmaengine, devicetree, linux-arm-kernel,
linux-kernel, linux-crypto, linux-raid, Anup Patel
In-Reply-To: <1486717628-17580-1-git-send-email-anup.patel@broadcom.com>
This patch adds the DT bindings document for newly added Broadcom
SBA RAID driver.
Signed-off-by: Anup Patel <anup.patel@broadcom.com>
Reviewed-by: Ray Jui <ray.jui@broadcom.com>
Reviewed-by: Scott Branden <scott.branden@broadcom.com>
---
.../devicetree/bindings/dma/brcm,iproc-sba.txt | 29 ++++++++++++++++++++++
1 file changed, 29 insertions(+)
create mode 100644 Documentation/devicetree/bindings/dma/brcm,iproc-sba.txt
diff --git a/Documentation/devicetree/bindings/dma/brcm,iproc-sba.txt b/Documentation/devicetree/bindings/dma/brcm,iproc-sba.txt
new file mode 100644
index 0000000..092913a
--- /dev/null
+++ b/Documentation/devicetree/bindings/dma/brcm,iproc-sba.txt
@@ -0,0 +1,29 @@
+* Broadcom SBA RAID engine
+
+Required properties:
+- compatible: Should be one of the following
+ "brcm,iproc-sba"
+ "brcm,iproc-sba-v2"
+ The "brcm,iproc-sba" has support for only 6 PQ coefficients
+ The "brcm,iproc-sba-v2" has support for only 30 PQ coefficients
+- mboxes: List of phandle and mailbox channel specifiers
+
+Example:
+
+raid_mbox: mbox@67400000 {
+ ...
+ #mbox-cells = <3>;
+ ...
+};
+
+raid0 {
+ compatible = "brcm,iproc-sba-v2";
+ mboxes = <&raid_mbox 0 0x1 0xffff>,
+ <&raid_mbox 1 0x1 0xffff>,
+ <&raid_mbox 2 0x1 0xffff>,
+ <&raid_mbox 3 0x1 0xffff>,
+ <&raid_mbox 4 0x1 0xffff>,
+ <&raid_mbox 5 0x1 0xffff>,
+ <&raid_mbox 6 0x1 0xffff>,
+ <&raid_mbox 7 0x1 0xffff>;
+};
--
2.7.4
^ permalink raw reply related
* [PATCH v1 0/5] md: use bio_clone_fast()
From: Ming Lei @ 2017-02-10 10:56 UTC (permalink / raw)
To: Shaohua Li, Jens Axboe, linux-kernel, linux-raid, linux-block,
Christoph Hellwig, NeilBrown
Cc: Ming Lei
Hi,
This patches replaces bio_clone() with bio_fast_clone() in
bio_clone_mddev() because:
1) bio_clone_mddev() is used in raid normal I/O and isn't in
resync I/O path, and all the direct access to bvec table in
raid happens on resync I/O only except for write behind of raid1.
Write behind is treated specially, so the replacement is safe.
2) for write behind, bio_clone() is kept, but this patchset
introduces bio_clone_bioset_partial() to just clone one specific
bvecs range instead of whole table. Then write behind is improved
too.
V1:
1) don't introduce bio_clone_slow_mddev_partial()
2) return failure if mddev->bio_set can't be created
3) remove check in bio_clone_mddev() as suggested by
Christoph Hellwig.
4) rename bio_clone_mddev() as bio_clone_fast_mddev()
Ming Lei (5):
block: introduce bio_clone_bioset_partial()
md/raid1: use bio_clone_bioset_partial() in case of write behind
md: fail if mddev->bio_set can't be created
md: remove unnecessary check on mddev
md: fast clone bio in bio_clone_mddev()
block/bio.c | 61 +++++++++++++++++++++++++++++++++++++++++------------
drivers/md/faulty.c | 2 +-
drivers/md/md.c | 14 ++++++------
drivers/md/md.h | 4 ++--
drivers/md/raid1.c | 26 ++++++++++++++++-------
drivers/md/raid10.c | 11 +++++-----
drivers/md/raid5.c | 4 ++--
include/linux/bio.h | 11 ++++++++--
8 files changed, 92 insertions(+), 41 deletions(-)
--
2.7.4
Thanks,
Ming
^ permalink raw reply
* [PATCH v1 1/5] block: introduce bio_clone_bioset_partial()
From: Ming Lei @ 2017-02-10 10:56 UTC (permalink / raw)
To: Shaohua Li, Jens Axboe, linux-kernel, linux-raid, linux-block,
Christoph Hellwig, NeilBrown
Cc: Ming Lei
In-Reply-To: <1486724177-14817-1-git-send-email-tom.leiming@gmail.com>
md still need bio clone(not the fast version) for behind write,
and it is more efficient to use bio_clone_bioset_partial().
The idea is simple and just copy the bvecs range specified from
parameters.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
block/bio.c | 61 +++++++++++++++++++++++++++++++++++++++++------------
include/linux/bio.h | 11 ++++++++--
2 files changed, 57 insertions(+), 15 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 4b564d0c3e29..5eec5e08417f 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -625,21 +625,20 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
}
EXPORT_SYMBOL(bio_clone_fast);
-/**
- * bio_clone_bioset - clone a bio
- * @bio_src: bio to clone
- * @gfp_mask: allocation priority
- * @bs: bio_set to allocate from
- *
- * Clone bio. Caller will own the returned bio, but not the actual data it
- * points to. Reference count of returned bio will be one.
- */
-struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
- struct bio_set *bs)
+static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
+ struct bio_set *bs, int offset,
+ int size)
{
struct bvec_iter iter;
struct bio_vec bv;
struct bio *bio;
+ struct bvec_iter iter_src = bio_src->bi_iter;
+
+ /* for supporting partial clone */
+ if (offset || size != bio_src->bi_iter.bi_size) {
+ bio_advance_iter(bio_src, &iter_src, offset);
+ iter_src.bi_size = size;
+ }
/*
* Pre immutable biovecs, __bio_clone() used to just do a memcpy from
@@ -663,7 +662,8 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
* __bio_clone_fast() anyways.
*/
- bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs);
+ bio = bio_alloc_bioset(gfp_mask, __bio_segments(bio_src,
+ &iter_src), bs);
if (!bio)
return NULL;
bio->bi_bdev = bio_src->bi_bdev;
@@ -680,7 +680,7 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
break;
default:
- bio_for_each_segment(bv, bio_src, iter)
+ __bio_for_each_segment(bv, bio_src, iter, iter_src)
bio->bi_io_vec[bio->bi_vcnt++] = bv;
break;
}
@@ -699,9 +699,44 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
return bio;
}
+
+/**
+ * bio_clone_bioset - clone a bio
+ * @bio_src: bio to clone
+ * @gfp_mask: allocation priority
+ * @bs: bio_set to allocate from
+ *
+ * Clone bio. Caller will own the returned bio, but not the actual data it
+ * points to. Reference count of returned bio will be one.
+ */
+struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
+ struct bio_set *bs)
+{
+ return __bio_clone_bioset(bio_src, gfp_mask, bs, 0,
+ bio_src->bi_iter.bi_size);
+}
EXPORT_SYMBOL(bio_clone_bioset);
/**
+ * bio_clone_bioset_partial - clone a partial bio
+ * @bio_src: bio to clone
+ * @gfp_mask: allocation priority
+ * @bs: bio_set to allocate from
+ * @offset: cloned starting from the offset
+ * @size: size for the cloned bio
+ *
+ * Clone bio. Caller will own the returned bio, but not the actual data it
+ * points to. Reference count of returned bio will be one.
+ */
+struct bio *bio_clone_bioset_partial(struct bio *bio_src, gfp_t gfp_mask,
+ struct bio_set *bs, int offset,
+ int size)
+{
+ return __bio_clone_bioset(bio_src, gfp_mask, bs, offset, size);
+}
+EXPORT_SYMBOL(bio_clone_bioset_partial);
+
+/**
* bio_add_pc_page - attempt to add page to bio
* @q: the target queue
* @bio: destination bio
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7cf8a6c70a3f..8e521194f6fc 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -183,7 +183,7 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
#define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
-static inline unsigned bio_segments(struct bio *bio)
+static inline unsigned __bio_segments(struct bio *bio, struct bvec_iter *bvec)
{
unsigned segs = 0;
struct bio_vec bv;
@@ -205,12 +205,17 @@ static inline unsigned bio_segments(struct bio *bio)
break;
}
- bio_for_each_segment(bv, bio, iter)
+ __bio_for_each_segment(bv, bio, iter, *bvec)
segs++;
return segs;
}
+static inline unsigned bio_segments(struct bio *bio)
+{
+ return __bio_segments(bio, &bio->bi_iter);
+}
+
/*
* get a reference to a bio, so it won't disappear. the intended use is
* something like:
@@ -384,6 +389,8 @@ extern void bio_put(struct bio *);
extern void __bio_clone_fast(struct bio *, struct bio *);
extern struct bio *bio_clone_fast(struct bio *, gfp_t, struct bio_set *);
extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *bs);
+extern struct bio *bio_clone_bioset_partial(struct bio *, gfp_t,
+ struct bio_set *, int, int);
extern struct bio_set *fs_bio_set;
--
2.7.4
^ permalink raw reply related
* [PATCH v1 2/5] md/raid1: use bio_clone_bioset_partial() in case of write behind
From: Ming Lei @ 2017-02-10 10:56 UTC (permalink / raw)
To: Shaohua Li, Jens Axboe, linux-kernel, linux-raid, linux-block,
Christoph Hellwig, NeilBrown
Cc: Ming Lei
In-Reply-To: <1486724177-14817-1-git-send-email-tom.leiming@gmail.com>
Write behind need to replace pages in bio's bvecs, and we have
to clone a fresh bio with new bvec table, so use the introduced
bio_clone_bioset_partial() for it.
For other bio_clone_mddev() cases, we will use fast clone since
they don't need to touch bvec table.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
drivers/md/raid1.c | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 830ff2b20346..4d7852c6ae97 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1341,13 +1341,12 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
first_clone = 1;
for (i = 0; i < disks; i++) {
- struct bio *mbio;
+ struct bio *mbio = NULL;
+ int offset;
if (!r1_bio->bios[i])
continue;
- mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
- bio_trim(mbio, r1_bio->sector - bio->bi_iter.bi_sector,
- max_sectors);
+ offset = r1_bio->sector - bio->bi_iter.bi_sector;
if (first_clone) {
/* do behind I/O ?
@@ -1357,8 +1356,13 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
if (bitmap &&
(atomic_read(&bitmap->behind_writes)
< mddev->bitmap_info.max_write_behind) &&
- !waitqueue_active(&bitmap->behind_wait))
+ !waitqueue_active(&bitmap->behind_wait)) {
+ mbio = bio_clone_bioset_partial(bio, GFP_NOIO,
+ mddev->bio_set,
+ offset,
+ max_sectors);
alloc_behind_pages(mbio, r1_bio);
+ }
bitmap_startwrite(bitmap, r1_bio->sector,
r1_bio->sectors,
@@ -1366,6 +1370,12 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
&r1_bio->state));
first_clone = 0;
}
+
+ if (!mbio) {
+ mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+ bio_trim(mbio, offset, max_sectors);
+ }
+
if (r1_bio->behind_bvecs) {
struct bio_vec *bvec;
int j;
--
2.7.4
^ permalink raw reply related
* [PATCH v1 3/5] md: fail if mddev->bio_set can't be created
From: Ming Lei @ 2017-02-10 10:56 UTC (permalink / raw)
To: Shaohua Li, Jens Axboe, linux-kernel, linux-raid, linux-block,
Christoph Hellwig, NeilBrown
Cc: Ming Lei
In-Reply-To: <1486724177-14817-1-git-send-email-tom.leiming@gmail.com>
The current behaviour is to fall back to allocate
bio from 'fs_bio_set', that isn't a correct way
because it might cause deadlock.
So this patch simply return failure if mddev->bio_set
can't be created.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
drivers/md/md.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 4c1b82defa78..3425c2b779a6 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5270,8 +5270,11 @@ int md_run(struct mddev *mddev)
sysfs_notify_dirent_safe(rdev->sysfs_state);
}
- if (mddev->bio_set == NULL)
+ if (mddev->bio_set == NULL) {
mddev->bio_set = bioset_create(BIO_POOL_SIZE, 0);
+ if (!mddev->bio_set)
+ return -ENOMEM;
+ }
spin_lock(&pers_lock);
pers = find_pers(mddev->level, mddev->clevel);
--
2.7.4
^ permalink raw reply related
* [PATCH v1 4/5] md: remove unnecessary check on mddev
From: Ming Lei @ 2017-02-10 10:56 UTC (permalink / raw)
To: Shaohua Li, Jens Axboe, linux-kernel, linux-raid, linux-block,
Christoph Hellwig, NeilBrown
Cc: Ming Lei
In-Reply-To: <1486724177-14817-1-git-send-email-tom.leiming@gmail.com>
mddev is never NULL and neither is ->bio_set, so
remove the check.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
drivers/md/md.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 3425c2b779a6..2835f09b9e71 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -193,9 +193,6 @@ EXPORT_SYMBOL_GPL(bio_alloc_mddev);
struct bio *bio_clone_mddev(struct bio *bio, gfp_t gfp_mask,
struct mddev *mddev)
{
- if (!mddev || !mddev->bio_set)
- return bio_clone(bio, gfp_mask);
-
return bio_clone_bioset(bio, gfp_mask, mddev->bio_set);
}
EXPORT_SYMBOL_GPL(bio_clone_mddev);
--
2.7.4
^ permalink raw reply related
* [PATCH v1 5/5] md: fast clone bio in bio_clone_mddev()
From: Ming Lei @ 2017-02-10 10:56 UTC (permalink / raw)
To: Shaohua Li, Jens Axboe, linux-kernel, linux-raid, linux-block,
Christoph Hellwig, NeilBrown
Cc: Ming Lei
In-Reply-To: <1486724177-14817-1-git-send-email-tom.leiming@gmail.com>
Firstly bio_clone_mddev() is used in raid normal I/O and isn't
in resync I/O path.
Secondly all the direct access to bvec table in raid happens on
resync I/O except for write behind of raid1, in which we still
use bio_clone() for allocating new bvec table.
So this patch replaces bio_clone() with bio_clone_fast()
in bio_clone_mddev().
Rename bio_clone_mddev() as bio_clone_fast_mddev() too, as
suggested by Christoph Hellwig.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
drivers/md/faulty.c | 2 +-
drivers/md/md.c | 6 +++---
drivers/md/md.h | 4 ++--
drivers/md/raid1.c | 8 ++++----
drivers/md/raid10.c | 11 +++++------
drivers/md/raid5.c | 4 ++--
6 files changed, 17 insertions(+), 18 deletions(-)
diff --git a/drivers/md/faulty.c b/drivers/md/faulty.c
index 685aa2d77e25..f80e7b8f8c40 100644
--- a/drivers/md/faulty.c
+++ b/drivers/md/faulty.c
@@ -214,7 +214,7 @@ static void faulty_make_request(struct mddev *mddev, struct bio *bio)
}
}
if (failit) {
- struct bio *b = bio_clone_mddev(bio, GFP_NOIO, mddev);
+ struct bio *b = bio_clone_fast_mddev(bio, GFP_NOIO, mddev);
b->bi_bdev = conf->rdev->bdev;
b->bi_private = bio;
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2835f09b9e71..d45e8d1382ad 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -190,12 +190,12 @@ struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
}
EXPORT_SYMBOL_GPL(bio_alloc_mddev);
-struct bio *bio_clone_mddev(struct bio *bio, gfp_t gfp_mask,
+struct bio *bio_clone_fast_mddev(struct bio *bio, gfp_t gfp_mask,
struct mddev *mddev)
{
- return bio_clone_bioset(bio, gfp_mask, mddev->bio_set);
+ return bio_clone_fast(bio, gfp_mask, mddev->bio_set);
}
-EXPORT_SYMBOL_GPL(bio_clone_mddev);
+EXPORT_SYMBOL_GPL(bio_clone_fast_mddev);
/*
* We have a system wide 'event count' that is incremented
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 968bbe72b237..88d0a101fb4c 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -673,8 +673,8 @@ extern void md_rdev_clear(struct md_rdev *rdev);
extern void mddev_suspend(struct mddev *mddev);
extern void mddev_resume(struct mddev *mddev);
-extern struct bio *bio_clone_mddev(struct bio *bio, gfp_t gfp_mask,
- struct mddev *mddev);
+extern struct bio *bio_clone_fast_mddev(struct bio *bio, gfp_t gfp_mask,
+ struct mddev *mddev);
extern struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
struct mddev *mddev);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 4d7852c6ae97..9e0b5a5ec0bc 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1108,7 +1108,7 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
r1_bio->read_disk = rdisk;
r1_bio->start_next_window = 0;
- read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+ read_bio = bio_clone_fast_mddev(bio, GFP_NOIO, mddev);
bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector,
max_sectors);
@@ -1372,7 +1372,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
}
if (!mbio) {
- mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+ mbio = bio_clone_fast_mddev(bio, GFP_NOIO, mddev);
bio_trim(mbio, offset, max_sectors);
}
@@ -2283,7 +2283,7 @@ static int narrow_write_error(struct r1bio *r1_bio, int i)
wbio->bi_vcnt = vcnt;
} else {
- wbio = bio_clone_mddev(r1_bio->master_bio, GFP_NOIO, mddev);
+ wbio = bio_clone_fast_mddev(r1_bio->master_bio, GFP_NOIO, mddev);
}
bio_set_op_attrs(wbio, REQ_OP_WRITE, 0);
@@ -2421,7 +2421,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
const unsigned long do_sync
= r1_bio->master_bio->bi_opf & REQ_SYNC;
r1_bio->read_disk = disk;
- bio = bio_clone_mddev(r1_bio->master_bio, GFP_NOIO, mddev);
+ bio = bio_clone_fast_mddev(r1_bio->master_bio, GFP_NOIO, mddev);
bio_trim(bio, r1_bio->sector - bio->bi_iter.bi_sector,
max_sectors);
r1_bio->bios[r1_bio->read_disk] = bio;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 6bc5c2a85160..406d6651fd4c 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1132,7 +1132,7 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
}
slot = r10_bio->read_slot;
- read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+ read_bio = bio_clone_fast_mddev(bio, GFP_NOIO, mddev);
bio_trim(read_bio, r10_bio->sector - bio->bi_iter.bi_sector,
max_sectors);
@@ -1406,7 +1406,7 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
int d = r10_bio->devs[i].devnum;
if (r10_bio->devs[i].bio) {
struct md_rdev *rdev = conf->mirrors[d].rdev;
- mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+ mbio = bio_clone_fast_mddev(bio, GFP_NOIO, mddev);
bio_trim(mbio, r10_bio->sector - bio->bi_iter.bi_sector,
max_sectors);
r10_bio->devs[i].bio = mbio;
@@ -1457,7 +1457,7 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
smp_mb();
rdev = conf->mirrors[d].rdev;
}
- mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+ mbio = bio_clone_fast_mddev(bio, GFP_NOIO, mddev);
bio_trim(mbio, r10_bio->sector - bio->bi_iter.bi_sector,
max_sectors);
r10_bio->devs[i].repl_bio = mbio;
@@ -2565,7 +2565,7 @@ static int narrow_write_error(struct r10bio *r10_bio, int i)
if (sectors > sect_to_write)
sectors = sect_to_write;
/* Write at 'sector' for 'sectors' */
- wbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+ wbio = bio_clone_fast_mddev(bio, GFP_NOIO, mddev);
bio_trim(wbio, sector - bio->bi_iter.bi_sector, sectors);
wsector = r10_bio->devs[i].addr + (sector - r10_bio->sector);
wbio->bi_iter.bi_sector = wsector +
@@ -2641,8 +2641,7 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
mdname(mddev),
bdevname(rdev->bdev, b),
(unsigned long long)r10_bio->sector);
- bio = bio_clone_mddev(r10_bio->master_bio,
- GFP_NOIO, mddev);
+ bio = bio_clone_fast_mddev(r10_bio->master_bio, GFP_NOIO, mddev);
bio_trim(bio, r10_bio->sector - bio->bi_iter.bi_sector, max_sectors);
r10_bio->devs[slot].bio = bio;
r10_bio->devs[slot].rdev = rdev;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 34f76615d620..b0bf647dd414 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5056,9 +5056,9 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
return 0;
}
/*
- * use bio_clone_mddev to make a copy of the bio
+ * use bio_clone_fast_mddev to make a copy of the bio
*/
- align_bi = bio_clone_mddev(raid_bio, GFP_NOIO, mddev);
+ align_bi = bio_clone_fast_mddev(raid_bio, GFP_NOIO, mddev);
if (!align_bi)
return 0;
/*
--
2.7.4
^ permalink raw reply related
* Re: [PATCH v3 3/4] dmaengine: Add Broadcom SBA RAID driver
From: Dan Williams @ 2017-02-10 17:50 UTC (permalink / raw)
To: Anup Patel
Cc: Mark Rutland, Device Tree, Herbert Xu, Scott Branden, Vinod Koul,
Ray Jui, Jassi Brar, linux-kernel@vger.kernel.org, linux-raid,
Jon Mason, Rob Herring, BCM Kernel Feedback, linux-crypto,
Rob Rice, dmaengine@vger.kernel.org, David S . Miller,
linux-arm-kernel@lists.infradead.org
In-Reply-To: <1486717628-17580-4-git-send-email-anup.patel@broadcom.com>
On Fri, Feb 10, 2017 at 1:07 AM, Anup Patel <anup.patel@broadcom.com> wrote:
> The Broadcom stream buffer accelerator (SBA) provides offloading
> capabilities for RAID operations. This SBA offload engine is
> accessible via Broadcom SoC specific ring manager.
>
> This patch adds Broadcom SBA RAID driver which provides one
> DMA device with RAID capabilities using one or more Broadcom
> SoC specific ring manager channels. The SBA RAID driver in its
> current shape implements memcpy, xor, and pq operations.
>
> Signed-off-by: Anup Patel <anup.patel@broadcom.com>
> Reviewed-by: Ray Jui <ray.jui@broadcom.com>
> ---
> drivers/dma/Kconfig | 13 +
> drivers/dma/Makefile | 1 +
> drivers/dma/bcm-sba-raid.c | 1711 ++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 1725 insertions(+)
> create mode 100644 drivers/dma/bcm-sba-raid.c
>
> diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
> index 263495d..bf8fb84 100644
> --- a/drivers/dma/Kconfig
> +++ b/drivers/dma/Kconfig
> @@ -99,6 +99,19 @@ config AXI_DMAC
> controller is often used in Analog Device's reference designs for FPGA
> platforms.
>
> +config BCM_SBA_RAID
> + tristate "Broadcom SBA RAID engine support"
> + depends on (ARM64 && MAILBOX && RAID6_PQ) || COMPILE_TEST
> + select DMA_ENGINE
> + select DMA_ENGINE_RAID
> + select ASYNC_TX_ENABLE_CHANNEL_SWITCH
ASYNC_TX_ENABLE_CHANNEL_SWITCH violates the DMA mapping API and
Russell has warned it's especially problematic on ARM [1]. If you
need channel switching for this offload engine to be useful then you
need to move DMA mapping and channel switching responsibilities to MD
itself.
[1]: http://lists.infradead.org/pipermail/linux-arm-kernel/2011-January/036753.html
[..]
> diff --git a/drivers/dma/bcm-sba-raid.c b/drivers/dma/bcm-sba-raid.c
> new file mode 100644
> index 0000000..bab9918
> --- /dev/null
> +++ b/drivers/dma/bcm-sba-raid.c
> @@ -0,0 +1,1711 @@
> +/*
> + * Copyright (C) 2017 Broadcom
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +/*
> + * Broadcom SBA RAID Driver
> + *
> + * The Broadcom stream buffer accelerator (SBA) provides offloading
> + * capabilities for RAID operations. The SBA offload engine is accessible
> + * via Broadcom SoC specific ring manager. Two or more offload engines
> + * can share same Broadcom SoC specific ring manager due to this Broadcom
> + * SoC specific ring manager driver is implemented as a mailbox controller
> + * driver and offload engine drivers are implemented as mallbox clients.
> + *
> + * Typically, Broadcom SoC specific ring manager will implement larger
> + * number of hardware rings over one or more SBA hardware devices. By
> + * design, the internal buffer size of SBA hardware device is limited
> + * but all offload operations supported by SBA can be broken down into
> + * multiple small size requests and executed parallely on multiple SBA
> + * hardware devices for achieving high through-put.
> + *
> + * The Broadcom SBA RAID driver does not require any register programming
> + * except submitting request to SBA hardware device via mailbox channels.
> + * This driver implements a DMA device with one DMA channel using a set
> + * of mailbox channels provided by Broadcom SoC specific ring manager
> + * driver. To exploit parallelism (as described above), all DMA request
> + * coming to SBA RAID DMA channel are broken down to smaller requests
> + * and submitted to multiple mailbox channels in round-robin fashion.
> + * For having more SBA DMA channels, we can create more SBA device nodes
> + * in Broadcom SoC specific DTS based on number of hardware rings supported
> + * by Broadcom SoC ring manager.
> + */
> +
> +#include <linux/bitops.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/dmaengine.h>
> +#include <linux/list.h>
> +#include <linux/mailbox_client.h>
> +#include <linux/mailbox/brcm-message.h>
> +#include <linux/module.h>
> +#include <linux/of_device.h>
> +#include <linux/slab.h>
> +#include <linux/raid/pq.h>
> +
> +#include "dmaengine.h"
> +
> +/* SBA command helper macros */
> +#define SBA_DEC(_d, _s, _m) (((_d) >> (_s)) & (_m))
> +#define SBA_ENC(_d, _v, _s, _m) \
> + do { \
> + (_d) &= ~((u64)(_m) << (_s)); \
> + (_d) |= (((u64)(_v) & (_m)) << (_s)); \
> + } while (0)
Reusing a macro argument multiple times is problematic, consider
SBA_ENC(..., arg++, ...), and hiding assignments in a macro make this
hard to read. The compiler should inline it properly if you just make
this a function that returns a value. You could also mark it __pure.
[..]
> +
> +static struct sba_request *sba_alloc_request(struct sba_device *sba)
> +{
> + unsigned long flags;
> + struct sba_request *req = NULL;
> +
> + spin_lock_irqsave(&sba->reqs_lock, flags);
> +
> + if (!list_empty(&sba->reqs_free_list)) {
> + req = list_first_entry(&sba->reqs_free_list,
> + struct sba_request,
> + node);
You could use list_first_entry_or_null() here.
[..]
> +
> +/* Note: Must be called with sba->reqs_lock held */
> +static void _sba_pending_request(struct sba_device *sba,
> + struct sba_request *req)
> +{
You can validate the locking assumptions here with
lockdep_assert_head(sba->reqs_lock).
[..]
> +
> +static void sba_cleanup_nonpending_requests(struct sba_device *sba)
> +{
> + unsigned long flags;
> + struct sba_request *req, *req1;
> +
> + spin_lock_irqsave(&sba->reqs_lock, flags);
> +
> + /* Freeup all alloced request */
> + list_for_each_entry_safe(req, req1, &sba->reqs_alloc_list, node) {
> + _sba_free_request(sba, req);
> + }
> +
> + /* Freeup all received request */
> + list_for_each_entry_safe(req, req1, &sba->reqs_received_list, node) {
> + _sba_free_request(sba, req);
> + }
> +
> + /* Freeup all completed request */
> + list_for_each_entry_safe(req, req1, &sba->reqs_completed_list, node) {
> + _sba_free_request(sba, req);
> + }
> +
> + /* Set all active requests as aborted */
> + list_for_each_entry_safe(req, req1, &sba->reqs_active_list, node) {
> + _sba_abort_request(sba, req);
> + }
In some parts of the driver you leave off unneeded braces like the for
loop in sba_prep_dma_pq(), and in some case you include them. I'd say
remove them if they're not necessary, but either way make it
consistent across the driver.
[..]
> +
> +static struct dma_async_tx_descriptor *
> +sba_prep_dma_pq(struct dma_chan *dchan, dma_addr_t *dst, dma_addr_t *src,
> + u32 src_cnt, const u8 *scf, size_t len, unsigned long flags)
> +{
> + u32 i, dst_q_index;
> + size_t req_len;
> + bool slow = false;
> + dma_addr_t off = 0;
> + dma_addr_t *dst_p = NULL, *dst_q = NULL;
> + struct sba_device *sba = to_sba_device(dchan);
> + struct sba_request *first = NULL, *req;
> +
> + /* Sanity checks */
> + if (unlikely(src_cnt > sba->max_pq_srcs))
> + return NULL;
> + for (i = 0; i < src_cnt; i++)
> + if (sba->max_pq_coefs <= raid6_gflog[scf[i]])
> + slow = true;
Thanks, yes, I do think this is cleaner here than in async_tx itself.
[..]
> +static void sba_receive_message(struct mbox_client *cl, void *msg)
> +{
> + unsigned long flags;
> + struct brcm_message *m = msg;
> + struct sba_request *req = m->ctx, *req1;
> + struct sba_device *sba = req->sba;
> +
> + /* Error count if message has error */
> + if (m->error < 0) {
> + dev_err(sba->dev, "%s got message with error %d",
> + dma_chan_name(&sba->dma_chan), m->error);
> + }
> +
> + /* Mark request as received */
> + sba_received_request(req);
> +
> + /* Wait for all chained requests to be completed */
> + if (atomic_dec_return(&req->first->next_pending_count))
> + goto done;
> +
> + /* Point to first request */
> + req = req->first;
> +
> + /* Update request */
> + if (req->state == SBA_REQUEST_STATE_RECEIVED)
> + sba_dma_tx_actions(req);
> + else
> + sba_free_chained_requests(req);
> +
> + spin_lock_irqsave(&sba->reqs_lock, flags);
> +
> + /* Re-check all completed request waiting for 'ack' */
> + list_for_each_entry_safe(req, req1, &sba->reqs_completed_list, node) {
> + spin_unlock_irqrestore(&sba->reqs_lock, flags);
> + sba_dma_tx_actions(req);
You've now required all callback paths to be hardirq safe whereas
previously the callbacks only assumed softirq exclusion. Have you run
this with CONFIG_PROVE_LOCKING enabled?
^ permalink raw reply
* [PATCH 1/2] md/raid5-cache: stripe reclaim only counts valid stripes
From: Shaohua Li @ 2017-02-11 0:18 UTC (permalink / raw)
To: linux-raid; +Cc: neilb, Song Liu
When log space is tight, we try to reclaim stripes from log head. There
are stripes which can't be reclaimed right now if some conditions are
met. We skip such stripes but accidentally count them, which might cause
no stripes are claimed. Fixing this by only counting valid stripes.
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
drivers/md/raid5-cache.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index caae853..a01f4da 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1418,9 +1418,9 @@ static void r5c_do_reclaim(struct r5conf *conf)
!test_bit(STRIPE_HANDLE, &sh->state) &&
atomic_read(&sh->count) == 0) {
r5c_flush_stripe(conf, sh);
+ if (count++ >= R5C_RECLAIM_STRIPE_GROUP)
+ break;
}
- if (count++ >= R5C_RECLAIM_STRIPE_GROUP)
- break;
}
spin_unlock(&conf->device_lock);
spin_unlock_irqrestore(&log->stripe_in_journal_lock, flags);
--
2.9.3
^ permalink raw reply related
* [PATCH 2/2] md/raid5-cache: exclude reclaiming stripes in reclaim check
From: Shaohua Li @ 2017-02-11 0:18 UTC (permalink / raw)
To: linux-raid; +Cc: neilb, Song Liu
In-Reply-To: <a8a29c2eceb8a7d68080b95c7df04f0663b358bd.1486772085.git.shli@fb.com>
stripes which are being reclaimed are still accounted into cached
stripes. The reclaim takes time. r5c_do_reclaim isn't aware of the
stripes and does unnecessary stripe reclaim. In practice, I saw one
stripe is reclaimed one time. This will cause bad IO pattern. Fixing
this by excluding the reclaing stripes in the check.
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
drivers/md/raid5-cache.c | 14 ++++++++++++--
drivers/md/raid5.c | 2 ++
drivers/md/raid5.h | 2 ++
3 files changed, 16 insertions(+), 2 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index a01f4da..3f307be 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1327,6 +1327,10 @@ static void r5c_flush_stripe(struct r5conf *conf, struct stripe_head *sh)
atomic_inc(&conf->active_stripes);
r5c_make_stripe_write_out(sh);
+ if (test_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state))
+ atomic_inc(&conf->r5c_flushing_partial_stripes);
+ else
+ atomic_inc(&conf->r5c_flushing_full_stripes);
raid5_release_stripe(sh);
}
@@ -1369,12 +1373,16 @@ static void r5c_do_reclaim(struct r5conf *conf)
unsigned long flags;
int total_cached;
int stripes_to_flush;
+ int flushing_partial, flushing_full;
if (!r5c_is_writeback(log))
return;
+ flushing_partial = atomic_read(&conf->r5c_flushing_partial_stripes);
+ flushing_full = atomic_read(&conf->r5c_flushing_full_stripes);
total_cached = atomic_read(&conf->r5c_cached_partial_stripes) +
- atomic_read(&conf->r5c_cached_full_stripes);
+ atomic_read(&conf->r5c_cached_full_stripes) -
+ flushing_full - flushing_partial;
if (total_cached > conf->min_nr_stripes * 3 / 4 ||
atomic_read(&conf->empty_inactive_list_nr) > 0)
@@ -1384,7 +1392,7 @@ static void r5c_do_reclaim(struct r5conf *conf)
*/
stripes_to_flush = R5C_RECLAIM_STRIPE_GROUP;
else if (total_cached > conf->min_nr_stripes * 1 / 2 ||
- atomic_read(&conf->r5c_cached_full_stripes) >
+ atomic_read(&conf->r5c_cached_full_stripes) - flushing_full >
R5C_FULL_STRIPE_FLUSH_BATCH)
/*
* if stripe cache pressure moderate, or if there is many full
@@ -2601,11 +2609,13 @@ void r5c_finish_stripe_write_out(struct r5conf *conf,
if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state)) {
BUG_ON(atomic_read(&conf->r5c_cached_partial_stripes) == 0);
+ atomic_dec(&conf->r5c_flushing_partial_stripes);
atomic_dec(&conf->r5c_cached_partial_stripes);
}
if (test_and_clear_bit(STRIPE_R5C_FULL_STRIPE, &sh->state)) {
BUG_ON(atomic_read(&conf->r5c_cached_full_stripes) == 0);
+ atomic_dec(&conf->r5c_flushing_full_stripes);
atomic_dec(&conf->r5c_cached_full_stripes);
}
}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 467ad4f..5f08ef1 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6839,6 +6839,8 @@ static struct r5conf *setup_conf(struct mddev *mddev)
INIT_LIST_HEAD(&conf->r5c_full_stripe_list);
atomic_set(&conf->r5c_cached_partial_stripes, 0);
INIT_LIST_HEAD(&conf->r5c_partial_stripe_list);
+ atomic_set(&conf->r5c_flushing_full_stripes, 0);
+ atomic_set(&conf->r5c_flushing_partial_stripes, 0);
conf->level = mddev->new_level;
conf->chunk_sectors = mddev->new_chunk_sectors;
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 6a99fb5e..1e2f35c 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -666,6 +666,8 @@ struct r5conf {
struct list_head r5c_full_stripe_list;
atomic_t r5c_cached_partial_stripes;
struct list_head r5c_partial_stripe_list;
+ atomic_t r5c_flushing_full_stripes;
+ atomic_t r5c_flushing_partial_stripes;
atomic_t empty_inactive_list_nr;
struct llist_head released_stripes;
--
2.9.3
^ permalink raw reply related
* Re: [PATCH v1 0/5] md: use bio_clone_fast()
From: Shaohua Li @ 2017-02-11 0:38 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-kernel, linux-raid, linux-block,
Christoph Hellwig, NeilBrown
In-Reply-To: <1486724177-14817-1-git-send-email-tom.leiming@gmail.com>
On Fri, Feb 10, 2017 at 06:56:12PM +0800, Ming Lei wrote:
> Hi,
>
> This patches replaces bio_clone() with bio_fast_clone() in
> bio_clone_mddev() because:
>
> 1) bio_clone_mddev() is used in raid normal I/O and isn't in
> resync I/O path, and all the direct access to bvec table in
> raid happens on resync I/O only except for write behind of raid1.
> Write behind is treated specially, so the replacement is safe.
>
> 2) for write behind, bio_clone() is kept, but this patchset
> introduces bio_clone_bioset_partial() to just clone one specific
> bvecs range instead of whole table. Then write behind is improved
> too.
Thanks! this patch set looks good to me.
Jens,
can you look at the first patch? If it's ok, I'll carry it in my tree.
Thanks,
Shaohua
> V1:
> 1) don't introduce bio_clone_slow_mddev_partial()
> 2) return failure if mddev->bio_set can't be created
> 3) remove check in bio_clone_mddev() as suggested by
> Christoph Hellwig.
> 4) rename bio_clone_mddev() as bio_clone_fast_mddev()
>
>
> Ming Lei (5):
> block: introduce bio_clone_bioset_partial()
> md/raid1: use bio_clone_bioset_partial() in case of write behind
> md: fail if mddev->bio_set can't be created
> md: remove unnecessary check on mddev
> md: fast clone bio in bio_clone_mddev()
>
> block/bio.c | 61 +++++++++++++++++++++++++++++++++++++++++------------
> drivers/md/faulty.c | 2 +-
> drivers/md/md.c | 14 ++++++------
> drivers/md/md.h | 4 ++--
> drivers/md/raid1.c | 26 ++++++++++++++++-------
> drivers/md/raid10.c | 11 +++++-----
> drivers/md/raid5.c | 4 ++--
> include/linux/bio.h | 11 ++++++++--
> 8 files changed, 92 insertions(+), 41 deletions(-)
>
> --
> 2.7.4
>
> Thanks,
> Ming
^ permalink raw reply
* Re: [PATCH V2] MD: add doc for raid5-cache
From: Nix @ 2017-02-12 0:16 UTC (permalink / raw)
To: Shaohua Li
Cc: linux-raid, antlists, philip, songliubraving, neilb,
jure.erznoznik, rramesh2400
In-Reply-To: <cac53388fd5903a006792c0165063d63eb66079d.1486408891.git.shli@fb.com>
On 6 Feb 2017, Shaohua Li stated:
> +write-back mode:
> +
> +write-back mode fixes the 'write hole' issue too, since all write data is
> +cached on cache disk. But the main goal of 'write-back' cache is to speed up
> +write. If a write crosses all RAID disks of a stripe, we call it full-stripe
> +write. For non-full-stripe writes, MD must read old data before the new parity
> +can be calculated. These synchronous reads hurt write throughput. Some writes
> +which are sequential but not dispatched in the same time will suffer from this
> +overhead too. Write-back cache will aggregate the data and flush the data to
> +RAID disks only after the data becomes a full stripe write. This will
> +completely avoid the overhead, so it's very helpful for some workloads. A
> +typical workload which does sequential write followed by fsync is an example.
> +
> +In write-back mode, MD reports IO completion to upper layer (usually
> +filesystems) right after the data hits cache disk. The data is flushed to raid
> +disks later after specific conditions met. So cache disk failure will cause
> +data loss.
> +
> +In write-back mode, MD also caches data in memory. The memory cache includes
> +the same data stored on cache disk, so a power loss doesn't cause data loss.
> +The memory cache size has performance impact for the array. It's recommended
> +the size is big. A user can configure the size by:
> +
> +echo "2048" > /sys/block/md0/md/stripe_cache_size
I'm missing something. Won't a big stripe_cache_size have the same
effect on reducing the read size of RMW as the writeback cache has?
That's the entire point of it: to remember stripes so you don't need to
take the R hit so often. I mean, sure, it won't survive a power loss: is
this just to avoid RMWs for the first write after a power loss to
stripes that were previously written before the power loss? Or is it
because the raid5-cache can be much bigger than the in-memory cache,
caching many thousands of stripes? (in which case, the raid5-cache is
preferable for any workload in which random or sub-stripe sequential
writes are scattered across very many distinct stripes rather than being
concentrated in a few, or a few dozen. This is probably a very common
case even for things like compilations or git checkouts, because new
file creation tends to be fairly scattered: every new object file might
well be in a different stripe from every other, so virtually every write
of less than the stripe size would have to block on the completion of a
read.)
(... this question is because I'm re-entering the world of md5 after
years wandering in the wilderness of hardware RAID: the writethrough
mode looks very compelling, particularly now your docs have described
how big it needs to be, or rather how big it doesn't need to be. But I
don't quite see the point of writeback mode yet.)
Hm. This is probably also a reason to keep your stripes not too large:
it's more likely that smallish writes will fill whole stripes and avoid
the read entirely. I was considering it pointless to make the stripe
size smaller than the average size of a disk track (if you can figure
that out these days), but making it much smaller seems like it's still
worthwhile.
Does anyone have recentish performance figures on the effect of changing
chunk, and thus, stripe sizes on things like file creations for a range
of sizes, or is picking a stripe size, stripe cache size, and readahead
value still basically guesswork like it was when I did this last? The
RAID performance pages show figures all over the shop, with most people
apparently agreeing on chunk sizes of 128--256KiB and *nobody* agreeing
on readahead or stripe cache sizes :( is there anything resembling a
consensus here yet?
--
NULL && (void)
^ permalink raw reply
* RAID 5 --assemble doesn't recognize all overlays as component devices (was: RAID 5 reshape stalled at 77.5% - next steps??)
From: George Rapp @ 2017-02-12 0:32 UTC (permalink / raw)
To: Linux-RAID; +Cc: Matthew Krumwiede
Previous thread: http://marc.info/?l=linux-raid&m=148564798430138&w=2
-- to summarize, while adding two drives to a RAID 5 array, one of the
existing RAID 5 component drives failed, causing the reshape progress
to stall at 77.5%. I removed the previous thread from this message to
conserve space -- before resolving that situation, another problem has
arisen.
We have cloned and replaced the failed /dev/sdg with "ddrescue --force
-r3 -n /dev/sdh /dev/sde c/sdh-sde-recovery.log"; copied in below, or
viewable via https://app.box.com/v/sdh-sde-recovery . The failing
device was removed from the server, and the RAID component partition
on the cloned drive is now /dev/sdg4.
I've also created and run a script to create overlay files per
https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID.
The script name is make_RAID_overlays, and it's also copied in below,
or viewable via https://app.box.com/v/make-RAID-overlays
When trying to assemble the array using the overlays, I now get the
following output:
# export MDADM_GROW_ALLOW_OLD=1
# mdadm --assemble /dev/md4
--backup-file=/home/gwr/2017/2017-01/md4_backup__2017-01-25
/dev/mapper/sde4 /dev/mapper/sdf4 /dev/mapper/sdh4 /dev/mapper/sdl4
/dev/mapper/sdg4 /dev/mapper/sdk4 /dev/mapper/sdi4 /dev/mapper/sdj4
/dev/mapper/sdb4 /dev/mapper/sdd4
mdadm: accepting backup with timestamp 1485366772 for array with
timestamp 1485439616
mdadm: /dev/md4 assembled from 9 drives - not enough to start the
array while not clean - consider --force.
As you can count, there are ten devices (mapping to overlay files) but
mdadm is only recognizing 9 of them as drives. By process of
elimination, I determined that /dev/mapper/sdg4 can be removed from
that command without changing the result:
# mdadm --assemble /dev/md4
--backup-file=/home/gwr/2017/2017-01/md4_backup__2017-01-25
/dev/mapper/sde4 /dev/mapper/sdf4 /dev/mapper/sdh4 /dev/mapper/sdl4
/dev/mapper/sdk4 /dev/mapper/sdi4 /dev/mapper/sdj4 /dev/mapper/sdb4
/dev/mapper/sdd4
mdadm: accepting backup with timestamp 1485366772 for array with
timestamp 1485439616
mdadm: /dev/md4 assembled from 9 drives - not enough to start the
array while not clean - consider --force.
The output of mdadm --examine (in file raid_status_2017-02-11, copied
in below or viewable via https://app.box.com/v/raid-status-2017-02-11)
doesn't show anything obvious that I can see that would cause
/dev/mapper/sdg4 not to be recognized as a RAID component. The only
discrepancy is that /dev/mapper/sdg4 is showing a State: value of
clean, while all other devices are showing State: active, and the
Array State is different (AAAARAAAA. vs. AAAAAAAAA.)
Any thoughts? Am I creating overlays incorrectly?
-=-=- sdh-sde-recovery.log
# Command line: ddrescue --force -r3 -n /dev/sdh /dev/sde c/sdh-sde-recovery.log
# Start time: 2017-01-29 18:43:17
# Current time: 2017-01-30 02:20:21
# Finished
# current_pos current_status
0x1D1C1115E00 +
# pos size status
0x00000000 0x1D099900000 +
0x1D099900000 0x00000200 -
0x1D099900200 0x0001FC00 /
0x1D09991FE00 0x00000200 -
0x1D099920000 0x00020000 +
0x1D099940000 0x00000200 -
0x1D099940200 0x1277D5C00 /
0x1D1C1115E00 0x00000200 -
-=-=- make_RAID_overlays
#/usr/bin/sh
UUID=359d41dc:a2e506e3:5e802a49:a84ef89c
DEVICES=$(cat /proc/partitions | parallel --tagstring {5} --colsep '
+' mdadm -E /dev/{5} |grep $UUID | grep -v "dm-" | parallel --colsep
'\t' echo /dev/{1})
parallel 'test -e /dev/loop{#} || mknod -m 660 /dev/loop{#} b 7 {#}'
::: $DEVICES
parallel truncate -s8G overlay-{/} ::: $DEVICES
parallel 'size=$(blockdev --getsize {}); loop=$(losetup -f --show --
overlay-{/}); echo 0 $size snapshot {} $loop P 8 | dmsetup create {/}'
::: $DEVICES
OVERLAYS=$(parallel echo /dev/mapper/{/} ::: $DEVICES)
dmsetup status
echo export DEVICES=\"$DEVICES\"
echo export OVERLAYS=\"$OVERLAYS\"
-=-=- raid_status_2017-02-11
/dev/sdb4:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x4
Array UUID : 359d41dc:a2e506e3:5e802a49:a84ef89c
Name : localhost.localdomain:4
Creation Time : Thu Feb 17 14:54:06 2011
Raid Level : raid5
Raid Devices : 10
Avail Dev Size : 3844265615 (1833.09 GiB 1968.26 GB)
Array Size : 17299187712 (16497.79 GiB 17714.37 GB)
Used Dev Size : 3844263936 (1833.09 GiB 1968.26 GB)
Data Offset : 2048 sectors
Super Offset : 0 sectors
Unused Space : before=1968 sectors, after=1679 sectors
State : active
Device UUID : af5226e3:757f9f97:55728ff9:197fc03d
Reshape pos'n : 13412689920 (12791.34 GiB 13734.59 GB)
Delta Devices : 2 (8->10)
Update Time : Thu Jan 26 08:06:56 2017
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : c1a3ed5b - correct
Events : 3957775
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 8
Array State : AAAAAAAAA. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdd4:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x16
Array UUID : 359d41dc:a2e506e3:5e802a49:a84ef89c
Name : localhost.localdomain:4
Creation Time : Thu Feb 17 14:54:06 2011
Raid Level : raid5
Raid Devices : 10
Avail Dev Size : 3844265615 (1833.09 GiB 1968.26 GB)
Array Size : 17299187712 (16497.79 GiB 17714.37 GB)
Used Dev Size : 3844263936 (1833.09 GiB 1968.26 GB)
Data Offset : 2048 sectors
Super Offset : 0 sectors
Recovery Offset : 2980597760 sectors
Unused Space : before=1968 sectors, after=1679 sectors
State : active
Device UUID : 87bc1358:efcfbf2a:226a9241:dcd1c54e
Reshape pos'n : 13412689920 (12791.34 GiB 13734.59 GB)
Delta Devices : 2 (8->10)
Update Time : Thu Jan 26 08:06:56 2017
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : d462424f - correct
Events : 3957775
Layout : left-symmetric
Chunk Size : 512K
Device Role : Replacement device 4
Array State : AAAAAAAAA. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sde4:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x4
Array UUID : 359d41dc:a2e506e3:5e802a49:a84ef89c
Name : localhost.localdomain:4
Creation Time : Thu Feb 17 14:54:06 2011
Raid Level : raid5
Raid Devices : 10
Avail Dev Size : 3844263936 (1833.09 GiB 1968.26 GB)
Array Size : 17299187712 (16497.79 GiB 17714.37 GB)
Data Offset : 2048 sectors
Super Offset : 0 sectors
Unused Space : before=1976 sectors, after=0 sectors
State : active
Device UUID : 4340ee15:6ab2d65a:bc21b9b0:9b285385
Reshape pos'n : 13412689920 (12791.34 GiB 13734.59 GB)
Delta Devices : 2 (8->10)
Update Time : Thu Jan 26 08:06:56 2017
Checksum : b6575fe9 - correct
Events : 3957775
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 0
Array State : AAAAAAAAA. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdf4:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x4
Array UUID : 359d41dc:a2e506e3:5e802a49:a84ef89c
Name : localhost.localdomain:4
Creation Time : Thu Feb 17 14:54:06 2011
Raid Level : raid5
Raid Devices : 10
Avail Dev Size : 3844263936 (1833.09 GiB 1968.26 GB)
Array Size : 17299187712 (16497.79 GiB 17714.37 GB)
Data Offset : 2048 sectors
Super Offset : 0 sectors
Unused Space : before=1976 sectors, after=0 sectors
State : active
Device UUID : 8aea344d:3d8180a3:7762f06b:ca3d5957
Reshape pos'n : 13412689920 (12791.34 GiB 13734.59 GB)
Delta Devices : 2 (8->10)
Update Time : Thu Jan 26 08:06:56 2017
Checksum : c3852eee - correct
Events : 3957775
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 1
Array State : AAAAAAAAA. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdg4:
Magic : a92b4efc
Version : 1.1
Feature Map : 0xc
Array UUID : 359d41dc:a2e506e3:5e802a49:a84ef89c
Name : localhost.localdomain:4
Creation Time : Thu Feb 17 14:54:06 2011
Raid Level : raid5
Raid Devices : 10
Avail Dev Size : 3844263936 (1833.09 GiB 1968.26 GB)
Array Size : 17299187712 (16497.79 GiB 17714.37 GB)
Data Offset : 2048 sectors
Super Offset : 0 sectors
Unused Space : before=1968 sectors, after=0 sectors
State : clean
Device UUID : 00e5ef75:50fc5750:a4ea2e21:13757690
Reshape pos'n : 13412689920 (12791.34 GiB 13734.59 GB)
Delta Devices : 2 (8->10)
Update Time : Thu Jan 26 07:49:59 2017
Bad Block Log : 512 entries available at offset 72 sectors - bad
blocks present.
Checksum : 877a6050 - correct
Events : 3957772
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 4
Array State : AAAARAAAA. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdh4:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x4
Array UUID : 359d41dc:a2e506e3:5e802a49:a84ef89c
Name : localhost.localdomain:4
Creation Time : Thu Feb 17 14:54:06 2011
Raid Level : raid5
Raid Devices : 10
Avail Dev Size : 3844263936 (1833.09 GiB 1968.26 GB)
Array Size : 17299187712 (16497.79 GiB 17714.37 GB)
Data Offset : 2048 sectors
Super Offset : 0 sectors
Unused Space : before=1976 sectors, after=0 sectors
State : active
Device UUID : 25472a02:7ad445e8:2aae650f:783346f7
Reshape pos'n : 13412689920 (12791.34 GiB 13734.59 GB)
Delta Devices : 2 (8->10)
Update Time : Thu Jan 26 08:06:56 2017
Checksum : a22029 - correct
Events : 3957775
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 2
Array State : AAAAAAAAA. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdi4:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x4
Array UUID : 359d41dc:a2e506e3:5e802a49:a84ef89c
Name : localhost.localdomain:4
Creation Time : Thu Feb 17 14:54:06 2011
Raid Level : raid5
Raid Devices : 10
Avail Dev Size : 3844264112 (1833.09 GiB 1968.26 GB)
Array Size : 17299187712 (16497.79 GiB 17714.37 GB)
Used Dev Size : 3844263936 (1833.09 GiB 1968.26 GB)
Data Offset : 2048 sectors
Super Offset : 0 sectors
Unused Space : before=1976 sectors, after=176 sectors
State : active
Device UUID : c06c4647:8e9bf793:db91ea2b:2f6313b6
Reshape pos'n : 13412689920 (12791.34 GiB 13734.59 GB)
Delta Devices : 2 (8->10)
Update Time : Thu Jan 26 08:06:56 2017
Checksum : ccc220f5 - correct
Events : 3957775
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 6
Array State : AAAAAAAAA. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdj4:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x4
Array UUID : 359d41dc:a2e506e3:5e802a49:a84ef89c
Name : localhost.localdomain:4
Creation Time : Thu Feb 17 14:54:06 2011
Raid Level : raid5
Raid Devices : 10
Avail Dev Size : 3844264112 (1833.09 GiB 1968.26 GB)
Array Size : 17299187712 (16497.79 GiB 17714.37 GB)
Used Dev Size : 3844263936 (1833.09 GiB 1968.26 GB)
Data Offset : 2048 sectors
Super Offset : 0 sectors
Unused Space : before=1976 sectors, after=176 sectors
State : active
Device UUID : c05f73af:10ae7c6f:d8ce6da0:ea06dfe6
Reshape pos'n : 13412689920 (12791.34 GiB 13734.59 GB)
Delta Devices : 2 (8->10)
Update Time : Thu Jan 26 08:06:56 2017
Checksum : b5c3072f - correct
Events : 3957775
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 7
Array State : AAAAAAAAA. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdk4:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x4
Array UUID : 359d41dc:a2e506e3:5e802a49:a84ef89c
Name : localhost.localdomain:4
Creation Time : Thu Feb 17 14:54:06 2011
Raid Level : raid5
Raid Devices : 10
Avail Dev Size : 3844264112 (1833.09 GiB 1968.26 GB)
Array Size : 17299187712 (16497.79 GiB 17714.37 GB)
Used Dev Size : 3844263936 (1833.09 GiB 1968.26 GB)
Data Offset : 2048 sectors
Super Offset : 0 sectors
Unused Space : before=1968 sectors, after=176 sectors
State : active
Device UUID : d0daf993:42ee5ab3:a32fb0c2:471e06bb
Reshape pos'n : 13412689920 (12791.34 GiB 13734.59 GB)
Delta Devices : 2 (8->10)
Update Time : Thu Jan 26 08:06:56 2017
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : d4993ae5 - correct
Events : 3957775
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 5
Array State : AAAAAAAAA. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdl4:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x4
Array UUID : 359d41dc:a2e506e3:5e802a49:a84ef89c
Name : localhost.localdomain:4
Creation Time : Thu Feb 17 14:54:06 2011
Raid Level : raid5
Raid Devices : 10
Avail Dev Size : 3844264112 (1833.09 GiB 1968.26 GB)
Array Size : 17299187712 (16497.79 GiB 17714.37 GB)
Used Dev Size : 3844263936 (1833.09 GiB 1968.26 GB)
Data Offset : 2048 sectors
Super Offset : 0 sectors
Unused Space : before=1976 sectors, after=176 sectors
State : active
Device UUID : dc42a9bd:400f4d25:8e2d74b5:25ee7478
Reshape pos'n : 13412689920 (12791.34 GiB 13734.59 GB)
Delta Devices : 2 (8->10)
Update Time : Thu Jan 26 08:06:56 2017
Checksum : 2065916e - correct
Events : 3957775
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 3
Array State : AAAAAAAAA. ('A' == active, '.' == missing, 'R' == replacing)
--
George Rapp (Pataskala, OH) Home: george.rapp -- at -- gmail.com
LinkedIn profile: https://www.linkedin.com/in/georgerapp
Phone: +1 740 936 RAPP (740 936 7277)
^ permalink raw reply
* Re: ANNOUNCE: mdadm 4.0 - A tool for managing md Soft RAID under Linux
From: zhilong @ 2017-02-13 5:08 UTC (permalink / raw)
To: Jes Sorensen, Bruce Dubbs, Brown, Neil
Cc: Guoqing Jiang, Shaohua Li, linux-raid@vger.kernel.org, LKML
In-Reply-To: <4eb83aa0-6245-d499-f18a-e8456aad9f98@gmail.com>
Hi, Jes;
On 01/13/2017 12:41 AM, Jes Sorensen wrote:
> On 01/11/17 23:24, Guoqing Jiang wrote:
>>
>> On 01/12/2017 12:59 AM, Jes Sorensen wrote:
>>> On 01/11/17 11:52, Shaohua Li wrote:
>>>> On Tue, Jan 10, 2017 at 11:49:04AM -0600, Bruce Dubbs wrote:
>>>>> Jes Sorensen wrote:
>>>>>> I am pleased to announce the availability of
>>>>>> mdadm version 4.0
>>>>>>
>>>>>> It is available at the usual places:
>>>>>> http://www.kernel.org/pub/linux/utils/raid/mdadm/
>>>>>> and via git at
>>>>>> git://git.kernel.org/pub/scm/utils/mdadm/mdadm.git
>>>>>> http://git.kernel.org/cgit/utils/mdadm/
>>>>>>
>>>>>> The update in major version number primarily indicates this is a
>>>>>> release by it's new maintainer. In addition it contains a large number
>>>>>> of fixes in particular for IMSM RAID and clustered RAID support. In
>>>>>> addition this release includes support for IMSM 4k sector drives,
>>>>>> failfast and better documentation for journaled RAID.
>>>>> Thank you for the new release. Unfortunately I get 9 failures
>>>>> running the
>>>>> test suite:
>>>>>
>>>>> tests/00raid1... FAILED
>>>>> tests/07autoassemble... FAILED
>>>>> tests/07changelevels... FAILED
>>>>> tests/07revert-grow... FAILED
>>>>> tests/07revert-inplace... FAILED
>>>>> tests/07testreshape5... FAILED
>>>>> tests/10ddf-fail-twice... FAILED
>>>>> tests/20raid5journal... FAILED
>>>>> tests/10ddf-incremental-wrong-order... FAILED
>>>> Yep, several tests usually fail. It appears some checks aren't always
>>>> good. At
>>>> least the 'check' function for reshape/resync isn't reliable in my
>>>> test, I saw
>>>> 07changelevelintr fails frequently.
>>> That is my experience as well - some of them are affected by the kernel
>>> version too. We probably need to look into making them more reliable.
>> If possible, it could be a potential topic for lsf/mm raid discussion as
>> Coly suggested
>> in previous mail.
>>
>> Is current test can run the test for different raid level, say, "./test
>> --raidtype=raid1" could
>> execute all the *r1* tests, does it make sense to do it if we don't
>> support it now.
> We could have a discussion about this at LSF/MM, if someone is willing
> to sponsor getting it accepted and we can get the right people there.
>
> Note that the test suite also allows you to run all the 01 tests by
> specifying ./test 01. I do like to see the test suite improved and made
> more resilient.
I'm sorry for my late response, I'm just back to work today from
vacation. In the past months, I learned and worked for cluster-md feature,
and I have draft one test suit for cluster-md feature. please refer to
https://github.com/zhilongliu/clustermd-autotest
I'm very willing to do something for improving mdadm testing part, also
wanna improve cluster-md test suit, welcome all comments for it.
> Cheers,
> Jes
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Thanks very much,
-Zhilong
^ permalink raw reply
* Re: ANNOUNCE: mdadm 4.0 - A tool for managing md Soft RAID under Linux
From: zhilong @ 2017-02-13 5:54 UTC (permalink / raw)
To: Jes Sorensen, Bruce Dubbs, Brown, Neil
Cc: Guoqing Jiang, Shaohua Li, linux-raid@vger.kernel.org, LKML
In-Reply-To: <6504747e-c49e-dc54-64a6-bce2220daffc@suse.com>
On 02/13/2017 01:08 PM, zhilong wrote:
> Hi, Jes;
>
>
> On 01/13/2017 12:41 AM, Jes Sorensen wrote:
>> On 01/11/17 23:24, Guoqing Jiang wrote:
>>>
>>> On 01/12/2017 12:59 AM, Jes Sorensen wrote:
>>>> On 01/11/17 11:52, Shaohua Li wrote:
>>>>> On Tue, Jan 10, 2017 at 11:49:04AM -0600, Bruce Dubbs wrote:
>>>>>> Jes Sorensen wrote:
>>>>>>> I am pleased to announce the availability of
>>>>>>> mdadm version 4.0
>>>>>>>
>>>>>>> It is available at the usual places:
>>>>>>> http://www.kernel.org/pub/linux/utils/raid/mdadm/
>>>>>>> and via git at
>>>>>>> git://git.kernel.org/pub/scm/utils/mdadm/mdadm.git
>>>>>>> http://git.kernel.org/cgit/utils/mdadm/
>>>>>>>
>>>>>>> The update in major version number primarily indicates this is a
>>>>>>> release by it's new maintainer. In addition it contains a large
>>>>>>> number
>>>>>>> of fixes in particular for IMSM RAID and clustered RAID
>>>>>>> support. In
>>>>>>> addition this release includes support for IMSM 4k sector drives,
>>>>>>> failfast and better documentation for journaled RAID.
>>>>>> Thank you for the new release. Unfortunately I get 9 failures
>>>>>> running the
>>>>>> test suite:
>>>>>>
>>>>>> tests/00raid1... FAILED
>>>>>> tests/07autoassemble... FAILED
>>>>>> tests/07changelevels... FAILED
>>>>>> tests/07revert-grow... FAILED
>>>>>> tests/07revert-inplace... FAILED
>>>>>> tests/07testreshape5... FAILED
>>>>>> tests/10ddf-fail-twice... FAILED
>>>>>> tests/20raid5journal... FAILED
>>>>>> tests/10ddf-incremental-wrong-order... FAILED
>>>>> Yep, several tests usually fail. It appears some checks aren't always
>>>>> good. At
>>>>> least the 'check' function for reshape/resync isn't reliable in my
>>>>> test, I saw
>>>>> 07changelevelintr fails frequently.
>>>> That is my experience as well - some of them are affected by the
>>>> kernel
>>>> version too. We probably need to look into making them more reliable.
>>> If possible, it could be a potential topic for lsf/mm raid
>>> discussion as
>>> Coly suggested
>>> in previous mail.
>>>
>>> Is current test can run the test for different raid level, say, "./test
>>> --raidtype=raid1" could
>>> execute all the *r1* tests, does it make sense to do it if we don't
>>> support it now.
>> We could have a discussion about this at LSF/MM, if someone is willing
>> to sponsor getting it accepted and we can get the right people there.
>>
>> Note that the test suite also allows you to run all the 01 tests by
>> specifying ./test 01. I do like to see the test suite improved and made
>> more resilient.
> I'm sorry for my late response, I'm just back to work today from
> vacation. In the past months, I learned and worked for cluster-md
> feature,
> and I have draft one test suit for cluster-md feature. please refer to
> https://github.com/zhilongliu/clustermd-autotest
> I'm very willing to do something for improving mdadm testing part,
> also wanna improve cluster-md test suit, welcome all comments for it.
>
I would keep making cluster-md test scripts more and more stable, and
finally apply to integrate into mdadm test part. :-)
Best regards,
-Zhilong
>> Cheers,
>> Jes
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
> Thanks very much,
> -Zhilong
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* Re: [PATCH 1/5] MD: attach data to each bio
From: Christoph Hellwig @ 2017-02-13 7:37 UTC (permalink / raw)
To: NeilBrown; +Cc: Shaohua Li, linux-raid, khlebnikov, hch
In-Reply-To: <87r336tw5l.fsf@notabene.neil.brown.name>
On Fri, Feb 10, 2017 at 05:08:54PM +1100, NeilBrown wrote:
> I must say that I don't really like this approach.
> Temporarily modifying ->bi_private and ->bi_end_io seems
> .... intrusive. I suspect it works, but I wonder if it is really
> robust in the long term.
>
> How about a different approach.. Your main concern with my first patch
> was that it called md_write_start() and md_write_end() much more often,
> and these performed atomic ops on "global" variables, particular
> writes_pending.
>
> We could change writes_pending to a per-cpu array which we only count
> occasionally when needed. As writes_pending is updated often and
> checked rarely, a per-cpu array which is summed on demand seems
> appropriate.
FYI, I much prefer you original approach, it's much closer to how
the rest of the block stack works.
^ permalink raw reply
* Enable the skip_copy feature will results in data integrity issue in raid5 degraded mode.
From: Chien Lee @ 2017-02-13 9:07 UTC (permalink / raw)
To: linux-raid, NeilBrown, shli, owner-linux-raid
Hello,
Recently we find a bug about skip_copy feature in raid5 degraded mode.
In the beginning, we enable the skip_copy feature to speed up system’s
write performance. But when the system has database read/write I/O
continually in raid5 degraded mode, the Mongo DB will detect the
checksum error and generate related debug log. The following is the
testing detail.
a. Enable skip_copy
--> Checksum error logs from Mongo DB
2017-02-06T11:54:56.537+0800 E STORAGE [conn7] WiredTiger (0)
[1486353296:537114][52:0x7f98396a4700],
file:collection-110-3235234017846331078.wt, WT_CURSOR.next: read
checksum error for 4096B block at offset 61440: calculated block
checksum of 1363526237 doesn't match expected checksum of 2969711960
b. Disable skip_copy
--> Mongo DB has no checksum error.
We've pretty sure that it must be a bug by our repeated database I/O
testing. When skip_copy feature is enabled, the raid5/raid6 always
causes the mongo DB checksum error in degraded mode less than one
hour. On the contrary, it will never cause this abnormal situation
when the skip_copy feature is disabled. Besides, because the skip_copy
feature only affects the write action instead of read action, I think
it should be the write action in degraded mode while skip_copy feature
is enabled cause this bug.
Please kindly provide us some help or idea about the root cause and solution.
Thanks,
--
Chien Lee
^ permalink raw reply
* Re: [PATCH v3 3/4] dmaengine: Add Broadcom SBA RAID driver
From: Anup Patel @ 2017-02-13 9:13 UTC (permalink / raw)
To: Dan Williams
Cc: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
David S . Miller, Jassi Brar, Ray Jui, Scott Branden, Jon Mason,
Rob Rice, BCM Kernel Feedback, dmaengine@vger.kernel.org,
Device Tree, linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-crypto, linux-raid
In-Reply-To: <CAPcyv4hE5gDiHhfaiHDHbhA2xKa45UdzKcSxnQXK-W92sr3Z1g@mail.gmail.com>
On Fri, Feb 10, 2017 at 11:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Fri, Feb 10, 2017 at 1:07 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>> The Broadcom stream buffer accelerator (SBA) provides offloading
>> capabilities for RAID operations. This SBA offload engine is
>> accessible via Broadcom SoC specific ring manager.
>>
>> This patch adds Broadcom SBA RAID driver which provides one
>> DMA device with RAID capabilities using one or more Broadcom
>> SoC specific ring manager channels. The SBA RAID driver in its
>> current shape implements memcpy, xor, and pq operations.
>>
>> Signed-off-by: Anup Patel <anup.patel@broadcom.com>
>> Reviewed-by: Ray Jui <ray.jui@broadcom.com>
>> ---
>> drivers/dma/Kconfig | 13 +
>> drivers/dma/Makefile | 1 +
>> drivers/dma/bcm-sba-raid.c | 1711 ++++++++++++++++++++++++++++++++++++++++++++
>> 3 files changed, 1725 insertions(+)
>> create mode 100644 drivers/dma/bcm-sba-raid.c
>>
>> diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
>> index 263495d..bf8fb84 100644
>> --- a/drivers/dma/Kconfig
>> +++ b/drivers/dma/Kconfig
>> @@ -99,6 +99,19 @@ config AXI_DMAC
>> controller is often used in Analog Device's reference designs for FPGA
>> platforms.
>>
>> +config BCM_SBA_RAID
>> + tristate "Broadcom SBA RAID engine support"
>> + depends on (ARM64 && MAILBOX && RAID6_PQ) || COMPILE_TEST
>> + select DMA_ENGINE
>> + select DMA_ENGINE_RAID
>> + select ASYNC_TX_ENABLE_CHANNEL_SWITCH
>
> ASYNC_TX_ENABLE_CHANNEL_SWITCH violates the DMA mapping API and
> Russell has warned it's especially problematic on ARM [1]. If you
> need channel switching for this offload engine to be useful then you
> need to move DMA mapping and channel switching responsibilities to MD
> itself.
>
> [1]: http://lists.infradead.org/pipermail/linux-arm-kernel/2011-January/036753.html
Actually driver works fine with/without
ASYNC_TX_ENABLE_CHANNEL_SWITCH enabled
so I am fine with removing dependency on this config option.
>
>
> [..]
>> diff --git a/drivers/dma/bcm-sba-raid.c b/drivers/dma/bcm-sba-raid.c
>> new file mode 100644
>> index 0000000..bab9918
>> --- /dev/null
>> +++ b/drivers/dma/bcm-sba-raid.c
>> @@ -0,0 +1,1711 @@
>> +/*
>> + * Copyright (C) 2017 Broadcom
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + */
>> +
>> +/*
>> + * Broadcom SBA RAID Driver
>> + *
>> + * The Broadcom stream buffer accelerator (SBA) provides offloading
>> + * capabilities for RAID operations. The SBA offload engine is accessible
>> + * via Broadcom SoC specific ring manager. Two or more offload engines
>> + * can share same Broadcom SoC specific ring manager due to this Broadcom
>> + * SoC specific ring manager driver is implemented as a mailbox controller
>> + * driver and offload engine drivers are implemented as mallbox clients.
>> + *
>> + * Typically, Broadcom SoC specific ring manager will implement larger
>> + * number of hardware rings over one or more SBA hardware devices. By
>> + * design, the internal buffer size of SBA hardware device is limited
>> + * but all offload operations supported by SBA can be broken down into
>> + * multiple small size requests and executed parallely on multiple SBA
>> + * hardware devices for achieving high through-put.
>> + *
>> + * The Broadcom SBA RAID driver does not require any register programming
>> + * except submitting request to SBA hardware device via mailbox channels.
>> + * This driver implements a DMA device with one DMA channel using a set
>> + * of mailbox channels provided by Broadcom SoC specific ring manager
>> + * driver. To exploit parallelism (as described above), all DMA request
>> + * coming to SBA RAID DMA channel are broken down to smaller requests
>> + * and submitted to multiple mailbox channels in round-robin fashion.
>> + * For having more SBA DMA channels, we can create more SBA device nodes
>> + * in Broadcom SoC specific DTS based on number of hardware rings supported
>> + * by Broadcom SoC ring manager.
>> + */
>> +
>> +#include <linux/bitops.h>
>> +#include <linux/dma-mapping.h>
>> +#include <linux/dmaengine.h>
>> +#include <linux/list.h>
>> +#include <linux/mailbox_client.h>
>> +#include <linux/mailbox/brcm-message.h>
>> +#include <linux/module.h>
>> +#include <linux/of_device.h>
>> +#include <linux/slab.h>
>> +#include <linux/raid/pq.h>
>> +
>> +#include "dmaengine.h"
>> +
>> +/* SBA command helper macros */
>> +#define SBA_DEC(_d, _s, _m) (((_d) >> (_s)) & (_m))
>> +#define SBA_ENC(_d, _v, _s, _m) \
>> + do { \
>> + (_d) &= ~((u64)(_m) << (_s)); \
>> + (_d) |= (((u64)(_v) & (_m)) << (_s)); \
>> + } while (0)
>
> Reusing a macro argument multiple times is problematic, consider
> SBA_ENC(..., arg++, ...), and hiding assignments in a macro make this
> hard to read. The compiler should inline it properly if you just make
> this a function that returns a value. You could also mark it __pure.
OK, I will make SBA_ENC as "static inline __pure" function.
>
> [..]
>> +
>> +static struct sba_request *sba_alloc_request(struct sba_device *sba)
>> +{
>> + unsigned long flags;
>> + struct sba_request *req = NULL;
>> +
>> + spin_lock_irqsave(&sba->reqs_lock, flags);
>> +
>> + if (!list_empty(&sba->reqs_free_list)) {
>> + req = list_first_entry(&sba->reqs_free_list,
>> + struct sba_request,
>> + node);
>
> You could use list_first_entry_or_null() here.
OK, will use this.
>
> [..]
>> +
>> +/* Note: Must be called with sba->reqs_lock held */
>> +static void _sba_pending_request(struct sba_device *sba,
>> + struct sba_request *req)
>> +{
>
> You can validate the locking assumptions here with
> lockdep_assert_head(sba->reqs_lock).
OK, will try this.
>
> [..]
>> +
>> +static void sba_cleanup_nonpending_requests(struct sba_device *sba)
>> +{
>> + unsigned long flags;
>> + struct sba_request *req, *req1;
>> +
>> + spin_lock_irqsave(&sba->reqs_lock, flags);
>> +
>> + /* Freeup all alloced request */
>> + list_for_each_entry_safe(req, req1, &sba->reqs_alloc_list, node) {
>> + _sba_free_request(sba, req);
>> + }
>> +
>> + /* Freeup all received request */
>> + list_for_each_entry_safe(req, req1, &sba->reqs_received_list, node) {
>> + _sba_free_request(sba, req);
>> + }
>> +
>> + /* Freeup all completed request */
>> + list_for_each_entry_safe(req, req1, &sba->reqs_completed_list, node) {
>> + _sba_free_request(sba, req);
>> + }
>> +
>> + /* Set all active requests as aborted */
>> + list_for_each_entry_safe(req, req1, &sba->reqs_active_list, node) {
>> + _sba_abort_request(sba, req);
>> + }
>
> In some parts of the driver you leave off unneeded braces like the for
> loop in sba_prep_dma_pq(), and in some case you include them. I'd say
> remove them if they're not necessary, but either way make it
> consistent across the driver.
I think I relied too much on checkpatch.pl to catch this
kind of coding-style issues.
I will fix this. Thanks for catching.
>
> [..]
>> +
>> +static struct dma_async_tx_descriptor *
>> +sba_prep_dma_pq(struct dma_chan *dchan, dma_addr_t *dst, dma_addr_t *src,
>> + u32 src_cnt, const u8 *scf, size_t len, unsigned long flags)
>> +{
>> + u32 i, dst_q_index;
>> + size_t req_len;
>> + bool slow = false;
>> + dma_addr_t off = 0;
>> + dma_addr_t *dst_p = NULL, *dst_q = NULL;
>> + struct sba_device *sba = to_sba_device(dchan);
>> + struct sba_request *first = NULL, *req;
>> +
>> + /* Sanity checks */
>> + if (unlikely(src_cnt > sba->max_pq_srcs))
>> + return NULL;
>> + for (i = 0; i < src_cnt; i++)
>> + if (sba->max_pq_coefs <= raid6_gflog[scf[i]])
>> + slow = true;
>
> Thanks, yes, I do think this is cleaner here than in async_tx itself.
>
> [..]
>> +static void sba_receive_message(struct mbox_client *cl, void *msg)
>> +{
>> + unsigned long flags;
>> + struct brcm_message *m = msg;
>> + struct sba_request *req = m->ctx, *req1;
>> + struct sba_device *sba = req->sba;
>> +
>> + /* Error count if message has error */
>> + if (m->error < 0) {
>> + dev_err(sba->dev, "%s got message with error %d",
>> + dma_chan_name(&sba->dma_chan), m->error);
>> + }
>> +
>> + /* Mark request as received */
>> + sba_received_request(req);
>> +
>> + /* Wait for all chained requests to be completed */
>> + if (atomic_dec_return(&req->first->next_pending_count))
>> + goto done;
>> +
>> + /* Point to first request */
>> + req = req->first;
>> +
>> + /* Update request */
>> + if (req->state == SBA_REQUEST_STATE_RECEIVED)
>> + sba_dma_tx_actions(req);
>> + else
>> + sba_free_chained_requests(req);
>> +
>> + spin_lock_irqsave(&sba->reqs_lock, flags);
>> +
>> + /* Re-check all completed request waiting for 'ack' */
>> + list_for_each_entry_safe(req, req1, &sba->reqs_completed_list, node) {
>> + spin_unlock_irqrestore(&sba->reqs_lock, flags);
>> + sba_dma_tx_actions(req);
>
> You've now required all callback paths to be hardirq safe whereas
> previously the callbacks only assumed softirq exclusion. Have you run
> this with CONFIG_PROVE_LOCKING enabled?
We have run stress tests on driver with multiple threads
trying to submit txn.
I will certainly try CONFIG_PROVE_LOCKING to be
double sure.
Thanks,
Anup
^ permalink raw reply
* Re: [PATCH 1/5] MD: attach data to each bio
From: NeilBrown @ 2017-02-13 9:32 UTC (permalink / raw)
Cc: Shaohua Li, linux-raid, khlebnikov, hch
In-Reply-To: <20170213073724.GA16666@lst.de>
[-- Attachment #1: Type: text/plain, Size: 1207 bytes --]
On Mon, Feb 13 2017, Christoph Hellwig wrote:
> On Fri, Feb 10, 2017 at 05:08:54PM +1100, NeilBrown wrote:
>> I must say that I don't really like this approach.
>> Temporarily modifying ->bi_private and ->bi_end_io seems
>> .... intrusive. I suspect it works, but I wonder if it is really
>> robust in the long term.
>>
>> How about a different approach.. Your main concern with my first patch
>> was that it called md_write_start() and md_write_end() much more often,
>> and these performed atomic ops on "global" variables, particular
>> writes_pending.
>>
>> We could change writes_pending to a per-cpu array which we only count
>> occasionally when needed. As writes_pending is updated often and
>> checked rarely, a per-cpu array which is summed on demand seems
>> appropriate.
>
> FYI, I much prefer you original approach, it's much closer to how
> the rest of the block stack works.
I probably wasn't clear, but my intention was to stick with my original
approach, but make it more acceptable by removing the extra cost of
cache-line-bouncing that Shaohua correctly identified.
i.e. this patch was a preliminary to improve the original series.
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply
* Re: [PATCH 1/5] MD: attach data to each bio
From: NeilBrown @ 2017-02-13 9:49 UTC (permalink / raw)
To: Shaohua Li; +Cc: Shaohua Li, linux-raid, khlebnikov, hch
In-Reply-To: <20170210064715.tpr5mzvccmgxz2af@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 3358 bytes --]
On Thu, Feb 09 2017, Shaohua Li wrote:
> On Fri, Feb 10, 2017 at 05:08:54PM +1100, Neil Brown wrote:
>> On Tue, Feb 07 2017, Shaohua Li wrote:
>>
>> > Currently MD is rebusing some bio fields. To remove the hack, we attach
>> > extra data to each bio. Each personablity can attach extra data to the
>> > bios, so we don't need to rebuse bio fields.
>>
>> I must say that I don't really like this approach.
>> Temporarily modifying ->bi_private and ->bi_end_io seems
>> .... intrusive. I suspect it works, but I wonder if it is really
>> robust in the long term.
>>
>> How about a different approach.. Your main concern with my first patch
>> was that it called md_write_start() and md_write_end() much more often,
>> and these performed atomic ops on "global" variables, particular
>> writes_pending.
>>
>> We could change writes_pending to a per-cpu array which we only count
>> occasionally when needed. As writes_pending is updated often and
>> checked rarely, a per-cpu array which is summed on demand seems
>> appropriate.
>>
>> The following patch is an early draft - it doesn't obviously fail and
>> isn't obviously wrong to me. There is certainly room for improvement
>> and may be bugs.
>> Next week I'll work on collection the re-factoring into separate
>> patches, which are possible good-to-have anyway.
>
> For your first patch, I don't have much concern. It's ok to me. What I don't
> like is the bi_phys_segments handling part. The patches add a lot of logic to
> handle the reference count. They should work, but I'd say it's not easy to
> understand and could be error prone. What we really need is a reference count
> for the bio, so let's just add a reference count. That's my logic and it's
> simple.
We already have two reference counts, and you want to add a third one.
bi_phys_segments is currently used for two related purposes.
It counts the number of stripe_heads currently attached to the bio so
that when the count reaches zero:
1/ ->writes_pending can be decremented
2/ bio_endio() can be called.
When the code was written, the __bi_remaining counter didn't exist. Now
it does and it is integrated with bio_endio() so it should make the code
easier to understand if we just use bio_endio() rather and doing our own
accounting.
That just leaves '1'. We can easily decrement ->writes_pending directly
instead of decrementing a per-bio refcount, and then when it reaches
zero, decrement ->writes_pending. As you pointed out, that comes with a
cost. If ->writes_pending is changed to a per-cpu array which is summed
on demand, the cost goes away.
Having an extra refcount in the bio just adds a level of indirection
that doesn't (that I can see) provide actual value.
>
> For the modifying bi_private and bi_end_io part, I saw some filesystems are
> using this way, at least btrfs. If this is really intrusive, is cloning a bio
> better?
The bio belongs to the filesystem. It allocated it and can do whatever
it likes with bi_end_io and bi_private. I don't think a block device
driver should ever change bi_private of bi_end_io of a bio that it was
passed (if it allocates its own bios, it can of course change those).
I don't think cloning the bio would really help, though you could
probably make something work.
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply
* LSI RAID
From: Gandalf Corvotempesta @ 2017-02-13 10:33 UTC (permalink / raw)
To: linux-raid
Hi to all
silly question: i've read that LSI/PERC Hardware controller supports DDF.
Would be possible, for mdadm, to use a RAID created with an LSI/PERC
controller supporting DDF ?
^ permalink raw reply
* Re: [PATCH v1 3/5] md: fail if mddev->bio_set can't be created
From: Christoph Hellwig @ 2017-02-13 13:45 UTC (permalink / raw)
To: Ming Lei
Cc: Shaohua Li, Jens Axboe, linux-kernel, linux-raid, linux-block,
Christoph Hellwig, NeilBrown
In-Reply-To: <1486724177-14817-4-git-send-email-tom.leiming@gmail.com>
Looks fine,
Reviewed-by: Christoph Hellwig <hch@lst.de>
but this really needs to be patch 2 in this series.
^ permalink raw reply
* Re: [PATCH v1 1/5] block: introduce bio_clone_bioset_partial()
From: Christoph Hellwig @ 2017-02-13 13:46 UTC (permalink / raw)
To: Ming Lei
Cc: Shaohua Li, Jens Axboe, linux-kernel, linux-raid, linux-block,
Christoph Hellwig, NeilBrown
In-Reply-To: <1486724177-14817-2-git-send-email-tom.leiming@gmail.com>
On Fri, Feb 10, 2017 at 06:56:13PM +0800, Ming Lei wrote:
> md still need bio clone(not the fast version) for behind write,
> and it is more efficient to use bio_clone_bioset_partial().
>
> The idea is simple and just copy the bvecs range specified from
> parameters.
Given how few users bio_clone_bioset has I wonder if we shouldn't
simply add the two new arguments to it instead of adding another
indirection.
Otherwise looks fine:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox