Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: [PATCH v4 3/4] dmaengine: Add Broadcom SBA RAID driver
From: Dan Williams @ 2017-02-15  7:25 UTC (permalink / raw)
  To: Anup Patel
  Cc: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar, Ray Jui, Scott Branden, Jon Mason,
	Rob Rice, BCM Kernel Feedback,
	dmaengine-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Device Tree,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-crypto-u79uwXL29TY76Z2rM5mHXA, linux-raid
In-Reply-To: <CAALAos-txDCs3QZ4HGBNicOD8t49NPT6E8RpjVMccMn1D-UTgQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, Feb 14, 2017 at 11:03 PM, Anup Patel <anup.patel-dY08KVG/lbpWk0Htik3J/w@public.gmane.org> wrote:
> On Wed, Feb 15, 2017 at 12:13 PM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>> On Tue, Feb 14, 2017 at 10:25 PM, Anup Patel <anup.patel-dY08KVG/lbpWk0Htik3J/w@public.gmane.org> wrote:
>>> On Tue, Feb 14, 2017 at 10:04 PM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>>>> On Mon, Feb 13, 2017 at 10:51 PM, Anup Patel <anup.patel-dY08KVG/lbpWk0Htik3J/w@public.gmane.org> wrote:
>>>>> The Broadcom stream buffer accelerator (SBA) provides offloading
>>>>> capabilities for RAID operations. This SBA offload engine is
>>>>> accessible via Broadcom SoC specific ring manager.
>>>>>
>>>>> This patch adds Broadcom SBA RAID driver which provides one
>>>>> DMA device with RAID capabilities using one or more Broadcom
>>>>> SoC specific ring manager channels. The SBA RAID driver in its
>>>>> current shape implements memcpy, xor, and pq operations.
>>>>>
>>>>> Signed-off-by: Anup Patel <anup.patel-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
>>>>> Reviewed-by: Ray Jui <ray.jui-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
>>>>> ---
>>>>>  drivers/dma/Kconfig        |   13 +
>>>>>  drivers/dma/Makefile       |    1 +
>>>>>  drivers/dma/bcm-sba-raid.c | 1694 ++++++++++++++++++++++++++++++++++++++++++++
>>>>>  3 files changed, 1708 insertions(+)
>>>>>  create mode 100644 drivers/dma/bcm-sba-raid.c
>>>>>
>>>>> diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
>>>>> index 263495d..bf8fb84 100644
>>>>> --- a/drivers/dma/Kconfig
>>>>> +++ b/drivers/dma/Kconfig
>>>>> @@ -99,6 +99,19 @@ config AXI_DMAC
>>>>>           controller is often used in Analog Device's reference designs for FPGA
>>>>>           platforms.
>>>>>
>>>>> +config BCM_SBA_RAID
>>>>> +       tristate "Broadcom SBA RAID engine support"
>>>>> +       depends on (ARM64 && MAILBOX && RAID6_PQ) || COMPILE_TEST
>>>>> +       select DMA_ENGINE
>>>>> +       select DMA_ENGINE_RAID
>>>>> +       select ASYNC_TX_ENABLE_CHANNEL_SWITCH
>>>>
>>>> I thought you agreed to drop this. Its usage is broken.
>>>
>>> If ASYNC_TX_ENABLE_CHANNEL_SWITCH is not selected
>>> then async_dma_find_channel() will only try to find channel
>>> with DMA_ASYNC_TX capability.
>>>
>>> The DMA_ASYNC_TX capability is set by
>>> dma_async_device_register() when all Async Tx
>>> capabilities are supported by a DMA devices namely
>>> DMA_INTERRUPT, DMA_MEMCPY, DMA_XOR,
>>> DMA_XOR_VAL, DMA_PQ, and DMA_PQ_VAL.
>>>
>>> We only support DMA_MEMCPY, DMA_XOR, and
>>> DMA_PQ capabilities in BCM-SBA-RAID driver so
>>> DMA_ASYNC_TX capability is never set for the
>>> DMA device registered by BCM-SBA-RAID driver.
>>>
>>> Due to above, if ASYNC_TX_ENABLE_CHANNEL_SWITCH
>>> is not selected then Async Tx APIs fail to find DMA
>>> channel provided by BCM-SBA-RAID hence the
>>> option ASYNC_TX_ENABLE_CHANNEL_SWITCH is
>>> required for BCM-SBA-RAID.
>>>
>>> The DMA mappings are violated by channel switching
>>> only if we switch form DMA channel A to DMA channel
>>> B and both these DMA channels have different underlying
>>> "struct device". In most of the cases DMA mappings
>>> are not violated because DMA channels having
>>> Async Tx capabilities are provided using same
>>> underlying "struct device".
>>
>> No, fix the infrastructure. Do not put local hack in your driver for
>> this global problem [1].
>
> There is no hack in the driver. We need
> ASYNC_TX_ENABLE_CHANNEL_SWITCH
> based on current state of dmaengine framework.
>
> The framework should be fixed as separate patchset.
>
> We have other RAID drivers such as xgene-dma and
> mv_xor_v2 who also require
> ASYNC_TX_ENABLE_CHANNEL_SWITCH due
> to same reason.
>
> Fixing the framework and improving framework is
> a ongoing process. I don't see why that should
> stop this patchset.
>

Because this driver is turning on a dangerous compile time option and
is not using the functionality. If this silicon IP block appears in
another product in the future paired with another DMA engine then the
assumptions about a safe/single dma-device is violated.

The realization of how async_tx was breaking DMA mapping api
assumptions came after some of these dma-drivers were added to the
kernel. We should stop making the problem worse.

I should have submitted a patch like the below at the time we
discovered this problem, but unfortunately it languished when I
stopped maintaining the iop-adma and ioat drivers.

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 263495d0adbd..6b30eb9ad125 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -35,6 +35,7 @@ comment "DMA Devices"

 #core
 config ASYNC_TX_ENABLE_CHANNEL_SWITCH
+       depends on BROKEN
        bool

 config ARCH_HAS_ASYNC_TX_FIND_CHANNEL
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH v4 3/4] dmaengine: Add Broadcom SBA RAID driver
From: Anup Patel @ 2017-02-15  8:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar, Ray Jui, Scott Branden, Jon Mason,
	Rob Rice, BCM Kernel Feedback, dmaengine@vger.kernel.org,
	Device Tree, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-crypto, linux-raid
In-Reply-To: <CAPcyv4h83vgeB82KmEEFV196MsWdteRqypY5_tcSyaMM_QwYMA@mail.gmail.com>

On Wed, Feb 15, 2017 at 12:55 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Tue, Feb 14, 2017 at 11:03 PM, Anup Patel <anup.patel@broadcom.com> wrote:
>> On Wed, Feb 15, 2017 at 12:13 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>> On Tue, Feb 14, 2017 at 10:25 PM, Anup Patel <anup.patel@broadcom.com> wrote:
>>>> On Tue, Feb 14, 2017 at 10:04 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>>>> On Mon, Feb 13, 2017 at 10:51 PM, Anup Patel <anup.patel@broadcom.com> wrote:
>>>>>> The Broadcom stream buffer accelerator (SBA) provides offloading
>>>>>> capabilities for RAID operations. This SBA offload engine is
>>>>>> accessible via Broadcom SoC specific ring manager.
>>>>>>
>>>>>> This patch adds Broadcom SBA RAID driver which provides one
>>>>>> DMA device with RAID capabilities using one or more Broadcom
>>>>>> SoC specific ring manager channels. The SBA RAID driver in its
>>>>>> current shape implements memcpy, xor, and pq operations.
>>>>>>
>>>>>> Signed-off-by: Anup Patel <anup.patel@broadcom.com>
>>>>>> Reviewed-by: Ray Jui <ray.jui@broadcom.com>
>>>>>> ---
>>>>>>  drivers/dma/Kconfig        |   13 +
>>>>>>  drivers/dma/Makefile       |    1 +
>>>>>>  drivers/dma/bcm-sba-raid.c | 1694 ++++++++++++++++++++++++++++++++++++++++++++
>>>>>>  3 files changed, 1708 insertions(+)
>>>>>>  create mode 100644 drivers/dma/bcm-sba-raid.c
>>>>>>
>>>>>> diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
>>>>>> index 263495d..bf8fb84 100644
>>>>>> --- a/drivers/dma/Kconfig
>>>>>> +++ b/drivers/dma/Kconfig
>>>>>> @@ -99,6 +99,19 @@ config AXI_DMAC
>>>>>>           controller is often used in Analog Device's reference designs for FPGA
>>>>>>           platforms.
>>>>>>
>>>>>> +config BCM_SBA_RAID
>>>>>> +       tristate "Broadcom SBA RAID engine support"
>>>>>> +       depends on (ARM64 && MAILBOX && RAID6_PQ) || COMPILE_TEST
>>>>>> +       select DMA_ENGINE
>>>>>> +       select DMA_ENGINE_RAID
>>>>>> +       select ASYNC_TX_ENABLE_CHANNEL_SWITCH
>>>>>
>>>>> I thought you agreed to drop this. Its usage is broken.
>>>>
>>>> If ASYNC_TX_ENABLE_CHANNEL_SWITCH is not selected
>>>> then async_dma_find_channel() will only try to find channel
>>>> with DMA_ASYNC_TX capability.
>>>>
>>>> The DMA_ASYNC_TX capability is set by
>>>> dma_async_device_register() when all Async Tx
>>>> capabilities are supported by a DMA devices namely
>>>> DMA_INTERRUPT, DMA_MEMCPY, DMA_XOR,
>>>> DMA_XOR_VAL, DMA_PQ, and DMA_PQ_VAL.
>>>>
>>>> We only support DMA_MEMCPY, DMA_XOR, and
>>>> DMA_PQ capabilities in BCM-SBA-RAID driver so
>>>> DMA_ASYNC_TX capability is never set for the
>>>> DMA device registered by BCM-SBA-RAID driver.
>>>>
>>>> Due to above, if ASYNC_TX_ENABLE_CHANNEL_SWITCH
>>>> is not selected then Async Tx APIs fail to find DMA
>>>> channel provided by BCM-SBA-RAID hence the
>>>> option ASYNC_TX_ENABLE_CHANNEL_SWITCH is
>>>> required for BCM-SBA-RAID.
>>>>
>>>> The DMA mappings are violated by channel switching
>>>> only if we switch form DMA channel A to DMA channel
>>>> B and both these DMA channels have different underlying
>>>> "struct device". In most of the cases DMA mappings
>>>> are not violated because DMA channels having
>>>> Async Tx capabilities are provided using same
>>>> underlying "struct device".
>>>
>>> No, fix the infrastructure. Do not put local hack in your driver for
>>> this global problem [1].
>>
>> There is no hack in the driver. We need
>> ASYNC_TX_ENABLE_CHANNEL_SWITCH
>> based on current state of dmaengine framework.
>>
>> The framework should be fixed as separate patchset.
>>
>> We have other RAID drivers such as xgene-dma and
>> mv_xor_v2 who also require
>> ASYNC_TX_ENABLE_CHANNEL_SWITCH due
>> to same reason.
>>
>> Fixing the framework and improving framework is
>> a ongoing process. I don't see why that should
>> stop this patchset.
>>
>
> Because this driver is turning on a dangerous compile time option and
> is not using the functionality. If this silicon IP block appears in
> another product in the future paired with another DMA engine then the
> assumptions about a safe/single dma-device is violated.
>
> The realization of how async_tx was breaking DMA mapping api
> assumptions came after some of these dma-drivers were added to the
> kernel. We should stop making the problem worse.
>
> I should have submitted a patch like the below at the time we
> discovered this problem, but unfortunately it languished when I
> stopped maintaining the iop-adma and ioat drivers.
>
> diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
> index 263495d0adbd..6b30eb9ad125 100644
> --- a/drivers/dma/Kconfig
> +++ b/drivers/dma/Kconfig
> @@ -35,6 +35,7 @@ comment "DMA Devices"
>
>  #core
>  config ASYNC_TX_ENABLE_CHANNEL_SWITCH
> +       depends on BROKEN
>         bool
>
>  config ARCH_HAS_ASYNC_TX_FIND_CHANNEL

Instead of selecting
ASYNC_TX_ENABLE_CHANNEL_SWITCH,
we can select the following in BCM_SBA_RAID config
option:
1. ASYNC_TX_DISABLE_XOR_VAL
2. ASYNC_TX_DISABLE_PQ_VAL

This will satisfy the needs of
dma_async_device_register() when
ASYNC_TX_ENABLE_CHANNEL_SWITCH is
not selected.

Will this be acceptable ??

Regards,
Anup

^ permalink raw reply

* [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: colyli @ 2017-02-15 16:35 UTC (permalink / raw)
  To: linux-raid
  Cc: Coly Li, Shaohua Li, Neil Brown, Johannes Thumshirn,
	Guoqing Jiang

'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
introduces a sliding resync window for raid1 I/O barrier, this idea limits
I/O barriers to happen only inside a slidingresync window, for regular
I/Os out of this resync window they don't need to wait for barrier any
more. On large raid1 device, it helps a lot to improve parallel writing
I/O throughput when there are background resync I/Os performing at
same time.

The idea of sliding resync widow is awesome, but code complexity is a
challenge. Sliding resync window requires several veriables to work
collectively, this is complexed and very hard to make it work correctly.
Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches
to fix the original resync window patch. This is not the end, any further
related modification may easily introduce more regreassion.

Therefore I decide to implement a much simpler raid1 I/O barrier, by
removing resync window code, I believe life will be much easier.

The brief idea of the simpler barrier is,
 - Do not maintain a logbal unique resync window
 - Use multiple hash buckets to reduce I/O barrier conflictions, regular
   I/O only has to wait for a resync I/O when both them have same barrier
   bucket index, vice versa.
 - I/O barrier can be recuded to an acceptable number if there are enought
   barrier buckets

Here I explain how the barrier buckets are designed,
 - BARRIER_UNIT_SECTOR_SIZE
   The whole LBA address space of a raid1 device is divided into multiple
   barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
   Bio request won't go across border of barrier unit size, that means
   maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes.
   For random I/O 64MB is large enough for both read and write requests,
   for sequential I/O considering underlying block layer may merge them
   into larger requests, 64MB is still good enough.
   Neil also points out that for resync operation, "we want the resync to
   move from region to region fairly quickly so that the slowness caused
   by having to synchronize with the resync is averaged out over a fairly
   small time frame". For full speed resync, 64MB should take less then 1
   second. When resync is competing with other I/O, it could take up a few
   minutes. Therefore 64MB size is fairly good range for resync.

 - BARRIER_BUCKETS_NR
   There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
        #define BARRIER_BUCKETS_NR_BITS   (PAGE_SHIFT - 2)
        #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
   this patch makes the bellowed members of struct r1conf from integer
   to array of integers,
        -       int                     nr_pending;
        -       int                     nr_waiting;
        -       int                     nr_queued;
        -       int                     barrier;
        +       int                     *nr_pending;
        +       int                     *nr_waiting;
        +       int                     *nr_queued;
        +       int                     *barrier;
   number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB
   kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O
   barrier buckets, and each array of integers occupies single memory page.
   1024 means for a request which is smaller than the I/O barrier unit size
   has ~0.1% chance to wait for resync to pause, which is quite a small
   enough fraction. Also requesting single memory page is more friendly to
   kernel page allocator than larger memory size.

 - I/O barrier bucket is indexed by bio start sector
   If multiple I/O requests hit different I/O barrier units, they only need
   to compete I/O barrier with other I/Os which hit the same I/O barrier
   bucket index with each other. The index of a barrier bucket which a
   bio should look for is calculated by sector_to_idx() which is defined
   in raid1.h as an inline function,
        static inline int sector_to_idx(sector_t sector)
        {
                return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
                                BARRIER_BUCKETS_NR_BITS);
        }
   Here sector_nr is the start sector number of a bio.

 - Single bio won't go across boundary of a I/O barrier unit
   If a request goes across boundary of barrier unit, it will be split. A
   bio may be split in raid1_make_request() or raid1_sync_request(), if
   sectors returned by align_to_barrier_unit_end() is small than original
   bio size.

Comparing to single sliding resync window,
 - Currently resync I/O grows linearly, therefore regular and resync I/O
   will have confliction within a single barrier units. So the I/O
   behavior is similar to single sliding resync window.
 - But a barrier unit bucket is shared by all barrier units with identical
   barrier uinit index, the probability of confliction might be higher
   than single sliding resync window, in condition that writing I/Os
   always hit barrier units which have identical barrier bucket indexs with
   the resync I/Os. This is a very rare condition in real I/O work loads,
   I cannot imagine how it could happen in practice.
 - Therefore we can achieve a good enough low confliction rate with much
   simpler barrier algorithm and implementation.

There are two changes should be noticed,
 - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
   single loop, it looks like this,
        spin_lock_irqsave(&conf->device_lock, flags);
        conf->nr_queued[idx]--;
        spin_unlock_irqrestore(&conf->device_lock, flags);
   This change generates more spin lock operations, but in next patch of
   this patch set, it will be replaced by a single line code,
        atomic_dec(&conf->nr_queueud[idx]);
   So we don't need to worry about spin lock cost here.
 - Mainline raid1 code split original raid1_make_request() into
   raid1_read_request() and raid1_write_request(). If the original bio
   goes across an I/O barrier unit size, this bio will be split before
   calling raid1_read_request() or raid1_write_request(),  this change
   the code logic more simple and clear.
 - In this patch wait_barrier() is moved from raid1_make_request() to
   raid1_write_request(). In raid_read_request(), original wait_barrier()
   is replaced by raid1_read_request().
   The differnece is wait_read_barrier() only waits if array is frozen,
   using different barrier function in different code path makes the code
   more clean and easy to read.
Changelog
V3:
- Rebase the patch against latest upstream kernel code.
- Many fixes by review comments from Neil,
  - Back to use pointers to replace arraries in struct r1conf
  - Remove total_barriers from struct r1conf
  - Add more patch comments to explain how/why the values of
    BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided.
  - Use get_unqueued_pending() to replace get_all_pendings() and
    get_all_queued()
  - Increase bucket number from 512 to 1024
- Change code comments format by review from Shaohua.
V2:
- Use bio_split() to split the orignal bio if it goes across barrier unit
  bounday, to make the code more simple, by suggestion from Shaohua and
  Neil.
- Use hash_long() to replace original linear hash, to avoid a possible
  confilict between resync I/O and sequential write I/O, by suggestion from
  Shaohua.
- Add conf->total_barriers to record barrier depth, which is used to
  control number of parallel sync I/O barriers, by suggestion from Shaohua.
- In V1 patch the bellowed barrier buckets related members in r1conf are
  allocated in memory page. To make the code more simple, V2 patch moves
  the memory space into struct r1conf, like this,
        -       int                     nr_pending;
        -       int                     nr_waiting;
        -       int                     nr_queued;
        -       int                     barrier;
        +       int                     nr_pending[BARRIER_BUCKETS_NR];
        +       int                     nr_waiting[BARRIER_BUCKETS_NR];
        +       int                     nr_queued[BARRIER_BUCKETS_NR];
        +       int                     barrier[BARRIER_BUCKETS_NR];
  This change is by the suggestion from Shaohua.
- Remove some inrelavent code comments, by suggestion from Guoqing.
- Add a missing wait_barrier() before jumping to retry_write, in
  raid1_make_write_request().
V1:
- Original RFC patch for comments

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/raid1.c | 447 ++++++++++++++++++++++++++++++-----------------------
 drivers/md/raid1.h |  42 +++--
 2 files changed, 275 insertions(+), 214 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7b0f647..4234494 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -71,9 +71,8 @@
  */
 static int max_queued_requests = 1024;
 
-static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
-			  sector_t bi_sector);
-static void lower_barrier(struct r1conf *conf);
+static void allow_barrier(struct r1conf *conf, sector_t sector_nr);
+static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
 
 #define raid1_log(md, fmt, args...)				\
 	do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while (0)
@@ -100,7 +99,6 @@ static void r1bio_pool_free(void *r1_bio, void *data)
 #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9)
 #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW)
 #define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
-#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
 
 static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
 {
@@ -215,7 +213,7 @@ static void put_buf(struct r1bio *r1_bio)
 
 	mempool_free(r1_bio, conf->r1buf_pool);
 
-	lower_barrier(conf);
+	lower_barrier(conf, r1_bio->sector);
 }
 
 static void reschedule_retry(struct r1bio *r1_bio)
@@ -223,10 +221,12 @@ static void reschedule_retry(struct r1bio *r1_bio)
 	unsigned long flags;
 	struct mddev *mddev = r1_bio->mddev;
 	struct r1conf *conf = mddev->private;
+	int idx;
 
+	idx = sector_to_idx(r1_bio->sector);
 	spin_lock_irqsave(&conf->device_lock, flags);
 	list_add(&r1_bio->retry_list, &conf->retry_list);
-	conf->nr_queued ++;
+	conf->nr_queued[idx]++;
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 
 	wake_up(&conf->wait_barrier);
@@ -243,7 +243,6 @@ static void call_bio_endio(struct r1bio *r1_bio)
 	struct bio *bio = r1_bio->master_bio;
 	int done;
 	struct r1conf *conf = r1_bio->mddev->private;
-	sector_t start_next_window = r1_bio->start_next_window;
 	sector_t bi_sector = bio->bi_iter.bi_sector;
 
 	if (bio->bi_phys_segments) {
@@ -269,7 +268,7 @@ static void call_bio_endio(struct r1bio *r1_bio)
 		 * Wake up any possible resync thread that waits for the device
 		 * to go idle.
 		 */
-		allow_barrier(conf, start_next_window, bi_sector);
+		allow_barrier(conf, bi_sector);
 	}
 }
 
@@ -517,6 +516,25 @@ static void raid1_end_write_request(struct bio *bio)
 		bio_put(to_put);
 }
 
+static sector_t align_to_barrier_unit_end(sector_t start_sector,
+					  sector_t sectors)
+{
+	sector_t len;
+
+	WARN_ON(sectors == 0);
+	/*
+	 * len is the number of sectors from start_sector to end of the
+	 * barrier unit which start_sector belongs to.
+	 */
+	len = round_up(start_sector + 1, BARRIER_UNIT_SECTOR_SIZE) -
+	      start_sector;
+
+	if (len > sectors)
+		len = sectors;
+
+	return len;
+}
+
 /*
  * This routine returns the disk from which the requested read should
  * be done. There is a per-array 'next expected sequential IO' sector
@@ -813,168 +831,168 @@ static void flush_pending_writes(struct r1conf *conf)
  */
 static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
 {
+	int idx = sector_to_idx(sector_nr);
+
 	spin_lock_irq(&conf->resync_lock);
 
 	/* Wait until no block IO is waiting */
-	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
+	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
 			    conf->resync_lock);
 
 	/* block any new IO from starting */
-	conf->barrier++;
-	conf->next_resync = sector_nr;
+	conf->barrier[idx]++;
 
 	/* For these conditions we must wait:
 	 * A: while the array is in frozen state
-	 * B: while barrier >= RESYNC_DEPTH, meaning resync reach
-	 *    the max count which allowed.
-	 * C: next_resync + RESYNC_SECTORS > start_next_window, meaning
-	 *    next resync will reach to the window which normal bios are
-	 *    handling.
-	 * D: while there are any active requests in the current window.
+	 * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
+	 *    existing in corresponding I/O barrier bucket.
+	 * C: while conf->barrier[idx] >= RESYNC_DEPTH, meaning reaches
+	 *    max resync count which allowed on current I/O barrier bucket.
 	 */
 	wait_event_lock_irq(conf->wait_barrier,
 			    !conf->array_frozen &&
-			    conf->barrier < RESYNC_DEPTH &&
-			    conf->current_window_requests == 0 &&
-			    (conf->start_next_window >=
-			     conf->next_resync + RESYNC_SECTORS),
+			     !conf->nr_pending[idx] &&
+			     conf->barrier[idx] < RESYNC_DEPTH,
 			    conf->resync_lock);
 
-	conf->nr_pending++;
+	conf->nr_pending[idx]++;
 	spin_unlock_irq(&conf->resync_lock);
 }
 
-static void lower_barrier(struct r1conf *conf)
+static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
 {
 	unsigned long flags;
-	BUG_ON(conf->barrier <= 0);
+	int idx = sector_to_idx(sector_nr);
+
+	BUG_ON(conf->barrier[idx] <= 0);
+
 	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->barrier--;
-	conf->nr_pending--;
+	conf->barrier[idx]--;
+	conf->nr_pending[idx]--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
 }
 
-static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
+static void _wait_barrier(struct r1conf *conf, int idx)
 {
-	bool wait = false;
-
-	if (conf->array_frozen || !bio)
-		wait = true;
-	else if (conf->barrier && bio_data_dir(bio) == WRITE) {
-		if ((conf->mddev->curr_resync_completed
-		     >= bio_end_sector(bio)) ||
-		    (conf->start_next_window + NEXT_NORMALIO_DISTANCE
-		     <= bio->bi_iter.bi_sector))
-			wait = false;
-		else
-			wait = true;
+	spin_lock_irq(&conf->resync_lock);
+	if (conf->array_frozen || conf->barrier[idx]) {
+		conf->nr_waiting[idx]++;
+		/* Wait for the barrier to drop. */
+		wait_event_lock_irq(
+			conf->wait_barrier,
+			!conf->array_frozen && !conf->barrier[idx],
+			conf->resync_lock);
+		conf->nr_waiting[idx]--;
 	}
 
-	return wait;
+	conf->nr_pending[idx]++;
+	spin_unlock_irq(&conf->resync_lock);
 }
 
-static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
+static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
 {
-	sector_t sector = 0;
+	int idx = sector_to_idx(sector_nr);
 
 	spin_lock_irq(&conf->resync_lock);
-	if (need_to_wait_for_sync(conf, bio)) {
-		conf->nr_waiting++;
-		/* Wait for the barrier to drop.
-		 * However if there are already pending
-		 * requests (preventing the barrier from
-		 * rising completely), and the
-		 * per-process bio queue isn't empty,
-		 * then don't wait, as we need to empty
-		 * that queue to allow conf->start_next_window
-		 * to increase.
-		 */
-		raid1_log(conf->mddev, "wait barrier");
-		wait_event_lock_irq(conf->wait_barrier,
-				    !conf->array_frozen &&
-				    (!conf->barrier ||
-				     ((conf->start_next_window <
-				       conf->next_resync + RESYNC_SECTORS) &&
-				      current->bio_list &&
-				      !bio_list_empty(current->bio_list))),
-				    conf->resync_lock);
-		conf->nr_waiting--;
-	}
-
-	if (bio && bio_data_dir(bio) == WRITE) {
-		if (bio->bi_iter.bi_sector >= conf->next_resync) {
-			if (conf->start_next_window == MaxSector)
-				conf->start_next_window =
-					conf->next_resync +
-					NEXT_NORMALIO_DISTANCE;
-
-			if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE)
-			    <= bio->bi_iter.bi_sector)
-				conf->next_window_requests++;
-			else
-				conf->current_window_requests++;
-			sector = conf->start_next_window;
-		}
+	if (conf->array_frozen) {
+		conf->nr_waiting[idx]++;
+		/* Wait for array to unfreeze */
+		wait_event_lock_irq(
+			conf->wait_barrier,
+			!conf->array_frozen,
+			conf->resync_lock);
+		conf->nr_waiting[idx]--;
 	}
 
-	conf->nr_pending++;
+	conf->nr_pending[idx]++;
 	spin_unlock_irq(&conf->resync_lock);
-	return sector;
 }
 
-static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
-			  sector_t bi_sector)
+static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
+{
+	int idx = sector_to_idx(sector_nr);
+
+	_wait_barrier(conf, idx);
+}
+
+static void wait_all_barriers(struct r1conf *conf)
+{
+	int idx;
+
+	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		_wait_barrier(conf, idx);
+}
+
+static void _allow_barrier(struct r1conf *conf, int idx)
 {
 	unsigned long flags;
 
 	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->nr_pending--;
-	if (start_next_window) {
-		if (start_next_window == conf->start_next_window) {
-			if (conf->start_next_window + NEXT_NORMALIO_DISTANCE
-			    <= bi_sector)
-				conf->next_window_requests--;
-			else
-				conf->current_window_requests--;
-		} else
-			conf->current_window_requests--;
-
-		if (!conf->current_window_requests) {
-			if (conf->next_window_requests) {
-				conf->current_window_requests =
-					conf->next_window_requests;
-				conf->next_window_requests = 0;
-				conf->start_next_window +=
-					NEXT_NORMALIO_DISTANCE;
-			} else
-				conf->start_next_window = MaxSector;
-		}
-	}
+	conf->nr_pending[idx]--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
 }
 
+static void allow_barrier(struct r1conf *conf, sector_t sector_nr)
+{
+	int idx = sector_to_idx(sector_nr);
+
+	_allow_barrier(conf, idx);
+}
+
+static void allow_all_barriers(struct r1conf *conf)
+{
+	int idx;
+
+	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		_allow_barrier(conf, idx);
+}
+
+/* conf->resync_lock should be held */
+static int get_unqueued_pending(struct r1conf *conf)
+{
+	int idx, ret;
+
+	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		ret += conf->nr_pending[idx] - conf->nr_queued[idx];
+
+	return ret;
+}
+
 static void freeze_array(struct r1conf *conf, int extra)
 {
-	/* stop syncio and normal IO and wait for everything to
+	/* Stop sync I/O and normal I/O and wait for everything to
 	 * go quite.
-	 * We wait until nr_pending match nr_queued+extra
-	 * This is called in the context of one normal IO request
-	 * that has failed. Thus any sync request that might be pending
-	 * will be blocked by nr_pending, and we need to wait for
-	 * pending IO requests to complete or be queued for re-try.
-	 * Thus the number queued (nr_queued) plus this request (extra)
-	 * must match the number of pending IOs (nr_pending) before
-	 * we continue.
+	 * This is called in two situations:
+	 * 1) management command handlers (reshape, remove disk, quiesce).
+	 * 2) one normal I/O request failed.
+
+	 * After array_frozen is set to 1, new sync IO will be blocked at
+	 * raise_barrier(), and new normal I/O will blocked at _wait_barrier()
+	 * or wait_read_barrier(). The flying I/Os will either complete or be
+	 * queued. When everything goes quite, there are only queued I/Os left.
+
+	 * Every flying I/O contributes to a conf->nr_pending[idx], idx is the
+	 * barrier bucket index which this I/O request hits. When all sync and
+	 * normal I/O are queued, sum of all conf->nr_pending[] will match sum
+	 * of all conf->nr_queued[]. But normal I/O failure is an exception,
+	 * in handle_read_error(), we may call freeze_array() before trying to
+	 * fix the read error. In this case, the error read I/O is not queued,
+	 * so get_unqueued_pending() == 1.
+	 *
+	 * Therefore before this function returns, we need to wait until
+	 * get_unqueued_pendings(conf) gets equal to extra. For
+	 * normal I/O context, extra is 1, in rested situations extra is 0.
 	 */
 	spin_lock_irq(&conf->resync_lock);
 	conf->array_frozen = 1;
 	raid1_log(conf->mddev, "wait freeze");
-	wait_event_lock_irq_cmd(conf->wait_barrier,
-				conf->nr_pending == conf->nr_queued+extra,
-				conf->resync_lock,
-				flush_pending_writes(conf));
+	wait_event_lock_irq_cmd(
+		conf->wait_barrier,
+		get_unqueued_pending(conf) == extra,
+		conf->resync_lock,
+		flush_pending_writes(conf));
 	spin_unlock_irq(&conf->resync_lock);
 }
 static void unfreeze_array(struct r1conf *conf)
@@ -1070,11 +1088,11 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
 	kfree(plug);
 }
 
-static void raid1_read_request(struct mddev *mddev, struct bio *bio,
-				 struct r1bio *r1_bio)
+static void raid1_read_request(struct mddev *mddev, struct bio *bio)
 {
 	struct r1conf *conf = mddev->private;
 	struct raid1_info *mirror;
+	struct r1bio *r1_bio;
 	struct bio *read_bio;
 	struct bitmap *bitmap = mddev->bitmap;
 	const int op = bio_op(bio);
@@ -1083,7 +1101,34 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 	int max_sectors;
 	int rdisk;
 
-	wait_barrier(conf, bio);
+	/*
+	 * Still need barrier for READ in case that whole
+	 * array is frozen.
+	 */
+	wait_read_barrier(conf, bio->bi_iter.bi_sector);
+	bitmap = mddev->bitmap;
+
+	/*
+	 * make_request() can abort the operation when read-ahead is being
+	 * used and no empty request is available.
+	 *
+	 */
+	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
+	r1_bio->master_bio = bio;
+	r1_bio->sectors = bio_sectors(bio);
+	r1_bio->state = 0;
+	r1_bio->mddev = mddev;
+	r1_bio->sector = bio->bi_iter.bi_sector;
+	/*
+	 * We might need to issue multiple reads to different
+	 * devices if there are bad blocks around, so we keep
+	 * track of the number of reads in bio->bi_phys_segments.
+	 * If this is 0, there is only one r1_bio and no locking
+	 * will be needed when requests complete.  If it is
+	 * non-zero, then it is the number of not-completed requests.
+	 */
+	bio->bi_phys_segments = 0;
+	bio_clear_flag(bio, BIO_SEG_VALID);
 
 read_again:
 	rdisk = read_balance(conf, r1_bio, &max_sectors);
@@ -1106,7 +1151,6 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 			   atomic_read(&bitmap->behind_writes) == 0);
 	}
 	r1_bio->read_disk = rdisk;
-	r1_bio->start_next_window = 0;
 
 	read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
 	bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector,
@@ -1163,10 +1207,10 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 		generic_make_request(read_bio);
 }
 
-static void raid1_write_request(struct mddev *mddev, struct bio *bio,
-				struct r1bio *r1_bio)
+static void raid1_write_request(struct mddev *mddev, struct bio *bio)
 {
 	struct r1conf *conf = mddev->private;
+	struct r1bio *r1_bio;
 	int i, disks;
 	struct bitmap *bitmap = mddev->bitmap;
 	unsigned long flags;
@@ -1180,7 +1224,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	int first_clone;
 	int sectors_handled;
 	int max_sectors;
-	sector_t start_next_window;
 
 	/*
 	 * Register the new request and wait if the reconstruction
@@ -1216,7 +1259,29 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 		}
 		finish_wait(&conf->wait_barrier, &w);
 	}
-	start_next_window = wait_barrier(conf, bio);
+	wait_barrier(conf, bio->bi_iter.bi_sector);
+
+	/*
+	 * make_request() can abort the operation when read-ahead is being
+	 * used and no empty request is available.
+	 *
+	 */
+	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
+	r1_bio->master_bio = bio;
+	r1_bio->sectors = bio_sectors(bio);
+	r1_bio->state = 0;
+	r1_bio->mddev = mddev;
+	r1_bio->sector = bio->bi_iter.bi_sector;
+	/*
+	 * We might need to issue multiple writes to different
+	 * devices if there are bad blocks around, so we keep
+	 * track of the number of writes in bio->bi_phys_segments.
+	 * If this is 0, there is only one r1_bio and no locking
+	 * will be needed when requests complete.  If it is
+	 * non-zero, then it is the number of not-completed requests.
+	 */
+	bio->bi_phys_segments = 0;
+	bio_clear_flag(bio, BIO_SEG_VALID);
 
 	if (conf->pending_count >= max_queued_requests) {
 		md_wakeup_thread(mddev->thread);
@@ -1237,7 +1302,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 
 	disks = conf->raid_disks * 2;
  retry_write:
-	r1_bio->start_next_window = start_next_window;
 	blocked_rdev = NULL;
 	rcu_read_lock();
 	max_sectors = r1_bio->sectors;
@@ -1304,25 +1368,15 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	if (unlikely(blocked_rdev)) {
 		/* Wait for this device to become unblocked */
 		int j;
-		sector_t old = start_next_window;
 
 		for (j = 0; j < i; j++)
 			if (r1_bio->bios[j])
 				rdev_dec_pending(conf->mirrors[j].rdev, mddev);
 		r1_bio->state = 0;
-		allow_barrier(conf, start_next_window, bio->bi_iter.bi_sector);
+		allow_barrier(conf, bio->bi_iter.bi_sector);
 		raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
 		md_wait_for_blocked_rdev(blocked_rdev, mddev);
-		start_next_window = wait_barrier(conf, bio);
-		/*
-		 * We must make sure the multi r1bios of bio have
-		 * the same value of bi_phys_segments
-		 */
-		if (bio->bi_phys_segments && old &&
-		    old != start_next_window)
-			/* Wait for the former r1bio(s) to complete */
-			wait_event(conf->wait_barrier,
-				   bio->bi_phys_segments == 1);
+		wait_barrier(conf, bio->bi_iter.bi_sector);
 		goto retry_write;
 	}
 
@@ -1447,36 +1501,26 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 
 static void raid1_make_request(struct mddev *mddev, struct bio *bio)
 {
-	struct r1conf *conf = mddev->private;
-	struct r1bio *r1_bio;
+	void (*make_request_fn)(struct mddev *mddev, struct bio *bio);
+	struct bio *split;
+	sector_t sectors;
 
-	/*
-	 * make_request() can abort the operation when read-ahead is being
-	 * used and no empty request is available.
-	 *
-	 */
-	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
+	make_request_fn = (bio_data_dir(bio) == READ) ?
+			  raid1_read_request : raid1_write_request;
 
-	r1_bio->master_bio = bio;
-	r1_bio->sectors = bio_sectors(bio);
-	r1_bio->state = 0;
-	r1_bio->mddev = mddev;
-	r1_bio->sector = bio->bi_iter.bi_sector;
-
-	/*
-	 * We might need to issue multiple reads to different devices if there
-	 * are bad blocks around, so we keep track of the number of reads in
-	 * bio->bi_phys_segments.  If this is 0, there is only one r1_bio and
-	 * no locking will be needed when requests complete.  If it is
-	 * non-zero, then it is the number of not-completed requests.
-	 */
-	bio->bi_phys_segments = 0;
-	bio_clear_flag(bio, BIO_SEG_VALID);
+	/* if bio exceeds barrier unit boundary, split it */
+	do {
+		sectors = align_to_barrier_unit_end(
+				bio->bi_iter.bi_sector, bio_sectors(bio));
+		if (sectors < bio_sectors(bio)) {
+			split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
+			bio_chain(split, bio);
+		} else {
+			split = bio;
+		}
 
-	if (bio_data_dir(bio) == READ)
-		raid1_read_request(mddev, bio, r1_bio);
-	else
-		raid1_write_request(mddev, bio, r1_bio);
+		make_request_fn(mddev, split);
+	} while (split != bio);
 }
 
 static void raid1_status(struct seq_file *seq, struct mddev *mddev)
@@ -1567,19 +1611,11 @@ static void print_conf(struct r1conf *conf)
 
 static void close_sync(struct r1conf *conf)
 {
-	wait_barrier(conf, NULL);
-	allow_barrier(conf, 0, 0);
+	wait_all_barriers(conf);
+	allow_all_barriers(conf);
 
 	mempool_destroy(conf->r1buf_pool);
 	conf->r1buf_pool = NULL;
-
-	spin_lock_irq(&conf->resync_lock);
-	conf->next_resync = MaxSector - 2 * NEXT_NORMALIO_DISTANCE;
-	conf->start_next_window = MaxSector;
-	conf->current_window_requests +=
-		conf->next_window_requests;
-	conf->next_window_requests = 0;
-	spin_unlock_irq(&conf->resync_lock);
 }
 
 static int raid1_spare_active(struct mddev *mddev)
@@ -2326,8 +2362,9 @@ static void handle_sync_write_finished(struct r1conf *conf, struct r1bio *r1_bio
 
 static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 {
-	int m;
+	int m, idx;
 	bool fail = false;
+
 	for (m = 0; m < conf->raid_disks * 2 ; m++)
 		if (r1_bio->bios[m] == IO_MADE_GOOD) {
 			struct md_rdev *rdev = conf->mirrors[m].rdev;
@@ -2353,7 +2390,8 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 	if (fail) {
 		spin_lock_irq(&conf->device_lock);
 		list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
-		conf->nr_queued++;
+		idx = sector_to_idx(r1_bio->sector);
+		conf->nr_queued[idx]++;
 		spin_unlock_irq(&conf->device_lock);
 		md_wakeup_thread(conf->mddev->thread);
 	} else {
@@ -2475,6 +2513,7 @@ static void raid1d(struct md_thread *thread)
 	struct r1conf *conf = mddev->private;
 	struct list_head *head = &conf->retry_list;
 	struct blk_plug plug;
+	int idx;
 
 	md_check_recovery(mddev);
 
@@ -2482,17 +2521,17 @@ static void raid1d(struct md_thread *thread)
 	    !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
 		LIST_HEAD(tmp);
 		spin_lock_irqsave(&conf->device_lock, flags);
-		if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
-			while (!list_empty(&conf->bio_end_io_list)) {
-				list_move(conf->bio_end_io_list.prev, &tmp);
-				conf->nr_queued--;
-			}
-		}
+		if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
+			list_splice_init(&conf->bio_end_io_list, &tmp);
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 		while (!list_empty(&tmp)) {
 			r1_bio = list_first_entry(&tmp, struct r1bio,
 						  retry_list);
 			list_del(&r1_bio->retry_list);
+			idx = sector_to_idx(r1_bio->sector);
+			spin_lock_irqsave(&conf->device_lock, flags);
+			conf->nr_queued[idx]--;
+			spin_unlock_irqrestore(&conf->device_lock, flags);
 			if (mddev->degraded)
 				set_bit(R1BIO_Degraded, &r1_bio->state);
 			if (test_bit(R1BIO_WriteError, &r1_bio->state))
@@ -2513,7 +2552,8 @@ static void raid1d(struct md_thread *thread)
 		}
 		r1_bio = list_entry(head->prev, struct r1bio, retry_list);
 		list_del(head->prev);
-		conf->nr_queued--;
+		idx = sector_to_idx(r1_bio->sector);
+		conf->nr_queued[idx]--;
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 
 		mddev = r1_bio->mddev;
@@ -2552,7 +2592,6 @@ static int init_resync(struct r1conf *conf)
 					  conf->poolinfo);
 	if (!conf->r1buf_pool)
 		return -ENOMEM;
-	conf->next_resync = 0;
 	return 0;
 }
 
@@ -2581,6 +2620,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	int still_degraded = 0;
 	int good_sectors = RESYNC_SECTORS;
 	int min_bad = 0; /* number of sectors that are bad in all devices */
+	int idx = sector_to_idx(sector_nr);
 
 	if (!conf->r1buf_pool)
 		if (init_resync(conf))
@@ -2630,7 +2670,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	 * If there is non-resync activity waiting for a turn, then let it
 	 * though before starting on this new sync request.
 	 */
-	if (conf->nr_waiting)
+	if (conf->nr_waiting[idx])
 		schedule_timeout_uninterruptible(1);
 
 	/* we are incrementing sector_nr below. To be safe, we check against
@@ -2657,6 +2697,8 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	r1_bio->sector = sector_nr;
 	r1_bio->state = 0;
 	set_bit(R1BIO_IsSync, &r1_bio->state);
+	/* make sure good_sectors won't go across barrier unit boundary */
+	good_sectors = align_to_barrier_unit_end(sector_nr, good_sectors);
 
 	for (i = 0; i < conf->raid_disks * 2; i++) {
 		struct md_rdev *rdev;
@@ -2887,6 +2929,26 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 	if (!conf)
 		goto abort;
 
+	conf->nr_pending = kcalloc(BARRIER_BUCKETS_NR,
+				   sizeof(int), GFP_KERNEL);
+	if (!conf->nr_pending)
+		goto abort;
+
+	conf->nr_waiting = kcalloc(BARRIER_BUCKETS_NR,
+				   sizeof(int), GFP_KERNEL);
+	if (!conf->nr_waiting)
+		goto abort;
+
+	conf->nr_queued = kcalloc(BARRIER_BUCKETS_NR,
+				  sizeof(int), GFP_KERNEL);
+	if (!conf->nr_queued)
+		goto abort;
+
+	conf->barrier = kcalloc(BARRIER_BUCKETS_NR,
+				sizeof(int), GFP_KERNEL);
+	if (!conf->barrier)
+		goto abort;
+
 	conf->mirrors = kzalloc(sizeof(struct raid1_info)
 				* mddev->raid_disks * 2,
 				 GFP_KERNEL);
@@ -2942,9 +3004,6 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 	conf->pending_count = 0;
 	conf->recovery_disabled = mddev->recovery_disabled - 1;
 
-	conf->start_next_window = MaxSector;
-	conf->current_window_requests = conf->next_window_requests = 0;
-
 	err = -EIO;
 	for (i = 0; i < conf->raid_disks * 2; i++) {
 
@@ -2987,6 +3046,10 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 		kfree(conf->mirrors);
 		safe_put_page(conf->tmppage);
 		kfree(conf->poolinfo);
+		kfree(conf->nr_pending);
+		kfree(conf->nr_waiting);
+		kfree(conf->nr_queued);
+		kfree(conf->barrier);
 		kfree(conf);
 	}
 	return ERR_PTR(err);
@@ -3088,6 +3151,10 @@ static void raid1_free(struct mddev *mddev, void *priv)
 	kfree(conf->mirrors);
 	safe_put_page(conf->tmppage);
 	kfree(conf->poolinfo);
+	kfree(conf->nr_pending);
+	kfree(conf->nr_waiting);
+	kfree(conf->nr_queued);
+	kfree(conf->barrier);
 	kfree(conf);
 }
 
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index c52ef42..d3faf30 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -1,6 +1,14 @@
 #ifndef _RAID1_H
 #define _RAID1_H
 
+/* each barrier unit size is 64MB fow now
+ * note: it must be larger than RESYNC_DEPTH
+ */
+#define BARRIER_UNIT_SECTOR_BITS	17
+#define BARRIER_UNIT_SECTOR_SIZE	(1<<17)
+#define BARRIER_BUCKETS_NR_BITS		(PAGE_SHIFT - 2)
+#define BARRIER_BUCKETS_NR		(1<<BARRIER_BUCKETS_NR_BITS)
+
 struct raid1_info {
 	struct md_rdev	*rdev;
 	sector_t	head_position;
@@ -35,25 +43,6 @@ struct r1conf {
 						 */
 	int			raid_disks;
 
-	/* During resync, read_balancing is only allowed on the part
-	 * of the array that has been resynced.  'next_resync' tells us
-	 * where that is.
-	 */
-	sector_t		next_resync;
-
-	/* When raid1 starts resync, we divide array into four partitions
-	 * |---------|--------------|---------------------|-------------|
-	 *        next_resync   start_next_window       end_window
-	 * start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
-	 * end_window = start_next_window + NEXT_NORMALIO_DISTANCE
-	 * current_window_requests means the count of normalIO between
-	 *   start_next_window and end_window.
-	 * next_window_requests means the count of normalIO after end_window.
-	 * */
-	sector_t		start_next_window;
-	int			current_window_requests;
-	int			next_window_requests;
-
 	spinlock_t		device_lock;
 
 	/* list of 'struct r1bio' that need to be processed by raid1d,
@@ -79,10 +68,10 @@ struct r1conf {
 	 */
 	wait_queue_head_t	wait_barrier;
 	spinlock_t		resync_lock;
-	int			nr_pending;
-	int			nr_waiting;
-	int			nr_queued;
-	int			barrier;
+	int			*nr_pending;
+	int			*nr_waiting;
+	int			*nr_queued;
+	int			*barrier;
 	int			array_frozen;
 
 	/* Set to 1 if a full sync is needed, (fresh device added).
@@ -135,7 +124,6 @@ struct r1bio {
 						 * in this BehindIO request
 						 */
 	sector_t		sector;
-	sector_t		start_next_window;
 	int			sectors;
 	unsigned long		state;
 	struct mddev		*mddev;
@@ -185,4 +173,10 @@ enum r1bio_state {
 	R1BIO_WriteError,
 	R1BIO_FailFast,
 };
+
+static inline int sector_to_idx(sector_t sector)
+{
+	return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
+			 BARRIER_BUCKETS_NR_BITS);
+}
 #endif
-- 
2.6.6


^ permalink raw reply related

* [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
From: colyli @ 2017-02-15 16:35 UTC (permalink / raw)
  To: linux-raid
  Cc: Coly Li, Shaohua Li, Hannes Reinecke, Neil Brown,
	Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <1487176523-109075-1-git-send-email-colyli@suse.de>

When I run a parallel reading performan testing on a md raid1 device with
two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
only 2.7GB/s, this is around 50% of the idea performance number.

The perf reports locking contention happens at allow_barrier() and
wait_barrier() code,
 - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
   - _raw_spin_lock_irqsave
         + 89.92% allow_barrier
         + 9.34% __wake_up
 - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
   - _raw_spin_lock_irq
         - 100.00% wait_barrier

The reason is, in these I/O barrier related functions,
 - raise_barrier()
 - lower_barrier()
 - wait_barrier()
 - allow_barrier()
They always hold conf->resync_lock firstly, even there are only regular
reading I/Os and no resync I/O at all. This is a huge performance penalty.

The solution is a lockless-like algorithm in I/O barrier code, and only
holding conf->resync_lock when it has to.

The original idea is from Hannes Reinecke, and Neil Brown provides
comments to improve it. I continue to work on it, and make the patch into
current form.

In the new simpler raid1 I/O barrier implementation, there are two
wait barrier functions,
 - wait_barrier()
   Which calls _wait_barrier(), is used for regular write I/O. If there is
   resync I/O happening on the same I/O barrier bucket, or the whole
   array is frozen, task will wait until no barrier on same barrier bucket,
   or the whold array is unfreezed.
 - wait_read_barrier()
   Since regular read I/O won't interfere with resync I/O (read_balance()
   will make sure only uptodate data will be read out), it is unnecessary
   to wait for barrier in regular read I/Os, waiting in only necessary
   when the whole array is frozen.

The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
barrier[idx] are very carefully designed in raise_barrier(),
lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
avoid unnecessary spin locks in these functions. Once conf->
nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
has to wait in raise_barrier(). Then in _wait_barrier() if no barrier
raised in same barrier bucket index and array is not frozen, the regular
I/O doesn't need to hold conf->resync_lock, it can just increase
conf->nr_pending[idx], and return to its caller. wait_read_barrier() is
very similar to _wait_barrier(), the only difference is it only waits when
array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier
code almostly gets rid of all spin lock cost.

This patch significantly improves raid1 reading peroformance. From my
testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
increases from 2.7GB/s to 4.6GB/s (+70%).

Changelog
V3:
- Add smp_mb__after_atomic() as Shaohua and Neil suggested.
- Change conf->nr_queued[] from atomic_t to int.
- Change conf->array_frozen from atomic_t back to int, and use
  READ_ONCE(conf->array_frozen) to check value of conf->array_frozen
  in _wait_barrier() and wait_read_barrier().
- In _wait_barrier() and wait_read_barrier(), add a call to
  wake_up(&conf->wait_barrier) after atomic_dec(&conf->nr_pending[idx]),
  to fix a deadlock between  _wait_barrier()/wait_read_barrier and
  freeze_array().
V2:
- Remove a spin_lock/unlock pair in raid1d().
- Add more code comments to explain why there is no racy when checking two
  atomic_t variables at same time.
V1:
- Original RFC patch for comments.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/raid1.c | 157 +++++++++++++++++++++++++++++++++++++----------------
 drivers/md/raid1.h |   6 +-
 2 files changed, 114 insertions(+), 49 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 4234494..2ac2650 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -836,11 +836,21 @@ static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
 	spin_lock_irq(&conf->resync_lock);
 
 	/* Wait until no block IO is waiting */
-	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
+	wait_event_lock_irq(conf->wait_barrier,
+			    !atomic_read(&conf->nr_waiting[idx]),
 			    conf->resync_lock);
 
 	/* block any new IO from starting */
-	conf->barrier[idx]++;
+	atomic_inc(&conf->barrier[idx]);
+	/*
+	 * In raise_barrier() we firstly increase conf->barrier[idx] then
+	 * check conf->nr_pending[idx]. In _wait_barrier() we firstly
+	 * increase conf->nr_pending[idx] then check conf->barrier[idx].
+	 * A memory barrier here to make sure conf->nr_pending[idx] won't
+	 * be fetched before conf->barrier[idx] is increased. Otherwise
+	 * there will be a race between raise_barrier() and _wait_barrier().
+	 */
+	smp_mb__after_atomic();
 
 	/* For these conditions we must wait:
 	 * A: while the array is in frozen state
@@ -851,42 +861,81 @@ static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
 	 */
 	wait_event_lock_irq(conf->wait_barrier,
 			    !conf->array_frozen &&
-			     !conf->nr_pending[idx] &&
-			     conf->barrier[idx] < RESYNC_DEPTH,
+			     !atomic_read(&conf->nr_pending[idx]) &&
+			     atomic_read(&conf->barrier[idx]) < RESYNC_DEPTH,
 			    conf->resync_lock);
 
-	conf->nr_pending[idx]++;
+	atomic_inc(&conf->nr_pending[idx]);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
 static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
 {
-	unsigned long flags;
 	int idx = sector_to_idx(sector_nr);
 
-	BUG_ON(conf->barrier[idx] <= 0);
+	BUG_ON(atomic_read(&conf->barrier[idx]) <= 0);
 
-	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->barrier[idx]--;
-	conf->nr_pending[idx]--;
-	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	atomic_dec(&conf->barrier[idx]);
+	atomic_dec(&conf->nr_pending[idx]);
 	wake_up(&conf->wait_barrier);
 }
 
 static void _wait_barrier(struct r1conf *conf, int idx)
 {
-	spin_lock_irq(&conf->resync_lock);
-	if (conf->array_frozen || conf->barrier[idx]) {
-		conf->nr_waiting[idx]++;
-		/* Wait for the barrier to drop. */
-		wait_event_lock_irq(
-			conf->wait_barrier,
-			!conf->array_frozen && !conf->barrier[idx],
-			conf->resync_lock);
-		conf->nr_waiting[idx]--;
-	}
+	/*
+	 * We need to increase conf->nr_pending[idx] very early here,
+	 * then raise_barrier() can be blocked when it waits for
+	 * conf->nr_pending[idx] to be 0. Then we can avoid holding
+	 * conf->resync_lock when there is no barrier raised in same
+	 * barrier unit bucket. Also if the array is frozen, I/O
+	 * should be blocked until array is unfrozen.
+	 */
+	atomic_inc(&conf->nr_pending[idx]);
+	/*
+	 * In _wait_barrier() we firstly increase conf->nr_pending[idx], then
+	 * check conf->barrier[idx]. In raise_barrier() we firstly increase
+	 * conf->barrier[idx], then check conf->nr_pending[idx]. A memory
+	 * barrier is necessary here to make sure conf->barrier[idx] won't be
+	 * fetched before conf->nr_pending[idx] is increased. Otherwise there
+	 * will be a race between _wait_barrier() and raise_barrier().
+	 */
+	smp_mb__after_atomic();
+
+	/*
+	 * Don't worry about checking two atomic_t variables at same time
+	 * here. If during we check conf->barrier[idx], the array is
+	 * frozen (conf->array_frozen is 1), and chonf->barrier[idx] is
+	 * 0, it is safe to return and make the I/O continue. Because the
+	 * array is frozen, all I/O returned here will eventually complete
+	 * or be queued, no race will happen. See code comment in
+	 * frozen_array().
+	 */
+	if (!READ_ONCE(conf->array_frozen) &&
+	    !atomic_read(&conf->barrier[idx]))
+		return;
 
-	conf->nr_pending[idx]++;
+	/*
+	 * After holding conf->resync_lock, conf->nr_pending[idx]
+	 * should be decreased before waiting for barrier to drop.
+	 * Otherwise, we may encounter a race condition because
+	 * raise_barrer() might be waiting for conf->nr_pending[idx]
+	 * to be 0 at same time.
+	 */
+	spin_lock_irq(&conf->resync_lock);
+	atomic_inc(&conf->nr_waiting[idx]);
+	atomic_dec(&conf->nr_pending[idx]);
+	/*
+	 * In case freeze_array() is waiting for
+	 * get_unqueued_pending() == extra
+	 */
+	wake_up(&conf->wait_barrier);
+	/* Wait for the barrier in same barrier unit bucket to drop. */
+	wait_event_lock_irq(conf->wait_barrier,
+			    !conf->array_frozen &&
+			     !atomic_read(&conf->barrier[idx]),
+			    conf->resync_lock);
+	atomic_inc(&conf->nr_pending[idx]);
+	atomic_dec(&conf->nr_waiting[idx]);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
@@ -894,18 +943,32 @@ static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
 {
 	int idx = sector_to_idx(sector_nr);
 
-	spin_lock_irq(&conf->resync_lock);
-	if (conf->array_frozen) {
-		conf->nr_waiting[idx]++;
-		/* Wait for array to unfreeze */
-		wait_event_lock_irq(
-			conf->wait_barrier,
-			!conf->array_frozen,
-			conf->resync_lock);
-		conf->nr_waiting[idx]--;
-	}
+	/*
+	 * Very similar to _wait_barrier(). The difference is, for read
+	 * I/O we don't need wait for sync I/O, but if the whole array
+	 * is frozen, the read I/O still has to wait until the array is
+	 * unfrozen. Since there is no ordering requirement with
+	 * conf->barrier[idx] here, memory barrier is unnecessary as well.
+	 */
+	atomic_inc(&conf->nr_pending[idx]);
 
-	conf->nr_pending[idx]++;
+	if (!READ_ONCE(conf->array_frozen))
+		return;
+
+	spin_lock_irq(&conf->resync_lock);
+	atomic_inc(&conf->nr_waiting[idx]);
+	atomic_dec(&conf->nr_pending[idx]);
+	/*
+	 * In case freeze_array() is waiting for
+	 * get_unqueued_pending() == extra
+	 */
+	wake_up(&conf->wait_barrier);
+	/* Wait for array to be unfrozen */
+	wait_event_lock_irq(conf->wait_barrier,
+			    !conf->array_frozen,
+			    conf->resync_lock);
+	atomic_inc(&conf->nr_pending[idx]);
+	atomic_dec(&conf->nr_waiting[idx]);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
@@ -926,11 +989,7 @@ static void wait_all_barriers(struct r1conf *conf)
 
 static void _allow_barrier(struct r1conf *conf, int idx)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->nr_pending[idx]--;
-	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	atomic_dec(&conf->nr_pending[idx]);
 	wake_up(&conf->wait_barrier);
 }
 
@@ -955,11 +1014,14 @@ static int get_unqueued_pending(struct r1conf *conf)
 	int idx, ret;
 
 	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
-		ret += conf->nr_pending[idx] - conf->nr_queued[idx];
+		ret += atomic_read(&conf->nr_pending[idx]) -
+			conf->nr_queued[idx];
 
 	return ret;
 }
 
+#define FREEZE_TIMEOUT_JIFFIES 10
+
 static void freeze_array(struct r1conf *conf, int extra)
 {
 	/* Stop sync I/O and normal I/O and wait for everything to
@@ -1000,8 +1062,8 @@ static void unfreeze_array(struct r1conf *conf)
 	/* reverse the effect of the freeze */
 	spin_lock_irq(&conf->resync_lock);
 	conf->array_frozen = 0;
-	wake_up(&conf->wait_barrier);
 	spin_unlock_irq(&conf->resync_lock);
+	wake_up(&conf->wait_barrier);
 }
 
 /* duplicate the data pages for behind I/O
@@ -2393,6 +2455,11 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 		idx = sector_to_idx(r1_bio->sector);
 		conf->nr_queued[idx]++;
 		spin_unlock_irq(&conf->device_lock);
+		/*
+		 * In case freeze_array() is waiting for condition
+		 * get_unqueued_pending() == extra to be true.
+		 */
+		wake_up(&conf->wait_barrier);
 		md_wakeup_thread(conf->mddev->thread);
 	} else {
 		if (test_bit(R1BIO_WriteError, &r1_bio->state))
@@ -2529,9 +2596,7 @@ static void raid1d(struct md_thread *thread)
 						  retry_list);
 			list_del(&r1_bio->retry_list);
 			idx = sector_to_idx(r1_bio->sector);
-			spin_lock_irqsave(&conf->device_lock, flags);
 			conf->nr_queued[idx]--;
-			spin_unlock_irqrestore(&conf->device_lock, flags);
 			if (mddev->degraded)
 				set_bit(R1BIO_Degraded, &r1_bio->state);
 			if (test_bit(R1BIO_WriteError, &r1_bio->state))
@@ -2670,7 +2735,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	 * If there is non-resync activity waiting for a turn, then let it
 	 * though before starting on this new sync request.
 	 */
-	if (conf->nr_waiting[idx])
+	if (atomic_read(&conf->nr_waiting[idx]))
 		schedule_timeout_uninterruptible(1);
 
 	/* we are incrementing sector_nr below. To be safe, we check against
@@ -2930,12 +2995,12 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 		goto abort;
 
 	conf->nr_pending = kcalloc(BARRIER_BUCKETS_NR,
-				   sizeof(int), GFP_KERNEL);
+				   sizeof(atomic_t), GFP_KERNEL);
 	if (!conf->nr_pending)
 		goto abort;
 
 	conf->nr_waiting = kcalloc(BARRIER_BUCKETS_NR,
-				   sizeof(int), GFP_KERNEL);
+				   sizeof(atomic_t), GFP_KERNEL);
 	if (!conf->nr_waiting)
 		goto abort;
 
@@ -2945,7 +3010,7 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 		goto abort;
 
 	conf->barrier = kcalloc(BARRIER_BUCKETS_NR,
-				sizeof(int), GFP_KERNEL);
+				sizeof(atomic_t), GFP_KERNEL);
 	if (!conf->barrier)
 		goto abort;
 
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index d3faf30..1d9d279 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -68,10 +68,10 @@ struct r1conf {
 	 */
 	wait_queue_head_t	wait_barrier;
 	spinlock_t		resync_lock;
-	int			*nr_pending;
-	int			*nr_waiting;
+	atomic_t		*nr_pending;
+	atomic_t		*nr_waiting;
 	int			*nr_queued;
-	int			*barrier;
+	atomic_t		*barrier;
 	int			array_frozen;
 
 	/* Set to 1 if a full sync is needed, (fresh device added).
-- 
2.6.6


^ permalink raw reply related

* Re: [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
From: Coly Li @ 2017-02-15 17:15 UTC (permalink / raw)
  To: linux-raid
  Cc: Shaohua Li, Hannes Reinecke, Neil Brown, Johannes Thumshirn,
	Guoqing Jiang
In-Reply-To: <1487176523-109075-2-git-send-email-colyli@suse.de>

On 2017/2/16 上午12:35, colyli@suse.de wrote:
[snip]
>  
> @@ -955,11 +1014,14 @@ static int get_unqueued_pending(struct r1conf *conf)
>  	int idx, ret;
>  
>  	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
> -		ret += conf->nr_pending[idx] - conf->nr_queued[idx];
> +		ret += atomic_read(&conf->nr_pending[idx]) -
> +			conf->nr_queued[idx];
>  
>  	return ret;
>  }
>  
> +#define FREEZE_TIMEOUT_JIFFIES 10
> +

The above line will be removed in next version, please ignore it.

[snip]

Coly

^ permalink raw reply

* [PATCH] dm space map metadata: constify dm_space_map structures
From: Bhumika Goyal @ 2017-02-15 18:13 UTC (permalink / raw)
  To: julia.lawall, agk, snitzer, dm-devel, shli, linux-raid,
	linux-kernel
  Cc: Bhumika Goyal

Declare dm_space_map structures as const as they are only passed as an
argument to the function memcpy. This argument is of type const void *,
so dm_space_map structures having this property can be declared as
const.

File size before:
   text	   data	    bss	    dec	    hex	filename
   4889	    240	      0	   5129	   1409 dm-space-map-metadata.o

File size after:
   text	   data	    bss	    dec	    hex	filename
   5139	      0	      0	   5139	   1413 dm-space-map-metadata.o

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
---
 drivers/md/persistent-data/dm-space-map-metadata.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/persistent-data/dm-space-map-metadata.c b/drivers/md/persistent-data/dm-space-map-metadata.c
index 20557e2..4aed69d 100644
--- a/drivers/md/persistent-data/dm-space-map-metadata.c
+++ b/drivers/md/persistent-data/dm-space-map-metadata.c
@@ -544,7 +544,7 @@ static int sm_metadata_copy_root(struct dm_space_map *sm, void *where_le, size_t
 
 static int sm_metadata_extend(struct dm_space_map *sm, dm_block_t extra_blocks);
 
-static struct dm_space_map ops = {
+static const struct dm_space_map ops = {
 	.destroy = sm_metadata_destroy,
 	.extend = sm_metadata_extend,
 	.get_nr_blocks = sm_metadata_get_nr_blocks,
@@ -671,7 +671,7 @@ static int sm_bootstrap_copy_root(struct dm_space_map *sm, void *where,
 	return -EINVAL;
 }
 
-static struct dm_space_map bootstrap_ops = {
+static const struct dm_space_map bootstrap_ops = {
 	.destroy = sm_bootstrap_destroy,
 	.extend = sm_bootstrap_extend,
 	.get_nr_blocks = sm_bootstrap_get_nr_blocks,
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH v2 0/5] md: use bio_clone_fast()
From: Shaohua Li @ 2017-02-15 19:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-kernel, linux-raid, linux-block,
	Christoph Hellwig, NeilBrown
In-Reply-To: <1487086143-10255-1-git-send-email-tom.leiming@gmail.com>

On Tue, Feb 14, 2017 at 11:28:58PM +0800, Ming Lei wrote:
> Hi,
> 
> This patches replaces bio_clone() with bio_fast_clone() in
> bio_clone_mddev() because:
> 
> 1) bio_clone_mddev() is used in raid normal I/O and isn't in
> resync I/O path, and all the direct access to bvec table in
> raid happens on resync I/O only except for write behind of raid1.
> Write behind is treated specially, so the replacement is safe.
> 
> 2) for write behind, bio_clone() is kept, but this patchset
> introduces bio_clone_bioset_partial() to just clone one specific 
> bvecs range instead of whole table. Then write behind is improved
> too.
> 
> V2:
> 	1) move 3rd patch into 2nd one
> 	2) kill  bio_clone_mddev() and use  bio_clone_fast() directly
> 	3) define local variable 'offset' as sector_t in raid1_write_request()
> 
> V1:
> 	1) don't introduce bio_clone_slow_mddev_partial()
> 	2) return failure if mddev->bio_set can't be created
> 	3) remove check in bio_clone_mddev() as suggested by
> 	Christoph Hellwig.
> 	4) rename bio_clone_mddev() as bio_clone_fast_mddev()

Applied, thanks!
 
> Ming Lei (5):
>   block: introduce bio_clone_bioset_partial()
>   md: fail if mddev->bio_set can't be created
>   md/raid1: use bio_clone_bioset_partial() in case of write behind
>   md: remove unnecessary check on mddev
>   md: fast clone bio in bio_clone_mddev()
> 
>  block/bio.c         | 61 +++++++++++++++++++++++++++++++++++++++++------------
>  drivers/md/faulty.c |  2 +-
>  drivers/md/md.c     | 15 ++++---------
>  drivers/md/md.h     |  2 --
>  drivers/md/raid1.c  | 28 +++++++++++++++++-------
>  drivers/md/raid10.c | 11 +++++-----
>  drivers/md/raid5.c  |  4 ++--
>  include/linux/bio.h | 11 ++++++++--
>  8 files changed, 89 insertions(+), 45 deletions(-)
> 
> -- 
> 2.7.4
> 

^ permalink raw reply

* Re: [PATCH v2 5/5] md: fast clone bio in bio_clone_mddev()
From: Shaohua Li @ 2017-02-15 19:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, Jens Axboe, linux-kernel, linux-raid, linux-block,
	NeilBrown
In-Reply-To: <20170214160109.GA32705@infradead.org>

On Tue, Feb 14, 2017 at 08:01:09AM -0800, Christoph Hellwig wrote:
> On Tue, Feb 14, 2017 at 11:29:03PM +0800, Ming Lei wrote:
> > Firstly bio_clone_mddev() is used in raid normal I/O and isn't
> > in resync I/O path.
> > 
> > Secondly all the direct access to bvec table in raid happens on
> > resync I/O except for write behind of raid1, in which we still
> > use bio_clone() for allocating new bvec table.
> > 
> > So this patch replaces bio_clone() with bio_clone_fast()
> > in bio_clone_mddev().
> > 
> > Also kill bio_clone_mddev() and call bio_clone_fast() directly, as
> > suggested by Christoph Hellwig.
> > 
> > Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> 
> Looks fine,
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> Btw, can you also tack on another patch to kill bio_alloc_mddev
> to be consistent?

I'll take care of this

^ permalink raw reply

* mdadm manpage: add missing unit for --array-size
From: G.raud @ 2017-02-15 19:25 UTC (permalink / raw)
  To: linux-raid

The description of --array-size does not give the default unit (but only
those corresponding to a suffix of M or G).  Please mention that the
default unit is the KiB (and that K is accepted as a suffix).  Since
mdadm -E gives some sizes in sectors it is certainly not evident that
the default unit is the KiB.

Thanks

^ permalink raw reply

* Re: Which API for md state monitoring - sysfs vs. /proc/mdstat vs. GET_DISK_INFO ioctl
From: Anthony Youngman @ 2017-02-15 21:21 UTC (permalink / raw)
  To: Tim Small, linux-raid; +Cc: shli
In-Reply-To: <994ffd50-9041-a5e0-bbd9-1db704f6ad63@buttersideup.com>

Another user-space tool to look at is xosview. The current release 
version uses /proc/mdstat (and is terminally broken as a result), but 
it's being updated to use /sys/block/...

Cheers,
Wol

On 14/02/17 14:43, Tim Small wrote:
> Hello,
>
> A design question for a user-space RAID monitoring tool.  The tool is
> intended to export machine-readable md state so that the system owners
> can creating automated alert policies etc.
>
> Should it get its information on md device state from /proc/mdstat, or
> /sys/block/*/md/ as a data source, or should it use the ioctl interface
> that mdadm uses (on the basis that this is more complete / heavily tested).
>
> I assume the issues are:
>
> . usability (/proc/mdstat is intended to be human-readable so machine
> parsing it seems bleugh - it's also incomplete?)
>
> . stability (perhaps /sys/block/*/md/ is less stable than the ioctl or
> proc interface?  Or perhaps it's just fine and won't break in
> backward-incompatible ways?).
>
> Any thoughts?
>
> Should I see which way the recent "add bad block flag to disk state"
> patch-set thread goes?
>
> Tim.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [PATCH v2 5/5] md: fast clone bio in bio_clone_mddev()
From: Shaohua Li @ 2017-02-15 23:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, Jens Axboe, linux-kernel, linux-raid, linux-block,
	NeilBrown
In-Reply-To: <20170215192025.fo7jsew7pvagp5hv@kernel.org>

On Wed, Feb 15, 2017 at 11:20:25AM -0800, Shaohua Li wrote:
> On Tue, Feb 14, 2017 at 08:01:09AM -0800, Christoph Hellwig wrote:
> > On Tue, Feb 14, 2017 at 11:29:03PM +0800, Ming Lei wrote:
> > > Firstly bio_clone_mddev() is used in raid normal I/O and isn't
> > > in resync I/O path.
> > > 
> > > Secondly all the direct access to bvec table in raid happens on
> > > resync I/O except for write behind of raid1, in which we still
> > > use bio_clone() for allocating new bvec table.
> > > 
> > > So this patch replaces bio_clone() with bio_clone_fast()
> > > in bio_clone_mddev().
> > > 
> > > Also kill bio_clone_mddev() and call bio_clone_fast() directly, as
> > > suggested by Christoph Hellwig.
> > > 
> > > Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> > 
> > Looks fine,
> > 
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > 
> > Btw, can you also tack on another patch to kill bio_alloc_mddev
> > to be consistent?
> 
> I'll take care of this

Hmm, bio_alloc_mddev is a little different, it could be called without the
bio_set. We can only guarantee the bio_set is valid for bio for md personality
not for metadata. Fully fixing the bio_set issue will need rework mddev struct
allocation, don't think it's worthy doing. So below is one removing the export
of the wrap.

From eaf70c32e5cca68110a0cf7a0311ef0ac97f4d42 Mon Sep 17 00:00:00 2001
Message-Id: <eaf70c32e5cca68110a0cf7a0311ef0ac97f4d42.1487200103.git.shli@fb.com>
From: Shaohua Li <shli@fb.com>
Date: Wed, 15 Feb 2017 11:41:21 -0800
Subject: [PATCH] md: don't export bio_alloc_mddev

When mddev is running, it has a valid bio_set, so the bio_alloc_mddev
wrap is unncessary. The problem is we run some IO (eg, load superblock)
before mddev is fully initialized. At that time bio_set isn't
initialized. Moving bio_set creation to md_alloc doesn't help either,
because dm-raid doesn't use it.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c     | 12 ++++++------
 drivers/md/md.h     |  2 --
 drivers/md/raid1.c  |  2 +-
 drivers/md/raid10.c |  2 +-
 4 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 55e7e7a..bef1863 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -171,11 +171,12 @@ static const struct block_device_operations md_fops;
 
 static int start_readonly;
 
-/* bio_clone_mddev
- * like bio_clone, but with a local bio set
+/*
+ * This is only required for meta data related bio, where bio_set might not be
+ * initialized yet. dm-raid doesn't use md_alloc, so we can't alloc bio_set
+ * there.
  */
-
-struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
+static struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
 			    struct mddev *mddev)
 {
 	struct bio *b;
@@ -188,7 +189,6 @@ struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
 		return NULL;
 	return b;
 }
-EXPORT_SYMBOL_GPL(bio_alloc_mddev);
 
 /*
  * We have a system wide 'event count' that is incremented
@@ -393,7 +393,7 @@ static void submit_flushes(struct work_struct *ws)
 			atomic_inc(&rdev->nr_pending);
 			atomic_inc(&rdev->nr_pending);
 			rcu_read_unlock();
-			bi = bio_alloc_mddev(GFP_NOIO, 0, mddev);
+			bi = bio_alloc_bioset(GFP_NOIO, 0, mddev->bio_set);
 			bi->bi_end_io = md_end_flush;
 			bi->bi_private = rdev;
 			bi->bi_bdev = rdev->bdev;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index b8859cb..44fe227 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -673,8 +673,6 @@ extern void md_rdev_clear(struct md_rdev *rdev);
 
 extern void mddev_suspend(struct mddev *mddev);
 extern void mddev_resume(struct mddev *mddev);
-extern struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
-				   struct mddev *mddev);
 
 extern void md_unplug(struct blk_plug_cb *cb, bool from_schedule);
 extern void md_reload_sb(struct mddev *mddev, int raid_disk);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index ad5c948..2180a9a 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2281,7 +2281,7 @@ static int narrow_write_error(struct r1bio *r1_bio, int i)
 				vcnt--;
 			}
 
-			wbio = bio_alloc_mddev(GFP_NOIO, vcnt, mddev);
+			wbio = bio_alloc_bioset(GFP_NOIO, vcnt, mddev->bio_set);
 			memcpy(wbio->bi_io_vec, vec, vcnt * sizeof(struct bio_vec));
 
 			wbio->bi_vcnt = vcnt;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index ade7d69..a1f8e98 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -4473,7 +4473,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr,
 		return sectors_done;
 	}
 
-	read_bio = bio_alloc_mddev(GFP_KERNEL, RESYNC_PAGES, mddev);
+	read_bio = bio_alloc_bioset(GFP_KERNEL, RESYNC_PAGES, mddev->bio_set);
 
 	read_bio->bi_bdev = rdev->bdev;
 	read_bio->bi_iter.bi_sector = (r10_bio->devs[r10_bio->read_slot].addr
-- 
2.9.3

^ permalink raw reply related

* mdadm: how to move superblock 1.0 on reduced components
From: G.raud @ 2017-02-15 23:41 UTC (permalink / raw)
  To: linux-raid

I cannot find information about this topic on the web.

Reducing the size of the components of an md array keeps the superblock
1.0 at the same position (which avoids data corruption and maybe is
required for md to find it).  However how can one then reduce the size
of the underlying device without losing the superblock?

In an array with redundancy re-addind reduced components one after the
other is a rather safe solution, I suppose, but it is time consuming.
Would it be possible to update the superblock version by doing that,
either by mixing different superblock versions by giving --metadata with
--add or by modifying the partitions defining the re-added components so
that the new devices start at the right offset for a superblock 1.2 (and
later using the following solution, re-creating the array with the
original partitions and a 1.2 superblock at the right offset with
--data-offset)?

Re-creating all components after a reduction of the underlying devices
seems possible by using --assume-clean.  What to look for to avoid data
loss?

mdadm already supports enlarging a 1.0 component, which requires moving
the superblock, so the code to reduce one would probably not require
much efforts.  Could it be added in the future?

I would like to attempt to move the superblocks 1.0 manually, but I lack
some information:

1) Should the reserved space before the superblock be moved along?

2) Is there metadata inside a superblock 1.0 to update after the move
   (that another program would not correct automatically)?

3) How is the superblock detected?  Is there only one position possible
   for a given chunk size on a given component size?  Should the
   remaining (padding) sectors of the underlying device be zeroed?

Thanks

^ permalink raw reply

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Shaohua Li @ 2017-02-16  2:22 UTC (permalink / raw)
  To: colyli
  Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn,
	Guoqing Jiang
In-Reply-To: <1487176523-109075-1-git-send-email-colyli@suse.de>

On Thu, Feb 16, 2017 at 12:35:22AM +0800, colyli@suse.de wrote:
> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
> introduces a sliding resync window for raid1 I/O barrier, this idea limits
> I/O barriers to happen only inside a slidingresync window, for regular
> I/Os out of this resync window they don't need to wait for barrier any
> more. On large raid1 device, it helps a lot to improve parallel writing
> I/O throughput when there are background resync I/Os performing at
> same time.
> 
> The idea of sliding resync widow is awesome, but code complexity is a
> challenge. Sliding resync window requires several veriables to work
> collectively, this is complexed and very hard to make it work correctly.
> Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches
> to fix the original resync window patch. This is not the end, any further
> related modification may easily introduce more regreassion.
> 
> Therefore I decide to implement a much simpler raid1 I/O barrier, by
> removing resync window code, I believe life will be much easier.
> 
> The brief idea of the simpler barrier is,
>  - Do not maintain a logbal unique resync window
>  - Use multiple hash buckets to reduce I/O barrier conflictions, regular
>    I/O only has to wait for a resync I/O when both them have same barrier
>    bucket index, vice versa.
>  - I/O barrier can be recuded to an acceptable number if there are enought
>    barrier buckets
> 
> Here I explain how the barrier buckets are designed,
>  - BARRIER_UNIT_SECTOR_SIZE
>    The whole LBA address space of a raid1 device is divided into multiple
>    barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>    Bio request won't go across border of barrier unit size, that means
>    maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes.
>    For random I/O 64MB is large enough for both read and write requests,
>    for sequential I/O considering underlying block layer may merge them
>    into larger requests, 64MB is still good enough.
>    Neil also points out that for resync operation, "we want the resync to
>    move from region to region fairly quickly so that the slowness caused
>    by having to synchronize with the resync is averaged out over a fairly
>    small time frame". For full speed resync, 64MB should take less then 1
>    second. When resync is competing with other I/O, it could take up a few
>    minutes. Therefore 64MB size is fairly good range for resync.
> 
>  - BARRIER_BUCKETS_NR
>    There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
>         #define BARRIER_BUCKETS_NR_BITS   (PAGE_SHIFT - 2)
>         #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
>    this patch makes the bellowed members of struct r1conf from integer
>    to array of integers,
>         -       int                     nr_pending;
>         -       int                     nr_waiting;
>         -       int                     nr_queued;
>         -       int                     barrier;
>         +       int                     *nr_pending;
>         +       int                     *nr_waiting;
>         +       int                     *nr_queued;
>         +       int                     *barrier;
>    number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB
>    kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O
>    barrier buckets, and each array of integers occupies single memory page.
>    1024 means for a request which is smaller than the I/O barrier unit size
>    has ~0.1% chance to wait for resync to pause, which is quite a small
>    enough fraction. Also requesting single memory page is more friendly to
>    kernel page allocator than larger memory size.
> 
>  - I/O barrier bucket is indexed by bio start sector
>    If multiple I/O requests hit different I/O barrier units, they only need
>    to compete I/O barrier with other I/Os which hit the same I/O barrier
>    bucket index with each other. The index of a barrier bucket which a
>    bio should look for is calculated by sector_to_idx() which is defined
>    in raid1.h as an inline function,
>         static inline int sector_to_idx(sector_t sector)
>         {
>                 return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
>                                 BARRIER_BUCKETS_NR_BITS);
>         }
>    Here sector_nr is the start sector number of a bio.
> 
>  - Single bio won't go across boundary of a I/O barrier unit
>    If a request goes across boundary of barrier unit, it will be split. A
>    bio may be split in raid1_make_request() or raid1_sync_request(), if
>    sectors returned by align_to_barrier_unit_end() is small than original
>    bio size.
> 
> Comparing to single sliding resync window,
>  - Currently resync I/O grows linearly, therefore regular and resync I/O
>    will have confliction within a single barrier units. So the I/O
>    behavior is similar to single sliding resync window.
>  - But a barrier unit bucket is shared by all barrier units with identical
>    barrier uinit index, the probability of confliction might be higher
>    than single sliding resync window, in condition that writing I/Os
>    always hit barrier units which have identical barrier bucket indexs with
>    the resync I/Os. This is a very rare condition in real I/O work loads,
>    I cannot imagine how it could happen in practice.
>  - Therefore we can achieve a good enough low confliction rate with much
>    simpler barrier algorithm and implementation.
> 
> There are two changes should be noticed,
>  - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
>    single loop, it looks like this,
>         spin_lock_irqsave(&conf->device_lock, flags);
>         conf->nr_queued[idx]--;
>         spin_unlock_irqrestore(&conf->device_lock, flags);
>    This change generates more spin lock operations, but in next patch of
>    this patch set, it will be replaced by a single line code,
>         atomic_dec(&conf->nr_queueud[idx]);
>    So we don't need to worry about spin lock cost here.
>  - Mainline raid1 code split original raid1_make_request() into
>    raid1_read_request() and raid1_write_request(). If the original bio
>    goes across an I/O barrier unit size, this bio will be split before
>    calling raid1_read_request() or raid1_write_request(),  this change
>    the code logic more simple and clear.
>  - In this patch wait_barrier() is moved from raid1_make_request() to
>    raid1_write_request(). In raid_read_request(), original wait_barrier()
>    is replaced by raid1_read_request().
>    The differnece is wait_read_barrier() only waits if array is frozen,
>    using different barrier function in different code path makes the code
>    more clean and easy to read.
> Changelog
> V3:
> - Rebase the patch against latest upstream kernel code.
> - Many fixes by review comments from Neil,
>   - Back to use pointers to replace arraries in struct r1conf
>   - Remove total_barriers from struct r1conf
>   - Add more patch comments to explain how/why the values of
>     BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided.
>   - Use get_unqueued_pending() to replace get_all_pendings() and
>     get_all_queued()
>   - Increase bucket number from 512 to 1024
> - Change code comments format by review from Shaohua.
> V2:
> - Use bio_split() to split the orignal bio if it goes across barrier unit
>   bounday, to make the code more simple, by suggestion from Shaohua and
>   Neil.
> - Use hash_long() to replace original linear hash, to avoid a possible
>   confilict between resync I/O and sequential write I/O, by suggestion from
>   Shaohua.
> - Add conf->total_barriers to record barrier depth, which is used to
>   control number of parallel sync I/O barriers, by suggestion from Shaohua.
> - In V1 patch the bellowed barrier buckets related members in r1conf are
>   allocated in memory page. To make the code more simple, V2 patch moves
>   the memory space into struct r1conf, like this,
>         -       int                     nr_pending;
>         -       int                     nr_waiting;
>         -       int                     nr_queued;
>         -       int                     barrier;
>         +       int                     nr_pending[BARRIER_BUCKETS_NR];
>         +       int                     nr_waiting[BARRIER_BUCKETS_NR];
>         +       int                     nr_queued[BARRIER_BUCKETS_NR];
>         +       int                     barrier[BARRIER_BUCKETS_NR];
>   This change is by the suggestion from Shaohua.
> - Remove some inrelavent code comments, by suggestion from Guoqing.
> - Add a missing wait_barrier() before jumping to retry_write, in
>   raid1_make_write_request().
> V1:
> - Original RFC patch for comments

Looks good, two minor issues.

>  
> -static void raid1_read_request(struct mddev *mddev, struct bio *bio,
> -				 struct r1bio *r1_bio)
> +static void raid1_read_request(struct mddev *mddev, struct bio *bio)
>  {
>  	struct r1conf *conf = mddev->private;
>  	struct raid1_info *mirror;
> +	struct r1bio *r1_bio;
>  	struct bio *read_bio;
>  	struct bitmap *bitmap = mddev->bitmap;
>  	const int op = bio_op(bio);
> @@ -1083,7 +1101,34 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
>  	int max_sectors;
>  	int rdisk;
>  
> -	wait_barrier(conf, bio);
> +	/*
> +	 * Still need barrier for READ in case that whole
> +	 * array is frozen.
> +	 */
> +	wait_read_barrier(conf, bio->bi_iter.bi_sector);
> +	bitmap = mddev->bitmap;
> +
> +	/*
> +	 * make_request() can abort the operation when read-ahead is being
> +	 * used and no empty request is available.
> +	 *
> +	 */
> +	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
> +	r1_bio->master_bio = bio;
> +	r1_bio->sectors = bio_sectors(bio);
> +	r1_bio->state = 0;
> +	r1_bio->mddev = mddev;
> +	r1_bio->sector = bio->bi_iter.bi_sector;

This part looks unnecessary complicated. If you change raid1_make_request to
something like __raid1_make_reques, add a new raid1_make_request and do bio
split there, then call __raid1_make_request for each splitted bio, then you
don't need to duplicate the r1_bio allocation parts for read/write.

> diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
> index c52ef42..d3faf30 100644
> --- a/drivers/md/raid1.h
> +++ b/drivers/md/raid1.h
> @@ -1,6 +1,14 @@
>  #ifndef _RAID1_H
>  #define _RAID1_H
>  
> +/* each barrier unit size is 64MB fow now
> + * note: it must be larger than RESYNC_DEPTH
> + */
> +#define BARRIER_UNIT_SECTOR_BITS	17
> +#define BARRIER_UNIT_SECTOR_SIZE	(1<<17)
> +#define BARRIER_BUCKETS_NR_BITS		(PAGE_SHIFT - 2)

maybe write this as (PAGE_SHIFT - ilog2(sizeof(int)))? To be honest, I don't
think it really matters if the array is PAGE_SIZE length, maybe just specify a
const here.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
From: Shaohua Li @ 2017-02-16  2:25 UTC (permalink / raw)
  To: colyli
  Cc: linux-raid, Shaohua Li, Hannes Reinecke, Neil Brown,
	Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <1487176523-109075-2-git-send-email-colyli@suse.de>

On Thu, Feb 16, 2017 at 12:35:23AM +0800, colyli@suse.de wrote:
> When I run a parallel reading performan testing on a md raid1 device with
> two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
> block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
> only 2.7GB/s, this is around 50% of the idea performance number.
> 
> The perf reports locking contention happens at allow_barrier() and
> wait_barrier() code,
>  - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
>    - _raw_spin_lock_irqsave
>          + 89.92% allow_barrier
>          + 9.34% __wake_up
>  - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
>    - _raw_spin_lock_irq
>          - 100.00% wait_barrier
> 
> The reason is, in these I/O barrier related functions,
>  - raise_barrier()
>  - lower_barrier()
>  - wait_barrier()
>  - allow_barrier()
> They always hold conf->resync_lock firstly, even there are only regular
> reading I/Os and no resync I/O at all. This is a huge performance penalty.
> 
> The solution is a lockless-like algorithm in I/O barrier code, and only
> holding conf->resync_lock when it has to.
> 
> The original idea is from Hannes Reinecke, and Neil Brown provides
> comments to improve it. I continue to work on it, and make the patch into
> current form.
> 
> In the new simpler raid1 I/O barrier implementation, there are two
> wait barrier functions,
>  - wait_barrier()
>    Which calls _wait_barrier(), is used for regular write I/O. If there is
>    resync I/O happening on the same I/O barrier bucket, or the whole
>    array is frozen, task will wait until no barrier on same barrier bucket,
>    or the whold array is unfreezed.
>  - wait_read_barrier()
>    Since regular read I/O won't interfere with resync I/O (read_balance()
>    will make sure only uptodate data will be read out), it is unnecessary
>    to wait for barrier in regular read I/Os, waiting in only necessary
>    when the whole array is frozen.
> 
> The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
> barrier[idx] are very carefully designed in raise_barrier(),
> lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
> avoid unnecessary spin locks in these functions. Once conf->
> nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
> has to wait in raise_barrier(). Then in _wait_barrier() if no barrier
> raised in same barrier bucket index and array is not frozen, the regular
> I/O doesn't need to hold conf->resync_lock, it can just increase
> conf->nr_pending[idx], and return to its caller. wait_read_barrier() is
> very similar to _wait_barrier(), the only difference is it only waits when
> array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier
> code almostly gets rid of all spin lock cost.
> 
> This patch significantly improves raid1 reading peroformance. From my
> testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
> blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
> increases from 2.7GB/s to 4.6GB/s (+70%).
> 
> Changelog
> V3:
> - Add smp_mb__after_atomic() as Shaohua and Neil suggested.
> - Change conf->nr_queued[] from atomic_t to int.

I missed this part. In the code, the nr_queued sometimes is protected by
device_lock, sometimes (raid1d) no protection at all. Can you explain this?

Thanks,
Shaohua

^ permalink raw reply

* [PATCH] async_tx: deprecate broken support for channel switching
From: Dan Williams @ 2017-02-16  2:42 UTC (permalink / raw)
  To: vinod.koul
  Cc: Thomas Petazzoni, Anup Patel, Rameshwar Prasad Sahu, Russell King,
	linux-kernel, linux-raid, dmaengine, Anatolij Gustschin,
	Saeed Bishara

Back in 2011, Russell pointed out that the "async_tx channel switch"
capability was violating expectations of the dma mapping api [1]. At the
time the existing uses were reviewed as still usable, but that longer
term we needed a rework of the raid offload implementation. While some
of the framework for a fixed implementation was introduced in 2012 [2],
the wider rewrite never materialized.

There continues to be interest in raid offload with new dma/raid engine
drivers being submitted. Those drivers must not build on top of the
broken channel switching capability.

Prevent async_tx from using an offload engine if the channel switching
capability is enabled. This still allows the engine to be used for other
purposes, but the broken way async_tx uses these engines for raid will
be disabled. For configurations where this causes a performance
regression the only solution is to start the work of eliminating the
async_tx api and moving channel management into the raid code directly
where it can manage marshalling an operation stream between multiple dma
channels.

[1]: http://lists.infradead.org/pipermail/linux-arm-kernel/2011-January/036753.html
[2]: https://lkml.org/lkml/2012/12/6/71

Cc: Anatolij Gustschin <agust@denx.de>
Cc: Anup Patel <anup.patel@broadcom.com>
Cc: Rameshwar Prasad Sahu <rsahu@apm.com>
Cc: Saeed Bishara <saeed.bishara@gmail.com>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Reported-by: Russell King <linux@armlinux.org.uk>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/async_tx.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/async_tx.h b/include/linux/async_tx.h
index 388574ea38ed..28e3cf1465ab 100644
--- a/include/linux/async_tx.h
+++ b/include/linux/async_tx.h
@@ -87,7 +87,7 @@ struct async_submit_ctl {
 	void *scribble;
 };

-#ifdef CONFIG_DMA_ENGINE
+#if defined(CONFIG_DMA_ENGINE) && !defined(CONFIG_ASYNC_TX_CHANNEL_SWITCH)
 #define async_tx_issue_pending_all dma_issue_pending_all

 /**

^ permalink raw reply related

* [md PATCH 00/14] remove all abuse of bi_phys_segments
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch

This series removes all abuse of bi_phys_segments from
md/raid.
Most of the effort is for raid5 - raid1 and raid10 are
comparatively straight forward.

I've also include some optimization of md_write_start() which is now
used more heavily.  This is the most "interesting" code, so it is at
the end, allowing it to easily be postponed.

The raid1 parts will conflict with Coly's raid1 resync changes.  The
conflicts aren't serious.  Which ever goes in last can easily be
adjusted to work with the other.

As you can see below, we save over 100 lines of code :-)

I've reviewed these a couple of times and done some light testing.
More review and testing would be welcome.

Thanks,
NeilBrown



---

NeilBrown (14):
      md/raid5: use md_write_start to count stripes, not bios
      md/raid5: simplfy delaying of writes while metadata is updated.
      md/raid5: call bio_endio() directly rather than queueing for later.
      block: trace completion of all bios.
      md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter
      md/raid5: remove over-loading of ->bi_phys_segments.
      Revert "md/raid5: limit request size according to implementation limits"
      md/raid1,raid10: move rXbio accounting closer to allocation.
      md/raid10: stop using bi_phys_segments
      md/raid1: stop using bi_phys_segment
      md/raid5: don't test ->writes_pending in raid5_remove_disk
      md: factor out set_in_sync()
      md: close a race with setting mddev->in_sync
      MD: use per-cpu counter for writes_pending


 block/bio.c              |    3 +
 drivers/md/dm.c          |    1 
 drivers/md/md.c          |  161 ++++++++++++++++++++++++++++++++------------
 drivers/md/md.h          |    3 +
 drivers/md/raid1.c       |  106 ++++++++++-------------------
 drivers/md/raid10.c      |   82 +++++++----------------
 drivers/md/raid5-cache.c |   14 +---
 drivers/md/raid5.c       |  167 ++++++++++++----------------------------------
 drivers/md/raid5.h       |   51 +-------------
 9 files changed, 235 insertions(+), 353 deletions(-)

--
Signature


^ permalink raw reply

* [md PATCH 01/14] md/raid5: use md_write_start to count stripes, not bios
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch
In-Reply-To: <148721992248.7521.17160361058957519076.stgit@noble>

We use md_write_start() to increase the count of pending writes, and
md_write_end() to decrement the count.  We currently count bios
submitted to md/raid5.  Change it count stripe_heads that a WRITE bio
has been attached to.

So now, raid5_make_request() calls md_write_start() and then
md_write_end() to keep the count elevated during the setup of the
request.

add_stripe_bio() calls md_write_start() for each stripe_head, and the
completion routines always call md_write_end(), instead of only
calling it when raid5_dec_bi_active_stripes() returns 0.
make_discard_request also calls md_write_start/end().

The parallel between md_write_{start,end} and use of bi_phys_segments
can be seen in that:
 Whenever we set bi_phys_segments to 1, we now call md_write_start.
 Whenever we increment it on non-read requests with
   raid5_inc_bi_active_stripes(), we now call md_write_start().
 Whenever we decrement bi_phys_segments on non-read requsts with
    raid5_dec_bi_active_stripes(), we now call md_write_end().

This reduces our dependence on keeping a per-bio count of active
stripes in bi_phys_segments.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5-cache.c |    2 +-
 drivers/md/raid5.c       |   27 +++++++++++++--------------
 2 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 302dea3296ba..4b211f914d17 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -265,8 +265,8 @@ r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev,
 	while (wbi && wbi->bi_iter.bi_sector <
 	       dev->sector + STRIPE_SECTORS) {
 		wbi2 = r5_next_bio(wbi, dev->sector);
+		md_write_end(conf->mddev);
 		if (!raid5_dec_bi_active_stripes(wbi)) {
-			md_write_end(conf->mddev);
 			bio_list_add(return_bi, wbi);
 		}
 		wbi = wbi2;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3c7e106c12a2..760b726943c9 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3075,6 +3075,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 		bi->bi_next = *bip;
 	*bip = bi;
 	raid5_inc_bi_active_stripes(bi);
+	md_write_start(conf->mddev, bi);
 
 	if (forwrite) {
 		/* check if page is covered */
@@ -3198,10 +3199,9 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector);
 
 			bi->bi_error = -EIO;
-			if (!raid5_dec_bi_active_stripes(bi)) {
-				md_write_end(conf->mddev);
+			md_write_end(conf->mddev);
+			if (!raid5_dec_bi_active_stripes(bi))
 				bio_list_add(return_bi, bi);
-			}
 			bi = nextbi;
 		}
 		if (bitmap_end)
@@ -3222,10 +3222,9 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			struct bio *bi2 = r5_next_bio(bi, sh->dev[i].sector);
 
 			bi->bi_error = -EIO;
-			if (!raid5_dec_bi_active_stripes(bi)) {
-				md_write_end(conf->mddev);
+			md_write_end(conf->mddev);
+			if (!raid5_dec_bi_active_stripes(bi))
 				bio_list_add(return_bi, bi);
-			}
 			bi = bi2;
 		}
 
@@ -3582,10 +3581,9 @@ static void handle_stripe_clean_event(struct r5conf *conf,
 				while (wbi && wbi->bi_iter.bi_sector <
 					dev->sector + STRIPE_SECTORS) {
 					wbi2 = r5_next_bio(wbi, dev->sector);
-					if (!raid5_dec_bi_active_stripes(wbi)) {
-						md_write_end(conf->mddev);
+					md_write_end(conf->mddev);
+					if (!raid5_dec_bi_active_stripes(wbi))
 						bio_list_add(return_bi, wbi);
-					}
 					wbi = wbi2;
 				}
 				bitmap_endwrite(conf->mddev->bitmap, sh->sector,
@@ -5268,6 +5266,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 
 	bi->bi_next = NULL;
 	bi->bi_phys_segments = 1; /* over-loaded to count active stripes */
+	md_write_start(mddev, bi);
 
 	stripe_sectors = conf->chunk_sectors *
 		(conf->raid_disks - conf->max_degraded);
@@ -5314,6 +5313,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 			sh->dev[d].towrite = bi;
 			set_bit(R5_OVERWRITE, &sh->dev[d].flags);
 			raid5_inc_bi_active_stripes(bi);
+			md_write_start(mddev, bi);
 			sh->overwrite_disks++;
 		}
 		spin_unlock_irq(&sh->stripe_lock);
@@ -5336,9 +5336,9 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 		release_stripe_plug(mddev, sh);
 	}
 
+	md_write_end(mddev);
 	remaining = raid5_dec_bi_active_stripes(bi);
 	if (remaining == 0) {
-		md_write_end(mddev);
 		bio_endio(bi);
 	}
 }
@@ -5373,8 +5373,6 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 		do_flush = bi->bi_opf & REQ_PREFLUSH;
 	}
 
-	md_write_start(mddev, bi);
-
 	/*
 	 * If array is degraded, better not do chunk aligned read because
 	 * later we might have to read it again in order to reconstruct
@@ -5397,6 +5395,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	last_sector = bio_end_sector(bi);
 	bi->bi_next = NULL;
 	bi->bi_phys_segments = 1;	/* over-loaded to count active stripes */
+	md_write_start(mddev, bi);
 
 	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
@@ -5531,11 +5530,11 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	}
 	finish_wait(&conf->wait_for_overlap, &w);
 
+	if (rw == WRITE)
+		md_write_end(mddev);
 	remaining = raid5_dec_bi_active_stripes(bi);
 	if (remaining == 0) {
 
-		if ( rw == WRITE )
-			md_write_end(mddev);
 
 		trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
 					 bi, 0);



^ permalink raw reply related

* [md PATCH 02/14] md/raid5: simplfy delaying of writes while metadata is updated.
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch
In-Reply-To: <148721992248.7521.17160361058957519076.stgit@noble>

If a device fails during a write, we must ensure the failure is
recorded in the metadata before the completion of the write is
acknowleged.

Commit c3cce6cda162 ("md/raid5: ensure device failure recorded before
write request returns.")  added code for this, but it was
unnecessarily complicated.  We already had similar functionality for
handling updates to the bad-block-list, thanks to Commit de393cdea66c
("md: make it easier to wait for bad blocks to be acknowledged.")

So revert most of the former commit, and instead avoid collecting
completed writes if MD_CHANGE_PENDING is set.  raid5d() will then flush
the metadata and retry the stripe_head.

We check MD_CHANGE_PENDING *after* analyse_stripe() as it could be set
asynchronously.  After analyse_stripe(), we have collected stable data
about the state of devices, which will be used to make decisions.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5.c |   27 ++++-----------------------
 drivers/md/raid5.h |    3 ---
 2 files changed, 4 insertions(+), 26 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 760b726943c9..154593e0afbe 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4492,7 +4492,8 @@ static void handle_stripe(struct stripe_head *sh)
 	if (test_bit(STRIPE_LOG_TRAPPED, &sh->state))
 		goto finish;
 
-	if (s.handle_bad_blocks) {
+	if (s.handle_bad_blocks ||
+	    test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) {
 		set_bit(STRIPE_HANDLE, &sh->state);
 		goto finish;
 	}
@@ -4822,15 +4823,8 @@ static void handle_stripe(struct stripe_head *sh)
 			md_wakeup_thread(conf->mddev->thread);
 	}
 
-	if (!bio_list_empty(&s.return_bi)) {
-		if (test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) {
-			spin_lock_irq(&conf->device_lock);
-			bio_list_merge(&conf->return_bi, &s.return_bi);
-			spin_unlock_irq(&conf->device_lock);
-			md_wakeup_thread(conf->mddev->thread);
-		} else
-			return_io(&s.return_bi);
-	}
+	if (!bio_list_empty(&s.return_bi))
+		return_io(&s.return_bi);
 
 	clear_bit_unlock(STRIPE_ACTIVE, &sh->state);
 }
@@ -6055,18 +6049,6 @@ static void raid5d(struct md_thread *thread)
 
 	md_check_recovery(mddev);
 
-	if (!bio_list_empty(&conf->return_bi) &&
-	    !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
-		struct bio_list tmp = BIO_EMPTY_LIST;
-		spin_lock_irq(&conf->device_lock);
-		if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
-			bio_list_merge(&tmp, &conf->return_bi);
-			bio_list_init(&conf->return_bi);
-		}
-		spin_unlock_irq(&conf->device_lock);
-		return_io(&tmp);
-	}
-
 	blk_start_plug(&plug);
 	handled = 0;
 	spin_lock_irq(&conf->device_lock);
@@ -6705,7 +6687,6 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 	INIT_LIST_HEAD(&conf->hold_list);
 	INIT_LIST_HEAD(&conf->delayed_list);
 	INIT_LIST_HEAD(&conf->bitmap_list);
-	bio_list_init(&conf->return_bi);
 	init_llist_head(&conf->released_stripes);
 	atomic_set(&conf->active_stripes, 0);
 	atomic_set(&conf->preread_active_stripes, 0);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 1440fa26e296..1ef71273b9b2 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -621,9 +621,6 @@ struct r5conf {
 	int			skip_copy; /* Don't copy data from bio to stripe cache */
 	struct list_head	*last_hold; /* detect hold_list promotions */
 
-	/* bios to have bi_end_io called after metadata is synced */
-	struct bio_list		return_bi;
-
 	atomic_t		reshape_stripes; /* stripes with pending writes for reshape */
 	/* unfortunately we need two cache names as we temporarily have
 	 * two caches.



^ permalink raw reply related

* [md PATCH 03/14] md/raid5: call bio_endio() directly rather than queueing for later.
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch
In-Reply-To: <148721992248.7521.17160361058957519076.stgit@noble>

We currently gather bios that need to be returned into a bio_list
and call bio_endio() on them all together.
The original reason for this was to avoid making the calls while
holding a spinlock.
Locking has changed a lot since then, and that reason is no longer
valid.

So discard return_io() and various return_bi lists, and just call
bio_endio() directly as needed.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5-cache.c |   13 +++++--------
 drivers/md/raid5.c       |   38 ++++++++++----------------------------
 drivers/md/raid5.h       |    3 +--
 3 files changed, 16 insertions(+), 38 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 4b211f914d17..1b972a172800 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -255,8 +255,7 @@ static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
 }
 
 static void
-r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev,
-			      struct bio_list *return_bi)
+r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev)
 {
 	struct bio *wbi, *wbi2;
 
@@ -266,23 +265,21 @@ r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev,
 	       dev->sector + STRIPE_SECTORS) {
 		wbi2 = r5_next_bio(wbi, dev->sector);
 		md_write_end(conf->mddev);
-		if (!raid5_dec_bi_active_stripes(wbi)) {
-			bio_list_add(return_bi, wbi);
-		}
+		if (!raid5_dec_bi_active_stripes(wbi))
+			bio_endio(wbi);
 		wbi = wbi2;
 	}
 }
 
 void r5c_handle_cached_data_endio(struct r5conf *conf,
-	  struct stripe_head *sh, int disks, struct bio_list *return_bi)
+				  struct stripe_head *sh, int disks)
 {
 	int i;
 
 	for (i = sh->disks; i--; ) {
 		if (sh->dev[i].written) {
 			set_bit(R5_UPTODATE, &sh->dev[i].flags);
-			r5c_return_dev_pending_writes(conf, &sh->dev[i],
-						      return_bi);
+			r5c_return_dev_pending_writes(conf, &sh->dev[i]);
 			bitmap_endwrite(conf->mddev->bitmap, sh->sector,
 					STRIPE_SECTORS,
 					!test_bit(STRIPE_DEGRADED, &sh->state),
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 154593e0afbe..a6edca65b561 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -154,17 +154,6 @@ static int raid6_idx_to_slot(int idx, struct stripe_head *sh,
 	return slot;
 }
 
-static void return_io(struct bio_list *return_bi)
-{
-	struct bio *bi;
-	while ((bi = bio_list_pop(return_bi)) != NULL) {
-		bi->bi_iter.bi_size = 0;
-		trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
-					 bi, 0);
-		bio_endio(bi);
-	}
-}
-
 static void print_raid5_conf (struct r5conf *conf);
 
 static int stripe_operations_active(struct stripe_head *sh)
@@ -1175,7 +1164,6 @@ async_copy_data(int frombio, struct bio *bio, struct page **page,
 static void ops_complete_biofill(void *stripe_head_ref)
 {
 	struct stripe_head *sh = stripe_head_ref;
-	struct bio_list return_bi = BIO_EMPTY_LIST;
 	int i;
 
 	pr_debug("%s: stripe %llu\n", __func__,
@@ -1200,15 +1188,13 @@ static void ops_complete_biofill(void *stripe_head_ref)
 				dev->sector + STRIPE_SECTORS) {
 				rbi2 = r5_next_bio(rbi, dev->sector);
 				if (!raid5_dec_bi_active_stripes(rbi))
-					bio_list_add(&return_bi, rbi);
+					bio_endio(rbi);
 				rbi = rbi2;
 			}
 		}
 	}
 	clear_bit(STRIPE_BIOFILL_RUN, &sh->state);
 
-	return_io(&return_bi);
-
 	set_bit(STRIPE_HANDLE, &sh->state);
 	raid5_release_stripe(sh);
 }
@@ -3152,8 +3138,7 @@ static void stripe_set_idx(sector_t stripe, struct r5conf *conf, int previous,
 
 static void
 handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
-				struct stripe_head_state *s, int disks,
-				struct bio_list *return_bi)
+		     struct stripe_head_state *s, int disks)
 {
 	int i;
 	BUG_ON(sh->batch_head);
@@ -3201,7 +3186,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			bi->bi_error = -EIO;
 			md_write_end(conf->mddev);
 			if (!raid5_dec_bi_active_stripes(bi))
-				bio_list_add(return_bi, bi);
+				bio_endio(bi);
 			bi = nextbi;
 		}
 		if (bitmap_end)
@@ -3224,7 +3209,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			bi->bi_error = -EIO;
 			md_write_end(conf->mddev);
 			if (!raid5_dec_bi_active_stripes(bi))
-				bio_list_add(return_bi, bi);
+				bio_endio(bi);
 			bi = bi2;
 		}
 
@@ -3250,7 +3235,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 
 				bi->bi_error = -EIO;
 				if (!raid5_dec_bi_active_stripes(bi))
-					bio_list_add(return_bi, bi);
+					bio_endio(bi);
 				bi = nextbi;
 			}
 		}
@@ -3549,7 +3534,7 @@ static void break_stripe_batch_list(struct stripe_head *head_sh,
  * never LOCKED, so we don't need to test 'failed' directly.
  */
 static void handle_stripe_clean_event(struct r5conf *conf,
-	struct stripe_head *sh, int disks, struct bio_list *return_bi)
+	struct stripe_head *sh, int disks)
 {
 	int i;
 	struct r5dev *dev;
@@ -3583,7 +3568,7 @@ static void handle_stripe_clean_event(struct r5conf *conf,
 					wbi2 = r5_next_bio(wbi, dev->sector);
 					md_write_end(conf->mddev);
 					if (!raid5_dec_bi_active_stripes(wbi))
-						bio_list_add(return_bi, wbi);
+						bio_endio(wbi);
 					wbi = wbi2;
 				}
 				bitmap_endwrite(conf->mddev->bitmap, sh->sector,
@@ -4526,7 +4511,7 @@ static void handle_stripe(struct stripe_head *sh)
 		sh->reconstruct_state = 0;
 		break_stripe_batch_list(sh, 0);
 		if (s.to_read+s.to_write+s.written)
-			handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
+			handle_failed_stripe(conf, sh, &s, disks);
 		if (s.syncing + s.replacing)
 			handle_failed_sync(conf, sh, &s);
 	}
@@ -4592,10 +4577,10 @@ static void handle_stripe(struct stripe_head *sh)
 			     && !test_bit(R5_LOCKED, &qdev->flags)
 			     && (test_bit(R5_UPTODATE, &qdev->flags) ||
 				 test_bit(R5_Discard, &qdev->flags))))))
-		handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
+		handle_stripe_clean_event(conf, sh, disks);
 
 	if (s.just_cached)
-		r5c_handle_cached_data_endio(conf, sh, disks, &s.return_bi);
+		r5c_handle_cached_data_endio(conf, sh, disks);
 	r5l_stripe_write_finished(sh);
 
 	/* Now we might consider reading some blocks, either to check/generate
@@ -4823,9 +4808,6 @@ static void handle_stripe(struct stripe_head *sh)
 			md_wakeup_thread(conf->mddev->thread);
 	}
 
-	if (!bio_list_empty(&s.return_bi))
-		return_io(&s.return_bi);
-
 	clear_bit_unlock(STRIPE_ACTIVE, &sh->state);
 }
 
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 1ef71273b9b2..9051631cb7f0 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -272,7 +272,6 @@ struct stripe_head_state {
 	int dec_preread_active;
 	unsigned long ops_request;
 
-	struct bio_list return_bi;
 	struct md_rdev *blocked_rdev;
 	int handle_bad_blocks;
 	int log_failed;
@@ -776,7 +775,7 @@ extern void r5c_release_extra_page(struct stripe_head *sh);
 extern void r5c_use_extra_page(struct stripe_head *sh);
 extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
 extern void r5c_handle_cached_data_endio(struct r5conf *conf,
-	struct stripe_head *sh, int disks, struct bio_list *return_bi);
+	struct stripe_head *sh, int disks);
 extern int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
 			  struct stripe_head_state *s);
 extern void r5c_make_stripe_write_out(struct stripe_head *sh);



^ permalink raw reply related

* [md PATCH 04/14] block: trace completion of all bios.
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch
In-Reply-To: <148721992248.7521.17160361058957519076.stgit@noble>

Currently only dm and raid5 bios trigger trace_block_bio_complete.
Rather than sprinkling trace calls around all drivers that might
want them, add the trace call to bio_endio() so that all
drivers benefit.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/bio.c        |    3 +++
 drivers/md/dm.c    |    1 -
 drivers/md/raid5.c |    8 --------
 3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 2b375020fc49..984c0afa935a 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1813,6 +1813,9 @@ void bio_endio(struct bio *bio)
 		goto again;
 	}
 
+	if (bio->bi_bdev)
+		trace_block_bio_complete(bdev_get_queue(bio->bi_bdev),
+					 bio, bio->bi_error);
 	if (bio->bi_end_io)
 		bio->bi_end_io(bio);
 }
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3086da5664f3..e151aefa0917 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -807,7 +807,6 @@ static void dec_pending(struct dm_io *io, int error)
 			queue_io(md, bio);
 		} else {
 			/* done with normal IO or empty flush */
-			trace_block_bio_complete(md->queue, bio, io_error);
 			bio->bi_error = io_error;
 			bio_endio(bio);
 		}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a6edca65b561..2fbf939c1a15 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4943,8 +4943,6 @@ static void raid5_align_endio(struct bio *bi)
 	rdev_dec_pending(rdev, conf->mddev);
 
 	if (!error) {
-		trace_block_bio_complete(bdev_get_queue(raid_bi->bi_bdev),
-					 raid_bi, 0);
 		bio_endio(raid_bi);
 		if (atomic_dec_and_test(&conf->active_aligned_reads))
 			wake_up(&conf->wait_for_quiescent);
@@ -5510,10 +5508,6 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 		md_write_end(mddev);
 	remaining = raid5_dec_bi_active_stripes(bi);
 	if (remaining == 0) {
-
-
-		trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
-					 bi, 0);
 		bio_endio(bi);
 	}
 }
@@ -5921,8 +5915,6 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 	}
 	remaining = raid5_dec_bi_active_stripes(raid_bio);
 	if (remaining == 0) {
-		trace_block_bio_complete(bdev_get_queue(raid_bio->bi_bdev),
-					 raid_bio, 0);
 		bio_endio(raid_bio);
 	}
 	if (atomic_dec_and_test(&conf->active_aligned_reads))



^ permalink raw reply related

* [md PATCH 05/14] md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch
In-Reply-To: <148721992248.7521.17160361058957519076.stgit@noble>

md/raid5 needs to keep track of how many stripe_heads are processing a
bio so that it can delay calling bio_endio() until all stripe_heads
have completed.  It currently uses 16 bits of ->bi_phys_segments for
this purpose.

16 bits is only enough for 256M requests, and it is possible for a
single bio to be larger than this, which causes problems.  Also, the
bio struct contains a larger counter, __bi_remaining, which has a
purpose very similar to the purpose of our counter.  So stop using
->bi_phys_segments, and instead use __bi_remaining.

This means we don't need to initialize the counter, as our caller
initializes it to '1'.  It also means we can call bio_endio() directly
as it tests this counter internally.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5-cache.c |    3 +--
 drivers/md/raid5.c       |   50 +++++++++++-----------------------------------
 drivers/md/raid5.h       |   17 +---------------
 3 files changed, 14 insertions(+), 56 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 1b972a172800..5c282ae71cbd 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -265,8 +265,7 @@ r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev)
 	       dev->sector + STRIPE_SECTORS) {
 		wbi2 = r5_next_bio(wbi, dev->sector);
 		md_write_end(conf->mddev);
-		if (!raid5_dec_bi_active_stripes(wbi))
-			bio_endio(wbi);
+		bio_endio(wbi);
 		wbi = wbi2;
 	}
 }
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 2fbf939c1a15..905abf081acf 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1187,8 +1187,7 @@ static void ops_complete_biofill(void *stripe_head_ref)
 			while (rbi && rbi->bi_iter.bi_sector <
 				dev->sector + STRIPE_SECTORS) {
 				rbi2 = r5_next_bio(rbi, dev->sector);
-				if (!raid5_dec_bi_active_stripes(rbi))
-					bio_endio(rbi);
+				bio_endio(rbi);
 				rbi = rbi2;
 			}
 		}
@@ -3027,14 +3026,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 		(unsigned long long)bi->bi_iter.bi_sector,
 		(unsigned long long)sh->sector);
 
-	/*
-	 * If several bio share a stripe. The bio bi_phys_segments acts as a
-	 * reference count to avoid race. The reference count should already be
-	 * increased before this function is called (for example, in
-	 * raid5_make_request()), so other bio sharing this stripe will not free the
-	 * stripe. If a stripe is owned by one stripe, the stripe lock will
-	 * protect it.
-	 */
 	spin_lock_irq(&sh->stripe_lock);
 	/* Don't allow new IO added to stripes in batch list */
 	if (sh->batch_head)
@@ -3060,7 +3051,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 	if (*bip)
 		bi->bi_next = *bip;
 	*bip = bi;
-	raid5_inc_bi_active_stripes(bi);
+	bio_inc_remaining(bi);
 	md_write_start(conf->mddev, bi);
 
 	if (forwrite) {
@@ -3185,8 +3176,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 
 			bi->bi_error = -EIO;
 			md_write_end(conf->mddev);
-			if (!raid5_dec_bi_active_stripes(bi))
-				bio_endio(bi);
+			bio_endio(bi);
 			bi = nextbi;
 		}
 		if (bitmap_end)
@@ -3208,8 +3198,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 
 			bi->bi_error = -EIO;
 			md_write_end(conf->mddev);
-			if (!raid5_dec_bi_active_stripes(bi))
-				bio_endio(bi);
+			bio_endio(bi);
 			bi = bi2;
 		}
 
@@ -3234,8 +3223,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 					r5_next_bio(bi, sh->dev[i].sector);
 
 				bi->bi_error = -EIO;
-				if (!raid5_dec_bi_active_stripes(bi))
-					bio_endio(bi);
+				bio_endio(bi);
 				bi = nextbi;
 			}
 		}
@@ -3567,8 +3555,7 @@ static void handle_stripe_clean_event(struct r5conf *conf,
 					dev->sector + STRIPE_SECTORS) {
 					wbi2 = r5_next_bio(wbi, dev->sector);
 					md_write_end(conf->mddev);
-					if (!raid5_dec_bi_active_stripes(wbi))
-						bio_endio(wbi);
+					bio_endio(wbi);
 					wbi = wbi2;
 				}
 				bitmap_endwrite(conf->mddev->bitmap, sh->sector,
@@ -4913,7 +4900,7 @@ static struct bio *remove_bio_from_retry(struct r5conf *conf)
 		 * this sets the active strip count to 1 and the processed
 		 * strip count to zero (upper 8 bits)
 		 */
-		raid5_set_bi_stripes(bi, 1); /* biased count of active stripes */
+		raid5_set_bi_processed_stripes(bi, 0);
 	}
 
 	return bi;
@@ -5228,7 +5215,6 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 	struct r5conf *conf = mddev->private;
 	sector_t logical_sector, last_sector;
 	struct stripe_head *sh;
-	int remaining;
 	int stripe_sectors;
 
 	if (mddev->reshape_position != MaxSector)
@@ -5239,7 +5225,6 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 	last_sector = bi->bi_iter.bi_sector + (bi->bi_iter.bi_size>>9);
 
 	bi->bi_next = NULL;
-	bi->bi_phys_segments = 1; /* over-loaded to count active stripes */
 	md_write_start(mddev, bi);
 
 	stripe_sectors = conf->chunk_sectors *
@@ -5286,7 +5271,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 				continue;
 			sh->dev[d].towrite = bi;
 			set_bit(R5_OVERWRITE, &sh->dev[d].flags);
-			raid5_inc_bi_active_stripes(bi);
+			bio_inc_remaining(bi);
 			md_write_start(mddev, bi);
 			sh->overwrite_disks++;
 		}
@@ -5311,10 +5296,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 	}
 
 	md_write_end(mddev);
-	remaining = raid5_dec_bi_active_stripes(bi);
-	if (remaining == 0) {
-		bio_endio(bi);
-	}
+	bio_endio(bi);
 }
 
 static void raid5_make_request(struct mddev *mddev, struct bio * bi)
@@ -5325,7 +5307,6 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	sector_t logical_sector, last_sector;
 	struct stripe_head *sh;
 	const int rw = bio_data_dir(bi);
-	int remaining;
 	DEFINE_WAIT(w);
 	bool do_prepare;
 	bool do_flush = false;
@@ -5368,7 +5349,6 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	logical_sector = bi->bi_iter.bi_sector & ~((sector_t)STRIPE_SECTORS-1);
 	last_sector = bio_end_sector(bi);
 	bi->bi_next = NULL;
-	bi->bi_phys_segments = 1;	/* over-loaded to count active stripes */
 	md_write_start(mddev, bi);
 
 	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
@@ -5506,10 +5486,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 
 	if (rw == WRITE)
 		md_write_end(mddev);
-	remaining = raid5_dec_bi_active_stripes(bi);
-	if (remaining == 0) {
-		bio_endio(bi);
-	}
+	bio_endio(bi);
 }
 
 static sector_t raid5_size(struct mddev *mddev, sector_t sectors, int raid_disks);
@@ -5874,7 +5851,6 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 	int dd_idx;
 	sector_t sector, logical_sector, last_sector;
 	int scnt = 0;
-	int remaining;
 	int handled = 0;
 
 	logical_sector = raid_bio->bi_iter.bi_sector &
@@ -5913,10 +5889,8 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 		raid5_release_stripe(sh);
 		handled++;
 	}
-	remaining = raid5_dec_bi_active_stripes(raid_bio);
-	if (remaining == 0) {
-		bio_endio(raid_bio);
-	}
+	bio_endio(raid_bio);
+
 	if (atomic_dec_and_test(&conf->active_aligned_reads))
 		wake_up(&conf->wait_for_quiescent);
 	return handled;
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 9051631cb7f0..3018a33693ab 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -481,8 +481,7 @@ static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector)
 }
 
 /*
- * We maintain a biased count of active stripes in the bottom 16 bits of
- * bi_phys_segments, and a count of processed stripes in the upper 16 bits
+ * We maintain a count of processed stripes in the upper 16 bits
  */
 static inline int raid5_bi_processed_stripes(struct bio *bio)
 {
@@ -491,20 +490,6 @@ static inline int raid5_bi_processed_stripes(struct bio *bio)
 	return (atomic_read(segments) >> 16) & 0xffff;
 }
 
-static inline int raid5_dec_bi_active_stripes(struct bio *bio)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-
-	return atomic_sub_return(1, segments) & 0xffff;
-}
-
-static inline void raid5_inc_bi_active_stripes(struct bio *bio)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-
-	atomic_inc(segments);
-}
-
 static inline void raid5_set_bi_processed_stripes(struct bio *bio,
 	unsigned int cnt)
 {



^ permalink raw reply related

* [md PATCH 06/14] md/raid5: remove over-loading of ->bi_phys_segments.
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch
In-Reply-To: <148721992248.7521.17160361058957519076.stgit@noble>

When a read request, which bypassed the cache, fails, we need to retry
it through the cache.
This involves attaching it to a sequence of stripe_heads, and it may not
be possible to get all the stripe_heads we need at once.
We do what we can, and record how far we got in ->bi_phys_segments so
we can pick up again later.

There is only ever one bio which may have a non-zero offset stored in
->bi_phys_segments, the one that is either active in the single thread
which calls retry_aligned_read(), or is in conf->retry_read_aligned
waiting for retry_aligned_read() to be called again.

So we only need to store one offset value.  This can be in a local
variable passed between remove_bio_from_retry() and
retry_aligned_read(), or in the r5conf structure next to the
->retry_read_aligned pointer.

Storing it there allows the last usage of ->bi_phys_segments to be
removed from md/raid5.c.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5.c |   24 ++++++++++++------------
 drivers/md/raid5.h |   30 +-----------------------------
 2 files changed, 13 insertions(+), 41 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 905abf081acf..f93e8fddbb23 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4883,12 +4883,14 @@ static void add_bio_to_retry(struct bio *bi,struct r5conf *conf)
 	md_wakeup_thread(conf->mddev->thread);
 }
 
-static struct bio *remove_bio_from_retry(struct r5conf *conf)
+static struct bio *remove_bio_from_retry(struct r5conf *conf,
+					 unsigned int *offset)
 {
 	struct bio *bi;
 
 	bi = conf->retry_read_aligned;
 	if (bi) {
+		*offset = conf->retry_read_offset;
 		conf->retry_read_aligned = NULL;
 		return bi;
 	}
@@ -4896,11 +4898,7 @@ static struct bio *remove_bio_from_retry(struct r5conf *conf)
 	if(bi) {
 		conf->retry_read_aligned_list = bi->bi_next;
 		bi->bi_next = NULL;
-		/*
-		 * this sets the active strip count to 1 and the processed
-		 * strip count to zero (upper 8 bits)
-		 */
-		raid5_set_bi_processed_stripes(bi, 0);
+		*offset = 0;
 	}
 
 	return bi;
@@ -5835,7 +5833,8 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n
 	return STRIPE_SECTORS;
 }
 
-static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
+static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio,
+			       unsigned int offset)
 {
 	/* We may not be able to submit a whole bio at once as there
 	 * may not be enough stripe_heads available.
@@ -5864,7 +5863,7 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 		     sector += STRIPE_SECTORS,
 		     scnt++) {
 
-		if (scnt < raid5_bi_processed_stripes(raid_bio))
+		if (scnt < offset)
 			/* already done this stripe */
 			continue;
 
@@ -5872,15 +5871,15 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 
 		if (!sh) {
 			/* failed to get a stripe - must wait */
-			raid5_set_bi_processed_stripes(raid_bio, scnt);
 			conf->retry_read_aligned = raid_bio;
+			conf->retry_read_offset = scnt;
 			return handled;
 		}
 
 		if (!add_stripe_bio(sh, raid_bio, dd_idx, 0, 0)) {
 			raid5_release_stripe(sh);
-			raid5_set_bi_processed_stripes(raid_bio, scnt);
 			conf->retry_read_aligned = raid_bio;
+			conf->retry_read_offset = scnt;
 			return handled;
 		}
 
@@ -6003,6 +6002,7 @@ static void raid5d(struct md_thread *thread)
 	while (1) {
 		struct bio *bio;
 		int batch_size, released;
+		unsigned int offset;
 
 		released = release_stripe_list(conf, conf->temp_inactive_list);
 		if (released)
@@ -6020,10 +6020,10 @@ static void raid5d(struct md_thread *thread)
 		}
 		raid5_activate_delayed(conf);
 
-		while ((bio = remove_bio_from_retry(conf))) {
+		while ((bio = remove_bio_from_retry(conf, &offset))) {
 			int ok;
 			spin_unlock_irq(&conf->device_lock);
-			ok = retry_aligned_read(conf, bio);
+			ok = retry_aligned_read(conf, bio, offset);
 			spin_lock_irq(&conf->device_lock);
 			if (!ok)
 				break;
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 3018a33693ab..a4ef02176afb 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -480,35 +480,6 @@ static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector)
 		return NULL;
 }
 
-/*
- * We maintain a count of processed stripes in the upper 16 bits
- */
-static inline int raid5_bi_processed_stripes(struct bio *bio)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-
-	return (atomic_read(segments) >> 16) & 0xffff;
-}
-
-static inline void raid5_set_bi_processed_stripes(struct bio *bio,
-	unsigned int cnt)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	int old, new;
-
-	do {
-		old = atomic_read(segments);
-		new = (old & 0xffff) | (cnt << 16);
-	} while (atomic_cmpxchg(segments, old, new) != old);
-}
-
-static inline void raid5_set_bi_stripes(struct bio *bio, unsigned int cnt)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-
-	atomic_set(segments, cnt);
-}
-
 /* NOTE NR_STRIPE_HASH_LOCKS must remain below 64.
  * This is because we sometimes take all the spinlocks
  * and creating that much locking depth can cause
@@ -596,6 +567,7 @@ struct r5conf {
 	struct list_head	delayed_list; /* stripes that have plugged requests */
 	struct list_head	bitmap_list; /* stripes delaying awaiting bitmap update */
 	struct bio		*retry_read_aligned; /* currently retrying aligned bios   */
+	unsigned int		retry_read_offset; /* sector offset into retry_read_aligned */
 	struct bio		*retry_read_aligned_list; /* aligned bios retry list  */
 	atomic_t		preread_active_stripes; /* stripes with scheduled io */
 	atomic_t		active_aligned_reads;



^ permalink raw reply related

* [md PATCH 08/14] md/raid1, raid10: move rXbio accounting closer to allocation.
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch
In-Reply-To: <148721992248.7521.17160361058957519076.stgit@noble>

When raid1 or raid10 need find they will need to allocate
a new r1bio/r10bio, in order to work around a known bad block,
the account for the allocation well before the allocation
is made.  This separation makes the correctness less obvious and
requires comments.

The accounting needs to be a little before: before the first rXbio is
submitted, but that is all.

So move the accounting down to where it makes more sense.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid1.c  |   23 ++++++++++-------------
 drivers/md/raid10.c |   22 +++++++++-------------
 2 files changed, 19 insertions(+), 26 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7b0f647bcccb..c1d0675880fb 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1326,18 +1326,9 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 		goto retry_write;
 	}
 
-	if (max_sectors < r1_bio->sectors) {
-		/* We are splitting this write into multiple parts, so
-		 * we need to prepare for allocating another r1_bio.
-		 */
+	if (max_sectors < r1_bio->sectors)
 		r1_bio->sectors = max_sectors;
-		spin_lock_irq(&conf->device_lock);
-		if (bio->bi_phys_segments == 0)
-			bio->bi_phys_segments = 2;
-		else
-			bio->bi_phys_segments++;
-		spin_unlock_irq(&conf->device_lock);
-	}
+
 	sectors_handled = r1_bio->sector + max_sectors - bio->bi_iter.bi_sector;
 
 	atomic_set(&r1_bio->remaining, 1);
@@ -1426,10 +1417,16 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	 * as it could result in the bio being freed.
 	 */
 	if (sectors_handled < bio_sectors(bio)) {
-		r1_bio_write_done(r1_bio);
-		/* We need another r1_bio.  It has already been counted
+		/* We need another r1_bio, which must be accounted
 		 * in bio->bi_phys_segments
 		 */
+		spin_lock_irq(&conf->device_lock);
+		if (bio->bi_phys_segments == 0)
+			bio->bi_phys_segments = 2;
+		else
+			bio->bi_phys_segments++;
+		spin_unlock_irq(&conf->device_lock);
+		r1_bio_write_done(r1_bio);
 		r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
 		r1_bio->master_bio = bio;
 		r1_bio->sectors = bio_sectors(bio) - sectors_handled;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 1920756828df..9258cbe233bb 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1383,18 +1383,8 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 		goto retry_write;
 	}
 
-	if (max_sectors < r10_bio->sectors) {
-		/* We are splitting this into multiple parts, so
-		 * we need to prepare for allocating another r10_bio.
-		 */
+	if (max_sectors < r10_bio->sectors)
 		r10_bio->sectors = max_sectors;
-		spin_lock_irq(&conf->device_lock);
-		if (bio->bi_phys_segments == 0)
-			bio->bi_phys_segments = 2;
-		else
-			bio->bi_phys_segments++;
-		spin_unlock_irq(&conf->device_lock);
-	}
 	sectors_handled = r10_bio->sector + max_sectors -
 		bio->bi_iter.bi_sector;
 
@@ -1491,10 +1481,16 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 	 */
 
 	if (sectors_handled < bio_sectors(bio)) {
-		one_write_done(r10_bio);
-		/* We need another r10_bio.  It has already been counted
+		/* We need another r10_bio and it needs to be counted
 		 * in bio->bi_phys_segments.
 		 */
+		spin_lock_irq(&conf->device_lock);
+		if (bio->bi_phys_segments == 0)
+			bio->bi_phys_segments = 2;
+		else
+			bio->bi_phys_segments++;
+		spin_unlock_irq(&conf->device_lock);
+		one_write_done(r10_bio);
 		r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
 
 		r10_bio->master_bio = bio;



^ permalink raw reply related

* [md PATCH 07/14] Revert "md/raid5: limit request size according to implementation limits"
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch
In-Reply-To: <148721992248.7521.17160361058957519076.stgit@noble>

This reverts commit e8d7c33232e5fdfa761c3416539bc5b4acd12db5.

Now that raid5 doesn't abuse bi_phys_segments any more, we no longer
need to impose these limits.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5.c |    9 ---------
 1 file changed, 9 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f93e8fddbb23..c96fca2c6a98 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7100,15 +7100,6 @@ static int raid5_run(struct mddev *mddev)
 			stripe = (stripe | (stripe-1)) + 1;
 		mddev->queue->limits.discard_alignment = stripe;
 		mddev->queue->limits.discard_granularity = stripe;
-
-		/*
-		 * We use 16-bit counter of active stripes in bi_phys_segments
-		 * (minus one for over-loaded initialization)
-		 */
-		blk_queue_max_hw_sectors(mddev->queue, 0xfffe * STRIPE_SECTORS);
-		blk_queue_max_discard_sectors(mddev->queue,
-					      0xfffe * STRIPE_SECTORS);
-
 		/*
 		 * unaligned part of discard request will be ignored, so can't
 		 * guarantee discard_zeroes_data



^ permalink raw reply related

* [md PATCH 10/14] md/raid1: stop using bi_phys_segment
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch
In-Reply-To: <148721992248.7521.17160361058957519076.stgit@noble>

Change to use bio->__bi_remaining to count number of r1bio attached
to a bio.
See precious raid10 patch for more details.

inc_pending is a little more complicated in raid1 as we need to adjust
next_window_requests or current_window_requests.

The wait_event() call if start_next_window is no longer needed as each
r1_bio is completed separately.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid1.c |   99 ++++++++++++++++++----------------------------------
 1 file changed, 34 insertions(+), 65 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index c1d0675880fb..5a57111c7bc9 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -241,36 +241,19 @@ static void reschedule_retry(struct r1bio *r1_bio)
 static void call_bio_endio(struct r1bio *r1_bio)
 {
 	struct bio *bio = r1_bio->master_bio;
-	int done;
 	struct r1conf *conf = r1_bio->mddev->private;
 	sector_t start_next_window = r1_bio->start_next_window;
 	sector_t bi_sector = bio->bi_iter.bi_sector;
 
-	if (bio->bi_phys_segments) {
-		unsigned long flags;
-		spin_lock_irqsave(&conf->device_lock, flags);
-		bio->bi_phys_segments--;
-		done = (bio->bi_phys_segments == 0);
-		spin_unlock_irqrestore(&conf->device_lock, flags);
-		/*
-		 * make_request() might be waiting for
-		 * bi_phys_segments to decrease
-		 */
-		wake_up(&conf->wait_barrier);
-	} else
-		done = 1;
-
 	if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
 		bio->bi_error = -EIO;
 
-	if (done) {
-		bio_endio(bio);
-		/*
-		 * Wake up any possible resync thread that waits for the device
-		 * to go idle.
-		 */
-		allow_barrier(conf, start_next_window, bi_sector);
-	}
+	bio_endio(bio);
+	/*
+	 * Wake up any possible resync thread that waits for the device
+	 * to go idle.
+	 */
+	allow_barrier(conf, start_next_window, bi_sector);
 }
 
 static void raid_end_bio_io(struct r1bio *r1_bio)
@@ -923,6 +906,26 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
 	return sector;
 }
 
+static void inc_pending(struct r1conf *conf, sector_t start_next_window,
+			sector_t bi_sector)
+{
+	/* The current request requires multiple r1_bio, so
+	 * we need to increment the pending count, and the corresponding
+	 * window count.
+	 */
+	spin_lock(&conf->resync_lock);
+	conf->nr_pending++;
+	if (start_next_window == conf->start_next_window) {
+		if (conf->start_next_window + NEXT_NORMALIO_DISTANCE
+		    <= bi_sector)
+			conf->next_window_requests++;
+		else
+			conf->current_window_requests++;
+	} else
+		conf->current_window_requests++;
+	spin_unlock(&conf->resync_lock);
+}
+
 static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
 			  sector_t bi_sector)
 {
@@ -1137,12 +1140,7 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 		sectors_handled = (r1_bio->sector + max_sectors
 				   - bio->bi_iter.bi_sector);
 		r1_bio->sectors = max_sectors;
-		spin_lock_irq(&conf->device_lock);
-		if (bio->bi_phys_segments == 0)
-			bio->bi_phys_segments = 2;
-		else
-			bio->bi_phys_segments++;
-		spin_unlock_irq(&conf->device_lock);
+		bio_inc_remaining(bio);
 
 		/*
 		 * Cannot call generic_make_request directly as that will be
@@ -1304,7 +1302,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	if (unlikely(blocked_rdev)) {
 		/* Wait for this device to become unblocked */
 		int j;
-		sector_t old = start_next_window;
 
 		for (j = 0; j < i; j++)
 			if (r1_bio->bios[j])
@@ -1314,15 +1311,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 		raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
 		md_wait_for_blocked_rdev(blocked_rdev, mddev);
 		start_next_window = wait_barrier(conf, bio);
-		/*
-		 * We must make sure the multi r1bios of bio have
-		 * the same value of bi_phys_segments
-		 */
-		if (bio->bi_phys_segments && old &&
-		    old != start_next_window)
-			/* Wait for the former r1bio(s) to complete */
-			wait_event(conf->wait_barrier,
-				   bio->bi_phys_segments == 1);
 		goto retry_write;
 	}
 
@@ -1417,22 +1405,18 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	 * as it could result in the bio being freed.
 	 */
 	if (sectors_handled < bio_sectors(bio)) {
-		/* We need another r1_bio, which must be accounted
-		 * in bio->bi_phys_segments
-		 */
-		spin_lock_irq(&conf->device_lock);
-		if (bio->bi_phys_segments == 0)
-			bio->bi_phys_segments = 2;
-		else
-			bio->bi_phys_segments++;
-		spin_unlock_irq(&conf->device_lock);
+		/* We need another r1_bio, which must be counted */
+		sector_t sect = bio->bi_iter.bi_sector + sectors_handled;
+
+		inc_pending(conf, start_next_window, sect);
+		bio_inc_remaining(bio);
 		r1_bio_write_done(r1_bio);
 		r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
 		r1_bio->master_bio = bio;
 		r1_bio->sectors = bio_sectors(bio) - sectors_handled;
 		r1_bio->state = 0;
 		r1_bio->mddev = mddev;
-		r1_bio->sector = bio->bi_iter.bi_sector + sectors_handled;
+		r1_bio->sector = sect;
 		goto retry_write;
 	}
 
@@ -1460,16 +1444,6 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio)
 	r1_bio->mddev = mddev;
 	r1_bio->sector = bio->bi_iter.bi_sector;
 
-	/*
-	 * We might need to issue multiple reads to different devices if there
-	 * are bad blocks around, so we keep track of the number of reads in
-	 * bio->bi_phys_segments.  If this is 0, there is only one r1_bio and
-	 * no locking will be needed when requests complete.  If it is
-	 * non-zero, then it is the number of not-completed requests.
-	 */
-	bio->bi_phys_segments = 0;
-	bio_clear_flag(bio, BIO_SEG_VALID);
-
 	if (bio_data_dir(bio) == READ)
 		raid1_read_request(mddev, bio, r1_bio);
 	else
@@ -2434,12 +2408,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 			int sectors_handled = (r1_bio->sector + max_sectors
 					       - mbio->bi_iter.bi_sector);
 			r1_bio->sectors = max_sectors;
-			spin_lock_irq(&conf->device_lock);
-			if (mbio->bi_phys_segments == 0)
-				mbio->bi_phys_segments = 2;
-			else
-				mbio->bi_phys_segments++;
-			spin_unlock_irq(&conf->device_lock);
+			bio_inc_remaining(mbio);
 			trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
 					      bio, bio_dev, bio_sector);
 			generic_make_request(bio);



^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox