From: NeilBrown <neilb@suse.com>
To: Shaohua Li <shli@kernel.org>, NeilBrown <neilb@suse.de>
Cc: colyli@suse.de, linux-raid@vger.kernel.org,
Shaohua Li <shli@fb.com>, Johannes Thumshirn <jthumshirn@suse.de>,
Guoqing Jiang <gqjiang@suse.com>
Subject: Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
Date: Mon, 20 Feb 2017 10:42:09 +1100 [thread overview]
Message-ID: <87vas5sq7i.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <20170217194117.cgbdm247p5p4gj6v@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 12744 bytes --]
On Fri, Feb 17 2017, Shaohua Li wrote:
> On Thu, Feb 16, 2017 at 06:04:00PM +1100, NeilBrown wrote:
>> On Thu, Feb 16 2017, colyli@suse.de wrote:
>>
>> > 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
>> > introduces a sliding resync window for raid1 I/O barrier, this idea limits
>> > I/O barriers to happen only inside a slidingresync window, for regular
>> > I/Os out of this resync window they don't need to wait for barrier any
>> > more. On large raid1 device, it helps a lot to improve parallel writing
>> > I/O throughput when there are background resync I/Os performing at
>> > same time.
>> >
>> > The idea of sliding resync widow is awesome, but code complexity is a
>> > challenge. Sliding resync window requires several veriables to work
>>
>> variables
>>
>> > collectively, this is complexed and very hard to make it work correctly.
>> > Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches
>> > to fix the original resync window patch. This is not the end, any further
>> > related modification may easily introduce more regreassion.
>> >
>> > Therefore I decide to implement a much simpler raid1 I/O barrier, by
>> > removing resync window code, I believe life will be much easier.
>> >
>> > The brief idea of the simpler barrier is,
>> > - Do not maintain a logbal unique resync window
>>
>> global
>>
>> > - Use multiple hash buckets to reduce I/O barrier conflictions, regular
>>
>> conflicts
>>
>> > I/O only has to wait for a resync I/O when both them have same barrier
>> > bucket index, vice versa.
>> > - I/O barrier can be recuded to an acceptable number if there are enought
>>
>> reduced
>> enough
>>
>> > barrier buckets
>> >
>> > Here I explain how the barrier buckets are designed,
>> > - BARRIER_UNIT_SECTOR_SIZE
>> > The whole LBA address space of a raid1 device is divided into multiple
>> > barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>> > Bio request won't go across border of barrier unit size, that means
>>
>> requests
>>
>> > maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes.
>> > For random I/O 64MB is large enough for both read and write requests,
>> > for sequential I/O considering underlying block layer may merge them
>> > into larger requests, 64MB is still good enough.
>> > Neil also points out that for resync operation, "we want the resync to
>> > move from region to region fairly quickly so that the slowness caused
>> > by having to synchronize with the resync is averaged out over a fairly
>> > small time frame". For full speed resync, 64MB should take less then 1
>> > second. When resync is competing with other I/O, it could take up a few
>> > minutes. Therefore 64MB size is fairly good range for resync.
>> >
>> > - BARRIER_BUCKETS_NR
>> > There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
>> > #define BARRIER_BUCKETS_NR_BITS (PAGE_SHIFT - 2)
>> > #define BARRIER_BUCKETS_NR (1<<BARRIER_BUCKETS_NR_BITS)
>> > this patch makes the bellowed members of struct r1conf from integer
>> > to array of integers,
>> > - int nr_pending;
>> > - int nr_waiting;
>> > - int nr_queued;
>> > - int barrier;
>> > + int *nr_pending;
>> > + int *nr_waiting;
>> > + int *nr_queued;
>> > + int *barrier;
>> > number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB
>> > kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O
>> > barrier buckets, and each array of integers occupies single memory page.
>> > 1024 means for a request which is smaller than the I/O barrier unit size
>> > has ~0.1% chance to wait for resync to pause, which is quite a small
>> > enough fraction. Also requesting single memory page is more friendly to
>> > kernel page allocator than larger memory size.
>> >
>> > - I/O barrier bucket is indexed by bio start sector
>> > If multiple I/O requests hit different I/O barrier units, they only need
>> > to compete I/O barrier with other I/Os which hit the same I/O barrier
>> > bucket index with each other. The index of a barrier bucket which a
>> > bio should look for is calculated by sector_to_idx() which is defined
>> > in raid1.h as an inline function,
>> > static inline int sector_to_idx(sector_t sector)
>> > {
>> > return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
>> > BARRIER_BUCKETS_NR_BITS);
>> > }
>> > Here sector_nr is the start sector number of a bio.
>>
>> "hash_long() is used so that sequential writes in are region of the
>> array which is not being resynced will not consistently align with
>> the buckets that are being sequentially resynced, as described below"
>>
>> >
>> > - Single bio won't go across boundary of a I/O barrier unit
>> > If a request goes across boundary of barrier unit, it will be split. A
>> > bio may be split in raid1_make_request() or raid1_sync_request(), if
>> > sectors returned by align_to_barrier_unit_end() is small than original
>>
>> smaller
>>
>> > bio size.
>> >
>> > Comparing to single sliding resync window,
>> > - Currently resync I/O grows linearly, therefore regular and resync I/O
>> > will have confliction within a single barrier units. So the I/O
>>
>> ... will conflict within ...
>>
>> > behavior is similar to single sliding resync window.
>> > - But a barrier unit bucket is shared by all barrier units with identical
>> > barrier uinit index, the probability of confliction might be higher
>> > than single sliding resync window, in condition that writing I/Os
>> > always hit barrier units which have identical barrier bucket indexs with
>> > the resync I/Os. This is a very rare condition in real I/O work loads,
>> > I cannot imagine how it could happen in practice.
>> > - Therefore we can achieve a good enough low confliction rate with much
>>
>> ... low conflict rate ...
>>
>> > simpler barrier algorithm and implementation.
>> >
>> > There are two changes should be noticed,
>> > - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
>> > single loop, it looks like this,
>> > spin_lock_irqsave(&conf->device_lock, flags);
>> > conf->nr_queued[idx]--;
>> > spin_unlock_irqrestore(&conf->device_lock, flags);
>> > This change generates more spin lock operations, but in next patch of
>> > this patch set, it will be replaced by a single line code,
>> > atomic_dec(&conf->nr_queueud[idx]);
>> > So we don't need to worry about spin lock cost here.
>> > - Mainline raid1 code split original raid1_make_request() into
>> > raid1_read_request() and raid1_write_request(). If the original bio
>> > goes across an I/O barrier unit size, this bio will be split before
>> > calling raid1_read_request() or raid1_write_request(), this change
>> > the code logic more simple and clear.
>> > - In this patch wait_barrier() is moved from raid1_make_request() to
>> > raid1_write_request(). In raid_read_request(), original wait_barrier()
>> > is replaced by raid1_read_request().
>> > The differnece is wait_read_barrier() only waits if array is frozen,
>> > using different barrier function in different code path makes the code
>> > more clean and easy to read.
>>
>> Thank you for putting the effort into writing a comprehensve change
>> description. I really appreciate it.
>>
>> >
>> > @@ -1447,36 +1501,26 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>> >
>> > static void raid1_make_request(struct mddev *mddev, struct bio *bio)
>> > {
>> > - struct r1conf *conf = mddev->private;
>> > - struct r1bio *r1_bio;
>> > + void (*make_request_fn)(struct mddev *mddev, struct bio *bio);
>> > + struct bio *split;
>> > + sector_t sectors;
>> >
>> > - /*
>> > - * make_request() can abort the operation when read-ahead is being
>> > - * used and no empty request is available.
>> > - *
>> > - */
>> > - r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
>> > + make_request_fn = (bio_data_dir(bio) == READ) ?
>> > + raid1_read_request : raid1_write_request;
>> >
>> > - r1_bio->master_bio = bio;
>> > - r1_bio->sectors = bio_sectors(bio);
>> > - r1_bio->state = 0;
>> > - r1_bio->mddev = mddev;
>> > - r1_bio->sector = bio->bi_iter.bi_sector;
>> > -
>> > - /*
>> > - * We might need to issue multiple reads to different devices if there
>> > - * are bad blocks around, so we keep track of the number of reads in
>> > - * bio->bi_phys_segments. If this is 0, there is only one r1_bio and
>> > - * no locking will be needed when requests complete. If it is
>> > - * non-zero, then it is the number of not-completed requests.
>> > - */
>> > - bio->bi_phys_segments = 0;
>> > - bio_clear_flag(bio, BIO_SEG_VALID);
>> > + /* if bio exceeds barrier unit boundary, split it */
>> > + do {
>> > + sectors = align_to_barrier_unit_end(
>> > + bio->bi_iter.bi_sector, bio_sectors(bio));
>> > + if (sectors < bio_sectors(bio)) {
>> > + split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
>> > + bio_chain(split, bio);
>> > + } else {
>> > + split = bio;
>> > + }
>> >
>> > - if (bio_data_dir(bio) == READ)
>> > - raid1_read_request(mddev, bio, r1_bio);
>> > - else
>> > - raid1_write_request(mddev, bio, r1_bio);
>> > + make_request_fn(mddev, split);
>> > + } while (split != bio);
>> > }
>>
>> I know you are going to change this as Shaohua wantsthe spitting
>> to happen in a separate function, which I agree with, but there is
>> something else wrong here.
>> Calling bio_split/bio_chain repeatedly in a loop is dangerous.
>> It is OK for simple devices, but when one request can wait for another
>> request to the same device it can deadlock.
>> This can happen with raid1. If a resync request calls raise_barrier()
>> between one request and the next, then the next has to wait for the
>> resync request, which has to wait for the first request.
>> As the first request will be stuck in the queue in
>> generic_make_request(), you get a deadlock.
>> It is much safer to:
>>
>> if (need to split) {
>> split = bio_split(bio, ...)
>> bio_chain(...)
>> make_request_fn(split);
>> generic_make_request(bio);
>> } else
>> make_request_fn(mddev, bio);
>>
>> This way we first process the initial section of the bio (in 'split')
>> which will queue some requests to the underlying devices. These
>> requests will be queued in generic_make_request.
>> Then we queue the remainder of the bio, which will be added to the end
>> of the generic_make_request queue.
>> Then we return.
>> generic_make_request() will pop the lower-level device requests off the
>> queue and handle them first. Then it will process the remainder
>> of the original bio once the first section has been fully processed.
>
> Good point! raid10 has the same problem. Looks this doesn't solve the issue for
> device with 3 times stack though.
>
> I knew you guys are working on this issue in block layer. Should we fix the
> issue in MD side (for 2 stack devices) or wait for the block layer patch?
We cannot fix everything at the block layer, or at the individual device
layer. We need changes in both.
I think that looping over bios in a device driver is wrong and can
easily lead to deadlocks. We should remove that from md.
If the block layer gets fixed that way I want it to, then we could move
the generic_make_request() call earlier, so that above could be
>> if (need to split) {
>> split = bio_split(bio, ...)
>> bio_chain(...)
>> generic_make_request(bio);
>> bio = split()
>> }
>> make_request_fn(mddev, bio);
which is slightly simpler. But the original would still work.
So yes, I think we need this change in md/raid1. I suspect that if
you built a kernel with a smaller BARRIER_UNIT_SECTOR_BITS - e.g. 4 -
you could very easily trigger a deadlock with md/raid1 on scsi.
At 17, it is not quite so easy, but is a real possibility.
I've had similar deadlocks reported before when the code wasn't quite
careful enough.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
prev parent reply other threads:[~2017-02-19 23:42 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-02-15 16:35 [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window colyli
2017-02-15 16:35 ` [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code colyli
2017-02-15 17:15 ` Coly Li
2017-02-16 2:25 ` Shaohua Li
2017-02-17 18:42 ` Coly Li
2017-02-16 7:04 ` NeilBrown
2017-02-17 7:56 ` Coly Li
2017-02-17 18:35 ` Coly Li
2017-02-16 2:22 ` [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window Shaohua Li
2017-02-16 17:05 ` Coly Li
2017-02-17 12:40 ` Coly Li
2017-02-16 7:04 ` NeilBrown
2017-02-17 6:56 ` Coly Li
2017-02-19 23:50 ` NeilBrown
2017-02-20 2:51 ` NeilBrown
2017-02-20 7:04 ` Shaohua Li
2017-02-20 8:07 ` Coly Li
2017-02-20 8:30 ` Coly Li
2017-02-20 18:14 ` Wols Lists
2017-02-21 11:30 ` Coly Li
2017-02-21 19:20 ` Wols Lists
2017-02-21 20:16 ` Coly Li
2017-02-21 0:29 ` NeilBrown
2017-02-21 9:45 ` Coly Li
2017-02-21 17:45 ` Shaohua Li
2017-02-21 20:09 ` Coly Li
2017-02-23 5:54 ` Coly Li
2017-02-23 17:34 ` Shaohua Li
2017-02-23 19:31 ` Coly Li
2017-02-23 19:58 ` Shaohua Li
2017-02-24 17:02 ` Coly Li
2017-02-24 10:19 ` 王金浦
2017-02-28 19:42 ` Shaohua Li
2017-03-01 17:01 ` 王金浦
2017-02-23 23:14 ` NeilBrown
2017-02-24 17:06 ` Coly Li
2017-02-24 17:17 ` Shaohua Li
2017-02-24 18:57 ` Coly Li
2017-02-24 19:02 ` Shaohua Li
2017-02-24 19:19 ` Coly Li
2017-02-17 19:41 ` Shaohua Li
2017-02-18 2:40 ` Coly Li
2017-02-19 23:42 ` NeilBrown [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87vas5sq7i.fsf@notabene.neil.brown.name \
--to=neilb@suse.com \
--cc=colyli@suse.de \
--cc=gqjiang@suse.com \
--cc=jthumshirn@suse.de \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
--cc=shli@fb.com \
--cc=shli@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).