From: Shaohua Li <shli@kernel.org>
To: colyli@suse.de
Cc: linux-raid@vger.kernel.org, Shaohua Li <shli@fb.com>,
Neil Brown <neilb@suse.de>,
Johannes Thumshirn <jthumshirn@suse.de>,
Guoqing Jiang <gqjiang@suse.com>
Subject: Re: [PATCH V4 1/2] RAID1: a new I/O barrier implementation to remove resync window
Date: Fri, 17 Feb 2017 12:00:36 -0800 [thread overview]
Message-ID: <20170217200036.2ca5iotzhc4oauwa@kernel.org> (raw)
In-Reply-To: <1487358357-123924-1-git-send-email-colyli@suse.de>
On Sat, Feb 18, 2017 at 03:05:56AM +0800, colyli@suse.de wrote:
> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
> introduces a sliding resync window for raid1 I/O barrier, this idea limits
> I/O barriers to happen only inside a slidingresync window, for regular
> I/Os out of this resync window they don't need to wait for barrier any
> more. On large raid1 device, it helps a lot to improve parallel writing
> I/O throughput when there are background resync I/Os performing at
> same time.
>
> The idea of sliding resync widow is awesome, but code complexity is a
> challenge. Sliding resync window requires several variables to work
> collectively, this is complexed and very hard to make it work correctly.
> Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches
> to fix the original resync window patch. This is not the end, any further
> related modification may easily introduce more regreassion.
>
> Therefore I decide to implement a much simpler raid1 I/O barrier, by
> removing resync window code, I believe life will be much easier.
>
> The brief idea of the simpler barrier is,
> - Do not maintain a global unique resync window
> - Use multiple hash buckets to reduce I/O barrier conflicts, regular
> I/O only has to wait for a resync I/O when both them have same barrier
> bucket index, vice versa.
> - I/O barrier can be reduced to an acceptable number if there are enough
> barrier buckets
>
> Here I explain how the barrier buckets are designed,
> - BARRIER_UNIT_SECTOR_SIZE
> The whole LBA address space of a raid1 device is divided into multiple
> barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
> Bio requests won't go across border of barrier unit size, that means
> maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes.
> For random I/O 64MB is large enough for both read and write requests,
> for sequential I/O considering underlying block layer may merge them
> into larger requests, 64MB is still good enough.
> Neil also points out that for resync operation, "we want the resync to
> move from region to region fairly quickly so that the slowness caused
> by having to synchronize with the resync is averaged out over a fairly
> small time frame". For full speed resync, 64MB should take less then 1
> second. When resync is competing with other I/O, it could take up a few
> minutes. Therefore 64MB size is fairly good range for resync.
>
> - BARRIER_BUCKETS_NR
> There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
> #define BARRIER_BUCKETS_NR_BITS (PAGE_SHIFT - 2)
> #define BARRIER_BUCKETS_NR (1<<BARRIER_BUCKETS_NR_BITS)
> this patch makes the bellowed members of struct r1conf from integer
> to array of integers,
> - int nr_pending;
> - int nr_waiting;
> - int nr_queued;
> - int barrier;
> + int *nr_pending;
> + int *nr_waiting;
> + int *nr_queued;
> + int *barrier;
> number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB
> kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O
> barrier buckets, and each array of integers occupies single memory page.
> 1024 means for a request which is smaller than the I/O barrier unit size
> has ~0.1% chance to wait for resync to pause, which is quite a small
> enough fraction. Also requesting single memory page is more friendly to
> kernel page allocator than larger memory size.
>
> - I/O barrier bucket is indexed by bio start sector
> If multiple I/O requests hit different I/O barrier units, they only need
> to compete I/O barrier with other I/Os which hit the same I/O barrier
> bucket index with each other. The index of a barrier bucket which a
> bio should look for is calculated by sector_to_idx() which is defined
> in raid1.h as an inline function,
> static inline int sector_to_idx(sector_t sector)
> {
> return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
> BARRIER_BUCKETS_NR_BITS);
> }
> Here sector_nr is the start sector number of a bio.
>
> - Single bio won't go across boundary of a I/O barrier unit
> If a request goes across boundary of barrier unit, it will be split. A
> bio may be split in raid1_make_request() or raid1_sync_request(), if
> sectors returned by align_to_barrier_unit_end() is smaller than
> original bio size.
>
> Comparing to single sliding resync window,
> - Currently resync I/O grows linearly, therefore regular and resync I/O
> will conflict within a single barrier units. So the I/O behavior is
> similar to single sliding resync window.
> - But a barrier unit bucket is shared by all barrier units with identical
> barrier uinit index, the probability of conflict might be higher
> than single sliding resync window, in condition that writing I/Os
> always hit barrier units which have identical barrier bucket indexs with
> the resync I/Os. This is a very rare condition in real I/O work loads,
> I cannot imagine how it could happen in practice.
> - Therefore we can achieve a good enough low conflict rate with much
> simpler barrier algorithm and implementation.
>
> There are two changes should be noticed,
> - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
> single loop, it looks like this,
> spin_lock_irqsave(&conf->device_lock, flags);
> conf->nr_queued[idx]--;
> spin_unlock_irqrestore(&conf->device_lock, flags);
> This change generates more spin lock operations, but in next patch of
> this patch set, it will be replaced by a single line code,
> atomic_dec(&conf->nr_queueud[idx]);
> So we don't need to worry about spin lock cost here.
> - Mainline raid1 code split original raid1_make_request() into
> raid1_read_request() and raid1_write_request(). If the original bio
> goes across an I/O barrier unit size, this bio will be split before
> calling raid1_read_request() or raid1_write_request(), this change
> the code logic more simple and clear.
> - In this patch wait_barrier() is moved from raid1_make_request() to
> raid1_write_request(). In raid_read_request(), original wait_barrier()
> is replaced by raid1_read_request().
> The differnece is wait_read_barrier() only waits if array is frozen,
> using different barrier function in different code path makes the code
> more clean and easy to read.
> Changelog
> V4:
> - Add alloc_r1bio() to remove redundant r1bio memory allocation code.
> - Fix many typos in patch comments.
> - Use (PAGE_SHIFT - ilog2(sizeof(int))) to define BARRIER_BUCKETS_NR_BITS.
> V3:
> - Rebase the patch against latest upstream kernel code.
> - Many fixes by review comments from Neil,
> - Back to use pointers to replace arraries in struct r1conf
> - Remove total_barriers from struct r1conf
> - Add more patch comments to explain how/why the values of
> BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided.
> - Use get_unqueued_pending() to replace get_all_pendings() and
> get_all_queued()
> - Increase bucket number from 512 to 1024
> - Change code comments format by review from Shaohua.
> V2:
> - Use bio_split() to split the orignal bio if it goes across barrier unit
> bounday, to make the code more simple, by suggestion from Shaohua and
> Neil.
> - Use hash_long() to replace original linear hash, to avoid a possible
> confilict between resync I/O and sequential write I/O, by suggestion from
> Shaohua.
> - Add conf->total_barriers to record barrier depth, which is used to
> control number of parallel sync I/O barriers, by suggestion from Shaohua.
> - In V1 patch the bellowed barrier buckets related members in r1conf are
> allocated in memory page. To make the code more simple, V2 patch moves
> the memory space into struct r1conf, like this,
> - int nr_pending;
> - int nr_waiting;
> - int nr_queued;
> - int barrier;
> + int nr_pending[BARRIER_BUCKETS_NR];
> + int nr_waiting[BARRIER_BUCKETS_NR];
> + int nr_queued[BARRIER_BUCKETS_NR];
> + int barrier[BARRIER_BUCKETS_NR];
> This change is by the suggestion from Shaohua.
> - Remove some inrelavent code comments, by suggestion from Guoqing.
> - Add a missing wait_barrier() before jumping to retry_write, in
> raid1_make_write_request().
> V1:
> - Original RFC patch for comments
Thanks, I applied the two patches. For the deadlock issue Neil pointed out, I
don't want it to hold this patch, we can fix it later after we figured out how
to fix. The r1bio allocation is guaranteed to success because it's mempool
allocation, so I deleted the warn statement.
Neil,
Since you reviewed the patches a lot I added your 'reviewed-by'. Please let me know if
you don't want to add it.
Thanks,
Shaohua
next prev parent reply other threads:[~2017-02-17 20:00 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-02-17 19:05 [PATCH V4 1/2] RAID1: a new I/O barrier implementation to remove resync window colyli
2017-02-17 19:05 ` [PATCH V4 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code colyli
2017-02-17 20:00 ` Shaohua Li [this message]
2017-02-18 2:56 ` [PATCH V4 1/2] RAID1: a new I/O barrier implementation to remove resync window Coly Li
2017-02-20 0:31 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170217200036.2ca5iotzhc4oauwa@kernel.org \
--to=shli@kernel.org \
--cc=colyli@suse.de \
--cc=gqjiang@suse.com \
--cc=jthumshirn@suse.de \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
--cc=shli@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).