From: Coly Li <colyli@suse.de>
To: linux-raid@vger.kernel.org
Cc: Coly Li <colyli@suse.de>, Shaohua Li <shli@fb.com>,
Neil Brown <neilb@suse.de>,
Johannes Thumshirn <jthumshirn@suse.de>,
Guoqing Jiang <gqjiang@suse.com>
Subject: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
Date: Tue, 22 Nov 2016 05:54:00 +0800 [thread overview]
Message-ID: <1479765241-15528-1-git-send-email-colyli@suse.de> (raw)
'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
introduces a sliding resync window for raid1 I/O barrier, this idea limits
I/O barriers to happen only inside a slidingresync window, for regular
I/Os out of this resync window they don't need to wait for barrier any
more. On large raid1 device, it helps a lot to improve parallel writing
I/O throughput when there are background resync I/Os performing at
same time.
The idea of sliding resync widow is awesome, but there are several
challenges are very difficult to solve,
- code complexity
Sliding resync window requires several veriables to work collectively,
this is complexed and very hard to make it work correctly. Just grep
"Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix
the original resync window patch. This is not the end, any further
related modification may easily introduce more regreassion.
- multiple sliding resync windows
Currently raid1 code only has a single sliding resync window, we cannot
do parallel resync with current I/O barrier implementation.
Implementing multiple resync windows are much more complexed, and very
hard to make it correctly.
Therefore I decide to implement a much simpler raid1 I/O barrier, by
removing resync window code, I believe life will be much easier.
The brief idea of the simpler barrier is,
- Do not maintain a logbal unique resync window
- Use multiple hash buckets to reduce I/O barrier conflictions, regular
I/O only has to wait for a resync I/O when both them have same barrier
bucket index, vice versa.
- I/O barrier can be recuded to an acceptable number if there are enought
barrier buckets
Here I explain how the barrier buckets are designed,
- BARRIER_UNIT_SECTOR_SIZE
The whole LBA address space of a raid1 device is divided into multiple
barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
Bio request won't go across border of barrier unit size, that means
maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
- BARRIER_BUCKETS_NR
There are BARRIER_BUCKETS_NR buckets in total, if multiple I/O requests
hit different barrier units, they only need to compete I/O barrier with
other I/Os which hit the same barrier bucket index with each other. The
index of a barrier bucket which a bio should look for is calculated by
get_barrier_bucket_idx(),
(sector >> BARRIER_UNIT_SECTOR_BITS) % BARRIER_BUCKETS_NR
sector is the start sector number of a bio. align_to_barrier_unit_end()
will make sure the finall bio sent into generic_make_request() won't
exceed border of the barrier unit size.
- RRIER_BUCKETS_NR
Number of barrier buckets is defined by,
#define BARRIER_BUCKETS_NR (PAGE_SIZE/sizeof(long))
For 4KB page size, there are 512 buckets for each raid1 device. That
means the propobility of full random I/O barrier confliction may be
reduced down to 1/512.
Comparing to single sliding resync window,
- Currently resync I/O grows linearly, therefore regular and resync I/O
will have confliction within a single barrier units. So it is similar to
single sliding resync window.
- But a barrier unit bucket is shared by all barrier units with identical
barrier uinit index, the probability of confliction might be higher
than single sliding resync window, in condition that writing I/Os
always hit barrier units which have identical barrier bucket index with
the resync I/Os. This is a very rare condition in real I/O work loads,
I cannot imagine how it could happen in practice.
- Therefore we can achieve a good enough low confliction rate with much
simpler barrier algorithm and implementation.
If user has a (realy) large raid1 device, for example 10PB size, we may
just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro,
it is possible to be a raid1-created-time-defined variable in future.
There are two changes should be noticed,
- In raid1d(), I change the code to decrease conf->nr_pending[idx] into
single loop, it looks like this,
spin_lock_irqsave(&conf->device_lock, flags);
conf->nr_queued[idx]--;
spin_unlock_irqrestore(&conf->device_lock, flags);
This change generates more spin lock operations, but in next patch of
this patch set, it will be replaced by a single line code,
atomic_dec(conf->nr_queueud[idx]);
So we don't need to worry about spin lock cost here.
- In raid1_make_request(), wait_barrier() is replaced by,
a) wait_read_barrier(): wait barrier in regular read I/O code path
b) wait_barrier(): wait barrier in regular write I/O code path
The differnece is wait_read_barrier() only waits if array is frozen, I
am not able to combile them into one function, because they must receive
differnet data types in their arguments list.
- align_to_barrier_unit_end() is called to make sure both regular and
resync I/O won't go across the border of a barrier unit size.
Open question:
- Need review from md clustring developer, I don't touch related code now.
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
---
drivers/md/raid1.c | 389 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------
drivers/md/raid1.h | 42 +++++-------
2 files changed, 242 insertions(+), 189 deletions(-)
Index: linux-raid1/drivers/md/raid1.c
===================================================================
--- linux-raid1.orig/drivers/md/raid1.c
+++ linux-raid1/drivers/md/raid1.c
@@ -66,9 +66,8 @@
*/
static int max_queued_requests = 1024;
-static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
- sector_t bi_sector);
-static void lower_barrier(struct r1conf *conf);
+static void allow_barrier(struct r1conf *conf, sector_t sector_nr);
+static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data)
{
@@ -92,7 +91,6 @@ static void r1bio_pool_free(void *r1_bio
#define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9)
#define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW)
#define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
-#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
{
@@ -198,6 +196,7 @@ static void put_buf(struct r1bio *r1_bio
{
struct r1conf *conf = r1_bio->mddev->private;
int i;
+ sector_t sector_nr = r1_bio->sector;
for (i = 0; i < conf->raid_disks * 2; i++) {
struct bio *bio = r1_bio->bios[i];
@@ -207,7 +206,7 @@ static void put_buf(struct r1bio *r1_bio
mempool_free(r1_bio, conf->r1buf_pool);
- lower_barrier(conf);
+ lower_barrier(conf, sector_nr);
}
static void reschedule_retry(struct r1bio *r1_bio)
@@ -215,10 +214,15 @@ static void reschedule_retry(struct r1bi
unsigned long flags;
struct mddev *mddev = r1_bio->mddev;
struct r1conf *conf = mddev->private;
+ sector_t sector_nr;
+ long idx;
+
+ sector_nr = r1_bio->sector;
+ idx = get_barrier_bucket_idx(sector_nr);
spin_lock_irqsave(&conf->device_lock, flags);
list_add(&r1_bio->retry_list, &conf->retry_list);
- conf->nr_queued ++;
+ conf->nr_queued[idx]++;
spin_unlock_irqrestore(&conf->device_lock, flags);
wake_up(&conf->wait_barrier);
@@ -235,8 +239,6 @@ static void call_bio_endio(struct r1bio
struct bio *bio = r1_bio->master_bio;
int done;
struct r1conf *conf = r1_bio->mddev->private;
- sector_t start_next_window = r1_bio->start_next_window;
- sector_t bi_sector = bio->bi_iter.bi_sector;
if (bio->bi_phys_segments) {
unsigned long flags;
@@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio
if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
bio->bi_error = -EIO;
- if (done) {
+ if (done)
bio_endio(bio);
- /*
- * Wake up any possible resync thread that waits for the device
- * to go idle.
- */
- allow_barrier(conf, start_next_window, bi_sector);
- }
}
static void raid_end_bio_io(struct r1bio *r1_bio)
{
struct bio *bio = r1_bio->master_bio;
+ struct r1conf *conf = r1_bio->mddev->private;
/* if nobody has done the final endio yet, do it now */
if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
@@ -278,6 +275,12 @@ static void raid_end_bio_io(struct r1bio
call_bio_endio(r1_bio);
}
+
+ /*
+ * Wake up any possible resync thread that waits for the device
+ * to go idle.
+ */
+ allow_barrier(conf, r1_bio->sector);
free_r1bio(r1_bio);
}
@@ -311,6 +314,7 @@ static int find_bio_disk(struct r1bio *r
return mirror;
}
+/* bi_end_io callback for regular READ bio */
static void raid1_end_read_request(struct bio *bio)
{
int uptodate = !bio->bi_error;
@@ -490,6 +494,25 @@ static void raid1_end_write_request(stru
bio_put(to_put);
}
+static sector_t align_to_barrier_unit_end(sector_t start_sector,
+ sector_t sectors)
+{
+ sector_t len;
+
+ WARN_ON(sectors == 0);
+ /* len is the number of sectors from start_sector to end of the
+ * barrier unit which start_sector belongs to.
+ */
+ len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
+ (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
+ start_sector;
+
+ if (len > sectors)
+ len = sectors;
+
+ return len;
+}
+
/*
* This routine returns the disk from which the requested read should
* be done. There is a per-array 'next expected sequential IO' sector
@@ -691,6 +714,7 @@ static int read_balance(struct r1conf *c
conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
}
rcu_read_unlock();
+ sectors = align_to_barrier_unit_end(this_sector, sectors);
*max_sectors = sectors;
return best_disk;
@@ -779,168 +803,174 @@ static void flush_pending_writes(struct
* there is no normal IO happeing. It must arrange to call
* lower_barrier when the particular background IO completes.
*/
+
static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
{
+ long idx = get_barrier_bucket_idx(sector_nr);
+
spin_lock_irq(&conf->resync_lock);
/* Wait until no block IO is waiting */
- wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
+ wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
conf->resync_lock);
/* block any new IO from starting */
- conf->barrier++;
- conf->next_resync = sector_nr;
+ conf->barrier[idx]++;
/* For these conditions we must wait:
* A: while the array is in frozen state
- * B: while barrier >= RESYNC_DEPTH, meaning resync reach
- * the max count which allowed.
- * C: next_resync + RESYNC_SECTORS > start_next_window, meaning
- * next resync will reach to the window which normal bios are
- * handling.
- * D: while there are any active requests in the current window.
+ * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
+ * existing in sector number ranges corresponding to idx.
+ * C: while conf->barrier[idx] >= RESYNC_DEPTH, meaning resync reach
+ * the max count which allowed in sector number ranges
+ * conrresponding to idx.
*/
wait_event_lock_irq(conf->wait_barrier,
- !conf->array_frozen &&
- conf->barrier < RESYNC_DEPTH &&
- conf->current_window_requests == 0 &&
- (conf->start_next_window >=
- conf->next_resync + RESYNC_SECTORS),
+ !conf->array_frozen && !conf->nr_pending[idx] &&
+ conf->barrier[idx] < RESYNC_DEPTH,
conf->resync_lock);
-
- conf->nr_pending++;
+ conf->nr_pending[idx]++;
spin_unlock_irq(&conf->resync_lock);
}
-static void lower_barrier(struct r1conf *conf)
+static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
{
unsigned long flags;
- BUG_ON(conf->barrier <= 0);
+ long idx = get_barrier_bucket_idx(sector_nr);
+
+ BUG_ON(conf->barrier[idx] <= 0);
spin_lock_irqsave(&conf->resync_lock, flags);
- conf->barrier--;
- conf->nr_pending--;
+ conf->barrier[idx]--;
+ conf->nr_pending[idx]--;
spin_unlock_irqrestore(&conf->resync_lock, flags);
wake_up(&conf->wait_barrier);
}
-static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
+/* A regular I/O should wait when,
+ * - The whole array is frozen (both READ and WRITE)
+ * - bio is WRITE and in same barrier bucket conf->barrier[idx] raised
+ */
+static void _wait_barrier(struct r1conf *conf, long idx)
{
- bool wait = false;
-
- if (conf->array_frozen || !bio)
- wait = true;
- else if (conf->barrier && bio_data_dir(bio) == WRITE) {
- if ((conf->mddev->curr_resync_completed
- >= bio_end_sector(bio)) ||
- (conf->next_resync + NEXT_NORMALIO_DISTANCE
- <= bio->bi_iter.bi_sector))
- wait = false;
- else
- wait = true;
+ spin_lock_irq(&conf->resync_lock);
+ if (conf->array_frozen || conf->barrier[idx]) {
+ conf->nr_waiting[idx]++;
+ /* Wait for the barrier to drop. */
+ wait_event_lock_irq(
+ conf->wait_barrier,
+ !conf->array_frozen && !conf->barrier[idx],
+ conf->resync_lock);
+ conf->nr_waiting[idx]--;
}
- return wait;
+ conf->nr_pending[idx]++;
+ spin_unlock_irq(&conf->resync_lock);
}
-static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
+static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
{
- sector_t sector = 0;
+ long idx = get_barrier_bucket_idx(sector_nr);
spin_lock_irq(&conf->resync_lock);
- if (need_to_wait_for_sync(conf, bio)) {
- conf->nr_waiting++;
- /* Wait for the barrier to drop.
- * However if there are already pending
- * requests (preventing the barrier from
- * rising completely), and the
- * per-process bio queue isn't empty,
- * then don't wait, as we need to empty
- * that queue to allow conf->start_next_window
- * to increase.
- */
- wait_event_lock_irq(conf->wait_barrier,
- !conf->array_frozen &&
- (!conf->barrier ||
- ((conf->start_next_window <
- conf->next_resync + RESYNC_SECTORS) &&
- current->bio_list &&
- !bio_list_empty(current->bio_list))),
- conf->resync_lock);
- conf->nr_waiting--;
- }
-
- if (bio && bio_data_dir(bio) == WRITE) {
- if (bio->bi_iter.bi_sector >= conf->next_resync) {
- if (conf->start_next_window == MaxSector)
- conf->start_next_window =
- conf->next_resync +
- NEXT_NORMALIO_DISTANCE;
-
- if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE)
- <= bio->bi_iter.bi_sector)
- conf->next_window_requests++;
- else
- conf->current_window_requests++;
- sector = conf->start_next_window;
- }
+ if (conf->array_frozen) {
+ conf->nr_waiting[idx]++;
+ /* Wait for array to unfreeze */
+ wait_event_lock_irq(
+ conf->wait_barrier,
+ !conf->array_frozen,
+ conf->resync_lock);
+ conf->nr_waiting[idx]--;
}
-
- conf->nr_pending++;
+ conf->nr_pending[idx]++;
spin_unlock_irq(&conf->resync_lock);
- return sector;
}
-static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
- sector_t bi_sector)
+static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
+{
+ long idx = get_barrier_bucket_idx(sector_nr);
+
+ _wait_barrier(conf, idx);
+}
+
+static void wait_all_barriers(struct r1conf *conf)
+{
+ long idx;
+
+ for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+ _wait_barrier(conf, idx);
+}
+
+static void _allow_barrier(struct r1conf *conf, long idx)
{
unsigned long flags;
spin_lock_irqsave(&conf->resync_lock, flags);
- conf->nr_pending--;
- if (start_next_window) {
- if (start_next_window == conf->start_next_window) {
- if (conf->start_next_window + NEXT_NORMALIO_DISTANCE
- <= bi_sector)
- conf->next_window_requests--;
- else
- conf->current_window_requests--;
- } else
- conf->current_window_requests--;
-
- if (!conf->current_window_requests) {
- if (conf->next_window_requests) {
- conf->current_window_requests =
- conf->next_window_requests;
- conf->next_window_requests = 0;
- conf->start_next_window +=
- NEXT_NORMALIO_DISTANCE;
- } else
- conf->start_next_window = MaxSector;
- }
- }
+ conf->nr_pending[idx]--;
spin_unlock_irqrestore(&conf->resync_lock, flags);
wake_up(&conf->wait_barrier);
}
+static void allow_barrier(struct r1conf *conf, sector_t sector_nr)
+{
+ long idx = get_barrier_bucket_idx(sector_nr);
+
+ _allow_barrier(conf, idx);
+}
+
+static void allow_all_barriers(struct r1conf *conf)
+{
+ long idx;
+
+ for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+ _allow_barrier(conf, idx);
+}
+
+
+/* conf->resync_lock should be held */
+static int get_all_pendings(struct r1conf *conf)
+{
+ long idx;
+ int ret;
+
+ for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+ ret += conf->nr_pending[idx];
+ return ret;
+}
+
+/* conf->resync_lock should be held */
+static int get_all_queued(struct r1conf *conf)
+{
+ long idx;
+ int ret;
+
+ for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+ ret += conf->nr_queued[idx];
+ return ret;
+}
+
static void freeze_array(struct r1conf *conf, int extra)
{
/* stop syncio and normal IO and wait for everything to
* go quite.
- * We wait until nr_pending match nr_queued+extra
+ * We wait until get_all_pending() matches get_all_queued()+extra,
+ * which means sum of conf->nr_pending[] matches sum of
+ * conf->nr_queued[] plus extra (which might be 0 or 1).
* This is called in the context of one normal IO request
* that has failed. Thus any sync request that might be pending
- * will be blocked by nr_pending, and we need to wait for
+ * will be blocked by a conf->nr_pending[idx] which the idx depends
+ * on the request's sector number, and we need to wait for
* pending IO requests to complete or be queued for re-try.
- * Thus the number queued (nr_queued) plus this request (extra)
- * must match the number of pending IOs (nr_pending) before
- * we continue.
+ * Thus the number queued (sum of conf->nr_queued[]) plus this
+ * request (extra) must match the number of pending IOs (sum
+ * of conf->nr_pending[]) before we continue.
*/
spin_lock_irq(&conf->resync_lock);
conf->array_frozen = 1;
- wait_event_lock_irq_cmd(conf->wait_barrier,
- conf->nr_pending == conf->nr_queued+extra,
- conf->resync_lock,
- flush_pending_writes(conf));
+ wait_event_lock_irq_cmd(
+ conf->wait_barrier,
+ get_all_pendings(conf) == get_all_queued(conf)+extra,
+ conf->resync_lock,
+ flush_pending_writes(conf));
spin_unlock_irq(&conf->resync_lock);
}
static void unfreeze_array(struct r1conf *conf)
@@ -1031,6 +1061,7 @@ static void raid1_unplug(struct blk_plug
kfree(plug);
}
+
static void raid1_make_request(struct mddev *mddev, struct bio * bio)
{
struct r1conf *conf = mddev->private;
@@ -1051,7 +1082,6 @@ static void raid1_make_request(struct md
int first_clone;
int sectors_handled;
int max_sectors;
- sector_t start_next_window;
/*
* Register the new request and wait if the reconstruction
@@ -1087,8 +1117,6 @@ static void raid1_make_request(struct md
finish_wait(&conf->wait_barrier, &w);
}
- start_next_window = wait_barrier(conf, bio);
-
bitmap = mddev->bitmap;
/*
@@ -1121,6 +1149,14 @@ static void raid1_make_request(struct md
int rdisk;
read_again:
+ /* Still need barrier for READ in case that whole
+ * array is frozen.
+ */
+ wait_read_barrier(conf, r1_bio->sector);
+ /* max_sectors from read_balance is modified to no
+ * go across border of the barrier unit which
+ * r1_bio->sector is in.
+ */
rdisk = read_balance(conf, r1_bio, &max_sectors);
if (rdisk < 0) {
@@ -1140,7 +1176,6 @@ read_again:
atomic_read(&bitmap->behind_writes) == 0);
}
r1_bio->read_disk = rdisk;
- r1_bio->start_next_window = 0;
read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector,
@@ -1211,7 +1246,7 @@ read_again:
disks = conf->raid_disks * 2;
retry_write:
- r1_bio->start_next_window = start_next_window;
+ wait_barrier(conf, r1_bio->sector);
blocked_rdev = NULL;
rcu_read_lock();
max_sectors = r1_bio->sectors;
@@ -1279,27 +1314,17 @@ read_again:
if (unlikely(blocked_rdev)) {
/* Wait for this device to become unblocked */
int j;
- sector_t old = start_next_window;
for (j = 0; j < i; j++)
if (r1_bio->bios[j])
rdev_dec_pending(conf->mirrors[j].rdev, mddev);
r1_bio->state = 0;
- allow_barrier(conf, start_next_window, bio->bi_iter.bi_sector);
+ allow_barrier(conf, r1_bio->sector);
md_wait_for_blocked_rdev(blocked_rdev, mddev);
- start_next_window = wait_barrier(conf, bio);
- /*
- * We must make sure the multi r1bios of bio have
- * the same value of bi_phys_segments
- */
- if (bio->bi_phys_segments && old &&
- old != start_next_window)
- /* Wait for the former r1bio(s) to complete */
- wait_event(conf->wait_barrier,
- bio->bi_phys_segments == 1);
goto retry_write;
}
+ max_sectors = align_to_barrier_unit_end(r1_bio->sector, max_sectors);
if (max_sectors < r1_bio->sectors) {
/* We are splitting this write into multiple parts, so
* we need to prepare for allocating another r1_bio.
@@ -1495,19 +1520,11 @@ static void print_conf(struct r1conf *co
static void close_sync(struct r1conf *conf)
{
- wait_barrier(conf, NULL);
- allow_barrier(conf, 0, 0);
+ wait_all_barriers(conf);
+ allow_all_barriers(conf);
mempool_destroy(conf->r1buf_pool);
conf->r1buf_pool = NULL;
-
- spin_lock_irq(&conf->resync_lock);
- conf->next_resync = MaxSector - 2 * NEXT_NORMALIO_DISTANCE;
- conf->start_next_window = MaxSector;
- conf->current_window_requests +=
- conf->next_window_requests;
- conf->next_window_requests = 0;
- spin_unlock_irq(&conf->resync_lock);
}
static int raid1_spare_active(struct mddev *mddev)
@@ -1787,7 +1804,7 @@ static int fix_sync_read_error(struct r1
struct bio *bio = r1_bio->bios[r1_bio->read_disk];
sector_t sect = r1_bio->sector;
int sectors = r1_bio->sectors;
- int idx = 0;
+ long idx = 0;
while(sectors) {
int s = sectors;
@@ -1983,6 +2000,14 @@ static void process_checks(struct r1bio
}
}
+/* If there is no error encountered during sync writing out, there are two
+ * places to destroy r1_bio:
+ * - sync_request_write(): If all wbio completed even before returning
+ * back to its caller.
+ * - end_sync_write(): when all remaining sync writes are done
+ * When there are error encountered from the above functions, r1_bio will
+ * be handled to handle_sync_write_finish() by reschedule_retry().
+ */
static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
{
struct r1conf *conf = mddev->private;
@@ -2244,6 +2269,9 @@ static void handle_write_finished(struct
{
int m;
bool fail = false;
+ sector_t sector_nr;
+ long idx;
+
for (m = 0; m < conf->raid_disks * 2 ; m++)
if (r1_bio->bios[m] == IO_MADE_GOOD) {
struct md_rdev *rdev = conf->mirrors[m].rdev;
@@ -2269,7 +2297,9 @@ static void handle_write_finished(struct
if (fail) {
spin_lock_irq(&conf->device_lock);
list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
- conf->nr_queued++;
+ sector_nr = r1_bio->sector;
+ idx = get_barrier_bucket_idx(sector_nr);
+ conf->nr_queued[idx]++;
spin_unlock_irq(&conf->device_lock);
md_wakeup_thread(conf->mddev->thread);
} else {
@@ -2380,6 +2410,8 @@ static void raid1d(struct md_thread *thr
struct r1conf *conf = mddev->private;
struct list_head *head = &conf->retry_list;
struct blk_plug plug;
+ sector_t sector_nr;
+ long idx;
md_check_recovery(mddev);
@@ -2387,17 +2419,19 @@ static void raid1d(struct md_thread *thr
!test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
LIST_HEAD(tmp);
spin_lock_irqsave(&conf->device_lock, flags);
- if (!test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
- while (!list_empty(&conf->bio_end_io_list)) {
- list_move(conf->bio_end_io_list.prev, &tmp);
- conf->nr_queued--;
- }
- }
+ if (!test_bit(MD_CHANGE_PENDING, &mddev->flags))
+ list_splice_init(&conf->bio_end_io_list, &tmp);
spin_unlock_irqrestore(&conf->device_lock, flags);
+
while (!list_empty(&tmp)) {
r1_bio = list_first_entry(&tmp, struct r1bio,
retry_list);
list_del(&r1_bio->retry_list);
+ sector_nr = r1_bio->sector;
+ idx = get_barrier_bucket_idx(sector_nr);
+ spin_lock_irqsave(&conf->device_lock, flags);
+ conf->nr_queued[idx]--;
+ spin_unlock_irqrestore(&conf->device_lock, flags);
if (mddev->degraded)
set_bit(R1BIO_Degraded, &r1_bio->state);
if (test_bit(R1BIO_WriteError, &r1_bio->state))
@@ -2418,7 +2452,9 @@ static void raid1d(struct md_thread *thr
}
r1_bio = list_entry(head->prev, struct r1bio, retry_list);
list_del(head->prev);
- conf->nr_queued--;
+ sector_nr = r1_bio->sector;
+ idx = get_barrier_bucket_idx(sector_nr);
+ conf->nr_queued[idx]--;
spin_unlock_irqrestore(&conf->device_lock, flags);
mddev = r1_bio->mddev;
@@ -2457,7 +2493,6 @@ static int init_resync(struct r1conf *co
conf->poolinfo);
if (!conf->r1buf_pool)
return -ENOMEM;
- conf->next_resync = 0;
return 0;
}
@@ -2486,6 +2521,7 @@ static sector_t raid1_sync_request(struc
int still_degraded = 0;
int good_sectors = RESYNC_SECTORS;
int min_bad = 0; /* number of sectors that are bad in all devices */
+ long idx = get_barrier_bucket_idx(sector_nr);
if (!conf->r1buf_pool)
if (init_resync(conf))
@@ -2535,7 +2571,7 @@ static sector_t raid1_sync_request(struc
* If there is non-resync activity waiting for a turn, then let it
* though before starting on this new sync request.
*/
- if (conf->nr_waiting)
+ if (conf->nr_waiting[idx])
schedule_timeout_uninterruptible(1);
/* we are incrementing sector_nr below. To be safe, we check against
@@ -2562,6 +2598,8 @@ static sector_t raid1_sync_request(struc
r1_bio->sector = sector_nr;
r1_bio->state = 0;
set_bit(R1BIO_IsSync, &r1_bio->state);
+ /* make sure good_sectors won't go across barrier unit border */
+ good_sectors = align_to_barrier_unit_end(sector_nr, good_sectors);
for (i = 0; i < conf->raid_disks * 2; i++) {
struct md_rdev *rdev;
@@ -2786,6 +2824,22 @@ static struct r1conf *setup_conf(struct
if (!conf)
goto abort;
+ conf->nr_pending = kzalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!conf->nr_pending)
+ goto abort;
+
+ conf->nr_waiting = kzalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!conf->nr_waiting)
+ goto abort;
+
+ conf->nr_queued = kzalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!conf->nr_queued)
+ goto abort;
+
+ conf->barrier = kzalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!conf->barrier)
+ goto abort;
+
conf->mirrors = kzalloc(sizeof(struct raid1_info)
* mddev->raid_disks * 2,
GFP_KERNEL);
@@ -2841,9 +2895,6 @@ static struct r1conf *setup_conf(struct
conf->pending_count = 0;
conf->recovery_disabled = mddev->recovery_disabled - 1;
- conf->start_next_window = MaxSector;
- conf->current_window_requests = conf->next_window_requests = 0;
-
err = -EIO;
for (i = 0; i < conf->raid_disks * 2; i++) {
@@ -2890,6 +2941,10 @@ static struct r1conf *setup_conf(struct
kfree(conf->mirrors);
safe_put_page(conf->tmppage);
kfree(conf->poolinfo);
+ kfree(conf->nr_pending);
+ kfree(conf->nr_waiting);
+ kfree(conf->nr_queued);
+ kfree(conf->barrier);
kfree(conf);
}
return ERR_PTR(err);
@@ -2992,6 +3047,10 @@ static void raid1_free(struct mddev *mdd
kfree(conf->mirrors);
safe_put_page(conf->tmppage);
kfree(conf->poolinfo);
+ kfree(conf->nr_pending);
+ kfree(conf->nr_waiting);
+ kfree(conf->nr_queued);
+ kfree(conf->barrier);
kfree(conf);
}
Index: linux-raid1/drivers/md/raid1.h
===================================================================
--- linux-raid1.orig/drivers/md/raid1.h
+++ linux-raid1/drivers/md/raid1.h
@@ -1,6 +1,20 @@
#ifndef _RAID1_H
#define _RAID1_H
+/* each barrier unit size is 64MB fow now
+ * note: it must be larger than RESYNC_DEPTH
+ */
+#define BARRIER_UNIT_SECTOR_BITS 17
+#define BARRIER_UNIT_SECTOR_SIZE (1<<17)
+#define BARRIER_BUCKETS_NR (PAGE_SIZE/sizeof(long))
+
+/* will use bit shift later */
+static inline long get_barrier_bucket_idx(sector_t sector)
+{
+ return (long)(sector >> BARRIER_UNIT_SECTOR_BITS) % BARRIER_BUCKETS_NR;
+
+}
+
struct raid1_info {
struct md_rdev *rdev;
sector_t head_position;
@@ -35,25 +49,6 @@ struct r1conf {
*/
int raid_disks;
- /* During resync, read_balancing is only allowed on the part
- * of the array that has been resynced. 'next_resync' tells us
- * where that is.
- */
- sector_t next_resync;
-
- /* When raid1 starts resync, we divide array into four partitions
- * |---------|--------------|---------------------|-------------|
- * next_resync start_next_window end_window
- * start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
- * end_window = start_next_window + NEXT_NORMALIO_DISTANCE
- * current_window_requests means the count of normalIO between
- * start_next_window and end_window.
- * next_window_requests means the count of normalIO after end_window.
- * */
- sector_t start_next_window;
- int current_window_requests;
- int next_window_requests;
-
spinlock_t device_lock;
/* list of 'struct r1bio' that need to be processed by raid1d,
@@ -79,10 +74,10 @@ struct r1conf {
*/
wait_queue_head_t wait_barrier;
spinlock_t resync_lock;
- int nr_pending;
- int nr_waiting;
- int nr_queued;
- int barrier;
+ int *nr_pending;
+ int *nr_waiting;
+ int *nr_queued;
+ int *barrier;
int array_frozen;
/* Set to 1 if a full sync is needed, (fresh device added).
@@ -135,7 +130,6 @@ struct r1bio {
* in this BehindIO request
*/
sector_t sector;
- sector_t start_next_window;
int sectors;
unsigned long state;
struct mddev *mddev;
next reply other threads:[~2016-11-21 21:54 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-11-21 21:54 Coly Li [this message]
2016-11-21 21:54 ` [RFC PATCH 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code Coly Li
2016-11-22 21:58 ` Shaohua Li
2016-11-22 21:35 ` [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window Shaohua Li
2016-11-23 9:05 ` Guoqing Jiang
2016-11-24 5:45 ` NeilBrown
2016-11-24 6:05 ` Guoqing Jiang
2016-11-28 6:59 ` Coly Li
2016-11-28 6:42 ` Coly Li
2016-11-29 19:29 ` Shaohua Li
2016-11-30 2:57 ` Coly Li
2016-11-24 7:34 ` Guoqing Jiang
2016-11-28 7:33 ` Coly Li
2016-11-30 6:37 ` Guoqing Jiang
2016-11-30 7:19 ` Coly Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1479765241-15528-1-git-send-email-colyli@suse.de \
--to=colyli@suse.de \
--cc=gqjiang@suse.com \
--cc=jthumshirn@suse.de \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
--cc=shli@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).