Linux RAID subsystem development
 help / color / mirror / Atom feed
* Re: [v2 PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Coly Li @ 2017-01-16  9:06 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn,
	Guoqing Jiang
In-Reply-To: <87o9zlksvh.fsf@notabene.neil.brown.name>

On 2017/1/6 上午7:08, NeilBrown wrote:
> On Wed, Dec 28 2016, Coly Li wrote:
> 
>> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of
>> iobarrier.")' introduces a sliding resync window for raid1 I/O
>> barrier, this idea limits I/O barriers to happen only inside a
>> slidingresync window, for regular I/Os out of this resync window
>> they don't need to wait for barrier any more. On large raid1
>> device, it helps a lot to improve parallel writing I/O throughput
>> when there are background resync I/Os performing at same time.
>> 
>> The idea of sliding resync widow is awesome, but there are
>> several challenges are very difficult to solve, - code
>> complexity Sliding resync window requires several veriables to
>> work collectively, this is complexed and very hard to make it
>> work correctly. Just grep "Fixes: 79ef3a8aa1" in kernel git log,
>> there are 8 more patches to fix the original resync window patch.
>> This is not the end, any further related modification may easily
>> introduce more regreassion. - multiple sliding resync windows 
>> Currently raid1 code only has a single sliding resync window, we
>> cannot do parallel resync with current I/O barrier
>> implementation. Implementing multiple resync windows are much
>> more complexed, and very hard to make it correctly.
> 
> I think I've asked this before, but why do you think that parallel 
> resync might ever be a useful idea?  I don't think it makes any
> sense, so it is wrong for you use it as part of the justification
> for this patch. Just don't mention it at all unless you have a
> genuine expectation that it would really be a good thing, in which
> case: explain the value.
> 

I will remove this from the patch log. Thanks for your suggestion.


>> 
>> Therefore I decide to implement a much simpler raid1 I/O barrier,
>> by removing resync window code, I believe life will be much
>> easier.
>> 
>> The brief idea of the simpler barrier is, - Do not maintain a
>> logbal unique resync window - Use multiple hash buckets to reduce
>> I/O barrier conflictions, regular I/O only has to wait for a
>> resync I/O when both them have same barrier bucket index, vice
>> versa. - I/O barrier can be recuded to an acceptable number if
>> there are enought barrier buckets
>> 
>> Here I explain how the barrier buckets are designed, -
>> BARRIER_UNIT_SECTOR_SIZE The whole LBA address space of a raid1
>> device is divided into multiple barrier units, by the size of
>> BARRIER_UNIT_SECTOR_SIZE. Bio request won't go across border of
>> barrier unit size, that means maximum bio size is
>> BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
> 
> It would be good to say here what number you chose, and why you
> chose it. You have picked 64MB.  This divides a 1TB device into
> 4096 regions. Any write request must fit into one of these regions,
> so we mustn't make the region too small, else we would get the
> benefits for sending large requests down.
> 
> We want the resync to move from region to region fairly quickly so
> that the slowness caused by having to synchronize with the resync
> is averaged out overa fairly small time frame.  At full speed, 64MB
> should take less than 1 second.  When resync is competing with
> other IO, it could easily take up to a minute(?).  I think that is
> a fairly good range.
> 
> So I think 64MB is probably a very good choice.  I just would like
> to see the justification clearly stated.

I see, I will add text to explain why I choose 64MB bucket unit size.

A reason for 64MB is just as you mentioned, that's the trade off
between memory consume and hash conflict rate. I did some calculation
for md raid1 targe size from 1TB to 10PB, and the maximum I/O
throughput of NVMe SSD, finally I deside a bucket size between
64~128MB, bucket number between 512~1024 are proper numbers.

I will explain the calculation in detail in next version patch.

> 
>> - BARRIER_BUCKETS_NR There are BARRIER_BUCKETS_NR buckets in
>> total, which is defined by, #define BARRIER_BUCKETS_NR_BITS   9 
>> #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
> 
> Why 512 buckets?  What are the tradeoffs? More buckets means more
> memory consumed for counters. Fewer buckets means more false
> sharing. With 512 buckets, a request which is smaller than the
> region size has a 0.2% chance of having to wait for resync to
> pause.  I think that is quite a small enough fraction. I think you
> originally chose the number of buckets so that a set of 4-byte
> counters fits exactly into a page.  I think that is still a good 
> guideline, so I would have #define BARRIER_BUCKETS_NR_BITS
> (PAGE_SHIFT - 2) (which makes it 10 ...).
> 

Good suggestion, 1024 buckets makes less hash conflict in each bucket.
I will change it in next version patch.


>> if multiple I/O requests hit different barrier units, they only
>> need to compete I/O barrier with other I/Os which hit the same
>> barrier bucket index with each other. The index of a barrier
>> bucket which a bio should look for is calculated by, int idx =
>> hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS)
> 
> This isn't right.  You have to divide by BARRIER_UNIT_SECTOR_SIZE
> first. int idx = hash_long(sector_nr >> BARRIER_UNIT_SECTOR_BITS,
> BARRIER_BUCKETS_NR_BITS);
> 

Oops, thanks for catching this. I will fix it in next version patch.


>> that sector_nr is the start sector number of a bio. We use
>> function align_to_barrier_unit_end() to calculate sectors number
>> from sector_nr to the next barrier unit size boundary, if the
>> requesting bio size goes across the boundary, we split the bio in
>> raid1_make_request(), to make sure the finall bio sent into
>> generic_make_request() won't exceed barrier unit boundary.
>> 
>> Comparing to single sliding resync window, - Currently resync I/O
>> grows linearly, therefore regular and resync I/O will have
>> confliction within a single barrier units. So it is similar to 
>> single sliding resync window. - But a barrier unit bucket is
>> shared by all barrier units with identical barrier uinit index,
>> the probability of confliction might be higher than single
>> sliding resync window, in condition that writing I/Os always hit
>> barrier units which have identical barrier bucket index with the
>> resync I/Os. This is a very rare condition in real I/O work
>> loads, I cannot imagine how it could happen in practice. -
>> Therefore we can achieve a good enough low confliction rate with
>> much simpler barrier algorithm and implementation.
>> 
>> If user has a (realy) large raid1 device, for example 10PB size,
>> we may just increase the buckets number BARRIER_BUCKETS_NR. Now
>> this is a macro, it is possible to be a
>> raid1-created-time-defined variable in future.
> 
> Why?  Why would a large array require more buckets?  Are you just 
> guessing, or do you see some concrete reason for there to be a 
> relationship between the size of the array and the number of
> buckets? If you can see a connection, please state it.  If not,
> don't mention it.
> 

This is a assumption. OK I will remove these text from the patch log.


>> 
>> There are two changes should be noticed, - In raid1d(), I change
>> the code to decrease conf->nr_pending[idx] into single loop, it
>> looks like this, spin_lock_irqsave(&conf->device_lock, flags); 
>> conf->nr_queued[idx]--; 
>> spin_unlock_irqrestore(&conf->device_lock, flags); This change
>> generates more spin lock operations, but in next patch of this
>> patch set, it will be replaced by a single line code, 
>> atomic_dec(conf->nr_queueud[idx]); So we don't need to worry
>> about spin lock cost here. - Original function
>> raid1_make_request() is split into two functions, -
>> raid1_make_read_request(): handles regular read request and
>> calls wait_read_barrier() for I/O barrier. -
>> raid1_make_write_request(): handles regular write request and
>> calls wait_barrier() for I/O barrier. The differnece is
>> wait_read_barrier() only waits if array is frozen, using
>> different barrier function in different code path makes the code 
>> more clean and easy to read. - align_to_barrier_unit_end() is
>> called to make sure both regular and resync I/O won't go across a
>> barrier unit boundary.
>> 
>> Changelog V1: - Original RFC patch for comments V2: - Use
>> bio_split() to split the orignal bio if it goes across barrier
>> unit bounday, to make the code more simple, by suggestion from
>> Shaohua and Neil. - Use hash_long() to replace original linear
>> hash, to avoid a possible confilict between resync I/O and
>> sequential write I/O, by suggestion from Shaohua. - Add
>> conf->total_barriers to record barrier depth, which is used to 
>> control number of parallel sync I/O barriers, by suggestion from
>> Shaohua.
> 
> I really don't think this is needed. As long as RESYNC_DEPTH *
> RESYNC_SECTORS is less than BARRIER_UNIT_SECTOR_SIZE just testing
> again ->barrier[idx] will ensure the number of barrier requests
> never exceeds RESYNC_DEPTH*2.  That is sufficient.
> 
> Also, I think the reason for imposing the RESYNC_DEPTH limit is to
> make sure regular IO never has to wait too long for pending resync
> requests to flush.  With the simple test, regular IO will never
> need to wait for more than RESYNC_DEPTH requests to complete.
> 
> So I think have this field brings no valid, and is potentially
> confusing.
> 

Ok, I will remove conf->total_barriers and back to the original
implementation. IMHO, I think this is a threshold for hard disk, for
SSD this limitation could be much larger, that's why the original
version there is no conf->total_barrier.

I will fix in next version patch.


>> - In V1 patch the bellowed barrier buckets related members in
>> r1conf are allocated in memory page. To make the code more
>> simple, V2 patch moves the memory space into struct r1conf, like
>> this, -       int                     nr_pending; -       int
>> nr_waiting; -       int                     nr_queued; -
>> int                     barrier; +       int
>> nr_pending[BARRIER_BUCKETS_NR]; +       int
>> nr_waiting[BARRIER_BUCKETS_NR]; +       int
>> nr_queued[BARRIER_BUCKETS_NR]; +       int
>> barrier[BARRIER_BUCKETS_NR];
> 
> I don't like this.  It makes the r1conf 4 pages is size, most of
> which is wasted.  A 4-page allocation is more likely to fail than a
> few 1-page allocations. I think these should be:
>> +       int                     *nr_pending; +       int
>> *nr_waiting; +       int                     *nr_queued; +
>> int                     *barrier;
> 
> Then use kcalloc(BARRIER_BUCKETS_NR, sizeof(int), GFP_KERNEL) to
> allocate each array.   I think this approach addresses Shaohua's 
> concerns without requiring a multi-page allocation.
> 

Very constructive suggestion. Yes, I will do this change in next
version patch.

>> This change is by the suggestion from Shaohua. - Remove some
>> inrelavent code comments, by suggestion from Guoqing. - Add a
>> missing wait_barrier() before jumping to retry_write, in 
>> raid1_make_write_request().
>> 
>> Signed-off-by: Coly Li <colyli@suse.de> Cc: Shaohua Li
>> <shli@fb.com> Cc: Neil Brown <neilb@suse.de> Cc: Johannes
>> Thumshirn <jthumshirn@suse.de> Cc: Guoqing Jiang
>> <gqjiang@suse.com> --- drivers/md/raid1.c | 485
>> ++++++++++++++++++++++++++++++----------------------- 
>> drivers/md/raid1.h |  37 ++-- 2 files changed, 291 insertions(+),
>> 231 deletions(-)
>> 
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index
>> a1f3fbe..5813656 100644 --- a/drivers/md/raid1.c +++
>> b/drivers/md/raid1.c @@ -67,9 +67,8 @@ */ static int
>> max_queued_requests = 1024;
>> 
>> -static void allow_barrier(struct r1conf *conf, sector_t
>> start_next_window, -			  sector_t bi_sector); -static void
>> lower_barrier(struct r1conf *conf); +static void
>> allow_barrier(struct r1conf *conf, sector_t sector_nr); +static
>> void lower_barrier(struct r1conf *conf, sector_t sector_nr);
>> 
>> #define raid1_log(md, fmt, args...)				\ do { if ((md)->queue)
>> blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while
>> (0) @@ -96,7 +95,6 @@ static void r1bio_pool_free(void *r1_bio,
>> void *data) #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9) 
>> #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW) #define
>> CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9) 
>> -#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
>> 
>> static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data) { @@
>> -211,7 +209,7 @@ static void put_buf(struct r1bio *r1_bio)
>> 
>> mempool_free(r1_bio, conf->r1buf_pool);
>> 
>> -	lower_barrier(conf); +	lower_barrier(conf, r1_bio->sector); }
>> 
>> static void reschedule_retry(struct r1bio *r1_bio) @@ -219,10
>> +217,12 @@ static void reschedule_retry(struct r1bio *r1_bio) 
>> unsigned long flags; struct mddev *mddev = r1_bio->mddev; struct
>> r1conf *conf = mddev->private; +	int idx;
>> 
>> +	idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS); 
>> spin_lock_irqsave(&conf->device_lock, flags); 
>> list_add(&r1_bio->retry_list, &conf->retry_list); -
>> conf->nr_queued ++; +	conf->nr_queued[idx]++; 
>> spin_unlock_irqrestore(&conf->device_lock, flags);
>> 
>> wake_up(&conf->wait_barrier); @@ -239,8 +239,6 @@ static void
>> call_bio_endio(struct r1bio *r1_bio) struct bio *bio =
>> r1_bio->master_bio; int done; struct r1conf *conf =
>> r1_bio->mddev->private; -	sector_t start_next_window =
>> r1_bio->start_next_window; -	sector_t bi_sector =
>> bio->bi_iter.bi_sector;
>> 
>> if (bio->bi_phys_segments) { unsigned long flags; @@ -265,7
>> +263,7 @@ static void call_bio_endio(struct r1bio *r1_bio) * Wake
>> up any possible resync thread that waits for the device * to go
>> idle. */ -		allow_barrier(conf, start_next_window, bi_sector); +
>> allow_barrier(conf, bio->bi_iter.bi_sector);
> 
> Why did you change this to use "bio->bi_iter.bi_sector" instead of 
> "bi_sector"?
> 
> I assume you thought it was an optimization that you would just
> slip in.  Can't hurt, right?
> 
> Just before this line is: bio_endio(bio); and that might cause the
> bio to be freed.  So your code could access freed memory.
> 
> Please be *very* cautious when making changes that are not
> directly related to the purpose of the patch.

Copied, I will fix it in next version patch. This is a regression I
introduced when I use bio_split() and move the location of
allow_barrier(). Thanks for catching this.

>> } }
>> 
>> @@ -513,6 +511,25 @@ static void raid1_end_write_request(struct
>> bio *bio) bio_put(to_put); }
>> 
>> +static sector_t align_to_barrier_unit_end(sector_t
>> start_sector, +					  sector_t sectors) +{ +	sector_t len; + +
>> WARN_ON(sectors == 0); +	/* len is the number of sectors from
>> start_sector to end of the +	 * barrier unit which start_sector
>> belongs to. +	 */ +	len = ((start_sector + sectors +
>> (1<<BARRIER_UNIT_SECTOR_BITS) - 1) & +
>> (~(BARRIER_UNIT_SECTOR_SIZE - 1))) - +	      start_sector;
> 
> This would be better as
> 
> len = round_up(start_sector+1, BARRIER_UNIT_SECTOR_SIZE) -
> start_sector;
> 

Aha! Yes, I will modify the code this way. Thanks for the suggestion.

> 
>> + +	if (len > sectors) +		len = sectors; + +	return len; +} + /* 
>> * This routine returns the disk from which the requested read
>> should * be done. There is a per-array 'next expected sequential
>> IO' sector @@ -809,168 +826,179 @@ static void
>> flush_pending_writes(struct r1conf *conf) */ static void
>> raise_barrier(struct r1conf *conf, sector_t sector_nr) { +	int
>> idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS); + 
>> spin_lock_irq(&conf->resync_lock);
>> 
>> /* Wait until no block IO is waiting */ -
>> wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting, +
>> wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx], 
>> conf->resync_lock);
>> 
>> /* block any new IO from starting */ -	conf->barrier++; -
>> conf->next_resync = sector_nr; +	conf->barrier[idx]++; +
>> conf->total_barriers++;
>> 
>> /* For these conditions we must wait: * A: while the array is in
>> frozen state -	 * B: while barrier >= RESYNC_DEPTH, meaning
>> resync reach -	 *    the max count which allowed. -	 * C:
>> next_resync + RESYNC_SECTORS > start_next_window, meaning -	 *
>> next resync will reach to the window which normal bios are -	 *
>> handling. -	 * D: while there are any active requests in the
>> current window. +	 * B: while conf->nr_pending[idx] is not 0,
>> meaning regular I/O +	 *    existing in sector number ranges
>> corresponding to idx. +	 * C: while conf->total_barriers >=
>> RESYNC_DEPTH, meaning resync reach +	 *    the max count which
>> allowed on the whole raid1 device. */ 
>> wait_event_lock_irq(conf->wait_barrier, !conf->array_frozen && -
>> conf->barrier < RESYNC_DEPTH && -
>> conf->current_window_requests == 0 && -
>> (conf->start_next_window >= -			     conf->next_resync +
>> RESYNC_SECTORS), +			     !conf->nr_pending[idx] && +
>> conf->total_barriers < RESYNC_DEPTH, conf->resync_lock);
>> 
>> -	conf->nr_pending++; +	conf->nr_pending[idx]++; 
>> spin_unlock_irq(&conf->resync_lock); }
>> 
>> -static void lower_barrier(struct r1conf *conf) +static void
>> lower_barrier(struct r1conf *conf, sector_t sector_nr) { unsigned
>> long flags; -	BUG_ON(conf->barrier <= 0); +	int idx =
>> hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS); + +
>> BUG_ON((conf->barrier[idx] <= 0) || conf->total_barriers <= 0); 
>> + spin_lock_irqsave(&conf->resync_lock, flags); -
>> conf->barrier--; -	conf->nr_pending--; +	conf->barrier[idx]--; +
>> conf->total_barriers--; +	conf->nr_pending[idx]--; 
>> spin_unlock_irqrestore(&conf->resync_lock, flags); 
>> wake_up(&conf->wait_barrier); }
>> 
>> -static bool need_to_wait_for_sync(struct r1conf *conf, struct
>> bio *bio) +static void _wait_barrier(struct r1conf *conf, int
>> idx) { -	bool wait = false; - -	if (conf->array_frozen || !bio) -
>> wait = true; -	else if (conf->barrier && bio_data_dir(bio) ==
>> WRITE) { -		if ((conf->mddev->curr_resync_completed -		     >=
>> bio_end_sector(bio)) || -		    (conf->start_next_window +
>> NEXT_NORMALIO_DISTANCE -		     <= bio->bi_iter.bi_sector)) -
>> wait = false; -		else -			wait = true; +
>> spin_lock_irq(&conf->resync_lock); +	if (conf->array_frozen ||
>> conf->barrier[idx]) { +		conf->nr_waiting[idx]++; +		/* Wait for
>> the barrier to drop. */ +		wait_event_lock_irq( +
>> conf->wait_barrier, +			!conf->array_frozen &&
>> !conf->barrier[idx], +			conf->resync_lock); +
>> conf->nr_waiting[idx]--; }
>> 
>> -	return wait; +	conf->nr_pending[idx]++; +
>> spin_unlock_irq(&conf->resync_lock); }
>> 
>> -static sector_t wait_barrier(struct r1conf *conf, struct bio
>> *bio) +static void wait_read_barrier(struct r1conf *conf,
>> sector_t sector_nr) { -	sector_t sector = 0; +	long idx =
>> hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
>> 
>> spin_lock_irq(&conf->resync_lock); -	if
>> (need_to_wait_for_sync(conf, bio)) { -		conf->nr_waiting++; -		/*
>> Wait for the barrier to drop. -		 * However if there are already
>> pending -		 * requests (preventing the barrier from -		 * rising
>> completely), and the -		 * per-process bio queue isn't empty, -
>> * then don't wait, as we need to empty -		 * that queue to allow
>> conf->start_next_window -		 * to increase. -		 */ -
>> raid1_log(conf->mddev, "wait barrier"); -
>> wait_event_lock_irq(conf->wait_barrier, -
>> !conf->array_frozen && -				    (!conf->barrier || -
>> ((conf->start_next_window < -				       conf->next_resync +
>> RESYNC_SECTORS) && -				      current->bio_list && -
>> !bio_list_empty(current->bio_list))), -
>> conf->resync_lock); -		conf->nr_waiting--; -	} - -	if (bio &&
>> bio_data_dir(bio) == WRITE) { -		if (bio->bi_iter.bi_sector >=
>> conf->next_resync) { -			if (conf->start_next_window ==
>> MaxSector) -				conf->start_next_window = -					conf->next_resync
>> + -					NEXT_NORMALIO_DISTANCE; - -			if
>> ((conf->start_next_window + NEXT_NORMALIO_DISTANCE) -			    <=
>> bio->bi_iter.bi_sector) -				conf->next_window_requests++; -
>> else -				conf->current_window_requests++; -			sector =
>> conf->start_next_window; -		} +	if (conf->array_frozen) { +
>> conf->nr_waiting[idx]++; +		/* Wait for array to unfreeze */ +
>> wait_event_lock_irq( +			conf->wait_barrier, +
>> !conf->array_frozen, +			conf->resync_lock); +
>> conf->nr_waiting[idx]--; }
>> 
>> -	conf->nr_pending++; +	conf->nr_pending[idx]++; 
>> spin_unlock_irq(&conf->resync_lock); -	return sector; }
>> 
>> -static void allow_barrier(struct r1conf *conf, sector_t
>> start_next_window, -			  sector_t bi_sector) +static void
>> wait_barrier(struct r1conf *conf, sector_t sector_nr) +{ +	int
>> idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS); + +
>> _wait_barrier(conf, idx); +} + +static void
>> wait_all_barriers(struct r1conf *conf) +{ +	int idx; + +	for (idx
>> = 0; idx < BARRIER_BUCKETS_NR; idx++) +		_wait_barrier(conf,
>> idx); +} + +static void _allow_barrier(struct r1conf *conf, int
>> idx) { unsigned long flags;
>> 
>> spin_lock_irqsave(&conf->resync_lock, flags); -
>> conf->nr_pending--; -	if (start_next_window) { -		if
>> (start_next_window == conf->start_next_window) { -			if
>> (conf->start_next_window + NEXT_NORMALIO_DISTANCE -			    <=
>> bi_sector) -				conf->next_window_requests--; -			else -
>> conf->current_window_requests--; -		} else -
>> conf->current_window_requests--; - -		if
>> (!conf->current_window_requests) { -			if
>> (conf->next_window_requests) { -				conf->current_window_requests
>> = -					conf->next_window_requests; -
>> conf->next_window_requests = 0; -				conf->start_next_window += -
>> NEXT_NORMALIO_DISTANCE; -			} else -				conf->start_next_window =
>> MaxSector; -		} -	} +	conf->nr_pending[idx]--; 
>> spin_unlock_irqrestore(&conf->resync_lock, flags); 
>> wake_up(&conf->wait_barrier); }
>> 
>> +static void allow_barrier(struct r1conf *conf, sector_t
>> sector_nr) +{ +	int idx = hash_long(sector_nr,
>> BARRIER_BUCKETS_NR_BITS); + +	_allow_barrier(conf, idx); +} + 
>> +static void allow_all_barriers(struct r1conf *conf) +{ +	int
>> idx; + +	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++) +
>> _allow_barrier(conf, idx); +} + +/* conf->resync_lock should be
>> held */ +static int get_all_pendings(struct r1conf *conf) +{ +
>> int idx, ret; + +	for (ret = 0, idx = 0; idx <
>> BARRIER_BUCKETS_NR; idx++) +		ret += conf->nr_pending[idx]; +
>> return ret; +} + +/* conf->resync_lock should be held */ +static
>> int get_all_queued(struct r1conf *conf) +{ +	int idx, ret; + +
>> for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++) +		ret +=
>> conf->nr_queued[idx]; +	return ret; +} + static void
>> freeze_array(struct r1conf *conf, int extra) { -	/* stop syncio
>> and normal IO and wait for everything to +	/* Stop sync I/O and
>> normal I/O and wait for everything to * go quite. -	 * We wait
>> until nr_pending match nr_queued+extra -	 * This is called in the
>> context of one normal IO request -	 * that has failed. Thus any
>> sync request that might be pending -	 * will be blocked by
>> nr_pending, and we need to wait for -	 * pending IO requests to
>> complete or be queued for re-try. -	 * Thus the number queued
>> (nr_queued) plus this request (extra) -	 * must match the number
>> of pending IOs (nr_pending) before -	 * we continue. +	 * This is
>> called in two situations: +	 * 1) management command handlers
>> (reshape, remove disk, quiesce). +	 * 2) one normal I/O request
>> failed. + +	 * After array_frozen is set to 1, new sync IO will
>> be blocked at +	 * raise_barrier(), and new normal I/O will
>> blocked at _wait_barrier(). +	 * The flying I/Os will either
>> complete or be queued. When everything +	 * goes quite, there are
>> only queued I/Os left. + +	 * Every flying I/O contributes to a
>> conf->nr_pending[idx], idx is the +	 * barrier bucket index which
>> this I/O request hits. When all sync and +	 * normal I/O are
>> queued, sum of all conf->nr_pending[] will match sum +	 * of all
>> conf->nr_queued[]. But normal I/O failure is an exception, +	 *
>> in handle_read_error(), we may call freeze_array() before trying
>> to +	 * fix the read error. In this case, the error read I/O is
>> not queued, +	 * so get_all_pending() == get_all_queued() + 1. +
>> * +	 * Therefore before this function returns, we need to wait
>> until +	 * get_all_pendings(conf) gets equal to
>> get_all_queued(conf)+extra. For +	 * normal I/O context, extra is
>> 1, in rested situations extra is 0. */ 
>> spin_lock_irq(&conf->resync_lock); conf->array_frozen = 1; 
>> raid1_log(conf->mddev, "wait freeze"); -
>> wait_event_lock_irq_cmd(conf->wait_barrier, -				conf->nr_pending
>> == conf->nr_queued+extra, -				conf->resync_lock, -
>> flush_pending_writes(conf)); +	wait_event_lock_irq_cmd( +
>> conf->wait_barrier, +		get_all_pendings(conf) ==
>> get_all_queued(conf)+extra, +		conf->resync_lock, +
>> flush_pending_writes(conf)); 
>> spin_unlock_irq(&conf->resync_lock); } static void
>> unfreeze_array(struct r1conf *conf) @@ -1066,64 +1094,23 @@
>> static void raid1_unplug(struct blk_plug_cb *cb, bool
>> from_schedule) kfree(plug); }
>> 
>> -static void raid1_make_request(struct mddev *mddev, struct bio *
>> bio) +static void raid1_make_read_request(struct mddev *mddev,
>> struct bio *bio) { struct r1conf *conf = mddev->private; struct
>> raid1_info *mirror; struct r1bio *r1_bio; struct bio *read_bio; -
>> int i, disks; struct bitmap *bitmap; -	unsigned long flags; const
>> int op = bio_op(bio); -	const int rw = bio_data_dir(bio); const
>> unsigned long do_sync = (bio->bi_opf & REQ_SYNC); -	const
>> unsigned long do_flush_fua = (bio->bi_opf & -						(REQ_PREFLUSH
>> | REQ_FUA)); -	struct md_rdev *blocked_rdev; -	struct blk_plug_cb
>> *cb; -	struct raid1_plug_cb *plug = NULL; -	int first_clone; int
>> sectors_handled; int max_sectors; -	sector_t start_next_window; +
>> int rdisk;
>> 
>> -	/* -	 * Register the new request and wait if the
>> reconstruction -	 * thread has put up a bar for new requests. -
>> * Continue immediately if no resync is active currently. +	/*
>> Still need barrier for READ in case that whole +	 * array is
>> frozen. */ - -	md_write_start(mddev, bio); /* wait on superblock
>> update early */ - -	if (bio_data_dir(bio) == WRITE && -
>> ((bio_end_sector(bio) > mddev->suspend_lo && -
>> bio->bi_iter.bi_sector < mddev->suspend_hi) || -
>> (mddev_is_clustered(mddev) && -
>> md_cluster_ops->area_resyncing(mddev, WRITE, -
>> bio->bi_iter.bi_sector, bio_end_sector(bio))))) { -		/* As the
>> suspend_* range is controlled by -		 * userspace, we want an
>> interruptible -		 * wait. -		 */ -		DEFINE_WAIT(w); -		for (;;)
>> { -			flush_signals(current); -
>> prepare_to_wait(&conf->wait_barrier, -					&w,
>> TASK_INTERRUPTIBLE); -			if (bio_end_sector(bio) <=
>> mddev->suspend_lo || -			    bio->bi_iter.bi_sector >=
>> mddev->suspend_hi || -			    (mddev_is_clustered(mddev) && -
>> !md_cluster_ops->area_resyncing(mddev, WRITE, -
>> bio->bi_iter.bi_sector, bio_end_sector(bio)))) -				break; -
>> schedule(); -		} -		finish_wait(&conf->wait_barrier, &w); -	} - -
>> start_next_window = wait_barrier(conf, bio); - +
>> wait_read_barrier(conf, bio->bi_iter.bi_sector); bitmap =
>> mddev->bitmap;
>> 
>> /* @@ -1149,12 +1136,9 @@ static void raid1_make_request(struct
>> mddev *mddev, struct bio * bio) bio->bi_phys_segments = 0; 
>> bio_clear_flag(bio, BIO_SEG_VALID);
>> 
>> -	if (rw == READ) { /* * read balancing logic: */ -		int rdisk; 
>> - read_again: rdisk = read_balance(conf, r1_bio, &max_sectors);
>> 
>> @@ -1176,7 +1160,6 @@ static void raid1_make_request(struct mddev
>> *mddev, struct bio * bio) atomic_read(&bitmap->behind_writes) ==
>> 0); } r1_bio->read_disk = rdisk; -		r1_bio->start_next_window =
>> 0;
>> 
>> read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev); 
>> bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector, @@
>> -1232,11 +1215,89 @@ static void raid1_make_request(struct mddev
>> *mddev, struct bio * bio) } else generic_make_request(read_bio); 
>> return; +} + +static void raid1_make_write_request(struct mddev
>> *mddev, struct bio *bio) +{ +	struct r1conf *conf =
>> mddev->private; +	struct r1bio *r1_bio; +	int i, disks; +	struct
>> bitmap *bitmap; +	unsigned long flags; +	const int op =
>> bio_op(bio); +	const unsigned long do_sync = (bio->bi_opf &
>> REQ_SYNC); +	const unsigned long do_flush_fua = (bio->bi_opf & +
>> (REQ_PREFLUSH | REQ_FUA)); +	struct md_rdev *blocked_rdev; +
>> struct blk_plug_cb *cb; +	struct raid1_plug_cb *plug = NULL; +
>> int first_clone; +	int sectors_handled; +	int max_sectors; + +
>> /* +	 * Register the new request and wait if the reconstruction +
>> * thread has put up a bar for new requests. +	 * Continue
>> immediately if no resync is active currently. +	 */ + +
>> md_write_start(mddev, bio); /* wait on superblock update early
>> */ + +	if (((bio_end_sector(bio) > mddev->suspend_lo && +
>> bio->bi_iter.bi_sector < mddev->suspend_hi) || +
>> (mddev_is_clustered(mddev) && +
>> md_cluster_ops->area_resyncing(mddev, WRITE, +
>> bio->bi_iter.bi_sector, bio_end_sector(bio))))) { +		/* As the
>> suspend_* range is controlled by +		 * userspace, we want an
>> interruptible +		 * wait. +		 */ +		DEFINE_WAIT(w); + +		for (;;)
>> { +			flush_signals(current); +
>> prepare_to_wait(&conf->wait_barrier, +					&w,
>> TASK_INTERRUPTIBLE); +			if (bio_end_sector(bio) <=
>> mddev->suspend_lo || +			    bio->bi_iter.bi_sector >=
>> mddev->suspend_hi || +			    (mddev_is_clustered(mddev) && +
>> !md_cluster_ops->area_resyncing( +						mddev, +						WRITE, +
>> bio->bi_iter.bi_sector, +						bio_end_sector(bio)))) +
>> break; +			schedule(); +		} +		finish_wait(&conf->wait_barrier,
>> &w); }
>> 
>> +	wait_barrier(conf, bio->bi_iter.bi_sector); +	bitmap =
>> mddev->bitmap; + /* -	 * WRITE: +	 * make_request() can abort the
>> operation when read-ahead is being +	 * used and no empty request
>> is available. +	 * +	 */ +	r1_bio =
>> mempool_alloc(conf->r1bio_pool, GFP_NOIO); + +	r1_bio->master_bio
>> = bio; +	r1_bio->sectors = bio_sectors(bio); +	r1_bio->state =
>> 0; +	r1_bio->mddev = mddev; +	r1_bio->sector =
>> bio->bi_iter.bi_sector; + +	/* We might need to issue multiple
>> reads to different +	 * devices if there are bad blocks around,
>> so we keep +	 * track of the number of reads in
>> bio->bi_phys_segments. +	 * If this is 0, there is only one
>> r1_bio and no locking +	 * will be needed when requests complete.
>> If it is +	 * non-zero, then it is the number of not-completed
>> requests.
> 
> This comment mentions "reads".  It should probably be changed to
> discuss what happens to "writes" since this is
> raid1_make_write_request().
> 

Yes, I will fix this.


>> */ +	bio->bi_phys_segments = 0; +	bio_clear_flag(bio,
>> BIO_SEG_VALID); + if (conf->pending_count >= max_queued_requests)
>> { md_wakeup_thread(mddev->thread); raid1_log(mddev, "wait
>> queued"); @@ -1256,7 +1317,6 @@ static void
>> raid1_make_request(struct mddev *mddev, struct bio * bio)
>> 
>> disks = conf->raid_disks * 2; retry_write: -
>> r1_bio->start_next_window = start_next_window; blocked_rdev =
>> NULL; rcu_read_lock(); max_sectors = r1_bio->sectors; @@ -1324,25
>> +1384,15 @@ static void raid1_make_request(struct mddev *mddev,
>> struct bio * bio) if (unlikely(blocked_rdev)) { /* Wait for this
>> device to become unblocked */ int j; -		sector_t old =
>> start_next_window;
>> 
>> for (j = 0; j < i; j++) if (r1_bio->bios[j]) 
>> rdev_dec_pending(conf->mirrors[j].rdev, mddev); r1_bio->state =
>> 0; -		allow_barrier(conf, start_next_window,
>> bio->bi_iter.bi_sector); +		allow_barrier(conf,
>> bio->bi_iter.bi_sector); raid1_log(mddev, "wait rdev %d blocked",
>> blocked_rdev->raid_disk); md_wait_for_blocked_rdev(blocked_rdev,
>> mddev); -		start_next_window = wait_barrier(conf, bio); -		/* -
>> * We must make sure the multi r1bios of bio have -		 * the same
>> value of bi_phys_segments -		 */ -		if (bio->bi_phys_segments &&
>> old && -		    old != start_next_window) -			/* Wait for the
>> former r1bio(s) to complete */ -
>> wait_event(conf->wait_barrier, -				   bio->bi_phys_segments ==
>> 1); +		wait_barrier(conf, bio->bi_iter.bi_sector); goto
>> retry_write; }
>> 
>> @@ -1464,6 +1514,31 @@ static void raid1_make_request(struct
>> mddev *mddev, struct bio * bio) wake_up(&conf->wait_barrier); }
>> 
>> +static void raid1_make_request(struct mddev *mddev, struct bio
>> *bio) +{ +	void (*make_request_fn)(struct mddev *mddev, struct
>> bio *bio); +	struct bio *split; +	sector_t sectors; + +
>> make_request_fn = (bio_data_dir(bio) == READ) ? +
>> raid1_make_read_request : +			  raid1_make_write_request; + +	/*
>> if bio exceeds barrier unit boundary, split it */ +	do { +
>> sectors = align_to_barrier_unit_end(bio->bi_iter.bi_sector, +
>> bio_sectors(bio)); +		if (sectors < bio_sectors(bio)) { +			split
>> = bio_split(bio, sectors, GFP_NOIO, fs_bio_set); +
>> bio_chain(split, bio); +		} else { +			split = bio; +		} + +
>> make_request_fn(mddev, split); +	} while (split != bio); +} + 
>> static void raid1_status(struct seq_file *seq, struct mddev
>> *mddev) { struct r1conf *conf = mddev->private; @@ -1552,19
>> +1627,11 @@ static void print_conf(struct r1conf *conf)
>> 
>> static void close_sync(struct r1conf *conf) { -
>> wait_barrier(conf, NULL); -	allow_barrier(conf, 0, 0); +
>> wait_all_barriers(conf); +	allow_all_barriers(conf);
>> 
>> mempool_destroy(conf->r1buf_pool); conf->r1buf_pool = NULL; - -
>> spin_lock_irq(&conf->resync_lock); -	conf->next_resync =
>> MaxSector - 2 * NEXT_NORMALIO_DISTANCE; -	conf->start_next_window
>> = MaxSector; -	conf->current_window_requests += -
>> conf->next_window_requests; -	conf->next_window_requests = 0; -
>> spin_unlock_irq(&conf->resync_lock); }
>> 
>> static int raid1_spare_active(struct mddev *mddev) @@ -2311,8
>> +2378,9 @@ static void handle_sync_write_finished(struct r1conf
>> *conf, struct r1bio *r1_bio
>> 
>> static void handle_write_finished(struct r1conf *conf, struct
>> r1bio *r1_bio) { -	int m; +	int m, idx; bool fail = false; + for
>> (m = 0; m < conf->raid_disks * 2 ; m++) if (r1_bio->bios[m] ==
>> IO_MADE_GOOD) { struct md_rdev *rdev = conf->mirrors[m].rdev; @@
>> -2338,7 +2406,8 @@ static void handle_write_finished(struct
>> r1conf *conf, struct r1bio *r1_bio) if (fail) { 
>> spin_lock_irq(&conf->device_lock); list_add(&r1_bio->retry_list,
>> &conf->bio_end_io_list); -		conf->nr_queued++; +		idx =
>> hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS); +
>> conf->nr_queued[idx]++; spin_unlock_irq(&conf->device_lock); 
>> md_wakeup_thread(conf->mddev->thread); } else { @@ -2460,6
>> +2529,7 @@ static void raid1d(struct md_thread *thread) struct
>> r1conf *conf = mddev->private; struct list_head *head =
>> &conf->retry_list; struct blk_plug plug; +	int idx;
>> 
>> md_check_recovery(mddev);
>> 
>> @@ -2467,17 +2537,18 @@ static void raid1d(struct md_thread
>> *thread) !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) { 
>> LIST_HEAD(tmp); spin_lock_irqsave(&conf->device_lock, flags); -
>> if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) { -
>> while (!list_empty(&conf->bio_end_io_list)) { -
>> list_move(conf->bio_end_io_list.prev, &tmp); -
>> conf->nr_queued--; -			} -		} +		if
>> (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) +
>> list_splice_init(&conf->bio_end_io_list, &tmp); 
>> spin_unlock_irqrestore(&conf->device_lock, flags); while
>> (!list_empty(&tmp)) { r1_bio = list_first_entry(&tmp, struct
>> r1bio, retry_list); list_del(&r1_bio->retry_list); +			idx =
>> hash_long(r1_bio->sector, +					BARRIER_BUCKETS_NR_BITS); +
>> spin_lock_irqsave(&conf->device_lock, flags); +
>> conf->nr_queued[idx]--; +
>> spin_unlock_irqrestore(&conf->device_lock, flags); if
>> (mddev->degraded) set_bit(R1BIO_Degraded, &r1_bio->state); if
>> (test_bit(R1BIO_WriteError, &r1_bio->state)) @@ -2498,7 +2569,8
>> @@ static void raid1d(struct md_thread *thread) } r1_bio =
>> list_entry(head->prev, struct r1bio, retry_list); 
>> list_del(head->prev); -		conf->nr_queued--; +		idx =
>> hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS); +
>> conf->nr_queued[idx]--; 
>> spin_unlock_irqrestore(&conf->device_lock, flags);
>> 
>> mddev = r1_bio->mddev; @@ -2537,7 +2609,6 @@ static int
>> init_resync(struct r1conf *conf) conf->poolinfo); if
>> (!conf->r1buf_pool) return -ENOMEM; -	conf->next_resync = 0; 
>> return 0; }
>> 
>> @@ -2566,6 +2637,7 @@ static sector_t raid1_sync_request(struct
>> mddev *mddev, sector_t sector_nr, int still_degraded = 0; int
>> good_sectors = RESYNC_SECTORS; int min_bad = 0; /* number of
>> sectors that are bad in all devices */ +	int idx =
>> hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
>> 
>> if (!conf->r1buf_pool) if (init_resync(conf)) @@ -2615,7 +2687,7
>> @@ static sector_t raid1_sync_request(struct mddev *mddev,
>> sector_t sector_nr, * If there is non-resync activity waiting for
>> a turn, then let it * though before starting on this new sync
>> request. */ -	if (conf->nr_waiting) +	if (conf->nr_waiting[idx]) 
>> schedule_timeout_uninterruptible(1);
>> 
>> /* we are incrementing sector_nr below. To be safe, we check
>> against @@ -2642,6 +2714,8 @@ static sector_t
>> raid1_sync_request(struct mddev *mddev, sector_t sector_nr, 
>> r1_bio->sector = sector_nr; r1_bio->state = 0; 
>> set_bit(R1BIO_IsSync, &r1_bio->state); +	/* make sure
>> good_sectors won't go across barrier unit boundary */ +
>> good_sectors = align_to_barrier_unit_end(sector_nr,
>> good_sectors);
>> 
>> for (i = 0; i < conf->raid_disks * 2; i++) { struct md_rdev
>> *rdev; @@ -2927,9 +3001,6 @@ static struct r1conf
>> *setup_conf(struct mddev *mddev) conf->pending_count = 0; 
>> conf->recovery_disabled = mddev->recovery_disabled - 1;
>> 
>> -	conf->start_next_window = MaxSector; -
>> conf->current_window_requests = conf->next_window_requests = 0; 
>> - err = -EIO; for (i = 0; i < conf->raid_disks * 2; i++) {
>> 
>> diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h index
>> c52ef42..817115d 100644 --- a/drivers/md/raid1.h +++
>> b/drivers/md/raid1.h @@ -1,6 +1,14 @@ #ifndef _RAID1_H #define
>> _RAID1_H
>> 
>> +/* each barrier unit size is 64MB fow now + * note: it must be
>> larger than RESYNC_DEPTH + */ +#define BARRIER_UNIT_SECTOR_BITS
>> 17 +#define BARRIER_UNIT_SECTOR_SIZE	(1<<17) +#define
>> BARRIER_BUCKETS_NR_BITS		9 +#define BARRIER_BUCKETS_NR
>> (1<<BARRIER_BUCKETS_NR_BITS) + struct raid1_info { struct md_rdev
>> *rdev; sector_t	head_position; @@ -35,25 +43,6 @@ struct r1conf
>> { */ int			raid_disks;
>> 
>> -	/* During resync, read_balancing is only allowed on the part -
>> * of the array that has been resynced.  'next_resync' tells us -
>> * where that is. -	 */ -	sector_t		next_resync; - -	/* When raid1
>> starts resync, we divide array into four partitions -	 *
>> |---------|--------------|---------------------|-------------| -
>> *        next_resync   start_next_window       end_window -	 *
>> start_next_window = next_resync + NEXT_NORMALIO_DISTANCE -	 *
>> end_window = start_next_window + NEXT_NORMALIO_DISTANCE -	 *
>> current_window_requests means the count of normalIO between -	 *
>> start_next_window and end_window. -	 * next_window_requests means
>> the count of normalIO after end_window. -	 * */ -	sector_t
>> start_next_window; -	int			current_window_requests; -	int
>> next_window_requests; - spinlock_t		device_lock;
>> 
>> /* list of 'struct r1bio' that need to be processed by raid1d, @@
>> -79,10 +68,11 @@ struct r1conf { */ wait_queue_head_t
>> wait_barrier; spinlock_t		resync_lock; -	int			nr_pending; -	int
>> nr_waiting; -	int			nr_queued; -	int			barrier; +	int
>> nr_pending[BARRIER_BUCKETS_NR]; +	int
>> nr_waiting[BARRIER_BUCKETS_NR]; +	int
>> nr_queued[BARRIER_BUCKETS_NR]; +	int
>> barrier[BARRIER_BUCKETS_NR]; +	int			total_barriers; int
>> array_frozen;
>> 
>> /* Set to 1 if a full sync is needed, (fresh device added). @@
>> -135,7 +125,6 @@ struct r1bio { * in this BehindIO request */ 
>> sector_t		sector; -	sector_t		start_next_window; int			sectors; 
>> unsigned long		state; struct mddev		*mddev; --

Thanks for your review, I will update all the fixes in next version patch.

Coly


^ permalink raw reply

* MDADM grow /dev/md0 - chunk size - II
From: J. Cassidy @ 2017-01-16 15:12 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb

Hello all/Neil,


no ideas anyone?


Any ideas much appreciated.


JC



^ permalink raw reply

* Re: PROBLEM: Kernel BUG with raid5 soft + Xen + DRBD - invalid opcode
From: MasterPrenium @ 2017-01-17  1:54 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-raid, xen-users, MasterPrenium@gmail.com, linux-kernel,
	xen-devel
In-Reply-To: <20170109224435.sfyrvkxhajgrq2i5@kernel.org>

Hi Shaohua,

I've made some new little tests, maybe it can help.

- I tried creating the RAID 5 stack with only 2 drives (mdadm --create 
/dev/md10 --raid-devices=3 --level=5 /dev/sdc1 /dev/sdd1 missing).
The same issue is happening.
- but one time (still with 2/3 drives), I was not able to crash the 
kernel, with exactly the same procedure as previous. Even with 
re-creating filesystems ect.
In order to re-produce the BUG I had to re-create the array.

Can this be linked to this message ? :
[  155.667456] md10: Warning: Device sdc1 is misaligned

I don't know how to "align" a drive in a RAID stack... The partition is 
correctly align (as "parted" says).

- In another test (still 2/3 drives in the stack), I didn't got the 
kernel crash, but I had 100% io wait on cpu. Trying to reboot, finally 
give me this printk messages : http://pastebin.com/uzVHUUrC

If you have any patch to give me (maybe something to be more verbose 
about the issue), please tell me, I'll test it as it's a really blocking 
issue...

Best regards,

MasterPrenium


Le 09/01/2017 à 23:44, Shaohua Li a écrit :
> On Sun, Jan 08, 2017 at 02:31:15PM +0100, MasterPrenium wrote:
>> Hello,
>>
>> Replies below + :
>> - I don't know if this can help but after the crash, when the system
>> reboots, the Raid 5 stack is re-synchronizing
>> [   37.028239] md10: Warning: Device sdc1 is misaligned
>> [   37.028541] created bitmap (15 pages) for device md10
>> [   37.030433] md10: bitmap initialized from disk: read 1 pages, set 59 of
>> 29807 bits
>>
>> - Sometimes the kernel completely crash (lost serial + network connection),
>> sometimes only got the "BUG" dump, but still have network access (but a
>> reboot is impossible, need to reset the system).
>>
>> - You can find blktrace here (while running fio), I hope it's complete since
>> the end of the file is when the kernel crashed : https://goo.gl/X9jZ50
> Looks most are normal full stripe writes.
>   
>>> I'm trying to reproduce, but no success. So
>>> ext4->btrfs->raid5, crash
>>> btrfs->raid5, no crash
>>> right? does subvolume matter? When you create the raid5 array, does adding
>>> '--assume-clean' option change the behavior? I'd like to narrow down the issue.
>>> If you can capture the blktrace to the raid5 array, it would be great to hint
>>> us what kind of IO it is.
>> Yes Correct.
>> The subvolume doesn't matter.
>> -- assume-clean doesn't change the behaviour.
> so it's not a resync issue.
>
>> Don't forget that the system needs to be running on xen to crash, without
>> (on native kernel) it doesn't crash (or at least, I was not able to make it
>> crash).
>>>> Regarding your patch, I can't find it. Is it the one sent by Konstantin
>>>> Khlebnikov ?
>>> Right.
>> It doesn't help :(. Maybe the crash is happening a little bit later.
> ok, the patch is unlikely helpful, since the IO size isn't very big.
>
> Don't have good idea yet. My best guess so far is virtual machine introduces
> extra delay, which might trigger some race conditions which aren't seen in
> native.  I'll check if I could find something locally.
>
> Thanks,
> Shaohua


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply

* performance of raid5 on fast devices
From: Jake Yao @ 2017-01-17  2:35 UTC (permalink / raw)
  To: linux-raid

I have a raid5 array on 4 NVMe drives, and the performance on the
array is only marginally better than a single drive. Unlike a similar
raid5 array on 4 SAS SSD or HDD,  the performance on array is 3x
better than a single drive, which is expected.

It looks like when the single kernel thread associated with the raid
device running at 100%, the array performance hit its peak. This can
happen easily for fast devices like NVMe.

This can reproduced by creating a raid5 with 4 ramdisks as well, and
comparing performance on the array and one ramdisk. Sometimes the
performance on the array is worse than a single ramdisk.

The kernel version is 4.9.0-rc3 and mdadm is release 3.4, no write
journal is configured.

Is this a known issue?

Please cc me on the email as I am not on the mail list.

Thanks!

^ permalink raw reply

* Re: performance of raid5 on fast devices
From: Stan Hoeppner @ 2017-01-17  3:10 UTC (permalink / raw)
  To: Jake Yao, linux-raid
In-Reply-To: <CA+Dh761_kVPEcRdFEAo72Wif_t-G7RpgyoKGetEFveyMpaYD1w@mail.gmail.com>

On 01/16/2017 08:35 PM, Jake Yao wrote:
> I have a raid5 array on 4 NVMe drives, and the performance on the
> array is only marginally better than a single drive. Unlike a similar
> raid5 array on 4 SAS SSD or HDD,  the performance on array is 3x
> better than a single drive, which is expected.
>
> It looks like when the single kernel thread associated with the raid
> device running at 100%, the array performance hit its peak. This can
> happen easily for fast devices like NVMe.
The md raid personalities are limited to a single kernel write thread.  
Work is in progress to alleviate this bottleneck by using multiple write 
threads.  When it will hit mainline I don't know.

> This can reproduced by creating a raid5 with 4 ramdisks as well, and
> comparing performance on the array and one ramdisk. Sometimes the
> performance on the array is worse than a single ramdisk.
>
> The kernel version is 4.9.0-rc3 and mdadm is release 3.4, no write
> journal is configured.
>
> Is this a known issue?
>
> Please cc me on the email as I am not on the mail list.
>
> Thanks!



^ permalink raw reply

* Re: performance of raid5 on fast devices
From: Coly Li @ 2017-01-17  5:04 UTC (permalink / raw)
  To: Jake Yao; +Cc: Stan Hoeppner, linux-raid
In-Reply-To: <65f0f922-fae0-0135-463d-86115cffbf33@hardwarefreak.org>

On 2017/1/17 上午11:10, Stan Hoeppner wrote:
> On 01/16/2017 08:35 PM, Jake Yao wrote:
>> I have a raid5 array on 4 NVMe drives, and the performance on the
>> array is only marginally better than a single drive. Unlike a similar
>> raid5 array on 4 SAS SSD or HDD,  the performance on array is 3x
>> better than a single drive, which is expected.
>>
>> It looks like when the single kernel thread associated with the raid
>> device running at 100%, the array performance hit its peak. This can
>> happen easily for fast devices like NVMe.
> The md raid personalities are limited to a single kernel write thread. 
> Work is in progress to alleviate this bottleneck by using multiple write
> threads.  When it will hit mainline I don't know.

If you want 8 writing threads, and your md raid5 device is /dev/md0, you
may have a try with,
	echo 8 > /sys/block/md0/md/group_thread_cnt

> 
>> This can reproduced by creating a raid5 with 4 ramdisks as well, and
>> comparing performance on the array and one ramdisk. Sometimes the
>> performance on the array is worse than a single ramdisk.
>>
>> The kernel version is 4.9.0-rc3 and mdadm is release 3.4, no write
>> journal is configured.
>>
>> Is this a known issue?

It was, but you are on 4.9 kernel, group_thread_cnt should work for you.

Coly


^ permalink raw reply

* Re: performance of raid5 on fast devices
From: Roman Mamedov @ 2017-01-17  5:10 UTC (permalink / raw)
  To: Jake Yao; +Cc: linux-raid
In-Reply-To: <CA+Dh761_kVPEcRdFEAo72Wif_t-G7RpgyoKGetEFveyMpaYD1w@mail.gmail.com>

On Mon, 16 Jan 2017 21:35:21 -0500
Jake Yao <jgyao1@gmail.com> wrote:

> I have a raid5 array on 4 NVMe drives, and the performance on the
> array is only marginally better than a single drive. Unlike a similar
> raid5 array on 4 SAS SSD or HDD,  the performance on array is 3x
> better than a single drive, which is expected.
> 
> It looks like when the single kernel thread associated with the raid
> device running at 100%, the array performance hit its peak. This can
> happen easily for fast devices like NVMe.
> 
> This can reproduced by creating a raid5 with 4 ramdisks as well, and
> comparing performance on the array and one ramdisk. Sometimes the
> performance on the array is worse than a single ramdisk.
> 
> The kernel version is 4.9.0-rc3 and mdadm is release 3.4, no write
> journal is configured.
> 
> Is this a known issue?

How do you measure the performance?

Sure it may be CPU-bound in the end, but also why not try the usual
optimization tricks, such as:

  * increase your stripe_cache_size, it's not uncommon that this can speed up
    linear writes by as much as several times;

  * if you meant reads, you could look into read-ahead settings for the array;

  * and in both cases, try experimenting with different stripe sizes (if you
    were using 512K, try with 64K stripes).

-- 
With respect,
Roman

^ permalink raw reply

* Soft-Raid 0 Performance | Transfer two Data-Streams (CPU+FPGA) to the same Soft-Raid
From: Eric Schwarz @ 2017-01-17 15:00 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <21298bd64ef0dc83b29bd45362ead46c@sw-optimization.com>

Hello mailing list,

I have got two questions:

1.) I have set-up a softraid (raid level 0) with mdadm using two M.2 
modules. For one module the throughput is ~350MB/s (no mdadm) for two 
modules the throughput is ~500MB/s which is less than factor 1,5 of the 
throughput of a single drive. The filesystem used is ext4. Is there 
someone having some values for comparison? For me the throughput gain 
seems to be too little. The test was done using a HP Z840 workstation.

2.) We want to configure a softraid (raid level 0) with mdadm which can 
be used from within Linux but also it should be possible to write data 
to the raid w/ DMA directly from the FPGA which is also connected to the 
PCIe bus as a slave as well as the M.2 modules. How can that be achieved 
using existing kernel infrastructure?

Many thanks for helpful replies
Eric

^ permalink raw reply

* Re: performance of raid5 on fast devices
From: Jake Yao @ 2017-01-17 15:22 UTC (permalink / raw)
  To: Coly Li; +Cc: Stan Hoeppner, linux-raid
In-Reply-To: <f72f7be4-9b2d-6bf9-1b61-eb066ef35343@suse.de>

Thanks for the response.

It helps a little by increasing group_thread_cnt, but not to the
extend of 3x expected.  It looks like the single kernel thread is
still the bottleneck.

On Tue, Jan 17, 2017 at 12:04 AM, Coly Li <colyli@suse.de> wrote:
> On 2017/1/17 上午11:10, Stan Hoeppner wrote:
>> On 01/16/2017 08:35 PM, Jake Yao wrote:
>>> I have a raid5 array on 4 NVMe drives, and the performance on the
>>> array is only marginally better than a single drive. Unlike a similar
>>> raid5 array on 4 SAS SSD or HDD,  the performance on array is 3x
>>> better than a single drive, which is expected.
>>>
>>> It looks like when the single kernel thread associated with the raid
>>> device running at 100%, the array performance hit its peak. This can
>>> happen easily for fast devices like NVMe.
>> The md raid personalities are limited to a single kernel write thread.
>> Work is in progress to alleviate this bottleneck by using multiple write
>> threads.  When it will hit mainline I don't know.
>
> If you want 8 writing threads, and your md raid5 device is /dev/md0, you
> may have a try with,
>         echo 8 > /sys/block/md0/md/group_thread_cnt
>
>>
>>> This can reproduced by creating a raid5 with 4 ramdisks as well, and
>>> comparing performance on the array and one ramdisk. Sometimes the
>>> performance on the array is worse than a single ramdisk.
>>>
>>> The kernel version is 4.9.0-rc3 and mdadm is release 3.4, no write
>>> journal is configured.
>>>
>>> Is this a known issue?
>
> It was, but you are on 4.9 kernel, group_thread_cnt should work for you.
>
> Coly
>

^ permalink raw reply

* Re: performance of raid5 on fast devices
From: Jake Yao @ 2017-01-17 15:28 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-raid
In-Reply-To: <20170117101043.78e80bdc@natsu>

Thanks for the response.

I am using fio for performance measurement.

The chunk size of raid5 array is 32K, and the block size in fio is set
to 96K(3x chunk size) which is also the optimal_io_size, ioengine is
set to libaio with direct IO.

Increasing stripe_cache_size does not help much, and it looks like the
write is limited by the single kernel thread as mentioned earlier.


On Tue, Jan 17, 2017 at 12:10 AM, Roman Mamedov <rm@romanrm.net> wrote:
> On Mon, 16 Jan 2017 21:35:21 -0500
> Jake Yao <jgyao1@gmail.com> wrote:
>
>> I have a raid5 array on 4 NVMe drives, and the performance on the
>> array is only marginally better than a single drive. Unlike a similar
>> raid5 array on 4 SAS SSD or HDD,  the performance on array is 3x
>> better than a single drive, which is expected.
>>
>> It looks like when the single kernel thread associated with the raid
>> device running at 100%, the array performance hit its peak. This can
>> happen easily for fast devices like NVMe.
>>
>> This can reproduced by creating a raid5 with 4 ramdisks as well, and
>> comparing performance on the array and one ramdisk. Sometimes the
>> performance on the array is worse than a single ramdisk.
>>
>> The kernel version is 4.9.0-rc3 and mdadm is release 3.4, no write
>> journal is configured.
>>
>> Is this a known issue?
>
> How do you measure the performance?
>
> Sure it may be CPU-bound in the end, but also why not try the usual
> optimization tricks, such as:
>
>   * increase your stripe_cache_size, it's not uncommon that this can speed up
>     linear writes by as much as several times;
>
>   * if you meant reads, you could look into read-ahead settings for the array;
>
>   * and in both cases, try experimenting with different stripe sizes (if you
>     were using 512K, try with 64K stripes).
>
> --
> With respect,
> Roman

^ permalink raw reply

* Re: Recommendation on new system Arrays
From: Benjammin2068 @ 2017-01-17 18:57 UTC (permalink / raw)
  To: Linux-RAID
In-Reply-To: <030ca00a-f024-14a0-d5b9-a9497811910e@turmel.org>

On 01/12/2017 09:39 AM, Phil Turmel wrote:
> Hi Ben,
>
> {Convention on kernel.org is reply-to-all -- please do.}
>
> On 01/07/2017 11:04 PM, Benjammin2068 wrote:
>
>> Also in other news.... (maybe someone from this list can help since they've run into it before)
>>
>> The motherboard of the computer has its own SATA controller as usual... as well as this Avago SAS controller.
>>
>> When CentOS 7 boots up, it enumerates the external SAS drives starting at /dev/sda instead of the motherboard's drives.
>>
>> The OCD in me wants the MB's SATA drives as /dev/sda-sdd.
>>
>> Where does one even control that enumeration order? udev? Eeks.
> You cannot control the order.  Period.  It is pseudo-random based on
> hardware responses to driver loads.  New kernels can have entirely
> different orders of devices, and if parallel loads are allowed, you can
> have one controller get sd[aceg] while another gets sd[bdfh]. The entire
> system of LABEL= and UUID= support via initramfs and blkid is intended
> make fstab and other utilities deterministic in spite of varying names.
>
> mdadm has always been resistant to naming problems thanks to its device
> #s and UUIDs in the superblocks.

Yea, I got that and understand... just the OCD in me. (sigh)

As for Reply/Reply-All -- most of the lists I happen to subscribe to "reply-to-list" (through one mechanism or another) as the default for "reply"... :(

The habit is hard to break.

Thanks!

 -Ben

^ permalink raw reply

* [PATCH] Fix oddity where mdadm did not recognise a relative path
From: Wols Lists @ 2017-01-17 19:07 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 0 bytes --]



[-- Attachment #2: 0001-Fix-oddity-where-mdadm-did-not-recognise-a-relative-.patch --]
[-- Type: text/x-patch, Size: 1254 bytes --]

From 4ce784307a9004124392ce48432960d7ca94d0bf Mon Sep 17 00:00:00 2001
From: Wol <anthony@youngman.org.uk>
Date: Tue, 17 Jan 2017 17:47:05 +0000
Subject: [PATCH] Fix oddity where mdadm did not recognise a relative path

mdadm assumed that a pathname started with a "/", while an array
name didn't. This alters the logic so that if the first character
is not a "/" it tries to open an array, and if that fails it drops
through to the pathname code rather than terminating immediately
with an error.

Signed-off-by: Wol <anthony@youngman.org.uk>
---
 mdadm.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mdadm.c b/mdadm.c
index c3a265b..b5d89e4 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -1899,12 +1899,12 @@ static int misc_list(struct mddev_dev *devlist,
 			rv |= SetAction(dv->devname, c->action);
 			continue;
 		}
-		if (dv->devname[0] == '/')
-			mdfd = open_mddev(dv->devname, 1);
-		else {
-			mdfd = open_dev(dv->devname);
-			if (mdfd < 0)
-				pr_err("Cannot open %s\n", dv->devname);
+		switch(dv->devname[0] == '/') {
+			case 0:
+				mdfd = open_dev(dv->devname);
+				if (mdfd >= 0) break;
+			case 1:
+				mdfd = open_mddev(dv->devname, 1);  
 		}
 		if (mdfd>=0) {
 			switch(dv->disposition) {
-- 
2.7.3


^ permalink raw reply related

* Re: performance of raid5 on fast devices
From: Heinz Mauelshagen @ 2017-01-17 21:04 UTC (permalink / raw)
  To: Jake Yao, Roman Mamedov; +Cc: linux-raid
In-Reply-To: <CA+Dh763Ttvnt9CSEUZ0QwPOLCaZq6VVXmj4OoheR7WTO6o4Pbw@mail.gmail.com>

Jake et al,

I took the oportunity to measure raid5 on a 4x NVME here with
variations of group_thread_cnt={0..10} minimal
stripe_cache_size={256,512,1024,2048,4096,8192,16384,32768}

This is on an X-99 with Intel E5-2640 and kernel 4.9.3-200.fc25.x86_64.

Highest active stripe count logged < 17K.


fio job/sections used:
----------------------------
[r-md0]
ioengine=libaio
iodepth=40
rw=read
bs=4096K
direct=1
size=4G
numjobs=8
filename=/dev/md0

[w-md0]
ioengine=libaio
iodepth=40
rw=write
bs=4096K
direct=1
size=4G
numjobs=8
filename=/dev/md0


Baseline performance seen with raid0:
---------------------------------------------------
md0 : active raid0 dm-350[3] dm-349[2] dm-348[1] dm-347[0]
       33521664 blocks super 1.2 32k chunks

READ: io=32768MB, aggrb=8202.3MB/s, minb=1025.3MB/s, maxb=1217.7MB/s, 
mint=3364msec, maxt=3995msec
WRITE: io=32768MB, aggrb=5746.8MB/s, minb=735584KB/s, maxb=836685KB/s, 
mint=5013msec, maxt=5702msec


Performance with raid5:
--------------------------------
md0 : active raid5 dm-350[3] dm-349[2] dm-348[1] dm-347[0]
       25141248 blocks super 1.2 level 5, 32k chunk, algorithm 2 [4/4] 
[UUUU]


READ: io=32768MB, aggrb=7375.3MB/s, minb=944025KB/s, maxb=1001.1MB/s, 
mint=4088msec, maxt=4443msec


Write results for group_thread_cnt/stripe_cache_size variations:
------------------------------------------------------------------------------------
0/256  -> WRITE: io=32768MB, aggrb=1296.4MB/s, minb=165927KB/s, 
maxb=167644KB/s, mint=25019msec, maxt=25278msec
1/256  -> WRITE: io=32768MB, aggrb=2152.6MB/s, minb=275524KB/s, 
maxb=278654KB/s, mint=15052msec, maxt=15223msec
2/256  -> WRITE: io=32768MB, aggrb=3177.4MB/s, minb=406700KB/s, 
maxb=415854KB/s, mint=10086msec, maxt=10313msec
3/256  -> WRITE: io=32768MB, aggrb=4026.6MB/s, minb=515397KB/s, 
maxb=524222KB/s, mint=8001msec, maxt=8138msec
4/256  -> WRITE: io=32768MB, aggrb=4172.2MB/s, minb=534034KB/s, 
maxb=552609KB/s, mint=7590msec, maxt=7854msec  *
5/256  -> WRITE: io=32768MB, aggrb=4166.9MB/s, minb=533355KB/s, 
maxb=547845KB/s, mint=7656msec, maxt=7864msec
6/256  -> WRITE: io=32768MB, aggrb=4189.3MB/s, minb=536218KB/s, 
maxb=556126KB/s, mint=7542msec, maxt=7822msec
7/256  -> WRITE: io=32768MB, aggrb=4192.5MB/s, minb=536630KB/s, 
maxb=560810KB/s, mint=7479msec, maxt=7816msec
8/256  -> WRITE: io=32768MB, aggrb=4185.2MB/s, minb=535807KB/s, 
maxb=562389KB/s, mint=7458msec, maxt=7828msec
9/256  -> WRITE: io=32768MB, aggrb=4192.1MB/s, minb=536699KB/s, 
maxb=577966KB/s, mint=7257msec, maxt=7815msec
10/256 -> WRITE: io=32768MB, aggrb=4182.3MB/s, minb=535329KB/s, 
maxb=568256KB/s, mint=7381msec, maxt=7835msec

0/512 -> WRITE: io=32768MB, aggrb=1297.8MB/s, minb=166025KB/s, 
maxb=167664KB/s, mint=25016msec, maxt=25263msec
1/512 -> WRITE: io=32768MB, aggrb=2148.5MB/s, minb=275000KB/s, 
maxb=278044KB/s, mint=15085msec, maxt=15252msec
2/512 -> WRITE: io=32768MB, aggrb=3158.4MB/s, minb=404270KB/s, 
maxb=411407KB/s, mint=10195msec, maxt=10375msec
3/512 -> WRITE: io=32768MB, aggrb=4102.7MB/s, minb=525141KB/s, 
maxb=539738KB/s, mint=7771msec, maxt=7987msec
4/512 -> WRITE: io=32768MB, aggrb=4162.8MB/s, minb=532745KB/s, 
maxb=541759KB/s, mint=7742msec, maxt=7873msec     *
5/512 -> WRITE: io=32768MB, aggrb=4178.6MB/s, minb=534851KB/s, 
maxb=549856KB/s, mint=7628msec, maxt=7842msec
6/512 -> WRITE: io=32768MB, aggrb=4167.4MB/s, minb=533422KB/s, 
maxb=562314KB/s, mint=7459msec, maxt=7863msec
7/512 -> WRITE: io=32768MB, aggrb=4192.1MB/s, minb=536699KB/s, 
maxb=566338KB/s, mint=7406msec, maxt=7815msec
8/512 -> WRITE: io=32768MB, aggrb=4189.8MB/s, minb=536287KB/s, 
maxb=558644KB/s, mint=7508msec, maxt=7821msec
9/512 -> WRITE: io=32768MB, aggrb=4165.8MB/s, minb=533219KB/s, 
maxb=559837KB/s, mint=7492msec, maxt=7866msec
10/512 -> WRITE: io=32768MB, aggrb=4177.2MB/s, minb=534783KB/s, 
maxb=570188KB/s, mint=7356msec, maxt=7843msec

0/1024 -> WRITE: io=32768MB, aggrb=1288.6MB/s, minb=164935KB/s, 
maxb=166877KB/s, mint=25134msec, maxt=25430msec
1/1024 -> WRITE: io=32768MB, aggrb=2218.5MB/s, minb=283955KB/s, 
maxb=289842KB/s, mint=14471msec, maxt=14771msec
2/1024 -> WRITE: io=32768MB, aggrb=3186.1MB/s, minb=407926KB/s, 
maxb=420903KB/s, mint=9965msec, maxt=10282msec
3/1024 -> WRITE: io=32768MB, aggrb=4107.4MB/s, minb=525733KB/s, 
maxb=538836KB/s, mint=7784msec, maxt=7978msec
4/1024 -> WRITE: io=32768MB, aggrb=4146.9MB/s, minb=530790KB/s, 
maxb=550505KB/s, mint=7619msec, maxt=7902msec
5/1024 -> WRITE: io=32768MB, aggrb=4160.5MB/s, minb=532542KB/s, 
maxb=550795KB/s, mint=7615msec, maxt=7876msec  *
6/1024 -> WRITE: io=32768MB, aggrb=4174.3MB/s, minb=534306KB/s, 
maxb=558942KB/s, mint=7504msec, maxt=7850msec
7/1024 -> WRITE: io=32768MB, aggrb=4189.8MB/s, minb=536287KB/s, 
maxb=556864KB/s, mint=7532msec, maxt=7821msec
8/1024 -> WRITE: io=32768MB, aggrb=4188.2MB/s, minb=536081KB/s, 
maxb=561035KB/s, mint=7476msec, maxt=7824msec
9/1024 -> WRITE: io=32768MB, aggrb=4167.4MB/s, minb=533422KB/s, 
maxb=567872KB/s, mint=7386msec, maxt=7863msec
10/1024 -> WRITE: io=32768MB, aggrb=4188.2MB/s, minb=536081KB/s, 
maxb=569878KB/s, mint=7360msec, maxt=7824msec

0/2048 -> WRITE: io=32768MB, aggrb=1265.7MB/s, minb=162004KB/s, 
maxb=166111KB/s, mint=25250msec, maxt=25890msec
1/2048 -> WRITE: io=32768MB, aggrb=2239.5MB/s, minb=286652KB/s, 
maxb=290846KB/s, mint=14421msec, maxt=14632msec
2/2048 -> WRITE: io=32768MB, aggrb=3184.5MB/s, minb=407609KB/s, 
maxb=413150KB/s, mint=10152msec, maxt=10290msec
3/2048 -> WRITE: io=32768MB, aggrb=4213.5MB/s, minb=539321KB/s, 
maxb=557901KB/s, mint=7518msec, maxt=7777msec     *
4/2048 -> WRITE: io=32768MB, aggrb=4168.5MB/s, minb=533558KB/s, 
maxb=543162KB/s, mint=7722msec, maxt=7861msec
5/2048 -> WRITE: io=32768MB, aggrb=4185.5MB/s, minb=535739KB/s, 
maxb=549352KB/s, mint=7635msec, maxt=7829msec
6/2048 -> WRITE: io=32768MB, aggrb=4181.8MB/s, minb=535260KB/s, 
maxb=553338KB/s, mint=7580msec, maxt=7836msec
7/2048 -> WRITE: io=32768MB, aggrb=4215.7MB/s, minb=539599KB/s, 
maxb=566109KB/s, mint=7409msec, maxt=7773msec
8/2048 -> WRITE: io=32768MB, aggrb=4200.5MB/s, minb=537662KB/s, 
maxb=568102KB/s, mint=7383msec, maxt=7801msec
9/2048 -> WRITE: io=32768MB, aggrb=4184.1MB/s, minb=535671KB/s, 
maxb=574483KB/s, mint=7301msec, maxt=7830msec
10/2048 -> WRITE: io=32768MB, aggrb=4172.7MB/s, minb=534102KB/s, 
maxb=567641KB/s, mint=7389msec, maxt=7853msec

0/4096 -> WRITE: io=32768MB, aggrb=1264.8MB/s, minb=161879KB/s, 
maxb=168588KB/s, mint=24879msec, maxt=25910msec
1/4096 -> WRITE: io=32768MB, aggrb=2349.4MB/s, minb=300710KB/s, 
maxb=312541KB/s, mint=13420msec, maxt=13948msec
2/4096 -> WRITE: io=32768MB, aggrb=3387.6MB/s, minb=433609KB/s, 
maxb=441877KB/s, mint=9492msec, maxt=9673msec
3/4096 -> WRITE: io=32768MB, aggrb=4182.3MB/s, minb=535329KB/s, 
maxb=552390KB/s, mint=7593msec, maxt=7835msec    *
4/4096 -> WRITE: io=32768MB, aggrb=4170.2MB/s, minb=533762KB/s, 
maxb=560061KB/s, mint=7489msec, maxt=7858msec
5/4096 -> WRITE: io=32768MB, aggrb=4179.6MB/s, minb=534919KB/s, 
maxb=548490KB/s, mint=7647msec, maxt=7841msec
6/4096 -> WRITE: io=32768MB, aggrb=4183.4MB/s, minb=535465KB/s, 
maxb=549208KB/s, mint=7637msec, maxt=7833msec
7/4096 -> WRITE: io=32768MB, aggrb=4174.9MB/s, minb=534374KB/s, 
maxb=557530KB/s, mint=7523msec, maxt=7849msec
8/4096 -> WRITE: io=32768MB, aggrb=4178.6MB/s, minb=534851KB/s, 
maxb=570188KB/s, mint=7356msec, maxt=7842msec
9/4096 -> WRITE: io=32768MB, aggrb=4180.2MB/s, minb=535056KB/s, 
maxb=570110KB/s, mint=7357msec, maxt=7839msec
10/4096 -> WRITE: io=32768MB, aggrb=4183.9MB/s, minb=535534KB/s, 
maxb=574640KB/s, mint=7299msec, maxt=7832msec

0/8192 -> WRITE: io=32768MB, aggrb=1260.9MB/s, minb=161381KB/s, 
maxb=171511KB/s, mint=24455msec, maxt=25990msec
1/8192 -> WRITE: io=32768MB, aggrb=2368.5MB/s, minb=303166KB/s, 
maxb=320444KB/s, mint=13089msec, maxt=13835msec
2/8192 -> WRITE: io=32768MB, aggrb=3408.8MB/s, minb=436225KB/s, 
maxb=458544KB/s, mint=9147msec, maxt=9615msec
3/8192 -> WRITE: io=32768MB, aggrb=4219.5MB/s, minb=540085KB/s, 
maxb=564585KB/s, mint=7429msec, maxt=7766msec     *
4/8192 -> WRITE: io=32768MB, aggrb=4208.6MB/s, minb=538698KB/s, 
maxb=570653KB/s, mint=7350msec, maxt=7786msec
5/8192 -> WRITE: io=32768MB, aggrb=4200.5MB/s, minb=537662KB/s, 
maxb=562013KB/s, mint=7463msec, maxt=7801msec
6/8192 -> WRITE: io=32768MB, aggrb=4189.3MB/s, minb=536218KB/s, 
maxb=585387KB/s, mint=7165msec, maxt=7822msec
7/8192 -> WRITE: io=32768MB, aggrb=4184.5MB/s, minb=535602KB/s, 
maxb=579323KB/s, mint=7240msec, maxt=7831msec
8/8192 -> WRITE: io=32768MB, aggrb=4186.6MB/s, minb=535876KB/s, 
maxb=572132KB/s, mint=7331msec, maxt=7827msec
9/8192 -> WRITE: io=32768MB, aggrb=4176.5MB/s, minb=534578KB/s, 
maxb=598246KB/s, mint=7011msec, maxt=7846msec
10/8192 -> WRITE: io=32768MB, aggrb=4184.1MB/s, minb=535671KB/s, 
maxb=580285KB/s, mint=7228msec, maxt=7830msec

0/16384 -> WRITE: io=32768MB, aggrb=1281.0MB/s, minb=163968KB/s, 
maxb=183542KB/s, mint=22852msec, maxt=25580msec
1/16384 -> WRITE: io=32768MB, aggrb=2451.8MB/s, minb=313827KB/s, 
maxb=337787KB/s, mint=12417msec, maxt=13365msec
2/16384 -> WRITE: io=32768MB, aggrb=3409.5MB/s, minb=436406KB/s, 
maxb=468532KB/s, mint=8952msec, maxt=9611msec
3/16384 -> WRITE: io=32768MB, aggrb=4192.5MB/s, minb=536630KB/s, 
maxb=566721KB/s, mint=7401msec, maxt=7816msec   *
4/16384 -> WRITE: io=32768MB, aggrb=4172.2MB/s, minb=534034KB/s, 
maxb=581089KB/s, mint=7218msec, maxt=7854msec
5/16384 -> WRITE: io=32768MB, aggrb=4175.4MB/s, minb=534442KB/s, 
maxb=587108KB/s, mint=7144msec, maxt=7848msec
6/16384 -> WRITE: io=32768MB, aggrb=4188.2MB/s, minb=536081KB/s, 
maxb=585224KB/s, mint=7167msec, maxt=7824msec
7/16384 -> WRITE: io=32768MB, aggrb=4173.8MB/s, minb=534238KB/s, 
maxb=591330KB/s, mint=7093msec, maxt=7851msec
8/16384 -> WRITE: io=32768MB, aggrb=4163.2MB/s, minb=532880KB/s, 
maxb=590165KB/s, mint=7107msec, maxt=7871msec
9/16384 -> WRITE: io=32768MB, aggrb=4166.9MB/s, minb=533355KB/s, 
maxb=608664KB/s, mint=6891msec, maxt=7864msec
10/16384 -> WRITE: io=32768MB, aggrb=4157.9MB/s, minb=532204KB/s, 
maxb=594768KB/s, mint=7052msec, maxt=7881msec

0/32768 -> WRITE: io=32768MB, aggrb=1288.1MB/s, minb=164980KB/s, 
maxb=189026KB/s, mint=22189msec, maxt=25423msec
1/32768 -> WRITE: io=32768MB, aggrb=2443.6MB/s, minb=312774KB/s, 
maxb=348624KB/s, mint=12031msec, maxt=13410msec
2/32768 -> WRITE: io=32768MB, aggrb=3467.1MB/s, minb=443888KB/s, 
maxb=484722KB/s, mint=8653msec, maxt=9449msec
3/32768 -> WRITE: io=32768MB, aggrb=4131.2MB/s, minb=528782KB/s, 
maxb=572444KB/s, mint=7327msec, maxt=7932msec    *
4/32768 -> WRITE: io=32768MB, aggrb=4082.8MB/s, minb=522589KB/s, 
maxb=606990KB/s, mint=6910msec, maxt=8026msec
5/32768 -> WRITE: io=32768MB, aggrb=3985.5MB/s, minb=510131KB/s, 
maxb=578046KB/s, mint=7256msec, maxt=8222msec
6/32768 -> WRITE: io=32768MB, aggrb=3937.2MB/s, minb=504062KB/s, 
maxb=591914KB/s, mint=7086msec, maxt=8321msec
7/32768 -> WRITE: io=32768MB, aggrb=4012.3MB/s, minb=513567KB/s, 
maxb=583028KB/s, mint=7194msec, maxt=8167msec
8/32768 -> WRITE: io=32768MB, aggrb=3944.2MB/s, minb=504851KB/s, 
maxb=567257KB/s, mint=7394msec, maxt=8308msec
9/32768 -> WRITE: io=32768MB, aggrb=3930.1MB/s, minb=503155KB/s, 
maxb=580687KB/s, mint=7223msec, maxt=8336msec
10/32768 -> WRITE: io=32768MB, aggrb=3965.2MB/s, minb=507539KB/s, 
maxb=599443KB/s, mint=6997msec, maxt=8264msec


Analysis:
-----------
- the amount of minimum stripe cache entries doesn't cause much 
variation as expected
- writing threads cause significant performance enhancement
- seen best results with 3 or 4 writing threads which correlates well to 
the # of stripes


Did you provide your fio job(s) for comparision yet?

Regards,
Heinz

P.S.: write performance tested with the following script:

#!/bin/sh

MD=md0

for s in 256 512 1024 2048 4096 8192 16384 32768
do
         echo $s > /sys/block/$MD/md/stripe_cache_size

         for t in {0..10}
         do
                 echo $t > /sys/block/$MD/md/group_thread_cnt
                 echo -n "$t/$s -> "
                 fio  --section=w-md0 fio_md0.job 2>&1|grep "aggrb="|sed 
's/^ *//'
         done
done



On 01/17/2017 04:28 PM, Jake Yao wrote:
> Thanks for the response.
>
> I am using fio for performance measurement.
>
> The chunk size of raid5 array is 32K, and the block size in fio is set
> to 96K(3x chunk size) which is also the optimal_io_size, ioengine is
> set to libaio with direct IO.
>
> Increasing stripe_cache_size does not help much, and it looks like the
> write is limited by the single kernel thread as mentioned earlier.
>
>
> On Tue, Jan 17, 2017 at 12:10 AM, Roman Mamedov <rm@romanrm.net> wrote:
>> On Mon, 16 Jan 2017 21:35:21 -0500
>> Jake Yao <jgyao1@gmail.com> wrote:
>>
>>> I have a raid5 array on 4 NVMe drives, and the performance on the
>>> array is only marginally better than a single drive. Unlike a similar
>>> raid5 array on 4 SAS SSD or HDD,  the performance on array is 3x
>>> better than a single drive, which is expected.
>>>
>>> It looks like when the single kernel thread associated with the raid
>>> device running at 100%, the array performance hit its peak. This can
>>> happen easily for fast devices like NVMe.
>>>
>>> This can reproduced by creating a raid5 with 4 ramdisks as well, and
>>> comparing performance on the array and one ramdisk. Sometimes the
>>> performance on the array is worse than a single ramdisk.
>>>
>>> The kernel version is 4.9.0-rc3 and mdadm is release 3.4, no write
>>> journal is configured.
>>>
>>> Is this a known issue?
>> How do you measure the performance?
>>
>> Sure it may be CPU-bound in the end, but also why not try the usual
>> optimization tricks, such as:
>>
>>    * increase your stripe_cache_size, it's not uncommon that this can speed up
>>      linear writes by as much as several times;
>>
>>    * if you meant reads, you could look into read-ahead settings for the array;
>>
>>    * and in both cases, try experimenting with different stripe sizes (if you
>>      were using 512K, try with 64K stripes).
>>
>> --
>> With respect,
>> Roman
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* [patch] block: add blktrace C events for bio-based drivers
From: Jeff Moyer @ 2017-01-17 21:57 UTC (permalink / raw)
  To: axboe, linux-block
  Cc: agk, snitzer, dm-devel, shli, linux-kernel, linux-raid, hch

Only a few bio-based drivers actually generate blktrace completion
(C) events.  Instead of changing all bio-based drivers to call
trace_block_bio_complete, move the tracing to bio_complete, and remove
the explicit tracing from the few drivers that actually do it.  After
this patch, there is exactly one caller of trace_block_bio_complete
and one caller of trace_block_rq_complete.  More importantly, all
bio-based drivers now generate C events, which is useful for
performance analysis.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
---

Testing: I made sure that request-based drivers don't see duplicate
completions, and that bio-based drivers show both Q and C events.  I
haven't tested all affected drivers or combinations, though.

diff --git a/block/bio.c b/block/bio.c
index 2b37502..ba5daad 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1785,16 +1785,7 @@ static inline bool bio_remaining_done(struct bio *bio)
 	return false;
 }
 
-/**
- * bio_endio - end I/O on a bio
- * @bio:	bio
- *
- * Description:
- *   bio_endio() will end I/O on the whole bio. bio_endio() is the preferred
- *   way to end I/O on a bio. No one should call bi_end_io() directly on a
- *   bio unless they own it and thus know that it has an end_io function.
- **/
-void bio_endio(struct bio *bio)
+void __bio_endio(struct bio *bio)
 {
 again:
 	if (!bio_remaining_done(bio))
@@ -1816,6 +1807,22 @@ void bio_endio(struct bio *bio)
 	if (bio->bi_end_io)
 		bio->bi_end_io(bio);
 }
+
+/**
+ * bio_endio - end I/O on a bio
+ * @bio:	bio
+ *
+ * Description:
+ *   bio_endio() will end I/O on the whole bio. bio_endio() is the preferred
+ *   way to end I/O on a bio. No one should call bi_end_io() directly on a
+ *   bio unless they own it and thus know that it has an end_io function.
+ **/
+void bio_endio(struct bio *bio)
+{
+	trace_block_bio_complete(bdev_get_queue(bio->bi_bdev),
+				 bio, bio->bi_error);
+	__bio_endio(bio);
+}
 EXPORT_SYMBOL(bio_endio);
 
 /**
diff --git a/block/blk-core.c b/block/blk-core.c
index 61ba08c..f77f2d9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -153,7 +153,7 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
 
 	/* don't actually finish bio if it's part of flush sequence */
 	if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ))
-		bio_endio(bio);
+		__bio_endio(bio);
 }
 
 void blk_dump_rq_flags(struct request *rq, char *msg)
@@ -1947,7 +1947,7 @@ generic_make_request_checks(struct bio *bio)
 	err = -EOPNOTSUPP;
 end_io:
 	bio->bi_error = err;
-	bio_endio(bio);
+	__bio_endio(bio);
 	return false;
 }
 
diff --git a/block/blk.h b/block/blk.h
index 041185e..1c9b50a 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -57,6 +57,7 @@ int blk_init_rl(struct request_list *rl, struct request_queue *q,
 		gfp_t gfp_mask);
 void blk_exit_rl(struct request_list *rl);
 void init_request_from_bio(struct request *req, struct bio *bio);
+void __bio_endio(struct bio *bio);
 void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
 			struct bio *bio);
 void blk_queue_bypass_start(struct request_queue *q);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3086da5..e151aef 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -807,7 +807,6 @@ static void dec_pending(struct dm_io *io, int error)
 			queue_io(md, bio);
 		} else {
 			/* done with normal IO or empty flush */
-			trace_block_bio_complete(md->queue, bio, io_error);
 			bio->bi_error = io_error;
 			bio_endio(bio);
 		}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 36c13e4..17b4e06 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -159,8 +159,6 @@ static void return_io(struct bio_list *return_bi)
 	struct bio *bi;
 	while ((bi = bio_list_pop(return_bi)) != NULL) {
 		bi->bi_iter.bi_size = 0;
-		trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
-					 bi, 0);
 		bio_endio(bi);
 	}
 }
@@ -4902,8 +4900,6 @@ static void raid5_align_endio(struct bio *bi)
 	rdev_dec_pending(rdev, conf->mddev);
 
 	if (!error) {
-		trace_block_bio_complete(bdev_get_queue(raid_bi->bi_bdev),
-					 raid_bi, 0);
 		bio_endio(raid_bi);
 		if (atomic_dec_and_test(&conf->active_aligned_reads))
 			wake_up(&conf->wait_for_quiescent);
@@ -5470,8 +5466,6 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 		if ( rw == WRITE )
 			md_write_end(mddev);
 
-		trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
-					 bi, 0);
 		bio_endio(bi);
 	}
 }
@@ -5878,11 +5872,9 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 		handled++;
 	}
 	remaining = raid5_dec_bi_active_stripes(raid_bio);
-	if (remaining == 0) {
-		trace_block_bio_complete(bdev_get_queue(raid_bio->bi_bdev),
-					 raid_bio, 0);
+	if (remaining == 0)
 		bio_endio(raid_bio);
-	}
+
 	if (atomic_dec_and_test(&conf->active_aligned_reads))
 		wake_up(&conf->wait_for_quiescent);
 	return handled;

^ permalink raw reply related

* Re: [patch] block: add blktrace C events for bio-based drivers
From: Jens Axboe @ 2017-01-17 22:07 UTC (permalink / raw)
  To: Jeff Moyer, linux-block
  Cc: agk, snitzer, dm-devel, shli, linux-kernel, linux-raid, hch
In-Reply-To: <x49ziips61y.fsf@segfault.boston.devel.redhat.com>

On 01/17/2017 01:57 PM, Jeff Moyer wrote:
> Only a few bio-based drivers actually generate blktrace completion
> (C) events.  Instead of changing all bio-based drivers to call
> trace_block_bio_complete, move the tracing to bio_complete, and remove
> the explicit tracing from the few drivers that actually do it.  After
> this patch, there is exactly one caller of trace_block_bio_complete
> and one caller of trace_block_rq_complete.  More importantly, all
> bio-based drivers now generate C events, which is useful for
> performance analysis.

I like the change, hate the naming. I'd prefer one of two things:

- Add bio_endio_complete() instead. That name sucks too, the
  important part is flipping the __name() to have a trace
  version instead.

- Mark the bio as trace completed, and keep the naming. Since
  it's only off the completion path, that can be just marking
  the bi_flags non-atomically.

I probably prefer the latter.

-- 
Jens Axboe


^ permalink raw reply

* Re: [patch] block: add blktrace C events for bio-based drivers
From: Jeff Moyer @ 2017-01-17 22:39 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, agk, snitzer, dm-devel, shli, linux-kernel,
	linux-raid, hch
In-Reply-To: <f3044f42-3134-c577-068f-bd2750528b32@kernel.dk>

Jens Axboe <axboe@kernel.dk> writes:

> On 01/17/2017 01:57 PM, Jeff Moyer wrote:
>> Only a few bio-based drivers actually generate blktrace completion
>> (C) events.  Instead of changing all bio-based drivers to call
>> trace_block_bio_complete, move the tracing to bio_complete, and remove
>> the explicit tracing from the few drivers that actually do it.  After
>> this patch, there is exactly one caller of trace_block_bio_complete
>> and one caller of trace_block_rq_complete.  More importantly, all
>> bio-based drivers now generate C events, which is useful for
>> performance analysis.
>
> I like the change, hate the naming. I'd prefer one of two things:
>
> - Add bio_endio_complete() instead. That name sucks too, the
>   important part is flipping the __name() to have a trace
>   version instead.

I had also considered bio_endio_notrace().

> - Mark the bio as trace completed, and keep the naming. Since
>   it's only off the completion path, that can be just marking
>   the bi_flags non-atomically.
>
> I probably prefer the latter.

Hmm, okay.  I'll take a crack at that.

Thanks!
Jeff

^ permalink raw reply

* [RFC PATCH v2] DM: dm-inplace-compress: inplace compressed DM target
From: Ram Pai @ 2017-01-17 23:59 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-raid; +Cc: agk, snitzer, corbet, shli, hbabu

This  patch provides a generic device-mapper inplace compression
device.  Originally written by Shaohua Li.
https://www.redhat.com/archives/dm-devel/2013-December/msg00143.html

I have optimized and hardened the code.

Testing:
-------
This compression block device  is  tested in the following scenarios
a) backing a ext4 filesystem 
b) backing swap
Its tested on a PPC64 system only.
More    testing  is  needed on different architectures.

TODO:
-----
For testing, the code was modified to use GFP_ATOMIC allocations when   device
is   used as swap. A more dynamic mechanism is needed to switch allocation
strategy based on usage. Probably a sysfs interface?

Version v1:
	Comments from Alasdair have been incorporated.
	https://www.redhat.com/archives/dm-devel/2013-December/msg00144.html

Version v2:
	All patches are merged into a single patch.
	Major code re-arrangement.
	Data and metablocks allocated based on the length of the device
	map rather than the size of the backing device.
        Size   of   each entry   in  the bitmap array is explicitly set
	 to 32bits.
	Attempt  to  reuse  the  provided  bio  buffer  space   instead  of
	 allocating a new one.

Your comments to improve the code is very much appreciated.

Ram Pai (1):
  From: Shaohua Li <shli@kernel.org>

 .../device-mapper/dm-inplace-compress.txt          |  139 ++
 drivers/md/Kconfig                                 |    6 +
 drivers/md/Makefile                                |    2 +
 drivers/md/dm-inplace-compress.c                   | 2104 ++++++++++++++++++++
 drivers/md/dm-inplace-compress.h                   |  185 ++
 5 files changed, 2436 insertions(+)
 create mode 100644 Documentation/device-mapper/dm-inplace-compress.txt
 create mode 100644 drivers/md/dm-inplace-compress.c
 create mode 100644 drivers/md/dm-inplace-compress.h

-- 
1.8.3.1

^ permalink raw reply

* [RFC PATCH v2 1/1] DM: inplace compressed DM target
From: Ram Pai @ 2017-01-17 23:59 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-raid; +Cc: agk, snitzer, corbet, shli, hbabu
In-Reply-To: <1484697546-20917-1-git-send-email-linuxram@us.ibm.com>

This is a simple DM target supporting inplace compression. Its best
suited for SSD. The underlying disk must support 512B sector size.
The target only supports 4k sector size.

Disk layout:
|super|...meta...|..data...|

Store unit is 4k (a block). Super is 1 block, which stores meta  and
data size and compression algorithm. Meta is a bitmap. For each data
 block, there are 5 bits meta.

Data:

Data of   a block is compressed. Compressed  data  is round up to 512B,
which is the payload. In disk, payload is  stored at  the beginning of
logical sector  of the block. Let's look  at an example.  Say we store
data to block A, which  is in sector  B(A*8), its orginal  size is 4k,
compressed size is  1500.    Compressed     data (CD)  will  use three
sectors (512B). The three  sectors  are the  payload. Payload  will be
stored at sector B.

---------------------------------------------------
... | CD1 | CD2 | CD3 |   |   |   |   |    | ...
---------------------------------------------------
    ^B    ^B+1  ^B+2                  ^B+7 ^B+8

For this block, we will not use sector B+3 to B+7 (a hole). We use four
meta  bits  to  present payload  size. The compressed size (1500) isn't
stored in meta directly. Instead, we  store  it  at  the last 32bits of
payload. In this  example, we store it at the  end  of  sector  B+2. If
compressed size + sizeof(32bits)  crosses a   sector, payload size will
increase one sector.  If payload  uses 8 sectors, we store uncompressed
data directly.

If IO size is bigger than one block, we can store the data as an extent.
Data of the  whole extent will compressed and stored in the similar way
like above.  The first  block of the extent is the head, all others are
the tail.  If extent is 1 block,  the  block  is head. We have 1 bit of
meta to present if a  block  is  head  or  tail. If 4 meta bits of head
block can't  store  extent payload size, we will borrow tail block meta
bits to  store  payload  size.   Max  allowd extent size is 128k, so we
don't compress/decompress too big size data.

Meta:
Modifying   data   will modify meta too. Meta will be written(flush) to
disk   depending   on   meta   write   policy. We support writeback and
writethrough mode.  In  writeback mode, meta will be written to disk in
an interval or a  FLUSH  request.  In  writethrough mode, data and meta
data will be written to disk together.

Advantages:

1. Simple. Since  we  store  compressed  data  in-place,  we don't need
   complicated disk data management.
2. Efficient. For  each  4k, we only need 5 bits meta. 1T data will use
less than 200M meta, so we  can  load  all meta into memory. And actual
compression size is in payload. So   if  IO doesn't need RMW and we use
writeback meta flush, we don't  need  extra IO for meta.

Disadvantages:

1. hole. Since we   store  compressed data in-place, there are a lot of
   holes (in above  example,  B+3 - B+7) Hole can impact IO, because we
   can't do IO merge.

2. 1:1 size. Compression  doesn't  change disk  size. If disk is 1T, we
   can only store 1T data even we do compression.

But this is for SSD only. Generally SSD firmware has a FTL layer to map
disk  sectors  to flash nand. High end SSD firmware has filesystem-like
FTL.

1. hole. Disk has a lot of holes, but SSD FTL   can   still  store data
   contiguous in nand. Even if we can't do IO   merge in  OS layer, SSD
   firmware can do it.

2. 1:1 size. On one side, we write compressed data to SSD, which means
   less  data is  written to SSD. This will be very helpful to improve
   SSD garbage collection, and  so write speed and life cycle. So even
   this is a problem, the target  is still helpful. On the other side,
   advanced SSD FTL can easily do thin provision. For example, if nand
   is   1T   and   we   let   SSD   report   it   as   2T,   and   use
   the  SSD  as  compressed target. In such SSD, we don't have the 1:1
   size issue.

So even if   SSD   FTL   cannot   map   non-contiguous disk sectors to
contiguous nand, the compression target can still function well.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Ram Pai <ram.n.pai@gmail.com>
---
 .../device-mapper/dm-inplace-compress.txt          |  139 ++
 drivers/md/Kconfig                                 |    6 +
 drivers/md/Makefile                                |    2 +
 drivers/md/dm-inplace-compress.c                   | 2104 ++++++++++++++++++++
 drivers/md/dm-inplace-compress.h                   |  185 ++
 5 files changed, 2436 insertions(+)
 create mode 100644 Documentation/device-mapper/dm-inplace-compress.txt
 create mode 100644 drivers/md/dm-inplace-compress.c
 create mode 100644 drivers/md/dm-inplace-compress.h

diff --git a/Documentation/device-mapper/dm-inplace-compress.txt b/Documentation/device-mapper/dm-inplace-compress.txt
new file mode 100644
index 0000000..c2eefb9
--- /dev/null
+++ b/Documentation/device-mapper/dm-inplace-compress.txt
@@ -0,0 +1,139 @@
+Device-Mapper's "inplace-compress" target provides inplace compression of block
+devices using the kernel compression API.
+
+Parameters: <device path> \
+	[ <#opt_params writethough> ]
+	[ <#opt_params <writeback> <meta_commit_delay> ]
+	[ <#opt_params compressor> <type> ]
+
+
+<writethrough>
+    Write data and metadata together.
+
+<writeback> <meta_commit_delay>
+    Write metadata every 'meta_commit_delay' interval.
+
+<device path>
+    This is the device that is going to be used as backend and contains the
+    compressed data.  You can specify it as a path like /dev/xxx or a device
+    number <major>:<minor>.
+
+<compressor> <type>
+    Choose the compressor algorithm. 'lzo' and '842'
+    compressors are supported.
+
+Example scripts
+===============
+
+create a inplace-compress block device using lzo compression. Write metadata
+and data together.
+[[
+#!/bin/sh
+# Create a inplace-compress device using dmsetup
+device=$1  #your backing storage eg: /dev/sdc1
+size=80000 #size of your new compressed block device
+dmsetup create comp1 --table "0 $size inplacecompress $device
+		writethrough compressor lzo"
+]]
+
+
+create a inplace-compress block device using nx-842 hardware compression. Write
+metadata periodially every 5sec.
+
+[[
+#!/bin/sh
+# Create a inplace-compress device using dmsetup
+device=$1  #your backing storage eg: /dev/sdc1
+size=80000 #size of your new compressed block device
+dmsetup create comp1 --table "0 $size inplacecompress $device
+		writeback 5 compressor 842"
+]]
+
+Description
+===========
+    This is a simple DM target supporting inplace compression. Its best suited for
+    SSD. The underlying disk must support 512B sector size, the target only
+    supports 4k sector size.
+
+    Disk layout:
+    |super|...meta...|..data...|
+
+    Store unit is 4k (a block). Super is 1 block, which stores meta and data
+    size and compression algorithm. Meta is a bitmap. For each data block,
+    there are 5 bits meta.
+
+    Data:
+
+    Data of a block is compressed. Compressed data is round up to 512B, which
+    is the payload. In disk, payload is stored at the beginning of logical
+    sector of the block. Let's look at an example. Say we store data to block
+    A, which is in sector B(A*8), its orginal size is 4k, compressed size is
+    1500. Compressed data (CD) will use 3 sectors (512B). The 3 sectors are the
+    payload. Payload will be stored at sector B.
+
+    ---------------------------------------------------
+    ... | CD1 | CD2 | CD3 |   |   |   |   |    | ...
+    ---------------------------------------------------
+        ^B    ^B+1  ^B+2                  ^B+7 ^B+8
+
+    For this block, we will not use sector B+3 to B+7 (a hole). We use 4 meta
+    bits to present payload size. The compressed size (1500) isn't stored in
+    meta directly. Instead, we store it at the last 32bits of payload. In this
+    example, we store it at the end of sector B+2. If compressed size +
+    sizeof(32bits) crosses a sector, payload size will increase one sector. If
+    payload uses 8 sectors, we store uncompressed data directly.
+
+    If IO size is bigger than one block, we can store the data as an extent.
+    Data of the whole extent will compressed and stored in the similar way like
+    above.  The first block of the extent is the head, all others are the tail.
+    If extent is 1 block, the block is head. We have 1 bit of meta to present
+    if a block is head or tail. If 4 meta bits of head block can't store extent
+    payload size, we will borrow tail block meta bits to store payload size.
+    Max allowd extent size is 128k, so we don't compress/decompress too big
+    size data.
+
+    Meta:
+    Modifying data will modify meta too. Meta will be written(flush) to disk
+    depending on meta write policy. We support writeback and writethrough mode.
+    In writeback mode, meta will be written to disk in an interval or a FLUSH
+    request.  In writethrough mode, data and meta data will be written to disk
+    together.
+
+    Advantages:
+
+    1. Simple. Since we store compressed data in-place, we don't need complicated
+    disk data management.
+    2. Efficient. For each 4k, we only need 5 bits meta. 1T data will use less than
+    200M meta, so we can load all meta into memory. And actual compression size is
+    in payload. So if IO doesn't need RMW and we use writeback meta flush, we don't
+    need extra IO for meta.
+
+    Disadvantages:
+
+    1. hole. Since we store compressed data in-place, there are a lot of holes
+    (in above example, B+3 - B+7) Hole can impact IO, because we can't do IO
+    merge.
+
+    2. 1:1 size. Compression doesn't change disk size. If disk is 1T, we can
+    only store 1T data even we do compression.
+
+    But this is for SSD only. Generally SSD firmware has a FTL layer to map
+    disk sectors to flash nand. High end SSD firmware has filesystem-like FTL.
+
+    1. hole. Disk has a lot of holes, but SSD FTL can still store data continuous
+    in nand. Even if we can't do IO merge in OS layer, SSD firmware can do it.
+
+    2. 1:1 size. On one side, we write compressed data to SSD, which means less
+    data is written to SSD. This will be very helpful to improve SSD garbage
+    collection, and so write speed and life cycle. So even this is a problem, the
+    target is still helpful. On the other side, advanced SSD FTL can easily do thin
+    provision. For example, if nand is 1T and we let SSD report it as 2T, and use
+    the SSD as compressed target. In such SSD, we don't have the 1:1 size issue.
+
+    So even if SSD FTL cannot map non-continuous disk sectors to continuous nand,
+    the compression target can still function well.
+
+
+Author:
+	Shaohua Li <shli@fusionio.com>
+	Ram Pai <ram.n.pai@gmail.com>
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index b7767da..2eece2a 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -508,4 +508,10 @@ config DM_LOG_WRITES
 
 	  If unsure, say N.
 
+config DM_INPLACE_COMPRESS
+	tristate "Inplace Compression target"
+	depends on BLK_DEV_DM
+	---help---
+	  Allow volume managers to compress data for SSD.
+
 endif # MD
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 3cbda1a..4525482 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -59,6 +59,8 @@ obj-$(CONFIG_DM_CACHE_SMQ)	+= dm-cache-smq.o
 obj-$(CONFIG_DM_CACHE_CLEANER)	+= dm-cache-cleaner.o
 obj-$(CONFIG_DM_ERA)		+= dm-era.o
 obj-$(CONFIG_DM_LOG_WRITES)	+= dm-log-writes.o
+obj-$(CONFIG_DM_LOG_WRITES)	+= dm-log-writes.o
+obj-$(CONFIG_DM_INPLACE_COMPRESS)	+= dm-inplace-compress.o
 
 ifeq ($(CONFIG_DM_UEVENT),y)
 dm-mod-objs			+= dm-uevent.o
diff --git a/drivers/md/dm-inplace-compress.c b/drivers/md/dm-inplace-compress.c
new file mode 100644
index 0000000..75ff28d
--- /dev/null
+++ b/drivers/md/dm-inplace-compress.c
@@ -0,0 +1,2104 @@
+/*
+ *  device mapper compression block device.
+ *
+ *  Released under GPL v2.
+ *
+ */
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <linux/slab.h>
+#include <linux/device-mapper.h>
+#include <linux/dm-io.h>
+#include <linux/crypto.h>
+#include <linux/lzo.h>
+#include <linux/kthread.h>
+#include <linux/page-flags.h>
+#include <linux/completion.h>
+#include <linux/vmalloc.h>
+#include "dm-inplace-compress.h"
+
+#define DM_MSG_PREFIX "dm-inplace-compress"
+
+static struct dm_icomp_compressor_data compressors[] = {
+	[DMCP_COMP_ALG_LZO] = {
+		.name = "lzo",
+		.comp_len = lzo_comp_len,
+		.max_comp_len = lzo_max_comp_len,
+	},
+	[DMCP_COMP_ALG_842] = {
+		.name = "842",
+		.comp_len = nx842_comp_len,
+		.max_comp_len = nx842_max_comp_len,
+	},
+};
+
+static int default_compressor = -1;
+#define DMCP_ALGO_LENGTH 9
+static char dm_icomp_algorithm[DMCP_ALGO_LENGTH] = "lzo";
+static struct kparam_string dm_icomp_compressor_kparam = {
+	.string =	dm_icomp_algorithm,
+	.maxlen =	sizeof(dm_icomp_algorithm),
+};
+static int dm_icomp_compressor_param_set(const char *,
+		const struct kernel_param *);
+static struct kernel_param_ops dm_icomp_compressor_param_ops = {
+	.set =	dm_icomp_compressor_param_set,
+	.get =	param_get_string,
+};
+module_param_cb(compress_algorithm, &dm_icomp_compressor_param_ops,
+		&dm_icomp_compressor_kparam, 0644);
+
+#define SET_REQ_STAGE(req, value) (req->stage = value)
+#define GET_REQ_STAGE(req) req->stage
+
+static int dm_icomp_get_compressor(const char *s)
+{
+	int r, val_len;
+
+	if (crypto_has_comp(s, 0, 0)) {
+		for (r = 0; r < ARRAY_SIZE(compressors); r++) {
+			val_len = strlen(compressors[r].name);
+			if (strncmp(s, compressors[r].name, val_len) == 0)
+				return r;
+		}
+	}
+	return -1;
+}
+
+static int dm_icomp_compressor_param_set(const char *val,
+		const struct kernel_param *kp)
+{
+	int ret;
+	char str[kp->str->maxlen], *s;
+	int val_len = strlen(val)+1;
+
+	strlcpy(str, val, val_len);
+	s = strim(str);
+	ret = dm_icomp_get_compressor(s);
+	if (ret < 0) {
+		DMWARN("Compressor %s not supported", s);
+		return -1;
+	}
+	DMINFO("compressor  is %s", s);
+	default_compressor = ret;
+	strlcpy(dm_icomp_algorithm, compressors[ret].name,
+		sizeof(dm_icomp_algorithm));
+	return 0;
+}
+
+static const struct kernel_param_ops dm_icomp_alloc_param_ops = {
+	.set    = param_set_ulong,
+	.get    = param_get_ulong,
+};
+
+static atomic64_t dm_icomp_total_alloc_size;
+#define DMCP_ALLOC(s) {atomic64_add(s, &dm_icomp_total_alloc_size); }
+#define DMCP_FREE_ALLOC(s) {atomic64_sub(s, &dm_icomp_total_alloc_size); }
+module_param_cb(dm_icomp_total_alloc_size, &dm_icomp_alloc_param_ops,
+				&dm_icomp_total_alloc_size, 0644);
+
+static atomic64_t dm_icomp_total_bio_save;
+#define DMCP_ALLOC_SAVE(s) {atomic64_add(s, &dm_icomp_total_bio_save); }
+module_param_cb(dm_icomp_total_bio_save, &dm_icomp_alloc_param_ops,
+				&dm_icomp_total_bio_save, 0644);
+
+
+static struct kmem_cache *dm_icomp_req_cachep;
+static struct kmem_cache *dm_icomp_io_range_cachep;
+static struct kmem_cache *dm_icomp_meta_io_cachep;
+
+static struct dm_icomp_io_worker dm_icomp_io_workers[NR_CPUS];
+static struct workqueue_struct *dm_icomp_wq;
+
+/*
+ * return the meta data bits corresponding to a block
+ * @block_index : the index of the block
+ */
+static u8 dm_icomp_get_meta(struct dm_icomp_info *info, u64 block_index)
+{
+	u64 first_bit = block_index * DMCP_META_BITS;
+	int bits, offset;
+	u32 data;
+	u8  ret = 0;
+
+	offset = first_bit & (DMCP_BITS_PER_ENTRY-1);
+	bits = min_t(u32, DMCP_META_BITS, DMCP_BITS_PER_ENTRY - offset);
+
+	data = (u32)info->meta_bitmap[first_bit >> DMCP_META_BITS];
+	ret = (data >> offset) & ((1 << bits) - 1);
+
+	if (bits < DMCP_META_BITS) {
+		data = info->meta_bitmap[(first_bit >> DMCP_META_BITS) + 1];
+		bits = DMCP_META_BITS - bits;
+		ret |= (data & ((1 << bits) - 1)) << (DMCP_META_BITS - bits);
+	}
+	return ret;
+}
+
+
+static void dm_icomp_mark_page(struct dm_icomp_info *info, u32 *addr,
+				bool dirty_meta)
+{
+	struct page *page;
+
+	page = vmalloc_to_page(addr);
+	if (!page)
+		return;
+	if (dirty_meta)
+		SetPageDirty(page);
+	else
+		ClearPageDirty(page);
+}
+
+/*
+ * set the meta data bits corresponding to a block
+ * @block_index : the index of the block
+ * @meta        : the meta data bits.
+ */
+static void dm_icomp_set_meta(struct dm_icomp_info *info, u64 block_index,
+		u8 meta, bool dirty_meta)
+{
+	u64 first_bit = block_index * DMCP_META_BITS;
+	int bits, offset;
+	u32 data;
+
+	offset = first_bit & (DMCP_BITS_PER_ENTRY-1);
+	bits = min_t(u32, DMCP_META_BITS, DMCP_BITS_PER_ENTRY - offset);
+
+
+	data = (u32)info->meta_bitmap[first_bit >> DMCP_META_BITS];
+	data &= ~(((1 << bits) - 1) << offset);
+	data |= (meta & ((1 << bits) - 1)) << offset;
+	info->meta_bitmap[first_bit >> DMCP_META_BITS] = (u32)data;
+
+	if (info->write_mode == DMCP_WRITE_BACK)
+		dm_icomp_mark_page(info,
+			&info->meta_bitmap[first_bit >> DMCP_META_BITS],
+			dirty_meta);
+
+	if (bits < DMCP_META_BITS) {
+		meta >>= bits;
+		data = (u32)
+			info->meta_bitmap[(first_bit >> DMCP_META_BITS) + 1];
+		bits = DMCP_META_BITS - bits;
+		data = (data >> bits) << bits;
+		data |= meta & ((1 << bits) - 1);
+		info->meta_bitmap[(first_bit >> DMCP_META_BITS) + 1] =
+				(u32)data;
+
+		if (info->write_mode == DMCP_WRITE_BACK)
+			dm_icomp_mark_page(info,
+			&info->meta_bitmap[(first_bit >> DMCP_META_BITS) + 1],
+			dirty_meta);
+	}
+}
+
+
+/*
+ * set the meta data bits corresponding to an extent
+ * @block : the index of the block
+ * @logical_blocks: the number of blocks in the extent
+ * @sectors: the number of sectors holding the compressed
+ *		data
+ */
+static void dm_icomp_set_extent(struct dm_icomp_req *req, u64 block,
+	u16 logical_blocks, sector_t data_sectors)
+{
+	int i;
+	u8 data;
+
+	for (i = 0; i < logical_blocks; i++) {
+		data = min_t(sector_t, data_sectors, 8);
+		data_sectors -= data;
+		if (i != 0)
+			data |= DMCP_TAIL_MASK;
+		/* For FUA, we write out meta data directly */
+		dm_icomp_set_meta(req->info, block + i, data,
+					!(req->bio->bi_opf & REQ_FUA));
+	}
+}
+
+/*
+ * get the meta data bits corresponding to an extent
+ * @block_index : the index of the block
+ * @logical_blocks: return the number of blocks in the extent
+ * @sectors: return the number of sectors holding the compressed
+ *		data
+ */
+static void dm_icomp_get_extent(struct dm_icomp_info *info, u64 block_index,
+	u64 *first_block_index, u16 *logical_sectors, u16 *data_sectors)
+{
+	u8 data;
+
+	data = dm_icomp_get_meta(info, block_index);
+	while (data & DMCP_TAIL_MASK) {
+		block_index--;
+		data = dm_icomp_get_meta(info, block_index);
+	}
+	*first_block_index = block_index;
+	*logical_sectors = DMCP_BYTES_TO_SECTOR(DMCP_BLOCK_SIZE);
+	*data_sectors = data & DMCP_LENGTH_MASK;
+	block_index++;
+	while (block_index < info->data_blocks) {
+		data = dm_icomp_get_meta(info, block_index);
+		if (!(data & DMCP_TAIL_MASK))
+			break;
+		*logical_sectors += DMCP_BYTES_TO_SECTOR(DMCP_BLOCK_SIZE);
+		*data_sectors += data & DMCP_LENGTH_MASK;
+		block_index++;
+	}
+}
+
+/*
+ * return the super block
+ */
+static int dm_icomp_access_super(struct dm_icomp_info *info, void *addr,
+		int op, int flag)
+{
+	struct dm_io_region region;
+	struct dm_io_request req;
+	unsigned long io_error = 0;
+	int ret;
+
+	region.bdev = info->dev->bdev;
+	region.sector = 0;
+	region.count = DMCP_BYTES_TO_SECTOR(DMCP_BLOCK_SIZE);
+
+	req.bi_op = op;
+	req.bi_op_flags = flag;
+	req.mem.type = DM_IO_KMEM;
+	req.mem.offset = 0;
+	req.mem.ptr.addr = addr;
+	req.notify.fn = NULL;
+	req.client = info->io_client;
+
+	ret = dm_io(&req, 1, &region, &io_error);
+	if (ret || io_error)
+		return -EIO;
+	return 0;
+}
+
+static void dm_icomp_meta_io_done(unsigned long error, void *context)
+{
+	struct dm_icomp_meta_io *meta_io = context;
+
+	meta_io->fn(meta_io->data, error);
+	kmem_cache_free(dm_icomp_meta_io_cachep, meta_io);
+}
+
+/*
+ * write meta data to the meta blocks in the backing store.
+ */
+static int dm_icomp_write_meta(struct dm_icomp_info *info, u64 start_page,
+	u64 end_page, void *data,
+	void (*fn)(void *data, unsigned long error), int rw, int flags)
+{
+	struct dm_icomp_meta_io *meta_io;
+	sector_t sector, last_sector, last_meta_sector = info->data_start-1;
+
+	WARN_ON(end_page > info->meta_bitmap_pages);
+
+	sector = DMCP_META_START_SECTOR + (start_page << (PAGE_SHIFT - 9));
+	WARN_ON(sector > last_meta_sector);
+	if (sector > last_meta_sector) {
+		fn(data, -EINVAL);
+		return -EINVAL;
+	}
+	last_sector = sector + ((end_page - start_page) << (PAGE_SHIFT - 9));
+	if (last_sector > last_meta_sector)
+		last_sector = last_meta_sector;
+
+
+	meta_io = kmem_cache_alloc(dm_icomp_meta_io_cachep, GFP_NOIO);
+	if (!meta_io) {
+		fn(data, -ENOMEM);
+		return -ENOMEM;
+	}
+	meta_io->data = data;
+	meta_io->fn = fn;
+
+	meta_io->io_region.bdev = info->dev->bdev;
+
+
+	meta_io->io_region.sector = sector;
+	meta_io->io_region.count = last_sector - sector + 1;
+	atomic64_add(DMCP_SECTOR_TO_BYTES(meta_io->io_region.count),
+				&info->meta_write_size);
+
+	meta_io->io_req.bi_op = rw;
+	meta_io->io_req.bi_op_flags = flags;
+	meta_io->io_req.mem.type = DM_IO_VMA;
+	meta_io->io_req.mem.offset = 0;
+	meta_io->io_req.mem.ptr.addr = info->meta_bitmap +
+						(start_page << PAGE_SHIFT);
+	meta_io->io_req.notify.fn = dm_icomp_meta_io_done;
+	meta_io->io_req.notify.context = meta_io;
+	meta_io->io_req.client = info->io_client;
+
+	dm_io(&meta_io->io_req, 1, &meta_io->io_region, NULL);
+	return 0;
+}
+
+struct writeback_flush_data {
+	struct completion complete;
+	atomic_t cnt;
+};
+
+static void writeback_flush_io_done(void *data, unsigned long error)
+{
+	struct writeback_flush_data *wb = data;
+
+	if (atomic_dec_return(&wb->cnt))
+		return;
+	complete(&wb->complete);
+}
+
+static void dm_icomp_flush_dirty_meta(struct dm_icomp_info *info,
+			struct writeback_flush_data *data)
+{
+	struct page *page;
+	u64 start = 0, index;
+	u32 pending = 0, cnt = 0;
+	bool dirty;
+	struct blk_plug plug;
+
+	blk_start_plug(&plug);
+	for (index = 0; index < info->meta_bitmap_pages; index++, cnt++) {
+		if (cnt == 256) {
+			cnt = 0;
+			cond_resched();
+		}
+
+		page = vmalloc_to_page((char *)(info->meta_bitmap) +
+					(index << PAGE_SHIFT));
+		if (!page)
+			DMWARN("Uable to find page for block=%llu", index);
+		dirty = TestClearPageDirty(page);
+
+		if (pending == 0 && dirty) {
+			start = index;
+			pending++;
+			continue;
+		} else if (pending == 0)
+			continue;
+		else if (pending > 0 && dirty) {
+			pending++;
+			continue;
+		}
+
+		/* pending > 0 && !dirty */
+		atomic_inc(&data->cnt);
+		dm_icomp_write_meta(info, start, start + pending, data,
+			writeback_flush_io_done, REQ_OP_WRITE, WRITE);
+		pending = 0;
+	}
+
+	if (pending > 0) {
+		atomic_inc(&data->cnt);
+		dm_icomp_write_meta(info, start, start + pending, data,
+			writeback_flush_io_done, REQ_OP_WRITE, WRITE);
+	}
+	blkdev_issue_flush(info->dev->bdev, GFP_NOIO, NULL);
+	blk_finish_plug(&plug);
+}
+
+static int dm_icomp_meta_writeback_thread(void *data)
+{
+	struct dm_icomp_info *info = data;
+	struct writeback_flush_data wb;
+
+	atomic_set(&wb.cnt, 1);
+	init_completion(&wb.complete);
+
+	while (!kthread_should_stop()) {
+		schedule_timeout_interruptible(
+			msecs_to_jiffies(info->writeback_delay * 1000));
+		dm_icomp_flush_dirty_meta(info, &wb);
+	}
+
+	dm_icomp_flush_dirty_meta(info, &wb);
+
+	writeback_flush_io_done(&wb, 0);
+	wait_for_completion(&wb.complete);
+	return 0;
+}
+
+static int dm_icomp_init_meta(struct dm_icomp_info *info, bool new)
+{
+	struct dm_io_region region;
+	struct dm_io_request req;
+	unsigned long io_error = 0;
+	struct blk_plug plug;
+	int ret;
+	ssize_t len = DIV_ROUND_UP_ULL(info->meta_bitmap_bits,
+			DMCP_BITS_PER_ENTRY);
+
+	len *= (DMCP_BITS_PER_ENTRY >> 3);
+
+	region.bdev = info->dev->bdev;
+	region.sector = DMCP_META_START_SECTOR;
+	region.count = DMCP_BYTES_TO_SECTOR(round_up(len,
+				DMCP_SECTOR_SIZE));
+
+	req.mem.type = DM_IO_VMA;
+	req.mem.offset = 0;
+	req.mem.ptr.addr = info->meta_bitmap;
+	req.notify.fn = NULL;
+	req.client = info->io_client;
+
+	blk_start_plug(&plug);
+	if (new) {
+		memset(info->meta_bitmap, 0, len);
+		req.bi_op = REQ_OP_WRITE;
+		req.bi_op_flags = REQ_FUA;
+		ret = dm_io(&req, 1, &region, &io_error);
+	} else {
+		req.bi_op = REQ_OP_READ;
+		req.bi_op_flags = READ;
+		ret = dm_io(&req, 1, &region, &io_error);
+	}
+	blk_finish_plug(&plug);
+
+	if (ret || io_error) {
+		info->ti->error = "Access metadata error";
+		return -EIO;
+	}
+
+	if (info->write_mode == DMCP_WRITE_BACK) {
+		info->writeback_tsk = kthread_run(
+			dm_icomp_meta_writeback_thread,
+			info, "dm_icomp_writeback");
+		if (!info->writeback_tsk) {
+			info->ti->error = "Create writeback thread error";
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int dm_icomp_alloc_compressor(struct dm_icomp_info *info)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		info->tfm[i] = crypto_alloc_comp(
+			compressors[info->comp_alg].name, 0, 0);
+		if (IS_ERR(info->tfm[i])) {
+			info->tfm[i] = NULL;
+			goto err;
+		}
+	}
+	return 0;
+err:
+	for_each_possible_cpu(i) {
+		if (info->tfm[i]) {
+			crypto_free_comp(info->tfm[i]);
+			info->tfm[i] = NULL;
+		}
+	}
+	return -ENOMEM;
+}
+
+static void dm_icomp_free_compressor(struct dm_icomp_info *info)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		if (info->tfm[i]) {
+			crypto_free_comp(info->tfm[i]);
+			info->tfm[i] = NULL;
+		}
+	}
+}
+
+/*
+ * create a new super block and initialize its contents.
+ */
+static int dm_icomp_read_or_create_super(struct dm_icomp_info *info)
+{
+	void *addr;
+	struct dm_icomp_super_block *super;
+	u64 total_blocks, data_blocks, meta_blocks;
+	bool new_super = false;
+	int ret;
+	ssize_t len;
+
+	info->total_sector = DMCP_BYTES_TO_SECTOR(
+			i_size_read(info->dev->bdev->bd_inode));
+	total_blocks = DMCP_SECTOR_TO_BLOCK(info->total_sector) - 1;
+
+	data_blocks =  DMCP_SECTOR_TO_BLOCK(info->ti->len);
+	meta_blocks =  ((data_blocks * DMCP_META_BITS) +
+			((DMCP_BLOCK_SIZE * 8) - 1)) / (DMCP_BLOCK_SIZE * 8);
+
+	info->data_blocks = data_blocks;
+	info->data_start = DMCP_BLOCK_TO_SECTOR(1 + meta_blocks);
+
+	DMINFO(" info->data_start=%u info->data_blocks=%llu",
+		(unsigned int)info->data_start, info->data_blocks);
+	DMINFO(" metablocks=%llu total_blocks=%llu",
+		meta_blocks, total_blocks);
+
+	if (DMCP_BLOCK_TO_SECTOR(data_blocks + meta_blocks + 1)
+			>= info->total_sector) {
+		info->ti->error =
+			"Insufficient sectors to satisfy requested size";
+		return -ENOMEM;
+	}
+
+	addr = kzalloc(DMCP_BLOCK_SIZE, GFP_KERNEL);
+	if (!addr) {
+		info->ti->error = "Cannot allocate super";
+		return -ENOMEM;
+	}
+
+	super = addr;
+	ret = dm_icomp_access_super(info, addr, REQ_OP_READ, 0);
+	if (ret)
+		goto out;
+
+	if (le64_to_cpu(super->magic) == DMCP_SUPER_MAGIC) {
+		if (le64_to_cpu(super->meta_blocks) != meta_blocks ||
+		    le64_to_cpu(super->data_blocks) != data_blocks) {
+			info->ti->error = "Super is invalid";
+			ret = -EINVAL;
+			goto out;
+		}
+		if (!crypto_has_comp(compressors[info->comp_alg].name,
+					0, 0)) {
+			info->ti->error =
+				"Compressor algorithm doesn't support";
+			ret = -EINVAL;
+			goto out;
+		}
+	} else {
+		super->magic = cpu_to_le64(DMCP_SUPER_MAGIC);
+		super->meta_blocks = cpu_to_le64(meta_blocks);
+		super->data_blocks = cpu_to_le64(data_blocks);
+		super->comp_alg = default_compressor;
+		ret = dm_icomp_access_super(info, addr, REQ_OP_WRITE,
+				REQ_FUA);
+		if (ret) {
+			info->ti->error = "Access super fails";
+			goto out;
+		}
+		new_super = true;
+	}
+
+	if (dm_icomp_alloc_compressor(info)) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	info->meta_bitmap_bits = data_blocks * DMCP_META_BITS;
+	len = DIV_ROUND_UP_ULL(info->meta_bitmap_bits, DMCP_BITS_PER_ENTRY);
+	len *= (DMCP_BITS_PER_ENTRY >> 3);
+	info->meta_bitmap_pages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	info->meta_bitmap = vzalloc(info->meta_bitmap_pages * PAGE_SIZE);
+	if (!info->meta_bitmap) {
+		ret = -ENOMEM;
+		goto bitmap_err;
+	}
+
+	ret = dm_icomp_init_meta(info, new_super);
+	if (ret)
+		goto meta_err;
+
+	return 0;
+meta_err:
+	vfree(info->meta_bitmap);
+bitmap_err:
+	dm_icomp_free_compressor(info);
+out:
+	kfree(addr);
+	return ret;
+}
+
+/*
+ * <dev> [ <writethough>/<writeback> <meta_commit_delay> ]
+ *	 [ <compressor> <type> ]
+ */
+static int dm_icomp_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+	struct dm_icomp_info *info;
+	char mode[15];
+	int par = 0;
+	int ret, i;
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info) {
+		ti->error = "dm-inplace-compress: Cannot allocate context";
+		return -ENOMEM;
+	}
+	info->ti = ti;
+	info->comp_alg = default_compressor;
+	while (++par < argc) {
+		if (sscanf(argv[par], "%s", mode) != 1) {
+			ti->error = "Invalid argument";
+			ret = -EINVAL;
+			goto err_para;
+		}
+
+		if (strcmp(mode, "writeback") == 0) {
+			info->write_mode = DMCP_WRITE_BACK;
+			if (sscanf(argv[++par], "%u",
+				 &info->writeback_delay) != 1) {
+				ti->error = "Invalid argument";
+				ret = -EINVAL;
+				goto err_para;
+			}
+		} else if (strcmp(mode, "writethrough") == 0) {
+			info->write_mode = DMCP_WRITE_THROUGH;
+		} else if (strcmp(mode, "compressor") == 0) {
+			if (sscanf(argv[++par], "%s", mode) != 1) {
+				ti->error = "Invalid argument";
+				ret = -EINVAL;
+				goto err_para;
+			}
+			ret = dm_icomp_get_compressor(mode);
+			if (ret >= 0) {
+				DMINFO("compressor  is %s", mode);
+				info->comp_alg = ret;
+			} else {
+				ti->error = "Unsupported compressor";
+				ret = -EINVAL;
+				goto err_para;
+			}
+		}
+	}
+
+	if (dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
+							&info->dev)) {
+		ti->error = "Can't get device";
+		ret = -EINVAL;
+		goto err_para;
+	}
+
+	info->io_client = dm_io_client_create();
+	if (!info->io_client) {
+		ti->error = "Can't create io client";
+		ret = -EINVAL;
+		goto err_ioclient;
+	}
+
+	if (bdev_logical_block_size(info->dev->bdev) != 512) {
+		ti->error = "Can't logical block size too big";
+		ret = -EINVAL;
+		goto err_blocksize;
+	}
+
+	ret = dm_icomp_read_or_create_super(info);
+	if (ret)
+		goto err_blocksize;
+
+	for (i = 0; i < BITMAP_HASH_LEN; i++) {
+		info->bitmap_locks[i].io_running = 0;
+		spin_lock_init(&info->bitmap_locks[i].wait_lock);
+		INIT_LIST_HEAD(&info->bitmap_locks[i].wait_list);
+	}
+
+	atomic64_set(&info->compressed_write_size, 0);
+	atomic64_set(&info->uncompressed_write_size, 0);
+	atomic64_set(&info->meta_write_size, 0);
+	atomic64_set(&dm_icomp_total_alloc_size, 0);
+	atomic64_set(&dm_icomp_total_bio_save, 0);
+
+	ti->num_flush_bios = 1;
+	/* ti->num_discard_bios = 1; */
+	ti->private = info;
+	return 0;
+err_blocksize:
+	dm_io_client_destroy(info->io_client);
+err_ioclient:
+	dm_put_device(ti, info->dev);
+err_para:
+	kfree(info);
+	return ret;
+}
+
+static void dm_icomp_dtr(struct dm_target *ti)
+{
+	struct dm_icomp_info *info = ti->private;
+
+	if (info->write_mode == DMCP_WRITE_BACK)
+		kthread_stop(info->writeback_tsk);
+	dm_icomp_free_compressor(info);
+	vfree(info->meta_bitmap);
+	dm_io_client_destroy(info->io_client);
+	dm_put_device(ti, info->dev);
+	kfree(info);
+}
+
+/*
+ * return the range lock to this block.
+ */
+static struct dm_icomp_hash_lock *dm_icomp_block_hash_lock(
+		struct dm_icomp_info *info, u64 block_index)
+{
+	return &info->bitmap_locks[(block_index >> BITMAP_HASH_SHIFT) &
+			BITMAP_HASH_MASK];
+}
+
+/*
+ * unlock the io range correspondingg to this block.
+ */
+static struct dm_icomp_hash_lock *dm_icomp_trylock_block(
+		struct dm_icomp_info *info,
+		struct dm_icomp_req *req, u64 block_index)
+{
+	struct dm_icomp_hash_lock *hash_lock;
+
+	hash_lock = dm_icomp_block_hash_lock(req->info, block_index);
+
+	spin_lock_irq(&hash_lock->wait_lock);
+	if (!hash_lock->io_running) {
+		hash_lock->io_running = 1;
+		spin_unlock_irq(&hash_lock->wait_lock);
+		return hash_lock;
+	}
+	list_add_tail(&req->sibling, &hash_lock->wait_list);
+	spin_unlock_irq(&hash_lock->wait_lock);
+	return NULL;
+}
+
+static void dm_icomp_queue_req_list(struct dm_icomp_info *info,
+	 struct list_head *list);
+
+static void dm_icomp_unlock_block(struct dm_icomp_info *info,
+	struct dm_icomp_req *req, struct dm_icomp_hash_lock *hash_lock)
+{
+	LIST_HEAD(pending_list);
+	unsigned long flags;
+
+	spin_lock_irqsave(&hash_lock->wait_lock, flags);
+	/* wakeup all pending reqs to avoid live lock */
+	list_splice_init(&hash_lock->wait_list, &pending_list);
+	hash_lock->io_running = 0;
+	spin_unlock_irqrestore(&hash_lock->wait_lock, flags);
+
+	dm_icomp_queue_req_list(info, &pending_list);
+}
+
+/*
+ * lock all the range locks corresponding to this io request.
+ */
+static int dm_icomp_lock_req_range(struct dm_icomp_req *req)
+{
+	u64 block_index, first_block_index;
+	u64 first_lock_block, second_lock_block;
+	u16 logical_sectors, data_sectors;
+
+	block_index = DMCP_SECTOR_TO_BLOCK(req->bio->bi_iter.bi_sector);
+	req->locks[0] = dm_icomp_trylock_block(req->info, req, block_index);
+	if (!req->locks[0])
+		return 0;
+	dm_icomp_get_extent(req->info, block_index, &first_block_index,
+				&logical_sectors, &data_sectors);
+	if (dm_icomp_block_hash_lock(req->info, first_block_index) !=
+						req->locks[0]) {
+		dm_icomp_unlock_block(req->info, req, req->locks[0]);
+		first_lock_block = first_block_index;
+		second_lock_block = block_index;
+		goto two_locks;
+	}
+
+	block_index = DMCP_SECTOR_TO_BLOCK(bio_end_sector(req->bio) - 1);
+	dm_icomp_get_extent(req->info, block_index, &first_block_index,
+				&logical_sectors, &data_sectors);
+	first_block_index += DMCP_SECTOR_TO_BLOCK(logical_sectors);
+	if (dm_icomp_block_hash_lock(req->info, first_block_index) !=
+						req->locks[0]) {
+		second_lock_block = first_block_index;
+		goto second_lock;
+	}
+	req->locked_locks = 1;
+	return 1;
+
+two_locks:
+	req->locks[0] = dm_icomp_trylock_block(req->info, req,
+		first_lock_block);
+	if (!req->locks[0])
+		return 0;
+second_lock:
+	req->locks[1] = dm_icomp_trylock_block(req->info, req,
+				second_lock_block);
+	if (!req->locks[1]) {
+		dm_icomp_unlock_block(req->info, req, req->locks[0]);
+		return 0;
+	}
+	/* Don't need check if meta is changed */
+	req->locked_locks = 2;
+	return 1;
+}
+
+
+
+/*
+ * unlock all the range locks corresponding to this io request.
+ */
+static void dm_icomp_unlock_req_range(struct dm_icomp_req *req)
+{
+	int i;
+
+	for (i = req->locked_locks - 1; i >= 0; i--)
+		dm_icomp_unlock_block(req->info, req, req->locks[i]);
+}
+
+static void dm_icomp_queue_req(struct dm_icomp_info *info,
+		struct dm_icomp_req *req)
+{
+	unsigned long flags;
+	struct dm_icomp_io_worker *worker = &dm_icomp_io_workers[req->cpu];
+
+	spin_lock_irqsave(&worker->lock, flags);
+	list_add_tail(&req->sibling, &worker->pending);
+	spin_unlock_irqrestore(&worker->lock, flags);
+
+	queue_work_on(req->cpu, dm_icomp_wq, &worker->work);
+}
+
+static void dm_icomp_queue_req_list(struct dm_icomp_info *info,
+		struct list_head *list)
+{
+	struct dm_icomp_req *req;
+
+	while (!list_empty(list)) {
+		req = list_first_entry(list, struct dm_icomp_req, sibling);
+		list_del_init(&req->sibling);
+		dm_icomp_queue_req(info, req);
+	}
+}
+
+static void dm_icomp_get_req(struct dm_icomp_req *req)
+{
+	atomic_inc(&req->io_pending);
+}
+
+/*
+ * Use GFP_ATOMIC allocations if the device
+ * is used as a swap device.
+ *
+ * TODO: need a better solution.
+ * maybe a sysfs interface to change allocation types?
+ */
+#define ALLOC_FLAG GFP_NOIO
+
+static void *dm_icomp_kmalloc(size_t size)
+{
+	void *addr = kmalloc(size, ALLOC_FLAG);
+
+	if (!addr)
+		return NULL;
+	DMCP_ALLOC(size);
+	return addr;
+}
+
+static void *dm_icomp_krealloc(void *ptr, size_t size,
+		size_t origsize)
+{
+	void *addr = krealloc(ptr, size, ALLOC_FLAG);
+
+	if (!addr)
+		return NULL;
+	DMCP_FREE_ALLOC(origsize);
+	DMCP_ALLOC(size);
+	return addr;
+}
+
+static int dm_icomp_alloc_compbuffer(struct dm_icomp_io_range *io, int size)
+{
+	void *addr = dm_icomp_kmalloc(size+DMCP_SECTOR_SIZE);
+
+	if (!addr)
+		return 1;
+
+	io->comp_real_data = addr;
+	io->comp_kmap	= false;
+	io->comp_data   = io->io_req.mem.ptr.addr = PTR_ALIGN(addr,
+				DMCP_SECTOR_SIZE);
+	io->comp_len	= size+DMCP_SECTOR_SIZE;
+	return 0;
+}
+
+static int dm_icomp_realloc_compbuffer(struct dm_icomp_io_range *io, int size)
+{
+	void *addr = dm_icomp_krealloc(io->comp_real_data,
+			size+DMCP_SECTOR_SIZE, io->comp_len);
+	if (!addr)
+		return 1;
+
+	DMCP_FREE_ALLOC(io->comp_len);
+	DMCP_ALLOC(size+DMCP_SECTOR_SIZE);
+	io->comp_real_data = addr;
+	io->comp_kmap	   = false;
+	io->comp_data      = io->io_req.mem.ptr.addr = PTR_ALIGN(addr,
+				DMCP_SECTOR_SIZE);
+	io->comp_len	   = size+DMCP_SECTOR_SIZE;
+	return 0;
+}
+
+static void dm_icomp_kfree(void *addr, unsigned int size)
+{
+	kfree(addr);
+	DMCP_FREE_ALLOC(size);
+}
+
+static void dm_icomp_release_decomp_buffer(struct dm_icomp_io_range *io)
+{
+	if (!io->decomp_data)
+		return;
+
+	if (io->decomp_kmap)
+		kunmap(io->decomp_real_data);
+	else
+		dm_icomp_kfree(io->decomp_real_data, io->decomp_len);
+
+	io->decomp_data = io->decomp_real_data = NULL;
+	io->decomp_len  = 0;
+	io->decomp_kmap = false;
+}
+
+static void dm_icomp_release_comp_buffer(struct dm_icomp_io_range *io)
+{
+	if (!io->comp_data)
+		return;
+
+	if (io->comp_kmap)
+		kunmap(io->comp_real_data);
+	else
+		dm_icomp_kfree(io->comp_real_data, io->comp_len);
+
+	io->comp_real_data = io->comp_data = NULL;
+	io->comp_len = 0;
+	io->comp_kmap = false;
+}
+
+static void dm_icomp_free_io_range(struct dm_icomp_io_range *io)
+{
+	dm_icomp_release_decomp_buffer(io);
+	dm_icomp_release_comp_buffer(io);
+	kmem_cache_free(dm_icomp_io_range_cachep, io);
+}
+
+static void dm_icomp_put_req(struct dm_icomp_req *req)
+{
+	struct dm_icomp_io_range *io;
+
+	if (atomic_dec_return(&req->io_pending))
+		return;
+
+	if (GET_REQ_STAGE(req) == STAGE_INIT) /* waiting for locking */
+		return;
+
+	if (GET_REQ_STAGE(req) == STAGE_READ_DECOMP ||
+	    GET_REQ_STAGE(req) == STAGE_WRITE_COMP)
+		SET_REQ_STAGE(req, STAGE_DONE);
+
+	if (!!!req->result && GET_REQ_STAGE(req) != STAGE_DONE) {
+		dm_icomp_queue_req(req->info, req);
+		return;
+	}
+
+	while (!list_empty(&req->all_io)) {
+		io = list_entry(req->all_io.next,
+			struct dm_icomp_io_range, next);
+		list_del(&io->next);
+		dm_icomp_free_io_range(io);
+	}
+
+	dm_icomp_unlock_req_range(req);
+
+	req->bio->bi_error = req->result;
+
+	bio_endio(req->bio);
+	kmem_cache_free(dm_icomp_req_cachep, req);
+}
+
+static void dm_icomp_bio_copy(struct bio *bio, off_t bio_off, void *buf,
+		ssize_t len, bool to_buf)
+{
+	struct bio_vec bv;
+	struct bvec_iter iter;
+	off_t buf_off = 0;
+	ssize_t size;
+	void *addr;
+
+	WARN_ON(bio_off + len > DMCP_SECTOR_TO_BYTES(bio_sectors(bio)));
+
+	bio_for_each_segment(bv, bio, iter) {
+		int length = bv.bv_len;
+
+		if (bio_off > length) {
+			bio_off -= length;
+			continue;
+		}
+		addr = kmap_atomic(bv.bv_page);
+		size = min_t(ssize_t, len, length - bio_off);
+		if (to_buf)
+			memcpy(buf + buf_off, addr + bio_off + bv.bv_offset,
+			size);
+		else
+			memcpy(addr + bio_off + bv.bv_offset, buf + buf_off,
+			size);
+		kunmap_atomic(addr);
+		bio_off = 0;
+		buf_off += size;
+
+		if (len <= size)
+			break;
+
+		len -= size;
+	}
+}
+
+static void dm_icomp_io_range_done(unsigned long error, void *context)
+{
+	struct dm_icomp_io_range *io = context;
+
+	if (error)
+		io->req->result = error;
+
+	dm_icomp_put_req(io->req);
+}
+
+static inline int dm_icomp_compressor_len(struct dm_icomp_info *info, int len)
+{
+	if (compressors[info->comp_alg].comp_len)
+		return compressors[info->comp_alg].comp_len(len);
+	return len;
+}
+
+static inline int dm_icomp_compressor_maxlen(struct dm_icomp_info *info,
+		int len)
+{
+	if (compressors[info->comp_alg].max_comp_len)
+		return compressors[info->comp_alg].max_comp_len(len);
+	return len;
+}
+
+/*
+ * caller should set region.sector, region.count. bi_rw. IO always to/from
+ * comp_data
+ */
+static struct dm_icomp_io_range *dm_icomp_create_io_range(
+		struct dm_icomp_req *req)
+{
+	struct dm_icomp_io_range *io;
+
+	io = kmem_cache_alloc(dm_icomp_io_range_cachep, GFP_NOIO);
+	if (!io)
+		return NULL;
+
+	io->io_req.notify.fn = dm_icomp_io_range_done;
+	io->io_req.notify.context = io;
+	io->io_req.client = req->info->io_client;
+	io->io_req.mem.type = DM_IO_KMEM;
+	io->io_req.mem.offset = 0;
+
+	io->io_region.bdev = req->info->dev->bdev;
+	io->req = req;
+
+	io->comp_data = io->comp_real_data =
+			io->decomp_data = io->decomp_real_data = NULL;
+
+	io->data_bytes = io->comp_len =
+			io->decomp_len = io->logical_bytes = 0;
+
+	io->comp_kmap = io->decomp_kmap = false;
+	return io;
+}
+
+
+/*
+ * return an address, within the bio. The address corresponds to
+ * the requested offset 'bio_off' and is contiguous of size 'len'
+ */
+static void *get_addr(struct bio *bio,  int len, u64 bio_off, u64 *offset)
+{
+	struct bio_vec bv;
+	struct bvec_iter iter;
+	void *addr;
+
+	bio_for_each_segment(bv, bio, iter) {
+		int length = bv.bv_len;
+
+		if (bio_off > length) {
+			bio_off -= length;
+			continue;
+		}
+		addr = bv.bv_page;
+		if (bv.bv_offset + bio_off + len >= length) {
+			*offset = bv.bv_offset + bio_off;
+			return kmap(addr);
+		}
+		break;
+	}
+	return NULL;
+}
+
+
+/*
+ * create a io range for tracking  predominantly a read request.
+ * @req		: the read request
+ * @comp_len	: allocation size of the compress buffer
+ * @decomp_len	: allocation size of the decompress buffer
+ * @actual_comp_len : real size of the compress data
+ * @bio_off	: offset within the bio read buffer this request corresponds to.
+ *		try to reuse and read into the bio buffer. -1 means don't reuse.
+ */
+static struct dm_icomp_io_range *dm_icomp_create_io_read_range(
+		struct dm_icomp_req *req, int comp_len, int decomp_len,
+		long bio_off, int actual_comp_len)
+{
+	struct bio *bio = req->bio;
+	void *addr = NULL;
+	struct dm_icomp_io_range *io = dm_icomp_create_io_range(req);
+	u64 offset;
+
+	if (!io)
+		return NULL;
+
+	WARN_ON(comp_len % DMCP_SECTOR_SIZE);
+
+	/* try reusing the bio if possible */
+	if (bio_off >= 0) {
+		addr = get_addr(bio, comp_len, (u64)bio_off, &offset);
+		if (addr) {
+			io->comp_real_data =  addr;
+			io->comp_data = io->io_req.mem.ptr.addr = addr + offset;
+			io->comp_kmap = true;
+			io->comp_len  = comp_len;
+		}
+	}
+
+	if (!addr && dm_icomp_alloc_compbuffer(io, comp_len)) {
+		kmem_cache_free(dm_icomp_io_range_cachep, io);
+		return NULL;
+	}
+
+	io->data_bytes	= actual_comp_len;  /* NOTE, this value can change */
+
+	/*
+	 * note requested length for decompress buffer. Do not allocate it yet.
+	 * Value once set is final.
+	 */
+	io->logical_bytes = decomp_len;
+
+	return io;
+}
+
+/*
+ *  ensure that the io range has all its buffers; of the correct size,
+ *  allocated.
+ */
+static int dm_icomp_update_io_read_range(struct dm_icomp_io_range *io)
+{
+	WARN_ON(!io->comp_data);
+	WARN_ON(io->decomp_data || io->decomp_len);
+	io->decomp_data = dm_icomp_kmalloc(io->logical_bytes);
+	if (!io->decomp_data)
+		return 1;
+	io->decomp_real_data = io->decomp_data;
+	io->decomp_len = io->logical_bytes;
+	io->decomp_kmap = false;
+	return 0;
+}
+
+/*
+ *  resize the comp buffer to its largest possible size.
+ */
+static int dm_icomp_mod_to_max_io_range(struct dm_icomp_info *info,
+			 struct dm_icomp_io_range *io)
+{
+	unsigned int maxlen = dm_icomp_compressor_maxlen(info, io->decomp_len);
+
+	WARN_ON(maxlen > io->logical_bytes);
+
+	if (io->comp_kmap) {
+		WARN_ON(io->comp_kmap);
+		kunmap(io->comp_real_data);
+		io->comp_kmap = false;
+		io->comp_real_data = io->comp_data = NULL;
+	}
+
+	if (dm_icomp_realloc_compbuffer(io, maxlen)) {
+		DMWARN("allocation failure ");
+		io->comp_len = 0;
+		return -ENOSPC;
+	}
+	io->comp_len = maxlen;
+	return 0;
+}
+
+/*
+ * create a io range for tracking a write request.
+ * @req		: the write request
+ * @count	: size of the write in sectors.
+ * @offset	: offset within the bio read buffer this request correspond to.
+ */
+static struct dm_icomp_io_range *dm_icomp_create_io_write_range(
+	struct dm_icomp_req *req, sector_t offset, sector_t count)
+{
+	struct bio *bio = req->bio;
+	int size  = DMCP_SECTOR_TO_BYTES(count);
+	u64 of;
+	int comp_len = dm_icomp_compressor_len(req->info, size);
+	void *addr;
+	struct dm_icomp_io_range *io = dm_icomp_create_io_range(req);
+
+	if (!io)
+		return NULL;
+
+	WARN_ON(io->comp_data);
+
+	if (dm_icomp_alloc_compbuffer(io, comp_len)) {
+		kmem_cache_free(dm_icomp_io_range_cachep, io);
+		return NULL;
+	}
+
+	/* we donot know the size of the compress segment yet. */
+	io->data_bytes = 0;
+
+
+	WARN_ON(io->decomp_data);
+
+	io->decomp_kmap = false;
+
+	/* try reusing the bio buffer for decomp data. */
+	addr = get_addr(bio, size, DMCP_SECTOR_TO_BYTES(offset), &of);
+	if (addr)
+		io->decomp_kmap = true;
+	else
+		addr  = dm_icomp_kmalloc(size);
+
+	if (!addr) {
+		dm_icomp_kfree(io->comp_data, comp_len);
+		kmem_cache_free(dm_icomp_io_range_cachep, io);
+		return NULL;
+	}
+
+	io->logical_bytes = io->decomp_len = size;
+
+	if (io->decomp_kmap) {
+		io->decomp_real_data = addr;
+		io->decomp_data = addr + of;
+		DMCP_ALLOC_SAVE(size);
+	} else {
+		io->decomp_data = io->decomp_real_data = addr;
+		dm_icomp_bio_copy(req->bio, DMCP_SECTOR_TO_BYTES(offset),
+			io->decomp_data, size, true);
+	}
+
+	return io;
+}
+
+static unsigned int round_to_next_sector(unsigned int val)
+{
+	unsigned int c = round_up(val, DMCP_SECTOR_SIZE);
+
+	if ((c - val) < 2*sizeof(u32))
+		c += DMCP_SECTOR_SIZE;
+	return c;
+}
+
+/*
+ * compress and store the data in compress buffer.
+ * return value:
+ * < 0 : error
+ * == 0 : ok
+ * == 1 : ok, but comp/decomp is skipped
+ * Compressed data size is roundup of 512, which makes the payload.
+ * We store the actual compressed len in the last u32 of the payload.
+ * If there is no free space, we add 512 to the payload size.
+ */
+static int dm_icomp_io_range_compress(struct dm_icomp_info *info,
+		struct dm_icomp_io_range *io, unsigned int *comp_len)
+{
+	unsigned int actual_comp_len = io->comp_len;
+	u32 *addr;
+	struct crypto_comp *tfm =  info->tfm[get_cpu()];
+	unsigned int decomp_len = io->logical_bytes;
+	int ret;
+
+	actual_comp_len = io->comp_len;
+	ret = crypto_comp_compress(tfm, io->decomp_data, decomp_len,
+		io->comp_data, &actual_comp_len);
+
+	if (ret || round_to_next_sector(actual_comp_len) > io->comp_len) {
+		ret = dm_icomp_mod_to_max_io_range(info, io);
+		if (!ret) {
+			actual_comp_len = io->comp_len;
+			ret = crypto_comp_compress(tfm, io->decomp_data,
+				decomp_len, io->comp_data,
+				&actual_comp_len);
+		}
+	}
+
+	put_cpu();
+
+	if (ret < 0)
+		DMINFO("CO Error %d ", ret);
+
+	atomic64_add(decomp_len, &info->uncompressed_write_size);
+	io->data_bytes = *comp_len = round_to_next_sector(actual_comp_len);
+	if (ret || decomp_len < *comp_len) {
+		*comp_len = decomp_len;
+		memcpy(io->comp_data, io->decomp_data, *comp_len);
+		atomic64_add(*comp_len, &info->compressed_write_size);
+	} else {
+		atomic64_add(*comp_len, &info->compressed_write_size);
+		addr = (u32 *)((char *)io->comp_data + *comp_len);
+		addr--;
+		*addr = cpu_to_le32(actual_comp_len);
+		addr--;
+		*addr = cpu_to_le32(DMCP_COMPRESS_MAGIC);
+	}
+
+	return 0;
+}
+
+/*
+ * decompress and store the data in decompress buffer.
+ * return value:
+ * < 0 : error
+ * == 0 : ok
+ */
+static int dm_icomp_io_range_decompress(struct dm_icomp_info *info,
+		struct dm_icomp_io_range *io, unsigned int *decomp_len)
+{
+	struct crypto_comp *tfm;
+	u32 *addr;
+	int ret;
+	int comp_len = io->data_bytes;
+
+	WARN_ON(!io->data_bytes);
+
+	if (comp_len == io->logical_bytes) {
+		memcpy(io->decomp_data, io->comp_data, comp_len);
+		*decomp_len = comp_len;
+		return 0;
+	}
+
+	WARN_ON(io->comp_data != io->io_req.mem.ptr.addr);
+
+	addr = (u32 *)((char *)(io->comp_data) + comp_len);
+	addr--;
+	comp_len = le32_to_cpu(*addr);
+	addr--;
+
+	if (le32_to_cpu(*addr) == DMCP_COMPRESS_MAGIC) {
+		tfm = info->tfm[get_cpu()];
+		*decomp_len = io->logical_bytes;
+		ret = crypto_comp_decompress(tfm, io->comp_data, comp_len,
+			io->decomp_data, decomp_len);
+		WARN_ON(*decomp_len != io->decomp_len);
+		put_cpu();
+		if (ret)
+			return -EINVAL;
+		return 0;
+	}
+
+	DMWARN("Decompress Error ");
+	return -1;
+}
+
+/*
+ *  fill the bio with the corresponding decompressed data.
+ */
+static void dm_icomp_handle_read_decomp(struct dm_icomp_req *req)
+{
+	struct dm_icomp_io_range *io;
+	off_t bio_off = 0;
+	int ret;
+	sector_t bio_len  = DMCP_SECTOR_TO_BYTES(bio_sectors(req->bio));
+
+	SET_REQ_STAGE(req, STAGE_READ_DECOMP);
+
+	if (req->result)
+		return;
+
+	list_for_each_entry(io, &req->all_io, next) {
+		ssize_t dst_off = 0, src_off = 0, len;
+		unsigned int decomp_len;
+
+		io->io_region.sector -= req->info->data_start;
+
+		if (io->io_region.sector >=
+				req->bio->bi_iter.bi_sector)
+			dst_off = DMCP_SECTOR_TO_BYTES(
+				io->io_region.sector -
+				req->bio->bi_iter.bi_sector);
+		else
+			src_off = DMCP_SECTOR_TO_BYTES(
+				req->bio->bi_iter.bi_sector -
+				io->io_region.sector);
+
+		if (dm_icomp_update_io_read_range(io)) {
+			req->result = -EIO;
+			return;
+		}
+
+		/* Do decomp here */
+		ret = dm_icomp_io_range_decompress(req->info, io, &decomp_len);
+		if (ret < 0) {
+			dm_icomp_release_decomp_buffer(io);
+			dm_icomp_release_comp_buffer(io);
+			req->result = -EIO;
+			return;
+		}
+
+		len = min_t(ssize_t,
+			max_t(ssize_t, decomp_len - src_off, 0),
+			max_t(ssize_t, bio_len - dst_off, 0));
+
+		dm_icomp_bio_copy(req->bio, dst_off,
+			   io->decomp_data + src_off, len, false);
+
+		/* io range in all_io list is ordered for read IO */
+		while (bio_off < dst_off) {
+			ssize_t size = min_t(ssize_t, PAGE_SIZE,
+					dst_off - bio_off);
+			dm_icomp_bio_copy(req->bio, bio_off, empty_zero_page,
+					size, false);
+			bio_off += size;
+		}
+
+		bio_off = dst_off + len;
+		dm_icomp_release_decomp_buffer(io);
+		dm_icomp_release_comp_buffer(io);
+	}
+
+	while (bio_off < bio_len) {
+		ssize_t size = min_t(ssize_t, PAGE_SIZE, (bio_len - bio_off));
+
+		dm_icomp_bio_copy(req->bio, bio_off, empty_zero_page,
+			size, false);
+		bio_off += size;
+	}
+}
+
+
+/*
+ * read an extent
+ * @req        : the read request
+ * @block      : the block to be read
+ * @logical_sectors   : no of sectors occupied by the decompressed data
+ * @data_sectors      : no of sectors occupied by the compressed data
+ * @may_resize : the compress data size may change during its life.
+ */
+static void dm_icomp_read_one_extent(struct dm_icomp_req *req, u64 block,
+	u16 logical_sectors, u16 data_sectors, bool may_resize)
+{
+	struct dm_icomp_io_range *io;
+	long bio_off = 0, comp_len;
+	int actual_comp_len = DMCP_SECTOR_TO_BYTES(data_sectors);
+
+	if (may_resize) {
+		/*
+		 * TODO: replace this with actual_comp_len.
+		 * comp_len = actual_comp_len;
+		 * There was some corruption observed with smaller
+		 * buffers. Hence we request this large size.
+		 */
+		comp_len = DMCP_SECTOR_TO_BYTES(logical_sectors);
+		bio_off =  -1;
+	} else {
+		comp_len = actual_comp_len;
+		bio_off = DMCP_BLOCK_TO_SECTOR(block) -
+				req->bio->bi_iter.bi_sector;
+	}
+
+	io = dm_icomp_create_io_read_range(req, comp_len,
+		DMCP_SECTOR_TO_BYTES(logical_sectors),
+		bio_off,
+		actual_comp_len);
+	if (!io) {
+		req->result = -EIO;
+		return;
+	}
+
+	dm_icomp_get_req(req);
+	list_add_tail(&io->next, &req->all_io);
+
+	io->io_region.sector = DMCP_BLOCK_TO_SECTOR(block) +
+				req->info->data_start;
+	io->io_region.count = data_sectors;
+	io->io_req.mem.ptr.addr = io->comp_data;
+	io->io_req.mem.type = DM_IO_KMEM;
+	io->io_req.mem.offset = 0;
+	io->io_req.bi_op = REQ_OP_READ;
+	io->io_req.bi_op_flags = (req->bio->bi_opf & REQ_FUA);
+
+	WARN_ON((io->io_region.sector + io->io_region.count)
+		>= req->info->total_sector);
+
+	dm_io(&io->io_req, 1, &io->io_region, NULL);
+}
+
+
+/*
+ * read the data corresponding to this request.
+ * @req   : the request.
+ * @reuse : the read data may be modified. So plan accordingly.
+ */
+static void dm_icomp_handle_read_existing(struct dm_icomp_req *req, bool reuse)
+{
+	u64 block_index, first_block_index;
+	u16 logical_sectors, data_sectors;
+
+	SET_REQ_STAGE(req, STAGE_READ_EXISTING);
+
+	block_index = DMCP_SECTOR_TO_BLOCK(req->bio->bi_iter.bi_sector);
+
+	while (!!!req->result &&
+		(block_index <= DMCP_SECTOR_TO_BLOCK(
+				bio_end_sector(req->bio)-1)) &&
+		(block_index < req->info->data_blocks)) {
+
+		dm_icomp_get_extent(req->info, block_index, &first_block_index,
+			&logical_sectors, &data_sectors);
+
+		if (data_sectors)
+			dm_icomp_read_one_extent(req, first_block_index,
+				logical_sectors, data_sectors, reuse);
+
+		block_index = first_block_index +
+				DMCP_SECTOR_TO_BLOCK(logical_sectors);
+	}
+}
+
+/*
+ * read existing data
+ */
+static void dm_icomp_handle_read_read_existing(struct dm_icomp_req *req)
+{
+	dm_icomp_handle_read_existing(req, false);
+
+	if (req->result)
+		return;
+
+	/* A shortcut if all data is in already */
+	if (list_empty(&req->all_io))
+		dm_icomp_handle_read_decomp(req);
+}
+
+static void dm_icomp_handle_read_request(struct dm_icomp_req *req)
+{
+	dm_icomp_get_req(req);
+
+	if (GET_REQ_STAGE(req) == STAGE_INIT) {
+		if (!dm_icomp_lock_req_range(req)) {
+			dm_icomp_put_req(req);
+			return;
+		}
+		dm_icomp_handle_read_read_existing(req);
+	} else if (GET_REQ_STAGE(req) == STAGE_READ_EXISTING) {
+		dm_icomp_handle_read_decomp(req);
+	}
+
+	dm_icomp_put_req(req);
+}
+
+static void dm_icomp_write_meta_done(void *context, unsigned long error)
+{
+	struct dm_icomp_req *req = context;
+
+	dm_icomp_put_req(req);
+}
+
+static u64 dm_icomp_block_meta_page_index(u64 block, bool end)
+{
+	u64 bits = block * DMCP_META_BITS - !!end;
+	/*
+	 * >> 5; 32 bits per entry
+	 * << 2; each entry is 4 bytes
+	 * >> PAGE_SHIFT; PAGE_SHIFT pages
+	 */
+	return bits >> (5 - 2 + PAGE_SHIFT);
+}
+
+
+/*
+ * write compressed data to the backing storage.
+ * @io : io range
+ * @sector_start : the sector on backing storage to which the
+ *	compressed data needs to be written.
+ * @meta_start: the page index of the bits corresponding to
+ * @meta_end  : start and end blocks.
+ */
+static int dm_icomp_compress_write(struct dm_icomp_io_range *io,
+		sector_t sector_start, u64 *meta_start, u64 *meta_end)
+{
+	struct dm_icomp_req *req = io->req;
+	sector_t count = DMCP_BYTES_TO_SECTOR(io->decomp_len);
+	unsigned int comp_len, ret;
+	u64 page_index;
+
+	/* comp_data must be able to accommadate a larger compress buffer */
+	ret = dm_icomp_io_range_compress(req->info, io, &comp_len);
+	if (ret < 0) {
+		req->result = -EIO;
+		return -EIO;
+	}
+	WARN_ON(comp_len > io->comp_len);
+
+	dm_icomp_get_req(req);
+
+	io->io_req.bi_op = REQ_OP_WRITE;
+	io->io_req.bi_op_flags = (req->bio->bi_opf & REQ_FUA);
+	io->io_req.mem.ptr.addr = io->comp_data;
+	io->io_req.mem.type = DM_IO_KMEM;
+	io->io_req.mem.offset = 0;
+	io->io_region.count = DMCP_BYTES_TO_SECTOR(comp_len);
+	io->io_region.sector = sector_start + req->info->data_start;
+
+	dm_icomp_release_decomp_buffer(io);
+
+
+	WARN_ON((io->io_region.sector + io->io_region.count)
+			>= req->info->total_sector);
+	dm_io(&io->io_req, 1, &io->io_region, NULL);
+
+	/* update the meta data bits */
+	dm_icomp_set_extent(req, DMCP_SECTOR_TO_BLOCK(sector_start),
+		DMCP_SECTOR_TO_BLOCK(count), DMCP_BYTES_TO_SECTOR(comp_len));
+
+	page_index = dm_icomp_block_meta_page_index(
+		DMCP_SECTOR_TO_BLOCK(sector_start), false);
+	if (*meta_start > page_index)
+		*meta_start = page_index;
+
+	page_index = dm_icomp_block_meta_page_index(
+			DMCP_SECTOR_TO_BLOCK(sector_start + count), true);
+	if (*meta_end < page_index)
+		*meta_end = page_index;
+	return 0;
+}
+
+/*
+ * modify and write compressed data to the backing storage.
+ * @io : io range
+ * @meta_start: the page index of the bits corresponding to
+ * @meta_end  : start and end blocks.
+ */
+static int dm_icomp_handle_write_modify(struct dm_icomp_io_range *io,
+	u64 *meta_start, u64 *meta_end)
+{
+	struct dm_icomp_req *req = io->req;
+	sector_t bio_start, bio_end, buf_start, buf_end, overlap;
+	off_t bio_off, buf_off;
+	int ret;
+	unsigned int decomp_len;
+
+	io->io_region.sector -= req->info->data_start;
+
+	/* decompress original data */
+	if (dm_icomp_update_io_read_range(io)) {
+		req->result = -EIO;
+		return -EIO;
+	}
+
+	ret = dm_icomp_io_range_decompress(req->info, io, &decomp_len);
+	if (ret < 0) {
+		req->result = -EINVAL;
+		return -EIO;
+	}
+
+	bio_start = req->bio->bi_iter.bi_sector;
+	bio_end = bio_end_sector(req->bio) - 1;
+
+	buf_start = io->io_region.sector;
+	buf_end = buf_start + DMCP_BYTES_TO_SECTOR(decomp_len) - 1;
+
+	/* if no overlap, nothing to do. Just return */
+	if (bio_start >= buf_end || bio_end <= buf_start)
+		return 0;
+
+	bio_off = (buf_start > bio_start) ?  (buf_start - bio_start) : 0;
+	buf_off = (bio_start > buf_start) ?  (bio_start - buf_start) : 0;
+
+	/*
+	 * overlap = sizeof(block1) + sizeof(block2) - sizeof(left_side_shift) -
+	 *		sizeof(right_side_shift)  / 2  +  1
+	 */
+	overlap  =  (((bio_end - bio_start) + (buf_end - buf_start) -
+		abs(buf_end - bio_end) - abs(buf_start - bio_start)) >> 1) + 1;
+
+
+	dm_icomp_bio_copy(req->bio, DMCP_SECTOR_TO_BYTES(bio_off),
+		   io->decomp_data + DMCP_SECTOR_TO_BYTES(buf_off),
+		   DMCP_SECTOR_TO_BYTES(overlap), true);
+
+	return dm_icomp_compress_write(io, io->io_region.sector,
+			meta_start, meta_end);
+}
+
+
+/*
+ * create and write new extents. Each extent is not more than
+ * 256 sectors.
+ * @req : the request
+ * @sec_start: the start sector of the request
+ * @total  : the total sectors
+ * @list  : collect each 256 sector size io request in this list
+ * @meta_start: the page index of the bits corresponding to
+ * @meta_end  : start and end blocks.
+ *
+ */
+static void dm_icomp_handle_write_create(struct dm_icomp_req *req,
+	sector_t sec_start, sector_t total,
+	struct list_head *list, u64 *meta_start, u64 *meta_end)
+{
+	struct dm_icomp_io_range *io;
+	sector_t count, offset = 0;
+	int ret;
+
+	while (total) {
+
+		/* max i/o is 128kbytes i.e 256 sectors */
+		count = min_t(sector_t, total, 256);
+		io = dm_icomp_create_io_write_range(req, offset, count);
+		if (!io) {
+			req->result = -EIO;
+			return;
+		}
+
+		ret = dm_icomp_compress_write(io, sec_start, meta_start,
+			meta_end);
+		if (ret) {
+			dm_icomp_free_io_range(io);
+			return;
+		}
+
+
+		list_add_tail(&io->next, list);
+		total -= count;
+		sec_start += count;
+		offset += count;
+
+	}
+}
+
+/*
+ *  handle the write request.
+ */
+static void dm_icomp_handle_write_comp(struct dm_icomp_req *req)
+{
+	struct dm_icomp_io_range *io;
+	sector_t io_start, req_start, req_end;
+	u64 meta_start = -1L, meta_end = 0;
+	LIST_HEAD(newlist);
+
+	SET_REQ_STAGE(req, STAGE_WRITE_COMP);
+
+	if (req->result)
+		return;
+
+	req_start = req->bio->bi_iter.bi_sector;
+	list_for_each_entry(io, &req->all_io, next) {
+
+		io_start = io->io_region.sector - req->info->data_start;
+
+		if (req_start < io_start) {
+			/* fill the gap */
+			dm_icomp_handle_write_create(req, req_start,
+				(io_start - req_start), &newlist,
+				&meta_start, &meta_end);
+		}
+
+		dm_icomp_handle_write_modify(io, &meta_start, &meta_end);
+
+		req_start = io_start + DMCP_BYTES_TO_SECTOR(io->logical_bytes);
+	}
+
+	req_end =  bio_end_sector(req->bio);
+	if (req_start < req_end) {
+		/* fill the gap */
+		dm_icomp_handle_write_create(req, req_start,
+			 req_end-req_start, &newlist, &meta_start,
+			&meta_end);
+	}
+
+	list_splice_tail(&newlist, &req->all_io);
+
+	if (req->info->write_mode == DMCP_WRITE_THROUGH ||
+				(req->bio->bi_opf & REQ_FUA)) {
+		if (meta_start == -1)
+			return;
+		dm_icomp_get_req(req);
+		dm_icomp_write_meta(req->info, meta_start,
+			meta_end+1, req,
+			dm_icomp_write_meta_done,
+			REQ_OP_WRITE, req->bio->bi_opf);
+	}
+}
+
+/*
+ *  read the data, modify and write it back to the backing store.
+ */
+static void dm_icomp_handle_write_read_existing(struct dm_icomp_req *req)
+{
+	dm_icomp_handle_read_existing(req, true);
+	if (req->result)
+		return;
+
+	if (list_empty(&req->all_io))
+		dm_icomp_handle_write_comp(req);
+}
+
+static void dm_icomp_handle_write_request(struct dm_icomp_req *req)
+{
+	dm_icomp_get_req(req);
+
+	if (GET_REQ_STAGE(req) == STAGE_INIT) {
+		if (!dm_icomp_lock_req_range(req)) {
+			dm_icomp_put_req(req);
+			return;
+		}
+		dm_icomp_handle_write_read_existing(req);
+	} else if (GET_REQ_STAGE(req) == STAGE_READ_EXISTING) {
+		dm_icomp_handle_write_comp(req);
+	}
+
+	dm_icomp_put_req(req);
+}
+
+/* For writeback mode */
+static void dm_icomp_handle_flush_request(struct dm_icomp_req *req)
+{
+	struct writeback_flush_data wb;
+
+	atomic_set(&wb.cnt, 1);
+	init_completion(&wb.complete);
+
+	dm_icomp_flush_dirty_meta(req->info, &wb);
+
+	writeback_flush_io_done(&wb, 0);
+	wait_for_completion(&wb.complete);
+
+	req->bio->bi_error = 0;
+	bio_endio(req->bio);
+	kmem_cache_free(dm_icomp_req_cachep, req);
+}
+
+static void dm_icomp_handle_request(struct dm_icomp_req *req)
+{
+	if (req->bio->bi_opf & REQ_PREFLUSH)
+		dm_icomp_handle_flush_request(req);
+	else if (op_is_write(bio_op(req->bio)))
+		dm_icomp_handle_write_request(req);
+	else
+		dm_icomp_handle_read_request(req);
+}
+
+static void dm_icomp_do_request_work(struct work_struct *work)
+{
+	struct dm_icomp_io_worker *worker = container_of(work,
+				struct dm_icomp_io_worker, work);
+	LIST_HEAD(list);
+	struct dm_icomp_req *req;
+	struct blk_plug plug;
+	bool repeat;
+
+	blk_start_plug(&plug);
+again:
+	spin_lock_irq(&worker->lock);
+	list_splice_init(&worker->pending, &list);
+	spin_unlock_irq(&worker->lock);
+
+	repeat = !list_empty(&list);
+	while (!list_empty(&list)) {
+		req = list_first_entry(&list, struct dm_icomp_req, sibling);
+		list_del(&req->sibling);
+
+		schedule();
+		dm_icomp_handle_request(req);
+	}
+	if (repeat)
+		goto again;
+	blk_finish_plug(&plug);
+}
+
+static bool valid_request(struct bio *bio, struct dm_icomp_info *info)
+{
+	sector_t dev_end	=  info->ti->len;
+	sector_t req_end	=  bio_end_sector(bio) - 1;
+
+	return (req_end <= dev_end);
+}
+
+static int dm_icomp_map(struct dm_target *ti, struct bio *bio)
+{
+	struct dm_icomp_info *info = ti->private;
+	struct dm_icomp_req *req;
+
+	if ((bio->bi_opf & REQ_PREFLUSH) &&
+			info->write_mode == DMCP_WRITE_THROUGH) {
+		bio->bi_bdev = info->dev->bdev;
+		return DM_MAPIO_REMAPPED;
+	}
+
+	if (!(bio->bi_opf & REQ_PREFLUSH) && !valid_request(bio, info)) {
+		req->bio = bio;
+		req->bio->bi_error = -EINVAL;
+		bio_endio(req->bio);
+		return DM_MAPIO_SUBMITTED;
+	}
+
+	req = kmem_cache_alloc(dm_icomp_req_cachep, GFP_NOIO);
+	if (!req)
+		return -EIO;
+
+	req->bio = bio;
+	req->info = info;
+	atomic_set(&req->io_pending, 0);
+	INIT_LIST_HEAD(&req->all_io);
+	req->result = 0;
+	SET_REQ_STAGE(req, STAGE_INIT);
+	req->locked_locks = 0;
+
+	req->cpu = raw_smp_processor_id();
+	dm_icomp_queue_req(info, req);
+
+	return DM_MAPIO_SUBMITTED;
+}
+
+static void dm_icomp_status(struct dm_target *ti, status_type_t type,
+	  unsigned int status_flags, char *result, unsigned int maxlen)
+{
+	struct dm_icomp_info *info = ti->private;
+	unsigned int sz = 0;
+
+	switch (type) {
+	case STATUSTYPE_INFO:
+		DMEMIT("%lu %lu %lu",
+			atomic64_read(&info->uncompressed_write_size),
+			atomic64_read(&info->compressed_write_size),
+			atomic64_read(&info->meta_write_size));
+		break;
+	case STATUSTYPE_TABLE:
+		if (info->write_mode == DMCP_WRITE_BACK)
+			DMEMIT("%s %s %d", info->dev->name, "writeback",
+				info->writeback_delay);
+		else
+			DMEMIT("%s %s", info->dev->name, "writethrough");
+		break;
+	}
+}
+
+static int dm_icomp_iterate_devices(struct dm_target *ti,
+				  iterate_devices_callout_fn fn, void *data)
+{
+	struct dm_icomp_info *info = ti->private;
+
+	return fn(ti, info->dev, info->data_start,
+		DMCP_BLOCK_TO_SECTOR(info->data_blocks), data);
+}
+
+static void dm_icomp_io_hints(struct dm_target *ti,
+			    struct queue_limits *limits)
+{
+	/* No blk_limits_logical_block_size */
+	limits->logical_block_size = limits->physical_block_size =
+	limits->io_min = DMCP_BLOCK_SIZE;
+	limits->max_hw_sectors = DMCP_BYTES_TO_SECTOR(DMCP_MAX_SIZE);
+}
+
+static struct target_type dm_icomp_target = {
+	.name   = "inplacecompress",
+	.version = {1, 0, 0},
+	.module = THIS_MODULE,
+	.ctr    = dm_icomp_ctr,
+	.dtr    = dm_icomp_dtr,
+	.map    = dm_icomp_map,
+	.status = dm_icomp_status,
+	.iterate_devices = dm_icomp_iterate_devices,
+	.io_hints = dm_icomp_io_hints,
+};
+
+static int __init dm_icomp_init(void)
+{
+	int r;
+	int arr_size = ARRAY_SIZE(compressors);
+
+	for (r = 0; r < arr_size; r++)
+		if (crypto_has_comp(compressors[r].name, 0, 0))
+			break;
+	if (r >= arr_size) {
+		DMWARN("No crypto compressors are supported");
+		return -EINVAL;
+	}
+	default_compressor = r;
+	strlcpy(dm_icomp_algorithm, compressors[r].name,
+			sizeof(dm_icomp_algorithm));
+
+	r = -ENOMEM;
+	dm_icomp_req_cachep = kmem_cache_create("dm_icomp_requests",
+		sizeof(struct dm_icomp_req), 0, 0, NULL);
+	if (!dm_icomp_req_cachep) {
+		DMWARN("Can't create request cache");
+		goto err;
+	}
+
+	dm_icomp_io_range_cachep = kmem_cache_create("dm_icomp_io_range",
+		sizeof(struct dm_icomp_io_range), 0, 0, NULL);
+	if (!dm_icomp_io_range_cachep) {
+		DMWARN("Can't create io_range cache");
+		goto err;
+	}
+
+	dm_icomp_meta_io_cachep = kmem_cache_create("dm_icomp_meta_io",
+		sizeof(struct dm_icomp_meta_io), 0, 0, NULL);
+	if (!dm_icomp_meta_io_cachep) {
+		DMWARN("Can't create meta_io cache");
+		goto err;
+	}
+
+	dm_icomp_wq = alloc_workqueue("dm_icomp_io",
+		WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);
+	if (!dm_icomp_wq) {
+		DMWARN("Can't create io workqueue");
+		goto err;
+	}
+
+	r = dm_register_target(&dm_icomp_target);
+	if (r < 0) {
+		DMWARN("target registration failed");
+		goto err;
+	}
+
+	for_each_possible_cpu(r) {
+		INIT_LIST_HEAD(&dm_icomp_io_workers[r].pending);
+		spin_lock_init(&dm_icomp_io_workers[r].lock);
+		INIT_WORK(&dm_icomp_io_workers[r].work,
+				dm_icomp_do_request_work);
+	}
+	return 0;
+err:
+	kmem_cache_destroy(dm_icomp_req_cachep);
+	kmem_cache_destroy(dm_icomp_io_range_cachep);
+	kmem_cache_destroy(dm_icomp_meta_io_cachep);
+	if (dm_icomp_wq)
+		destroy_workqueue(dm_icomp_wq);
+
+	return r;
+}
+
+static void __exit dm_icomp_exit(void)
+{
+	dm_unregister_target(&dm_icomp_target);
+	kmem_cache_destroy(dm_icomp_req_cachep);
+	kmem_cache_destroy(dm_icomp_io_range_cachep);
+	kmem_cache_destroy(dm_icomp_meta_io_cachep);
+	destroy_workqueue(dm_icomp_wq);
+}
+
+module_init(dm_icomp_init);
+module_exit(dm_icomp_exit);
+
+MODULE_AUTHOR("Shaohua Li <shli@kernel.org>");
+MODULE_DESCRIPTION(DM_NAME " target with data inplace-compression");
+MODULE_LICENSE("GPL");
diff --git a/drivers/md/dm-inplace-compress.h b/drivers/md/dm-inplace-compress.h
new file mode 100644
index 0000000..ab28e51
--- /dev/null
+++ b/drivers/md/dm-inplace-compress.h
@@ -0,0 +1,185 @@
+#ifndef __DM_INPLACE_COMPRESS_H__
+#define __DM_INPLACE_COMPRESS_H__
+#include <linux/types.h>
+
+#define DMCP_SUPER_MAGIC 0x106526c206506c09
+#define DMCP_COMPRESS_MAGIC 0xfaceecaf
+struct dm_icomp_super_block {
+	__le64 magic;
+	__le64 meta_blocks;
+	__le64 data_blocks;
+	u8 comp_alg;
+} __packed;
+
+#define DMCP_COMP_ALG_LZO 1
+#define DMCP_COMP_ALG_842 0
+
+#ifdef __KERNEL__
+/*
+ * Minium logical size of this target is 4096 byte, which is a block.
+ * Data of a block is compressed. Compressed data is round up to 512B, which is
+ * the payload. For each block, we have 5 bits meta data. bit 0 - 3 stands
+ * payload length(0 - 8 sectors). If compressed payload length is 8 sectors, we
+ * just store uncompressed data. Actual compressed data length is stored at the
+ * last 32 bits of payload if data is compressed. In disk, payload is stored at
+ * the beginning of logical sector of the block. If IO size is bigger than one
+ * block, we store the whole data as an extent. Bit 4 stands tail for an
+ * extent. Max allowed extent size is 128k.
+ */
+#define DMCP_BLOCK_SHIFT	12
+#define DMCP_BLOCK_SIZE		(1 << DMCP_BLOCK_SHIFT)
+#define DMCP_SECTOR_SHIFT	SECTOR_SHIFT
+#define DMCP_SECTOR_SIZE	(1 << SECTOR_SHIFT)
+#define DMCP_BLOCK_SECTOR_SHIFT (DMCP_BLOCK_SHIFT - DMCP_SECTOR_SHIFT)
+#define DMCP_BLOCK_TO_SECTOR(b) ((b) << DMCP_BLOCK_SECTOR_SHIFT)
+#define DMCP_SECTOR_TO_BLOCK(s) ((s) >> DMCP_BLOCK_SECTOR_SHIFT)
+#define DMCP_SECTOR_TO_BYTES(s) ((s) << DMCP_SECTOR_SHIFT)
+#define DMCP_BYTES_TO_SECTOR(b) ((b) >> DMCP_SECTOR_SHIFT)
+#define DMCP_BYTES_TO_BLOCK(b)	((b) >> DMCP_BLOCK_SHIFT)
+
+#define DMCP_MIN_SIZE	DMCP_BLOCK_SIZE
+#define DMCP_MAX_SIZE	(128 * 2 * DMCP_SECTOR_SIZE) /* 128k */
+
+#define DMCP_BITS_PER_ENTRY	32
+#define DMCP_META_BITS		5
+#define DMCP_LENGTH_BITS	4
+#define DMCP_TAIL_MASK		(1 << DMCP_LENGTH_BITS)
+#define DMCP_LENGTH_MASK	(DMCP_TAIL_MASK - 1)
+
+#define DMCP_META_START_SECTOR (DMCP_BLOCK_SIZE >> DMCP_SECTOR_SHIFT)
+
+enum DMCP_WRITE_MODE {
+	DMCP_WRITE_BACK,
+	DMCP_WRITE_THROUGH,
+};
+
+/*
+ * a lock spans 128 blocks i.e 512kbytes.
+ * max I/O is 128K, which can at-most span two locks.
+ */
+#define BITMAP_HASH_SHIFT 7
+#define BITMAP_HASH_LEN (1<<6)
+#define BITMAP_HASH_MASK (BITMAP_HASH_LEN - 1)
+struct dm_icomp_hash_lock {
+	int io_running;
+	spinlock_t wait_lock;
+	struct list_head wait_list;
+};
+
+struct dm_icomp_info {
+	struct dm_target *ti;
+	struct dm_dev *dev;
+
+	int comp_alg;
+	struct crypto_comp *tfm[NR_CPUS];
+
+	sector_t total_sector;	/* total sectors in the backing store */
+	sector_t data_start;
+	u64 data_blocks;
+	u64 no_of_sectors;
+
+	u32 *meta_bitmap;
+	u64 meta_bitmap_bits;
+	u64 meta_bitmap_pages;
+	struct dm_icomp_hash_lock bitmap_locks[BITMAP_HASH_LEN];
+
+	enum DMCP_WRITE_MODE write_mode;
+	unsigned int writeback_delay; /* second */
+	struct task_struct *writeback_tsk;
+	struct dm_io_client *io_client;
+
+	atomic64_t compressed_write_size;
+	atomic64_t uncompressed_write_size;
+	atomic64_t meta_write_size;
+};
+
+struct dm_icomp_meta_io {
+	struct dm_io_request io_req;
+	struct dm_io_region io_region;
+	void *data;
+	void (*fn)(void *data, unsigned long error);
+};
+
+struct dm_icomp_io_range {
+	struct dm_io_request io_req;
+	struct dm_io_region io_region;
+	bool decomp_kmap;	     /* Is the decomp_data kmapped'? */
+	void *decomp_data;
+	void *decomp_real_data;      /* holds the actual start of the buffer */
+	unsigned int decomp_len;     /* actual allocated/mapped length */
+	unsigned int logical_bytes;  /* decompressed size of the extent */
+	bool comp_kmap;		     /* Is the comp_data kmapped'? */
+	void *comp_data;
+	void *comp_real_data;	     /* holds the actual start of the buffer */
+	unsigned int comp_len;	     /* actual allocated/mapped length */
+	unsigned int data_bytes;     /* compressed size of the extent */
+	struct list_head next;
+	struct dm_icomp_req *req;
+};
+
+enum DMCP_REQ_STAGE {
+	STAGE_INIT,
+	STAGE_READ_EXISTING,
+	STAGE_READ_DECOMP,
+	STAGE_WRITE_COMP,
+	STAGE_DONE,
+};
+
+struct dm_icomp_req {
+	struct bio *bio;
+	struct dm_icomp_info *info;
+	struct list_head sibling;
+	struct list_head all_io;
+	atomic_t io_pending;
+	enum DMCP_REQ_STAGE stage;
+	struct dm_icomp_hash_lock *locks[2];
+	int locked_locks;
+	int result;
+	int cpu;
+	struct work_struct work;
+};
+
+struct dm_icomp_io_worker {
+	struct list_head pending;
+	spinlock_t lock;
+	struct work_struct work;
+};
+
+struct dm_icomp_compressor_data {
+	char *name;
+	int (*comp_len)(int comp_len);
+	int (*max_comp_len)(int comp_len);
+};
+
+static inline int lzo_comp_len(int comp_len)
+{
+	/* lzo compression overshoots the comp buffer
+	 * if the buffer size is insufficient.
+	 * Once that bug is fixed we can return half
+	 * the length.
+	 *
+	 * return lzo1x_worst_compress(comp_len) >> 1;
+	 *
+	 * For now its the full length.
+	 */
+	return lzo1x_worst_compress(comp_len);
+}
+
+static inline int lzo_max_comp_len(int comp_len)
+{
+	return lzo1x_worst_compress(comp_len);
+}
+
+static inline int nx842_comp_len(int comp_len)
+{
+	return (comp_len >> 1);
+}
+
+static inline int nx842_max_comp_len(int comp_len)
+{
+	return comp_len;
+}
+
+#endif /* __KERNEL__ */
+
+#endif /* __DM_INPLACE_COMPRESS_H__ */
-- 
1.8.3.1


^ permalink raw reply related

* [RFC PATCH v3] IV Generation algorithms for dm-crypt
From: Binoy Jayan @ 2017-01-18  9:40 UTC (permalink / raw)
  To: Oded, Ofir
  Cc: Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, linux-kernel, Alasdair Kergon, Mike Snitzer,
	dm-devel, Shaohua Li, linux-raid, Rajendra, Milan Broz, Gilad,
	Binoy Jayan

===============================================================================
GENIV Template cipher
===============================================================================

Currently, the iv generation algorithms are implemented in dm-crypt.c. The goal
is to move these algorithms from the dm layer to the kernel crypto layer by
implementing them as template ciphers so they can be used in relation with
algorithms like aes, and with multiple modes like cbc, ecb etc. As part of this
patchset, the iv-generation code is moved from the dm layer to the crypto layer
and adapt the dm-layer to send a whole 'bio' (as defined in the block layer)
at a time. Each bio contains the in memory representation of physically
contiguous disk blocks. Since the bio itself may not be contiguous in main
memory, the dm layer sets up a chained scatterlist of these blocks split into
physically contiguous segments in memory so that DMA can be performed.

One challenge in doing so is that the IVs are generated based on a 512-byte
sector number. This infact limits the block sizes to 512 bytes. But this should
not be a problem if a hardware with iv generation support is used. The geniv
itself splits the segments into sectors so it could choose the IV based on
sector number. But it could be modelled in hardware effectively by not
splitting up the segments in the bio.

Another challenge faced is that dm-crypt has an option to use multiple keys.
The key selection is done based on the sector number. If the whole bio is
encrypted / decrypted with the same key, the encrypted volumes will not be
compatible with the original dm-crypt [without the changes]. So, the key
selection code is moved to crypto layer so the neighboring sectors are
encrypted with a different key.

The dm layer allocates space for iv. The hardware drivers can choose to make
use of this space to generate their IVs sequentially or allocate it on their
own. This can be moved to crypto layer too. Postponing this decision until
the requirement to integrate milan's changes are clear.

Interface to the crypto layer - include/crypto/geniv.h

Revisions:

v1: https://patchwork.kernel.org/patch/9439175
v2: https://patchwork.kernel.org/patch/9471923

v2 --> v3
----------

1. Moved iv algorithms in dm-crypt.c for control
2. Key management code moved from dm layer to cryto layer
   so that cipher instance selection can be made depending on key_index
3. The revision v2 had scatterlist nodes created for every sector in the bio.
   It is modified to create only once scatterlist node to reduce memory
   foot print. Synchronous requests are processed sequentially. Asynchronous
   requests are processed in parallel and is freed in the async callback.
4. Changed allocation for sub-requests using mempool

v1 --> v2
----------

1. dm-crypt changes to process larger block sizes (one segment in a bio)
2. Incorporated changes w.r.t. comments from Herbert.

Binoy Jayan (1):
  crypto: Add IV generation algorithms

 drivers/md/dm-crypt.c  | 1891 ++++++++++++++++++++++++++++++++++--------------
 include/crypto/geniv.h |   47 ++
 2 files changed, 1399 insertions(+), 539 deletions(-)
 create mode 100644 include/crypto/geniv.h

-- 
Binoy Jayan

^ permalink raw reply

* [RFC PATCH v3] crypto: Add IV generation algorithms
From: Binoy Jayan @ 2017-01-18  9:40 UTC (permalink / raw)
  To: Oded, Ofir
  Cc: Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, linux-kernel, Alasdair Kergon, Mike Snitzer,
	dm-devel, Shaohua Li, linux-raid, Rajendra, Milan Broz, Gilad,
	Binoy Jayan
In-Reply-To: <1484732425-10319-1-git-send-email-binoy.jayan@linaro.org>

Currently, the iv generation algorithms are implemented in dm-crypt.c.
The goal is to move these algorithms from the dm layer to the kernel
crypto layer by implementing them as template ciphers so they can be
implemented in hardware for performance. As part of this patchset, the
iv-generation code is moved from the dm layer to the crypto layer and
adapt the dm-layer to send a whole 'bio' (as defined in the block layer)
at a time. Each bio contains an in memory representation of physically
contiguous disk blocks. The dm layer sets up a chained scatterlist of
these blocks split into physically contiguous segments in memory so that
DMA can be performed. Also, the key management code is moved from dm layer
to the cryto layer since the key selection for encrypting neighboring
sectors depend on the keycount.

Synchronous crypto requests to encrypt/decrypt a sector are processed
sequentially. Asynchronous requests if processed in parallel, are freed
in the async callback. The dm layer allocates space for iv. The hardware
implementations can choose to make use of this space to generate their IVs
sequentially or allocate it on their own.
Interface to the crypto layer - include/crypto/geniv.h

Signed-off-by: Binoy Jayan <binoy.jayan@linaro.org>
---
 drivers/md/dm-crypt.c  | 1891 ++++++++++++++++++++++++++++++++++--------------
 include/crypto/geniv.h |   47 ++
 2 files changed, 1399 insertions(+), 539 deletions(-)
 create mode 100644 include/crypto/geniv.h

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 7c6c572..7275b0f 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -32,170 +32,113 @@
 #include <crypto/algapi.h>
 #include <crypto/skcipher.h>
 #include <keys/user-type.h>
-
 #include <linux/device-mapper.h>
-
-#define DM_MSG_PREFIX "crypt"
-
-/*
- * context holding the current state of a multi-part conversion
- */
-struct convert_context {
-	struct completion restart;
-	struct bio *bio_in;
-	struct bio *bio_out;
-	struct bvec_iter iter_in;
-	struct bvec_iter iter_out;
-	sector_t cc_sector;
-	atomic_t cc_pending;
-	struct skcipher_request *req;
+#include <crypto/internal/skcipher.h>
+#include <linux/backing-dev.h>
+#include <linux/log2.h>
+#include <crypto/geniv.h>
+
+#define DM_MSG_PREFIX		"crypt"
+#define MAX_SG_LIST		(BIO_MAX_PAGES * 8)
+#define MIN_IOS			64
+#define LMK_SEED_SIZE		64 /* hash + 0 */
+#define TCW_WHITENING_SIZE	16
+
+struct geniv_ctx;
+struct geniv_req_ctx;
+
+/* Sub request for each of the skcipher_request's for a segment */
+struct geniv_subreq {
+	struct skcipher_request req CRYPTO_MINALIGN_ATTR;
+	struct scatterlist src;
+	struct scatterlist dst;
+	int n;
+	struct geniv_req_ctx *rctx;
 };
 
-/*
- * per bio private data
- */
-struct dm_crypt_io {
-	struct crypt_config *cc;
-	struct bio *base_bio;
-	struct work_struct work;
-
-	struct convert_context ctx;
-
-	atomic_t io_pending;
-	int error;
-	sector_t sector;
-
-	struct rb_node rb_node;
-} CRYPTO_MINALIGN_ATTR;
-
-struct dm_crypt_request {
-	struct convert_context *ctx;
-	struct scatterlist sg_in;
-	struct scatterlist sg_out;
+struct geniv_req_ctx {
+	struct geniv_subreq *subreq;
+	bool is_write;
 	sector_t iv_sector;
+	unsigned int nents;
+	u8 *iv;
+	struct completion restart;
+	atomic_t req_pending;
+	struct skcipher_request *req;
 };
 
-struct crypt_config;
-
 struct crypt_iv_operations {
-	int (*ctr)(struct crypt_config *cc, struct dm_target *ti,
-		   const char *opts);
-	void (*dtr)(struct crypt_config *cc);
-	int (*init)(struct crypt_config *cc);
-	int (*wipe)(struct crypt_config *cc);
-	int (*generator)(struct crypt_config *cc, u8 *iv,
-			 struct dm_crypt_request *dmreq);
-	int (*post)(struct crypt_config *cc, u8 *iv,
-		    struct dm_crypt_request *dmreq);
+	int (*ctr)(struct geniv_ctx *ctx);
+	void (*dtr)(struct geniv_ctx *ctx);
+	int (*init)(struct geniv_ctx *ctx);
+	int (*wipe)(struct geniv_ctx *ctx);
+	int (*generator)(struct geniv_ctx *ctx,
+			 struct geniv_req_ctx *rctx,
+			 struct geniv_subreq *subreq);
+	int (*post)(struct geniv_ctx *ctx,
+		    struct geniv_req_ctx *rctx,
+		    struct geniv_subreq *subreq);
 };
 
-struct iv_essiv_private {
+struct geniv_essiv_private {
 	struct crypto_ahash *hash_tfm;
 	u8 *salt;
 };
 
-struct iv_benbi_private {
+struct geniv_benbi_private {
 	int shift;
 };
 
-#define LMK_SEED_SIZE 64 /* hash + 0 */
-struct iv_lmk_private {
+struct geniv_lmk_private {
 	struct crypto_shash *hash_tfm;
 	u8 *seed;
 };
 
-#define TCW_WHITENING_SIZE 16
-struct iv_tcw_private {
+struct geniv_tcw_private {
 	struct crypto_shash *crc32_tfm;
 	u8 *iv_seed;
 	u8 *whitening;
 };
 
-/*
- * Crypt: maps a linear range of a block device
- * and encrypts / decrypts at the same time.
- */
-enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID,
-	     DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD };
-
-/*
- * The fields in here must be read only after initialization.
- */
-struct crypt_config {
-	struct dm_dev *dev;
-	sector_t start;
-
-	/*
-	 * pool for per bio private data, crypto requests and
-	 * encryption requeusts/buffer pages
-	 */
-	mempool_t *req_pool;
-	mempool_t *page_pool;
-	struct bio_set *bs;
-	struct mutex bio_alloc_lock;
-
-	struct workqueue_struct *io_queue;
-	struct workqueue_struct *crypt_queue;
-
-	struct task_struct *write_thread;
-	wait_queue_head_t write_thread_wait;
-	struct rb_root write_tree;
-
+struct geniv_ctx {
+	unsigned int tfms_count;
+	struct crypto_skcipher *child;
+	struct crypto_skcipher **tfms;
+	char *ivmode;
+	unsigned int iv_size;
+	char *ivopts;
 	char *cipher;
-	char *cipher_string;
-	char *key_string;
-
+	char *ciphermode;
 	const struct crypt_iv_operations *iv_gen_ops;
 	union {
-		struct iv_essiv_private essiv;
-		struct iv_benbi_private benbi;
-		struct iv_lmk_private lmk;
-		struct iv_tcw_private tcw;
+		struct geniv_essiv_private essiv;
+		struct geniv_benbi_private benbi;
+		struct geniv_lmk_private lmk;
+		struct geniv_tcw_private tcw;
 	} iv_gen_private;
-	sector_t iv_offset;
-	unsigned int iv_size;
-
-	/* ESSIV: struct crypto_cipher *essiv_tfm */
 	void *iv_private;
-	struct crypto_skcipher **tfms;
-	unsigned tfms_count;
-
-	/*
-	 * Layout of each crypto request:
-	 *
-	 *   struct skcipher_request
-	 *      context
-	 *      padding
-	 *   struct dm_crypt_request
-	 *      padding
-	 *   IV
-	 *
-	 * The padding is added so that dm_crypt_request and the IV are
-	 * correctly aligned.
-	 */
-	unsigned int dmreq_start;
-
-	unsigned int per_bio_data_size;
-
-	unsigned long flags;
+	struct crypto_skcipher *tfm;
+	mempool_t *subreq_pool;
 	unsigned int key_size;
+	unsigned int key_extra_size;
 	unsigned int key_parts;      /* independent parts in key buffer */
-	unsigned int key_extra_size; /* additional keys length */
-	u8 key[0];
+	enum setkey_op keyop;
+	char *msg;
+	u8 *key;
 };
 
-#define MIN_IOS        64
-
-static void clone_init(struct dm_crypt_io *, struct bio *);
-static void kcryptd_queue_crypt(struct dm_crypt_io *io);
-static u8 *iv_of_dmreq(struct crypt_config *cc, struct dm_crypt_request *dmreq);
+static struct crypto_skcipher *any_tfm(struct geniv_ctx *ctx)
+{
+	return ctx->tfms[0];
+}
 
-/*
- * Use this to access cipher attributes that are the same for each CPU.
- */
-static struct crypto_skcipher *any_tfm(struct crypt_config *cc)
+static inline
+struct geniv_req_ctx *geniv_req_ctx(struct skcipher_request *req)
 {
-	return cc->tfms[0];
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	unsigned long align = crypto_skcipher_alignmask(tfm);
+
+	return (void *) PTR_ALIGN((u8 *) skcipher_request_ctx(req), align + 1);
 }
 
 /*
@@ -245,44 +188,50 @@ static struct crypto_skcipher *any_tfm(struct crypt_config *cc)
  * http://article.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt/454
  */
 
-static int crypt_iv_plain_gen(struct crypt_config *cc, u8 *iv,
-			      struct dm_crypt_request *dmreq)
+static int crypt_iv_plain_gen(struct geniv_ctx *ctx,
+			      struct geniv_req_ctx *rctx,
+			      struct geniv_subreq *subreq)
 {
-	memset(iv, 0, cc->iv_size);
-	*(__le32 *)iv = cpu_to_le32(dmreq->iv_sector & 0xffffffff);
+	u8 *iv = rctx->iv;
+
+	memset(iv, 0, ctx->iv_size);
+	*(__le32 *)iv = cpu_to_le32(rctx->iv_sector & 0xffffffff);
 
 	return 0;
 }
 
-static int crypt_iv_plain64_gen(struct crypt_config *cc, u8 *iv,
-				struct dm_crypt_request *dmreq)
+static int crypt_iv_plain64_gen(struct geniv_ctx *ctx,
+				struct geniv_req_ctx *rctx,
+				struct geniv_subreq *subreq)
 {
-	memset(iv, 0, cc->iv_size);
-	*(__le64 *)iv = cpu_to_le64(dmreq->iv_sector);
+	u8 *iv = rctx->iv;
+
+	memset(iv, 0, ctx->iv_size);
+	*(__le64 *)iv = cpu_to_le64(rctx->iv_sector);
 
 	return 0;
 }
 
 /* Initialise ESSIV - compute salt but no local memory allocations */
-static int crypt_iv_essiv_init(struct crypt_config *cc)
+static int crypt_iv_essiv_init(struct geniv_ctx *ctx)
 {
-	struct iv_essiv_private *essiv = &cc->iv_gen_private.essiv;
-	AHASH_REQUEST_ON_STACK(req, essiv->hash_tfm);
+	struct geniv_essiv_private *essiv = &ctx->iv_gen_private.essiv;
 	struct scatterlist sg;
 	struct crypto_cipher *essiv_tfm;
 	int err;
+	AHASH_REQUEST_ON_STACK(req, essiv->hash_tfm);
 
-	sg_init_one(&sg, cc->key, cc->key_size);
+	sg_init_one(&sg, ctx->key, ctx->key_size);
 	ahash_request_set_tfm(req, essiv->hash_tfm);
 	ahash_request_set_callback(req, CRYPTO_TFM_REQ_MAY_SLEEP, NULL, NULL);
-	ahash_request_set_crypt(req, &sg, essiv->salt, cc->key_size);
+	ahash_request_set_crypt(req, &sg, essiv->salt, ctx->key_size);
 
 	err = crypto_ahash_digest(req);
 	ahash_request_zero(req);
 	if (err)
 		return err;
 
-	essiv_tfm = cc->iv_private;
+	essiv_tfm = ctx->iv_private;
 
 	err = crypto_cipher_setkey(essiv_tfm, essiv->salt,
 			    crypto_ahash_digestsize(essiv->hash_tfm));
@@ -293,16 +242,16 @@ static int crypt_iv_essiv_init(struct crypt_config *cc)
 }
 
 /* Wipe salt and reset key derived from volume key */
-static int crypt_iv_essiv_wipe(struct crypt_config *cc)
+static int crypt_iv_essiv_wipe(struct geniv_ctx *ctx)
 {
-	struct iv_essiv_private *essiv = &cc->iv_gen_private.essiv;
-	unsigned salt_size = crypto_ahash_digestsize(essiv->hash_tfm);
+	struct geniv_essiv_private *essiv = &ctx->iv_gen_private.essiv;
+	unsigned int salt_size = crypto_ahash_digestsize(essiv->hash_tfm);
 	struct crypto_cipher *essiv_tfm;
 	int r, err = 0;
 
 	memset(essiv->salt, 0, salt_size);
 
-	essiv_tfm = cc->iv_private;
+	essiv_tfm = ctx->iv_private;
 	r = crypto_cipher_setkey(essiv_tfm, essiv->salt, salt_size);
 	if (r)
 		err = r;
@@ -311,42 +260,40 @@ static int crypt_iv_essiv_wipe(struct crypt_config *cc)
 }
 
 /* Set up per cpu cipher state */
-static struct crypto_cipher *setup_essiv_cpu(struct crypt_config *cc,
-					     struct dm_target *ti,
-					     u8 *salt, unsigned saltsize)
+static struct crypto_cipher *setup_essiv_cpu(struct geniv_ctx *ctx,
+					     u8 *salt, unsigned int saltsize)
 {
 	struct crypto_cipher *essiv_tfm;
 	int err;
 
 	/* Setup the essiv_tfm with the given salt */
-	essiv_tfm = crypto_alloc_cipher(cc->cipher, 0, CRYPTO_ALG_ASYNC);
+	essiv_tfm = crypto_alloc_cipher(ctx->cipher, 0, CRYPTO_ALG_ASYNC);
+
 	if (IS_ERR(essiv_tfm)) {
-		ti->error = "Error allocating crypto tfm for ESSIV";
+		DMERR("Error allocating crypto tfm for ESSIV\n");
 		return essiv_tfm;
 	}
 
 	if (crypto_cipher_blocksize(essiv_tfm) !=
-	    crypto_skcipher_ivsize(any_tfm(cc))) {
-		ti->error = "Block size of ESSIV cipher does "
-			    "not match IV size of block cipher";
+	    crypto_skcipher_ivsize(any_tfm(ctx))) {
+		DMERR("Block size of ESSIV cipher does not match IV size of block cipher\n");
 		crypto_free_cipher(essiv_tfm);
 		return ERR_PTR(-EINVAL);
 	}
 
 	err = crypto_cipher_setkey(essiv_tfm, salt, saltsize);
 	if (err) {
-		ti->error = "Failed to set key for ESSIV cipher";
+		DMERR("Failed to set key for ESSIV cipher\n");
 		crypto_free_cipher(essiv_tfm);
 		return ERR_PTR(err);
 	}
-
 	return essiv_tfm;
 }
 
-static void crypt_iv_essiv_dtr(struct crypt_config *cc)
+static void crypt_iv_essiv_dtr(struct geniv_ctx *ctx)
 {
 	struct crypto_cipher *essiv_tfm;
-	struct iv_essiv_private *essiv = &cc->iv_gen_private.essiv;
+	struct geniv_essiv_private *essiv = &ctx->iv_gen_private.essiv;
 
 	crypto_free_ahash(essiv->hash_tfm);
 	essiv->hash_tfm = NULL;
@@ -354,52 +301,50 @@ static void crypt_iv_essiv_dtr(struct crypt_config *cc)
 	kzfree(essiv->salt);
 	essiv->salt = NULL;
 
-	essiv_tfm = cc->iv_private;
+	essiv_tfm = ctx->iv_private;
 
 	if (essiv_tfm)
 		crypto_free_cipher(essiv_tfm);
 
-	cc->iv_private = NULL;
+	ctx->iv_private = NULL;
 }
 
-static int crypt_iv_essiv_ctr(struct crypt_config *cc, struct dm_target *ti,
-			      const char *opts)
+static int crypt_iv_essiv_ctr(struct geniv_ctx *ctx)
 {
 	struct crypto_cipher *essiv_tfm = NULL;
 	struct crypto_ahash *hash_tfm = NULL;
 	u8 *salt = NULL;
 	int err;
 
-	if (!opts) {
-		ti->error = "Digest algorithm missing for ESSIV mode";
+	if (!ctx->ivopts) {
+		DMERR("Digest algorithm missing for ESSIV mode\n");
 		return -EINVAL;
 	}
 
 	/* Allocate hash algorithm */
-	hash_tfm = crypto_alloc_ahash(opts, 0, CRYPTO_ALG_ASYNC);
+	hash_tfm = crypto_alloc_ahash(ctx->ivopts, 0, CRYPTO_ALG_ASYNC);
 	if (IS_ERR(hash_tfm)) {
-		ti->error = "Error initializing ESSIV hash";
 		err = PTR_ERR(hash_tfm);
+		DMERR("Error initializing ESSIV hash. err=%d\n", err);
 		goto bad;
 	}
 
 	salt = kzalloc(crypto_ahash_digestsize(hash_tfm), GFP_KERNEL);
 	if (!salt) {
-		ti->error = "Error kmallocing salt storage in ESSIV";
 		err = -ENOMEM;
 		goto bad;
 	}
 
-	cc->iv_gen_private.essiv.salt = salt;
-	cc->iv_gen_private.essiv.hash_tfm = hash_tfm;
+	ctx->iv_gen_private.essiv.salt = salt;
+	ctx->iv_gen_private.essiv.hash_tfm = hash_tfm;
 
-	essiv_tfm = setup_essiv_cpu(cc, ti, salt,
+	essiv_tfm = setup_essiv_cpu(ctx, salt,
 				crypto_ahash_digestsize(hash_tfm));
 	if (IS_ERR(essiv_tfm)) {
-		crypt_iv_essiv_dtr(cc);
+		crypt_iv_essiv_dtr(ctx);
 		return PTR_ERR(essiv_tfm);
 	}
-	cc->iv_private = essiv_tfm;
+	ctx->iv_private = essiv_tfm;
 
 	return 0;
 
@@ -410,70 +355,73 @@ static int crypt_iv_essiv_ctr(struct crypt_config *cc, struct dm_target *ti,
 	return err;
 }
 
-static int crypt_iv_essiv_gen(struct crypt_config *cc, u8 *iv,
-			      struct dm_crypt_request *dmreq)
+static int crypt_iv_essiv_gen(struct geniv_ctx *ctx,
+			      struct geniv_req_ctx *rctx,
+			      struct geniv_subreq *subreq)
 {
-	struct crypto_cipher *essiv_tfm = cc->iv_private;
+	u8 *iv = rctx->iv;
+	struct crypto_cipher *essiv_tfm = ctx->iv_private;
 
-	memset(iv, 0, cc->iv_size);
-	*(__le64 *)iv = cpu_to_le64(dmreq->iv_sector);
+	memset(iv, 0, ctx->iv_size);
+	*(__le64 *)iv = cpu_to_le64(rctx->iv_sector);
 	crypto_cipher_encrypt_one(essiv_tfm, iv, iv);
 
 	return 0;
 }
 
-static int crypt_iv_benbi_ctr(struct crypt_config *cc, struct dm_target *ti,
-			      const char *opts)
+static int crypt_iv_benbi_ctr(struct geniv_ctx *ctx)
 {
-	unsigned bs = crypto_skcipher_blocksize(any_tfm(cc));
+	unsigned int bs = crypto_skcipher_blocksize(any_tfm(ctx));
 	int log = ilog2(bs);
 
 	/* we need to calculate how far we must shift the sector count
-	 * to get the cipher block count, we use this shift in _gen */
+	 * to get the cipher block count, we use this shift in _gen
+	 */
 
 	if (1 << log != bs) {
-		ti->error = "cypher blocksize is not a power of 2";
+		DMERR("cypher blocksize is not a power of 2\n");
 		return -EINVAL;
 	}
 
 	if (log > 9) {
-		ti->error = "cypher blocksize is > 512";
+		DMERR("cypher blocksize is > 512\n");
 		return -EINVAL;
 	}
 
-	cc->iv_gen_private.benbi.shift = 9 - log;
+	ctx->iv_gen_private.benbi.shift = 9 - log;
 
 	return 0;
 }
 
-static void crypt_iv_benbi_dtr(struct crypt_config *cc)
-{
-}
-
-static int crypt_iv_benbi_gen(struct crypt_config *cc, u8 *iv,
-			      struct dm_crypt_request *dmreq)
+static int crypt_iv_benbi_gen(struct geniv_ctx *ctx,
+			      struct geniv_req_ctx *rctx,
+			      struct geniv_subreq *subreq)
 {
+	u8 *iv = rctx->iv;
 	__be64 val;
 
-	memset(iv, 0, cc->iv_size - sizeof(u64)); /* rest is cleared below */
+	memset(iv, 0, ctx->iv_size - sizeof(u64)); /* rest is cleared below */
 
-	val = cpu_to_be64(((u64)dmreq->iv_sector << cc->iv_gen_private.benbi.shift) + 1);
-	put_unaligned(val, (__be64 *)(iv + cc->iv_size - sizeof(u64)));
+	val = cpu_to_be64(((u64) rctx->iv_sector <<
+			  ctx->iv_gen_private.benbi.shift) + 1);
+	put_unaligned(val, (__be64 *)(iv + ctx->iv_size - sizeof(u64)));
 
 	return 0;
 }
 
-static int crypt_iv_null_gen(struct crypt_config *cc, u8 *iv,
-			     struct dm_crypt_request *dmreq)
+static int crypt_iv_null_gen(struct geniv_ctx *ctx,
+			     struct geniv_req_ctx *rctx,
+			     struct geniv_subreq *subreq)
 {
-	memset(iv, 0, cc->iv_size);
+	u8 *iv = rctx->iv;
 
+	memset(iv, 0, ctx->iv_size);
 	return 0;
 }
 
-static void crypt_iv_lmk_dtr(struct crypt_config *cc)
+static void crypt_iv_lmk_dtr(struct geniv_ctx *ctx)
 {
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
+	struct geniv_lmk_private *lmk = &ctx->iv_gen_private.lmk;
 
 	if (lmk->hash_tfm && !IS_ERR(lmk->hash_tfm))
 		crypto_free_shash(lmk->hash_tfm);
@@ -483,49 +431,49 @@ static void crypt_iv_lmk_dtr(struct crypt_config *cc)
 	lmk->seed = NULL;
 }
 
-static int crypt_iv_lmk_ctr(struct crypt_config *cc, struct dm_target *ti,
-			    const char *opts)
+static int crypt_iv_lmk_ctr(struct geniv_ctx *ctx)
 {
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
+	struct geniv_lmk_private *lmk = &ctx->iv_gen_private.lmk;
 
 	lmk->hash_tfm = crypto_alloc_shash("md5", 0, 0);
 	if (IS_ERR(lmk->hash_tfm)) {
-		ti->error = "Error initializing LMK hash";
+		DMERR("Error initializing LMK hash; err=%ld\n",
+		      PTR_ERR(lmk->hash_tfm));
 		return PTR_ERR(lmk->hash_tfm);
 	}
 
 	/* No seed in LMK version 2 */
-	if (cc->key_parts == cc->tfms_count) {
+	if (ctx->key_parts == ctx->tfms_count) {
 		lmk->seed = NULL;
 		return 0;
 	}
 
 	lmk->seed = kzalloc(LMK_SEED_SIZE, GFP_KERNEL);
 	if (!lmk->seed) {
-		crypt_iv_lmk_dtr(cc);
-		ti->error = "Error kmallocing seed storage in LMK";
+		crypt_iv_lmk_dtr(ctx);
+		DMERR("Error kmallocing seed storage in LMK\n");
 		return -ENOMEM;
 	}
 
 	return 0;
 }
 
-static int crypt_iv_lmk_init(struct crypt_config *cc)
+static int crypt_iv_lmk_init(struct geniv_ctx *ctx)
 {
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
-	int subkey_size = cc->key_size / cc->key_parts;
+	struct geniv_lmk_private *lmk = &ctx->iv_gen_private.lmk;
+	int subkey_size = ctx->key_size / ctx->key_parts;
 
 	/* LMK seed is on the position of LMK_KEYS + 1 key */
 	if (lmk->seed)
-		memcpy(lmk->seed, cc->key + (cc->tfms_count * subkey_size),
+		memcpy(lmk->seed, ctx->key + (ctx->tfms_count * subkey_size),
 		       crypto_shash_digestsize(lmk->hash_tfm));
 
 	return 0;
 }
 
-static int crypt_iv_lmk_wipe(struct crypt_config *cc)
+static int crypt_iv_lmk_wipe(struct geniv_ctx *ctx)
 {
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
+	struct geniv_lmk_private *lmk = &ctx->iv_gen_private.lmk;
 
 	if (lmk->seed)
 		memset(lmk->seed, 0, LMK_SEED_SIZE);
@@ -533,15 +481,14 @@ static int crypt_iv_lmk_wipe(struct crypt_config *cc)
 	return 0;
 }
 
-static int crypt_iv_lmk_one(struct crypt_config *cc, u8 *iv,
-			    struct dm_crypt_request *dmreq,
-			    u8 *data)
+static int crypt_iv_lmk_one(struct geniv_ctx *ctx, u8 *iv,
+			    struct geniv_req_ctx *rctx, u8 *data)
 {
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
-	SHASH_DESC_ON_STACK(desc, lmk->hash_tfm);
+	struct geniv_lmk_private *lmk = &ctx->iv_gen_private.lmk;
 	struct md5_state md5state;
 	__le32 buf[4];
 	int i, r;
+	SHASH_DESC_ON_STACK(desc, lmk->hash_tfm);
 
 	desc->tfm = lmk->hash_tfm;
 	desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP;
@@ -562,8 +509,9 @@ static int crypt_iv_lmk_one(struct crypt_config *cc, u8 *iv,
 		return r;
 
 	/* Sector is cropped to 56 bits here */
-	buf[0] = cpu_to_le32(dmreq->iv_sector & 0xFFFFFFFF);
-	buf[1] = cpu_to_le32((((u64)dmreq->iv_sector >> 32) & 0x00FFFFFF) | 0x80000000);
+	buf[0] = cpu_to_le32(rctx->iv_sector & 0xFFFFFFFF);
+	buf[1] = cpu_to_le32((((u64)rctx->iv_sector >> 32) & 0x00FFFFFF)
+			     | 0x80000000);
 	buf[2] = cpu_to_le32(4024);
 	buf[3] = 0;
 	r = crypto_shash_update(desc, (u8 *)buf, sizeof(buf));
@@ -577,50 +525,54 @@ static int crypt_iv_lmk_one(struct crypt_config *cc, u8 *iv,
 
 	for (i = 0; i < MD5_HASH_WORDS; i++)
 		__cpu_to_le32s(&md5state.hash[i]);
-	memcpy(iv, &md5state.hash, cc->iv_size);
+	memcpy(iv, &md5state.hash, ctx->iv_size);
 
 	return 0;
 }
 
-static int crypt_iv_lmk_gen(struct crypt_config *cc, u8 *iv,
-			    struct dm_crypt_request *dmreq)
+static int crypt_iv_lmk_gen(struct geniv_ctx *ctx,
+			    struct geniv_req_ctx *rctx,
+			    struct geniv_subreq *subreq)
 {
 	u8 *src;
+	u8 *iv = rctx->iv;
 	int r = 0;
 
-	if (bio_data_dir(dmreq->ctx->bio_in) == WRITE) {
-		src = kmap_atomic(sg_page(&dmreq->sg_in));
-		r = crypt_iv_lmk_one(cc, iv, dmreq, src + dmreq->sg_in.offset);
+	if (rctx->is_write) {
+		src = kmap_atomic(sg_page(&subreq->src));
+		r = crypt_iv_lmk_one(ctx, iv, rctx, src + subreq->src.offset);
 		kunmap_atomic(src);
 	} else
-		memset(iv, 0, cc->iv_size);
+		memset(iv, 0, ctx->iv_size);
 
 	return r;
 }
 
-static int crypt_iv_lmk_post(struct crypt_config *cc, u8 *iv,
-			     struct dm_crypt_request *dmreq)
+static int crypt_iv_lmk_post(struct geniv_ctx *ctx,
+			     struct geniv_req_ctx *rctx,
+			     struct geniv_subreq *subreq)
 {
 	u8 *dst;
+	u8 *iv = rctx->iv;
 	int r;
 
-	if (bio_data_dir(dmreq->ctx->bio_in) == WRITE)
+	if (rctx->is_write)
 		return 0;
 
-	dst = kmap_atomic(sg_page(&dmreq->sg_out));
-	r = crypt_iv_lmk_one(cc, iv, dmreq, dst + dmreq->sg_out.offset);
+	dst = kmap_atomic(sg_page(&subreq->dst));
+	r = crypt_iv_lmk_one(ctx, iv, rctx, dst + subreq->dst.offset);
 
 	/* Tweak the first block of plaintext sector */
 	if (!r)
-		crypto_xor(dst + dmreq->sg_out.offset, iv, cc->iv_size);
+		crypto_xor(dst + subreq->dst.offset, iv, ctx->iv_size);
 
 	kunmap_atomic(dst);
 	return r;
 }
 
-static void crypt_iv_tcw_dtr(struct crypt_config *cc)
+static void crypt_iv_tcw_dtr(struct geniv_ctx *ctx)
 {
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
 
 	kzfree(tcw->iv_seed);
 	tcw->iv_seed = NULL;
@@ -632,64 +584,65 @@ static void crypt_iv_tcw_dtr(struct crypt_config *cc)
 	tcw->crc32_tfm = NULL;
 }
 
-static int crypt_iv_tcw_ctr(struct crypt_config *cc, struct dm_target *ti,
-			    const char *opts)
+static int crypt_iv_tcw_ctr(struct geniv_ctx *ctx)
 {
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
 
-	if (cc->key_size <= (cc->iv_size + TCW_WHITENING_SIZE)) {
-		ti->error = "Wrong key size for TCW";
+	if (ctx->key_size <= (ctx->iv_size + TCW_WHITENING_SIZE)) {
+		DMERR("Wrong key size (%d) for TCW. Choose a value > %d bytes\n",
+			ctx->key_size,
+			ctx->iv_size + TCW_WHITENING_SIZE);
 		return -EINVAL;
 	}
 
 	tcw->crc32_tfm = crypto_alloc_shash("crc32", 0, 0);
 	if (IS_ERR(tcw->crc32_tfm)) {
-		ti->error = "Error initializing CRC32 in TCW";
+		DMERR("Error initializing CRC32 in TCW; err=%ld\n",
+			PTR_ERR(tcw->crc32_tfm));
 		return PTR_ERR(tcw->crc32_tfm);
 	}
 
-	tcw->iv_seed = kzalloc(cc->iv_size, GFP_KERNEL);
+	tcw->iv_seed = kzalloc(ctx->iv_size, GFP_KERNEL);
 	tcw->whitening = kzalloc(TCW_WHITENING_SIZE, GFP_KERNEL);
 	if (!tcw->iv_seed || !tcw->whitening) {
-		crypt_iv_tcw_dtr(cc);
-		ti->error = "Error allocating seed storage in TCW";
+		crypt_iv_tcw_dtr(ctx);
+		DMERR("Error allocating seed storage in TCW\n");
 		return -ENOMEM;
 	}
 
 	return 0;
 }
 
-static int crypt_iv_tcw_init(struct crypt_config *cc)
+static int crypt_iv_tcw_init(struct geniv_ctx *ctx)
 {
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
-	int key_offset = cc->key_size - cc->iv_size - TCW_WHITENING_SIZE;
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
+	int key_offset = ctx->key_size - ctx->iv_size - TCW_WHITENING_SIZE;
 
-	memcpy(tcw->iv_seed, &cc->key[key_offset], cc->iv_size);
-	memcpy(tcw->whitening, &cc->key[key_offset + cc->iv_size],
+	memcpy(tcw->iv_seed, &ctx->key[key_offset], ctx->iv_size);
+	memcpy(tcw->whitening, &ctx->key[key_offset + ctx->iv_size],
 	       TCW_WHITENING_SIZE);
 
 	return 0;
 }
 
-static int crypt_iv_tcw_wipe(struct crypt_config *cc)
+static int crypt_iv_tcw_wipe(struct geniv_ctx *ctx)
 {
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
 
-	memset(tcw->iv_seed, 0, cc->iv_size);
+	memset(tcw->iv_seed, 0, ctx->iv_size);
 	memset(tcw->whitening, 0, TCW_WHITENING_SIZE);
 
 	return 0;
 }
 
-static int crypt_iv_tcw_whitening(struct crypt_config *cc,
-				  struct dm_crypt_request *dmreq,
-				  u8 *data)
+static int crypt_iv_tcw_whitening(struct geniv_ctx *ctx,
+				  struct geniv_req_ctx *rctx, u8 *data)
 {
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
-	__le64 sector = cpu_to_le64(dmreq->iv_sector);
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
+	__le64 sector = cpu_to_le64(rctx->iv_sector);
 	u8 buf[TCW_WHITENING_SIZE];
-	SHASH_DESC_ON_STACK(desc, tcw->crc32_tfm);
 	int i, r;
+	SHASH_DESC_ON_STACK(desc, tcw->crc32_tfm);
 
 	/* xor whitening with sector number */
 	memcpy(buf, tcw->whitening, TCW_WHITENING_SIZE);
@@ -713,99 +666,1006 @@ static int crypt_iv_tcw_whitening(struct crypt_config *cc,
 	crypto_xor(&buf[0], &buf[12], 4);
 	crypto_xor(&buf[4], &buf[8], 4);
 
-	/* apply whitening (8 bytes) to whole sector */
-	for (i = 0; i < ((1 << SECTOR_SHIFT) / 8); i++)
-		crypto_xor(data + i * 8, buf, 8);
-out:
-	memzero_explicit(buf, sizeof(buf));
-	return r;
-}
+	/* apply whitening (8 bytes) to whole sector */
+	for (i = 0; i < (SECTOR_SIZE / 8); i++)
+		crypto_xor(data + i * 8, buf, 8);
+out:
+	memzero_explicit(buf, sizeof(buf));
+	return r;
+}
+
+static int crypt_iv_tcw_gen(struct geniv_ctx *ctx,
+			    struct geniv_req_ctx *rctx,
+			    struct geniv_subreq *subreq)
+{
+	u8 *iv = rctx->iv;
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
+	__le64 sector = cpu_to_le64(rctx->iv_sector);
+	u8 *src;
+	int r = 0;
+
+	/* Remove whitening from ciphertext */
+	if (!rctx->is_write) {
+		src = kmap_atomic(sg_page(&subreq->src));
+		r = crypt_iv_tcw_whitening(ctx, rctx,
+					   src + subreq->src.offset);
+		kunmap_atomic(src);
+	}
+
+	/* Calculate IV */
+	memcpy(iv, tcw->iv_seed, ctx->iv_size);
+	crypto_xor(iv, (u8 *)&sector, 8);
+	if (ctx->iv_size > 8)
+		crypto_xor(&iv[8], (u8 *)&sector, ctx->iv_size - 8);
+
+	return r;
+}
+
+static int crypt_iv_tcw_post(struct geniv_ctx *ctx,
+			     struct geniv_req_ctx *rctx,
+			     struct geniv_subreq *subreq)
+{
+	u8 *dst;
+	int r;
+
+	if (!rctx->is_write)
+		return 0;
+
+	/* Apply whitening on ciphertext */
+	dst = kmap_atomic(sg_page(&subreq->dst));
+	r = crypt_iv_tcw_whitening(ctx, rctx, dst + subreq->dst.offset);
+	kunmap_atomic(dst);
+
+	return r;
+}
+
+static const struct crypt_iv_operations crypt_iv_plain_ops = {
+	.generator = crypt_iv_plain_gen
+};
+
+static const struct crypt_iv_operations crypt_iv_plain64_ops = {
+	.generator = crypt_iv_plain64_gen
+};
+
+static const struct crypt_iv_operations crypt_iv_essiv_ops = {
+	.ctr       = crypt_iv_essiv_ctr,
+	.dtr       = crypt_iv_essiv_dtr,
+	.init      = crypt_iv_essiv_init,
+	.wipe      = crypt_iv_essiv_wipe,
+	.generator = crypt_iv_essiv_gen
+};
+
+static const struct crypt_iv_operations crypt_iv_benbi_ops = {
+	.ctr	   = crypt_iv_benbi_ctr,
+	.generator = crypt_iv_benbi_gen
+};
+
+static const struct crypt_iv_operations crypt_iv_null_ops = {
+	.generator = crypt_iv_null_gen
+};
+
+static const struct crypt_iv_operations crypt_iv_lmk_ops = {
+	.ctr	   = crypt_iv_lmk_ctr,
+	.dtr	   = crypt_iv_lmk_dtr,
+	.init	   = crypt_iv_lmk_init,
+	.wipe	   = crypt_iv_lmk_wipe,
+	.generator = crypt_iv_lmk_gen,
+	.post	   = crypt_iv_lmk_post
+};
+
+static const struct crypt_iv_operations crypt_iv_tcw_ops = {
+	.ctr	   = crypt_iv_tcw_ctr,
+	.dtr	   = crypt_iv_tcw_dtr,
+	.init	   = crypt_iv_tcw_init,
+	.wipe	   = crypt_iv_tcw_wipe,
+	.generator = crypt_iv_tcw_gen,
+	.post	   = crypt_iv_tcw_post
+};
+
+static int geniv_setkey_set(struct geniv_ctx *ctx)
+{
+	int ret = 0;
+
+	if (ctx->iv_gen_ops && ctx->iv_gen_ops->init)
+		ret = ctx->iv_gen_ops->init(ctx);
+	return ret;
+}
+
+static int geniv_setkey_wipe(struct geniv_ctx *ctx)
+{
+	int ret = 0;
+
+	if (ctx->iv_gen_ops && ctx->iv_gen_ops->wipe) {
+		ret = ctx->iv_gen_ops->wipe(ctx);
+		if (ret)
+			return ret;
+	}
+	return ret;
+}
+
+static int geniv_init_iv(struct geniv_ctx *ctx)
+{
+	int ret = -EINVAL;
+
+	DMDEBUG("IV Generation algorithm : %s\n", ctx->ivmode);
+
+	if (ctx->ivmode == NULL)
+		ctx->iv_gen_ops = NULL;
+	else if (strcmp(ctx->ivmode, "plain") == 0)
+		ctx->iv_gen_ops = &crypt_iv_plain_ops;
+	else if (strcmp(ctx->ivmode, "plain64") == 0)
+		ctx->iv_gen_ops = &crypt_iv_plain64_ops;
+	else if (strcmp(ctx->ivmode, "essiv") == 0)
+		ctx->iv_gen_ops = &crypt_iv_essiv_ops;
+	else if (strcmp(ctx->ivmode, "benbi") == 0)
+		ctx->iv_gen_ops = &crypt_iv_benbi_ops;
+	else if (strcmp(ctx->ivmode, "null") == 0)
+		ctx->iv_gen_ops = &crypt_iv_null_ops;
+	else if (strcmp(ctx->ivmode, "lmk") == 0)
+		ctx->iv_gen_ops = &crypt_iv_lmk_ops;
+	else if (strcmp(ctx->ivmode, "tcw") == 0) {
+		ctx->iv_gen_ops = &crypt_iv_tcw_ops;
+		ctx->key_parts += 2; /* IV + whitening */
+		ctx->key_extra_size = ctx->iv_size + TCW_WHITENING_SIZE;
+	} else {
+		ret = -EINVAL;
+		DMERR("Invalid IV mode %s\n", ctx->ivmode);
+		goto end;
+	}
+
+	/* Allocate IV */
+	if (ctx->iv_gen_ops && ctx->iv_gen_ops->ctr) {
+		ret = ctx->iv_gen_ops->ctr(ctx);
+		if (ret < 0) {
+			DMERR("Error creating IV for %s\n", ctx->ivmode);
+			goto end;
+		}
+	}
+
+	/* Initialize IV (set keys for ESSIV etc) */
+	if (ctx->iv_gen_ops && ctx->iv_gen_ops->init) {
+		ret = ctx->iv_gen_ops->init(ctx);
+		if (ret < 0)
+			DMERR("Error creating IV for %s\n", ctx->ivmode);
+	}
+	ret = 0;
+end:
+	return ret;
+}
+
+static void geniv_free_tfms(struct geniv_ctx *ctx)
+{
+	unsigned int i;
+
+	if (!ctx->tfms)
+		return;
+
+	for (i = 0; i < ctx->tfms_count; i++)
+		if (ctx->tfms[i] && !IS_ERR(ctx->tfms[i])) {
+			crypto_free_skcipher(ctx->tfms[i]);
+			ctx->tfms[i] = NULL;
+		}
+
+	kfree(ctx->tfms);
+	ctx->tfms = NULL;
+}
+
+/* Allocate memory for the underlying cipher algorithm. Ex: cbc(aes)
+ */
+
+static int geniv_alloc_tfms(struct crypto_skcipher *parent,
+			    struct geniv_ctx *ctx)
+{
+	unsigned int i, reqsize, align;
+	int err = 0;
+
+	ctx->tfms = kcalloc(ctx->tfms_count, sizeof(struct crypto_skcipher *),
+			   GFP_KERNEL);
+	if (!ctx->tfms) {
+		err = -ENOMEM;
+		goto end;
+	}
+
+	/* First instance is already allocated in geniv_init_tfm */
+	ctx->tfms[0] = ctx->child;
+	for (i = 1; i < ctx->tfms_count; i++) {
+		ctx->tfms[i] = crypto_alloc_skcipher(ctx->ciphermode, 0, 0);
+		if (IS_ERR(ctx->tfms[i])) {
+			err = PTR_ERR(ctx->tfms[i]);
+			geniv_free_tfms(ctx);
+			goto end;
+		}
+
+		/* Setup the current cipher's request structure */
+		align = crypto_skcipher_alignmask(parent);
+		align &= ~(crypto_tfm_ctx_alignment() - 1);
+		reqsize = align + sizeof(struct geniv_req_ctx) +
+			  crypto_skcipher_reqsize(ctx->tfms[i]);
+		crypto_skcipher_set_reqsize(parent, reqsize);
+	}
+
+end:
+	return err;
+}
+
+/* Initialize the cipher's context with the key, ivmode and other parameters.
+ * Also allocate IV generation template ciphers and initialize them.
+ */
+
+static int geniv_setkey_init(struct crypto_skcipher *parent,
+			     struct geniv_key_info *info)
+{
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(parent);
+	int ret = -ENOMEM;
+
+	ctx->iv_size = crypto_skcipher_ivsize(parent);
+	ctx->tfms_count = info->tfms_count;
+	ctx->key = info->key;
+	ctx->key_size = info->key_size;
+	ctx->key_parts = info->key_parts;
+	ctx->ivopts = info->ivopts;
+
+	ret = geniv_alloc_tfms(parent, ctx);
+	if (ret)
+		goto end;
+
+	ret = geniv_init_iv(ctx);
+
+end:
+	return ret;
+}
+
+static int geniv_setkey_tfms(struct crypto_skcipher *parent,
+			     struct geniv_ctx *ctx,
+			     struct geniv_key_info *info)
+{
+	unsigned int subkey_size;
+	int ret = 0, i;
+
+	/* Ignore extra keys (which are used for IV etc) */
+	subkey_size = (ctx->key_size - ctx->key_extra_size)
+		      >> ilog2(ctx->tfms_count);
+
+	for (i = 0; i < ctx->tfms_count; i++) {
+		struct crypto_skcipher *child = ctx->tfms[i];
+		char *subkey = ctx->key + (subkey_size) * i;
+
+		crypto_skcipher_clear_flags(child, CRYPTO_TFM_REQ_MASK);
+		crypto_skcipher_set_flags(child,
+					  crypto_skcipher_get_flags(parent) &
+					  CRYPTO_TFM_REQ_MASK);
+		ret = crypto_skcipher_setkey(child, subkey, subkey_size);
+		if (ret) {
+			DMERR("Error setting key for tfms[%d]\n", i);
+			break;
+		}
+		crypto_skcipher_set_flags(parent,
+					  crypto_skcipher_get_flags(child) &
+					  CRYPTO_TFM_RES_MASK);
+	}
+
+	return ret;
+}
+
+static int geniv_setkey(struct crypto_skcipher *parent,
+			const u8 *key, unsigned int keylen)
+{
+	int err = 0;
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(parent);
+	struct geniv_key_info *info = (struct geniv_key_info *) key;
+
+	DMDEBUG("SETKEY Operation : %d\n", info->keyop);
+
+	switch (info->keyop) {
+	case SETKEY_OP_INIT:
+		err = geniv_setkey_init(parent, info);
+		break;
+	case SETKEY_OP_SET:
+		err = geniv_setkey_set(ctx);
+		break;
+	case SETKEY_OP_WIPE:
+		err = geniv_setkey_wipe(ctx);
+		break;
+	}
+
+	if (err)
+		goto end;
+
+	err = geniv_setkey_tfms(parent, ctx, info);
+
+end:
+	return err;
+}
+
+static void geniv_async_done(struct crypto_async_request *async_req, int error);
+
+static int geniv_alloc_subreq(struct skcipher_request *req,
+			      struct geniv_ctx *ctx,
+			      struct geniv_req_ctx *rctx)
+{
+	int key_index, r = 0;
+	struct skcipher_request *sreq;
+
+	if (!rctx->subreq) {
+		rctx->subreq = mempool_alloc(ctx->subreq_pool, GFP_NOIO);
+		if (!rctx->subreq)
+			r = -ENOMEM;
+	}
+
+	sreq = &rctx->subreq->req;
+	rctx->subreq->rctx = rctx;
+
+	key_index = rctx->iv_sector & (ctx->tfms_count - 1);
+
+	skcipher_request_set_tfm(sreq, ctx->tfms[key_index]);
+	skcipher_request_set_callback(sreq, req->base.flags,
+				      geniv_async_done, rctx->subreq);
+	return r;
+}
+
+/* Asynchronous IO completion callback for each sector in a segment. When all
+ * pending i/o are completed the parent cipher's async function is called.
+ */
+
+static void geniv_async_done(struct crypto_async_request *async_req, int error)
+{
+	struct geniv_subreq *subreq =
+		(struct geniv_subreq *) async_req->data;
+	struct geniv_req_ctx *rctx = subreq->rctx;
+	struct skcipher_request *req = rctx->req;
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	/*
+	 * A request from crypto driver backlog is going to be processed now,
+	 * finish the completion and continue in crypt_convert().
+	 * (Callback will be called for the second time for this request.)
+	 */
+
+	if (error == -EINPROGRESS) {
+		complete(&rctx->restart);
+		return;
+	}
+
+	if (!error && ctx->iv_gen_ops && ctx->iv_gen_ops->post)
+		error = ctx->iv_gen_ops->post(ctx, rctx, subreq);
+
+	mempool_free(subreq, ctx->subreq_pool);
+
+	/* req_pending needs to be checked before req->base.complete is called
+	 * as we need 'req_pending' to be equal to 1 to ensure all subrequests
+	 * are processed.
+	 */
+	if (!atomic_dec_and_test(&rctx->req_pending)) {
+		/* Call the parent cipher's completion function */
+		skcipher_request_complete(req, error);
+	}
+}
+
+static unsigned int geniv_get_sectors(struct scatterlist *sg1,
+				      struct scatterlist *sg2,
+				      unsigned int segments)
+{
+	unsigned int i, n1, n2, nents;
+
+	n1 = n2 = 0;
+	for (i = 0; i < segments ; i++)
+		n1 += sg1[i].length / SECTOR_SIZE;
+
+	for (i = 0; i < segments ; i++)
+		n2 += sg2[i].length / SECTOR_SIZE;
+
+	nents = n1 > n2 ? n1 : n2;
+	return nents;
+}
+
+/* Iterate scatterlist of segments to retrieve the 512-byte sectors so that
+ * unique IVs could be generated for each 512-byte sector. This split may not
+ * be necessary e.g. when these ciphers are modelled in hardware, where it can
+ * make use of the hardware's IV generation capabilities.
+ */
+
+static int geniv_iter_block(struct skcipher_request *req,
+			    struct geniv_subreq *subreq,
+			    struct geniv_req_ctx *rctx,
+			    unsigned int *seg_no,
+			    unsigned int *done)
+
+{
+	unsigned int srcoff, dstoff, len, rem;
+	struct scatterlist *src1, *dst1, *src2, *dst2;
+
+	if (unlikely(*seg_no >= rctx->nents))
+		return 0; /* done */
+
+	src1 = &req->src[*seg_no];
+	dst1 = &req->dst[*seg_no];
+	src2 = &subreq->src;
+	dst2 = &subreq->dst;
+
+	if (*done >= src1->length) {
+		(*seg_no)++;
+
+		if (*seg_no >= rctx->nents)
+			return 0; /* done */
+
+		src1 = &req->src[*seg_no];
+		dst1 = &req->dst[*seg_no];
+		*done = 0;
+	}
+
+	srcoff = src1->offset + *done;
+	dstoff = dst1->offset + *done;
+	rem = src1->length - *done;
+
+	len = rem > SECTOR_SIZE ? SECTOR_SIZE : rem;
+
+	DMDEBUG("segment:(%d/%u), srcoff:%d, dstoff:%d, done:%d, rem:%d\n",
+		*seg_no + 1, rctx->nents, srcoff, dstoff, *done, rem);
+
+	sg_set_page(src2, sg_page(src1), len, srcoff);
+	sg_set_page(dst2, sg_page(dst1), len, dstoff);
+
+	*done += len;
+
+	return len; /* bytes returned */
+}
+
+/* Common encryt/decrypt function for geniv template cipher. Before the crypto
+ * operation, it splits the memory segments (in the scatterlist) into 512 byte
+ * sectors. The initialization vector(IV) used is based on a unique sector
+ * number which is generated here.
+ */
+static inline int geniv_crypt(struct skcipher_request *req, bool encrypt)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct geniv_req_ctx *rctx = geniv_req_ctx(req);
+	struct geniv_req_info *rinfo = (struct geniv_req_info *) req->iv;
+	int i, bytes, cryptlen, ret = 0;
+	unsigned int sectors, segno = 0, done = 0;
+	char *str __maybe_unused = encrypt ? "encrypt" : "decrypt";
+
+	/* Instance of 'struct geniv_req_info' is stored in IV ptr */
+	rctx->is_write = rinfo->is_write;
+	rctx->iv_sector = rinfo->iv_sector;
+	rctx->nents = rinfo->nents;
+	rctx->iv = rinfo->iv;
+	rctx->req = req;
+	rctx->subreq = NULL;
+	cryptlen = req->cryptlen;
+
+	DMDEBUG("geniv:%s: starting sector=%d, #segments=%u\n", str,
+		(unsigned int) rctx->iv_sector, rctx->nents);
+
+	sectors = geniv_get_sectors(req->src, req->dst, rctx->nents);
+
+	init_completion(&rctx->restart);
+	atomic_set(&rctx->req_pending, 1);
+
+	for (i = 0; i < sectors; i++) {
+		struct geniv_subreq *subreq;
+
+		ret = geniv_alloc_subreq(req, ctx, rctx);
+		if (ret)
+			goto end;
+
+		subreq = rctx->subreq;
+		subreq->rctx = rctx;
+
+		atomic_inc(&rctx->req_pending);
+		bytes = geniv_iter_block(req, subreq, rctx, &segno, &done);
+
+		if (bytes == 0)
+			break;
+
+		cryptlen -= bytes;
+
+		if (ctx->iv_gen_ops)
+			ret = ctx->iv_gen_ops->generator(ctx, rctx, subreq);
+
+		if (ret < 0) {
+			DMERR("Error in generating IV ret: %d\n", ret);
+			goto end;
+		}
+
+		skcipher_request_set_crypt(&subreq->req, &subreq->src,
+					   &subreq->dst, bytes, rctx->iv);
+
+		if (encrypt)
+			ret = crypto_skcipher_encrypt(&subreq->req);
+
+		else
+			ret = crypto_skcipher_decrypt(&subreq->req);
+
+		if (!ret && ctx->iv_gen_ops && ctx->iv_gen_ops->post)
+			ret = ctx->iv_gen_ops->post(ctx, rctx, subreq);
+
+		switch (ret) {
+		/*
+		 * The request was queued by a crypto driver
+		 * but the driver request queue is full, let's wait.
+		 */
+		case -EBUSY:
+			wait_for_completion(&rctx->restart);
+			reinit_completion(&rctx->restart);
+			/* fall through */
+		/*
+		 * The request is queued and processed asynchronously,
+		 * completion function geniv_async_done() is called.
+		 */
+		case -EINPROGRESS:
+			/* Marking this NULL lets the creation of a new sub-
+			 * request when 'geniv_alloc_subreq' is called.
+			 */
+			rctx->subreq = NULL;
+			rctx->iv_sector++;
+			cond_resched();
+			break;
+		/*
+		 * The request was already processed (synchronously).
+		 */
+		case 0:
+			atomic_dec(&rctx->req_pending);
+			rctx->iv_sector++;
+			cond_resched();
+			continue;
+
+		/* There was an error while processing the request. */
+		default:
+			atomic_dec(&rctx->req_pending);
+			return ret;
+		}
+
+		if (ret)
+			break;
+	}
+
+	if (rctx->subreq && atomic_read(&rctx->req_pending) == 1) {
+		DMDEBUG("geniv:%s: Freeing sub request\n", str);
+		mempool_free(rctx->subreq, ctx->subreq_pool);
+	}
+
+end:
+	return ret;
+}
+
+static int geniv_encrypt(struct skcipher_request *req)
+{
+	return geniv_crypt(req, true);
+}
+
+static int geniv_decrypt(struct skcipher_request *req)
+{
+	return geniv_crypt(req, false);
+}
+
+static int geniv_init_tfm(struct crypto_skcipher *tfm)
+{
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(tfm);
+	const int psize = sizeof(struct geniv_subreq);
+	unsigned int reqsize, align;
+	char *algname, *chainmode;
+	int ret = 0;
+
+	algname = (char *) crypto_tfm_alg_name(crypto_skcipher_tfm(tfm));
+	ctx->ciphermode = kmalloc(CRYPTO_MAX_ALG_NAME, GFP_KERNEL);
+	if (!ctx->ciphermode) {
+		ret = -ENOMEM;
+		goto end;
+	}
+
+	/* Parse algorithm name 'ivmode(chainmode(cipher))' */
+	ctx->ivmode	= strsep(&algname, "(");
+	chainmode	= strsep(&algname, "(");
+	ctx->cipher	= strsep(&algname, ")");
+
+	snprintf(ctx->ciphermode, CRYPTO_MAX_ALG_NAME, "%s(%s)",
+		 chainmode, ctx->cipher);
+
+	DMDEBUG("ciphermode=%s, ivmode=%s\n", ctx->ciphermode, ctx->ivmode);
+
+	/*
+	 * Usually the underlying cipher instances are spawned here, but since
+	 * the value of tfms_count (which is equal to the key_count) is not
+	 * known yet, create only one instance and delay the creation of the
+	 * rest of the instances of the underlying cipher 'cbc(aes)' until
+	 * the setkey operation is invoked.
+	 * The first instance created i.e. ctx->child will later be assigned as
+	 * the 1st element in the array ctx->tfms. Creation of atleast one
+	 * instance of the cipher is necessary to be created here to uncover
+	 * any errors earlier than during the setkey operation later where the
+	 * remaining instances are created.
+	 */
+	ctx->child = crypto_alloc_skcipher(ctx->ciphermode, 0, 0);
+	if (IS_ERR(ctx->child)) {
+		ret = PTR_ERR(ctx->child);
+		DMERR("Failed to create skcipher %s. err %d\n",
+		      ctx->ciphermode, ret);
+		goto end;
+	}
+
+	/* Setup the current cipher's request structure */
+	align = crypto_skcipher_alignmask(tfm);
+	align &= ~(crypto_tfm_ctx_alignment() - 1);
+	reqsize = align + sizeof(struct geniv_req_ctx)
+			+ crypto_skcipher_reqsize(ctx->child);
+	crypto_skcipher_set_reqsize(tfm, reqsize);
+
+	/* create memory pool for sub-request structure */
+	ctx->subreq_pool = mempool_create_kmalloc_pool(MIN_IOS, psize);
+	if (!ctx->subreq_pool) {
+		ret = -ENOMEM;
+		DMERR("Could not allocate crypt sub-request mempool\n");
+	}
+end:
+	return ret;
+}
+
+static void geniv_exit_tfm(struct crypto_skcipher *tfm)
+{
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	if (ctx->iv_gen_ops && ctx->iv_gen_ops->dtr)
+		ctx->iv_gen_ops->dtr(ctx);
+
+	mempool_destroy(ctx->subreq_pool);
+	geniv_free_tfms(ctx);
+	kfree(ctx->ciphermode);
+}
+
+static void geniv_free(struct skcipher_instance *inst)
+{
+	struct crypto_skcipher_spawn *spawn = skcipher_instance_ctx(inst);
+
+	crypto_drop_skcipher(spawn);
+	kfree(inst);
+}
+
+static int geniv_create(struct crypto_template *tmpl,
+			struct rtattr **tb, char *algname)
+{
+	struct crypto_attr_type *algt;
+	struct skcipher_instance *inst;
+	struct skcipher_alg *alg;
+	struct crypto_skcipher_spawn *spawn;
+	const char *cipher_name;
+	int err;
+
+	algt = crypto_get_attr_type(tb);
+
+	if (IS_ERR(algt))
+		return PTR_ERR(algt);
+
+	if ((algt->type ^ CRYPTO_ALG_TYPE_SKCIPHER) & algt->mask)
+		return -EINVAL;
+
+	cipher_name = crypto_attr_alg_name(tb[1]);
+
+	if (IS_ERR(cipher_name))
+		return PTR_ERR(cipher_name);
+
+	inst = kzalloc(sizeof(*inst) + sizeof(*spawn), GFP_KERNEL);
+	if (!inst)
+		return -ENOMEM;
+
+	spawn = skcipher_instance_ctx(inst);
+
+	crypto_set_skcipher_spawn(spawn, skcipher_crypto_instance(inst));
+	err = crypto_grab_skcipher(spawn, cipher_name, 0,
+				    crypto_requires_sync(algt->type,
+							 algt->mask));
+
+	if (err)
+		goto err_free_inst;
+
+	alg = crypto_spawn_skcipher_alg(spawn);
+
+	err = -EINVAL;
+
+	/* Only support blocks of size which is of a power of 2 */
+	if (!is_power_of_2(alg->base.cra_blocksize))
+		goto err_drop_spawn;
+
+	/* algname: essiv, base.cra_name: cbc(aes) */
+	err = -ENAMETOOLONG;
+	if (snprintf(inst->alg.base.cra_name, CRYPTO_MAX_ALG_NAME, "%s(%s)",
+		     algname, alg->base.cra_name) >= CRYPTO_MAX_ALG_NAME)
+		goto err_drop_spawn;
+	if (snprintf(inst->alg.base.cra_driver_name, CRYPTO_MAX_ALG_NAME,
+		     "%s(%s)", algname, alg->base.cra_driver_name) >=
+	    CRYPTO_MAX_ALG_NAME)
+		goto err_drop_spawn;
+
+	inst->alg.base.cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER;
+	inst->alg.base.cra_priority = alg->base.cra_priority;
+	inst->alg.base.cra_blocksize = alg->base.cra_blocksize;
+	inst->alg.base.cra_alignmask = alg->base.cra_alignmask;
+	inst->alg.base.cra_flags = alg->base.cra_flags & CRYPTO_ALG_ASYNC;
+	inst->alg.ivsize = alg->base.cra_blocksize;
+	inst->alg.chunksize = crypto_skcipher_alg_chunksize(alg);
+	inst->alg.min_keysize = crypto_skcipher_alg_min_keysize(alg);
+	inst->alg.max_keysize = crypto_skcipher_alg_max_keysize(alg);
+
+	inst->alg.setkey = geniv_setkey;
+	inst->alg.encrypt = geniv_encrypt;
+	inst->alg.decrypt = geniv_decrypt;
+
+	inst->alg.base.cra_ctxsize = sizeof(struct geniv_ctx);
+
+	inst->alg.init = geniv_init_tfm;
+	inst->alg.exit = geniv_exit_tfm;
+
+	inst->free = geniv_free;
+
+	err = skcipher_register_instance(tmpl, inst);
+	if (err)
+		goto err_drop_spawn;
+
+out:
+	return err;
+
+err_drop_spawn:
+	crypto_drop_skcipher(spawn);
+err_free_inst:
+	kfree(inst);
+	goto out;
+}
+
+static int crypto_plain_create(struct crypto_template *tmpl,
+			       struct rtattr **tb)
+{
+	return geniv_create(tmpl, tb, "plain");
+}
+
+static int crypto_plain64_create(struct crypto_template *tmpl,
+				 struct rtattr **tb)
+{
+	return geniv_create(tmpl, tb, "plain64");
+}
+
+static int crypto_essiv_create(struct crypto_template *tmpl,
+			       struct rtattr **tb)
+{
+	return geniv_create(tmpl, tb, "essiv");
+}
+
+static int crypto_benbi_create(struct crypto_template *tmpl,
+			       struct rtattr **tb)
+{
+	return geniv_create(tmpl, tb, "benbi");
+}
+
+static int crypto_null_create(struct crypto_template *tmpl,
+			      struct rtattr **tb)
+{
+	return geniv_create(tmpl, tb, "null");
+}
+
+static int crypto_lmk_create(struct crypto_template *tmpl,
+			     struct rtattr **tb)
+{
+	return geniv_create(tmpl, tb, "lmk");
+}
+
+static int crypto_tcw_create(struct crypto_template *tmpl,
+			     struct rtattr **tb)
+{
+	return geniv_create(tmpl, tb, "tcw");
+}
+
+static struct crypto_template crypto_plain_tmpl = {
+	.name   = "plain",
+	.create = crypto_plain_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_plain64_tmpl = {
+	.name   = "plain64",
+	.create = crypto_plain64_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_essiv_tmpl = {
+	.name   = "essiv",
+	.create = crypto_essiv_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_benbi_tmpl = {
+	.name   = "benbi",
+	.create = crypto_benbi_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_null_tmpl = {
+	.name   = "null",
+	.create = crypto_null_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_lmk_tmpl = {
+	.name   = "lmk",
+	.create = crypto_lmk_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_tcw_tmpl = {
+	.name   = "tcw",
+	.create = crypto_tcw_create,
+	.module = THIS_MODULE,
+};
+
+static int __init geniv_register_algs(void)
+{
+	int err;
+
+	err = crypto_register_template(&crypto_plain_tmpl);
+	if (err)
+		goto out;
+
+	err = crypto_register_template(&crypto_plain64_tmpl);
+	if (err)
+		goto out_undo_plain;
+
+	err = crypto_register_template(&crypto_essiv_tmpl);
+	if (err)
+		goto out_undo_plain64;
+
+	err = crypto_register_template(&crypto_benbi_tmpl);
+	if (err)
+		goto out_undo_essiv;
+
+	err = crypto_register_template(&crypto_null_tmpl);
+	if (err)
+		goto out_undo_benbi;
+
+	err = crypto_register_template(&crypto_lmk_tmpl);
+	if (err)
+		goto out_undo_null;
+
+	err = crypto_register_template(&crypto_tcw_tmpl);
+	if (!err)
+		goto out;
+
+	crypto_unregister_template(&crypto_lmk_tmpl);
+out_undo_null:
+	crypto_unregister_template(&crypto_null_tmpl);
+out_undo_benbi:
+	crypto_unregister_template(&crypto_benbi_tmpl);
+out_undo_essiv:
+	crypto_unregister_template(&crypto_essiv_tmpl);
+out_undo_plain64:
+	crypto_unregister_template(&crypto_plain64_tmpl);
+out_undo_plain:
+	crypto_unregister_template(&crypto_plain_tmpl);
+out:
+	return err;
+}
+
+static void __exit geniv_deregister_algs(void)
+{
+	crypto_unregister_template(&crypto_plain_tmpl);
+	crypto_unregister_template(&crypto_plain64_tmpl);
+	crypto_unregister_template(&crypto_essiv_tmpl);
+	crypto_unregister_template(&crypto_benbi_tmpl);
+	crypto_unregister_template(&crypto_null_tmpl);
+	crypto_unregister_template(&crypto_lmk_tmpl);
+	crypto_unregister_template(&crypto_tcw_tmpl);
+}
+
+/* End of geniv template cipher algorithms */
+
+/*
+ * context holding the current state of a multi-part conversion
+ */
+struct convert_context {
+	struct completion restart;
+	struct bio *bio_in;
+	struct bio *bio_out;
+	struct bvec_iter iter_in;
+	struct bvec_iter iter_out;
+	sector_t cc_sector;
+	atomic_t cc_pending;
+	struct skcipher_request *req;
+};
+
+/*
+ * per bio private data
+ */
+struct dm_crypt_io {
+	struct crypt_config *cc;
+	struct bio *base_bio;
+	struct work_struct work;
+
+	struct convert_context ctx;
 
-static int crypt_iv_tcw_gen(struct crypt_config *cc, u8 *iv,
-			    struct dm_crypt_request *dmreq)
-{
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
-	__le64 sector = cpu_to_le64(dmreq->iv_sector);
-	u8 *src;
-	int r = 0;
+	atomic_t io_pending;
+	int error;
+	sector_t sector;
 
-	/* Remove whitening from ciphertext */
-	if (bio_data_dir(dmreq->ctx->bio_in) != WRITE) {
-		src = kmap_atomic(sg_page(&dmreq->sg_in));
-		r = crypt_iv_tcw_whitening(cc, dmreq, src + dmreq->sg_in.offset);
-		kunmap_atomic(src);
-	}
+	struct rb_node rb_node;
+} CRYPTO_MINALIGN_ATTR;
 
-	/* Calculate IV */
-	memcpy(iv, tcw->iv_seed, cc->iv_size);
-	crypto_xor(iv, (u8 *)&sector, 8);
-	if (cc->iv_size > 8)
-		crypto_xor(&iv[8], (u8 *)&sector, cc->iv_size - 8);
+struct dm_crypt_request {
+	struct convert_context *ctx;
+	struct scatterlist *sg_in;
+	struct scatterlist *sg_out;
+	sector_t iv_sector;
+};
 
-	return r;
-}
+struct crypt_config;
 
-static int crypt_iv_tcw_post(struct crypt_config *cc, u8 *iv,
-			     struct dm_crypt_request *dmreq)
-{
-	u8 *dst;
-	int r;
+/*
+ * Crypt: maps a linear range of a block device
+ * and encrypts / decrypts at the same time.
+ */
+enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID,
+	     DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD };
 
-	if (bio_data_dir(dmreq->ctx->bio_in) != WRITE)
-		return 0;
+/*
+ * The fields in here must be read only after initialization.
+ */
+struct crypt_config {
+	struct dm_dev *dev;
+	sector_t start;
 
-	/* Apply whitening on ciphertext */
-	dst = kmap_atomic(sg_page(&dmreq->sg_out));
-	r = crypt_iv_tcw_whitening(cc, dmreq, dst + dmreq->sg_out.offset);
-	kunmap_atomic(dst);
+	/*
+	 * pool for per bio private data, crypto requests and
+	 * encryption requeusts/buffer pages
+	 */
+	mempool_t *req_pool;
+	mempool_t *page_pool;
+	struct bio_set *bs;
+	struct mutex bio_alloc_lock;
 
-	return r;
-}
+	struct workqueue_struct *io_queue;
+	struct workqueue_struct *crypt_queue;
 
-static const struct crypt_iv_operations crypt_iv_plain_ops = {
-	.generator = crypt_iv_plain_gen
-};
+	struct task_struct *write_thread;
+	wait_queue_head_t write_thread_wait;
+	struct rb_root write_tree;
 
-static const struct crypt_iv_operations crypt_iv_plain64_ops = {
-	.generator = crypt_iv_plain64_gen
-};
+	char *cipher;
+	char *cipher_string;
+	char *key_string;
 
-static const struct crypt_iv_operations crypt_iv_essiv_ops = {
-	.ctr       = crypt_iv_essiv_ctr,
-	.dtr       = crypt_iv_essiv_dtr,
-	.init      = crypt_iv_essiv_init,
-	.wipe      = crypt_iv_essiv_wipe,
-	.generator = crypt_iv_essiv_gen
-};
+	sector_t iv_offset;
+	unsigned int iv_size;
 
-static const struct crypt_iv_operations crypt_iv_benbi_ops = {
-	.ctr	   = crypt_iv_benbi_ctr,
-	.dtr	   = crypt_iv_benbi_dtr,
-	.generator = crypt_iv_benbi_gen
-};
+	/* ESSIV: struct crypto_cipher *essiv_tfm */
+	void *iv_private;
+	struct crypto_skcipher *tfm;
+	unsigned int tfms_count;
 
-static const struct crypt_iv_operations crypt_iv_null_ops = {
-	.generator = crypt_iv_null_gen
-};
+	/*
+	 * Layout of each crypto request:
+	 *
+	 *   struct skcipher_request
+	 *      context
+	 *      padding
+	 *   struct dm_crypt_request
+	 *      padding
+	 *   IV
+	 *
+	 * The padding is added so that dm_crypt_request and the IV are
+	 * correctly aligned.
+	 */
+	unsigned int dmreq_start;
 
-static const struct crypt_iv_operations crypt_iv_lmk_ops = {
-	.ctr	   = crypt_iv_lmk_ctr,
-	.dtr	   = crypt_iv_lmk_dtr,
-	.init	   = crypt_iv_lmk_init,
-	.wipe	   = crypt_iv_lmk_wipe,
-	.generator = crypt_iv_lmk_gen,
-	.post	   = crypt_iv_lmk_post
-};
+	unsigned int per_bio_data_size;
 
-static const struct crypt_iv_operations crypt_iv_tcw_ops = {
-	.ctr	   = crypt_iv_tcw_ctr,
-	.dtr	   = crypt_iv_tcw_dtr,
-	.init	   = crypt_iv_tcw_init,
-	.wipe	   = crypt_iv_tcw_wipe,
-	.generator = crypt_iv_tcw_gen,
-	.post	   = crypt_iv_tcw_post
+	unsigned long flags;
+	unsigned int key_size;
+	unsigned int key_parts;      /* independent parts in key buffer */
+	unsigned int key_extra_size; /* additional keys length */
+	u8 key[0];
 };
 
+static void clone_init(struct dm_crypt_io *, struct bio *);
+static void kcryptd_queue_crypt(struct dm_crypt_io *io);
+static u8 *iv_of_dmreq(struct crypt_config *cc, struct dm_crypt_request *dmreq);
+
 static void crypt_convert_init(struct crypt_config *cc,
 			       struct convert_context *ctx,
 			       struct bio *bio_out, struct bio *bio_in,
@@ -837,53 +1697,7 @@ static u8 *iv_of_dmreq(struct crypt_config *cc,
 		       struct dm_crypt_request *dmreq)
 {
 	return (u8 *)ALIGN((unsigned long)(dmreq + 1),
-		crypto_skcipher_alignmask(any_tfm(cc)) + 1);
-}
-
-static int crypt_convert_block(struct crypt_config *cc,
-			       struct convert_context *ctx,
-			       struct skcipher_request *req)
-{
-	struct bio_vec bv_in = bio_iter_iovec(ctx->bio_in, ctx->iter_in);
-	struct bio_vec bv_out = bio_iter_iovec(ctx->bio_out, ctx->iter_out);
-	struct dm_crypt_request *dmreq;
-	u8 *iv;
-	int r;
-
-	dmreq = dmreq_of_req(cc, req);
-	iv = iv_of_dmreq(cc, dmreq);
-
-	dmreq->iv_sector = ctx->cc_sector;
-	dmreq->ctx = ctx;
-	sg_init_table(&dmreq->sg_in, 1);
-	sg_set_page(&dmreq->sg_in, bv_in.bv_page, 1 << SECTOR_SHIFT,
-		    bv_in.bv_offset);
-
-	sg_init_table(&dmreq->sg_out, 1);
-	sg_set_page(&dmreq->sg_out, bv_out.bv_page, 1 << SECTOR_SHIFT,
-		    bv_out.bv_offset);
-
-	bio_advance_iter(ctx->bio_in, &ctx->iter_in, 1 << SECTOR_SHIFT);
-	bio_advance_iter(ctx->bio_out, &ctx->iter_out, 1 << SECTOR_SHIFT);
-
-	if (cc->iv_gen_ops) {
-		r = cc->iv_gen_ops->generator(cc, iv, dmreq);
-		if (r < 0)
-			return r;
-	}
-
-	skcipher_request_set_crypt(req, &dmreq->sg_in, &dmreq->sg_out,
-				   1 << SECTOR_SHIFT, iv);
-
-	if (bio_data_dir(ctx->bio_in) == WRITE)
-		r = crypto_skcipher_encrypt(req);
-	else
-		r = crypto_skcipher_decrypt(req);
-
-	if (!r && cc->iv_gen_ops && cc->iv_gen_ops->post)
-		r = cc->iv_gen_ops->post(cc, iv, dmreq);
-
-	return r;
+		crypto_skcipher_alignmask(cc->tfm) + 1);
 }
 
 static void kcryptd_async_done(struct crypto_async_request *async_req,
@@ -892,12 +1706,10 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
 static void crypt_alloc_req(struct crypt_config *cc,
 			    struct convert_context *ctx)
 {
-	unsigned key_index = ctx->cc_sector & (cc->tfms_count - 1);
-
 	if (!ctx->req)
 		ctx->req = mempool_alloc(cc->req_pool, GFP_NOIO);
 
-	skcipher_request_set_tfm(ctx->req, cc->tfms[key_index]);
+	skcipher_request_set_tfm(ctx->req, cc->tfm);
 
 	/*
 	 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
@@ -920,57 +1732,98 @@ static void crypt_free_req(struct crypt_config *cc,
 /*
  * Encrypt / decrypt data from one bio to another one (can be the same one)
  */
-static int crypt_convert(struct crypt_config *cc,
-			 struct convert_context *ctx)
+
+static int crypt_convert_bio(struct crypt_config *cc,
+			     struct convert_context *ctx)
 {
+	unsigned int cryptlen, n1, n2, nents, i = 0, bytes = 0;
+	struct skcipher_request *req;
+	struct dm_crypt_request *dmreq;
+	struct geniv_req_info rinfo;
+	struct bio_vec bv_in, bv_out;
 	int r;
+	u8 *iv;
 
 	atomic_set(&ctx->cc_pending, 1);
+	crypt_alloc_req(cc, ctx);
+
+	req = ctx->req;
+	dmreq = dmreq_of_req(cc, req);
+	iv = iv_of_dmreq(cc, dmreq);
 
-	while (ctx->iter_in.bi_size && ctx->iter_out.bi_size) {
+	n1 = bio_segments(ctx->bio_in);
+	n2 = bio_segments(ctx->bio_in);
+	nents = n1 > n2 ? n1 : n2;
+	nents = nents > MAX_SG_LIST ? MAX_SG_LIST : nents;
+	cryptlen = ctx->iter_in.bi_size;
 
-		crypt_alloc_req(cc, ctx);
+	DMDEBUG("dm-crypt:%s: segments:[in=%u, out=%u] bi_size=%u\n",
+		bio_data_dir(ctx->bio_in) == WRITE ? "write" : "read",
+		n1, n2, cryptlen);
 
-		atomic_inc(&ctx->cc_pending);
+	dmreq->sg_in  = kcalloc(nents, sizeof(struct scatterlist), GFP_KERNEL);
+	dmreq->sg_out = kcalloc(nents, sizeof(struct scatterlist), GFP_KERNEL);
+	if (!dmreq->sg_in || !dmreq->sg_out) {
+		DMERR("dm-crypt: Failed to allocate scatterlist\n");
+		r = -ENOMEM;
+		goto end;
+	}
+	dmreq->ctx = ctx;
 
-		r = crypt_convert_block(cc, ctx, ctx->req);
+	sg_init_table(dmreq->sg_in, nents);
+	sg_init_table(dmreq->sg_out, nents);
 
-		switch (r) {
-		/*
-		 * The request was queued by a crypto driver
-		 * but the driver request queue is full, let's wait.
-		 */
-		case -EBUSY:
-			wait_for_completion(&ctx->restart);
-			reinit_completion(&ctx->restart);
-			/* fall through */
-		/*
-		 * The request is queued and processed asynchronously,
-		 * completion function kcryptd_async_done() will be called.
-		 */
-		case -EINPROGRESS:
-			ctx->req = NULL;
-			ctx->cc_sector++;
-			continue;
-		/*
-		 * The request was already processed (synchronously).
-		 */
-		case 0:
-			atomic_dec(&ctx->cc_pending);
-			ctx->cc_sector++;
-			cond_resched();
-			continue;
+	while (ctx->iter_in.bi_size && ctx->iter_out.bi_size && i < nents) {
+		bv_in = bio_iter_iovec(ctx->bio_in, ctx->iter_in);
+		bv_out = bio_iter_iovec(ctx->bio_out, ctx->iter_out);
 
-		/* There was an error while processing the request. */
-		default:
-			atomic_dec(&ctx->cc_pending);
-			return r;
-		}
+		sg_set_page(&dmreq->sg_in[i], bv_in.bv_page, bv_in.bv_len,
+			    bv_in.bv_offset);
+		sg_set_page(&dmreq->sg_out[i], bv_out.bv_page, bv_out.bv_len,
+			    bv_out.bv_offset);
+
+		bio_advance_iter(ctx->bio_in, &ctx->iter_in, bv_in.bv_len);
+		bio_advance_iter(ctx->bio_out, &ctx->iter_out, bv_out.bv_len);
+
+		bytes += bv_in.bv_len;
+		i++;
 	}
 
-	return 0;
+	DMDEBUG("dm-crypt: Processed %u of %u bytes\n", bytes, cryptlen);
+
+	rinfo.is_write = bio_data_dir(ctx->bio_in) == WRITE;
+	rinfo.iv_sector = ctx->cc_sector;
+	rinfo.nents = nents;
+	rinfo.iv = iv;
+
+	skcipher_request_set_crypt(req, dmreq->sg_in, dmreq->sg_out,
+				   bytes, &rinfo);
+
+	if (bio_data_dir(ctx->bio_in) == WRITE)
+		r = crypto_skcipher_encrypt(req);
+	else
+		r = crypto_skcipher_decrypt(req);
+
+	switch (r) {
+	/* The request was queued so wait. */
+	case -EBUSY:
+		wait_for_completion(&ctx->restart);
+		reinit_completion(&ctx->restart);
+		/* fall through */
+	/*
+	 * The request is queued and processed asynchronously,
+	 * completion function kcryptd_async_done() is called.
+	 */
+	case -EINPROGRESS:
+		ctx->req = NULL;
+		cond_resched();
+		break;
+	}
+end:
+	return r;
 }
 
+
 static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone);
 
 /*
@@ -1070,11 +1923,17 @@ static void crypt_dec_pending(struct dm_crypt_io *io)
 {
 	struct crypt_config *cc = io->cc;
 	struct bio *base_bio = io->base_bio;
+	struct dm_crypt_request *dmreq;
 	int error = io->error;
 
 	if (!atomic_dec_and_test(&io->io_pending))
 		return;
 
+	dmreq = dmreq_of_req(cc, io->ctx.req);
+	DMDEBUG("dm-crypt: Freeing scatterlists [sync]\n");
+	kfree(dmreq->sg_in);
+	kfree(dmreq->sg_out);
+
 	if (io->ctx.req)
 		crypt_free_req(cc, io->ctx.req, base_bio);
 
@@ -1313,7 +2172,7 @@ static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)
 	sector += bio_sectors(clone);
 
 	crypt_inc_pending(io);
-	r = crypt_convert(cc, &io->ctx);
+	r = crypt_convert_bio(cc, &io->ctx);
 	if (r)
 		io->error = -EIO;
 	crypt_finished = atomic_dec_and_test(&io->ctx.cc_pending);
@@ -1343,7 +2202,8 @@ static void kcryptd_crypt_read_convert(struct dm_crypt_io *io)
 	crypt_convert_init(cc, &io->ctx, io->base_bio, io->base_bio,
 			   io->sector);
 
-	r = crypt_convert(cc, &io->ctx);
+	r = crypt_convert_bio(cc, &io->ctx);
+
 	if (r < 0)
 		io->error = -EIO;
 
@@ -1371,12 +2231,13 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
 		return;
 	}
 
-	if (!error && cc->iv_gen_ops && cc->iv_gen_ops->post)
-		error = cc->iv_gen_ops->post(cc, iv_of_dmreq(cc, dmreq), dmreq);
-
 	if (error < 0)
 		io->error = -EIO;
 
+	DMDEBUG("dm-crypt: Freeing scatterlists and request struct [async]\n");
+	kfree(dmreq->sg_in);
+	kfree(dmreq->sg_out);
+
 	crypt_free_req(cc, req_of_dmreq(cc, dmreq), io->base_bio);
 
 	if (!atomic_dec_and_test(&ctx->cc_pending))
@@ -1430,62 +2291,38 @@ static int crypt_decode_key(u8 *key, char *hex, unsigned int size)
 	return 0;
 }
 
-static void crypt_free_tfms(struct crypt_config *cc)
+static void crypt_free_tfm(struct crypt_config *cc)
 {
-	unsigned i;
-
-	if (!cc->tfms)
+	if (!cc->tfm)
 		return;
 
-	for (i = 0; i < cc->tfms_count; i++)
-		if (cc->tfms[i] && !IS_ERR(cc->tfms[i])) {
-			crypto_free_skcipher(cc->tfms[i]);
-			cc->tfms[i] = NULL;
-		}
+	if (cc->tfm && !IS_ERR(cc->tfm))
+		crypto_free_skcipher(cc->tfm);
 
-	kfree(cc->tfms);
-	cc->tfms = NULL;
+	cc->tfm = NULL;
 }
 
-static int crypt_alloc_tfms(struct crypt_config *cc, char *ciphermode)
+static int crypt_alloc_tfm(struct crypt_config *cc, char *ciphermode)
 {
-	unsigned i;
 	int err;
 
-	cc->tfms = kzalloc(cc->tfms_count * sizeof(struct crypto_skcipher *),
-			   GFP_KERNEL);
-	if (!cc->tfms)
-		return -ENOMEM;
-
-	for (i = 0; i < cc->tfms_count; i++) {
-		cc->tfms[i] = crypto_alloc_skcipher(ciphermode, 0, 0);
-		if (IS_ERR(cc->tfms[i])) {
-			err = PTR_ERR(cc->tfms[i]);
-			crypt_free_tfms(cc);
-			return err;
-		}
+	cc->tfm = crypto_alloc_skcipher(ciphermode, 0, 0);
+	if (IS_ERR(cc->tfm)) {
+		err = PTR_ERR(cc->tfm);
+		crypt_free_tfm(cc);
+		return err;
 	}
 
 	return 0;
 }
 
-static int crypt_setkey(struct crypt_config *cc)
+static inline int crypt_setkey(struct crypt_config *cc, enum setkey_op keyop,
+			       char *ivopts)
 {
-	unsigned subkey_size;
-	int err = 0, i, r;
-
-	/* Ignore extra keys (which are used for IV etc) */
-	subkey_size = (cc->key_size - cc->key_extra_size) >> ilog2(cc->tfms_count);
-
-	for (i = 0; i < cc->tfms_count; i++) {
-		r = crypto_skcipher_setkey(cc->tfms[i],
-					   cc->key + (i * subkey_size),
-					   subkey_size);
-		if (r)
-			err = r;
-	}
+	DECLARE_GENIV_KEY(kinfo, keyop, cc->tfms_count, cc->key, cc->key_size,
+			  cc->key_parts, ivopts);
 
-	return err;
+	return crypto_skcipher_setkey(cc->tfm, (u8 *) &kinfo, sizeof(kinfo));
 }
 
 #ifdef CONFIG_KEYS
@@ -1498,7 +2335,9 @@ static bool contains_whitespace(const char *str)
 	return false;
 }
 
-static int crypt_set_keyring_key(struct crypt_config *cc, const char *key_string)
+static int crypt_set_keyring_key(struct crypt_config *cc,
+				 const char *key_string,
+				 enum setkey_op keyop, char *ivopts)
 {
 	char *new_key_string, *key_desc;
 	int ret;
@@ -1559,7 +2398,7 @@ static int crypt_set_keyring_key(struct crypt_config *cc, const char *key_string
 	/* clear the flag since following operations may invalidate previously valid key */
 	clear_bit(DM_CRYPT_KEY_VALID, &cc->flags);
 
-	ret = crypt_setkey(cc);
+	ret = crypt_setkey(cc, keyop, ivopts);
 
 	/* wipe the kernel key payload copy in each case */
 	memset(cc->key, 0, cc->key_size * sizeof(u8));
@@ -1599,7 +2438,9 @@ static int get_key_size(char **key_string)
 
 #else
 
-static int crypt_set_keyring_key(struct crypt_config *cc, const char *key_string)
+static int crypt_set_keyring_key(struct crypt_config *cc,
+				 const char *key_string,
+				 enum setkey_op keyop, char *ivopts)
 {
 	return -EINVAL;
 }
@@ -1611,7 +2452,8 @@ static int get_key_size(char **key_string)
 
 #endif
 
-static int crypt_set_key(struct crypt_config *cc, char *key)
+static int crypt_set_key(struct crypt_config *cc, enum setkey_op keyop,
+			 char *key, char *ivopts)
 {
 	int r = -EINVAL;
 	int key_string_len = strlen(key);
@@ -1622,7 +2464,7 @@ static int crypt_set_key(struct crypt_config *cc, char *key)
 
 	/* ':' means the key is in kernel keyring, short-circuit normal key processing */
 	if (key[0] == ':') {
-		r = crypt_set_keyring_key(cc, key + 1);
+		r = crypt_set_keyring_key(cc, key + 1, keyop, ivopts);
 		goto out;
 	}
 
@@ -1636,7 +2478,7 @@ static int crypt_set_key(struct crypt_config *cc, char *key)
 	if (cc->key_size && crypt_decode_key(cc->key, key, cc->key_size) < 0)
 		goto out;
 
-	r = crypt_setkey(cc);
+	r = crypt_setkey(cc, keyop, ivopts);
 	if (!r)
 		set_bit(DM_CRYPT_KEY_VALID, &cc->flags);
 
@@ -1647,6 +2489,17 @@ static int crypt_set_key(struct crypt_config *cc, char *key)
 	return r;
 }
 
+static int crypt_init_key(struct dm_target *ti, char *key, char *ivopts)
+{
+	struct crypt_config *cc = ti->private;
+	int ret;
+
+	ret = crypt_set_key(cc, SETKEY_OP_INIT, key, ivopts);
+	if (ret < 0)
+		ti->error = "Error decoding and setting key";
+	return ret;
+}
+
 static int crypt_wipe_key(struct crypt_config *cc)
 {
 	clear_bit(DM_CRYPT_KEY_VALID, &cc->flags);
@@ -1654,7 +2507,7 @@ static int crypt_wipe_key(struct crypt_config *cc)
 	kzfree(cc->key_string);
 	cc->key_string = NULL;
 
-	return crypt_setkey(cc);
+	return crypt_setkey(cc, SETKEY_OP_WIPE, NULL);
 }
 
 static void crypt_dtr(struct dm_target *ti)
@@ -1674,7 +2527,7 @@ static void crypt_dtr(struct dm_target *ti)
 	if (cc->crypt_queue)
 		destroy_workqueue(cc->crypt_queue);
 
-	crypt_free_tfms(cc);
+	crypt_free_tfm(cc);
 
 	if (cc->bs)
 		bioset_free(cc->bs);
@@ -1682,9 +2535,6 @@ static void crypt_dtr(struct dm_target *ti)
 	mempool_destroy(cc->page_pool);
 	mempool_destroy(cc->req_pool);
 
-	if (cc->iv_gen_ops && cc->iv_gen_ops->dtr)
-		cc->iv_gen_ops->dtr(cc);
-
 	if (cc->dev)
 		dm_put_device(ti, cc->dev);
 
@@ -1762,22 +2612,30 @@ static int crypt_ctr_cipher(struct dm_target *ti,
 	if (!cipher_api)
 		goto bad_mem;
 
-	ret = snprintf(cipher_api, CRYPTO_MAX_ALG_NAME,
-		       "%s(%s)", chainmode, cipher);
+create_cipher:
+	/* For those ciphers which do not support IVs,
+	 * use the 'null' template cipher
+	 */
+
+	if (!ivmode)
+		ivmode = "null";
+
+	ret = snprintf(cipher_api, CRYPTO_MAX_ALG_NAME, "%s(%s(%s))",
+		       ivmode, chainmode, cipher);
 	if (ret < 0) {
 		kfree(cipher_api);
 		goto bad_mem;
 	}
 
 	/* Allocate cipher */
-	ret = crypt_alloc_tfms(cc, cipher_api);
+	ret = crypt_alloc_tfm(cc, cipher_api);
 	if (ret < 0) {
 		ti->error = "Error allocating crypto tfm";
 		goto bad;
 	}
 
 	/* Initialize IV */
-	cc->iv_size = crypto_skcipher_ivsize(any_tfm(cc));
+	cc->iv_size = crypto_skcipher_ivsize(cc->tfm);
 	if (cc->iv_size)
 		/* at least a 64 bit sector number should fit in our buffer */
 		cc->iv_size = max(cc->iv_size,
@@ -1785,23 +2643,10 @@ static int crypt_ctr_cipher(struct dm_target *ti,
 	else if (ivmode) {
 		DMWARN("Selected cipher does not support IVs");
 		ivmode = NULL;
+		goto create_cipher;
 	}
 
-	/* Choose ivmode, see comments at iv code. */
-	if (ivmode == NULL)
-		cc->iv_gen_ops = NULL;
-	else if (strcmp(ivmode, "plain") == 0)
-		cc->iv_gen_ops = &crypt_iv_plain_ops;
-	else if (strcmp(ivmode, "plain64") == 0)
-		cc->iv_gen_ops = &crypt_iv_plain64_ops;
-	else if (strcmp(ivmode, "essiv") == 0)
-		cc->iv_gen_ops = &crypt_iv_essiv_ops;
-	else if (strcmp(ivmode, "benbi") == 0)
-		cc->iv_gen_ops = &crypt_iv_benbi_ops;
-	else if (strcmp(ivmode, "null") == 0)
-		cc->iv_gen_ops = &crypt_iv_null_ops;
-	else if (strcmp(ivmode, "lmk") == 0) {
-		cc->iv_gen_ops = &crypt_iv_lmk_ops;
+	if (strcmp(ivmode, "lmk") == 0) {
 		/*
 		 * Version 2 and 3 is recognised according
 		 * to length of provided multi-key string.
@@ -1813,39 +2658,14 @@ static int crypt_ctr_cipher(struct dm_target *ti,
 			cc->key_extra_size = cc->key_size / cc->key_parts;
 		}
 	} else if (strcmp(ivmode, "tcw") == 0) {
-		cc->iv_gen_ops = &crypt_iv_tcw_ops;
 		cc->key_parts += 2; /* IV + whitening */
 		cc->key_extra_size = cc->iv_size + TCW_WHITENING_SIZE;
-	} else {
-		ret = -EINVAL;
-		ti->error = "Invalid IV mode";
-		goto bad;
 	}
 
 	/* Initialize and set key */
-	ret = crypt_set_key(cc, key);
-	if (ret < 0) {
-		ti->error = "Error decoding and setting key";
+	ret = crypt_init_key(ti, key, ivopts);
+	if (ret < 0)
 		goto bad;
-	}
-
-	/* Allocate IV */
-	if (cc->iv_gen_ops && cc->iv_gen_ops->ctr) {
-		ret = cc->iv_gen_ops->ctr(cc, ti, ivopts);
-		if (ret < 0) {
-			ti->error = "Error creating IV";
-			goto bad;
-		}
-	}
-
-	/* Initialize IV (set keys for ESSIV etc) */
-	if (cc->iv_gen_ops && cc->iv_gen_ops->init) {
-		ret = cc->iv_gen_ops->init(cc);
-		if (ret < 0) {
-			ti->error = "Error initialising IV";
-			goto bad;
-		}
-	}
 
 	ret = 0;
 bad:
@@ -1901,20 +2721,20 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 		goto bad;
 
 	cc->dmreq_start = sizeof(struct skcipher_request);
-	cc->dmreq_start += crypto_skcipher_reqsize(any_tfm(cc));
+	cc->dmreq_start += crypto_skcipher_reqsize(cc->tfm);
 	cc->dmreq_start = ALIGN(cc->dmreq_start, __alignof__(struct dm_crypt_request));
 
-	if (crypto_skcipher_alignmask(any_tfm(cc)) < CRYPTO_MINALIGN) {
+	if (crypto_skcipher_alignmask(cc->tfm) < CRYPTO_MINALIGN) {
 		/* Allocate the padding exactly */
 		iv_size_padding = -(cc->dmreq_start + sizeof(struct dm_crypt_request))
-				& crypto_skcipher_alignmask(any_tfm(cc));
+				& crypto_skcipher_alignmask(cc->tfm);
 	} else {
 		/*
 		 * If the cipher requires greater alignment than kmalloc
 		 * alignment, we don't know the exact position of the
 		 * initialization vector. We must assume worst case.
 		 */
-		iv_size_padding = crypto_skcipher_alignmask(any_tfm(cc));
+		iv_size_padding = crypto_skcipher_alignmask(cc->tfm);
 	}
 
 	ret = -ENOMEM;
@@ -2072,8 +2892,9 @@ static int crypt_map(struct dm_target *ti, struct bio *bio)
 	if (bio_data_dir(io->base_bio) == READ) {
 		if (kcryptd_io_read(io, GFP_NOWAIT))
 			kcryptd_queue_read(io);
-	} else
+	} else {
 		kcryptd_queue_crypt(io);
+	}
 
 	return DM_MAPIO_SUBMITTED;
 }
@@ -2155,7 +2976,7 @@ static void crypt_resume(struct dm_target *ti)
 static int crypt_message(struct dm_target *ti, unsigned argc, char **argv)
 {
 	struct crypt_config *cc = ti->private;
-	int key_size, ret = -EINVAL;
+	int key_size;
 
 	if (argc < 2)
 		goto error;
@@ -2173,19 +2994,9 @@ static int crypt_message(struct dm_target *ti, unsigned argc, char **argv)
 				return -EINVAL;
 			}
 
-			ret = crypt_set_key(cc, argv[2]);
-			if (ret)
-				return ret;
-			if (cc->iv_gen_ops && cc->iv_gen_ops->init)
-				ret = cc->iv_gen_ops->init(cc);
-			return ret;
+			return crypt_set_key(cc, SETKEY_OP_SET, argv[2], NULL);
 		}
 		if (argc == 2 && !strcasecmp(argv[1], "wipe")) {
-			if (cc->iv_gen_ops && cc->iv_gen_ops->wipe) {
-				ret = cc->iv_gen_ops->wipe(cc);
-				if (ret)
-					return ret;
-			}
 			return crypt_wipe_key(cc);
 		}
 	}
@@ -2216,7 +3027,7 @@ static void crypt_io_hints(struct dm_target *ti, struct queue_limits *limits)
 
 static struct target_type crypt_target = {
 	.name   = "crypt",
-	.version = {1, 15, 0},
+	.version = {1, 16, 0},
 	.module = THIS_MODULE,
 	.ctr    = crypt_ctr,
 	.dtr    = crypt_dtr,
@@ -2234,6 +3045,7 @@ static int __init dm_crypt_init(void)
 {
 	int r;
 
+	geniv_register_algs();
 	r = dm_register_target(&crypt_target);
 	if (r < 0)
 		DMERR("register failed %d", r);
@@ -2244,6 +3056,7 @@ static int __init dm_crypt_init(void)
 static void __exit dm_crypt_exit(void)
 {
 	dm_unregister_target(&crypt_target);
+	geniv_deregister_algs();
 }
 
 module_init(dm_crypt_init);
diff --git a/include/crypto/geniv.h b/include/crypto/geniv.h
new file mode 100644
index 0000000..b472507
--- /dev/null
+++ b/include/crypto/geniv.h
@@ -0,0 +1,47 @@
+/*
+ * geniv: common interface for IV generation algorithms
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+#ifndef _CRYPTO_GENIV_
+#define _CRYPTO_GENIV_
+
+#define SECTOR_SIZE		(1 << SECTOR_SHIFT)
+
+enum setkey_op {
+	SETKEY_OP_INIT,
+	SETKEY_OP_SET,
+	SETKEY_OP_WIPE,
+};
+
+struct geniv_key_info {
+	enum setkey_op keyop;
+	unsigned int tfms_count;
+	u8 *key;
+	unsigned int key_size;
+	unsigned int key_parts;
+	char *ivopts;
+};
+
+#define DECLARE_GENIV_KEY(c, op, n, k, sz, kp, opts)	\
+	struct geniv_key_info c = {			\
+		.keyop = op,				\
+		.tfms_count = n,			\
+		.key = k,				\
+		.key_size = sz,				\
+		.key_parts = kp,			\
+		.ivopts = opts,				\
+	}
+
+struct geniv_req_info {
+	bool is_write;
+	sector_t iv_sector;
+	unsigned int nents;
+	u8 *iv;
+};
+
+#endif
-- 
Binoy Jayan

^ permalink raw reply related

* Re: [RFC PATCH v3] crypto: Add IV generation algorithms
From: Gilad Ben-Yossef @ 2017-01-18 15:21 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: Oded, Ofir, Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, linux-kernel, Alasdair Kergon, Mike Snitzer,
	dm-devel, Shaohua Li, linux-raid, Rajendra, Milan Broz
In-Reply-To: <1484732425-10319-2-git-send-email-binoy.jayan@linaro.org>

[-- Attachment #1: Type: text/plain, Size: 4693 bytes --]

Hi Binoy,


On Wed, Jan 18, 2017 at 11:40 AM, Binoy Jayan <binoy.jayan@linaro.org> wrote:
> Currently, the iv generation algorithms are implemented in dm-crypt.c.
> The goal is to move these algorithms from the dm layer to the kernel
> crypto layer by implementing them as template ciphers so they can be
> implemented in hardware for performance. As part of this patchset, the
> iv-generation code is moved from the dm layer to the crypto layer and
> adapt the dm-layer to send a whole 'bio' (as defined in the block layer)
> at a time. Each bio contains an in memory representation of physically
> contiguous disk blocks. The dm layer sets up a chained scatterlist of
> these blocks split into physically contiguous segments in memory so that
> DMA can be performed. Also, the key management code is moved from dm layer
> to the cryto layer since the key selection for encrypting neighboring
> sectors depend on the keycount.
>
> Synchronous crypto requests to encrypt/decrypt a sector are processed
> sequentially. Asynchronous requests if processed in parallel, are freed
> in the async callback. The dm layer allocates space for iv. The hardware
> implementations can choose to make use of this space to generate their IVs
> sequentially or allocate it on their own.
> Interface to the crypto layer - include/crypto/geniv.h
>
> Signed-off-by: Binoy Jayan <binoy.jayan@linaro.org>
> ---

I have some review comments and a bug report -

<snip>

>   */
> -static int crypt_convert(struct crypt_config *cc,
> -                        struct convert_context *ctx)
> +
> +static int crypt_convert_bio(struct crypt_config *cc,
> +                            struct convert_context *ctx)
>  {
> +       unsigned int cryptlen, n1, n2, nents, i = 0, bytes = 0;
> +       struct skcipher_request *req;
> +       struct dm_crypt_request *dmreq;
> +       struct geniv_req_info rinfo;
> +       struct bio_vec bv_in, bv_out;
>         int r;
> +       u8 *iv;
>
>         atomic_set(&ctx->cc_pending, 1);
> +       crypt_alloc_req(cc, ctx);
> +
> +       req = ctx->req;
> +       dmreq = dmreq_of_req(cc, req);
> +       iv = iv_of_dmreq(cc, dmreq);
>
> -       while (ctx->iter_in.bi_size && ctx->iter_out.bi_size) {
> +       n1 = bio_segments(ctx->bio_in);
> +       n2 = bio_segments(ctx->bio_in);


I'm pretty sure this needs to be

 n2 = bio_segments(ctx->bio_out);

> +       nents = n1 > n2 ? n1 : n2;
> +       nents = nents > MAX_SG_LIST ? MAX_SG_LIST : nents;
> +       cryptlen = ctx->iter_in.bi_size;
>
> -               crypt_alloc_req(cc, ctx);
> +       DMDEBUG("dm-crypt:%s: segments:[in=%u, out=%u] bi_size=%u\n",
> +               bio_data_dir(ctx->bio_in) == WRITE ? "write" : "read",
> +               n1, n2, cryptlen);
>
<Snip>

>
> -               /* There was an error while processing the request. */
> -               default:
> -                       atomic_dec(&ctx->cc_pending);
> -                       return r;
> -               }
> +               sg_set_page(&dmreq->sg_in[i], bv_in.bv_page, bv_in.bv_len,
> +                           bv_in.bv_offset);
> +               sg_set_page(&dmreq->sg_out[i], bv_out.bv_page, bv_out.bv_len,
> +                           bv_out.bv_offset);
> +
> +               bio_advance_iter(ctx->bio_in, &ctx->iter_in, bv_in.bv_len);
> +               bio_advance_iter(ctx->bio_out, &ctx->iter_out, bv_out.bv_len);
> +
> +               bytes += bv_in.bv_len;
> +               i++;
>         }
>
> -       return 0;
> +       DMDEBUG("dm-crypt: Processed %u of %u bytes\n", bytes, cryptlen);
> +
> +       rinfo.is_write = bio_data_dir(ctx->bio_in) == WRITE;

Please consider wrapping the above boolean expression in parenthesis.


> +       rinfo.iv_sector = ctx->cc_sector;
> +       rinfo.nents = nents;
> +       rinfo.iv = iv;
> +
> +       skcipher_request_set_crypt(req, dmreq->sg_in, dmreq->sg_out,

Also, where do the scatterlist src2 and dst2 that you use
sg_set_page() get sg_init_table() called on?
I couldn't figure it out...

Last but not least, when performing the following sequence on Arm64
(on latest Qemu Virt platform) -

1. cryptsetup luksFormat fs3.img
2. cryptsetup open --type luks fs3.img croot
3. mke2fs /dev/mapper/croot


[ fs3.img is a 16MB file for loopback ]

The attached kernel panic happens. The same does not occur without the patch.

Let me know if you need any additional information to recreate it.

I've tried to debug it a little but did not came up with anything
useful aside from above review notes.

Thanks!


-- 
Gilad Ben-Yossef
Chief Coffee Drinker

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

[-- Attachment #2: panic.txt --]
[-- Type: text/plain, Size: 8333 bytes --]

# 
# 
# mk
mkdir     mkdosfs   mke2fs    mkfifo    mknod     mkpasswd  mkswap    mktemp
# mk
mkdir     mkdosfs   mke2fs    mkfifo    mknod     mkpasswd  mkswap    mktemp
# mke2fs /dev/mapper/croot 
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
3584 inodes, 14336 blocks
716 blocks (5%) reserved for the super user
First data block=1
Maximum filesystem blocks=262144
2 block groups
8192 blocks per group, 8192 fragments per group
1792 inodes per group
Superblock backups stored on blocks:
	8193
[ 1288.475373] Unable to handle kernel paging request at virtual address 00001028
[ 1288.483568] pgd = ffff0000093ed000
[ 1288.483708] [00001028] *pgd=00000000beffe003, *pud=00000000beffd003, *pmd=0000000000000000
[ 1288.484071] Internal error: Oops: 96000006 [#1] PREEMPT SMP
[ 1288.484266] Modules linked in:
[ 1288.484526] CPU: 0 PID: 16 Comm: kworker/u2:1 Not tainted 4.10.0-rc4-00805-g9a4c309 #32
[ 1288.484712] Hardware name: linux,dummy-virt (DT)
[ 1288.485233] Workqueue: kcryptd kcryptd_crypt
[ 1288.485385] task: ffff80007c4abe80 task.stack: ffff80007c4dc000
[ 1288.485572] PC is at scatterwalk_copychunks+0x144/0x1e8
[ 1288.485750] LR is at scatterwalk_copychunks+0xdc/0x1e8
[ 1288.485904] pc : [<ffff00000834f46c>] lr : [<ffff00000834f404>] pstate: 20000145
[ 1288.486115] sp : ffff80007c4dfa50
[ 1288.486264] x29: ffff80007c4dfa50 x28: 0000000000000001 
[ 1288.486539] x27: ffff80007c4abe80 x26: ffff80007bacab81 
[ 1288.486709] x25: ffff80007c4dfb70 x24: 000000000000000f 
[ 1288.486850] x23: 0000000000000001 x22: 0000000000000001 
[ 1288.486991] x21: 000081ffffffffff x20: ffff80007c4abe80 
[ 1288.487362] x19: ffff80007c4abe80 x18: 0000000000000001 
[ 1288.487544] x17: 00000000004b3768 x16: ffff0000081ec678 
[ 1288.487696] x15: ffffffffffffffff x14: ffff80007bab7c88 
[ 1288.487846] x13: 0000000100000000 x12: 0000000000000000 
[ 1288.487995] x11: 0000000000000010 x10: 0000000000000200 
[ 1288.488178] x9 : 0000000000000000 x8 : 0000000000000000 
[ 1288.488339] x7 : 0000000000000001 x6 : ffff800000040201 
[ 1288.488540] x5 : ffff80007ba85b68 x4 : 0000000000001008 
[ 1288.488690] x3 : 0000000000001008 x2 : 0000000000001008 
[ 1288.488869] x1 : 0000000000000001 x0 : ffff80007ba838a0 
[ 1288.489187] 
[ 1288.489350] Process kworker/u2:1 (pid: 16, stack limit = 0xffff80007c4dc000)
[ 1288.489646] Stack: (0xffff80007c4dfa50 to 0xffff80007c4e0000)
[ 1288.489881] fa40:                                   ffff80007c4dfac0 ffff0000083525d8
[ 1288.490134] fa60: ffff80007c4dfb38 0000000000000000 00000000000001f0 ffff80007ba859f0
[ 1288.490329] fa80: ffff80007babe900 0000000000000000 ffff80007b45adc8 ffff000008cec000
[ 1288.490548] faa0: ffff80007b45ad98 0000000000000200 ffff80007c4dfb38 0000000100000001
[ 1288.490764] fac0: ffff80007c4dfb00 ffff0000080b9774 000000000000000a ffff80007ba85ae8
[ 1288.490962] fae0: 0000000000000000 ffff80007ba859f0 ffff80007babe900 ffff0000080b9764
[ 1288.491155] fb00: ffff80007c4dfbd0 ffff000008391840 ffff80007ba83880 ffff80007babea80
[ 1288.491372] fb20: 0000000000000008 ffff80007b45ad00 ffff80007c4dfb60 00000000ffffffc8
[ 1288.491586] fb40: ffff80007bacab80 ffff80007c4dfba0 ffff80007bacab80 ffff80007ba83880
[ 1288.491798] fb60: ffff80007b45ae90 0000000000000010 ffff80007ba838a0 0000000000000001
[ 1288.491991] fb80: 0000000000000200 0000000000000000 0000000000001000 0000000000000000
[ 1288.492183] fba0: ffff80007bacab80 ffff80007b45ae80 ffff80007b45ae80 ffff800200000010
[ 1288.492394] fbc0: 0000001000000010 ffff800000000007 ffff80007c4dfbf0 ffff0000087969c4
[ 1288.492609] fbe0: ffff80007b45ad80 ffff80007ba83800 ffff80007c4dfc70 ffff000008796688
[ 1288.492836] fc00: 0000000000000000 ffff80007b45ad00 0000000000001000 fffffffffffffff8
[ 1288.493020] fc20: ffff80007b45ac78 ffff80007b45ac30 ffff80007b45ae60 0000000000001000
[ 1288.493213] fc40: 0000000000000001 0000000000000001 ffff000009373f75 ffff80007b45ada8
[ 1288.493482] fc60: 0000000000000001 0000020000000000 ffff80007c4dfd20 ffff000008797480
[ 1288.495928] fc80: ffff80007ba83f00 ffff80007b45ac30 ffff80007b45ac10 ffff80007c408000
[ 1288.496370] fca0: 0000000000000000 ffff80007b45ac10 ffff0000092a7edb ffff80007b45ac00
[ 1288.496762] fcc0: ffff80007c408078 ffff80007c4082a8 0000000000000001 ffff80007ba09800
[ 1288.497182] fce0: ffff00000924b000 ffff0000080e6e10 ffff80007c4dfd80 ffff000000001000
[ 1288.497590] fd00: ffff80007efdf800 0000000000000000 ffff000000000001 ffff80007b45ae80
[ 1288.498014] fd20: ffff80007c4dfdc0 ffff0000080d9d94 0000000000000020 ffff80007c4c5800
[ 1288.498412] fd40: ffff80007ba7dd00 ffff80007c408000 0000000000000000 ffff80007b45ac10
[ 1288.498787] fd60: ffff0000092a7edb ffff80007c408020 ffff80007c408078 ffff80007c4082a8
[ 1288.499147] fd80: ffff80007c4dfde0 ffff000008971800 ffff80007c4abe80 ffff80007c408000
[ 1288.499373] fda0: ffff80007c4c5830 ffff80007c408020 ffff000009287000 ffff80007b45ac30
[ 1288.499571] fdc0: ffff80007c4dfe00 ffff0000080d9f80 ffff80007c4c5800 ffff80007c408000
[ 1288.499766] fde0: ffff80007c4c5830 ffff80007c408020 ffff000009287000 ffff80007c4abe80
[ 1288.499963] fe00: ffff80007c4dfe60 ffff0000080dfe30 ffff80007c4c3b00 ffff80007c4c3680
[ 1288.500159] fe20: ffff00000938fb08 ffff80007c4abe80 ffff000008c50570 ffff80007c4c5800
[ 1288.500373] fe40: ffff0000080d9f38 ffff80007c4c3b38 ffff80007c487d20 0000000000000000
[ 1288.500569] fe60: 0000000000000000 ffff000008082ec0 ffff0000080dfd40 ffff80007c4c3680
[ 1288.500777] fe80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1288.500975] fea0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1288.501204] fec0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1288.501401] fee0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1288.501621] ff00: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1288.501879] ff20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1288.502115] ff40: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1288.502329] ff60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1288.502530] ff80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1288.502742] ffa0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1288.502996] ffc0: 0000000000000000 0000000000000005 0000000000000000 0000000000000000
[ 1288.503304] ffe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1288.503761] Call trace:
[ 1288.504274] Exception stack(0xffff80007c4df880 to 0xffff80007c4df9b0)
[ 1288.504572] f880: ffff80007c4abe80 0001000000000000 ffff80007c4dfa50 ffff00000834f46c
[ 1288.504771] f8a0: ffff80007c4df8d0 ffff0000080f410c ffff80007ba09800 ffff80007efdf8e0
[ 1288.504968] f8c0: ffff80007c4df8d0 ffff0000080f4124 ffff80007c4df8e0 ffff0000083ca188
[ 1288.505239] f8e0: ffff80007c4df960 ffff0000083cce80 ffff000009395375 ffff0000089e5510
[ 1288.505439] f900: 0000000000000020 00000000000003e0 00000000fffffff8 ffff000009395328
[ 1288.505693] f920: ffff80007ba838a0 0000000000000001 0000000000001008 0000000000001008
[ 1288.505878] f940: 0000000000001008 ffff80007ba85b68 ffff800000040201 0000000000000001
[ 1288.506150] f960: 0000000000000000 0000000000000000 0000000000000200 0000000000000010
[ 1288.506455] f980: 0000000000000000 0000000100000000 ffff80007bab7c88 ffffffffffffffff
[ 1288.507201] f9a0: ffff0000081ec678 00000000004b3768
[ 1288.507466] [<ffff00000834f46c>] scatterwalk_copychunks+0x144/0x1e8
[ 1288.507627] [<ffff0000083525d8>] skcipher_walk_done+0x250/0x2b0
[ 1288.507781] [<ffff0000080b9774>] xts_decrypt+0x84/0xb0
[ 1288.507913] [<ffff000008391840>] simd_skcipher_decrypt+0x70/0xa8
[ 1288.508088] [<ffff0000087969c4>] geniv_decrypt+0x1b4/0x320
[ 1288.508226] [<ffff000008796688>] crypt_convert_bio+0x790/0x7e0
[ 1288.508368] [<ffff000008797480>] kcryptd_crypt+0x2f0/0x358
[ 1288.508504] [<ffff0000080d9d94>] process_one_work+0x1bc/0x360
[ 1288.508646] [<ffff0000080d9f80>] worker_thread+0x48/0x480
[ 1288.508778] [<ffff0000080dfe30>] kthread+0xf0/0x120
[ 1288.508905] [<ffff000008082ec0>] ret_from_fork+0x10/0x50
[ 1288.509154] Code: d34c7c42 f9400003 927ef463 8b021862 (f9401044) 
[ 1288.509756] ---[ end trace 8de15ab91f16458a ]---
[ 1288.509951] note: kworker/u2:1[16] exited with preempt_count 1


^ permalink raw reply

* Re: [patch] block: add blktrace C events for bio-based drivers
From: Jeff Moyer @ 2017-01-18 16:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-raid, snitzer, linux-kernel, linux-block, dm-devel, shli,
	hch, agk
In-Reply-To: <f3044f42-3134-c577-068f-bd2750528b32@kernel.dk>

Hi, Jens,

Jens Axboe <axboe@kernel.dk> writes:

> I like the change, hate the naming. I'd prefer one of two things:
>
> - Add bio_endio_complete() instead. That name sucks too, the
>   important part is flipping the __name() to have a trace
>   version instead.

ITYM a notrace version.  By default, we want tracing for bio_endio.  The
only callers that need the inverse of that are in the request-based
path, and there are only 2 of them.

> - Mark the bio as trace completed, and keep the naming. Since
>   it's only off the completion path, that can be just marking
>   the bi_flags non-atomically.

One issue with this is in generic_make_request_checks, where we can call
bio_endio without having called trace_block_bio_queue (so you could get
a C event with no corresponding Q).  To address that, we could make the
flag indicate that trace_block_bio_queue was performed, and clear it in
bio_complete, like so:

	if (test_and_clear_bit(BIO_QUEUE_TRACED, &bio->bi_flags))
		trace_block_bio_complete(...);

That would solve the problem of duplicate completions, but requires
setting the flag in the submission path and clearing it in the
completion path.  I think the former can be done with just a
bio_set_flag (i.e. non-atomic), right?  Of course, where to stick that
bio_set_flag call is another bike-shedding discussion waiting to happen
(i.e. does it go in the tracepoint itself?).

Alternatively, we could set the trace_completed flag in the paths where
we end I/O without having done the trace_block_bio_queue, but that seems
way uglier to me.

Can you think of any other options?  If we're choosing from the above,
my preference is for adding the bio_endio_notrace(), since it's so much
simpler.

Cheers,
Jeff

^ permalink raw reply

* Re: performance of raid5 on fast devices
From: Jake Yao @ 2017-01-18 19:25 UTC (permalink / raw)
  To: Heinz Mauelshagen; +Cc: Roman Mamedov, linux-raid
In-Reply-To: <27fb7a94-27d0-b319-30b8-daf4dab97415@redhat.com>

It is interesting. I do not see the similar behavior with the change
of group_thread_cnt.

The raid5 I have is following:

md125 : active raid5 nvme0n1p1[0] nvme2n1p1[2] nvme1n1p1[1] nvme3n1p1[4]
      943325184 blocks super 1.2 level 5, 32k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 0/3 pages [0KB], 65536KB chunk

/dev/md125:
        Version : 1.2
  Creation Time : Thu Dec 15 20:11:46 2016
     Raid Level : raid5
     Array Size : 943325184 (899.63 GiB 965.96 GB)
  Used Dev Size : 314441728 (299.88 GiB 321.99 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Wed Jan 18 16:24:52 2017
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 32K

           Name : localhost:nvme  (local to host localhost)
           UUID : 477a94af:79f5a10a:0d513dc6:7f5e670d
         Events : 108

    Number   Major   Minor   RaidDevice State
       0     259        6        0      active sync   /dev/nvme0n1p1
       1     259        8        1      active sync   /dev/nvme1n1p1
       2     259        9        2      active sync   /dev/nvme2n1p1
       4     259        1        3      active sync   /dev/nvme3n1p1

The fio config is:

[global]
ioengine=libaio
iodepth=64
bs=96K
direct=1
thread=1
time_based=1
runtime=20
numjobs=1
loops=1
group_reporting=1
exitall

[nvme_md_wrt]
rw=write
filename=/dev/md125

[nvme_single_wrt]
rw=write
filename=/dev/nvme1n1p2

With changing group_thread_cnt, I got following:

0 -> WRITE: io=40643MB, aggrb=2031.1MB/s, minb=2031.1MB/s,
maxb=2031.1MB/s, mint=20002msec, maxt=20002msec
1 -> WRITE: io=43740MB, aggrb=2186.7MB/s, minb=2186.7MB/s,
maxb=2186.7MB/s, mint=20003msec, maxt=20003msec
2 -> WRITE: io=43805MB, aggrb=2189.1MB/s, minb=2189.1MB/s,
maxb=2189.1MB/s, mint=20003msec, maxt=20003msec
3 -> WRITE: io=43763MB, aggrb=2187.9MB/s, minb=2187.9MB/s,
maxb=2187.9MB/s, mint=20003msec, maxt=20003msec
4 -> WRITE: io=43767MB, aggrb=2188.2MB/s, minb=2188.2MB/s,
maxb=2188.2MB/s, mint=20002msec, maxt=20002msec
5 -> WRITE: io=43767MB, aggrb=2188.4MB/s, minb=2188.4MB/s,
maxb=2188.4MB/s, mint=20003msec, maxt=20003msec
6 -> WRITE: io=43776MB, aggrb=2188.5MB/s, minb=2188.5MB/s,
maxb=2188.5MB/s, mint=20003msec, maxt=20003msec
7 -> WRITE: io=43758MB, aggrb=2187.6MB/s, minb=2187.6MB/s,
maxb=2187.6MB/s, mint=20003msec, maxt=20003msec
8 -> WRITE: io=43766MB, aggrb=2187.1MB/s, minb=2187.1MB/s,
maxb=2187.1MB/s, mint=20003msec, maxt=20003msec

In the test run,  the md125_raid5 kernel thread running close to 100%
during the test, and all the kworker threads at around 10%

My system is a VM with 6 cpus running on ESXi with NVMe drives passthru.

I am wondering why the difference.

Thanks!


On Tue, Jan 17, 2017 at 4:04 PM, Heinz Mauelshagen <heinzm@redhat.com> wrote:
> Jake et al,
>
> I took the oportunity to measure raid5 on a 4x NVME here with
> variations of group_thread_cnt={0..10} minimal
> stripe_cache_size={256,512,1024,2048,4096,8192,16384,32768}
>
> This is on an X-99 with Intel E5-2640 and kernel 4.9.3-200.fc25.x86_64.
>
> Highest active stripe count logged < 17K.
>
>
> fio job/sections used:
> ----------------------------
> [r-md0]
> ioengine=libaio
> iodepth=40
> rw=read
> bs=4096K
> direct=1
> size=4G
> numjobs=8
> filename=/dev/md0
>
> [w-md0]
> ioengine=libaio
> iodepth=40
> rw=write
> bs=4096K
> direct=1
> size=4G
> numjobs=8
> filename=/dev/md0
>
>
> Baseline performance seen with raid0:
> ---------------------------------------------------
> md0 : active raid0 dm-350[3] dm-349[2] dm-348[1] dm-347[0]
>       33521664 blocks super 1.2 32k chunks
>
> READ: io=32768MB, aggrb=8202.3MB/s, minb=1025.3MB/s, maxb=1217.7MB/s,
> mint=3364msec, maxt=3995msec
> WRITE: io=32768MB, aggrb=5746.8MB/s, minb=735584KB/s, maxb=836685KB/s,
> mint=5013msec, maxt=5702msec
>
>
> Performance with raid5:
> --------------------------------
> md0 : active raid5 dm-350[3] dm-349[2] dm-348[1] dm-347[0]
>       25141248 blocks super 1.2 level 5, 32k chunk, algorithm 2 [4/4] [UUUU]
>
>
> READ: io=32768MB, aggrb=7375.3MB/s, minb=944025KB/s, maxb=1001.1MB/s,
> mint=4088msec, maxt=4443msec
>
>
> Write results for group_thread_cnt/stripe_cache_size variations:
> ------------------------------------------------------------------------------------
> 0/256  -> WRITE: io=32768MB, aggrb=1296.4MB/s, minb=165927KB/s,
> maxb=167644KB/s, mint=25019msec, maxt=25278msec
> 1/256  -> WRITE: io=32768MB, aggrb=2152.6MB/s, minb=275524KB/s,
> maxb=278654KB/s, mint=15052msec, maxt=15223msec
> 2/256  -> WRITE: io=32768MB, aggrb=3177.4MB/s, minb=406700KB/s,
> maxb=415854KB/s, mint=10086msec, maxt=10313msec
> 3/256  -> WRITE: io=32768MB, aggrb=4026.6MB/s, minb=515397KB/s,
> maxb=524222KB/s, mint=8001msec, maxt=8138msec
> 4/256  -> WRITE: io=32768MB, aggrb=4172.2MB/s, minb=534034KB/s,
> maxb=552609KB/s, mint=7590msec, maxt=7854msec  *
> 5/256  -> WRITE: io=32768MB, aggrb=4166.9MB/s, minb=533355KB/s,
> maxb=547845KB/s, mint=7656msec, maxt=7864msec
> 6/256  -> WRITE: io=32768MB, aggrb=4189.3MB/s, minb=536218KB/s,
> maxb=556126KB/s, mint=7542msec, maxt=7822msec
> 7/256  -> WRITE: io=32768MB, aggrb=4192.5MB/s, minb=536630KB/s,
> maxb=560810KB/s, mint=7479msec, maxt=7816msec
> 8/256  -> WRITE: io=32768MB, aggrb=4185.2MB/s, minb=535807KB/s,
> maxb=562389KB/s, mint=7458msec, maxt=7828msec
> 9/256  -> WRITE: io=32768MB, aggrb=4192.1MB/s, minb=536699KB/s,
> maxb=577966KB/s, mint=7257msec, maxt=7815msec
> 10/256 -> WRITE: io=32768MB, aggrb=4182.3MB/s, minb=535329KB/s,
> maxb=568256KB/s, mint=7381msec, maxt=7835msec
>
> 0/512 -> WRITE: io=32768MB, aggrb=1297.8MB/s, minb=166025KB/s,
> maxb=167664KB/s, mint=25016msec, maxt=25263msec
> 1/512 -> WRITE: io=32768MB, aggrb=2148.5MB/s, minb=275000KB/s,
> maxb=278044KB/s, mint=15085msec, maxt=15252msec
> 2/512 -> WRITE: io=32768MB, aggrb=3158.4MB/s, minb=404270KB/s,
> maxb=411407KB/s, mint=10195msec, maxt=10375msec
> 3/512 -> WRITE: io=32768MB, aggrb=4102.7MB/s, minb=525141KB/s,
> maxb=539738KB/s, mint=7771msec, maxt=7987msec
> 4/512 -> WRITE: io=32768MB, aggrb=4162.8MB/s, minb=532745KB/s,
> maxb=541759KB/s, mint=7742msec, maxt=7873msec     *
> 5/512 -> WRITE: io=32768MB, aggrb=4178.6MB/s, minb=534851KB/s,
> maxb=549856KB/s, mint=7628msec, maxt=7842msec
> 6/512 -> WRITE: io=32768MB, aggrb=4167.4MB/s, minb=533422KB/s,
> maxb=562314KB/s, mint=7459msec, maxt=7863msec
> 7/512 -> WRITE: io=32768MB, aggrb=4192.1MB/s, minb=536699KB/s,
> maxb=566338KB/s, mint=7406msec, maxt=7815msec
> 8/512 -> WRITE: io=32768MB, aggrb=4189.8MB/s, minb=536287KB/s,
> maxb=558644KB/s, mint=7508msec, maxt=7821msec
> 9/512 -> WRITE: io=32768MB, aggrb=4165.8MB/s, minb=533219KB/s,
> maxb=559837KB/s, mint=7492msec, maxt=7866msec
> 10/512 -> WRITE: io=32768MB, aggrb=4177.2MB/s, minb=534783KB/s,
> maxb=570188KB/s, mint=7356msec, maxt=7843msec
>
> 0/1024 -> WRITE: io=32768MB, aggrb=1288.6MB/s, minb=164935KB/s,
> maxb=166877KB/s, mint=25134msec, maxt=25430msec
> 1/1024 -> WRITE: io=32768MB, aggrb=2218.5MB/s, minb=283955KB/s,
> maxb=289842KB/s, mint=14471msec, maxt=14771msec
> 2/1024 -> WRITE: io=32768MB, aggrb=3186.1MB/s, minb=407926KB/s,
> maxb=420903KB/s, mint=9965msec, maxt=10282msec
> 3/1024 -> WRITE: io=32768MB, aggrb=4107.4MB/s, minb=525733KB/s,
> maxb=538836KB/s, mint=7784msec, maxt=7978msec
> 4/1024 -> WRITE: io=32768MB, aggrb=4146.9MB/s, minb=530790KB/s,
> maxb=550505KB/s, mint=7619msec, maxt=7902msec
> 5/1024 -> WRITE: io=32768MB, aggrb=4160.5MB/s, minb=532542KB/s,
> maxb=550795KB/s, mint=7615msec, maxt=7876msec  *
> 6/1024 -> WRITE: io=32768MB, aggrb=4174.3MB/s, minb=534306KB/s,
> maxb=558942KB/s, mint=7504msec, maxt=7850msec
> 7/1024 -> WRITE: io=32768MB, aggrb=4189.8MB/s, minb=536287KB/s,
> maxb=556864KB/s, mint=7532msec, maxt=7821msec
> 8/1024 -> WRITE: io=32768MB, aggrb=4188.2MB/s, minb=536081KB/s,
> maxb=561035KB/s, mint=7476msec, maxt=7824msec
> 9/1024 -> WRITE: io=32768MB, aggrb=4167.4MB/s, minb=533422KB/s,
> maxb=567872KB/s, mint=7386msec, maxt=7863msec
> 10/1024 -> WRITE: io=32768MB, aggrb=4188.2MB/s, minb=536081KB/s,
> maxb=569878KB/s, mint=7360msec, maxt=7824msec
>
> 0/2048 -> WRITE: io=32768MB, aggrb=1265.7MB/s, minb=162004KB/s,
> maxb=166111KB/s, mint=25250msec, maxt=25890msec
> 1/2048 -> WRITE: io=32768MB, aggrb=2239.5MB/s, minb=286652KB/s,
> maxb=290846KB/s, mint=14421msec, maxt=14632msec
> 2/2048 -> WRITE: io=32768MB, aggrb=3184.5MB/s, minb=407609KB/s,
> maxb=413150KB/s, mint=10152msec, maxt=10290msec
> 3/2048 -> WRITE: io=32768MB, aggrb=4213.5MB/s, minb=539321KB/s,
> maxb=557901KB/s, mint=7518msec, maxt=7777msec     *
> 4/2048 -> WRITE: io=32768MB, aggrb=4168.5MB/s, minb=533558KB/s,
> maxb=543162KB/s, mint=7722msec, maxt=7861msec
> 5/2048 -> WRITE: io=32768MB, aggrb=4185.5MB/s, minb=535739KB/s,
> maxb=549352KB/s, mint=7635msec, maxt=7829msec
> 6/2048 -> WRITE: io=32768MB, aggrb=4181.8MB/s, minb=535260KB/s,
> maxb=553338KB/s, mint=7580msec, maxt=7836msec
> 7/2048 -> WRITE: io=32768MB, aggrb=4215.7MB/s, minb=539599KB/s,
> maxb=566109KB/s, mint=7409msec, maxt=7773msec
> 8/2048 -> WRITE: io=32768MB, aggrb=4200.5MB/s, minb=537662KB/s,
> maxb=568102KB/s, mint=7383msec, maxt=7801msec
> 9/2048 -> WRITE: io=32768MB, aggrb=4184.1MB/s, minb=535671KB/s,
> maxb=574483KB/s, mint=7301msec, maxt=7830msec
> 10/2048 -> WRITE: io=32768MB, aggrb=4172.7MB/s, minb=534102KB/s,
> maxb=567641KB/s, mint=7389msec, maxt=7853msec
>
> 0/4096 -> WRITE: io=32768MB, aggrb=1264.8MB/s, minb=161879KB/s,
> maxb=168588KB/s, mint=24879msec, maxt=25910msec
> 1/4096 -> WRITE: io=32768MB, aggrb=2349.4MB/s, minb=300710KB/s,
> maxb=312541KB/s, mint=13420msec, maxt=13948msec
> 2/4096 -> WRITE: io=32768MB, aggrb=3387.6MB/s, minb=433609KB/s,
> maxb=441877KB/s, mint=9492msec, maxt=9673msec
> 3/4096 -> WRITE: io=32768MB, aggrb=4182.3MB/s, minb=535329KB/s,
> maxb=552390KB/s, mint=7593msec, maxt=7835msec    *
> 4/4096 -> WRITE: io=32768MB, aggrb=4170.2MB/s, minb=533762KB/s,
> maxb=560061KB/s, mint=7489msec, maxt=7858msec
> 5/4096 -> WRITE: io=32768MB, aggrb=4179.6MB/s, minb=534919KB/s,
> maxb=548490KB/s, mint=7647msec, maxt=7841msec
> 6/4096 -> WRITE: io=32768MB, aggrb=4183.4MB/s, minb=535465KB/s,
> maxb=549208KB/s, mint=7637msec, maxt=7833msec
> 7/4096 -> WRITE: io=32768MB, aggrb=4174.9MB/s, minb=534374KB/s,
> maxb=557530KB/s, mint=7523msec, maxt=7849msec
> 8/4096 -> WRITE: io=32768MB, aggrb=4178.6MB/s, minb=534851KB/s,
> maxb=570188KB/s, mint=7356msec, maxt=7842msec
> 9/4096 -> WRITE: io=32768MB, aggrb=4180.2MB/s, minb=535056KB/s,
> maxb=570110KB/s, mint=7357msec, maxt=7839msec
> 10/4096 -> WRITE: io=32768MB, aggrb=4183.9MB/s, minb=535534KB/s,
> maxb=574640KB/s, mint=7299msec, maxt=7832msec
>
> 0/8192 -> WRITE: io=32768MB, aggrb=1260.9MB/s, minb=161381KB/s,
> maxb=171511KB/s, mint=24455msec, maxt=25990msec
> 1/8192 -> WRITE: io=32768MB, aggrb=2368.5MB/s, minb=303166KB/s,
> maxb=320444KB/s, mint=13089msec, maxt=13835msec
> 2/8192 -> WRITE: io=32768MB, aggrb=3408.8MB/s, minb=436225KB/s,
> maxb=458544KB/s, mint=9147msec, maxt=9615msec
> 3/8192 -> WRITE: io=32768MB, aggrb=4219.5MB/s, minb=540085KB/s,
> maxb=564585KB/s, mint=7429msec, maxt=7766msec     *
> 4/8192 -> WRITE: io=32768MB, aggrb=4208.6MB/s, minb=538698KB/s,
> maxb=570653KB/s, mint=7350msec, maxt=7786msec
> 5/8192 -> WRITE: io=32768MB, aggrb=4200.5MB/s, minb=537662KB/s,
> maxb=562013KB/s, mint=7463msec, maxt=7801msec
> 6/8192 -> WRITE: io=32768MB, aggrb=4189.3MB/s, minb=536218KB/s,
> maxb=585387KB/s, mint=7165msec, maxt=7822msec
> 7/8192 -> WRITE: io=32768MB, aggrb=4184.5MB/s, minb=535602KB/s,
> maxb=579323KB/s, mint=7240msec, maxt=7831msec
> 8/8192 -> WRITE: io=32768MB, aggrb=4186.6MB/s, minb=535876KB/s,
> maxb=572132KB/s, mint=7331msec, maxt=7827msec
> 9/8192 -> WRITE: io=32768MB, aggrb=4176.5MB/s, minb=534578KB/s,
> maxb=598246KB/s, mint=7011msec, maxt=7846msec
> 10/8192 -> WRITE: io=32768MB, aggrb=4184.1MB/s, minb=535671KB/s,
> maxb=580285KB/s, mint=7228msec, maxt=7830msec
>
> 0/16384 -> WRITE: io=32768MB, aggrb=1281.0MB/s, minb=163968KB/s,
> maxb=183542KB/s, mint=22852msec, maxt=25580msec
> 1/16384 -> WRITE: io=32768MB, aggrb=2451.8MB/s, minb=313827KB/s,
> maxb=337787KB/s, mint=12417msec, maxt=13365msec
> 2/16384 -> WRITE: io=32768MB, aggrb=3409.5MB/s, minb=436406KB/s,
> maxb=468532KB/s, mint=8952msec, maxt=9611msec
> 3/16384 -> WRITE: io=32768MB, aggrb=4192.5MB/s, minb=536630KB/s,
> maxb=566721KB/s, mint=7401msec, maxt=7816msec   *
> 4/16384 -> WRITE: io=32768MB, aggrb=4172.2MB/s, minb=534034KB/s,
> maxb=581089KB/s, mint=7218msec, maxt=7854msec
> 5/16384 -> WRITE: io=32768MB, aggrb=4175.4MB/s, minb=534442KB/s,
> maxb=587108KB/s, mint=7144msec, maxt=7848msec
> 6/16384 -> WRITE: io=32768MB, aggrb=4188.2MB/s, minb=536081KB/s,
> maxb=585224KB/s, mint=7167msec, maxt=7824msec
> 7/16384 -> WRITE: io=32768MB, aggrb=4173.8MB/s, minb=534238KB/s,
> maxb=591330KB/s, mint=7093msec, maxt=7851msec
> 8/16384 -> WRITE: io=32768MB, aggrb=4163.2MB/s, minb=532880KB/s,
> maxb=590165KB/s, mint=7107msec, maxt=7871msec
> 9/16384 -> WRITE: io=32768MB, aggrb=4166.9MB/s, minb=533355KB/s,
> maxb=608664KB/s, mint=6891msec, maxt=7864msec
> 10/16384 -> WRITE: io=32768MB, aggrb=4157.9MB/s, minb=532204KB/s,
> maxb=594768KB/s, mint=7052msec, maxt=7881msec
>
> 0/32768 -> WRITE: io=32768MB, aggrb=1288.1MB/s, minb=164980KB/s,
> maxb=189026KB/s, mint=22189msec, maxt=25423msec
> 1/32768 -> WRITE: io=32768MB, aggrb=2443.6MB/s, minb=312774KB/s,
> maxb=348624KB/s, mint=12031msec, maxt=13410msec
> 2/32768 -> WRITE: io=32768MB, aggrb=3467.1MB/s, minb=443888KB/s,
> maxb=484722KB/s, mint=8653msec, maxt=9449msec
> 3/32768 -> WRITE: io=32768MB, aggrb=4131.2MB/s, minb=528782KB/s,
> maxb=572444KB/s, mint=7327msec, maxt=7932msec    *
> 4/32768 -> WRITE: io=32768MB, aggrb=4082.8MB/s, minb=522589KB/s,
> maxb=606990KB/s, mint=6910msec, maxt=8026msec
> 5/32768 -> WRITE: io=32768MB, aggrb=3985.5MB/s, minb=510131KB/s,
> maxb=578046KB/s, mint=7256msec, maxt=8222msec
> 6/32768 -> WRITE: io=32768MB, aggrb=3937.2MB/s, minb=504062KB/s,
> maxb=591914KB/s, mint=7086msec, maxt=8321msec
> 7/32768 -> WRITE: io=32768MB, aggrb=4012.3MB/s, minb=513567KB/s,
> maxb=583028KB/s, mint=7194msec, maxt=8167msec
> 8/32768 -> WRITE: io=32768MB, aggrb=3944.2MB/s, minb=504851KB/s,
> maxb=567257KB/s, mint=7394msec, maxt=8308msec
> 9/32768 -> WRITE: io=32768MB, aggrb=3930.1MB/s, minb=503155KB/s,
> maxb=580687KB/s, mint=7223msec, maxt=8336msec
> 10/32768 -> WRITE: io=32768MB, aggrb=3965.2MB/s, minb=507539KB/s,
> maxb=599443KB/s, mint=6997msec, maxt=8264msec
>
>
> Analysis:
> -----------
> - the amount of minimum stripe cache entries doesn't cause much variation as
> expected
> - writing threads cause significant performance enhancement
> - seen best results with 3 or 4 writing threads which correlates well to the
> # of stripes
>
>
> Did you provide your fio job(s) for comparision yet?
>
> Regards,
> Heinz
>
> P.S.: write performance tested with the following script:
>
> #!/bin/sh
>
> MD=md0
>
> for s in 256 512 1024 2048 4096 8192 16384 32768
> do
>         echo $s > /sys/block/$MD/md/stripe_cache_size
>
>         for t in {0..10}
>         do
>                 echo $t > /sys/block/$MD/md/group_thread_cnt
>                 echo -n "$t/$s -> "
>                 fio  --section=w-md0 fio_md0.job 2>&1|grep "aggrb="|sed 's/^
> *//'
>         done
> done
>
>
>
>
> On 01/17/2017 04:28 PM, Jake Yao wrote:
>>
>> Thanks for the response.
>>
>> I am using fio for performance measurement.
>>
>> The chunk size of raid5 array is 32K, and the block size in fio is set
>> to 96K(3x chunk size) which is also the optimal_io_size, ioengine is
>> set to libaio with direct IO.
>>
>> Increasing stripe_cache_size does not help much, and it looks like the
>> write is limited by the single kernel thread as mentioned earlier.
>>
>>
>> On Tue, Jan 17, 2017 at 12:10 AM, Roman Mamedov <rm@romanrm.net> wrote:
>>>
>>> On Mon, 16 Jan 2017 21:35:21 -0500
>>> Jake Yao <jgyao1@gmail.com> wrote:
>>>
>>>> I have a raid5 array on 4 NVMe drives, and the performance on the
>>>> array is only marginally better than a single drive. Unlike a similar
>>>> raid5 array on 4 SAS SSD or HDD,  the performance on array is 3x
>>>> better than a single drive, which is expected.
>>>>
>>>> It looks like when the single kernel thread associated with the raid
>>>> device running at 100%, the array performance hit its peak. This can
>>>> happen easily for fast devices like NVMe.
>>>>
>>>> This can reproduced by creating a raid5 with 4 ramdisks as well, and
>>>> comparing performance on the array and one ramdisk. Sometimes the
>>>> performance on the array is worse than a single ramdisk.
>>>>
>>>> The kernel version is 4.9.0-rc3 and mdadm is release 3.4, no write
>>>> journal is configured.
>>>>
>>>> Is this a known issue?
>>>
>>> How do you measure the performance?
>>>
>>> Sure it may be CPU-bound in the end, but also why not try the usual
>>> optimization tricks, such as:
>>>
>>>    * increase your stripe_cache_size, it's not uncommon that this can
>>> speed up
>>>      linear writes by as much as several times;
>>>
>>>    * if you meant reads, you could look into read-ahead settings for the
>>> array;
>>>
>>>    * and in both cases, try experimenting with different stripe sizes (if
>>> you
>>>      were using 512K, try with 64K stripes).
>>>
>>> --
>>> With respect,
>>> Roman
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply

* [PATCH 1/2] md/r5cache: disable write back for degraded raid6
From: Song Liu @ 2017-01-18 23:56 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu, Jes.Sorensen

raid6 handles write differently in degraded mode. Specifically,
handle_stripe_fill() is called for writes. As a result, write
back cache has very little performance benefit for degraded
raid6. (On the other hand, write back cache does help sequential
writes on degraded raid4 and raid5).

Write back cache for degraded mode also introduces data integrity
corner cases. This is mostly because handle_stripe_fill() is
called on write. To avoid handling these corner cases, this patch
disables write back cache for degraded raid6.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 4957297..b31ae41 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2371,6 +2371,16 @@ int r5c_try_caching_write(struct r5conf *conf,
 		set_bit(STRIPE_R5C_CACHING, &sh->state);
 	}
 
+	/*
+	 * When raid6 array runs in degraded mode, handle_stripe_fill() is
+	 * called on every write. So write back cache doesn't help the
+	 * performance. To simplify the code, do write-through.
+	 */
+	if (conf->level == 6 && s->failed) {
+		r5c_make_stripe_write_out(sh);
+		return -EAGAIN;
+	}
+
 	for (i = disks; i--; ) {
 		dev = &sh->dev[i];
 		/* if non-overwrite, use writing-out phase */
-- 
2.9.3


^ permalink raw reply related

* [PATCH 2/2] md/r5cache: improve journal device efficiency
From: Song Liu @ 2017-01-18 23:56 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu, Jes.Sorensen
In-Reply-To: <20170118235650.2430923-1-songliubraving@fb.com>

It is important to be able to flush all stripes in raid5-cache.
Therefore, we need reserve some space on the journal device for
these flushes. If flush operation includes pending writes to the
stripe, we need to reserve (conf->raid_disk + 1) pages per stripe
for the flush out. This reduces the efficiency of journal space.
If we exclude these pending writes from flush operation, we only
need (conf->max_degraded + 1) pages per stripe.

With this patch, when log space is critical (R5C_LOG_CRITICAL=1),
pending writes will be excluded from stripe flush out. Therefore,
we can reduce reserved space for flush out and thus improve journal
device efficiency.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 15 +++------------
 drivers/md/raid5.c       | 42 ++++++++++++++++++++++++++++++++----------
 2 files changed, 35 insertions(+), 22 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index b31ae41..c027f1b 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -387,17 +387,8 @@ void r5c_check_cached_full_stripe(struct r5conf *conf)
 /*
  * Total log space (in sectors) needed to flush all data in cache
  *
- * Currently, writing-out phase automatically includes all pending writes
- * to the same sector. So the reclaim of each stripe takes up to
- * (conf->raid_disks + 1) pages of log space.
- *
- * To totally avoid deadlock due to log space, the code reserves
- * (conf->raid_disks + 1) pages for each stripe in cache, which is not
- * necessary in most cases.
- *
- * To improve this, we will need writing-out phase to be able to NOT include
- * pending writes, which will reduce the requirement to
- * (conf->max_degraded + 1) pages per stripe in cache.
+ * To flush all stripes in cache, we need (conf->max_degraded + 1)
+ * pages per stripe in cache.
  */
 static sector_t r5c_log_required_to_flush_cache(struct r5conf *conf)
 {
@@ -406,7 +397,7 @@ static sector_t r5c_log_required_to_flush_cache(struct r5conf *conf)
 	if (!r5c_is_writeback(log))
 		return 0;
 
-	return BLOCK_SECTORS * (conf->raid_disks + 1) *
+	return BLOCK_SECTORS * (conf->max_degraded + 1) *
 		atomic_read(&log->stripe_in_journal_count);
 }
 
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 193acd3..55a6156 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2953,13 +2953,35 @@ sector_t raid5_compute_blocknr(struct stripe_head *sh, int i, int previous)
  *      like to flush data in journal to RAID disks first, so complex rmw
  *      is handled in the write patch (handle_stripe_dirtying).
  *
- *   2. to be added
+ *   2. when journal space is critical (R5C_LOG_CRITICAL=1)
+ *
+ *      It is important to be able to flush all stripes in raid5-cache.
+ *      Therefore, we need reserve some space on the journal device for
+ *      these flushes. If flush operation includes pending writes to the
+ *      stripe, we need to reserve (conf->raid_disk + 1) pages per stripe
+ *      for the flush out. If we exclude these pending writes from flush
+ *      operation, we only need (conf->max_degraded + 1) pages per stripe.
+ *      Therefore, excluding pending writes in these cases enables more
+ *      efficient use of the journal device.
+ *
+ *      Note: To make sure the stripe makes progress, we only delay
+ *      towrite for stripes with data already in journal (injournal > 0).
+ *      When LOG_CRITICAL, stripes with injournal == 0 will be sent to
+ *      no_space_stripes list.
  */
-static inline bool delay_towrite(struct r5dev *dev,
-				   struct stripe_head_state *s)
+static inline bool delay_towrite(struct r5conf *conf,
+				 struct r5dev *dev,
+				 struct stripe_head_state *s)
 {
-	return dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags) &&
-		!test_bit(R5_Insync, &dev->flags) && s->injournal;
+	/* case 1 above */
+	if (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags) &&
+	    !test_bit(R5_Insync, &dev->flags) && s->injournal)
+		return true;
+	/* case 2 above */
+	if (test_bit(R5C_LOG_CRITICAL, &conf->cache_state) &&
+	    s->injournal > 0)
+		return true;
+	return false;
 }
 
 static void
@@ -2982,7 +3004,7 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
 
-			if (dev->towrite && !delay_towrite(dev, s)) {
+			if (dev->towrite && !delay_towrite(conf, dev, s)) {
 				set_bit(R5_LOCKED, &dev->flags);
 				set_bit(R5_Wantdrain, &dev->flags);
 				if (!expand)
@@ -3733,7 +3755,7 @@ static int handle_stripe_dirtying(struct r5conf *conf,
 	} else for (i = disks; i--; ) {
 		/* would I have to read this buffer for read_modify_write */
 		struct r5dev *dev = &sh->dev[i];
-		if (((dev->towrite && !delay_towrite(dev, s)) ||
+		if (((dev->towrite && !delay_towrite(conf, dev, s)) ||
 		     i == sh->pd_idx || i == sh->qd_idx ||
 		     test_bit(R5_InJournal, &dev->flags)) &&
 		    !test_bit(R5_LOCKED, &dev->flags) &&
@@ -3757,8 +3779,8 @@ static int handle_stripe_dirtying(struct r5conf *conf,
 		}
 	}
 
-	pr_debug("for sector %llu, rmw=%d rcw=%d\n",
-		(unsigned long long)sh->sector, rmw, rcw);
+	pr_debug("for sector %llu state 0x%lx, rmw=%d rcw=%d\n",
+		 (unsigned long long)sh->sector, sh->state, rmw, rcw);
 	set_bit(STRIPE_HANDLE, &sh->state);
 	if ((rmw < rcw || (rmw == rcw && conf->rmw_level == PARITY_PREFER_RMW)) && rmw > 0) {
 		/* prefer read-modify-write, but need to get some data */
@@ -3798,7 +3820,7 @@ static int handle_stripe_dirtying(struct r5conf *conf,
 
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
-			if (((dev->towrite && !delay_towrite(dev, s)) ||
+			if (((dev->towrite && !delay_towrite(conf, dev, s)) ||
 			     i == sh->pd_idx || i == sh->qd_idx ||
 			     test_bit(R5_InJournal, &dev->flags)) &&
 			    !test_bit(R5_LOCKED, &dev->flags) &&
-- 
2.9.3


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox