Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: NeilBrown @ 2016-12-12  0:59 UTC (permalink / raw)
  To: Jinpu Wang, linux-raid, Shaohua Li; +Cc: Nate Dailey
In-Reply-To: <CAMGffE=1kp7gMhG+dFxTvhD6VkRuSfbuXbf9vYSWwGG+=vvhDA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 184 bytes --]

On Sat, Nov 26 2016, Jinpu Wang wrote:
> [  810.270860]  [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0

What is this?  I cannot find that function in the upstream kernel.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: What is the solution for USB HDs
From: Phil Turmel @ 2016-12-12  3:17 UTC (permalink / raw)
  To: Adam Goryachev, juca, linux-raid
In-Reply-To: <171542c9-1ab8-0993-733d-fbb95467f512@websitemanagers.com.au>

On 12/11/2016 05:14 PM, Adam Goryachev wrote:
> On 12/12/16 04:14, juca@juan-carlos.info wrote:
> If you are still having problems after that, then please try to post
> more details on what happens when the drives vanish (eg, kernel logs,
> system logs, etc).
> 
> Also, you might consider an alternative SBC, some have SATA ports
> already available (RPi are the only SBC's I've used, but there are many
> other options/variations).

There's also a kernel boot parameter that can cut off all USB
auto-suspend, which I suspect is part of the problem.  It has been known
for a long time that USB is not robust enough to be trusted in MD arrays.

{ looking around..... }

usbcore.autosuspend=-1

is what you need in your bootloader or builtin to the kernel.

Phil

^ permalink raw reply

* Re: raid0 vs. mkfs
From: NeilBrown @ 2016-12-12  3:17 UTC (permalink / raw)
  To: Coly Li, Shaohua Li; +Cc: Avi Kivity, linux-raid, linux-block
In-Reply-To: <14867138-b0ea-fb58-dae6-70f30a3ddcc8@suse.de>

[-- Attachment #1: Type: text/plain, Size: 3524 bytes --]

On Fri, Dec 09 2016, Coly Li wrote:

> On 2016/12/9 上午3:19, Shaohua Li wrote:
>> On Fri, Dec 09, 2016 at 12:44:57AM +0800, Coly Li wrote:
>>> On 2016/12/8 上午12:59, Shaohua Li wrote:
>>>> On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote:
>>> [snip]
>>>> Thanks for doing this, Coly! For raid0, this totally makes sense. The raid0
>>>> zones make things a little complicated though. I just had a brief look of your
>>>> proposed patch, which looks really complicated. I'd suggest something like
>>>> this:
>>>> 1. split the bio according to zone boundary.
>>>> 2. handle the splitted bio. since the bio is within zone range, calculating
>>>> the start and end sector for each rdev should be easy.
>>>>
>>>
>>> Hi Shaohua,
>>>
>>> Thanks for your suggestion! I try to modify the code by your suggestion,
>>> it is even more hard to make the code that way ...
>>>
>>> Because even split bios for each zone, all the corner cases still exist
>>> and should be taken care in every zoon. The code will be more complicated.
>> 
>> Not sure why it makes the code more complicated. Probably I'm wrong, but Just
>> want to make sure we are in the same page: split the bio according to zone
>> boundary, then handle the splitted bio separately. Calculating end/start point
>> of each rdev for the new bio within a zone should be simple. we then clone a
>> bio for each rdev and dispatch. So for example:
>> Disk 0: D0 D2 D4 D6 D7
>> Disk 1: D1 D3 D5
>> zone 0 is from D0 - D5, zone 1 is from D6 - D7
>> If bio is from D1 to D7, we split it to 2 bios, one is D1 - D5, the other D6 - D7.
>> For D1 - D5, we dispatch 2 bios. D1 - D5 for disk 1, D2 - D4 for disk 0
>> For D6 - D7, we just dispatch to disk 0.
>> What kind of corner case makes this more complicated?
>>  
>
> Let me explain the corner cases.
>
> When upper layer code issues a DISCARD bio, the bio->bi_iter.bi_sector
> may not be chunk size aligned, and bio->bi_iter.bi_size may not be
> (chunk_sects*nb_dev) sectors aligned. In raid0, we can't simply round
> up/down them into chunk size aligned number, otherwise data
> lost/corruption will happen.
>
> Therefore for each DISCARD bio that raid0_make_request() receive, the
> beginning and ending parts of this bio should be treat very carefully.
> All the corner cases *come from here*, they are not about number of
> zones or rdevs, it is about whether bio->bi_iter.bi_sector and
> bio->bi_iter.bi_size are chunk size aligned or not.
>
> - beginning of the bio
>   If bio->bi_iter.bi_sector is not chunk size aligned, current raid0
> code will split the beginning part into split bio which only contains
> sectors from bio->bi_iter.sector to next chunk size aligned offset, and
> issue this bio by generic_make_request(). But in
> discard_large_discard_bio() we can't issue the split bio now, we have to
> record lenth of this split bio into a per-device structure, and issue a

Why?

Why cannot you just split of the start of the bio and chain it with the
rest of the bio?

If the bio doesn't start at the beginning of a stripe, just split of the
first (partial) chunk exactly was we currently do.

If it does start at the beginning of a stripe, then split off a whole
number of stripes and allocate one bio for each device.  Chain those to
the original bio together with any remainder (which isn't a whole
stripe).

I think that if you make use of bio_split() and bio_chain() properly,
the code will be much simpler.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* [PATCH 1/1] IMSM: Add support for Non-Intel NVMe drives under VMD
From: Pawel Baldysiak @ 2016-12-12 10:28 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Pawel Baldysiak

This patch adds checking if platform (preOS) supports
non-Intel NVMe drives under VMD domain,
and - if so - allow creating IMSM Raid Volume
with those drives.

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
---
 platform-intel.h |  7 +++++++
 super-intel.c    | 12 +++++++++---
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/platform-intel.h b/platform-intel.h
index a8ae85f..4851074 100644
--- a/platform-intel.h
+++ b/platform-intel.h
@@ -99,6 +99,8 @@ struct imsm_orom {
 	#define IMSM_OROM_CAPABILITIES_Rohi (1 << 5)
 	#define IMSM_OROM_CAPABILITIES_ReadPatrol (1 << 6)
 	#define IMSM_OROM_CAPABILITIES_XorHw (1 << 7)
+	#define IMSM_OROM_CAPABILITIES_SKUMode ((1 << 8)|(1 << 9))
+	#define IMSM_OROM_CAPABILITIES_TPV (1 << 10)
 } __attribute__((packed));
 
 static inline int imsm_orom_has_raid0(const struct imsm_orom *orom)
@@ -184,6 +186,11 @@ static inline int imsm_orom_is_nvme(const struct imsm_orom *orom)
 			sizeof(orom->signature)) == 0;
 }
 
+static inline int imsm_orom_has_tpv_support(const struct imsm_orom *orom)
+{
+	return !!(orom->driver_features & IMSM_OROM_CAPABILITIES_TPV);
+}
+
 enum sys_dev_type {
 	SYS_DEV_UNKNOWN = 0,
 	SYS_DEV_SAS,
diff --git a/super-intel.c b/super-intel.c
index 0407d43..cee6951 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -2199,9 +2199,6 @@ static int print_vmd_attached_devs(struct sys_dev *hba)
 			continue;
 
 		sprintf(path, "/sys/bus/pci/drivers/nvme/%s", ent->d_name);
-		/* if not a intel NVMe - skip it*/
-		if (devpath_to_vendor(path) != 0x8086)
-			continue;
 
 		rp = realpath(path, NULL);
 		if (!rp)
@@ -2416,6 +2413,8 @@ static int detail_platform_imsm(int verbose, int enumerate_only, char *controlle
 	for (entry = orom_entries; entry; entry = entry->next) {
 		if (entry->type == SYS_DEV_VMD) {
 			print_imsm_capability(&entry->orom);
+			printf(" 3rd party NVMe :%s supported\n",
+			    imsm_orom_has_tpv_support(&entry->orom)?"":" not");
 			for (hba = list; hba; hba = hba->next) {
 				if (hba->type == SYS_DEV_VMD) {
 					char buf[PATH_MAX];
@@ -5609,6 +5608,13 @@ static int add_to_super_imsm(struct supertype *st, mdu_disk_info_t *dk,
 						"\tRAID 0 is the only supported configuration for this type of x8 device.\n");
 				break;
 			}
+		} else if (super->hba->type == SYS_DEV_VMD && super->orom &&
+		    !imsm_orom_has_tpv_support(super->orom)) {
+			pr_err("\tPlatform configuration does not support non-Intel NVMe drives.\n"
+			       "\tPlease refer to Intel(R) RSTe user guide.\n");
+			free(dd->devname);
+			free(dd);
+			return 1;
 		}
 	}
 
-- 
2.9.3


^ permalink raw reply related

* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: Jinpu Wang @ 2016-12-12 13:10 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey
In-Reply-To: <87a8c20xpg.fsf@notabene.neil.brown.name>

On Mon, Dec 12, 2016 at 1:59 AM, NeilBrown <neilb@suse.com> wrote:
> On Sat, Nov 26 2016, Jinpu Wang wrote:
>> [  810.270860]  [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0
>
> What is this?  I cannot find that function in the upstream kernel.
>
> NeilBrown

Hi Neil,

blk_prologue_bio is our internal extension to gather some stats, sorry
not informed before.

It does below, code snip only:
+static void bio_endio_complete_diskstat(struct bio *clone)
+{
+       struct bio *bio = clone->bi_private;
+       struct block_device *bdev = bio->bi_bdev;
+       struct hd_struct *hdp = bdev->bd_part;
+       unsigned long start = clone->bi_start_time;
+       const int rw = !!bio_data_dir(bio);
+       int cpu, err = clone->bi_error;
+
+       bio_put(clone);
+       blk_end_io_acct()
+       bio->bi_error = err;
+       bio_endio(bio);
+}
+
+blk_qc_t blk_prologue_bio(struct request_queue *q, struct bio *bio)
+{
+       struct block_device *bdev = bio->bi_bdev;
+       struct hd_struct *hdp = bdev->bd_part;
+       struct bio *clone;
+
+       if (!hdp->additional_diskstat)
+               return q->custom_make_request_fn(q, bio);
+       clone = bio_clone_fast(bio, GFP_NOWAIT | __GFP_NOWARN, NULL);
+       if (unlikely(!clone))
+               return q->custom_make_request_fn(q, bio);
+       clone->bi_start_time = jiffies;
+       if (hdp->bio_mode_diskstat) {
+               int rw = !!bio_data_dir(bio);
+
+               generic_start_io_acct(rw, bio_sectors(clone), hdp);
+       }
+       clone->bi_end_io = bio_endio_complete_diskstat;
+       clone->bi_private = bio;
+       return q->custom_make_request_fn(q, clone);
+}

ＩＭＨＯ， it seems unrelated, but I will rerun my test without this change.

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply

* Re: [PATCH 1/1] IMSM: Add support for Non-Intel NVMe drives under VMD
From: Jes Sorensen @ 2016-12-12 19:23 UTC (permalink / raw)
  To: Pawel Baldysiak; +Cc: linux-raid
In-Reply-To: <20161212102844.26697-1-pawel.baldysiak@intel.com>

Pawel Baldysiak <pawel.baldysiak@intel.com> writes:
> This patch adds checking if platform (preOS) supports
> non-Intel NVMe drives under VMD domain,
> and - if so - allow creating IMSM Raid Volume
> with those drives.
>
> Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
> ---
>  platform-intel.h |  7 +++++++
>  super-intel.c    | 12 +++++++++---
>  2 files changed, 16 insertions(+), 3 deletions(-)

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH] imsm: set generation number when reading superblock
From: Jes Sorensen @ 2016-12-12 19:25 UTC (permalink / raw)
  To: Mariusz Dabrowski; +Cc: linux-raid
In-Reply-To: <1481195568-12973-1-git-send-email-mariusz.dabrowski@intel.com>

Mariusz Dabrowski <mariusz.dabrowski@intel.com> writes:
> IMSM doesn't set 'events' field with generation number, so sometimes mdadm
> tries to re-assembly container using metadata which isn't most recent (e. g.
> from spare disk).
>
> Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
> ---
>  super-intel.c | 1 +
>  1 file changed, 1 insertion(+)

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH] Use disk sector size value to set offset for reading GPT
From: Jes Sorensen @ 2016-12-12 19:26 UTC (permalink / raw)
  To: Mariusz Dabrowski; +Cc: linux-raid
In-Reply-To: <1481195595-13140-1-git-send-email-mariusz.dabrowski@intel.com>

Mariusz Dabrowski <mariusz.dabrowski@intel.com> writes:
> mdadm is using invalid byte-offset while reading GPT header to get
> partition info (size, first sector, last sector etc.). Now this offset
> is hardcoded to 512 bytes and it is not valid for disks with sector
> size different than 512 bytes because MBR and GPT headers are aligned
> to LBA, so valid offset for 4k drives is 4096 bytes.
>
> Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
> ---
>  super-gpt.c | 10 ++++++++++
>  util.c      |  7 ++++++-
>  2 files changed, 16 insertions(+), 1 deletion(-)

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH] Always return last partition end address in 512B blocks
From: Jes Sorensen @ 2016-12-12 19:30 UTC (permalink / raw)
  To: Mariusz Dabrowski; +Cc: linux-raid
In-Reply-To: <1481195606-13227-1-git-send-email-mariusz.dabrowski@intel.com>

Mariusz Dabrowski <mariusz.dabrowski@intel.com> writes:
> For 4K disks 'endofpart' is an index of the last 4K sector used by partition.
> mdadm is using number of 512-byte sectors, so value returned by
> get_last_partition_end must be multiplied by 8 for devices with 4K sectors.
> Also, unused 'ret' variable has been removed.
>
> Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
> ---
>  util.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)

I am a little wary of this since I don't have any 4k sector drivers to
test with, so I will come yelling at you if this breaks something :)

> diff --git a/util.c b/util.c
> index 883eaa4..46c1280 100644
> --- a/util.c
> +++ b/util.c
> @@ -1430,6 +1430,7 @@ static int get_last_partition_end(int fd, unsigned long long *endofpart)
>  	struct MBR boot_sect;
>  	unsigned long long curr_part_end;
>  	unsigned part_nr;
> +	unsigned int sector_size;
>  	int retval = 0;
>  
>  	*endofpart = 0;
> @@ -1469,6 +1470,9 @@ static int get_last_partition_end(int fd, unsigned long long *endofpart)
>  		/* Unknown partition table */
>  		retval = -1;
>  	}
> +	/* calculate number of 512-byte blocks */
> +	if (get_dev_sector_size(fd, NULL, &sector_size))
> +		*endofpart *= sector_size/512;

I really would prefer to see a parenthesis around this here and use of
proper spacing.

Thanks,
Jes

^ permalink raw reply

* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: NeilBrown @ 2016-12-12 21:53 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Nate Dailey
In-Reply-To: <CAMGffEkQ5puQEE0FVfLMV1of1OYOhLMhHdUTAy-YLYPkzBzjDg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 835 bytes --]

On Tue, Dec 13 2016, Jinpu Wang wrote:

> On Mon, Dec 12, 2016 at 1:59 AM, NeilBrown <neilb@suse.com> wrote:
>> On Sat, Nov 26 2016, Jinpu Wang wrote:
>>> [  810.270860]  [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0
>>
>> What is this?  I cannot find that function in the upstream kernel.
>>
>> NeilBrown
>
> Hi Neil,
>
> blk_prologue_bio is our internal extension to gather some stats, sorry
> not informed before.

Ahhh.

....
> +       return q->custom_make_request_fn(q, clone);

I haven't heard of custom_make_request_fn before either.

> +}
>
> ＩＭＨＯ， it seems unrelated, but I will rerun my test without this change.

Yes, please re-test with an unmodified upstream kernel (and always
report *exactly* what kernel you are running.  I cannot analyse code
that I cannot see).

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH] dm: Avoid sleeping while holding the dm_bufio lock
From: Doug Anderson @ 2016-12-13  0:08 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alasdair Kergon, Mike Snitzer, Shaohua Li, Dmitry Torokhov,
	linux-kernel@vger.kernel.org, linux-raid, dm-devel,
	David Rientjes, Sonny Rao, Guenter Roeck
In-Reply-To: <alpine.LRH.2.02.1612081758250.25149@file01.intranet.prod.int.rdu2.redhat.com>

Hi,

On Thu, Dec 8, 2016 at 3:20 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Wed, 7 Dec 2016, Doug Anderson wrote:
>
>> Hi,
>>
>> On Wed, Nov 23, 2016 at 12:57 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> > Hi
>> >
>> > The GFP_NOIO allocation frees clean cached pages. The GFP_NOWAIT
>> > allocation doesn't. Your patch would incorrectly reuse buffers in a
>> > situation when the memory is filled with clean cached pages.
>> >
>> > Here I'm proposing an alternate patch that first tries GFP_NOWAIT
>> > allocation, then drops the lock and tries GFP_NOIO allocation.
>> >
>> > Note that the root cause why you are seeing this stacktrace is, that your
>> > block device is congested - i.e. there are too many requests in the
>> > device's queue - and note that fixing this wait won't fix the root cause
>> > (congested device).
>> >
>> > The congestion limits are set in blk_queue_congestion_threshold to 7/8 to
>> > 13/16 size of the nr_requests value.
>> >
>> > If you don't want your device to report the congested status, you can
>> > increase /sys/block/<device>/queue/nr_requests - you should test if your
>> > chromebook is faster of slower with this setting increased. But note that
>> > this setting won't increase the IO-per-second of the device.
>>
>> Cool, thanks for the insight!
>>
>> Can you clarify which block device is relevant here?  Is this the DM
>> block device, the underlying block device, or the swap block device?
>> I'm not at all an expert on DM, but I think we have:
>>
>> 1. /dev/mmcblk0 - the underlying storage device.
>> 2. /dev/dm-0 - The verity device that's run atop /dev/mmcblk0p3
>> 3. /dev/zram0 - Our swap device
>
> The /dev/mmcblk0 device is congested. You can see the number of requests
> in /sys/block/mmcblk0/inflight

Hrm.  Doing some tests doesn't show that to be true.

I ran this command while doing my test (the "balloon" test described
in my patch, not the running the computer as a real user):

  while true;
    do grep -v '0 *0' /sys/block/*/inflight;
    sleep .1;
  done

The largest numbers I ever saw was:

  /sys/block/mmcblk0/inflight:       6        0

...and in one case when I saw a "long bufio" mesage I was seeing:

/sys/block/zram0/inflight:       1        0
/sys/block/zram0/inflight:       0        1
/sys/block/zram0/inflight:       0        1
/sys/block/zram0/inflight:       1        1
/sys/block/zram0/inflight:       0        1


>> As stated in the original email, I'm running on a downstream kernel
>> (kernel-4.4) with bunches of local patches, so it's plausible that
>> things have changed in the meantime, but:
>>
>> * At boot time the "nr_requests" for all block devices was 128
>> * I was unable to set the "nr_requests" for dm-0 and zram0 (it just
>> gives an error in sysfs).
>> * When I set "nr_requests" to 4096 for /dev/mmcblk0 it didn't seem to
>> affect the problem.
>
> The eMMC has some IOPS and the IOPS can't be improved. Use faster block
> device - but it will cost more money.

It's so strange since when I see this failure I'm not even doing lots
of eMMC traffic.  As per above my swap goes to zram and there's no
disk backing, so there should be very minimal disk usage.

If I run iostat now after having been running my test for a while (and
having seen a bunch of "long bufio" messages, you can see that there
hasn't exactly been lots of activity.

localhost debug # iostat
Linux 4.4.21 (localhost)        12/09/16        _aarch64_       (6 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.11   33.87   36.63    1.78    0.00   21.61

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
loop0             2.74        39.01        68.95      33529      59264
loop1             0.07         1.23         0.00       1055          0
loop2             0.04         0.13         0.00        112          0
loop3             0.00         0.00         0.00          4          0
loop4             0.00         0.01         0.00          8          0
loop5             0.00         0.01         0.00          8          0
mmcblk1           0.56        44.45         0.00      38202          0
mmcblk0          14.37       636.29        38.74     546888      33300
mmcblk0rpmb       0.01         0.03         0.00         28          0
mmcblk0boot1      0.03         0.15         0.00        128          0
mmcblk0boot0      0.03         0.15         0.00        128          0
dm-0             11.50       419.25         0.00     360344          0
dm-1              2.15        36.53        68.95      31401      59264
zram0         42103.26     84111.64     84301.42   72293956   72457068

---

OK, so I just put a printk in wait_iff_congested() and it didn't show
me waiting for the timeout (!).  I know that I saw
wait_iff_congested() in the originally reproduction of this problem,
but it appears that in my little "balloon" reproduction it's not
actually involved...


...I dug further and it appears that __alloc_pages_direct_reclaim() is
actually what's slow.  Specifically it looks as if shrink_zone() can
actually take quite a while.  As I've said, I'm not an expert on the
memory manager but I'm not convinced that it's wrong for the direct
reclaim path to be pretty slow at times, especially when I'm putting
an abnormally high amount of stress on it.

I'm going to take this as further evidence that the patch being
discussed in this thread is a good one (AKA don't hold the dm bufio
lock while allocating memory).  :)  If it's unexpected that
shrink_zone() might take several seconds when under extreme memory
pressure then I can do some additional digging.  Do note that I am
running with "zram" and remember that I'm on an ancient 4.4-based
kernel, so perhaps one of those two factors causes problems.


> If you want to handle such a situation where you run 4 tasks each eating
> 900MB, just use more memory, don't expect that this will work smoothly on
> 4GB machine.
>
> If you want to protect the chromebook from runaway memory allocations, you
> can detect this situation in some watchdog process and either kill the
> process that consumes most memory with the kill syscall or trigger the
> kernel OOM killer by writing 'f' to /proc/sysrq-trigger.
>
> The question is what you really want - handle this situation smoothly
> (then, you must add more memory) or protect chromeOS from applications
> allocating too much memory?

These are definitely important things to think about.  In the end user
bug report, though, we were seeing this super slow / janky behavior
_without_ a runaway process--they were just using memory (and swap)
normally.  I've simulated it with the "balloon" process, but users
were seeing it without anything so crazy.


-Doug

^ permalink raw reply

* DO YOU NEED A LOAN??
From: bancoleite @ 2016-12-13  5:29 UTC (permalink / raw)
  To: Recipients

Are you in need of a loan? Apply for more details.

^ permalink raw reply

* [PATCH] md/raid5-cache: removes unnecessary write-through mode judgments
From: JackieLiu @ 2016-12-13  5:55 UTC (permalink / raw)
  To: songliubraving; +Cc: shli, linux-raid, JackieLiu

The write-through mode has been returned in front of the function,
do not need to do it again.

Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
---
 drivers/md/raid5-cache.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 6d1a150..4dd8e4e 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2418,9 +2418,6 @@ void r5c_finish_stripe_write_out(struct r5conf *conf,
 	if (do_wakeup)
 		wake_up(&conf->wait_for_overlap);
 
-	if (conf->log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
-		return;
-
 	spin_lock_irq(&conf->log->stripe_in_journal_lock);
 	list_del_init(&sh->r5c);
 	spin_unlock_irq(&conf->log->stripe_in_journal_lock);
-- 
2.10.2




^ permalink raw reply related

* (user) Help needed: mdadm seems to constantly touch my disks
From: Jure Erznožnik @ 2016-12-13  8:13 UTC (permalink / raw)
  To: linux-raid

First of all, I apologise if this mail list is not intended for layman
help, but this is what I am and I couldn't get an explanation
elsewhere.

My problem is that (as it seems) mdadm is touching HDD superblocks
once per second, once at address 8 (sectors), next at address 16.
Total traffic is kilobytes per second, writes only, no other
detectable traffic.

I have detailed the problem here:
http://unix.stackexchange.com/questions/329477/

Shortened:
kubuntu 16.10 4.8.0-30-generic #32, mdadm v3.4 2016-01-28
My configuration: 4 spinning platters (/dev/sd[cdef]) assembled into a
raid5 array, then bcache set to cache (hopefully) everything
(cache_mode = writeback, sequential_cutoff = 0). On top of bcache
volume I have set up lvm.

* iostat shows traffic on sd[cdef] and md0
* iotop shows no traffic
* iosnoop shows COMM=[idle, md0_raid5, kworker] as processes working
on the disk. Blocks reported are 8, 16 (data size a few KB) and
18446744073709500000 (data size 0). That last one must be some virtual
thingie as the disks are nowhere near that large.
* enabling block_dump shows md0_raid5 process writing to block 8 (1
sectors) and 16 (8 sectors)

This touching is caused by any write into the array and goes on for
quite a while after the write has been done (a couple of hours for
60GB of writes). When services actually work with the array, this
becomes pretty much constant.

What am I observing and is there any way of stopping it?

Thanks,
Jure

^ permalink raw reply

* [RFC PATCH v2] IV Generation algorithms for dm-crypt
From: Binoy Jayan @ 2016-12-13  8:49 UTC (permalink / raw)
  To: Oded, Ofir
  Cc: Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, linux-kernel, Alasdair Kergon, Mike Snitzer,
	dm-devel, Shaohua Li, linux-raid, Rajendra, Binoy Jayan

===============================================================================
GENIV Template cipher
===============================================================================

Currently, the iv generation algorithms are implemented in dm-crypt.c. The goal
is to move these algorithms from the dm layer to the kernel crypto layer by
implementing them as template ciphers so they can be used in relation with
algorithms like aes, and with multiple modes like cbc, ecb etc. As part of this
patchset, the iv-generation code is moved from the dm layer to the crypto layer
and adapt the dm-layer to send a whole 'bio' (as defined in the block layer)
at a time. Each bio contains the in memory representation of physically
contiguous disk blocks. Since the bio itself may not be contiguous in main
memory, the dm layer sets up a chained scatterlist of these blocks split into
physically contiguous segments in memory so that DMA can be performed.

v1 --> v2
----------

1. dm-crypt changes to process larger block sizes (one segment in a bio)
2. Incorporated changes w.r.t. comments from Herbert.

v1: https://patchwork.kernel.org/patch/9439175

One challenge in doing so is that the IVs are generated based on a 512-byte
sector number. This infact limits the block sizes to 512 bytes. But this should
not be a problem if a hardware with iv generation support is used. The geniv
itself splits the segments into sectors so it could choose the IV based on
sector number. But it could be modelled in hardware effectively by not
splitting up the segments in the bio.

Another challenge faced is that dm-crypt has an option to use multiple keys.
The key selection is done based on the sector number. If the whole bio is
encrypted / decrypted with the same key, the encrypted volumes will not be
compatible with the original dm-crypt [without the changes].

The following ASCII art decomposes the kernel crypto API layers when using the
skcipher with the automated IV generation. The shown example is used by the
DM layer. For other use cases of cbc(aes), the ASCII art applies as well, but
the caller may not use the same with a separate IV generator. In this case, the
caller must generate the IV. The depicted example decomposes <ivgen>(cbc(aes))
based on the generic C implementations (geniv.c, cbc.c and aes-generic.c).
The generic implementation depicts the dependency between the templates ciphers
used in implementing geniv using the kernel crypto API.
Here, <geniv> indicates one of the following algorithms:

1. plain
2. plain64
3. essiv
4. benbi
5. null
6. lmk
7. tcw

It is possible that some streamlined cipher implementations (like AES-NI)
provide implementations merging aspects which in the view of the kernel crypto
API cannot be decomposed into layers any more. Each block in the following
ASCII art is an independent cipher instance obtained from the kernel crypto
API. Each block is accessed by the caller or by other blocks using the API
functions defined by the kernel crypto API for the cipher implementation type.
The blocks below indicate the cipher type as well as the specific logic
implemented in the cipher.

The ASCII art picture also indicates the call structure, i.e. who calls which
component. The arrows point to the invoked block where the caller uses the API
applicable to the cipher type specified for the block. For the purpose of
illustration, here we take the example of the aes mode 'cbc'. However, the IV
generation algorithm could be used with other aes modes like ecb as well.

-------------------------------------------------------------------------------
Geniv implementation
-------------------------------------------------------------------------------

NB: The ASCII art below is best viewed in a fixed-width font.

                             crypt_convert_block()              (DM Layer)
                                      |
                                      | (1)
                                      |
                                      v
+------------+   +-----------+   +-----------+       +-----------+
|            |   |           |   |           |  (2)  |           |
|  skcipher  |   | skcipher  |   | skcipher  |----+  | skcipher  |   Blocks for
| (plain/64) |   | (benbi)   |   | (essiv)   |    |  |  (null)   |   lmk, tcw
+------------+   +-----------+   +-----------+    |  +-----------+ 
     |                |               |           v         |
     | (3)            | (3)      (3)  |    +-----------+    |
     |                |               |    |           |    |
     |                |               |    |   ahash   |    | (3)
     |                |               |    |           |    |
     |                |               |    +-----------+    |
     |                |               v                     |     (Crypto API
     |                |         +-----------+               |        Layer)
     |                v         |           |               |
     +------------------------> |  skcipher | <-------------+
                                |   (cbc)   |
                                +-----------+  (AES Mode Template cipher)
                                      | (4)
                                      v
                               +-----------+
                               |           |
                               |   cipher  |   (Base generic-AES cipher)
                               |   (aes)   |
                               +-----------+

The following call sequence is applicable when the DM layer triggers an
encryption operation with the crypt_convert_block() function. During
configuration, the administrator sets up the use of <geniv>(cbc(aes)) as the
template cipher. 'geniv' can be one among plain, plain64, essiv, benbi, null,
lmk, or tcw which are all implemented as seperate templates. 
The following are the template ciphers implemented as part of 'geniv.c'

1. plain(cbc(aes))
2. plain64(cbc(aes))
3. essiv(cbc(aes))
4. benbi(cbc(aes))
5. null(cbc(aes))
6. lmk(cbc(aes))
7. tcw(cbc(aes))

The following call sequence is now depicted in the ASCII art above:

1. crypt_convert_block invokes crypto_skcipher_encrypt() to trigger encryption
   operation of a single block (i.e. sector) with the IV same as the sector no.
   For example, with essiv, the IV generation implementation is registered with
   a call to 'crypto_register_template(&crypto_essiv_tmpl)'

2. During instantiation of the 'geniv' handle, the IV generation algorithm is
   instantiated. For the purpose of illustration, we take the example of essiv.
   In this case, the ahash cipher is instantiated to calculate the hash of the
   sector to generate the IV.

3. Now, geniv uses the skcipher api calls to invoke the associated cipher. In
   our case, during the instantiation of geniv, the cipher handle for cbc is
   provided to geniv. The geniv skcipher type implementation now invokes the
   skcipher api with the instantiated cbc(aes) cipher handle. During the
   instantiation of the cbc(aes) cipher, the cipher type generic-aes is also
   instantiated. That means that the SKCIPHER implementation of cbc(aes) only
   implements the Cipher-block chaining mode. After performing block chaining
   operation, the cipher implementation of aes is invoked. The skcipher of
   cbc(aes) now invokes the cipher api with the aes cipher handle to encrypt
   one block.

-------------------------------------------------------------------------------
To be discussed
-------------------------------------------------------------------------------

1. Changes to testmgr.c
2. How to use multiple keys.
   When using multiple keys with the original dm-crypt, the key selection is
   made based on the sector number as:

   key_index = sector & (key_count - 1)

   This restricts the usage of the same key for encrypting/decrypting a single
   bio. One way to solve this is to move the key management code from dm-crypt
   to cryto layer. But this seems tricky when using template ciphers because,
   when multiple ciphers are instantiated from dm layer, each cipher instance
   set with a unique subkey (part of the bigger master key) and these instances
   themselves do not have access to each other's instances or contexts. This
   way, a single instance cannot encryt/decrypt a whole bio. I have not been
   able to get around this problem.

-------------------------------------------------------------------------------
Test procedure
-------------------------------------------------------------------------------

The algorithms are tested using 'cryptsetup' utility to create LUKS
compatible volumes on Qemu.

NB: '/dev/sdb' is a second disk volume (configured in qemu)

# One time setup - Format the device compatible with LUKS.
# Choose one of the following IV generation alorithms at a time
cryptsetup -y -c aes-cbc-plain -s 256 --hash sha256 luksFormat /dev/sdb
cryptsetup -y -c aes-cbc-plain64 -s 256 --hash sha256 luksFormat /dev/sdb
cryptsetup -y -c aes-cbc-essiv:sha256 -s 256 --hash sha256 luksFormat /dev/sdb
cryptsetup -y -c aes-cbc-benbi -s 256 --hash sha256 luksFormat /dev/sdb
cryptsetup -y -c aes-cbc-null -s 256 --hash sha256 luksFormat /dev/sdb
cryptsetup -y -c aes-cbc-lmk -s 256 --hash sha256 luksFormat /dev/sdb
cryptsetup -y -c aes-cbc-tcw -s 256 --hash sha256 luksFormat /dev/sdb

# With a keycount >= 2
# These were tested ok but not compatible with the older dm-crypt
# because the keys were selected based on the IV number when multiple
# keys are involved. This needs to be fixed.

cryptsetup -y -c aes:2-cbc-plain -s 256 --hash sha256 luksFormat /dev/sdb
cryptsetup -y -c aes:2-cbc-plain64 -s 256 --hash sha256 luksFormat /dev/sdb
cryptsetup -y -c aes:2-cbc-essiv:sha256 -s 256 --hash sha256 luksFormat /dev/sdb
cryptsetup -y -c aes:2-cbc-null -s 256 --hash sha256 luksFormat /dev/sdb
cryptsetup -y -c aes:2-cbc-lmk -s 256 --hash sha256 luksFormat /dev/sdb

# Add additional key - optional
cryptsetup luksAddKey /dev/sdb

# The above lists only a limited number of tests with the aes cipher.
# The IV generation algorithms may also be tested with other ciphers as well.

cryptsetup luksDump --dump-master-key /dev/sdb

# create a luks volume and open the device
cryptsetup luksOpen /dev/sdb crypt_fun
dmsetup table --showkeys

# Write some data to the device
cat data.txt > /dev/mapper/crypt_fun

# Read 100 bytes back
dd if=/dev/mapper/crypt_fun of=out.txt bs=100 count=1
cat out.txt

mkfs.ext4 -j /dev/mapper/crypt_fun

# Mount if fs creation succeeds
mount -t ext4 /dev/mapper/crypt_fun /mnt

<-- Use the encrypted file system -->

umount /mnt
cryptsetup luksClose crypt_fun
cryptsetup luksRemoveKey /dev/sdb

This seems to work well. The file system mounts successfully and the files
written to in the file system remain persistent across reboots.

Binoy Jayan (1):
  crypto: Add IV generation algorithms

 crypto/Kconfig         |   10 +
 crypto/Makefile        |    1 +
 crypto/geniv.c         | 1294 ++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/dm-crypt.c  |  894 +++++----------------------------
 include/crypto/geniv.h |   60 +++
 5 files changed, 1499 insertions(+), 760 deletions(-)
 create mode 100644 crypto/geniv.c
 create mode 100644 include/crypto/geniv.h

-- 
Binoy Jayan

^ permalink raw reply

* [RFC PATCH v2] crypto: Add IV generation algorithms
From: Binoy Jayan @ 2016-12-13  8:49 UTC (permalink / raw)
  To: Oded, Ofir
  Cc: Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, linux-kernel, Alasdair Kergon, Mike Snitzer,
	dm-devel, Shaohua Li, linux-raid, Rajendra, Binoy Jayan
In-Reply-To: <1481618949-20086-1-git-send-email-binoy.jayan@linaro.org>

Currently, the iv generation algorithms are implemented in dm-crypt.c.
The goal is to move these algorithms from the dm layer to the kernel
crypto layer by implementing them as template ciphers so they can be
implemented in hardware for performance. As part of this patchset, the
iv-generation code is moved from the dm layer to the crypto layer and
adapt the dm-layer to send a whole 'bio' (as defined in the block layer)
at a time. Each bio contains the in memory representation of physically
contiguous disk blocks. The dm layer sets up a chained scatterlist of
these blocks split into physically contiguous segments in memory so that
DMA can be performed. The iv generation algorithms implemented in geniv.c
include plain, plain64, essiv, benbi, null, lmk and tcw.

When using multiple keys with the original dm-crypt, the key selection is
made based on the sector number as:

key_index = sector & (key_count - 1)

This restricts the usage of the same key for encrypting/decrypting a
single bio. One way to solve this is to move the key management code from
dm-crypt to cryto layer. But this seems tricky when using template ciphers
because, when multiple ciphers are instantiated from dm layer, each cipher
instance set with a unique subkey (part of the bigger master key) and
these instances themselves do not have access to each other's instances
or contexts. This way, a single instance cannot encryt/decrypt a whole bio.
This has to be fixed.

Not-signed-off-by: Binoy Jayan <binoy.jayan@linaro.org>
---
 crypto/Kconfig         |   10 +
 crypto/Makefile        |    1 +
 crypto/geniv.c         | 1294 ++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/dm-crypt.c  |  894 +++++----------------------------
 include/crypto/geniv.h |   60 +++
 5 files changed, 1499 insertions(+), 760 deletions(-)
 create mode 100644 crypto/geniv.c
 create mode 100644 include/crypto/geniv.h

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 84d7148..dc33a33 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -326,6 +326,16 @@ config CRYPTO_CTS
 	  This mode is required for Kerberos gss mechanism support
 	  for AES encryption.
 
+config CRYPTO_GENIV
+	tristate "IV Generation for dm-crypt"
+	select CRYPTO_BLKCIPHER
+	help
+	  GENIV: IV Generation for dm-crypt
+	  Algorithms to generate Initialization Vector for ciphers
+	  used by dm-crypt.  The iv generation algorithms implemented
+	  as part of geniv include plain, plain64, essiv, benbi, null,
+	  lmk and tcw.
+
 config CRYPTO_ECB
 	tristate "ECB support"
 	select CRYPTO_BLKCIPHER
diff --git a/crypto/Makefile b/crypto/Makefile
index bd6a029..627ec76 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -75,6 +75,7 @@ obj-$(CONFIG_CRYPTO_TGR192) += tgr192.o
 obj-$(CONFIG_CRYPTO_GF128MUL) += gf128mul.o
 obj-$(CONFIG_CRYPTO_ECB) += ecb.o
 obj-$(CONFIG_CRYPTO_CBC) += cbc.o
+obj-$(CONFIG_CRYPTO_GENIV) += geniv.o
 obj-$(CONFIG_CRYPTO_PCBC) += pcbc.o
 obj-$(CONFIG_CRYPTO_CTS) += cts.o
 obj-$(CONFIG_CRYPTO_LRW) += lrw.o
diff --git a/crypto/geniv.c b/crypto/geniv.c
new file mode 100644
index 0000000..ac81a49
--- /dev/null
+++ b/crypto/geniv.c
@@ -0,0 +1,1294 @@
+/*
+ * geniv: IV generation algorithms
+ *
+ * Copyright (c) 2016, Linaro Ltd.
+ * Copyright (C) 2006-2015 Red Hat, Inc. All rights reserved.
+ * Copyright (C) 2013 Milan Broz <gmazyland@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+
+#include <crypto/algapi.h>
+#include <crypto/internal/skcipher.h>
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/log2.h>
+#include <linux/module.h>
+#include <linux/scatterlist.h>
+#include <linux/slab.h>
+#include <linux/completion.h>
+#include <linux/crypto.h>
+#include <linux/workqueue.h>
+#include <linux/backing-dev.h>
+#include <linux/atomic.h>
+#include <linux/rbtree.h>
+#include <crypto/hash.h>
+#include <crypto/md5.h>
+#include <crypto/algapi.h>
+#include <crypto/skcipher.h>
+#include <asm/unaligned.h>
+#include <crypto/geniv.h>
+
+struct geniv_ctx;
+struct crypto_geniv_req_ctx;
+
+/* Sub request for each of the skcipher_request's for a segment */
+struct crypto_geniv_subreq {
+	struct skcipher_request req CRYPTO_MINALIGN_ATTR;
+	struct crypto_geniv_req_ctx *rctx;
+};
+
+struct crypto_geniv_req_ctx {
+	struct crypto_geniv_subreq *subreqs;
+	struct scatterlist *src;
+	struct scatterlist *dst;
+	bool is_write;
+	sector_t iv_sector;
+	unsigned int nents;
+	u8 *iv;
+	struct completion restart;
+	atomic_t req_pending;
+	struct skcipher_request *req;
+};
+
+struct geniv_operations {
+	int (*ctr)(struct geniv_ctx *ctx);
+	void (*dtr)(struct geniv_ctx *ctx);
+	int (*init)(struct geniv_ctx *ctx);
+	int (*wipe)(struct geniv_ctx *ctx);
+	int (*generator)(struct geniv_ctx *ctx,
+			 struct crypto_geniv_req_ctx *rctx, int n);
+	int (*post)(struct geniv_ctx *ctx,
+		    struct crypto_geniv_req_ctx *rctx, int n);
+};
+
+struct geniv_essiv_private {
+	struct crypto_ahash *hash_tfm;
+	u8 *salt;
+};
+
+struct geniv_benbi_private {
+	int shift;
+};
+
+struct geniv_lmk_private {
+	struct crypto_shash *hash_tfm;
+	u8 *seed;
+};
+
+struct geniv_tcw_private {
+	struct crypto_shash *crc32_tfm;
+	u8 *iv_seed;
+	u8 *whitening;
+};
+
+struct geniv_ctx {
+	struct crypto_skcipher *child;
+	unsigned int tfms_count;
+	char *ivmode;
+	unsigned int iv_size;
+	char *ivopts;
+	char *cipher;
+	struct geniv_operations *iv_gen_ops;
+	union {
+		struct geniv_essiv_private essiv;
+		struct geniv_benbi_private benbi;
+		struct geniv_lmk_private lmk;
+		struct geniv_tcw_private tcw;
+	} iv_gen_private;
+	void *iv_private;
+	struct crypto_skcipher *tfm;
+	unsigned int key_size;
+	unsigned int key_extra_size;
+	unsigned int key_parts;      /* independent parts in key buffer */
+	enum setkey_op keyop;
+	char *msg;
+	u8 *key;
+};
+
+static struct crypto_skcipher *any_tfm(struct geniv_ctx *ctx)
+{
+	return ctx->tfm;
+}
+
+static inline
+struct crypto_geniv_req_ctx *geniv_req_ctx(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	unsigned long align = crypto_skcipher_alignmask(tfm);
+
+	return (void *) PTR_ALIGN((u8 *)skcipher_request_ctx(req), align + 1);
+}
+
+static int crypt_iv_plain_gen(struct geniv_ctx *ctx,
+			      struct crypto_geniv_req_ctx *rctx, int n)
+{
+	u8 *iv = rctx->iv;
+
+	memset(iv, 0, ctx->iv_size);
+	*(__le32 *)iv = cpu_to_le32(rctx->iv_sector & 0xffffffff);
+
+	return 0;
+}
+
+static int crypt_iv_plain64_gen(struct geniv_ctx *ctx,
+				struct crypto_geniv_req_ctx *rctx, int n)
+{
+	u8 *iv = rctx->iv;
+
+	memset(iv, 0, ctx->iv_size);
+	*(__le64 *)iv = cpu_to_le64(rctx->iv_sector);
+
+	return 0;
+}
+
+/* Initialise ESSIV - compute salt but no local memory allocations */
+static int crypt_iv_essiv_init(struct geniv_ctx *ctx)
+{
+	struct geniv_essiv_private *essiv = &ctx->iv_gen_private.essiv;
+	struct scatterlist sg;
+	struct crypto_cipher *essiv_tfm;
+	int err;
+	AHASH_REQUEST_ON_STACK(req, essiv->hash_tfm);
+
+	sg_init_one(&sg, ctx->key, ctx->key_size);
+	ahash_request_set_tfm(req, essiv->hash_tfm);
+	ahash_request_set_callback(req, CRYPTO_TFM_REQ_MAY_SLEEP, NULL, NULL);
+	ahash_request_set_crypt(req, &sg, essiv->salt, ctx->key_size);
+
+	err = crypto_ahash_digest(req);
+	ahash_request_zero(req);
+	if (err)
+		return err;
+
+	essiv_tfm = ctx->iv_private;
+
+	err = crypto_cipher_setkey(essiv_tfm, essiv->salt,
+			    crypto_ahash_digestsize(essiv->hash_tfm));
+	if (err)
+		return err;
+
+	return 0;
+}
+
+/* Wipe salt and reset key derived from volume key */
+static int crypt_iv_essiv_wipe(struct geniv_ctx *ctx)
+{
+	struct geniv_essiv_private *essiv = &ctx->iv_gen_private.essiv;
+	unsigned int salt_size = crypto_ahash_digestsize(essiv->hash_tfm);
+	struct crypto_cipher *essiv_tfm;
+	int r, err = 0;
+
+	memset(essiv->salt, 0, salt_size);
+
+	essiv_tfm = ctx->iv_private;
+	r = crypto_cipher_setkey(essiv_tfm, essiv->salt, salt_size);
+	if (r)
+		err = r;
+
+	return err;
+}
+
+/* Set up per cpu cipher state */
+static struct crypto_cipher *setup_essiv_cpu(struct geniv_ctx *ctx,
+					     u8 *salt, unsigned int saltsize)
+{
+	struct crypto_cipher *essiv_tfm;
+	int err;
+
+	/* Setup the essiv_tfm with the given salt */
+	essiv_tfm = crypto_alloc_cipher(ctx->cipher, 0, CRYPTO_ALG_ASYNC);
+
+	if (IS_ERR(essiv_tfm)) {
+		pr_err("Error allocating crypto tfm for ESSIV\n");
+		return essiv_tfm;
+	}
+
+	if (crypto_cipher_blocksize(essiv_tfm) !=
+	    crypto_skcipher_ivsize(any_tfm(ctx))) {
+		pr_err("Block size of ESSIV cipher does not match IV size of block cipher\n");
+		crypto_free_cipher(essiv_tfm);
+		return ERR_PTR(-EINVAL);
+	}
+
+	err = crypto_cipher_setkey(essiv_tfm, salt, saltsize);
+	if (err) {
+		pr_err("Failed to set key for ESSIV cipher\n");
+		crypto_free_cipher(essiv_tfm);
+		return ERR_PTR(err);
+	}
+	return essiv_tfm;
+}
+
+static void crypt_iv_essiv_dtr(struct geniv_ctx *ctx)
+{
+	struct crypto_cipher *essiv_tfm;
+	struct geniv_essiv_private *essiv = &ctx->iv_gen_private.essiv;
+
+	crypto_free_ahash(essiv->hash_tfm);
+	essiv->hash_tfm = NULL;
+
+	kzfree(essiv->salt);
+	essiv->salt = NULL;
+
+	essiv_tfm = ctx->iv_private;
+
+	if (essiv_tfm)
+		crypto_free_cipher(essiv_tfm);
+
+	ctx->iv_private = NULL;
+}
+
+static int crypt_iv_essiv_ctr(struct geniv_ctx *ctx)
+{
+	struct crypto_cipher *essiv_tfm = NULL;
+	struct crypto_ahash *hash_tfm = NULL;
+	u8 *salt = NULL;
+	int err;
+
+	if (!ctx->ivopts) {
+		pr_err("Digest algorithm missing for ESSIV mode\n");
+		return -EINVAL;
+	}
+
+	/* Allocate hash algorithm */
+	hash_tfm = crypto_alloc_ahash(ctx->ivopts, 0, CRYPTO_ALG_ASYNC);
+	if (IS_ERR(hash_tfm)) {
+		err = PTR_ERR(hash_tfm);
+		pr_err("Error initializing ESSIV hash. err=%d\n", err);
+		goto bad;
+	}
+
+	salt = kzalloc(crypto_ahash_digestsize(hash_tfm), GFP_KERNEL);
+	if (!salt) {
+		err = -ENOMEM;
+		goto bad;
+	}
+
+	ctx->iv_gen_private.essiv.salt = salt;
+	ctx->iv_gen_private.essiv.hash_tfm = hash_tfm;
+
+	essiv_tfm = setup_essiv_cpu(ctx, salt,
+				crypto_ahash_digestsize(hash_tfm));
+	if (IS_ERR(essiv_tfm)) {
+		crypt_iv_essiv_dtr(ctx);
+		return PTR_ERR(essiv_tfm);
+	}
+	ctx->iv_private = essiv_tfm;
+
+	return 0;
+
+bad:
+	if (hash_tfm && !IS_ERR(hash_tfm))
+		crypto_free_ahash(hash_tfm);
+	kfree(salt);
+	return err;
+}
+
+static int crypt_iv_essiv_gen(struct geniv_ctx *ctx,
+			      struct crypto_geniv_req_ctx *rctx, int n)
+{
+	u8 *iv = rctx->iv;
+	struct crypto_cipher *essiv_tfm = ctx->iv_private;
+
+	memset(iv, 0, ctx->iv_size);
+	*(__le64 *)iv = cpu_to_le64(rctx->iv_sector);
+	crypto_cipher_encrypt_one(essiv_tfm, iv, iv);
+
+	return 0;
+}
+
+static int crypt_iv_benbi_ctr(struct geniv_ctx *ctx)
+{
+	unsigned int bs = crypto_skcipher_blocksize(any_tfm(ctx));
+	int log = ilog2(bs);
+
+	/* we need to calculate how far we must shift the sector count
+	 * to get the cipher block count, we use this shift in _gen
+	 */
+
+	if (1 << log != bs) {
+		pr_err("cypher blocksize is not a power of 2\n");
+		return -EINVAL;
+	}
+
+	if (log > 9) {
+		pr_err("cypher blocksize is > 512\n");
+		return -EINVAL;
+	}
+
+	ctx->iv_gen_private.benbi.shift = 9 - log;
+
+	return 0;
+}
+
+static int crypt_iv_benbi_gen(struct geniv_ctx *ctx,
+			      struct crypto_geniv_req_ctx *rctx, int n)
+{
+	u8 *iv = rctx->iv;
+	__be64 val;
+
+	memset(iv, 0, ctx->iv_size - sizeof(u64)); /* rest is cleared below */
+
+	val = cpu_to_be64(((u64) rctx->iv_sector <<
+			  ctx->iv_gen_private.benbi.shift) + 1);
+	put_unaligned(val, (__be64 *)(iv + ctx->iv_size - sizeof(u64)));
+
+	return 0;
+}
+
+static int crypt_iv_null_gen(struct geniv_ctx *ctx,
+			     struct crypto_geniv_req_ctx *rctx, int n)
+{
+	u8 *iv = rctx->iv;
+
+	memset(iv, 0, ctx->iv_size);
+	return 0;
+}
+
+static void crypt_iv_lmk_dtr(struct geniv_ctx *ctx)
+{
+	struct geniv_lmk_private *lmk = &ctx->iv_gen_private.lmk;
+
+	if (lmk->hash_tfm && !IS_ERR(lmk->hash_tfm))
+		crypto_free_shash(lmk->hash_tfm);
+	lmk->hash_tfm = NULL;
+
+	kzfree(lmk->seed);
+	lmk->seed = NULL;
+}
+
+static int crypt_iv_lmk_ctr(struct geniv_ctx *ctx)
+{
+	struct geniv_lmk_private *lmk = &ctx->iv_gen_private.lmk;
+
+	lmk->hash_tfm = crypto_alloc_shash("md5", 0, 0);
+	if (IS_ERR(lmk->hash_tfm)) {
+		pr_err("Error initializing LMK hash; err=%ld\n",
+				PTR_ERR(lmk->hash_tfm));
+		return PTR_ERR(lmk->hash_tfm);
+	}
+
+	/* No seed in LMK version 2 */
+	if (ctx->key_parts == ctx->tfms_count) {
+		lmk->seed = NULL;
+		return 0;
+	}
+
+	lmk->seed = kzalloc(LMK_SEED_SIZE, GFP_KERNEL);
+	if (!lmk->seed) {
+		crypt_iv_lmk_dtr(ctx);
+		pr_err("Error kmallocing seed storage in LMK\n");
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int crypt_iv_lmk_init(struct geniv_ctx *ctx)
+{
+	struct geniv_lmk_private *lmk = &ctx->iv_gen_private.lmk;
+	int subkey_size = ctx->key_size / ctx->key_parts;
+
+	/* LMK seed is on the position of LMK_KEYS + 1 key */
+	if (lmk->seed)
+		memcpy(lmk->seed, ctx->key + (ctx->tfms_count * subkey_size),
+		       crypto_shash_digestsize(lmk->hash_tfm));
+
+	return 0;
+}
+
+static int crypt_iv_lmk_wipe(struct geniv_ctx *ctx)
+{
+	struct geniv_lmk_private *lmk = &ctx->iv_gen_private.lmk;
+
+	if (lmk->seed)
+		memset(lmk->seed, 0, LMK_SEED_SIZE);
+
+	return 0;
+}
+
+static int crypt_iv_lmk_one(struct geniv_ctx *ctx, u8 *iv,
+			    struct crypto_geniv_req_ctx *rctx, u8 *data)
+{
+	struct geniv_lmk_private *lmk = &ctx->iv_gen_private.lmk;
+	struct md5_state md5state;
+	__le32 buf[4];
+	int i, r;
+	SHASH_DESC_ON_STACK(desc, lmk->hash_tfm);
+
+	desc->tfm = lmk->hash_tfm;
+	desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP;
+
+	r = crypto_shash_init(desc);
+	if (r)
+		return r;
+
+	if (lmk->seed) {
+		r = crypto_shash_update(desc, lmk->seed, LMK_SEED_SIZE);
+		if (r)
+			return r;
+	}
+
+	/* Sector is always 512B, block size 16, add data of blocks 1-31 */
+	r = crypto_shash_update(desc, data + 16, 16 * 31);
+	if (r)
+		return r;
+
+	/* Sector is cropped to 56 bits here */
+	buf[0] = cpu_to_le32(rctx->iv_sector & 0xFFFFFFFF);
+	buf[1] = cpu_to_le32((((u64)rctx->iv_sector >> 32) & 0x00FFFFFF)
+			     | 0x80000000);
+	buf[2] = cpu_to_le32(4024);
+	buf[3] = 0;
+	r = crypto_shash_update(desc, (u8 *)buf, sizeof(buf));
+	if (r)
+		return r;
+
+	/* No MD5 padding here */
+	r = crypto_shash_export(desc, &md5state);
+	if (r)
+		return r;
+
+	for (i = 0; i < MD5_HASH_WORDS; i++)
+		__cpu_to_le32s(&md5state.hash[i]);
+	memcpy(iv, &md5state.hash, ctx->iv_size);
+
+	return 0;
+}
+
+static int crypt_iv_lmk_gen(struct geniv_ctx *ctx,
+			    struct crypto_geniv_req_ctx *rctx, int n)
+{
+	u8 *src;
+	u8 *iv = rctx->iv;
+	int r = 0;
+
+	if (rctx->is_write) {
+		src = kmap_atomic(sg_page(&rctx->src[n]));
+		r = crypt_iv_lmk_one(ctx, iv, rctx, src + rctx->src[n].offset);
+		kunmap_atomic(src);
+	} else
+		memset(iv, 0, ctx->iv_size);
+
+	return r;
+}
+
+static int crypt_iv_lmk_post(struct geniv_ctx *ctx,
+			     struct crypto_geniv_req_ctx *rctx, int n)
+{
+	u8 *dst;
+	u8 *iv = rctx->iv;
+	int r;
+
+	if (rctx->is_write)
+		return 0;
+
+	dst = kmap_atomic(sg_page(&rctx->dst[n]));
+	r = crypt_iv_lmk_one(ctx, iv, rctx, dst + rctx->dst[n].offset);
+
+	/* Tweak the first block of plaintext sector */
+	if (!r)
+		crypto_xor(dst + rctx->dst[n].offset, iv, ctx->iv_size);
+
+	kunmap_atomic(dst);
+	return r;
+}
+
+static void crypt_iv_tcw_dtr(struct geniv_ctx *ctx)
+{
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
+
+	kzfree(tcw->iv_seed);
+	tcw->iv_seed = NULL;
+	kzfree(tcw->whitening);
+	tcw->whitening = NULL;
+
+	if (tcw->crc32_tfm && !IS_ERR(tcw->crc32_tfm))
+		crypto_free_shash(tcw->crc32_tfm);
+	tcw->crc32_tfm = NULL;
+}
+
+static int crypt_iv_tcw_ctr(struct geniv_ctx *ctx)
+{
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
+
+	if (ctx->key_size <= (ctx->iv_size + TCW_WHITENING_SIZE)) {
+		pr_err("Wrong key size (%d) for TCW. Choose a value > %d bytes\n",
+			ctx->key_size,
+			ctx->iv_size + TCW_WHITENING_SIZE);
+		return -EINVAL;
+	}
+
+	tcw->crc32_tfm = crypto_alloc_shash("crc32", 0, 0);
+	if (IS_ERR(tcw->crc32_tfm)) {
+		pr_err("Error initializing CRC32 in TCW; err=%ld\n",
+			PTR_ERR(tcw->crc32_tfm));
+		return PTR_ERR(tcw->crc32_tfm);
+	}
+
+	tcw->iv_seed = kzalloc(ctx->iv_size, GFP_KERNEL);
+	tcw->whitening = kzalloc(TCW_WHITENING_SIZE, GFP_KERNEL);
+	if (!tcw->iv_seed || !tcw->whitening) {
+		crypt_iv_tcw_dtr(ctx);
+		pr_err("Error allocating seed storage in TCW\n");
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int crypt_iv_tcw_init(struct geniv_ctx *ctx)
+{
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
+	int key_offset = ctx->key_size - ctx->iv_size - TCW_WHITENING_SIZE;
+
+	memcpy(tcw->iv_seed, &ctx->key[key_offset], ctx->iv_size);
+	memcpy(tcw->whitening, &ctx->key[key_offset + ctx->iv_size],
+	       TCW_WHITENING_SIZE);
+
+	return 0;
+}
+
+static int crypt_iv_tcw_wipe(struct geniv_ctx *ctx)
+{
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
+
+	memset(tcw->iv_seed, 0, ctx->iv_size);
+	memset(tcw->whitening, 0, TCW_WHITENING_SIZE);
+
+	return 0;
+}
+
+static int crypt_iv_tcw_whitening(struct geniv_ctx *ctx,
+				  struct crypto_geniv_req_ctx *rctx, u8 *data)
+{
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
+	__le64 sector = cpu_to_le64(rctx->iv_sector);
+	u8 buf[TCW_WHITENING_SIZE];
+	int i, r;
+	SHASH_DESC_ON_STACK(desc, tcw->crc32_tfm);
+
+	/* xor whitening with sector number */
+	memcpy(buf, tcw->whitening, TCW_WHITENING_SIZE);
+	crypto_xor(buf, (u8 *)&sector, 8);
+	crypto_xor(&buf[8], (u8 *)&sector, 8);
+
+	/* calculate crc32 for every 32bit part and xor it */
+	desc->tfm = tcw->crc32_tfm;
+	desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP;
+	for (i = 0; i < 4; i++) {
+		r = crypto_shash_init(desc);
+		if (r)
+			goto out;
+		r = crypto_shash_update(desc, &buf[i * 4], 4);
+		if (r)
+			goto out;
+		r = crypto_shash_final(desc, &buf[i * 4]);
+		if (r)
+			goto out;
+	}
+	crypto_xor(&buf[0], &buf[12], 4);
+	crypto_xor(&buf[4], &buf[8], 4);
+
+	/* apply whitening (8 bytes) to whole sector */
+	for (i = 0; i < (SECTOR_SIZE / 8); i++)
+		crypto_xor(data + i * 8, buf, 8);
+out:
+	memzero_explicit(buf, sizeof(buf));
+	return r;
+}
+
+static int crypt_iv_tcw_gen(struct geniv_ctx *ctx,
+			    struct crypto_geniv_req_ctx *rctx, int n)
+{
+	u8 *iv = rctx->iv;
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
+	__le64 sector = cpu_to_le64(rctx->iv_sector);
+	u8 *src;
+	int r = 0;
+
+	/* Remove whitening from ciphertext */
+	if (!rctx->is_write) {
+		src = kmap_atomic(sg_page(&rctx->src[n]));
+		r = crypt_iv_tcw_whitening(ctx, rctx,
+					   src + rctx->src[n].offset);
+		kunmap_atomic(src);
+	}
+
+	/* Calculate IV */
+	memcpy(iv, tcw->iv_seed, ctx->iv_size);
+	crypto_xor(iv, (u8 *)&sector, 8);
+	if (ctx->iv_size > 8)
+		crypto_xor(&iv[8], (u8 *)&sector, ctx->iv_size - 8);
+
+	return r;
+}
+
+static int crypt_iv_tcw_post(struct geniv_ctx *ctx,
+			     struct crypto_geniv_req_ctx *rctx, int n)
+{
+	u8 *dst;
+	int r;
+
+	if (!rctx->is_write)
+		return 0;
+
+	/* Apply whitening on ciphertext */
+	dst = kmap_atomic(sg_page(&rctx->dst[n]));
+	r = crypt_iv_tcw_whitening(ctx, rctx, dst + rctx->dst[n].offset);
+	kunmap_atomic(dst);
+
+	return r;
+}
+
+static struct geniv_operations crypt_iv_plain_ops = {
+	.generator = crypt_iv_plain_gen
+};
+
+static struct geniv_operations crypt_iv_plain64_ops = {
+	.generator = crypt_iv_plain64_gen
+};
+
+static struct geniv_operations crypt_iv_essiv_ops = {
+	.ctr       = crypt_iv_essiv_ctr,
+	.dtr       = crypt_iv_essiv_dtr,
+	.init      = crypt_iv_essiv_init,
+	.wipe      = crypt_iv_essiv_wipe,
+	.generator = crypt_iv_essiv_gen
+};
+
+static struct geniv_operations crypt_iv_benbi_ops = {
+	.ctr	   = crypt_iv_benbi_ctr,
+	.generator = crypt_iv_benbi_gen
+};
+
+static struct geniv_operations crypt_iv_null_ops = {
+	.generator = crypt_iv_null_gen
+};
+
+static struct geniv_operations crypt_iv_lmk_ops = {
+	.ctr	   = crypt_iv_lmk_ctr,
+	.dtr	   = crypt_iv_lmk_dtr,
+	.init	   = crypt_iv_lmk_init,
+	.wipe	   = crypt_iv_lmk_wipe,
+	.generator = crypt_iv_lmk_gen,
+	.post	   = crypt_iv_lmk_post
+};
+
+static struct geniv_operations crypt_iv_tcw_ops = {
+	.ctr	   = crypt_iv_tcw_ctr,
+	.dtr	   = crypt_iv_tcw_dtr,
+	.init	   = crypt_iv_tcw_init,
+	.wipe	   = crypt_iv_tcw_wipe,
+	.generator = crypt_iv_tcw_gen,
+	.post	   = crypt_iv_tcw_post
+};
+
+static int geniv_setkey_set(struct geniv_ctx *ctx)
+{
+	int ret = 0;
+
+	if (ctx->iv_gen_ops && ctx->iv_gen_ops->init)
+		ret = ctx->iv_gen_ops->init(ctx);
+	return ret;
+}
+
+static int geniv_setkey_wipe(struct geniv_ctx *ctx)
+{
+	int ret = 0;
+
+	if (ctx->iv_gen_ops && ctx->iv_gen_ops->wipe) {
+		ret = ctx->iv_gen_ops->wipe(ctx);
+		if (ret)
+			return ret;
+	}
+	return ret;
+}
+
+static int geniv_setkey_init_ctx(struct geniv_ctx *ctx)
+{
+	int ret = -EINVAL;
+
+	pr_debug("IV Generation algorithm : %s\n", ctx->ivmode);
+
+	if (ctx->ivmode == NULL)
+		ctx->iv_gen_ops = NULL;
+	else if (strcmp(ctx->ivmode, "plain") == 0)
+		ctx->iv_gen_ops = &crypt_iv_plain_ops;
+	else if (strcmp(ctx->ivmode, "plain64") == 0)
+		ctx->iv_gen_ops = &crypt_iv_plain64_ops;
+	else if (strcmp(ctx->ivmode, "essiv") == 0)
+		ctx->iv_gen_ops = &crypt_iv_essiv_ops;
+	else if (strcmp(ctx->ivmode, "benbi") == 0)
+		ctx->iv_gen_ops = &crypt_iv_benbi_ops;
+	else if (strcmp(ctx->ivmode, "null") == 0)
+		ctx->iv_gen_ops = &crypt_iv_null_ops;
+	else if (strcmp(ctx->ivmode, "lmk") == 0)
+		ctx->iv_gen_ops = &crypt_iv_lmk_ops;
+	else if (strcmp(ctx->ivmode, "tcw") == 0) {
+		ctx->iv_gen_ops = &crypt_iv_tcw_ops;
+		ctx->key_parts += 2; /* IV + whitening */
+		ctx->key_extra_size = ctx->iv_size + TCW_WHITENING_SIZE;
+	} else {
+		ret = -EINVAL;
+		pr_err("Invalid IV mode %s\n", ctx->ivmode);
+		goto end;
+	}
+
+	/* Allocate IV */
+	if (ctx->iv_gen_ops && ctx->iv_gen_ops->ctr) {
+		ret = ctx->iv_gen_ops->ctr(ctx);
+		if (ret < 0) {
+			pr_err("Error creating IV for %s\n", ctx->ivmode);
+			goto end;
+		}
+	}
+
+	/* Initialize IV (set keys for ESSIV etc) */
+	if (ctx->iv_gen_ops && ctx->iv_gen_ops->init) {
+		ret = ctx->iv_gen_ops->init(ctx);
+		if (ret < 0)
+			pr_err("Error creating IV for %s\n", ctx->ivmode);
+	}
+	ret = 0;
+end:
+	return ret;
+}
+
+/* Initialize the cipher's context with the key, ivmode and other parameters.
+ * Also allocate IV generation template ciphers and initialize them.
+ */
+
+static int geniv_setkey_init(struct crypto_skcipher *parent,
+			     struct geniv_key_info *info)
+{
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(parent);
+
+	ctx->tfm = parent;
+	ctx->iv_size = crypto_skcipher_ivsize(parent);
+	ctx->tfms_count = info->tfms_count;
+	ctx->cipher = info->cipher;
+	ctx->key = info->key;
+	ctx->key_size = info->key_size;
+	ctx->key_parts = info->key_parts;
+	ctx->ivmode = info->ivmode;
+	ctx->ivopts = info->ivopts;
+	return geniv_setkey_init_ctx(ctx);
+}
+
+static int crypto_geniv_setkey(struct crypto_skcipher *parent,
+				const u8 *key, unsigned int keylen)
+{
+	int err;
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(parent);
+	struct crypto_skcipher *child = ctx->child;
+	struct geniv_key_info *info = (struct geniv_key_info *) key;
+
+	pr_debug("SETKEY Operation : %d\n", info->keyop);
+
+	switch (info->keyop) {
+	case SETKEY_OP_INIT:
+		err = geniv_setkey_init(parent, info);
+		break;
+	case SETKEY_OP_SET:
+		err = geniv_setkey_set(ctx);
+		break;
+	case SETKEY_OP_WIPE:
+		err = geniv_setkey_wipe(ctx);
+		break;
+	}
+
+	crypto_skcipher_clear_flags(child, CRYPTO_TFM_REQ_MASK);
+	crypto_skcipher_set_flags(child, crypto_skcipher_get_flags(parent) &
+					 CRYPTO_TFM_REQ_MASK);
+	err = crypto_skcipher_setkey(child, info->subkey, info->subkey_size);
+	crypto_skcipher_set_flags(parent, crypto_skcipher_get_flags(child) &
+					  CRYPTO_TFM_RES_MASK);
+	return err;
+}
+
+/* Asynchronous IO completion callback for each sector in a segment. When all
+ * pending i/o are completed the parent cipher's async function is called.
+ */
+
+static void geniv_async_done(struct crypto_async_request *async_req, int error)
+{
+	struct crypto_geniv_subreq *subreq =
+		(struct crypto_geniv_subreq *) async_req->data;
+	struct crypto_geniv_req_ctx *rctx = subreq->rctx;
+	struct skcipher_request *req = rctx->req;
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(tfm);
+	int n = subreq - rctx->subreqs;
+
+	/*
+	 * A request from crypto driver backlog is going to be processed now,
+	 * finish the completion and continue in crypt_convert().
+	 * (Callback will be called for the second time for this request.)
+	 */
+	if (error == -EINPROGRESS) {
+		complete(&rctx->restart);
+		return;
+	}
+
+	if (!error && ctx->iv_gen_ops && ctx->iv_gen_ops->post)
+		error = ctx->iv_gen_ops->post(ctx, rctx, n);
+
+	/* req_pending needs to be checked before req->base.complete is called
+	 * as we need 'req_pending' to be equal to 1 to ensure all subrequests
+	 * are processed before freeing subreq array
+	 */
+	if (!atomic_dec_and_test(&rctx->req_pending)) {
+		/* Call the parent cipher's completion function */
+		skcipher_request_complete(req, error);
+		kfree(rctx->subreqs);
+		kfree(rctx->src);
+		kfree(rctx->dst);
+	}
+}
+
+/* Split scatterlist of segments into scatterlist of sectors so that unique IVs
+ * could be generated for each 512-byte sector. This split may not be necessary
+ * for example when these ciphers are modelled in hardware, where in can make
+ * use of the hardware's IV generation capabilities.
+ */
+static unsigned int geniv_split_req(struct scatterlist *sg,
+				    struct scatterlist **sg2_ptr,
+				    unsigned int segments)
+{
+	unsigned int i, j, nents = 0, off, len;
+	struct scatterlist *sg2;
+
+	for (i = 0; i < segments ; i++)
+		nents += sg[i].length / SECTOR_SIZE;
+
+	pr_debug("geniv: splitting scatterlist with %d segments into %d ents\n",
+		 segments, nents);
+	sg2 = kcalloc(nents, sizeof(struct scatterlist), GFP_KERNEL);
+	*sg2_ptr = sg2;
+	for (i = 0, j = 0; i < segments ; i++) {
+
+		off = sg[i].offset;
+		len = sg[i].length;
+
+		for (; len > 0; j++) {
+			sg_set_page(&sg2[j], sg_page(&sg[i]), SECTOR_SIZE, off);
+			off += SECTOR_SIZE;
+			len -= SECTOR_SIZE;
+		}
+	}
+	return nents;
+}
+
+/* Common encryt/decrypt function for geniv template cipher. Before the crypto
+ * operation, it splits the memory segments (in the scatterlist) into 512 byte
+ * sectors. The initialization vector(IV) used is based on a unique sector
+ * number which is generated here.
+ */
+static inline int crypto_geniv_crypt(struct skcipher_request *req, bool encrypt)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct crypto_geniv_req_ctx *rctx = geniv_req_ctx(req);
+	struct crypto_geniv_subreq *subreqs;
+	struct geniv_req_info *rinfo = (struct geniv_req_info *) req->iv;
+	int i, bytes, cryptlen, ret = 0, n1, n2;
+	char *str = encrypt ? "encrypt" : "decrypt";
+
+	/* Instance of 'struct geniv_req_info' is stored in IV ptr */
+	rctx->is_write = rinfo->is_write;
+	rctx->iv_sector = rinfo->iv_sector;
+	rctx->nents = rinfo->nents;
+	rctx->iv = rinfo->iv;
+
+	pr_debug("geniv:%s: starting sector=%d, #segments=%u\n", str,
+		 (unsigned int) rctx->iv_sector, rctx->nents);
+
+	cryptlen = req->cryptlen;
+	n1 = geniv_split_req(req->src, &rctx->src, rctx->nents);
+	n2 = geniv_split_req(req->dst, &rctx->dst, rctx->nents);
+	rctx->nents = n1 > n2 ? n1 : n2;
+
+	subreqs = kcalloc(rctx->nents, sizeof(struct crypto_geniv_subreq),
+			  GFP_KERNEL);
+	rctx->subreqs = subreqs;
+	rctx->req = req;
+
+	init_completion(&rctx->restart);
+	atomic_set(&rctx->req_pending, 1);
+	for (i = 0; i < rctx->nents; i++) {
+		struct skcipher_request *subreq = &subreqs[i].req;
+
+		subreqs[i].rctx = rctx;
+		atomic_inc(&rctx->req_pending);
+		if (ctx->iv_gen_ops)
+			ret = ctx->iv_gen_ops->generator(ctx, rctx, i);
+
+		if (ret < 0) {
+			pr_err("Error in generating IV ret: %d\n", ret);
+			goto end;
+		}
+
+		skcipher_request_set_tfm(subreq, ctx->child);
+		skcipher_request_set_callback(subreq, req->base.flags,
+					      geniv_async_done, &subreqs[i]);
+
+		bytes = cryptlen < SECTOR_SIZE ? cryptlen : SECTOR_SIZE;
+
+		skcipher_request_set_crypt(subreq, &rctx->src[i],
+					   &rctx->dst[i], bytes, rctx->iv);
+		cryptlen -= bytes;
+
+		if (encrypt)
+			ret = crypto_skcipher_encrypt(subreq);
+		else
+			ret = crypto_skcipher_decrypt(subreq);
+
+
+		if (!ret && ctx->iv_gen_ops && ctx->iv_gen_ops->post)
+			ret = ctx->iv_gen_ops->post(ctx, rctx, i);
+
+		switch (ret) {
+		/*
+		 * The request was queued by a crypto driver
+		 * but the driver request queue is full, let's wait.
+		 */
+		case -EBUSY:
+			wait_for_completion(&rctx->restart);
+			reinit_completion(&rctx->restart);
+			/* fall through */
+		/*
+		 * The request is queued and processed asynchronously,
+		 * completion function geniv_async_done() is called.
+		 */
+		case -EINPROGRESS:
+			rctx->iv_sector++;
+			cond_resched();
+			break;
+		/*
+		 * The request was already processed (synchronously).
+		 */
+		case 0:
+			atomic_dec(&rctx->req_pending);
+			rctx->iv_sector++;
+			cond_resched();
+			continue;
+
+		/* There was an error while processing the request. */
+		default:
+			atomic_dec(&rctx->req_pending);
+			return ret;
+		}
+
+		if (ret)
+			break;
+	}
+
+	if (atomic_read(&rctx->req_pending) == 1) {
+		pr_debug("geniv:%s: Freeing subreq and scatterlists\n", str);
+		kfree(subreqs);
+		kfree(rctx->src);
+		kfree(rctx->dst);
+	}
+
+end:
+	return ret;
+}
+
+static int crypto_geniv_encrypt(struct skcipher_request *req)
+{
+	return crypto_geniv_crypt(req, true);
+}
+
+static int crypto_geniv_decrypt(struct skcipher_request *req)
+{
+	return crypto_geniv_crypt(req, false);
+}
+
+static int crypto_geniv_init_tfm(struct crypto_skcipher *tfm)
+{
+	struct skcipher_instance *inst = skcipher_alg_instance(tfm);
+	struct crypto_skcipher_spawn *spawn = skcipher_instance_ctx(inst);
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct crypto_skcipher *cipher;
+	unsigned long align;
+	unsigned int reqsize;
+
+	cipher = crypto_spawn_skcipher2(spawn);
+	if (IS_ERR(cipher))
+		return PTR_ERR(cipher);
+
+	ctx->child = cipher;
+
+	/* Setup the current cipher's request structure */
+	align = crypto_skcipher_alignmask(tfm);
+	align &= ~(crypto_tfm_ctx_alignment() - 1);
+	reqsize = align + sizeof(struct crypto_geniv_req_ctx) +
+		  crypto_skcipher_reqsize(cipher);
+	crypto_skcipher_set_reqsize(tfm, reqsize);
+
+	return 0;
+}
+
+static void crypto_geniv_exit_tfm(struct crypto_skcipher *tfm)
+{
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	if (ctx->iv_gen_ops && ctx->iv_gen_ops->dtr)
+		ctx->iv_gen_ops->dtr(ctx);
+
+	crypto_free_skcipher(ctx->child);
+}
+
+static void crypto_geniv_free(struct skcipher_instance *inst)
+{
+	struct crypto_skcipher_spawn *spawn = skcipher_instance_ctx(inst);
+
+	crypto_drop_skcipher(spawn);
+	kfree(inst);
+}
+
+static int crypto_geniv_create(struct crypto_template *tmpl,
+				 struct rtattr **tb, char *algname)
+{
+	struct crypto_attr_type *algt;
+	struct skcipher_instance *inst;
+	struct skcipher_alg *alg;
+	struct crypto_skcipher_spawn *spawn;
+	const char *cipher_name;
+	int err;
+
+	algt = crypto_get_attr_type(tb);
+
+	if (IS_ERR(algt))
+		return PTR_ERR(algt);
+
+	if ((algt->type ^ CRYPTO_ALG_TYPE_SKCIPHER) & algt->mask)
+		return -EINVAL;
+
+	cipher_name = crypto_attr_alg_name(tb[1]);
+
+	if (IS_ERR(cipher_name))
+		return PTR_ERR(cipher_name);
+
+	inst = kzalloc(sizeof(*inst) + sizeof(*spawn), GFP_KERNEL);
+	if (!inst)
+		return -ENOMEM;
+
+	spawn = skcipher_instance_ctx(inst);
+
+	crypto_set_skcipher_spawn(spawn, skcipher_crypto_instance(inst));
+	err = crypto_grab_skcipher2(spawn, cipher_name, 0,
+				    crypto_requires_sync(algt->type,
+							 algt->mask));
+
+	if (err)
+		goto err_free_inst;
+
+	alg = crypto_spawn_skcipher_alg(spawn);
+
+	/* We only support 16-byte blocks. */
+	err = -EINVAL;
+
+	if (!is_power_of_2(alg->base.cra_blocksize))
+		goto err_drop_spawn;
+
+	err = -ENAMETOOLONG;
+	if (snprintf(inst->alg.base.cra_name, CRYPTO_MAX_ALG_NAME, "%s(%s)",
+		     algname, alg->base.cra_name) >= CRYPTO_MAX_ALG_NAME)
+		goto err_drop_spawn;
+	if (snprintf(inst->alg.base.cra_driver_name, CRYPTO_MAX_ALG_NAME,
+		     "%s(%s)", algname, alg->base.cra_driver_name) >=
+	    CRYPTO_MAX_ALG_NAME)
+		goto err_drop_spawn;
+
+	inst->alg.base.cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER;
+	inst->alg.base.cra_priority = alg->base.cra_priority;
+	inst->alg.base.cra_blocksize = alg->base.cra_blocksize;
+	inst->alg.base.cra_alignmask = alg->base.cra_alignmask;
+	inst->alg.base.cra_flags = alg->base.cra_flags & CRYPTO_ALG_ASYNC;
+	inst->alg.ivsize = alg->base.cra_blocksize;
+	inst->alg.chunksize = crypto_skcipher_alg_chunksize(alg);
+	inst->alg.min_keysize = crypto_skcipher_alg_min_keysize(alg);
+	inst->alg.max_keysize = crypto_skcipher_alg_max_keysize(alg);
+
+	inst->alg.setkey = crypto_geniv_setkey;
+	inst->alg.encrypt = crypto_geniv_encrypt;
+	inst->alg.decrypt = crypto_geniv_decrypt;
+
+	inst->alg.base.cra_ctxsize = sizeof(struct geniv_ctx);
+
+	inst->alg.init = crypto_geniv_init_tfm;
+	inst->alg.exit = crypto_geniv_exit_tfm;
+
+	inst->free = crypto_geniv_free;
+
+	err = skcipher_register_instance(tmpl, inst);
+	if (err)
+		goto err_drop_spawn;
+
+out:
+	return err;
+
+err_drop_spawn:
+	crypto_drop_skcipher(spawn);
+err_free_inst:
+	kfree(inst);
+	goto out;
+}
+
+static int crypto_plain_create(struct crypto_template *tmpl,
+				struct rtattr **tb)
+{
+	return crypto_geniv_create(tmpl, tb, "plain");
+}
+
+static int crypto_plain64_create(struct crypto_template *tmpl,
+				struct rtattr **tb)
+{
+	return crypto_geniv_create(tmpl, tb, "plain64");
+}
+
+static int crypto_essiv_create(struct crypto_template *tmpl,
+				struct rtattr **tb)
+{
+	return crypto_geniv_create(tmpl, tb, "essiv");
+}
+
+static int crypto_benbi_create(struct crypto_template *tmpl,
+				struct rtattr **tb)
+{
+	return crypto_geniv_create(tmpl, tb, "benbi");
+}
+
+static int crypto_null_create(struct crypto_template *tmpl,
+				struct rtattr **tb)
+{
+	return crypto_geniv_create(tmpl, tb, "null");
+}
+
+static int crypto_lmk_create(struct crypto_template *tmpl,
+				struct rtattr **tb)
+{
+	return crypto_geniv_create(tmpl, tb, "lmk");
+}
+
+static int crypto_tcw_create(struct crypto_template *tmpl,
+				struct rtattr **tb)
+{
+	return crypto_geniv_create(tmpl, tb, "tcw");
+}
+
+static struct crypto_template crypto_plain_tmpl = {
+	.name   = "plain",
+	.create = crypto_plain_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_plain64_tmpl = {
+	.name   = "plain64",
+	.create = crypto_plain64_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_essiv_tmpl = {
+	.name   = "essiv",
+	.create = crypto_essiv_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_benbi_tmpl = {
+	.name   = "benbi",
+	.create = crypto_benbi_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_null_tmpl = {
+	.name   = "null",
+	.create = crypto_null_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_lmk_tmpl = {
+	.name   = "lmk",
+	.create = crypto_lmk_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_tcw_tmpl = {
+	.name   = "tcw",
+	.create = crypto_tcw_create,
+	.module = THIS_MODULE,
+};
+
+static int __init crypto_geniv_module_init(void)
+{
+	int err;
+
+	err = crypto_register_template(&crypto_plain_tmpl);
+	if (err)
+		goto out;
+
+	err = crypto_register_template(&crypto_plain64_tmpl);
+	if (err)
+		goto out_undo_plain;
+
+	err = crypto_register_template(&crypto_essiv_tmpl);
+	if (err)
+		goto out_undo_plain64;
+
+	err = crypto_register_template(&crypto_benbi_tmpl);
+	if (err)
+		goto out_undo_essiv;
+
+	err = crypto_register_template(&crypto_null_tmpl);
+	if (err)
+		goto out_undo_benbi;
+
+	err = crypto_register_template(&crypto_lmk_tmpl);
+	if (err)
+		goto out_undo_null;
+
+	err = crypto_register_template(&crypto_tcw_tmpl);
+	if (!err)
+		goto out;
+
+	crypto_unregister_template(&crypto_lmk_tmpl);
+out_undo_null:
+	crypto_unregister_template(&crypto_null_tmpl);
+out_undo_benbi:
+	crypto_unregister_template(&crypto_benbi_tmpl);
+out_undo_essiv:
+	crypto_unregister_template(&crypto_essiv_tmpl);
+out_undo_plain64:
+	crypto_unregister_template(&crypto_plain64_tmpl);
+out_undo_plain:
+	crypto_unregister_template(&crypto_plain_tmpl);
+out:
+	return err;
+}
+
+static void __exit crypto_geniv_module_exit(void)
+{
+	crypto_unregister_template(&crypto_plain_tmpl);
+	crypto_unregister_template(&crypto_plain64_tmpl);
+	crypto_unregister_template(&crypto_essiv_tmpl);
+	crypto_unregister_template(&crypto_benbi_tmpl);
+	crypto_unregister_template(&crypto_null_tmpl);
+	crypto_unregister_template(&crypto_lmk_tmpl);
+	crypto_unregister_template(&crypto_tcw_tmpl);
+}
+
+module_init(crypto_geniv_module_init);
+module_exit(crypto_geniv_module_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("IV generation algorithms");
+MODULE_ALIAS_CRYPTO("geniv");
+
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index a276883..1d565d8 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -29,10 +29,12 @@
 #include <crypto/md5.h>
 #include <crypto/algapi.h>
 #include <crypto/skcipher.h>
+#include <crypto/geniv.h>
 
 #include <linux/device-mapper.h>
 
 #define DM_MSG_PREFIX "crypt"
+#define MAX_SG_LIST 1024
 
 /*
  * context holding the current state of a multi-part conversion
@@ -67,47 +69,13 @@ struct dm_crypt_io {
 
 struct dm_crypt_request {
 	struct convert_context *ctx;
-	struct scatterlist sg_in;
-	struct scatterlist sg_out;
+	struct scatterlist *sg_in;
+	struct scatterlist *sg_out;
 	sector_t iv_sector;
 };
 
 struct crypt_config;
 
-struct crypt_iv_operations {
-	int (*ctr)(struct crypt_config *cc, struct dm_target *ti,
-		   const char *opts);
-	void (*dtr)(struct crypt_config *cc);
-	int (*init)(struct crypt_config *cc);
-	int (*wipe)(struct crypt_config *cc);
-	int (*generator)(struct crypt_config *cc, u8 *iv,
-			 struct dm_crypt_request *dmreq);
-	int (*post)(struct crypt_config *cc, u8 *iv,
-		    struct dm_crypt_request *dmreq);
-};
-
-struct iv_essiv_private {
-	struct crypto_ahash *hash_tfm;
-	u8 *salt;
-};
-
-struct iv_benbi_private {
-	int shift;
-};
-
-#define LMK_SEED_SIZE 64 /* hash + 0 */
-struct iv_lmk_private {
-	struct crypto_shash *hash_tfm;
-	u8 *seed;
-};
-
-#define TCW_WHITENING_SIZE 16
-struct iv_tcw_private {
-	struct crypto_shash *crc32_tfm;
-	u8 *iv_seed;
-	u8 *whitening;
-};
-
 /*
  * Crypt: maps a linear range of a block device
  * and encrypts / decrypts at the same time.
@@ -141,13 +109,6 @@ struct crypt_config {
 	char *cipher;
 	char *cipher_string;
 
-	struct crypt_iv_operations *iv_gen_ops;
-	union {
-		struct iv_essiv_private essiv;
-		struct iv_benbi_private benbi;
-		struct iv_lmk_private lmk;
-		struct iv_tcw_private tcw;
-	} iv_gen_private;
 	sector_t iv_offset;
 	unsigned int iv_size;
 
@@ -241,567 +202,6 @@ static struct crypto_skcipher *any_tfm(struct crypt_config *cc)
  * http://article.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt/454
  */
 
-static int crypt_iv_plain_gen(struct crypt_config *cc, u8 *iv,
-			      struct dm_crypt_request *dmreq)
-{
-	memset(iv, 0, cc->iv_size);
-	*(__le32 *)iv = cpu_to_le32(dmreq->iv_sector & 0xffffffff);
-
-	return 0;
-}
-
-static int crypt_iv_plain64_gen(struct crypt_config *cc, u8 *iv,
-				struct dm_crypt_request *dmreq)
-{
-	memset(iv, 0, cc->iv_size);
-	*(__le64 *)iv = cpu_to_le64(dmreq->iv_sector);
-
-	return 0;
-}
-
-/* Initialise ESSIV - compute salt but no local memory allocations */
-static int crypt_iv_essiv_init(struct crypt_config *cc)
-{
-	struct iv_essiv_private *essiv = &cc->iv_gen_private.essiv;
-	AHASH_REQUEST_ON_STACK(req, essiv->hash_tfm);
-	struct scatterlist sg;
-	struct crypto_cipher *essiv_tfm;
-	int err;
-
-	sg_init_one(&sg, cc->key, cc->key_size);
-	ahash_request_set_tfm(req, essiv->hash_tfm);
-	ahash_request_set_callback(req, CRYPTO_TFM_REQ_MAY_SLEEP, NULL, NULL);
-	ahash_request_set_crypt(req, &sg, essiv->salt, cc->key_size);
-
-	err = crypto_ahash_digest(req);
-	ahash_request_zero(req);
-	if (err)
-		return err;
-
-	essiv_tfm = cc->iv_private;
-
-	err = crypto_cipher_setkey(essiv_tfm, essiv->salt,
-			    crypto_ahash_digestsize(essiv->hash_tfm));
-	if (err)
-		return err;
-
-	return 0;
-}
-
-/* Wipe salt and reset key derived from volume key */
-static int crypt_iv_essiv_wipe(struct crypt_config *cc)
-{
-	struct iv_essiv_private *essiv = &cc->iv_gen_private.essiv;
-	unsigned salt_size = crypto_ahash_digestsize(essiv->hash_tfm);
-	struct crypto_cipher *essiv_tfm;
-	int r, err = 0;
-
-	memset(essiv->salt, 0, salt_size);
-
-	essiv_tfm = cc->iv_private;
-	r = crypto_cipher_setkey(essiv_tfm, essiv->salt, salt_size);
-	if (r)
-		err = r;
-
-	return err;
-}
-
-/* Set up per cpu cipher state */
-static struct crypto_cipher *setup_essiv_cpu(struct crypt_config *cc,
-					     struct dm_target *ti,
-					     u8 *salt, unsigned saltsize)
-{
-	struct crypto_cipher *essiv_tfm;
-	int err;
-
-	/* Setup the essiv_tfm with the given salt */
-	essiv_tfm = crypto_alloc_cipher(cc->cipher, 0, CRYPTO_ALG_ASYNC);
-	if (IS_ERR(essiv_tfm)) {
-		ti->error = "Error allocating crypto tfm for ESSIV";
-		return essiv_tfm;
-	}
-
-	if (crypto_cipher_blocksize(essiv_tfm) !=
-	    crypto_skcipher_ivsize(any_tfm(cc))) {
-		ti->error = "Block size of ESSIV cipher does "
-			    "not match IV size of block cipher";
-		crypto_free_cipher(essiv_tfm);
-		return ERR_PTR(-EINVAL);
-	}
-
-	err = crypto_cipher_setkey(essiv_tfm, salt, saltsize);
-	if (err) {
-		ti->error = "Failed to set key for ESSIV cipher";
-		crypto_free_cipher(essiv_tfm);
-		return ERR_PTR(err);
-	}
-
-	return essiv_tfm;
-}
-
-static void crypt_iv_essiv_dtr(struct crypt_config *cc)
-{
-	struct crypto_cipher *essiv_tfm;
-	struct iv_essiv_private *essiv = &cc->iv_gen_private.essiv;
-
-	crypto_free_ahash(essiv->hash_tfm);
-	essiv->hash_tfm = NULL;
-
-	kzfree(essiv->salt);
-	essiv->salt = NULL;
-
-	essiv_tfm = cc->iv_private;
-
-	if (essiv_tfm)
-		crypto_free_cipher(essiv_tfm);
-
-	cc->iv_private = NULL;
-}
-
-static int crypt_iv_essiv_ctr(struct crypt_config *cc, struct dm_target *ti,
-			      const char *opts)
-{
-	struct crypto_cipher *essiv_tfm = NULL;
-	struct crypto_ahash *hash_tfm = NULL;
-	u8 *salt = NULL;
-	int err;
-
-	if (!opts) {
-		ti->error = "Digest algorithm missing for ESSIV mode";
-		return -EINVAL;
-	}
-
-	/* Allocate hash algorithm */
-	hash_tfm = crypto_alloc_ahash(opts, 0, CRYPTO_ALG_ASYNC);
-	if (IS_ERR(hash_tfm)) {
-		ti->error = "Error initializing ESSIV hash";
-		err = PTR_ERR(hash_tfm);
-		goto bad;
-	}
-
-	salt = kzalloc(crypto_ahash_digestsize(hash_tfm), GFP_KERNEL);
-	if (!salt) {
-		ti->error = "Error kmallocing salt storage in ESSIV";
-		err = -ENOMEM;
-		goto bad;
-	}
-
-	cc->iv_gen_private.essiv.salt = salt;
-	cc->iv_gen_private.essiv.hash_tfm = hash_tfm;
-
-	essiv_tfm = setup_essiv_cpu(cc, ti, salt,
-				crypto_ahash_digestsize(hash_tfm));
-	if (IS_ERR(essiv_tfm)) {
-		crypt_iv_essiv_dtr(cc);
-		return PTR_ERR(essiv_tfm);
-	}
-	cc->iv_private = essiv_tfm;
-
-	return 0;
-
-bad:
-	if (hash_tfm && !IS_ERR(hash_tfm))
-		crypto_free_ahash(hash_tfm);
-	kfree(salt);
-	return err;
-}
-
-static int crypt_iv_essiv_gen(struct crypt_config *cc, u8 *iv,
-			      struct dm_crypt_request *dmreq)
-{
-	struct crypto_cipher *essiv_tfm = cc->iv_private;
-
-	memset(iv, 0, cc->iv_size);
-	*(__le64 *)iv = cpu_to_le64(dmreq->iv_sector);
-	crypto_cipher_encrypt_one(essiv_tfm, iv, iv);
-
-	return 0;
-}
-
-static int crypt_iv_benbi_ctr(struct crypt_config *cc, struct dm_target *ti,
-			      const char *opts)
-{
-	unsigned bs = crypto_skcipher_blocksize(any_tfm(cc));
-	int log = ilog2(bs);
-
-	/* we need to calculate how far we must shift the sector count
-	 * to get the cipher block count, we use this shift in _gen */
-
-	if (1 << log != bs) {
-		ti->error = "cypher blocksize is not a power of 2";
-		return -EINVAL;
-	}
-
-	if (log > 9) {
-		ti->error = "cypher blocksize is > 512";
-		return -EINVAL;
-	}
-
-	cc->iv_gen_private.benbi.shift = 9 - log;
-
-	return 0;
-}
-
-static void crypt_iv_benbi_dtr(struct crypt_config *cc)
-{
-}
-
-static int crypt_iv_benbi_gen(struct crypt_config *cc, u8 *iv,
-			      struct dm_crypt_request *dmreq)
-{
-	__be64 val;
-
-	memset(iv, 0, cc->iv_size - sizeof(u64)); /* rest is cleared below */
-
-	val = cpu_to_be64(((u64)dmreq->iv_sector << cc->iv_gen_private.benbi.shift) + 1);
-	put_unaligned(val, (__be64 *)(iv + cc->iv_size - sizeof(u64)));
-
-	return 0;
-}
-
-static int crypt_iv_null_gen(struct crypt_config *cc, u8 *iv,
-			     struct dm_crypt_request *dmreq)
-{
-	memset(iv, 0, cc->iv_size);
-
-	return 0;
-}
-
-static void crypt_iv_lmk_dtr(struct crypt_config *cc)
-{
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
-
-	if (lmk->hash_tfm && !IS_ERR(lmk->hash_tfm))
-		crypto_free_shash(lmk->hash_tfm);
-	lmk->hash_tfm = NULL;
-
-	kzfree(lmk->seed);
-	lmk->seed = NULL;
-}
-
-static int crypt_iv_lmk_ctr(struct crypt_config *cc, struct dm_target *ti,
-			    const char *opts)
-{
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
-
-	lmk->hash_tfm = crypto_alloc_shash("md5", 0, 0);
-	if (IS_ERR(lmk->hash_tfm)) {
-		ti->error = "Error initializing LMK hash";
-		return PTR_ERR(lmk->hash_tfm);
-	}
-
-	/* No seed in LMK version 2 */
-	if (cc->key_parts == cc->tfms_count) {
-		lmk->seed = NULL;
-		return 0;
-	}
-
-	lmk->seed = kzalloc(LMK_SEED_SIZE, GFP_KERNEL);
-	if (!lmk->seed) {
-		crypt_iv_lmk_dtr(cc);
-		ti->error = "Error kmallocing seed storage in LMK";
-		return -ENOMEM;
-	}
-
-	return 0;
-}
-
-static int crypt_iv_lmk_init(struct crypt_config *cc)
-{
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
-	int subkey_size = cc->key_size / cc->key_parts;
-
-	/* LMK seed is on the position of LMK_KEYS + 1 key */
-	if (lmk->seed)
-		memcpy(lmk->seed, cc->key + (cc->tfms_count * subkey_size),
-		       crypto_shash_digestsize(lmk->hash_tfm));
-
-	return 0;
-}
-
-static int crypt_iv_lmk_wipe(struct crypt_config *cc)
-{
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
-
-	if (lmk->seed)
-		memset(lmk->seed, 0, LMK_SEED_SIZE);
-
-	return 0;
-}
-
-static int crypt_iv_lmk_one(struct crypt_config *cc, u8 *iv,
-			    struct dm_crypt_request *dmreq,
-			    u8 *data)
-{
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
-	SHASH_DESC_ON_STACK(desc, lmk->hash_tfm);
-	struct md5_state md5state;
-	__le32 buf[4];
-	int i, r;
-
-	desc->tfm = lmk->hash_tfm;
-	desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP;
-
-	r = crypto_shash_init(desc);
-	if (r)
-		return r;
-
-	if (lmk->seed) {
-		r = crypto_shash_update(desc, lmk->seed, LMK_SEED_SIZE);
-		if (r)
-			return r;
-	}
-
-	/* Sector is always 512B, block size 16, add data of blocks 1-31 */
-	r = crypto_shash_update(desc, data + 16, 16 * 31);
-	if (r)
-		return r;
-
-	/* Sector is cropped to 56 bits here */
-	buf[0] = cpu_to_le32(dmreq->iv_sector & 0xFFFFFFFF);
-	buf[1] = cpu_to_le32((((u64)dmreq->iv_sector >> 32) & 0x00FFFFFF) | 0x80000000);
-	buf[2] = cpu_to_le32(4024);
-	buf[3] = 0;
-	r = crypto_shash_update(desc, (u8 *)buf, sizeof(buf));
-	if (r)
-		return r;
-
-	/* No MD5 padding here */
-	r = crypto_shash_export(desc, &md5state);
-	if (r)
-		return r;
-
-	for (i = 0; i < MD5_HASH_WORDS; i++)
-		__cpu_to_le32s(&md5state.hash[i]);
-	memcpy(iv, &md5state.hash, cc->iv_size);
-
-	return 0;
-}
-
-static int crypt_iv_lmk_gen(struct crypt_config *cc, u8 *iv,
-			    struct dm_crypt_request *dmreq)
-{
-	u8 *src;
-	int r = 0;
-
-	if (bio_data_dir(dmreq->ctx->bio_in) == WRITE) {
-		src = kmap_atomic(sg_page(&dmreq->sg_in));
-		r = crypt_iv_lmk_one(cc, iv, dmreq, src + dmreq->sg_in.offset);
-		kunmap_atomic(src);
-	} else
-		memset(iv, 0, cc->iv_size);
-
-	return r;
-}
-
-static int crypt_iv_lmk_post(struct crypt_config *cc, u8 *iv,
-			     struct dm_crypt_request *dmreq)
-{
-	u8 *dst;
-	int r;
-
-	if (bio_data_dir(dmreq->ctx->bio_in) == WRITE)
-		return 0;
-
-	dst = kmap_atomic(sg_page(&dmreq->sg_out));
-	r = crypt_iv_lmk_one(cc, iv, dmreq, dst + dmreq->sg_out.offset);
-
-	/* Tweak the first block of plaintext sector */
-	if (!r)
-		crypto_xor(dst + dmreq->sg_out.offset, iv, cc->iv_size);
-
-	kunmap_atomic(dst);
-	return r;
-}
-
-static void crypt_iv_tcw_dtr(struct crypt_config *cc)
-{
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
-
-	kzfree(tcw->iv_seed);
-	tcw->iv_seed = NULL;
-	kzfree(tcw->whitening);
-	tcw->whitening = NULL;
-
-	if (tcw->crc32_tfm && !IS_ERR(tcw->crc32_tfm))
-		crypto_free_shash(tcw->crc32_tfm);
-	tcw->crc32_tfm = NULL;
-}
-
-static int crypt_iv_tcw_ctr(struct crypt_config *cc, struct dm_target *ti,
-			    const char *opts)
-{
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
-
-	if (cc->key_size <= (cc->iv_size + TCW_WHITENING_SIZE)) {
-		ti->error = "Wrong key size for TCW";
-		return -EINVAL;
-	}
-
-	tcw->crc32_tfm = crypto_alloc_shash("crc32", 0, 0);
-	if (IS_ERR(tcw->crc32_tfm)) {
-		ti->error = "Error initializing CRC32 in TCW";
-		return PTR_ERR(tcw->crc32_tfm);
-	}
-
-	tcw->iv_seed = kzalloc(cc->iv_size, GFP_KERNEL);
-	tcw->whitening = kzalloc(TCW_WHITENING_SIZE, GFP_KERNEL);
-	if (!tcw->iv_seed || !tcw->whitening) {
-		crypt_iv_tcw_dtr(cc);
-		ti->error = "Error allocating seed storage in TCW";
-		return -ENOMEM;
-	}
-
-	return 0;
-}
-
-static int crypt_iv_tcw_init(struct crypt_config *cc)
-{
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
-	int key_offset = cc->key_size - cc->iv_size - TCW_WHITENING_SIZE;
-
-	memcpy(tcw->iv_seed, &cc->key[key_offset], cc->iv_size);
-	memcpy(tcw->whitening, &cc->key[key_offset + cc->iv_size],
-	       TCW_WHITENING_SIZE);
-
-	return 0;
-}
-
-static int crypt_iv_tcw_wipe(struct crypt_config *cc)
-{
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
-
-	memset(tcw->iv_seed, 0, cc->iv_size);
-	memset(tcw->whitening, 0, TCW_WHITENING_SIZE);
-
-	return 0;
-}
-
-static int crypt_iv_tcw_whitening(struct crypt_config *cc,
-				  struct dm_crypt_request *dmreq,
-				  u8 *data)
-{
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
-	__le64 sector = cpu_to_le64(dmreq->iv_sector);
-	u8 buf[TCW_WHITENING_SIZE];
-	SHASH_DESC_ON_STACK(desc, tcw->crc32_tfm);
-	int i, r;
-
-	/* xor whitening with sector number */
-	memcpy(buf, tcw->whitening, TCW_WHITENING_SIZE);
-	crypto_xor(buf, (u8 *)&sector, 8);
-	crypto_xor(&buf[8], (u8 *)&sector, 8);
-
-	/* calculate crc32 for every 32bit part and xor it */
-	desc->tfm = tcw->crc32_tfm;
-	desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP;
-	for (i = 0; i < 4; i++) {
-		r = crypto_shash_init(desc);
-		if (r)
-			goto out;
-		r = crypto_shash_update(desc, &buf[i * 4], 4);
-		if (r)
-			goto out;
-		r = crypto_shash_final(desc, &buf[i * 4]);
-		if (r)
-			goto out;
-	}
-	crypto_xor(&buf[0], &buf[12], 4);
-	crypto_xor(&buf[4], &buf[8], 4);
-
-	/* apply whitening (8 bytes) to whole sector */
-	for (i = 0; i < ((1 << SECTOR_SHIFT) / 8); i++)
-		crypto_xor(data + i * 8, buf, 8);
-out:
-	memzero_explicit(buf, sizeof(buf));
-	return r;
-}
-
-static int crypt_iv_tcw_gen(struct crypt_config *cc, u8 *iv,
-			    struct dm_crypt_request *dmreq)
-{
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
-	__le64 sector = cpu_to_le64(dmreq->iv_sector);
-	u8 *src;
-	int r = 0;
-
-	/* Remove whitening from ciphertext */
-	if (bio_data_dir(dmreq->ctx->bio_in) != WRITE) {
-		src = kmap_atomic(sg_page(&dmreq->sg_in));
-		r = crypt_iv_tcw_whitening(cc, dmreq, src + dmreq->sg_in.offset);
-		kunmap_atomic(src);
-	}
-
-	/* Calculate IV */
-	memcpy(iv, tcw->iv_seed, cc->iv_size);
-	crypto_xor(iv, (u8 *)&sector, 8);
-	if (cc->iv_size > 8)
-		crypto_xor(&iv[8], (u8 *)&sector, cc->iv_size - 8);
-
-	return r;
-}
-
-static int crypt_iv_tcw_post(struct crypt_config *cc, u8 *iv,
-			     struct dm_crypt_request *dmreq)
-{
-	u8 *dst;
-	int r;
-
-	if (bio_data_dir(dmreq->ctx->bio_in) != WRITE)
-		return 0;
-
-	/* Apply whitening on ciphertext */
-	dst = kmap_atomic(sg_page(&dmreq->sg_out));
-	r = crypt_iv_tcw_whitening(cc, dmreq, dst + dmreq->sg_out.offset);
-	kunmap_atomic(dst);
-
-	return r;
-}
-
-static struct crypt_iv_operations crypt_iv_plain_ops = {
-	.generator = crypt_iv_plain_gen
-};
-
-static struct crypt_iv_operations crypt_iv_plain64_ops = {
-	.generator = crypt_iv_plain64_gen
-};
-
-static struct crypt_iv_operations crypt_iv_essiv_ops = {
-	.ctr       = crypt_iv_essiv_ctr,
-	.dtr       = crypt_iv_essiv_dtr,
-	.init      = crypt_iv_essiv_init,
-	.wipe      = crypt_iv_essiv_wipe,
-	.generator = crypt_iv_essiv_gen
-};
-
-static struct crypt_iv_operations crypt_iv_benbi_ops = {
-	.ctr	   = crypt_iv_benbi_ctr,
-	.dtr	   = crypt_iv_benbi_dtr,
-	.generator = crypt_iv_benbi_gen
-};
-
-static struct crypt_iv_operations crypt_iv_null_ops = {
-	.generator = crypt_iv_null_gen
-};
-
-static struct crypt_iv_operations crypt_iv_lmk_ops = {
-	.ctr	   = crypt_iv_lmk_ctr,
-	.dtr	   = crypt_iv_lmk_dtr,
-	.init	   = crypt_iv_lmk_init,
-	.wipe	   = crypt_iv_lmk_wipe,
-	.generator = crypt_iv_lmk_gen,
-	.post	   = crypt_iv_lmk_post
-};
-
-static struct crypt_iv_operations crypt_iv_tcw_ops = {
-	.ctr	   = crypt_iv_tcw_ctr,
-	.dtr	   = crypt_iv_tcw_dtr,
-	.init	   = crypt_iv_tcw_init,
-	.wipe	   = crypt_iv_tcw_wipe,
-	.generator = crypt_iv_tcw_gen,
-	.post	   = crypt_iv_tcw_post
-};
-
 static void crypt_convert_init(struct crypt_config *cc,
 			       struct convert_context *ctx,
 			       struct bio *bio_out, struct bio *bio_in,
@@ -836,52 +236,6 @@ static u8 *iv_of_dmreq(struct crypt_config *cc,
 		crypto_skcipher_alignmask(any_tfm(cc)) + 1);
 }
 
-static int crypt_convert_block(struct crypt_config *cc,
-			       struct convert_context *ctx,
-			       struct skcipher_request *req)
-{
-	struct bio_vec bv_in = bio_iter_iovec(ctx->bio_in, ctx->iter_in);
-	struct bio_vec bv_out = bio_iter_iovec(ctx->bio_out, ctx->iter_out);
-	struct dm_crypt_request *dmreq;
-	u8 *iv;
-	int r;
-
-	dmreq = dmreq_of_req(cc, req);
-	iv = iv_of_dmreq(cc, dmreq);
-
-	dmreq->iv_sector = ctx->cc_sector;
-	dmreq->ctx = ctx;
-	sg_init_table(&dmreq->sg_in, 1);
-	sg_set_page(&dmreq->sg_in, bv_in.bv_page, 1 << SECTOR_SHIFT,
-		    bv_in.bv_offset);
-
-	sg_init_table(&dmreq->sg_out, 1);
-	sg_set_page(&dmreq->sg_out, bv_out.bv_page, 1 << SECTOR_SHIFT,
-		    bv_out.bv_offset);
-
-	bio_advance_iter(ctx->bio_in, &ctx->iter_in, 1 << SECTOR_SHIFT);
-	bio_advance_iter(ctx->bio_out, &ctx->iter_out, 1 << SECTOR_SHIFT);
-
-	if (cc->iv_gen_ops) {
-		r = cc->iv_gen_ops->generator(cc, iv, dmreq);
-		if (r < 0)
-			return r;
-	}
-
-	skcipher_request_set_crypt(req, &dmreq->sg_in, &dmreq->sg_out,
-				   1 << SECTOR_SHIFT, iv);
-
-	if (bio_data_dir(ctx->bio_in) == WRITE)
-		r = crypto_skcipher_encrypt(req);
-	else
-		r = crypto_skcipher_decrypt(req);
-
-	if (!r && cc->iv_gen_ops && cc->iv_gen_ops->post)
-		r = cc->iv_gen_ops->post(cc, iv, dmreq);
-
-	return r;
-}
-
 static void kcryptd_async_done(struct crypto_async_request *async_req,
 			       int error);
 
@@ -916,57 +270,94 @@ static void crypt_free_req(struct crypt_config *cc,
 /*
  * Encrypt / decrypt data from one bio to another one (can be the same one)
  */
-static int crypt_convert(struct crypt_config *cc,
-			 struct convert_context *ctx)
+
+static int crypt_convert_bio(struct crypt_config *cc,
+			     struct convert_context *ctx)
 {
+	unsigned int cryptlen, n1, n2, nents, i = 0, bytes = 0;
+	struct skcipher_request *req;
+	struct dm_crypt_request *dmreq;
+	struct geniv_req_info rinfo;
+	struct bio_vec bv_in, bv_out;
 	int r;
+	u8 *iv;
 
 	atomic_set(&ctx->cc_pending, 1);
+	crypt_alloc_req(cc, ctx);
 
-	while (ctx->iter_in.bi_size && ctx->iter_out.bi_size) {
+	req = ctx->req;
+	dmreq = dmreq_of_req(cc, req);
+	iv = iv_of_dmreq(cc, dmreq);
 
-		crypt_alloc_req(cc, ctx);
+	n1 = bio_segments(ctx->bio_in);
+	n2 = bio_segments(ctx->bio_in);
+	nents = n1 > n2 ? n1 : n2;
+	nents = nents > MAX_SG_LIST ? MAX_SG_LIST : nents;
+	cryptlen = ctx->iter_in.bi_size;
 
-		atomic_inc(&ctx->cc_pending);
+	DMDEBUG("dm-crypt:%s: segments:[in=%u, out=%u] bi_size=%u\n",
+		bio_data_dir(ctx->bio_in) == WRITE ? "write" : "read",
+		n1, n2, cryptlen);
 
-		r = crypt_convert_block(cc, ctx, ctx->req);
+	dmreq->sg_in  = kcalloc(nents, sizeof(struct scatterlist), GFP_KERNEL);
+	dmreq->sg_out = kcalloc(nents, sizeof(struct scatterlist), GFP_KERNEL);
 
-		switch (r) {
-		/*
-		 * The request was queued by a crypto driver
-		 * but the driver request queue is full, let's wait.
-		 */
-		case -EBUSY:
-			wait_for_completion(&ctx->restart);
-			reinit_completion(&ctx->restart);
-			/* fall through */
-		/*
-		 * The request is queued and processed asynchronously,
-		 * completion function kcryptd_async_done() will be called.
-		 */
-		case -EINPROGRESS:
-			ctx->req = NULL;
-			ctx->cc_sector++;
-			continue;
-		/*
-		 * The request was already processed (synchronously).
-		 */
-		case 0:
-			atomic_dec(&ctx->cc_pending);
-			ctx->cc_sector++;
-			cond_resched();
-			continue;
-
-		/* There was an error while processing the request. */
-		default:
-			atomic_dec(&ctx->cc_pending);
-			return r;
-		}
+	dmreq->ctx = ctx;
+
+	sg_init_table(dmreq->sg_in, nents);
+	sg_init_table(dmreq->sg_out, nents);
+
+	while (ctx->iter_in.bi_size && ctx->iter_out.bi_size && i < nents) {
+		bv_in = bio_iter_iovec(ctx->bio_in, ctx->iter_in);
+		bv_out = bio_iter_iovec(ctx->bio_out, ctx->iter_out);
+
+		sg_set_page(&dmreq->sg_in[i], bv_in.bv_page, bv_in.bv_len,
+			    bv_in.bv_offset);
+		sg_set_page(&dmreq->sg_out[i], bv_out.bv_page, bv_out.bv_len,
+			    bv_out.bv_offset);
+
+		bio_advance_iter(ctx->bio_in, &ctx->iter_in, bv_in.bv_len);
+		bio_advance_iter(ctx->bio_out, &ctx->iter_out, bv_out.bv_len);
+
+		bytes += bv_in.bv_len;
+		i++;
 	}
 
-	return 0;
+	DMDEBUG("dm-crypt: Processed %u of %u bytes\n", bytes, cryptlen);
+
+	rinfo.is_write = bio_data_dir(ctx->bio_in) == WRITE;
+	rinfo.iv_sector = ctx->cc_sector;
+	rinfo.nents = nents;
+	rinfo.iv = iv;
+
+	skcipher_request_set_crypt(req, dmreq->sg_in, dmreq->sg_out,
+				   bytes, &rinfo);
+
+	if (bio_data_dir(ctx->bio_in) == WRITE)
+		r = crypto_skcipher_encrypt(req);
+	else
+		r = crypto_skcipher_decrypt(req);
+
+	switch (r) {
+	/* The request was queued so wait. */
+	case -EBUSY:
+		wait_for_completion(&ctx->restart);
+		reinit_completion(&ctx->restart);
+		/* fall through */
+	/*
+	 * The request is queued and processed asynchronously,
+	 * completion function kcryptd_async_done() is called.
+	 */
+	case -EINPROGRESS:
+		ctx->req = NULL;
+		cond_resched();
+		break;
+	}
+
+	return r;
 }
 
+
 static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone);
 
 /*
@@ -1072,11 +463,17 @@ static void crypt_dec_pending(struct dm_crypt_io *io)
 {
 	struct crypt_config *cc = io->cc;
 	struct bio *base_bio = io->base_bio;
+	struct dm_crypt_request *dmreq;
 	int error = io->error;
 
 	if (!atomic_dec_and_test(&io->io_pending))
 		return;
 
+	dmreq = dmreq_of_req(cc, io->ctx.req);
+	DMDEBUG("dm-crypt: Freeing scatterlists [sync]\n");
+	kfree(dmreq->sg_in);
+	kfree(dmreq->sg_out);
+
 	if (io->ctx.req)
 		crypt_free_req(cc, io->ctx.req, base_bio);
 
@@ -1315,7 +712,7 @@ static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)
 	sector += bio_sectors(clone);
 
 	crypt_inc_pending(io);
-	r = crypt_convert(cc, &io->ctx);
+	r = crypt_convert_bio(cc, &io->ctx);
 	if (r)
 		io->error = -EIO;
 	crypt_finished = atomic_dec_and_test(&io->ctx.cc_pending);
@@ -1345,7 +742,8 @@ static void kcryptd_crypt_read_convert(struct dm_crypt_io *io)
 	crypt_convert_init(cc, &io->ctx, io->base_bio, io->base_bio,
 			   io->sector);
 
-	r = crypt_convert(cc, &io->ctx);
+	r = crypt_convert_bio(cc, &io->ctx);
+
 	if (r < 0)
 		io->error = -EIO;
 
@@ -1373,12 +771,13 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
 		return;
 	}
 
-	if (!error && cc->iv_gen_ops && cc->iv_gen_ops->post)
-		error = cc->iv_gen_ops->post(cc, iv_of_dmreq(cc, dmreq), dmreq);
-
 	if (error < 0)
 		io->error = -EIO;
 
+	DMDEBUG("dm-crypt: Freeing scatterlists and request struct [async]\n");
+	kfree(dmreq->sg_in);
+	kfree(dmreq->sg_out);
+
 	crypt_free_req(cc, req_of_dmreq(cc, dmreq), io->base_bio);
 
 	if (!atomic_dec_and_test(&ctx->cc_pending))
@@ -1471,7 +870,8 @@ static int crypt_alloc_tfms(struct crypt_config *cc, char *ciphermode)
 	return 0;
 }
 
-static int crypt_setkey_allcpus(struct crypt_config *cc)
+static int crypt_setkey_allcpus(struct crypt_config *cc, enum setkey_op keyop,
+				char *ivmode, char *ivopts)
 {
 	unsigned subkey_size;
 	int err = 0, i, r;
@@ -1480,9 +880,13 @@ static int crypt_setkey_allcpus(struct crypt_config *cc)
 	subkey_size = (cc->key_size - cc->key_extra_size) >> ilog2(cc->tfms_count);
 
 	for (i = 0; i < cc->tfms_count; i++) {
-		r = crypto_skcipher_setkey(cc->tfms[i],
-					   cc->key + (i * subkey_size),
-					   subkey_size);
+		DECLARE_GENIV_KEY(kinfo, keyop, cc->tfms_count, cc->cipher,
+				  cc->key, cc->key_size,
+				  cc->key + (subkey_size * i), subkey_size,
+				  cc->key_parts, ivmode, ivopts);
+
+		r = crypto_skcipher_setkey(cc->tfms[i], (u8 *) &kinfo,
+					   sizeof(kinfo));
 		if (r)
 			err = r;
 	}
@@ -1490,7 +894,8 @@ static int crypt_setkey_allcpus(struct crypt_config *cc)
 	return err;
 }
 
-static int crypt_set_key(struct crypt_config *cc, char *key)
+static int crypt_set_key(struct crypt_config *cc, enum setkey_op keyop,
+			 char *key, char *ivmode, char *ivopts)
 {
 	int r = -EINVAL;
 	int key_string_len = strlen(key);
@@ -1508,7 +913,7 @@ static int crypt_set_key(struct crypt_config *cc, char *key)
 
 	set_bit(DM_CRYPT_KEY_VALID, &cc->flags);
 
-	r = crypt_setkey_allcpus(cc);
+	r = crypt_setkey_allcpus(cc, keyop, ivmode, ivopts);
 
 out:
 	/* Hex key string not needed after here, so wipe it. */
@@ -1517,12 +922,24 @@ static int crypt_set_key(struct crypt_config *cc, char *key)
 	return r;
 }
 
+static int crypt_init_all_cpus(struct dm_target *ti, char *key,
+			       char *ivmode, char *ivopts)
+{
+	struct crypt_config *cc = ti->private;
+	int ret;
+
+	ret = crypt_set_key(cc, SETKEY_OP_INIT, key, ivmode, ivopts);
+	if (ret < 0)
+		ti->error = "Error decoding and setting key";
+	return ret;
+}
+
 static int crypt_wipe_key(struct crypt_config *cc)
 {
 	clear_bit(DM_CRYPT_KEY_VALID, &cc->flags);
 	memset(&cc->key, 0, cc->key_size * sizeof(u8));
 
-	return crypt_setkey_allcpus(cc);
+	return crypt_setkey_allcpus(cc, SETKEY_OP_WIPE, NULL, NULL);
 }
 
 static void crypt_dtr(struct dm_target *ti)
@@ -1550,9 +967,6 @@ static void crypt_dtr(struct dm_target *ti)
 	mempool_destroy(cc->page_pool);
 	mempool_destroy(cc->req_pool);
 
-	if (cc->iv_gen_ops && cc->iv_gen_ops->dtr)
-		cc->iv_gen_ops->dtr(cc);
-
 	if (cc->dev)
 		dm_put_device(ti, cc->dev);
 
@@ -1629,8 +1043,16 @@ static int crypt_ctr_cipher(struct dm_target *ti,
 	if (!cipher_api)
 		goto bad_mem;
 
-	ret = snprintf(cipher_api, CRYPTO_MAX_ALG_NAME,
-		       "%s(%s)", chainmode, cipher);
+create_cipher:
+	/* For those ciphers which do not support IVs,
+	 * use the 'null' template cipher
+	 */
+
+	if (!ivmode)
+		ivmode = "null";
+
+	ret = snprintf(cipher_api, CRYPTO_MAX_ALG_NAME, "%s(%s(%s))",
+		       ivmode, chainmode, cipher);
 	if (ret < 0) {
 		kfree(cipher_api);
 		goto bad_mem;
@@ -1652,23 +1074,10 @@ static int crypt_ctr_cipher(struct dm_target *ti,
 	else if (ivmode) {
 		DMWARN("Selected cipher does not support IVs");
 		ivmode = NULL;
+		goto create_cipher;
 	}
 
-	/* Choose ivmode, see comments at iv code. */
-	if (ivmode == NULL)
-		cc->iv_gen_ops = NULL;
-	else if (strcmp(ivmode, "plain") == 0)
-		cc->iv_gen_ops = &crypt_iv_plain_ops;
-	else if (strcmp(ivmode, "plain64") == 0)
-		cc->iv_gen_ops = &crypt_iv_plain64_ops;
-	else if (strcmp(ivmode, "essiv") == 0)
-		cc->iv_gen_ops = &crypt_iv_essiv_ops;
-	else if (strcmp(ivmode, "benbi") == 0)
-		cc->iv_gen_ops = &crypt_iv_benbi_ops;
-	else if (strcmp(ivmode, "null") == 0)
-		cc->iv_gen_ops = &crypt_iv_null_ops;
-	else if (strcmp(ivmode, "lmk") == 0) {
-		cc->iv_gen_ops = &crypt_iv_lmk_ops;
+	if (strcmp(ivmode, "lmk") == 0) {
 		/*
 		 * Version 2 and 3 is recognised according
 		 * to length of provided multi-key string.
@@ -1680,39 +1089,14 @@ static int crypt_ctr_cipher(struct dm_target *ti,
 			cc->key_extra_size = cc->key_size / cc->key_parts;
 		}
 	} else if (strcmp(ivmode, "tcw") == 0) {
-		cc->iv_gen_ops = &crypt_iv_tcw_ops;
 		cc->key_parts += 2; /* IV + whitening */
 		cc->key_extra_size = cc->iv_size + TCW_WHITENING_SIZE;
-	} else {
-		ret = -EINVAL;
-		ti->error = "Invalid IV mode";
-		goto bad;
 	}
 
 	/* Initialize and set key */
-	ret = crypt_set_key(cc, key);
-	if (ret < 0) {
-		ti->error = "Error decoding and setting key";
+	ret = crypt_init_all_cpus(ti, key, ivmode, ivopts);
+	if (ret < 0)
 		goto bad;
-	}
-
-	/* Allocate IV */
-	if (cc->iv_gen_ops && cc->iv_gen_ops->ctr) {
-		ret = cc->iv_gen_ops->ctr(cc, ti, ivopts);
-		if (ret < 0) {
-			ti->error = "Error creating IV";
-			goto bad;
-		}
-	}
-
-	/* Initialize IV (set keys for ESSIV etc) */
-	if (cc->iv_gen_ops && cc->iv_gen_ops->init) {
-		ret = cc->iv_gen_ops->init(cc);
-		if (ret < 0) {
-			ti->error = "Error initialising IV";
-			goto bad;
-		}
-	}
 
 	ret = 0;
 bad:
@@ -1934,8 +1318,9 @@ static int crypt_map(struct dm_target *ti, struct bio *bio)
 	if (bio_data_dir(io->base_bio) == READ) {
 		if (kcryptd_io_read(io, GFP_NOWAIT))
 			kcryptd_queue_read(io);
-	} else
+	} else {
 		kcryptd_queue_crypt(io);
+	}
 
 	return DM_MAPIO_SUBMITTED;
 }
@@ -2014,7 +1399,6 @@ static void crypt_resume(struct dm_target *ti)
 static int crypt_message(struct dm_target *ti, unsigned argc, char **argv)
 {
 	struct crypt_config *cc = ti->private;
-	int ret = -EINVAL;
 
 	if (argc < 2)
 		goto error;
@@ -2025,19 +1409,9 @@ static int crypt_message(struct dm_target *ti, unsigned argc, char **argv)
 			return -EINVAL;
 		}
 		if (argc == 3 && !strcasecmp(argv[1], "set")) {
-			ret = crypt_set_key(cc, argv[2]);
-			if (ret)
-				return ret;
-			if (cc->iv_gen_ops && cc->iv_gen_ops->init)
-				ret = cc->iv_gen_ops->init(cc);
-			return ret;
+			return crypt_set_key(cc, SETKEY_OP_SET, argv[2], 0, 0);
 		}
 		if (argc == 2 && !strcasecmp(argv[1], "wipe")) {
-			if (cc->iv_gen_ops && cc->iv_gen_ops->wipe) {
-				ret = cc->iv_gen_ops->wipe(cc);
-				if (ret)
-					return ret;
-			}
 			return crypt_wipe_key(cc);
 		}
 	}
diff --git a/include/crypto/geniv.h b/include/crypto/geniv.h
new file mode 100644
index 0000000..df9f953
--- /dev/null
+++ b/include/crypto/geniv.h
@@ -0,0 +1,60 @@
+/*
+ * geniv: common data structures for IV generation algorithms
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+#ifndef _CRYPTO_GENIV_
+#define _CRYPTO_GENIV_
+
+#define SECTOR_SHIFT		9
+#define SECTOR_SIZE		(1 << SECTOR_SHIFT)
+
+#define LMK_SEED_SIZE		64 /* hash + 0 */
+#define TCW_WHITENING_SIZE	16
+
+enum setkey_op {
+	SETKEY_OP_INIT,
+	SETKEY_OP_SET,
+	SETKEY_OP_WIPE,
+};
+
+struct geniv_key_info {
+	enum setkey_op keyop;
+	unsigned int tfms_count;
+	char *cipher;
+	u8 *key;
+	u8 *subkey;
+	unsigned int key_size;
+	unsigned int subkey_size;
+	unsigned int key_parts;
+	char *ivmode;
+	char *ivopts;
+};
+
+#define DECLARE_GENIV_KEY(c, op, n, p, k, sz, skey, ssz, kp, m, opts)	\
+	struct geniv_key_info c = {					\
+		.keyop = op,						\
+		.tfms_count = n,					\
+		.cipher = p,						\
+		.key = k,						\
+		.key_size = sz,						\
+		.subkey = skey,						\
+		.subkey_size = ssz,					\
+		.key_parts = kp,					\
+		.ivmode = m,						\
+		.ivopts = opts,						\
+	}
+
+struct geniv_req_info {
+	bool is_write;
+	sector_t iv_sector;
+	unsigned int nents;
+	u8 *iv;
+};
+
+#endif
+
-- 
Binoy Jayan


^ permalink raw reply related

* Re: [RFC PATCH v2] crypto: Add IV generation algorithms
From: Milan Broz @ 2016-12-13 10:01 UTC (permalink / raw)
  To: Binoy Jayan, Oded, Ofir
  Cc: Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, linux-kernel, Alasdair Kergon, Mike Snitzer,
	dm-devel, Shaohua Li, linux-raid, Rajendra
In-Reply-To: <1481618949-20086-2-git-send-email-binoy.jayan@linaro.org>

On 12/13/2016 09:49 AM, Binoy Jayan wrote:
> Currently, the iv generation algorithms are implemented in dm-crypt.c.
> The goal is to move these algorithms from the dm layer to the kernel
> crypto layer by implementing them as template ciphers so they can be
> implemented in hardware for performance. As part of this patchset, the
> iv-generation code is moved from the dm layer to the crypto layer and
> adapt the dm-layer to send a whole 'bio' (as defined in the block layer)
> at a time. Each bio contains the in memory representation of physically
> contiguous disk blocks. The dm layer sets up a chained scatterlist of
> these blocks split into physically contiguous segments in memory so that
> DMA can be performed. The iv generation algorithms implemented in geniv.c
> include plain, plain64, essiv, benbi, null, lmk and tcw.
...

Just few notes...

The lmk (loopAES) and tcw (old TrueCrypt mode) IVs are in fact hacks,
it is combination of IV and modification of CBC mode (in post calls).

It is used only in these disk-encryption systems, it does not make sense
to use it elsewhere (moreover tcw IV is not safe, it is here only
for compatibility reasons).

I think that IV generators should not modify or read encrypted data directly,
it should only generate IV.

By the move everything to cryptoAPI we are basically introducing some strange mix
of IV and modes there, I wonder how this is going to be maintained.
Anyway, Herbert should say if it is ok...

But I would imagine that cryptoAPI should implement only "real" IV generators
and these disk-encryption add-ons keep inside dmcrypt.
(and dmcrypt should probably use normal IVs through crypto API and
build some wrapper around for hacks...)

...

> 
> When using multiple keys with the original dm-crypt, the key selection is
> made based on the sector number as:
> 
> key_index = sector & (key_count - 1)
> 
> This restricts the usage of the same key for encrypting/decrypting a
> single bio. One way to solve this is to move the key management code from
> dm-crypt to cryto layer. But this seems tricky when using template ciphers
> because, when multiple ciphers are instantiated from dm layer, each cipher
> instance set with a unique subkey (part of the bigger master key) and
> these instances themselves do not have access to each other's instances
> or contexts. This way, a single instance cannot encryt/decrypt a whole bio.
> This has to be fixed.

I really do not think the disk encryption key management should be moved
outside of dm-crypt. We cannot then change key structure later easily.

But it is not only key management, you are basically moving much more internals
outside of dm-crypt:

> +struct geniv_ctx {
> +	struct crypto_skcipher *child;
> +	unsigned int tfms_count;
> +	char *ivmode;
> +	unsigned int iv_size;
> +	char *ivopts;
> +	char *cipher;
> +	struct geniv_operations *iv_gen_ops;
> +	union {
> +		struct geniv_essiv_private essiv;
> +		struct geniv_benbi_private benbi;
> +		struct geniv_lmk_private lmk;
> +		struct geniv_tcw_private tcw;
> +	} iv_gen_private;
> +	void *iv_private;
> +	struct crypto_skcipher *tfm;
> +	unsigned int key_size;
> +	unsigned int key_extra_size;
> +	unsigned int key_parts;      /* independent parts in key buffer */

^^^ these key sizes you probably mean by key management.
It is based on way how the key is currently sent into kernel
(one hexa string in ioctl that needs to be split) and have to be changed in future.

> +	enum setkey_op keyop;
> +	char *msg;
> +	u8 *key;

Milan

^ permalink raw reply

* [PATCH v2] Always return last partition end address in 512B blocks
From: Mariusz Dabrowski @ 2016-12-13 13:31 UTC (permalink / raw)
  To: linux-raid; +Cc: Jes.Sorensen, Mariusz Dabrowski

For 4K disks 'endofpart' is an index of the last 4K sector used by partition.
mdadm is using number of 512-byte sectors, so value returned by
get_last_partition_end must be multiplied by 8 for devices with 4K sectors.
Also, unused 'ret' variable has been removed.

Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
---
 util.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/util.c b/util.c
index 883eaa4..d510613 100644
--- a/util.c
+++ b/util.c
@@ -1430,6 +1430,7 @@ static int get_last_partition_end(int fd, unsigned long long *endofpart)
 	struct MBR boot_sect;
 	unsigned long long curr_part_end;
 	unsigned part_nr;
+	unsigned int sector_size;
 	int retval = 0;
 
 	*endofpart = 0;
@@ -1469,6 +1470,9 @@ static int get_last_partition_end(int fd, unsigned long long *endofpart)
 		/* Unknown partition table */
 		retval = -1;
 	}
+	/* calculate number of 512-byte blocks */
+	if (get_dev_sector_size(fd, NULL, &sector_size))
+		*endofpart *= (sector_size / 512);
  abort:
 	return retval;
 }
@@ -1480,9 +1484,8 @@ int check_partitions(int fd, char *dname, unsigned long long freesize,
 	 * Check where the last partition ends
 	 */
 	unsigned long long endofpart;
-	int ret;
 
-	if ((ret = get_last_partition_end(fd, &endofpart)) > 0) {
+	if (get_last_partition_end(fd, &endofpart) > 0) {
 		/* There appears to be a partition table here */
 		if (freesize == 0) {
 			/* partitions will not be visible in new device */
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH v2] Always return last partition end address in 512B blocks
From: Jes Sorensen @ 2016-12-13 14:09 UTC (permalink / raw)
  To: Mariusz Dabrowski; +Cc: linux-raid
In-Reply-To: <1481635862-829-1-git-send-email-mariusz.dabrowski@intel.com>

Mariusz Dabrowski <mariusz.dabrowski@intel.com> writes:
> For 4K disks 'endofpart' is an index of the last 4K sector used by partition.
> mdadm is using number of 512-byte sectors, so value returned by
> get_last_partition_end must be multiplied by 8 for devices with 4K sectors.
> Also, unused 'ret' variable has been removed.
>
> Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
> ---
>  util.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: Jinpu Wang @ 2016-12-13 15:08 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey
In-Reply-To: <87oa0gzuej.fsf@notabene.neil.brown.name>

Hi Neil,

On Mon, Dec 12, 2016 at 10:53 PM, NeilBrown <neilb@suse.com> wrote:
> On Tue, Dec 13 2016, Jinpu Wang wrote:
>
>> On Mon, Dec 12, 2016 at 1:59 AM, NeilBrown <neilb@suse.com> wrote:
>>> On Sat, Nov 26 2016, Jinpu Wang wrote:
>>>> [  810.270860]  [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0
>>>
>>> What is this?  I cannot find that function in the upstream kernel.
>>>
>>> NeilBrown
>>
>> Hi Neil,
>>
>> blk_prologue_bio is our internal extension to gather some stats, sorry
>> not informed before.
>
> Ahhh.
>
> ....
>> +       return q->custom_make_request_fn(q, clone);
>
> I haven't heard of custom_make_request_fn before either.
>
>> +}
>>
>> ＩＭＨＯ， it seems unrelated, but I will rerun my test without this change.
>
> Yes, please re-test with an unmodified upstream kernel (and always
> report *exactly* what kernel you are running.  I cannot analyse code
> that I cannot see).
>
> NeilBrown

As you suggested, I re-run same test with 4.4.36 with no our own patch on MD.
I can still reproduce the same bug, nr_pending on heathy leg(loop1) is till 1.

4.4.36 kernel:
crash> bt 4069
PID: 4069   TASK: ffff88022b4f8d00  CPU: 3   COMMAND: "md2_raid1"
 #0 [ffff8800b77d3bf8] __schedule at ffffffff81811453
 #1 [ffff8800b77d3c50] schedule at ffffffff81811c30
 #2 [ffff8800b77d3c68] freeze_array at ffffffffa07ee17e [raid1]
 #3 [ffff8800b77d3cc0] handle_read_error at ffffffffa07f093b [raid1]
 #4 [ffff8800b77d3d68] raid1d at ffffffffa07f10a6 [raid1]
 #5 [ffff8800b77d3e60] md_thread at ffffffffa04dee80 [md_mod]
 #6 [ffff8800b77d3ed0] kthread at ffffffff81075fb6
 #7 [ffff8800b77d3f50] ret_from_fork at ffffffff818157df
crash> bt 2558
bt: invalid task or pid value: 2558
crash> bt 4558
PID: 4558   TASK: ffff88022b550d00  CPU: 3   COMMAND: "fio"
 #0 [ffff88022c287710] __schedule at ffffffff81811453
 #1 [ffff88022c287768] schedule at ffffffff81811c30
 #2 [ffff88022c287780] wait_barrier at ffffffffa07ee044 [raid1]
 #3 [ffff88022c2877e8] make_request at ffffffffa07efc65 [raid1]
 #4 [ffff88022c2878d0] md_make_request at ffffffffa04df609 [md_mod]
 #5 [ffff88022c287928] generic_make_request at ffffffff813fd3de
 #6 [ffff88022c287970] submit_bio at ffffffff813fd522
 #7 [ffff88022c2879b8] do_blockdev_direct_IO at ffffffff811d32a7
 #8 [ffff88022c287be8] __blockdev_direct_IO at ffffffff811d3b6e
 #9 [ffff88022c287c10] blkdev_direct_IO at ffffffff811ce2d7
#10 [ffff88022c287c38] generic_file_direct_write at ffffffff81132c90
#11 [ffff88022c287cb0] __generic_file_write_iter at ffffffff81132e1d
#12 [ffff88022c287d08] blkdev_write_iter at ffffffff811ce597
#13 [ffff88022c287d68] aio_run_iocb at ffffffff811deca6
#14 [ffff88022c287e68] do_io_submit at ffffffff811dfbaa
#15 [ffff88022c287f40] sys_io_submit at ffffffff811dfe4b
#16 [ffff88022c287f50] entry_SYSCALL_64_fastpath at ffffffff81815497
    RIP: 00007f63b1362737  RSP: 00007ffff7eb17f8  RFLAGS: 00000206
    RAX: ffffffffffffffda  RBX: 00007f63a142a000  RCX: 00007f63b1362737
    RDX: 0000000001179b58  RSI: 0000000000000001  RDI: 00007f63b1f4a000
    RBP: 0000000000000512   R8: 0000000000000001   R9: 0000000001171fa0
    R10: 00007f639ef84000  R11: 0000000000000206  R12: 0000000000000001
    R13: 0000000000000200  R14: 000000003a2d3000  R15: 0000000000000001
    ORIG_RAX: 00000000000000d1  CS: 0033  SS: 002b

crash> struct r1conf 0xffff880037362100
struct r1conf {
  mddev = 0xffff880037352800,
  mirrors = 0xffff88022c209c00,
  raid_disks = 2,
  next_resync = 18446744073709527039,
  start_next_window = 18446744073709551615,
  current_window_requests = 0,
  next_window_requests = 0,
  device_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  retry_list = {
    next = 0xffff8801ce757740,
    prev = 0xffff8801b1b79140
  },
  bio_end_io_list = {
    next = 0xffff8801ce7d9ac0,
    prev = 0xffff88022838f4c0
  },
  pending_bio_list = {
    head = 0x0,
    tail = 0x0
  },
  pending_count = 0,
  wait_barrier = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8801f6d87818,
      prev = 0xffff88022c2877a8
    }
  },
  resync_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  nr_pending = 2086,
  nr_waiting = 97,
  nr_queued = 2084,
  barrier = 0,
  array_frozen = 1,
  fullsync = 0,
  recovery_disabled = 1,
  poolinfo = 0xffff8802330be390,
  r1bio_pool = 0xffff88022bdf54e0,
  r1buf_pool = 0x0,
  tmppage = 0xffffea0000dcee40,
  thread = 0x0,
  cluster_sync_low = 0,
  cluster_sync_high = 0
}
crash>
crash> struct raid1_info 0xffff88022c209c00
struct raid1_info {
  rdev = 0xffff880231635800,
  head_position = 1318965,
  next_seq_sect = 252597,
  seq_start = 252342
}
crash> struct raid1_info 0xffff88022c209c20
struct raid1_info {
  rdev = 0xffff88023166ce00,
  head_position = 1585216,
  next_seq_sect = 839992,
  seq_start = 839977
}
crash> struct md_rdev 0xffff880231635800
struct md_rdev {
  same_set = {
    next = 0xffff880037352818,
    prev = 0xffff88023166ce00
  },
  sectors = 2095104,
  mddev = 0xffff880037352800,
  last_events = 41325652,
  meta_bdev = 0x0,
  bdev = 0xffff880235c2aa40,
  sb_page = 0xffffea0002dd98c0,
  bb_page = 0xffffea0002e48f80,
  sb_loaded = 1,
  sb_events = 205,
  data_offset = 2048,
  new_data_offset = 2048,
  sb_start = 8,
  sb_size = 512,
  preferred_minor = 65535,
  kobj = {
    name = 0xffff8802341cdef0 "dev-loop1",
    entry = {
      next = 0xffff880231635880,
      prev = 0xffff880231635880
    },
    parent = 0xffff880037352850,
    kset = 0x0,
    ktype = 0xffffffffa04f3020 <rdev_ktype>,
    sd = 0xffff880233e3b8e8,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 2,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8802316358c8,
      prev = 0xffff8802316358c8
    }
  },
desc_nr = 0,
  raid_disk = 0,
  new_raid_disk = 0,
  saved_raid_disk = -1,
  {
    recovery_offset = 0,
    journal_tail = 0
  },
  nr_pending = {
    counter = 1
  },
  read_errors = {
    counter = 0
  },
  last_read_error = {
    tv_sec = 0,
    tv_nsec = 0
  },
  corrected_errors = {
    counter = 0
  },
  del_work = {
    data = {
      counter = 0
    },
    entry = {
      next = 0x0,
      prev = 0x0
    },
    func = 0x0
  },
  sysfs_state = 0xffff880233e3b960,
  badblocks = {
    count = 0,
    unacked_exist = 0,
    shift = 0,
    page = 0xffff88022c0d6000,
    changed = 0,
    lock = {
      seqcount = {
        sequence = 264
      },
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      }
    },
    sector = 0,
    size = 0
  }
}
struct md_rdev {
  same_set = {
    next = 0xffff880231635800,
    prev = 0xffff880037352818
  },
  sectors = 2095104,
  mddev = 0xffff880037352800,
  last_events = 10875407,
  meta_bdev = 0x0,
  bdev = 0xffff880234a86a40,
  sb_page = 0xffffea00089e0ac0,
  bb_page = 0xffffea0007db4980,
  sb_loaded = 1,
  sb_events = 204,
  data_offset = 2048,
  new_data_offset = 2048,
  sb_start = 8,
  sb_size = 512,
  preferred_minor = 65535,
  kobj = {
    name = 0xffff88022c100e30 "dev-ibnbd0",
    entry = {
      next = 0xffff88023166ce80,
      prev = 0xffff88023166ce80
    },
    parent = 0xffff880037352850,
    kset = 0x0,
    ktype = 0xffffffffa04f3020 <rdev_ktype>,
    sd = 0xffff8800b6539e10,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 581,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff88023166cec8,
      prev = 0xffff88023166cec8
    }
  },
  desc_nr = 1,
  raid_disk = 1,
  new_raid_disk = 0,
  saved_raid_disk = -1,
  {
    recovery_offset = 18446744073709551615,
    journal_tail = 18446744073709551615
  },
  nr_pending = {
    counter = 2073
  },
  read_errors = {
    counter = 0
  },
  last_read_error = {
    tv_sec = 0,
    tv_nsec = 0
  },
  corrected_errors = {
    counter = 0
  },
  del_work = {
    data = {
      counter = 0
    },
    entry = {
      next = 0x0,
      prev = 0x0
    },
    func = 0x0
  },
  sysfs_state = 0xffff8800b6539e88,
  badblocks = {
    count = 1,
    unacked_exist = 0,
    shift = 0,
    page = 0xffff880099ced000,
    changed = 0,
    lock = {
      seqcount = {
        sequence = 4
      },
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      }
    },
    sector = 80,
    size = 8
  }
}


-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply

* Re: [PATCH v2 00/12] Partial Parity Log for MD RAID 5
From: Jes Sorensen @ 2016-12-13 15:25 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Artur Paszkiewicz, NeilBrown, linux-raid
In-Reply-To: <20161207170906.sthdxswufwk5fqhn@kernel.org>

Shaohua Li <shli@kernel.org> writes:
> On Wed, Dec 07, 2016 at 03:36:01PM +0100, Artur Paszkiewicz wrote:
>> On 12/07/2016 01:32 AM, NeilBrown wrote:
>> > 
>> > I would expect to see as description of what a PPL actually is and how
>> > it works here... but there is none.
>> > 
>> > The change-log for patch 06 has a tiny bit more information which is
>> > just enough to be able to start trying to understand the code, but it
>> > isn't much.
>> > And none of this description gets into the code, or into the
>> > Documentation/.  This makes it hard to review and hard to maintain.
>> > 
>> > Remember: if you want people to review you code, it is in your interest
>> > to make it easy.  That means give lots of details.
>> 
>> Hi Neil,
>> 
>> Thank you for taking the time to look at this and for your feedback. I
>> didn't try to make it hard to review... Sometimes it's easy to forget
>> how non-obvious things are after looking at them for too long :) I will
>> improve the descriptions and address the issues that you found in the
>> next version of the patches.
>
> Havn't looked at the patches yet, being busy recently, sorry! When you repost
> these, I'd like to know why we need another log for raid5 considering we
> already had one to fix similar issue. What's the good/bad side of this new log?
> There is such feature in Intel RSTe doesn't sound like a technical reason we
> should support this.

Shaohua,

Any further thought on these patches? I am considering doing a release
of mdadm early in the new year. it would be nice to include these
patches if the feature is going in.

As for supporting it, if IMSM supports it and it is used in the field,
then it seems legitimate for Linux to support it too. Just like we
support so many other obscure pieces of hardware :)

Cheers,
Jes

^ permalink raw reply

* Re: What is the solution for USB HDs
From: Wols Lists @ 2016-12-13 15:53 UTC (permalink / raw)
  To: Phil Turmel, Adam Goryachev, juca, linux-raid
In-Reply-To: <9481982b-bfda-644e-e160-74b94bbbec0c@turmel.org>

On 12/12/16 03:17, Phil Turmel wrote:
> On 12/11/2016 05:14 PM, Adam Goryachev wrote:
>> > On 12/12/16 04:14, juca@juan-carlos.info wrote:
>> > If you are still having problems after that, then please try to post
>> > more details on what happens when the drives vanish (eg, kernel logs,
>> > system logs, etc).
>> > 
>> > Also, you might consider an alternative SBC, some have SATA ports
>> > already available (RPi are the only SBC's I've used, but there are many
>> > other options/variations).
> There's also a kernel boot parameter that can cut off all USB
> auto-suspend, which I suspect is part of the problem.  It has been known
> for a long time that USB is not robust enough to be trusted in MD arrays.

Okay, I've just parroted it on the wiki, but the wiki does recommend
against USB disks, precisely because every mention (very few) I've seen
says it's a bad idea - because of suspend.

Cheers,
Wol

^ permalink raw reply

* Upcoming mdadm release
From: Jes Sorensen @ 2016-12-13 16:09 UTC (permalink / raw)
  To: linux-raid

Hi,

I am planning to do my first release of mdadm early in the new year,
expect early-mid January.

If you have patches pending, now would be a good time to finish them up
and start pushing them to the list.

If you already pushed patches and I missed them, now would also be a
good time to start nagging me.

Cheers,
Jes

^ permalink raw reply

* Re: [PATCH] Use disk sector size value to set offset for reading GPT
From: Wols Lists @ 2016-12-13 16:38 UTC (permalink / raw)
  To: Mariusz Dabrowski, linux-raid; +Cc: jes.sorensen
In-Reply-To: <1481195595-13140-1-git-send-email-mariusz.dabrowski@intel.com>

On 08/12/16 11:13, Mariusz Dabrowski wrote:
> mdadm is using invalid byte-offset while reading GPT header to get
> partition info (size, first sector, last sector etc.). Now this offset
> is hardcoded to 512 bytes and it is not valid for disks with sector
> size different than 512 bytes because MBR and GPT headers are aligned
> to LBA, so valid offset for 4k drives is 4096 bytes.
> 
> Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>

Could this be behind the couple of incidents recently, where an array
has been moved from one machine to another, and the GPT has disappeared?

I know I've been following the threads, and I've been puzzled in that
I've thought "nothing should be writing there!"

If it is, what mdadm/kernels are affected, and I can put it on the wiki,
writing the page up about corrupted disks is next on my list.

Cheers,
Wol


^ permalink raw reply

* Re: [PATCH] dm: Avoid sleeping while holding the dm_bufio lock
From: Doug Anderson @ 2016-12-13 22:01 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alasdair Kergon, Mike Snitzer, Shaohua Li, Dmitry Torokhov,
	linux-kernel@vger.kernel.org, linux-raid, dm-devel,
	David Rientjes, Sonny Rao, Guenter Roeck
In-Reply-To: <CAD=FV=U+1VA23s1WMn4srH5U3Cjuz+0OLhq1qD5qw0oTE34xFQ@mail.gmail.com>

Hi,

On Mon, Dec 12, 2016 at 4:08 PM, Doug Anderson <dianders@chromium.org> wrote:
> OK, so I just put a printk in wait_iff_congested() and it didn't show
> me waiting for the timeout (!).  I know that I saw
> wait_iff_congested() in the originally reproduction of this problem,
> but it appears that in my little "balloon" reproduction it's not
> actually involved...
>
>
> ...I dug further and it appears that __alloc_pages_direct_reclaim() is
> actually what's slow.  Specifically it looks as if shrink_zone() can
> actually take quite a while.  As I've said, I'm not an expert on the
> memory manager but I'm not convinced that it's wrong for the direct
> reclaim path to be pretty slow at times, especially when I'm putting
> an abnormally high amount of stress on it.
>
> I'm going to take this as further evidence that the patch being
> discussed in this thread is a good one (AKA don't hold the dm bufio
> lock while allocating memory).  :)  If it's unexpected that
> shrink_zone() might take several seconds when under extreme memory
> pressure then I can do some additional digging.  Do note that I am
> running with "zram" and remember that I'm on an ancient 4.4-based
> kernel, so perhaps one of those two factors causes problems.

Sadly, I couldn't get this go as just "the way things were" in case
there was some major speedup to be had here.  :-P

I tracked this down to shrink_list() taking 1 ms per call (perhaps
because I have HZ=1000?) and in shrink_lruvec() the outer loop ran
many thousands of times.  Thus the total time taken by shrink_lruvec()
could easily be many seconds.

Wow, interesting, when I change HZ to 100 instead of 1000 then the
behavior changes quite a bit.  I can still get my bufio lock warning
easily, but all of a sudden shrink_lruvec() isn't slow.  :-P

OK, really truly going to stop digging further now...  ;)  Presumably
reporting weird behaviors with old kernels doesn't help anyone in
mainline, and I can buy the whole "memory accesses are slow when you
start thrashing the system" argument.

-Doug

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox