Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: Recovering a RAID6 after all disks were disconnected
From: Giuseppe Bilotta @ 2016-12-23 21:14 UTC (permalink / raw)
  To: NeilBrown; +Cc: John Stoffel, linux-raid
In-Reply-To: <CAOxFTcxaC1WOj7HeD5bRaPKV93fQZ6X-mBtHOFcQmPwWfjPxDQ@mail.gmail.com>

On Fri, Dec 23, 2016 at 5:17 PM, Giuseppe Bilotta
<giuseppe.bilotta@gmail.com> wrote:
>
> Now I wonder if it it would be possible to combine this approach with
> something that simply hacked the metadata of each disk to re-establish
> the correct disk order to make it possible to reassemble this
> particular array without recreating anything. Are problems such as
> mine common enough to warrant support for this kind of verified
> reassembly from assumed-clean disks easier?.

Actually, now that the correct order is verified, I would like to know
why re-creating the array using mdadm -C --assume-clean with the disks
in the correct order works (the RAID is then accessible, and I can
read data off of it).

However, if I  simply hand-edit the metadata to assign the correct
device order to the disks (I do this by restoring the correct device
roles in the dev_roles table, at the entries corresponding to the
disks' dev_numbers, in the correct order, and then adjust the checksum
accrdingly) and then assemble the array, I get I/O errors accessing
the array contents, even though raid6check doesn't report issues.

In the 'hacked dev role' case, the dmesg reads:

[  +0.002057] md: bind<dm-2>
[  +0.000936] md: bind<dm-1>
[  +0.000932] md: bind<dm-0>
[  +0.000925] md: bind<dm-3>
[  +0.001443] md/raid:md112: device dm-3 operational as raid disk 0
[  +0.000540] md/raid:md112: device dm-0 operational as raid disk 3
[  +0.000710] md/raid:md112: device dm-1 operational as raid disk 2
[  +0.000508] md/raid:md112: device dm-2 operational as raid disk 1
[  +0.009716] md/raid:md112: allocated 4374kB
[  +0.000555] md/raid:md112: raid level 6 active with 4 out of 4
devices, algorithm 2
[  +0.000531] RAID conf printout:
[  +0.000001]  --- level:6 rd:4 wd:4
[  +0.000001]  disk 0, o:1, dev:dm-3
[  +0.000001]  disk 1, o:1, dev:dm-2
[  +0.000000]  disk 2, o:1, dev:dm-1
[  +0.000001]  disk 3, o:1, dev:dm-0
[  +0.000449] created bitmap (22 pages) for device md112
[  +0.001865] md112: bitmap initialized from disk: read 2 pages, set 5
of 44711 bits
[  +0.533458] md112: detected capacity change from 0 to 6000916561920
[  +0.004194] Buffer I/O error on dev md112, logical block 0, async page read
[  +0.003450] Buffer I/O error on dev md112, logical block 0, async page read
[  +0.001953] Buffer I/O error on dev md112, logical block 0, async page read
[  +0.001978] Buffer I/O error on dev md112, logical block 0, async page read
[  +0.001852] ldm_validate_partition_table(): Disk read failed.
[  +0.001889] Buffer I/O error on dev md112, logical block 0, async page read
[  +0.001875] Buffer I/O error on dev md112, logical block 0, async page read
[  +0.001834] Buffer I/O error on dev md112, logical block 0, async page read
[  +0.001596] Buffer I/O error on dev md112, logical block 0, async page read
[  +0.001551] Dev md112: unable to read RDB block 0
[  +0.001293] Buffer I/O error on dev md112, logical block 0, async page read
[  +0.001284] Buffer I/O error on dev md112, logical block 0, async page read
[  +0.001307]  md112: unable to read partition table


So the array assembles, and raid6check reports no error, but the data
is actually inaccessible .. am I missing other aspects of the metadata
that need to be restored?


-- 
Giuseppe "Oblomov" Bilotta

^ permalink raw reply

* Re: Raid5 performance issue
From: Peter Grandi @ 2016-12-23 19:27 UTC (permalink / raw)
  To: Linux RAID
In-Reply-To: <"H0000071000db2da.1482500583.sx.f1-outsourcing.eu*"@MHS>

> I have grown a raid5 over the years with drives and resized
> partitions, now I have upgraded to centos7 (from centos5). And
> I have the impression the speed is not what it used to be.

Yes. Speed most likely has dropped a lot, while performance has
probably stayed the same or improved. There is a large tradeoff
between flexibility and speed.

> Can this be because of some missalignment?

Plus a couple of other major reasons.

> How can this be verified?

While reading (or even worse writing) files sequentially on the
filesystem contaioned in that RAID set the 'iostat -dkzyx 1'
output will show lots of random accesses and read-modify-write.

Your only sensible option is to dump the content, recreate the
RAID set and reformat the filesystem, and reload.

^ permalink raw reply

* Re: Raid5 performance issue
From: Doug Dumitru @ 2016-12-23 19:24 UTC (permalink / raw)
  To: Marc Roos; +Cc: linux-raid
In-Reply-To: <H0000071000db2da.1482500583.sx.f1-outsourcing.eu*@MHS>

Mr. Roos,

It is very hard to get an array "to speed" without hitting it at very
high queue depths.  In this area, spinning disks and SSDs actually
behave quite differently.

With hard drives, I suspect your single disk tests are taking
advantage of the disks' on-controller cache and is doing read-ahead
and thus streaming.  With the array in place, you are probably doing
512K reads (check the array chunk size) so the disks will see bursts
of 512K reads with big gaps.  The gaps are large enough that the
rotation has gone too far and the caching makes you wait a rotation.
This is just a guess.

You can test this hypothesis by doing the test with block sizes that
are exact stripe size (or multiples thereof).  Check
/sys/block/md?/md/optimial_io_size.  This should be ( number of drives
- number of parity drives - number of spares ) * chunk size.  This
might be a really large number, so the block stack will cut the
requests up anyway (there is a 1M limit for struct bio in most
layers), but with HDDs the scheduler should have time to do some
magic.

You might actually do better on this test with smaller chunk sizes.
Then again, this test is far from representative of a production
workload, so tuning for it might be folly.

Doug Dumitru



On Fri, Dec 23, 2016 at 5:43 AM, Marc Roos <M.Roos@f1-outsourcing.eu> wrote:
>
> I have grown a raid5 over the years with drives and resized partitions,
> now I have upgraded to centos7 (from centos5). And I have the impression
> the speed is not what it used to be.
>
> Can this be because of some missalignment? How can this be verified?
>
>
> If I monitor the individual disks with dstat it reads the raid drives at
> very low speeds
>
> dd if=/dev/md21 of=/dev/null bs=1M count=1500 iflag=direct
> 1500+0 records in
> 1500+0 records out
> 1572864000 bytes (1.6 GB) copied, 19.5879 s, 80.3 MB/s
>
>
>    0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :
>   0     0
>    0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :
>   0     0
>    0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :
>   0     0
>  256k    0 : 320k    0 : 320k    0 : 192k    0 : 256k    0 : 320k    0 :
> 256k    0
> 4672k    0 :4672k    0 :4672k    0 :4800k    0 :4672k    0 :4672k    0
> :4736k    0
>   11M    0 :  11M    0 :  11M    0 :  11M    0 :  11M    0 :  11M    0 :
>  11M    0
>   10M    0 :  10M    0 :  10M    0 :  10M    0 :  10M    0 :  10M    0 :
>  10M    0
>   10M    0 :  10M    0 :  10M    0 :  10M    0 :  10M    0 :  10M    0 :
>  10M    0
>   13M    0 :  13M    0 :  13M    0 :  13M    0 :  13M    0 :  13M    0 :
>  13M    0
>   10M    0 :  10M    0 :  10M    0 :  10M    0 :  10M    0 :  10M    0 :
>  10M    0
>   11M    0 :  11M    0 :  11M    0 :  11M    0 :  11M    0 :  11M    0 :
>  11M    0
>   19M    0 :  19M    0 :  19M    0 :  19M    0 :  19M    0 :  19M    0 :
>  19M    0
> 9984k    0 :9792k    0 :9792k    0 :9792k    0 :9984k    0 :9984k    0
> :9856k    0
>   13M    0 :  13M    0 :  13M    0 :  13M    0 :  13M    0 :  13M    0 :
>  13M    0
>   11M    0 :  11M    0 :  11M    0 :  11M    0 :  11M    0 :  11M    0 :
>  11M    0
>   12M    0 :  12M    0 :  12M    0 :  12M    0 :  12M    0 :  12M    0 :
>  12M    0
> 7872k    0 :7744k    0 :7808k    0 :7744k    0 :7936k    0 :7744k    0
> :7744k    0
>   11M    0 :  11M    0 :  11M    0 :  11M    0 :  11M    0 :  11M    0 :
>  11M    0
>   19M    0 :  19M    0 :  19M    0 :  19M    0 :  19M    0 :  19M    0 :
>  19M    0
> 7488k    0 :7360k    0 :7296k    0 :7360k    0 :7296k    0 :7296k    0
> :7296k    0
>   10M    0 :  10M    0 :  10M    0 :  10M    0 :  10M    0 :  10M    0 :
>  10M    0
>   14M    0 :  14M    0 :  14M    0 :  14M    0 :  14M    0 :  14M    0 :
>  14M    0
> 9472k    0 :9536k    0 :9536k    0 :9536k    0 :9472k    0 :9536k    0
> :9472k    0
>    0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :
>   0     0
>
> When I test the individual disks with
>
> for disk in sdm sdl sdi sde sdk sdf sdd;do `dd if=/dev/$disk
> of=/dev/null bs=1M count=1500 iflag=direct &`  ;done
>
> [root@san2 ~]# 1500+0 records in
> 1500+0 records out
> 1572864000 bytes (1.6 GB) copied, 8.96022 s, 176 MB/s
> 1500+0 records in
> 1500+0 records out
> 1572864000 bytes (1.6 GB) copied, 9.59289 s, 164 MB/s
> 1500+0 records in
> 1500+0 records out
> 1572864000 bytes (1.6 GB) copied, 10.0863 s, 156 MB/s
> 1500+0 records in
> 1500+0 records out
> 1572864000 bytes (1.6 GB) copied, 10.5833 s, 149 MB/s
> 1500+0 records in
> 1500+0 records out
> 1572864000 bytes (1.6 GB) copied, 10.6084 s, 148 MB/s
> 1500+0 records in
> 1500+0 records out
> 1572864000 bytes (1.6 GB) copied, 11.0205 s, 143 MB/s
> 1500+0 records in
> 1500+0 records out
> 1572864000 bytes (1.6 GB) copied, 11.3199 s, 139 MB/s
>
>
>
>   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :
>  0     0
>    0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :
>   0     0
> 4096k    0 :5120k    0 :  32M    0 : 512k    0 :  29M    0 :  35M    0
> :5120k    0
>   62M    0 :  51M    0 : 157M    0 : 145M    0 : 144M    0 : 153M    0 :
>  38M    0
>  153M    0 : 148M    0 : 158M    0 : 174M    0 : 135M    0 : 151M    0 :
> 150M    0
>  152M    0 : 144M    0 : 154M    0 : 179M    0 : 150M    0 : 146M    0 :
> 149M    0
>  149M    0 : 147M    0 : 155M    0 : 186M    0 : 148M    0 : 155M    0 :
> 157M    0
>  156M    0 : 128M    0 : 154M    0 : 188M    0 : 136M    0 : 153M    0 :
> 155M    0
>  159M    0 : 136M    0 : 157M    0 : 206M    0 : 147M    0 : 155M    0 :
> 151M    0
>  153M    0 : 147M    0 : 162M    0 : 153M    0 : 144M    0 : 127M    0 :
> 147M    0
>  153M    0 : 138M    0 : 159M    0 : 153M    0 : 134M    0 : 145M    0 :
> 146M    0
>  147M    0 : 144M    0 : 154M    0 : 116M    0 : 144M    0 : 153M    0 :
> 143M    0
>  154M    0 : 150M    0 :  60M    0 :   0     0 : 141M    0 : 131M    0 :
> 153M    0
>   61M    0 : 147M    0 :   0     0 :   0     0 :  51M    0 :   0     0 :
> 109M    0
>    0     0 :  17M    0 :   0     0 :   0     0 :   0     0 :   0     0 :
>   0     0
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -.
> F1 Outsourcing Development Sp. z o.o.
> Poland
>
> t:  +48 (0)124466845
> f:  +48 (0)124466843
> e:  marc@f1-outsourcing.eu
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply

* PROBLEM: Kernel BUG with raid5 soft + Xen + DRBD - invalid opcode
From: MasterPrenium @ 2016-12-23 18:25 UTC (permalink / raw)
  To: linux-kernel, xen-users
  Cc: linux-raid, shli, MasterPrenium@gmail.com, xen-devel

Hello Guys,

I've having some trouble on a new system I'm setting up. I'm getting a kernel BUG message, seems to be related with the use of Xen (when I boot the system _without_ Xen, I don't get any crash).
Here is configuration :
- 3x Hard Drives running on RAID 5 Software raid created by mdadm
- On top of it, DRBD for replication over another node (Active/passive cluster)
- On top of it, a BTRFS FileSystem with a few subvolumes
- On top of it, XEN VMs running.

The BUG is happening when I'm making "huge" I/O (20MB/s with a rsync for example) on the RAID5 stack.
I've to reset system to make it work again.

Reproducible : ALWAYS (making the i/o, it crash in 2-5mins). Also reproducible on another system with the same hardware.

Kernel versions impacted (at least): kernel-4.4.26, kernel-4.8.15, kernel-4.9.0

Here dmesg errors :
[  937.123220] ------------[ cut here ]------------
[  937.127549] kernel BUG at drivers/md/raid5.c:527!
[  937.131891] invalid opcode: 0000 [#1] SMP
[  937.136216] Modules linked in: x86_pkg_temp_thermal coretemp crc32c_intel aesni_intel aes_x86_64 ablk_helper mei_me mei mpt3sas
[  937.145665] CPU: 2 PID: 9704 Comm: kworker/u16:8 Not tainted 4.9.0-gentoo #2
[  937.150293] Hardware name: Supermicro Super Server/X10SDV-4C-7TP4F, BIOS 1.0b 11/21/2016
[  937.155531] Workqueue: drbd0_submit do_submit
[  937.160506] task: ffff88026b0b2940 task.stack: ffffc9000a66c000
[  937.164115] RIP: e030:[<ffffffff819e1fc1>]  [<ffffffff819e1fc1>] raid5_get_active_stripe+0x5e1/0x670
[  937.169584] RSP: e02b:ffffc9000a66fa58  EFLAGS: 00010086
[  937.175070] RAX: 0000000000000000 RBX: ffff880249d50000 RCX: ffff8802648bb5d0
[  937.180640] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff880249d50000
[  937.185505] RBP: ffffc9000a66faf0 R08: ffff8801f4813288 R09: 0000000000000000
[  937.190631] R10: 0000000000000288 R11: 0000000000000000 R12: 0000000000000000
[  937.196030] R13: 000000001e773e88 R14: ffff880249d50000 R15: ffff8802648bb400
[  937.202011] FS:  0000000000000000(0000) GS:ffff880270c80000(0000) knlGS:ffff880270c80000
[  937.206628] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[  937.212372] CR2: 00007f68a101b520 CR3: 0000000257875000 CR4: 0000000000042660
[  937.217538] Stack:
[  937.223361]  ffff8802648bb400 ffff880269550b40 0000000000000000 0000000166cf3800
[  937.229103]  000000001e773e88 ffff8802648bb5d0 0000000000000001 0000000000000000
[  937.233707]  ffff8802648bb40c 0000000000000001 ffffc9000a66faf0 ffff880047cba958
[  937.239736] Call Trace:
[  937.244406]  [<ffffffff819e21cd>] raid5_make_request+0x17d/0xdf0
[  937.250345]  [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
[  937.256173]  [<ffffffff81a09c03>] md_make_request+0xe3/0x220
[  937.261031]  [<ffffffff81483e9b>] generic_make_request+0xcb/0x1a0
[  937.265615]  [<ffffffff81732537>] drbd_send_and_submit+0x497/0x1310
[  937.271605]  [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
[  937.276726]  [<ffffffff817339ba>] send_and_submit_pending+0x6a/0x90
[  937.282292]  [<ffffffff81733e43>] do_submit+0x463/0x550
[  937.288333]  [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
[  937.293205]  [<ffffffff81095400>] process_one_work+0x170/0x420
[  937.298982]  [<ffffffff810957d3>] worker_thread+0x123/0x500
[  937.304154]  [<ffffffff810956b0>] ? process_one_work+0x420/0x420
[  937.310314]  [<ffffffff810956b0>] ? process_one_work+0x420/0x420
[  937.316013]  [<ffffffff8109b135>] kthread+0xc5/0xe0
[  937.320918]  [<ffffffff8102c815>] ? __switch_to+0x355/0x7a0
[  937.327029]  [<ffffffff8109b070>] ? kthread_park+0x60/0x60
[  937.331994]  [<ffffffff81ccbbc5>] ret_from_fork+0x25/0x30
[  937.338068] Code: 85 d0 fb ff ff f0 41 80 8f 98 02 00 00 04 e9 c2 fb ff ff f3 90 41 8b 47 70 a8 01 75 f6 89 45 a4 e9 e2 fd ff ff 0f 0b 0f 0b 0f 0b <0f> 0b 49 89 d6 e9 e1 fa ff ff 49 8b 82 e8 01 00 00 4d 8b 8a e0
[  937.349579] RIP  [<ffffffff819e1fc1>] raid5_get_active_stripe+0x5e1/0x670
[  937.355290]  RSP <ffffc9000a66fa58>
[  937.386587] ---[ end trace b870be01f61065a5 ]---
[  941.931453] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  941.937139] IP: [<ffffffff810bcaa6>] __wake_up_common+0x26/0x80
[  941.943106] PGD 252dde067
[  941.943219] PUD 252ee7067
[  941.950107] PMD 0

[  941.956080] Oops: 0000 [#2] SMP
[  941.961919] Modules linked in: x86_pkg_temp_thermal coretemp crc32c_intel aesni_intel aes_x86_64 ablk_helper mei_me mei mpt3sas
[  941.974933] CPU: 2 PID: 9704 Comm: kworker/u16:8 Tainted: G      D         4.9.0-gentoo #2
[  941.982080] Hardware name: Supermicro Super Server/X10SDV-4C-7TP4F, BIOS 1.0b 11/21/2016
[  941.989296] task: ffff88026b0b2940 task.stack: ffffc9000a66c000
[  941.996831] RIP: e030:[<ffffffff810bcaa6>]  [<ffffffff810bcaa6>] __wake_up_common+0x26/0x80
[  942.004391] RSP: e02b:ffffc9000a66fe50  EFLAGS: 00010086
[  942.011818] RAX: 0000000000000200 RBX: ffffc9000a66ff18 RCX: 0000000000000000
[  942.019290] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffc9000a66ff18
[  942.026779] RBP: ffffc9000a66fe88 R08: 0000000000000000 R09: 0000000000000000
[  942.034246] R10: 0000000000000008 R11: 0000000000000001 R12: ffffc9000a66ff20
[  942.041693] R13: 0000000000000200 R14: 0000000000000000 R15: 0000000000000003
[  942.049166] FS:  0000000000000000(0000) GS:ffff880270c80000(0000) knlGS:ffff880270c80000
[  942.056599] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[  942.063953] CR2: 0000000000000028 CR3: 0000000257875000 CR4: 0000000000042660
[  942.070841] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[  942.074862] BUG: unable to handle kernel paging request at ffffc9000234f8f8
[  942.078910] IP: [<ffffc9000234f8f8>] 0xffffc9000234f8f8
[  942.082961] PGD 1e9840067
[  942.083010] PUD 1e983f067
[  942.086963] PMD 26b42c067
[  942.086978] PTE 801000026b66c067

[  942.094822] Oops: 0011 [#3] SMP
[  942.098734] Modules linked in: x86_pkg_temp_thermal coretemp crc32c_intel aesni_intel aes_x86_64 ablk_helper mei_me mei mpt3sas
[  942.107222] CPU: 2 PID: 9704 Comm: kworker/u16:8 Tainted: G      D         4.9.0-gentoo #2
[  942.111581] Hardware name: Supermicro Super Server/X10SDV-4C-7TP4F, BIOS 1.0b 11/21/2016
[  942.116050] task: ffff88026b0b2940 task.stack: ffffc9000a66c000
[  942.120530] RIP: e030:[<ffffc9000234f8f8>]  [<ffffc9000234f8f8>] 0xffffc9000234f8f8
[  942.125019] RSP: e02b:ffffc9000a66fb80  EFLAGS: 00010082
[  942.129534] RAX: 0000000000000041 RBX: 0000000000042660 RCX: 0000000000000006
[  942.134355] RDX: 0000000000000041 RSI: ffffffff824e00a0 RDI: ffff880270c8dd80
[  942.138921] RBP: ffffc9000a66fbe0 R08: 0000000000000000 R09: 0000000000000000
[  942.143564] R10: 0000000000000008 R11: 0000000000000001 R12: 0000000080050033
[  942.148172] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  942.152837] FS:  0000000000000000(0000) GS:ffff880270c80000(0000) knlGS:ffff880270c80000
[  942.157525] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[  942.162213] CR2: 0000000000000028 CR3: 0000000257875000 CR4: 0000000000042660
[  942.166954] Stack:
[  942.171576]  0000000257875000 0000000000000028 ffff880270c80000 ffff880270c80000
[  942.176267]  0000000000000000 0000e0330a66c000 0000000000000000 ffffc9000a66fda8
[  942.180918]  0000000000000000 ffffc9000a66fda8 0000000000000000 0000000000000000
[  942.185521] Call Trace:
[  942.190043]  [<ffffffff810302ad>] show_regs+0x2d/0x180
[  942.194605]  [<ffffffff81030725>] __die+0xa5/0xf0
[  942.199050]  [<ffffffff8106041e>] no_context+0x14e/0x3d0
[  942.203562]  [<ffffffff81060798>] __bad_area_nosemaphore+0xf8/0x1c0
[  942.208002]  [<ffffffff8106086f>] bad_area_nosemaphore+0xf/0x20
[  942.212478]  [<ffffffff81061034>] __do_page_fault+0x84/0x4b0
[  942.216797]  [<ffffffff8106148c>] do_page_fault+0x2c/0x40
[  942.221021]  [<ffffffff81ccd808>] page_fault+0x28/0x30
[  942.225184]  [<ffffffff810bcaa6>] ? __wake_up_common+0x26/0x80
[  942.229287]  [<ffffffff810bcb0e>] __wake_up_locked+0xe/0x10
[  942.233366]  [<ffffffff810bd4d2>] complete+0x32/0x50
[  942.237330]  [<ffffffff8107a500>] mm_release+0xc0/0x160
[  942.241216]  [<ffffffff81080206>] do_exit+0x136/0xb50
[  942.245056]  [<ffffffff81ccdc07>] rewind_stack_do_exit+0x17/0x20
[  942.248933] Code: c9 ff ff c0 cf 74 b7 01 88 ff ff 00 30 cf 66 02 88 ff ff 00 00 00 00 00 00 00 00 40 29 57 6b 02 88 ff ff b0 cf 0b 81 ff ff ff ff <70> fb 66 0a 00 c9 ff ff 88 b6 8b 64 02 88 ff ff 00 00 00 00 01
[  942.257683] RIP  [<ffffc9000234f8f8>] 0xffffc9000234f8f8
[  942.261814]  RSP <ffffc9000a66fb80>
[  942.265860] CR2: ffffc9000234f8f8
[  942.269830] ---[ end trace b870be01f61065a6 ]---
[  942.293603] Fixing recursive fault but reboot is needed!
[  962.926746] INFO: rcu_sched detected stalls on CPUs/tasks:
[  962.930582]  4-...: (1 GPs behind) idle=deb/140000000000000/0 softirq=51234/51234 fqs=5195
[  962.934400]  (detected by 1, t=21002 jiffies, g=26732, c=26731, q=173)
[  962.938231] Task dump for CPU 4:
[  962.942054] md10_raid5      R  running task    13232  2654      2 0x00000008
[  962.945939]  ffff880270d0dcc0 ffff880270ed8ec0 000000000306bc88 0000000000000000
[  962.949899]  0000000000000220 ffff8802648bb40c 0000000000000002 ffff8802648bb708
[  962.953781]  0000000000000001 ffffc9000306bcc8 ffffffff81ccb884 ffff8802648bb400
[  962.957570] Call Trace:
[  962.961272]  [<ffffffff81ccb884>] ? _raw_spin_lock_irqsave+0x54/0x60
[  962.964943]  [<ffffffff819d87f4>] ? release_inactive_stripe_list+0x44/0x180
[  962.968604]  [<ffffffff819e5469>] ? handle_active_stripes.isra.56+0x169/0x440
[  962.972253]  [<ffffffff819e5ae1>] ? raid5d+0x3a1/0x730
[  962.975825]  [<ffffffff81a094d3>] ? md_thread+0xf3/0x100
[  962.979360]  [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
[  962.982900]  [<ffffffff81a093e0>] ? find_pers+0x70/0x70
[  962.986392]  [<ffffffff8109b135>] ? kthread+0xc5/0xe0
[  962.989881]  [<ffffffff8102c815>] ? __switch_to+0x355/0x7a0
[  962.993382]  [<ffffffff8109b070>] ? kthread_park+0x60/0x60
[  962.996858]  [<ffffffff81ccbbc5>] ? ret_from_fork+0x25/0x30
[ 1025.932534] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1025.936027]  4-...: (1 GPs behind) idle=deb/140000000000000/0 softirq=51234/51234 fqs=20780
[ 1025.939486]  (detected by 0, t=84014 jiffies, g=26732, c=26731, q=742)
[ 1025.942969] Task dump for CPU 4:
[ 1025.946373] md10_raid5      R  running task    13232  2654      2 0x00000008
[ 1025.949909]  ffff880270d0dcc0 ffff880270ed8ec0 000000000306bc88 0000000000000000
[ 1025.953451]  0000000000000220 ffff8802648bb40c 0000000000000002 ffff8802648bb708
[ 1025.957015]  0000000000000001 ffffc9000306bcc8 ffffffff81ccb884 ffff8802648bb400
[ 1025.960601] Call Trace:
[ 1025.964139]  [<ffffffff81ccb884>] ? _raw_spin_lock_irqsave+0x54/0x60
[ 1025.967724]  [<ffffffff819d87f4>] ? release_inactive_stripe_list+0x44/0x180
[ 1025.971351]  [<ffffffff819e5469>] ? handle_active_stripes.isra.56+0x169/0x440
[ 1025.975001]  [<ffffffff819e5ae1>] ? raid5d+0x3a1/0x730
[ 1025.978598]  [<ffffffff81a094d3>] ? md_thread+0xf3/0x100
[ 1025.982255]  [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
[ 1025.985875]  [<ffffffff81a093e0>] ? find_pers+0x70/0x70
[ 1025.989496]  [<ffffffff8109b135>] ? kthread+0xc5/0xe0
[ 1025.993117]  [<ffffffff8102c815>] ? __switch_to+0x355/0x7a0
[ 1025.996707]  [<ffffffff8109b070>] ? kthread_park+0x60/0x60
[ 1026.000354]  [<ffffffff81ccbbc5>] ? ret_from_fork+0x25/0x30
[ 1088.937436] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1088.941109]  4-...: (1 GPs behind) idle=deb/140000000000000/0 softirq=51234/51234 fqs=36280
[ 1088.944649]  (detected by 0, t=147019 jiffies, g=26732, c=26731, q=1328)
[ 1088.948180] Task dump for CPU 4:
[ 1088.951671] md10_raid5      R  running task    13232  2654      2 0x00000008
[ 1088.955296]  ffff880270d0dcc0 ffff880270ed8ec0 000000000306bc88 0000000000000000
[ 1088.958963]  0000000000000220 ffff8802648bb40c 0000000000000002 ffff8802648bb708
[ 1088.962665]  0000000000000001 ffffc9000306bcc8 ffffffff81ccb884 ffff8802648bb400
[ 1088.966301] Call Trace:
[ 1088.969868]  [<ffffffff81ccb884>] ? _raw_spin_lock_irqsave+0x54/0x60
[ 1088.973451]  [<ffffffff819d87f4>] ? release_inactive_stripe_list+0x44/0x180
[ 1088.977020]  [<ffffffff819e5469>] ? handle_active_stripes.isra.56+0x169/0x440
[ 1088.980572]  [<ffffffff819e5ae1>] ? raid5d+0x3a1/0x730
[ 1088.984066]  [<ffffffff81a094d3>] ? md_thread+0xf3/0x100
[ 1088.987549]  [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
[ 1088.991011]  [<ffffffff81a093e0>] ? find_pers+0x70/0x70
[ 1088.994444]  [<ffffffff8109b135>] ? kthread+0xc5/0xe0
[ 1088.997815]  [<ffffffff8102c815>] ? __switch_to+0x355/0x7a0
[ 1089.001181]  [<ffffffff8109b070>] ? kthread_park+0x60/0x60
[ 1089.004534]  [<ffffffff81ccbbc5>] ? ret_from_fork+0x25/0x30

(Another log here : http://pastebin.com/maxGFc1z)

Xen versions affected (at least): xen-4.6, xen-4.7, xen-4.8
xen-tools same version

Userland is a gentoo linux box.

Kernel .config : http://pastebin.com/p0EcHjbu

All buit with : gcc (Gentoo 4.9.3 p1.5, pie-0.6.4) 4.9.3

-> scripts/ver_linux
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.

Linux Node_1 4.9.0-gentoo #2 SMP Fri Dec 23 16:37:48 CET 2016 x86_64 Intel(R) Xeon(R) CPU D-1518 @ 2.20GHz GenuineIntel GNU/Linux

GNU C                   4.9.3
GNU Make                4.1
Binutils                2.25.1
Util-linux              2.26.2
Mount                   2.26.2
Module-init-tools       22
E2fsprogs               1.43.3
Linux C Library         2.22
Dynamic linker (ldd)    2.22
Linux C++ Library       6.0.20
Procps                  3.3.12
Net-tools               1.60
Kbd                     2.0.3
Console-tools           2.0.3
Sh-utils                8.25
Udev                    220
Modules Loaded          ablk_helper aesni_intel aes_x86_64 coretemp crc32c_intel mei mei_me mpt3sas x86_pkg_temp_thermal

-> System is a SuperMicro Motherboard X10SDV-4C-7TP4F with 8GB of DDR 4 ECC Registered memory


Any help would be greatly appreciated !

Thanks,

^ permalink raw reply

* Re: Recovering a RAID6 after all disks were disconnected
From: Giuseppe Bilotta @ 2016-12-23 16:17 UTC (permalink / raw)
  To: NeilBrown; +Cc: John Stoffel, linux-raid
In-Reply-To: <87k2arpmvt.fsf@notabene.neil.brown.name>

On Fri, Dec 23, 2016 at 12:25 AM, NeilBrown <neilb@suse.com> wrote:
> On Fri, Dec 23 2016, Giuseppe Bilotta wrote:
>> I also wrote a small script to test all combinations (nothing smart,
>> really, simply enumeration of combos, but I'll consider putting it up
>> on the wiki as well), and I was actually surprised by the results. To
>> test if the RAID was being re-created correctly with each combination,
>> I used `file -s` on the RAID, and verified that the results made
>> sense. I am surprised to find out that there are multiple combinations
>> that make sense (note that the disk names are shifted by one compared
>> to previous emails due a machine lockup that required a reboot and
>> another disk butting in to a different order):
>>
>> trying /dev/sdd /dev/sdf /dev/sde /dev/sdg
>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>> (needs journal recovery) (extents) (large files) (huge files)
>>
>> trying /dev/sdd /dev/sdf /dev/sdg /dev/sde
>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>> (needs journal recovery) (extents) (large files) (huge files)
>>
>> trying /dev/sde /dev/sdf /dev/sdd /dev/sdg
>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>> (needs journal recovery) (extents) (large files) (huge files)
>>
>> trying /dev/sde /dev/sdf /dev/sdg /dev/sdd
>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>> (needs journal recovery) (extents) (large files) (huge files)
>>
>> trying /dev/sdg /dev/sdf /dev/sde /dev/sdd
>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>> (needs journal recovery) (extents) (large files) (huge files)
>>
>> trying /dev/sdg /dev/sdf /dev/sdd /dev/sde
>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>> (needs journal recovery) (extents) (large files) (huge files)
>> :
>> So there are six out of 24 combinations that make sense, at least for
>> the first block. I know from the pre-fail dmesg that the g-f-e-d order
>> should be the correct one, but now I'm left wondering if there is a
>> better way to verify this (other than manually sampling files to see
>> if they make sense), or if the left-symmetric layout on a RAID6 simply
>> allows some of the disk positions to be swapped without loss of data.

> You script has reported all arrangements with /dev/sdf as the second
> device.  Presumably that is where the single block you are reading
> resides.

That makes sense.

> To check if a RAID6 arrangement is credible, you can try the raid6check
> program that is include in the mdadm source release.  There is a man
> page.
> If the order of devices is not correct raid6check will tell you about
> it.

That's a wonderful small utility, thanks for making it known to me!
Checking even just a small number of stripes was enough in this case,
as the expected combination (g f e d) was the only one that produced
no errors.

Now I wonder if it it would be possible to combine this approach with
something that simply hacked the metadata of each disk to re-establish
the correct disk order to make it possible to reassemble this
particular array without recreating anything. Are problems such as
mine common enough to warrant support for this kind of verified
reassembly from assumed-clean disks easier?.

-- 
Giuseppe "Oblomov" Bilotta

^ permalink raw reply

* Re: SAS disk from RAID card (no RAID mode) problems
From: Jes Sorensen @ 2016-12-23 14:21 UTC (permalink / raw)
  To: IW News; +Cc: linux-raid
In-Reply-To: <2326bdf0-2948-5cbf-3033-27ed41803e23@imagedworld.com>

IW News <news@imagedworld.com> writes:
> Hello,
>
> First message here.
>
> After looking for a solution without any luck I have found this
> list. I hope someone can help me with this.
>
> I have an ASUS P6T Deluxe with a MARVELL 88SE63xx SAS RAID controller.
> There are to identical 400GB SAS SSD drives attached to it. One of
> them has a Windows 10 installation, the other one Linux.
> Grub is installed on the second disk.
>
> Windows works as expected, but I have problems with the Linux
> installation: the desktop environment freezes for some second once in
> a while. This occurs with Mint Cinnamon, OpenSuSe KDE, Ubuntu and
> Manjaro KDE. All of them are current installations. I'm now working in
> up to date Manjaro KDE (kernel 4.9.0).
> When the temporary freezes occur the mouse pointer moves, some windows
> are updated correctly, other do not and DE stops working.
> When this happens always I have a system log like this:
>
> ______________________________________________________
> 23/12/16 8:29    kernel    sas: Enter sas_scsi_recover_host busy: 6
> failed: 6
> 23/12/16 8:29    kernel    sas: trying to find task 0xffff8bda8d2bd900
> 23/12/16 8:29    kernel    sas: sas_scsi_find_task: aborting task
> 0xffff8bda8d2bd900
> 23/12/16 8:29    kernel    sas: task done but aborted
> 23/12/16 8:29    kernel    sas: sas_scsi_find_task: task
> 0xffff8bda8d2bd900 is done
> 23/12/16 8:29    kernel    sas: sas_eh_handle_sas_errors: task
> 0xffff8bda8d2bd900 is done
[snip]
> Sometimes shorter sometimes larger.
> It looks like a controller/drive/cable problem?
> Any thoughts?

Doesn't look like a Linux RAID issue, but much more like a driver
problem. Possibly caused by cables or controller issues.

You probably want to reach out to the linux-scsi list.

Jes

^ permalink raw reply

* Raid5 performance issue
From: Marc Roos @ 2016-12-23 13:43 UTC (permalink / raw)
  To: linux-raid


I have grown a raid5 over the years with drives and resized partitions, 
now I have upgraded to centos7 (from centos5). And I have the impression 
the speed is not what it used to be. 

Can this be because of some missalignment? How can this be verified?


If I monitor the individual disks with dstat it reads the raid drives at 
very low speeds

dd if=/dev/md21 of=/dev/null bs=1M count=1500 iflag=direct
1500+0 records in
1500+0 records out
1572864000 bytes (1.6 GB) copied, 19.5879 s, 80.3 MB/s


   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 
  0     0
   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 
  0     0
   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 
  0     0
 256k    0 : 320k    0 : 320k    0 : 192k    0 : 256k    0 : 320k    0 : 
256k    0
4672k    0 :4672k    0 :4672k    0 :4800k    0 :4672k    0 :4672k    0 
:4736k    0
  11M    0 :  11M    0 :  11M    0 :  11M    0 :  11M    0 :  11M    0 : 
 11M    0
  10M    0 :  10M    0 :  10M    0 :  10M    0 :  10M    0 :  10M    0 : 
 10M    0
  10M    0 :  10M    0 :  10M    0 :  10M    0 :  10M    0 :  10M    0 : 
 10M    0
  13M    0 :  13M    0 :  13M    0 :  13M    0 :  13M    0 :  13M    0 : 
 13M    0
  10M    0 :  10M    0 :  10M    0 :  10M    0 :  10M    0 :  10M    0 : 
 10M    0
  11M    0 :  11M    0 :  11M    0 :  11M    0 :  11M    0 :  11M    0 : 
 11M    0
  19M    0 :  19M    0 :  19M    0 :  19M    0 :  19M    0 :  19M    0 : 
 19M    0
9984k    0 :9792k    0 :9792k    0 :9792k    0 :9984k    0 :9984k    0 
:9856k    0
  13M    0 :  13M    0 :  13M    0 :  13M    0 :  13M    0 :  13M    0 : 
 13M    0
  11M    0 :  11M    0 :  11M    0 :  11M    0 :  11M    0 :  11M    0 : 
 11M    0
  12M    0 :  12M    0 :  12M    0 :  12M    0 :  12M    0 :  12M    0 : 
 12M    0
7872k    0 :7744k    0 :7808k    0 :7744k    0 :7936k    0 :7744k    0 
:7744k    0
  11M    0 :  11M    0 :  11M    0 :  11M    0 :  11M    0 :  11M    0 : 
 11M    0
  19M    0 :  19M    0 :  19M    0 :  19M    0 :  19M    0 :  19M    0 : 
 19M    0
7488k    0 :7360k    0 :7296k    0 :7360k    0 :7296k    0 :7296k    0 
:7296k    0
  10M    0 :  10M    0 :  10M    0 :  10M    0 :  10M    0 :  10M    0 : 
 10M    0
  14M    0 :  14M    0 :  14M    0 :  14M    0 :  14M    0 :  14M    0 : 
 14M    0
9472k    0 :9536k    0 :9536k    0 :9536k    0 :9472k    0 :9536k    0 
:9472k    0
   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 
  0     0

When I test the individual disks with

for disk in sdm sdl sdi sde sdk sdf sdd;do `dd if=/dev/$disk 
of=/dev/null bs=1M count=1500 iflag=direct &`  ;done

[root@san2 ~]# 1500+0 records in
1500+0 records out
1572864000 bytes (1.6 GB) copied, 8.96022 s, 176 MB/s
1500+0 records in
1500+0 records out
1572864000 bytes (1.6 GB) copied, 9.59289 s, 164 MB/s
1500+0 records in
1500+0 records out
1572864000 bytes (1.6 GB) copied, 10.0863 s, 156 MB/s
1500+0 records in
1500+0 records out
1572864000 bytes (1.6 GB) copied, 10.5833 s, 149 MB/s
1500+0 records in
1500+0 records out
1572864000 bytes (1.6 GB) copied, 10.6084 s, 148 MB/s
1500+0 records in
1500+0 records out
1572864000 bytes (1.6 GB) copied, 11.0205 s, 143 MB/s
1500+0 records in
1500+0 records out
1572864000 bytes (1.6 GB) copied, 11.3199 s, 139 MB/s



  0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :  
 0     0
   0     0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 
  0     0
4096k    0 :5120k    0 :  32M    0 : 512k    0 :  29M    0 :  35M    0 
:5120k    0
  62M    0 :  51M    0 : 157M    0 : 145M    0 : 144M    0 : 153M    0 : 
 38M    0
 153M    0 : 148M    0 : 158M    0 : 174M    0 : 135M    0 : 151M    0 : 
150M    0
 152M    0 : 144M    0 : 154M    0 : 179M    0 : 150M    0 : 146M    0 : 
149M    0
 149M    0 : 147M    0 : 155M    0 : 186M    0 : 148M    0 : 155M    0 : 
157M    0
 156M    0 : 128M    0 : 154M    0 : 188M    0 : 136M    0 : 153M    0 : 
155M    0
 159M    0 : 136M    0 : 157M    0 : 206M    0 : 147M    0 : 155M    0 : 
151M    0
 153M    0 : 147M    0 : 162M    0 : 153M    0 : 144M    0 : 127M    0 : 
147M    0
 153M    0 : 138M    0 : 159M    0 : 153M    0 : 134M    0 : 145M    0 : 
146M    0
 147M    0 : 144M    0 : 154M    0 : 116M    0 : 144M    0 : 153M    0 : 
143M    0
 154M    0 : 150M    0 :  60M    0 :   0     0 : 141M    0 : 131M    0 : 
153M    0
  61M    0 : 147M    0 :   0     0 :   0     0 :  51M    0 :   0     0 : 
109M    0
   0     0 :  17M    0 :   0     0 :   0     0 :   0     0 :   0     0 : 
  0     0

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -. 
F1 Outsourcing Development Sp. z o.o.
Poland 

t:  +48 (0)124466845
f:  +48 (0)124466843
e:  marc@f1-outsourcing.eu



^ permalink raw reply

* Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
From: Lars Ellenberg @ 2016-12-23 11:45 UTC (permalink / raw)
  To: Michael Wang
  Cc: Jens Axboe, linux-block, Martin K. Petersen, Mike Snitzer,
	Peter Zijlstra, Jiri Kosina, Ming Lei, Kirill A. Shutemov,
	NeilBrown, linux-kernel, linux-raid, Takashi Iwai, linux-bcache,
	Zheng Liu, Kent Overstreet, Keith Busch, dm-devel, Shaohua Li,
	Ingo Molnar, Alasdair Kergon, Roland Kammerer
In-Reply-To: <76d9bf14-d848-4405-8358-3771c0a93d39@profitbricks.com>

On Fri, Dec 23, 2016 at 09:49:53AM +0100, Michael Wang wrote:
> Dear Maintainers
> 
> I'd like to ask for the status of this patch since we hit the
> issue too during our testing on md raid1.
> 
> Split remainder bio_A was queued ahead, following by bio_B for
> lower device, at this moment raid start freezing, the loop take
> out bio_A firstly and deliver it, which will hung since raid is
> freezing, while the freezing never end since it waiting for
> bio_B to finish, and bio_B is still on the queue, waiting for
> bio_A to finish...
> 
> We're looking for a good solution and we found this patch
> already progressed a lot, but we can't find it on linux-next,
> so we'd like to ask are we still planning to have this fix
> in upstream?

I don't see why not, I'd even like to have it in older kernels,
but did not have the time and energy to push it.

Thanks for the bump.

	Lars

On 07/11/2016 04:10 PM, Lars Ellenberg wrote:
> For a long time, generic_make_request() converts recursion into
> iteration by queuing recursive arguments on current->bio_list.
> 
> This is convenient for stacking drivers,
> the top-most driver would take the originally submitted bio,
> and re-submit a re-mapped version of it, or one or more clones,
> or one or more new allocated bios to its backend(s). Which
> are then simply processed in turn, and each can again queue
> more "backend-bios" until we reach the bottom of the driver stack,
> and actually dispatch to the real backend device.
> 
> Any stacking driver ->make_request_fn() could expect that,
> once it returns, any backend-bios it submitted via recursive calls
> to generic_make_request() would now be processed and dispatched, before
> the current task would call into this driver again.
> 
> This is changed by commit
>   54efd50 block: make generic_make_request handle arbitrarily sized bios
> 
> Drivers may call blk_queue_split() inside their ->make_request_fn(),
> which may split the current bio into a front-part to be dealt with
> immediately, and a remainder-part, which may need to be split even
> further. That remainder-part will simply also be pushed to
> current->bio_list, and would end up being head-of-queue, in front
> of any backend-bios the current make_request_fn() might submit during
> processing of the fron-part.
> 
> Which means the current task would immediately end up back in the same
> make_request_fn() of the same driver again, before any of its backend
> bios have even been processed.
> 
> This can lead to resource starvation deadlock.
> Drivers could avoid this by learning to not need blk_queue_split(),
> or by submitting their backend bios in a different context (dedicated
> kernel thread, work_queue context, ...). Or by playing funny re-ordering
> games with entries on current->bio_list.
> 
> Instead, I suggest to distinguish between recursive calls to
> generic_make_request(), and pushing back the remainder part in
> blk_queue_split(), by pointing current->bio_lists to a
> 	struct recursion_to_iteration_bio_lists {
> 		struct bio_list recursion;
> 		struct bio_list queue;
> 	}
> 
> By providing each q->make_request_fn() with an empty "recursion"
> bio_list, then merging any recursively submitted bios to the
> head of the "queue" list, we can make the recursion-to-iteration
> logic in generic_make_request() process deepest level bios first,
> and "sibling" bios of the same level in "natural" order.
> 
> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
> ---
>  block/bio.c               | 20 +++++++++++--------
>  block/blk-core.c          | 49 +++++++++++++++++++++++++----------------------
>  block/blk-merge.c         |  5 ++++-
>  drivers/md/bcache/btree.c | 12 ++++++------
>  drivers/md/dm-bufio.c     |  2 +-
>  drivers/md/raid1.c        |  5 ++---
>  drivers/md/raid10.c       |  5 ++---
>  include/linux/bio.h       | 25 ++++++++++++++++++++++++
>  include/linux/sched.h     |  4 ++--
>  9 files changed, 80 insertions(+), 47 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 848cd35..c2606fd 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -366,12 +366,16 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>  	 */
>  
>  	bio_list_init(&punt);
> -	bio_list_init(&nopunt);
>  
> -	while ((bio = bio_list_pop(current->bio_list)))
> +	bio_list_init(&nopunt);
> +	while ((bio = bio_list_pop(&current->bio_lists->recursion)))
>  		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> +	current->bio_lists->recursion = nopunt;
>  
> -	*current->bio_list = nopunt;
> +	bio_list_init(&nopunt);
> +	while ((bio = bio_list_pop(&current->bio_lists->queue)))
> +		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> +	current->bio_lists->queue = nopunt;
>  
>  	spin_lock(&bs->rescue_lock);
>  	bio_list_merge(&bs->rescue_list, &punt);
> @@ -453,13 +457,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  		 *
>  		 * We solve this, and guarantee forward progress, with a rescuer
>  		 * workqueue per bio_set. If we go to allocate and there are
> -		 * bios on current->bio_list, we first try the allocation
> -		 * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
> -		 * bios we would be blocking to the rescuer workqueue before
> -		 * we retry with the original gfp_flags.
> +		 * bios on current->bio_lists->{recursion,queue}, we first try the
> +		 * allocation without __GFP_DIRECT_RECLAIM; if that fails, we
> +		 * punt those bios we would be blocking to the rescuer
> +		 * workqueue before we retry with the original gfp_flags.
>  		 */
>  
> -		if (current->bio_list && !bio_list_empty(current->bio_list))
> +		if (current_has_pending_bios())
>  			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
>  
>  		p = mempool_alloc(bs->bio_pool, gfp_mask);
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3cfd67d..2886a59b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2040,7 +2040,7 @@ end_io:
>   */
>  blk_qc_t generic_make_request(struct bio *bio)
>  {
> -	struct bio_list bio_list_on_stack;
> +	struct recursion_to_iteration_bio_lists bio_lists_on_stack;
>  	blk_qc_t ret = BLK_QC_T_NONE;
>  
>  	if (!generic_make_request_checks(bio))
> @@ -2049,15 +2049,20 @@ blk_qc_t generic_make_request(struct bio *bio)
>  	/*
>  	 * We only want one ->make_request_fn to be active at a time, else
>  	 * stack usage with stacked devices could be a problem.  So use
> -	 * current->bio_list to keep a list of requests submited by a
> -	 * make_request_fn function.  current->bio_list is also used as a
> +	 * current->bio_lists to keep a list of requests submited by a
> +	 * make_request_fn function.  current->bio_lists is also used as a
>  	 * flag to say if generic_make_request is currently active in this
>  	 * task or not.  If it is NULL, then no make_request is active.  If
>  	 * it is non-NULL, then a make_request is active, and new requests
> -	 * should be added at the tail
> +	 * should be added at the tail of current->bio_lists->recursion;
> +	 * bios resulting from a call to blk_queue_split() from
> +	 * within ->make_request_fn() should be pushed back to the head of
> +	 * current->bio_lists->queue.
> +	 * After the current ->make_request_fn() returns, .recursion will be
> +	 * merged back to the head of .queue.
>  	 */
> -	if (current->bio_list) {
> -		bio_list_add(current->bio_list, bio);
> +	if (current->bio_lists) {
> +		bio_list_add(&current->bio_lists->recursion, bio);
>  		goto out;
>  	}
>  
> @@ -2066,35 +2071,33 @@ blk_qc_t generic_make_request(struct bio *bio)
>  	 * Before entering the loop, bio->bi_next is NULL (as all callers
>  	 * ensure that) so we have a list with a single bio.
>  	 * We pretend that we have just taken it off a longer list, so
> -	 * we assign bio_list to a pointer to the bio_list_on_stack,
> -	 * thus initialising the bio_list of new bios to be
> -	 * added.  ->make_request() may indeed add some more bios
> -	 * through a recursive call to generic_make_request.  If it
> -	 * did, we find a non-NULL value in bio_list and re-enter the loop
> -	 * from the top.  In this case we really did just take the bio
> -	 * of the top of the list (no pretending) and so remove it from
> -	 * bio_list, and call into ->make_request() again.
> +	 * we assign bio_list to a pointer to the bio_lists_on_stack,
> +	 * thus initialising the bio_lists of new bios to be added.
> +	 * ->make_request() may indeed add some more bios to .recursion
> +	 * through a recursive call to generic_make_request.  If it did,
> +	 * we find a non-NULL value in .recursion, merge .recursion back to the
> +	 * head of .queue, and re-enter the loop from the top.  In this case we
> +	 * really did just take the bio of the top of the list (no pretending)
> +	 * and so remove it from .queue, and call into ->make_request() again.
>  	 */
>  	BUG_ON(bio->bi_next);
> -	bio_list_init(&bio_list_on_stack);
> -	current->bio_list = &bio_list_on_stack;
> +	bio_list_init(&bio_lists_on_stack.queue);
> +	current->bio_lists = &bio_lists_on_stack;
>  	do {
>  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>  
>  		if (likely(blk_queue_enter(q, false) == 0)) {
> +			bio_list_init(&bio_lists_on_stack.recursion);
>  			ret = q->make_request_fn(q, bio);
> -
>  			blk_queue_exit(q);
> -
> -			bio = bio_list_pop(current->bio_list);
> +			bio_list_merge_head(&bio_lists_on_stack.queue,
> +					&bio_lists_on_stack.recursion);
>  		} else {
> -			struct bio *bio_next = bio_list_pop(current->bio_list);
> -
>  			bio_io_error(bio);
> -			bio = bio_next;
>  		}
> +		bio = bio_list_pop(&current->bio_lists->queue);
>  	} while (bio);
> -	current->bio_list = NULL; /* deactivate */
> +	current->bio_lists = NULL; /* deactivate */
>  
>  out:
>  	return ret;
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index c265348..df96327 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -172,6 +172,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
>  	struct bio *split, *res;
>  	unsigned nsegs;
>  
> +	BUG_ON(!current->bio_lists);
>  	if (bio_op(*bio) == REQ_OP_DISCARD)
>  		split = blk_bio_discard_split(q, *bio, bs, &nsegs);
>  	else if (bio_op(*bio) == REQ_OP_WRITE_SAME)
> @@ -190,7 +191,9 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
>  
>  		bio_chain(split, *bio);
>  		trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
> -		generic_make_request(*bio);
> +		/* push back remainder, it may later be split further */
> +		bio_list_add_head(&current->bio_lists->queue, *bio);
> +		/* and fake submission of a suitably sized piece */
>  		*bio = split;
>  	}
>  }
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 76f7534..731ec3b 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -450,7 +450,7 @@ void __bch_btree_node_write(struct btree *b, struct closure *parent)
>  
>  	trace_bcache_btree_write(b);
>  
> -	BUG_ON(current->bio_list);
> +	BUG_ON(current->bio_lists);
>  	BUG_ON(b->written >= btree_blocks(b));
>  	BUG_ON(b->written && !i->keys);
>  	BUG_ON(btree_bset_first(b)->seq != i->seq);
> @@ -544,7 +544,7 @@ static void bch_btree_leaf_dirty(struct btree *b, atomic_t *journal_ref)
>  
>  	/* Force write if set is too big */
>  	if (set_bytes(i) > PAGE_SIZE - 48 &&
> -	    !current->bio_list)
> +	    !current->bio_lists)
>  		bch_btree_node_write(b, NULL);
>  }
>  
> @@ -889,7 +889,7 @@ static struct btree *mca_alloc(struct cache_set *c, struct btree_op *op,
>  {
>  	struct btree *b;
>  
> -	BUG_ON(current->bio_list);
> +	BUG_ON(current->bio_lists);
>  
>  	lockdep_assert_held(&c->bucket_lock);
>  
> @@ -976,7 +976,7 @@ retry:
>  	b = mca_find(c, k);
>  
>  	if (!b) {
> -		if (current->bio_list)
> +		if (current->bio_lists)
>  			return ERR_PTR(-EAGAIN);
>  
>  		mutex_lock(&c->bucket_lock);
> @@ -2127,7 +2127,7 @@ static int bch_btree_insert_node(struct btree *b, struct btree_op *op,
>  
>  	return 0;
>  split:
> -	if (current->bio_list) {
> +	if (current->bio_lists) {
>  		op->lock = b->c->root->level + 1;
>  		return -EAGAIN;
>  	} else if (op->lock <= b->c->root->level) {
> @@ -2209,7 +2209,7 @@ int bch_btree_insert(struct cache_set *c, struct keylist *keys,
>  	struct btree_insert_op op;
>  	int ret = 0;
>  
> -	BUG_ON(current->bio_list);
> +	BUG_ON(current->bio_lists);
>  	BUG_ON(bch_keylist_empty(keys));
>  
>  	bch_btree_op_init(&op.op, 0);
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index 6571c81..ba0c325 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -174,7 +174,7 @@ static inline int dm_bufio_cache_index(struct dm_bufio_client *c)
>  #define DM_BUFIO_CACHE(c)	(dm_bufio_caches[dm_bufio_cache_index(c)])
>  #define DM_BUFIO_CACHE_NAME(c)	(dm_bufio_cache_names[dm_bufio_cache_index(c)])
>  
> -#define dm_bufio_in_request()	(!!current->bio_list)
> +#define dm_bufio_in_request()	(!!current->bio_lists)
>  
>  static void dm_bufio_lock(struct dm_bufio_client *c)
>  {
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 10e53cd..38790e3 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -876,8 +876,7 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
>  				    (!conf->barrier ||
>  				     ((conf->start_next_window <
>  				       conf->next_resync + RESYNC_SECTORS) &&
> -				      current->bio_list &&
> -				      !bio_list_empty(current->bio_list))),
> +				      current_has_pending_bios())),
>  				    conf->resync_lock);
>  		conf->nr_waiting--;
>  	}
> @@ -1014,7 +1013,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
>  	struct r1conf *conf = mddev->private;
>  	struct bio *bio;
>  
> -	if (from_schedule || current->bio_list) {
> +	if (from_schedule || current->bio_lists) {
>  		spin_lock_irq(&conf->device_lock);
>  		bio_list_merge(&conf->pending_bio_list, &plug->pending);
>  		conf->pending_count += plug->pending_cnt;
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 245640b..13a5341 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -945,8 +945,7 @@ static void wait_barrier(struct r10conf *conf)
>  		wait_event_lock_irq(conf->wait_barrier,
>  				    !conf->barrier ||
>  				    (conf->nr_pending &&
> -				     current->bio_list &&
> -				     !bio_list_empty(current->bio_list)),
> +				     current_has_pending_bios()),
>  				    conf->resync_lock);
>  		conf->nr_waiting--;
>  	}
> @@ -1022,7 +1021,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
>  	struct r10conf *conf = mddev->private;
>  	struct bio *bio;
>  
> -	if (from_schedule || current->bio_list) {
> +	if (from_schedule || current->bio_lists) {
>  		spin_lock_irq(&conf->device_lock);
>  		bio_list_merge(&conf->pending_bio_list, &plug->pending);
>  		conf->pending_count += plug->pending_cnt;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index b7e1a008..2f8a361 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -541,6 +541,24 @@ struct bio_list {
>  	struct bio *tail;
>  };
>  
> +/* for generic_make_request() */
> +struct recursion_to_iteration_bio_lists {
> +	/* For stacking drivers submitting to their respective backend,
> +	 * bios are added to the tail of .recursion, which is re-initialized
> +	 * before each call to ->make_request_fn() and after that returns,
> +	 * the whole .recursion queue is then merged back to the head of .queue.
> +	 *
> +	 * The recursion-to-iteration logic in generic_make_request() will
> +	 * peel off of .queue.head, processing bios in deepest-level-first
> +	 * "natural" order. */
> +	struct bio_list recursion;
> +
> +	/* This keeps a list of to-be-processed bios.
> +	 * The "remainder" part resulting from calling blk_queue_split()
> +	 * will be pushed back to its head. */
> +	struct bio_list queue;
> +};
> +
>  static inline int bio_list_empty(const struct bio_list *bl)
>  {
>  	return bl->head == NULL;
> @@ -551,6 +569,13 @@ static inline void bio_list_init(struct bio_list *bl)
>  	bl->head = bl->tail = NULL;
>  }
>  
> +static inline bool current_has_pending_bios(void)
> +{
> +	return current->bio_lists &&
> +		(!bio_list_empty(&current->bio_lists->queue) ||
> +		 !bio_list_empty(&current->bio_lists->recursion));
> +}
> +
>  #define BIO_EMPTY_LIST	{ NULL, NULL }
>  
>  #define bio_list_for_each(bio, bl) \
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6e42ada..146eedc 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -128,7 +128,7 @@ struct sched_attr {
>  
>  struct futex_pi_state;
>  struct robust_list_head;
> -struct bio_list;
> +struct recursion_to_iteration_bio_lists;
>  struct fs_struct;
>  struct perf_event_context;
>  struct blk_plug;
> @@ -1727,7 +1727,7 @@ struct task_struct {
>  	void *journal_info;
>  
>  /* stacked block device info */
> -	struct bio_list *bio_list;
> +	struct recursion_to_iteration_bio_lists *bio_lists;
>  
>  #ifdef CONFIG_BLOCK
>  /* stack plugging */
> 

^ permalink raw reply

* Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
From: Michael Wang @ 2016-12-23  8:49 UTC (permalink / raw)
  To: Lars Ellenberg, Jens Axboe
  Cc: NeilBrown, linux-raid, Martin K. Petersen, Mike Snitzer,
	Peter Zijlstra, Jiri Kosina, Ming Lei, linux-kernel, Zheng Liu,
	linux-block, Takashi Iwai, linux-bcache, Ingo Molnar,
	Alasdair Kergon, Keith Busch, dm-devel, Shaohua Li,
	Kent Overstreet, Kirill A. Shutemov, Roland Kammerer
In-Reply-To: <20160711141042.GY13335@soda.linbit>

Dear Maintainers

I'd like to ask for the status of this patch since we hit the
issue too during our testing on md raid1.

Split remainder bio_A was queued ahead, following by bio_B for
lower device, at this moment raid start freezing, the loop take
out bio_A firstly and deliver it, which will hung since raid is
freezing, while the freezing never end since it waiting for
bio_B to finish, and bio_B is still on the queue, waiting for
bio_A to finish...

We're looking for a good solution and we found this patch
already progressed a lot, but we can't find it on linux-next,
so we'd like to ask are we still planning to have this fix
in upstream?

Regards,
Michael Wang


On 07/11/2016 04:10 PM, Lars Ellenberg wrote:
> For a long time, generic_make_request() converts recursion into
> iteration by queuing recursive arguments on current->bio_list.
> 
> This is convenient for stacking drivers,
> the top-most driver would take the originally submitted bio,
> and re-submit a re-mapped version of it, or one or more clones,
> or one or more new allocated bios to its backend(s). Which
> are then simply processed in turn, and each can again queue
> more "backend-bios" until we reach the bottom of the driver stack,
> and actually dispatch to the real backend device.
> 
> Any stacking driver ->make_request_fn() could expect that,
> once it returns, any backend-bios it submitted via recursive calls
> to generic_make_request() would now be processed and dispatched, before
> the current task would call into this driver again.
> 
> This is changed by commit
>   54efd50 block: make generic_make_request handle arbitrarily sized bios
> 
> Drivers may call blk_queue_split() inside their ->make_request_fn(),
> which may split the current bio into a front-part to be dealt with
> immediately, and a remainder-part, which may need to be split even
> further. That remainder-part will simply also be pushed to
> current->bio_list, and would end up being head-of-queue, in front
> of any backend-bios the current make_request_fn() might submit during
> processing of the fron-part.
> 
> Which means the current task would immediately end up back in the same
> make_request_fn() of the same driver again, before any of its backend
> bios have even been processed.
> 
> This can lead to resource starvation deadlock.
> Drivers could avoid this by learning to not need blk_queue_split(),
> or by submitting their backend bios in a different context (dedicated
> kernel thread, work_queue context, ...). Or by playing funny re-ordering
> games with entries on current->bio_list.
> 
> Instead, I suggest to distinguish between recursive calls to
> generic_make_request(), and pushing back the remainder part in
> blk_queue_split(), by pointing current->bio_lists to a
> 	struct recursion_to_iteration_bio_lists {
> 		struct bio_list recursion;
> 		struct bio_list queue;
> 	}
> 
> By providing each q->make_request_fn() with an empty "recursion"
> bio_list, then merging any recursively submitted bios to the
> head of the "queue" list, we can make the recursion-to-iteration
> logic in generic_make_request() process deepest level bios first,
> and "sibling" bios of the same level in "natural" order.
> 
> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
> ---
>  block/bio.c               | 20 +++++++++++--------
>  block/blk-core.c          | 49 +++++++++++++++++++++++++----------------------
>  block/blk-merge.c         |  5 ++++-
>  drivers/md/bcache/btree.c | 12 ++++++------
>  drivers/md/dm-bufio.c     |  2 +-
>  drivers/md/raid1.c        |  5 ++---
>  drivers/md/raid10.c       |  5 ++---
>  include/linux/bio.h       | 25 ++++++++++++++++++++++++
>  include/linux/sched.h     |  4 ++--
>  9 files changed, 80 insertions(+), 47 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 848cd35..c2606fd 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -366,12 +366,16 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>  	 */
>  
>  	bio_list_init(&punt);
> -	bio_list_init(&nopunt);
>  
> -	while ((bio = bio_list_pop(current->bio_list)))
> +	bio_list_init(&nopunt);
> +	while ((bio = bio_list_pop(&current->bio_lists->recursion)))
>  		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> +	current->bio_lists->recursion = nopunt;
>  
> -	*current->bio_list = nopunt;
> +	bio_list_init(&nopunt);
> +	while ((bio = bio_list_pop(&current->bio_lists->queue)))
> +		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> +	current->bio_lists->queue = nopunt;
>  
>  	spin_lock(&bs->rescue_lock);
>  	bio_list_merge(&bs->rescue_list, &punt);
> @@ -453,13 +457,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  		 *
>  		 * We solve this, and guarantee forward progress, with a rescuer
>  		 * workqueue per bio_set. If we go to allocate and there are
> -		 * bios on current->bio_list, we first try the allocation
> -		 * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
> -		 * bios we would be blocking to the rescuer workqueue before
> -		 * we retry with the original gfp_flags.
> +		 * bios on current->bio_lists->{recursion,queue}, we first try the
> +		 * allocation without __GFP_DIRECT_RECLAIM; if that fails, we
> +		 * punt those bios we would be blocking to the rescuer
> +		 * workqueue before we retry with the original gfp_flags.
>  		 */
>  
> -		if (current->bio_list && !bio_list_empty(current->bio_list))
> +		if (current_has_pending_bios())
>  			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
>  
>  		p = mempool_alloc(bs->bio_pool, gfp_mask);
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3cfd67d..2886a59b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2040,7 +2040,7 @@ end_io:
>   */
>  blk_qc_t generic_make_request(struct bio *bio)
>  {
> -	struct bio_list bio_list_on_stack;
> +	struct recursion_to_iteration_bio_lists bio_lists_on_stack;
>  	blk_qc_t ret = BLK_QC_T_NONE;
>  
>  	if (!generic_make_request_checks(bio))
> @@ -2049,15 +2049,20 @@ blk_qc_t generic_make_request(struct bio *bio)
>  	/*
>  	 * We only want one ->make_request_fn to be active at a time, else
>  	 * stack usage with stacked devices could be a problem.  So use
> -	 * current->bio_list to keep a list of requests submited by a
> -	 * make_request_fn function.  current->bio_list is also used as a
> +	 * current->bio_lists to keep a list of requests submited by a
> +	 * make_request_fn function.  current->bio_lists is also used as a
>  	 * flag to say if generic_make_request is currently active in this
>  	 * task or not.  If it is NULL, then no make_request is active.  If
>  	 * it is non-NULL, then a make_request is active, and new requests
> -	 * should be added at the tail
> +	 * should be added at the tail of current->bio_lists->recursion;
> +	 * bios resulting from a call to blk_queue_split() from
> +	 * within ->make_request_fn() should be pushed back to the head of
> +	 * current->bio_lists->queue.
> +	 * After the current ->make_request_fn() returns, .recursion will be
> +	 * merged back to the head of .queue.
>  	 */
> -	if (current->bio_list) {
> -		bio_list_add(current->bio_list, bio);
> +	if (current->bio_lists) {
> +		bio_list_add(&current->bio_lists->recursion, bio);
>  		goto out;
>  	}
>  
> @@ -2066,35 +2071,33 @@ blk_qc_t generic_make_request(struct bio *bio)
>  	 * Before entering the loop, bio->bi_next is NULL (as all callers
>  	 * ensure that) so we have a list with a single bio.
>  	 * We pretend that we have just taken it off a longer list, so
> -	 * we assign bio_list to a pointer to the bio_list_on_stack,
> -	 * thus initialising the bio_list of new bios to be
> -	 * added.  ->make_request() may indeed add some more bios
> -	 * through a recursive call to generic_make_request.  If it
> -	 * did, we find a non-NULL value in bio_list and re-enter the loop
> -	 * from the top.  In this case we really did just take the bio
> -	 * of the top of the list (no pretending) and so remove it from
> -	 * bio_list, and call into ->make_request() again.
> +	 * we assign bio_list to a pointer to the bio_lists_on_stack,
> +	 * thus initialising the bio_lists of new bios to be added.
> +	 * ->make_request() may indeed add some more bios to .recursion
> +	 * through a recursive call to generic_make_request.  If it did,
> +	 * we find a non-NULL value in .recursion, merge .recursion back to the
> +	 * head of .queue, and re-enter the loop from the top.  In this case we
> +	 * really did just take the bio of the top of the list (no pretending)
> +	 * and so remove it from .queue, and call into ->make_request() again.
>  	 */
>  	BUG_ON(bio->bi_next);
> -	bio_list_init(&bio_list_on_stack);
> -	current->bio_list = &bio_list_on_stack;
> +	bio_list_init(&bio_lists_on_stack.queue);
> +	current->bio_lists = &bio_lists_on_stack;
>  	do {
>  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>  
>  		if (likely(blk_queue_enter(q, false) == 0)) {
> +			bio_list_init(&bio_lists_on_stack.recursion);
>  			ret = q->make_request_fn(q, bio);
> -
>  			blk_queue_exit(q);
> -
> -			bio = bio_list_pop(current->bio_list);
> +			bio_list_merge_head(&bio_lists_on_stack.queue,
> +					&bio_lists_on_stack.recursion);
>  		} else {
> -			struct bio *bio_next = bio_list_pop(current->bio_list);
> -
>  			bio_io_error(bio);
> -			bio = bio_next;
>  		}
> +		bio = bio_list_pop(&current->bio_lists->queue);
>  	} while (bio);
> -	current->bio_list = NULL; /* deactivate */
> +	current->bio_lists = NULL; /* deactivate */
>  
>  out:
>  	return ret;
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index c265348..df96327 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -172,6 +172,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
>  	struct bio *split, *res;
>  	unsigned nsegs;
>  
> +	BUG_ON(!current->bio_lists);
>  	if (bio_op(*bio) == REQ_OP_DISCARD)
>  		split = blk_bio_discard_split(q, *bio, bs, &nsegs);
>  	else if (bio_op(*bio) == REQ_OP_WRITE_SAME)
> @@ -190,7 +191,9 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
>  
>  		bio_chain(split, *bio);
>  		trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
> -		generic_make_request(*bio);
> +		/* push back remainder, it may later be split further */
> +		bio_list_add_head(&current->bio_lists->queue, *bio);
> +		/* and fake submission of a suitably sized piece */
>  		*bio = split;
>  	}
>  }
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 76f7534..731ec3b 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -450,7 +450,7 @@ void __bch_btree_node_write(struct btree *b, struct closure *parent)
>  
>  	trace_bcache_btree_write(b);
>  
> -	BUG_ON(current->bio_list);
> +	BUG_ON(current->bio_lists);
>  	BUG_ON(b->written >= btree_blocks(b));
>  	BUG_ON(b->written && !i->keys);
>  	BUG_ON(btree_bset_first(b)->seq != i->seq);
> @@ -544,7 +544,7 @@ static void bch_btree_leaf_dirty(struct btree *b, atomic_t *journal_ref)
>  
>  	/* Force write if set is too big */
>  	if (set_bytes(i) > PAGE_SIZE - 48 &&
> -	    !current->bio_list)
> +	    !current->bio_lists)
>  		bch_btree_node_write(b, NULL);
>  }
>  
> @@ -889,7 +889,7 @@ static struct btree *mca_alloc(struct cache_set *c, struct btree_op *op,
>  {
>  	struct btree *b;
>  
> -	BUG_ON(current->bio_list);
> +	BUG_ON(current->bio_lists);
>  
>  	lockdep_assert_held(&c->bucket_lock);
>  
> @@ -976,7 +976,7 @@ retry:
>  	b = mca_find(c, k);
>  
>  	if (!b) {
> -		if (current->bio_list)
> +		if (current->bio_lists)
>  			return ERR_PTR(-EAGAIN);
>  
>  		mutex_lock(&c->bucket_lock);
> @@ -2127,7 +2127,7 @@ static int bch_btree_insert_node(struct btree *b, struct btree_op *op,
>  
>  	return 0;
>  split:
> -	if (current->bio_list) {
> +	if (current->bio_lists) {
>  		op->lock = b->c->root->level + 1;
>  		return -EAGAIN;
>  	} else if (op->lock <= b->c->root->level) {
> @@ -2209,7 +2209,7 @@ int bch_btree_insert(struct cache_set *c, struct keylist *keys,
>  	struct btree_insert_op op;
>  	int ret = 0;
>  
> -	BUG_ON(current->bio_list);
> +	BUG_ON(current->bio_lists);
>  	BUG_ON(bch_keylist_empty(keys));
>  
>  	bch_btree_op_init(&op.op, 0);
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index 6571c81..ba0c325 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -174,7 +174,7 @@ static inline int dm_bufio_cache_index(struct dm_bufio_client *c)
>  #define DM_BUFIO_CACHE(c)	(dm_bufio_caches[dm_bufio_cache_index(c)])
>  #define DM_BUFIO_CACHE_NAME(c)	(dm_bufio_cache_names[dm_bufio_cache_index(c)])
>  
> -#define dm_bufio_in_request()	(!!current->bio_list)
> +#define dm_bufio_in_request()	(!!current->bio_lists)
>  
>  static void dm_bufio_lock(struct dm_bufio_client *c)
>  {
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 10e53cd..38790e3 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -876,8 +876,7 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
>  				    (!conf->barrier ||
>  				     ((conf->start_next_window <
>  				       conf->next_resync + RESYNC_SECTORS) &&
> -				      current->bio_list &&
> -				      !bio_list_empty(current->bio_list))),
> +				      current_has_pending_bios())),
>  				    conf->resync_lock);
>  		conf->nr_waiting--;
>  	}
> @@ -1014,7 +1013,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
>  	struct r1conf *conf = mddev->private;
>  	struct bio *bio;
>  
> -	if (from_schedule || current->bio_list) {
> +	if (from_schedule || current->bio_lists) {
>  		spin_lock_irq(&conf->device_lock);
>  		bio_list_merge(&conf->pending_bio_list, &plug->pending);
>  		conf->pending_count += plug->pending_cnt;
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 245640b..13a5341 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -945,8 +945,7 @@ static void wait_barrier(struct r10conf *conf)
>  		wait_event_lock_irq(conf->wait_barrier,
>  				    !conf->barrier ||
>  				    (conf->nr_pending &&
> -				     current->bio_list &&
> -				     !bio_list_empty(current->bio_list)),
> +				     current_has_pending_bios()),
>  				    conf->resync_lock);
>  		conf->nr_waiting--;
>  	}
> @@ -1022,7 +1021,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
>  	struct r10conf *conf = mddev->private;
>  	struct bio *bio;
>  
> -	if (from_schedule || current->bio_list) {
> +	if (from_schedule || current->bio_lists) {
>  		spin_lock_irq(&conf->device_lock);
>  		bio_list_merge(&conf->pending_bio_list, &plug->pending);
>  		conf->pending_count += plug->pending_cnt;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index b7e1a008..2f8a361 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -541,6 +541,24 @@ struct bio_list {
>  	struct bio *tail;
>  };
>  
> +/* for generic_make_request() */
> +struct recursion_to_iteration_bio_lists {
> +	/* For stacking drivers submitting to their respective backend,
> +	 * bios are added to the tail of .recursion, which is re-initialized
> +	 * before each call to ->make_request_fn() and after that returns,
> +	 * the whole .recursion queue is then merged back to the head of .queue.
> +	 *
> +	 * The recursion-to-iteration logic in generic_make_request() will
> +	 * peel off of .queue.head, processing bios in deepest-level-first
> +	 * "natural" order. */
> +	struct bio_list recursion;
> +
> +	/* This keeps a list of to-be-processed bios.
> +	 * The "remainder" part resulting from calling blk_queue_split()
> +	 * will be pushed back to its head. */
> +	struct bio_list queue;
> +};
> +
>  static inline int bio_list_empty(const struct bio_list *bl)
>  {
>  	return bl->head == NULL;
> @@ -551,6 +569,13 @@ static inline void bio_list_init(struct bio_list *bl)
>  	bl->head = bl->tail = NULL;
>  }
>  
> +static inline bool current_has_pending_bios(void)
> +{
> +	return current->bio_lists &&
> +		(!bio_list_empty(&current->bio_lists->queue) ||
> +		 !bio_list_empty(&current->bio_lists->recursion));
> +}
> +
>  #define BIO_EMPTY_LIST	{ NULL, NULL }
>  
>  #define bio_list_for_each(bio, bl) \
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6e42ada..146eedc 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -128,7 +128,7 @@ struct sched_attr {
>  
>  struct futex_pi_state;
>  struct robust_list_head;
> -struct bio_list;
> +struct recursion_to_iteration_bio_lists;
>  struct fs_struct;
>  struct perf_event_context;
>  struct blk_plug;
> @@ -1727,7 +1727,7 @@ struct task_struct {
>  	void *journal_info;
>  
>  /* stacked block device info */
> -	struct bio_list *bio_list;
> +	struct recursion_to_iteration_bio_lists *bio_lists;
>  
>  #ifdef CONFIG_BLOCK
>  /* stack plugging */
> 

^ permalink raw reply

* SAS disk from RAID card (no RAID mode) problems
From: IW News @ 2016-12-23  8:01 UTC (permalink / raw)
  To: linux-raid

Hello,

First message here.

After looking for a solution without any luck I have found this list. I 
hope someone can help me with this.

I have an ASUS P6T Deluxe with a MARVELL 88SE63xx SAS RAID controller.
There are to identical 400GB SAS SSD drives attached to it. One of them 
has a Windows 10 installation, the other one Linux.
Grub is installed on the second disk.

Windows works as expected, but I have problems with the Linux 
installation: the desktop environment freezes for some second once in a 
while. This occurs with Mint Cinnamon, OpenSuSe KDE, Ubuntu and Manjaro 
KDE. All of them are current installations. I'm now working in up to 
date Manjaro KDE (kernel 4.9.0).
When the temporary freezes occur the mouse pointer moves, some windows 
are updated correctly, other do not and DE stops working.
When this happens always I have a system log like this:

______________________________________________________
23/12/16 8:29    kernel    sas: Enter sas_scsi_recover_host busy: 6 
failed: 6
23/12/16 8:29    kernel    sas: trying to find task 0xffff8bda8d2bd900
23/12/16 8:29    kernel    sas: sas_scsi_find_task: aborting task 
0xffff8bda8d2bd900
23/12/16 8:29    kernel    sas: task done but aborted
23/12/16 8:29    kernel    sas: sas_scsi_find_task: task 
0xffff8bda8d2bd900 is done
23/12/16 8:29    kernel    sas: sas_eh_handle_sas_errors: task 
0xffff8bda8d2bd900 is done
23/12/16 8:29    kernel    sas: trying to find task 0xffff8bda8d2bce00
23/12/16 8:29    kernel    sas: sas_scsi_find_task: aborting task 
0xffff8bda8d2bce00
23/12/16 8:29    kernel    sas: task done but aborted
23/12/16 8:29    kernel    sas: sas_scsi_find_task: task 
0xffff8bda8d2bce00 is done
23/12/16 8:29    kernel    sas: sas_eh_handle_sas_errors: task 
0xffff8bda8d2bce00 is done
23/12/16 8:29    kernel    sas: trying to find task 0xffff8bda8d2bd700
23/12/16 8:29    kernel    sas: sas_scsi_find_task: aborting task 
0xffff8bda8d2bd700
23/12/16 8:29    kernel    sas: task done but aborted
23/12/16 8:29    kernel    sas: sas_scsi_find_task: task 
0xffff8bda8d2bd700 is done
23/12/16 8:29    kernel    sas: sas_eh_handle_sas_errors: task 
0xffff8bda8d2bd700 is done
23/12/16 8:29    kernel    sas: trying to find task 0xffff8bda8d2bde00
23/12/16 8:29    kernel    sas: sas_scsi_find_task: aborting task 
0xffff8bda8d2bde00
23/12/16 8:29    kernel    sas: task done but aborted
23/12/16 8:29    kernel    sas: sas_scsi_find_task: task 
0xffff8bda8d2bde00 is done
23/12/16 8:29    kernel    sas: sas_eh_handle_sas_errors: task 
0xffff8bda8d2bde00 is done
23/12/16 8:29    kernel    sas: trying to find task 0xffff8bda8d2bdc00
23/12/16 8:29    kernel    sas: sas_scsi_find_task: aborting task 
0xffff8bda8d2bdc00
23/12/16 8:29    kernel    sas: task done but aborted
23/12/16 8:29    kernel    sas: sas_scsi_find_task: task 
0xffff8bda8d2bdc00 is done
23/12/16 8:29    kernel    sas: sas_eh_handle_sas_errors: task 
0xffff8bda8d2bdc00 is done
23/12/16 8:29    kernel    sas: trying to find task 0xffff8bdcb0298800
23/12/16 8:29    kernel    sas: sas_scsi_find_task: aborting task 
0xffff8bdcb0298800
23/12/16 8:29    kernel    sas: task done but aborted
23/12/16 8:29    kernel    sas: sas_scsi_find_task: task 
0xffff8bdcb0298800 is done
23/12/16 8:29    kernel    sas: sas_eh_handle_sas_errors: task 
0xffff8bdcb0298800 is done
23/12/16 8:29    kernel    sas: --- Exit sas_scsi_recover_host: busy: 0 
failed: 6 tries: 1
23/12/16 8:29    kernel    drivers/scsi/mvsas/mv_sas.c 1694:reuse same 
slot, retry command.
23/12/16 8:29    kernel    drivers/scsi/mvsas/mv_sas.c 1694:reuse same 
slot, retry command.
23/12/16 8:29    kernel    drivers/scsi/mvsas/mv_sas.c 1694:reuse same 
slot, retry command.
23/12/16 8:29    kernel    sas: Enter sas_scsi_recover_host busy: 2 
failed: 2
23/12/16 8:29    kernel    sas: trying to find task 0xffff8bdb9453ae00
23/12/16 8:29    kernel    sas: sas_scsi_find_task: aborting task 
0xffff8bdb9453ae00
23/12/16 8:29    kernel    sas: task done but aborted
23/12/16 8:29    kernel    sas: sas_scsi_find_task: task 
0xffff8bdb9453ae00 is done
23/12/16 8:29    kernel    sas: sas_eh_handle_sas_errors: task 
0xffff8bdb9453ae00 is done
23/12/16 8:29    kernel    sas: trying to find task 0xffff8bdb9453b000
23/12/16 8:29    kernel    sas: sas_scsi_find_task: aborting task 
0xffff8bdb9453b000
23/12/16 8:29    kernel    sas: task done but aborted
23/12/16 8:29    kernel    sas: sas_scsi_find_task: task 
0xffff8bdb9453b000 is done
23/12/16 8:29    kernel    sas: sas_eh_handle_sas_errors: task 
0xffff8bdb9453b000 is done
23/12/16 8:29    kernel    sas: --- Exit sas_scsi_recover_host: busy: 0 
failed: 2 tries: 1
__________________________________________________________________________________________

Sometimes shorter sometimes larger.
It looks like a controller/drive/cable problem?
Any thoughts?

Thanks in advance.

^ permalink raw reply

* Re: [RFC PATCH v2] crypto: Add IV generation algorithms
From: Herbert Xu @ 2016-12-23  7:51 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: Milan Broz, Oded, Ofir, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, Linux kernel mailing list, Alasdair Kergon,
	Mike Snitzer, dm-devel, Shaohua Li, linux-raid, Rajendra
In-Reply-To: <CAHv-k_-EKq2g=Wb+YkPVp9gCXTDEyrxZhQpT5JMSw=WtZ1OC9w@mail.gmail.com>

On Thu, Dec 22, 2016 at 04:25:12PM +0530, Binoy Jayan wrote:
>
> > It doesn't have to live outside of dm-crypt.  You can register
> > these IV generators from there if you really want.
> 
> Sorry, but I didn't understand this part.

What I mean is that moving the IV generators into the crypto API
does not mean the dm-crypt team giving up control over them.  You
could continue to keep them within the dm-crypt code base and
still register them through the normal crypto API mechanism.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* using the raid6check report
From: Eyal Lebedinsky @ 2016-12-23  0:56 UTC (permalink / raw)
  To: list linux-raid

 From time to time I get non-zero mismatch_count in the weekly scrub. The way I handle
it is to run a check around the stripe (I have a background job printing the mismatch
count and /proc/mdstat regularly) which should report the same count.

I now drill into the fs to find which files use this area, deal with them and delete
the bad ones. I then run a repair on that small area.

I now found about raid6check which can actually tell me which disk holds the bad data.
This is something raid6 should be able to do assuming a single error.
Hoping it is one bad disk, the simple solution now is to recover the bad stripe on
that disk.

Will a 'repair' rewrite the bad disk or just create fresh P+Q which may just make the
bad data invisible to a 'check'? I recall this being the case in the past.

'man md' still says
	For RAID5/RAID6 new parity blocks are written
I think RAID6 can do better.

TIA

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au)

^ permalink raw reply

* [PATCH] md/r5cache: fix spelling mistake on "recoverying"
From: Colin King @ 2016-12-23  0:52 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: linux-kernel

From: Colin Ian King <colin.king@canonical.com>

Trivial fix to spelling mistake "recoverying" to "recovering" in
pr_dbg message.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/md/raid5-cache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index bff1b4a..0e8ed2c 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2170,7 +2170,7 @@ static int r5l_recovery_log(struct r5l_log *log)
 		pr_debug("md/raid:%s: starting from clean shutdown\n",
 			 mdname(mddev));
 	else {
-		pr_debug("md/raid:%s: recoverying %d data-only stripes and %d data-parity stripes\n",
+		pr_debug("md/raid:%s: recovering %d data-only stripes and %d data-parity stripes\n",
 			 mdname(mddev), ctx.data_only_stripes,
 			 ctx.data_parity_stripes);
 
-- 
2.10.2

^ permalink raw reply related

* Re: Recovering a RAID6 after all disks were disconnected
From: NeilBrown @ 2016-12-22 23:25 UTC (permalink / raw)
  To: Giuseppe Bilotta, John Stoffel; +Cc: linux-raid
In-Reply-To: <CAOxFTcwYujHCHeiJiwUc4aA5RpN_9ocoBGfLS1kfgaz=sKSOSQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4557 bytes --]

On Fri, Dec 23 2016, Giuseppe Bilotta wrote:

> Hello again,
>
> On Thu, Dec 8, 2016 at 8:02 PM, John Stoffel <john@stoffel.org> wrote:
>>
>> Sorry for not getting back to you sooner, I've been under the weather
>> lately.  And I'm NOT an expert on this, but it's good you've made
>> copies of the disks.
>
> Don't worry about the timing, as you can see I haven't had much time
> to dedicate to the recovery of this RAID either. As you can see, it
> was not that urgent ;-)
>
>
>> Giuseppe> Here it is. Notice that this is the result of -E _after_ the attempted
>> Giuseppe> re-add while the RAID was running, which marked all the disks as
>> Giuseppe> spares:
>>
>> Yeah, this is probably a bad state.  I would suggest you try to just
>> assemble the disks in various orders using your clones:
>>
>>    mdadm -A /dev/md0 /dev/sdc /dev/sdd /dev/sde /dev/sdf
>>
>> And then mix up the order until you get a working array.  You might
>> also want to try assembling using the 'missing' flag for the original
>> disk which dropped out of the array, so that just the three good disks
>> are used.  This might take a while to test all the possible
>> permutations.
>>
>> You might also want to look back in the archives of this mailing
>> list.  Phil Turmel has some great advice and howto guides for this.
>> You can do the test assembles using loop back devices so that you
>> don't write to the originals, or even to the clones.
>
> I've used the instructions on using overlays with dmsetup + sparse
> files on the RAID wiki
> https://raid.wiki.kernel.org/index.php/Recovering_a_damaged_RAID
> to experiment with the recovery (and just to be sure, I set the
> original disks read-only using blockdev; might be worth adding this to
> the wiki).
>
> I also wrote a small script to test all combinations (nothing smart,
> really, simply enumeration of combos, but I'll consider putting it up
> on the wiki as well), and I was actually surprised by the results. To
> test if the RAID was being re-created correctly with each combination,
> I used `file -s` on the RAID, and verified that the results made
> sense. I am surprised to find out that there are multiple combinations
> that make sense (note that the disk names are shifted by one compared
> to previous emails due a machine lockup that required a reboot and
> another disk butting in to a different order):
>
> trying /dev/sdd /dev/sdf /dev/sde /dev/sdg
> /dev/md111: Linux rev 1.0 ext4 filesystem data,
> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
> (needs journal recovery) (extents) (large files) (huge files)
>
> trying /dev/sdd /dev/sdf /dev/sdg /dev/sde
> /dev/md111: Linux rev 1.0 ext4 filesystem data,
> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
> (needs journal recovery) (extents) (large files) (huge files)
>
> trying /dev/sde /dev/sdf /dev/sdd /dev/sdg
> /dev/md111: Linux rev 1.0 ext4 filesystem data,
> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
> (needs journal recovery) (extents) (large files) (huge files)
>
> trying /dev/sde /dev/sdf /dev/sdg /dev/sdd
> /dev/md111: Linux rev 1.0 ext4 filesystem data,
> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
> (needs journal recovery) (extents) (large files) (huge files)
>
> trying /dev/sdg /dev/sdf /dev/sde /dev/sdd
> /dev/md111: Linux rev 1.0 ext4 filesystem data,
> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
> (needs journal recovery) (extents) (large files) (huge files)
>
> trying /dev/sdg /dev/sdf /dev/sdd /dev/sde
> /dev/md111: Linux rev 1.0 ext4 filesystem data,
> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
> (needs journal recovery) (extents) (large files) (huge files)
> :
> So there are six out of 24 combinations that make sense, at least for
> the first block. I know from the pre-fail dmesg that the g-f-e-d order
> should be the correct one, but now I'm left wondering if there is a
> better way to verify this (other than manually sampling files to see
> if they make sense), or if the left-symmetric layout on a RAID6 simply
> allows some of the disk positions to be swapped without loss of data.
>

You script has reported all arrangements with /dev/sdf as the second
device.  Presumably that is where the single block you are reading
resides.

To check if a RAID6 arrangement is credible, you can try the raid6check
program that is include in the mdadm source release.  There is a man
page.
If the order of devices is not correct raid6check will tell you about
it.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: Recovering a RAID6 after all disks were disconnected
From: Giuseppe Bilotta @ 2016-12-22 23:11 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-raid
In-Reply-To: <22601.44638.79418.124438@quad.stoffel.home>

Hello again,

On Thu, Dec 8, 2016 at 8:02 PM, John Stoffel <john@stoffel.org> wrote:
>
> Sorry for not getting back to you sooner, I've been under the weather
> lately.  And I'm NOT an expert on this, but it's good you've made
> copies of the disks.

Don't worry about the timing, as you can see I haven't had much time
to dedicate to the recovery of this RAID either. As you can see, it
was not that urgent ;-)

> Giuseppe> Here it is. Notice that this is the result of -E _after_ the attempted
> Giuseppe> re-add while the RAID was running, which marked all the disks as
> Giuseppe> spares:
>
> Yeah, this is probably a bad state.  I would suggest you try to just
> assemble the disks in various orders using your clones:
>
>    mdadm -A /dev/md0 /dev/sdc /dev/sdd /dev/sde /dev/sdf
>
> And then mix up the order until you get a working array.  You might
> also want to try assembling using the 'missing' flag for the original
> disk which dropped out of the array, so that just the three good disks
> are used.  This might take a while to test all the possible
> permutations.
>
> You might also want to look back in the archives of this mailing
> list.  Phil Turmel has some great advice and howto guides for this.
> You can do the test assembles using loop back devices so that you
> don't write to the originals, or even to the clones.

I've used the instructions on using overlays with dmsetup + sparse
files on the RAID wiki
https://raid.wiki.kernel.org/index.php/Recovering_a_damaged_RAID
to experiment with the recovery (and just to be sure, I set the
original disks read-only using blockdev; might be worth adding this to
the wiki).

I also wrote a small script to test all combinations (nothing smart,
really, simply enumeration of combos, but I'll consider putting it up
on the wiki as well), and I was actually surprised by the results. To
test if the RAID was being re-created correctly with each combination,
I used `file -s` on the RAID, and verified that the results made
sense. I am surprised to find out that there are multiple combinations
that make sense (note that the disk names are shifted by one compared
to previous emails due a machine lockup that required a reboot and
another disk butting in to a different order):

trying /dev/sdd /dev/sdf /dev/sde /dev/sdg
/dev/md111: Linux rev 1.0 ext4 filesystem data,
UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
(needs journal recovery) (extents) (large files) (huge files)

trying /dev/sdd /dev/sdf /dev/sdg /dev/sde
/dev/md111: Linux rev 1.0 ext4 filesystem data,
UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
(needs journal recovery) (extents) (large files) (huge files)

trying /dev/sde /dev/sdf /dev/sdd /dev/sdg
/dev/md111: Linux rev 1.0 ext4 filesystem data,
UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
(needs journal recovery) (extents) (large files) (huge files)

trying /dev/sde /dev/sdf /dev/sdg /dev/sdd
/dev/md111: Linux rev 1.0 ext4 filesystem data,
UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
(needs journal recovery) (extents) (large files) (huge files)

trying /dev/sdg /dev/sdf /dev/sde /dev/sdd
/dev/md111: Linux rev 1.0 ext4 filesystem data,
UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
(needs journal recovery) (extents) (large files) (huge files)

trying /dev/sdg /dev/sdf /dev/sdd /dev/sde
/dev/md111: Linux rev 1.0 ext4 filesystem data,
UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
(needs journal recovery) (extents) (large files) (huge files)
:
So there are six out of 24 combinations that make sense, at least for
the first block. I know from the pre-fail dmesg that the g-f-e-d order
should be the correct one, but now I'm left wondering if there is a
better way to verify this (other than manually sampling files to see
if they make sense), or if the left-symmetric layout on a RAID6 simply
allows some of the disk positions to be swapped without loss of data.

-- 
Giuseppe "Oblomov" Bilotta

^ permalink raw reply

* Re: [PATCH RESEND] IMSM: Do not update metadata if not able to migrate
From: Jes Sorensen @ 2016-12-22 17:20 UTC (permalink / raw)
  To: Pawel Baldysiak; +Cc: linux-raid
In-Reply-To: <20161222121047.32469-1-pawel.baldysiak@intel.com>

Pawel Baldysiak <pawel.baldysiak@intel.com> writes:
> This patch prevents mdadm from updating metadata if migration is
> not possible. The same check is done in analyse_change(),
> but in that place - metadata is already modified.
>
> Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
> ---
>  super-intel.c | 5 +++++
>  1 file changed, 5 insertions(+)

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: [mdadm PATCH] Make get_component_size() work with named array.
From: Jes Sorensen @ 2016-12-22 17:19 UTC (permalink / raw)
  To: NeilBrown; +Cc: Robert LeBlanc, linux-raid
In-Reply-To: <87wpespv4s.fsf@notabene.neil.brown.name>

NeilBrown <neilb@suse.com> writes:
> get_component_size() still assumes that all array are
>  /sys/block/md%d or /sys/block/md_d%d
> and so doesn't work with e.g. /sys/block/md_foo.
>
> This cause "mdadm --detail" to report
>    Used Dev Size : unknown
> and causes problems when added spares and in other circumstances.
>
> So change it to use stat2devnm() which does the right thing with all
> types of array names.
>
> Reported-and-tested-by: Robert LeBlanc <robert@leblancnet.us>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  sysfs.c | 10 ++--------
>  1 file changed, 2 insertions(+), 8 deletions(-)

Applied!

Thanks,
Jes

^ permalink raw reply

* [PATCH RESEND] IMSM: Do not update metadata if not able to migrate
From: Pawel Baldysiak @ 2016-12-22 12:10 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Pawel Baldysiak

This patch prevents mdadm from updating metadata if migration is
not possible. The same check is done in analyse_change(),
but in that place - metadata is already modified.

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
---
 super-intel.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/super-intel.c b/super-intel.c
index 0407d43..5e58672 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -10808,6 +10808,11 @@ enum imsm_reshape_type imsm_analyze_change(struct supertype *st,
 			pr_err("Error. Chunk size change for RAID 10 is not supported.\n");
 			change = -1;
 			goto analyse_change_exit;
+		} else if (info.component_size % (geo->chunksize/512)) {
+			pr_err("New chunk size (%dK) does not evenly divide device size (%lluk). Aborting...\n",
+			       geo->chunksize/1024, info.component_size/2);
+			change = -1;
+			goto analyse_change_exit;
 		}
 		change = CH_MIGRATION;
 	} else {
-- 
2.9.3


^ permalink raw reply related

* Re: [RFC PATCH v2] crypto: Add IV generation algorithms
From: Binoy Jayan @ 2016-12-22 10:55 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Milan Broz, Oded, Ofir, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, Linux kernel mailing list, Alasdair Kergon,
	Mike Snitzer, dm-devel, Shaohua Li, linux-raid, Rajendra
In-Reply-To: <20161222085509.GA2160@gondor.apana.org.au>

Hi Herbert,

On 22 December 2016 at 14:25, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Tue, Dec 13, 2016 at 11:01:08AM +0100, Milan Broz wrote:
>>
>> By the move everything to cryptoAPI we are basically introducing some strange mix
>> of IV and modes there, I wonder how this is going to be maintained.
>> Anyway, Herbert should say if it is ok...
>
> Well there is precedent in how do the IPsec IV generation.  In
> that case the IV generators too are completely specific to that
> application, i.e., IPsec.  However, the way structured it allowed
> us to have one single entry path from the IPsec stack into the
> crypto layer regardless of whether you are using AEAD or traditional
> encryption/hashing algorithms.
>
> For IPsec we make the IV generators behave like normal AEAD
> algorithms, except that they take the sequence number as the IV.
>
> The goal here are obviously different.  However, by employing
> the same method as we do in IPsec, it appears to me that you
> can effectively process multiple blocks at once instead of having
> to supply one block at a time due to the IV generation issue.

Thank you for clarifying that part.:)
So, I hope we can consider algorithms like lmk and tcw too as IV generation
algorithms, even though they manipulate encrypted data directly?

>> I really do not think the disk encryption key management should be moved
>> outside of dm-crypt. We cannot then change key structure later easily.

I agree with this too, only problem with this is when multiple keys are involved
(where the master key is split into 2 or more), and the key selection is made
with a modular division of the (512-byte) sector number by the number of keys.

> It doesn't have to live outside of dm-crypt.  You can register
> these IV generators from there if you really want.

Sorry, but I didn't understand this part.

Thanks,
Binoy

^ permalink raw reply

* RE: dm-crypt optimization
From: Ofir Drang @ 2016-12-22 10:14 UTC (permalink / raw)
  To: Herbert Xu, Binoy Jayan
  Cc: Milan Broz, Oded Golombek, Arnd Bergmann, Mark Brown,
	Alasdair Kergon, David S. Miller, private-kwg@linaro.org,
	dm-devel@redhat.com, linux-crypto@vger.kernel.org, Rajendra,
	Linux kernel mailing list, linux-raid@vger.kernel.org, Shaohua Li,
	Mike Snitzer
In-Reply-To: <20161222085927.GB2160@gondor.apana.org.au>



-----Original Message-----
From: Herbert Xu [mailto:herbert@gondor.apana.org.au]
Sent: Thursday, December 22, 2016 10:59 AM
To: Binoy Jayan
Cc: Milan Broz; Oded Golombek; Ofir Drang; Arnd Bergmann; Mark Brown; Alasdair Kergon; David S. Miller; private-kwg@linaro.org; dm-devel@redhat.com; linux-crypto@vger.kernel.org; Rajendra; Linux kernel mailing list; linux-raid@vger.kernel.org; Shaohua Li; Mike Snitzer
Subject: Re: dm-crypt optimization

On Thu, Dec 22, 2016 at 01:55:59PM +0530, Binoy Jayan wrote:
>>
>> > Support of bigger block sizes would be unsafe without additional
>> > mechanism that provides atomic writes of multiple sectors. Maybe it
>> > applies to 4k as well on some devices though...)
>>
>> Did you mean write to the crypto output buffers or the actual disk write?
>> I didn't quite understand how the block size for encryption affects
>> atomic writes as it is the block layer which handles them. As far as
>> dm-crypt is, concerned it just encrypts/decrypts a 'struct bio'
>> instance and submits the IO operation to the block layer.

>I think Milan's talking about increasing the real block size, which would obviously require the hardware to be able to write that out atomically, as otherwise it breaks the crypto.
>
>But if we can instead do the IV generation within the crypto API, then the block size won't be an issue at all.  Because you can supply as many blocks as you want and they would be processed block-by-block.
>
>Now there is a disadvantage to this approach, and that is you have to wait for the whole thing to be encrypted before you can start doing the IO.  I'm not sure how big a problem that is but if it is bad enough to affect performance, we can look into adding >some form of partial completion to the crypto API.
>
>Cheers,

But assuming we have hardware accelerator that know to handle the IV generation for each sector, it will make sense to send out to the hardware the maximum block size as this will allow us to better utilize the hardware and offload the software. So if possible we need to provide generic interface that will be able to optimize the hardware accelerates.

Thx Ofir
--
Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply

* Re: dm-crypt optimization
From: Herbert Xu @ 2016-12-22  8:59 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: Milan Broz, Oded, Ofir, Arnd Bergmann, Mark Brown,
	Alasdair Kergon, David S. Miller, private-kwg, dm-devel,
	linux-crypto, Rajendra, Linux kernel mailing list, linux-raid,
	Shaohua Li, Mike Snitzer
In-Reply-To: <CAHv-k_-K9dOiM+Pm_wqVJrvmNYhjtS82-emKVZ8OjsoMHf+7hg@mail.gmail.com>

On Thu, Dec 22, 2016 at 01:55:59PM +0530, Binoy Jayan wrote:
>
> > Support of bigger block sizes would be unsafe without additional mechanism that provides
> > atomic writes of multiple sectors. Maybe it applies to 4k as well on some devices though...)
> 
> Did you mean write to the crypto output buffers or the actual disk write?
> I didn't quite understand how the block size for encryption affects atomic
> writes as it is the block layer which handles them. As far as dm-crypt is,
> concerned it just encrypts/decrypts a 'struct bio' instance and submits the IO
> operation to the block layer.

I think Milan's talking about increasing the real block size, which
would obviously require the hardware to be able to write that out
atomically, as otherwise it breaks the crypto.

But if we can instead do the IV generation within the crypto API,
then the block size won't be an issue at all.  Because you can
supply as many blocks as you want and they would be processed
block-by-block.

Now there is a disadvantage to this approach, and that is you
have to wait for the whole thing to be encrypted before you can 
start doing the IO.  I'm not sure how big a problem that is but
if it is bad enough to affect performance, we can look into adding
some form of partial completion to the crypto API.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [RFC PATCH v2] crypto: Add IV generation algorithms
From: Herbert Xu @ 2016-12-22  8:55 UTC (permalink / raw)
  To: Milan Broz
  Cc: Binoy Jayan, Oded, Ofir, David S. Miller, linux-crypto,
	Mark Brown, Arnd Bergmann, linux-kernel, Alasdair Kergon,
	Mike Snitzer, dm-devel, Shaohua Li, linux-raid, Rajendra
In-Reply-To: <d6d92865-98fa-4d02-035f-9080bc265c35@gmail.com>

On Tue, Dec 13, 2016 at 11:01:08AM +0100, Milan Broz wrote:
>
> By the move everything to cryptoAPI we are basically introducing some strange mix
> of IV and modes there, I wonder how this is going to be maintained.
> Anyway, Herbert should say if it is ok...

Well there is precedent in how do the IPsec IV generation.  In
that case the IV generators too are completely specific to that
application, i.e., IPsec.  However, the way structured it allowed
us to have one single entry path from the IPsec stack into the
crypto layer regardless of whether you are using AEAD or traditional
encryption/hashing algorithms.

For IPsec we make the IV generators behave like normal AEAD
algorithms, except that they take the sequence number as the IV.

The goal here are obviously different.  However, by employing
the same method as we do in IPsec, it appears to me that you
can effectively process multiple blocks at once instead of having
to supply one block at a time due to the IV generation issue.

> I really do not think the disk encryption key management should be moved
> outside of dm-crypt. We cannot then change key structure later easily.

It doesn't have to live outside of dm-crypt.  You can register
these IV generators from there if you really want.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: Jinpu Wang @ 2016-12-22  8:35 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey
In-Reply-To: <878tr8rgc3.fsf@notabene.neil.brown.name>

On Thu, Dec 22, 2016 at 12:51 AM, NeilBrown <neilb@suse.com> wrote:
> On Wed, Dec 21 2016, Jinpu Wang wrote:
>
>>>
>> Thanks, it does help a lot, I attached the patch I'm still testing,
>> but so far so good.
>> Could you check if I got it right?
>
> Yes, that looks exactly right.
> I guess one of us should try to push it upstream... maybe next year :-)
>
> Thanks,
> NeilBrown

Thanks, Neil. I will try to push to upstream next year!
Happy holidays!

Cheers!
-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply

* Re: dm-crypt optimization
From: Binoy Jayan @ 2016-12-22  8:25 UTC (permalink / raw)
  To: Milan Broz
  Cc: Oded, Ofir, Herbert Xu, Arnd Bergmann, Mark Brown,
	Alasdair Kergon, David S. Miller, private-kwg, dm-devel,
	linux-crypto, Rajendra, Linux kernel mailing list, linux-raid,
	Shaohua Li, Mike Snitzer
In-Reply-To: <bf5e7237-2f5c-3fc6-c7d6-38c3b13ac2c3@gmail.com>

Hi Milan,

On 21 December 2016 at 18:17, Milan Broz <gmazyland@gmail.com> wrote:

> So the core problem is that your crypto accelerator can operate efficiently only
> with bigger batch sizes.

Thank you for the reply. Yes, that would be rather an improvement when having
bigger block sizes.

> How big blocks your crypto hw need to be able to operate more efficiently?
> What about 4k blocks (no batches), could it be usable trade-off?

The benchmark results for Qualcomm Snapdragon SoC's (mentioned below) show
significant improvement with 4K blocks but in batches of all such contiguous
segments in the block layer's request queue in the form of a chained
scatterlist.
However, it uses the algorithm 'aes-xts' instead of the conventional
'essiv-cbc-aes'
used in dm-crypt. Also, it uses the device mapper dm-req-crypt instead
of dm-cypt.

http://nelenkov.blogspot.in/2015/05/hardware-accelerated-disk-encryption-in.html
Section : 'Performance'

Its reports and IO rate of 46.3MB/s compared to an IO rate of 25.1MB/s while
using a software-based FDE (based on dm-crypt).  But I am not sure how genuine
this data is or how it was tested.

Since qualcomm SoC's use hardware backed keystore for managing keys and since
there is no easy way to make dm-crypt work with qualcomm's engines, I do not
have solid benchmark data to show an improved performance when using 4k blocks.

> With some (backward incompatible) changes in LUKS format I would like to see support
> for encryption blocks equivalent to sectors size, so it basically means for 4k drive 4k
> encryption block.
> (This should decrease overhead, now is everything processed on 512 blocks only.)
>
> Support of bigger block sizes would be unsafe without additional mechanism that provides
> atomic writes of multiple sectors. Maybe it applies to 4k as well on some devices though...)

Did you mean write to the crypto output buffers or the actual disk write?
I didn't quite understand how the block size for encryption affects atomic
writes as it is the block layer which handles them. As far as dm-crypt is,
concerned it just encrypts/decrypts a 'struct bio' instance and submits the IO
operation to the block layer.

> The above is not going against your proposal, I am just curious if this is enough
> to provide better performance on your hw accelerator or not.

May be I should be able to procure an open crypto board and get back to you with
some results. Or may be show even a marginal improvement while using software
algorithm by avoiding the crypto overhead for every 512 bytes.

-Binoy

^ permalink raw reply

* [mdadm PATCH] Make get_component_size() work with named array.
From: NeilBrown @ 2016-12-22  2:14 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: Robert LeBlanc, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1267 bytes --]

get_component_size() still assumes that all array are
 /sys/block/md%d or /sys/block/md_d%d
and so doesn't work with e.g. /sys/block/md_foo.

This cause "mdadm --detail" to report
   Used Dev Size : unknown
and causes problems when added spares and in other circumstances.

So change it to use stat2devnm() which does the right thing with all
types of array names.

Reported-and-tested-by: Robert LeBlanc <robert@leblancnet.us>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 sysfs.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/sysfs.c b/sysfs.c
index 84c7348526c9..b0657a04b3a3 100644
--- a/sysfs.c
+++ b/sysfs.c
@@ -400,14 +400,8 @@ unsigned long long get_component_size(int fd)
 	int n;
 	if (fstat(fd, &stb))
 		return 0;
-	if (major(stb.st_rdev) != (unsigned)get_mdp_major())
-		snprintf(fname, MAX_SYSFS_PATH_LEN,
-			"/sys/block/md%d/md/component_size",
-			(int)minor(stb.st_rdev));
-	else
-		snprintf(fname, MAX_SYSFS_PATH_LEN,
-			"/sys/block/md_d%d/md/component_size",
-			(int)minor(stb.st_rdev)>>MdpMinorShift);
+	snprintf(fname, MAX_SYSFS_PATH_LEN,
+		 "/sys/block/%s/md/component_size", stat2devnm(&stb));
 	fd = open(fname, O_RDONLY);
 	if (fd < 0)
 		return 0;
-- 
2.11.0

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox