* Re: [PATCH] mdadm: add man page for --add-journal
From: Song Liu @ 2016-08-15 17:16 UTC (permalink / raw)
To: Jes Sorensen, Adam Goryachev
Cc: linux-raid@vger.kernel.org, yizhan@redhat.com, Shaohua Li
In-Reply-To: <wrfj60r25c15.fsf@redhat.com>
Thanks Adam and Jes.
These looks good to me.
PS: we will make “add-journal” more flexible, and revise the man page accordingly.
Song
>> On 8/15/16, 7:42 AM, "Jes Sorensen" <Jes.Sorensen@redhat.com> wrote:
Adam Goryachev <mailinglists@websitemanagers.com.au> writes:
> On 13/08/2016 00:58, Jes Sorensen wrote:
>> Song Liu <songliubraving@fb.com> writes:
>>> Add the following to man page:
>>>
>>> --add-journal
>>> Recreate journal for RAID-4/5/6 array that losts journal
>>> devices. In current implementation, this command cannot
>>> add journal to an array that had failed journal. To
>>> avoid interrupting on-going write opertions,
>>> --add-journal only works for array in Read-Only state.
>>>
>>> Reported-by: Yi Zhang <yizhan@redhat.com>
>>> Signed-off-by: Song Liu <songliubraving@fb.com>
>>> Signed-off-by: Shaohua Li <shli@fb.com>
>>> ---
>>> mdadm.8.in | 8 ++++++++
>>> 1 file changed, 8 insertions(+)
>> Applied, with a few minor mods.
>>
>> I changed it to say this, I hope you are fine with that:
>>
>> "Recreate journal for RAID-4/5/6 array that lost a journal device. In the
>> current implementation, this command cannot add a journal to an array
>> that had a failed journal. To avoid interrupting on-going write
>> opertions, "
> I think this might be more correct:
>
> "Recreate journal for RAID-4/5/6 array that lost a journal device. In the
> current implementation, this command cannot add a journal to an array
> that *has* a failed journal. To avoid interrupting on-going write
> *operations*, "
>
>
> Note the two words modified have **
> has mean currently, if it had (past) a failed journal, but that has
> already been fixed, then it currently has a working journal, and so I
> assume this patch is not relevant. It's only related to if the array
> is currently missing a journal...
> The second operations is just a typo...
>
> Hope you don't mind my jumping in here, I can't help much with code,
> but hopefully contribution is still helpful.
If Song is happy with this and you send me a patch, I'll be happy to
apply it.
Cheers,
Jes
^ permalink raw reply
* Re: read errors with md RAID5 array
From: Chris Murphy @ 2016-08-15 16:23 UTC (permalink / raw)
To: Tim Small; +Cc: Chris Murphy, linux-raid@vger.kernel.org
In-Reply-To: <b353161b-255e-e359-e3f4-800eac848847@buttersideup.com>
On Mon, Aug 15, 2016 at 8:42 AM, Tim Small <tim@buttersideup.com> wrote:
> On 15/08/16 14:57, Chris Murphy wrote:
>> $ sudo smartctl -l scterc <dev> ## for each device used in the array
>> $ sudo cat /sys/block/<dev>/device/timeout ## for each device used
>> in the array
>
> These were all reporting:
>
> SCT Error Recovery Control:
> Read: Disabled
> Write: Disabled
You failed to provide the value for the 2nd command. Is it something
other than 30 for each device?
>
> However I'm not sure how this would cause a read error from the md
> device itself? There are no timeout/reset messages in the kernel logs
> for the underlying SATA devices?
Nevertheless it's a misconfiguration that inhibits proper read error
reporting by the drive, thereby preventing the md driver from fixing
bad sectors via writing good data over them and causing the drive
firmware to sort it out. So you should issue 'smartctl -l scterc,70,70
<dev>' for all devices and make sure this is made persistent at boot
time.
>
> To check, I've set the ERC on all drives to 6.5 seconds for both reads
> and writes, and restarted the "dd if=/dev/md2 of=/dev/null
> conv=noerror", and it's just produced read failures at exactly the same
> places, with no further kernel messages.
Well it isn't really a read error, it's a buffer io error that happens
to be triggered when reading, so it's a little more specific than a
read error. It sounds to me you've run into a bug or there's some kind
of hardware problem somewhere. It might be helpful if you provide the
entire dmesg from boot until the first error message. As well as the
stuff Andreas asked for.
--
Chris Murphy
^ permalink raw reply
* Re: read errors with md RAID5 array
From: Andreas Klauer @ 2016-08-15 14:59 UTC (permalink / raw)
To: Tim Small; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <78aab29c-9b7b-ff8b-f9c7-fab286f16243@buttersideup.com>
On Mon, Aug 15, 2016 at 02:12:23PM +0100, Tim Small wrote:
> I'm seeing some strange read errors whilst reading from an md RAID5
> array (3x 2TB SATA Drives, Intel AHCI controller).
mdadm --examine and --examine-badblocks for all disks/partitions?
> One of the underlying devices is reporting some "pending sectors"
smartctl -a for all disks?
Regards
Andreas Klauer
^ permalink raw reply
* Re: read errors with md RAID5 array
From: Tim Small @ 2016-08-15 14:42 UTC (permalink / raw)
To: Chris Murphy; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <CAJCQCtSggAsQMSH9Ufa=Vjmu13cBm58+Q-RBZ4WRXm9UF=Yd+Q@mail.gmail.com>
On 15/08/16 14:57, Chris Murphy wrote:
> $ sudo smartctl -l scterc <dev> ## for each device used in the array
> $ sudo cat /sys/block/<dev>/device/timeout ## for each device used
> in the array
These were all reporting:
SCT Error Recovery Control:
Read: Disabled
Write: Disabled
However I'm not sure how this would cause a read error from the md
device itself? There are no timeout/reset messages in the kernel logs
for the underlying SATA devices?
To check, I've set the ERC on all drives to 6.5 seconds for both reads
and writes, and restarted the "dd if=/dev/md2 of=/dev/null
conv=noerror", and it's just produced read failures at exactly the same
places, with no further kernel messages.
Some scenarios:
1. These are write-hole locations, and the md driver has recorded this
and is failing I/O here (didn't know it did this, and a quick read
through the raid5 code couldn't see this, BICBW as I was just skimming it).
2. Two underlying drives have I/O problems at these locations (but then
why no errors in kernel logs?).
3. Something's bad in the block or ATA layer.
... or something else.
Cheers,
Tim.
^ permalink raw reply
* Re: [PATCH] mdadm: add man page for --add-journal
From: Jes Sorensen @ 2016-08-15 14:42 UTC (permalink / raw)
To: Adam Goryachev; +Cc: Song Liu, linux-raid, yizhan, Shaohua Li
In-Reply-To: <defa4365-a771-f4e8-4375-b67bc933767b@websitemanagers.com.au>
Adam Goryachev <mailinglists@websitemanagers.com.au> writes:
> On 13/08/2016 00:58, Jes Sorensen wrote:
>> Song Liu <songliubraving@fb.com> writes:
>>> Add the following to man page:
>>>
>>> --add-journal
>>> Recreate journal for RAID-4/5/6 array that losts journal
>>> devices. In current implementation, this command cannot
>>> add journal to an array that had failed journal. To
>>> avoid interrupting on-going write opertions,
>>> --add-journal only works for array in Read-Only state.
>>>
>>> Reported-by: Yi Zhang <yizhan@redhat.com>
>>> Signed-off-by: Song Liu <songliubraving@fb.com>
>>> Signed-off-by: Shaohua Li <shli@fb.com>
>>> ---
>>> mdadm.8.in | 8 ++++++++
>>> 1 file changed, 8 insertions(+)
>> Applied, with a few minor mods.
>>
>> I changed it to say this, I hope you are fine with that:
>>
>> "Recreate journal for RAID-4/5/6 array that lost a journal device. In the
>> current implementation, this command cannot add a journal to an array
>> that had a failed journal. To avoid interrupting on-going write
>> opertions, "
> I think this might be more correct:
>
> "Recreate journal for RAID-4/5/6 array that lost a journal device. In the
> current implementation, this command cannot add a journal to an array
> that *has* a failed journal. To avoid interrupting on-going write
> *operations*, "
>
>
> Note the two words modified have **
> has mean currently, if it had (past) a failed journal, but that has
> already been fixed, then it currently has a working journal, and so I
> assume this patch is not relevant. It's only related to if the array
> is currently missing a journal...
> The second operations is just a typo...
>
> Hope you don't mind my jumping in here, I can't help much with code,
> but hopefully contribution is still helpful.
If Song is happy with this and you send me a patch, I'll be happy to
apply it.
Cheers,
Jes
^ permalink raw reply
* Re: read errors with md RAID5 array
From: Chris Murphy @ 2016-08-15 13:57 UTC (permalink / raw)
To: Tim Small; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <78aab29c-9b7b-ff8b-f9c7-fab286f16243@buttersideup.com>
$ sudo smartctl -l scterc <dev> ## for each device used in the array
$ sudo cat /sys/block/<dev>/device/timeout ## for each device used
in the array
Chris Murphy
^ permalink raw reply
* read errors with md RAID5 array
From: Tim Small @ 2016-08-15 13:12 UTC (permalink / raw)
To: linux-raid@vger.kernel.org
I'm seeing some strange read errors whilst reading from an md RAID5
array (3x 2TB SATA Drives, Intel AHCI controller).
One of the underlying devices is reporting some "pending sectors" via
SMART, so I triggered a check (via sync_action the pseudo file), but
when this didn't decrease the unreadable sector count, I just did:
dd if=/dev/md2 of=/dev/null conv=noerror
This results in:
[ 1466.586612] buffer_io_error: 85 callbacks suppressed
[ 1466.586617] Buffer I/O error on dev md2, logical block 7057384, async
page read
[ 1466.824085] Buffer I/O error on dev md2, logical block 7057384, async
page read
[ 1466.986397] Buffer I/O error on dev md2, logical block 7057384, async
page read
[ 1467.143073] Buffer I/O error on dev md2, logical block 7057384, async
page read
[ 1467.305265] Buffer I/O error on dev md2, logical block 7057384, async
page read
[ 1467.465493] Buffer I/O error on dev md2, logical block 7057384, async
page read
[ 1467.623860] Buffer I/O error on dev md2, logical block 7057384, async
page read
[ 1467.774287] Buffer I/O error on dev md2, logical block 7057384, async
page read
[ 1467.934768] Buffer I/O error on dev md2, logical block 7057385, async
page read
[ 1468.097099] Buffer I/O error on dev md2, logical block 7057385, async
page read
[ 1569.197498] buffer_io_error: 198 callbacks suppressed
[ 1569.197503] Buffer I/O error on dev md2, logical block 8124804, async
page read
[ 1569.443257] Buffer I/O error on dev md2, logical block 8124804, async
page read
[ 1569.597697] Buffer I/O error on dev md2, logical block 8124804, async
page read
[ 1569.760507] Buffer I/O error on dev md2, logical block 8124804, async
page read
[ 1569.924565] Buffer I/O error on dev md2, logical block 8124804, async
page read
[ 1570.087074] Buffer I/O error on dev md2, logical block 8124804, async
page read
[ 1570.241459] Buffer I/O error on dev md2, logical block 8124804, async
page read
[ 1570.407910] Buffer I/O error on dev md2, logical block 8124804, async
page read
[ 1570.570488] Buffer I/O error on dev md2, logical block 8124805, async
page read
[ 1570.732574] Buffer I/O error on dev md2, logical block 8124805, async
page read
I'm not getting any accompanying reports of underlying SATA read errors,
nor apparently any attempt to correct unreadable sectors.
# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0]
[raid1] [raid10]
md2 : active raid5 sda2[0] sdd2[3] sdc2[1]
3885793280 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3]
[UUU]
bitmap: 0/15 pages [0KB], 65536KB chunk
unused devices: <none>
I thought perhaps that the array was aware of a RAID5 hole, and failing
reads, but this would seem to disagree on that?
# cat /sys/block/md2/md/mismatch_cnt
0
... unless that's not the way to detect such errors?
# uname -a
Linux magic 4.4.0-34-generic #53-Ubuntu SMP Wed Jul 27 16:06:39 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux
This is the current Ubuntu LTS kernel. Were there any known md, or
block layer problems with the 4.4 kernel? Should I try with the latest
mainline kernel, or am I missing something else entirely?
Tim.
^ permalink raw reply
* Re: [PATCH] mdadm: add man page for --add-journal
From: Adam Goryachev @ 2016-08-13 5:13 UTC (permalink / raw)
To: Jes Sorensen, Song Liu; +Cc: linux-raid, yizhan, Shaohua Li
In-Reply-To: <wrfjshua825o.fsf@redhat.com>
On 13/08/2016 00:58, Jes Sorensen wrote:
> Song Liu <songliubraving@fb.com> writes:
>> Add the following to man page:
>>
>> --add-journal
>> Recreate journal for RAID-4/5/6 array that losts journal
>> devices. In current implementation, this command cannot
>> add journal to an array that had failed journal. To
>> avoid interrupting on-going write opertions,
>> --add-journal only works for array in Read-Only state.
>>
>> Reported-by: Yi Zhang <yizhan@redhat.com>
>> Signed-off-by: Song Liu <songliubraving@fb.com>
>> Signed-off-by: Shaohua Li <shli@fb.com>
>> ---
>> mdadm.8.in | 8 ++++++++
>> 1 file changed, 8 insertions(+)
> Applied, with a few minor mods.
>
> I changed it to say this, I hope you are fine with that:
>
> "Recreate journal for RAID-4/5/6 array that lost a journal device. In the
> current implementation, this command cannot add a journal to an array
> that had a failed journal. To avoid interrupting on-going write
> opertions, "
I think this might be more correct:
"Recreate journal for RAID-4/5/6 array that lost a journal device. In the
current implementation, this command cannot add a journal to an array
that *has* a failed journal. To avoid interrupting on-going write
*operations*, "
Note the two words modified have **
has mean currently, if it had (past) a failed journal, but that has
already been fixed, then it currently has a working journal, and so I
assume this patch is not relevant. It's only related to if the array is
currently missing a journal...
The second operations is just a typo...
Hope you don't mind my jumping in here, I can't help much with code, but
hopefully contribution is still helpful.
Regards,
Adam
> If I botched it up please let me know.
>
> Jes
>
>
>> diff --git a/mdadm.8.in b/mdadm.8.in
>> index 1a04bd1..a335c53 100644
>> --- a/mdadm.8.in
>> +++ b/mdadm.8.in
>> @@ -1444,6 +1444,14 @@ number. The receiving node must acknowledge this message
>> with \-\-cluster\-confirm. Valid arguments are <slot>:<devicename> in case
>> the device is found or <slot>:missing in case the device is not found.
>>
>> +.TP
>> +.BR \-\-add-journal
>> +Recreate journal for RAID-4/5/6 array that losts journal devices. In current
>> +implementation, this command cannot add journal to an array that had failed
>> +journal. To avoid interrupting on-going write opertions,
>> +.B \-\-add-journal
>> +only works for array in Read-Only state.
>> +
>> .P
>> Each of these options requires that the first device listed is the array
>> to be acted upon, and the remainder are component devices to be added,
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH v2 6/6] (DO NOT APPLY) lib/raid6: Add unroll by 8 to AVX512 optimized xor_syndrome functions.
From: Gayatri Kammela @ 2016-08-13 1:03 UTC (permalink / raw)
To: linux-raid
Cc: shli, linux-kernel, ravi.v.shankar, Gayatri Kammela,
H . Peter Anvin, Jim Kukunas, Fenghua Yu, Megha Dey
In-Reply-To: <1471050204-26361-1-git-send-email-gayatri.kammela@intel.com>
Optimize RAID6 xor_syndrome functions by further unrolling by 8 to take
advantage of all the 32 ZMM registers.
Note: In theory avx512 unroll by 8 xor_syndrome function should perfom
better than the rest of xor_syndrome functions, but it is outperformed
by avx512 unroll by 4 xor_syndrome function when tested in userspace.
This is posted for reference only, to allow others to make their own
experiments.
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jim Kukunas <james.t.kukunas@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Megha Dey <megha.dey@linux.intel.com>
Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
---
lib/raid6/avx512.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 234 insertions(+), 1 deletion(-)
diff --git a/lib/raid6/avx512.c b/lib/raid6/avx512.c
index 221ce46362cf..59e67366a84d 100644
--- a/lib/raid6/avx512.c
+++ b/lib/raid6/avx512.c
@@ -729,9 +729,242 @@ static void raid6_avx5128_gen_syndrome(int disks, size_t bytes, void **ptrs)
kernel_fpu_end();
}
+static void raid6_avx5128_xor_syndrome(int disks, int start, int stop,
+ size_t bytes, void **ptrs)
+{
+ u8 **dptr = (u8 **)ptrs;
+ u8 *p, *q;
+ int d, z, z0;
+
+ z0 = stop; /* P/Q right side optimization */
+ p = dptr[disks-2]; /* XOR parity */
+ q = dptr[disks-1]; /* RS syndrome */
+
+ kernel_fpu_begin();
+
+ asm volatile("vmovdqa64 %0,%%zmm0"
+ : : "m" (raid6_avx512_constants.x1d[0]));
+
+ for (d = 0 ; d < bytes ; d += 512) {
+ asm volatile("vmovdqa64 %0,%%zmm4\n\t"
+ "vmovdqa64 %1,%%zmm6\n\t"
+ "vmovdqa64 %2,%%zmm12\n\t"
+ "vmovdqa64 %3,%%zmm14\n\t"
+ "vmovdqa64 %4,%%zmm20\n\t"
+ "vmovdqa64 %5,%%zmm22\n\t"
+ "vmovdqa64 %6,%%zmm28\n\t"
+ "vmovdqa64 %7,%%zmm30\n\t"
+ "vmovdqa64 %8,%%zmm2\n\t"
+ "vmovdqa64 %9,%%zmm3\n\t"
+ "vmovdqa64 %10,%%zmm10\n\t"
+ "vmovdqa64 %11,%%zmm11\n\t"
+ "vmovdqa64 %12,%%zmm16\n\t"
+ "vmovdqa64 %13,%%zmm18\n\t"
+ "vmovdqa64 %14,%%zmm24\n\t"
+ "vmovdqa64 %15,%%zmm26\n\t"
+ "vpxorq %%zmm4,%%zmm2,%%zmm2\n\t"
+ "vpxorq %%zmm6,%%zmm3,%%zmm3\n\t"
+ "vpxorq %%zmm12,%%zmm10,%%zmm10\n\t"
+ "vpxorq %%zmm14,%%zmm11,%%zmm11\n\t"
+ "vpxorq %%zmm20,%%zmm16,%%zmm16\n\t"
+ "vpxorq %%zmm22,%%zmm18,%%zmm18\n\t"
+ "vpxorq %%zmm28,%%zmm24,%%zmm24\n\t"
+ "vpxorq %%zmm30,%%zmm26,%%zmm26"
+ :
+ : "m" (dptr[z0][d]), "m" (dptr[z0][d+64]),
+ "m" (dptr[z0][d+128]), "m" (dptr[z0][d+192]),
+ "m" (dptr[z0][d+256]), "m" (dptr[z0][d+320]),
+ "m" (dptr[z0][d+384]), "m" (dptr[z0][d+448]),
+ "m" (p[d]), "m" (p[d+64]), "m" (p[d+128]),
+ "m" (p[d+192]), "m" (p[d+256]), "m" (p[d+320]),
+ "m" (p[d+384]), "m" (p[d+448]));
+ /* P/Q data pages */
+ for (z = z0-1 ; z >= start ; z--) {
+ asm volatile("prefetchnta %0\n\t"
+ "prefetchnta %2\n\t"
+ "prefetchnta %4\n\t"
+ "prefetchnta %6\n\t"
+ "vpxorq %%zmm21,%%zmm21,%%zmm21\n\t"
+ "vpxorq %%zmm23,%%zmm23,%%zmm23\n\t"
+ "vpxorq %%zmm29,%%zmm29,%%zmm29\n\t"
+ "vpxorq %%zmm31,%%zmm31,%%zmm31\n\t"
+ "vpxorq %%zmm5,%%zmm5,%%zmm5\n\t"
+ "vpxorq %%zmm7,%%zmm7,%%zmm7\n\t"
+ "vpxorq %%zmm13,%%zmm13,%%zmm13\n\t"
+ "vpxorq %%zmm15,%%zmm15,%%zmm15\n\t"
+ "vpcmpgtb %%zmm4,%%zmm5,%%k1\n\t"
+ "vpcmpgtb %%zmm6,%%zmm7,%%k2\n\t"
+ "vpcmpgtb %%zmm12,%%zmm13,%%k3\n\t"
+ "vpcmpgtb %%zmm14,%%zmm15,%%k4\n\t"
+ "vpmovm2b %%k1,%%zmm5\n\t"
+ "vpmovm2b %%k2,%%zmm7\n\t"
+ "vpmovm2b %%k3,%%zmm13\n\t"
+ "vpmovm2b %%k4,%%zmm15\n\t"
+ "vpcmpgtb %%zmm20,%%zmm21,%%k5\n\t"
+ "vpcmpgtb %%zmm22,%%zmm23,%%k6\n\t"
+ "vpcmpgtb %%zmm28,%%zmm29,%%k7\n\t"
+ "vpcmpgtb %%zmm30,%%zmm31,%%k1\n\t"
+ "vpmovm2b %%k5,%%zmm21\n\t"
+ "vpmovm2b %%k6,%%zmm23\n\t"
+ "vpmovm2b %%k7,%%zmm29\n\t"
+ "vpmovm2b %%k1,%%zmm31\n\t"
+ "vpaddb %%zmm4,%%zmm4,%%zmm4\n\t"
+ "vpaddb %%zmm6,%%zmm6,%%zmm6\n\t"
+ "vpaddb %%zmm12,%%zmm12,%%zmm12\n\t"
+ "vpaddb %%zmm14,%%zmm14,%%zmm14\n\t"
+ "vpaddb %%zmm20,%%zmm20,%%zmm20\n\t"
+ "vpaddb %%zmm22,%%zmm22,%%zmm22\n\t"
+ "vpaddb %%zmm28,%%zmm28,%%zmm28\n\t"
+ "vpaddb %%zmm30,%%zmm30,%%zmm30\n\t"
+ "vpandq %%zmm0,%%zmm5,%%zmm5\n\t"
+ "vpandq %%zmm0,%%zmm7,%%zmm7\n\t"
+ "vpandq %%zmm0,%%zmm13,%%zmm13\n\t"
+ "vpandq %%zmm0,%%zmm15,%%zmm15\n\t"
+ "vpandq %%zmm0,%%zmm21,%%zmm21\n\t"
+ "vpandq %%zmm0,%%zmm23,%%zmm23\n\t"
+ "vpandq %%zmm0,%%zmm29,%%zmm29\n\t"
+ "vpandq %%zmm0,%%zmm31,%%zmm31\n\t"
+ "vpxorq %%zmm5,%%zmm4,%%zmm4\n\t"
+ "vpxorq %%zmm7,%%zmm6,%%zmm6\n\t"
+ "vpxorq %%zmm13,%%zmm12,%%zmm12\n\t"
+ "vpxorq %%zmm15,%%zmm14,%%zmm14\n\t"
+ "vpxorq %%zmm21,%%zmm20,%%zmm20\n\t"
+ "vpxorq %%zmm23,%%zmm22,%%zmm22\n\t"
+ "vpxorq %%zmm29,%%zmm28,%%zmm28\n\t"
+ "vpxorq %%zmm31,%%zmm30,%%zmm30\n\t"
+ "vmovdqa64 %0,%%zmm5\n\t"
+ "vmovdqa64 %1,%%zmm7\n\t"
+ "vmovdqa64 %2,%%zmm13\n\t"
+ "vmovdqa64 %3,%%zmm15\n\t"
+ "vmovdqa64 %4,%%zmm21\n\t"
+ "vmovdqa64 %5,%%zmm23\n\t"
+ "vmovdqa64 %6,%%zmm29\n\t"
+ "vmovdqa64 %7,%%zmm31\n\t"
+ "vpxorq %%zmm5,%%zmm2,%%zmm2\n\t"
+ "vpxorq %%zmm7,%%zmm3,%%zmm3\n\t"
+ "vpxorq %%zmm13,%%zmm10,%%zmm10\n\t"
+ "vpxorq %%zmm15,%%zmm11,%%zmm11\n\t"
+ "vpxorq %%zmm21,%%zmm16,%%zmm16\n\t"
+ "vpxorq %%zmm23,%%zmm18,%%zmm18\n\t"
+ "vpxorq %%zmm29,%%zmm24,%%zmm24\n\t"
+ "vpxorq %%zmm31,%%zmm26,%%zmm26\n\t"
+ "vpxorq %%zmm5,%%zmm4,%%zmm4\n\t"
+ "vpxorq %%zmm7,%%zmm6,%%zmm6\n\t"
+ "vpxorq %%zmm13,%%zmm12,%%zmm12\n\t"
+ "vpxorq %%zmm15,%%zmm14,%%zmm14\n\t"
+ "vpxorq %%zmm21,%%zmm20,%%zmm20\n\t"
+ "vpxorq %%zmm23,%%zmm22,%%zmm22\n\t"
+ "vpxorq %%zmm29,%%zmm28,%%zmm28\n\t"
+ "vpxorq %%zmm31,%%zmm30,%%zmm30"
+ :
+ : "m" (dptr[z][d]), "m" (dptr[z][d+64]),
+ "m" (dptr[z][d+128]),
+ "m" (dptr[z][d+192]),
+ "m" (dptr[z][d+256]),
+ "m" (dptr[z][d+320]),
+ "m" (dptr[z][d+384]),
+ "m" (dptr[z][d+448]));
+ }
+ asm volatile("prefetchnta %0\n\t"
+ "prefetchnta %1\n\t"
+ "prefetchnta %2\n\t"
+ "prefetchnta %3"
+ :
+ : "m" (q[d]), "m" (q[d+128]), "m" (q[d+256]),
+ "m" (q[d+384]));
+ /* P/Q left side optimization */
+ for (z = start-1 ; z >= 0 ; z--) {
+ asm volatile("vpxorq %%zmm5,%%zmm5,%%zmm5\n\t"
+ "vpxorq %%zmm7,%%zmm7,%%zmm7\n\t"
+ "vpxorq %%zmm13,%%zmm13,%%zmm13\n\t"
+ "vpxorq %%zmm15,%%zmm15,%%zmm15\n\t"
+ "vpxorq %%zmm21,%%zmm21,%%zmm21\n\t"
+ "vpxorq %%zmm23,%%zmm23,%%zmm23\n\t"
+ "vpxorq %%zmm29,%%zmm29,%%zmm29\n\t"
+ "vpxorq %%zmm31,%%zmm31,%%zmm31\n\t"
+ "vpcmpgtb %%zmm4,%%zmm5,%%k1\n\t"
+ "vpcmpgtb %%zmm6,%%zmm7,%%k2\n\t"
+ "vpcmpgtb %%zmm12,%%zmm13,%%k3\n\t"
+ "vpcmpgtb %%zmm14,%%zmm15,%%k4\n\t"
+ "vpmovm2b %%k1,%%zmm5\n\t"
+ "vpmovm2b %%k2,%%zmm7\n\t"
+ "vpmovm2b %%k3,%%zmm13\n\t"
+ "vpmovm2b %%k4,%%zmm15\n\t"
+ "vpcmpgtb %%zmm20,%%zmm21,%%k5\n\t"
+ "vpcmpgtb %%zmm22,%%zmm23,%%k6\n\t"
+ "vpcmpgtb %%zmm28,%%zmm29,%%k7\n\t"
+ "vpcmpgtb %%zmm30,%%zmm31,%%k1\n\t"
+ "vpmovm2b %%k5,%%zmm21\n\t"
+ "vpmovm2b %%k6,%%zmm23\n\t"
+ "vpmovm2b %%k7,%%zmm29\n\t"
+ "vpmovm2b %%k1,%%zmm31\n\t"
+ "vpaddb %%zmm4,%%zmm4,%%zmm4\n\t"
+ "vpaddb %%zmm6,%%zmm6,%%zmm6\n\t"
+ "vpaddb %%zmm12,%%zmm12,%%zmm12\n\t"
+ "vpaddb %%zmm14,%%zmm14,%%zmm14\n\t"
+ "vpaddb %%zmm20,%%zmm20,%%zmm20\n\t"
+ "vpaddb %%zmm22,%%zmm22,%%zmm22\n\t"
+ "vpaddb %%zmm28,%%zmm28,%%zmm28\n\t"
+ "vpaddb %%zmm30,%%zmm30,%%zmm30\n\t"
+ "vpandq %%zmm0,%%zmm5,%%zmm5\n\t"
+ "vpandq %%zmm0,%%zmm7,%%zmm7\n\t"
+ "vpandq %%zmm0,%%zmm13,%%zmm13\n\t"
+ "vpandq %%zmm0,%%zmm15,%%zmm15\n\t"
+ "vpandq %%zmm0,%%zmm21,%%zmm21\n\t"
+ "vpandq %%zmm0,%%zmm23,%%zmm23\n\t"
+ "vpandq %%zmm0,%%zmm29,%%zmm29\n\t"
+ "vpandq %%zmm0,%%zmm31,%%zmm31\n\t"
+ "vpxorq %%zmm5,%%zmm4,%%zmm4\n\t"
+ "vpxorq %%zmm7,%%zmm6,%%zmm6\n\t"
+ "vpxorq %%zmm13,%%zmm12,%%zmm12\n\t"
+ "vpxorq %%zmm15,%%zmm14,%%zmm14\n\t"
+ "vpxorq %%zmm21,%%zmm20,%%zmm20\n\t"
+ "vpxorq %%zmm23,%%zmm22,%%zmm22\n\t"
+ "vpxorq %%zmm29,%%zmm28,%%zmm28\n\t"
+ "vpxorq %%zmm31,%%zmm30,%%zmm30"
+ :
+ : );
+ }
+ asm volatile("vmovntdq %%zmm2,%0\n\t"
+ "vmovntdq %%zmm3,%1\n\t"
+ "vmovntdq %%zmm10,%2\n\t"
+ "vmovntdq %%zmm11,%3\n\t"
+ "vmovntdq %%zmm16,%4\n\t"
+ "vmovntdq %%zmm18,%5\n\t"
+ "vmovntdq %%zmm24,%6\n\t"
+ "vmovntdq %%zmm26,%7"
+ :
+ : "m" (p[d]), "m" (p[d+64]), "m" (p[d+128]),
+ "m" (p[d+192]), "m" (p[d+256]), "m" (p[d+320]),
+ "m" (p[d+384]), "m" (p[d+448]));
+ asm volatile("vpxorq %0,%%zmm4,%%zmm4\n\t"
+ "vpxorq %1,%%zmm6,%%zmm6\n\t"
+ "vpxorq %2,%%zmm12,%%zmm12\n\t"
+ "vpxorq %3,%%zmm14,%%zmm14\n\t"
+ "vpxorq %4,%%zmm20,%%zmm20\n\t"
+ "vpxorq %5,%%zmm22,%%zmm22\n\t"
+ "vpxorq %6,%%zmm28,%%zmm28\n\t"
+ "vpxorq %7,%%zmm30,%%zmm30\n\t"
+ "vmovntdq %%zmm4,%0\n\t"
+ "vmovntdq %%zmm6,%1\n\t"
+ "vmovntdq %%zmm12,%2\n\t"
+ "vmovntdq %%zmm14,%3\n\t"
+ "vmovntdq %%zmm20,%4\n\t"
+ "vmovntdq %%zmm22,%5\n\t"
+ "vmovntdq %%zmm28,%6\n\t"
+ "vmovntdq %%zmm30,%7"
+ :
+ : "m" (q[d]), "m" (q[d+64]), "m" (q[d+128]),
+ "m" (q[d+192]), "m" (q[d+256]), "m" (q[d+320]),
+ "m" (q[d+384]), "m" (q[d+448]));
+ }
+ asm volatile("sfence" : : : "memory");
+ kernel_fpu_end();
+}
+
const struct raid6_calls raid6_avx512x8 = {
raid6_avx5128_gen_syndrome,
- NULL, /* XOR not yet implemented */
+ raid6_avx5128_xor_syndrome,
raid6_have_avx512,
"avx512x8",
1 /* Has cache hints */
--
2.7.4
^ permalink raw reply related
* [PATCH v2 5/6] (DO NOT APPLY) lib/raid6: Add unroll by 8 to AVX512 optimized gen_syndrome functions
From: Gayatri Kammela @ 2016-08-13 1:03 UTC (permalink / raw)
To: linux-raid
Cc: shli, linux-kernel, ravi.v.shankar, Gayatri Kammela,
H . Peter Anvin, Jim Kukunas, Fenghua Yu, Megha Dey
In-Reply-To: <1471050204-26361-1-git-send-email-gayatri.kammela@intel.com>
Optimize RAID6 gen_syndrome functions by further unrolling by 8 to take
advantage of all the 32 ZMM registers.
Note: In theory avx512 unroll by 8 gen_syndrome function should perfom
better than the rest of gen_syndrome functions, but it is outperformed
by avx512 unroll by 4 gen_syndrome function when tested in user as well
as kernel space.
This is posted for reference only, to allow others to make their own
experiments.
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jim Kukunas <james.t.kukunas@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Megha Dey <megha.dey@linux.intel.com>
Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
---
include/linux/raid/pq.h | 1 +
lib/raid6/algos.c | 1 +
lib/raid6/avx512.c | 172 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 174 insertions(+)
diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
index 1abd89584568..b4db38eb053a 100644
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -105,6 +105,7 @@ extern const struct raid6_calls raid6_avx2x4;
extern const struct raid6_calls raid6_avx512x1;
extern const struct raid6_calls raid6_avx512x2;
extern const struct raid6_calls raid6_avx512x4;
+extern const struct raid6_calls raid6_avx512x8;
extern const struct raid6_calls raid6_tilegx8;
struct raid6_recov_calls {
diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
index 149d947a4fec..85ba18acad00 100644
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -67,6 +67,7 @@ const struct raid6_calls * const raid6_algos[] = {
&raid6_avx512x1,
&raid6_avx512x2,
&raid6_avx512x4,
+ &raid6_avx512x8,
#endif
#endif
#ifdef CONFIG_ALTIVEC
diff --git a/lib/raid6/avx512.c b/lib/raid6/avx512.c
index f524a7972006..221ce46362cf 100644
--- a/lib/raid6/avx512.c
+++ b/lib/raid6/avx512.c
@@ -564,6 +564,178 @@ const struct raid6_calls raid6_avx512x4 = {
"avx512x4",
1 /* Has cache hints */
};
+
+/*
+ * Unrolled-by-8 AVX512 implementation
+ */
+static void raid6_avx5128_gen_syndrome(int disks, size_t bytes, void **ptrs)
+{
+ u8 **dptr = (u8 **)ptrs;
+ u8 *p, *q;
+ int d, z, z0;
+
+ z0 = disks - 3; /* Highest data disk */
+ p = dptr[z0+1]; /* XOR parity */
+ q = dptr[z0+2]; /* RS syndrome */
+
+ kernel_fpu_begin();
+
+ asm volatile("vmovdqa64 %0,%%zmm0\n\t"
+ "vpxorq %%zmm1,%%zmm1,%%zmm1\n\t" /* Zero temp */
+ "vpxorq %%zmm2,%%zmm2,%%zmm2\n\t" /* P[0] */
+ "vpxorq %%zmm3,%%zmm3,%%zmm3\n\t" /* P[1] */
+ "vpxorq %%zmm4,%%zmm4,%%zmm4\n\t" /* Q[0] */
+ "vpxorq %%zmm6,%%zmm6,%%zmm6\n\t" /* Q[1] */
+ "vpxorq %%zmm10,%%zmm10,%%zmm10\n\t" /* P[2] */
+ "vpxorq %%zmm11,%%zmm11,%%zmm11\n\t" /* P[3] */
+ "vpxorq %%zmm12,%%zmm12,%%zmm12\n\t" /* Q[2] */
+ "vpxorq %%zmm14,%%zmm14,%%zmm14\n\t" /* Q[3] */
+ "vpxorq %%zmm16,%%zmm16,%%zmm16\n\t" /* P[4] */
+ "vpxorq %%zmm18,%%zmm18,%%zmm18\n\t" /* P[5] */
+ "vpxorq %%zmm20,%%zmm20,%%zmm20\n\t" /* Q[4] */
+ "vpxorq %%zmm22,%%zmm22,%%zmm22\n\t" /* Q[5] */
+ "vpxorq %%zmm24,%%zmm24,%%zmm24\n\t" /* P[6] */
+ "vpxorq %%zmm26,%%zmm26,%%zmm26\n\t" /* P[7] */
+ "vpxorq %%zmm28,%%zmm28,%%zmm28\n\t" /* Q[6] */
+ "vpxorq %%zmm30,%%zmm30,%%zmm30" /* Q[7] */
+ :
+ : "m" (raid6_avx512_constants.x1d[0]));
+
+ for (d = 0; d < bytes; d += 512) {
+ for (z = z0; z >= 0; z--) {
+ asm volatile("prefetchnta %0\n\t"
+ "prefetchnta %1\n\t"
+ "prefetchnta %2\n\t"
+ "prefetchnta %3\n\t"
+ "vpcmpgtb %%zmm4,%%zmm1,%%k1\n\t"
+ "vpcmpgtb %%zmm6,%%zmm1,%%k2\n\t"
+ "vpcmpgtb %%zmm12,%%zmm1,%%k3\n\t"
+ "vpcmpgtb %%zmm14,%%zmm1,%%k4\n\t"
+ "vpmovm2b %%k1,%%zmm5\n\t"
+ "vpmovm2b %%k2,%%zmm7\n\t"
+ "vpmovm2b %%k3,%%zmm13\n\t"
+ "vpmovm2b %%k4,%%zmm15\n\t"
+ "vpaddb %%zmm4,%%zmm4,%%zmm4\n\t"
+ "vpaddb %%zmm6,%%zmm6,%%zmm6\n\t"
+ "vpaddb %%zmm12,%%zmm12,%%zmm12\n\t"
+ "vpaddb %%zmm14,%%zmm14,%%zmm14\n\t"
+ "vpandq %%zmm0,%%zmm5,%%zmm5\n\t"
+ "vpandq %%zmm0,%%zmm7,%%zmm7\n\t"
+ "vpandq %%zmm0,%%zmm13,%%zmm13\n\t"
+ "vpandq %%zmm0,%%zmm15,%%zmm15\n\t"
+ "vpxorq %%zmm5,%%zmm4,%%zmm4\n\t"
+ "vpxorq %%zmm7,%%zmm6,%%zmm6\n\t"
+ "vpxorq %%zmm13,%%zmm12,%%zmm12\n\t"
+ "vpxorq %%zmm15,%%zmm14,%%zmm14\n\t"
+ "vmovdqa64 %0,%%zmm5\n\t"
+ "vmovdqa64 %1,%%zmm7\n\t"
+ "vmovdqa64 %2,%%zmm13\n\t"
+ "vmovdqa64 %3,%%zmm15\n\t"
+ "vpxorq %%zmm5,%%zmm2,%%zmm2\n\t"
+ "vpxorq %%zmm7,%%zmm3,%%zmm3\n\t"
+ "vpxorq %%zmm13,%%zmm10,%%zmm10\n\t"
+ "vpxorq %%zmm15,%%zmm11,%%zmm11\n\t"
+ "vpxorq %%zmm5,%%zmm4,%%zmm4\n\t"
+ "vpxorq %%zmm7,%%zmm6,%%zmm6\n\t"
+ "vpxorq %%zmm13,%%zmm12,%%zmm12\n\t"
+ "vpxorq %%zmm15,%%zmm14,%%zmm14\n\t"
+ "prefetchnta %4\n\t"
+ "prefetchnta %5\n\t"
+ "prefetchnta %6\n\t"
+ "prefetchnta %7\n\t"
+ "vpcmpgtb %%zmm20,%%zmm1,%%k5\n\t"
+ "vpcmpgtb %%zmm22,%%zmm1,%%k6\n\t"
+ "vpcmpgtb %%zmm28,%%zmm1,%%k7\n\t"
+ "vpcmpgtb %%zmm30,%%zmm1,%%k1\n\t"
+ "vpmovm2b %%k5,%%zmm21\n\t"
+ "vpmovm2b %%k6,%%zmm23\n\t"
+ "vpmovm2b %%k7,%%zmm29\n\t"
+ "vpmovm2b %%k1,%%zmm31\n\t"
+ "vpaddb %%zmm20,%%zmm20,%%zmm20\n\t"
+ "vpaddb %%zmm22,%%zmm22,%%zmm22\n\t"
+ "vpaddb %%zmm28,%%zmm28,%%zmm28\n\t"
+ "vpaddb %%zmm30,%%zmm30,%%zmm30\n\t"
+ "vpandq %%zmm0,%%zmm21,%%zmm21\n\t"
+ "vpandq %%zmm0,%%zmm23,%%zmm23\n\t"
+ "vpandq %%zmm0,%%zmm29,%%zmm29\n\t"
+ "vpandq %%zmm0,%%zmm31,%%zmm31\n\t"
+ "vpxorq %%zmm21,%%zmm20,%%zmm20\n\t"
+ "vpxorq %%zmm23,%%zmm22,%%zmm22\n\t"
+ "vpxorq %%zmm29,%%zmm28,%%zmm28\n\t"
+ "vpxorq %%zmm31,%%zmm30,%%zmm30\n\t"
+ "vmovdqa64 %4,%%zmm21\n\t"
+ "vmovdqa64 %5,%%zmm23\n\t"
+ "vmovdqa64 %6,%%zmm29\n\t"
+ "vmovdqa64 %7,%%zmm31\n\t"
+ "vpxorq %%zmm21,%%zmm16,%%zmm16\n\t"
+ "vpxorq %%zmm23,%%zmm18,%%zmm18\n\t"
+ "vpxorq %%zmm29,%%zmm24,%%zmm24\n\t"
+ "vpxorq %%zmm31,%%zmm26,%%zmm26\n\t"
+ "vpxorq %%zmm21,%%zmm20,%%zmm20\n\t"
+ "vpxorq %%zmm23,%%zmm22,%%zmm22\n\t"
+ "vpxorq %%zmm29,%%zmm28,%%zmm28\n\t"
+ "vpxorq %%zmm31,%%zmm30,%%zmm30"
+ :
+ : "m" (dptr[z][d]), "m" (dptr[z][d+64]),
+ "m" (dptr[z][d+128]),
+ "m" (dptr[z][d+192]),
+ "m" (dptr[z][d+256]),
+ "m" (dptr[z][d+320]),
+ "m" (dptr[z][d+384]),
+ "m" (dptr[z][d+448]));
+ }
+ asm volatile("vmovntdq %%zmm2,%0\n\t"
+ "vpxorq %%zmm2,%%zmm2,%%zmm2\n\t"
+ "vmovntdq %%zmm3,%1\n\t"
+ "vpxorq %%zmm3,%%zmm3,%%zmm3\n\t"
+ "vmovntdq %%zmm10,%2\n\t"
+ "vpxorq %%zmm10,%%zmm10,%%zmm10\n\t"
+ "vmovntdq %%zmm11,%3\n\t"
+ "vpxorq %%zmm11,%%zmm11,%%zmm11\n\t"
+ "vmovntdq %%zmm4,%4\n\t"
+ "vpxorq %%zmm4,%%zmm4,%%zmm4\n\t"
+ "vmovntdq %%zmm6,%5\n\t"
+ "vpxorq %%zmm6,%%zmm6,%%zmm6\n\t"
+ "vmovntdq %%zmm12,%6\n\t"
+ "vpxorq %%zmm12,%%zmm12,%%zmm12\n\t"
+ "vmovntdq %%zmm14,%7\n\t"
+ "vpxorq %%zmm14,%%zmm14,%%zmm14\n\t"
+ "vmovntdq %%zmm16,%8\n\t"
+ "vpxorq %%zmm16,%%zmm16,%%zmm16\n\t"
+ "vmovntdq %%zmm18,%9\n\t"
+ "vpxorq %%zmm18,%%zmm18,%%zmm18\n\t"
+ "vmovntdq %%zmm24,%10\n\t"
+ "vpxorq %%zmm24,%%zmm24,%%zmm24\n\t"
+ "vmovntdq %%zmm26,%11\n\t"
+ "vpxorq %%zmm26,%%zmm26,%%zmm26\n\t"
+ "vmovntdq %%zmm20,%12\n\t"
+ "vpxorq %%zmm20,%%zmm20,%%zmm20\n\t"
+ "vmovntdq %%zmm22,%13\n\t"
+ "vpxorq %%zmm22,%%zmm22,%%zmm22\n\t"
+ "vmovntdq %%zmm28,%14\n\t"
+ "vpxorq %%zmm28,%%zmm28,%%zmm28\n\t"
+ "vmovntdq %%zmm30,%15\n\t"
+ "vpxorq %%zmm30,%%zmm30,%%zmm30"
+ :
+ : "m" (p[d]), "m" (p[d+64]), "m" (p[d+128]),
+ "m" (p[d+192]), "m" (q[d]), "m" (q[d+64]),
+ "m" (q[d+128]), "m" (q[d+192]), "m" (p[d+256]),
+ "m" (p[d+320]), "m" (p[d+384]), "m" (p[d+448]),
+ "m" (q[d+256]), "m" (q[d+320]), "m" (q[d+384]),
+ "m" (q[d+448]));
+ }
+
+ asm volatile("sfence" : : : "memory");
+ kernel_fpu_end();
+}
+
+const struct raid6_calls raid6_avx512x8 = {
+ raid6_avx5128_gen_syndrome,
+ NULL, /* XOR not yet implemented */
+ raid6_have_avx512,
+ "avx512x8",
+ 1 /* Has cache hints */
+};
#endif
#endif /* CONFIG_AS_AVX512 */
--
2.7.4
^ permalink raw reply related
* [PATCH v2 4/6] lib/raid6: Add AVX512 optimized xor_syndrome functions
From: Gayatri Kammela @ 2016-08-13 1:03 UTC (permalink / raw)
To: linux-raid
Cc: shli, linux-kernel, ravi.v.shankar, Gayatri Kammela,
H . Peter Anvin, Jim Kukunas, Fenghua Yu, Megha Dey
In-Reply-To: <1471050204-26361-1-git-send-email-gayatri.kammela@intel.com>
Optimize RAID6 xor_syndrome functions to take advantage of the 512-bit
ZMM integer instructions introduced in AVX512.
AVX512 optimized xor_syndrome functions, which is simply based on sse2.c
written by hpa.
The patch was tested and benchmarked before submission on
a hardware that has AVX512 flags to support such instructions
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jim Kukunas <james.t.kukunas@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Megha Dey <megha.dey@linux.intel.com>
Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
---
lib/raid6/avx512.c | 281 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 278 insertions(+), 3 deletions(-)
diff --git a/lib/raid6/avx512.c b/lib/raid6/avx512.c
index b1188a6e51a6..f524a7972006 100644
--- a/lib/raid6/avx512.c
+++ b/lib/raid6/avx512.c
@@ -103,9 +103,68 @@ static void raid6_avx5121_gen_syndrome(int disks, size_t bytes, void **ptrs)
kernel_fpu_end();
}
+static void raid6_avx5121_xor_syndrome(int disks, int start, int stop,
+ size_t bytes, void **ptrs)
+{
+ u8 **dptr = (u8 **)ptrs;
+ u8 *p, *q;
+ int d, z, z0;
+
+ z0 = stop; /* P/Q right side optimization */
+ p = dptr[disks-2]; /* XOR parity */
+ q = dptr[disks-1]; /* RS syndrome */
+
+ kernel_fpu_begin();
+
+ asm volatile("vmovdqa64 %0,%%zmm0"
+ : : "m" (raid6_avx512_constants.x1d[0]));
+
+ for (d = 0 ; d < bytes ; d += 64) {
+ asm volatile("vmovdqa64 %0,%%zmm4\n\t"
+ "vmovdqa64 %1,%%zmm2\n\t"
+ "vpxorq %%zmm4,%%zmm2,%%zmm2"
+ :
+ : "m" (dptr[z0][d]), "m" (p[d]));
+ /* P/Q data pages */
+ for (z = z0-1 ; z >= start ; z--) {
+ asm volatile("vpxorq %%zmm5,%%zmm5,%%zmm5\n\t"
+ "vpcmpgtb %%zmm4,%%zmm5,%%k1\n\t"
+ "vpmovm2b %%k1,%%zmm5\n\t"
+ "vpaddb %%zmm4,%%zmm4,%%zmm4\n\t"
+ "vpandq %%zmm0,%%zmm5,%%zmm5\n\t"
+ "vpxorq %%zmm5,%%zmm4,%%zmm4\n\t"
+ "vmovdqa64 %0,%%zmm5\n\t"
+ "vpxorq %%zmm5,%%zmm2,%%zmm2\n\t"
+ "vpxorq %%zmm5,%%zmm4,%%zmm4"
+ :
+ : "m" (dptr[z][d]));
+ }
+ /* P/Q left side optimization */
+ for (z = start-1 ; z >= 0 ; z--) {
+ asm volatile("vpxorq %%zmm5,%%zmm5,%%zmm5\n\t"
+ "vpcmpgtb %%zmm4,%%zmm5,%%k1\n\t"
+ "vpmovm2b %%k1,%%zmm5\n\t"
+ "vpaddb %%zmm4,%%zmm4,%%zmm4\n\t"
+ "vpandq %%zmm0,%%zmm5,%%zmm5\n\t"
+ "vpxorq %%zmm5,%%zmm4,%%zmm4"
+ :
+ : );
+ }
+ asm volatile("vpxorq %0,%%zmm4,%%zmm4\n\t"
+ /* Don't use movntdq for r/w memory area < cache line */
+ "vmovdqa64 %%zmm4,%0\n\t"
+ "vmovdqa64 %%zmm2,%1"
+ :
+ : "m" (q[d]), "m" (p[d]));
+ }
+
+ asm volatile("sfence" : : : "memory");
+ kernel_fpu_end();
+}
+
const struct raid6_calls raid6_avx512x1 = {
raid6_avx5121_gen_syndrome,
- NULL, /* XOR not yet implemented */
+ raid6_avx5121_xor_syndrome,
raid6_have_avx512,
"avx512x1",
1 /* Has cache hints */
@@ -176,9 +235,93 @@ static void raid6_avx5122_gen_syndrome(int disks, size_t bytes, void **ptrs)
kernel_fpu_end();
}
+static void raid6_avx5122_xor_syndrome(int disks, int start, int stop,
+ size_t bytes, void **ptrs)
+{
+ u8 **dptr = (u8 **)ptrs;
+ u8 *p, *q;
+ int d, z, z0;
+
+ z0 = stop; /* P/Q right side optimization */
+ p = dptr[disks-2]; /* XOR parity */
+ q = dptr[disks-1]; /* RS syndrome */
+
+ kernel_fpu_begin();
+
+ asm volatile("vmovdqa64 %0,%%zmm0"
+ : : "m" (raid6_avx512_constants.x1d[0]));
+
+ for (d = 0 ; d < bytes ; d += 128) {
+ asm volatile("vmovdqa64 %0,%%zmm4\n\t"
+ "vmovdqa64 %1,%%zmm6\n\t"
+ "vmovdqa64 %2,%%zmm2\n\t"
+ "vmovdqa64 %3,%%zmm3\n\t"
+ "vpxorq %%zmm4,%%zmm2,%%zmm2\n\t"
+ "vpxorq %%zmm6,%%zmm3,%%zmm3"
+ :
+ : "m" (dptr[z0][d]), "m" (dptr[z0][d+64]),
+ "m" (p[d]), "m" (p[d+64]));
+ /* P/Q data pages */
+ for (z = z0-1 ; z >= start ; z--) {
+ asm volatile("vpxorq %%zmm5,%%zmm5,%%zmm5\n\t"
+ "vpxorq %%zmm7,%%zmm7,%%zmm7\n\t"
+ "vpcmpgtb %%zmm4,%%zmm5,%%k1\n\t"
+ "vpcmpgtb %%zmm6,%%zmm7,%%k2\n\t"
+ "vpmovm2b %%k1,%%zmm5\n\t"
+ "vpmovm2b %%k2,%%zmm7\n\t"
+ "vpaddb %%zmm4,%%zmm4,%%zmm4\n\t"
+ "vpaddb %%zmm6,%%zmm6,%%zmm6\n\t"
+ "vpandq %%zmm0,%%zmm5,%%zmm5\n\t"
+ "vpandq %%zmm0,%%zmm7,%%zmm7\n\t"
+ "vpxorq %%zmm5,%%zmm4,%%zmm4\n\t"
+ "vpxorq %%zmm7,%%zmm6,%%zmm6\n\t"
+ "vmovdqa64 %0,%%zmm5\n\t"
+ "vmovdqa64 %1,%%zmm7\n\t"
+ "vpxorq %%zmm5,%%zmm2,%%zmm2\n\t"
+ "vpxorq %%zmm7,%%zmm3,%%zmm3\n\t"
+ "vpxorq %%zmm5,%%zmm4,%%zmm4\n\t"
+ "vpxorq %%zmm7,%%zmm6,%%zmm6"
+ :
+ : "m" (dptr[z][d]), "m" (dptr[z][d+64]));
+ }
+ /* P/Q left side optimization */
+ for (z = start-1 ; z >= 0 ; z--) {
+ asm volatile("vpxorq %%zmm5,%%zmm5,%%zmm5\n\t"
+ "vpxorq %%zmm7,%%zmm7,%%zmm7\n\t"
+ "vpcmpgtb %%zmm4,%%zmm5,%%k1\n\t"
+ "vpcmpgtb %%zmm6,%%zmm7,%%k2\n\t"
+ "vpmovm2b %%k1,%%zmm5\n\t"
+ "vpmovm2b %%k2,%%zmm7\n\t"
+ "vpaddb %%zmm4,%%zmm4,%%zmm4\n\t"
+ "vpaddb %%zmm6,%%zmm6,%%zmm6\n\t"
+ "vpandq %%zmm0,%%zmm5,%%zmm5\n\t"
+ "vpandq %%zmm0,%%zmm7,%%zmm7\n\t"
+ "vpxorq %%zmm5,%%zmm4,%%zmm4\n\t"
+ "vpxorq %%zmm7,%%zmm6,%%zmm6"
+ :
+ : );
+ }
+ asm volatile("vpxorq %0,%%zmm4,%%zmm4\n\t"
+ "vpxorq %1,%%zmm6,%%zmm6\n\t"
+ /* Don't use movntdq for r/w
+ * memory area < cache line
+ */
+ "vmovdqa64 %%zmm4,%0\n\t"
+ "vmovdqa64 %%zmm6,%1\n\t"
+ "vmovdqa64 %%zmm2,%2\n\t"
+ "vmovdqa64 %%zmm3,%3"
+ :
+ : "m" (q[d]), "m" (q[d+64]), "m" (p[d]),
+ "m" (p[d+64]));
+ }
+
+ asm volatile("sfence" : : : "memory");
+ kernel_fpu_end();
+}
+
const struct raid6_calls raid6_avx512x2 = {
raid6_avx5122_gen_syndrome,
- NULL, /* XOR not yet implemented */
+ raid6_avx5122_xor_syndrome,
raid6_have_avx512,
"avx512x2",
1 /* Has cache hints */
@@ -282,9 +425,141 @@ static void raid6_avx5124_gen_syndrome(int disks, size_t bytes, void **ptrs)
kernel_fpu_end();
}
+static void raid6_avx5124_xor_syndrome(int disks, int start, int stop,
+ size_t bytes, void **ptrs)
+{
+ u8 **dptr = (u8 **)ptrs;
+ u8 *p, *q;
+ int d, z, z0;
+
+ z0 = stop; /* P/Q right side optimization */
+ p = dptr[disks-2]; /* XOR parity */
+ q = dptr[disks-1]; /* RS syndrome */
+
+ kernel_fpu_begin();
+
+ asm volatile("vmovdqa64 %0,%%zmm0"
+ :: "m" (raid6_avx512_constants.x1d[0]));
+
+ for (d = 0 ; d < bytes ; d += 256) {
+ asm volatile("vmovdqa64 %0,%%zmm4\n\t"
+ "vmovdqa64 %1,%%zmm6\n\t"
+ "vmovdqa64 %2,%%zmm12\n\t"
+ "vmovdqa64 %3,%%zmm14\n\t"
+ "vmovdqa64 %4,%%zmm2\n\t"
+ "vmovdqa64 %5,%%zmm3\n\t"
+ "vmovdqa64 %6,%%zmm10\n\t"
+ "vmovdqa64 %7,%%zmm11\n\t"
+ "vpxorq %%zmm4,%%zmm2,%%zmm2\n\t"
+ "vpxorq %%zmm6,%%zmm3,%%zmm3\n\t"
+ "vpxorq %%zmm12,%%zmm10,%%zmm10\n\t"
+ "vpxorq %%zmm14,%%zmm11,%%zmm11"
+ :
+ : "m" (dptr[z0][d]), "m" (dptr[z0][d+64]),
+ "m" (dptr[z0][d+128]), "m" (dptr[z0][d+192]),
+ "m" (p[d]), "m" (p[d+64]), "m" (p[d+128]),
+ "m" (p[d+192]));
+ /* P/Q data pages */
+ for (z = z0-1 ; z >= start ; z--) {
+ asm volatile("vpxorq %%zmm5,%%zmm5,%%zmm5\n\t"
+ "vpxorq %%zmm7,%%zmm7,%%zmm7\n\t"
+ "vpxorq %%zmm13,%%zmm13,%%zmm13\n\t"
+ "vpxorq %%zmm15,%%zmm15,%%zmm15\n\t"
+ "prefetchnta %0\n\t"
+ "prefetchnta %2\n\t"
+ "vpcmpgtb %%zmm4,%%zmm5,%%k1\n\t"
+ "vpcmpgtb %%zmm6,%%zmm7,%%k2\n\t"
+ "vpcmpgtb %%zmm12,%%zmm13,%%k3\n\t"
+ "vpcmpgtb %%zmm14,%%zmm15,%%k4\n\t"
+ "vpmovm2b %%k1,%%zmm5\n\t"
+ "vpmovm2b %%k2,%%zmm7\n\t"
+ "vpmovm2b %%k3,%%zmm13\n\t"
+ "vpmovm2b %%k4,%%zmm15\n\t"
+ "vpaddb %%zmm4,%%zmm4,%%zmm4\n\t"
+ "vpaddb %%zmm6,%%zmm6,%%zmm6\n\t"
+ "vpaddb %%zmm12,%%zmm12,%%zmm12\n\t"
+ "vpaddb %%Zmm14,%%zmm14,%%zmm14\n\t"
+ "vpandq %%zmm0,%%zmm5,%%zmm5\n\t"
+ "vpandq %%zmm0,%%zmm7,%%zmm7\n\t"
+ "vpandq %%zmm0,%%zmm13,%%zmm13\n\t"
+ "vpandq %%zmm0,%%zmm15,%%zmm15\n\t"
+ "vpxorq %%zmm5,%%zmm4,%%zmm4\n\t"
+ "vpxorq %%zmm7,%%zmm6,%%zmm6\n\t"
+ "vpxorq %%zmm13,%%zmm12,%%zmm12\n\t"
+ "vpxorq %%zmm15,%%zmm14,%%zmm14\n\t"
+ "vmovdqa64 %0,%%zmm5\n\t"
+ "vmovdqa64 %1,%%zmm7\n\t"
+ "vmovdqa64 %2,%%zmm13\n\t"
+ "vmovdqa64 %3,%%zmm15\n\t"
+ "vpxorq %%zmm5,%%zmm2,%%zmm2\n\t"
+ "vpxorq %%zmm7,%%zmm3,%%zmm3\n\t"
+ "vpxorq %%zmm13,%%zmm10,%%zmm10\n\t"
+ "vpxorq %%zmm15,%%zmm11,%%zmm11\n\t"
+ "vpxorq %%zmm5,%%zmm4,%%zmm4\n\t"
+ "vpxorq %%zmm7,%%zmm6,%%zmm6\n\t"
+ "vpxorq %%zmm13,%%zmm12,%%zmm12\n\t"
+ "vpxorq %%zmm15,%%zmm14,%%zmm14"
+ :
+ : "m" (dptr[z][d]), "m" (dptr[z][d+64]),
+ "m" (dptr[z][d+128]),
+ "m" (dptr[z][d+192]));
+ }
+ asm volatile("prefetchnta %0\n\t"
+ "prefetchnta %1\n\t"
+ :
+ : "m" (q[d]), "m" (q[d+128]));
+ /* P/Q left side optimization */
+ for (z = start-1 ; z >= 0 ; z--) {
+ asm volatile("vpxorq %%zmm5,%%zmm5,%%zmm5\n\t"
+ "vpxorq %%zmm7,%%zmm7,%%zmm7\n\t"
+ "vpxorq %%zmm13,%%zmm13,%%zmm13\n\t"
+ "vpxorq %%zmm15,%%zmm15,%%zmm15\n\t"
+ "vpcmpgtb %%zmm4,%%zmm5,%%k1\n\t"
+ "vpcmpgtb %%zmm6,%%zmm7,%%k2\n\t"
+ "vpcmpgtb %%zmm12,%%zmm13,%%k3\n\t"
+ "vpcmpgtb %%zmm14,%%zmm15,%%k4\n\t"
+ "vpmovm2b %%k1,%%zmm5\n\t"
+ "vpmovm2b %%k2,%%zmm7\n\t"
+ "vpmovm2b %%k3,%%zmm13\n\t"
+ "vpmovm2b %%k4,%%zmm15\n\t"
+ "vpaddb %%zmm4,%%zmm4,%%zmm4\n\t"
+ "vpaddb %%zmm6,%%zmm6,%%zmm6\n\t"
+ "vpaddb %%zmm12,%%zmm12,%%zmm12\n\t"
+ "vpaddb %%zmm14,%%zmm14,%%zmm14\n\t"
+ "vpandq %%zmm0,%%zmm5,%%zmm5\n\t"
+ "vpandq %%zmm0,%%zmm7,%%zmm7\n\t"
+ "vpandq %%zmm0,%%zmm13,%%zmm13\n\t"
+ "vpandq %%zmm0,%%zmm15,%%zmm15\n\t"
+ "vpxorq %%zmm5,%%zmm4,%%zmm4\n\t"
+ "vpxorq %%zmm7,%%zmm6,%%zmm6\n\t"
+ "vpxorq %%zmm13,%%zmm12,%%zmm12\n\t"
+ "vpxorq %%zmm15,%%zmm14,%%zmm14"
+ :
+ : );
+ }
+ asm volatile("vmovntdq %%zmm2,%0\n\t"
+ "vmovntdq %%zmm3,%1\n\t"
+ "vmovntdq %%zmm10,%2\n\t"
+ "vmovntdq %%zmm11,%3\n\t"
+ "vpxorq %4,%%zmm4,%%zmm4\n\t"
+ "vpxorq %5,%%zmm6,%%zmm6\n\t"
+ "vpxorq %6,%%zmm12,%%zmm12\n\t"
+ "vpxorq %7,%%zmm14,%%zmm14\n\t"
+ "vmovntdq %%zmm4,%4\n\t"
+ "vmovntdq %%zmm6,%5\n\t"
+ "vmovntdq %%zmm12,%6\n\t"
+ "vmovntdq %%zmm14,%7"
+ :
+ : "m" (p[d]), "m" (p[d+64]), "m" (p[d+128]),
+ "m" (p[d+192]), "m" (q[d]), "m" (q[d+64]),
+ "m" (q[d+128]), "m" (q[d+192]));
+ }
+ asm volatile("sfence" : : : "memory");
+ kernel_fpu_end();
+}
const struct raid6_calls raid6_avx512x4 = {
raid6_avx5124_gen_syndrome,
- NULL, /* XOR not yet implemented */
+ raid6_avx5124_xor_syndrome,
raid6_have_avx512,
"avx512x4",
1 /* Has cache hints */
--
2.7.4
^ permalink raw reply related
* [PATCH v2 3/6] lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery functions
From: Gayatri Kammela @ 2016-08-13 1:03 UTC (permalink / raw)
To: linux-raid
Cc: shli, linux-kernel, ravi.v.shankar, Gayatri Kammela,
H . Peter Anvin, Jim Kukunas, Fenghua Yu, Megha Dey
In-Reply-To: <1471050204-26361-1-git-send-email-gayatri.kammela@intel.com>
Adding avx512 gen_syndrome and recovery functions so as to allow code to
be compiled and tested successfully in userspace.
This patch is tested in userspace and improvement in performace is
observed.
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jim Kukunas <james.t.kukunas@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
---
lib/raid6/test/Makefile | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile
index 29090f3db677..2c7b60edea04 100644
--- a/lib/raid6/test/Makefile
+++ b/lib/raid6/test/Makefile
@@ -32,10 +32,13 @@ ifeq ($(ARCH),arm64)
endif
ifeq ($(IS_X86),yes)
- OBJS += mmx.o sse1.o sse2.o avx2.o recov_ssse3.o recov_avx2.o
+ OBJS += mmx.o sse1.o sse2.o avx2.o recov_ssse3.o recov_avx2.o avx512.o recov_avx512.o
CFLAGS += $(shell echo "vpbroadcastb %xmm0, %ymm1" | \
gcc -c -x assembler - >&/dev/null && \
rm ./-.o && echo -DCONFIG_AS_AVX2=1)
+ CFLAGS += $(shell echo "vpmovm2b %k1, %zmm5" | \
+ gcc -c -x assembler - >&/dev/null && \
+ rm ./-.o && echo -DCONFIG_AS_AVX512=1)
else ifeq ($(HAS_NEON),yes)
OBJS += neon.o neon1.o neon2.o neon4.o neon8.o
CFLAGS += -DCONFIG_KERNEL_MODE_NEON=1
--
2.7.4
^ permalink raw reply related
* [PATCH v2 2/6] lib/raid6: Add AVX512 optimized recovery functions
From: Gayatri Kammela @ 2016-08-13 1:03 UTC (permalink / raw)
To: linux-raid
Cc: shli, linux-kernel, ravi.v.shankar, Gayatri Kammela, Jim Kukunas,
H . Peter Anvin, Fenghua Yu, Megha Dey
In-Reply-To: <1471050204-26361-1-git-send-email-gayatri.kammela@intel.com>
Optimize RAID6 recovery functions to take advantage of
the 512-bit ZMM integer instructions introduced in AVX512.
AVX512 optimized recovery functions, which is simply based
on recov_avx2.c written by Jim Kukunas
This patch was tested and benchmarked before submission on
a hardware that has AVX512 flags to support such instructions
Cc: Jim Kukunas <james.t.kukunas@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
---
include/linux/raid/pq.h | 1 +
lib/raid6/Makefile | 2 +-
lib/raid6/algos.c | 3 +
lib/raid6/recov_avx512.c | 388 +++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 393 insertions(+), 1 deletion(-)
create mode 100644 lib/raid6/recov_avx512.c
diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
index 0c529a55b52e..1abd89584568 100644
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -118,6 +118,7 @@ struct raid6_recov_calls {
extern const struct raid6_recov_calls raid6_recov_intx1;
extern const struct raid6_recov_calls raid6_recov_ssse3;
extern const struct raid6_recov_calls raid6_recov_avx2;
+extern const struct raid6_recov_calls raid6_recov_avx512;
extern const struct raid6_calls raid6_neonx1;
extern const struct raid6_calls raid6_neonx2;
diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
index 8948268d47b4..cd05ee1fb809 100644
--- a/lib/raid6/Makefile
+++ b/lib/raid6/Makefile
@@ -3,7 +3,7 @@ obj-$(CONFIG_RAID6_PQ) += raid6_pq.o
raid6_pq-y += algos.o recov.o tables.o int1.o int2.o int4.o \
int8.o int16.o int32.o
-raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o mmx.o sse1.o sse2.o avx2.o avx512.o
+raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o mmx.o sse1.o sse2.o avx2.o avx512.o recov_avx512.o
raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o
raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o
raid6_pq-$(CONFIG_TILEGX) += tilegx8.o
diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
index f5f090c52dd9..149d947a4fec 100644
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -98,6 +98,9 @@ void (*raid6_datap_recov)(int, size_t, int, void **);
EXPORT_SYMBOL_GPL(raid6_datap_recov);
const struct raid6_recov_calls *const raid6_recov_algos[] = {
+#ifdef CONFIG_AS_AVX512
+ &raid6_recov_avx512,
+#endif
#ifdef CONFIG_AS_AVX2
&raid6_recov_avx2,
#endif
diff --git a/lib/raid6/recov_avx512.c b/lib/raid6/recov_avx512.c
new file mode 100644
index 000000000000..625aafa33b61
--- /dev/null
+++ b/lib/raid6/recov_avx512.c
@@ -0,0 +1,388 @@
+/*
+ * Copyright (C) 2016 Intel Corporation
+ *
+ * Author: Gayatri Kammela <gayatri.kammela@intel.com>
+ * Author: Megha Dey <megha.dey@linux.intel.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ *
+ */
+
+#ifdef CONFIG_AS_AVX512
+
+#include <linux/raid/pq.h>
+#include "x86.h"
+
+static int raid6_has_avx512(void)
+{
+ return boot_cpu_has(X86_FEATURE_AVX2) &&
+ boot_cpu_has(X86_FEATURE_AVX) &&
+ boot_cpu_has(X86_FEATURE_AVX512F) &&
+ boot_cpu_has(X86_FEATURE_AVX512BW) &&
+ boot_cpu_has(X86_FEATURE_AVX512VL) &&
+ boot_cpu_has(X86_FEATURE_AVX512DQ);
+}
+
+static void raid6_2data_recov_avx512(int disks, size_t bytes, int faila,
+ int failb, void **ptrs)
+{
+ u8 *p, *q, *dp, *dq;
+ const u8 *pbmul; /* P multiplier table for B data */
+ const u8 *qmul; /* Q multiplier table (for both) */
+ const u8 x0f = 0x0f;
+
+ p = (u8 *)ptrs[disks-2];
+ q = (u8 *)ptrs[disks-1];
+
+ /*
+ * Compute syndrome with zero for the missing data pages
+ * Use the dead data pages as temporary storage for
+ * delta p and delta q
+ */
+
+ dp = (u8 *)ptrs[faila];
+ ptrs[faila] = (void *)raid6_empty_zero_page;
+ ptrs[disks-2] = dp;
+ dq = (u8 *)ptrs[failb];
+ ptrs[failb] = (void *)raid6_empty_zero_page;
+ ptrs[disks-1] = dq;
+
+ raid6_call.gen_syndrome(disks, bytes, ptrs);
+
+ /* Restore pointer table */
+ ptrs[faila] = dp;
+ ptrs[failb] = dq;
+ ptrs[disks-2] = p;
+ ptrs[disks-1] = q;
+
+ /* Now, pick the proper data tables */
+ pbmul = raid6_vgfmul[raid6_gfexi[failb-faila]];
+ qmul = raid6_vgfmul[raid6_gfinv[raid6_gfexp[faila] ^
+ raid6_gfexp[failb]]];
+
+ kernel_fpu_begin();
+
+ /* zmm0 = x0f[16] */
+ asm volatile("vpbroadcastb %0, %%zmm7" : : "m" (x0f));
+
+ while (bytes) {
+#ifdef CONFIG_X86_64
+ asm volatile("vmovdqa64 %0, %%zmm1\n\t"
+ "vmovdqa64 %1, %%zmm9\n\t"
+ "vmovdqa64 %2, %%zmm0\n\t"
+ "vmovdqa64 %3, %%zmm8\n\t"
+ "vpxorq %4, %%zmm1, %%zmm1\n\t"
+ "vpxorq %5, %%zmm9, %%zmm9\n\t"
+ "vpxorq %6, %%zmm0, %%zmm0\n\t"
+ "vpxorq %7, %%zmm8, %%zmm8"
+ :
+ : "m" (q[0]), "m" (q[64]), "m" (p[0]),
+ "m" (p[64]), "m" (dq[0]), "m" (dq[64]),
+ "m" (dp[0]), "m" (dp[64]));
+
+ /*
+ * 1 = dq[0] ^ q[0]
+ * 9 = dq[64] ^ q[64]
+ * 0 = dp[0] ^ p[0]
+ * 8 = dp[64] ^ p[64]
+ */
+
+ asm volatile("vbroadcasti64x2 %0, %%zmm4\n\t"
+ "vbroadcasti64x2 %1, %%zmm5"
+ :
+ : "m" (qmul[0]), "m" (qmul[16]));
+
+ asm volatile("vpsraw $4, %%zmm1, %%zmm3\n\t"
+ "vpsraw $4, %%zmm9, %%zmm12\n\t"
+ "vpandq %%zmm7, %%zmm1, %%zmm1\n\t"
+ "vpandq %%zmm7, %%zmm9, %%zmm9\n\t"
+ "vpandq %%zmm7, %%zmm3, %%zmm3\n\t"
+ "vpandq %%zmm7, %%zmm12, %%zmm12\n\t"
+ "vpshufb %%zmm9, %%zmm4, %%zmm14\n\t"
+ "vpshufb %%zmm1, %%zmm4, %%zmm4\n\t"
+ "vpshufb %%zmm12, %%zmm5, %%zmm15\n\t"
+ "vpshufb %%zmm3, %%zmm5, %%zmm5\n\t"
+ "vpxorq %%zmm14, %%zmm15, %%zmm15\n\t"
+ "vpxorq %%zmm4, %%zmm5, %%zmm5"
+ :
+ : );
+
+ /*
+ * 5 = qx[0]
+ * 15 = qx[64]
+ */
+
+ asm volatile("vbroadcasti64x2 %0, %%zmm4\n\t"
+ "vbroadcasti64x2 %1, %%zmm1\n\t"
+ "vpsraw $4, %%zmm0, %%zmm2\n\t"
+ "vpsraw $4, %%zmm8, %%zmm6\n\t"
+ "vpandq %%zmm7, %%zmm0, %%zmm3\n\t"
+ "vpandq %%zmm7, %%zmm8, %%zmm14\n\t"
+ "vpandq %%zmm7, %%zmm2, %%zmm2\n\t"
+ "vpandq %%zmm7, %%zmm6, %%zmm6\n\t"
+ "vpshufb %%zmm14, %%zmm4, %%zmm12\n\t"
+ "vpshufb %%zmm3, %%zmm4, %%zmm4\n\t"
+ "vpshufb %%zmm6, %%zmm1, %%zmm13\n\t"
+ "vpshufb %%zmm2, %%zmm1, %%zmm1\n\t"
+ "vpxorq %%zmm4, %%zmm1, %%zmm1\n\t"
+ "vpxorq %%zmm12, %%zmm13, %%zmm13"
+ :
+ : "m" (pbmul[0]), "m" (pbmul[16]));
+
+ /*
+ * 1 = pbmul[px[0]]
+ * 13 = pbmul[px[64]]
+ */
+ asm volatile("vpxorq %%zmm5, %%zmm1, %%zmm1\n\t"
+ "vpxorq %%zmm15, %%zmm13, %%zmm13"
+ :
+ : );
+
+ /*
+ * 1 = db = DQ
+ * 13 = db[64] = DQ[64]
+ */
+ asm volatile("vmovdqa64 %%zmm1, %0\n\t"
+ "vmovdqa64 %%zmm13,%1\n\t"
+ "vpxorq %%zmm1, %%zmm0, %%zmm0\n\t"
+ "vpxorq %%zmm13, %%zmm8, %%zmm8"
+ :
+ : "m" (dq[0]), "m" (dq[64]));
+
+ asm volatile("vmovdqa64 %%zmm0, %0\n\t"
+ "vmovdqa64 %%zmm8, %1"
+ :
+ : "m" (dp[0]), "m" (dp[64]));
+
+ bytes -= 128;
+ p += 128;
+ q += 128;
+ dp += 128;
+ dq += 128;
+#else
+ asm volatile("vmovdqa64 %0, %%zmm1\n\t"
+ "vmovdqa64 %1, %%zmm0\n\t"
+ "vpxorq %2, %%zmm1, %%zmm1\n\t"
+ "vpxorq %3, %%zmm0, %%zmm0"
+ :
+ : "m" (*q), "m" (*p), "m"(*dq), "m" (*dp));
+
+ /* 1 = dq ^ q; 0 = dp ^ p */
+
+ asm volatile("vbroadcasti64x2 %0, %%zmm4\n\t"
+ "vbroadcasti64x2 %1, %%zmm5"
+ :
+ : "m" (qmul[0]), "m" (qmul[16]));
+
+ /*
+ * 1 = dq ^ q
+ * 3 = dq ^ p >> 4
+ */
+ asm volatile("vpsraw $4, %%zmm1, %%zmm3\n\t"
+ "vpandq %%zmm7, %%zmm1, %%zmm1\n\t"
+ "vpandq %%zmm7, %%zmm3, %%zmm3\n\t"
+ "vpshufb %%zmm1, %%zmm4, %%zmm4\n\t"
+ "vpshufb %%zmm3, %%zmm5, %%zmm5\n\t"
+ "vpxorq %%zmm4, %%zmm5, %%zmm5"
+ :
+ : );
+
+ /* 5 = qx */
+
+ asm volatile("vbroadcasti64x2 %0, %%zmm4\n\t"
+ "vbroadcasti64x2 %1, %%zmm1"
+ :
+ : "m" (pbmul[0]), "m" (pbmul[16]));
+
+ asm volatile("vpsraw $4, %%zmm0, %%zmm2\n\t"
+ "vpandq %%zmm7, %%zmm0, %%zmm3\n\t"
+ "vpandq %%zmm7, %%zmm2, %%zmm2\n\t"
+ "vpshufb %%zmm3, %%zmm4, %%zmm4\n\t"
+ "vpshufb %%zmm2, %%zmm1, %%zmm1\n\t"
+ "vpxorq %%zmm4, %%zmm1, %%zmm1"
+ :
+ : );
+
+ /* 1 = pbmul[px] */
+ asm volatile("vpxorq %%zmm5, %%zmm1, %%zmm1\n\t"
+ /* 1 = db = DQ */
+ "vmovdqa64 %%zmm1, %0\n\t"
+ :
+ : "m" (dq[0]));
+
+ asm volatile("vpxorq %%zmm1, %%zmm0, %%zmm0\n\t"
+ "vmovdqa64 %%zmm0, %0"
+ :
+ : "m" (dp[0]));
+
+ bytes -= 64;
+ p += 64;
+ q += 64;
+ dp += 64;
+ dq += 64;
+#endif
+ }
+
+ kernel_fpu_end();
+}
+
+static void raid6_datap_recov_avx512(int disks, size_t bytes, int faila,
+ void **ptrs)
+{
+ u8 *p, *q, *dq;
+ const u8 *qmul; /* Q multiplier table */
+ const u8 x0f = 0x0f;
+
+ p = (u8 *)ptrs[disks-2];
+ q = (u8 *)ptrs[disks-1];
+
+ /*
+ * Compute syndrome with zero for the missing data page
+ * Use the dead data page as temporary storage for delta q
+ */
+
+ dq = (u8 *)ptrs[faila];
+ ptrs[faila] = (void *)raid6_empty_zero_page;
+ ptrs[disks-1] = dq;
+
+ raid6_call.gen_syndrome(disks, bytes, ptrs);
+
+ /* Restore pointer table */
+ ptrs[faila] = dq;
+ ptrs[disks-1] = q;
+
+ /* Now, pick the proper data tables */
+ qmul = raid6_vgfmul[raid6_gfinv[raid6_gfexp[faila]]];
+
+ kernel_fpu_begin();
+
+ asm volatile("vpbroadcastb %0, %%zmm7" : : "m" (x0f));
+
+ while (bytes) {
+#ifdef CONFIG_X86_64
+ asm volatile("vmovdqa64 %0, %%zmm3\n\t"
+ "vmovdqa64 %1, %%zmm8\n\t"
+ "vpxorq %2, %%zmm3, %%zmm3\n\t"
+ "vpxorq %3, %%zmm8, %%zmm8"
+ :
+ : "m" (dq[0]), "m" (dq[64]), "m" (q[0]),
+ "m" (q[64]));
+
+ /*
+ * 3 = q[0] ^ dq[0]
+ * 8 = q[64] ^ dq[64]
+ */
+ asm volatile("vbroadcasti64x2 %0, %%zmm0\n\t"
+ "vmovapd %%zmm0, %%zmm13\n\t"
+ "vbroadcasti64x2 %1, %%zmm1\n\t"
+ "vmovapd %%zmm1, %%zmm14"
+ :
+ : "m" (qmul[0]), "m" (qmul[16]));
+
+ asm volatile("vpsraw $4, %%zmm3, %%zmm6\n\t"
+ "vpsraw $4, %%zmm8, %%zmm12\n\t"
+ "vpandq %%zmm7, %%zmm3, %%zmm3\n\t"
+ "vpandq %%zmm7, %%zmm8, %%zmm8\n\t"
+ "vpandq %%zmm7, %%zmm6, %%zmm6\n\t"
+ "vpandq %%zmm7, %%zmm12, %%zmm12\n\t"
+ "vpshufb %%zmm3, %%zmm0, %%zmm0\n\t"
+ "vpshufb %%zmm8, %%zmm13, %%zmm13\n\t"
+ "vpshufb %%zmm6, %%zmm1, %%zmm1\n\t"
+ "vpshufb %%zmm12, %%zmm14, %%zmm14\n\t"
+ "vpxorq %%zmm0, %%zmm1, %%zmm1\n\t"
+ "vpxorq %%zmm13, %%zmm14, %%zmm14"
+ :
+ : );
+
+ /*
+ * 1 = qmul[q[0] ^ dq[0]]
+ * 14 = qmul[q[64] ^ dq[64]]
+ */
+ asm volatile("vmovdqa64 %0, %%zmm2\n\t"
+ "vmovdqa64 %1, %%zmm12\n\t"
+ "vpxorq %%zmm1, %%zmm2, %%zmm2\n\t"
+ "vpxorq %%zmm14, %%zmm12, %%zmm12"
+ :
+ : "m" (p[0]), "m" (p[64]));
+
+ /*
+ * 2 = p[0] ^ qmul[q[0] ^ dq[0]]
+ * 12 = p[64] ^ qmul[q[64] ^ dq[64]]
+ */
+
+ asm volatile("vmovdqa64 %%zmm1, %0\n\t"
+ "vmovdqa64 %%zmm14, %1\n\t"
+ "vmovdqa64 %%zmm2, %2\n\t"
+ "vmovdqa64 %%zmm12,%3"
+ :
+ : "m" (dq[0]), "m" (dq[64]), "m" (p[0]),
+ "m" (p[64]));
+
+ bytes -= 128;
+ p += 128;
+ q += 128;
+ dq += 128;
+#else
+ asm volatile("vmovdqa64 %0, %%zmm3\n\t"
+ "vpxorq %1, %%zmm3, %%zmm3"
+ :
+ : "m" (dq[0]), "m" (q[0]));
+
+ /* 3 = q ^ dq */
+
+ asm volatile("vbroadcasti64x2 %0, %%zmm0\n\t"
+ "vbroadcasti64x2 %1, %%zmm1"
+ :
+ : "m" (qmul[0]), "m" (qmul[16]));
+
+ asm volatile("vpsraw $4, %%zmm3, %%zmm6\n\t"
+ "vpandq %%zmm7, %%zmm3, %%zmm3\n\t"
+ "vpandq %%zmm7, %%zmm6, %%zmm6\n\t"
+ "vpshufb %%zmm3, %%zmm0, %%zmm0\n\t"
+ "vpshufb %%zmm6, %%zmm1, %%zmm1\n\t"
+ "vpxorq %%zmm0, %%zmm1, %%zmm1"
+ :
+ : );
+
+ /* 1 = qmul[q ^ dq] */
+
+ asm volatile("vmovdqa64 %0, %%zmm2\n\t"
+ "vpxorq %%zmm1, %%zmm2, %%zmm2"
+ :
+ : "m" (p[0]));
+
+ /* 2 = p ^ qmul[q ^ dq] */
+
+ asm volatile("vmovdqa64 %%zmm1, %0\n\t"
+ "vmovdqa64 %%zmm2, %1"
+ :
+ : "m" (dq[0]), "m" (p[0]));
+
+ bytes -= 64;
+ p += 64;
+ q += 64;
+ dq += 64;
+#endif
+ }
+
+ kernel_fpu_end();
+}
+
+const struct raid6_recov_calls raid6_recov_avx512 = {
+ .data2 = raid6_2data_recov_avx512,
+ .datap = raid6_datap_recov_avx512,
+ .valid = raid6_has_avx512,
+#ifdef CONFIG_X86_64
+ .name = "avx512x2",
+#else
+ .name = "avx512x1",
+#endif
+ .priority = 3,
+};
+
+#else
+#warning "your version of binutils lacks AVX512 support"
+#endif
--
2.7.4
^ permalink raw reply related
* [PATCH v2 1/6] lib/raid6: Add AVX512 optimized gen_syndrome functions
From: Gayatri Kammela @ 2016-08-13 1:03 UTC (permalink / raw)
To: linux-raid
Cc: shli, linux-kernel, ravi.v.shankar, Gayatri Kammela,
H . Peter Anvin, Jim Kukunas, Fenghua Yu, Megha Dey
In-Reply-To: <1471050204-26361-1-git-send-email-gayatri.kammela@intel.com>
Optimize RAID6 gen_syndrom functions to take advantage of
the 512-bit ZMM integer instructions introduced in AVX512.
AVX512 optimized gen_syndrom functions, which is simply based
on avx2.c written by Yuanhan Liu and sse2.c written by hpa.
The patch was tested and benchmarked before submission on
a hardware that has AVX512 flags to support such instructions
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jim Kukunas <james.t.kukunas@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
---
arch/x86/Makefile | 5 +-
include/linux/raid/pq.h | 3 +
lib/raid6/Makefile | 2 +-
lib/raid6/algos.c | 9 ++
lib/raid6/avx512.c | 294 ++++++++++++++++++++++++++++++++++++++++++++++++
lib/raid6/x86.h | 10 ++
6 files changed, 320 insertions(+), 3 deletions(-)
create mode 100644 lib/raid6/avx512.c
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 830ed391e7ef..2d449337a360 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -163,11 +163,12 @@ asinstr += $(call as-instr,pshufb %xmm0$(comma)%xmm0,-DCONFIG_AS_SSSE3=1)
asinstr += $(call as-instr,crc32l %eax$(comma)%eax,-DCONFIG_AS_CRC32=1)
avx_instr := $(call as-instr,vxorps %ymm0$(comma)%ymm1$(comma)%ymm2,-DCONFIG_AS_AVX=1)
avx2_instr :=$(call as-instr,vpbroadcastb %xmm0$(comma)%ymm1,-DCONFIG_AS_AVX2=1)
+avx512_instr :=$(call as-instr,vpmovm2b %k1$(comma)%zmm5,-DCONFIG_AS_AVX512=1)
sha1_ni_instr :=$(call as-instr,sha1msg1 %xmm0$(comma)%xmm1,-DCONFIG_AS_SHA1_NI=1)
sha256_ni_instr :=$(call as-instr,sha256msg1 %xmm0$(comma)%xmm1,-DCONFIG_AS_SHA256_NI=1)
-KBUILD_AFLAGS += $(cfi) $(cfi-sigframe) $(cfi-sections) $(asinstr) $(avx_instr) $(avx2_instr) $(sha1_ni_instr) $(sha256_ni_instr)
-KBUILD_CFLAGS += $(cfi) $(cfi-sigframe) $(cfi-sections) $(asinstr) $(avx_instr) $(avx2_instr) $(sha1_ni_instr) $(sha256_ni_instr)
+KBUILD_AFLAGS += $(cfi) $(cfi-sigframe) $(cfi-sections) $(asinstr) $(avx_instr) $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr)
+KBUILD_CFLAGS += $(cfi) $(cfi-sigframe) $(cfi-sections) $(asinstr) $(avx_instr) $(avx2_instr) $(avx512_instr) $(sha1_ni_instr) $(sha256_ni_instr)
LDFLAGS := -m elf_$(UTS_MACHINE)
diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
index a0118d5929a9..0c529a55b52e 100644
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -102,6 +102,9 @@ extern const struct raid6_calls raid6_altivec8;
extern const struct raid6_calls raid6_avx2x1;
extern const struct raid6_calls raid6_avx2x2;
extern const struct raid6_calls raid6_avx2x4;
+extern const struct raid6_calls raid6_avx512x1;
+extern const struct raid6_calls raid6_avx512x2;
+extern const struct raid6_calls raid6_avx512x4;
extern const struct raid6_calls raid6_tilegx8;
struct raid6_recov_calls {
diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
index 3b10a48fa040..8948268d47b4 100644
--- a/lib/raid6/Makefile
+++ b/lib/raid6/Makefile
@@ -3,7 +3,7 @@ obj-$(CONFIG_RAID6_PQ) += raid6_pq.o
raid6_pq-y += algos.o recov.o tables.o int1.o int2.o int4.o \
int8.o int16.o int32.o
-raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o mmx.o sse1.o sse2.o avx2.o
+raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o mmx.o sse1.o sse2.o avx2.o avx512.o
raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o
raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o
raid6_pq-$(CONFIG_TILEGX) += tilegx8.o
diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
index 975c6e0434bd..f5f090c52dd9 100644
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -49,6 +49,10 @@ const struct raid6_calls * const raid6_algos[] = {
&raid6_avx2x1,
&raid6_avx2x2,
#endif
+#ifdef CONFIG_AS_AVX512
+ &raid6_avx512x1,
+ &raid6_avx512x2,
+#endif
#endif
#if defined(__x86_64__) && !defined(__arch_um__)
&raid6_sse2x1,
@@ -59,6 +63,11 @@ const struct raid6_calls * const raid6_algos[] = {
&raid6_avx2x2,
&raid6_avx2x4,
#endif
+#ifdef CONFIG_AS_AVX512
+ &raid6_avx512x1,
+ &raid6_avx512x2,
+ &raid6_avx512x4,
+#endif
#endif
#ifdef CONFIG_ALTIVEC
&raid6_altivec1,
diff --git a/lib/raid6/avx512.c b/lib/raid6/avx512.c
new file mode 100644
index 000000000000..b1188a6e51a6
--- /dev/null
+++ b/lib/raid6/avx512.c
@@ -0,0 +1,294 @@
+/* -*- linux-c -*- --------------------------------------------------------
+ *
+ * Copyright (C) 2016 Intel Corporation
+ *
+ * Author: Gayatri Kammela <gayatri.kammela@intel.com>
+ * Author: Megha Dey <megha.dey@linux.intel.com>
+ *
+ * Based on avx2.c: Copyright 2012 Yuanhan Liu All Rights Reserved
+ * Based on sse2.c: Copyright 2002 H. Peter Anvin - All Rights Reserved
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, Inc., 53 Temple Place Ste 330,
+ * Boston MA 02111-1307, USA; either version 2 of the License, or
+ * (at your option) any later version; incorporated herein by reference.
+ *
+ * -----------------------------------------------------------------------
+ */
+
+/*
+ * AVX512 implementation of RAID-6 syndrome functions
+ *
+ */
+
+#ifdef CONFIG_AS_AVX512
+
+#include <linux/raid/pq.h>
+#include "x86.h"
+
+static const struct raid6_avx512_constants {
+ u64 x1d[8];
+} raid6_avx512_constants __aligned(512) = {
+ { 0x1d1d1d1d1d1d1d1dULL, 0x1d1d1d1d1d1d1d1dULL,
+ 0x1d1d1d1d1d1d1d1dULL, 0x1d1d1d1d1d1d1d1dULL,
+ 0x1d1d1d1d1d1d1d1dULL, 0x1d1d1d1d1d1d1d1dULL,
+ 0x1d1d1d1d1d1d1d1dULL, 0x1d1d1d1d1d1d1d1dULL,},
+};
+
+static int raid6_have_avx512(void)
+{
+ return boot_cpu_has(X86_FEATURE_AVX2) &&
+ boot_cpu_has(X86_FEATURE_AVX) &&
+ boot_cpu_has(X86_FEATURE_AVX512F) &&
+ boot_cpu_has(X86_FEATURE_AVX512BW) &&
+ boot_cpu_has(X86_FEATURE_AVX512VL) &&
+ boot_cpu_has(X86_FEATURE_AVX512DQ);
+}
+
+static void raid6_avx5121_gen_syndrome(int disks, size_t bytes, void **ptrs)
+{
+ u8 **dptr = (u8 **)ptrs;
+ u8 *p, *q;
+ int d, z, z0;
+
+ z0 = disks - 3; /* Highest data disk */
+ p = dptr[z0+1]; /* XOR parity */
+ q = dptr[z0+2]; /* RS syndrome */
+
+ kernel_fpu_begin();
+
+ asm volatile("vmovdqa64 %0,%%zmm0\n\t"
+ "vpxorq %%zmm1,%%zmm1,%%zmm1" /* Zero temp */
+ :
+ : "m" (raid6_avx512_constants.x1d[0]));
+
+ for (d = 0; d < bytes; d += 64) {
+ asm volatile("prefetchnta %0\n\t"
+ "vmovdqa64 %0,%%zmm2\n\t" /* P[0] */
+ "prefetchnta %1\n\t"
+ "vmovdqa64 %%zmm2,%%zmm4\n\t" /* Q[0] */
+ "vmovdqa64 %1,%%zmm6"
+ :
+ : "m" (dptr[z0][d]), "m" (dptr[z0-1][d]));
+ for (z = z0-2; z >= 0; z--) {
+ asm volatile("prefetchnta %0\n\t"
+ "vpcmpgtb %%zmm4,%%zmm1,%%k1\n\t"
+ "vpmovm2b %%k1,%%zmm5\n\t"
+ "vpaddb %%zmm4,%%zmm4,%%zmm4\n\t"
+ "vpandq %%zmm0,%%zmm5,%%zmm5\n\t"
+ "vpxorq %%zmm5,%%zmm4,%%zmm4\n\t"
+ "vpxorq %%zmm6,%%zmm2,%%zmm2\n\t"
+ "vpxorq %%zmm6,%%zmm4,%%zmm4\n\t"
+ "vmovdqa64 %0,%%zmm6"
+ :
+ : "m" (dptr[z][d]));
+ }
+ asm volatile("vpcmpgtb %%zmm4,%%zmm1,%%k1\n\t"
+ "vpmovm2b %%k1,%%zmm5\n\t"
+ "vpaddb %%zmm4,%%zmm4,%%zmm4\n\t"
+ "vpandq %%zmm0,%%zmm5,%%zmm5\n\t"
+ "vpxorq %%zmm5,%%zmm4,%%zmm4\n\t"
+ "vpxorq %%zmm6,%%zmm2,%%zmm2\n\t"
+ "vpxorq %%zmm6,%%zmm4,%%zmm4\n\t"
+ "vmovntdq %%zmm2,%0\n\t"
+ "vpxorq %%zmm2,%%zmm2,%%zmm2\n\t"
+ "vmovntdq %%zmm4,%1\n\t"
+ "vpxorq %%zmm4,%%zmm4,%%zmm4"
+ :
+ : "m" (p[d]), "m" (q[d]));
+ }
+
+ asm volatile("sfence" : : : "memory");
+ kernel_fpu_end();
+}
+
+const struct raid6_calls raid6_avx512x1 = {
+ raid6_avx5121_gen_syndrome,
+ NULL, /* XOR not yet implemented */
+ raid6_have_avx512,
+ "avx512x1",
+ 1 /* Has cache hints */
+};
+
+/*
+ * Unrolled-by-2 AVX512 implementation
+ */
+static void raid6_avx5122_gen_syndrome(int disks, size_t bytes, void **ptrs)
+{
+ u8 **dptr = (u8 **)ptrs;
+ u8 *p, *q;
+ int d, z, z0;
+
+ z0 = disks - 3; /* Highest data disk */
+ p = dptr[z0+1]; /* XOR parity */
+ q = dptr[z0+2]; /* RS syndrome */
+
+ kernel_fpu_begin();
+
+ asm volatile("vmovdqa64 %0,%%zmm0\n\t"
+ "vpxorq %%zmm1,%%zmm1,%%zmm1" /* Zero temp */
+ :
+ : "m" (raid6_avx512_constants.x1d[0]));
+
+ /* We uniformly assume a single prefetch covers at least 64 bytes */
+ for (d = 0; d < bytes; d += 128) {
+ asm volatile("prefetchnta %0\n\t"
+ "prefetchnta %1\n\t"
+ "vmovdqa64 %0,%%zmm2\n\t" /* P[0] */
+ "vmovdqa64 %1,%%zmm3\n\t" /* P[1] */
+ "vmovdqa64 %%zmm2,%%zmm4\n\t" /* Q[0] */
+ "vmovdqa64 %%zmm3,%%zmm6" /* Q[1] */
+ :
+ : "m" (dptr[z0][d]), "m" (dptr[z0][d+64]));
+ for (z = z0-1; z >= 0; z--) {
+ asm volatile("prefetchnta %0\n\t"
+ "prefetchnta %1\n\t"
+ "vpcmpgtb %%zmm4,%%zmm1,%%k1\n\t"
+ "vpcmpgtb %%zmm6,%%zmm1,%%k2\n\t"
+ "vpmovm2b %%k1,%%zmm5\n\t"
+ "vpmovm2b %%k2,%%zmm7\n\t"
+ "vpaddb %%zmm4,%%zmm4,%%zmm4\n\t"
+ "vpaddb %%zmm6,%%zmm6,%%zmm6\n\t"
+ "vpandq %%zmm0,%%zmm5,%%zmm5\n\t"
+ "vpandq %%zmm0,%%zmm7,%%zmm7\n\t"
+ "vpxorq %%zmm5,%%zmm4,%%zmm4\n\t"
+ "vpxorq %%zmm7,%%zmm6,%%zmm6\n\t"
+ "vmovdqa64 %0,%%zmm5\n\t"
+ "vmovdqa64 %1,%%zmm7\n\t"
+ "vpxorq %%zmm5,%%zmm2,%%zmm2\n\t"
+ "vpxorq %%zmm7,%%zmm3,%%zmm3\n\t"
+ "vpxorq %%zmm5,%%zmm4,%%zmm4\n\t"
+ "vpxorq %%zmm7,%%zmm6,%%zmm6"
+ :
+ : "m" (dptr[z][d]), "m" (dptr[z][d+64]));
+ }
+ asm volatile("vmovntdq %%zmm2,%0\n\t"
+ "vmovntdq %%zmm3,%1\n\t"
+ "vmovntdq %%zmm4,%2\n\t"
+ "vmovntdq %%zmm6,%3"
+ :
+ : "m" (p[d]), "m" (p[d+64]), "m" (q[d]),
+ "m" (q[d+64]));
+ }
+
+ asm volatile("sfence" : : : "memory");
+ kernel_fpu_end();
+}
+
+const struct raid6_calls raid6_avx512x2 = {
+ raid6_avx5122_gen_syndrome,
+ NULL, /* XOR not yet implemented */
+ raid6_have_avx512,
+ "avx512x2",
+ 1 /* Has cache hints */
+};
+
+#ifdef CONFIG_X86_64
+
+/*
+ * Unrolled-by-4 AVX2 implementation
+ */
+static void raid6_avx5124_gen_syndrome(int disks, size_t bytes, void **ptrs)
+{
+ u8 **dptr = (u8 **)ptrs;
+ u8 *p, *q;
+ int d, z, z0;
+
+ z0 = disks - 3; /* Highest data disk */
+ p = dptr[z0+1]; /* XOR parity */
+ q = dptr[z0+2]; /* RS syndrome */
+
+ kernel_fpu_begin();
+
+ asm volatile("vmovdqa64 %0,%%zmm0\n\t"
+ "vpxorq %%zmm1,%%zmm1,%%zmm1\n\t" /* Zero temp */
+ "vpxorq %%zmm2,%%zmm2,%%zmm2\n\t" /* P[0] */
+ "vpxorq %%zmm3,%%zmm3,%%zmm3\n\t" /* P[1] */
+ "vpxorq %%zmm4,%%zmm4,%%zmm4\n\t" /* Q[0] */
+ "vpxorq %%zmm6,%%zmm6,%%zmm6\n\t" /* Q[1] */
+ "vpxorq %%zmm10,%%zmm10,%%zmm10\n\t" /* P[2] */
+ "vpxorq %%zmm11,%%zmm11,%%zmm11\n\t" /* P[3] */
+ "vpxorq %%zmm12,%%zmm12,%%zmm12\n\t" /* Q[2] */
+ "vpxorq %%zmm14,%%zmm14,%%zmm14" /* Q[3] */
+ :
+ : "m" (raid6_avx512_constants.x1d[0]));
+
+ for (d = 0; d < bytes; d += 256) {
+ for (z = z0; z >= 0; z--) {
+ asm volatile("prefetchnta %0\n\t"
+ "prefetchnta %1\n\t"
+ "prefetchnta %2\n\t"
+ "prefetchnta %3\n\t"
+ "vpcmpgtb %%zmm4,%%zmm1,%%k1\n\t"
+ "vpcmpgtb %%zmm6,%%zmm1,%%k2\n\t"
+ "vpcmpgtb %%zmm12,%%zmm1,%%k3\n\t"
+ "vpcmpgtb %%zmm14,%%zmm1,%%k4\n\t"
+ "vpmovm2b %%k1,%%zmm5\n\t"
+ "vpmovm2b %%k2,%%zmm7\n\t"
+ "vpmovm2b %%k3,%%zmm13\n\t"
+ "vpmovm2b %%k4,%%zmm15\n\t"
+ "vpaddb %%zmm4,%%zmm4,%%zmm4\n\t"
+ "vpaddb %%zmm6,%%zmm6,%%zmm6\n\t"
+ "vpaddb %%zmm12,%%zmm12,%%zmm12\n\t"
+ "vpaddb %%zmm14,%%zmm14,%%zmm14\n\t"
+ "vpandq %%zmm0,%%zmm5,%%zmm5\n\t"
+ "vpandq %%zmm0,%%zmm7,%%zmm7\n\t"
+ "vpandq %%zmm0,%%zmm13,%%zmm13\n\t"
+ "vpandq %%zmm0,%%zmm15,%%zmm15\n\t"
+ "vpxorq %%zmm5,%%zmm4,%%zmm4\n\t"
+ "vpxorq %%zmm7,%%zmm6,%%zmm6\n\t"
+ "vpxorq %%zmm13,%%zmm12,%%zmm12\n\t"
+ "vpxorq %%zmm15,%%zmm14,%%zmm14\n\t"
+ "vmovdqa64 %0,%%zmm5\n\t"
+ "vmovdqa64 %1,%%zmm7\n\t"
+ "vmovdqa64 %2,%%zmm13\n\t"
+ "vmovdqa64 %3,%%zmm15\n\t"
+ "vpxorq %%zmm5,%%zmm2,%%zmm2\n\t"
+ "vpxorq %%zmm7,%%zmm3,%%zmm3\n\t"
+ "vpxorq %%zmm13,%%zmm10,%%zmm10\n\t"
+ "vpxorq %%zmm15,%%zmm11,%%zmm11\n"
+ "vpxorq %%zmm5,%%zmm4,%%zmm4\n\t"
+ "vpxorq %%zmm7,%%zmm6,%%zmm6\n\t"
+ "vpxorq %%zmm13,%%zmm12,%%zmm12\n\t"
+ "vpxorq %%zmm15,%%zmm14,%%zmm14"
+ :
+ : "m" (dptr[z][d]), "m" (dptr[z][d+64]),
+ "m" (dptr[z][d+128]), "m" (dptr[z][d+192]));
+ }
+ asm volatile("vmovntdq %%zmm2,%0\n\t"
+ "vpxorq %%zmm2,%%zmm2,%%zmm2\n\t"
+ "vmovntdq %%zmm3,%1\n\t"
+ "vpxorq %%zmm3,%%zmm3,%%zmm3\n\t"
+ "vmovntdq %%zmm10,%2\n\t"
+ "vpxorq %%zmm10,%%zmm10,%%zmm10\n\t"
+ "vmovntdq %%zmm11,%3\n\t"
+ "vpxorq %%zmm11,%%zmm11,%%zmm11\n\t"
+ "vmovntdq %%zmm4,%4\n\t"
+ "vpxorq %%zmm4,%%zmm4,%%zmm4\n\t"
+ "vmovntdq %%zmm6,%5\n\t"
+ "vpxorq %%zmm6,%%zmm6,%%zmm6\n\t"
+ "vmovntdq %%zmm12,%6\n\t"
+ "vpxorq %%zmm12,%%zmm12,%%zmm12\n\t"
+ "vmovntdq %%zmm14,%7\n\t"
+ "vpxorq %%zmm14,%%zmm14,%%zmm14"
+ :
+ : "m" (p[d]), "m" (p[d+64]), "m" (p[d+128]),
+ "m" (p[d+192]), "m" (q[d]), "m" (q[d+64]),
+ "m" (q[d+128]), "m" (q[d+192]));
+ }
+
+ asm volatile("sfence" : : : "memory");
+ kernel_fpu_end();
+}
+
+const struct raid6_calls raid6_avx512x4 = {
+ raid6_avx5124_gen_syndrome,
+ NULL, /* XOR not yet implemented */
+ raid6_have_avx512,
+ "avx512x4",
+ 1 /* Has cache hints */
+};
+#endif
+
+#endif /* CONFIG_AS_AVX512 */
diff --git a/lib/raid6/x86.h b/lib/raid6/x86.h
index 8fe9d9662abb..834d268a4b05 100644
--- a/lib/raid6/x86.h
+++ b/lib/raid6/x86.h
@@ -46,6 +46,16 @@ static inline void kernel_fpu_end(void)
#define X86_FEATURE_SSSE3 (4*32+ 9) /* Supplemental SSE-3 */
#define X86_FEATURE_AVX (4*32+28) /* Advanced Vector Extensions */
#define X86_FEATURE_AVX2 (9*32+ 5) /* AVX2 instructions */
+#define X86_FEATURE_AVX512F (9*32+16) /* AVX-512 Foundation */
+#define X86_FEATURE_AVX512DQ (9*32+17) /* AVX-512 DQ (Double/Quad granular)
+ * Instructions
+ */
+#define X86_FEATURE_AVX512BW (9*32+30) /* AVX-512 BW (Byte/Word granular)
+ * Instructions
+ */
+#define X86_FEATURE_AVX512VL (9*32+31) /* AVX-512 VL (128/256 Vector Length)
+ * Extensions
+ */
#define X86_FEATURE_MMXEXT (1*32+22) /* AMD MMX extensions */
/* Should work well enough on modern CPUs for testing */
--
2.7.4
^ permalink raw reply related
* [PATCH v2 0/6] Add AVX512 optimized gen_syndrome, xor_syndrome and recovery functions
From: Gayatri Kammela @ 2016-08-13 1:03 UTC (permalink / raw)
To: linux-raid; +Cc: shli, linux-kernel, ravi.v.shankar, Gayatri Kammela
This is the version 2 patch series for adding AVX512 optimized gen_syndrome,
xor_syndrome and recovery functions.
Optimization of RAID6 using AVX512 instructions should improve the
RAID6 performance.These patches are tested and observed the improvement
in performance.
Changes since v1:
1) Added xor_syndrome functions to avx512 optimized raid6.
Gayatri Kammela (6):
lib/raid6: Add AVX512 optimized gen_syndrome functions
lib/raid6: Add AVX512 optimized recovery functions
lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery
functions
lib/raid6: Add AVX512 optimized xor_syndrome functions
(DO NOT APPLY) lib/raid6: Add unroll by 8 to AVX512 optimized
gen_syndrome functions
(DO NOT APPLY) lib/raid6: Add unroll by 8 to AVX512 optimized
xor_syndrome functions.
arch/x86/Makefile | 5 +-
include/linux/raid/pq.h | 5 +
lib/raid6/Makefile | 2 +-
lib/raid6/algos.c | 13 +
lib/raid6/avx512.c | 972 +++++++++++++++++++++++++++++++++++++++++++++++
lib/raid6/recov_avx512.c | 388 +++++++++++++++++++
lib/raid6/test/Makefile | 5 +-
lib/raid6/x86.h | 10 +
8 files changed, 1396 insertions(+), 4 deletions(-)
create mode 100644 lib/raid6/avx512.c
create mode 100644 lib/raid6/recov_avx512.c
--
2.7.4
^ permalink raw reply
* best kernel ??
From: bobzer @ 2016-08-13 0:00 UTC (permalink / raw)
To: linux-raid
Hi,
I need to update my kernel to get full ext4 support.
I'm on debian, so i compiled my the last stable version 4.7 and also mdadm 3.4
everything is good until i did ./test --keep-going there i got a lot of error
i saw that kernel betwen 4.1 and 4.4 has problems about raid :
http://www.linuxfromscratch.org/blfs/view/svn/postlfs/mdadm.html
So i'm wondering which kernel should i get ?
my debian is jessie (8) and it's a vm in esx, so i don't need to
support a lot of things but at least raid :-)
what version do you run ??
any advice ?
about the compilation i simply download the last version untar
check that mdadm was stop
make
make install
./test -keep-going
the things that i don't understand it's that i ran the test a few time
and never get the same result..
thanks
^ permalink raw reply
* Re: [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
From: Kent Overstreet @ 2016-08-12 16:36 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Eric Wheeler, Ming Lei, Jens Axboe, linux-kernel, linux-block,
linux-bcache, linux-raid, Sebastian Roesner, 4.3+, Shaohua Li
In-Reply-To: <20160811140253.GA2867@infradead.org>
On Thu, Aug 11, 2016 at 07:02:53AM -0700, Christoph Hellwig wrote:
> Please just fix bcache to not submit bios larger than BIO_MAX_PAGES for
> now, until we can support such callers in general and enable common
> used code to do so.
Christoph, what's wrong with Ming's patch? Leaving bcache aside, just
considering the block layer, do you think that patch is the wrong approach?
^ permalink raw reply
* Re: [PATCH] mdadm: put journal device in right place of --detail
From: Jes Sorensen @ 2016-08-12 14:59 UTC (permalink / raw)
To: Song Liu; +Cc: linux-raid, yizhan, Shaohua Li
In-Reply-To: <1470960853-2859579-1-git-send-email-songliubraving@fb.com>
Song Liu <songliubraving@fb.com> writes:
> When there is failed HDDs, journal device showed in wrong place
> of --detail:
>
> Number Major Minor RaidDevice State
> 4 8 24 - journal /dev/sdb8
> 1 8 18 1 active sync /dev/sdb2
> 2 8 19 2 active sync /dev/sdb3
> 3 8 21 3 active sync /dev/sdb5
>
> 0 8 17 - faulty /dev/sdb1
>
> This patch fixed the output as:
>
> Number Major Minor RaidDevice State
> - 0 0 0 removed
> 1 8 18 1 active sync /dev/sdb2
> 2 8 19 2 active sync /dev/sdb3
> 3 8 21 3 active sync /dev/sdb5
>
> 0 8 17 - faulty /dev/sdb1
> 4 8 24 - journal /dev/sdb8
>
> Reported-by: Yi Zhang <yizhan@redhat.com>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
> Detail.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
Applied, thanks!
Jes
PS: If you CC me directly in mdadm patches I am likely to see them
faster, ie. before I get to the linux-raid list.
^ permalink raw reply
* Re: [PATCH] mdadm: add man page for --add-journal
From: Jes Sorensen @ 2016-08-12 14:58 UTC (permalink / raw)
To: Song Liu; +Cc: linux-raid, yizhan, Shaohua Li
In-Reply-To: <1470960604-2852450-1-git-send-email-songliubraving@fb.com>
Song Liu <songliubraving@fb.com> writes:
> Add the following to man page:
>
> --add-journal
> Recreate journal for RAID-4/5/6 array that losts journal
> devices. In current implementation, this command cannot
> add journal to an array that had failed journal. To
> avoid interrupting on-going write opertions,
> --add-journal only works for array in Read-Only state.
>
> Reported-by: Yi Zhang <yizhan@redhat.com>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
> mdadm.8.in | 8 ++++++++
> 1 file changed, 8 insertions(+)
Applied, with a few minor mods.
I changed it to say this, I hope you are fine with that:
"Recreate journal for RAID-4/5/6 array that lost a journal device. In the
current implementation, this command cannot add a journal to an array
that had a failed journal. To avoid interrupting on-going write
opertions, "
If I botched it up please let me know.
Jes
>
> diff --git a/mdadm.8.in b/mdadm.8.in
> index 1a04bd1..a335c53 100644
> --- a/mdadm.8.in
> +++ b/mdadm.8.in
> @@ -1444,6 +1444,14 @@ number. The receiving node must acknowledge this message
> with \-\-cluster\-confirm. Valid arguments are <slot>:<devicename> in case
> the device is found or <slot>:missing in case the device is not found.
>
> +.TP
> +.BR \-\-add-journal
> +Recreate journal for RAID-4/5/6 array that losts journal devices. In current
> +implementation, this command cannot add journal to an array that had failed
> +journal. To avoid interrupting on-going write opertions,
> +.B \-\-add-journal
> +only works for array in Read-Only state.
> +
> .P
> Each of these options requires that the first device listed is the array
> to be acted upon, and the remainder are component devices to be added,
^ permalink raw reply
* Re: Unable to convert raid1 to raid5
From: Wols Lists @ 2016-08-12 12:37 UTC (permalink / raw)
To: NeilBrown; +Cc: mdraid
In-Reply-To: <87a8gilqko.fsf@notabene.neil.brown.name>
On 12/08/16 02:32, NeilBrown wrote:
> On Sun, Aug 07 2016, Wols Lists wrote:
>
>> On 07/08/16 01:32, Glenn Enright wrote:
>>> On 7/08/2016 12:01 pm, "Wols Lists" <antlists@youngman.org.uk
>>> <mailto:antlists@youngman.org.uk>> wrote:
>>>>
>>>> On 05/08/16 21:16, Wols Lists wrote:
>>>>> In my testing of xosview, I've been mucking about with a vm and raid.
>>>>> xosview is looking quite promising (I've got a few comments about it,
>>>>> but never mind).
>>>>>
>>>>> BUT. In mucking about with raid 1, I increased my raid devices to three.
>>>>> I now just can NOT convert the array to raid 5! I've been mucking around
>>>>> with all sorts of things trying to get it to work, but finally two error
>>>>> messages make things clear.
>>>>>
>>>> Following up to myself - suddenly thought "I know what's wrong". So I
>>>> stopped the array, and of course couldn't access it, it was no longer
>>>> there. So I assembled but didn't run it, and it worked fine.
>>>>
>>>> Simples, once you realise what's wrong - you can ADD devices to a
>>>> running array, but you can't REMOVE them.
>>>>
>>>> Cheers,
>>>> Wol
>>>>
>>
>>>
>>> You can remove em if you mark em as failed first. Eg
>>>
>>> Mdadm /dev/mdx --fail /dev/sdc1 --remove /dev/sdc1
>>>
>>> Best, Glenn
>>>
>> Except - if you read my original post - I was trying to TOTALLY remove
>> the device!
>>
>> mdadm --grow -raid-devices=2
>>
>> THAT was the problem - I had a 3-device mirror, and you can't convert
>> that to raid5! Even if you've --fail --remove'd the third device!
>>
>> In other words, "--grow --raid-devices=more" will work on a running
>> device, "--grow --raid-devices=less" will only work on an array that is
>> built but not running.
>
> I don't believe this is correct, and I could reproduce your results in
> quick tests.
>
> If the array is not running, then you cannot reshape it at all.
>
> You can reduce the number of devices in a RAID1 at any time as long as
> the number of active devices is not greater than the number of devices
> requested.
>
> /dev/md0 has 3 working devices:
>
> # mdadm -G /dev/md0 -n2
> mdadm: failed to set raid disks
> # mdadm /dev/md0 -f /dev/loop0
> mdadm: set /dev/loop0 faulty in /dev/md0
> # mdadm -G /dev/md0 -n2
> raid_disks for /dev/md0 set to 2
>
That's my error - you've done pretty much exactly the same as me, except
my second try also failed, whereas yours has succeeded. Why? (although I
did --fail, --remove).
>
>>
>> I now have the problem that my "--grow --level=5" has fallen foul of the
>> "reshape stuck at zero" problem, and I can now neither run the array,
>> nor get the reshape working ... :-(
>
> We really need to fix this ... if only I knew how to reproduce it.
Mikael gave me the force-revert-reshape syntax, which worked a treat
when I did it his way (I'd tried to do it already, but obviously didn't
get the magic incantation right :-)
>
As you'll see from my other thread, this is a test array (I was trying
to test the new raid features of xosview :-), so there's no urgency from
my point of view, but yes, I want to know what's going wrong, too. I
want to write all this up as documentation :-)
My problem is I've only got one computer with decent-enough grunt to run
the VM, and getting uninterrupted access to it (wife, two grand-kids)
isn't always easy. I'll get back to it asap - Mikael is on the case, but
I need to reset it back the way it was and try the reshape again.
I hope it's not the SuSE kernel - I'm running the latest mdadm from your
git repository.
> NeilBrown
>
Cheers,
Wol
^ permalink raw reply
* Re: [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
From: Ming Lei @ 2016-08-12 11:12 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Eric Wheeler, Jens Axboe, Linux Kernel Mailing List, linux-block,
open list:BCACHE (BLOCK LAYER CACHE),
open list:SOFTWARE RAID (Multiple Disks) SUPPORT, Kent Overstreet,
Sebastian Roesner, 4.3+, Shaohua Li
In-Reply-To: <20160811140253.GA2867@infradead.org>
On Thu, Aug 11, 2016 at 10:02 PM, Christoph Hellwig <hch@infradead.org> wrote:
> Please just fix bcache to not submit bios larger than BIO_MAX_PAGES for
> now, until we can support such callers in general and enable common
> used code to do so.
IMO it can't be efficient to do that in bcache because it need to figure out
how many bvecs one bio includes.
This patch(block: make sure big bio is splitted into at most 256 bvecs)
can support such callers.
Also this kind of usage does simplify drivers.
Thanks,
Ming
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH] raid10: record correct address of bad block
From: Tomasz Majchrzak @ 2016-08-12 9:03 UTC (permalink / raw)
To: linux-raid; +Cc: shli, aleksey.obitotskiy, pawel.baldysiak, artur.paszkiewicz
For failed write request record block address on a device, not block
address in an array.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
drivers/md/raid10.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index cfa96b5..d18b26d 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2449,6 +2449,7 @@ static int narrow_write_error(struct r10bio *r10_bio, int i)
int block_sectors;
sector_t sector;
+ sector_t data_offset;
int sectors;
int sect_to_write = r10_bio->sectors;
int ok = 1;
@@ -2462,6 +2463,7 @@ static int narrow_write_error(struct r10bio *r10_bio, int i)
sectors = ((r10_bio->sector + block_sectors)
& ~(sector_t)(block_sectors - 1))
- sector;
+ data_offset = choose_data_offset(r10_bio, rdev);
while (sect_to_write) {
struct bio *wbio;
@@ -2471,13 +2473,12 @@ static int narrow_write_error(struct r10bio *r10_bio, int i)
wbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
bio_trim(wbio, sector - bio->bi_iter.bi_sector, sectors);
wbio->bi_iter.bi_sector = (r10_bio->devs[i].addr+
- choose_data_offset(r10_bio, rdev) +
- (sector - r10_bio->sector));
+ data_offset + (sector - r10_bio->sector));
wbio->bi_bdev = rdev->bdev;
if (submit_bio_wait(WRITE, wbio) < 0)
/* Failure! */
- ok = rdev_set_badblocks(rdev, sector,
- sectors, 0)
+ ok = rdev_set_badblocks(rdev, wbio->bi_iter.bi_sector -
+ data_offset, sectors, 0)
&& ok;
bio_put(wbio);
--
1.8.3.1
^ permalink raw reply related
* [PATCH V2 10/10] md-cluster: make resync lock also could be interruptted
From: Guoqing Jiang @ 2016-08-12 5:42 UTC (permalink / raw)
To: linux-raid; +Cc: shli, Guoqing Jiang
In-Reply-To: <1470980563-26062-1-git-send-email-gqjiang@suse.com>
When one node is perform resync or recovery, other nodes
can't get resync lock and could block for a while before
it holds the lock, so we can't stop array immediately for
this scenario.
To make array could be stop quickly, we check MD_CLOSING
in dlm_lock_sync_interruptible to make us can interrupt
the lock request.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
drivers/md/md-cluster.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 149d19a..16d892d 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -159,7 +159,8 @@ static int dlm_lock_sync_interruptible(struct dlm_lock_resource *res, int mode,
return ret;
wait_event(res->sync_locking, res->sync_locking_done
- || kthread_should_stop());
+ || kthread_should_stop()
+ || test_bit(MD_CLOSING, &mddev->flags));
if (!res->sync_locking_done) {
/*
* the convert queue contains the lock request when request is
@@ -1033,7 +1034,7 @@ static void metadata_update_cancel(struct mddev *mddev)
static int resync_start(struct mddev *mddev)
{
struct md_cluster_info *cinfo = mddev->cluster_info;
- return dlm_lock_sync(cinfo->resync_lockres, DLM_LOCK_EX);
+ return dlm_lock_sync_interruptible(cinfo->resync_lockres, DLM_LOCK_EX, mddev);
}
static int resync_info_update(struct mddev *mddev, sector_t lo, sector_t hi)
--
2.6.2
^ permalink raw reply related
* [PATCH V2 09/10] md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang
From: Guoqing Jiang @ 2016-08-12 5:42 UTC (permalink / raw)
To: linux-raid; +Cc: shli, Guoqing Jiang
In-Reply-To: <1470980563-26062-1-git-send-email-gqjiang@suse.com>
When some node leaves cluster, then it's bitmap need to be
synced by another node, so "md*_recover" thread is triggered
for the purpose. However, with below steps. we can find tasks
hang happened either in B or C.
1. Node A create a resyncing cluster raid1, assemble it in
other two nodes (B and C).
2. stop array in B and C.
3. stop array in A.
linux44:~ # ps aux|grep md|grep D
root 5938 0.0 0.1 19852 1964 pts/0 D+ 14:52 0:00 mdadm -S md0
root 5939 0.0 0.0 0 0 ? D 14:52 0:00 [md0_recover]
linux44:~ # cat /proc/5939/stack
[<ffffffffa04cf321>] dlm_lock_sync+0x71/0x90 [md_cluster]
[<ffffffffa04d0705>] recover_bitmaps+0x125/0x220 [md_cluster]
[<ffffffffa052105d>] md_thread+0x16d/0x180 [md_mod]
[<ffffffff8107ad94>] kthread+0xb4/0xc0
[<ffffffff8152a518>] ret_from_fork+0x58/0x90
linux44:~ # cat /proc/5938/stack
[<ffffffff8107afde>] kthread_stop+0x6e/0x120
[<ffffffffa0519da0>] md_unregister_thread+0x40/0x80 [md_mod]
[<ffffffffa04cfd20>] leave+0x70/0x120 [md_cluster]
[<ffffffffa0525e24>] md_cluster_stop+0x14/0x30 [md_mod]
[<ffffffffa05269ab>] bitmap_free+0x14b/0x150 [md_mod]
[<ffffffffa0523f3b>] do_md_stop+0x35b/0x5a0 [md_mod]
[<ffffffffa0524e83>] md_ioctl+0x873/0x1590 [md_mod]
[<ffffffff81288464>] blkdev_ioctl+0x214/0x7d0
[<ffffffff811dd3dd>] block_ioctl+0x3d/0x40
[<ffffffff811b92d4>] do_vfs_ioctl+0x2d4/0x4b0
[<ffffffff811b9538>] SyS_ioctl+0x88/0xa0
[<ffffffff8152a5c9>] system_call_fastpath+0x16/0x1b
The problem is caused by recover_bitmaps can't reliably abort
when the thread is unregistered. So dlm_lock_sync_interruptible
is introduced to detect the thread's situation to fix the problem.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
drivers/md/md-cluster.c | 37 ++++++++++++++++++++++++++++++++++++-
1 file changed, 36 insertions(+), 1 deletion(-)
diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 03a51e7..149d19a 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -10,6 +10,7 @@
#include <linux/module.h>
+#include <linux/kthread.h>
#include <linux/dlm.h>
#include <linux/sched.h>
#include <linux/raid/md_p.h>
@@ -144,6 +145,40 @@ static int dlm_unlock_sync(struct dlm_lock_resource *res)
return dlm_lock_sync(res, DLM_LOCK_NL);
}
+/* An variation of dlm_lock_sync, which make lock request could
+ * be interrupted */
+static int dlm_lock_sync_interruptible(struct dlm_lock_resource *res, int mode,
+ struct mddev *mddev)
+{
+ int ret = 0;
+
+ ret = dlm_lock(res->ls, mode, &res->lksb,
+ res->flags, res->name, strlen(res->name),
+ 0, sync_ast, res, res->bast);
+ if (ret)
+ return ret;
+
+ wait_event(res->sync_locking, res->sync_locking_done
+ || kthread_should_stop());
+ if (!res->sync_locking_done) {
+ /*
+ * the convert queue contains the lock request when request is
+ * interrupted, and sync_ast could still be run, so need to
+ * cancel the request and reset completion
+ */
+ ret = dlm_unlock(res->ls, res->lksb.sb_lkid, DLM_LKF_CANCEL, &res->lksb, res);
+ res->sync_locking_done = false;
+ if (unlikely(ret != 0))
+ pr_info("failed to cancel previous lock request "
+ "%s return %d\n", res->name, ret);
+ return -EPERM;
+ } else
+ res->sync_locking_done = false;
+ if (res->lksb.sb_status == 0)
+ res->mode = mode;
+ return res->lksb.sb_status;
+}
+
static struct dlm_lock_resource *lockres_init(struct mddev *mddev,
char *name, void (*bastfn)(void *arg, int mode), int with_lvb)
{
@@ -276,7 +311,7 @@ static void recover_bitmaps(struct md_thread *thread)
goto clear_bit;
}
- ret = dlm_lock_sync(bm_lockres, DLM_LOCK_PW);
+ ret = dlm_lock_sync_interruptible(bm_lockres, DLM_LOCK_PW, mddev);
if (ret) {
pr_err("md-cluster: Could not DLM lock %s: %d\n",
str, ret);
--
2.6.2
^ permalink raw reply related
* [PATCH V2 08/10] md-cluster: convert the completion to wait queue
From: Guoqing Jiang @ 2016-08-12 5:42 UTC (permalink / raw)
To: linux-raid; +Cc: shli, Guoqing Jiang
In-Reply-To: <1470980563-26062-1-git-send-email-gqjiang@suse.com>
Previously, we used completion to sync between require dlm lock
and sync_ast, however we will have to expose completion.wait
and completion.done in dlm_lock_sync_interruptible (introduced
later), it is not a common usage for completion, so convert
related things to wait queue.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
drivers/md/md-cluster.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 8972413..03a51e7 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -25,7 +25,8 @@ struct dlm_lock_resource {
struct dlm_lksb lksb;
char *name; /* lock name. */
uint32_t flags; /* flags to pass to dlm_lock() */
- struct completion completion; /* completion for synchronized locking */
+ wait_queue_head_t sync_locking; /* wait queue for synchronized locking */
+ bool sync_locking_done;
void (*bast)(void *arg, int mode); /* blocking AST function pointer*/
struct mddev *mddev; /* pointing back to mddev. */
int mode;
@@ -118,7 +119,8 @@ static void sync_ast(void *arg)
struct dlm_lock_resource *res;
res = arg;
- complete(&res->completion);
+ res->sync_locking_done = true;
+ wake_up(&res->sync_locking);
}
static int dlm_lock_sync(struct dlm_lock_resource *res, int mode)
@@ -130,7 +132,8 @@ static int dlm_lock_sync(struct dlm_lock_resource *res, int mode)
0, sync_ast, res, res->bast);
if (ret)
return ret;
- wait_for_completion(&res->completion);
+ wait_event(res->sync_locking, res->sync_locking_done);
+ res->sync_locking_done = false;
if (res->lksb.sb_status == 0)
res->mode = mode;
return res->lksb.sb_status;
@@ -151,7 +154,8 @@ static struct dlm_lock_resource *lockres_init(struct mddev *mddev,
res = kzalloc(sizeof(struct dlm_lock_resource), GFP_KERNEL);
if (!res)
return NULL;
- init_completion(&res->completion);
+ init_waitqueue_head(&res->sync_locking);
+ res->sync_locking_done = false;
res->ls = cinfo->lockspace;
res->mddev = mddev;
res->mode = DLM_LOCK_IV;
@@ -205,7 +209,7 @@ static void lockres_free(struct dlm_lock_resource *res)
if (unlikely(ret != 0))
pr_err("failed to unlock %s return %d\n", res->name, ret);
else
- wait_for_completion(&res->completion);
+ wait_event(res->sync_locking, res->sync_locking_done);
kfree(res->name);
kfree(res->lksb.sb_lvbptr);
--
2.6.2
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox