Re: OT: Processor recommendation for RAID6

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: OT: Processor recommendation for RAID6
       [not found] <CAAMCDedmGUcWY=9Nb36gXoo0+F82rhq=-6yKZ1xPf74Gj0mq7Q () mail ! gmail ! com>
@ 2021-04-06 21:05 ` Hans Henrik Happe
  2021-04-08 13:01   ` Gal Ofri
  0 siblings, 1 reply; 6+ messages in thread
From: Hans Henrik Happe @ 2021-04-06 21:05 UTC (permalink / raw)
  To: linux-raid

On 02.04.2021 16.45, Roger Heflin wrote:
> On Fri, Apr 2, 2021 at 4:13 AM Paul Menzel <pmenzel@molgen.mpg.de> wrote:
>>
>> Dear Linux folks,
>>
>>
> 
>>> Are these values a good benchmark for comparing processors?
>>
>> After two years, yes they are. I created 16 10 GB files in `/dev/shm`,
>> set them up as loop devices, and created a RAID6. For resync speed it
>> makes difference.
>>
>> 2 x AMD EPYC 7601 32-Core Processor:    34671K/sec
>> 2 x Intel Xeon Gold 6248 CPU @ 2.50GHz: 87533K/sec
>>
>> So, the current state of affairs seems to be, that AVX512 instructions
>> do help for software RAIDs, if you want fast rebuild/resync times.
>> Getting, for example, a four core/eight thread Intel Xeon Gold 5222
>> might be useful.
>>
>> Now, the question remains, if AMD processors could make it up with
>> higher performance, or better optimized code, or if AVX512 instructions
>> are a must,
>>
>> [=E2=80=A6]
>>
>>
>> Kind regards,
>>
>> Paul
>>
>>
>> PS: Here are the commands on the AMD EPYC system:
>>
>> ```
>> $ for i in $(seq 1 16); do truncate -s 10G /dev/shm/vdisk$i.img; done
>> $ for i in /dev/shm/v*.img; do sudo losetup --find --show $i; done
>> /dev/loop0
>> /dev/loop1
>> /dev/loop2
>> /dev/loop3
>> /dev/loop4
>> /dev/loop5
>> /dev/loop6
>> /dev/loop7
>> /dev/loop8
>> /dev/loop9
>> /dev/loop10
>> /dev/loop11
>> /dev/loop12
>> /dev/loop13
>> /dev/loop14
>> /dev/loop15
>> $ sudo mdadm --create /dev/md1 --level=3D6 --raid-devices=3D16
>> /dev/loop{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
>> mdadm: Defaulting to version 1.2 metadata
>> mdadm: array /dev/md1 started.
>> $ more /proc/mdstat
>> Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
>> [multipath]
>> md1 : active raid6 loop15[15] loop14[14] loop13[13] loop12[12]
>> loop11[11] loop10[10] loop9[9] loop8[8] loop7[7] loop6[6] loop5[5]266
>> loop4[4] loop3[3] lo
>> op2[2] loop1[1] loop0[0]
>>        146671616 blocks super 1.2 level 6, 512k chunk, algorithm 276
>> [16/16] [UUUUUUUUUUUUUUUU]
>>        [>....................]  resync =3D  3.9% (416880/10476544)
>> finish=3D5.6min speed=3D29777K/sec
>>
>> unused devices: <none>
>> $ more /proc/mdstat
>> Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
>> [multipath]
>> md1 : active raid6 loop15[15] loop14[14] loop13[13] loop12[12]
>> loop11[11] loop10[10] loop9[9] loop8[8] loop7[7] loop6[6] loop5[5]
>> loop4[4] loop3[3] lo
>> op2[2] loop1[1] loop0[0]
>>        146671616 blocks super 1.2 level 6, 512k chunk, algorithm 2
>> [16/16] [UUUUUUUUUUUUUUUU]
>>        [>....................]  resync =3D  4.1% (439872/10476544)
>> finish=3D5.3min speed=3D31419K/sec
>> $ sudo mdadm -S /dev/md1
>> mdadm: stopped /dev/md1
>> $ sudo losetup -D
>> $ sudo rm /dev/shm/vdisk*.img
> 
> 
> I think you are testing something else.  Your speeds are way below
> what the raw processor can do. You are probably testing memory
> speed/numa arch differences between the 2.
> 
> On the intel arch there are 2 numa nodes total with 4 channels, so the
> system  has 8 usable channels of bandwidth, but a allocation on a
> single numa node will only have 4 channels usable (ddr4-2933)
> 
> On the epyc there are 8 numa nodes with 2 channels each (ddr4-2666),
> so any single memory allocation will have only 2 channels available
> and if the accesses are across the numa bus will be slower.
> 
> So 4*2933/2*2666 =3D 2.20 * 34671 =3D 76286 (fairly close to your results).
> 
> How the allocation for memory works depends a lot on how much ram you
> actually have per numa node and how much for the whole machine.  But
> any single block for any single device should be on a single numa node
> almost all of the time.
> 
> You might want to drop the cache before the test, run numactl
> --hardware to see how much memory is free per numa node, then rerun
> the test and at the of the test before the stop run numactl --hardware
> again to see how it was spread across numa nodes.  Even if it spreads
> it across multiple numa nodes that may well mean that on the epyc case
> you are running with several numa nodes were the main raid processes
> are running against remote numa nodes, and because intel only has 2
> then there is a decent chance that it is only running on 1 most of the
> time (so no remote memory).  I have also seen in benchmarks I have run
> on 2P and 4P intel machines that interleaved on a 2P single thread job
> is faster than running on a single numa nodes memory (with the process
> pinned to a single cpu on one of the numa nodes, memory interleaved
> over both), but on a 4P/4numa node machine interleaving slows it down
> significantly.  And in the default case any single write/read of a
> block is likely only on a single numa node so that specific read/write
> is constrained by a single numa node bandwidth giving an advantage to
> fewer faster/bigger numa nodes and less remote memory.
> 
> Outside of rebooting and forcing the entire machine to interleave I am
> not sure how to get shm to interleave.   It might be a good enough
> test to just force the epyc to interleave and see if the benchmark
> result changes in any way.  If the result does change repeat on the
> intel.  Overall for the most part the raid would not be able to use
> very many cpu anyway, so a bigger machine with more numa nodes may
> slow down the overall rate.


I don't think it's a memory issue. I can read from similar /dev/shm
setup at ~20GB/s on single EPYC Rome.

I've also experienced slow sync behavior on otherwise idle CentOS7/8
systems. Even tried kernel-lt.x86_64 5.4.95-1.el8.elrepo. Setting
speed_limit_min helps, but that should not be needed when the system is
not doing other I/O or compute.

This I first noticed with a RAID6 of SAS3 HDDs being far from fast
enough for sync. Also writes vere very bad, with less than 1GB/s for
10-18 disk sets, no matter how many writers.

With 16 writers on the described loop setup I get this 'perf top' output
(similar on HDDs):

  33.92%  [kernel]                  [k] native_queued_spin_lock_slowpath



   7.05%  [kernel]                  [k] async_copy_data.isra.61



   5.81%  [kernel]                  [k] memcpy



   2.44%  [kernel]                  [k] read_tsc



   1.65%  [kernel]                  [k] analyse_stripe



   1.64%  [kernel]                  [k] native_sched_clock



   1.32%  [kernel]                  [k] raid6_avx22_gen_syndrome



   1.14%  [kernel]                  [k] generic_make_request_checks



   1.11%  [kernel]                  [k] _raw_spin_unlock_irqrestore



   1.07%  [kernel]                  [k] native_irq_return_iret



   1.06%  [kernel]                  [k] add_stripe_bio



   1.03%  [kernel]                  [k] raid5_compute_blocknr



   0.99%  [kernel]                  [k] _raw_spin_lock_irq



   0.88%  [kernel]                  [k] raid5_compute_sector



   0.82%  [kernel]                  [k] select_task_rq_fair



   0.81%  [kernel]                  [k] _raw_spin_lock_irqsave



   0.71%  [kernel]                  [k] blk_mq_make_request



   0.70%  [kernel]                  [k] raid5_get_active_stripe



   0.68%  [kernel]                  [k] bio_reset



   0.67%  [kernel]                  [k] percpu_counter_add_batch



   0.63%  [kernel]                  [k] ktime_get



   0.61%  [kernel]                  [k] llist_reverse_order



   0.59%  [kernel]                  [k] ops_run_io



   0.59%  [kernel]                  [k] release_stripe_plug



   0.50%  [kernel]                  [k] raid5_make_request



   0.48%  [kernel]                  [k] raid5_release_stripe



   0.45%  [kernel]                  [k] loop_queue_work



   0.44%  [kernel]                  [k] llist_add_batch



   0.42%  [kernel]                  [k] sched_clock_cpu



   0.41%  [kernel]                  [k] blk_mq_dispatch_rq_list



   0.39%  [kernel]                  [k] default_send_IPI_single_phys



   0.38%  [kernel]                  [k] do_iter_readv_writev



   0.38%  [kernel]                  [k] bio_endio



   0.37%  [kernel]                  [k] md_write_inc




I'm not sure if it is expected that 'native_queued_spin_lock_slowpath'
is so dominant. Seems to increase when adding more and more writers.

BTW RAID0 does not have this issue.

Cheers,
Hans Henrik

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: OT: Processor recommendation for RAID6
  2021-04-06 21:05 ` OT: Processor recommendation for RAID6 Hans Henrik Happe
@ 2021-04-08 13:01   ` Gal Ofri
  0 siblings, 0 replies; 6+ messages in thread
From: Gal Ofri @ 2021-04-08 13:01 UTC (permalink / raw)
  To: happe; +Cc: linux-raid

On Tue, 6 Apr 2021 23:05:46 +0200
Hans Henrik Happe <happe@nbi.dk> wrote:

> On 02.04.2021 16.45, Roger Heflin wrote:
> > On Fri, Apr 2, 2021 at 4:13 AM Paul Menzel <pmenzel@molgen.mpg.de> wrote:  
> >>
> >> Dear Linux folks,
> >>
> >>  
> >   
> >>> Are these values a good benchmark for comparing processors?  
> >>
> >> After two years, yes they are. I created 16 10 GB files in `/dev/shm`,
> >> set them up as loop devices, and created a RAID6. For resync speed it
> >> makes difference.
> >>
> >> 2 x AMD EPYC 7601 32-Core Processor:    34671K/sec
> >> 2 x Intel Xeon Gold 6248 CPU @ 2.50GHz: 87533K/sec
> >>
> >> So, the current state of affairs seems to be, that AVX512 instructions
> >> do help for software RAIDs, if you want fast rebuild/resync times.
> >> Getting, for example, a four core/eight thread Intel Xeon Gold 5222
> >> might be useful.
> >>
> >> Now, the question remains, if AMD processors could make it up with
> >> higher performance, or better optimized code, or if AVX512 instructions
> >> are a must,
> >>
> >> [=E2=80=A6]
> >>
> >>
> >> Kind regards,
> >>
> >> Paul
> >>
> >>
> >> PS: Here are the commands on the AMD EPYC system:
> >>
> >> ```
> >> $ for i in $(seq 1 16); do truncate -s 10G /dev/shm/vdisk$i.img; done
> >> $ for i in /dev/shm/v*.img; do sudo losetup --find --show $i; done
> >> /dev/loop0
> >> /dev/loop1
> >> /dev/loop2
> >> /dev/loop3
> >> /dev/loop4
> >> /dev/loop5
> >> /dev/loop6
> >> /dev/loop7
> >> /dev/loop8
> >> /dev/loop9
> >> /dev/loop10
> >> /dev/loop11
> >> /dev/loop12
> >> /dev/loop13
> >> /dev/loop14
> >> /dev/loop15
> >> $ sudo mdadm --create /dev/md1 --level=3D6 --raid-devices=3D16
> >> /dev/loop{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
> >> mdadm: Defaulting to version 1.2 metadata
> >> mdadm: array /dev/md1 started.
> >> $ more /proc/mdstat
> >> Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> >> [multipath]
> >> md1 : active raid6 loop15[15] loop14[14] loop13[13] loop12[12]
> >> loop11[11] loop10[10] loop9[9] loop8[8] loop7[7] loop6[6] loop5[5]266
> >> loop4[4] loop3[3] lo
> >> op2[2] loop1[1] loop0[0]
> >>        146671616 blocks super 1.2 level 6, 512k chunk, algorithm 276
> >> [16/16] [UUUUUUUUUUUUUUUU]  
> >>        [>....................]  resync =3D  3.9% (416880/10476544)  
> >> finish=3D5.6min speed=3D29777K/sec
> >>
> >> unused devices: <none>
> >> $ more /proc/mdstat
> >> Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> >> [multipath]
> >> md1 : active raid6 loop15[15] loop14[14] loop13[13] loop12[12]
> >> loop11[11] loop10[10] loop9[9] loop8[8] loop7[7] loop6[6] loop5[5]
> >> loop4[4] loop3[3] lo
> >> op2[2] loop1[1] loop0[0]
> >>        146671616 blocks super 1.2 level 6, 512k chunk, algorithm 2
> >> [16/16] [UUUUUUUUUUUUUUUU]  
> >>        [>....................]  resync =3D  4.1% (439872/10476544)  
> >> finish=3D5.3min speed=3D31419K/sec
> >> $ sudo mdadm -S /dev/md1
> >> mdadm: stopped /dev/md1
> >> $ sudo losetup -D
> >> $ sudo rm /dev/shm/vdisk*.img  
> > 
> > 
> > I think you are testing something else.  Your speeds are way below
> > what the raw processor can do. You are probably testing memory
> > speed/numa arch differences between the 2.
> > 
> > On the intel arch there are 2 numa nodes total with 4 channels, so the
> > system  has 8 usable channels of bandwidth, but a allocation on a
> > single numa node will only have 4 channels usable (ddr4-2933)
> > 
> > On the epyc there are 8 numa nodes with 2 channels each (ddr4-2666),
> > so any single memory allocation will have only 2 channels available
> > and if the accesses are across the numa bus will be slower.
> > 
> > So 4*2933/2*2666 =3D 2.20 * 34671 =3D 76286 (fairly close to your results).
> > 
> > How the allocation for memory works depends a lot on how much ram you
> > actually have per numa node and how much for the whole machine.  But
> > any single block for any single device should be on a single numa node
> > almost all of the time.
> > 
> > You might want to drop the cache before the test, run numactl
> > --hardware to see how much memory is free per numa node, then rerun
> > the test and at the of the test before the stop run numactl --hardware
> > again to see how it was spread across numa nodes.  Even if it spreads
> > it across multiple numa nodes that may well mean that on the epyc case
> > you are running with several numa nodes were the main raid processes
> > are running against remote numa nodes, and because intel only has 2
> > then there is a decent chance that it is only running on 1 most of the
> > time (so no remote memory).  I have also seen in benchmarks I have run
> > on 2P and 4P intel machines that interleaved on a 2P single thread job
> > is faster than running on a single numa nodes memory (with the process
> > pinned to a single cpu on one of the numa nodes, memory interleaved
> > over both), but on a 4P/4numa node machine interleaving slows it down
> > significantly.  And in the default case any single write/read of a
> > block is likely only on a single numa node so that specific read/write
> > is constrained by a single numa node bandwidth giving an advantage to
> > fewer faster/bigger numa nodes and less remote memory.
> > 
> > Outside of rebooting and forcing the entire machine to interleave I am
> > not sure how to get shm to interleave.   It might be a good enough
> > test to just force the epyc to interleave and see if the benchmark
> > result changes in any way.  If the result does change repeat on the
> > intel.  Overall for the most part the raid would not be able to use
> > very many cpu anyway, so a bigger machine with more numa nodes may
> > slow down the overall rate.  
> 
> 
> I don't think it's a memory issue. I can read from similar /dev/shm
> setup at ~20GB/s on single EPYC Rome.
> 
> I've also experienced slow sync behavior on otherwise idle CentOS7/8
> systems. Even tried kernel-lt.x86_64 5.4.95-1.el8.elrepo. Setting
> speed_limit_min helps, but that should not be needed when the system is
> not doing other I/O or compute.
> 
> This I first noticed with a RAID6 of SAS3 HDDs being far from fast
> enough for sync. Also writes vere very bad, with less than 1GB/s for
> 10-18 disk sets, no matter how many writers.
> 
> With 16 writers on the described loop setup I get this 'perf top' output
> (similar on HDDs):
> 
>   33.92%  [kernel]                  [k] native_queued_spin_lock_slowpath
> 
> 
> 
>    7.05%  [kernel]                  [k] async_copy_data.isra.61
> 
> 
> 
>    5.81%  [kernel]                  [k] memcpy
> 
> 
> 
>    2.44%  [kernel]                  [k] read_tsc
> 
> 
> 
>    1.65%  [kernel]                  [k] analyse_stripe
> 
> 
> 
>    1.64%  [kernel]                  [k] native_sched_clock
> 
> 
> 
>    1.32%  [kernel]                  [k] raid6_avx22_gen_syndrome
> 
> 
> 
>    1.14%  [kernel]                  [k] generic_make_request_checks
> 
> 
> 
>    1.11%  [kernel]                  [k] _raw_spin_unlock_irqrestore
> 
> 
> 
>    1.07%  [kernel]                  [k] native_irq_return_iret
> 
> 
> 
>    1.06%  [kernel]                  [k] add_stripe_bio
> 
> 
> 
>    1.03%  [kernel]                  [k] raid5_compute_blocknr
> 
> 
> 
>    0.99%  [kernel]                  [k] _raw_spin_lock_irq
> 
> 
> 
>    0.88%  [kernel]                  [k] raid5_compute_sector
> 
> 
> 
>    0.82%  [kernel]                  [k] select_task_rq_fair
> 
> 
> 
>    0.81%  [kernel]                  [k] _raw_spin_lock_irqsave
> 
> 
> 
>    0.71%  [kernel]                  [k] blk_mq_make_request
> 
> 
> 
>    0.70%  [kernel]                  [k] raid5_get_active_stripe
> 
> 
> 
>    0.68%  [kernel]                  [k] bio_reset
> 
> 
> 
>    0.67%  [kernel]                  [k] percpu_counter_add_batch
> 
> 
> 
>    0.63%  [kernel]                  [k] ktime_get
> 
> 
> 
>    0.61%  [kernel]                  [k] llist_reverse_order
> 
> 
> 
>    0.59%  [kernel]                  [k] ops_run_io
> 
> 
> 
>    0.59%  [kernel]                  [k] release_stripe_plug
> 
> 
> 
>    0.50%  [kernel]                  [k] raid5_make_request
> 
> 
> 
>    0.48%  [kernel]                  [k] raid5_release_stripe
> 
> 
> 
>    0.45%  [kernel]                  [k] loop_queue_work
> 
> 
> 
>    0.44%  [kernel]                  [k] llist_add_batch
> 
> 
> 
>    0.42%  [kernel]                  [k] sched_clock_cpu
> 
> 
> 
>    0.41%  [kernel]                  [k] blk_mq_dispatch_rq_list
> 
> 
> 
>    0.39%  [kernel]                  [k] default_send_IPI_single_phys
> 
> 
> 
>    0.38%  [kernel]                  [k] do_iter_readv_writev
> 
> 
> 
>    0.38%  [kernel]                  [k] bio_endio
> 
> 
> 
>    0.37%  [kernel]                  [k] md_write_inc
> 
> 
> 
> 
> I'm not sure if it is expected that 'native_queued_spin_lock_slowpath'
> is so dominant. Seems to increase when adding more and more writers.
> 
> BTW RAID0 does not have this issue.
> 
> Cheers,
> Hans Henrik

Hey,

I'm not sure if that helps much with your processors dilemma, but I'd like to confirm your findings:
I spent a while investigating raid6 (/raid5) performance recently; That spinlock is indeed the bottleneck, especially when you go beyond 8 cpu cores.
I tested some io workloads and found a limit of ~1.6M iops for rand-read-16k and ~100k iops for rand-write-4k, and a limit of 1~2GB throughput on seq writes (e.g. 128k).
* For some reason I get only ~1.2M iops running random-read-4k.

The bottleneck is originated in the device_lock which clearly dominates the code - even in code paths that could, in principle, run in parallel.
Actually, I also spent the time "defusing" that lock, and I have a p.o.c. with ~1.9M/250k iops (rand 4k read/write), and ~25GB throughput on rand-read-16k.
I'm working on making this p.o.c. ready for sharing here, but meanwhile feel free to contact me privately for an unstable version, if you find it useful for your needs.

Will update soon anyway.
Cheers,
Gal Ofri

^ permalink raw reply	[flat|nested] 6+ messages in thread

[parent not found: <16ceff73-1257-fc3d-aade-43656c7216e7@molgen.mpg.de>]

[parent not found: <12e8f7f1-6655-9f0b-72b1-0908f229dcac@molgen.mpg.de>]

* Re: OT: Processor recommendation for RAID6
       [not found] ` <12e8f7f1-6655-9f0b-72b1-0908f229dcac@molgen.mpg.de>
@ 2021-04-02  9:09   ` Paul Menzel
  2021-04-02 14:45     ` Roger Heflin
  0 siblings, 1 reply; 6+ messages in thread
From: Paul Menzel @ 2021-04-02  9:09 UTC (permalink / raw)
  To: linux-raid; +Cc: LKML

Dear Linux folks,


Am 08.04.19 um 18:34 schrieb Paul Menzel:

> On 04/08/19 12:33, Paul Menzel wrote:
> 
>> Can you share your experiences, which processors you choose for
>> your RAID6 systems? I am particularly interested in Intel
>> alternatives? Are AMD EPYC processors good alternatives for file
>> servers? What about ARM and POWER?
>>
>> We currently use the HBA  Adaptec Smart Storage PQI 12G SAS/PCIe 3
>> (rev 01), Dell systems and rotating disks.
>>
>> For example, Dell PowerEdge R730 with 40x E5-2687W v3 @ 3.10GHz,
>> 192 GB of memory, Linux 4.14.87 and XFS file system. (The processor
>> looks too powerful for the system. At least the processor usage
>> is at most at one or two thread.)
>>
>> ```
>> [    0.394710] raid6: sse2x1   gen() 11441 MB/s
>> [    0.416710] raid6: sse2x1   xor()  8099 MB/s
>> [    0.438713] raid6: sse2x2   gen() 13359 MB/s
>> [    0.460710] raid6: sse2x2   xor()  8910 MB/s
>> [    0.482712] raid6: sse2x4   gen() 16128 MB/s
>> [    0.504710] raid6: sse2x4   xor() 10009 MB/s
>> [    0.526710] raid6: avx2x1   gen() 22242 MB/s
>> [    0.548709] raid6: avx2x1   xor() 15406 MB/s
>> [    0.570710] raid6: avx2x2   gen() 25699 MB/s
>> [    0.592710] raid6: avx2x2   xor() 16521 MB/s
>> [    0.614709] raid6: avx2x4   gen() 29847 MB/s
>> [    0.636710] raid6: avx2x4   xor() 18617 MB/s
>> [    0.642001] raid6: using algorithm avx2x4 gen() 29847 MB/s
>> [    0.648000] raid6: .... xor() 18617 MB/s, rmw enabled
>> [    0.654001] raid6: using avx2x2 recovery algorithm
>> ```

[…]

> Maybe some more data. AVX512 from Intel processors really seems to
> make a difference in the Linux tests. But also
> 
> ### Intel Xeon W-2145 (3.7 GHz) with Linux 4.19.19
> 
> ```
> $ dmesg | grep -e raid6 -e smpboot
> [    0.118880] smpboot: Allowing 16 CPUs, 0 hotplug CPUs
> [    0.379291] smpboot: CPU0: Intel(R) Xeon(R) W-2145 CPU @ 3.70GHz (family: 0x6, model: 0x55, stepping: 0x4)
> [    0.398245] smpboot: Max logical packages: 1
> [    0.398618] smpboot: Total of 16 processors activated (118400.00 BogoMIPS)
> [    0.426597] raid6: sse2x1   gen() 13144 MB/s
> [    0.443601] raid6: sse2x1   xor()  9962 MB/s
> [    0.460602] raid6: sse2x2   gen() 16863 MB/s
> [    0.477606] raid6: sse2x2   xor() 11425 MB/s
> [    0.494609] raid6: sse2x4   gen() 19089 MB/s
> [    0.511613] raid6: sse2x4   xor() 11988 MB/s
> [    0.528614] raid6: avx2x1   gen() 26285 MB/s
> [    0.545617] raid6: avx2x1   xor() 19335 MB/s
> [    0.562620] raid6: avx2x2   gen() 33953 MB/s
> [    0.579624] raid6: avx2x2   xor() 21255 MB/s
> [    0.596627] raid6: avx2x4   gen() 38492 MB/s
> [    0.613629] raid6: avx2x4   xor() 19722 MB/s
> [    0.630633] raid6: avx512x1 gen() 37621 MB/s
> [    0.647636] raid6: avx512x1 xor() 21017 MB/s
> [    0.664639] raid6: avx512x2 gen() 46859 MB/s
> [    0.681642] raid6: avx512x2 xor() 26173 MB/s
> [    0.698645] raid6: avx512x4 gen() 54210 MB/s
> [    0.715648] raid6: avx512x4 xor() 28041 MB/s
> [    0.716019] raid6: using algorithm avx512x4 gen() 54210 MB/s
> [    0.716244] raid6: .... xor() 28041 MB/s, rmw enabled
> [    0.716648] raid6: using avx512x2 recovery algorithm
> ```
> 
> ### AMD EPYC Linux 4.19.19 (up to 2.6 GHz according to `lscpu`)
> 
> ```
> $ dmesg | grep -e raid6 -e smpboot
> [    0.000000] smpboot: Allowing 128 CPUs, 0 hotplug CPUs
> [    0.122478] smpboot: CPU0: AMD EPYC 7601 32-Core Processor (family: 0x17, model: 0x1, stepping: 0x2)
> [    0.364480] smpboot: Max logical packages: 2
> [    0.366489] smpboot: Total of 128 processors activated (561529.72 BogoMIPS)
> [    0.503630] raid6: sse2x1   gen()  6136 MB/s
> [    0.524630] raid6: sse2x1   xor()  5931 MB/s
> [    0.545627] raid6: sse2x2   gen() 12941 MB/s
> [    0.566628] raid6: sse2x2   xor()  8173 MB/s
> [    0.587629] raid6: sse2x4   gen() 13089 MB/s
> [    0.608627] raid6: sse2x4   xor()  7318 MB/s
> [    0.629627] raid6: avx2x1   gen() 15164 MB/s
> [    0.650626] raid6: avx2x1   xor() 10990 MB/s
> [    0.671627] raid6: avx2x2   gen() 20316 MB/s
> [    0.692625] raid6: avx2x2   xor() 11886 MB/s
> [    0.713625] raid6: avx2x4   gen() 20726 MB/s
> [    0.734628] raid6: avx2x4   xor() 10095 MB/s
> [    0.739479] raid6: using algorithm avx2x4 gen() 20726 MB/s
> [    0.745479] raid6: .... xor() 10095 MB/s, rmw enabled
> [    0.750479] raid6: using avx2x2 recovery algorithm
> ```
> 
> Are these values a good benchmark for comparing processors?

After two years, yes they are. I created 16 10 GB files in `/dev/shm`, 
set them up as loop devices, and created a RAID6. For resync speed it 
makes difference.

2 x AMD EPYC 7601 32-Core Processor:    34671K/sec
2 x Intel Xeon Gold 6248 CPU @ 2.50GHz: 87533K/sec

So, the current state of affairs seems to be, that AVX512 instructions 
do help for software RAIDs, if you want fast rebuild/resync times. 
Getting, for example, a four core/eight thread Intel Xeon Gold 5222 
might be useful.

Now, the question remains, if AMD processors could make it up with 
higher performance, or better optimized code, or if AVX512 instructions 
are a must,

[…]


Kind regards,

Paul


PS: Here are the commands on the AMD EPYC system:

```
$ for i in $(seq 1 16); do truncate -s 10G /dev/shm/vdisk$i.img; done
$ for i in /dev/shm/v*.img; do sudo losetup --find --show $i; done
/dev/loop0
/dev/loop1
/dev/loop2
/dev/loop3
/dev/loop4
/dev/loop5
/dev/loop6
/dev/loop7
/dev/loop8
/dev/loop9
/dev/loop10
/dev/loop11
/dev/loop12
/dev/loop13
/dev/loop14
/dev/loop15
$ sudo mdadm --create /dev/md1 --level=6 --raid-devices=16 
/dev/loop{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
$ more /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] 
[multipath]
md1 : active raid6 loop15[15] loop14[14] loop13[13] loop12[12] 
loop11[11] loop10[10] loop9[9] loop8[8] loop7[7] loop6[6] loop5[5] 
loop4[4] loop3[3] lo
op2[2] loop1[1] loop0[0]
       146671616 blocks super 1.2 level 6, 512k chunk, algorithm 2 
[16/16] [UUUUUUUUUUUUUUUU]
       [>....................]  resync =  3.9% (416880/10476544) 
finish=5.6min speed=29777K/sec

unused devices: <none>
$ more /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] 
[multipath]
md1 : active raid6 loop15[15] loop14[14] loop13[13] loop12[12] 
loop11[11] loop10[10] loop9[9] loop8[8] loop7[7] loop6[6] loop5[5] 
loop4[4] loop3[3] lo
op2[2] loop1[1] loop0[0]
       146671616 blocks super 1.2 level 6, 512k chunk, algorithm 2 
[16/16] [UUUUUUUUUUUUUUUU]
       [>....................]  resync =  4.1% (439872/10476544) 
finish=5.3min speed=31419K/sec
$ sudo mdadm -S /dev/md1
mdadm: stopped /dev/md1
$ sudo losetup -D
$ sudo rm /dev/shm/vdisk*.img
```

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: OT: Processor recommendation for RAID6
  2021-04-02  9:09   ` Paul Menzel
@ 2021-04-02 14:45     ` Roger Heflin
  2021-04-07 13:46       ` Paul Menzel
  0 siblings, 1 reply; 6+ messages in thread
From: Roger Heflin @ 2021-04-02 14:45 UTC (permalink / raw)
  To: Paul Menzel; +Cc: Linux RAID, LKML

On Fri, Apr 2, 2021 at 4:13 AM Paul Menzel <pmenzel@molgen.mpg.de> wrote:
>
> Dear Linux folks,
>
>

> > Are these values a good benchmark for comparing processors?
>
> After two years, yes they are. I created 16 10 GB files in `/dev/shm`,
> set them up as loop devices, and created a RAID6. For resync speed it
> makes difference.
>
> 2 x AMD EPYC 7601 32-Core Processor:    34671K/sec
> 2 x Intel Xeon Gold 6248 CPU @ 2.50GHz: 87533K/sec
>
> So, the current state of affairs seems to be, that AVX512 instructions
> do help for software RAIDs, if you want fast rebuild/resync times.
> Getting, for example, a four core/eight thread Intel Xeon Gold 5222
> might be useful.
>
> Now, the question remains, if AMD processors could make it up with
> higher performance, or better optimized code, or if AVX512 instructions
> are a must,
>
> […]
>
>
> Kind regards,
>
> Paul
>
>
> PS: Here are the commands on the AMD EPYC system:
>
> ```
> $ for i in $(seq 1 16); do truncate -s 10G /dev/shm/vdisk$i.img; done
> $ for i in /dev/shm/v*.img; do sudo losetup --find --show $i; done
> /dev/loop0
> /dev/loop1
> /dev/loop2
> /dev/loop3
> /dev/loop4
> /dev/loop5
> /dev/loop6
> /dev/loop7
> /dev/loop8
> /dev/loop9
> /dev/loop10
> /dev/loop11
> /dev/loop12
> /dev/loop13
> /dev/loop14
> /dev/loop15
> $ sudo mdadm --create /dev/md1 --level=6 --raid-devices=16
> /dev/loop{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
> mdadm: Defaulting to version 1.2 metadata
> mdadm: array /dev/md1 started.
> $ more /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> [multipath]
> md1 : active raid6 loop15[15] loop14[14] loop13[13] loop12[12]
> loop11[11] loop10[10] loop9[9] loop8[8] loop7[7] loop6[6] loop5[5]266
> loop4[4] loop3[3] lo
> op2[2] loop1[1] loop0[0]
>        146671616 blocks super 1.2 level 6, 512k chunk, algorithm 276
> [16/16] [UUUUUUUUUUUUUUUU]
>        [>....................]  resync =  3.9% (416880/10476544)
> finish=5.6min speed=29777K/sec
>
> unused devices: <none>
> $ more /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> [multipath]
> md1 : active raid6 loop15[15] loop14[14] loop13[13] loop12[12]
> loop11[11] loop10[10] loop9[9] loop8[8] loop7[7] loop6[6] loop5[5]
> loop4[4] loop3[3] lo
> op2[2] loop1[1] loop0[0]
>        146671616 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [16/16] [UUUUUUUUUUUUUUUU]
>        [>....................]  resync =  4.1% (439872/10476544)
> finish=5.3min speed=31419K/sec
> $ sudo mdadm -S /dev/md1
> mdadm: stopped /dev/md1
> $ sudo losetup -D
> $ sudo rm /dev/shm/vdisk*.img


I think you are testing something else.  Your speeds are way below
what the raw processor can do. You are probably testing memory
speed/numa arch differences between the 2.

On the intel arch there are 2 numa nodes total with 4 channels, so the
system  has 8 usable channels of bandwidth, but a allocation on a
single numa node will only have 4 channels usable (ddr4-2933)

On the epyc there are 8 numa nodes with 2 channels each (ddr4-2666),
so any single memory allocation will have only 2 channels available
and if the accesses are across the numa bus will be slower.

So 4*2933/2*2666 = 2.20 * 34671 = 76286 (fairly close to your results).

How the allocation for memory works depends a lot on how much ram you
actually have per numa node and how much for the whole machine.  But
any single block for any single device should be on a single numa node
almost all of the time.

You might want to drop the cache before the test, run numactl
--hardware to see how much memory is free per numa node, then rerun
the test and at the of the test before the stop run numactl --hardware
again to see how it was spread across numa nodes.  Even if it spreads
it across multiple numa nodes that may well mean that on the epyc case
you are running with several numa nodes were the main raid processes
are running against remote numa nodes, and because intel only has 2
then there is a decent chance that it is only running on 1 most of the
time (so no remote memory).  I have also seen in benchmarks I have run
on 2P and 4P intel machines that interleaved on a 2P single thread job
is faster than running on a single numa nodes memory (with the process
pinned to a single cpu on one of the numa nodes, memory interleaved
over both), but on a 4P/4numa node machine interleaving slows it down
significantly.  And in the default case any single write/read of a
block is likely only on a single numa node so that specific read/write
is constrained by a single numa node bandwidth giving an advantage to
fewer faster/bigger numa nodes and less remote memory.

Outside of rebooting and forcing the entire machine to interleave I am
not sure how to get shm to interleave.   It might be a good enough
test to just force the epyc to interleave and see if the benchmark
result changes in any way.  If the result does change repeat on the
intel.  Overall for the most part the raid would not be able to use
very many cpu anyway, so a bigger machine with more numa nodes may
slow down the overall rate.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: OT: Processor recommendation for RAID6
  2021-04-02 14:45     ` Roger Heflin
@ 2021-04-07 13:46       ` Paul Menzel
  2021-04-08  0:12         ` Roger Heflin
  0 siblings, 1 reply; 6+ messages in thread
From: Paul Menzel @ 2021-04-07 13:46 UTC (permalink / raw)
  To: Roger Heflin; +Cc: linux-raid, LKML, it+linux-raid

Dear Roger,


Thank you for your response.


Am 02.04.21 um 16:45 schrieb Roger Heflin:
> On Fri, Apr 2, 2021 at 4:13 AM Paul Menzel wrote:

>>> Are these values a good benchmark for comparing processors?
>>
>> After two years, yes they are. I created 16 10 GB files in `/dev/shm`,
>> set them up as loop devices, and created a RAID6. For resync speed it
>> makes difference.
>>
>> 2 x AMD EPYC 7601 32-Core Processor:    34671K/sec
>> 2 x Intel Xeon Gold 6248 CPU @ 2.50GHz: 87533K/sec
>>
>> So, the current state of affairs seems to be, that AVX512 instructions
>> do help for software RAIDs, if you want fast rebuild/resync times.
>> Getting, for example, a four core/eight thread Intel Xeon Gold 5222
>> might be useful.
>>
>> Now, the question remains, if AMD processors could make it up with
>> higher performance, or better optimized code, or if AVX512 instructions
>> are a must,
>>
>> […]

>> PS: Here are the commands on the AMD EPYC system:
>>
>> ```
>> $ for i in $(seq 1 16); do truncate -s 10G /dev/shm/vdisk$i.img; done
>> $ for i in /dev/shm/v*.img; do sudo losetup --find --show $i; done
>> […]
>> $ sudo mdadm --create /dev/md1 --level=6 --raid-devices=16 /dev/loop{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
>> mdadm: Defaulting to version 1.2 metadata
>> mdadm: array /dev/md1 started.
>> $ more /proc/mdstat
>> Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
>> md1 : active raid6 loop15[15] loop14[14] loop13[13] loop12[12] loop11[11] loop10[10] loop9[9] loop8[8] loop7[7] loop6[6] loop5[5] loop4[4] loop3[3] loop2[2] loop1[1] loop0[0]
>>         146671616 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>         [>....................]  resync =  3.9% (416880/10476544) finish=5.6min speed=29777K/sec
>>
>> unused devices: <none>
>> $ more /proc/mdstat
>> Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
>> md1 : active raid6 loop15[15] loop14[14] loop13[13] loop12[12] loop11[11] loop10[10] loop9[9] loop8[8] loop7[7] loop6[6] loop5[5] loop4[4] loop3[3] loop2[2] loop1[1] loop0[0]
>>         146671616 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>         [>....................]  resync =  4.1% (439872/10476544) finish=5.3min speed=31419K/sec
>> $ sudo mdadm -S /dev/md1
>> mdadm: stopped /dev/md1
>> $ sudo losetup -D
>> $ sudo rm /dev/shm/vdisk*.img
> 
> I think you are testing something else.  Your speeds are way below
> what the raw processor can do. You are probably testing memory
> speed/numa arch differences between the 2.
> 
> On the intel arch there are 2 numa nodes total with 4 channels, so the
> system  has 8 usable channels of bandwidth, but a allocation on a
> single numa node will only have 4 channels usable (ddr4-2933)
> 
> On the epyc there are 8 numa nodes with 2 channels each (ddr4-2666),
> so any single memory allocation will have only 2 channels available
> and if the accesses are across the numa bus will be slower.
> 
> So 4*2933/2*2666 = 2.20 * 34671 = 76286 (fairly close to your results).
> 
> How the allocation for memory works depends a lot on how much ram you
> actually have per numa node and how much for the whole machine.  But
> any single block for any single device should be on a single numa node
> almost all of the time.
> 
> You might want to drop the cache before the test, run numactl
> --hardware to see how much memory is free per numa node, then rerun
> the test and at the of the test before the stop run numactl --hardware
> again to see how it was spread across numa nodes.  Even if it spreads
> it across multiple numa nodes that may well mean that on the epyc case
> you are running with several numa nodes were the main raid processes
> are running against remote numa nodes, and because intel only has 2
> then there is a decent chance that it is only running on 1 most of the
> time (so no remote memory).  I have also seen in benchmarks I have run
> on 2P and 4P intel machines that interleaved on a 2P single thread job
> is faster than running on a single numa nodes memory (with the process
> pinned to a single cpu on one of the numa nodes, memory interleaved
> over both), but on a 4P/4numa node machine interleaving slows it down
> significantly.  And in the default case any single write/read of a
> block is likely only on a single numa node so that specific read/write
> is constrained by a single numa node bandwidth giving an advantage to
> fewer faster/bigger numa nodes and less remote memory.
> 
> Outside of rebooting and forcing the entire machine to interleave I am
> not sure how to get shm to interleave.   It might be a good enough
> test to just force the epyc to interleave and see if the benchmark
> result changes in any way.  If the result does change repeat on the
> intel.  Overall for the most part the raid would not be able to use
> very many cpu anyway, so a bigger machine with more numa nodes may
> slow down the overall rate.

Thank you for the analysis. If I am going to have time, I am going to 
try your suggestions. In the meantime I won’t test in `/dev/shm`. Our 
servers with 256+ GB RAM are only two socket systems with a lot of 
cores/threads, but I didn’t have controllers and disks for testing handy.

Quickly testing this on two desktop machine.

Dell OptiPlex 5055 with AMD Ryzen 5 PRO 1500 (max 3.5 GHz), 16 GB 
memory, and 16 loop mounted 512 MB files in `/dev/shm` Linux 5.12.0-rc6 
reports 60000K/sec.

```
$ more /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
md1 : active raid6 loop15[15] loop14[14] loop13[13] loop12[12] 
loop11[11] loop10[10] loop9[9] loop8[8] loop7[7] loop6[6] loop5[5] 
loop4[4] loop3[3] loop2[2] loop1[1] loop0[0]
       7311360 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] 
[UUUUUUUUUUUUUUUU]
       [===================>.]  resync = 95.6% (500704/522240) 
finish=0.0min speed=62588K/sec

unused devices: <none>
```

Dell Precision 3620 with Intel i7-7700 @ 3.6 GHz, 32 GB memory Linux 
5.12.0-rc3 reports 110279K/sec.

```
$ more /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] 
[multipath]
md1 : active raid6 loop15[15] loop14[14] loop13[13] loop12[12] 
loop11[11] loop10[10] loop9[9] loop8[8] loop7[7] loop6[6] loop5[5] 
loop4[4] loop3[3] lo
op2[2] loop1[1] loop0[0]
       7311360 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] 
[UUUUUUUUUUUUUUUU]
       [================>....]  resync = 84.3% (441116/522240) 
finish=0.0min speed=110279K/sec

unused devices: <none>
```

I have no idea, if it’s related to the smaller files or the 
processor/system (single thread performance?).

On a Dell T440/021KCD (firmware 2.9.3) with two Intel Xeon Gold 5222 CPU 
@ 3.80GHz (AVX512), 128 GB memory, Adaptec Smart Storage PQI 12G 
SAS/PCIe 3 (HBA1100) and 16 8 TB Seagate ST8000NM001A, Linux 5.4.97 
reports over 130000K/sec.

```
$ more /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] 
[multipath]
md0 : active raid6 sdr[15] sdq[14] sdp[13] sdo[12] sdn[11] sdm[10] 
sdl[9] sdk[8] sdj[7] sdi[6] sdh[5] sdg[4] sdf[3] sde[2] sdd[1] sdc[0]
       109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 
[16/16] [UUUUUUUUUUUUUUUU]
       [=>...................]  resync =  5.7% (452697636/7813894144) 
finish=938.1min speed=130767K/sec
       bitmap: 56/59 pages [224KB], 65536KB chunk

unused devices: <none>
$ sudo perf top
[…]
   15.97%  [kernel]            [k] xor_avx_5
   12.78%  [kernel]            [k] analyse_stripe
   11.90%  [kernel]            [k] memcmp
    7.71%  [kernel]            [k] ops_run_io
    4.75%  [kernel]            [k] blk_rq_map_sg
    4.41%  [kernel]            [k] raid6_avx5124_gen_syndrome
    3.36%  [kernel]            [k] bio_advance
    3.03%  [kernel]            [k] raid5_get_active_stripe
    3.00%  [kernel]            [k] raid5_end_read_request
    2.85%  [kernel]            [k] xor_avx_3
    1.72%  [kernel]            [k] blk_update_request
[…]
```

This is also much faster compared to the Dell PowerEdge T640 with two 
Intel Xeon Gold 6248 @ 2,50 GHz results in `/dev/shm`.

So, for the thread purpose, tests need to be done on real disks and not 
loop mounted files in memory.


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: OT: Processor recommendation for RAID6
  2021-04-07 13:46       ` Paul Menzel
@ 2021-04-08  0:12         ` Roger Heflin
  0 siblings, 0 replies; 6+ messages in thread
From: Roger Heflin @ 2021-04-08  0:12 UTC (permalink / raw)
  To: Paul Menzel; +Cc: Linux RAID, LKML, it+linux-raid

I ran some tests on a 4 intel socket box with files in tmpfs (gold
6152 I think) and with the files interleaved 4way (I think) got the
same speeds you got on your intels (roughly) with defaults.

I also tested on my 6 core/4500u ryzen and got almost the same
speed(slightly slower) as on your large ryzen boxes with many numa
nodes, so it has to be effectively only using a single numa node and a
single cpu.

I did test my 4500u ryzen machine with fewer cores enabled,  1 core
got 18M, 2 cores got 23M, and 3 got 32M so it did not appear scale
past 3 cores.

I also testing on an ancient a8-5600k and was almost the same speed as
the ryzen.

From the calls there must be a lot of reading memory.   And I got the
same speed using shm, using tmpfs, using tmpfs+hugepages and using
files on a disk that should have been in file cache.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-04-08 13:01 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAAMCDedmGUcWY=9Nb36gXoo0+F82rhq=-6yKZ1xPf74Gj0mq7Q () mail ! gmail ! com>
2021-04-06 21:05 ` OT: Processor recommendation for RAID6 Hans Henrik Happe
2021-04-08 13:01   ` Gal Ofri
     [not found] <16ceff73-1257-fc3d-aade-43656c7216e7@molgen.mpg.de>
     [not found] ` <12e8f7f1-6655-9f0b-72b1-0908f229dcac@molgen.mpg.de>
2021-04-02  9:09   ` Paul Menzel
2021-04-02 14:45     ` Roger Heflin
2021-04-07 13:46       ` Paul Menzel
2021-04-08  0:12         ` Roger Heflin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox