On 4/2/07, Dan Williams wrote: > On 3/30/07, Raz Ben-Jehuda(caro) wrote: > > Please see bellow. > > > > On 8/28/06, Neil Brown wrote: > > > On Sunday August 13, raziebe@gmail.com wrote: > > > > well ... me again > > > > > > > > Following your advice.... > > > > > > > > I added a deadline for every WRITE stripe head when it is created. > > > > in raid5_activate_delayed i checked if deadline is expired and if not i am > > > > setting the sh to prereadactive mode as . > > > > > > > > This small fix ( and in few other places in the code) reduced the > > > > amount of reads > > > > to zero with dd but with no improvement to throghput. But with random access to > > > > the raid ( buffers are aligned by the stripe width and with the size > > > > of stripe width ) > > > > there is an improvement of at least 20 % . > > > > > > > > Problem is that a user must know what he is doing else there would be > > > > a reduction > > > > in performance if deadline line it too long (say 100 ms). > > > > > > So if I understand you correctly, you are delaying write requests to > > > partial stripes slightly (your 'deadline') and this is sometimes > > > giving you a 20% improvement ? > > > > > > I'm not surprised that you could get some improvement. 20% is quite > > > surprising. It would be worth following through with this to make > > > that improvement generally available. > > > > > > As you say, picking a time in milliseconds is very error prone. We > > > really need to come up with something more natural. > > > I had hopped that the 'unplug' infrastructure would provide the right > > > thing, but apparently not. Maybe unplug is just being called too > > > often. > > > > > > I'll see if I can duplicate this myself and find out what is really > > > going on. > > > > > > Thanks for the report. > > > > > > NeilBrown > > > > > > > Neil Hello. I am sorry for this interval , I was assigned abruptly to > > a different project. > > > > 1. > > I'd taken a look at the raid5 delay patch I have written a while > > ago. I ported it to 2.6.17 and tested it. it makes sounds of working > > and when used correctly it eliminates the reads penalty. > > > > 2. Benchmarks . > > configuration: > > I am testing a raid5 x 3 disks with 1MB chunk size. IOs are > > synchronous and non-buffered(o_direct) , 2 MB in size and always > > aligned to the beginning of a stripe. kernel is 2.6.17. The > > stripe_delay was set to 10ms. > > > > Attached is the simple_write code. > > > > command : > > simple_write /dev/md1 2048 0 1000 > > simple_write raw writes (O_DIRECT) sequentially > > starting from offset zero 2048 kilobytes 1000 times. > > > > Benchmark Before patch > > > > sda 1848.00 8384.00 50992.00 8384 50992 > > sdb 1995.00 12424.00 51008.00 12424 51008 > > sdc 1698.00 8160.00 51000.00 8160 51000 > > sdd 0.00 0.00 0.00 0 0 > > md0 0.00 0.00 0.00 0 0 > > md1 450.00 0.00 102400.00 0 102400 > > > > > > Benchmark After patch > > > > sda 389.11 0.00 128530.69 0 129816 > > sdb 381.19 0.00 129354.46 0 130648 > > sdc 383.17 0.00 128530.69 0 129816 > > sdd 0.00 0.00 0.00 0 0 > > md0 0.00 0.00 0.00 0 0 > > md1 1140.59 0.00 259548.51 0 262144 > > > > As one can see , no additional reads were done. One can actually > > calculate the raid's utilization: n-1/n * ( single disk throughput > > with 1M writes ) . > > > > > > 3. The patch code. > > Kernel tested above was 2.6.17. The patch is of 2.6.20.2 > > because I have noticed a big code differences between 17 to 20.x . > > This patch was not tested on 2.6.20.2 but it is essentialy the same. I > > have not tested (yet) degraded mode or any other non-common pathes. > > > This is along the same lines of what I am working on, new cache > policies for raid5/6, so I want to give it a try as well. > Unfortunately gmail has mangled your patch. Can you resend as an > attachment? > > patch: **** malformed patch at line 10: > (&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK])) > > Thanks, > Dan > Dan hello. Attached are the patches. Also , I have added another test unit : random_writev. It is not much of a code but it does the work. It tests writing a vector .it shows the same results as writing using a single buffer. What is the new cache poilcies ? Please note ! I haven't indented the patch nor did the instructions according to SubmitingPatches document. If Neil would approve this patch or parts of it, I will do so. # Benchmark 3: Testing 8 disks raid5. Tyan Numa dual (amd) CPU machine, with 8 sata maxtor disks, controller is promise in jbod mode. raid conf: md1 : active raid5 sda2[0] sdh1[7] sdg1[6] sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb2[1] 3404964864 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU] In order to achieve zero reads I had to tune the deadline to 20ms ( so long ? ). stripe_cache_size is 256 which is exactly what is needed to preform a full stripe hit with this configuration. > comand: random_writev /dev/md1 7168 0 3000 10000 iostats snapshot avg-cpu: %user %nice %sys %iowait %idle 0.00 0.00 21.00 29.00 50.00 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn hda 0.00 0.00 0.00 0 0 md0 0.00 0.00 0.00 0 0 sda 234.34 0.00 50400.00 0 49896 sdb 235.35 0.00 50658.59 0 50152 sdc 242.42 0.00 51014.14 0 50504 sdd 246.46 0.00 50755.56 0 50248 sde 248.48 0.00 51272.73 0 50760 sdf 245.45 0.00 50755.56 0 50248 sdg 244.44 0.00 50755.56 0 50248 sdh 245.45 0.00 50755.56 0 50248 md1 1407.07 0.00 347741.41 0 344264 Try setting it the stripe_cace_size to 255 and you will notice the delay. Try lowering with the stripe_deadline and you will notice how the amount of reads grow. Cheers -- Raz