From: "Raz Ben-Jehuda(caro)" <raziebe@gmail.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Neil Brown <neilb@suse.de>,
Linux RAID Mailing List <linux-raid@vger.kernel.org>
Subject: Re: raid5 write performance
Date: Mon, 2 Apr 2007 17:13:43 +0300 [thread overview]
Message-ID: <5d96567b0704020713s6ae02102kaeca12edfb663cb1@mail.gmail.com> (raw)
In-Reply-To: <e9c3a7c20704011608lb1fcb55j628a7d1172812869@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 6706 bytes --]
On 4/2/07, Dan Williams <dan.j.williams@intel.com> wrote:
> On 3/30/07, Raz Ben-Jehuda(caro) <raziebe@gmail.com> wrote:
> > Please see bellow.
> >
> > On 8/28/06, Neil Brown <neilb@suse.de> wrote:
> > > On Sunday August 13, raziebe@gmail.com wrote:
> > > > well ... me again
> > > >
> > > > Following your advice....
> > > >
> > > > I added a deadline for every WRITE stripe head when it is created.
> > > > in raid5_activate_delayed i checked if deadline is expired and if not i am
> > > > setting the sh to prereadactive mode as .
> > > >
> > > > This small fix ( and in few other places in the code) reduced the
> > > > amount of reads
> > > > to zero with dd but with no improvement to throghput. But with random access to
> > > > the raid ( buffers are aligned by the stripe width and with the size
> > > > of stripe width )
> > > > there is an improvement of at least 20 % .
> > > >
> > > > Problem is that a user must know what he is doing else there would be
> > > > a reduction
> > > > in performance if deadline line it too long (say 100 ms).
> > >
> > > So if I understand you correctly, you are delaying write requests to
> > > partial stripes slightly (your 'deadline') and this is sometimes
> > > giving you a 20% improvement ?
> > >
> > > I'm not surprised that you could get some improvement. 20% is quite
> > > surprising. It would be worth following through with this to make
> > > that improvement generally available.
> > >
> > > As you say, picking a time in milliseconds is very error prone. We
> > > really need to come up with something more natural.
> > > I had hopped that the 'unplug' infrastructure would provide the right
> > > thing, but apparently not. Maybe unplug is just being called too
> > > often.
> > >
> > > I'll see if I can duplicate this myself and find out what is really
> > > going on.
> > >
> > > Thanks for the report.
> > >
> > > NeilBrown
> > >
> >
> > Neil Hello. I am sorry for this interval , I was assigned abruptly to
> > a different project.
> >
> > 1.
> > I'd taken a look at the raid5 delay patch I have written a while
> > ago. I ported it to 2.6.17 and tested it. it makes sounds of working
> > and when used correctly it eliminates the reads penalty.
> >
> > 2. Benchmarks .
> > configuration:
> > I am testing a raid5 x 3 disks with 1MB chunk size. IOs are
> > synchronous and non-buffered(o_direct) , 2 MB in size and always
> > aligned to the beginning of a stripe. kernel is 2.6.17. The
> > stripe_delay was set to 10ms.
> >
> > Attached is the simple_write code.
> >
> > command :
> > simple_write /dev/md1 2048 0 1000
> > simple_write raw writes (O_DIRECT) sequentially
> > starting from offset zero 2048 kilobytes 1000 times.
> >
> > Benchmark Before patch
> >
> > sda 1848.00 8384.00 50992.00 8384 50992
> > sdb 1995.00 12424.00 51008.00 12424 51008
> > sdc 1698.00 8160.00 51000.00 8160 51000
> > sdd 0.00 0.00 0.00 0 0
> > md0 0.00 0.00 0.00 0 0
> > md1 450.00 0.00 102400.00 0 102400
> >
> >
> > Benchmark After patch
> >
> > sda 389.11 0.00 128530.69 0 129816
> > sdb 381.19 0.00 129354.46 0 130648
> > sdc 383.17 0.00 128530.69 0 129816
> > sdd 0.00 0.00 0.00 0 0
> > md0 0.00 0.00 0.00 0 0
> > md1 1140.59 0.00 259548.51 0 262144
> >
> > As one can see , no additional reads were done. One can actually
> > calculate the raid's utilization: n-1/n * ( single disk throughput
> > with 1M writes ) .
> >
> >
> > 3. The patch code.
> > Kernel tested above was 2.6.17. The patch is of 2.6.20.2
> > because I have noticed a big code differences between 17 to 20.x .
> > This patch was not tested on 2.6.20.2 but it is essentialy the same. I
> > have not tested (yet) degraded mode or any other non-common pathes.
> >
> This is along the same lines of what I am working on, new cache
> policies for raid5/6, so I want to give it a try as well.
> Unfortunately gmail has mangled your patch. Can you resend as an
> attachment?
>
> patch: **** malformed patch at line 10:
> (&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK]))
>
> Thanks,
> Dan
>
Dan hello.
Attached are the patches. Also , I have added another test unit : random_writev.
It is not much of a code but it does the work. It tests writing a
vector .it shows the same results as writing using a single buffer.
What is the new cache poilcies ?
Please note !
I haven't indented the patch nor did the instructions according to
SubmitingPatches document. If Neil would approve this patch or parts
of it, I will do so.
# Benchmark 3: Testing 8 disks raid5.
Tyan Numa dual (amd) CPU machine, with 8 sata maxtor disks, controller
is promise
in jbod mode.
raid conf:
md1 : active raid5 sda2[0] sdh1[7] sdg1[6] sdf1[5] sde1[4] sdd1[3]
sdc1[2] sdb2[1]
3404964864 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
In order to achieve zero reads I had to tune the deadline to 20ms ( so
long ? ). stripe_cache_size is 256 which is exactly what is needed to
preform a full stripe hit
with this configuration.
> comand: random_writev /dev/md1 7168 0 3000 10000
iostats snapshot
avg-cpu: %user %nice %sys %iowait %idle
0.00 0.00 21.00 29.00 50.00
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
hda 0.00 0.00 0.00 0 0
md0 0.00 0.00 0.00 0 0
sda 234.34 0.00 50400.00 0 49896
sdb 235.35 0.00 50658.59 0 50152
sdc 242.42 0.00 51014.14 0 50504
sdd 246.46 0.00 50755.56 0 50248
sde 248.48 0.00 51272.73 0 50760
sdf 245.45 0.00 50755.56 0 50248
sdg 244.44 0.00 50755.56 0 50248
sdh 245.45 0.00 50755.56 0 50248
md1 1407.07 0.00 347741.41 0 344264
Try setting it the stripe_cace_size to 255 and you will notice the delay.
Try lowering with the stripe_deadline and you will notice how the amount
of reads grow.
Cheers
--
Raz
[-- Attachment #2: raid5_write.c.patch --]
[-- Type: text/x-patch, Size: 5182 bytes --]
diff -ruN -X linux-2.6.20.2/Documentation/dontdiff linux-2.6.20.2/drivers/md/raid5.c linux-2.6.20.2-raid/drivers/md/raid5.c
--- linux-2.6.20.2/drivers/md/raid5.c 2007-03-09 20:58:04.000000000 +0200
+++ linux-2.6.20.2-raid/drivers/md/raid5.c 2007-03-30 12:37:55.000000000 +0300
@@ -65,6 +65,7 @@
#define NR_HASH (PAGE_SIZE / sizeof(struct hlist_head))
#define HASH_MASK (NR_HASH - 1)
+
#define stripe_hash(conf, sect) (&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK]))
/* bio's attached to a stripe+device for I/O are linked together in bi_sector
@@ -234,6 +235,8 @@
sh->sector = sector;
sh->pd_idx = pd_idx;
sh->state = 0;
+ sh->active_preread_jiffies =
+ msecs_to_jiffies( atomic_read(&conf->deadline_ms) )+ jiffies;
sh->disks = disks;
@@ -628,6 +631,7 @@
clear_bit(R5_LOCKED, &sh->dev[i].flags);
set_bit(STRIPE_HANDLE, &sh->state);
+ sh->active_preread_jiffies = jiffies;
release_stripe(sh);
return 0;
}
@@ -1255,8 +1259,11 @@
bip = &sh->dev[dd_idx].towrite;
if (*bip == NULL && sh->dev[dd_idx].written == NULL)
firstwrite = 1;
- } else
+ } else{
bip = &sh->dev[dd_idx].toread;
+ sh->active_preread_jiffies = jiffies;
+ }
+
while (*bip && (*bip)->bi_sector < bi->bi_sector) {
if ((*bip)->bi_sector + ((*bip)->bi_size >> 9) > bi->bi_sector)
goto overlap;
@@ -2437,13 +2444,27 @@
-static void raid5_activate_delayed(raid5_conf_t *conf)
+static struct stripe_head* raid5_activate_delayed(raid5_conf_t *conf)
{
if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD) {
while (!list_empty(&conf->delayed_list)) {
struct list_head *l = conf->delayed_list.next;
struct stripe_head *sh;
sh = list_entry(l, struct stripe_head, lru);
+
+ if( time_before(jiffies,sh->active_preread_jiffies) ){
+ PRINTK("deadline : no expire sec=%lld %8u %8u\n",
+ (unsigned long long) sh->sector,
+ jiffies_to_msecs(sh->active_preread_jiffies),
+ jiffies_to_msecs(jiffies));
+ return sh;
+ }
+ else{
+ PRINTK("deadline: expire:sec=%lld %8u %8u\n",
+ (unsigned long long)sh->sector,
+ jiffies_to_msecs(sh->active_preread_jiffies),
+ jiffies_to_msecs(jiffies));
+ }
list_del_init(l);
clear_bit(STRIPE_DELAYED, &sh->state);
if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
@@ -2451,6 +2472,7 @@
list_add_tail(&sh->lru, &conf->handle_list);
}
}
+ return NULL;
}
static void activate_bit_delay(raid5_conf_t *conf)
@@ -3191,7 +3213,7 @@
*/
static void raid5d (mddev_t *mddev)
{
- struct stripe_head *sh;
+ struct stripe_head *sh,*delayed_sh=NULL;
raid5_conf_t *conf = mddev_to_conf(mddev);
int handled;
@@ -3218,8 +3240,10 @@
atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD &&
!blk_queue_plugged(mddev->queue) &&
!list_empty(&conf->delayed_list))
- raid5_activate_delayed(conf);
-
+ delayed_sh=raid5_activate_delayed(conf);
+
+ if(delayed_sh) break;
+
while ((bio = remove_bio_from_retry(conf))) {
int ok;
spin_unlock_irq(&conf->device_lock);
@@ -3254,9 +3278,51 @@
unplug_slaves(mddev);
PRINTK("--- raid5d inactive\n");
+ if (delayed_sh){
+ long wakeup=delayed_sh->active_preread_jiffies-jiffies;
+ PRINTK("--- raid5d inactive sleep for %d\n",
+ jiffies_to_msecs(wakeup) );
+ if (wakeup>0)
+ mddev->thread->timeout = wakeup;
+ }
+}
+
+static ssize_t
+raid5_show_stripe_deadline(mddev_t *mddev, char *page)
+{
+ raid5_conf_t *conf = mddev_to_conf(mddev);
+ if (conf)
+ return sprintf(page, "%d\n", atomic_read(&conf->deadline_ms));
+ else
+ return 0;
}
static ssize_t
+raid5_store_stripe_deadline(mddev_t *mddev, const char *page, size_t len)
+{
+ raid5_conf_t *conf = mddev_to_conf(mddev);
+ char *end;
+ int new;
+ if (len >= PAGE_SIZE)
+ return -EINVAL;
+ if (!conf)
+ return -ENODEV;
+ new = simple_strtoul(page, &end, 10);
+ if (!*page || (*end && *end != '\n') )
+ return -EINVAL;
+ if (new < 0 || new > 10000)
+ return -EINVAL;
+ atomic_set(&conf->deadline_ms,new);
+ return len;
+}
+
+static struct md_sysfs_entry
+raid5_stripe_deadline = __ATTR(stripe_deadline, S_IRUGO | S_IWUSR,
+ raid5_show_stripe_deadline,
+ raid5_store_stripe_deadline);
+
+
+static ssize_t
raid5_show_stripe_cache_size(mddev_t *mddev, char *page)
{
raid5_conf_t *conf = mddev_to_conf(mddev);
@@ -3297,6 +3363,9 @@
return len;
}
+
+
+
static struct md_sysfs_entry
raid5_stripecache_size = __ATTR(stripe_cache_size, S_IRUGO | S_IWUSR,
raid5_show_stripe_cache_size,
@@ -3318,8 +3387,10 @@
static struct attribute *raid5_attrs[] = {
&raid5_stripecache_size.attr,
&raid5_stripecache_active.attr,
+ &raid5_stripe_deadline.attr,
NULL,
};
+
static struct attribute_group raid5_attrs_group = {
.name = NULL,
.attrs = raid5_attrs,
@@ -3567,6 +3638,8 @@
blk_queue_merge_bvec(mddev->queue, raid5_mergeable_bvec);
+ atomic_set(&conf->deadline_ms,0);
+
return 0;
abort:
if (conf) {
[-- Attachment #3: raid5_write.h.patch --]
[-- Type: text/x-patch, Size: 765 bytes --]
diff -ruN -X linux-2.6.20.2/Documentation/dontdiff linux-2.6.20.2/include/linux/raid/raid5.h linux-2.6.20.2-raid/include/linux/raid/raid5.h
--- linux-2.6.20.2/include/linux/raid/raid5.h 2007-03-09 20:58:04.000000000 +0200
+++ linux-2.6.20.2-raid/include/linux/raid/raid5.h 2007-03-30 00:25:38.000000000 +0200
@@ -136,6 +136,7 @@
spinlock_t lock;
int bm_seq; /* sequence number for bitmap flushes */
int disks; /* disks in stripe */
+ unsigned long active_preread_jiffies;
struct r5dev {
struct bio req;
struct bio_vec vec;
@@ -254,6 +255,7 @@
* Free stripes pool
*/
atomic_t active_stripes;
+ atomic_t deadline_ms;
struct list_head inactive_list;
wait_queue_head_t wait_for_stripe;
wait_queue_head_t wait_for_overlap;
[-- Attachment #4: random_writev.cpp --]
[-- Type: text/x-c++src, Size: 2203 bytes --]
#define _LARGEFILE64_SOURC
#include <iostream>
#include <stdio.h>
#include <string>
#include <stddef.h>
#include <sys/time.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <libaio.h>
#include <time.h>
#include <stdio.h>
#include <errno.h>
#include <sys/uio.h>
#include <unistd.h>
#include <sys/types.h>
#include <linux/unistd.h>
#include <errno.h>
using namespace std;
int main (int argc, char *argv[])
{
if (argc<6){
cout << "usage <device name> <size to write in kb> <offset in kb > <diskSizeGB> <loops>" << endl;
return 0;
}
char* dev_name = argv[1];
int fd = open(dev_name, O_LARGEFILE | O_DIRECT | O_WRONLY , 777 );
if (fd<0){
perror("open ");
return (-1);
}
long long write_sz_bytes = ( (long long)atoi(argv[2]))<<10;
long long offset_sz_bytes = ( (long long) atoi(argv[3]) )<<10;
long long diskSizeBytes = ( (long long)atoi(argv[4]))<<30;
int loops = atoi(argv[5]);
struct iovec vec[10];
int blocks = (write_sz_bytes >>20);
for( int i = 0 ; i < blocks; i++){
char* buffer = (char*)valloc((1<<20));
if (!buffer) {
perror("alloc : ");
return -1;
}
vec[i].iov_base = buffer;
vec[i].iov_len = 1048576;
memset(buffer,0x00,1048576);
}
int ret=0;
while( (--loops)>0 ){
if ( lseek64(fd,offset_sz_bytes,SEEK_SET) < 0 ){
printf("%s: failed on lseek offset=%lld\n",offset_sz_bytes);
return (0);
}
ret = writev(fd,(struct iovec*)&vec,blocks);
if ( ret != write_sz_bytes ) {
perror("failed to write: ");
printf("write size=%lld offset=%lld\n",write_sz_bytes,offset_sz_bytes);
return -1;
}
offset_sz_bytes = write_sz_bytes *( random() % diskSizeBytes );
long long rnd = (long long)random();
offset_sz_bytes = write_sz_bytes * (long long)( rnd % diskSizeBytes );
if(offset_sz_bytes>diskSizeBytes){
offset_sz_bytes = (offset_sz_bytes - diskSizeBytes ) % diskSizeBytes;
offset_sz_bytes = (offset_sz_bytes/write_sz_bytes)*write_sz_bytes;
}
printf("writing %d bytes at offset %lld\n",ret,offset_sz_bytes);
}
return(0);
}
next prev parent reply other threads:[~2007-04-02 14:13 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-07-02 14:02 raid5 write performance Raz Ben-Jehuda(caro)
2006-07-02 22:35 ` Neil Brown
2006-08-13 13:19 ` Raz Ben-Jehuda(caro)
2006-08-28 4:32 ` Neil Brown
2007-03-30 21:44 ` Raz Ben-Jehuda(caro)
2007-03-31 21:28 ` Bill Davidsen
2007-03-31 23:03 ` Raz Ben-Jehuda(caro)
2007-04-01 2:16 ` Bill Davidsen
2007-04-01 23:08 ` Dan Williams
2007-04-02 14:13 ` Raz Ben-Jehuda(caro) [this message]
[not found] ` <17950.50209.580439.607958@notabene.brown>
[not found] ` <5d96567b0704161329n5c3ca008p56df00baaa16eacb@mail.gmail.com>
2007-04-19 8:28 ` Raz Ben-Jehuda(caro)
2007-04-19 9:20 ` Neil Brown
-- strict thread matches above, loose matches on Subject: below --
2005-11-18 14:05 Jure Pečar
2005-11-18 19:19 ` Dan Stromberg
2005-11-18 19:23 ` Mike Hardy
2005-11-19 4:40 ` Guy
2005-11-19 4:57 ` Mike Hardy
2005-11-19 5:54 ` Neil Brown
2005-11-19 11:59 ` Farkas Levente
2005-11-20 23:39 ` Neil Brown
2005-11-19 19:52 ` Carlos Carvalho
2005-11-20 19:54 ` Paul Clements
2005-11-19 5:56 ` Guy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5d96567b0704020713s6ae02102kaeca12edfb663cb1@mail.gmail.com \
--to=raziebe@gmail.com \
--cc=dan.j.williams@intel.com \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).