Re: raid5 write performance - Raz Ben-Jehuda(caro)

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Raz Ben-Jehuda(caro)" <raziebe@gmail.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Neil Brown <neilb@suse.de>,
	Linux RAID Mailing List <linux-raid@vger.kernel.org>
Subject: Re: raid5 write performance
Date: Mon, 2 Apr 2007 17:13:43 +0300	[thread overview]
Message-ID: <5d96567b0704020713s6ae02102kaeca12edfb663cb1@mail.gmail.com> (raw)
In-Reply-To: <e9c3a7c20704011608lb1fcb55j628a7d1172812869@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6706 bytes --]

On 4/2/07, Dan Williams <dan.j.williams@intel.com> wrote:
> On 3/30/07, Raz Ben-Jehuda(caro) <raziebe@gmail.com> wrote:
> > Please see bellow.
> >
> > On 8/28/06, Neil Brown <neilb@suse.de> wrote:
> > > On Sunday August 13, raziebe@gmail.com wrote:
> > > > well ... me again
> > > >
> > > > Following your advice....
> > > >
> > > > I added a deadline for every WRITE stripe head when it is created.
> > > > in raid5_activate_delayed i checked if deadline is expired and if not i am
> > > > setting the sh to prereadactive mode as .
> > > >
> > > > This small fix ( and in few other places in the code) reduced the
> > > > amount of reads
> > > > to zero with dd but with no improvement to throghput. But with random access to
> > > > the raid  ( buffers are aligned by the stripe width and with the size
> > > > of stripe width )
> > > > there is an improvement of at least 20 % .
> > > >
> > > > Problem is that a user must know what he is doing else there would be
> > > > a reduction
> > > > in performance if deadline line it too long (say 100 ms).
> > >
> > > So if I understand you correctly, you are delaying write requests to
> > > partial stripes slightly (your 'deadline') and this is sometimes
> > > giving you a 20% improvement ?
> > >
> > > I'm not surprised that you could get some improvement.  20% is quite
> > > surprising.  It would be worth following through with this to make
> > > that improvement generally available.
> > >
> > > As you say, picking a time in milliseconds is very error prone.  We
> > > really need to come up with something more natural.
> > > I had hopped that the 'unplug' infrastructure would provide the right
> > > thing, but apparently not.  Maybe unplug is just being called too
> > > often.
> > >
> > > I'll see if I can duplicate this myself and find out what is really
> > > going on.
> > >
> > > Thanks for the report.
> > >
> > > NeilBrown
> > >
> >
> > Neil Hello. I am sorry for this interval , I was assigned abruptly to
> > a different project.
> >
> > 1.
> >   I'd taken a look at the raid5 delay patch I have written a while
> > ago. I ported it to 2.6.17 and tested it. it makes sounds of working
> > and when used correctly it eliminates the reads penalty.
> >
> > 2. Benchmarks .
> >     configuration:
> >      I am testing a raid5 x 3 disks with 1MB chunk size.  IOs are
> > synchronous and non-buffered(o_direct) , 2 MB in size and always
> > aligned to the beginning of a stripe. kernel is 2.6.17. The
> > stripe_delay was set to 10ms.
> >
> >  Attached is the simple_write code.
> >
> >          command :
> >                simple_write /dev/md1 2048 0 1000
> >                        simple_write raw writes (O_DIRECT) sequentially
> > starting from offset zero 2048 kilobytes 1000 times.
> >
> > Benchmark Before patch
> >
> > sda            1848.00      8384.00     50992.00       8384      50992
> > sdb            1995.00     12424.00     51008.00      12424      51008
> > sdc            1698.00      8160.00     51000.00       8160      51000
> > sdd               0.00         0.00         0.00          0          0
> > md0               0.00         0.00         0.00          0          0
> > md1             450.00         0.00    102400.00          0     102400
> >
> >
> > Benchmark After patch
> >
> > sda             389.11         0.00    128530.69          0     129816
> > sdb             381.19         0.00    129354.46          0     130648
> > sdc             383.17         0.00    128530.69          0     129816
> > sdd               0.00         0.00         0.00          0          0
> > md0               0.00         0.00         0.00          0          0
> > md1            1140.59         0.00    259548.51          0     262144
> >
> > As one can see , no additional reads were done. One can actually
> > calculate  the raid's utilization: n-1/n * ( single disk throughput
> > with 1M writes ) .
> >
> >
> >       3.  The patch code.
> >           Kernel tested above was 2.6.17. The patch is of 2.6.20.2
> > because I have noticed a big code differences between 17 to 20.x .
> > This patch was not tested on 2.6.20.2 but it is essentialy the same. I
> > have not tested (yet) degraded mode or any other non-common pathes.
> >
> This is along the same lines of what I am working on, new cache
> policies for raid5/6, so I want to give it a try as well.
> Unfortunately gmail has mangled your patch.  Can you resend as an
> attachment?
>
> patch: **** malformed patch at line 10:
> (&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK]))
>
> Thanks,
> Dan
>

Dan hello.
Attached are the patches. Also , I have added another test unit : random_writev.
It is not much of a code but it does the work. It tests writing a
vector .it shows the same results as writing using a single buffer.

What is the new cache poilcies ?

Please note !
I haven't indented the patch nor did the instructions according to
SubmitingPatches document. If Neil would approve this patch or parts
of it, I will do so.

# Benchmark 3:  Testing  8 disks raid5.

Tyan Numa dual (amd) CPU machine, with 8 sata maxtor disks, controller
is promise
in jbod mode.

raid conf:
md1 : active raid5 sda2[0] sdh1[7] sdg1[6] sdf1[5] sde1[4] sdd1[3]
sdc1[2] sdb2[1]
      3404964864 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]

In order to achieve zero reads I had to tune the deadline to 20ms ( so
long ? ). stripe_cache_size is 256 which is exactly what is needed to
preform a full stripe hit
with this configuration.

> comand:    random_writev /dev/md1 7168 0 3000 10000

iostats snapshot

avg-cpu:  %user   %nice    %sys %iowait   %idle
           0.00    0.00   21.00   29.00   50.00

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
hda               0.00         0.00         0.00          0          0
md0               0.00         0.00         0.00          0          0
sda             234.34         0.00     50400.00          0      49896
sdb             235.35         0.00     50658.59          0      50152
sdc             242.42         0.00     51014.14          0      50504
sdd             246.46         0.00     50755.56          0      50248
sde             248.48         0.00     51272.73          0      50760
sdf             245.45         0.00     50755.56          0      50248
sdg             244.44         0.00     50755.56          0      50248
sdh             245.45         0.00     50755.56          0      50248
md1            1407.07         0.00    347741.41          0     344264

Try setting it the stripe_cace_size to 255 and you will notice the delay.
Try lowering with the stripe_deadline and you will notice how the amount
of reads grow.

Cheers
-- 
Raz

[-- Attachment #2: raid5_write.c.patch --]
[-- Type: text/x-patch, Size: 5182 bytes --]

diff -ruN -X linux-2.6.20.2/Documentation/dontdiff linux-2.6.20.2/drivers/md/raid5.c linux-2.6.20.2-raid/drivers/md/raid5.c
--- linux-2.6.20.2/drivers/md/raid5.c	2007-03-09 20:58:04.000000000 +0200
+++ linux-2.6.20.2-raid/drivers/md/raid5.c	2007-03-30 12:37:55.000000000 +0300
@@ -65,6 +65,7 @@
 #define NR_HASH			(PAGE_SIZE / sizeof(struct hlist_head))
 #define HASH_MASK		(NR_HASH - 1)
 
+
 #define stripe_hash(conf, sect)	(&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK]))
 
 /* bio's attached to a stripe+device for I/O are linked together in bi_sector
@@ -234,6 +235,8 @@
 	sh->sector = sector;
 	sh->pd_idx = pd_idx;
 	sh->state = 0;
+	sh->active_preread_jiffies =
+        		msecs_to_jiffies( atomic_read(&conf->deadline_ms) )+ jiffies;
 
 	sh->disks = disks;
 
@@ -628,6 +631,7 @@
 	
 	clear_bit(R5_LOCKED, &sh->dev[i].flags);
 	set_bit(STRIPE_HANDLE, &sh->state);
+	sh->active_preread_jiffies = jiffies;
 	release_stripe(sh);
 	return 0;
 }
@@ -1255,8 +1259,11 @@
 		bip = &sh->dev[dd_idx].towrite;
 		if (*bip == NULL && sh->dev[dd_idx].written == NULL)
 			firstwrite = 1;
-	} else
+	} else{
 		bip = &sh->dev[dd_idx].toread;
+		sh->active_preread_jiffies = jiffies;	
+	}
+
 	while (*bip && (*bip)->bi_sector < bi->bi_sector) {
 		if ((*bip)->bi_sector + ((*bip)->bi_size >> 9) > bi->bi_sector)
 			goto overlap;
@@ -2437,13 +2444,27 @@
 
 
 
-static void raid5_activate_delayed(raid5_conf_t *conf)
+static struct stripe_head*  raid5_activate_delayed(raid5_conf_t *conf)
 {
 	if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD) {
 		while (!list_empty(&conf->delayed_list)) {
 			struct list_head *l = conf->delayed_list.next;
 			struct stripe_head *sh;
 			sh = list_entry(l, struct stripe_head, lru);
+      			
+			if( time_before(jiffies,sh->active_preread_jiffies) ){
+        		  PRINTK("deadline : no expire sec=%lld %8u %8u\n",
+	               		(unsigned long long) sh->sector,
+               			jiffies_to_msecs(sh->active_preread_jiffies),
+               			jiffies_to_msecs(jiffies));
+        		  return sh;
+      			}
+      			else{
+			      PRINTK("deadline:  expire:sec=%lld %8u %8u\n",
+	               			(unsigned long long)sh->sector,
+        	       			jiffies_to_msecs(sh->active_preread_jiffies),
+               				jiffies_to_msecs(jiffies));
+			}
 			list_del_init(l);
 			clear_bit(STRIPE_DELAYED, &sh->state);
 			if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
@@ -2451,6 +2472,7 @@
 			list_add_tail(&sh->lru, &conf->handle_list);
 		}
 	}
+     return NULL;
 }
 
 static void activate_bit_delay(raid5_conf_t *conf)
@@ -3191,7 +3213,7 @@
  */
 static void raid5d (mddev_t *mddev)
 {
-	struct stripe_head *sh;
+	struct stripe_head *sh,*delayed_sh=NULL;
 	raid5_conf_t *conf = mddev_to_conf(mddev);
 	int handled;
 
@@ -3218,8 +3240,10 @@
 		    atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD &&
 		    !blk_queue_plugged(mddev->queue) &&
 		    !list_empty(&conf->delayed_list))
-			raid5_activate_delayed(conf);
-
+			delayed_sh=raid5_activate_delayed(conf);
+		
+		if(delayed_sh) break;
+		    
 		while ((bio = remove_bio_from_retry(conf))) {
 			int ok;
 			spin_unlock_irq(&conf->device_lock);
@@ -3254,9 +3278,51 @@
 	unplug_slaves(mddev);
 
 	PRINTK("--- raid5d inactive\n");
+ 	if (delayed_sh){
+   		long wakeup=delayed_sh->active_preread_jiffies-jiffies;
+   		PRINTK("--- raid5d inactive sleep for %d\n",
+            		jiffies_to_msecs(wakeup) );
+   		if (wakeup>0)
+     		mddev->thread->timeout = wakeup;
+  	}
+}
+
+static ssize_t
+raid5_show_stripe_deadline(mddev_t *mddev, char *page)
+{
+  raid5_conf_t *conf = mddev_to_conf(mddev);
+  if (conf)
+    return sprintf(page, "%d\n", atomic_read(&conf->deadline_ms));
+  else
+    return 0;
 }
 
 static ssize_t
+raid5_store_stripe_deadline(mddev_t *mddev, const char *page, size_t len)
+{
+  raid5_conf_t *conf = mddev_to_conf(mddev);
+  char *end;
+  int new;
+  if (len >= PAGE_SIZE)
+    return -EINVAL;
+  if (!conf)
+    return -ENODEV;
+  new = simple_strtoul(page, &end, 10);
+  if (!*page || (*end && *end != '\n') )
+    return -EINVAL;
+  if (new < 0 || new > 10000)
+    return -EINVAL;
+  atomic_set(&conf->deadline_ms,new);
+  return len;
+}
+
+static struct md_sysfs_entry
+raid5_stripe_deadline = __ATTR(stripe_deadline, S_IRUGO | S_IWUSR,
+                                raid5_show_stripe_deadline,
+                               raid5_store_stripe_deadline);
+
+
+static ssize_t
 raid5_show_stripe_cache_size(mddev_t *mddev, char *page)
 {
 	raid5_conf_t *conf = mddev_to_conf(mddev);
@@ -3297,6 +3363,9 @@
 	return len;
 }
 
+
+
+
 static struct md_sysfs_entry
 raid5_stripecache_size = __ATTR(stripe_cache_size, S_IRUGO | S_IWUSR,
 				raid5_show_stripe_cache_size,
@@ -3318,8 +3387,10 @@
 static struct attribute *raid5_attrs[] =  {
 	&raid5_stripecache_size.attr,
 	&raid5_stripecache_active.attr,
+    &raid5_stripe_deadline.attr,
 	NULL,
 };
+
 static struct attribute_group raid5_attrs_group = {
 	.name = NULL,
 	.attrs = raid5_attrs,
@@ -3567,6 +3638,8 @@
 
 	blk_queue_merge_bvec(mddev->queue, raid5_mergeable_bvec);
 
+	atomic_set(&conf->deadline_ms,0);
+
 	return 0;
 abort:
 	if (conf) {

[-- Attachment #3: raid5_write.h.patch --]
[-- Type: text/x-patch, Size: 765 bytes --]

diff -ruN -X linux-2.6.20.2/Documentation/dontdiff linux-2.6.20.2/include/linux/raid/raid5.h linux-2.6.20.2-raid/include/linux/raid/raid5.h
--- linux-2.6.20.2/include/linux/raid/raid5.h	2007-03-09 20:58:04.000000000 +0200
+++ linux-2.6.20.2-raid/include/linux/raid/raid5.h	2007-03-30 00:25:38.000000000 +0200
@@ -136,6 +136,7 @@
 	spinlock_t		lock;
 	int			bm_seq;	/* sequence number for bitmap flushes */
 	int			disks;			/* disks in stripe */
+	unsigned long   	active_preread_jiffies;
 	struct r5dev {
 		struct bio	req;
 		struct bio_vec	vec;
@@ -254,6 +255,7 @@
 	 * Free stripes pool
 	 */
 	atomic_t		active_stripes;
+	atomic_t        	deadline_ms;
 	struct list_head	inactive_list;
 	wait_queue_head_t	wait_for_stripe;
 	wait_queue_head_t	wait_for_overlap;

[-- Attachment #4: random_writev.cpp --]
[-- Type: text/x-c++src, Size: 2203 bytes --]

#define _LARGEFILE64_SOURC

#include <iostream>
#include <stdio.h>
#include <string>
#include <stddef.h>
#include <sys/time.h>

#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <libaio.h>
#include <time.h>
#include <stdio.h>
#include <errno.h>
#include <sys/uio.h>
#include <unistd.h>
#include <sys/types.h>
#include <linux/unistd.h>
#include <errno.h>

using namespace std;

int main (int argc, char *argv[])
{
  if (argc<6){
	cout << "usage  <device name>  <size to write in kb> <offset in kb > <diskSizeGB> <loops>" << endl;
	return 0;
  }

  char* dev_name = argv[1];

  int fd = open(dev_name, O_LARGEFILE | O_DIRECT | O_WRONLY , 777 );
  if (fd<0){
	perror("open ");
	return (-1);
  }

  long long write_sz_bytes    = ( (long long)atoi(argv[2]))<<10;
  long long offset_sz_bytes   = ( (long long) atoi(argv[3]) )<<10;
  long long diskSizeBytes        = ( (long long)atoi(argv[4]))<<30;
  int loops = atoi(argv[5]);
 
  struct iovec vec[10];
  int blocks = (write_sz_bytes >>20);

  for( int i = 0 ; i < blocks; i++){
  
    char* buffer = (char*)valloc((1<<20));
    if (!buffer) {
	    perror("alloc : ");
        return -1;
    }
    vec[i].iov_base = buffer;
    vec[i].iov_len = 1048576;
    memset(buffer,0x00,1048576);
  }

  int ret=0;
 
  while( (--loops)>0 ){

    if ( lseek64(fd,offset_sz_bytes,SEEK_SET) < 0  ){
      printf("%s: failed on lseek offset=%lld\n",offset_sz_bytes);
      return (0);
    }
    
    ret = writev(fd,(struct iovec*)&vec,blocks);
    if ( ret != write_sz_bytes ) {
      perror("failed to write: ");
      printf("write size=%lld offset=%lld\n",write_sz_bytes,offset_sz_bytes);
      return -1;
    }

    offset_sz_bytes =  write_sz_bytes *(  random() % diskSizeBytes );

    long long rnd = (long long)random();
    offset_sz_bytes =  write_sz_bytes * (long long)( rnd %  diskSizeBytes );

    if(offset_sz_bytes>diskSizeBytes){
            offset_sz_bytes =  (offset_sz_bytes - diskSizeBytes ) % diskSizeBytes;
            offset_sz_bytes =  (offset_sz_bytes/write_sz_bytes)*write_sz_bytes;
    }

    printf("writing %d bytes at offset %lld\n",ret,offset_sz_bytes);
  }
  
  return(0);
}

next prev parent reply	other threads:[~2007-04-02 14:13 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-07-02 14:02 raid5 write performance Raz Ben-Jehuda(caro)
2006-07-02 22:35 ` Neil Brown
2006-08-13 13:19   ` Raz Ben-Jehuda(caro)
2006-08-28  4:32     ` Neil Brown
2007-03-30 21:44       ` Raz Ben-Jehuda(caro)
2007-03-31 21:28         ` Bill Davidsen
2007-03-31 23:03           ` Raz Ben-Jehuda(caro)
2007-04-01  2:16             ` Bill Davidsen
2007-04-01 23:08         ` Dan Williams
2007-04-02 14:13           ` Raz Ben-Jehuda(caro) [this message]
     [not found]         ` <17950.50209.580439.607958@notabene.brown>
     [not found]           ` <5d96567b0704161329n5c3ca008p56df00baaa16eacb@mail.gmail.com>
2007-04-19  8:28             ` Raz Ben-Jehuda(caro)
2007-04-19  9:20               ` Neil Brown
  -- strict thread matches above, loose matches on Subject: below --
2005-11-18 14:05 Jure Pečar
2005-11-18 19:19 ` Dan Stromberg
2005-11-18 19:23   ` Mike Hardy
2005-11-19  4:40     ` Guy
2005-11-19  4:57       ` Mike Hardy
2005-11-19  5:54         ` Neil Brown
2005-11-19 11:59           ` Farkas Levente
2005-11-20 23:39             ` Neil Brown
2005-11-19 19:52           ` Carlos Carvalho
2005-11-20 19:54             ` Paul Clements
2005-11-19  5:56         ` Guy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5d96567b0704020713s6ae02102kaeca12edfb663cb1@mail.gmail.com \
    --to=raziebe@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).