2.6.17 odd hd slow down

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 2.6.17 odd hd slow down
@ 2006-09-06  9:45 Damon LaCrosse
  2006-09-06 10:57 ` Jens Axboe
  0 siblings, 1 reply; 4+ messages in thread
From: Damon LaCrosse @ 2006-09-06  9:45 UTC (permalink / raw)
  To: linux-kernel


Hi all,
experimenting a little the 2.6 block device layer I detected under some
circumstances a net slowness in the disk throughput. Strangely enough, in fact,
my IDE disk reported a significant performance drop off in correspondence of
certain access patterns.

Following further investigations I was able to simulate this ill behavior in
the following piece of code, clearly showing a non negligible hard-disk slow
down when the step value is set greater than 8. These result in fact far below
the hard-disk real speed (30~70MB/sec), as correctly measured instead in
correspondence of low STEP values (<8). In particular, with step of 512 or
above, the overall performance scored by the disk results below 2MB/sec.

At first I thought to a side-effect of the queue plug/unplug mechanism: the
scattered accesses involve the unplug timeout to each bio. So, I added the
BIO_RW_SYNC flag that - AFAIK - should force the queue
unplugging. Unfortunately nothing changes.

Now, as it is quite possible that I'm missing something, the question is: is
there an effective way of doing scattered disk accesses using bios? In other
words, how can I fix the following program in order to get disk full speed for
steps > 8?

TIA!
Damon

PS: please find below several results corresponding to various steps/scheduler
combinations, along with some configuration specs.

# hdparm -i /dev/hda

ATA device, with non-removable media
        Model Number:       Maxtor 6Y080P0
        Firmware Revision:  YAR41BW0
Standards:
        Supported: 7 6 5 4
        Likely used: 7
Configuration:
        Logical         max     current
        cylinders       16383   16383
        heads           16      16
        sectors/track   63      63
        --
        CHS current addressable sectors:   16514064
        LBA    user addressable sectors:  160086528
        device size with M = 1024*1024:       78167 MBytes
        device size with M = 1000*1000:       81964 MBytes (81 GB)
Capabilities:
        LBA, IORDY(can be disabled)
        Queue depth: 1
        Standby timer values: spec'd by Standard, no device specific minimum
        R/W multiple sector transfer: Max = 16  Current = 16
        Advanced power management level: unknown setting (0x0000)
        Recommended acoustic management value: 192, current value: 254
        DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 udma6
             Cycle time: min=120ns recommended=120ns
        PIO: pio0 pio1 pio2 pio3 pio4
             Cycle time: no flow control=120ns  IORDY flow control=120ns

# uname -a
Linux 2.6.17.1 #2 SMP PREEMPT i686 Intel(R) Xeon(TM) CPU 2.80GHz GNU/Linux

ANTICIPATORY SCHEDULER

STEP (hs)	CYCLES		WRITTEN (MB)	ELAPSED (s)	SPEED (MB/s)
1		61954		242		3		75.432
2		59394		232		3		71.3032
3		16473		64		3		21.843
4		52482		205		3		62.3135
5		14448		56		3		18.1951
6		13617		53		3		17.1732
7		12849		50		3		16.1695
8		47874		187		3		56.2823
9		2569		10		3		3.468
10		2608		10		3		3.716
11		2416		9		3		2.3085
12		2576		10		3		3.468
13		2480		9		3		3.222
14		2424		9		3		2.3084
15		2616		10		3		3.738
16		2288		8		3		2.2619
32		2376		9		3		2.2849
64		2400		9		3		2.3059
128		2408		9		3		2.3098
256		1384		5		3		1.2104
512		1048		4		3		1.761

DEADLINE SCHEDULER

STEP (hs)	CYCLES		WRITTEN (MB)	ELAPSED (s)	SPEED (MB/s)
1		61955		242		3		75.736
2		59907		234		3		72.1307
3		16473		64		3		21.843
4		52994		207		3		63.1816
5		14330		55		3		18.1526
6		13569		53		3		17.1476
7		12817		50		3		16.1618
8		47618		186		3		56.1991
9		2625		10		3		3.734
10		2472		9		3		3.185
11		2512		9		3		3.371
12		2624		10		3		3.764
13		2392		9		3		2.3051
14		2472		9		3		2.3214
15		2664		10		3		3.863
16		2512		9		3		3.305
32		2448		9		3		3.10
64		2520		9		3		3.375
128		2417		9		3		2.3017
256		1305		5		3		1.1776
512		1160		4		3		1.1258

CFQ SCHEDULER

STEP (hs)	CYCLES		WRITTEN (MB)	ELAPSED (s)	SPEED (MB/s)
1		62850		245		3		76.1395
2		60416		236		3		73.940
3		15970		62		3		20.1902
4		53225		207		3		63.2719
5		14945		58		3		19.865
6		14250		55		3		18.1160
7		13682		53		3		17.1986
8		47870		186		3		56.2472
9		2529		9		3		3.170
10		2576		10		3		3.477
11		2472		9		3		3.44
12		2672		10		3		3.933
13		2481		9		3		3.256
14		2592		10		3		3.627
15		2512		9		3		3.386
16		2688		10		3		3.1008
32		2384		9		3		2.2996
64		2320		9		3		2.2734
128		2720		10		3		3.1130
256		1265		4		3		1.1664
512		1088		4		3		1.768

NOOP SCHEDULER

STEP (hs)	CYCLES		WRITTEN (MB)	ELAPSED (s)	SPEED (MB/s)
1		20987		81		3		27.413
2		19974		78		3		25.2373
3		16434		64		3		21.712
4		18541		72		3		23.2482
5		14217		55		3		18.1067
6		13625		53		3		17.1729
7		12489		48		3		16.337
8		48898		191		3		57.3135
9		2560		10		3		3.499
10		2568		10		3		3.332
11		2472		9		3		3.161
12		2568		10		3		3.371
13		2352		9		3		2.2875
14		2584		10		3		3.487
15		2320		9		3		2.2740
16		2544		9		3		3.481
32		2344		9		3		2.2832
64		2416		9		3		2.3069
128		2328		9		3		2.2649
256		1360		5		3		1.2010
512		1440		5		3		1.2190

--- empty       2006-09-05 00:16:24.000000000 +0200
+++ test.c      2006-09-05 00:16:49.000000000 +0200
@@ -0,0 +1,145 @@
+#include <linux/module.h>
+#include <linux/timer.h>
+#include <linux/bio.h>
+
+#define START(t) ({                                            \
+               struct timeval __tv;                            \
+               do_gettimeofday(&__tv);                         \
+               (t) = timeval_to_ns(&__tv);                     \
+       })
+
+#define STOP(t) ({                                             \
+               struct timeval __tv;                            \
+               do_gettimeofday(&__tv);                         \
+               (t) = timeval_to_ns(&__tv) - (t);               \
+       })
+
+DECLARE_WAIT_QUEUE_HEAD(wait);
+atomic_t errors, busy;
+int halt;
+
+void stop_write(unsigned long arg)
+{
+       halt = 1;
+}
+
+int endio(struct bio *bio, unsigned int bytes_done, int error)
+{
+       if (bio->bi_size) {
+               return 1;
+       }
+
+       if (error || !test_bit(BIO_UPTODATE, &bio->bi_flags)) {
+               atomic_inc(&errors);
+       }
+
+       if (atomic_dec_and_test(&busy)) {
+               wake_up(&wait);
+       }
+
+       return 0;
+}
+
+int do_write(struct block_device *bdev,
+            struct page *zero, unsigned long expires, int step)
+{
+       DEFINE_TIMER(timer, stop_write, expires, (unsigned long) NULL);
+       int i;
+
+       add_timer(&timer);
+
+       for (halt = i = 0; !halt; i++) {
+               struct bio *bio = bio_alloc(GFP_NOIO, 1);
+               if (bio) {
+                       atomic_inc(&busy);
+
+                       bio->bi_bdev = bdev;
+                       bio->bi_sector = step * i;
+                       bio_add_page(bio, zero, PAGE_SIZE, 0);
+                       bio->bi_end_io = endio;
+                       submit_bio((1 << BIO_RW) | (1 << BIO_RW_SYNC), bio);
+               } else {
+                       atomic_inc(&errors);
+               }
+       }
+
+       wait_event(wait, !atomic_read(&busy));
+
+       return i;
+}
+
+int write(struct block_device *bdev, int secs, int step)
+{
+       struct page *zero;
+
+       s64 time;
+       unsigned long space;
+       int cycles;
+
+       zero = alloc_page(GFP_KERNEL);
+       if (!zero) {
+               return -ENOMEM;
+       }
+
+       memset(kmap(zero), 0, PAGE_SIZE);
+       kunmap(zero);
+
+       atomic_set(&errors, 0);
+       atomic_set(&busy, 0);
+
+       START(time);
+
+       cycles = do_write(bdev, zero, jiffies + secs * HZ, step);
+
+       STOP(time);
+
+       put_page(zero);
+
+       (void) do_div(time, 1000000);
+
+       space = ((unsigned long) cycles * 1000 * (PAGE_SIZE >> 10)) >> 10;
+
+       printk("%d\t\t%d\t\t%lu\t\t%lu\t\t%lu.%-3lu\n",
+              step, cycles, space / 1000,
+              (unsigned long ) time / 1000,
+              space / (unsigned long) time,
+              space % (unsigned long) time);
+
+       return 0;
+}
+
+static int __init init(void)
+{
+       struct block_device *bdev;
+       int i, err;
+
+       bdev = open_bdev_excl("/dev/hda", 0, THIS_MODULE);
+       if (IS_ERR(bdev)) {
+               printk("device won't open!\n");
+               return PTR_ERR(bdev);
+       }
+
+       printk("STEP (hs)\tCYCLES\t\tWRITTEN (MB)\tELAPSED (s)\tSPEED (MB/s)\n");
+
+       for (i = 1; i < 16; i++) {
+               err = write(bdev, 3, i);
+               if (err < 0) {
+                       printk("%d\t-\t\t-\t\t-\t\t-\n", i);
+               }
+       }
+
+       for (; i < 1024; i <<= 1) {
+               err = write(bdev, 3, i);
+               if (err < 0) {
+                       printk("%d\t-\t\t-\t\t-\t\t-\n", i);
+               }
+       }
+
+       close_bdev_excl(bdev);
+
+       return -EIO;
+}
+
+module_init(init);
+
+MODULE_LICENSE("GPL v2");


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 2.6.17 odd hd slow down
  2006-09-06  9:45 2.6.17 odd hd slow down Damon LaCrosse
@ 2006-09-06 10:57 ` Jens Axboe
  2006-09-06 12:06   ` Damon LaCrosse
  2006-09-06 12:06   ` Damon LaCrosse
  0 siblings, 2 replies; 4+ messages in thread
From: Jens Axboe @ 2006-09-06 10:57 UTC (permalink / raw)
  To: Damon LaCrosse; +Cc: linux-kernel

On Wed, Sep 06 2006, Damon LaCrosse wrote:
> 
> Hi all,
> experimenting a little the 2.6 block device layer I detected under
> some circumstances a net slowness in the disk throughput. Strangely
> enough, in fact, my IDE disk reported a significant performance drop
> off in correspondence of certain access patterns.
> 
> Following further investigations I was able to simulate this ill
> behavior in the following piece of code, clearly showing a non
> negligible hard-disk slow down when the step value is set greater than
> 8. These result in fact far below the hard-disk real speed
> (30~70MB/sec), as correctly measured instead in correspondence of low
> STEP values (<8). In particular, with step of 512 or above, the
> overall performance scored by the disk results below 2MB/sec.

You are effectively approaching seeky writes, I bet it's just the drive
firmware biting you. Repeat the test on a different drive, and see if
you see an identical pattern.

> At first I thought to a side-effect of the queue plug/unplug
> mechanism: the scattered accesses involve the unplug timeout to each
> bio. So, I added the BIO_RW_SYNC flag that - AFAIK - should force the
> queue unplugging. Unfortunately nothing changes.
> 
> Now, as it is quite possible that I'm missing something, the question
> is: is there an effective way of doing scattered disk accesses using
> bios? In other words, how can I fix the following program in order to
> get disk full speed for steps > 8?

I don't think the io path has anything to do with this. Why are you
expecting non-sequential writes to continue to be fast? They wont be.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 2.6.17 odd hd slow down
  2006-09-06 10:57 ` Jens Axboe
@ 2006-09-06 12:06   ` Damon LaCrosse
  2006-09-06 12:06   ` Damon LaCrosse
  1 sibling, 0 replies; 4+ messages in thread
From: Damon LaCrosse @ 2006-09-06 12:06 UTC (permalink / raw)
  To: linux-kernel

Jens Axboe <axboe@kernel.dk> writes:

   On Wed, Sep 06 2006, Damon LaCrosse wrote:
   > 
   > Hi all,
   > experimenting a little the 2.6 block device layer I detected under
   > some circumstances a net slowness in the disk throughput. Strangely
   > enough, in fact, my IDE disk reported a significant performance drop
   > off in correspondence of certain access patterns.
   > 
   > Following further investigations I was able to simulate this ill
   > behavior in the following piece of code, clearly showing a non
   > negligible hard-disk slow down when the step value is set greater than
   > 8. These result in fact far below the hard-disk real speed
   > (30~70MB/sec), as correctly measured instead in correspondence of low
   > STEP values (<8). In particular, with step of 512 or above, the
   > overall performance scored by the disk results below 2MB/sec.

   You are effectively approaching seeky writes, I bet it's just the drive
   firmware biting you. Repeat the test on a different drive, and see if
   you see an identical pattern.

Hi, thank you for your prompt answer! 

Unfortunately I have only a bunch of maxtors at moment, but I'll do it ASAP.

   > At first I thought to a side-effect of the queue plug/unplug
   > mechanism: the scattered accesses involve the unplug timeout to each
   > bio. So, I added the BIO_RW_SYNC flag that - AFAIK - should force the
   > queue unplugging. Unfortunately nothing changes.
   > 
   > Now, as it is quite possible that I'm missing something, the question
   > is: is there an effective way of doing scattered disk accesses using
   > bios? In other words, how can I fix the following program in order to
   > get disk full speed for steps > 8?

   I don't think the io path has anything to do with this. Why are you
   expecting non-sequential writes to continue to be fast? They wont be.

Well, as already said, probably I'm missing something so don't blame me ;-)

First, because of the performance holes among 1, 2, 4 and 8 steps. If you look
carefully below I'll find a performance drop in correspondence of step 5, 6, 7
but not for step 8; in my experiments, step == sector (512B) and block_size ==
page_size (4096B), so you actually write only the even blocks when step=8. If
this would really a seeky problem, shouldn't performance go down linearly with
the step size (i.e. without holes)?

Second, at the end of the submit_bio() loop nearly every bi_end_io() has been
already been serviced: AFAIK submit_bio() is asynchronous so IMHO there are
chances that it stalls on the request queue prior to submit the bio to the
driver.

Third, the disk doesn't look stressed: it blinks idly and no significant head
movements can be noticed. It is really much more stressed during DVDs
projection than in this case.

Bye
Damon

ANTICIPATORY SCHEDULER

STEP (hs)	CYCLES		WRITTEN (MB)	ELAPSED (s)	SPEED (MB/s)
4		52482		205		3		62.3135 <-----
5		14448		56		3		18.1951
6		13617		53		3		17.1732
7		12849		50		3		16.1695
8		47874		187		3		56.2823 <-----
9		2569		10		3		3.468
10		2608		10		3		3.716

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 2.6.17 odd hd slow down
  2006-09-06 10:57 ` Jens Axboe
  2006-09-06 12:06   ` Damon LaCrosse
@ 2006-09-06 12:06   ` Damon LaCrosse
  1 sibling, 0 replies; 4+ messages in thread
From: Damon LaCrosse @ 2006-09-06 12:06 UTC (permalink / raw)
  To: linux-kernel

Jens Axboe <axboe@kernel.dk> writes:

   On Wed, Sep 06 2006, Damon LaCrosse wrote:
   > 
   > Hi all,
   > experimenting a little the 2.6 block device layer I detected under
   > some circumstances a net slowness in the disk throughput. Strangely
   > enough, in fact, my IDE disk reported a significant performance drop
   > off in correspondence of certain access patterns.
   > 
   > Following further investigations I was able to simulate this ill
   > behavior in the following piece of code, clearly showing a non
   > negligible hard-disk slow down when the step value is set greater than
   > 8. These result in fact far below the hard-disk real speed
   > (30~70MB/sec), as correctly measured instead in correspondence of low
   > STEP values (<8). In particular, with step of 512 or above, the
   > overall performance scored by the disk results below 2MB/sec.

   You are effectively approaching seeky writes, I bet it's just the drive
   firmware biting you. Repeat the test on a different drive, and see if
   you see an identical pattern.

Hi, thank you for your prompt answer! 

Unfortunately I have only a bunch of maxtors at moment, but I'll do it ASAP.

   > At first I thought to a side-effect of the queue plug/unplug
   > mechanism: the scattered accesses involve the unplug timeout to each
   > bio. So, I added the BIO_RW_SYNC flag that - AFAIK - should force the
   > queue unplugging. Unfortunately nothing changes.
   > 
   > Now, as it is quite possible that I'm missing something, the question
   > is: is there an effective way of doing scattered disk accesses using
   > bios? In other words, how can I fix the following program in order to
   > get disk full speed for steps > 8?

   I don't think the io path has anything to do with this. Why are you
   expecting non-sequential writes to continue to be fast? They wont be.

Well, as already said, probably I'm missing something so don't blame me ;-)

First, because of the performance holes among 1, 2, 4 and 8 steps. If you look
carefully below I'll find a performance drop in correspondence of step 5, 6, 7
but not for step 8; in my experiments, step == sector (512B) and block_size ==
page_size (4096B), so you actually write only the even blocks when step=8. If
this would really a seeky problem, shouldn't performance go down linearly with
the step size (i.e. without holes)?

Second, at the end of the submit_bio() loop nearly every bi_end_io() has been
already been serviced: AFAIK submit_bio() is asynchronous so IMHO there are
chances that it stalls on the request queue prior to submit the bio to the
driver.

Third, the disk doesn't look stressed: it blinks idly and no significant head
movements can be noticed. It is really much more stressed during DVDs
projection than in this case.

Bye
Damon

ANTICIPATORY SCHEDULER

STEP (hs)	CYCLES		WRITTEN (MB)	ELAPSED (s)	SPEED (MB/s)
4		52482		205		3		62.3135 <-----
5		14448		56		3		18.1951
6		13617		53		3		17.1732
7		12849		50		3		16.1695
8		47874		187		3		56.2823 <-----
9		2569		10		3		3.468
10		2608		10		3		3.716

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2006-09-06 12:10 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-06  9:45 2.6.17 odd hd slow down Damon LaCrosse
2006-09-06 10:57 ` Jens Axboe
2006-09-06 12:06   ` Damon LaCrosse
2006-09-06 12:06   ` Damon LaCrosse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox