public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patch 1/21] random fixes
@ 2002-08-11  7:38 Andrew Morton
  2002-08-11  7:56 ` Alexander Viro
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Andrew Morton @ 2002-08-11  7:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: lkml

Sorry, but there's a ton of stuff here.  It ends up as a 4600 line
diff.  Some code dating back to 2.5.24.  It's almost all performance
work and it has been very painful getting its effectiveness tested
on the big machines; the main problem has been getting them booting
2.5 at all.  The results still are not as conclusive as I'd like,
but the signs are good, and there are no other proposals around to
fix these problems.



This one is mainly a resend.

- I changed the sector_t thing in max_block to use davem's approach. 
  I agree with Anton, but making it explicit doesn't hurt.

- Remove a dead comment in copy_strings.

Old stuff:

- Remove the IO error warning in end_buffer_io_sync().  Failed READA
  attempts trigger it.

- Emit a warning when an ext2 is mounting an ext3 filesystem.

  We have had quite a few problem reports related to this, mainly
  arising from initrd problems.  And mount(8) tends to report the
  fstype from /etc/fstab rather than reporting what has really
  happened.

Fixes some bogosity which I added to max_block():

- `size' doesn't need to be sector_t

- `retval' should not be initialised to "~0UL" because that is
  0x00000000ffffffff with 64-bit sector_t.

- Allocate task_structs with GFP_KERNEL, as discussed.

- Convert the EXPORT_SYMBOL for generic_file_direct_IO() to
  EXPORT_SYMBOL_GPL.  That was only exported as a practicality for the
  raw driver.

- Make the loop thread run balance_dirty_pages() after dirtying the
  backing file.  So it will perform writeback of the backing file when
  dirty memory levels are high.  Export balance_dirty_pages to GPL
  modules for this.

  This makes loop work a lot better - I suspect it broke when callers
  of balance_dirty_pages() started writing back only their own queue.

  There are many page allocation failures under heavy loop writeout. 
  Coming from blk_queue_bounce()'s allocation from the page_pool
  mempool.  So...

- Disable page allocation warnings around the initial atomic
  allocation attempt in mempool_alloc() - the one where __GFP_WAIT and
  __GFP_IO were turned off.  That one can easily fail.

- Add some commentary in block_write_full_page()


 drivers/block/loop.c |    2 ++
 fs/block_dev.c       |    6 +++---
 fs/buffer.c          |   13 +++++++++++--
 fs/exec.c            |    5 -----
 fs/ext2/super.c      |    3 +++
 kernel/fork.c        |    5 +++--
 kernel/ksyms.c       |    2 +-
 mm/mempool.c         |    2 ++
 mm/page-writeback.c  |    1 +
 9 files changed, 26 insertions(+), 13 deletions(-)

--- 2.5.31/fs/buffer.c~misc	Sat Aug 10 23:23:35 2002
+++ 2.5.31-akpm/fs/buffer.c	Sat Aug 10 23:23:35 2002
@@ -180,7 +180,10 @@ void end_buffer_io_sync(struct buffer_he
 	if (uptodate) {
 		set_buffer_uptodate(bh);
 	} else {
-		buffer_io_error(bh);
+		/*
+		 * This happens, due to failed READA attempts.
+		 * buffer_io_error(bh);
+		 */
 		clear_buffer_uptodate(bh);
 	}
 	unlock_buffer(bh);
@@ -2283,7 +2286,13 @@ int block_write_full_page(struct page *p
 		return -EIO;
 	}
 
-	/* The page straddles i_size */
+	/*
+	 * The page straddles i_size.  It must be zeroed out on each and every
+	 * writepage invokation because it may be mmapped.  "A file is mapped
+	 * in multiples of the page size.  For a file that is not a multiple of
+	 * the  page size, the remaining memory is zeroed when mapped, and
+	 * writes to that region are not written out to the file."
+	 */
 	kaddr = kmap(page);
 	memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
 	flush_dcache_page(page);
--- 2.5.31/fs/block_dev.c~misc	Sat Aug 10 23:23:35 2002
+++ 2.5.31-akpm/fs/block_dev.c	Sat Aug 10 23:23:35 2002
@@ -26,12 +26,12 @@
 
 static sector_t max_block(struct block_device *bdev)
 {
-	sector_t retval = ~0U;
+	sector_t retval = ~((sector_t)0);
 	loff_t sz = bdev->bd_inode->i_size;
 
 	if (sz) {
-		sector_t size = block_size(bdev);
-		unsigned sizebits = blksize_bits(size);
+		unsigned int size = block_size(bdev);
+		unsigned int sizebits = blksize_bits(size);
 		retval = (sz >> sizebits);
 	}
 	return retval;
--- 2.5.31/fs/ext2/super.c~misc	Sat Aug 10 23:23:35 2002
+++ 2.5.31-akpm/fs/ext2/super.c	Sat Aug 10 23:23:35 2002
@@ -698,6 +698,9 @@ static int ext2_fill_super(struct super_
 			printk(KERN_ERR "EXT2-fs: get root inode failed\n");
 		goto failed_mount2;
 	}
+	if (EXT2_HAS_COMPAT_FEATURE(sb, EXT3_FEATURE_COMPAT_HAS_JOURNAL))
+		ext2_warning(sb, __FUNCTION__,
+			"mounting ext3 filesystem as ext2\n");
 	ext2_setup_super (sb, es, sb->s_flags & MS_RDONLY);
 	return 0;
 failed_mount2:
--- 2.5.31/kernel/fork.c~misc	Sat Aug 10 23:23:35 2002
+++ 2.5.31-akpm/kernel/fork.c	Sat Aug 10 23:23:35 2002
@@ -106,9 +106,10 @@ static struct task_struct *dup_task_stru
 	struct thread_info *ti;
 
 	ti = alloc_thread_info();
-	if (!ti) return NULL;
+	if (!ti)
+		return NULL;
 
-	tsk = kmem_cache_alloc(task_struct_cachep,GFP_ATOMIC);
+	tsk = kmem_cache_alloc(task_struct_cachep, GFP_KERNEL);
 	if (!tsk) {
 		free_thread_info(ti);
 		return NULL;
--- 2.5.31/kernel/ksyms.c~misc	Sat Aug 10 23:23:35 2002
+++ 2.5.31-akpm/kernel/ksyms.c	Sat Aug 10 23:23:35 2002
@@ -340,7 +340,7 @@ EXPORT_SYMBOL(register_disk);
 EXPORT_SYMBOL(read_dev_sector);
 EXPORT_SYMBOL(init_buffer);
 EXPORT_SYMBOL(wipe_partitions);
-EXPORT_SYMBOL(generic_file_direct_IO);
+EXPORT_SYMBOL_GPL(generic_file_direct_IO);
 
 /* tty routines */
 EXPORT_SYMBOL(tty_hangup);
--- 2.5.31/drivers/block/loop.c~misc	Sat Aug 10 23:23:35 2002
+++ 2.5.31-akpm/drivers/block/loop.c	Sat Aug 10 23:23:35 2002
@@ -74,6 +74,7 @@
 #include <linux/slab.h>
 #include <linux/loop.h>
 #include <linux/suspend.h>
+#include <linux/writeback.h>
 #include <linux/buffer_head.h>		/* for invalidate_bdev() */
 
 #include <asm/uaccess.h>
@@ -235,6 +236,7 @@ do_lo_send(struct loop_device *lo, struc
 	up(&mapping->host->i_sem);
 out:
 	kunmap(bvec->bv_page);
+	balance_dirty_pages(mapping);
 	return ret;
 
 unlock:
--- 2.5.31/mm/page-writeback.c~misc	Sat Aug 10 23:23:35 2002
+++ 2.5.31-akpm/mm/page-writeback.c	Sat Aug 10 23:23:35 2002
@@ -133,6 +133,7 @@ void balance_dirty_pages(struct address_
 	if (!writeback_in_progress(bdi) && ps.nr_dirty > background_thresh)
 		pdflush_operation(background_writeout, 0);
 }
+EXPORT_SYMBOL_GPL(balance_dirty_pages);
 
 /**
  * balance_dirty_pages_ratelimited - balance dirty memory state
--- 2.5.31/mm/mempool.c~misc	Sat Aug 10 23:23:35 2002
+++ 2.5.31-akpm/mm/mempool.c	Sat Aug 10 23:23:35 2002
@@ -189,7 +189,9 @@ void * mempool_alloc(mempool_t *pool, in
 	int gfp_nowait = gfp_mask & ~(__GFP_WAIT | __GFP_IO);
 
 repeat_alloc:
+	current->flags |= PF_NOWARN;
 	element = pool->alloc(gfp_nowait, pool->pool_data);
+	current->flags &= ~PF_NOWARN;
 	if (likely(element != NULL))
 		return element;
 
--- 2.5.31/fs/exec.c~misc	Sat Aug 10 23:23:40 2002
+++ 2.5.31-akpm/fs/exec.c	Sat Aug 10 23:24:12 2002
@@ -209,11 +209,6 @@ int copy_strings(int argc,char ** argv, 
 		/* XXX: add architecture specific overflow check here. */ 
 		pos = bprm->p;
 
-		/*
-		 * The only sleeping function which we are allowed to call in
-		 * this loop is copy_from_user().  Otherwise, copy_user_state
-		 * could get trashed.
-		 */
 		while (len > 0) {
 			int i, new, err;
 			int offset, bytes_to_copy;

.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 1/21] random fixes
  2002-08-11  7:38 [patch 1/21] random fixes Andrew Morton
@ 2002-08-11  7:56 ` Alexander Viro
  2002-08-11 14:29 ` Adam Kropelin
  2002-08-14  8:35 ` William Lee Irwin III
  2 siblings, 0 replies; 21+ messages in thread
From: Alexander Viro @ 2002-08-11  7:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, lkml



On Sun, 11 Aug 2002, Andrew Morton wrote:

>  	flush_dcache_page(page);
> --- 2.5.31/fs/block_dev.c~misc	Sat Aug 10 23:23:35 2002
> +++ 2.5.31-akpm/fs/block_dev.c	Sat Aug 10 23:23:35 2002
> @@ -26,12 +26,12 @@
>  
>  static sector_t max_block(struct block_device *bdev)
>  {
> -	sector_t retval = ~0U;
> +	sector_t retval = ~((sector_t)0);
>  	loff_t sz = bdev->bd_inode->i_size;
>  
>  	if (sz) {
> -		sector_t size = block_size(bdev);
> -		unsigned sizebits = blksize_bits(size);
> +		unsigned int size = block_size(bdev);
> +		unsigned int sizebits = blksize_bits(size);
>  		retval = (sz >> sizebits);

Ugh.  Why do we have all that stuff, anyway?

	bdev->bd_inode->i_size >> bdev->bd_inode->i_blkbits

should work just fine...


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 1/21] random fixes
  2002-08-11  7:38 [patch 1/21] random fixes Andrew Morton
  2002-08-11  7:56 ` Alexander Viro
@ 2002-08-11 14:29 ` Adam Kropelin
  2002-08-11 18:09   ` Andrew Morton
  2002-08-14  8:35 ` William Lee Irwin III
  2 siblings, 1 reply; 21+ messages in thread
From: Adam Kropelin @ 2002-08-11 14:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml

On Sun, Aug 11, 2002 at 12:38:19AM -0700, Andrew Morton wrote:
> Sorry, but there's a ton of stuff here.  It ends up as a 4600 line
> diff.  Some code dating back to 2.5.24.  It's almost all performance

Andrew,

Nearly all the patches against mm/vmscan.c are failing when applied
to the 2.5.31 Linus just released. Are these patches against a
slightly older BK rev?

--Adam


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 1/21] random fixes
  2002-08-11 14:29 ` Adam Kropelin
@ 2002-08-11 18:09   ` Andrew Morton
  2002-08-12  0:27     ` Adam Kropelin
  2002-08-12  2:54     ` Adam Kropelin
  0 siblings, 2 replies; 21+ messages in thread
From: Andrew Morton @ 2002-08-11 18:09 UTC (permalink / raw)
  To: Adam Kropelin; +Cc: lkml

Adam Kropelin wrote:
> 
> On Sun, Aug 11, 2002 at 12:38:19AM -0700, Andrew Morton wrote:
> > Sorry, but there's a ton of stuff here.  It ends up as a 4600 line
> > diff.  Some code dating back to 2.5.24.  It's almost all performance
> 
> Andrew,
> 
> Nearly all the patches against mm/vmscan.c are failing when applied
> to the 2.5.31 Linus just released. Are these patches against a
> slightly older BK rev?

Gee I hope not.

Try getting them from http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/,
or the big rollup http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/everything.gz

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 1/21] random fixes
  2002-08-11 18:09   ` Andrew Morton
@ 2002-08-12  0:27     ` Adam Kropelin
  2002-08-12  0:41       ` Rik van Riel
  2002-08-12  4:58       ` Andrew Morton
  2002-08-12  2:54     ` Adam Kropelin
  1 sibling, 2 replies; 21+ messages in thread
From: Adam Kropelin @ 2002-08-12  0:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml

On Sun, Aug 11, 2002 at 11:09:02AM -0700, Andrew Morton wrote:
> Adam Kropelin wrote:
> > 
> > On Sun, Aug 11, 2002 at 12:38:19AM -0700, Andrew Morton wrote:
> > > Sorry, but there's a ton of stuff here.  It ends up as a 4600 line
> > > diff.  Some code dating back to 2.5.24.  It's almost all performance
> > 
> > Andrew,
> > 
> > Nearly all the patches against mm/vmscan.c are failing when applied
> > to the 2.5.31 Linus just released. Are these patches against a
> > slightly older BK rev?
> 
> Gee I hope not.
> 
> Try getting them from http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/,
> or the big rollup http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/everything.gz

The big rollup applied fine, thanks.

I did a bit of testing since I've always thought 2.4 (and 2.5) writeout behavior
left something to be desired. Testbed was a SMP x86 (2xPPro-200) with 160 MB
of RAM. I used everyone's favorite 2.5 scapegoat: IDE, with a single not-very-
fast IBM disk. Filesystem was ext3 in data=ordered mode. Test workload was an
inbound (from the point of view of the system under test) FTP transfer of a
600 MB iso image. All test runs were from a clean boot with all unnecessary
services shut down.

Results (average of 4 runs):

2.5.31-akpm: 2m 43s
2.5.31:      2m 33s
2.4.19:      2m 18s

`vmstat 1` shows some differences, expecially with respect to 2.4 vs. 2.5. In
about 40% of the cases when the bo drops to (near) 0, the machine stalled (FTP
transfer halted, vmstat output paused, etc.). With 2.5.31-akpm, the stalls were
about 3-4 seconds in length. With 2.5.31, the stalls were of the same duration,
but slightly less frequent. With 2.4.19, the stalls were very frequent (closer
to 70% of the time bo hit 0), but were only 1-2 seconds in duration.

Below are representative samples of `vmstat 1` for each kernel during the test. (Note that the low cache usage in the 2.5.31 sample is because the snapshot is
from early in the run when the cache is still filling.)

Let me know if I can provide more information...

--Adam

2.5.31-akpm:
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  2  1    112   3436      0 140956   0   0     4 15480 5454   400   0  39  60
 0  2  1    112   3436      0 140956   0   0     0  7696 1093    69   0   2  98
 0  2  1    112   3436      0 140956   0   0     0  6268 1084    85   0  31  69
 1  0  2    112   2476      0 142012   0   0     0     4 2863   250   0  23  77
 1  0  0    112   3080      0 142080   0   0     0    68 6730   485   0  46  53
 0  1  1    112   2940      0 141968   0   0     0 11720 5025   340   1  33  67
 0  1  1    112   2936      0 141968   0   0     0   264 1085    45   0   1  99
 1  0  1    112   2812      0 142344   0   0     0    52 3104   203   0  18  82
 0  0  0    112   3300      0 141972   0   0     0     4 6761   469   1  42  57
 1  0  0    112   3492      0 141684   0   0     0     0 6859   495   1  42  56
 0  1  1    112   3548      0 141204   0   0     0 15508 4769   328   0  31  69
 0  1  1    112   3544      0 141204   0   0     0  2268 1081    63   0   2  98
 0  0  0    112   2436      0 142248   0   0     0    56 2006   147   0  10  90
 1  0  0    112   2952      0 142328   0   0     0     4 6760   452   1  43  56
 1  0  1    112   3432      0 141716   0   0     0     0 6955   464   1  42  57
 0  1  1    112   2940      0 141816   0   0     0 15612 4301   262   0  28  72
 0  1  1    112   2932      0 141816   0   0     0   588 1095    78   0   2  98
 1  0  0    112   2620      0 142660   0   0     0    52 4554   314   1  30  69
 1  0  0    112   3420      0 141808   0   0     0     4 6673   465   0  43  57
 0  0  0    112   2628      0 142456   0   0     0     4 6931   491   1  44  55

2.5.31:
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 1  0  0      0 118940      0  28240   0   0     4     0 4171   256   1  21  78
 1  0  0      0 110904      0  36036   0   0     0     0 8937   590   1  53  46
 1  0  0      0 103260      0  43452   0   0     0     0 8558   559   1  50  49
 0  0  0      0  97100      0  49424   0   0     0     0 6919   460   1  41  58
 0  1  1      0  96048      0  50104   0   0     0 21036 1798    67   0   9  90
 0  1  1      0  96044      0  50104   0   0     0  3888 1087    55   0   2  98
 0  1  1      0  96044      0  50104   0   0     0     0 1081    65   0   1  99
 1  0  0      0  91516      0  54544   0   0     0    72 5305   352   0  33  67
 0  0  0      0  85392      0  60560   0   0     0     0 6972   458   0  44  56
 0  0  1      0  79344      0  66476   0   0     0 10788 6384  3173   1  48  50
 1  0  0      0  73296      0  72416   0   0     0    44 6705  1392   1  49  50
 0  0  0      0  67156      0  78444   0   0     0     0 6975   475   1  62  37
 1  0  0      0  61392      0  84104   0   0     0     0 6603   442   0  37  62
 0  1  1      0  55272      0  90016   0   0     0 15500 6940   451   1  42  57
 0  1  1      0  55272      0  90016   0   0     0  7696 1123    13   0   3  97

2.4.19:
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 1  0  0      0   4384   2124 140132   0   0     0    52 6961   645   0  54  45
 1  0  0      0   4372   2132 140024   0   0     0     0 6994   653   1  50  50
 0  1  1      0   4360   2136 139916   0   0     0  3956 6189   577   1  44  55
 0  1  1      0   4360   2136 139916   0   0     0  8196  223    14   0   2  97
 0  0  1      0   4344   2140 139908   0   0     0  6080 1189    90   0   9  91
 0  1  1      0   4440   2140 139764   0   0     4  7296 5902   557   0  43  57
 1  0  0      0   4360   2144 140044   0   0     0    56 3515   307   0  29  71
 0  1  1      0   4468   2144 139936   0   0     0  4036 5672   519   0  42  57
 0  1  1      0   4468   2144 139936   0   0     0  7960  220    14   0   1  99
 1  0  1      0   4464   2144 139980   0   0     0  5160 2073   178   0  17  82
 1  0  0      0   4396   2164 140092   0   0     0  3148 6965   656   1  51  48
 1  0  0      0   4396   2164 140068   0   0     0     0 7193   656   1  44  54
 0  2  1      0   4384   2164 139996   0   0     0  5848 4923   454   1  37  62
 0  2  1      0   4384   2164 139996   0   0     0  6148  222    10   0   0  99
 1  0  1      0   4400   2168 139900   0   0     0  7400 2961   258   0  24  75
 1  0  0      0   4464   2184 140004   0   0     0    52 7076   659   1  51  48
 1  0  0      0   4452   2184 139936   0   0     0     0 6960   638   0  54  46
 0  1  2      0   4404   2188 139932   0   0     0  5968 4332   399   0  30  69
 0  1  1      0   4404   2188 139932   0   0     0  4804  222    12   0   1  99


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 1/21] random fixes
  2002-08-12  0:27     ` Adam Kropelin
@ 2002-08-12  0:41       ` Rik van Riel
  2002-08-12  4:58       ` Andrew Morton
  1 sibling, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2002-08-12  0:41 UTC (permalink / raw)
  To: Adam Kropelin; +Cc: Andrew Morton, lkml

On Sun, 11 Aug 2002, Adam Kropelin wrote:

> fast IBM disk. Filesystem was ext3 in data=ordered mode. Test workload
> was an inbound (from the point of view of the system under test) FTP
> transfer of a 600 MB iso image. All test runs were from a clean boot
> with all unnecessary services shut down.

> machine stalled (FTP transfer halted, vmstat output paused, etc.). With
> 2.5.31-akpm, the stalls were about 3-4 seconds in length. With 2.5.31,
> the stalls were of the same duration, but slightly less frequent. With

Definately some writeout sillyness.  Why would we ever stop
writing pages to disk while a transfer is going on and then
suddenly decide to stall the system because pages are being
dirtied at a rate faster than we write them ?

If we can smooth out the writing we can keep the disks busy
all the time and should in theory perform better. I wonder
why Andrew made the writeout in 2.5 _more_ bursty ...

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 1/21] random fixes
  2002-08-11 18:09   ` Andrew Morton
  2002-08-12  0:27     ` Adam Kropelin
@ 2002-08-12  2:54     ` Adam Kropelin
  2002-08-12  3:40       ` Andrew Morton
  1 sibling, 1 reply; 21+ messages in thread
From: Adam Kropelin @ 2002-08-12  2:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml

FYI, just got this while un-tarring a kernel tree with 2.5.31+everything.gz:
(no nvidia ;)

--Adam

ksymoops 2.4.1 on i686 2.5.31-akpm.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.5.31-akpm/ (default)
     -m /boot/System.map-2.5.31-akpm (default)

Warning: You did not tell me where to find symbol information.  I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc.  ksymoops -h explains the options.

No modules in ksyms, skipping objects
Warning (read_lsmod): no symbols in lsmod, is /proc/modules a valid lsmod file?
Warning (compare_maps): ksyms_base symbol GPLONLY___wake_up_sync not found in System.map.  Ignoring ksyms_base entry
Warning (compare_maps): ksyms_base symbol GPLONLY_balance_dirty_pages not found in System.map.  Ignoring ksyms_base entry
Warning (compare_maps): ksyms_base symbol GPLONLY_generic_file_direct_IO not found in System.map.  Ignoring ksyms_base entry
Warning (compare_maps): ksyms_base symbol GPLONLY_idle_cpu not found in System.map.  Ignoring ksyms_base entry
Warning (compare_maps): ksyms_base symbol GPLONLY_set_cpus_allowed not found in System.map.  Ignoring ksyms_base entry
kernel BUG at page_alloc.c:98!
invalid operand: 0000
CPU:    1
EIP:    0010:[<c0132503>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010282
eax: c89d5840   ebx: c10c7000   ecx: 00000000   edx: 00000000
esi: c51f5e70   edi: 00000005   ebp: 00000010   esp: c51f5e14
ds: 0018   es: 0018   ss: 0018
Stack: 00009000 c1000018 c1123238 c1028018 c0313c60 00000206 ffffffff 00001a66 
       00000000 00000008 c51f5e70 00000005 00000010 c0132f7a c10caa48 00000009 
       c0130e1b c51f5e6c 00000000 c89d5e20 c2f88dd0 00000000 00000009 c10570e8 
Call Trace: [<c0132f7a>] [<c0130e1b>] [<c0129f01>] [<c0114791>] [<c0165c04>] 
   [<c0116569>] [<c011b3f9>] [<c0111370>] [<c0107183>] 
Code: 0f 0b 62 00 85 b6 2c c0 8b 03 ba 04 00 00 00 83 e0 10 74 1d 

>>EIP; c0132503 <__free_pages_ok+93/300>   <=====
Trace; c0132f7a <__pagevec_free+1a/20>
Trace; c0130e1b <__pagevec_release+fb/110>
Trace; c0129f01 <exit_mmap+1a1/280>
Trace; c0114791 <default_wake_function+21/40>
Trace; c0165c04 <ext3_release_file+14/20>
Trace; c0116569 <mmput+49/70>
Trace; c011b3f9 <do_exit+d9/2c0>
Trace; c0111370 <smp_apic_timer_interrupt+e0/120>
Trace; c0107183 <syscall_call+7/b>
Code;  c0132503 <__free_pages_ok+93/300>
00000000 <_EIP>:
Code;  c0132503 <__free_pages_ok+93/300>   <=====
   0:   0f 0b                     ud2a      <=====
Code;  c0132505 <__free_pages_ok+95/300>
   2:   62 00                     bound  %eax,(%eax)
Code;  c0132507 <__free_pages_ok+97/300>
   4:   85 b6 2c c0 8b 03         test   %esi,0x38bc02c(%esi)
Code;  c013250d <__free_pages_ok+9d/300>
   a:   ba 04 00 00 00            mov    $0x4,%edx
Code;  c0132512 <__free_pages_ok+a2/300>
   f:   83 e0 10                  and    $0x10,%eax
Code;  c0132515 <__free_pages_ok+a5/300>
  12:   74 1d                     je     31 <_EIP+0x31> c0132534 <__free_pages_ok+c4/300>


7 warnings issued.  Results may not be reliable.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 1/21] random fixes
  2002-08-12  2:54     ` Adam Kropelin
@ 2002-08-12  3:40       ` Andrew Morton
  0 siblings, 0 replies; 21+ messages in thread
From: Andrew Morton @ 2002-08-12  3:40 UTC (permalink / raw)
  To: Adam Kropelin; +Cc: lkml

Adam Kropelin wrote:
> 
> FYI, just got this while un-tarring a kernel tree with 2.5.31+everything.gz:
> (no nvidia ;)
> 

That'll be this one:

	        BUG_ON(page->pte.chain != NULL);

we've had a few reports of this dribbling in since rmap went in.  But
nothing repeatable enough for it to be hunted down.

But we do have a repeatable inconsistency happening with ntpd and
memory pressure.  That may be related, but in that case it's probably
related to mlock().

So.  An open bug, alas.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 1/21] random fixes
  2002-08-12  0:27     ` Adam Kropelin
  2002-08-12  0:41       ` Rik van Riel
@ 2002-08-12  4:58       ` Andrew Morton
  2002-08-13  0:26         ` Adam Kropelin
  1 sibling, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2002-08-12  4:58 UTC (permalink / raw)
  To: Adam Kropelin; +Cc: lkml

Adam Kropelin wrote:
> 
> ...
> I did a bit of testing since I've always thought 2.4 (and 2.5) writeout behavior
> left something to be desired. Testbed was a SMP x86 (2xPPro-200) with 160 MB
> of RAM. I used everyone's favorite 2.5 scapegoat: IDE, with a single not-very-
> fast IBM disk. Filesystem was ext3 in data=ordered mode.

ext3 performs its own writeback alongside the core kernel's writeback
decisions, so that complicates things.

> Test workload was an
> inbound (from the point of view of the system under test) FTP transfer of a
> 600 MB iso image. All test runs were from a clean boot with all unnecessary
> services shut down.
> 
> Results (average of 4 runs):
> 
> 2.5.31-akpm: 2m 43s
> 2.5.31:      2m 33s
> 2.4.19:      2m 18s

yes.  For this workload (10 mbyte/sec ftp transfer onto a >20 meg/sec
disk) the application should never block on IO - all writeback should 
happen via pdflush.

2.4 starts background writeback at 30% dirty and synchronous writeback
at 60% dirty.

2.5 starts background writeback at 40% dirty and synchronous writeback
at 50% dirty.

You can make 2.5 use the 2.4 settings with

cd /proc/sys/vm
echo 30 > dirty_background_ratio 
echo 60 > dirty_async_ratio 
echo 70 > dirty_sync_ratio 

and I expect you'll find that fixes it up.  Setting dirty_background_ratio
to 10% will make it even better.  But it will hurt dbench numbers at
certain client counts, which is a national emergency.

Sigh.  I don't know what the right numbers are.  There aren't any; that's
the problem with magic numbers.  That part of the kernel is making writeback
and throttling decisions in total ignorance of the overall state of
the system.

Worst comes to worst, we can set the 2.5 knobs at the same level as the
2.4 ones, but I'd rather prefer that we can some up with something dynamic.

In fact, I'd be inclined to set the background ratio much lower than
2.4, and to hell with dbench.  Because the lower level is better for
real programs, as you've observed.

Care to tune and retest?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 1/21] random fixes
  2002-08-12  4:58       ` Andrew Morton
@ 2002-08-13  0:26         ` Adam Kropelin
  2002-08-13  0:49           ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Adam Kropelin @ 2002-08-13  0:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, riel

On Sun, Aug 11, 2002 at 09:58:22PM -0700, Andrew Morton wrote:
> ext3 performs its own writeback alongside the core kernel's writeback
> decisions, so that complicates things.

I ran the test after mounting the partition as ext2 and saw a slight
decrease in performance (7-10 seconds over the duration of the test), but I
did not have time to run more than once so this could be a fluke. In general,
the `vmstat 1` output looked the same to me.

> > Results (average of 4 runs):
> > 
> > 2.5.31-akpm: 2m 43s
> > 2.5.31:      2m 33s
> > 2.4.19:      2m 18s
> 
> yes.  For this workload (10 mbyte/sec ftp transfer onto a >20 meg/sec
> disk) the application should never block on IO - all writeback should 
> happen via pdflush.

> You can make 2.5 use the 2.4 settings with
> 
> cd /proc/sys/vm
> echo 30 > dirty_background_ratio 
> echo 60 > dirty_async_ratio 
> echo 70 > dirty_sync_ratio 

These settings bring -akpm in line with stock 2.5.31, but they are both
still slower than 2.4.19 (which itself could do better, I think).

> and I expect you'll find that fixes it up.  Setting dirty_background_ratio
> to 10% will make it even better.  But it will hurt dbench numbers at

No real change at 10%. It's consistently a second or two faster than -akpm is at
30%, but not a drastic change.

> certain client counts, which is a national emergency.
> 
> Sigh.  I don't know what the right numbers are.  There aren't any; that's
> the problem with magic numbers.  That part of the kernel is making writeback
> and throttling decisions in total ignorance of the overall state of
> the system.

It certainly seems something is amiss. If we could actually manage to keep
the disk busy (and this is a fairly slow disk), we'd do wonderfully. But with
a 2-3 second pause every 4-5 seconds, we're transferring data barely 50% of the
time. (Yes, the pause is long enough the disk activity LED actually goes out.)
The short-term average transfer rate over the FTP connection is very
respectable for older hardware: 7-8 MB/s. But with the stalls, the overall
throughput is just over 4 MB/s.

> In fact, I'd be inclined to set the background ratio much lower than
> 2.4, and to hell with dbench.  Because the lower level is better for
> real programs, as you've observed.

> Care to tune and retest?

Absolutely. I'll try whatever ideas/patches you want to throw at me.

BTW, full `vmstat 1` logs are available for all these tests if you want them.

--Adam


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 1/21] random fixes
  2002-08-13  0:26         ` Adam Kropelin
@ 2002-08-13  0:49           ` Andrew Morton
  2002-08-13  2:25             ` Adam Kropelin
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2002-08-13  0:49 UTC (permalink / raw)
  To: Adam Kropelin; +Cc: lkml, riel

Adam Kropelin wrote:
> 
> ...
> > You can make 2.5 use the 2.4 settings with
> >
> > cd /proc/sys/vm
> > echo 30 > dirty_background_ratio
> > echo 60 > dirty_async_ratio
> > echo 70 > dirty_sync_ratio
> 
> These settings bring -akpm in line with stock 2.5.31, but they are both
> still slower than 2.4.19 (which itself could do better, I think).

In that case I'm confounded.  It worked sweetly for me.  Just

	wget ftp://other-machine/600-meg-file

on a machine booted with mem=160m.  Took 63 seconds over 100bT,
steady column of writes in vmstat.

Which ftp client are you using?  And can you strace it, to see how
much data it's writing per system call?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 1/21] random fixes
  2002-08-13  0:49           ` Andrew Morton
@ 2002-08-13  2:25             ` Adam Kropelin
  2002-08-13  3:03               ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Adam Kropelin @ 2002-08-13  2:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, riel

On Mon, Aug 12, 2002 at 05:49:40PM -0700, Andrew Morton wrote:
> Adam Kropelin wrote:
> > 
> > ...
> > > You can make 2.5 use the 2.4 settings with
> > >
> > > cd /proc/sys/vm
> > > echo 30 > dirty_background_ratio
> > > echo 60 > dirty_async_ratio
> > > echo 70 > dirty_sync_ratio
> > 
> > These settings bring -akpm in line with stock 2.5.31, but they are both
> > still slower than 2.4.19 (which itself could do better, I think).
> 
> In that case I'm confounded.  It worked sweetly for me.  Just

> Which ftp client are you using?  And can you strace it, to see how
> much data it's writing per system call?

Actually, I'm running an FTP server on the testbed machine and pushing the
data from a client on another (much faster) machine. I straced the server
(redhat wu-ftpd2.6.1-20) and it looks like 8 KB reads/writes.

After the transfer gets going...

1329  read(8, "v&X\205:\327.+\310/a\335\24Sa\361c\243\r\244\260~\264z"..., 8192) = 8192
1329  write(7, "v&X\205:\327.+\310/a\335\24Sa\361c\243\r\244\260~\264z"..., 8192) = 8192
1329  rt_sigaction(SIGALRM, {0x804b030, [ALRM], SA_RESTART|0x4000000}, {0x804b030, [ALRM], SA_RESTART|0x4000000}, 8) = 0
1329  alarm(1200)                       = 1200
1329  read(8, "\335\235\335\35}\335]\375\17\373|\324VS[\r\266Af\333\246"..., 8192) = 8192
1329  write(7, "\335\235\335\35}\335]\375\17\373|\324VS[\r\266Af\333\246"..., 8192) = 8192
1329  rt_sigaction(SIGALRM, {0x804b030, [ALRM], SA_RESTART|0x4000000}, {0x804b030, [ALRM], SA_RESTART|0x4000000}, 8) = 0
1329  alarm(1200)                       = 1200
1329  read(8, "\302\365SV4\24{*\341\336\24\213\242\363\307\36\274\377"..., 8192) = 8192
1329  write(7, "\302\365SV4\24{*\341\336\24\213\242\363\307\36\274\377"..., 8192) = 8192
1329  rt_sigaction(SIGALRM, {0x804b030, [ALRM], SA_RESTART|0x4000000}, {0x804b030, [ALRM], SA_RESTART|0x4000000}, 8) = 0
1329  alarm(1200)                       = 1200

...etc.

Following your method and wget'ting from a remote server seems to do 
a bit better (just watching vmstat since I can't compare timings against
my original method). wget seems to read 8K and write it in two 4K writes.
Don't know if this has anything to do with things... Pauses are still
there and the disc activity light still goes out several times per minute
coincident with the pauses.

--Adam


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 1/21] random fixes
  2002-08-13  2:25             ` Adam Kropelin
@ 2002-08-13  3:03               ` Andrew Morton
  2002-08-13  4:10                 ` Adam Kropelin
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2002-08-13  3:03 UTC (permalink / raw)
  To: Adam Kropelin; +Cc: lkml, riel

Adam Kropelin wrote:
> 
> On Mon, Aug 12, 2002 at 05:49:40PM -0700, Andrew Morton wrote:
> > Adam Kropelin wrote:
> > >
> > > ...
> > > > You can make 2.5 use the 2.4 settings with
> > > >
> > > > cd /proc/sys/vm
> > > > echo 30 > dirty_background_ratio
> > > > echo 60 > dirty_async_ratio
> > > > echo 70 > dirty_sync_ratio
> > >
> > > These settings bring -akpm in line with stock 2.5.31, but they are both
> > > still slower than 2.4.19 (which itself could do better, I think).
> >
> > In that case I'm confounded.  It worked sweetly for me.  Just
> 
> > Which ftp client are you using?  And can you strace it, to see how
> > much data it's writing per system call?
> 
> Actually, I'm running an FTP server on the testbed machine and pushing the
> data from a client on another (much faster) machine. I straced the server
> (redhat wu-ftpd2.6.1-20) and it looks like 8 KB reads/writes.
> 

OK, tried that against a slow disk (13 megs/sec write bandwidth).  2.5.31,
defalt writeback settings.

ext3 is misbehaving:

 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  2  2   5104   4376      0 134016   0   0     0 21620 2888  1966   0   5  95
 0  0  2   5104   4448      0 134224   0   0     0 11420 4787  4004   0   8  92
 1  0  0   5104   4464      0 134776   0   0     0   100 13133 12564   1  24  75
 1  0  0   5104   4440      0 134716   0   0     8     0 13281 12660   1  23  76
 0  0  0   5104   4480      0 134448   0   0    56     0 13272 13022   1  22  77
 0  1  2   5104   4592      0 133880   0   0     0 27200 2598  1596   0   5  95
 0  1  2   5104   4588      0 133880   0   0     0 11544 1127   128   0   2  98
 0  0  1   5104   4356      0 134388   0   0     0   692 10383  9839   0  21  79
 1  0  0   5104   4368      0 134836   0   0     0   108 13115 12912   1  25  74
 0  0  0   5104   4360      0 134556   0   0    36    68 11829 11687   1  20  79

and takes 86 seconds.

When the server is writing to ext2, it is good:

 1  0  0   5104   4364      0 135248   0   0    56 12380 13316 16547   1  17  82
 0  0  0   5104   4388      0 135296   0   0     0 12324 13310 16488   1  16  83
 1  0  0   5104   4056      0 135600   0   0     0 12344 13300 16521   1  15  84
 0  0  0   5104   4368      0 135264   0   0     0 12324 13293 16480   0  16  84
 1  0  0   5104   4428      0 135184   0   0     0  8216 13306 16514   1  16  83
 0  0  0   5104   4396      0 135172   0   0    48 12380 13296 16444   1  16  83
 0  0  0   5104   4392      0 135148   0   0    56 12324 13304 16461   1  16  82
 1  0  0   5104   4396      0 135196   0   0     0 12324 13297 16468   1  17  82
 1  0  0   5104   4444      0 135116   0   0     0 12348 13304 16511   1  18  81

and the transfer takes 54 seconds, which is wirespeed.

The ext3 stall is going to require some thought - it's waiting on a previous
transaction commit so it can get in and modify an inode block again.

Are you _sure_ it was bad with ext2?    How long does

	dd if=/dev/zero of=foo bs=1M count=600 ; sync

take against that disk?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 1/21] random fixes
  2002-08-13  3:03               ` Andrew Morton
@ 2002-08-13  4:10                 ` Adam Kropelin
  2002-08-13  5:25                   ` Andreas Dilger
  2002-08-13  5:32                   ` Andrew Morton
  0 siblings, 2 replies; 21+ messages in thread
From: Adam Kropelin @ 2002-08-13  4:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, riel

On Mon, Aug 12, 2002 at 08:03:34PM -0700, Andrew Morton wrote:
> Adam Kropelin wrote:
> > Actually, I'm running an FTP server on the testbed machine and pushing the
> > data from a client on another (much faster) machine. I straced the server
> > (redhat wu-ftpd2.6.1-20) and it looks like 8 KB reads/writes.
> > 
> 
> OK, tried that against a slow disk (13 megs/sec write bandwidth).  2.5.31,
> defalt writeback settings.
> 
> ext3 is misbehaving:
> and takes 86 seconds.
> 
> When the server is writing to ext2, it is good:
> and the transfer takes 54 seconds, which is wirespeed.
> 
> Are you _sure_ it was bad with ext2?

Yes.

[root@devbox adk0212] mount
/dev/hda3 on / type ext2 (rw)
none on /proc type proc (rw)
/dev/hda1 on /boot type ext2 (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
none on /dev/shm type tmpfs (rw)

   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  1  1    120   4360      0 141132   0   0     0  9804 6775   564   0  45  55
 0  1  1    120   4344      0 141132   0   0     0     0 1083    20   0   0  99
 0  0  0    120   4364      0 141116   0   0     0    40 2098   156   0  11  89
 0  0  0    120   4384      0 141368   0   0     0     4 7013   594   0  52  47
 0  0  0    120   4360      0 141416   0   0     0     0 6914   589   1  56  43
 0  1  1    120   4464      0 140856   0   0     0 15420 6235   520   0  42  58
 0  1  1    120   4456      0 140856   0   0     0  3240 1094    36   0   2  98
 1  0  0    120   4428      0 140844   0   0     0    52 1151    70   0   4  96
 1  0  0    120   4440      0 141356   0   0     0     4 6810   541   1  42  57
 0  0  0    120   4464      0 141320   0   0     0     0 6894   553   1  40  58
 0  1  1    120   4396      0 140840   0   0     0 15508 6018   466   0  40  59
 0  1  1    120   4388      0 140840   0   0     0  1608 1093    57   0   2  98
 0  0  0    120   4404      0 140832   0   0     0    52 2350   165   0  12  87
 0  0  0    120   4460      0 141380   0   0     0     4 7040   564   1  42  57
 1  0  0    120   4356      0 141372   0   0     0     4 7073   570   1  45  54
 0  1  1    120   4360      0 140916   0   0     0 15404 5541   437   1  36  63
 0  1  1    120   4356      0 140916   0   0     0  2832 1084    55   0   1  99
 0  0  0    120   4356      0 140904   0   0     0    48 1614   125   0   8  91
 0  0  1    120   4380      0 141412   0   0     0     4 6888   552   1  43  56
 1  0  0    120   4232      0 141476   0   0     4     0 6857   556   1  40  58
 0  1  1    120   4352      0 140988   0   0     0 13700 5148   449   0  35  65

Is it possible that the darn thing is mounted ext3 even though fstab and mount
agree that it's ext2?

> How long does
> 
> 	dd if=/dev/zero of=foo bs=1M count=600 ; sync
> 
> take against that disk?

1m 23s  (I said it was a slow disk ;)

Even during that, the writeout was inconsistent (but a lot better than during
the FTP transfer):

   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  1  3   1784   2180      0 141072   0   0     0  5220 1070    19   0   6  93
 0  1  2   1784   2248      0 141020   0   0     0  8064 1066    23   0   8  92
 1  0  3   1784   2296      0 141008   0   0     0  8436 1132    36   0  12  87
 0  1  3   1784   2300      0 141004   0   0     0  6828 1072   164   0  24  75
 1  0  2   1784   2988      0 140336   0   0     0  4664 1071   144   0  21  79
 1  0  2   1784   2616      0 140700   0   0     0 12944 2688   102   0   5  95
 0  1  3   1784   2296      0 141036   0   0     0 10048 1076   125   1  21  78
 0  1  1   1784   3284      0 140048   0   0     4  5504 1064   143   0  19  80
 0  1  1   1784   3284      0 140048   0   0     0     0 1064    51   0   1  99
 0  1  1   1784   3284      0 140048   0   0     0     0 1058    23   0   1  99
 1  1  3   1812   2312      0 141236   0  28     0 22892 2495   131   0  10  90
 0  2  3   1812   3204      0 140340   0   0     4  7736 1065    81   0  25  75
 0  2  3   1812   3204      0 140340   0   0     0  3848 1062    52   0   9  90
 0  2  3   1812   3204      0 140340   0   0     0  7696 1059    50   0   2  98
 0  1  3   1812   3196      0 140336   0   0     4  3976 1061    58   0  20  80
 0  1  3   1812   3312      0 140208   0   0     0  7944 1065    25   0   4  96
 0  1  2   1812   3308      0 140208   0   0     0  3844 1065    32   0   1  99
 0  1  2   1812   3308      0 140208   0   0     0  2956 1056    43   0   3  97
 0  1  2   1812   3268      0 140248   0   0     4  5548 1059    64   0   5  94
 0  1  2   1812   3268      0 140252   0   0     0   236 1065    56   0   4  96
 0  1  2   1812   3268      0 140252   0   0     0     0 1058    42   0   1  99

(all of the above discussion was 2.5.31 stock with default writeout settings)

I've been trying these sorts of tests on this machine for over a year now,
with various disk subsystems, and I have *never* seen anything as nice and
consistent as the ext2 writeout you quoted. Maybe this machine is cursed.

--Adam


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 1/21] random fixes
  2002-08-13  4:10                 ` Adam Kropelin
@ 2002-08-13  5:25                   ` Andreas Dilger
  2002-08-13 12:37                     ` Adam Kropelin
  2002-08-13  5:32                   ` Andrew Morton
  1 sibling, 1 reply; 21+ messages in thread
From: Andreas Dilger @ 2002-08-13  5:25 UTC (permalink / raw)
  To: Adam Kropelin; +Cc: Andrew Morton, lkml, riel

On Aug 13, 2002  00:10 -0400, Adam Kropelin wrote:
> On Mon, Aug 12, 2002 at 08:03:34PM -0700, Andrew Morton wrote:
> > Are you _sure_ it was bad with ext2?
> 
> Yes.
> 
> [root@devbox adk0212] mount
> /dev/hda3 on / type ext2 (rw)
> /dev/hda1 on /boot type ext2 (rw)
> 
> Is it possible that the darn thing is mounted ext3 even though fstab and mount
> agree that it's ext2?

Yes, if you have a journal on your root filesystem, then it will be mounted
as ext3 regardless of what it says in /etc/fstab.  Since "mount" also
looks in /etc/fstab for writing the entry in /etc/mtab _after_ the root
filesystem is mounted, the output from "mount" can also be bogus.  You
need to check /proc/mounts to see the real answer.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 1/21] random fixes
  2002-08-13  4:10                 ` Adam Kropelin
  2002-08-13  5:25                   ` Andreas Dilger
@ 2002-08-13  5:32                   ` Andrew Morton
  2002-08-13 15:39                     ` Daniel Egger
  2002-08-14  0:01                     ` Adam Kropelin
  1 sibling, 2 replies; 21+ messages in thread
From: Andrew Morton @ 2002-08-13  5:32 UTC (permalink / raw)
  To: Adam Kropelin; +Cc: lkml, riel

Adam Kropelin wrote:
> 
> On Mon, Aug 12, 2002 at 08:03:34PM -0700, Andrew Morton wrote:
> > Adam Kropelin wrote:
> > > Actually, I'm running an FTP server on the testbed machine and pushing the
> > > data from a client on another (much faster) machine. I straced the server
> > > (redhat wu-ftpd2.6.1-20) and it looks like 8 KB reads/writes.
> > >
> >
> > OK, tried that against a slow disk (13 megs/sec write bandwidth).  2.5.31,
> > defalt writeback settings.
> >
> > ext3 is misbehaving:
> > and takes 86 seconds.
> >
> > When the server is writing to ext2, it is good:
> > and the transfer takes 54 seconds, which is wirespeed.
> >
> > Are you _sure_ it was bad with ext2?
> 
> Yes.
> 
> [root@devbox adk0212] mount
> /dev/hda3 on / type ext2 (rw)
> none on /proc type proc (rw)
> /dev/hda1 on /boot type ext2 (rw)
> none on /dev/pts type devpts (rw,gid=5,mode=620)
> none on /dev/shm type tmpfs (rw)
> 
>    procs                      memory    swap          io     system         cpu
>  r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
>  0  1  1    120   4360      0 141132   0   0     0  9804 6775   564   0  45  55
>  0  1  1    120   4344      0 141132   0   0     0     0 1083    20   0   0  99
>  0  0  0    120   4364      0 141116   0   0     0    40 2098   156   0  11  89
>  0  0  0    120   4384      0 141368   0   0     0     4 7013   594   0  52  47
>  0  0  0    120   4360      0 141416   0   0     0     0 6914   589   1  56  43
>  0  1  1    120   4464      0 140856   0   0     0 15420 6235   520   0  42  58
>  0  1  1    120   4456      0 140856   0   0     0  3240 1094    36   0   2  98
>  1  0  0    120   4428      0 140844   0   0     0    52 1151    70   0   4  96
>  1  0  0    120   4440      0 141356   0   0     0     4 6810   541   1  42  57
>  0  0  0    120   4464      0 141320   0   0     0     0 6894   553   1  40  58
>  0  1  1    120   4396      0 140840   0   0     0 15508 6018   466   0  40  59
>  0  1  1    120   4388      0 140840   0   0     0  1608 1093    57   0   2  98
>  0  0  0    120   4404      0 140832   0   0     0    52 2350   165   0  12  87
>  0  0  0    120   4460      0 141380   0   0     0     4 7040   564   1  42  57
>  1  0  0    120   4356      0 141372   0   0     0     4 7073   570   1  45  54
> ...

Sure looks like ext3.

> 
> Is it possible that the darn thing is mounted ext3 even though fstab and mount
> agree that it's ext2?

Yes.  Although it's usually the other way round. "How come it keeps running
fsck even though mount says ext3?".

Take a look in /proc/mounts.

> > How long does
> >
> >       dd if=/dev/zero of=foo bs=1M count=600 ; sync
> >
> > take against that disk?
> 
> 1m 23s  (I said it was a slow disk ;)

gack.  I've seen pencils which can write faster than that.

So your wirespeed actually exceeds the disk speed.  That changes things.

The kernel *has* to stall the generator of dirty data.  We can make
the stalls shorter, and more frequent.  Go into drivers/block/ll_rw_blk.c
and see where it's initialising batch_requests.  Just change it to

	batch_requests = 1;

batch_requests needs to die anyhow...

And in fs/mpage.c, set RATELIMIT_PAGES to 16.

The application has to block, but the disk should certainly never
fall idle.  I'll play with this a bit.  IDE ceased to be an option
in 2.5.30, which does not aid this effort.

> I've been trying these sorts of tests on this machine for over a year now,
> with various disk subsystems, and I have *never* seen anything as nice and
> consistent as the ext2 writeout you quoted. Maybe this machine is cursed.
> 

Lumpy writeback is pretty common.  As is bad latency during writeout.
It's quite tricky to get these things balanced out, and it's easy to
fix one thing and break another.  Not a lot of effort has been put into
fine tuning 2.5 for smoothness and latency thus far.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 1/21] random fixes
  2002-08-13  5:25                   ` Andreas Dilger
@ 2002-08-13 12:37                     ` Adam Kropelin
  2002-08-13 17:21                       ` Andreas Dilger
  0 siblings, 1 reply; 21+ messages in thread
From: Adam Kropelin @ 2002-08-13 12:37 UTC (permalink / raw)
  To: Andrew Morton, lkml, riel; +Cc: Andreas Dilger

On Mon, Aug 12, 2002 at 11:25:59PM -0600, Andreas Dilger wrote:
> On Aug 13, 2002  00:10 -0400, Adam Kropelin wrote:
> > On Mon, Aug 12, 2002 at 08:03:34PM -0700, Andrew Morton wrote:
> > > Are you _sure_ it was bad with ext2?
> > 
> > Yes.
> > 
> > [root@devbox adk0212] mount
> > /dev/hda3 on / type ext2 (rw)
> > /dev/hda1 on /boot type ext2 (rw)
> > 
> > Is it possible that the darn thing is mounted ext3 even though fstab and mount
> > agree that it's ext2?
> 
> Yes, if you have a journal on your root filesystem, then it will be mounted
> as ext3 regardless of what it says in /etc/fstab.  Since "mount" also
> looks in /etc/fstab for writing the entry in /etc/mtab _after_ the root
> filesystem is mounted, the output from "mount" can also be bogus.  You
> need to check /proc/mounts to see the real answer.

Ahhh, carp.

It's still ext3, precisely as you describe.

*/me hangs head in shame*

When I get home tonight I'll reboot with a rescue disk and blow away the
journal. *That* should fix its little red wagon.

--Adam


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 1/21] random fixes
  2002-08-13  5:32                   ` Andrew Morton
@ 2002-08-13 15:39                     ` Daniel Egger
  2002-08-14  0:01                     ` Adam Kropelin
  1 sibling, 0 replies; 21+ messages in thread
From: Daniel Egger @ 2002-08-13 15:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml

Am Die, 2002-08-13 um 07.32 schrieb Andrew Morton:

> > 1m 23s  (I said it was a slow disk ;) 
> gack.  I've seen pencils which can write faster than that.

Interesting. Even up-to-date notebook are not much faster on an ext3 fs:

egger@sonja:/localstuff/temp$ time dd if=/dev/zero of=foo bs=1M
count=600 ; sync 
600+0 Records ein
600+0 Records aus

real    0m58.375s
user    0m0.010s
sys     0m4.930s

> So your wirespeed actually exceeds the disk speed.  That changes things.

This is trivial especially with mainstream machines being shipped with 
GigE.
 
-- 
Servus,
       Daniel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 1/21] random fixes
  2002-08-13 12:37                     ` Adam Kropelin
@ 2002-08-13 17:21                       ` Andreas Dilger
  0 siblings, 0 replies; 21+ messages in thread
From: Andreas Dilger @ 2002-08-13 17:21 UTC (permalink / raw)
  To: Adam Kropelin; +Cc: Andrew Morton, lkml, riel

On Aug 13, 2002  08:37 -0400, Adam Kropelin wrote:
> On Mon, Aug 12, 2002 at 11:25:59PM -0600, Andreas Dilger wrote:
> > Yes, if you have a journal on your root filesystem, then it will be mounted
> > as ext3 regardless of what it says in /etc/fstab.  Since "mount" also
> > looks in /etc/fstab for writing the entry in /etc/mtab _after_ the root
> > filesystem is mounted, the output from "mount" can also be bogus.  You
> > need to check /proc/mounts to see the real answer.
> 
> Ahhh, carp.
> 
> It's still ext3, precisely as you describe.
> 
> */me hangs head in shame*
> 
> When I get home tonight I'll reboot with a rescue disk and blow away the
> journal. *That* should fix its little red wagon.

Or, you can optionally use the "rootfstype=ext2" kernel option, to avoid
the need to remove and then re-create the journal.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 1/21] random fixes
  2002-08-13  5:32                   ` Andrew Morton
  2002-08-13 15:39                     ` Daniel Egger
@ 2002-08-14  0:01                     ` Adam Kropelin
  1 sibling, 0 replies; 21+ messages in thread
From: Adam Kropelin @ 2002-08-14  0:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, riel

On Mon, Aug 12, 2002 at 10:32:11PM -0700, Andrew Morton wrote:
> Adam Kropelin wrote:
> > 
> > On Mon, Aug 12, 2002 at 08:03:34PM -0700, Andrew Morton wrote:
> > > OK, tried that against a slow disk (13 megs/sec write bandwidth).  2.5.31,
> > > defalt writeback settings.
> > >
> > > ext3 is misbehaving:
> > > and takes 86 seconds.
> > >
> > > When the server is writing to ext2, it is good:
> > > and the transfer takes 54 seconds, which is wirespeed.
> > >
> > > Are you _sure_ it was bad with ext2?
> > 
> > Yes.

...but I was wrong.

> Sure looks like ext3.

..it was.

*Actually* switching to ext2 (rather than just pretending) made a
tremendous difference. New numbers:

2.5.31-stock: 1m 49s
2.5.31-akpm: 1m 50s
2.4.19-stock: 1m 34s

...but, applying the writeout threshold settings you suggested:

2.5.31-stock: 1m 34s
2.5.31-akpm: 1m 34s

(That's with dirty_background at 30%; 10% turned in the same numbers
as 30% did.)

Presumably with the disk as the bottleneck, the -akpm changes aren't
expected to do much. At least they're not degrading anything.

> So your wirespeed actually exceeds the disk speed.  That changes things.
...
> 
> 	batch_requests = 1;
> And in fs/mpage.c, set RATELIMIT_PAGES to 16.

These changes didn't have as much effect as the threshold tweaks:

2.5.31-stock: 1m 39s

..unless I added in the threshold tweaks as well, in which case:

2.5.31-stock: 1m 34s

...which is the same as the threshold tweaks alone.

> The application has to block, but the disk should certainly never
> fall idle.  I'll play with this a bit.  IDE ceased to be an option
> in 2.5.30, which does not aid this effort.

With ext2 and the threshold tweaks it never becomes idle. That is clearly
an ext3 issue now.

> fix one thing and break another.  Not a lot of effort has been put into
> fine tuning 2.5 for smoothness and latency thus far.

Understandably. I think it says a lot already that an untuned development
kernel can match the current release kernel. I'm sure once 2.5 gets into
the tweak 'n tune cycle we'll see it beating 2.4 hands down.

Actually 2.5 writeout to ext2 is far smoother than 2.4 already:

2.4.19:
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 1  0  2      0   4400   1788 140520   0   0     0  7776 7434   892   2  47  51
 1  0  2      0   4408   1796 140492   0   0     0  7868 7315   873   0  50  50
 1  0  3      0   4428   1804 140484   0   0     0 10496 7327   877   3  56  41
 1  0  2      0   4372   1812 140516   0   0     0  8132 7239   872   0  53  47
 1  0  0      0   4408   1816 140460   0   0     4  5876 2415   255   0  17  83
 1  0  0      0   4376   1824 140528   0   0     0     0 7555   894   1  42  56
 0  0  2      0   4376   1832 140512   0   0     0  4096 7589   858   1  52  47
 1  0  1      0   4416   1840 140464   0   0     0  8052 7229   879   1  51  47
 0  0  1      0   4380   1848 140496   0   0     0 10180 7183   863   1  49  50
 1  0  1      0   4348   1856 140500   0   0     0  8080 7240   852   1  49  50
 1  0  1      0   4464   1864 140408   0   0     0  4504 7309   886   1  47  51
 0  0  1      0   4444   1872 140400   0   0     0  7284 7459   873   1  51  48
 0  0  3      0   4380   1880 140440   0   0     0 10184 7428   895   1  50  49
 1  0  1      0   4428   1888 140400   0   0     0  8092 7308   867   0  52  48

2.5.31:
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  0  0      0   7404      0 137796   0   0     0  4108 6933  1176   1  43  56
 1  0  0      0   4384      0 141048   0   0     0  8216 6918  1293   1  42  57
 0  0  0    104   4392      0 141472   0 104     0  4212 6909  1211   1  53  45
 0  0  1    120   4440      0 141488   0  16     0  8232 6860  1233   1  61  38
 1  0  1    120   4352      0 141628   0   0     0  4108 6810  1137   2  38  60
 0  0  0    120   4468      0 141508   0   0     0  8216 6848  1114   0  40  59
 1  0  0    120   4352      0 141608   0   0     0  4108 6817  1091   1  39  60
 0  0  1    120   4464      0 141528   0   0     0  8216 6846  1090   1  39  60
 0  0  0    120   4412      0 141568   0   0     0  4108 6836  1056   1  39  60
 0  0  1    120   4388      0 141588   0   0     0  8216 6863  1088   1  41  58
 1  0  0    120   4392      0 141608   0   0     0  4108 6899  1162   1  41  58
 0  0  0    120   4428      0 141572   0   0     0  8216 6917  1085   2  40  58
 0  0  0    120   4416      0 141592   0   0     0  4208 6887  1097   1  40  59

The oscillation between 8 MB and 4 MB is a little odd, but it's very consistent
and averages out to about 6 MB, which is exactly what the FTP session is doing.

Thanks for your insight and patience. I'm always excited to see another batch
of akpm patches show up on the list. If I can run other tests to help you, let
me know.

--Adam


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 1/21] random fixes
  2002-08-11  7:38 [patch 1/21] random fixes Andrew Morton
  2002-08-11  7:56 ` Alexander Viro
  2002-08-11 14:29 ` Adam Kropelin
@ 2002-08-14  8:35 ` William Lee Irwin III
  2 siblings, 0 replies; 21+ messages in thread
From: William Lee Irwin III @ 2002-08-14  8:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, lkml

On Sun, Aug 11, 2002 at 12:38:19AM -0700, Andrew Morton wrote:
> Sorry, but there's a ton of stuff here.  It ends up as a 4600 line
> diff.  Some code dating back to 2.5.24.  It's almost all performance
> work and it has been very painful getting its effectiveness tested
> on the big machines; the main problem has been getting them booting
> 2.5 at all.  The results still are not as conclusive as I'd like,
> but the signs are good, and there are no other proposals around to
> fix these problems.

dbench 256 on a 16x/16G numaq:

Throughput 50.7526 MB/sec (NB=63.4408 MB/sec  507.526 MBit/sec)  256 procs


c013bf74 13251607 72.928      .text.lock.highmem
c013b7d0 1606972  8.84371     kunmap_high
c013b5dc 1211097  6.66507     kmap_high
c012f260 459420   2.52834     generic_file_write
c0114820 166854   0.918253    scheduler_tick
c012e53c 166773   0.917808    file_read_actor
c0105394 125561   0.691004    default_idle
c013bcbc 75623    0.416179    blk_queue_bounce
c013564c 72289    0.397831    rmqueue
c01113b8 69062    0.380071    smp_apic_timer_interrupt
c0143cec 64782    0.356517    block_prepare_write
c014330c 53426    0.294021    __block_prepare_write
c0142ee8 39892    0.219539    create_empty_buffers
c012dec0 39161    0.215516    unlock_page
c01143d8 38648    0.212693    load_balance
c013b558 34840    0.191736    flush_all_zero_pkmaps
c0135d28 33414    0.183888    page_cache_release
c013429c 22753    0.125217    lru_cache_add
c0135b10 20326    0.111861    __alloc_pages
c0143d98 19833    0.109148    generic_commit_write
c012dcb4 18150    0.0998855   add_to_page_cache
c0140044 17758    0.0977282   vfs_write

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2002-08-14  8:33 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-08-11  7:38 [patch 1/21] random fixes Andrew Morton
2002-08-11  7:56 ` Alexander Viro
2002-08-11 14:29 ` Adam Kropelin
2002-08-11 18:09   ` Andrew Morton
2002-08-12  0:27     ` Adam Kropelin
2002-08-12  0:41       ` Rik van Riel
2002-08-12  4:58       ` Andrew Morton
2002-08-13  0:26         ` Adam Kropelin
2002-08-13  0:49           ` Andrew Morton
2002-08-13  2:25             ` Adam Kropelin
2002-08-13  3:03               ` Andrew Morton
2002-08-13  4:10                 ` Adam Kropelin
2002-08-13  5:25                   ` Andreas Dilger
2002-08-13 12:37                     ` Adam Kropelin
2002-08-13 17:21                       ` Andreas Dilger
2002-08-13  5:32                   ` Andrew Morton
2002-08-13 15:39                     ` Daniel Egger
2002-08-14  0:01                     ` Adam Kropelin
2002-08-12  2:54     ` Adam Kropelin
2002-08-12  3:40       ` Andrew Morton
2002-08-14  8:35 ` William Lee Irwin III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox