2.4.6 and ext3-2.4-0.9.1-246

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 2.4.6 and ext3-2.4-0.9.1-246
@ 2001-07-10  9:47 Mike Black
  2001-07-10 17:52 ` Andreas Dilger
  2001-07-11  4:08 ` Andrew Morton
  0 siblings, 2 replies; 23+ messages in thread
From: Mike Black @ 2001-07-10  9:47 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.or

I started testing 2.4.6 with ext3-2.4-0.9.1-246 yesterday morning and
immediately hit a wall.

Testing on a an SMP kernel -- dual IDE RAID1 set the system temporarily
locked up (telnet window stops until disk I/O is complete).
I'm using tiobench tiobench-0.3.2 and do have unmaskirq turned on so it
shouldn't be irq contention.
/dev/hda:
 multcount    =  0 (off)
 I/O support  =  1 (32-bit)
 unmaskirq    =  1 (on)
 using_dma    =  1 (on)
 keepsettings =  0 (off)
 nowerr       =  0 (off)
 readonly     =  0 (off)
 readahead    =  8 (on)
/dev/hdc:
 multcount    =  0 (off)
 I/O support  =  1 (32-bit)
 unmaskirq    =  1 (on)
 using_dma    =  1 (on)
 keepsettings =  0 (off)
 nowerr       =  0 (off)
 readonly     =  0 (off)
 readahead    =  8 (on)

Investigating this some I noticed that kswapd was taking a LOT of CPU time
(althought there was only 10Meg in swap).  The swap files are located on the
RAID1 IDE set.
So...I moved the swapfiles to my SCSI subsystem (also EXT3 at this point)
and tested again.
Smoother although there was a quite a bit of jerkiness on the telnet window
still.
So...swap on IDE/RAID1/EXT3 was bad idea...I'd say 80% better when swap was
moved off of the IDE system to SCSI.

Here's my RAID1/IDE benchmark with EXT3
..ooops...spoke too soon.
The tiobench.pl locked up on 8 threads (after doing 1,2, & 4).  Had to do a
ALT-SYSRQ-B as all windows were dead although I could get a login prompt.

It really looks like tiobench is a good stress tester for ext3.
________________________________________
Michael D. Black   Principal Engineer
mblack@csihq.com  321-676-2923,x203
http://www.csihq.com  Computer Science Innovations
http://www.csihq.com/~mike  My home page
FAX 321-676-2355

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 2.4.6 and ext3-2.4-0.9.1-246
  2001-07-10  9:47 2.4.6 and ext3-2.4-0.9.1-246 Mike Black
@ 2001-07-10 17:52 ` Andreas Dilger
       [not found]   ` <018101c1096a$17e2afc0$b6562341@cfl.rr.com>
  2001-07-11  4:08 ` Andrew Morton
  1 sibling, 1 reply; 23+ messages in thread
From: Andreas Dilger @ 2001-07-10 17:52 UTC (permalink / raw)
  To: Mike Black; +Cc: linux-kernel@vger.kernel.or, Ext2 development mailing list

Mike Black writes:
> I started testing 2.4.6 with ext3-2.4-0.9.1-246 yesterday morning and
> immediately hit a wall.
> 
> Testing on a an SMP kernel -- dual IDE RAID1 set the system temporarily
> locked up (telnet window stops until disk I/O is complete).
> Investigating this some I noticed that kswapd was taking a LOT of CPU time
> (althought there was only 10Meg in swap).  The swap files are located on the
> RAID1 IDE set.

Are you saying you have swap _files_ or is that a typo?  Not to say that this
is illegal or anything, but it sure is a waste of CPU/disk performance.  If
you are swapping to a file on a journaled filesystem, you have a huge amount
of unnecessary overhead.  Rather have a swap partition and avoid the fs
altogether.

It is also possible that there are still problems with the core kernel swap
code, and they are just more noticable when swapping on ext3.  What form of
journaling are you using?  Ordered, writeback, or full data journaling?

> Here's my RAID1/IDE benchmark with EXT3
> ..ooops...spoke too soon.
> The tiobench.pl locked up on 8 threads (after doing 1,2, & 4).  Had to do a
> ALT-SYSRQ-B as all windows were dead although I could get a login prompt.

I've CC'd this to ext2-devel, where the core ext3 developers are more likely
to see it.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246
       [not found]   ` <018101c1096a$17e2afc0$b6562341@cfl.rr.com>
@ 2001-07-10 18:17     ` Stephen C. Tweedie
  2001-07-10 18:27       ` Mike Black
  2001-07-10 18:51     ` Andreas Dilger
  1 sibling, 1 reply; 23+ messages in thread
From: Stephen C. Tweedie @ 2001-07-10 18:17 UTC (permalink / raw)
  To: Mike Black
  Cc: Andreas Dilger, linux-kernel@vger.kernel.or,
	Ext2 development mailing list

Hi,

On Tue, Jul 10, 2001 at 01:59:40PM -0400, Mike Black wrote:
> Yep -- I said __files__ -- I'm less concerned about performance than
> reliability -- I don't think you can RAID1 a swap partition can you?

You can on 2.4.  2.2 would let you do it but it was unsafe --- swap
could interact badly with raid reconstruction.  2.4 should be OK.

> Also,
> having it in files allows me to easily add more swap as needed.
> As far as journalling mode I just used tune2fs to put a journal on with
> default parameters so I assume that's full journaling.

The swap code bypasses filesystem writes: all it does is to ask the
filesystem where on disk the data resides, then it performs IO
straight to those disk blocks.  The data journaling mode doesn't
really matter there.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246
  2001-07-10 18:17     ` [Ext2-devel] " Stephen C. Tweedie
@ 2001-07-10 18:27       ` Mike Black
  2001-07-10 18:29         ` Stephen C. Tweedie
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Black @ 2001-07-10 18:27 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Andreas Dilger, linux-kernel@vger.kernel.or,
	Ext2 development mailing list

So it sounds like theres no advantage then to a swap partition vs file?

----- Original Message -----
From: "Stephen C. Tweedie" <sct@redhat.com>
To: "Mike Black" <mblack@csihq.com>
Cc: "Andreas Dilger" <adilger@turbolinux.com>; "linux-kernel@vger.kernel.or"
<linux-kernel@vger.kernel.org>; "Ext2 development mailing list"
<ext2-devel@lists.sourceforge.net>
Sent: Tuesday, July 10, 2001 2:17 PM
Subject: Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246


> Hi,
>
> On Tue, Jul 10, 2001 at 01:59:40PM -0400, Mike Black wrote:
> > Yep -- I said __files__ -- I'm less concerned about performance than
> > reliability -- I don't think you can RAID1 a swap partition can you?
>
> You can on 2.4.  2.2 would let you do it but it was unsafe --- swap
> could interact badly with raid reconstruction.  2.4 should be OK.
>
> > Also,
> > having it in files allows me to easily add more swap as needed.
> > As far as journalling mode I just used tune2fs to put a journal on with
> > default parameters so I assume that's full journaling.
>
> The swap code bypasses filesystem writes: all it does is to ask the
> filesystem where on disk the data resides, then it performs IO
> straight to those disk blocks.  The data journaling mode doesn't
> really matter there.
>
> Cheers,
>  Stephen


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246
  2001-07-10 18:27       ` Mike Black
@ 2001-07-10 18:29         ` Stephen C. Tweedie
  0 siblings, 0 replies; 23+ messages in thread
From: Stephen C. Tweedie @ 2001-07-10 18:29 UTC (permalink / raw)
  To: Mike Black
  Cc: Stephen C. Tweedie, Andreas Dilger, linux-kernel@vger.kernel.or,
	Ext2 development mailing list

Hi,

On Tue, Jul 10, 2001 at 02:27:00PM -0400, Mike Black wrote:
> So it sounds like theres no advantage then to a swap partition vs file?

There are --- the cost of accessing the metadata to do the file
blocknr lookup, and the fragmentation you can get on files, both add
to their cost compared to partitions.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246
       [not found]   ` <018101c1096a$17e2afc0$b6562341@cfl.rr.com>
  2001-07-10 18:17     ` [Ext2-devel] " Stephen C. Tweedie
@ 2001-07-10 18:51     ` Andreas Dilger
  1 sibling, 0 replies; 23+ messages in thread
From: Andreas Dilger @ 2001-07-10 18:51 UTC (permalink / raw)
  To: Mike Black
  Cc: Andreas Dilger, linux-kernel@vger.kernel.or,
	Ext2 development mailing list

Mike Black writes:
> Also, having it in files allows me to easily add more swap as needed.

But you usually have a base amount of swap, and only add more files as
needed, right?  Put the base swap into its own partition, and then only
put "extra" swap into files.

> As far as journalling mode I just used tune2fs to put a journal on with
> default parameters so I assume that's full journaling.

The default is metadata-only journaling mode, called ordered data mode.
This gives the best performance in many cases, and also prevents you
from getting garbage in new/modified files after a crash.  It does so
by forcing data writes to the disk before the metadata changes in the
journal are committed to the filesystem.  This is also good for performance
because you separate writes to the journal from writes to the fs to avoid
excess seeking.

There is full data journaling, which slows down all I/O by half, because
it writes everything once to the journal and then again to the filesystem.

There is also writeback journaling mode, which writes all data directly
to the disk whenever it can, independent of metadata writes to the journal
and filesystem.  This allows the possibility of bad data in files if a
change was added to the journal/fs but the data write did not make it to
disk.  This is the only mode that reiserfs can use right now, but Chris
Mason is working on getting ordered data mode to work as well.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 2.4.6 and ext3-2.4-0.9.1-246
  2001-07-10  9:47 2.4.6 and ext3-2.4-0.9.1-246 Mike Black
  2001-07-10 17:52 ` Andreas Dilger
@ 2001-07-11  4:08 ` Andrew Morton
  2001-07-11 12:16   ` Mike Black
  1 sibling, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2001-07-11  4:08 UTC (permalink / raw)
  To: Mike Black; +Cc: linux-kernel@vger.kernel.or, ext2-devel

Mike Black wrote:
> 
> I started testing 2.4.6 with ext3-2.4-0.9.1-246 yesterday morning and
> immediately hit a wall.
> 
> Testing on a an SMP kernel -- dual IDE RAID1 set the system temporarily
> locked up (telnet window stops until disk I/O is complete).

Mike, we're going to need a lot more detail to reproduce this.

Let me describe how I didn't reproduce it and perhaps
you can point out any differences:

- Kernel 2.4.6+ext3-2.4-0.9.1.

- Two 4gig IDE partitions on separate disks combined into a
  RADI1 device.

- 64 megs of memory (32meg lowmem, 32meg highmem)

- 1 gig swapfile on the ext3 raid device.

- Ran  ./tiobench.pl --threads 16

That's a *lot* more aggressive than your setup, yet
it ran to completion quite happily.

I'd be particularly interested in knowing how much memory
you're using.  It certainly sounds like you're experiencing
memory exhaustion.  ext3's ability to recover from out-of-memory
situations was weakened recently so as to reduce our impact
on core kernel code.  I'll be generating an incremental patch
which puts that code back in.

In the meantime, could you please retest with this somewhat lame
alternative?

--- linux-2.4.6/mm/vmscan.c	Wed Jul  4 18:21:32 2001
+++ lk-ext3/mm/vmscan.c	Wed Jul 11 14:03:10 2001
@@ -852,6 +870,9 @@ static int do_try_to_free_pages(unsigned
 	 * list, so this is a relatively cheap operation.
 	 */
 	if (free_shortage()) {
+		extern void shrink_journal_memory(void);
+
+		shrink_journal_memory();
 		ret += page_launder(gfp_mask, user);
 		shrink_dcache_memory(DEF_PRIORITY, gfp_mask);
 		shrink_icache_memory(DEF_PRIORITY, gfp_mask);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 2.4.6 and ext3-2.4-0.9.1-246
  2001-07-11  4:08 ` Andrew Morton
@ 2001-07-11 12:16   ` Mike Black
  2001-07-11 15:36     ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Black @ 2001-07-11 12:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel@vger.kernel.or, ext2-devel

My system is:
Dual 1Ghz PIII
2G RAM
2x2G swapfiles
And I ran tiobench as tiobench.pl --size 4000 (twice memory)

Me thinkst SMP is probably the biggest difference in this list.

I ran this on another "smaller" memory (still dual CPU though) machine and
noticed this on top:

12983 root      15   0   548  544   448 D    73.6  0.2   0:11 tiotest
    3 root      18   0     0    0     0 SW   72.6  0.0   0:52 kswapd

kswapd is taking an awful lot of CPU time.  Not sure why it should be
hitting swap at all.
I noticed a similar behavior even with NO swap -- kswapd still chewing up
time.

________________________________________
Michael D. Black   Principal Engineer
mblack@csihq.com  321-676-2923,x203
http://www.csihq.com  Computer Science Innovations
http://www.csihq.com/~mike  My home page
FAX 321-676-2355
----- Original Message -----
From: "Andrew Morton" <andrewm@uow.edu.au>
To: "Mike Black" <mblack@csihq.com>
Cc: "linux-kernel@vger.kernel.or" <linux-kernel@vger.kernel.org>;
<ext2-devel@lists.sourceforge.net>
Sent: Wednesday, July 11, 2001 12:08 AM
Subject: Re: 2.4.6 and ext3-2.4-0.9.1-246

Mike Black wrote:
>
> I started testing 2.4.6 with ext3-2.4-0.9.1-246 yesterday morning and
> immediately hit a wall.
>
> Testing on a an SMP kernel -- dual IDE RAID1 set the system temporarily
> locked up (telnet window stops until disk I/O is complete).

Mike, we're going to need a lot more detail to reproduce this.

Let me describe how I didn't reproduce it and perhaps
you can point out any differences:

- Kernel 2.4.6+ext3-2.4-0.9.1.

- Two 4gig IDE partitions on separate disks combined into a
  RADI1 device.

- 64 megs of memory (32meg lowmem, 32meg highmem)

- 1 gig swapfile on the ext3 raid device.

- Ran  ./tiobench.pl --threads 16

That's a *lot* more aggressive than your setup, yet
it ran to completion quite happily.

I'd be particularly interested in knowing how much memory
you're using.  It certainly sounds like you're experiencing
memory exhaustion.  ext3's ability to recover from out-of-memory
situations was weakened recently so as to reduce our impact
on core kernel code.  I'll be generating an incremental patch
which puts that code back in.

In the meantime, could you please retest with this somewhat lame
alternative?

--- linux-2.4.6/mm/vmscan.c Wed Jul  4 18:21:32 2001
+++ lk-ext3/mm/vmscan.c Wed Jul 11 14:03:10 2001
@@ -852,6 +870,9 @@ static int do_try_to_free_pages(unsigned
  * list, so this is a relatively cheap operation.
  */
  if (free_shortage()) {
+ extern void shrink_journal_memory(void);
+
+ shrink_journal_memory();
  ret += page_launder(gfp_mask, user);
  shrink_dcache_memory(DEF_PRIORITY, gfp_mask);
  shrink_icache_memory(DEF_PRIORITY, gfp_mask);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 2.4.6 and ext3-2.4-0.9.1-246
  2001-07-11 12:16   ` Mike Black
@ 2001-07-11 15:36     ` Andrew Morton
  2001-07-12 10:54       ` Mike Black
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2001-07-11 15:36 UTC (permalink / raw)
  To: Mike Black; +Cc: linux-kernel@vger.kernel.or, ext2-devel

Mike Black wrote:
> 
> My system is:
> Dual 1Ghz PIII
> 2G RAM
> 2x2G swapfiles
> And I ran tiobench as tiobench.pl --size 4000 (twice memory)
> 
> Me thinkst SMP is probably the biggest difference in this list.

No, the problem is in RAID1.  The buffer allocation in there is nowhere
near strong enough for these loads.

> I ran this on another "smaller" memory (still dual CPU though) machine and
> noticed this on top:
> 
> 12983 root      15   0   548  544   448 D    73.6  0.2   0:11 tiotest
>     3 root      18   0     0    0     0 SW   72.6  0.0   0:52 kswapd
> 
> kswapd is taking an awful lot of CPU time.  Not sure why it should be
> hitting swap at all.

It's not trying to swap stuff out - it's trying to find pages
to recycle.  kswapd often goes berzerk like this.  I think it
was a design objective.



For me, RAID1 works OK with tiobench, but it is trivially deadlockable
with other workloads.  The usual failure mode is for bdflush to be
stuck in raid1_alloc_r1bh() - can't allocate any more r1bh's, can't
move dirty buffers to disk.  Dead.

The below patch increases the size of the reserved r1bh pool, scales it
by PAGE_CACHE_SIZE and introduces a reservation policy for PF_FLUSH
callers (ie: bdflush).  That fixes the raid1_alloc_r1bh() deadlocks.

bdflush can also deadlock in raid1_alloc_bh(), trying to allocate
buffer_heads.  So we do the same thing there.

Putting swap on RAID1 would definitely have exacerbated the problem.
The last thing we want to do when we're trying to push stuff out
of memory is to have to allocate more of it.  So I allowed PF_MEMALLOC
tasks to bite into the reserves as well.


Please, if you have time, apply and retest.

--- linux-2.4.6/include/linux/sched.h	Wed May  2 22:00:07 2001
+++ lk-ext3/include/linux/sched.h	Thu Jul 12 01:03:20 2001
@@ -413,7 +418,7 @@ struct task_struct {
 #define PF_SIGNALED	0x00000400	/* killed by a signal */
 #define PF_MEMALLOC	0x00000800	/* Allocating memory */
 #define PF_VFORK	0x00001000	/* Wake up parent in mm_release */
-
+#define PF_FLUSH	0x00002000	/* Flushes buffers to disk */
 #define PF_USEDFPU	0x00100000	/* task used FPU this quantum (SMP) */
 
 /*
--- linux-2.4.6/include/linux/raid/raid1.h	Tue Dec 12 08:20:08 2000
+++ lk-ext3/include/linux/raid/raid1.h	Thu Jul 12 01:15:39 2001
@@ -37,12 +37,12 @@ struct raid1_private_data {
 	/* buffer pool */
 	/* buffer_heads that we have pre-allocated have b_pprev -> &freebh
 	 * and are linked into a stack using b_next
-	 * raid1_bh that are pre-allocated have R1BH_PreAlloc set.
 	 * All these variable are protected by device_lock
 	 */
 	struct buffer_head	*freebh;
 	int			freebh_cnt;	/* how many are on the list */
 	struct raid1_bh		*freer1;
+	unsigned		freer1_cnt;
 	struct raid1_bh		*freebuf; 	/* each bh_req has a page allocated */
 	md_wait_queue_head_t	wait_buffer;
 
@@ -87,5 +87,4 @@ struct raid1_bh {
 /* bits for raid1_bh.state */
 #define	R1BH_Uptodate	1
 #define	R1BH_SyncPhase	2
-#define	R1BH_PreAlloc	3	/* this was pre-allocated, add to free list */
 #endif
--- linux-2.4.6/fs/buffer.c	Wed Jul  4 18:21:31 2001
+++ lk-ext3/fs/buffer.c	Thu Jul 12 01:03:57 2001
@@ -2685,6 +2748,7 @@ int bdflush(void *sem)
 	sigfillset(&tsk->blocked);
 	recalc_sigpending(tsk);
 	spin_unlock_irq(&tsk->sigmask_lock);
+	current->flags |= PF_FLUSH;
 
 	up((struct semaphore *)sem);
 
@@ -2726,6 +2790,7 @@ int kupdate(void *sem)
 	siginitsetinv(&current->blocked, sigmask(SIGCONT) | sigmask(SIGSTOP));
 	recalc_sigpending(tsk);
 	spin_unlock_irq(&tsk->sigmask_lock);
+	current->flags |= PF_FLUSH;
 
 	up((struct semaphore *)sem);
 
--- linux-2.4.6/drivers/md/raid1.c	Wed Jul  4 18:21:26 2001
+++ lk-ext3/drivers/md/raid1.c	Thu Jul 12 01:28:58 2001
@@ -51,6 +51,28 @@ static mdk_personality_t raid1_personali
 static md_spinlock_t retry_list_lock = MD_SPIN_LOCK_UNLOCKED;
 struct raid1_bh *raid1_retry_list = NULL, **raid1_retry_tail;
 
+/*
+ * We need to scale the number of reserved buffers by the page size
+ * to make writepage()s sucessful. --akpm
+ */
+#define R1_BLOCKS_PP			(PAGE_CACHE_SIZE / 1024)
+#define FREER1_MEMALLOC_RESERVED	(16 * R1_BLOCKS_PP)
+
+/*
+ * Return true if the caller make take a bh from the list.
+ * PF_FLUSH and PF_MEMALLOC tasks are allowed to use the reserves, because
+ * they're trying to *free* some memory.
+ *
+ * Requires that conf->device_lock be held.
+ */
+static int may_take_bh(raid1_conf_t *conf, int cnt)
+{
+	int min_free = (current->flags & (PF_FLUSH|PF_MEMALLOC)) ?
+			cnt :
+			(cnt + FREER1_MEMALLOC_RESERVED * conf->raid_disks);
+	return conf->freebh_cnt >= min_free;
+}
+
 static struct buffer_head *raid1_alloc_bh(raid1_conf_t *conf, int cnt)
 {
 	/* return a linked list of "cnt" struct buffer_heads.
@@ -62,7 +84,7 @@ static struct buffer_head *raid1_alloc_b
 	while(cnt) {
 		struct buffer_head *t;
 		md_spin_lock_irq(&conf->device_lock);
-		if (conf->freebh_cnt >= cnt)
+		if (may_take_bh(conf, cnt))
 			while (cnt) {
 				t = conf->freebh;
 				conf->freebh = t->b_next;
@@ -83,7 +105,7 @@ static struct buffer_head *raid1_alloc_b
 			cnt--;
 		} else {
 			PRINTK("raid1: waiting for %d bh\n", cnt);
-			wait_event(conf->wait_buffer, conf->freebh_cnt >= cnt);
+			wait_event(conf->wait_buffer, may_take_bh(conf, cnt));
 		}
 	}
 	return bh;
@@ -96,9 +118,9 @@ static inline void raid1_free_bh(raid1_c
 	while (bh) {
 		struct buffer_head *t = bh;
 		bh=bh->b_next;
-		if (t->b_pprev == NULL)
+		if (conf->freebh_cnt >= FREER1_MEMALLOC_RESERVED) {
 			kfree(t);
-		else {
+		} else {
 			t->b_next= conf->freebh;
 			conf->freebh = t;
 			conf->freebh_cnt++;
@@ -108,29 +130,6 @@ static inline void raid1_free_bh(raid1_c
 	wake_up(&conf->wait_buffer);
 }
 
-static int raid1_grow_bh(raid1_conf_t *conf, int cnt)
-{
-	/* allocate cnt buffer_heads, possibly less if kalloc fails */
-	int i = 0;
-
-	while (i < cnt) {
-		struct buffer_head *bh;
-		bh = kmalloc(sizeof(*bh), GFP_KERNEL);
-		if (!bh) break;
-		memset(bh, 0, sizeof(*bh));
-
-		md_spin_lock_irq(&conf->device_lock);
-		bh->b_pprev = &conf->freebh;
-		bh->b_next = conf->freebh;
-		conf->freebh = bh;
-		conf->freebh_cnt++;
-		md_spin_unlock_irq(&conf->device_lock);
-
-		i++;
-	}
-	return i;
-}
-
 static int raid1_shrink_bh(raid1_conf_t *conf, int cnt)
 {
 	/* discard cnt buffer_heads, if we can find them */
@@ -147,7 +146,16 @@ static int raid1_shrink_bh(raid1_conf_t 
 	md_spin_unlock_irq(&conf->device_lock);
 	return i;
 }
-		
+
+/*
+ * Return true if the caller make take a raid1_bh from the list.
+ * Requires that conf->device_lock be held.
+ */
+static int may_take_r1bh(raid1_conf_t *conf)
+{
+	return ((conf->freer1_cnt > FREER1_MEMALLOC_RESERVED) ||
+		  (current->flags & (PF_FLUSH|PF_MEMALLOC))) && conf->freer1;
+}
 
 static struct raid1_bh *raid1_alloc_r1bh(raid1_conf_t *conf)
 {
@@ -155,8 +163,9 @@ static struct raid1_bh *raid1_alloc_r1bh
 
 	do {
 		md_spin_lock_irq(&conf->device_lock);
-		if (conf->freer1) {
+		if (may_take_r1bh(conf)) {
 			r1_bh = conf->freer1;
+			conf->freer1_cnt--;
 			conf->freer1 = r1_bh->next_r1;
 			r1_bh->next_r1 = NULL;
 			r1_bh->state = 0;
@@ -170,7 +179,7 @@ static struct raid1_bh *raid1_alloc_r1bh
 			memset(r1_bh, 0, sizeof(*r1_bh));
 			return r1_bh;
 		}
-		wait_event(conf->wait_buffer, conf->freer1);
+		wait_event(conf->wait_buffer, may_take_r1bh(conf));
 	} while (1);
 }
 
@@ -178,49 +187,30 @@ static inline void raid1_free_r1bh(struc
 {
 	struct buffer_head *bh = r1_bh->mirror_bh_list;
 	raid1_conf_t *conf = mddev_to_conf(r1_bh->mddev);
+	unsigned long flags;
 
 	r1_bh->mirror_bh_list = NULL;
 
-	if (test_bit(R1BH_PreAlloc, &r1_bh->state)) {
-		unsigned long flags;
-		spin_lock_irqsave(&conf->device_lock, flags);
+	spin_lock_irqsave(&conf->device_lock, flags);
+	if (conf->freer1_cnt < FREER1_MEMALLOC_RESERVED) {
 		r1_bh->next_r1 = conf->freer1;
 		conf->freer1 = r1_bh;
+		conf->freer1_cnt++;
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 	} else {
+		spin_unlock_irqrestore(&conf->device_lock, flags);
 		kfree(r1_bh);
 	}
 	raid1_free_bh(conf, bh);
 }
 
-static int raid1_grow_r1bh (raid1_conf_t *conf, int cnt)
-{
-	int i = 0;
-
-	while (i < cnt) {
-		struct raid1_bh *r1_bh;
-		r1_bh = (struct raid1_bh*)kmalloc(sizeof(*r1_bh), GFP_KERNEL);
-		if (!r1_bh)
-			break;
-		memset(r1_bh, 0, sizeof(*r1_bh));
-
-		md_spin_lock_irq(&conf->device_lock);
-		set_bit(R1BH_PreAlloc, &r1_bh->state);
-		r1_bh->next_r1 = conf->freer1;
-		conf->freer1 = r1_bh;
-		md_spin_unlock_irq(&conf->device_lock);
-
-		i++;
-	}
-	return i;
-}
-
 static void raid1_shrink_r1bh(raid1_conf_t *conf)
 {
 	md_spin_lock_irq(&conf->device_lock);
 	while (conf->freer1) {
 		struct raid1_bh *r1_bh = conf->freer1;
 		conf->freer1 = r1_bh->next_r1;
+		conf->freer1_cnt--;	/* pedantry */
 		kfree(r1_bh);
 	}
 	md_spin_unlock_irq(&conf->device_lock);
@@ -1610,21 +1600,6 @@ static int raid1_run (mddev_t *mddev)
 		goto out_free_conf;
 	}
 
-
-	/* pre-allocate some buffer_head structures.
-	 * As a minimum, 1 r1bh and raid_disks buffer_heads
-	 * would probably get us by in tight memory situations,
-	 * but a few more is probably a good idea.
-	 * For now, try 16 r1bh and 16*raid_disks bufferheads
-	 * This will allow at least 16 concurrent reads or writes
-	 * even if kmalloc starts failing
-	 */
-	if (raid1_grow_r1bh(conf, 16) < 16 ||
-	    raid1_grow_bh(conf, 16*conf->raid_disks)< 16*conf->raid_disks) {
-		printk(MEM_ERROR, mdidx(mddev));
-		goto out_free_conf;
-	}
-
 	for (i = 0; i < MD_SB_DISKS; i++) {
 		
 		descriptor = sb->disks+i;
@@ -1713,6 +1688,8 @@ out_free_conf:
 	raid1_shrink_r1bh(conf);
 	raid1_shrink_bh(conf, conf->freebh_cnt);
 	raid1_shrink_buffers(conf);
+	if (conf->freer1_cnt != 0)
+		BUG();
 	kfree(conf);
 	mddev->private = NULL;
 out:

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 2.4.6 and ext3-2.4-0.9.1-246
  2001-07-11 15:36     ` Andrew Morton
@ 2001-07-12 10:54       ` Mike Black
  2001-07-12 11:34         ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Black @ 2001-07-12 10:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel@vger.kernel.or, ext2-devel

Nope -- still locked up on 8 threads....however...it's apparently not RAID1
causing this.
I'm repeating this now on my SCSI 7x36G RAID5 set and seeing similar
behavior.  It's a little better though since its SCSI.
Since IDE hits the CPU harder the system appeared to lock up for a lot
longer -- it might have finished but I couldn't afford to wait that long.
The CPU is hitting 100% system usage which makes it appear as though it is
locked up.
I've got a vmstat running in a window and it pauses a lot.  When I was
testing the IDE RAID1 it paused (locked?) for a LONG time.
But , it is recovering from the 100% system usage and here what is has so
far:
tiobench.pl --size 4000
Size is MB, BlkSz is Bytes, Read, Write, and Seeks are MB/secd . -T

         File   Block  Num  Seq Read    Rand Read   Seq Write  Rand Write
  Dir    Size   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
------- ------ ------- --- ----------- ----------- ----------- -----------
   .     4000   4096    1  64.71 51.4% 0.826 2.00% 21.78 32.7% 1.218 0.85%
   .     4000   4096    2  23.28 21.7% 0.935 1.76% 7.374 39.1% 1.261 0.96%
   .     4000   4096    4  20.74 20.7% 1.087 2.50% 5.399 46.8% 1.278 1.09%

It's banging like crazy on the 8-thread run and I'm trying to let it finish
but it's really slow and non-responsive.
Here's the latest vmstat (10 second increments):
   procs                      memory    swap          io     system
cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy
id
 4 82  6      0   3604   4244 1902520   0   0     6  3342  386   347   0 100
0
 2 84  6      0   3620   4252 1902488   0   0     1  1506  173    17   0  99
1
 4 82  6      0   3636   4228 1902472   0   0     1  3749  448   237   0  84
16
 8 79  7      0   3620   4236 1902448   0   0     2   966  199    56   0  98
2
 4 80  6      0   3620   4252 1902336   0   0     1  1040  330   557   0 100
0
 0 86  5      0   3624   4252 1902332   0   0     0   627  335   725   0  98
2
15 75  5      0   3624   4264 1902636   0   0     1   953  494   182   0  90
10
16 76  6      0   3564   4280 1902748   0   0     1  1581  595   354   0  87
13
11 80  6      0   3564   4292 1902740   0   0     1  1337  174    67   0 100
0
18 74  6      0   3560   4308 1902716   0   0     0   703  313   353   0 100
0
 7 78  7      0   3560   4324 1902632   0   0     5  2181  301   626   0 100
0
 7 79  7      0   3560   4332 1902628   0   0     1   732  351   163   0 100
0
11 81  8      0   3224   4324 1902968   0   0     0     1  280   214   0 100
0
 9 76  7      0   3560   4332 1902624   0   0     0   569  270    83   0 100
0
 6 77  6      0   2832   4336 1903340   0   0     0   910  281   268   0 100
0
 3 83  7      0   3564   4336 1902604   0   0     0   487  281   130   0 100
0
17 77  7      0   3560   4344 1902600   0   0     0  1056  377   102   0 100
0
 9 76  7      0   3560   4364 1902256   0   0     1  3030  517   696   0 100
0
11 75  6      0   3560   4384 1902328   0   0     1  2145  230   131   0 100
0
12 72  7      0   3560   4416 1902296   0   0    16  2487  493    82   0  99
1
 9 76  6      0   3560   4424 1902084   0   0    82  1938  423  1124   0 100
0

______________________
Michael D. Black   Principal Engineer
mblack@csihq.com  321-676-2923,x203
http://www.csihq.com  Computer Science Innovations
http://www.csihq.com/~mike  My home page
FAX 321-676-2355
----- Original Message -----
From: "Andrew Morton" <andrewm@uow.edu.au>
To: "Mike Black" <mblack@csihq.com>
Cc: "linux-kernel@vger.kernel.or" <linux-kernel@vger.kernel.org>;
<ext2-devel@lists.sourceforge.net>
Sent: Wednesday, July 11, 2001 11:36 AM
Subject: Re: 2.4.6 and ext3-2.4-0.9.1-246


Mike Black wrote:
>
> My system is:
> Dual 1Ghz PIII
> 2G RAM
> 2x2G swapfiles
> And I ran tiobench as tiobench.pl --size 4000 (twice memory)
>
> Me thinkst SMP is probably the biggest difference in this list.

No, the problem is in RAID1.  The buffer allocation in there is nowhere
near strong enough for these loads.

> I ran this on another "smaller" memory (still dual CPU though) machine and
> noticed this on top:
>
> 12983 root      15   0   548  544   448 D    73.6  0.2   0:11 tiotest
>     3 root      18   0     0    0     0 SW   72.6  0.0   0:52 kswapd
>
> kswapd is taking an awful lot of CPU time.  Not sure why it should be
> hitting swap at all.

It's not trying to swap stuff out - it's trying to find pages
to recycle.  kswapd often goes berzerk like this.  I think it
was a design objective.



For me, RAID1 works OK with tiobench, but it is trivially deadlockable
with other workloads.  The usual failure mode is for bdflush to be
stuck in raid1_alloc_r1bh() - can't allocate any more r1bh's, can't
move dirty buffers to disk.  Dead.

The below patch increases the size of the reserved r1bh pool, scales it
by PAGE_CACHE_SIZE and introduces a reservation policy for PF_FLUSH
callers (ie: bdflush).  That fixes the raid1_alloc_r1bh() deadlocks.

bdflush can also deadlock in raid1_alloc_bh(), trying to allocate
buffer_heads.  So we do the same thing there.

Putting swap on RAID1 would definitely have exacerbated the problem.
The last thing we want to do when we're trying to push stuff out
of memory is to have to allocate more of it.  So I allowed PF_MEMALLOC
tasks to bite into the reserves as well.


Please, if you have time, apply and retest.

--- linux-2.4.6/include/linux/sched.h Wed May  2 22:00:07 2001
+++ lk-ext3/include/linux/sched.h Thu Jul 12 01:03:20 2001
@@ -413,7 +418,7 @@ struct task_struct {
 #define PF_SIGNALED 0x00000400 /* killed by a signal */
 #define PF_MEMALLOC 0x00000800 /* Allocating memory */
 #define PF_VFORK 0x00001000 /* Wake up parent in mm_release */
-
+#define PF_FLUSH 0x00002000 /* Flushes buffers to disk */
 #define PF_USEDFPU 0x00100000 /* task used FPU this quantum (SMP) */

 /*
--- linux-2.4.6/include/linux/raid/raid1.h Tue Dec 12 08:20:08 2000
+++ lk-ext3/include/linux/raid/raid1.h Thu Jul 12 01:15:39 2001
@@ -37,12 +37,12 @@ struct raid1_private_data {
  /* buffer pool */
  /* buffer_heads that we have pre-allocated have b_pprev -> &freebh
  * and are linked into a stack using b_next
- * raid1_bh that are pre-allocated have R1BH_PreAlloc set.
  * All these variable are protected by device_lock
  */
  struct buffer_head *freebh;
  int freebh_cnt; /* how many are on the list */
  struct raid1_bh *freer1;
+ unsigned freer1_cnt;
  struct raid1_bh *freebuf; /* each bh_req has a page allocated */
  md_wait_queue_head_t wait_buffer;

@@ -87,5 +87,4 @@ struct raid1_bh {
 /* bits for raid1_bh.state */
 #define R1BH_Uptodate 1
 #define R1BH_SyncPhase 2
-#define R1BH_PreAlloc 3 /* this was pre-allocated, add to free list */
 #endif
--- linux-2.4.6/fs/buffer.c Wed Jul  4 18:21:31 2001
+++ lk-ext3/fs/buffer.c Thu Jul 12 01:03:57 2001
@@ -2685,6 +2748,7 @@ int bdflush(void *sem)
  sigfillset(&tsk->blocked);
  recalc_sigpending(tsk);
  spin_unlock_irq(&tsk->sigmask_lock);
+ current->flags |= PF_FLUSH;

  up((struct semaphore *)sem);

@@ -2726,6 +2790,7 @@ int kupdate(void *sem)
  siginitsetinv(&current->blocked, sigmask(SIGCONT) | sigmask(SIGSTOP));
  recalc_sigpending(tsk);
  spin_unlock_irq(&tsk->sigmask_lock);
+ current->flags |= PF_FLUSH;

  up((struct semaphore *)sem);

--- linux-2.4.6/drivers/md/raid1.c Wed Jul  4 18:21:26 2001
+++ lk-ext3/drivers/md/raid1.c Thu Jul 12 01:28:58 2001
@@ -51,6 +51,28 @@ static mdk_personality_t raid1_personali
 static md_spinlock_t retry_list_lock = MD_SPIN_LOCK_UNLOCKED;
 struct raid1_bh *raid1_retry_list = NULL, **raid1_retry_tail;

+/*
+ * We need to scale the number of reserved buffers by the page size
+ * to make writepage()s sucessful. --akpm
+ */
+#define R1_BLOCKS_PP (PAGE_CACHE_SIZE / 1024)
+#define FREER1_MEMALLOC_RESERVED (16 * R1_BLOCKS_PP)
+
+/*
+ * Return true if the caller make take a bh from the list.
+ * PF_FLUSH and PF_MEMALLOC tasks are allowed to use the reserves, because
+ * they're trying to *free* some memory.
+ *
+ * Requires that conf->device_lock be held.
+ */
+static int may_take_bh(raid1_conf_t *conf, int cnt)
+{
+ int min_free = (current->flags & (PF_FLUSH|PF_MEMALLOC)) ?
+ cnt :
+ (cnt + FREER1_MEMALLOC_RESERVED * conf->raid_disks);
+ return conf->freebh_cnt >= min_free;
+}
+
 static struct buffer_head *raid1_alloc_bh(raid1_conf_t *conf, int cnt)
 {
  /* return a linked list of "cnt" struct buffer_heads.
@@ -62,7 +84,7 @@ static struct buffer_head *raid1_alloc_b
  while(cnt) {
  struct buffer_head *t;
  md_spin_lock_irq(&conf->device_lock);
- if (conf->freebh_cnt >= cnt)
+ if (may_take_bh(conf, cnt))
  while (cnt) {
  t = conf->freebh;
  conf->freebh = t->b_next;
@@ -83,7 +105,7 @@ static struct buffer_head *raid1_alloc_b
  cnt--;
  } else {
  PRINTK("raid1: waiting for %d bh\n", cnt);
- wait_event(conf->wait_buffer, conf->freebh_cnt >= cnt);
+ wait_event(conf->wait_buffer, may_take_bh(conf, cnt));
  }
  }
  return bh;
@@ -96,9 +118,9 @@ static inline void raid1_free_bh(raid1_c
  while (bh) {
  struct buffer_head *t = bh;
  bh=bh->b_next;
- if (t->b_pprev == NULL)
+ if (conf->freebh_cnt >= FREER1_MEMALLOC_RESERVED) {
  kfree(t);
- else {
+ } else {
  t->b_next= conf->freebh;
  conf->freebh = t;
  conf->freebh_cnt++;
@@ -108,29 +130,6 @@ static inline void raid1_free_bh(raid1_c
  wake_up(&conf->wait_buffer);
 }

-static int raid1_grow_bh(raid1_conf_t *conf, int cnt)
-{
- /* allocate cnt buffer_heads, possibly less if kalloc fails */
- int i = 0;
-
- while (i < cnt) {
- struct buffer_head *bh;
- bh = kmalloc(sizeof(*bh), GFP_KERNEL);
- if (!bh) break;
- memset(bh, 0, sizeof(*bh));
-
- md_spin_lock_irq(&conf->device_lock);
- bh->b_pprev = &conf->freebh;
- bh->b_next = conf->freebh;
- conf->freebh = bh;
- conf->freebh_cnt++;
- md_spin_unlock_irq(&conf->device_lock);
-
- i++;
- }
- return i;
-}
-
 static int raid1_shrink_bh(raid1_conf_t *conf, int cnt)
 {
  /* discard cnt buffer_heads, if we can find them */
@@ -147,7 +146,16 @@ static int raid1_shrink_bh(raid1_conf_t
  md_spin_unlock_irq(&conf->device_lock);
  return i;
 }
-
+
+/*
+ * Return true if the caller make take a raid1_bh from the list.
+ * Requires that conf->device_lock be held.
+ */
+static int may_take_r1bh(raid1_conf_t *conf)
+{
+ return ((conf->freer1_cnt > FREER1_MEMALLOC_RESERVED) ||
+   (current->flags & (PF_FLUSH|PF_MEMALLOC))) && conf->freer1;
+}

 static struct raid1_bh *raid1_alloc_r1bh(raid1_conf_t *conf)
 {
@@ -155,8 +163,9 @@ static struct raid1_bh *raid1_alloc_r1bh

  do {
  md_spin_lock_irq(&conf->device_lock);
- if (conf->freer1) {
+ if (may_take_r1bh(conf)) {
  r1_bh = conf->freer1;
+ conf->freer1_cnt--;
  conf->freer1 = r1_bh->next_r1;
  r1_bh->next_r1 = NULL;
  r1_bh->state = 0;
@@ -170,7 +179,7 @@ static struct raid1_bh *raid1_alloc_r1bh
  memset(r1_bh, 0, sizeof(*r1_bh));
  return r1_bh;
  }
- wait_event(conf->wait_buffer, conf->freer1);
+ wait_event(conf->wait_buffer, may_take_r1bh(conf));
  } while (1);
 }

@@ -178,49 +187,30 @@ static inline void raid1_free_r1bh(struc
 {
  struct buffer_head *bh = r1_bh->mirror_bh_list;
  raid1_conf_t *conf = mddev_to_conf(r1_bh->mddev);
+ unsigned long flags;

  r1_bh->mirror_bh_list = NULL;

- if (test_bit(R1BH_PreAlloc, &r1_bh->state)) {
- unsigned long flags;
- spin_lock_irqsave(&conf->device_lock, flags);
+ spin_lock_irqsave(&conf->device_lock, flags);
+ if (conf->freer1_cnt < FREER1_MEMALLOC_RESERVED) {
  r1_bh->next_r1 = conf->freer1;
  conf->freer1 = r1_bh;
+ conf->freer1_cnt++;
  spin_unlock_irqrestore(&conf->device_lock, flags);
  } else {
+ spin_unlock_irqrestore(&conf->device_lock, flags);
  kfree(r1_bh);
  }
  raid1_free_bh(conf, bh);
 }

-static int raid1_grow_r1bh (raid1_conf_t *conf, int cnt)
-{
- int i = 0;
-
- while (i < cnt) {
- struct raid1_bh *r1_bh;
- r1_bh = (struct raid1_bh*)kmalloc(sizeof(*r1_bh), GFP_KERNEL);
- if (!r1_bh)
- break;
- memset(r1_bh, 0, sizeof(*r1_bh));
-
- md_spin_lock_irq(&conf->device_lock);
- set_bit(R1BH_PreAlloc, &r1_bh->state);
- r1_bh->next_r1 = conf->freer1;
- conf->freer1 = r1_bh;
- md_spin_unlock_irq(&conf->device_lock);
-
- i++;
- }
- return i;
-}
-
 static void raid1_shrink_r1bh(raid1_conf_t *conf)
 {
  md_spin_lock_irq(&conf->device_lock);
  while (conf->freer1) {
  struct raid1_bh *r1_bh = conf->freer1;
  conf->freer1 = r1_bh->next_r1;
+ conf->freer1_cnt--; /* pedantry */
  kfree(r1_bh);
  }
  md_spin_unlock_irq(&conf->device_lock);
@@ -1610,21 +1600,6 @@ static int raid1_run (mddev_t *mddev)
  goto out_free_conf;
  }

-
- /* pre-allocate some buffer_head structures.
- * As a minimum, 1 r1bh and raid_disks buffer_heads
- * would probably get us by in tight memory situations,
- * but a few more is probably a good idea.
- * For now, try 16 r1bh and 16*raid_disks bufferheads
- * This will allow at least 16 concurrent reads or writes
- * even if kmalloc starts failing
- */
- if (raid1_grow_r1bh(conf, 16) < 16 ||
-     raid1_grow_bh(conf, 16*conf->raid_disks)< 16*conf->raid_disks) {
- printk(MEM_ERROR, mdidx(mddev));
- goto out_free_conf;
- }
-
  for (i = 0; i < MD_SB_DISKS; i++) {

  descriptor = sb->disks+i;
@@ -1713,6 +1688,8 @@ out_free_conf:
  raid1_shrink_r1bh(conf);
  raid1_shrink_bh(conf, conf->freebh_cnt);
  raid1_shrink_buffers(conf);
+ if (conf->freer1_cnt != 0)
+ BUG();
  kfree(conf);
  mddev->private = NULL;
 out:


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 2.4.6 and ext3-2.4-0.9.1-246
  2001-07-12 10:54       ` Mike Black
@ 2001-07-12 11:34         ` Andrew Morton
  2001-07-13 12:22           ` Mike Black
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2001-07-12 11:34 UTC (permalink / raw)
  To: Mike Black; +Cc: linux-kernel@vger.kernel.or, ext2-devel

Mike Black wrote:
> 
> Nope -- still locked up on 8 threads....however...it's apparently not RAID1
> causing this.


Well, aside from the RAID problems which we're triggering, you're
seeing interactions between RAID, ext3 and the VM.   There's
another raid1 patch here, please.

> I'm repeating this now on my SCSI 7x36G RAID5 set and seeing similar
> behavior.  It's a little better though since its SCSI.

RAID5 had a bug which would cause long stalls - ext3 triggered
it.  It's fixed in 2.4.7-pre.  I include that diff here, although
it'd be surprising if you were hitting it with that workload.

> ...
> I've got a vmstat running in a window and it pauses a lot.  When I was
> testing the IDE RAID1 it paused (locked?) for a LONG time.

That's typical behaviour for an out-of-memory condition.

--- linux-2.4.6/drivers/md/raid1.c	Wed Jul  4 18:21:26 2001
+++ lk-ext3/drivers/md/raid1.c	Thu Jul 12 15:27:09 2001
@@ -46,6 +46,30 @@
 #define PRINTK(x...)  do { } while (0)
 #endif
 
+#define __raid1_wait_event(wq, condition) 				\
+do {									\
+	wait_queue_t __wait;						\
+	init_waitqueue_entry(&__wait, current);				\
+									\
+	add_wait_queue(&wq, &__wait);					\
+	for (;;) {							\
+		set_current_state(TASK_UNINTERRUPTIBLE);		\
+		if (condition)						\
+			break;						\
+		run_task_queue(&tq_disk);				\
+		schedule();						\
+	}								\
+	current->state = TASK_RUNNING;					\
+	remove_wait_queue(&wq, &__wait);				\
+} while (0)
+
+#define raid1_wait_event(wq, condition) 				\
+do {									\
+	if (condition)	 						\
+		break;							\
+	__raid1_wait_event(wq, condition);				\
+} while (0)
+
 
 static mdk_personality_t raid1_personality;
 static md_spinlock_t retry_list_lock = MD_SPIN_LOCK_UNLOCKED;
@@ -83,7 +107,7 @@ static struct buffer_head *raid1_alloc_b
 			cnt--;
 		} else {
 			PRINTK("raid1: waiting for %d bh\n", cnt);
-			wait_event(conf->wait_buffer, conf->freebh_cnt >= cnt);
+			raid1_wait_event(conf->wait_buffer, conf->freebh_cnt >= cnt);
 		}
 	}
 	return bh;
@@ -170,7 +194,7 @@ static struct raid1_bh *raid1_alloc_r1bh
 			memset(r1_bh, 0, sizeof(*r1_bh));
 			return r1_bh;
 		}
-		wait_event(conf->wait_buffer, conf->freer1);
+		raid1_wait_event(conf->wait_buffer, conf->freer1);
 	} while (1);
 }
 
--- linux-2.4.6/drivers/md/raid5.c	Wed Jul  4 18:21:26 2001
+++ lk-ext3/drivers/md/raid5.c	Thu Jul 12 21:31:55 2001
@@ -66,10 +66,11 @@ static inline void __release_stripe(raid
 			BUG();
 		if (atomic_read(&conf->active_stripes)==0)
 			BUG();
-		if (test_bit(STRIPE_DELAYED, &sh->state))
-			list_add_tail(&sh->lru, &conf->delayed_list);
-		else if (test_bit(STRIPE_HANDLE, &sh->state)) {
-			list_add_tail(&sh->lru, &conf->handle_list);
+		if (test_bit(STRIPE_HANDLE, &sh->state)) {
+			if (test_bit(STRIPE_DELAYED, &sh->state))
+				list_add_tail(&sh->lru, &conf->delayed_list);
+			else
+				list_add_tail(&sh->lru, &conf->handle_list);
 			md_wakeup_thread(conf->thread);
 		} else {
 			if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
@@ -1167,10 +1168,9 @@ static void raid5_unplug_device(void *da
 
 	raid5_activate_delayed(conf);
 	
-	if (conf->plugged) {
-		conf->plugged = 0;
-		md_wakeup_thread(conf->thread);
-	}	
+	conf->plugged = 0;
+	md_wakeup_thread(conf->thread);
+
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 }

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 2.4.6 and ext3-2.4-0.9.1-246
  2001-07-12 11:34         ` Andrew Morton
@ 2001-07-13 12:22           ` Mike Black
  2001-07-13 13:54             ` Mike Black
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Black @ 2001-07-13 12:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel@vger.kernel.or, ext2-devel

I haven't done the RAID5 patch yet but I think one big problem is ext3
interaction with kswapd
My tiobench finally completed

tiobench.pl --size 4000
Size is MB, BlkSz is Bytes, Read, Write, and Seeks are MB/secd . -T

         File   Block  Num  Seq Read    Rand Read   Seq Write  Rand Write
  Dir    Size   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
------- ------ ------- --- ----------- ----------- ----------- -----------
   .     4000   4096    1  64.71 51.4% 0.826 2.00% 21.78 32.7% 1.218 0.85%
   .     4000   4096    2  23.28 21.7% 0.935 1.76% 7.374 39.1% 1.261 0.96%
   .     4000   4096    4  20.74 20.7% 1.087 2.50% 5.399 46.8% 1.278 1.09%
   .     4000   4096    8  18.60 19.1% 1.265 2.67% 3.106 63.6% 1.286 1.17%

The CPU culprit is kswapd...this is apparently why the system appears to
lock up.
I don't even have swap turned on.
  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
    3 root      14   0     0    0     0 RW   85.6  0.0  39:50 kswapd

And...when I switch back to ext2 and do the same test:kswapd barely gets
used at all:
tiobench.pl --size 4000
Size is MB, BlkSz is Bytes, Read, Write, and Seeks are MB/secd . -T

         File   Block  Num  Seq Read    Rand Read   Seq Write  Rand Write
  Dir    Size   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
------- ------ ------- --- ----------- ----------- ----------- -----------
   .     4000   4096    1  62.54 46.0% 0.806 2.27% 29.97 27.7% 1.343 0.94%
   .     4000   4096    2  56.10 46.9% 1.030 3.03% 28.18 26.7% 1.320 1.30%
   .     4000   4096    4  39.46 35.0% 1.204 3.34% 17.16 16.2% 1.309 1.28%
   .     4000   4096    8  33.80 31.0% 1.384 3.74% 14.26 13.7% 1.309 1.21%


So...my question is why does ext3 cause kswapd to go nuts?

________________________________________
Michael D. Black   Principal Engineer
mblack@csihq.com  321-676-2923,x203
http://www.csihq.com  Computer Science Innovations
http://www.csihq.com/~mike  My home page
FAX 321-676-2355
----- Original Message -----
From: "Andrew Morton" <andrewm@uow.edu.au>
To: "Mike Black" <mblack@csihq.com>
Cc: "linux-kernel@vger.kernel.or" <linux-kernel@vger.kernel.org>;
<ext2-devel@lists.sourceforge.net>
Sent: Thursday, July 12, 2001 7:34 AM
Subject: Re: 2.4.6 and ext3-2.4-0.9.1-246


Mike Black wrote:
>
> Nope -- still locked up on 8 threads....however...it's apparently not
RAID1
> causing this.


Well, aside from the RAID problems which we're triggering, you're
seeing interactions between RAID, ext3 and the VM.   There's
another raid1 patch here, please.

> I'm repeating this now on my SCSI 7x36G RAID5 set and seeing similar
> behavior.  It's a little better though since its SCSI.

RAID5 had a bug which would cause long stalls - ext3 triggered
it.  It's fixed in 2.4.7-pre.  I include that diff here, although
it'd be surprising if you were hitting it with that workload.

> ...
> I've got a vmstat running in a window and it pauses a lot.  When I was
> testing the IDE RAID1 it paused (locked?) for a LONG time.

That's typical behaviour for an out-of-memory condition.

--- linux-2.4.6/drivers/md/raid1.c Wed Jul  4 18:21:26 2001
+++ lk-ext3/drivers/md/raid1.c Thu Jul 12 15:27:09 2001
@@ -46,6 +46,30 @@
 #define PRINTK(x...)  do { } while (0)
 #endif

+#define __raid1_wait_event(wq, condition) \
+do { \
+ wait_queue_t __wait; \
+ init_waitqueue_entry(&__wait, current); \
+ \
+ add_wait_queue(&wq, &__wait); \
+ for (;;) { \
+ set_current_state(TASK_UNINTERRUPTIBLE); \
+ if (condition) \
+ break; \
+ run_task_queue(&tq_disk); \
+ schedule(); \
+ } \
+ current->state = TASK_RUNNING; \
+ remove_wait_queue(&wq, &__wait); \
+} while (0)
+
+#define raid1_wait_event(wq, condition) \
+do { \
+ if (condition) \
+ break; \
+ __raid1_wait_event(wq, condition); \
+} while (0)
+

 static mdk_personality_t raid1_personality;
 static md_spinlock_t retry_list_lock = MD_SPIN_LOCK_UNLOCKED;
@@ -83,7 +107,7 @@ static struct buffer_head *raid1_alloc_b
  cnt--;
  } else {
  PRINTK("raid1: waiting for %d bh\n", cnt);
- wait_event(conf->wait_buffer, conf->freebh_cnt >= cnt);
+ raid1_wait_event(conf->wait_buffer, conf->freebh_cnt >= cnt);
  }
  }
  return bh;
@@ -170,7 +194,7 @@ static struct raid1_bh *raid1_alloc_r1bh
  memset(r1_bh, 0, sizeof(*r1_bh));
  return r1_bh;
  }
- wait_event(conf->wait_buffer, conf->freer1);
+ raid1_wait_event(conf->wait_buffer, conf->freer1);
  } while (1);
 }

--- linux-2.4.6/drivers/md/raid5.c Wed Jul  4 18:21:26 2001
+++ lk-ext3/drivers/md/raid5.c Thu Jul 12 21:31:55 2001
@@ -66,10 +66,11 @@ static inline void __release_stripe(raid
  BUG();
  if (atomic_read(&conf->active_stripes)==0)
  BUG();
- if (test_bit(STRIPE_DELAYED, &sh->state))
- list_add_tail(&sh->lru, &conf->delayed_list);
- else if (test_bit(STRIPE_HANDLE, &sh->state)) {
- list_add_tail(&sh->lru, &conf->handle_list);
+ if (test_bit(STRIPE_HANDLE, &sh->state)) {
+ if (test_bit(STRIPE_DELAYED, &sh->state))
+ list_add_tail(&sh->lru, &conf->delayed_list);
+ else
+ list_add_tail(&sh->lru, &conf->handle_list);
  md_wakeup_thread(conf->thread);
  } else {
  if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
@@ -1167,10 +1168,9 @@ static void raid5_unplug_device(void *da

  raid5_activate_delayed(conf);

- if (conf->plugged) {
- conf->plugged = 0;
- md_wakeup_thread(conf->thread);
- }
+ conf->plugged = 0;
+ md_wakeup_thread(conf->thread);
+
  spin_unlock_irqrestore(&conf->device_lock, flags);
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 2.4.6 and ext3-2.4-0.9.1-246
  2001-07-13 12:22           ` Mike Black
@ 2001-07-13 13:54             ` Mike Black
  2001-07-13 14:15               ` Andrew Morton
  2001-07-13 16:30               ` Stephen C. Tweedie
  0 siblings, 2 replies; 23+ messages in thread
From: Mike Black @ 2001-07-13 13:54 UTC (permalink / raw)
  To: Mike Black, Andrew Morton; +Cc: linux-kernel@vger.kernel.or, ext2-devel

I give up!  I'm getting file system corruption now on the ext3 partition...
and I've got a kernel oops (soon to be decoded) This is the worst file
corruption I've ever seen other than having a disk go bad.
I'm removing ext3 for now.
________________________________________
Michael D. Black   Principal Engineer
mblack@csihq.com  321-676-2923,x203
http://www.csihq.com  Computer Science Innovations
http://www.csihq.com/~mike  My home page
FAX 321-676-2355

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 2.4.6 and ext3-2.4-0.9.1-246
  2001-07-13 13:54             ` Mike Black
@ 2001-07-13 14:15               ` Andrew Morton
  2001-07-13 17:30                 ` Mike Black
  2001-07-13 16:30               ` Stephen C. Tweedie
  1 sibling, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2001-07-13 14:15 UTC (permalink / raw)
  To: Mike Black; +Cc: linux-kernel@vger.kernel.or, ext2-devel

Mike Black wrote:
> 
> I give up!  I'm getting file system corruption now on the ext3 partition...
> and I've got a kernel oops (soon to be decoded) This is the worst file
> corruption I've ever seen other than having a disk go bad.

There was a truncate-related bug fixed in 0.9.2.  What workload
were you using at the time?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246
  2001-07-13 13:54             ` Mike Black
  2001-07-13 14:15               ` Andrew Morton
@ 2001-07-13 16:30               ` Stephen C. Tweedie
  2001-07-13 17:27                 ` Steve Lord
  1 sibling, 1 reply; 23+ messages in thread
From: Stephen C. Tweedie @ 2001-07-13 16:30 UTC (permalink / raw)
  To: Mike Black
  Cc: Andrew Morton, linux-kernel@vger.kernel.or, ext2-devel,
	Stephen Tweedie

Hi,

On Fri, Jul 13, 2001 at 09:54:56AM -0400, Mike Black wrote:
> I give up!  I'm getting file system corruption now on the ext3 partition...
> and I've got a kernel oops (soon to be decoded)

Please, do send details.  We already know that the VM has a hard job
under load, and journaling exacerbates that --- ext3 cannot always
write to disk without first allocating more memory, and the VM simply
doesn't have a mechanism for dealing with that reliably.  It seems to
be compounded by (a) 2.4 having less write throttling than 2.2 had,
and (b) the zoned allocator getting confused about which zones
actually need to be recycled.

It's not just ext3 --- highmem bounce buffering and soft raid buffers
have the same problem, and work around it by doing their own internal
preallocation of emergency buffers.  Loop devices and nbd will have a
similar problem if you use those for swap or writable mmaps, as will
NFS.

One proposed suggestion is to do per-zone memory reservations for the
VM's use: Ben LaHaise has prototype code for that and we'll be testing
to see if it makes for an improvement when used with ext3.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246
  2001-07-13 16:30               ` Stephen C. Tweedie
@ 2001-07-13 17:27                 ` Steve Lord
  2001-07-13 17:33                   ` Stephen C. Tweedie
  0 siblings, 1 reply; 23+ messages in thread
From: Steve Lord @ 2001-07-13 17:27 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Mike Black, Andrew Morton, linux-kernel@vger.kernel.or,
	ext2-devel

> Hi,
> 
> On Fri, Jul 13, 2001 at 09:54:56AM -0400, Mike Black wrote:
> > I give up!  I'm getting file system corruption now on the ext3 partition...
> > and I've got a kernel oops (soon to be decoded)
> 
> Please, do send details.  We already know that the VM has a hard job
> under load, and journaling exacerbates that --- ext3 cannot always
> write to disk without first allocating more memory, and the VM simply
> doesn't have a mechanism for dealing with that reliably.  It seems to
> be compounded by (a) 2.4 having less write throttling than 2.2 had,
> and (b) the zoned allocator getting confused about which zones
> actually need to be recycled.

We seem to have managed to keep XFS going without the memory reservation
scheme - and the way we do I/O on metadata right now means there is always
a memory allocation in that path. At the moment the only thing I can kill
the system with is make -j bzImage it eventually grinds to a halt with
the swapper waiting for a request slot in the block layer but the system
is in such a mess that I have not been able to diagnose it further than
that.

A lot of careful use of GFP flags on memory allocation was necessary to
get to this point, the GFP_NOIO and GFP_NOFS finally made this deadlock
clean.

Steve


> 
> It's not just ext3 --- highmem bounce buffering and soft raid buffers
> have the same problem, and work around it by doing their own internal
> preallocation of emergency buffers.  Loop devices and nbd will have a
> similar problem if you use those for swap or writable mmaps, as will
> NFS.
> 
> One proposed suggestion is to do per-zone memory reservations for the
> VM's use: Ben LaHaise has prototype code for that and we'll be testing
> to see if it makes for an improvement when used with ext3.
> 
> Cheers,
>  Stephen
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 2.4.6 and ext3-2.4-0.9.1-246
  2001-07-13 14:15               ` Andrew Morton
@ 2001-07-13 17:30                 ` Mike Black
  2001-07-13 17:38                   ` [Ext2-devel] " Stephen C. Tweedie
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Black @ 2001-07-13 17:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel@vger.kernel.or, ext2-devel

Here's the oops:
Message on console:
yeti kernel: EXT3-fs error (device md(9,0)): ext3_new_inode: reserved inode
or inode > inodes count - block_group = 0,inode=1

Here line 575:
        J_ASSERT_JH(jh, !buffer_locked(jh2bh(jh)));

Kernel BUG at transaction.c:575!
invalid operand: 0000
CPU: 1
EIP: 0010:[<c015b21d>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010282
eax: 00000021 ebx: df83e850 ecx: 00000001 edx: 00000001
esi: d13fa880 edi: cf83e850 ebp: f7856600 esp: f73e5cac
ds: 0018 es: 0018 ss: 0018
Process syslogd (pid: 57, stackpage=f73e5000)
Stack: c0245fcb c0246140 0000023f f7856600 d13fa880 cf83e850 f78576694
c3217a58
        00000000 00000000 00000000 d73f42a0 c015b689 d13fa880 cf83e850
00000000
        00000912 f7856800 00000913 f73e5d34 c01529e9 d13fa880 f784eec0
d13fa880
Call Trace: c015b689 c01529e9 c01540ee c01543eb c01546ac c0154952 c015b62a
        c0135694 c0154b46 c0135c88 c01364d6 c0154f96 c0154ad4 c01270b2
        c0154f96 c0154ad4 c01270b2 c01531be c01331b6 c01531a4 c01332c5
c0106c7b
Code: 0f 0b 83 c4 0c f0 fe 0d a0 aa 28 c0 0f 88 35 f5 0c 00 8b 53

>>EIP; c015b21d <do_get_write_access+205/638>   <=====
Trace; c015b689 <journal_get_write_access+39/5c>
Trace; c01529e9 <ext3_new_block+349/55c>
Trace; c01540ee <ext3_alloc_block+1e/24>
Trace; c01543eb <ext3_alloc_branch+3f/24c>
Trace; c01546ac <ext3_splice_branch+b4/130>
Trace; c0154952 <ext3_get_block_handle+22a/3ac>
Trace; c015b62a <do_get_write_access+612/638>
Code;  c015b21d <do_get_write_access+205/638>
00000000 <_EIP>:
Code;  c015b21d <do_get_write_access+205/638>   <=====
   0:   0f 0b                     ud2a      <=====
Code;  c015b21f <do_get_write_access+207/638>
   2:   83 c4 0c                  add    $0xc,%esp
Code;  c015b222 <do_get_write_access+20a/638>
   5:   f0 fe 0d a0 aa 28 c0      lock decb 0xc028aaa0
Code;  c015b229 <do_get_write_access+211/638>
   c:   0f 88 35 f5 0c 00         js     cf547 <_EIP+0xcf547> c022a764
<stext_lock+33bc/92c6>
Code;  c015b22f <do_get_write_access+217/638>
  12:   8b 53 00                  mov    0x0(%ebx),%edx


________________________________________
Michael D. Black   Principal Engineer
mblack@csihq.com  321-676-2923,x203
http://www.csihq.com  Computer Science Innovations
http://www.csihq.com/~mike  My home page
FAX 321-676-2355
----- Original Message -----
From: "Andrew Morton" <andrewm@uow.edu.au>
To: "Mike Black" <mblack@csihq.com>
Cc: "linux-kernel@vger.kernel.or" <linux-kernel@vger.kernel.org>;
<ext2-devel@lists.sourceforge.net>
Sent: Friday, July 13, 2001 10:15 AM
Subject: Re: 2.4.6 and ext3-2.4-0.9.1-246


Mike Black wrote:
>
> I give up!  I'm getting file system corruption now on the ext3
partition...
> and I've got a kernel oops (soon to be decoded) This is the worst file
> corruption I've ever seen other than having a disk go bad.

There was a truncate-related bug fixed in 0.9.2.  What workload
were you using at the time?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246
  2001-07-13 17:27                 ` Steve Lord
@ 2001-07-13 17:33                   ` Stephen C. Tweedie
  0 siblings, 0 replies; 23+ messages in thread
From: Stephen C. Tweedie @ 2001-07-13 17:33 UTC (permalink / raw)
  To: Steve Lord
  Cc: Stephen C. Tweedie, Mike Black, Andrew Morton,
	linux-kernel@vger.kernel.or, ext2-devel

Hi,

On Fri, Jul 13, 2001 at 12:27:33PM -0500, Steve Lord wrote:

> We seem to have managed to keep XFS going without the memory reservation
> scheme - and the way we do I/O on metadata right now means there is always
> a memory allocation in that path. At the moment the only thing I can kill
> the system with is make -j bzImage it eventually grinds to a halt with
> the swapper waiting for a request slot in the block layer but the system
> is in such a mess that I have not been able to diagnose it further than
> that.

That I can certainly reproduce.  Cerberus is also capable of
reproducing a stall after a few hours.  I'm usually seeing
create_buffers() as the place where the kswapd itself is deadlocked,
though.  

It does need a _really_ high load to trigger this, though.  A 200
client dbench won't do it --- there are too many IOs slowing each
process up.  However, Cerberus, which produces a high VM load in
parallel with the IO stress, seems pretty good at triggering the
problem.

It also does seem worse on highmem machines, almost certainly because
of zone imbalance problems: Marcelo has just started looking more
closely at that as a result of the fine-grained VM monitoring patch
(which showed clearly just how imbalanced the VM can get when you have
a highmem zone).

> A lot of careful use of GFP flags on memory allocation was necessary to
> get to this point, the GFP_NOIO and GFP_NOFS finally made this deadlock
> clean.

Yep, with GFP_NOFS I don't see any fs deadlocks, but I'm still seeing
the swapper blocked inside create_buffers (not within ext3 at all).
That's not good.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246
  2001-07-13 17:30                 ` Mike Black
@ 2001-07-13 17:38                   ` Stephen C. Tweedie
  2001-07-14 10:42                     ` Mike Black
  0 siblings, 1 reply; 23+ messages in thread
From: Stephen C. Tweedie @ 2001-07-13 17:38 UTC (permalink / raw)
  To: Mike Black; +Cc: Andrew Morton, linux-kernel@vger.kernel.or, ext2-devel

Hi,

On Fri, Jul 13, 2001 at 01:30:34PM -0400, Mike Black wrote:
> Here's the oops:
> Message on console:
> yeti kernel: EXT3-fs error (device md(9,0)): ext3_new_inode: reserved inode
> or inode > inodes count - block_group = 0,inode=1
> 
> Here line 575:
>         J_ASSERT_JH(jh, !buffer_locked(jh2bh(jh)));

Many thanks.  Were there any other log messages at all?

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246
  2001-07-13 17:38                   ` [Ext2-devel] " Stephen C. Tweedie
@ 2001-07-14 10:42                     ` Mike Black
  2001-07-14 10:53                       ` Andrew Morton
  2001-07-14 11:58                       ` Andrew Morton
  0 siblings, 2 replies; 23+ messages in thread
From: Mike Black @ 2001-07-14 10:42 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Andrew Morton, linux-kernel@vger.kernel.or, ext2-devel

Only when I rebooted and fsck ran :-(

----- Original Message -----
From: "Stephen C. Tweedie" <sct@redhat.com>
To: "Mike Black" <mblack@csihq.com>
Cc: "Andrew Morton" <andrewm@uow.edu.au>; "linux-kernel@vger.kernel.or"
<linux-kernel@vger.kernel.org>; <ext2-devel@lists.sourceforge.net>
Sent: Friday, July 13, 2001 1:38 PM
Subject: Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246


> Hi,
>
> On Fri, Jul 13, 2001 at 01:30:34PM -0400, Mike Black wrote:
> > Here's the oops:
> > Message on console:
> > yeti kernel: EXT3-fs error (device md(9,0)): ext3_new_inode: reserved
inode
> > or inode > inodes count - block_group = 0,inode=1
> >
> > Here line 575:
> >         J_ASSERT_JH(jh, !buffer_locked(jh2bh(jh)));
>
> Many thanks.  Were there any other log messages at all?
>
> Cheers,
>  Stephen


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246
  2001-07-14 10:42                     ` Mike Black
@ 2001-07-14 10:53                       ` Andrew Morton
  2001-07-14 11:58                       ` Andrew Morton
  1 sibling, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2001-07-14 10:53 UTC (permalink / raw)
  To: Mike Black; +Cc: Stephen C. Tweedie, linux-kernel@vger.kernel.or, ext2-devel

Mike Black wrote:
> 
> Only when I rebooted and fsck ran :-(
> 

What version of ext3 was it?

It's quite easy to reproduce the raid5/VM problems here - the
system slows to a crawl with the disk only using about 1/10th
of its bandwidth.  Much worse if highmem is enabled.

Does this match your observations?

-

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246
  2001-07-14 10:42                     ` Mike Black
  2001-07-14 10:53                       ` Andrew Morton
@ 2001-07-14 11:58                       ` Andrew Morton
  2001-07-16 18:23                         ` Stephen C. Tweedie
  1 sibling, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2001-07-14 11:58 UTC (permalink / raw)
  To: Mike Black; +Cc: Stephen C. Tweedie, linux-kernel@vger.kernel.or, ext2-devel

Mike Black wrote:
> 
> Ummm...that would be the version(s) mentioned in the subject line???? :-)

doh.

OK, there was a nasty bug in 0.9.1 which I was not able to trigger
in a solid month's testing.  But others with more worthy hardware
were able to find it quite quickly.  Stephen fixed it in 0.9.2.
I don't know if it explains the failure you saw.  This:

	EXT3-fs error (device md(9,0)): ext3_new_inode: reserved
	inode or inode > inodes count - block_group = 0,inode=1

is nasty.  The LRU cache of inode bitmaps got wrecked.  Ugly.

Maybe one more try?

> My .config has
> # CONFIG_NOHIGHMEM is not set
> CONFIG_HIGHMEM4G=y
> # CONFIG_HIGHMEM64G is not set
> CONFIG_HIGHMEM=y
> I've got 2G of RAM
> 
> And the main thing I noticed was kswapd going nuts -- this was NOT observed
> with the same tiobench on ext2 (same filesystem).  The performance with ext3
> reduced by about 66% on two threads -- and I think that is due to kswapd
> hogging CPU time.

Yup.  I've nailed this one - it's lovely.

I'll be back.

-

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246
  2001-07-14 11:58                       ` Andrew Morton
@ 2001-07-16 18:23                         ` Stephen C. Tweedie
  0 siblings, 0 replies; 23+ messages in thread
From: Stephen C. Tweedie @ 2001-07-16 18:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Black, Stephen C. Tweedie, linux-kernel@vger.kernel.or,
	ext2-devel

Hi,

On Sat, Jul 14, 2001 at 09:58:42PM +1000, Andrew Morton wrote:

> OK, there was a nasty bug in 0.9.1 which I was not able to trigger
> in a solid month's testing.  But others with more worthy hardware
> were able to find it quite quickly.

It would depend very much on the workload.  The problem would only
occur if you had two tasks collide when trying to allocate a block at
the same time, which essentially means doing mmap writes in the middle
of a sparse file.  Most workloads would not ever trigger that no
matter how much you tried.  

> Stephen fixed it in 0.9.2.
> I don't know if it explains the failure you saw.

Me neither, but it could conceivably do so.  The worst case scenario
as an immediate result of that bug would be corruption in the middle
of an indirect block.  We used to see that on ext2 on kernels before
2.4.3 as a result of a similar bug there, and the side effects of the
bug were often severe --- if an indirect block is corrupted this way,
then on subsequent delete, you can end up freeing arbitrary parts of
the fs and all bets are off beyond that.

With the 0.9.2 fix in place, I've seen no such problems with any
stress tests, although the VM problems being discussed elsewhere do
still sometimes cause things to stall for a while or lock up totally
after a few hours.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2001-07-16 18:23 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-07-10  9:47 2.4.6 and ext3-2.4-0.9.1-246 Mike Black
2001-07-10 17:52 ` Andreas Dilger
     [not found]   ` <018101c1096a$17e2afc0$b6562341@cfl.rr.com>
2001-07-10 18:17     ` [Ext2-devel] " Stephen C. Tweedie
2001-07-10 18:27       ` Mike Black
2001-07-10 18:29         ` Stephen C. Tweedie
2001-07-10 18:51     ` Andreas Dilger
2001-07-11  4:08 ` Andrew Morton
2001-07-11 12:16   ` Mike Black
2001-07-11 15:36     ` Andrew Morton
2001-07-12 10:54       ` Mike Black
2001-07-12 11:34         ` Andrew Morton
2001-07-13 12:22           ` Mike Black
2001-07-13 13:54             ` Mike Black
2001-07-13 14:15               ` Andrew Morton
2001-07-13 17:30                 ` Mike Black
2001-07-13 17:38                   ` [Ext2-devel] " Stephen C. Tweedie
2001-07-14 10:42                     ` Mike Black
2001-07-14 10:53                       ` Andrew Morton
2001-07-14 11:58                       ` Andrew Morton
2001-07-16 18:23                         ` Stephen C. Tweedie
2001-07-13 16:30               ` Stephen C. Tweedie
2001-07-13 17:27                 ` Steve Lord
2001-07-13 17:33                   ` Stephen C. Tweedie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox