public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7
  2000-12-08  9:58 [found?] " Alexander Viro
@ 2000-12-08 18:11 ` Alexander Viro
  2000-12-08 18:48   ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Alexander Viro @ 2000-12-08 18:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David Woodhouse, linux-kernel

Folks, see if the following patch helps. AFAICS it closes a pretty real
race - we could call block_write_full_page() for a page that has sync
IO in progress and blindly change ->b_end_io callbacks on the bh with
pending requests. With a little bit of bad luck they would complete before
we got to ll_rw_block(), thus leading to extra UnlockPage(). All it takes
is a pageout on a page that got partial write recently - if some fragments
were still unmapped we get to call get_block() on them and it can easily
block, providing a decent window for that race.

Fix: postpone changing ->b_end_io until the call of ll_rw_block(); if by
the time of ll_rw_block() some fragments will still have IO in progress -
wait on them.

Comments?
						Cheers,
							Al

--- buffer.c	Fri Dec  8 16:19:53 2000
+++ buffer.c.new	Fri Dec  8 16:26:44 2000
@@ -1577,6 +1577,26 @@
  * "Dirty" is valid only with the last case (mapped+uptodate).
  */
 
+static void write_array_async(struct page *page, struct buffer_head **p, int n)
+{
+	int i;
+	if (!n) {
+		UnlockPage(page);
+		return;
+	}
+	/*
+	 * If there are pending requests on these guys - wait before changing
+	 * ->b_end_io.
+	 */
+	for (i=0; i<n; i++) {
+		wait_on_buffer(p[i]);
+		set_bit(BH_Uptodate, &p[i]->b_state);
+		set_bit(BH_Dirty, &p[i]->b_state);
+		p[i]->b_end_io = end_buffer_io_async;
+	}
+	ll_rw_block(WRITE, n, p);
+}
+
 /*
  * block_write_full_page() is SMP-safe - currently it's still
  * being called with the kernel lock held, but the code is ready.
@@ -1616,28 +1636,17 @@
 			if (buffer_new(bh))
 				unmap_underlying_metadata(bh);
 		}
-		set_bit(BH_Uptodate, &bh->b_state);
-		set_bit(BH_Dirty, &bh->b_state);
-		bh->b_end_io = end_buffer_io_async;
 		atomic_inc(&bh->b_count);
 		arr[nr++] = bh;
 		bh = bh->b_this_page;
 		block++;
 	} while (bh != head);
 
-	if (nr) {
-		ll_rw_block(WRITE, nr, arr);
-	} else {
-		UnlockPage(page);
-	}
+	write_array_async(page, arr, nr);
 	SetPageUptodate(page);
 	return 0;
 out:
-	if (nr) {
-		ll_rw_block(WRITE, nr, arr);
-	} else {
-		UnlockPage(page);
-	}
+	write_array_async(page, arr, nr);
 	ClearPageUptodate(page);
 	return err;
 }

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7
  2000-12-08 18:11 ` [PATCH] " Alexander Viro
@ 2000-12-08 18:48   ` Linus Torvalds
  2000-12-08 19:13     ` Alexander Viro
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2000-12-08 18:48 UTC (permalink / raw)
  To: Alexander Viro; +Cc: David Woodhouse, linux-kernel



On Fri, 8 Dec 2000, Alexander Viro wrote:
> 
> Fix: postpone changing ->b_end_io until the call of ll_rw_block(); if by
> the time of ll_rw_block() some fragments will still have IO in progress -
> wait on them.
> 
> Comments?

Yes.

On the other hand, I have this suspicion that there is an even simpler
solution: stop using the end_buffer_io_sync version for writes
altogether.

It's really only __block_prepare_write() that can mark the buffers for
sync writes, and even that case is fairly bogus: it really only wants to
mark the non-upto-date buffers that it wants to _read_ for sync IO, it
just "knows" that it is ok to change every buffer it sees. It isn't.

Moving the 

	bh->b_end_io = end_buffer_io_sync;

into the read-path should be sufficient (hell, we should move the
"ll_rw_block(READ, 1, &bh)" away, and make it one single read with

	ll_rw_block(READ, wait_bh-wait, wait);

at the end of the loop.

If we do it that way, there is no way the two can clash, because a
non-up-to-date buffer head won't ever be touched by the write path, so we
can't get this kind of confusion.

Al? I'd really prefer this alternative instead. Do you see any problems
with it?

(New rule that makes 100% sense: NOBODY sets "bh->b_end_io" on any buffer
that he isn't actually going to do IO on.  And if there is pending IO on
the buffer, it had better only be of the same type as the one you're going
to do).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7
  2000-12-08 18:48   ` Linus Torvalds
@ 2000-12-08 19:13     ` Alexander Viro
  2000-12-08 19:39       ` Linus Torvalds
  2000-12-09 15:37       ` David Woodhouse
  0 siblings, 2 replies; 21+ messages in thread
From: Alexander Viro @ 2000-12-08 19:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David Woodhouse, linux-kernel



On Fri, 8 Dec 2000, Linus Torvalds wrote:

> It's really only __block_prepare_write() that can mark the buffers for
> sync writes, and even that case is fairly bogus: it really only wants to
> mark the non-upto-date buffers that it wants to _read_ for sync IO, it
> just "knows" that it is ok to change every buffer it sees. It isn't.
> 
> Moving the 
> 
> 	bh->b_end_io = end_buffer_io_sync;
> 
> into the read-path should be sufficient (hell, we should move the
> "ll_rw_block(READ, 1, &bh)" away, and make it one single read with
> 
> 	ll_rw_block(READ, wait_bh-wait, wait);
> 
> at the end of the loop.

Erm... So you want to make ->commit_write() page-unlocking? Fine with me,
but that will make for somewhat bigger patch. Hey, _you_ are in position
to change the locking rules, freeze or not, so if it's OK with you...

> If we do it that way, there is no way the two can clash, because a
> non-up-to-date buffer head won't ever be touched by the write path, so we
> can't get this kind of confusion.
> 
> Al? I'd really prefer this alternative instead. Do you see any problems
> with it?

If I understood you right - no problems.  I can do such patch, just give
me a couple of hours to recheck the locking rules.  BTW, what do you
think about the following:
	* add_to_page_cache() is not exported and never used. Kill?
	* __lock_page() is called only in lock_page(). Make the latter
inlined wrapper or kill the former?
	* rw_swap_page_base() would be simpler if it didn't leave unlocking the
page to caller - right now all callers do
	if (!rw_swap_page_base(...))
		UnlockPage(page);
	* ramfs_writepage() needs UnlockPage()
	* udf_adinicb_writepage() doesn't
	* brw_page() needs liposuction. _Badly_. Just read it.
	* Documentations/filesystems/Locking needs an update.
I can toss any subset of the above into patch - just tell...
 
> (New rule that makes 100% sense: NOBODY sets "bh->b_end_io" on any buffer
> that he isn't actually going to do IO on.  And if there is pending IO on
> the buffer, it had better only be of the same type as the one you're going
> to do).

Hrm. Why not move setting ->b_end_io to ll_rw_block() while we are at it?
Or into ll_rw_block() wrapper...
							Cheers,
								Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7
  2000-12-08 19:13     ` Alexander Viro
@ 2000-12-08 19:39       ` Linus Torvalds
  2000-12-08 19:50         ` Daniel Phillips
  2000-12-08 22:30         ` Alexander Viro
  2000-12-09 15:37       ` David Woodhouse
  1 sibling, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2000-12-08 19:39 UTC (permalink / raw)
  To: Alexander Viro; +Cc: David Woodhouse, linux-kernel



On Fri, 8 Dec 2000, Alexander Viro wrote:
> 
> Erm... So you want to make ->commit_write() page-unlocking? Fine with me,
> but that will make for somewhat bigger patch. Hey, _you_ are in position
> to change the locking rules, freeze or not, so if it's OK with you...

No.

Read the code a bit more.

commit_write() doesn't start any IO at all. It only marks the buffer
dirty.

The dirty flush-out works the way it has always worked.

(You're right in that the dirty flusher should make sure to change
b_end_io when they do their write-outs - that code used to just depend on 
the buffer always having b_end_io magically set)

> Hrm. Why not move setting ->b_end_io to ll_rw_block() while we are at it?
> Or into ll_rw_block() wrapper...

That's too big a change right now, but yes, in theory. That would make
sure that nobody could ever forget to set the "completion" handler.

In the meantime, let's just enforce it for ll_rw_block: make sure that
there is a 1:1 mapping between setting b_end_io and doing a ll_rw_block
(and let's make sure that there are no more of these non-local rules any
more).

Btw, I also think that the dirty buffer flushing should get the page lock.
Right now it touches the buffer list without holding the lock on the page
that the buffer is on, which means that there is really nothign that
prevents it from racing with the page-based writeout. I don't like that:
it does hold the LRU list lock, so the list state itself is ok, but it
does actually touch part of the buffer state that is not really protected
by the lock.

I think it ends up being ok because ll_rw_block will ignore buffers that
get output twice, and the rest of the state is handled with atomic
operations (b_count and b_flags), but it's not really proper. If
flush_dirty_buffers() got the page lock, everything would be protected
properly. Thoughts?

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7
  2000-12-08 19:39       ` Linus Torvalds
@ 2000-12-08 19:50         ` Daniel Phillips
  2000-12-08 21:17           ` Linus Torvalds
  2000-12-08 22:30         ` Alexander Viro
  1 sibling, 1 reply; 21+ messages in thread
From: Daniel Phillips @ 2000-12-08 19:50 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro; +Cc: David Woodhouse, linux-kernel

On Fri, 08 Dec 2000, Linus Torvalds wrote:
> Btw, I also think that the dirty buffer flushing should get the page lock.
> Right now it touches the buffer list without holding the lock on the page
> that the buffer is on, which means that there is really nothign that
> prevents it from racing with the page-based writeout. I don't like that:
> it does hold the LRU list lock, so the list state itself is ok, but it
> does actually touch part of the buffer state that is not really protected
> by the lock.
> 
> I think it ends up being ok because ll_rw_block will ignore buffers that
> get output twice, and the rest of the state is handled with atomic
> operations (b_count and b_flags), but it's not really proper. If
> flush_dirty_buffers() got the page lock, everything would be protected
> properly. Thoughts?

This is great when you have buffersize==pagesize.  When there are
multiple buffers per page it means that some of the buffers might have
to wait for flushing just because bdflush started IO on some other
buffer on the same page.  Oh well.  The common case improves in terms
being proveably correct and the uncommon case gets worse a tiny bit. 
It sounds like a win.

We are moving steadily away from buffer oriented I/O anyway, and I
think we can optimize the case of buffersize!=pagesize in 2.5 by being a
little more careful about how getblk hands out buffers.  Getblk
could force all buffers on the same page to be on the same
major/minor, or even better, to be physically close together.  Or
more simply, flush_dirty_buffers could check for other dirty buffers on
the same page and initiate I/O at the same time.  Or both strategies.

This is by way of buttressing an argument in favor of simplicity
and reliabilty, i.e., being religious about taking the page lock, even
when there's a slight cost in the short term.

-- 
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7
  2000-12-08 19:50         ` Daniel Phillips
@ 2000-12-08 21:17           ` Linus Torvalds
  0 siblings, 0 replies; 21+ messages in thread
From: Linus Torvalds @ 2000-12-08 21:17 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Alexander Viro, David Woodhouse, linux-kernel



On Fri, 8 Dec 2000, Daniel Phillips wrote:
>
> [ flush-buffers taking the page lock ]
> 
> This is great when you have buffersize==pagesize.  When there are
> multiple buffers per page it means that some of the buffers might have
> to wait for flushing just because bdflush started IO on some other
> buffer on the same page.  Oh well.  The common case improves in terms
> being proveably correct and the uncommon case gets worse a tiny bit. 
> It sounds like a win.

Also, I think that we should strive for a setup where most of the dirty
buffer flushing is done through "page_launder()" instead of using
sync_buffers all that much at all. 

I'm convinced that the page LRU list is as least as good as, if not better
than, the dirty buffer timestamp stuff. And as we need to have the page
LRU for other reasons anyway, I'd like the long-range plan to be to get
rid of the buffer LRU completely. It wastes memory and increases
complexity for very little gain, I think.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7
       [not found] <90rjbg$8eso1$1@fido.engr.sgi.com>
@ 2000-12-08 22:22 ` Rajagopal Ananthanarayanan
  0 siblings, 0 replies; 21+ messages in thread
From: Rajagopal Ananthanarayanan @ 2000-12-08 22:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds wrote:
> 
> On Fri, 8 Dec 2000, Daniel Phillips wrote:
> >
> > [ flush-buffers taking the page lock ]
> >
> > This is great when you have buffersize==pagesize.  When there are
> > multiple buffers per page it means that some of the buffers might have
> > to wait for flushing just because bdflush started IO on some other
> > buffer on the same page.  Oh well.  The common case improves in terms
> > being proveably correct and the uncommon case gets worse a tiny bit.
> > It sounds like a win.
> 
> Also, I think that we should strive for a setup where most of the dirty
> buffer flushing is done through "page_launder()" instead of using
> sync_buffers all that much at all.
> 
> I'm convinced that the page LRU list is as least as good as, if not better
> than, the dirty buffer timestamp stuff. And as we need to have the page
> LRU for other reasons anyway, I'd like the long-range plan to be to get
> rid of the buffer LRU completely. It wastes memory and increases
> complexity for very little gain, I think.
> 

I think flushing pages instead of buffers is a good direction to take.
Two things:

1. currently bdflush is setup to use page_launder only
   under memory pressure (if (free_shortage() ... )
   Do you think that it should call page_launder regardless?

2. There are two operations here:
	a. starting a write-back, periodically.
        b. freeing a page, which may involve taking the page
            out of a inode mapping, etc. IOW, what page_launder does.
   bdflush primarily does (a). If we want to move to page-oriented
   flushing, we atleast need extra information in the _page_ structure
   as to whether it is time to flush the page back.


--------------------------------------------------------------------------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--------------------------------------------------------------------------
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7
  2000-12-08 19:39       ` Linus Torvalds
  2000-12-08 19:50         ` Daniel Phillips
@ 2000-12-08 22:30         ` Alexander Viro
  2000-12-08 22:42           ` Linus Torvalds
  1 sibling, 1 reply; 21+ messages in thread
From: Alexander Viro @ 2000-12-08 22:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David Woodhouse, linux-kernel



On Fri, 8 Dec 2000, Linus Torvalds wrote:

> 
> 
> On Fri, 8 Dec 2000, Alexander Viro wrote:
> > 
> > Erm... So you want to make ->commit_write() page-unlocking? Fine with me,
> > but that will make for somewhat bigger patch. Hey, _you_ are in position
> > to change the locking rules, freeze or not, so if it's OK with you...
> 
> No.
> 
> Read the code a bit more.
> 
> commit_write() doesn't start any IO at all. It only marks the buffer
> dirty.

I'm quite aware of that fact ;-) However, you said 

   On the other hand, I have this suspicion that there is an even simpler
   solution: stop using the end_buffer_io_sync version for writes
   altogether.

If that happens (i.e. if write requests resulting from prepare_write()/
commit_write()/bdflush sequence become async) we must stop unlocking pages
after commit_write(). Essentially it would become unlocker of the same
kind as readpage() and writepage() - callers must assume that page submitted
to commit_write() will eventually be unlocked.

> The dirty flush-out works the way it has always worked.
> 
> (You're right in that the dirty flusher should make sure to change
> b_end_io when they do their write-outs - that code used to just depend on 
> the buffer always having b_end_io magically set)

Hmm... IDGI. Do you want the request resulting from commit_write() ("resulting"
!= "issued") to stay sync (in that case we still need to change
block_write_full_page() since the analysis stands - it dirties blocks
unconditionally) or you want them to go async (in that case we need to change
commit_write() callers)? Could you clarify?

> Btw, I also think that the dirty buffer flushing should get the page lock.
> Right now it touches the buffer list without holding the lock on the page
> that the buffer is on, which means that there is really nothign that
> prevents it from racing with the page-based writeout. I don't like that:
> it does hold the LRU list lock, so the list state itself is ok, but it
> does actually touch part of the buffer state that is not really protected
> by the lock.
> 
> I think it ends up being ok because ll_rw_block will ignore buffers that
> get output twice, and the rest of the state is handled with atomic
> operations (b_count and b_flags), but it's not really proper. If
> flush_dirty_buffers() got the page lock, everything would be protected
> properly. Thoughts?

Umm... I don't think that we need to do it on per-page basis. We need some
exclusion, all right, but I'ld rather provide it from block_write_full_page()
and its ilk. Hell knows...
							Cheers,
								Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7
  2000-12-08 22:30         ` Alexander Viro
@ 2000-12-08 22:42           ` Linus Torvalds
  2000-12-09  4:59             ` Alexander Viro
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2000-12-08 22:42 UTC (permalink / raw)
  To: Alexander Viro; +Cc: David Woodhouse, linux-kernel



On Fri, 8 Dec 2000, Alexander Viro wrote:
> 
> I'm quite aware of that fact ;-) However, you said 
> 
>    On the other hand, I have this suspicion that there is an even simpler
>    solution: stop using the end_buffer_io_sync version for writes
>    altogether.
> 
> If that happens (i.e. if write requests resulting from prepare_write()/
> commit_write()/bdflush sequence become async) we must stop unlocking pages
> after commit_write(). Essentially it would become unlocker of the same
> kind as readpage() and writepage() - callers must assume that page submitted
> to commit_write() will eventually be unlocked.

You're right, we can't do that for anonymous buffers right now. Mea culpa.

Looking more at this issue, I suspect that the easiest pretty solution
that everybody can probably agree is reasonable is to either pass down the
end-of-io callback to ll_rw_block as you suggested, or, preferably by just
forcing the _caller_ to do the buffer locking, and just do the b_end_io
stuff inside the buffer lock and get rid of all the races that way
instead (and make ll_rw_block() verify that the buffers it is passed are
always locked).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7
  2000-12-08 22:42           ` Linus Torvalds
@ 2000-12-09  4:59             ` Alexander Viro
  2000-12-09  8:45               ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Alexander Viro @ 2000-12-09  4:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David Woodhouse, linux-kernel



On Fri, 8 Dec 2000, Linus Torvalds wrote:

> Looking more at this issue, I suspect that the easiest pretty solution
> that everybody can probably agree is reasonable is to either pass down the
> end-of-io callback to ll_rw_block as you suggested, or, preferably by just
> forcing the _caller_ to do the buffer locking, and just do the b_end_io
> stuff inside the buffer lock and get rid of all the races that way
> instead (and make ll_rw_block() verify that the buffers it is passed are
> always locked).

Hmm... I've looked through the ll_rw_block() callers and yes, it seems
to be doable. BTW, we probably want a helper function that would do
lock+ll_rw_block+wait - it will simplify the life in filesystems.
I'll do a patch tonight.
						Cheers,
							Al
PS: alpha-testers needed for sysvfs patches - right now the thing is racey
as hell and unlike minixfs I can't test it myself. It's more or less
straightforward port of minixfs patch that went into the tree sometime
ago, so it shouldn't be too bad, but... IOW, if somebody has boxen to
test the thing on - yell.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7
  2000-12-09  4:59             ` Alexander Viro
@ 2000-12-09  8:45               ` Linus Torvalds
  2000-12-09  8:56                 ` Linus Torvalds
                                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Linus Torvalds @ 2000-12-09  8:45 UTC (permalink / raw)
  To: Alexander Viro; +Cc: David Woodhouse, Kernel Mailing List


On Fri, 8 Dec 2000, Alexander Viro wrote:
> On Fri, 8 Dec 2000, Linus Torvalds wrote:
> 
> > Looking more at this issue, I suspect that the easiest pretty solution
> > that everybody can probably agree is reasonable is to either pass down the
> > end-of-io callback to ll_rw_block as you suggested, or, preferably by just
> > forcing the _caller_ to do the buffer locking, and just do the b_end_io
> > stuff inside the buffer lock and get rid of all the races that way
> > instead (and make ll_rw_block() verify that the buffers it is passed are
> > always locked).
> 
> Hmm... I've looked through the ll_rw_block() callers and yes, it seems
> to be doable.

Looking at this, there's a _lot_ of them.

I've taken a test-approach that is fairly simple:

 - get rid of the old "ll_rw_block()", because it was inherently racey wrt
   bh->b_end_io, I think we all agree that changing bh->b_end_io without
   holding any locks at all is fairly dangerous.

 - instead, a simple "submit_bh()" thing, that takes only one locked
   buffer head at a time, and queues it for IO. The usage would basically
   be

		lock_buffer(bh);
		bh->b_end_io = completion_callback;
		submit_bh(READ|WRITE, bh);

   which is a pretty clean and simple interface that has no obvious
   races - submit_bh() will set bh->b_end_io to the completion callback.

 - BUT BUT BUT: Because of tons of old users of ll_rw_block(), we
   introduce a totally new ll_rw_block() that has the same old interface,
   but basically does

	void ll_rw_block(int op, int nr, struct buffer_head **array)
	{
		int i;

		for (i = 0; i < nr; i++) {
			struct buffer_head * bh = array[i];
			lock_buffer(bh);
			bh->b_end_io = end_buffer_io_sync;
			submit_bh(op, bh);
		}
	}

   Again, the above avoids the race (we never touch b_end_io except with
   the buffer lock held), and allows all old uses of "ll_rw_block()"
   _except_ the ones that wanted to do the fancy async callbacks.

The advantage? All the regular old code that isn't fancy (ie the low-level
filesystems, bread(), breada() etc) will get a working ll_rw_block() with
the semantics they want, and we can pretty much prove that they can never
get an async handler even by mistake.

And the (few) clever routines in fs/buffer.c that currently use
ll_rw_block() with async handlers can just be converted to use the
submit_bh() interface.

This is a preliminary patch that I have not compiled and probably breaks,
but you get the idea. I'm going to sleep, to survive another night with
three small kids.

If somebody wants to run with this, please try it out, but realize that
it's a work-in-progress. And realize that bugs in this area tend to
corrupt filesystems very quickly indeed. I would strongly advice against
trying this patch out without really grokking what it does and feeling
confident that it actually works.

NOTE NOTE NOTE! I've tried to keep the patch small and simple. I've tried
to make all the changes truly straightforward and obvious. I want this bug
squashed, and I don't want to see it again. But I _still_ think this is a
dangerous patch until somebody like Al has given it a green light. Caveat
Emptor.

		Linus

-----
diff -u --recursive t12p7/linux/drivers/block/ll_rw_blk.c linux/drivers/block/ll_rw_blk.c
--- t12p7/linux/drivers/block/ll_rw_blk.c	Thu Dec  7 15:56:25 2000
+++ linux/drivers/block/ll_rw_blk.c	Sat Dec  9 00:40:35 2000
@@ -885,6 +885,36 @@
 	while (q->make_request_fn(q, rw, bh));
 }
 
+
+/*
+ * Submit a buffer head for IO.
+ */
+void submit_bh(int rw, struct buffer_head * bh)
+{
+	if (!test_bit(BH_Lock, &bh->b_state))
+		BUG();
+
+	set_bit(BH_Req, &bh->b_state);
+
+	/*
+	 * First step, 'identity mapping' - RAID or LVM might
+	 * further remap this.
+	 */
+	bh->b_rdev = bh->b_dev;
+	bh->b_rsector = bh->b_blocknr * (bh->b_size>>9);
+
+	generic_make_request(rw, bh);
+}
+
+/*
+ * Default IO end handler, used by "ll_rw_block()".
+ */
+static void end_buffer_io_sync(struct buffer_head *bh, int uptodate)
+{
+	mark_buffer_uptodate(bh, uptodate);
+	unlock_buffer(bh);
+}
+
 /* This function can be used to request a number of buffers from a block
    device. Currently the only restriction is that all buffers must belong to
    the same device */
@@ -931,7 +961,8 @@
 		if (test_and_set_bit(BH_Lock, &bh->b_state))
 			continue;
 
-		set_bit(BH_Req, &bh->b_state);
+		/* We have the buffer lock */
+		bh->b_end_io = end_buffer_io_sync;
 
 		switch(rw) {
 		case WRITE:
@@ -954,17 +985,9 @@
 	end_io:
 			bh->b_end_io(bh, test_bit(BH_Uptodate, &bh->b_state));
 			continue;
-			
 		}
 
-		/*
-		 * First step, 'identity mapping' - RAID or LVM might
-		 * further remap this.
-		 */
-		bh->b_rdev = bh->b_dev;
-		bh->b_rsector = bh->b_blocknr * (bh->b_size>>9);
-
-		generic_make_request(rw, bh);
+		submit_bh(rw, bh);
 	}
 	return;
 
@@ -972,7 +995,6 @@
 	for (i = 0; i < nr; i++)
 		buffer_IO_error(bhs[i]);
 }
-
 
 #ifdef CONFIG_STRAM_SWAP
 extern int stram_device_init (void);
diff -u --recursive t12p7/linux/fs/buffer.c linux/fs/buffer.c
--- t12p7/linux/fs/buffer.c	Thu Dec  7 15:56:26 2000
+++ linux/fs/buffer.c	Sat Dec  9 00:38:20 2000
@@ -758,12 +758,6 @@
 	bh->b_private = private;
 }
 
-static void end_buffer_io_sync(struct buffer_head *bh, int uptodate)
-{
-	mark_buffer_uptodate(bh, uptodate);
-	unlock_buffer(bh);
-}
-
 static void end_buffer_io_bad(struct buffer_head *bh, int uptodate)
 {
 	mark_buffer_uptodate(bh, uptodate);
@@ -1001,7 +995,7 @@
 	 * and it is clean.
 	 */
 	if (bh) {
-		init_buffer(bh, end_buffer_io_sync, NULL);
+		init_buffer(bh, end_buffer_io_bad, NULL);
 		bh->b_dev = dev;
 		bh->b_blocknr = block;
 		bh->b_state = 1 << BH_Mapped;
@@ -1210,7 +1204,6 @@
 
 	if (buffer_uptodate(bh))
 		return(bh);   
-	else ll_rw_block(READ, 1, &bh);
 
 	blocks = (filesize - pos) >> (9+index);
 
@@ -1225,12 +1218,11 @@
 			brelse(bh);
 			break;
 		}
-		else bhlist[j++] = bh;
+		bhlist[j++] = bh;
 	}
 
 	/* Request the read for these buffers, and then release them. */
-	if (j>1)  
-		ll_rw_block(READA, (j-1), bhlist+1); 
+	ll_rw_block(READ, j, bhlist);
 	for(i=1; i<j; i++)
 		brelse(bhlist[i]);
 
@@ -1439,7 +1431,7 @@
 		block = *(b++);
 
 		tail = bh;
-		init_buffer(bh, end_buffer_io_async, NULL);
+		init_buffer(bh, end_buffer_io_bad, NULL);
 		bh->b_dev = dev;
 		bh->b_blocknr = block;
 
@@ -1586,8 +1578,6 @@
 	int err, i;
 	unsigned long block;
 	struct buffer_head *bh, *head;
-	struct buffer_head *arr[MAX_BUF_PER_PAGE];
-	int nr = 0;
 
 	if (!PageLocked(page))
 		BUG();
@@ -1600,6 +1590,8 @@
 
 	bh = head;
 	i = 0;
+
+	/* Stage 1: make sure we have all the buffers mapped! */
 	do {
 		/*
 		 * If the buffer isn't up-to-date, we can't be sure
@@ -1616,28 +1608,32 @@
 			if (buffer_new(bh))
 				unmap_underlying_metadata(bh);
 		}
-		set_bit(BH_Uptodate, &bh->b_state);
-		set_bit(BH_Dirty, &bh->b_state);
+		bh = bh->b_this_page;
+		block++;
+	} while (bh != head);
+
+	/* Stage 2: lock the buffers, mark them dirty */
+	do {
+		lock_buffer(bh);
 		bh->b_end_io = end_buffer_io_async;
 		atomic_inc(&bh->b_count);
-		arr[nr++] = bh;
+		set_bit(BH_Uptodate, &bh->b_state);
+		set_bit(BH_Dirty, &bh->b_state);
 		bh = bh->b_this_page;
-		block++;
 	} while (bh != head);
 
-	if (nr) {
-		ll_rw_block(WRITE, nr, arr);
-	} else {
-		UnlockPage(page);
-	}
+	/* Stage 3: submit the IO */
+	do {
+		submit_bh(WRITE, bh);
+		bh = bh->b_this_page;		
+	} while (bh != head);
+
+	/* Done - end_buffer_io_async will unlock */
 	SetPageUptodate(page);
 	return 0;
+
 out:
-	if (nr) {
-		ll_rw_block(WRITE, nr, arr);
-	} else {
-		UnlockPage(page);
-	}
+	UnlockPage(page);
 	ClearPageUptodate(page);
 	return err;
 }
@@ -1669,7 +1665,6 @@
 			continue;
 		if (block_start >= to)
 			break;
-		bh->b_end_io = end_buffer_io_sync;
 		if (!buffer_mapped(bh)) {
 			err = get_block(inode, block, bh, 1);
 			if (err)
@@ -1766,7 +1761,6 @@
 	unsigned long iblock, lblock;
 	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
 	unsigned int blocksize, blocks;
-	char *kaddr = NULL;
 	int nr, i;
 
 	if (!PageLocked(page))
@@ -1793,35 +1787,40 @@
 					continue;
 			}
 			if (!buffer_mapped(bh)) {
-				if (!kaddr)
-					kaddr = kmap(page);
-				memset(kaddr + i*blocksize, 0, blocksize);
+				memset(kmap(page) + i*blocksize, 0, blocksize);
 				flush_dcache_page(page);
+				kunmap(page);
 				set_bit(BH_Uptodate, &bh->b_state);
 				continue;
 			}
 		}
 
-		init_buffer(bh, end_buffer_io_async, NULL);
-		atomic_inc(&bh->b_count);
 		arr[nr] = bh;
 		nr++;
 	} while (i++, iblock++, (bh = bh->b_this_page) != head);
 
-	if (nr) {
-		if (Page_Uptodate(page))
-			BUG();
-		ll_rw_block(READ, nr, arr);
-	} else {
+	if (!nr) {
 		/*
 		 * all buffers are uptodate - we can set the page
 		 * uptodate as well.
 		 */
 		SetPageUptodate(page);
 		UnlockPage(page);
+		return 0;
 	}
-	if (kaddr)
-		kunmap(page);
+
+	/* Stage two: lock the buffers */
+	for (i = 0; i < nr; i++) {
+		struct buffer_head * bh = arr[i];
+		lock_buffer(bh);
+		bh->b_end_io = end_buffer_io_async;
+		atomic_inc(&bh->b_count);
+	}
+
+	/* Stage 3: start the IO */
+	for (i = 0; i < nr; i++)
+		submit_bh(READ, arr[i]);
+
 	return 0;
 }
 
@@ -1989,7 +1988,6 @@
 	if (Page_Uptodate(page))
 		set_bit(BH_Uptodate, &bh->b_state);
 
-	bh->b_end_io = end_buffer_io_sync;
 	if (!buffer_uptodate(bh)) {
 		err = -EIO;
 		ll_rw_block(READ, 1, &bh);
@@ -2263,67 +2261,31 @@
  */
 int brw_page(int rw, struct page *page, kdev_t dev, int b[], int size)
 {
-	struct buffer_head *head, *bh, *arr[MAX_BUF_PER_PAGE];
-	int nr, fresh /* temporary debugging flag */, block;
+	struct buffer_head *head, *bh;
 
 	if (!PageLocked(page))
 		panic("brw_page: page not locked for I/O");
-//	ClearPageError(page);
-	/*
-	 * We pretty much rely on the page lock for this, because
-	 * create_page_buffers() might sleep.
-	 */
-	fresh = 0;
-	if (!page->buffers) {
-		create_page_buffers(rw, page, dev, b, size);
-		fresh = 1;
-	}
+
 	if (!page->buffers)
+		create_page_buffers(rw, page, dev, b, size);
+
+	head = bh = page->buffers;
+	if (!head)
 		BUG();
 
-	head = page->buffers;
-	bh = head;
-	nr = 0;
+	/* Stage 1: lock all the buffers */
 	do {
-		block = *(b++);
+		lock_buffer(bh);
+		bh->b_end_io = end_buffer_io_async;
+		atomic_inc(&bh->b_count);
+		bh = bh->b_this_page;
+	} while (bh != head);
 
-		if (fresh && (atomic_read(&bh->b_count) != 0))
-			BUG();
-		if (rw == READ) {
-			if (!fresh)
-				BUG();
-			if (!buffer_uptodate(bh)) {
-				arr[nr++] = bh;
-				atomic_inc(&bh->b_count);
-			}
-		} else { /* WRITE */
-			if (!bh->b_blocknr) {
-				if (!block)
-					BUG();
-				bh->b_blocknr = block;
-			} else {
-				if (!block)
-					BUG();
-			}
-			set_bit(BH_Uptodate, &bh->b_state);
-			set_bit(BH_Dirty, &bh->b_state);
-			arr[nr++] = bh;
-			atomic_inc(&bh->b_count);
-		}
+	/* Stage 2: start the IO */
+	do {
+		submit_bh(rw, bh);
 		bh = bh->b_this_page;
 	} while (bh != head);
-	if ((rw == READ) && nr) {
-		if (Page_Uptodate(page))
-			BUG();
-		ll_rw_block(rw, nr, arr);
-	} else {
-		if (!nr && rw == READ) {
-			SetPageUptodate(page);
-			UnlockPage(page);
-		}
-		if (nr && (rw == WRITE))
-			ll_rw_block(rw, nr, arr);
-	}
 	return 0;
 }
 
diff -u --recursive t12p7/linux/include/linux/fs.h linux/include/linux/fs.h
--- t12p7/linux/include/linux/fs.h	Thu Dec  7 15:56:26 2000
+++ linux/include/linux/fs.h	Sat Dec  9 00:27:41 2000
@@ -1193,6 +1193,7 @@
 extern struct buffer_head * get_hash_table(kdev_t, int, int);
 extern struct buffer_head * getblk(kdev_t, int, int);
 extern void ll_rw_block(int, int, struct buffer_head * bh[]);
+extern void submit_bh(int, struct buffer_head *);
 extern int is_read_only(kdev_t);
 extern void __brelse(struct buffer_head *);
 static inline void brelse(struct buffer_head *buf)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7
  2000-12-09  8:45               ` Linus Torvalds
@ 2000-12-09  8:56                 ` Linus Torvalds
  2000-12-09 10:40                 ` Alexander Viro
  2000-12-09 14:00                 ` David S. Miller
  2 siblings, 0 replies; 21+ messages in thread
From: Linus Torvalds @ 2000-12-09  8:56 UTC (permalink / raw)
  To: Alexander Viro; +Cc: David Woodhouse, Kernel Mailing List



On Sat, 9 Dec 2000, Linus Torvalds wrote:
> 
> This is a preliminary patch that I have not compiled and probably breaks,
> but you get the idea. I'm going to sleep, to survive another night with
> three small kids.

Call me stupid [ Chorus: "You're stupid, Linus" ], but I actually compiled
and booted this remotely. And it came up. Which is a pretty good sign that
it doesn't have anything really broken in it.

Still, it would be good to have a few other sharp eyes looking it over.
Look sclean and obviously correct to me, but then _everything_ I write
always looks obviously correct yo me.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7
  2000-12-09  8:45               ` Linus Torvalds
  2000-12-09  8:56                 ` Linus Torvalds
@ 2000-12-09 10:40                 ` Alexander Viro
  2000-12-09 12:56                   ` Andries Brouwer
  2000-12-09 17:28                   ` Linus Torvalds
  2000-12-09 14:00                 ` David S. Miller
  2 siblings, 2 replies; 21+ messages in thread
From: Alexander Viro @ 2000-12-09 10:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Woodhouse, Ingo Molnar, Mikulas Patocka,
	Kernel Mailing List



On Sat, 9 Dec 2000, Linus Torvalds wrote:

> This is a preliminary patch that I have not compiled and probably breaks,
> but you get the idea. I'm going to sleep, to survive another night with
> three small kids.
> 
> If somebody wants to run with this, please try it out, but realize that
> it's a work-in-progress. And realize that bugs in this area tend to
> corrupt filesystems very quickly indeed. I would strongly advice against
> trying this patch out without really grokking what it does and feeling
> confident that it actually works.
> 
> NOTE NOTE NOTE! I've tried to keep the patch small and simple. I've tried
> to make all the changes truly straightforward and obvious. I want this bug
> squashed, and I don't want to see it again. But I _still_ think this is a
> dangerous patch until somebody like Al has given it a green light. Caveat
> Emptor.
 
> @@ -1001,7 +995,7 @@
>  	 * and it is clean.
>  	 */
>  	if (bh) {
> -		init_buffer(bh, end_buffer_io_sync, NULL);
> +		init_buffer(bh, end_buffer_io_bad, NULL);

	How about NULL?

> @@ -1210,7 +1204,6 @@
[breada()]
	Umm... why do we keep it, in the first place? AFAICS the only
in-tree user is hpfs_map_sector() and it doesn't look like we really
need it there. OTOH, trimming the buffer.c down is definitely nice.
Mikulas?

> @@ -1439,7 +1431,7 @@
[create_page_buffers()]
	Linus, compare the result with create_empty_buffers(). Then look
at the only caller and notice that it will merrily loop over these
buffer_heads. IOW, I propose to mark them mapped and set the ->b_blocknr
in brw_page(), replace create_page_buffers() call with create_empty_buffers()
and let create_page_buffers() go to hell. Oh, and let create_empty_buffers()
take device as an argument, not inode as it does now.

> @@ -1600,6 +1590,8 @@
[__block_write_full_page()]
>  
>  	bh = head;
>  	i = 0;
> +
> +	/* Stage 1: make sure we have all the buffers mapped! */
>  	do {
>  		/*
>  		 * If the buffer isn't up-to-date, we can't be sure

Ahem. Think what will happen if we are sitting over the hole in file,
map some buffers and fail in the middle of the fun. Notice that we
do not submit any IO requests in that case, i.e. we just had created
a random crap in file.

More recovery needed here. I would just break out of the mapping loop,
then proceed as above, but limited the activity to mapped blocks only.
And did if (err) ClearPageUptodate(page) in the end.

> @@ -1669,7 +1665,6 @@
[__block_prepare_write()]

Same problem with recovery - see the patch I've sent recently (handling
get_block() failures).

> @@ -1793,35 +1787,40 @@
[block_read_full_page()]
> -		init_buffer(bh, end_buffer_io_async, NULL);
	Fine
> -		atomic_inc(&bh->b_count);

	Why? It's cleaner the old way - why bother postponing that until we
lock the thing?

> +	/* Stage two: lock the buffers */
> +	for (i = 0; i < nr; i++) {
> +		struct buffer_head * bh = arr[i];
> +		lock_buffer(bh);
> +		bh->b_end_io = end_buffer_io_async;
> +		atomic_inc(&bh->b_count);

See above.

> @@ -2263,67 +2261,31 @@
>   */
>  int brw_page(int rw, struct page *page, kdev_t dev, int b[], int size)

>  	if (!page->buffers)
> +		create_page_buffers(rw, page, dev, b, size);

		create_empty_buffers(page, dev, size);

> +
> +	head = bh = page->buffers;
> +	if (!head)
>  		BUG();
>  
> -	head = page->buffers;
> -	bh = head;
> -	nr = 0;
> +	/* Stage 1: lock all the buffers */
>  	do {
> -		block = *(b++);
> +		lock_buffer(bh);
> +		bh->b_end_io = end_buffer_io_async;

		bh->b_blocknr = *(b++);
		set_bit(BH_Mapped, &bh->b_state);

	Modulo the comments above - fine with me. However, there is stuff in
drivers/md that I don't understand. Ingo, could you comment on the use of
->b_end_io there?

	Another bad thing is in mm/filemap.c::writeout_one_page() - it doesn't
even check for buffers being mapped, let alone attempt to map them.
	Fortunately, ext2 doesn't use it these days, but the rest of block
filesystems... <doubletake> WTF? fsync() merrily ignores mmap()'ed stuff?
I mean, we can mmap() an area, dirty the bloody thing, call fsync() and
get zero traffic. Hmm... OTOH, it might be correct behaviour...
	Anyway, fsync() on block filesystems other than ext2 needs fixing.
Badly. I'll port Stephen's patch to them.

	IMO patch is mostly safe (the worst part is error recovery on
block_write_full_page()), except the writeout_one_page() part and possibly
the RAID stuff.
	Linus, Ingo, Mikulas - comments?
							Cheers,
								Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7
  2000-12-09 10:40                 ` Alexander Viro
@ 2000-12-09 12:56                   ` Andries Brouwer
  2000-12-09 13:11                     ` Alexander Viro
  2000-12-09 17:28                   ` Linus Torvalds
  1 sibling, 1 reply; 21+ messages in thread
From: Andries Brouwer @ 2000-12-09 12:56 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Linus Torvalds, David Woodhouse, Ingo Molnar, Mikulas Patocka,
	Kernel Mailing List

On Sat, Dec 09, 2000 at 05:40:47AM -0500, Alexander Viro wrote:

> > @@ -1210,7 +1204,6 @@
> [breada()]
> 	Umm... why do we keep it, in the first place? AFAICS the only
> in-tree user is hpfs_map_sector() and it doesn't look like we really
> need it there. OTOH, trimming the buffer.c down is definitely nice.
> Mikulas?

Throw it out. The number of users has diminished over time.
Recently isofs stopped using breada.
The hpfs use was broken, I fixed it a bit some time ago, but
there is nothing against throwing it out altogether, I think.

Andries
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7
  2000-12-09 12:56                   ` Andries Brouwer
@ 2000-12-09 13:11                     ` Alexander Viro
  2000-12-09 21:25                       ` Mikulas Patocka
  0 siblings, 1 reply; 21+ messages in thread
From: Alexander Viro @ 2000-12-09 13:11 UTC (permalink / raw)
  To: Andries Brouwer
  Cc: Linus Torvalds, David Woodhouse, Ingo Molnar, Mikulas Patocka,
	Kernel Mailing List



On Sat, 9 Dec 2000, Andries Brouwer wrote:

> On Sat, Dec 09, 2000 at 05:40:47AM -0500, Alexander Viro wrote:
> 
> > > @@ -1210,7 +1204,6 @@
> > [breada()]
> > 	Umm... why do we keep it, in the first place? AFAICS the only
> > in-tree user is hpfs_map_sector() and it doesn't look like we really
> > need it there. OTOH, trimming the buffer.c down is definitely nice.
> > Mikulas?
> 
> Throw it out. The number of users has diminished over time.
> Recently isofs stopped using breada.
> The hpfs use was broken, I fixed it a bit some time ago, but
> there is nothing against throwing it out altogether, I think.

I've looked at the use of hpfs_map_sector() (and hpfs_map_4sectors() - sorry)
and it looks like we would be better off doing getblk() on affected sectors
and ll_rw_block() on the whole bunch - we end up calling breada() for
increasing block numbers with decreasing readahead window anyway.

So it probably should go - it gives no real win. Mikulas has the final
word here - he is the HPFS maintainer, so...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7
  2000-12-09  8:45               ` Linus Torvalds
  2000-12-09  8:56                 ` Linus Torvalds
  2000-12-09 10:40                 ` Alexander Viro
@ 2000-12-09 14:00                 ` David S. Miller
  2 siblings, 0 replies; 21+ messages in thread
From: David S. Miller @ 2000-12-09 14:00 UTC (permalink / raw)
  To: torvalds; +Cc: viro, dwmw2, linux-kernel

   Date: 	Sat, 9 Dec 2000 00:45:51 -0800 (PST)
   From: Linus Torvalds <torvalds@transmeta.com>

    out:
   -	if (nr) {
   -		ll_rw_block(WRITE, nr, arr);
   -	} else {
   -		UnlockPage(page);
   -	}
   +	UnlockPage(page);
	   ClearPageUptodate(page);
	   return err;
    }
   @@ -1669,7 +1665,6 @@

It would seem that you would want to unlock the page _after_ clearing
the uptodate bit to make sure people sleeping on the page do not see
it set by accident.

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7
  2000-12-08 19:13     ` Alexander Viro
  2000-12-08 19:39       ` Linus Torvalds
@ 2000-12-09 15:37       ` David Woodhouse
  1 sibling, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2000-12-09 15:37 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, linux-kernel

On Fri, 8 Dec 2000, Alexander Viro wrote:

> BTW, what do you think about the following:
> 	* add_to_page_cache() is not exported and never used. Kill?

I have my eye on that for execute-in-place of romfs from real ROM chips -
making up struct pages for parts of the ROM chips and inserting them into
the page cache. I'd rather you didn't remove it :)

-- 
dwmw2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7
  2000-12-09 10:40                 ` Alexander Viro
  2000-12-09 12:56                   ` Andries Brouwer
@ 2000-12-09 17:28                   ` Linus Torvalds
  2000-12-09 18:43                     ` Andi Kleen
  1 sibling, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2000-12-09 17:28 UTC (permalink / raw)
  To: Alexander Viro
  Cc: David Woodhouse, Ingo Molnar, Mikulas Patocka,
	Kernel Mailing List



On Sat, 9 Dec 2000, Alexander Viro wrote:
> 	Fine
> > -		atomic_inc(&bh->b_count);
> 
> 	Why? It's cleaner the old way - why bother postponing that until we
> lock the thing?

I had a rule: we always do the "lock_buffer()" and "b_count increment"
together with setting "b_end_io = end_buffer_io_async". Why? Because that
way we pair up the actions, and I could easily verify that every single
user of "end_buffer_io_async" will increment the count (that is
decremented in "end_buffer_io_async").

We never used to have any rules in this area, and it was sometimes hard to
match up the actions with each other.

> >  int brw_page(int rw, struct page *page, kdev_t dev, int b[], int size)
> 
> >  	if (!page->buffers)
> > +		create_page_buffers(rw, page, dev, b, size);
> 
> 		create_empty_buffers(page, dev, size);

Agreed.

> 	Modulo the comments above - fine with me. However, there is stuff in
> drivers/md that I don't understand. Ingo, could you comment on the use of
> ->b_end_io there?

I already sent him mail about it for the same reason. 

> 	Another bad thing is in mm/filemap.c::writeout_one_page() - it doesn't
> even check for buffers being mapped, let alone attempt to map them.
> 	Fortunately, ext2 doesn't use it these days, but the rest of block
> filesystems... <doubletake> WTF? fsync() merrily ignores mmap()'ed stuff?

fsync() has _always_ ignored mmap'ed stuff. 

If you want to get your mappings synchronized, you absolutely positively
have to call "msync()".

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7
  2000-12-09 17:28                   ` Linus Torvalds
@ 2000-12-09 18:43                     ` Andi Kleen
  0 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2000-12-09 18:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:


> > 	Modulo the comments above - fine with me. However, there is stuff in
> > drivers/md that I don't understand. Ingo, could you comment on the use of
> > ->b_end_io there?
> 
> I already sent him mail about it for the same reason. 

How about sending mail to all the journaling FS groups too? -- the change is surely
to break all the JFS which use usually b_end_io to submit the data after the journal
has been flushed.


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7
  2000-12-09 13:11                     ` Alexander Viro
@ 2000-12-09 21:25                       ` Mikulas Patocka
  2000-12-10  1:11                         ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Mikulas Patocka @ 2000-12-09 21:25 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Andries Brouwer, Linus Torvalds, David Woodhouse, Ingo Molnar,
	Kernel Mailing List

On Sat, 9 Dec 2000, Alexander Viro wrote:

> On Sat, 9 Dec 2000, Andries Brouwer wrote:
> 
> > On Sat, Dec 09, 2000 at 05:40:47AM -0500, Alexander Viro wrote:
> > 
> > > > @@ -1210,7 +1204,6 @@
> > > [breada()]
> > > 	Umm... why do we keep it, in the first place? AFAICS the only
> > > in-tree user is hpfs_map_sector() and it doesn't look like we really
> > > need it there. OTOH, trimming the buffer.c down is definitely nice.
> > > Mikulas?
> > 
> > Throw it out. The number of users has diminished over time.
> > Recently isofs stopped using breada.
> > The hpfs use was broken, I fixed it a bit some time ago, but
> > there is nothing against throwing it out altogether, I think.
> 
> I've looked at the use of hpfs_map_sector() (and hpfs_map_4sectors() - sorry)
> and it looks like we would be better off doing getblk() on affected sectors
> and ll_rw_block() on the whole bunch - we end up calling breada() for
> increasing block numbers with decreasing readahead window anyway.
> 
> So it probably should go - it gives no real win. Mikulas has the final
> word here - he is the HPFS maintainer, so...

I did a test. I disabled readahead except for reading all 4 buffers in
map_4sectors.

I observed 14% slowdown on walking directories with find and 4% slowdown
on grepping one my working directory (10M, 281 files).

If you can't make it otherwise you can rip readahead out. If there is a
possibility to keep it, it would be better.

Mikulas


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7
  2000-12-09 21:25                       ` Mikulas Patocka
@ 2000-12-10  1:11                         ` Linus Torvalds
  0 siblings, 0 replies; 21+ messages in thread
From: Linus Torvalds @ 2000-12-10  1:11 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alexander Viro, Andries Brouwer, David Woodhouse, Ingo Molnar,
	Kernel Mailing List



On Sat, 9 Dec 2000, Mikulas Patocka wrote:
> 
> I did a test. I disabled readahead except for reading all 4 buffers in
> map_4sectors.
> 
> I observed 14% slowdown on walking directories with find and 4% slowdown
> on grepping one my working directory (10M, 281 files).
> 
> If you can't make it otherwise you can rip readahead out. If there is a
> possibility to keep it, it would be better.

The absolutely best thing would be to keep the directories in the page
cache. At which point the whole issue becomes pretty moot and we could use
the page-cache readahead code. Which gets the end right, and can handle
stuff that isn't physically contiguous on disk.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2000-12-10  1:43 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <90rjbg$8eso1$1@fido.engr.sgi.com>
2000-12-08 22:22 ` [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7 Rajagopal Ananthanarayanan
2000-12-08  9:58 [found?] " Alexander Viro
2000-12-08 18:11 ` [PATCH] " Alexander Viro
2000-12-08 18:48   ` Linus Torvalds
2000-12-08 19:13     ` Alexander Viro
2000-12-08 19:39       ` Linus Torvalds
2000-12-08 19:50         ` Daniel Phillips
2000-12-08 21:17           ` Linus Torvalds
2000-12-08 22:30         ` Alexander Viro
2000-12-08 22:42           ` Linus Torvalds
2000-12-09  4:59             ` Alexander Viro
2000-12-09  8:45               ` Linus Torvalds
2000-12-09  8:56                 ` Linus Torvalds
2000-12-09 10:40                 ` Alexander Viro
2000-12-09 12:56                   ` Andries Brouwer
2000-12-09 13:11                     ` Alexander Viro
2000-12-09 21:25                       ` Mikulas Patocka
2000-12-10  1:11                         ` Linus Torvalds
2000-12-09 17:28                   ` Linus Torvalds
2000-12-09 18:43                     ` Andi Kleen
2000-12-09 14:00                 ` David S. Miller
2000-12-09 15:37       ` David Woodhouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox