Re: fsync fixes for 2.4

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: fsync fixes for 2.4
@ 2002-07-12 21:52 Griffiths, Richard A
  2002-07-12 22:21 ` Andrew Morton
  2002-07-15 10:07 ` Andrea Arcangeli
  0 siblings, 2 replies; 14+ messages in thread
From: Griffiths, Richard A @ 2002-07-12 21:52 UTC (permalink / raw)
  To: 'Andrea Arcangeli'
  Cc: 'Andrew Morton', 'Marcelo Tosatti',
	'linux-kernel@vger.kernel.org',
	'Carter K. George', 'Don Norton',
	'James S. Tybur', Gross, Mark

Mark is off climbing Mt. Hood, so he asked me to post the data on the fsync
patch.
It appears from these results that there is no appreciable improvement using
the
fsync patch - there is a slight loss of top end on 4 adapters 1 drive. 
 It's worthnoting that these numbers  are a definite improvement over
generic 2.4.18.
 Something has been done right in 2.4.19rc1.  I included an excerpt of the
data that Mark
 posted for 2.4.18 at the end of this email.

As to scaling - well it scales in reverse.  Two adapters from one to three
drives 
is scaling in the right direction. From then on it's in decline.
Here's the data.
The system is a dual PCI/66 box with 4 good SCSI cards with up to 6
15KRPM ST318452LC drives per card.  (dual 1.2Ghz Pentium 3 with 2 Gig RAM
running kernels high mem support == OFF ) 
 
http://www.intel.com/design/servers/scb2/index.htm?iid=ipp_browse+motherbd_s
cb2&

http://www.adaptec.com/worldwide/product/proddetail.html?sess=no&prodkey=ASC
-39160&cat=Products

The benchmark we are using is bonnie++ version 1.02
http://www.coker.com.au/bonnie++

Running on 2.4.19rc1 base 8KB writes to a 1GB file  on an 
ext2 filesystem:

 2 adapters (on separate PCI buses)
 Drives per card  Total system throughput EXT2        
           1                    105983 KB/sec                 
           2                    179214 KB/sec                  
           3                    180237 KB/sec
           4                    178795 KB/sec
           5                    175484 KB/sec
           6                    172903 KB/sec                 
 
 4 adapters
 Drives per card  Total system throughput EXT2          
           1                     184150 KB/sec                  
           2                     165774 KB/sec                   
           3                     160775 KB/sec
           4                     158326 KB/sec
           5                     157291 KB/sec
           6                     155901 KB/sec

===	===	===	===	===	===

Running on 2.4.19rc1 with the fsync patch:
 2 adapters (on separate PCI buses)
 Drives per card  Total system throughput EXT2        
           1                    107940 KB/sec                 
           2                    176749 KB/sec                  
           3                    181073 KB/sec
           4                    177064 KB/sec
           5                    175080 KB/sec
           6                    173583 KB/sec                 
 
 4 adapters
 Drives per card  Total system throughput EXT2          
           1                     176876 KB/sec                  
           2                     164800 KB/sec                   
           3                     161371 KB/sec
           4                     158792 KB/sec
           5                     156509 KB/sec
           6                     155913 KB/sec    
----------------------------------------------------------------------------
----------------------------------------
Mark's original data on 2.4.18:

> Running on 2.4.18 base, 8KB writes to 300MB files I get the following data
> when run on the ext 2 file system.
> 
> 2 adapters (on separate PCI buses)
> Drives per card  Total system throughput EXT2          EXT3
>           1                      73357 KB/sec                   92095
KB/sec
>           2                    115953 KB/sec                   110956
KB/sec
>           3                    132176 KB/sec
>           4                    139578 KB/sec
>           5                    139085 KB/sec
>           6                    140033 KB/sec                   106883
KB/sec
> 
> 4 adapters
> Drives per card  Total system throughput EXT2          EXT3
>           1                     125282 KB/sec                   121125
KB/sec
>           2                     146632 KB/sec                   117575
KB/sec
>           3                     146622 KB/sec
>           4                     142472 KB/sec
>           5                     142560 KB/sec
>           6                     138835 KB/sec                   116570
KB/sec


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: fsync fixes for 2.4
  2002-07-12 21:52 fsync fixes for 2.4 Griffiths, Richard A
@ 2002-07-12 22:21 ` Andrew Morton
  2002-07-15 10:07 ` Andrea Arcangeli
  1 sibling, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2002-07-12 22:21 UTC (permalink / raw)
  To: Griffiths, Richard A
  Cc: 'Andrea Arcangeli', 'Marcelo Tosatti',
	'linux-kernel@vger.kernel.org',
	'Carter K. George', 'Don Norton',
	'James S. Tybur', Gross, Mark

"Griffiths, Richard A" wrote:
> 
> ...
> Running on 2.4.19rc1 base 8KB writes to a 1GB file  on an
> ext2 filesystem:
> 
>  2 adapters (on separate PCI buses)
>  Drives per card  Total system throughput EXT2
>            1                    105983 KB/sec
>            2                    179214 KB/sec
>            3                    180237 KB/sec
>            4                    178795 KB/sec
>            5                    175484 KB/sec
>            6                    172903 KB/sec
> 
>  4 adapters
>  Drives per card  Total system throughput EXT2
>            1                     184150 KB/sec
>            2                     165774 KB/sec
>            3                     160775 KB/sec
>            4                     158326 KB/sec
>            5                     157291 KB/sec
>            6                     155901 KB/sec

Well I know what the problem is, but I don't know how to fix
it in 2.4.

With 4 adapters and six disks on each, you have 27 processes
which are responsible for submitting IO: the 24 bonnies,
kswapd, bdflush and kupdate.

If one or two of the request queues gets filled up, *all* these
threads hit those queues and go to sleep.  Nobody is submitting
IO for the other queues and they fall idle.

In 2.4, we have a global LRU of dirty buffers and everyone walks
that list in old->new order.  It has a jumble of buffers which
are dirty against all the queues so inevitably, as soon as one
queue fills up it blocks everyone.

A naive fix would be to get callers of balance_dirty() to skip
over buffers in that queue which do not belong to their blockdev.
But the CPU cost of that search would be astronomical.

A more intrusive fix would be to make callers of balance_dirty()
walk the superblock->inode->i_dirty[_data]_buffers list instead
of the buffer LRU.  That's a sort-of-2.5 approach.

But even that wouldn't help, because then you hit the second
problem: your 24 bonnie threads hit the same queue congestion
in the page reclaim code when they encounter dirty pages on
the page LRU.

I have a fix for the first problem in 2.5.  And the second problem
(the page reclaim code) I have sorta-bandaided.

So.  Hard.

I haven't tested this yet, but you may get some benefit from
this patch:

--- linux-2.4.19-rc1/drivers/block/ll_rw_blk.c	Thu Jul  4 02:01:16 2002
+++ linux-akpm/drivers/block/ll_rw_blk.c	Fri Jul 12 15:28:42 2002
@@ -359,6 +359,7 @@ int blk_grow_request_list(request_queue_
 	q->batch_requests = q->nr_requests / 4;
 	if (q->batch_requests > 32)
 		q->batch_requests = 32;
+	q->batch_requests = 1;
 	spin_unlock_irqrestore(&io_request_lock, flags);
 	return q->nr_requests;
 }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: fsync fixes for 2.4
  2002-07-12 21:52 fsync fixes for 2.4 Griffiths, Richard A
  2002-07-12 22:21 ` Andrew Morton
@ 2002-07-15 10:07 ` Andrea Arcangeli
  2002-07-15 18:36   ` Andrew Morton
  2002-07-17 14:44   ` mgross
  1 sibling, 2 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2002-07-15 10:07 UTC (permalink / raw)
  To: Griffiths, Richard A
  Cc: 'Andrew Morton', 'Marcelo Tosatti',
	'linux-kernel@vger.kernel.org',
	'Carter K. George', 'Don Norton',
	'James S. Tybur', Gross, Mark

On Fri, Jul 12, 2002 at 02:52:11PM -0700, Griffiths, Richard A wrote:
> Mark is off climbing Mt. Hood, so he asked me to post the data on the fsync
> patch.
> It appears from these results that there is no appreciable improvement using
> the
> fsync patch - there is a slight loss of top end on 4 adapters 1 drive. 

that's very much expected, as said with my new design by adding an
additional pass (third pass), I could remove the slight loss that I
expected from the simple patch that puts wait_on_buffer right in the
first pass.

I mentioned this in my first email of the thread, so it looks all right.
For a rc2 the slight loss sounds like the simplest approch.

If you care about it, with my new fsync accounting design we can fix it,
just let me know if you're interested about it. Personally I'm pretty
much fine with it this way too, as said in the first email if we block
it's likely bdflush is pumping the queue for us. the slowdown is most
probably due too early unplug of the queue generated by the blocking
points.

as for the scaling with async flushes to multiple devices, 2.4 has a
single flushing thread, 2.5 as Andrew said (partly) fixes this as he
explained me at OLS, with multiple pdflush. The only issue I seen in his
design is that he works based on superblocks, so if a filesystem is on
top of a lvm backed by a dozen of different harddisks, only one pdflush
will pump on those dozen physical request queues, because the first
pdflush entering the superblock will forbid other pdflush to work on the
same superblock too. So the first physical queue that is full, will
forbid pdflush to push more dirty pages to the other possibly empty
physical queues.

without lvm or raid that doesn't matter thoguh, nor it matters with
hardware raid.

Andrea

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: fsync fixes for 2.4
  2002-07-15 10:07 ` Andrea Arcangeli
@ 2002-07-15 18:36   ` Andrew Morton
  2002-07-17 14:44   ` mgross
  1 sibling, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2002-07-15 18:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Griffiths, Richard A, 'Marcelo Tosatti',
	'linux-kernel@vger.kernel.org',
	'Carter K. George', 'Don Norton',
	'James S. Tybur', Gross, Mark

Andrea Arcangeli wrote:
> 
> ...
> as for the scaling with async flushes to multiple devices, 2.4 has a
> single flushing thread, 2.5 as Andrew said (partly) fixes this as he
> explained me at OLS, with multiple pdflush. The only issue I seen in his
> design is that he works based on superblocks, so if a filesystem is on
> top of a lvm backed by a dozen of different harddisks, only one pdflush
> will pump on those dozen physical request queues, because the first
> pdflush entering the superblock will forbid other pdflush to work on the
> same superblock too. So the first physical queue that is full, will
> forbid pdflush to push more dirty pages to the other possibly empty
> physical queues.

Well.  There's no way in which we can get effective writeback against
200 spindles by relying on pdflush, so that daemon is mainly there
to permit background writeback under light-to-moderate loads.

Once things get heavy, the only sane approach is to use the actual
caller of write(2) as the resource for performing the writeback.
As we're currently doing, in balance_dirty[_pages]().  But the
problem there is that in both 2.4 and 2.5, a caller to that function
can easily get stuck on the wrong queue, and bandwidth really suffers.

I've been working on changing 2.5 so that the write(2) caller no
longer performs a general "writeback of everything" - that caller
instead performs writeback specifically against the queue which
he just dirtied.  Do this by using the address_space->backing_dev_info
as a key during a search across the superblocks and blockdev inodes.
That works quite well.

But there's still a problem where pdflush goes to writeback a queue
and fills it up, so the userspace program ends up blocking (due to
pdflush's activity) when it really should not.  Still undecided about
what to do about that.

And yes, point taken on the LVM thing.  If the chunk size is reasonably
small (a few megabytes) then we should normally get decent concurrency,
but there will be corner-cases.

-

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: fsync fixes for 2.4
  2002-07-15 10:07 ` Andrea Arcangeli
  2002-07-15 18:36   ` Andrew Morton
@ 2002-07-17 14:44   ` mgross
  2002-07-17 20:05     ` Andrea Arcangeli
  1 sibling, 1 reply; 14+ messages in thread
From: mgross @ 2002-07-17 14:44 UTC (permalink / raw)
  To: Andrea Arcangeli, Griffiths, Richard A
  Cc: 'Andrew Morton', 'Marcelo Tosatti',
	'linux-kernel@vger.kernel.org',
	'Carter K. George', 'Don Norton',
	'James S. Tybur', Gross, Mark

On Monday 15 July 2002 06:07 am, Andrea Arcangeli wrote:
> On Fri, Jul 12, 2002 at 02:52:11PM -0700, Griffiths, Richard A wrote:
> > Mark is off climbing Mt. Hood, so he asked me to post the data on the
> > fsync patch.

I was excited to report the significant improvement of 2.4.19rc1+fsync fix 
over 2.4.18 and didn't realize that the improvement was not due to the fsync 
patch.   I'm so glad Richard did a careful check, I was on my way out the 
door for my vacation :)

I would like to know what's so good about 2.4.19rc1 that gives our block I/O 
benchmark that significant improvement over 2.4.18.

> > It appears from these results that there is no appreciable improvement
> > using the
> > fsync patch - there is a slight loss of top end on 4 adapters 1 drive.
>
> that's very much expected, as said with my new design by adding an
> additional pass (third pass), I could remove the slight loss that I
> expected from the simple patch that puts wait_on_buffer right in the
> first pass.
>
> I mentioned this in my first email of the thread, so it looks all right.
> For a rc2 the slight loss sounds like the simplest approch.
>
> If you care about it, with my new fsync accounting design we can fix it,
> just let me know if you're interested about it. Personally I'm pretty
> much fine with it this way too, as said in the first email if we block
> it's likely bdflush is pumping the queue for us. the slowdown is most
> probably due too early unplug of the queue generated by the blocking
> points.

I don't care about the very slight (and possibly in the noise floor of our 
test) reduction in throughput due to the fsync fix.  I think your's and 
Andrews'  assertion of the bdflush / dirty page handling getting stopped up 
is likely the problem preventing scaling to my personal goal of 250 to 
300MB/sec on our setup.

Thanks,

Mark Gross
PS I had a very nice time on mount hood.  I didn't make it to the top this 
time too much snow had melted off the top of the thing to have a safe attempt 
at the summit.  It was a guided (http://www.timberlinemtguides.com) 3 day 
climb. 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: fsync fixes for 2.4
  2002-07-17 14:44   ` mgross
@ 2002-07-17 20:05     ` Andrea Arcangeli
  0 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2002-07-17 20:05 UTC (permalink / raw)
  To: mgross
  Cc: Griffiths, Richard A, 'Andrew Morton',
	'Marcelo Tosatti', 'linux-kernel@vger.kernel.org',
	'Carter K. George', 'Don Norton',
	'James S. Tybur', Gross, Mark

On Wed, Jul 17, 2002 at 10:44:18AM -0400, mgross wrote:
> On Monday 15 July 2002 06:07 am, Andrea Arcangeli wrote:
> > On Fri, Jul 12, 2002 at 02:52:11PM -0700, Griffiths, Richard A wrote:
> > > Mark is off climbing Mt. Hood, so he asked me to post the data on the
> > > fsync patch.
> 
> I was excited to report the significant improvement of 2.4.19rc1+fsync fix 
> over 2.4.18 and didn't realize that the improvement was not due to the fsync 
> patch.   I'm so glad Richard did a careful check, I was on my way out the 
> door for my vacation :)
> 
> I would like to know what's so good about 2.4.19rc1 that gives our block I/O 
> benchmark that significant improvement over 2.4.18.

that should be the effect of the first part of my vm updates that gone into 2.4.19pre.

> > > It appears from these results that there is no appreciable improvement
> > > using the
> > > fsync patch - there is a slight loss of top end on 4 adapters 1 drive.
> >
> > that's very much expected, as said with my new design by adding an
> > additional pass (third pass), I could remove the slight loss that I
> > expected from the simple patch that puts wait_on_buffer right in the
> > first pass.
> >
> > I mentioned this in my first email of the thread, so it looks all right.
> > For a rc2 the slight loss sounds like the simplest approch.
> >
> > If you care about it, with my new fsync accounting design we can fix it,
> > just let me know if you're interested about it. Personally I'm pretty
> > much fine with it this way too, as said in the first email if we block
> > it's likely bdflush is pumping the queue for us. the slowdown is most
> > probably due too early unplug of the queue generated by the blocking
> > points.
> 
> I don't care about the very slight (and possibly in the noise floor of our 
> test) reduction in throughput due to the fsync fix.  I think your's and 
> Andrews'  assertion of the bdflush / dirty page handling getting stopped up 
> is likely the problem preventing scaling to my personal goal of 250 to 
> 300MB/sec on our setup.

yep, should be. Actually running multiple fsyncs from multiple tasks
will kind of workaround the single threaded async flushing of 2.4.

> 
> Thanks,
> 
> Mark Gross
> PS I had a very nice time on mount hood.  I didn't make it to the top this 
> time too much snow had melted off the top of the thing to have a safe attempt 
> at the summit.  It was a guided (http://www.timberlinemtguides.com) 3 day 
> climb. 

:)

Andrea

^ permalink raw reply	[flat|nested] 14+ messages in thread

* fsync fixes for 2.4
@ 2002-07-10 20:20 Andrea Arcangeli
  2002-07-11 20:21 ` Marcelo Tosatti
  2002-07-11 21:57 ` J.A. Magallon
  0 siblings, 2 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2002-07-10 20:20 UTC (permalink / raw)
  To: Andrew Morton, Marcelo Tosatti
  Cc: linux-kernel, Carter K. George, Don Norton, James S. Tybur

At polyserve they found a severe problem with fsync in 2.4.

In short the write_buffer (ll_rw_block of mainline) is a noop if old I/O
is in flight. You know the buffer can be made dirty while I/O is in
flight, and in such case fsync would return without flushing the dirty
buffers at all. Their proposed fix is strightforward, just a
wait_on_buffer before the ll_rw_block will guarantee somebody marked the
bh locked _after_ we wrote to it. This may actually decrease performance
because we would wait in the middle of the flushes, but if the dirty
inodes are just getting written under us, probably bdflush/kupdate are
just keeping the I/O pipeline filled, and the wait_on_buffer blockage
shouldn't be a common case (don't see any difference in some basic
test, bandwitdth is fully used).  If it sorts out to be a problem with
my new infrastructure (see below) it should be easy to do the
ll_rw_block during the second pass.

The other part of the patch was related to the fact the whole fsync has
to be running inside the i_sem, so there cannot be writes, truncates or
other parallel fsyncs, but there can be writepages in background that
will mark the buffers dirty for example during get_block(create). So it
is also possible the buffer is marked dirty while it's in the private
tmp "fsync" list. The original patch according to the commentary was
trying to detect a dirty buffer during the second pass and to put it
back into the inode dirty list if it was dirty (using refile_buffer btw,
that won't do that and it was a noop in that sense), but that's not
needed because the insert_inode_queue functions are always reinserting
the bh into the inode, so as soon as the buffer will be marked dirty it
will go automatically back into the main inode dirty list. And as said
refile_buffer could do nothing like that anyways, so I just rejected
that part.

To make an example what will happen is that we flush the bh, so it's
only locked and we put it into the private local list. Then before we
wait on it in the second pass on the private local list, it is marked
dirty and it goes back into the inode list under us. but that's not a
problem because it's locked and we'll wait for it later in the expensive
third and last osync pass (this is also mentioned in the last commentary
for fsync_buffers_list). I guess either the osync third-pass or the
uncoditional rolling of the bh every time it's marked dirty are been
overlooked.

However I really didn't liked the unconditional rolling of the bh into
the inode list every time it is marked dirty, nor I liked the osync pass
that can generate indefinite wait of fsync, so I changed some bit in
that area also considering the bh->b_inode cannot change until somebody
runs remove_inode_queue(bh).

Now it should be a bit more efficient during writebacks (no cacheline
trahsing every time we overwrite a data or metadata bh just dirty in
cache) and we'll skip the osync pass. osync_buffers_list stays there in
case anybody will want to use it (even if it's not exported at this
time).  If "somebody" just submits the bh itself, so it only cares to
wait for locked buffers, and never to flush dirty buffers, he can use
the osync_buffers_list. See the inode.c comments in generic_osync_inode
to know why nobody is using it now (in short nobody is submitting the bh
directly, they just mark the bh dirty and then we write them during the
generic_osync_inode that calls ours fsync_buffers_list, that now is
smart enough not to need the expensive osync).

last but not the least while overviewing this code I also noticed fsync
writes the buffers backwards which is not the best for most harddisks.

All of this is fixed by this patch against 2.4.19rc1. 

Comments?

diff -urNp 2.4.19rc1/fs/buffer.c z/fs/buffer.c
--- 2.4.19rc1/fs/buffer.c	Fri Jul  5 12:20:47 2002
+++ z/fs/buffer.c	Wed Jul 10 20:52:41 2002
@@ -587,20 +587,20 @@ struct buffer_head * get_hash_table(kdev
 void buffer_insert_inode_queue(struct buffer_head *bh, struct inode *inode)
 {
 	spin_lock(&lru_list_lock);
-	if (bh->b_inode)
-		list_del(&bh->b_inode_buffers);
-	bh->b_inode = inode;
-	list_add(&bh->b_inode_buffers, &inode->i_dirty_buffers);
+	if (!bh->b_inode) {
+		bh->b_inode = inode;
+		list_add(&bh->b_inode_buffers, &inode->i_dirty_buffers);
+	}
 	spin_unlock(&lru_list_lock);
 }

 void buffer_insert_inode_data_queue(struct buffer_head *bh, struct inode *inode)
 {
 	spin_lock(&lru_list_lock);
-	if (bh->b_inode)
-		list_del(&bh->b_inode_buffers);
-	bh->b_inode = inode;
-	list_add(&bh->b_inode_buffers, &inode->i_dirty_data_buffers);
+	if (!bh->b_inode) {
+		bh->b_inode = inode;
+		list_add(&bh->b_inode_buffers, &inode->i_dirty_data_buffers);
+	}
 	spin_unlock(&lru_list_lock);
 }

@@ -819,37 +819,40 @@ inline void set_buffer_async_io(struct b
  * forever if somebody is actively writing to the file.
  *
  * Do this in two main stages: first we copy dirty buffers to a
- * temporary inode list, queueing the writes as we go.  Then we clean
+ * temporary list, queueing the writes as we go.  Then we clean
  * up, waiting for those writes to complete.
  * 
  * During this second stage, any subsequent updates to the file may end
- * up refiling the buffer on the original inode's dirty list again, so
- * there is a chance we will end up with a buffer queued for write but
- * not yet completed on that list.  So, as a final cleanup we go through
- * the osync code to catch these locked, dirty buffers without requeuing
- * any newly dirty buffers for write.
+ * up marking some of our private bh dirty, so we must refile them
+ * into the original inode's dirty list again during the second stage.
  */
 int fsync_buffers_list(struct list_head *list)
 {
 	struct buffer_head *bh;
-	struct inode tmp;
-	int err = 0, err2;
-	
-	INIT_LIST_HEAD(&tmp.i_dirty_buffers);
+	int err = 0;

+	LIST_HEAD(tmp);
 	spin_lock(&lru_list_lock);

 	while (!list_empty(list)) {
-		bh = BH_ENTRY(list->next);
+		bh = BH_ENTRY(list->prev);
 		list_del(&bh->b_inode_buffers);
 		if (!buffer_dirty(bh) && !buffer_locked(bh))
 			bh->b_inode = NULL;
 		else {
-			bh->b_inode = &tmp;
-			list_add(&bh->b_inode_buffers, &tmp.i_dirty_buffers);
+			list_add(&bh->b_inode_buffers, &tmp);
 			if (buffer_dirty(bh)) {
 				get_bh(bh);
 				spin_unlock(&lru_list_lock);
+				/*
+				 * Wait I/O completion before submitting
+				 * the buffer, to be sure the write will
+				 * be effective on the latest data in
+				 * the buffer. (otherwise - if there's old
+				 * I/O in flight - write_buffer would become
+				 * a noop)
+				 */
+				wait_on_buffer(bh);
 				ll_rw_block(WRITE, 1, &bh);
 				brelse(bh);
 				spin_lock(&lru_list_lock);
@@ -857,9 +860,20 @@ int fsync_buffers_list(struct list_head 
 		}
 	}

-	while (!list_empty(&tmp.i_dirty_buffers)) {
-		bh = BH_ENTRY(tmp.i_dirty_buffers.prev);
-		remove_inode_queue(bh);
+	while (!list_empty(&tmp)) {
+		bh = BH_ENTRY(tmp.prev);
+		list_del(&bh->b_inode_buffers);
+		/*
+		 * If the buffer is been made dirty again
+		 * during the fsync (for example from a ->writepage
+		 * that doesn't take the i_sem), just make sure not
+		 * to lose track of it, put it back the buffer into
+		 * its inode queue.
+		 */
+		if (!buffer_dirty(bh))
+			bh->b_inode = NULL;
+		else
+			list_add(&bh->b_inode_buffers, &bh->b_inode->i_dirty_buffers);
 		get_bh(bh);
 		spin_unlock(&lru_list_lock);
 		wait_on_buffer(bh);
@@ -870,12 +884,8 @@ int fsync_buffers_list(struct list_head 
 	}

 	spin_unlock(&lru_list_lock);
-	err2 = osync_buffers_list(list);

-	if (err)
-		return err;
-	else
-		return err2;
+	return err;
 }

 /*
@@ -887,6 +897,10 @@ int fsync_buffers_list(struct list_head 
  * you dirty the buffers, and then use osync_buffers_list to wait for
  * completion.  Any other dirty buffers which are not yet queued for
  * write will not be flushed to disk by the osync.
+ *
+ * Nobody uses this functionality right now because everybody marks the bh
+ * dirty and then use fsync_buffers_list() to first flush them and then
+ * wait completion on them. (see inode.c generic_osync_inode for more details)
  */
 static int osync_buffers_list(struct list_head *list)
 {

Andrea

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: fsync fixes for 2.4
  2002-07-10 20:20 Andrea Arcangeli
@ 2002-07-11 20:21 ` Marcelo Tosatti
  2002-07-11 22:57   ` Andrea Arcangeli
  2002-07-11 21:57 ` J.A. Magallon
  1 sibling, 1 reply; 14+ messages in thread
From: Marcelo Tosatti @ 2002-07-11 20:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-kernel, Carter K. George, Don Norton,
	James S. Tybur



On Wed, 10 Jul 2002, Andrea Arcangeli wrote:

> At polyserve they found a severe problem with fsync in 2.4.
>
> In short the write_buffer (ll_rw_block of mainline) is a noop if old I/O
> is in flight. You know the buffer can be made dirty while I/O is in
> flight, and in such case fsync would return without flushing the dirty
> buffers at all. Their proposed fix is strightforward, just a
> wait_on_buffer before the ll_rw_block will guarantee somebody marked the
> bh locked _after_ we wrote to it.

>From what I can see the problem goes like:


thread1				thread2
				writepage(page) (marks the buffers clean, page is
				locked for IO)

mark_buffer_dirty()

fsync()

fsync_buffers_list() finds
the dirtied buffer, but since
its locked ll_rw_block() returns
without queueing the data.

fsync_buffers_list() waits on the writepage()'s
write to return but not on latest data write.


Is that what you mean or I'm misunderstanding something?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: fsync fixes for 2.4
  2002-07-11 20:21 ` Marcelo Tosatti
@ 2002-07-11 22:57   ` Andrea Arcangeli
  2002-07-12  0:51     ` Marcelo Tosatti
  0 siblings, 1 reply; 14+ messages in thread
From: Andrea Arcangeli @ 2002-07-11 22:57 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andrew Morton, linux-kernel, Carter K. George, Don Norton,
	James S. Tybur

On Thu, Jul 11, 2002 at 05:21:24PM -0300, Marcelo Tosatti wrote:
> 
> 
> On Wed, 10 Jul 2002, Andrea Arcangeli wrote:
> 
> > At polyserve they found a severe problem with fsync in 2.4.
> >
> > In short the write_buffer (ll_rw_block of mainline) is a noop if old I/O
> > is in flight. You know the buffer can be made dirty while I/O is in
> > flight, and in such case fsync would return without flushing the dirty
> > buffers at all. Their proposed fix is strightforward, just a
> > wait_on_buffer before the ll_rw_block will guarantee somebody marked the
> > bh locked _after_ we wrote to it.
> 
> >From what I can see the problem goes like:
> 
> 
> thread1				thread2
> 				writepage(page) (marks the buffers clean, page is
> 				locked for IO)
> 
> mark_buffer_dirty()
> 
> fsync()
> 
> fsync_buffers_list() finds
> the dirtied buffer, but since
> its locked ll_rw_block() returns
> without queueing the data.
> 
> fsync_buffers_list() waits on the writepage()'s
> write to return but not on latest data write.
> 
> 
> Is that what you mean or I'm misunderstanding something?

yes, that's it.

Andrea

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: fsync fixes for 2.4
  2002-07-11 22:57   ` Andrea Arcangeli
@ 2002-07-12  0:51     ` Marcelo Tosatti
  2002-07-12  1:52       ` Andrea Arcangeli
  0 siblings, 1 reply; 14+ messages in thread
From: Marcelo Tosatti @ 2002-07-12  0:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-kernel, Carter K. George, Don Norton,
	James S. Tybur



On Fri, 12 Jul 2002, Andrea Arcangeli wrote:

> On Thu, Jul 11, 2002 at 05:21:24PM -0300, Marcelo Tosatti wrote:
> >
> >
> > On Wed, 10 Jul 2002, Andrea Arcangeli wrote:
> >
> > > At polyserve they found a severe problem with fsync in 2.4.
> > >
> > > In short the write_buffer (ll_rw_block of mainline) is a noop if old I/O
> > > is in flight. You know the buffer can be made dirty while I/O is in
> > > flight, and in such case fsync would return without flushing the dirty
> > > buffers at all. Their proposed fix is strightforward, just a
> > > wait_on_buffer before the ll_rw_block will guarantee somebody marked the
> > > bh locked _after_ we wrote to it.
> >
> > >From what I can see the problem goes like:
> >
> >
> > thread1				thread2
> > 				writepage(page) (marks the buffers clean, page is
> > 				locked for IO)
> >
> > mark_buffer_dirty()
> >
> > fsync()
> >
> > fsync_buffers_list() finds
> > the dirtied buffer, but since
> > its locked ll_rw_block() returns
> > without queueing the data.
> >
> > fsync_buffers_list() waits on the writepage()'s
> > write to return but not on latest data write.
> >
> >
> > Is that what you mean or I'm misunderstanding something?
>
> yes, that's it.

So I'm just going to add wait_on_page() on fsync_buffers_list() before the
ll_rw_block().

Nothing else, since all of the other stuff on your patch seems to be
improvements rather than bug fixes. ACK?




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: fsync fixes for 2.4
  2002-07-12  0:51     ` Marcelo Tosatti
@ 2002-07-12  1:52       ` Andrea Arcangeli
  2002-07-12  2:59         ` Marcelo Tosatti
  0 siblings, 1 reply; 14+ messages in thread
From: Andrea Arcangeli @ 2002-07-12  1:52 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andrew Morton, linux-kernel, Carter K. George, Don Norton,
	James S. Tybur

On Thu, Jul 11, 2002 at 09:51:29PM -0300, Marcelo Tosatti wrote:
> 
> 
> On Fri, 12 Jul 2002, Andrea Arcangeli wrote:
> 
> > On Thu, Jul 11, 2002 at 05:21:24PM -0300, Marcelo Tosatti wrote:
> > >
> > >
> > > On Wed, 10 Jul 2002, Andrea Arcangeli wrote:
> > >
> > > > At polyserve they found a severe problem with fsync in 2.4.
> > > >
> > > > In short the write_buffer (ll_rw_block of mainline) is a noop if old I/O
> > > > is in flight. You know the buffer can be made dirty while I/O is in
> > > > flight, and in such case fsync would return without flushing the dirty
> > > > buffers at all. Their proposed fix is strightforward, just a
> > > > wait_on_buffer before the ll_rw_block will guarantee somebody marked the
> > > > bh locked _after_ we wrote to it.
> > >
> > > >From what I can see the problem goes like:
> > >
> > >
> > > thread1				thread2
> > > 				writepage(page) (marks the buffers clean, page is
> > > 				locked for IO)
> > >
> > > mark_buffer_dirty()
> > >
> > > fsync()
> > >
> > > fsync_buffers_list() finds
> > > the dirtied buffer, but since
> > > its locked ll_rw_block() returns
> > > without queueing the data.
> > >
> > > fsync_buffers_list() waits on the writepage()'s
> > > write to return but not on latest data write.
> > >
> > >
> > > Is that what you mean or I'm misunderstanding something?
> >
> > yes, that's it.
> 
> So I'm just going to add wait_on_page() on fsync_buffers_list() before the
> ll_rw_block().
> 
> Nothing else, since all of the other stuff on your patch seems to be
> improvements rather than bug fixes. ACK?

agreed, for an rc2 that's certainly the best approch, thanks.

Andrea

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: fsync fixes for 2.4
  2002-07-12  1:52       ` Andrea Arcangeli
@ 2002-07-12  2:59         ` Marcelo Tosatti
  0 siblings, 0 replies; 14+ messages in thread
From: Marcelo Tosatti @ 2002-07-12  2:59 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-kernel, Carter K. George, Don Norton,
	James S. Tybur



On Fri, 12 Jul 2002, Andrea Arcangeli wrote:

> agreed, for an rc2 that's certainly the best approch, thanks.

Its already in my BK repo.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: fsync fixes for 2.4
  2002-07-10 20:20 Andrea Arcangeli
  2002-07-11 20:21 ` Marcelo Tosatti
@ 2002-07-11 21:57 ` J.A. Magallon
  2002-07-11 23:00   ` Andrea Arcangeli
  1 sibling, 1 reply; 14+ messages in thread
From: J.A. Magallon @ 2002-07-11 21:57 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel


On 2002.07.10 Andrea Arcangeli wrote:
>At polyserve they found a severe problem with fsync in 2.4.
[patch trimmed]

Does this apply also to  -aa, or other changes make it unnecessary ?

-- 
J.A. Magallon             \   Software is like sex: It's better when it's free
mailto:jamagallon@able.es  \                    -- Linus Torvalds, FSF T-shirt
Linux werewolf 2.4.19-rc1-jam2, Mandrake Linux 8.3 (Cooker) for i586
gcc (GCC) 3.1.1 (Mandrake Linux 8.3 3.1.1-0.7mdk)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: fsync fixes for 2.4
  2002-07-11 21:57 ` J.A. Magallon
@ 2002-07-11 23:00   ` Andrea Arcangeli
  0 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2002-07-11 23:00 UTC (permalink / raw)
  To: J.A. Magallon; +Cc: linux-kernel

On Thu, Jul 11, 2002 at 11:57:39PM +0200, J.A. Magallon wrote:
> 
> On 2002.07.10 Andrea Arcangeli wrote:
> >At polyserve they found a severe problem with fsync in 2.4.
> [patch trimmed]
> 
> Does this apply also to  -aa, or other changes make it unnecessary ?

I submitted it for mainline before porting it to -aa, so -aa is missing
it too at the moment (I just have it in my devel tree, so it will be
automatically included in the next one).

Andrea

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2002-07-17 20:03 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-07-12 21:52 fsync fixes for 2.4 Griffiths, Richard A
2002-07-12 22:21 ` Andrew Morton
2002-07-15 10:07 ` Andrea Arcangeli
2002-07-15 18:36   ` Andrew Morton
2002-07-17 14:44   ` mgross
2002-07-17 20:05     ` Andrea Arcangeli
  -- strict thread matches above, loose matches on Subject: below --
2002-07-10 20:20 Andrea Arcangeli
2002-07-11 20:21 ` Marcelo Tosatti
2002-07-11 22:57   ` Andrea Arcangeli
2002-07-12  0:51     ` Marcelo Tosatti
2002-07-12  1:52       ` Andrea Arcangeli
2002-07-12  2:59         ` Marcelo Tosatti
2002-07-11 21:57 ` J.A. Magallon
2002-07-11 23:00   ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox