fsync() Performance Issue

All of lore.kernel.org
 help / color / mirror / Atom feed

* fsync() Performance Issue
@ 2002-04-26 20:28 berthiaume_wayne
  2002-04-29 16:20 ` Russell Coker
  2002-04-30 14:20 ` Oleg Drokin
  0 siblings, 2 replies; 30+ messages in thread
From: berthiaume_wayne @ 2002-04-26 20:28 UTC (permalink / raw)
  To: reiserfs-list

[-- Attachment #1: Type: text/plain, Size: 1323 bytes --]

	I'm wondering if anyone out there may have some suggestions on how
to improve the performance of a system employing fsync(). I have to be able
to guaranty that every write to my fileserver is on disk when the client has
passed it to the server. Therefore, I have disabled write cache on the disk
and issue an fsync() per file. I'm running 2.4.19-pre7, reiserfs 3.6.25,
without additional patches. I have seen some discussions out here about
various other "speed-up" patches and am wondering if I need to add these to
2.4.19-pre7? And what they are and where can I obtain said patches? Also,
I'm wondering if there is another solution to syncing the data that is
faster than fsync(). Testing, thusfar, has shown a large disparity between
running with and without sync.Another idea is to explore another filesystem,
but I'm not exactly excited by the other journaling filesystems out there at
this time. All ideas will be greatly appreciated.

Wayne
EMC Corp
Centera Engineering
4400 Computer Drive
M/S F213
Westboro,  MA    01580

email:       Berthiaume_Wayne@emc.com
voice:       (508) 898-6564
pager:     (888) 769-4578  (numeric)
                8007208398.9801155@pagenet.net  (alpha)
fax:          (508) 898-6388

"One man can make a difference, and every man should try."  - JFK

 <<Wayne Berthiaume (E-mail).vcf>> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-04-26 20:28 fsync() Performance Issue berthiaume_wayne
@ 2002-04-29 16:20 ` Russell Coker
  2002-04-29 16:30   ` Chris Mason
  2002-04-29 16:32   ` Toby Dickenson
  2002-04-30 14:20 ` Oleg Drokin
  1 sibling, 2 replies; 30+ messages in thread
From: Russell Coker @ 2002-04-29 16:20 UTC (permalink / raw)
  To: berthiaume_wayne, reiserfs-list

On Fri, 26 Apr 2002 22:28, berthiaume_wayne@emc.com wrote:

It's interesting to note your email address and what it implies...

> 	I'm wondering if anyone out there may have some suggestions on how
> to improve the performance of a system employing fsync(). I have to be able
> to guaranty that every write to my fileserver is on disk when the client
> has passed it to the server. Therefore, I have disabled write cache on the
> disk and issue an fsync() per file. I'm running 2.4.19-pre7, reiserfs
> 3.6.25, without additional patches. I have seen some discussions out here
> about various other "speed-up" patches and am wondering if I need to add
> these to 2.4.19-pre7? And what they are and where can I obtain said
> patches? Also, I'm wondering if there is another solution to syncing the
> data that is faster than fsync(). Testing, thusfar, has shown a large
> disparity between running with and without sync.Another idea is to explore
> another filesystem, but I'm not exactly excited by the other journaling
> filesystems out there at this time. All ideas will be greatly appreciated.

These issues have been discussed a few times, but not with any results as 
exciting as you might hope for.  One which was mentioned was using 
fdatasync() instead of fsync().

One thing that has occurred to me (which has not been previously discussed as 
far as I recall) is the possibility for using sync() instead of fsync() if 
you can accumulate a number of files (and therefore replace many fsync()'s 
with one sync() ).

-- 
If you send email to me or to a mailing list that I use which has >4 lines
of legalistic junk at the end then you are specifically authorizing me to do
whatever I wish with the message and all other messages from your domain, by
posting the message you agree that your long legalistic sig is void.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-04-29 16:20 ` Russell Coker
@ 2002-04-29 16:30   ` Chris Mason
  2002-04-29 16:32   ` Toby Dickenson
  1 sibling, 0 replies; 30+ messages in thread
From: Chris Mason @ 2002-04-29 16:30 UTC (permalink / raw)
  To: Russell Coker; +Cc: berthiaume_wayne, reiserfs-list

On Mon, 2002-04-29 at 12:20, Russell Coker wrote:
> On Fri, 26 Apr 2002 22:28, berthiaume_wayne@emc.com wrote:
> 
> It's interesting to note your email address and what it implies...
> 
> > 	I'm wondering if anyone out there may have some suggestions on how
> > to improve the performance of a system employing fsync(). I have to be able
> > to guaranty that every write to my fileserver is on disk when the client
> > has passed it to the server. Therefore, I have disabled write cache on the
> > disk and issue an fsync() per file. I'm running 2.4.19-pre7, reiserfs
> > 3.6.25, without additional patches. I have seen some discussions out here
> > about various other "speed-up" patches and am wondering if I need to add
> > these to 2.4.19-pre7? And what they are and where can I obtain said
> > patches? Also, I'm wondering if there is another solution to syncing the
> > data that is faster than fsync(). Testing, thusfar, has shown a large
> > disparity between running with and without sync.Another idea is to explore
> > another filesystem, but I'm not exactly excited by the other journaling
> > filesystems out there at this time. All ideas will be greatly appreciated.
> 
> These issues have been discussed a few times, but not with any results as 
> exciting as you might hope for.  One which was mentioned was using 
> fdatasync() instead of fsync().

The speedup patches should help fsync some, since they make it much more
likely a commit will be done without the journal lock held.

If all the writes on the FS end up being done through fsync, the data
logging patches might help a lot.  These should be ready for broader
testing this week.

If you are using IDE drives, the write barrier patches are almost enough
to allow you to turn on write caching safely.  They make sure metadata
triggers proper drive cache flushes, I can try to rig up something that
will also trigger a cache flush on data syncs.

-chris



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-04-29 16:20 ` Russell Coker
  2002-04-29 16:30   ` Chris Mason
@ 2002-04-29 16:32   ` Toby Dickenson
  2002-04-29 16:45     ` Chris Mason
  2002-04-29 17:56     ` Matthias Andree
  1 sibling, 2 replies; 30+ messages in thread
From: Toby Dickenson @ 2002-04-29 16:32 UTC (permalink / raw)
  To: Russell Coker; +Cc: berthiaume_wayne, reiserfs-list

On Mon, 29 Apr 2002 18:20:18 +0200, Russell Coker
<russell@coker.com.au> wrote:

>On Fri, 26 Apr 2002 22:28, berthiaume_wayne@emc.com wrote:
>
>It's interesting to note your email address and what it implies...
>
>> 	I'm wondering if anyone out there may have some suggestions on how
>> to improve the performance of a system employing fsync(). I have to be able
>> to guaranty that every write to my fileserver is on disk when the client
>> has passed it to the server. Therefore, I have disabled write cache on the
>> disk and issue an fsync() per file. I'm running 2.4.19-pre7, reiserfs
>> 3.6.25, without additional patches. I have seen some discussions out here
>> about various other "speed-up" patches and am wondering if I need to add
>> these to 2.4.19-pre7? And what they are and where can I obtain said
>> patches? Also, I'm wondering if there is another solution to syncing the
>> data that is faster than fsync(). Testing, thusfar, has shown a large
>> disparity between running with and without sync.Another idea is to explore
>> another filesystem, but I'm not exactly excited by the other journaling
>> filesystems out there at this time. All ideas will be greatly appreciated.
>
>These issues have been discussed a few times, but not with any results as 
>exciting as you might hope for.  One which was mentioned was using 
>fdatasync() instead of fsync().
>
>One thing that has occurred to me (which has not been previously discussed as 
>far as I recall) is the possibility for using sync() instead of fsync() if 
>you can accumulate a number of files (and therefore replace many fsync()'s 
>with one sync() ).

I can see

write to file A
write to file B
write to file C
sync

might be faster than

write to file A
fsync A
write to file B
fsync B
write to file C
fsync C

but is it possible for it to be faster than

write to file A
write to file B
write to file C
fsync A
fsync B
fsync C

?



Toby Dickenson
tdickenson@geminidataloggers.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-04-29 16:32   ` Toby Dickenson
@ 2002-04-29 16:45     ` Chris Mason
  2002-04-29 17:56     ` Matthias Andree
  1 sibling, 0 replies; 30+ messages in thread
From: Chris Mason @ 2002-04-29 16:45 UTC (permalink / raw)
  To: tdickenson; +Cc: Russell Coker, berthiaume_wayne, reiserfs-list

On Mon, 2002-04-29 at 12:32, Toby Dickenson wrote:

> >One thing that has occurred to me (which has not been previously discussed as 
> >far as I recall) is the possibility for using sync() instead of fsync() if 
> >you can accumulate a number of files (and therefore replace many fsync()'s 
> >with one sync() ).
> 
> I can see
> 
> write to file A
> write to file B
> write to file C
> sync
> 
> might be faster than
> 
> write to file A
> fsync A
> write to file B
> fsync B
> write to file C
> fsync C

Correct.

> 
> but is it possible for it to be faster than
> 
> write to file A
> write to file B
> write to file C
> fsync A
> fsync B
> fsync C

It depends on the rest of the system.  sync() goes through the big lru
list for the whole box, and fsync() goes through the private list for
just that inode.  If you've got other devices or files with dirty data,
case C that you presented will always be the fastest.  For general use,
I like this one the best, it is what the journal code is optimized for.

If files A, B, and C are the only dirty things on the whole box, a
single sync() will be slightly better, mostly due to reduced cpu time.

-chris

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-04-29 16:32   ` Toby Dickenson
  2002-04-29 16:45     ` Chris Mason
@ 2002-04-29 17:56     ` Matthias Andree
  2002-04-29 18:58       ` Valdis.Kletnieks
  1 sibling, 1 reply; 30+ messages in thread
From: Matthias Andree @ 2002-04-29 17:56 UTC (permalink / raw)
  To: reiserfs-list

Toby Dickenson <tdickenson@geminidataloggers.com> writes:

> write to file A
> write to file B
> write to file C
> sync

Be careful with this approach. Apart from syncing other processes' dirty
data, sync() does not make the same guarantees as fsync() does.

Barring write cache effects, fsync() only returns after all blocks are
on disk. While I'm not sure if and if yes, which, Linux file systems are
affected, but for portable applications, be aware that sync() may return
prematurely (and is allowed to!).

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-04-29 17:56     ` Matthias Andree
@ 2002-04-29 18:58       ` Valdis.Kletnieks
  2002-04-29 18:56         ` Hans Reiser
  0 siblings, 1 reply; 30+ messages in thread
From: Valdis.Kletnieks @ 2002-04-29 18:58 UTC (permalink / raw)
  To: Matthias Andree; +Cc: reiserfs-list

[-- Attachment #1: Type: text/plain, Size: 986 bytes --]

On Mon, 29 Apr 2002 19:56:59 +0200, Matthias Andree <ma@dt.e-technik.uni-dortmund.de>  said:

> Barring write cache effects, fsync() only returns after all blocks are
> on disk. While I'm not sure if and if yes, which, Linux file systems are
> affected, but for portable applications, be aware that sync() may return
> prematurely (and is allowed to!).

And in fact is the reason for the old "recipe":
  # sync
  # sync
  # sync
  # reboot

On the older Vax 750-class machines, sync could return LONG before the blocks
were all flushed - the second 2 sync's were so you were busy typing for
several seconds while the disks whirred.  Failure to understand the typing
speed issue has lead at least one otherwise-clued author to recommend:
  # sync;sync;sync
  # reboot

(the distinction being obvious if you think about when the shell reads the
commands, and when it does the fork/exec for each case....)

-- 
				Valdis Kletnieks
				Computer Systems Senior Engineer
				Virginia Tech



[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-04-29 18:58       ` Valdis.Kletnieks
@ 2002-04-29 18:56         ` Hans Reiser
  0 siblings, 0 replies; 30+ messages in thread
From: Hans Reiser @ 2002-04-29 18:56 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: Matthias Andree, reiserfs-list

Valdis.Kletnieks@vt.edu wrote:

>On Mon, 29 Apr 2002 19:56:59 +0200, Matthias Andree <ma@dt.e-technik.uni-dortmund.de>  said:
>
>  
>
>>Barring write cache effects, fsync() only returns after all blocks are
>>on disk. While I'm not sure if and if yes, which, Linux file systems are
>>affected, but for portable applications, be aware that sync() may return
>>prematurely (and is allowed to!).
>>    
>>
>
>And in fact is the reason for the old "recipe":
>  # sync
>  # sync
>  # sync
>  # reboot
>
>On the older Vax 750-class machines, sync could return LONG before the blocks
>were all flushed - the second 2 sync's were so you were busy typing for
>several seconds while the disks whirred.  Failure to understand the typing
>speed issue has lead at least one otherwise-clued author to recommend:
>  # sync;sync;sync
>  # reboot
>
>(the distinction being obvious if you think about when the shell reads the
>commands, and when it does the fork/exec for each case....)
>
>  
>
Finally I understand this.  Doing more than one sync always seemed 
mysterious to me.;-)

Thanks Matthias.

Hans


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-04-26 20:28 fsync() Performance Issue berthiaume_wayne
  2002-04-29 16:20 ` Russell Coker
@ 2002-04-30 14:20 ` Oleg Drokin
  2002-04-30 14:27   ` Chris Mason
  2002-05-02  5:07   ` Christian Stuke
  1 sibling, 2 replies; 30+ messages in thread
From: Oleg Drokin @ 2002-04-30 14:20 UTC (permalink / raw)
  To: berthiaume_wayne; +Cc: reiserfs-list

[-- Attachment #1: Type: text/plain, Size: 1408 bytes --]

Hello!

On Fri, Apr 26, 2002 at 04:28:26PM -0400, berthiaume_wayne@emc.com wrote:
> 	I'm wondering if anyone out there may have some suggestions on how
> to improve the performance of a system employing fsync(). I have to be able
> to guaranty that every write to my fileserver is on disk when the client has
> passed it to the server. Therefore, I have disabled write cache on the disk
> and issue an fsync() per file. I'm running 2.4.19-pre7, reiserfs 3.6.25,
> without additional patches. I have seen some discussions out here about
> various other "speed-up" patches and am wondering if I need to add these to
> 2.4.19-pre7? And what they are and where can I obtain said patches? Also,
> I'm wondering if there is another solution to syncing the data that is
> faster than fsync(). Testing, thusfar, has shown a large disparity between
> running with and without sync.Another idea is to explore another filesystem,
> but I'm not exactly excited by the other journaling filesystems out there at
> this time. All ideas will be greatly appreciated.

Attached is a speedup patch for 2.4.19-pre7 that should help your fsync
operations a little. (From Chris Mason).
Filesystem cannot do very much at this point unfortunatelly, it is ending up
waiting for disk to finish write operations.

Also we are working on other speedup patches that would cover different areas
of write perfomance itself.

Bye,
    Oleg

[-- Attachment #2: speedup-2.4.19-pre6.diff --]
[-- Type: text/plain, Size: 24408 bytes --]

diff -uNr linux-2.4.19-pre6.o/fs/buffer.c linux-2.4.19-pre6.speedup/fs/buffer.c
--- linux-2.4.19-pre6.o/fs/buffer.c	Mon Apr  8 14:53:24 2002
+++ linux-2.4.19-pre6.speedup/fs/buffer.c	Wed Apr 10 10:43:46 2002
@@ -325,6 +325,8 @@
 	lock_super(sb);
 	if (sb->s_dirt && sb->s_op && sb->s_op->write_super)
 		sb->s_op->write_super(sb);
+	if (sb->s_op && sb->s_op->commit_super)
+		sb->s_op->commit_super(sb);
 	unlock_super(sb);
 	unlock_kernel();
 
@@ -344,7 +346,7 @@
 	lock_kernel();
 	sync_inodes(dev);
 	DQUOT_SYNC(dev);
-	sync_supers(dev);
+	commit_supers(dev);
 	unlock_kernel();
 
 	return sync_buffers(dev, 1);
Binary files linux-2.4.19-pre6.o/fs/reiserfs/.journal.c.rej.swp and linux-2.4.19-pre6.speedup/fs/reiserfs/.journal.c.rej.swp differ
diff -uNr linux-2.4.19-pre6.o/fs/reiserfs/bitmap.c linux-2.4.19-pre6.speedup/fs/reiserfs/bitmap.c
--- linux-2.4.19-pre6.o/fs/reiserfs/bitmap.c	Mon Apr  8 14:53:24 2002
+++ linux-2.4.19-pre6.speedup/fs/reiserfs/bitmap.c	Wed Apr 10 10:43:46 2002
@@ -122,7 +122,6 @@
   set_sb_free_blocks( rs, sb_free_blocks(rs) + 1 );
 
   journal_mark_dirty (th, s, sbh);
-  s->s_dirt = 1;
 }
 
 void reiserfs_free_block (struct reiserfs_transaction_handle *th, 
@@ -433,7 +432,6 @@
   /* update free block count in super block */
   PUT_SB_FREE_BLOCKS( s, SB_FREE_BLOCKS(s) - init_amount_needed );
   journal_mark_dirty (th, s, SB_BUFFER_WITH_SB (s));
-  s->s_dirt = 1;
 
   return CARRY_ON;
 }
diff -uNr linux-2.4.19-pre6.o/fs/reiserfs/ibalance.c linux-2.4.19-pre6.speedup/fs/reiserfs/ibalance.c
--- linux-2.4.19-pre6.o/fs/reiserfs/ibalance.c	Sat Nov 10 01:18:25 2001
+++ linux-2.4.19-pre6.speedup/fs/reiserfs/ibalance.c	Wed Apr 10 10:43:46 2002
@@ -632,7 +632,6 @@
 		/* use check_internal if new root is an internal node */
 		check_internal (new_root);
 	    /*&&&&&&&&&&&&&&&&&&&&&&*/
-	    tb->tb_sb->s_dirt = 1;
 
 	    /* do what is needed for buffer thrown from tree */
 	    reiserfs_invalidate_buffer(tb, tbSh);
@@ -950,7 +949,6 @@
         PUT_SB_ROOT_BLOCK( tb->tb_sb, tbSh->b_blocknr );
         PUT_SB_TREE_HEIGHT( tb->tb_sb, SB_TREE_HEIGHT(tb->tb_sb) + 1 );
 	do_balance_mark_sb_dirty (tb, tb->tb_sb->u.reiserfs_sb.s_sbh, 1);
-	tb->tb_sb->s_dirt = 1;
     }
 	
     if ( tb->blknum[h] == 2 ) {
diff -uNr linux-2.4.19-pre6.o/fs/reiserfs/journal.c linux-2.4.19-pre6.speedup/fs/reiserfs/journal.c
--- linux-2.4.19-pre6.o/fs/reiserfs/journal.c	Mon Apr  8 14:53:24 2002
+++ linux-2.4.19-pre6.speedup/fs/reiserfs/journal.c	Wed Apr 10 10:44:32 2002
@@ -64,12 +64,15 @@
 */
 static int reiserfs_mounted_fs_count = 0 ;
 
+static struct list_head kreiserfsd_supers = LIST_HEAD_INIT(kreiserfsd_supers);
+
 /* wake this up when you add something to the commit thread task queue */
 DECLARE_WAIT_QUEUE_HEAD(reiserfs_commit_thread_wait) ;
 
 /* wait on this if you need to be sure you task queue entries have been run */
 static DECLARE_WAIT_QUEUE_HEAD(reiserfs_commit_thread_done) ;
 DECLARE_TASK_QUEUE(reiserfs_commit_thread_tq) ;
+DECLARE_MUTEX(kreiserfsd_sem) ;
 
 #define JOURNAL_TRANS_HALF 1018   /* must be correct to keep the desc and commit
 				     structs at 4k */
@@ -576,17 +579,12 @@
 /* lock the current transaction */
 inline static void lock_journal(struct super_block *p_s_sb) {
   PROC_INFO_INC( p_s_sb, journal.lock_journal );
-  while(atomic_read(&(SB_JOURNAL(p_s_sb)->j_wlock)) > 0) {
-    PROC_INFO_INC( p_s_sb, journal.lock_journal_wait );
-    sleep_on(&(SB_JOURNAL(p_s_sb)->j_wait)) ;
-  }
-  atomic_set(&(SB_JOURNAL(p_s_sb)->j_wlock), 1) ;
+  down(&SB_JOURNAL(p_s_sb)->j_lock);
 }
 
 /* unlock the current transaction */
 inline static void unlock_journal(struct super_block *p_s_sb) {
-  atomic_dec(&(SB_JOURNAL(p_s_sb)->j_wlock)) ;
-  wake_up(&(SB_JOURNAL(p_s_sb)->j_wait)) ;
+  up(&SB_JOURNAL(p_s_sb)->j_lock);
 }
 
 /*
@@ -756,7 +754,6 @@
   atomic_set(&(jl->j_commit_flushing), 0) ;
   wake_up(&(jl->j_commit_wait)) ;
 
-  s->s_dirt = 1 ;
   return 0 ;
 }
 
@@ -1220,7 +1217,6 @@
     if (run++ == 0) {
         goto loop_start ;
     }
-
     atomic_set(&(jl->j_flushing), 0) ;
     wake_up(&(jl->j_flush_wait)) ;
     return ret ;
@@ -1250,7 +1246,7 @@
     while(i != start) {
         jl = SB_JOURNAL_LIST(s) + i  ;
         age = CURRENT_TIME - jl->j_timestamp ;
-        if (jl->j_len > 0 && // age >= (JOURNAL_MAX_COMMIT_AGE * 2) && 
+        if (jl->j_len > 0 && age >= JOURNAL_MAX_COMMIT_AGE && 
             atomic_read(&(jl->j_nonzerolen)) > 0 &&
             atomic_read(&(jl->j_commit_left)) == 0) {
 
@@ -1325,6 +1321,10 @@
 static int do_journal_release(struct reiserfs_transaction_handle *th, struct super_block *p_s_sb, int error) {
   struct reiserfs_transaction_handle myth ;
 
+  down(&kreiserfsd_sem);
+  list_del(&p_s_sb->u.reiserfs_sb.s_reiserfs_supers);
+  up(&kreiserfsd_sem);
+
   /* we only want to flush out transactions if we were called with error == 0
   */
   if (!error && !(p_s_sb->s_flags & MS_RDONLY)) {
@@ -1811,10 +1811,6 @@
   jl = SB_JOURNAL_LIST(ct->p_s_sb) + ct->jindex ;
 
   flush_commit_list(ct->p_s_sb, SB_JOURNAL_LIST(ct->p_s_sb) + ct->jindex, 1) ; 
-  if (jl->j_len > 0 && atomic_read(&(jl->j_nonzerolen)) > 0 && 
-      atomic_read(&(jl->j_commit_left)) == 0) {
-    kupdate_one_transaction(ct->p_s_sb, jl) ;
-  }
   reiserfs_kfree(ct->self, sizeof(struct reiserfs_journal_commit_task), ct->p_s_sb) ;
 }
 
@@ -1864,6 +1860,9 @@
 ** then run the per filesystem commit task queue when we wakeup.
 */
 static int reiserfs_journal_commit_thread(void *nullp) {
+  struct list_head *entry, *safe ;
+  struct super_block *s;
+  time_t last_run = 0;
 
   daemonize() ;
 
@@ -1879,6 +1878,18 @@
     while(TQ_ACTIVE(reiserfs_commit_thread_tq)) {
       run_task_queue(&reiserfs_commit_thread_tq) ;
     }
+    if (CURRENT_TIME - last_run > 5) {
+	down(&kreiserfsd_sem);
+	list_for_each_safe(entry, safe, &kreiserfsd_supers) {
+	    s = list_entry(entry, struct super_block, 
+	                   u.reiserfs_sb.s_reiserfs_supers);    
+	    if (!(s->s_flags & MS_RDONLY)) {
+		reiserfs_flush_old_commits(s);
+	    }
+	}
+	up(&kreiserfsd_sem);
+	last_run = CURRENT_TIME;
+    }
 
     /* if there aren't any more filesystems left, break */
     if (reiserfs_mounted_fs_count <= 0) {
@@ -1953,13 +1964,12 @@
   SB_JOURNAL(p_s_sb)->j_last = NULL ;	  
   SB_JOURNAL(p_s_sb)->j_first = NULL ;     
   init_waitqueue_head(&(SB_JOURNAL(p_s_sb)->j_join_wait)) ;
-  init_waitqueue_head(&(SB_JOURNAL(p_s_sb)->j_wait)) ; 
+  sema_init(&SB_JOURNAL(p_s_sb)->j_lock, 1);
 
   SB_JOURNAL(p_s_sb)->j_trans_id = 10 ;  
   SB_JOURNAL(p_s_sb)->j_mount_id = 10 ; 
   SB_JOURNAL(p_s_sb)->j_state = 0 ;
   atomic_set(&(SB_JOURNAL(p_s_sb)->j_jlock), 0) ;
-  atomic_set(&(SB_JOURNAL(p_s_sb)->j_wlock), 0) ;
   SB_JOURNAL(p_s_sb)->j_cnode_free_list = allocate_cnodes(num_cnodes) ;
   SB_JOURNAL(p_s_sb)->j_cnode_free_orig = SB_JOURNAL(p_s_sb)->j_cnode_free_list ;
   SB_JOURNAL(p_s_sb)->j_cnode_free = SB_JOURNAL(p_s_sb)->j_cnode_free_list ? num_cnodes : 0 ;
@@ -1989,6 +1999,7 @@
     kernel_thread((void *)(void *)reiserfs_journal_commit_thread, NULL,
                   CLONE_FS | CLONE_FILES | CLONE_VM) ;
   }
+  list_add(&p_s_sb->u.reiserfs_sb.s_reiserfs_supers, &kreiserfsd_supers);
   return 0 ;
 }
 
@@ -2117,7 +2128,6 @@
   th->t_trans_id = SB_JOURNAL(p_s_sb)->j_trans_id ;
   th->t_caller = "Unknown" ;
   unlock_journal(p_s_sb) ;
-  p_s_sb->s_dirt = 1; 
   return 0 ;
 }
 
@@ -2159,7 +2169,7 @@
     reiserfs_panic(th->t_super, "journal-1577: handle trans id %ld != current trans id %ld\n", 
                    th->t_trans_id, SB_JOURNAL(p_s_sb)->j_trans_id);
   }
-  p_s_sb->s_dirt = 1 ;
+  p_s_sb->s_dirt |= S_SUPER_DIRTY;
 
   prepared = test_and_clear_bit(BH_JPrepared, &bh->b_state) ;
   /* already in this transaction, we are done */
@@ -2407,12 +2417,8 @@
 ** flushes any old transactions to disk
 ** ends the current transaction if it is too old
 **
-** also calls flush_journal_list with old_only == 1, which allows me to reclaim
-** memory and such from the journal lists whose real blocks are all on disk.
-**
-** called by sync_dev_journal from buffer.c
 */
-int flush_old_commits(struct super_block *p_s_sb, int immediate) {
+int reiserfs_flush_old_commits(struct super_block *p_s_sb) {
   int i ;
   int count = 0;
   int start ; 
@@ -2429,8 +2435,7 @@
   /* starting with oldest, loop until we get to the start */
   i = (SB_JOURNAL_LIST_INDEX(p_s_sb) + 1) % JOURNAL_LIST_COUNT ;
   while(i != start) {
-    if (SB_JOURNAL_LIST(p_s_sb)[i].j_len > 0 && ((now - SB_JOURNAL_LIST(p_s_sb)[i].j_timestamp) > JOURNAL_MAX_COMMIT_AGE ||
-       immediate)) {
+    if (SB_JOURNAL_LIST(p_s_sb)[i].j_len > 0 && ((now - SB_JOURNAL_LIST(p_s_sb)[i].j_timestamp) > JOURNAL_MAX_COMMIT_AGE)) {
       /* we have to check again to be sure the current transaction did not change */
       if (i != SB_JOURNAL_LIST_INDEX(p_s_sb))  {
 	flush_commit_list(p_s_sb, SB_JOURNAL_LIST(p_s_sb) + i, 1) ;
@@ -2439,26 +2444,26 @@
     i = (i + 1) % JOURNAL_LIST_COUNT ;
     count++ ;
   }
+
   /* now, check the current transaction.  If there are no writers, and it is too old, finish it, and
   ** force the commit blocks to disk
   */
-  if (!immediate && atomic_read(&(SB_JOURNAL(p_s_sb)->j_wcount)) <= 0 &&  
+  if (atomic_read(&(SB_JOURNAL(p_s_sb)->j_wcount)) <= 0 &&  
      SB_JOURNAL(p_s_sb)->j_trans_start_time > 0 && 
      SB_JOURNAL(p_s_sb)->j_len > 0 && 
      (now - SB_JOURNAL(p_s_sb)->j_trans_start_time) > JOURNAL_MAX_TRANS_AGE) {
     journal_join(&th, p_s_sb, 1) ;
     reiserfs_prepare_for_journal(p_s_sb, SB_BUFFER_WITH_SB(p_s_sb), 1) ;
     journal_mark_dirty(&th, p_s_sb, SB_BUFFER_WITH_SB(p_s_sb)) ;
-    do_journal_end(&th, p_s_sb,1, COMMIT_NOW) ;
-  } else if (immediate) { /* belongs above, but I wanted this to be very explicit as a special case.  If they say to 
-                             flush, we must be sure old transactions hit the disk too. */
-    journal_join(&th, p_s_sb, 1) ;
-    reiserfs_prepare_for_journal(p_s_sb, SB_BUFFER_WITH_SB(p_s_sb), 1) ;
-    journal_mark_dirty(&th, p_s_sb, SB_BUFFER_WITH_SB(p_s_sb)) ;
+
+    /* we're only being called from kreiserfsd, it makes no sense to do
+    ** an async commit so that kreiserfsd can do it later
+    */
     do_journal_end(&th, p_s_sb,1, COMMIT_NOW | WAIT) ;
-  }
-   reiserfs_journal_kupdate(p_s_sb) ;
-   return 0 ;
+  } 
+  reiserfs_journal_kupdate(p_s_sb) ;
+
+  return S_SUPER_DIRTY_COMMIT;
 }
 
 /*
@@ -2497,7 +2502,7 @@
   if (SB_JOURNAL(p_s_sb)->j_len == 0) {
     int wcount = atomic_read(&(SB_JOURNAL(p_s_sb)->j_wcount)) ;
     unlock_journal(p_s_sb) ;
-    if (atomic_read(&(SB_JOURNAL(p_s_sb)->j_jlock))  > 0 && wcount <= 0) {
+    if (atomic_read(&(SB_JOURNAL(p_s_sb)->j_jlock)) > 0 && wcount <= 0) {
       atomic_dec(&(SB_JOURNAL(p_s_sb)->j_jlock)) ;
       wake_up(&(SB_JOURNAL(p_s_sb)->j_join_wait)) ;
     }
@@ -2768,6 +2773,7 @@
   ** it tells us if we should continue with the journal_end, or just return
   */
   if (!check_journal_end(th, p_s_sb, nblocks, flags)) {
+    p_s_sb->s_dirt |= S_SUPER_DIRTY;
     return 0 ;
   }
 
@@ -2937,17 +2943,12 @@
   /* write any buffers that must hit disk before this commit is done */
   fsync_inode_buffers(&(SB_JOURNAL(p_s_sb)->j_dummy_inode)) ;
 
-  /* honor the flush and async wishes from the caller */
+  /* honor the flush wishes from the caller, simple commits can
+  ** be done outside the journal lock, they are done below
+  */
   if (flush) {
-  
     flush_commit_list(p_s_sb, SB_JOURNAL_LIST(p_s_sb) + orig_jindex, 1) ;
     flush_journal_list(p_s_sb,  SB_JOURNAL_LIST(p_s_sb) + orig_jindex , 1) ;  
-  } else if (commit_now) {
-    if (wait_on_commit) {
-      flush_commit_list(p_s_sb, SB_JOURNAL_LIST(p_s_sb) + orig_jindex, 1) ;
-    } else {
-      commit_flush_async(p_s_sb, orig_jindex) ; 
-    }
   }
 
   /* reset journal values for the next transaction */
@@ -3009,6 +3010,16 @@
   atomic_set(&(SB_JOURNAL(p_s_sb)->j_jlock), 0) ;
   /* wake up any body waiting to join. */
   wake_up(&(SB_JOURNAL(p_s_sb)->j_join_wait)) ;
+  
+  if (!flush && commit_now) {
+    if (current->need_resched)
+      schedule() ;
+    if (wait_on_commit) {
+      flush_commit_list(p_s_sb, SB_JOURNAL_LIST(p_s_sb) + orig_jindex, 1) ;
+    } else {
+      commit_flush_async(p_s_sb, orig_jindex) ; 
+    }
+  }
   return 0 ;
 }
 
diff -uNr linux-2.4.19-pre6.o/fs/reiserfs/objectid.c linux-2.4.19-pre6.speedup/fs/reiserfs/objectid.c
--- linux-2.4.19-pre6.o/fs/reiserfs/objectid.c	Mon Apr  8 14:53:24 2002
+++ linux-2.4.19-pre6.speedup/fs/reiserfs/objectid.c	Wed Apr 10 10:43:46 2002
@@ -87,7 +87,6 @@
     }
 
     journal_mark_dirty(th, s, SB_BUFFER_WITH_SB (s));
-    s->s_dirt = 1;
     return unused_objectid;
 }
 
@@ -106,8 +105,6 @@
 
     reiserfs_prepare_for_journal(s, SB_BUFFER_WITH_SB(s), 1) ;
     journal_mark_dirty(th, s, SB_BUFFER_WITH_SB (s)); 
-    s->s_dirt = 1;
-
 
     /* start at the beginning of the objectid map (i = 0) and go to
        the end of it (i = disk_sb->s_oid_cursize).  Linear search is
diff -uNr linux-2.4.19-pre6.o/fs/reiserfs/stree.c linux-2.4.19-pre6.speedup/fs/reiserfs/stree.c
--- linux-2.4.19-pre6.o/fs/reiserfs/stree.c	Mon Apr  8 14:53:24 2002
+++ linux-2.4.19-pre6.speedup/fs/reiserfs/stree.c	Wed Apr 10 10:44:40 2002
@@ -598,26 +598,32 @@
 
 
 
-#ifdef SEARCH_BY_KEY_READA
+#define SEARCH_BY_KEY_READA 32
 
 /* The function is NOT SCHEDULE-SAFE! */
-static void search_by_key_reada (struct super_block * s, int blocknr)
+static void search_by_key_reada (struct super_block * s, 
+                                 struct buffer_head **bh, 
+				 unsigned long *b, int num)
 {
-    struct buffer_head * bh;
+    int i,j;
   
-    if (blocknr == 0)
-	return;
-
-    bh = getblk (s->s_dev, blocknr, s->s_blocksize);
-  
-    if (!buffer_uptodate (bh)) {
-	ll_rw_block (READA, 1, &bh);
+    for (i = 0 ; i < num ; i++) {
+	bh[i] = sb_getblk (s, b[i]);
+	if (buffer_uptodate(bh[i])) {
+	    brelse(bh[i]);
+	    break;
+	}
+	touch_buffer(bh[i]);
+    } 
+    if (i) {
+	ll_rw_block(READA, i, bh);
+    }
+    for(j = 0 ; j < i ; j++) {
+        if (bh[j])
+	    brelse(bh[j]);
     }
-    bh->b_count --;
 }
 
-#endif
-
 /**************************************************************************
  * Algorithm   SearchByKey                                                *
  *             look for item in the Disk S+Tree by its key                *
@@ -660,6 +666,9 @@
     int				n_node_level, n_retval;
     int 			right_neighbor_of_leaf_node;
     int				fs_gen;
+    struct buffer_head *reada_bh[SEARCH_BY_KEY_READA];
+    unsigned long      reada_blocks[SEARCH_BY_KEY_READA];
+    int reada_count = 0;
 
 #ifdef CONFIG_REISERFS_CHECK
     int n_repeat_counter = 0;
@@ -693,11 +702,11 @@
 	fs_gen = get_generation (p_s_sb);
 	expected_level --;
 
-#ifdef SEARCH_BY_KEY_READA
-	/* schedule read of right neighbor */
-	search_by_key_reada (p_s_sb, right_neighbor_of_leaf_node);
-#endif
-
+	/* schedule read of right neighbors */
+	if (reada_count) {
+	    search_by_key_reada (p_s_sb, reada_bh, reada_blocks, reada_count);
+	    reada_count = 0;
+	}
 	/* Read the next tree node, and set the last element in the path to
            have a pointer to it. */
 	if ( ! (p_s_bh = p_s_last_element->pe_buffer =
@@ -785,11 +794,20 @@
 	   position in the node. */
 	n_block_number = B_N_CHILD_NUM(p_s_bh, p_s_last_element->pe_position);
 
-#ifdef SEARCH_BY_KEY_READA
-	/* if we are going to read leaf node, then calculate its right neighbor if possible */
-	if (n_node_level == DISK_LEAF_NODE_LEVEL + 1 && p_s_last_element->pe_position < B_NR_ITEMS (p_s_bh))
-	    right_neighbor_of_leaf_node = B_N_CHILD_NUM(p_s_bh, p_s_last_element->pe_position + 1);
-#endif
+	/* if we are going to read leaf node, then try to find good leaves
+	** for read ahead as well.  Don't bother for stat data though
+	*/
+	if (reiserfs_test4(p_s_sb) && 
+	    n_node_level == DISK_LEAF_NODE_LEVEL + 1 && 
+	    p_s_last_element->pe_position < B_NR_ITEMS (p_s_bh) &&
+	    !is_statdata_cpu_key(p_s_key))
+	{
+	    int pos = p_s_last_element->pe_position;
+	    int limit = B_NR_ITEMS(p_s_bh);
+	    while(pos <= limit && reada_count < SEARCH_BY_KEY_READA) { 
+	        reada_blocks[reada_count++] = B_N_CHILD_NUM(p_s_bh, pos++);
+	    }
+        }
     }
 }
 
diff -uNr linux-2.4.19-pre6.o/fs/reiserfs/super.c linux-2.4.19-pre6.speedup/fs/reiserfs/super.c
--- linux-2.4.19-pre6.o/fs/reiserfs/super.c	Mon Apr  8 14:53:24 2002
+++ linux-2.4.19-pre6.speedup/fs/reiserfs/super.c	Wed Apr 10 10:43:46 2002
@@ -29,23 +29,22 @@
 static int reiserfs_remount (struct super_block * s, int * flags, char * data);
 static int reiserfs_statfs (struct super_block * s, struct statfs * buf);
 
-//
-// a portion of this function, particularly the VFS interface portion,
-// was derived from minix or ext2's analog and evolved as the
-// prototype did. You should be able to tell which portion by looking
-// at the ext2 code and comparing. It's subfunctions contain no code
-// used as a template unless they are so labeled.
-//
+/* kreiserfsd does all the periodic stuff for us */
 static void reiserfs_write_super (struct super_block * s)
 {
+    s->s_dirt = S_SUPER_DIRTY_COMMIT;
+}
 
-  int dirty = 0 ;
-  lock_kernel() ;
-  if (!(s->s_flags & MS_RDONLY)) {
-    dirty = flush_old_commits(s, 1) ;
-  }
-  s->s_dirt = dirty;
-  unlock_kernel() ;
+static void reiserfs_commit_super (struct super_block * s)
+{
+    struct reiserfs_transaction_handle th;
+    lock_kernel() ;
+    if (!(s->s_flags & MS_RDONLY)) {
+	journal_begin(&th, s, 1);
+	journal_end_sync(&th, s, 1);
+	s->s_dirt = 0;
+    }
+    unlock_kernel() ;
 }
 
 //
@@ -413,6 +412,7 @@
   put_super: reiserfs_put_super,
   write_super: reiserfs_write_super,
   write_super_lockfs: reiserfs_write_super_lockfs,
+  commit_super: reiserfs_commit_super,
   unlockfs: reiserfs_unlockfs,
   statfs: reiserfs_statfs,
   remount_fs: reiserfs_remount,
@@ -968,6 +968,7 @@
 
 
     memset (&s->u.reiserfs_sb, 0, sizeof (struct reiserfs_sb_info));
+    INIT_LIST_HEAD(&s->u.reiserfs_sb.s_reiserfs_supers);
 
     if (parse_options ((char *) data, &(s->u.reiserfs_sb.s_mount_opt), &blocks) == 0) {
 	return NULL;
diff -uNr linux-2.4.19-pre6.o/fs/super.c linux-2.4.19-pre6.speedup/fs/super.c
--- linux-2.4.19-pre6.o/fs/super.c	Mon Apr  8 14:53:24 2002
+++ linux-2.4.19-pre6.speedup/fs/super.c	Wed Apr 10 10:43:46 2002
@@ -431,15 +431,68 @@
 	put_super(sb);
 }
 
+/* since we've added the idea of comit_dirty vs regular dirty with
+ * commit_super operation, only use the S_SUPER_DIRTY mask if 
+ * the FS has a commit_super op.
+ */
+static inline int super_dirty(struct super_block *sb)
+{
+	if (sb->s_op && sb->s_op->commit_super) {
+		return sb->s_dirt & S_SUPER_DIRTY;
+	}
+	return sb->s_dirt;
+}
+
+
 static inline void write_super(struct super_block *sb)
 {
 	lock_super(sb);
-	if (sb->s_root && sb->s_dirt)
+	if (sb->s_root && super_dirty(sb))
 		if (sb->s_op && sb->s_op->write_super)
 			sb->s_op->write_super(sb);
 	unlock_super(sb);
 }
 
+static inline void commit_super(struct super_block *sb)
+{
+	lock_super(sb);
+	if (sb->s_root && sb->s_dirt) {
+		if (sb->s_op && sb->s_op->write_super)
+			sb->s_op->write_super(sb);
+		if (sb->s_op && sb->s_op->commit_super)
+			sb->s_op->commit_super(sb);
+	}
+	unlock_super(sb);
+}
+
+void commit_supers(kdev_t dev)
+{
+	struct super_block * sb;
+
+	if (dev) {
+		sb = get_super(dev);
+		if (sb) {
+			if (sb->s_dirt)
+				commit_super(sb);
+			drop_super(sb);
+		}
+	}
+restart:
+	spin_lock(&sb_lock);
+	sb = sb_entry(super_blocks.next);
+	while (sb != sb_entry(&super_blocks))
+		if (sb->s_dirt) {
+			sb->s_count++;
+			spin_unlock(&sb_lock);
+			down_read(&sb->s_umount);
+			commit_super(sb);
+			drop_super(sb);
+			goto restart;
+		} else
+			sb = sb_entry(sb->s_list.next);
+	spin_unlock(&sb_lock);
+}
+
 /*
  * Note: check the dirty flag before waiting, so we don't
  * hold up the sync while mounting a device. (The newly
@@ -462,7 +515,7 @@
 	spin_lock(&sb_lock);
 	sb = sb_entry(super_blocks.next);
 	while (sb != sb_entry(&super_blocks))
-		if (sb->s_dirt) {
+		if (super_dirty(sb)) {
 			sb->s_count++;
 			spin_unlock(&sb_lock);
 			down_read(&sb->s_umount);
diff -uNr linux-2.4.19-pre6.o/include/linux/fs.h linux-2.4.19-pre6.speedup/include/linux/fs.h
--- linux-2.4.19-pre6.o/include/linux/fs.h	Mon Apr  8 14:53:26 2002
+++ linux-2.4.19-pre6.speedup/include/linux/fs.h	Wed Apr 10 10:55:34 2002
@@ -706,6 +706,10 @@
 
 #define sb_entry(list)	list_entry((list), struct super_block, s_list)
 #define S_BIAS (1<<30)
+
+/* flags for the s_dirt field */
+#define S_SUPER_DIRTY 1
+#define S_SUPER_DIRTY_COMMIT 2
 struct super_block {
 	struct list_head	s_list;		/* Keep this first */
 	kdev_t			s_dev;
@@ -918,6 +922,7 @@
 	struct dentry * (*fh_to_dentry)(struct super_block *sb, __u32 *fh, int len, int fhtype, int parent);
 	int (*dentry_to_fh)(struct dentry *, __u32 *fh, int *lenp, int need_parent);
 	int (*show_options)(struct seq_file *, struct vfsmount *);
+	void (*commit_super) (struct super_block *);
 };
 
 /* Inode state bits.. */
@@ -1226,6 +1231,7 @@
 extern int filemap_fdatasync(struct address_space *);
 extern int filemap_fdatawait(struct address_space *);
 extern void sync_supers(kdev_t);
+extern void commit_supers(kdev_t);
 extern int bmap(struct inode *, int);
 extern int notify_change(struct dentry *, struct iattr *);
 extern int permission(struct inode *, int);
diff -uNr linux-2.4.19-pre6.o/include/linux/reiserfs_fs.h linux-2.4.19-pre6.speedup/include/linux/reiserfs_fs.h
--- linux-2.4.19-pre6.o/include/linux/reiserfs_fs.h	Mon Apr  8 14:53:26 2002
+++ linux-2.4.19-pre6.speedup/include/linux/reiserfs_fs.h	Wed Apr 10 10:56:36 2002
@@ -1533,6 +1533,7 @@
 */
 #define JOURNAL_BUFFER(j,n) ((j)->j_ap_blocks[((j)->j_start + (n)) % JOURNAL_BLOCK_COUNT])
 
+int reiserfs_flush_old_commits(struct super_block *);
 void reiserfs_commit_for_inode(struct inode *) ;
 void reiserfs_update_inode_transaction(struct inode *) ;
 void reiserfs_wait_on_write_block(struct super_block *s) ;
diff -uNr linux-2.4.19-pre6.o/include/linux/reiserfs_fs_sb.h linux-2.4.19-pre6.speedup/include/linux/reiserfs_fs_sb.h
--- linux-2.4.19-pre6.o/include/linux/reiserfs_fs_sb.h	Mon Apr  8 14:53:26 2002
+++ linux-2.4.19-pre6.speedup/include/linux/reiserfs_fs_sb.h	Wed Apr 10 10:55:34 2002
@@ -291,8 +291,7 @@
   */
   struct reiserfs_page_list *j_flush_pages ;
   time_t j_trans_start_time ;         /* time this transaction started */
-  wait_queue_head_t j_wait ;         /* wait  journal_end to finish I/O */
-  atomic_t j_wlock ;                       /* lock for j_wait */
+  struct semaphore j_lock ;
   wait_queue_head_t j_join_wait ;    /* wait for current transaction to finish before starting new one */
   atomic_t j_jlock ;                       /* lock for j_join_wait */
   int j_journal_list_index ;	      /* journal list number of the current trans */
@@ -444,6 +443,7 @@
     int s_is_unlinked_ok;
     reiserfs_proc_info_data_t s_proc_info_data;
     struct proc_dir_entry *procdir;
+    struct list_head s_reiserfs_supers;
 };
 
 /* Definitions of reiserfs on-disk properties: */
@@ -510,7 +510,6 @@
 void reiserfs_file_buffer (struct buffer_head * bh, int list);
 int reiserfs_is_super(struct super_block *s)  ;
 int journal_mark_dirty(struct reiserfs_transaction_handle *, struct super_block *, struct buffer_head *bh) ;
-int flush_old_commits(struct super_block *s, int) ;
 int show_reiserfs_locks(void) ;
 int reiserfs_resize(struct super_block *, unsigned long) ;
 
diff -uNr linux-2.4.19-pre6.o/mm/filemap.c linux-2.4.19-pre6.speedup/mm/filemap.c
--- linux-2.4.19-pre6.o/mm/filemap.c	Mon Apr  8 14:53:27 2002
+++ linux-2.4.19-pre6.speedup/mm/filemap.c	Wed Apr 10 10:44:40 2002
@@ -1306,6 +1306,7 @@
 	/* Mark the page referenced, AFTER checking for previous usage.. */
 	SetPageReferenced(page);
 }
+EXPORT_SYMBOL(mark_page_accessed);
 
 /*
  * This is a generic file read routine, and uses the
@@ -2897,6 +2898,14 @@
 	}
 }
 
+static void update_inode_times(struct inode *inode) 
+{
+	time_t now = CURRENT_TIME;
+	if (inode->i_ctime != now || inode->i_mtime != now) {
+	    inode->i_ctime = inode->i_mtime = now;
+	    mark_inode_dirty_sync(inode);
+	} 
+}
 /*
  * Write to a file through the page cache. 
  *
@@ -3026,8 +3035,7 @@
 		goto out;
 
 	remove_suid(inode);
-	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
-	mark_inode_dirty_sync(inode);
+	update_inode_times(inode);
 
 	if (file->f_flags & O_DIRECT)
 		goto o_direct;

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-04-30 14:20 ` Oleg Drokin
@ 2002-04-30 14:27   ` Chris Mason
  2002-05-02  5:07   ` Christian Stuke
  1 sibling, 0 replies; 30+ messages in thread
From: Chris Mason @ 2002-04-30 14:27 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: berthiaume_wayne, reiserfs-list

On Tue, 2002-04-30 at 10:20, Oleg Drokin wrote:

> Attached is a speedup patch for 2.4.19-pre7 that should help your fsync
> operations a little. (From Chris Mason).
> Filesystem cannot do very much at this point unfortunatelly, it is ending up
> waiting for disk to finish write operations.
> 
> Also we are working on other speedup patches that would cover different areas
> of write perfomance itself.

A newer one (against 2.4.19-pre7) is below.  It has not been through as
much testing on the namesys side, which is why Oleg sent the older one.

Wayne and I have been talking in private mail, he's getting a bunch of
beta patches later today (this speedup, data logging, updated barrier
code).  Along with instructions for testing.

-chris

# Veritas (Hugh Dickins supplied the patch) sent the bits in
# fs/super.c that allow the FS to leave super->s_dirt set after a
# write_super call.
#
diff -urN --exclude *.orig parent/fs/buffer.c comp/fs/buffer.c
--- parent/fs/buffer.c	Mon Apr 29 10:20:24 2002
+++ comp/fs/buffer.c	Mon Apr 29 10:20:22 2002
@@ -325,6 +325,8 @@
 	lock_super(sb);
 	if (sb->s_dirt && sb->s_op && sb->s_op->write_super)
 		sb->s_op->write_super(sb);
+	if (sb->s_op && sb->s_op->commit_super)
+		sb->s_op->commit_super(sb);
 	unlock_super(sb);
 	unlock_kernel();
 
@@ -344,7 +346,7 @@
 	lock_kernel();
 	sync_inodes(dev);
 	DQUOT_SYNC(dev);
-	sync_supers(dev);
+	commit_supers(dev);
 	unlock_kernel();
 
 	return sync_buffers(dev, 1);
diff -urN --exclude *.orig parent/fs/reiserfs/bitmap.c comp/fs/reiserfs/bitmap.c
--- parent/fs/reiserfs/bitmap.c	Mon Apr 29 10:20:24 2002
+++ comp/fs/reiserfs/bitmap.c	Mon Apr 29 10:20:19 2002
@@ -122,7 +122,6 @@
   set_sb_free_blocks( rs, sb_free_blocks(rs) + 1 );
 
   journal_mark_dirty (th, s, sbh);
-  s->s_dirt = 1;
 }
 
 void reiserfs_free_block (struct reiserfs_transaction_handle *th, 
@@ -433,7 +432,6 @@
   /* update free block count in super block */
   PUT_SB_FREE_BLOCKS( s, SB_FREE_BLOCKS(s) - init_amount_needed );
   journal_mark_dirty (th, s, SB_BUFFER_WITH_SB (s));
-  s->s_dirt = 1;
 
   return CARRY_ON;
 }
diff -urN --exclude *.orig parent/fs/reiserfs/ibalance.c comp/fs/reiserfs/ibalance.c
--- parent/fs/reiserfs/ibalance.c	Mon Apr 29 10:20:24 2002
+++ comp/fs/reiserfs/ibalance.c	Mon Apr 29 10:20:19 2002
@@ -632,7 +632,6 @@
 		/* use check_internal if new root is an internal node */
 		check_internal (new_root);
 	    /*&&&&&&&&&&&&&&&&&&&&&&*/
-	    tb->tb_sb->s_dirt = 1;
 
 	    /* do what is needed for buffer thrown from tree */
 	    reiserfs_invalidate_buffer(tb, tbSh);
@@ -950,7 +949,6 @@
         PUT_SB_ROOT_BLOCK( tb->tb_sb, tbSh->b_blocknr );
         PUT_SB_TREE_HEIGHT( tb->tb_sb, SB_TREE_HEIGHT(tb->tb_sb) + 1 );
 	do_balance_mark_sb_dirty (tb, tb->tb_sb->u.reiserfs_sb.s_sbh, 1);
-	tb->tb_sb->s_dirt = 1;
     }
 	
     if ( tb->blknum[h] == 2 ) {
diff -urN --exclude *.orig parent/fs/reiserfs/journal.c comp/fs/reiserfs/journal.c
--- parent/fs/reiserfs/journal.c	Mon Apr 29 10:20:24 2002
+++ comp/fs/reiserfs/journal.c	Mon Apr 29 10:20:21 2002
@@ -64,12 +64,15 @@
 */
 static int reiserfs_mounted_fs_count = 0 ;
 
+static struct list_head kreiserfsd_supers = LIST_HEAD_INIT(kreiserfsd_supers);
+
 /* wake this up when you add something to the commit thread task queue */
 DECLARE_WAIT_QUEUE_HEAD(reiserfs_commit_thread_wait) ;
 
 /* wait on this if you need to be sure you task queue entries have been run */
 static DECLARE_WAIT_QUEUE_HEAD(reiserfs_commit_thread_done) ;
 DECLARE_TASK_QUEUE(reiserfs_commit_thread_tq) ;
+DECLARE_MUTEX(kreiserfsd_sem) ;
 
 #define JOURNAL_TRANS_HALF 1018   /* must be correct to keep the desc and commit
 				     structs at 4k */
@@ -576,17 +579,12 @@
 /* lock the current transaction */
 inline static void lock_journal(struct super_block *p_s_sb) {
   PROC_INFO_INC( p_s_sb, journal.lock_journal );
-  while(atomic_read(&(SB_JOURNAL(p_s_sb)->j_wlock)) > 0) {
-    PROC_INFO_INC( p_s_sb, journal.lock_journal_wait );
-    sleep_on(&(SB_JOURNAL(p_s_sb)->j_wait)) ;
-  }
-  atomic_set(&(SB_JOURNAL(p_s_sb)->j_wlock), 1) ;
+  down(&SB_JOURNAL(p_s_sb)->j_lock);
 }
 
 /* unlock the current transaction */
 inline static void unlock_journal(struct super_block *p_s_sb) {
-  atomic_dec(&(SB_JOURNAL(p_s_sb)->j_wlock)) ;
-  wake_up(&(SB_JOURNAL(p_s_sb)->j_wait)) ;
+  up(&SB_JOURNAL(p_s_sb)->j_lock);
 }
 
 /*
@@ -756,7 +754,6 @@
   atomic_set(&(jl->j_commit_flushing), 0) ;
   wake_up(&(jl->j_commit_wait)) ;
 
-  s->s_dirt = 1 ;
   return 0 ;
 }
 
@@ -1220,7 +1217,6 @@
     if (run++ == 0) {
         goto loop_start ;
     }
-
     atomic_set(&(jl->j_flushing), 0) ;
     wake_up(&(jl->j_flush_wait)) ;
     return ret ;
@@ -1250,7 +1246,7 @@
     while(i != start) {
         jl = SB_JOURNAL_LIST(s) + i  ;
         age = CURRENT_TIME - jl->j_timestamp ;
-        if (jl->j_len > 0 && // age >= (JOURNAL_MAX_COMMIT_AGE * 2) && 
+        if (jl->j_len > 0 && age >= JOURNAL_MAX_COMMIT_AGE && 
             atomic_read(&(jl->j_nonzerolen)) > 0 &&
             atomic_read(&(jl->j_commit_left)) == 0) {
 
@@ -1325,6 +1321,10 @@
 static int do_journal_release(struct reiserfs_transaction_handle *th, struct super_block *p_s_sb, int error) {
   struct reiserfs_transaction_handle myth ;
 
+  down(&kreiserfsd_sem);
+  list_del(&p_s_sb->u.reiserfs_sb.s_reiserfs_supers);
+  up(&kreiserfsd_sem);
+
   /* we only want to flush out transactions if we were called with error == 0
   */
   if (!error && !(p_s_sb->s_flags & MS_RDONLY)) {
@@ -1811,10 +1811,6 @@
   jl = SB_JOURNAL_LIST(ct->p_s_sb) + ct->jindex ;
 
   flush_commit_list(ct->p_s_sb, SB_JOURNAL_LIST(ct->p_s_sb) + ct->jindex, 1) ; 
-  if (jl->j_len > 0 && atomic_read(&(jl->j_nonzerolen)) > 0 && 
-      atomic_read(&(jl->j_commit_left)) == 0) {
-    kupdate_one_transaction(ct->p_s_sb, jl) ;
-  }
   reiserfs_kfree(ct->self, sizeof(struct reiserfs_journal_commit_task), ct->p_s_sb) ;
 }
 
@@ -1864,6 +1860,9 @@
 ** then run the per filesystem commit task queue when we wakeup.
 */
 static int reiserfs_journal_commit_thread(void *nullp) {
+  struct list_head *entry, *safe ;
+  struct super_block *s;
+  time_t last_run = 0;
 
   daemonize() ;
 
@@ -1879,6 +1878,18 @@
     while(TQ_ACTIVE(reiserfs_commit_thread_tq)) {
       run_task_queue(&reiserfs_commit_thread_tq) ;
     }
+    if (CURRENT_TIME - last_run > 5) {
+	down(&kreiserfsd_sem);
+	list_for_each_safe(entry, safe, &kreiserfsd_supers) {
+	    s = list_entry(entry, struct super_block, 
+	                   u.reiserfs_sb.s_reiserfs_supers);    
+	    if (!(s->s_flags & MS_RDONLY)) {
+		reiserfs_flush_old_commits(s);
+	    }
+	}
+	up(&kreiserfsd_sem);
+	last_run = CURRENT_TIME;
+    }
 
     /* if there aren't any more filesystems left, break */
     if (reiserfs_mounted_fs_count <= 0) {
@@ -1953,13 +1964,12 @@
   SB_JOURNAL(p_s_sb)->j_last = NULL ;	  
   SB_JOURNAL(p_s_sb)->j_first = NULL ;     
   init_waitqueue_head(&(SB_JOURNAL(p_s_sb)->j_join_wait)) ;
-  init_waitqueue_head(&(SB_JOURNAL(p_s_sb)->j_wait)) ; 
+  sema_init(&SB_JOURNAL(p_s_sb)->j_lock, 1);
 
   SB_JOURNAL(p_s_sb)->j_trans_id = 10 ;  
   SB_JOURNAL(p_s_sb)->j_mount_id = 10 ; 
   SB_JOURNAL(p_s_sb)->j_state = 0 ;
   atomic_set(&(SB_JOURNAL(p_s_sb)->j_jlock), 0) ;
-  atomic_set(&(SB_JOURNAL(p_s_sb)->j_wlock), 0) ;
   SB_JOURNAL(p_s_sb)->j_cnode_free_list = allocate_cnodes(num_cnodes) ;
   SB_JOURNAL(p_s_sb)->j_cnode_free_orig = SB_JOURNAL(p_s_sb)->j_cnode_free_list ;
   SB_JOURNAL(p_s_sb)->j_cnode_free = SB_JOURNAL(p_s_sb)->j_cnode_free_list ? num_cnodes : 0 ;
@@ -1989,6 +1999,9 @@
     kernel_thread((void *)(void *)reiserfs_journal_commit_thread, NULL,
                   CLONE_FS | CLONE_FILES | CLONE_VM) ;
   }
+  down(&kreiserfsd_sem);
+  list_add(&p_s_sb->u.reiserfs_sb.s_reiserfs_supers, &kreiserfsd_supers);
+  up(&kreiserfsd_sem);
   return 0 ;
 }
 
@@ -2117,7 +2130,6 @@
   th->t_trans_id = SB_JOURNAL(p_s_sb)->j_trans_id ;
   th->t_caller = "Unknown" ;
   unlock_journal(p_s_sb) ;
-  p_s_sb->s_dirt = 1; 
   return 0 ;
 }
 
@@ -2159,7 +2171,7 @@
     reiserfs_panic(th->t_super, "journal-1577: handle trans id %ld != current trans id %ld\n", 
                    th->t_trans_id, SB_JOURNAL(p_s_sb)->j_trans_id);
   }
-  p_s_sb->s_dirt = 1 ;
+  p_s_sb->s_dirt = 1;
 
   prepared = test_and_clear_bit(BH_JPrepared, &bh->b_state) ;
   /* already in this transaction, we are done */
@@ -2407,12 +2419,8 @@
 ** flushes any old transactions to disk
 ** ends the current transaction if it is too old
 **
-** also calls flush_journal_list with old_only == 1, which allows me to reclaim
-** memory and such from the journal lists whose real blocks are all on disk.
-**
-** called by sync_dev_journal from buffer.c
 */
-int flush_old_commits(struct super_block *p_s_sb, int immediate) {
+int reiserfs_flush_old_commits(struct super_block *p_s_sb) {
   int i ;
   int count = 0;
   int start ; 
@@ -2429,8 +2437,7 @@
   /* starting with oldest, loop until we get to the start */
   i = (SB_JOURNAL_LIST_INDEX(p_s_sb) + 1) % JOURNAL_LIST_COUNT ;
   while(i != start) {
-    if (SB_JOURNAL_LIST(p_s_sb)[i].j_len > 0 && ((now - SB_JOURNAL_LIST(p_s_sb)[i].j_timestamp) > JOURNAL_MAX_COMMIT_AGE ||
-       immediate)) {
+    if (SB_JOURNAL_LIST(p_s_sb)[i].j_len > 0 && ((now - SB_JOURNAL_LIST(p_s_sb)[i].j_timestamp) > JOURNAL_MAX_COMMIT_AGE)) {
       /* we have to check again to be sure the current transaction did not change */
       if (i != SB_JOURNAL_LIST_INDEX(p_s_sb))  {
 	flush_commit_list(p_s_sb, SB_JOURNAL_LIST(p_s_sb) + i, 1) ;
@@ -2439,26 +2446,26 @@
     i = (i + 1) % JOURNAL_LIST_COUNT ;
     count++ ;
   }
+
   /* now, check the current transaction.  If there are no writers, and it is too old, finish it, and
   ** force the commit blocks to disk
   */
-  if (!immediate && atomic_read(&(SB_JOURNAL(p_s_sb)->j_wcount)) <= 0 &&  
+  if (atomic_read(&(SB_JOURNAL(p_s_sb)->j_wcount)) <= 0 &&  
      SB_JOURNAL(p_s_sb)->j_trans_start_time > 0 && 
      SB_JOURNAL(p_s_sb)->j_len > 0 && 
      (now - SB_JOURNAL(p_s_sb)->j_trans_start_time) > JOURNAL_MAX_TRANS_AGE) {
     journal_join(&th, p_s_sb, 1) ;
     reiserfs_prepare_for_journal(p_s_sb, SB_BUFFER_WITH_SB(p_s_sb), 1) ;
     journal_mark_dirty(&th, p_s_sb, SB_BUFFER_WITH_SB(p_s_sb)) ;
-    do_journal_end(&th, p_s_sb,1, COMMIT_NOW) ;
-  } else if (immediate) { /* belongs above, but I wanted this to be very explicit as a special case.  If they say to 
-                             flush, we must be sure old transactions hit the disk too. */
-    journal_join(&th, p_s_sb, 1) ;
-    reiserfs_prepare_for_journal(p_s_sb, SB_BUFFER_WITH_SB(p_s_sb), 1) ;
-    journal_mark_dirty(&th, p_s_sb, SB_BUFFER_WITH_SB(p_s_sb)) ;
+
+    /* we're only being called from kreiserfsd, it makes no sense to do
+    ** an async commit so that kreiserfsd can do it later
+    */
     do_journal_end(&th, p_s_sb,1, COMMIT_NOW | WAIT) ;
-  }
-   reiserfs_journal_kupdate(p_s_sb) ;
-   return 0 ;
+  } 
+  reiserfs_journal_kupdate(p_s_sb) ;
+
+  return p_s_sb->s_dirt;
 }
 
 /*
@@ -2497,7 +2504,7 @@
   if (SB_JOURNAL(p_s_sb)->j_len == 0) {
     int wcount = atomic_read(&(SB_JOURNAL(p_s_sb)->j_wcount)) ;
     unlock_journal(p_s_sb) ;
-    if (atomic_read(&(SB_JOURNAL(p_s_sb)->j_jlock))  > 0 && wcount <= 0) {
+    if (atomic_read(&(SB_JOURNAL(p_s_sb)->j_jlock)) > 0 && wcount <= 0) {
       atomic_dec(&(SB_JOURNAL(p_s_sb)->j_jlock)) ;
       wake_up(&(SB_JOURNAL(p_s_sb)->j_join_wait)) ;
     }
@@ -2768,6 +2775,7 @@
   ** it tells us if we should continue with the journal_end, or just return
   */
   if (!check_journal_end(th, p_s_sb, nblocks, flags)) {
+    p_s_sb->s_dirt = 1;
     return 0 ;
   }
 
diff -urN --exclude *.orig parent/fs/reiserfs/objectid.c comp/fs/reiserfs/objectid.c
--- parent/fs/reiserfs/objectid.c	Mon Apr 29 10:20:24 2002
+++ comp/fs/reiserfs/objectid.c	Mon Apr 29 10:20:19 2002
@@ -87,7 +87,6 @@
     }
 
     journal_mark_dirty(th, s, SB_BUFFER_WITH_SB (s));
-    s->s_dirt = 1;
     return unused_objectid;
 }
 
@@ -106,8 +105,6 @@
 
     reiserfs_prepare_for_journal(s, SB_BUFFER_WITH_SB(s), 1) ;
     journal_mark_dirty(th, s, SB_BUFFER_WITH_SB (s)); 
-    s->s_dirt = 1;
-
 
     /* start at the beginning of the objectid map (i = 0) and go to
        the end of it (i = disk_sb->s_oid_cursize).  Linear search is
diff -urN --exclude *.orig parent/fs/reiserfs/super.c comp/fs/reiserfs/super.c
--- parent/fs/reiserfs/super.c	Mon Apr 29 10:20:24 2002
+++ comp/fs/reiserfs/super.c	Mon Apr 29 10:20:19 2002
@@ -29,23 +29,22 @@
 static int reiserfs_remount (struct super_block * s, int * flags, char * data);
 static int reiserfs_statfs (struct super_block * s, struct statfs * buf);
 
-//
-// a portion of this function, particularly the VFS interface portion,
-// was derived from minix or ext2's analog and evolved as the
-// prototype did. You should be able to tell which portion by looking
-// at the ext2 code and comparing. It's subfunctions contain no code
-// used as a template unless they are so labeled.
-//
+/* kreiserfsd does all the periodic stuff for us */
 static void reiserfs_write_super (struct super_block * s)
 {
+    return;
+}
 
-  int dirty = 0 ;
-  lock_kernel() ;
-  if (!(s->s_flags & MS_RDONLY)) {
-    dirty = flush_old_commits(s, 1) ;
-  }
-  s->s_dirt = dirty;
-  unlock_kernel() ;
+static void reiserfs_commit_super (struct super_block * s)
+{
+    struct reiserfs_transaction_handle th;
+    lock_kernel() ;
+    if (!(s->s_flags & MS_RDONLY)) {
+	journal_begin(&th, s, 1);
+	journal_end_sync(&th, s, 1);
+	s->s_dirt = 0;
+    }
+    unlock_kernel() ;
 }
 
 //
@@ -58,7 +57,6 @@
 static void reiserfs_write_super_lockfs (struct super_block * s)
 {
 
-  int dirty = 0 ;
   struct reiserfs_transaction_handle th ;
   lock_kernel() ;
   if (!(s->s_flags & MS_RDONLY)) {
@@ -68,7 +66,7 @@
     reiserfs_block_writes(&th) ;
     journal_end(&th, s, 1) ;
   }
-  s->s_dirt = dirty;
+  s->s_dirt = 0;
   unlock_kernel() ;
 }
 
@@ -357,6 +355,7 @@
   ** to do a journal_end
   */
   journal_release(&th, s) ;
+  s->s_dirt = 0;
 
   for (i = 0; i < SB_BMAP_NR (s); i ++)
     brelse (SB_AP_BITMAP (s)[i]);
@@ -413,6 +412,7 @@
   put_super: reiserfs_put_super,
   write_super: reiserfs_write_super,
   write_super_lockfs: reiserfs_write_super_lockfs,
+  commit_super: reiserfs_commit_super,
   unlockfs: reiserfs_unlockfs,
   statfs: reiserfs_statfs,
   remount_fs: reiserfs_remount,
@@ -968,6 +968,7 @@
 

     memset (&s->u.reiserfs_sb, 0, sizeof (struct reiserfs_sb_info));
+    INIT_LIST_HEAD(&s->u.reiserfs_sb.s_reiserfs_supers);
 
     if (parse_options ((char *) data, &(s->u.reiserfs_sb.s_mount_opt), &blocks) == 0) {
 	return NULL;
diff -urN --exclude *.orig parent/fs/super.c comp/fs/super.c
--- parent/fs/super.c	Mon Apr 29 10:20:24 2002
+++ comp/fs/super.c	Mon Apr 29 10:20:19 2002
@@ -396,6 +396,7 @@
 	struct file_system_type *fs = s->s_type;
 
 	spin_lock(&sb_lock);
+	s->s_type = NULL;
 	list_del(&s->s_list);
 	list_del(&s->s_instances);
 	spin_unlock(&sb_lock);
@@ -440,12 +441,23 @@
 	unlock_super(sb);
 }
 
+static inline void commit_super(struct super_block *sb)
+{
+	lock_super(sb);
+	if (sb->s_root && sb->s_dirt)
+		if (sb->s_op && sb->s_op->write_super)
+			sb->s_op->write_super(sb);
+		if (sb->s_op && sb->s_op->commit_super)
+			sb->s_op->commit_super(sb);
+	unlock_super(sb);
+}
+
 /*
  * Note: check the dirty flag before waiting, so we don't
  * hold up the sync while mounting a device. (The newly
  * mounted device won't need syncing.)
  */
-void sync_supers(kdev_t dev)
+static void dirty_super_op(kdev_t dev, void (*func)(struct super_block *))
 {
 	struct super_block * sb;
 
@@ -453,25 +465,41 @@
 		sb = get_super(dev);
 		if (sb) {
 			if (sb->s_dirt)
-				write_super(sb);
+				func(sb);
 			drop_super(sb);
 		}
 		return;
 	}
-restart:
 	spin_lock(&sb_lock);
+restart:
 	sb = sb_entry(super_blocks.next);
-	while (sb != sb_entry(&super_blocks))
+	while (sb != sb_entry(&super_blocks)) {
 		if (sb->s_dirt) {
 			sb->s_count++;
 			spin_unlock(&sb_lock);
 			down_read(&sb->s_umount);
-			write_super(sb);
-			drop_super(sb);
-			goto restart;
-		} else
-			sb = sb_entry(sb->s_list.next);
+			func(sb);
+			up_read(&sb->s_umount);
+			spin_lock(&sb_lock);
+			if (!--sb->s_count) {
+				destroy_super(sb);
+				goto restart;
+			} else if (!sb->s_type)
+				goto restart;
+		}
+		sb = sb_entry(sb->s_list.next);
+	}
 	spin_unlock(&sb_lock);
+}
+
+void sync_supers(kdev_t dev)
+{
+    dirty_super_op(dev, write_super);
+}
+
+void commit_supers(kdev_t dev)
+{
+    dirty_super_op(dev, commit_super);
 }
 
 /**
diff -urN --exclude *.orig parent/include/linux/fs.h comp/include/linux/fs.h
--- parent/include/linux/fs.h	Mon Apr 29 10:20:24 2002
+++ comp/include/linux/fs.h	Mon Apr 29 10:20:19 2002
@@ -918,6 +918,7 @@
 	struct dentry * (*fh_to_dentry)(struct super_block *sb, __u32 *fh, int len, int fhtype, int parent);
 	int (*dentry_to_fh)(struct dentry *, __u32 *fh, int *lenp, int need_parent);
 	int (*show_options)(struct seq_file *, struct vfsmount *);
+	void (*commit_super) (struct super_block *);
 };
 
 /* Inode state bits.. */
@@ -1226,6 +1227,7 @@
 extern int filemap_fdatasync(struct address_space *);
 extern int filemap_fdatawait(struct address_space *);
 extern void sync_supers(kdev_t);
+extern void commit_supers(kdev_t);
 extern int bmap(struct inode *, int);
 extern int notify_change(struct dentry *, struct iattr *);
 extern int permission(struct inode *, int);
diff -urN --exclude *.orig parent/include/linux/reiserfs_fs.h comp/include/linux/reiserfs_fs.h
--- parent/include/linux/reiserfs_fs.h	Mon Apr 29 10:20:24 2002
+++ comp/include/linux/reiserfs_fs.h	Mon Apr 29 10:20:19 2002
@@ -1533,6 +1533,7 @@
 */
 #define JOURNAL_BUFFER(j,n) ((j)->j_ap_blocks[((j)->j_start + (n)) % JOURNAL_BLOCK_COUNT])
 
+int reiserfs_flush_old_commits(struct super_block *);
 void reiserfs_commit_for_inode(struct inode *) ;
 void reiserfs_update_inode_transaction(struct inode *) ;
 void reiserfs_wait_on_write_block(struct super_block *s) ;
diff -urN --exclude *.orig parent/include/linux/reiserfs_fs_sb.h comp/include/linux/reiserfs_fs_sb.h
--- parent/include/linux/reiserfs_fs_sb.h	Mon Apr 29 10:20:24 2002
+++ comp/include/linux/reiserfs_fs_sb.h	Mon Apr 29 10:20:21 2002
@@ -291,8 +291,7 @@
   */
   struct reiserfs_page_list *j_flush_pages ;
   time_t j_trans_start_time ;         /* time this transaction started */
-  wait_queue_head_t j_wait ;         /* wait  journal_end to finish I/O */
-  atomic_t j_wlock ;                       /* lock for j_wait */
+  struct semaphore j_lock ;
   wait_queue_head_t j_join_wait ;    /* wait for current transaction to finish before starting new one */
   atomic_t j_jlock ;                       /* lock for j_join_wait */
   int j_journal_list_index ;	      /* journal list number of the current trans */
@@ -444,6 +443,7 @@
     int s_is_unlinked_ok;
     reiserfs_proc_info_data_t s_proc_info_data;
     struct proc_dir_entry *procdir;
+    struct list_head s_reiserfs_supers;
 };
 
 /* Definitions of reiserfs on-disk properties: */
@@ -510,7 +510,6 @@
 void reiserfs_file_buffer (struct buffer_head * bh, int list);
 int reiserfs_is_super(struct super_block *s)  ;
 int journal_mark_dirty(struct reiserfs_transaction_handle *, struct super_block *, struct buffer_head *bh) ;
-int flush_old_commits(struct super_block *s, int) ;
 int show_reiserfs_locks(void) ;
 int reiserfs_resize(struct super_block *, unsigned long) ;
 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-04-30 14:20 ` Oleg Drokin
  2002-04-30 14:27   ` Chris Mason
@ 2002-05-02  5:07   ` Christian Stuke
  2002-05-02  6:20     ` Oleg Drokin
  1 sibling, 1 reply; 30+ messages in thread
From: Christian Stuke @ 2002-05-02  5:07 UTC (permalink / raw)
  To: Oleg Drokin, berthiaume_wayne; +Cc: reiserfs-list

Could we have this for 2.4.18+ pending also please?

Chris
----- Original Message -----
From: "Oleg Drokin" <green@namesys.com>
To: <berthiaume_wayne@emc.com>
Cc: <reiserfs-list@namesys.com>
Sent: Tuesday, April 30, 2002 4:20 PM
Subject: Re: [reiserfs-list] fsync() Performance Issue


> Hello!
>
> On Fri, Apr 26, 2002 at 04:28:26PM -0400, berthiaume_wayne@emc.com wrote:
> > I'm wondering if anyone out there may have some suggestions on how
> > to improve the performance of a system employing fsync(). I have to be
able
> > to guaranty that every write to my fileserver is on disk when the client
has
> > passed it to the server. Therefore, I have disabled write cache on the
disk
> > and issue an fsync() per file. I'm running 2.4.19-pre7, reiserfs 3.6.25,
> > without additional patches. I have seen some discussions out here about
> > various other "speed-up" patches and am wondering if I need to add these
to
> > 2.4.19-pre7? And what they are and where can I obtain said patches?
Also,
> > I'm wondering if there is another solution to syncing the data that is
> > faster than fsync(). Testing, thusfar, has shown a large disparity
between
> > running with and without sync.Another idea is to explore another
filesystem,
> > but I'm not exactly excited by the other journaling filesystems out
there at
> > this time. All ideas will be greatly appreciated.
>
> Attached is a speedup patch for 2.4.19-pre7 that should help your fsync
> operations a little. (From Chris Mason).
> Filesystem cannot do very much at this point unfortunatelly, it is ending
up
> waiting for disk to finish write operations.
>
> Also we are working on other speedup patches that would cover different
areas
> of write perfomance itself.
>
> Bye,
>     Oleg
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-05-02  5:07   ` Christian Stuke
@ 2002-05-02  6:20     ` Oleg Drokin
  0 siblings, 0 replies; 30+ messages in thread
From: Oleg Drokin @ 2002-05-02  6:20 UTC (permalink / raw)
  To: Christian Stuke; +Cc: berthiaume_wayne, reiserfs-list

Hello!

On Thu, May 02, 2002 at 07:07:18AM +0200, Christian Stuke wrote:
> Could we have this for 2.4.18+ pending also please?

This patch would apply to 2.4.18 + pending patches, I believe.
As for including these patchs into pending queue for 2.4.18, this is impossible
now, it is too big of a change, unfortunatelly. We hope to get something
like this into 2.4.19-pre1+

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: fsync() Performance Issue
@ 2002-04-29 17:26 berthiaume_wayne
  0 siblings, 0 replies; 30+ messages in thread
From: berthiaume_wayne @ 2002-04-29 17:26 UTC (permalink / raw)
  To: mason; +Cc: russell, tdickenson, reiserfs-list

	Agreed, it would be better to sync to disk after multiple files
rather than serially; however, in the interest of not being concerned of a
power outage during the process, one of the reason the disk cache is
disabled, the choice was to fsync() each write.  

-----Original Message-----
From: Chris Mason [mailto:mason@suse.com]
Sent: Monday, April 29, 2002 12:46 PM
To: tdickenson@geminidataloggers.com
Cc: Russell Coker; berthiaume_wayne@emc.com; reiserfs-list@namesys.com
Subject: Re: [reiserfs-list] fsync() Performance Issue


On Mon, 2002-04-29 at 12:32, Toby Dickenson wrote:

> >One thing that has occurred to me (which has not been previously
discussed as 
> >far as I recall) is the possibility for using sync() instead of fsync()
if 
> >you can accumulate a number of files (and therefore replace many
fsync()'s 
> >with one sync() ).
> 
> I can see
> 
> write to file A
> write to file B
> write to file C
> sync
> 
> might be faster than
> 
> write to file A
> fsync A
> write to file B
> fsync B
> write to file C
> fsync C

Correct.

> 
> but is it possible for it to be faster than
> 
> write to file A
> write to file B
> write to file C
> fsync A
> fsync B
> fsync C

It depends on the rest of the system.  sync() goes through the big lru
list for the whole box, and fsync() goes through the private list for
just that inode.  If you've got other devices or files with dirty data,
case C that you presented will always be the fastest.  For general use,
I like this one the best, it is what the journal code is optimized for.

If files A, B, and C are the only dirty things on the whole box, a
single sync() will be slightly better, mostly due to reduced cpu time.

-chris


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: fsync() Performance Issue
@ 2002-04-30 14:45 berthiaume_wayne
  0 siblings, 0 replies; 30+ messages in thread
From: berthiaume_wayne @ 2002-04-30 14:45 UTC (permalink / raw)
  To: mason, green; +Cc: berthiaume_wayne, reiserfs-list

	Thanks. I'll start putting this one into test.
Wayne.

-----Original Message-----
From: Chris Mason [mailto:mason@suse.com]
Sent: Tuesday, April 30, 2002 10:28 AM
To: Oleg Drokin
Cc: berthiaume_wayne@emc.com; reiserfs-list@namesys.com
Subject: Re: [reiserfs-list] fsync() Performance Issue


On Tue, 2002-04-30 at 10:20, Oleg Drokin wrote:

> Attached is a speedup patch for 2.4.19-pre7 that should help your fsync
> operations a little. (From Chris Mason).
> Filesystem cannot do very much at this point unfortunatelly, it is ending
up
> waiting for disk to finish write operations.
> 
> Also we are working on other speedup patches that would cover different
areas
> of write perfomance itself.

A newer one (against 2.4.19-pre7) is below.  It has not been through as
much testing on the namesys side, which is why Oleg sent the older one.

Wayne and I have been talking in private mail, he's getting a bunch of
beta patches later today (this speedup, data logging, updated barrier
code).  Along with instructions for testing.

-chris

# Veritas (Hugh Dickins supplied the patch) sent the bits in
# fs/super.c that allow the FS to leave super->s_dirt set after a
# write_super call.
#
diff -urN --exclude *.orig parent/fs/buffer.c comp/fs/buffer.c
--- parent/fs/buffer.c	Mon Apr 29 10:20:24 2002
+++ comp/fs/buffer.c	Mon Apr 29 10:20:22 2002
@@ -325,6 +325,8 @@
 	lock_super(sb);
 	if (sb->s_dirt && sb->s_op && sb->s_op->write_super)
 		sb->s_op->write_super(sb);
+	if (sb->s_op && sb->s_op->commit_super)
+		sb->s_op->commit_super(sb);
 	unlock_super(sb);
 	unlock_kernel();
 
@@ -344,7 +346,7 @@
 	lock_kernel();
 	sync_inodes(dev);
 	DQUOT_SYNC(dev);
-	sync_supers(dev);
+	commit_supers(dev);
 	unlock_kernel();
 
 	return sync_buffers(dev, 1);
diff -urN --exclude *.orig parent/fs/reiserfs/bitmap.c
comp/fs/reiserfs/bitmap.c
--- parent/fs/reiserfs/bitmap.c	Mon Apr 29 10:20:24 2002
+++ comp/fs/reiserfs/bitmap.c	Mon Apr 29 10:20:19 2002
@@ -122,7 +122,6 @@
   set_sb_free_blocks( rs, sb_free_blocks(rs) + 1 );
 
   journal_mark_dirty (th, s, sbh);
-  s->s_dirt = 1;
 }
 
 void reiserfs_free_block (struct reiserfs_transaction_handle *th, 
@@ -433,7 +432,6 @@
   /* update free block count in super block */
   PUT_SB_FREE_BLOCKS( s, SB_FREE_BLOCKS(s) - init_amount_needed );
   journal_mark_dirty (th, s, SB_BUFFER_WITH_SB (s));
-  s->s_dirt = 1;
 
   return CARRY_ON;
 }
diff -urN --exclude *.orig parent/fs/reiserfs/ibalance.c
comp/fs/reiserfs/ibalance.c
--- parent/fs/reiserfs/ibalance.c	Mon Apr 29 10:20:24 2002
+++ comp/fs/reiserfs/ibalance.c	Mon Apr 29 10:20:19 2002
@@ -632,7 +632,6 @@
 		/* use check_internal if new root is an internal node */
 		check_internal (new_root);
 	    /*&&&&&&&&&&&&&&&&&&&&&&*/
-	    tb->tb_sb->s_dirt = 1;
 
 	    /* do what is needed for buffer thrown from tree */
 	    reiserfs_invalidate_buffer(tb, tbSh);
@@ -950,7 +949,6 @@
         PUT_SB_ROOT_BLOCK( tb->tb_sb, tbSh->b_blocknr );
         PUT_SB_TREE_HEIGHT( tb->tb_sb, SB_TREE_HEIGHT(tb->tb_sb) + 1 );
 	do_balance_mark_sb_dirty (tb, tb->tb_sb->u.reiserfs_sb.s_sbh, 1);
-	tb->tb_sb->s_dirt = 1;
     }
 	
     if ( tb->blknum[h] == 2 ) {
diff -urN --exclude *.orig parent/fs/reiserfs/journal.c
comp/fs/reiserfs/journal.c
--- parent/fs/reiserfs/journal.c	Mon Apr 29 10:20:24 2002
+++ comp/fs/reiserfs/journal.c	Mon Apr 29 10:20:21 2002
@@ -64,12 +64,15 @@
 */
 static int reiserfs_mounted_fs_count = 0 ;
 
+static struct list_head kreiserfsd_supers =
LIST_HEAD_INIT(kreiserfsd_supers);
+
 /* wake this up when you add something to the commit thread task queue */
 DECLARE_WAIT_QUEUE_HEAD(reiserfs_commit_thread_wait) ;
 
 /* wait on this if you need to be sure you task queue entries have been run
*/
 static DECLARE_WAIT_QUEUE_HEAD(reiserfs_commit_thread_done) ;
 DECLARE_TASK_QUEUE(reiserfs_commit_thread_tq) ;
+DECLARE_MUTEX(kreiserfsd_sem) ;
 
 #define JOURNAL_TRANS_HALF 1018   /* must be correct to keep the desc and
commit
 				     structs at 4k */
@@ -576,17 +579,12 @@
 /* lock the current transaction */
 inline static void lock_journal(struct super_block *p_s_sb) {
   PROC_INFO_INC( p_s_sb, journal.lock_journal );
-  while(atomic_read(&(SB_JOURNAL(p_s_sb)->j_wlock)) > 0) {
-    PROC_INFO_INC( p_s_sb, journal.lock_journal_wait );
-    sleep_on(&(SB_JOURNAL(p_s_sb)->j_wait)) ;
-  }
-  atomic_set(&(SB_JOURNAL(p_s_sb)->j_wlock), 1) ;
+  down(&SB_JOURNAL(p_s_sb)->j_lock);
 }
 
 /* unlock the current transaction */
 inline static void unlock_journal(struct super_block *p_s_sb) {
-  atomic_dec(&(SB_JOURNAL(p_s_sb)->j_wlock)) ;
-  wake_up(&(SB_JOURNAL(p_s_sb)->j_wait)) ;
+  up(&SB_JOURNAL(p_s_sb)->j_lock);
 }
 
 /*
@@ -756,7 +754,6 @@
   atomic_set(&(jl->j_commit_flushing), 0) ;
   wake_up(&(jl->j_commit_wait)) ;
 
-  s->s_dirt = 1 ;
   return 0 ;
 }
 
@@ -1220,7 +1217,6 @@
     if (run++ == 0) {
         goto loop_start ;
     }
-
     atomic_set(&(jl->j_flushing), 0) ;
     wake_up(&(jl->j_flush_wait)) ;
     return ret ;
@@ -1250,7 +1246,7 @@
     while(i != start) {
         jl = SB_JOURNAL_LIST(s) + i  ;
         age = CURRENT_TIME - jl->j_timestamp ;
-        if (jl->j_len > 0 && // age >= (JOURNAL_MAX_COMMIT_AGE * 2) && 
+        if (jl->j_len > 0 && age >= JOURNAL_MAX_COMMIT_AGE && 
             atomic_read(&(jl->j_nonzerolen)) > 0 &&
             atomic_read(&(jl->j_commit_left)) == 0) {
 
@@ -1325,6 +1321,10 @@
 static int do_journal_release(struct reiserfs_transaction_handle *th,
struct super_block *p_s_sb, int error) {
   struct reiserfs_transaction_handle myth ;
 
+  down(&kreiserfsd_sem);
+  list_del(&p_s_sb->u.reiserfs_sb.s_reiserfs_supers);
+  up(&kreiserfsd_sem);
+
   /* we only want to flush out transactions if we were called with error ==
0
   */
   if (!error && !(p_s_sb->s_flags & MS_RDONLY)) {
@@ -1811,10 +1811,6 @@
   jl = SB_JOURNAL_LIST(ct->p_s_sb) + ct->jindex ;
 
   flush_commit_list(ct->p_s_sb, SB_JOURNAL_LIST(ct->p_s_sb) + ct->jindex,
1) ; 
-  if (jl->j_len > 0 && atomic_read(&(jl->j_nonzerolen)) > 0 && 
-      atomic_read(&(jl->j_commit_left)) == 0) {
-    kupdate_one_transaction(ct->p_s_sb, jl) ;
-  }
   reiserfs_kfree(ct->self, sizeof(struct reiserfs_journal_commit_task),
ct->p_s_sb) ;
 }
 
@@ -1864,6 +1860,9 @@
 ** then run the per filesystem commit task queue when we wakeup.
 */
 static int reiserfs_journal_commit_thread(void *nullp) {
+  struct list_head *entry, *safe ;
+  struct super_block *s;
+  time_t last_run = 0;
 
   daemonize() ;
 
@@ -1879,6 +1878,18 @@
     while(TQ_ACTIVE(reiserfs_commit_thread_tq)) {
       run_task_queue(&reiserfs_commit_thread_tq) ;
     }
+    if (CURRENT_TIME - last_run > 5) {
+	down(&kreiserfsd_sem);
+	list_for_each_safe(entry, safe, &kreiserfsd_supers) {
+	    s = list_entry(entry, struct super_block, 
+	                   u.reiserfs_sb.s_reiserfs_supers);    
+	    if (!(s->s_flags & MS_RDONLY)) {
+		reiserfs_flush_old_commits(s);
+	    }
+	}
+	up(&kreiserfsd_sem);
+	last_run = CURRENT_TIME;
+    }
 
     /* if there aren't any more filesystems left, break */
     if (reiserfs_mounted_fs_count <= 0) {
@@ -1953,13 +1964,12 @@
   SB_JOURNAL(p_s_sb)->j_last = NULL ;	  
   SB_JOURNAL(p_s_sb)->j_first = NULL ;     
   init_waitqueue_head(&(SB_JOURNAL(p_s_sb)->j_join_wait)) ;
-  init_waitqueue_head(&(SB_JOURNAL(p_s_sb)->j_wait)) ; 
+  sema_init(&SB_JOURNAL(p_s_sb)->j_lock, 1);
 
   SB_JOURNAL(p_s_sb)->j_trans_id = 10 ;  
   SB_JOURNAL(p_s_sb)->j_mount_id = 10 ; 
   SB_JOURNAL(p_s_sb)->j_state = 0 ;
   atomic_set(&(SB_JOURNAL(p_s_sb)->j_jlock), 0) ;
-  atomic_set(&(SB_JOURNAL(p_s_sb)->j_wlock), 0) ;
   SB_JOURNAL(p_s_sb)->j_cnode_free_list = allocate_cnodes(num_cnodes) ;
   SB_JOURNAL(p_s_sb)->j_cnode_free_orig =
SB_JOURNAL(p_s_sb)->j_cnode_free_list ;
   SB_JOURNAL(p_s_sb)->j_cnode_free = SB_JOURNAL(p_s_sb)->j_cnode_free_list
? num_cnodes : 0 ;
@@ -1989,6 +1999,9 @@
     kernel_thread((void *)(void *)reiserfs_journal_commit_thread, NULL,
                   CLONE_FS | CLONE_FILES | CLONE_VM) ;
   }
+  down(&kreiserfsd_sem);
+  list_add(&p_s_sb->u.reiserfs_sb.s_reiserfs_supers, &kreiserfsd_supers);
+  up(&kreiserfsd_sem);
   return 0 ;
 }
 
@@ -2117,7 +2130,6 @@
   th->t_trans_id = SB_JOURNAL(p_s_sb)->j_trans_id ;
   th->t_caller = "Unknown" ;
   unlock_journal(p_s_sb) ;
-  p_s_sb->s_dirt = 1; 
   return 0 ;
 }
 
@@ -2159,7 +2171,7 @@
     reiserfs_panic(th->t_super, "journal-1577: handle trans id %ld !=
current trans id %ld\n", 
                    th->t_trans_id, SB_JOURNAL(p_s_sb)->j_trans_id);
   }
-  p_s_sb->s_dirt = 1 ;
+  p_s_sb->s_dirt = 1;
 
   prepared = test_and_clear_bit(BH_JPrepared, &bh->b_state) ;
   /* already in this transaction, we are done */
@@ -2407,12 +2419,8 @@
 ** flushes any old transactions to disk
 ** ends the current transaction if it is too old
 **
-** also calls flush_journal_list with old_only == 1, which allows me to
reclaim
-** memory and such from the journal lists whose real blocks are all on
disk.
-**
-** called by sync_dev_journal from buffer.c
 */
-int flush_old_commits(struct super_block *p_s_sb, int immediate) {
+int reiserfs_flush_old_commits(struct super_block *p_s_sb) {
   int i ;
   int count = 0;
   int start ; 
@@ -2429,8 +2437,7 @@
   /* starting with oldest, loop until we get to the start */
   i = (SB_JOURNAL_LIST_INDEX(p_s_sb) + 1) % JOURNAL_LIST_COUNT ;
   while(i != start) {
-    if (SB_JOURNAL_LIST(p_s_sb)[i].j_len > 0 && ((now -
SB_JOURNAL_LIST(p_s_sb)[i].j_timestamp) > JOURNAL_MAX_COMMIT_AGE ||
-       immediate)) {
+    if (SB_JOURNAL_LIST(p_s_sb)[i].j_len > 0 && ((now -
SB_JOURNAL_LIST(p_s_sb)[i].j_timestamp) > JOURNAL_MAX_COMMIT_AGE)) {
       /* we have to check again to be sure the current transaction did not
change */
       if (i != SB_JOURNAL_LIST_INDEX(p_s_sb))  {
 	flush_commit_list(p_s_sb, SB_JOURNAL_LIST(p_s_sb) + i, 1) ;
@@ -2439,26 +2446,26 @@
     i = (i + 1) % JOURNAL_LIST_COUNT ;
     count++ ;
   }
+
   /* now, check the current transaction.  If there are no writers, and it
is too old, finish it, and
   ** force the commit blocks to disk
   */
-  if (!immediate && atomic_read(&(SB_JOURNAL(p_s_sb)->j_wcount)) <= 0 &&  
+  if (atomic_read(&(SB_JOURNAL(p_s_sb)->j_wcount)) <= 0 &&  
      SB_JOURNAL(p_s_sb)->j_trans_start_time > 0 && 
      SB_JOURNAL(p_s_sb)->j_len > 0 && 
      (now - SB_JOURNAL(p_s_sb)->j_trans_start_time) >
JOURNAL_MAX_TRANS_AGE) {
     journal_join(&th, p_s_sb, 1) ;
     reiserfs_prepare_for_journal(p_s_sb, SB_BUFFER_WITH_SB(p_s_sb), 1) ;
     journal_mark_dirty(&th, p_s_sb, SB_BUFFER_WITH_SB(p_s_sb)) ;
-    do_journal_end(&th, p_s_sb,1, COMMIT_NOW) ;
-  } else if (immediate) { /* belongs above, but I wanted this to be very
explicit as a special case.  If they say to 
-                             flush, we must be sure old transactions hit
the disk too. */
-    journal_join(&th, p_s_sb, 1) ;
-    reiserfs_prepare_for_journal(p_s_sb, SB_BUFFER_WITH_SB(p_s_sb), 1) ;
-    journal_mark_dirty(&th, p_s_sb, SB_BUFFER_WITH_SB(p_s_sb)) ;
+
+    /* we're only being called from kreiserfsd, it makes no sense to do
+    ** an async commit so that kreiserfsd can do it later
+    */
     do_journal_end(&th, p_s_sb,1, COMMIT_NOW | WAIT) ;
-  }
-   reiserfs_journal_kupdate(p_s_sb) ;
-   return 0 ;
+  } 
+  reiserfs_journal_kupdate(p_s_sb) ;
+
+  return p_s_sb->s_dirt;
 }
 
 /*
@@ -2497,7 +2504,7 @@
   if (SB_JOURNAL(p_s_sb)->j_len == 0) {
     int wcount = atomic_read(&(SB_JOURNAL(p_s_sb)->j_wcount)) ;
     unlock_journal(p_s_sb) ;
-    if (atomic_read(&(SB_JOURNAL(p_s_sb)->j_jlock))  > 0 && wcount <= 0) {
+    if (atomic_read(&(SB_JOURNAL(p_s_sb)->j_jlock)) > 0 && wcount <= 0) {
       atomic_dec(&(SB_JOURNAL(p_s_sb)->j_jlock)) ;
       wake_up(&(SB_JOURNAL(p_s_sb)->j_join_wait)) ;
     }
@@ -2768,6 +2775,7 @@
   ** it tells us if we should continue with the journal_end, or just return
   */
   if (!check_journal_end(th, p_s_sb, nblocks, flags)) {
+    p_s_sb->s_dirt = 1;
     return 0 ;
   }
 
diff -urN --exclude *.orig parent/fs/reiserfs/objectid.c
comp/fs/reiserfs/objectid.c
--- parent/fs/reiserfs/objectid.c	Mon Apr 29 10:20:24 2002
+++ comp/fs/reiserfs/objectid.c	Mon Apr 29 10:20:19 2002
@@ -87,7 +87,6 @@
     }
 
     journal_mark_dirty(th, s, SB_BUFFER_WITH_SB (s));
-    s->s_dirt = 1;
     return unused_objectid;
 }
 
@@ -106,8 +105,6 @@
 
     reiserfs_prepare_for_journal(s, SB_BUFFER_WITH_SB(s), 1) ;
     journal_mark_dirty(th, s, SB_BUFFER_WITH_SB (s)); 
-    s->s_dirt = 1;
-
 
     /* start at the beginning of the objectid map (i = 0) and go to
        the end of it (i = disk_sb->s_oid_cursize).  Linear search is
diff -urN --exclude *.orig parent/fs/reiserfs/super.c
comp/fs/reiserfs/super.c
--- parent/fs/reiserfs/super.c	Mon Apr 29 10:20:24 2002
+++ comp/fs/reiserfs/super.c	Mon Apr 29 10:20:19 2002
@@ -29,23 +29,22 @@
 static int reiserfs_remount (struct super_block * s, int * flags, char *
data);
 static int reiserfs_statfs (struct super_block * s, struct statfs * buf);
 
-//
-// a portion of this function, particularly the VFS interface portion,
-// was derived from minix or ext2's analog and evolved as the
-// prototype did. You should be able to tell which portion by looking
-// at the ext2 code and comparing. It's subfunctions contain no code
-// used as a template unless they are so labeled.
-//
+/* kreiserfsd does all the periodic stuff for us */
 static void reiserfs_write_super (struct super_block * s)
 {
+    return;
+}
 
-  int dirty = 0 ;
-  lock_kernel() ;
-  if (!(s->s_flags & MS_RDONLY)) {
-    dirty = flush_old_commits(s, 1) ;
-  }
-  s->s_dirt = dirty;
-  unlock_kernel() ;
+static void reiserfs_commit_super (struct super_block * s)
+{
+    struct reiserfs_transaction_handle th;
+    lock_kernel() ;
+    if (!(s->s_flags & MS_RDONLY)) {
+	journal_begin(&th, s, 1);
+	journal_end_sync(&th, s, 1);
+	s->s_dirt = 0;
+    }
+    unlock_kernel() ;
 }
 
 //
@@ -58,7 +57,6 @@
 static void reiserfs_write_super_lockfs (struct super_block * s)
 {
 
-  int dirty = 0 ;
   struct reiserfs_transaction_handle th ;
   lock_kernel() ;
   if (!(s->s_flags & MS_RDONLY)) {
@@ -68,7 +66,7 @@
     reiserfs_block_writes(&th) ;
     journal_end(&th, s, 1) ;
   }
-  s->s_dirt = dirty;
+  s->s_dirt = 0;
   unlock_kernel() ;
 }
 
@@ -357,6 +355,7 @@
   ** to do a journal_end
   */
   journal_release(&th, s) ;
+  s->s_dirt = 0;
 
   for (i = 0; i < SB_BMAP_NR (s); i ++)
     brelse (SB_AP_BITMAP (s)[i]);
@@ -413,6 +412,7 @@
   put_super: reiserfs_put_super,
   write_super: reiserfs_write_super,
   write_super_lockfs: reiserfs_write_super_lockfs,
+  commit_super: reiserfs_commit_super,
   unlockfs: reiserfs_unlockfs,
   statfs: reiserfs_statfs,
   remount_fs: reiserfs_remount,
@@ -968,6 +968,7 @@
 

     memset (&s->u.reiserfs_sb, 0, sizeof (struct reiserfs_sb_info));
+    INIT_LIST_HEAD(&s->u.reiserfs_sb.s_reiserfs_supers);
 
     if (parse_options ((char *) data, &(s->u.reiserfs_sb.s_mount_opt),
&blocks) == 0) {
 	return NULL;
diff -urN --exclude *.orig parent/fs/super.c comp/fs/super.c
--- parent/fs/super.c	Mon Apr 29 10:20:24 2002
+++ comp/fs/super.c	Mon Apr 29 10:20:19 2002
@@ -396,6 +396,7 @@
 	struct file_system_type *fs = s->s_type;
 
 	spin_lock(&sb_lock);
+	s->s_type = NULL;
 	list_del(&s->s_list);
 	list_del(&s->s_instances);
 	spin_unlock(&sb_lock);
@@ -440,12 +441,23 @@
 	unlock_super(sb);
 }
 
+static inline void commit_super(struct super_block *sb)
+{
+	lock_super(sb);
+	if (sb->s_root && sb->s_dirt)
+		if (sb->s_op && sb->s_op->write_super)
+			sb->s_op->write_super(sb);
+		if (sb->s_op && sb->s_op->commit_super)
+			sb->s_op->commit_super(sb);
+	unlock_super(sb);
+}
+
 /*
  * Note: check the dirty flag before waiting, so we don't
  * hold up the sync while mounting a device. (The newly
  * mounted device won't need syncing.)
  */
-void sync_supers(kdev_t dev)
+static void dirty_super_op(kdev_t dev, void (*func)(struct super_block *))
 {
 	struct super_block * sb;
 
@@ -453,25 +465,41 @@
 		sb = get_super(dev);
 		if (sb) {
 			if (sb->s_dirt)
-				write_super(sb);
+				func(sb);
 			drop_super(sb);
 		}
 		return;
 	}
-restart:
 	spin_lock(&sb_lock);
+restart:
 	sb = sb_entry(super_blocks.next);
-	while (sb != sb_entry(&super_blocks))
+	while (sb != sb_entry(&super_blocks)) {
 		if (sb->s_dirt) {
 			sb->s_count++;
 			spin_unlock(&sb_lock);
 			down_read(&sb->s_umount);
-			write_super(sb);
-			drop_super(sb);
-			goto restart;
-		} else
-			sb = sb_entry(sb->s_list.next);
+			func(sb);
+			up_read(&sb->s_umount);
+			spin_lock(&sb_lock);
+			if (!--sb->s_count) {
+				destroy_super(sb);
+				goto restart;
+			} else if (!sb->s_type)
+				goto restart;
+		}
+		sb = sb_entry(sb->s_list.next);
+	}
 	spin_unlock(&sb_lock);
+}
+
+void sync_supers(kdev_t dev)
+{
+    dirty_super_op(dev, write_super);
+}
+
+void commit_supers(kdev_t dev)
+{
+    dirty_super_op(dev, commit_super);
 }
 
 /**
diff -urN --exclude *.orig parent/include/linux/fs.h comp/include/linux/fs.h
--- parent/include/linux/fs.h	Mon Apr 29 10:20:24 2002
+++ comp/include/linux/fs.h	Mon Apr 29 10:20:19 2002
@@ -918,6 +918,7 @@
 	struct dentry * (*fh_to_dentry)(struct super_block *sb, __u32 *fh,
int len, int fhtype, int parent);
 	int (*dentry_to_fh)(struct dentry *, __u32 *fh, int *lenp, int
need_parent);
 	int (*show_options)(struct seq_file *, struct vfsmount *);
+	void (*commit_super) (struct super_block *);
 };
 
 /* Inode state bits.. */
@@ -1226,6 +1227,7 @@
 extern int filemap_fdatasync(struct address_space *);
 extern int filemap_fdatawait(struct address_space *);
 extern void sync_supers(kdev_t);
+extern void commit_supers(kdev_t);
 extern int bmap(struct inode *, int);
 extern int notify_change(struct dentry *, struct iattr *);
 extern int permission(struct inode *, int);
diff -urN --exclude *.orig parent/include/linux/reiserfs_fs.h
comp/include/linux/reiserfs_fs.h
--- parent/include/linux/reiserfs_fs.h	Mon Apr 29 10:20:24 2002
+++ comp/include/linux/reiserfs_fs.h	Mon Apr 29 10:20:19 2002
@@ -1533,6 +1533,7 @@
 */
 #define JOURNAL_BUFFER(j,n) ((j)->j_ap_blocks[((j)->j_start + (n)) %
JOURNAL_BLOCK_COUNT])
 
+int reiserfs_flush_old_commits(struct super_block *);
 void reiserfs_commit_for_inode(struct inode *) ;
 void reiserfs_update_inode_transaction(struct inode *) ;
 void reiserfs_wait_on_write_block(struct super_block *s) ;
diff -urN --exclude *.orig parent/include/linux/reiserfs_fs_sb.h
comp/include/linux/reiserfs_fs_sb.h
--- parent/include/linux/reiserfs_fs_sb.h	Mon Apr 29 10:20:24 2002
+++ comp/include/linux/reiserfs_fs_sb.h	Mon Apr 29 10:20:21 2002
@@ -291,8 +291,7 @@
   */
   struct reiserfs_page_list *j_flush_pages ;
   time_t j_trans_start_time ;         /* time this transaction started */
-  wait_queue_head_t j_wait ;         /* wait  journal_end to finish I/O */
-  atomic_t j_wlock ;                       /* lock for j_wait */
+  struct semaphore j_lock ;
   wait_queue_head_t j_join_wait ;    /* wait for current transaction to
finish before starting new one */
   atomic_t j_jlock ;                       /* lock for j_join_wait */
   int j_journal_list_index ;	      /* journal list number of the current
trans */
@@ -444,6 +443,7 @@
     int s_is_unlinked_ok;
     reiserfs_proc_info_data_t s_proc_info_data;
     struct proc_dir_entry *procdir;
+    struct list_head s_reiserfs_supers;
 };
 
 /* Definitions of reiserfs on-disk properties: */
@@ -510,7 +510,6 @@
 void reiserfs_file_buffer (struct buffer_head * bh, int list);
 int reiserfs_is_super(struct super_block *s)  ;
 int journal_mark_dirty(struct reiserfs_transaction_handle *, struct
super_block *, struct buffer_head *bh) ;
-int flush_old_commits(struct super_block *s, int) ;
 int show_reiserfs_locks(void) ;
 int reiserfs_resize(struct super_block *, unsigned long) ;
 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: fsync() Performance Issue
@ 2002-05-03 20:35 berthiaume_wayne
  2002-05-03 22:00 ` Chris Mason
  0 siblings, 1 reply; 30+ messages in thread
From: berthiaume_wayne @ 2002-05-03 20:35 UTC (permalink / raw)
  To: mason; +Cc: reiserfs-list, green

	Chris, I have some quick preliminary results for you. I have
additional testing to perform and haven't run debugreiserfs() yet. If you
have a preference for which tests to run debugreiserfs() let me know.
	Base testing was done against 2.4.13 built on RH 7.1 using the
test_writes.c code I forwarded to you. The system is a Tyan with single
PIII, IDE Promise 20269, Maxtor 160GB drive - write cache disabled. All
numbers are with fsync() and 1KB files. As I said, more testing, i.e.
filesizes, need to be performed.

2.4.13 			
	=> 47.4ms/file
2.4.19-pre7 / no options
	=> 50.3ms/file
2.4.19-pre7 speedup / no options
	=> 46.8ms/file
2.4.19-pre7 speedup, data logging, write barrier / no options
	=> 47.1ms/file
2.4.19-pre7 speedup, data logging, write barrier / data=journal
	=> 25.2ms/file
2.4.19-pre7 speedup, data logging, write barrier / data=journal,barrier=none
	=> 27.8ms/file

	Now what I find interesting are the last two results. The write
cache is disabled on the HDD. These early results are most promising. Next
week I'll be running bonnie++ and our homegrown test against reiserfs and
ext3fs with the same kernel metrics but add filesizes: 10K, 100K, 1M, 10M,
and 100M. 
	One question is will these patches be going into the 2.4 tree and
when?
Regards,
Wayne.  

-----Original Message-----
From: Chris Mason [mailto:mason@suse.com]
Sent: Wednesday, May 01, 2002 9:32 AM
To: berthiaume_wayne@emc.com
Subject: RE: [reiserfs-list] fsync() Performance Issue


Here are some quick results from the test program you sent.  I haven't
yet done the test on an unpatch kernel, so these numbers just show the
improvement from data logging.

As the size of the synchronous write increases, data logging helps
less.  With an 8k file size, data logging decreases total run time and
per file time by 40%.  16k file size it decreases it by 32%.

This was on a scsi drive.

-chris


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: fsync() Performance Issue
  2002-05-03 20:35 berthiaume_wayne
@ 2002-05-03 22:00 ` Chris Mason
  2002-05-04  2:05   ` Hans Reiser
  0 siblings, 1 reply; 30+ messages in thread
From: Chris Mason @ 2002-05-03 22:00 UTC (permalink / raw)
  To: berthiaume_wayne; +Cc: reiserfs-list, green

On Fri, 2002-05-03 at 16:35, berthiaume_wayne@emc.com wrote:
> 	Chris, I have some quick preliminary results for you. I have
> additional testing to perform and haven't run debugreiserfs() yet. If you
> have a preference for which tests to run debugreiserfs() let me know.
> 	Base testing was done against 2.4.13 built on RH 7.1 using the
> test_writes.c code I forwarded to you. The system is a Tyan with single
> PIII, IDE Promise 20269, Maxtor 160GB drive - write cache disabled. All
> numbers are with fsync() and 1KB files. As I said, more testing, i.e.
> filesizes, need to be performed.

> 2.4.19-pre7 speedup, data logging, write barrier / no options
> 	=> 47.1ms/file

Hi Wayne, thanks for sending these along.

I expected a slight improvement over the 2.4.13 code even with the data
logging turned off.  I'm curious to see how it does with the IDE cache
turned on.  With scsi, I see 10-15% better without any options than an
unpatched kernel.

> 2.4.19-pre7 speedup, data logging, write barrier / data=journal
> 	=> 25.2ms/file
> 2.4.19-pre7 speedup, data logging, write barrier / data=journal,barrier=none
> 	=> 27.8ms/file

The barrier option doesn't make much difference because the write cache
is off.  With write cache on, the barrier code should allow you to be
faster than with the caching off, but without risking the data (Jens and
I are working on final fsync safety issues though).

Hans, data=journal turns on the data journaling.  The data journaling
patches also include optimizations to write metadata back to disk in
bigger chunks for tiny transactions (the current method is to write one
transaction's worth back, when a transaction has 3 blocks, this is
pretty slow).

I've put these patches up on:

ftp.suse.com/pub/people/mason/patches/data-logging

> 	One question is will these patches be going into the 2.4 tree and
> when?

The data logging patches are a huge change, but the good news is they
are based on the nesting patches that have been stable for a long time
in the quota code.  I'll probably want a month or more of heavy testing
before I think about submitting them.

-chris

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-05-03 22:00 ` Chris Mason
@ 2002-05-04  2:05   ` Hans Reiser
  2002-05-04  5:41     ` Valdis.Kletnieks
  2002-05-04 13:11     ` Chris Mason
  0 siblings, 2 replies; 30+ messages in thread
From: Hans Reiser @ 2002-05-04  2:05 UTC (permalink / raw)
  To: Chris Mason; +Cc: berthiaume_wayne, reiserfs-list, green

Chris Mason wrote:

>On Fri, 2002-05-03 at 16:35, berthiaume_wayne@emc.com wrote:
>  
>
>>	Chris, I have some quick preliminary results for you. I have
>>additional testing to perform and haven't run debugreiserfs() yet. If you
>>have a preference for which tests to run debugreiserfs() let me know.
>>	Base testing was done against 2.4.13 built on RH 7.1 using the
>>test_writes.c code I forwarded to you. The system is a Tyan with single
>>PIII, IDE Promise 20269, Maxtor 160GB drive - write cache disabled. All
>>numbers are with fsync() and 1KB files. As I said, more testing, i.e.
>>filesizes, need to be performed.
>>    
>>
>
>  
>
>>2.4.19-pre7 speedup, data logging, write barrier / no options
>>	=> 47.1ms/file
>>    
>>
>
>Hi Wayne, thanks for sending these along.
>
>I expected a slight improvement over the 2.4.13 code even with the data
>logging turned off.  I'm curious to see how it does with the IDE cache
>turned on.  With scsi, I see 10-15% better without any options than an
>unpatched kernel.
>
>  
>
>>2.4.19-pre7 speedup, data logging, write barrier / data=journal
>>	=> 25.2ms/file
>>2.4.19-pre7 speedup, data logging, write barrier / data=journal,barrier=none
>>	=> 27.8ms/file
>>    
>>
>
>The barrier option doesn't make much difference because the write cache
>is off.  With write cache on, the barrier code should allow you to be
>faster than with the caching off, but without risking the data (Jens and
>I are working on final fsync safety issues though).
>
>Hans, data=journal turns on the data journaling.  The data journaling
>
and the reason it is faster is....

>patches also include optimizations to write metadata back to disk in
>bigger chunks for tiny transactions (the current method is to write one
>transaction's worth back, when a transaction has 3 blocks, this is
>pretty slow).
>
is this a lazy fsync?  If so, then everything makes sense to me, if not, 
I remain uneducated and looking to receive your wisdom;-)

>
>I've put these patches up on:
>
>ftp.suse.com/pub/people/mason/patches/data-logging
>
>  
>
>>	One question is will these patches be going into the 2.4 tree and
>>when?
>>    
>>
>
>The data logging patches are a huge change, but the good news is they
>are based on the nesting patches that have been stable for a long time
>in the quota code.  I'll probably want a month or more of heavy testing
>before I think about submitting them.
>
>-chris
>
>
>
>
>  
>




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-05-04  2:05   ` Hans Reiser
@ 2002-05-04  5:41     ` Valdis.Kletnieks
  2002-05-04 13:11     ` Chris Mason
  1 sibling, 0 replies; 30+ messages in thread
From: Valdis.Kletnieks @ 2002-05-04  5:41 UTC (permalink / raw)
  To: Hans Reiser; +Cc: reiserfs-list

[-- Attachment #1: Type: text/plain, Size: 1819 bytes --]

On Sat, 04 May 2002 06:05:32 +0400, Hans Reiser said:
> Chris Mason wrote:
> >Hans, data=journal turns on the data journaling.  The data journaling
> and the reason it is faster is....
> >patches also include optimizations to write metadata back to disk in
> >bigger chunks for tiny transactions (the current method is to write one
> >transaction's worth back, when a transaction has 3 blocks, this is
> >pretty slow).
> >
> is this a lazy fsync?  If so, then everything makes sense to me, if not, 
> I remain uneducated and looking to receive your wisdom;-)

Well.. *semi* lazy - since the bits *ARE* on the oxide, it's followed
the requirements of fsync() not returning till they hit the oxide ;)

We ran into a similar performance issue when we first started deploying
another vendor's data-journalled filesystem (although at 1:30AM, I'm
unable to remember whether it was Sun or Digital).  We finally did a bunch
of benchmarking work, and discovered that it was due to the arm movement
patterns.  

When running on a non-journalled file system, the arm would be seeking back
and forth however the elevator algorithm told it, to write the next blocks
that needed writing (which if you had any fragmentation or writes to
multiple files, meant you were doing seeks all over the place).

However, when running journalled, during periods of high activity,
the head would basically stay parked in the journal area doing sequential
writes to the journal, meaning the *effective* seek time for the next
write would be essentially zero (since the head was over the journal
already) - it would then go and do all the write-behind out to the
"permanent" area of the disk when the disk wasn't otherwise busy (so
you didn't take a performance hit.
-- 
				Valdis Kletnieks
				Computer Systems Senior Engineer
				Virginia Tech

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-05-04  2:05   ` Hans Reiser
  2002-05-04  5:41     ` Valdis.Kletnieks
@ 2002-05-04 13:11     ` Chris Mason
  2002-05-04 14:59       ` Hans Reiser
  1 sibling, 1 reply; 30+ messages in thread
From: Chris Mason @ 2002-05-04 13:11 UTC (permalink / raw)
  To: Hans Reiser; +Cc: berthiaume_wayne, reiserfs-list, green

On Fri, 2002-05-03 at 22:05, Hans Reiser wrote:

> >The barrier option doesn't make much difference because the write cache
> >is off.  With write cache on, the barrier code should allow you to be
> >faster than with the caching off, but without risking the data (Jens and
> >I are working on final fsync safety issues though).
> >
> >Hans, data=journal turns on the data journaling.  The data journaling
> >
> and the reason it is faster is....

Any time you append X number of bytes followed by an fsync (or O_SYNC),
you trigger a commit for the modified metadata, and then a seek to write
the data block.  You wait on the log blocks (usually 5 or 6 of them
total in the transaction) and then you wait for the data block to hit
the main disk area.

With data logging, you also write the data block to the log, and that
means you can wait a while to flush it back to the main disk.  This
increases the size of the transaction by 1, but writing 7 blocks to the
log is almost no different from writing 6.

You don't have to seek back to the main disk until you need to flush the
transaction.  The new code also flushes metadata from more the one
transaction at once, leading to less waiting overall.

The new code is smarter about triggering updates to the journal header
block, it happens much less frequently now, leading to fewer seeks and
less waiting.

As the size of the O_SYNC/fsync write increases, the benefit goes down. 
A 16k write only gets around 30% improvement with the new patches. 
Since a 1k write still needs to write a whole block, it should be the
same speed as a 4k write.

Wayne, you might want to try the 1k test mounted with -o notail.

-chris

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-05-04 13:11     ` Chris Mason
@ 2002-05-04 14:59       ` Hans Reiser
  2002-05-06 12:40         ` Chris Mason
  0 siblings, 1 reply; 30+ messages in thread
From: Hans Reiser @ 2002-05-04 14:59 UTC (permalink / raw)
  To: Chris Mason; +Cc: berthiaume_wayne, reiserfs-list, green

Chris Mason wrote:

>On Fri, 2002-05-03 at 22:05, Hans Reiser wrote:
>
>  
>
>>>The barrier option doesn't make much difference because the write cache
>>>is off.  With write cache on, the barrier code should allow you to be
>>>faster than with the caching off, but without risking the data (Jens and
>>>I are working on final fsync safety issues though).
>>>
>>>Hans, data=journal turns on the data journaling.  The data journaling
>>>
>>>      
>>>
>>and the reason it is faster is....
>>    
>>
>
>Any time you append X number of bytes followed by an fsync (or O_SYNC),
>you trigger a commit for the modified metadata, and then a seek to write
>the data block.  You wait on the log blocks (usually 5 or 6 of them
>total in the transaction) and then you wait for the data block to hit
>the main disk area.
>
>With data logging, you also write the data block to the log, and that
>means you can wait a while to flush it back to the main disk.  This
>increases the size of the transaction by 1, but writing 7 blocks to the
>log is almost no different from writing 6.
>
>You don't have to seek back to the main disk until you need to flush the
>transaction.  The new code also flushes metadata from more the one
>transaction at once, leading to less waiting overall.
>
>The new code is smarter about triggering updates to the journal header
>block, it happens much less frequently now, leading to fewer seeks and
>less waiting.
>
>As the size of the O_SYNC/fsync write increases, the benefit goes down. 
>A 16k write only gets around 30% improvement with the new patches. 
>Since a 1k write still needs to write a whole block, it should be the
>same speed as a 4k write.
>
>Wayne, you might want to try the 1k test mounted with -o notail.
>
>-chris
>
>
>
>
>  
>
So how about if you revise fsync so that it always sends data blocks to 
the journal not to the main disk?

Hans



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-05-04 14:59       ` Hans Reiser
@ 2002-05-06 12:40         ` Chris Mason
  2002-05-06 13:02           ` Hans Reiser
  2002-05-06 21:21           ` Hans Reiser
  0 siblings, 2 replies; 30+ messages in thread
From: Chris Mason @ 2002-05-06 12:40 UTC (permalink / raw)
  To: Hans Reiser; +Cc: berthiaume_wayne, reiserfs-list, green

On Sat, 2002-05-04 at 10:59, Hans Reiser wrote:
>
> So how about if you revise fsync so that it always sends data blocks to 
> the journal not to the main disk?

This gets a little sticky.

Once you log a block, it might be replayed after a crash.  So, you have
to protect against corner cases like this:

write(file)
fsync(file) ; /* logs modified data blocks */
write(file) ; /* write the same blocks without fsync */
sync ;        /* use expects new version of the blocks on disk */
<crash>

During replay, the logged data blocks overwrite the blocks sent to disk
via sync().

This isn't hard to correct for, every time a buffer is marked dirty, you
check the journal hash tables to see if it is replayable, and if so you
log it instead (the 2.2.x code did this due to tails).  This translates
to increased CPU usage for every write.

I'd rather not put it back in because it adds yet another corner case to
maintain for all time.  Most of the fsync/O_SYNC bound applications are
just given their own partition anyway, so most users that need data
logging need it for every write.

-chris

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-05-06 12:40         ` Chris Mason
@ 2002-05-06 13:02           ` Hans Reiser
  2002-05-06 21:21           ` Hans Reiser
  1 sibling, 0 replies; 30+ messages in thread
From: Hans Reiser @ 2002-05-06 13:02 UTC (permalink / raw)
  To: Chris Mason; +Cc: berthiaume_wayne, reiserfs-list, green

Chris Mason wrote:

>On Sat, 2002-05-04 at 10:59, Hans Reiser wrote:
>  
>
>>So how about if you revise fsync so that it always sends data blocks to 
>>the journal not to the main disk?
>>    
>>
>
>This gets a little sticky.
>
>Once you log a block, it might be replayed after a crash.  So, you have
>to protect against corner cases like this:
>
>write(file)
>fsync(file) ; /* logs modified data blocks */
>write(file) ; /* write the same blocks without fsync */
>sync ;        /* use expects new version of the blocks on disk */
><crash>
>
>During replay, the logged data blocks overwrite the blocks sent to disk
>via sync().
>
>This isn't hard to correct for, every time a buffer is marked dirty, you
>check the journal hash tables to see if it is replayable, and if so you
>log it instead (the 2.2.x code did this due to tails).  This translates
>to increased CPU usage for every write.
>
Significant increased CPU usage?

>
>I'd rather not put it back in because it adds yet another corner case to
>maintain for all time.  Most of the fsync/O_SYNC bound applications are
>just given their own partition anyway, so most users that need data
>logging need it for every write.
>
most users don't know enough to turn it on....;-)

>
>-chris
>
>
>
>
>
>
>  
>




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-05-06 12:40         ` Chris Mason
  2002-05-06 13:02           ` Hans Reiser
@ 2002-05-06 21:21           ` Hans Reiser
  2002-05-06 22:57             ` Chris Mason
  1 sibling, 1 reply; 30+ messages in thread
From: Hans Reiser @ 2002-05-06 21:21 UTC (permalink / raw)
  To: Chris Mason; +Cc: berthiaume_wayne, reiserfs-list, green

Chris Mason wrote:

>On Sat, 2002-05-04 at 10:59, Hans Reiser wrote:
>  
>
>>So how about if you revise fsync so that it always sends data blocks to 
>>the journal not to the main disk?
>>    
>>
>
>This gets a little sticky.
>
>Once you log a block, it might be replayed after a crash.  So, you have
>to protect against corner cases like this:
>
>write(file)
>fsync(file) ; /* logs modified data blocks */
>write(file) ; /* write the same blocks without fsync */
>sync ;        /* use expects new version of the blocks on disk */
><crash>
>
>During replay, the logged data blocks overwrite the blocks sent to disk
>via sync().
>
>This isn't hard to correct for, every time a buffer is marked dirty, you
>check the journal hash tables to see if it is replayable, and if so you
>log it instead (the 2.2.x code did this due to tails).  This translates
>to increased CPU usage for every write.
>
>I'd rather not put it back in because it adds yet another corner case to
>maintain for all time.  Most of the fsync/O_SYNC bound applications are
>just given their own partition anyway, so most users that need data
>logging need it for every write.
>
Does mozilla's mail user agent use fsync?  Should I give it its own 
partition?  I bet it is fsync bound....;-)

Also, I don't think you can reasonably expect most persons to know that 
they should turn data logging on for high fsync performance, even if you 
document it.

Most persons using small fsyncs are using it because the person who 
wrote their application wrote it wrong.  What's more, many of the 
persons who wrote those applications cannot understand that they did it 
wrong even if you tell them (e.g. qmail author reportedly cannot 
understand, sendmail guys now understand but had Kirk McKusick on their 
staff and attending the meeting when I explained it to them so they are 
not very typical....).  

In other words, handling stupidity is an important life skill, and we 
all need to excell at it.;-)

Tell me what your thoughts are on the following:

If you ask randomly selected ReiserFS users (not the reiserfs-list, but 
the ones who would never send you an email....)  the following 
questions, what percentage will answer which choice?

The filesystem you are using is named:

a) the Performance Optimized SuSE FS

b) NTFS

c) FAT

d) ext2

e) ReiserFS

If you want to change reiserfs to use data journaling you must do which:

a) reinstall the reiserfs package using rpm

b) modify /etc/fs.conf

c) reinstall the operating system from scratch, and select different 
options during the install this time

d) reformat your reiserfs partition using mkreiserfs

e) none of the above

f) all of the above except e)

What do you think the chances are that you can convince Hubert that 
every SuSE Enterprise Edition user should be asked at install time if 
they are going to use fsync a lot on each partition, and to use a 
different fstab setting if yes?

I know that you are an experienced sysadmin who was good at it.  Your 
intuition tells you that most sysadmins are like the ones you were 
willing to hire into your group at the university.  They aren't.

Linux needs to be like a telephone.  You plug it in, push buttons, and 
talk.  It works well, but most folks don't know why.

A moderate number of programs are small fsync bound for the simple 
reason that it is simpler to write them that way.    We need to cover 
over their simplistic designs.

So, you have my sympathies Chris, because I believe you that it makes 
the code uglier and it won't be a joy to code and test.  I hope you also 
see that it should be done.

Hans

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-05-06 21:21           ` Hans Reiser
@ 2002-05-06 22:57             ` Chris Mason
  2002-05-06 23:41               ` Hans Reiser
  2002-05-07  1:17               ` Manuel Krause
  0 siblings, 2 replies; 30+ messages in thread
From: Chris Mason @ 2002-05-06 22:57 UTC (permalink / raw)
  To: Hans Reiser; +Cc: reiserfs-list

On Mon, 2002-05-06 at 17:21, Hans Reiser wrote:
>
> >I'd rather not put it back in because it adds yet another corner case to
> >maintain for all time.  Most of the fsync/O_SYNC bound applications are
> >just given their own partition anyway, so most users that need data
> >logging need it for every write.
> >
> Does mozilla's mail user agent use fsync?  Should I give it its own 
> partition?  I bet it is fsync bound....;-)

[ I took Wayne off the cc list, he's probably not horribly interested ]

Perhaps, but I'll also bet the fsync performance hit doesn't affect the
performance of the system as a whole.  Remember that data=journal
doesn't make the fsyncs fast, it just makes them faster.

> 
> Most persons using small fsyncs are using it because the person who 
> wrote their application wrote it wrong.  What's more, many of the 
> persons who wrote those applications cannot understand that they did it 
> wrong even if you tell them (e.g. qmail author reportedly cannot 
> understand, sendmail guys now understand but had Kirk McKusick on their 
> staff and attending the meeting when I explained it to them so they are 
> not very typical....).  
> 
> In other words, handling stupidity is an important life skill, and we 
> all need to excell at it.;-)

A real strength to linux is the application designers can talk directly
to their own personal bottlenecks.  Hopefully we reward those that hunt
us down and spend the time convincing us their applications are worth
tuning for.  They then proceed to beat the pants off their competition.

> 
> Tell me what your thoughts are on the following:
> 
> If you ask randomly selected ReiserFS users (not the reiserfs-list, but 
> the ones who would never send you an email....)  the following 
> questions, what percentage will answer which choice?
> 
> The filesystem you are using is named:
> 
> a) the Performance Optimized SuSE FS
> 
> b) NTFS
> 
> c) FAT
> 
> d) ext2
> 
> e) ReiserFS

I believe the ones that know what a filesystem is will answer ReiserFS,
You might get a lot of ext2 answers, just because that's what a lot of
people think the linux filesystem is.

> 
> If you want to change reiserfs to use data journaling you must do which:
> 
> a) reinstall the reiserfs package using rpm
> 
> b) modify /etc/fs.conf
> 
> c) reinstall the operating system from scratch, and select different 
> options during the install this time
> 
> d) reformat your reiserfs partition using mkreiserfs
> 
> e) none of the above
> 
> f) all of the above except e)

These people won't be admins of systems big enough for the difference to
matter.  data journaling is targeted at people with so much load they
would have to buy more hardware to make up for it.  The new option
lowers the price to performance ratio, which is exactly what we want to
do for sendmails, egeneras, lycos, etc.  If it takes my laptop 20ms to
deliver a mail message, cutting the time down to 10ms just won't matter.

> 
> 
> What do you think the chances are that you can convince Hubert that 
> every SuSE Enterprise Edition user should be asked at install time if 
> they are going to use fsync a lot on each partition, and to use a 
> different fstab setting if yes?

Very little, I might tell them to buy the suse email server instead,
since that would have the settings done right.  data=journal is just a
small part of mail server tuning.

> 
> I know that you are an experienced sysadmin who was good at it.  Your 
> intuition tells you that most sysadmins are like the ones you were 
> willing to hire into your group at the university.  They aren't.
> 
> Linux needs to be like a telephone.  You plug it in, push buttons, and 
> talk.  It works well, but most folks don't know why.
> 

Exactly.  I think there are 3 classes of users at play here.

1) Those who don't understand and don't have enough load to notice.
2) Those who don't understand and do have enough load to notice.
3) Those who do understand and do have enough load to notice.

#2 will buy support from someone, and they should be able to configure
the thing right.

#3 will find the docs and do it right themselves.

> A moderate number of programs are small fsync bound for the simple 
> reason that it is simpler to write them that way.    We need to cover 
> over their simplistic designs.
> 
> So, you have my sympathies Chris, because I believe you that it makes 
> the code uglier and it won't be a joy to code and test.  I hope you also 
> see that it should be done.

Mostly, I feel this kind of tuning is a mistake right now.  The patch is
young and there are so many places left to tweak...I'm still at the
stage where much larger improvements are possible, and a better use of
coding time.  Plus, it's monday and it's always more fun to debate than
give in on mondays.

-chris

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-05-06 22:57             ` Chris Mason
@ 2002-05-06 23:41               ` Hans Reiser
  2002-05-07  1:17               ` Manuel Krause
  1 sibling, 0 replies; 30+ messages in thread
From: Hans Reiser @ 2002-05-06 23:41 UTC (permalink / raw)
  To: Chris Mason; +Cc: reiserfs-list

Chris Mason wrote:

>On Mon, 2002-05-06 at 17:21, Hans Reiser wrote:
>  
>
>>>I'd rather not put it back in because it adds yet another corner case to
>>>maintain for all time.  Most of the fsync/O_SYNC bound applications are
>>>just given their own partition anyway, so most users that need data
>>>logging need it for every write.
>>>
>>>      
>>>
>>Does mozilla's mail user agent use fsync?  Should I give it its own 
>>partition?  I bet it is fsync bound....;-)
>>    
>>
>
>[ I took Wayne off the cc list, he's probably not horribly interested ]
>
>Perhaps, but I'll also bet the fsync performance hit doesn't affect the
>performance of the system as a whole.
>
 I suspect that on my laptop, downloading emails is disk bound due to 
fsync()....  I haven't measured it, but it "feels" that way.

>
>Mostly, I feel this kind of tuning is a mistake right now.  The patch is
>young and there are so many places left to tweak...I'm still at the
>stage where much larger improvements are possible, and a better use of
>coding time.  Plus, it's monday and it's always more fun to debate than
>give in on mondays.
>
>-chris
>
>
>
>
>  
>

Needing more time to finish analyzing what is going on and what fixes it 
best is always a good reason to defer things....

Hans


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-05-06 22:57             ` Chris Mason
  2002-05-06 23:41               ` Hans Reiser
@ 2002-05-07  1:17               ` Manuel Krause
  2002-05-07  2:04                 ` Chris Mason
  1 sibling, 1 reply; 30+ messages in thread
From: Manuel Krause @ 2002-05-07  1:17 UTC (permalink / raw)
  To: Chris Mason, Hans Reiser; +Cc: reiserfs-list

On 05/07/2002 12:57 AM, Chris Mason wrote:

> On Mon, 2002-05-06 at 17:21, Hans Reiser wrote:
> 
>>>I'd rather not put it back in because it adds yet another corner case to
>>>maintain for all time.  Most of the fsync/O_SYNC bound applications are
>>>just given their own partition anyway, so most users that need data
>>>logging need it for every write.
>>>
>>>
>>Does mozilla's mail user agent use fsync?  Should I give it its own 
>>partition?  I bet it is fsync bound....;-)
>>
> 
> [ I took Wayne off the cc list, he's probably not horribly interested ]
> 
> Perhaps, but I'll also bet the fsync performance hit doesn't affect the
> performance of the system as a whole.  Remember that data=journal
> doesn't make the fsyncs fast, it just makes them faster.
> 
> 
>>Most persons using small fsyncs are using it because the person who 
>>wrote their application wrote it wrong.  What's more, many of the 
>>persons who wrote those applications cannot understand that they did it 
>>wrong even if you tell them (e.g. qmail author reportedly cannot 
>>understand, sendmail guys now understand but had Kirk McKusick on their 
>>staff and attending the meeting when I explained it to them so they are 
>>not very typical....).  
>>
>>In other words, handling stupidity is an important life skill, and we 
>>all need to excell at it.;-)
>>
> 
> A real strength to linux is the application designers can talk directly
> to their own personal bottlenecks.  Hopefully we reward those that hunt
> us down and spend the time convincing us their applications are worth
> tuning for.  They then proceed to beat the pants off their competition.
> 
> 
>>Tell me what your thoughts are on the following:
>>
>>If you ask randomly selected ReiserFS users (not the reiserfs-list, but 
>>the ones who would never send you an email....)  the following 
>>questions, what percentage will answer which choice?
>>
>>The filesystem you are using is named:
>>
>>a) the Performance Optimized SuSE FS
>>
>>b) NTFS
>>
>>c) FAT
>>
>>d) ext2
>>
>>e) ReiserFS
>>
> 
> I believe the ones that know what a filesystem is will answer ReiserFS,
> You might get a lot of ext2 answers, just because that's what a lot of
> people think the linux filesystem is.
> 
> 
>>If you want to change reiserfs to use data journaling you must do which:
>>
>>a) reinstall the reiserfs package using rpm
>>
>>b) modify /etc/fs.conf
>>
>>c) reinstall the operating system from scratch, and select different 
>>options during the install this time
>>
>>d) reformat your reiserfs partition using mkreiserfs
>>
>>e) none of the above
>>
>>f) all of the above except e)
>>
> 
> These people won't be admins of systems big enough for the difference to
> matter.  data journaling is targeted at people with so much load they
> would have to buy more hardware to make up for it.  The new option
> lowers the price to performance ratio, which is exactly what we want to
> do for sendmails, egeneras, lycos, etc.  If it takes my laptop 20ms to
> deliver a mail message, cutting the time down to 10ms just won't matter.
> 
> 
>>
>>What do you think the chances are that you can convince Hubert that 
>>every SuSE Enterprise Edition user should be asked at install time if 
>>they are going to use fsync a lot on each partition, and to use a 
>>different fstab setting if yes?
>>
> 
> Very little, I might tell them to buy the suse email server instead,
> since that would have the settings done right.  data=journal is just a
> small part of mail server tuning.
> 
> 
>>I know that you are an experienced sysadmin who was good at it.  Your 
>>intuition tells you that most sysadmins are like the ones you were 
>>willing to hire into your group at the university.  They aren't.
>>
>>Linux needs to be like a telephone.  You plug it in, push buttons, and 
>>talk.  It works well, but most folks don't know why.
>>
>>
> 
> Exactly.  I think there are 3 classes of users at play here.
> 
> 1) Those who don't understand and don't have enough load to notice.
> 2) Those who don't understand and do have enough load to notice.
> 3) Those who do understand and do have enough load to notice.
> 
> #2 will buy support from someone, and they should be able to configure
> the thing right.
> 
> #3 will find the docs and do it right themselves.
> 
> 
>>A moderate number of programs are small fsync bound for the simple 
>>reason that it is simpler to write them that way.    We need to cover 
>>over their simplistic designs.
>>
>>So, you have my sympathies Chris, because I believe you that it makes 
>>the code uglier and it won't be a joy to code and test.  I hope you also 
>>see that it should be done.
>>
> 
> Mostly, I feel this kind of tuning is a mistake right now.  The patch is
> young and there are so many places left to tweak...I'm still at the
> stage where much larger improvements are possible, and a better use of
> coding time.  Plus, it's monday and it's always more fun to debate than
> give in on mondays.
> 
> -chris
> 


Hi, Chris & Hans!

Don't think this somekind of destructive discussion would lead to 
anything useful for now, can you post a diff for 
2.4.19-pre7+latest-related-pending +compound-patch-from-ftp?

I'll try it and report if that leads to more security and/or less 
performance on my every day use with NS6 and so on if there is any.


Thanks,

Manuel




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-05-07  1:17               ` Manuel Krause
@ 2002-05-07  2:04                 ` Chris Mason
  2002-05-07 20:26                   ` Manuel Krause
  0 siblings, 1 reply; 30+ messages in thread
From: Chris Mason @ 2002-05-07  2:04 UTC (permalink / raw)
  To: Manuel Krause; +Cc: Hans Reiser, reiserfs-list

On Mon, 2002-05-06 at 21:17, Manuel Krause wrote:
> On 05/07/2002 12:57 AM, Chris Mason wrote:
>
> 
> Hi, Chris & Hans!
> 
> Don't think this somekind of destructive discussion would lead to 
> anything useful for now, can you post a diff for 
> 2.4.19-pre7+latest-related-pending +compound-patch-from-ftp?
> 
> I'll try it and report if that leads to more security and/or less 
> performance on my every day use with NS6 and so on if there is any.

The current data logging patches are at:

ftp.suse.com/pub/people/mason/patches/data-logging

They are against 2.4.19-pre7, and contain versions of the major (stable)
speedups.  The patch is pretty big, so I'm not likely to merge with the
namesys pending directories.  The namesys guys add things frequently,
and I think it would get confusing for people trying to figure out which
patches to apply.

The data logging stuff is beta code, if you have a good test bed where
it's ok if things go wrong I can make you a special patch with the
pending stuff merged.

-chris

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-05-07  2:04                 ` Chris Mason
@ 2002-05-07 20:26                   ` Manuel Krause
  2002-05-08  1:22                     ` Chris Mason
  0 siblings, 1 reply; 30+ messages in thread
From: Manuel Krause @ 2002-05-07 20:26 UTC (permalink / raw)
  To: Chris Mason; +Cc: Hans Reiser, reiserfs-list

On 05/07/2002 04:04 AM, Chris Mason wrote:

> On Mon, 2002-05-06 at 21:17, Manuel Krause wrote:
> 
>>On 05/07/2002 12:57 AM, Chris Mason wrote:
>>
>>
>>Hi, Chris & Hans!
>>
>>Don't think this somekind of destructive discussion would lead to 
>>anything useful for now, can you post a diff for 
>>2.4.19-pre7+latest-related-pending +compound-patch-from-ftp?
>>
>>I'll try it and report if that leads to more security and/or less 
>>performance on my every day use with NS6 and so on if there is any.
>>
> 
> The current data logging patches are at:
> 
> ftp.suse.com/pub/people/mason/patches/data-logging
> 
> They are against 2.4.19-pre7, and contain versions of the major (stable)
> speedups.  The patch is pretty big, so I'm not likely to merge with the
> namesys pending directories.  The namesys guys add things frequently,
> and I think it would get confusing for people trying to figure out which
> patches to apply.
> 
> The data logging stuff is beta code, if you have a good test bed where
> it's ok if things go wrong I can make you a special patch with the
> pending stuff merged.
> 
> -chris
> 

So, what should I do? I believed in ReiserFS stability so far more than 
usual and it went o.k. since years. I would not consider my notebook a 
safe test bed machine but I do know the value of uptodate backups. ;-)

When the reiserfs speedup patches will go into mainline, what they'll do 
hopefully, you need to adjust your patches anyways:
I don't think it's good if I adjust them myself, as I don't know what I 
may corrupt in the code. Sometimes it's really annoying to edit some 
patches' .diff files when some maintainers decide to add one 
dumb/unneeded blank space or diff path or /usr/src/linux link. I tried 
to rediff/manually-add  your recently posted speedup-patch (IIRC it was 
with the fsync issue for Wayne) over the last weekend and I stopped 
after facing too many identities with the 
existing+pending+compound-speedup-2(-with-iicache-in) code. So do I do 
with  the latest data-logging patches you pointed me to.

I think I'm not approved enough and in need to do your work, as I had 
enough stupidity to merge aa., akpm. and rml. patches for 2.4.19-pre7. 
Choose your poison, like Hans finally meant under the lines, an addon 
mount option in fstab would never be a problem... :-))

BTW, why do you "not believe iicache should be good to use, because 
similar functionality (with less overhead) can be achieved by 
pagecache." as Oleg wrote today?! Is that true and why isn't that 
realized yet?

Bye, thanks and best regards,

Manuel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: fsync() Performance Issue
  2002-05-07 20:26                   ` Manuel Krause
@ 2002-05-08  1:22                     ` Chris Mason
  0 siblings, 0 replies; 30+ messages in thread
From: Chris Mason @ 2002-05-08  1:22 UTC (permalink / raw)
  To: Manuel Krause; +Cc: Hans Reiser, reiserfs-list

On Tue, 2002-05-07 at 16:26, Manuel Krause wrote:

> > The data logging stuff is beta code, if you have a good test bed where
> > it's ok if things go wrong I can make you a special patch with the
> > pending stuff merged.
> > 
> So, what should I do? I believed in ReiserFS stability so far more than 
> usual and it went o.k. since years. I would not consider my notebook a 
> safe test bed machine but I do know the value of uptodate backups. ;-)

;-) I'd wait a bit for the patches to stabilize.  As the development
slows down, it gets easier for me to spend the time on merging with
various things.  Plus, people like Yura integrate it into their
projects.  

> 
> When the reiserfs speedup patches will go into mainline, what they'll do 
> hopefully, you need to adjust your patches anyways:

Yes.

> BTW, why do you "not believe iicache should be good to use, because 
> similar functionality (with less overhead) can be achieved by 
> pagecache." as Oleg wrote today?! Is that true and why isn't that 
> realized yet?

I think the iicache is solid experimental work, and that it does a great
job of showing a place where reiserfs can improve performance.  Before
I'd suggest including it, I think we need to find the single best mode
of operation and get rid of all the other stuff.

The idea behind the iicache is to maintain a cache of the metadata in
the tree for use later on.  Yura did this by putting extra fields into
the reiserfs part of the inode, but I think it can also by done by just
mapping in the pages corresponding to the metadata.  There are
advantages to each approach, but I think using the pages as the backing
for the iicache will end up cleaner in the end.

Oleg is experimenting a little with this too, Yura deserves a lot of
credit not just for starting the iicache patch, but for maintaining it
over this long period of time.

-chris

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: fsync() Performance Issue
@ 2002-05-06 12:54 berthiaume_wayne
  0 siblings, 0 replies; 30+ messages in thread
From: berthiaume_wayne @ 2002-05-06 12:54 UTC (permalink / raw)
  To: mason; +Cc: reiserfs-list, green

	I'll add the write caching into the test just for info. Until there
is a way to guaranty the data is safe I'll have to go with no write caching
though. I should have all this testing done by the end of the week.

-----Original Message-----
From: Chris Mason [mailto:mason@suse.com]
Sent: Friday, May 03, 2002 6:00 PM
To: berthiaume_wayne@emc.com
Cc: reiserfs-list@namesys.com; green@namesys.com
Subject: RE: [reiserfs-list] fsync() Performance Issue

On Fri, 2002-05-03 at 16:35, berthiaume_wayne@emc.com wrote:
> 	Chris, I have some quick preliminary results for you. I have
> additional testing to perform and haven't run debugreiserfs() yet. If you
> have a preference for which tests to run debugreiserfs() let me know.
> 	Base testing was done against 2.4.13 built on RH 7.1 using the
> test_writes.c code I forwarded to you. The system is a Tyan with single
> PIII, IDE Promise 20269, Maxtor 160GB drive - write cache disabled. All
> numbers are with fsync() and 1KB files. As I said, more testing, i.e.
> filesizes, need to be performed.

> 2.4.19-pre7 speedup, data logging, write barrier / no options
> 	=> 47.1ms/file

Hi Wayne, thanks for sending these along.

I expected a slight improvement over the 2.4.13 code even with the data
logging turned off.  I'm curious to see how it does with the IDE cache
turned on.  With scsi, I see 10-15% better without any options than an
unpatched kernel.

> 2.4.19-pre7 speedup, data logging, write barrier / data=journal
> 	=> 25.2ms/file
> 2.4.19-pre7 speedup, data logging, write barrier /
data=journal,barrier=none
> 	=> 27.8ms/file

The barrier option doesn't make much difference because the write cache
is off.  With write cache on, the barrier code should allow you to be
faster than with the caching off, but without risking the data (Jens and
I are working on final fsync safety issues though).

Hans, data=journal turns on the data journaling.  The data journaling
patches also include optimizations to write metadata back to disk in
bigger chunks for tiny transactions (the current method is to write one
transaction's worth back, when a transaction has 3 blocks, this is
pretty slow).

I've put these patches up on:

ftp.suse.com/pub/people/mason/patches/data-logging

> 	One question is will these patches be going into the 2.4 tree and
> when?

The data logging patches are a huge change, but the good news is they
are based on the nesting patches that have been stable for a long time
in the quota code.  I'll probably want a month or more of heavy testing
before I think about submitting them.

-chris

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2002-05-08  1:22 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-04-26 20:28 fsync() Performance Issue berthiaume_wayne
2002-04-29 16:20 ` Russell Coker
2002-04-29 16:30   ` Chris Mason
2002-04-29 16:32   ` Toby Dickenson
2002-04-29 16:45     ` Chris Mason
2002-04-29 17:56     ` Matthias Andree
2002-04-29 18:58       ` Valdis.Kletnieks
2002-04-29 18:56         ` Hans Reiser
2002-04-30 14:20 ` Oleg Drokin
2002-04-30 14:27   ` Chris Mason
2002-05-02  5:07   ` Christian Stuke
2002-05-02  6:20     ` Oleg Drokin
  -- strict thread matches above, loose matches on Subject: below --
2002-04-29 17:26 berthiaume_wayne
2002-04-30 14:45 berthiaume_wayne
2002-05-03 20:35 berthiaume_wayne
2002-05-03 22:00 ` Chris Mason
2002-05-04  2:05   ` Hans Reiser
2002-05-04  5:41     ` Valdis.Kletnieks
2002-05-04 13:11     ` Chris Mason
2002-05-04 14:59       ` Hans Reiser
2002-05-06 12:40         ` Chris Mason
2002-05-06 13:02           ` Hans Reiser
2002-05-06 21:21           ` Hans Reiser
2002-05-06 22:57             ` Chris Mason
2002-05-06 23:41               ` Hans Reiser
2002-05-07  1:17               ` Manuel Krause
2002-05-07  2:04                 ` Chris Mason
2002-05-07 20:26                   ` Manuel Krause
2002-05-08  1:22                     ` Chris Mason
2002-05-06 12:54 berthiaume_wayne

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.