public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* test13-pre6
@ 2000-12-30  0:25 Linus Torvalds
  2000-12-30  0:49 ` test13-pre6 Alexander Viro
                   ` (4 more replies)
  0 siblings, 5 replies; 56+ messages in thread
From: Linus Torvalds @ 2000-12-30  0:25 UTC (permalink / raw)
  To: Kernel Mailing List
  Cc: Alexander Viro, Stephen C. Tweedie, Marco d'Itri,
	Jeff Lightfoot, Dan Aloni, Anton Blanchard


Ok, there's a test13-pre6 out there now, which does a partial sync with
Alan, in addition to hopefully fixing the innd shared mapping writeback
problem for good.  Thanks to Marcelo Tosatti and others..

I've pounded on the shared dirty page writeback logic quite a bit, and
verified (by doing timings with "strace -ttT") that it should do the right
thing with sync/fsync/fdatasync/msync(MS_ASYNC)/msync(MS_SYNC).

The reason I'm fairly confident that the problem is gone for good is that
the dirty page handling is now quite integral to the whole address space
code, and it fell out rather nicely from some mapping host cleanups.

Al: this changes "mapping->host" to be truly defined as a pointer to the
inode that owns the mapping. That's how every user actually _used_ the
host pointer, so this cleans those up to not need any casts. The main
reason, however, is that we needed to have some FS-level anchor for dirty
pages in order to get the correct sync() semantics. If you think it's
worth it to have a notion of an anonymous host we need to add something
else.

Stephen: mind trying your fsync/etc tests on this one, to verify that the
inode dirty stuff is all done right?

Marco d'Itri and everybody else who has seen innd problems (or other
shared map problems): can you verify that test13-pre6 works for you?

			Linus

----
 - pre6:
   - Marc Joosen: BIOS int15/e820 memory query: don't assume %edx
     unchanged by the BIOS. Fixes at least some IBM ThinkPads.
   - Alan Cox: synchronize
   - Marcelo Tosatti & me: properly sync dirty pages
   - Andreas Dilger: proper ext2 compat flag checking

 - pre5:
   - NIIBE Yutaka: SuperH update
   - Geert Uytterhoeven: m68k update
   - David Miller: TCP RTO calc fix, UDP multicast fix etc
   - Duncan Laurie: ServerWorks PIRQ routing definition.
   - mm PageDirty cleanups, added sanity checks, and don't lose the bit. 

 - pre4:
   - Christoph Rohland: shmfs cleanup
   - Nicolas Pitre: don't forget loop.c flags
   - Geert Uytterhoeven: new-style m68k Makefiles
   - Neil Brown: knfsd cleanups, raid5 re-org
   - Andrea Arkangeli: update to LVM-0.9
   - LC Chang: sis900 driver doc update
   - David Miller: netfilter oops fix
   - Andrew Grover: acpi update

 - pre3:
   - Christian Jullien: smc9194: proper dev_kfree_skb_irq
   - Cort Dougan: new-style PowerPC Makefiles
   - Andrew Morton, Petr Vandrovec: fix run_task_queue
   - Christoph Rohland: shmfs for shared memory handling

 - pre2:
   - Kai Germaschewski: ISDN update (including Makefiles)
   - Jens Axboe: cdrom updates
   - Petr Vandrovec; Matrox G450 support
   - Bill Nottingham: fix FAT32 filesystems on 64-bit platforms
   - David Miller: sparc (and other) Makefile fixup
   - Andrea Arkangeli: alpha SMP TLB context fix (and cleanups)
   - Niels Kristian Bech Jensen: checkconfig, USB warnings
   - Andrew Grover: large ACPI update

 - pre1:
   - me: drop support for old-style Makefiles entirely. Big.
   - me: check b_end_io at the IO submission path
   - me: fix "ptep_mkdirty()" (so that swapoff() works correctly)
   - fix fault case in copy_from_user() with a constant size, where
     ((size & 3) == 3)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: test13-pre6
  2000-12-30  0:25 test13-pre6 Linus Torvalds
@ 2000-12-30  0:49 ` Alexander Viro
  2000-12-30  1:03   ` test13-pre6 Linus Torvalds
  2000-12-30  2:25 ` test13-pre6 Daniel Phillips
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 56+ messages in thread
From: Alexander Viro @ 2000-12-30  0:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kernel Mailing List, Alexander Viro, Stephen C. Tweedie,
	Marco d'Itri, Jeff Lightfoot, Dan Aloni, Anton Blanchard



On Fri, 29 Dec 2000, Linus Torvalds wrote:

> Al: this changes "mapping->host" to be truly defined as a pointer to the
> inode that owns the mapping. That's how every user actually _used_ the
> host pointer, so this cleans those up to not need any casts. The main
> reason, however, is that we needed to have some FS-level anchor for dirty
> pages in order to get the correct sync() semantics. If you think it's
> worth it to have a notion of an anonymous host we need to add something
> else.

Two examples: devices and bitmaps-in-pagecache trick. But both belong to
2.5, so...

BTW, nice timing ;-) -pre6 appeared 5 minutes after I've started testing
sane-s_lock patch (SMP-safe lock_super() and friends, refcount on superblocks,
death of mount_sem, beginning of SMP-safe super.c). Oh, well...

Oblock_super(): what the hell is wait_on_super() doing in fsync_file()?
It gives absolutely no warranties - ->write_super() can easily block, so
it looks very odd.

BTW, while we are dropping the junk from vm_operations_struct, could we lose
->protect() and ->wppage()? Both are never used and never defined. I can
send a diff, indeed, but
ed include/linux/mm.h <<EOF
g/\*protect/d
g/wppage/d
w
q
EOF
should take care of them.
						Cheers,
							Al
It's nice to be back... 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: test13-pre6
  2000-12-30  0:49 ` test13-pre6 Alexander Viro
@ 2000-12-30  1:03   ` Linus Torvalds
  2000-12-30 18:09     ` test13-pre6 Alexander Viro
  0 siblings, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2000-12-30  1:03 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Kernel Mailing List, Alexander Viro, Stephen C. Tweedie,
	Marco d'Itri, Jeff Lightfoot, Dan Aloni, Anton Blanchard



On Fri, 29 Dec 2000, Alexander Viro wrote:
> 
> Two examples: devices and bitmaps-in-pagecache trick. But both belong to
> 2.5, so...

Also, they can easily be done with a private inode, if required. So even
in 2.5.x this may not be a major problem.

> BTW, nice timing ;-) -pre6 appeared 5 minutes after I've started testing
> sane-s_lock patch (SMP-safe lock_super() and friends, refcount on superblocks,
> death of mount_sem, beginning of SMP-safe super.c). Oh, well...
> 
> Oblock_super(): what the hell is wait_on_super() doing in fsync_file()?
> It gives absolutely no warranties - ->write_super() can easily block, so
> it looks very odd.

A lot of the superblock locking has been odd. It should probably be a
lock_super() + unlock_super(). At least that's what sync_supers() does.

> BTW, while we are dropping the junk from vm_operations_struct, could we lose
> ->protect() and ->wppage()?

Sure. I think sync() and unmap() fall under that heading too - it used to
do a msync(), but that was before we handled dirty pages directly, so...

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: test13-pre6
  2000-12-30  0:25 test13-pre6 Linus Torvalds
  2000-12-30  0:49 ` test13-pre6 Alexander Viro
@ 2000-12-30  2:25 ` Daniel Phillips
  2000-12-30  3:16   ` test13-pre6 Linus Torvalds
  2000-12-30  3:08 ` test13-pre6 (Fork Bug with Athlons? Temporary Fix) Byron Stanoszek
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 56+ messages in thread
From: Daniel Phillips @ 2000-12-30  2:25 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel

Linus Torvalds wrote:
> 
> Ok, there's a test13-pre6 out there now, which does a partial sync with
> Alan, in addition to hopefully fixing the innd shared mapping writeback
> problem for good.  Thanks to Marcelo Tosatti and others..

After the page_cache_release at line 574 of vmscan.c the page is
unlocked and only owned by the page cache - anything could happen.  How
do you know the set_page_dirty at line 581 is still hitting a valid
page?

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: test13-pre6 (Fork Bug with Athlons? Temporary Fix)
  2000-12-30  0:25 test13-pre6 Linus Torvalds
  2000-12-30  0:49 ` test13-pre6 Alexander Viro
  2000-12-30  2:25 ` test13-pre6 Daniel Phillips
@ 2000-12-30  3:08 ` Byron Stanoszek
  2000-12-30  3:36   ` Linus Torvalds
                     ` (2 more replies)
  2000-12-30  4:21 ` test13-pre6 Dan Aloni
  2001-01-04 20:23 ` test13-pre6 Stephen C. Tweedie
  4 siblings, 3 replies; 56+ messages in thread
From: Byron Stanoszek @ 2000-12-30  3:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kernel Mailing List, Alexander Viro, Stephen C. Tweedie,
	Marco d'Itri, Jeff Lightfoot, Dan Aloni, Anton Blanchard

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4923 bytes --]

On Fri, 29 Dec 2000, Linus Torvalds wrote:

> 
> Ok, there's a test13-pre6 out there now, which does a partial sync with
> Alan, in addition to hopefully fixing the innd shared mapping writeback
> problem for good.  Thanks to Marcelo Tosatti and others..

I've been noticing a problem with the memory context switching conflicting with
fork() on my Athlon. The problem began in the test13-pre2 patch, and because
nobody else has seen this problem (or otherwise reported it) since then, I
felt I should look into it a little further.

I narrowed the problem down to a subset of patches from the MM set in
test13-pre2. Reversing the attached 'context.patch' fixes the problem (only for
i386), but I'm not yet sure why. test13-pre2 and up work without any problems
on an Intel cpu (Pentium 180 & P3 800 tested).

Anyways, I can't seem to find out what really changes with the patch except for
the obvious 'void *segment' changing into a typedef-struct. The only thing I
can think of is that the compiler decodes it differently, but I think I can
safely rule that out. I tried both 2.91.66 and 2.95.2, using both different
types of parameters for P5 & K7 (-march=i586 & -march=i686 -malign-functions=4)
and it still gives the problem on the Athlon. Maybe there's something I've
overlooked in that attached patch. Request for an extra pair of eyes please. :)


Here are the casual symptoms. The parent seems to die as soon as a forked child
exits, which seems to me that a new LDT isn't being initialized correctly:

root:~> ps -aux
USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
root         1  1.1  0.4  1228  532 ?        S    21:42   0:05 init [3]
root         2  0.0  0.0     0    0 ?        SW   21:42   0:00 [keventd]
root         3  0.0  0.0     0    0 ?        SW   21:42   0:00 [kswapd]
root         4  0.0  0.0     0    0 ?        SW   21:42   0:00 [kreclaimd]
root         5  0.0  0.0     0    0 ?        SW   21:42   0:00 [bdflush]
root         6  0.0  0.0     0    0 ?        SW   21:42   0:00 [kupdate]
root       289  0.0  0.4  1284  604 ?        S    21:42   0:00 syslogd -m 0
root       299  0.0  0.8  1912 1104 ?        S    21:42   0:00 klogd
root       351  0.0  1.2  9292 1576 ?        S    21:42   0:00 named
root       361  0.0  0.0     0    0 ?        Z    21:42   0:00 [named <defunct>]
root       363  0.0  1.2  9292 1576 ?        S    21:42   0:00 named
root       364  0.0  1.2  9292 1576 ?        S    21:42   0:00 named
root       365  0.0  0.7  2064  936 ?        S    21:42   0:00 /usr/sbin/sshd
..etc
(Note PID 361)

root:~> strace nslookup sunsite.unc.edu
 :
 :
rt_sigaction(SIGINT, {0x4003ce78, ~[], 0x4000000}, NULL, 8) = 0
rt_sigaction(SIGTERM, {0x4003ce78, ~[], 0x4000000}, NULL, 8) = 0
rt_sigaction(SIGPIPE, {SIG_IGN}, NULL, 8) = 0
rt_sigaction(SIGHUP, {SIG_DFL}, NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [HUP INT TERM], NULL, 8) = 0
getpid()                                = 2615
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
close(3)                                = 0
socket(PF_INET6, SOCK_STREAM, 0)        = -1 ENOSYS (Function not implemented)
socket(PF_INET6, SOCK_STREAM, 0)        = -1 ENOSYS (Function not implemented)
socket(PF_INET6, SOCK_STREAM, 0)        = -1 EAFNOSUPPORT (Address family not supported by protocol)--- SIGSEGV (Segmentation fault) ---
+++ killed by SIGSEGV +++


---Example parent/child process:

root:~> tar -xzvvf ../pkgs/zgv-5.2.tar.gz
 :
 :
-rw------- rus/users      1356 2000-06-01 11:46:57 zgv-5.2/INSTALL
-rw------- rus/users     17976 1994-08-23 16:09:05 zgv-5.2/COPYING
-rw------- rus/users      1077 1998-08-26 09:24:31 zgv-5.2/README.fonts
-rw------- rus/users       120 2000-04-22 22:46:49 zgv-5.2/AUTHORS
-rw------- rus/users      3714 2000-01-23 16:29:40 zgv-5.2/SECURITY
Segmentation fault (core dumped)

root:~> strace tar -xzvvf ../pkgs/zgv-5.2.tar.gz
 :
 :
open("zgv-5.2/COPYING", O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0600) = 4
write(4, "\t\t    GNU GENERAL PUBLIC LICENSE"..., 9728) = 9728
read(3, "ccept this License.  Therefore, "..., 10240) = 10240
write(4, "ccept this License.  Therefore, "..., 8248) = 8248
close(4)                                = 0
utime("zgv-5.2/COPYING", [2000/12/29-20:21:16, 1994/08/23-16:09:05]) = 0
chown32("zgv-5.2/COPYING", 500, 100)    = 0
write(1, "-rw------- rus/users      1077 1"..., 72-rw------- rus/users      1077 1998-08-26 09:24:31 zgv-5.2/README.fonts
) = 72
open("zgv-5.2/README.fonts", O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0600) = 4
write(4, "The copyright for *.bdf (taken f"..., 1024) = 1024
read(3, "\"as\nis\" without express or impli"..., 10240) = 8192
--- SIGCHLD (Child exited) ---
--- SIGSEGV (Segmentation fault) ---
+++ killed by SIGSEGV +++

Ideas, anyone?

 -Byron

-- 
Byron Stanoszek                         Ph: (330) 644-3059
Systems Programmer                      Fax: (330) 644-8110
Commercial Timesharing Inc.             Email: bstanoszek@comtime.com

[-- Attachment #2: Type: TEXT/PLAIN, Size: 8539 bytes --]

diff -u --recursive --new-file v2.4.0-test12/linux/arch/i386/kernel/ldt.c linux/arch/i386/kernel/ldt.c
--- v2.4.0-test12/linux/arch/i386/kernel/ldt.c	Sat May 20 10:39:58 2000
+++ linux/arch/i386/kernel/ldt.c	Fri Dec 15 13:01:59 2000
@@ -31,7 +31,7 @@
 	struct mm_struct * mm = current->mm;
 
 	err = 0;
-	if (!mm->segments)
+	if (!mm->context.segments)
 		goto out;
 
 	size = LDT_ENTRIES*LDT_ENTRY_SIZE;
@@ -39,7 +39,7 @@
 		size = bytecount;
 
 	err = size;
-	if (copy_to_user(ptr, mm->segments, size))
+	if (copy_to_user(ptr, mm->context.segments, size))
 		err = -EFAULT;
 out:
 	return err;
@@ -87,13 +87,12 @@
 	 * limited by MAX_LDT_DESCRIPTORS.
 	 */
 	down(&mm->mmap_sem);
-	if (!mm->segments) {
-		
+	if (!mm->context.segments) {
 		error = -ENOMEM;
-		mm->segments = vmalloc(LDT_ENTRIES*LDT_ENTRY_SIZE);
-		if (!mm->segments)
+		mm->context.segments = vmalloc(LDT_ENTRIES*LDT_ENTRY_SIZE);
+		if (!mm->context.segments)
 			goto out_unlock;
-		memset(mm->segments, 0, LDT_ENTRIES*LDT_ENTRY_SIZE);
+		memset(mm->context.segments, 0, LDT_ENTRIES*LDT_ENTRY_SIZE);
 		
 		if (atomic_read(&mm->mm_users) > 1)
 			printk(KERN_WARNING "LDT allocated for cloned task!\n");
@@ -104,7 +103,7 @@
 		load_LDT(mm);
 	}
 
-	lp = (__u32 *) ((ldt_info.entry_number << 3) + (char *) mm->segments);
+	lp = (__u32 *) ((ldt_info.entry_number << 3) + (char *) mm->context.segments);
 
    	/* Allow LDTs to be cleared by the user. */
    	if (ldt_info.base_addr == 0 && ldt_info.limit == 0) {
diff -u --recursive --new-file v2.4.0-test12/linux/arch/i386/kernel/process.c linux/arch/i386/kernel/process.c
--- v2.4.0-test12/linux/arch/i386/kernel/process.c	Mon Dec 11 17:59:43 2000
+++ linux/arch/i386/kernel/process.c	Fri Dec 15 15:37:09 2000
@@ -412,15 +412,15 @@
 /*
  * No need to lock the MM as we are the last user
  */
-void release_segments(struct mm_struct *mm)
+void destroy_context(struct mm_struct *mm)
 {
-	void * ldt = mm->segments;
+	void * ldt = mm->context.segments;
 
 	/*
 	 * free the LDT
 	 */
 	if (ldt) {
-		mm->segments = NULL;
+		mm->context.segments = NULL;
 		clear_LDT();
 		vfree(ldt);
 	}
@@ -478,7 +478,7 @@
 void release_thread(struct task_struct *dead_task)
 {
 	if (dead_task->mm) {
-		void * ldt = dead_task->mm->segments;
+		void * ldt = dead_task->mm->context.segments;
 
 		// temporary debugging check
 		if (ldt) {
@@ -493,29 +493,24 @@
  * we do not have to muck with descriptors here, that is
  * done in switch_mm() as needed.
  */
-void copy_segments(struct task_struct *p, struct mm_struct *new_mm)
+int init_new_context(struct task_struct *p, struct mm_struct *new_mm)
 {
-	struct mm_struct * old_mm = current->mm;
-	void * old_ldt = old_mm->segments, * ldt;
+	struct mm_struct * old_mm;
+	void *old_ldt, *ldt;
 
-	if (!old_ldt) {
+	ldt = NULL;
+	old_mm = current->mm;
+	if (old_mm && (old_ldt = old_mm->context.segments) != NULL) {
 		/*
-		 * default LDT - use the one from init_task
+		 * Completely new LDT, we initialize it from the parent:
 		 */
-		new_mm->segments = NULL;
-		return;
-	}
-
-	/*
-	 * Completely new LDT, we initialize it from the parent:
-	 */
-	ldt = vmalloc(LDT_ENTRIES*LDT_ENTRY_SIZE);
-	if (!ldt)
-		printk(KERN_WARNING "ldt allocation failed\n");
-	else
+		ldt = vmalloc(LDT_ENTRIES*LDT_ENTRY_SIZE);
+		if (!ldt)
+			return -ENOMEM;
 		memcpy(ldt, old_ldt, LDT_ENTRIES*LDT_ENTRY_SIZE);
-	new_mm->segments = ldt;
-	return;
+	}
+	new_mm->context.segments = ldt;
+	return 0;
 }
 
 /*
diff -u --recursive --new-file v2.4.0-test12/linux/include/asm-i386/desc.h linux/include/asm-i386/desc.h
--- v2.4.0-test12/linux/include/asm-i386/desc.h	Sat Sep  4 13:06:08 1999
+++ linux/include/asm-i386/desc.h	Fri Dec 15 12:40:53 2000
@@ -82,7 +82,7 @@
 extern inline void load_LDT (struct mm_struct *mm)
 {
 	int cpu = smp_processor_id();
-	void *segments = mm->segments;
+	void *segments = mm->context.segments;
 	int count = LDT_ENTRIES;
 
 	if (!segments) {
diff -u --recursive --new-file v2.4.0-test12/linux/include/asm-i386/mmu.h linux/include/asm-i386/mmu.h
--- v2.4.0-test12/linux/include/asm-i386/mmu.h	Wed Dec 31 16:00:00 1969
+++ linux/include/asm-i386/mmu.h	Fri Dec 15 12:38:24 2000
@@ -0,0 +1,12 @@
+#ifndef __i386_MMU_H
+#define __i386_MMU_H
+
+/*
+ * The i386 doesn't have a mmu context, but
+ * we put the segment information here.
+ */
+typedef struct { 
+	void *segments;
+} mm_context_t;
+
+#endif
diff -u --recursive --new-file v2.4.0-test12/linux/include/asm-i386/mmu_context.h linux/include/asm-i386/mmu_context.h
--- v2.4.0-test12/linux/include/asm-i386/mmu_context.h	Fri Sep  8 12:52:41 2000
+++ linux/include/asm-i386/mmu_context.h	Fri Dec 15 18:29:19 2000
@@ -6,11 +6,9 @@
 #include <asm/atomic.h>
 #include <asm/pgalloc.h>
 
-/*
- * possibly do the LDT unload here?
- */
-#define destroy_context(mm)		do { } while(0)
-#define init_new_context(tsk,mm)	0
+/* Segment information */
+extern void destroy_context(struct mm_struct *);
+extern int init_new_context(struct task_struct *, struct mm_struct *);
 
 #ifdef CONFIG_SMP
 
@@ -33,7 +31,7 @@
 		/*
 		 * Re-load LDT if necessary
 		 */
-		if (prev->segments != next->segments)
+		if (prev->context.segments != next->context.segments)
 			load_LDT(next);
 #ifdef CONFIG_SMP
 		cpu_tlbstate[cpu].state = TLBSTATE_OK;
diff -u --recursive --new-file v2.4.0-test12/linux/include/asm-i386/processor.h linux/include/asm-i386/processor.h
--- v2.4.0-test12/linux/include/asm-i386/processor.h	Mon Dec 11 17:59:45 2000
+++ linux/include/asm-i386/processor.h	Fri Dec 15 18:29:18 2000
@@ -427,11 +427,6 @@
  */
 extern int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags);
 
-/* Copy and release all segment info associated with a VM */
-extern void copy_segments(struct task_struct *p, struct mm_struct * mm);
-extern void release_segments(struct mm_struct * mm);
-extern void forget_segments(void);
-
 /*
  * Return saved PC of a blocked thread.
  */
diff -u --recursive --new-file v2.4.0-test12/linux/include/linux/sched.h linux/include/linux/sched.h
--- v2.4.0-test12/linux/include/linux/sched.h	Mon Dec 11 17:59:45 2000
+++ linux/include/linux/sched.h	Fri Dec 15 18:29:19 2000
@@ -18,6 +18,7 @@
 #include <asm/semaphore.h>
 #include <asm/page.h>
 #include <asm/ptrace.h>
+#include <asm/mmu.h>
 
 #include <linux/smp.h>
 #include <linux/tty.h>
@@ -208,7 +209,6 @@
 	int map_count;				/* number of VMAs */
 	struct semaphore mmap_sem;
 	spinlock_t page_table_lock;
-	unsigned long context;
 	unsigned long start_code, end_code, start_data, end_data;
 	unsigned long start_brk, brk, start_stack;
 	unsigned long arg_start, arg_end, env_start, env_end;
@@ -217,11 +217,9 @@
 	unsigned long cpu_vm_mask;
 	unsigned long swap_cnt;	/* number of pages to swap on next pass */
 	unsigned long swap_address;
-	/*
-	 * This is an architecture-specific pointer: the portable
-	 * part of Linux does not know about any segments.
-	 */
-	void * segments;
+
+	/* Architecture-specific MM context */
+	mm_context_t context;
 };
 
 #define INIT_MM(name) \
@@ -235,7 +233,6 @@
 	map_count:	1, 				\
 	mmap_sem:	__MUTEX_INITIALIZER(name.mmap_sem), \
 	page_table_lock: SPIN_LOCK_UNLOCKED, 		\
-	segments:	NULL 				\
 }
 
 struct signal_struct {
diff -u --recursive --new-file v2.4.0-test12/linux/kernel/fork.c linux/kernel/fork.c
--- v2.4.0-test12/linux/kernel/fork.c	Mon Dec 11 17:59:45 2000
+++ linux/kernel/fork.c	Fri Dec 15 12:45:58 2000
@@ -133,11 +133,9 @@
 	mm->mmap_avl = NULL;
 	mm->mmap_cache = NULL;
 	mm->map_count = 0;
-	mm->context = 0;
 	mm->cpu_vm_mask = 0;
 	mm->swap_cnt = 0;
 	mm->swap_address = 0;
-	mm->segments = NULL;
 	pprev = &mm->mmap;
 	for (mpnt = current->mm->mmap ; mpnt ; mpnt = mpnt->vm_next) {
 		struct file *file;
@@ -319,11 +317,6 @@
 	up(&current->mm->mmap_sem);
 	if (retval)
 		goto free_pt;
-
-	/*
-	 * child gets a private LDT (if there was an LDT in the parent)
-	 */
-	copy_segments(tsk, mm);
 
 	if (init_new_context(tsk,mm))
 		goto free_pt;
diff -u --recursive --new-file v2.4.0-test12/linux/mm/mmap.c linux/mm/mmap.c
--- v2.4.0-test12/linux/mm/mmap.c	Mon Dec 11 17:59:45 2000
+++ linux/mm/mmap.c	Fri Dec 15 12:57:54 2000
@@ -885,7 +885,6 @@
 {
 	struct vm_area_struct * mpnt;
 
-	release_segments(mm);
 	spin_lock(&mm->page_table_lock);
 	mpnt = mm->mmap;
 	mm->mmap = mm->mmap_avl = mm->mmap_cache = NULL;

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: test13-pre6
  2000-12-30  2:25 ` test13-pre6 Daniel Phillips
@ 2000-12-30  3:16   ` Linus Torvalds
  2000-12-30 18:58     ` [RFC] Generic deferred file writing Daniel Phillips
  0 siblings, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2000-12-30  3:16 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel



On Sat, 30 Dec 2000, Daniel Phillips wrote:

> Linus Torvalds wrote:
> > 
> > Ok, there's a test13-pre6 out there now, which does a partial sync with
> > Alan, in addition to hopefully fixing the innd shared mapping writeback
> > problem for good.  Thanks to Marcelo Tosatti and others..
> 
> After the page_cache_release at line 574 of vmscan.c the page is
> unlocked and only owned by the page cache - anything could happen.  How
> do you know the set_page_dirty at line 581 is still hitting a valid
> page?

Good question.

It should be safe because of the magic return value from "writepage()".

If "writepage()" returns 1, then that implies that the page is locked down
somehow. But you're right, this is ugly, if not outright buggy (maybe the
locked down state could change after the writepage, who knows?).

Moving the test is probably a good idea.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: test13-pre6 (Fork Bug with Athlons? Temporary Fix)
  2000-12-30  3:08 ` test13-pre6 (Fork Bug with Athlons? Temporary Fix) Byron Stanoszek
@ 2000-12-30  3:36   ` Linus Torvalds
  2000-12-30  5:55     ` Andi Kleen
  2000-12-30  5:13   ` Linus Torvalds
  2000-12-30  8:13   ` Graham Murray
  2 siblings, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2000-12-30  3:36 UTC (permalink / raw)
  To: Byron Stanoszek
  Cc: Kernel Mailing List, Alexander Viro, Stephen C. Tweedie,
	Marco d'Itri, Jeff Lightfoot, Dan Aloni, Anton Blanchard



On Fri, 29 Dec 2000, Byron Stanoszek wrote:
> 
> I narrowed the problem down to a subset of patches from the MM set in
> test13-pre2. Reversing the attached 'context.patch' fixes the problem (only for
> i386), but I'm not yet sure why. test13-pre2 and up work without any problems
> on an Intel cpu (Pentium 180 & P3 800 tested).

Cool.

Maybe your libc is different on the different machines? Normal programs
shouldn't use segments at all, so I really do not see how this patch could
matter in the least, even if it was completely and utterly buggy (which is
not obvious at first glance).

I wonder why you seem to have an LDT at all..

> Anyways, I can't seem to find out what really changes with the patch except for
> the obvious 'void *segment' changing into a typedef-struct.

Would you mind trying to hunt this down a bit more? In particular, it
would be good to see if the behaviour is the same if you do the typedef
change but leave the other logic alone. That would also cut down on the
purely syntactic changes of the patch.

I'll take a look at the code here.

	Thanks,

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: test13-pre6
  2000-12-30  0:25 test13-pre6 Linus Torvalds
                   ` (2 preceding siblings ...)
  2000-12-30  3:08 ` test13-pre6 (Fork Bug with Athlons? Temporary Fix) Byron Stanoszek
@ 2000-12-30  4:21 ` Dan Aloni
  2001-01-04 20:23 ` test13-pre6 Stephen C. Tweedie
  4 siblings, 0 replies; 56+ messages in thread
From: Dan Aloni @ 2000-12-30  4:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kernel Mailing List, Alexander Viro, Stephen C. Tweedie,
	Marco d'Itri, Jeff Lightfoot, Anton Blanchard

On Fri, 29 Dec 2000, Linus Torvalds wrote:

> Marco d'Itri and everybody else who has seen innd problems (or other
> shared map problems): can you verify that test13-pre6 works for you?

The ->mapping problem seems to be gone in test13-pre5, I'm running this
kernel for over 30 hours now with no glitch, gonna check if it's the same
for test13-pre6.

-- 
Dan Aloni 
dax@karrde.org

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: test13-pre6 (Fork Bug with Athlons? Temporary Fix)
  2000-12-30  3:08 ` test13-pre6 (Fork Bug with Athlons? Temporary Fix) Byron Stanoszek
  2000-12-30  3:36   ` Linus Torvalds
@ 2000-12-30  5:13   ` Linus Torvalds
  2000-12-30  8:13   ` Graham Murray
  2 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2000-12-30  5:13 UTC (permalink / raw)
  To: Byron Stanoszek; +Cc: Kernel Mailing List



Ok, I don't think this is an athlon bug, and I think I've figured out what
the problem is.  For now, you rtemporary fix is probably fine, I'll clean
stuff up a bit and make a nicer patch available tomorrow.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: test13-pre6 (Fork Bug with Athlons? Temporary Fix)
  2000-12-30  3:36   ` Linus Torvalds
@ 2000-12-30  5:55     ` Andi Kleen
  0 siblings, 0 replies; 56+ messages in thread
From: Andi Kleen @ 2000-12-30  5:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Byron Stanoszek, Kernel Mailing List, Alexander Viro,
	Stephen C. Tweedie, Marco d'Itri, Jeff Lightfoot, Dan Aloni,
	Anton Blanchard

On Fri, Dec 29, 2000 at 07:36:21PM -0800, Linus Torvalds wrote:
> Maybe your libc is different on the different machines? Normal programs
> shouldn't use segments at all, so I really do not see how this patch could
> matter in the least, even if it was completely and utterly buggy (which is
> not obvious at first glance).
> 
> I wonder why you seem to have an LDT at all..

glibc 2.2 linuxthreads sets up an LDT to use %gs as thread local data base 
pointer.


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: test13-pre6 (Fork Bug with Athlons? Temporary Fix)
  2000-12-30  3:08 ` test13-pre6 (Fork Bug with Athlons? Temporary Fix) Byron Stanoszek
  2000-12-30  3:36   ` Linus Torvalds
  2000-12-30  5:13   ` Linus Torvalds
@ 2000-12-30  8:13   ` Graham Murray
  2 siblings, 0 replies; 56+ messages in thread
From: Graham Murray @ 2000-12-30  8:13 UTC (permalink / raw)
  To: linux-kernel

Byron Stanoszek <gandalf@winds.org> writes:

> I narrowed the problem down to a subset of patches from the MM set in
> test13-pre2. Reversing the attached 'context.patch' fixes the problem (only for
> i386), but I'm not yet sure why. test13-pre2 and up work without any problems
> on an Intel cpu (Pentium 180 & P3 800 tested).
[Snip] 
> root       351  0.0  1.2  9292 1576 ?        S    21:42   0:00 named
> root       361  0.0  0.0     0    0 ?        Z    21:42   0:00 [named <defunct>]
> root       363  0.0  1.2  9292 1576 ?        S    21:42   0:00 named
> root       364  0.0  1.2  9292 1576 ?        S    21:42   0:00 named
> root       365  0.0  0.7  2064  936 ?        S    21:42   0:00 /usr/sbin/sshd
> ..etc
> (Note PID 361)

I am seeing the same thing with the [named <defunct>] on a PIII 600,
so it is not Athlon specific. I haven't yet tried test13-pre6 but it
happens with pre3,4,5. So I am still running on test12.

I will try running pre6 then, if it still fails will try with your
context.patch and see if that fixes it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: test13-pre6
  2000-12-30  1:03   ` test13-pre6 Linus Torvalds
@ 2000-12-30 18:09     ` Alexander Viro
  0 siblings, 0 replies; 56+ messages in thread
From: Alexander Viro @ 2000-12-30 18:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kernel Mailing List, Alexander Viro, Stephen C. Tweedie,
	Peter J. Braam, linux-fsdevel



On Fri, 29 Dec 2000, Linus Torvalds wrote:

> > Oblock_super(): what the hell is wait_on_super() doing in fsync_file()?
> > It gives absolutely no warranties - ->write_super() can easily block, so
> > it looks very odd.
> 
> A lot of the superblock locking has been odd. It should probably be a
> lock_super() + unlock_super(). At least that's what sync_supers() does.

Yeah, that's what I've done, but I'm less than sure that we need anything
of that kind in the first place...

Anyway, under the "it works here" category: current state of sane-s_lock
patch. It will need further cleanup in fs/super.c, but IMO the current
form is already better than code in tree. Comments are more than welcome.

One note on the patch: sb->s_count is amount of extra references to superblock
(== not coming from vfsmounts) + 1 if there is a vfsmount over that superblock.
I.e. same scheme as we use for mm_struct. get_super() bumps the refcount,
put_super() decrements it and frees the superblock if counter hits zero.
add_vfsmnt() bumps refcount if we are adding the first vfsmount.
remove_vfsmnt() bumps refcount _unless_ we are removing the last vfsmount.
I.e. remove_vfsmnt() + put_super() always does the right thing.

List of superblocks contains only "active" ones. kill_super() takes the
superblock out of the list. get_empty_super() became simpler - it doesn't
bother searching the list for "reusable" ones. List is still BKL-protected;
I'll fix that.

We got _two_ sempaphores - one for get_super()/umount() synchronization
(->s_umount) and another for lock_super()/unlock_super() (->s_lock).
I would expect the latter to go into the per-fs part - all synchronization
that is interesting from VFS point of view is done by ->s_umount. The
reason why I've kept s_lock at all is simple - I don't want to mix
changes in fs-private synchronization with the fs-independent stuff.

Patch is against -pre6 and it seems to working here. Comments (and help
in testing) are welcome.

Peter, if you could remind me the uses of get_empty_super() and friends
in your code I would be very grateful - I seriously suspect that we
could use better interface in that area.
							Cheers,
								Al
Patch follows:

diff -urN rc13-pre6/arch/parisc/hpux/sys_hpux.c rc13-pre6-s_lock/arch/parisc/hpux/sys_hpux.c
--- rc13-pre6/arch/parisc/hpux/sys_hpux.c	Tue Dec 12 02:10:00 2000
+++ rc13-pre6-s_lock/arch/parisc/hpux/sys_hpux.c	Sat Dec 30 09:17:25 2000
@@ -109,9 +109,11 @@
 
 	lock_kernel();
 	s = get_super(to_kdev_t(dev));
+	unlock_kernel();
 	if (s == NULL)
 		goto out;
 	err = vfs_statfs(s, &sbuf);
+	put_super(s);
 	if (err)
 		goto out;
 
@@ -124,7 +126,6 @@
 	/* Changed to hpux_ustat:  */
 	err = copy_to_user(ubuf,&tmp,sizeof(struct hpux_ustat)) ? -EFAULT : 0;
 out:
-	unlock_kernel();
 	return err;
 }
 
diff -urN rc13-pre6/arch/sparc/kernel/sys_sunos.c rc13-pre6-s_lock/arch/sparc/kernel/sys_sunos.c
--- rc13-pre6/arch/sparc/kernel/sys_sunos.c	Sun Aug 13 16:56:28 2000
+++ rc13-pre6-s_lock/arch/sparc/kernel/sys_sunos.c	Sat Dec 30 09:17:25 2000
@@ -620,8 +620,6 @@
 };
 
 
-extern dev_t get_unnamed_dev(void);
-extern void put_unnamed_dev(dev_t);
 extern asmlinkage int sys_connect(int fd, struct sockaddr *uservaddr, int addrlen);
 extern asmlinkage int sys_socket(int family, int type, int protocol);
 extern asmlinkage int sys_bind(int fd, struct sockaddr *umyaddr, int addrlen);
diff -urN rc13-pre6/arch/sparc64/kernel/sys_sunos32.c rc13-pre6-s_lock/arch/sparc64/kernel/sys_sunos32.c
--- rc13-pre6/arch/sparc64/kernel/sys_sunos32.c	Tue Dec 12 02:10:04 2000
+++ rc13-pre6-s_lock/arch/sparc64/kernel/sys_sunos32.c	Sat Dec 30 09:17:25 2000
@@ -580,8 +580,6 @@
 	char       *netname;   /* server's netname */
 };
 
-extern dev_t get_unnamed_dev(void);
-extern void put_unnamed_dev(dev_t);
 extern asmlinkage int sys_mount(char *, char *, char *, unsigned long, void *);
 extern asmlinkage int sys_connect(int fd, struct sockaddr *uservaddr, int addrlen);
 extern asmlinkage int sys_socket(int family, int type, int protocol);
diff -urN rc13-pre6/drivers/acorn/block/mfmhd.c rc13-pre6-s_lock/drivers/acorn/block/mfmhd.c
--- rc13-pre6/drivers/acorn/block/mfmhd.c	Wed Nov 29 19:28:26 2000
+++ rc13-pre6-s_lock/drivers/acorn/block/mfmhd.c	Sat Dec 30 09:17:26 2000
@@ -1486,13 +1486,8 @@
 	for (i = maxp - 1; i >= 0; i--) {
 		int minor = start + i;
 		kdev_t devi = MKDEV(MAJOR_NR, minor);
-		struct super_block *sb = get_super(devi);
-
-		sync_dev (devi);
-		if (sb)
-			invalidate_inodes (sb);
-		invalidate_buffers (devi);
 
+		invalidate_dev(devi, 1);
 		mfm_gendisk.part[minor].start_sect = 0;
 		mfm_gendisk.part[minor].nr_sects = 0;
 	}
diff -urN rc13-pre6/drivers/block/DAC960.c rc13-pre6-s_lock/drivers/block/DAC960.c
--- rc13-pre6/drivers/block/DAC960.c	Tue Dec 12 02:10:05 2000
+++ rc13-pre6-s_lock/drivers/block/DAC960.c	Sat Dec 30 09:17:26 2000
@@ -4990,16 +4990,12 @@
 						      PartitionNumber);
 	  int MinorNumber = DAC960_MinorNumber(LogicalDriveNumber,
 					       PartitionNumber);
-	  SuperBlock_T *SuperBlock = get_super(Device);
 	  if (Controller->GenericDiskInfo.part[MinorNumber].nr_sects == 0)
 	    continue;
 	  /*
 	    Flush all changes and invalidate buffered state.
 	  */
-	  sync_dev(Device);
-	  if (SuperBlock != NULL)
-	    invalidate_inodes(SuperBlock);
-	  invalidate_buffers(Device);
+	  invalidate_dev(Device, 1);
 	  /*
 	    Clear existing partition sizes.
 	  */
diff -urN rc13-pre6/drivers/block/acsi.c rc13-pre6-s_lock/drivers/block/acsi.c
--- rc13-pre6/drivers/block/acsi.c	Tue Dec 12 02:10:05 2000
+++ rc13-pre6-s_lock/drivers/block/acsi.c	Sat Dec 30 09:17:26 2000
@@ -1887,12 +1887,7 @@
 	for( i = max_p - 1; i >= 0 ; i-- ) {
 		if (gdev->part[start + i].nr_sects != 0) {
 			kdev_t devp = MKDEV(MAJOR_NR, start + i);
-			struct super_block *sb = get_super(devp);
-
-			fsync_dev(devp);
-			if (sb)
-				invalidate_inodes(sb);
-			invalidate_buffers(devp);
+			invalidate_dev(devp, 2);
 			gdev->part[start + i].nr_sects = 0;
 		}
 		gdev->part[start+i].start_sect = 0;
diff -urN rc13-pre6/drivers/block/amiflop.c rc13-pre6-s_lock/drivers/block/amiflop.c
--- rc13-pre6/drivers/block/amiflop.c	Wed Nov 29 19:28:35 2000
+++ rc13-pre6-s_lock/drivers/block/amiflop.c	Sat Dec 30 09:17:26 2000
@@ -1490,7 +1490,6 @@
 {
 	int drive = inode->i_rdev & 3;
 	static struct floppy_struct getprm;
-	struct super_block * sb;
 
 	switch(cmd){
 	case HDIO_GETGEO:
@@ -1540,10 +1539,7 @@
 		break;
 	case FDFMTEND:
 		floppy_off(drive);
-		sb = get_super(inode->i_rdev);
-		if (sb)
-			invalidate_inodes(sb);
-		invalidate_buffers(inode->i_rdev);
+		invalidate_dev(inode->i_rdev, 0);
 		break;
 	case FDGETPRM:
 		memset((void *)&getprm, 0, sizeof (getprm));
@@ -1677,9 +1673,6 @@
 
 static int floppy_release(struct inode * inode, struct file * filp)
 {
-#ifdef DEBUG
-	struct super_block * sb;
-#endif
 	int drive = MINOR(inode->i_rdev) & 3;
 
 	if (unit[drive].dirty == 1) {
diff -urN rc13-pre6/drivers/block/blkpg.c rc13-pre6-s_lock/drivers/block/blkpg.c
--- rc13-pre6/drivers/block/blkpg.c	Mon Mar 20 01:36:47 2000
+++ rc13-pre6-s_lock/drivers/block/blkpg.c	Sat Dec 30 09:17:26 2000
@@ -157,8 +157,7 @@
 
 	/* partition in use? Incomplete check for now. */
 	devp = MKDEV(MAJOR(dev), minor);
-	if (get_super(devp) ||		/* mounted? */
-	    is_swap_partition(devp))
+	if (is_mounted(devp) || is_swap_partition(devp))
 		return -EBUSY;
 
 	/* all seems OK */
diff -urN rc13-pre6/drivers/block/cciss.c rc13-pre6-s_lock/drivers/block/cciss.c
--- rc13-pre6/drivers/block/cciss.c	Wed Nov 29 21:36:25 2000
+++ rc13-pre6-s_lock/drivers/block/cciss.c	Sat Dec 30 09:17:26 2000
@@ -707,10 +707,7 @@
         for(i=max_p; i>=0; i--) {
                 int minor = start+i;
                 kdev_t devi = MKDEV(MAJOR_NR + ctlr, minor);
-                struct super_block *sb = get_super(devi);
-                sync_dev(devi);
-                if (sb) invalidate_inodes(sb);
-                invalidate_buffers(devi);
+		invalidate_dev(devi, 1);
                 gdev->part[minor].start_sect = 0;
                 gdev->part[minor].nr_sects = 0;
 
diff -urN rc13-pre6/drivers/block/cpqarray.c rc13-pre6-s_lock/drivers/block/cpqarray.c
--- rc13-pre6/drivers/block/cpqarray.c	Wed Nov 29 21:42:57 2000
+++ rc13-pre6-s_lock/drivers/block/cpqarray.c	Sat Dec 30 09:17:26 2000
@@ -1571,10 +1571,7 @@
 	for(i=max_p; i>=0; i--) {
 		int minor = start+i;
 		kdev_t devi = MKDEV(MAJOR_NR + ctlr, minor);
-		struct super_block *sb = get_super(devi);
-		sync_dev(devi);
-		if (sb) invalidate_inodes(sb);
-		invalidate_buffers(devi);
+		invalidate_dev(devi, 1);
 		gdev->part[minor].start_sect = 0;	
 		gdev->part[minor].nr_sects = 0;	
 
diff -urN rc13-pre6/drivers/block/paride/pd.c rc13-pre6-s_lock/drivers/block/paride/pd.c
--- rc13-pre6/drivers/block/paride/pd.c	Tue Jul 25 09:23:16 2000
+++ rc13-pre6-s_lock/drivers/block/paride/pd.c	Sat Dec 30 09:17:26 2000
@@ -593,8 +593,6 @@
         long flags;
         kdev_t devp;
 
-	struct super_block *sb;
-
         unit = DEVICE_NR(dev);
         if ((unit >= PD_UNITS) || (!PD.present)) return -ENODEV;
 
@@ -610,12 +608,7 @@
         for (p=(PD_PARTNS-1);p>=0;p--) {
 		minor = p + unit*PD_PARTNS;
                 devp = MKDEV(MAJOR_NR, minor);
-                fsync_dev(devp);
-
-                sb = get_super(devp);
-                if (sb) invalidate_inodes(sb);
-
-                invalidate_buffers(devp);
+		invalidate_dev(devp, 2);
                 pd_hd[minor].start_sect = 0;
                 pd_hd[minor].nr_sects = 0;
         }
diff -urN rc13-pre6/drivers/block/ps2esdi.c rc13-pre6-s_lock/drivers/block/ps2esdi.c
--- rc13-pre6/drivers/block/ps2esdi.c	Mon Jul 31 16:33:53 2000
+++ rc13-pre6-s_lock/drivers/block/ps2esdi.c	Sat Dec 30 09:17:26 2000
@@ -1180,12 +1180,7 @@
 	     partition >= 0; partition--) {
 		int minor = (start | partition);
 		kdev_t devp = MKDEV(MAJOR_NR, minor);
-		struct super_block * sb = get_super(devp);
-		
-		sync_dev(devp);
-		if (sb)
-			invalidate_inodes(sb);
-		invalidate_buffers(devp);
+		invalidate_dev(devp, 1);
 		ps2esdi_gendisk.part[start + partition].start_sect = 0;
 		ps2esdi_gendisk.part[start + partition].nr_sects = 0;
 	}
diff -urN rc13-pre6/drivers/block/xd.c rc13-pre6-s_lock/drivers/block/xd.c
--- rc13-pre6/drivers/block/xd.c	Wed Nov 29 21:36:26 2000
+++ rc13-pre6-s_lock/drivers/block/xd.c	Sat Dec 30 09:17:26 2000
@@ -394,12 +394,7 @@
 	for (partition = xd_gendisk.max_p - 1; partition >= 0; partition--) {
 		int minor = (start | partition);
 		kdev_t devp = MKDEV(MAJOR_NR, minor);
-		struct super_block * sb = get_super(devp);
-		
-		sync_dev(devp);
-		if (sb)
-			invalidate_inodes(sb);
-		invalidate_buffers(devp);
+		invalidate_dev(devp, 1);
 		xd_gendisk.part[minor].start_sect = 0;
 		xd_gendisk.part[minor].nr_sects = 0;
 	};
diff -urN rc13-pre6/drivers/char/raw.c rc13-pre6-s_lock/drivers/char/raw.c
--- rc13-pre6/drivers/char/raw.c	Wed Nov 29 19:28:41 2000
+++ rc13-pre6-s_lock/drivers/char/raw.c	Sat Dec 30 09:17:26 2000
@@ -103,7 +103,7 @@
 	 */
 	
 	sector_size = 512;
-	if (get_super(rdev) != NULL) {
+	if (is_mounted(rdev)) {
 		if (blksize_size[MAJOR(rdev)])
 			sector_size = blksize_size[MAJOR(rdev)][MINOR(rdev)];
 	} else {
diff -urN rc13-pre6/drivers/i2o/i2o_block.c rc13-pre6-s_lock/drivers/i2o/i2o_block.c
--- rc13-pre6/drivers/i2o/i2o_block.c	Wed Nov 29 21:43:05 2000
+++ rc13-pre6-s_lock/drivers/i2o/i2o_block.c	Sat Dec 30 09:17:27 2000
@@ -900,12 +900,7 @@
 	{
 		int m = minor+i;
 		kdev_t d = MKDEV(MAJOR_NR, m);
-		struct super_block *sb = get_super(d);
-		
-		sync_dev(d);
-		if(sb)
-			invalidate_inodes(sb);
-		invalidate_buffers(d);
+		invalidate_dev(d, 1);
 		i2ob_gendisk.part[m].start_sect = 0;
 		i2ob_gendisk.part[m].nr_sects = 0;
 	}
diff -urN rc13-pre6/drivers/ide/hd.c rc13-pre6-s_lock/drivers/ide/hd.c
--- rc13-pre6/drivers/ide/hd.c	Wed Nov 29 21:43:06 2000
+++ rc13-pre6-s_lock/drivers/ide/hd.c	Sat Dec 30 09:17:27 2000
@@ -881,12 +881,7 @@
 	for (i=max_p - 1; i >=0 ; i--) {
 		int minor = start + i;
 		kdev_t devi = MKDEV(MAJOR_NR, minor);
-		struct super_block *sb = get_super(devi); 
-
-		sync_dev(devi);
-		if (sb)
-			invalidate_inodes(sb);
-		invalidate_buffers(devi);
+		invalidate_dev(devi, 1);
 		gdev->part[minor].start_sect = 0;
 		gdev->part[minor].nr_sects = 0;
 	}
diff -urN rc13-pre6/drivers/ide/ide.c rc13-pre6-s_lock/drivers/ide/ide.c
--- rc13-pre6/drivers/ide/ide.c	Tue Dec 12 02:10:10 2000
+++ rc13-pre6-s_lock/drivers/ide/ide.c	Sat Dec 30 09:17:27 2000
@@ -1761,11 +1761,7 @@
 	for (p = 0; p < (1<<PARTN_BITS); ++p) {
 		if (drive->part[p].nr_sects > 0) {
 			kdev_t devp = MKDEV(major, minor+p);
-			struct super_block * sb = get_super(devp);
-			fsync_dev          (devp);
-			if (sb)
-				invalidate_inodes(sb);
-			invalidate_buffers (devp);
+			invalidate_dev(devp, 2);
 			set_blocksize(devp, 1024);
 		}
 		drive->part[p].start_sect = 0;
@@ -1982,9 +1978,7 @@
 		for (p = 0; p < (1<<PARTN_BITS); ++p) {
 			if (drive->part[p].nr_sects > 0) {
 				kdev_t devp = MKDEV(hwif->major, minor+p);
-				struct super_block * sb = get_super(devp);
-				if (sb) invalidate_inodes(sb);
-				invalidate_buffers (devp);
+				invalidate_dev(devp, 0);
 			}
 		}
 #ifdef CONFIG_PROC_FS
diff -urN rc13-pre6/drivers/md/md.c rc13-pre6-s_lock/drivers/md/md.c
--- rc13-pre6/drivers/md/md.c	Tue Dec 12 02:10:15 2000
+++ rc13-pre6-s_lock/drivers/md/md.c	Sat Dec 30 09:17:27 2000
@@ -1088,7 +1088,7 @@
 	}
 	memset(rdev, 0, sizeof(*rdev));
 
-	if (get_super(newdev)) {
+	if (is_mounted(newdev)) {
 		printk("md: can not import %s, has active inodes!\n",
 			partition_name(newdev));
 		err = -EBUSY;
@@ -1726,7 +1726,7 @@
  	}
  
  	/* this shouldn't be needed as above would have fired */
-	if (!ro && get_super(dev)) {
+	if (!ro && is_mounted(dev)) {
 		printk (STILL_MOUNTED, mdidx(mddev));
 		OUT(-EBUSY);
 	}
diff -urN rc13-pre6/drivers/mtd/ftl.c rc13-pre6-s_lock/drivers/mtd/ftl.c
--- rc13-pre6/drivers/mtd/ftl.c	Tue Dec 12 02:10:19 2000
+++ rc13-pre6-s_lock/drivers/mtd/ftl.c	Sat Dec 30 09:17:27 2000
@@ -915,9 +915,6 @@
 
 static release_t ftl_close(struct inode *inode, struct file *file)
 {
-#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
-    struct super_block *sb = get_super(inode->i_rdev);
-#endif
     int minor = MINOR(inode->i_rdev);
     partition_t *part = myparts[minor >> 4];
     int i;
@@ -925,11 +922,7 @@
     DEBUG(0, "ftl_cs: ftl_close(%d)\n", minor);
 
     /* Flush all writes */
-    fsync_dev(inode->i_rdev);
-#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
-    if (sb) invalidate_inodes(sb);
-#endif
-    invalidate_buffers(inode->i_rdev);
+    invalidate_dev(inode->i_rdev, 2);
 
     /* Wait for any pending erase operations to complete */
     if (part->mtd->sync)
diff -urN rc13-pre6/drivers/mtd/mtdblock.c rc13-pre6-s_lock/drivers/mtd/mtdblock.c
--- rc13-pre6/drivers/mtd/mtdblock.c	Sat Dec 30 07:37:35 2000
+++ rc13-pre6-s_lock/drivers/mtd/mtdblock.c	Sat Dec 30 09:17:27 2000
@@ -355,19 +355,12 @@
 {
 	int dev;
 	struct mtdblk_dev *mtdblk;
-#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
-	struct super_block * sb = get_super(inode->i_rdev);
-#endif
    	DEBUG(MTD_DEBUG_LEVEL1, "mtdblock_release\n");
 
 	if (inode == NULL)
 		release_return(-ENODEV);
    
-	fsync_dev(inode->i_rdev);
-#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
-	if (sb) invalidate_inodes(sb);
-#endif
-	invalidate_buffers(inode->i_rdev);
+	invalidate_dev(inode->i_rdev, 2);
 
 	dev = MINOR(inode->i_rdev);
 	mtdblk = mtdblks[dev];
diff -urN rc13-pre6/drivers/mtd/nftl.c rc13-pre6-s_lock/drivers/mtd/nftl.c
--- rc13-pre6/drivers/mtd/nftl.c	Tue Dec 12 02:10:19 2000
+++ rc13-pre6-s_lock/drivers/mtd/nftl.c	Sat Dec 30 09:17:28 2000
@@ -997,17 +997,13 @@
 
 static int nftl_release(struct inode *inode, struct file *fp)
 {
-	struct super_block *sb = get_super(inode->i_rdev);
 	struct NFTLrecord *thisNFTL;
 
 	thisNFTL = NFTLs[MINOR(inode->i_rdev) / 16];
 
 	DEBUG(MTD_DEBUG_LEVEL2, "NFTL_release\n");
 
-	fsync_dev(inode->i_rdev);
-	if (sb)
-		invalidate_inodes(sb);
-	invalidate_buffers(inode->i_rdev);
+	invalidate_dev(inode->i_rdev, 2);
 
 	if (thisNFTL->mtd->sync)
 		thisNFTL->mtd->sync(thisNFTL->mtd);
diff -urN rc13-pre6/drivers/scsi/sd.c rc13-pre6-s_lock/drivers/scsi/sd.c
--- rc13-pre6/drivers/scsi/sd.c	Wed Nov 29 19:29:16 2000
+++ rc13-pre6-s_lock/drivers/scsi/sd.c	Sat Dec 30 09:17:28 2000
@@ -1262,11 +1262,7 @@
 	for (i = max_p - 1; i >= 0; i--) {
 		int index = start + i;
 		kdev_t devi = MKDEV_SD_PARTITION(index);
-		struct super_block *sb = get_super(devi);
-		sync_dev(devi);
-		if (sb)
-			invalidate_inodes(sb);
-		invalidate_buffers(devi);
+		invalidate_dev(devi, 1);
 		sd_gendisks->part[index].start_sect = 0;
 		sd_gendisks->part[index].nr_sects = 0;
 		/*
@@ -1315,11 +1311,7 @@
 			for (j = max_p - 1; j >= 0; j--) {
 				int index = start + j;
 				kdev_t devi = MKDEV_SD_PARTITION(index);
-				struct super_block *sb = get_super(devi);
-				sync_dev(devi);
-				if (sb)
-					invalidate_inodes(sb);
-				invalidate_buffers(devi);
+				invalidate_dev(devi, 1);
 				sd_gendisks->part[index].start_sect = 0;
 				sd_gendisks->part[index].nr_sects = 0;
 				sd_sizes[index] = 0;
diff -urN rc13-pre6/drivers/scsi/sr.c rc13-pre6-s_lock/drivers/scsi/sr.c
--- rc13-pre6/drivers/scsi/sr.c	Sat Dec 30 07:37:41 2000
+++ rc13-pre6-s_lock/drivers/scsi/sr.c	Sat Dec 30 09:17:28 2000
@@ -874,15 +874,11 @@
 	for (cpnt = scsi_CDs, i = 0; i < sr_template.dev_max; i++, cpnt++)
 		if (cpnt->device == SDp) {
 			kdev_t devi = MKDEV(MAJOR_NR, i);
-			struct super_block *sb = get_super(devi);
-
 			/*
 			 * Since the cdrom is read-only, no need to sync the device.
 			 * We should be kind to our buffer cache, however.
 			 */
-			if (sb)
-				invalidate_inodes(sb);
-			invalidate_buffers(devi);
+			invalidate_dev(devi, 0);
 
 			/*
 			 * Reset things back to a sane state so that one can re-load a new
diff -urN rc13-pre6/drivers/sound/via82cxxx_audio.c rc13-pre6-s_lock/drivers/sound/via82cxxx_audio.c
--- rc13-pre6/drivers/sound/via82cxxx_audio.c	Wed Nov 29 21:37:13 2000
+++ rc13-pre6-s_lock/drivers/sound/via82cxxx_audio.c	Sat Dec 30 09:17:28 2000
@@ -1727,20 +1727,8 @@
 }
 
 
-#ifndef VM_RESERVE
-static int via_mm_swapout (struct page *page, struct file *filp)
-{
-	return 0;
-}
-#endif /* VM_RESERVE */
-
-
 struct vm_operations_struct via_mm_ops = {
 	nopage:		via_mm_nopage,
-
-#ifndef VM_RESERVE
-	swapout:	via_mm_swapout,
-#endif
 };
 
 
@@ -1789,9 +1777,7 @@
 	vma->vm_ops = &via_mm_ops;
 	vma->vm_private_data = card;
 
-#ifdef VM_RESERVE
-	vma->vm_flags |= VM_RESERVE;
-#endif
+	vma->vm_flags |= VM_RESERVED;
 
 	if (rd)
 		card->ch_in.is_mapped = 1;
diff -urN rc13-pre6/fs/block_dev.c rc13-pre6-s_lock/fs/block_dev.c
--- rc13-pre6/fs/block_dev.c	Wed Nov 29 19:29:31 2000
+++ rc13-pre6-s_lock/fs/block_dev.c	Sat Dec 30 09:17:28 2000
@@ -576,8 +576,11 @@
 		bdevname(dev));
 
 	sb = get_super(dev);
-	if (sb && invalidate_inodes(sb))
-		printk("VFS: busy inodes on changed media.\n");
+	if (sb) {
+		if (invalidate_inodes(sb))
+			printk("VFS: busy inodes on changed media.\n");
+		put_super(sb);
+	}
 
 	destroy_buffers(dev);
 
diff -urN rc13-pre6/fs/buffer.c rc13-pre6-s_lock/fs/buffer.c
--- rc13-pre6/fs/buffer.c	Sat Dec 30 07:37:45 2000
+++ rc13-pre6-s_lock/fs/buffer.c	Sat Dec 30 09:17:28 2000
@@ -337,9 +337,10 @@
 
 	/* sync the superblock to buffers */
 	sb = inode->i_sb;
-	wait_on_super(sb);
+	lock_super(sb);
 	if (sb->s_op && sb->s_op->write_super)
 		sb->s_op->write_super(sb);
+	unlock_super(sb);
 
 	/* .. finally sync the buffers to disk */
 	dev = inode->i_dev;
@@ -677,6 +678,19 @@
 	spin_unlock(&lru_list_lock);
 	if (slept)
 		goto retry;
+}
+
+void invalidate_dev(kdev_t dev, int sync_flag)
+{
+	struct super_block *sb = get_super(dev);
+	if (sync_flag == 1)
+		sync_dev(dev);
+	else if (sync_flag == 2)
+		fsync_dev(dev);
+	if (sb)
+		invalidate_inodes(sb);
+	put_super(sb);
+	invalidate_buffers(dev);
 }
 
 void set_blocksize(kdev_t dev, int size)
diff -urN rc13-pre6/fs/coda/dir.c rc13-pre6-s_lock/fs/coda/dir.c
--- rc13-pre6/fs/coda/dir.c	Sat Dec 30 07:37:45 2000
+++ rc13-pre6-s_lock/fs/coda/dir.c	Sat Dec 30 09:17:28 2000
@@ -557,8 +557,8 @@
                 return -ENXIO;
         }
                 
-        *ind = NULL;
         *ind = iget(sbptr, ino);
+	put_super(sbptr);
 
         if ( *ind == NULL ) {
 		printk("coda_inode_grab: iget(dev: %d, ino: %ld) "
diff -urN rc13-pre6/fs/coda/inode.c rc13-pre6-s_lock/fs/coda/inode.c
--- rc13-pre6/fs/coda/inode.c	Sat Dec 30 07:37:45 2000
+++ rc13-pre6-s_lock/fs/coda/inode.c	Sat Dec 30 09:17:28 2000
@@ -97,7 +97,6 @@
 	struct coda_sb_info *sbi = NULL;
 	struct venus_comm *vc = NULL;
         ViceFid fid;
-	kdev_t dev = sb->s_dev;
         int error;
 	int idx;
 	ENTRY;
@@ -139,7 +138,6 @@
         sb->s_blocksize = 1024;	/* XXXXX  what do we put here?? */
         sb->s_blocksize_bits = 10;
         sb->s_magic = CODA_SUPER_MAGIC;
-        sb->s_dev = dev;
         sb->s_op = &coda_super_operations;
 
 	/* get root fid from Venus: this needs the root inode */
diff -urN rc13-pre6/fs/devfs/base.c rc13-pre6-s_lock/fs/devfs/base.c
--- rc13-pre6/fs/devfs/base.c	Wed Nov 29 21:44:16 2000
+++ rc13-pre6-s_lock/fs/devfs/base.c	Sat Dec 30 09:17:28 2000
@@ -2166,8 +2166,11 @@
     printk ( KERN_DEBUG "VFS: Disk change detected on device %s\n",
 	     kdevname (dev) );
     sb = get_super (dev);
-    if ( sb && invalidate_inodes (sb) )
-	printk("VFS: busy inodes on changed media..\n");
+    if (sb) {
+	if (invalidate_inodes (sb))
+		printk("VFS: busy inodes on changed media..\n");
+	put_super(sb);
+    }
     invalidate_buffers (dev);
     /*  Ugly hack to disable messages about unable to read partition table  */
     tmp = warn_no_part;
diff -urN rc13-pre6/fs/dquot.c rc13-pre6-s_lock/fs/dquot.c
--- rc13-pre6/fs/dquot.c	Tue Dec 12 02:10:42 2000
+++ rc13-pre6-s_lock/fs/dquot.c	Sat Dec 30 09:17:29 2000
@@ -326,7 +326,7 @@
         memset(&dquot->dq_dqb, 0, sizeof(struct dqblk));
 }
 
-void invalidate_dquots(kdev_t dev, short type)
+static void invalidate_dquots(struct super_block *sb, short type)
 {
 	struct dquot *dquot, *next;
 	int need_restart;
@@ -336,12 +336,10 @@
 	need_restart = 0;
 	while ((dquot = next) != NULL) {
 		next = dquot->dq_next;
-		if (dquot->dq_dev != dev)
+		if (dquot->dq_sb != sb)
 			continue;
 		if (dquot->dq_type != type)
 			continue;
-		if (!dquot->dq_sb)	/* Already invalidated entry? */
-			continue;
 		if (dquot->dq_flags & DQ_LOCKED) {
 			__wait_on_dquot(dquot);
 
@@ -350,12 +348,10 @@
 			/*
 			 * Make sure it's still the same dquot.
 			 */
-			if (dquot->dq_dev != dev)
+			if (dquot->dq_sb != sb)
 				continue;
 			if (dquot->dq_type != type)
 				continue;
-			if (!dquot->dq_sb)
-				continue;
 		}
 		/*
 		 *  Because inodes needn't to be the only holders of dquot
@@ -1390,7 +1386,7 @@
 }
 
 /* Function in inode.c - remove pointers to dquots in icache */
-extern void remove_dquot_ref(kdev_t, short);
+extern void remove_dquot_ref(struct super_block *, short);
 
 /*
  * Turn quota off on a device. type == -1 ==> quotaoff for all types (umount)
@@ -1415,8 +1411,8 @@
 		reset_enable_flags(dqopt, cnt);
 
 		/* Note: these are blocking operations */
-		remove_dquot_ref(sb->s_dev, cnt);
-		invalidate_dquots(sb->s_dev, cnt);
+		remove_dquot_ref(sb, cnt);
+		invalidate_dquots(sb, cnt);
 
 		/* Wait for any pending IO - remove me as soon as invalidate is more polite */
 		down(&dqopt->dqio_sem);
@@ -1604,6 +1600,8 @@
 	if (sb && sb_has_quota_enabled(sb, type))
 		ret = set_dqblk(sb, id, type, flags, (struct dqblk *) addr);
 out:
+	if (sb)
+		put_super(sb);
 	unlock_kernel();
 	return ret;
 }
diff -urN rc13-pre6/fs/ext2/super.c rc13-pre6-s_lock/fs/ext2/super.c
--- rc13-pre6/fs/ext2/super.c	Sat Dec 30 07:37:46 2000
+++ rc13-pre6-s_lock/fs/ext2/super.c	Sat Dec 30 09:17:29 2000
@@ -75,9 +75,6 @@
 	va_start (args, fmt);
 	vsprintf (error_buf, fmt, args);
 	va_end (args);
-	/* this is to prevent panic from syncing this filesystem */
-	if (sb->s_lock)
-		sb->s_lock=0;
 	sb->s_flags |= MS_RDONLY;
 	panic ("EXT2-fs panic (device %s): %s: %s\n",
 	       bdevname(sb->s_dev), function, error_buf);
diff -urN rc13-pre6/fs/inode.c rc13-pre6-s_lock/fs/inode.c
--- rc13-pre6/fs/inode.c	Sat Dec 30 07:37:47 2000
+++ rc13-pre6-s_lock/fs/inode.c	Sat Dec 30 09:17:29 2000
@@ -1002,14 +1002,13 @@
 void put_dquot_list(struct list_head *);
 int remove_inode_dquot_ref(struct inode *, short, struct list_head *);
 
-void remove_dquot_ref(kdev_t dev, short type)
+void remove_dquot_ref(struct super_block *sb, short type)
 {
-	struct super_block *sb = get_super(dev);
 	struct inode *inode;
 	struct list_head *act_head;
 	LIST_HEAD(tofree_head);
 
-	if (!sb || !sb->dq_op)
+	if (!sb->dq_op)
 		return;	/* nothing to do */
 
 	/* We have to be protected against other CPUs */
diff -urN rc13-pre6/fs/jffs/inode-v23.c rc13-pre6-s_lock/fs/jffs/inode-v23.c
--- rc13-pre6/fs/jffs/inode-v23.c	Sat Dec 30 07:37:48 2000
+++ rc13-pre6-s_lock/fs/jffs/inode-v23.c	Sat Dec 30 09:17:29 2000
@@ -159,7 +159,6 @@
 
 	D1(printk (KERN_NOTICE "jffs_put_super(): Successfully waited on thread.\n"));
 
-	sb->s_dev = 0;
 	jffs_cleanup_control((struct jffs_control *)sb->u.generic_sbp);
 	D1(printk(KERN_NOTICE "JFFS: Successfully unmounted device %s.\n",
 	       kdevname(dev)));
diff -urN rc13-pre6/fs/super.c rc13-pre6-s_lock/fs/super.c
--- rc13-pre6/fs/super.c	Tue Dec 12 02:10:46 2000
+++ rc13-pre6-s_lock/fs/super.c	Sat Dec 30 12:03:30 2000
@@ -41,14 +41,6 @@
 #define __NO_VERSION__
 #include <linux/module.h>
 
-/*
- * We use a semaphore to synchronize all mount/umount
- * activity - imagine the mess if we have a race between
- * unmounting a filesystem and re-mounting it (or something
- * else).
- */
-static DECLARE_MUTEX(mount_sem);
-
 extern void wait_for_keypress(void);
 
 extern int root_mountflags;
@@ -346,6 +338,8 @@
 		INIT_LIST_HEAD(&mnt->mnt_clash);
 	}
 	INIT_LIST_HEAD(&mnt->mnt_mounts);
+	if (list_empty(&sb->s_mounts))
+		atomic_inc(&sb->s_count);
 	list_add(&mnt->mnt_instances, &sb->s_mounts);
 	list_add(&mnt->mnt_list, vfsmntlist.prev);
 	spin_unlock(&dcache_lock);
@@ -411,6 +405,8 @@
 static void remove_vfsmnt(struct vfsmount *mnt)
 {
 	/* First of all, remove it from all lists */
+	if (mnt->mnt_instances.next != mnt->mnt_instances.prev)
+		atomic_inc(&mnt->mnt_sb->s_count);
 	list_del(&mnt->mnt_instances);
 	list_del(&mnt->mnt_clash);
 	list_del(&mnt->mnt_list);
@@ -579,27 +575,11 @@
 #undef FREEROOM
 }
 
-/**
- *	__wait_on_super	- wait on a superblock
- *	@sb: superblock to wait on
- *
- *	Waits for a superblock to become unlocked and then returns. It does
- *	not take the lock. This is an internal function. See wait_on_super().
- */
- 
-void __wait_on_super(struct super_block * sb)
+void put_super(struct super_block *sb)
 {
-	DECLARE_WAITQUEUE(wait, current);
-
-	add_wait_queue(&sb->s_wait, &wait);
-repeat:
-	set_current_state(TASK_UNINTERRUPTIBLE);
-	if (sb->s_lock) {
-		schedule();
-		goto repeat;
-	}
-	remove_wait_queue(&sb->s_wait, &wait);
-	current->state = TASK_RUNNING;
+	up(&sb->s_umount);
+	if (atomic_dec_and_test(&sb->s_count))
+		kfree(sb);
 }
 
 /*
@@ -611,20 +591,23 @@
 {
 	struct super_block * sb;
 
+restart:
 	for (sb = sb_entry(super_blocks.next);
 	     sb != sb_entry(&super_blocks); 
 	     sb = sb_entry(sb->s_list.next)) {
-		if (!sb->s_dev)
-			continue;
 		if (dev && sb->s_dev != dev)
 			continue;
 		if (!sb->s_dirt)
 			continue;
+		atomic_inc(&sb->s_count);
+		down(&sb->s_umount);
 		lock_super(sb);
-		if (sb->s_dev && sb->s_dirt && (!dev || dev == sb->s_dev))
+		if (sb->s_root && sb->s_dirt)
 			if (sb->s_op && sb->s_op->write_super)
 				sb->s_op->write_super(sb);
 		unlock_super(sb);
+		put_super(sb);
+		goto restart;
 	}
 }
 
@@ -646,9 +629,11 @@
 	s = sb_entry(super_blocks.next);
 	while (s != sb_entry(&super_blocks))
 		if (s->s_dev == dev) {
-			wait_on_super(s);
-			if (s->s_dev == dev)
+			atomic_inc(&s->s_count);
+			down(&s->s_umount);
+			if (s->s_root)
 				return s;
+			put_super(s);
 			goto restart;
 		} else
 			s = sb_entry(s->s_list.next);
@@ -668,6 +653,7 @@
         if (s == NULL)
                 goto out;
 	err = vfs_statfs(s, &sbuf);
+	put_super(s);
 	if (err)
 		goto out;
 
@@ -691,42 +677,61 @@
  
 struct super_block *get_empty_super(void)
 {
-	struct super_block *s;
-
-	for (s  = sb_entry(super_blocks.next);
-	     s != sb_entry(&super_blocks); 
-	     s  = sb_entry(s->s_list.next)) {
-		if (s->s_dev)
-			continue;
-		if (!s->s_lock)
-			return s;
-		printk("VFS: empty superblock %p locked!\n", s);
-	}
-	/* Need a new one... */
-	if (nr_super_blocks >= max_super_blocks)
-		return NULL;
-	s = kmalloc(sizeof(struct super_block),  GFP_USER);
+	struct super_block *s = kmalloc(sizeof(struct super_block),  GFP_USER);
 	if (s) {
 		nr_super_blocks++;
 		memset(s, 0, sizeof(struct super_block));
 		INIT_LIST_HEAD(&s->s_dirty);
-		list_add (&s->s_list, super_blocks.prev);
-		init_waitqueue_head(&s->s_wait);
 		INIT_LIST_HEAD(&s->s_files);
 		INIT_LIST_HEAD(&s->s_mounts);
+		sema_init(&s->s_umount, 0);
+		sema_init(&s->s_lock, 1);
+		atomic_set(&s->s_count, 1);
+		list_add (&s->s_list, super_blocks.prev);
 	}
 	return s;
 }
 
+/*
+ * Unnamed block devices are dummy devices used by virtual
+ * filesystems which don't use real block-devices.  -- jrs
+ */
+
+static unsigned int unnamed_dev_in_use[256/(8*sizeof(unsigned int))];
+
+static kdev_t get_unnamed_dev(void)
+{
+	int i;
+
+	for (i = 1; i < 256; i++) {
+		if (!test_and_set_bit(i,unnamed_dev_in_use))
+			return MKDEV(UNNAMED_MAJOR, i);
+	}
+	return 0;
+}
+
+static void put_unnamed_dev(kdev_t dev)
+{
+	if (!dev || MAJOR(dev) != UNNAMED_MAJOR)
+		return;
+	if (test_and_clear_bit(MINOR(dev), unnamed_dev_in_use))
+		return;
+	printk("VFS: put_unnamed_dev: freeing unused device %s\n",
+			kdevname(dev));
+}
+
 static struct super_block * read_super(kdev_t dev, struct block_device *bdev,
 				       struct file_system_type *type, int flags,
 				       void *data, int silent)
 {
 	struct super_block * s;
+	kdev_t rdev = dev ? dev : get_unnamed_dev();
+	if (!rdev)
+		goto fail;
 	s = get_empty_super();
 	if (!s)
-		goto out;
-	s->s_dev = dev;
+		goto fail;
+	s->s_dev = rdev;
 	s->s_bdev = bdev;
 	s->s_flags = flags;
 	s->s_dirt = 0;
@@ -743,48 +748,24 @@
 	/* tell bdcache that we are going to keep this one */
 	if (bdev)
 		atomic_inc(&bdev->bd_count);
-out:
 	return s;
 
 out_fail:
 	s->s_dev = 0;
-	s->s_bdev = 0;
+	if (!dev)
+		put_unnamed_dev(rdev);
+	s->s_bdev = NULL;
 	s->s_type = NULL;
 	unlock_super(s);
+	put_super(s);
+fail:
 	return NULL;
 }
 
-/*
- * Unnamed block devices are dummy devices used by virtual
- * filesystems which don't use real block-devices.  -- jrs
- */
-
-static unsigned int unnamed_dev_in_use[256/(8*sizeof(unsigned int))];
-
-kdev_t get_unnamed_dev(void)
-{
-	int i;
-
-	for (i = 1; i < 256; i++) {
-		if (!test_and_set_bit(i,unnamed_dev_in_use))
-			return MKDEV(UNNAMED_MAJOR, i);
-	}
-	return 0;
-}
-
-void put_unnamed_dev(kdev_t dev)
-{
-	if (!dev || MAJOR(dev) != UNNAMED_MAJOR)
-		return;
-	if (test_and_clear_bit(MINOR(dev), unnamed_dev_in_use))
-		return;
-	printk("VFS: put_unnamed_dev: freeing unused device %s\n",
-			kdevname(dev));
-}
-
 static struct super_block *get_sb_bdev(struct file_system_type *fs_type,
 	char *dev_name, int flags, void * data)
 {
+	DECLARE_MUTEX(mount_sem);
 	struct inode *inode;
 	struct block_device *bdev;
 	struct block_device_operations *bdops;
@@ -809,16 +790,18 @@
 	bdev = inode->i_bdev;
 	bdops = devfs_get_ops ( devfs_get_handle_from_inode (inode) );
 	if (bdops) bdev->bd_op = bdops;
+	dev = to_kdev_t(bdev->bd_dev);
 	/* Done with lookups, semaphore down */
 	down(&mount_sem);
-	dev = to_kdev_t(bdev->bd_dev);
 	sb = get_super(dev);
 	if (sb) {
 		if (fs_type == sb->s_type &&
 		    ((flags ^ sb->s_flags) & MS_RDONLY) == 0) {
+			up(&mount_sem);
 			path_release(&nd);
 			return sb;
 		}
+		put_super(sb);
 	} else {
 		mode_t mode = FMODE_READ; /* we always need it ;-) */
 		if (!(flags & MS_RDONLY))
@@ -833,6 +816,7 @@
 		error = -EINVAL;
 		sb = read_super(dev, bdev, fs_type, flags, data, 0);
 		if (sb) {
+			up(&mount_sem);
 			get_filesystem(fs_type);
 			path_release(&nd);
 			return sb;
@@ -841,29 +825,21 @@
 		blkdev_put(bdev, BDEV_FS);
 	}
 out:
-	path_release(&nd);
 	up(&mount_sem);
+	path_release(&nd);
 	return ERR_PTR(error);
 }
 
 static struct super_block *get_sb_nodev(struct file_system_type *fs_type,
 	int flags, void * data)
 {
-	kdev_t dev;
-	int error = -EMFILE;
-	down(&mount_sem);
-	dev = get_unnamed_dev();
-	if (dev) {
-		struct super_block * sb;
-		error = -EINVAL;
-		sb = read_super(dev, NULL, fs_type, flags, data, 0);
-		if (sb) {
-			get_filesystem(fs_type);
-			return sb;
-		}
-		put_unnamed_dev(dev);
+	struct super_block * sb;
+	int error = -EINVAL;
+	sb = read_super(0, NULL, fs_type, flags, data, 0);
+	if (sb) {
+		get_filesystem(fs_type);
+		return sb;
 	}
-	up(&mount_sem);
 	return ERR_PTR(error);
 }
 
@@ -875,10 +851,11 @@
 	 * Get the superblock of kernel-wide instance, but
 	 * keep the reference to fs_type.
 	 */
-	down(&mount_sem);
 	sb = fs_type->kern_mnt->mnt_sb;
 	if (!sb)
 		BUG();
+	atomic_inc(&sb->s_count);
+	down(&sb->s_umount);
 	get_filesystem(fs_type);
 	do_remount_sb(sb, flags, data);
 	return sb;
@@ -913,6 +890,7 @@
 			"Self-destruct in 5 seconds.  Have a nice day...\n");
 	}
 
+	list_del(&sb->s_list);
 	dev = sb->s_dev;
 	sb->s_dev = 0;		/* Free the superblock */
 	bdev = sb->s_bdev;
@@ -969,21 +947,18 @@
 
 struct vfsmount *kern_mount(struct file_system_type *type)
 {
-	kdev_t dev = get_unnamed_dev();
 	struct super_block *sb;
 	struct vfsmount *mnt;
-	if (!dev)
-		return ERR_PTR(-EMFILE);
-	sb = read_super(dev, NULL, type, 0, NULL, 0);
-	if (!sb) {
-		put_unnamed_dev(dev);
+	sb = read_super(0, NULL, type, 0, NULL, 0);
+	if (!sb)
 		return ERR_PTR(-EINVAL);
-	}
 	mnt = add_vfsmnt(NULL, sb->s_root, NULL);
 	if (!mnt) {
 		kill_super(sb, 0);
+		put_super(sb);
 		return ERR_PTR(-ENOMEM);
 	}
+	put_super(sb);
 	type->kern_mnt = mnt;
 	return mnt;
 }
@@ -993,9 +968,11 @@
 void kern_umount(struct vfsmount *mnt)
 {
 	struct super_block *sb = mnt->mnt_sb;
+	down(&sb->s_umount);
 	spin_lock(&dcache_lock);
 	remove_vfsmnt(mnt);
 	kill_super(sb, 0);
+	put_super(sb);
 }
 
 /*
@@ -1014,6 +991,8 @@
 {
 	struct super_block * sb = mnt->mnt_sb;
 
+	down(&sb->s_umount);
+
 	/*
 	 * No sense to grab the lock for this test, but test itself looks
 	 * somewhat bogus. Suggestions for better replacement?
@@ -1033,6 +1012,7 @@
 		mntput(mnt);
 		if (!(sb->s_flags & MS_RDONLY))
 			retval = do_remount_sb(sb, MS_RDONLY, 0);
+		up(&sb->s_umount);
 		return retval;
 	}
 
@@ -1041,14 +1021,14 @@
 	if (mnt->mnt_instances.next != mnt->mnt_instances.prev) {
 		if (atomic_read(&mnt->mnt_count) > 2) {
 			spin_unlock(&dcache_lock);
-			mntput(mnt);
-			return -EBUSY;
+			goto busy;
 		}
 		if (sb->s_type->fs_flags & FS_SINGLE)
 			put_filesystem(sb->s_type);
 		/* We hold two references, so mntput() is safe */
 		mntput(mnt);
 		remove_vfsmnt(mnt);
+		put_super(sb);
 		return 0;
 	}
 	spin_unlock(&dcache_lock);
@@ -1084,18 +1064,15 @@
 	shrink_dcache_sb(sb);
 	fsync_dev(sb->s_dev);
 
-	if (sb->s_root->d_inode->i_state) {
-		mntput(mnt);
-		return -EBUSY;
-	}
+	if (sb->s_root->d_inode->i_state)
+		goto busy;
 
 	/* Something might grab it again - redo checks */
 
 	spin_lock(&dcache_lock);
 	if (atomic_read(&mnt->mnt_count) > 2) {
 		spin_unlock(&dcache_lock);
-		mntput(mnt);
-		return -EBUSY;
+		goto busy;
 	}
 
 	/* OK, that's the point of no return */
@@ -1103,7 +1080,12 @@
 	remove_vfsmnt(mnt);
 
 	kill_super(sb, umount_root);
+	put_super(sb);
 	return 0;
+busy:
+	mntput(mnt);
+	up(&sb->s_umount);
+	return -EBUSY;
 }
 
 /*
@@ -1141,9 +1123,7 @@
 
 	dput(nd.dentry);
 	/* puts nd.mnt */
-	down(&mount_sem);
 	retval = do_umount(nd.mnt, 0, flags);
-	up(&mount_sem);
 	goto out;
 dput_and_out:
 	path_release(&nd);
@@ -1208,7 +1188,6 @@
 	if (old_nd.mnt->mnt_sb->s_type->fs_flags & FS_SINGLE)
 		get_filesystem(old_nd.mnt->mnt_sb->s_type);
 		
-	down(&mount_sem);
 	/* there we go */
 	down(&new_nd.dentry->d_inode->i_zombie);
 	if (IS_DEADDIR(new_nd.dentry->d_inode))
@@ -1216,7 +1195,6 @@
 	else if (add_vfsmnt(&new_nd, old_nd.dentry, old_nd.mnt->mnt_devname))
 		err = 0;
 	up(&new_nd.dentry->d_inode->i_zombie);
-	up(&mount_sem);
 	if (err && old_nd.mnt->mnt_sb->s_type->fs_flags & FS_SINGLE)
 		put_filesystem(old_nd.mnt->mnt_sb->s_type);
 out2:
@@ -1346,12 +1324,6 @@
 	if (!type_page || !memchr(type_page, 0, PAGE_SIZE))
 		return -EINVAL;
 
-#if 0	/* Can be deleted again. Introduced in patch-2.3.99-pre6 */
-	/* loopback mount? This is special - requires fewer capabilities */
-	if (strcmp(type_page, "bind")==0)
-		return do_loopback(dev_name, dir_name);
-#endif
-
 	/* for the rest we _really_ need capabilities... */
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
@@ -1368,7 +1340,7 @@
 	if (retval)
 		goto fs_out;
 
-	/* get superblock, locks mount_sem on success */
+	/* get superblock, locks ->s_umount on success */
 	if (fstype->fs_flags & FS_NOMOUNT)
 		sb = ERR_PTR(-EINVAL);
 	else if (fstype->fs_flags & FS_REQUIRES_DEV)
@@ -1405,7 +1377,7 @@
 		goto fail;
 	retval = 0;
 unlock_out:
-	up(&mount_sem);
+	put_super(sb);
 dput_out:
 	path_release(&nd);
 fs_out:
@@ -1478,26 +1450,19 @@
 	fs_type = get_fs_type("nfs");
 	if (!fs_type)
 		goto no_nfs;
-	ROOT_DEV = get_unnamed_dev();
-	if (!ROOT_DEV)
-		/*
-		 * Your /linuxrc sucks worse than MSExchange - that's the
-		 * only way you could run out of anon devices at that point.
-		 */
-		goto no_anon;
 	data = nfs_root_data();
 	if (!data)
-		goto no_server;
-	sb = read_super(ROOT_DEV, NULL, fs_type, root_mountflags, data, 1);
-	if (sb)
+		goto no_anon;
+	sb = read_super(0, NULL, fs_type, root_mountflags, data, 1);
+	if (sb) {
 		/*
 		 * We _can_ fail there, but if that will happen we have no
 		 * chance anyway (no memory for vfsmnt and we _will_ need it,
 		 * no matter which fs we try to mount).
 		 */
+		ROOT_DEV = sb->s_dev;
 		goto mount_it;
-no_server:
-	put_unnamed_dev(ROOT_DEV);
+	}
 no_anon:
 	put_filesystem(fs_type);
 no_nfs:
@@ -1607,12 +1572,12 @@
 	}
 	else
 		vfsmnt = add_vfsmnt(NULL, sb->s_root, "/dev/root");
-	/* FIXME: if something will try to umount us right now... */
 	if (vfsmnt) {
 		set_fs_root(current->fs, vfsmnt, sb->s_root);
 		set_fs_pwd(current->fs, vfsmnt, sb->s_root);
 		if (bdev)
 			bdput(bdev); /* sb holds a reference */
+		put_super(sb);
 		return;
 	}
 	panic("VFS: add_vfsmnt failed for root fs");
@@ -1697,7 +1662,6 @@
 	root_mnt = mntget(current->fs->rootmnt);
 	root = dget(current->fs->root);
 	read_unlock(&current->fs->lock);
-	down(&mount_sem);
 	down(&old_nd.dentry->d_inode->i_zombie);
 	error = -ENOENT;
 	if (IS_DEADDIR(new_nd.dentry->d_inode))
@@ -1732,7 +1696,6 @@
 	error = 0;
 out2:
 	up(&old_nd.dentry->d_inode->i_zombie);
-	up(&mount_sem);
 	dput(root);
 	mntput(root_mnt);
 	path_release(&old_nd);
@@ -1765,10 +1728,8 @@
 		if (devfs_nd.mnt->mnt_sb->s_magic == DEVFS_SUPER_MAGIC &&
 		    devfs_nd.dentry == devfs_nd.mnt->mnt_root) {
 			dput(devfs_nd.dentry);
-			down(&mount_sem);
 			/* puts devfs_nd.mnt */
 			do_umount(devfs_nd.mnt, 0, 0);
-			up(&mount_sem);
 		} else 
 			path_release(&devfs_nd);
 	}
diff -urN rc13-pre6/fs/udf/inode.c rc13-pre6-s_lock/fs/udf/inode.c
--- rc13-pre6/fs/udf/inode.c	Tue Dec 12 02:10:46 2000
+++ rc13-pre6-s_lock/fs/udf/inode.c	Sat Dec 30 09:17:29 2000
@@ -203,7 +203,6 @@
 	udf_release_data(bh);
 
 	inode->i_data.a_ops->writepage(page);
-	UnlockPage(page);
 	page_cache_release(page);
 
 	mark_inode_dirty(inode);
diff -urN rc13-pre6/fs/ufs/super.c rc13-pre6-s_lock/fs/ufs/super.c
--- rc13-pre6/fs/ufs/super.c	Sun Sep 17 12:15:20 2000
+++ rc13-pre6-s_lock/fs/ufs/super.c	Sat Dec 30 09:17:29 2000
@@ -230,9 +230,6 @@
 	va_start (args, fmt);
 	vsprintf (error_buf, fmt, args);
 	va_end (args);
-	/* this is to prevent panic from syncing this filesystem */
-	if (sb->s_lock)
-		sb->s_lock = 0;
 	sb->s_flags |= MS_RDONLY;
 	printk (KERN_CRIT "UFS-fs panic (device %s): %s: %s\n",
 		kdevname(sb->s_dev), function, error_buf);
diff -urN rc13-pre6/include/linux/fs.h rc13-pre6-s_lock/include/linux/fs.h
--- rc13-pre6/include/linux/fs.h	Sat Dec 30 07:37:54 2000
+++ rc13-pre6-s_lock/include/linux/fs.h	Sat Dec 30 09:44:34 2000
@@ -667,7 +667,6 @@
 	kdev_t			s_dev;
 	unsigned long		s_blocksize;
 	unsigned char		s_blocksize_bits;
-	unsigned char		s_lock;
 	unsigned char		s_dirt;
 	struct file_system_type	*s_type;
 	struct super_operations	*s_op;
@@ -675,7 +674,9 @@
 	unsigned long		s_flags;
 	unsigned long		s_magic;
 	struct dentry		*s_root;
-	wait_queue_head_t	s_wait;
+	struct semaphore	s_umount;
+	struct semaphore	s_lock;		/* probably goner */
+	atomic_t		s_count;
 
 	struct list_head	s_dirty;	/* dirty inodes */
 	struct list_head	s_files;
@@ -1070,6 +1071,7 @@
 extern void write_inode_now(struct inode *, int);
 extern void sync_dev(kdev_t);
 extern int fsync_dev(kdev_t);
+extern void invalidate_dev(kdev_t, int);
 extern int fsync_inode_buffers(struct inode *);
 extern int osync_inode_buffers(struct inode *);
 extern int inode_has_buffers(struct inode *);
@@ -1265,7 +1267,16 @@
 extern struct file_system_type *get_fs_type(const char *name);
 extern struct super_block *get_super(kdev_t);
 struct super_block *get_empty_super(void);
-extern void put_super(kdev_t);
+extern void put_super(struct super_block *sb);
+static inline int is_mounted(kdev_t dev)
+{
+	struct super_block *sb = get_super(dev);
+	if (sb) {
+		put_super(sb);
+		return 1;
+	}
+	return 0;
+}
 unsigned long generate_cluster(kdev_t, int b[], int);
 unsigned long generate_cluster_swab32(kdev_t, int b[], int);
 extern kdev_t ROOT_DEV;
diff -urN rc13-pre6/include/linux/locks.h rc13-pre6-s_lock/include/linux/locks.h
--- rc13-pre6/include/linux/locks.h	Sat Dec 16 12:30:30 2000
+++ rc13-pre6-s_lock/include/linux/locks.h	Sat Dec 30 09:44:35 2000
@@ -39,30 +39,15 @@
  * a super-block (although even this isn't done right now.
  * nfs may need it).
  */
-extern void __wait_on_super(struct super_block *);
-
-extern inline void wait_on_super(struct super_block * sb)
-{
-	if (sb->s_lock)
-		__wait_on_super(sb);
-}
 
 extern inline void lock_super(struct super_block * sb)
 {
-	if (sb->s_lock)
-		__wait_on_super(sb);
-	sb->s_lock = 1;
+	down(&sb->s_lock);
 }
 
 extern inline void unlock_super(struct super_block * sb)
 {
-	sb->s_lock = 0;
-	/*
-	 * No need of any barrier, we're protected by
-	 * the big kernel lock here... unfortunately :)
-	 */
-	if (waitqueue_active(&sb->s_wait))
-		wake_up(&sb->s_wait);
+	up(&sb->s_lock);
 }
 
 #endif /* _LINUX_LOCKS_H */
diff -urN rc13-pre6/include/linux/quotaops.h rc13-pre6-s_lock/include/linux/quotaops.h
--- rc13-pre6/include/linux/quotaops.h	Sat Dec 16 12:52:02 2000
+++ rc13-pre6-s_lock/include/linux/quotaops.h	Sat Dec 30 09:58:07 2000
@@ -21,7 +21,6 @@
  */
 extern void dquot_initialize(struct inode *inode, short type);
 extern void dquot_drop(struct inode *inode);
-extern void invalidate_dquots(kdev_t dev, short type);
 extern int  quota_off(struct super_block *sb, short type);
 extern int  sync_dquots(kdev_t dev, short type);
 
diff -urN rc13-pre6/kernel/ksyms.c rc13-pre6-s_lock/kernel/ksyms.c
--- rc13-pre6/kernel/ksyms.c	Sat Dec 30 07:37:55 2000
+++ rc13-pre6-s_lock/kernel/ksyms.c	Sat Dec 30 09:17:30 2000
@@ -127,6 +127,8 @@
 EXPORT_SYMBOL(update_atime);
 EXPORT_SYMBOL(get_fs_type);
 EXPORT_SYMBOL(get_super);
+EXPORT_SYMBOL(put_super);
+EXPORT_SYMBOL(invalidate_dev);
 EXPORT_SYMBOL(get_empty_super);
 EXPORT_SYMBOL(getname);
 EXPORT_SYMBOL(names_cachep);
@@ -472,7 +474,6 @@
 
 /* Added to make file system as module */
 EXPORT_SYMBOL(sys_tz);
-EXPORT_SYMBOL(__wait_on_super);
 EXPORT_SYMBOL(file_fsync);
 EXPORT_SYMBOL(fsync_inode_buffers);
 EXPORT_SYMBOL(clear_inode);


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC] Generic deferred file writing
  2000-12-30  3:16   ` test13-pre6 Linus Torvalds
@ 2000-12-30 18:58     ` Daniel Phillips
  2000-12-30 20:05       ` Linus Torvalds
  2000-12-30 20:06       ` Alexander Viro
  0 siblings, 2 replies; 56+ messages in thread
From: Daniel Phillips @ 2000-12-30 18:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

When I saw you put in the if (PageDirty) -->writepage and related code
over the last couple of weeks I was wondering if you realize how close
we are to having generic deferred file writing in the VFS.  I took some
time today to code this little hack and it comes awfully close to doing
the job.  However, *** Warning, do not run this on a machine you care
about, it will mess it up ***.

The advantages of deferred file writing are pretty obvious.  Right now
we are deferring just the writeout of data to the disk, but we can also
defer the disk mapping, so that metadata blocks don't have to stay
around in cache waiting for data blocks to get mapped into them one at
a time - a whole group can be done in one flush.  There is more scope
for the filesystem to optimize allocation - we just need some kind of
->get_blocks call that maps n file blocks in one operation.  Sometimes
we'll be able to write files, read them back and erase them without
ever hitting the disk, or even the filesystem.  Sequences of short
writes will cost a lot less cpu and hit less cache.  There are probably
other advantages as well.

For this to work properly there has to be an analog of kflushd for
pages.  We don't have that yet, and it's some distance away - so I'm
not even beginning to suggest this is a 2.4.0 thing.

In the patch below I didn't try to write the error handling exactly
right and the indenting is wrong.  It's just meant to show the idea
clearly.  Instead of mapping each file page to storage as we normally do
in generic_file_write we just mark it dirty and let page_launder or
Marcelo's new flushing code call the filesystem later.

This code worked well enough to copy some files and directories in uml,
and still have them there after a reboot.  Beyond that - patching it
into test13-pre5 on a real machine resulted in a toasted passwd file,
so there's some work to do yet.

Here it is...

--- 2.4.0-test12.clean/mm/filemap.c	Sat Dec  9 09:13:23 2000
+++ 2.4.0-test12/mm/filemap.c		Sat Dec 30 19:56:24 2000
@@ -2449,6 +2449,16 @@
 			PAGE_BUG(page);
 		}
 
+	if (!offset && bytes == PAGE_SIZE)
+	{
+		kaddr = page_address(page);
+		status = copy_from_user(kaddr+offset, buf, bytes);
+		flush_dcache_page(page);
+		SetPageUptodate(page);
+		SetPageDirty(page); /* set_page_dirty in test13 */
+	}
+	else
+	{
 		status = mapping->a_ops->prepare_write(file, page, offset, offset+bytes);
 		if (status)
 			goto unlock;
@@ -2458,6 +2468,7 @@
 		if (status)
 			goto fail_write;
 		status = mapping->a_ops->commit_write(file, page, offset, offset+bytes);
+	}
 		if (!status)
 			status = bytes;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-30 18:58     ` [RFC] Generic deferred file writing Daniel Phillips
@ 2000-12-30 20:05       ` Linus Torvalds
  2000-12-30 20:06       ` Alexander Viro
  1 sibling, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2000-12-30 20:05 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel



On Sat, 30 Dec 2000, Daniel Phillips wrote:
>
> When I saw you put in the if (PageDirty) -->writepage and related code
> over the last couple of weeks I was wondering if you realize how close
> we are to having generic deferred file writing in the VFS.

I'm very aware of it indeed. 

However, it does break various common assumptions, one of them being
proper error handling. Things like proper detection of quota overflows and
even simple "out of disk space" issues. 

One of the main advantages of deferred writing would be that we could do
temp-files without ever actually doing most of the low-level filesystem
block allocation, but in order to get that advantage we really need to
handle the out-of-disk case gracefully.

I considered doing something like this as a mount option, so that people
could decide on their own whether they want a safe filesystem, or whether
it's ok to do deferred writes. People might find it worth it for /tmp, but
might be unwilling to use it for /var/spool/mail, for example.

(Hmm.. It might be perfectly fine for /vsr/spool/mail - mail delivery
tends to be really careful about doing "fsync()" etc and actually pick up
the errors that way. HOWEVER, before doing that you should expand the
writepage logic to set the page "error" bit for when it fails to write out
a full page - right now we just lose the error completely).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-30 18:58     ` [RFC] Generic deferred file writing Daniel Phillips
  2000-12-30 20:05       ` Linus Torvalds
@ 2000-12-30 20:06       ` Alexander Viro
  2000-12-30 20:21         ` Linus Torvalds
  1 sibling, 1 reply; 56+ messages in thread
From: Alexander Viro @ 2000-12-30 20:06 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Linus Torvalds, linux-kernel



On Sat, 30 Dec 2000, Daniel Phillips wrote:

> When I saw you put in the if (PageDirty) -->writepage and related code
> over the last couple of weeks I was wondering if you realize how close
> we are to having generic deferred file writing in the VFS.  I took some
> time today to code this little hack and it comes awfully close to doing
> the job.  However, *** Warning, do not run this on a machine you care
> about, it will mess it up ***.
> 
> The advantages of deferred file writing are pretty obvious.  Right now
> we are deferring just the writeout of data to the disk, but we can also
> defer the disk mapping, so that metadata blocks don't have to stay
> around in cache waiting for data blocks to get mapped into them one at
> a time - a whole group can be done in one flush.

Except that we've got file-expanding writes outside of ->i_sem. Thanks, but
no thanks.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-30 20:06       ` Alexander Viro
@ 2000-12-30 20:21         ` Linus Torvalds
  2000-12-30 21:10           ` Andreas Dilger
                             ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Linus Torvalds @ 2000-12-30 20:21 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Daniel Phillips, linux-kernel



On Sat, 30 Dec 2000, Alexander Viro wrote:
> 
> Except that we've got file-expanding writes outside of ->i_sem. Thanks, but
> no thanks.

No, Al, the file size is still updated inside i_sem.

Yes, it will do actual block allocation outside i_sem, but that is already
true of any mmap'ed writes, and has been true for a long long time. So if
we have a bug here (and I don't think we have one), it's not something
new. But the inode semaphore doesn't protect the balloc() data structures
anyway, as they are filesystem-global.

If you're nervous about the effects of "truncate()", then that should be
handled properly by truncate_inode_pages().

In short, I don't see _those_ kinds of issues. I do see error reporting as
a major issue, though. If we need to do proper low-level block allocation
in order to get correct ENOSPC handling, then the win from doing deferred
writes is not very big.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-30 20:21         ` Linus Torvalds
@ 2000-12-30 21:10           ` Andreas Dilger
  2000-12-30 21:46           ` Alexander Viro
  2000-12-30 22:00           ` Eric W. Biederman
  2 siblings, 0 replies; 56+ messages in thread
From: Andreas Dilger @ 2000-12-30 21:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alexander Viro, Daniel Phillips, linux-kernel

Linus writes:
> In short, I don't see _those_ kinds of issues. I do see error reporting as
> a major issue, though. If we need to do proper low-level block allocation
> in order to get correct ENOSPC handling, then the win from doing deferred
> writes is not very big.

It should be relatively light-weight to call into the filesystem simply
to allocate a "count" of blocks needed for the current file.  It may
even be possible to do delayed inode allocation.  This would defer
the inode/block bitmap searching/allocation on ext2 until the file
was actually written - only the free_blocks/free_inodes count in the
superblock would be decremented, and we would get ENOSPC immediately
if we don't have enough space for the file.  On fsck, these values are
recalculated from the group descriptors on ext2, so it wouldn't be a
problem if the system crashed with pre-allocated blocks.

It would definitely be a win on ext2 and XFS, and if it isn't possible
on other filesystems, it should at least not be a loss.

We would need to ensure we also keep enough space for indirect blocks
and such, so we need to pass more information than just the number of
blocks added (i.e. how big the file already is).

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-30 20:21         ` Linus Torvalds
  2000-12-30 21:10           ` Andreas Dilger
@ 2000-12-30 21:46           ` Alexander Viro
  2000-12-30 23:12             ` Daniel Phillips
  2000-12-30 22:00           ` Eric W. Biederman
  2 siblings, 1 reply; 56+ messages in thread
From: Alexander Viro @ 2000-12-30 21:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Phillips, linux-kernel



On Sat, 30 Dec 2000, Linus Torvalds wrote:

> 
> 
> On Sat, 30 Dec 2000, Alexander Viro wrote:
> > 
> > Except that we've got file-expanding writes outside of ->i_sem. Thanks, but
> > no thanks.
> 
> No, Al, the file size is still updated inside i_sem.

Then we are screwed. Look: we call write(). Twice. The second call happens
to overflow the quota. Getting the second chunk of data written and the
first one ending up as a hole is the last thing you would expect, isn't it?

> In short, I don't see _those_ kinds of issues. I do see error reporting as
> a major issue, though. If we need to do proper low-level block allocation
> in order to get correct ENOSPC handling, then the win from doing deferred
> writes is not very big.

Well, see above. I'm pretty nervous about breaking the ordering of metadata
allocation. For pageout() we don't have such ordering. For write() we
certainly do. Notice that reserving disk space upon write() and eating it
later is _very_ messy job - you'll have to take care of situations when
we reserve the space upon write() and get pageout do the real allocation.
Not nice, since pageout has no way in hell to tell whether it is eating
from a reserved area or just flushing the mmaped one. We could keep the
per-bh "reserved" flag to fold that information into the pagecache, but
IMO it's simply not worth the trouble. If some filesystems wants that -
hey, it can do that right now. Just make ->prepare_write() do reservations
and let ->commit_write() mark the page dirty. Then ->writepage() will
eventually flush it.

Again, if one is willing to implement reservation on block level - fine,
there is no need to change anything in VFS or VM. I certainly don't want
to mess with that, but hey, if somebody is into masochism - let them.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-30 20:21         ` Linus Torvalds
  2000-12-30 21:10           ` Andreas Dilger
  2000-12-30 21:46           ` Alexander Viro
@ 2000-12-30 22:00           ` Eric W. Biederman
  2000-12-30 22:44             ` Linus Torvalds
  2000-12-31  1:02             ` Andrea Arcangeli
  2 siblings, 2 replies; 56+ messages in thread
From: Eric W. Biederman @ 2000-12-30 22:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alexander Viro, Daniel Phillips, linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:


> In short, I don't see _those_ kinds of issues. I do see error reporting as
> a major issue, though. If we need to do proper low-level block allocation
> in order to get correct ENOSPC handling, then the win from doing deferred
> writes is not very big.

To get ENOSPC handling 99% correct all we need to do is decrement a counter,
that remembers how many disks blocks are free.  If we need a better
estimate than just the data blocks it should not be hard to add an
extra callback to the filesystem.  

There look to be some interesting cases to handle when we fill up a
filesystem.  Before actually failing and returning ENOSPC the
filesystem might want to fsync itself. And see how correct it's
estimates were.  But that is the rare case and shouldn't affect
performance.

<rant>
In the long term VFS support for deferred writes looks like a major
win.  Knowing how large a file is before we write it to disk allows
very efficient disk organization, and fast file access (esp combined
with an extent based fs).   Support for compressing files in real time
falls out naturally.  Support for filesystems maintain coherency by
never writing the same block back to the same disk location also
appears.
</rant>

One other thing to think about for the VFS/MM layer is limiting the
total number of dirty pages in the system (to what disk pressure shows
the disk can handle), to keep system performance smooth when swapping.
All cases except mmaped files are easy, and they can be handled by a
modified page fault handler that directly puts the dirty bit on the
struct page.  (Except that is buggy with respect to clearing the dirty
bit on the struct page.)  In reality we would have to create a queue
of pointers to dirty pte's from the page fault handler and depending
on a timer or memory pressure move the dirty bits to the actual page.

Combined with the code to make sync and fsync to work on the page
cache we msync would be obsolete?

Of course the most important part is that when all of that is
working, the VFS/MM layer it would be perfect.  World domination
would be achieved.  For who would be caught using an OS with an
imperfect VFS layer :)

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-30 22:00           ` Eric W. Biederman
@ 2000-12-30 22:44             ` Linus Torvalds
  2000-12-31  0:26               ` Eric W. Biederman
  2000-12-31  1:02             ` Andrea Arcangeli
  1 sibling, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2000-12-30 22:44 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Alexander Viro, Daniel Phillips, linux-kernel



On 30 Dec 2000, Eric W. Biederman wrote:
> 
> One other thing to think about for the VFS/MM layer is limiting the
> total number of dirty pages in the system (to what disk pressure shows
> the disk can handle), to keep system performance smooth when swapping.

This is a separate issue, and I think that it is most closely tied in to
the "RSS limit" kind of patches because of the memory mapping issues. If
you've seen the RSS rlimit patch (it's been posted a few times this week),
then you could think of that modified by a "Resident writable pages Set
Size" approach. Not just for shared mappings - this is also an issue with
limiting swapout.

(I actually don't think that RSS is all that interesting, it's really the
"potentially dirty RSS" that counts for VM behaviour - everything else can
be dropped easily enough)

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-30 21:46           ` Alexander Viro
@ 2000-12-30 23:12             ` Daniel Phillips
  0 siblings, 0 replies; 56+ messages in thread
From: Daniel Phillips @ 2000-12-30 23:12 UTC (permalink / raw)
  To: Alexander Viro, Linus Torvalds; +Cc: linux-kernel

On Sat, 30 Dec 2000, Alexander Viro wrote:
> Well, see above. I'm pretty nervous about breaking the ordering of metadata
> allocation. For pageout() we don't have such ordering. For write() we
> certainly do. Notice that reserving disk space upon write() and eating it
> later is _very_ messy job - you'll have to take care of situations when
> we reserve the space upon write() and get pageout do the real allocation.
> Not nice, since pageout has no way in hell to tell whether it is eating
> from a reserved area or just flushing the mmaped one. We could keep the
> per-bh "reserved" flag to fold that information into the pagecache, but
> IMO it's simply not worth the trouble. If some filesystems wants that -
> hey, it can do that right now. Just make ->prepare_write() do reservations
> and let ->commit_write() mark the page dirty. Then ->writepage() will
> eventually flush it.

This is a refinement of the idea and some abstraction like that is
clearly needed, and maybe that is exactly the right one.  For now I'm
interested in putting this on the table so that we can check the
stability and performance, maybe uncover come more bugs, then start
going after some of the things that need to be done to turn it into a
useful option.

P.S., I humbly apologize for writing (!offset && bytes == PAGE_SIZE)
when I could have just written (bytes == PAGE_SIZE).

-- 
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-30 22:44             ` Linus Torvalds
@ 2000-12-31  0:26               ` Eric W. Biederman
  0 siblings, 0 replies; 56+ messages in thread
From: Eric W. Biederman @ 2000-12-31  0:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alexander Viro, Daniel Phillips, linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:

> On 30 Dec 2000, Eric W. Biederman wrote:
> > 
> > One other thing to think about for the VFS/MM layer is limiting the
> > total number of dirty pages in the system (to what disk pressure shows
> > the disk can handle), to keep system performance smooth when swapping.
> 
> This is a separate issue,  and I think that it is most closely tied in to
> the "RSS limit" kind of patches because of the memory mapping issues. If
> you've seen the RSS rlimit patch (it's been posted a few times this week),
> then you could think of that modified by a "Resident writable pages Set
> Size" approach. 

Building on the RSS limit approach sounds much simpler then they way
I was thinking.

> Not just for shared mappings - this is also an issue with
> limiting swapout.
> 
> (I actually don't think that RSS is all that interesting, it's really the
> "potentially dirty RSS" that counts for VM behaviour - everything else can
> be dropped easily enough)

Definitely.

Now the only tricky bit is how do we sense when we are overloading
the swap disks.  Well that is the next step.  I'll take a look
and see what it takes to keep statistics on dirty mapped pages.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-30 22:00           ` Eric W. Biederman
  2000-12-30 22:44             ` Linus Torvalds
@ 2000-12-31  1:02             ` Andrea Arcangeli
  2000-12-31  1:13               ` Chris Wedgwood
                                 ` (2 more replies)
  1 sibling, 3 replies; 56+ messages in thread
From: Andrea Arcangeli @ 2000-12-31  1:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Alexander Viro, Daniel Phillips, linux-kernel

On Sat, Dec 30, 2000 at 03:00:43PM -0700, Eric W. Biederman wrote:
> To get ENOSPC handling 99% correct all we need to do is decrement a counter,
> that remembers how many disks blocks are free.  If we need a better

Yes, we need to add one field to the in-core superblock to do this accounting.

> estimate than just the data blocks it should not be hard to add an
> extra callback to the filesystem.  

Yes, I was thinking at this callback too. Such a callback is nearly the only
support we need from the filesystem to provide allocate on flush. Allocate on
flush is a pagecache issue, not really a filesystem issue. When a filesystem
doesn't implement such callback we can simply get_block(create) at pagecache
creation time as usual.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-31  1:02             ` Andrea Arcangeli
@ 2000-12-31  1:13               ` Chris Wedgwood
  2000-12-31  1:50               ` Alexander Viro
  2000-12-31  2:09               ` Roman Zippel
  2 siblings, 0 replies; 56+ messages in thread
From: Chris Wedgwood @ 2000-12-31  1:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Eric W. Biederman, Linus Torvalds, Alexander Viro,
	Daniel Phillips, linux-kernel

On Sun, Dec 31, 2000 at 02:02:34AM +0100, Andrea Arcangeli wrote:

    Yes, we need to add one field to the in-core superblock to do
    this accounting.

How does this work for filesystems like reisefs which do tail merging
and other filesystems which might do sub-clock allocation?

We either need more than one field (or a byte counter) + lock or
perhaps a generic fields + callback to the fs itself.
    
    > estimate than just the data blocks it should not be hard to add
    > an extra callback to the filesystem.
    
    Yes, I was thinking at this callback too. Such a callback is
    nearly the only support we need from the filesystem to provide
    allocate on flush.

I think a callback is the way to go.



  --cw
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-31  1:02             ` Andrea Arcangeli
  2000-12-31  1:13               ` Chris Wedgwood
@ 2000-12-31  1:50               ` Alexander Viro
  2000-12-31  2:34                 ` Andrea Arcangeli
  2000-12-31  2:09               ` Roman Zippel
  2 siblings, 1 reply; 56+ messages in thread
From: Alexander Viro @ 2000-12-31  1:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Eric W. Biederman, Linus Torvalds, Daniel Phillips, linux-kernel



On Sun, 31 Dec 2000, Andrea Arcangeli wrote:

> On Sat, Dec 30, 2000 at 03:00:43PM -0700, Eric W. Biederman wrote:
> > To get ENOSPC handling 99% correct all we need to do is decrement a counter,
> > that remembers how many disks blocks are free.  If we need a better
> 
> Yes, we need to add one field to the in-core superblock to do this accounting.

And its meaning for 2/3 of filesystems would be?

> > estimate than just the data blocks it should not be hard to add an
> > extra callback to the filesystem.  
> 
> Yes, I was thinking at this callback too. Such a callback is nearly the only
> support we need from the filesystem to provide allocate on flush. Allocate on
> flush is a pagecache issue, not really a filesystem issue. When a filesystem

I _doubt_ it. If it is a pagecache issue it should apply to NFS. It should
apply to ramfs. It should apply to helluva lot of filesystems that are not
block-based. Pagecache doesn't (and shouldn't) know about blocks.

> doesn't implement such callback we can simply get_block(create) at pagecache
> creation time as usual.

Umm... You do realize that get_block is _not_ visible on pagecache level?
Sure thing, filesystem should be free to use whatever functions it wants
for address_space methods. No arguments here. It should be able whatever
callbacks these functions expect. If filesystem doesn't implement reservation
it should use functions that do not expect such argument. That's it. No
need to invent new methods or shoehorn all block filesystems into the same
scheme.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-31  1:02             ` Andrea Arcangeli
  2000-12-31  1:13               ` Chris Wedgwood
  2000-12-31  1:50               ` Alexander Viro
@ 2000-12-31  2:09               ` Roman Zippel
  2000-12-31  2:28                 ` Linus Torvalds
  2 siblings, 1 reply; 56+ messages in thread
From: Roman Zippel @ 2000-12-31  2:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Eric W. Biederman, Linus Torvalds, Alexander Viro,
	Daniel Phillips, linux-kernel

Hi,

On Sun, 31 Dec 2000, Andrea Arcangeli wrote:

> > estimate than just the data blocks it should not be hard to add an
> > extra callback to the filesystem.  
> 
> Yes, I was thinking at this callback too. Such a callback is nearly the only
> support we need from the filesystem to provide allocate on flush.

Actually the getblock function could be split into 3 functions:
- alloc_block: mostly just decrementing a counter (and quota)
- get_block: allocating a block from the bitmap
- commit_block: inserting the new block into the inode

This would be really useful for streaming, one could get as fast as
possible the block number and the data could be very quickly written,
while keeping the cache usage low. Or streaming directly from a device
to disk also wants to get rid of the data as fast as possible.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-31  2:09               ` Roman Zippel
@ 2000-12-31  2:28                 ` Linus Torvalds
  2000-12-31 12:58                   ` Roman Zippel
                                     ` (3 more replies)
  0 siblings, 4 replies; 56+ messages in thread
From: Linus Torvalds @ 2000-12-31  2:28 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Andrea Arcangeli, Eric W. Biederman, Alexander Viro,
	Daniel Phillips, linux-kernel



On Sun, 31 Dec 2000, Roman Zippel wrote:
> 
> On Sun, 31 Dec 2000, Andrea Arcangeli wrote:
> 
> > > estimate than just the data blocks it should not be hard to add an
> > > extra callback to the filesystem.  
> > 
> > Yes, I was thinking at this callback too. Such a callback is nearly the only
> > support we need from the filesystem to provide allocate on flush.
> 
> Actually the getblock function could be split into 3 functions:
> - alloc_block: mostly just decrementing a counter (and quota)
> - get_block: allocating a block from the bitmap
> - commit_block: inserting the new block into the inode
> 
> This would be really useful for streaming, one could get as fast as
> possible the block number and the data could be very quickly written,
> while keeping the cache usage low. Or streaming directly from a device
> to disk also wants to get rid of the data as fast as possible.

Now, to insert a small note of sanity here: I think people are starting to
overdesign stuff.

The fact is that currently the "get_block()" interface that the page cache
helper functions use does NOT have to be very expensive at all.

In fact, in a properly designed filesystem just a bit of low-level caching
would easily make the average "get_block()" be very fast indeed. The fact
that right now ext2 has not been optimized for this is _not_ a reason to
design the VFS layer around a slow get_block() operation.

If you look at the generic block-based writing routines, they are actually
not all that expensive. Any kind of complication is only going to make
those functions more complex, and any performance gained could easily be
lost in extra complexity.

There are only two real advantages to deferred writing:

 - not having to do get_block() at all for temp-files, as we never have to
   do the allocation if we end up removing the file.

   NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors
   right is quite possibly big enough that this advantage is possibly very
   questionable.  It's very possible that people could speed things up
   using this approach, but I also suspect that it is equally (if not
   more) possible to speed things up by just making sure that the
   low-level FS has a fast get_block().

 - Using "global" access patterns to do a better job of "get_block()", ie
   taking advantage of issues with journalling etc and deferring the write
   in order to get a bigger journal.

The second point is completely different, and THIS is where I think there
are potentially real advantages. However, I also think that this is not
actually about deferred writes at all: it's really a question of the
filesystem having access to the page when the physical write is actually
started so that the filesystem might choose to _change_ the allocation it
did - it might have allocated a backing store block at "get_block()" time,
but by the time it actually writes the stuff out to disk it may have
allocated a bigger contiguous area somewhere else for the data..

I really think that the #2 thing is the more interesting one, and that
anybody looking at ext2 should look at just improving the locking and
making the block allocation functions run faster. Which should not be all
that difficult - the last time I looked at the thing it was quite
horrible.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-31  1:50               ` Alexander Viro
@ 2000-12-31  2:34                 ` Andrea Arcangeli
  0 siblings, 0 replies; 56+ messages in thread
From: Andrea Arcangeli @ 2000-12-31  2:34 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Eric W. Biederman, Linus Torvalds, Daniel Phillips, linux-kernel

On Sat, Dec 30, 2000 at 08:50:52PM -0500, Alexander Viro wrote:
> And its meaning for 2/3 of filesystems would be?

It should stay in the private part of the in-core superblock of course.

> I _doubt_ it. If it is a pagecache issue it should apply to NFS. It should
> apply to ramfs. It should apply to helluva lot of filesystems that are not
> block-based. Pagecache doesn't (and shouldn't) know about blocks.

With pagecache I meant the library of pagecache methods in buffer.c. Even
if they are recalled by the lowlevel filesystem code and they can be
overridden by lowlevel filesystem code, they aren't lowlevel filesystem code
but they're infact common code.  We can implement another version of them that
instead of knowing about get_block, also know about another filesystem
callback and when possible it only reserve the space for a delayed allocation
later triggered (in parallel) by future kupdate. They will know about this new
callback in the same way the current standard pagecache library methods knows
about get_block_t. Filesystems implementing this callback will be able to use
those new pagecache library methods.

> it should use functions that do not expect such argument. That's it. No
> need to invent new methods or shoehorn all block filesystems into the same
> scheme.

Of course.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-31  2:28                 ` Linus Torvalds
@ 2000-12-31 12:58                   ` Roman Zippel
  2001-04-21 20:06                     ` Races in affs_unlink(), affs_rmdir() and affs_rename() Alexander Viro
  2000-12-31 14:38                   ` [RFC] Generic deferred file writing Andrea Arcangeli
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 56+ messages in thread
From: Roman Zippel @ 2000-12-31 12:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Eric W. Biederman, Alexander Viro,
	Daniel Phillips, linux-kernel

Hi,

On Sat, 30 Dec 2000, Linus Torvalds wrote:

> In fact, in a properly designed filesystem just a bit of low-level caching
> would easily make the average "get_block()" be very fast indeed. The fact
> that right now ext2 has not been optimized for this is _not_ a reason to
> design the VFS layer around a slow get_block() operation.
> [..]
> The second point is completely different, and THIS is where I think there
> are potentially real advantages. However, I also think that this is not
> actually about deferred writes at all: it's really a question of the
> filesystem having access to the page when the physical write is actually
> started so that the filesystem might choose to _change_ the allocation it
> did - it might have allocated a backing store block at "get_block()" time,
> but by the time it actually writes the stuff out to disk it may have
> allocated a bigger contiguous area somewhere else for the data..
> 
> I really think that the #2 thing is the more interesting one, and that
> anybody looking at ext2 should look at just improving the locking and
> making the block allocation functions run faster. Which should not be all
> that difficult - the last time I looked at the thing it was quite
> horrible.

What makes get_block business complicated now, is that can be called
recursively: get_block needs to allocate something, what might start new
i/o which calls again get_block.
Writing dirty pages should be a real asynchronous process, but it isn't
right now, as get_block is synchronous. Making get_block asynchronous is
almost impossible, so one usually does it in a separate thread.
So IMO something like this should happen: dirty pages should be put on a
separate list and a thread takes these pages and allocates the buffers for
them and starts the i/o. This had another advantage: get_block wouldn't
really need to do preallocation anymore, the get_block function could work
instead on a number of pages (preallocation would instead happen in the
page cache).
This could make the get_block function and the needed locking very simple, 
e.g. one could use a simple semaphore instead of kernel_lock to protect
getting of multiple blocks instead of only one. Also splitting it into
several tasks can make it faster, so in one step we just do the resource
allocation to guarantee the write, in a separate step we do the real
allocation. If this is done for several pages at once, it can be very 
fast and simple.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-31  2:28                 ` Linus Torvalds
  2000-12-31 12:58                   ` Roman Zippel
@ 2000-12-31 14:38                   ` Andrea Arcangeli
  2000-12-31 16:33                     ` Linus Torvalds
  2000-12-31 18:30                   ` Daniel Phillips
  2001-01-02 18:27                   ` Chris Mason
  3 siblings, 1 reply; 56+ messages in thread
From: Andrea Arcangeli @ 2000-12-31 14:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roman Zippel, Eric W. Biederman, Alexander Viro, Daniel Phillips,
	linux-kernel

On Sat, Dec 30, 2000 at 06:28:39PM -0800, Linus Torvalds wrote:
> There are only two real advantages to deferred writing:
> 
>  - not having to do get_block() at all for temp-files, as we never have to
>    do the allocation if we end up removing the file.
> 
>    NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors
>    right is quite possibly big enough that this advantage is possibly very
>    questionable.  It's very possible that people could speed things up
>    using this approach, but I also suspect that it is equally (if not
>    more) possible to speed things up by just making sure that the
>    low-level FS has a fast get_block().

get_block for large files can be improved using extents, but how can we
implement a fast get_block without restructuring the on-disk format of the
filesystem? (in turn using another filesystem instead of ext2?)

get_block needs to walk all level of inode metadata indirection if they
exists. It has to map the logical page from its (inode) address space to the
physical blocks. If those indirection blocks aren't in cache it has to block
to read them. It doesn't matter how it is actually implemented in core. And
then later as you say those allocated blocks could never get written because
the file may be deleted in the meantime.

With allocate on flush we can run the slow get_block in parallel
asynchronously using a kernel daemon after the page flushtime timeout
triggers. It looks nicer to me. The in-core overhead of the reserved
blocks for delayed allocation should be not relevant at all (and it also
should not need the big kernel lock making the whole write path big lock
free).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-31 14:38                   ` [RFC] Generic deferred file writing Andrea Arcangeli
@ 2000-12-31 16:33                     ` Linus Torvalds
  2000-12-31 16:50                       ` Andrea Arcangeli
  2000-12-31 16:51                       ` Alexander Viro
  0 siblings, 2 replies; 56+ messages in thread
From: Linus Torvalds @ 2000-12-31 16:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Roman Zippel, Eric W. Biederman, Alexander Viro, Daniel Phillips,
	linux-kernel



On Sun, 31 Dec 2000, Andrea Arcangeli wrote:
> 
> get_block for large files can be improved using extents, but how can we
> implement a fast get_block without restructuring the on-disk format of the
> filesystem? (in turn using another filesystem instead of ext2?)

By doing a better job of caching stuff.

There are multiple levels of caching here. One issue is the question of
"allocate a new block". Go and look how the ext2 block pre-allocation
works, and cry. It should _not_ be a loop that sets one bit at a time, it
should be something that notices when the (u32 *) is zero and grabs the
whole 32 blocks in one go. Instead of defaulting to a pre-allocation of 8
blocks, doing that same expensive thing much too often.

The other thing is that one of the common cases for writing is consecutive
writing to the end of the file. Now, you figure it out: if get_block()
really is a bottle-neck, why not cache the last tree lookup? You'd get a
99% hitrate for that common case.

You should realize that all the block allocation etc code was written for
a very different VFS layer. "get_block()" didn't even exist. We didn't
have SMP issues. We had very different access patterns (the virtual caches
in the page cache makes the accesses to "get_block()" very different, as
the VFS layer keeps track of man mappings on its own.

And get_block() was basically tacked on top of the old code. Al Viro did a
good job of cleaning up some of the issues, but go and look at get_block()
and follow it all the way down into "ext2_new_block()" and back, and I bet
you'll ask yourself why it's so complex. And wonder if it couldn't just be
cleaned up and sped up a lot that way.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-31 16:33                     ` Linus Torvalds
@ 2000-12-31 16:50                       ` Andrea Arcangeli
  2000-12-31 16:51                       ` Alexander Viro
  1 sibling, 0 replies; 56+ messages in thread
From: Andrea Arcangeli @ 2000-12-31 16:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roman Zippel, Eric W. Biederman, Alexander Viro, Daniel Phillips,
	linux-kernel

On Sun, Dec 31, 2000 at 08:33:01AM -0800, Linus Torvalds wrote:
> By doing a better job of caching stuff.

Caching can happen after we are been slow and we waited for I/O synchronously
the first time (bread).

How can we optimize the first time (when the indirect blocks are out of buffer
cache) without changing the on-disk format? We can't as far I can see.

It's of course fine to optimize the address_space->physical_block resolver
algorithm, because it has to run anyways if we want to write such data to disk
eventually (despite it's asynchronous with allocate on flush, or synchronous
like now).  Probably it's a more sensible optimization than the allocate
on flush thing. But still being able to run the resolver in an asynchronous
manner, in parallel, only at the time we need to flush the page to disk, looks
nicer behaviour to me.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-31 16:33                     ` Linus Torvalds
  2000-12-31 16:50                       ` Andrea Arcangeli
@ 2000-12-31 16:51                       ` Alexander Viro
  2000-12-31 17:12                         ` Linus Torvalds
  1 sibling, 1 reply; 56+ messages in thread
From: Alexander Viro @ 2000-12-31 16:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Roman Zippel, Eric W. Biederman,
	Daniel Phillips, linux-kernel



On Sun, 31 Dec 2000, Linus Torvalds wrote:

> The other thing is that one of the common cases for writing is consecutive
> writing to the end of the file. Now, you figure it out: if get_block()
> really is a bottle-neck, why not cache the last tree lookup? You'd get a
> 99% hitrate for that common case.

Because it is not a bottleneck. The _real_ bottleneck is in ext2_new_block().
Try to profile it and you'll see.

We could diddle with ext2_get_block(). No arguments. But the real source of
PITA is balloc.c, not inode.c. Look at the group descriptor cache code and
weep. That, and bitmaps handling.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-31 16:51                       ` Alexander Viro
@ 2000-12-31 17:12                         ` Linus Torvalds
  0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2000-12-31 17:12 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Andrea Arcangeli, Roman Zippel, Eric W. Biederman,
	Daniel Phillips, linux-kernel



On Sun, 31 Dec 2000, Alexander Viro wrote:
> 
> On Sun, 31 Dec 2000, Linus Torvalds wrote:
> 
> > The other thing is that one of the common cases for writing is consecutive
> > writing to the end of the file. Now, you figure it out: if get_block()
> > really is a bottle-neck, why not cache the last tree lookup? You'd get a
> > 99% hitrate for that common case.
> 
> Because it is not a bottleneck. The _real_ bottleneck is in ext2_new_block().
> Try to profile it and you'll see.
> 
> We could diddle with ext2_get_block(). No arguments. But the real source of
> PITA is balloc.c, not inode.c. Look at the group descriptor cache code and
> weep. That, and bitmaps handling.

I'm not surprised. Just doign pre-allocation 32 blocks at a time would
probably help. But that code really should be re-written, I think.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-31  2:28                 ` Linus Torvalds
  2000-12-31 12:58                   ` Roman Zippel
  2000-12-31 14:38                   ` [RFC] Generic deferred file writing Andrea Arcangeli
@ 2000-12-31 18:30                   ` Daniel Phillips
  2000-12-31 18:44                     ` Linus Torvalds
  2001-01-02 18:27                   ` Chris Mason
  3 siblings, 1 reply; 56+ messages in thread
From: Daniel Phillips @ 2000-12-31 18:30 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel

Linus Torvalds wrote:
> There are only two real advantages to deferred writing:
> 
>  - not having to do get_block() at all for temp-files, as we never have to
>    do the allocation if we end up removing the file.
> 
>    NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors
>    right is quite possibly big enough that this advantage is possibly very
>    questionable.  It's very possible that people could speed things up
>    using this approach, but I also suspect that it is equally (if not
>    more) possible to speed things up by just making sure that the
>    low-level FS has a fast get_block().

It's not that hard or inefficient to return the ENOSPC from the usual
point.  For example, make a gross overestimate of the space needed for
the write, compare to a cached filesystem free space value less the
amount deferred so far, and fail to take the optimization if it looks
even close.  Also, it's not necessarily bad to defer the ENOSPC to file
close time.  The same thing can happen with failed disk io, and that's
just a fact of life with asynchronous io.  A reliable program needs to
be able to deal with it.  (I think there was a long thread on this not
long ago, regarding filesystem errors returned after a program exits.)

>  - Using "global" access patterns to do a better job of "get_block()", ie
>    taking advantage of issues with journalling etc and deferring the write
>    in order to get a bigger journal.

Another nicety is not having to bother the filesystem at all about
sequences of short writes.

> The second point is completely different, and THIS is where I think there
> are potentially real advantages. However, I also think that this is not
> actually about deferred writes at all: it's really a question of the
> filesystem having access to the page when the physical write is actually
> started so that the filesystem might choose to _change_ the allocation it
> did - it might have allocated a backing store block at "get_block()" time,
> but by the time it actually writes the stuff out to disk it may have
> allocated a bigger contiguous area somewhere else for the data..

You'd most likely be able to measure the overhead under load of doing
the allocation twice, due to the occasional need to read back a
swapped-out metadata block.

XFS does deferred allocation and the Reiser guys are talking about it -
it seems to be something worth doing.  For me the question is, if the
VFS can already do the deferring, is there a good reason for a
filesystem to duplicate that functionality?

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-31 18:30                   ` Daniel Phillips
@ 2000-12-31 18:44                     ` Linus Torvalds
  2000-12-31 19:10                       ` Daniel Phillips
  2000-12-31 21:03                       ` Roman Zippel
  0 siblings, 2 replies; 56+ messages in thread
From: Linus Torvalds @ 2000-12-31 18:44 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel



On Sun, 31 Dec 2000, Daniel Phillips wrote:
> 
> It's not that hard or inefficient to return the ENOSPC from the usual
> point.  For example, make a gross overestimate of the space needed for
> the write, compare to a cached filesystem free space value less the
> amount deferred so far, and fail to take the optimization if it looks
> even close.

Let me repeat myself one more time:

 I do not believe that "get_block()" is as big of a problem as people make
 it out to be.

And more importantly:

 I strongly believe that trying to be clever is detrimental to your
 health. 

 The "clever" approach is to add tons of complexity, have various
 heuristics to try to not overflow, and then try to debug it considering
 that the ENOSPC case is actually rather rare.

 The "intelligent" approach is just to say that if get_block() shows up on
 the performance profiles, then it should be optimized.

I'd rather be intelligent than clever. Optimize get_block(), which in the
case of ext2 seems to be mostly ext2_new_block() and the balloc.c mess.

The argument that Andrea had is bogus: the common case for writes (and
writes is the only part that deferred writing would touch) is re-writing
the whole file, and the IO to look up the metadata is never an issue for
that case. Everything is basically cached and created on-the-fly. IO is
not the issue, being good about new block allocation _is_ the issue.

Don't get me wrong: I like the notion of deferred writes. But I'm also
very pragmatic: I have not heard of a really good argument that makes it
obvious that deferred writes is a major win performance-wise that would
make it worth the complexity.

One form of deferred writes I _do_ like is the mount-time-option form.
Because that one doesn't add complexity. Kind of like the "noatime" mount
option - it can be worth it under some circumstances, and sometimes it's
acceptable to not get 100% unix semantics - at which point deferred writes
have none of the disadvantages of trying to be clever.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-31 18:44                     ` Linus Torvalds
@ 2000-12-31 19:10                       ` Daniel Phillips
  2000-12-31 19:31                         ` Linus Torvalds
  2000-12-31 21:03                       ` Roman Zippel
  1 sibling, 1 reply; 56+ messages in thread
From: Daniel Phillips @ 2000-12-31 19:10 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel

Linus Torvalds wrote:
>  I do not believe that "get_block()" is as big of a problem as people make
>  it out to be.

I didn't mention get_block - disk accesses obviously far outweigh
filesystem cpu/cache usage in overall impact.  The question is, what
happens to disk access patterns when we do the deferred allocation.

> One form of deferred writes I _do_ like is the mount-time-option form.
> Because that one doesn't add complexity. Kind of like the "noatime" mount
> option - it can be worth it under some circumstances, and sometimes it's
> acceptable to not get 100% unix semantics - at which point deferred writes
> have none of the disadvantages of trying to be clever.

And the added attraction of requiring almost no effort.
	
--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-31 19:10                       ` Daniel Phillips
@ 2000-12-31 19:31                         ` Linus Torvalds
  0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2000-12-31 19:31 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel

On Sun, 31 Dec 2000, Daniel Phillips wrote:

> Linus Torvalds wrote:
> >  I do not believe that "get_block()" is as big of a problem as people make
> >  it out to be.
>
> I didn't mention get_block - disk accesses obviously far outweigh
> filesystem cpu/cache usage in overall impact.  The question is, what
> happens to disk access patterns when we do the deferred allocation.

Note that the deferred allocation is only possible with a full page write.

Go and do statistics on a system of how often this happens, and what the
circumstances are. Just for fun.

I will bet you $5 USD that 99.9% of all such writes are to new files, at
the end of the file. I'm sure you can come up with other usage patterns,
but they'll be special (big databases etc, and I bet that they'll want to
have stuff cached all the time anyway for other reasons).

So I seriously doubt that you'll have much of an IO component to the
writing anyway - except for the "normal" deferred write of actually
writing the stuff out at all.

Now, this is where I agree with you, but I disagree with where most of the
discussion has been going: I _do_ believe that we may want to change block
allocation discussions at write-out-time. That makes sense to me. But that
doesn't really impact "ENOSPC" - the write would not be really "deferred"
by the VM layer, and the filesystem would always be aware of the writer
synchronously.

> > One form of deferred writes I _do_ like is the mount-time-option form.
> > Because that one doesn't add complexity. Kind of like the "noatime" mount
> > option - it can be worth it under some circumstances, and sometimes it's
> > acceptable to not get 100% unix semantics - at which point deferred writes
> > have none of the disadvantages of trying to be clever.
>
> And the added attraction of requiring almost no effort.

Did I mention my belief in the true meaning of "intelligence"?

"Intelligence is the ability to avoid doing work, yet get the work done".

Lazy programmers are the best programmers. Think Tom Sawyer painting the
fence. That's intelligence.

Requireing almost no effort is a big plus in my book.

It's the "clever" programmer I'm afraid of. The one who isn't afraid of
generating complexity, because he has a Plan (capital "P"), and he knows
he can work out the details later.

			Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-31 18:44                     ` Linus Torvalds
  2000-12-31 19:10                       ` Daniel Phillips
@ 2000-12-31 21:03                       ` Roman Zippel
  2000-12-31 21:32                         ` Linus Torvalds
  1 sibling, 1 reply; 56+ messages in thread
From: Roman Zippel @ 2000-12-31 21:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Phillips, linux-kernel

Hi,

On Sun, 31 Dec 2000, Linus Torvalds wrote:

> Let me repeat myself one more time:
> 
>  I do not believe that "get_block()" is as big of a problem as people make
>  it out to be.

The real problem is that get_block() doesn't scale and it's very hard to
do. A recursive per inode-semaphore might help, but it's still a pain to
get it right.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-31 21:03                       ` Roman Zippel
@ 2000-12-31 21:32                         ` Linus Torvalds
  0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2000-12-31 21:32 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Daniel Phillips, linux-kernel



On Sun, 31 Dec 2000, Roman Zippel wrote:
> 
> On Sun, 31 Dec 2000, Linus Torvalds wrote:
> 
> > Let me repeat myself one more time:
> > 
> >  I do not believe that "get_block()" is as big of a problem as people make
> >  it out to be.
> 
> The real problem is that get_block() doesn't scale and it's very hard to
> do. A recursive per inode-semaphore might help, but it's still a pain to
> get it right.

Not true.

There's nothing unscalable in get_block() per se. The only lock we hold is
the per-page lock, which we must hold anyway. get_block() itself does not
need any real locking: you can do it with a simple per-inode spinlock if
you want to (and release the spinlock and try again if you need to fetch
non-cached data blocks).

Sure, the current ext2 _implementation_ sucks. Nobody has ever contested
that. 

Stop re-designing something just because you want to.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
       [not found] <Pine.LNX.4.10.10012311726230.1671-100000@penguin.transmeta.com>
@ 2001-01-01  2:50 ` Roman Zippel
  2001-01-01  3:47   ` Alexander Viro
  0 siblings, 1 reply; 56+ messages in thread
From: Roman Zippel @ 2001-01-01  2:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Phillips, linux-kernel

Hi,

On Sun, 31 Dec 2000, Linus Torvalds wrote:

> 		cached_allocation = NULL;
> 
> 	repeat:
> 		spin_lock();
> 		result = try_to_find_existing();
> 		if (!result) {
> 			if (!cached_allocation) {
> 				spin_unlock();
> 				cached_allocation = allocate_block();
> 				goto repeat;
> 			}
> 			result = cached_allocation;
> 			add_to_datastructures(result);
> 		}
> 		spin_unlock();
> 		return result;
> 
> This is quite standard, and Linux does it in many places. It doesn't have
> to be complex or ugly.

No problem with that.

> Also, I don't see why you claim the current get_block() is recursive and
> hard to use: it obviously isn't. If you look at the current ext2
> get_block(), the way it protects most of its data structures is by the
> super-block-global lock. That wouldn't work if your claims of recursive
> invocation were true. 

I just rechecked that, but I don't see no superblock lock here, it uses
the kernel_lock instead. Although Al could give the definitive answer for
this, he wrote it. :)

> The way the Linux MM works, if the lower levels need to do buffer
> allocations, they will use GFP_BUFFER (which "bread()" does internally),
> which will mean that the VM layer will _not_ call down recursively to
> write stuff out while it's already trying to write something else. This is
> exactly so that filesystems don't have to release and re-try if they don't
> want to.
> 
> In short, I don't see any of your arguments.

Then I must have misunderstood Al. Al?
If you were right here, I would see absolutely no reason for the current
complexity. (Me is a bit confused here.)

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2001-01-01  2:50 ` [RFC] Generic deferred file writing Roman Zippel
@ 2001-01-01  3:47   ` Alexander Viro
  2001-01-01 12:44     ` Roman Zippel
  2001-01-01 20:00     ` Daniel Phillips
  0 siblings, 2 replies; 56+ messages in thread
From: Alexander Viro @ 2001-01-01  3:47 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Linus Torvalds, Daniel Phillips, linux-kernel



On Mon, 1 Jan 2001, Roman Zippel wrote:

> I just rechecked that, but I don't see no superblock lock here, it uses
> the kernel_lock instead. Although Al could give the definitive answer for
> this, he wrote it. :)

No superblock lock in get_block() proper. Tons of it in the dungheap called
balloc.c. _That's_ where the bottleneck is. BTW, even BKL is easily removable
from get_block() - check /* Reader: */ and /* Writer: */ comments, they mark
the places to put spinlock in.

> > The way the Linux MM works, if the lower levels need to do buffer
> > allocations, they will use GFP_BUFFER (which "bread()" does internally),
> > which will mean that the VM layer will _not_ call down recursively to
> > write stuff out while it's already trying to write something else. This is
> > exactly so that filesystems don't have to release and re-try if they don't
> > want to.
> > 
> > In short, I don't see any of your arguments.
> 
> Then I must have misunderstood Al. Al?
> If you were right here, I would see absolutely no reason for the current
> complexity. (Me is a bit confused here.)

Reread the original thread. GFP_BUFFER protects us from buffer cache
allocations triggering pageouts. It has nothing to the deadlock scenario
that would come from grabbing ->i_sem on pageout.

Sheesh... "Complexity" of ext2_get_block() (down to the ext2_new_block()
calls) is really, really not a problem. Normally it just gives you the
straightforward path. All unrolls are for contention cases and they
are precisely what we have to do there.

Again, scalability problems are in the block allocator, not in the
indirect blocks handling. It's completely independent from get_block().
We overlock. Big way. And the structures we are protecting excessively
are:
	* cylinder group descriptors
	* block bitmaps
	* (to less extent) inode bitmaps and inode table.
That's what needs to be fixed. It has nothing to VFS or VM - purely
internal ext2 business.

Another ext2 issue is reducing the buffer cache pressure - mostly by
moving the directories into page cache. I've posted such patches on
fsdevel and they are applied to the kernel I'm running here. Works
like a charm and allows the rest of metadata stay in cache longer.

GFP_BUFFER _may_ become an issue if we move bitmaps into pagecache.
Then we'll need a per-address_space gfp_mask. Again, patches exist
and had been tested (not right now - I didn't port them to current
tree yet). Bugger if I remember whether they were posted or not - they've
definitely had been mentioned on linux-mm, but IIRC I had sent the
modifications of VM code only to Jens. I can repost them.

Some pieces of balloc.c cleanup had been posted on fsdevel. Check the
archives. They prepare the ground for killing lock_super() contention
on ext2_new_inode(), but that part wasn't there back then.

I will start -bird (aka FS-CURRENT) branch as soon as Linus opens 2.4.
Hopefully by the time of 2.5 it will be tested well enough. Right now
it exists as a large patchset against more or less recent -test<n> and
I'm waiting for slowdown of the changes in main tree to put them all
together.
							Cheers,
								Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2001-01-01  3:47   ` Alexander Viro
@ 2001-01-01 12:44     ` Roman Zippel
  2001-01-01 15:16       ` Alexander Viro
  2001-01-01 20:00     ` Daniel Phillips
  1 sibling, 1 reply; 56+ messages in thread
From: Roman Zippel @ 2001-01-01 12:44 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Daniel Phillips, linux-kernel

Hi,

On Sun, 31 Dec 2000, Alexander Viro wrote:

> Reread the original thread. GFP_BUFFER protects us from buffer cache
> allocations triggering pageouts. It has nothing to the deadlock scenario
> that would come from grabbing ->i_sem on pageout.

I don't want to grab i_sem. It was a very, very early idea... :)

> Sheesh... "Complexity" of ext2_get_block() (down to the ext2_new_block()
> calls) is really, really not a problem. Normally it just gives you the
> straightforward path. All unrolls are for contention cases and they
> are precisely what we have to do there.

Maybe complexity is the wrong word, of course the logic in there is
straight forward (once one understood it :) ).
Let me ask it differently and it's now only about indirect block handling.
Is it possible to use a per-inode-indirect-block-semaphore?
The reason for the question is, that I maybe see a different sort of
contention here - live locks. I don't mind that getting of resources and
rechecking if everything went well. The problem is how much resources you
need to get (and to release, if something failed). Somewhere is always a
point, where two threads can't make any progress or one thread can stall
the progress of a second.
To get back to ext2_get_block: IMO such a scenario could happen in the
double or triple indirect block case, when two or more threads try to
allocate/truncate a block here. Maybe my concerns are baseless, but I'd
just like to know, that there isn't a possibility for a DOS attack here.
(BTW that's what I mean with complexity, it's less the logical complexity,
it's more the "runtime complexity").

The other reason for the question is that I'm currently overwork the block
handling in affs, especially the extended block handling, where I'm
implementing a new extended block cache, where I would pretty much prefer
to use a semaphore to protect it. Although I could do it probably without
the semaphore and use a spinlock+rechecking, but it would keep it so much
simpler. (I can post more details about this part on fsdevel if needed /
wanted.)

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2001-01-01 12:44     ` Roman Zippel
@ 2001-01-01 15:16       ` Alexander Viro
  2001-01-02  3:00         ` Roman Zippel
  0 siblings, 1 reply; 56+ messages in thread
From: Alexander Viro @ 2001-01-01 15:16 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Linus Torvalds, Daniel Phillips, linux-kernel



On Mon, 1 Jan 2001, Roman Zippel wrote:

> The other reason for the question is that I'm currently overwork the block
> handling in affs, especially the extended block handling, where I'm
> implementing a new extended block cache, where I would pretty much prefer
> to use a semaphore to protect it. Although I could do it probably without
> the semaphore and use a spinlock+rechecking, but it would keep it so much
> simpler. (I can post more details about this part on fsdevel if needed /
> wanted.)

But... But with AFFS you _have_ exclusion between block-allocation and
truncate(). It has no sparse files, so pageout will never allocate
anything. I.e. all allocations come from write(2). And both write(2) and
truncate(2) hold i_sem.

Problem with AFFS is on the directory side of that business and there it's
really scary. Block allocation is trivial...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2001-01-01  3:47   ` Alexander Viro
  2001-01-01 12:44     ` Roman Zippel
@ 2001-01-01 20:00     ` Daniel Phillips
  1 sibling, 0 replies; 56+ messages in thread
From: Daniel Phillips @ 2001-01-01 20:00 UTC (permalink / raw)
  To: Alexander Viro, linux-kernel, Rik van Riel

Alexander Viro wrote:
> GFP_BUFFER _may_ become an issue if we move bitmaps into pagecache.
> Then we'll need a per-address_space gfp_mask. Again, patches exist
> and had been tested (not right now - I didn't port them to current
> tree yet). Bugger if I remember whether they were posted or not - they've
> definitely had been mentioned on linux-mm, but IIRC I had sent the
> modifications of VM code only to Jens. I can repost them.

Please, and I'll ask Rik to post them on the kernelnewbies.org patch
page.  (Rik?)  Putting bitmaps and group descriptors into the page cache
will allow the current adhoc bitmap and groupdesc caching code to be
deleted - the for-real cache should have better lru, be more efficient
to access, not need special locking and won't have an arbitrary limit on
number of cached bitmaps.  About 300 lines of spagetti gone.  I suppose
the group descriptor pages still need to be locked in memory so we can
address them through a table instead of searching the hash.  OK, this
must be what you meant by a 'proper' fix to the ialloc group desc bug I
posted last month, which by the way is *still* there.  How about
applying my patch in the interim?  It's a real bug, it just doesn't
trigger often.

> Some pieces of balloc.c cleanup had been posted on fsdevel. Check the
> archives. They prepare the ground for killing lock_super() contention
> on ext2_new_inode(), but that part wasn't there back then.
> 
> I will start -bird (aka FS-CURRENT) branch as soon as Linus opens 2.4.
> Hopefully by the time of 2.5 it will be tested well enough. Right now
> it exists as a large patchset against more or less recent -test<n> and
> I'm waiting for slowdown of the changes in main tree to put them all
> together.

It would be awfully nice to have those patches available via ftp. 
Web-based mail archives don't make it because you can't generally can't
get the patches out intact - the tabs get expanded and other noise
inserted.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2001-01-01 15:16       ` Alexander Viro
@ 2001-01-02  3:00         ` Roman Zippel
  2001-01-02  5:00           ` Alexander Viro
  0 siblings, 1 reply; 56+ messages in thread
From: Roman Zippel @ 2001-01-02  3:00 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Daniel Phillips, linux-kernel

Hi,

On Mon, 1 Jan 2001, Alexander Viro wrote:

> But... But with AFFS you _have_ exclusion between block-allocation and
> truncate(). It has no sparse files, so pageout will never allocate
> anything. I.e. all allocations come from write(2). And both write(2) and
> truncate(2) hold i_sem.
> 
> Problem with AFFS is on the directory side of that business and there it's
> really scary. Block allocation is trivial...

Block allocation is not my problem right now (and even directory handling
is not that difficult), but I will post somethings about this on fsdevel
later.
But one question is still open, I'd really like an answer for:
Is it possible to use a per-inode-indirect-block-semaphore?

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2001-01-02  3:00         ` Roman Zippel
@ 2001-01-02  5:00           ` Alexander Viro
  2001-01-02 16:53             ` Roman Zippel
  0 siblings, 1 reply; 56+ messages in thread
From: Alexander Viro @ 2001-01-02  5:00 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Linus Torvalds, Daniel Phillips, linux-kernel



On Tue, 2 Jan 2001, Roman Zippel wrote:

> Block allocation is not my problem right now (and even directory handling
> is not that difficult), but I will post somethings about this on fsdevel
> later.
> But one question is still open, I'd really like an answer for:
> Is it possible to use a per-inode-indirect-block-semaphore?

Depends on a filesystem. Generally you don't want asynchronous operations
to grab semaphores shared with something else. kswapd knows to skip the locked
pages, but that's it - if writepage() is going to block on a semaphore you
will not know what had hit you. And while buffer-cache operations will not
trigger writepage() (grep for GFP_BUFFER and GFP_IO and you'll see) you have
no such warranties for other sources of memory pressure. If one of them
hits while you are holding such semaphore - you are toast.

We probably could pull it off for ext2_truncate() vs. ext2_get_block()
but it would not do us any good. It would give excessive exclusion for
operations that can be done in parallel just fine (example: we have
a hole from 100Kb to 200Kb. Pageouts in that area can be trivially
done i parallel - current code will not even try to do unrolls. With
your locking they will be serialized for no good reason). What for?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2001-01-02  5:00           ` Alexander Viro
@ 2001-01-02 16:53             ` Roman Zippel
  0 siblings, 0 replies; 56+ messages in thread
From: Roman Zippel @ 2001-01-02 16:53 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Daniel Phillips, linux-kernel

Hi,

On Tue, 2 Jan 2001, Alexander Viro wrote:

> Depends on a filesystem. Generally you don't want asynchronous operations
> to grab semaphores shared with something else. kswapd knows to skip the locked
> pages, but that's it - if writepage() is going to block on a semaphore you
> will not know what had hit you. And while buffer-cache operations will not
> trigger writepage() (grep for GFP_BUFFER and GFP_IO and you'll see) you have
> no such warranties for other sources of memory pressure. If one of them
> hits while you are holding such semaphore - you are toast.

I just checked that and you're right, sorry for causing confusion and
thanks for clearing this up.

> We probably could pull it off for ext2_truncate() vs. ext2_get_block()
> but it would not do us any good. It would give excessive exclusion for
> operations that can be done in parallel just fine (example: we have
> a hole from 100Kb to 200Kb. Pageouts in that area can be trivially
> done i parallel - current code will not even try to do unrolls. With
> your locking they will be serialized for no good reason). What for?

Let me come back to the three phases I mentioned earlier:
alloc_block: does only a read-only check whether a block needs to be
allocated or not, this can be done in parallel and only needs the page
lock.
get_block: blocks are now really allocated and this needs locking of the
bitmap.
commit_block: write the allocated blocks to the inode and this now would
use an inode specific semaphore to protect the updates of indirect blocks.

The only problem I see is truncate, but if we move the release of unneeded
indirect blocks to file_close, only new indirect blocks can appear while
the file is open, but they won't change anymore, what would make lots of
the checks easier.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC] Generic deferred file writing
  2000-12-31  2:28                 ` Linus Torvalds
                                     ` (2 preceding siblings ...)
  2000-12-31 18:30                   ` Daniel Phillips
@ 2001-01-02 18:27                   ` Chris Mason
  3 siblings, 0 replies; 56+ messages in thread
From: Chris Mason @ 2001-01-02 18:27 UTC (permalink / raw)
  To: Linus Torvalds, Roman Zippel
  Cc: Andrea Arcangeli, Eric W. Biederman, Alexander Viro,
	Daniel Phillips, linux-kernel



On Saturday, December 30, 2000 06:28:39 PM -0800 Linus Torvalds
<torvalds@transmeta.com> wrote:
 
> There are only two real advantages to deferred writing:
> 
>  - not having to do get_block() at all for temp-files, as we never have to
>    do the allocation if we end up removing the file.
> 
>    NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors
>    right is quite possibly big enough that this advantage is possibly very
>    questionable.  It's very possible that people could speed things up
>    using this approach, but I also suspect that it is equally (if not
>    more) possible to speed things up by just making sure that the
>    low-level FS has a fast get_block().
> 
>  - Using "global" access patterns to do a better job of "get_block()", ie
>    taking advantage of issues with journalling etc and deferring the write
>    in order to get a bigger journal.
> 
> The second point is completely different, and THIS is where I think there
> are potentially real advantages. 

Absolutely.  I wrote reiserfs delayed allocation code back in october, and
kind of left it alone until the VM had the callbacks needed to make it
clean (err, less ugly).  I included bunches of optimizations to
reiserfs_get_block, and the most effective one was a cache of block
pointers in the inode to avoid consecutive tree searches.  This was a
locking and an i/o win, for both reading and writing (reiserfs needs this
more than ext2 does)

For growing the file, delayed allocation was a huge bonus.  For all the
reasons you've already discussed, and because writing a file went from this:
 
(reiserfs_get_block is starting/stopping the transaction)
while(bytes_to_write) 
	start_transaction
	allocate block
	insert block pointer
	end_transaction
end

To this:

while(bytes_to_write)
	update counters
end

(delayed alloc routine is starting/stopping trans)
start_transaction
allocate X blocks
insert X block pointers
update counters
end_transaction

A big fat transaction is a happy one ;-)

Anyway, I'll return to the optimizations once things have settled down a
bit, and might give the generic delayed allocation (instead of reiserfs
only code) a try.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: test13-pre6
  2000-12-30  0:25 test13-pre6 Linus Torvalds
                   ` (3 preceding siblings ...)
  2000-12-30  4:21 ` test13-pre6 Dan Aloni
@ 2001-01-04 20:23 ` Stephen C. Tweedie
  2001-01-04 22:15   ` test13-pre6 stewart
  4 siblings, 1 reply; 56+ messages in thread
From: Stephen C. Tweedie @ 2001-01-04 20:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kernel Mailing List, Alexander Viro, Stephen C. Tweedie,
	Marco d'Itri, Jeff Lightfoot, Dan Aloni, Anton Blanchard

Hi,

On Fri, Dec 29, 2000 at 04:25:43PM -0800, Linus Torvalds wrote:
> 
> Stephen: mind trying your fsync/etc tests on this one, to verify that the
> inode dirty stuff is all done right?

Back from the Scottish Hogmanay celebrations now. :)  I've run my
normal tests on this (based mainly on timing tests which show up
exactly how much is being written to disk for 1000 iterations of
various fsync/fdatasync/O_SYNC and overwrite/append combinations) and
2.4.0-prerelease seems to be doing the Right Thing.

My standard tests for this don't cover msync --- do you want me to
give that a try too?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: test13-pre6
  2001-01-04 20:23 ` test13-pre6 Stephen C. Tweedie
@ 2001-01-04 22:15   ` stewart
  0 siblings, 0 replies; 56+ messages in thread
From: stewart @ 2001-01-04 22:15 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Kernel Mailing List, Alexander Viro,
	Marco d'Itri, Jeff Lightfoot, Dan Aloni, Anton Blanchard


Stephen,

Have you or can you run these tests directly against a buffered block
device (bypassing the filesystem) and see if it still behaves correctly?
I have a Java app that does this and 2.4.0-prerelease shows a cumulative
sync() time. As I write more data, sync times take longer and longer
and never comes back down. It takes writing 30-50 megs cumulative (many
syncs along the way) to become noticable. Noticable are latencies in   
the 20-30 ms range (up from 0-1 at the start of the test). Eventually  
the test comes to a grinding halt with all of it's time spent in the 
kernel.

I don't know when this happened in 2.4.0-testxx, but 2.2.x does not
show this behavior.

stewart


On Thu, 4 Jan 2001, Stephen C. Tweedie wrote:

> Hi,
> 
> On Fri, Dec 29, 2000 at 04:25:43PM -0800, Linus Torvalds wrote:
> > 
> > Stephen: mind trying your fsync/etc tests on this one, to verify that the
> > inode dirty stuff is all done right?
> 
> Back from the Scottish Hogmanay celebrations now. :)  I've run my
> normal tests on this (based mainly on timing tests which show up
> exactly how much is being written to disk for 1000 iterations of
> various fsync/fdatasync/O_SYNC and overwrite/append combinations) and
> 2.4.0-prerelease seems to be doing the Right Thing.
> 
> My standard tests for this don't cover msync --- do you want me to
> give that a try too?
> 
> --Stephen
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Races in affs_unlink(), affs_rmdir() and affs_rename()
  2000-12-31 12:58                   ` Roman Zippel
@ 2001-04-21 20:06                     ` Alexander Viro
  2001-04-21 22:16                       ` Roman Zippel
  0 siblings, 1 reply; 56+ messages in thread
From: Alexander Viro @ 2001-04-21 20:06 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Linus Torvalds, linux-kernel

mkdir /A
mkdir /B
mkdir /C
touch /A/a
ln /A/a /B/b
ln /A/a /C/c
rm /A/a &
rm /B/b

can corrupt filesystem. Scenario:

unlink("/B/b") locks /B, removes "b" and unlocks /B. Then it calls
affs_remove_link(), which blocks.

unlink("/A/a") locks /A, removes "a" and unlocks /A. Then it calls
affs_remove_link(). Which locks /B, renames removed entry into "b",
removes old "b" and inserts renamed "a" into /B.

The rest is irrelevant - we're already in it.

Similar race exists between unlink() and rename();

mkdir /A
mkdir /B
mkdir /C
touch /A/a
touch /B/a
ln /A/a /B/b
ln /A/a /C/c
rm /A/a &
mv /B/a /B/b
- similar scenario, different source of affs_remove_header().

Another one: unlink() and rmdir():
mkdir /A
mkdir /B
touch /A/a
ln /A/a /B/a
rm /A/a &
rmdir /B

Since you don't lock /B for affs_empty_dir(), you can hit the
window between removing old /B/a and inserting renamed /A/a into /B.
Notice that VFS _does_ lock /B (->i_zombie), but affs_remove_link()
for /A/a doesn't even look at it.

Same thing for rename()/rmdir() (rmdir victim contains a link to rename
target, apply the previous scenario).
								Al


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Races in affs_unlink(), affs_rmdir() and affs_rename()
  2001-04-21 20:06                     ` Races in affs_unlink(), affs_rmdir() and affs_rename() Alexander Viro
@ 2001-04-21 22:16                       ` Roman Zippel
  2001-04-22  5:53                         ` Alexander Viro
  0 siblings, 1 reply; 56+ messages in thread
From: Roman Zippel @ 2001-04-21 22:16 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Roman Zippel, Linus Torvalds, linux-kernel

Hi,

Alexander Viro wrote:

> unlink("/B/b") locks /B, removes "b" and unlocks /B. Then it calls
> affs_remove_link(), which blocks.
> 
> unlink("/A/a") locks /A, removes "a" and unlocks /A. Then it calls
> affs_remove_link(). Which locks /B, renames removed entry into "b",
> removes old "b" and inserts renamed "a" into /B.
> 
> The rest is irrelevant - we're already in it.

Thanks for finding that one, but it should be easy to fix. I can remove
the parent pointer in aff_remove_hash and check for that before I try to
rename that entry.

> Since you don't lock /B for affs_empty_dir(), you can hit the
> window between removing old /B/a and inserting renamed /A/a into /B.
> Notice that VFS _does_ lock /B (->i_zombie), but affs_remove_link()
> for /A/a doesn't even look at it.

I thought about that one and I know it should be locked. The reason I
don't do right now is, that affs supports hardlinks to dirs. The problem
are especially recursive links, e.g.:

mkdir A
ln A A/B
rm A/B

This is possible with affs, but will already deadlock in vfs.

mkdir A
mkdir A/B
ln A A/B/C
rm A/B/C/A &
rm A/B/C &
rm A/B

Every rm already takes the hash lock of the parent and then I can't
simply also take the hash lock of the dir itself. What I actually want
to do is to insert a reverse is_subdir() check before taking the lock.
On the other hand I was thinking whether I should allow links to dirs at
all and just show them as empty/readonly dirs. For 2.4 that's probably
safer, as it would require a lot of locking changes in vfs and the other
fs to support this properly, particularly moving most of the locking
from vfs into the fs.

bye, Roman

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Races in affs_unlink(), affs_rmdir() and affs_rename()
  2001-04-21 22:16                       ` Roman Zippel
@ 2001-04-22  5:53                         ` Alexander Viro
  2001-04-22 12:57                           ` Roman Zippel
  0 siblings, 1 reply; 56+ messages in thread
From: Alexander Viro @ 2001-04-22  5:53 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Roman Zippel, Linus Torvalds, linux-kernel



On Sun, 22 Apr 2001, Roman Zippel wrote:

> This is possible with affs, but will already deadlock in vfs.
> 
> mkdir A
> mkdir A/B
> ln A A/B/C
> rm A/B/C/A &
> rm A/B/C &
> rm A/B
> 
> Every rm already takes the hash lock of the parent and then I can't
> simply also take the hash lock of the dir itself. What I actually want
> to do is to insert a reverse is_subdir() check before taking the lock.
> On the other hand I was thinking whether I should allow links to dirs at
> all and just show them as empty/readonly dirs. For 2.4 that's probably
> safer, as it would require a lot of locking changes in vfs and the other
> fs to support this properly, particularly moving most of the locking
> from vfs into the fs.

And all that to support a misfeature present in one legacy filesystem?
Don't forget that find et.al. are going to die on loops in directory
tree, so you'd also need to change large chunk of userland code.

By the way, how would you detect the attempts to detach a subtree by
rmdir()/rename() with the multiple links on directories? Again, forget about
the VFS side of that business, the question is how to check that
required change doesn't make on-disk data structures inconsistent.

As far I know even native (AmigaOS) implementation doesn't handle
directory links correctly. So I don't see much point in allowing them
even for compatibility purposes.

Seriously, screw the directory links - getting the thing work correctly
without them has much higher priority.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Races in affs_unlink(), affs_rmdir() and affs_rename()
  2001-04-22  5:53                         ` Alexander Viro
@ 2001-04-22 12:57                           ` Roman Zippel
  2001-04-22 13:15                             ` Alexander Viro
  0 siblings, 1 reply; 56+ messages in thread
From: Roman Zippel @ 2001-04-22 12:57 UTC (permalink / raw)
  To: Alexander Viro; +Cc: linux-kernel

Hi,

Alexander Viro wrote:

> > all and just show them as empty/readonly dirs. For 2.4 that's probably
> > safer, as it would require a lot of locking changes in vfs and the other
> > fs to support this properly, particularly moving most of the locking
> > from vfs into the fs.
> 
> And all that to support a misfeature present in one legacy filesystem?
> Don't forget that find et.al. are going to die on loops in directory
> tree, so you'd also need to change large chunk of userland code.

It's not the only reason, but right now I can't even offer it as a mount
option. On the other hand I could also export them as symlinks.
Anyway, the major reason I'd like to see it is, that it IMO could
simplify locking. All possible locks don't need to be taken in advance,
instead they can be taken when needed. Also unlink/rename of files/dirs
are no specially cases anymore (at least locking wise).
VFS would just operate on dentries and the fs works with the inodes.
With affs I tried to show how it could look on the fs side.

> By the way, how would you detect the attempts to detach a subtree by
> rmdir()/rename() with the multiple links on directories? Again, forget about
> the VFS side of that business, the question is how to check that
> required change doesn't make on-disk data structures inconsistent.

Do you have an example? At the affs side there is no big difference
between link to files or dirs.

bye, Roman

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Races in affs_unlink(), affs_rmdir() and affs_rename()
  2001-04-22 12:57                           ` Roman Zippel
@ 2001-04-22 13:15                             ` Alexander Viro
  0 siblings, 0 replies; 56+ messages in thread
From: Alexander Viro @ 2001-04-22 13:15 UTC (permalink / raw)
  To: Roman Zippel; +Cc: linux-kernel



On Sun, 22 Apr 2001, Roman Zippel wrote:

> instead they can be taken when needed. Also unlink/rename of files/dirs
> are no specially cases anymore (at least locking wise).
> VFS would just operate on dentries and the fs works with the inodes.
> With affs I tried to show how it could look on the fs side.

I will believe it when I see it. So far the code is racy and I don't
see a way to fix the rmdir()/unlink() one without holding two locks at
once.

> > By the way, how would you detect the attempts to detach a subtree by
> > rmdir()/rename() with the multiple links on directories? Again, forget about
> > the VFS side of that business, the question is how to check that
> > required change doesn't make on-disk data structures inconsistent.
> 
> Do you have an example? At the affs side there is no big difference
> between link to files or dirs.

Loop creation:

/A/B and /C/D are links to the same directory. mv /A /C/D/A creates a loop.

Once you have a loop you either have it forever (all directories invloved
are non-empty) _or_ you have to check whether rmdir() is going to make
graph disconnected and I'd like to see how you do it.


^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2001-04-22 13:15 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2000-12-30  0:25 test13-pre6 Linus Torvalds
2000-12-30  0:49 ` test13-pre6 Alexander Viro
2000-12-30  1:03   ` test13-pre6 Linus Torvalds
2000-12-30 18:09     ` test13-pre6 Alexander Viro
2000-12-30  2:25 ` test13-pre6 Daniel Phillips
2000-12-30  3:16   ` test13-pre6 Linus Torvalds
2000-12-30 18:58     ` [RFC] Generic deferred file writing Daniel Phillips
2000-12-30 20:05       ` Linus Torvalds
2000-12-30 20:06       ` Alexander Viro
2000-12-30 20:21         ` Linus Torvalds
2000-12-30 21:10           ` Andreas Dilger
2000-12-30 21:46           ` Alexander Viro
2000-12-30 23:12             ` Daniel Phillips
2000-12-30 22:00           ` Eric W. Biederman
2000-12-30 22:44             ` Linus Torvalds
2000-12-31  0:26               ` Eric W. Biederman
2000-12-31  1:02             ` Andrea Arcangeli
2000-12-31  1:13               ` Chris Wedgwood
2000-12-31  1:50               ` Alexander Viro
2000-12-31  2:34                 ` Andrea Arcangeli
2000-12-31  2:09               ` Roman Zippel
2000-12-31  2:28                 ` Linus Torvalds
2000-12-31 12:58                   ` Roman Zippel
2001-04-21 20:06                     ` Races in affs_unlink(), affs_rmdir() and affs_rename() Alexander Viro
2001-04-21 22:16                       ` Roman Zippel
2001-04-22  5:53                         ` Alexander Viro
2001-04-22 12:57                           ` Roman Zippel
2001-04-22 13:15                             ` Alexander Viro
2000-12-31 14:38                   ` [RFC] Generic deferred file writing Andrea Arcangeli
2000-12-31 16:33                     ` Linus Torvalds
2000-12-31 16:50                       ` Andrea Arcangeli
2000-12-31 16:51                       ` Alexander Viro
2000-12-31 17:12                         ` Linus Torvalds
2000-12-31 18:30                   ` Daniel Phillips
2000-12-31 18:44                     ` Linus Torvalds
2000-12-31 19:10                       ` Daniel Phillips
2000-12-31 19:31                         ` Linus Torvalds
2000-12-31 21:03                       ` Roman Zippel
2000-12-31 21:32                         ` Linus Torvalds
2001-01-02 18:27                   ` Chris Mason
2000-12-30  3:08 ` test13-pre6 (Fork Bug with Athlons? Temporary Fix) Byron Stanoszek
2000-12-30  3:36   ` Linus Torvalds
2000-12-30  5:55     ` Andi Kleen
2000-12-30  5:13   ` Linus Torvalds
2000-12-30  8:13   ` Graham Murray
2000-12-30  4:21 ` test13-pre6 Dan Aloni
2001-01-04 20:23 ` test13-pre6 Stephen C. Tweedie
2001-01-04 22:15   ` test13-pre6 stewart
     [not found] <Pine.LNX.4.10.10012311726230.1671-100000@penguin.transmeta.com>
2001-01-01  2:50 ` [RFC] Generic deferred file writing Roman Zippel
2001-01-01  3:47   ` Alexander Viro
2001-01-01 12:44     ` Roman Zippel
2001-01-01 15:16       ` Alexander Viro
2001-01-02  3:00         ` Roman Zippel
2001-01-02  5:00           ` Alexander Viro
2001-01-02 16:53             ` Roman Zippel
2001-01-01 20:00     ` Daniel Phillips

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox