downgrade_write replacement in remap_file

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* downgrade_write replacement in remap_file_pages
@ 2004-06-08 15:44 Andrea Arcangeli
  2004-06-08 16:31 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Andrea Arcangeli @ 2004-06-08 15:44 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds; +Cc: linux-kernel

Apparently downgrade_write deadlocks the kernel in the mmap_sem
under load. I suspect some rwsem bug. Anyways it's matter of time before
in my tree I replace the rwsem implementation with my spinlock based
common code implementation again that I can understand trivially (unlike
the current code). I don't mind a microoptimization when the code is so
complicated, so I don't mind too much to fix whatever current bug in
downgrade_write.

In the meantime to workaround the deadlock (and to verify if this make
the deadlock go away, which returned a positive result) I implemented
this patch: this doesn't fix downgrade_wite, but I'm posting it here
because I believe it's much better code regardless of the
downgrade_write issue.  With this patch we'll never run down_write again
in production, the down_write will be taken only once when the db or the
simulator startup (during the very first page fault) and never again, in
turn providing (at least in theory) better runtime scalability.

Index: linux-2.5/mm/fremap.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/mm/fremap.c,v
retrieving revision 1.24
diff -u -p -r1.24 fremap.c
--- linux-2.5/mm/fremap.c	23 May 2004 05:07:26 -0000	1.24
+++ linux-2.5/mm/fremap.c	8 Jun 2004 15:38:21 -0000
@@ -161,6 +161,7 @@ asmlinkage long sys_remap_file_pages(uns
 	unsigned long end = start + size;
 	struct vm_area_struct *vma;
 	int err = -EINVAL;
+	int has_write_lock = 0;
 
 	if (__prot)
 		return err;
@@ -181,7 +182,8 @@ asmlinkage long sys_remap_file_pages(uns
 #endif
 
 	/* We need down_write() to change vma->vm_flags. */
-	down_write(&mm->mmap_sem);
+	down_read(&mm->mmap_sem);
+ retry:
 	vma = find_vma(mm, start);
 
 	/*
@@ -198,8 +200,13 @@ asmlinkage long sys_remap_file_pages(uns
 				end <= vma->vm_end) {
 
 		/* Must set VM_NONLINEAR before any pages are populated. */
-		if (pgoff != linear_page_index(vma, start) &&
-		    !(vma->vm_flags & VM_NONLINEAR)) {
+		if (unlikely(pgoff != linear_pgoff && !(vma->vm_flags & VM_NONLINEAR))) {
+			if (!has_write_lock) {
+				up_read(&mm->mmap_sem);
+				down_write(&mm->mmap_sem);
+				has_write_lock = 1;
+				goto retry;
+			}
 			mapping = vma->vm_file->f_mapping;
 			spin_lock(&mapping->i_mmap_lock);
 			flush_dcache_mmap_lock(mapping);
@@ -212,8 +219,6 @@ asmlinkage long sys_remap_file_pages(uns
 			spin_unlock(&mapping->i_mmap_lock);
 		}
 
-		/* ->populate can take a long time, so downgrade the lock. */
-		downgrade_write(&mm->mmap_sem);
 		err = vma->vm_ops->populate(vma, start, size,
 					    vma->vm_page_prot,
 					    pgoff, flags & MAP_NONBLOCK);
@@ -223,10 +228,11 @@ asmlinkage long sys_remap_file_pages(uns
 		 * it after ->populate completes, and that would prevent
 		 * downgrading the lock.  (Locks can't be upgraded).
 		 */
+	}
+	if (likely(!has_write_lock))
 		up_read(&mm->mmap_sem);
-	} else {
+	else
 		up_write(&mm->mmap_sem);
-	}
 
 	return err;
 }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: downgrade_write replacement in remap_file_pages
  2004-06-08 15:44 downgrade_write replacement in remap_file_pages Andrea Arcangeli
@ 2004-06-08 16:31 ` Andrew Morton
  2004-06-08 16:39   ` Linus Torvalds
  2004-06-08 19:04   ` David Howells
  2004-06-08 17:05 ` David Howells
  2004-06-08 19:36 ` William Lee Irwin III
  2 siblings, 2 replies; 17+ messages in thread
From: Andrew Morton @ 2004-06-08 16:31 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: torvalds, linux-kernel, David Howells

Andrea Arcangeli <andrea@suse.de> wrote:
>
> Apparently downgrade_write deadlocks the kernel in the mmap_sem
> under load. I suspect some rwsem bug. Anyways it's matter of time before
> in my tree I replace the rwsem implementation with my spinlock based
> common code implementation again that I can understand trivially (unlike
> the current code). I don't mind a microoptimization when the code is so
> complicated, so I don't mind too much to fix whatever current bug in
> downgrade_write.

I must say I agree with the sentiments - the current implementation doesn't
have a very attractive complexity/benefit ratio.  But I wrote a
spinlock-based version three years ago too, so I'm biased ;) Certainly it
is bog-simple and fixes up the overflow-at-32768-waiters bug.

I think a spinlock-based implementation would be OK if it was x86-specific,
because x86 spin_unlock is cheap.  But some architectures do atomic ops in
spin_unlock and won't like it.  Plus those architectures which can
implement atomic_add_return() can implement nice versions of rwsems such as
the ppc64 code.  Although ppc64 still seems to have an overflow bug.

So ho-hum.  As a first step, David, could you please take a look into
what's up with downgrade_write()?

(Then again, we need to have a serious think about the overflow bug.  It's
fatal.  Should we fix it?  If so, the current rwsem implementation is
probably unsalvageable).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: downgrade_write replacement in remap_file_pages
  2004-06-08 16:31 ` Andrew Morton
@ 2004-06-08 16:39   ` Linus Torvalds
  2004-06-08 19:04   ` David Howells
  1 sibling, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2004-06-08 16:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-kernel, David Howells



On Tue, 8 Jun 2004, Andrew Morton wrote:
> 
> So ho-hum.  As a first step, David, could you please take a look into
> what's up with downgrade_write()?

One reason why I like Andrea's approach (regardless of downgrade_write()) 
is that it seems to avoid taking the heavy lock (write) by default. 

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: downgrade_write replacement in remap_file_pages
  2004-06-08 15:44 downgrade_write replacement in remap_file_pages Andrea Arcangeli
  2004-06-08 16:31 ` Andrew Morton
@ 2004-06-08 17:05 ` David Howells
  2004-06-08 22:33   ` Andrea Arcangeli
  2004-06-08 19:36 ` William Lee Irwin III
  2 siblings, 1 reply; 17+ messages in thread
From: David Howells @ 2004-06-08 17:05 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, Linus Torvalds, linux-kernel


> Apparently downgrade_write deadlocks the kernel in the mmap_sem
> under load.

Which implementation of rwsems is your kernel using? The spinlock-based one or
the XADD based one? Have you tried the other version?

Have you more than 32767 processes?

Do you have any stack traces?

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: downgrade_write replacement in remap_file_pages
  2004-06-08 16:31 ` Andrew Morton
  2004-06-08 16:39   ` Linus Torvalds
@ 2004-06-08 19:04   ` David Howells
  1 sibling, 0 replies; 17+ messages in thread
From: David Howells @ 2004-06-08 19:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, torvalds, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1327 bytes --]

> 
> So ho-hum.  As a first step, David, could you please take a look into
> what's up with downgrade_write()?

Yes. Patch attached. The critical bit is in the hunk beginning at line 95.

It appears my rwsem testsuite didn't have a downgrade write test. I added one
and the bug became obvious.

> (Then again, we need to have a serious think about the overflow bug.  It's
> fatal.  Should we fix it?  If so, the current rwsem implementation is
> probably unsalvageable).

The optimised rwsem implementation is probably okay, provided you don't have a
32-bit machine with silly amounts of memory, for which the 32767 process limit
is probably reasonable. Maybe make it contingent on CONFIG_HIGHMEM on X86?

If you don't want to use an optimised version, there _is_ a spinlock version
as well.

For 64-bit archs like x86_64 and ppc64 this algorithm can easily made to use a
64-bit counter (lib/rwsem.c wants a signed long counter), so it's just a
matter of using XADDQ, LDARX/STDCX or equivalents (I have patches for those
two). I've also got MIPS64 patches somewhere too, though they're horribly out
of date.

Alpha and S390 already do the 64-bit counter thing. And if sparc64 has a
64-bit CAS, that can be used. IA64's rwsems should be 64-bitable too.

Using a 64-bit counter gives you a limit of 2 billion processes.

David



[-- Attachment #2: Type: text/plain, Size: 4087 bytes --]


Signed-Off-By: David Howells <dhowells@redhat.com>
D: Stop downgrade_write() from under-adjusting the rwsem counter in optimised
D: rw-semaphores.

diff -urp linux-2.6.6/lib/rwsem.c linux-2.6.6-keys/lib/rwsem.c
--- linux-2.6.6/lib/rwsem.c	2004-05-11 11:28:57.000000000 +0100
+++ linux-2.6.6-keys/lib/rwsem.c	2004-06-08 19:31:35.550653817 +0100
@@ -29,15 +29,15 @@ void rwsemtrace(struct rw_semaphore *sem
 
 /*
  * handle the lock being released whilst there are processes blocked on it that can now run
- * - if we come here, then:
- *   - the 'active part' of the count (&0x0000ffff) reached zero but has been re-incremented
+ * - if we come here from up_xxxx(), then:
+ *   - the 'active part' of the count (&0x0000ffff) had reached zero (but may have changed)
  *   - the 'waiting part' of the count (&0xffff0000) is negative (and will still be so)
  *   - there must be someone on the queue
  * - the spinlock must be held by the caller
  * - woken process blocks are discarded from the list after having task zeroed
- * - writers are only woken if wakewrite is non-zero
+ * - writers are only woken if downgrading is false
  */
-static inline struct rw_semaphore *__rwsem_do_wake(struct rw_semaphore *sem, int wakewrite)
+static inline struct rw_semaphore *__rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 {
 	struct rwsem_waiter *waiter;
 	struct task_struct *tsk;
@@ -46,10 +46,12 @@ static inline struct rw_semaphore *__rws
 
 	rwsemtrace(sem,"Entering __rwsem_do_wake");
 
-	if (!wakewrite)
+	if (downgrading)
 		goto dont_wake_writers;
 
-	/* only wake someone up if we can transition the active part of the count from 0 -> 1 */
+	/* if we came through an up_xxxx() call, we only only wake someone up
+	 * if we can transition the active part of the count from 0 -> 1
+	 */
  try_again:
 	oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS,sem) - RWSEM_ACTIVE_BIAS;
 	if (oldcount & RWSEM_ACTIVE_MASK)
@@ -78,9 +80,10 @@ static inline struct rw_semaphore *__rws
 	if (waiter->flags & RWSEM_WAITING_FOR_WRITE)
 		goto out;
 
-	/* grant an infinite number of read locks to the readers at the front of the queue
-	 * - note we increment the 'active part' of the count by the number of readers (less one
-	 *   for the activity decrement we've already done) before waking any processes up
+	/* grant an infinite number of read locks to the readers at the front
+	 * of the queue
+	 * - note we increment the 'active part' of the count by the number of
+	 *   readers before waking any processes up
 	 */
  readers_only:
 	woken = 0;
@@ -95,8 +98,10 @@ static inline struct rw_semaphore *__rws
 	} while (waiter->flags & RWSEM_WAITING_FOR_READ);
 
 	loop = woken;
-	woken *= RWSEM_ACTIVE_BIAS-RWSEM_WAITING_BIAS;
-	woken -= RWSEM_ACTIVE_BIAS;
+	woken *= RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS;
+	if (!downgrading)
+		woken -= RWSEM_ACTIVE_BIAS; /* we'd already done one increment
+					     * earlier */
 	rwsem_atomic_add(woken,sem);
 
 	next = sem->wait_list.next;
@@ -150,7 +155,7 @@ static inline struct rw_semaphore *rwsem
 	 * - it might even be this process, since the waker takes a more active part
 	 */
 	if (!(count & RWSEM_ACTIVE_MASK))
-		sem = __rwsem_do_wake(sem,1);
+		sem = __rwsem_do_wake(sem, 0);
 
 	spin_unlock(&sem->wait_lock);
 
@@ -201,7 +206,7 @@ struct rw_semaphore fastcall __sched *rw
 
 /*
  * handle waking up a waiter on the semaphore
- * - up_read has decremented the active part of the count if we come here
+ * - up_read/up_write has decremented the active part of the count if we come here
  */
 struct rw_semaphore fastcall *rwsem_wake(struct rw_semaphore *sem)
 {
@@ -211,7 +216,7 @@ struct rw_semaphore fastcall *rwsem_wake
 
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem,1);
+		sem = __rwsem_do_wake(sem, 0);
 
 	spin_unlock(&sem->wait_lock);
 
@@ -233,7 +238,7 @@ struct rw_semaphore fastcall *rwsem_down
 
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem,0);
+		sem = __rwsem_do_wake(sem, 1);
 
 	spin_unlock(&sem->wait_lock);
 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: downgrade_write replacement in remap_file_pages
  2004-06-08 15:44 downgrade_write replacement in remap_file_pages Andrea Arcangeli
  2004-06-08 16:31 ` Andrew Morton
  2004-06-08 17:05 ` David Howells
@ 2004-06-08 19:36 ` William Lee Irwin III
  2004-06-08 22:52   ` Andrea Arcangeli
  2004-06-09 12:19   ` [PATCH] A generic_file_sendpage() Alexander Nyberg
  2 siblings, 2 replies; 17+ messages in thread
From: William Lee Irwin III @ 2004-06-08 19:36 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, Linus Torvalds, linux-kernel

On Tue, Jun 08, 2004 at 05:44:38PM +0200, Andrea Arcangeli wrote:
> Apparently downgrade_write deadlocks the kernel in the mmap_sem
> under load. I suspect some rwsem bug. Anyways it's matter of time before
> in my tree I replace the rwsem implementation with my spinlock based
> common code implementation again that I can understand trivially (unlike
> the current code). I don't mind a microoptimization when the code is so
> complicated, so I don't mind too much to fix whatever current bug in
> downgrade_write.
> In the meantime to workaround the deadlock (and to verify if this make
> the deadlock go away, which returned a positive result) I implemented
> this patch: this doesn't fix downgrade_wite, but I'm posting it here
> because I believe it's much better code regardless of the
> downgrade_write issue.  With this patch we'll never run down_write again
> in production, the down_write will be taken only once when the db or the
> simulator startup (during the very first page fault) and never again, in
> turn providing (at least in theory) better runtime scalability.

I've been using something similar since about May 20. However, it was
unclear that it was a deadlock as opposed to semaphore contention from
the reports I got.


On Tue, Jun 08, 2004 at 05:44:38PM +0200, Andrea Arcangeli wrote:
> -		if (pgoff != linear_page_index(vma, start) &&
> -		    !(vma->vm_flags & VM_NONLINEAR)) {
> +		if (unlikely(pgoff != linear_pgoff && !(vma->vm_flags & VM_NONLINEAR))) {

There is no linear_pgoff variable...


-- wli

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: downgrade_write replacement in remap_file_pages
  2004-06-08 17:05 ` David Howells
@ 2004-06-08 22:33   ` Andrea Arcangeli
  0 siblings, 0 replies; 17+ messages in thread
From: Andrea Arcangeli @ 2004-06-08 22:33 UTC (permalink / raw)
  To: David Howells; +Cc: Andrew Morton, Linus Torvalds, linux-kernel

On Tue, Jun 08, 2004 at 06:05:08PM +0100, David Howells wrote:
> 
> > Apparently downgrade_write deadlocks the kernel in the mmap_sem
> > under load.
> 
> Which implementation of rwsems is your kernel using? The spinlock-based one or
> the XADD based one? Have you tried the other version?

stock 2.6 rwsem implementation compiled for PII (I still have tons of
patches to forward port from 2.4-aa, if I would port my rwsem to 2.6 I
would have never noticed this race)

> Have you more than 32767 processes?

no, there should be around 10k processes, sure not more than 20k.

> Do you have any stack traces?

yes:

strace:

open("/proc/6022/stat", O_RDONLY)       = 6
read(6, 

SYSRQ+T

 [<c01fa365>] rwsem_down_read_failed+0x85/0x121
 [<c01a0da9>] .text.lock.array+0x49/0xd0
 [<c01e4412>] avc_has_perm+0x62/0x78
 [<c01e5ba3>] inode_has_perm+0x53/0x90
 [<c019d3b1>] proc_info_read+0x51/0x150
 [<c016ada1>] vfs_read+0xe1/0x130
 [<c016b001>] sys_read+0x91/0xf0

it's the down_read in proc_pid_stat. the workload running at the same
time is heavy remap_file_pages. They're processes so the only race
happens against the /proc filesystem and that's why it hangs there.
Somehow downgrade_write in remap_file_pages races with down_read in
/proc. My patch workarounds the deadlock by not calling downgrade_write,
but I posted it to l-k because my code is better anyways since there's
no good reason to ever call down_write in the fast path (and if we don't
start down_write we don't need downgrade_write anymore). The only thing
bitten by downgrade_write left is xfs.

You can imagine which is the critical apps that triggers this deadlock,
not many apps are using remap_file_pages in production yet, and very few
are going to call it in a flood.

I agree with Andrew the limit of 32k processes needs fixing, but nobody
noticed yet with any real app, so it's a low prio matter.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: downgrade_write replacement in remap_file_pages
  2004-06-08 19:36 ` William Lee Irwin III
@ 2004-06-08 22:52   ` Andrea Arcangeli
  2004-06-09 12:19   ` [PATCH] A generic_file_sendpage() Alexander Nyberg
  1 sibling, 0 replies; 17+ messages in thread
From: Andrea Arcangeli @ 2004-06-08 22:52 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, Linus Torvalds,
	linux-kernel

On Tue, Jun 08, 2004 at 12:36:21PM -0700, William Lee Irwin III wrote:
> On Tue, Jun 08, 2004 at 05:44:38PM +0200, Andrea Arcangeli wrote:
> > -		if (pgoff != linear_page_index(vma, start) &&
> > -		    !(vma->vm_flags & VM_NONLINEAR)) {
> > +		if (unlikely(pgoff != linear_pgoff && !(vma->vm_flags & VM_NONLINEAR))) {
> 
> There is no linear_pgoff variable...

I tested it on a different codebase and I didn't notice this issue while
fixing the rejects, fixing it up is easy, replace linear_pgoff with
linear_page_index(vma, start).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] A generic_file_sendpage()
  2004-06-08 19:36 ` William Lee Irwin III
  2004-06-08 22:52   ` Andrea Arcangeli
@ 2004-06-09 12:19   ` Alexander Nyberg
  2004-06-10 19:49     ` Pavel Machek
  2004-06-25 19:19     ` Jörn Engel
  1 sibling, 2 replies; 17+ messages in thread
From: Alexander Nyberg @ 2004-06-09 12:19 UTC (permalink / raw)
  To: viro; +Cc: linux-kernel

The sendfile() for all file systems remain unusable as it is right now,
only works for sending data to socket. But that should be as much performance
enhancing as this yes?

Please hit me with cluebat for what I'm missing.
(yes, rebooted between all copying)

-----------------------------------
Normal read/write with 16K buffers:

comp1 with 4 scsi disks sw raid0:
kernie:/mnt/data/playground# time ./copyf tref x1
size: 2097152000

real    2m9.680s
user    0m0.075s
sys     0m21.019s

comp2 with single ide disk:
om3:/home/alex# time ./copyf haha c1
size: 1048576000

real    1m25.104s
user    0m0.042s
sys     0m14.880s


-----------------------------------
Normal read/write with 64K buffers:

comp1 with 4 scsi disks sw raid0:
kernie:/mnt/data/playground# time ./copyf tref x3
size: 2097152000

real    2m11.160s
user    0m0.035s
sys     0m20.745s


comp2 with single ide disk:
om3:/home/alex# time ./copyf haha c3
size: 1048576000

real    1m25.651s
user    0m0.052s
sys     0m14.020s


-----------------------------------
Using sendfile() to copy entire files:

comp1 with 4 scsi disks sw raid0:
kernie:/mnt/data/playground# time ./sendf tref x2
size: 2097152000

real    2m9.645s
user    0m0.001s
sys     0m19.961s

and again:

real    2m9.675s
user    0m0.001s
sys     0m19.271s


comp2 with single ide disk:
om3:/home/alex# time ./sendf haha c2
size: 1048576000

real    1m24.395s
user    0m0.002s
sys     0m13.151s

and again:

real    1m23.781s
user    0m0.001s
sys     0m12.967s



--- include/linux/fs_orig.h     2004-06-09 00:37:29.000000000 +0200
+++ include/linux/fs.h  2004-06-07 18:13:54.000000000 +0200
@@ -1405,6 +1405,7 @@ extern ssize_t do_sync_write(struct file
 ssize_t generic_file_write_nolock(struct file *file, const struct iovec *iov,
                                unsigned long nr_segs, loff_t *ppos);
 extern ssize_t generic_file_sendfile(struct file *, loff_t *, size_t, read_actor_t, void __user *);
+extern ssize_t generic_file_sendpage(struct file *, struct page *, int, size_t, loff_t *, int);
 extern void do_generic_mapping_read(struct address_space *mapping,
                                    struct file_ra_state *, struct file *,
                                    loff_t *, read_descriptor_t *, read_actor_t);
--- mm/filemap_orig.c   2004-06-09 00:37:45.000000000 +0200
+++ mm/filemap.c        2004-06-08 22:19:48.000000000 +0200
@@ -961,7 +961,32 @@ generic_file_read(struct file *filp, cha

 EXPORT_SYMBOL(generic_file_read);

-int file_send_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size)
+ssize_t generic_file_sendpage(struct file *out_file, struct page *page,
+                       int offset, size_t size, loff_t *pos, int more)
+{
+       void *addr;
+       int ret;
+       mm_segment_t old_fs;
+
+       old_fs = get_fs();
+       set_fs(KERNEL_DS);
+
+       addr = kmap(page);
+       if (!addr) {
+               set_fs(old_fs);
+               return -ENOMEM;
+       }
+
+       ret = out_file->f_op->write(out_file, addr + offset, size, pos);
+
+       kunmap(addr);
+
+       set_fs(old_fs);
+       return ret;
+}
+
+int file_send_actor(read_descriptor_t * desc, struct page *page,
+                       unsigned long offset, unsigned long size)
 {
        ssize_t written;
        unsigned long count = desc->count;
--- fs/ext3/file_orig.c 2004-06-09 00:42:50.000000000 +0200
+++ fs/ext3/file.c      2004-06-07 18:12:19.000000000 +0200
@@ -129,6 +129,7 @@ struct file_operations ext3_file_operati
        .release        = ext3_release_file,
        .fsync          = ext3_sync_file,
        .sendfile       = generic_file_sendfile,
+       .sendpage       = generic_file_sendpage,
 };

 struct inode_operations ext3_file_inode_operations = {
@@ -140,4 +141,3 @@ struct inode_operations ext3_file_inode_
        .removexattr    = ext3_removexattr,
        .permission     = ext3_permission,
 };
-



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] A generic_file_sendpage()
  2004-06-09 12:19   ` [PATCH] A generic_file_sendpage() Alexander Nyberg
@ 2004-06-10 19:49     ` Pavel Machek
  2004-06-25 19:19     ` Jörn Engel
  1 sibling, 0 replies; 17+ messages in thread
From: Pavel Machek @ 2004-06-10 19:49 UTC (permalink / raw)
  To: Alexander Nyberg; +Cc: viro, linux-kernel

Hi!

> The sendfile() for all file systems remain unusable as it is right now,
> only works for sending data to socket. But that should be as much performance
> enhancing as this yes?
> 
> Please hit me with cluebat for what I'm missing.
> (yes, rebooted between all copying)
> 

...this will also allow COW filesystems and copy-on-server...
It would be nice if it could go in soon, so userspace can
start using it and COW/copy-on-server can be usefull in 2.8...


				Pavel
-- 
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms         


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] A generic_file_sendpage()
  2004-06-09 12:19   ` [PATCH] A generic_file_sendpage() Alexander Nyberg
  2004-06-10 19:49     ` Pavel Machek
@ 2004-06-25 19:19     ` Jörn Engel
  2004-06-25 19:46       ` viro
  2004-06-25 20:05       ` Andreas Dilger
  1 sibling, 2 replies; 17+ messages in thread
From: Jörn Engel @ 2004-06-25 19:19 UTC (permalink / raw)
  To: Alexander Nyberg; +Cc: viro, linux-kernel, Pavel Machek

On Wed, 9 June 2004 14:19:19 +0200, Alexander Nyberg wrote:
> 
> The sendfile() for all file systems remain unusable as it is right now,
> only works for sending data to socket. But that should be as much performance
> enhancing as this yes?
> 
> Please hit me with cluebat for what I'm missing.
> (yes, rebooted between all copying)

A similar patch I've once written gave about 10% performance boost
with warm caches.  Your measurements are with could caches but still
give a noticeable boost.  Nice.

> --- include/linux/fs_orig.h     2004-06-09 00:37:29.000000000 +0200
> +++ include/linux/fs.h  2004-06-07 18:13:54.000000000 +0200
> @@ -1405,6 +1405,7 @@ extern ssize_t do_sync_write(struct file
>  ssize_t generic_file_write_nolock(struct file *file, const struct iovec *iov,
>                                 unsigned long nr_segs, loff_t *ppos);
>  extern ssize_t generic_file_sendfile(struct file *, loff_t *, size_t, read_actor_t, void __user *);
> +extern ssize_t generic_file_sendpage(struct file *, struct page *, int, size_t, loff_t *, int);
>  extern void do_generic_mapping_read(struct address_space *mapping,
>                                     struct file_ra_state *, struct file *,
>                                     loff_t *, read_descriptor_t *, read_actor_t);
> --- mm/filemap_orig.c   2004-06-09 00:37:45.000000000 +0200
> +++ mm/filemap.c        2004-06-08 22:19:48.000000000 +0200
> @@ -961,7 +961,32 @@ generic_file_read(struct file *filp, cha
> 
>  EXPORT_SYMBOL(generic_file_read);
> 
> +ssize_t generic_file_sendpage(struct file *out_file, struct page *page,
> +                       int offset, size_t size, loff_t *pos, int more)
> +{
> +       void *addr;
> +       int ret;
> +       mm_segment_t old_fs;
> +
> +       old_fs = get_fs();
> +       set_fs(KERNEL_DS);
> +
> +       addr = kmap(page);
> +       if (!addr) {
> +               set_fs(old_fs);
> +               return -ENOMEM;
> +       }
> +
> +       ret = out_file->f_op->write(out_file, addr + offset, size, pos);
> +
> +       kunmap(addr);
> +
> +       set_fs(old_fs);
> +       return ret;
> +}

Your patch is *much* smaller than mine.  Looks lean and mean.  But you
depend on the struct file* passed to generic_file_sendpage().

One of my goals for 2.7 is to get rid of all users of struct file* in
the various read-, write- and send-functions.  Currently, there are
four of them, you would introduce number five.

Is is possible get around using out_file without making the patch much
bigger?

If you need a motivation for this, think cowlinks.  Copying inodes
would be trivial, if you didn't need a fscking file for every copy
operation.  If you actually open(), sendfile() and close(), you end up
with tons of possible errors and a nightmare of cleanup code.  Not
fun.

> --- fs/ext3/file_orig.c 2004-06-09 00:42:50.000000000 +0200
> +++ fs/ext3/file.c      2004-06-07 18:12:19.000000000 +0200
> @@ -129,6 +129,7 @@ struct file_operations ext3_file_operati
> +       .sendpage       = generic_file_sendpage,

Obviously also works for ext2.

Jörn

-- 
They laughed at Galileo.  They laughed at Copernicus.  They laughed at
Columbus. But remember, they also laughed at Bozo the Clown.
-- unknown

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] A generic_file_sendpage()
  2004-06-25 19:19     ` Jörn Engel
@ 2004-06-25 19:46       ` viro
  2004-06-25 20:03         ` Jörn Engel
  2004-06-25 20:05       ` Andreas Dilger
  1 sibling, 1 reply; 17+ messages in thread
From: viro @ 2004-06-25 19:46 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Alexander Nyberg, linux-kernel, Pavel Machek

On Fri, Jun 25, 2004 at 09:19:24PM +0200, Jörn Engel wrote:
> > +       old_fs = get_fs();
> > +       set_fs(KERNEL_DS);

Eeek...  Don't do that, please - set_fs() is not a good thing to add for
no good reason.
 
> Your patch is *much* smaller than mine.  Looks lean and mean.  But you
> depend on the struct file* passed to generic_file_sendpage().
> 
> One of my goals for 2.7 is to get rid of all users of struct file* in
> the various read-, write- and send-functions.  Currently, there are
> four of them, you would introduce number five.

And how, pray tell, are you going to do that on filesystems that keep
part of context in file->private_data?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] A generic_file_sendpage()
  2004-06-25 19:46       ` viro
@ 2004-06-25 20:03         ` Jörn Engel
  2004-06-26  0:53           ` Trond Myklebust
  0 siblings, 1 reply; 17+ messages in thread
From: Jörn Engel @ 2004-06-25 20:03 UTC (permalink / raw)
  To: viro; +Cc: Alexander Nyberg, linux-kernel, Pavel Machek

On Fri, 25 June 2004 20:46:11 +0100, viro@parcelfarce.linux.theplanet.co.uk wrote:
> > 
> > One of my goals for 2.7 is to get rid of all users of struct file* in
> > the various read-, write- and send-functions.  Currently, there are
> > four of them, you would introduce number five.
> 
> And how, pray tell, are you going to do that on filesystems that keep
> part of context in file->private_data?

Not sure.  NFSv3 appears to be fixable, the only context is the UID,
which happens to be stored in the inode as well.  NFSv4 and cifs could
be worse, I didn't look closely yet.  smbfs accesses the dentry, which
has similar effects, but should be fixable as well.

Do you know of any impossible cases?

Jörn

-- 
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] A generic_file_sendpage()
  2004-06-25 19:19     ` Jörn Engel
  2004-06-25 19:46       ` viro
@ 2004-06-25 20:05       ` Andreas Dilger
  2004-06-25 20:09         ` Jörn Engel
  1 sibling, 1 reply; 17+ messages in thread
From: Andreas Dilger @ 2004-06-25 20:05 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Alexander Nyberg, viro, linux-kernel, Pavel Machek

[-- Attachment #1: Type: text/plain, Size: 864 bytes --]

On Jun 25, 2004  21:19 +0200, Jörn Engel wrote:
> On Wed, 9 June 2004 14:19:19 +0200, Alexander Nyberg wrote:
> > The sendfile() for all file systems remain unusable as it is right now,
> > only works for sending data to socket. But that should be as much
> > performance enhancing as this yes?
> 
> A similar patch I've once written gave about 10% performance boost
> with warm caches.  Your measurements are with could caches but still
> give a noticeable boost.  Nice.

Yes, it would be nice to get this into the kernel, then eventually have
"cp" test if sendfile() works between two files.  That would allow things
like remote file copy for network filesystems, COW, etc much easier.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/             http://members.shaw.ca/golinux/


[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] A generic_file_sendpage()
  2004-06-25 20:05       ` Andreas Dilger
@ 2004-06-25 20:09         ` Jörn Engel
  0 siblings, 0 replies; 17+ messages in thread
From: Jörn Engel @ 2004-06-25 20:09 UTC (permalink / raw)
  To: Alexander Nyberg, viro, linux-kernel, Pavel Machek

On Fri, 25 June 2004 14:05:33 -0600, Andreas Dilger wrote:
> 
> Yes, it would be nice to get this into the kernel, then eventually have
> "cp" test if sendfile() works between two files.  That would allow things
> like remote file copy for network filesystems, COW, etc much easier.

But as long as the patches are as ugly as mine, this is 2.7 work.  In
case anyone missed it:

http://wohnheim.fh-wedel.de/~joern/cowlink/

Jörn

-- 
To my face you have the audacity to advise me to become a thief - the worst
kind of thief that is conceivable, a thief of spiritual things, a thief of
ideas! It is insufferable, intolerable!
-- M. Binet in Scarabouche

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] A generic_file_sendpage()
  2004-06-25 20:03         ` Jörn Engel
@ 2004-06-26  0:53           ` Trond Myklebust
  2004-06-28 11:41             ` Jörn Engel
  0 siblings, 1 reply; 17+ messages in thread
From: Trond Myklebust @ 2004-06-26  0:53 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Alexander Viro, Alexander Nyberg, linux-kernel, Pavel Machek

På fr , 25/06/2004 klokka 16:03, skreiv Jörn Engel:
> Not sure.  NFSv3 appears to be fixable, the only context is the UID,

Huh???? WTF happened to the actual credential?

> which happens to be stored in the inode as well.  NFSv4 and cifs could
> be worse, I didn't look closely yet.  smbfs accesses the dentry, which
> has similar effects, but should be fixable as well.
> 
> Do you know of any impossible cases?

NFS, CIFS, all other networked filesystems that need private context
information beyond what is contained in the struct file. Why?

Trond

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] A generic_file_sendpage()
  2004-06-26  0:53           ` Trond Myklebust
@ 2004-06-28 11:41             ` Jörn Engel
  0 siblings, 0 replies; 17+ messages in thread
From: Jörn Engel @ 2004-06-28 11:41 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Alexander Viro, Alexander Nyberg, linux-kernel, Pavel Machek

On Fri, 25 June 2004 20:53:34 -0400, Trond Myklebust wrote:
> På fr , 25/06/2004 klokka 16:03, skreiv Jörn Engel:
> > Not sure.  NFSv3 appears to be fixable, the only context is the UID,
> 
> Huh???? WTF happened to the actual credential?

No idea, I couldn't find it in the source.

> > which happens to be stored in the inode as well.  NFSv4 and cifs could
> > be worse, I didn't look closely yet.  smbfs accesses the dentry, which
> > has similar effects, but should be fixable as well.
> > 
> > Do you know of any impossible cases?
> 
> NFS, CIFS, all other networked filesystems that need private context
> information beyond what is contained in the struct file. Why?

Darn!  I need to copy an inode.  Currently, I'm opening a two files,
copy between them and close them again.  Open and close are completely
pointless and only complicate things.

The fun part is that cifs would actually benifit from the same things
it complicates.  Oh well.

Jörn

-- 
Geld macht nicht glücklich.
Glück macht nicht satt.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2004-06-28 11:41 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-06-08 15:44 downgrade_write replacement in remap_file_pages Andrea Arcangeli
2004-06-08 16:31 ` Andrew Morton
2004-06-08 16:39   ` Linus Torvalds
2004-06-08 19:04   ` David Howells
2004-06-08 17:05 ` David Howells
2004-06-08 22:33   ` Andrea Arcangeli
2004-06-08 19:36 ` William Lee Irwin III
2004-06-08 22:52   ` Andrea Arcangeli
2004-06-09 12:19   ` [PATCH] A generic_file_sendpage() Alexander Nyberg
2004-06-10 19:49     ` Pavel Machek
2004-06-25 19:19     ` Jörn Engel
2004-06-25 19:46       ` viro
2004-06-25 20:03         ` Jörn Engel
2004-06-26  0:53           ` Trond Myklebust
2004-06-28 11:41             ` Jörn Engel
2004-06-25 20:05       ` Andreas Dilger
2004-06-25 20:09         ` Jörn Engel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox