NFS client latencies

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* NFS client latencies
@ 2005-03-29 23:04 Lee Revell
  2005-03-29 23:18 ` Trond Myklebust
  0 siblings, 1 reply; 45+ messages in thread
From: Lee Revell @ 2005-03-29 23:04 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Ingo Molnar, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 94 bytes --]

I am seeing long latencies in the NFS client code.  Attached is a ~1.9
ms latency trace.

Lee

[-- Attachment #2: nfs-1919us-latency_trace.bz2 --]
[-- Type: application/x-bzip, Size: 30730 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-29 23:04 NFS client latencies Lee Revell
@ 2005-03-29 23:18 ` Trond Myklebust
  2005-03-29 23:32   ` Lee Revell
  2005-03-30 14:26   ` Lee Revell
  0 siblings, 2 replies; 45+ messages in thread
From: Trond Myklebust @ 2005-03-29 23:18 UTC (permalink / raw)
  To: Lee Revell; +Cc: Ingo Molnar, linux-kernel

ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell:
> I am seeing long latencies in the NFS client code.  Attached is a ~1.9
> ms latency trace.

What kind of workload are you using to produce these numbers?

Cheers,
  Trond
-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-29 23:18 ` Trond Myklebust
@ 2005-03-29 23:32   ` Lee Revell
  2005-03-29 23:34     ` Trond Myklebust
  2005-03-30 14:26   ` Lee Revell
  1 sibling, 1 reply; 45+ messages in thread
From: Lee Revell @ 2005-03-29 23:32 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Ingo Molnar, linux-kernel

On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote:
> ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell:
> > I am seeing long latencies in the NFS client code.  Attached is a ~1.9
> > ms latency trace.
> 
> What kind of workload are you using to produce these numbers?
> 

Just a kernel compile over NFS.

Lee


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-29 23:32   ` Lee Revell
@ 2005-03-29 23:34     ` Trond Myklebust
  2005-03-29 23:37       ` Lee Revell
  2005-03-30  8:02       ` Ingo Molnar
  0 siblings, 2 replies; 45+ messages in thread
From: Trond Myklebust @ 2005-03-29 23:34 UTC (permalink / raw)
  To: Lee Revell; +Cc: Ingo Molnar, linux-kernel

ty den 29.03.2005 Klokka 18:32 (-0500) skreiv Lee Revell:
> On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote:
> > ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell:
> > > I am seeing long latencies in the NFS client code.  Attached is a ~1.9
> > > ms latency trace.
> > 
> > What kind of workload are you using to produce these numbers?
> > 
> 
> Just a kernel compile over NFS.

In other words a workload consisting mainly of mmap()ed writes?

Cheers,
  Trond

-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-29 23:34     ` Trond Myklebust
@ 2005-03-29 23:37       ` Lee Revell
  2005-03-30  8:02       ` Ingo Molnar
  1 sibling, 0 replies; 45+ messages in thread
From: Lee Revell @ 2005-03-29 23:37 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Ingo Molnar, linux-kernel

On Tue, 2005-03-29 at 18:34 -0500, Trond Myklebust wrote:
> ty den 29.03.2005 Klokka 18:32 (-0500) skreiv Lee Revell:
> > On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote:
> > > ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell:
> > > > I am seeing long latencies in the NFS client code.  Attached is a ~1.9
> > > > ms latency trace.
> > > 
> > > What kind of workload are you using to produce these numbers?
> > > 
> > 
> > Just a kernel compile over NFS.
> 
> In other words a workload consisting mainly of mmap()ed writes?
> 

If you say so... I don't know the NFS internals very well.

I did use "make -j16", on a UP.

Lee


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-29 23:34     ` Trond Myklebust
  2005-03-29 23:37       ` Lee Revell
@ 2005-03-30  8:02       ` Ingo Molnar
  2005-03-30 14:11         ` Trond Myklebust
  1 sibling, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2005-03-30  8:02 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Lee Revell, linux-kernel

* Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> ty den 29.03.2005 Klokka 18:32 (-0500) skreiv Lee Revell:
> > On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote:
> > > ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell:
> > > > I am seeing long latencies in the NFS client code.  Attached is a ~1.9
> > > > ms latency trace.
> > > 
> > > What kind of workload are you using to produce these numbers?
> > > 
> > 
> > Just a kernel compile over NFS.
> 
> In other words a workload consisting mainly of mmap()ed writes?

new files created during a kernel compile are done via open()/write().

looking at the trace it seems that the NFS client code is doing list 
walks over ~7000 entries (!), in nfs_list_add_request(). Whatever 
protocol/server-side gain there might be due to the sorting and 
coalescing, this CPU overhead seems extremely high - more than 1 msec 
for this single insertion!

the comment suggests that this is optimized for append writes (which is 
quite common, but by far not the only write workload) - but the 
worst-case behavior of this code is very bad. How about disabling this 
sorting altogether and benchmarking the result? Maybe it would get 
comparable coalescing (higher levels do coalesce after all), but wastly 
improved CPU utilization on the client side. (Note that the server 
itself will do sorting of any write IO anyway, if this is to hit any 
persistent storage - and if not then sorting so agressively on the 
client side makes little sense.)

i think normal NFS benchmarks would not show this effect, as writes are 
typically streamed in benchmarks. But once you have lots of outstanding 
requests and a write comes out of order, CPU utilization (and latency) 
skyrockets.

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-30  8:02       ` Ingo Molnar
@ 2005-03-30 14:11         ` Trond Myklebust
  2005-03-30 14:20           ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Trond Myklebust @ 2005-03-30 14:11 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Lee Revell, linux-kernel

on den 30.03.2005 Klokka 10:02 (+0200) skreiv Ingo Molnar:

> the comment suggests that this is optimized for append writes (which is 
> quite common, but by far not the only write workload) - but the 
> worst-case behavior of this code is very bad. How about disabling this 
> sorting altogether and benchmarking the result? Maybe it would get 
> comparable coalescing (higher levels do coalesce after all), but wastly 
> improved CPU utilization on the client side. (Note that the server 
> itself will do sorting of any write IO anyway, if this is to hit any 
> persistent storage - and if not then sorting so agressively on the 
> client side makes little sense.)

No. Coalescing on the client makes tons of sense. The overhead of
sending 8 RPC requests for 4k writes instead of sending 1 RPC request
for a single 32k write is huge: among other things, you end up tying up
8 RPC slots on the client + 8 nfsd threads on the server instead of just
one of each.
You also end up allocating 8 times a much memory for supporting
structures such as rpc_tasks. That sucks when you're in a low memory
situation and are trying to push dirty pages to the server as fast as
possible...

What we can do instead is to do the same thing that the VM does: use a
radix tree with tags to do the sorting of the nfs_pages when we're
actually building up the list of dirty pages to send.
We already have the radix tree to do that, all we need to do is add the
tags and modify the scanning function to use them.

Cheers,
 Trond

-- 
Trond Myklebust <trond.myklebust@fys.uio.no>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-30 14:11         ` Trond Myklebust
@ 2005-03-30 14:20           ` Ingo Molnar
  2005-03-30 19:53             ` Andrew Morton
  0 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2005-03-30 14:20 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Lee Revell, linux-kernel


* Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> > the comment suggests that this is optimized for append writes (which is 
> > quite common, but by far not the only write workload) - but the 
> > worst-case behavior of this code is very bad. How about disabling this 
> > sorting altogether and benchmarking the result? Maybe it would get 
> > comparable coalescing (higher levels do coalesce after all), but wastly 
> > improved CPU utilization on the client side. (Note that the server 
> > itself will do sorting of any write IO anyway, if this is to hit any 
> > persistent storage - and if not then sorting so agressively on the 
> > client side makes little sense.)
> 
> No. Coalescing on the client makes tons of sense. The overhead of
> sending 8 RPC requests for 4k writes instead of sending 1 RPC request
> for a single 32k write is huge: among other things, you end up tying up
> 8 RPC slots on the client + 8 nfsd threads on the server instead of just
> one of each.

yes - coalescing a few pages makes sense, but linearly scanning 
thousands of entries is entirely pointless.

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-30 14:20           ` Ingo Molnar
@ 2005-03-30 19:53             ` Andrew Morton
  0 siblings, 0 replies; 45+ messages in thread
From: Andrew Morton @ 2005-03-30 19:53 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: trond.myklebust, rlrevell, linux-kernel

Ingo Molnar <mingo@elte.hu> wrote:
>
> > No. Coalescing on the client makes tons of sense. The overhead of
>  > sending 8 RPC requests for 4k writes instead of sending 1 RPC request
>  > for a single 32k write is huge: among other things, you end up tying up
>  > 8 RPC slots on the client + 8 nfsd threads on the server instead of just
>  > one of each.
> 
>  yes - coalescing a few pages makes sense, but linearly scanning 
>  thousands of entries is entirely pointless.

If each object has a unique ->wb_index then they could all be placed into a
radix tree rather than a list.  Use the gang-lookup functions to perform
the sorting.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-29 23:18 ` Trond Myklebust
  2005-03-29 23:32   ` Lee Revell
@ 2005-03-30 14:26   ` Lee Revell
  2005-03-30 14:50     ` Trond Myklebust
  1 sibling, 1 reply; 45+ messages in thread
From: Lee Revell @ 2005-03-30 14:26 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Ingo Molnar, linux-kernel

On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote:
> ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell:
> > I am seeing long latencies in the NFS client code.  Attached is a ~1.9
> > ms latency trace.
> 
> What kind of workload are you using to produce these numbers?
> 

Here is the other long latency I am seeing in the NFS client.  I posted
this before, but did not cc: the correct people.

It looks like nfs_wait_on_requests is doing thousands of
radix_tree_gang_lookups while holding some lock.

Lee

preemption latency trace v1.1.4 on 2.6.12-rc1-RT-V0.7.41-00
--------------------------------------------------------------------
 latency: 3178 �s, #4095/14224, CPU#0 | (M:preempt VP:0, KP:1, SP:1 HP:1
#P:1)
    -----------------
    | task: ksoftirqd/0-2 (uid:0 nice:-10 policy:0 rt_prio:0)
    -----------------

                 _------=> CPU#            
                / _-----=> irqs-off        
               | / _----=> need-resched    
               || / _---=> hardirq/softirq 
               ||| / _--=> preempt-depth   
               |||| /                      
               |||||     delay             
   cmd     pid ||||| time  |   caller      
      \   /    |||||   \   |   /           
(T1/#0)            <...> 32105 0 3 00000004 00000000 [0011939614227867]
0.000ms (+4137027.445ms): <6500646c> (<61000000>)
(T1/#2)            <...> 32105 0 3 00000004 00000002 [0011939614228097]
0.000ms (+0.000ms): __trace_start_sched_wakeup+0x9a/0xd0 <c013150a>
(try_to_wake_up+0x94/0x140 <c0110474>)
(T1/#3)            <...> 32105 0 3 00000003 00000003 [0011939614228436]
0.000ms (+0.000ms): preempt_schedule+0x11/0x80 <c02b57c1>
(try_to_wake_up+0x94/0x140 <c0110474>)
(T3/#4)    <...>-32105 0dn.3    0�s : try_to_wake_up+0x11e/0x140
<c01104fe> <<...>-2> (69 76): 
(T1/#5)            <...> 32105 0 3 00000002 00000005 [0011939614228942]
0.000ms (+0.000ms): preempt_schedule+0x11/0x80 <c02b57c1>
(try_to_wake_up+0xf8/0x140 <c01104d8>)
(T1/#6)            <...> 32105 0 3 00000002 00000006 [0011939614229130]
0.000ms (+0.000ms): wake_up_process+0x35/0x40 <c0110555> (do_softirq
+0x3f/0x50 <c011b05f>)
(T6/#7)    <...>-32105 0dn.1    1�s < (1)
(T1/#8)            <...> 32105 0 2 00000001 00000008 [0011939614229782]
0.001ms (+0.000ms): radix_tree_gang_lookup+0xe/0x70 <c01e05ee>
(nfs_wait_on_requests+0x6d/0x110 <c01c744d>)
(T1/#9)            <...> 32105 0 2 00000001 00000009 [0011939614229985]
0.001ms (+0.000ms): __lookup+0xe/0xd0 <c01e051e> (radix_tree_gang_lookup
+0x52/0x70 <c01e0632>)
(T1/#10)            <...> 32105 0 2 00000001 0000000a [0011939614230480]
0.001ms (+0.000ms): radix_tree_gang_lookup+0xe/0x70 <c01e05ee>
(nfs_wait_on_requests+0x6d/0x110 <c01c744d>)
(T1/#11)            <...> 32105 0 2 00000001 0000000b [0011939614230634]
0.002ms (+0.000ms): __lookup+0xe/0xd0 <c01e051e> (radix_tree_gang_lookup
+0x52/0x70 <c01e0632>)
(T1/#12)            <...> 32105 0 2 00000001 0000000c [0011939614230889]
0.002ms (+0.000ms): radix_tree_gang_lookup+0xe/0x70 <c01e05ee>
(nfs_wait_on_requests+0x6d/0x110 <c01c744d>)
(T1/#13)            <...> 32105 0 2 00000001 0000000d [0011939614231034]
0.002ms (+0.000ms): __lookup+0xe/0xd0 <c01e051e> (radix_tree_gang_lookup
+0x52/0x70 <c01e0632>)
(T1/#14)            <...> 32105 0 2 00000001 0000000e [0011939614231302]
0.002ms (+0.000ms): radix_tree_gang_lookup+0xe/0x70 <c01e05ee>
(nfs_wait_on_requests+0x6d/0x110 <c01c744d>)
(T1/#15)            <...> 32105 0 2 00000001 0000000f [0011939614231419]
0.002ms (+0.000ms): __lookup+0xe/0xd0 <c01e051e> (radix_tree_gang_lookup
+0x52/0x70 <c01e0632>)

(last two lines just repeat)


> Cheers,
>   Trond


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-30 14:26   ` Lee Revell
@ 2005-03-30 14:50     ` Trond Myklebust
  2005-03-30 19:50       ` Lee Revell
  2005-03-30 19:56       ` Andrew Morton
  0 siblings, 2 replies; 45+ messages in thread
From: Trond Myklebust @ 2005-03-30 14:50 UTC (permalink / raw)
  To: Lee Revell; +Cc: Ingo Molnar, linux-kernel

on den 30.03.2005 Klokka 09:26 (-0500) skreiv Lee Revell:
> On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote:
> > ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell:
> > > I am seeing long latencies in the NFS client code.  Attached is a ~1.9
> > > ms latency trace.
> > 
> > What kind of workload are you using to produce these numbers?
> > 
> 
> Here is the other long latency I am seeing in the NFS client.  I posted
> this before, but did not cc: the correct people.
> 
> It looks like nfs_wait_on_requests is doing thousands of
> radix_tree_gang_lookups while holding some lock.

That's normal and cannot be avoided: when writing, we have to look for
the existence of old nfs_page requests. The reason is that if one does
exist, we must either coalesce our new dirty area into it or if we
can't, we must flush the old request out to the server.

Cheers,
  Trond

-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-30 14:50     ` Trond Myklebust
@ 2005-03-30 19:50       ` Lee Revell
  2005-03-30 19:56       ` Andrew Morton
  1 sibling, 0 replies; 45+ messages in thread
From: Lee Revell @ 2005-03-30 19:50 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Ingo Molnar, linux-kernel

On Wed, 2005-03-30 at 09:50 -0500, Trond Myklebust wrote:
> on den 30.03.2005 Klokka 09:26 (-0500) skreiv Lee Revell:
> > On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote:
> > > ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell:
> > > > I am seeing long latencies in the NFS client code.  Attached is a ~1.9
> > > > ms latency trace.
> > > 
> > > What kind of workload are you using to produce these numbers?
> > > 
> > 
> > Here is the other long latency I am seeing in the NFS client.  I posted
> > this before, but did not cc: the correct people.
> > 
> > It looks like nfs_wait_on_requests is doing thousands of
> > radix_tree_gang_lookups while holding some lock.
> 
> That's normal and cannot be avoided: when writing, we have to look for
> the existence of old nfs_page requests. The reason is that if one does
> exist, we must either coalesce our new dirty area into it or if we
> can't, we must flush the old request out to the server.

But holding a spinlock for 3ms is not acceptable.  _Something_ has to be
done.  Can't the lock be dropped and reacquired after processing N
requests where N is some reasonable number?

Lee


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-30 14:50     ` Trond Myklebust
  2005-03-30 19:50       ` Lee Revell
@ 2005-03-30 19:56       ` Andrew Morton
  2005-03-30 21:14         ` Trond Myklebust
  1 sibling, 1 reply; 45+ messages in thread
From: Andrew Morton @ 2005-03-30 19:56 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: rlrevell, mingo, linux-kernel

Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
>
> on den 30.03.2005 Klokka 09:26 (-0500) skreiv Lee Revell:
> > On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote:
> > > ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell:
> > > > I am seeing long latencies in the NFS client code.  Attached is a ~1.9
> > > > ms latency trace.
> > > 
> > > What kind of workload are you using to produce these numbers?
> > > 
> > 
> > Here is the other long latency I am seeing in the NFS client.  I posted
> > this before, but did not cc: the correct people.
> > 
> > It looks like nfs_wait_on_requests is doing thousands of
> > radix_tree_gang_lookups while holding some lock.
> 
> That's normal and cannot be avoided: when writing, we have to look for
> the existence of old nfs_page requests. The reason is that if one does
> exist, we must either coalesce our new dirty area into it or if we
> can't, we must flush the old request out to the server.

One could use the radix-tree tagging stuff so that the gang lookup only
looks up pages which are !NFS_WBACK_BUSY.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-30 19:56       ` Andrew Morton
@ 2005-03-30 21:14         ` Trond Myklebust
  2005-03-31  2:26           ` Lee Revell
  0 siblings, 1 reply; 45+ messages in thread
From: Trond Myklebust @ 2005-03-30 21:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: rlrevell, mingo, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 734 bytes --]

on den 30.03.2005 Klokka 11:56 (-0800) skreiv Andrew Morton:
> > That's normal and cannot be avoided: when writing, we have to look for
> > the existence of old nfs_page requests. The reason is that if one does
> > exist, we must either coalesce our new dirty area into it or if we
> > can't, we must flush the old request out to the server.
> 
> One could use the radix-tree tagging stuff so that the gang lookup only
> looks up pages which are !NFS_WBACK_BUSY.

Yes. Together with the radix tree-based sorting of dirty requests,
that's pretty much what I've spent most of today doing. Lee, could you
see how the attached combined patch changes your latency numbers?

Cheers,
  Trond
-- 
Trond Myklebust <trond.myklebust@fys.uio.no>

[-- Attachment #2: linux-2.6.12-nfspage_use_tags.dif --]
[-- Type: text/x-patch, Size: 14045 bytes --]

Index: linux-2.6.12-rc1/fs/nfs/inode.c
===================================================================
--- linux-2.6.12-rc1.orig/fs/nfs/inode.c
+++ linux-2.6.12-rc1/fs/nfs/inode.c
@@ -118,7 +118,7 @@ nfs_write_inode(struct inode *inode, int
 	int flags = sync ? FLUSH_WAIT : 0;
 	int ret;
 
-	ret = nfs_commit_inode(inode, 0, 0, flags);
+	ret = nfs_commit_inode(inode, flags);
 	if (ret < 0)
 		return ret;
 	return 0;
Index: linux-2.6.12-rc1/fs/nfs/pagelist.c
===================================================================
--- linux-2.6.12-rc1.orig/fs/nfs/pagelist.c
+++ linux-2.6.12-rc1/fs/nfs/pagelist.c
@@ -112,6 +112,33 @@ void nfs_unlock_request(struct nfs_page 
 }
 
 /**
+ * nfs_set_page_writeback_locked - Lock a request for writeback
+ * @req:
+ */
+int nfs_set_page_writeback_locked(struct nfs_page *req)
+{
+	struct nfs_inode *nfsi = NFS_I(req->wb_context->dentry->d_inode);
+
+	if (!nfs_lock_request(req))
+		return 0;
+	radix_tree_tag_set(&nfsi->nfs_page_tree, req->wb_index, NFS_PAGE_TAG_WRITEBACK);
+	return 1;
+}
+
+/**
+ * nfs_clear_page_writeback - Unlock request and wake up sleepers
+ */
+void nfs_clear_page_writeback(struct nfs_page *req)
+{
+	struct nfs_inode *nfsi = NFS_I(req->wb_context->dentry->d_inode);
+
+	spin_lock(&nfsi->req_lock);
+	radix_tree_tag_clear(&nfsi->nfs_page_tree, req->wb_index, NFS_PAGE_TAG_WRITEBACK);
+	spin_unlock(&nfsi->req_lock);
+	nfs_unlock_request(req);
+}
+
+/**
  * nfs_clear_request - Free up all resources allocated to the request
  * @req:
  *
@@ -151,36 +178,6 @@ nfs_release_request(struct nfs_page *req
 }
 
 /**
- * nfs_list_add_request - Insert a request into a sorted list
- * @req: request
- * @head: head of list into which to insert the request.
- *
- * Note that the wb_list is sorted by page index in order to facilitate
- * coalescing of requests.
- * We use an insertion sort that is optimized for the case of appended
- * writes.
- */
-void
-nfs_list_add_request(struct nfs_page *req, struct list_head *head)
-{
-	struct list_head *pos;
-
-#ifdef NFS_PARANOIA
-	if (!list_empty(&req->wb_list)) {
-		printk(KERN_ERR "NFS: Add to list failed!\n");
-		BUG();
-	}
-#endif
-	list_for_each_prev(pos, head) {
-		struct nfs_page	*p = nfs_list_entry(pos);
-		if (p->wb_index < req->wb_index)
-			break;
-	}
-	list_add(&req->wb_list, pos);
-	req->wb_list_head = head;
-}
-
-/**
  * nfs_wait_on_request - Wait for a request to complete.
  * @req: request to wait upon.
  *
@@ -243,6 +240,62 @@ nfs_coalesce_requests(struct list_head *
 	return npages;
 }
 
+#define NFS_SCAN_MAXENTRIES 16
+/**
+ * nfs_scan_lock_dirty - Scan the radix tree for dirty requests
+ * @nfsi: NFS inode
+ * @dst: Destination list
+ * @idx_start: lower bound of page->index to scan
+ * @npages: idx_start + npages sets the upper bound to scan.
+ *
+ * Moves elements from one of the inode request lists.
+ * If the number of requests is set to 0, the entire address_space
+ * starting at index idx_start, is scanned.
+ * The requests are *not* checked to ensure that they form a contiguous set.
+ * You must be holding the inode's req_lock when calling this function
+ */
+int
+nfs_scan_lock_dirty(struct nfs_inode *nfsi, struct list_head *dst,
+	      unsigned long idx_start, unsigned int npages)
+{
+	struct nfs_page *pgvec[NFS_SCAN_MAXENTRIES];
+	struct nfs_page *req;
+	unsigned long idx_end;
+	int found, i;
+	int res;
+
+	res = 0;
+	if (npages == 0)
+		idx_end = ~0;
+	else
+		idx_end = idx_start + npages - 1;
+
+	for (;;) {
+		found = radix_tree_gang_lookup_tag(&nfsi->nfs_page_tree,
+				(void **)&pgvec[0], idx_start, NFS_SCAN_MAXENTRIES,
+				NFS_PAGE_TAG_DIRTY);
+		if (found <= 0)
+			break;
+		for (i = 0; i < found; i++) {
+			req = pgvec[i];
+			if (req->wb_index > idx_end)
+				goto out;
+
+			idx_start = req->wb_index + 1;
+
+			if (nfs_set_page_writeback_locked(req)) {
+				radix_tree_tag_clear(&nfsi->nfs_page_tree,
+						req->wb_index, NFS_PAGE_TAG_DIRTY);
+				nfs_list_remove_request(req);
+				nfs_list_add_request(req, dst);
+				res++;
+			}
+		}
+	}
+out:
+	return res;
+}
+
 /**
  * nfs_scan_list - Scan a list for matching requests
  * @head: One of the NFS inode request lists
@@ -280,7 +333,7 @@ nfs_scan_list(struct list_head *head, st
 		if (req->wb_index > idx_end)
 			break;
 
-		if (!nfs_lock_request(req))
+		if (!nfs_set_page_writeback_locked(req))
 			continue;
 		nfs_list_remove_request(req);
 		nfs_list_add_request(req, dst);
Index: linux-2.6.12-rc1/fs/nfs/read.c
===================================================================
--- linux-2.6.12-rc1.orig/fs/nfs/read.c
+++ linux-2.6.12-rc1/fs/nfs/read.c
@@ -173,7 +173,6 @@ static int nfs_readpage_async(struct nfs
 	if (len < PAGE_CACHE_SIZE)
 		memclear_highpage_flush(page, len, PAGE_CACHE_SIZE - len);
 
-	nfs_lock_request(new);
 	nfs_list_add_request(new, &one_request);
 	nfs_pagein_one(&one_request, inode);
 	return 0;
@@ -185,7 +184,6 @@ static void nfs_readpage_release(struct 
 
 	nfs_clear_request(req);
 	nfs_release_request(req);
-	nfs_unlock_request(req);
 
 	dprintk("NFS: read done (%s/%Ld %d@%Ld)\n",
 			req->wb_context->dentry->d_inode->i_sb->s_id,
@@ -553,7 +551,6 @@ readpage_async_filler(void *data, struct
 	}
 	if (len < PAGE_CACHE_SIZE)
 		memclear_highpage_flush(page, len, PAGE_CACHE_SIZE - len);
-	nfs_lock_request(new);
 	nfs_list_add_request(new, desc->head);
 	return 0;
 }
Index: linux-2.6.12-rc1/fs/nfs/write.c
===================================================================
--- linux-2.6.12-rc1.orig/fs/nfs/write.c
+++ linux-2.6.12-rc1/fs/nfs/write.c
@@ -352,7 +352,7 @@ int nfs_writepages(struct address_space 
 		if (err < 0)
 			goto out;
 	}
-	err = nfs_commit_inode(inode, 0, 0, wb_priority(wbc));
+	err = nfs_commit_inode(inode, wb_priority(wbc));
 	if (err > 0) {
 		wbc->nr_to_write -= err;
 		err = 0;
@@ -446,6 +446,8 @@ nfs_mark_request_dirty(struct nfs_page *
 	struct nfs_inode *nfsi = NFS_I(inode);
 
 	spin_lock(&nfsi->req_lock);
+	radix_tree_tag_set(&nfsi->nfs_page_tree,
+			req->wb_index, NFS_PAGE_TAG_DIRTY);
 	nfs_list_add_request(req, &nfsi->dirty);
 	nfsi->ndirty++;
 	spin_unlock(&nfsi->req_lock);
@@ -503,13 +505,12 @@ nfs_wait_on_requests(struct inode *inode
 
 	spin_lock(&nfsi->req_lock);
 	next = idx_start;
-	while (radix_tree_gang_lookup(&nfsi->nfs_page_tree, (void **)&req, next, 1)) {
+	while (radix_tree_gang_lookup_tag(&nfsi->nfs_page_tree, (void **)&req, next, 1, NFS_PAGE_TAG_WRITEBACK)) {
 		if (req->wb_index > idx_end)
 			break;
 
 		next = req->wb_index + 1;
-		if (!NFS_WBACK_BUSY(req))
-			continue;
+		BUG_ON(!NFS_WBACK_BUSY(req));
 
 		atomic_inc(&req->wb_count);
 		spin_unlock(&nfsi->req_lock);
@@ -539,7 +540,7 @@ nfs_scan_dirty(struct inode *inode, stru
 {
 	struct nfs_inode *nfsi = NFS_I(inode);
 	int	res;
-	res = nfs_scan_list(&nfsi->dirty, dst, idx_start, npages);
+	res = nfs_scan_lock_dirty(nfsi, dst, idx_start, npages);
 	nfsi->ndirty -= res;
 	sub_page_state(nr_dirty,res);
 	if ((nfsi->ndirty == 0) != list_empty(&nfsi->dirty))
@@ -821,7 +822,7 @@ out:
 #else
 	nfs_inode_remove_request(req);
 #endif
-	nfs_unlock_request(req);
+	nfs_clear_page_writeback(req);
 }
 
 static inline int flush_task_priority(int how)
@@ -952,7 +953,7 @@ out_bad:
 		nfs_writedata_free(data);
 	}
 	nfs_mark_request_dirty(req);
-	nfs_unlock_request(req);
+	nfs_clear_page_writeback(req);
 	return -ENOMEM;
 }
 
@@ -1002,7 +1003,7 @@ static int nfs_flush_one(struct list_hea
 		struct nfs_page *req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_dirty(req);
-		nfs_unlock_request(req);
+		nfs_clear_page_writeback(req);
 	}
 	return -ENOMEM;
 }
@@ -1029,7 +1030,7 @@ nfs_flush_list(struct list_head *head, i
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_dirty(req);
-		nfs_unlock_request(req);
+		nfs_clear_page_writeback(req);
 	}
 	return error;
 }
@@ -1121,7 +1122,7 @@ static void nfs_writeback_done_full(stru
 		nfs_inode_remove_request(req);
 #endif
 	next:
-		nfs_unlock_request(req);
+		nfs_clear_page_writeback(req);
 	}
 }
 
@@ -1210,36 +1211,24 @@ static void nfs_commit_rpcsetup(struct l
 		struct nfs_write_data *data, int how)
 {
 	struct rpc_task		*task = &data->task;
-	struct nfs_page		*first, *last;
+	struct nfs_page		*first;
 	struct inode		*inode;
-	loff_t			start, end, len;
 
 	/* Set up the RPC argument and reply structs
 	 * NB: take care not to mess about with data->commit et al. */
 
 	list_splice_init(head, &data->pages);
 	first = nfs_list_entry(data->pages.next);
-	last = nfs_list_entry(data->pages.prev);
 	inode = first->wb_context->dentry->d_inode;
 
-	/*
-	 * Determine the offset range of requests in the COMMIT call.
-	 * We rely on the fact that data->pages is an ordered list...
-	 */
-	start = req_offset(first);
-	end = req_offset(last) + last->wb_bytes;
-	len = end - start;
-	/* If 'len' is not a 32-bit quantity, pass '0' in the COMMIT call */
-	if (end >= i_size_read(inode) || len < 0 || len > (~((u32)0) >> 1))
-		len = 0;
-
 	data->inode	  = inode;
 	data->cred	  = first->wb_context->cred;
 
 	data->args.fh     = NFS_FH(data->inode);
-	data->args.offset = start;
-	data->args.count  = len;
-	data->res.count   = len;
+	/* Note: we always request a commit of the entire inode */
+	data->args.offset = 0;
+	data->args.count  = 0;
+	data->res.count   = 0;
 	data->res.fattr   = &data->fattr;
 	data->res.verf    = &data->verf;
 	
@@ -1278,7 +1267,7 @@ nfs_commit_list(struct list_head *head, 
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
-		nfs_unlock_request(req);
+		nfs_clear_page_writeback(req);
 	}
 	return -ENOMEM;
 }
@@ -1324,7 +1313,7 @@ nfs_commit_done(struct rpc_task *task)
 		dprintk(" mismatch\n");
 		nfs_mark_request_dirty(req);
 	next:
-		nfs_unlock_request(req);
+		nfs_clear_page_writeback(req);
 		res++;
 	}
 	sub_page_state(nr_unstable,res);
@@ -1350,8 +1339,7 @@ static int nfs_flush_inode(struct inode 
 }
 
 #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4)
-int nfs_commit_inode(struct inode *inode, unsigned long idx_start,
-		    unsigned int npages, int how)
+int nfs_commit_inode(struct inode *inode, int how)
 {
 	struct nfs_inode *nfsi = NFS_I(inode);
 	LIST_HEAD(head);
@@ -1359,15 +1347,13 @@ int nfs_commit_inode(struct inode *inode
 				error = 0;
 
 	spin_lock(&nfsi->req_lock);
-	res = nfs_scan_commit(inode, &head, idx_start, npages);
+	res = nfs_scan_commit(inode, &head, 0, 0);
+	spin_unlock(&nfsi->req_lock);
 	if (res) {
-		res += nfs_scan_commit(inode, &head, 0, 0);
-		spin_unlock(&nfsi->req_lock);
 		error = nfs_commit_list(&head, how);
-	} else
-		spin_unlock(&nfsi->req_lock);
-	if (error < 0)
-		return error;
+		if (error < 0)
+			return error;
+	}
 	return res;
 }
 #endif
@@ -1389,7 +1375,7 @@ int nfs_sync_inode(struct inode *inode, 
 			error = nfs_flush_inode(inode, idx_start, npages, how);
 #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4)
 		if (error == 0)
-			error = nfs_commit_inode(inode, idx_start, npages, how);
+			error = nfs_commit_inode(inode, how);
 #endif
 	} while (error > 0);
 	return error;
Index: linux-2.6.12-rc1/include/linux/nfs_fs.h
===================================================================
--- linux-2.6.12-rc1.orig/include/linux/nfs_fs.h
+++ linux-2.6.12-rc1/include/linux/nfs_fs.h
@@ -377,10 +377,10 @@ extern void nfs_commit_done(struct rpc_t
  */
 extern int  nfs_sync_inode(struct inode *, unsigned long, unsigned int, int);
 #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4)
-extern int  nfs_commit_inode(struct inode *, unsigned long, unsigned int, int);
+extern int  nfs_commit_inode(struct inode *, int);
 #else
 static inline int
-nfs_commit_inode(struct inode *inode, unsigned long idx_start, unsigned int npages, int how)
+nfs_commit_inode(struct inode *inode, int how)
 {
 	return 0;
 }
Index: linux-2.6.12-rc1/include/linux/nfs_page.h
===================================================================
--- linux-2.6.12-rc1.orig/include/linux/nfs_page.h
+++ linux-2.6.12-rc1/include/linux/nfs_page.h
@@ -20,12 +20,19 @@
 #include <asm/atomic.h>
 
 /*
+ * Valid flags for the radix tree
+ */
+#define NFS_PAGE_TAG_DIRTY	0
+#define NFS_PAGE_TAG_WRITEBACK	1
+
+/*
  * Valid flags for a dirty buffer
  */
 #define PG_BUSY			0
 #define PG_NEED_COMMIT		1
 #define PG_NEED_RESCHED		2
 
+struct nfs_inode;
 struct nfs_page {
 	struct list_head	wb_list,	/* Defines state of page: */
 				*wb_list_head;	/*      read/write/commit */
@@ -54,14 +61,17 @@ extern	void nfs_clear_request(struct nfs
 extern	void nfs_release_request(struct nfs_page *req);
 
 
-extern	void nfs_list_add_request(struct nfs_page *, struct list_head *);
-
+extern  int nfs_scan_lock_dirty(struct nfs_inode *nfsi, struct list_head *dst,
+				unsigned long idx_start, unsigned int npages);
 extern	int nfs_scan_list(struct list_head *, struct list_head *,
 			  unsigned long, unsigned int);
 extern	int nfs_coalesce_requests(struct list_head *, struct list_head *,
 				  unsigned int);
 extern  int nfs_wait_on_request(struct nfs_page *);
 extern	void nfs_unlock_request(struct nfs_page *req);
+extern  int nfs_set_page_writeback_locked(struct nfs_page *req);
+extern  void nfs_clear_page_writeback(struct nfs_page *req);
+
 
 /*
  * Lock the page of an asynchronous request without incrementing the wb_count
@@ -86,6 +96,18 @@ nfs_lock_request(struct nfs_page *req)
 	return 1;
 }
 
+/**
+ * nfs_list_add_request - Insert a request into a list
+ * @req: request
+ * @head: head of list into which to insert the request.
+ */
+static inline void
+nfs_list_add_request(struct nfs_page *req, struct list_head *head)
+{
+	list_add_tail(&req->wb_list, head);
+	req->wb_list_head = head;
+}
+
 
 /**
  * nfs_list_remove_request - Remove a request from its wb_list
@@ -96,10 +118,6 @@ nfs_list_remove_request(struct nfs_page 
 {
 	if (list_empty(&req->wb_list))
 		return;
-	if (!NFS_WBACK_BUSY(req)) {
-		printk(KERN_ERR "NFS: unlocked request attempted removed from list!\n");
-		BUG();
-	}
 	list_del_init(&req->wb_list);
 	req->wb_list_head = NULL;
 }

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-30 21:14         ` Trond Myklebust
@ 2005-03-31  2:26           ` Lee Revell
  2005-03-31  2:39             ` Andrew Morton
  2005-03-31  7:39             ` Ingo Molnar
  0 siblings, 2 replies; 45+ messages in thread
From: Lee Revell @ 2005-03-31  2:26 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andrew Morton, mingo, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 836 bytes --]

On Wed, 2005-03-30 at 16:14 -0500, Trond Myklebust wrote:
> on den 30.03.2005 Klokka 11:56 (-0800) skreiv Andrew Morton:
> > > That's normal and cannot be avoided: when writing, we have to look for
> > > the existence of old nfs_page requests. The reason is that if one does
> > > exist, we must either coalesce our new dirty area into it or if we
> > > can't, we must flush the old request out to the server.
> > 
> > One could use the radix-tree tagging stuff so that the gang lookup only
> > looks up pages which are !NFS_WBACK_BUSY.
> 
> Yes. Together with the radix tree-based sorting of dirty requests,
> that's pretty much what I've spent most of today doing. Lee, could you
> see how the attached combined patch changes your latency numbers?
> 

Different code path, and the latency is worse.  See the attached ~7ms
trace.

Lee

[-- Attachment #2: nfs-6776usec-trace.bz2 --]
[-- Type: application/x-bzip, Size: 36792 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31  2:26           ` Lee Revell
@ 2005-03-31  2:39             ` Andrew Morton
  2005-03-31  2:47               ` Lee Revell
  2005-03-31  7:39             ` Ingo Molnar
  1 sibling, 1 reply; 45+ messages in thread
From: Andrew Morton @ 2005-03-31  2:39 UTC (permalink / raw)
  To: Lee Revell; +Cc: trond.myklebust, mingo, linux-kernel

Lee Revell <rlrevell@joe-job.com> wrote:
>
> > Yes. Together with the radix tree-based sorting of dirty requests,
>  > that's pretty much what I've spent most of today doing. Lee, could you
>  > see how the attached combined patch changes your latency numbers?
>  > 
> 
>  Different code path, and the latency is worse.  See the attached ~7ms
>  trace.

Is a bunch of gobbledygook.  Hows about you interpret it for us?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31  2:39             ` Andrew Morton
@ 2005-03-31  2:47               ` Lee Revell
  2005-03-31  3:48                 ` Trond Myklebust
  2005-03-31  7:03                 ` Ingo Molnar
  0 siblings, 2 replies; 45+ messages in thread
From: Lee Revell @ 2005-03-31  2:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: trond.myklebust, mingo, linux-kernel

On Wed, 2005-03-30 at 18:39 -0800, Andrew Morton wrote:
> Lee Revell <rlrevell@joe-job.com> wrote:
> >
> > > Yes. Together with the radix tree-based sorting of dirty requests,
> >  > that's pretty much what I've spent most of today doing. Lee, could you
> >  > see how the attached combined patch changes your latency numbers?
> >  > 
> > 
> >  Different code path, and the latency is worse.  See the attached ~7ms
> >  trace.
> 
> Is a bunch of gobbledygook.  Hows about you interpret it for us?
> 

Sorry.  When I summarized them before, Ingo just asked for the full
verbose trace.

The 7 ms are spent in this loop:

 radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
 nfs_set_page_writeback_locked+0xe/0x60 <c01c357e> (nfs_scan_lock_dirty+0x8d/0xf0 <c01c39fd>)
 radix_tree_tag_set+0xe/0xa0 <c01e036e> (nfs_set_page_writeback_locked+0x4b/0x60 <c01c35bb>)
 radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
 nfs_set_page_writeback_locked+0xe/0x60 <c01c357e> (nfs_scan_lock_dirty+0x8d/0xf0 <c01c39fd>)
 radix_tree_tag_set+0xe/0xa0 <c01e036e> (nfs_set_page_writeback_locked+0x4b/0x60 <c01c35bb>)
 radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
 nfs_set_page_writeback_locked+0xe/0x60 <c01c357e> (nfs_scan_lock_dirty+0x8d/0xf0 <c01c39fd>)
 radix_tree_tag_set+0xe/0xa0 <c01e036e> (nfs_set_page_writeback_locked+0x4b/0x60 <c01c35bb>)
 radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
 nfs_set_page_writeback_locked+0xe/0x60 <c01c357e> (nfs_scan_lock_dirty+0x8d/0xf0 <c01c39fd>)
 radix_tree_tag_set+0xe/0xa0 <c01e036e> (nfs_set_page_writeback_locked+0x4b/0x60 <c01c35bb>)
 radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
 nfs_set_page_writeback_locked+0xe/0x60 <c01c357e> (nfs_scan_lock_dirty+0x8d/0xf0 <c01c39fd>)
 radix_tree_tag_set+0xe/0xa0 <c01e036e> (nfs_set_page_writeback_locked+0x4b/0x60 <c01c35bb>)
 radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
 radix_tree_gang_lookup_tag+0xe/0x80 <c01e074e> (nfs_scan_lock_dirty+0x69/0xf0 <c01c39d9>)
 __lookup_tag+0xe/0x130 <c01e061e> (radix_tree_gang_lookup_tag+0x59/0x80 <c01e0799>)

Lee


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31  2:47               ` Lee Revell
@ 2005-03-31  3:48                 ` Trond Myklebust
  2005-03-31  6:59                   ` Ingo Molnar
  2005-03-31  7:03                 ` Ingo Molnar
  1 sibling, 1 reply; 45+ messages in thread
From: Trond Myklebust @ 2005-03-31  3:48 UTC (permalink / raw)
  To: Lee Revell; +Cc: Andrew Morton, mingo, linux-kernel

on den 30.03.2005 Klokka 21:47 (-0500) skreiv Lee Revell:
> On Wed, 2005-03-30 at 18:39 -0800, Andrew Morton wrote:
> > Lee Revell <rlrevell@joe-job.com> wrote:
> > >
> > > > Yes. Together with the radix tree-based sorting of dirty requests,
> > >  > that's pretty much what I've spent most of today doing. Lee, could you
> > >  > see how the attached combined patch changes your latency numbers?
> > >  > 
> > > 
> > >  Different code path, and the latency is worse.  See the attached ~7ms
> > >  trace.
> > 
> > Is a bunch of gobbledygook.  Hows about you interpret it for us?
> > 
> 
> Sorry.  When I summarized them before, Ingo just asked for the full
> verbose trace.
> 
> The 7 ms are spent in this loop:
> 
>  radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
>  nfs_set_page_writeback_locked+0xe/0x60 <c01c357e> (nfs_scan_lock_dirty+0x8d/0xf0 <c01c39fd>)
>  radix_tree_tag_set+0xe/0xa0 <c01e036e> (nfs_set_page_writeback_locked+0x4b/0x60 <c01c35bb>)
>  radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)


Which is basically confirming what the guys from Bull already told me,
namely that the radix tree tag stuff is much less efficient that what
we've got now. I never saw their patches, though, so I was curious to
try it for myself.

Cheers,
  Trond

-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31  3:48                 ` Trond Myklebust
@ 2005-03-31  6:59                   ` Ingo Molnar
  2005-03-31  7:15                     ` Ingo Molnar
  2005-03-31  7:18                     ` Andrew Morton
  0 siblings, 2 replies; 45+ messages in thread
From: Ingo Molnar @ 2005-03-31  6:59 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Lee Revell, Andrew Morton, linux-kernel

* Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> > The 7 ms are spent in this loop:
>
> Which is basically confirming what the guys from Bull already told me, 
> namely that the radix tree tag stuff is much less efficient that what 
> we've got now. I never saw their patches, though, so I was curious to 
> try it for myself.

i think the numbers are being misinterpreted. I believe this patch is a 
big step forward. The big thing is that nfs_list_add_request() is O(1) 
now - while _a single request addition to the sorted list_ triggered the 
1+ msec latency in Lee's previous trace. I.e. the previous trace showed 
a 1+ msec latency for a single page IO submission, while his new trace 
shows _thousands_ of pages submitted for NFS IO - which is a very 
different beast!

i think all it needs now is a lock-breaker in the main radix-lookup loop 
in nfs_scan_lock_dirty(), or a latency-oriented reduction in the npages 
argument, to make the loop bounded. The nfs_list_add_request() code 
unbreakable from a latency POV. The patch looks really great to me, 
kudos for pulling it off that fast.

(Also, i'd not declare this variant _slower_ based on latencies, unless 
proven in real benchmarks. A faster implementation can easily have 
higher latencies, if some detail changed - and lots of details changed 
due to your patch. And i'd not even declare latency that much worse, 
unless it's been measured with the tracer turned off. (Lee, could you 
try such a comparison, worst-case latency with and without the patch, 
with LATENCY_TRACE turned off in the .config but latency timing still 
enabled?) The latency tracer itself can baloon the latency.)

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31  6:59                   ` Ingo Molnar
@ 2005-03-31  7:15                     ` Ingo Molnar
  2005-03-31  7:18                     ` Andrew Morton
  1 sibling, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2005-03-31  7:15 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Lee Revell, Andrew Morton, linux-kernel

* Ingo Molnar <mingo@elte.hu> wrote:

> i think all it needs now is a lock-breaker in the main radix-lookup 
> loop in nfs_scan_lock_dirty(), or a latency-oriented reduction in the 
> npages argument, to make the loop bounded. [...]

can nfsi->req_lock be dropped within nfs_scan_dirty()? Or does the 
scanning have to restart in that case? My guess would be the scanning 
does not have to be restarted, since we drop the lock after scanning 
anyway, so all it has to take care of is the consistency of the list 
itself, and the fact that the whole index range got scanned in a certain 
point in time.

Such a lock-breaker was hard before because we had a list 'cursor' which 
could move away while we dropped the lock. But with the radix tree it's 
an 'index position' now, which is much more invariant. The patch below 
attempts this, ontop of your patch - but i'm not sure whether ->req_lock 
is the only lock we hold at that point.

	Ingo

--- linux/fs/nfs/pagelist.c.orig
+++ linux/fs/nfs/pagelist.c
@@ -291,6 +291,7 @@ nfs_scan_lock_dirty(struct nfs_inode *nf
 				res++;
 			}
 		}
+		cond_resched_lock(&nfsi->req_lock);
 	}
 out:
 	return res;

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31  6:59                   ` Ingo Molnar
  2005-03-31  7:15                     ` Ingo Molnar
@ 2005-03-31  7:18                     ` Andrew Morton
  2005-03-31  7:30                       ` Ingo Molnar
  1 sibling, 1 reply; 45+ messages in thread
From: Andrew Morton @ 2005-03-31  7:18 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: trond.myklebust, rlrevell, linux-kernel

Ingo Molnar <mingo@elte.hu> wrote:
>
> 
>  * Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
> 
>  > > The 7 ms are spent in this loop:
>  >
>  > Which is basically confirming what the guys from Bull already told me, 
>  > namely that the radix tree tag stuff is much less efficient that what 
>  > we've got now. I never saw their patches, though, so I was curious to 
>  > try it for myself.
> 
>  i think the numbers are being misinterpreted. I believe this patch is a 
>  big step forward. The big thing is that nfs_list_add_request() is O(1) 
>  now - while _a single request addition to the sorted list_ triggered the 
>  1+ msec latency in Lee's previous trace.

Well.  The radix-tree approach's best-case is probably quite a lot worse
than the list-based approach's best-case.  It hits a lot more cachelines
and involves a lot more code.

But of course the radix-tree's worst-case will be far better than list's.

And presumably that list-based code rarely hits the worst-case, else it
would have been changed by now.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31  7:18                     ` Andrew Morton
@ 2005-03-31  7:30                       ` Ingo Molnar
  2005-03-31 11:58                         ` Trond Myklebust
  0 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2005-03-31  7:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: trond.myklebust, rlrevell, linux-kernel

* Andrew Morton <akpm@osdl.org> wrote:

> Well.  The radix-tree approach's best-case is probably quite a lot 
> worse than the list-based approach's best-case.  It hits a lot more 
> cachelines and involves a lot more code.

The list-based approach's best-case are large continuous append writes.  
No sorting overhead, and light data structures.

i'd say this workload should be not that bad under the radix tree either 
- the gang lookup stuffs a nice vector of 16 pages into an array.

we definitely can say nothing based on the observation that a _single_ 
page took 1.9 msecs in Lee's previous measurement, while 7700 pages now 
take 6 msecs to process.

> But of course the radix-tree's worst-case will be far better than 
> list's.

the generic VM/pagecache has proven that the radix tree wins hands down 
for alot more workloads than the worst-case.

> And presumably that list-based code rarely hits the worst-case, else 
> it would have been changed by now.

that was my other point in a previous mail: most write benchmarks do 
continuous append writes, and CPU overhead easily gets lost in network 
latency.

Also, considering that a good portion of the NFS client's code is still 
running under the BKL one would assume if the BKL hurts performance it 
would have been changed by now? ;-)

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31  7:30                       ` Ingo Molnar
@ 2005-03-31 11:58                         ` Trond Myklebust
  2005-03-31 12:34                           ` Trond Myklebust
  0 siblings, 1 reply; 45+ messages in thread
From: Trond Myklebust @ 2005-03-31 11:58 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, rlrevell, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1111 bytes --]

to den 31.03.2005 Klokka 09:30 (+0200) skreiv Ingo Molnar:


> > And presumably that list-based code rarely hits the worst-case, else
> > it would have been changed by now.
> 
> that was my other point in a previous mail: most write benchmarks do 
> continuous append writes, and CPU overhead easily gets lost in
> network 
> latency.

Well. Most _good_ write benchmarks go through a variety of scenarios.

The problem with the radix-tree is that in most cases you are swapping a
linear scan through only the dirty pages (most of which you want to try
to add to the list anyway) for a linear scan through the entire list of
pages.
Damn! I've forgotten an extra _obvious_ optimization that might help
here... See the attached patch.

> Also, considering that a good portion of the NFS client's code is still 
> running under the BKL one would assume if the BKL hurts performance it 
> would have been changed by now? ;-)

Funny you should mention that... Chuck happens to have spent the last
fortnight working on removing our BKL-addiction. ;-)

Cheers,
  Trond

-- 
Trond Myklebust <trond.myklebust@fys.uio.no>

[-- Attachment #2: Type: text/plain, Size: 1722 bytes --]

 fs/nfs/write.c |   28 +++++++++++++++++-----------
 1 files changed, 17 insertions(+), 11 deletions(-)

Index: linux-2.6.12-rc1/fs/nfs/write.c
===================================================================
--- linux-2.6.12-rc1.orig/fs/nfs/write.c
+++ linux-2.6.12-rc1/fs/nfs/write.c
@@ -539,12 +539,15 @@ static int
 nfs_scan_dirty(struct inode *inode, struct list_head *dst, unsigned long idx_start, unsigned int npages)
 {
 	struct nfs_inode *nfsi = NFS_I(inode);
-	int	res;
-	res = nfs_scan_lock_dirty(nfsi, dst, idx_start, npages);
-	nfsi->ndirty -= res;
-	sub_page_state(nr_dirty,res);
-	if ((nfsi->ndirty == 0) != list_empty(&nfsi->dirty))
-		printk(KERN_ERR "NFS: desynchronized value of nfs_i.ndirty.\n");
+	int res = 0;
+
+	if (nfsi->ndirty != 0) {
+		res = nfs_scan_lock_dirty(nfsi, dst, idx_start, npages);
+		nfsi->ndirty -= res;
+		sub_page_state(nr_dirty,res);
+		if ((nfsi->ndirty == 0) != list_empty(&nfsi->dirty))
+			printk(KERN_ERR "NFS: desynchronized value of nfs_i.ndirty.\n");
+	}
 	return res;
 }
 
@@ -563,11 +566,14 @@ static int
 nfs_scan_commit(struct inode *inode, struct list_head *dst, unsigned long idx_start, unsigned int npages)
 {
 	struct nfs_inode *nfsi = NFS_I(inode);
-	int	res;
-	res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages);
-	nfsi->ncommit -= res;
-	if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit))
-		printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n");
+	int res;
+
+	if (nfsi->ncommit != 0) {
+		res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages);
+		nfsi->ncommit -= res;
+		if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit))
+			printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n");
+	}
 	return res;
 }
 #endif

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31 11:58                         ` Trond Myklebust
@ 2005-03-31 12:34                           ` Trond Myklebust
  2005-03-31 13:58                             ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Trond Myklebust @ 2005-03-31 12:34 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, rlrevell, linux-kernel

to den 31.03.2005 Klokka 06:58 (-0500) skreiv Trond Myklebust:
>  
> @@ -563,11 +566,14 @@ static int
>  nfs_scan_commit(struct inode *inode, struct list_head *dst, unsigned long idx_start, unsigned int npages)
>  {
>  	struct nfs_inode *nfsi = NFS_I(inode);
> -	int	res;
> -	res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages);
> -	nfsi->ncommit -= res;
> -	if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit))
> -		printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n");
> +	int res;
	  ^^^^^^^ Should be "int res = 0;"
> +
> +	if (nfsi->ncommit != 0) {
> +		res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages);
> +		nfsi->ncommit -= res;
> +		if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit))
> +			printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n");
> +	}
>  	return res;
>  }
>  #endif

Cheers,
  Trond
-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31 12:34                           ` Trond Myklebust
@ 2005-03-31 13:58                             ` Ingo Molnar
  2005-03-31 14:32                               ` Trond Myklebust
  0 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2005-03-31 13:58 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andrew Morton, rlrevell, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 416 bytes --]


* Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> > +	int res;
> 	  ^^^^^^^ Should be "int res = 0;"

your patch works fine here - but there are still latencies in 
nfs_scan_commit()/nfs_scan_list(): see the attached 3.7 msec latency 
trace. It happened during a simple big-file writeout and is easily 
reproducible. Could the nfsi->commit list searching be replaced with a 
radix based approach too?

	Ingo

[-- Attachment #2: nfs-3769us-latency.bz2 --]
[-- Type: application/x-bzip2, Size: 27069 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31 13:58                             ` Ingo Molnar
@ 2005-03-31 14:32                               ` Trond Myklebust
  2005-03-31 14:39                                 ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Trond Myklebust @ 2005-03-31 14:32 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, rlrevell, linux-kernel

to den 31.03.2005 Klokka 15:58 (+0200) skreiv Ingo Molnar:
> * Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
> 
> > > +	int res;
> > 	  ^^^^^^^ Should be "int res = 0;"
> 
> your patch works fine here - but there are still latencies in 
> nfs_scan_commit()/nfs_scan_list(): see the attached 3.7 msec latency 
> trace. It happened during a simple big-file writeout and is easily 
> reproducible. Could the nfsi->commit list searching be replaced with a 
> radix based approach too?

That would be 100% pure overhead. The nfsi->commit list does not need to
be sorted and with these patches applied, it no longer is. In fact one
of the cleanups I still need to do is to get rid of those redundant
checks on wb->index (start is now always set to 0, and end is always
~0UL).

So the overhead you are currently seeing should just be that of
iterating through the list, locking said requests and adding them to our
private list.

Cheers,
  Trond

-- 
Trond Myklebust <trond.myklebust@fys.uio.no>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31 14:32                               ` Trond Myklebust
@ 2005-03-31 14:39                                 ` Ingo Molnar
  2005-03-31 14:50                                   ` Ingo Molnar
                                                     ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Ingo Molnar @ 2005-03-31 14:39 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andrew Morton, rlrevell, linux-kernel


* Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> > your patch works fine here - but there are still latencies in 
> > nfs_scan_commit()/nfs_scan_list(): see the attached 3.7 msec latency 
> > trace. It happened during a simple big-file writeout and is easily 
> > reproducible. Could the nfsi->commit list searching be replaced with a 
> > radix based approach too?
> 
> That would be 100% pure overhead. The nfsi->commit list does not need 
> to be sorted and with these patches applied, it no longer is. In fact 
> one of the cleanups I still need to do is to get rid of those 
> redundant checks on wb->index (start is now always set to 0, and end 
> is always ~0UL).
> 
> So the overhead you are currently seeing should just be that of 
> iterating through the list, locking said requests and adding them to 
> our private list.

ah - cool! This was a 100 MB writeout so having 3.7 msecs to process 
20K+ pages is not unreasonable. To break the latency, can i just do a 
simple lock-break, via the patch below?
 
	Ingo

--- linux/fs/nfs/pagelist.c.orig
+++ linux/fs/nfs/pagelist.c
@@ -311,8 +311,9 @@ out:
  * You must be holding the inode's req_lock when calling this function
  */
 int
-nfs_scan_list(struct list_head *head, struct list_head *dst,
-	      unsigned long idx_start, unsigned int npages)
+nfs_scan_list(struct nfs_inode *nfsi, struct list_head *head,
+	      struct list_head *dst, unsigned long idx_start,
+	      unsigned int npages)
 {
 	struct list_head	*pos, *tmp;
 	struct nfs_page		*req;
@@ -327,6 +328,8 @@ nfs_scan_list(struct list_head *head, st
 
 	list_for_each_safe(pos, tmp, head) {
 
+		cond_resched_lock(&nfsi->req_lock);
+
 		req = nfs_list_entry(pos);
 
 		if (req->wb_index < idx_start)
--- linux/fs/nfs/write.c.orig
+++ linux/fs/nfs/write.c
@@ -569,7 +569,7 @@ nfs_scan_commit(struct inode *inode, str
 	int res = 0;
 
 	if (nfsi->ncommit != 0) {
-		res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages);
+		res = nfs_scan_list(nfsi, &nfsi->commit, dst, idx_start, npages);
 		nfsi->ncommit -= res;
 		if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit))
 			printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n");
--- linux/include/linux/nfs_page.h.orig
+++ linux/include/linux/nfs_page.h
@@ -63,8 +63,8 @@ extern	void nfs_release_request(struct n
 
 extern  int nfs_scan_lock_dirty(struct nfs_inode *nfsi, struct list_head *dst,
 				unsigned long idx_start, unsigned int npages);
-extern	int nfs_scan_list(struct list_head *, struct list_head *,
-			  unsigned long, unsigned int);
+extern	int nfs_scan_list(struct nfs_inode *nfsi, struct list_head *,
+			  struct list_head *, unsigned long, unsigned int);
 extern	int nfs_coalesce_requests(struct list_head *, struct list_head *,
 				  unsigned int);
 extern  int nfs_wait_on_request(struct nfs_page *);

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31 14:39                                 ` Ingo Molnar
@ 2005-03-31 14:50                                   ` Ingo Molnar
  2005-04-01  2:28                                     ` Lee Revell
  2005-03-31 14:54                                   ` Ingo Molnar
  2005-03-31 14:54                                   ` Trond Myklebust
  2 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2005-03-31 14:50 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andrew Morton, rlrevell, linux-kernel


* Ingo Molnar <mingo@elte.hu> wrote:

> > So the overhead you are currently seeing should just be that of 
> > iterating through the list, locking said requests and adding them to 
> > our private list.
> 
> ah - cool! This was a 100 MB writeout so having 3.7 msecs to process 
> 20K+ pages is not unreasonable. To break the latency, can i just do a 
> simple lock-break, via the patch below?

with this patch the worst-case latency during NFS writeout is down to 40 
usecs (!).

Lee: i've uploaded the -42-05 release with this patch included - could 
you test it on your (no doubt more complex than mine) NFS setup?

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31 14:50                                   ` Ingo Molnar
@ 2005-04-01  2:28                                     ` Lee Revell
  2005-04-01  4:30                                       ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Lee Revell @ 2005-04-01  2:28 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Trond Myklebust, Andrew Morton, linux-kernel

On Thu, 2005-03-31 at 16:50 +0200, Ingo Molnar wrote:
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > So the overhead you are currently seeing should just be that of 
> > > iterating through the list, locking said requests and adding them to 
> > > our private list.
> > 
> > ah - cool! This was a 100 MB writeout so having 3.7 msecs to process 
> > 20K+ pages is not unreasonable. To break the latency, can i just do a 
> > simple lock-break, via the patch below?
> 
> with this patch the worst-case latency during NFS writeout is down to 40 
> usecs (!).
> 
> Lee: i've uploaded the -42-05 release with this patch included - could 
> you test it on your (no doubt more complex than mine) NFS setup?

This fixes all the NFS related latency problems I was seeing.  Now the
longest latency from an NFS kernel compile with "make -j64" is 391 usecs
in get_swap_page.

Lee


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-04-01  2:28                                     ` Lee Revell
@ 2005-04-01  4:30                                       ` Ingo Molnar
  2005-04-01 16:16                                         ` Orion Poplawski
  2005-04-01 21:18                                         ` Lee Revell
  0 siblings, 2 replies; 45+ messages in thread
From: Ingo Molnar @ 2005-04-01  4:30 UTC (permalink / raw)
  To: Lee Revell; +Cc: Trond Myklebust, Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 985 bytes --]


* Lee Revell <rlrevell@joe-job.com> wrote:

> > > ah - cool! This was a 100 MB writeout so having 3.7 msecs to process 
> > > 20K+ pages is not unreasonable. To break the latency, can i just do a 
> > > simple lock-break, via the patch below?
> > 
> > with this patch the worst-case latency during NFS writeout is down to 40 
> > usecs (!).
> > 
> > Lee: i've uploaded the -42-05 release with this patch included - could 
> > you test it on your (no doubt more complex than mine) NFS setup?
> 
> This fixes all the NFS related latency problems I was seeing.  Now the 
> longest latency from an NFS kernel compile with "make -j64" is 391 
> usecs in get_swap_page.

great! The latest patches (-42-08 and later) have the reworked 
nfs_scan_list() lock-breaker, which should perform similarly.

i bet these NFS patches also improve generic NFS performance on fast 
networks. I've attached the full patchset with all fixes and 
improvements included - might be worth a try in -mm?

	Ingo

[-- Attachment #2: nfs-latency.patch --]
[-- Type: text/plain, Size: 15547 bytes --]

--- linux/fs/nfs/inode.c.orig
+++ linux/fs/nfs/inode.c
@@ -118,7 +118,7 @@ nfs_write_inode(struct inode *inode, int
 	int flags = sync ? FLUSH_WAIT : 0;
 	int ret;
 
-	ret = nfs_commit_inode(inode, 0, 0, flags);
+	ret = nfs_commit_inode(inode, flags);
 	if (ret < 0)
 		return ret;
 	return 0;
--- linux/fs/nfs/write.c.orig
+++ linux/fs/nfs/write.c
@@ -352,7 +352,7 @@ int nfs_writepages(struct address_space 
 		if (err < 0)
 			goto out;
 	}
-	err = nfs_commit_inode(inode, 0, 0, wb_priority(wbc));
+	err = nfs_commit_inode(inode, wb_priority(wbc));
 	if (err > 0) {
 		wbc->nr_to_write -= err;
 		err = 0;
@@ -446,6 +446,8 @@ nfs_mark_request_dirty(struct nfs_page *
 	struct nfs_inode *nfsi = NFS_I(inode);
 
 	spin_lock(&nfsi->req_lock);
+	radix_tree_tag_set(&nfsi->nfs_page_tree,
+			req->wb_index, NFS_PAGE_TAG_DIRTY);
 	nfs_list_add_request(req, &nfsi->dirty);
 	nfsi->ndirty++;
 	spin_unlock(&nfsi->req_lock);
@@ -503,13 +505,12 @@ nfs_wait_on_requests(struct inode *inode
 
 	spin_lock(&nfsi->req_lock);
 	next = idx_start;
-	while (radix_tree_gang_lookup(&nfsi->nfs_page_tree, (void **)&req, next, 1)) {
+	while (radix_tree_gang_lookup_tag(&nfsi->nfs_page_tree, (void **)&req, next, 1, NFS_PAGE_TAG_WRITEBACK)) {
 		if (req->wb_index > idx_end)
 			break;
 
 		next = req->wb_index + 1;
-		if (!NFS_WBACK_BUSY(req))
-			continue;
+		BUG_ON(!NFS_WBACK_BUSY(req));
 
 		atomic_inc(&req->wb_count);
 		spin_unlock(&nfsi->req_lock);
@@ -538,12 +539,15 @@ static int
 nfs_scan_dirty(struct inode *inode, struct list_head *dst, unsigned long idx_start, unsigned int npages)
 {
 	struct nfs_inode *nfsi = NFS_I(inode);
-	int	res;
-	res = nfs_scan_list(&nfsi->dirty, dst, idx_start, npages);
-	nfsi->ndirty -= res;
-	sub_page_state(nr_dirty,res);
-	if ((nfsi->ndirty == 0) != list_empty(&nfsi->dirty))
-		printk(KERN_ERR "NFS: desynchronized value of nfs_i.ndirty.\n");
+	int res = 0;
+
+	if (nfsi->ndirty != 0) {
+		res = nfs_scan_lock_dirty(nfsi, dst, idx_start, npages);
+		nfsi->ndirty -= res;
+		sub_page_state(nr_dirty,res);
+		if ((nfsi->ndirty == 0) != list_empty(&nfsi->dirty))
+			printk(KERN_ERR "NFS: desynchronized value of nfs_i.ndirty.\n");
+	}
 	return res;
 }
 
@@ -562,11 +566,14 @@ static int
 nfs_scan_commit(struct inode *inode, struct list_head *dst, unsigned long idx_start, unsigned int npages)
 {
 	struct nfs_inode *nfsi = NFS_I(inode);
-	int	res;
-	res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages);
-	nfsi->ncommit -= res;
-	if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit))
-		printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n");
+	int res = 0;
+
+	if (nfsi->ncommit != 0) {
+		res = nfs_scan_list(nfsi, &nfsi->commit, dst, idx_start, npages);
+		nfsi->ncommit -= res;
+		if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit))
+			printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n");
+	}
 	return res;
 }
 #endif
@@ -821,7 +828,7 @@ out:
 #else
 	nfs_inode_remove_request(req);
 #endif
-	nfs_unlock_request(req);
+	nfs_clear_page_writeback(req);
 }
 
 static inline int flush_task_priority(int how)
@@ -952,7 +959,7 @@ out_bad:
 		nfs_writedata_free(data);
 	}
 	nfs_mark_request_dirty(req);
-	nfs_unlock_request(req);
+	nfs_clear_page_writeback(req);
 	return -ENOMEM;
 }
 
@@ -1002,7 +1009,7 @@ static int nfs_flush_one(struct list_hea
 		struct nfs_page *req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_dirty(req);
-		nfs_unlock_request(req);
+		nfs_clear_page_writeback(req);
 	}
 	return -ENOMEM;
 }
@@ -1029,7 +1036,7 @@ nfs_flush_list(struct list_head *head, i
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_dirty(req);
-		nfs_unlock_request(req);
+		nfs_clear_page_writeback(req);
 	}
 	return error;
 }
@@ -1121,7 +1128,7 @@ static void nfs_writeback_done_full(stru
 		nfs_inode_remove_request(req);
 #endif
 	next:
-		nfs_unlock_request(req);
+		nfs_clear_page_writeback(req);
 	}
 }
 
@@ -1210,36 +1217,24 @@ static void nfs_commit_rpcsetup(struct l
 		struct nfs_write_data *data, int how)
 {
 	struct rpc_task		*task = &data->task;
-	struct nfs_page		*first, *last;
+	struct nfs_page		*first;
 	struct inode		*inode;
-	loff_t			start, end, len;
 
 	/* Set up the RPC argument and reply structs
 	 * NB: take care not to mess about with data->commit et al. */
 
 	list_splice_init(head, &data->pages);
 	first = nfs_list_entry(data->pages.next);
-	last = nfs_list_entry(data->pages.prev);
 	inode = first->wb_context->dentry->d_inode;
 
-	/*
-	 * Determine the offset range of requests in the COMMIT call.
-	 * We rely on the fact that data->pages is an ordered list...
-	 */
-	start = req_offset(first);
-	end = req_offset(last) + last->wb_bytes;
-	len = end - start;
-	/* If 'len' is not a 32-bit quantity, pass '0' in the COMMIT call */
-	if (end >= i_size_read(inode) || len < 0 || len > (~((u32)0) >> 1))
-		len = 0;
-
 	data->inode	  = inode;
 	data->cred	  = first->wb_context->cred;
 
 	data->args.fh     = NFS_FH(data->inode);
-	data->args.offset = start;
-	data->args.count  = len;
-	data->res.count   = len;
+	/* Note: we always request a commit of the entire inode */
+	data->args.offset = 0;
+	data->args.count  = 0;
+	data->res.count   = 0;
 	data->res.fattr   = &data->fattr;
 	data->res.verf    = &data->verf;
 	
@@ -1278,7 +1273,7 @@ nfs_commit_list(struct list_head *head, 
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
-		nfs_unlock_request(req);
+		nfs_clear_page_writeback(req);
 	}
 	return -ENOMEM;
 }
@@ -1324,7 +1319,7 @@ nfs_commit_done(struct rpc_task *task)
 		dprintk(" mismatch\n");
 		nfs_mark_request_dirty(req);
 	next:
-		nfs_unlock_request(req);
+		nfs_clear_page_writeback(req);
 		res++;
 	}
 	sub_page_state(nr_unstable,res);
@@ -1350,8 +1345,7 @@ static int nfs_flush_inode(struct inode 
 }
 
 #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4)
-int nfs_commit_inode(struct inode *inode, unsigned long idx_start,
-		    unsigned int npages, int how)
+int nfs_commit_inode(struct inode *inode, int how)
 {
 	struct nfs_inode *nfsi = NFS_I(inode);
 	LIST_HEAD(head);
@@ -1359,15 +1353,13 @@ int nfs_commit_inode(struct inode *inode
 				error = 0;
 
 	spin_lock(&nfsi->req_lock);
-	res = nfs_scan_commit(inode, &head, idx_start, npages);
+	res = nfs_scan_commit(inode, &head, 0, 0);
+	spin_unlock(&nfsi->req_lock);
 	if (res) {
-		res += nfs_scan_commit(inode, &head, 0, 0);
-		spin_unlock(&nfsi->req_lock);
 		error = nfs_commit_list(&head, how);
-	} else
-		spin_unlock(&nfsi->req_lock);
-	if (error < 0)
-		return error;
+		if (error < 0)
+			return error;
+	}
 	return res;
 }
 #endif
@@ -1389,7 +1381,7 @@ int nfs_sync_inode(struct inode *inode, 
 			error = nfs_flush_inode(inode, idx_start, npages, how);
 #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4)
 		if (error == 0)
-			error = nfs_commit_inode(inode, idx_start, npages, how);
+			error = nfs_commit_inode(inode, how);
 #endif
 	} while (error > 0);
 	return error;
--- linux/fs/nfs/read.c.orig
+++ linux/fs/nfs/read.c
@@ -173,7 +173,6 @@ static int nfs_readpage_async(struct nfs
 	if (len < PAGE_CACHE_SIZE)
 		memclear_highpage_flush(page, len, PAGE_CACHE_SIZE - len);
 
-	nfs_lock_request(new);
 	nfs_list_add_request(new, &one_request);
 	nfs_pagein_one(&one_request, inode);
 	return 0;
@@ -185,7 +184,6 @@ static void nfs_readpage_release(struct 
 
 	nfs_clear_request(req);
 	nfs_release_request(req);
-	nfs_unlock_request(req);
 
 	dprintk("NFS: read done (%s/%Ld %d@%Ld)\n",
 			req->wb_context->dentry->d_inode->i_sb->s_id,
@@ -553,7 +551,6 @@ readpage_async_filler(void *data, struct
 	}
 	if (len < PAGE_CACHE_SIZE)
 		memclear_highpage_flush(page, len, PAGE_CACHE_SIZE - len);
-	nfs_lock_request(new);
 	nfs_list_add_request(new, desc->head);
 	return 0;
 }
--- linux/fs/nfs/pagelist.c.orig
+++ linux/fs/nfs/pagelist.c
@@ -112,6 +112,33 @@ void nfs_unlock_request(struct nfs_page 
 }
 
 /**
+ * nfs_set_page_writeback_locked - Lock a request for writeback
+ * @req:
+ */
+int nfs_set_page_writeback_locked(struct nfs_page *req)
+{
+	struct nfs_inode *nfsi = NFS_I(req->wb_context->dentry->d_inode);
+
+	if (!nfs_lock_request(req))
+		return 0;
+	radix_tree_tag_set(&nfsi->nfs_page_tree, req->wb_index, NFS_PAGE_TAG_WRITEBACK);
+	return 1;
+}
+
+/**
+ * nfs_clear_page_writeback - Unlock request and wake up sleepers
+ */
+void nfs_clear_page_writeback(struct nfs_page *req)
+{
+	struct nfs_inode *nfsi = NFS_I(req->wb_context->dentry->d_inode);
+
+	spin_lock(&nfsi->req_lock);
+	radix_tree_tag_clear(&nfsi->nfs_page_tree, req->wb_index, NFS_PAGE_TAG_WRITEBACK);
+	spin_unlock(&nfsi->req_lock);
+	nfs_unlock_request(req);
+}
+
+/**
  * nfs_clear_request - Free up all resources allocated to the request
  * @req:
  *
@@ -151,36 +178,6 @@ nfs_release_request(struct nfs_page *req
 }
 
 /**
- * nfs_list_add_request - Insert a request into a sorted list
- * @req: request
- * @head: head of list into which to insert the request.
- *
- * Note that the wb_list is sorted by page index in order to facilitate
- * coalescing of requests.
- * We use an insertion sort that is optimized for the case of appended
- * writes.
- */
-void
-nfs_list_add_request(struct nfs_page *req, struct list_head *head)
-{
-	struct list_head *pos;
-
-#ifdef NFS_PARANOIA
-	if (!list_empty(&req->wb_list)) {
-		printk(KERN_ERR "NFS: Add to list failed!\n");
-		BUG();
-	}
-#endif
-	list_for_each_prev(pos, head) {
-		struct nfs_page	*p = nfs_list_entry(pos);
-		if (p->wb_index < req->wb_index)
-			break;
-	}
-	list_add(&req->wb_list, pos);
-	req->wb_list_head = head;
-}
-
-/**
  * nfs_wait_on_request - Wait for a request to complete.
  * @req: request to wait upon.
  *
@@ -243,6 +240,63 @@ nfs_coalesce_requests(struct list_head *
 	return npages;
 }
 
+#define NFS_SCAN_MAXENTRIES 16
+/**
+ * nfs_scan_lock_dirty - Scan the radix tree for dirty requests
+ * @nfsi: NFS inode
+ * @dst: Destination list
+ * @idx_start: lower bound of page->index to scan
+ * @npages: idx_start + npages sets the upper bound to scan.
+ *
+ * Moves elements from one of the inode request lists.
+ * If the number of requests is set to 0, the entire address_space
+ * starting at index idx_start, is scanned.
+ * The requests are *not* checked to ensure that they form a contiguous set.
+ * You must be holding the inode's req_lock when calling this function
+ */
+int
+nfs_scan_lock_dirty(struct nfs_inode *nfsi, struct list_head *dst,
+	      unsigned long idx_start, unsigned int npages)
+{
+	struct nfs_page *pgvec[NFS_SCAN_MAXENTRIES];
+	struct nfs_page *req;
+	unsigned long idx_end;
+	int found, i;
+	int res;
+
+	res = 0;
+	if (npages == 0)
+		idx_end = ~0;
+	else
+		idx_end = idx_start + npages - 1;
+
+	for (;;) {
+		found = radix_tree_gang_lookup_tag(&nfsi->nfs_page_tree,
+				(void **)&pgvec[0], idx_start, NFS_SCAN_MAXENTRIES,
+				NFS_PAGE_TAG_DIRTY);
+		if (found <= 0)
+			break;
+		for (i = 0; i < found; i++) {
+			req = pgvec[i];
+			if (req->wb_index > idx_end)
+				goto out;
+
+			idx_start = req->wb_index + 1;
+
+			if (nfs_set_page_writeback_locked(req)) {
+				radix_tree_tag_clear(&nfsi->nfs_page_tree,
+						req->wb_index, NFS_PAGE_TAG_DIRTY);
+				nfs_list_remove_request(req);
+				nfs_list_add_request(req, dst);
+				res++;
+			}
+		}
+		cond_resched_lock(&nfsi->req_lock);
+	}
+out:
+	return res;
+}
+
 /**
  * nfs_scan_list - Scan a list for matching requests
  * @head: One of the NFS inode request lists
@@ -257,10 +311,12 @@ nfs_coalesce_requests(struct list_head *
  * You must be holding the inode's req_lock when calling this function
  */
 int
-nfs_scan_list(struct list_head *head, struct list_head *dst,
-	      unsigned long idx_start, unsigned int npages)
+nfs_scan_list(struct nfs_inode *nfsi, struct list_head *head,
+	      struct list_head *dst, unsigned long idx_start,
+	      unsigned int npages)
 {
-	struct list_head	*pos, *tmp;
+	LIST_HEAD(locked);
+	struct list_head	*pos;
 	struct nfs_page		*req;
 	unsigned long		idx_end;
 	int			res;
@@ -271,21 +327,22 @@ nfs_scan_list(struct list_head *head, st
 	else
 		idx_end = idx_start + npages - 1;
 
-	list_for_each_safe(pos, tmp, head) {
+	while (!list_empty(head)) {
 
+		pos = head->next;
 		req = nfs_list_entry(pos);
 
-		if (req->wb_index < idx_start)
-			continue;
-		if (req->wb_index > idx_end)
-			break;
-
-		if (!nfs_lock_request(req))
-			continue;
-		nfs_list_remove_request(req);
-		nfs_list_add_request(req, dst);
-		res++;
+		if (!nfs_set_page_writeback_locked(req)) {
+			list_del(pos);
+			list_add(&req->wb_list, &locked);
+		} else {
+			nfs_list_remove_request(req);
+			nfs_list_add_request(req, dst);
+			res++;
+		}
+		cond_resched_lock(&nfsi->req_lock);
 	}
+	list_splice(&locked, head);
 	return res;
 }
 
--- linux/include/linux/nfs_fs.h.orig
+++ linux/include/linux/nfs_fs.h
@@ -377,10 +377,10 @@ extern void nfs_commit_done(struct rpc_t
  */
 extern int  nfs_sync_inode(struct inode *, unsigned long, unsigned int, int);
 #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4)
-extern int  nfs_commit_inode(struct inode *, unsigned long, unsigned int, int);
+extern int  nfs_commit_inode(struct inode *, int);
 #else
 static inline int
-nfs_commit_inode(struct inode *inode, unsigned long idx_start, unsigned int npages, int how)
+nfs_commit_inode(struct inode *inode, int how)
 {
 	return 0;
 }
--- linux/include/linux/nfs_page.h.orig
+++ linux/include/linux/nfs_page.h
@@ -20,12 +20,19 @@
 #include <asm/atomic.h>
 
 /*
+ * Valid flags for the radix tree
+ */
+#define NFS_PAGE_TAG_DIRTY	0
+#define NFS_PAGE_TAG_WRITEBACK	1
+
+/*
  * Valid flags for a dirty buffer
  */
 #define PG_BUSY			0
 #define PG_NEED_COMMIT		1
 #define PG_NEED_RESCHED		2
 
+struct nfs_inode;
 struct nfs_page {
 	struct list_head	wb_list,	/* Defines state of page: */
 				*wb_list_head;	/*      read/write/commit */
@@ -54,14 +61,17 @@ extern	void nfs_clear_request(struct nfs
 extern	void nfs_release_request(struct nfs_page *req);
 
 
-extern	void nfs_list_add_request(struct nfs_page *, struct list_head *);
-
-extern	int nfs_scan_list(struct list_head *, struct list_head *,
-			  unsigned long, unsigned int);
+extern  int nfs_scan_lock_dirty(struct nfs_inode *nfsi, struct list_head *dst,
+				unsigned long idx_start, unsigned int npages);
+extern	int nfs_scan_list(struct nfs_inode *nfsi, struct list_head *,
+			  struct list_head *, unsigned long, unsigned int);
 extern	int nfs_coalesce_requests(struct list_head *, struct list_head *,
 				  unsigned int);
 extern  int nfs_wait_on_request(struct nfs_page *);
 extern	void nfs_unlock_request(struct nfs_page *req);
+extern  int nfs_set_page_writeback_locked(struct nfs_page *req);
+extern  void nfs_clear_page_writeback(struct nfs_page *req);
+
 
 /*
  * Lock the page of an asynchronous request without incrementing the wb_count
@@ -86,6 +96,18 @@ nfs_lock_request(struct nfs_page *req)
 	return 1;
 }
 
+/**
+ * nfs_list_add_request - Insert a request into a list
+ * @req: request
+ * @head: head of list into which to insert the request.
+ */
+static inline void
+nfs_list_add_request(struct nfs_page *req, struct list_head *head)
+{
+	list_add_tail(&req->wb_list, head);
+	req->wb_list_head = head;
+}
+
 
 /**
  * nfs_list_remove_request - Remove a request from its wb_list
@@ -96,10 +118,6 @@ nfs_list_remove_request(struct nfs_page 
 {
 	if (list_empty(&req->wb_list))
 		return;
-	if (!NFS_WBACK_BUSY(req)) {
-		printk(KERN_ERR "NFS: unlocked request attempted removed from list!\n");
-		BUG();
-	}
 	list_del_init(&req->wb_list);
 	req->wb_list_head = NULL;
 }

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-04-01  4:30                                       ` Ingo Molnar
@ 2005-04-01 16:16                                         ` Orion Poplawski
  2005-04-01 16:33                                           ` Trond Myklebust
  2005-04-01 21:18                                         ` Lee Revell
  1 sibling, 1 reply; 45+ messages in thread
From: Orion Poplawski @ 2005-04-01 16:16 UTC (permalink / raw)
  To: linux-kernel

Ingo Molnar wrote:
> * Lee Revell <rlrevell@joe-job.com> wrote:
>>This fixes all the NFS related latency problems I was seeing.  Now the 
>>longest latency from an NFS kernel compile with "make -j64" is 391 
>>usecs in get_swap_page.
> 
> 
> great! The latest patches (-42-08 and later) have the reworked 
> nfs_scan_list() lock-breaker, which should perform similarly.
> 
> i bet these NFS patches also improve generic NFS performance on fast 
> networks. I've attached the full patchset with all fixes and 
> improvements included - might be worth a try in -mm?
> 

Just a question - would these changes be expected to improve NFS client 
*read* access at all, or just write?

Thanks!


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-04-01 16:16                                         ` Orion Poplawski
@ 2005-04-01 16:33                                           ` Trond Myklebust
  0 siblings, 0 replies; 45+ messages in thread
From: Trond Myklebust @ 2005-04-01 16:33 UTC (permalink / raw)
  To: Orion Poplawski; +Cc: linux-kernel

fr den 01.04.2005 Klokka 09:16 (-0700) skreiv Orion Poplawski:
> Just a question - would these changes be expected to improve NFS client 
> *read* access at all, or just write?

Just write.

Cheers,
  Trond

-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-04-01  4:30                                       ` Ingo Molnar
  2005-04-01 16:16                                         ` Orion Poplawski
@ 2005-04-01 21:18                                         ` Lee Revell
  1 sibling, 0 replies; 45+ messages in thread
From: Lee Revell @ 2005-04-01 21:18 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Trond Myklebust, Andrew Morton, linux-kernel

On Fri, 2005-04-01 at 06:30 +0200, Ingo Molnar wrote:
> * Lee Revell <rlrevell@joe-job.com> wrote:
> 
> > > > ah - cool! This was a 100 MB writeout so having 3.7 msecs to process 
> > > > 20K+ pages is not unreasonable. To break the latency, can i just do a 
> > > > simple lock-break, via the patch below?
> > > 
> > > with this patch the worst-case latency during NFS writeout is down to 40 
> > > usecs (!).
> > > 
> > > Lee: i've uploaded the -42-05 release with this patch included - could 
> > > you test it on your (no doubt more complex than mine) NFS setup?
> > 
> > This fixes all the NFS related latency problems I was seeing.  Now the 
> > longest latency from an NFS kernel compile with "make -j64" is 391 
> > usecs in get_swap_page.
> 
> great! The latest patches (-42-08 and later) have the reworked 
> nfs_scan_list() lock-breaker, which should perform similarly.
> 
> i bet these NFS patches also improve generic NFS performance on fast 
> networks. I've attached the full patchset with all fixes and 
> improvements included - might be worth a try in -mm?

With tracing disabled on the C3 (which is the NFS server, don't ask),
the maximum latency during the same kernel compile is about 2ms.  So
tracing overhead probably doubled or tripled the latencies.

I'll try again with tracing enabled to determine whether these code
paths are related to the NFS server or not.  It's either going to be
that or the get_swap_page stuff.  But the client side is OK now.

Lee


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31 14:39                                 ` Ingo Molnar
  2005-03-31 14:50                                   ` Ingo Molnar
@ 2005-03-31 14:54                                   ` Ingo Molnar
  2005-03-31 15:00                                     ` Trond Myklebust
  2005-03-31 14:54                                   ` Trond Myklebust
  2 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2005-03-31 14:54 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andrew Morton, rlrevell, linux-kernel

* Ingo Molnar <mingo@elte.hu> wrote:

> [...] To break the latency, can i just do a simple lock-break, via the 
> patch below?

i think it's incorrect in its current form, because 'pos' is not 
necessarily safe and might be removed from the commit_list?

the whole loop could be a "while (!list_empty(head))" thing, if it wasnt 
for the possibility for an nfs_set_page_writeback_locked() to skip an 
entry. Maybe a local list of 'already processed' requests should be 
built, and once the main list is emptied, moved back into the main list? 
Then the lock-break could be moved to the end of the loop.

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31 14:54                                   ` Ingo Molnar
@ 2005-03-31 15:00                                     ` Trond Myklebust
  0 siblings, 0 replies; 45+ messages in thread
From: Trond Myklebust @ 2005-03-31 15:00 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, rlrevell, linux-kernel

to den 31.03.2005 Klokka 16:54 (+0200) skreiv Ingo Molnar:

> i think it's incorrect in its current form, because 'pos' is not 
> necessarily safe and might be removed from the commit_list?

Right.

> the whole loop could be a "while (!list_empty(head))" thing, if it wasnt 
> for the possibility for an nfs_set_page_writeback_locked() to skip an 
> entry. Maybe a local list of 'already processed' requests should be 
> built, and once the main list is emptied, moved back into the main list? 
> Then the lock-break could be moved to the end of the loop.

Should be pretty much unnecessary. See my previous email.

Cheers,
  Trond
-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31 14:39                                 ` Ingo Molnar
  2005-03-31 14:50                                   ` Ingo Molnar
  2005-03-31 14:54                                   ` Ingo Molnar
@ 2005-03-31 14:54                                   ` Trond Myklebust
  2005-03-31 14:58                                     ` Ingo Molnar
  2 siblings, 1 reply; 45+ messages in thread
From: Trond Myklebust @ 2005-03-31 14:54 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, rlrevell, linux-kernel

to den 31.03.2005 Klokka 16:39 (+0200) skreiv Ingo Molnar:
> * Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
> 
> > > your patch works fine here - but there are still latencies in 
> > > nfs_scan_commit()/nfs_scan_list(): see the attached 3.7 msec latency 
> > > trace. It happened during a simple big-file writeout and is easily 
> > > reproducible. Could the nfsi->commit list searching be replaced with a 
> > > radix based approach too?
> > 
> > That would be 100% pure overhead. The nfsi->commit list does not need 
> > to be sorted and with these patches applied, it no longer is. In fact 
> > one of the cleanups I still need to do is to get rid of those 
> > redundant checks on wb->index (start is now always set to 0, and end 
> > is always ~0UL).
> > 
> > So the overhead you are currently seeing should just be that of 
> > iterating through the list, locking said requests and adding them to 
> > our private list.
> 
> ah - cool! This was a 100 MB writeout so having 3.7 msecs to process 
> 20K+ pages is not unreasonable. To break the latency, can i just do a 
> simple lock-break, via the patch below?

Errm... That looks a bit unsafe. What is there then to stop another
process from removing "pos" while you are scheduling?
It seems to me that you should really start afresh in that case.

The good news, though, is that because requests on the "commit" list do
not remain locked for long without being removed from the list, you
should rarely have to skip them. IOW restarting from the head of the
list is pretty much the same as starting from where you left off.

Cheers,
  Trond

> 	Ingo
> 
> --- linux/fs/nfs/pagelist.c.orig
> +++ linux/fs/nfs/pagelist.c
> @@ -311,8 +311,9 @@ out:
>   * You must be holding the inode's req_lock when calling this function
>   */
>  int
> -nfs_scan_list(struct list_head *head, struct list_head *dst,
> -	      unsigned long idx_start, unsigned int npages)
> +nfs_scan_list(struct nfs_inode *nfsi, struct list_head *head,
> +	      struct list_head *dst, unsigned long idx_start,
> +	      unsigned int npages)
>  {
>  	struct list_head	*pos, *tmp;
>  	struct nfs_page		*req;
> @@ -327,6 +328,8 @@ nfs_scan_list(struct list_head *head, st
>  
>  	list_for_each_safe(pos, tmp, head) {
>  
> +		cond_resched_lock(&nfsi->req_lock);
> +
>  		req = nfs_list_entry(pos);
>  
>  		if (req->wb_index < idx_start)
> --- linux/fs/nfs/write.c.orig
> +++ linux/fs/nfs/write.c
> @@ -569,7 +569,7 @@ nfs_scan_commit(struct inode *inode, str
>  	int res = 0;
>  
>  	if (nfsi->ncommit != 0) {
> -		res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages);
> +		res = nfs_scan_list(nfsi, &nfsi->commit, dst, idx_start, npages);
>  		nfsi->ncommit -= res;
>  		if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit))
>  			printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n");
> --- linux/include/linux/nfs_page.h.orig
> +++ linux/include/linux/nfs_page.h
> @@ -63,8 +63,8 @@ extern	void nfs_release_request(struct n
>  
>  extern  int nfs_scan_lock_dirty(struct nfs_inode *nfsi, struct list_head *dst,
>  				unsigned long idx_start, unsigned int npages);
> -extern	int nfs_scan_list(struct list_head *, struct list_head *,
> -			  unsigned long, unsigned int);
> +extern	int nfs_scan_list(struct nfs_inode *nfsi, struct list_head *,
> +			  struct list_head *, unsigned long, unsigned int);
>  extern	int nfs_coalesce_requests(struct list_head *, struct list_head *,
>  				  unsigned int);
>  extern  int nfs_wait_on_request(struct nfs_page *);
-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31 14:54                                   ` Trond Myklebust
@ 2005-03-31 14:58                                     ` Ingo Molnar
  2005-03-31 15:06                                       ` Trond Myklebust
  2005-03-31 15:10                                       ` Ingo Molnar
  0 siblings, 2 replies; 45+ messages in thread
From: Ingo Molnar @ 2005-03-31 14:58 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andrew Morton, rlrevell, linux-kernel

* Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> Errm... That looks a bit unsafe. What is there then to stop another 
> process from removing "pos" while you are scheduling? It seems to me 
> that you should really start afresh in that case.

yeah.

> The good news, though, is that because requests on the "commit" list 
> do not remain locked for long without being removed from the list, you 
> should rarely have to skip them. IOW restarting from the head of the 
> list is pretty much the same as starting from where you left off.

as we've learned it through painful experience on the ext3 side, 
restarting scans where 'skipping' has to be done is unhealthy.

would it be safe to collect locked entries into a separate, local list, 
so that the restart would only see newly added entries? Then once the 
moving of all entries has been done, all the locked entries could be 
added back to the commit_list via one list_add. (can anything happen to 
those locked entries that would break this method?)

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31 14:58                                     ` Ingo Molnar
@ 2005-03-31 15:06                                       ` Trond Myklebust
  2005-03-31 15:10                                         ` Ingo Molnar
  2005-03-31 15:10                                       ` Ingo Molnar
  1 sibling, 1 reply; 45+ messages in thread
From: Trond Myklebust @ 2005-03-31 15:06 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, rlrevell, linux-kernel

to den 31.03.2005 Klokka 16:58 (+0200) skreiv Ingo Molnar:


> would it be safe to collect locked entries into a separate, local list, 
> so that the restart would only see newly added entries? Then once the 
> moving of all entries has been done, all the locked entries could be 
> added back to the commit_list via one list_add. (can anything happen to 
> those locked entries that would break this method?)

You are not allowed to remove an entry from the list if it is locked by
someone else. Locking grants temporary ownership of the entry.

Cheers,
  Trond
-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31 15:06                                       ` Trond Myklebust
@ 2005-03-31 15:10                                         ` Ingo Molnar
  2005-03-31 16:00                                           ` Trond Myklebust
  0 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2005-03-31 15:10 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andrew Morton, rlrevell, linux-kernel


* Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> > would it be safe to collect locked entries into a separate, local list, 
> > so that the restart would only see newly added entries? Then once the 
> > moving of all entries has been done, all the locked entries could be 
> > added back to the commit_list via one list_add. (can anything happen to 
> > those locked entries that would break this method?)
> 
> You are not allowed to remove an entry from the list if it is locked 
> by someone else. Locking grants temporary ownership of the entry.

well, removing a neighboring entry will change the locked entry's link 
fields (e.g. req->wb_list.prev), so this ownership cannot involve this 
particular list, can it?

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31 15:10                                         ` Ingo Molnar
@ 2005-03-31 16:00                                           ` Trond Myklebust
  0 siblings, 0 replies; 45+ messages in thread
From: Trond Myklebust @ 2005-03-31 16:00 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, rlrevell, linux-kernel

to den 31.03.2005 Klokka 17:10 (+0200) skreiv Ingo Molnar:
> * Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
> 
> > > would it be safe to collect locked entries into a separate, local list, 
> > > so that the restart would only see newly added entries? Then once the 
> > > moving of all entries has been done, all the locked entries could be 
> > > added back to the commit_list via one list_add. (can anything happen to 
> > > those locked entries that would break this method?)
> > 
> > You are not allowed to remove an entry from the list if it is locked 
> > by someone else. Locking grants temporary ownership of the entry.
> 
> well, removing a neighboring entry will change the locked entry's link 
> fields (e.g. req->wb_list.prev), so this ownership cannot involve this 
> particular list, can it?

The point is that you are taking something that was previously globally
visible and owned by somebody else and making it globally invisible by
placing it on a private list.
I'm not sure whether or not that is going to cause bugs in the current
code (it may actually be safe with the current code), but as far as
clean ownership semantics go, it sucks.

Cheers,
  Trond
-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31 14:58                                     ` Ingo Molnar
  2005-03-31 15:06                                       ` Trond Myklebust
@ 2005-03-31 15:10                                       ` Ingo Molnar
  1 sibling, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2005-03-31 15:10 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andrew Morton, rlrevell, linux-kernel


* Ingo Molnar <mingo@elte.hu> wrote:

> > The good news, though, is that because requests on the "commit" list 
> > do not remain locked for long without being removed from the list, you 
> > should rarely have to skip them. IOW restarting from the head of the 
> > list is pretty much the same as starting from where you left off.
> 
> as we've learned it through painful experience on the ext3 side, 
> restarting scans where 'skipping' has to be done is unhealthy.
> 
> would it be safe to collect locked entries into a separate, local 
> list, so that the restart would only see newly added entries? Then 
> once the moving of all entries has been done, all the locked entries 
> could be added back to the commit_list via one list_add. (can anything 
> happen to those locked entries that would break this method?)

i.e. like the patch below (compiles and is lightly tested). It's a 
pretty straightforward technique when dealing with latencies, and lets 
us sleep for sure. We dont have to know whether massive amounts of 
locked entries ever queue up or not.

(thanks go to Arjan for noticing that it has to be list_splice() not 
list_add() :-| )

	Ingo

--- linux/fs/nfs/pagelist.c.orig3	
+++ linux/fs/nfs/pagelist.c	
@@ -311,10 +311,12 @@ out:
  * You must be holding the inode's req_lock when calling this function
  */
 int
-nfs_scan_list(struct list_head *head, struct list_head *dst,
-	      unsigned long idx_start, unsigned int npages)
+nfs_scan_list(struct nfs_inode *nfsi, struct list_head *head,
+	      struct list_head *dst, unsigned long idx_start,
+	      unsigned int npages)
 {
-	struct list_head	*pos, *tmp;
+	LIST_HEAD(locked);
+	struct list_head	*pos;
 	struct nfs_page		*req;
 	unsigned long		idx_end;
 	int			res;
@@ -325,21 +327,22 @@ nfs_scan_list(struct list_head *head, st
 	else
 		idx_end = idx_start + npages - 1;
 
-	list_for_each_safe(pos, tmp, head) {
+	while (!list_empty(head)) {
 
+		pos = head->next;
 		req = nfs_list_entry(pos);
 
-		if (req->wb_index < idx_start)
-			continue;
-		if (req->wb_index > idx_end)
-			break;
-
-		if (!nfs_set_page_writeback_locked(req))
-			continue;
-		nfs_list_remove_request(req);
-		nfs_list_add_request(req, dst);
-		res++;
+		if (!nfs_set_page_writeback_locked(req)) {
+			list_del(pos);
+			list_add(&req->wb_list, &locked);
+		} else {
+			nfs_list_remove_request(req);
+			nfs_list_add_request(req, dst);
+			res++;
+		}
+		cond_resched_lock(&nfsi->req_lock);
 	}
+	list_splice(&locked, head);
 	return res;
 }
 
--- linux/fs/nfs/write.c.orig3	
+++ linux/fs/nfs/write.c	
@@ -569,7 +569,7 @@ nfs_scan_commit(struct inode *inode, str
 	int res = 0;
 
 	if (nfsi->ncommit != 0) {
-		res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages);
+		res = nfs_scan_list(nfsi, &nfsi->commit, dst, idx_start, npages);
 		nfsi->ncommit -= res;
 		if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit))
 			printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n");
--- linux/include/linux/nfs_page.h.orig3	
+++ linux/include/linux/nfs_page.h	
@@ -63,8 +63,8 @@ extern	void nfs_release_request(struct n
 
 extern  int nfs_scan_lock_dirty(struct nfs_inode *nfsi, struct list_head *dst,
 				unsigned long idx_start, unsigned int npages);
-extern	int nfs_scan_list(struct list_head *, struct list_head *,
-			  unsigned long, unsigned int);
+extern	int nfs_scan_list(struct nfs_inode *nfsi, struct list_head *,
+			  struct list_head *, unsigned long, unsigned int);
 extern	int nfs_coalesce_requests(struct list_head *, struct list_head *,
 				  unsigned int);
 extern  int nfs_wait_on_request(struct nfs_page *);

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31  2:47               ` Lee Revell
  2005-03-31  3:48                 ` Trond Myklebust
@ 2005-03-31  7:03                 ` Ingo Molnar
  1 sibling, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2005-03-31  7:03 UTC (permalink / raw)
  To: Lee Revell; +Cc: Andrew Morton, trond.myklebust, linux-kernel

* Lee Revell <rlrevell@joe-job.com> wrote:

> > Is a bunch of gobbledygook.  Hows about you interpret it for us?
> 
> Sorry.  When I summarized them before, Ingo just asked for the full 
> verbose trace.

please send non-verbose traces if possible. The verbose traces are 
useful when it's not clear which portion of a really large function is 
the call site - but they are also alot harder to read.  Verbose traces 
are basically just a debugging mechanism for me, not meant for public 
consumption.

i can add back the instruction-'offset' to the non-verbose trace, that 
will make even the ext3 traces easily interpretable in the non-verbose 
format.

> The 7 ms are spent in this loop:
> 
>  radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
>  radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
>  radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
>  radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
>  radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)

the trace shows thousands of pages getting submitted - each of the line 
above is at least one new page. The loop is not preemptible right now 
but that should be easy to bound. Note that your earlier traces showed 
the list sorting overhead for a _single page_. So it's a huge difference 
and a huge step forward.

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31  2:26           ` Lee Revell
  2005-03-31  2:39             ` Andrew Morton
@ 2005-03-31  7:39             ` Ingo Molnar
  2005-03-31  7:48               ` Ingo Molnar
  1 sibling, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2005-03-31  7:39 UTC (permalink / raw)
  To: Lee Revell; +Cc: Trond Myklebust, Andrew Morton, linux-kernel


* Lee Revell <rlrevell@joe-job.com> wrote:

> > Yes. Together with the radix tree-based sorting of dirty requests,
> > that's pretty much what I've spent most of today doing. Lee, could you
> > see how the attached combined patch changes your latency numbers?
> 
> Different code path, and the latency is worse.  See the attached ~7ms 
> trace.

could you check the -41-23 -RT kernel at the usual place:

   http://redhat.com/~mingo/realtime-preempt/

i've added Trond's radix lookup code, plus the lockbreaker patch.

(one thing that is not covered yet is nfs_scan_list() - that still scans 
a list. Trond, is that fixable too?)

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31  7:39             ` Ingo Molnar
@ 2005-03-31  7:48               ` Ingo Molnar
  2005-03-31  7:58                 ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2005-03-31  7:48 UTC (permalink / raw)
  To: Lee Revell; +Cc: Trond Myklebust, Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 582 bytes --]


* Ingo Molnar <mingo@elte.hu> wrote:

> could you check the -41-23 -RT kernel at the usual place:
> 
>    http://redhat.com/~mingo/realtime-preempt/
> 
> i've added Trond's radix lookup code, plus the lockbreaker patch.
> 
> (one thing that is not covered yet is nfs_scan_list() - that still scans 
> a list. Trond, is that fixable too?)

ok, find a new latency trace attached (1.6 msecs). I generated write 
loads, and the nfs_scan_list_dirty() latency is gone and indeed 
nfs_scan_list() generates the worst latency - while processing 8535 
pages in one critical section.

	Ingo

[-- Attachment #2: nfs-1682us-latency.bz2 --]
[-- Type: application/x-bzip2, Size: 12565 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: NFS client latencies
  2005-03-31  7:48               ` Ingo Molnar
@ 2005-03-31  7:58                 ` Ingo Molnar
  0 siblings, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2005-03-31  7:58 UTC (permalink / raw)
  To: Lee Revell; +Cc: Trond Myklebust, Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 477 bytes --]


* Ingo Molnar <mingo@elte.hu> wrote:

> ok, find a new latency trace attached (1.6 msecs). I generated write 
> loads, and the nfs_scan_list_dirty() latency is gone and indeed 
> nfs_scan_list() generates the worst latency - while processing 8535 
> pages in one critical section.

here's a more complete one of 4.4 msecs. (there was a bug in the latency 
tracer that trimmed latency traces - this one is complete and shows an 
nfs_scan_list processing of 22908 pages)

	Ingo

[-- Attachment #2: nfs-4481us-latency.bz2 --]
[-- Type: application/x-bzip2, Size: 33823 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2005-04-01 21:21 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-29 23:04 NFS client latencies Lee Revell
2005-03-29 23:18 ` Trond Myklebust
2005-03-29 23:32   ` Lee Revell
2005-03-29 23:34     ` Trond Myklebust
2005-03-29 23:37       ` Lee Revell
2005-03-30  8:02       ` Ingo Molnar
2005-03-30 14:11         ` Trond Myklebust
2005-03-30 14:20           ` Ingo Molnar
2005-03-30 19:53             ` Andrew Morton
2005-03-30 14:26   ` Lee Revell
2005-03-30 14:50     ` Trond Myklebust
2005-03-30 19:50       ` Lee Revell
2005-03-30 19:56       ` Andrew Morton
2005-03-30 21:14         ` Trond Myklebust
2005-03-31  2:26           ` Lee Revell
2005-03-31  2:39             ` Andrew Morton
2005-03-31  2:47               ` Lee Revell
2005-03-31  3:48                 ` Trond Myklebust
2005-03-31  6:59                   ` Ingo Molnar
2005-03-31  7:15                     ` Ingo Molnar
2005-03-31  7:18                     ` Andrew Morton
2005-03-31  7:30                       ` Ingo Molnar
2005-03-31 11:58                         ` Trond Myklebust
2005-03-31 12:34                           ` Trond Myklebust
2005-03-31 13:58                             ` Ingo Molnar
2005-03-31 14:32                               ` Trond Myklebust
2005-03-31 14:39                                 ` Ingo Molnar
2005-03-31 14:50                                   ` Ingo Molnar
2005-04-01  2:28                                     ` Lee Revell
2005-04-01  4:30                                       ` Ingo Molnar
2005-04-01 16:16                                         ` Orion Poplawski
2005-04-01 16:33                                           ` Trond Myklebust
2005-04-01 21:18                                         ` Lee Revell
2005-03-31 14:54                                   ` Ingo Molnar
2005-03-31 15:00                                     ` Trond Myklebust
2005-03-31 14:54                                   ` Trond Myklebust
2005-03-31 14:58                                     ` Ingo Molnar
2005-03-31 15:06                                       ` Trond Myklebust
2005-03-31 15:10                                         ` Ingo Molnar
2005-03-31 16:00                                           ` Trond Myklebust
2005-03-31 15:10                                       ` Ingo Molnar
2005-03-31  7:03                 ` Ingo Molnar
2005-03-31  7:39             ` Ingo Molnar
2005-03-31  7:48               ` Ingo Molnar
2005-03-31  7:58                 ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox