* NFS client latencies @ 2005-03-29 23:04 Lee Revell 2005-03-29 23:18 ` Trond Myklebust 0 siblings, 1 reply; 45+ messages in thread From: Lee Revell @ 2005-03-29 23:04 UTC (permalink / raw) To: Trond Myklebust; +Cc: Ingo Molnar, linux-kernel [-- Attachment #1: Type: text/plain, Size: 94 bytes --] I am seeing long latencies in the NFS client code. Attached is a ~1.9 ms latency trace. Lee [-- Attachment #2: nfs-1919us-latency_trace.bz2 --] [-- Type: application/x-bzip, Size: 30730 bytes --] ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-29 23:04 NFS client latencies Lee Revell @ 2005-03-29 23:18 ` Trond Myklebust 2005-03-29 23:32 ` Lee Revell 2005-03-30 14:26 ` Lee Revell 0 siblings, 2 replies; 45+ messages in thread From: Trond Myklebust @ 2005-03-29 23:18 UTC (permalink / raw) To: Lee Revell; +Cc: Ingo Molnar, linux-kernel ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell: > I am seeing long latencies in the NFS client code. Attached is a ~1.9 > ms latency trace. What kind of workload are you using to produce these numbers? Cheers, Trond -- Trond Myklebust <trond.myklebust@fys.uio.no> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-29 23:18 ` Trond Myklebust @ 2005-03-29 23:32 ` Lee Revell 2005-03-29 23:34 ` Trond Myklebust 2005-03-30 14:26 ` Lee Revell 1 sibling, 1 reply; 45+ messages in thread From: Lee Revell @ 2005-03-29 23:32 UTC (permalink / raw) To: Trond Myklebust; +Cc: Ingo Molnar, linux-kernel On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote: > ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell: > > I am seeing long latencies in the NFS client code. Attached is a ~1.9 > > ms latency trace. > > What kind of workload are you using to produce these numbers? > Just a kernel compile over NFS. Lee ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-29 23:32 ` Lee Revell @ 2005-03-29 23:34 ` Trond Myklebust 2005-03-29 23:37 ` Lee Revell 2005-03-30 8:02 ` Ingo Molnar 0 siblings, 2 replies; 45+ messages in thread From: Trond Myklebust @ 2005-03-29 23:34 UTC (permalink / raw) To: Lee Revell; +Cc: Ingo Molnar, linux-kernel ty den 29.03.2005 Klokka 18:32 (-0500) skreiv Lee Revell: > On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote: > > ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell: > > > I am seeing long latencies in the NFS client code. Attached is a ~1.9 > > > ms latency trace. > > > > What kind of workload are you using to produce these numbers? > > > > Just a kernel compile over NFS. In other words a workload consisting mainly of mmap()ed writes? Cheers, Trond -- Trond Myklebust <trond.myklebust@fys.uio.no> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-29 23:34 ` Trond Myklebust @ 2005-03-29 23:37 ` Lee Revell 2005-03-30 8:02 ` Ingo Molnar 1 sibling, 0 replies; 45+ messages in thread From: Lee Revell @ 2005-03-29 23:37 UTC (permalink / raw) To: Trond Myklebust; +Cc: Ingo Molnar, linux-kernel On Tue, 2005-03-29 at 18:34 -0500, Trond Myklebust wrote: > ty den 29.03.2005 Klokka 18:32 (-0500) skreiv Lee Revell: > > On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote: > > > ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell: > > > > I am seeing long latencies in the NFS client code. Attached is a ~1.9 > > > > ms latency trace. > > > > > > What kind of workload are you using to produce these numbers? > > > > > > > Just a kernel compile over NFS. > > In other words a workload consisting mainly of mmap()ed writes? > If you say so... I don't know the NFS internals very well. I did use "make -j16", on a UP. Lee ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-29 23:34 ` Trond Myklebust 2005-03-29 23:37 ` Lee Revell @ 2005-03-30 8:02 ` Ingo Molnar 2005-03-30 14:11 ` Trond Myklebust 1 sibling, 1 reply; 45+ messages in thread From: Ingo Molnar @ 2005-03-30 8:02 UTC (permalink / raw) To: Trond Myklebust; +Cc: Lee Revell, linux-kernel * Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > ty den 29.03.2005 Klokka 18:32 (-0500) skreiv Lee Revell: > > On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote: > > > ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell: > > > > I am seeing long latencies in the NFS client code. Attached is a ~1.9 > > > > ms latency trace. > > > > > > What kind of workload are you using to produce these numbers? > > > > > > > Just a kernel compile over NFS. > > In other words a workload consisting mainly of mmap()ed writes? new files created during a kernel compile are done via open()/write(). looking at the trace it seems that the NFS client code is doing list walks over ~7000 entries (!), in nfs_list_add_request(). Whatever protocol/server-side gain there might be due to the sorting and coalescing, this CPU overhead seems extremely high - more than 1 msec for this single insertion! the comment suggests that this is optimized for append writes (which is quite common, but by far not the only write workload) - but the worst-case behavior of this code is very bad. How about disabling this sorting altogether and benchmarking the result? Maybe it would get comparable coalescing (higher levels do coalesce after all), but wastly improved CPU utilization on the client side. (Note that the server itself will do sorting of any write IO anyway, if this is to hit any persistent storage - and if not then sorting so agressively on the client side makes little sense.) i think normal NFS benchmarks would not show this effect, as writes are typically streamed in benchmarks. But once you have lots of outstanding requests and a write comes out of order, CPU utilization (and latency) skyrockets. Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-30 8:02 ` Ingo Molnar @ 2005-03-30 14:11 ` Trond Myklebust 2005-03-30 14:20 ` Ingo Molnar 0 siblings, 1 reply; 45+ messages in thread From: Trond Myklebust @ 2005-03-30 14:11 UTC (permalink / raw) To: Ingo Molnar; +Cc: Lee Revell, linux-kernel on den 30.03.2005 Klokka 10:02 (+0200) skreiv Ingo Molnar: > the comment suggests that this is optimized for append writes (which is > quite common, but by far not the only write workload) - but the > worst-case behavior of this code is very bad. How about disabling this > sorting altogether and benchmarking the result? Maybe it would get > comparable coalescing (higher levels do coalesce after all), but wastly > improved CPU utilization on the client side. (Note that the server > itself will do sorting of any write IO anyway, if this is to hit any > persistent storage - and if not then sorting so agressively on the > client side makes little sense.) No. Coalescing on the client makes tons of sense. The overhead of sending 8 RPC requests for 4k writes instead of sending 1 RPC request for a single 32k write is huge: among other things, you end up tying up 8 RPC slots on the client + 8 nfsd threads on the server instead of just one of each. You also end up allocating 8 times a much memory for supporting structures such as rpc_tasks. That sucks when you're in a low memory situation and are trying to push dirty pages to the server as fast as possible... What we can do instead is to do the same thing that the VM does: use a radix tree with tags to do the sorting of the nfs_pages when we're actually building up the list of dirty pages to send. We already have the radix tree to do that, all we need to do is add the tags and modify the scanning function to use them. Cheers, Trond -- Trond Myklebust <trond.myklebust@fys.uio.no> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-30 14:11 ` Trond Myklebust @ 2005-03-30 14:20 ` Ingo Molnar 2005-03-30 19:53 ` Andrew Morton 0 siblings, 1 reply; 45+ messages in thread From: Ingo Molnar @ 2005-03-30 14:20 UTC (permalink / raw) To: Trond Myklebust; +Cc: Lee Revell, linux-kernel * Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > the comment suggests that this is optimized for append writes (which is > > quite common, but by far not the only write workload) - but the > > worst-case behavior of this code is very bad. How about disabling this > > sorting altogether and benchmarking the result? Maybe it would get > > comparable coalescing (higher levels do coalesce after all), but wastly > > improved CPU utilization on the client side. (Note that the server > > itself will do sorting of any write IO anyway, if this is to hit any > > persistent storage - and if not then sorting so agressively on the > > client side makes little sense.) > > No. Coalescing on the client makes tons of sense. The overhead of > sending 8 RPC requests for 4k writes instead of sending 1 RPC request > for a single 32k write is huge: among other things, you end up tying up > 8 RPC slots on the client + 8 nfsd threads on the server instead of just > one of each. yes - coalescing a few pages makes sense, but linearly scanning thousands of entries is entirely pointless. Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-30 14:20 ` Ingo Molnar @ 2005-03-30 19:53 ` Andrew Morton 0 siblings, 0 replies; 45+ messages in thread From: Andrew Morton @ 2005-03-30 19:53 UTC (permalink / raw) To: Ingo Molnar; +Cc: trond.myklebust, rlrevell, linux-kernel Ingo Molnar <mingo@elte.hu> wrote: > > > No. Coalescing on the client makes tons of sense. The overhead of > > sending 8 RPC requests for 4k writes instead of sending 1 RPC request > > for a single 32k write is huge: among other things, you end up tying up > > 8 RPC slots on the client + 8 nfsd threads on the server instead of just > > one of each. > > yes - coalescing a few pages makes sense, but linearly scanning > thousands of entries is entirely pointless. If each object has a unique ->wb_index then they could all be placed into a radix tree rather than a list. Use the gang-lookup functions to perform the sorting. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-29 23:18 ` Trond Myklebust 2005-03-29 23:32 ` Lee Revell @ 2005-03-30 14:26 ` Lee Revell 2005-03-30 14:50 ` Trond Myklebust 1 sibling, 1 reply; 45+ messages in thread From: Lee Revell @ 2005-03-30 14:26 UTC (permalink / raw) To: Trond Myklebust; +Cc: Ingo Molnar, linux-kernel On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote: > ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell: > > I am seeing long latencies in the NFS client code. Attached is a ~1.9 > > ms latency trace. > > What kind of workload are you using to produce these numbers? > Here is the other long latency I am seeing in the NFS client. I posted this before, but did not cc: the correct people. It looks like nfs_wait_on_requests is doing thousands of radix_tree_gang_lookups while holding some lock. Lee preemption latency trace v1.1.4 on 2.6.12-rc1-RT-V0.7.41-00 -------------------------------------------------------------------- latency: 3178 �s, #4095/14224, CPU#0 | (M:preempt VP:0, KP:1, SP:1 HP:1 #P:1) ----------------- | task: ksoftirqd/0-2 (uid:0 nice:-10 policy:0 rt_prio:0) ----------------- _------=> CPU# / _-----=> irqs-off | / _----=> need-resched || / _---=> hardirq/softirq ||| / _--=> preempt-depth |||| / ||||| delay cmd pid ||||| time | caller \ / ||||| \ | / (T1/#0) <...> 32105 0 3 00000004 00000000 [0011939614227867] 0.000ms (+4137027.445ms): <6500646c> (<61000000>) (T1/#2) <...> 32105 0 3 00000004 00000002 [0011939614228097] 0.000ms (+0.000ms): __trace_start_sched_wakeup+0x9a/0xd0 <c013150a> (try_to_wake_up+0x94/0x140 <c0110474>) (T1/#3) <...> 32105 0 3 00000003 00000003 [0011939614228436] 0.000ms (+0.000ms): preempt_schedule+0x11/0x80 <c02b57c1> (try_to_wake_up+0x94/0x140 <c0110474>) (T3/#4) <...>-32105 0dn.3 0�s : try_to_wake_up+0x11e/0x140 <c01104fe> <<...>-2> (69 76): (T1/#5) <...> 32105 0 3 00000002 00000005 [0011939614228942] 0.000ms (+0.000ms): preempt_schedule+0x11/0x80 <c02b57c1> (try_to_wake_up+0xf8/0x140 <c01104d8>) (T1/#6) <...> 32105 0 3 00000002 00000006 [0011939614229130] 0.000ms (+0.000ms): wake_up_process+0x35/0x40 <c0110555> (do_softirq +0x3f/0x50 <c011b05f>) (T6/#7) <...>-32105 0dn.1 1�s < (1) (T1/#8) <...> 32105 0 2 00000001 00000008 [0011939614229782] 0.001ms (+0.000ms): radix_tree_gang_lookup+0xe/0x70 <c01e05ee> (nfs_wait_on_requests+0x6d/0x110 <c01c744d>) (T1/#9) <...> 32105 0 2 00000001 00000009 [0011939614229985] 0.001ms (+0.000ms): __lookup+0xe/0xd0 <c01e051e> (radix_tree_gang_lookup +0x52/0x70 <c01e0632>) (T1/#10) <...> 32105 0 2 00000001 0000000a [0011939614230480] 0.001ms (+0.000ms): radix_tree_gang_lookup+0xe/0x70 <c01e05ee> (nfs_wait_on_requests+0x6d/0x110 <c01c744d>) (T1/#11) <...> 32105 0 2 00000001 0000000b [0011939614230634] 0.002ms (+0.000ms): __lookup+0xe/0xd0 <c01e051e> (radix_tree_gang_lookup +0x52/0x70 <c01e0632>) (T1/#12) <...> 32105 0 2 00000001 0000000c [0011939614230889] 0.002ms (+0.000ms): radix_tree_gang_lookup+0xe/0x70 <c01e05ee> (nfs_wait_on_requests+0x6d/0x110 <c01c744d>) (T1/#13) <...> 32105 0 2 00000001 0000000d [0011939614231034] 0.002ms (+0.000ms): __lookup+0xe/0xd0 <c01e051e> (radix_tree_gang_lookup +0x52/0x70 <c01e0632>) (T1/#14) <...> 32105 0 2 00000001 0000000e [0011939614231302] 0.002ms (+0.000ms): radix_tree_gang_lookup+0xe/0x70 <c01e05ee> (nfs_wait_on_requests+0x6d/0x110 <c01c744d>) (T1/#15) <...> 32105 0 2 00000001 0000000f [0011939614231419] 0.002ms (+0.000ms): __lookup+0xe/0xd0 <c01e051e> (radix_tree_gang_lookup +0x52/0x70 <c01e0632>) (last two lines just repeat) > Cheers, > Trond ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-30 14:26 ` Lee Revell @ 2005-03-30 14:50 ` Trond Myklebust 2005-03-30 19:50 ` Lee Revell 2005-03-30 19:56 ` Andrew Morton 0 siblings, 2 replies; 45+ messages in thread From: Trond Myklebust @ 2005-03-30 14:50 UTC (permalink / raw) To: Lee Revell; +Cc: Ingo Molnar, linux-kernel on den 30.03.2005 Klokka 09:26 (-0500) skreiv Lee Revell: > On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote: > > ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell: > > > I am seeing long latencies in the NFS client code. Attached is a ~1.9 > > > ms latency trace. > > > > What kind of workload are you using to produce these numbers? > > > > Here is the other long latency I am seeing in the NFS client. I posted > this before, but did not cc: the correct people. > > It looks like nfs_wait_on_requests is doing thousands of > radix_tree_gang_lookups while holding some lock. That's normal and cannot be avoided: when writing, we have to look for the existence of old nfs_page requests. The reason is that if one does exist, we must either coalesce our new dirty area into it or if we can't, we must flush the old request out to the server. Cheers, Trond -- Trond Myklebust <trond.myklebust@fys.uio.no> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-30 14:50 ` Trond Myklebust @ 2005-03-30 19:50 ` Lee Revell 2005-03-30 19:56 ` Andrew Morton 1 sibling, 0 replies; 45+ messages in thread From: Lee Revell @ 2005-03-30 19:50 UTC (permalink / raw) To: Trond Myklebust; +Cc: Ingo Molnar, linux-kernel On Wed, 2005-03-30 at 09:50 -0500, Trond Myklebust wrote: > on den 30.03.2005 Klokka 09:26 (-0500) skreiv Lee Revell: > > On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote: > > > ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell: > > > > I am seeing long latencies in the NFS client code. Attached is a ~1.9 > > > > ms latency trace. > > > > > > What kind of workload are you using to produce these numbers? > > > > > > > Here is the other long latency I am seeing in the NFS client. I posted > > this before, but did not cc: the correct people. > > > > It looks like nfs_wait_on_requests is doing thousands of > > radix_tree_gang_lookups while holding some lock. > > That's normal and cannot be avoided: when writing, we have to look for > the existence of old nfs_page requests. The reason is that if one does > exist, we must either coalesce our new dirty area into it or if we > can't, we must flush the old request out to the server. But holding a spinlock for 3ms is not acceptable. _Something_ has to be done. Can't the lock be dropped and reacquired after processing N requests where N is some reasonable number? Lee ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-30 14:50 ` Trond Myklebust 2005-03-30 19:50 ` Lee Revell @ 2005-03-30 19:56 ` Andrew Morton 2005-03-30 21:14 ` Trond Myklebust 1 sibling, 1 reply; 45+ messages in thread From: Andrew Morton @ 2005-03-30 19:56 UTC (permalink / raw) To: Trond Myklebust; +Cc: rlrevell, mingo, linux-kernel Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > on den 30.03.2005 Klokka 09:26 (-0500) skreiv Lee Revell: > > On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote: > > > ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell: > > > > I am seeing long latencies in the NFS client code. Attached is a ~1.9 > > > > ms latency trace. > > > > > > What kind of workload are you using to produce these numbers? > > > > > > > Here is the other long latency I am seeing in the NFS client. I posted > > this before, but did not cc: the correct people. > > > > It looks like nfs_wait_on_requests is doing thousands of > > radix_tree_gang_lookups while holding some lock. > > That's normal and cannot be avoided: when writing, we have to look for > the existence of old nfs_page requests. The reason is that if one does > exist, we must either coalesce our new dirty area into it or if we > can't, we must flush the old request out to the server. One could use the radix-tree tagging stuff so that the gang lookup only looks up pages which are !NFS_WBACK_BUSY. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-30 19:56 ` Andrew Morton @ 2005-03-30 21:14 ` Trond Myklebust 2005-03-31 2:26 ` Lee Revell 0 siblings, 1 reply; 45+ messages in thread From: Trond Myklebust @ 2005-03-30 21:14 UTC (permalink / raw) To: Andrew Morton; +Cc: rlrevell, mingo, linux-kernel [-- Attachment #1: Type: text/plain, Size: 734 bytes --] on den 30.03.2005 Klokka 11:56 (-0800) skreiv Andrew Morton: > > That's normal and cannot be avoided: when writing, we have to look for > > the existence of old nfs_page requests. The reason is that if one does > > exist, we must either coalesce our new dirty area into it or if we > > can't, we must flush the old request out to the server. > > One could use the radix-tree tagging stuff so that the gang lookup only > looks up pages which are !NFS_WBACK_BUSY. Yes. Together with the radix tree-based sorting of dirty requests, that's pretty much what I've spent most of today doing. Lee, could you see how the attached combined patch changes your latency numbers? Cheers, Trond -- Trond Myklebust <trond.myklebust@fys.uio.no> [-- Attachment #2: linux-2.6.12-nfspage_use_tags.dif --] [-- Type: text/x-patch, Size: 14045 bytes --] Index: linux-2.6.12-rc1/fs/nfs/inode.c =================================================================== --- linux-2.6.12-rc1.orig/fs/nfs/inode.c +++ linux-2.6.12-rc1/fs/nfs/inode.c @@ -118,7 +118,7 @@ nfs_write_inode(struct inode *inode, int int flags = sync ? FLUSH_WAIT : 0; int ret; - ret = nfs_commit_inode(inode, 0, 0, flags); + ret = nfs_commit_inode(inode, flags); if (ret < 0) return ret; return 0; Index: linux-2.6.12-rc1/fs/nfs/pagelist.c =================================================================== --- linux-2.6.12-rc1.orig/fs/nfs/pagelist.c +++ linux-2.6.12-rc1/fs/nfs/pagelist.c @@ -112,6 +112,33 @@ void nfs_unlock_request(struct nfs_page } /** + * nfs_set_page_writeback_locked - Lock a request for writeback + * @req: + */ +int nfs_set_page_writeback_locked(struct nfs_page *req) +{ + struct nfs_inode *nfsi = NFS_I(req->wb_context->dentry->d_inode); + + if (!nfs_lock_request(req)) + return 0; + radix_tree_tag_set(&nfsi->nfs_page_tree, req->wb_index, NFS_PAGE_TAG_WRITEBACK); + return 1; +} + +/** + * nfs_clear_page_writeback - Unlock request and wake up sleepers + */ +void nfs_clear_page_writeback(struct nfs_page *req) +{ + struct nfs_inode *nfsi = NFS_I(req->wb_context->dentry->d_inode); + + spin_lock(&nfsi->req_lock); + radix_tree_tag_clear(&nfsi->nfs_page_tree, req->wb_index, NFS_PAGE_TAG_WRITEBACK); + spin_unlock(&nfsi->req_lock); + nfs_unlock_request(req); +} + +/** * nfs_clear_request - Free up all resources allocated to the request * @req: * @@ -151,36 +178,6 @@ nfs_release_request(struct nfs_page *req } /** - * nfs_list_add_request - Insert a request into a sorted list - * @req: request - * @head: head of list into which to insert the request. - * - * Note that the wb_list is sorted by page index in order to facilitate - * coalescing of requests. - * We use an insertion sort that is optimized for the case of appended - * writes. - */ -void -nfs_list_add_request(struct nfs_page *req, struct list_head *head) -{ - struct list_head *pos; - -#ifdef NFS_PARANOIA - if (!list_empty(&req->wb_list)) { - printk(KERN_ERR "NFS: Add to list failed!\n"); - BUG(); - } -#endif - list_for_each_prev(pos, head) { - struct nfs_page *p = nfs_list_entry(pos); - if (p->wb_index < req->wb_index) - break; - } - list_add(&req->wb_list, pos); - req->wb_list_head = head; -} - -/** * nfs_wait_on_request - Wait for a request to complete. * @req: request to wait upon. * @@ -243,6 +240,62 @@ nfs_coalesce_requests(struct list_head * return npages; } +#define NFS_SCAN_MAXENTRIES 16 +/** + * nfs_scan_lock_dirty - Scan the radix tree for dirty requests + * @nfsi: NFS inode + * @dst: Destination list + * @idx_start: lower bound of page->index to scan + * @npages: idx_start + npages sets the upper bound to scan. + * + * Moves elements from one of the inode request lists. + * If the number of requests is set to 0, the entire address_space + * starting at index idx_start, is scanned. + * The requests are *not* checked to ensure that they form a contiguous set. + * You must be holding the inode's req_lock when calling this function + */ +int +nfs_scan_lock_dirty(struct nfs_inode *nfsi, struct list_head *dst, + unsigned long idx_start, unsigned int npages) +{ + struct nfs_page *pgvec[NFS_SCAN_MAXENTRIES]; + struct nfs_page *req; + unsigned long idx_end; + int found, i; + int res; + + res = 0; + if (npages == 0) + idx_end = ~0; + else + idx_end = idx_start + npages - 1; + + for (;;) { + found = radix_tree_gang_lookup_tag(&nfsi->nfs_page_tree, + (void **)&pgvec[0], idx_start, NFS_SCAN_MAXENTRIES, + NFS_PAGE_TAG_DIRTY); + if (found <= 0) + break; + for (i = 0; i < found; i++) { + req = pgvec[i]; + if (req->wb_index > idx_end) + goto out; + + idx_start = req->wb_index + 1; + + if (nfs_set_page_writeback_locked(req)) { + radix_tree_tag_clear(&nfsi->nfs_page_tree, + req->wb_index, NFS_PAGE_TAG_DIRTY); + nfs_list_remove_request(req); + nfs_list_add_request(req, dst); + res++; + } + } + } +out: + return res; +} + /** * nfs_scan_list - Scan a list for matching requests * @head: One of the NFS inode request lists @@ -280,7 +333,7 @@ nfs_scan_list(struct list_head *head, st if (req->wb_index > idx_end) break; - if (!nfs_lock_request(req)) + if (!nfs_set_page_writeback_locked(req)) continue; nfs_list_remove_request(req); nfs_list_add_request(req, dst); Index: linux-2.6.12-rc1/fs/nfs/read.c =================================================================== --- linux-2.6.12-rc1.orig/fs/nfs/read.c +++ linux-2.6.12-rc1/fs/nfs/read.c @@ -173,7 +173,6 @@ static int nfs_readpage_async(struct nfs if (len < PAGE_CACHE_SIZE) memclear_highpage_flush(page, len, PAGE_CACHE_SIZE - len); - nfs_lock_request(new); nfs_list_add_request(new, &one_request); nfs_pagein_one(&one_request, inode); return 0; @@ -185,7 +184,6 @@ static void nfs_readpage_release(struct nfs_clear_request(req); nfs_release_request(req); - nfs_unlock_request(req); dprintk("NFS: read done (%s/%Ld %d@%Ld)\n", req->wb_context->dentry->d_inode->i_sb->s_id, @@ -553,7 +551,6 @@ readpage_async_filler(void *data, struct } if (len < PAGE_CACHE_SIZE) memclear_highpage_flush(page, len, PAGE_CACHE_SIZE - len); - nfs_lock_request(new); nfs_list_add_request(new, desc->head); return 0; } Index: linux-2.6.12-rc1/fs/nfs/write.c =================================================================== --- linux-2.6.12-rc1.orig/fs/nfs/write.c +++ linux-2.6.12-rc1/fs/nfs/write.c @@ -352,7 +352,7 @@ int nfs_writepages(struct address_space if (err < 0) goto out; } - err = nfs_commit_inode(inode, 0, 0, wb_priority(wbc)); + err = nfs_commit_inode(inode, wb_priority(wbc)); if (err > 0) { wbc->nr_to_write -= err; err = 0; @@ -446,6 +446,8 @@ nfs_mark_request_dirty(struct nfs_page * struct nfs_inode *nfsi = NFS_I(inode); spin_lock(&nfsi->req_lock); + radix_tree_tag_set(&nfsi->nfs_page_tree, + req->wb_index, NFS_PAGE_TAG_DIRTY); nfs_list_add_request(req, &nfsi->dirty); nfsi->ndirty++; spin_unlock(&nfsi->req_lock); @@ -503,13 +505,12 @@ nfs_wait_on_requests(struct inode *inode spin_lock(&nfsi->req_lock); next = idx_start; - while (radix_tree_gang_lookup(&nfsi->nfs_page_tree, (void **)&req, next, 1)) { + while (radix_tree_gang_lookup_tag(&nfsi->nfs_page_tree, (void **)&req, next, 1, NFS_PAGE_TAG_WRITEBACK)) { if (req->wb_index > idx_end) break; next = req->wb_index + 1; - if (!NFS_WBACK_BUSY(req)) - continue; + BUG_ON(!NFS_WBACK_BUSY(req)); atomic_inc(&req->wb_count); spin_unlock(&nfsi->req_lock); @@ -539,7 +540,7 @@ nfs_scan_dirty(struct inode *inode, stru { struct nfs_inode *nfsi = NFS_I(inode); int res; - res = nfs_scan_list(&nfsi->dirty, dst, idx_start, npages); + res = nfs_scan_lock_dirty(nfsi, dst, idx_start, npages); nfsi->ndirty -= res; sub_page_state(nr_dirty,res); if ((nfsi->ndirty == 0) != list_empty(&nfsi->dirty)) @@ -821,7 +822,7 @@ out: #else nfs_inode_remove_request(req); #endif - nfs_unlock_request(req); + nfs_clear_page_writeback(req); } static inline int flush_task_priority(int how) @@ -952,7 +953,7 @@ out_bad: nfs_writedata_free(data); } nfs_mark_request_dirty(req); - nfs_unlock_request(req); + nfs_clear_page_writeback(req); return -ENOMEM; } @@ -1002,7 +1003,7 @@ static int nfs_flush_one(struct list_hea struct nfs_page *req = nfs_list_entry(head->next); nfs_list_remove_request(req); nfs_mark_request_dirty(req); - nfs_unlock_request(req); + nfs_clear_page_writeback(req); } return -ENOMEM; } @@ -1029,7 +1030,7 @@ nfs_flush_list(struct list_head *head, i req = nfs_list_entry(head->next); nfs_list_remove_request(req); nfs_mark_request_dirty(req); - nfs_unlock_request(req); + nfs_clear_page_writeback(req); } return error; } @@ -1121,7 +1122,7 @@ static void nfs_writeback_done_full(stru nfs_inode_remove_request(req); #endif next: - nfs_unlock_request(req); + nfs_clear_page_writeback(req); } } @@ -1210,36 +1211,24 @@ static void nfs_commit_rpcsetup(struct l struct nfs_write_data *data, int how) { struct rpc_task *task = &data->task; - struct nfs_page *first, *last; + struct nfs_page *first; struct inode *inode; - loff_t start, end, len; /* Set up the RPC argument and reply structs * NB: take care not to mess about with data->commit et al. */ list_splice_init(head, &data->pages); first = nfs_list_entry(data->pages.next); - last = nfs_list_entry(data->pages.prev); inode = first->wb_context->dentry->d_inode; - /* - * Determine the offset range of requests in the COMMIT call. - * We rely on the fact that data->pages is an ordered list... - */ - start = req_offset(first); - end = req_offset(last) + last->wb_bytes; - len = end - start; - /* If 'len' is not a 32-bit quantity, pass '0' in the COMMIT call */ - if (end >= i_size_read(inode) || len < 0 || len > (~((u32)0) >> 1)) - len = 0; - data->inode = inode; data->cred = first->wb_context->cred; data->args.fh = NFS_FH(data->inode); - data->args.offset = start; - data->args.count = len; - data->res.count = len; + /* Note: we always request a commit of the entire inode */ + data->args.offset = 0; + data->args.count = 0; + data->res.count = 0; data->res.fattr = &data->fattr; data->res.verf = &data->verf; @@ -1278,7 +1267,7 @@ nfs_commit_list(struct list_head *head, req = nfs_list_entry(head->next); nfs_list_remove_request(req); nfs_mark_request_commit(req); - nfs_unlock_request(req); + nfs_clear_page_writeback(req); } return -ENOMEM; } @@ -1324,7 +1313,7 @@ nfs_commit_done(struct rpc_task *task) dprintk(" mismatch\n"); nfs_mark_request_dirty(req); next: - nfs_unlock_request(req); + nfs_clear_page_writeback(req); res++; } sub_page_state(nr_unstable,res); @@ -1350,8 +1339,7 @@ static int nfs_flush_inode(struct inode } #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4) -int nfs_commit_inode(struct inode *inode, unsigned long idx_start, - unsigned int npages, int how) +int nfs_commit_inode(struct inode *inode, int how) { struct nfs_inode *nfsi = NFS_I(inode); LIST_HEAD(head); @@ -1359,15 +1347,13 @@ int nfs_commit_inode(struct inode *inode error = 0; spin_lock(&nfsi->req_lock); - res = nfs_scan_commit(inode, &head, idx_start, npages); + res = nfs_scan_commit(inode, &head, 0, 0); + spin_unlock(&nfsi->req_lock); if (res) { - res += nfs_scan_commit(inode, &head, 0, 0); - spin_unlock(&nfsi->req_lock); error = nfs_commit_list(&head, how); - } else - spin_unlock(&nfsi->req_lock); - if (error < 0) - return error; + if (error < 0) + return error; + } return res; } #endif @@ -1389,7 +1375,7 @@ int nfs_sync_inode(struct inode *inode, error = nfs_flush_inode(inode, idx_start, npages, how); #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4) if (error == 0) - error = nfs_commit_inode(inode, idx_start, npages, how); + error = nfs_commit_inode(inode, how); #endif } while (error > 0); return error; Index: linux-2.6.12-rc1/include/linux/nfs_fs.h =================================================================== --- linux-2.6.12-rc1.orig/include/linux/nfs_fs.h +++ linux-2.6.12-rc1/include/linux/nfs_fs.h @@ -377,10 +377,10 @@ extern void nfs_commit_done(struct rpc_t */ extern int nfs_sync_inode(struct inode *, unsigned long, unsigned int, int); #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4) -extern int nfs_commit_inode(struct inode *, unsigned long, unsigned int, int); +extern int nfs_commit_inode(struct inode *, int); #else static inline int -nfs_commit_inode(struct inode *inode, unsigned long idx_start, unsigned int npages, int how) +nfs_commit_inode(struct inode *inode, int how) { return 0; } Index: linux-2.6.12-rc1/include/linux/nfs_page.h =================================================================== --- linux-2.6.12-rc1.orig/include/linux/nfs_page.h +++ linux-2.6.12-rc1/include/linux/nfs_page.h @@ -20,12 +20,19 @@ #include <asm/atomic.h> /* + * Valid flags for the radix tree + */ +#define NFS_PAGE_TAG_DIRTY 0 +#define NFS_PAGE_TAG_WRITEBACK 1 + +/* * Valid flags for a dirty buffer */ #define PG_BUSY 0 #define PG_NEED_COMMIT 1 #define PG_NEED_RESCHED 2 +struct nfs_inode; struct nfs_page { struct list_head wb_list, /* Defines state of page: */ *wb_list_head; /* read/write/commit */ @@ -54,14 +61,17 @@ extern void nfs_clear_request(struct nfs extern void nfs_release_request(struct nfs_page *req); -extern void nfs_list_add_request(struct nfs_page *, struct list_head *); - +extern int nfs_scan_lock_dirty(struct nfs_inode *nfsi, struct list_head *dst, + unsigned long idx_start, unsigned int npages); extern int nfs_scan_list(struct list_head *, struct list_head *, unsigned long, unsigned int); extern int nfs_coalesce_requests(struct list_head *, struct list_head *, unsigned int); extern int nfs_wait_on_request(struct nfs_page *); extern void nfs_unlock_request(struct nfs_page *req); +extern int nfs_set_page_writeback_locked(struct nfs_page *req); +extern void nfs_clear_page_writeback(struct nfs_page *req); + /* * Lock the page of an asynchronous request without incrementing the wb_count @@ -86,6 +96,18 @@ nfs_lock_request(struct nfs_page *req) return 1; } +/** + * nfs_list_add_request - Insert a request into a list + * @req: request + * @head: head of list into which to insert the request. + */ +static inline void +nfs_list_add_request(struct nfs_page *req, struct list_head *head) +{ + list_add_tail(&req->wb_list, head); + req->wb_list_head = head; +} + /** * nfs_list_remove_request - Remove a request from its wb_list @@ -96,10 +118,6 @@ nfs_list_remove_request(struct nfs_page { if (list_empty(&req->wb_list)) return; - if (!NFS_WBACK_BUSY(req)) { - printk(KERN_ERR "NFS: unlocked request attempted removed from list!\n"); - BUG(); - } list_del_init(&req->wb_list); req->wb_list_head = NULL; } ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-30 21:14 ` Trond Myklebust @ 2005-03-31 2:26 ` Lee Revell 2005-03-31 2:39 ` Andrew Morton 2005-03-31 7:39 ` Ingo Molnar 0 siblings, 2 replies; 45+ messages in thread From: Lee Revell @ 2005-03-31 2:26 UTC (permalink / raw) To: Trond Myklebust; +Cc: Andrew Morton, mingo, linux-kernel [-- Attachment #1: Type: text/plain, Size: 836 bytes --] On Wed, 2005-03-30 at 16:14 -0500, Trond Myklebust wrote: > on den 30.03.2005 Klokka 11:56 (-0800) skreiv Andrew Morton: > > > That's normal and cannot be avoided: when writing, we have to look for > > > the existence of old nfs_page requests. The reason is that if one does > > > exist, we must either coalesce our new dirty area into it or if we > > > can't, we must flush the old request out to the server. > > > > One could use the radix-tree tagging stuff so that the gang lookup only > > looks up pages which are !NFS_WBACK_BUSY. > > Yes. Together with the radix tree-based sorting of dirty requests, > that's pretty much what I've spent most of today doing. Lee, could you > see how the attached combined patch changes your latency numbers? > Different code path, and the latency is worse. See the attached ~7ms trace. Lee [-- Attachment #2: nfs-6776usec-trace.bz2 --] [-- Type: application/x-bzip, Size: 36792 bytes --] ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 2:26 ` Lee Revell @ 2005-03-31 2:39 ` Andrew Morton 2005-03-31 2:47 ` Lee Revell 2005-03-31 7:39 ` Ingo Molnar 1 sibling, 1 reply; 45+ messages in thread From: Andrew Morton @ 2005-03-31 2:39 UTC (permalink / raw) To: Lee Revell; +Cc: trond.myklebust, mingo, linux-kernel Lee Revell <rlrevell@joe-job.com> wrote: > > > Yes. Together with the radix tree-based sorting of dirty requests, > > that's pretty much what I've spent most of today doing. Lee, could you > > see how the attached combined patch changes your latency numbers? > > > > Different code path, and the latency is worse. See the attached ~7ms > trace. Is a bunch of gobbledygook. Hows about you interpret it for us? ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 2:39 ` Andrew Morton @ 2005-03-31 2:47 ` Lee Revell 2005-03-31 3:48 ` Trond Myklebust 2005-03-31 7:03 ` Ingo Molnar 0 siblings, 2 replies; 45+ messages in thread From: Lee Revell @ 2005-03-31 2:47 UTC (permalink / raw) To: Andrew Morton; +Cc: trond.myklebust, mingo, linux-kernel On Wed, 2005-03-30 at 18:39 -0800, Andrew Morton wrote: > Lee Revell <rlrevell@joe-job.com> wrote: > > > > > Yes. Together with the radix tree-based sorting of dirty requests, > > > that's pretty much what I've spent most of today doing. Lee, could you > > > see how the attached combined patch changes your latency numbers? > > > > > > > Different code path, and the latency is worse. See the attached ~7ms > > trace. > > Is a bunch of gobbledygook. Hows about you interpret it for us? > Sorry. When I summarized them before, Ingo just asked for the full verbose trace. The 7 ms are spent in this loop: radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>) nfs_set_page_writeback_locked+0xe/0x60 <c01c357e> (nfs_scan_lock_dirty+0x8d/0xf0 <c01c39fd>) radix_tree_tag_set+0xe/0xa0 <c01e036e> (nfs_set_page_writeback_locked+0x4b/0x60 <c01c35bb>) radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>) nfs_set_page_writeback_locked+0xe/0x60 <c01c357e> (nfs_scan_lock_dirty+0x8d/0xf0 <c01c39fd>) radix_tree_tag_set+0xe/0xa0 <c01e036e> (nfs_set_page_writeback_locked+0x4b/0x60 <c01c35bb>) radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>) nfs_set_page_writeback_locked+0xe/0x60 <c01c357e> (nfs_scan_lock_dirty+0x8d/0xf0 <c01c39fd>) radix_tree_tag_set+0xe/0xa0 <c01e036e> (nfs_set_page_writeback_locked+0x4b/0x60 <c01c35bb>) radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>) nfs_set_page_writeback_locked+0xe/0x60 <c01c357e> (nfs_scan_lock_dirty+0x8d/0xf0 <c01c39fd>) radix_tree_tag_set+0xe/0xa0 <c01e036e> (nfs_set_page_writeback_locked+0x4b/0x60 <c01c35bb>) radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>) nfs_set_page_writeback_locked+0xe/0x60 <c01c357e> (nfs_scan_lock_dirty+0x8d/0xf0 <c01c39fd>) radix_tree_tag_set+0xe/0xa0 <c01e036e> (nfs_set_page_writeback_locked+0x4b/0x60 <c01c35bb>) radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>) radix_tree_gang_lookup_tag+0xe/0x80 <c01e074e> (nfs_scan_lock_dirty+0x69/0xf0 <c01c39d9>) __lookup_tag+0xe/0x130 <c01e061e> (radix_tree_gang_lookup_tag+0x59/0x80 <c01e0799>) Lee ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 2:47 ` Lee Revell @ 2005-03-31 3:48 ` Trond Myklebust 2005-03-31 6:59 ` Ingo Molnar 2005-03-31 7:03 ` Ingo Molnar 1 sibling, 1 reply; 45+ messages in thread From: Trond Myklebust @ 2005-03-31 3:48 UTC (permalink / raw) To: Lee Revell; +Cc: Andrew Morton, mingo, linux-kernel on den 30.03.2005 Klokka 21:47 (-0500) skreiv Lee Revell: > On Wed, 2005-03-30 at 18:39 -0800, Andrew Morton wrote: > > Lee Revell <rlrevell@joe-job.com> wrote: > > > > > > > Yes. Together with the radix tree-based sorting of dirty requests, > > > > that's pretty much what I've spent most of today doing. Lee, could you > > > > see how the attached combined patch changes your latency numbers? > > > > > > > > > > Different code path, and the latency is worse. See the attached ~7ms > > > trace. > > > > Is a bunch of gobbledygook. Hows about you interpret it for us? > > > > Sorry. When I summarized them before, Ingo just asked for the full > verbose trace. > > The 7 ms are spent in this loop: > > radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>) > nfs_set_page_writeback_locked+0xe/0x60 <c01c357e> (nfs_scan_lock_dirty+0x8d/0xf0 <c01c39fd>) > radix_tree_tag_set+0xe/0xa0 <c01e036e> (nfs_set_page_writeback_locked+0x4b/0x60 <c01c35bb>) > radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>) Which is basically confirming what the guys from Bull already told me, namely that the radix tree tag stuff is much less efficient that what we've got now. I never saw their patches, though, so I was curious to try it for myself. Cheers, Trond -- Trond Myklebust <trond.myklebust@fys.uio.no> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 3:48 ` Trond Myklebust @ 2005-03-31 6:59 ` Ingo Molnar 2005-03-31 7:15 ` Ingo Molnar 2005-03-31 7:18 ` Andrew Morton 0 siblings, 2 replies; 45+ messages in thread From: Ingo Molnar @ 2005-03-31 6:59 UTC (permalink / raw) To: Trond Myklebust; +Cc: Lee Revell, Andrew Morton, linux-kernel * Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > The 7 ms are spent in this loop: > > Which is basically confirming what the guys from Bull already told me, > namely that the radix tree tag stuff is much less efficient that what > we've got now. I never saw their patches, though, so I was curious to > try it for myself. i think the numbers are being misinterpreted. I believe this patch is a big step forward. The big thing is that nfs_list_add_request() is O(1) now - while _a single request addition to the sorted list_ triggered the 1+ msec latency in Lee's previous trace. I.e. the previous trace showed a 1+ msec latency for a single page IO submission, while his new trace shows _thousands_ of pages submitted for NFS IO - which is a very different beast! i think all it needs now is a lock-breaker in the main radix-lookup loop in nfs_scan_lock_dirty(), or a latency-oriented reduction in the npages argument, to make the loop bounded. The nfs_list_add_request() code unbreakable from a latency POV. The patch looks really great to me, kudos for pulling it off that fast. (Also, i'd not declare this variant _slower_ based on latencies, unless proven in real benchmarks. A faster implementation can easily have higher latencies, if some detail changed - and lots of details changed due to your patch. And i'd not even declare latency that much worse, unless it's been measured with the tracer turned off. (Lee, could you try such a comparison, worst-case latency with and without the patch, with LATENCY_TRACE turned off in the .config but latency timing still enabled?) The latency tracer itself can baloon the latency.) Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 6:59 ` Ingo Molnar @ 2005-03-31 7:15 ` Ingo Molnar 2005-03-31 7:18 ` Andrew Morton 1 sibling, 0 replies; 45+ messages in thread From: Ingo Molnar @ 2005-03-31 7:15 UTC (permalink / raw) To: Trond Myklebust; +Cc: Lee Revell, Andrew Morton, linux-kernel * Ingo Molnar <mingo@elte.hu> wrote: > i think all it needs now is a lock-breaker in the main radix-lookup > loop in nfs_scan_lock_dirty(), or a latency-oriented reduction in the > npages argument, to make the loop bounded. [...] can nfsi->req_lock be dropped within nfs_scan_dirty()? Or does the scanning have to restart in that case? My guess would be the scanning does not have to be restarted, since we drop the lock after scanning anyway, so all it has to take care of is the consistency of the list itself, and the fact that the whole index range got scanned in a certain point in time. Such a lock-breaker was hard before because we had a list 'cursor' which could move away while we dropped the lock. But with the radix tree it's an 'index position' now, which is much more invariant. The patch below attempts this, ontop of your patch - but i'm not sure whether ->req_lock is the only lock we hold at that point. Ingo --- linux/fs/nfs/pagelist.c.orig +++ linux/fs/nfs/pagelist.c @@ -291,6 +291,7 @@ nfs_scan_lock_dirty(struct nfs_inode *nf res++; } } + cond_resched_lock(&nfsi->req_lock); } out: return res; ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 6:59 ` Ingo Molnar 2005-03-31 7:15 ` Ingo Molnar @ 2005-03-31 7:18 ` Andrew Morton 2005-03-31 7:30 ` Ingo Molnar 1 sibling, 1 reply; 45+ messages in thread From: Andrew Morton @ 2005-03-31 7:18 UTC (permalink / raw) To: Ingo Molnar; +Cc: trond.myklebust, rlrevell, linux-kernel Ingo Molnar <mingo@elte.hu> wrote: > > > * Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > > > The 7 ms are spent in this loop: > > > > Which is basically confirming what the guys from Bull already told me, > > namely that the radix tree tag stuff is much less efficient that what > > we've got now. I never saw their patches, though, so I was curious to > > try it for myself. > > i think the numbers are being misinterpreted. I believe this patch is a > big step forward. The big thing is that nfs_list_add_request() is O(1) > now - while _a single request addition to the sorted list_ triggered the > 1+ msec latency in Lee's previous trace. Well. The radix-tree approach's best-case is probably quite a lot worse than the list-based approach's best-case. It hits a lot more cachelines and involves a lot more code. But of course the radix-tree's worst-case will be far better than list's. And presumably that list-based code rarely hits the worst-case, else it would have been changed by now. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 7:18 ` Andrew Morton @ 2005-03-31 7:30 ` Ingo Molnar 2005-03-31 11:58 ` Trond Myklebust 0 siblings, 1 reply; 45+ messages in thread From: Ingo Molnar @ 2005-03-31 7:30 UTC (permalink / raw) To: Andrew Morton; +Cc: trond.myklebust, rlrevell, linux-kernel * Andrew Morton <akpm@osdl.org> wrote: > Well. The radix-tree approach's best-case is probably quite a lot > worse than the list-based approach's best-case. It hits a lot more > cachelines and involves a lot more code. The list-based approach's best-case are large continuous append writes. No sorting overhead, and light data structures. i'd say this workload should be not that bad under the radix tree either - the gang lookup stuffs a nice vector of 16 pages into an array. we definitely can say nothing based on the observation that a _single_ page took 1.9 msecs in Lee's previous measurement, while 7700 pages now take 6 msecs to process. > But of course the radix-tree's worst-case will be far better than > list's. the generic VM/pagecache has proven that the radix tree wins hands down for alot more workloads than the worst-case. > And presumably that list-based code rarely hits the worst-case, else > it would have been changed by now. that was my other point in a previous mail: most write benchmarks do continuous append writes, and CPU overhead easily gets lost in network latency. Also, considering that a good portion of the NFS client's code is still running under the BKL one would assume if the BKL hurts performance it would have been changed by now? ;-) Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 7:30 ` Ingo Molnar @ 2005-03-31 11:58 ` Trond Myklebust 2005-03-31 12:34 ` Trond Myklebust 0 siblings, 1 reply; 45+ messages in thread From: Trond Myklebust @ 2005-03-31 11:58 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, rlrevell, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1111 bytes --] to den 31.03.2005 Klokka 09:30 (+0200) skreiv Ingo Molnar: > > And presumably that list-based code rarely hits the worst-case, else > > it would have been changed by now. > > that was my other point in a previous mail: most write benchmarks do > continuous append writes, and CPU overhead easily gets lost in > network > latency. Well. Most _good_ write benchmarks go through a variety of scenarios. The problem with the radix-tree is that in most cases you are swapping a linear scan through only the dirty pages (most of which you want to try to add to the list anyway) for a linear scan through the entire list of pages. Damn! I've forgotten an extra _obvious_ optimization that might help here... See the attached patch. > Also, considering that a good portion of the NFS client's code is still > running under the BKL one would assume if the BKL hurts performance it > would have been changed by now? ;-) Funny you should mention that... Chuck happens to have spent the last fortnight working on removing our BKL-addiction. ;-) Cheers, Trond -- Trond Myklebust <trond.myklebust@fys.uio.no> [-- Attachment #2: Type: text/plain, Size: 1722 bytes --] fs/nfs/write.c | 28 +++++++++++++++++----------- 1 files changed, 17 insertions(+), 11 deletions(-) Index: linux-2.6.12-rc1/fs/nfs/write.c =================================================================== --- linux-2.6.12-rc1.orig/fs/nfs/write.c +++ linux-2.6.12-rc1/fs/nfs/write.c @@ -539,12 +539,15 @@ static int nfs_scan_dirty(struct inode *inode, struct list_head *dst, unsigned long idx_start, unsigned int npages) { struct nfs_inode *nfsi = NFS_I(inode); - int res; - res = nfs_scan_lock_dirty(nfsi, dst, idx_start, npages); - nfsi->ndirty -= res; - sub_page_state(nr_dirty,res); - if ((nfsi->ndirty == 0) != list_empty(&nfsi->dirty)) - printk(KERN_ERR "NFS: desynchronized value of nfs_i.ndirty.\n"); + int res = 0; + + if (nfsi->ndirty != 0) { + res = nfs_scan_lock_dirty(nfsi, dst, idx_start, npages); + nfsi->ndirty -= res; + sub_page_state(nr_dirty,res); + if ((nfsi->ndirty == 0) != list_empty(&nfsi->dirty)) + printk(KERN_ERR "NFS: desynchronized value of nfs_i.ndirty.\n"); + } return res; } @@ -563,11 +566,14 @@ static int nfs_scan_commit(struct inode *inode, struct list_head *dst, unsigned long idx_start, unsigned int npages) { struct nfs_inode *nfsi = NFS_I(inode); - int res; - res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages); - nfsi->ncommit -= res; - if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit)) - printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n"); + int res; + + if (nfsi->ncommit != 0) { + res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages); + nfsi->ncommit -= res; + if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit)) + printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n"); + } return res; } #endif ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 11:58 ` Trond Myklebust @ 2005-03-31 12:34 ` Trond Myklebust 2005-03-31 13:58 ` Ingo Molnar 0 siblings, 1 reply; 45+ messages in thread From: Trond Myklebust @ 2005-03-31 12:34 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, rlrevell, linux-kernel to den 31.03.2005 Klokka 06:58 (-0500) skreiv Trond Myklebust: > > @@ -563,11 +566,14 @@ static int > nfs_scan_commit(struct inode *inode, struct list_head *dst, unsigned long idx_start, unsigned int npages) > { > struct nfs_inode *nfsi = NFS_I(inode); > - int res; > - res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages); > - nfsi->ncommit -= res; > - if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit)) > - printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n"); > + int res; ^^^^^^^ Should be "int res = 0;" > + > + if (nfsi->ncommit != 0) { > + res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages); > + nfsi->ncommit -= res; > + if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit)) > + printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n"); > + } > return res; > } > #endif Cheers, Trond -- Trond Myklebust <trond.myklebust@fys.uio.no> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 12:34 ` Trond Myklebust @ 2005-03-31 13:58 ` Ingo Molnar 2005-03-31 14:32 ` Trond Myklebust 0 siblings, 1 reply; 45+ messages in thread From: Ingo Molnar @ 2005-03-31 13:58 UTC (permalink / raw) To: Trond Myklebust; +Cc: Andrew Morton, rlrevell, linux-kernel [-- Attachment #1: Type: text/plain, Size: 416 bytes --] * Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > + int res; > ^^^^^^^ Should be "int res = 0;" your patch works fine here - but there are still latencies in nfs_scan_commit()/nfs_scan_list(): see the attached 3.7 msec latency trace. It happened during a simple big-file writeout and is easily reproducible. Could the nfsi->commit list searching be replaced with a radix based approach too? Ingo [-- Attachment #2: nfs-3769us-latency.bz2 --] [-- Type: application/x-bzip2, Size: 27069 bytes --] ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 13:58 ` Ingo Molnar @ 2005-03-31 14:32 ` Trond Myklebust 2005-03-31 14:39 ` Ingo Molnar 0 siblings, 1 reply; 45+ messages in thread From: Trond Myklebust @ 2005-03-31 14:32 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, rlrevell, linux-kernel to den 31.03.2005 Klokka 15:58 (+0200) skreiv Ingo Molnar: > * Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > > > + int res; > > ^^^^^^^ Should be "int res = 0;" > > your patch works fine here - but there are still latencies in > nfs_scan_commit()/nfs_scan_list(): see the attached 3.7 msec latency > trace. It happened during a simple big-file writeout and is easily > reproducible. Could the nfsi->commit list searching be replaced with a > radix based approach too? That would be 100% pure overhead. The nfsi->commit list does not need to be sorted and with these patches applied, it no longer is. In fact one of the cleanups I still need to do is to get rid of those redundant checks on wb->index (start is now always set to 0, and end is always ~0UL). So the overhead you are currently seeing should just be that of iterating through the list, locking said requests and adding them to our private list. Cheers, Trond -- Trond Myklebust <trond.myklebust@fys.uio.no> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 14:32 ` Trond Myklebust @ 2005-03-31 14:39 ` Ingo Molnar 2005-03-31 14:50 ` Ingo Molnar ` (2 more replies) 0 siblings, 3 replies; 45+ messages in thread From: Ingo Molnar @ 2005-03-31 14:39 UTC (permalink / raw) To: Trond Myklebust; +Cc: Andrew Morton, rlrevell, linux-kernel * Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > your patch works fine here - but there are still latencies in > > nfs_scan_commit()/nfs_scan_list(): see the attached 3.7 msec latency > > trace. It happened during a simple big-file writeout and is easily > > reproducible. Could the nfsi->commit list searching be replaced with a > > radix based approach too? > > That would be 100% pure overhead. The nfsi->commit list does not need > to be sorted and with these patches applied, it no longer is. In fact > one of the cleanups I still need to do is to get rid of those > redundant checks on wb->index (start is now always set to 0, and end > is always ~0UL). > > So the overhead you are currently seeing should just be that of > iterating through the list, locking said requests and adding them to > our private list. ah - cool! This was a 100 MB writeout so having 3.7 msecs to process 20K+ pages is not unreasonable. To break the latency, can i just do a simple lock-break, via the patch below? Ingo --- linux/fs/nfs/pagelist.c.orig +++ linux/fs/nfs/pagelist.c @@ -311,8 +311,9 @@ out: * You must be holding the inode's req_lock when calling this function */ int -nfs_scan_list(struct list_head *head, struct list_head *dst, - unsigned long idx_start, unsigned int npages) +nfs_scan_list(struct nfs_inode *nfsi, struct list_head *head, + struct list_head *dst, unsigned long idx_start, + unsigned int npages) { struct list_head *pos, *tmp; struct nfs_page *req; @@ -327,6 +328,8 @@ nfs_scan_list(struct list_head *head, st list_for_each_safe(pos, tmp, head) { + cond_resched_lock(&nfsi->req_lock); + req = nfs_list_entry(pos); if (req->wb_index < idx_start) --- linux/fs/nfs/write.c.orig +++ linux/fs/nfs/write.c @@ -569,7 +569,7 @@ nfs_scan_commit(struct inode *inode, str int res = 0; if (nfsi->ncommit != 0) { - res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages); + res = nfs_scan_list(nfsi, &nfsi->commit, dst, idx_start, npages); nfsi->ncommit -= res; if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit)) printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n"); --- linux/include/linux/nfs_page.h.orig +++ linux/include/linux/nfs_page.h @@ -63,8 +63,8 @@ extern void nfs_release_request(struct n extern int nfs_scan_lock_dirty(struct nfs_inode *nfsi, struct list_head *dst, unsigned long idx_start, unsigned int npages); -extern int nfs_scan_list(struct list_head *, struct list_head *, - unsigned long, unsigned int); +extern int nfs_scan_list(struct nfs_inode *nfsi, struct list_head *, + struct list_head *, unsigned long, unsigned int); extern int nfs_coalesce_requests(struct list_head *, struct list_head *, unsigned int); extern int nfs_wait_on_request(struct nfs_page *); ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 14:39 ` Ingo Molnar @ 2005-03-31 14:50 ` Ingo Molnar 2005-04-01 2:28 ` Lee Revell 2005-03-31 14:54 ` Ingo Molnar 2005-03-31 14:54 ` Trond Myklebust 2 siblings, 1 reply; 45+ messages in thread From: Ingo Molnar @ 2005-03-31 14:50 UTC (permalink / raw) To: Trond Myklebust; +Cc: Andrew Morton, rlrevell, linux-kernel * Ingo Molnar <mingo@elte.hu> wrote: > > So the overhead you are currently seeing should just be that of > > iterating through the list, locking said requests and adding them to > > our private list. > > ah - cool! This was a 100 MB writeout so having 3.7 msecs to process > 20K+ pages is not unreasonable. To break the latency, can i just do a > simple lock-break, via the patch below? with this patch the worst-case latency during NFS writeout is down to 40 usecs (!). Lee: i've uploaded the -42-05 release with this patch included - could you test it on your (no doubt more complex than mine) NFS setup? Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 14:50 ` Ingo Molnar @ 2005-04-01 2:28 ` Lee Revell 2005-04-01 4:30 ` Ingo Molnar 0 siblings, 1 reply; 45+ messages in thread From: Lee Revell @ 2005-04-01 2:28 UTC (permalink / raw) To: Ingo Molnar; +Cc: Trond Myklebust, Andrew Morton, linux-kernel On Thu, 2005-03-31 at 16:50 +0200, Ingo Molnar wrote: > * Ingo Molnar <mingo@elte.hu> wrote: > > > > So the overhead you are currently seeing should just be that of > > > iterating through the list, locking said requests and adding them to > > > our private list. > > > > ah - cool! This was a 100 MB writeout so having 3.7 msecs to process > > 20K+ pages is not unreasonable. To break the latency, can i just do a > > simple lock-break, via the patch below? > > with this patch the worst-case latency during NFS writeout is down to 40 > usecs (!). > > Lee: i've uploaded the -42-05 release with this patch included - could > you test it on your (no doubt more complex than mine) NFS setup? This fixes all the NFS related latency problems I was seeing. Now the longest latency from an NFS kernel compile with "make -j64" is 391 usecs in get_swap_page. Lee ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-04-01 2:28 ` Lee Revell @ 2005-04-01 4:30 ` Ingo Molnar 2005-04-01 16:16 ` Orion Poplawski 2005-04-01 21:18 ` Lee Revell 0 siblings, 2 replies; 45+ messages in thread From: Ingo Molnar @ 2005-04-01 4:30 UTC (permalink / raw) To: Lee Revell; +Cc: Trond Myklebust, Andrew Morton, linux-kernel [-- Attachment #1: Type: text/plain, Size: 985 bytes --] * Lee Revell <rlrevell@joe-job.com> wrote: > > > ah - cool! This was a 100 MB writeout so having 3.7 msecs to process > > > 20K+ pages is not unreasonable. To break the latency, can i just do a > > > simple lock-break, via the patch below? > > > > with this patch the worst-case latency during NFS writeout is down to 40 > > usecs (!). > > > > Lee: i've uploaded the -42-05 release with this patch included - could > > you test it on your (no doubt more complex than mine) NFS setup? > > This fixes all the NFS related latency problems I was seeing. Now the > longest latency from an NFS kernel compile with "make -j64" is 391 > usecs in get_swap_page. great! The latest patches (-42-08 and later) have the reworked nfs_scan_list() lock-breaker, which should perform similarly. i bet these NFS patches also improve generic NFS performance on fast networks. I've attached the full patchset with all fixes and improvements included - might be worth a try in -mm? Ingo [-- Attachment #2: nfs-latency.patch --] [-- Type: text/plain, Size: 15547 bytes --] --- linux/fs/nfs/inode.c.orig +++ linux/fs/nfs/inode.c @@ -118,7 +118,7 @@ nfs_write_inode(struct inode *inode, int int flags = sync ? FLUSH_WAIT : 0; int ret; - ret = nfs_commit_inode(inode, 0, 0, flags); + ret = nfs_commit_inode(inode, flags); if (ret < 0) return ret; return 0; --- linux/fs/nfs/write.c.orig +++ linux/fs/nfs/write.c @@ -352,7 +352,7 @@ int nfs_writepages(struct address_space if (err < 0) goto out; } - err = nfs_commit_inode(inode, 0, 0, wb_priority(wbc)); + err = nfs_commit_inode(inode, wb_priority(wbc)); if (err > 0) { wbc->nr_to_write -= err; err = 0; @@ -446,6 +446,8 @@ nfs_mark_request_dirty(struct nfs_page * struct nfs_inode *nfsi = NFS_I(inode); spin_lock(&nfsi->req_lock); + radix_tree_tag_set(&nfsi->nfs_page_tree, + req->wb_index, NFS_PAGE_TAG_DIRTY); nfs_list_add_request(req, &nfsi->dirty); nfsi->ndirty++; spin_unlock(&nfsi->req_lock); @@ -503,13 +505,12 @@ nfs_wait_on_requests(struct inode *inode spin_lock(&nfsi->req_lock); next = idx_start; - while (radix_tree_gang_lookup(&nfsi->nfs_page_tree, (void **)&req, next, 1)) { + while (radix_tree_gang_lookup_tag(&nfsi->nfs_page_tree, (void **)&req, next, 1, NFS_PAGE_TAG_WRITEBACK)) { if (req->wb_index > idx_end) break; next = req->wb_index + 1; - if (!NFS_WBACK_BUSY(req)) - continue; + BUG_ON(!NFS_WBACK_BUSY(req)); atomic_inc(&req->wb_count); spin_unlock(&nfsi->req_lock); @@ -538,12 +539,15 @@ static int nfs_scan_dirty(struct inode *inode, struct list_head *dst, unsigned long idx_start, unsigned int npages) { struct nfs_inode *nfsi = NFS_I(inode); - int res; - res = nfs_scan_list(&nfsi->dirty, dst, idx_start, npages); - nfsi->ndirty -= res; - sub_page_state(nr_dirty,res); - if ((nfsi->ndirty == 0) != list_empty(&nfsi->dirty)) - printk(KERN_ERR "NFS: desynchronized value of nfs_i.ndirty.\n"); + int res = 0; + + if (nfsi->ndirty != 0) { + res = nfs_scan_lock_dirty(nfsi, dst, idx_start, npages); + nfsi->ndirty -= res; + sub_page_state(nr_dirty,res); + if ((nfsi->ndirty == 0) != list_empty(&nfsi->dirty)) + printk(KERN_ERR "NFS: desynchronized value of nfs_i.ndirty.\n"); + } return res; } @@ -562,11 +566,14 @@ static int nfs_scan_commit(struct inode *inode, struct list_head *dst, unsigned long idx_start, unsigned int npages) { struct nfs_inode *nfsi = NFS_I(inode); - int res; - res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages); - nfsi->ncommit -= res; - if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit)) - printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n"); + int res = 0; + + if (nfsi->ncommit != 0) { + res = nfs_scan_list(nfsi, &nfsi->commit, dst, idx_start, npages); + nfsi->ncommit -= res; + if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit)) + printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n"); + } return res; } #endif @@ -821,7 +828,7 @@ out: #else nfs_inode_remove_request(req); #endif - nfs_unlock_request(req); + nfs_clear_page_writeback(req); } static inline int flush_task_priority(int how) @@ -952,7 +959,7 @@ out_bad: nfs_writedata_free(data); } nfs_mark_request_dirty(req); - nfs_unlock_request(req); + nfs_clear_page_writeback(req); return -ENOMEM; } @@ -1002,7 +1009,7 @@ static int nfs_flush_one(struct list_hea struct nfs_page *req = nfs_list_entry(head->next); nfs_list_remove_request(req); nfs_mark_request_dirty(req); - nfs_unlock_request(req); + nfs_clear_page_writeback(req); } return -ENOMEM; } @@ -1029,7 +1036,7 @@ nfs_flush_list(struct list_head *head, i req = nfs_list_entry(head->next); nfs_list_remove_request(req); nfs_mark_request_dirty(req); - nfs_unlock_request(req); + nfs_clear_page_writeback(req); } return error; } @@ -1121,7 +1128,7 @@ static void nfs_writeback_done_full(stru nfs_inode_remove_request(req); #endif next: - nfs_unlock_request(req); + nfs_clear_page_writeback(req); } } @@ -1210,36 +1217,24 @@ static void nfs_commit_rpcsetup(struct l struct nfs_write_data *data, int how) { struct rpc_task *task = &data->task; - struct nfs_page *first, *last; + struct nfs_page *first; struct inode *inode; - loff_t start, end, len; /* Set up the RPC argument and reply structs * NB: take care not to mess about with data->commit et al. */ list_splice_init(head, &data->pages); first = nfs_list_entry(data->pages.next); - last = nfs_list_entry(data->pages.prev); inode = first->wb_context->dentry->d_inode; - /* - * Determine the offset range of requests in the COMMIT call. - * We rely on the fact that data->pages is an ordered list... - */ - start = req_offset(first); - end = req_offset(last) + last->wb_bytes; - len = end - start; - /* If 'len' is not a 32-bit quantity, pass '0' in the COMMIT call */ - if (end >= i_size_read(inode) || len < 0 || len > (~((u32)0) >> 1)) - len = 0; - data->inode = inode; data->cred = first->wb_context->cred; data->args.fh = NFS_FH(data->inode); - data->args.offset = start; - data->args.count = len; - data->res.count = len; + /* Note: we always request a commit of the entire inode */ + data->args.offset = 0; + data->args.count = 0; + data->res.count = 0; data->res.fattr = &data->fattr; data->res.verf = &data->verf; @@ -1278,7 +1273,7 @@ nfs_commit_list(struct list_head *head, req = nfs_list_entry(head->next); nfs_list_remove_request(req); nfs_mark_request_commit(req); - nfs_unlock_request(req); + nfs_clear_page_writeback(req); } return -ENOMEM; } @@ -1324,7 +1319,7 @@ nfs_commit_done(struct rpc_task *task) dprintk(" mismatch\n"); nfs_mark_request_dirty(req); next: - nfs_unlock_request(req); + nfs_clear_page_writeback(req); res++; } sub_page_state(nr_unstable,res); @@ -1350,8 +1345,7 @@ static int nfs_flush_inode(struct inode } #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4) -int nfs_commit_inode(struct inode *inode, unsigned long idx_start, - unsigned int npages, int how) +int nfs_commit_inode(struct inode *inode, int how) { struct nfs_inode *nfsi = NFS_I(inode); LIST_HEAD(head); @@ -1359,15 +1353,13 @@ int nfs_commit_inode(struct inode *inode error = 0; spin_lock(&nfsi->req_lock); - res = nfs_scan_commit(inode, &head, idx_start, npages); + res = nfs_scan_commit(inode, &head, 0, 0); + spin_unlock(&nfsi->req_lock); if (res) { - res += nfs_scan_commit(inode, &head, 0, 0); - spin_unlock(&nfsi->req_lock); error = nfs_commit_list(&head, how); - } else - spin_unlock(&nfsi->req_lock); - if (error < 0) - return error; + if (error < 0) + return error; + } return res; } #endif @@ -1389,7 +1381,7 @@ int nfs_sync_inode(struct inode *inode, error = nfs_flush_inode(inode, idx_start, npages, how); #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4) if (error == 0) - error = nfs_commit_inode(inode, idx_start, npages, how); + error = nfs_commit_inode(inode, how); #endif } while (error > 0); return error; --- linux/fs/nfs/read.c.orig +++ linux/fs/nfs/read.c @@ -173,7 +173,6 @@ static int nfs_readpage_async(struct nfs if (len < PAGE_CACHE_SIZE) memclear_highpage_flush(page, len, PAGE_CACHE_SIZE - len); - nfs_lock_request(new); nfs_list_add_request(new, &one_request); nfs_pagein_one(&one_request, inode); return 0; @@ -185,7 +184,6 @@ static void nfs_readpage_release(struct nfs_clear_request(req); nfs_release_request(req); - nfs_unlock_request(req); dprintk("NFS: read done (%s/%Ld %d@%Ld)\n", req->wb_context->dentry->d_inode->i_sb->s_id, @@ -553,7 +551,6 @@ readpage_async_filler(void *data, struct } if (len < PAGE_CACHE_SIZE) memclear_highpage_flush(page, len, PAGE_CACHE_SIZE - len); - nfs_lock_request(new); nfs_list_add_request(new, desc->head); return 0; } --- linux/fs/nfs/pagelist.c.orig +++ linux/fs/nfs/pagelist.c @@ -112,6 +112,33 @@ void nfs_unlock_request(struct nfs_page } /** + * nfs_set_page_writeback_locked - Lock a request for writeback + * @req: + */ +int nfs_set_page_writeback_locked(struct nfs_page *req) +{ + struct nfs_inode *nfsi = NFS_I(req->wb_context->dentry->d_inode); + + if (!nfs_lock_request(req)) + return 0; + radix_tree_tag_set(&nfsi->nfs_page_tree, req->wb_index, NFS_PAGE_TAG_WRITEBACK); + return 1; +} + +/** + * nfs_clear_page_writeback - Unlock request and wake up sleepers + */ +void nfs_clear_page_writeback(struct nfs_page *req) +{ + struct nfs_inode *nfsi = NFS_I(req->wb_context->dentry->d_inode); + + spin_lock(&nfsi->req_lock); + radix_tree_tag_clear(&nfsi->nfs_page_tree, req->wb_index, NFS_PAGE_TAG_WRITEBACK); + spin_unlock(&nfsi->req_lock); + nfs_unlock_request(req); +} + +/** * nfs_clear_request - Free up all resources allocated to the request * @req: * @@ -151,36 +178,6 @@ nfs_release_request(struct nfs_page *req } /** - * nfs_list_add_request - Insert a request into a sorted list - * @req: request - * @head: head of list into which to insert the request. - * - * Note that the wb_list is sorted by page index in order to facilitate - * coalescing of requests. - * We use an insertion sort that is optimized for the case of appended - * writes. - */ -void -nfs_list_add_request(struct nfs_page *req, struct list_head *head) -{ - struct list_head *pos; - -#ifdef NFS_PARANOIA - if (!list_empty(&req->wb_list)) { - printk(KERN_ERR "NFS: Add to list failed!\n"); - BUG(); - } -#endif - list_for_each_prev(pos, head) { - struct nfs_page *p = nfs_list_entry(pos); - if (p->wb_index < req->wb_index) - break; - } - list_add(&req->wb_list, pos); - req->wb_list_head = head; -} - -/** * nfs_wait_on_request - Wait for a request to complete. * @req: request to wait upon. * @@ -243,6 +240,63 @@ nfs_coalesce_requests(struct list_head * return npages; } +#define NFS_SCAN_MAXENTRIES 16 +/** + * nfs_scan_lock_dirty - Scan the radix tree for dirty requests + * @nfsi: NFS inode + * @dst: Destination list + * @idx_start: lower bound of page->index to scan + * @npages: idx_start + npages sets the upper bound to scan. + * + * Moves elements from one of the inode request lists. + * If the number of requests is set to 0, the entire address_space + * starting at index idx_start, is scanned. + * The requests are *not* checked to ensure that they form a contiguous set. + * You must be holding the inode's req_lock when calling this function + */ +int +nfs_scan_lock_dirty(struct nfs_inode *nfsi, struct list_head *dst, + unsigned long idx_start, unsigned int npages) +{ + struct nfs_page *pgvec[NFS_SCAN_MAXENTRIES]; + struct nfs_page *req; + unsigned long idx_end; + int found, i; + int res; + + res = 0; + if (npages == 0) + idx_end = ~0; + else + idx_end = idx_start + npages - 1; + + for (;;) { + found = radix_tree_gang_lookup_tag(&nfsi->nfs_page_tree, + (void **)&pgvec[0], idx_start, NFS_SCAN_MAXENTRIES, + NFS_PAGE_TAG_DIRTY); + if (found <= 0) + break; + for (i = 0; i < found; i++) { + req = pgvec[i]; + if (req->wb_index > idx_end) + goto out; + + idx_start = req->wb_index + 1; + + if (nfs_set_page_writeback_locked(req)) { + radix_tree_tag_clear(&nfsi->nfs_page_tree, + req->wb_index, NFS_PAGE_TAG_DIRTY); + nfs_list_remove_request(req); + nfs_list_add_request(req, dst); + res++; + } + } + cond_resched_lock(&nfsi->req_lock); + } +out: + return res; +} + /** * nfs_scan_list - Scan a list for matching requests * @head: One of the NFS inode request lists @@ -257,10 +311,12 @@ nfs_coalesce_requests(struct list_head * * You must be holding the inode's req_lock when calling this function */ int -nfs_scan_list(struct list_head *head, struct list_head *dst, - unsigned long idx_start, unsigned int npages) +nfs_scan_list(struct nfs_inode *nfsi, struct list_head *head, + struct list_head *dst, unsigned long idx_start, + unsigned int npages) { - struct list_head *pos, *tmp; + LIST_HEAD(locked); + struct list_head *pos; struct nfs_page *req; unsigned long idx_end; int res; @@ -271,21 +327,22 @@ nfs_scan_list(struct list_head *head, st else idx_end = idx_start + npages - 1; - list_for_each_safe(pos, tmp, head) { + while (!list_empty(head)) { + pos = head->next; req = nfs_list_entry(pos); - if (req->wb_index < idx_start) - continue; - if (req->wb_index > idx_end) - break; - - if (!nfs_lock_request(req)) - continue; - nfs_list_remove_request(req); - nfs_list_add_request(req, dst); - res++; + if (!nfs_set_page_writeback_locked(req)) { + list_del(pos); + list_add(&req->wb_list, &locked); + } else { + nfs_list_remove_request(req); + nfs_list_add_request(req, dst); + res++; + } + cond_resched_lock(&nfsi->req_lock); } + list_splice(&locked, head); return res; } --- linux/include/linux/nfs_fs.h.orig +++ linux/include/linux/nfs_fs.h @@ -377,10 +377,10 @@ extern void nfs_commit_done(struct rpc_t */ extern int nfs_sync_inode(struct inode *, unsigned long, unsigned int, int); #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4) -extern int nfs_commit_inode(struct inode *, unsigned long, unsigned int, int); +extern int nfs_commit_inode(struct inode *, int); #else static inline int -nfs_commit_inode(struct inode *inode, unsigned long idx_start, unsigned int npages, int how) +nfs_commit_inode(struct inode *inode, int how) { return 0; } --- linux/include/linux/nfs_page.h.orig +++ linux/include/linux/nfs_page.h @@ -20,12 +20,19 @@ #include <asm/atomic.h> /* + * Valid flags for the radix tree + */ +#define NFS_PAGE_TAG_DIRTY 0 +#define NFS_PAGE_TAG_WRITEBACK 1 + +/* * Valid flags for a dirty buffer */ #define PG_BUSY 0 #define PG_NEED_COMMIT 1 #define PG_NEED_RESCHED 2 +struct nfs_inode; struct nfs_page { struct list_head wb_list, /* Defines state of page: */ *wb_list_head; /* read/write/commit */ @@ -54,14 +61,17 @@ extern void nfs_clear_request(struct nfs extern void nfs_release_request(struct nfs_page *req); -extern void nfs_list_add_request(struct nfs_page *, struct list_head *); - -extern int nfs_scan_list(struct list_head *, struct list_head *, - unsigned long, unsigned int); +extern int nfs_scan_lock_dirty(struct nfs_inode *nfsi, struct list_head *dst, + unsigned long idx_start, unsigned int npages); +extern int nfs_scan_list(struct nfs_inode *nfsi, struct list_head *, + struct list_head *, unsigned long, unsigned int); extern int nfs_coalesce_requests(struct list_head *, struct list_head *, unsigned int); extern int nfs_wait_on_request(struct nfs_page *); extern void nfs_unlock_request(struct nfs_page *req); +extern int nfs_set_page_writeback_locked(struct nfs_page *req); +extern void nfs_clear_page_writeback(struct nfs_page *req); + /* * Lock the page of an asynchronous request without incrementing the wb_count @@ -86,6 +96,18 @@ nfs_lock_request(struct nfs_page *req) return 1; } +/** + * nfs_list_add_request - Insert a request into a list + * @req: request + * @head: head of list into which to insert the request. + */ +static inline void +nfs_list_add_request(struct nfs_page *req, struct list_head *head) +{ + list_add_tail(&req->wb_list, head); + req->wb_list_head = head; +} + /** * nfs_list_remove_request - Remove a request from its wb_list @@ -96,10 +118,6 @@ nfs_list_remove_request(struct nfs_page { if (list_empty(&req->wb_list)) return; - if (!NFS_WBACK_BUSY(req)) { - printk(KERN_ERR "NFS: unlocked request attempted removed from list!\n"); - BUG(); - } list_del_init(&req->wb_list); req->wb_list_head = NULL; } ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-04-01 4:30 ` Ingo Molnar @ 2005-04-01 16:16 ` Orion Poplawski 2005-04-01 16:33 ` Trond Myklebust 2005-04-01 21:18 ` Lee Revell 1 sibling, 1 reply; 45+ messages in thread From: Orion Poplawski @ 2005-04-01 16:16 UTC (permalink / raw) To: linux-kernel Ingo Molnar wrote: > * Lee Revell <rlrevell@joe-job.com> wrote: >>This fixes all the NFS related latency problems I was seeing. Now the >>longest latency from an NFS kernel compile with "make -j64" is 391 >>usecs in get_swap_page. > > > great! The latest patches (-42-08 and later) have the reworked > nfs_scan_list() lock-breaker, which should perform similarly. > > i bet these NFS patches also improve generic NFS performance on fast > networks. I've attached the full patchset with all fixes and > improvements included - might be worth a try in -mm? > Just a question - would these changes be expected to improve NFS client *read* access at all, or just write? Thanks! ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-04-01 16:16 ` Orion Poplawski @ 2005-04-01 16:33 ` Trond Myklebust 0 siblings, 0 replies; 45+ messages in thread From: Trond Myklebust @ 2005-04-01 16:33 UTC (permalink / raw) To: Orion Poplawski; +Cc: linux-kernel fr den 01.04.2005 Klokka 09:16 (-0700) skreiv Orion Poplawski: > Just a question - would these changes be expected to improve NFS client > *read* access at all, or just write? Just write. Cheers, Trond -- Trond Myklebust <trond.myklebust@fys.uio.no> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-04-01 4:30 ` Ingo Molnar 2005-04-01 16:16 ` Orion Poplawski @ 2005-04-01 21:18 ` Lee Revell 1 sibling, 0 replies; 45+ messages in thread From: Lee Revell @ 2005-04-01 21:18 UTC (permalink / raw) To: Ingo Molnar; +Cc: Trond Myklebust, Andrew Morton, linux-kernel On Fri, 2005-04-01 at 06:30 +0200, Ingo Molnar wrote: > * Lee Revell <rlrevell@joe-job.com> wrote: > > > > > ah - cool! This was a 100 MB writeout so having 3.7 msecs to process > > > > 20K+ pages is not unreasonable. To break the latency, can i just do a > > > > simple lock-break, via the patch below? > > > > > > with this patch the worst-case latency during NFS writeout is down to 40 > > > usecs (!). > > > > > > Lee: i've uploaded the -42-05 release with this patch included - could > > > you test it on your (no doubt more complex than mine) NFS setup? > > > > This fixes all the NFS related latency problems I was seeing. Now the > > longest latency from an NFS kernel compile with "make -j64" is 391 > > usecs in get_swap_page. > > great! The latest patches (-42-08 and later) have the reworked > nfs_scan_list() lock-breaker, which should perform similarly. > > i bet these NFS patches also improve generic NFS performance on fast > networks. I've attached the full patchset with all fixes and > improvements included - might be worth a try in -mm? With tracing disabled on the C3 (which is the NFS server, don't ask), the maximum latency during the same kernel compile is about 2ms. So tracing overhead probably doubled or tripled the latencies. I'll try again with tracing enabled to determine whether these code paths are related to the NFS server or not. It's either going to be that or the get_swap_page stuff. But the client side is OK now. Lee ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 14:39 ` Ingo Molnar 2005-03-31 14:50 ` Ingo Molnar @ 2005-03-31 14:54 ` Ingo Molnar 2005-03-31 15:00 ` Trond Myklebust 2005-03-31 14:54 ` Trond Myklebust 2 siblings, 1 reply; 45+ messages in thread From: Ingo Molnar @ 2005-03-31 14:54 UTC (permalink / raw) To: Trond Myklebust; +Cc: Andrew Morton, rlrevell, linux-kernel * Ingo Molnar <mingo@elte.hu> wrote: > [...] To break the latency, can i just do a simple lock-break, via the > patch below? i think it's incorrect in its current form, because 'pos' is not necessarily safe and might be removed from the commit_list? the whole loop could be a "while (!list_empty(head))" thing, if it wasnt for the possibility for an nfs_set_page_writeback_locked() to skip an entry. Maybe a local list of 'already processed' requests should be built, and once the main list is emptied, moved back into the main list? Then the lock-break could be moved to the end of the loop. Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 14:54 ` Ingo Molnar @ 2005-03-31 15:00 ` Trond Myklebust 0 siblings, 0 replies; 45+ messages in thread From: Trond Myklebust @ 2005-03-31 15:00 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, rlrevell, linux-kernel to den 31.03.2005 Klokka 16:54 (+0200) skreiv Ingo Molnar: > i think it's incorrect in its current form, because 'pos' is not > necessarily safe and might be removed from the commit_list? Right. > the whole loop could be a "while (!list_empty(head))" thing, if it wasnt > for the possibility for an nfs_set_page_writeback_locked() to skip an > entry. Maybe a local list of 'already processed' requests should be > built, and once the main list is emptied, moved back into the main list? > Then the lock-break could be moved to the end of the loop. Should be pretty much unnecessary. See my previous email. Cheers, Trond -- Trond Myklebust <trond.myklebust@fys.uio.no> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 14:39 ` Ingo Molnar 2005-03-31 14:50 ` Ingo Molnar 2005-03-31 14:54 ` Ingo Molnar @ 2005-03-31 14:54 ` Trond Myklebust 2005-03-31 14:58 ` Ingo Molnar 2 siblings, 1 reply; 45+ messages in thread From: Trond Myklebust @ 2005-03-31 14:54 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, rlrevell, linux-kernel to den 31.03.2005 Klokka 16:39 (+0200) skreiv Ingo Molnar: > * Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > > > your patch works fine here - but there are still latencies in > > > nfs_scan_commit()/nfs_scan_list(): see the attached 3.7 msec latency > > > trace. It happened during a simple big-file writeout and is easily > > > reproducible. Could the nfsi->commit list searching be replaced with a > > > radix based approach too? > > > > That would be 100% pure overhead. The nfsi->commit list does not need > > to be sorted and with these patches applied, it no longer is. In fact > > one of the cleanups I still need to do is to get rid of those > > redundant checks on wb->index (start is now always set to 0, and end > > is always ~0UL). > > > > So the overhead you are currently seeing should just be that of > > iterating through the list, locking said requests and adding them to > > our private list. > > ah - cool! This was a 100 MB writeout so having 3.7 msecs to process > 20K+ pages is not unreasonable. To break the latency, can i just do a > simple lock-break, via the patch below? Errm... That looks a bit unsafe. What is there then to stop another process from removing "pos" while you are scheduling? It seems to me that you should really start afresh in that case. The good news, though, is that because requests on the "commit" list do not remain locked for long without being removed from the list, you should rarely have to skip them. IOW restarting from the head of the list is pretty much the same as starting from where you left off. Cheers, Trond > Ingo > > --- linux/fs/nfs/pagelist.c.orig > +++ linux/fs/nfs/pagelist.c > @@ -311,8 +311,9 @@ out: > * You must be holding the inode's req_lock when calling this function > */ > int > -nfs_scan_list(struct list_head *head, struct list_head *dst, > - unsigned long idx_start, unsigned int npages) > +nfs_scan_list(struct nfs_inode *nfsi, struct list_head *head, > + struct list_head *dst, unsigned long idx_start, > + unsigned int npages) > { > struct list_head *pos, *tmp; > struct nfs_page *req; > @@ -327,6 +328,8 @@ nfs_scan_list(struct list_head *head, st > > list_for_each_safe(pos, tmp, head) { > > + cond_resched_lock(&nfsi->req_lock); > + > req = nfs_list_entry(pos); > > if (req->wb_index < idx_start) > --- linux/fs/nfs/write.c.orig > +++ linux/fs/nfs/write.c > @@ -569,7 +569,7 @@ nfs_scan_commit(struct inode *inode, str > int res = 0; > > if (nfsi->ncommit != 0) { > - res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages); > + res = nfs_scan_list(nfsi, &nfsi->commit, dst, idx_start, npages); > nfsi->ncommit -= res; > if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit)) > printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n"); > --- linux/include/linux/nfs_page.h.orig > +++ linux/include/linux/nfs_page.h > @@ -63,8 +63,8 @@ extern void nfs_release_request(struct n > > extern int nfs_scan_lock_dirty(struct nfs_inode *nfsi, struct list_head *dst, > unsigned long idx_start, unsigned int npages); > -extern int nfs_scan_list(struct list_head *, struct list_head *, > - unsigned long, unsigned int); > +extern int nfs_scan_list(struct nfs_inode *nfsi, struct list_head *, > + struct list_head *, unsigned long, unsigned int); > extern int nfs_coalesce_requests(struct list_head *, struct list_head *, > unsigned int); > extern int nfs_wait_on_request(struct nfs_page *); -- Trond Myklebust <trond.myklebust@fys.uio.no> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 14:54 ` Trond Myklebust @ 2005-03-31 14:58 ` Ingo Molnar 2005-03-31 15:06 ` Trond Myklebust 2005-03-31 15:10 ` Ingo Molnar 0 siblings, 2 replies; 45+ messages in thread From: Ingo Molnar @ 2005-03-31 14:58 UTC (permalink / raw) To: Trond Myklebust; +Cc: Andrew Morton, rlrevell, linux-kernel * Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > Errm... That looks a bit unsafe. What is there then to stop another > process from removing "pos" while you are scheduling? It seems to me > that you should really start afresh in that case. yeah. > The good news, though, is that because requests on the "commit" list > do not remain locked for long without being removed from the list, you > should rarely have to skip them. IOW restarting from the head of the > list is pretty much the same as starting from where you left off. as we've learned it through painful experience on the ext3 side, restarting scans where 'skipping' has to be done is unhealthy. would it be safe to collect locked entries into a separate, local list, so that the restart would only see newly added entries? Then once the moving of all entries has been done, all the locked entries could be added back to the commit_list via one list_add. (can anything happen to those locked entries that would break this method?) Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 14:58 ` Ingo Molnar @ 2005-03-31 15:06 ` Trond Myklebust 2005-03-31 15:10 ` Ingo Molnar 2005-03-31 15:10 ` Ingo Molnar 1 sibling, 1 reply; 45+ messages in thread From: Trond Myklebust @ 2005-03-31 15:06 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, rlrevell, linux-kernel to den 31.03.2005 Klokka 16:58 (+0200) skreiv Ingo Molnar: > would it be safe to collect locked entries into a separate, local list, > so that the restart would only see newly added entries? Then once the > moving of all entries has been done, all the locked entries could be > added back to the commit_list via one list_add. (can anything happen to > those locked entries that would break this method?) You are not allowed to remove an entry from the list if it is locked by someone else. Locking grants temporary ownership of the entry. Cheers, Trond -- Trond Myklebust <trond.myklebust@fys.uio.no> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 15:06 ` Trond Myklebust @ 2005-03-31 15:10 ` Ingo Molnar 2005-03-31 16:00 ` Trond Myklebust 0 siblings, 1 reply; 45+ messages in thread From: Ingo Molnar @ 2005-03-31 15:10 UTC (permalink / raw) To: Trond Myklebust; +Cc: Andrew Morton, rlrevell, linux-kernel * Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > would it be safe to collect locked entries into a separate, local list, > > so that the restart would only see newly added entries? Then once the > > moving of all entries has been done, all the locked entries could be > > added back to the commit_list via one list_add. (can anything happen to > > those locked entries that would break this method?) > > You are not allowed to remove an entry from the list if it is locked > by someone else. Locking grants temporary ownership of the entry. well, removing a neighboring entry will change the locked entry's link fields (e.g. req->wb_list.prev), so this ownership cannot involve this particular list, can it? Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 15:10 ` Ingo Molnar @ 2005-03-31 16:00 ` Trond Myklebust 0 siblings, 0 replies; 45+ messages in thread From: Trond Myklebust @ 2005-03-31 16:00 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, rlrevell, linux-kernel to den 31.03.2005 Klokka 17:10 (+0200) skreiv Ingo Molnar: > * Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > > > would it be safe to collect locked entries into a separate, local list, > > > so that the restart would only see newly added entries? Then once the > > > moving of all entries has been done, all the locked entries could be > > > added back to the commit_list via one list_add. (can anything happen to > > > those locked entries that would break this method?) > > > > You are not allowed to remove an entry from the list if it is locked > > by someone else. Locking grants temporary ownership of the entry. > > well, removing a neighboring entry will change the locked entry's link > fields (e.g. req->wb_list.prev), so this ownership cannot involve this > particular list, can it? The point is that you are taking something that was previously globally visible and owned by somebody else and making it globally invisible by placing it on a private list. I'm not sure whether or not that is going to cause bugs in the current code (it may actually be safe with the current code), but as far as clean ownership semantics go, it sucks. Cheers, Trond -- Trond Myklebust <trond.myklebust@fys.uio.no> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 14:58 ` Ingo Molnar 2005-03-31 15:06 ` Trond Myklebust @ 2005-03-31 15:10 ` Ingo Molnar 1 sibling, 0 replies; 45+ messages in thread From: Ingo Molnar @ 2005-03-31 15:10 UTC (permalink / raw) To: Trond Myklebust; +Cc: Andrew Morton, rlrevell, linux-kernel * Ingo Molnar <mingo@elte.hu> wrote: > > The good news, though, is that because requests on the "commit" list > > do not remain locked for long without being removed from the list, you > > should rarely have to skip them. IOW restarting from the head of the > > list is pretty much the same as starting from where you left off. > > as we've learned it through painful experience on the ext3 side, > restarting scans where 'skipping' has to be done is unhealthy. > > would it be safe to collect locked entries into a separate, local > list, so that the restart would only see newly added entries? Then > once the moving of all entries has been done, all the locked entries > could be added back to the commit_list via one list_add. (can anything > happen to those locked entries that would break this method?) i.e. like the patch below (compiles and is lightly tested). It's a pretty straightforward technique when dealing with latencies, and lets us sleep for sure. We dont have to know whether massive amounts of locked entries ever queue up or not. (thanks go to Arjan for noticing that it has to be list_splice() not list_add() :-| ) Ingo --- linux/fs/nfs/pagelist.c.orig3 +++ linux/fs/nfs/pagelist.c @@ -311,10 +311,12 @@ out: * You must be holding the inode's req_lock when calling this function */ int -nfs_scan_list(struct list_head *head, struct list_head *dst, - unsigned long idx_start, unsigned int npages) +nfs_scan_list(struct nfs_inode *nfsi, struct list_head *head, + struct list_head *dst, unsigned long idx_start, + unsigned int npages) { - struct list_head *pos, *tmp; + LIST_HEAD(locked); + struct list_head *pos; struct nfs_page *req; unsigned long idx_end; int res; @@ -325,21 +327,22 @@ nfs_scan_list(struct list_head *head, st else idx_end = idx_start + npages - 1; - list_for_each_safe(pos, tmp, head) { + while (!list_empty(head)) { + pos = head->next; req = nfs_list_entry(pos); - if (req->wb_index < idx_start) - continue; - if (req->wb_index > idx_end) - break; - - if (!nfs_set_page_writeback_locked(req)) - continue; - nfs_list_remove_request(req); - nfs_list_add_request(req, dst); - res++; + if (!nfs_set_page_writeback_locked(req)) { + list_del(pos); + list_add(&req->wb_list, &locked); + } else { + nfs_list_remove_request(req); + nfs_list_add_request(req, dst); + res++; + } + cond_resched_lock(&nfsi->req_lock); } + list_splice(&locked, head); return res; } --- linux/fs/nfs/write.c.orig3 +++ linux/fs/nfs/write.c @@ -569,7 +569,7 @@ nfs_scan_commit(struct inode *inode, str int res = 0; if (nfsi->ncommit != 0) { - res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages); + res = nfs_scan_list(nfsi, &nfsi->commit, dst, idx_start, npages); nfsi->ncommit -= res; if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit)) printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n"); --- linux/include/linux/nfs_page.h.orig3 +++ linux/include/linux/nfs_page.h @@ -63,8 +63,8 @@ extern void nfs_release_request(struct n extern int nfs_scan_lock_dirty(struct nfs_inode *nfsi, struct list_head *dst, unsigned long idx_start, unsigned int npages); -extern int nfs_scan_list(struct list_head *, struct list_head *, - unsigned long, unsigned int); +extern int nfs_scan_list(struct nfs_inode *nfsi, struct list_head *, + struct list_head *, unsigned long, unsigned int); extern int nfs_coalesce_requests(struct list_head *, struct list_head *, unsigned int); extern int nfs_wait_on_request(struct nfs_page *); ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 2:47 ` Lee Revell 2005-03-31 3:48 ` Trond Myklebust @ 2005-03-31 7:03 ` Ingo Molnar 1 sibling, 0 replies; 45+ messages in thread From: Ingo Molnar @ 2005-03-31 7:03 UTC (permalink / raw) To: Lee Revell; +Cc: Andrew Morton, trond.myklebust, linux-kernel * Lee Revell <rlrevell@joe-job.com> wrote: > > Is a bunch of gobbledygook. Hows about you interpret it for us? > > Sorry. When I summarized them before, Ingo just asked for the full > verbose trace. please send non-verbose traces if possible. The verbose traces are useful when it's not clear which portion of a really large function is the call site - but they are also alot harder to read. Verbose traces are basically just a debugging mechanism for me, not meant for public consumption. i can add back the instruction-'offset' to the non-verbose trace, that will make even the ext3 traces easily interpretable in the non-verbose format. > The 7 ms are spent in this loop: > > radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>) > radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>) > radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>) > radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>) > radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>) the trace shows thousands of pages getting submitted - each of the line above is at least one new page. The loop is not preemptible right now but that should be easy to bound. Note that your earlier traces showed the list sorting overhead for a _single page_. So it's a huge difference and a huge step forward. Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 2:26 ` Lee Revell 2005-03-31 2:39 ` Andrew Morton @ 2005-03-31 7:39 ` Ingo Molnar 2005-03-31 7:48 ` Ingo Molnar 1 sibling, 1 reply; 45+ messages in thread From: Ingo Molnar @ 2005-03-31 7:39 UTC (permalink / raw) To: Lee Revell; +Cc: Trond Myklebust, Andrew Morton, linux-kernel * Lee Revell <rlrevell@joe-job.com> wrote: > > Yes. Together with the radix tree-based sorting of dirty requests, > > that's pretty much what I've spent most of today doing. Lee, could you > > see how the attached combined patch changes your latency numbers? > > Different code path, and the latency is worse. See the attached ~7ms > trace. could you check the -41-23 -RT kernel at the usual place: http://redhat.com/~mingo/realtime-preempt/ i've added Trond's radix lookup code, plus the lockbreaker patch. (one thing that is not covered yet is nfs_scan_list() - that still scans a list. Trond, is that fixable too?) Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 7:39 ` Ingo Molnar @ 2005-03-31 7:48 ` Ingo Molnar 2005-03-31 7:58 ` Ingo Molnar 0 siblings, 1 reply; 45+ messages in thread From: Ingo Molnar @ 2005-03-31 7:48 UTC (permalink / raw) To: Lee Revell; +Cc: Trond Myklebust, Andrew Morton, linux-kernel [-- Attachment #1: Type: text/plain, Size: 582 bytes --] * Ingo Molnar <mingo@elte.hu> wrote: > could you check the -41-23 -RT kernel at the usual place: > > http://redhat.com/~mingo/realtime-preempt/ > > i've added Trond's radix lookup code, plus the lockbreaker patch. > > (one thing that is not covered yet is nfs_scan_list() - that still scans > a list. Trond, is that fixable too?) ok, find a new latency trace attached (1.6 msecs). I generated write loads, and the nfs_scan_list_dirty() latency is gone and indeed nfs_scan_list() generates the worst latency - while processing 8535 pages in one critical section. Ingo [-- Attachment #2: nfs-1682us-latency.bz2 --] [-- Type: application/x-bzip2, Size: 12565 bytes --] ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: NFS client latencies 2005-03-31 7:48 ` Ingo Molnar @ 2005-03-31 7:58 ` Ingo Molnar 0 siblings, 0 replies; 45+ messages in thread From: Ingo Molnar @ 2005-03-31 7:58 UTC (permalink / raw) To: Lee Revell; +Cc: Trond Myklebust, Andrew Morton, linux-kernel [-- Attachment #1: Type: text/plain, Size: 477 bytes --] * Ingo Molnar <mingo@elte.hu> wrote: > ok, find a new latency trace attached (1.6 msecs). I generated write > loads, and the nfs_scan_list_dirty() latency is gone and indeed > nfs_scan_list() generates the worst latency - while processing 8535 > pages in one critical section. here's a more complete one of 4.4 msecs. (there was a bug in the latency tracer that trimmed latency traces - this one is complete and shows an nfs_scan_list processing of 22908 pages) Ingo [-- Attachment #2: nfs-4481us-latency.bz2 --] [-- Type: application/x-bzip2, Size: 33823 bytes --] ^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2005-04-01 21:21 UTC | newest] Thread overview: 45+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-03-29 23:04 NFS client latencies Lee Revell 2005-03-29 23:18 ` Trond Myklebust 2005-03-29 23:32 ` Lee Revell 2005-03-29 23:34 ` Trond Myklebust 2005-03-29 23:37 ` Lee Revell 2005-03-30 8:02 ` Ingo Molnar 2005-03-30 14:11 ` Trond Myklebust 2005-03-30 14:20 ` Ingo Molnar 2005-03-30 19:53 ` Andrew Morton 2005-03-30 14:26 ` Lee Revell 2005-03-30 14:50 ` Trond Myklebust 2005-03-30 19:50 ` Lee Revell 2005-03-30 19:56 ` Andrew Morton 2005-03-30 21:14 ` Trond Myklebust 2005-03-31 2:26 ` Lee Revell 2005-03-31 2:39 ` Andrew Morton 2005-03-31 2:47 ` Lee Revell 2005-03-31 3:48 ` Trond Myklebust 2005-03-31 6:59 ` Ingo Molnar 2005-03-31 7:15 ` Ingo Molnar 2005-03-31 7:18 ` Andrew Morton 2005-03-31 7:30 ` Ingo Molnar 2005-03-31 11:58 ` Trond Myklebust 2005-03-31 12:34 ` Trond Myklebust 2005-03-31 13:58 ` Ingo Molnar 2005-03-31 14:32 ` Trond Myklebust 2005-03-31 14:39 ` Ingo Molnar 2005-03-31 14:50 ` Ingo Molnar 2005-04-01 2:28 ` Lee Revell 2005-04-01 4:30 ` Ingo Molnar 2005-04-01 16:16 ` Orion Poplawski 2005-04-01 16:33 ` Trond Myklebust 2005-04-01 21:18 ` Lee Revell 2005-03-31 14:54 ` Ingo Molnar 2005-03-31 15:00 ` Trond Myklebust 2005-03-31 14:54 ` Trond Myklebust 2005-03-31 14:58 ` Ingo Molnar 2005-03-31 15:06 ` Trond Myklebust 2005-03-31 15:10 ` Ingo Molnar 2005-03-31 16:00 ` Trond Myklebust 2005-03-31 15:10 ` Ingo Molnar 2005-03-31 7:03 ` Ingo Molnar 2005-03-31 7:39 ` Ingo Molnar 2005-03-31 7:48 ` Ingo Molnar 2005-03-31 7:58 ` Ingo Molnar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox