* Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?
From: Peter Staubach @ 2008-06-09 15:03 UTC (permalink / raw)
To: Jeff Layton; +Cc: linux-nfs, nfsv4, nhorman, lhh
In-Reply-To: <20080609103137.2474aabd-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
Jeff Layton wrote:
> Apologies for the long email, but I ran into an interesting problem the
> other day and am looking for some feedback on my general approach to
> fixing it before I spend too much time on it:
>
> We (RH) have a cluster-suite product that some people use for making HA
> NFS services. When our QA folks test this, they often will start up
> some operations that do activity on an NFS mount from the cluster and
> then rapidly do failovers between cluster machines and make sure
> everything keeps moving along. The cluster is designed to not shut down
> nfsd's when a failover occurs. nfsd's are considered a "shared
> resource". It's possible that there could be multiple clustered
> services for NFS-sharing, so when a failover occurs, we just manipulate
> the exports table.
>
> The problem we've run into is that occasionally they fail over to the
> alternate machine and then back very rapidly. Because nfsd's are not
> shut down on failover, sockets are not closed. So what happens is
> something like this on TCP mounts:
>
> - client has NFS mount from clustered NFS service on one server
>
> - service fails over, new server doesn't know anything about the
> existing socket, so it sends a RST back to the client when data
> comes in. Client closes connection and reopens it and does some
> I/O on the socket.
>
> - service fails back to original server. The original socket there
> is still open, but now the TCP sequence numbers are off. When
> packets come into the server we end up with an ACK storm, and the
> client hangs for a long time.
>
> Neil Horman did a good writeup of this problem here for those that
> want the gory details:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16
>
> I can think of 3 ways to fix this:
>
> 1) Add something like the recently added "unlock_ip" interface that
> was added for NLM. Maybe a "close_ip" that allows us to close all
> nfsd sockets connected to a given local IP address. So clustering
> software could do something like:
>
> # echo 10.20.30.40 > /proc/fs/nfsd/close_ip
>
> ...and make sure that all of the sockets are closed.
>
> 2) just use the same "unlock_ip" interface and just have it also
> close sockets in addition to dropping locks.
>
> 3) have an nfsd close all non-listening connections when it gets a
> certain signal (maybe SIGUSR1 or something). Connections on a
> sockets that aren't failing over should just get a RST and would
> reopen their connections.
>
> ...my preference would probably be approach #1.
>
> I've only really done some rudimentary perusing of the code, so there
> may be roadblocks with some of these approaches I haven't considered.
> Does anyone have thoughts on the general problem or idea for a solution?
>
> The situation is a bit specific to failover testing -- most people failing
> over don't do it so rapidly, but we'd still like to ensure that this
> problem doesn't occur if someone does do it.
>
> Thanks,
>
This doesn't sound like it would be an NFS specific situation.
Why doesn't TCP handle this, without causing an ACK storm?
Thanx...
ps
^ permalink raw reply
* rapid clustered nfs server failover and hung clients -- how best to close the sockets?
From: Jeff Layton @ 2008-06-09 14:31 UTC (permalink / raw)
To: linux-nfs, nfsv4; +Cc: nhorman, lhh
Apologies for the long email, but I ran into an interesting problem the
other day and am looking for some feedback on my general approach to
fixing it before I spend too much time on it:
We (RH) have a cluster-suite product that some people use for making HA
NFS services. When our QA folks test this, they often will start up
some operations that do activity on an NFS mount from the cluster and
then rapidly do failovers between cluster machines and make sure
everything keeps moving along. The cluster is designed to not shut down
nfsd's when a failover occurs. nfsd's are considered a "shared
resource". It's possible that there could be multiple clustered
services for NFS-sharing, so when a failover occurs, we just manipulate
the exports table.
The problem we've run into is that occasionally they fail over to the
alternate machine and then back very rapidly. Because nfsd's are not
shut down on failover, sockets are not closed. So what happens is
something like this on TCP mounts:
- client has NFS mount from clustered NFS service on one server
- service fails over, new server doesn't know anything about the
existing socket, so it sends a RST back to the client when data
comes in. Client closes connection and reopens it and does some
I/O on the socket.
- service fails back to original server. The original socket there
is still open, but now the TCP sequence numbers are off. When
packets come into the server we end up with an ACK storm, and the
client hangs for a long time.
Neil Horman did a good writeup of this problem here for those that
want the gory details:
https://bugzilla.redhat.com/show_bug.cgi?id=369991#c16
I can think of 3 ways to fix this:
1) Add something like the recently added "unlock_ip" interface that
was added for NLM. Maybe a "close_ip" that allows us to close all
nfsd sockets connected to a given local IP address. So clustering
software could do something like:
# echo 10.20.30.40 > /proc/fs/nfsd/close_ip
...and make sure that all of the sockets are closed.
2) just use the same "unlock_ip" interface and just have it also
close sockets in addition to dropping locks.
3) have an nfsd close all non-listening connections when it gets a
certain signal (maybe SIGUSR1 or something). Connections on a
sockets that aren't failing over should just get a RST and would
reopen their connections.
...my preference would probably be approach #1.
I've only really done some rudimentary perusing of the code, so there
may be roadblocks with some of these approaches I haven't considered.
Does anyone have thoughts on the general problem or idea for a solution?
The situation is a bit specific to failover testing -- most people failing
over don't do it so rapidly, but we'd still like to ensure that this
problem doesn't occur if someone does do it.
Thanks,
--
Jeff Layton <jlayton@redhat.com>
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4
^ permalink raw reply
* RE: Problems with large number of clients and reads
From: Weathers, Norman R. @ 2008-06-09 14:19 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: linux-nfs
In-Reply-To: <20080606160922.GG30863@fieldses.org>
-----Original Message-----
From: J. Bruce Fields [mailto:bfields@fieldses.org]
Sent: Fri 6/6/2008 11:09 AM
To: Weathers, Norman R.
Cc: linux-nfs@vger.kernel.org
Subject: Re: Problems with large number of clients and reads
On Tue, Jun 03, 2008 at 01:50:01PM -0500, Norman Weathers wrote:
> Hello all,
>
> We are having some issues with some high throughput servers of ours.
>
> Here is the issue, we are using a vanilla 2.6.22.14 kernel on a node
> with 2 Dual Core Intels (3 GHz) and 16 GB of ram. The files that are
> being served are around 2 GB each, and there are usually 3 to 5 of them
> being read, so once read they fit into memory nicely, and when all is
> working correctly, we have a perfectly filled cache, with almost no disk
> activity.
>
> When we have large NFS activity (say, 600 to 1200 clients) connecting to
> the server(s), they can get into a state where they are using up all of
> memory, but they are dropping cache. slabtop is showing 13 GB of memory
> being used by the size-4096 slab object. We have two ethernet channels
> bonded, so we see in excess of 240 MB/s of data flowing out of the box,
> and all of the sudden, disk activity has risen to 185 MB/s. This
> happens if we are using 8 or more nfs threads. If we limit the threads
> to 6 or less, this doesn't happen. Of course, we are starving clients,
> but at least the jobs that my customers are throwing out there are
> progressing. The question becomes, what is causing the memory to be
> used up by the slab size-4096 object? Why when all of the sudden a
> bunch of clients ask for data does this object grow from 100 MB to 13
> GB? I have set the memory settings to something that I thought was
> reasonable.
>
> Here is some more of the particulars:
>
> sysctl.conf tcp memory settings:
>
> # NFS Tuning Parameters
> sunrpc.udp_slot_table_entries = 128
> sunrpc.tcp_slot_table_entries = 128
> vm.overcommit_ratio = 80
>
> net.core.rmem_max=524288
> net.core.rmem_default=262144
> net.core.wmem_max=524288
> net.core.wmem_default=262144
> net.ipv4.tcp_rmem = 8192 262144 524288
> net.ipv4.tcp_wmem = 8192 262144 524288
> net.ipv4.tcp_sack=0
> net.ipv4.tcp_timestamps=0
> vm.min_free_kbytes=50000
> vm.overcommit_memory=1
> net.ipv4.tcp_reordering=127
>
> # Enable tcp_low_latency
> net.ipv4.tcp_low_latency=1
>
> Here is a current reading from a slabtop of a system where this error is
> happening:
>
> 3007154 3007154 100% 4.00K 3007154 1 12028616K size-4096
>
> Note the size of the object cache, usually it is 50 - 100 MB (I have
> another box with 32 threads and the same settings which is bouncing
> between 50 and 128 MB right now).
>
> I have a lot of client boxes that need access to these servers, and
> would really benefit from having more threads, but if I increase the
> number of threads, it pushes everything out of cache, forcing re-reads,
> and really slows down our jobs.
>
> Any thoughts on this?
>I'd've thought that suggests a leak of memory allocated by kmalloc().
>Does the size-4096 cache decrease eventually, or does it stay that large
>until you reboot?
>--b.
I would agree that it "looks" like a memory leak. If I restart NFS, the size-4096 cache
goes from 12 GB to under 50 MB, but then depending upon how hard the box is utilized, it
starts to climb back up. I have seen it climb back up to 3 or 4 GB right after the
restart, but that is much better because the regular disk cache will grow from the 2 GB
that it was pressured into back to 5 or 8 GB, so all of the files have been reread into
memory and things are progressing smoothly. It is weird. I really think that this has
to do with a lot of connections happening at once, because I can run slabtop and see a
node that is running full out, but only have a couple hundred megs of the size-4096 slab
being used, and then turn around and see another node that is pushing out 245 MB/s and
all of the sudden using over 12 GB of the size-4096. It is very odd... If I lower the
number of threads from a usable 64 to a low of 3 threads, I have less of a chance of the
servers going haywire, to the point of being so loaded they may crash or you cannot
contact them over the network (fortunately, I have serial on these boxes so that I can
get on the nodes if they reach that point). If I run 8 threads, and with enough
clients, I can bring down one of these servers. size-4096 goes through the roof, and
depending on the hour of the day, the server can either crash or becomes unresponsive.
Norman
^ permalink raw reply
* RE: Problems with large number of clients and reads
From: Weathers, Norman R. @ 2008-06-09 13:56 UTC (permalink / raw)
To: chuck.lever; +Cc: Chuck Lever, linux-nfs
In-Reply-To: <48494D5E.9000501@oracle.com>
-----Original Message-----
From: Chuck Lever [mailto:chuck.lever@oracle.com]
Sent: Fri 6/6/2008 9:44 AM
To: Weathers, Norman R.
Cc: Chuck Lever; linux-nfs@vger.kernel.org
Subject: Re: Problems with large number of clients and reads
Norman Weathers wrote:
> On Wed, 2008-06-04 at 09:13 -0500, Norman Weathers wrote:
>> On Wed, 2008-06-04 at 09:49 -0400, Chuck Lever wrote:
>>> Hi Norman-
>>>
>>> On Tue, Jun 3, 2008 at 2:50 PM, Norman Weathers
>>> <norman.r.weathers-496aOtIFJR1B+Kdf37RAV9BPR1lH4CV8@public.gmane.org> wrote:
>>>> Hello all,
>>>>
>>>> We are having some issues with some high throughput servers of ours.
>>>>
>>>> Here is the issue, we are using a vanilla 2.6.22.14 kernel on a node
>>>> with 2 Dual Core Intels (3 GHz) and 16 GB of ram. The files that are
>>>> being served are around 2 GB each, and there are usually 3 to 5 of them
>>>> being read, so once read they fit into memory nicely, and when all is
>>>> working correctly, we have a perfectly filled cache, with almost no disk
>>>> activity.
>>>>
>>>> When we have large NFS activity (say, 600 to 1200 clients) connecting to
>>>> the server(s), they can get into a state where they are using up all of
>>>> memory, but they are dropping cache. slabtop is showing 13 GB of memory
>>>> being used by the size-4096 slab object. We have two ethernet channels
>>>> bonded, so we see in excess of 240 MB/s of data flowing out of the box,
>>>> and all of the sudden, disk activity has risen to 185 MB/s. This
>>>> happens if we are using 8 or more nfs threads. If we limit the threads
>>>> to 6 or less, this doesn't happen. Of course, we are starving clients,
>>>> but at least the jobs that my customers are throwing out there are
>>>> progressing. The question becomes, what is causing the memory to be
>>>> used up by the slab size-4096 object? Why when all of the sudden a
>>>> bunch of clients ask for data does this object grow from 100 MB to 13
>>>> GB? I have set the memory settings to something that I thought was
>>>> reasonable.
>>>>
>>>> Here is some more of the particulars:
>>>>
>>>> sysctl.conf tcp memory settings:
>>>>
>>>> # NFS Tuning Parameters
>>>> sunrpc.udp_slot_table_entries = 128
>>>> sunrpc.tcp_slot_table_entries = 128
>>> I don't have an answer to your size-4096 question, but I do want to
>>> note that setting the slot table entries sysctls has no effect on NFS
>>> servers. It's a client-only setting.
>>>
>>
>> Ok.
>>
>>> Have you tried this experiment on a server where there are no special
>>> memory tuning sysctls?
>> Unfortunately, no. I can try it today.
>>
>
>
> I tried the test with no special memory settings, and I still see the
> same issue. I also have noticed that even with only 3 threads running,
> I can still have times where 11 GB of memory is being used for buffer
> and not for disk cache. It just seems like memory is being used up if
> we have a lot of requests from a lot of clients at once...
>I'm at a loss... but I have another question or two. Is it just memory
>utilization issues that you see on the server, or are there noticeable
>performance problems that crop up when you see this?
We are seeing both, but the performance problem is odd in that we have 20 of these systems
and they slow down a lot whenever one of the other systems has this issue. It is like one
system really starts to load up on connections and requests, but at the same time it pushes
out network wise every last little bit of network packets that it can (2 1 Gb connections pushing
245 MB/s). What else is weird is that if I restart NFS during that time, it generally causes the
memory to settle down and allows connections to move on. (Data is basically striped across these
20 nodes). When a node has "the issue" happen, the other 19 servers slow down from 150 or 180 MB/s
to 50 MB/s or less.
>Did you mention what your physical file system is on the server? Are
>you running it on an LVM or software or hardware RAID?
The file system is XFS, it is on a hardware RAID (HP cciss), running RAID 5, 64 k stripe. I can
push from the file system itself on a linear read ~ 180 MB/s, and with a cached file, I can
easily push out the data.
^ permalink raw reply
* RE: Problems with large number of clients and reads
From: Weathers, Norman R. @ 2008-06-09 13:20 UTC (permalink / raw)
To: Dean Hildebrand; +Cc: linux-nfs
In-Reply-To: <48487F79.4000607@gmail.com>
(I dislike Outlook.... Apologize if I end up messing up the formatting
of the message)
The file system is XFS, about 250 GB per server. I would say that yes
it is managing the cache on the server(s) in question. The servers in
question have 16 GB of memory, and the files being served are 1.9 GB,
about 5 each per server.
-----Original Message-----
From: Dean Hildebrand [mailto:seattleplus@gmail.com]
Sent: Thursday, June 05, 2008 7:06 PM
To: Weathers, Norman R.
Cc: linux-nfs@vger.kernel.org
Subject: Re: Problems with large number of clients and reads
>What is the file system? It is the one managing the cache on the
server.
>Dean
Norman Weathers wrote:
> Hello all,
>
> We are having some issues with some high throughput servers of ours.
>
> Here is the issue, we are using a vanilla 2.6.22.14 kernel on a node
> with 2 Dual Core Intels (3 GHz) and 16 GB of ram. The files that are
> being served are around 2 GB each, and there are usually 3 to 5 of
them
> being read, so once read they fit into memory nicely, and when all is
> working correctly, we have a perfectly filled cache, with almost no
disk
> activity.
>
> When we have large NFS activity (say, 600 to 1200 clients) connecting
to
> the server(s), they can get into a state where they are using up all
of
> memory, but they are dropping cache. slabtop is showing 13 GB of
memory
> being used by the size-4096 slab object. We have two ethernet
channels
> bonded, so we see in excess of 240 MB/s of data flowing out of the
box,
> and all of the sudden, disk activity has risen to 185 MB/s. This
> happens if we are using 8 or more nfs threads. If we limit the
threads
> to 6 or less, this doesn't happen. Of course, we are starving
clients,
> but at least the jobs that my customers are throwing out there are
> progressing. The question becomes, what is causing the memory to be
> used up by the slab size-4096 object? Why when all of the sudden a
> bunch of clients ask for data does this object grow from 100 MB to 13
> GB? I have set the memory settings to something that I thought was
> reasonable.
>
> Here is some more of the particulars:
>
> sysctl.conf tcp memory settings:
>
> # NFS Tuning Parameters
> sunrpc.udp_slot_table_entries = 128
> sunrpc.tcp_slot_table_entries = 128
> vm.overcommit_ratio = 80
>
> net.core.rmem_max=524288
> net.core.rmem_default=262144
> net.core.wmem_max=524288
> net.core.wmem_default=262144
> net.ipv4.tcp_rmem = 8192 262144 524288
> net.ipv4.tcp_wmem = 8192 262144 524288
> net.ipv4.tcp_sack=0
> net.ipv4.tcp_timestamps=0
> vm.min_free_kbytes=50000
> vm.overcommit_memory=1
> net.ipv4.tcp_reordering=127
>
> # Enable tcp_low_latency
> net.ipv4.tcp_low_latency=1
>
> Here is a current reading from a slabtop of a system where this error
is
> happening:
>
> 3007154 3007154 100% 4.00K 3007154 1 12028616K size-4096
>
> Note the size of the object cache, usually it is 50 - 100 MB (I have
> another box with 32 threads and the same settings which is bouncing
> between 50 and 128 MB right now).
>
> I have a lot of client boxes that need access to these servers, and
> would really benefit from having more threads, but if I increase the
> number of threads, it pushes everything out of cache, forcing
re-reads,
> and really slows down our jobs.
>
> Any thoughts on this?
>
>
> Thanks,
>
> Norman Weathers
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs"
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* Re: [PATCH 4/5] knfsd: convert knfsd to kthread API
From: Jeff Layton @ 2008-06-09 13:19 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: linux-nfs, nfsv4, gnb
In-Reply-To: <20080609130830.GA21769@fieldses.org>
On Mon, 9 Jun 2008 09:08:30 -0400
"J. Bruce Fields" <bfields@fieldses.org> wrote:
> On Fri, Jun 06, 2008 at 03:49:48PM -0400, Jeff Layton wrote:
> > On Fri, 6 Jun 2008 14:11:16 -0400
> > Jeff Layton <jlayton@redhat.com> wrote:
> >
> > > On Fri, 6 Jun 2008 13:24:31 -0400
> > > "J. Bruce Fields" <bfields@fieldses.org> wrote:
> > >
> > ...
> > > > How does the module refcounting work after this patch?
> > > >
> > > > --b.
> > > >
> > >
> > > I think I've goofed this part, actually. I was thinking that we didn't
> > > need to bump the refcount here, and that the kernel would realize that
> > > nfsd() hadn't returned and would prevent unloading until it had. This
> > > doesn't seem to be the case. I'll need to go back and add refcounting
> > > back in.
> > >
> >
> > Here's a respun patch that adds back in the module refcounts and also
> > removes the unneeded "err = 0;" at the bottom of the loop. Thoughts?
>
> Looks good to me. I'll apply all 5 (with this version of #4) if noone
> catches something else.
>
> --b.
>
Sounds good. My only concern here is whether moving the __module_get
from the RPC layer to nfsd() itself is OK. I *think* it is since the
nfsctl and /proc/fs/nfsd routines are all part of the nfsd module, so
we're guaranteed to have a reference there anyway, but if there are
potential races then we may want to go back to the old way.
I'd appreciate an ack/nack on this from someone who understands module
refcounts better than me.
--
Jeff Layton <jlayton@redhat.com>
^ permalink raw reply
* Re: [PATCH 4/5] knfsd: convert knfsd to kthread API
From: J. Bruce Fields @ 2008-06-09 13:08 UTC (permalink / raw)
To: Jeff Layton; +Cc: linux-nfs, nfsv4, neilb, gnb-cP1dWloDopni96+mSzHFpQC/G2K4zDHf
In-Reply-To: <20080606154948.303aba28-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
On Fri, Jun 06, 2008 at 03:49:48PM -0400, Jeff Layton wrote:
> On Fri, 6 Jun 2008 14:11:16 -0400
> Jeff Layton <jlayton@redhat.com> wrote:
>
> > On Fri, 6 Jun 2008 13:24:31 -0400
> > "J. Bruce Fields" <bfields@fieldses.org> wrote:
> >
> ...
> > > How does the module refcounting work after this patch?
> > >
> > > --b.
> > >
> >
> > I think I've goofed this part, actually. I was thinking that we didn't
> > need to bump the refcount here, and that the kernel would realize that
> > nfsd() hadn't returned and would prevent unloading until it had. This
> > doesn't seem to be the case. I'll need to go back and add refcounting
> > back in.
> >
>
> Here's a respun patch that adds back in the module refcounts and also
> removes the unneeded "err = 0;" at the bottom of the loop. Thoughts?
Looks good to me. I'll apply all 5 (with this version of #4) if noone
catches something else.
--b.
>
> -----------[snip]--------------
>
> From 794f3137ec5a5a8720b091f00b22807dab8f1be2 Mon Sep 17 00:00:00 2001
> From: Jeff Layton <jlayton@redhat.com>
> Date: Fri, 6 Jun 2008 14:44:23 -0400
> Subject: [PATCH 4/5] knfsd: convert knfsd to kthread API
>
> This patch is rather large, but I couldn't figure out a way to break it
> up that would remain bisectable. It does several things:
>
> - change svc_thread_fn typedef to better match what kthread_create expects
> - change svc_pool_map_set_cpumask to be more kthread friendly. Make it
> take a task arg and and get rid of the "oldmask"
> - have svc_set_num_threads call kthread_create directly
> - eliminate __svc_create_thread
>
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
> fs/nfsd/nfssvc.c | 49 +++++++++++++--------
> include/linux/sunrpc/svc.h | 2 +-
> net/sunrpc/svc.c | 102 +++++++++++++++-----------------------------
> 3 files changed, 66 insertions(+), 87 deletions(-)
>
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index 6339cb7..7fdfa23 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -21,6 +21,7 @@
> #include <linux/smp_lock.h>
> #include <linux/freezer.h>
> #include <linux/fs_struct.h>
> +#include <linux/kthread.h>
>
> #include <linux/sunrpc/types.h>
> #include <linux/sunrpc/stats.h>
> @@ -46,7 +47,7 @@
> #define SHUTDOWN_SIGS (sigmask(SIGKILL) | sigmask(SIGHUP) | sigmask(SIGINT) | sigmask(SIGQUIT))
>
> extern struct svc_program nfsd_program;
> -static void nfsd(struct svc_rqst *rqstp);
> +static int nfsd(void *vrqstp);
> struct timeval nfssvc_boot;
> static atomic_t nfsd_busy;
> static unsigned long nfsd_last_call;
> @@ -407,18 +408,20 @@ update_thread_usage(int busy_threads)
> /*
> * This is the NFS server kernel thread
> */
> -static void
> -nfsd(struct svc_rqst *rqstp)
> +static int
> +nfsd(void *vrqstp)
> {
> + struct svc_rqst *rqstp = (struct svc_rqst *) vrqstp;
> struct fs_struct *fsp;
> - int err;
> sigset_t shutdown_mask, allowed_mask;
> + int err, preverr = 0;
> + unsigned int signo;
>
> /* Lock module and set up kernel thread */
> + __module_get(THIS_MODULE);
> mutex_lock(&nfsd_mutex);
> - daemonize("nfsd");
>
> - /* After daemonize() this kernel thread shares current->fs
> + /* At this point, the thread shares current->fs
> * with the init process. We need to create files with a
> * umask of 0 instead of init's umask. */
> fsp = copy_fs_struct(current->fs);
> @@ -433,14 +436,18 @@ nfsd(struct svc_rqst *rqstp)
> siginitsetinv(&shutdown_mask, SHUTDOWN_SIGS);
> siginitsetinv(&allowed_mask, ALLOWED_SIGS);
>
> + /*
> + * thread is spawned with all signals set to SIG_IGN, re-enable
> + * the ones that matter
> + */
> + for (signo = 1; signo <= _NSIG; signo++) {
> + if (!sigismember(&shutdown_mask, signo))
> + allow_signal(signo);
> + }
>
> nfsdstats.th_cnt++;
> -
> - rqstp->rq_task = current;
> -
> mutex_unlock(&nfsd_mutex);
>
> -
> /*
> * We want less throttling in balance_dirty_pages() so that nfs to
> * localhost doesn't cause nfsd to lock up due to all the client's
> @@ -462,15 +469,25 @@ nfsd(struct svc_rqst *rqstp)
> */
> while ((err = svc_recv(rqstp, 60*60*HZ)) == -EAGAIN)
> ;
> - if (err < 0)
> + if (err == -EINTR)
> break;
> + else if (err < 0) {
> + if (err != preverr) {
> + printk(KERN_WARNING "%s: unexpected error "
> + "from svc_recv (%d)\n", __func__, -err);
> + preverr = err;
> + }
> + schedule_timeout_uninterruptible(HZ);
> + continue;
> + }
> +
> update_thread_usage(atomic_read(&nfsd_busy));
> atomic_inc(&nfsd_busy);
>
> /* Lock the export hash tables for reading. */
> exp_readlock();
>
> - /* Process request with signals blocked. */
> + /* Process request with signals blocked. */
> sigprocmask(SIG_SETMASK, &allowed_mask, NULL);
>
> svc_process(rqstp);
> @@ -481,23 +498,19 @@ nfsd(struct svc_rqst *rqstp)
> atomic_dec(&nfsd_busy);
> }
>
> - if (err != -EINTR)
> - printk(KERN_WARNING "nfsd: terminating on error %d\n", -err);
> -
> /* Clear signals before calling svc_exit_thread() */
> flush_signals(current);
>
> mutex_lock(&nfsd_mutex);
> -
> nfsdstats.th_cnt --;
>
> out:
> /* Release the thread */
> svc_exit_thread(rqstp);
> -
> - /* Release module */
> mutex_unlock(&nfsd_mutex);
> +
> module_put_and_exit(0);
> + return 0;
> }
>
> static __be32 map_new_errors(u32 vers, __be32 nfserr)
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 4b54c5f..011d6d8 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -22,7 +22,7 @@
> /*
> * This is the RPC server thread function prototype
> */
> -typedef void (*svc_thread_fn)(struct svc_rqst *);
> +typedef int (*svc_thread_fn)(void *);
>
> /*
> *
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 7bffaff..f461da2 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -18,6 +18,7 @@
> #include <linux/mm.h>
> #include <linux/interrupt.h>
> #include <linux/module.h>
> +#include <linux/kthread.h>
>
> #include <linux/sunrpc/types.h>
> #include <linux/sunrpc/xdr.h>
> @@ -291,15 +292,14 @@ svc_pool_map_put(void)
>
>
> /*
> - * Set the current thread's cpus_allowed mask so that it
> + * Set the given thread's cpus_allowed mask so that it
> * will only run on cpus in the given pool.
> - *
> - * Returns 1 and fills in oldmask iff a cpumask was applied.
> */
> -static inline int
> -svc_pool_map_set_cpumask(unsigned int pidx, cpumask_t *oldmask)
> +static inline void
> +svc_pool_map_set_cpumask(struct task_struct *task, unsigned int pidx)
> {
> struct svc_pool_map *m = &svc_pool_map;
> + unsigned int node = m->pool_to[pidx];
>
> /*
> * The caller checks for sv_nrpools > 1, which
> @@ -307,26 +307,17 @@ svc_pool_map_set_cpumask(unsigned int pidx, cpumask_t *oldmask)
> */
> BUG_ON(m->count == 0);
>
> - switch (m->mode)
> - {
> - default:
> - return 0;
> + switch (m->mode) {
> case SVC_POOL_PERCPU:
> {
> - unsigned int cpu = m->pool_to[pidx];
> -
> - *oldmask = current->cpus_allowed;
> - set_cpus_allowed_ptr(current, &cpumask_of_cpu(cpu));
> - return 1;
> + set_cpus_allowed_ptr(task, &cpumask_of_cpu(node));
> + break;
> }
> case SVC_POOL_PERNODE:
> {
> - unsigned int node = m->pool_to[pidx];
> node_to_cpumask_ptr(nodecpumask, node);
> -
> - *oldmask = current->cpus_allowed;
> - set_cpus_allowed_ptr(current, nodecpumask);
> - return 1;
> + set_cpus_allowed_ptr(task, nodecpumask);
> + break;
> }
> }
> }
> @@ -579,47 +570,6 @@ out_enomem:
> EXPORT_SYMBOL(svc_prepare_thread);
>
> /*
> - * Create a thread in the given pool. Caller must hold BKL or another lock to
> - * serialize access to the svc_serv struct. On a NUMA or SMP machine, with a
> - * multi-pool serv, the thread will be restricted to run on the cpus belonging
> - * to the pool.
> - */
> -static int
> -__svc_create_thread(svc_thread_fn func, struct svc_serv *serv,
> - struct svc_pool *pool)
> -{
> - struct svc_rqst *rqstp;
> - int error = -ENOMEM;
> - int have_oldmask = 0;
> - cpumask_t uninitialized_var(oldmask);
> -
> - rqstp = svc_prepare_thread(serv, pool);
> - if (IS_ERR(rqstp)) {
> - error = PTR_ERR(rqstp);
> - goto out;
> - }
> -
> - if (serv->sv_nrpools > 1)
> - have_oldmask = svc_pool_map_set_cpumask(pool->sp_id, &oldmask);
> -
> - error = kernel_thread((int (*)(void *)) func, rqstp, 0);
> -
> - if (have_oldmask)
> - set_cpus_allowed(current, oldmask);
> -
> - if (error < 0)
> - goto out_thread;
> - svc_sock_update_bufs(serv);
> - error = 0;
> -out:
> - return error;
> -
> -out_thread:
> - svc_exit_thread(rqstp);
> - goto out;
> -}
> -
> -/*
> * Choose a pool in which to create a new thread, for svc_set_num_threads
> */
> static inline struct svc_pool *
> @@ -688,7 +638,9 @@ found_pool:
> int
> svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
> {
> - struct task_struct *victim;
> + struct svc_rqst *rqstp;
> + struct task_struct *task;
> + struct svc_pool *chosen_pool;
> int error = 0;
> unsigned int state = serv->sv_nrthreads-1;
>
> @@ -704,18 +656,32 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
> /* create new threads */
> while (nrservs > 0) {
> nrservs--;
> - __module_get(serv->sv_module);
> - error = __svc_create_thread(serv->sv_function, serv,
> - choose_pool(serv, pool, &state));
> - if (error < 0) {
> - module_put(serv->sv_module);
> + chosen_pool = choose_pool(serv, pool, &state);
> +
> + rqstp = svc_prepare_thread(serv, chosen_pool);
> + if (IS_ERR(rqstp)) {
> + error = PTR_ERR(rqstp);
> break;
> }
> +
> + task = kthread_create(serv->sv_function, rqstp, serv->sv_name);
> + if (IS_ERR(task)) {
> + error = PTR_ERR(task);
> + svc_exit_thread(rqstp);
> + break;
> + }
> +
> + rqstp->rq_task = task;
> + if (serv->sv_nrpools > 1)
> + svc_pool_map_set_cpumask(task, chosen_pool->sp_id);
> +
> + svc_sock_update_bufs(serv);
> + wake_up_process(task);
> }
> /* destroy old threads */
> while (nrservs < 0 &&
> - (victim = choose_victim(serv, pool, &state)) != NULL) {
> - send_sig(serv->sv_kill_signal, victim, 1);
> + (task = choose_victim(serv, pool, &state)) != NULL) {
> + send_sig(serv->sv_kill_signal, task, 1);
> nrservs++;
> }
>
> --
> 1.5.3.6
>
^ permalink raw reply
* nfsd: accept failed (err 22)
From: Marcelo Leal @ 2008-06-09 0:07 UTC (permalink / raw)
To: linux-nfs
Ok, Ok, it's an old GNU/Linux kernel(2.4.20), but it was working
pretty well for years! Just strong as a rock.
Now, after seeing the message:
nfsd: accept failed (err 22)!
in the logs, some clients start to behave very bad (slow)... most
some rhel tcp clients!
I think that can be a "tcp" problem, maybe some problem in the network.
Looking the code:
...
if (tcpflag && FD_ISSET(tcpsock, ready)) {
len = sizeof(inetpeer);
if ((msgsock = accept(tcpsock,
(struct sockaddr *)&inetpeer, &len)) < 0) {
syslog(LOG_ERR, "accept failed: %m");
return (1);
...
Seems like after restart the server, everything goes fine.
I just want to ask you the following:
1) Am i right, that is a problem with the "tcp" connections?
2) it's a bug in the NFS server (probably solved in newer versions),
or it's a network problem?
3) It's a network problem?
4) may be a client (rhel) problem?
ps.: Many clients stil working pretty well, just some clients show
the slow behavior.
Thanks a lot for your time!
Leal.
--
pOSix rules
^ permalink raw reply
* Re: [PATCH 4/5] knfsd: convert knfsd to kthread API
From: Jeff Layton @ 2008-06-06 19:49 UTC (permalink / raw)
To: Jeff Layton; +Cc: linux-nfs, nfsv4, gnb
In-Reply-To: <20080606141116.07b79b14@tleilax.poochiereds.net>
On Fri, 6 Jun 2008 14:11:16 -0400
Jeff Layton <jlayton@redhat.com> wrote:
> On Fri, 6 Jun 2008 13:24:31 -0400
> "J. Bruce Fields" <bfields@fieldses.org> wrote:
>
...
> > How does the module refcounting work after this patch?
> >
> > --b.
> >
>
> I think I've goofed this part, actually. I was thinking that we didn't
> need to bump the refcount here, and that the kernel would realize that
> nfsd() hadn't returned and would prevent unloading until it had. This
> doesn't seem to be the case. I'll need to go back and add refcounting
> back in.
>
Here's a respun patch that adds back in the module refcounts and also
removes the unneeded "err = 0;" at the bottom of the loop. Thoughts?
-----------[snip]--------------
>From 794f3137ec5a5a8720b091f00b22807dab8f1be2 Mon Sep 17 00:00:00 2001
From: Jeff Layton <jlayton@redhat.com>
Date: Fri, 6 Jun 2008 14:44:23 -0400
Subject: [PATCH 4/5] knfsd: convert knfsd to kthread API
This patch is rather large, but I couldn't figure out a way to break it
up that would remain bisectable. It does several things:
- change svc_thread_fn typedef to better match what kthread_create expects
- change svc_pool_map_set_cpumask to be more kthread friendly. Make it
take a task arg and and get rid of the "oldmask"
- have svc_set_num_threads call kthread_create directly
- eliminate __svc_create_thread
Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
fs/nfsd/nfssvc.c | 49 +++++++++++++--------
include/linux/sunrpc/svc.h | 2 +-
net/sunrpc/svc.c | 102 +++++++++++++++-----------------------------
3 files changed, 66 insertions(+), 87 deletions(-)
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 6339cb7..7fdfa23 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -21,6 +21,7 @@
#include <linux/smp_lock.h>
#include <linux/freezer.h>
#include <linux/fs_struct.h>
+#include <linux/kthread.h>
#include <linux/sunrpc/types.h>
#include <linux/sunrpc/stats.h>
@@ -46,7 +47,7 @@
#define SHUTDOWN_SIGS (sigmask(SIGKILL) | sigmask(SIGHUP) | sigmask(SIGINT) | sigmask(SIGQUIT))
extern struct svc_program nfsd_program;
-static void nfsd(struct svc_rqst *rqstp);
+static int nfsd(void *vrqstp);
struct timeval nfssvc_boot;
static atomic_t nfsd_busy;
static unsigned long nfsd_last_call;
@@ -407,18 +408,20 @@ update_thread_usage(int busy_threads)
/*
* This is the NFS server kernel thread
*/
-static void
-nfsd(struct svc_rqst *rqstp)
+static int
+nfsd(void *vrqstp)
{
+ struct svc_rqst *rqstp = (struct svc_rqst *) vrqstp;
struct fs_struct *fsp;
- int err;
sigset_t shutdown_mask, allowed_mask;
+ int err, preverr = 0;
+ unsigned int signo;
/* Lock module and set up kernel thread */
+ __module_get(THIS_MODULE);
mutex_lock(&nfsd_mutex);
- daemonize("nfsd");
- /* After daemonize() this kernel thread shares current->fs
+ /* At this point, the thread shares current->fs
* with the init process. We need to create files with a
* umask of 0 instead of init's umask. */
fsp = copy_fs_struct(current->fs);
@@ -433,14 +436,18 @@ nfsd(struct svc_rqst *rqstp)
siginitsetinv(&shutdown_mask, SHUTDOWN_SIGS);
siginitsetinv(&allowed_mask, ALLOWED_SIGS);
+ /*
+ * thread is spawned with all signals set to SIG_IGN, re-enable
+ * the ones that matter
+ */
+ for (signo = 1; signo <= _NSIG; signo++) {
+ if (!sigismember(&shutdown_mask, signo))
+ allow_signal(signo);
+ }
nfsdstats.th_cnt++;
-
- rqstp->rq_task = current;
-
mutex_unlock(&nfsd_mutex);
-
/*
* We want less throttling in balance_dirty_pages() so that nfs to
* localhost doesn't cause nfsd to lock up due to all the client's
@@ -462,15 +469,25 @@ nfsd(struct svc_rqst *rqstp)
*/
while ((err = svc_recv(rqstp, 60*60*HZ)) == -EAGAIN)
;
- if (err < 0)
+ if (err == -EINTR)
break;
+ else if (err < 0) {
+ if (err != preverr) {
+ printk(KERN_WARNING "%s: unexpected error "
+ "from svc_recv (%d)\n", __func__, -err);
+ preverr = err;
+ }
+ schedule_timeout_uninterruptible(HZ);
+ continue;
+ }
+
update_thread_usage(atomic_read(&nfsd_busy));
atomic_inc(&nfsd_busy);
/* Lock the export hash tables for reading. */
exp_readlock();
- /* Process request with signals blocked. */
+ /* Process request with signals blocked. */
sigprocmask(SIG_SETMASK, &allowed_mask, NULL);
svc_process(rqstp);
@@ -481,23 +498,19 @@ nfsd(struct svc_rqst *rqstp)
atomic_dec(&nfsd_busy);
}
- if (err != -EINTR)
- printk(KERN_WARNING "nfsd: terminating on error %d\n", -err);
-
/* Clear signals before calling svc_exit_thread() */
flush_signals(current);
mutex_lock(&nfsd_mutex);
-
nfsdstats.th_cnt --;
out:
/* Release the thread */
svc_exit_thread(rqstp);
-
- /* Release module */
mutex_unlock(&nfsd_mutex);
+
module_put_and_exit(0);
+ return 0;
}
static __be32 map_new_errors(u32 vers, __be32 nfserr)
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 4b54c5f..011d6d8 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -22,7 +22,7 @@
/*
* This is the RPC server thread function prototype
*/
-typedef void (*svc_thread_fn)(struct svc_rqst *);
+typedef int (*svc_thread_fn)(void *);
/*
*
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 7bffaff..f461da2 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -18,6 +18,7 @@
#include <linux/mm.h>
#include <linux/interrupt.h>
#include <linux/module.h>
+#include <linux/kthread.h>
#include <linux/sunrpc/types.h>
#include <linux/sunrpc/xdr.h>
@@ -291,15 +292,14 @@ svc_pool_map_put(void)
/*
- * Set the current thread's cpus_allowed mask so that it
+ * Set the given thread's cpus_allowed mask so that it
* will only run on cpus in the given pool.
- *
- * Returns 1 and fills in oldmask iff a cpumask was applied.
*/
-static inline int
-svc_pool_map_set_cpumask(unsigned int pidx, cpumask_t *oldmask)
+static inline void
+svc_pool_map_set_cpumask(struct task_struct *task, unsigned int pidx)
{
struct svc_pool_map *m = &svc_pool_map;
+ unsigned int node = m->pool_to[pidx];
/*
* The caller checks for sv_nrpools > 1, which
@@ -307,26 +307,17 @@ svc_pool_map_set_cpumask(unsigned int pidx, cpumask_t *oldmask)
*/
BUG_ON(m->count == 0);
- switch (m->mode)
- {
- default:
- return 0;
+ switch (m->mode) {
case SVC_POOL_PERCPU:
{
- unsigned int cpu = m->pool_to[pidx];
-
- *oldmask = current->cpus_allowed;
- set_cpus_allowed_ptr(current, &cpumask_of_cpu(cpu));
- return 1;
+ set_cpus_allowed_ptr(task, &cpumask_of_cpu(node));
+ break;
}
case SVC_POOL_PERNODE:
{
- unsigned int node = m->pool_to[pidx];
node_to_cpumask_ptr(nodecpumask, node);
-
- *oldmask = current->cpus_allowed;
- set_cpus_allowed_ptr(current, nodecpumask);
- return 1;
+ set_cpus_allowed_ptr(task, nodecpumask);
+ break;
}
}
}
@@ -579,47 +570,6 @@ out_enomem:
EXPORT_SYMBOL(svc_prepare_thread);
/*
- * Create a thread in the given pool. Caller must hold BKL or another lock to
- * serialize access to the svc_serv struct. On a NUMA or SMP machine, with a
- * multi-pool serv, the thread will be restricted to run on the cpus belonging
- * to the pool.
- */
-static int
-__svc_create_thread(svc_thread_fn func, struct svc_serv *serv,
- struct svc_pool *pool)
-{
- struct svc_rqst *rqstp;
- int error = -ENOMEM;
- int have_oldmask = 0;
- cpumask_t uninitialized_var(oldmask);
-
- rqstp = svc_prepare_thread(serv, pool);
- if (IS_ERR(rqstp)) {
- error = PTR_ERR(rqstp);
- goto out;
- }
-
- if (serv->sv_nrpools > 1)
- have_oldmask = svc_pool_map_set_cpumask(pool->sp_id, &oldmask);
-
- error = kernel_thread((int (*)(void *)) func, rqstp, 0);
-
- if (have_oldmask)
- set_cpus_allowed(current, oldmask);
-
- if (error < 0)
- goto out_thread;
- svc_sock_update_bufs(serv);
- error = 0;
-out:
- return error;
-
-out_thread:
- svc_exit_thread(rqstp);
- goto out;
-}
-
-/*
* Choose a pool in which to create a new thread, for svc_set_num_threads
*/
static inline struct svc_pool *
@@ -688,7 +638,9 @@ found_pool:
int
svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
{
- struct task_struct *victim;
+ struct svc_rqst *rqstp;
+ struct task_struct *task;
+ struct svc_pool *chosen_pool;
int error = 0;
unsigned int state = serv->sv_nrthreads-1;
@@ -704,18 +656,32 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
/* create new threads */
while (nrservs > 0) {
nrservs--;
- __module_get(serv->sv_module);
- error = __svc_create_thread(serv->sv_function, serv,
- choose_pool(serv, pool, &state));
- if (error < 0) {
- module_put(serv->sv_module);
+ chosen_pool = choose_pool(serv, pool, &state);
+
+ rqstp = svc_prepare_thread(serv, chosen_pool);
+ if (IS_ERR(rqstp)) {
+ error = PTR_ERR(rqstp);
break;
}
+
+ task = kthread_create(serv->sv_function, rqstp, serv->sv_name);
+ if (IS_ERR(task)) {
+ error = PTR_ERR(task);
+ svc_exit_thread(rqstp);
+ break;
+ }
+
+ rqstp->rq_task = task;
+ if (serv->sv_nrpools > 1)
+ svc_pool_map_set_cpumask(task, chosen_pool->sp_id);
+
+ svc_sock_update_bufs(serv);
+ wake_up_process(task);
}
/* destroy old threads */
while (nrservs < 0 &&
- (victim = choose_victim(serv, pool, &state)) != NULL) {
- send_sig(serv->sv_kill_signal, victim, 1);
+ (task = choose_victim(serv, pool, &state)) != NULL) {
+ send_sig(serv->sv_kill_signal, task, 1);
nrservs++;
}
--
1.5.3.6
^ permalink raw reply related
* Re: [PATCH 4/5] knfsd: convert knfsd to kthread API
From: J. Bruce Fields @ 2008-06-06 19:13 UTC (permalink / raw)
To: Jeff Layton; +Cc: linux-nfs, nfsv4, neilb, gnb-cP1dWloDopni96+mSzHFpQC/G2K4zDHf
In-Reply-To: <20080606150537.14c9537c-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
On Fri, Jun 06, 2008 at 03:05:37PM -0400, Jeff Layton wrote:
> On Fri, 6 Jun 2008 14:16:12 -0400
> "J. Bruce Fields" <bfields@fieldses.org> wrote:
>
> > On Fri, Jun 06, 2008 at 02:11:16PM -0400, Jeff Layton wrote:
> > > I think I've goofed this part, actually. I was thinking that we didn't
> > > need to bump the refcount here, and that the kernel would realize that
> > > nfsd() hadn't returned and would prevent unloading until it had. This
> > > doesn't seem to be the case. I'll need to go back and add refcounting
> > > back in.
> >
> > OK. If you decide it is needed here, could you double-check the lockd
> > conversion as well? Looks like some refcounting logic might have gotten
> > lost there too.
> >
> > --b.
>
> Full disclosure:
>
> I don't completely understand module refcounts and when we need to take
> a reference. So feel free to set me straight if my comments below are
> wrong :-)
>
> The change to lockd was deliberate and was suggested by Neil Brown, when
> I was working on an earlier version of the lockd-kthread patch:
>
> --------------[snip]------------------
>
> > - module_put_and_exit(0);
> > + module_put(THIS_MODULE);
> > + return 0;
>
> This changes bothers me. Putting the last ref to a module in code
> inside that module is not safe, which is why module_put_and_exit
> exists.
>
> So this module_put is either unsafe or not needed. I think the
> latter.
>
> As you say in the comment, lockd_down now blocks until lockd actually
> exits. As every caller for lockd_down will own a reference to the
> lockd module, the lockd thread no longer needs to own a reference too.
> So I think it is safe to remove the module_put, and also remove the
> __module_get at the top of the lockd function.
>
> --------------[snip]------------------
>
> So I followed his advice and everything seems to be OK. I don't see a way
> to yank out the lockd module while lockd is actually up, since the
> callers of lockd_up() have to have a reference to the lockd module, and
> if those modules go away, then lockd should be down anyway.
Yes, thanks for the reminder--that makes sense.
> This is what led me to think that we didn't need this for nfsd either,
> but that seems to be incorrect. I think nfsd is different because it's
> started directly from userspace. We don't have any persistent module
> references so we need to take them explicitly.
Right.
--b.
^ permalink raw reply
* Re: [PATCH 4/5] knfsd: convert knfsd to kthread API
From: Jeff Layton @ 2008-06-06 19:05 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: linux-nfs, nfsv4, gnb
In-Reply-To: <20080606181612.GB761@fieldses.org>
On Fri, 6 Jun 2008 14:16:12 -0400
"J. Bruce Fields" <bfields@fieldses.org> wrote:
> On Fri, Jun 06, 2008 at 02:11:16PM -0400, Jeff Layton wrote:
> > I think I've goofed this part, actually. I was thinking that we didn't
> > need to bump the refcount here, and that the kernel would realize that
> > nfsd() hadn't returned and would prevent unloading until it had. This
> > doesn't seem to be the case. I'll need to go back and add refcounting
> > back in.
>
> OK. If you decide it is needed here, could you double-check the lockd
> conversion as well? Looks like some refcounting logic might have gotten
> lost there too.
>
> --b.
Full disclosure:
I don't completely understand module refcounts and when we need to take
a reference. So feel free to set me straight if my comments below are
wrong :-)
The change to lockd was deliberate and was suggested by Neil Brown, when
I was working on an earlier version of the lockd-kthread patch:
--------------[snip]------------------
> - module_put_and_exit(0);
> + module_put(THIS_MODULE);
> + return 0;
This changes bothers me. Putting the last ref to a module in code
inside that module is not safe, which is why module_put_and_exit
exists.
So this module_put is either unsafe or not needed. I think the
latter.
As you say in the comment, lockd_down now blocks until lockd actually
exits. As every caller for lockd_down will own a reference to the
lockd module, the lockd thread no longer needs to own a reference too.
So I think it is safe to remove the module_put, and also remove the
__module_get at the top of the lockd function.
--------------[snip]------------------
So I followed his advice and everything seems to be OK. I don't see a way
to yank out the lockd module while lockd is actually up, since the
callers of lockd_up() have to have a reference to the lockd module, and
if those modules go away, then lockd should be down anyway.
This is what led me to think that we didn't need this for nfsd either,
but that seems to be incorrect. I think nfsd is different because it's
started directly from userspace. We don't have any persistent module
references so we need to take them explicitly.
--
Jeff Layton <jlayton@redhat.com>
^ permalink raw reply
* Re: [PATCH 4/5] knfsd: convert knfsd to kthread API
From: J. Bruce Fields @ 2008-06-06 18:16 UTC (permalink / raw)
To: Jeff Layton; +Cc: linux-nfs, nfsv4, gnb
In-Reply-To: <20080606141116.07b79b14@tleilax.poochiereds.net>
On Fri, Jun 06, 2008 at 02:11:16PM -0400, Jeff Layton wrote:
> I think I've goofed this part, actually. I was thinking that we didn't
> need to bump the refcount here, and that the kernel would realize that
> nfsd() hadn't returned and would prevent unloading until it had. This
> doesn't seem to be the case. I'll need to go back and add refcounting
> back in.
OK. If you decide it is needed here, could you double-check the lockd
conversion as well? Looks like some refcounting logic might have gotten
lost there too.
--b.
^ permalink raw reply
* Re: [PATCH 4/5] knfsd: convert knfsd to kthread API
From: Jeff Layton @ 2008-06-06 18:11 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: linux-nfs, nfsv4, gnb
In-Reply-To: <20080606172431.GA761@fieldses.org>
On Fri, 6 Jun 2008 13:24:31 -0400
"J. Bruce Fields" <bfields@fieldses.org> wrote:
> On Fri, Jun 06, 2008 at 11:12:50AM -0400, Jeff Layton wrote:
> > This patch is rather large, but I couldn't figure out a way to break it
> > up that would remain bisectable. It does several things:
> >
> > - change svc_thread_fn typedef to better match what kthread_create expects
> > - change svc_pool_map_set_cpumask to be more kthread friendly. Make it
> > take a task arg and and get rid of the "oldmask"
> > - have svc_set_num_threads call kthread_create directly
> > - eliminate __svc_create_thread
> >
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> > fs/nfsd/nfssvc.c | 52 ++++++++++++++--------
> > include/linux/sunrpc/svc.h | 2 +-
> > net/sunrpc/svc.c | 102 +++++++++++++++-----------------------------
> > 3 files changed, 68 insertions(+), 88 deletions(-)
> >
> > diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> > index 6339cb7..fe62ec8 100644
> > --- a/fs/nfsd/nfssvc.c
> > +++ b/fs/nfsd/nfssvc.c
> > @@ -21,6 +21,7 @@
> > #include <linux/smp_lock.h>
> > #include <linux/freezer.h>
> > #include <linux/fs_struct.h>
> > +#include <linux/kthread.h>
> >
> > #include <linux/sunrpc/types.h>
> > #include <linux/sunrpc/stats.h>
> > @@ -46,7 +47,7 @@
> > #define SHUTDOWN_SIGS (sigmask(SIGKILL) | sigmask(SIGHUP) | sigmask(SIGINT) | sigmask(SIGQUIT))
> >
> > extern struct svc_program nfsd_program;
> > -static void nfsd(struct svc_rqst *rqstp);
> > +static int nfsd(void *vrqstp);
> > struct timeval nfssvc_boot;
> > static atomic_t nfsd_busy;
> > static unsigned long nfsd_last_call;
> > @@ -407,18 +408,19 @@ update_thread_usage(int busy_threads)
> > /*
> > * This is the NFS server kernel thread
> > */
> > -static void
> > -nfsd(struct svc_rqst *rqstp)
> > +static int
> > +nfsd(void *vrqstp)
> > {
> > + struct svc_rqst *rqstp = (struct svc_rqst *) vrqstp;
> > struct fs_struct *fsp;
> > - int err;
> > sigset_t shutdown_mask, allowed_mask;
> > + int err, preverr = 0;
> > + unsigned int signo;
> >
> > /* Lock module and set up kernel thread */
> > mutex_lock(&nfsd_mutex);
> > - daemonize("nfsd");
> >
> > - /* After daemonize() this kernel thread shares current->fs
> > + /* At this point, the thread shares current->fs
> > * with the init process. We need to create files with a
> > * umask of 0 instead of init's umask. */
> > fsp = copy_fs_struct(current->fs);
> > @@ -433,14 +435,18 @@ nfsd(struct svc_rqst *rqstp)
> > siginitsetinv(&shutdown_mask, SHUTDOWN_SIGS);
> > siginitsetinv(&allowed_mask, ALLOWED_SIGS);
> >
> > + /*
> > + * thread is spawned with all signals set to SIG_IGN, re-enable
> > + * the ones that matter
> > + */
> > + for (signo = 1; signo <= _NSIG; signo++) {
> > + if (!sigismember(&shutdown_mask, signo))
> > + allow_signal(signo);
> > + }
> >
> > nfsdstats.th_cnt++;
> > -
> > - rqstp->rq_task = current;
> > -
> > mutex_unlock(&nfsd_mutex);
> >
> > -
> > /*
> > * We want less throttling in balance_dirty_pages() so that nfs to
> > * localhost doesn't cause nfsd to lock up due to all the client's
> > @@ -462,15 +468,25 @@ nfsd(struct svc_rqst *rqstp)
> > */
> > while ((err = svc_recv(rqstp, 60*60*HZ)) == -EAGAIN)
> > ;
> > - if (err < 0)
> > + if (err == -EINTR)
> > break;
> > + else if (err < 0) {
> > + if (err != preverr) {
> > + printk(KERN_WARNING "%s: unexpected error "
> > + "from svc_recv (%d)\n", __func__, -err);
> > + preverr = err;
> > + }
> > + schedule_timeout_uninterruptible(HZ);
> > + continue;
> > + }
> > +
> > update_thread_usage(atomic_read(&nfsd_busy));
> > atomic_inc(&nfsd_busy);
> >
> > /* Lock the export hash tables for reading. */
> > exp_readlock();
> >
> > - /* Process request with signals blocked. */
> > + /* Process request with signals blocked. */
> > sigprocmask(SIG_SETMASK, &allowed_mask, NULL);
> >
> > svc_process(rqstp);
> > @@ -479,25 +495,23 @@ nfsd(struct svc_rqst *rqstp)
> > exp_readunlock();
> > update_thread_usage(atomic_read(&nfsd_busy));
> > atomic_dec(&nfsd_busy);
> > - }
> >
> > - if (err != -EINTR)
> > - printk(KERN_WARNING "nfsd: terminating on error %d\n", -err);
> > + /* reset err in case kthread_stop is called */
> > + err = 0;
> > + }
>
> There's no use of err here until it's next set in
>
> while ((err = svc_recv(rqstp, 60*60*HZ)) == -EAGAIN)
>
> so I assume the comment and this "err = 0" are artifacts of a previous
> implementation.
>
That's correct. An earlier patch has the main loop as
"while(!kthread_should_stop())". Since we're using signals to take the
thread down though, there's not much point in checking
kthread_should_stop(). I can remove the "err = 0;" there.
> >
> > /* Clear signals before calling svc_exit_thread() */
> > flush_signals(current);
> >
> > mutex_lock(&nfsd_mutex);
> > -
> > nfsdstats.th_cnt --;
> >
> > out:
> > /* Release the thread */
> > svc_exit_thread(rqstp);
> > -
> > - /* Release module */
> > mutex_unlock(&nfsd_mutex);
> > - module_put_and_exit(0);
>
> How does the module refcounting work after this patch?
>
> --b.
>
I think I've goofed this part, actually. I was thinking that we didn't
need to bump the refcount here, and that the kernel would realize that
nfsd() hadn't returned and would prevent unloading until it had. This
doesn't seem to be the case. I'll need to go back and add refcounting
back in.
--
Jeff Layton <jlayton@redhat.com>
^ permalink raw reply
* Re: [PATCH 4/5] knfsd: convert knfsd to kthread API
From: J. Bruce Fields @ 2008-06-06 17:24 UTC (permalink / raw)
To: Jeff Layton; +Cc: linux-nfs, nfsv4, gnb
In-Reply-To: <1212765171-26042-5-git-send-email-jlayton@redhat.com>
On Fri, Jun 06, 2008 at 11:12:50AM -0400, Jeff Layton wrote:
> This patch is rather large, but I couldn't figure out a way to break it
> up that would remain bisectable. It does several things:
>
> - change svc_thread_fn typedef to better match what kthread_create expects
> - change svc_pool_map_set_cpumask to be more kthread friendly. Make it
> take a task arg and and get rid of the "oldmask"
> - have svc_set_num_threads call kthread_create directly
> - eliminate __svc_create_thread
>
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
> fs/nfsd/nfssvc.c | 52 ++++++++++++++--------
> include/linux/sunrpc/svc.h | 2 +-
> net/sunrpc/svc.c | 102 +++++++++++++++-----------------------------
> 3 files changed, 68 insertions(+), 88 deletions(-)
>
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index 6339cb7..fe62ec8 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -21,6 +21,7 @@
> #include <linux/smp_lock.h>
> #include <linux/freezer.h>
> #include <linux/fs_struct.h>
> +#include <linux/kthread.h>
>
> #include <linux/sunrpc/types.h>
> #include <linux/sunrpc/stats.h>
> @@ -46,7 +47,7 @@
> #define SHUTDOWN_SIGS (sigmask(SIGKILL) | sigmask(SIGHUP) | sigmask(SIGINT) | sigmask(SIGQUIT))
>
> extern struct svc_program nfsd_program;
> -static void nfsd(struct svc_rqst *rqstp);
> +static int nfsd(void *vrqstp);
> struct timeval nfssvc_boot;
> static atomic_t nfsd_busy;
> static unsigned long nfsd_last_call;
> @@ -407,18 +408,19 @@ update_thread_usage(int busy_threads)
> /*
> * This is the NFS server kernel thread
> */
> -static void
> -nfsd(struct svc_rqst *rqstp)
> +static int
> +nfsd(void *vrqstp)
> {
> + struct svc_rqst *rqstp = (struct svc_rqst *) vrqstp;
> struct fs_struct *fsp;
> - int err;
> sigset_t shutdown_mask, allowed_mask;
> + int err, preverr = 0;
> + unsigned int signo;
>
> /* Lock module and set up kernel thread */
> mutex_lock(&nfsd_mutex);
> - daemonize("nfsd");
>
> - /* After daemonize() this kernel thread shares current->fs
> + /* At this point, the thread shares current->fs
> * with the init process. We need to create files with a
> * umask of 0 instead of init's umask. */
> fsp = copy_fs_struct(current->fs);
> @@ -433,14 +435,18 @@ nfsd(struct svc_rqst *rqstp)
> siginitsetinv(&shutdown_mask, SHUTDOWN_SIGS);
> siginitsetinv(&allowed_mask, ALLOWED_SIGS);
>
> + /*
> + * thread is spawned with all signals set to SIG_IGN, re-enable
> + * the ones that matter
> + */
> + for (signo = 1; signo <= _NSIG; signo++) {
> + if (!sigismember(&shutdown_mask, signo))
> + allow_signal(signo);
> + }
>
> nfsdstats.th_cnt++;
> -
> - rqstp->rq_task = current;
> -
> mutex_unlock(&nfsd_mutex);
>
> -
> /*
> * We want less throttling in balance_dirty_pages() so that nfs to
> * localhost doesn't cause nfsd to lock up due to all the client's
> @@ -462,15 +468,25 @@ nfsd(struct svc_rqst *rqstp)
> */
> while ((err = svc_recv(rqstp, 60*60*HZ)) == -EAGAIN)
> ;
> - if (err < 0)
> + if (err == -EINTR)
> break;
> + else if (err < 0) {
> + if (err != preverr) {
> + printk(KERN_WARNING "%s: unexpected error "
> + "from svc_recv (%d)\n", __func__, -err);
> + preverr = err;
> + }
> + schedule_timeout_uninterruptible(HZ);
> + continue;
> + }
> +
> update_thread_usage(atomic_read(&nfsd_busy));
> atomic_inc(&nfsd_busy);
>
> /* Lock the export hash tables for reading. */
> exp_readlock();
>
> - /* Process request with signals blocked. */
> + /* Process request with signals blocked. */
> sigprocmask(SIG_SETMASK, &allowed_mask, NULL);
>
> svc_process(rqstp);
> @@ -479,25 +495,23 @@ nfsd(struct svc_rqst *rqstp)
> exp_readunlock();
> update_thread_usage(atomic_read(&nfsd_busy));
> atomic_dec(&nfsd_busy);
> - }
>
> - if (err != -EINTR)
> - printk(KERN_WARNING "nfsd: terminating on error %d\n", -err);
> + /* reset err in case kthread_stop is called */
> + err = 0;
> + }
There's no use of err here until it's next set in
while ((err = svc_recv(rqstp, 60*60*HZ)) == -EAGAIN)
so I assume the comment and this "err = 0" are artifacts of a previous
implementation.
>
> /* Clear signals before calling svc_exit_thread() */
> flush_signals(current);
>
> mutex_lock(&nfsd_mutex);
> -
> nfsdstats.th_cnt --;
>
> out:
> /* Release the thread */
> svc_exit_thread(rqstp);
> -
> - /* Release module */
> mutex_unlock(&nfsd_mutex);
> - module_put_and_exit(0);
How does the module refcounting work after this patch?
--b.
> +
> + return 0;
> }
>
> static __be32 map_new_errors(u32 vers, __be32 nfserr)
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 4b54c5f..011d6d8 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -22,7 +22,7 @@
> /*
> * This is the RPC server thread function prototype
> */
> -typedef void (*svc_thread_fn)(struct svc_rqst *);
> +typedef int (*svc_thread_fn)(void *);
>
> /*
> *
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 7bffaff..f461da2 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -18,6 +18,7 @@
> #include <linux/mm.h>
> #include <linux/interrupt.h>
> #include <linux/module.h>
> +#include <linux/kthread.h>
>
> #include <linux/sunrpc/types.h>
> #include <linux/sunrpc/xdr.h>
> @@ -291,15 +292,14 @@ svc_pool_map_put(void)
>
>
> /*
> - * Set the current thread's cpus_allowed mask so that it
> + * Set the given thread's cpus_allowed mask so that it
> * will only run on cpus in the given pool.
> - *
> - * Returns 1 and fills in oldmask iff a cpumask was applied.
> */
> -static inline int
> -svc_pool_map_set_cpumask(unsigned int pidx, cpumask_t *oldmask)
> +static inline void
> +svc_pool_map_set_cpumask(struct task_struct *task, unsigned int pidx)
> {
> struct svc_pool_map *m = &svc_pool_map;
> + unsigned int node = m->pool_to[pidx];
>
> /*
> * The caller checks for sv_nrpools > 1, which
> @@ -307,26 +307,17 @@ svc_pool_map_set_cpumask(unsigned int pidx, cpumask_t *oldmask)
> */
> BUG_ON(m->count == 0);
>
> - switch (m->mode)
> - {
> - default:
> - return 0;
> + switch (m->mode) {
> case SVC_POOL_PERCPU:
> {
> - unsigned int cpu = m->pool_to[pidx];
> -
> - *oldmask = current->cpus_allowed;
> - set_cpus_allowed_ptr(current, &cpumask_of_cpu(cpu));
> - return 1;
> + set_cpus_allowed_ptr(task, &cpumask_of_cpu(node));
> + break;
> }
> case SVC_POOL_PERNODE:
> {
> - unsigned int node = m->pool_to[pidx];
> node_to_cpumask_ptr(nodecpumask, node);
> -
> - *oldmask = current->cpus_allowed;
> - set_cpus_allowed_ptr(current, nodecpumask);
> - return 1;
> + set_cpus_allowed_ptr(task, nodecpumask);
> + break;
> }
> }
> }
> @@ -579,47 +570,6 @@ out_enomem:
> EXPORT_SYMBOL(svc_prepare_thread);
>
> /*
> - * Create a thread in the given pool. Caller must hold BKL or another lock to
> - * serialize access to the svc_serv struct. On a NUMA or SMP machine, with a
> - * multi-pool serv, the thread will be restricted to run on the cpus belonging
> - * to the pool.
> - */
> -static int
> -__svc_create_thread(svc_thread_fn func, struct svc_serv *serv,
> - struct svc_pool *pool)
> -{
> - struct svc_rqst *rqstp;
> - int error = -ENOMEM;
> - int have_oldmask = 0;
> - cpumask_t uninitialized_var(oldmask);
> -
> - rqstp = svc_prepare_thread(serv, pool);
> - if (IS_ERR(rqstp)) {
> - error = PTR_ERR(rqstp);
> - goto out;
> - }
> -
> - if (serv->sv_nrpools > 1)
> - have_oldmask = svc_pool_map_set_cpumask(pool->sp_id, &oldmask);
> -
> - error = kernel_thread((int (*)(void *)) func, rqstp, 0);
> -
> - if (have_oldmask)
> - set_cpus_allowed(current, oldmask);
> -
> - if (error < 0)
> - goto out_thread;
> - svc_sock_update_bufs(serv);
> - error = 0;
> -out:
> - return error;
> -
> -out_thread:
> - svc_exit_thread(rqstp);
> - goto out;
> -}
> -
> -/*
> * Choose a pool in which to create a new thread, for svc_set_num_threads
> */
> static inline struct svc_pool *
> @@ -688,7 +638,9 @@ found_pool:
> int
> svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
> {
> - struct task_struct *victim;
> + struct svc_rqst *rqstp;
> + struct task_struct *task;
> + struct svc_pool *chosen_pool;
> int error = 0;
> unsigned int state = serv->sv_nrthreads-1;
>
> @@ -704,18 +656,32 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
> /* create new threads */
> while (nrservs > 0) {
> nrservs--;
> - __module_get(serv->sv_module);
> - error = __svc_create_thread(serv->sv_function, serv,
> - choose_pool(serv, pool, &state));
> - if (error < 0) {
> - module_put(serv->sv_module);
> + chosen_pool = choose_pool(serv, pool, &state);
> +
> + rqstp = svc_prepare_thread(serv, chosen_pool);
> + if (IS_ERR(rqstp)) {
> + error = PTR_ERR(rqstp);
> break;
> }
> +
> + task = kthread_create(serv->sv_function, rqstp, serv->sv_name);
> + if (IS_ERR(task)) {
> + error = PTR_ERR(task);
> + svc_exit_thread(rqstp);
> + break;
> + }
> +
> + rqstp->rq_task = task;
> + if (serv->sv_nrpools > 1)
> + svc_pool_map_set_cpumask(task, chosen_pool->sp_id);
> +
> + svc_sock_update_bufs(serv);
> + wake_up_process(task);
> }
> /* destroy old threads */
> while (nrservs < 0 &&
> - (victim = choose_victim(serv, pool, &state)) != NULL) {
> - send_sig(serv->sv_kill_signal, victim, 1);
> + (task = choose_victim(serv, pool, &state)) != NULL) {
> + send_sig(serv->sv_kill_signal, task, 1);
> nrservs++;
> }
>
> --
> 1.5.3.6
>
^ permalink raw reply
* [PATCH 2/2] SUNRPC: Ensure all transports set rq_xtime consistently
From: Chuck Lever @ 2008-06-06 17:22 UTC (permalink / raw)
To: trond.myklebust; +Cc: linux-nfs, chuck.lever
In-Reply-To: <20080606172037.14757.64893.stgit-meopP2rzCrTwdl/1UfZZQIVfYA8g3rJ/@public.gmane.org>
The RPC client uses the rq_xtime field in each RPC request to determine the
round-trip time of the request. Currently, the rq_xtime field is
initialized by each transport just before it starts enqueing a request to
be sent. However, transports do not handle initializing this value
consistently; sometimes they don't initialize it at all.
To make the measurement of request round-trip time consistent for all
RPC client transport capabilities, pull rq_xtime initialization into the
RPC client's generic transport logic. Now all transports will get a
standardized RTT measure automatically, from:
xprt_transmit()
to
xprt_complete_rqst()
This makes round-trip time calculation more accurate for the TCP transport.
The socket ->sendmsg() method can return "-EAGAIN" if the socket's output
buffer is full, so the TCP transport's ->send_request() method may call
the ->sendmsg() method repeatedly until it gets all of the request's bytes
queued in the socket's buffer.
Currently, the TCP transport sets the rq_xtime field every time through
that loop so the final value is the timestamp just before the *last* call
to the underlying socket's ->sendmsg() method. After this patch, the
rq_xtime field contains a timestamp that reflects the time just before the
*first* call to ->sendmsg().
This is consequential under heavy workloads because large requests often
take multiple ->sendmsg() calls to get all the bytes of a request queued.
The TCP transport causes the request to sleep until the remote end of the
socket has received enough bytes to clear space in the socket's local
output buffer. This delay can be quite significant.
The method introduced by this patch is a more accurate measure of RTT
for stream transports, since the server can cause enough back pressure
to delay (ie increase the latency of) requests from the client.
Additionally, this patch corrects the behavior of the RDMA transport, which
entirely neglected to initialize the rq_xtime field. RPC performance
metrics for RDMA transports now display correct RPC request round trip
times.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Tom Talpey <thomas.talpey@netapp.com>
---
net/sunrpc/xprt.c | 1 +
net/sunrpc/xprtsock.c | 2 --
2 files changed, 1 insertions(+), 2 deletions(-)
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 67996bd..99a52aa 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -872,6 +872,7 @@ void xprt_transmit(struct rpc_task *task)
return;
req->rq_connect_cookie = xprt->connect_cookie;
+ req->rq_xtime = jiffies;
status = xprt->ops->send_request(task);
if (status == 0) {
dprintk("RPC: %5u xmit complete\n", task->tk_pid);
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index ddbe981..4486c59 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -579,7 +579,6 @@ static int xs_udp_send_request(struct rpc_task *task)
req->rq_svec->iov_base,
req->rq_svec->iov_len);
- req->rq_xtime = jiffies;
status = xs_sendpages(transport->sock,
xs_addr(xprt),
xprt->addrlen, xdr,
@@ -671,7 +670,6 @@ static int xs_tcp_send_request(struct rpc_task *task)
* to cope with writespace callbacks arriving _after_ we have
* called sendmsg(). */
while (1) {
- req->rq_xtime = jiffies;
status = xs_sendpages(transport->sock,
NULL, 0, xdr, req->rq_bytes_sent);
^ permalink raw reply related
* [PATCH 1/2] NFS: Move fs/nfs/iostat.h to include/linux
From: Chuck Lever @ 2008-06-06 17:22 UTC (permalink / raw)
To: trond.myklebust; +Cc: linux-nfs, chuck.lever
In-Reply-To: <20080606172037.14757.64893.stgit-meopP2rzCrTwdl/1UfZZQIVfYA8g3rJ/@public.gmane.org>
The fs/nfs/iostat.h header has definitions that were designed to be exposed
to user space. Move the header under include/linux so user space can use
the header in applications that read /proc/self/mountstats.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
fs/nfs/client.c | 2
fs/nfs/dir.c | 2
fs/nfs/direct.c | 2
fs/nfs/file.c | 2
fs/nfs/inode.c | 2
fs/nfs/iostat.h | 164 -----------------------------------------
fs/nfs/nfs3proc.c | 2
fs/nfs/nfs4proc.c | 2
fs/nfs/read.c | 2
fs/nfs/super.c | 2
fs/nfs/write.c | 2
include/linux/nfs_iostat.h | 178 ++++++++++++++++++++++++++++++++++++++++++++
12 files changed, 188 insertions(+), 174 deletions(-)
delete mode 100644 fs/nfs/iostat.h
create mode 100644 include/linux/nfs_iostat.h
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index f2a092c..618faa3 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -32,6 +32,7 @@
#include <linux/seq_file.h>
#include <linux/mount.h>
#include <linux/nfs_idmap.h>
+#include <linux/nfs_iostat.h>
#include <linux/vfs.h>
#include <linux/inet.h>
#include <linux/in6.h>
@@ -43,7 +44,6 @@
#include "nfs4_fs.h"
#include "callback.h"
#include "delegation.h"
-#include "iostat.h"
#include "internal.h"
#define NFSDBG_FACILITY NFSDBG_CLIENT
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 425b984..a5cd722 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -28,6 +28,7 @@
#include <linux/sunrpc/clnt.h>
#include <linux/nfs_fs.h>
#include <linux/nfs_mount.h>
+#include <linux/nfs_iostat.h>
#include <linux/pagemap.h>
#include <linux/smp_lock.h>
#include <linux/pagevec.h>
@@ -37,7 +38,6 @@
#include "nfs4_fs.h"
#include "delegation.h"
-#include "iostat.h"
#include "internal.h"
/* #define NFS_DEBUG_VERBOSE 1 */
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 08f6b04..581af62 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -47,6 +47,7 @@
#include <linux/nfs_fs.h>
#include <linux/nfs_page.h>
+#include <linux/nfs_iostat.h>
#include <linux/sunrpc/clnt.h>
#include <asm/system.h>
@@ -54,7 +55,6 @@
#include <asm/atomic.h>
#include "internal.h"
-#include "iostat.h"
#define NFSDBG_FACILITY NFSDBG_VFS
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index bba6181..b47c1f2 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -23,6 +23,7 @@
#include <linux/stat.h>
#include <linux/nfs_fs.h>
#include <linux/nfs_mount.h>
+#include <linux/nfs_iostat.h>
#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/pagemap.h>
@@ -34,7 +35,6 @@
#include "delegation.h"
#include "internal.h"
-#include "iostat.h"
#define NFSDBG_FACILITY NFSDBG_FILE
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 596c5d8..5287f37 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -34,6 +34,7 @@
#include <linux/seq_file.h>
#include <linux/mount.h>
#include <linux/nfs_idmap.h>
+#include <linux/nfs_iostat.h>
#include <linux/vfs.h>
#include <linux/inet.h>
#include <linux/nfs_xdr.h>
@@ -44,7 +45,6 @@
#include "nfs4_fs.h"
#include "callback.h"
#include "delegation.h"
-#include "iostat.h"
#include "internal.h"
#define NFSDBG_FACILITY NFSDBG_VFS
diff --git a/fs/nfs/iostat.h b/fs/nfs/iostat.h
deleted file mode 100644
index 6350ecb..0000000
--- a/fs/nfs/iostat.h
+++ /dev/null
@@ -1,164 +0,0 @@
-/*
- * linux/fs/nfs/iostat.h
- *
- * Declarations for NFS client per-mount statistics
- *
- * Copyright (C) 2005, 2006 Chuck Lever <cel@netapp.com>
- *
- * NFS client per-mount statistics provide information about the health of
- * the NFS client and the health of each NFS mount point. Generally these
- * are not for detailed problem diagnosis, but simply to indicate that there
- * is a problem.
- *
- * These counters are not meant to be human-readable, but are meant to be
- * integrated into system monitoring tools such as "sar" and "iostat". As
- * such, the counters are sampled by the tools over time, and are never
- * zeroed after a file system is mounted. Moving averages can be computed
- * by the tools by taking the difference between two instantaneous samples
- * and dividing that by the time between the samples.
- */
-
-#ifndef _NFS_IOSTAT
-#define _NFS_IOSTAT
-
-#define NFS_IOSTAT_VERS "1.0"
-
-/*
- * NFS byte counters
- *
- * 1. SERVER - the number of payload bytes read from or written to the
- * server by the NFS client via an NFS READ or WRITE request.
- *
- * 2. NORMAL - the number of bytes read or written by applications via
- * the read(2) and write(2) system call interfaces.
- *
- * 3. DIRECT - the number of bytes read or written from files opened
- * with the O_DIRECT flag.
- *
- * These counters give a view of the data throughput into and out of the NFS
- * client. Comparing the number of bytes requested by an application with the
- * number of bytes the client requests from the server can provide an
- * indication of client efficiency (per-op, cache hits, etc).
- *
- * These counters can also help characterize which access methods are in
- * use. DIRECT by itself shows whether there is any O_DIRECT traffic.
- * NORMAL + DIRECT shows how much data is going through the system call
- * interface. A large amount of SERVER traffic without much NORMAL or
- * DIRECT traffic shows that applications are using mapped files.
- *
- * NFS page counters
- *
- * These count the number of pages read or written via nfs_readpage(),
- * nfs_readpages(), or their write equivalents.
- */
-enum nfs_stat_bytecounters {
- NFSIOS_NORMALREADBYTES = 0,
- NFSIOS_NORMALWRITTENBYTES,
- NFSIOS_DIRECTREADBYTES,
- NFSIOS_DIRECTWRITTENBYTES,
- NFSIOS_SERVERREADBYTES,
- NFSIOS_SERVERWRITTENBYTES,
- NFSIOS_READPAGES,
- NFSIOS_WRITEPAGES,
- __NFSIOS_BYTESMAX,
-};
-
-/*
- * NFS event counters
- *
- * These counters provide a low-overhead way of monitoring client activity
- * without enabling NFS trace debugging. The counters show the rate at
- * which VFS requests are made, and how often the client invalidates its
- * data and attribute caches. This allows system administrators to monitor
- * such things as how close-to-open is working, and answer questions such
- * as "why are there so many GETATTR requests on the wire?"
- *
- * They also count anamolous events such as short reads and writes, silly
- * renames due to close-after-delete, and operations that change the size
- * of a file (such operations can often be the source of data corruption
- * if applications aren't using file locking properly).
- */
-enum nfs_stat_eventcounters {
- NFSIOS_INODEREVALIDATE = 0,
- NFSIOS_DENTRYREVALIDATE,
- NFSIOS_DATAINVALIDATE,
- NFSIOS_ATTRINVALIDATE,
- NFSIOS_VFSOPEN,
- NFSIOS_VFSLOOKUP,
- NFSIOS_VFSACCESS,
- NFSIOS_VFSUPDATEPAGE,
- NFSIOS_VFSREADPAGE,
- NFSIOS_VFSREADPAGES,
- NFSIOS_VFSWRITEPAGE,
- NFSIOS_VFSWRITEPAGES,
- NFSIOS_VFSGETDENTS,
- NFSIOS_VFSSETATTR,
- NFSIOS_VFSFLUSH,
- NFSIOS_VFSFSYNC,
- NFSIOS_VFSLOCK,
- NFSIOS_VFSRELEASE,
- NFSIOS_CONGESTIONWAIT,
- NFSIOS_SETATTRTRUNC,
- NFSIOS_EXTENDWRITE,
- NFSIOS_SILLYRENAME,
- NFSIOS_SHORTREAD,
- NFSIOS_SHORTWRITE,
- NFSIOS_DELAY,
- __NFSIOS_COUNTSMAX,
-};
-
-#ifdef __KERNEL__
-
-#include <linux/percpu.h>
-#include <linux/cache.h>
-
-struct nfs_iostats {
- unsigned long long bytes[__NFSIOS_BYTESMAX];
- unsigned long events[__NFSIOS_COUNTSMAX];
-} ____cacheline_aligned;
-
-static inline void nfs_inc_server_stats(struct nfs_server *server, enum nfs_stat_eventcounters stat)
-{
- struct nfs_iostats *iostats;
- int cpu;
-
- cpu = get_cpu();
- iostats = per_cpu_ptr(server->io_stats, cpu);
- iostats->events[stat] ++;
- put_cpu_no_resched();
-}
-
-static inline void nfs_inc_stats(struct inode *inode, enum nfs_stat_eventcounters stat)
-{
- nfs_inc_server_stats(NFS_SERVER(inode), stat);
-}
-
-static inline void nfs_add_server_stats(struct nfs_server *server, enum nfs_stat_bytecounters stat, unsigned long addend)
-{
- struct nfs_iostats *iostats;
- int cpu;
-
- cpu = get_cpu();
- iostats = per_cpu_ptr(server->io_stats, cpu);
- iostats->bytes[stat] += addend;
- put_cpu_no_resched();
-}
-
-static inline void nfs_add_stats(struct inode *inode, enum nfs_stat_bytecounters stat, unsigned long addend)
-{
- nfs_add_server_stats(NFS_SERVER(inode), stat, addend);
-}
-
-static inline struct nfs_iostats *nfs_alloc_iostats(void)
-{
- return alloc_percpu(struct nfs_iostats);
-}
-
-static inline void nfs_free_iostats(struct nfs_iostats *stats)
-{
- if (stats != NULL)
- free_percpu(stats);
-}
-
-#endif
-#endif
diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c
index c3523ad..d461eb3 100644
--- a/fs/nfs/nfs3proc.c
+++ b/fs/nfs/nfs3proc.c
@@ -17,8 +17,8 @@
#include <linux/nfs_page.h>
#include <linux/lockd/bind.h>
#include <linux/nfs_mount.h>
+#include <linux/nfs_iostat.h>
-#include "iostat.h"
#include "internal.h"
#define NFSDBG_FACILITY NFSDBG_PROC
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 1293e0a..df93fa8 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -45,6 +45,7 @@
#include <linux/nfs4.h>
#include <linux/nfs_fs.h>
#include <linux/nfs_page.h>
+#include <linux/nfs_iostat.h>
#include <linux/smp_lock.h>
#include <linux/namei.h>
#include <linux/mount.h>
@@ -52,7 +53,6 @@
#include "nfs4_fs.h"
#include "delegation.h"
#include "internal.h"
-#include "iostat.h"
#define NFSDBG_FACILITY NFSDBG_PROC
diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index 40d1798..306363f 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -18,12 +18,12 @@
#include <linux/sunrpc/clnt.h>
#include <linux/nfs_fs.h>
#include <linux/nfs_page.h>
+#include <linux/nfs_iostat.h>
#include <linux/smp_lock.h>
#include <asm/system.h>
#include "internal.h"
-#include "iostat.h"
#define NFSDBG_FACILITY NFSDBG_PAGECACHE
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 2a4a024..ef27035 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -43,6 +43,7 @@
#include <linux/seq_file.h>
#include <linux/mount.h>
#include <linux/nfs_idmap.h>
+#include <linux/nfs_iostat.h>
#include <linux/vfs.h>
#include <linux/inet.h>
#include <linux/in6.h>
@@ -57,7 +58,6 @@
#include "nfs4_fs.h"
#include "callback.h"
#include "delegation.h"
-#include "iostat.h"
#include "internal.h"
#define NFSDBG_FACILITY NFSDBG_VFS
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index c03a039..1bb0593 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -18,13 +18,13 @@
#include <linux/nfs_fs.h>
#include <linux/nfs_mount.h>
#include <linux/nfs_page.h>
+#include <linux/nfs_iostat.h>
#include <linux/backing-dev.h>
#include <asm/uaccess.h>
#include "delegation.h"
#include "internal.h"
-#include "iostat.h"
#define NFSDBG_FACILITY NFSDBG_PAGECACHE
diff --git a/include/linux/nfs_iostat.h b/include/linux/nfs_iostat.h
new file mode 100644
index 0000000..cdfa9de
--- /dev/null
+++ b/include/linux/nfs_iostat.h
@@ -0,0 +1,178 @@
+/*
+ * Declarations for NFS client per-mount statistics
+ *
+ * Copyright (C) 2005, 2006 Chuck Lever <cel@netapp.com>
+ *
+ * NFS client per-mount statistics provide information about the
+ * health of the NFS client and the health of each NFS mount point.
+ * Generally these are not for detailed problem diagnosis, but
+ * simply to indicate that there is a problem.
+ *
+ * These counters are not meant to be human-readable, but are meant
+ * to be integrated into system monitoring tools such as "sar" and
+ * "iostat". As such, the counters are sampled by the tools over
+ * time, and are never zeroed after a file system is mounted.
+ * Moving averages can be computed by the tools by taking the
+ * difference between two instantaneous samples and dividing that
+ * by the time between the samples.
+ */
+
+#ifndef _NFS_IOSTAT
+#define _NFS_IOSTAT
+
+#define NFS_IOSTAT_VERS "1.0"
+
+/*
+ * NFS byte counters
+ *
+ * 1. SERVER - the number of payload bytes read from or written
+ * to the server by the NFS client via an NFS READ or WRITE
+ * request.
+ *
+ * 2. NORMAL - the number of bytes read or written by applications
+ * via the read(2) and write(2) system call interfaces.
+ *
+ * 3. DIRECT - the number of bytes read or written from files
+ * opened with the O_DIRECT flag.
+ *
+ * These counters give a view of the data throughput into and out
+ * of the NFS client. Comparing the number of bytes requested by
+ * an application with the number of bytes the client requests from
+ * the server can provide an indication of client efficiency
+ * (per-op, cache hits, etc).
+ *
+ * These counters can also help characterize which access methods
+ * are in use. DIRECT by itself shows whether there is any O_DIRECT
+ * traffic. NORMAL + DIRECT shows how much data is going through
+ * the system call interface. A large amount of SERVER traffic
+ * without much NORMAL or DIRECT traffic shows that applications
+ * are using mapped files.
+ *
+ * NFS page counters
+ *
+ * These count the number of pages read or written via nfs_readpage(),
+ * nfs_readpages(), or their write equivalents.
+ *
+ * NB: When adding new byte counters, please include the measured
+ * units in the name of each byte counter to help users of this
+ * interface determine what exactly is being counted.
+ */
+enum nfs_stat_bytecounters {
+ NFSIOS_NORMALREADBYTES = 0,
+ NFSIOS_NORMALWRITTENBYTES,
+ NFSIOS_DIRECTREADBYTES,
+ NFSIOS_DIRECTWRITTENBYTES,
+ NFSIOS_SERVERREADBYTES,
+ NFSIOS_SERVERWRITTENBYTES,
+ NFSIOS_READPAGES,
+ NFSIOS_WRITEPAGES,
+ __NFSIOS_BYTESMAX,
+};
+
+/*
+ * NFS event counters
+ *
+ * These counters provide a low-overhead way of monitoring client
+ * activity without enabling NFS trace debugging. The counters
+ * show the rate at which VFS requests are made, and how often the
+ * client invalidates its data and attribute caches. This allows
+ * system administrators to monitor such things as how close-to-open
+ * is working, and answer questions such as "why are there so many
+ * GETATTR requests on the wire?"
+ *
+ * They also count anamolous events such as short reads and writes,
+ * silly renames due to close-after-delete, and operations that
+ * change the size of a file (such operations can often be the
+ * source of data corruption if applications aren't using file
+ * locking properly).
+ */
+enum nfs_stat_eventcounters {
+ NFSIOS_INODEREVALIDATE = 0,
+ NFSIOS_DENTRYREVALIDATE,
+ NFSIOS_DATAINVALIDATE,
+ NFSIOS_ATTRINVALIDATE,
+ NFSIOS_VFSOPEN,
+ NFSIOS_VFSLOOKUP,
+ NFSIOS_VFSACCESS,
+ NFSIOS_VFSUPDATEPAGE,
+ NFSIOS_VFSREADPAGE,
+ NFSIOS_VFSREADPAGES,
+ NFSIOS_VFSWRITEPAGE,
+ NFSIOS_VFSWRITEPAGES,
+ NFSIOS_VFSGETDENTS,
+ NFSIOS_VFSSETATTR,
+ NFSIOS_VFSFLUSH,
+ NFSIOS_VFSFSYNC,
+ NFSIOS_VFSLOCK,
+ NFSIOS_VFSRELEASE,
+ NFSIOS_CONGESTIONWAIT,
+ NFSIOS_SETATTRTRUNC,
+ NFSIOS_EXTENDWRITE,
+ NFSIOS_SILLYRENAME,
+ NFSIOS_SHORTREAD,
+ NFSIOS_SHORTWRITE,
+ NFSIOS_DELAY,
+ __NFSIOS_COUNTSMAX,
+};
+
+#ifdef __KERNEL__
+
+#include <linux/percpu.h>
+#include <linux/cache.h>
+
+struct nfs_iostats {
+ unsigned long long bytes[__NFSIOS_BYTESMAX];
+ unsigned long events[__NFSIOS_COUNTSMAX];
+} ____cacheline_aligned;
+
+static inline void nfs_inc_server_stats(struct nfs_server *server,
+ enum nfs_stat_eventcounters stat)
+{
+ struct nfs_iostats *iostats;
+ int cpu;
+
+ cpu = get_cpu();
+ iostats = per_cpu_ptr(server->io_stats, cpu);
+ iostats->events[stat]++;
+ put_cpu_no_resched();
+}
+
+static inline void nfs_inc_stats(struct inode *inode,
+ enum nfs_stat_eventcounters stat)
+{
+ nfs_inc_server_stats(NFS_SERVER(inode), stat);
+}
+
+static inline void nfs_add_server_stats(struct nfs_server *server,
+ enum nfs_stat_bytecounters stat,
+ unsigned long addend)
+{
+ struct nfs_iostats *iostats;
+ int cpu;
+
+ cpu = get_cpu();
+ iostats = per_cpu_ptr(server->io_stats, cpu);
+ iostats->bytes[stat] += addend;
+ put_cpu_no_resched();
+}
+
+static inline void nfs_add_stats(struct inode *inode,
+ enum nfs_stat_bytecounters stat,
+ unsigned long addend)
+{
+ nfs_add_server_stats(NFS_SERVER(inode), stat, addend);
+}
+
+static inline struct nfs_iostats *nfs_alloc_iostats(void)
+{
+ return alloc_percpu(struct nfs_iostats);
+}
+
+static inline void nfs_free_iostats(struct nfs_iostats *stats)
+{
+ if (stats != NULL)
+ free_percpu(stats);
+}
+
+#endif
+#endif
^ permalink raw reply related
* [PATCH 0/2] NFS/RPC metrics fix-ups for 2.6.27
From: Chuck Lever @ 2008-06-06 17:22 UTC (permalink / raw)
To: trond.myklebust; +Cc: linux-nfs, chuck.lever
Hi Trond-
Here are two patches that address minor issues with NFS and RPC performance
metrics. See the patch descriptions for details. Please consider these
for 2.6.27.
--
corporate: <chuck dot lever at oracle dot com>
^ permalink raw reply
* Re: Problems with large number of clients and reads
From: J. Bruce Fields @ 2008-06-06 16:09 UTC (permalink / raw)
To: Norman Weathers; +Cc: linux-nfs
In-Reply-To: <1212519001.24900.14.camel@hololw58>
On Tue, Jun 03, 2008 at 01:50:01PM -0500, Norman Weathers wrote:
> Hello all,
>
> We are having some issues with some high throughput servers of ours.
>
> Here is the issue, we are using a vanilla 2.6.22.14 kernel on a node
> with 2 Dual Core Intels (3 GHz) and 16 GB of ram. The files that are
> being served are around 2 GB each, and there are usually 3 to 5 of them
> being read, so once read they fit into memory nicely, and when all is
> working correctly, we have a perfectly filled cache, with almost no disk
> activity.
>
> When we have large NFS activity (say, 600 to 1200 clients) connecting to
> the server(s), they can get into a state where they are using up all of
> memory, but they are dropping cache. slabtop is showing 13 GB of memory
> being used by the size-4096 slab object. We have two ethernet channels
> bonded, so we see in excess of 240 MB/s of data flowing out of the box,
> and all of the sudden, disk activity has risen to 185 MB/s. This
> happens if we are using 8 or more nfs threads. If we limit the threads
> to 6 or less, this doesn't happen. Of course, we are starving clients,
> but at least the jobs that my customers are throwing out there are
> progressing. The question becomes, what is causing the memory to be
> used up by the slab size-4096 object? Why when all of the sudden a
> bunch of clients ask for data does this object grow from 100 MB to 13
> GB? I have set the memory settings to something that I thought was
> reasonable.
>
> Here is some more of the particulars:
>
> sysctl.conf tcp memory settings:
>
> # NFS Tuning Parameters
> sunrpc.udp_slot_table_entries = 128
> sunrpc.tcp_slot_table_entries = 128
> vm.overcommit_ratio = 80
>
> net.core.rmem_max=524288
> net.core.rmem_default=262144
> net.core.wmem_max=524288
> net.core.wmem_default=262144
> net.ipv4.tcp_rmem = 8192 262144 524288
> net.ipv4.tcp_wmem = 8192 262144 524288
> net.ipv4.tcp_sack=0
> net.ipv4.tcp_timestamps=0
> vm.min_free_kbytes=50000
> vm.overcommit_memory=1
> net.ipv4.tcp_reordering=127
>
> # Enable tcp_low_latency
> net.ipv4.tcp_low_latency=1
>
> Here is a current reading from a slabtop of a system where this error is
> happening:
>
> 3007154 3007154 100% 4.00K 3007154 1 12028616K size-4096
>
> Note the size of the object cache, usually it is 50 - 100 MB (I have
> another box with 32 threads and the same settings which is bouncing
> between 50 and 128 MB right now).
>
> I have a lot of client boxes that need access to these servers, and
> would really benefit from having more threads, but if I increase the
> number of threads, it pushes everything out of cache, forcing re-reads,
> and really slows down our jobs.
>
> Any thoughts on this?
I'd've thought that suggests a leak of memory allocated by kmalloc().
Does the size-4096 cache decrease eventually, or does it stay that large
until you reboot?
--b.
^ permalink raw reply
* Re: [PATCH 2/3] sunrpc: have pooled services make NUMA-friendly allocations
From: J. Bruce Fields @ 2008-06-06 15:41 UTC (permalink / raw)
To: Jeff Layton; +Cc: linux-kernel, linux-nfs
In-Reply-To: <20080606112145.6a23afca-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
On Fri, Jun 06, 2008 at 11:21:45AM -0400, Jeff Layton wrote:
> On Tue, 03 Jun 2008 07:18:02 -0400
> Jeff Layton <jlayton@redhat.com> wrote:
>
> > Currently, svc_prepare_thread allocates memory using plain kmalloc()
> > and alloc_page() calls, even for threads that are destined to run on
> > different CPUs or NUMA nodes than the current one. Add a function to
> > translate a poolid into a NUMA node, and have svc_prepare_thread and
> > svc_init_buffer allocate memory on those nodes instead.
> >
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
>
>
> --------[snip]--------
> >
> > - rqstp = kzalloc(sizeof(*rqstp), GFP_KERNEL);
> > + rqstp = kzalloc_node(sizeof(*rqstp), GFP_KERNEL, node);
>
> Bruce,
> It looks like AKPM has taken the kzalloc_node patch into -mm. I'd
> like to have you take this set into your tree at some point, but don't
> want you to have to carry that VM patch too. Would you be amenable to me
> changing the above to something like:
>
> /* FIXME: change to kzalloc_node when/if it makes it to mainline */
> rqstp = kmalloc_node(sizeof(*rqstp), GFP_KERNEL | __GFP_ZERO, node);
>
> ...and then we can make the FIXME change when mainline has the new inline?
I'd rather just take a copy of the patch. Perhaps you see a problem I
don't--but if it's really identical to the patch that'll go into
upstream, then they'll merge trivially and there shouldn't be a problem.
--b.
^ permalink raw reply
* Re: [PATCH 2/3] sunrpc: have pooled services make NUMA-friendly allocations
From: Jeff Layton @ 2008-06-06 15:21 UTC (permalink / raw)
To: bfields; +Cc: linux-kernel, linux-nfs
In-Reply-To: <20080603111802.8769.22921.stgit-LRazzElzaNIFmhoHi+V13ACJwEvxM/w9@public.gmane.org>
On Tue, 03 Jun 2008 07:18:02 -0400
Jeff Layton <jlayton@redhat.com> wrote:
> Currently, svc_prepare_thread allocates memory using plain kmalloc()
> and alloc_page() calls, even for threads that are destined to run on
> different CPUs or NUMA nodes than the current one. Add a function to
> translate a poolid into a NUMA node, and have svc_prepare_thread and
> svc_init_buffer allocate memory on those nodes instead.
>
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
--------[snip]--------
>
> - rqstp = kzalloc(sizeof(*rqstp), GFP_KERNEL);
> + rqstp = kzalloc_node(sizeof(*rqstp), GFP_KERNEL, node);
Bruce,
It looks like AKPM has taken the kzalloc_node patch into -mm. I'd
like to have you take this set into your tree at some point, but don't
want you to have to carry that VM patch too. Would you be amenable to me
changing the above to something like:
/* FIXME: change to kzalloc_node when/if it makes it to mainline */
rqstp = kmalloc_node(sizeof(*rqstp), GFP_KERNEL | __GFP_ZERO, node);
...and then we can make the FIXME change when mainline has the new inline?
Thanks,
--
Jeff Layton <jlayton@redhat.com>
^ permalink raw reply
* [PATCH 5/5] sunrpc: remove unneeded fields from svc_serv struct
From: Jeff Layton @ 2008-06-06 15:12 UTC (permalink / raw)
To: bfields; +Cc: linux-nfs, nfsv4, neilb, gnb-cP1dWloDopni96+mSzHFpQC/G2K4zDHf
In-Reply-To: <1212765171-26042-5-git-send-email-jlayton@redhat.com>
Remove the sv_module sv_kill_signal fields from the svc_serv struct.
svc_set_num_threads doesn't bother with module reference counts
anymore, and since we don't have a special shutdown signal anymore,
we might as well just standardize on one that's always used to take
down the threads by the kernel (SIGINT).
Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
fs/nfsd/nfssvc.c | 3 +--
include/linux/sunrpc/svc.h | 10 ++--------
net/sunrpc/svc.c | 10 +++-------
3 files changed, 6 insertions(+), 17 deletions(-)
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index fe62ec8..f4b8592 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -236,8 +236,7 @@ int nfsd_create_serv(void)
atomic_set(&nfsd_busy, 0);
nfsd_serv = svc_create_pooled(&nfsd_program, nfsd_max_blksize,
- nfsd_last_thread, nfsd, SIGINT,
- THIS_MODULE);
+ nfsd_last_thread, nfsd);
if (nfsd_serv == NULL)
err = -ENOMEM;
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 011d6d8..cd304c4 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -72,15 +72,10 @@ struct svc_serv {
unsigned int sv_nrpools; /* number of thread pools */
struct svc_pool * sv_pools; /* array of thread pools */
+ /* Callback to use when last thread exits */
void (*sv_shutdown)(struct svc_serv *serv);
- /* Callback to use when last thread
- * exits.
- */
- struct module * sv_module; /* optional module to count when
- * adding threads */
svc_thread_fn sv_function; /* main function for threads */
- int sv_kill_signal; /* signal to kill threads */
};
/*
@@ -388,8 +383,7 @@ struct svc_rqst *svc_prepare_thread(struct svc_serv *serv,
struct svc_pool *pool);
void svc_exit_thread(struct svc_rqst *);
struct svc_serv * svc_create_pooled(struct svc_program *, unsigned int,
- void (*shutdown)(struct svc_serv*),
- svc_thread_fn, int sig, struct module *);
+ void (*shutdown)(struct svc_serv*), svc_thread_fn);
int svc_set_num_threads(struct svc_serv *, struct svc_pool *, int);
void svc_destroy(struct svc_serv *);
int svc_process(struct svc_rqst *);
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index f461da2..a6794a8 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -433,19 +433,15 @@ EXPORT_SYMBOL(svc_create);
struct svc_serv *
svc_create_pooled(struct svc_program *prog, unsigned int bufsize,
- void (*shutdown)(struct svc_serv *serv),
- svc_thread_fn func, int sig, struct module *mod)
+ void (*shutdown)(struct svc_serv *serv), svc_thread_fn func)
{
struct svc_serv *serv;
unsigned int npools = svc_pool_map_get();
serv = __svc_create(prog, bufsize, npools, shutdown);
- if (serv != NULL) {
+ if (serv != NULL)
serv->sv_function = func;
- serv->sv_kill_signal = sig;
- serv->sv_module = mod;
- }
return serv;
}
@@ -681,7 +677,7 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
/* destroy old threads */
while (nrservs < 0 &&
(task = choose_victim(serv, pool, &state)) != NULL) {
- send_sig(serv->sv_kill_signal, task, 1);
+ send_sig(SIGINT, task, 1);
nrservs++;
}
--
1.5.3.6
^ permalink raw reply related
* [PATCH 4/5] knfsd: convert knfsd to kthread API
From: Jeff Layton @ 2008-06-06 15:12 UTC (permalink / raw)
To: bfields; +Cc: linux-nfs, nfsv4, neilb, gnb-cP1dWloDopni96+mSzHFpQC/G2K4zDHf
In-Reply-To: <1212765171-26042-4-git-send-email-jlayton@redhat.com>
This patch is rather large, but I couldn't figure out a way to break it
up that would remain bisectable. It does several things:
- change svc_thread_fn typedef to better match what kthread_create expects
- change svc_pool_map_set_cpumask to be more kthread friendly. Make it
take a task arg and and get rid of the "oldmask"
- have svc_set_num_threads call kthread_create directly
- eliminate __svc_create_thread
Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
fs/nfsd/nfssvc.c | 52 ++++++++++++++--------
include/linux/sunrpc/svc.h | 2 +-
net/sunrpc/svc.c | 102 +++++++++++++++-----------------------------
3 files changed, 68 insertions(+), 88 deletions(-)
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 6339cb7..fe62ec8 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -21,6 +21,7 @@
#include <linux/smp_lock.h>
#include <linux/freezer.h>
#include <linux/fs_struct.h>
+#include <linux/kthread.h>
#include <linux/sunrpc/types.h>
#include <linux/sunrpc/stats.h>
@@ -46,7 +47,7 @@
#define SHUTDOWN_SIGS (sigmask(SIGKILL) | sigmask(SIGHUP) | sigmask(SIGINT) | sigmask(SIGQUIT))
extern struct svc_program nfsd_program;
-static void nfsd(struct svc_rqst *rqstp);
+static int nfsd(void *vrqstp);
struct timeval nfssvc_boot;
static atomic_t nfsd_busy;
static unsigned long nfsd_last_call;
@@ -407,18 +408,19 @@ update_thread_usage(int busy_threads)
/*
* This is the NFS server kernel thread
*/
-static void
-nfsd(struct svc_rqst *rqstp)
+static int
+nfsd(void *vrqstp)
{
+ struct svc_rqst *rqstp = (struct svc_rqst *) vrqstp;
struct fs_struct *fsp;
- int err;
sigset_t shutdown_mask, allowed_mask;
+ int err, preverr = 0;
+ unsigned int signo;
/* Lock module and set up kernel thread */
mutex_lock(&nfsd_mutex);
- daemonize("nfsd");
- /* After daemonize() this kernel thread shares current->fs
+ /* At this point, the thread shares current->fs
* with the init process. We need to create files with a
* umask of 0 instead of init's umask. */
fsp = copy_fs_struct(current->fs);
@@ -433,14 +435,18 @@ nfsd(struct svc_rqst *rqstp)
siginitsetinv(&shutdown_mask, SHUTDOWN_SIGS);
siginitsetinv(&allowed_mask, ALLOWED_SIGS);
+ /*
+ * thread is spawned with all signals set to SIG_IGN, re-enable
+ * the ones that matter
+ */
+ for (signo = 1; signo <= _NSIG; signo++) {
+ if (!sigismember(&shutdown_mask, signo))
+ allow_signal(signo);
+ }
nfsdstats.th_cnt++;
-
- rqstp->rq_task = current;
-
mutex_unlock(&nfsd_mutex);
-
/*
* We want less throttling in balance_dirty_pages() so that nfs to
* localhost doesn't cause nfsd to lock up due to all the client's
@@ -462,15 +468,25 @@ nfsd(struct svc_rqst *rqstp)
*/
while ((err = svc_recv(rqstp, 60*60*HZ)) == -EAGAIN)
;
- if (err < 0)
+ if (err == -EINTR)
break;
+ else if (err < 0) {
+ if (err != preverr) {
+ printk(KERN_WARNING "%s: unexpected error "
+ "from svc_recv (%d)\n", __func__, -err);
+ preverr = err;
+ }
+ schedule_timeout_uninterruptible(HZ);
+ continue;
+ }
+
update_thread_usage(atomic_read(&nfsd_busy));
atomic_inc(&nfsd_busy);
/* Lock the export hash tables for reading. */
exp_readlock();
- /* Process request with signals blocked. */
+ /* Process request with signals blocked. */
sigprocmask(SIG_SETMASK, &allowed_mask, NULL);
svc_process(rqstp);
@@ -479,25 +495,23 @@ nfsd(struct svc_rqst *rqstp)
exp_readunlock();
update_thread_usage(atomic_read(&nfsd_busy));
atomic_dec(&nfsd_busy);
- }
- if (err != -EINTR)
- printk(KERN_WARNING "nfsd: terminating on error %d\n", -err);
+ /* reset err in case kthread_stop is called */
+ err = 0;
+ }
/* Clear signals before calling svc_exit_thread() */
flush_signals(current);
mutex_lock(&nfsd_mutex);
-
nfsdstats.th_cnt --;
out:
/* Release the thread */
svc_exit_thread(rqstp);
-
- /* Release module */
mutex_unlock(&nfsd_mutex);
- module_put_and_exit(0);
+
+ return 0;
}
static __be32 map_new_errors(u32 vers, __be32 nfserr)
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 4b54c5f..011d6d8 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -22,7 +22,7 @@
/*
* This is the RPC server thread function prototype
*/
-typedef void (*svc_thread_fn)(struct svc_rqst *);
+typedef int (*svc_thread_fn)(void *);
/*
*
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 7bffaff..f461da2 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -18,6 +18,7 @@
#include <linux/mm.h>
#include <linux/interrupt.h>
#include <linux/module.h>
+#include <linux/kthread.h>
#include <linux/sunrpc/types.h>
#include <linux/sunrpc/xdr.h>
@@ -291,15 +292,14 @@ svc_pool_map_put(void)
/*
- * Set the current thread's cpus_allowed mask so that it
+ * Set the given thread's cpus_allowed mask so that it
* will only run on cpus in the given pool.
- *
- * Returns 1 and fills in oldmask iff a cpumask was applied.
*/
-static inline int
-svc_pool_map_set_cpumask(unsigned int pidx, cpumask_t *oldmask)
+static inline void
+svc_pool_map_set_cpumask(struct task_struct *task, unsigned int pidx)
{
struct svc_pool_map *m = &svc_pool_map;
+ unsigned int node = m->pool_to[pidx];
/*
* The caller checks for sv_nrpools > 1, which
@@ -307,26 +307,17 @@ svc_pool_map_set_cpumask(unsigned int pidx, cpumask_t *oldmask)
*/
BUG_ON(m->count == 0);
- switch (m->mode)
- {
- default:
- return 0;
+ switch (m->mode) {
case SVC_POOL_PERCPU:
{
- unsigned int cpu = m->pool_to[pidx];
-
- *oldmask = current->cpus_allowed;
- set_cpus_allowed_ptr(current, &cpumask_of_cpu(cpu));
- return 1;
+ set_cpus_allowed_ptr(task, &cpumask_of_cpu(node));
+ break;
}
case SVC_POOL_PERNODE:
{
- unsigned int node = m->pool_to[pidx];
node_to_cpumask_ptr(nodecpumask, node);
-
- *oldmask = current->cpus_allowed;
- set_cpus_allowed_ptr(current, nodecpumask);
- return 1;
+ set_cpus_allowed_ptr(task, nodecpumask);
+ break;
}
}
}
@@ -579,47 +570,6 @@ out_enomem:
EXPORT_SYMBOL(svc_prepare_thread);
/*
- * Create a thread in the given pool. Caller must hold BKL or another lock to
- * serialize access to the svc_serv struct. On a NUMA or SMP machine, with a
- * multi-pool serv, the thread will be restricted to run on the cpus belonging
- * to the pool.
- */
-static int
-__svc_create_thread(svc_thread_fn func, struct svc_serv *serv,
- struct svc_pool *pool)
-{
- struct svc_rqst *rqstp;
- int error = -ENOMEM;
- int have_oldmask = 0;
- cpumask_t uninitialized_var(oldmask);
-
- rqstp = svc_prepare_thread(serv, pool);
- if (IS_ERR(rqstp)) {
- error = PTR_ERR(rqstp);
- goto out;
- }
-
- if (serv->sv_nrpools > 1)
- have_oldmask = svc_pool_map_set_cpumask(pool->sp_id, &oldmask);
-
- error = kernel_thread((int (*)(void *)) func, rqstp, 0);
-
- if (have_oldmask)
- set_cpus_allowed(current, oldmask);
-
- if (error < 0)
- goto out_thread;
- svc_sock_update_bufs(serv);
- error = 0;
-out:
- return error;
-
-out_thread:
- svc_exit_thread(rqstp);
- goto out;
-}
-
-/*
* Choose a pool in which to create a new thread, for svc_set_num_threads
*/
static inline struct svc_pool *
@@ -688,7 +638,9 @@ found_pool:
int
svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
{
- struct task_struct *victim;
+ struct svc_rqst *rqstp;
+ struct task_struct *task;
+ struct svc_pool *chosen_pool;
int error = 0;
unsigned int state = serv->sv_nrthreads-1;
@@ -704,18 +656,32 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
/* create new threads */
while (nrservs > 0) {
nrservs--;
- __module_get(serv->sv_module);
- error = __svc_create_thread(serv->sv_function, serv,
- choose_pool(serv, pool, &state));
- if (error < 0) {
- module_put(serv->sv_module);
+ chosen_pool = choose_pool(serv, pool, &state);
+
+ rqstp = svc_prepare_thread(serv, chosen_pool);
+ if (IS_ERR(rqstp)) {
+ error = PTR_ERR(rqstp);
break;
}
+
+ task = kthread_create(serv->sv_function, rqstp, serv->sv_name);
+ if (IS_ERR(task)) {
+ error = PTR_ERR(task);
+ svc_exit_thread(rqstp);
+ break;
+ }
+
+ rqstp->rq_task = task;
+ if (serv->sv_nrpools > 1)
+ svc_pool_map_set_cpumask(task, chosen_pool->sp_id);
+
+ svc_sock_update_bufs(serv);
+ wake_up_process(task);
}
/* destroy old threads */
while (nrservs < 0 &&
- (victim = choose_victim(serv, pool, &state)) != NULL) {
- send_sig(serv->sv_kill_signal, victim, 1);
+ (task = choose_victim(serv, pool, &state)) != NULL) {
+ send_sig(serv->sv_kill_signal, task, 1);
nrservs++;
}
--
1.5.3.6
^ permalink raw reply related
* [PATCH 2/5] knfsd: clean up nfsd filesystem interfaces
From: Jeff Layton @ 2008-06-06 15:12 UTC (permalink / raw)
To: bfields; +Cc: linux-nfs, nfsv4, neilb, gnb-cP1dWloDopni96+mSzHFpQC/G2K4zDHf
In-Reply-To: <1212765171-26042-2-git-send-email-jlayton@redhat.com>
Several of the nfsd filesystem interfaces allow changes to parameters
that don't have any effect on a running nfsd service. They are only ever
checked when nfsd is started. This patch fixes it so that changes to
those procfiles return -EBUSY if nfsd is already running to make it
clear that changes on the fly don't work.
The patch should also close some relatively harmless races between
changing the info in those interfaces and starting nfsd, since these
variables are being moved under the protection of the nfsd_mutex.
Finally, the nfsv4recoverydir file always returns -EINVAL if read. This
patch fixes it to return the recoverydir path as expected.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
fs/nfsd/nfs4state.c | 17 ++++++++++---
fs/nfsd/nfsctl.c | 66 +++++++++++++++++++++++++++++++++++++++++----------
fs/nfsd/nfssvc.c | 8 ++++++
3 files changed, 74 insertions(+), 17 deletions(-)
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 8799b87..c602aba 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -3249,12 +3249,14 @@ nfs4_state_shutdown(void)
nfs4_unlock_state();
}
+/*
+ * user_recovery_dirname is protected by the nfsd_mutex since it's only
+ * accessed when nfsd is starting.
+ */
static void
nfs4_set_recdir(char *recdir)
{
- nfs4_lock_state();
strcpy(user_recovery_dirname, recdir);
- nfs4_unlock_state();
}
/*
@@ -3278,6 +3280,12 @@ nfs4_reset_recoverydir(char *recdir)
return status;
}
+char *
+nfs4_recoverydir(void)
+{
+ return user_recovery_dirname;
+}
+
/*
* Called when leasetime is changed.
*
@@ -3286,11 +3294,12 @@ nfs4_reset_recoverydir(char *recdir)
* we start to register any changes in lease time. If the administrator
* really wants to change the lease time *now*, they can go ahead and bring
* nfsd down and then back up again after changing the lease time.
+ *
+ * user_lease_time is protected by nfsd_mutex since it's only really accessed
+ * when nfsd is starting
*/
void
nfs4_reset_lease(time_t leasetime)
{
- lock_kernel();
user_lease_time = leasetime;
- unlock_kernel();
}
diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index 049d2a9..2c2eb87 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -509,7 +509,7 @@ out_free:
return rv;
}
-static ssize_t write_versions(struct file *file, char *buf, size_t size)
+static ssize_t __write_versions(struct file *file, char *buf, size_t size)
{
/*
* Format:
@@ -572,6 +572,16 @@ static ssize_t write_versions(struct file *file, char *buf, size_t size)
return len;
}
+static ssize_t write_versions(struct file *file, char *buf, size_t size)
+{
+ ssize_t rv;
+
+ mutex_lock(&nfsd_mutex);
+ rv = __write_versions(file, buf, size);
+ mutex_unlock(&nfsd_mutex);
+ return rv;
+}
+
static ssize_t __write_ports(struct file *file, char *buf, size_t size)
{
if (size == 0) {
@@ -675,6 +685,7 @@ static ssize_t __write_ports(struct file *file, char *buf, size_t size)
static ssize_t write_ports(struct file *file, char *buf, size_t size)
{
ssize_t rv;
+
mutex_lock(&nfsd_mutex);
rv = __write_ports(file, buf, size);
mutex_unlock(&nfsd_mutex);
@@ -714,16 +725,17 @@ static ssize_t write_maxblksize(struct file *file, char *buf, size_t size)
#ifdef CONFIG_NFSD_V4
extern time_t nfs4_leasetime(void);
-static ssize_t write_leasetime(struct file *file, char *buf, size_t size)
+static ssize_t __write_leasetime(struct file *file, char *buf, size_t size)
{
/* if size > 10 seconds, call
* nfs4_reset_lease() then write out the new lease (seconds) as reply
*/
char *mesg = buf;
- int rv;
+ int rv, lease;
if (size > 0) {
- int lease;
+ if (nfsd_serv)
+ return -EBUSY;
rv = get_int(&mesg, &lease);
if (rv)
return rv;
@@ -735,24 +747,52 @@ static ssize_t write_leasetime(struct file *file, char *buf, size_t size)
return strlen(buf);
}
-static ssize_t write_recoverydir(struct file *file, char *buf, size_t size)
+static ssize_t write_leasetime(struct file *file, char *buf, size_t size)
+{
+ ssize_t rv;
+
+ mutex_lock(&nfsd_mutex);
+ rv = __write_leasetime(file, buf, size);
+ mutex_unlock(&nfsd_mutex);
+ return rv;
+}
+
+extern char *nfs4_recoverydir(void);
+
+static ssize_t __write_recoverydir(struct file *file, char *buf, size_t size)
{
char *mesg = buf;
char *recdir;
int len, status;
- if (size == 0 || size > PATH_MAX || buf[size-1] != '\n')
- return -EINVAL;
- buf[size-1] = 0;
+ if (size > 0) {
+ if (nfsd_serv)
+ return -EBUSY;
+ if (size > PATH_MAX || buf[size-1] != '\n')
+ return -EINVAL;
+ buf[size-1] = 0;
- recdir = mesg;
- len = qword_get(&mesg, recdir, size);
- if (len <= 0)
- return -EINVAL;
+ recdir = mesg;
+ len = qword_get(&mesg, recdir, size);
+ if (len <= 0)
+ return -EINVAL;
- status = nfs4_reset_recoverydir(recdir);
+ status = nfs4_reset_recoverydir(recdir);
+ }
+ sprintf(buf, "%s\n", nfs4_recoverydir());
return strlen(buf);
}
+
+static ssize_t write_recoverydir(struct file *file, char *buf, size_t size)
+{
+ ssize_t rv;
+
+ mutex_lock(&nfsd_mutex);
+ rv = __write_recoverydir(file, buf, size);
+ mutex_unlock(&nfsd_mutex);
+ return rv;
+}
+
#endif
/*----------------------------------------------------------------------------*/
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 512bd04..929af23 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -70,6 +70,14 @@ static DEFINE_SPINLOCK(nfsd_call_lock);
* Transitions of the thread count between zero and non-zero are of particular
* interest since the svc_serv needs to be created and initialized at that
* point, or freed.
+ *
+ * Finally, the nfsd_mutex also protects some of the global variables that are
+ * accessed when nfsd starts and that are settable via the write_* routines in
+ * nfsctl.c. In particular:
+ *
+ * user_recovery_dirname
+ * user_lease_time
+ * nfsd_versions
*/
DEFINE_MUTEX(nfsd_mutex);
struct svc_serv *nfsd_serv;
--
1.5.3.6
^ permalink raw reply related
* [PATCH 3/5] knfsd: remove special handling for SIGHUP
From: Jeff Layton @ 2008-06-06 15:12 UTC (permalink / raw)
To: bfields; +Cc: linux-nfs, nfsv4, neilb, gnb-cP1dWloDopni96+mSzHFpQC/G2K4zDHf
In-Reply-To: <1212765171-26042-3-git-send-email-jlayton@redhat.com>
The special handling for SIGHUP in knfsd is a holdover from much
earlier versions of Linux where reloading the export table was
more expensive. That facility is not really needed anymore and
to my knowledge, is seldom-used.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
fs/nfsd/nfssvc.c | 33 ++++++++-------------------------
1 files changed, 8 insertions(+), 25 deletions(-)
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 929af23..6339cb7 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -44,11 +44,6 @@
* when not handling a request. i.e. when waiting
*/
#define SHUTDOWN_SIGS (sigmask(SIGKILL) | sigmask(SIGHUP) | sigmask(SIGINT) | sigmask(SIGQUIT))
-/* if the last thread dies with SIGHUP, then the exports table is
- * left unchanged ( like 2.4-{0-9} ). Any other signal will clear
- * the exports table (like 2.2).
- */
-#define SIG_NOCLEAN SIGHUP
extern struct svc_program nfsd_program;
static void nfsd(struct svc_rqst *rqstp);
@@ -175,7 +170,6 @@ int nfsd_nrthreads(void)
return nfsd_serv->sv_nrthreads;
}
-static int killsig; /* signal that was used to kill last nfsd */
static void nfsd_last_thread(struct svc_serv *serv)
{
/* When last nfsd thread exits we need to do some clean-up */
@@ -186,11 +180,9 @@ static void nfsd_last_thread(struct svc_serv *serv)
nfsd_racache_shutdown();
nfs4_state_shutdown();
- printk(KERN_WARNING "nfsd: last server has exited\n");
- if (killsig != SIG_NOCLEAN) {
- printk(KERN_WARNING "nfsd: unexporting all filesystems\n");
- nfsd_export_flush();
- }
+ printk(KERN_WARNING "nfsd: last server has exited, flushing export "
+ "cache\n");
+ nfsd_export_flush();
}
void nfsd_reset_versions(void)
@@ -242,10 +234,9 @@ int nfsd_create_serv(void)
}
atomic_set(&nfsd_busy, 0);
- nfsd_serv = svc_create_pooled(&nfsd_program,
- nfsd_max_blksize,
- nfsd_last_thread,
- nfsd, SIG_NOCLEAN, THIS_MODULE);
+ nfsd_serv = svc_create_pooled(&nfsd_program, nfsd_max_blksize,
+ nfsd_last_thread, nfsd, SIGINT,
+ THIS_MODULE);
if (nfsd_serv == NULL)
err = -ENOMEM;
@@ -490,17 +481,9 @@ nfsd(struct svc_rqst *rqstp)
atomic_dec(&nfsd_busy);
}
- if (err != -EINTR) {
+ if (err != -EINTR)
printk(KERN_WARNING "nfsd: terminating on error %d\n", -err);
- } else {
- unsigned int signo;
-
- for (signo = 1; signo <= _NSIG; signo++)
- if (sigismember(¤t->pending.signal, signo) &&
- !sigismember(¤t->blocked, signo))
- break;
- killsig = signo;
- }
+
/* Clear signals before calling svc_exit_thread() */
flush_signals(current);
--
1.5.3.6
^ permalink raw reply related
* [PATCH 1/5] knfsd: Replace lock_kernel with a mutex for nfsd thread startup/shutdown locking.
From: Jeff Layton @ 2008-06-06 15:12 UTC (permalink / raw)
To: bfields; +Cc: linux-nfs, nfsv4, gnb
In-Reply-To: <1212765171-26042-1-git-send-email-jlayton@redhat.com>
From: Neil Brown <neilb@suse.de>
This removes the BKL from the RPC service creation codepath. The BKL
really isn't adequate for this job since some of this info needs
protection across sleeps.
Also, add some comments to try and clarify how the locking should work
and to make it clear that the BKL isn't necessary as long as there is
adequate locking between tasks when touching the svc_serv fields.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
fs/nfsd/nfsctl.c | 37 +++++++++++++++++++++++--------------
fs/nfsd/nfssvc.c | 45 ++++++++++++++++++++++++++++++++-------------
include/linux/nfsd/nfsd.h | 1 +
net/sunrpc/svc.c | 15 +++++++++------
4 files changed, 65 insertions(+), 33 deletions(-)
diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index 5ac00c4..049d2a9 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -450,22 +450,26 @@ static ssize_t write_pool_threads(struct file *file, char *buf, size_t size)
int i;
int rv;
int len;
- int npools = nfsd_nrpools();
+ int npools;
int *nthreads;
+ mutex_lock(&nfsd_mutex);
+ npools = nfsd_nrpools();
if (npools == 0) {
/*
* NFS is shut down. The admin can start it by
* writing to the threads file but NOT the pool_threads
* file, sorry. Report zero threads.
*/
+ mutex_unlock(&nfsd_mutex);
strcpy(buf, "0\n");
return strlen(buf);
}
nthreads = kcalloc(npools, sizeof(int), GFP_KERNEL);
+ rv = -ENOMEM;
if (nthreads == NULL)
- return -ENOMEM;
+ goto out_free;
if (size > 0) {
for (i = 0; i < npools; i++) {
@@ -496,10 +500,12 @@ static ssize_t write_pool_threads(struct file *file, char *buf, size_t size)
mesg += len;
}
+ mutex_unlock(&nfsd_mutex);
return (mesg-buf);
out_free:
kfree(nthreads);
+ mutex_unlock(&nfsd_mutex);
return rv;
}
@@ -566,14 +572,13 @@ static ssize_t write_versions(struct file *file, char *buf, size_t size)
return len;
}
-static ssize_t write_ports(struct file *file, char *buf, size_t size)
+static ssize_t __write_ports(struct file *file, char *buf, size_t size)
{
if (size == 0) {
int len = 0;
- lock_kernel();
+
if (nfsd_serv)
len = svc_xprt_names(nfsd_serv, buf, 0);
- unlock_kernel();
return len;
}
/* Either a single 'fd' number is written, in which
@@ -603,9 +608,7 @@ static ssize_t write_ports(struct file *file, char *buf, size_t size)
/* Decrease the count, but don't shutdown the
* the service
*/
- lock_kernel();
nfsd_serv->sv_nrthreads--;
- unlock_kernel();
}
return err < 0 ? err : 0;
}
@@ -614,10 +617,8 @@ static ssize_t write_ports(struct file *file, char *buf, size_t size)
int len = 0;
if (!toclose)
return -ENOMEM;
- lock_kernel();
if (nfsd_serv)
len = svc_sock_names(buf, nfsd_serv, toclose);
- unlock_kernel();
if (len >= 0)
lockd_down();
kfree(toclose);
@@ -655,7 +656,6 @@ static ssize_t write_ports(struct file *file, char *buf, size_t size)
if (sscanf(&buf[1], "%15s %4d", transport, &port) == 2) {
if (port == 0)
return -EINVAL;
- lock_kernel();
if (nfsd_serv) {
xprt = svc_find_xprt(nfsd_serv, transport,
AF_UNSPEC, port);
@@ -666,13 +666,22 @@ static ssize_t write_ports(struct file *file, char *buf, size_t size)
} else
err = -ENOTCONN;
}
- unlock_kernel();
return err < 0 ? err : 0;
}
}
return -EINVAL;
}
+static ssize_t write_ports(struct file *file, char *buf, size_t size)
+{
+ ssize_t rv;
+ mutex_lock(&nfsd_mutex);
+ rv = __write_ports(file, buf, size);
+ mutex_unlock(&nfsd_mutex);
+ return rv;
+}
+
+
int nfsd_max_blksize;
static ssize_t write_maxblksize(struct file *file, char *buf, size_t size)
@@ -691,13 +700,13 @@ static ssize_t write_maxblksize(struct file *file, char *buf, size_t size)
if (bsize > NFSSVC_MAXBLKSIZE)
bsize = NFSSVC_MAXBLKSIZE;
bsize &= ~(1024-1);
- lock_kernel();
+ mutex_lock(&nfsd_mutex);
if (nfsd_serv && nfsd_serv->sv_nrthreads) {
- unlock_kernel();
+ mutex_unlock(&nfsd_mutex);
return -EBUSY;
}
nfsd_max_blksize = bsize;
- unlock_kernel();
+ mutex_unlock(&nfsd_mutex);
}
return sprintf(buf, "%d\n", nfsd_max_blksize);
}
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 941041f..512bd04 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -53,11 +53,27 @@
extern struct svc_program nfsd_program;
static void nfsd(struct svc_rqst *rqstp);
struct timeval nfssvc_boot;
- struct svc_serv *nfsd_serv;
static atomic_t nfsd_busy;
static unsigned long nfsd_last_call;
static DEFINE_SPINLOCK(nfsd_call_lock);
+/*
+ * nfsd_mutex protects nfsd_serv -- both the pointer itself and the members
+ * of the svc_serv struct. In particular, ->sv_nrthreads but also to some
+ * extent ->sv_temp_socks and ->sv_permsocks. It also protects nfsdstats.th_cnt
+ *
+ * If (out side the lock) nfsd_serv is non-NULL, then it must point to a
+ * properly initialised 'struct svc_serv' with ->sv_nrthreads > 0. That number
+ * of nfsd threads must exist and each must listed in ->sp_all_threads in each
+ * entry of ->sv_pools[].
+ *
+ * Transitions of the thread count between zero and non-zero are of particular
+ * interest since the svc_serv needs to be created and initialized at that
+ * point, or freed.
+ */
+DEFINE_MUTEX(nfsd_mutex);
+struct svc_serv *nfsd_serv;
+
#if defined(CONFIG_NFSD_V2_ACL) || defined(CONFIG_NFSD_V3_ACL)
static struct svc_stat nfsd_acl_svcstats;
static struct svc_version * nfsd_acl_version[] = {
@@ -190,13 +206,14 @@ void nfsd_reset_versions(void)
}
}
+
int nfsd_create_serv(void)
{
int err = 0;
- lock_kernel();
+
+ WARN_ON(!mutex_is_locked(&nfsd_mutex));
if (nfsd_serv) {
svc_get(nfsd_serv);
- unlock_kernel();
return 0;
}
if (nfsd_max_blksize == 0) {
@@ -223,7 +240,7 @@ int nfsd_create_serv(void)
nfsd, SIG_NOCLEAN, THIS_MODULE);
if (nfsd_serv == NULL)
err = -ENOMEM;
- unlock_kernel();
+
do_gettimeofday(&nfssvc_boot); /* record boot time */
return err;
}
@@ -282,6 +299,8 @@ int nfsd_set_nrthreads(int n, int *nthreads)
int tot = 0;
int err = 0;
+ WARN_ON(!mutex_is_locked(&nfsd_mutex));
+
if (nfsd_serv == NULL || n <= 0)
return 0;
@@ -316,7 +335,6 @@ int nfsd_set_nrthreads(int n, int *nthreads)
nthreads[0] = 1;
/* apply the new numbers */
- lock_kernel();
svc_get(nfsd_serv);
for (i = 0; i < n; i++) {
err = svc_set_num_threads(nfsd_serv, &nfsd_serv->sv_pools[i],
@@ -325,7 +343,6 @@ int nfsd_set_nrthreads(int n, int *nthreads)
break;
}
svc_destroy(nfsd_serv);
- unlock_kernel();
return err;
}
@@ -334,8 +351,8 @@ int
nfsd_svc(unsigned short port, int nrservs)
{
int error;
-
- lock_kernel();
+
+ mutex_lock(&nfsd_mutex);
dprintk("nfsd: creating service\n");
error = -EINVAL;
if (nrservs <= 0)
@@ -363,7 +380,7 @@ nfsd_svc(unsigned short port, int nrservs)
failure:
svc_destroy(nfsd_serv); /* Release server */
out:
- unlock_kernel();
+ mutex_unlock(&nfsd_mutex);
return error;
}
@@ -399,7 +416,7 @@ nfsd(struct svc_rqst *rqstp)
sigset_t shutdown_mask, allowed_mask;
/* Lock module and set up kernel thread */
- lock_kernel();
+ mutex_lock(&nfsd_mutex);
daemonize("nfsd");
/* After daemonize() this kernel thread shares current->fs
@@ -417,11 +434,13 @@ nfsd(struct svc_rqst *rqstp)
siginitsetinv(&shutdown_mask, SHUTDOWN_SIGS);
siginitsetinv(&allowed_mask, ALLOWED_SIGS);
+
nfsdstats.th_cnt++;
rqstp->rq_task = current;
- unlock_kernel();
+ mutex_unlock(&nfsd_mutex);
+
/*
* We want less throttling in balance_dirty_pages() so that nfs to
@@ -477,7 +496,7 @@ nfsd(struct svc_rqst *rqstp)
/* Clear signals before calling svc_exit_thread() */
flush_signals(current);
- lock_kernel();
+ mutex_lock(&nfsd_mutex);
nfsdstats.th_cnt --;
@@ -486,7 +505,7 @@ out:
svc_exit_thread(rqstp);
/* Release module */
- unlock_kernel();
+ mutex_unlock(&nfsd_mutex);
module_put_and_exit(0);
}
diff --git a/include/linux/nfsd/nfsd.h b/include/linux/nfsd/nfsd.h
index 41d30c9..88d85b9 100644
--- a/include/linux/nfsd/nfsd.h
+++ b/include/linux/nfsd/nfsd.h
@@ -54,6 +54,7 @@ typedef int (*nfsd_dirop_t)(struct inode *, struct dentry *, int, int);
extern struct svc_program nfsd_program;
extern struct svc_version nfsd_version2, nfsd_version3,
nfsd_version4;
+extern struct mutex nfsd_mutex;
extern struct svc_serv *nfsd_serv;
extern struct seq_operations nfs_exports_op;
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 01c7e31..7bffaff 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -461,7 +461,8 @@ svc_create_pooled(struct svc_program *prog, unsigned int bufsize,
EXPORT_SYMBOL(svc_create_pooled);
/*
- * Destroy an RPC service. Should be called with the BKL held
+ * Destroy an RPC service. Should be called with appropriate locking to
+ * protect the sv_nrthreads, sv_permsocks and sv_tempsocks.
*/
void
svc_destroy(struct svc_serv *serv)
@@ -578,9 +579,10 @@ out_enomem:
EXPORT_SYMBOL(svc_prepare_thread);
/*
- * Create a thread in the given pool. Caller must hold BKL.
- * On a NUMA or SMP machine, with a multi-pool serv, the thread
- * will be restricted to run on the cpus belonging to the pool.
+ * Create a thread in the given pool. Caller must hold BKL or another lock to
+ * serialize access to the svc_serv struct. On a NUMA or SMP machine, with a
+ * multi-pool serv, the thread will be restricted to run on the cpus belonging
+ * to the pool.
*/
static int
__svc_create_thread(svc_thread_fn func, struct svc_serv *serv,
@@ -674,7 +676,7 @@ found_pool:
* of threads the given number. If `pool' is non-NULL, applies
* only to threads in that pool, otherwise round-robins between
* all pools. Must be called with a svc_get() reference and
- * the BKL held.
+ * the BKL or another lock to protect access to svc_serv fields.
*
* Destroying threads relies on the service threads filling in
* rqstp->rq_task, which only the nfs ones do. Assumes the serv
@@ -722,7 +724,8 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
EXPORT_SYMBOL(svc_set_num_threads);
/*
- * Called from a server thread as it's exiting. Caller must hold BKL.
+ * Called from a server thread as it's exiting. Caller must hold the BKL or
+ * the "service mutex", whichever is appropriate for the service.
*/
void
svc_exit_thread(struct svc_rqst *rqstp)
--
1.5.3.6
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox