* Problems with large number of clients and reads
@ 2008-06-03 18:50 Norman Weathers
2008-06-04 13:49 ` Chuck Lever
` (2 more replies)
0 siblings, 3 replies; 41+ messages in thread
From: Norman Weathers @ 2008-06-03 18:50 UTC (permalink / raw)
To: linux-nfs
Hello all,
We are having some issues with some high throughput servers of ours.
Here is the issue, we are using a vanilla 2.6.22.14 kernel on a node
with 2 Dual Core Intels (3 GHz) and 16 GB of ram. The files that are
being served are around 2 GB each, and there are usually 3 to 5 of them
being read, so once read they fit into memory nicely, and when all is
working correctly, we have a perfectly filled cache, with almost no disk
activity.
When we have large NFS activity (say, 600 to 1200 clients) connecting to
the server(s), they can get into a state where they are using up all of
memory, but they are dropping cache. slabtop is showing 13 GB of memory
being used by the size-4096 slab object. We have two ethernet channels
bonded, so we see in excess of 240 MB/s of data flowing out of the box,
and all of the sudden, disk activity has risen to 185 MB/s. This
happens if we are using 8 or more nfs threads. If we limit the threads
to 6 or less, this doesn't happen. Of course, we are starving clients,
but at least the jobs that my customers are throwing out there are
progressing. The question becomes, what is causing the memory to be
used up by the slab size-4096 object? Why when all of the sudden a
bunch of clients ask for data does this object grow from 100 MB to 13
GB? I have set the memory settings to something that I thought was
reasonable.
Here is some more of the particulars:
sysctl.conf tcp memory settings:
# NFS Tuning Parameters
sunrpc.udp_slot_table_entries = 128
sunrpc.tcp_slot_table_entries = 128
vm.overcommit_ratio = 80
net.core.rmem_max=524288
net.core.rmem_default=262144
net.core.wmem_max=524288
net.core.wmem_default=262144
net.ipv4.tcp_rmem = 8192 262144 524288
net.ipv4.tcp_wmem = 8192 262144 524288
net.ipv4.tcp_sack=0
net.ipv4.tcp_timestamps=0
vm.min_free_kbytes=50000
vm.overcommit_memory=1
net.ipv4.tcp_reordering=127
# Enable tcp_low_latency
net.ipv4.tcp_low_latency=1
Here is a current reading from a slabtop of a system where this error is
happening:
3007154 3007154 100% 4.00K 3007154 1 12028616K size-4096
Note the size of the object cache, usually it is 50 - 100 MB (I have
another box with 32 threads and the same settings which is bouncing
between 50 and 128 MB right now).
I have a lot of client boxes that need access to these servers, and
would really benefit from having more threads, but if I increase the
number of threads, it pushes everything out of cache, forcing re-reads,
and really slows down our jobs.
Any thoughts on this?
Thanks,
Norman Weathers
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Problems with large number of clients and reads
2008-06-03 18:50 Problems with large number of clients and reads Norman Weathers
@ 2008-06-04 13:49 ` Chuck Lever
[not found] ` <76bd70e30806040649h53ab5d66x8c3423c551e94f77-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-06-06 0:06 ` Dean Hildebrand
2008-06-06 16:09 ` J. Bruce Fields
2 siblings, 1 reply; 41+ messages in thread
From: Chuck Lever @ 2008-06-04 13:49 UTC (permalink / raw)
To: Norman Weathers; +Cc: linux-nfs
Hi Norman-
On Tue, Jun 3, 2008 at 2:50 PM, Norman Weathers
<norman.r.weathers-496aOtIFJR1B+Kdf37RAV9BPR1lH4CV8@public.gmane.org> wrote:
> Hello all,
>
> We are having some issues with some high throughput servers of ours.
>
> Here is the issue, we are using a vanilla 2.6.22.14 kernel on a node
> with 2 Dual Core Intels (3 GHz) and 16 GB of ram. The files that are
> being served are around 2 GB each, and there are usually 3 to 5 of them
> being read, so once read they fit into memory nicely, and when all is
> working correctly, we have a perfectly filled cache, with almost no disk
> activity.
>
> When we have large NFS activity (say, 600 to 1200 clients) connecting to
> the server(s), they can get into a state where they are using up all of
> memory, but they are dropping cache. slabtop is showing 13 GB of memory
> being used by the size-4096 slab object. We have two ethernet channels
> bonded, so we see in excess of 240 MB/s of data flowing out of the box,
> and all of the sudden, disk activity has risen to 185 MB/s. This
> happens if we are using 8 or more nfs threads. If we limit the threads
> to 6 or less, this doesn't happen. Of course, we are starving clients,
> but at least the jobs that my customers are throwing out there are
> progressing. The question becomes, what is causing the memory to be
> used up by the slab size-4096 object? Why when all of the sudden a
> bunch of clients ask for data does this object grow from 100 MB to 13
> GB? I have set the memory settings to something that I thought was
> reasonable.
>
> Here is some more of the particulars:
>
> sysctl.conf tcp memory settings:
>
> # NFS Tuning Parameters
> sunrpc.udp_slot_table_entries = 128
> sunrpc.tcp_slot_table_entries = 128
I don't have an answer to your size-4096 question, but I do want to
note that setting the slot table entries sysctls has no effect on NFS
servers. It's a client-only setting.
Have you tried this experiment on a server where there are no special
memory tuning sysctls?
Can you describe the characteristics of your I/O workload (the
random/sequentialness of it, the size of the I/O requests, the
burstiness, etc)?
What mount options are you using on the clients, and what are your
export options on the server? (Which NFS version are you using)?
And finally, the output of uname -a on the server would be good to include.
> vm.overcommit_ratio = 80
>
> net.core.rmem_max=524288
> net.core.rmem_default=262144
> net.core.wmem_max=524288
> net.core.wmem_default=262144
> net.ipv4.tcp_rmem = 8192 262144 524288
> net.ipv4.tcp_wmem = 8192 262144 524288
> net.ipv4.tcp_sack=0
> net.ipv4.tcp_timestamps=0
> vm.min_free_kbytes=50000
> vm.overcommit_memory=1
> net.ipv4.tcp_reordering=127
>
> # Enable tcp_low_latency
> net.ipv4.tcp_low_latency=1
>
> Here is a current reading from a slabtop of a system where this error is
> happening:
>
> 3007154 3007154 100% 4.00K 3007154 1 12028616K size-4096
>
> Note the size of the object cache, usually it is 50 - 100 MB (I have
> another box with 32 threads and the same settings which is bouncing
> between 50 and 128 MB right now).
>
> I have a lot of client boxes that need access to these servers, and
> would really benefit from having more threads, but if I increase the
> number of threads, it pushes everything out of cache, forcing re-reads,
> and really slows down our jobs.
>
> Any thoughts on this?
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Problems with large number of clients and reads
[not found] ` <76bd70e30806040649h53ab5d66x8c3423c551e94f77-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-06-04 14:13 ` Norman Weathers
2008-06-05 18:54 ` Norman Weathers
0 siblings, 1 reply; 41+ messages in thread
From: Norman Weathers @ 2008-06-04 14:13 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-nfs
On Wed, 2008-06-04 at 09:49 -0400, Chuck Lever wrote:
> Hi Norman-
>
> On Tue, Jun 3, 2008 at 2:50 PM, Norman Weathers
> <norman.r.weathers-496aOtIFJR1B+Kdf37RAV9BPR1lH4CV8@public.gmane.org> wrote:
> > Hello all,
> >
> > We are having some issues with some high throughput servers of ours.
> >
> > Here is the issue, we are using a vanilla 2.6.22.14 kernel on a node
> > with 2 Dual Core Intels (3 GHz) and 16 GB of ram. The files that are
> > being served are around 2 GB each, and there are usually 3 to 5 of them
> > being read, so once read they fit into memory nicely, and when all is
> > working correctly, we have a perfectly filled cache, with almost no disk
> > activity.
> >
> > When we have large NFS activity (say, 600 to 1200 clients) connecting to
> > the server(s), they can get into a state where they are using up all of
> > memory, but they are dropping cache. slabtop is showing 13 GB of memory
> > being used by the size-4096 slab object. We have two ethernet channels
> > bonded, so we see in excess of 240 MB/s of data flowing out of the box,
> > and all of the sudden, disk activity has risen to 185 MB/s. This
> > happens if we are using 8 or more nfs threads. If we limit the threads
> > to 6 or less, this doesn't happen. Of course, we are starving clients,
> > but at least the jobs that my customers are throwing out there are
> > progressing. The question becomes, what is causing the memory to be
> > used up by the slab size-4096 object? Why when all of the sudden a
> > bunch of clients ask for data does this object grow from 100 MB to 13
> > GB? I have set the memory settings to something that I thought was
> > reasonable.
> >
> > Here is some more of the particulars:
> >
> > sysctl.conf tcp memory settings:
> >
> > # NFS Tuning Parameters
> > sunrpc.udp_slot_table_entries = 128
> > sunrpc.tcp_slot_table_entries = 128
>
> I don't have an answer to your size-4096 question, but I do want to
> note that setting the slot table entries sysctls has no effect on NFS
> servers. It's a client-only setting.
>
Ok.
> Have you tried this experiment on a server where there are no special
> memory tuning sysctls?
Unfortunately, no. I can try it today.
>
> Can you describe the characteristics of your I/O workload (the
> random/sequentialness of it, the size of the I/O requests, the
> burstiness, etc)?
The I/O pattern is somewhat random, but when functioning properly, the
files are small enough to fit into cache. Size per record is ~ 10k (can
be up to 64k).
>
> What mount options are you using on the clients, and what are your
> export options on the server? (Which NFS version are you using)?
NFSv3. Client mount options are:
rw,vers=3,rsize=1048576,wsize=1048576,acregmin=1,acregmax=15,acdirmin=0,acdirmax=0,hard,intr,proto=tcp,timeo=600,retrans=2,addr=hoeptt01
>
> And finally, the output of uname -a on the server would be good to include.
>
Linux hoeptt06 2.6.22.14.SLAB #5 SMP Wed Jan 23 15:45:40 CST 2008 x86_64
x86_64 x86_64 GNU/Linux
> > vm.overcommit_ratio = 80
> >
> > net.core.rmem_max=524288
> > net.core.rmem_default=262144
> > net.core.wmem_max=524288
> > net.core.wmem_default=262144
> > net.ipv4.tcp_rmem = 8192 262144 524288
> > net.ipv4.tcp_wmem = 8192 262144 524288
> > net.ipv4.tcp_sack=0
> > net.ipv4.tcp_timestamps=0
> > vm.min_free_kbytes=50000
> > vm.overcommit_memory=1
> > net.ipv4.tcp_reordering=127
> >
> > # Enable tcp_low_latency
> > net.ipv4.tcp_low_latency=1
> >
> > Here is a current reading from a slabtop of a system where this error is
> > happening:
> >
> > 3007154 3007154 100% 4.00K 3007154 1 12028616K size-4096
> >
> > Note the size of the object cache, usually it is 50 - 100 MB (I have
> > another box with 32 threads and the same settings which is bouncing
> > between 50 and 128 MB right now).
> >
> > I have a lot of client boxes that need access to these servers, and
> > would really benefit from having more threads, but if I increase the
> > number of threads, it pushes everything out of cache, forcing re-reads,
> > and really slows down our jobs.
> >
> > Any thoughts on this?
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Problems with large number of clients and reads
2008-06-04 14:13 ` Norman Weathers
@ 2008-06-05 18:54 ` Norman Weathers
2008-06-06 14:44 ` Chuck Lever
0 siblings, 1 reply; 41+ messages in thread
From: Norman Weathers @ 2008-06-05 18:54 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-nfs
On Wed, 2008-06-04 at 09:13 -0500, Norman Weathers wrote:
> On Wed, 2008-06-04 at 09:49 -0400, Chuck Lever wrote:
> > Hi Norman-
> >
> > On Tue, Jun 3, 2008 at 2:50 PM, Norman Weathers
> > <norman.r.weathers-496aOtIFJR1B+Kdf37RAV9BPR1lH4CV8@public.gmane.org> wrote:
> > > Hello all,
> > >
> > > We are having some issues with some high throughput servers of ours.
> > >
> > > Here is the issue, we are using a vanilla 2.6.22.14 kernel on a node
> > > with 2 Dual Core Intels (3 GHz) and 16 GB of ram. The files that are
> > > being served are around 2 GB each, and there are usually 3 to 5 of them
> > > being read, so once read they fit into memory nicely, and when all is
> > > working correctly, we have a perfectly filled cache, with almost no disk
> > > activity.
> > >
> > > When we have large NFS activity (say, 600 to 1200 clients) connecting to
> > > the server(s), they can get into a state where they are using up all of
> > > memory, but they are dropping cache. slabtop is showing 13 GB of memory
> > > being used by the size-4096 slab object. We have two ethernet channels
> > > bonded, so we see in excess of 240 MB/s of data flowing out of the box,
> > > and all of the sudden, disk activity has risen to 185 MB/s. This
> > > happens if we are using 8 or more nfs threads. If we limit the threads
> > > to 6 or less, this doesn't happen. Of course, we are starving clients,
> > > but at least the jobs that my customers are throwing out there are
> > > progressing. The question becomes, what is causing the memory to be
> > > used up by the slab size-4096 object? Why when all of the sudden a
> > > bunch of clients ask for data does this object grow from 100 MB to 13
> > > GB? I have set the memory settings to something that I thought was
> > > reasonable.
> > >
> > > Here is some more of the particulars:
> > >
> > > sysctl.conf tcp memory settings:
> > >
> > > # NFS Tuning Parameters
> > > sunrpc.udp_slot_table_entries = 128
> > > sunrpc.tcp_slot_table_entries = 128
> >
> > I don't have an answer to your size-4096 question, but I do want to
> > note that setting the slot table entries sysctls has no effect on NFS
> > servers. It's a client-only setting.
> >
>
>
> Ok.
>
> > Have you tried this experiment on a server where there are no special
> > memory tuning sysctls?
>
> Unfortunately, no. I can try it today.
>
I tried the test with no special memory settings, and I still see the
same issue. I also have noticed that even with only 3 threads running,
I can still have times where 11 GB of memory is being used for buffer
and not for disk cache. It just seems like memory is being used up if
we have a lot of requests from a lot of clients at once...
> >
> > Can you describe the characteristics of your I/O workload (the
> > random/sequentialness of it, the size of the I/O requests, the
> > burstiness, etc)?
>
> The I/O pattern is somewhat random, but when functioning properly, the
> files are small enough to fit into cache. Size per record is ~ 10k (can
> be up to 64k).
>
> >
> > What mount options are you using on the clients, and what are your
> > export options on the server? (Which NFS version are you using)?
>
> NFSv3. Client mount options are:
> rw,vers=3,rsize=1048576,wsize=1048576,acregmin=1,acregmax=15,acdirmin=0,acdirmax=0,hard,intr,proto=tcp,timeo=600,retrans=2,addr=hoeptt01
>
>
> >
> > And finally, the output of uname -a on the server would be good to include.
> >
>
> Linux hoeptt06 2.6.22.14.SLAB #5 SMP Wed Jan 23 15:45:40 CST 2008 x86_64
> x86_64 x86_64 GNU/Linux
>
>
> > > vm.overcommit_ratio = 80
> > >
> > > net.core.rmem_max=524288
> > > net.core.rmem_default=262144
> > > net.core.wmem_max=524288
> > > net.core.wmem_default=262144
> > > net.ipv4.tcp_rmem = 8192 262144 524288
> > > net.ipv4.tcp_wmem = 8192 262144 524288
> > > net.ipv4.tcp_sack=0
> > > net.ipv4.tcp_timestamps=0
> > > vm.min_free_kbytes=50000
> > > vm.overcommit_memory=1
> > > net.ipv4.tcp_reordering=127
> > >
> > > # Enable tcp_low_latency
> > > net.ipv4.tcp_low_latency=1
> > >
> > > Here is a current reading from a slabtop of a system where this error is
> > > happening:
> > >
> > > 3007154 3007154 100% 4.00K 3007154 1 12028616K size-4096
> > >
> > > Note the size of the object cache, usually it is 50 - 100 MB (I have
> > > another box with 32 threads and the same settings which is bouncing
> > > between 50 and 128 MB right now).
> > >
> > > I have a lot of client boxes that need access to these servers, and
> > > would really benefit from having more threads, but if I increase the
> > > number of threads, it pushes everything out of cache, forcing re-reads,
> > > and really slows down our jobs.
> > >
> > > Any thoughts on this?
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Problems with large number of clients and reads
2008-06-03 18:50 Problems with large number of clients and reads Norman Weathers
2008-06-04 13:49 ` Chuck Lever
@ 2008-06-06 0:06 ` Dean Hildebrand
2008-06-09 13:20 ` Weathers, Norman R.
2008-06-06 16:09 ` J. Bruce Fields
2 siblings, 1 reply; 41+ messages in thread
From: Dean Hildebrand @ 2008-06-06 0:06 UTC (permalink / raw)
To: Norman Weathers; +Cc: linux-nfs
What is the file system? It is the one managing the cache on the server.
Dean
Norman Weathers wrote:
> Hello all,
>
> We are having some issues with some high throughput servers of ours.
>
> Here is the issue, we are using a vanilla 2.6.22.14 kernel on a node
> with 2 Dual Core Intels (3 GHz) and 16 GB of ram. The files that are
> being served are around 2 GB each, and there are usually 3 to 5 of them
> being read, so once read they fit into memory nicely, and when all is
> working correctly, we have a perfectly filled cache, with almost no disk
> activity.
>
> When we have large NFS activity (say, 600 to 1200 clients) connecting to
> the server(s), they can get into a state where they are using up all of
> memory, but they are dropping cache. slabtop is showing 13 GB of memory
> being used by the size-4096 slab object. We have two ethernet channels
> bonded, so we see in excess of 240 MB/s of data flowing out of the box,
> and all of the sudden, disk activity has risen to 185 MB/s. This
> happens if we are using 8 or more nfs threads. If we limit the threads
> to 6 or less, this doesn't happen. Of course, we are starving clients,
> but at least the jobs that my customers are throwing out there are
> progressing. The question becomes, what is causing the memory to be
> used up by the slab size-4096 object? Why when all of the sudden a
> bunch of clients ask for data does this object grow from 100 MB to 13
> GB? I have set the memory settings to something that I thought was
> reasonable.
>
> Here is some more of the particulars:
>
> sysctl.conf tcp memory settings:
>
> # NFS Tuning Parameters
> sunrpc.udp_slot_table_entries = 128
> sunrpc.tcp_slot_table_entries = 128
> vm.overcommit_ratio = 80
>
> net.core.rmem_max=524288
> net.core.rmem_default=262144
> net.core.wmem_max=524288
> net.core.wmem_default=262144
> net.ipv4.tcp_rmem = 8192 262144 524288
> net.ipv4.tcp_wmem = 8192 262144 524288
> net.ipv4.tcp_sack=0
> net.ipv4.tcp_timestamps=0
> vm.min_free_kbytes=50000
> vm.overcommit_memory=1
> net.ipv4.tcp_reordering=127
>
> # Enable tcp_low_latency
> net.ipv4.tcp_low_latency=1
>
> Here is a current reading from a slabtop of a system where this error is
> happening:
>
> 3007154 3007154 100% 4.00K 3007154 1 12028616K size-4096
>
> Note the size of the object cache, usually it is 50 - 100 MB (I have
> another box with 32 threads and the same settings which is bouncing
> between 50 and 128 MB right now).
>
> I have a lot of client boxes that need access to these servers, and
> would really benefit from having more threads, but if I increase the
> number of threads, it pushes everything out of cache, forcing re-reads,
> and really slows down our jobs.
>
> Any thoughts on this?
>
>
> Thanks,
>
> Norman Weathers
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Problems with large number of clients and reads
2008-06-05 18:54 ` Norman Weathers
@ 2008-06-06 14:44 ` Chuck Lever
2008-06-09 13:56 ` Weathers, Norman R.
0 siblings, 1 reply; 41+ messages in thread
From: Chuck Lever @ 2008-06-06 14:44 UTC (permalink / raw)
To: Norman Weathers; +Cc: Chuck Lever, linux-nfs
[-- Attachment #1: Type: text/plain, Size: 3030 bytes --]
Norman Weathers wrote:
> On Wed, 2008-06-04 at 09:13 -0500, Norman Weathers wrote:
>> On Wed, 2008-06-04 at 09:49 -0400, Chuck Lever wrote:
>>> Hi Norman-
>>>
>>> On Tue, Jun 3, 2008 at 2:50 PM, Norman Weathers
>>> <norman.r.weathers-496aOtIFJR1B+Kdf37RAV9BPR1lH4CV8@public.gmane.org> wrote:
>>>> Hello all,
>>>>
>>>> We are having some issues with some high throughput servers of ours.
>>>>
>>>> Here is the issue, we are using a vanilla 2.6.22.14 kernel on a node
>>>> with 2 Dual Core Intels (3 GHz) and 16 GB of ram. The files that are
>>>> being served are around 2 GB each, and there are usually 3 to 5 of them
>>>> being read, so once read they fit into memory nicely, and when all is
>>>> working correctly, we have a perfectly filled cache, with almost no disk
>>>> activity.
>>>>
>>>> When we have large NFS activity (say, 600 to 1200 clients) connecting to
>>>> the server(s), they can get into a state where they are using up all of
>>>> memory, but they are dropping cache. slabtop is showing 13 GB of memory
>>>> being used by the size-4096 slab object. We have two ethernet channels
>>>> bonded, so we see in excess of 240 MB/s of data flowing out of the box,
>>>> and all of the sudden, disk activity has risen to 185 MB/s. This
>>>> happens if we are using 8 or more nfs threads. If we limit the threads
>>>> to 6 or less, this doesn't happen. Of course, we are starving clients,
>>>> but at least the jobs that my customers are throwing out there are
>>>> progressing. The question becomes, what is causing the memory to be
>>>> used up by the slab size-4096 object? Why when all of the sudden a
>>>> bunch of clients ask for data does this object grow from 100 MB to 13
>>>> GB? I have set the memory settings to something that I thought was
>>>> reasonable.
>>>>
>>>> Here is some more of the particulars:
>>>>
>>>> sysctl.conf tcp memory settings:
>>>>
>>>> # NFS Tuning Parameters
>>>> sunrpc.udp_slot_table_entries = 128
>>>> sunrpc.tcp_slot_table_entries = 128
>>> I don't have an answer to your size-4096 question, but I do want to
>>> note that setting the slot table entries sysctls has no effect on NFS
>>> servers. It's a client-only setting.
>>>
>>
>> Ok.
>>
>>> Have you tried this experiment on a server where there are no special
>>> memory tuning sysctls?
>> Unfortunately, no. I can try it today.
>>
>
>
> I tried the test with no special memory settings, and I still see the
> same issue. I also have noticed that even with only 3 threads running,
> I can still have times where 11 GB of memory is being used for buffer
> and not for disk cache. It just seems like memory is being used up if
> we have a lot of requests from a lot of clients at once...
I'm at a loss... but I have another question or two. Is it just memory
utilization issues that you see on the server, or are there noticeable
performance problems that crop up when you see this?
Did you mention what your physical file system is on the server? Are
you running it on an LVM or software or hardware RAID?
[-- Attachment #2: chuck_lever.vcf --]
[-- Type: text/x-vcard, Size: 259 bytes --]
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
version:2.1
end:vcard
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Problems with large number of clients and reads
2008-06-03 18:50 Problems with large number of clients and reads Norman Weathers
2008-06-04 13:49 ` Chuck Lever
2008-06-06 0:06 ` Dean Hildebrand
@ 2008-06-06 16:09 ` J. Bruce Fields
2008-06-09 14:19 ` Weathers, Norman R.
2 siblings, 1 reply; 41+ messages in thread
From: J. Bruce Fields @ 2008-06-06 16:09 UTC (permalink / raw)
To: Norman Weathers; +Cc: linux-nfs
On Tue, Jun 03, 2008 at 01:50:01PM -0500, Norman Weathers wrote:
> Hello all,
>
> We are having some issues with some high throughput servers of ours.
>
> Here is the issue, we are using a vanilla 2.6.22.14 kernel on a node
> with 2 Dual Core Intels (3 GHz) and 16 GB of ram. The files that are
> being served are around 2 GB each, and there are usually 3 to 5 of them
> being read, so once read they fit into memory nicely, and when all is
> working correctly, we have a perfectly filled cache, with almost no disk
> activity.
>
> When we have large NFS activity (say, 600 to 1200 clients) connecting to
> the server(s), they can get into a state where they are using up all of
> memory, but they are dropping cache. slabtop is showing 13 GB of memory
> being used by the size-4096 slab object. We have two ethernet channels
> bonded, so we see in excess of 240 MB/s of data flowing out of the box,
> and all of the sudden, disk activity has risen to 185 MB/s. This
> happens if we are using 8 or more nfs threads. If we limit the threads
> to 6 or less, this doesn't happen. Of course, we are starving clients,
> but at least the jobs that my customers are throwing out there are
> progressing. The question becomes, what is causing the memory to be
> used up by the slab size-4096 object? Why when all of the sudden a
> bunch of clients ask for data does this object grow from 100 MB to 13
> GB? I have set the memory settings to something that I thought was
> reasonable.
>
> Here is some more of the particulars:
>
> sysctl.conf tcp memory settings:
>
> # NFS Tuning Parameters
> sunrpc.udp_slot_table_entries = 128
> sunrpc.tcp_slot_table_entries = 128
> vm.overcommit_ratio = 80
>
> net.core.rmem_max=524288
> net.core.rmem_default=262144
> net.core.wmem_max=524288
> net.core.wmem_default=262144
> net.ipv4.tcp_rmem = 8192 262144 524288
> net.ipv4.tcp_wmem = 8192 262144 524288
> net.ipv4.tcp_sack=0
> net.ipv4.tcp_timestamps=0
> vm.min_free_kbytes=50000
> vm.overcommit_memory=1
> net.ipv4.tcp_reordering=127
>
> # Enable tcp_low_latency
> net.ipv4.tcp_low_latency=1
>
> Here is a current reading from a slabtop of a system where this error is
> happening:
>
> 3007154 3007154 100% 4.00K 3007154 1 12028616K size-4096
>
> Note the size of the object cache, usually it is 50 - 100 MB (I have
> another box with 32 threads and the same settings which is bouncing
> between 50 and 128 MB right now).
>
> I have a lot of client boxes that need access to these servers, and
> would really benefit from having more threads, but if I increase the
> number of threads, it pushes everything out of cache, forcing re-reads,
> and really slows down our jobs.
>
> Any thoughts on this?
I'd've thought that suggests a leak of memory allocated by kmalloc().
Does the size-4096 cache decrease eventually, or does it stay that large
until you reboot?
--b.
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: Problems with large number of clients and reads
2008-06-06 0:06 ` Dean Hildebrand
@ 2008-06-09 13:20 ` Weathers, Norman R.
0 siblings, 0 replies; 41+ messages in thread
From: Weathers, Norman R. @ 2008-06-09 13:20 UTC (permalink / raw)
To: Dean Hildebrand; +Cc: linux-nfs
(I dislike Outlook.... Apologize if I end up messing up the formatting
of the message)
The file system is XFS, about 250 GB per server. I would say that yes
it is managing the cache on the server(s) in question. The servers in
question have 16 GB of memory, and the files being served are 1.9 GB,
about 5 each per server.
-----Original Message-----
From: Dean Hildebrand [mailto:seattleplus@gmail.com]
Sent: Thursday, June 05, 2008 7:06 PM
To: Weathers, Norman R.
Cc: linux-nfs@vger.kernel.org
Subject: Re: Problems with large number of clients and reads
>What is the file system? It is the one managing the cache on the
server.
>Dean
Norman Weathers wrote:
> Hello all,
>
> We are having some issues with some high throughput servers of ours.
>
> Here is the issue, we are using a vanilla 2.6.22.14 kernel on a node
> with 2 Dual Core Intels (3 GHz) and 16 GB of ram. The files that are
> being served are around 2 GB each, and there are usually 3 to 5 of
them
> being read, so once read they fit into memory nicely, and when all is
> working correctly, we have a perfectly filled cache, with almost no
disk
> activity.
>
> When we have large NFS activity (say, 600 to 1200 clients) connecting
to
> the server(s), they can get into a state where they are using up all
of
> memory, but they are dropping cache. slabtop is showing 13 GB of
memory
> being used by the size-4096 slab object. We have two ethernet
channels
> bonded, so we see in excess of 240 MB/s of data flowing out of the
box,
> and all of the sudden, disk activity has risen to 185 MB/s. This
> happens if we are using 8 or more nfs threads. If we limit the
threads
> to 6 or less, this doesn't happen. Of course, we are starving
clients,
> but at least the jobs that my customers are throwing out there are
> progressing. The question becomes, what is causing the memory to be
> used up by the slab size-4096 object? Why when all of the sudden a
> bunch of clients ask for data does this object grow from 100 MB to 13
> GB? I have set the memory settings to something that I thought was
> reasonable.
>
> Here is some more of the particulars:
>
> sysctl.conf tcp memory settings:
>
> # NFS Tuning Parameters
> sunrpc.udp_slot_table_entries = 128
> sunrpc.tcp_slot_table_entries = 128
> vm.overcommit_ratio = 80
>
> net.core.rmem_max=524288
> net.core.rmem_default=262144
> net.core.wmem_max=524288
> net.core.wmem_default=262144
> net.ipv4.tcp_rmem = 8192 262144 524288
> net.ipv4.tcp_wmem = 8192 262144 524288
> net.ipv4.tcp_sack=0
> net.ipv4.tcp_timestamps=0
> vm.min_free_kbytes=50000
> vm.overcommit_memory=1
> net.ipv4.tcp_reordering=127
>
> # Enable tcp_low_latency
> net.ipv4.tcp_low_latency=1
>
> Here is a current reading from a slabtop of a system where this error
is
> happening:
>
> 3007154 3007154 100% 4.00K 3007154 1 12028616K size-4096
>
> Note the size of the object cache, usually it is 50 - 100 MB (I have
> another box with 32 threads and the same settings which is bouncing
> between 50 and 128 MB right now).
>
> I have a lot of client boxes that need access to these servers, and
> would really benefit from having more threads, but if I increase the
> number of threads, it pushes everything out of cache, forcing
re-reads,
> and really slows down our jobs.
>
> Any thoughts on this?
>
>
> Thanks,
>
> Norman Weathers
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs"
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: Problems with large number of clients and reads
2008-06-06 14:44 ` Chuck Lever
@ 2008-06-09 13:56 ` Weathers, Norman R.
0 siblings, 0 replies; 41+ messages in thread
From: Weathers, Norman R. @ 2008-06-09 13:56 UTC (permalink / raw)
To: chuck.lever; +Cc: Chuck Lever, linux-nfs
-----Original Message-----
From: Chuck Lever [mailto:chuck.lever@oracle.com]
Sent: Fri 6/6/2008 9:44 AM
To: Weathers, Norman R.
Cc: Chuck Lever; linux-nfs@vger.kernel.org
Subject: Re: Problems with large number of clients and reads
Norman Weathers wrote:
> On Wed, 2008-06-04 at 09:13 -0500, Norman Weathers wrote:
>> On Wed, 2008-06-04 at 09:49 -0400, Chuck Lever wrote:
>>> Hi Norman-
>>>
>>> On Tue, Jun 3, 2008 at 2:50 PM, Norman Weathers
>>> <norman.r.weathers-496aOtIFJR1B+Kdf37RAV9BPR1lH4CV8@public.gmane.org> wrote:
>>>> Hello all,
>>>>
>>>> We are having some issues with some high throughput servers of ours.
>>>>
>>>> Here is the issue, we are using a vanilla 2.6.22.14 kernel on a node
>>>> with 2 Dual Core Intels (3 GHz) and 16 GB of ram. The files that are
>>>> being served are around 2 GB each, and there are usually 3 to 5 of them
>>>> being read, so once read they fit into memory nicely, and when all is
>>>> working correctly, we have a perfectly filled cache, with almost no disk
>>>> activity.
>>>>
>>>> When we have large NFS activity (say, 600 to 1200 clients) connecting to
>>>> the server(s), they can get into a state where they are using up all of
>>>> memory, but they are dropping cache. slabtop is showing 13 GB of memory
>>>> being used by the size-4096 slab object. We have two ethernet channels
>>>> bonded, so we see in excess of 240 MB/s of data flowing out of the box,
>>>> and all of the sudden, disk activity has risen to 185 MB/s. This
>>>> happens if we are using 8 or more nfs threads. If we limit the threads
>>>> to 6 or less, this doesn't happen. Of course, we are starving clients,
>>>> but at least the jobs that my customers are throwing out there are
>>>> progressing. The question becomes, what is causing the memory to be
>>>> used up by the slab size-4096 object? Why when all of the sudden a
>>>> bunch of clients ask for data does this object grow from 100 MB to 13
>>>> GB? I have set the memory settings to something that I thought was
>>>> reasonable.
>>>>
>>>> Here is some more of the particulars:
>>>>
>>>> sysctl.conf tcp memory settings:
>>>>
>>>> # NFS Tuning Parameters
>>>> sunrpc.udp_slot_table_entries = 128
>>>> sunrpc.tcp_slot_table_entries = 128
>>> I don't have an answer to your size-4096 question, but I do want to
>>> note that setting the slot table entries sysctls has no effect on NFS
>>> servers. It's a client-only setting.
>>>
>>
>> Ok.
>>
>>> Have you tried this experiment on a server where there are no special
>>> memory tuning sysctls?
>> Unfortunately, no. I can try it today.
>>
>
>
> I tried the test with no special memory settings, and I still see the
> same issue. I also have noticed that even with only 3 threads running,
> I can still have times where 11 GB of memory is being used for buffer
> and not for disk cache. It just seems like memory is being used up if
> we have a lot of requests from a lot of clients at once...
>I'm at a loss... but I have another question or two. Is it just memory
>utilization issues that you see on the server, or are there noticeable
>performance problems that crop up when you see this?
We are seeing both, but the performance problem is odd in that we have 20 of these systems
and they slow down a lot whenever one of the other systems has this issue. It is like one
system really starts to load up on connections and requests, but at the same time it pushes
out network wise every last little bit of network packets that it can (2 1 Gb connections pushing
245 MB/s). What else is weird is that if I restart NFS during that time, it generally causes the
memory to settle down and allows connections to move on. (Data is basically striped across these
20 nodes). When a node has "the issue" happen, the other 19 servers slow down from 150 or 180 MB/s
to 50 MB/s or less.
>Did you mention what your physical file system is on the server? Are
>you running it on an LVM or software or hardware RAID?
The file system is XFS, it is on a hardware RAID (HP cciss), running RAID 5, 64 k stripe. I can
push from the file system itself on a linear read ~ 180 MB/s, and with a cached file, I can
easily push out the data.
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: Problems with large number of clients and reads
2008-06-06 16:09 ` J. Bruce Fields
@ 2008-06-09 14:19 ` Weathers, Norman R.
[not found] ` <0122F800A3B64C449565A9E8C2977010155587-zIGg2qceuZx7uNL6xugVa6xOck334EZe@public.gmane.org>
0 siblings, 1 reply; 41+ messages in thread
From: Weathers, Norman R. @ 2008-06-09 14:19 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: linux-nfs
-----Original Message-----
From: J. Bruce Fields [mailto:bfields@fieldses.org]
Sent: Fri 6/6/2008 11:09 AM
To: Weathers, Norman R.
Cc: linux-nfs@vger.kernel.org
Subject: Re: Problems with large number of clients and reads
On Tue, Jun 03, 2008 at 01:50:01PM -0500, Norman Weathers wrote:
> Hello all,
>
> We are having some issues with some high throughput servers of ours.
>
> Here is the issue, we are using a vanilla 2.6.22.14 kernel on a node
> with 2 Dual Core Intels (3 GHz) and 16 GB of ram. The files that are
> being served are around 2 GB each, and there are usually 3 to 5 of them
> being read, so once read they fit into memory nicely, and when all is
> working correctly, we have a perfectly filled cache, with almost no disk
> activity.
>
> When we have large NFS activity (say, 600 to 1200 clients) connecting to
> the server(s), they can get into a state where they are using up all of
> memory, but they are dropping cache. slabtop is showing 13 GB of memory
> being used by the size-4096 slab object. We have two ethernet channels
> bonded, so we see in excess of 240 MB/s of data flowing out of the box,
> and all of the sudden, disk activity has risen to 185 MB/s. This
> happens if we are using 8 or more nfs threads. If we limit the threads
> to 6 or less, this doesn't happen. Of course, we are starving clients,
> but at least the jobs that my customers are throwing out there are
> progressing. The question becomes, what is causing the memory to be
> used up by the slab size-4096 object? Why when all of the sudden a
> bunch of clients ask for data does this object grow from 100 MB to 13
> GB? I have set the memory settings to something that I thought was
> reasonable.
>
> Here is some more of the particulars:
>
> sysctl.conf tcp memory settings:
>
> # NFS Tuning Parameters
> sunrpc.udp_slot_table_entries = 128
> sunrpc.tcp_slot_table_entries = 128
> vm.overcommit_ratio = 80
>
> net.core.rmem_max=524288
> net.core.rmem_default=262144
> net.core.wmem_max=524288
> net.core.wmem_default=262144
> net.ipv4.tcp_rmem = 8192 262144 524288
> net.ipv4.tcp_wmem = 8192 262144 524288
> net.ipv4.tcp_sack=0
> net.ipv4.tcp_timestamps=0
> vm.min_free_kbytes=50000
> vm.overcommit_memory=1
> net.ipv4.tcp_reordering=127
>
> # Enable tcp_low_latency
> net.ipv4.tcp_low_latency=1
>
> Here is a current reading from a slabtop of a system where this error is
> happening:
>
> 3007154 3007154 100% 4.00K 3007154 1 12028616K size-4096
>
> Note the size of the object cache, usually it is 50 - 100 MB (I have
> another box with 32 threads and the same settings which is bouncing
> between 50 and 128 MB right now).
>
> I have a lot of client boxes that need access to these servers, and
> would really benefit from having more threads, but if I increase the
> number of threads, it pushes everything out of cache, forcing re-reads,
> and really slows down our jobs.
>
> Any thoughts on this?
>I'd've thought that suggests a leak of memory allocated by kmalloc().
>Does the size-4096 cache decrease eventually, or does it stay that large
>until you reboot?
>--b.
I would agree that it "looks" like a memory leak. If I restart NFS, the size-4096 cache
goes from 12 GB to under 50 MB, but then depending upon how hard the box is utilized, it
starts to climb back up. I have seen it climb back up to 3 or 4 GB right after the
restart, but that is much better because the regular disk cache will grow from the 2 GB
that it was pressured into back to 5 or 8 GB, so all of the files have been reread into
memory and things are progressing smoothly. It is weird. I really think that this has
to do with a lot of connections happening at once, because I can run slabtop and see a
node that is running full out, but only have a couple hundred megs of the size-4096 slab
being used, and then turn around and see another node that is pushing out 245 MB/s and
all of the sudden using over 12 GB of the size-4096. It is very odd... If I lower the
number of threads from a usable 64 to a low of 3 threads, I have less of a chance of the
servers going haywire, to the point of being so loaded they may crash or you cannot
contact them over the network (fortunately, I have serial on these boxes so that I can
get on the nodes if they reach that point). If I run 8 threads, and with enough
clients, I can bring down one of these servers. size-4096 goes through the roof, and
depending on the hour of the day, the server can either crash or becomes unresponsive.
Norman
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Problems with large number of clients and reads
[not found] ` <0122F800A3B64C449565A9E8C2977010155587-zIGg2qceuZx7uNL6xugVa6xOck334EZe@public.gmane.org>
@ 2008-06-09 18:53 ` J. Bruce Fields
2008-06-10 14:30 ` Weathers, Norman R.
0 siblings, 1 reply; 41+ messages in thread
From: J. Bruce Fields @ 2008-06-09 18:53 UTC (permalink / raw)
To: Weathers, Norman R.; +Cc: linux-nfs
On Mon, Jun 09, 2008 at 09:19:03AM -0500, Weathers, Norman R. wrote:
> >I'd've thought that suggests a leak of memory allocated by kmalloc().
>
> >Does the size-4096 cache decrease eventually, or does it stay that
> >large until you reboot?
>
> I would agree that it "looks" like a memory leak. If I restart NFS,
> the size-4096 cache goes from 12 GB to under 50 MB,
And restarting nfsd is the only thing you've found that will do this?
(So decreasing the number of threads, or stopping all the client won't
do anything to the size-4096 number?)
> but then depending
> upon how hard the box is utilized, it starts to climb back up.
> I have
> seen it climb back up to 3 or 4 GB right after the restart, but that
> is much better because the regular disk cache will grow from the 2 GB
> that it was pressured into back to 5 or 8 GB, so all of the files have
> been reread into memory and things are progressing smoothly. It is
> weird. I really think that this has to do with a lot of connections
> happening at once, because I can run slabtop and see a node that is
> running full out, but only have a couple hundred megs of the size-4096
> slab being used, and then turn around and see another node that is
> pushing out 245 MB/s and all of the sudden using over 12 GB of the
> size-4096. It is very odd... If I lower the number of threads from a
> usable 64 to a low of 3 threads, I have less of a chance of the
> servers going haywire, to the point of being so loaded they may crash
> or you cannot contact them over the network (fortunately, I have
> serial on these boxes so that I can get on the nodes if they reach
> that point). If I run 8 threads, and with enough clients, I can bring
> down one of these servers. size-4096 goes through the roof, and
> depending on the hour of the day, the server can either crash or
> becomes unresponsive.
These are doing only NFS v2 and v3? (No v4?)
--b.
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: Problems with large number of clients and reads
2008-06-09 18:53 ` J. Bruce Fields
@ 2008-06-10 14:30 ` Weathers, Norman R.
[not found] ` <0122F800A3B64C449565A9E8C297701002D75D9F-zIGg2qceuZx7uNL6xugVa6xOck334EZe@public.gmane.org>
0 siblings, 1 reply; 41+ messages in thread
From: Weathers, Norman R. @ 2008-06-10 14:30 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: linux-nfs
> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields@fieldses.org]
> Sent: Monday, June 09, 2008 1:54 PM
> To: Weathers, Norman R.
> Cc: linux-nfs@vger.kernel.org
> Subject: Re: Problems with large number of clients and reads
>
> On Mon, Jun 09, 2008 at 09:19:03AM -0500, Weathers, Norman R. wrote:
> > >I'd've thought that suggests a leak of memory allocated by
> kmalloc().
> >
> > >Does the size-4096 cache decrease eventually, or does it stay that
> > >large until you reboot?
> >
> > I would agree that it "looks" like a memory leak. If I restart NFS,
> > the size-4096 cache goes from 12 GB to under 50 MB,
>
> And restarting nfsd is the only thing you've found that will do this?
> (So decreasing the number of threads, or stopping all the client won't
> do anything to the size-4096 number?)
Unfortunately, I cannot stop the clients (middle of long running jobs).
I might be able to test this soon. If I have the number of threads
high,
yes I can reduce the number of threads and it appears to lower some of
the
memory, but even with as little as three threads, the memory usage
climbs very
high, just not as high as if there are say 8 threads. When the memory
usage
climbs high, it can cause the box to not respond over the network (ssh,
rsh),
and even be very sluggish when I am connected over our serial console to
the server(s). This same scenario has been happening with kernels that
I have
tried from 2.6.22.x on to the 2.6.25 series. The 2.6.25 series is
interesting in
that I can push the same load from a box with the 2.6.25 kernel and not
have a load
over .3 (with 3 threads), but with the 2.6.22.x kernel, I have a load of
over 3
when I hit the same conditions.
Also, this is all with the SLAB cache option. SLUB crashes everytime I
use it
under heavy load.
>
> > but then depending
> > upon how hard the box is utilized, it starts to climb back up.
>
> > I have
> > seen it climb back up to 3 or 4 GB right after the restart, but that
> > is much better because the regular disk cache will grow
> from the 2 GB
> > that it was pressured into back to 5 or 8 GB, so all of the
> files have
> > been reread into memory and things are progressing smoothly. It is
> > weird. I really think that this has to do with a lot of connections
> > happening at once, because I can run slabtop and see a node that is
> > running full out, but only have a couple hundred megs of
> the size-4096
> > slab being used, and then turn around and see another node that is
> > pushing out 245 MB/s and all of the sudden using over 12 GB of the
> > size-4096. It is very odd... If I lower the number of
> threads from a
> > usable 64 to a low of 3 threads, I have less of a chance of the
> > servers going haywire, to the point of being so loaded they
> may crash
> > or you cannot contact them over the network (fortunately, I have
> > serial on these boxes so that I can get on the nodes if they reach
> > that point). If I run 8 threads, and with enough clients,
> I can bring
> > down one of these servers. size-4096 goes through the roof, and
> > depending on the hour of the day, the server can either crash or
> > becomes unresponsive.
>
> These are doing only NFS v2 and v3? (No v4?)
>
> --b.
>
It should only be NFS v3 tcp.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Problems with large number of clients and reads
[not found] ` <0122F800A3B64C449565A9E8C297701002D75D9F-zIGg2qceuZx7uNL6xugVa6xOck334EZe@public.gmane.org>
@ 2008-06-10 17:16 ` J. Bruce Fields
2008-06-10 22:12 ` Weathers, Norman R.
0 siblings, 1 reply; 41+ messages in thread
From: J. Bruce Fields @ 2008-06-10 17:16 UTC (permalink / raw)
To: Weathers, Norman R.; +Cc: linux-nfs
On Tue, Jun 10, 2008 at 09:30:18AM -0500, Weathers, Norman R. wrote:
> Unfortunately, I cannot stop the clients (middle of long running
> jobs). I might be able to test this soon. If I have the number of
> threads high, yes I can reduce the number of threads and it appears to
> lower some of the memory, but even with as little as three threads,
> the memory usage climbs very high, just not as high as if there are
> say 8 threads. When the memory usage climbs high, it can cause the
> box to not respond over the network (ssh, rsh), and even be very
> sluggish when I am connected over our serial console to the server(s).
> This same scenario has been happening with kernels that I have tried
> from 2.6.22.x on to the 2.6.25 series. The 2.6.25 series is
> interesting in that I can push the same load from a box with the
> 2.6.25 kernel and not have a load over .3 (with 3 threads), but with
> the 2.6.22.x kernel, I have a load of over 3 when I hit the same
> conditions.
OK, I think what we want to do is turn on CONFIG_DEBUG_SLAB_LEAK. I've
never used it before, but it looks like it will report which functions
are allocating from each slab cache, which may be exactly what we need
to know. So:
1. Install a kernel with both CONFIG_DEBUG_SLAB ("Debug slab
memory allocations") and CONFIG_DEBUG_SLAB_LEAK ("Memory leak
debugging") turned on. They're both under the "kernel hacking"
section of the kernel config. (If you have a file
/proc/slab_allocators, then you already have these turned on and
you can skip this step.)
2. Do whatever you need to do to reproduce the problem.
3. Get a copy of /proc/slabinfo and /proc/slab_allocators.
Then we can take a look at that and see if it sheds any light.
I think that debugging will hurt the server performance, so you won't
want to keep it turned on all the time.
>
> Also, this is all with the SLAB cache option. SLUB crashes everytime
> I use it under heavy load.
Have you reported the SLUB bugs to lkml?
--b.
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: Problems with large number of clients and reads
2008-06-10 17:16 ` J. Bruce Fields
@ 2008-06-10 22:12 ` Weathers, Norman R.
[not found] ` <0122F800A3B64C449565A9E8C297701002D75DA3-zIGg2qceuZx7uNL6xugVa6xOck334EZe@public.gmane.org>
0 siblings, 1 reply; 41+ messages in thread
From: Weathers, Norman R. @ 2008-06-10 22:12 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: linux-nfs
> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields@fieldses.org]
> Sent: Tuesday, June 10, 2008 12:16 PM
> To: Weathers, Norman R.
> Cc: linux-nfs@vger.kernel.org
> Subject: Re: Problems with large number of clients and reads
>
> On Tue, Jun 10, 2008 at 09:30:18AM -0500, Weathers, Norman R. wrote:
> > Unfortunately, I cannot stop the clients (middle of long running
> > jobs). I might be able to test this soon. If I have the number of
> > threads high, yes I can reduce the number of threads and it
> appears to
> > lower some of the memory, but even with as little as three threads,
> > the memory usage climbs very high, just not as high as if there are
> > say 8 threads. When the memory usage climbs high, it can cause the
> > box to not respond over the network (ssh, rsh), and even be very
> > sluggish when I am connected over our serial console to the
> server(s).
> > This same scenario has been happening with kernels that I have tried
> > from 2.6.22.x on to the 2.6.25 series. The 2.6.25 series is
> > interesting in that I can push the same load from a box with the
> > 2.6.25 kernel and not have a load over .3 (with 3 threads), but with
> > the 2.6.22.x kernel, I have a load of over 3 when I hit the same
> > conditions.
>
> OK, I think what we want to do is turn on
> CONFIG_DEBUG_SLAB_LEAK. I've
> never used it before, but it looks like it will report which functions
> are allocating from each slab cache, which may be exactly what we need
> to know. So:
>
> 1. Install a kernel with both CONFIG_DEBUG_SLAB ("Debug slab
> memory allocations") and CONFIG_DEBUG_SLAB_LEAK ("Memory leak
> debugging") turned on. They're both under the "kernel hacking"
> section of the kernel config. (If you have a file
> /proc/slab_allocators, then you already have these turned on and
> you can skip this step.)
>
> 2. Do whatever you need to do to reproduce the problem.
>
> 3. Get a copy of /proc/slabinfo and /proc/slab_allocators.
>
> Then we can take a look at that and see if it sheds any light.
I have taken several snapshots of the /proc/slab_allocators and
/proc/slabinfo as requested, but since there is a lot of info in them,
and I didn't think anyone wanted to go cross-eyed reading the data in an
email, I have them up on a website:
http://shashi-weathers.net/linux/cluster/NFS/
The order of data collection is:
slab_allocators_bad1.txt and corresponding slabinfo
slab_allocators_after_bad1.txt and corresponding slabinfo
slab_allocators_16_threads.txt and corresponding slabinfo
slab_allocators_16_threads_1.txt and corresponding slabinfo
slab_allocators_32_threads.txt and corresponding slabinfo
slab_allocators_really_bad.txt and corresponding slabinfo.
You will have to forgive my ignorance at this point, but I was looking
through the slabinfo and slab_allocators, and noticed that size-4096
does not show up in slab_allocators... I hope that is by design. You
can see it growing into the gigabytes in the slabinfo files....
>
> I think that debugging will hurt the server performance, so you won't
> want to keep it turned on all the time.
>
> >
> > Also, this is all with the SLAB cache option. SLUB crashes
> everytime
> > I use it under heavy load.
>
> Have you reported the SLUB bugs to lkml?
No, I haven't yet. I didn't know for sure if I was doing something
wrong, or if SLUB was the problem there. Since the failures, I had gone
back to using SLAB anyway, so .... I probably should...
>
> --b.
>
Norman Weathers
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Problems with large number of clients and reads
[not found] ` <0122F800A3B64C449565A9E8C297701002D75DA3-zIGg2qceuZx7uNL6xugVa6xOck334EZe@public.gmane.org>
@ 2008-06-11 18:46 ` J. Bruce Fields
2008-06-11 19:52 ` J. Bruce Fields
0 siblings, 1 reply; 41+ messages in thread
From: J. Bruce Fields @ 2008-06-11 18:46 UTC (permalink / raw)
To: Weathers, Norman R.; +Cc: linux-nfs
On Tue, Jun 10, 2008 at 05:12:31PM -0500, Weathers, Norman R. wrote:
>
>
> > -----Original Message-----
> > From: J. Bruce Fields [mailto:bfields@fieldses.org]
> > Sent: Tuesday, June 10, 2008 12:16 PM
> > To: Weathers, Norman R.
> > Cc: linux-nfs@vger.kernel.org
> > Subject: Re: Problems with large number of clients and reads
> >
> > On Tue, Jun 10, 2008 at 09:30:18AM -0500, Weathers, Norman R. wrote:
> > > Unfortunately, I cannot stop the clients (middle of long running
> > > jobs). I might be able to test this soon. If I have the number of
> > > threads high, yes I can reduce the number of threads and it
> > appears to
> > > lower some of the memory, but even with as little as three threads,
> > > the memory usage climbs very high, just not as high as if there are
> > > say 8 threads. When the memory usage climbs high, it can cause the
> > > box to not respond over the network (ssh, rsh), and even be very
> > > sluggish when I am connected over our serial console to the
> > server(s).
> > > This same scenario has been happening with kernels that I have tried
> > > from 2.6.22.x on to the 2.6.25 series. The 2.6.25 series is
> > > interesting in that I can push the same load from a box with the
> > > 2.6.25 kernel and not have a load over .3 (with 3 threads), but with
> > > the 2.6.22.x kernel, I have a load of over 3 when I hit the same
> > > conditions.
> >
> > OK, I think what we want to do is turn on
> > CONFIG_DEBUG_SLAB_LEAK. I've
> > never used it before, but it looks like it will report which functions
> > are allocating from each slab cache, which may be exactly what we need
> > to know. So:
> >
> > 1. Install a kernel with both CONFIG_DEBUG_SLAB ("Debug slab
> > memory allocations") and CONFIG_DEBUG_SLAB_LEAK ("Memory leak
> > debugging") turned on. They're both under the "kernel hacking"
> > section of the kernel config. (If you have a file
> > /proc/slab_allocators, then you already have these turned on and
> > you can skip this step.)
> >
> > 2. Do whatever you need to do to reproduce the problem.
> >
> > 3. Get a copy of /proc/slabinfo and /proc/slab_allocators.
> >
> > Then we can take a look at that and see if it sheds any light.
>
>
> I have taken several snapshots of the /proc/slab_allocators and
> /proc/slabinfo as requested, but since there is a lot of info in them,
> and I didn't think anyone wanted to go cross-eyed reading the data in an
> email, I have them up on a website:
>
> http://shashi-weathers.net/linux/cluster/NFS/
Excellent.
>
> The order of data collection is:
>
> slab_allocators_bad1.txt and corresponding slabinfo
> slab_allocators_after_bad1.txt and corresponding slabinfo
> slab_allocators_16_threads.txt and corresponding slabinfo
> slab_allocators_16_threads_1.txt and corresponding slabinfo
> slab_allocators_32_threads.txt and corresponding slabinfo
> slab_allocators_really_bad.txt and corresponding slabinfo.
>
>
> You will have to forgive my ignorance at this point, but I was looking
> through the slabinfo and slab_allocators, and noticed that size-4096
> does not show up in slab_allocators... I hope that is by design. You
> can see it growing into the gigabytes in the slabinfo files....
Argh. OK, I don't understand well enough how this works. Time to ask
someone, I guess....
--b.
>
>
>
> >
> > I think that debugging will hurt the server performance, so you won't
> > want to keep it turned on all the time.
> >
> > >
> > > Also, this is all with the SLAB cache option. SLUB crashes
> > everytime
> > > I use it under heavy load.
> >
> > Have you reported the SLUB bugs to lkml?
>
> No, I haven't yet. I didn't know for sure if I was doing something
> wrong, or if SLUB was the problem there. Since the failures, I had gone
> back to using SLAB anyway, so .... I probably should...
>
> >
> > --b.
> >
>
>
> Norman Weathers
^ permalink raw reply [flat|nested] 41+ messages in thread
* CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
2008-06-11 18:46 ` J. Bruce Fields
@ 2008-06-11 19:52 ` J. Bruce Fields
0 siblings, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2008-06-11 19:52 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-nfs, Weathers, Norman R.
I'm probably missing something fundamental--why doesn't
/proc/slab_allocators show any results for size-x where x >= 4096?
Someone's seeing a performance problem with the linux nfs server. One
of the symptoms is the "size-4096" slab cache seems to be out of
control. I assumed that meant that memory allocated by kmalloc() might
be leaking, so figured it might be interesting to turn on
CONFIG_DEBUG_SLAB_LEAK. As far as I can tell what that does is list
kmalloc() callers in /proc/slab_allocators. But that doesn't seem to be
showing any results for size-4096. Can anyone provide a clue?
Thanks!
--b.
On Wed, Jun 11, 2008 at 02:46:13PM -0400, bfields wrote:
> On Tue, Jun 10, 2008 at 05:12:31PM -0500, Weathers, Norman R. wrote:
> >
> >
> > > -----Original Message-----
> > > From: J. Bruce Fields [mailto:bfields@fieldses.org]
> > > Sent: Tuesday, June 10, 2008 12:16 PM
> > > To: Weathers, Norman R.
> > > Cc: linux-nfs@vger.kernel.org
> > > Subject: Re: Problems with large number of clients and reads
> > >
> > > On Tue, Jun 10, 2008 at 09:30:18AM -0500, Weathers, Norman R. wrote:
> > > > Unfortunately, I cannot stop the clients (middle of long running
> > > > jobs). I might be able to test this soon. If I have the number of
> > > > threads high, yes I can reduce the number of threads and it
> > > appears to
> > > > lower some of the memory, but even with as little as three threads,
> > > > the memory usage climbs very high, just not as high as if there are
> > > > say 8 threads. When the memory usage climbs high, it can cause the
> > > > box to not respond over the network (ssh, rsh), and even be very
> > > > sluggish when I am connected over our serial console to the
> > > server(s).
> > > > This same scenario has been happening with kernels that I have tried
> > > > from 2.6.22.x on to the 2.6.25 series. The 2.6.25 series is
> > > > interesting in that I can push the same load from a box with the
> > > > 2.6.25 kernel and not have a load over .3 (with 3 threads), but with
> > > > the 2.6.22.x kernel, I have a load of over 3 when I hit the same
> > > > conditions.
> > >
> > > OK, I think what we want to do is turn on
> > > CONFIG_DEBUG_SLAB_LEAK. I've
> > > never used it before, but it looks like it will report which functions
> > > are allocating from each slab cache, which may be exactly what we need
> > > to know. So:
> > >
> > > 1. Install a kernel with both CONFIG_DEBUG_SLAB ("Debug slab
> > > memory allocations") and CONFIG_DEBUG_SLAB_LEAK ("Memory leak
> > > debugging") turned on. They're both under the "kernel hacking"
> > > section of the kernel config. (If you have a file
> > > /proc/slab_allocators, then you already have these turned on and
> > > you can skip this step.)
> > >
> > > 2. Do whatever you need to do to reproduce the problem.
> > >
> > > 3. Get a copy of /proc/slabinfo and /proc/slab_allocators.
> > >
> > > Then we can take a look at that and see if it sheds any light.
> >
> >
> > I have taken several snapshots of the /proc/slab_allocators and
> > /proc/slabinfo as requested, but since there is a lot of info in them,
> > and I didn't think anyone wanted to go cross-eyed reading the data in an
> > email, I have them up on a website:
> >
> > http://shashi-weathers.net/linux/cluster/NFS/
>
> Excellent.
>
> >
> > The order of data collection is:
> >
> > slab_allocators_bad1.txt and corresponding slabinfo
> > slab_allocators_after_bad1.txt and corresponding slabinfo
> > slab_allocators_16_threads.txt and corresponding slabinfo
> > slab_allocators_16_threads_1.txt and corresponding slabinfo
> > slab_allocators_32_threads.txt and corresponding slabinfo
> > slab_allocators_really_bad.txt and corresponding slabinfo.
> >
> >
> > You will have to forgive my ignorance at this point, but I was looking
> > through the slabinfo and slab_allocators, and noticed that size-4096
> > does not show up in slab_allocators... I hope that is by design. You
> > can see it growing into the gigabytes in the slabinfo files....
>
> Argh. OK, I don't understand well enough how this works. Time to ask
> someone, I guess....
>
> --b.
>
> >
> >
> >
> > >
> > > I think that debugging will hurt the server performance, so you won't
> > > want to keep it turned on all the time.
> > >
> > > >
> > > > Also, this is all with the SLAB cache option. SLUB crashes
> > > everytime
> > > > I use it under heavy load.
> > >
> > > Have you reported the SLUB bugs to lkml?
> >
> > No, I haven't yet. I didn't know for sure if I was doing something
> > wrong, or if SLUB was the problem there. Since the failures, I had gone
> > back to using SLAB anyway, so .... I probably should...
> >
> > >
> > > --b.
> > >
> >
> >
> > Norman Weathers
^ permalink raw reply [flat|nested] 41+ messages in thread
* CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
@ 2008-06-11 19:52 ` J. Bruce Fields
0 siblings, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2008-06-11 19:52 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-nfs, Weathers, Norman R.
I'm probably missing something fundamental--why doesn't
/proc/slab_allocators show any results for size-x where x >= 4096?
Someone's seeing a performance problem with the linux nfs server. One
of the symptoms is the "size-4096" slab cache seems to be out of
control. I assumed that meant that memory allocated by kmalloc() might
be leaking, so figured it might be interesting to turn on
CONFIG_DEBUG_SLAB_LEAK. As far as I can tell what that does is list
kmalloc() callers in /proc/slab_allocators. But that doesn't seem to be
showing any results for size-4096. Can anyone provide a clue?
Thanks!
--b.
On Wed, Jun 11, 2008 at 02:46:13PM -0400, bfields wrote:
> On Tue, Jun 10, 2008 at 05:12:31PM -0500, Weathers, Norman R. wrote:
> >
> >
> > > -----Original Message-----
> > > From: J. Bruce Fields [mailto:bfields@fieldses.org]
> > > Sent: Tuesday, June 10, 2008 12:16 PM
> > > To: Weathers, Norman R.
> > > Cc: linux-nfs@vger.kernel.org
> > > Subject: Re: Problems with large number of clients and reads
> > >
> > > On Tue, Jun 10, 2008 at 09:30:18AM -0500, Weathers, Norman R. wrote:
> > > > Unfortunately, I cannot stop the clients (middle of long running
> > > > jobs). I might be able to test this soon. If I have the number of
> > > > threads high, yes I can reduce the number of threads and it
> > > appears to
> > > > lower some of the memory, but even with as little as three threads,
> > > > the memory usage climbs very high, just not as high as if there are
> > > > say 8 threads. When the memory usage climbs high, it can cause the
> > > > box to not respond over the network (ssh, rsh), and even be very
> > > > sluggish when I am connected over our serial console to the
> > > server(s).
> > > > This same scenario has been happening with kernels that I have tried
> > > > from 2.6.22.x on to the 2.6.25 series. The 2.6.25 series is
> > > > interesting in that I can push the same load from a box with the
> > > > 2.6.25 kernel and not have a load over .3 (with 3 threads), but with
> > > > the 2.6.22.x kernel, I have a load of over 3 when I hit the same
> > > > conditions.
> > >
> > > OK, I think what we want to do is turn on
> > > CONFIG_DEBUG_SLAB_LEAK. I've
> > > never used it before, but it looks like it will report which functions
> > > are allocating from each slab cache, which may be exactly what we need
> > > to know. So:
> > >
> > > 1. Install a kernel with both CONFIG_DEBUG_SLAB ("Debug slab
> > > memory allocations") and CONFIG_DEBUG_SLAB_LEAK ("Memory leak
> > > debugging") turned on. They're both under the "kernel hacking"
> > > section of the kernel config. (If you have a file
> > > /proc/slab_allocators, then you already have these turned on and
> > > you can skip this step.)
> > >
> > > 2. Do whatever you need to do to reproduce the problem.
> > >
> > > 3. Get a copy of /proc/slabinfo and /proc/slab_allocators.
> > >
> > > Then we can take a look at that and see if it sheds any light.
> >
> >
> > I have taken several snapshots of the /proc/slab_allocators and
> > /proc/slabinfo as requested, but since there is a lot of info in them,
> > and I didn't think anyone wanted to go cross-eyed reading the data in an
> > email, I have them up on a website:
> >
> > http://shashi-weathers.net/linux/cluster/NFS/
>
> Excellent.
>
> >
> > The order of data collection is:
> >
> > slab_allocators_bad1.txt and corresponding slabinfo
> > slab_allocators_after_bad1.txt and corresponding slabinfo
> > slab_allocators_16_threads.txt and corresponding slabinfo
> > slab_allocators_16_threads_1.txt and corresponding slabinfo
> > slab_allocators_32_threads.txt and corresponding slabinfo
> > slab_allocators_really_bad.txt and corresponding slabinfo.
> >
> >
> > You will have to forgive my ignorance at this point, but I was looking
> > through the slabinfo and slab_allocators, and noticed that size-4096
> > does not show up in slab_allocators... I hope that is by design. You
> > can see it growing into the gigabytes in the slabinfo files....
>
> Argh. OK, I don't understand well enough how this works. Time to ask
> someone, I guess....
>
> --b.
>
> >
> >
> >
> > >
> > > I think that debugging will hurt the server performance, so you won't
> > > want to keep it turned on all the time.
> > >
> > > >
> > > > Also, this is all with the SLAB cache option. SLUB crashes
> > > everytime
> > > > I use it under heavy load.
> > >
> > > Have you reported the SLUB bugs to lkml?
> >
> > No, I haven't yet. I didn't know for sure if I was doing something
> > wrong, or if SLUB was the problem there. Since the failures, I had gone
> > back to using SLAB anyway, so .... I probably should...
> >
> > >
> > > --b.
> > >
> >
> >
> > Norman Weathers
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
2008-06-11 19:52 ` J. Bruce Fields
@ 2008-06-11 20:09 ` Jeff Layton
-1 siblings, 0 replies; 41+ messages in thread
From: Jeff Layton @ 2008-06-11 20:09 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: linux-kernel, linux-nfs, Weathers, Norman R.
On Wed, 11 Jun 2008 15:52:22 -0400
"J. Bruce Fields" <bfields@fieldses.org> wrote:
> I'm probably missing something fundamental--why doesn't
> /proc/slab_allocators show any results for size-x where x >= 4096?
>
> Someone's seeing a performance problem with the linux nfs server. One
> of the symptoms is the "size-4096" slab cache seems to be out of
> control. I assumed that meant that memory allocated by kmalloc() might
> be leaking, so figured it might be interesting to turn on
> CONFIG_DEBUG_SLAB_LEAK. As far as I can tell what that does is list
> kmalloc() callers in /proc/slab_allocators. But that doesn't seem to be
> showing any results for size-4096. Can anyone provide a clue?
> Thanks!
>
> --b.
>
Hmm...I've never used this, but in kmem_cache_alloc():
/*
* Enable redzoning and last user accounting, except for caches with
* large objects, if the increased size would increase the object size
* above the next power of two: caches with object sizes just above a
* power of two have a significant amount of internal fragmentation.
*/
if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
2 * sizeof(unsigned long long)))
flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
...looks like it specifically excludes some caches.
> On Wed, Jun 11, 2008 at 02:46:13PM -0400, bfields wrote:
> > On Tue, Jun 10, 2008 at 05:12:31PM -0500, Weathers, Norman R. wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: J. Bruce Fields [mailto:bfields@fieldses.org]
> > > > Sent: Tuesday, June 10, 2008 12:16 PM
> > > > To: Weathers, Norman R.
> > > > Cc: linux-nfs@vger.kernel.org
> > > > Subject: Re: Problems with large number of clients and reads
> > > >
> > > > On Tue, Jun 10, 2008 at 09:30:18AM -0500, Weathers, Norman R. wrote:
> > > > > Unfortunately, I cannot stop the clients (middle of long running
> > > > > jobs). I might be able to test this soon. If I have the number of
> > > > > threads high, yes I can reduce the number of threads and it
> > > > appears to
> > > > > lower some of the memory, but even with as little as three threads,
> > > > > the memory usage climbs very high, just not as high as if there are
> > > > > say 8 threads. When the memory usage climbs high, it can cause the
> > > > > box to not respond over the network (ssh, rsh), and even be very
> > > > > sluggish when I am connected over our serial console to the
> > > > server(s).
> > > > > This same scenario has been happening with kernels that I have tried
> > > > > from 2.6.22.x on to the 2.6.25 series. The 2.6.25 series is
> > > > > interesting in that I can push the same load from a box with the
> > > > > 2.6.25 kernel and not have a load over .3 (with 3 threads), but with
> > > > > the 2.6.22.x kernel, I have a load of over 3 when I hit the same
> > > > > conditions.
> > > >
> > > > OK, I think what we want to do is turn on
> > > > CONFIG_DEBUG_SLAB_LEAK. I've
> > > > never used it before, but it looks like it will report which functions
> > > > are allocating from each slab cache, which may be exactly what we need
> > > > to know. So:
> > > >
> > > > 1. Install a kernel with both CONFIG_DEBUG_SLAB ("Debug slab
> > > > memory allocations") and CONFIG_DEBUG_SLAB_LEAK ("Memory leak
> > > > debugging") turned on. They're both under the "kernel hacking"
> > > > section of the kernel config. (If you have a file
> > > > /proc/slab_allocators, then you already have these turned on and
> > > > you can skip this step.)
> > > >
> > > > 2. Do whatever you need to do to reproduce the problem.
> > > >
> > > > 3. Get a copy of /proc/slabinfo and /proc/slab_allocators.
> > > >
> > > > Then we can take a look at that and see if it sheds any light.
> > >
> > >
> > > I have taken several snapshots of the /proc/slab_allocators and
> > > /proc/slabinfo as requested, but since there is a lot of info in them,
> > > and I didn't think anyone wanted to go cross-eyed reading the data in an
> > > email, I have them up on a website:
> > >
> > > http://shashi-weathers.net/linux/cluster/NFS/
> >
> > Excellent.
> >
> > >
> > > The order of data collection is:
> > >
> > > slab_allocators_bad1.txt and corresponding slabinfo
> > > slab_allocators_after_bad1.txt and corresponding slabinfo
> > > slab_allocators_16_threads.txt and corresponding slabinfo
> > > slab_allocators_16_threads_1.txt and corresponding slabinfo
> > > slab_allocators_32_threads.txt and corresponding slabinfo
> > > slab_allocators_really_bad.txt and corresponding slabinfo.
> > >
> > >
> > > You will have to forgive my ignorance at this point, but I was looking
> > > through the slabinfo and slab_allocators, and noticed that size-4096
> > > does not show up in slab_allocators... I hope that is by design. You
> > > can see it growing into the gigabytes in the slabinfo files....
> >
> > Argh. OK, I don't understand well enough how this works. Time to ask
> > someone, I guess....
> >
> > --b.
> >
> > >
> > >
> > >
> > > >
> > > > I think that debugging will hurt the server performance, so you won't
> > > > want to keep it turned on all the time.
> > > >
> > > > >
> > > > > Also, this is all with the SLAB cache option. SLUB crashes
> > > > everytime
> > > > > I use it under heavy load.
> > > >
> > > > Have you reported the SLUB bugs to lkml?
> > >
> > > No, I haven't yet. I didn't know for sure if I was doing something
> > > wrong, or if SLUB was the problem there. Since the failures, I had gone
> > > back to using SLAB anyway, so .... I probably should...
> > >
> > > >
> > > > --b.
> > > >
> > >
> > >
> > > Norman Weathers
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Jeff Layton <jlayton@poochiereds.net>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
@ 2008-06-11 20:09 ` Jeff Layton
0 siblings, 0 replies; 41+ messages in thread
From: Jeff Layton @ 2008-06-11 20:09 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: linux-kernel, linux-nfs, Weathers, Norman R.
On Wed, 11 Jun 2008 15:52:22 -0400
"J. Bruce Fields" <bfields@fieldses.org> wrote:
> I'm probably missing something fundamental--why doesn't
> /proc/slab_allocators show any results for size-x where x >= 4096?
>
> Someone's seeing a performance problem with the linux nfs server. One
> of the symptoms is the "size-4096" slab cache seems to be out of
> control. I assumed that meant that memory allocated by kmalloc() might
> be leaking, so figured it might be interesting to turn on
> CONFIG_DEBUG_SLAB_LEAK. As far as I can tell what that does is list
> kmalloc() callers in /proc/slab_allocators. But that doesn't seem to be
> showing any results for size-4096. Can anyone provide a clue?
> Thanks!
>
> --b.
>
Hmm...I've never used this, but in kmem_cache_alloc():
/*
* Enable redzoning and last user accounting, except for caches with
* large objects, if the increased size would increase the object size
* above the next power of two: caches with object sizes just above a
* power of two have a significant amount of internal fragmentation.
*/
if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
2 * sizeof(unsigned long long)))
flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
...looks like it specifically excludes some caches.
> On Wed, Jun 11, 2008 at 02:46:13PM -0400, bfields wrote:
> > On Tue, Jun 10, 2008 at 05:12:31PM -0500, Weathers, Norman R. wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: J. Bruce Fields [mailto:bfields@fieldses.org]
> > > > Sent: Tuesday, June 10, 2008 12:16 PM
> > > > To: Weathers, Norman R.
> > > > Cc: linux-nfs@vger.kernel.org
> > > > Subject: Re: Problems with large number of clients and reads
> > > >
> > > > On Tue, Jun 10, 2008 at 09:30:18AM -0500, Weathers, Norman R. wrote:
> > > > > Unfortunately, I cannot stop the clients (middle of long running
> > > > > jobs). I might be able to test this soon. If I have the number of
> > > > > threads high, yes I can reduce the number of threads and it
> > > > appears to
> > > > > lower some of the memory, but even with as little as three threads,
> > > > > the memory usage climbs very high, just not as high as if there are
> > > > > say 8 threads. When the memory usage climbs high, it can cause the
> > > > > box to not respond over the network (ssh, rsh), and even be very
> > > > > sluggish when I am connected over our serial console to the
> > > > server(s).
> > > > > This same scenario has been happening with kernels that I have tried
> > > > > from 2.6.22.x on to the 2.6.25 series. The 2.6.25 series is
> > > > > interesting in that I can push the same load from a box with the
> > > > > 2.6.25 kernel and not have a load over .3 (with 3 threads), but with
> > > > > the 2.6.22.x kernel, I have a load of over 3 when I hit the same
> > > > > conditions.
> > > >
> > > > OK, I think what we want to do is turn on
> > > > CONFIG_DEBUG_SLAB_LEAK. I've
> > > > never used it before, but it looks like it will report which functions
> > > > are allocating from each slab cache, which may be exactly what we need
> > > > to know. So:
> > > >
> > > > 1. Install a kernel with both CONFIG_DEBUG_SLAB ("Debug slab
> > > > memory allocations") and CONFIG_DEBUG_SLAB_LEAK ("Memory leak
> > > > debugging") turned on. They're both under the "kernel hacking"
> > > > section of the kernel config. (If you have a file
> > > > /proc/slab_allocators, then you already have these turned on and
> > > > you can skip this step.)
> > > >
> > > > 2. Do whatever you need to do to reproduce the problem.
> > > >
> > > > 3. Get a copy of /proc/slabinfo and /proc/slab_allocators.
> > > >
> > > > Then we can take a look at that and see if it sheds any light.
> > >
> > >
> > > I have taken several snapshots of the /proc/slab_allocators and
> > > /proc/slabinfo as requested, but since there is a lot of info in them,
> > > and I didn't think anyone wanted to go cross-eyed reading the data in an
> > > email, I have them up on a website:
> > >
> > > http://shashi-weathers.net/linux/cluster/NFS/
> >
> > Excellent.
> >
> > >
> > > The order of data collection is:
> > >
> > > slab_allocators_bad1.txt and corresponding slabinfo
> > > slab_allocators_after_bad1.txt and corresponding slabinfo
> > > slab_allocators_16_threads.txt and corresponding slabinfo
> > > slab_allocators_16_threads_1.txt and corresponding slabinfo
> > > slab_allocators_32_threads.txt and corresponding slabinfo
> > > slab_allocators_really_bad.txt and corresponding slabinfo.
> > >
> > >
> > > You will have to forgive my ignorance at this point, but I was looking
> > > through the slabinfo and slab_allocators, and noticed that size-4096
> > > does not show up in slab_allocators... I hope that is by design. You
> > > can see it growing into the gigabytes in the slabinfo files....
> >
> > Argh. OK, I don't understand well enough how this works. Time to ask
> > someone, I guess....
> >
> > --b.
> >
> > >
> > >
> > >
> > > >
> > > > I think that debugging will hurt the server performance, so you won't
> > > > want to keep it turned on all the time.
> > > >
> > > > >
> > > > > Also, this is all with the SLAB cache option. SLUB crashes
> > > > everytime
> > > > > I use it under heavy load.
> > > >
> > > > Have you reported the SLUB bugs to lkml?
> > >
> > > No, I haven't yet. I didn't know for sure if I was doing something
> > > wrong, or if SLUB was the problem there. Since the failures, I had gone
> > > back to using SLAB anyway, so .... I probably should...
> > >
> > > >
> > > > --b.
> > > >
> > >
> > >
> > > Norman Weathers
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Jeff Layton <jlayton@poochiereds.net>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
2008-06-11 20:09 ` Jeff Layton
@ 2008-06-11 20:57 ` J. Bruce Fields
-1 siblings, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2008-06-11 20:57 UTC (permalink / raw)
To: Jeff Layton; +Cc: linux-kernel, linux-nfs, Weathers, Norman R.
On Wed, Jun 11, 2008 at 04:09:47PM -0400, Jeff Layton wrote:
> On Wed, 11 Jun 2008 15:52:22 -0400
> "J. Bruce Fields" <bfields@fieldses.org> wrote:
>
> > I'm probably missing something fundamental--why doesn't
> > /proc/slab_allocators show any results for size-x where x >= 4096?
> >
> > Someone's seeing a performance problem with the linux nfs server. One
> > of the symptoms is the "size-4096" slab cache seems to be out of
> > control. I assumed that meant that memory allocated by kmalloc() might
> > be leaking, so figured it might be interesting to turn on
> > CONFIG_DEBUG_SLAB_LEAK. As far as I can tell what that does is list
> > kmalloc() callers in /proc/slab_allocators. But that doesn't seem to be
> > showing any results for size-4096. Can anyone provide a clue?
> > Thanks!
> >
> > --b.
> >
>
>
> Hmm...I've never used this, but in kmem_cache_alloc():
>
> /*
> * Enable redzoning and last user accounting, except for caches with
> * large objects, if the increased size would increase the object size
> * above the next power of two: caches with object sizes just above a
> * power of two have a significant amount of internal fragmentation.
> */
> if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> 2 * sizeof(unsigned long long)))
> flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
>
>
> ...looks like it specifically excludes some caches.
Ah, I missed that! I'm a little confused as to how those flags behavior
affect the collection of the leak debugging data, but I can verify that
the below does result in size-4096 showing up in /proc/slab_allocators;
hopefully there's no more negative result than the performance penalty.
Norman, do you think you could try applying this and then trying again?
--b.
diff --git a/mm/slab.c b/mm/slab.c
index 06236e4..b379e31 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name, size_t size, size_t align,
* above the next power of two: caches with object sizes just above a
* power of two have a significant amount of internal fragmentation.
*/
- if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
+ if (size < 8192 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
2 * sizeof(unsigned long long)))
flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
if (!(flags & SLAB_DESTROY_BY_RCU))
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
@ 2008-06-11 20:57 ` J. Bruce Fields
0 siblings, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2008-06-11 20:57 UTC (permalink / raw)
To: Jeff Layton; +Cc: linux-kernel, linux-nfs, Weathers, Norman R.
On Wed, Jun 11, 2008 at 04:09:47PM -0400, Jeff Layton wrote:
> On Wed, 11 Jun 2008 15:52:22 -0400
> "J. Bruce Fields" <bfields@fieldses.org> wrote:
>
> > I'm probably missing something fundamental--why doesn't
> > /proc/slab_allocators show any results for size-x where x >= 4096?
> >
> > Someone's seeing a performance problem with the linux nfs server. One
> > of the symptoms is the "size-4096" slab cache seems to be out of
> > control. I assumed that meant that memory allocated by kmalloc() might
> > be leaking, so figured it might be interesting to turn on
> > CONFIG_DEBUG_SLAB_LEAK. As far as I can tell what that does is list
> > kmalloc() callers in /proc/slab_allocators. But that doesn't seem to be
> > showing any results for size-4096. Can anyone provide a clue?
> > Thanks!
> >
> > --b.
> >
>
>
> Hmm...I've never used this, but in kmem_cache_alloc():
>
> /*
> * Enable redzoning and last user accounting, except for caches with
> * large objects, if the increased size would increase the object size
> * above the next power of two: caches with object sizes just above a
> * power of two have a significant amount of internal fragmentation.
> */
> if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> 2 * sizeof(unsigned long long)))
> flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
>
>
> ...looks like it specifically excludes some caches.
Ah, I missed that! I'm a little confused as to how those flags behavior
affect the collection of the leak debugging data, but I can verify that
the below does result in size-4096 showing up in /proc/slab_allocators;
hopefully there's no more negative result than the performance penalty.
Norman, do you think you could try applying this and then trying again?
--b.
diff --git a/mm/slab.c b/mm/slab.c
index 06236e4..b379e31 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name, size_t size, size_t align,
* above the next power of two: caches with object sizes just above a
* power of two have a significant amount of internal fragmentation.
*/
- if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
+ if (size < 8192 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
2 * sizeof(unsigned long long)))
flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
if (!(flags & SLAB_DESTROY_BY_RCU))
^ permalink raw reply related [flat|nested] 41+ messages in thread
* RE: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
2008-06-11 20:57 ` J. Bruce Fields
@ 2008-06-11 22:46 ` Weathers, Norman R.
-1 siblings, 0 replies; 41+ messages in thread
From: Weathers, Norman R. @ 2008-06-11 22:46 UTC (permalink / raw)
To: J. Bruce Fields, Jeff Layton; +Cc: linux-kernel, linux-nfs
> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields@fieldses.org]
> Sent: Wednesday, June 11, 2008 3:58 PM
> To: Jeff Layton
> Cc: linux-kernel@vger.kernel.org; linux-nfs@vger.kernel.org;
> Weathers, Norman R.
> Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
>
> On Wed, Jun 11, 2008 at 04:09:47PM -0400, Jeff Layton wrote:
> > On Wed, 11 Jun 2008 15:52:22 -0400
> > "J. Bruce Fields" <bfields@fieldses.org> wrote:
> >
> > > I'm probably missing something fundamental--why doesn't
> > > /proc/slab_allocators show any results for size-x where x >= 4096?
> > >
> > > Someone's seeing a performance problem with the linux nfs
> server. One
> > > of the symptoms is the "size-4096" slab cache seems to be out of
> > > control. I assumed that meant that memory allocated by
> kmalloc() might
> > > be leaking, so figured it might be interesting to turn on
> > > CONFIG_DEBUG_SLAB_LEAK. As far as I can tell what that
> does is list
> > > kmalloc() callers in /proc/slab_allocators. But that
> doesn't seem to be
> > > showing any results for size-4096. Can anyone provide a clue?
> > > Thanks!
> > >
> > > --b.
> > >
> >
> >
> > Hmm...I've never used this, but in kmem_cache_alloc():
> >
> > /*
> > * Enable redzoning and last user accounting,
> except for caches with
> > * large objects, if the increased size would
> increase the object size
> > * above the next power of two: caches with object
> sizes just above a
> > * power of two have a significant amount of
> internal fragmentation.
> > */
> > if (size < 4096 || fls(size - 1) == fls(size-1 +
> REDZONE_ALIGN +
> > 2 *
> sizeof(unsigned long long)))
> > flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
> >
> >
> > ...looks like it specifically excludes some caches.
>
> Ah, I missed that! I'm a little confused as to how those
> flags behavior
> affect the collection of the leak debugging data, but I can
> verify that
> the below does result in size-4096 showing up in
> /proc/slab_allocators;
> hopefully there's no more negative result than the
> performance penalty.
>
> Norman, do you think you could try applying this and then
> trying again?
>
> --b.
I will try and get it patched and retested, but it may be a day or two
before I can get back the information due to production jobs now
running. Once they finish up, I will get back with the info.
Thanks everyone for looking at this, by the way!
>
>
> diff --git a/mm/slab.c b/mm/slab.c
> index 06236e4..b379e31 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name,
> size_t size, size_t align,
> * above the next power of two: caches with object
> sizes just above a
> * power of two have a significant amount of internal
> fragmentation.
> */
> - if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> + if (size < 8192 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> 2 *
> sizeof(unsigned long long)))
> flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
> if (!(flags & SLAB_DESTROY_BY_RCU))
>
Norman Weathers
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
@ 2008-06-11 22:46 ` Weathers, Norman R.
0 siblings, 0 replies; 41+ messages in thread
From: Weathers, Norman R. @ 2008-06-11 22:46 UTC (permalink / raw)
To: J. Bruce Fields, Jeff Layton; +Cc: linux-kernel, linux-nfs
> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields@fieldses.org]
> Sent: Wednesday, June 11, 2008 3:58 PM
> To: Jeff Layton
> Cc: linux-kernel@vger.kernel.org; linux-nfs@vger.kernel.org;
> Weathers, Norman R.
> Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
>
> On Wed, Jun 11, 2008 at 04:09:47PM -0400, Jeff Layton wrote:
> > On Wed, 11 Jun 2008 15:52:22 -0400
> > "J. Bruce Fields" <bfields@fieldses.org> wrote:
> >
> > > I'm probably missing something fundamental--why doesn't
> > > /proc/slab_allocators show any results for size-x where x >= 4096?
> > >
> > > Someone's seeing a performance problem with the linux nfs
> server. One
> > > of the symptoms is the "size-4096" slab cache seems to be out of
> > > control. I assumed that meant that memory allocated by
> kmalloc() might
> > > be leaking, so figured it might be interesting to turn on
> > > CONFIG_DEBUG_SLAB_LEAK. As far as I can tell what that
> does is list
> > > kmalloc() callers in /proc/slab_allocators. But that
> doesn't seem to be
> > > showing any results for size-4096. Can anyone provide a clue?
> > > Thanks!
> > >
> > > --b.
> > >
> >
> >
> > Hmm...I've never used this, but in kmem_cache_alloc():
> >
> > /*
> > * Enable redzoning and last user accounting,
> except for caches with
> > * large objects, if the increased size would
> increase the object size
> > * above the next power of two: caches with object
> sizes just above a
> > * power of two have a significant amount of
> internal fragmentation.
> > */
> > if (size < 4096 || fls(size - 1) == fls(size-1 +
> REDZONE_ALIGN +
> > 2 *
> sizeof(unsigned long long)))
> > flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
> >
> >
> > ...looks like it specifically excludes some caches.
>
> Ah, I missed that! I'm a little confused as to how those
> flags behavior
> affect the collection of the leak debugging data, but I can
> verify that
> the below does result in size-4096 showing up in
> /proc/slab_allocators;
> hopefully there's no more negative result than the
> performance penalty.
>
> Norman, do you think you could try applying this and then
> trying again?
>
> --b.
I will try and get it patched and retested, but it may be a day or two
before I can get back the information due to production jobs now
running. Once they finish up, I will get back with the info.
Thanks everyone for looking at this, by the way!
>
>
> diff --git a/mm/slab.c b/mm/slab.c
> index 06236e4..b379e31 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name,
> size_t size, size_t align,
> * above the next power of two: caches with object
> sizes just above a
> * power of two have a significant amount of internal
> fragmentation.
> */
> - if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> + if (size < 8192 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> 2 *
> sizeof(unsigned long long)))
> flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
> if (!(flags & SLAB_DESTROY_BY_RCU))
>
Norman Weathers
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
2008-06-11 22:46 ` Weathers, Norman R.
@ 2008-06-11 22:54 ` J. Bruce Fields
-1 siblings, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2008-06-11 22:54 UTC (permalink / raw)
To: Weathers, Norman R.; +Cc: Jeff Layton, linux-kernel, linux-nfs
On Wed, Jun 11, 2008 at 05:46:13PM -0500, Weathers, Norman R. wrote:
> I will try and get it patched and retested, but it may be a day or two
> before I can get back the information due to production jobs now
> running. Once they finish up, I will get back with the info.
Understood.
> Thanks everyone for looking at this, by the way!
And thanks for your persistence.
--b.
>
> >
> >
> > diff --git a/mm/slab.c b/mm/slab.c
> > index 06236e4..b379e31 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name,
> > size_t size, size_t align,
> > * above the next power of two: caches with object
> > sizes just above a
> > * power of two have a significant amount of internal
> > fragmentation.
> > */
> > - if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> > + if (size < 8192 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> > 2 *
> > sizeof(unsigned long long)))
> > flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
> > if (!(flags & SLAB_DESTROY_BY_RCU))
> >
>
>
> Norman Weathers
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
@ 2008-06-11 22:54 ` J. Bruce Fields
0 siblings, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2008-06-11 22:54 UTC (permalink / raw)
To: Weathers, Norman R.; +Cc: Jeff Layton, linux-kernel, linux-nfs
On Wed, Jun 11, 2008 at 05:46:13PM -0500, Weathers, Norman R. wrote:
> I will try and get it patched and retested, but it may be a day or two
> before I can get back the information due to production jobs now
> running. Once they finish up, I will get back with the info.
Understood.
> Thanks everyone for looking at this, by the way!
And thanks for your persistence.
--b.
>
> >
> >
> > diff --git a/mm/slab.c b/mm/slab.c
> > index 06236e4..b379e31 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name,
> > size_t size, size_t align,
> > * above the next power of two: caches with object
> > sizes just above a
> > * power of two have a significant amount of internal
> > fragmentation.
> > */
> > - if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> > + if (size < 8192 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> > 2 *
> > sizeof(unsigned long long)))
> > flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
> > if (!(flags & SLAB_DESTROY_BY_RCU))
> >
>
>
> Norman Weathers
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
2008-06-11 22:54 ` J. Bruce Fields
@ 2008-06-12 19:54 ` Weathers, Norman R.
-1 siblings, 0 replies; 41+ messages in thread
From: Weathers, Norman R. @ 2008-06-12 19:54 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: Jeff Layton, linux-kernel, linux-nfs
> -----Original Message-----
> From: linux-nfs-owner@vger.kernel.org
> [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of J. Bruce Fields
> Sent: Wednesday, June 11, 2008 5:55 PM
> To: Weathers, Norman R.
> Cc: Jeff Layton; linux-kernel@vger.kernel.org;
> linux-nfs@vger.kernel.org
> Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
>
> On Wed, Jun 11, 2008 at 05:46:13PM -0500, Weathers, Norman R. wrote:
> > I will try and get it patched and retested, but it may be a
> day or two
> > before I can get back the information due to production jobs now
> > running. Once they finish up, I will get back with the info.
>
> Understood.
>
I was able to get my big user to cooperate and let me in to be able to
get the information that you were needing. The full output from the
/proc/slab_allocator file is at
http://www.shashi-weathers.net/linux/cluster/NFS_DEBUG_2 . The 16
thread case is very interesting. Also, there is a small txt file in the
directory that has some rpc errors, but I imagine the way that I am
running the box (oversubscribed threads) has more to do with the rpc
errors than anything else. For those of you wanting the gist of the
story, the size-4096 slab has the following very large allocation:
size-4096: 2 sys_init_module+0x140b/0x1980
size-4096: 1 __vmalloc_area_node+0x188/0x1b0
size-4096: 1 seq_read+0x1d9/0x2e0
size-4096: 1 slabstats_open+0x2b/0x80
size-4096: 5 vc_allocate+0x167/0x190
size-4096: 3 input_allocate_device+0x12/0x80
size-4096: 1 hid_add_field+0x122/0x290
size-4096: 9 reqsk_queue_alloc+0x5f/0xf0
size-4096: 1846825 __alloc_skb+0x7d/0x170
size-4096: 3 alloc_netdev+0x33/0xa0
size-4096: 10 neigh_sysctl_register+0x52/0x2b0
size-4096: 5 devinet_sysctl_register+0x28/0x110
size-4096: 1 pidmap_init+0x15/0x60
size-4096: 1 netlink_proto_init+0x44/0x190
size-4096: 1 ip_rt_init+0xfd/0x2f0
size-4096: 1 cipso_v4_init+0x13/0x70
size-4096: 3 journal_init_revoke+0xe7/0x270 [jbd]
size-4096: 3 journal_init_revoke+0x18a/0x270 [jbd]
size-4096: 2 journal_init_inode+0x84/0x150 [jbd]
size-4096: 2 bnx2_alloc_mem+0x18/0x1f0 [bnx2]
size-4096: 1 joydev_connect+0x53/0x390 [joydev]
size-4096: 13 kmem_alloc+0xb3/0x100 [xfs]
size-4096: 5 addrconf_sysctl_register+0x31/0x130 [ipv6]
size-4096: 7 rpc_clone_client+0x84/0x140 [sunrpc]
size-4096: 3 rpc_create+0x254/0x4d0 [sunrpc]
size-4096: 16 __svc_create_thread+0x53/0x1f0 [sunrpc]
size-4096: 16 __svc_create_thread+0x72/0x1f0 [sunrpc]
size-4096: 1 nfsd_racache_init+0x2e/0x140 [nfsd]
The big one seems to be the __alloc_skb. (This is with 16 threads, and
it says that we are using up somewhere between 12 and 14 GB of memory,
about 2 to 3 gig of that is disk cache). If I were to put anymore
threads out there, the server would become almost unresponsive (it was
bad enough as it was).
At the same time, I also noticed this:
skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
Don't know for sure if that is meaningful or not....
> > Thanks everyone for looking at this, by the way!
>
> And thanks for your persistence.
>
> --b.
>
Anytime. This is the part of the job that is fun (except for my
users...). Anyone can watch a system run, it's dealing with the unknown
that makes it interesting.
Norman Weathers
> >
> > >
> > >
> > > diff --git a/mm/slab.c b/mm/slab.c
> > > index 06236e4..b379e31 100644
> > > --- a/mm/slab.c
> > > +++ b/mm/slab.c
> > > @@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name,
> > > size_t size, size_t align,
> > > * above the next power of two: caches with object
> > > sizes just above a
> > > * power of two have a significant amount of internal
> > > fragmentation.
> > > */
> > > - if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> > > + if (size < 8192 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> > > 2 *
> > > sizeof(unsigned long long)))
> > > flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
> > > if (!(flags & SLAB_DESTROY_BY_RCU))
> > >
> >
> >
> > Norman Weathers
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
@ 2008-06-12 19:54 ` Weathers, Norman R.
0 siblings, 0 replies; 41+ messages in thread
From: Weathers, Norman R. @ 2008-06-12 19:54 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: Jeff Layton, linux-kernel, linux-nfs
> -----Original Message-----
> From: linux-nfs-owner@vger.kernel.org
> [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of J. Bruce Fields
> Sent: Wednesday, June 11, 2008 5:55 PM
> To: Weathers, Norman R.
> Cc: Jeff Layton; linux-kernel@vger.kernel.org;
> linux-nfs@vger.kernel.org
> Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
>
> On Wed, Jun 11, 2008 at 05:46:13PM -0500, Weathers, Norman R. wrote:
> > I will try and get it patched and retested, but it may be a
> day or two
> > before I can get back the information due to production jobs now
> > running. Once they finish up, I will get back with the info.
>
> Understood.
>
I was able to get my big user to cooperate and let me in to be able to
get the information that you were needing. The full output from the
/proc/slab_allocator file is at
http://www.shashi-weathers.net/linux/cluster/NFS_DEBUG_2 . The 16
thread case is very interesting. Also, there is a small txt file in the
directory that has some rpc errors, but I imagine the way that I am
running the box (oversubscribed threads) has more to do with the rpc
errors than anything else. For those of you wanting the gist of the
story, the size-4096 slab has the following very large allocation:
size-4096: 2 sys_init_module+0x140b/0x1980
size-4096: 1 __vmalloc_area_node+0x188/0x1b0
size-4096: 1 seq_read+0x1d9/0x2e0
size-4096: 1 slabstats_open+0x2b/0x80
size-4096: 5 vc_allocate+0x167/0x190
size-4096: 3 input_allocate_device+0x12/0x80
size-4096: 1 hid_add_field+0x122/0x290
size-4096: 9 reqsk_queue_alloc+0x5f/0xf0
size-4096: 1846825 __alloc_skb+0x7d/0x170
size-4096: 3 alloc_netdev+0x33/0xa0
size-4096: 10 neigh_sysctl_register+0x52/0x2b0
size-4096: 5 devinet_sysctl_register+0x28/0x110
size-4096: 1 pidmap_init+0x15/0x60
size-4096: 1 netlink_proto_init+0x44/0x190
size-4096: 1 ip_rt_init+0xfd/0x2f0
size-4096: 1 cipso_v4_init+0x13/0x70
size-4096: 3 journal_init_revoke+0xe7/0x270 [jbd]
size-4096: 3 journal_init_revoke+0x18a/0x270 [jbd]
size-4096: 2 journal_init_inode+0x84/0x150 [jbd]
size-4096: 2 bnx2_alloc_mem+0x18/0x1f0 [bnx2]
size-4096: 1 joydev_connect+0x53/0x390 [joydev]
size-4096: 13 kmem_alloc+0xb3/0x100 [xfs]
size-4096: 5 addrconf_sysctl_register+0x31/0x130 [ipv6]
size-4096: 7 rpc_clone_client+0x84/0x140 [sunrpc]
size-4096: 3 rpc_create+0x254/0x4d0 [sunrpc]
size-4096: 16 __svc_create_thread+0x53/0x1f0 [sunrpc]
size-4096: 16 __svc_create_thread+0x72/0x1f0 [sunrpc]
size-4096: 1 nfsd_racache_init+0x2e/0x140 [nfsd]
The big one seems to be the __alloc_skb. (This is with 16 threads, and
it says that we are using up somewhere between 12 and 14 GB of memory,
about 2 to 3 gig of that is disk cache). If I were to put anymore
threads out there, the server would become almost unresponsive (it was
bad enough as it was).
At the same time, I also noticed this:
skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
Don't know for sure if that is meaningful or not....
> > Thanks everyone for looking at this, by the way!
>
> And thanks for your persistence.
>
> --b.
>
Anytime. This is the part of the job that is fun (except for my
users...). Anyone can watch a system run, it's dealing with the unknown
that makes it interesting.
Norman Weathers
> >
> > >
> > >
> > > diff --git a/mm/slab.c b/mm/slab.c
> > > index 06236e4..b379e31 100644
> > > --- a/mm/slab.c
> > > +++ b/mm/slab.c
> > > @@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name,
> > > size_t size, size_t align,
> > > * above the next power of two: caches with object
> > > sizes just above a
> > > * power of two have a significant amount of internal
> > > fragmentation.
> > > */
> > > - if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> > > + if (size < 8192 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> > > 2 *
> > > sizeof(unsigned long long)))
> > > flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
> > > if (!(flags & SLAB_DESTROY_BY_RCU))
> > >
> >
> >
> > Norman Weathers
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
2008-06-12 19:54 ` Weathers, Norman R.
@ 2008-06-13 20:15 ` J. Bruce Fields
-1 siblings, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2008-06-13 20:15 UTC (permalink / raw)
To: Weathers, Norman R.; +Cc: Jeff Layton, linux-kernel, linux-nfs, Neil Brown
On Thu, Jun 12, 2008 at 02:54:09PM -0500, Weathers, Norman R. wrote:
>
>
> > -----Original Message-----
> > From: linux-nfs-owner@vger.kernel.org
> > [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of J. Bruce Fields
> > Sent: Wednesday, June 11, 2008 5:55 PM
> > To: Weathers, Norman R.
> > Cc: Jeff Layton; linux-kernel@vger.kernel.org;
> > linux-nfs@vger.kernel.org
> > Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
> >
> > On Wed, Jun 11, 2008 at 05:46:13PM -0500, Weathers, Norman R. wrote:
> > > I will try and get it patched and retested, but it may be a
> > day or two
> > > before I can get back the information due to production jobs now
> > > running. Once they finish up, I will get back with the info.
> >
> > Understood.
> >
>
>
> I was able to get my big user to cooperate and let me in to be able to
> get the information that you were needing. The full output from the
> /proc/slab_allocator file is at
> http://www.shashi-weathers.net/linux/cluster/NFS_DEBUG_2 . The 16
> thread case is very interesting. Also, there is a small txt file in the
> directory that has some rpc errors, but I imagine the way that I am
> running the box (oversubscribed threads) has more to do with the rpc
> errors than anything else. For those of you wanting the gist of the
> story, the size-4096 slab has the following very large allocation:
>
> size-4096: 2 sys_init_module+0x140b/0x1980
> size-4096: 1 __vmalloc_area_node+0x188/0x1b0
> size-4096: 1 seq_read+0x1d9/0x2e0
> size-4096: 1 slabstats_open+0x2b/0x80
> size-4096: 5 vc_allocate+0x167/0x190
> size-4096: 3 input_allocate_device+0x12/0x80
> size-4096: 1 hid_add_field+0x122/0x290
> size-4096: 9 reqsk_queue_alloc+0x5f/0xf0
> size-4096: 1846825 __alloc_skb+0x7d/0x170
> size-4096: 3 alloc_netdev+0x33/0xa0
> size-4096: 10 neigh_sysctl_register+0x52/0x2b0
> size-4096: 5 devinet_sysctl_register+0x28/0x110
> size-4096: 1 pidmap_init+0x15/0x60
> size-4096: 1 netlink_proto_init+0x44/0x190
> size-4096: 1 ip_rt_init+0xfd/0x2f0
> size-4096: 1 cipso_v4_init+0x13/0x70
> size-4096: 3 journal_init_revoke+0xe7/0x270 [jbd]
> size-4096: 3 journal_init_revoke+0x18a/0x270 [jbd]
> size-4096: 2 journal_init_inode+0x84/0x150 [jbd]
> size-4096: 2 bnx2_alloc_mem+0x18/0x1f0 [bnx2]
> size-4096: 1 joydev_connect+0x53/0x390 [joydev]
> size-4096: 13 kmem_alloc+0xb3/0x100 [xfs]
> size-4096: 5 addrconf_sysctl_register+0x31/0x130 [ipv6]
> size-4096: 7 rpc_clone_client+0x84/0x140 [sunrpc]
> size-4096: 3 rpc_create+0x254/0x4d0 [sunrpc]
> size-4096: 16 __svc_create_thread+0x53/0x1f0 [sunrpc]
> size-4096: 16 __svc_create_thread+0x72/0x1f0 [sunrpc]
> size-4096: 1 nfsd_racache_init+0x2e/0x140 [nfsd]
>
> The big one seems to be the __alloc_skb. (This is with 16 threads, and
> it says that we are using up somewhere between 12 and 14 GB of memory,
> about 2 to 3 gig of that is disk cache). If I were to put anymore
> threads out there, the server would become almost unresponsive (it was
> bad enough as it was).
>
> At the same time, I also noticed this:
>
> skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
>
> Don't know for sure if that is meaningful or not....
OK, so, starting at net/core/skbuff.c, this means that this memory was
allocated by __alloc_skb() calls with something nonzero in the third
("fclone") argument. The only such caller is alloc_skb_fclone().
Callers of alloc_skb_fclone() include:
sk_stream_alloc_skb:
do_tcp_sendpages
tcp_sendmsg
tcp_fragment
tso_fragment
tcp_mtu_probe
tcp_send_fin
tcp_connect
buf_acquire:
lots of callers in tipc code (whatever that is).
So unless you're using tipc, or you have something in userspace going
haywire (perhaps netstat would help rule that out?), then I suppose
there's something wrong with knfsd's tcp code. Which makes sense, I
guess.
I'd think this sort of allocation would be limited by the number of
sockets times the size of the send and receive buffers.
svc_xprt.c:svc_check_conn_limits() claims to be limiting the number of
sockets to (nrthreads+3)*20. (You aren't hitting the "too many open
connections" printk there, are you?) The total buffer size should be
bounded by something like 4 megs.
--b.
>
>
>
> > > Thanks everyone for looking at this, by the way!
> >
> > And thanks for your persistence.
> >
> > --b.
> >
>
>
> Anytime. This is the part of the job that is fun (except for my
> users...). Anyone can watch a system run, it's dealing with the unknown
> that makes it interesting.
OK! Because I'm a bit stuck, so this will take some more work....
--b.
>
>
> Norman Weathers
>
>
> > >
> > > >
> > > >
> > > > diff --git a/mm/slab.c b/mm/slab.c
> > > > index 06236e4..b379e31 100644
> > > > --- a/mm/slab.c
> > > > +++ b/mm/slab.c
> > > > @@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name,
> > > > size_t size, size_t align,
> > > > * above the next power of two: caches with object
> > > > sizes just above a
> > > > * power of two have a significant amount of internal
> > > > fragmentation.
> > > > */
> > > > - if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> > > > + if (size < 8192 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> > > > 2 *
> > > > sizeof(unsigned long long)))
> > > > flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
> > > > if (!(flags & SLAB_DESTROY_BY_RCU))
> > > >
> > >
> > >
> > > Norman Weathers
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
@ 2008-06-13 20:15 ` J. Bruce Fields
0 siblings, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2008-06-13 20:15 UTC (permalink / raw)
To: Weathers, Norman R.; +Cc: Jeff Layton, linux-kernel, linux-nfs, Neil Brown
On Thu, Jun 12, 2008 at 02:54:09PM -0500, Weathers, Norman R. wrote:
>
>
> > -----Original Message-----
> > From: linux-nfs-owner@vger.kernel.org
> > [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of J. Bruce Fields
> > Sent: Wednesday, June 11, 2008 5:55 PM
> > To: Weathers, Norman R.
> > Cc: Jeff Layton; linux-kernel@vger.kernel.org;
> > linux-nfs@vger.kernel.org
> > Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
> >
> > On Wed, Jun 11, 2008 at 05:46:13PM -0500, Weathers, Norman R. wrote:
> > > I will try and get it patched and retested, but it may be a
> > day or two
> > > before I can get back the information due to production jobs now
> > > running. Once they finish up, I will get back with the info.
> >
> > Understood.
> >
>
>
> I was able to get my big user to cooperate and let me in to be able to
> get the information that you were needing. The full output from the
> /proc/slab_allocator file is at
> http://www.shashi-weathers.net/linux/cluster/NFS_DEBUG_2 . The 16
> thread case is very interesting. Also, there is a small txt file in the
> directory that has some rpc errors, but I imagine the way that I am
> running the box (oversubscribed threads) has more to do with the rpc
> errors than anything else. For those of you wanting the gist of the
> story, the size-4096 slab has the following very large allocation:
>
> size-4096: 2 sys_init_module+0x140b/0x1980
> size-4096: 1 __vmalloc_area_node+0x188/0x1b0
> size-4096: 1 seq_read+0x1d9/0x2e0
> size-4096: 1 slabstats_open+0x2b/0x80
> size-4096: 5 vc_allocate+0x167/0x190
> size-4096: 3 input_allocate_device+0x12/0x80
> size-4096: 1 hid_add_field+0x122/0x290
> size-4096: 9 reqsk_queue_alloc+0x5f/0xf0
> size-4096: 1846825 __alloc_skb+0x7d/0x170
> size-4096: 3 alloc_netdev+0x33/0xa0
> size-4096: 10 neigh_sysctl_register+0x52/0x2b0
> size-4096: 5 devinet_sysctl_register+0x28/0x110
> size-4096: 1 pidmap_init+0x15/0x60
> size-4096: 1 netlink_proto_init+0x44/0x190
> size-4096: 1 ip_rt_init+0xfd/0x2f0
> size-4096: 1 cipso_v4_init+0x13/0x70
> size-4096: 3 journal_init_revoke+0xe7/0x270 [jbd]
> size-4096: 3 journal_init_revoke+0x18a/0x270 [jbd]
> size-4096: 2 journal_init_inode+0x84/0x150 [jbd]
> size-4096: 2 bnx2_alloc_mem+0x18/0x1f0 [bnx2]
> size-4096: 1 joydev_connect+0x53/0x390 [joydev]
> size-4096: 13 kmem_alloc+0xb3/0x100 [xfs]
> size-4096: 5 addrconf_sysctl_register+0x31/0x130 [ipv6]
> size-4096: 7 rpc_clone_client+0x84/0x140 [sunrpc]
> size-4096: 3 rpc_create+0x254/0x4d0 [sunrpc]
> size-4096: 16 __svc_create_thread+0x53/0x1f0 [sunrpc]
> size-4096: 16 __svc_create_thread+0x72/0x1f0 [sunrpc]
> size-4096: 1 nfsd_racache_init+0x2e/0x140 [nfsd]
>
> The big one seems to be the __alloc_skb. (This is with 16 threads, and
> it says that we are using up somewhere between 12 and 14 GB of memory,
> about 2 to 3 gig of that is disk cache). If I were to put anymore
> threads out there, the server would become almost unresponsive (it was
> bad enough as it was).
>
> At the same time, I also noticed this:
>
> skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
>
> Don't know for sure if that is meaningful or not....
OK, so, starting at net/core/skbuff.c, this means that this memory was
allocated by __alloc_skb() calls with something nonzero in the third
("fclone") argument. The only such caller is alloc_skb_fclone().
Callers of alloc_skb_fclone() include:
sk_stream_alloc_skb:
do_tcp_sendpages
tcp_sendmsg
tcp_fragment
tso_fragment
tcp_mtu_probe
tcp_send_fin
tcp_connect
buf_acquire:
lots of callers in tipc code (whatever that is).
So unless you're using tipc, or you have something in userspace going
haywire (perhaps netstat would help rule that out?), then I suppose
there's something wrong with knfsd's tcp code. Which makes sense, I
guess.
I'd think this sort of allocation would be limited by the number of
sockets times the size of the send and receive buffers.
svc_xprt.c:svc_check_conn_limits() claims to be limiting the number of
sockets to (nrthreads+3)*20. (You aren't hitting the "too many open
connections" printk there, are you?) The total buffer size should be
bounded by something like 4 megs.
--b.
>
>
>
> > > Thanks everyone for looking at this, by the way!
> >
> > And thanks for your persistence.
> >
> > --b.
> >
>
>
> Anytime. This is the part of the job that is fun (except for my
> users...). Anyone can watch a system run, it's dealing with the unknown
> that makes it interesting.
OK! Because I'm a bit stuck, so this will take some more work....
--b.
>
>
> Norman Weathers
>
>
> > >
> > > >
> > > >
> > > > diff --git a/mm/slab.c b/mm/slab.c
> > > > index 06236e4..b379e31 100644
> > > > --- a/mm/slab.c
> > > > +++ b/mm/slab.c
> > > > @@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name,
> > > > size_t size, size_t align,
> > > > * above the next power of two: caches with object
> > > > sizes just above a
> > > > * power of two have a significant amount of internal
> > > > fragmentation.
> > > > */
> > > > - if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> > > > + if (size < 8192 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> > > > 2 *
> > > > sizeof(unsigned long long)))
> > > > flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
> > > > if (!(flags & SLAB_DESTROY_BY_RCU))
> > > >
> > >
> > >
> > > Norman Weathers
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
2008-06-13 20:15 ` J. Bruce Fields
@ 2008-06-13 21:53 ` Weathers, Norman R.
-1 siblings, 0 replies; 41+ messages in thread
From: Weathers, Norman R. @ 2008-06-13 21:53 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: Jeff Layton, linux-kernel, linux-nfs, Neil Brown
> -----Original Message-----
> From: linux-nfs-owner@vger.kernel.org
> [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of J. Bruce Fields
> Sent: Friday, June 13, 2008 3:16 PM
> To: Weathers, Norman R.
> Cc: Jeff Layton; linux-kernel@vger.kernel.org;
> linux-nfs@vger.kernel.org; Neil Brown
> Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
>
> On Thu, Jun 12, 2008 at 02:54:09PM -0500, Weathers, Norman R. wrote:
> >
> >
> > > -----Original Message-----
> > > From: linux-nfs-owner@vger.kernel.org
> > > [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of J.
> Bruce Fields
> > > Sent: Wednesday, June 11, 2008 5:55 PM
> > > To: Weathers, Norman R.
> > > Cc: Jeff Layton; linux-kernel@vger.kernel.org;
> > > linux-nfs@vger.kernel.org
> > > Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
> > >
> > > On Wed, Jun 11, 2008 at 05:46:13PM -0500, Weathers,
> Norman R. wrote:
> > > > I will try and get it patched and retested, but it may be a
> > > day or two
> > > > before I can get back the information due to production jobs now
> > > > running. Once they finish up, I will get back with the info.
> > >
> > > Understood.
> > >
> >
> >
> > I was able to get my big user to cooperate and let me in to
> be able to
> > get the information that you were needing. The full output from the
> > /proc/slab_allocator file is at
> > http://www.shashi-weathers.net/linux/cluster/NFS_DEBUG_2 . The 16
> > thread case is very interesting. Also, there is a small
> txt file in the
> > directory that has some rpc errors, but I imagine the way that I am
> > running the box (oversubscribed threads) has more to do with the rpc
> > errors than anything else. For those of you wanting the gist of the
> > story, the size-4096 slab has the following very large allocation:
> >
> > size-4096: 2 sys_init_module+0x140b/0x1980
> > size-4096: 1 __vmalloc_area_node+0x188/0x1b0
> > size-4096: 1 seq_read+0x1d9/0x2e0
> > size-4096: 1 slabstats_open+0x2b/0x80
> > size-4096: 5 vc_allocate+0x167/0x190
> > size-4096: 3 input_allocate_device+0x12/0x80
> > size-4096: 1 hid_add_field+0x122/0x290
> > size-4096: 9 reqsk_queue_alloc+0x5f/0xf0
> > size-4096: 1846825 __alloc_skb+0x7d/0x170
> > size-4096: 3 alloc_netdev+0x33/0xa0
> > size-4096: 10 neigh_sysctl_register+0x52/0x2b0
> > size-4096: 5 devinet_sysctl_register+0x28/0x110
> > size-4096: 1 pidmap_init+0x15/0x60
> > size-4096: 1 netlink_proto_init+0x44/0x190
> > size-4096: 1 ip_rt_init+0xfd/0x2f0
> > size-4096: 1 cipso_v4_init+0x13/0x70
> > size-4096: 3 journal_init_revoke+0xe7/0x270 [jbd]
> > size-4096: 3 journal_init_revoke+0x18a/0x270 [jbd]
> > size-4096: 2 journal_init_inode+0x84/0x150 [jbd]
> > size-4096: 2 bnx2_alloc_mem+0x18/0x1f0 [bnx2]
> > size-4096: 1 joydev_connect+0x53/0x390 [joydev]
> > size-4096: 13 kmem_alloc+0xb3/0x100 [xfs]
> > size-4096: 5 addrconf_sysctl_register+0x31/0x130 [ipv6]
> > size-4096: 7 rpc_clone_client+0x84/0x140 [sunrpc]
> > size-4096: 3 rpc_create+0x254/0x4d0 [sunrpc]
> > size-4096: 16 __svc_create_thread+0x53/0x1f0 [sunrpc]
> > size-4096: 16 __svc_create_thread+0x72/0x1f0 [sunrpc]
> > size-4096: 1 nfsd_racache_init+0x2e/0x140 [nfsd]
> >
> > The big one seems to be the __alloc_skb. (This is with 16
> threads, and
> > it says that we are using up somewhere between 12 and 14 GB
> of memory,
> > about 2 to 3 gig of that is disk cache). If I were to put anymore
> > threads out there, the server would become almost
> unresponsive (it was
> > bad enough as it was).
> >
> > At the same time, I also noticed this:
> >
> > skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
> >
> > Don't know for sure if that is meaningful or not....
>
> OK, so, starting at net/core/skbuff.c, this means that this memory was
> allocated by __alloc_skb() calls with something nonzero in the third
> ("fclone") argument. The only such caller is alloc_skb_fclone().
> Callers of alloc_skb_fclone() include:
>
> sk_stream_alloc_skb:
> do_tcp_sendpages
> tcp_sendmsg
> tcp_fragment
> tso_fragment
Interesting you should mention the tso... We recently went through and
turned on TSO on all of our systems, trying it out to see if it helped
with performance... This could be something to do with that. I can try
disabling the tso on all of the servers and see if that helps with the
memory. Actually, I think I will, and I will monitor the situation. I
think it might help some, but I still think there may be something else
going on in a deep corner...
> tcp_mtu_probe
> tcp_send_fin
> tcp_connect
> buf_acquire:
> lots of callers in tipc code (whatever that is).
>
> So unless you're using tipc, or you have something in userspace going
> haywire (perhaps netstat would help rule that out?), then I suppose
> there's something wrong with knfsd's tcp code. Which makes sense, I
> guess.
>
Not for sure what tipc is either....
> I'd think this sort of allocation would be limited by the number of
> sockets times the size of the send and receive buffers.
> svc_xprt.c:svc_check_conn_limits() claims to be limiting the number of
> sockets to (nrthreads+3)*20. (You aren't hitting the "too many open
> connections" printk there, are you?) The total buffer size should be
> bounded by something like 4 megs.
>
> --b.
>
Yes, we are getting a continuous stream of the too many open connections
scrolling across our logs.
> >
> >
> >
> > > > Thanks everyone for looking at this, by the way!
> > >
> > > And thanks for your persistence.
> > >
> > > --b.
> > >
> >
> >
> > Anytime. This is the part of the job that is fun (except for my
> > users...). Anyone can watch a system run, it's dealing
> with the unknown
> > that makes it interesting.
>
> OK! Because I'm a bit stuck, so this will take some more work....
>
> --b.
>
No problems. I feel good if I exercised some deep corner of the code
and found something that needed flushed out, that's what the experience
is all about, isn't it?
> >
> >
> > Norman Weathers
> >
> >
> > > >
> > > > >
> > > > >
> > > > > diff --git a/mm/slab.c b/mm/slab.c
> > > > > index 06236e4..b379e31 100644
> > > > > --- a/mm/slab.c
> > > > > +++ b/mm/slab.c
> > > > > @@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name,
> > > > > size_t size, size_t align,
> > > > > * above the next power of two: caches with object
> > > > > sizes just above a
> > > > > * power of two have a significant amount of internal
> > > > > fragmentation.
> > > > > */
> > > > > - if (size < 4096 || fls(size - 1) == fls(size-1
> + REDZONE_ALIGN +
> > > > > + if (size < 8192 || fls(size - 1) == fls(size-1
> + REDZONE_ALIGN +
> > > > > 2 *
> > > > > sizeof(unsigned long long)))
> > > > > flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
> > > > > if (!(flags & SLAB_DESTROY_BY_RCU))
> > > > >
> > > >
> > > >
> > > > Norman Weathers
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe
> > > linux-nfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
@ 2008-06-13 21:53 ` Weathers, Norman R.
0 siblings, 0 replies; 41+ messages in thread
From: Weathers, Norman R. @ 2008-06-13 21:53 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: Jeff Layton, linux-kernel, linux-nfs, Neil Brown
> -----Original Message-----
> From: linux-nfs-owner@vger.kernel.org
> [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of J. Bruce Fields
> Sent: Friday, June 13, 2008 3:16 PM
> To: Weathers, Norman R.
> Cc: Jeff Layton; linux-kernel@vger.kernel.org;
> linux-nfs@vger.kernel.org; Neil Brown
> Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
>
> On Thu, Jun 12, 2008 at 02:54:09PM -0500, Weathers, Norman R. wrote:
> >
> >
> > > -----Original Message-----
> > > From: linux-nfs-owner@vger.kernel.org
> > > [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of J.
> Bruce Fields
> > > Sent: Wednesday, June 11, 2008 5:55 PM
> > > To: Weathers, Norman R.
> > > Cc: Jeff Layton; linux-kernel@vger.kernel.org;
> > > linux-nfs@vger.kernel.org
> > > Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
> > >
> > > On Wed, Jun 11, 2008 at 05:46:13PM -0500, Weathers,
> Norman R. wrote:
> > > > I will try and get it patched and retested, but it may be a
> > > day or two
> > > > before I can get back the information due to production jobs now
> > > > running. Once they finish up, I will get back with the info.
> > >
> > > Understood.
> > >
> >
> >
> > I was able to get my big user to cooperate and let me in to
> be able to
> > get the information that you were needing. The full output from the
> > /proc/slab_allocator file is at
> > http://www.shashi-weathers.net/linux/cluster/NFS_DEBUG_2 . The 16
> > thread case is very interesting. Also, there is a small
> txt file in the
> > directory that has some rpc errors, but I imagine the way that I am
> > running the box (oversubscribed threads) has more to do with the rpc
> > errors than anything else. For those of you wanting the gist of the
> > story, the size-4096 slab has the following very large allocation:
> >
> > size-4096: 2 sys_init_module+0x140b/0x1980
> > size-4096: 1 __vmalloc_area_node+0x188/0x1b0
> > size-4096: 1 seq_read+0x1d9/0x2e0
> > size-4096: 1 slabstats_open+0x2b/0x80
> > size-4096: 5 vc_allocate+0x167/0x190
> > size-4096: 3 input_allocate_device+0x12/0x80
> > size-4096: 1 hid_add_field+0x122/0x290
> > size-4096: 9 reqsk_queue_alloc+0x5f/0xf0
> > size-4096: 1846825 __alloc_skb+0x7d/0x170
> > size-4096: 3 alloc_netdev+0x33/0xa0
> > size-4096: 10 neigh_sysctl_register+0x52/0x2b0
> > size-4096: 5 devinet_sysctl_register+0x28/0x110
> > size-4096: 1 pidmap_init+0x15/0x60
> > size-4096: 1 netlink_proto_init+0x44/0x190
> > size-4096: 1 ip_rt_init+0xfd/0x2f0
> > size-4096: 1 cipso_v4_init+0x13/0x70
> > size-4096: 3 journal_init_revoke+0xe7/0x270 [jbd]
> > size-4096: 3 journal_init_revoke+0x18a/0x270 [jbd]
> > size-4096: 2 journal_init_inode+0x84/0x150 [jbd]
> > size-4096: 2 bnx2_alloc_mem+0x18/0x1f0 [bnx2]
> > size-4096: 1 joydev_connect+0x53/0x390 [joydev]
> > size-4096: 13 kmem_alloc+0xb3/0x100 [xfs]
> > size-4096: 5 addrconf_sysctl_register+0x31/0x130 [ipv6]
> > size-4096: 7 rpc_clone_client+0x84/0x140 [sunrpc]
> > size-4096: 3 rpc_create+0x254/0x4d0 [sunrpc]
> > size-4096: 16 __svc_create_thread+0x53/0x1f0 [sunrpc]
> > size-4096: 16 __svc_create_thread+0x72/0x1f0 [sunrpc]
> > size-4096: 1 nfsd_racache_init+0x2e/0x140 [nfsd]
> >
> > The big one seems to be the __alloc_skb. (This is with 16
> threads, and
> > it says that we are using up somewhere between 12 and 14 GB
> of memory,
> > about 2 to 3 gig of that is disk cache). If I were to put anymore
> > threads out there, the server would become almost
> unresponsive (it was
> > bad enough as it was).
> >
> > At the same time, I also noticed this:
> >
> > skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
> >
> > Don't know for sure if that is meaningful or not....
>
> OK, so, starting at net/core/skbuff.c, this means that this memory was
> allocated by __alloc_skb() calls with something nonzero in the third
> ("fclone") argument. The only such caller is alloc_skb_fclone().
> Callers of alloc_skb_fclone() include:
>
> sk_stream_alloc_skb:
> do_tcp_sendpages
> tcp_sendmsg
> tcp_fragment
> tso_fragment
Interesting you should mention the tso... We recently went through and
turned on TSO on all of our systems, trying it out to see if it helped
with performance... This could be something to do with that. I can try
disabling the tso on all of the servers and see if that helps with the
memory. Actually, I think I will, and I will monitor the situation. I
think it might help some, but I still think there may be something else
going on in a deep corner...
> tcp_mtu_probe
> tcp_send_fin
> tcp_connect
> buf_acquire:
> lots of callers in tipc code (whatever that is).
>
> So unless you're using tipc, or you have something in userspace going
> haywire (perhaps netstat would help rule that out?), then I suppose
> there's something wrong with knfsd's tcp code. Which makes sense, I
> guess.
>
Not for sure what tipc is either....
> I'd think this sort of allocation would be limited by the number of
> sockets times the size of the send and receive buffers.
> svc_xprt.c:svc_check_conn_limits() claims to be limiting the number of
> sockets to (nrthreads+3)*20. (You aren't hitting the "too many open
> connections" printk there, are you?) The total buffer size should be
> bounded by something like 4 megs.
>
> --b.
>
Yes, we are getting a continuous stream of the too many open connections
scrolling across our logs.
> >
> >
> >
> > > > Thanks everyone for looking at this, by the way!
> > >
> > > And thanks for your persistence.
> > >
> > > --b.
> > >
> >
> >
> > Anytime. This is the part of the job that is fun (except for my
> > users...). Anyone can watch a system run, it's dealing
> with the unknown
> > that makes it interesting.
>
> OK! Because I'm a bit stuck, so this will take some more work....
>
> --b.
>
No problems. I feel good if I exercised some deep corner of the code
and found something that needed flushed out, that's what the experience
is all about, isn't it?
> >
> >
> > Norman Weathers
> >
> >
> > > >
> > > > >
> > > > >
> > > > > diff --git a/mm/slab.c b/mm/slab.c
> > > > > index 06236e4..b379e31 100644
> > > > > --- a/mm/slab.c
> > > > > +++ b/mm/slab.c
> > > > > @@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name,
> > > > > size_t size, size_t align,
> > > > > * above the next power of two: caches with object
> > > > > sizes just above a
> > > > > * power of two have a significant amount of internal
> > > > > fragmentation.
> > > > > */
> > > > > - if (size < 4096 || fls(size - 1) == fls(size-1
> + REDZONE_ALIGN +
> > > > > + if (size < 8192 || fls(size - 1) == fls(size-1
> + REDZONE_ALIGN +
> > > > > 2 *
> > > > > sizeof(unsigned long long)))
> > > > > flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
> > > > > if (!(flags & SLAB_DESTROY_BY_RCU))
> > > > >
> > > >
> > > >
> > > > Norman Weathers
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe
> > > linux-nfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
2008-06-13 21:53 ` Weathers, Norman R.
@ 2008-06-13 22:04 ` J. Bruce Fields
-1 siblings, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2008-06-13 22:04 UTC (permalink / raw)
To: Weathers, Norman R.; +Cc: Jeff Layton, linux-kernel, linux-nfs, Neil Brown
On Fri, Jun 13, 2008 at 04:53:31PM -0500, Weathers, Norman R. wrote:
>
>
> > > The big one seems to be the __alloc_skb. (This is with 16
> > threads, and
> > > it says that we are using up somewhere between 12 and 14 GB
> > of memory,
> > > about 2 to 3 gig of that is disk cache). If I were to put anymore
> > > threads out there, the server would become almost
> > unresponsive (it was
> > > bad enough as it was).
> > >
> > > At the same time, I also noticed this:
> > >
> > > skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
> > >
> > > Don't know for sure if that is meaningful or not....
> >
> > OK, so, starting at net/core/skbuff.c, this means that this memory was
> > allocated by __alloc_skb() calls with something nonzero in the third
> > ("fclone") argument. The only such caller is alloc_skb_fclone().
> > Callers of alloc_skb_fclone() include:
> >
> > sk_stream_alloc_skb:
> > do_tcp_sendpages
> > tcp_sendmsg
> > tcp_fragment
> > tso_fragment
>
> Interesting you should mention the tso... We recently went through and
> turned on TSO on all of our systems, trying it out to see if it helped
> with performance... This could be something to do with that. I can try
> disabling the tso on all of the servers and see if that helps with the
> memory. Actually, I think I will, and I will monitor the situation. I
> think it might help some, but I still think there may be something else
> going on in a deep corner...
I'll plead total ignorance about TSO, and it sounds like a long
shot--but sure, it'd be worth trying, thanks.
>
> > tcp_mtu_probe
> > tcp_send_fin
> > tcp_connect
> > buf_acquire:
> > lots of callers in tipc code (whatever that is).
> >
> > So unless you're using tipc, or you have something in userspace going
> > haywire (perhaps netstat would help rule that out?), then I suppose
> > there's something wrong with knfsd's tcp code. Which makes sense, I
> > guess.
> >
>
> Not for sure what tipc is either....
>
> > I'd think this sort of allocation would be limited by the number of
> > sockets times the size of the send and receive buffers.
> > svc_xprt.c:svc_check_conn_limits() claims to be limiting the number of
> > sockets to (nrthreads+3)*20. (You aren't hitting the "too many open
> > connections" printk there, are you?) The total buffer size should be
> > bounded by something like 4 megs.
> >
> > --b.
> >
>
> Yes, we are getting a continuous stream of the too many open connections
> scrolling across our logs.
That's interesting! So we should probably look more closely at the
svc_check_conn_limits() behavior. I wonder whether some pathological
behavior is triggered in the case where you're constantly over the limit
it's trying to enforce.
(Remind me how many active clients you have?)
> No problems. I feel good if I exercised some deep corner of the code
> and found something that needed flushed out, that's what the experience
> is all about, isn't it?
Yep!
--b.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
@ 2008-06-13 22:04 ` J. Bruce Fields
0 siblings, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2008-06-13 22:04 UTC (permalink / raw)
To: Weathers, Norman R.; +Cc: Jeff Layton, linux-kernel, linux-nfs, Neil Brown
On Fri, Jun 13, 2008 at 04:53:31PM -0500, Weathers, Norman R. wrote:
>
>
> > > The big one seems to be the __alloc_skb. (This is with 16
> > threads, and
> > > it says that we are using up somewhere between 12 and 14 GB
> > of memory,
> > > about 2 to 3 gig of that is disk cache). If I were to put anymore
> > > threads out there, the server would become almost
> > unresponsive (it was
> > > bad enough as it was).
> > >
> > > At the same time, I also noticed this:
> > >
> > > skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
> > >
> > > Don't know for sure if that is meaningful or not....
> >
> > OK, so, starting at net/core/skbuff.c, this means that this memory was
> > allocated by __alloc_skb() calls with something nonzero in the third
> > ("fclone") argument. The only such caller is alloc_skb_fclone().
> > Callers of alloc_skb_fclone() include:
> >
> > sk_stream_alloc_skb:
> > do_tcp_sendpages
> > tcp_sendmsg
> > tcp_fragment
> > tso_fragment
>
> Interesting you should mention the tso... We recently went through and
> turned on TSO on all of our systems, trying it out to see if it helped
> with performance... This could be something to do with that. I can try
> disabling the tso on all of the servers and see if that helps with the
> memory. Actually, I think I will, and I will monitor the situation. I
> think it might help some, but I still think there may be something else
> going on in a deep corner...
I'll plead total ignorance about TSO, and it sounds like a long
shot--but sure, it'd be worth trying, thanks.
>
> > tcp_mtu_probe
> > tcp_send_fin
> > tcp_connect
> > buf_acquire:
> > lots of callers in tipc code (whatever that is).
> >
> > So unless you're using tipc, or you have something in userspace going
> > haywire (perhaps netstat would help rule that out?), then I suppose
> > there's something wrong with knfsd's tcp code. Which makes sense, I
> > guess.
> >
>
> Not for sure what tipc is either....
>
> > I'd think this sort of allocation would be limited by the number of
> > sockets times the size of the send and receive buffers.
> > svc_xprt.c:svc_check_conn_limits() claims to be limiting the number of
> > sockets to (nrthreads+3)*20. (You aren't hitting the "too many open
> > connections" printk there, are you?) The total buffer size should be
> > bounded by something like 4 megs.
> >
> > --b.
> >
>
> Yes, we are getting a continuous stream of the too many open connections
> scrolling across our logs.
That's interesting! So we should probably look more closely at the
svc_check_conn_limits() behavior. I wonder whether some pathological
behavior is triggered in the case where you're constantly over the limit
it's trying to enforce.
(Remind me how many active clients you have?)
> No problems. I feel good if I exercised some deep corner of the code
> and found something that needed flushed out, that's what the experience
> is all about, isn't it?
Yep!
--b.
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
2008-06-13 22:04 ` J. Bruce Fields
@ 2008-06-13 22:53 ` Weathers, Norman R.
-1 siblings, 0 replies; 41+ messages in thread
From: Weathers, Norman R. @ 2008-06-13 22:53 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: Jeff Layton, linux-kernel, linux-nfs, Neil Brown
> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields@fieldses.org]
> Sent: Friday, June 13, 2008 5:04 PM
> To: Weathers, Norman R.
> Cc: Jeff Layton; linux-kernel@vger.kernel.org;
> linux-nfs@vger.kernel.org; Neil Brown
> Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
>
> On Fri, Jun 13, 2008 at 04:53:31PM -0500, Weathers, Norman R. wrote:
> >
> >
> > > > The big one seems to be the __alloc_skb. (This is with 16
> > > threads, and
> > > > it says that we are using up somewhere between 12 and 14 GB
> > > of memory,
> > > > about 2 to 3 gig of that is disk cache). If I were to
> put anymore
> > > > threads out there, the server would become almost
> > > unresponsive (it was
> > > > bad enough as it was).
> > > >
> > > > At the same time, I also noticed this:
> > > >
> > > > skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
> > > >
> > > > Don't know for sure if that is meaningful or not....
> > >
> > > OK, so, starting at net/core/skbuff.c, this means that
> this memory was
> > > allocated by __alloc_skb() calls with something nonzero
> in the third
> > > ("fclone") argument. The only such caller is alloc_skb_fclone().
> > > Callers of alloc_skb_fclone() include:
> > >
> > > sk_stream_alloc_skb:
> > > do_tcp_sendpages
> > > tcp_sendmsg
> > > tcp_fragment
> > > tso_fragment
> >
> > Interesting you should mention the tso... We recently went
> through and
> > turned on TSO on all of our systems, trying it out to see
> if it helped
> > with performance... This could be something to do with
> that. I can try
> > disabling the tso on all of the servers and see if that
> helps with the
> > memory. Actually, I think I will, and I will monitor the
> situation. I
> > think it might help some, but I still think there may be
> something else
> > going on in a deep corner...
>
> I'll plead total ignorance about TSO, and it sounds like a long
> shot--but sure, it'd be worth trying, thanks.
>
Tried it, not for sure if I like the results yet or not... Didn't seem
to make a huge difference, but here is something that will really make
you want to drink, the 2.6.25.4 kernel does not go into the size-4096
hell. The largest users of slab there are the size-1024 and still the
skbuff_fclone_cache. On a box with 16 threads, it will cache up about 5
GB of disk data, and still use about 6 GB of slab to put the information
out there (without TSO on), but at least it is not causing the disk
cache to be evicted, and it appears to be a little more responsive. If
I up it to 32 or more threads, however, it gets very sluggish, but then
again, I am hitting it with a lot of nodes.
> >
> > > tcp_mtu_probe
> > > tcp_send_fin
> > > tcp_connect
> > > buf_acquire:
> > > lots of callers in tipc code (whatever that is).
> > >
> > > So unless you're using tipc, or you have something in
> userspace going
> > > haywire (perhaps netstat would help rule that out?), then
> I suppose
> > > there's something wrong with knfsd's tcp code. Which
> makes sense, I
> > > guess.
> > >
> >
> > Not for sure what tipc is either....
> >
> > > I'd think this sort of allocation would be limited by the
> number of
> > > sockets times the size of the send and receive buffers.
> > > svc_xprt.c:svc_check_conn_limits() claims to be limiting
> the number of
> > > sockets to (nrthreads+3)*20. (You aren't hitting the
> "too many open
> > > connections" printk there, are you?) The total buffer
> size should be
> > > bounded by something like 4 megs.
> > >
> > > --b.
> > >
> >
> > Yes, we are getting a continuous stream of the too many
> open connections
> > scrolling across our logs.
>
> That's interesting! So we should probably look more closely at the
> svc_check_conn_limits() behavior. I wonder whether some pathological
> behavior is triggered in the case where you're constantly
> over the limit
> it's trying to enforce.
>
> (Remind me how many active clients you have?)
>
We currently are hitting with somewhere around 600 to 800 nodes, but it
can go up to over 1000 nodes. We are artificially starving with a
limited number of threads (2 to 3) right now on the older 2.6.22.14
kernel because of that memory issue (which may or may not be tso
related)...
I really want to move forward to the newer kernel, but we had an issue
where clients all of the sudden wouldn't connect, yet other clients
could, to the exact same server NFS export. I had booted the server
into the 2.6.25.4 kernel at the time, and the other admin set us back to
the 2.6.22.14 to see if that was it. The clients started working again,
and he left it there (he also took out my options in the exports file,
no_subtree_check and insecure). I know that we are running over the
number of privelaged ports, and we probably need the insecure, but I am
having a hard time wrapping my self around all of the problems at
once....
> > No problems. I feel good if I exercised some deep corner
> of the code
> > and found something that needed flushed out, that's what
> the experience
> > is all about, isn't it?
>
> Yep!
>
> --b.
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
@ 2008-06-13 22:53 ` Weathers, Norman R.
0 siblings, 0 replies; 41+ messages in thread
From: Weathers, Norman R. @ 2008-06-13 22:53 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: Jeff Layton, linux-kernel, linux-nfs, Neil Brown
> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields@fieldses.org]
> Sent: Friday, June 13, 2008 5:04 PM
> To: Weathers, Norman R.
> Cc: Jeff Layton; linux-kernel@vger.kernel.org;
> linux-nfs@vger.kernel.org; Neil Brown
> Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
>
> On Fri, Jun 13, 2008 at 04:53:31PM -0500, Weathers, Norman R. wrote:
> >
> >
> > > > The big one seems to be the __alloc_skb. (This is with 16
> > > threads, and
> > > > it says that we are using up somewhere between 12 and 14 GB
> > > of memory,
> > > > about 2 to 3 gig of that is disk cache). If I were to
> put anymore
> > > > threads out there, the server would become almost
> > > unresponsive (it was
> > > > bad enough as it was).
> > > >
> > > > At the same time, I also noticed this:
> > > >
> > > > skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
> > > >
> > > > Don't know for sure if that is meaningful or not....
> > >
> > > OK, so, starting at net/core/skbuff.c, this means that
> this memory was
> > > allocated by __alloc_skb() calls with something nonzero
> in the third
> > > ("fclone") argument. The only such caller is alloc_skb_fclone().
> > > Callers of alloc_skb_fclone() include:
> > >
> > > sk_stream_alloc_skb:
> > > do_tcp_sendpages
> > > tcp_sendmsg
> > > tcp_fragment
> > > tso_fragment
> >
> > Interesting you should mention the tso... We recently went
> through and
> > turned on TSO on all of our systems, trying it out to see
> if it helped
> > with performance... This could be something to do with
> that. I can try
> > disabling the tso on all of the servers and see if that
> helps with the
> > memory. Actually, I think I will, and I will monitor the
> situation. I
> > think it might help some, but I still think there may be
> something else
> > going on in a deep corner...
>
> I'll plead total ignorance about TSO, and it sounds like a long
> shot--but sure, it'd be worth trying, thanks.
>
Tried it, not for sure if I like the results yet or not... Didn't seem
to make a huge difference, but here is something that will really make
you want to drink, the 2.6.25.4 kernel does not go into the size-4096
hell. The largest users of slab there are the size-1024 and still the
skbuff_fclone_cache. On a box with 16 threads, it will cache up about 5
GB of disk data, and still use about 6 GB of slab to put the information
out there (without TSO on), but at least it is not causing the disk
cache to be evicted, and it appears to be a little more responsive. If
I up it to 32 or more threads, however, it gets very sluggish, but then
again, I am hitting it with a lot of nodes.
> >
> > > tcp_mtu_probe
> > > tcp_send_fin
> > > tcp_connect
> > > buf_acquire:
> > > lots of callers in tipc code (whatever that is).
> > >
> > > So unless you're using tipc, or you have something in
> userspace going
> > > haywire (perhaps netstat would help rule that out?), then
> I suppose
> > > there's something wrong with knfsd's tcp code. Which
> makes sense, I
> > > guess.
> > >
> >
> > Not for sure what tipc is either....
> >
> > > I'd think this sort of allocation would be limited by the
> number of
> > > sockets times the size of the send and receive buffers.
> > > svc_xprt.c:svc_check_conn_limits() claims to be limiting
> the number of
> > > sockets to (nrthreads+3)*20. (You aren't hitting the
> "too many open
> > > connections" printk there, are you?) The total buffer
> size should be
> > > bounded by something like 4 megs.
> > >
> > > --b.
> > >
> >
> > Yes, we are getting a continuous stream of the too many
> open connections
> > scrolling across our logs.
>
> That's interesting! So we should probably look more closely at the
> svc_check_conn_limits() behavior. I wonder whether some pathological
> behavior is triggered in the case where you're constantly
> over the limit
> it's trying to enforce.
>
> (Remind me how many active clients you have?)
>
We currently are hitting with somewhere around 600 to 800 nodes, but it
can go up to over 1000 nodes. We are artificially starving with a
limited number of threads (2 to 3) right now on the older 2.6.22.14
kernel because of that memory issue (which may or may not be tso
related)...
I really want to move forward to the newer kernel, but we had an issue
where clients all of the sudden wouldn't connect, yet other clients
could, to the exact same server NFS export. I had booted the server
into the 2.6.25.4 kernel at the time, and the other admin set us back to
the 2.6.22.14 to see if that was it. The clients started working again,
and he left it there (he also took out my options in the exports file,
no_subtree_check and insecure). I know that we are running over the
number of privelaged ports, and we probably need the insecure, but I am
having a hard time wrapping my self around all of the problems at
once....
> > No problems. I feel good if I exercised some deep corner
> of the code
> > and found something that needed flushed out, that's what
> the experience
> > is all about, isn't it?
>
> Yep!
>
> --b.
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
2008-06-13 22:53 ` Weathers, Norman R.
@ 2008-06-16 17:43 ` J. Bruce Fields
-1 siblings, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2008-06-16 17:43 UTC (permalink / raw)
To: Weathers, Norman R.; +Cc: Jeff Layton, linux-kernel, linux-nfs, Neil Brown
On Fri, Jun 13, 2008 at 05:53:20PM -0500, Weathers, Norman R. wrote:
>
>
> > -----Original Message-----
> > From: J. Bruce Fields [mailto:bfields@fieldses.org]
> > Sent: Friday, June 13, 2008 5:04 PM
> > To: Weathers, Norman R.
> > Cc: Jeff Layton; linux-kernel@vger.kernel.org;
> > linux-nfs@vger.kernel.org; Neil Brown
> > Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
> >
> > On Fri, Jun 13, 2008 at 04:53:31PM -0500, Weathers, Norman R. wrote:
> > >
> > >
> > > > > The big one seems to be the __alloc_skb. (This is with 16
> > > > threads, and
> > > > > it says that we are using up somewhere between 12 and 14 GB
> > > > of memory,
> > > > > about 2 to 3 gig of that is disk cache). If I were to
> > put anymore
> > > > > threads out there, the server would become almost
> > > > unresponsive (it was
> > > > > bad enough as it was).
> > > > >
> > > > > At the same time, I also noticed this:
> > > > >
> > > > > skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
> > > > >
> > > > > Don't know for sure if that is meaningful or not....
> > > >
> > > > OK, so, starting at net/core/skbuff.c, this means that
> > this memory was
> > > > allocated by __alloc_skb() calls with something nonzero
> > in the third
> > > > ("fclone") argument. The only such caller is alloc_skb_fclone().
> > > > Callers of alloc_skb_fclone() include:
> > > >
> > > > sk_stream_alloc_skb:
> > > > do_tcp_sendpages
> > > > tcp_sendmsg
> > > > tcp_fragment
> > > > tso_fragment
> > >
> > > Interesting you should mention the tso... We recently went
> > through and
> > > turned on TSO on all of our systems, trying it out to see
> > if it helped
> > > with performance... This could be something to do with
> > that. I can try
> > > disabling the tso on all of the servers and see if that
> > helps with the
> > > memory. Actually, I think I will, and I will monitor the
> > situation. I
> > > think it might help some, but I still think there may be
> > something else
> > > going on in a deep corner...
> >
> > I'll plead total ignorance about TSO, and it sounds like a long
> > shot--but sure, it'd be worth trying, thanks.
> >
>
> Tried it, not for sure if I like the results yet or not... Didn't seem
> to make a huge difference, but here is something that will really make
> you want to drink, the 2.6.25.4 kernel does not go into the size-4096
> hell.
Remind me what the most recent *bad* kernel was of those you tested?
(2.6.25?)
Nothing jumped out at me in a quick skim through the commits from 2.6.25
to 2.6.25.4.
> The largest users of slab there are the size-1024 and still the
> skbuff_fclone_cache. On a box with 16 threads, it will cache up about 5
> GB of disk data, and still use about 6 GB of slab to put the information
> out there (without TSO on), but at least it is not causing the disk
> cache to be evicted, and it appears to be a little more responsive. If
> I up it to 32 or more threads, however, it gets very sluggish, but then
> again, I am hitting it with a lot of nodes.
>
> > >
> > > > tcp_mtu_probe
> > > > tcp_send_fin
> > > > tcp_connect
> > > > buf_acquire:
> > > > lots of callers in tipc code (whatever that is).
> > > >
> > > > So unless you're using tipc, or you have something in
> > userspace going
> > > > haywire (perhaps netstat would help rule that out?), then
> > I suppose
> > > > there's something wrong with knfsd's tcp code. Which
> > makes sense, I
> > > > guess.
> > > >
> > >
> > > Not for sure what tipc is either....
> > >
> > > > I'd think this sort of allocation would be limited by the
> > number of
> > > > sockets times the size of the send and receive buffers.
> > > > svc_xprt.c:svc_check_conn_limits() claims to be limiting
> > the number of
> > > > sockets to (nrthreads+3)*20. (You aren't hitting the
> > "too many open
> > > > connections" printk there, are you?) The total buffer
> > size should be
> > > > bounded by something like 4 megs.
> > > >
> > > > --b.
> > > >
> > >
> > > Yes, we are getting a continuous stream of the too many
> > open connections
> > > scrolling across our logs.
> >
> > That's interesting! So we should probably look more closely at the
> > svc_check_conn_limits() behavior. I wonder whether some pathological
> > behavior is triggered in the case where you're constantly
> > over the limit
> > it's trying to enforce.
> >
> > (Remind me how many active clients you have?)
> >
>
>
> We currently are hitting with somewhere around 600 to 800 nodes, but it
> can go up to over 1000 nodes. We are artificially starving with a
> limited number of threads (2 to 3) right now on the older 2.6.22.14
> kernel because of that memory issue (which may or may not be tso
> related)...
So with that many clients all making requests to the server at once,
we'd start hitting that (serv->sv_nrthreads+3)*20 limit when the number
of threads was set to less than 30-50. That doesn't seem to be the
point where you're seeing a change in behavior, though.
> I really want to move forward to the newer kernel, but we had an issue
> where clients all of the sudden wouldn't connect, yet other clients
> could, to the exact same server NFS export. I had booted the server
> into the 2.6.25.4 kernel at the time, and the other admin set us back to
> the 2.6.22.14 to see if that was it. The clients started working again,
> and he left it there (he also took out my options in the exports file,
> no_subtree_check and insecure). I know that we are running over the
> number of privelaged ports, and we probably need the insecure, but I am
> having a hard time wrapping my self around all of the problems at
> once....
The secure ports limitation should be a problem for a client that does a
lot of nfs mounts, not for a server with a lot of clients.
--b.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
@ 2008-06-16 17:43 ` J. Bruce Fields
0 siblings, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2008-06-16 17:43 UTC (permalink / raw)
To: Weathers, Norman R.; +Cc: Jeff Layton, linux-kernel, linux-nfs, Neil Brown
On Fri, Jun 13, 2008 at 05:53:20PM -0500, Weathers, Norman R. wrote:
>
>
> > -----Original Message-----
> > From: J. Bruce Fields [mailto:bfields@fieldses.org]
> > Sent: Friday, June 13, 2008 5:04 PM
> > To: Weathers, Norman R.
> > Cc: Jeff Layton; linux-kernel@vger.kernel.org;
> > linux-nfs@vger.kernel.org; Neil Brown
> > Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
> >
> > On Fri, Jun 13, 2008 at 04:53:31PM -0500, Weathers, Norman R. wrote:
> > >
> > >
> > > > > The big one seems to be the __alloc_skb. (This is with 16
> > > > threads, and
> > > > > it says that we are using up somewhere between 12 and 14 GB
> > > > of memory,
> > > > > about 2 to 3 gig of that is disk cache). If I were to
> > put anymore
> > > > > threads out there, the server would become almost
> > > > unresponsive (it was
> > > > > bad enough as it was).
> > > > >
> > > > > At the same time, I also noticed this:
> > > > >
> > > > > skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
> > > > >
> > > > > Don't know for sure if that is meaningful or not....
> > > >
> > > > OK, so, starting at net/core/skbuff.c, this means that
> > this memory was
> > > > allocated by __alloc_skb() calls with something nonzero
> > in the third
> > > > ("fclone") argument. The only such caller is alloc_skb_fclone().
> > > > Callers of alloc_skb_fclone() include:
> > > >
> > > > sk_stream_alloc_skb:
> > > > do_tcp_sendpages
> > > > tcp_sendmsg
> > > > tcp_fragment
> > > > tso_fragment
> > >
> > > Interesting you should mention the tso... We recently went
> > through and
> > > turned on TSO on all of our systems, trying it out to see
> > if it helped
> > > with performance... This could be something to do with
> > that. I can try
> > > disabling the tso on all of the servers and see if that
> > helps with the
> > > memory. Actually, I think I will, and I will monitor the
> > situation. I
> > > think it might help some, but I still think there may be
> > something else
> > > going on in a deep corner...
> >
> > I'll plead total ignorance about TSO, and it sounds like a long
> > shot--but sure, it'd be worth trying, thanks.
> >
>
> Tried it, not for sure if I like the results yet or not... Didn't seem
> to make a huge difference, but here is something that will really make
> you want to drink, the 2.6.25.4 kernel does not go into the size-4096
> hell.
Remind me what the most recent *bad* kernel was of those you tested?
(2.6.25?)
Nothing jumped out at me in a quick skim through the commits from 2.6.25
to 2.6.25.4.
> The largest users of slab there are the size-1024 and still the
> skbuff_fclone_cache. On a box with 16 threads, it will cache up about 5
> GB of disk data, and still use about 6 GB of slab to put the information
> out there (without TSO on), but at least it is not causing the disk
> cache to be evicted, and it appears to be a little more responsive. If
> I up it to 32 or more threads, however, it gets very sluggish, but then
> again, I am hitting it with a lot of nodes.
>
> > >
> > > > tcp_mtu_probe
> > > > tcp_send_fin
> > > > tcp_connect
> > > > buf_acquire:
> > > > lots of callers in tipc code (whatever that is).
> > > >
> > > > So unless you're using tipc, or you have something in
> > userspace going
> > > > haywire (perhaps netstat would help rule that out?), then
> > I suppose
> > > > there's something wrong with knfsd's tcp code. Which
> > makes sense, I
> > > > guess.
> > > >
> > >
> > > Not for sure what tipc is either....
> > >
> > > > I'd think this sort of allocation would be limited by the
> > number of
> > > > sockets times the size of the send and receive buffers.
> > > > svc_xprt.c:svc_check_conn_limits() claims to be limiting
> > the number of
> > > > sockets to (nrthreads+3)*20. (You aren't hitting the
> > "too many open
> > > > connections" printk there, are you?) The total buffer
> > size should be
> > > > bounded by something like 4 megs.
> > > >
> > > > --b.
> > > >
> > >
> > > Yes, we are getting a continuous stream of the too many
> > open connections
> > > scrolling across our logs.
> >
> > That's interesting! So we should probably look more closely at the
> > svc_check_conn_limits() behavior. I wonder whether some pathological
> > behavior is triggered in the case where you're constantly
> > over the limit
> > it's trying to enforce.
> >
> > (Remind me how many active clients you have?)
> >
>
>
> We currently are hitting with somewhere around 600 to 800 nodes, but it
> can go up to over 1000 nodes. We are artificially starving with a
> limited number of threads (2 to 3) right now on the older 2.6.22.14
> kernel because of that memory issue (which may or may not be tso
> related)...
So with that many clients all making requests to the server at once,
we'd start hitting that (serv->sv_nrthreads+3)*20 limit when the number
of threads was set to less than 30-50. That doesn't seem to be the
point where you're seeing a change in behavior, though.
> I really want to move forward to the newer kernel, but we had an issue
> where clients all of the sudden wouldn't connect, yet other clients
> could, to the exact same server NFS export. I had booted the server
> into the 2.6.25.4 kernel at the time, and the other admin set us back to
> the 2.6.22.14 to see if that was it. The clients started working again,
> and he left it there (he also took out my options in the exports file,
> no_subtree_check and insecure). I know that we are running over the
> number of privelaged ports, and we probably need the insecure, but I am
> having a hard time wrapping my self around all of the problems at
> once....
The secure ports limitation should be a problem for a client that does a
lot of nfs mounts, not for a server with a lot of clients.
--b.
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
2008-06-16 17:43 ` J. Bruce Fields
@ 2008-06-19 15:53 ` Weathers, Norman R.
-1 siblings, 0 replies; 41+ messages in thread
From: Weathers, Norman R. @ 2008-06-19 15:53 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: Jeff Layton, linux-kernel, linux-nfs, Neil Brown
> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields@fieldses.org]
> Sent: Monday, June 16, 2008 12:44 PM
> To: Weathers, Norman R.
> Cc: Jeff Layton; linux-kernel@vger.kernel.org;
> linux-nfs@vger.kernel.org; Neil Brown
> Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
>
> On Fri, Jun 13, 2008 at 05:53:20PM -0500, Weathers, Norman R. wrote:
> >
> >
> > > -----Original Message-----
> > > From: J. Bruce Fields [mailto:bfields@fieldses.org]
> > > Sent: Friday, June 13, 2008 5:04 PM
> > > To: Weathers, Norman R.
> > > Cc: Jeff Layton; linux-kernel@vger.kernel.org;
> > > linux-nfs@vger.kernel.org; Neil Brown
> > > Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
> > >
> > > On Fri, Jun 13, 2008 at 04:53:31PM -0500, Weathers,
> Norman R. wrote:
> > > >
> > > >
> > > > > > The big one seems to be the __alloc_skb. (This is with 16
> > > > > threads, and
> > > > > > it says that we are using up somewhere between 12 and 14 GB
> > > > > of memory,
> > > > > > about 2 to 3 gig of that is disk cache). If I were to
> > > put anymore
> > > > > > threads out there, the server would become almost
> > > > > unresponsive (it was
> > > > > > bad enough as it was).
> > > > > >
> > > > > > At the same time, I also noticed this:
> > > > > >
> > > > > > skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
> > > > > >
> > > > > > Don't know for sure if that is meaningful or not....
> > > > >
> > > > > OK, so, starting at net/core/skbuff.c, this means that
> > > this memory was
> > > > > allocated by __alloc_skb() calls with something nonzero
> > > in the third
> > > > > ("fclone") argument. The only such caller is
> alloc_skb_fclone().
> > > > > Callers of alloc_skb_fclone() include:
> > > > >
> > > > > sk_stream_alloc_skb:
> > > > > do_tcp_sendpages
> > > > > tcp_sendmsg
> > > > > tcp_fragment
> > > > > tso_fragment
> > > >
> > > > Interesting you should mention the tso... We recently went
> > > through and
> > > > turned on TSO on all of our systems, trying it out to see
> > > if it helped
> > > > with performance... This could be something to do with
> > > that. I can try
> > > > disabling the tso on all of the servers and see if that
> > > helps with the
> > > > memory. Actually, I think I will, and I will monitor the
> > > situation. I
> > > > think it might help some, but I still think there may be
> > > something else
> > > > going on in a deep corner...
> > >
> > > I'll plead total ignorance about TSO, and it sounds like a long
> > > shot--but sure, it'd be worth trying, thanks.
> > >
> >
> > Tried it, not for sure if I like the results yet or not...
> Didn't seem
> > to make a huge difference, but here is something that will
> really make
> > you want to drink, the 2.6.25.4 kernel does not go into the
> size-4096
> > hell.
>
> Remind me what the most recent *bad* kernel was of those you tested?
> (2.6.25?)
>
The kernel that we were really seeing the problem with was 2.6.25.4, but
I think we may have figured out the 4096 problem, and it was probably a
mistake on my part, but it is important for the NFS users to see it so
they don't make the same mistake. I had found some performance tuning
guides, and in trying some of the suggestions, found that the setting
changes did seem to help on some things, but of course I never got to
run a check under full load (800 + clients). A suggestion was to change
the tcp_reordering tunable under /proc/sys/net/ipv4 from the default 3
to 127. We think that this was actually causing the issue. I was able
to trace back through all of the changes, and I changed this setting
back to the default 3, and it immediately fixed the size-4096 hell. It
appears that the reordering just eats into the memory, especially in
high demand situations, and I guess that should make perfect sense if we
are actually buffering up packets for reorder, and we are slamming the
box with thousands of requests per minute.
We still have other performance issues now, but it appears to be more of
a bottleneck, the nodes do not appear to be backing off when the servers
are becoming congested.
> Nothing jumped out at me in a quick skim through the commits
> from 2.6.25
> to 2.6.25.4.
>
> > The largest users of slab there are the size-1024 and still the
> > skbuff_fclone_cache. On a box with 16 threads, it will
> cache up about 5
> > GB of disk data, and still use about 6 GB of slab to put
> the information
> > out there (without TSO on), but at least it is not causing the disk
> > cache to be evicted, and it appears to be a little more
> responsive. If
> > I up it to 32 or more threads, however, it gets very
> sluggish, but then
> > again, I am hitting it with a lot of nodes.
> >
> > > >
> > > > > tcp_mtu_probe
> > > > > tcp_send_fin
> > > > > tcp_connect
> > > > > buf_acquire:
> > > > > lots of callers in tipc code (whatever that is).
> > > > >
> > > > > So unless you're using tipc, or you have something in
> > > userspace going
> > > > > haywire (perhaps netstat would help rule that out?), then
> > > I suppose
> > > > > there's something wrong with knfsd's tcp code. Which
> > > makes sense, I
> > > > > guess.
> > > > >
> > > >
> > > > Not for sure what tipc is either....
> > > >
> > > > > I'd think this sort of allocation would be limited by the
> > > number of
> > > > > sockets times the size of the send and receive buffers.
> > > > > svc_xprt.c:svc_check_conn_limits() claims to be limiting
> > > the number of
> > > > > sockets to (nrthreads+3)*20. (You aren't hitting the
> > > "too many open
> > > > > connections" printk there, are you?) The total buffer
> > > size should be
> > > > > bounded by something like 4 megs.
> > > > >
> > > > > --b.
> > > > >
> > > >
> > > > Yes, we are getting a continuous stream of the too many
> > > open connections
> > > > scrolling across our logs.
> > >
> > > That's interesting! So we should probably look more
> closely at the
> > > svc_check_conn_limits() behavior. I wonder whether some
> pathological
> > > behavior is triggered in the case where you're constantly
> > > over the limit
> > > it's trying to enforce.
> > >
> > > (Remind me how many active clients you have?)
> > >
> >
> >
> > We currently are hitting with somewhere around 600 to 800
> nodes, but it
> > can go up to over 1000 nodes. We are artificially starving with a
> > limited number of threads (2 to 3) right now on the older 2.6.22.14
> > kernel because of that memory issue (which may or may not be tso
> > related)...
>
> So with that many clients all making requests to the server at once,
> we'd start hitting that (serv->sv_nrthreads+3)*20 limit when
> the number
> of threads was set to less than 30-50. That doesn't seem to be the
> point where you're seeing a change in behavior, though.
>
We were estimating between 40 and 50 threads was the cut off for being
able to service all of the (current) requests at once. I haven't ramped
back up to that level yet. I wasn't comfortable yet with letting it all
hang back out just in case we get into that hellish mode again, it can
be a pain to try and get into those systems once they are overloaded
(even over serial, sometimes it can just timeout the login). We had to
actually bring online a second option to help alleviate some of the back
congestion because the servers couldn't handle the workload.
> > I really want to move forward to the newer kernel, but we
> had an issue
> > where clients all of the sudden wouldn't connect, yet other clients
> > could, to the exact same server NFS export. I had booted the server
> > into the 2.6.25.4 kernel at the time, and the other admin
> set us back to
> > the 2.6.22.14 to see if that was it. The clients started
> working again,
> > and he left it there (he also took out my options in the
> exports file,
> > no_subtree_check and insecure). I know that we are running over the
> > number of privelaged ports, and we probably need the
> insecure, but I am
> > having a hard time wrapping my self around all of the problems at
> > once....
>
> The secure ports limitation should be a problem for a client
> that does a
> lot of nfs mounts, not for a server with a lot of clients.
>
Ah, OK. That makes sense.
> --b.
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
@ 2008-06-19 15:53 ` Weathers, Norman R.
0 siblings, 0 replies; 41+ messages in thread
From: Weathers, Norman R. @ 2008-06-19 15:53 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: Jeff Layton, linux-kernel, linux-nfs, Neil Brown
> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields@fieldses.org]
> Sent: Monday, June 16, 2008 12:44 PM
> To: Weathers, Norman R.
> Cc: Jeff Layton; linux-kernel@vger.kernel.org;
> linux-nfs@vger.kernel.org; Neil Brown
> Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
>
> On Fri, Jun 13, 2008 at 05:53:20PM -0500, Weathers, Norman R. wrote:
> >
> >
> > > -----Original Message-----
> > > From: J. Bruce Fields [mailto:bfields@fieldses.org]
> > > Sent: Friday, June 13, 2008 5:04 PM
> > > To: Weathers, Norman R.
> > > Cc: Jeff Layton; linux-kernel@vger.kernel.org;
> > > linux-nfs@vger.kernel.org; Neil Brown
> > > Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
> > >
> > > On Fri, Jun 13, 2008 at 04:53:31PM -0500, Weathers,
> Norman R. wrote:
> > > >
> > > >
> > > > > > The big one seems to be the __alloc_skb. (This is with 16
> > > > > threads, and
> > > > > > it says that we are using up somewhere between 12 and 14 GB
> > > > > of memory,
> > > > > > about 2 to 3 gig of that is disk cache). If I were to
> > > put anymore
> > > > > > threads out there, the server would become almost
> > > > > unresponsive (it was
> > > > > > bad enough as it was).
> > > > > >
> > > > > > At the same time, I also noticed this:
> > > > > >
> > > > > > skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
> > > > > >
> > > > > > Don't know for sure if that is meaningful or not....
> > > > >
> > > > > OK, so, starting at net/core/skbuff.c, this means that
> > > this memory was
> > > > > allocated by __alloc_skb() calls with something nonzero
> > > in the third
> > > > > ("fclone") argument. The only such caller is
> alloc_skb_fclone().
> > > > > Callers of alloc_skb_fclone() include:
> > > > >
> > > > > sk_stream_alloc_skb:
> > > > > do_tcp_sendpages
> > > > > tcp_sendmsg
> > > > > tcp_fragment
> > > > > tso_fragment
> > > >
> > > > Interesting you should mention the tso... We recently went
> > > through and
> > > > turned on TSO on all of our systems, trying it out to see
> > > if it helped
> > > > with performance... This could be something to do with
> > > that. I can try
> > > > disabling the tso on all of the servers and see if that
> > > helps with the
> > > > memory. Actually, I think I will, and I will monitor the
> > > situation. I
> > > > think it might help some, but I still think there may be
> > > something else
> > > > going on in a deep corner...
> > >
> > > I'll plead total ignorance about TSO, and it sounds like a long
> > > shot--but sure, it'd be worth trying, thanks.
> > >
> >
> > Tried it, not for sure if I like the results yet or not...
> Didn't seem
> > to make a huge difference, but here is something that will
> really make
> > you want to drink, the 2.6.25.4 kernel does not go into the
> size-4096
> > hell.
>
> Remind me what the most recent *bad* kernel was of those you tested?
> (2.6.25?)
>
The kernel that we were really seeing the problem with was 2.6.25.4, but
I think we may have figured out the 4096 problem, and it was probably a
mistake on my part, but it is important for the NFS users to see it so
they don't make the same mistake. I had found some performance tuning
guides, and in trying some of the suggestions, found that the setting
changes did seem to help on some things, but of course I never got to
run a check under full load (800 + clients). A suggestion was to change
the tcp_reordering tunable under /proc/sys/net/ipv4 from the default 3
to 127. We think that this was actually causing the issue. I was able
to trace back through all of the changes, and I changed this setting
back to the default 3, and it immediately fixed the size-4096 hell. It
appears that the reordering just eats into the memory, especially in
high demand situations, and I guess that should make perfect sense if we
are actually buffering up packets for reorder, and we are slamming the
box with thousands of requests per minute.
We still have other performance issues now, but it appears to be more of
a bottleneck, the nodes do not appear to be backing off when the servers
are becoming congested.
> Nothing jumped out at me in a quick skim through the commits
> from 2.6.25
> to 2.6.25.4.
>
> > The largest users of slab there are the size-1024 and still the
> > skbuff_fclone_cache. On a box with 16 threads, it will
> cache up about 5
> > GB of disk data, and still use about 6 GB of slab to put
> the information
> > out there (without TSO on), but at least it is not causing the disk
> > cache to be evicted, and it appears to be a little more
> responsive. If
> > I up it to 32 or more threads, however, it gets very
> sluggish, but then
> > again, I am hitting it with a lot of nodes.
> >
> > > >
> > > > > tcp_mtu_probe
> > > > > tcp_send_fin
> > > > > tcp_connect
> > > > > buf_acquire:
> > > > > lots of callers in tipc code (whatever that is).
> > > > >
> > > > > So unless you're using tipc, or you have something in
> > > userspace going
> > > > > haywire (perhaps netstat would help rule that out?), then
> > > I suppose
> > > > > there's something wrong with knfsd's tcp code. Which
> > > makes sense, I
> > > > > guess.
> > > > >
> > > >
> > > > Not for sure what tipc is either....
> > > >
> > > > > I'd think this sort of allocation would be limited by the
> > > number of
> > > > > sockets times the size of the send and receive buffers.
> > > > > svc_xprt.c:svc_check_conn_limits() claims to be limiting
> > > the number of
> > > > > sockets to (nrthreads+3)*20. (You aren't hitting the
> > > "too many open
> > > > > connections" printk there, are you?) The total buffer
> > > size should be
> > > > > bounded by something like 4 megs.
> > > > >
> > > > > --b.
> > > > >
> > > >
> > > > Yes, we are getting a continuous stream of the too many
> > > open connections
> > > > scrolling across our logs.
> > >
> > > That's interesting! So we should probably look more
> closely at the
> > > svc_check_conn_limits() behavior. I wonder whether some
> pathological
> > > behavior is triggered in the case where you're constantly
> > > over the limit
> > > it's trying to enforce.
> > >
> > > (Remind me how many active clients you have?)
> > >
> >
> >
> > We currently are hitting with somewhere around 600 to 800
> nodes, but it
> > can go up to over 1000 nodes. We are artificially starving with a
> > limited number of threads (2 to 3) right now on the older 2.6.22.14
> > kernel because of that memory issue (which may or may not be tso
> > related)...
>
> So with that many clients all making requests to the server at once,
> we'd start hitting that (serv->sv_nrthreads+3)*20 limit when
> the number
> of threads was set to less than 30-50. That doesn't seem to be the
> point where you're seeing a change in behavior, though.
>
We were estimating between 40 and 50 threads was the cut off for being
able to service all of the (current) requests at once. I haven't ramped
back up to that level yet. I wasn't comfortable yet with letting it all
hang back out just in case we get into that hellish mode again, it can
be a pain to try and get into those systems once they are overloaded
(even over serial, sometimes it can just timeout the login). We had to
actually bring online a second option to help alleviate some of the back
congestion because the servers couldn't handle the workload.
> > I really want to move forward to the newer kernel, but we
> had an issue
> > where clients all of the sudden wouldn't connect, yet other clients
> > could, to the exact same server NFS export. I had booted the server
> > into the 2.6.25.4 kernel at the time, and the other admin
> set us back to
> > the 2.6.22.14 to see if that was it. The clients started
> working again,
> > and he left it there (he also took out my options in the
> exports file,
> > no_subtree_check and insecure). I know that we are running over the
> > number of privelaged ports, and we probably need the
> insecure, but I am
> > having a hard time wrapping my self around all of the problems at
> > once....
>
> The secure ports limitation should be a problem for a client
> that does a
> lot of nfs mounts, not for a server with a lot of clients.
>
Ah, OK. That makes sense.
> --b.
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
2008-06-19 15:53 ` Weathers, Norman R.
@ 2008-06-19 18:46 ` J. Bruce Fields
-1 siblings, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2008-06-19 18:46 UTC (permalink / raw)
To: Weathers, Norman R.; +Cc: Jeff Layton, linux-kernel, linux-nfs, Neil Brown
On Thu, Jun 19, 2008 at 10:53:28AM -0500, Weathers, Norman R. wrote:
> The kernel that we were really seeing the problem with was 2.6.25.4, but
> I think we may have figured out the 4096 problem, and it was probably a
> mistake on my part, but it is important for the NFS users to see it so
> they don't make the same mistake. I had found some performance tuning
> guides, and in trying some of the suggestions, found that the setting
> changes did seem to help on some things, but of course I never got to
> run a check under full load (800 + clients). A suggestion was to change
> the tcp_reordering tunable under /proc/sys/net/ipv4 from the default 3
> to 127. We think that this was actually causing the issue. I was able
> to trace back through all of the changes, and I changed this setting
> back to the default 3, and it immediately fixed the size-4096 hell. It
> appears that the reordering just eats into the memory, especially in
> high demand situations, and I guess that should make perfect sense if we
> are actually buffering up packets for reorder, and we are slamming the
> box with thousands of requests per minute.
OK, sounds plausible, though I won't pretend to understand exactly how
that reordering code is using memory.
> We still have other performance issues now, but it appears to be more of
> a bottleneck, the nodes do not appear to be backing off when the servers
> are becoming congested.
...
> > So with that many clients all making requests to the server at once,
> > we'd start hitting that (serv->sv_nrthreads+3)*20 limit when
> > the number
> > of threads was set to less than 30-50. That doesn't seem to be the
> > point where you're seeing a change in behavior, though.
> >
>
> We were estimating between 40 and 50 threads was the cut off for being
> able to service all of the (current) requests at once. I haven't ramped
> back up to that level yet. I wasn't comfortable yet with letting it all
> hang back out just in case we get into that hellish mode again, it can
> be a pain to try and get into those systems once they are overloaded
> (even over serial, sometimes it can just timeout the login). We had to
> actually bring online a second option to help alleviate some of the back
> congestion because the servers couldn't handle the workload.
Thanks for the update, and let us know if you figure out anything more.
--b.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
@ 2008-06-19 18:46 ` J. Bruce Fields
0 siblings, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2008-06-19 18:46 UTC (permalink / raw)
To: Weathers, Norman R.; +Cc: Jeff Layton, linux-kernel, linux-nfs, Neil Brown
On Thu, Jun 19, 2008 at 10:53:28AM -0500, Weathers, Norman R. wrote:
> The kernel that we were really seeing the problem with was 2.6.25.4, but
> I think we may have figured out the 4096 problem, and it was probably a
> mistake on my part, but it is important for the NFS users to see it so
> they don't make the same mistake. I had found some performance tuning
> guides, and in trying some of the suggestions, found that the setting
> changes did seem to help on some things, but of course I never got to
> run a check under full load (800 + clients). A suggestion was to change
> the tcp_reordering tunable under /proc/sys/net/ipv4 from the default 3
> to 127. We think that this was actually causing the issue. I was able
> to trace back through all of the changes, and I changed this setting
> back to the default 3, and it immediately fixed the size-4096 hell. It
> appears that the reordering just eats into the memory, especially in
> high demand situations, and I guess that should make perfect sense if we
> are actually buffering up packets for reorder, and we are slamming the
> box with thousands of requests per minute.
OK, sounds plausible, though I won't pretend to understand exactly how
that reordering code is using memory.
> We still have other performance issues now, but it appears to be more of
> a bottleneck, the nodes do not appear to be backing off when the servers
> are becoming congested.
...
> > So with that many clients all making requests to the server at once,
> > we'd start hitting that (serv->sv_nrthreads+3)*20 limit when
> > the number
> > of threads was set to less than 30-50. That doesn't seem to be the
> > point where you're seeing a change in behavior, though.
> >
>
> We were estimating between 40 and 50 threads was the cut off for being
> able to service all of the (current) requests at once. I haven't ramped
> back up to that level yet. I wasn't comfortable yet with letting it all
> hang back out just in case we get into that hellish mode again, it can
> be a pain to try and get into those systems once they are overloaded
> (even over serial, sometimes it can just timeout the login). We had to
> actually bring online a second option to help alleviate some of the back
> congestion because the servers couldn't handle the workload.
Thanks for the update, and let us know if you figure out anything more.
--b.
^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2008-06-19 18:46 UTC | newest]
Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-03 18:50 Problems with large number of clients and reads Norman Weathers
2008-06-04 13:49 ` Chuck Lever
[not found] ` <76bd70e30806040649h53ab5d66x8c3423c551e94f77-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-06-04 14:13 ` Norman Weathers
2008-06-05 18:54 ` Norman Weathers
2008-06-06 14:44 ` Chuck Lever
2008-06-09 13:56 ` Weathers, Norman R.
2008-06-06 0:06 ` Dean Hildebrand
2008-06-09 13:20 ` Weathers, Norman R.
2008-06-06 16:09 ` J. Bruce Fields
2008-06-09 14:19 ` Weathers, Norman R.
[not found] ` <0122F800A3B64C449565A9E8C2977010155587-zIGg2qceuZx7uNL6xugVa6xOck334EZe@public.gmane.org>
2008-06-09 18:53 ` J. Bruce Fields
2008-06-10 14:30 ` Weathers, Norman R.
[not found] ` <0122F800A3B64C449565A9E8C297701002D75D9F-zIGg2qceuZx7uNL6xugVa6xOck334EZe@public.gmane.org>
2008-06-10 17:16 ` J. Bruce Fields
2008-06-10 22:12 ` Weathers, Norman R.
[not found] ` <0122F800A3B64C449565A9E8C297701002D75DA3-zIGg2qceuZx7uNL6xugVa6xOck334EZe@public.gmane.org>
2008-06-11 18:46 ` J. Bruce Fields
2008-06-11 19:52 ` CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger? J. Bruce Fields
2008-06-11 19:52 ` J. Bruce Fields
2008-06-11 20:09 ` Jeff Layton
2008-06-11 20:09 ` Jeff Layton
[not found] ` <20080611160947.5f08fb16-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-06-11 20:57 ` J. Bruce Fields
2008-06-11 20:57 ` J. Bruce Fields
2008-06-11 22:46 ` Weathers, Norman R.
2008-06-11 22:46 ` Weathers, Norman R.
[not found] ` <0122F800A3B64C449565A9E8C297701002D75DAA-zIGg2qceuZx7uNL6xugVa6xOck334EZe@public.gmane.org>
2008-06-11 22:54 ` J. Bruce Fields
2008-06-11 22:54 ` J. Bruce Fields
2008-06-12 19:54 ` Weathers, Norman R.
2008-06-12 19:54 ` Weathers, Norman R.
[not found] ` <0122F800A3B64C449565A9E8C297701002D75DAE-zIGg2qceuZx7uNL6xugVa6xOck334EZe@public.gmane.org>
2008-06-13 20:15 ` J. Bruce Fields
2008-06-13 20:15 ` J. Bruce Fields
2008-06-13 21:53 ` Weathers, Norman R.
2008-06-13 21:53 ` Weathers, Norman R.
[not found] ` <0122F800A3B64C449565A9E8C297701002D75DB6-zIGg2qceuZx7uNL6xugVa6xOck334EZe@public.gmane.org>
2008-06-13 22:04 ` J. Bruce Fields
2008-06-13 22:04 ` J. Bruce Fields
2008-06-13 22:53 ` Weathers, Norman R.
2008-06-13 22:53 ` Weathers, Norman R.
[not found] ` <0122F800A3B64C449565A9E8C297701002D75DB7-zIGg2qceuZx7uNL6xugVa6xOck334EZe@public.gmane.org>
2008-06-16 17:43 ` J. Bruce Fields
2008-06-16 17:43 ` J. Bruce Fields
2008-06-19 15:53 ` Weathers, Norman R.
2008-06-19 15:53 ` Weathers, Norman R.
[not found] ` <0122F800A3B64C449565A9E8C297701002D75DD4-zIGg2qceuZx7uNL6xugVa6xOck334EZe@public.gmane.org>
2008-06-19 18:46 ` J. Bruce Fields
2008-06-19 18:46 ` J. Bruce Fields
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.