* Re: [BUG] sky2 broken for Yukon PCI-E Gigabit Ethernet Controller 11ab:4362 (rev 19)
From: Herbert Xu @ 2006-01-27 12:22 UTC (permalink / raw)
To: Knut Petersen; +Cc: shemminger, netdev, linux-kernel, David S. Miller
In-Reply-To: <43D9B8A6.5020200@t-online.de>
[-- Attachment #1: Type: text/plain, Size: 1128 bytes --]
On Fri, Jan 27, 2006 at 07:07:34AM +0100, Knut Petersen wrote:
>
> Well, there are no problems if SuSEfirewall2 is disabled. But have a look
> at the loaded modules:
>
> ipt_MASQUERADE 3968 1
> pppoe 15360 2
> pppox 4616 1 pppoe
OK, although we can't rule out sky2/netfilter from the enquiry, I've
identified two bugs in ppp/pppoe that may be responsible for what you
are seeing. So please try the following patch and let us know if the
problem still exists (or deteriorates/improves).
[PPP]: Fixed hardware RX checksum handling
When we pull the PPP protocol off the skb, we forgot to update the
hardware RX checksum. This may lead to messages such as
dsl0: hw csum failure.
Similarly, we need to clear the hardware checksum flag when we use
the existing packet to store the decompressed result.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
[-- Attachment #2: ppp-rxcsum --]
[-- Type: text/plain, Size: 682 bytes --]
diff --git a/drivers/net/ppp_generic.c b/drivers/net/ppp_generic.c
--- a/drivers/net/ppp_generic.c
+++ b/drivers/net/ppp_generic.c
@@ -1610,6 +1610,8 @@ ppp_receive_nonmp_frame(struct ppp *ppp,
}
else if (!pskb_may_pull(skb, skb->len))
goto err;
+ else
+ skb->ip_summed = CHECKSUM_NONE;
len = slhc_uncompress(ppp->vj, skb->data + 2, skb->len - 2);
if (len <= 0) {
@@ -1690,6 +1692,7 @@ ppp_receive_nonmp_frame(struct ppp *ppp,
kfree_skb(skb);
} else {
skb_pull(skb, 2); /* chop off protocol */
+ skb_postpull_rcsum(skb, skb->data - 2, 2);
skb->dev = ppp->dev;
skb->protocol = htons(npindex_to_ethertype[npi]);
skb->mac.raw = skb->data;
^ permalink raw reply
* Re: [-mm patch] drivers/net/wireless/tiacx/: remove code for WIRELESS_EXT < 18
From: Denis Vlasenko @ 2006-01-27 12:49 UTC (permalink / raw)
To: Johannes Berg
Cc: Adrian Bunk, acx100-devel, John W. Linville, jgarzik, netdev,
linux-kernel
In-Reply-To: <1138362557.5983.26.camel@localhost>
On Friday 27 January 2006 13:49, Johannes Berg wrote:
> On Fri, 2006-01-27 at 12:19 +0200, Denis Vlasenko wrote:
>
> > I very much want to get rid of all remaining compat cruft, and
> > I plan to do it as soon as acx will be present in mainline kernel.
>
> I doubt you'll get it merged with the compat cruft.
What cruft? This?
# grep -r WIRELESS_EXT .
./pci.c: ndev->name, WIRELESS_EXT, UTS_RELEASE);
./common.c: "Wireless extension version:\t" STRING(WIRELESS_EXT) "\n"
./acx_struct.h:#ifdef WIRELESS_EXT
./acx_struct.h:#if WIRELESS_EXT > 15
./ioctl.c: range->we_version_compiled = WIRELESS_EXT;
I consider this to be a really modest amount of compat code
which makes driver users happy (that fraction of it which is not
willing to run -mm).
However, I would remove even that at Jeff's or Andrew's request,
or without anyone's request if acx will be merged to Linus tree.
--
vda
^ permalink raw reply
* Re: [Acx100-devel] Re: [-mm patch] drivers/net/wireless/tiacx/: remove code for WIRELESS_EXT < 18
From: Andreas Mohr @ 2006-01-27 14:46 UTC (permalink / raw)
To: acx100-devel
Cc: Johannes Berg, Adrian Bunk, John W. Linville, jgarzik, netdev,
linux-kernel
In-Reply-To: <200601271449.49226.vda@ilport.com.ua>
Hi,
On Fri, Jan 27, 2006 at 02:49:49PM +0200, Denis Vlasenko wrote:
> On Friday 27 January 2006 13:49, Johannes Berg wrote:
> > On Fri, 2006-01-27 at 12:19 +0200, Denis Vlasenko wrote:
> >
> > > I very much want to get rid of all remaining compat cruft, and
> > > I plan to do it as soon as acx will be present in mainline kernel.
> >
> > I doubt you'll get it merged with the compat cruft.
>
> What cruft? This?
>
> # grep -r WIRELESS_EXT .
> ./pci.c: ndev->name, WIRELESS_EXT, UTS_RELEASE);
> ./common.c: "Wireless extension version:\t" STRING(WIRELESS_EXT) "\n"
> ./acx_struct.h:#ifdef WIRELESS_EXT
> ./acx_struct.h:#if WIRELESS_EXT > 15
> ./ioctl.c: range->we_version_compiled = WIRELESS_EXT;
>
> I consider this to be a really modest amount of compat code
> which makes driver users happy (that fraction of it which is not
> willing to run -mm).
>
> However, I would remove even that at Jeff's or Andrew's request,
> or without anyone's request if acx will be merged to Linus tree.
Indeed, I don't think there should be any discussion at all about this,
since it helps users of our currently still external driver
(not too much longer external, I guess and hope) a lot.
Given that we don't have a stable driver ABI (for way too often discussed
very valid and sane reasons) I really, really think we shouldn't shoot
our foot into pieces by then additionally also bitching about *MINIMAL*
amounts of compatibility code required to keep up with those speedily changing
kernel requirements while our driver isn't included yet.
In the future, I'd like to ask people to be a *bit* more tolerant of newish
compatibility cruft. It's not like we're supporting kernel 2.2.x here still,
our driver is at 2.6.10 at a minimum(!), yet you still want to remove even
those few pieces!
This is simply ridiculous (again, as long as our driver isn't merged, which it
should be soon to improve maintenance).
OK, ending this rather fruitless discussion here. I better get back to hacking,
that's more productive.
Andreas Mohr
^ permalink raw reply
* Re: [BUG] sky2 broken for Yukon PCI-E Gigabit Ethernet Controller 11ab:4362 (rev 19)
From: Patrick McHardy @ 2006-01-27 15:28 UTC (permalink / raw)
To: Herbert Xu, KdF
Cc: Knut Petersen, shemminger, netdev, linux-kernel, David S. Miller,
netfilter-devel
In-Reply-To: <20060127122242.GA32128@gondor.apana.org.au>
[-- Attachment #1: Type: text/plain, Size: 1050 bytes --]
Herbert Xu wrote:
> On Fri, Jan 27, 2006 at 07:07:34AM +0100, Knut Petersen wrote:
>
>>Well, there are no problems if SuSEfirewall2 is disabled. But have a look
>>at the loaded modules:
>>
>>ipt_MASQUERADE 3968 1
>>pppoe 15360 2
>>pppox 4616 1 pppoe
>
>
> OK, although we can't rule out sky2/netfilter from the enquiry, I've
> identified two bugs in ppp/pppoe that may be responsible for what you
> are seeing. So please try the following patch and let us know if the
> problem still exists (or deteriorates/improves).
>
> [PPP]: Fixed hardware RX checksum handling
>
> When we pull the PPP protocol off the skb, we forgot to update the
> hardware RX checksum. This may lead to messages such as
>
> dsl0: hw csum failure.
>
> Similarly, we need to clear the hardware checksum flag when we use
> the existing packet to store the decompressed result.
We had a couple of reports of incorrect hardware checksums with
PPPoE. KdF, can you test Herbert's patch (attached again to this
mail) please?
[-- Attachment #2: ppp-rxcsum --]
[-- Type: text/plain, Size: 682 bytes --]
diff --git a/drivers/net/ppp_generic.c b/drivers/net/ppp_generic.c
--- a/drivers/net/ppp_generic.c
+++ b/drivers/net/ppp_generic.c
@@ -1610,6 +1610,8 @@ ppp_receive_nonmp_frame(struct ppp *ppp,
}
else if (!pskb_may_pull(skb, skb->len))
goto err;
+ else
+ skb->ip_summed = CHECKSUM_NONE;
len = slhc_uncompress(ppp->vj, skb->data + 2, skb->len - 2);
if (len <= 0) {
@@ -1690,6 +1692,7 @@ ppp_receive_nonmp_frame(struct ppp *ppp,
kfree_skb(skb);
} else {
skb_pull(skb, 2); /* chop off protocol */
+ skb_postpull_rcsum(skb, skb->data - 2, 2);
skb->dev = ppp->dev;
skb->protocol = htons(npindex_to_ethertype[npi]);
skb->mac.raw = skb->data;
^ permalink raw reply
* Re: [BUG] sky2 broken for Yukon PCI-E Gigabit Ethernet Controller 11ab:4362 (rev 19)
From: Knut Petersen @ 2006-01-27 16:04 UTC (permalink / raw)
To: Herbert Xu; +Cc: shemminger, netdev, linux-kernel, David S. Miller
In-Reply-To: <20060127122242.GA32128@gondor.apana.org.au>
Herbert Xu wrote:
>When we pull the PPP protocol off the skb, we forgot to update the
>hardware RX checksum. This may lead to messages such as
>
> dsl0: hw csum failure.
>
>Similarly, we need to clear the hardware checksum flag when we use
>the existing packet to store the decompressed result.
>
>Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
>
ACK
That patch seems to solve all my problems with
sky2 / pppoe / SuSE 9.2 Firewall.
Thanks a lot!
cu,
Knut
^ permalink raw reply
* Re: [PATCH 2.6.15-git9a] aoe [1/1]: do not stop retransmit timer when device goes down
From: Al Boldi @ 2006-01-27 16:12 UTC (permalink / raw)
To: Ed L. Cashin; +Cc: linux-kernel, linux-raid, netdev
In-Reply-To: <200601260104.37649.a1426z@gawab.com>
Ed L. Cashin wrote:
> On Thu, Jan 26, 2006 at 01:04:37AM +0300, Al Boldi wrote:
> > Ed L. Cashin wrote:
> > > This patch is a bugfix that follows and depends on the
> > > eight aoe driver patches sent January 19th.
> >
> > Will they also fix this?
> > Or is this an md bug?
>
> No, this patch fixes a bug that would cause an AoE device to be
> totally unusable, so I think mdadm or mkraid would get an error that
> the device was not available before it tried to make a new md device.
>
> > It only happens with aoe.
>
> It looks like in setting up the raid, sysfs_create_link probably has
> this going off:
>
> BUG_ON(!kobj || !kobj->dentry || !name);
>
> > Also, why is aoe slower than nbd?
>
> It wasn't when I tried it. The userland vblade is slow. Maybe that's
> affecting your results?
Why is the userland vblade server slower than the userland nbd-server?
Thanks!
--
Al
^ permalink raw reply
* [GIT PULL] bcm43xx update
From: Michael Buesch @ 2006-01-27 17:42 UTC (permalink / raw)
To: John W. Linville
Cc: netdev-u79uwXL29TY76Z2rM5mHXA, bcm43xx-dev-0fE9KPoRgkgATYTw5x5z8w
[-- Attachment #1: Type: text/plain, Size: 1069 bytes --]
Hi John,
I synced my git repository with the bcm43xx SVN repository.
I will continue to develop the driver in git, so these ugly syncing
will not be needed any longer. I will poke you from time to time
to pull from my repos. (I don't know how the other bcm43xx developers
will handle it. I suggest they set up a git repository, too)
Please do a
git pull git://bu3sch.de/bcm43xx.git master-upstream
to pull the bcm43xx-softmac branch.
Please do a
git pull git://bu3sch.de/bcm43xx.git dscape-upstream
to pull the bcm43xx-dscape branch.
master ShortLog:
Michael Buesch:
[bcm43xx] sync with svn.berlios.de
[bcm43xx] remove linux version compatibility code.
[bcm43xx] Move README file to Documentation directory.
[bcm43xx] remove redundant COPYING file.
dscape ShortLog:
Michael Buesch:
[bcm43xx] sync with svn.berlios.de
[bcm43xx] Move documentation to Documentation subdirectory and move scripts to scripts subdirectory.
[bcm43xx] Remove redundant COPYING file.
--
Greetings Michael.
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply
* Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
From: Ravikiran G Thirumalai @ 2006-01-27 19:52 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Andrew Morton, davem, linux-kernel, shai, netdev, pravins
In-Reply-To: <43D9DFA1.9070802@cosmosbay.com>
On Fri, Jan 27, 2006 at 09:53:53AM +0100, Eric Dumazet wrote:
> Ravikiran G Thirumalai a écrit :
> >Change the atomic_t sockets_allocated member of struct proto to a
> >per-cpu counter.
> >
> >Signed-off-by: Pravin B. Shelar <pravins@calsoftinc.com>
> >Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
> >Signed-off-by: Shai Fultheim <shai@scalex86.org>
> >
> Hi Ravikiran
>
> If I correctly read this patch, I think there is a scalability problem.
>
> On a big SMP machine, read_sockets_allocated() is going to be a real killer.
>
> Say we have 128 Opterons CPUS in a box.
read_sockets_allocated is being invoked when when /proc/net/protocols is read,
which can be assumed as not frequent.
At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only
certain conditions, under memory pressure -- on a large CPU count machine,
you'd have large memory, and I don't think read_sockets_allocated would get
called often. It did not atleast on our 8cpu/16G box. So this should be OK
I think.
There're no 128 CPU Opteron boxes yet afaik ;).
Thanks,
Kiran
^ permalink raw reply
* Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
From: Andrew Morton @ 2006-01-27 20:16 UTC (permalink / raw)
To: Ravikiran G Thirumalai; +Cc: dada1, davem, linux-kernel, shai, netdev, pravins
In-Reply-To: <20060127195227.GA3565@localhost.localdomain>
Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
>
> On Fri, Jan 27, 2006 at 09:53:53AM +0100, Eric Dumazet wrote:
> > Ravikiran G Thirumalai a écrit :
> > >Change the atomic_t sockets_allocated member of struct proto to a
> > >per-cpu counter.
> > >
> > >Signed-off-by: Pravin B. Shelar <pravins@calsoftinc.com>
> > >Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
> > >Signed-off-by: Shai Fultheim <shai@scalex86.org>
> > >
> > Hi Ravikiran
> >
> > If I correctly read this patch, I think there is a scalability problem.
> >
> > On a big SMP machine, read_sockets_allocated() is going to be a real killer.
> >
> > Say we have 128 Opterons CPUS in a box.
>
> read_sockets_allocated is being invoked when when /proc/net/protocols is read,
> which can be assumed as not frequent.
> At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only
> certain conditions, under memory pressure -- on a large CPU count machine,
> you'd have large memory, and I don't think read_sockets_allocated would get
> called often. It did not atleast on our 8cpu/16G box. So this should be OK
> I think.
That being said, the percpu_counters aren't a terribly successful concept
and probably do need a revisit due to the high inaccuracy at high CPU
counts. It might be better to do some generic version of vm_acct_memory()
instead.
If the benchmarks say that we need to. If we cannot observe any problems
in testing of existing code and if we can't demonstrate any benefit from
the patched code then one option is to go off and do something else ;)
^ permalink raw reply
* Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
From: Eric Dumazet @ 2006-01-27 22:30 UTC (permalink / raw)
To: Andrew Morton
Cc: Ravikiran G Thirumalai, davem, linux-kernel, shai, netdev,
pravins
In-Reply-To: <20060127121602.18bc3f25.akpm@osdl.org>
Andrew Morton a écrit :
> Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
>> On Fri, Jan 27, 2006 at 09:53:53AM +0100, Eric Dumazet wrote:
>>> Ravikiran G Thirumalai a écrit :
>>>> Change the atomic_t sockets_allocated member of struct proto to a
>>>> per-cpu counter.
>>>>
>>>> Signed-off-by: Pravin B. Shelar <pravins@calsoftinc.com>
>>>> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
>>>> Signed-off-by: Shai Fultheim <shai@scalex86.org>
>>>>
>>> Hi Ravikiran
>>>
>>> If I correctly read this patch, I think there is a scalability problem.
>>>
>>> On a big SMP machine, read_sockets_allocated() is going to be a real killer.
>>>
>>> Say we have 128 Opterons CPUS in a box.
>> read_sockets_allocated is being invoked when when /proc/net/protocols is read,
>> which can be assumed as not frequent.
>> At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only
>> certain conditions, under memory pressure -- on a large CPU count machine,
>> you'd have large memory, and I don't think read_sockets_allocated would get
>> called often. It did not atleast on our 8cpu/16G box. So this should be OK
>> I think.
>
> That being said, the percpu_counters aren't a terribly successful concept
> and probably do need a revisit due to the high inaccuracy at high CPU
> counts. It might be better to do some generic version of vm_acct_memory()
> instead.
There are several issues here :
alloc_percpu() current implementation is a a waste of ram. (because it uses
slab allocations that have a minimum size of 32 bytes)
Currently we cannot use per_cpu(&some_object, cpu), so a generic version of
vm_acct_memory() would need a rework of percpu.h and maybe this is not
possible on every platform ?
#define per_cpu(var, cpu) (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))
-->
#define per_cpu_name(var) per_cpu__##var
#define per_cpu_addr(var) &per_cpu_name(var)
#define per_cpu(var, cpu) (*RELOC_HIDE(per_cpu_addr(var), __per_cpu_offset[cpu])
But this could render TLS migration difficult...
Eric
^ permalink raw reply
* Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
From: Ravikiran G Thirumalai @ 2006-01-27 22:44 UTC (permalink / raw)
To: Andrew Morton; +Cc: dada1, davem, linux-kernel, shai, netdev, pravins
In-Reply-To: <20060127121602.18bc3f25.akpm@osdl.org>
On Fri, Jan 27, 2006 at 12:16:02PM -0800, Andrew Morton wrote:
> Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
> >
> > which can be assumed as not frequent.
> > At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only
> > certain conditions, under memory pressure -- on a large CPU count machine,
> > you'd have large memory, and I don't think read_sockets_allocated would get
> > called often. It did not atleast on our 8cpu/16G box. So this should be OK
> > I think.
>
> That being said, the percpu_counters aren't a terribly successful concept
> and probably do need a revisit due to the high inaccuracy at high CPU
> counts. It might be better to do some generic version of vm_acct_memory()
> instead.
AFAICS vm_acct_memory is no better. The deviation on large cpu counts is the
same as percpu_counters -- (NR_CPUS * NR_CPUS * 2) ...
>
> If the benchmarks say that we need to. If we cannot observe any problems
> in testing of existing code and if we can't demonstrate any benefit from
> the patched code then one option is to go off and do something else ;)
We first tried plain per-CPU counters for memory_allocated, found that reads
on memory_allocated was causing cacheline transfers, and then
switched over to batching. So batching reads is useful. To avoid
inaccuracy, we can maybe change percpu_counter_init to:
void percpu_counter_init(struct percpu_counter *fbc, int maxdev)
the percpu batching limit would then be maxdev/num_possible_cpus. One would
use batching counters only when both reads and writes are frequent. With
the above scheme, we would go fetch cachelines from other cpus for read
often only on large cpu counts, which is not any worse than the global
counter alternative, but it would still be beneficial on smaller machines,
without sacrificing a pre-set deviation.
Comments?
Thanks,
Kiran
^ permalink raw reply
* Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
From: Ravikiran G Thirumalai @ 2006-01-27 22:50 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Andrew Morton, davem, linux-kernel, shai, netdev, pravins
In-Reply-To: <43DA9EFF.1020200@cosmosbay.com>
On Fri, Jan 27, 2006 at 11:30:23PM +0100, Eric Dumazet wrote:
>
> There are several issues here :
>
> alloc_percpu() current implementation is a a waste of ram. (because it uses
> slab allocations that have a minimum size of 32 bytes)
Oh there was a solution for that :).
http://lwn.net/Articles/119532/
I can quickly revive it if there is interest.
Thanks,
Kiran
^ permalink raw reply
* Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
From: Eric Dumazet @ 2006-01-27 22:58 UTC (permalink / raw)
To: Ravikiran G Thirumalai
Cc: Andrew Morton, davem, linux-kernel, shai, netdev, pravins
In-Reply-To: <20060127224433.GB3565@localhost.localdomain>
Ravikiran G Thirumalai a écrit :
> On Fri, Jan 27, 2006 at 12:16:02PM -0800, Andrew Morton wrote:
>> Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
>>> which can be assumed as not frequent.
>>> At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only
>>> certain conditions, under memory pressure -- on a large CPU count machine,
>>> you'd have large memory, and I don't think read_sockets_allocated would get
>>> called often. It did not atleast on our 8cpu/16G box. So this should be OK
>>> I think.
>> That being said, the percpu_counters aren't a terribly successful concept
>> and probably do need a revisit due to the high inaccuracy at high CPU
>> counts. It might be better to do some generic version of vm_acct_memory()
>> instead.
>
> AFAICS vm_acct_memory is no better. The deviation on large cpu counts is the
> same as percpu_counters -- (NR_CPUS * NR_CPUS * 2) ...
Ah... yes you are right, I read min(16, NR_CPUS*2)
I wonder if it is not a typo... I mean, I understand the more cpus you have,
the less updates on central atomic_t is desirable, but a quadratic offset
seems too much...
Eric
^ permalink raw reply
* Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
From: Andrew Morton @ 2006-01-27 23:01 UTC (permalink / raw)
To: Ravikiran G Thirumalai; +Cc: dada1, davem, linux-kernel, shai, netdev, pravins
In-Reply-To: <20060127224433.GB3565@localhost.localdomain>
Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
>
> On Fri, Jan 27, 2006 at 12:16:02PM -0800, Andrew Morton wrote:
> > Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
> > >
> > > which can be assumed as not frequent.
> > > At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only
> > > certain conditions, under memory pressure -- on a large CPU count machine,
> > > you'd have large memory, and I don't think read_sockets_allocated would get
> > > called often. It did not atleast on our 8cpu/16G box. So this should be OK
> > > I think.
> >
> > That being said, the percpu_counters aren't a terribly successful concept
> > and probably do need a revisit due to the high inaccuracy at high CPU
> > counts. It might be better to do some generic version of vm_acct_memory()
> > instead.
>
> AFAICS vm_acct_memory is no better. The deviation on large cpu counts is the
> same as percpu_counters -- (NR_CPUS * NR_CPUS * 2) ...
I suppose so. Except vm_acct_memory() has
#define ACCT_THRESHOLD max(16, NR_CPUS * 2)
But if we were to perform similar tuning to percpu_counter, yes, they're
pretty similar.
Oh, and because vm_acct_memory() is counting a singleton object, it can use
DEFINE_PER_CPU rather than alloc_percpu(), so it saves on a bit of kmalloc
overhead.
> >
> > If the benchmarks say that we need to. If we cannot observe any problems
> > in testing of existing code and if we can't demonstrate any benefit from
> > the patched code then one option is to go off and do something else ;)
>
> We first tried plain per-CPU counters for memory_allocated, found that reads
> on memory_allocated was causing cacheline transfers, and then
> switched over to batching. So batching reads is useful. To avoid
> inaccuracy, we can maybe change percpu_counter_init to:
>
> void percpu_counter_init(struct percpu_counter *fbc, int maxdev)
>
> the percpu batching limit would then be maxdev/num_possible_cpus. One would
> use batching counters only when both reads and writes are frequent. With
> the above scheme, we would go fetch cachelines from other cpus for read
> often only on large cpu counts, which is not any worse than the global
> counter alternative, but it would still be beneficial on smaller machines,
> without sacrificing a pre-set deviation.
>
> Comments?
Sounds sane.
^ permalink raw reply
* Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
From: Andrew Morton @ 2006-01-27 23:08 UTC (permalink / raw)
To: kiran, dada1, davem, linux-kernel, shai, netdev, pravins
In-Reply-To: <20060127150106.38b9e041.akpm@osdl.org>
Andrew Morton <akpm@osdl.org> wrote:
>
> Oh, and because vm_acct_memory() is counting a singleton object, it can use
> DEFINE_PER_CPU rather than alloc_percpu(), so it saves on a bit of kmalloc
> overhead.
Actually, I don't think that's true. we're allocating a sizeof(long) with
kmalloc_node() so there shouldn't be memory wastage.
^ permalink raw reply
* Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
From: Andrew Morton @ 2006-01-27 23:16 UTC (permalink / raw)
To: Eric Dumazet; +Cc: kiran, davem, linux-kernel, shai, netdev, pravins
In-Reply-To: <43DAA586.5050609@cosmosbay.com>
Eric Dumazet <dada1@cosmosbay.com> wrote:
>
> Ravikiran G Thirumalai a écrit :
> > On Fri, Jan 27, 2006 at 12:16:02PM -0800, Andrew Morton wrote:
> >> Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
> >>> which can be assumed as not frequent.
> >>> At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only
> >>> certain conditions, under memory pressure -- on a large CPU count machine,
> >>> you'd have large memory, and I don't think read_sockets_allocated would get
> >>> called often. It did not atleast on our 8cpu/16G box. So this should be OK
> >>> I think.
> >> That being said, the percpu_counters aren't a terribly successful concept
> >> and probably do need a revisit due to the high inaccuracy at high CPU
> >> counts. It might be better to do some generic version of vm_acct_memory()
> >> instead.
> >
> > AFAICS vm_acct_memory is no better. The deviation on large cpu counts is the
> > same as percpu_counters -- (NR_CPUS * NR_CPUS * 2) ...
>
> Ah... yes you are right, I read min(16, NR_CPUS*2)
So did I ;)
> I wonder if it is not a typo... I mean, I understand the more cpus you have,
> the less updates on central atomic_t is desirable, but a quadratic offset
> seems too much...
I'm not sure whether it was a mistake or if I intended it and didn't do the
sums on accuracy :(
An advantage of retaining a spinlock in percpu_counter is that if accuracy
is needed at a low rate (say, /proc reading) we can take the lock and then
go spill each CPU's local count into the main one. It would need to be a
very low rate though. Or we make the cpu-local counters atomic too.
Certainly it's sensible to delegate the tuning to the creator of the
percpu_counter, but it'll be a difficult thing to get right.
^ permalink raw reply
* Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
From: Eric Dumazet @ 2006-01-27 23:21 UTC (permalink / raw)
To: Ravikiran G Thirumalai
Cc: Andrew Morton, davem, linux-kernel, shai, netdev, pravins
In-Reply-To: <20060127225036.GC3565@localhost.localdomain>
Ravikiran G Thirumalai a écrit :
> On Fri, Jan 27, 2006 at 11:30:23PM +0100, Eric Dumazet wrote:
>> There are several issues here :
>>
>> alloc_percpu() current implementation is a a waste of ram. (because it uses
>> slab allocations that have a minimum size of 32 bytes)
>
> Oh there was a solution for that :).
>
> http://lwn.net/Articles/119532/
>
> I can quickly revive it if there is interest.
>
Well, nice work ! :)
Maybe a litle bit complicated if expected percpu space is 50 KB per cpu ?
Why not use a boot time allocated percpu area (as done today in
setup_per_cpu_areas()), but instead of reserving extra space for module's
percpu data, being able to serve alloc_percpu() from this reserved area (ie no
kmalloced data anymore), and keeping your
#define per_cpu_ptr(ptr, cpu) ((__typeof__(ptr)) \
(RELOC_HIDE(ptr, PCPU_BLKSIZE * cpu)))
Some code from kernel/module.c could be reworked to serve both as an allocator
when a module percpudata must be relocated (insmod)/freed (rmmod), and serve
alloc_percpu() for 'dynamic allocations'
Eric
^ permalink raw reply
* Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
From: Ravikiran G Thirumalai @ 2006-01-28 0:01 UTC (permalink / raw)
To: Andrew Morton; +Cc: dada1, davem, linux-kernel, shai, netdev, pravins
In-Reply-To: <20060127150847.48c312c0.akpm@osdl.org>
On Fri, Jan 27, 2006 at 03:08:47PM -0800, Andrew Morton wrote:
> Andrew Morton <akpm@osdl.org> wrote:
> >
> > Oh, and because vm_acct_memory() is counting a singleton object, it can use
> > DEFINE_PER_CPU rather than alloc_percpu(), so it saves on a bit of kmalloc
> > overhead.
>
> Actually, I don't think that's true. we're allocating a sizeof(long) with
> kmalloc_node() so there shouldn't be memory wastage.
Oh yeah there is. Each dynamic per-cpu object would have been atleast
(NR_CPUS * sizeof (void *) + num_cpus_possible * cacheline_size ).
Now kmalloc_node will fall back on size-32 for allocation of long, so
replace the cacheline_size above with 32 -- which then means dynamic per-cpu
data are not on a cacheline boundary anymore (most modern cpus have 64byte/128
byte cache lines) which means per-cpu data could end up false shared....
Kiran
^ permalink raw reply
* Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
From: Andrew Morton @ 2006-01-28 0:26 UTC (permalink / raw)
To: Ravikiran G Thirumalai; +Cc: dada1, davem, linux-kernel, shai, netdev, pravins
In-Reply-To: <20060128000100.GD3565@localhost.localdomain>
Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
>
> On Fri, Jan 27, 2006 at 03:08:47PM -0800, Andrew Morton wrote:
> > Andrew Morton <akpm@osdl.org> wrote:
> > >
> > > Oh, and because vm_acct_memory() is counting a singleton object, it can use
> > > DEFINE_PER_CPU rather than alloc_percpu(), so it saves on a bit of kmalloc
> > > overhead.
> >
> > Actually, I don't think that's true. we're allocating a sizeof(long) with
> > kmalloc_node() so there shouldn't be memory wastage.
>
> Oh yeah there is. Each dynamic per-cpu object would have been atleast
> (NR_CPUS * sizeof (void *) + num_cpus_possible * cacheline_size ).
> Now kmalloc_node will fall back on size-32 for allocation of long, so
> replace the cacheline_size above with 32 -- which then means dynamic per-cpu
> data are not on a cacheline boundary anymore (most modern cpus have 64byte/128
> byte cache lines) which means per-cpu data could end up false shared....
>
OK. But isn't the core of the problem the fact that __alloc_percpu() is
using kmalloc_node() rather than a (new, as-yet-unimplemented)
kmalloc_cpu()? kmalloc_cpu() wouldn't need the L1 cache alignment.
It might be worth creating just a small number of per-cpu slabs (4-byte,
8-byte). A kmalloc_cpu() would just need a per-cpu array of
kmem_cache_t*'s and it'd internally use kmalloc_node(cpu_to_node), no?
Or we could just give __alloc_percpu() a custom, hand-rolled,
not-cacheline-padded sizeof(long) slab per CPU and use that if (size ==
sizeof(long)). Or something.
^ permalink raw reply
* Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
From: Eric Dumazet @ 2006-01-28 0:28 UTC (permalink / raw)
To: Andrew Morton; +Cc: kiran, davem, linux-kernel, shai, netdev, pravins
In-Reply-To: <20060127151635.3a149fe2.akpm@osdl.org>
[-- Attachment #1: Type: text/plain, Size: 1741 bytes --]
Andrew Morton a écrit :
> Eric Dumazet <dada1@cosmosbay.com> wrote:
>> Ravikiran G Thirumalai a écrit :
>>> On Fri, Jan 27, 2006 at 12:16:02PM -0800, Andrew Morton wrote:
>>>> Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
>>>>> which can be assumed as not frequent.
>>>>> At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only
>>>>> certain conditions, under memory pressure -- on a large CPU count machine,
>>>>> you'd have large memory, and I don't think read_sockets_allocated would get
>>>>> called often. It did not atleast on our 8cpu/16G box. So this should be OK
>>>>> I think.
>>>> That being said, the percpu_counters aren't a terribly successful concept
>>>> and probably do need a revisit due to the high inaccuracy at high CPU
>>>> counts. It might be better to do some generic version of vm_acct_memory()
>>>> instead.
>>> AFAICS vm_acct_memory is no better. The deviation on large cpu counts is the
>>> same as percpu_counters -- (NR_CPUS * NR_CPUS * 2) ...
>> Ah... yes you are right, I read min(16, NR_CPUS*2)
>
> So did I ;)
>
>> I wonder if it is not a typo... I mean, I understand the more cpus you have,
>> the less updates on central atomic_t is desirable, but a quadratic offset
>> seems too much...
>
> I'm not sure whether it was a mistake or if I intended it and didn't do the
> sums on accuracy :(
>
> An advantage of retaining a spinlock in percpu_counter is that if accuracy
> is needed at a low rate (say, /proc reading) we can take the lock and then
> go spill each CPU's local count into the main one. It would need to be a
> very low rate though. Or we make the cpu-local counters atomic too.
We might use atomic_long_t only (and no spinlocks)
Something like this ?
[-- Attachment #2: functions --]
[-- Type: text/plain, Size: 949 bytes --]
struct percpu_counter {
atomic_long_t count;
atomic_long_t *counters;
};
#ifdef CONFIG_SMP
void percpu_counter_mod(struct percpu_counter *fbc, long amount)
{
long old, new;
atomic_long_t *pcount;
pcount = per_cpu_ptr(fbc->counters, get_cpu());
start:
old = atomic_long_read(pcount);
new = old + amount;
if (new >= FBC_BATCH || new <= -FBC_BATCH) {
if (unlikely(atomic_long_cmpxchg(pcount, old, 0) != old))
goto start;
atomic_long_add(new, &fbc->count);
} else
atomic_long_add(amount, pcount);
put_cpu();
}
EXPORT_SYMBOL(percpu_counter_mod);
long percpu_counter_read_accurate(struct percpu_counter *fbc)
{
long res = 0;
int cpu;
atomic_long_t *pcount;
for_each_cpu(cpu) {
pcount = per_cpu_ptr(fbc->counters, cpu);
/* dont dirty cache line if not necessary */
if (atomic_long_read(pcount))
res += atomic_long_xchg(pcount, 0);
}
return res;
}
EXPORT_SYMBOL(percpu_counter_read_accurate);
#endif /* CONFIG_SMP */
^ permalink raw reply
* Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
From: Eric Dumazet @ 2006-01-28 0:35 UTC (permalink / raw)
To: Eric Dumazet
Cc: Andrew Morton, kiran, davem, linux-kernel, shai, netdev, pravins
In-Reply-To: <43DABAA4.8040208@cosmosbay.com>
Eric Dumazet a écrit :
> Andrew Morton a écrit :
>> Eric Dumazet <dada1@cosmosbay.com> wrote:
>>> Ravikiran G Thirumalai a écrit :
>>>> On Fri, Jan 27, 2006 at 12:16:02PM -0800, Andrew Morton wrote:
>>>>> Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
>>>>>> which can be assumed as not frequent. At
>>>>>> sk_stream_mem_schedule(), read_sockets_allocated() is invoked only
>>>>>> certain conditions, under memory pressure -- on a large CPU count
>>>>>> machine, you'd have large memory, and I don't think
>>>>>> read_sockets_allocated would get called often. It did not atleast
>>>>>> on our 8cpu/16G box. So this should be OK I think.
>>>>> That being said, the percpu_counters aren't a terribly successful
>>>>> concept
>>>>> and probably do need a revisit due to the high inaccuracy at high CPU
>>>>> counts. It might be better to do some generic version of
>>>>> vm_acct_memory()
>>>>> instead.
>>>> AFAICS vm_acct_memory is no better. The deviation on large cpu
>>>> counts is the same as percpu_counters -- (NR_CPUS * NR_CPUS * 2) ...
>>> Ah... yes you are right, I read min(16, NR_CPUS*2)
>>
>> So did I ;)
>>
>>> I wonder if it is not a typo... I mean, I understand the more cpus
>>> you have, the less updates on central atomic_t is desirable, but a
>>> quadratic offset seems too much...
>>
>> I'm not sure whether it was a mistake or if I intended it and didn't
>> do the
>> sums on accuracy :(
>>
>> An advantage of retaining a spinlock in percpu_counter is that if
>> accuracy
>> is needed at a low rate (say, /proc reading) we can take the lock and
>> then
>> go spill each CPU's local count into the main one. It would need to be a
>> very low rate though. Or we make the cpu-local counters atomic too.
>
> We might use atomic_long_t only (and no spinlocks)
> Something like this ?
>
>
> ------------------------------------------------------------------------
>
> struct percpu_counter {
> atomic_long_t count;
> atomic_long_t *counters;
> };
>
> #ifdef CONFIG_SMP
> void percpu_counter_mod(struct percpu_counter *fbc, long amount)
> {
> long old, new;
> atomic_long_t *pcount;
>
> pcount = per_cpu_ptr(fbc->counters, get_cpu());
> start:
> old = atomic_long_read(pcount);
> new = old + amount;
> if (new >= FBC_BATCH || new <= -FBC_BATCH) {
> if (unlikely(atomic_long_cmpxchg(pcount, old, 0) != old))
> goto start;
> atomic_long_add(new, &fbc->count);
> } else
> atomic_long_add(amount, pcount);
>
> put_cpu();
> }
> EXPORT_SYMBOL(percpu_counter_mod);
>
> long percpu_counter_read_accurate(struct percpu_counter *fbc)
> {
> long res = 0;
> int cpu;
> atomic_long_t *pcount;
>
> for_each_cpu(cpu) {
> pcount = per_cpu_ptr(fbc->counters, cpu);
> /* dont dirty cache line if not necessary */
> if (atomic_long_read(pcount))
> res += atomic_long_xchg(pcount, 0);
> }
atomic_long_add(res, &fbc->count);
res = atomic_long_read(&fbc->count);
> return res;
> }
> EXPORT_SYMBOL(percpu_counter_read_accurate);
> #endif /* CONFIG_SMP */
>
^ permalink raw reply
* Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
From: Ravikiran G Thirumalai @ 2006-01-28 0:40 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Andrew Morton, davem, linux-kernel, shai, netdev, pravins
In-Reply-To: <43DAAAE3.2030107@cosmosbay.com>
On Sat, Jan 28, 2006 at 12:21:07AM +0100, Eric Dumazet wrote:
> Ravikiran G Thirumalai a écrit :
> >On Fri, Jan 27, 2006 at 11:30:23PM +0100, Eric Dumazet wrote:
>
> Why not use a boot time allocated percpu area (as done today in
> setup_per_cpu_areas()), but instead of reserving extra space for module's
> percpu data, being able to serve alloc_percpu() from this reserved area (ie
> no kmalloced data anymore), and keeping your
At that time ia64 placed a limit on the max size of per-CPU area
(PERCPU_ENOUGH_ROOM). I think that the limit is still there, But hopefully
64K per-CPU should be enough for static + dynamic + modules?
Let me do a allyesconfig on my box and verify.
Thanks,
Kiran
^ permalink raw reply
* Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
From: Andrew Morton @ 2006-01-28 0:43 UTC (permalink / raw)
To: Eric Dumazet; +Cc: kiran, davem, linux-kernel, shai, netdev, pravins
In-Reply-To: <43DABAA4.8040208@cosmosbay.com>
Eric Dumazet <dada1@cosmosbay.com> wrote:
>
> >
> > An advantage of retaining a spinlock in percpu_counter is that if accuracy
> > is needed at a low rate (say, /proc reading) we can take the lock and then
> > go spill each CPU's local count into the main one. It would need to be a
> > very low rate though. Or we make the cpu-local counters atomic too.
>
> We might use atomic_long_t only (and no spinlocks)
Yup, that's it.
> Something like this ?
>
It'd be a lot neater if we had atomic_long_xchg().
^ permalink raw reply
* Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
From: Eric Dumazet @ 2006-01-28 1:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: kiran, davem, linux-kernel, shai, netdev, pravins
In-Reply-To: <20060127164308.1ea4c3e5.akpm@osdl.org>
[-- Attachment #1: Type: text/plain, Size: 681 bytes --]
Andrew Morton a écrit :
> Eric Dumazet <dada1@cosmosbay.com> wrote:
>>> An advantage of retaining a spinlock in percpu_counter is that if accuracy
>>> is needed at a low rate (say, /proc reading) we can take the lock and then
>>> go spill each CPU's local count into the main one. It would need to be a
>>> very low rate though. Or we make the cpu-local counters atomic too.
>> We might use atomic_long_t only (and no spinlocks)
>
> Yup, that's it.
>
>> Something like this ?
>>
>
> It'd be a lot neater if we had atomic_long_xchg().
You are my guest :)
[PATCH] Add atomic_long_xchg() and atomic_long_cmpxchg() wrappers
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
[-- Attachment #2: atomic.patch --]
[-- Type: text/plain, Size: 882 bytes --]
--- a/include/asm-generic/atomic.h 2006-01-28 02:59:49.000000000 +0100
+++ b/include/asm-generic/atomic.h 2006-01-28 02:57:36.000000000 +0100
@@ -66,6 +66,18 @@
atomic64_sub(i, v);
}
+static inline long atomic_long_xchg(atomic_long_t *l, long val)
+{
+ atomic64_t *v = (atomic64_t *)l;
+ return atomic64_xchg(v, val);
+}
+
+static inline long atomic_long_cmpxchg(atomic_long_t *l, long old, long new)
+{
+ atomic64_t *v = (atomic64_t *)l;
+ return atomic64_cmpxchg(v, old, new);
+}
+
#else
typedef atomic_t atomic_long_t;
@@ -113,5 +125,17 @@
atomic_sub(i, v);
}
+static inline long atomic_long_xchg(atomic_long_t *l, long val)
+{
+ atomic_t *v = (atomic_t *)l;
+ return atomic_xchg(v, val);
+}
+
+static inline long atomic_long_cmpxchg(atomic_long_t *l, long old, long new)
+{
+ atomic_t *v = (atomic_t *)l;
+ return atomic_cmpxchg(v, old, new);
+}
+
#endif
#endif
^ permalink raw reply
* Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
From: Andrew Morton @ 2006-01-28 1:18 UTC (permalink / raw)
To: Eric Dumazet; +Cc: kiran, davem, linux-kernel, shai, netdev, pravins
In-Reply-To: <43DAC46B.3010200@cosmosbay.com>
Eric Dumazet <dada1@cosmosbay.com> wrote:
>
> [PATCH] Add atomic_long_xchg() and atomic_long_cmpxchg() wrappers
>
> ...
>
> +static inline long atomic_long_xchg(atomic_long_t *l, long val)
> +{
> + atomic64_t *v = (atomic64_t *)l;
> + return atomic64_xchg(v, val);
All we need now is some implementations of atomic64_xchg() ;)
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox