Netdev List
 help / color / mirror / Atom feed
* Re: netconf notes and materials
From: David Miller @ 2010-11-08  1:53 UTC (permalink / raw)
  To: roszenrami; +Cc: netdev
In-Reply-To: <AANLkTimZFfft66L983BbOXb4+LN9yzx2e3=t-kfQyM-_@mail.gmail.com>

From: Rami Rosen <roszenrami@gmail.com>
Date: Sun, 7 Nov 2010 21:06:20 +0200

> David,
> 1)  Great, thanks for the link!
> 
> 2)  Regarding your "Linux Networking Futures 2010" slides :
>  You mention in the fifth slide :
> "XFS is in review state, 2.6.38 likely'.
> 
> I suppose you probably mean "XPS" patches,
> the transmit Packet Steering patches by
> Tom Herbert. Or am I wrong and don't know something ?

You're correct, and the typo is mine :-)

^ permalink raw reply

* Re: [PATCH] firewire: net: rate-limit log spam at transmit failure
From: Maxim Levitsky @ 2010-11-08  1:41 UTC (permalink / raw)
  To: Stefan Richter; +Cc: linux1394-devel, netdev@vger.kernel.org
In-Reply-To: <4CD6957C.5030504@s5r6.in-berlin.de>

On Sun, 2010-11-07 at 13:03 +0100, Stefan Richter wrote:
> Maxim Levitsky wrote:
> > I have here my own hack to set the transaction timeout,
> 
> You can use firecontrol to set it on-the-fly.  To set it on node ffc2 i.e. the
> node with phy ID 2, and controller 1 (i.e. libraw1394 "port" 1):
> # echo "w . 2 0xfffff0000018 4 3" | firecontrol 1
> 
> This sets the whole-seconds part to 3, which gives you 3.1 seconds timeout if
> the fractional part at 0xfffff000001c is still at its default value of 0.1
> seconds, i.e. 800 << 19 (subsecond part in units of (1/8000)s shifted by 19).
> 
> # { echo "r . 2 0xfffff0000018 4"; echo "r . 2 0xfffff000001c 4"; } |
> firecontrol 1
> 
> displays the current register value on node ffc2 at port 1.  Type "help" in
> firecontrol for more available firecontrol commands.
Agreed.

But why the timeout is  never set?
What is the default?

I think that timeout_jiffies is never initialized, thus, it is 0 due to
kzalloc.

Also, note that I see here that if I send a TCP stream from one system
to another then the system that recieves the packets (and sends TCP
acks), still overflows the queue (error 10, and confirmed by printks).

Best regards,
	Maxim Levitsky


^ permalink raw reply

* Re: [Security] [SECURITY] Fix leaking of kernel heap addresses via /proc
From: Willy Tarreau @ 2010-11-08  1:00 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, chas, security, pekkas, yoshfuji, netdev, drosenberg,
	jmorris, remi.denis-courmont, kuznet, kaber
In-Reply-To: <20101106.165703.193714684.davem@davemloft.net>

On Sat, Nov 06, 2010 at 04:57:03PM -0700, David Miller wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Sat, 6 Nov 2010 13:50:32 -0700
> 
> > On Saturday, November 6, 2010, Dan Rosenberg <drosenberg@vsecurity.com> wrote:
> >>
> >> Clearly, in most cases we cannot just remove the field from the /proc
> >> output, as this would break a number of userspace programs that rely on
> >> consistency.  However, I propose that we replace the address with a "0"
> >> rather than leaking this information.
> > 
> > I really think it would be much better to use the unidentified number
> > or similar.
> > 
> > Just replacing with zeroes is annoying, and has the potential of
> > losing actual information.
> 
> I would really like to see the specific examples of where this is
> happening, it sounds like something very silly to me.

It has happened to me several times to use an hex editor to check some
socket's parameters (eg: backlog) based on the pointer. Sometimes I had
even change some parameters at runtime as part of debugging sessions.

In fact we could consider than many places that return pointers could
return 0 to normal users and the real value only to root (or any special
capability). I find it important not to reduce the observability of the
kernel for the sake of security.

Regards,
Willy


^ permalink raw reply

* Re: [Security] [SECURITY] Fix leaking of kernel heap addresses via /proc
From: Andi Kleen @ 2010-11-07 23:56 UTC (permalink / raw)
  To: Dan Rosenberg
  Cc: chas3, Andi Kleen, Ted Ts'o, Linus Torvalds,
	davem@davemloft.net, kuznet@ms2.inr.ac.ru, pekkas@netcore.fi,
	jmorris@namei.org, yoshfuji@linux-ipv6.org, kaber@trash.net,
	remi.denis-courmont@nokia.com, netdev@vger.kernel.org,
	security@kernel.org
In-Reply-To: <1289172456.3090.184.camel@Dan>

> The criticism raised so far is that cutting out the pointers entirely
> results in the omission of potentially useful debugging information.  I
> see two viable options to address this: either print out or omit
> addresses based on privileges (CAP_NET_ADMIN, for example), or have it
> controllable via sysctl.  I'm leaning towards the sysctl
> option...thoughts?

I would just remove the pointers from /proc and supply 
gdb macros that extract the equivalent information from /proc/kcore.
This is a bit racy, but for debugging it should be no
problem to run them multiple times as needed.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [Security] [SECURITY] Fix leaking of kernel heap addresses via /proc
From: Andi Kleen @ 2010-11-07 23:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: chas3, Andi Kleen, Ted Ts'o, Dan Rosenberg,
	davem@davemloft.net, kuznet@ms2.inr.ac.ru, pekkas@netcore.fi,
	jmorris@namei.org, yoshfuji@linux-ipv6.org, kaber@trash.net,
	remi.denis-courmont@nokia.com, netdev@vger.kernel.org,
	security@kernel.org
In-Reply-To: <AANLkTikDeVXcYRXqcq-qD_mf19awnKPTYGZbZguxJLDm@mail.gmail.com>

> So the only question is whether kernel pointers are an information
> leak worth worrying about. Personally, I think it's just damn stupid
> to expose a kernel pointer unless you _have_ to. There is absolutely

Agreed.

> no reason to expose the address of a socket in /proc, perhaps unless
> you're in some super-duper-debugging mode that no sane person would
> ever use outside of specialized network debugging (and I bet that in
> that case you still shouldn't expose it in a _normal_ proc file, but
> in debugfs or something).

In most cases you can get them by running gdb or crash on /proc/kcore 
anyways.

Still overall I suspect there may be a case that making
dmesg root only is a good idea.  At least optionally for the non 
kernel hackers out there.

The early memory map information can be likely used to guess a lot of 
locations and perhaps really make that heap overflow easier to write.
And perhaps some more mapping randomization inside the kernel.

I'm not sure the case is very strong though.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [Security] [SECURITY] Fix leaking of kernel heap addresses via /proc
From: Dan Rosenberg @ 2010-11-07 23:27 UTC (permalink / raw)
  To: chas3
  Cc: Andi Kleen, Ted Ts'o, Linus Torvalds, davem@davemloft.net,
	kuznet@ms2.inr.ac.ru, pekkas@netcore.fi, jmorris@namei.org,
	yoshfuji@linux-ipv6.org, kaber@trash.net,
	remi.denis-courmont@nokia.com, netdev@vger.kernel.org,
	security@kernel.org
In-Reply-To: <201011072248.oA7MmjKg025857@cmf.nrl.navy.mil>

Based on the feedback so far, I think a few things need to be worked
out.  Firstly, I'm going to separate out printk leaks as an issue to be
worked on at a later time - it seems there is some interest in
evaluating whether dmesg should be restricted, but that is out of scope
of my plans for this patch series.

That leaves the /proc leakage.  I don't think XOR-ing is a viable
option, since it would be relatively easy to infer the constant value.
The criticism raised so far is that cutting out the pointers entirely
results in the omission of potentially useful debugging information.  I
see two viable options to address this: either print out or omit
addresses based on privileges (CAP_NET_ADMIN, for example), or have it
controllable via sysctl.  I'm leaning towards the sysctl
option...thoughts?

-Dan


^ permalink raw reply

* Re: [Security] [SECURITY] Fix leaking of kernel heap addresses via /proc
From: Linus Torvalds @ 2010-11-07 23:22 UTC (permalink / raw)
  To: chas3
  Cc: Andi Kleen, Ted Ts'o, Dan Rosenberg, davem@davemloft.net,
	kuznet@ms2.inr.ac.ru, pekkas@netcore.fi, jmorris@namei.org,
	yoshfuji@linux-ipv6.org, kaber@trash.net,
	remi.denis-courmont@nokia.com, netdev@vger.kernel.org,
	security@kernel.org
In-Reply-To: <201011072248.oA7MmjKg025857@cmf.nrl.navy.mil>

On Sun, Nov 7, 2010 at 2:48 PM, Chas Williams (CONTRACTOR)
<chas@cmf.nrl.navy.mil> wrote:
>
> i suppose one could use idr to map the pointers to unique values.  the
> infiniband code uses this technique>

We already _have_ the unique value that /proc uses - the inode number.
It's what lsof and friends use to match things across different files
anyway.

Why are people arguing about stupid things? If it's wrong to expose
kernel pointers in /proc (and I do think it generally is something we
should try to avoid), then we already do have the natural alternative.
Which happens to be what the patch already does.

So the only question is whether kernel pointers are an information
leak worth worrying about. Personally, I think it's just damn stupid
to expose a kernel pointer unless you _have_ to. There is absolutely
no reason to expose the address of a socket in /proc, perhaps unless
you're in some super-duper-debugging mode that no sane person would
ever use outside of specialized network debugging (and I bet that in
that case you still shouldn't expose it in a _normal_ proc file, but
in debugfs or something).

                          Linus

^ permalink raw reply

* Re: [Security] [SECURITY] Fix leaking of kernel heap addresses via /proc
From: Andi Kleen @ 2010-11-07 23:14 UTC (permalink / raw)
  To: chas3
  Cc: Andi Kleen, Ted Ts'o, Linus Torvalds, Dan Rosenberg,
	davem@davemloft.net, kuznet@ms2.inr.ac.ru, pekkas@netcore.fi,
	jmorris@namei.org, yoshfuji@linux-ipv6.org, kaber@trash.net,
	remi.denis-courmont@nokia.com, netdev@vger.kernel.org,
	security@kernel.org
In-Reply-To: <201011072248.oA7MmjKg025857@cmf.nrl.navy.mil>

On Sun, Nov 07, 2010 at 05:48:45PM -0500, Chas Williams (CONTRACTOR) wrote:
> In message <87sjzcssx5.fsf@basil.nowhere.org>,Andi Kleen writes:
> >Ted Ts'o <tytso@mit.edu> writes:
> >
> >> Are there any userspace programs that might be reasonably expected to
> >> _use_ this information?  If there is, we could just pick a random
> >> number at boot time, and then XOR the heap adddress with that random
> >> number.
> >
> >If any of the addresses can be guessed ever (and that is likely if it's
> >allocated at boot) determining the random value will be trivial
> >for everyone.
> 
> i suppose one could use idr to map the pointers to unique values.  the
> infiniband code uses this technique>

idr requires allocating memory, and it's unclear you can do that
in all the situations where debugging printks are used. And
how would you get the idr table out of a broken kernel? And further
the memory allocations would eventually fill up your memory
if they go on. I don't think idr is a solution.

I don't really have a good solution either. Even if the
the individual pointers were removed from printks (which
probably doesn't make much difference anyways because those
printks usually happen only in unlikely debug situations):

The information about the memory layout early on in dmesg
is probably enough to make a good educated guess about
the locations of standard slab caches on a known kernel
image. For example the first N sockets opened are very likely
easy to guess this way.

I guess one could make dmesg root only, although I personally
use it often as non root.

Or maybe add some more randomization to the buddy allocator.
The drawback of that is that they tend to make
benchmarks unstable due to cache coloring differences.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [PATCH 2/2 v5] xps: Transmit Packet Steering
From: Tom Herbert @ 2010-11-07 23:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev
In-Reply-To: <1289162455.2478.295.camel@edumazet-laptop>

> Tom, I still dont understand why *xps_maps is here, and not in
> net_device ?
>

Originally it was in the net_device, but that necessitated a reference
be held on the device since we can't destroy the map until all
kobjects were released, but the open reference prevents the kojects
from being released in the first place.

> I am asking because netdev_get_xps_maps(dev) might be slowed down
> because queue 0 state might change often (__QUEUE_STATE_XOFF)
>
Could a add a pointer in the net_device also.

> This means _tx[0] becomes a very hot cache line, needed to access all
> queues (from get_xps_queue())
>
> Other than that, your patch seems fine (not tested yet)
>
> Thanks
>
>
>

^ permalink raw reply

* Re: [RESEND PATCH] virtio-net: init link state correctly
From: Rusty Russell @ 2010-11-07 23:11 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, davem, markmc, kvm, mst
In-Reply-To: <20101105094718.19101.58898.stgit@dhcp-91-158.nay.redhat.com>

On Fri, 5 Nov 2010 08:17:18 pm Jason Wang wrote:
> For device that supports VIRTIO_NET_F_STATUS, there's no need to
> assume the link is up and we need to call nerif_carrier_off() before
> querying device status, otherwise we may get wrong operstate after
> diver was loaded because the link watch event was not fired as
> expected.
> 
> For device that does not support VIRITO_NET_F_STATUS, we could not get
> its status through virtnet_update_status() and what we can only do is
> always assuming the link is up.
> 
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Acked-by: Rusty Russell <rusty@rustcorp.com.au>

Thanks!
Rusty.

^ permalink raw reply

* Re: [PATCH] virtio_net: Fix queue full check
From: Rusty Russell @ 2010-11-07 23:08 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Krishna Kumar2, davem, netdev, yvugenfi
In-Reply-To: <20101104122424.GA29830@redhat.com>

On Thu, 4 Nov 2010 10:54:24 pm Michael S. Tsirkin wrote:
> I thought about this some more.  I think the original
> code is actually correct in returning ENOSPC: indirect
> buffers are nice, but it's a mistake
> to rely on them as a memory allocation might fail.
> 
> And if you look at virtio-net, it is dropping packets
> under memory pressure which is not really a happy outcome:
> the packet will get freed, reallocated and we get another one,
> adding pressure on the allocator instead of releasing it
> until we free up some buffers.
> 
> So I now think we should calculate the capacity
> assuming non-indirect entries, and if we manage to
> use indirect, all the better.

I've long said it's a weakness in the network stack that it insists
drivers stop the tx queue before they *might* run out of room, leading to
worst-case assumptions and underutilization of the tx ring.

However, I lost that debate, and so your patch is the way it's supposed to
work.  The other main indirect user (block) doesn't care as its queue
allows for post-attempt blocking.

I enhanced your commentry a little:

Subject: virtio: return correct capacity to users
Date: Thu, 4 Nov 2010 14:24:24 +0200
From: "Michael S. Tsirkin" <mst@redhat.com>

We can't rely on indirect buffers for capacity
calculations because they need a memory allocation
which might fail.  In particular, virtio_net can get
into this situation under stress, and it drops packets
and performs badly.

So return the number of buffers we can guarantee users.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reported-By: Krishna Kumar2 <krkumar2@in.ibm.com>

Thanks!
Rusty.

^ permalink raw reply

* Re: [Security] [SECURITY] Fix leaking of kernel heap addresses via /proc
From: Chas Williams (CONTRACTOR) @ 2010-11-07 22:48 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ted Ts'o, Linus Torvalds, Dan Rosenberg, davem@davemloft.net,
	kuznet@ms2.inr.ac.ru, pekkas@netcore.fi, jmorris@namei.org,
	yoshfuji@linux-ipv6.org, kaber@trash.net,
	remi.denis-courmont@nokia.com, netdev@vger.kernel.org,
	security@kernel.org
In-Reply-To: <87sjzcssx5.fsf@basil.nowhere.org>

In message <87sjzcssx5.fsf@basil.nowhere.org>,Andi Kleen writes:
>Ted Ts'o <tytso@mit.edu> writes:
>
>> Are there any userspace programs that might be reasonably expected to
>> _use_ this information?  If there is, we could just pick a random
>> number at boot time, and then XOR the heap adddress with that random
>> number.
>
>If any of the addresses can be guessed ever (and that is likely if it's
>allocated at boot) determining the random value will be trivial
>for everyone.

i suppose one could use idr to map the pointers to unique values.  the
infiniband code uses this technique>

^ permalink raw reply

* Re: [PATCH 0/9] Fix leaking of kernel heap addresses in net/
From: Urs Thuermann @ 2010-11-07 21:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Dan Rosenberg, chas, davem, kuznet, pekkas, jmorris, yoshfuji,
	kaber, remi.denis-courmont, netdev, security, stable
In-Reply-To: <1289149416.2478.143.camel@edumazet-laptop>

Eric Dumazet <eric.dumazet@gmail.com> writes:

> You are basically ruining a lot of debugging facilities we use every
> day to find and fix _real_ bugs. The bugs that happen to crash
> machines of our customers.
> 
> If you want to avoid a user reading kernel syslog, why dont you fix
> the problem for non root users able to "dmesg" ? I personally dont
> care.

dmesg and syslog aren't the only problem.  There are files in
/proc/net/ that must be readable by all users in order for a number of
tools to work, e.g. netstat(8), and those files contain kernel
addresses of sock structs.

> There are pretty easy ways to not disclose "information", but your
> way of using '0' for all values is the dumbest idea one could ever
> had.
> 
> A single XOR with a "root only visible, random value chosen at boot"
> would be OK. At least we could continue our work, with litle burden.

Simple XOR with a secret constant doesn't help very much.  Many bits
of that hidden constant can be revealed quite easily.  On x86-32 for
example all addresses are in the high 1 GB address space so the
highest two address bits are 1.  Many structs of type sock are aligned
to 64 or higher powers of two so you can easily infer the lower 6 or
more bits of that hidden XOR constant.  If you know two structs of
alignment 2^n are contiguous in memory you can infer the lower n+1
bits.

I suggest either one of the following two solutions:

1. Make the contents of the proc-files in question dependent on some
   capability, e.g. CAP_SYS_ADMIN or some new CAP_KERNEL_DEBUG.  If
   the process opening the file owns that capability it will see the
   kernel address, otherwise it will see 0.

2. Replace the kernel address by the socket inode number or some other
   identifying value (as our patch for CAN does) and provide some
   mechanism for root only to translate that value to the
   corresponding kernel address.


urs

^ permalink raw reply

* Re: [Security] [SECURITY] Fix leaking of kernel heap addresses via /proc
From: Andi Kleen @ 2010-11-07 21:52 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Linus Torvalds, Dan Rosenberg, chas@cmf.nrl.navy.mil,
	davem@davemloft.net, kuznet@ms2.inr.ac.ru, pekkas@netcore.fi,
	jmorris@namei.org, yoshfuji@linux-ipv6.org, kaber@trash.net,
	remi.denis-courmont@nokia.com, netdev@vger.kernel.org,
	security@kernel.org
In-Reply-To: <20101106234840.GD2935@thunk.org>

Ted Ts'o <tytso@mit.edu> writes:

> Are there any userspace programs that might be reasonably expected to
> _use_ this information?  If there is, we could just pick a random
> number at boot time, and then XOR the heap adddress with that random
> number.

If any of the addresses can be guessed ever (and that is likely if it's
allocated at boot) determining the random value will be trivial
for everyone.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* [BUG]: skge not working (as module) in 2.6.37-rc1
From: Marin Mitov @ 2010-11-07 21:45 UTC (permalink / raw)
  To: Stephen Hemminger, Stephen Hemminger, netdev, linux-kernel,
	David S. Miller

Hi Stephen,

skge as in 2.6.36 (and before) is working.
As in 2.6.37-rc1 it is not:

kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
kernel: IP: [<ffffffffa0005d20>] skge_devinit+0x270/0x2a0 [skge]
kernel: PGD d8657067 PUD d8658067 PMD 0
kernel: Oops: 0002 [#1] PREEMPT SMP
kernel: last sysfs file: /sys/devices/platform/mga_warp.0/firmware/mga_warp.0/loading
kernel: CPU 1
kernel: Modules linked in: skge(+)
kernel:
kernel: Pid: 2005, comm: insmod Not tainted 2.6.37-rc1 #2 A8V/System Product Name
kernel: RIP: 0010:[<ffffffffa0005d20>]  [<ffffffffa0005d20>] skge_devinit+0x270/0x2a0 [skge]
kernel: RSP: 0018:ffff8800ce477cb8  EFLAGS: 00010292
kernel: RAX: 0000000000000000 RBX: ffff88011f2cc800 RCX: ffffffff815ab260
kernel: RDX: ffffffff814e82a8 RSI: 0000000000000046 RDI: ffffffff815ab154
kernel: RBP: ffff8800ce477cd8 R08: 00000000ffffffff R09: 0000000000000000
kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800daa27480
kernel: R13: 0000000000000000 R14: ffff88011f2ccd80 R15: 0000000000000000
kernel: FS:  00007f2af73f3700(0000) GS:ffff8800dfd00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
kernel: CR2: 0000000000000010 CR3: 00000000d865a000 CR4: 00000000000006e0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
kernel: Process insmod (pid: 2005, threadinfo ffff8800ce476000, task ffff880109875370)
kernel: Stack:
kernel:  ffff88011fe5f800 0000000000000000 ffff8800daa27480 ffff8800daa274e8
kernel:  ffff8800ce477d28 ffffffffa0006f2d ffff8800ce477d08 ffffffff8119555a
kernel:  ffff8800ce477d38 ffff88011fe5f888 ffff88011fe5f800 ffffffffa0008120
kernel: Call Trace:
kernel:  [<ffffffffa0006f2d>] skge_probe+0x27c/0x4a7 [skge]
kernel:  [<ffffffff8119555a>] ? kobject_get+0x1a/0x30
kernel:  [<ffffffff811aa812>] local_pci_probe+0x12/0x20
kernel:  [<ffffffff811aaae0>] pci_device_probe+0x80/0xb0
kernel:  [<ffffffff812337fa>] ? driver_sysfs_add+0x7a/0xb0
kernel:  [<ffffffff81233931>] driver_probe_device+0x81/0x1a0
kernel:  [<ffffffff81233ae3>] __driver_attach+0x93/0xa0
kernel:  [<ffffffff81233a50>] ? __driver_attach+0x0/0xa0
kernel:  [<ffffffff8123301c>] bus_for_each_dev+0x5c/0x90
kernel:  [<ffffffff81233779>] driver_attach+0x19/0x20
kernel:  [<ffffffff81232908>] bus_add_driver+0x198/0x250
kernel:  [<ffffffffa000c000>] ? skge_init_module+0x0/0x3a [skge]
kernel:  [<ffffffff81233dd8>] driver_register+0x78/0x140
kernel:  [<ffffffffa000c000>] ? skge_init_module+0x0/0x3a [skge]
kernel:  [<ffffffff811aad91>] __pci_register_driver+0x51/0xd0
kernel:  [<ffffffff812d4840>] ? dmi_check_system+0x20/0x50
kernel:  [<ffffffffa000c038>] skge_init_module+0x38/0x3a [skge]
kernel:  [<ffffffff810001de>] do_one_initcall+0x3e/0x170
kernel:  [<ffffffff81066192>] sys_init_module+0xb2/0x200
kernel:  [<ffffffff810024ab>] system_call_fastpath+0x16/0x1b
kernel: Code: 39 e1 48 89 df e8 81 a7 33 e1 ba 15 0f 00 00 48 c7 c6 08 7c 00 a0 48 c7 c7 48 73 00 a0 31 c0 e8 c1 aa 39 e1 48 8b 83 00 03 00 00 <f0> 80 48 10 01 ba 17 0f 00 00 48 c7 c6 08 7c 00 a0 48 c7 c7 48
kernel: RIP  [<ffffffffa0005d20>] skge_devinit+0x270/0x2a0 [skge]
kernel:  RSP <ffff8800ce477cb8>
kernel: CR2: 0000000000000010
kernel: ---[ end trace ef29176d9e5b71a4 ]---

Reverting the changes in skge.c (2.6.36 -> 2.6.37-rc1) does not help.
Debugging with many printk embedded in skge_devinit() found the problem is in
netif_stop_queue(dev). Removing the statement (see the patch) helps - skge is working.

But I am not expert in the networking, so I am not sure I have solved the real problem.
May be some changes in the core networking are the real cause of the problem.

And by the way, why one should stop a queue that is not yet (at least explicitly) started? :-)

Best regards.

Marin Mitov

Signed-off-by: Marin Mitov <mitov@issp.bas.bg>

===========================================================
--- a/drivers/net/skge.c	2010-11-07 10:55:22.000000000 +0200
+++ b/drivers/net/skge.c	2010-11-07 20:55:43.000000000 +0200
@@ -3858,7 +3858,6 @@ static struct net_device *skge_devinit(s
 
 	/* device is off until link detection */
 	netif_carrier_off(dev);
-	netif_stop_queue(dev);
 
 	return dev;
 }

^ permalink raw reply

* [PATCH 4/4] firewire: net: throttle TX queue before running out of tlabels
From: Stefan Richter @ 2010-11-07 21:41 UTC (permalink / raw)
  To: linux1394-devel; +Cc: netdev, linux-kernel
In-Reply-To: <tkrat.036c5d10fb83f85d@s5r6.in-berlin.de>

This prevents firewire-net from submitting write requests in fast
succession until failure due to all 64 transaction labels used up for
unfinished split transactions.  The netif_stop/wake_queue API is used
for this purpose.

The high watermark is set to considerably less than 64 (I chose 8) in
order to put less pressure on to responder peers.  Still, responders
which run Linux firewire-net still need to improved to keep up with
even this small amount of outstanding split transactions.  (This is
considering the case of only one responder at a time, or a primary
responder, because this is the most likely case with IP over 1394.)

This is based on the count of unfinished TX datagrams.  Maybe it should
rather be based on the count of unfinished TX fragments?  Many 1394a
CardBus cards accept only 1024k sized packets, i.e. require two fragments
per datagram at the standard MTU of 1500.  However, the chosen watermark
is still well below the maximum number of tlabels.

Without this stop/wake mechanism, datagrams were simply lost whenever
the tlabel pool was exhausted.

I wonder what a good net_device.tx_queue_len value is.  I just set it
to the same value as the chosen watermark for now.  Note that the net
scheduler still lets us exceed the high watermark occasionally.  This
could be prevented by an earlier queued_datagrams check in fwnet_tx with
a NETDEV_TX_BUSY return but such an early check did not actually
improve performance in my testing, i.e. did not reduce busy responder
situations.

Results:
  - I did not see changes to resulting throughput that were discernible
    from the usual measuring noise.
  - With this change, other requesters on the same node now have a
    chance to get spare transaction labels while firewire-net is
    transmitting.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
 drivers/firewire/net.c |   36 ++++++++++++++++++++++++++++--------
 1 file changed, 28 insertions(+), 8 deletions(-)

Index: b/drivers/firewire/net.c
===================================================================
--- a/drivers/firewire/net.c
+++ b/drivers/firewire/net.c
@@ -28,8 +28,14 @@
 #include <asm/unaligned.h>
 #include <net/arp.h>
 
-#define FWNET_MAX_FRAGMENTS	25	/* arbitrary limit */
-#define FWNET_ISO_PAGE_COUNT	(PAGE_SIZE < 16 * 1024 ? 4 : 2)
+/* rx limits */
+#define FWNET_MAX_FRAGMENTS		25 /* arbitrary limit */
+#define FWNET_ISO_PAGE_COUNT		(PAGE_SIZE < 16 * 1024 ? 4 : 2)
+
+/* tx limits */
+#define FWNET_TX_QUEUE_LEN		8 /* ? */
+#define FWNET_MAX_QUEUED_DATAGRAMS	8 /* should keep AT DMA busy enough */
+#define FWNET_MIN_QUEUED_DATAGRAMS	2
 
 #define IEEE1394_BROADCAST_CHANNEL	31
 #define IEEE1394_ALL_NODES		(0xffc0 | 0x003f)
@@ -892,6 +898,20 @@ static void fwnet_free_ptask(struct fwne
 	kmem_cache_free(fwnet_packet_task_cache, ptask);
 }
 
+/* caller must hold dev->lock */
+static void inc_queued_datagrams(struct fwnet_device *dev)
+{
+	if (++dev->queued_datagrams >= FWNET_MAX_QUEUED_DATAGRAMS)
+		netif_stop_queue(dev->netdev);
+}
+
+/* caller must hold dev->lock */
+static void dec_queued_datagrams(struct fwnet_device *dev)
+{
+	if (--dev->queued_datagrams == FWNET_MIN_QUEUED_DATAGRAMS)
+		netif_wake_queue(dev->netdev);
+}
+
 static int fwnet_send_packet(struct fwnet_packet_task *ptask);
 
 static void fwnet_transmit_packet_done(struct fwnet_packet_task *ptask)
@@ -908,7 +928,7 @@ static void fwnet_transmit_packet_done(s
 	/* Check whether we or the networking TX soft-IRQ is last user. */
 	free = (ptask->outstanding_pkts == 0 && ptask->enqueued);
 	if (free)
-		dev->queued_datagrams--;
+		dec_queued_datagrams(dev);
 
 	if (ptask->outstanding_pkts == 0) {
 		dev->netdev->stats.tx_packets++;
@@ -979,7 +999,7 @@ static void fwnet_transmit_packet_failed
 	/* Check whether we or the networking TX soft-IRQ is last user. */
 	free = ptask->enqueued;
 	if (free)
-		dev->queued_datagrams--;
+		dec_queued_datagrams(dev);
 
 	dev->netdev->stats.tx_dropped++;
 	dev->netdev->stats.tx_errors++;
@@ -1064,7 +1084,7 @@ static int fwnet_send_packet(struct fwne
 		if (!free)
 			ptask->enqueued = true;
 		else
-			dev->queued_datagrams--;
+			dec_queued_datagrams(dev);
 
 		spin_unlock_irqrestore(&dev->lock, flags);
 
@@ -1083,7 +1103,7 @@ static int fwnet_send_packet(struct fwne
 	if (!free)
 		ptask->enqueued = true;
 	else
-		dev->queued_datagrams--;
+		dec_queued_datagrams(dev);
 
 	spin_unlock_irqrestore(&dev->lock, flags);
 
@@ -1345,7 +1365,7 @@ static netdev_tx_t fwnet_tx(struct sk_bu
 		max_payload += RFC2374_FRAG_HDR_SIZE;
 	}
 
-	dev->queued_datagrams++;
+	inc_queued_datagrams(dev);
 
 	spin_unlock_irqrestore(&dev->lock, flags);
 
@@ -1415,7 +1435,7 @@ static void fwnet_init_dev(struct net_de
 	net->addr_len		= FWNET_ALEN;
 	net->hard_header_len	= FWNET_HLEN;
 	net->type		= ARPHRD_IEEE1394;
-	net->tx_queue_len	= 10;
+	net->tx_queue_len	= FWNET_TX_QUEUE_LEN;
 	SET_ETHTOOL_OPS(net, &fwnet_ethtool_ops);
 }
 

-- 
Stefan Richter
-=====-==-=- =-== --===
http://arcgraph.de/sr/


------------------------------------------------------------------------------
The Next 800 Companies to Lead America's Growth: New Video Whitepaper
David G. Thomson, author of the best-selling book "Blueprint to a 
Billion" shares his insights and actions to help propel your 
business during the next growth cycle. Listen Now!
http://p.sf.net/sfu/SAP-dev2dev

^ permalink raw reply

* [PATCH 3/4] firewire: net: replace lists by counters
From: Stefan Richter @ 2010-11-07 21:40 UTC (permalink / raw)
  To: linux1394-devel; +Cc: linux-kernel, netdev
In-Reply-To: <tkrat.036c5d10fb83f85d@s5r6.in-berlin.de>

The current transmit code does not at all make use of
  - fwnet_device.packet_list
and only very limited use of
  - fwnet_device.broadcasted_list,
  - fwnet_device.queued_packets.
Their current function is to track whether the TX soft-IRQ finished
dealing with an skb when the AT-req tasklet takes over, and to discard
pending tx datagrams (if there are any) when the local node is removed.

The latter does actually contain a race condition bug with TX soft-IRQ
and AT-req tasklet.

Instead of these lists and the corresponding link in fwnet_packet_task,
  - a flag in fwnet_packet_task to track whether fwnet_tx is done,
  - a counter of queued datagrams in fwnet_device
do the job as well.

The above mentioned theoretic race condition is resolved by letting
fwnet_remove sleep until all datagrams were flushed.  It may sleep
almost arbitrarily long since fwnet_remove is executed in the context of
a multithreaded (concurrency managed) workqueue.

The type of max_payload is changed to u16 here to avoid waste in struct
fwnet_packet_task.  This value cannot exceed 4096 per IEEE 1394:2008
table 16-18 (or 32678 per specification of packet headers, if there is
ever going to be something else than beta mode).

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
 drivers/firewire/net.c |   73 +++++++++++++++--------------------------
 1 file changed, 26 insertions(+), 47 deletions(-)

Index: b/drivers/firewire/net.c
===================================================================
--- a/drivers/firewire/net.c
+++ b/drivers/firewire/net.c
@@ -7,6 +7,7 @@
  */
 
 #include <linux/bug.h>
+#include <linux/delay.h>
 #include <linux/device.h>
 #include <linux/ethtool.h>
 #include <linux/firewire.h>
@@ -170,15 +171,8 @@ struct fwnet_device {
 	struct fw_address_handler handler;
 	u64 local_fifo;
 
-	/* List of packets to be sent */
-	struct list_head packet_list;
-	/*
-	 * List of packets that were broadcasted.  When we get an ISO interrupt
-	 * one of them has been sent
-	 */
-	struct list_head broadcasted_list;
-	/* List of packets that have been sent but not yet acked */
-	struct list_head sent_list;
+	/* Number of tx datagrams that have been queued but not yet acked */
+	int queued_datagrams;
 
 	struct list_head peer_list;
 	struct fw_card *card;
@@ -196,7 +190,7 @@ struct fwnet_peer {
 	unsigned pdg_size;        /* pd_list size */
 
 	u16 datagram_label;       /* outgoing datagram label */
-	unsigned max_payload;     /* includes RFC2374_FRAG_HDR_SIZE overhead */
+	u16 max_payload;          /* includes RFC2374_FRAG_HDR_SIZE overhead */
 	int node_id;
 	int generation;
 	unsigned speed;
@@ -204,22 +198,18 @@ struct fwnet_peer {
 
 /* This is our task struct. It's used for the packet complete callback.  */
 struct fwnet_packet_task {
-	/*
-	 * ptask can actually be on dev->packet_list, dev->broadcasted_list,
-	 * or dev->sent_list depending on its current state.
-	 */
-	struct list_head pt_link;
 	struct fw_transaction transaction;
 	struct rfc2734_header hdr;
 	struct sk_buff *skb;
 	struct fwnet_device *dev;
 
 	int outstanding_pkts;
-	unsigned max_payload;
 	u64 fifo_addr;
 	u16 dest_node;
+	u16 max_payload;
 	u8 generation;
 	u8 speed;
+	u8 enqueued;
 };
 
 /*
@@ -916,9 +906,9 @@ static void fwnet_transmit_packet_done(s
 	ptask->outstanding_pkts--;
 
 	/* Check whether we or the networking TX soft-IRQ is last user. */
-	free = (ptask->outstanding_pkts == 0 && !list_empty(&ptask->pt_link));
+	free = (ptask->outstanding_pkts == 0 && ptask->enqueued);
 	if (free)
-		list_del(&ptask->pt_link);
+		dev->queued_datagrams--;
 
 	if (ptask->outstanding_pkts == 0) {
 		dev->netdev->stats.tx_packets++;
@@ -987,9 +977,9 @@ static void fwnet_transmit_packet_failed
 	ptask->outstanding_pkts = 0;
 
 	/* Check whether we or the networking TX soft-IRQ is last user. */
-	free = !list_empty(&ptask->pt_link);
+	free = ptask->enqueued;
 	if (free)
-		list_del(&ptask->pt_link);
+		dev->queued_datagrams--;
 
 	dev->netdev->stats.tx_dropped++;
 	dev->netdev->stats.tx_errors++;
@@ -1070,9 +1060,11 @@ static int fwnet_send_packet(struct fwne
 		spin_lock_irqsave(&dev->lock, flags);
 
 		/* If the AT tasklet already ran, we may be last user. */
-		free = (ptask->outstanding_pkts == 0 && list_empty(&ptask->pt_link));
+		free = (ptask->outstanding_pkts == 0 && !ptask->enqueued);
 		if (!free)
-			list_add_tail(&ptask->pt_link, &dev->broadcasted_list);
+			ptask->enqueued = true;
+		else
+			dev->queued_datagrams--;
 
 		spin_unlock_irqrestore(&dev->lock, flags);
 
@@ -1087,9 +1079,11 @@ static int fwnet_send_packet(struct fwne
 	spin_lock_irqsave(&dev->lock, flags);
 
 	/* If the AT tasklet already ran, we may be last user. */
-	free = (ptask->outstanding_pkts == 0 && list_empty(&ptask->pt_link));
+	free = (ptask->outstanding_pkts == 0 && !ptask->enqueued);
 	if (!free)
-		list_add_tail(&ptask->pt_link, &dev->sent_list);
+		ptask->enqueued = true;
+	else
+		dev->queued_datagrams--;
 
 	spin_unlock_irqrestore(&dev->lock, flags);
 
@@ -1351,10 +1345,12 @@ static netdev_tx_t fwnet_tx(struct sk_bu
 		max_payload += RFC2374_FRAG_HDR_SIZE;
 	}
 
+	dev->queued_datagrams++;
+
 	spin_unlock_irqrestore(&dev->lock, flags);
 
 	ptask->max_payload = max_payload;
-	INIT_LIST_HEAD(&ptask->pt_link);
+	ptask->enqueued    = 0;
 
 	fwnet_send_packet(ptask);
 
@@ -1500,14 +1496,9 @@ static int fwnet_probe(struct device *_d
 	dev->broadcast_rcv_context = NULL;
 	dev->broadcast_xmt_max_payload = 0;
 	dev->broadcast_xmt_datagramlabel = 0;
-
 	dev->local_fifo = FWNET_NO_FIFO_ADDR;
-
-	INIT_LIST_HEAD(&dev->packet_list);
-	INIT_LIST_HEAD(&dev->broadcasted_list);
-	INIT_LIST_HEAD(&dev->sent_list);
+	dev->queued_datagrams = 0;
 	INIT_LIST_HEAD(&dev->peer_list);
-
 	dev->card = card;
 	dev->netdev = net;
 
@@ -1565,7 +1556,7 @@ static int fwnet_remove(struct device *_
 	struct fwnet_peer *peer = dev_get_drvdata(_dev);
 	struct fwnet_device *dev = peer->dev;
 	struct net_device *net;
-	struct fwnet_packet_task *ptask, *pt_next;
+	int i;
 
 	mutex_lock(&fwnet_device_mutex);
 
@@ -1583,21 +1574,9 @@ static int fwnet_remove(struct device *_
 					      dev->card);
 			fw_iso_context_destroy(dev->broadcast_rcv_context);
 		}
-		list_for_each_entry_safe(ptask, pt_next,
-					 &dev->packet_list, pt_link) {
-			dev_kfree_skb_any(ptask->skb);
-			kmem_cache_free(fwnet_packet_task_cache, ptask);
-		}
-		list_for_each_entry_safe(ptask, pt_next,
-					 &dev->broadcasted_list, pt_link) {
-			dev_kfree_skb_any(ptask->skb);
-			kmem_cache_free(fwnet_packet_task_cache, ptask);
-		}
-		list_for_each_entry_safe(ptask, pt_next,
-					 &dev->sent_list, pt_link) {
-			dev_kfree_skb_any(ptask->skb);
-			kmem_cache_free(fwnet_packet_task_cache, ptask);
-		}
+		for (i = 0; dev->queued_datagrams && i < 5; i++)
+			ssleep(1);
+		WARN_ON(dev->queued_datagrams);
 		list_del(&dev->dev_link);
 
 		free_netdev(net);

-- 
Stefan Richter
-=====-==-=- =-== --===
http://arcgraph.de/sr/

^ permalink raw reply

* [PATCH 2/4] firewire: net: fix memory leaks
From: Stefan Richter @ 2010-11-07 21:40 UTC (permalink / raw)
  To: linux1394-devel; +Cc: netdev, linux-kernel
In-Reply-To: <tkrat.036c5d10fb83f85d@s5r6.in-berlin.de>

a) fwnet_transmit_packet_done used to poison ptask->pt_link by list_del.
If fwnet_send_packet checked later whether it was responsible to clean
up (in the border case that the TX soft IRQ was outpaced by the AT-req
tasklet on another CPU), it missed this because ptask->pt_link was no
longer shown as empty.

b) If fwnet_write_complete got an rcode other than RCODE_COMPLETE, we
missed to free the skb and ptask entirely.

Also, count stats.tx_dropped and stats.tx_errors when rcode != 0.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
 drivers/firewire/net.c |   35 +++++++++++++++++++++++++++++++----
 1 file changed, 31 insertions(+), 4 deletions(-)

Index: b/drivers/firewire/net.c
===================================================================
--- a/drivers/firewire/net.c
+++ b/drivers/firewire/net.c
@@ -917,9 +917,10 @@ static void fwnet_transmit_packet_done(s
 
 	/* Check whether we or the networking TX soft-IRQ is last user. */
 	free = (ptask->outstanding_pkts == 0 && !list_empty(&ptask->pt_link));
+	if (free)
+		list_del(&ptask->pt_link);
 
 	if (ptask->outstanding_pkts == 0) {
-		list_del(&ptask->pt_link);
 		dev->netdev->stats.tx_packets++;
 		dev->netdev->stats.tx_bytes += skb->len;
 	}
@@ -974,6 +975,31 @@ static void fwnet_transmit_packet_done(s
 		fwnet_free_ptask(ptask);
 }
 
+static void fwnet_transmit_packet_failed(struct fwnet_packet_task *ptask)
+{
+	struct fwnet_device *dev = ptask->dev;
+	unsigned long flags;
+	bool free;
+
+	spin_lock_irqsave(&dev->lock, flags);
+
+	/* One fragment failed; don't try to send remaining fragments. */
+	ptask->outstanding_pkts = 0;
+
+	/* Check whether we or the networking TX soft-IRQ is last user. */
+	free = !list_empty(&ptask->pt_link);
+	if (free)
+		list_del(&ptask->pt_link);
+
+	dev->netdev->stats.tx_dropped++;
+	dev->netdev->stats.tx_errors++;
+
+	spin_unlock_irqrestore(&dev->lock, flags);
+
+	if (free)
+		fwnet_free_ptask(ptask);
+}
+
 static void fwnet_write_complete(struct fw_card *card, int rcode,
 				 void *payload, size_t length, void *data)
 {
@@ -981,11 +1007,12 @@ static void fwnet_write_complete(struct 
 
 	ptask = data;
 
-	if (rcode == RCODE_COMPLETE)
+	if (rcode == RCODE_COMPLETE) {
 		fwnet_transmit_packet_done(ptask);
-	else
+	} else {
 		fw_error("fwnet_write_complete: failed: %x\n", rcode);
-		/* ??? error recovery */
+		fwnet_transmit_packet_failed(ptask);
+	}
 }
 
 static int fwnet_send_packet(struct fwnet_packet_task *ptask)

-- 
Stefan Richter
-=====-==-=- =-== --===
http://arcgraph.de/sr/


------------------------------------------------------------------------------
The Next 800 Companies to Lead America's Growth: New Video Whitepaper
David G. Thomson, author of the best-selling book "Blueprint to a 
Billion" shares his insights and actions to help propel your 
business during the next growth cycle. Listen Now!
http://p.sf.net/sfu/SAP-dev2dev

^ permalink raw reply

* [PATCH 1/4] firewire: net: count stats.tx_packets and stats.tx_bytes
From: Stefan Richter @ 2010-11-07 21:39 UTC (permalink / raw)
  To: linux1394-devel; +Cc: netdev, linux-kernel
In-Reply-To: <tkrat.036c5d10fb83f85d@s5r6.in-berlin.de>

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
 drivers/firewire/net.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Index: b/drivers/firewire/net.c
===================================================================
--- a/drivers/firewire/net.c
+++ b/drivers/firewire/net.c
@@ -907,6 +907,7 @@ static int fwnet_send_packet(struct fwne
 static void fwnet_transmit_packet_done(struct fwnet_packet_task *ptask)
 {
 	struct fwnet_device *dev = ptask->dev;
+	struct sk_buff *skb = ptask->skb;
 	unsigned long flags;
 	bool free;
 
@@ -917,8 +918,11 @@ static void fwnet_transmit_packet_done(s
 	/* Check whether we or the networking TX soft-IRQ is last user. */
 	free = (ptask->outstanding_pkts == 0 && !list_empty(&ptask->pt_link));
 
-	if (ptask->outstanding_pkts == 0)
+	if (ptask->outstanding_pkts == 0) {
 		list_del(&ptask->pt_link);
+		dev->netdev->stats.tx_packets++;
+		dev->netdev->stats.tx_bytes += skb->len;
+	}
 
 	spin_unlock_irqrestore(&dev->lock, flags);
 
@@ -927,7 +931,6 @@ static void fwnet_transmit_packet_done(s
 		u16 fg_off;
 		u16 datagram_label;
 		u16 lf;
-		struct sk_buff *skb;
 
 		/* Update the ptask to point to the next fragment and send it */
 		lf = fwnet_get_hdr_lf(&ptask->hdr);
@@ -954,7 +957,7 @@ static void fwnet_transmit_packet_done(s
 			datagram_label = fwnet_get_hdr_dgl(&ptask->hdr);
 			break;
 		}
-		skb = ptask->skb;
+
 		skb_pull(skb, ptask->max_payload);
 		if (ptask->outstanding_pkts > 1) {
 			fwnet_make_sf_hdr(&ptask->hdr, RFC2374_HDR_INTFRAG,

-- 
Stefan Richter
-=====-==-=- =-== --===
http://arcgraph.de/sr/



------------------------------------------------------------------------------
The Next 800 Companies to Lead America's Growth: New Video Whitepaper
David G. Thomson, author of the best-selling book "Blueprint to a 
Billion" shares his insights and actions to help propel your 
business during the next growth cycle. Listen Now!
http://p.sf.net/sfu/SAP-dev2dev

^ permalink raw reply

* [PATCH 0/4] firewire-net updates
From: Stefan Richter @ 2010-11-07 21:38 UTC (permalink / raw)
  To: linux1394-devel; +Cc: netdev, linux-kernel

Here is a repost of yesterday's patches for firewire-net.  I dropped the
error logging limitation.

firewire: net: count stats.tx_packets and stats.tx_bytes
firewire: net: fix memory leaks
firewire: net: replace lists by counters
firewire: net: throttle TX queue before running out of tlabels

 drivers/firewire/net.c |  137 ++++++++++++++++++++++++-----------------
 1 file changed, 83 insertions(+), 54 deletions(-)
-- 
Stefan Richter
-=====-==-=- =-== --===
http://arcgraph.de/sr/



------------------------------------------------------------------------------
The Next 800 Companies to Lead America's Growth: New Video Whitepaper
David G. Thomson, author of the best-selling book "Blueprint to a 
Billion" shares his insights and actions to help propel your 
business during the next growth cycle. Listen Now!
http://p.sf.net/sfu/SAP-dev2dev

^ permalink raw reply

* Registration of NGO for a Grant,contact me for more details.
From: NGO REGISTRATION @ 2010-11-07 18:32 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=_iso-2022-jp$ESC, Size: 0 bytes --]



^ permalink raw reply

* [PATCH] SUNRPC: Simplify rpc_alloc_iostats by removing pointless local variable
From: Jesper Juhl @ 2010-11-07 21:11 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	J. Bruce Fields, Neil Brown, Trond Myklebust, David S. Miller

Hi,

We can simplify net/sunrpc/stats.c::rpc_alloc_iostats() a bit by getting 
rid of the unneeded local variable 'new'.


Please CC me on replies.


Signed-off-by: Jesper Juhl <jj-IYz4IdjRLj0sV2N9l4h3zg@public.gmane.org>
---
 stats.c |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git 
a/net/sunrpc/stats.c b/net/sunrpc/stats.c
index f71a731..80df89d 100644
--- a/net/sunrpc/stats.c
+++ b/net/sunrpc/stats.c
@@ -115,9 +115,7 @@ EXPORT_SYMBOL_GPL(svc_seq_show);
  */
 struct rpc_iostats *rpc_alloc_iostats(struct rpc_clnt *clnt)
 {
-	struct rpc_iostats *new;
-	new = kcalloc(clnt->cl_maxproc, sizeof(struct rpc_iostats), GFP_KERNEL);
-	return new;
+	return kcalloc(clnt->cl_maxproc, sizeof(struct rpc_iostats), GFP_KERNEL);
 }
 EXPORT_SYMBOL_GPL(rpc_alloc_iostats);
 


-- 
Jesper Juhl <jj-IYz4IdjRLj0sV2N9l4h3zg@public.gmane.org>             http://www.chaosbits.net/
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH 2/2 v5] xps: Transmit Packet Steering
From: Eric Dumazet @ 2010-11-07 20:40 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.1.00.1011071146390.29978@pokey.mtv.corp.google.com>

Le dimanche 07 novembre 2010 à 11:52 -0800, Tom Herbert a écrit :
> This patch implements transmit packet steering (XPS) for multiqueue
> devices.  XPS selects a transmit queue during packet transmission based
> on configuration.  This is done by mapping the CPU transmitting the
> packet to a queue.  This is the transmit side analogue to RPS-- where
> RPS is selecting a CPU based on receive queue, XPS selects a queue
> based on the CPU (previously there was an XPS patch from Eric
> Dumazet, but that might more appropriately be called transmit completion
> steering).
> 
> Each transmit queue can be associated with a number of CPUs which will
> use the queue to send packets.  This is configured as a CPU mask on a
> per queue basis in:
> 
> /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus
> 
> The mappings are stored per device in an inverted data structure that
> maps CPUs to queues.  In the netdevice structure this is an array of
> num_possible_cpu structures where each structure holds and array of
> queue_indexes for queues which that CPU can use.
> 
> The benefits of XPS are improved locality in the per queue data
> structures.  Also, transmit completions are more likely to be done
> nearer to the sending thread, so this should promote locality back
> to the socket on free (e.g. UDP).  The benefits of XPS are dependent on
> cache hierarchy, application load, and other factors.  XPS would
> nominally be configured so that a queue would only be shared by CPUs
> which are sharing a cache, the degenerative configuration woud be that
> each CPU has it's own queue.
> 
> Below are some benchmark results which show the potential benfit of
> this patch.  The netperf test has 500 instances of netperf TCP_RR test
> with 1 byte req. and resp.
> 
> bnx2x on 16 core AMD
>    XPS (16 queues, 1 TX queue per CPU)  1234K at 100% CPU
>    No XPS (16 queues)                   996K at 100% CPU
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
>  include/linux/netdevice.h |   32 ++++
>  net/core/dev.c            |   54 +++++++-
>  net/core/net-sysfs.c      |  367 ++++++++++++++++++++++++++++++++++++++++++++-
>  net/core/net-sysfs.h      |    3 +
>  4 files changed, 450 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 072652d..b2ea7c0 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -503,6 +503,13 @@ struct netdev_queue {
>  	struct Qdisc		*qdisc;
>  	unsigned long		state;
>  	struct Qdisc		*qdisc_sleeping;
> +#ifdef CONFIG_RPS
> +	struct netdev_queue	*first;
> +	atomic_t		count;
> +	struct xps_dev_maps	*xps_maps;

Tom, I still dont understand why *xps_maps is here, and not in
net_device ?

I am asking because netdev_get_xps_maps(dev) might be slowed down
because queue 0 state might change often (__QUEUE_STATE_XOFF)

This means _tx[0] becomes a very hot cache line, needed to access all
queues (from get_xps_queue())

Other than that, your patch seems fine (not tested yet)

Thanks



^ permalink raw reply

* [PATCH 2/2 v5] xps: Transmit Packet Steering
From: Tom Herbert @ 2010-11-07 19:52 UTC (permalink / raw)
  To: davem, netdev; +Cc: eric.dumazet

This patch implements transmit packet steering (XPS) for multiqueue
devices.  XPS selects a transmit queue during packet transmission based
on configuration.  This is done by mapping the CPU transmitting the
packet to a queue.  This is the transmit side analogue to RPS-- where
RPS is selecting a CPU based on receive queue, XPS selects a queue
based on the CPU (previously there was an XPS patch from Eric
Dumazet, but that might more appropriately be called transmit completion
steering).

Each transmit queue can be associated with a number of CPUs which will
use the queue to send packets.  This is configured as a CPU mask on a
per queue basis in:

/sys/class/net/eth<n>/queues/tx-<n>/xps_cpus

The mappings are stored per device in an inverted data structure that
maps CPUs to queues.  In the netdevice structure this is an array of
num_possible_cpu structures where each structure holds and array of
queue_indexes for queues which that CPU can use.

The benefits of XPS are improved locality in the per queue data
structures.  Also, transmit completions are more likely to be done
nearer to the sending thread, so this should promote locality back
to the socket on free (e.g. UDP).  The benefits of XPS are dependent on
cache hierarchy, application load, and other factors.  XPS would
nominally be configured so that a queue would only be shared by CPUs
which are sharing a cache, the degenerative configuration woud be that
each CPU has it's own queue.

Below are some benchmark results which show the potential benfit of
this patch.  The netperf test has 500 instances of netperf TCP_RR test
with 1 byte req. and resp.

bnx2x on 16 core AMD
   XPS (16 queues, 1 TX queue per CPU)  1234K at 100% CPU
   No XPS (16 queues)                   996K at 100% CPU

Signed-off-by: Tom Herbert <therbert@google.com>
---
 include/linux/netdevice.h |   32 ++++
 net/core/dev.c            |   54 +++++++-
 net/core/net-sysfs.c      |  367 ++++++++++++++++++++++++++++++++++++++++++++-
 net/core/net-sysfs.h      |    3 +
 4 files changed, 450 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 072652d..b2ea7c0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -503,6 +503,13 @@ struct netdev_queue {
 	struct Qdisc		*qdisc;
 	unsigned long		state;
 	struct Qdisc		*qdisc_sleeping;
+#ifdef CONFIG_RPS
+	struct netdev_queue	*first;
+	atomic_t		count;
+	struct xps_dev_maps	*xps_maps;
+	struct kobject		kobj;
+#endif
+
 /*
  * write mostly part
  */
@@ -530,6 +537,31 @@ struct rps_map {
 #define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + (_num * sizeof(u16)))
 
 /*
+ * This structure holds an XPS map which can be of variable length.  The
+ * map is an array of queues.
+ */
+struct xps_map {
+	unsigned int len;
+	unsigned int alloc_len;
+	struct rcu_head rcu;
+	u16 queues[0];
+};
+#define XPS_MAP_SIZE(_num) (sizeof(struct xps_map) + (_num * sizeof(u16)))
+#define XPS_MIN_MAP_ALLOC ((L1_CACHE_BYTES - sizeof(struct xps_map))	\
+    / sizeof(u16))
+
+/*
+ * This structure holds all XPS maps for device.  Maps are indexed by CPU.
+ */
+struct xps_dev_maps {
+	struct rcu_head rcu;
+	struct xps_map *cpu_map[0];
+};
+#define XPS_DEV_MAPS_SIZE (sizeof(struct xps_dev_maps) +		\
+    (nr_cpu_ids * sizeof(struct xps_map *)))
+#define netdev_get_xps_maps(dev) ((dev)->_tx[0].xps_maps)
+
+/*
  * The rps_dev_flow structure contains the mapping of a flow to a CPU and the
  * tail pointer for that CPU's input queue at the time of last enqueue.
  */
diff --git a/net/core/dev.c b/net/core/dev.c
index 0ff9df6..502aa50 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2119,6 +2119,43 @@ static inline u16 dev_cap_txqueue(struct net_device *dev, u16 queue_index)
 	return queue_index;
 }
 
+static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
+{
+#ifdef CONFIG_RPS
+	struct xps_dev_maps *dev_maps;
+	struct xps_map *map;
+	int queue_index = -1;
+
+	rcu_read_lock();
+	dev_maps = rcu_dereference(netdev_get_xps_maps(dev));
+	if (dev_maps) {
+		map = rcu_dereference(
+		    dev_maps->cpu_map[raw_smp_processor_id()]);
+		if (map) {
+			if (map->len == 1)
+				queue_index = map->queues[0];
+			else {
+				u32 hash;
+				if (skb->sk && skb->sk->sk_hash)
+					hash = skb->sk->sk_hash;
+				else
+					hash = (__force u16) skb->protocol ^
+					    skb->rxhash;
+				hash = jhash_1word(hash, hashrnd);
+				queue_index = map->queues[
+				    ((u64)hash * map->len) >> 32];
+			}
+			if (unlikely(queue_index >= dev->real_num_tx_queues))
+				queue_index = -1;
+		}
+	}
+	rcu_read_unlock();
+
+	return queue_index;
+#endif
+	return -1;
+}
+
 static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 					struct sk_buff *skb)
 {
@@ -2138,7 +2175,9 @@ static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 		    queue_index >= dev->real_num_tx_queues) {
 			int old_index = queue_index;
 
-			queue_index = skb_tx_hash(dev, skb);
+			queue_index = get_xps_queue(dev, skb);
+			if (queue_index < 0)
+				queue_index = skb_tx_hash(dev, skb);
 
 			if (queue_index != old_index && sk) {
 				struct dst_entry *dst =
@@ -5057,6 +5096,17 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 		return -ENOMEM;
 	}
 	dev->_tx = tx;
+#ifdef CONFIG_RPS
+	/*
+	 * Set a pointer to first element in the array which holds the
+	 * reference count.
+	 */
+	{
+		int i;
+		for (i = 0; i < count; i++)
+			tx[i].first = tx;
+	}
+#endif
 	return 0;
 }
 
@@ -5621,7 +5671,9 @@ void free_netdev(struct net_device *dev)
 
 	release_net(dev_net(dev));
 
+#ifndef CONFIG_RPS
 	kfree(dev->_tx);
+#endif
 
 	kfree(rcu_dereference_raw(dev->ingress_queue));
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index a5ff5a8..aa77d7f 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -770,18 +770,375 @@ net_rx_queue_update_kobjects(struct net_device *net, int old_num, int new_num)
 	return error;
 }
 
-static int rx_queue_register_kobjects(struct net_device *net)
+/*
+ * netdev_queue sysfs structures and functions.
+ */
+struct netdev_queue_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct netdev_queue *queue,
+	    struct netdev_queue_attribute *attr, char *buf);
+	ssize_t (*store)(struct netdev_queue *queue,
+	    struct netdev_queue_attribute *attr, const char *buf, size_t len);
+};
+#define to_netdev_queue_attr(_attr) container_of(_attr,		\
+    struct netdev_queue_attribute, attr)
+
+#define to_netdev_queue(obj) container_of(obj, struct netdev_queue, kobj)
+
+static ssize_t netdev_queue_attr_show(struct kobject *kobj,
+				      struct attribute *attr, char *buf)
+{
+	struct netdev_queue_attribute *attribute = to_netdev_queue_attr(attr);
+	struct netdev_queue *queue = to_netdev_queue(kobj);
+
+	if (!attribute->show)
+		return -EIO;
+
+	return attribute->show(queue, attribute, buf);
+}
+
+static ssize_t netdev_queue_attr_store(struct kobject *kobj,
+				       struct attribute *attr,
+				       const char *buf, size_t count)
+{
+	struct netdev_queue_attribute *attribute = to_netdev_queue_attr(attr);
+	struct netdev_queue *queue = to_netdev_queue(kobj);
+
+	if (!attribute->store)
+		return -EIO;
+
+	return attribute->store(queue, attribute, buf, count);
+}
+
+static const struct sysfs_ops netdev_queue_sysfs_ops = {
+	.show = netdev_queue_attr_show,
+	.store = netdev_queue_attr_store,
+};
+
+static inline unsigned int get_netdev_queue_index(struct netdev_queue *queue)
+{
+	struct net_device *dev = queue->dev;
+	int i;
+
+	for (i = 0; i < dev->num_tx_queues; i++)
+		if (queue == &dev->_tx[i])
+			break;
+
+	BUG_ON(i >= dev->num_tx_queues);
+
+	return i;
+}
+
+
+static ssize_t show_xps_map(struct netdev_queue *queue,
+			    struct netdev_queue_attribute *attribute, char *buf)
+{
+	struct netdev_queue *first = queue->first;
+	struct xps_dev_maps *dev_maps;
+	cpumask_var_t mask;
+	unsigned long index;
+	size_t len = 0;
+	int i;
+
+	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	index = get_netdev_queue_index(queue);
+
+	rcu_read_lock();
+	dev_maps = rcu_dereference(first->xps_maps);
+	if (dev_maps) {
+		for_each_possible_cpu(i) {
+			struct xps_map *map =
+			    rcu_dereference(dev_maps->cpu_map[i]);
+			if (map) {
+				int j;
+				for (j = 0; j < map->len; j++) {
+					if (map->queues[j] == index) {
+						cpumask_set_cpu(i, mask);
+						break;
+					}
+				}
+			}
+		}
+	}
+	len += cpumask_scnprintf(buf + len, PAGE_SIZE, mask);
+	if (PAGE_SIZE - len < 3) {
+		rcu_read_unlock();
+		free_cpumask_var(mask);
+		return -EINVAL;
+	}
+	rcu_read_unlock();
+
+	free_cpumask_var(mask);
+	len += sprintf(buf + len, "\n");
+	return len;
+}
+
+static void xps_map_release(struct rcu_head *rcu)
+{
+	struct xps_map *map = container_of(rcu, struct xps_map, rcu);
+
+	kfree(map);
+}
+
+static void xps_dev_maps_release(struct rcu_head *rcu)
+{
+	struct xps_dev_maps *dev_maps =
+	    container_of(rcu, struct xps_dev_maps, rcu);
+
+	kfree(dev_maps);
+}
+
+static DEFINE_MUTEX(xps_map_mutex);
+
+static ssize_t store_xps_map(struct netdev_queue *queue,
+		      struct netdev_queue_attribute *attribute,
+		      const char *buf, size_t len)
+{
+	struct netdev_queue *first = queue->first;
+	cpumask_var_t mask;
+	int err, i, cpu, pos, map_len, alloc_len, need_set;
+	unsigned long index;
+	struct xps_map *map, *new_map;
+	struct xps_dev_maps *dev_maps, *new_dev_maps;
+	int nonempty = 0;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	index = get_netdev_queue_index(queue);
+
+	err = bitmap_parse(buf, len, cpumask_bits(mask), nr_cpumask_bits);
+	if (err) {
+		free_cpumask_var(mask);
+		return err;
+	}
+
+	new_dev_maps = kzalloc(max_t(unsigned,
+	    XPS_DEV_MAPS_SIZE, L1_CACHE_BYTES), GFP_KERNEL);
+	if (!new_dev_maps) {
+		free_cpumask_var(mask);
+		return err;
+	}
+
+	mutex_unlock(&xps_map_mutex);
+
+	dev_maps = first->xps_maps;
+
+	for_each_possible_cpu(cpu) {
+		new_map = map = dev_maps ? dev_maps->cpu_map[cpu] : NULL;
+
+		if (map) {
+			for (pos = 0; pos < map->len; pos++)
+				if (map->queues[pos] == index)
+					break;
+			map_len = map->len;
+			alloc_len = map->alloc_len;
+		} else
+			pos = map_len = alloc_len = 0;
+
+		need_set = cpu_isset(cpu, *mask) && cpu_online(cpu);
+
+		if (need_set && pos >= map_len) {
+			/* Need to add queue to this CPU's map */
+			if (map_len >= alloc_len) {
+				alloc_len = alloc_len ?
+				    2 * alloc_len : XPS_MIN_MAP_ALLOC;
+				new_map = kzalloc(XPS_MAP_SIZE(alloc_len),
+				    GFP_KERNEL);
+				if (!new_map)
+					goto error;
+				new_map->alloc_len = alloc_len;
+				for (i = 0; i < map_len; i++)
+					new_map->queues[i] = map->queues[i];
+				new_map->len = map_len;
+			}
+			new_map->queues[new_map->len++] = index;
+		} else if (!need_set && pos < map_len) {
+			/* Need to remove queue from this CPU's map */
+			if (map_len > 1)
+				new_map->queues[pos] =
+				    new_map->queues[--new_map->len];
+			else
+				new_map = NULL;
+		}
+		new_dev_maps->cpu_map[cpu] = new_map;
+	}
+
+	/* Cleanup old maps */
+	for_each_possible_cpu(cpu) {
+		map = dev_maps ? dev_maps->cpu_map[cpu] : NULL;
+		if (map && new_dev_maps->cpu_map[cpu] != map)
+			call_rcu(&map->rcu, xps_map_release);
+		if (new_dev_maps->cpu_map[cpu])
+			nonempty = 1;
+	}
+
+	if (nonempty)
+		rcu_assign_pointer(first->xps_maps, new_dev_maps);
+	else {
+		kfree(new_dev_maps);
+		rcu_assign_pointer(first->xps_maps, NULL);
+	}
+
+	if (dev_maps)
+		call_rcu(&dev_maps->rcu, xps_dev_maps_release);
+
+	mutex_unlock(&xps_map_mutex);
+
+	free_cpumask_var(mask);
+	return len;
+
+error:
+	mutex_unlock(&xps_map_mutex);
+
+	if (new_dev_maps)
+		for_each_possible_cpu(i)
+			kfree(new_dev_maps->cpu_map[i]);
+	kfree(new_dev_maps);
+	free_cpumask_var(mask);
+	return -ENOMEM;
+}
+
+static struct netdev_queue_attribute xps_cpus_attribute =
+    __ATTR(xps_cpus, S_IRUGO | S_IWUSR, show_xps_map, store_xps_map);
+
+static struct attribute *netdev_queue_default_attrs[] = {
+	&xps_cpus_attribute.attr,
+	NULL
+};
+
+static void netdev_queue_release(struct kobject *kobj)
 {
+	struct netdev_queue *queue = to_netdev_queue(kobj);
+	struct netdev_queue *first = queue->first;
+	struct xps_dev_maps *dev_maps;
+	struct xps_map *map;
+	unsigned long index;
+	int i, pos, nonempty = 0;
+
+	index = get_netdev_queue_index(queue);
+
+	mutex_lock(&xps_map_mutex);
+	dev_maps = first->xps_maps;
+
+	for_each_possible_cpu(i) {
+		map  = dev_maps ? dev_maps->cpu_map[i] : NULL;
+		if (!map)
+			continue;
+
+		for (pos = 0; pos < map->len; pos++)
+			if (map->queues[pos] == index)
+				break;
+
+		if (pos < map->len) {
+			if (map->len > 1)
+				map->queues[pos] = map->queues[--map->len];
+			else {
+				rcu_assign_pointer(dev_maps->cpu_map[i],
+				    NULL);
+				call_rcu(&map->rcu, xps_map_release);
+				map = NULL;
+			}
+		}
+
+		if (map)
+			nonempty = 1;
+	}
+
+	if (!nonempty) {
+		rcu_assign_pointer(first->xps_maps, NULL);
+		call_rcu(&dev_maps->rcu, xps_dev_maps_release);
+	}
+	mutex_unlock(&xps_map_mutex);
+
+	if (atomic_dec_and_test(&first->count))
+		kfree(first);
+}
+
+static struct kobj_type netdev_queue_ktype = {
+	.sysfs_ops = &netdev_queue_sysfs_ops,
+	.release = netdev_queue_release,
+	.default_attrs = netdev_queue_default_attrs,
+};
+
+static int netdev_queue_add_kobject(struct net_device *net, int index)
+{
+	struct netdev_queue *queue = net->_tx + index;
+	struct netdev_queue *first = queue->first;
+	struct kobject *kobj = &queue->kobj;
+	int error = 0;
+
+	kobj->kset = net->queues_kset;
+	error = kobject_init_and_add(kobj, &netdev_queue_ktype, NULL,
+	    "tx-%u", index);
+	if (error) {
+		kobject_put(kobj);
+		return error;
+	}
+
+	kobject_uevent(kobj, KOBJ_ADD);
+	atomic_inc(&first->count);
+
+	return error;
+}
+
+int
+netdev_queue_update_kobjects(struct net_device *net, int old_num, int new_num)
+{
+	int i;
+	int error = 0;
+
+	for (i = old_num; i < new_num; i++) {
+		error = netdev_queue_add_kobject(net, i);
+		if (error) {
+			new_num = old_num;
+			break;
+		}
+	}
+
+	while (--i >= new_num)
+		kobject_put(&net->_rx[i].kobj);
+
+	return error;
+}
+
+static int register_queue_kobjects(struct net_device *net)
+{
+	int error = 0, txq = 0, rxq = 0;
+
 	net->queues_kset = kset_create_and_add("queues",
 	    NULL, &net->dev.kobj);
 	if (!net->queues_kset)
 		return -ENOMEM;
-	return net_rx_queue_update_kobjects(net, 0, net->real_num_rx_queues);
+
+	error = net_rx_queue_update_kobjects(net, 0, net->real_num_rx_queues);
+	if (error)
+		goto error;
+	rxq = net->real_num_rx_queues;
+
+	error = netdev_queue_update_kobjects(net, 0,
+					     net->real_num_tx_queues);
+	if (error)
+		goto error;
+	txq = net->real_num_tx_queues;
+
+	return 0;
+
+error:
+	netdev_queue_update_kobjects(net, txq, 0);
+	net_rx_queue_update_kobjects(net, rxq, 0);
+	return error;
 }
 
-static void rx_queue_remove_kobjects(struct net_device *net)
+static void remove_queue_kobjects(struct net_device *net)
 {
 	net_rx_queue_update_kobjects(net, net->real_num_rx_queues, 0);
+	netdev_queue_update_kobjects(net, net->real_num_tx_queues, 0);
 	kset_unregister(net->queues_kset);
 }
 #endif /* CONFIG_RPS */
@@ -884,7 +1241,7 @@ void netdev_unregister_kobject(struct net_device * net)
 	kobject_get(&dev->kobj);
 
 #ifdef CONFIG_RPS
-	rx_queue_remove_kobjects(net);
+	remove_queue_kobjects(net);
 #endif
 
 	device_del(dev);
@@ -925,7 +1282,7 @@ int netdev_register_kobject(struct net_device *net)
 		return error;
 
 #ifdef CONFIG_RPS
-	error = rx_queue_register_kobjects(net);
+	error = register_queue_kobjects(net);
 	if (error) {
 		device_del(dev);
 		return error;
diff --git a/net/core/net-sysfs.h b/net/core/net-sysfs.h
index 778e157..25ec2ee 100644
--- a/net/core/net-sysfs.h
+++ b/net/core/net-sysfs.h
@@ -6,6 +6,9 @@ int netdev_register_kobject(struct net_device *);
 void netdev_unregister_kobject(struct net_device *);
 #ifdef CONFIG_RPS
 int net_rx_queue_update_kobjects(struct net_device *, int old_num, int new_num);
+int netdev_queue_update_kobjects(struct net_device *net,
+				 int old_num, int new_num);
+
 #endif
 
 #endif
-- 
1.7.3.1


^ permalink raw reply related

* [PATCH 1/2 v5] xps: Improvements in TX queue selection
From: Tom Herbert @ 2010-11-07 19:52 UTC (permalink / raw)
  To: davem, netdev; +Cc: eric.dumazet

In dev_pick_tx, don't do work in calculating queue index or setting
the index in the sock unless the device has more than one queue.  This
allows the sock to be set only with a queue index of a multi-queue
device which is desirable if device are stacked like in a tunnel.

We also allow the mapping of a socket to queue to be changed.  To
maintain in order packet transmission a flag (ooo_okay) has been
added to the sk_buff structure.  If a transport layer sets this flag
on a packet, the transmit queue can be changed for the socket.
Presumably, the transport would set this if there was no possbility
of creating OOO packets (for instance, there are no packets in flight
for the socket).  This patch includes the modification in TCP output
for setting this flag.

Signed-off-by: Tom Herbert <therbert@google.com>
---
 include/linux/skbuff.h |    3 ++-
 net/core/dev.c         |   18 +++++++++++-------
 net/ipv4/tcp_output.c  |    5 ++++-
 3 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e6ba898..19f37a6 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -386,9 +386,10 @@ struct sk_buff {
 #else
 	__u8			deliver_no_wcard:1;
 #endif
+	__u8			ooo_okay:1;
 	kmemcheck_bitfield_end(flags2);
 
-	/* 0/14 bit hole */
+	/* 0/13 bit hole */
 
 #ifdef CONFIG_NET_DMA
 	dma_cookie_t		dma_cookie;
diff --git a/net/core/dev.c b/net/core/dev.c
index 0dd54a6..0ff9df6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2125,20 +2125,24 @@ static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 	int queue_index;
 	const struct net_device_ops *ops = dev->netdev_ops;
 
-	if (ops->ndo_select_queue) {
+	if (dev->real_num_tx_queues == 1)
+		queue_index = 0;
+	else if (ops->ndo_select_queue) {
 		queue_index = ops->ndo_select_queue(dev, skb);
 		queue_index = dev_cap_txqueue(dev, queue_index);
 	} else {
 		struct sock *sk = skb->sk;
 		queue_index = sk_tx_queue_get(sk);
-		if (queue_index < 0 || queue_index >= dev->real_num_tx_queues) {
 
-			queue_index = 0;
-			if (dev->real_num_tx_queues > 1)
-				queue_index = skb_tx_hash(dev, skb);
+		if (queue_index < 0 || skb->ooo_okay ||
+		    queue_index >= dev->real_num_tx_queues) {
+			int old_index = queue_index;
 
-			if (sk) {
-				struct dst_entry *dst = rcu_dereference_check(sk->sk_dst_cache, 1);
+			queue_index = skb_tx_hash(dev, skb);
+
+			if (queue_index != old_index && sk) {
+				struct dst_entry *dst =
+				    rcu_dereference_check(sk->sk_dst_cache, 1);
 
 				if (dst && skb_dst(skb) == dst)
 					sk_tx_queue_set(sk, queue_index);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 05b1ecf..2b6eb36 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -822,8 +822,11 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 							   &md5);
 	tcp_header_size = tcp_options_size + sizeof(struct tcphdr);
 
-	if (tcp_packets_in_flight(tp) == 0)
+	if (tcp_packets_in_flight(tp) == 0) {
 		tcp_ca_event(sk, CA_EVENT_TX_START);
+		skb->ooo_okay = 1;
+	} else
+		skb->ooo_okay = 0;
 
 	skb_push(skb, tcp_header_size);
 	skb_reset_transport_header(skb);
-- 
1.7.3.1


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox