Netdev List
 help / color / mirror / Atom feed
* [PATCH] bonding: fix a buffer overflow in bonding_show_queue_id.
From: Nicolas de Pesloüan @ 2010-07-14 22:24 UTC (permalink / raw)
  To: bonding-devel, andy, fubar, davem, netdev; +Cc: Nicolas de Pesloüan

The test for buffer overflow ensures we have room for 6 more bytes.
sprintf, called with %s:%d, slave->dev->name, slave->queue_id may yield
far more than 6 bytes.

The correct test is res > (PAGE_SIZE - IFNAMSIZ - 6) .

Signed-off-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
---
 drivers/net/bonding/bond_sysfs.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index f9a0343..1a99764 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -1427,8 +1427,8 @@ static ssize_t bonding_show_queue_id(struct device *d,
 
 	read_lock(&bond->lock);
 	bond_for_each_slave(bond, slave, i) {
-		if (res > (PAGE_SIZE - 6)) {
-			/* not enough space for another interface name */
+		if (res > (PAGE_SIZE - IFNAMSIZ - 6)) {
+			/* not enough space for another interface_name:queue_id pair */
 			if ((PAGE_SIZE - res) > 10)
 				res = PAGE_SIZE - 10;
 			res += sprintf(buf + res, "++more++ ");
-- 
1.7.1




^ permalink raw reply related

* Re: [PATCH] net: skb_tx_hash() fix relative to skb_orphan_try()
From: David Miller @ 2010-07-14 22:33 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1279034660.2634.439.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 13 Jul 2010 17:24:20 +0200

> commit fc6055a5ba31e2 (net: Introduce skb_orphan_try()) added early
> orphaning of skbs.
> 
> This unfortunately added a performance regression in skb_tx_hash() in
> case of stacked devices (bonding, vlans, ...)
> 
> Since skb->sk is now NULL, we cannot access sk->sk_hash anymore to
> spread tx packets to multiple NIC queues on multiqueue devices.
> 
> skb_tx_hash() in this case only uses skb->protocol, same value for all
> flows.
> 
> skb_orphan_try() can copy sk->sk_hash into skb->rxhash and skb_tx_hash()
> can use this saved sk_hash value to compute its internal hash value.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: Raise initial congestion window size / speedup slow start?
From: Hagen Paul Pfeifer @ 2010-07-14 22:36 UTC (permalink / raw)
  To: Ed W; +Cc: Rick Jones, David Miller, davidsen, linux-kernel, netdev
In-Reply-To: <4C3E34AB.2060405@wildgooses.com>

* Ed W | 2010-07-14 23:05:31 [+0100]:

>Initial cwnd was changed (increased) in the past (rfc3390) and the
>RFC claims that studies then suggested that the benefits were all
>positive. Some reasonably smart people have suggested that it might
>be time to review the status quo again so it doesn't seem completely
>obvious that the current number is optimal?

Do you cite "An Argument for Increasing TCP's Initial Congestion Window"?
People at google stated that a CWND of 10 seems to be fair in their
measurements. 10 because the test setup was equipped with a reasonable large
link capacity? Do they analyse their modification in environments with a small
BDP (e.g. multihop MANET setup, ...)? I am curious, but We will see what
happens if TCPM adopts this.

>That RFC is a subtle read - it appears to give more specific guidance
>on what to do in certain situations, but I'm not sure I see that it
>improves slow start convergence speed for my situation (large RTT)?
>Would you mind highlighting the new bits for those of us a bit newer
>to the subject?

The objection/hint was more of general nature - not specific for larger RTTs.
Environments with larger RTTs are disadvantaged because TCP is ACK clocked.
Half-truth statement for my part because RTT fairness is and was an issue at
the development of new congestion control algorithms: BIC, CUBIC and friends.

>>Partial local issues can already be "fixed" via route specific ip options -
>>see initcwnd.
>
>Oh, excellent.  This seems like exactly what I'm after.  (Thanks
>Stephen Hemminger!)

Great, you are welcome! ;-)


Hagen

^ permalink raw reply

* Re: Raise initial congestion window size / speedup slow start?
From: Hagen Paul Pfeifer @ 2010-07-14 22:40 UTC (permalink / raw)
  To: Rick Jones; +Cc: David Miller, lists, davidsen, linux-kernel, netdev
In-Reply-To: <4C3E37F7.3020607@hp.com>

* Rick Jones | 2010-07-14 15:19:35 [-0700]:

>>;-) sure, but it is often wise to thwart these kind of discussions. It seems
>>these CWND discussions turn up once every other month. ;-)
>
>Which suggests there is a constant "force" out there yet to be rekoned with. :)

;-) I am _not_ unconscious, but the better address for this kind of
discussions is still tcpm.

Hagen

^ permalink raw reply

* Re: Raise initial congestion window size / speedup slow start?
From: Ed W @ 2010-07-14 22:52 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: David Miller, rick.jones2, davidsen, linux-kernel, netdev
In-Reply-To: <20100714221301.GI6682@nuttenaction>


>> Although section 3 of RFC 5681 is a great text, it does not say at all
>> that increasing the initial CWND would lead to fairness issues.
>>      
> Because it is only one side of the medal, probing conservative the available
> link capacity in conjunction with n simultaneous probing TCP/SCTP/DCCP
> instances is another.
>    

So lets define the problem more succinctly:
- New TCP connections are assumed to have no knowledge of current 
network conditions (bah)
- We desire the connection to consume the maximum amount of bandwidth 
possible, but staying ever so fractionally under the maximum link bandwidth

> Currently I know no working link capacity probing approach, without active
> network feedback, to conservatively probing the available link capacity with a
> high CWND. I am curious about any future trends.
>    

Sounds like smarter people than I have played this game, but just to 
chuck out one idea: How about attacking the idea that we have no 
knowledge of network conditions?  After all we have a bunch of 
information about:

1) very good information about the size of the link to the first hop (eg 
the modem/network card reported rate)
2) often a reasonably good idea about the bandwidth to the first 
"restrictive" router along our default path (ie usually the situation is 
there is a pool of high speed network locally, then a more limited 
connectivity between our network and other networks.  We can look at the 
maximum flows through our network device to outside our subnet and infer 
an approximate link speed from that)
3) often moderate quality information about the size of the link between 
us and a specific destination IP

So here goes: the heuristic could be to examine current flows through 
our interface, use this to offer hints to the remote end during SYN 
handshake as to a recommended starting size, and additionally the client 
side can examine the implied RTT of the SYN/ACK to further fine tune the 
initial cwnd?

In practice this could be implemented in other ways such as examining 
recent TCP congestion windows and using some heuristic to start "near" 
those.  Or remembering congestion windows recently used for popular 
destinations?  Also we can benefit the receiver of our data - if we see 
some app open up 16 http connections to some poor server then some of 
those connections will NOT be given large initial cwnd.

Essentially perhaps we can refine our initial cwnd heuristic somewhat if 
we assume better than zero knowledge about the network link?


Out of curiousity, why has it taken so long for active feedback to 
appear?  If every router simply added a hint to the packet as to the max 
bandwidth it can offer then we would appear to be able to make massively 
better decisions on window sizes.  Furthermore routers have the ability 
to put backpressure on classes of traffic as appropriate.  I guess the 
speed at which ECN has been adopted answers the question of why nothing 
more exotic has appeared?

>> But for all we know this side discussion about initial CWND settings
>> could have nothing to do with the issue being reported at the start of
>> this thread. :-)
>>      

Actually the original question was mine and it was literally - can I 
adjust the initial cwnd for users of my very specific satellite network 
which has a high RTT.  I believe Stephen Hemminger has been kind enough 
to recently add the facility to experiment with this to the ip utility 
and so I am now in a position to go do some testing - thanks Stephen


Cheers

Ed W

^ permalink raw reply

* RE: [REGRESSION] e1000e stopped working [MANUALLY BISECTED]
From: Tantilov, Emil S @ 2010-07-14 22:56 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Kirsher, Jeffrey T, netdev@vger.kernel.org, Allan, Bruce W,
	Pieper, Jeffrey E
In-Reply-To: <1278981483.23017.4.camel@localhost.localdomain>

Maxim Levitsky wrote:
> On Mon, 2010-07-12 at 15:23 -0600, Tantilov, Emil S wrote:
>> Maxim Levitsky wrote:
>>> On Mon, 2010-07-05 at 12:58 +0300, Maxim Levitsky wrote:
>>>> On Mon, 2010-07-05 at 01:13 -0700, Jeff Kirsher wrote:
>>>>> On Sun, Jul 4, 2010 at 15:48, Maxim Levitsky
>>>>> <maximlevitsky@gmail.com> wrote:
>>>>>> Did few guesses, and now I see that reverting the below commit
>>>>>> fixes the problem. 
>>>>>> 
>>>>>> "e1000e: Fix/cleanup PHY reset code for ICHx/PCHx"
>>>>>> e98cac447cc1cc418dff1d610a5c79c4f2bdec7f.
>>>>>> 
>>>>>> 
>>>>>> Best regards,
>>>>>>        Maxim Levitsky
>>>>>> 
>>>>>> --
>>>>> 
>>>>> Can you give us till Tuesday to respond?  I know that there are
>>>>> some additional e1000e patches in my queue, which may resolve the
>>>>> issue, but this weekend the power is down to do some
>>>>> infrastructure upgrades which prevents us from doing any
>>>>> investigation.debugging until Tuesday. 
>>>>> 
>>>> 
>>>> Sure.
>>>> 
>>>> Best regards,
>>>> 	Maxim Levitsky
>>>> 
>>> 
>>> Updates?
>> 
>> We are working on reproducing the issue. So far we have not seen the
>> problem when testing with net-next. 
>> 
>> I asked in previous email about some additional info from ethtool
>> (-d, -e, -S) and kernel config. That would help us to narrow it
>> down.  
>> 
>> Thanks,
>> Emil
> I did send -e and -d output.

Sorry, looks like I lost the email with the attachements. 

Could you provide the output of dmesg after the failure occurs?
 
> Since you probably want -S output during failure, I need to recompile
> kernel for that. I will do that soon.
> 
> 
> One question, in two weeks I hope 2.6.35 won't be released?
> If so, I will have enough free time then to narrow down this issue.
> 
> Other solution, is to revert this commit.
> (I have never seen this problem with it reverted).

We have been running reboot tests on 2 separate systems with recent net-next kernels 
using your config and so far no luck in reproducing this issue. 

What is the make model of your system (or MB)?

Thanks,
Emil

^ permalink raw reply

* Re: Raise initial congestion window size / speedup slow start?
From: Hagen Paul Pfeifer @ 2010-07-14 23:01 UTC (permalink / raw)
  To: Ed W; +Cc: David Miller, rick.jones2, davidsen, linux-kernel, netdev
In-Reply-To: <4C3E3F92.2090506@wildgooses.com>

* Ed W | 2010-07-14 23:52:02 [+0100]:

>Out of curiousity, why has it taken so long for active feedback to
>appear?  If every router simply added a hint to the packet as to the
>max bandwidth it can offer then we would appear to be able to make
>massively better decisions on window sizes.  Furthermore routers have
>the ability to put backpressure on classes of traffic as appropriate.
>I guess the speed at which ECN has been adopted answers the question
>of why nothing more exotic has appeared?

It is quite late here so I will quickly write two sentence about ECN: one
month ago Lars Eggers posted a link at the tcpm maillinglist where google (not
really sure if it was google) analysed the employment of ECN - the usage was
really low. Search the PDF, it is quite interesting one.

Hagen



^ permalink raw reply

* Re: Raise initial congestion window size / speedup slow start?
From: Ed W @ 2010-07-14 23:01 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: Rick Jones, David Miller, davidsen, linux-kernel, netdev
In-Reply-To: <20100714223633.GJ6682@nuttenaction>


> Do you cite "An Argument for Increasing TCP's Initial Congestion Window"?
> People at google stated that a CWND of 10 seems to be fair in their
> measurements. 10 because the test setup was equipped with a reasonable large
> link capacity? Do they analyse their modification in environments with a small
> BDP (e.g. multihop MANET setup, ...)? I am curious, but We will see what
> happens if TCPM adopts this.
>    

Well, I personally would shoot for starting from the position of 
assuming better than zero knowledge about our link and incorporating 
that into the initial cwnd estimate...

We know something about the RTT from the syn/ack times, speed of the 
local link and quickly we will learn about median window sizes to other 
destinations, plus additionally the kernel has some knowledge of other 
connections currently in progress.  With all that information perhaps we 
can make a more informed option than just a hard coded magic number? (Oh 
and lets make the option pluggable so that we can soon have 10 different 
kernel options...)

Seems like there is evidence that networks are starting to cluster into groups that would benefit from a range of cwnd options (higher/lower) - perhaps there is some way to choose a reasonable heuristic to cluster these and choose a better starting option?

Cheers

Ed W



^ permalink raw reply

* Re: Raise initial congestion window size / speedup slow start?
From: Ed W @ 2010-07-14 23:05 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: David Miller, rick.jones2, davidsen, linux-kernel, netdev
In-Reply-To: <20100714230100.GL6682@nuttenaction>

On 15/07/2010 00:01, Hagen Paul Pfeifer wrote:
> It is quite late here so I will quickly write two sentence about ECN: one
> month ago Lars Eggers posted a link at the tcpm maillinglist where google (not
> really sure if it was google) analysed the employment of ECN - the usage was
> really low. Search the PDF, it is quite interesting one.
>    

I would speculate that this is because there is a big warning on ECN 
saying that it may cause you to loose customers who can't connect to 
you... Businesses are driven by needing to support the most common case, 
not the most optimal (witness the pain of html development and needing 
to consider IE6...)

What would be more useful is for google to survey how many devices are 
unable to interoperate with ECN and if that number turned out to be 
extremely low, and this fact were advertised, then I suspect we might 
see a mass increase in it's deployment?  I know I have it turned off on 
all my servers because I worry more about loosing one customer than 
improving the experience for all customers...

Cheers

Ed W

^ permalink raw reply

* Re: [PATCH repost] sched: export sched_set/getaffinity to modules
From: Sridhar Samudrala @ 2010-07-14 23:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Oleg Nesterov, Peter Zijlstra, Tejun Heo, Ingo Molnar, netdev,
	lkml, kvm@vger.kernel.org, Andrew Morton, Dmitri Vorobiev,
	Jiri Kosina, Thomas Gleixner, Andi Kleen
In-Reply-To: <20100713110939.GA3446@redhat.com>

On Tue, 2010-07-13 at 14:09 +0300, Michael S. Tsirkin wrote: 
> On Mon, Jul 12, 2010 at 11:59:08PM -0700, Sridhar Samudrala wrote:
> > On 7/4/2010 2:00 AM, Michael S. Tsirkin wrote:
> > >On Fri, Jul 02, 2010 at 11:06:37PM +0200, Oleg Nesterov wrote:
> > >>On 07/02, Peter Zijlstra wrote:
> > >>>On Fri, 2010-07-02 at 11:01 -0700, Sridhar Samudrala wrote:
> > >>>>  Does  it (Tejun's kthread_clone() patch) also  inherit the
> > >>>>cgroup of the caller?
> > >>>Of course, its a simple do_fork() which inherits everything just as you
> > >>>would expect from a similar sys_clone()/sys_fork() call.
> > >>Yes. And I'm afraid it can inherit more than we want. IIUC, this is called
> > >>from ioctl(), right?
> > >>
> > >>Then the new thread becomes the natural child of the caller, and it shares
> > >>->mm with the parent. And files, dup_fd() without CLONE_FS.
> > >>
> > >>Signals. Say, if you send SIGKILL to this new thread, it can't sleep in
> > >>TASK_INTERRUPTIBLE or KILLABLE after that. And this SIGKILL can be sent
> > >>just because the parent gets SIGQUIT or abother coredumpable signal.
> > >>Or the new thread can recieve SIGSTOP via ^Z.
> > >>
> > >>Perhaps this is OK, I do not know. Just to remind that kernel_thread()
> > >>is merely clone(CLONE_VM).
> > >>
> > >>Oleg.
> > >
> > >Right. Doing this might break things like flush.  The signal and exit
> > >behaviour needs to be examined carefully. I am also unsure whether
> > >using such threads might be more expensive than inheriting kthreadd.
> > >
> > Should we just leave it to the userspace to set the cgroup/cpumask
> > after qemu starts the guest and
> > the vhost threads?
> > 
> > Thanks
> > Sridhar
> 
> Yes but we can't trust userspace to do this. It's important
> to do it on thread creation: if we don't, malicious userspace
> can create large amount of work exceeding the cgroup limits.
> 
> And the same applies so the affinity: if the qemu process
> is limited to a set of CPUs, it's important to make
> the kernel thread that does work our behalf limited to the same
> set of CPUs.
> 
> This is not unique to vhost, it's just that virt scenarious are affected
> by this more: people seem to run untrusted applications and expect the
> damage to be contained.

OK. So we want to create a thread that is a child of kthreadd, but inherits the cgroup/cpumask
from the caller. How about an exported kthread function kthread_create_in_current_cg() 
that does this?

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index aabc8a1..e0616f0 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -9,6 +9,9 @@ struct task_struct *kthread_create(int (*threadfn)(void *data),
 				   const char namefmt[], ...)
 	__attribute__((format(printf, 3, 4)));
 
+struct task_struct *kthread_create_in_current_cg(int (*threadfn)(void *data),
+						 void *data, char *name);
+
 /**
  * kthread_run - create and wake a thread.
  * @threadfn: the function to run until signal_pending(current).
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 83911c7..ea4e737 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -15,6 +15,7 @@
 #include <linux/module.h>
 #include <linux/mutex.h>
 #include <trace/events/sched.h>
+#include <linux/cgroup.h>
 
 static DEFINE_SPINLOCK(kthread_create_lock);
 static LIST_HEAD(kthread_create_list);
@@ -149,6 +150,42 @@ struct task_struct *kthread_create(int (*threadfn)(void *data),
 }
 EXPORT_SYMBOL(kthread_create);
 
+struct task_struct *kthread_create_in_current_cg(int (*threadfn)(void *data),
+						 void *data, char *name)
+{
+	struct task_struct *worker;
+	cpumask_var_t mask;
+	int ret = -ENOMEM;
+
+	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
+		goto out_free_mask;
+
+	worker = kthread_create(threadfn, data, "%s-%d", name, current->pid);
+	if (IS_ERR(worker))
+		goto out_free_mask;
+
+	ret = sched_getaffinity(current->pid, mask);
+	if (ret)
+		goto out_stop_worker;
+
+	ret = sched_setaffinity(worker->pid, mask);
+	if (ret)
+		goto out_stop_worker;
+
+	ret = cgroup_attach_task_current_cg(worker);
+	if (ret)
+		goto out_stop_worker;
+
+	return worker;
+
+out_stop_worker:
+	kthread_stop(worker);
+out_free_mask:
+	free_cpumask_var(mask);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(kthread_create_in_current_cg);
+
 /**
  * kthread_bind - bind a just-created kthread to a cpu.
  * @p: thread created by kthread_create().


Thanks
Sridhar


^ permalink raw reply related

* RE: [REGRESSION] e1000e stopped working [MANUALLY BISECTED]
From: Maxim Levitsky @ 2010-07-14 23:33 UTC (permalink / raw)
  To: Tantilov, Emil S
  Cc: Kirsher, Jeffrey T, netdev@vger.kernel.org, Allan, Bruce W,
	Pieper, Jeffrey E
In-Reply-To: <EA929A9653AAE14F841771FB1DE5A1365FF4B05AF9@rrsmsx501.amr.corp.intel.com>

On Wed, 2010-07-14 at 16:56 -0600, Tantilov, Emil S wrote:
> Maxim Levitsky wrote:
> > On Mon, 2010-07-12 at 15:23 -0600, Tantilov, Emil S wrote:
> >> Maxim Levitsky wrote:
> >>> On Mon, 2010-07-05 at 12:58 +0300, Maxim Levitsky wrote:
> >>>> On Mon, 2010-07-05 at 01:13 -0700, Jeff Kirsher wrote:
> >>>>> On Sun, Jul 4, 2010 at 15:48, Maxim Levitsky
> >>>>> <maximlevitsky@gmail.com> wrote:
> >>>>>> Did few guesses, and now I see that reverting the below commit
> >>>>>> fixes the problem. 
> >>>>>> 
> >>>>>> "e1000e: Fix/cleanup PHY reset code for ICHx/PCHx"
> >>>>>> e98cac447cc1cc418dff1d610a5c79c4f2bdec7f.
> >>>>>> 
> >>>>>> 
> >>>>>> Best regards,
> >>>>>>        Maxim Levitsky
> >>>>>> 
> >>>>>> --
> >>>>> 
> >>>>> Can you give us till Tuesday to respond?  I know that there are
> >>>>> some additional e1000e patches in my queue, which may resolve the
> >>>>> issue, but this weekend the power is down to do some
> >>>>> infrastructure upgrades which prevents us from doing any
> >>>>> investigation.debugging until Tuesday. 
> >>>>> 
> >>>> 
> >>>> Sure.
> >>>> 
> >>>> Best regards,
> >>>> 	Maxim Levitsky
> >>>> 
> >>> 
> >>> Updates?
> >> 
> >> We are working on reproducing the issue. So far we have not seen the
> >> problem when testing with net-next. 
> >> 
> >> I asked in previous email about some additional info from ethtool
> >> (-d, -e, -S) and kernel config. That would help us to narrow it
> >> down.  
> >> 
> >> Thanks,
> >> Emil
> > I did send -e and -d output.
> 
> Sorry, looks like I lost the email with the attachements. 
> 
> Could you provide the output of dmesg after the failure occurs?
>  
> > Since you probably want -S output during failure, I need to recompile
> > kernel for that. I will do that soon.
> > 
> > 
> > One question, in two weeks I hope 2.6.35 won't be released?
> > If so, I will have enough free time then to narrow down this issue.
> > 
> > Other solution, is to revert this commit.
> > (I have never seen this problem with it reverted).
> 
> We have been running reboot tests on 2 separate systems with recent net-next kernels 
> using your config and so far no luck in reproducing this issue. 
> 
> What is the make model of your system (or MB)?

the motherboard is Intel DG965RY.

However, I am using vanilla kernel.
net-next might contain further fixes.

I see if net-next works here.

Best regards,
	Maxim Levitsky


^ permalink raw reply

* multiqueue, skb_get_queue_mapping() and netdev_get_tx_queue()
From: Eldon Koyle @ 2010-07-14 23:13 UTC (permalink / raw)
  To: netdev

It looks like there is a potential for an out of bounds index anywhere
skb_get_queue_mapping(skb) (which just returns skb->queue_mapping) is
used to get an index for netdev_get_tx_queue() (and probably other
places) on a device with multiple rx/tx queues.

As I understand it, skb->queue_mapping should contain rx_queue + 1,
which can be out of range for netdev_get_tx_queue (which expects a
0-based index).

Am I misunderstanding something, or should all of these occurrences be
replaced with something more like the following?

static inline u16 skb_get_queue_index(const struct sk_buff *skb)
{
        return skb->queue_mapping ? skb->queue_mapping - 1 : 0;
}

Here is how it is commonly used (which looks incorrect to me):
In net/8021q/vlan_dev.c:
	static netdev_tx_t vlan_dev_hard_start_xmit(struct sk_buff *skb,
						    struct net_device *dev)
	{
		int i = skb_get_queue_mapping(skb);
		struct netdev_queue *txq = netdev_get_tx_queue(dev, i);
	...


And here is some other possibly pertinent code:
In include/linux/netdevice.h:
	static inline
	struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
						 unsigned int index)
	{
		return &dev->_tx[index];
	}

In net/core/dev.c:
	struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
			void (*setup)(struct net_device *), unsigned int queue_count)
	...
		tx = kcalloc(queue_count, sizeof(struct netdev_queue), GFP_KERNEL);
	...
		dev->_tx = tx;
	...

In include/linux/skbuff.h:
	static inline u16 skb_get_queue_mapping(const struct sk_buff *skb)
	{
		return skb->queue_mapping;
	}
	...
	static inline void skb_record_rx_queue(struct sk_buff *skb, u16 rx_queue)
	{
		skb->queue_mapping = rx_queue + 1;
	}

	static inline u16 skb_get_rx_queue(const struct sk_buff *skb)
	{
		return skb->queue_mapping - 1;
	}


-- 
Eldon Koyle
-- 
Politicians are the same all over.  They promise to build a bridge even
where there is no river.
		-- Nikita Khrushchev

^ permalink raw reply

* Re: [PATCH repost] sched: export sched_set/getaffinity to modules
From: Oleg Nesterov @ 2010-07-15  0:05 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: Michael S. Tsirkin, Peter Zijlstra, Tejun Heo, Ingo Molnar,
	netdev, lkml, kvm@vger.kernel.org, Andrew Morton, Dmitri Vorobiev,
	Jiri Kosina, Thomas Gleixner, Andi Kleen
In-Reply-To: <1279149996.32374.5.camel@w-sridhar.beaverton.ibm.com>

On 07/14, Sridhar Samudrala wrote:
>
> OK. So we want to create a thread that is a child of kthreadd, but inherits the cgroup/cpumask
> from the caller. How about an exported kthread function kthread_create_in_current_cg()
> that does this?

Well. I must admit, this looks a bit strange to me ;)

Instead of exporting sched_xxxaffinity() we export the new function
which calls them. And I don't think this new helper is very useful
in general. May be I am wrong...

Oleg.

^ permalink raw reply

* Re: [PATCH] wd: fix memory leak
From: David Miller @ 2010-07-15  0:53 UTC (permalink / raw)
  To: segooon; +Cc: kernel-janitors, joe, netdev
In-Reply-To: <1279020192-9484-1-git-send-email-segooon@gmail.com>

From: Kulikov Vasiliy <segooon@gmail.com>
Date: Tue, 13 Jul 2010 15:23:12 +0400

> Unmap mapped IO in wd_probe1() if register_netdev() failed.
> 
> Signed-off-by: Kulikov Vasiliy <segooon@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH NEXT 1/1] netxen: fix for kdump
From: David Miller @ 2010-07-15  0:55 UTC (permalink / raw)
  To: amit.salecha; +Cc: netdev, ameen.rahman, rajesh.borundia
In-Reply-To: <1279020822-10419-1-git-send-email-amit.salecha@qlogic.com>

From: Amit Kumar Salecha <amit.salecha@qlogic.com>
Date: Tue, 13 Jul 2010 04:33:42 -0700

> From: Rajesh Borundia <rajesh.borundia@qlogic.com>
> 
> When the crash kernel is loaded after crash, the device is in unknown state.
> So reset the device contexts prior to its creation in case of kdump,
> depending upon kernel parameter reset_devices.
> 
> Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com>

Applied, thanks.

^ permalink raw reply

* Re: [patch] net/sched: potential data corruption
From: David Miller @ 2010-07-15  0:56 UTC (permalink / raw)
  To: hadi; +Cc: error27, shemminger, netdev, kernel-janitors, matthew
In-Reply-To: <1279036694.16376.0.camel@bigi>

From: jamal <hadi@cyberus.ca>
Date: Tue, 13 Jul 2010 11:58:14 -0400

> On Tue, 2010-07-13 at 15:21 +0200, Dan Carpenter wrote:
>> The reset_policy() does:
>>         memset(d->tcfd_defdata, 0, SIMP_MAX_DATA);
>>         strlcpy(d->tcfd_defdata, defdata, SIMP_MAX_DATA);
>> 
>> In the original code, the size of d->tcfd_defdata wasn't fixed and if
>> strlen(defdata) was less than 31, reset_policy() would cause memory
>> corruption.
>> 
>> Please Note:  The original alloc_defdata() assumes defdata is 32
>> characters and a NUL terminator while reset_policy() assumes defdata is
>> 31 characters and a NUL.  This patch updates alloc_defdata() to match
>> reset_policy() (ie a shorter string).  I'm not very familiar with this
>> code so please review carefully.
>> 
>> Signed-off-by: Dan Carpenter <error27@gmail.com>
> 
> 
> Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH v2] eth16i: fix memory leak
From: David Miller @ 2010-07-15  0:57 UTC (permalink / raw)
  To: segooon; +Cc: kernel-janitors, miku, shemminger, eric.dumazet, tj, jpirko,
	netdev
In-Reply-To: <1279050563-15759-1-git-send-email-segooon@gmail.com>

From: Kulikov Vasiliy <segooon@gmail.com>
Date: Tue, 13 Jul 2010 23:49:23 +0400

> Free allocated netdev if no probe is expected.
> 
> Signed-off-by: Kulikov Vasiliy <segooon@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next-2.6] xfrm: cleanup of xfrm_input.c.
From: David Miller @ 2010-07-15  0:59 UTC (permalink / raw)
  To: ramirose; +Cc: netdev
In-Reply-To: <AANLkTik6qHgbCVwZfMGXpso4p2DTC9w6U9agdFVl1ZbN@mail.gmail.com>

From: Rami Rosen <ramirose@gmail.com>
Date: Wed, 14 Jul 2010 11:18:41 +0300

> Hi,
>  The patch removes unneeded inclusion of header files
> (linux/module.h, linux/netdevice.h, net/dst.h and net/ip.h)
>  in net/xfrm/xfrm_input.c
> 
> Regards,
> Rami Rosen
> 
> Signed-off-by: Rami Rosen <ramirose@gmail.com>

If you do this, I also want to see you add includes for things like
linux/skbuff.h since data structures such as "struct sk_buff"
are used in this file.

Otherwise, this is how we end up with obscure build failures on
some configurations and not others, either now or in the future
when a similar change is made to some header file.


^ permalink raw reply

* Re: [PATCH net-next-2.6] net/core: neighbour update Oops
From: David Miller @ 2010-07-15  1:02 UTC (permalink / raw)
  To: eric.dumazet; +Cc: rdkehn, netdev
In-Reply-To: <1279035511.2634.456.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 13 Jul 2010 17:38:31 +0200

> Le mardi 13 juillet 2010 à 08:23 -0700, Doug Kehn a écrit :
>> When configuring DMVPN (GRE + openNHRP) and a GRE remote
>> address is configured a kernel Oops is observed.  The
>> obserseved Oops is caused by a NULL header_ops pointer
>> (neigh->dev->header_ops) in neigh_update_hhs() when
>> 
>> void (*update)(struct hh_cache*, const struct net_device*, const unsigned char *)
>> = neigh->dev->header_ops->cache_update;
>> 
>> is executed.  The dev associated with the NULL header_ops is
>> the GRE interface.  This patch guards against the
>> possibility that header_ops is NULL.
>> 
>> This Oops was first observed in kernel version 2.6.26.8.
>> 
>> Signed-off-by: Doug Kehn <rdkehn@yahoo.com>
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied and queued up for -stable, thanks!

^ permalink raw reply

* Re: [PATCH] bonding: fix a buffer overflow in bonding_show_queue_id.
From: David Miller @ 2010-07-15  1:25 UTC (permalink / raw)
  To: nicolas.2p.debian; +Cc: bonding-devel, andy, fubar, netdev
In-Reply-To: <1279146277-9381-1-git-send-email-nicolas.2p.debian@free.fr>

From: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Date: Thu, 15 Jul 2010 00:24:37 +0200

> The test for buffer overflow ensures we have room for 6 more bytes.
> sprintf, called with %s:%d, slave->dev->name, slave->queue_id may yield
> far more than 6 bytes.
> 
> The correct test is res > (PAGE_SIZE - IFNAMSIZ - 6) .
> 
> Signed-off-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>

Applied to net-next-2.6, thanks.

^ permalink raw reply

* Re: [PATCH] Net: ethernet: eth: fix some code style issues
From: David Miller @ 2010-07-15  1:26 UTC (permalink / raw)
  To: chihau; +Cc: eric.dumazet, opurdila, netdev, linux-devel, mitov
In-Reply-To: <1279138937-12985-1-git-send-email-chihau@gmail.com>

From: Chihau Chau <chihau@gmail.com>
Date: Wed, 14 Jul 2010 16:22:17 -0400

> @@ -180,7 +180,8 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
>  	 *      seems to set IFF_PROMISC.
>  	 */
>  
> -	else if (1 /*dev->flags&IFF_PROMISC */ ) {
> +	/*dev->flags&IFF_PROMISC */
> +	else if (1) {

This makes the code look worse, not better.

If the commented test is in the parenthesis, it is unambiguous
which piece of code the suggested test is meant for.

^ permalink raw reply

* Re: [PATCH] Net: ethernet: pe2.c: fix EXPORT_SYMBOL macro code style issue
From: David Miller @ 2010-07-15  1:27 UTC (permalink / raw)
  To: chihau; +Cc: tj, netdev, linux-kernel
In-Reply-To: <1279139074-13040-1-git-send-email-chihau@gmail.com>

From: Chihau Chau <chihau@gmail.com>
Date: Wed, 14 Jul 2010 16:24:34 -0400

> From: Chihau Chau <chihau@gmail.com>
> 
> This patch fix a code style issuei, if a function is exported, the
> EXPORT_SYMBOL macro for it should follow immediately after the closing
> function brace line.
> 
> Signed-off-by: Chihau Chau <chihau@gmail.com>

Applied.

^ permalink raw reply

* Re: Raise initial congestion window size / speedup slow start?
From: Bill Fink @ 2010-07-15  2:52 UTC (permalink / raw)
  To: David Miller; +Cc: davidsen, lists, linux-kernel, netdev
In-Reply-To: <20100714.111553.104052157.davem@davemloft.net>

On Wed, 14 Jul 2010, David Miller wrote:

> From: Bill Davidsen <davidsen@tmr.com>
> Date: Wed, 14 Jul 2010 11:21:15 -0400
> 
> > You may have to go into /proc/sys/net/core and crank up the
> > rmem_* settings, depending on your distribution.
> 
> You should never, ever, have to touch the various networking sysctl
> values to get good performance in any normal setup.  If you do, it's a
> bug, report it so we can fix it.
> 
> I cringe every time someone says to do this, so please do me a favor
> and don't spread this further. :-)
> 
> For one thing, TCP dynamically adjusts the socket buffer sizes based
> upon the behavior of traffic on the connection.
> 
> And the TCP memory limit sysctls (not the core socket ones) are sized
> based upon available memory.  They are there to protect you from
> situations such as having so much memory dedicated to socket buffers
> that there is none left to do other things effectively.  It's a
> protective limit, rather than a setting meant to increase or improve
> performance.  So like the others, leave these alone too.

What's normal?  :-)

netem1% cat /proc/version 
Linux version 2.6.30.10-105.2.23.fc11.x86_64 (mockbuild@x86-01.phx2.fedoraproject.org) (gcc version 4.4.1 20090725 (Red Hat 4.4.1-2) (GCC) ) #1 SMP Thu Feb 11 07:06:34 UTC 2010

Linux TCP autotuning across an 80 ms RTT cross country network path:

netem1% nuttcp -T10 -i1 192.168.1.18
   14.1875 MB /   1.00 sec =  119.0115 Mbps     0 retrans
  558.0000 MB /   1.00 sec = 4680.7169 Mbps     0 retrans
  872.8750 MB /   1.00 sec = 7322.3527 Mbps     0 retrans
  869.6875 MB /   1.00 sec = 7295.5478 Mbps     0 retrans
  858.4375 MB /   1.00 sec = 7201.0165 Mbps     0 retrans
  857.3750 MB /   1.00 sec = 7192.2116 Mbps     0 retrans
  865.5625 MB /   1.00 sec = 7260.7193 Mbps     0 retrans
  872.3750 MB /   1.00 sec = 7318.2095 Mbps     0 retrans
  862.7500 MB /   1.00 sec = 7237.2571 Mbps     0 retrans
  857.6250 MB /   1.00 sec = 7194.1864 Mbps     0 retrans

 7504.2771 MB /  10.09 sec = 6236.5068 Mbps 11 %TX 25 %RX 0 retrans 80.59 msRTT

Manually specified 100 MB TCP socket buffer on the same path:

netem1% nuttcp -T10 -i1 -w100m 192.168.1.18
  106.8125 MB /   1.00 sec =  895.9598 Mbps     0 retrans
 1092.0625 MB /   1.00 sec = 9160.3254 Mbps     0 retrans
 1111.2500 MB /   1.00 sec = 9322.6424 Mbps     0 retrans
 1115.4375 MB /   1.00 sec = 9356.2569 Mbps     0 retrans
 1116.4375 MB /   1.00 sec = 9365.6937 Mbps     0 retrans
 1115.3125 MB /   1.00 sec = 9356.2749 Mbps     0 retrans
 1121.2500 MB /   1.00 sec = 9405.6233 Mbps     0 retrans
 1125.5625 MB /   1.00 sec = 9441.6949 Mbps     0 retrans
 1130.0000 MB /   1.00 sec = 9478.7479 Mbps     0 retrans
 1139.0625 MB /   1.00 sec = 9555.8559 Mbps     0 retrans

10258.5120 MB /  10.20 sec = 8440.3558 Mbps 15 %TX 40 %RX 0 retrans 80.59 msRTT

The manually selected TCP socket buffer size both ramps up
quicker and achieves a much higher steady state rate.

					-Bill

^ permalink raw reply

* Re: [PATCH] fec: use interrupt for MDIO completion indication
From: Bryan Wu @ 2010-07-15  3:31 UTC (permalink / raw)
  To: Baruch Siach
  Cc: netdev, linux-arm-kernel, Sascha Hauer, Greg Ungerer,
	Wolfram Sang
In-Reply-To: <006416d38a8e51ba8dd8631613a991528dc7976a.1278918594.git.baruch@tkos.co.il>

Baruch,

Thanks for this patch, we tested on our i.MX51 board with Ubuntu. It works fine.

Wolfram, you can pick up this, too. -;)

-Bryan

On 07/12/2010 03:12 PM, Baruch Siach wrote:
> With the move to phylib (commit e6b043d) I was seeing sporadic "MDIO write
> timeout" messages. Measure of the actual time spent showed latency times of
> more than 1600us.
> 
> This patch uses the MII event indication of the FEC hardware to detect
> completion of MDIO transactions.
> 
> Signed-off-by: Baruch Siach <baruch@tkos.co.il>
> ---
>  drivers/net/fec.c |   55 ++++++++++++++++++++++++----------------------------
>  1 files changed, 25 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/net/fec.c b/drivers/net/fec.c
> index edfff92..89f3393 100644
> --- a/drivers/net/fec.c
> +++ b/drivers/net/fec.c
> @@ -187,6 +187,7 @@ struct fec_enet_private {
>  	int	index;
>  	int	link;
>  	int	full_duplex;
> +	struct	completion mdio_done;
>  };
>  
>  static irqreturn_t fec_enet_interrupt(int irq, void * dev_id);
> @@ -205,7 +206,7 @@ static void fec_stop(struct net_device *dev);
>  #define FEC_MMFR_TA		(2 << 16)
>  #define FEC_MMFR_DATA(v)	(v & 0xffff)
>  
> -#define FEC_MII_TIMEOUT		10000
> +#define FEC_MII_TIMEOUT		1000 /* us */
>  
>  /* Transmitter timeout */
>  #define TX_TIMEOUT (2 * HZ)
> @@ -334,6 +335,11 @@ fec_enet_interrupt(int irq, void * dev_id)
>  			ret = IRQ_HANDLED;
>  			fec_enet_tx(dev);
>  		}
> +
> +		if (int_events & FEC_ENET_MII) {
> +			ret = IRQ_HANDLED;
> +			complete(&fep->mdio_done);
> +		}
>  	} while (int_events);
>  
>  	return ret;
> @@ -608,18 +614,13 @@ spin_unlock:
>  		phy_print_status(phy_dev);
>  }
>  
> -/*
> - * NOTE: a MII transaction is during around 25 us, so polling it...
> - */
>  static int fec_enet_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
>  {
>  	struct fec_enet_private *fep = bus->priv;
> -	int timeout = FEC_MII_TIMEOUT;
> +	unsigned long time_left;
>  
>  	fep->mii_timeout = 0;
> -
> -	/* clear MII end of transfer bit*/
> -	writel(FEC_ENET_MII, fep->hwp + FEC_IEVENT);
> +	init_completion(&fep->mdio_done);
>  
>  	/* start a read op */
>  	writel(FEC_MMFR_ST | FEC_MMFR_OP_READ |
> @@ -627,13 +628,12 @@ static int fec_enet_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
>  		FEC_MMFR_TA, fep->hwp + FEC_MII_DATA);
>  
>  	/* wait for end of transfer */
> -	while (!(readl(fep->hwp + FEC_IEVENT) & FEC_ENET_MII)) {
> -		cpu_relax();
> -		if (timeout-- < 0) {
> -			fep->mii_timeout = 1;
> -			printk(KERN_ERR "FEC: MDIO read timeout\n");
> -			return -ETIMEDOUT;
> -		}
> +	time_left = wait_for_completion_timeout(&fep->mdio_done,
> +			usecs_to_jiffies(FEC_MII_TIMEOUT));
> +	if (time_left == 0) {
> +		fep->mii_timeout = 1;
> +		printk(KERN_ERR "FEC: MDIO read timeout\n");
> +		return -ETIMEDOUT;
>  	}
>  
>  	/* return value */
> @@ -644,12 +644,10 @@ static int fec_enet_mdio_write(struct mii_bus *bus, int mii_id, int regnum,
>  			   u16 value)
>  {
>  	struct fec_enet_private *fep = bus->priv;
> -	int timeout = FEC_MII_TIMEOUT;
> +	unsigned long time_left;
>  
>  	fep->mii_timeout = 0;
> -
> -	/* clear MII end of transfer bit*/
> -	writel(FEC_ENET_MII, fep->hwp + FEC_IEVENT);
> +	init_completion(&fep->mdio_done);
>  
>  	/* start a read op */
>  	writel(FEC_MMFR_ST | FEC_MMFR_OP_READ |
> @@ -658,13 +656,12 @@ static int fec_enet_mdio_write(struct mii_bus *bus, int mii_id, int regnum,
>  		fep->hwp + FEC_MII_DATA);
>  
>  	/* wait for end of transfer */
> -	while (!(readl(fep->hwp + FEC_IEVENT) & FEC_ENET_MII)) {
> -		cpu_relax();
> -		if (timeout-- < 0) {
> -			fep->mii_timeout = 1;
> -			printk(KERN_ERR "FEC: MDIO write timeout\n");
> -			return -ETIMEDOUT;
> -		}
> +	time_left = wait_for_completion_timeout(&fep->mdio_done,
> +			usecs_to_jiffies(FEC_MII_TIMEOUT));
> +	if (time_left == 0) {
> +		fep->mii_timeout = 1;
> +		printk(KERN_ERR "FEC: MDIO write timeout\n");
> +		return -ETIMEDOUT;
>  	}
>  
>  	return 0;
> @@ -1222,7 +1219,8 @@ fec_restart(struct net_device *dev, int duplex)
>  	writel(0, fep->hwp + FEC_R_DES_ACTIVE);
>  
>  	/* Enable interrupts we wish to service */
> -	writel(FEC_ENET_TXF | FEC_ENET_RXF, fep->hwp + FEC_IMASK);
> +	writel(FEC_ENET_TXF | FEC_ENET_RXF | FEC_ENET_MII,
> +			fep->hwp + FEC_IMASK);
>  }
>  
>  static void
> @@ -1242,9 +1240,6 @@ fec_stop(struct net_device *dev)
>  	writel(1, fep->hwp + FEC_ECNTRL);
>  	udelay(10);
>  
> -	/* Clear outstanding MII command interrupts. */
> -	writel(FEC_ENET_MII, fep->hwp + FEC_IEVENT);
> -
>  	writel(fep->phy_speed, fep->hwp + FEC_MII_SPEED);
>  }
>  


^ permalink raw reply

* [PATCH] net: fix problem in reading sock TX queue
From: Tom Herbert @ 2010-07-15  3:48 UTC (permalink / raw)
  To: davem, netdev

Fix problem in reading the tx_queue recorded in a socket.  In
dev_pick_tx, the TX queue is read by doing a check with
sk_tx_queue_recorded on the socket, followed by a sk_tx_queue_get.
The problem is that there is not mutual exclusion across these
calls in the socket so it it is possible that the queue in the
sock can be invalidated after sk_tx_queue_recorded is called so
that sk_tx_queue get returns -1, which sets 65535 in queue_index
and thus dev_pick_tx returns 65536 which is a bogus queue and
can cause crash in dev_queue_xmit.

We fix this by only calling sk_tx_queue_get which does the proper
checks.  The interface is that sk_tx_queue_get returns the TX queue
if the sock argument is non-NULL and TX queue is recorded, else it
returns -1.  sk_tx_queue_recorded is no longer used so it can be
completely removed.

Signed-off-by: Tom Herbert <therbert@google.com>
---
diff --git a/include/net/sock.h b/include/net/sock.h
index 3100e71..a441c9c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1226,12 +1226,7 @@ static inline void sk_tx_queue_clear(struct sock *sk)
 
 static inline int sk_tx_queue_get(const struct sock *sk)
 {
-	return sk->sk_tx_queue_mapping;
-}
-
-static inline bool sk_tx_queue_recorded(const struct sock *sk)
-{
-	return (sk && sk->sk_tx_queue_mapping >= 0);
+	return sk ? sk->sk_tx_queue_mapping : -1;
 }
 
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
diff --git a/net/core/dev.c b/net/core/dev.c
index e2b9fa2..f071252 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2053,12 +2053,11 @@ static inline u16 dev_cap_txqueue(struct net_device *dev, u16 queue_index)
 static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 					struct sk_buff *skb)
 {
-	u16 queue_index;
+	int queue_index;
 	struct sock *sk = skb->sk;
 
-	if (sk_tx_queue_recorded(sk)) {
-		queue_index = sk_tx_queue_get(sk);
-	} else {
+	queue_index = sk_tx_queue_get(sk);
+	if (queue_index < 0) {
 		const struct net_device_ops *ops = dev->netdev_ops;
 
 		if (ops->ndo_select_queue) {

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox