Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [trivial PATCH resend 2/2] ixgbe: use PCI_VENDOR_ID_INTEL
From: Jeff Kirsher @ 2012-07-20  7:14 UTC (permalink / raw)
  To: Jon Mason
  Cc: trivial, netdev, linux-kernel, Jesse Brandeburg, Bruce Allan,
	Carolyn Wyborny, Don Skidmore, Greg Rose, Peter P Waskiewicz Jr,
	Alex Duyck, John Ronciak
In-Reply-To: <1342767729-17788-3-git-send-email-jdmason@kudzu.us>

[-- Attachment #1: Type: text/plain, Size: 1251 bytes --]

On Fri, 2012-07-20 at 00:02 -0700, Jon Mason wrote:
> Use PCI_VENDOR_ID_INTEL from pci_ids.h instead of creating its own
> vendor ID #define.
> 
> Signed-off-by: Jon Mason <jdmason@kudzu.us>
> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Cc: Bruce Allan <bruce.w.allan@intel.com>
> Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
> Cc: Don Skidmore <donald.c.skidmore@intel.com>
> Cc: Greg Rose <gregory.v.rose@intel.com>
> Cc: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> Cc: Alex Duyck <alexander.h.duyck@intel.com>
> Cc: John Ronciak <john.ronciak@intel.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |    4 ++--
>  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c |    8 ++++----
>  drivers/net/ethernet/intel/ixgbe/ixgbe_type.h  |    3 ---
>  3 files changed, 6 insertions(+), 9 deletions(-) 

Same goes for this patch as well (not being trivial).

I already have several patches submitted for Dave against ixgbe
currently and I am not sure if there would be any issues with this patch
applying on top of the currently submitted patches.  I will verify this
applies cleanly with no issue to my current net-next tree before I send
my ACK.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* RE: [RFC] r8169 : why SG / TX checksum are default disabled
From: hayeswang @ 2012-07-20  7:14 UTC (permalink / raw)
  To: 'David Miller', eric.dumazet; +Cc: romieu, netdev
In-Reply-To: <20120718.152405.1083396282134539674.davem@davemloft.net>

 Francois Romieu <romieu@fr.zoreil.com>

> >> > A NETDEV_TX_OK return means we accepted the SKB, it 
> doesn't look like
> >> > that's what you are doing in the skb_padto() failure path.
> >> 
> >> ?
> >> 
> >> - skb_padto fails
> >>   (original skb is implicitely freed)
> >> - skb_padto returns error status (!= 0)
> >> - rtl8169_tso_csum returns false
> >> - start_xmit returns NETDEV_TX_OK. 
> >> 
> >> I'll search the missing "!" after some sleep if that's 
> what you are talking
> >> about. Otherwise than that, I don't get it.
> >> 
> > 
> > 
> > Yes, I believe your patch is fine.
> > 
> > In fact many drivers dont account the error in their stats.
> 
> My bad, I forgot that skb_padto() frees the SKB on failure.
> 

I find that the total length field of IP header would be modified if the hw
checksum is enabled. Therefore, skb_padto + hw checksum wouldn't work. The
software checksum is necessary.
 
Best Regards,
Hayes

^ permalink raw reply

* Re: [trivial PATCH resend 2/2] ixgbe: use PCI_VENDOR_ID_INTEL
From: Jeff Kirsher @ 2012-07-20  7:17 UTC (permalink / raw)
  To: Jon Mason
  Cc: trivial, netdev, linux-kernel, Jesse Brandeburg, Bruce Allan,
	Carolyn Wyborny, Don Skidmore, Greg Rose, Peter P Waskiewicz Jr,
	Alex Duyck, John Ronciak
In-Reply-To: <1342767729-17788-3-git-send-email-jdmason@kudzu.us>

[-- Attachment #1: Type: text/plain, Size: 1027 bytes --]

On Fri, 2012-07-20 at 00:02 -0700, Jon Mason wrote:
> Use PCI_VENDOR_ID_INTEL from pci_ids.h instead of creating its own
> vendor ID #define.
> 
> Signed-off-by: Jon Mason <jdmason@kudzu.us>
> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Cc: Bruce Allan <bruce.w.allan@intel.com>
> Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
> Cc: Don Skidmore <donald.c.skidmore@intel.com>
> Cc: Greg Rose <gregory.v.rose@intel.com>
> Cc: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> Cc: Alex Duyck <alexander.h.duyck@intel.com>
> Cc: John Ronciak <john.ronciak@intel.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |    4 ++--
>  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c |    8 ++++----
>  drivers/net/ethernet/intel/ixgbe/ixgbe_type.h  |    3 ---
>  3 files changed, 6 insertions(+), 9 deletions(-) 

This applies cleanly against my current net-next tree and looks fine.

Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* [PATCH net-next] tcp: Return bool instead of int where appropriate
From: Vijay Subramanian @ 2012-07-20  7:32 UTC (permalink / raw)
  To: netdev; +Cc: davem, Vijay Subramanian

Applied to a set of static inline functions in tcp_input.c

Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
---
 net/ipv4/tcp_input.c |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e67d685..21d7f8f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2521,7 +2521,7 @@ static void tcp_cwnd_down(struct sock *sk, int flag)
 /* Nothing was retransmitted or returned timestamp is less
  * than timestamp of the first retransmission.
  */
-static inline int tcp_packet_delayed(const struct tcp_sock *tp)
+static inline bool tcp_packet_delayed(const struct tcp_sock *tp)
 {
 	return !tp->retrans_stamp ||
 		(tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
@@ -2582,7 +2582,7 @@ static void tcp_undo_cwr(struct sock *sk, const bool undo_ssthresh)
 	tp->snd_cwnd_stamp = tcp_time_stamp;
 }
 
-static inline int tcp_may_undo(const struct tcp_sock *tp)
+static inline bool tcp_may_undo(const struct tcp_sock *tp)
 {
 	return tp->undo_marker && (!tp->undo_retrans || tcp_packet_delayed(tp));
 }
@@ -3371,13 +3371,13 @@ static void tcp_ack_probe(struct sock *sk)
 	}
 }
 
-static inline int tcp_ack_is_dubious(const struct sock *sk, const int flag)
+static inline bool tcp_ack_is_dubious(const struct sock *sk, const int flag)
 {
 	return !(flag & FLAG_NOT_DUP) || (flag & FLAG_CA_ALERT) ||
 		inet_csk(sk)->icsk_ca_state != TCP_CA_Open;
 }
 
-static inline int tcp_may_raise_cwnd(const struct sock *sk, const int flag)
+static inline bool tcp_may_raise_cwnd(const struct sock *sk, const int flag)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 	return (!(flag & FLAG_ECE) || tp->snd_cwnd < tp->snd_ssthresh) &&
@@ -3387,7 +3387,7 @@ static inline int tcp_may_raise_cwnd(const struct sock *sk, const int flag)
 /* Check that window update is acceptable.
  * The function assumes that snd_una<=ack<=snd_next.
  */
-static inline int tcp_may_update_window(const struct tcp_sock *tp,
+static inline bool tcp_may_update_window(const struct tcp_sock *tp,
 					const u32 ack, const u32 ack_seq,
 					const u32 nwin)
 {
@@ -4006,7 +4006,7 @@ static int tcp_disordered_ack(const struct sock *sk, const struct sk_buff *skb)
 		(s32)(tp->rx_opt.ts_recent - tp->rx_opt.rcv_tsval) <= (inet_csk(sk)->icsk_rto * 1024) / HZ);
 }
 
-static inline int tcp_paws_discard(const struct sock *sk,
+static inline bool tcp_paws_discard(const struct sock *sk,
 				   const struct sk_buff *skb)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
@@ -4028,7 +4028,7 @@ static inline int tcp_paws_discard(const struct sock *sk,
  * (borrowed from freebsd)
  */
 
-static inline int tcp_sequence(const struct tcp_sock *tp, u32 seq, u32 end_seq)
+static inline bool tcp_sequence(const struct tcp_sock *tp, u32 seq, u32 end_seq)
 {
 	return	!before(end_seq, tp->rcv_wup) &&
 		!after(seq, tp->rcv_nxt + tcp_receive_window(tp));
@@ -5214,7 +5214,7 @@ static __sum16 __tcp_checksum_complete_user(struct sock *sk,
 	return result;
 }
 
-static inline int tcp_checksum_complete_user(struct sock *sk,
+static inline bool tcp_checksum_complete_user(struct sock *sk,
 					     struct sk_buff *skb)
 {
 	return !skb_csum_unnecessary(skb) &&
-- 
1.7.0.4

^ permalink raw reply related

* Re: [PATCH net-next 4/7] sfc: Add support for IEEE-1588 PTP
From: Richard Cochran @ 2012-07-20  7:37 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Stuart Hodgson, David Miller, netdev, linux-net-drivers,
	Andrew Jackson
In-Reply-To: <1342713051.2617.40.camel@bwh-desktop.uk.solarflarecom.com>

On Thu, Jul 19, 2012 at 04:50:51PM +0100, Ben Hutchings wrote:
> On Thu, 2012-07-19 at 16:29 +0100, Stuart Hodgson wrote:
> > On 19/07/12 15:25, Richard Cochran wrote:
> [...]
> > > I am trying to purge the whole SYS thing (only blackfin is left)
> > > because there is a much better way to go about this, namely
> > > synchronizing the system time to the PHC time via an internal PPS
> > > signal.
> > 
> > This may be possible in future. But leads us to another problem
> > where the PPS event that is generated by the PHC subsystem to the 
> > PPS subsystem is stamped with the current system_time. That may
> > be fine for a PPS signal generated from an interrupt but not when
> > the internal PPS event has implicit jitter from the handler/event_queue
> > that we have in the driver.
> [...]
> 
> We can certainly take a timestamp in the hard interrupt handler; in fact
> that's what I originally expected we would do since we have a separate
> MSI-X vector for PTP.  But even hard interrupt handling can be subject
> to substantial jitter.

What kind of jitter do you see or expect?

I did a study of synching system to PHC on a PowerPC system, where the
PPS timestamps varied from about 10 usec (on average under light load)
to over 30 usec (under heavy load).

Even so, it was easy to synchronize the system clock to within about a
microsecond under light load, with heavy load producing about an
additional 6 usec error.

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH net-next] tcp: Return bool instead of int where appropriate
From: Eric Dumazet @ 2012-07-20  8:02 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: netdev, davem
In-Reply-To: <1342769538-4039-1-git-send-email-subramanian.vijay@gmail.com>

On Fri, 2012-07-20 at 00:32 -0700, Vijay Subramanian wrote:
> Applied to a set of static inline functions in tcp_input.c
> 
> Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
> ---
>  net/ipv4/tcp_input.c |   16 ++++++++--------
>  1 files changed, 8 insertions(+), 8 deletions(-)

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* (unknown)
From: Standard Credit International Finance @ 2012-07-20  8:12 UTC (permalink / raw)



Do you need business loan or personal loan if yes contact us today?

^ permalink raw reply

* [PATCH] ipv4: show pmtu in route list
From: Julian Anastasov @ 2012-07-20  9:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

	Override the metrics with rt_pmtu

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---

	Is this patch still useful if routing cache is removed?

 net/ipv4/route.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 9f7ffbe..d547f6f 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2909,6 +2909,7 @@ static int rt_fill_info(struct net *net,
 	struct nlmsghdr *nlh;
 	unsigned long expires = 0;
 	u32 error;
+	u32 metrics[RTAX_MAX];
 
 	nlh = nlmsg_put(skb, pid, seq, event, sizeof(*r), flags);
 	if (nlh == NULL)
@@ -2953,7 +2954,10 @@ static int rt_fill_info(struct net *net,
 	    nla_put_be32(skb, RTA_GATEWAY, rt->rt_gateway))
 		goto nla_put_failure;
 
-	if (rtnetlink_put_metrics(skb, dst_metrics_ptr(&rt->dst)) < 0)
+	memcpy(metrics, dst_metrics_ptr(&rt->dst), sizeof(metrics));
+	if (rt->rt_pmtu)
+		metrics[RTAX_MTU - 1] = rt->rt_pmtu;
+	if (rtnetlink_put_metrics(skb, metrics) < 0)
 		goto nla_put_failure;
 
 	if (rt->rt_mark &&
-- 
1.7.3.4

^ permalink raw reply related

* [PATCH net-next] tcp: use hash_32() in tcp_metrics
From: Eric Dumazet @ 2012-07-20  9:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

Fix a missing roundup_pow_of_two(), since tcpmhash_entries is not
guaranteed to be a power of two.

Uses hash_32() instead of custom hash.

tcpmhash_entries should be an unsigned int.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/netns/ipv4.h |    2 +-
 net/ipv4/tcp_metrics.c   |   25 ++++++++++---------------
 2 files changed, 11 insertions(+), 16 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index d909c7f..0ffb8e3 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -40,7 +40,7 @@ struct netns_ipv4 {
 	struct sock		**icmp_sk;
 	struct inet_peer_base	*peers;
 	struct tcpm_hash_bucket	*tcp_metrics_hash;
-	unsigned int		tcp_metrics_hash_mask;
+	unsigned int		tcp_metrics_hash_log;
 	struct netns_frags	frags;
 #ifdef CONFIG_NETFILTER
 	struct xt_table		*iptable_filter;
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 99779ae..992f1bf 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -7,6 +7,7 @@
 #include <linux/slab.h>
 #include <linux/init.h>
 #include <linux/tcp.h>
+#include <linux/hash.h>
 
 #include <net/inet_connection_sock.h>
 #include <net/net_namespace.h>
@@ -228,10 +229,8 @@ static struct tcp_metrics_block *__tcp_get_metrics_req(struct request_sock *req,
 		return NULL;
 	}
 
-	hash ^= (hash >> 24) ^ (hash >> 16) ^ (hash >> 8);
-
 	net = dev_net(dst->dev);
-	hash &= net->ipv4.tcp_metrics_hash_mask;
+	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
 
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
@@ -265,10 +264,8 @@ static struct tcp_metrics_block *__tcp_get_metrics_tw(struct inet_timewait_sock
 		return NULL;
 	}
 
-	hash ^= (hash >> 24) ^ (hash >> 16) ^ (hash >> 8);
-
 	net = twsk_net(tw);
-	hash &= net->ipv4.tcp_metrics_hash_mask;
+	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
 
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
@@ -302,10 +299,8 @@ static struct tcp_metrics_block *tcp_get_metrics(struct sock *sk,
 		return NULL;
 	}
 
-	hash ^= (hash >> 24) ^ (hash >> 16) ^ (hash >> 8);
-
 	net = dev_net(dst->dev);
-	hash &= net->ipv4.tcp_metrics_hash_mask;
+	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
 
 	tm = __tcp_get_metrics(&addr, net, hash);
 	reclaim = false;
@@ -694,7 +689,7 @@ void tcp_fastopen_cache_set(struct sock *sk, u16 mss,
 	rcu_read_unlock();
 }
 
-static unsigned long tcpmhash_entries;
+static unsigned int tcpmhash_entries;
 static int __init set_tcpmhash_entries(char *str)
 {
 	ssize_t ret;
@@ -702,7 +697,7 @@ static int __init set_tcpmhash_entries(char *str)
 	if (!str)
 		return 0;
 
-	ret = kstrtoul(str, 0, &tcpmhash_entries);
+	ret = kstrtouint(str, 0, &tcpmhash_entries);
 	if (ret)
 		return 0;
 
@@ -712,7 +707,8 @@ __setup("tcpmhash_entries=", set_tcpmhash_entries);
 
 static int __net_init tcp_net_metrics_init(struct net *net)
 {
-	int slots, size;
+	size_t size;
+	unsigned int slots;
 
 	slots = tcpmhash_entries;
 	if (!slots) {
@@ -722,14 +718,13 @@ static int __net_init tcp_net_metrics_init(struct net *net)
 			slots = 8 * 1024;
 	}
 
-	size = slots * sizeof(struct tcpm_hash_bucket);
+	net->ipv4.tcp_metrics_hash_log = order_base_2(slots);
+	size = sizeof(struct tcpm_hash_bucket) << net->ipv4.tcp_metrics_hash_log;
 
 	net->ipv4.tcp_metrics_hash = kzalloc(size, GFP_KERNEL);
 	if (!net->ipv4.tcp_metrics_hash)
 		return -ENOMEM;
 
-	net->ipv4.tcp_metrics_hash_mask = (slots - 1);
-
 	return 0;
 }
 

^ permalink raw reply related

* Re: [PATCH net-next 4/7] sfc: Add support for IEEE-1588 PTP
From: Stuart Hodgson @ 2012-07-20  9:15 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Ben Hutchings, David Miller, netdev, linux-net-drivers,
	Andrew Jackson
In-Reply-To: <20120720063102.GB2330@netboy.at.omicron.at>

>>> This code looks like it is trying to find the offset between two
>>> clocks. Is there some reason why you cannot use <linux/timecompare.h>
>>> to accomplish this?
>>
>> This is what the code is doing. <linux/timecompare.h> states
>>
>> "the assumption is that reading the source
>> time is slow and involves equal time for sending the request and
>> receiving the reply"
>>
>> While in our case event though it is slow we cannot guarantee the second
>> assumption. The code above takes into account some of the particulars of the sfc
>> hardware and gives us good results. 
> 
> Fair enough, but then maybe a comment mentioning how timecompare is
> unsuitable would be nice to have.
> 

This can be added

>>> I am trying to purge the whole SYS thing (only blackfin is left)
>>> because there is a much better way to go about this, namely
>>> synchronizing the system time to the PHC time via an internal PPS
>>> signal.
>>

Do you mean using the PPS kernel consumer to govern the system time?

> I don't understand what the issue is here. Can't you just call
> ptp_clock_event, like you already have...
> 
>>>> +static void efx_ptp_pps_worker(struct work_struct *work)
>>>> +{
>>>> +	struct efx_ptp_data *ptp =
>>>> +		container_of(work, struct efx_ptp_data, pps_work);
>>>> +	struct efx_nic *efx = ptp->channel->efx;
>>>> +	struct timespec event_gen_time;
>>>> +	struct ptp_clock_event ptp_pps_evt;
>>>> +	ktime_t gen_time_host;
>>>> +
>>>> +	if (efx_ptp_synchronize(efx, PTP_SYNC_ATTEMPTS))
>>>> +		return;
>>>> +
>>>> +	gen_time_host = ktime_sub(ptp->mc_base_time,
>>>> +				  ptp->host_base_time);
>>>> +	event_gen_time = ktime_to_timespec(gen_time_host);
>>>> +
>>>> +	ptp_pps_evt.type = PTP_CLOCK_EXTTS;
>>>> +	ptp_pps_evt.timestamp = ktime_to_ns(gen_time_host);
>>>> +	ptp_clock_event(ptp->phc_clock, &ptp_pps_evt);
>>>> +}
> 
> ... here?
> 

In order for a PPS to arrive at the kernel consumer ptp_clock_event
needs to be called with PTP_CLOCK_PPS. This then calls pps_get_ts
and stamps the event with the current system time, not the time
that was put into the event.

Using PTP_CLOCK_EXTTS the PPS is visible to userspace via a read
on the phc device and can then be used in our modified ptpd2.

>>>> +static int efx_phc_adjtime(struct ptp_clock_info *ptp, s64 delta)
>>>> +{
> 
> You can set the time here somehow by doing, T' = T + offset, and so...
> 
>>>> +}
> 
>>>> +static int efx_phc_settime(struct ptp_clock_info *ptp,
>>>> +			   const struct timespec *e_ts)
>>>> +{
>>>> +	/* We must provide this function, but we cannot actually set the time */
>>>
>>> Huh? You can adjtime, so must be able to settime, too, right?
>>>
>>> If you have enough range in the RAW timestamp in the MC firmware (like
>>> 64 bits of nanoseconds), and you allow settime, then you can spare the
>>> system time synchronization code altogether.
>>>
>>
>> You will have to elaborate further on this point.
> 
> ... why can't you also just set the time?

Our hardware can only have an offset applied to the clock. In order to set time
we need to know the time now, then work out and offset to get to the target time.
At the point that we apply this offset the clock will have moved on and not be
set to the target time. We can apply some measured average times to the offset
to get closer but with this hardware settime will not leave the NIC clock at the
desired time. 

> 
> Thanks,
> Richard

^ permalink raw reply

* Re: [PATCH] net, cgroup: Fix boot failure due to iteration of uninitialized list
From: Srivatsa S. Bhat @ 2012-07-20 10:04 UTC (permalink / raw)
  To: Neil Horman
  Cc: gaofeng, eric.dumazet, davem, linux-kernel, netdev, mark.d.rustad,
	john.r.fastabend, lizefan
In-Reply-To: <20120719164407.GA2963@neilslaptop.think-freely.org>

On 07/19/2012 10:14 PM, Neil Horman wrote:
> On Thu, Jul 19, 2012 at 09:57:37PM +0530, Srivatsa S. Bhat wrote:
>> After commit ef209f15 (net: cgroup: fix access the unallocated memory in
>> netprio cgroup), boot fails with the following NULL pointer dereference:
>>
>> Initializing cgroup subsys devices
>> Initializing cgroup subsys freezer
>> Initializing cgroup subsys net_cls
>> Initializing cgroup subsys blkio
>> Initializing cgroup subsys perf_event
>> Initializing cgroup subsys net_prio
>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000698
>> IP: [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
>> PGD 0
>> Oops: 0000 [#1] SMP
>> CPU 0
>> Modules linked in:
>>
>> Pid: 0, comm: swapper/0 Not tainted 3.5.0-rc7-mandeep #1 IBM IBM System x -[7870C4Q]-/68Y8033
>> RIP: 0010:[<ffffffff8145e8d6>]  [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
>> RSP: 0000:ffffffff81a01ea8  EFLAGS: 00010213
>> RAX: 0000000000000000 RBX: ffffffffffffff10 RCX: 0000000000000000
>> RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffffff81aa70a0
>> RBP: ffffffff81a01ed8 R08: 0000000000000000 R09: 0000000000000000
>> R10: ffff8808ff8641c0 R11: 6e697a696c616974 R12: 0000000000000001
>> R13: ffff8808ff8641c0 R14: 0000000000000000 R15: 0000000000093970
>> FS:  0000000000000000(0000) GS:ffff8808ffc00000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> CR2: 0000000000000698 CR3: 0000000001a0b000 CR4: 00000000000006b0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Process swapper/0 (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a13420)
>> Stack:
>>  ffffffff81a01eb8 ffffffff818060ff ffffffff81d75ec8 ffffffff81aa8960
>>  ffffffff81aa8960 ffffffff81b4c2c0 ffffffff81a01ef8 ffffffff81b1cb78
>>  0000000000000018 0000000000000048 ffffffff81a01f18 ffffffff81b1ce13
>> Call Trace:
>>  [<ffffffff81b1cb78>] cgroup_init_subsys+0x83/0x169
>>  [<ffffffff81b1ce13>] cgroup_init+0x36/0x119
>>  [<ffffffff81affef7>] start_kernel+0x3ba/0x3ef
>>  [<ffffffff81aff95b>] ? kernel_init+0x27b/0x27b
>>  [<ffffffff81aff356>] x86_64_start_reservations+0x131/0x136
>>  [<ffffffff81aff45e>] x86_64_start_kernel+0x103/0x112
>> Code: 01 48 3d f8 e1 ec 81 48 8d 98 10 ff ff ff 75 1b eb 73 0f 1f 00 48 8b 83 f0 00 00 00 48 3d f8 e1 ec 81 48 8d 98 10 ff ff ff 74 5a <48> 8b 83 88 07 00 00 48 85 c0 74 de 44 3b 60 10 76 d8 44 89 e6
>> RIP  [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
>>  RSP <ffffffff81a01ea8>
>> CR2: 0000000000000698
>> ---[ end trace a7919e7f17c0a725 ]---
>> Kernel panic - not syncing: Attempted to kill the idle task!
>>
>> The code corresponds to:
>>
>> update_netdev_tables():
>>         for_each_netdev(&init_net, dev) {
>>                 map = rtnl_dereference(dev->priomap);  <---- HERE
>>
>>
>> The list head is initialized in netdev_init(), which is called much
>> later than cgrp_create(). So the problem is that we are calling
>> update_netdev_tables() way too early (in cgrp_create()), which will
>> end up traversing the not-yet-circular linked list. So at some point,
>> the dev pointer will become NULL and hence dev->priomap becomes an
>> invalid access.
>>
>> To fix this, just remove the update_netdev_tables() function entirely,
>> since it appears that write_update_netdev_table() will handle things
>> just fine.
>>
>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>> ---
>>
>> Requesting a thorough review of this patch, since I am not sure whether
>> removing update_netdev_tables() is perfectly OK and whether that is the
>> right thing to do.
>>
> We could do this I suppose, but this has already been fixed by
> 734b65417b24d6eea3e3d7457e1f11493890ee1d

Oh good! But don't you think that my patch looks cleaner than that fix?
(Of course, provided that my patch is correct!)

Anyway, I'm happy to see that the boot failure is fixed. But if anyone feels
that the approach of removing the update_netdev_tables() function is correct
and better, I'll be happy to provide a patch on top of the boot-fix that
went upstream.

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: [RFC] r8169 : why SG / TX checksum are default disabled
From: Francois Romieu @ 2012-07-20 10:08 UTC (permalink / raw)
  To: hayeswang; +Cc: 'David Miller', eric.dumazet, netdev
In-Reply-To: <EF06A7B3C1E6432BB660CDEF8E1D2A1B@realtek.com.tw>

hayeswang <hayeswang@realtek.com> :
[...]
> I find that the total length field of IP header would be modified if the hw
> checksum is enabled. Therefore, skb_padto + hw checksum wouldn't work.

Ok, my patch completely ignored the fact that skb_padto does not change the
length.

However skb_padto + length adjustement + hw checksum should work (at least in
theory if not in the patch below) ?

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index be4e00f..8d0cc09 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -5740,7 +5740,7 @@ err_out:
 	return -EIO;
 }
 
-static inline void rtl8169_tso_csum(struct rtl8169_private *tp,
+static inline bool rtl8169_tso_csum(struct rtl8169_private *tp,
 				    struct sk_buff *skb, u32 *opts)
 {
 	const struct rtl_tx_desc_info *info = tx_desc_info + tp->txd_version;
@@ -5753,6 +5753,13 @@ static inline void rtl8169_tso_csum(struct rtl8169_private *tp,
 	} else if (skb->ip_summed == CHECKSUM_PARTIAL) {
 		const struct iphdr *ip = ip_hdr(skb);
 
+		if (unlikely(skb->len < ETH_ZLEN &&
+		    (tp->mac_version == RTL_GIGA_MAC_VER_34))) {
+			if (skb_padto(skb, ETH_ZLEN))
+				return false;
+			skb_put(skb, ETH_ZLEN - skb->len);
+		}
+
 		if (ip->protocol == IPPROTO_TCP)
 			opts[offset] |= info->checksum.tcp;
 		else if (ip->protocol == IPPROTO_UDP)
@@ -5760,6 +5767,7 @@ static inline void rtl8169_tso_csum(struct rtl8169_private *tp,
 		else
 			WARN_ON_ONCE(1);
 	}
+	return true;
 }
 
 static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
@@ -5783,25 +5791,26 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 	if (unlikely(le32_to_cpu(txd->opts1) & DescOwn))
 		goto err_stop_0;
 
+	opts[1] = cpu_to_le32(rtl8169_tx_vlan_tag(tp, skb));
+	opts[0] = DescOwn;
+
+	if (!rtl8169_tso_csum(tp, skb, opts))
+		goto err_update_stats_0;
+
 	len = skb_headlen(skb);
 	mapping = dma_map_single(d, skb->data, len, DMA_TO_DEVICE);
 	if (unlikely(dma_mapping_error(d, mapping))) {
 		if (net_ratelimit())
 			netif_err(tp, drv, dev, "Failed to map TX DMA!\n");
-		goto err_dma_0;
+		goto err_free_skb_1;
 	}
 
 	tp->tx_skb[entry].len = len;
 	txd->addr = cpu_to_le64(mapping);
 
-	opts[1] = cpu_to_le32(rtl8169_tx_vlan_tag(tp, skb));
-	opts[0] = DescOwn;
-
-	rtl8169_tso_csum(tp, skb, opts);
-
 	frags = rtl8169_xmit_frags(tp, skb, opts);
 	if (frags < 0)
-		goto err_dma_1;
+		goto err_unmap_2;
 	else if (frags)
 		opts[0] |= FirstFrag;
 	else {
@@ -5849,10 +5858,11 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 
 	return NETDEV_TX_OK;
 
-err_dma_1:
+err_unmap_2:
 	rtl8169_unmap_tx_skb(d, tp->tx_skb + entry, txd);
-err_dma_0:
+err_free_skb_1:
 	dev_kfree_skb(skb);
+err_update_stats_0:
 	dev->stats.tx_dropped++;
 	return NETDEV_TX_OK;
 

^ permalink raw reply related

* RE: [PATCH V2 resend] ipv6: fix incorrect route 'expires' value passed to userspace
From: David Laight @ 2012-07-20 10:32 UTC (permalink / raw)
  To: Li Wei, David Miller; +Cc: netdev, shemminger
In-Reply-To: <5008B794.7010904@cn.fujitsu.com>

> -	else if (rt->dst.expires - jiffies < INT_MAX)
> -		expires = rt->dst.expires - jiffies;
> +	else if ((long)rt->dst.expires - (long)jiffies > INT_MIN
> +			&& (long)rt->dst.expires - (long)jiffies <
INT_MAX)
> +		expires = (long)rt->dst.expires - (long)jiffies;
>  	else
> -		expires = INT_MAX;
> +		expires = time_is_after_jiffies(rt->dst.expires) ?
INT_MAX : INT_MIN;

I can't help feeling there is a better way to do this.
Maybe:
	long expires = rt->dst.expires - jiffies;
	if (expires != (int)expires)
		expires = expires > 0 ? INT_MAX : INT_MIN;
Although maybe -INT_MAX instead of INT_MIN.

	David

^ permalink raw reply

* Re: [PATCH] net, cgroup: Fix boot failure due to iteration of uninitialized list
From: Neil Horman @ 2012-07-20 11:00 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: gaofeng, eric.dumazet, davem, linux-kernel, netdev, mark.d.rustad,
	john.r.fastabend, lizefan
In-Reply-To: <50092D3F.5020108@linux.vnet.ibm.com>

On Fri, Jul 20, 2012 at 03:34:47PM +0530, Srivatsa S. Bhat wrote:
> On 07/19/2012 10:14 PM, Neil Horman wrote:
> > On Thu, Jul 19, 2012 at 09:57:37PM +0530, Srivatsa S. Bhat wrote:
> >> After commit ef209f15 (net: cgroup: fix access the unallocated memory in
> >> netprio cgroup), boot fails with the following NULL pointer dereference:
> >>
> >> Initializing cgroup subsys devices
> >> Initializing cgroup subsys freezer
> >> Initializing cgroup subsys net_cls
> >> Initializing cgroup subsys blkio
> >> Initializing cgroup subsys perf_event
> >> Initializing cgroup subsys net_prio
> >> BUG: unable to handle kernel NULL pointer dereference at 0000000000000698
> >> IP: [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
> >> PGD 0
> >> Oops: 0000 [#1] SMP
> >> CPU 0
> >> Modules linked in:
> >>
> >> Pid: 0, comm: swapper/0 Not tainted 3.5.0-rc7-mandeep #1 IBM IBM System x -[7870C4Q]-/68Y8033
> >> RIP: 0010:[<ffffffff8145e8d6>]  [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
> >> RSP: 0000:ffffffff81a01ea8  EFLAGS: 00010213
> >> RAX: 0000000000000000 RBX: ffffffffffffff10 RCX: 0000000000000000
> >> RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffffff81aa70a0
> >> RBP: ffffffff81a01ed8 R08: 0000000000000000 R09: 0000000000000000
> >> R10: ffff8808ff8641c0 R11: 6e697a696c616974 R12: 0000000000000001
> >> R13: ffff8808ff8641c0 R14: 0000000000000000 R15: 0000000000093970
> >> FS:  0000000000000000(0000) GS:ffff8808ffc00000(0000) knlGS:0000000000000000
> >> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >> CR2: 0000000000000698 CR3: 0000000001a0b000 CR4: 00000000000006b0
> >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> >> Process swapper/0 (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a13420)
> >> Stack:
> >>  ffffffff81a01eb8 ffffffff818060ff ffffffff81d75ec8 ffffffff81aa8960
> >>  ffffffff81aa8960 ffffffff81b4c2c0 ffffffff81a01ef8 ffffffff81b1cb78
> >>  0000000000000018 0000000000000048 ffffffff81a01f18 ffffffff81b1ce13
> >> Call Trace:
> >>  [<ffffffff81b1cb78>] cgroup_init_subsys+0x83/0x169
> >>  [<ffffffff81b1ce13>] cgroup_init+0x36/0x119
> >>  [<ffffffff81affef7>] start_kernel+0x3ba/0x3ef
> >>  [<ffffffff81aff95b>] ? kernel_init+0x27b/0x27b
> >>  [<ffffffff81aff356>] x86_64_start_reservations+0x131/0x136
> >>  [<ffffffff81aff45e>] x86_64_start_kernel+0x103/0x112
> >> Code: 01 48 3d f8 e1 ec 81 48 8d 98 10 ff ff ff 75 1b eb 73 0f 1f 00 48 8b 83 f0 00 00 00 48 3d f8 e1 ec 81 48 8d 98 10 ff ff ff 74 5a <48> 8b 83 88 07 00 00 48 85 c0 74 de 44 3b 60 10 76 d8 44 89 e6
> >> RIP  [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
> >>  RSP <ffffffff81a01ea8>
> >> CR2: 0000000000000698
> >> ---[ end trace a7919e7f17c0a725 ]---
> >> Kernel panic - not syncing: Attempted to kill the idle task!
> >>
> >> The code corresponds to:
> >>
> >> update_netdev_tables():
> >>         for_each_netdev(&init_net, dev) {
> >>                 map = rtnl_dereference(dev->priomap);  <---- HERE
> >>
> >>
> >> The list head is initialized in netdev_init(), which is called much
> >> later than cgrp_create(). So the problem is that we are calling
> >> update_netdev_tables() way too early (in cgrp_create()), which will
> >> end up traversing the not-yet-circular linked list. So at some point,
> >> the dev pointer will become NULL and hence dev->priomap becomes an
> >> invalid access.
> >>
> >> To fix this, just remove the update_netdev_tables() function entirely,
> >> since it appears that write_update_netdev_table() will handle things
> >> just fine.
> >>
> >> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> >> ---
> >>
> >> Requesting a thorough review of this patch, since I am not sure whether
> >> removing update_netdev_tables() is perfectly OK and whether that is the
> >> right thing to do.
> >>
> > We could do this I suppose, but this has already been fixed by
> > 734b65417b24d6eea3e3d7457e1f11493890ee1d
> 
> Oh good! But don't you think that my patch looks cleaner than that fix?
> (Of course, provided that my patch is correct!)
> 
> Anyway, I'm happy to see that the boot failure is fixed. But if anyone feels
> that the approach of removing the update_netdev_tables() function is correct
> and better, I'll be happy to provide a patch on top of the boot-fix that
> went upstream.
> 
We're almost at the end of a release.  The fix that went in has been tesetd and
fixes the specific problem that was reported, with almost zero likelyhood of
causing other regressions. While this fix looks like it might be preferable,
this isn't a time to go doing something like this without alot more testing, as
it may cause unforseen problems.

Theres also a larger issue of initalization order that I'll be looking at in the
next few weeks.  Based on the outcome of that I may roll this change in.

Thanks!
Neil

> Regards,
> Srivatsa S. Bhat
> 
> 

^ permalink raw reply

* Re: [PATCH] net, cgroup: Fix boot failure due to iteration of uninitialized list
From: Srivatsa S. Bhat @ 2012-07-20 11:18 UTC (permalink / raw)
  To: Neil Horman
  Cc: gaofeng, eric.dumazet, davem, linux-kernel, netdev, mark.d.rustad,
	john.r.fastabend, lizefan
In-Reply-To: <20120720110016.GA22367@hmsreliant.think-freely.org>

On 07/20/2012 04:30 PM, Neil Horman wrote:
> On Fri, Jul 20, 2012 at 03:34:47PM +0530, Srivatsa S. Bhat wrote:
>> On 07/19/2012 10:14 PM, Neil Horman wrote:
>>> On Thu, Jul 19, 2012 at 09:57:37PM +0530, Srivatsa S. Bhat wrote:
>>>> After commit ef209f15 (net: cgroup: fix access the unallocated memory in
>>>> netprio cgroup), boot fails with the following NULL pointer dereference:
>>>>
>>>> Initializing cgroup subsys devices
>>>> Initializing cgroup subsys freezer
>>>> Initializing cgroup subsys net_cls
>>>> Initializing cgroup subsys blkio
>>>> Initializing cgroup subsys perf_event
>>>> Initializing cgroup subsys net_prio
>>>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000698
>>>> IP: [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
>>>> PGD 0
>>>> Oops: 0000 [#1] SMP
>>>> CPU 0
>>>> Modules linked in:
>>>>
>>>> Pid: 0, comm: swapper/0 Not tainted 3.5.0-rc7-mandeep #1 IBM IBM System x -[7870C4Q]-/68Y8033
>>>> RIP: 0010:[<ffffffff8145e8d6>]  [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
>>>> RSP: 0000:ffffffff81a01ea8  EFLAGS: 00010213
>>>> RAX: 0000000000000000 RBX: ffffffffffffff10 RCX: 0000000000000000
>>>> RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffffff81aa70a0
>>>> RBP: ffffffff81a01ed8 R08: 0000000000000000 R09: 0000000000000000
>>>> R10: ffff8808ff8641c0 R11: 6e697a696c616974 R12: 0000000000000001
>>>> R13: ffff8808ff8641c0 R14: 0000000000000000 R15: 0000000000093970
>>>> FS:  0000000000000000(0000) GS:ffff8808ffc00000(0000) knlGS:0000000000000000
>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>> CR2: 0000000000000698 CR3: 0000000001a0b000 CR4: 00000000000006b0
>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>>> Process swapper/0 (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a13420)
>>>> Stack:
>>>>  ffffffff81a01eb8 ffffffff818060ff ffffffff81d75ec8 ffffffff81aa8960
>>>>  ffffffff81aa8960 ffffffff81b4c2c0 ffffffff81a01ef8 ffffffff81b1cb78
>>>>  0000000000000018 0000000000000048 ffffffff81a01f18 ffffffff81b1ce13
>>>> Call Trace:
>>>>  [<ffffffff81b1cb78>] cgroup_init_subsys+0x83/0x169
>>>>  [<ffffffff81b1ce13>] cgroup_init+0x36/0x119
>>>>  [<ffffffff81affef7>] start_kernel+0x3ba/0x3ef
>>>>  [<ffffffff81aff95b>] ? kernel_init+0x27b/0x27b
>>>>  [<ffffffff81aff356>] x86_64_start_reservations+0x131/0x136
>>>>  [<ffffffff81aff45e>] x86_64_start_kernel+0x103/0x112
>>>> Code: 01 48 3d f8 e1 ec 81 48 8d 98 10 ff ff ff 75 1b eb 73 0f 1f 00 48 8b 83 f0 00 00 00 48 3d f8 e1 ec 81 48 8d 98 10 ff ff ff 74 5a <48> 8b 83 88 07 00 00 48 85 c0 74 de 44 3b 60 10 76 d8 44 89 e6
>>>> RIP  [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
>>>>  RSP <ffffffff81a01ea8>
>>>> CR2: 0000000000000698
>>>> ---[ end trace a7919e7f17c0a725 ]---
>>>> Kernel panic - not syncing: Attempted to kill the idle task!
>>>>
>>>> The code corresponds to:
>>>>
>>>> update_netdev_tables():
>>>>         for_each_netdev(&init_net, dev) {
>>>>                 map = rtnl_dereference(dev->priomap);  <---- HERE
>>>>
>>>>
>>>> The list head is initialized in netdev_init(), which is called much
>>>> later than cgrp_create(). So the problem is that we are calling
>>>> update_netdev_tables() way too early (in cgrp_create()), which will
>>>> end up traversing the not-yet-circular linked list. So at some point,
>>>> the dev pointer will become NULL and hence dev->priomap becomes an
>>>> invalid access.
>>>>
>>>> To fix this, just remove the update_netdev_tables() function entirely,
>>>> since it appears that write_update_netdev_table() will handle things
>>>> just fine.
>>>>
>>>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>>>> ---
>>>>
>>>> Requesting a thorough review of this patch, since I am not sure whether
>>>> removing update_netdev_tables() is perfectly OK and whether that is the
>>>> right thing to do.
>>>>
>>> We could do this I suppose, but this has already been fixed by
>>> 734b65417b24d6eea3e3d7457e1f11493890ee1d
>>
>> Oh good! But don't you think that my patch looks cleaner than that fix?
>> (Of course, provided that my patch is correct!)
>>
>> Anyway, I'm happy to see that the boot failure is fixed. But if anyone feels
>> that the approach of removing the update_netdev_tables() function is correct
>> and better, I'll be happy to provide a patch on top of the boot-fix that
>> went upstream.
>>
> We're almost at the end of a release.  The fix that went in has been tesetd and
> fixes the specific problem that was reported, with almost zero likelyhood of
> causing other regressions. While this fix looks like it might be preferable,
> this isn't a time to go doing something like this without alot more testing, as
> it may cause unforseen problems.
> 

Oh definitely! I didn't mean to suggest doing these changes right away.
It can surely wait.. :)

> Theres also a larger issue of initalization order that I'll be looking at in the
> next few weeks.  Based on the outcome of that I may roll this change in.
> 

Sure, thanks!

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* WESTERN UNION COMPENSATION PAYMENT
From: South Africa western union @ 2012-07-20 11:58 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 0 bytes --]



[-- Attachment #2: WUMT.docx --]
[-- Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document, Size: 27936 bytes --]

^ permalink raw reply

* [patch net-next 0/6] net: add team multiqueue support and do comple of thing on the way
From: Jiri Pirko @ 2012-07-20 12:28 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, shemminger, fubar, andy

This patchset represents the way I walked when I was adding multiqueue support for
team driver.

Jiri Pirko (6):
  net: honour netif_set_real_num_tx_queues() retval
  rtnl: allow to specify different num for rx and tx queue count
  rtnl: allow to specify number of rx and tx queues on device creation
  net: rename bond_queue_mapping to slave_dev_queue_mapping
  bond_sysfs: use ream_num_tx_queues rather than params.tx_queue
  team: add multiqueue support

 drivers/net/bonding/bond_main.c  |   20 ++++++------
 drivers/net/bonding/bond_sysfs.c |    2 +-
 drivers/net/team/team.c          |   65 +++++++++++++++++++++++++++++++++++---
 include/linux/if_link.h          |    2 ++
 include/linux/if_team.h          |    8 +++++
 include/linux/netdevice.h        |    7 +++-
 include/net/rtnetlink.h          |   10 +++---
 include/net/sch_generic.h        |    2 +-
 net/core/rtnetlink.c             |   27 +++++++++++-----
 9 files changed, 114 insertions(+), 29 deletions(-)

-- 
1.7.10.4

^ permalink raw reply

* [patch net-next 1/6] net: honour netif_set_real_num_tx_queues() retval
From: Jiri Pirko @ 2012-07-20 12:28 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, shemminger, fubar, andy
In-Reply-To: <1342787331-1866-1-git-send-email-jiri@resnulli.us>

In netif_copy_real_num_queues() the return value of
netif_set_real_num_tx_queues() should be checked.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 include/linux/netdevice.h |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ab0251d..eb06e58 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2110,7 +2110,12 @@ static inline int netif_set_real_num_rx_queues(struct net_device *dev,
 static inline int netif_copy_real_num_queues(struct net_device *to_dev,
 					     const struct net_device *from_dev)
 {
-	netif_set_real_num_tx_queues(to_dev, from_dev->real_num_tx_queues);
+	int err;
+
+	err = netif_set_real_num_tx_queues(to_dev,
+					   from_dev->real_num_tx_queues);
+	if (err)
+		return err;
 #ifdef CONFIG_RPS
 	return netif_set_real_num_rx_queues(to_dev,
 					    from_dev->real_num_rx_queues);
-- 
1.7.10.4

^ permalink raw reply related

* [patch net-next 2/6] rtnl: allow to specify different num for rx and tx queue count
From: Jiri Pirko @ 2012-07-20 12:28 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, shemminger, fubar, andy
In-Reply-To: <1342787331-1866-1-git-send-email-jiri@resnulli.us>

Also cut out unused function parameters and possible err in return
value.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 drivers/net/bonding/bond_main.c |   14 ++++++++------
 include/net/rtnetlink.h         |   10 ++++++----
 net/core/rtnetlink.c            |   16 ++++++++--------
 3 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 3960b1b..f41ddc2 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4845,17 +4845,19 @@ static int bond_validate(struct nlattr *tb[], struct nlattr *data[])
 	return 0;
 }
 
-static int bond_get_tx_queues(struct net *net, struct nlattr *tb[])
+static unsigned int bond_get_num_tx_queues(void)
 {
 	return tx_queues;
 }
 
 static struct rtnl_link_ops bond_link_ops __read_mostly = {
-	.kind		= "bond",
-	.priv_size	= sizeof(struct bonding),
-	.setup		= bond_setup,
-	.validate	= bond_validate,
-	.get_tx_queues	= bond_get_tx_queues,
+	.kind			= "bond",
+	.priv_size		= sizeof(struct bonding),
+	.setup			= bond_setup,
+	.validate		= bond_validate,
+	.get_num_tx_queues	= bond_get_num_tx_queues,
+	.get_num_rx_queues	= bond_get_num_tx_queues, /* Use the same number
+							     as for TX queues */
 };
 
 /* Create a new bond based on the specified name and bonding parameters.
diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
index bbcfd09..6b00c4f 100644
--- a/include/net/rtnetlink.h
+++ b/include/net/rtnetlink.h
@@ -44,8 +44,10 @@ static inline int rtnl_msg_family(const struct nlmsghdr *nlh)
  *	@get_xstats_size: Function to calculate required room for dumping device
  *			  specific statistics
  *	@fill_xstats: Function to dump device specific statistics
- *	@get_tx_queues: Function to determine number of transmit queues to create when
- *		        creating a new device.
+ *	@get_num_tx_queues: Function to determine number of transmit queues
+ *			    to create when creating a new device.
+ *	@get_num_rx_queues: Function to determine number of receive queues
+ *			    to create when creating a new device.
  */
 struct rtnl_link_ops {
 	struct list_head	list;
@@ -77,8 +79,8 @@ struct rtnl_link_ops {
 	size_t			(*get_xstats_size)(const struct net_device *dev);
 	int			(*fill_xstats)(struct sk_buff *skb,
 					       const struct net_device *dev);
-	int			(*get_tx_queues)(struct net *net,
-						 struct nlattr *tb[]);
+	unsigned int		(*get_num_tx_queues)(void);
+	unsigned int		(*get_num_rx_queues)(void);
 };
 
 extern int	__rtnl_link_register(struct rtnl_link_ops *ops);
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 045db8a..db5a8ad 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1624,17 +1624,17 @@ struct net_device *rtnl_create_link(struct net *src_net, struct net *net,
 {
 	int err;
 	struct net_device *dev;
-	unsigned int num_queues = 1;
+	unsigned int num_tx_queues = 1;
+	unsigned int num_rx_queues = 1;
 
-	if (ops->get_tx_queues) {
-		err = ops->get_tx_queues(src_net, tb);
-		if (err < 0)
-			goto err;
-		num_queues = err;
-	}
+	if (ops->get_num_tx_queues)
+		num_tx_queues = ops->get_num_tx_queues();
+	if (ops->get_num_rx_queues)
+		num_rx_queues = ops->get_num_rx_queues();
 
 	err = -ENOMEM;
-	dev = alloc_netdev_mq(ops->priv_size, ifname, ops->setup, num_queues);
+	dev = alloc_netdev_mqs(ops->priv_size, ifname, ops->setup,
+			       num_tx_queues, num_rx_queues);
 	if (!dev)
 		goto err;
 
-- 
1.7.10.4

^ permalink raw reply related

* [patch net-next 3/6] rtnl: allow to specify number of rx and tx queues on device creation
From: Jiri Pirko @ 2012-07-20 12:28 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, shemminger, fubar, andy
In-Reply-To: <1342787331-1866-1-git-send-email-jiri@resnulli.us>

This patch introduces IFLA_NUM_TX_QUEUES and IFLA_NUM_RX_QUEUES by
which userspace can set number of rx and/or tx queues to be allocated
for newly created netdevice.
This overrides ops->get_num_[tr]x_queues()

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 include/linux/if_link.h |    2 ++
 net/core/rtnetlink.c    |   15 +++++++++++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index f715750..ac173bd 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -140,6 +140,8 @@ enum {
 	IFLA_EXT_MASK,		/* Extended info mask, VFs, etc */
 	IFLA_PROMISCUITY,	/* Promiscuity count: > 0 means acts PROMISC */
 #define IFLA_PROMISCUITY IFLA_PROMISCUITY
+	IFLA_NUM_TX_QUEUES,
+	IFLA_NUM_RX_QUEUES,
 	__IFLA_MAX
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index db5a8ad..5bb1ebc 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -771,6 +771,8 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(4) /* IFLA_LINK */
 	       + nla_total_size(4) /* IFLA_MASTER */
 	       + nla_total_size(4) /* IFLA_PROMISCUITY */
+	       + nla_total_size(4) /* IFLA_NUM_TX_QUEUES */
+	       + nla_total_size(4) /* IFLA_NUM_RX_QUEUES */
 	       + nla_total_size(1) /* IFLA_OPERSTATE */
 	       + nla_total_size(1) /* IFLA_LINKMODE */
 	       + nla_total_size(ext_filter_mask
@@ -889,6 +891,8 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	    nla_put_u32(skb, IFLA_MTU, dev->mtu) ||
 	    nla_put_u32(skb, IFLA_GROUP, dev->group) ||
 	    nla_put_u32(skb, IFLA_PROMISCUITY, dev->promiscuity) ||
+	    nla_put_u32(skb, IFLA_NUM_TX_QUEUES, dev->num_tx_queues) ||
+	    nla_put_u32(skb, IFLA_NUM_RX_QUEUES, dev->num_rx_queues) ||
 	    (dev->ifindex != dev->iflink &&
 	     nla_put_u32(skb, IFLA_LINK, dev->iflink)) ||
 	    (dev->master &&
@@ -1106,6 +1110,8 @@ const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_AF_SPEC]		= { .type = NLA_NESTED },
 	[IFLA_EXT_MASK]		= { .type = NLA_U32 },
 	[IFLA_PROMISCUITY]	= { .type = NLA_U32 },
+	[IFLA_NUM_TX_QUEUES]	= { .type = NLA_U32 },
+	[IFLA_NUM_RX_QUEUES]	= { .type = NLA_U32 },
 };
 EXPORT_SYMBOL(ifla_policy);
 
@@ -1627,9 +1633,14 @@ struct net_device *rtnl_create_link(struct net *src_net, struct net *net,
 	unsigned int num_tx_queues = 1;
 	unsigned int num_rx_queues = 1;
 
-	if (ops->get_num_tx_queues)
+	if (tb[IFLA_NUM_TX_QUEUES])
+		num_tx_queues = nla_get_u32(tb[IFLA_NUM_TX_QUEUES]);
+	else if (ops->get_num_tx_queues)
 		num_tx_queues = ops->get_num_tx_queues();
-	if (ops->get_num_rx_queues)
+
+	if (tb[IFLA_NUM_RX_QUEUES])
+		num_rx_queues = nla_get_u32(tb[IFLA_NUM_RX_QUEUES]);
+	else if (ops->get_num_rx_queues)
 		num_rx_queues = ops->get_num_rx_queues();
 
 	err = -ENOMEM;
-- 
1.7.10.4

^ permalink raw reply related

* [patch net-next 4/6] net: rename bond_queue_mapping to slave_dev_queue_mapping
From: Jiri Pirko @ 2012-07-20 12:28 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, shemminger, fubar, andy
In-Reply-To: <1342787331-1866-1-git-send-email-jiri@resnulli.us>

As this is going to be used not only by bonding.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 drivers/net/bonding/bond_main.c |    6 +++---
 include/net/sch_generic.h       |    2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index f41ddc2..6fae5f3 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -395,8 +395,8 @@ int bond_dev_queue_xmit(struct bonding *bond, struct sk_buff *skb,
 	skb->dev = slave_dev;
 
 	BUILD_BUG_ON(sizeof(skb->queue_mapping) !=
-		     sizeof(qdisc_skb_cb(skb)->bond_queue_mapping));
-	skb->queue_mapping = qdisc_skb_cb(skb)->bond_queue_mapping;
+		     sizeof(qdisc_skb_cb(skb)->slave_dev_queue_mapping));
+	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
 
 	if (unlikely(netpoll_tx_running(slave_dev)))
 		bond_netpoll_send_skb(bond_get_slave_by_dev(bond, slave_dev), skb);
@@ -4184,7 +4184,7 @@ static u16 bond_select_queue(struct net_device *dev, struct sk_buff *skb)
 	/*
 	 * Save the original txq to restore before passing to the driver
 	 */
-	qdisc_skb_cb(skb)->bond_queue_mapping = skb->queue_mapping;
+	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
 
 	if (unlikely(txq >= dev->real_num_tx_queues)) {
 		do {
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 9d7d54a..d9611e0 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -220,7 +220,7 @@ struct tcf_proto {
 
 struct qdisc_skb_cb {
 	unsigned int		pkt_len;
-	u16			bond_queue_mapping;
+	u16			slave_dev_queue_mapping;
 	u16			_pad;
 	unsigned char		data[20];
 };
-- 
1.7.10.4

^ permalink raw reply related

* [patch net-next 5/6] bond_sysfs: use ream_num_tx_queues rather than params.tx_queue
From: Jiri Pirko @ 2012-07-20 12:28 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, shemminger, fubar, andy
In-Reply-To: <1342787331-1866-1-git-send-email-jiri@resnulli.us>

Since now number of tx queues can be specified during bond instance
creation and therefore it may differ from params.tx_queues, use rather
real_num_tx_queues for boundary check.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 drivers/net/bonding/bond_sysfs.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 485bedb..dc15d24 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -1495,7 +1495,7 @@ static ssize_t bonding_store_queue_id(struct device *d,
 	/* Check buffer length, valid ifname and queue id */
 	if (strlen(buffer) > IFNAMSIZ ||
 	    !dev_valid_name(buffer) ||
-	    qid > bond->params.tx_queues)
+	    qid > bond->dev->real_num_tx_queues)
 		goto err_no_cmd;
 
 	/* Get the pointer to that interface if it exists */
-- 
1.7.10.4

^ permalink raw reply related

* [patch net-next 6/6] team: add multiqueue support
From: Jiri Pirko @ 2012-07-20 12:28 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, shemminger, fubar, andy
In-Reply-To: <1342787331-1866-1-git-send-email-jiri@resnulli.us>

Largely copied from bonding code.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 drivers/net/team/team.c |   65 +++++++++++++++++++++++++++++++++++++++++++----
 include/linux/if_team.h |    8 ++++++
 2 files changed, 68 insertions(+), 5 deletions(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index 813e131..b104c05 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -27,6 +27,7 @@
 #include <net/rtnetlink.h>
 #include <net/genetlink.h>
 #include <net/netlink.h>
+#include <net/sch_generic.h>
 #include <linux/if_team.h>
 
 #define DRV_NAME "team"
@@ -1121,6 +1122,22 @@ static const struct team_option team_options[] = {
 	},
 };
 
+static struct lock_class_key team_netdev_xmit_lock_key;
+static struct lock_class_key team_netdev_addr_lock_key;
+
+static void team_set_lockdep_class_one(struct net_device *dev,
+				       struct netdev_queue *txq,
+				       void *unused)
+{
+	lockdep_set_class(&txq->_xmit_lock, &team_netdev_xmit_lock_key);
+}
+
+static void team_set_lockdep_class(struct net_device *dev)
+{
+	lockdep_set_class(&dev->addr_list_lock, &team_netdev_addr_lock_key);
+	netdev_for_each_tx_queue(dev, team_set_lockdep_class_one, NULL);
+}
+
 static int team_init(struct net_device *dev)
 {
 	struct team *team = netdev_priv(dev);
@@ -1148,6 +1165,8 @@ static int team_init(struct net_device *dev)
 		goto err_options_register;
 	netif_carrier_off(dev);
 
+	team_set_lockdep_class(dev);
+
 	return 0;
 
 err_options_register:
@@ -1216,6 +1235,29 @@ static netdev_tx_t team_xmit(struct sk_buff *skb, struct net_device *dev)
 	return NETDEV_TX_OK;
 }
 
+static u16 team_select_queue(struct net_device *dev, struct sk_buff *skb)
+{
+	/*
+	 * This helper function exists to help dev_pick_tx get the correct
+	 * destination queue.  Using a helper function skips a call to
+	 * skb_tx_hash and will put the skbs in the queue we expect on their
+	 * way down to the team driver.
+	 */
+	u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
+
+	/*
+	 * Save the original txq to restore before passing to the driver
+	 */
+	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+	if (unlikely(txq >= dev->real_num_tx_queues)) {
+		do {
+			txq -= dev->real_num_tx_queues;
+		} while (txq >= dev->real_num_tx_queues);
+	}
+	return txq;
+}
+
 static void team_change_rx_flags(struct net_device *dev, int change)
 {
 	struct team *team = netdev_priv(dev);
@@ -1469,6 +1511,7 @@ static const struct net_device_ops team_netdev_ops = {
 	.ndo_open		= team_open,
 	.ndo_stop		= team_close,
 	.ndo_start_xmit		= team_xmit,
+	.ndo_select_queue	= team_select_queue,
 	.ndo_change_rx_flags	= team_change_rx_flags,
 	.ndo_set_rx_mode	= team_set_rx_mode,
 	.ndo_set_mac_address	= team_set_mac_address,
@@ -1543,12 +1586,24 @@ static int team_validate(struct nlattr *tb[], struct nlattr *data[])
 	return 0;
 }
 
+static unsigned int team_get_num_tx_queues(void)
+{
+	return TEAM_DEFAULT_NUM_TX_QUEUES;
+}
+
+static unsigned int team_get_num_rx_queues(void)
+{
+	return TEAM_DEFAULT_NUM_RX_QUEUES;
+}
+
 static struct rtnl_link_ops team_link_ops __read_mostly = {
-	.kind		= DRV_NAME,
-	.priv_size	= sizeof(struct team),
-	.setup		= team_setup,
-	.newlink	= team_newlink,
-	.validate	= team_validate,
+	.kind			= DRV_NAME,
+	.priv_size		= sizeof(struct team),
+	.setup			= team_setup,
+	.newlink		= team_newlink,
+	.validate		= team_validate,
+	.get_num_tx_queues	= team_get_num_tx_queues,
+	.get_num_rx_queues	= team_get_num_rx_queues,
 };
 
 
diff --git a/include/linux/if_team.h b/include/linux/if_team.h
index 7fd0cde..6960fc1 100644
--- a/include/linux/if_team.h
+++ b/include/linux/if_team.h
@@ -14,6 +14,7 @@
 #ifdef __KERNEL__
 
 #include <linux/netpoll.h>
+#include <net/sch_generic.h>
 
 struct team_pcpu_stats {
 	u64			rx_packets;
@@ -98,6 +99,10 @@ static inline void team_netpoll_send_skb(struct team_port *port,
 static inline int team_dev_queue_xmit(struct team *team, struct team_port *port,
 				      struct sk_buff *skb)
 {
+	BUILD_BUG_ON(sizeof(skb->queue_mapping) !=
+		     sizeof(qdisc_skb_cb(skb)->slave_dev_queue_mapping));
+	skb_set_queue_mapping(skb, qdisc_skb_cb(skb)->slave_dev_queue_mapping);
+
 	skb->dev = port->dev;
 	if (unlikely(netpoll_tx_running(port->dev))) {
 		team_netpoll_send_skb(port, skb);
@@ -236,6 +241,9 @@ extern void team_options_unregister(struct team *team,
 extern int team_mode_register(const struct team_mode *mode);
 extern void team_mode_unregister(const struct team_mode *mode);
 
+#define TEAM_DEFAULT_NUM_TX_QUEUES 16
+#define TEAM_DEFAULT_NUM_RX_QUEUES 16
+
 #endif /* __KERNEL__ */
 
 #define TEAM_STRING_MAX_LEN 32
-- 
1.7.10.4

^ permalink raw reply related

* [patch iproute2] iplink: add support for num[tr]xqueues
From: Jiri Pirko @ 2012-07-20 12:29 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, shemminger

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 include/linux/if_link.h |    2 ++
 ip/iplink.c             |   20 ++++++++++++++++++++
 man/man8/ip-link.8.in   |   13 +++++++++++++
 3 files changed, 35 insertions(+)

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 00e5868..46f03db 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -140,6 +140,8 @@ enum {
 	IFLA_EXT_MASK,		/* Extended info mask, VFs, etc */
 	IFLA_PROMISCUITY,	/* Promiscuity count: > 0 means acts PROMISC */
 #define IFLA_PROMISCUITY IFLA_PROMISCUITY
+	IFLA_NUM_TX_QUEUES,
+	IFLA_NUM_RX_QUEUES,
 	__IFLA_MAX
 };
 
diff --git a/ip/iplink.c b/ip/iplink.c
index 679091e..0baa128 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -48,6 +48,8 @@ void iplink_usage(void)
 		fprintf(stderr, "                   [ address LLADDR ]\n");
 		fprintf(stderr, "                   [ broadcast LLADDR ]\n");
 		fprintf(stderr, "                   [ mtu MTU ]\n");
+		fprintf(stderr, "                   [ numtxqueues QUEUE_COUNT ]\n");
+		fprintf(stderr, "                   [ numrxqueues QUEUE_COUNT ]\n");
 		fprintf(stderr, "                   type TYPE [ ARGS ]\n");
 		fprintf(stderr, "       ip link delete DEV type TYPE [ ARGS ]\n");
 		fprintf(stderr, "\n");
@@ -279,6 +281,8 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
 	int mtu = -1;
 	int netns = -1;
 	int vf = -1;
+	int numtxqueues = -1;
+	int numrxqueues = -1;
 
 	*group = -1;
 	ret = argc;
@@ -445,6 +449,22 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
 				invarg("Invalid operstate\n", *argv);
 
 			addattr8(&req->n, sizeof(*req), IFLA_OPERSTATE, state);
+		} else if (strcmp(*argv, "numtxqueues") == 0) {
+			NEXT_ARG();
+			if (numtxqueues != -1)
+				duparg("numtxqueues", *argv);
+			if (get_integer(&numtxqueues, *argv, 0))
+				invarg("Invalid \"numtxqueues\" value\n", *argv);
+			addattr_l(&req->n, sizeof(*req), IFLA_NUM_TX_QUEUES,
+				  &numtxqueues, 4);
+		} else if (strcmp(*argv, "numrxqueues") == 0) {
+			NEXT_ARG();
+			if (numrxqueues != -1)
+				duparg("numrxqueues", *argv);
+			if (get_integer(&numrxqueues, *argv, 0))
+				invarg("Invalid \"numrxqueues\" value\n", *argv);
+			addattr_l(&req->n, sizeof(*req), IFLA_NUM_RX_QUEUES,
+				  &numrxqueues, 4);
 		} else {
 			if (strcmp(*argv, "dev") == 0) {
 				NEXT_ARG();
diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index 9386cc6..8a24e51 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -40,6 +40,11 @@ ip-link \- network device configuration
 .RB "[ " mtu
 .IR MTU " ]"
 .br
+.RB "[ " numtxqueues
+.IR QUEUE_COUNT " ]"
+.RB "[ " numrxqueues
+.IR QUEUE_COUNT " ]"
+.br
 .BR type " TYPE"
 .RI "[ " ARGS " ]"
 
@@ -156,6 +161,14 @@ Link types:
 - Ethernet Bridge device
 .in -8
 
+.TP
+.BI numtxqueues " QUEUE_COUNT "
+specifies the number of transmit queues for new device.
+
+.TP
+.BI numrxqueues " QUEUE_COUNT "
+specifies the number of receive queues for new device.
+
 .SS ip link delete - delete virtual link
 .I DEVICE
 specifies the virtual  device to act operate on.
-- 
1.7.10.4

^ permalink raw reply related

* Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
From: Michael S. Tsirkin @ 2012-07-20 12:33 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, habanero, mashirle, kvm, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <1341484194-8108-6-git-send-email-jasowang@redhat.com>

On Thu, Jul 05, 2012 at 06:29:54PM +0800, Jason Wang wrote:
> This patch let the virtio_net driver can negotiate the number of queues it
> wishes to use through control virtqueue and export an ethtool interface to let
> use tweak it.
> 
> As current multiqueue virtio-net implementation has optimizations on per-cpu
> virtuqueues, so only two modes were support:
> 
> - single queue pair mode
> - multiple queue paris mode, the number of queues matches the number of vcpus
> 
> The single queue mode were used by default currently due to regression of
> multiqueue mode in some test (especially in stream test).
> 
> Since virtio core does not support paritially deleting virtqueues, so during
> mode switching the whole virtqueue were deleted and the driver would re-create
> the virtqueues it would used.
> 
> btw. The queue number negotiating were defered to .ndo_open(), this is because
> only after feature negotitaion could we send the command to control virtqueue
> (as it may also use event index).
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/net/virtio_net.c   |  171 ++++++++++++++++++++++++++++++++++---------
>  include/linux/virtio_net.h |    7 ++
>  2 files changed, 142 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 7410187..3339eeb 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -88,6 +88,7 @@ struct receive_queue {
>  
>  struct virtnet_info {
>  	u16 num_queue_pairs;		/* # of RX/TX vq pairs */
> +	u16 total_queue_pairs;
>  
>  	struct send_queue *sq[MAX_QUEUES] ____cacheline_aligned_in_smp;
>  	struct receive_queue *rq[MAX_QUEUES] ____cacheline_aligned_in_smp;
> @@ -137,6 +138,8 @@ struct padded_vnet_hdr {
>  	char padding[6];
>  };
>  
> +static const struct ethtool_ops virtnet_ethtool_ops;
> +
>  static inline int txq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
>  {
>  	int ret = virtqueue_get_queue_index(vq);
> @@ -802,22 +805,6 @@ static void virtnet_netpoll(struct net_device *dev)
>  }
>  #endif
>  
> -static int virtnet_open(struct net_device *dev)
> -{
> -	struct virtnet_info *vi = netdev_priv(dev);
> -	int i;
> -
> -	for (i = 0; i < vi->num_queue_pairs; i++) {
> -		/* Make sure we have some buffers: if oom use wq. */
> -		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
> -			queue_delayed_work(system_nrt_wq,
> -					   &vi->rq[i]->refill, 0);
> -		virtnet_napi_enable(vi->rq[i]);
> -	}
> -
> -	return 0;
> -}
> -
>  /*
>   * Send command via the control virtqueue and check status.  Commands
>   * supported by the hypervisor, as indicated by feature bits, should
> @@ -873,6 +860,43 @@ static void virtnet_ack_link_announce(struct virtnet_info *vi)
>  	rtnl_unlock();
>  }
>  
> +static int virtnet_set_queues(struct virtnet_info *vi)
> +{
> +	struct scatterlist sg;
> +	struct net_device *dev = vi->dev;
> +	sg_init_one(&sg, &vi->num_queue_pairs, sizeof(vi->num_queue_pairs));
> +
> +	if (!vi->has_cvq)
> +		return -EINVAL;
> +
> +	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MULTIQUEUE,
> +				  VIRTIO_NET_CTRL_MULTIQUEUE_QNUM, &sg, 1, 0)){
> +		dev_warn(&dev->dev, "Fail to set the number of queue pairs to"
> +			 " %d\n", vi->num_queue_pairs);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int virtnet_open(struct net_device *dev)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	int i;
> +
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		/* Make sure we have some buffers: if oom use wq. */
> +		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
> +			queue_delayed_work(system_nrt_wq,
> +					   &vi->rq[i]->refill, 0);
> +		virtnet_napi_enable(vi->rq[i]);
> +	}
> +
> +	virtnet_set_queues(vi);
> +
> +	return 0;
> +}
> +
>  static int virtnet_close(struct net_device *dev)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
> @@ -1013,12 +1037,6 @@ static void virtnet_get_drvinfo(struct net_device *dev,
>  
>  }
>  
> -static const struct ethtool_ops virtnet_ethtool_ops = {
> -	.get_drvinfo = virtnet_get_drvinfo,
> -	.get_link = ethtool_op_get_link,
> -	.get_ringparam = virtnet_get_ringparam,
> -};
> -
>  #define MIN_MTU 68
>  #define MAX_MTU 65535
>  
> @@ -1235,7 +1253,7 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
>  
>  err:
>  	if (ret && names)
> -		for (i = 0; i < vi->num_queue_pairs * 2; i++)
> +		for (i = 0; i < total_vqs * 2; i++)
>  			kfree(names[i]);
>  
>  	kfree(names);
> @@ -1373,7 +1391,6 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	mutex_init(&vi->config_lock);
>  	vi->config_enable = true;
>  	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
> -	vi->num_queue_pairs = num_queue_pairs;
>  
>  	/* If we can receive ANY GSO packets, we must allocate large ones. */
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
> @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>  		vi->has_cvq = true;
>  
> +	/* Use single tx/rx queue pair as default */
> +	vi->num_queue_pairs = 1;
> +	vi->total_queue_pairs = num_queue_pairs;
> +
>  	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
>  	err = virtnet_setup_vqs(vi);
>  	if (err)
> @@ -1396,6 +1417,9 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
>  		dev->features |= NETIF_F_HW_VLAN_FILTER;
>  
> +	netif_set_real_num_tx_queues(dev, 1);
> +	netif_set_real_num_rx_queues(dev, 1);
> +
>  	err = register_netdev(dev);
>  	if (err) {
>  		pr_debug("virtio_net: registering device failed\n");
> @@ -1403,7 +1427,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	}
>  
>  	/* Last of all, set up some receive buffers. */
> -	for (i = 0; i < num_queue_pairs; i++) {
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
>  		try_fill_recv(vi->rq[i], GFP_KERNEL);
>  
>  		/* If we didn't even get one input buffer, we're useless. */
> @@ -1474,10 +1498,8 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
>  	free_netdev(vi->dev);
>  }
>  
> -#ifdef CONFIG_PM
> -static int virtnet_freeze(struct virtio_device *vdev)
> +static void virtnet_stop(struct virtnet_info *vi)
>  {
> -	struct virtnet_info *vi = vdev->priv;
>  	int i;
>  
>  	/* Prevent config work handler from accessing the device */
> @@ -1493,17 +1515,10 @@ static int virtnet_freeze(struct virtio_device *vdev)
>  		for (i = 0; i < vi->num_queue_pairs; i++)
>  			napi_disable(&vi->rq[i]->napi);
>  
> -
> -	remove_vq_common(vi);
> -
> -	flush_work(&vi->config_work);
> -
> -	return 0;
>  }
>  
> -static int virtnet_restore(struct virtio_device *vdev)
> +static int virtnet_start(struct virtnet_info *vi)
>  {
> -	struct virtnet_info *vi = vdev->priv;
>  	int err, i;
>  
>  	err = virtnet_setup_vqs(vi);
> @@ -1527,6 +1542,29 @@ static int virtnet_restore(struct virtio_device *vdev)
>  
>  	return 0;
>  }
> +
> +#ifdef CONFIG_PM
> +static int virtnet_freeze(struct virtio_device *vdev)
> +{
> +	struct virtnet_info *vi = vdev->priv;
> +
> +	virtnet_stop(vi);
> +
> +	remove_vq_common(vi);
> +
> +	flush_work(&vi->config_work);
> +
> +	return 0;
> +}
> +
> +static int virtnet_restore(struct virtio_device *vdev)
> +{
> +	struct virtnet_info *vi = vdev->priv;
> +
> +	virtnet_start(vi);
> +
> +	return 0;
> +}
>  #endif
>  
>  static struct virtio_device_id id_table[] = {
> @@ -1560,6 +1598,67 @@ static struct virtio_driver virtio_net_driver = {
>  #endif
>  };
>  
> +static int virtnet_set_channels(struct net_device *dev,
> +				struct ethtool_channels *channels)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	u16 queues = channels->rx_count;
> +	unsigned status = VIRTIO_CONFIG_S_ACKNOWLEDGE | VIRTIO_CONFIG_S_DRIVER;
> +
> +	if (channels->rx_count != channels->tx_count)
> +		return -EINVAL;
> +	/* Only two modes were support currently */

s/were/are/ ?

> +	if (queues != vi->total_queue_pairs && queues != 1)
> +		return -EINVAL;

So userspace has to get queue number right. How does it know
what the valid value is?

> +	if (!vi->has_cvq)
> +		return -EINVAL;
> +
> +	virtnet_stop(vi);
> +
> +	netif_set_real_num_tx_queues(dev, queues);
> +	netif_set_real_num_rx_queues(dev, queues);
> +
> +	remove_vq_common(vi);
> +	flush_work(&vi->config_work);
> +
> +	vi->num_queue_pairs = queues;
> +	virtnet_start(vi);
> +
> +	vi->vdev->config->finalize_features(vi->vdev);
> +
> +	if (virtnet_set_queues(vi))
> +		status |= VIRTIO_CONFIG_S_FAILED;
> +	else
> +		status |= VIRTIO_CONFIG_S_DRIVER_OK;
> +
> +	vi->vdev->config->set_status(vi->vdev, status);
> +

Why do we need to tweak status like that?
Can we maybe just roll changes back on error?

> +	return 0;
> +}
> +
> +static void virtnet_get_channels(struct net_device *dev,
> +				 struct ethtool_channels *channels)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +
> +	channels->max_rx = vi->total_queue_pairs;
> +	channels->max_tx = vi->total_queue_pairs;
> +	channels->max_other = 0;
> +	channels->max_combined = 0;
> +	channels->rx_count = vi->num_queue_pairs;
> +	channels->tx_count = vi->num_queue_pairs;
> +	channels->other_count = 0;
> +	channels->combined_count = 0;
> +}
> +
> +static const struct ethtool_ops virtnet_ethtool_ops = {
> +	.get_drvinfo = virtnet_get_drvinfo,
> +	.get_link = ethtool_op_get_link,
> +	.get_ringparam = virtnet_get_ringparam,
> +	.set_channels = virtnet_set_channels,
> +	.get_channels = virtnet_get_channels,
> +};
> +
>  static int __init init(void)
>  {
>  	return register_virtio_driver(&virtio_net_driver);
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index 60f09ff..0d21e08 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -169,4 +169,11 @@ struct virtio_net_ctrl_mac {
>  #define VIRTIO_NET_CTRL_ANNOUNCE       3
>   #define VIRTIO_NET_CTRL_ANNOUNCE_ACK         0
>  
> +/*
> + * Control multiqueue
> + *
> + */
> +#define VIRTIO_NET_CTRL_MULTIQUEUE       4
> + #define VIRTIO_NET_CTRL_MULTIQUEUE_QNUM         0
> +
>  #endif /* _LINUX_VIRTIO_NET_H */
> -- 
> 1.7.1

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox