Netdev List

Netdev List
 help / color / mirror / Atom feed

* [net-next 1/9] e1000e: suggest a possible workaround to a device hang on 82577/8
From: Jeff Kirsher @ 2012-05-03  9:56 UTC (permalink / raw)
  To: davem; +Cc: Bruce Allan, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1336038992-3144-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Bruce Allan <bruce.w.allan@intel.com>

There is a known issue in the 82577 and 82578 device that can cause a hang
in the device hardware during traffic stress; the current workaround in the
driver is to disable transmit flow control by default.  If the user enables
transmit flow control and the device hang occurs, provide a message in the
syslog suggesting to re-enable the workaround.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index c0e211b..e86b524 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1084,6 +1084,10 @@ static void e1000_print_hw_hang(struct work_struct *work)
 	      phy_1000t_status,
 	      phy_ext_status,
 	      pci_status);
+
+	/* Suggest workaround for known h/w issue */
+	if ((hw->mac.type == e1000_pchlan) && (er32(CTRL) & E1000_CTRL_TFCE))
+		e_err("Try turning off Tx pause (flow control) via ethtool\n");
 }
 
 /**
-- 
1.7.7.6

^ permalink raw reply related

* [net-next 0/9][pull request] Intel Wired LAN Dirver Updates
From: Jeff Kirsher @ 2012-05-03  9:56 UTC (permalink / raw)
  To: davem; +Cc: Jeff Kirsher, netdev, gospo, sassmann

This series of patches contains updates for e1000e and ixgbevf.

The following are changes since commit af94bf6db1d58d26f1cdab145b6312ad363254a6:
  ixgbe: Fix use after free on module remove
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next master

Bruce Allan (2):
  e1000e: suggest a possible workaround to a device hang on 82577/8
  e1000e: cleanup long [read|write]_reg_locked PHY ops function
    pointers

Chris Boot (2):
  e1000e: Disable ASPM L1 on 82574
  e1000e: Remove special case for 82573/82574 ASPM L1 disablement

Greg Rose (3):
  ixgbevf: Add support to recognize 100mb link speed
  ixgbevf: Make sure jumbo frames are set correctly after PF reset
  ixgbevf: Update version string

Matthew Vick (2):
  e1000e: Resolve intermittent negotiation issue on 82574/82583.
  e1000e: Driver workaround for IPv6 Header Extension Erratum.

 drivers/net/ethernet/intel/e1000e/80003es2lan.c   |    8 +++
 drivers/net/ethernet/intel/e1000e/82571.c         |   13 +++++-
 drivers/net/ethernet/intel/e1000e/e1000.h         |   10 ++++
 drivers/net/ethernet/intel/e1000e/ich8lan.c       |   54 ++++++++++-----------
 drivers/net/ethernet/intel/e1000e/netdev.c        |   21 ++------
 drivers/net/ethernet/intel/e1000e/phy.c           |   18 +++++++-
 drivers/net/ethernet/intel/ixgbevf/defines.h      |    2 +
 drivers/net/ethernet/intel/ixgbevf/ethtool.c      |   18 +++++--
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      |    2 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   30 ++++++------
 drivers/net/ethernet/intel/ixgbevf/vf.c           |   12 ++++-
 11 files changed, 119 insertions(+), 69 deletions(-)

-- 
1.7.7.6

^ permalink raw reply

* Re: [GIT] Networking
From: Jeff Kirsher @ 2012-05-03  9:44 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: David Miller, torvalds, akpm, netdev, linux-kernel, bpoirier, rjw,
	linux-pm, stephen.s.ko
In-Reply-To: <4FA251B5.8080109@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 976 bytes --]

On Thu, 2012-05-03 at 15:06 +0530, Srivatsa S. Bhat wrote:
> On 05/03/2012 03:02 PM, David Miller wrote:
> 
> > From: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
> > Date: Thu, 03 May 2012 14:36:57 +0530
> > 
> >> https://lkml.org/lkml/2012/4/5/197
> > 
> > Networking patches not posted to netdev and the Intel ethernet driver
> > maintainers will be ignored.
> > 
> 
> 
> Jeff Kirsher (Intel ethernet driver maintainer) already acked that
> patchset (both patches).
> 
> https://lkml.org/lkml/2012/4/5/376
> https://lkml.org/lkml/2012/4/5/377
> 
> One went upstream, the other didn't (yet). That's my concern.
> 
> Regards,
> Srivatsa S. Bhat
> 

The second has not been ignored or dropped.  I pushed one of the patches
and we are finishing up the validation on the second patch.  I should
have the second patch pushed here in the next day or so.  I apologize
for not being able to push both patches at the same time.

Cheers,
Jeff

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: sky2 still badly broken
From: Niccolò Belli @ 2012-05-03  9:40 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20120502115618.04ab8ff9@s6510.linuxnetplumber.net>

Il 02/05/2012 20:56, Stephen Hemminger ha scritto:
> It could be that your switch doesn't do autonegotiation or flow
> control. You are getting receive fifo overflow errors.

I don't have this problem with other NICs. Also transfer rate is very 
low (even 2 MB/s sometimes) while I get ~110MB/s with other NICs (and 
the same switch of course).

Niccolò

^ permalink raw reply

* Re: [GIT] Networking
From: Srivatsa S. Bhat @ 2012-05-03  9:36 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, akpm, netdev, linux-kernel, bpoirier, jeffrey.t.kirsher,
	rjw, linux-pm, stephen.s.ko
In-Reply-To: <20120503.053246.92708809581536438.davem@davemloft.net>

On 05/03/2012 03:02 PM, David Miller wrote:

> From: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
> Date: Thu, 03 May 2012 14:36:57 +0530
> 
>> https://lkml.org/lkml/2012/4/5/197
> 
> Networking patches not posted to netdev and the Intel ethernet driver
> maintainers will be ignored.
> 


Jeff Kirsher (Intel ethernet driver maintainer) already acked that
patchset (both patches).

https://lkml.org/lkml/2012/4/5/376
https://lkml.org/lkml/2012/4/5/377

One went upstream, the other didn't (yet). That's my concern.

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* [PATCH net-next] net: Fix truesize accounting in skb_gro_receive()
From: Eric Dumazet @ 2012-05-03  9:33 UTC (permalink / raw)
  To: Alexander Duyck, David Miller; +Cc: netdev, jeffrey.t.kirsher
In-Reply-To: <20120503071859.13636.30050.stgit@gitlad.jf.intel.com>

From: Eric Dumazet <edumazet@google.com>

GRO is very optimistic in skb truesize estimates, only taking into
account the used part of fragments.

Be conservative, and use more precise computation, so that bloated GRO
skbs can be collapsed eventually.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 net/core/skbuff.c |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9e8caa0..e1f8bba 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2871,6 +2871,7 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 	unsigned int len = skb_gro_len(skb);
 	unsigned int offset = skb_gro_offset(skb);
 	unsigned int headlen = skb_headlen(skb);
+	unsigned int delta_truesize;
 
 	if (p->len + len >= 65536)
 		return -E2BIG;
@@ -2900,11 +2901,14 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 		frag->page_offset += offset;
 		skb_frag_size_sub(frag, offset);
 
+		/* all fragments truesize : remove (head size + sk_buff) */
+		delta_truesize = skb->truesize - SKB_TRUESIZE(skb_end_pointer(skb) - skb->head);
+
 		skb->truesize -= skb->data_len;
 		skb->len -= skb->data_len;
 		skb->data_len = 0;
 
-		NAPI_GRO_CB(skb)->free = 1;
+		NAPI_GRO_CB(skb)->free = NAPI_GRO_FREE;
 		goto done;
 	} else if (skb->head_frag) {
 		int nr_frags = pinfo->nr_frags;
@@ -2929,6 +2933,7 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 		memcpy(frag + 1, skbinfo->frags, sizeof(*frag) * skbinfo->nr_frags);
 		/* We dont need to clear skbinfo->nr_frags here */
 
+		delta_truesize = skb->truesize - SKB_DATA_ALIGN(sizeof(struct sk_buff));
 		NAPI_GRO_CB(skb)->free = NAPI_GRO_FREE_STOLEN_HEAD;
 		goto done;
 	} else if (skb_gro_len(p) != pinfo->gso_size)
@@ -2971,7 +2976,7 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 	p = nskb;
 
 merge:
-	p->truesize += skb->truesize - len;
+	delta_truesize = skb->truesize;
 	if (offset > headlen) {
 		unsigned int eat = offset - headlen;
 
@@ -2991,7 +2996,7 @@ merge:
 done:
 	NAPI_GRO_CB(p)->count++;
 	p->data_len += len;
-	p->truesize += len;
+	p->truesize += delta_truesize;
 	p->len += len;
 
 	NAPI_GRO_CB(skb)->same_flow = 1;

^ permalink raw reply related

* Re: [GIT] Networking
From: David Miller @ 2012-05-03  9:32 UTC (permalink / raw)
  To: srivatsa.bhat
  Cc: torvalds, akpm, netdev, linux-kernel, bpoirier, jeffrey.t.kirsher,
	rjw, linux-pm, stephen.s.ko
In-Reply-To: <4FA24AB1.9010108@linux.vnet.ibm.com>

From: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Date: Thu, 03 May 2012 14:36:57 +0530

> https://lkml.org/lkml/2012/4/5/197

Networking patches not posted to netdev and the Intel ethernet driver
maintainers will be ignored.

^ permalink raw reply

* Re: [PATCH 4/6] tcp: Repair socket queues
From: David Miller @ 2012-05-03  9:31 UTC (permalink / raw)
  To: xemul; +Cc: eric.dumazet, netdev
In-Reply-To: <4FA248E4.7060501@parallels.com>

From: Pavel Emelyanov <xemul@parallels.com>
Date: Thu, 03 May 2012 12:59:16 +0400

> Well, yes, but this ability is given to CAP_SYS_NET_ADMIN users only.
> Do you think it's nonetheless worth accounting this allocation into
> the socket's rmem?

Often such too large lengths can be a bug in the application, so best
to catch it than let it silently succeed.

Also, restricting an operation to "privileged" entities does not mean
we should forego resource utilization checks.

^ permalink raw reply

* Re: [PATCH 4/6] tcp: Repair socket queues
From: Pavel Emelyanov @ 2012-05-03  9:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Netdev List, David Miller
In-Reply-To: <1336036120.10187.7.camel@edumazet-glaptop>

On 05/03/2012 01:08 PM, Eric Dumazet wrote:
> On Thu, 2012-05-03 at 12:59 +0400, Pavel Emelyanov wrote:
>> On 05/02/2012 03:11 PM, Eric Dumazet wrote:
> 
>>> I am not sure any check is performed on 'size' ?
>>
>> No, no checks here.
>>
>>> A caller might trigger OOM or wrap bug.
>>
>> Well, yes, but this ability is given to CAP_SYS_NET_ADMIN users only.
>> Do you think it's nonetheless worth accounting this allocation into
>> the socket's rmem?
> 
> Yes, something must be done...
> 
> Might be a good reason to un-inline tcp_try_rmem_schedule(), this fat
> thing...

OK, will try to look at it.

Thanks,
Pavel

^ permalink raw reply

* Re: [PATCH 4/6] tcp: Repair socket queues
From: Eric Dumazet @ 2012-05-03  9:08 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Netdev List, David Miller
In-Reply-To: <4FA248E4.7060501@parallels.com>

On Thu, 2012-05-03 at 12:59 +0400, Pavel Emelyanov wrote:
> On 05/02/2012 03:11 PM, Eric Dumazet wrote:

> > I am not sure any check is performed on 'size' ?
> 
> No, no checks here.
> 
> > A caller might trigger OOM or wrap bug.
> 
> Well, yes, but this ability is given to CAP_SYS_NET_ADMIN users only.
> Do you think it's nonetheless worth accounting this allocation into
> the socket's rmem?

Yes, something must be done...

Might be a good reason to un-inline tcp_try_rmem_schedule(), this fat
thing...

^ permalink raw reply

* Re: vhost-net: is there a race for sock in handle_tx/rx?
From: Liu ping fan @ 2012-05-03  9:08 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, kvm, linux-kernel
In-Reply-To: <20120503084115.GM8266@redhat.com>

Oh, got it. It is a very interesting implement.

Thanks and regards,
pingfan

On Thu, May 3, 2012 at 4:41 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Thu, May 03, 2012 at 04:33:55PM +0800, Liu ping fan wrote:
>> Hi,
>>
>> During reading the vhost-net code, I find the following,
>>
>> static void handle_tx(struct vhost_net *net)
>> {
>>       struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
>>       unsigned out, in, s;
>>       int head;
>>       struct msghdr msg = {
>>               .msg_name = NULL,
>>               .msg_namelen = 0,
>>               .msg_control = NULL,
>>               .msg_controllen = 0,
>>               .msg_iov = vq->iov,
>>               .msg_flags = MSG_DONTWAIT,
>>       };
>>       size_t len, total_len = 0;
>>       int err, wmem;
>>       size_t hdr_size;
>>       struct socket *sock;
>>       struct vhost_ubuf_ref *uninitialized_var(ubufs);
>>       bool zcopy;
>>
>>       /* TODO: check that we are running from vhost_worker? */
>>       sock = rcu_dereference_check(vq->private_data, 1);
>>       if (!sock)
>>               return;
>>
>>            --------------------------------> Qemu calls
>> vhost_net_set_backend() to set a new backend fd, and close
>> @oldsock->file. And  sock->file refcnt==0.
>>
>>                                               Can vhost_worker prevent
>> itself from such situation? And how?
>>
>>       wmem = atomic_read(&sock->sk->sk_wmem_alloc);
>>        .........................................................................
>>
>> Is it a race?
>>
>> Thanks and regards,
>> pingfan
>
> See comment before void __rcu *private_data in vhost.h
>
>

^ permalink raw reply

* Re: [GIT] Networking
From: Srivatsa S. Bhat @ 2012-05-03  9:06 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, akpm, netdev, linux-kernel, bpoirier, jeffrey.t.kirsher,
	Rafael J. Wysocki, Linux PM mailing list, stephen.s.ko
In-Reply-To: <20120503.025951.1393913795357644010.davem@davemloft.net>

On 05/03/2012 12:29 PM, David Miller wrote:

> 
> It's the usual jumble of small fixes, mostly in drivers, but a few in
> core infrastructure parts and TCP.
> 
> 1) Transfer padding was wrong for full-speed USB in ASIX driver,
>    fix from Ingo van Lil.
> 
> 2) Propagate the negative packet offset fix into the PowerPC BPF JIT.
>    From Jan Seiffert.
> 
> 3) dl2k driver's private ioctls were letting unprivileged tasks make
>    MII writes and other ugly bits like that.  Fix from Jeff Mahoney.
> 
> 4) Fix TX VLAN and RX packet drops in ucc_geth, from Joakim Tjernlund.
> 
> 5) OOPS and network namespace fixes in IPVS from Hans Schillstrom and
>    Julian Anastasov.
> 
> 6) Fix races and sleeping in locked context bugs in drop_monitor, from
>    Neil Horman.
> 
> 7) Fix link status indication in smsc95xx driver, from Paolo Pisati.
> 
> 8) Fix bridge netfilter OOPS, from Peter Huang.
> 
> 9) L2TP sendmsg can return on error conditions with the socket lock
>    held, oops.  Fix from Sasha Levin.
> 
> 10) udp_diag should return meaningful values for socket memory usage,
>     from Shan Wei.
> 
> 11) Eric Dumazet is so awesome he gets his own section:
> 
> 	Socket memory cgroup code (I never should have applied those
> 	patches, grumble...) made erroneous changes to
> 	sk_sockets_allocated_read_positive().  It was changed to
> 	use percpu_counter_sum_positive (which requires BH disabling)
> 	instead of percpu_counter_read_positive (which does not).
> 	Revert back to avoid crashes and lockdep warnings.
> 
> 	Adjust the default tcp_adv_win_scale and tcp_rmem[2] values
> 	to fix throughput regressions.  This is necessary as a result
> 	of our more precise skb->truesize tracking.
> 
> 	Fix SKB leak in netem packet scheduler.
> 
> 12) New device IDs for various bluetooth devices, from Manoj Iyer,
>     AceLan Kao, and Steven Harms.
> 
> 13) Fix command completion race in ipw2200, from Stanislav Yakovlev.
> 
> 14) Fix rtlwifi oops on unload, from Larry Finger.
> 
> 15) Fix hard_mtu when adjusting hard_header_len in smsc95xx driver. From
>     Stephane Fillod.
> 
> 16) ehea driver registers it's IRQ before all the necessary state is
>     setup, resulting in crashes.  Fix from Thadeu Lima de Souza
>     Cascardo.
> 
> 17) Fix PHY connection failures in davinci_emac driver, from Anatolij
>     Gustschin.
> 
> 18) Missing break; in switch statement in bluetooth's
>     hci_cmd_complete_evt().  Fix from Szymon Janc.
> 
> 19) Fix queue programming in iwlwifi, from Johannes Berg.
> 
> 20) Interrupt throttling defaults not being actually programmed
>     into the hardware, fix from Jeff Kirsher and Ying Cai.
> 
> 21) TLAN driver SKB encoding in descriptor busted on 64-bit, fix
>     from Benjamin Poirier.
> 
> 22) Fix blind status block RX producer pointer deref in TG3 driver,
>     from Matt Carlson.
> 
> 23) Promisc and multicast are busted on ehea, fixes from Thadeu Lima
>     de Souza Cascardo.
> 
> 24) Fix crashes in 6lowpan, from Alexander Smirnov.
> 
> 25) tcp_complete_cwr() needs to be careful to not rewind the CWND to
>     ssthresh if ssthresh has the "infinite" value.  Fix from Yuchung
>     Cheng.
> 
> Please pull, thanks a lot.
> 
> The following changes since commit 4d634ca35a8b38530b134ae92bc9e3cc9c23c030:
> 
>   Merge branch 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild (2012-04-23 19:45:19 -0700)
> 
> are available in the git repository at:
> 
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git master
> 
> for you to fetch changes up to 5a8887d39e1ba5ee2d4ccb94b14d6f2dce5ddfca:
> 
>   sungem: Fix WakeOnLan (2012-05-03 01:42:55 -0400)
> 
> ----------------------------------------------------------------
> AceLan Kao (2):
>       Bluetooth: Add support for Atheros [13d3:3362]
>       Bluetooth: Add support for AR3012 [0cf3:e004]
> 
> Alexander Duyck (1):
>       ixgbe: Fix a memory leak in IEEE DCB
> 
> Anatolij Gustschin (1):
>       net/davinci_emac: fix failing PHY connect attempts
> 
> Benjamin Poirier (1):
>       tlan: add cast needed for proper 64 bit operation
> 


Oh, looks like even this pull request missed the igb fix from Benjamin.
https://lkml.org/lkml/2012/4/5/197

I don't mean to rush things, but my only concern here is to ensure that
this patch doesn't get lost, because the fix is important, is stable
material (I see warnings/stacktraces during suspend/resume in stable
kernels very frequently and this patch fixes it) and a similar fix for
ixgbe (patch 2/2 in that patchset, https://lkml.org/lkml/2012/4/5/198)
went upstream in a previous -rc (commit 34948a947d), while this one got
left out...

In case the above mentioned patch is already in the pipeline, sorry for
the noise..

Regards,

Srivatsa S. Bhat
IBM Linux Technology Center

^ permalink raw reply

* Re: [PATCH 4/6] tcp: Repair socket queues
From: Pavel Emelyanov @ 2012-05-03  8:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Netdev List, David Miller
In-Reply-To: <1335957064.22133.428.camel@edumazet-glaptop>

On 05/02/2012 03:11 PM, Eric Dumazet wrote:
> On Thu, 2012-04-19 at 17:41 +0400, Pavel Emelyanov wrote:
>> Reading queues under repair mode is done with recvmsg call.
>> The queue-under-repair set by TCP_REPAIR_QUEUE option is used
>> to determine which queue should be read. Thus both send and
>> receive queue can be read with this.
>>
>> Caller must pass the MSG_PEEK flag.
>>
>> Writing to queues is done with sendmsg call and yet again --
>> the repair-queue option can be used to push data into the
>> receive queue.
>>
>> When putting an skb into receive queue a zero tcp header is
>> appented to its head to address the tcp_hdr(skb)->syn and
>> the ->fin checks by the (after repair) tcp_recvmsg. These
>> flags flags are both set to zero and that's why.
>>
>> The fin cannot be met in the queue while reading the source
>> socket, since the repair only works for closed/established
>> sockets and queueing fin packet always changes its state.
>>
>> The syn in the queue denotes that the respective skb's seq
>> is "off-by-one" as compared to the actual payload lenght. Thus,
>> at the rcv queue refill we can just drop this flag and set the
>> skb's sequences to precice values.
>>
>> When the repair mode is turned off, the write queue seqs are
>> updated so that the whole queue is considered to be 'already sent,
>> waiting for ACKs' (write_seq = snd_nxt <= snd_una). From the
>> protocol POV the send queue looks like it was sent, but the data
>> between the write_seq and snd_nxt is lost in the network.
>>
>> This helps to avoid another sockoption for setting the snd_nxt
>> sequence. Leaving the whole queue in a 'not yet sent' state (as
>> it will be after sendmsg-s) will not allow to receive any acks
>> from the peer since the ack_seq will be after the snd_nxt. Thus
>> even the ack for the window probe will be dropped and the
>> connection will be 'locked' with the zero peer window.
>>
>> Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
>> ---
>>  net/ipv4/tcp.c        |   89 +++++++++++++++++++++++++++++++++++++++++++++++--
>>  net/ipv4/tcp_output.c |    1 +
>>  2 files changed, 87 insertions(+), 3 deletions(-)
>>
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index e38d6f2..47e2f49 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -912,6 +912,39 @@ static inline int select_size(const struct sock *sk, bool sg)
>>  	return tmp;
>>  }
>>  
>> +static int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
>> +{
>> +	struct sk_buff *skb;
>> +	struct tcp_skb_cb *cb;
>> +	struct tcphdr *th;
>> +
>> +	skb = alloc_skb(size + sizeof(*th), sk->sk_allocation);
> 
> I am not sure any check is performed on 'size' ?

No, no checks here.

> A caller might trigger OOM or wrap bug.

Well, yes, but this ability is given to CAP_SYS_NET_ADMIN users only.
Do you think it's nonetheless worth accounting this allocation into
the socket's rmem?

Thanks,
Pavel

^ permalink raw reply

* [PATCH v2] RPS: Sparse connection optimizations - v2
From: Deng-Cheng Zhu @ 2012-05-03  8:56 UTC (permalink / raw)
  To: davem, therbert, netdev; +Cc: eric.dumazet, dczhu

From: Deng-Cheng Zhu <dczhu@mips.com>

Currently, choosing target CPU to process the incoming packet is based on
skb->rxhash. In the case of sparse connections, this could lead to
relatively low and inconsistent bandwidth while doing network throughput
tests -- CPU selection in the RPS map is imbalanced. Even with the same
hash value, 2 packets could come from different devices.

This patch introduces a feature that allows some flows to select their CPUs
by looping the RPS CPU maps. Some tests were performed on the MIPS Malta
1004K platform (2 cores, each with 2 VPEs) at 25Mhz with 2 Intel Pro/1000
NICs. The Malta board works as a router between 2 PCs. Using iperf, here
are results:

       | Original Kernel             | Patched Kernel              |
-------|-----------------------------|-----------------------------|-------
       | SUM    SUM    SUM2   SUM3   | SUM    SUM    SUM2   SUM3   | SUM3
       | 1->2   2->1                 | 1->2   2->1                 | Delta
-------|-----------------------------|-----------------------------|-------
1x  1  | 33.70  29.10  62.80  657.40 | 46.70  46.30  93.00  928.80 | 41.28%
    2  | 46.20  29.30  75.50         | 46.80  46.20  93.00         |
    3  | 25.50  17.60  43.10         | 46.70  45.90  92.60         |
    4  | 38.00  29.10  67.10         | 46.80  46.20  93.00         |
    5  | 46.10  17.30  63.40         | 46.80  46.40  93.20         |
    6  | 36.80  29.00  65.80         | 46.60  46.20  92.80         |
    7  | 46.10  28.10  74.20         | 46.70  46.20  92.90         |
    8  | 46.10  27.90  74.00         | 46.70  46.00  92.70         |
    9  | 36.70  27.80  64.50         | 46.80  46.20  93.00         |
    10 | 38.00  29.00  67.00         | 46.60  46.00  92.60         |
-------|-----------------------------|-----------------------------|-------
2x  1  | 30.90  35.60  66.50  674.32 | 47.40  44.60  92.00  902.80 | 33.88%
    2  | 36.80  17.81  54.61         | 46.30  39.20  85.50         |
    3  | 41.10  17.35  58.45         | 47.40  44.70  92.10         |
    4  | 41.10  35.50  76.60         | 47.50  45.20  92.70         |
    5  | 41.20  35.70  76.90         | 47.50  39.00  86.50         |
    6  | 36.70  40.20  76.90         | 47.40  44.90  92.30         |
    7  | 29.40  18.06  47.46         | 46.90  45.20  92.10         |
    8  | 34.50  40.10  74.60         | 47.00  44.80  91.80         |
    9  | 34.00  35.80  69.80         | 46.40  45.00  91.40         |
    10 | 37.00  35.50  72.50         | 47.40  39.00  86.40         |
-------|-----------------------------|-----------------------------|-------
3x  1  | 45.40  36.90  82.30  774.89 | 45.30  46.90  92.20  895.50 | 15.56%
    2  | 44.00  19.12  63.12         | 45.20  46.50  91.70         |
    3  | 36.90  38.20  75.10         | 45.90  40.60  86.50         |
    4  | 39.20  37.30  76.50         | 45.50  40.30  85.80         |
    5  | 43.30  39.43  82.73         | 45.60  46.10  91.70         |
    6  | 42.70  39.55  82.25         | 45.40  46.30  91.70         |
    7  | 41.20  39.56  80.76         | 45.60  46.20  91.80         |
    8  | 44.60  38.00  82.60         | 45.30  40.30  85.60         |
    9  | 35.43  37.30  72.73         | 45.50  40.50  86.00         |
    10 | 39.70  37.10  76.80         | 45.80  46.70  92.50         |
-------|-----------------------------|-----------------------------|-------
4x  1  | 41.07  35.09  76.16  738.34 | 41.79  45.70  87.49  845.24 | 14.48%
    2  | 38.40  34.92  73.32         | 42.30  40.21  82.51         |
    3  | 33.18  34.76  67.94         | 41.95  44.70  86.65         |
    4  | 41.18  34.81  75.99         | 41.44  39.69  81.13         |
    5  | 34.52  34.46  68.98         | 41.07  39.61  80.68         |
    6  | 41.72  34.15  75.87         | 40.76  45.40  86.16         |
    7  | 38.81  39.43  78.24         | 42.40  45.30  87.70         |
    8  | 40.86  38.08  78.94         | 41.58  44.02  85.60         |
    9  | 34.80  38.82  73.62         | 42.20  39.95  82.15         |
    10 | 30.48  38.80  69.28         | 41.37  43.80  85.17         |
-------|-----------------------------|-----------------------------|-------
6x  1  | 35.59  34.10  69.69  706.58 | 37.28  41.59  78.87  772.02 | 9.26%
    2  | 35.53  39.02  74.55         | 39.42  38.47  77.89         |
    3  | 40.74  31.54  72.28         | 37.12  36.17  73.29         |
    4  | 37.64  35.66  73.30         | 39.16  41.60  80.76         |
    5  | 36.87  31.35  68.22         | 39.83  38.03  77.86         |
    6  | 37.65  34.99  72.64         | 39.72  39.56  79.28         |
    7  | 37.05  38.70  75.75         | 35.72  36.13  71.85         |
    8  | 35.56  34.15  69.71         | 38.24  41.17  79.41         |
    9  | 29.18  31.16  60.34         | 39.81  37.39  77.20         |
    10 | 34.09  36.01  70.10         | 39.88  35.73  75.61         |
-------|-----------------------------|-----------------------------|-------
8x  1  | 31.38  36.37  67.75  677.76 | 38.25  37.38  75.63  739.60 | 9.12%
    2  | 35.77  34.04  69.81         | 36.37  41.39  77.76         |
    3  | 32.53  32.83  65.36         | 34.64  34.54  69.18         |
    4  | 29.67  36.76  66.43         | 38.37  37.45  75.82         |
    5  | 33.99  34.77  68.76         | 35.39  36.71  72.10         |
    6  | 32.31  34.05  66.36         | 34.23  37.65  71.88         |
    7  | 33.37  38.29  71.66         | 38.28  35.32  73.60         |
    8  | 30.83  36.18  67.01         | 38.26  37.32  75.58         |
    9  | 34.37  33.14  67.51         | 35.01  37.81  72.82         |
    10 | 32.74  34.37  67.11         | 34.20  41.03  75.23         |
-------|-----------------------------|-----------------------------|-------
12x 1  | 31.22  32.81  64.03  649.48 | 30.47  37.07  67.54  681.10 | 4.87%
    2  | 29.63  34.46  64.09         | 34.98  35.63  70.61         |
    3  | 32.47  28.61  61.08         | 33.09  35.88  68.97         |
    4  | 32.22  31.01  63.23         | 32.89  36.09  68.98         |
    5  | 29.49  35.92  65.41         | 32.92  33.48  66.40         |
    6  | 32.07  34.29  66.36         | 32.56  34.62  67.18         |
    7  | 31.13  35.65  66.78         | 35.22  36.62  71.84         |
    8  | 32.96  37.00  69.96         | 32.53  37.08  69.61         |
    9  | 28.85  32.59  61.44         | 32.67  34.46  67.13         |
    10 | 32.71  34.39  67.10         | 30.94  31.90  62.84         |
-------|-----------------------------|-----------------------------|-------
16x 1  | 29.55  35.64  65.19  634.00 | 30.03  34.37  64.40  643.42 | 1.49%
    2  | 29.13  32.61  61.74         | 30.86  30.66  61.52         |
    3  | 29.87  34.52  64.39         | 29.53  36.59  66.12         |
    4  | 28.16  30.54  58.70         | 29.01  35.66  64.67         |
    5  | 30.04  34.35  64.39         | 30.72  35.18  65.90         |
    6  | 27.45  36.73  64.18         | 30.81  28.83  59.64         |
    7  | 28.34  38.18  66.52         | 30.71  33.56  64.27         |
    8  | 27.11  38.22  65.33         | 32.35  35.85  68.20         |
    9  | 28.53  32.93  61.46         | 31.21  32.35  63.56         |
    10 | 28.77  33.33  62.10         | 30.99  34.15  65.14         |
-------|-----------------------------|-----------------------------|-------
20x 1  | 30.57  36.96  67.53  641.27 | 30.27  34.99  65.26  617.18 | -3.76%
    2  | 26.23  36.64  62.87         | 28.85  32.50  61.35         |
    3  | 28.84  36.58  65.42         | 28.97  33.79  62.76         |
    4  | 30.59  31.27  61.86         | 27.34  32.83  60.17         |
    5  | 27.91  32.32  60.23         | 28.32  32.82  61.14         |
    6  | 28.77  33.32  62.09         | 26.95  33.08  60.03         |
    7  | 29.60  38.10  67.70         | 28.14  35.74  63.88         |
    8  | 29.84  36.38  66.22         | 29.00  30.01  59.01         |
    9  | 28.68  35.84  64.52         | 27.67  31.44  59.11         |
    10 | 28.16  34.67  62.83         | 30.54  33.93  64.47         |
-------|-----------------------------|-----------------------------|-------
24x 1  | 30.89  34.15  65.05  617.21 | 28.75  33.91  62.66  618.79 | 0.26%
    2  | 30.53  34.38  64.91         | 29.39  31.85  61.24         |
    3  | 28.13  35.20  63.33         | 28.36  34.01  62.37         |
    4  | 29.21  30.46  59.67         | 25.12  34.24  59.36         |
    5  | 24.72  35.46  60.18         | 29.38  32.60  61.98         |
    6  | 28.52  27.00  55.52         | 30.23  35.08  65.32         |
    7  | 25.12  35.46  60.57         | 28.44  31.91  60.35         |
    8  | 27.46  35.93  63.39         | 29.10  34.27  63.37         |
    9  | 27.62  32.56  60.18         | 27.85  34.83  62.68         |
    10 | 30.44  33.99  64.42         | 28.61  30.84  59.46         |
-------|-----------------------------|-----------------------------|-------
28x 1  | 28.30  30.15  58.45  613.21 | 26.97  30.28  57.25  592.80 | -3.33%
    2  | 30.78  31.02  61.80         | 28.27  30.33  58.61         |
    3  | 26.76  34.01  60.77         | 27.89  31.18  59.07         |
    4  | 27.18  32.31  59.49         | 29.42  33.19  62.61         |
    5  | 30.44  35.69  66.13         | 25.56  32.96  58.52         |
    6  | 27.70  30.55  58.25         | 27.94  32.19  60.12         |
    7  | 28.60  34.18  62.77         | 25.18  31.26  56.44         |
    8  | 29.40  31.41  60.81         | 28.78  28.71  57.49         |
    9  | 27.11  34.13  61.24         | 28.65  32.48  61.13         |
    10 | 30.07  33.43  63.50         | 25.99  35.59  61.57         |
-------|-----------------------------|-----------------------------|-------
32x 1  | 27.41  29.16  56.58  590.24 | 27.94  30.75  58.69  584.15 | -1.03%
    2  | 26.54  27.85  54.39         | 28.92  34.46  63.37         |
    3  | 26.68  34.18  60.86         | 25.71  31.12  56.83         |
    4  | 27.31  34.72  62.03         | 26.70  31.35  58.04         |
    5  | 28.82  32.89  61.71         | 27.45  33.83  61.28         |
    6  | 25.49  28.59  54.08         | 27.94  32.06  60.00         |
    7  | 25.80  34.75  60.55         | 26.63  33.22  59.85         |
    8  | 24.39  32.44  56.83         | 26.17  32.27  58.43         |
    9  | 29.33  35.19  64.53         | 24.11  26.43  50.54         |
    10 | 28.02  30.66  58.68         | 25.45  31.67  57.11         |

Note:
1. Data unit: Mbits/sec
2. 1x, 2x...32x: N iperf instances were running in parallel.
3. SUM 1->2: PC1 is the iperf client and PC2 is the iperf server. The sum
   of all instances. Bidirectional tests were performed as well.
4. Tested with iptables/NAT + RPS (RPS CPU mask is 0xe for both NICs, which
   means CPU1/2/3 are covered).
5. CONFIG_NR_RPS_MAP_LOOPS == 4 by default.
6. Duration for each test: 100 seconds.
7. The results show that the overhead brought in by this feature is limited
   as the number of connections goes higher.

Reference: http://www.spinics.net/lists/netdev/msg196734.html
Signed-off-by: Deng-Cheng Zhu <dczhu@mips.com>
---
Changes:
v2 - v1:
o Use percpu variables instead of static NR_CPUS array.
o Delete ARCH details -- let user choose optimal masks.
o Move structure definition to header file.

 include/linux/netdevice.h |   12 +++++++++
 net/Kconfig               |   22 ++++++++++++++++
 net/core/dev.c            |   59 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 93 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5cbaa20..22ac47d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -589,6 +589,18 @@ static inline void netdev_queue_numa_node_write(struct netdev_queue *q, int node
 }
 
 #ifdef CONFIG_RPS
+#ifdef CONFIG_RPS_SPARSE_FLOW_OPTIMIZATION
+/*
+ * This structure defines a flow that will be active on a given CPU for a
+ * certain period.
+ */
+struct cpu_flow {
+	struct net_device *dev;
+	u32 rxhash;
+	unsigned long ts;
+};
+#endif
+
 /*
  * This structure holds an RPS map which can be of variable length.  The
  * map is an array of CPUs.
diff --git a/net/Kconfig b/net/Kconfig
index e07272d..d5aa682 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -222,6 +222,28 @@ config RPS
 	depends on SMP && SYSFS && USE_GENERIC_SMP_HELPERS
 	default y
 
+config RPS_SPARSE_FLOW_OPTIMIZATION
+	bool "RPS optimizations for sparse flows"
+	depends on RPS
+	default n
+	---help---
+	  This feature will try to map some network flows to consecutive
+	  CPUs in the RPS map. It will bring in some per packet overhead
+	  but should be able to do good to network throughput in the case
+	  of low number of connections while not much affecting other
+	  cases. (e.g. relatively consistent and high bandwidth in single
+	  connection tests).
+
+config NR_RPS_MAP_LOOPS
+	int "Number of loops walking RPS map before hash indexing (1-5)"
+	range 1 5
+	depends on RPS_SPARSE_FLOW_OPTIMIZATION
+	default "4"
+	---help---
+	  It defines how many loops to go through the RPS map while
+	  determing target CPU to process the incoming packet. After that,
+	  the decision will fall back on hash indexing the RPS map.
+
 config RFS_ACCEL
 	boolean
 	depends on RPS && GENERIC_HARDIRQS
diff --git a/net/core/dev.c b/net/core/dev.c
index c25d453..92e292b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2698,6 +2698,61 @@ set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 	return rflow;
 }
 
+#ifdef CONFIG_RPS_SPARSE_FLOW_OPTIMIZATION
+static DEFINE_PER_CPU(struct cpu_flow [CONFIG_NR_RPS_MAP_LOOPS], cpu_flows);
+static unsigned long hash_active;
+
+#define FLOW_INACTIVE(now, base) (time_after((now), (base) + HZ) || \
+			 unlikely(time_before((now), (base))))
+
+static u16 find_cpu(const struct rps_map *map, const struct sk_buff *skb)
+{
+	struct cpu_flow *flow;
+	u16 cpu;
+	int i, l, do_alloc = 0;
+	unsigned long now = jiffies;
+
+retry:
+	for (l = 0; l < CONFIG_NR_RPS_MAP_LOOPS; l++) {
+		for (i = map->len - 1; i >= 0; i--) {
+			cpu = map->cpus[i];
+			flow = &per_cpu(cpu_flows, cpu)[l];
+
+			if (do_alloc) {
+				if (flow->dev == NULL ||
+				    FLOW_INACTIVE(now, flow->ts)) {
+					flow->dev = skb->dev;
+					flow->rxhash = skb->rxhash;
+					flow->ts = now;
+					return cpu;
+				}
+			} else {
+				/*
+				 * Unlike hash indexing, this avoids packet
+				 * processing imbalance across CPUs.
+				 */
+				if (flow->rxhash == skb->rxhash &&
+				    flow->dev == skb->dev &&
+				    !FLOW_INACTIVE(now, flow->ts)) {
+					flow->ts = now;
+					return cpu;
+				}
+			}
+		}
+	}
+
+	if (FLOW_INACTIVE(now, hash_active) && do_alloc == 0) {
+		do_alloc = 1;
+		goto retry;
+	}
+
+	/* For all other flows */
+	hash_active = now;
+
+	return map->cpus[((u64) skb->rxhash * map->len) >> 32];
+}
+#endif
+
 /*
  * get_rps_cpu is called from netif_receive_skb and returns the target
  * CPU from the RPS map of the receiving queue for a given skb.
@@ -2780,7 +2835,11 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 	}
 
 	if (map) {
+#ifdef CONFIG_RPS_SPARSE_FLOW_OPTIMIZATION
+		tcpu = find_cpu(map, skb);
+#else
 		tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32];
+#endif
 
 		if (cpu_online(tcpu)) {
 			cpu = tcpu;
-- 
1.7.1

^ permalink raw reply related

* Re: vhost-net: is there a race for sock in handle_tx/rx?
From: Michael S. Tsirkin @ 2012-05-03  8:41 UTC (permalink / raw)
  To: Liu ping fan; +Cc: netdev, kvm, linux-kernel
In-Reply-To: <CAFgQCTtKWR6F3D_mPcGe69HvZbYmmAdXreSWLZQrdi+0T3i2ag@mail.gmail.com>

On Thu, May 03, 2012 at 04:33:55PM +0800, Liu ping fan wrote:
> Hi,
> 
> During reading the vhost-net code, I find the following,
> 
> static void handle_tx(struct vhost_net *net)
> {
> 	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
> 	unsigned out, in, s;
> 	int head;
> 	struct msghdr msg = {
> 		.msg_name = NULL,
> 		.msg_namelen = 0,
> 		.msg_control = NULL,
> 		.msg_controllen = 0,
> 		.msg_iov = vq->iov,
> 		.msg_flags = MSG_DONTWAIT,
> 	};
> 	size_t len, total_len = 0;
> 	int err, wmem;
> 	size_t hdr_size;
> 	struct socket *sock;
> 	struct vhost_ubuf_ref *uninitialized_var(ubufs);
> 	bool zcopy;
> 
> 	/* TODO: check that we are running from vhost_worker? */
> 	sock = rcu_dereference_check(vq->private_data, 1);
> 	if (!sock)
> 		return;
> 
>            --------------------------------> Qemu calls
> vhost_net_set_backend() to set a new backend fd, and close
> @oldsock->file. And  sock->file refcnt==0.
> 
>                                               Can vhost_worker prevent
> itself from such situation? And how?
> 
> 	wmem = atomic_read(&sock->sk->sk_wmem_alloc);
>        .........................................................................
> 
> Is it a race?
> 
> Thanks and regards,
> pingfan

See comment before void __rcu *private_data in vhost.h

^ permalink raw reply

* Re: [PATCH 2/2] ss: implement -M option to get all memory information
From: Shan Wei @ 2012-05-03  8:39 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: xemul, NetDev
In-Reply-To: <20120502120042.5420644a@s6510.linuxnetplumber.net>

Stephen Hemminger said, at 2012/5/3 3:00:

> 
> This looks good, is the skmeminfo a superset of the old meminfo?


Yes, skmeminfo is a superset of old meminfo.
Using this can get more socket memory information. 

> But your code is broken on 64 bit. skmeminfo in kernel is an array of __u32!


OK. here is a new version.

----
[PATCH] ss: use new INET_DIAG_SKMEMINFO option to get more memory information for tcp socket


Signed-off-by: Shan Wei <davidshan@tencent.com>
---
 misc/ss.c |   16 ++++++++++++++--
 1 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 5f70a26..bd60548 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -1336,7 +1336,17 @@ static void tcp_show_info(const struct nlmsghdr *nlh, struct inet_diag_msg *r)
 	parse_rtattr(tb, INET_DIAG_MAX, (struct rtattr*)(r+1),
 		     nlh->nlmsg_len - NLMSG_LENGTH(sizeof(*r)));
 
-	if (tb[INET_DIAG_MEMINFO]) {
+	if (tb[INET_DIAG_SKMEMINFO]) {
+		const __u32 *skmeminfo =  RTA_DATA(tb[INET_DIAG_SKMEMINFO]);
+		printf(" skmem:(r%u,rb%u,t%u,tb%u,f%u,w%u,o%u)",
+			skmeminfo[SK_MEMINFO_RMEM_ALLOC],
+			skmeminfo[SK_MEMINFO_RCVBUF],
+			skmeminfo[SK_MEMINFO_WMEM_ALLOC],
+			skmeminfo[SK_MEMINFO_SNDBUF],
+			skmeminfo[SK_MEMINFO_FWD_ALLOC],
+			skmeminfo[SK_MEMINFO_WMEM_QUEUED],
+			skmeminfo[SK_MEMINFO_OPTMEM]);
+	}else if (tb[INET_DIAG_MEMINFO]) {
 		const struct inet_diag_meminfo *minfo
 			= RTA_DATA(tb[INET_DIAG_MEMINFO]);
 		printf(" mem:(r%u,w%u,f%u,t%u)",
@@ -1505,8 +1515,10 @@ static int tcp_show_netlink(struct filter *f, FILE *dump_fp, int socktype)
 	memset(&req.r, 0, sizeof(req.r));
 	req.r.idiag_family = AF_INET;
 	req.r.idiag_states = f->states;
-	if (show_mem)
+	if (show_mem) {
 		req.r.idiag_ext |= (1<<(INET_DIAG_MEMINFO-1));
+		req.r.idiag_ext |= (1<<(INET_DIAG_SKMEMINFO-1));
+	}
 
 	if (show_tcpinfo) {
 		req.r.idiag_ext |= (1<<(INET_DIAG_INFO-1));
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH 2/2] ss: implement -M option to get all memory information
From: Pavel Emelyanov @ 2012-05-03  8:37 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Shan Wei, NetDev
In-Reply-To: <20120502120042.5420644a@s6510.linuxnetplumber.net>

On 05/02/2012 11:00 PM, Stephen Hemminger wrote:
> On Wed, 02 May 2012 17:45:02 +0800
> Shan Wei <shanwei88@gmail.com> wrote:
> 
>> Hi stephen:
>>
>> Stephen Hemminger said, at 2012/4/28 1:21:
>>
>>> Lots of options return more or different information based on kernel
>>> version, probably the biggest example is how stats are processed.
>>
>>
>> how about the following patch?
>>
>> ----
>> [PATCH] ss: use new INET_DIAG_SKMEMINFO option to get memory information for tcp socket
>>
>>
>> Signed-off-by: Shan Wei <davidshan@tencent.com>
>> ---
>>  misc/ss.c |   16 ++++++++++++++--
>>  1 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/misc/ss.c b/misc/ss.c
>> index 5f70a26..3cfc9e8 100644
>> --- a/misc/ss.c
>> +++ b/misc/ss.c
>> @@ -1336,7 +1336,17 @@ static void tcp_show_info(const struct nlmsghdr *nlh, struct inet_diag_msg *r)
>>  	parse_rtattr(tb, INET_DIAG_MAX, (struct rtattr*)(r+1),
>>  		     nlh->nlmsg_len - NLMSG_LENGTH(sizeof(*r)));
>>  
>> -	if (tb[INET_DIAG_MEMINFO]) {
>> +	if (tb[INET_DIAG_SKMEMINFO]) {
>> +		const unsigned int *skmeminfo =  RTA_DATA(tb[INET_DIAG_SKMEMINFO]);
>> +		printf(" skmem:(r%u,rb%u,t%u,tb%u,f%u,w%u,o%u)",
>> +			skmeminfo[SK_MEMINFO_RMEM_ALLOC],
>> +			skmeminfo[SK_MEMINFO_RCVBUF],
>> +			skmeminfo[SK_MEMINFO_WMEM_ALLOC],
>> +			skmeminfo[SK_MEMINFO_SNDBUF],
>> +			skmeminfo[SK_MEMINFO_FWD_ALLOC],
>> +			skmeminfo[SK_MEMINFO_WMEM_QUEUED],
>> +			skmeminfo[SK_MEMINFO_OPTMEM]);
>> +	}else if (tb[INET_DIAG_MEMINFO]) {
>>  		const struct inet_diag_meminfo *minfo
>>  			= RTA_DATA(tb[INET_DIAG_MEMINFO]);
>>  		printf(" mem:(r%u,w%u,f%u,t%u)",
>> @@ -1505,8 +1515,10 @@ static int tcp_show_netlink(struct filter *f, FILE *dump_fp, int socktype)
>>  	memset(&req.r, 0, sizeof(req.r));
>>  	req.r.idiag_family = AF_INET;
>>  	req.r.idiag_states = f->states;
>> -	if (show_mem)
>> +	if (show_mem) {
>>  		req.r.idiag_ext |= (1<<(INET_DIAG_MEMINFO-1));
>> +		req.r.idiag_ext |= (1<<(INET_DIAG_SKMEMINFO-1));
>> +	}
>>  
>>  	if (show_tcpinfo) {
>>  		req.r.idiag_ext |= (1<<(INET_DIAG_INFO-1));
> 
> This looks good, is the skmeminfo a superset of the old meminfo?

In terms of the values it returns -- yes, but these two structures are not
binary compatible to each other.

> But your code is broken on 64 bit. skmeminfo in kernel is an array of __u32!

Hmm :( So is the inet_diag_meminfo, which was the prototype for the skmeminfo...
Should we introduce the SKMEMINFO64?

> .
> 

^ permalink raw reply

* vhost-net: is there a race for sock in handle_tx/rx?
From: Liu ping fan @ 2012-05-03  8:33 UTC (permalink / raw)
  To: netdev; +Cc: Michael S. Tsirkin, kvm, linux-kernel

Hi,

During reading the vhost-net code, I find the following,

static void handle_tx(struct vhost_net *net)
{
	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
	unsigned out, in, s;
	int head;
	struct msghdr msg = {
		.msg_name = NULL,
		.msg_namelen = 0,
		.msg_control = NULL,
		.msg_controllen = 0,
		.msg_iov = vq->iov,
		.msg_flags = MSG_DONTWAIT,
	};
	size_t len, total_len = 0;
	int err, wmem;
	size_t hdr_size;
	struct socket *sock;
	struct vhost_ubuf_ref *uninitialized_var(ubufs);
	bool zcopy;

	/* TODO: check that we are running from vhost_worker? */
	sock = rcu_dereference_check(vq->private_data, 1);
	if (!sock)
		return;

           --------------------------------> Qemu calls
vhost_net_set_backend() to set a new backend fd, and close
@oldsock->file. And  sock->file refcnt==0.

                                              Can vhost_worker prevent
itself from such situation? And how?

	wmem = atomic_read(&sock->sk->sk_wmem_alloc);
       .........................................................................

Is it a race?

Thanks and regards,
pingfan

^ permalink raw reply

* Re: [v2 PATCH 4/4] ixgbe: Fix use after free on module remove
From: David Miller @ 2012-05-03  8:22 UTC (permalink / raw)
  To: alexander.h.duyck; +Cc: netdev, jeffrey.t.kirsher, edumazet
In-Reply-To: <20120503071914.13636.31157.stgit@gitlad.jf.intel.com>

From: Alexander Duyck <alexander.h.duyck@intel.com>
Date: Thu, 03 May 2012 00:19:14 -0700

> While testing the TCP changes I had to fix an issue in order to be able to
> load and unload the module.
> 
> The recent patch that added thermal sensor support added a use after free
> bug on module unload with an 82598 adapter in the system.  To resolve the
> issue I have updated the code so that when we free the info_kobj we set it
> back to NULL.
> 
> I suspect there are likely other bugs present, but I will leave that for
> another patch that can undergo more testing.
> 
> I am submitting this directly to net-next since this fixes a fairly serious
> bug that will lock up the ixgbe module until the system is rebooted.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>

Applied.

^ permalink raw reply

* Re: [v2 PATCH 3/4] tcp: move stats merge to the end of tcp_try_coalesce
From: David Miller @ 2012-05-03  8:22 UTC (permalink / raw)
  To: eric.dumazet; +Cc: alexander.h.duyck, netdev, jeffrey.t.kirsher, edumazet
In-Reply-To: <1336031565.3503.25.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 03 May 2012 09:52:45 +0200

> On Thu, 2012-05-03 at 00:19 -0700, Alexander Duyck wrote:
>> This change cleans up the last bits of tcp_try_coalesce so that we only
>> need one goto which jumps to the end of the function.  The idea is to make
>> the code more readable by putting things in a linear order so that we start
>> execution at the top of the function, and end it at the bottom.
>> 
>> I also made a slight tweak to the code for handling frags when we are a
>> clone.  Instead of making it an if (clone) loop else nr_frags = 0 I changed
>> the logic so that if (!clone) we just set the number of frags to 0 which
>> disables the for loop anyway.
>> 
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
 ...
> Thanks a lot Alex, this patch serie looks very good.
> 
> Acked-by: Eric Dumazet <edumazet@google.com>

Applied.

^ permalink raw reply

* Re: [v2 PATCH 2/4] tcp: Move code related to head frag in tcp_try_coalesce
From: David Miller @ 2012-05-03  8:22 UTC (permalink / raw)
  To: eric.dumazet; +Cc: alexander.h.duyck, netdev, jeffrey.t.kirsher, edumazet
In-Reply-To: <1336031456.3503.24.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 03 May 2012 09:50:56 +0200

> On Thu, 2012-05-03 at 00:19 -0700, Alexander Duyck wrote:
>> This change reorders the code related to the use of an skb->head_frag so it
>> is placed before we check the rest of the frags.  This allows the code to
>> read more linearly instead of like some sort of loop.
>> 
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
 ...
> Acked-by: Eric Dumazet <edumazet@google.com>

Applied.

^ permalink raw reply

* Re: [v2 PATCH 1/4] tcp: Fix truesize accounting in tcp_try_coalesce
From: David Miller @ 2012-05-03  8:21 UTC (permalink / raw)
  To: eric.dumazet; +Cc: alexander.h.duyck, netdev, jeffrey.t.kirsher, edumazet
In-Reply-To: <1336031334.3503.23.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 03 May 2012 09:48:54 +0200

> On Thu, 2012-05-03 at 00:18 -0700, Alexander Duyck wrote:
>> This patch addresses several issues in the way we were tracking the
>> truesize in tcp_try_coalesce.
>> 
>> First it was using ksize which prevents us from having a 0 sized head frag
>> and getting a usable result.  To resolve that this patch uses the end
>> pointer which is set based off either ksize, or the frag_size supplied in
>> build_skb.  This allows us to compute the original truesize of the entire
>> buffer and remove that value leaving us with just what was added as pages.
>> 
>> The second issue was the use of skb->len if there is a mergeable head frag.
>> We should only need to remove the size of an data aligned sk_buff from our
>> current skb->truesize to compute the delta for a buffer with a reused head.
>> By using skb->len the value of truesize was being artificially reduced
>> which means that head frags could use more memory than buffers using
>> standard allocations.
>> 
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
 ...
> Acked-by: Eric Dumazet <edumazet@google.com>

Applied.

^ permalink raw reply

* [PATCH] bnx2x: bug fix when loading after SAN boot
From: Ariel Elior @ 2012-05-03  8:22 UTC (permalink / raw)
  To: ariele, davem, netdev; +Cc: Eilon Greenstein

This is a bug fix for an "interface fails to load" issue.
The issue occurs when bnx2x driver loads after UNDI driver was previously
loaded over the chip. In such a scenario the UNDI driver is loaded and operates
in the pre-boot kernel, within its own specific host memory address range.
When the pre-boot stage is complete, the real kernel is loaded, in a new and
distinct host memory address range. The transition from pre-boot stage to boot
is asynchronous from UNDI point of view.

A race condition occurs when UNDI driver triggers a DMAE transaction to valid
host addresses in the pre-boot stage, when control is diverted to the real
kernel. This results in access to illegal addresses by our HW as the addresses
which were valid in the preboot stage are no longer considered valid.
Specifically, the 'was_error' bit in the pci glue of our device is set. This
causes all following pci transactions from chip to host to timeout (in
accordance to the pci spec).

Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |   23 +++++++++++++++++++++-
 1 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index e077d25..795fc49 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -9122,13 +9122,34 @@ static int __devinit bnx2x_prev_unload_common(struct bnx2x *bp)
 	return bnx2x_prev_mcp_done(bp);
 }

+/* previous driver DMAE transaction may have occurred when pre-boot stage ended
+ * and boot began, or when kdump kernel was loaded. Either case would invalidate
+ * the addresses of the transaction, resulting in was-error bit set in the pci
+ * causing all hw-to-host pcie transactions to timeout. If this happened we want
+ * to clear the interrupt which detected this from the pglueb and the was done
+ * bit
+ */
+static void __devinit bnx2x_prev_interrupted_dmae(struct bnx2x *bp)
+{
+	u32 val = REG_RD(bp, PGLUE_B_REG_PGLUE_B_INT_STS);
+	if (val & PGLUE_B_PGLUE_B_INT_STS_REG_WAS_ERROR_ATTN) {
+		BNX2X_ERR("was error bit was found to be set in pglueb upon startup. Clearing");
+		REG_WR(bp, PGLUE_B_REG_WAS_ERROR_PF_7_0_CLR, 1 << BP_FUNC(bp));
+	}
+}
+
 static int __devinit bnx2x_prev_unload(struct bnx2x *bp)
 {
 	int time_counter = 10;
 	u32 rc, fw, hw_lock_reg, hw_lock_val;
 	BNX2X_DEV_INFO("Entering Previous Unload Flow\n");

-       /* Release previously held locks */
+	/* clear hw from errors which mnay have resulted from an interrupted
+	 * dmae transaction.
+	 */
+	bnx2x_prev_interrupted_dmae(bp);
+
+	/* Release previously held locks */
 	hw_lock_reg = (BP_FUNC(bp) <= 5) ?
 		      (MISC_REG_DRIVER_CONTROL_1 + BP_FUNC(bp) * 8) :
 		      (MISC_REG_DRIVER_CONTROL_7 + (BP_FUNC(bp) - 6) * 8);
-- 
1.7.9.GIT

^ permalink raw reply related

* Re: [PATCH 3/8] Sometimes the ISDN chip only controls the D-channel
From: David Miller @ 2012-05-03  8:08 UTC (permalink / raw)
  To: kkeil; +Cc: netdev
In-Reply-To: <4FA23438.6090103@linux-pingi.de>

From: Karsten Keil <kkeil@linux-pingi.de>
Date: Thu, 03 May 2012 09:31:04 +0200

> PCM only mode need a special protocol and a mechanism to set/get/store
> the PCM slots of the card, and this is for what the extra stuff is used.

It changed the values of some macros which are actually used by the
code.

Then it adds members to structures, and defines, which are completely
unused.

The latter part is completely bogus.

This is the second time I'm saying this again.  I'm not saying it
a third time, instead I'll just ignore you.

^ permalink raw reply

* Re: [v2 PATCH 3/4] tcp: move stats merge to the end of tcp_try_coalesce
From: Eric Dumazet @ 2012-05-03  7:52 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, davem, jeffrey.t.kirsher, edumazet
In-Reply-To: <20120503071909.13636.43086.stgit@gitlad.jf.intel.com>

On Thu, 2012-05-03 at 00:19 -0700, Alexander Duyck wrote:
> This change cleans up the last bits of tcp_try_coalesce so that we only
> need one goto which jumps to the end of the function.  The idea is to make
> the code more readable by putting things in a linear order so that we start
> execution at the top of the function, and end it at the bottom.
> 
> I also made a slight tweak to the code for handling frags when we are a
> clone.  Instead of making it an if (clone) loop else nr_frags = 0 I changed
> the logic so that if (!clone) we just set the number of frags to 0 which
> disables the for loop anyway.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
> 
>  net/ipv4/tcp_input.c |   55 ++++++++++++++++++++++++++------------------------
>  1 files changed, 29 insertions(+), 26 deletions(-)


Thanks a lot Alex, this patch serie looks very good.

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox