Netdev List

Netdev List
 help / color / mirror / Atom feed

* [net-next.git (v2)] stmmac: remove two useless initialisations
From: Giuseppe CAVALLARO @ 2012-06-04 16:36 UTC (permalink / raw)
  To: netdev; +Cc: Giuseppe Cavallaro

This patch removes two useless initialisations in the
stmmac_rx and stmmac_tx functions.
In the former, the count variable was reset twice and in
the stmmac_tx we only need to increment the dirty pointer
w/o setting the entry variable.

v2: review the subject and comment that was not clear in my
first version.

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 7096633..0caae72 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -677,7 +677,7 @@ static void stmmac_tx(struct stmmac_priv *priv)
 
 		priv->hw->desc->release_tx_desc(p);
 
-		entry = (++priv->dirty_tx) % txsize;
+		priv->dirty_tx++;
 	}
 	if (unlikely(netif_queue_stopped(priv->dev) &&
 		     stmmac_tx_avail(priv) > STMMAC_TX_THRESH(priv))) {
@@ -1307,7 +1307,6 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit)
 		display_ring(priv->dma_rx, rxsize);
 	}
 #endif
-	count = 0;
 	while (!priv->hw->desc->get_rx_owner(p)) {
 		int status;
 
-- 
1.7.4.4

^ permalink raw reply related

* Re: [RFC PATCH v1 2/3] net: add VEPA, VEB bridge mode
From: John Fastabend @ 2012-06-04 16:38 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: bhutchings, buytenh, eilong, eric.w.multanen, gregory.v.rose,
	hadi, jeffrey.t.kirsher, mst, netdev, shemminger, sri
In-Reply-To: <OF0D72A103.6F322B98-ON65257A13.00500E50-65257A13.005272D6@in.ibm.com>

On 6/4/2012 7:59 AM, Krishna Kumar2 wrote:
> John Fastabend <john.r.fastabend@intel.com> wrote on 05/30/2012 08:37:06
> AM:
> 
> Some comments below:
> 
>> +static int rtnl_bridge_notify(struct net_device *dev, u16 flags)
>> +{
>> ...
>> +		 if (!flags && master && master->netdev_ops->
> ndo_bridge_getlink)
>> +		 		 err = master->netdev_ops->ndo_bridge_getlink(skb,
> 0, 0, dev);
>> +		 else if (dev->netdev_ops->ndo_bridge_getlink)
>> +		 		 err = dev->netdev_ops->ndo_bridge_getlink(skb, 0,
> 0, dev);
> 
> I think you should do something like:
> 
>         if ((flags == BRIDGE_FLAGS_MASTER) && ...)
>                 ...
> 
> Also you could use BRIDGE_FLAGS_MASTER=1, SELF=2, and use
> "if (flags & BRIDGE_FLAGS_MASTER)" for consistency?

OK this is likely a good thing otherwise user space is a
bit tedious when managing FDB and bridge modes. We do still
need the !flags case to support existing applications though,
(we must maintain existing semantics)

if (!flags || (flags & BRIDGE_FLAGS_MASTER) && ...)
	...
else (flags & BRIDGE_FLAGS_SELF)
	...

> 
> 
> +		 if (!err)
> +		 		 err = rtnl_bridge_notify(dev, flags);
> 
> It is possible to return a reporting error even though
> the operation succeeded. Maybe something that could be
> done here to indicate that the operation succeeded, or
> is that a TODO?
> 

The problem is if rtnl_bridge_notify fails due to memory
constraints or otherwise. In this case the set has already
completed successfully as you note so we should not return
any error. This should fix it if I understand your concern
correctly.

	if (!err)
		rtnl_bridge_notify(dev, flags);
	return err;



>>  static int rtnl_bridge_setlink(struct sk_buff *skb, struct nlmsghdr
> *nlh,
>>                  void *arg)
>>  {
> ..
>> +   if (!flags && dev->master &&
>> +       dev->master->netdev_ops->ndo_bridge_setlink)
>> +      err = dev->master->netdev_ops->ndo_bridge_setlink(dev, nlh);
>> +   else if ((flags & BRIDGE_FLAGS_SELF) &&
>> +         dev->netdev_ops->ndo_bridge_setlink)
> 
> Same usage of MASTER here.

Agreed. Thanks.

> 
> Thanks,
> - KK
> 

^ permalink raw reply

* [PATCH] net/hyperv: Use wait_event on outstanding sends during device removal
From: Haiyang Zhang @ 2012-06-04 16:42 UTC (permalink / raw)
  To: davem, netdev; +Cc: haiyangz, kys, olaf, linux-kernel, devel

Change the busy-waiting/udelay to wait_event on outstanding sends.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
---
 drivers/net/hyperv/hyperv_net.h |    1 +
 drivers/net/hyperv/netvsc.c     |   12 ++++++------
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 4ffcd57..2857ab0 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -478,6 +478,7 @@ struct netvsc_device {
 	u32 nvsp_version;
 
 	atomic_t num_outstanding_sends;
+	wait_queue_head_t wait_drain;
 	bool start_remove;
 	bool destroy;
 	/*
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 8b91947..0c56983 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -42,6 +42,7 @@ static struct netvsc_device *alloc_net_device(struct hv_device *device)
 	if (!net_device)
 		return NULL;
 
+	init_waitqueue_head(&net_device->wait_drain);
 	net_device->start_remove = false;
 	net_device->destroy = false;
 	net_device->dev = device;
@@ -387,12 +388,8 @@ int netvsc_device_remove(struct hv_device *device)
 	spin_unlock_irqrestore(&device->channel->inbound_lock, flags);
 
 	/* Wait for all send completions */
-	while (atomic_read(&net_device->num_outstanding_sends)) {
-		dev_info(&device->device,
-			"waiting for %d requests to complete...\n",
-			atomic_read(&net_device->num_outstanding_sends));
-		udelay(100);
-	}
+	wait_event(net_device->wait_drain,
+		   atomic_read(&net_device->num_outstanding_sends) == 0);
 
 	netvsc_disconnect_vsp(net_device);
 
@@ -486,6 +483,9 @@ static void netvsc_send_completion(struct hv_device *device,
 		num_outstanding_sends =
 			atomic_dec_return(&net_device->num_outstanding_sends);
 
+		if (net_device->destroy && num_outstanding_sends == 0)
+			wake_up(&net_device->wait_drain);
+
 		if (netif_queue_stopped(ndev) && !net_device->start_remove &&
 			(hv_ringbuf_avail_percent(&device->channel->outbound)
 			> RING_AVAIL_PERCENT_HIWATER ||
-- 
1.7.4.1

^ permalink raw reply related

* Re: [PATCH RFC] c_can_pci: generic module for c_can on PCI
From: Alessandro Rubini @ 2012-06-04 16:45 UTC (permalink / raw)
  To: alan
  Cc: federico.vaga, wg, mkl, giancarlo.asnaghi, alan, linux-can,
	netdev, linux-kernel
In-Reply-To: <20120604165619.15ba43bf@pyramind.ukuu.org.uk>

> Anythign wrong with 
> 
> bool aligned32;

I personally think booleans are evil.  But both this and the other
thing:

>> +static u16 c_can_pci_read_reg_aligned_to_16bit(struct c_can_priv *priv,
>> +						void *reg)
> 
> I'm a bit worried this function name might be too short ;)

come from the platform driver this is based on (I already blamed
federico offlist for not preserving authorship of the original file).

So, this file is mostly copied from the platform driver, which is a
duplication of code.  A mandated duplication, given how the thing
is currently laid out: the c_can core driver exports functions that
the other two files are using (the platform and the new pci driver).

In my opinion, it would be much better to have one less layer and no
exports at all. The core driver should be a platform driver, and the
pci driver would just build platform data and register the platform
device.

Sure this isn't up to federico, who has the pci device but cannot
access any boards where the previous driver is used.  What do the
maintainers think? I (or federico :) may propose a reshaping, if
the idea makes sense.

/alessandro, another user of the sta2x11 where c_can_pci lives

^ permalink raw reply

* Re: [PATCH (net.git) 3/3] stmmac: fix driver Kconfig when built as module
From: Ben Hutchings @ 2012-06-04 16:45 UTC (permalink / raw)
  To: Giuseppe CAVALLARO; +Cc: netdev, Rayagond Kokatanur
In-Reply-To: <1338827531-10376-4-git-send-email-peppe.cavallaro@st.com>

On Mon, 2012-06-04 at 18:32 +0200, Giuseppe CAVALLARO wrote:
> This patches fixes the driver when built as dyn module.
> In fact the platform part cannot be built and the probe fails
> (thanks to Bob Liu that reported this bug).
> 
> v2: as D. Miller suggested, it is not necessary to make the
> pci and the platform code mutually exclusive.
> Having both could also help, at built time ,to verify that
> all the code is validated and compiles fine.
> 
> Reported-by: Bob Liu <lliubbo@gmail.com>
> Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
> Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
[...]

What I said in <1337687765.11796.11.camel@deadeye> was:

That's because CONFIG_STMMAC_PLATFORM and CONFIG_STMMAC_PCI are wrongly
declared as tristate in Kconfig.  Change them to bool and it should work
again.

You can't add a 'Reviewed-by' line on the basis of this; you can only
add it if a reviewer writes that line themselves (and if you don't make
further changes).

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH RFC] c_can_pci: generic module for c_can on PCI
From: Federico Vaga @ 2012-06-04 16:45 UTC (permalink / raw)
  To: Alan Cox
  Cc: Wolfgang Grandegger, Marc Kleine-Budde, Giancarlo Asnaghi,
	Alan Cox, Alessandro Rubini, linux-can, netdev, linux-kernel
In-Reply-To: <20120604165619.15ba43bf@pyramind.ukuu.org.uk>

> > +static u16 c_can_pci_read_reg_aligned_to_16bit(struct c_can_priv
> > *priv, +						void *reg)
> 
> I'm a bit worried this function name might be too short ;)

I know :) I was inspired by the same function in c_can_platform.c


About these function I suggest to move them into c_can.c because they 
are the same for c_can_platform.c and c_can_pci.c Then add a new field 
c_can_priv->offset which can be used to shift the register offset 
coherently with the memory alignment. Finally, remove c_can_priv-
>read_reg and c_can_priv->write_reg and use internal c_can.c function to 
read and write registers.

static u16 c_can_read_reg(struct c_can_priv *priv, enum reg index)
{
	return readw(priv->base + (priv->regs[index] << priv->offset));
}
static void c_can_write_reg(struct c_can_priv *priv, enum reg index,
						u16 val)
{
	writew(val, priv->base + (priv->regs[index] << priv->offset));
}


If it's ok, I can made a patch for this in the next days.

> > +	 * do not call pci_disable_device on sta2x11 because it
> > +	 * break all other Bus masters on this EP
> > +	 */
> > +	if(pdev->vendor == PCI_VENDOR_ID_STMICRO &&
> > +	   pdev->device == PCI_DEVICE_ID_STMICRO_CAN)
> > +		goto out;
> 
> Is that the disabling or the dropping it into D3. We have a PCI quirk
> flag for the latter. See "quirk_no_ata_d3". That will also avoid any
> accidents elsewhere. Right now the quirk has "ata" in the name but the
> ata is just historically because we had to quirk various disk
> controllers.

We are investigating if this is still necessary on the current version 
of the board.

-- 
Federico Vaga

^ permalink raw reply

* Re: [PATCH 06/24] TTY: ircomm, revamp locking
From: Alan Cox @ 2012-06-04 16:57 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: gregkh, alan, linux-kernel, jirislaby, Samuel Ortiz, netdev
In-Reply-To: <1338809738-18967-7-git-send-email-jslaby@suse.cz>

On Mon,  4 Jun 2012 13:35:20 +0200
Jiri Slaby <jslaby@suse.cz> wrote:

> Use self->spinlock only for ctrl_skb and tx_skb. TTY stuff is now
> protected by tty_port->lock. This is needed for further cleanup (and
> conversion to tty_port helpers).
> 
> This also closes the race in the end of close.

Ah ok.. fixed here

^ permalink raw reply

* [PATCH] net/9p: Add __force to cast of __user pointer
From: Joe Perches @ 2012-06-04 17:16 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David S. Miller, netdev
In-Reply-To: <1338825881.3979.203.camel@deadeye>

A recent commit that removed unnecessary casts of pointers
to the same type uncovered a missing __force cast.

Add it.

Reported by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Joe Perches <joe@perches.com>
---
 net/9p/client.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/9p/client.c b/net/9p/client.c
index 5cbea90..8260f13 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -1548,7 +1548,7 @@ p9_client_read(struct p9_fid *fid, char *data, char __user *udata, u64 offset,
 			kernel_buf = 1;
 			indata = data;
 		} else
-			indata = udata;
+			indata = (__force char *)udata;
 		/*
 		 * response header len is 11
 		 * PDU Header(7) + IO Size (4)

^ permalink raw reply related

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
From: Michael S. Tsirkin @ 2012-06-04 17:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev
In-Reply-To: <1338822064.2760.1834.camel@edumazet-glaptop>

On Mon, Jun 04, 2012 at 05:01:04PM +0200, Eric Dumazet wrote:
> On Mon, 2012-06-04 at 17:17 +0300, Michael S. Tsirkin wrote:
> 
> > bnx2 and tg3 both do skb_reserve of at least NET_SKB_PAD
> > after build_skb. You are saying it's not a must?
> > 
> 
> 32 would be the minimum. NETS_SKB_PAD is using a cache line (64 bytes
> on most x86 current cpus) to avoid using half a cache line.

OK so if we want to use build_skb we need to pad by at least
32 - 12 bytes or more likely 64-12. We lose 52 bytes per page
but save a copy of 128 bytes for medium sized packets.

> > Hmm so maybe we should teach the hypervisor to write data
> > out at an offset. Interesting.
> > 
> > Another question is about very small packets truesize.
> > build_skb sets truesize to frag_size but isn't
> > this too small? We keep the whole page around, no?
> 
> We keep one page per cpu, at most.
> 
> For example on MTU=1500 and PAGE_SIZE=4096, one page is splitted into
> two fragments, of 1500 + NET_SKB_PAD + align(shared_info), so its good
> enough (this is very close from 2048 'truesize')

I see. virtio allocates full 4K pages which works well for GSO
but it means smaller packets would get large truesize
wasting lots of memory.
So maybe we should keep copying packets < 128 bytes.

> But yes, for some uses (wifi for example), we might use a full page per
> skb, yet underestimate skb->truesize. Hopefully we can track these uses
> and fix them.
> 
> ath9k for example could be changed, to be able to reassemble up to 3
> frags instead of 2 frags, ie extending what did commit
> 0d95521ea74735826cb2e28bebf6a07392c75bfa (ath9k: use split rx
> buffers to get rid of order-1 skb allocations)
> 
> 

^ permalink raw reply

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
From: Eric Dumazet @ 2012-06-04 17:44 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev
In-Reply-To: <20120604172030.GA32205@redhat.com>

On Mon, 2012-06-04 at 20:20 +0300, Michael S. Tsirkin wrote:

> I see. virtio allocates full 4K pages which works well for GSO
> but it means smaller packets would get large truesize
> wasting lots of memory.
> So maybe we should keep copying packets < 128 bytes.

Yet, but restrict the copy to bare minimum if > 128 bytes

See following commit for details (actual changelog bigger than patch
itself)

commit 56138f50d1900b0c3d8647376e37b488b23ba53d
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri May 18 04:48:33 2012 +0000

    iwlwifi: dont pull too much payload in skb head
    
    As iwlwifi use fat skbs, it should not pull too much data in skb->head,
    and particularly no tcp data payload, or splice() is slower, and TCP
    coalescing is disabled. Copying payload to userland also involves at
    least two copies (part from header, part from fragment)
    
    Each layer will pull its header from the fragment as needed.
    
    (on 64bit arches, skb_tailroom(skb) at this point is 192 bytes)
    
    With this patch applied, I have a major reduction of collapsed/pruned
    TCP packets, a nice increase of TCPRcvCoalesce counter, and overall
    better Internet User experience.
    
    Small packets are still using a fragless skb, so that page can be reused
    by the driver.
    

^ permalink raw reply

* Re: [PATCH net-next 5/5] gianfar_ethtool: coding style and whitespace cleanups
From: David Miller @ 2012-06-04 18:08 UTC (permalink / raw)
  To: jan.ceuleers; +Cc: b06378, joe, netdev
In-Reply-To: <1338827516-18425-6-git-send-email-jan.ceuleers@computer.org>

From: Jan Ceuleers <jan.ceuleers@computer.org>
Date: Mon,  4 Jun 2012 18:31:56 +0200

> -					sizeof(struct gfar_mask_entry),
> -					gfar_comp, &gfar_swap);
> +			     sizeof(struct gfar_mask_entry),
> +			     far_comp, &gfar_swap);

Don't send me crap you haven't even compile tested.

I'm tossing this entire patch series out, don't resubmit until you've
tested every single patch.

^ permalink raw reply

* Re: [PATCH (net.git) 0/3 (v2)] stmmac fixes for net.git
From: David Miller @ 2012-06-04 18:11 UTC (permalink / raw)
  To: peppe.cavallaro; +Cc: netdev
In-Reply-To: <1338827531-10376-1-git-send-email-peppe.cavallaro@st.com>

From: Giuseppe CAVALLARO <peppe.cavallaro@st.com>
Date: Mon,  4 Jun 2012 18:32:08 +0200

> These patches fix a problem in the driver when built as dynamic
> module and fix the driver's documentation.
> 
> v2: this patchset has the same patches I had sent before but
>     I removed a patch that did a cleanup (now moved for net-next).
> 
> Giuseppe Cavallaro (3):
>   stmmac: fix driver's doc when run kernel-doc script
>   stmmac: update driver's doc
>   stmmac: fix driver Kconfig when built as module

Still not right, see Ben's feedback.

^ permalink raw reply

* Re: [PATCH] net/hyperv: Use wait_event on outstanding sends during device removal
From: David Miller @ 2012-06-04 18:11 UTC (permalink / raw)
  To: haiyangz; +Cc: netdev, kys, olaf, linux-kernel, devel
In-Reply-To: <1338828158-3133-1-git-send-email-haiyangz@microsoft.com>

From: Haiyang Zhang <haiyangz@microsoft.com>
Date: Mon,  4 Jun 2012 09:42:38 -0700

> Change the busy-waiting/udelay to wait_event on outstanding sends.
> 
> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>

Applied.

^ permalink raw reply

* Re: [net-next.git (v2)] stmmac: remove two useless initialisations
From: David Miller @ 2012-06-04 18:12 UTC (permalink / raw)
  To: peppe.cavallaro; +Cc: netdev
In-Reply-To: <1338827782-11877-1-git-send-email-peppe.cavallaro@st.com>

From: Giuseppe CAVALLARO <peppe.cavallaro@st.com>
Date: Mon,  4 Jun 2012 18:36:22 +0200

> This patch removes two useless initialisations in the
> stmmac_rx and stmmac_tx functions.
> In the former, the count variable was reset twice and in
> the stmmac_tx we only need to increment the dirty pointer
> w/o setting the entry variable.
> 
> v2: review the subject and comment that was not clear in my
> first version.
> 
> Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net/9p: Add __force to cast of __user pointer
From: David Miller @ 2012-06-04 18:12 UTC (permalink / raw)
  To: joe; +Cc: bhutchings, netdev
In-Reply-To: <1338830174.15683.2.camel@joe2Laptop>

From: Joe Perches <joe@perches.com>
Date: Mon, 04 Jun 2012 10:16:14 -0700

> A recent commit that removed unnecessary casts of pointers
> to the same type uncovered a missing __force cast.
> 
> Add it.
> 
> Reported by: Ben Hutchings <bhutchings@solarflare.com>
> Signed-off-by: Joe Perches <joe@perches.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
From: Michael S. Tsirkin @ 2012-06-04 18:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev
In-Reply-To: <1338822064.2760.1834.camel@edumazet-glaptop>

On Mon, Jun 04, 2012 at 05:01:04PM +0200, Eric Dumazet wrote:
> On Mon, 2012-06-04 at 17:17 +0300, Michael S. Tsirkin wrote:
> 
> > bnx2 and tg3 both do skb_reserve of at least NET_SKB_PAD
> > after build_skb. You are saying it's not a must?
> > 
> 
> 32 would be the minimum. NETS_SKB_PAD is using a cache line (64 bytes
> on most x86 current cpus) to avoid using half a cache line.
> 
> > Hmm so maybe we should teach the hypervisor to write data
> > out at an offset. Interesting.
> > 
> > Another question is about very small packets truesize.
> > build_skb sets truesize to frag_size but isn't
> > this too small? We keep the whole page around, no?
> 
> We keep one page per cpu, at most.
> 
> For example on MTU=1500 and PAGE_SIZE=4096, one page is splitted into
> two fragments, of 1500 + NET_SKB_PAD + align(shared_info), so its good
> enough (this is very close from 2048 'truesize')

Yes but if a tcp socket then hangs on, on one of the fragments,
while the other has been freed, the whole page is still
never reused, right?

Doesn't this mean truesize should be 4K?

-- 
MST

^ permalink raw reply

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
From: Michael S. Tsirkin @ 2012-06-04 18:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev
In-Reply-To: <1338831890.2760.1842.camel@edumazet-glaptop>

On Mon, Jun 04, 2012 at 07:44:50PM +0200, Eric Dumazet wrote:
> On Mon, 2012-06-04 at 20:20 +0300, Michael S. Tsirkin wrote:
> 
> > I see. virtio allocates full 4K pages which works well for GSO
> > but it means smaller packets would get large truesize
> > wasting lots of memory.
> > So maybe we should keep copying packets < 128 bytes.
> 
> Yet, but restrict the copy to bare minimum if > 128 bytes
> 
> See following commit for details (actual changelog bigger than patch
> itself)
> 
> commit 56138f50d1900b0c3d8647376e37b488b23ba53d
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Fri May 18 04:48:33 2012 +0000
> 
>     iwlwifi: dont pull too much payload in skb head
>     
>     As iwlwifi use fat skbs, it should not pull too much data in skb->head,
>     and particularly no tcp data payload, or splice() is slower, and TCP
>     coalescing is disabled. Copying payload to userland also involves at
>     least two copies (part from header, part from fragment)
>     
>     Each layer will pull its header from the fragment as needed.
>     
>     (on 64bit arches, skb_tailroom(skb) at this point is 192 bytes)
>     
>     With this patch applied, I have a major reduction of collapsed/pruned
>     TCP packets, a nice increase of TCPRcvCoalesce counter, and overall
>     better Internet User experience.
>     
>     Small packets are still using a fragless skb, so that page can be reused
>     by the driver.
>     

Will take a look, thanks.
By the way, this comment at build_skb:
* @frag_size: size of fragment, or 0 if head was kmalloced

is not very clear to me. Could you clarify what exactly size
of fragment means in this context?

^ permalink raw reply

* Re: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
From: Damian Lukowski @ 2012-06-04 17:50 UTC (permalink / raw)
  To: Jerry Chu; +Cc: Netdev, David Miller, Ilpo Järvinen
In-Reply-To: <CAPshTCiFDUb3CVm-joJW_DRVS+C+_BjFO-Mt3uAwrCzVw8MsVg@mail.gmail.com>

Hi Jerry,

please verify, I understood you correctly.

You have set TCP_RTO_MIN to a lower value, e.g. 0.002 seconds to improve
your internal low-latency traffic. Because of the improvement, R1
timeouts are triggered too fast for external high-RTT traffic. Is that
correct?
If so, may I suggest to set tcp_retries1 to a higher value? For
TCP_RTO_MIN == 0.002 and tcp_retries1 ==  10, R1 will be calculated to
approximately 4 seconds.

Is that ok?

Best regards
 Damian

Am Freitag, den 01.06.2012, 15:58 -0700 schrieb Jerry Chu:
> > From: Damian Lukowski <damian@tvk.rwth-aachen.de>
> > Date: Wed, Aug 26, 2009 at 3:16 AM
> > Subject: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close
> > threshold as a time value.
> > To: Netdev <netdev@vger.kernel.org>
> >
> >
> > RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
> > which may represent a number of allowed retransmissions or a timeout value.
> > Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
> > in number of allowed retransmissions.
> >
> > For any desired threshold R2 (by means of time) one can specify tcp_retries2
> > (by means of number of retransmissions) such that TCP will not time out
> > earlier than R2. This is the case, because the RTO schedule follows a fixed
> > pattern, namely exponential backoff.
> >
> > However, the RTO behaviour is not predictable any more if RTO backoffs can
> > be
> > reverted, as it is the case in the draft
> > "Make TCP more Robust to Long Connectivity Disruptions"
> > (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
> >
> > In the worst case TCP would time out a connection after 3.2 seconds, if the
> > initial RTO equaled MIN_RTO and each backoff has been reverted.
> >
> > This patch introduces a function retransmits_timed_out(N),
> > which calculates the timeout of a TCP connection, assuming an initial
> > RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
> >
> > Whenever timeout decisions are made by comparing the retransmission counter
> > to some value N, this function can be used, instead.
> >
> > The meaning of tcp_retries2 will be changed, as many more RTO
> > retransmissions
> > can occur than the value indicates. However, it yields a timeout which is
> > similar to the one of an unpatched, exponentially backing off TCP in the
> > same
> > scenario. As no application could rely on an RTO greater than MIN_RTO, there
> > should be no risk of a regression.
> 
> This looks like a typical "fix one problem, introducing a few more" patch :(.
> What do you mean by "no application could rely on an RTO greater than
> MIN_RTO..."
> above? How can you make the assumption that RTO is not too far off
> from TCP_RTO_MIN?
> 
> While you tried to address a problem where the retransmission count
> was high but the actual
> timeout duration was too short, have you considered the other case
> around, i.e., the timeout
> duration is long but the retransmission count is too short? This is
> exactly what's happening
> to us with your patch. We've much reduced TCP_RTO_MIN for our internal
> traffic, but not
> noticing your change has severely shortened the R1 & R2 recommended by
> RFC1122 for our
> long haul traffic until now. In many cases R1 threshold was met upon
> the first retrans timeout.
> 
> I think retransmits_timed_out() should check against both time
> duration and retrans count
> (icsk_retransmits).
> 
> Thought?
> 
> Jerry
> 
> >
> > Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
> > ---
> >  include/net/tcp.h    |   18 ++++++++++++++++++
> >  net/ipv4/tcp_timer.c |   11 +++++++----
> >  2 files changed, 25 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index c35b329..17d1a88 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -1247,6 +1247,24 @@ static inline struct sk_buff
> > *tcp_write_queue_prev(struct sock *sk, struct sk_bu
> >  #define tcp_for_write_queue_from_safe(skb, tmp, sk)                    \
> >        skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
> >
> > +static inline bool retransmits_timed_out(const struct sock *sk,
> > +                                        unsigned int boundary)
> > +{
> > +       int limit, K;
> > +       if (!inet_csk(sk)->icsk_retransmits)
> > +               return false;
> > +
> > +       K = ilog2(TCP_RTO_MAX/TCP_RTO_MIN);
> > +
> > +       if (boundary <= K)
> > +               limit = ((2 << boundary) - 1) * TCP_RTO_MIN;
> > +       else
> > +               limit = ((2 << K) - 1) * TCP_RTO_MIN +
> > +                       (boundary - K) * TCP_RTO_MAX;
> > +
> > +       return (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= limit;
> > +}
> > +
> >  static inline struct sk_buff *tcp_send_head(struct sock *sk)
> >  {
> >        return sk->sk_send_head;
> > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > index a3ba494..2972d7b 100644
> > --- a/net/ipv4/tcp_timer.c
> > +++ b/net/ipv4/tcp_timer.c
> > @@ -137,13 +137,14 @@ static int tcp_write_timeout(struct sock *sk)
> >  {
> >        struct inet_connection_sock *icsk = inet_csk(sk);
> >        int retry_until;
> > +       bool do_reset;
> >
> >        if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
> >                if (icsk->icsk_retransmits)
> >                        dst_negative_advice(&sk->sk_dst_cache);
> >                retry_until = icsk->icsk_syn_retries ? :
> > sysctl_tcp_syn_retries;
> >        } else {
> > -               if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
> > +               if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
> >                        /* Black hole detection */
> >                        tcp_mtu_probing(icsk, sk);
> >
> > @@ -155,13 +156,15 @@ static int tcp_write_timeout(struct sock *sk)
> >                        const int alive = (icsk->icsk_rto < TCP_RTO_MAX);
> >
> >                        retry_until = tcp_orphan_retries(sk, alive);
> > +                       do_reset = alive ||
> > +                                  !retransmits_timed_out(sk, retry_until);
> >
> > -                       if (tcp_out_of_resources(sk, alive ||
> > icsk->icsk_retransmits < retry_until))
> > +                       if (tcp_out_of_resources(sk, do_reset))
> >                                return 1;
> >                }
> >        }
> >
> > -       if (icsk->icsk_retransmits >= retry_until) {
> > +       if (retransmits_timed_out(sk, retry_until)) {
> >                /* Has it gone just too far? */
> >                tcp_write_err(sk);
> >                return 1;
> > @@ -385,7 +388,7 @@ void tcp_retransmit_timer(struct sock *sk)
> >  out_reset_timer:
> >        icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
> >        inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto,
> > TCP_RTO_MAX);
> > -       if (icsk->icsk_retransmits > sysctl_tcp_retries1)
> > +       if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1))
> >                __sk_dst_reset(sk);
> >
> >  out:;
> > --
> > 1.6.3.3
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >

^ permalink raw reply

* tcp wifi upload performance and lots of ACKs
From: Ben Greear @ 2012-06-04 18:29 UTC (permalink / raw)
  To: netdev

I'm going some TCP performance testing on wifi -> LAN interface connections.  With
UDP, we can get around 250Mbps of payload throughput.  With TCP, max is about 80Mbps.

I think the problem is that there are way too many ACK packets, and bi-directional
traffic on wifi interfaces really slows things down.  (About 7000 pkts per second in
upload direction, 2000 pps download.  And the vast majority of the download pkts
are 66 byte ACK pkts from what I can tell.)

Kernel is 3.3.7+

Anyone know of any tuning parameters that would let the receiving socket wait a
bit longer and send more ACK data in fewer packets?

Packet traces and other info available if anyone wants to take a look.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: [PATCH iproute2] Add reference to tc-codel(8) to the SEE ALSO section
From: Stephen Hemminger @ 2012-06-04 19:03 UTC (permalink / raw)
  To: Jan Ceuleers; +Cc: andyqos, netdev
In-Reply-To: <1338666310-18744-1-git-send-email-jan.ceuleers@computer.org>

On Sat,  2 Jun 2012 21:45:10 +0200
Jan Ceuleers <jan.ceuleers@computer.org> wrote:

> Reported-by: Andy Furniss <andyqos@ukfsn.org>
> Signed-off-by: Jan Ceuleers <jan.ceuleers@computer.org>
> ---
>  man/man8/tc.8 |    1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/man/man8/tc.8 b/man/man8/tc.8
> index 6576377..14a1cd8 100644
> --- a/man/man8/tc.8
> +++ b/man/man8/tc.8
> @@ -368,6 +368,7 @@ was written by Alexey N. Kuznetsov and added in Linux 2.2.
>  .SH SEE ALSO
>  .BR tc-cbq (8),
>  .BR tc-choke (8),
> +.BR tc-codel (8),
>  .BR tc-drr (8),
>  .BR tc-htb (8),
>  .BR tc-hfsc (8),

applied.

^ permalink raw reply

* RE: [PATCH] net: compute a more reasonable default ip6_rt_max_size
From: Lubashev, Igor @ 2012-06-04 19:04 UTC (permalink / raw)
  To: David Miller, eric.dumazet@gmail.com
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Arun Sharma
In-Reply-To: <83CE6FF8F6C9B2468A618FC2C51267260F303CD88B@USMBX1.msg.corp.akamai.com>

David and Eric,

Any news about this?  We definitely have many machines that are experiencing abnormal behavior under ipv6 load.  The machines are healthier when the ipv6 route cache is increased to 64K, but I am afraid this is a band-aid that is hiding the actual problems.

So, could you address the concerns about the code in fib6_age?  I can see three potential problems with it:

1.  The meaning/use of NTF_ROUTER flag is inverted in 3.4

2.  A potential NULL-pointer exception in pre-3.4 versions.  In particular, "rt->rt6i_nexthop" (version 2.6.37) is checked for NULL in (almost?) all cases of referencing that field.  I do not know for sure about "dst_get_neighbour_raw(&rt->dst)" in 3.0.32.

3.   In all cases, an RTF_GATEWAY entry with __refcount > 0 may be garbage collected.  That seems like a wrong thing to do.  Is it?

Thank you!

- Igor

-----Original Message-----
From: Lubashev, Igor 
Sent: Wednesday, May 30, 2012 7:50 PM
To: David Miller; Arun Sharma
Cc: eric.dumazet@gmail.com; netdev@vger.kernel.org; linux-kernel@vger.kernel.org
Subject: Re: [PATCH] net: compute a more reasonable default ip6_rt_max_size

>It's possible that there is a bug somewhere - we didn't get a chance to 
>dig deeper. What we observed is that as we got close to the 4096 limit, 
>some hosts were becoming unreachable. A modest increase in the routing 
>table size made things better.

First of all, we have observed the same thing.

While I am not an expert in this area of the routing code, the function fib6_age in net/ipv6/ip6_fib.c puzzles me.

In kernel version 2.6.37, we have net/ipv6/ip6_fib.c:
static int fib6_age(struct rt6_info *rt, void *arg) {
	unsigned long now = jiffies;

	if (rt->rt6i_flags&RTF_EXPIRES && rt->rt6i_expires) {
		if (time_after(now, rt->rt6i_expires)) {
			RT6_TRACE("expiring %p\n", rt);
			return -1;
		}
		gc_args.more++;
	} else if (rt->rt6i_flags & RTF_CACHE) {
		if (atomic_read(&rt->dst.__refcnt) == 0 &&
		    time_after_eq(now, rt->dst.lastuse + gc_args.timeout)) {
			RT6_TRACE("aging clone %p\n", rt);
			return -1;
		} else if ((rt->rt6i_flags & RTF_GATEWAY) &&
			   (!(rt->rt6i_nexthop->flags & NTF_ROUTER))) {
			RT6_TRACE("purging route %p via non-router but gateway\n",
				  rt);
			return -1;
		}
		gc_args.more++;
	}

	return 0;
}

In kernel 3.0.32, we have net/ipv6/ip6_fib.c:
static int fib6_age(struct rt6_info *rt, void *arg) {
	unsigned long now = jiffies;

	if (rt->rt6i_flags&RTF_EXPIRES && rt->rt6i_expires) {
		if (time_after(now, rt->rt6i_expires)) {
			RT6_TRACE("expiring %p\n", rt);
			return -1;
		}
		gc_args.more++;
	} else if (rt->rt6i_flags & RTF_CACHE) {
		if (atomic_read(&rt->dst.__refcnt) == 0 &&
		    time_after_eq(now, rt->dst.lastuse + gc_args.timeout)) {
			RT6_TRACE("aging clone %p\n", rt);
			return -1;
		} else if ((rt->rt6i_flags & RTF_GATEWAY) &&
			   (!(dst_get_neighbour_raw(&rt->dst)->flags & NTF_ROUTER))) {
			RT6_TRACE("purging route %p via non-router but gateway\n",
				  rt);
			return -1;
		}
		gc_args.more++;
	}

	return 0;
}

In kernel 3.4, we have net/ipv6/ip6_fib.c:
static int fib6_age(struct rt6_info *rt, void *arg) {
	unsigned long now = jiffies;

	if (rt->rt6i_flags & RTF_EXPIRES && rt->dst.expires) {
		if (time_after(now, rt->dst.expires)) {
			RT6_TRACE("expiring %p\n", rt);
			return -1;
		}
		gc_args.more++;
	} else if (rt->rt6i_flags & RTF_CACHE) {
		if (atomic_read(&rt->dst.__refcnt) == 0 &&
		    time_after_eq(now, rt->dst.lastuse + gc_args.timeout)) {
			RT6_TRACE("aging clone %p\n", rt);
			return -1;
		} else if (rt->rt6i_flags & RTF_GATEWAY) {
			struct neighbour *neigh;
			__u8 neigh_flags = 0;

			neigh = dst_neigh_lookup(&rt->dst, &rt->rt6i_gateway);
			if (neigh) {
				neigh_flags = neigh->flags;
				neigh_release(neigh);
			}
			if (neigh_flags & NTF_ROUTER) {
				RT6_TRACE("purging route %p via non-router but gateway\n",
					  rt);
				return -1;
			}
		}
		gc_args.more++;
	}

	return 0;
}

Do we have the meaning of the NTF_ROUTER flag reversed in kernel 3.4?  Or is the opposite use of that flag a fix for the bug in the previous releases? Or is this a bug in kernel 3.4?

Also, could this remove a Gateway entry, if there is no neighbor entry for it (in any of the version of the code)?  Could this try to deference a null pointer in 3.0.32 version of the code (and any version prior to 3.4)?  In general, is this the right place to remove a gateway route that has __refcnt > 0?

I wish I had more expertise in this area of the code to answer questions and not only to pose them.

Thank you,

- Igor

^ permalink raw reply

* Re: [PATCH iproute2 1/3] Add missing can.h
From: Stephen Hemminger @ 2012-06-04 19:10 UTC (permalink / raw)
  To: Rostislav Lisovy; +Cc: netdev, linux-can, lartc, pisa, sojkam1
In-Reply-To: <1338826149-11604-2-git-send-email-lisovy@gmail.com>

On Mon,  4 Jun 2012 18:09:07 +0200
Rostislav Lisovy <lisovy@gmail.com> wrote:

> This header file is slightly modified version copied from
> Linux kernel v. 3.3. It contains defines necessary for working
> with AF_CAN packets.
> 
> Signed-off-by: Rostislav Lisovy <lisovy@gmail.com>

Ok, but I will put in can.h but it will be done by copying the result
of the kernel 'make headers_install' which does the standard
removal of #ifdef kernel pieces. In iproute2, all kernel headers
used should come from this process.

^ permalink raw reply

* Re: [PATCH iproute2 0/3] CAN Filter/Classifier
From: Stephen Hemminger @ 2012-06-04 19:13 UTC (permalink / raw)
  To: Rostislav Lisovy; +Cc: netdev, linux-can, lartc, pisa, sojkam1
In-Reply-To: <1338826149-11604-1-git-send-email-lisovy@gmail.com>

On Mon,  4 Jun 2012 18:09:06 +0200
Rostislav Lisovy <lisovy@gmail.com> wrote:

> This classifier classifies CAN frames (AF_CAN) according to their
> identifiers. This functionality can not be easily achieved with
> existing classifiers, such as u32. This classifier can be used
> with any available qdisc and it is able to classify both SFF
> or EFF frames.
> 
> The filtering rules for EFF frames are stored in an array, which
> is traversed during classification. A bitmap is used to store SFF
> rules -- one bit for each ID.
> 
> More info about the project:
> http://rtime.felk.cvut.cz/can/socketcan-qdisc-final.pdf
> 
> 
> Rostislav Lisovy (3):
>   Add missing can.h
>   CAN Filter/Classifier -- Source code
>   CAN Filter/Classifier -- Documentation
> 
>  include/linux/can.h     |  112 ++++++++++++++++++++++
>  include/linux/pkt_cls.h |   10 ++
>  man/man8/tc-can.8       |   97 +++++++++++++++++++
>  tc/Makefile             |    1 +
>  tc/f_can.c              |  238 +++++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 458 insertions(+)
>  create mode 100644 include/linux/can.h
>  create mode 100644 man/man8/tc-can.8
>  create mode 100644 tc/f_can.c

Please resubmit these when the necessary upstream pieces are in Linus's tree.
That will be during the 3.6 merge window.

^ permalink raw reply

* Re: [PATCH net-next 5/5] gianfar_ethtool: coding style and whitespace cleanups
From: Jan Ceuleers @ 2012-06-04 19:11 UTC (permalink / raw)
  To: David Miller; +Cc: b06378, joe, netdev
In-Reply-To: <20120604.140850.1847853625568125563.davem@davemloft.net>

On 06/04/2012 08:08 PM, David Miller wrote:
> From: Jan Ceuleers <jan.ceuleers@computer.org>
> Date: Mon,  4 Jun 2012 18:31:56 +0200
> 
>> -					sizeof(struct gfar_mask_entry),
>> -					gfar_comp, &gfar_swap);
>> +			     sizeof(struct gfar_mask_entry),
>> +			     far_comp, &gfar_swap);
> 
> Don't send me crap you haven't even compile tested.
> 
> I'm tossing this entire patch series out, don't resubmit until you've
> tested every single patch.
> 

Well then I'm sorry but I won't be resubmitting

^ permalink raw reply

* Re: tcp wifi upload performance and lots of ACKs
From: Daniel Baluta @ 2012-06-04 19:22 UTC (permalink / raw)
  To: Ben Greear; +Cc: netdev
In-Reply-To: <4FCCFE76.3060304@candelatech.com>

On Mon, Jun 4, 2012 at 9:29 PM, Ben Greear <greearb@candelatech.com> wrote:
> I'm going some TCP performance testing on wifi -> LAN interface connections.
>  With
> UDP, we can get around 250Mbps of payload throughput.  With TCP, max is
> about 80Mbps.
>
> I think the problem is that there are way too many ACK packets, and
> bi-directional
> traffic on wifi interfaces really slows things down.  (About 7000 pkts per
> second in
> upload direction, 2000 pps download.  And the vast majority of the download
> pkts
> are 66 byte ACK pkts from what I can tell.)
>
> Kernel is 3.3.7+
>
> Anyone know of any tuning parameters that would let the receiving socket
> wait a
> bit longer and send more ACK data in fewer packets?

An ACK is generated after every second full sized segment or a timeout
expires.

Currently, there is no way to tune these parameters. Here is an experimental
patch [1]. If anyone, thinks that this patch has a chance to get accepted
I will be happily try to further improve it.

>
> Packet traces and other info available if anyone wants to take a look.

thanks,
Daniel.

[1] http://marc.info/?l=linux-netdev&m=131983649130350&w=2

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox