Netdev List
 help / color / mirror / Atom feed
* Re: [V2 PATCH 3/9] macvtap: zerocopy: put page when fail to get all requested user pages
From: Shirley Ma @ 2012-05-15 17:33 UTC (permalink / raw)
  To: Jason Wang; +Cc: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem
In-Reply-To: <20120502034157.11782.66606.stgit@amd-6168-8-1.englab.nay.redhat.com>

On Wed, 2012-05-02 at 11:41 +0800, Jason Wang wrote:
> When get_user_pages_fast() fails to get all requested pages, we could
> not use
> kfree_skb() to free it as it has not been put in the skb fragments. So
> we need
> to call put_page() instead.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/net/macvtap.c |    6 ++++--
>  1 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index 7cb2684..9ab182a 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -531,9 +531,11 @@ static int zerocopy_sg_from_iovec(struct sk_buff
> *skb, const struct iovec *from,
>                 size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >>
> PAGE_SHIFT;
>                 num_pages = get_user_pages_fast(base, size, 0,
> &page[i]);
>                 if ((num_pages != size) ||
> -                   (num_pages > MAX_SKB_FRAGS -
> skb_shinfo(skb)->nr_frags))
> -                       /* put_page is in skb free */
> +                   (num_pages > MAX_SKB_FRAGS -
> skb_shinfo(skb)->nr_frags)) {
> +                       for (i = 0; i < num_pages; i++)
> +                               put_page(page[i]);
>                         return -EFAULT;
> +               }
>                 truesize = size * PAGE_SIZE;
>                 skb->data_len += len;
>                 skb->len += len; 

Good catch. I don't know why I thought put_page would be called in
skb_free for these pages which hadn't been added to skb frags before. :(

thanks
Shirley

^ permalink raw reply

* Re: [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation
From: Shirley Ma @ 2012-05-15 17:26 UTC (permalink / raw)
  To: Jason Wang; +Cc: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem
In-Reply-To: <20120502034144.11782.88947.stgit@amd-6168-8-1.englab.nay.redhat.com>

On Wed, 2012-05-02 at 11:41 +0800, Jason Wang wrote:
> As the skb fragment were pinned/built from user pages, we should
> account the page instead of length for truesize.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/net/macvtap.c |    6 ++++--
>  1 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index bd4a70d..7cb2684 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -519,6 +519,7 @@ static int zerocopy_sg_from_iovec(struct sk_buff
> *skb, const struct iovec *from,
>                 struct page *page[MAX_SKB_FRAGS];
>                 int num_pages;
>                 unsigned long base;
> +               unsigned long truesize;
> 
>                 len = from->iov_len - offset;
>                 if (!len) {
> @@ -533,10 +534,11 @@ static int zerocopy_sg_from_iovec(struct sk_buff
> *skb, const struct iovec *from,
>                     (num_pages > MAX_SKB_FRAGS -
> skb_shinfo(skb)->nr_frags))
>                         /* put_page is in skb free */
>                         return -EFAULT;
> +               truesize = size * PAGE_SIZE;

Here should be truesize = size * PAGE_SIZE - offset, right?

>                 skb->data_len += len;
>                 skb->len += len;
> -               skb->truesize += len;
> -               atomic_add(len, &skb->sk->sk_wmem_alloc);
> +               skb->truesize += truesize;
> +               atomic_add(truesize, &skb->sk->sk_wmem_alloc);
>                 while (len) {
>                         int off = base & ~PAGE_MASK;
>                         int size = min_t(int, len, PAGE_SIZE - off);
> 
> 

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Eric Dumazet @ 2012-05-15 17:23 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Kieran Mansley, netdev
In-Reply-To: <1337101280.8512.1108.camel@edumazet-glaptop>

On Tue, 2012-05-15 at 19:01 +0200, Eric Dumazet wrote:

> 
> napi_get_frags() could probably updated in net-next to use the first
> frag as skb->head to save 512 bytes per skb.

By the way GRO_MAX_HEAD definition is way too big, napi_get_frags()
allocates fat skbs (1280 bytes of overhead instead of 768 bytes)

This should be enough :

#define GRO_MAX_HEAD 128

^ permalink raw reply

* Re: [V2 PATCH 1/9] macvtap: zerocopy: fix offset calculation when building skb
From: Shirley Ma @ 2012-05-15 17:17 UTC (permalink / raw)
  To: Jason Wang; +Cc: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem
In-Reply-To: <20120502034130.11782.25906.stgit@amd-6168-8-1.englab.nay.redhat.com>

On Wed, 2012-05-02 at 11:41 +0800, Jason Wang wrote:
> This patch fixes the offset calculation when building skb:
> 
> - offset1 were used as skb data offset not vector offset
> - reset offset to zero only when we advance to next vector

I tested the original code in all scenario, it worked well.

However this patch makes the code more clear.

Thanks
Shirley

^ permalink raw reply

* Re: [PATCH] xfrm_algo: drop an unnecessary inclusion
From: David Miller @ 2012-05-15 17:14 UTC (permalink / raw)
  To: JBeulich; +Cc: netdev, linux-kernel
In-Reply-To: <4FB2618C0200007800083C99@nat28.tlf.novell.com>

From: "Jan Beulich" <JBeulich@suse.com>
Date: Tue, 15 May 2012 13:00:44 +0100

> For several releases, this has not been needed anymore, as no helper
> functions declared in net/ah.h get implemented by xfrm_algo.c anymore.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Applied.

^ permalink raw reply

* Re: [PATCH, v2] xfrm: make xfrm_algo.c a module
From: David Miller @ 2012-05-15 17:14 UTC (permalink / raw)
  To: JBeulich; +Cc: netdev, linux-kernel
In-Reply-To: <4FB260D80200007800083C91@nat28.tlf.novell.com>

From: "Jan Beulich" <JBeulich@suse.com>
Date: Tue, 15 May 2012 12:57:44 +0100

> By making this a standalone config option (auto-selected as needed),
> selecting CRYPTO from here rather than from XFRM (which is boolean)
> allows the core crypto code to become a module again even when XFRM=y.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 0/2] ethtool changes
From: David Miller @ 2012-05-15 17:14 UTC (permalink / raw)
  To: manish.chopra
  Cc: bhutchings, netdev, Dept_NX_Linux_NIC_Driver, anirban.chakraborty
In-Reply-To: <1337080419-31786-1-git-send-email-manish.chopra@qlogic.com>

From: Manish Chopra <manish.chopra@qlogic.com>
Date: Tue, 15 May 2012 07:13:37 -0400

> Please apply it to net-next.

All applied, thanks.

^ permalink raw reply

* Re: [PATCH v3.4-rc7 regression] net/bond: return int from recv_probe()
From: David Miller @ 2012-05-15 17:08 UTC (permalink / raw)
  To: arnd; +Cc: geert, jbohac, linux-kernel, netdev
In-Reply-To: <201205151056.53621.arnd@arndb.de>


Also there were illegal sequences in your email headers (the netdev
address has a trailing ".") so vger rejected the posting.

^ permalink raw reply

* Re: [PATCH v3.4-rc7 regression] net/bond: return int from recv_probe()
From: David Miller @ 2012-05-15 17:07 UTC (permalink / raw)
  To: arnd; +Cc: geert, jbohac, linux-kernel, netdev
In-Reply-To: <201205151056.53621.arnd@arndb.de>

From: Arnd Bergmann <arnd@arndb.de>
Date: Tue, 15 May 2012 10:56:53 +0000

> 13a8e0c8c "bonding: don't increase rx_dropped after processing LACPDUs"
> changed the prototype of the bonding recv_probe handler, but only in
> some places, as identified by this warning:
> 
> drivers/net/bonding/bond_main.c: In function 'bond_handle_frame':
> drivers/net/bonding/bond_main.c:1463:13: error: assignment from incompatible pointer type [-Werror]
> drivers/net/bonding/bond_main.c: In function 'bond_open':
> drivers/net/bonding/bond_main.c:3441:21: error: assignment from incompatible pointer type [-Werror]
> drivers/net/bonding/bond_main.c:3448:20: error: assignment from incompatible pointer type [-Werror]
> 
> To fix this, we can change the remaining prototypes to return an
> integer as well, and always return RX_HANDLER_ANOTHER from rlb_arp_recv.

This has been fixed in Linus's tree for almost 2 days.

^ permalink raw reply

* Re: [PATCH v3 6/6] net: sh_eth: use NAPI
From: David Miller @ 2012-05-15 17:05 UTC (permalink / raw)
  To: yoshihiro.shimoda.uh; +Cc: netdev, linux-sh
In-Reply-To: <4FB225F1.20407@renesas.com>

From: "Shimoda, Yoshihiro" <yoshihiro.shimoda.uh@renesas.com>
Date: Tue, 15 May 2012 18:46:25 +0900

> 2012/05/15 14:07, David Miller wrote:
>> From: "Shimoda, Yoshihiro" <yoshihiro.shimoda.uh@renesas.com>
>> Date: Tue, 15 May 2012 13:47:44 +0900
>> 
>>> 2012/05/15 7:50, David Miller wrote:
>>>> You need strict synchronization between your TX queueing and TX
>>>> liberation flows.  So that queue stop and wake are only performed
>>>> at the correct moment.
>>>
>>> I will add netif_queue_stopped() in the sh_eth_poll().
>> 
>> That doesn't fix the bug.  What if someone transmits a packet and
>> fills the TX queue between the netif_queue_stopped() test and the
>> call to netif_wake_queue()?
>> 
>> Adding another test doesn't create the necessary synchronization.
>> 
> 
> Thank you for the reply again.
> I will modify the code as the following. Is it correct?
> 
> 	if (txfree_num) {
> 		netif_tx_lock(ndev);
> 		if (netif_queue_stopped(ndev))
> 			netif_wake_queue(ndev);
> 		netif_tx_unlock(ndev);
> 	}

Yes, and then you don't need that private lock in the start_xmit()
method at all, since that method runs with the tx_lock held.

^ permalink raw reply

* ibmveth bug?
From: Nishanth Aravamudan @ 2012-05-15 17:01 UTC (permalink / raw)
  To: santil; +Cc: anton, benh, paulus, netdev, linux-kernel

Hi Santiago,

Are you still working on ibmveth?

I've found a very sporadic bug with ibmveth in some testing. PAPR
requires that:

"Validate the Buffer Descriptor of the receive queue buffer (I/O
addresses for entire buffer length starting at the spec- ified I/O
address are translated by the RTCE table, length is a multiple of 16
bytes, and alignment is on a 16 byte boundary) else H_Parameter."

but from what I can tell ibmveth.c is not enforcing this last condition:

	adapter->rx_queue.queue_addr =
		kmalloc(adapter->rx_queue.queue_len, GFP_KERNEL);

	...

	adapter->rx_queue.queue_dma = dma_map_single(dev,
		adapter->rx_queue.queue_addr, adapter->rx_queue.queue_len,
		DMA_BIDIRECTIONAL);

	...

	rxq_desc.fields.address = adapter->rx_queue.queue_dma;

	...
	

	lpar_rc = ibmveth_register_logical_lan(adapter, rxq_desc,
		mac_address);
	netdev_err(netdev, "buffer TCE:0x%llx filter TCE:0x%llx rxq "
	 	"desc:0x%llx MAC:0x%llx\n", adapter->buffer_list_dma,
	 	adapter->filter_list_dma, rxq_desc.desc, mac_address);

And I got on one install attempt:

[ 39.978430] ibmveth 30000004: eth0: h_register_logical_lan failed with -4
[ 39.978449] ibmveth 30000004: eth0: buffer TCE:0x1000 filter TCE:0x10000 rxq desc:0x80006010000200a8 MAC:0x56754de8e904

rxq desc, as you can see is not 16byte aligned. kmalloc() only
guarantees 8-byte alignment (as does gcc, I think). Initially, I thought
we could just overallocate the queue_addr and ALIGN() down, but then we
would need to save the original kmalloc pointer in a new struct member
per rx_queue.

So a couple of questions:

1) Is my analysis accurate? :)

2) How gross would it be to save an extra pointer for every rx_queue?

3) Based upon 2), is it better to just go ahead and create our own
kmem_cache (which gets an alignment specified)?

For 3), I started coding this, but couldn't find a clean place to
allocate the kmem_cache itself, as the size of each object depends on
the run-time characteristics (afaict), but needs to be specified at
cache creation time. Any insight you could provide would be great!

Thanks,
Nish
 
-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Eric Dumazet @ 2012-05-15 17:01 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Kieran Mansley, netdev
In-Reply-To: <1337100454.2544.25.camel@bwh-desktop.uk.solarflarecom.com>

On Tue, 2012-05-15 at 17:47 +0100, Ben Hutchings wrote:
> On Tue, 2012-05-15 at 18:34 +0200, Eric Dumazet wrote:
> > On Tue, 2012-05-15 at 17:29 +0100, Kieran Mansley wrote:
> > > On Tue, 2012-05-15 at 16:56 +0200, Eric Dumazet wrote:
> > > > 
> > > > Please try latest kernels, this is probably 'fixed'
> > > 
> > > I've just tried with 3.4.0-rc7 and the problem is still reproducible.
> > > It's perhaps harder to reproduce than on 3.3.6 but still there.
> > > 
> > > > What network driver are you using ? 
> > > 
> > > The receiver is using the sfc driver that is included in the kernel
> > > build, together with an SFC 9020 NIC. 
> > > 
> > > Kieran
> > > 
> > 
> > MTU ?
> 
> 1500
> 
> > What is typical skb->truesize of skb given to stack in RX path ?
> >
> > If drivers use PAGE_SIZE fragments, then you are more likely to hit
> > limit.
> 
> We're passing page fragments into GRO.


Yes, I can see drivers/net/ethernet/sfc/rx.c is even lying about
truesize. Thats explain why you trigger the backlogdrop even on 2.6
kernels.

skb->len = rx_buf->len;
skb->data_len = rx_buf->len;
skb->truesize += rx_buf->len; // instead of real frag size

So skb->truesize are rather small.

napi_get_frags() could probably updated in net-next to use the first
frag as skb->head to save 512 bytes per skb.

You could try setting tcp_adv_win_scale to -2

^ permalink raw reply

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
From: Shirley Ma @ 2012-05-15 16:50 UTC (permalink / raw)
  To: Jason Wang; +Cc: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem
In-Reply-To: <20120502034254.11782.27314.stgit@amd-6168-8-1.englab.nay.redhat.com>

On Wed, 2012-05-02 at 11:42 +0800, Jason Wang wrote:
> We add used and signal guest in worker thread but did not poll the
> virtqueue
> during the zero copy callback. This may lead the missing of adding and
> signalling during zerocopy. Solve this by polling the virtqueue and
> let it
> wakeup the worker during callback.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/vhost/vhost.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 947f00d..7b75fdf 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1604,6 +1604,7 @@ void vhost_zerocopy_callback(void *arg)
>         struct vhost_ubuf_ref *ubufs = ubuf->arg;
>         struct vhost_virtqueue *vq = ubufs->vq;
> 
> +       vhost_poll_queue(&vq->poll);
>         /* set len = 1 to mark this desc buffers done DMA */
>         vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
>         kref_put(&ubufs->kref, vhost_zerocopy_done_signal);

Doing so, we might have redundant vhost_poll_queue(). Do you know in
which scenario there might be missing of adding and signaling during
zerocopy?

Thanks
Shirley

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Ben Hutchings @ 2012-05-15 16:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kieran Mansley, netdev
In-Reply-To: <1337099641.8512.1102.camel@edumazet-glaptop>

On Tue, 2012-05-15 at 18:34 +0200, Eric Dumazet wrote:
> On Tue, 2012-05-15 at 17:29 +0100, Kieran Mansley wrote:
> > On Tue, 2012-05-15 at 16:56 +0200, Eric Dumazet wrote:
> > > 
> > > Please try latest kernels, this is probably 'fixed'
> > 
> > I've just tried with 3.4.0-rc7 and the problem is still reproducible.
> > It's perhaps harder to reproduce than on 3.3.6 but still there.
> > 
> > > What network driver are you using ? 
> > 
> > The receiver is using the sfc driver that is included in the kernel
> > build, together with an SFC 9020 NIC. 
> > 
> > Kieran
> > 
> 
> MTU ?

1500

> What is typical skb->truesize of skb given to stack in RX path ?
>
> If drivers use PAGE_SIZE fragments, then you are more likely to hit
> limit.

We're passing page fragments into GRO.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH net-next 0/2] extend sch_mqprio to distribute traffic not only by ETS TC
From: John Fastabend @ 2012-05-15 16:44 UTC (permalink / raw)
  To: Amir Vadai
  Cc: David S. Miller, netdev, Oren Duer, Liran Liss, Jamal Hadi Salim,
	Diego Crupnicoff, Or Gerlitz
In-Reply-To: <4FB15BEC.8040000@mellanox.com>

On 5/14/2012 12:24 PM, Amir Vadai wrote:
>>>> On 5/6/2012 12:05 AM, Amir Vadai wrote:
>>>>> This series comes to revive the discussion initiated on the thread "net:
>>>>> support tx_ring per UP in HW based QoS mechanism" (see
>>>>> http://marc.info/?t=133165957200004&r=1&w=2) with the major issue to be address
>>>>> is - how should sk_prio<=>    TC be done, for both, tagged and untagged traffic.
>>>>> Following is a staged description addressing the background, problem
>>>>> description, current situation, suggestion for the change and implementation of
>>>>> it.

[...]

> John Hi,
> 
> After some internal discussions, it was agreed to line up with your
> approach, to leave mqprio an abstract skb->priority <=> queue set
> mapping and to ignore egress_map if mqprio is enabled.
> 

OK sounds good.

> It would be very nice, if the term 'tc' in kernel code would be
> replaced to queue set, since it is very misleading.
> 

Go ahead and write up a patch. Just be careful not to break existing
user visible API. I agree it is confusing.

> There still might be some small issues with skb_tx_hash for tagged
> traffic, which I will work on tomorrow, and hopefully will send a new
> patch set with the solution.
> 

What are the issues? Lets see a patch.

Thanks,
John

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Eric Dumazet @ 2012-05-15 16:34 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: netdev
In-Reply-To: <1337099368.1689.47.camel@kjm-desktop.uk.level5networks.com>

On Tue, 2012-05-15 at 17:29 +0100, Kieran Mansley wrote:
> On Tue, 2012-05-15 at 16:56 +0200, Eric Dumazet wrote:
> > 
> > Please try latest kernels, this is probably 'fixed'
> 
> I've just tried with 3.4.0-rc7 and the problem is still reproducible.
> It's perhaps harder to reproduce than on 3.3.6 but still there.
> 
> > What network driver are you using ? 
> 
> The receiver is using the sfc driver that is included in the kernel
> build, together with an SFC 9020 NIC. 
> 
> Kieran
> 

MTU ?

What is typical skb->truesize of skb given to stack in RX path ?

If drivers use PAGE_SIZE fragments, then you are more likely to hit
limit.

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Kieran Mansley @ 2012-05-15 16:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1337093776.8512.1089.camel@edumazet-glaptop>

On Tue, 2012-05-15 at 16:56 +0200, Eric Dumazet wrote:
> 
> Please try latest kernels, this is probably 'fixed'

I've just tried with 3.4.0-rc7 and the problem is still reproducible.
It's perhaps harder to reproduce than on 3.3.6 but still there.

> What network driver are you using ? 

The receiver is using the sfc driver that is included in the kernel
build, together with an SFC 9020 NIC. 

Kieran

^ permalink raw reply

* Re: PROBLEM: Fragmentation issue with 1521 bytes ip packets
From: Eric Dumazet @ 2012-05-15 15:04 UTC (permalink / raw)
  To: Omar Alhassane; +Cc: netdev
In-Reply-To: <CAPFXtPexGpVyGGuKdjpuH+cxOaxa7KeBfmFrs39=NW7LsZ6-bw@mail.gmail.com>

On Tue, 2012-05-15 at 10:00 -0400, Omar Alhassane wrote:
> Hello Folks,
> 
> I think i may have found a problem with the linux networking stack.
> Below is a description of the problem.
> 
> [1.] One line summary of the problem:
> No response to pings of certain sizes.
> 
> [2.] Full description of the problem/report:
> Using hping3, when i ping a linux machine with 1521 bytes ip packets i
> get only one response.
> But when i use 1482 bytes, everything works fine. I've tried this with
> both tcp and udp. The MTU of my interface is 1500.
> [3.] Keywords (i.e., modules, networking, kernel):
> ip, udp, tcp, networking, fragmentation
> [4.] Kernel version (from /proc/version):
> 3.3.1
> [5.] Output of Oops.. message (if applicable) with symbolic information
> [6.] A small shell script or example program which triggers the
> problem (if possible)
> The following commands works only if the target has tcp port 22 open
> 
> hping3 -d 1481 -S -P 22 10.0.30.225 (only one response)
> hping3 -d 1482 -S -P 22 10.0.30.225 (works fine)
> 
> Can somebody confirm if this is a problem?

hping3 bug : All the fragments it sends have the same ID field.

First 2 frags are reassembled by remote. Remote sends a SYNACK.


Following frags are 'ignored' because they have same ID than previous
packet.

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Eric Dumazet @ 2012-05-15 15:00 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: netdev
In-Reply-To: <1337093776.8512.1089.camel@edumazet-glaptop>

On Tue, 2012-05-15 at 16:56 +0200, Eric Dumazet wrote:

> Please try latest kernels, this is probably 'fixed'
> 
> What network driver are you using ?
> 
> 

commit b49960a05e32121d29316cfdf653894b88ac9190
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed May 2 02:28:41 2012 +0000

    tcp: change tcp_adv_win_scale and tcp_rmem[2]
    
    tcp_adv_win_scale default value is 2, meaning we expect a good citizen
    skb to have skb->len / skb->truesize ratio of 75% (3/4)
    
    In 2.6 kernels we (mis)accounted for typical MSS=1460 frame :
    1536 + 64 + 256 = 1856 'estimated truesize', and 1856 * 3/4 = 1392.
    So these skbs were considered as not bloated.
    
    With recent truesize fixes, a typical MSS=1460 frame truesize is now the
    more precise :
    2048 + 256 = 2304. But 2304 * 3/4 = 1728.
    So these skb are not good citizen anymore, because 1460 < 1728
    
    (GRO can escape this problem because it build skbs with a too low
    truesize.)
    
    This also means tcp advertises a too optimistic window for a given
    allocated rcvspace : When receiving frames, sk_rmem_alloc can hit
    sk_rcvbuf limit and we call tcp_prune_queue()/tcp_collapse() too often,
    especially when application is slow to drain its receive queue or in
    case of losses (netperf is fast, scp is slow). This is a major latency
    source.
    
    We should adjust the len/truesize ratio to 50% instead of 75%
    
    This patch :
    
    1) changes tcp_adv_win_scale default to 1 instead of 2
    
    2) increase tcp_rmem[2] limit from 4MB to 6MB to take into account
    better truesize tracking and to allow autotuning tcp receive window to
    reach same value than before. Note that same amount of kernel memory is
    consumed compared to 2.6 kernels.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Eric Dumazet @ 2012-05-15 14:56 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: netdev
In-Reply-To: <1337092718.1689.45.camel@kjm-desktop.uk.level5networks.com>

On Tue, 2012-05-15 at 15:38 +0100, Kieran Mansley wrote:
> I've been investigating an issue with TCPBacklogDrops being reported
> (and relatively poor performance as a result).  The problem is most
> easily observed on slightly older kernels (e.g 3.0.13) but is still
> present in 3.3.6, although harder to reproduce.  I've also seen it in
> 2.6 series kernels, so it's not a recent issue.
> 
> The problem occurs at the receiver when a TCP sender with a large
> congestion window is sending at a high rate and the receiving
> application has blocked in a recv() or similar call.  During the stream
> ACKs are being returned to the sender keeping the receive window open
> and so allowing it to carry on sending.  The local socket receive buffer
> gets dynamically increased, and the advertised receive window increases
> similarly.
> 
> [As an aside, it appears as though the total bytes that the receiver
> commits to receiving - i.e. the point at which it stops advertising new
> sequence space - is around double the receive socket buffer.  I'm
> guessing it is committing to receiving the current socket buffer
> (perhaps as there is a pending recv() it knows it will be able to
> immediately empty this) and the next one, but I've not looked into this
> in detail]
> 
> As the socket buffer is approaching full the kernel decides to satisfy
> the recv() call and wake the application.  It will have to copy the data
> to application address space etc.  At this point there is a switch in
> tcp_v4_rcv():
> 
> http://lxr.linux.no/#linux+v3.3.6/net/ipv4/tcp_ipv4.c#L1726
> 
> Before this point, the "if (!sock_owned_by_user(sk)) " will evaluate to
> true, but once it has decided to wake the application I think it will
> evaluate to false and it will drop through to:
> 
> 1739        else if (unlikely(sk_add_backlog(sk, skb))) {
> 1740                bh_unlock_sock(sk);
> 1741                NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
> 1742                goto discard_and_relse;
> 1743        }
> 
> In sk_add_backlog() there is a test to see if the socket's receive
> buffer is full, and if there is the kernel drops the packets, reporting
> them through netstat as TCPBacklogDrop.  This is despite there being
> potentially megabytes of unused advertised receive window space at this
> point.
> 
> Very shortly afterwards the socket buffer will be empty again (as its
> contents will have been transferred to the user) so this is essentially
> a race and depends on a fast sender to demonstrate it.  It shows up as a
> acute period of drops that are quickly retransmitted and then
> accepted.  
> 
> There are two ways of thinking about this problem: either the receiver
> should be more conservative about the receive window it advertises
> (limiting it to the available receive socket buffer size); or the
> receiver should be more generous with what it will accept on to the
> backlog (matching it to the advertised receive window).  It is the
> discrepancy between advertised receive window and what can be put on the
> backlog that is the root of the problem.  I would be tempted by the
> latter and say that as the backlog is likely to soon make it into the
> receive buffer, it should be allowed to contain a full receive buffer of
> bytes on top of what is currently being removed from the receive buffer
> into the application.
> 
> It is harder to reproduce on recent kernels because the pending recv()
> call gets satisfied very close to the start of a burst, and at this time
> the receive buffer will be mostly empty and so it is less likely that
> any packets in flight will overflow the backlog.  On earlier kernels it
> is easier to reproduce because the pending recv() call didn't return
> until the socket's receive buffer was nearly full, and so it would only
> take a few extra packets to overflow the backlog.
> 
> I have a packet capture to illustrate the problem (taken on 3.0.13) if
> that would be of help.  As I can easily reproduce it I'm also happy to
> make changes and test to see if they improve matters.


Please try latest kernels, this is probably 'fixed'

What network driver are you using ?

^ permalink raw reply

* TCPBacklogDrops during aggressive bursts of traffic
From: Kieran Mansley @ 2012-05-15 14:38 UTC (permalink / raw)
  To: netdev

I've been investigating an issue with TCPBacklogDrops being reported
(and relatively poor performance as a result).  The problem is most
easily observed on slightly older kernels (e.g 3.0.13) but is still
present in 3.3.6, although harder to reproduce.  I've also seen it in
2.6 series kernels, so it's not a recent issue.

The problem occurs at the receiver when a TCP sender with a large
congestion window is sending at a high rate and the receiving
application has blocked in a recv() or similar call.  During the stream
ACKs are being returned to the sender keeping the receive window open
and so allowing it to carry on sending.  The local socket receive buffer
gets dynamically increased, and the advertised receive window increases
similarly.

[As an aside, it appears as though the total bytes that the receiver
commits to receiving - i.e. the point at which it stops advertising new
sequence space - is around double the receive socket buffer.  I'm
guessing it is committing to receiving the current socket buffer
(perhaps as there is a pending recv() it knows it will be able to
immediately empty this) and the next one, but I've not looked into this
in detail]

As the socket buffer is approaching full the kernel decides to satisfy
the recv() call and wake the application.  It will have to copy the data
to application address space etc.  At this point there is a switch in
tcp_v4_rcv():

http://lxr.linux.no/#linux+v3.3.6/net/ipv4/tcp_ipv4.c#L1726

Before this point, the "if (!sock_owned_by_user(sk)) " will evaluate to
true, but once it has decided to wake the application I think it will
evaluate to false and it will drop through to:

1739        else if (unlikely(sk_add_backlog(sk, skb))) {
1740                bh_unlock_sock(sk);
1741                NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
1742                goto discard_and_relse;
1743        }

In sk_add_backlog() there is a test to see if the socket's receive
buffer is full, and if there is the kernel drops the packets, reporting
them through netstat as TCPBacklogDrop.  This is despite there being
potentially megabytes of unused advertised receive window space at this
point.

Very shortly afterwards the socket buffer will be empty again (as its
contents will have been transferred to the user) so this is essentially
a race and depends on a fast sender to demonstrate it.  It shows up as a
acute period of drops that are quickly retransmitted and then
accepted.  

There are two ways of thinking about this problem: either the receiver
should be more conservative about the receive window it advertises
(limiting it to the available receive socket buffer size); or the
receiver should be more generous with what it will accept on to the
backlog (matching it to the advertised receive window).  It is the
discrepancy between advertised receive window and what can be put on the
backlog that is the root of the problem.  I would be tempted by the
latter and say that as the backlog is likely to soon make it into the
receive buffer, it should be allowed to contain a full receive buffer of
bytes on top of what is currently being removed from the receive buffer
into the application.

It is harder to reproduce on recent kernels because the pending recv()
call gets satisfied very close to the start of a burst, and at this time
the receive buffer will be mostly empty and so it is less likely that
any packets in flight will overflow the backlog.  On earlier kernels it
is easier to reproduce because the pending recv() call didn't return
until the socket's receive buffer was nearly full, and so it would only
take a few extra packets to overflow the backlog.

I have a packet capture to illustrate the problem (taken on 3.0.13) if
that would be of help.  As I can easily reproduce it I'm also happy to
make changes and test to see if they improve matters.

Thanks

Kieran

^ permalink raw reply

* Strange latency spikes/TX network stalls on Sun Fire X4150(x86) and e1000e
From: Denys Fedoryshchenko @ 2012-05-15 14:15 UTC (permalink / raw)
  To: netdev, e1000-devel, jeffrey.t.kirsher, jesse.brandeburg

Hi

I have two identical servers, Sun Fire X4150, both has different 
flavors of Linux, x86_64 and i386.
04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit 
Ethernet Controller (Copper) (rev 01)
04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit 
Ethernet Controller (Copper) (rev 01)
0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
I am using now interface:
#ethtool -i eth0
driver: e1000e
version: 1.9.5-k
firmware-version: 2.1-11
bus-info: 0000:04:00.0
There is 2 CPU , Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz .

i386 was acting as NAT and shaper, and as soon as i removed shaper from 
it, i started to experience strange lockups, e.g. traffic is normal for 
5-30 seconds, then short lockup for 500-3000ms (usually around 1000ms) 
with dropped packets counter increasing. I was suspecting it is due 
load, but it seems was wrong.
Recently, on another server, x86_64 i am using as development, i 
upgrade kernel (it was old, from 2.6 series) and on completely idle 
machine started to experience same latency spikes, while i am just 
running mc and for example typing in text editor - i notice "stalls". 
After i investigate it a little more, i notice also small amount of 
drops on interface. No tcpdump running. Also this machine is idle, and 
the only traffic there - some small broadcasts from network, my ssh, and 
ping.

Dropped packets in ifconfig
           RX packets:3752868 errors:0 dropped:5350 overruns:0 frame:0
Counter is increasing sometimes, when this stall happening.

ethtool -S is clean, there is no dropped packets.

I did tried to check load (mpstat and perf), there is nothing 
suspicious, latencytop also doesn't show anything suspicious.
dropwatch report a lot of drops, but mostly because there is some 
broadcasts and etc. tcpdump at the moment of such drops doesn't show 
anything suspicious.
Changed qdisc from default fifo_fast to bfifo, without any result.
Tried:  ethtool -K eth0 tso off gso off gro off sg off , no result
Problem occured at 3.3.6 - 3.4.0-rc7, most probably 3.3.0 also, but i 
don't remember for sure. I thik on some kernels like 3.1 probably it 
doesn't occur, i will check it soon, because it is not always reliable 
to reproduce it. All tests i did on 3.4.0-rc7.

I did run also in background tcpdump, additionally iptables with 
timestamps, and at time when stall occured, seems i am still receiving 
packets properly, also on iperf udp  (from some host to this SunFire) at 
this moments no packets missing. But i am sure RX interface errors are 
increasing.
If i do iperf from SunFire to test host - there is packetloss at 
moments when stall occured.

I suspect that by some reason network card stop to transmit, but unable 
to pinpoint issue. All other hosts in this network are fine and don't 
have such problems.
Can you help me with that please? Maybe i can provide more debug 
information, compile with patches and etc. Also i will try to fallback 
to 3.1 and 3.0 kernels.

Here it is how it occurs and i am reproducing it:
I'm just opening file, and start to scroll it in mc, then in another 
console i run ping
[1337089061.844167] 1480 bytes from 194.146.153.20: icmp_req=162 ttl=64 
time=0.485 ms
[1337089061.944138] 1480 bytes from 194.146.153.20: icmp_req=163 ttl=64 
time=0.470 ms
[1337089062.467759] 1480 bytes from 194.146.153.20: icmp_req=164 ttl=64 
time=424 ms
[1337089062.467899] 1480 bytes from 194.146.153.20: icmp_req=165 ttl=64 
time=324 ms
[1337089062.468058] 1480 bytes from 194.146.153.20: icmp_req=166 ttl=64 
time=214 ms
[1337089062.468161] 1480 bytes from 194.146.153.20: icmp_req=167 ttl=64 
time=104 ms
[1337089062.468958] 1480 bytes from 194.146.153.20: icmp_req=168 ttl=64 
time=1.15 ms
[1337089062.568604] 1480 bytes from 194.146.153.20: icmp_req=169 ttl=64 
time=0.477 ms
[1337089062.668909] 1480 bytes from 194.146.153.20: icmp_req=170 ttl=64 
time=0.667 ms

Remote host tcpdump:
1337089061.934737 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 163, length 1480
1337089062.458360 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 164, length 1480
1337089062.458380 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 164, length 1480
1337089062.458481 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 165, length 1480
1337089062.458502 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 165, length 1480
1337089062.458606 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 166, length 1480
1337089062.458623 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 166, length 1480
1337089062.458729 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 167, length 1480
1337089062.458745 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 167, length 1480
1337089062.459537 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 168, length 1480
1337089062.459545 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 168, length 1480

Local host(SunFire) tcpdump:
1337089061.844140 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 162, length 1480
1337089061.943661 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 163, length 1480
1337089061.944124 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 163, length 1480
1337089062.465622 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 164, length 1480
1337089062.465630 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 165, length 1480
1337089062.465632 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 166, length 1480
1337089062.465634 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 167, length 1480
1337089062.467730 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 164, length 1480
1337089062.467785 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 168, length 1480
1337089062.467884 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 165, length 1480
1337089062.468035 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 166, length 1480
1337089062.468129 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 167, length 1480
1337089062.468928 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 168, length 1480
1337089062.568112 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 169, length 1480
1337089062.568578 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 169, length 1480

lspci -t
centaur src # lspci -t
-[0000:00]-+-00.0
            +-02.0-[01-05]--+-00.0-[02-04]--+-00.0-[03]--
            |               |               \-02.0-[04]--+-00.0
            |               |                            \-00.1
            |               \-00.3-[05]--
            +-03.0-[06]--
            +-04.0-[07]----00.0
            +-05.0-[08]--
            +-06.0-[09]--
            +-07.0-[0a]--
            +-08.0
            +-10.0
            +-10.1
            +-10.2
            +-11.0
            +-13.0
            +-15.0
            +-16.0
            +-1c.0-[0b]--+-00.0
            |            \-00.1
            +-1d.0
            +-1d.1
            +-1d.2
            +-1d.3
            +-1d.7
            +-1e.0-[0c]----05.0
            +-1f.0
            +-1f.1
            +-1f.2
            \-1f.3
lspci
00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory Controller 
Hub (rev b1)
00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express 
x4 Port 2 (rev b1)
00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express 
x4 Port 3 (rev b1)
00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express 
x8 Port 4-5 (rev b1)
00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express 
x4 Port 5 (rev b1)
00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express 
x8 Port 6-7 (rev b1)
00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express 
x4 Port 7 (rev b1)
00:08.0 System peripheral: Intel Corporation 5000 Series Chipset DMA 
Engine (rev b1)
00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB 
Registers (rev b1)
00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB 
Registers (rev b1)
00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB 
Registers (rev b1)
00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved 
Registers (rev b1)
00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved 
Registers (rev b1)
00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD 
Registers (rev b1)
00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD 
Registers (rev b1)
00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset PCI 
Express Root Port 1 (rev 09)
00:1d.0 USB controller: Intel Corporation 631xESB/632xESB/3100 Chipset 
UHCI USB Controller #1 (rev 09)
00:1d.1 USB controller: Intel Corporation 631xESB/632xESB/3100 Chipset 
UHCI USB Controller #2 (rev 09)
00:1d.2 USB controller: Intel Corporation 631xESB/632xESB/3100 Chipset 
UHCI USB Controller #3 (rev 09)
00:1d.3 USB controller: Intel Corporation 631xESB/632xESB/3100 Chipset 
UHCI USB Controller #4 (rev 09)
00:1d.7 USB controller: Intel Corporation 631xESB/632xESB/3100 Chipset 
EHCI USB2 Controller (rev 09)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9)
00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset LPC 
Interface Controller (rev 09)
00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE Controller 
(rev 09)
00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA AHCI 
Controller (rev 09)
00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus 
Controller (rev 09)
01:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express 
Upstream Port (rev 01)
01:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express to 
PCI-X Bridge (rev 01)
02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express 
Downstream Port E1 (rev 01)
02:02.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express 
Downstream Port E3 (rev 01)
04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit 
Ethernet Controller (Copper) (rev 01)
04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit 
Ethernet Controller (Copper) (rev 01)
07:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09)
0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
0c:05.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED 
Graphics Family


dmesg:
[    4.936885] e1000: Intel(R) PRO/1000 Network Driver - version 
7.3.21-k8-NAPI
[    4.936887] e1000: Copyright (c) 1999-2006 Intel Corporation.
[    4.936966] e1000e: Intel(R) PRO/1000 Network Driver - 1.9.5-k
[    4.936967] e1000e: Copyright(c) 1999 - 2012 Intel Corporation.
[    4.938529] e1000e 0000:04:00.0: (unregistered net_device): 
Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[    4.939598] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
[    4.992246] e1000e 0000:04:00.0: eth0: (PCI Express:2.5GT/s:Width 
x4) 00:1e:68:04:99:f8
[    4.992657] e1000e 0000:04:00.0: eth0: Intel(R) PRO/1000 Network 
Connection
[    4.992964] e1000e 0000:04:00.0: eth0: MAC: 5, PHY: 5, PBA No: 
FFFFFF-0FF
[    4.994745] e1000e 0000:04:00.1: (unregistered net_device): 
Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[    4.996233] e1000e 0000:04:00.1: irq 66 for MSI/MSI-X
[    5.050901] e1000e 0000:04:00.1: eth1: (PCI Express:2.5GT/s:Width 
x4) 00:1e:68:04:99:f9
[    5.051317] e1000e 0000:04:00.1: eth1: Intel(R) PRO/1000 Network 
Connection
[    5.051623] e1000e 0000:04:00.1: eth1: MAC: 5, PHY: 5, PBA No: 
FFFFFF-0FF
[    5.051857] e1000e 0000:0b:00.0: Disabling ASPM  L1
[    5.052168] e1000e 0000:0b:00.0: (unregistered net_device): 
Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[    5.052611] e1000e 0000:0b:00.0: irq 67 for MSI/MSI-X
[    5.223454] e1000e 0000:0b:00.0: eth2: (PCI Express:2.5GT/s:Width 
x4) 00:1e:68:04:99:fa
[    5.223864] e1000e 0000:0b:00.0: eth2: Intel(R) PRO/1000 Network 
Connection
[    5.224178] e1000e 0000:0b:00.0: eth2: MAC: 0, PHY: 4, PBA No: 
C83246-002
[    5.224412] e1000e 0000:0b:00.1: Disabling ASPM  L1
[    5.224709] e1000e 0000:0b:00.1: (unregistered net_device): 
Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[    5.225168] e1000e 0000:0b:00.1: irq 68 for MSI/MSI-X
[    5.397603] e1000e 0000:0b:00.1: eth3: (PCI Express:2.5GT/s:Width 
x4) 00:1e:68:04:99:fb
[    5.398021] e1000e 0000:0b:00.1: eth3: Intel(R) PRO/1000 Network 
Connection
[    5.398336] e1000e 0000:0b:00.1: eth3: MAC: 0, PHY: 4, PBA No: 
C83246-002
[   13.859817] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
[   13.962309] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
[   17.150392] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow 
Control: None

^ permalink raw reply

* (unknown), 
From: Omar Alhassane @ 2012-05-15 14:07 UTC (permalink / raw)
  To: netdev

subscribe netdev

^ permalink raw reply

* Re: [PATCH NEXT 1/2] linux/ethtool: Added macro ETH_FW_DUMP_DISABLE
From: Ben Hutchings @ 2012-05-15 14:01 UTC (permalink / raw)
  To: Manish Chopra
  Cc: davem, netdev, Dept_NX_Linux_NIC_Driver, anirban.chakraborty
In-Reply-To: <1337080419-31786-2-git-send-email-manish.chopra@qlogic.com>

On Tue, 2012-05-15 at 07:13 -0400, Manish Chopra wrote:
> From: Manish chopra <manish.chopra@qlogic.com>
> 
> o flag field of ethtool_dump structure must be initialized by this macro
> value that is zero, if the firmware dump is disabled.
> by this we can get the firmware dump capability [enable/disable] via ethtool
> 
> Signed-off-by: Manish chopra <manish.chopra@qlogic.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>

> ---
>  include/linux/ethtool.h |    7 ++++++-
>  1 files changed, 6 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
> index 89d68d8..fea2ac0 100644
> --- a/include/linux/ethtool.h
> +++ b/include/linux/ethtool.h
> @@ -661,12 +661,17 @@ struct ethtool_flash {
>   * 	%ETHTOOL_SET_DUMP
>   * @version: FW version of the dump, filled in by driver
>   * @flag: driver dependent flag for dump setting, filled in by driver during
> - * 	  get and filled in by ethtool for set operation
> + *        get and filled in by ethtool for set operation.
> + *        flag must be initialized by macro ETH_FW_DUMP_DISABLE value when
> + *        firmware dump is disabled.
>   * @len: length of dump data, used as the length of the user buffer on entry to
>   * 	 %ETHTOOL_GET_DUMP_DATA and this is returned as dump length by driver
>   * 	 for %ETHTOOL_GET_DUMP_FLAG command
>   * @data: data collected for get dump data operation
>   */
> +
> +#define ETH_FW_DUMP_DISABLE 0
> +
>  struct ethtool_dump {
>  	__u32	cmd;
>  	__u32	version;

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* PROBLEM: Fragmentation issue with 1521 bytes ip packets
From: Omar Alhassane @ 2012-05-15 14:00 UTC (permalink / raw)
  To: netdev

Hello Folks,

I think i may have found a problem with the linux networking stack.
Below is a description of the problem.

[1.] One line summary of the problem:
No response to pings of certain sizes.

[2.] Full description of the problem/report:
Using hping3, when i ping a linux machine with 1521 bytes ip packets i
get only one response.
But when i use 1482 bytes, everything works fine. I've tried this with
both tcp and udp. The MTU of my interface is 1500.
[3.] Keywords (i.e., modules, networking, kernel):
ip, udp, tcp, networking, fragmentation
[4.] Kernel version (from /proc/version):
3.3.1
[5.] Output of Oops.. message (if applicable) with symbolic information
[6.] A small shell script or example program which triggers the
problem (if possible)
The following commands works only if the target has tcp port 22 open

hping3 -d 1481 -S -P 22 10.0.30.225 (only one response)
hping3 -d 1482 -S -P 22 10.0.30.225 (works fine)

Can somebody confirm if this is a problem?

Thanks

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox