Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: tcp timestamp issues with google servers
From: Vijay Subramanian @ 2012-05-22 17:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Miklos Szeredi, netdev, linux-kernel
In-Reply-To: <1337705285.3361.229.camel@edumazet-glaptop>

>> Maybe tcptraceroute[1] can help you figure this out.
>>
>> [1] http://michael.toren.net/code/tcptraceroute/
>
>
> The transparent proxy can intercept TCP connections to port 80/443, and
> let ICMP being NATed by the box.

Just to be clear..tcptraceroute uses TCP SYN packets to trace the
route instead of using ICMP packets used by vanilla traceroute
precisely because
of the issue you raised.
The idea is that if the connection is getting terminated at a
middlebox, the trace will end there. Otherwise, the trace route will
end
at destination (google in this case). This avoids the problems of ICMP
and TCP flows being treated differently by the middlebox.
Is this approach workable?

Thanks,
Vijay

^ permalink raw reply

* net/wanrouter?
From: Joe Perches @ 2012-05-22 17:33 UTC (permalink / raw)
  To: netdev

Does anyone still use this?

^ permalink raw reply

* Re: Strange latency spikes/TX network stalls on Sun Fire X4150(x86) and e1000e
From: Eric Dumazet @ 2012-05-22 17:24 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: Tom Herbert, netdev
In-Reply-To: <eb8cdd693530010d6736baede0cfebd8@visp.net.lb>

On Tue, 2012-05-22 at 20:11 +0300, Denys Fedoryshchenko wrote:

> By the way, if BQL limit is going lower than MTU, is it considered as a 
> bug?
> If yes, i can try to upload 3.4 to some servers and add condition to 
> WARN_ON if limit < 1500.

There is no problem with BQL limit going lower than the max packet size.

(With TSO it can be 64K)

Remember BQL allows one packet to be sent to device, regardless of its
size.

Next packet might be blocked/stay in Qdisc

If your workload is mostly idle, but sending bursts of 3 packets, then
only one is immediately sent.

Next packets shall wait the TX completion of first packet.

^ permalink raw reply

* Re: [RFC] net: skb_head_is_locked() should use skb_header_cloned()
From: Alexander Duyck @ 2012-05-22 17:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1337666034.3361.50.camel@edumazet-glaptop>

On 05/21/2012 10:53 PM, Eric Dumazet wrote:
> Hi David and Alexander
>
> There is no hurry since net-next is closed, but I hit the following
> problem :
>
> When IPv6 conntracking is enabled, code from
> net/ipv6/netfilter/nf_conntrack_reasm.c does a cloning of all skbs to
> build a shadow.
>
> Then we run : (skb here is the head of the 'shadow skb' )
>
> void nf_ct_frag6_output(unsigned int hooknum, struct sk_buff *skb,
>                         struct net_device *in, struct net_device *out,
>                         int (*okfn)(struct sk_buff *))
> {
>         struct sk_buff *s, *s2;
>
>         for (s = NFCT_FRAG6_CB(skb)->orig; s;) {
>                 nf_conntrack_put_reasm(s->nfct_reasm);
>                 nf_conntrack_get_reasm(skb);
>                 s->nfct_reasm = skb;
>
>                 s2 = s->next;
>                 s->next = NULL;
>
>                 NF_HOOK_THRESH(NFPROTO_IPV6, hooknum, s, in, out, okfn,
>                                NF_IP6_PRI_CONNTRACK_DEFRAG + 1);
>                 s = s2;
>         }
>         nf_conntrack_put_reasm(skb);
> }
>
> So when all original skbs are fed to real IPv6 reassembly code, their
> clones are still alive and we hit the condition in skb_try_coalesce() :
>
> if (skb_head_is_locked(from))
> 	return false;
>
> I was wondering if skb_head_is_locked() should be changed to :
>
> if (!skb->head_frag || skb_header_cloned(skb))
> 	return false;
>
> Then we could add skb_header_release() calls on the clones of course in
> net/ipv6/netfilter/nf_conntrack_reasm.c 
>
> Not-Yet-Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  include/linux/skbuff.h |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 0e50171..6509ee1 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -2587,7 +2587,7 @@ static inline bool skb_is_recycleable(const struct sk_buff *skb, int skb_size)
>   */
>  static inline bool skb_head_is_locked(const struct sk_buff *skb)
>  {
> -	return !skb->head_frag || skb_cloned(skb);
> +	return !skb->head_frag || skb_header_cloned(skb);
>  }
>  #endif	/* __KERNEL__ */
>  #endif	/* _LINUX_SKBUFF_H */
>
>
The problem is that the whole reason for checking skb_cloned was to
avoid reference count issues between the skb and the page.  We should
only be using the reference count in one or the other and not both. 
Otherwise we open up the possibility of a data corruption if someone
misinterprets a skb_shinfo()->dataref == 1, or skb_header_cloned
returning false when we have the buffer shared between both the sk_buff
and a page.

The skb_header_cloned check only verifies that the portion between
skb->head and skb->data is currently being unused by the other clones. 
It doesn't guarantee that skb->head is not being used by any other
sk_buff.  As such we run the same risk of messing up the dataref
counting if we were to use it.

The way I see it there are 2 solutions.  The first would be to just
split the reference counts and make it so that calls like skb_cloned
have to check both dataref and page count if skb->head_frag is set.  The
second option would be to look at something like pskb_expand_head where
we could generate a new head fragment and then memcpy the data over to
that frag in order to "unlock" the head.

Thanks,

Alex

^ permalink raw reply

* Using jiffies for tcp_time_stamp?
From: Srećko Jurić-Kavelj @ 2012-05-22 17:21 UTC (permalink / raw)
  To: netdev
In-Reply-To: <CAACrLC39Xdm3vTKUsjz43ZPyEq_vHxR-_Uf56SjSm+kUqxOqZg@mail.gmail.com>

Hi,

Recently I tackled round trip time estimation of a TCP connection.
After implementing a straight-forward approach (time stamping sending
and receiving of data using clock_gettime) I found this article:
http://linuxgazette.net/136/pfeiffer.html (using getsockopt() to get
struct tcp_info). The tcp_info structure conveniently has a rtt field.

Using the first method I get 1-3 ms RTT, and by using the second I get
>=10 ms RTT.

By looking at the code it's clear that the time stamping is done with
jiffies, and my kernel has CONFIG_HZ=100.

I understand that this is for performance reasons (and the RTT
smoothing filter is implemented with bit shifting operations), but
would using a more precise time stamp have significant impact on
performance? Since RTT is used to compute RTO, wouldn't there be any
benefits of having more accurate estimate of this value?

Best regards,

Srećko Jurić-Kavelj, dipl.ing. (Ms.E.E)
Research and Teaching Assistant at University of Zagreb
(Faculty of Electrical Engineering and Computing, Department of
Control and Computer Engineering)

E-mail: srecko.juric-kavelj@fer.hr
URL: http://www.fer.hr/srecko.juric-kavelj

Sanctus Hieronymus: "Parce mihi, Domine, quia dalmata sum!"

^ permalink raw reply

* Re: Strange latency spikes/TX network stalls on Sun Fire X4150(x86) and e1000e
From: Denys Fedoryshchenko @ 2012-05-22 17:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tom Herbert, netdev
In-Reply-To: <1337589620.3361.23.camel@edumazet-glaptop>

On 2012-05-21 11:40, Eric Dumazet wrote:
> On Mon, 2012-05-21 at 10:30 +0200, Eric Dumazet wrote:
>> On Mon, 2012-05-21 at 11:06 +0300, Denys Fedoryshchenko wrote:
>>
>> > Not sure it is a lot of time, after all it is 2 x core quad 
>> machine,
>> > should be enough fast for pings.
>> > It will cause stalls on small packets even more seems.
>> >
>> > Tested latest git, net-next, still the same, stalls.
>> > hardware latency detector are silent by the way, so there is no
>> > significant SMI.
>> >
>>
>> I am trying to reproduce your problem here with no luck yet.
>>
>> I wonder of softirq are correctly scheduled on your machine
>>
>
> By the way, fact you have 8 cpus is irrelevant.
>
> Only one cpu has queued the NET_TX_SOFTIRQ softirq (serviced by
> net_tx_action())
>
>
> If this cpu is busy servicing other stuff, no other cpu will help.
>
By the way, if BQL limit is going lower than MTU, is it considered as a 
bug?
If yes, i can try to upload 3.4 to some servers and add condition to 
WARN_ON if limit < 1500.

---
Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L.

^ permalink raw reply

* Re: tcp timestamp issues with google servers
From: Rick Jones @ 2012-05-22 16:58 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Eric Dumazet, netdev, linux-kernel
In-Reply-To: <87mx50rz34.fsf@tucsk.pomaz.szeredi.hu>

On 05/22/2012 08:54 AM, Miklos Szeredi wrote:
> Eric Dumazet<eric.dumazet@gmail.com>  writes:
>
>> On Tue, 2012-05-22 at 17:25 +0200, Miklos Szeredi wrote:
>>
>>> So it appears.  The IP address is certainly registered to Google.
>>
>> Good, but you could have a middlebox doing transparent proxying.
>>
>> The SYNACK could be send by this box.
>
> Okay.  Is there a way to find out whether there is a middlebox or not?

The source IP in the trace was a 192.168 IP - is it possible/desirable 
to reproduce the problem without the device doing NAT in the path?

What is your "public" IP address?  Given that, and the IP address to 
which you are connecting, it should be possible to validate the RTT you 
are seeing.  If the geographic/topological location of the destination 
Google IP address is far enough from your public source IP that would 
show whether  the RTT you are seeing is even physically possible and so 
could suggest there is a middlebox (other than your NAT), though it 
couldn't show there was not a middlebox.

rick jones

^ permalink raw reply

* Re: tcp timestamp issues with google servers
From: Eric Dumazet @ 2012-05-22 16:48 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: Miklos Szeredi, netdev, linux-kernel
In-Reply-To: <CAGK4HS9cGnOcoLAL1pggWvMG1B40rWyqc3BHsddUvnii4EvTcQ@mail.gmail.com>

On Tue, 2012-05-22 at 09:38 -0700, Vijay Subramanian wrote:
> > Okay.  Is there a way to find out whether there is a middlebox or not?
> >
> 
> Miklos,
> Maybe tcptraceroute[1] can help you figure this out.
> 
> Hope this helps.
> Vijay
> 
> [1] http://michael.toren.net/code/tcptraceroute/

The transparent proxy can intercept TCP connections to port 80/443, and
let ICMP being NATed by the box.

So its better to check of the delay between SYN and SYNACK is roughly
independent of the HTTP server.

If you have very large range of delays, you can conclude its not a
transparent proxy.

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Eric Dumazet @ 2012-05-22 16:45 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: Ben Hutchings, netdev
In-Reply-To: <1337704382.1698.53.camel@kjm-desktop.uk.level5networks.com>

On Tue, 2012-05-22 at 17:32 +0100, Kieran Mansley wrote:
> On Tue, 2012-05-22 at 18:12 +0200, Eric Dumazet wrote:
> > 
> > __tcp_select_window() ( more precisely tcp_space() takes into account
> > memory used in receive/ofo queue, but not frames in backlog queue)
> > 
> > So if you send bursts, it might explain TCP stack continues to
> > advertise
> > a too big window, instead of anticipate the problem.
> > 
> > Please try the following patch :
> > 
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index e79aa48..82382cb 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -1042,8 +1042,9 @@ static inline int tcp_win_from_space(int space)
> >  /* Note: caller must be prepared to deal with negative returns */ 
> >  static inline int tcp_space(const struct sock *sk)
> >  {
> > -       return tcp_win_from_space(sk->sk_rcvbuf -
> > -                                 atomic_read(&sk->sk_rmem_alloc));
> > +       int used = atomic_read(&sk->sk_rmem_alloc) +
> > sk->sk_backlog.len;
> > +
> > +       return tcp_win_from_space(sk->sk_rcvbuf - used);
> >  } 
> >  
> >  static inline int tcp_full_space(const struct sock *sk)
> 
> 
> I can give this a try (not sure when - probably later this week) but I
> think this it is back to front.  The patch above will reduce the
> advertised window by sk_backlog.len, but at the time that the window was
> advertised that allowed the dropped packets to be sent the backlog was
> empty.  It is later, when the kernel is waking the application and takes
> the socket lock that the backlog starts to be used and the drop happens.
> But reducing the window advertised at this point is futile - the packets
> that will be dropped are already in flight.
> 

Not really. If we receive these packets while backlog is empty, then the
sender violates TCP rules.

We advertise tcp window directly from memory we are allowed to consume.

(On the premise sender behaves correctly, not sending bytes in small
packets)


> The problem exists because the backlog has a tighter limit on it than
> the receive window does; I think the backlog should be able to accept
> sk_rcvbuf bytes in addition to what is already in the receive buffer (or
> up to the advertised receive window if that's smaller).  At the moment
> it will only accept sk_rcvbuf bytes including what is already in the
> receive buffer.  The logic being that in this case we're using the
> backlog because it's in the process of emptying the receive buffer into
> the application, and so the receive buffer will very soon be empty, and
> so we will very soon be able to accept sk_rcvbuf bytes.  This is evident
> from the packet capture as the kernel stack is quite happy to accept the
> significant quantity of data that arrives as part of the same burst
> immediately after it has dropped a couple of packets.
> 

This is not evident from the capture, you are mistaken.

tcpdump captures packets before tcp stack, it doesnt say if they are :

1) queued in receive of ofo queue
2) queued in socket backlog
3) dropped because we hit socket rcvbuf limit

If socket lock is hold by the user, packets are queued to backlog, or
dropped.

Then, when socket lock is about to be released, we process the backlog.

^ permalink raw reply

* Re: tcp timestamp issues with google servers
From: Vijay Subramanian @ 2012-05-22 16:38 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Eric Dumazet, netdev, linux-kernel
In-Reply-To: <87mx50rz34.fsf@tucsk.pomaz.szeredi.hu>

> Okay.  Is there a way to find out whether there is a middlebox or not?
>

Miklos,
Maybe tcptraceroute[1] can help you figure this out.

Hope this helps.
Vijay

[1] http://michael.toren.net/code/tcptraceroute/

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Kieran Mansley @ 2012-05-22 16:32 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Hutchings, netdev
In-Reply-To: <1337703170.3361.217.camel@edumazet-glaptop>

On Tue, 2012-05-22 at 18:12 +0200, Eric Dumazet wrote:
> 
> __tcp_select_window() ( more precisely tcp_space() takes into account
> memory used in receive/ofo queue, but not frames in backlog queue)
> 
> So if you send bursts, it might explain TCP stack continues to
> advertise
> a too big window, instead of anticipate the problem.
> 
> Please try the following patch :
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index e79aa48..82382cb 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1042,8 +1042,9 @@ static inline int tcp_win_from_space(int space)
>  /* Note: caller must be prepared to deal with negative returns */ 
>  static inline int tcp_space(const struct sock *sk)
>  {
> -       return tcp_win_from_space(sk->sk_rcvbuf -
> -                                 atomic_read(&sk->sk_rmem_alloc));
> +       int used = atomic_read(&sk->sk_rmem_alloc) +
> sk->sk_backlog.len;
> +
> +       return tcp_win_from_space(sk->sk_rcvbuf - used);
>  } 
>  
>  static inline int tcp_full_space(const struct sock *sk)

I can give this a try (not sure when - probably later this week) but I
think this it is back to front.  The patch above will reduce the
advertised window by sk_backlog.len, but at the time that the window was
advertised that allowed the dropped packets to be sent the backlog was
empty.  It is later, when the kernel is waking the application and takes
the socket lock that the backlog starts to be used and the drop happens.
But reducing the window advertised at this point is futile - the packets
that will be dropped are already in flight.

The problem exists because the backlog has a tighter limit on it than
the receive window does; I think the backlog should be able to accept
sk_rcvbuf bytes in addition to what is already in the receive buffer (or
up to the advertised receive window if that's smaller).  At the moment
it will only accept sk_rcvbuf bytes including what is already in the
receive buffer.  The logic being that in this case we're using the
backlog because it's in the process of emptying the receive buffer into
the application, and so the receive buffer will very soon be empty, and
so we will very soon be able to accept sk_rcvbuf bytes.  This is evident
from the packet capture as the kernel stack is quite happy to accept the
significant quantity of data that arrives as part of the same burst
immediately after it has dropped a couple of packets.

Perhaps it would be easier for me to write a patch to show this
suggested solution?

Kieran

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Eric Dumazet @ 2012-05-22 16:12 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: Ben Hutchings, netdev
In-Reply-To: <1337699379.1698.30.camel@kjm-desktop.uk.level5networks.com>

On Tue, 2012-05-22 at 16:09 +0100, Kieran Mansley wrote:
> On Tue, 2012-05-22 at 11:30 +0200, Eric Dumazet wrote:
> > Also can you post a pcap capture of problematic flow ?
> 
> I'll email this to you directly. The capture is generated with netserver
> on the system under test, and NetPerf sending from a similar server.
> I've only included the first 1000 frames to keep the capture size down.
> There are 7 retransmissions in that capture, and the TCPBacklogDrops
> counter incremented by 7 during the test, so I'm happy to say they are
> the cause of the drops.
> 
> The system under test was running net-next.
> 
> I've not tried with another NIC (e.g. tg3) but will see if I can find
> one to test.

Or you could change sfc to allow its frames being coalesced.

> 
> I've got a feeling that the drops might be easier to reproduce if I
> taskset the netserver process to a different package than the one that
> is handling the network interrupt for that NIC.  This fits with my
> earlier theory in that it is likely to increase the overhead of waking
> the user-level process to satisfy the read and so increase the time
> during which received packets could overflow the backlog.  Having a
> relatively aggressive sending TCP also helps, e.g. one that is
> configured to open its congestion window quickly, as this will produce
> more intensive bursts.

__tcp_select_window() ( more precisely tcp_space() takes into account
memory used in receive/ofo queue, but not frames in backlog queue)

So if you send bursts, it might explain TCP stack continues to advertise
a too big window, instead of anticipate the problem.

Please try the following patch :

diff --git a/include/net/tcp.h b/include/net/tcp.h
index e79aa48..82382cb 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1042,8 +1042,9 @@ static inline int tcp_win_from_space(int space)
 /* Note: caller must be prepared to deal with negative returns */ 
 static inline int tcp_space(const struct sock *sk)
 {
-	return tcp_win_from_space(sk->sk_rcvbuf -
-				  atomic_read(&sk->sk_rmem_alloc));
+	int used = atomic_read(&sk->sk_rmem_alloc) + sk->sk_backlog.len;
+
+	return tcp_win_from_space(sk->sk_rcvbuf - used);
 } 
 
 static inline int tcp_full_space(const struct sock *sk)

^ permalink raw reply related

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
From: Shirley Ma @ 2012-05-22 15:55 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, eric.dumazet, netdev, linux-kernel, ebiederm,
	davem
In-Reply-To: <4FBB64F7.5090801@redhat.com>

On Tue, 2012-05-22 at 18:05 +0800, Jason Wang wrote:
> On 05/21/2012 11:42 PM, Shirley Ma wrote:
> > On Mon, 2012-05-21 at 14:05 +0800, Jason Wang wrote:
> >>>> - tx polling depends on skb_orphan() which is often called by
> >> device
> >>>> driver when it place the packet into the queue of the devices
> >> instead
> >>>> of  when the packets were sent. So it was too early for vhost to
> be
> >>>> notified.
> >>> Then do you think it's better to replace with vhost_poll_queue
> here
> >>> instead?
> >> Just like what does this patch do - calling vhost_poll_queue() in
> >> vhost_zerocopy_callback().
> >>>> - it only works when the pending DMAs exceeds VHOST_MAX_PEND,
> it's
> >>>> highly possible that guest needs to be notified when the pending
> >>>> packets
> >>>> isn't so much.
> >>> In which situation the guest needs to be notified when there is no
> >> TX
> >>> besides buffers run out?
> >> Consider guest call virtqueue_enable_cb_delayed() which means it
> only
> >> need to be notified when 3/4 of pending buffers ( about 178 buffers
> >> (256-MAX_SKB_FRAGS-2)*3/4 ) were sent by host. So vhost_net would
> >> notify
> >> guest when about 60 buffers were pending. Since tx polling is only
> >> enabled when pending packets exceeds VHOST_MAX_PEND 128, so tx work
> >> would not be notified to run and guest would never get the
> interrupt
> >> it
> >> expected to re-enable the queue.
> > So it seems we still need vhost_enable_notify() in handle_tx when
> there
> > is no tx in zerocopy case.
> >
> > Do you know which one is more expensive: the cost of
> vhost_poll_queue()
> > in each zerocopy callback or calling vhost_enable_notify()?
> 
> Didn't follow here, do you mean vhost_signal() here? 

I meant removing the code in handle_tx for zerocopy as below:

+	if (zcopy) {
                        /* If more outstanding DMAs, queue the work.
                         * Handle upend_idx wrap around
                         */
                        num_pends = likely(vq->upend_idx >= vq->done_idx) ?
                                    (vq->upend_idx - vq->done_idx) :
                                    (vq->upend_idx + UIO_MAXIOV - vq->done_idx);
+			/* zerocopy vhost_enable_notify is under zerocopy callback
+			 * since it could be too early to notify here */
+			break;
+	}
-                       if (unlikely(num_pends > VHOST_MAX_PEND)) {
-                                tx_poll_start(net, sock);
-                                set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
-                                break;
-                        }
                        if (unlikely(vhost_enable_notify(&net->dev, vq))) {
                                vhost_disable_notify(&net->dev, vq);
                                continue;
                        }
                        break;

Thanks
Shirley

^ permalink raw reply

* Re: tcp timestamp issues with google servers
From: Miklos Szeredi @ 2012-05-22 15:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel
In-Reply-To: <1337701259.3361.208.camel@edumazet-glaptop>

Eric Dumazet <eric.dumazet@gmail.com> writes:

> On Tue, 2012-05-22 at 17:25 +0200, Miklos Szeredi wrote:
>
>> So it appears.  The IP address is certainly registered to Google.
>
> Good, but you could have a middlebox doing transparent proxying.
>
> The SYNACK could be send by this box.

Okay.  Is there a way to find out whether there is a middlebox or not?

Thanks,
Miklos

^ permalink raw reply

* Re: tcp timestamp issues with google servers
From: Eric Dumazet @ 2012-05-22 15:40 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: netdev, linux-kernel
In-Reply-To: <87zk90s0em.fsf@tucsk.pomaz.szeredi.hu>

On Tue, 2012-05-22 at 17:25 +0200, Miklos Szeredi wrote:

> So it appears.  The IP address is certainly registered to Google.

Good, but you could have a middlebox doing transparent proxying.

The SYNACK could be send by this box.

^ permalink raw reply

* Re: tcp timestamp issues with google servers
From: Miklos Szeredi @ 2012-05-22 15:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel
In-Reply-To: <1337278363.3403.39.camel@edumazet-glaptop>

Eric Dumazet <eric.dumazet@gmail.com> writes:

> On Thu, 2012-05-17 at 11:39 +0200, Miklos Szeredi wrote:
>> Sometimes connection to google.com, gmail.com and other google servers
>> doesn't work or takes ages to connect.  When this hits it hits all
>> google servers at the same time and it's persistent.  It never happens
>> to anything other than google.  Rebooting helps.  Rarely it goes away
>> spontaneously.
>> 
>> Apparently google is sometimes replying with an invalid TSecr timestamp
>> value (smaller than the one sent in the last packet) and this confuses
>> the Linux TCP stack which either discards the packet or sends a Reset.
>> 
>> Network dump attached.
>> 
>> I found only a couple of references to this issue:
>> 
>> http://gotchas.livejournal.com/3028.html
>> 
>> http://groups.google.com/group/comp.os.linux.networking/browse_thread/thread/29f56feded11b42a
>> 
>> Turning tcp timestamps fixes the issue:
>> 
>>   sysctl -w net.ipv4.tcp_timestamps=0
>> 
>> Not sure why this happens only to me and a very few others.
>> 
>> It appears to be an issue with google TCP stack (is it a modified
>> stack?) but I thought about issues in my network switch (restarting it
>> doesn't help) or something in the ISP, but those look unlikely.
>> 
>> Any ideas?
>> 
>> Thanks,
>> Miklos
>> 
>> 
>> 
>>   1   0.000000 192.168.28.100 -> 74.125.232.226 TCP 51303 > http [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSV=35355050 TSER=0 WS=5
>>   2   0.002730 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=0 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184565067 TSER=35325344 WS=6
>
>
> Do you really have 2730 usec RTT between you and this (Google ?)
> server ?

So it appears.  The IP address is certainly registered to Google.

Thanks,
Miklos

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Kieran Mansley @ 2012-05-22 15:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Hutchings, netdev
In-Reply-To: <1337679045.3361.154.camel@edumazet-glaptop>

On Tue, 2012-05-22 at 11:30 +0200, Eric Dumazet wrote:
> Also can you post a pcap capture of problematic flow ?

I'll email this to you directly. The capture is generated with netserver
on the system under test, and NetPerf sending from a similar server.
I've only included the first 1000 frames to keep the capture size down.
There are 7 retransmissions in that capture, and the TCPBacklogDrops
counter incremented by 7 during the test, so I'm happy to say they are
the cause of the drops.

The system under test was running net-next.

I've not tried with another NIC (e.g. tg3) but will see if I can find
one to test.

I've got a feeling that the drops might be easier to reproduce if I
taskset the netserver process to a different package than the one that
is handling the network interrupt for that NIC.  This fits with my
earlier theory in that it is likely to increase the overhead of waking
the user-level process to satisfy the read and so increase the time
during which received packets could overflow the backlog.  Having a
relatively aggressive sending TCP also helps, e.g. one that is
configured to open its congestion window quickly, as this will produce
more intensive bursts.

Kieran

^ permalink raw reply

* Re: [PATCH] net: Surpress kmemleak messages on sysctl paths
From: Steven Rostedt @ 2012-05-22 14:51 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: David Miller, LKML, netdev, viro, tixxdz
In-Reply-To: <878vgk1do2.fsf@xmission.com>

On Tue, 2012-05-22 at 08:41 -0600, Eric W. Biederman wrote:
> Steven Rostedt <rostedt@goodmis.org> writes:
> 
> > The network code allocates ctl_table_headers that are used for the life
> > of the kernel. These headers are registered and never unregistered. The
> > head pointer is allocated and not referenced, as it never needs to be
> > unregistered, and the kmemleak detector triggers these as false
> > positives:
> 
> The fix for this should already be merged into Linus's tree from the
> net-next tree for 3.5.

Ah, I didn't look at net-next. I just looked at 3.4 and didn't see
anything. If that's the case, simply ignore :-)

-- Steve

^ permalink raw reply

* Re: WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()
From: Sergio Correia @ 2012-05-22 14:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1337697203.3361.190.camel@edumazet-glaptop>

Hi Eric,

On Tue, May 22, 2012 at 11:33 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2012-05-22 at 10:57 -0300, Sergio Correia wrote:
>> On Tue, May 22, 2012 at 2:43 AM, Sergio Correia <lists@uece.net> wrote:
>> > So far it has happened only once.
>> > Last commit is 471368557a734c6c486ee757952c902b36e7fd01.
>> >
>> >
>> > [ 3726.624387] ------------[ cut here ]------------
>> > [ 3726.624398] WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()
>> > [ 3726.624400] Hardware name: N53SV
>> > [ 3726.624402] cleanup rbuf bug: copied A4D1F126 seq A4D1F126 rcvnxt A4D1F126
>> > [ 3726.624404] Modules linked in:
>> > [ 3726.624407] Pid: 1416, comm: transmission-gt Not tainted 3.4.0-git+ #52
>> > [ 3726.624409] Call Trace:
>> > [ 3726.624415]  [<ffffffff81035eba>] warn_slowpath_common+0x7a/0xb0
>> > [ 3726.624419]  [<ffffffff81035f91>] warn_slowpath_fmt+0x41/0x50
>> > [ 3726.624507]  [<ffffffff81aa57a5>] ? sub_preempt_count+0x65/0xc0
>> > [ 3726.624510]  [<ffffffff819101cf>] tcp_cleanup_rbuf+0x4f/0x110
>> > [ 3726.624514]  [<ffffffff819112b7>] tcp_recvmsg+0x637/0xa60
>> > [ 3726.624518]  [<ffffffff81849310>] ? release_sock+0xe0/0x110
>> > [ 3726.624522]  [<ffffffff81934a34>] inet_recvmsg+0x94/0xc0
>> > [ 3726.624534]  [<ffffffff81844792>] sock_aio_read.part.8+0x142/0x170
>> > [ 3726.624537]  [<ffffffff818447c0>] ? sock_aio_read.part.8+0x170/0x170
>> > [ 3726.624540]  [<ffffffff818447e1>] sock_aio_read+0x21/0x30
>> > [ 3726.624544]  [<ffffffff81124b0a>] do_sync_readv_writev+0xca/0x110
>> > [ 3726.624548]  [<ffffffff8140f582>] ? security_file_permission+0x92/0xb0
>> > [ 3726.624552]  [<ffffffff8112425c>] ? rw_verify_area+0x5c/0xe0
>> > [ 3726.624555]  [<ffffffff81124de6>] do_readv_writev+0xd6/0x1e0
>> > [ 3726.624558]  [<ffffffff8184336b>] ? sock_do_ioctl+0x2b/0x70
>> > [ 3726.624562]  [<ffffffff81135abf>] ? do_vfs_ioctl+0x8f/0x530
>> > [ 3726.624566]  [<ffffffff8141278f>] ? file_has_perm+0x8f/0xa0
>> > [ 3726.624569]  [<ffffffff81124f7d>] vfs_readv+0x2d/0x50
>> > [ 3726.624572]  [<ffffffff81124fe5>] sys_readv+0x45/0xb0
>> > [ 3726.624575]  [<ffffffff81aa9062>] system_call_fastpath+0x16/0x1b
>> > [ 3726.624578] ---[ end trace 6dc5d813929e5e6f ]---
>>
>> Checked this morning, and my dmesg now is basically composed of this
>> warning over and over and over.
>
> Is it wifi adapter ?
>

Yes, it's an Atheros AR9285 adapter.
This morning I did a make mrproper before rebuilding the kernel
(should I always do that?), but the warning has just appeared again.

> Please send "netstat -s" output
>

netstat -s
Ip:
    49826 total packets received
    3 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    49771 incoming packets delivered
    36344 requests sent out
Icmp:
    129 ICMP messages received
    1 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 129
    156 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 156
IcmpMsg:
        InType3: 129
        OutType3: 156
Tcp:
    1150 active connections openings
    42 passive connection openings
    63 failed connection attempts
    45 connection resets received
    29 connections established
    25457 segments received
    23088 segments send out
    689 segments retransmited
    0 bad segments received.
    347 resets sent
Udp:
    23927 packets received
    155 packets to unknown port received.
    0 packet receive errors
    12411 packets sent
    0 receive buffer errors
    0 send buffer errors
UdpLite:
TcpExt:
    19 invalid SYN cookies received
    1 resets received for embryonic SYN_RECV sockets
    122 TCP sockets finished time wait in fast timer
    668 delayed acks sent
    Quick ack mode was activated 228 times
    2 packets directly queued to recvmsg prequeue.
    2 bytes directly received in process context from prequeue
    10825 packet headers predicted
    2914 acknowledgments not containing data payload received
    1355 predicted acknowledgments
    3 times recovered from packet loss by selective acknowledgements
    1 congestion windows recovered without slow start by DSACK
    36 congestion windows recovered without slow start after partial ack
    3 fast retransmits
    229 other TCP timeouts
    1 SACK retransmits failed
    441 DSACKs sent for old packets
    11 DSACKs sent for out of order packets
    31 DSACKs received
    35 connections reset due to unexpected data
    32 connections reset due to early user close
    3 connections aborted due to timeout
    TCPDSACKIgnoredNoUndo: 6
    TCPSackShiftFallback: 11
    TCPRcvCoalesce: 8467
IpExt:
    OutMcastPkts: 13
    InBcastPkts: 271
    OutBcastPkts: 184
    InOctets: 53951318
    OutOctets: 5022932
    OutMcastOctets: 2091
    InBcastOctets: 40466
    OutBcastOctets: 32946

^ permalink raw reply

* Re: [PATCH] net: Surpress kmemleak messages on sysctl paths
From: Eric W. Biederman @ 2012-05-22 14:41 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: David Miller, LKML, netdev, viro, tixxdz
In-Reply-To: <1337647392.13348.14.camel@gandalf.stny.rr.com>

Steven Rostedt <rostedt@goodmis.org> writes:

> The network code allocates ctl_table_headers that are used for the life
> of the kernel. These headers are registered and never unregistered. The
> head pointer is allocated and not referenced, as it never needs to be
> unregistered, and the kmemleak detector triggers these as false
> positives:

The fix for this should already be merged into Linus's tree from the
net-next tree for 3.5.

Eric

^ permalink raw reply

* Re: WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()
From: Eric Dumazet @ 2012-05-22 14:33 UTC (permalink / raw)
  To: Sergio Correia; +Cc: netdev
In-Reply-To: <CAJyhjX1VKQpAbqkxnWFNrRSVBuSRYaNQSUXYWqYfoE0GnmVokQ@mail.gmail.com>

On Tue, 2012-05-22 at 10:57 -0300, Sergio Correia wrote:
> On Tue, May 22, 2012 at 2:43 AM, Sergio Correia <lists@uece.net> wrote:
> > So far it has happened only once.
> > Last commit is 471368557a734c6c486ee757952c902b36e7fd01.
> >
> >
> > [ 3726.624387] ------------[ cut here ]------------
> > [ 3726.624398] WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()
> > [ 3726.624400] Hardware name: N53SV
> > [ 3726.624402] cleanup rbuf bug: copied A4D1F126 seq A4D1F126 rcvnxt A4D1F126
> > [ 3726.624404] Modules linked in:
> > [ 3726.624407] Pid: 1416, comm: transmission-gt Not tainted 3.4.0-git+ #52
> > [ 3726.624409] Call Trace:
> > [ 3726.624415]  [<ffffffff81035eba>] warn_slowpath_common+0x7a/0xb0
> > [ 3726.624419]  [<ffffffff81035f91>] warn_slowpath_fmt+0x41/0x50
> > [ 3726.624507]  [<ffffffff81aa57a5>] ? sub_preempt_count+0x65/0xc0
> > [ 3726.624510]  [<ffffffff819101cf>] tcp_cleanup_rbuf+0x4f/0x110
> > [ 3726.624514]  [<ffffffff819112b7>] tcp_recvmsg+0x637/0xa60
> > [ 3726.624518]  [<ffffffff81849310>] ? release_sock+0xe0/0x110
> > [ 3726.624522]  [<ffffffff81934a34>] inet_recvmsg+0x94/0xc0
> > [ 3726.624534]  [<ffffffff81844792>] sock_aio_read.part.8+0x142/0x170
> > [ 3726.624537]  [<ffffffff818447c0>] ? sock_aio_read.part.8+0x170/0x170
> > [ 3726.624540]  [<ffffffff818447e1>] sock_aio_read+0x21/0x30
> > [ 3726.624544]  [<ffffffff81124b0a>] do_sync_readv_writev+0xca/0x110
> > [ 3726.624548]  [<ffffffff8140f582>] ? security_file_permission+0x92/0xb0
> > [ 3726.624552]  [<ffffffff8112425c>] ? rw_verify_area+0x5c/0xe0
> > [ 3726.624555]  [<ffffffff81124de6>] do_readv_writev+0xd6/0x1e0
> > [ 3726.624558]  [<ffffffff8184336b>] ? sock_do_ioctl+0x2b/0x70
> > [ 3726.624562]  [<ffffffff81135abf>] ? do_vfs_ioctl+0x8f/0x530
> > [ 3726.624566]  [<ffffffff8141278f>] ? file_has_perm+0x8f/0xa0
> > [ 3726.624569]  [<ffffffff81124f7d>] vfs_readv+0x2d/0x50
> > [ 3726.624572]  [<ffffffff81124fe5>] sys_readv+0x45/0xb0
> > [ 3726.624575]  [<ffffffff81aa9062>] system_call_fastpath+0x16/0x1b
> > [ 3726.624578] ---[ end trace 6dc5d813929e5e6f ]---
> 
> Checked this morning, and my dmesg now is basically composed of this
> warning over and over and over.

Is it wifi adapter ?

Please send "netstat -s" output

^ permalink raw reply

* tc filter u32 match
From: Nieścierowicz Adam @ 2012-05-22 13:45 UTC (permalink / raw)
  To: netdev

Hello,

I'm in the process of building a new shaper, when adding support for 
802.1q
vlan noticed that u32 can catch network traffic without giving 4 bytes
offset. How is this possible?

My environment:

eth2 - network card
eth2.200 - vlan

/sbin/tc filter add dev eth2 parent 1:0 prio 5 handle 35: protocol ip 
u32 divisor 256
/sbin/tc filter add dev eth2 protocol ip parent 1:0 prio 5 u32 ht 800:: 
match ip dst 31.41.208.32/27 hashkey mask 0x000000ff at 16 link 35:
/sbin/tc filter add dev eth2 protocol ip parent 1: prio 1 u32 ht 35:24: 
match ip dst 31.41.208.36 flowid 1:2e5

Here you can see the hits in the rule
filter parent 1: protocol ip pref 5 u32 fh 35:24:800 order 2048 key ht 
35 bkt 24 flowid 1:2e5  (rule hit 44037 success 44037)
   match 1f29d024/ffffffff at 16 (success 44037 )


I found a similar question here 
http://serverfault.com/questions/370795/tc-u32-how-to-match-l2-protocols-in-recent-kernels

Thanks

^ permalink raw reply

* tc filter u32 match
From: Nieścierowicz Adam @ 2012-05-22 13:42 UTC (permalink / raw)
  To: netdev

Hello,

I'm in the process of building a new shaper, when adding support for 
802.1q
vlan noticed that u32 can catch network traffic without giving 4 bytes
offset. How is this possible?

My environment:

eth2 - network card
eth2.200 - vlan

/sbin/tc filter add dev eth2 parent 1:0 prio 5 handle 35: protocol ip 
u32 divisor 256
/sbin/tc filter add dev eth2 protocol ip parent 1:0 prio 5 u32 ht 800:: 
match ip dst 31.41.208.32/27 hashkey mask 0x000000ff at 16 link 35:
/sbin/tc filter add dev eth2 protocol ip parent 1: prio 1 u32 ht 35:24: 
match ip dst 31.41.208.36 flowid 1:2e5

Here you can see the hits in the rule
filter parent 1: protocol ip pref 5 u32 fh 35:24:800 order 2048 key ht 
35 bkt 24 flowid 1:2e5  (rule hit 44037 success 44037)
   match 1f29d024/ffffffff at 16 (success 44037 )


I found a similar question here 
http://serverfault.com/questions/370795/tc-u32-how-to-match-l2-protocols-in-recent-kernels

Thanks

^ permalink raw reply

* Re: [PATCH v3] drop_monitor: convert to modular building
From: Neil Horman @ 2012-05-22 13:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, bhutchings
In-Reply-To: <1337691919.3361.189.camel@edumazet-glaptop>

On Tue, May 22, 2012 at 03:05:19PM +0200, Eric Dumazet wrote:
> On Thu, 2012-05-17 at 16:21 -0400, Neil Horman wrote:
> > On Thu, May 17, 2012 at 04:09:37PM -0400, David Miller wrote:
> 
> > > 
> > > Applied, althrough it didn't apply cleanly to net-next.
> > > 
> > 
> > Apologies Dave, should have told you that I was carrying Joe P.'s cleanup patch
> > in my net-next tree as well:
> > http://marc.info/?l=linux-netdev&m=133727344816140&w=2
> > 
> > Since you noted that you had applied it, I applied it myself here.
> > Neil
> > 
> 
> Any plan to autoload drop_monitor module from dropwatch,
> or issuing some advice ?
> 
> # dropwatch -l kas
> Unable to find NET_DM family, dropwatch can't work
> Cleanuing up on socket creation error
> 
> Thanks
> 
I'm looking into that currently, although I was starting to wonder if its
possible to do with a generic netlink socket.  I can't seem to find any
examples, and I can't use the net-pf-* module alias mechanism that formal
protocols implement, since I don't have a defined address family.  I suppose I
could augment that format to support a net-pf-16-<name> alias, where name is the
name of the genl family that gets registered by the module you're looking for.

Does that seem like a reasonable idea?
Neil

> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()
From: Sergio Correia @ 2012-05-22 13:57 UTC (permalink / raw)
  To: netdev
In-Reply-To: <CAJyhjX22Vna=TwuhD-FqyoXEd9D8wq_VdHbAAWwmqO1hZ34FRA@mail.gmail.com>

On Tue, May 22, 2012 at 2:43 AM, Sergio Correia <lists@uece.net> wrote:
> So far it has happened only once.
> Last commit is 471368557a734c6c486ee757952c902b36e7fd01.
>
>
> [ 3726.624387] ------------[ cut here ]------------
> [ 3726.624398] WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()
> [ 3726.624400] Hardware name: N53SV
> [ 3726.624402] cleanup rbuf bug: copied A4D1F126 seq A4D1F126 rcvnxt A4D1F126
> [ 3726.624404] Modules linked in:
> [ 3726.624407] Pid: 1416, comm: transmission-gt Not tainted 3.4.0-git+ #52
> [ 3726.624409] Call Trace:
> [ 3726.624415]  [<ffffffff81035eba>] warn_slowpath_common+0x7a/0xb0
> [ 3726.624419]  [<ffffffff81035f91>] warn_slowpath_fmt+0x41/0x50
> [ 3726.624507]  [<ffffffff81aa57a5>] ? sub_preempt_count+0x65/0xc0
> [ 3726.624510]  [<ffffffff819101cf>] tcp_cleanup_rbuf+0x4f/0x110
> [ 3726.624514]  [<ffffffff819112b7>] tcp_recvmsg+0x637/0xa60
> [ 3726.624518]  [<ffffffff81849310>] ? release_sock+0xe0/0x110
> [ 3726.624522]  [<ffffffff81934a34>] inet_recvmsg+0x94/0xc0
> [ 3726.624534]  [<ffffffff81844792>] sock_aio_read.part.8+0x142/0x170
> [ 3726.624537]  [<ffffffff818447c0>] ? sock_aio_read.part.8+0x170/0x170
> [ 3726.624540]  [<ffffffff818447e1>] sock_aio_read+0x21/0x30
> [ 3726.624544]  [<ffffffff81124b0a>] do_sync_readv_writev+0xca/0x110
> [ 3726.624548]  [<ffffffff8140f582>] ? security_file_permission+0x92/0xb0
> [ 3726.624552]  [<ffffffff8112425c>] ? rw_verify_area+0x5c/0xe0
> [ 3726.624555]  [<ffffffff81124de6>] do_readv_writev+0xd6/0x1e0
> [ 3726.624558]  [<ffffffff8184336b>] ? sock_do_ioctl+0x2b/0x70
> [ 3726.624562]  [<ffffffff81135abf>] ? do_vfs_ioctl+0x8f/0x530
> [ 3726.624566]  [<ffffffff8141278f>] ? file_has_perm+0x8f/0xa0
> [ 3726.624569]  [<ffffffff81124f7d>] vfs_readv+0x2d/0x50
> [ 3726.624572]  [<ffffffff81124fe5>] sys_readv+0x45/0xb0
> [ 3726.624575]  [<ffffffff81aa9062>] system_call_fastpath+0x16/0x1b
> [ 3726.624578] ---[ end trace 6dc5d813929e5e6f ]---

Checked this morning, and my dmesg now is basically composed of this
warning over and over and over.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox