Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] virtio-net: fix a race on 32bit arches
From: Eric Dumazet @ 2012-06-06 20:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, linux-kernel, virtualization, Stephen Hemminger
In-Reply-To: <1339014259.26966.70.camel@edumazet-glaptop>

On Wed, 2012-06-06 at 22:24 +0200, Eric Dumazet wrote:

> (ndo_get_stats64() is not allowed to sleep, and I cant see how you are
> going to disable napi without sleeping)
> 
> 

In case you wonder, take a look at bond_get_stats() in
drivers/net/bonding/bond_main.c

^ permalink raw reply

* Re: Change in alloc_skb() behavior in 3.2+ kernels?
From: Grant Edwards @ 2012-06-06 20:35 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1339014685.26966.73.camel@edumazet-glaptop>

On 2012-06-06, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2012-06-06 at 20:24 +0000, Grant Edwards wrote:
>
>> Is skb_tailroom() guaranteed to be >= the requested size?
>> 
>
> Of course, but only right after alloc_skb().

That's good enough.
     
-- 
Grant Edwards               grant.b.edwards        Yow! Okay ... I'm going
                                  at               home to write the "I HATE
                              gmail.com            RUBIK's CUBE HANDBOOK FOR
                                                   DEAD CAT LOVERS" ...

^ permalink raw reply

* Re: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
From: Damian Lukowski @ 2012-06-06 20:35 UTC (permalink / raw)
  To: Jerry Chu; +Cc: Netdev, David Miller, Ilpo Järvinen
In-Reply-To: <CAPshTCjEkyLg+BdYvA3vW4C92rAyfuC_mVEDgrbRz4NDyGh9Ug@mail.gmail.com>

Am Dienstag, den 05.06.2012, 14:22 -0700 schrieb Jerry Chu:
> On Tue, Jun 5, 2012 at 11:39 AM, Damian Lukowski
> <damian@tvk.rwth-aachen.de> wrote:
> > Am Dienstag, den 05.06.2012, 10:42 -0700 schrieb Jerry Chu:
> >> On Mon, Jun 4, 2012 at 4:50 PM, Jerry Chu <hkchu@google.com> wrote:
> >> > Hi Damian,
> >> >
> >> > On Mon, Jun 4, 2012 at 10:50 AM, Damian Lukowski
> >> > <damian@tvk.rwth-aachen.de> wrote:
> >> >> Hi Jerry,
> >> >>
> >> >> please verify, I understood you correctly.
> >> >>
> >> >> You have set TCP_RTO_MIN to a lower value, e.g. 0.002 seconds to improve
> >> >> your internal low-latency traffic. Because of the improvement, R1
> >> >> timeouts are triggered too fast for external high-RTT traffic. Is that
> >> >> correct?
> >> >
> >> > Correct.
> >> >
> >> >> If so, may I suggest to set tcp_retries1 to a higher value? For
> >> >> TCP_RTO_MIN == 0.002 and tcp_retries1 ==  10, R1 will be calculated to
> >> >> approximately 4 seconds.
> >> >
> >> > I think hacking tcp_retries1 is the wrong solution. E.g., 10 retries may be too
> >> > generous for those short RTT flows.
> >> >
> >> > I think the fundamental problem is - the ideal fix for your original RTO revert
> >> > problem should've used the per-flow RTO to compute R1 & R2. But that
> >> > computation may be too expensive so you used TCP_RTO_MIN as an
> >> > approximation - not a good idea IMHO!
> >>
> >> Just realized the correct fix of using the original, non-backoff per flow RTO is
> >> not any more expensive than the current code through ilog2(). What's needed
> >> is a new field "base_rto" to record the original RTO before backoff. I'm leaning
> >> toward this more accurate fix now without any fudge because fudging almost
> >> always causes bugs.
> >
> >
> > The current version of retransmits_timed_out() uses such a field
> > already. I suppose, we can do a combination like the following?
> >
> > -       unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN;
> > +       unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : __tcp_set_rto(tcp_sk(sk));
> 
> Yes that could work and we probably don't need a new field for the original RTO.
> 
> But I started wondering what the problem you tried to solve initially. The old
> counter (icsk_retransmits) based code was really easy to understand, debug, and
> matched well with the API (sysctl_tcp_retries1, sysctl_tcp_retries2,
> TCP_SYNCNT,...), which are all counter based. Moreover, my simple brain has
> a strong prejudice against complex code unless the complexity is justified.
> 
> Could you point out where backoff revert might happen? (tcp_v4_err() when
> handing ICMP errors?) And for those cases is it possible to either not increment
> icsk_retransmits (as long as it won't get us into infinite
> retransmissions), or invent
> a separate field for the sole purpose of timeout check? Won't that be
> much simpler than your current fix?

Hi,

backoffs are reverted when an RTO retransmission triggers an ICMP
destination unreachable along the path towards the target.
Consider A --- R --- B, where A and B are TCP endpoints, and R is some
router in between. When the link between R and B breaks for a longer
time, A will perform RTO retransmissions if there are outstanding ACKs.
Those packets which arrive at R will hopefully trigger an ICMP response
back towards A, as R has no more route towards B. The ICMP packet is an
indication for A, that the retransmission has not been lost because of
congestion but because of a link outage, and the backoff will be
reverted for the corresponding TCP session. In the best case, every RTO
retransmission triggers an ICMP response, so every backoff is reverted,
and the time between retransmissions remains at the original value.
If icsk_retransmits is decremented at this point within the original
code's logic, the connection might never time out. And we cannot take
tcp_retriesX literally here, as the above scenario would time out after
tcp_retries2 x base_rto, where the base_rto might be as small as 0.2
seconds.

I am not sure, how an additional counter variable should help. You still
cannot take tcp_retriesX literally. Besides, I think that changing the
socket structure is too heavy machinery, isn't it?

Regards
 Damian

> 
> Best,
> 
> Jerry
> 
> > +       rto_base = rto_base ? : TCP_RTO_MIN;
> >
> > -       if (!inet_csk(sk)->icsk_retransmits)
> > +       if (inet_csk(sk)->icsk_retransmits < boundary)
> >
> >
> > Regards
> >  Damian
> >
> >>
> >> Any comment is welcome. I'm not sure in the existing code if it makes sense
> >> to apply the exponential backoff based computation to thin stream but it's a
> >> separate question so I won't touch it.
> >>
> >> Jerry
> >>
> >> >
> >> > The easiest solution I can see so far is to replace the check
> >> >
> >> > if (!inet_csk(sk)->icsk_retransmits)
> >> >                return false;
> >> >
> >> > at the beginning of retransmits_timed_out() with
> >> >
> >> > if (inet_csk(sk)->icsk_retransmits < boundary)
> >> >                return false;
> >> >
> >> > Best,
> >> >
> >> > Jerry
> >> >
> >> >>
> >> >> Is that ok?
> >> >>
> >> >> Best regards
> >> >>  Damian
> >> >>
> >> >> Am Freitag, den 01.06.2012, 15:58 -0700 schrieb Jerry Chu:
> >> >>> > From: Damian Lukowski <damian@tvk.rwth-aachen.de>
> >> >>> > Date: Wed, Aug 26, 2009 at 3:16 AM
> >> >>> > Subject: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close
> >> >>> > threshold as a time value.
> >> >>> > To: Netdev <netdev@vger.kernel.org>
> >> >>> >
> >> >>> >
> >> >>> > RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
> >> >>> > which may represent a number of allowed retransmissions or a timeout value.
> >> >>> > Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
> >> >>> > in number of allowed retransmissions.
> >> >>> >
> >> >>> > For any desired threshold R2 (by means of time) one can specify tcp_retries2
> >> >>> > (by means of number of retransmissions) such that TCP will not time out
> >> >>> > earlier than R2. This is the case, because the RTO schedule follows a fixed
> >> >>> > pattern, namely exponential backoff.
> >> >>> >
> >> >>> > However, the RTO behaviour is not predictable any more if RTO backoffs can
> >> >>> > be
> >> >>> > reverted, as it is the case in the draft
> >> >>> > "Make TCP more Robust to Long Connectivity Disruptions"
> >> >>> > (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
> >> >>> >
> >> >>> > In the worst case TCP would time out a connection after 3.2 seconds, if the
> >> >>> > initial RTO equaled MIN_RTO and each backoff has been reverted.
> >> >>> >
> >> >>> > This patch introduces a function retransmits_timed_out(N),
> >> >>> > which calculates the timeout of a TCP connection, assuming an initial
> >> >>> > RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
> >> >>> >
> >> >>> > Whenever timeout decisions are made by comparing the retransmission counter
> >> >>> > to some value N, this function can be used, instead.
> >> >>> >
> >> >>> > The meaning of tcp_retries2 will be changed, as many more RTO
> >> >>> > retransmissions
> >> >>> > can occur than the value indicates. However, it yields a timeout which is
> >> >>> > similar to the one of an unpatched, exponentially backing off TCP in the
> >> >>> > same
> >> >>> > scenario. As no application could rely on an RTO greater than MIN_RTO, there
> >> >>> > should be no risk of a regression.
> >> >>>
> >> >>> This looks like a typical "fix one problem, introducing a few more" patch :(.
> >> >>> What do you mean by "no application could rely on an RTO greater than
> >> >>> MIN_RTO..."
> >> >>> above? How can you make the assumption that RTO is not too far off
> >> >>> from TCP_RTO_MIN?
> >> >>>
> >> >>> While you tried to address a problem where the retransmission count
> >> >>> was high but the actual
> >> >>> timeout duration was too short, have you considered the other case
> >> >>> around, i.e., the timeout
> >> >>> duration is long but the retransmission count is too short? This is
> >> >>> exactly what's happening
> >> >>> to us with your patch. We've much reduced TCP_RTO_MIN for our internal
> >> >>> traffic, but not
> >> >>> noticing your change has severely shortened the R1 & R2 recommended by
> >> >>> RFC1122 for our
> >> >>> long haul traffic until now. In many cases R1 threshold was met upon
> >> >>> the first retrans timeout.
> >> >>>
> >> >>> I think retransmits_timed_out() should check against both time
> >> >>> duration and retrans count
> >> >>> (icsk_retransmits).
> >> >>>
> >> >>> Thought?
> >> >>>
> >> >>> Jerry
> >> >>>
> >> >>> >
> >> >>> > Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
> >> >>> > ---
> >> >>> >  include/net/tcp.h    |   18 ++++++++++++++++++
> >> >>> >  net/ipv4/tcp_timer.c |   11 +++++++----
> >> >>> >  2 files changed, 25 insertions(+), 4 deletions(-)
> >> >>> >
> >> >>> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> >> >>> > index c35b329..17d1a88 100644
> >> >>> > --- a/include/net/tcp.h
> >> >>> > +++ b/include/net/tcp.h
> >> >>> > @@ -1247,6 +1247,24 @@ static inline struct sk_buff
> >> >>> > *tcp_write_queue_prev(struct sock *sk, struct sk_bu
> >> >>> >  #define tcp_for_write_queue_from_safe(skb, tmp, sk)                    \
> >> >>> >        skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
> >> >>> >
> >> >>> > +static inline bool retransmits_timed_out(const struct sock *sk,
> >> >>> > +                                        unsigned int boundary)
> >> >>> > +{
> >> >>> > +       int limit, K;
> >> >>> > +       if (!inet_csk(sk)->icsk_retransmits)
> >> >>> > +               return false;
> >> >>> > +
> >> >>> > +       K = ilog2(TCP_RTO_MAX/TCP_RTO_MIN);
> >> >>> > +
> >> >>> > +       if (boundary <= K)
> >> >>> > +               limit = ((2 << boundary) - 1) * TCP_RTO_MIN;
> >> >>> > +       else
> >> >>> > +               limit = ((2 << K) - 1) * TCP_RTO_MIN +
> >> >>> > +                       (boundary - K) * TCP_RTO_MAX;
> >> >>> > +
> >> >>> > +       return (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= limit;
> >> >>> > +}
> >> >>> > +
> >> >>> >  static inline struct sk_buff *tcp_send_head(struct sock *sk)
> >> >>> >  {
> >> >>> >        return sk->sk_send_head;
> >> >>> > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> >> >>> > index a3ba494..2972d7b 100644
> >> >>> > --- a/net/ipv4/tcp_timer.c
> >> >>> > +++ b/net/ipv4/tcp_timer.c
> >> >>> > @@ -137,13 +137,14 @@ static int tcp_write_timeout(struct sock *sk)
> >> >>> >  {
> >> >>> >        struct inet_connection_sock *icsk = inet_csk(sk);
> >> >>> >        int retry_until;
> >> >>> > +       bool do_reset;
> >> >>> >
> >> >>> >        if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
> >> >>> >                if (icsk->icsk_retransmits)
> >> >>> >                        dst_negative_advice(&sk->sk_dst_cache);
> >> >>> >                retry_until = icsk->icsk_syn_retries ? :
> >> >>> > sysctl_tcp_syn_retries;
> >> >>> >        } else {
> >> >>> > -               if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
> >> >>> > +               if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
> >> >>> >                        /* Black hole detection */
> >> >>> >                        tcp_mtu_probing(icsk, sk);
> >> >>> >
> >> >>> > @@ -155,13 +156,15 @@ static int tcp_write_timeout(struct sock *sk)
> >> >>> >                        const int alive = (icsk->icsk_rto < TCP_RTO_MAX);
> >> >>> >
> >> >>> >                        retry_until = tcp_orphan_retries(sk, alive);
> >> >>> > +                       do_reset = alive ||
> >> >>> > +                                  !retransmits_timed_out(sk, retry_until);
> >> >>> >
> >> >>> > -                       if (tcp_out_of_resources(sk, alive ||
> >> >>> > icsk->icsk_retransmits < retry_until))
> >> >>> > +                       if (tcp_out_of_resources(sk, do_reset))
> >> >>> >                                return 1;
> >> >>> >                }
> >> >>> >        }
> >> >>> >
> >> >>> > -       if (icsk->icsk_retransmits >= retry_until) {
> >> >>> > +       if (retransmits_timed_out(sk, retry_until)) {
> >> >>> >                /* Has it gone just too far? */
> >> >>> >                tcp_write_err(sk);
> >> >>> >                return 1;
> >> >>> > @@ -385,7 +388,7 @@ void tcp_retransmit_timer(struct sock *sk)
> >> >>> >  out_reset_timer:
> >> >>> >        icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
> >> >>> >        inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto,
> >> >>> > TCP_RTO_MAX);
> >> >>> > -       if (icsk->icsk_retransmits > sysctl_tcp_retries1)
> >> >>> > +       if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1))
> >> >>> >                __sk_dst_reset(sk);
> >> >>> >
> >> >>> >  out:;
> >> >>> > --
> >> >>> > 1.6.3.3
> >> >>> >
> >> >>> > --
> >> >>> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> >> >>> > the body of a message to majordomo@vger.kernel.org
> >> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>> >
> >> >>
> >> >>
> >
> >

^ permalink raw reply

* Re: [PATCH] virtio-net: fix a race on 32bit arches
From: Ben Hutchings @ 2012-06-06 20:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eric Dumazet, netdev, linux-kernel, virtualization,
	Stephen Hemminger
In-Reply-To: <20120606201620.GA23358@redhat.com>

On Wed, 2012-06-06 at 23:16 +0300, Michael S. Tsirkin wrote:
> On Wed, Jun 06, 2012 at 10:08:09PM +0200, Eric Dumazet wrote:
> > On Wed, 2012-06-06 at 22:58 +0300, Michael S. Tsirkin wrote:
> > 
> > > Absolutely, I am talking about virtio here.  I'm not kicking
> > > u64_stats_sync idea I am just saying that simple locking
> > > would work for virtio and might be better as it
> > > gives us a way to get counters atomically.
> > 
> > Which lock do you own in the RX path ?
> 
> We can just disable napi, everything is updated from napi callback.

Seriously, though: don't do that; this is going to hurt performance for
minimal benefit.

Ben.

> > You'll have to add a lock in fast path. This sounds really a bad choice
> > to me.
> 
> .ndo_get_stats64 is not data path though, is it?
> 

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: Change in alloc_skb() behavior in 3.2+ kernels?
From: Eric Dumazet @ 2012-06-06 20:31 UTC (permalink / raw)
  To: Grant Edwards; +Cc: netdev
In-Reply-To: <jqoe90$fs3$2@dough.gmane.org>

On Wed, 2012-06-06 at 20:24 +0000, Grant Edwards wrote:

> Is skb_tailroom() guaranteed to be >= the requested size?
> 

Of course, but only right after alloc_skb().

^ permalink raw reply

* Re: Change in alloc_skb() behavior in 3.2+ kernels?
From: Grant Edwards @ 2012-06-06 20:26 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1339011742.26966.44.camel@edumazet-glaptop>

On 2012-06-06, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2012-06-06 at 11:51 -0700, David Miller wrote:
>> From: Grant Edwards <grant.b.edwards@gmail.com>
>> Date: Wed, 6 Jun 2012 18:32:57 +0000 (UTC)
>> 
>> > The kernel module that's started failing fills the allocated sk_buff
>> > until tailroom() indicates it is full and then sends it.  The problem
>> > is that sending a packet with a length of 1850 won't work (it's a
>> > MAC-layer Ethernet packet).
>> 
>> The amount of tailroom an SKB has is implementation dependent.
>> 
>> It's incredibly poor form to rely upon it to determine whether a
>> fully sized frame has been constructed or not.
>> 
>> Please fix the code that does this.
>
> By the way, we had a similar problem, and the fix was :
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=a21d45726acacc963d8baddf74607d9b74e2b723
>
> Grant, depending on the context, you might use skb->avail_size and
> skb_availroom() as well.
>
> Beware skb->avail_size is unioned with skb->{mark|dropcount}

Thanks for the pointer.

-- 
Grant Edwards               grant.b.edwards        Yow! ANN JILLIAN'S HAIR
                                  at               makes LONI ANDERSON'S
                              gmail.com            HAIR look like RICARDO
                                                   MONTALBAN'S HAIR!

^ permalink raw reply

* Re: pull request: wireless 2012-06-06
From: David Miller @ 2012-06-06 20:29 UTC (permalink / raw)
  To: linville; +Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <20120606183613.GD2338@tuxdriver.com>

From: "John W. Linville" <linville@tuxdriver.com>
Date: Wed, 6 Jun 2012 14:36:13 -0400

> Here is a batch of wireless/bluetooth fixes intended for 3.5...
 ...
> Please let me know if there are problems!

Pulled, thanks for the detailed rundown, it helps.

^ permalink raw reply

* Re: [PATCH] virtio-net: fix a race on 32bit arches
From: Eric Dumazet @ 2012-06-06 20:25 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Michael S. Tsirkin, netdev, linux-kernel, virtualization,
	Stephen Hemminger
In-Reply-To: <1339013979.2836.52.camel@bwh-desktop.uk.solarflarecom.com>

On Wed, 2012-06-06 at 21:19 +0100, Ben Hutchings wrote:
> On Wed, 2012-06-06 at 22:08 +0200, Eric Dumazet wrote:
> > On Wed, 2012-06-06 at 22:58 +0300, Michael S. Tsirkin wrote:
> > 
> > > Absolutely, I am talking about virtio here.  I'm not kicking
> > > u64_stats_sync idea I am just saying that simple locking
> > > would work for virtio and might be better as it
> > > gives us a way to get counters atomically.
> > 
> > Which lock do you own in the RX path ?
> > 
> > You'll have to add a lock in fast path. This sounds really a bad choice
> > to me.
> 
> You have the NAPI 'lock', so when gathering stats you can synchronise
> using napi_disable() ;-)

Nice, this adds one new bug in network stack.

Really guys, can we stop this thread, please ?

^ permalink raw reply

* Re: Change in alloc_skb() behavior in 3.2+ kernels?
From: Grant Edwards @ 2012-06-06 20:24 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20120606.131703.870120997635192180.davem@davemloft.net>

On 2012-06-06, David Miller <davem@davemloft.net> wrote:
> From: Grant Edwards <grant.b.edwards@gmail.com>
> Date: Wed, 6 Jun 2012 19:01:40 +0000 (UTC)
>
>> That's what I'll do as soon as I can find a definition of what the API
>> for alloc_skb() actually _is_.  It has clearly changed in the past few
>> years.
>
> And it will continue to change.  There are no stable APIs inside of
> the kernel, none.

Oh, I know (as does anybody who has maintained a driver for more than
a few months).  Right now, I'm just trying to find out what the
current API for alloc_skb() is.

Is skb_tailroom() guaranteed to be >= the requested size?

-- 
Grant Edwards               grant.b.edwards        Yow! I'm wet!  I'm wild!
                                  at               
                              gmail.com            

^ permalink raw reply

* Re: [PATCH] virtio-net: fix a race on 32bit arches
From: Eric Dumazet @ 2012-06-06 20:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, linux-kernel, virtualization, Stephen Hemminger
In-Reply-To: <20120606201620.GA23358@redhat.com>

On Wed, 2012-06-06 at 23:16 +0300, Michael S. Tsirkin wrote:
> On Wed, Jun 06, 2012 at 10:08:09PM +0200, Eric Dumazet wrote:
> > On Wed, 2012-06-06 at 22:58 +0300, Michael S. Tsirkin wrote:
> > 
> > > Absolutely, I am talking about virtio here.  I'm not kicking
> > > u64_stats_sync idea I am just saying that simple locking
> > > would work for virtio and might be better as it
> > > gives us a way to get counters atomically.
> > 
> > Which lock do you own in the RX path ?
> 
> We can just disable napi, everything is updated from napi callback.

This is very disruptive, and illegal from ndo_get_stats64()

(ndo_get_stats64() is not allowed to sleep, and I cant see how you are
going to disable napi without sleeping)

^ permalink raw reply

* Re: [PATCH] sky2: fix checksum bit management on some chips
From: Stephen Hemminger @ 2012-06-06 20:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kirill Smelkov, David Miller, Mirko Lindner, netdev
In-Reply-To: <1339013644.26966.63.camel@edumazet-glaptop>

On Wed, 06 Jun 2012 22:14:04 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2012-06-06 at 13:01 -0700, Stephen Hemminger wrote:
> > The newer flavors of Yukon II use a different method for receive
> > checksum offload. This is indicated in the driver by the SKY2_HW_NEW_LE
> > flag. On these newer chips, the BMU_ENA_RX_CHKSUM should not be set.
> > 
> > The driver would get incorrectly toggle the bit, enabling the old
> > checksum logic on these chips and cause a BUG_ON() assertion. If
> > receive checksum was toggled via ethtool.
> > 
> > Reported-by: Kirill Smelkov <kirr@mns.spb.ru>
> > 
> > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> > 
> > ---
> > Patch against net-next, please apply to net and stable kernels.
> > 
> > --- a/drivers/net/ethernet/marvell/sky2.c	2012-06-06 11:09:38.288440819 -0700
> > +++ b/drivers/net/ethernet/marvell/sky2.c	2012-06-06 11:25:01.275782462 -0700
> > @@ -4381,10 +4381,12 @@ static int sky2_set_features(struct net_
> >  	struct sky2_port *sky2 = netdev_priv(dev);
> >  	netdev_features_t changed = dev->features ^ features;
> >  
> > -	if (changed & NETIF_F_RXCSUM) {
> > -		bool on = features & NETIF_F_RXCSUM;
> > -		sky2_write32(sky2->hw, Q_ADDR(rxqaddr[sky2->port], Q_CSR),
> > -			     on ? BMU_ENA_RX_CHKSUM : BMU_DIS_RX_CHKSUM);
> > +	if ((changed & NETIF_F_RXCSUM) &&
> > +	    !(sky2->hw->flags & SKY2_HW_NEW_LE)) {
> > +		sky2_write32(sky2->hw,
> > +			     Q_ADDR(rxqaddr[sky2->port], Q_CSR),
> > +			     (features & NETIF_F_RXCSUM)
> > +			     ? BMU_ENA_RX_CHKSUM : BMU_DIS_RX_CHKSUM);
> 
> Don't you need to return an error if NETIF_F_RXCSUM could not be
> changed ?
> 
> 

No, what happens is that on the new chips, the feature flag is already checked
in the receive status processing

^ permalink raw reply

* Re: Change in alloc_skb() behavior in 3.2+ kernels?
From: Grant Edwards @ 2012-06-06 20:22 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20120606.120247.1618312724057709285.davem@davemloft.net>

On 2012-06-06, David Miller <davem@davemloft.net> wrote:
> From: Grant Edwards <grant.b.edwards@gmail.com>
> Date: Wed, 6 Jun 2012 18:59:19 +0000 (UTC)
>
>> At the time it was written (probably 10+ years ago) it was relying on
>> the documented API for alloc_skb() that stated alloc_skb() either
>> returned an sk_buff of the requested size or it failed.
>
> It was never a formal API that we would only allocate 'size'
> amount of tailroom.

Well, somebody forgot to tell whoever wrote the man page for
alloc_skb().  :)

-- 
Grant Edwards               grant.b.edwards        Yow! I'm ZIPPY the PINHEAD
                                  at               and I'm totally committed
                              gmail.com            to the festive mode.

^ permalink raw reply

* Re: [PATCH] virtio-net: fix a race on 32bit arches
From: Michael S. Tsirkin @ 2012-06-06 20:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel, virtualization, Stephen Hemminger
In-Reply-To: <1339013171.26966.60.camel@edumazet-glaptop>

On Wed, Jun 06, 2012 at 10:06:11PM +0200, Eric Dumazet wrote:
> On Wed, 2012-06-06 at 21:43 +0300, Michael S. Tsirkin wrote:
> 
> > 1. We are trying to look at counters for purposes of tuning the device.
> > E.g. if ethtool reports packets and bytes, we'd like to calculate
> > average packet size by bytes/packets.
> > 
> > If both counters are read atomically the metric becomes more exact.
> > Not a must but nice to have.
> > 
> 
> metrics are exact right now.

Yes, but they are not synchronised between themselves.
E.g. you can in theory have a report where #of packets > #of bytes.

I know there's no guarantee they are synchronised
on an arbitrary device but if they are, without
slowing fast path, it's nice.

^ permalink raw reply

* Re: [PATCH] virtio-net: fix a race on 32bit arches
From: Ben Hutchings @ 2012-06-06 20:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michael S. Tsirkin, Stephen Hemminger, Jason Wang, netdev, rusty,
	linux-kernel, virtualization
In-Reply-To: <1339013289.26966.62.camel@edumazet-glaptop>

On Wed, 2012-06-06 at 22:08 +0200, Eric Dumazet wrote:
> On Wed, 2012-06-06 at 22:58 +0300, Michael S. Tsirkin wrote:
> 
> > Absolutely, I am talking about virtio here.  I'm not kicking
> > u64_stats_sync idea I am just saying that simple locking
> > would work for virtio and might be better as it
> > gives us a way to get counters atomically.
> 
> Which lock do you own in the RX path ?
> 
> You'll have to add a lock in fast path. This sounds really a bad choice
> to me.

You have the NAPI 'lock', so when gathering stats you can synchronise
using napi_disable() ;-)

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: Change in alloc_skb() behavior in 3.2+ kernels?
From: David Miller @ 2012-06-06 20:17 UTC (permalink / raw)
  To: grant.b.edwards; +Cc: netdev
In-Reply-To: <jqo9ej$ao4$1@dough.gmane.org>

From: Grant Edwards <grant.b.edwards@gmail.com>
Date: Wed, 6 Jun 2012 19:01:40 +0000 (UTC)

> That's what I'll do as soon as I can find a definition of what the API
> for alloc_skb() actually _is_.  It has clearly changed in the past few
> years.

And it will continue to change.  There are no stable APIs inside of
the kernel, none.

^ permalink raw reply

* Re: [PATCH] virtio-net: fix a race on 32bit arches
From: Michael S. Tsirkin @ 2012-06-06 20:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel, virtualization, Stephen Hemminger
In-Reply-To: <1339013289.26966.62.camel@edumazet-glaptop>

On Wed, Jun 06, 2012 at 10:08:09PM +0200, Eric Dumazet wrote:
> On Wed, 2012-06-06 at 22:58 +0300, Michael S. Tsirkin wrote:
> 
> > Absolutely, I am talking about virtio here.  I'm not kicking
> > u64_stats_sync idea I am just saying that simple locking
> > would work for virtio and might be better as it
> > gives us a way to get counters atomically.
> 
> Which lock do you own in the RX path ?

We can just disable napi, everything is updated from napi callback.

> You'll have to add a lock in fast path. This sounds really a bad choice
> to me.

.ndo_get_stats64 is not data path though, is it?

-- 
MST

^ permalink raw reply

* Re: [PATCH] sky2: fix checksum bit management on some chips
From: Eric Dumazet @ 2012-06-06 20:14 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Kirill Smelkov, David Miller, Mirko Lindner, netdev
In-Reply-To: <20120606130130.5a86f94a@nehalam.linuxnetplumber.net>

On Wed, 2012-06-06 at 13:01 -0700, Stephen Hemminger wrote:
> The newer flavors of Yukon II use a different method for receive
> checksum offload. This is indicated in the driver by the SKY2_HW_NEW_LE
> flag. On these newer chips, the BMU_ENA_RX_CHKSUM should not be set.
> 
> The driver would get incorrectly toggle the bit, enabling the old
> checksum logic on these chips and cause a BUG_ON() assertion. If
> receive checksum was toggled via ethtool.
> 
> Reported-by: Kirill Smelkov <kirr@mns.spb.ru>
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> 
> ---
> Patch against net-next, please apply to net and stable kernels.
> 
> --- a/drivers/net/ethernet/marvell/sky2.c	2012-06-06 11:09:38.288440819 -0700
> +++ b/drivers/net/ethernet/marvell/sky2.c	2012-06-06 11:25:01.275782462 -0700
> @@ -4381,10 +4381,12 @@ static int sky2_set_features(struct net_
>  	struct sky2_port *sky2 = netdev_priv(dev);
>  	netdev_features_t changed = dev->features ^ features;
>  
> -	if (changed & NETIF_F_RXCSUM) {
> -		bool on = features & NETIF_F_RXCSUM;
> -		sky2_write32(sky2->hw, Q_ADDR(rxqaddr[sky2->port], Q_CSR),
> -			     on ? BMU_ENA_RX_CHKSUM : BMU_DIS_RX_CHKSUM);
> +	if ((changed & NETIF_F_RXCSUM) &&
> +	    !(sky2->hw->flags & SKY2_HW_NEW_LE)) {
> +		sky2_write32(sky2->hw,
> +			     Q_ADDR(rxqaddr[sky2->port], Q_CSR),
> +			     (features & NETIF_F_RXCSUM)
> +			     ? BMU_ENA_RX_CHKSUM : BMU_DIS_RX_CHKSUM);

Don't you need to return an error if NETIF_F_RXCSUM could not be
changed ?

^ permalink raw reply

* Re: [PATCH] virtio-net: fix a race on 32bit arches
From: Eric Dumazet @ 2012-06-06 20:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, linux-kernel, virtualization, Stephen Hemminger
In-Reply-To: <20120606195814.GA20677@redhat.com>

On Wed, 2012-06-06 at 22:58 +0300, Michael S. Tsirkin wrote:

> Absolutely, I am talking about virtio here.  I'm not kicking
> u64_stats_sync idea I am just saying that simple locking
> would work for virtio and might be better as it
> gives us a way to get counters atomically.

Which lock do you own in the RX path ?

You'll have to add a lock in fast path. This sounds really a bad choice
to me.

^ permalink raw reply

* Re: [PATCH] virtio-net: fix a race on 32bit arches
From: Eric Dumazet @ 2012-06-06 20:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, linux-kernel, virtualization, Stephen Hemminger
In-Reply-To: <20120606184351.GA20380@redhat.com>

On Wed, 2012-06-06 at 21:43 +0300, Michael S. Tsirkin wrote:

> 1. We are trying to look at counters for purposes of tuning the device.
> E.g. if ethtool reports packets and bytes, we'd like to calculate
> average packet size by bytes/packets.
> 
> If both counters are read atomically the metric becomes more exact.
> Not a must but nice to have.
> 

metrics are exact right now.

As soon as you read a value, it might already have changed.

Maybe you want to stop_machine() to make sure all the metrics you want
are 'exact' ;)

> 2. 32 bit systems have some overhead because of the seqlock.
> virtio could instead simply keep tx counters in the queue structure, and
> get the tx lock when they are read.
> 

But then you need atomic64 stuff, have you an idea of the cost of such
primitives on 32bit ?

3. use 32bit counters on 32bit arches, as many drivers still do ?

^ permalink raw reply

* Re: [PATCH net-next] fec: Add support for Coldfire M5441x enet-mac.
From: Steven King @ 2012-06-06 20:05 UTC (permalink / raw)
  To: Jan Ceuleers; +Cc: netdev, uClinux development list, Greg Ungerer
In-Reply-To: <4FCF949D.7060402@computer.org>

On Wednesday 06 June 2012 10:34:21 am Jan Ceuleers wrote:
> On 06/06/2012 07:06 PM, Steven King wrote:
> > Add support for the Freescale Coldfire M5441x; as these parts have an
> > enet-mac, add a quirk to distinguish them from the other Coldfire parts
> > so we can use the existing enet-mac support.
>
> Stephen,
>
> You are activating certain functionality based on whether M5441x is
> defined. But where is this being defined? Should this not be added in a
> Kconfig somewhere as a platform option?

Yes.  Hopefully, once I send Greg my updated patches to add support for the 
m5441x, then it will be a selection in the m68k port.  I just happened to 
have these ready to go after David chastised me for sending them too late in 
the last merge cycle...

^ permalink raw reply

* [PATCH ethtool] ethtool: fix to display support for KX4 and KX PHY
From: Ajit Khaparde @ 2012-06-06 20:03 UTC (permalink / raw)
  To: bhutchings; +Cc: netdev


Signed-off-by: Ajit Khaparde <ajit.khaparde@emulex.com>
---
 ethtool.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/ethtool.c b/ethtool.c
index f18f611..546a43a 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -424,6 +424,13 @@ dump_link_caps(const char *prefix, const char *an_prefix, u32 mask)
 	if (mask & ADVERTISED_1000baseT_Full) {
 		did1++; fprintf(stdout, "1000baseT/Full ");
 	}
+	if (did1 && (mask & ADVERTISED_1000baseKX_Full)) {
+		fprintf(stdout, "\n");
+		fprintf(stdout, "	%*s", indent, "");
+	}
+	if (mask & ADVERTISED_1000baseKX_Full) {
+		did1++; fprintf(stdout, "1000baseKX/Full ");
+	}
 	if (did1 && (mask & ADVERTISED_2500baseX_Full)) {
 		fprintf(stdout, "\n");
 		fprintf(stdout, "	%*s", indent, "");
@@ -438,6 +445,13 @@ dump_link_caps(const char *prefix, const char *an_prefix, u32 mask)
 	if (mask & ADVERTISED_10000baseT_Full) {
 		did1++; fprintf(stdout, "10000baseT/Full ");
 	}
+	if (did1 && (mask & ADVERTISED_10000baseKX4_Full)) {
+		fprintf(stdout, "\n");
+		fprintf(stdout, "	%*s", indent, "");
+	}
+	if (mask & ADVERTISED_10000baseKX4_Full) {
+		did1++; fprintf(stdout, "10000baseKX4/Full ");
+	}
 	if (did1 && (mask & ADVERTISED_20000baseMLD2_Full)) {
 		fprintf(stdout, "\n");
 		fprintf(stdout, "	%*s", indent, "");
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH] sky2: fix checksum bit management on some chips
From: Stephen Hemminger @ 2012-06-06 20:01 UTC (permalink / raw)
  To: Kirill Smelkov, David Miller; +Cc: Mirko Lindner, netdev
In-Reply-To: <20120606172036.GA11911@tugrik.mns.mnsspb.ru>

The newer flavors of Yukon II use a different method for receive
checksum offload. This is indicated in the driver by the SKY2_HW_NEW_LE
flag. On these newer chips, the BMU_ENA_RX_CHKSUM should not be set.

The driver would get incorrectly toggle the bit, enabling the old
checksum logic on these chips and cause a BUG_ON() assertion. If
receive checksum was toggled via ethtool.

Reported-by: Kirill Smelkov <kirr@mns.spb.ru>

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
Patch against net-next, please apply to net and stable kernels.

--- a/drivers/net/ethernet/marvell/sky2.c	2012-06-06 11:09:38.288440819 -0700
+++ b/drivers/net/ethernet/marvell/sky2.c	2012-06-06 11:25:01.275782462 -0700
@@ -4381,10 +4381,12 @@ static int sky2_set_features(struct net_
 	struct sky2_port *sky2 = netdev_priv(dev);
 	netdev_features_t changed = dev->features ^ features;
 
-	if (changed & NETIF_F_RXCSUM) {
-		bool on = features & NETIF_F_RXCSUM;
-		sky2_write32(sky2->hw, Q_ADDR(rxqaddr[sky2->port], Q_CSR),
-			     on ? BMU_ENA_RX_CHKSUM : BMU_DIS_RX_CHKSUM);
+	if ((changed & NETIF_F_RXCSUM) &&
+	    !(sky2->hw->flags & SKY2_HW_NEW_LE)) {
+		sky2_write32(sky2->hw,
+			     Q_ADDR(rxqaddr[sky2->port], Q_CSR),
+			     (features & NETIF_F_RXCSUM)
+			     ? BMU_ENA_RX_CHKSUM : BMU_DIS_RX_CHKSUM);
 	}
 
 	if (changed & NETIF_F_RXHASH)

^ permalink raw reply

* Re: [PATCH] virtio-net: fix a race on 32bit arches
From: Eric Dumazet @ 2012-06-06 20:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, linux-kernel, virtualization, Stephen Hemminger
In-Reply-To: <20120606165754.GA19357@redhat.com>

On Wed, 2012-06-06 at 19:57 +0300, Michael S. Tsirkin wrote:

> So for virtio since all counters get incremented from bh we can
> ensure they are read atomically, simply but reading them
> from the correct CPU with bh disabled.
> And then we don't need u64_stats_sync at all.
> 

Really ? How are you going to read 64bit stats from foreign cpus on
32bit arches, without additional cost in fast path ?

You should read include/linux/u64_stats_sync.h to fully understand the
issues.

^ permalink raw reply

* Re: [PATCH] virtio-net: fix a race on 32bit arches
From: Michael S. Tsirkin @ 2012-06-06 19:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel, virtualization, Stephen Hemminger
In-Reply-To: <1339012441.26966.48.camel@edumazet-glaptop>

On Wed, Jun 06, 2012 at 09:54:01PM +0200, Eric Dumazet wrote:
> On Wed, 2012-06-06 at 21:51 +0300, Michael S. Tsirkin wrote:
> 
> > BTW for cards that do implement the counters in software,
> > under xmit lock, is anything wrong with simply taking the xmit lock
> > when we get the stats instead of the per-cpu trick + seqlock?
> > 
> 
> I still dont understand why you would do that.
> 
> Most modern machines are 64bits, so there is no seqlock overhead,
> nothing at all.
> 
> If you focus on 32bit hardware, just stick on 32bit counters ?

These wrap around.

> Note that most u64_stats_sync users are virtual drivers, without xmit
> lock (LLTX drivers)
> 
> 

Absolutely, I am talking about virtio here.  I'm not kicking
u64_stats_sync idea I am just saying that simple locking
would work for virtio and might be better as it
gives us a way to get counters atomically.

-- 
MST

^ permalink raw reply

* Re: [PATCH] virtio-net: fix a race on 32bit arches
From: Eric Dumazet @ 2012-06-06 19:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, Jason Wang, netdev, rusty, linux-kernel,
	virtualization
In-Reply-To: <20120606185107.GA20503@redhat.com>

On Wed, 2012-06-06 at 21:51 +0300, Michael S. Tsirkin wrote:

> BTW for cards that do implement the counters in software,
> under xmit lock, is anything wrong with simply taking the xmit lock
> when we get the stats instead of the per-cpu trick + seqlock?
> 

I still dont understand why you would do that.

Most modern machines are 64bits, so there is no seqlock overhead,
nothing at all.

If you focus on 32bit hardware, just stick on 32bit counters ?

Note that most u64_stats_sync users are virtual drivers, without xmit
lock (LLTX drivers)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox