Netdev List
 help / color / mirror / Atom feed
* [PATCH] net: rps: support PPPOE session messages
From: Changli Gao @ 2011-08-19  5:53 UTC (permalink / raw)
  To: David S. Miller; +Cc: Eric Dumazet, Tom Herbert, netdev, Changli Gao

Inspect the payload of PPPOE session messages for the 4 tuples to generate
skb->rxhash.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
---
 net/core/dev.c |    8 ++++++++
 1 file changed, 8 insertions(+)
diff --git a/net/core/dev.c b/net/core/dev.c
index be7ee50..c2442b4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -134,6 +134,7 @@
 #include <linux/inetdevice.h>
 #include <linux/cpu_rmap.h>
 #include <linux/if_tunnel.h>
+#include <linux/if_pppox.h>
 
 #include "net-sysfs.h"
 
@@ -2573,6 +2574,13 @@ again:
 		proto = vlan->h_vlan_encapsulated_proto;
 		nhoff += sizeof(*vlan);
 		goto again;
+	case __constant_htons(ETH_P_PPP_SES):
+		if (!pskb_may_pull(skb, PPPOE_SES_HLEN + nhoff))
+			goto done;
+		proto = *((__be16 *) (skb->data + nhoff +
+				      sizeof(struct pppoe_hdr)));
+		nhoff += PPPOE_SES_HLEN;
+		goto again;
 	default:
 		goto done;
 	}

^ permalink raw reply related

* Re: [PATCH] net: rps: support PPPOE session messages
From: David Miller @ 2011-08-19  6:24 UTC (permalink / raw)
  To: xiaosuo; +Cc: eric.dumazet, therbert, netdev
In-Reply-To: <1313733235-32238-1-git-send-email-xiaosuo@gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Fri, 19 Aug 2011 13:53:55 +0800

> Inspect the payload of PPPOE session messages for the 4 tuples to generate
> skb->rxhash.
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>

Also applied, thank you.

^ permalink raw reply

* Re: protect raw sockets
From: krbmit siso @ 2011-08-19  6:43 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev, ipsec-tools-users, ipsec-tools-devel, ikev2-devel
In-Reply-To: <4E4DF979.5030708@iki.fi>

Hi Timo ,
You are absolutely right, I am using it for traffic generator but,
i want it with ESP , so i want to make the best use of underlying kernel
XFRM functionality . It can be provided has an option
in the kernel like eg ..CONFIG_SECURE_RAW for applying IPsec
policy .

Regards
Naveen

2011/8/19 Timo Teräs <timo.teras@iki.fi>:
> On 08/18/2011 06:01 PM, krbmit siso wrote:
>> After adding the below code in net/ipv4/raw.c in function raw_send_hdrinc()
>> I am able to see packet sent using RAW_SOCKET getting protected .
>>
>> Please  let me know how can it be done better and provide it has a feature
>> , so that others can also use it  if  packet sent using RAW_SOCKET
>> needs to be protected.
>
> Raw sockets are raw sockets. They are used to send out network traffic
> that was captured earlier, or to generate test traffic. I don't think
> it makes any sense to apply XFRM policies to them: it might break the
> usage this API was intended for. The whole purpose of raw sockets is to
> bypass kernel side extra handling.
>
> To generate IPsec protected stuff use the normal APIs: regular UDP/TCP
> sockets.
>
> The same applies for sending/receiving IKE packets. You need regular UDP
> socket with IPsec bypass policy.
>
> What's your point in trying to use raw sockets? You should not need to
> use them unless you are implementing a packet capturer or a network
> traffic generator.
>
> - Timo
>

^ permalink raw reply

* Re: [PATCH] Proportional Rate Reduction for TCP.
From: Nandita Dukkipati @ 2011-08-19  7:34 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: David S. Miller, Netdev, Tom Herbert, Yuchung Cheng, Matt Mathis
In-Reply-To: <alpine.DEB.2.00.1108121432240.12780@wel-95.cs.helsinki.fi>

On Fri, Aug 12, 2011 at 4:36 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
>
> > This patch implements Proportional Rate Reduction (PRR) for TCP.
> > PRR is an algorithm that determines TCP's sending rate in fast
> > recovery. PRR avoids excessive window reductions and aims for
> > the actual congestion window size at the end of recovery to be as
> > close as possible to the window determined by the congestion control
> > algorithm. PRR also improves accuracy of the amount of data sent
> > during loss recovery.
> >
> > The patch implements the recommended flavor of PRR called PRR-SSRB
> > (Proportional rate reduction with slow start reduction bound) and
> > replaces the existing rate halving algorithm. PRR improves upon the
> > existing Linux fast recovery under a number of conditions including:
> >   1) burst losses where the losses implicitly reduce the amount of
> > outstanding data (pipe) below the ssthresh value selected by the
> > congestion control algorithm and,
> >   2) losses near the end of short flows where application runs out of
> > data to send.
> >
> > As an example, with the existing rate halving implementation a single
> > loss event can cause a connection carrying short Web transactions to
> > go into the slow start mode after the recovery. This is because during
> > recovery Linux pulls the congestion window down to packets_in_flight+1
> > on every ACK. A short Web response often runs out of new data to send
> > and its pipe reduces to zero by the end of recovery when all its packets
> > are drained from the network. Subsequent HTTP responses using the same
> > connection will have to slow start to raise cwnd to ssthresh. PRR on
> > the other hand aims for the cwnd to be as close as possible to ssthresh
> > by the end of recovery.
> >
> > A description of PRR and a discussion of its performance can be found at
> > the following links:
> > - IETF Draft:
> >     http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01
> > - IETF Slides:
> >     http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf
> >     http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf
> > - Paper to appear in Internet Measurements Conference (IMC) 2011:
> >     Improving TCP Loss Recovery
> >     Nandita Dukkipati, Matt Mathis, Yuchung Cheng
> >
> > Signed-off-by: Nandita Dukkipati <nanditad@google.com>
> > ---
> >  include/linux/tcp.h   |    4 +++
> >  net/ipv4/tcp_input.c  |   63 ++++++++++++++++++++++++++++++++++++++++++++----
> >  net/ipv4/tcp_output.c |    7 ++++-
> >  3 files changed, 67 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> > index 531ede8..dda4f2e1 100644
> > --- a/include/linux/tcp.h
> > +++ b/include/linux/tcp.h
> > @@ -379,6 +379,10 @@ struct tcp_sock {
> >       u32     snd_cwnd_clamp; /* Do not allow snd_cwnd to grow above this */
> >       u32     snd_cwnd_used;
> >       u32     snd_cwnd_stamp;
> > +     u32     prr_cwnd;       /* Congestion window at start of Recovery. */
> > +     u32     prr_delivered;  /* Number of newly delivered packets to
> > +                              * receiver in Recovery. */
> > +     u32     prr_out;        /* Total number of pkts sent during Recovery. */
> >
> >       u32     rcv_wnd;        /* Current receiver window              */
> >       u32     write_seq;      /* Tail(+1) of data held in tcp send buffer */
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index ea0d218..601eff0 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -2830,9 +2830,14 @@ static int tcp_try_undo_loss(struct sock *sk)
> >  static inline void tcp_complete_cwr(struct sock *sk)
> >  {
> >       struct tcp_sock *tp = tcp_sk(sk);
> > -     /* Do not moderate cwnd if it's already undone in cwr or recovery */
> > -     if (tp->undo_marker && tp->snd_cwnd > tp->snd_ssthresh) {
> > -             tp->snd_cwnd = tp->snd_ssthresh;
> > +
> > +     /* Do not moderate cwnd if it's already undone in cwr or recovery. */
> > +     if (tp->undo_marker) {
> > +
> > +             if (inet_csk(sk)->icsk_ca_state == TCP_CA_CWR)
> > +                     tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh);
> > +             else /* PRR */
> > +                     tp->snd_cwnd = tp->snd_ssthresh;
> >               tp->snd_cwnd_stamp = tcp_time_stamp;
> >       }
> >       tcp_ca_event(sk, CA_EVENT_COMPLETE_CWR);
> > @@ -2950,6 +2955,39 @@ void tcp_simple_retransmit(struct sock *sk)
> >  }
> >  EXPORT_SYMBOL(tcp_simple_retransmit);
> >
> > +/* This function implements the PRR algorithm, specifcally the PRR-SSRB
> > + * (proportional rate reduction with slow start reduction bound) as described in
> > + * http://www.ietf.org/id/draft-mathis-tcpm-proportional-rate-reduction-01.txt.
> > + * It computes the number of packets to send (sndcnt) based on packets newly
> > + * delivered:
> > + *   1) If the packets in flight is larger than ssthresh, PRR spreads the
> > + *   cwnd reductions across a full RTT.
> > + *   2) If packets in flight is lower than ssthresh (such as due to excess
> > + *   losses and/or application stalls), do not perform any further cwnd
> > + *   reductions, but instead slow start up to ssthresh.
> > + */
> > +static void tcp_update_cwnd_in_recovery(struct sock *sk, int pkts_delivered,
> > +                                     int fast_rexmit, int flag)
> > +{
> > +     struct tcp_sock *tp = tcp_sk(sk);
> > +     int sndcnt = 0;
> > +     int delta = tp->snd_ssthresh - tcp_packets_in_flight(tp);
> > +
> > +     if (tcp_packets_in_flight(tp) > tp->snd_ssthresh) {
> > +             if (WARN_ON(!tp->prr_cwnd))
> > +                     tp->prr_cwnd = 1;
>
> Thanks, this made me to smile when I realized what kind of positive feedback
> loop it would cause in a heavily congested environment :-). ...Perhaps any
> value below 2 * tp->snd_ssthresh isn't that safe safety relief valve here.
> ...Of course this condition isn't ever hit with the current code base so
> no real harm being done regardless of the value.
>
> > +             sndcnt = DIV_ROUND_UP(tp->prr_delivered * tp->snd_ssthresh,
> > +                                   tp->prr_cwnd) - tp->prr_out;
>
> What about very large windows ...overflow?

Addressed in patch v2.

> Due to not having lower bound here, the resulting window can end up below
> snd_ssthresh in one step if prr_delivered suddently jumps to a value that
> is above prr_cnwd?
>
> Also, snd_ssthresh and prr_cwnd are essentially constants over the
> whole recovery and yet divide is needed whole the time. ...And in practically
> all cases the proportion will be 0.5 (except for the odd number produced
> off-by-one thing)?
>
> > +     } else {
> > +             sndcnt = min_t(int, delta,
> > +                            max_t(int, tp->prr_delivered - tp->prr_out,
> > +                                  pkts_delivered) + 1);
> > +     }
> > +
> > +     sndcnt = max(sndcnt, (fast_rexmit ? 1 : 0));
> > +     tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
> > +}
> > +
> >  /* Process an event, which can update packets-in-flight not trivially.
> >   * Main goal of this function is to calculate new estimate for left_out,
> >   * taking into account both packets sitting in receiver's buffer and
> > @@ -2961,7 +2999,8 @@ EXPORT_SYMBOL(tcp_simple_retransmit);
> >   * It does _not_ decide what to send, it is made in function
> >   * tcp_xmit_retransmit_queue().
> >   */
> > -static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
> > +static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
> > +                               int pkts_delivered, int flag)
> >  {
> >       struct inet_connection_sock *icsk = inet_csk(sk);
> >       struct tcp_sock *tp = tcp_sk(sk);
> > @@ -3111,13 +3150,17 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
> >
> >               tp->bytes_acked = 0;
> >               tp->snd_cwnd_cnt = 0;
> > +             tp->prr_cwnd = tp->snd_cwnd;
>
> prior_cwnd would make this variable about 1000 times easier to read
> without requiring some background.

Patch v2 changes this to prior_cwnd.

>
> While at it, I realized that there is some restore window place in the
> generic code that assumes "prior_cwnd == 2 * tp->snd_ssthresh" which is
> not true with every single cc module.
>
> > +             tp->prr_delivered = 0;
> > +             tp->prr_out = 0;
> >               tcp_set_ca_state(sk, TCP_CA_Recovery);
> >               fast_rexmit = 1;
> >       }
> >
> >       if (do_lost || (tcp_is_fack(tp) && tcp_head_timedout(sk)))
> >               tcp_update_scoreboard(sk, fast_rexmit);
> > -     tcp_cwnd_down(sk, flag);
> > +     tp->prr_delivered += pkts_delivered;
>
> ...Is here a bug in the calculation if the recovery was triggered with
> _a cumulative ACK_ that had enough SACK blocks?
>
> > +     tcp_update_cwnd_in_recovery(sk, pkts_delivered, fast_rexmit, flag);
> >       tcp_xmit_retransmit_queue(sk);
> >  }
> >
> > @@ -3632,6 +3675,11 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
> >       u32 prior_in_flight;
> >       u32 prior_fackets;
> >       int prior_packets;
> > +     int prior_sacked = tp->sacked_out;
> > +     /* pkts_delivered is number of packets newly cumulatively acked or
> > +      * sacked on this ACK.
> > +      */
> > +     int pkts_delivered = 0;
>
> If you need a comment like that on variable declaration, maybe its name
> could be more obvious, pkts_acked_sacked ?

Changed this to newly_acked_sacked as per your suggestion.

>
> >       int frto_cwnd = 0;
> >
> >       /* If the ack is older than previous acks
> > @@ -3703,6 +3751,9 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
> >       /* See if we can take anything off of the retransmit queue. */
> >       flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una);
> > +     pkts_delivered = (prior_packets - prior_sacked) -
> > +                      (tp->packets_out - tp->sacked_out);
> > +
> >       if (tp->frto_counter)
> >               frto_cwnd = tcp_process_frto(sk, flag);
> >       /* Guarantee sacktag reordering detection against wrap-arounds */
> > @@ -3715,7 +3766,7 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
> >                   tcp_may_raise_cwnd(sk, flag))
> >                       tcp_cong_avoid(sk, ack, prior_in_flight);
> >               tcp_fastretrans_alert(sk, prior_packets - tp->packets_out,
> > -                                   flag);
> > +                                   pkts_delivered, flag);
> >       } else {
> >               if ((flag & FLAG_DATA_ACKED) && !frto_cwnd)
> >                       tcp_cong_avoid(sk, ack, prior_in_flight);
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 882e0b0..ca50408 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -1796,11 +1796,13 @@ static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> >               tcp_event_new_data_sent(sk, skb);
> >
> >               tcp_minshall_update(tp, mss_now, skb);
> > -             sent_pkts++;
> > +             sent_pkts += tcp_skb_pcount(skb);
> >
> >               if (push_one)
> >                       break;
> >       }
> > +     if (inet_csk(sk)->icsk_ca_state == TCP_CA_Recovery)
> > +             tp->prr_out += sent_pkts;
>
> Is there some harm done if you get rid of the conditionality?
>
> >       if (likely(sent_pkts)) {
> >               tcp_cwnd_validate(sk);
> > @@ -2294,6 +2296,9 @@ begin_fwd:
> >                       return;
> >               NET_INC_STATS_BH(sock_net(sk), mib_idx);
> >
> > +             if (inet_csk(sk)->icsk_ca_state == TCP_CA_Recovery)
> > +                     tp->prr_out += tcp_skb_pcount(skb);
> > +
>
> ...same here?

No harm done if the condition is removed in either of these case.
However, letting the condition remain for now for ease of readability.

>
> >               if (skb == tcp_write_queue_head(sk))
> >                       inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
> >                                                 inet_csk(sk)->icsk_rto,
>
> Besides the points above, I'm not convinced that we really need all the
> across-packet state that is present, some relevant values should be
> available before we do any ACK processing (in-flight, cwnd, and more
> importantly, the difference between them). ...But I haven't yet
> convinced myself of anything particular.
>
> --
>  i.

^ permalink raw reply

* Re: [PATCH] Proportional Rate Reduction for TCP.
From: Nandita Dukkipati @ 2011-08-19  7:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David S. Miller, netdev, Tom Herbert, Yuchung Cheng, Matt Mathis
In-Reply-To: <m262m0lh0q.fsf@firstfloor.org>

On Sat, Aug 13, 2011 at 10:05 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Nandita Dukkipati <nanditad@google.com> writes:
>> +
>> +     if (tcp_packets_in_flight(tp) > tp->snd_ssthresh) {
>> +             if (WARN_ON(!tp->prr_cwnd))
>> +                     tp->prr_cwnd = 1;
>> +             sndcnt = DIV_ROUND_UP(tp->prr_delivered * tp->snd_ssthresh,
>> +                                   tp->prr_cwnd) - tp->prr_out;
>> +     } else {
>> +             sndcnt = min_t(int, delta,
>> +                            max_t(int, tp->prr_delivered -
>> tp->prr_out,
>
> u32s here? This will likely do bad things with large enough windows.

Fixed in patch v2.

>
> The rest looks good to me as code, but I don't claim to fully understand the
> algorithm. Perhaps a sysctl to turn it off and a linux mib counter when
> it triggers would be useful in addition.

There already exists a mib counter when entering Recovery state, so
this will be incremented when PRR is triggered.

Thanks
Nandita

^ permalink raw reply

* [PATCH v2] Proportional Rate Reduction for TCP.
From: Nandita Dukkipati @ 2011-08-19  7:33 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Tom Herbert, Matt Mathis, Yuchung Cheng,
	Nandita Dukkipati
In-Reply-To: <1313134197-5082-1-git-send-email-nanditad@google.com>

This patch implements Proportional Rate Reduction (PRR) for TCP.
PRR is an algorithm that determines TCP's sending rate in fast
recovery. PRR avoids excessive window reductions and aims for
the actual congestion window size at the end of recovery to be as
close as possible to the window determined by the congestion control
algorithm. PRR also improves accuracy of the amount of data sent
during loss recovery.

The patch implements the recommended flavor of PRR called PRR-SSRB
(Proportional rate reduction with slow start reduction bound) and
replaces the existing rate halving algorithm. PRR improves upon the
existing Linux fast recovery under a number of conditions including:
  1) burst losses where the losses implicitly reduce the amount of
outstanding data (pipe) below the ssthresh value selected by the
congestion control algorithm and,
  2) losses near the end of short flows where application runs out of
data to send.

As an example, with the existing rate halving implementation a single
loss event can cause a connection carrying short Web transactions to
go into the slow start mode after the recovery. This is because during
recovery Linux pulls the congestion window down to packets_in_flight+1
on every ACK. A short Web response often runs out of new data to send
and its pipe reduces to zero by the end of recovery when all its packets
are drained from the network. Subsequent HTTP responses using the same
connection will have to slow start to raise cwnd to ssthresh. PRR on
the other hand aims for the cwnd to be as close as possible to ssthresh
by the end of recovery.

A description of PRR and a discussion of its performance can be found at
the following links:
- IETF Draft:
    http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01
- IETF Slides:
    http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf
    http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf
- Paper to appear in Internet Measurements Conference (IMC) 2011:
    Improving TCP Loss Recovery
    Nandita Dukkipati, Matt Mathis, Yuchung Cheng

Signed-off-by: Nandita Dukkipati <nanditad@google.com>
---
Changelog since v1:
- Took care of overflow for large congestion windows in tcp_update_cwnd_in_recovery().
- Renamed prr_cwnd to prior_cwnd.
- Renamed pkts_delivered to newly_acked_sacked.

 include/linux/tcp.h   |    4 +++
 net/ipv4/tcp_input.c  |   61 ++++++++++++++++++++++++++++++++++++++++++++-----
 net/ipv4/tcp_output.c |    7 +++++-
 3 files changed, 65 insertions(+), 7 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 531ede8..6b63b31 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -379,6 +379,10 @@ struct tcp_sock {
 	u32	snd_cwnd_clamp; /* Do not allow snd_cwnd to grow above this */
 	u32	snd_cwnd_used;
 	u32	snd_cwnd_stamp;
+	u32	prior_cwnd;	/* Congestion window at start of Recovery. */
+	u32	prr_delivered;	/* Number of newly delivered packets to
+				 * receiver in Recovery. */
+	u32	prr_out;	/* Total number of pkts sent during Recovery. */
 
  	u32	rcv_wnd;	/* Current receiver window		*/
 	u32	write_seq;	/* Tail(+1) of data held in tcp send buffer */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ea0d218..41e5f43 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2830,9 +2830,14 @@ static int tcp_try_undo_loss(struct sock *sk)
 static inline void tcp_complete_cwr(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	/* Do not moderate cwnd if it's already undone in cwr or recovery */
-	if (tp->undo_marker && tp->snd_cwnd > tp->snd_ssthresh) {
-		tp->snd_cwnd = tp->snd_ssthresh;
+
+	/* Do not moderate cwnd if it's already undone in cwr or recovery. */
+	if (tp->undo_marker) {
+
+		if (inet_csk(sk)->icsk_ca_state == TCP_CA_CWR)
+			tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh);
+		else /* PRR */
+			tp->snd_cwnd = tp->snd_ssthresh;
 		tp->snd_cwnd_stamp = tcp_time_stamp;
 	}
 	tcp_ca_event(sk, CA_EVENT_COMPLETE_CWR);
@@ -2950,6 +2955,40 @@ void tcp_simple_retransmit(struct sock *sk)
 }
 EXPORT_SYMBOL(tcp_simple_retransmit);
 
+/* This function implements the PRR algorithm, specifcally the PRR-SSRB
+ * (proportional rate reduction with slow start reduction bound) as described in
+ * http://www.ietf.org/id/draft-mathis-tcpm-proportional-rate-reduction-01.txt.
+ * It computes the number of packets to send (sndcnt) based on packets newly
+ * delivered:
+ *   1) If the packets in flight is larger than ssthresh, PRR spreads the
+ *	cwnd reductions across a full RTT.
+ *   2) If packets in flight is lower than ssthresh (such as due to excess
+ *	losses and/or application stalls), do not perform any further cwnd
+ *	reductions, but instead slow start up to ssthresh.
+ */
+static void tcp_update_cwnd_in_recovery(struct sock *sk, int newly_acked_sacked,
+					int fast_rexmit, int flag)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	int sndcnt = 0;
+	int delta = tp->snd_ssthresh - tcp_packets_in_flight(tp);
+
+	if (tcp_packets_in_flight(tp) > tp->snd_ssthresh) {
+		if (WARN_ON(!tp->prior_cwnd))
+			tp->prior_cwnd = 1;
+		sndcnt = DIV_ROUND_UP((u64)(tp->prr_delivered *
+					    tp->snd_ssthresh),
+				      (u64)tp->prior_cwnd) - tp->prr_out;
+	} else {
+		sndcnt = min_t(int, delta,
+			       max_t(int, tp->prr_delivered - tp->prr_out,
+				     newly_acked_sacked) + 1);
+	}
+
+	sndcnt = max(sndcnt, (fast_rexmit ? 1 : 0));
+	tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
+}
+
 /* Process an event, which can update packets-in-flight not trivially.
  * Main goal of this function is to calculate new estimate for left_out,
  * taking into account both packets sitting in receiver's buffer and
@@ -2961,7 +3000,8 @@ EXPORT_SYMBOL(tcp_simple_retransmit);
  * It does _not_ decide what to send, it is made in function
  * tcp_xmit_retransmit_queue().
  */
-static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
+static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
+				  int newly_acked_sacked, int flag)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -3111,13 +3151,17 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
 
 		tp->bytes_acked = 0;
 		tp->snd_cwnd_cnt = 0;
+		tp->prior_cwnd = tp->snd_cwnd;
+		tp->prr_delivered = 0;
+		tp->prr_out = 0;
 		tcp_set_ca_state(sk, TCP_CA_Recovery);
 		fast_rexmit = 1;
 	}
 
 	if (do_lost || (tcp_is_fack(tp) && tcp_head_timedout(sk)))
 		tcp_update_scoreboard(sk, fast_rexmit);
-	tcp_cwnd_down(sk, flag);
+	tp->prr_delivered += newly_acked_sacked;
+	tcp_update_cwnd_in_recovery(sk, newly_acked_sacked, fast_rexmit, flag);
 	tcp_xmit_retransmit_queue(sk);
 }
 
@@ -3632,6 +3676,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
 	u32 prior_in_flight;
 	u32 prior_fackets;
 	int prior_packets;
+	int prior_sacked = tp->sacked_out;
+	int newly_acked_sacked = 0;
 	int frto_cwnd = 0;
 
 	/* If the ack is older than previous acks
@@ -3703,6 +3749,9 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
 	/* See if we can take anything off of the retransmit queue. */
 	flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una);
 
+	newly_acked_sacked = (prior_packets - prior_sacked) -
+			     (tp->packets_out - tp->sacked_out);
+
 	if (tp->frto_counter)
 		frto_cwnd = tcp_process_frto(sk, flag);
 	/* Guarantee sacktag reordering detection against wrap-arounds */
@@ -3715,7 +3764,7 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
 		    tcp_may_raise_cwnd(sk, flag))
 			tcp_cong_avoid(sk, ack, prior_in_flight);
 		tcp_fastretrans_alert(sk, prior_packets - tp->packets_out,
-				      flag);
+				      newly_acked_sacked, flag);
 	} else {
 		if ((flag & FLAG_DATA_ACKED) && !frto_cwnd)
 			tcp_cong_avoid(sk, ack, prior_in_flight);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 882e0b0..ca50408 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1796,11 +1796,13 @@ static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		tcp_event_new_data_sent(sk, skb);
 
 		tcp_minshall_update(tp, mss_now, skb);
-		sent_pkts++;
+		sent_pkts += tcp_skb_pcount(skb);
 
 		if (push_one)
 			break;
 	}
+	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Recovery)
+		tp->prr_out += sent_pkts;
 
 	if (likely(sent_pkts)) {
 		tcp_cwnd_validate(sk);
@@ -2294,6 +2296,9 @@ begin_fwd:
 			return;
 		NET_INC_STATS_BH(sock_net(sk), mib_idx);
 
+		if (inet_csk(sk)->icsk_ca_state == TCP_CA_Recovery)
+			tp->prr_out += tcp_skb_pcount(skb);
+
 		if (skb == tcp_write_queue_head(sk))
 			inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
 						  inet_csk(sk)->icsk_rto,
-- 
1.7.3.1


^ permalink raw reply related

* Re: [omega-g1:10937] Re: [PATCH] net: configurable sysctl parameter "net.core.tcp_lowat" for sk_stream_min_wspace()
From: Jun.Kondo @ 2011-08-19  9:28 UTC (permalink / raw)
  To: David Miller
  Cc: linux-kernel, omega-g1, notsuki, motokazu.kozaki, htaira, netdev,
	tomohiko.takahashi, kotaro.sakai, ken.sugawara
In-Reply-To: <20110814.224714.2061635645365376268.davem@davemloft.net>

You suggested to use non-blocking writes, but we think
we have to rewrite the Apache code if doing so.
That is, we have to make a modification to Apache that
depends on the architecture.
By using this patch, it can be handled by changing the
configuration a little bit on the kernel side for such
applications that it is difficult to do so on application
side.



(2011/08/15 14:47), David Miller wrote:
> From: "Jun.Kondo"<jun.kondo@ctc-g.co.jp>
> Date: Mon, 15 Aug 2011 14:38:11 +0900
>
>> 2. to limit the block time of the write in order to
>> prevent the timeout of upper layer applications
>> even when the connection has low throughput, such
>> as low rate streaming
> Use non-blocking writes if you want this behavior.
>



^ permalink raw reply

* Re: [RFC 0/0] Introducing a generic socket offload framework
From: Alan Cox @ 2011-08-19  9:28 UTC (permalink / raw)
  To: San Mehat
  Cc: davem, mst, rusty, linux-kernel, virtualization, netdev,
	digitaleric, mikew, miche, maccarro
In-Reply-To: <CAPi7mHqV6Pgk5tSt+fJn_2aEZKZbO=aO78JXgw7MvDaW0musoQ@mail.gmail.com>

> I have no desire to change the 'genericness' of sockets.. just the
> opposite - i wish to
> introduce the notion that sockets (can be) completely generic (when
> offloaded) as far as
> the guest is concerned.

I suppose my concern is that you don't want to design for a specific
offload device, your offload might change but the view from the
application side should not differ.

> > This guest only view means you can't use the abstraction for local
> > sockets too.
> >
> 
> To be honest, the way we're attempting to integrate is in such a way
> that you *could*
> offload AF_LOCAL sockets...  but that world gets a bit too much like
> the 'Twilight Zone'
> for my current linkings..

Until you want to be able to have a pair of apps talking that may or may
not be on different systems and may or may not be on a vm host at all, at
which point having the same acceleration between them (a null accelerator
so to speak) would avoid having to add extra paths to the apps.

> > And yes there is still the complicated cases such as 'the routing table
> > has changed from vitual host to via siberia now what' but I don't believe
> > your proposal addresses that either.
> 
> Can you be more specific? If you mean solving the 'keeping your tcp connections
> open to non virtual endpoints across a migration (or whatever)' then
> no it doesn't :)

That was my assumption.


^ permalink raw reply

* Re: [omega-g1:10937] Re: [PATCH] net: configurable sysctl parameter "net.core.tcp_lowat" for sk_stream_min_wspace()
From: David Miller @ 2011-08-19  9:43 UTC (permalink / raw)
  To: jun.kondo
  Cc: linux-kernel, omega-g1, notsuki, motokazu.kozaki, htaira, netdev,
	tomohiko.takahashi, kotaro.sakai, ken.sugawara
In-Reply-To: <4E4E2CCD.6050809@ctc-g.co.jp>

From: "Jun.Kondo" <jun.kondo@ctc-g.co.jp>
Date: Fri, 19 Aug 2011 18:28:45 +0900

> You suggested to use non-blocking writes, but we think
> we have to rewrite the Apache code if doing so.
> That is, we have to make a modification to Apache that
> depends on the architecture.
> By using this patch, it can be handled by changing the
> configuration a little bit on the kernel side for such
> applications that it is difficult to do so on application
> side.

The kernel provides the facilities necessary to achieve your
goals.  It is a userspace problem.

^ permalink raw reply

* [PATCH] ipv6: Fix ipv6_getsockopt for IPV6_2292PKTOPTIONS
From: Daniel Baluta @ 2011-08-19  9:55 UTC (permalink / raw)
  To: davem, kuznet, jmorris, kaber; +Cc: netdev, Daniel Baluta, Sorin Dumitru

IPV6_2292PKTOPTIONS is broken for 32-bit applications running
in COMPAT mode on 64-bit kernels.

The same problem was fixed for IPv4 with the patch:
ipv4: Fix ip_getsockopt for IP_PKTOPTIONS,
commit dd23198e58cd35259dd09e8892bbdb90f1d57748

Signed-off-by: Sorin Dumitru <sdumitru@ixiacom.com>
Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
---
 net/ipv6/ipv6_sockglue.c |    9 +++++----
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 9cb191e..71ce053 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -913,7 +913,7 @@ static int ipv6_getsockopt_sticky(struct sock *sk, struct ipv6_txoptions *opt,
 }
 
 static int do_ipv6_getsockopt(struct sock *sk, int level, int optname,
-		    char __user *optval, int __user *optlen)
+		    char __user *optval, int __user *optlen, unsigned flags)
 {
 	struct ipv6_pinfo *np = inet6_sk(sk);
 	int len;
@@ -962,7 +962,7 @@ static int do_ipv6_getsockopt(struct sock *sk, int level, int optname,
 
 		msg.msg_control = optval;
 		msg.msg_controllen = len;
-		msg.msg_flags = 0;
+		msg.msg_flags = flags;
 
 		lock_sock(sk);
 		skb = np->pktoptions;
@@ -1222,7 +1222,7 @@ int ipv6_getsockopt(struct sock *sk, int level, int optname,
 	if(level != SOL_IPV6)
 		return -ENOPROTOOPT;
 
-	err = do_ipv6_getsockopt(sk, level, optname, optval, optlen);
+	err = do_ipv6_getsockopt(sk, level, optname, optval, optlen, 0);
 #ifdef CONFIG_NETFILTER
 	/* we need to exclude all possible ENOPROTOOPTs except default case */
 	if (err == -ENOPROTOOPT && optname != IPV6_2292PKTOPTIONS) {
@@ -1264,7 +1264,8 @@ int compat_ipv6_getsockopt(struct sock *sk, int level, int optname,
 		return compat_mc_getsockopt(sk, level, optname, optval, optlen,
 			ipv6_getsockopt);
 
-	err = do_ipv6_getsockopt(sk, level, optname, optval, optlen);
+	err = do_ipv6_getsockopt(sk, level, optname, optval, optlen, 
+				 MSG_CMSG_COMPAT);
 #ifdef CONFIG_NETFILTER
 	/* we need to exclude all possible ENOPROTOOPTs except default case */
 	if (err == -ENOPROTOOPT && optname != IPV6_2292PKTOPTIONS) {
-- 
1.7.1


^ permalink raw reply related

* Re: [PATCH] ipv6: Fix ipv6_getsockopt for IPV6_2292PKTOPTIONS
From: David Miller @ 2011-08-19 10:19 UTC (permalink / raw)
  To: dbaluta; +Cc: kuznet, jmorris, kaber, netdev, sdumitru
In-Reply-To: <1313747758-10063-1-git-send-email-dbaluta@ixiacom.com>

From: Daniel Baluta <dbaluta@ixiacom.com>
Date: Fri, 19 Aug 2011 12:55:58 +0300

> IPV6_2292PKTOPTIONS is broken for 32-bit applications running
> in COMPAT mode on 64-bit kernels.
> 
> The same problem was fixed for IPv4 with the patch:
> ipv4: Fix ip_getsockopt for IP_PKTOPTIONS,
> commit dd23198e58cd35259dd09e8892bbdb90f1d57748
> 
> Signed-off-by: Sorin Dumitru <sdumitru@ixiacom.com>
> Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>

Applied, but:

> -	err = do_ipv6_getsockopt(sk, level, optname, optval, optlen);
> +	err = do_ipv6_getsockopt(sk, level, optname, optval, optlen, 

This adds trailing whitespace, which GIT catches and warns about,
please be mindful of this in the future.

I took care of it and fixed it up this time.

^ permalink raw reply

* Re: [PATCH v2] Proportional Rate Reduction for TCP.
From: Ilpo Järvinen @ 2011-08-19 10:25 UTC (permalink / raw)
  To: Nandita Dukkipati
  Cc: David S. Miller, netdev, Tom Herbert, Matt Mathis, Yuchung Cheng
In-Reply-To: <1313739212-2315-1-git-send-email-nanditad@google.com>

On Fri, 19 Aug 2011, Nandita Dukkipati wrote:

> +static void tcp_update_cwnd_in_recovery(struct sock *sk, int newly_acked_sacked,
> +					int fast_rexmit, int flag)
> +{
> +	struct tcp_sock *tp = tcp_sk(sk);
> +	int sndcnt = 0;
> +	int delta = tp->snd_ssthresh - tcp_packets_in_flight(tp);
> +
> +	if (tcp_packets_in_flight(tp) > tp->snd_ssthresh) {
> +		if (WARN_ON(!tp->prior_cwnd))
> +			tp->prior_cwnd = 1;

This should still be made larger to avoid problems if it ever will be 
needed.

> +		sndcnt = DIV_ROUND_UP((u64)(tp->prr_delivered *
> +					    tp->snd_ssthresh),
> +				      (u64)tp->prior_cwnd) - tp->prr_out;

I think you should pick one from include/linux/math64.h instead of letting 
gcc to do / operand all by itself. ...Obviosly then the ROUND_UP part 
needs to be done manually (either using the remainder or pre-addition 
like DIV_ROUND_UP does).

-- 
 i.

^ permalink raw reply

* Re: [PATCH v2] Proportional Rate Reduction for TCP.
From: David Miller @ 2011-08-19 10:26 UTC (permalink / raw)
  To: nanditad; +Cc: netdev, therbert, mattmathis, ycheng
In-Reply-To: <1313739212-2315-1-git-send-email-nanditad@google.com>

From: Nandita Dukkipati <nanditad@google.com>
Date: Fri, 19 Aug 2011 00:33:32 -0700

> @@ -2830,9 +2830,14 @@ static int tcp_try_undo_loss(struct sock *sk)
>  static inline void tcp_complete_cwr(struct sock *sk)
>  {
>  	struct tcp_sock *tp = tcp_sk(sk);
> -	/* Do not moderate cwnd if it's already undone in cwr or recovery */
> -	if (tp->undo_marker && tp->snd_cwnd > tp->snd_ssthresh) {
> -		tp->snd_cwnd = tp->snd_ssthresh;
> +
> +	/* Do not moderate cwnd if it's already undone in cwr or recovery. */
> +	if (tp->undo_marker) {
> +
> +		if (inet_csk(sk)->icsk_ca_state == TCP_CA_CWR)

Please get rid of that empty line before the TCP_CA_CWR case.

> +		sndcnt = DIV_ROUND_UP((u64)(tp->prr_delivered *
> +					    tp->snd_ssthresh),
> +				      (u64)tp->prior_cwnd) - tp->prr_out;

This won't link on 32-bit unless __divdi3 libgcc routine is provided
by the architecture.  To portably do 64-bit division you need to use
do_div() or something based upon it.  Perhaps DIV_ROUND_UP_LL() will
work best in this case.



^ permalink raw reply

* Re: [PATCH net-next v4 af-packet 0/2] Enhance af-packet to provide (near zero)lossless packet capture functionality.
From: David Miller @ 2011-08-19 10:33 UTC (permalink / raw)
  To: loke.chetan; +Cc: netdev
In-Reply-To: <1313116260-1000-1-git-send-email-loke.chetan@gmail.com>

From: Chetan Loke <loke.chetan@gmail.com>
Date: Thu, 11 Aug 2011 22:30:58 -0400

> This patch attempts to:
> 1)Improve network capture visibility by increasing packet density
> 2)Assist in analyzing multiple(aggregated) capture ports.

Just some general comments before I say something about the actual
code.

Claiming loss-less, or almost loss-less, packet capture is something
which is highly dependent upon the hardware and traffic load.

You cannot claim near loss-less packet capture for an i386 with a
10gbit ethernet part going line rate, for example.

So I want you to describe this as what it is, an improvement.  More
specifically it's just a more flexible mmap ring buffer scheme that
reduces the wastage incurred by the fixed buffer size scheme we
have currently.

Next, you need to use a different subject line for the two patches.
Right now you say the same thing in both, and someone scanning the
changesets headers will have no idea what is different about one
change vs. the other.

^ permalink raw reply

* Re: [PATCH net-next v4 af-packet 2/2] Enhance af-packet to provide (near zero)lossless packet capture functionality.
From: David Miller @ 2011-08-19 10:36 UTC (permalink / raw)
  To: loke.chetan; +Cc: netdev
In-Reply-To: <1313116260-1000-3-git-send-email-loke.chetan@gmail.com>

From: Chetan Loke <loke.chetan@gmail.com>
Date: Thu, 11 Aug 2011 22:31:00 -0400

> +struct kbdq_ft_ops {
> +	int num_ops;
> +	void (*ft_ops[2])(void *, void *);
> +};
...
> +	struct kbdq_ft_ops kfops;
 ...
> +static void prb_init_ft_ops(struct kbdq_core *p1,
> +			union tpacket_req_u *req_u)
> +{
> +	p1->kfops.ft_ops[p1->kfops.num_ops++] = prb_fill_vlan_info;
> +
> +	if (req_u->req3.tp_feature_req_word) {
> +		if (req_u->req3.tp_feature_req_word & TP_FT_REQ_FILL_RXHASH)
> +			p1->kfops.ft_ops[p1->kfops.num_ops++] = prb_fill_rxhash;
> +		else
> +			p1->kfops.ft_ops[p1->kfops.num_ops++] =
> +			prb_clear_rxhash;
> +	}
> +}

It is a lot cheaper to just test the flags in-line than do indirect calls.
Indirect calls are very expensive on many cpus.

In fact, since the first op (prb_fill_vlan_info) is unconditional, we eat
the indirect call cost for absolutely no reason at all.

This kfops stuff was not present in your previous changes.  And I'm
going to tell you that if you keep adding things each revision instead
of just fixing the specific items you've received feedback about, a
set of changes this invasive and of this size will never get merged.

Please resist the urge to further tinker with the code, and just
address the feedback we give you.

^ permalink raw reply

* Re: [PATCH net-next] ipv4: one more case for non-local saddr in ICMP
From: David Miller @ 2011-08-19 10:43 UTC (permalink / raw)
  To: ja; +Cc: netdev, herbert
In-Reply-To: <alpine.LFD.2.00.1108151857420.1481@ja.ssi.bg>

From: Julian Anastasov <ja@ssi.bg>
Date: Mon, 15 Aug 2011 19:21:23 +0300 (EEST)

> 
> 	May be there is one more case that we can avoid using
> non-local source for ICMP errors: xfrm_lookup, num_xfrms = 0 when
> reverse "Flow passes untransformed". Avoid using the input route
> if xfrm_lookup returns same dst.
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---
> 
> 	In fact, should we use local IP in all cases when
> sending ICMP? I'm asking it for the following case:
> 
> 	Large packet is forwarded but is rejected with ICMP FRAG
> NEEDED. We usually send ICMP with local saddr instead of the
> original non-local destination. What is the role of
> this reverse check? May be after xfrm_decode_session_reverse
> we should use 'fl4_dec.saddr = fl4->saddr;' so that xfrm_lookup
> works with ICMP from local IP? What is right thing to do here?
> I don't see code that looks in the embedded header...

Well.. this relookup behavior is guided by a special transform state
XFRM_STATE_ICMP that the user must explicitly create IPSEC rules for.

Presumably they are going to add real transforms to such special IPSEC
rules, not create NOP ones with no transforms.  And if they do create
such IPSEC state with no transforms, perhaps the intention is to trigger
to use of the non-local source.

The whole thing revolves around how Herbert envisions people implementing
RFC4301 support using this new XFRM_STATE_ICMP thing.

Right?


^ permalink raw reply

* Re: [PATCH] ipv6: Fix ipv6_getsockopt for IPV6_2292PKTOPTIONS
From: Daniel Baluta @ 2011-08-19 11:23 UTC (permalink / raw)
  To: David Miller; +Cc: kuznet, jmorris, kaber, netdev, sdumitru
In-Reply-To: <20110819.031957.1143398061283788162.davem@davemloft.net>

On Fri, Aug 19, 2011 at 1:19 PM, David Miller <davem@davemloft.net> wrote:
> From: Daniel Baluta <dbaluta@ixiacom.com>
> Date: Fri, 19 Aug 2011 12:55:58 +0300
>
>> IPV6_2292PKTOPTIONS is broken for 32-bit applications running
>> in COMPAT mode on 64-bit kernels.
>>
>> The same problem was fixed for IPv4 with the patch:
>> ipv4: Fix ip_getsockopt for IP_PKTOPTIONS,
>> commit dd23198e58cd35259dd09e8892bbdb90f1d57748
>>
>> Signed-off-by: Sorin Dumitru <sdumitru@ixiacom.com>
>> Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
>
> Applied, but:
>
>> -     err = do_ipv6_getsockopt(sk, level, optname, optval, optlen);
>> +     err = do_ipv6_getsockopt(sk, level, optname, optval, optlen,
>
> This adds trailing whitespace, which GIT catches and warns about,
> please be mindful of this in the future.
>
> I took care of it and fixed it up this time.

Thanks David. I will pay more attention to this next time.

Daniel.

^ permalink raw reply

* Traceroute and "ping" sockets: some questions
From: Dmitry Butskoy @ 2011-08-19 11:19 UTC (permalink / raw)
  To: netdev

Hi

I've released new version of the Linux traceroute 2.0.18, which supports 
new (SOCK_DGRAM, IPPROTO_ICMP) sockets, appeared in the kernel 3.0 . 
This way users might perform icmp tracerouting (`-I') without setuid bit 
or cap_net_raw settings of the executable.
(See it at 
http://sourceforge.net/projects/traceroute/files/traceroute/traceroute-2.0.18/)

I would like to ask some questions:

- Currently such "ping" sockets implemented for IPv4 only. Are there any 
plans to implement it for IPv6 as well?

The traceroute-2.0.18 is ready for IPv6 anyway -- just wait for 
appearing it in the kernel without recompile (I hope :) ) -- but I would 
prefer to test it as soon as possible.

- Are there any plans to implement some "rate control" (maybe 
sysctl-configurable too), to restrict unprivileged users to send icmp 
echoes too fast (ie. faster than 200 ms -- the current ping(8) restriction)?


Regards,
Dmitry Butskoy
http://www.fedoraproject.org/wiki/DmitryButskoy

^ permalink raw reply

* Re: Traceroute and "ping" sockets: some questions
From: David Miller @ 2011-08-19 11:38 UTC (permalink / raw)
  To: buc; +Cc: netdev
In-Reply-To: <4E4E46B4.9010909@odu.neva.ru>

From: Dmitry Butskoy <buc@odusz.so-cdu.ru>
Date: Fri, 19 Aug 2011 15:19:16 +0400

> I would like to ask some questions:
> 
> - Currently such "ping" sockets implemented for IPv4 only. Are there any
> - plans to implement it for IPv6 as well?

I think the original authors of the kernel component said they would
work on this, but it hasn't materialized yet.

> - Are there any plans to implement some "rate control" (maybe
> - sysctl-configurable too), to restrict unprivileged users to send icmp
> - echoes too fast (ie. faster than 200 ms -- the current ping(8)
> - restriction)?

Why limit?  He can spam with UDP socket just as easily at any rate
he pleases, and rate limiting is policy issue and there relegated
to netfilter and/or the packet scheduler layer.

^ permalink raw reply

* winner
From: Microsoft @ 2011-08-19  9:07 UTC (permalink / raw)


You have won 500.000 GBP
send your phone number
and address

^ permalink raw reply

* Re: net: rps: support 802.1Q
From: Ben Hutchings @ 2011-08-19 11:54 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, Eric Dumazet, Tom Herbert, netdev
In-Reply-To: <1313730353-25379-1-git-send-email-xiaosuo@gmail.com>

On Fri, 2011-08-19 at 13:05 +0800, Changli Gao wrote:
> For the 802.1Q packets, if the NIC doesn't support hw-accel-vlan-rx, RPS
> won't inspect the internal 4 tuples to generate skb->rxhash, so this kind
> of traffic can't get any benefit from RPS.
> 
> This patch adds the support for 802.1Q to RPS.
[...]
> @@ -2565,6 +2566,13 @@ again:
>  		addr2 = (__force u32) ip6->daddr.s6_addr32[3];
>  		nhoff += 40;
>  		break;
> +	case __constant_htons(ETH_P_8021Q):
> +		if (!pskb_may_pull(skb, sizeof(*vlan) + nhoff))
> +			goto done;
> +		vlan = (const struct vlan_hdr *) (skb->data + nhoff);
> +		proto = vlan->h_vlan_encapsulated_proto;
> +		nhoff += sizeof(*vlan);
> +		goto again;
>  	default:
>  		goto done;
>  	}

Should this really be reading an unlimited number of tags?  What if an
attacker starts sending packets full of VLAN tags?  Since this runs
before netfilter, there would be no way to prevent those packets burning
our CPU time.  And if there are legitimately multiple VLAN tags, they
presumably won't all have the 802.1q Ethertype.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: Traceroute and "ping" sockets: some questions
From: Dmitry Butskoy @ 2011-08-19 12:22 UTC (permalink / raw)
  Cc: netdev
In-Reply-To: <20110819.043816.1648681777223816477.davem@davemloft.net>

David Miller wrote:
>> - Are there any plans to implement some "rate control" (maybe
>> - sysctl-configurable too), to restrict unprivileged users to send icmp
>> - echoes too fast (ie. faster than 200 ms -- the current ping(8)
>> - restriction)?
>>      
> Why limit?  He can spam with UDP socket just as easily at any rate
> he pleases,
>    
Yes, but most cases such UDP is "one-way" spam (until services like 
"echo 7/udp" are enabled).
For icmp ping, we normally receive icmp replies, hence it is 
bidirectional crap. Which was not present before.

Besides that ping(8) is normally present in the system even if C 
development is not installed (ie. user cannot build its spam software at 
the host etc...)

Regards,
Dmitry
http://www.fedoraproject.org/wiki/DmitryButskoy

^ permalink raw reply

* Re: [RFC 0/0] Introducing a generic socket offload framework
From: jamal @ 2011-08-19 12:49 UTC (permalink / raw)
  To: San Mehat
  Cc: davem, mst, rusty, linux-kernel, virtualization, netdev,
	digitaleric, mikew, miche, maccarro
In-Reply-To: <20110818220756.5C93E5C80B@san.sea.corp.google.com>

On Thu, 2011-08-18 at 15:07 -0700, San Mehat wrote:
> TL;DR
> -----
> In this RFC we propose the introduction of the concept of hardware socket
> offload to the Linux kernel. Patches will accompany this RFC in a few days,
> but we felt we had enough on the design to solicit constructive discussion
> from the community at-large.
> 

[..]

> ALTERNATIVE STRATEGIES
> ----------------------
> 
> An alternative strategy for providing similar functionality involves either
> modifying glibc or using LD_PRELOAD tricks to intercept socket calls. We were
> forced to rule this out due to the complexity (and fragility) involved with
> attempting to maintain a general solution compatible across various
> distributions where platform-libraries differ.

Above should have been in your TL;DR;->

LD_PRELOAD is also horrible because of the granularity of the socket
calls;
Having things in the kernel and specifically tagging socket as needing
this feature is much much more manageable.

Tying things to virtualization may miss the big picture because there
are many other use cases for intercepting socket calls, example:
Samir Bellabes <sam@synack.fr> has been trying to get what he refers to
as "personal firewall" (equivalent to the silly windows firewall) which
prompts the user "ping from blah, do you want to allow a response?"
That requires intercepting send/recv calls, prompt the user in possibly
some pop-up and act based on response. It requires looking at content.
He is trying to use selinux for that interface,
but i think this would be the right abstraction.
I hope you plan to support send/recv.
I also hope you add support for SOCK_RAW (and maybe SOCK_PACKET).

Q: If you want this to be transparent to the apps, who/what is doing
the tagging of SOCK_HWASSIST? clearly not the app if you dont want to
change it.

I like the uri abstraction if it doesnt come at the expense of the
app transparency.

cheers,
jamal

^ permalink raw reply

* [net-next 0/6][pull request] Intel Wired LAN Driver Update
From: Jeff Kirsher @ 2011-08-19 13:11 UTC (permalink / raw)
  To: davem; +Cc: Jeff Kirsher, netdev, gospo

The following series contains updates to e1000e and ixgbe.

The following are changes since commit ae1511bf769cafeae5ab61aaf9947a16a22cbd10:
  net: rps: support PPPOE session messages
and are available in the git repository at:
  master.kernel.org:/pub/scm/linux/kernel/git/jkirsher/net-next master

Alexander Duyck (3):
  ixgbe: Refactor transmit map and cleanup routines
  ixgbe: replace reference to CONFIG_FCOE with IXGBE_FCOE
  ixgbe: Cleanup FCOE and VLAN handling in xmit_frame_ring

Amir Hanania (1):
  ixgbe - DDP last user buffer - error to warn

Bruce Allan (2):
  e1000e: convert driver to use extended descriptors
  e1000e: bump driver version number

 drivers/net/ethernet/intel/e1000e/e1000.h       |    3 +-
 drivers/net/ethernet/intel/e1000e/ethtool.c     |    9 +-
 drivers/net/ethernet/intel/e1000e/netdev.c      |  199 +++++----
 drivers/net/ethernet/intel/ixgbe/ixgbe.h        |   27 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_nl.c |    2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c   |   10 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |  554 ++++++++++++----------
 7 files changed, 445 insertions(+), 359 deletions(-)

-- 
1.7.6


^ permalink raw reply

* [net-next 1/6] e1000e: convert driver to use extended descriptors
From: Jeff Kirsher @ 2011-08-19 13:11 UTC (permalink / raw)
  To: davem; +Cc: Bruce Allan, netdev, gospo, Jeff Kirsher
In-Reply-To: <1313759486-23575-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Bruce Allan <bruce.w.allan@intel.com>

Some features currently not supported by the driver (e.g. RSS) require the
use of extended descriptors, but the driver is setup to only use legacy
descriptors in all modes except for when jumbo frames are enabled on some
parts.  Convert the driver to always use extended descriptors in order to
enable the forthcoming support of these other features.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000e/e1000.h   |    3 +-
 drivers/net/ethernet/intel/e1000e/ethtool.c |    9 +-
 drivers/net/ethernet/intel/e1000e/netdev.c  |  197 +++++++++++++++------------
 3 files changed, 120 insertions(+), 89 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/e1000.h b/drivers/net/ethernet/intel/e1000e/e1000.h
index 638d175..cbbbff4 100644
--- a/drivers/net/ethernet/intel/e1000e/e1000.h
+++ b/drivers/net/ethernet/intel/e1000e/e1000.h
@@ -456,8 +456,9 @@ struct e1000_info {
 
 #define E1000_RX_DESC_PS(R, i)	    \
 	(&(((union e1000_rx_desc_packet_split *)((R).desc))[i]))
+#define E1000_RX_DESC_EXT(R, i)	    \
+	(&(((union e1000_rx_desc_extended *)((R).desc))[i]))
 #define E1000_GET_DESC(R, i, type)	(&(((struct type *)((R).desc))[i]))
-#define E1000_RX_DESC(R, i)		E1000_GET_DESC(R, i, e1000_rx_desc)
 #define E1000_TX_DESC(R, i)		E1000_GET_DESC(R, i, e1000_tx_desc)
 #define E1000_CONTEXT_DESC(R, i)	E1000_GET_DESC(R, i, e1000_context_desc)
 
diff --git a/drivers/net/ethernet/intel/e1000e/ethtool.c b/drivers/net/ethernet/intel/e1000e/ethtool.c
index 06d88f3..8d3ca85 100644
--- a/drivers/net/ethernet/intel/e1000e/ethtool.c
+++ b/drivers/net/ethernet/intel/e1000e/ethtool.c
@@ -1195,7 +1195,7 @@ static int e1000_setup_desc_rings(struct e1000_adapter *adapter)
 		goto err_nomem;
 	}
 
-	rx_ring->size = rx_ring->count * sizeof(struct e1000_rx_desc);
+	rx_ring->size = rx_ring->count * sizeof(union e1000_rx_desc_extended);
 	rx_ring->desc = dma_alloc_coherent(&pdev->dev, rx_ring->size,
 					   &rx_ring->dma, GFP_KERNEL);
 	if (!rx_ring->desc) {
@@ -1220,7 +1220,7 @@ static int e1000_setup_desc_rings(struct e1000_adapter *adapter)
 	ew32(RCTL, rctl);
 
 	for (i = 0; i < rx_ring->count; i++) {
-		struct e1000_rx_desc *rx_desc = E1000_RX_DESC(*rx_ring, i);
+		union e1000_rx_desc_extended *rx_desc;
 		struct sk_buff *skb;
 
 		skb = alloc_skb(2048 + NET_IP_ALIGN, GFP_KERNEL);
@@ -1238,8 +1238,9 @@ static int e1000_setup_desc_rings(struct e1000_adapter *adapter)
 			ret_val = 8;
 			goto err_nomem;
 		}
-		rx_desc->buffer_addr =
-			cpu_to_le64(rx_ring->buffer_info[i].dma);
+		rx_desc = E1000_RX_DESC_EXT(*rx_ring, i);
+		rx_desc->read.buffer_addr =
+		    cpu_to_le64(rx_ring->buffer_info[i].dma);
 		memset(skb->data, 0x00, skb->len);
 	}
 
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index d0fdb51..55c3cc1 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -192,7 +192,7 @@ static void e1000e_dump(struct e1000_adapter *adapter)
 	struct e1000_buffer *buffer_info;
 	struct e1000_ring *rx_ring = adapter->rx_ring;
 	union e1000_rx_desc_packet_split *rx_desc_ps;
-	struct e1000_rx_desc *rx_desc;
+	union e1000_rx_desc_extended *rx_desc;
 	struct my_u1 {
 		u64 a;
 		u64 b;
@@ -399,41 +399,70 @@ rx_ring_summary:
 		break;
 	default:
 	case 0:
-		/* Legacy Receive Descriptor Format
+		/* Extended Receive Descriptor (Read) Format
 		 *
-		 * +-----------------------------------------------------+
-		 * |                Buffer Address [63:0]                |
-		 * +-----------------------------------------------------+
-		 * | VLAN Tag | Errors | Status 0 | Packet csum | Length |
-		 * +-----------------------------------------------------+
-		 * 63       48 47    40 39      32 31         16 15      0
+		 *   +-----------------------------------------------------+
+		 * 0 |                Buffer Address [63:0]                |
+		 *   +-----------------------------------------------------+
+		 * 8 |                      Reserved                       |
+		 *   +-----------------------------------------------------+
 		 */
-		printk(KERN_INFO "Rl[desc]     [address 63:0  ] "
-		       "[vl er S cks ln] [bi->dma       ] [bi->skb] "
-		       "<-- Legacy format\n");
-		for (i = 0; rx_ring->desc && (i < rx_ring->count); i++) {
-			rx_desc = E1000_RX_DESC(*rx_ring, i);
+		printk(KERN_INFO "R  [desc]      [buf addr 63:0 ] "
+		       "[reserved 63:0 ] [bi->dma       ] "
+		       "[bi->skb] <-- Ext (Read) format\n");
+		/* Extended Receive Descriptor (Write-Back) Format
+		 *
+		 *   63       48 47    32 31    24 23            4 3        0
+		 *   +------------------------------------------------------+
+		 *   |     RSS Hash      |        |               |         |
+		 * 0 +-------------------+  Rsvd  |   Reserved    | MRQ RSS |
+		 *   | Packet   | IP     |        |               |  Type   |
+		 *   | Checksum | Ident  |        |               |         |
+		 *   +------------------------------------------------------+
+		 * 8 | VLAN Tag | Length | Extended Error | Extended Status |
+		 *   +------------------------------------------------------+
+		 *   63       48 47    32 31            20 19               0
+		 */
+		printk(KERN_INFO "RWB[desc]      [cs ipid    mrq] "
+		       "[vt   ln xe  xs] "
+		       "[bi->skb] <-- Ext (Write-Back) format\n");
+
+		for (i = 0; i < rx_ring->count; i++) {
 			buffer_info = &rx_ring->buffer_info[i];
-			u0 = (struct my_u0 *)rx_desc;
-			printk(KERN_INFO "Rl[0x%03X]    %016llX %016llX "
-			       "%016llX %p", i,
-			       (unsigned long long)le64_to_cpu(u0->a),
-			       (unsigned long long)le64_to_cpu(u0->b),
-			       (unsigned long long)buffer_info->dma,
-			       buffer_info->skb);
+			rx_desc = E1000_RX_DESC_EXT(*rx_ring, i);
+			u1 = (struct my_u1 *)rx_desc;
+			staterr = le32_to_cpu(rx_desc->wb.upper.status_error);
+			if (staterr & E1000_RXD_STAT_DD) {
+				/* Descriptor Done */
+				printk(KERN_INFO "RWB[0x%03X]     %016llX "
+				       "%016llX ---------------- %p", i,
+				       (unsigned long long)le64_to_cpu(u1->a),
+				       (unsigned long long)le64_to_cpu(u1->b),
+				       buffer_info->skb);
+			} else {
+				printk(KERN_INFO "R  [0x%03X]     %016llX "
+				       "%016llX %016llX %p", i,
+				       (unsigned long long)le64_to_cpu(u1->a),
+				       (unsigned long long)le64_to_cpu(u1->b),
+				       (unsigned long long)buffer_info->dma,
+				       buffer_info->skb);
+
+				if (netif_msg_pktdata(adapter))
+					print_hex_dump(KERN_INFO, "",
+						       DUMP_PREFIX_ADDRESS, 16,
+						       1,
+						       phys_to_virt
+						       (buffer_info->dma),
+						       adapter->rx_buffer_len,
+						       true);
+			}
+
 			if (i == rx_ring->next_to_use)
 				printk(KERN_CONT " NTU\n");
 			else if (i == rx_ring->next_to_clean)
 				printk(KERN_CONT " NTC\n");
 			else
 				printk(KERN_CONT "\n");
-
-			if (netif_msg_pktdata(adapter))
-				print_hex_dump(KERN_INFO, "",
-					       DUMP_PREFIX_ADDRESS,
-					       16, 1,
-					       phys_to_virt(buffer_info->dma),
-					       adapter->rx_buffer_len, true);
 		}
 	}
 
@@ -519,7 +548,7 @@ static void e1000_rx_checksum(struct e1000_adapter *adapter, u32 status_err,
 }
 
 /**
- * e1000_alloc_rx_buffers - Replace used receive buffers; legacy & extended
+ * e1000_alloc_rx_buffers - Replace used receive buffers
  * @adapter: address of board private structure
  **/
 static void e1000_alloc_rx_buffers(struct e1000_adapter *adapter,
@@ -528,7 +557,7 @@ static void e1000_alloc_rx_buffers(struct e1000_adapter *adapter,
 	struct net_device *netdev = adapter->netdev;
 	struct pci_dev *pdev = adapter->pdev;
 	struct e1000_ring *rx_ring = adapter->rx_ring;
-	struct e1000_rx_desc *rx_desc;
+	union e1000_rx_desc_extended *rx_desc;
 	struct e1000_buffer *buffer_info;
 	struct sk_buff *skb;
 	unsigned int i;
@@ -562,8 +591,8 @@ map_skb:
 			break;
 		}
 
-		rx_desc = E1000_RX_DESC(*rx_ring, i);
-		rx_desc->buffer_addr = cpu_to_le64(buffer_info->dma);
+		rx_desc = E1000_RX_DESC_EXT(*rx_ring, i);
+		rx_desc->read.buffer_addr = cpu_to_le64(buffer_info->dma);
 
 		if (unlikely(!(i & (E1000_RX_BUFFER_WRITE - 1)))) {
 			/*
@@ -697,7 +726,7 @@ static void e1000_alloc_jumbo_rx_buffers(struct e1000_adapter *adapter,
 {
 	struct net_device *netdev = adapter->netdev;
 	struct pci_dev *pdev = adapter->pdev;
-	struct e1000_rx_desc *rx_desc;
+	union e1000_rx_desc_extended *rx_desc;
 	struct e1000_ring *rx_ring = adapter->rx_ring;
 	struct e1000_buffer *buffer_info;
 	struct sk_buff *skb;
@@ -738,8 +767,8 @@ check_page:
 			                                PAGE_SIZE,
 							DMA_FROM_DEVICE);
 
-		rx_desc = E1000_RX_DESC(*rx_ring, i);
-		rx_desc->buffer_addr = cpu_to_le64(buffer_info->dma);
+		rx_desc = E1000_RX_DESC_EXT(*rx_ring, i);
+		rx_desc->read.buffer_addr = cpu_to_le64(buffer_info->dma);
 
 		if (unlikely(++i == rx_ring->count))
 			i = 0;
@@ -774,28 +803,27 @@ static bool e1000_clean_rx_irq(struct e1000_adapter *adapter,
 	struct pci_dev *pdev = adapter->pdev;
 	struct e1000_hw *hw = &adapter->hw;
 	struct e1000_ring *rx_ring = adapter->rx_ring;
-	struct e1000_rx_desc *rx_desc, *next_rxd;
+	union e1000_rx_desc_extended *rx_desc, *next_rxd;
 	struct e1000_buffer *buffer_info, *next_buffer;
-	u32 length;
+	u32 length, staterr;
 	unsigned int i;
 	int cleaned_count = 0;
 	bool cleaned = 0;
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
 
 	i = rx_ring->next_to_clean;
-	rx_desc = E1000_RX_DESC(*rx_ring, i);
+	rx_desc = E1000_RX_DESC_EXT(*rx_ring, i);
+	staterr = le32_to_cpu(rx_desc->wb.upper.status_error);
 	buffer_info = &rx_ring->buffer_info[i];
 
-	while (rx_desc->status & E1000_RXD_STAT_DD) {
+	while (staterr & E1000_RXD_STAT_DD) {
 		struct sk_buff *skb;
-		u8 status;
 
 		if (*work_done >= work_to_do)
 			break;
 		(*work_done)++;
 		rmb();	/* read descriptor and rx_buffer_info after status DD */
 
-		status = rx_desc->status;
 		skb = buffer_info->skb;
 		buffer_info->skb = NULL;
 
@@ -804,7 +832,7 @@ static bool e1000_clean_rx_irq(struct e1000_adapter *adapter,
 		i++;
 		if (i == rx_ring->count)
 			i = 0;
-		next_rxd = E1000_RX_DESC(*rx_ring, i);
+		next_rxd = E1000_RX_DESC_EXT(*rx_ring, i);
 		prefetch(next_rxd);
 
 		next_buffer = &rx_ring->buffer_info[i];
@@ -817,7 +845,7 @@ static bool e1000_clean_rx_irq(struct e1000_adapter *adapter,
 				 DMA_FROM_DEVICE);
 		buffer_info->dma = 0;
 
-		length = le16_to_cpu(rx_desc->length);
+		length = le16_to_cpu(rx_desc->wb.upper.length);
 
 		/*
 		 * !EOP means multiple descriptors were used to store a single
@@ -826,7 +854,7 @@ static bool e1000_clean_rx_irq(struct e1000_adapter *adapter,
 		 * next frame that _does_ have the EOP bit set, as it is by
 		 * definition only a frame fragment
 		 */
-		if (unlikely(!(status & E1000_RXD_STAT_EOP)))
+		if (unlikely(!(staterr & E1000_RXD_STAT_EOP)))
 			adapter->flags2 |= FLAG2_IS_DISCARDING;
 
 		if (adapter->flags2 & FLAG2_IS_DISCARDING) {
@@ -834,12 +862,12 @@ static bool e1000_clean_rx_irq(struct e1000_adapter *adapter,
 			e_dbg("Receive packet consumed multiple buffers\n");
 			/* recycle */
 			buffer_info->skb = skb;
-			if (status & E1000_RXD_STAT_EOP)
+			if (staterr & E1000_RXD_STAT_EOP)
 				adapter->flags2 &= ~FLAG2_IS_DISCARDING;
 			goto next_desc;
 		}
 
-		if (rx_desc->errors & E1000_RXD_ERR_FRAME_ERR_MASK) {
+		if (staterr & E1000_RXDEXT_ERR_FRAME_ERR_MASK) {
 			/* recycle */
 			buffer_info->skb = skb;
 			goto next_desc;
@@ -877,15 +905,15 @@ static bool e1000_clean_rx_irq(struct e1000_adapter *adapter,
 		skb_put(skb, length);
 
 		/* Receive Checksum Offload */
-		e1000_rx_checksum(adapter,
-				  (u32)(status) |
-				  ((u32)(rx_desc->errors) << 24),
-				  le16_to_cpu(rx_desc->csum), skb);
+		e1000_rx_checksum(adapter, staterr,
+				  le16_to_cpu(rx_desc->wb.lower.hi_dword.
+					      csum_ip.csum), skb);
 
-		e1000_receive_skb(adapter, netdev, skb,status,rx_desc->special);
+		e1000_receive_skb(adapter, netdev, skb, staterr,
+				  rx_desc->wb.upper.vlan);
 
 next_desc:
-		rx_desc->status = 0;
+		rx_desc->wb.upper.status_error &= cpu_to_le32(~0xFF);
 
 		/* return some buffers to hardware, one at a time is too slow */
 		if (cleaned_count >= E1000_RX_BUFFER_WRITE) {
@@ -897,6 +925,8 @@ next_desc:
 		/* use prefetched values */
 		rx_desc = next_rxd;
 		buffer_info = next_buffer;
+
+		staterr = le32_to_cpu(rx_desc->wb.upper.status_error);
 	}
 	rx_ring->next_to_clean = i;
 
@@ -1280,35 +1310,34 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
 	struct net_device *netdev = adapter->netdev;
 	struct pci_dev *pdev = adapter->pdev;
 	struct e1000_ring *rx_ring = adapter->rx_ring;
-	struct e1000_rx_desc *rx_desc, *next_rxd;
+	union e1000_rx_desc_extended *rx_desc, *next_rxd;
 	struct e1000_buffer *buffer_info, *next_buffer;
-	u32 length;
+	u32 length, staterr;
 	unsigned int i;
 	int cleaned_count = 0;
 	bool cleaned = false;
 	unsigned int total_rx_bytes=0, total_rx_packets=0;
 
 	i = rx_ring->next_to_clean;
-	rx_desc = E1000_RX_DESC(*rx_ring, i);
+	rx_desc = E1000_RX_DESC_EXT(*rx_ring, i);
+	staterr = le32_to_cpu(rx_desc->wb.upper.status_error);
 	buffer_info = &rx_ring->buffer_info[i];
 
-	while (rx_desc->status & E1000_RXD_STAT_DD) {
+	while (staterr & E1000_RXD_STAT_DD) {
 		struct sk_buff *skb;
-		u8 status;
 
 		if (*work_done >= work_to_do)
 			break;
 		(*work_done)++;
 		rmb();	/* read descriptor and rx_buffer_info after status DD */
 
-		status = rx_desc->status;
 		skb = buffer_info->skb;
 		buffer_info->skb = NULL;
 
 		++i;
 		if (i == rx_ring->count)
 			i = 0;
-		next_rxd = E1000_RX_DESC(*rx_ring, i);
+		next_rxd = E1000_RX_DESC_EXT(*rx_ring, i);
 		prefetch(next_rxd);
 
 		next_buffer = &rx_ring->buffer_info[i];
@@ -1319,23 +1348,22 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
 			       DMA_FROM_DEVICE);
 		buffer_info->dma = 0;
 
-		length = le16_to_cpu(rx_desc->length);
+		length = le16_to_cpu(rx_desc->wb.upper.length);
 
 		/* errors is only valid for DD + EOP descriptors */
-		if (unlikely((status & E1000_RXD_STAT_EOP) &&
-		    (rx_desc->errors & E1000_RXD_ERR_FRAME_ERR_MASK))) {
-				/* recycle both page and skb */
-				buffer_info->skb = skb;
-				/* an error means any chain goes out the window
-				 * too */
-				if (rx_ring->rx_skb_top)
-					dev_kfree_skb_irq(rx_ring->rx_skb_top);
-				rx_ring->rx_skb_top = NULL;
-				goto next_desc;
+		if (unlikely((staterr & E1000_RXD_STAT_EOP) &&
+			     (staterr & E1000_RXDEXT_ERR_FRAME_ERR_MASK))) {
+			/* recycle both page and skb */
+			buffer_info->skb = skb;
+			/* an error means any chain goes out the window too */
+			if (rx_ring->rx_skb_top)
+				dev_kfree_skb_irq(rx_ring->rx_skb_top);
+			rx_ring->rx_skb_top = NULL;
+			goto next_desc;
 		}
 
 #define rxtop (rx_ring->rx_skb_top)
-		if (!(status & E1000_RXD_STAT_EOP)) {
+		if (!(staterr & E1000_RXD_STAT_EOP)) {
 			/* this descriptor is only the beginning (or middle) */
 			if (!rxtop) {
 				/* this is the beginning of a chain */
@@ -1390,10 +1418,9 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
 		}
 
 		/* Receive Checksum Offload XXX recompute due to CRC strip? */
-		e1000_rx_checksum(adapter,
-		                  (u32)(status) |
-		                  ((u32)(rx_desc->errors) << 24),
-		                  le16_to_cpu(rx_desc->csum), skb);
+		e1000_rx_checksum(adapter, staterr,
+				  le16_to_cpu(rx_desc->wb.lower.hi_dword.
+					      csum_ip.csum), skb);
 
 		/* probably a little skewed due to removing CRC */
 		total_rx_bytes += skb->len;
@@ -1406,11 +1433,11 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
 			goto next_desc;
 		}
 
-		e1000_receive_skb(adapter, netdev, skb, status,
-		                  rx_desc->special);
+		e1000_receive_skb(adapter, netdev, skb, staterr,
+				  rx_desc->wb.upper.vlan);
 
 next_desc:
-		rx_desc->status = 0;
+		rx_desc->wb.upper.status_error &= cpu_to_le32(~0xFF);
 
 		/* return some buffers to hardware, one at a time is too slow */
 		if (unlikely(cleaned_count >= E1000_RX_BUFFER_WRITE)) {
@@ -1422,6 +1449,8 @@ next_desc:
 		/* use prefetched values */
 		rx_desc = next_rxd;
 		buffer_info = next_buffer;
+
+		staterr = le32_to_cpu(rx_desc->wb.upper.status_error);
 	}
 	rx_ring->next_to_clean = i;
 
@@ -2820,6 +2849,10 @@ static void e1000_setup_rctl(struct e1000_adapter *adapter)
 		break;
 	}
 
+	/* Enable Extended Status in all Receive Descriptors */
+	rfctl = er32(RFCTL);
+	rfctl |= E1000_RFCTL_EXTEN;
+
 	/*
 	 * 82571 and greater support packet-split where the protocol
 	 * header is placed in skb->data and the packet data is
@@ -2845,9 +2878,6 @@ static void e1000_setup_rctl(struct e1000_adapter *adapter)
 	if (adapter->rx_ps_pages) {
 		u32 psrctl = 0;
 
-		/* Configure extra packet-split registers */
-		rfctl = er32(RFCTL);
-		rfctl |= E1000_RFCTL_EXTEN;
 		/*
 		 * disable packet split support for IPv6 extension headers,
 		 * because some malformed IPv6 headers can hang the Rx
@@ -2855,8 +2885,6 @@ static void e1000_setup_rctl(struct e1000_adapter *adapter)
 		rfctl |= (E1000_RFCTL_IPV6_EX_DIS |
 			  E1000_RFCTL_NEW_IPV6_EXT_DIS);
 
-		ew32(RFCTL, rfctl);
-
 		/* Enable Packet split descriptors */
 		rctl |= E1000_RCTL_DTYP_PS;
 
@@ -2879,6 +2907,7 @@ static void e1000_setup_rctl(struct e1000_adapter *adapter)
 		ew32(PSRCTL, psrctl);
 	}
 
+	ew32(RFCTL, rfctl);
 	ew32(RCTL, rctl);
 	/* just started the receive unit, no need to restart */
 	adapter->flags &= ~FLAG_RX_RESTART_NOW;
@@ -2904,11 +2933,11 @@ static void e1000_configure_rx(struct e1000_adapter *adapter)
 		adapter->clean_rx = e1000_clean_rx_irq_ps;
 		adapter->alloc_rx_buf = e1000_alloc_rx_buffers_ps;
 	} else if (adapter->netdev->mtu > ETH_FRAME_LEN + ETH_FCS_LEN) {
-		rdlen = rx_ring->count * sizeof(struct e1000_rx_desc);
+		rdlen = rx_ring->count * sizeof(union e1000_rx_desc_extended);
 		adapter->clean_rx = e1000_clean_jumbo_rx_irq;
 		adapter->alloc_rx_buf = e1000_alloc_jumbo_rx_buffers;
 	} else {
-		rdlen = rx_ring->count * sizeof(struct e1000_rx_desc);
+		rdlen = rx_ring->count * sizeof(union e1000_rx_desc_extended);
 		adapter->clean_rx = e1000_clean_rx_irq;
 		adapter->alloc_rx_buf = e1000_alloc_rx_buffers;
 	}
-- 
1.7.6


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox