Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v3 net-next 21/27] net: add a function to get the next private
From: Veaceslav Falico @ 2013-09-17 13:55 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: netdev, jiri, David S. Miller, Eric Dumazet, Alexander Duyck
In-Reply-To: <1379382601.23881.36.camel@deadeye.wl.decadent.org.uk>

On Tue, Sep 17, 2013 at 02:50:01AM +0100, Ben Hutchings wrote:
>On Tue, 2013-09-17 at 02:46 +0200, Veaceslav Falico wrote:
>> It searches for the provided private and returns the next one. If private
>> is not found or next list element is list head - returns NULL.
>
>This is going to take linear time, which is probably OK for a bond that
>has only a very few devices.  But it would likely be a really bad idea
>for, say, a bridge device that could have tens or hundreds of lower
>devices.  So it's not a generically useful function.

Indeed, you're right. I've tried searching or trying to figure out why
others could need it - with no luck. It's really bonding-specific. And, in
any case, if there will be more users of it - it can be changed in the
future.

>
>I think the bonding driver can implement this:
>
>[...]
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -5055,6 +5055,33 @@ void *netdev_lower_dev_get_private(struct net_device *dev,
>>  }
>>  EXPORT_SYMBOL(netdev_lower_dev_get_private);
>>
>> +/* netdev_lower_dev_get_next_private - return the ->private of the list
>> + *				       element whos ->private == private.
>> + * @dev - device to search
>> + * @private - private pointer to search for.
>> + *
>> + * Returns the next ->private pointer, if ->next is not head and private is
>> + * found.
>> + */
>> +extern void *netdev_lower_dev_get_next_private(struct net_device *dev,
>> +					       void *private)
>> +{
>> +	struct netdev_adjacent *lower;
>> +
>> +	list_for_each_entry(lower, &dev->adj_list.lower, list) {
>> +		if (lower->private == private) {
>> +			lower = list_entry(lower->list.next,
>> +					   struct netdev_adjacent, list);
>> +			if (&lower->list == &dev->adj_list.lower)
>> +				return NULL;
>> +			return lower->private;
>> +		}
>> +	}
>> +
>> +	return NULL;
>> +}
>> +EXPORT_SYMBOL(netdev_lower_dev_get_next_private);
>
>using only the functions already exported:
>
>static void *__bond_next_slave(struct net_device *dev, void *private)
>{
>	struct list_head *iter;
>	struct net_device *lower;
>	bool found = false;
>
>	netdev_for_each_lower_dev(dev, lower, iter) {
>		if (found)
>			return netdev_adjacent_get_private(iter);
>		if (netdev_adjacent_get_private(iter) == private)
>			found = true;
>	}
>
>	return NULL;
>}
>
>(not that I've tested it :-).

Yep, I'll think of something like that and send it in the next version,
dropping the current netdev_ variant.

Thanks a lot!

>
>Ben.
>
>> +
>>  static void dev_change_rx_flags(struct net_device *dev, int flags)
>>  {
>>  	const struct net_device_ops *ops = dev->netdev_ops;
>
>-- 
>Ben Hutchings, Staff Engineer, Solarflare
>Not speaking for my employer; that's the marketing department's job.
>They asked us to note that Solarflare product names are trademarked.
>

^ permalink raw reply

* Re: [PATCH  v3 0/6] ipv6: Do route updating for redirect in ndisc layer
From: Hannes Frederic Sowa @ 2013-09-17 13:52 UTC (permalink / raw)
  To: Duan Jiong
  Cc: Daniel Borkmann, Duan Jiong, davem, netdev,
	linux-sctp@vger.kernel.org
In-Reply-To: <523710CC.1090404@gmail.com>

On Mon, Sep 16, 2013 at 10:08:12PM +0800, Duan Jiong wrote:
> 于 2013/9/16 20:22, Daniel Borkmann 写道:
> > On 09/16/2013 01:47 PM, Duan Jiong wrote:
> >> From: Duan Jiong <duanj.fnst@cn.fujitsu.com>
> >>
> >> the ip6_redirect() could be replaced with
> >> ip6_redirect_no_header(), we could always use ip6_redirect()
> >> for route updating in ndisc layer and use the data of the
> >> redirected header option just for finding the socket to be
> >> notified and then notify user in protocols' err_handler.
> > If I get this right, it seems to me that this patchset actually consists of two
> > different kind of changes:
> >
> > 1) Not notifying user space on ICMP redirects (net material)
> > 2) Simplify code for updating route in ndisc layer instead of error handlers (net-next)
> >
> > Also, you do the *actual* change in the very last patch, which means that from
> > patch 1 to 5 we're in an inconsistent and buggy state unless we also apply patch
> > number 6. It should actually be the other way around, that you first do the actual
> > change and then migrate users (also commit messages are quite terse).
> 
> I make the patch set on net tree, not on net-next. Maybe those
> things should be done in two patch sets.

I have thought about going the other direction. Just make it one patch
and update the whole logic atomically in the git tree. So we won't have
any partial logic updates in the tree.

Please also rebase your series onto net from today, because Daniel's
patch regarding updating sk->sk_err in sctp went in yesterday.

> > Moreover, just looking at the SCTP part (sctp_err_lookup() function) ...
> >
> > /* RFC 4960, Appendix C. ICMP Handling
> >  *
> >  * ICMP6) An implementation MUST validate that the Verification Tag
> >  * contained in the ICMP message matches the Verification Tag of
> >  * the peer.  If the Verification Tag is not 0 and does NOT
> >  * match, discard the ICMP message.  If it is 0 and the ICMP
> >  * message contains enough bytes to verify that the chunk type is
> >  * an INIT chunk and that the Initiate Tag matches the tag of the
> >  * peer, continue with ICMP7.  If the ICMP message is too short
> >  * or the chunk type or the Initiate Tag does not match, silently
> >  * discard the packet.
> >  */
> >  
> > ... it seems to me that we would simply ignore such RFC requirements with
> > your patch for the sctp_v6_err() part.
> >
> > Care to elaborate? ;-)
> 
> Sorry, i didn't notice that.
> 
> According to the RFC requirements, it suggests that we can't update
> route for redirect in ndisc layer before calling into sctp_err_lookup(),
> because we must verify the ICMP Message. Now maybe we update route for
> redirect in ndisc layer is wrong.
> 
> Do you have any idea?

IMO updating of routes in ndisc layer is fine. We already accept redirects
(and also change routes) for sctp connections where the redirect packet does
not contain any tag. Also it is debatable if redirects are counted as
icmp packets or merely just belong to the kind of neighbour discovery
packets which just use icmp framing (and so the sctp rfc would not say
anything about them).

As Daniel already said, it would be better to update the commit message
to clarify the reasons why this is ok (with some pointers to RFCs).

Greetings,

  Hannes

^ permalink raw reply

* Re: Why we discard all rtt samples when only some of the acked skbs have been retransmited in processing ack?
From: Neal Cardwell @ 2013-09-17 13:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: LovelyLich, Yuchung Cheng, Netdev
In-Reply-To: <1379394682.29845.2.camel@edumazet-glaptop>

On Tue, Sep 17, 2013 at 1:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2013-09-17 at 12:01 +0800, LovelyLich wrote:
>> Hi Eric,
>>
>> In tcp_clean_rtx_queue(), we set the flag FLAG_RETRANS_DATA_ACKED when we
>>
>> encounter one ever retransmited skb A. But if there is one( or more) skb B
>>
>> after this retransmited skb, and we calculate the rtt for skb B. The question
>>
>> is because we have set the flag FLAG_RETRANS_DATA_ACKED, and we will just
>>
>> return in tcp_ack_no_tstamp() !
>>
>> Two questions:
>>
>> 1. if we will just ignore all packets in this ack, we do not need to calculate
>>
>> skb B's rtt sample.
>>
>> 2. what I want to know, even if A's rtt sample is not reliable, but B's rtt
>>
>> sample can be trusted. Why we discard it ?
>>
>>
>>
>> Thanks in advanced.
>>
>
> Good point !
>
> Yuchung, what do you think of following patch ?
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 25a89ea..7f12b96 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2971,7 +2971,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
>         struct sk_buff *skb;
>         u32 now = tcp_time_stamp;
>         int fully_acked = true;
> -       int flag = 0;
> +       int flag = FLAG_RETRANS_DATA_ACKED;
>         u32 pkts_acked = 0;
>         u32 reord = tp->packets_out;
>         u32 prior_sacked = tp->sacked_out;
> @@ -3002,7 +3002,6 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
>                 if (sacked & TCPCB_RETRANS) {
>                         if (sacked & TCPCB_SACKED_RETRANS)
>                                 tp->retrans_out -= acked_pcount;
> -                       flag |= FLAG_RETRANS_DATA_ACKED;
>                 } else {
>                         ca_seq_rtt = now - scb->when;
>                         last_ackt = skb->tstamp;
> @@ -3013,6 +3012,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
>                                 reord = min(pkts_acked, reord);
>                         if (!after(scb->end_seq, tp->high_seq))
>                                 flag |= FLAG_ORIG_SACK_ACKED;
> +                       flag &= ~FLAG_RETRANS_DATA_ACKED;
>                 }
>
>                 if (sacked & TCPCB_SACKED_ACKED)

I think the existing logic is better than the patch. If we get a
cumulative ACK that covers a retransmitted packet, then any RTT sample
we try to extract is suspect, and likely to be at least 2x too high.

Consider the following common scenario:

t=0: send pkts 1, 2, 3, 4
t=1*RTT: receive dupack with SACK for pkts 2,3,4; fast retransmit  pkt 1
t=2*RTT: receive cumulative ack for all pkts through 4

With the existing logic, because the ACK we get at t=2*RTT covers the
retransmitted pkt 1, we do not attempt to take an RTT sample.

With that proposed patch, when we get the ACK at t=2*RTT we see that
there are non-retransmitted pkts 2,3,4 being ACKed, so we clear the
FLAG_RETRANS_DATA_ACKED bit and take an RTT sample of 2*RTT. But this
is 2x too big, and will distort our RTT sample.

Note that with Yuchung's recent patch to gather RTT samples from
SACKed packets (59c9af4234b0c21a1ed05cf65bf014d0c1a67bfd "tcp: measure
RTT from new SACK"), we will already be extracting essentially all the
RTT samples that we possibly can out of such scenarios with
retransmitted packets (for OSes that support SACK, which is basically
everyone). In the example above, Yuchung's new SACK-based RTT scheme
will correctly take RTT samples at t=1*RTT for the SACKed packets.

neal

^ permalink raw reply

* Re: [PATCH v3 net-next 07/27] net: add for_each iterators through neighbour lower link's private
From: Veaceslav Falico @ 2013-09-17 12:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, netdev, jiri, David S. Miller, Eric Dumazet,
	Alexander Duyck
In-Reply-To: <1379418182.3457.2.camel@edumazet-glaptop>

On Tue, Sep 17, 2013 at 04:43:02AM -0700, Eric Dumazet wrote:
>On Tue, 2013-09-17 at 09:36 +0200, Veaceslav Falico wrote:
>> On Tue, Sep 17, 2013 at 02:26:43AM +0100, Ben Hutchings wrote:
>> >On Tue, 2013-09-17 at 02:46 +0200, Veaceslav Falico wrote:
>> >[...]
>> >> --- a/net/core/dev.c
>> >> +++ b/net/core/dev.c
>> >> @@ -4537,6 +4537,72 @@ struct net_device *netdev_all_upper_get_next_dev_rcu(struct net_device *dev,
>> >>  }
>> >>  EXPORT_SYMBOL(netdev_all_upper_get_next_dev_rcu);
>> >>
>> >> +/* netdev_lower_get_next_private - Get the next ->private from the
>> >> + *				   lower neighbour list
>> >[...]
>> >
>> >This is not correct kernel-doc syntax.  You must begin the comment like
>> >this:
>> >
>> >/**
>> > * function_name - summary on one physical line, no wrapping allowed
>>
>> I've thought that netdev specifically requires that type of comments*. But
>> I don't have any strong opinion on that, so if needed - can change easily
>> in the next version.
>>
>> Thanks a lot!
>>
>> *Documentation/networking/netdev-FAQ.txt
>>
>> Q: Someone said that the comment style and coding convention is different
>>     for the networking content.  Is this true?
>>
>> A: Yes, in a largely trivial way.  Instead of this:
>>
>>          /*
>>           * foobar blah blah blah
>>           * another line of text
>>           */
>>
>>     it is requested that you make it look like this:
>>
>>          /* foobar blah blah blah
>>           * another line of text
>>           */
>>
>>
>
>Thats for comments, not kernel-doc sections.
>
>You provided :
>
>+/* netdev_lower_get_next_private - Get the next ->private from the
>+ *                                lower neighbour list
>+ * @dev: device
>+ * @iter: list_head ** of the current position
>+ *
>+ * Gets the next netdev_adjacent->private from the dev's lower neighbour
>+ * list, starting from iter position. The caller must hold either hold the
>+ * RTNL lock or its own locking that guarantees that the neighbour lower
>+ * list will remain unchainged. If iter is NULL - return the first private.
>+ */
>
>Which really looks like a kernel-doc section,
>but misses the proper delimiter which is :

Ok, got it - will fix in the next version.

Thanks all!

>
>/**
>
>Not
>
>/*
>
>
>

^ permalink raw reply

* Re: [PATCH net] tcp: fix RTO calculated from cached RTT
From: Eric Dumazet @ 2013-09-17 12:21 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: David Miller, netdev, Eric Dumazet, Yuchung Cheng
In-Reply-To: <1379382260-29303-1-git-send-email-ncardwell@google.com>

On Mon, 2013-09-16 at 21:44 -0400, Neal Cardwell wrote:
> Commit 1b7fdd2ab5852 ("tcp: do not use cached RTT for RTT estimation")
> did not correctly account for the fact that crtt is the RTT shifted
> left 3 bits. Fix the calculation to consistently reflect this fact.
> 
> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> ---
>  net/ipv4/tcp_metrics.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)

Acked-by: Eric Dumazet <edumazet@google.com>

Instead of rto = 3*crtt, we had rto = 10*crtt, which was indeed
suboptimal ;)

^ permalink raw reply

* Re: [PATCH net-next] bridge: change the order of actions in addif/delif
From: Eric Dumazet @ 2013-09-17 11:57 UTC (permalink / raw)
  To: Hong Zhiguo; +Cc: netdev, stephen, davem, vyasevic, Hong Zhiguo
In-Reply-To: <1379403883-16219-1-git-send-email-zhiguohong@tencent.com>

On Tue, 2013-09-17 at 15:44 +0800, Hong Zhiguo wrote:
> From: Hong Zhiguo <zhiguohong@tencent.com>
> 
> netdev_rx_handler_register turns on the bridge traffic. It should
> be the last action of br_add_if, after the installation of fdb and
> stp staff. _Maybe_ traffic coming in before preparation of fdb and
> stp is taken care of, but we could make it easier.
> 
> Vise versa, netdev_rx_handler_unregister actually turns off bridge
> traffic from the dev. It should be the first action of br_del_if,
> before fdb and stp staff is uninstalled.
> 
> Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com>
> ---
>  net/bridge/br_if.c | 20 ++++++++------------
>  1 file changed, 8 insertions(+), 12 deletions(-)
> 
> diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
> index c41d5fb..6544154 100644
> --- a/net/bridge/br_if.c
> +++ b/net/bridge/br_if.c
> @@ -133,6 +133,8 @@ static void del_nbp(struct net_bridge_port *p)
>  
>  	sysfs_remove_link(br->ifobj, p->dev->name);
>  
> +	netdev_rx_handler_unregister(dev);
> +
>  	dev_set_promiscuity(dev, -1);
>  
>  	spin_lock_bh(&br->lock);
> @@ -148,8 +150,6 @@ static void del_nbp(struct net_bridge_port *p)
>  
>  	dev->priv_flags &= ~IFF_BRIDGE_PORT;
>  
> -	netdev_rx_handler_unregister(dev);
> -
>  	netdev_upper_dev_unlink(dev, br->dev);
>  

This was really your first patch, and I 'Acked' it.
(OK, changelog can be different now we fixed the bug)

The following should be part of a different patch

>  	br_multicast_del_port(p);
> @@ -336,8 +336,8 @@ int br_add_if(struct net_bridge *br, struct net_device *dev)
>  	if (dev->netdev_ops->ndo_start_xmit == br_dev_xmit)
>  		return -ELOOP;
>  
> -	/* Device is already being bridged */
> -	if (br_port_exists(dev))
> +	/* Device is already being bridged or registered with other handler */
> +	if (br_port_exists(dev) || dev->rx_handler)
>  		return -EBUSY;

rcu_access_pointer(dev->rx_handler)

Anyway, this is risky, because it assumes netdev_rx_handler_register()
implementation wont change.

Just leave the code as is, I don't think its worth it.

^ permalink raw reply

* Re: [PATCH v3 net-next 07/27] net: add for_each iterators through neighbour lower link's private
From: Eric Dumazet @ 2013-09-17 11:43 UTC (permalink / raw)
  To: Veaceslav Falico
  Cc: Ben Hutchings, netdev, jiri, David S. Miller, Eric Dumazet,
	Alexander Duyck
In-Reply-To: <20130917073618.GC18195@redhat.com>

On Tue, 2013-09-17 at 09:36 +0200, Veaceslav Falico wrote:
> On Tue, Sep 17, 2013 at 02:26:43AM +0100, Ben Hutchings wrote:
> >On Tue, 2013-09-17 at 02:46 +0200, Veaceslav Falico wrote:
> >[...]
> >> --- a/net/core/dev.c
> >> +++ b/net/core/dev.c
> >> @@ -4537,6 +4537,72 @@ struct net_device *netdev_all_upper_get_next_dev_rcu(struct net_device *dev,
> >>  }
> >>  EXPORT_SYMBOL(netdev_all_upper_get_next_dev_rcu);
> >>
> >> +/* netdev_lower_get_next_private - Get the next ->private from the
> >> + *				   lower neighbour list
> >[...]
> >
> >This is not correct kernel-doc syntax.  You must begin the comment like
> >this:
> >
> >/**
> > * function_name - summary on one physical line, no wrapping allowed
> 
> I've thought that netdev specifically requires that type of comments*. But
> I don't have any strong opinion on that, so if needed - can change easily
> in the next version.
> 
> Thanks a lot!
> 
> *Documentation/networking/netdev-FAQ.txt
> 
> Q: Someone said that the comment style and coding convention is different
>     for the networking content.  Is this true?
> 
> A: Yes, in a largely trivial way.  Instead of this:
> 
>          /*
>           * foobar blah blah blah
>           * another line of text
>           */
> 
>     it is requested that you make it look like this:
> 
>          /* foobar blah blah blah
>           * another line of text
>           */
> 
> 

Thats for comments, not kernel-doc sections.

You provided :

+/* netdev_lower_get_next_private - Get the next ->private from the
+ *                                lower neighbour list
+ * @dev: device
+ * @iter: list_head ** of the current position
+ *
+ * Gets the next netdev_adjacent->private from the dev's lower neighbour
+ * list, starting from iter position. The caller must hold either hold the
+ * RTNL lock or its own locking that guarantees that the neighbour lower
+ * list will remain unchainged. If iter is NULL - return the first private.
+ */

Which really looks like a kernel-doc section,
but misses the proper delimiter which is :

/**

Not 

/*

^ permalink raw reply

* [PATCH iproute2] tc: support TCA_STATS_RATE_EST64
From: Eric Dumazet @ 2013-09-17 11:19 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

Since linux-3.11, rate estimator can provide TCA_STATS_RATE_EST64
when rate (bytes per second) is above 2^32 (~34 Mbits)

Change tc to use this attribute for high rates.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 tc/tc_util.c |   26 ++++++++++++++++++++------
 tc/tc_util.h |    4 ++--
 2 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/tc/tc_util.c b/tc/tc_util.c
index 8114c97..be3ed07 100644
--- a/tc/tc_util.c
+++ b/tc/tc_util.c
@@ -171,20 +171,24 @@ int get_rate(unsigned *rate, const char *str)
 	return 0;
 }
 
-void print_rate(char *buf, int len, __u32 rate)
+void print_rate(char *buf, int len, __u64 rate)
 {
 	double tmp = (double)rate*8;
 	extern int use_iec;
 
 	if (use_iec) {
-		if (tmp >= 1000.0*1024.0*1024.0)
+		if (tmp >= 1000.0*1024.0*1024.0*1024.0)
+			snprintf(buf, len, "%.0fGibit", tmp/(1024.0*1024.0*1024.0));
+		else if (tmp >= 1000.0*1024.0*1024.0)
 			snprintf(buf, len, "%.0fMibit", tmp/(1024.0*1024.0));
 		else if (tmp >= 1000.0*1024)
 			snprintf(buf, len, "%.0fKibit", tmp/1024);
 		else
 			snprintf(buf, len, "%.0fbit", tmp);
 	} else {
-		if (tmp >= 1000.0*1000000.0)
+		if (tmp >= 1000.0*1000000000.0)
+			snprintf(buf, len, "%.0fGbit", tmp/1000000000.0);
+		else if (tmp >= 1000.0*1000000.0)
 			snprintf(buf, len, "%.0fMbit", tmp/1000000.0);
 		else if (tmp >= 1000.0 * 1000.0)
 			snprintf(buf, len, "%.0fKbit", tmp/1000.0);
@@ -193,7 +197,7 @@ void print_rate(char *buf, int len, __u32 rate)
 	}
 }
 
-char * sprint_rate(__u32 rate, char *buf)
+char * sprint_rate(__u64 rate, char *buf)
 {
 	print_rate(buf, SPRINT_BSIZE-1, rate);
 	return buf;
@@ -460,9 +464,19 @@ void print_tcstats2_attr(FILE *fp, struct rtattr *rta, char *prefix, struct rtat
 			q.drops, q.overlimits, q.requeues);
 	}
 
-	if (tbs[TCA_STATS_RATE_EST]) {
+	if (tbs[TCA_STATS_RATE_EST64]) {
+		struct gnet_stats_rate_est64 re = {0};
+
+		memcpy(&re, RTA_DATA(tbs[TCA_STATS_RATE_EST64]),
+		       MIN(RTA_PAYLOAD(tbs[TCA_STATS_RATE_EST64]),
+			   sizeof(re)));
+		fprintf(fp, "\n%srate %s %llupps ",
+			prefix, sprint_rate(re.bps, b1), re.pps);
+	} else if (tbs[TCA_STATS_RATE_EST]) {
 		struct gnet_stats_rate_est re = {0};
-		memcpy(&re, RTA_DATA(tbs[TCA_STATS_RATE_EST]), MIN(RTA_PAYLOAD(tbs[TCA_STATS_RATE_EST]), sizeof(re)));
+
+		memcpy(&re, RTA_DATA(tbs[TCA_STATS_RATE_EST]),
+		       MIN(RTA_PAYLOAD(tbs[TCA_STATS_RATE_EST]), sizeof(re)));
 		fprintf(fp, "\n%srate %s %upps ",
 			prefix, sprint_rate(re.bps, b1), re.pps);
 	}
diff --git a/tc/tc_util.h b/tc/tc_util.h
index 4f54436..7c3709f 100644
--- a/tc/tc_util.h
+++ b/tc/tc_util.h
@@ -63,12 +63,12 @@ extern int get_size_and_cell(unsigned *size, int *cell_log, char *str);
 extern int get_time(unsigned *time, const char *str);
 extern int get_linklayer(unsigned *val, const char *arg);
 
-extern void print_rate(char *buf, int len, __u32 rate);
+extern void print_rate(char *buf, int len, __u64 rate);
 extern void print_size(char *buf, int len, __u32 size);
 extern void print_qdisc_handle(char *buf, int len, __u32 h);
 extern void print_time(char *buf, int len, __u32 time);
 extern void print_linklayer(char *buf, int len, unsigned linklayer);
-extern char * sprint_rate(__u32 rate, char *buf);
+extern char * sprint_rate(__u64 rate, char *buf);
 extern char * sprint_size(__u32 size, char *buf);
 extern char * sprint_qdisc_handle(__u32 h, char *buf);
 extern char * sprint_tc_classid(__u32 h, char *buf);

^ permalink raw reply related

* Re: [PATCH net-next] gen_estimator: change the lock order for better perfermance
From: Eric Dumazet @ 2013-09-17 11:11 UTC (permalink / raw)
  To: Hong Zhiguo; +Cc: netdev, davem, stephen, Hong Zhiguo
In-Reply-To: <1379407101-18933-1-git-send-email-zhiguohong@tencent.com>

On Tue, 2013-09-17 at 16:38 +0800, Hong Zhiguo wrote:
> From: Hong Zhiguo <zhiguohong@tencent.com>
> 
> e->stats_lock is usually taken by fast path to update stats.
> In the old order, fast path should wait for write_lock(&est_lock).
> Even though it's only one line inside the write_lock(&est_lock),
> but if there's interrupt or page fault, a lot of spin on
> e->stats_lock will be wasted in fast path.

1) net-next is not open

2) There is no way a page fault can happen in this path.

3) This patch is wrong.

Current order is there for good reasons.

Have you really tried LOCKDEP, before sending a patch changing lock
order ?

^ permalink raw reply

* Re: [CFT][PATCH] net: Delay default_device_exit_batch until no devices are unregistering
From: Eric W. Biederman @ 2013-09-17  9:38 UTC (permalink / raw)
  To: Francesco Ruggeri
  Cc: David S. Miller, Eric Dumazet, Jiri Pirko, Alexander Duyck,
	Cong Wang, netdev
In-Reply-To: <CA+HUmGjg9jE1fjSMgkfVtiWVRD3d2DuwfbPU3Owrk_cTCe0FQA@mail.gmail.com>

Francesco Ruggeri <fruggeri@aristanetworks.com> writes:

> On Mon, Sep 16, 2013 at 8:49 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:

>> Francesco could you take a look at this.  I am about 99% certain this is
>> right but I am starting to fade.  So it is entirely possible I missed
>> something.
>
> Same here ...
> The logic looks right to me and I think it should address the original
> issue I ran into.
> Would it make sense to have netdev_unregistering and
> netdev_unregistering_wait be per-namespace, and have
> default_device_exit_batch only wait for the namespaces in net_list? It
> would require some extra loops and locking, but it may help avoid
> unnecessary waits.

It might be worth it, and it certainly would give a good progress
guarantee, as activity in an exiting network namespace is guaranteed to
wind down.  Progress guarantees are always good to have.

The downside of a per netns unregistering count is that we will have
extra wake ups which will could result in extra contention on the
rtnl_lock.

The wait queue definitely makes sense to be global as there only ever
can be a single waiter, because the netns_wq is single threaded and
everything we are doing is happening with the net_mutex held.

The count doesn't necessarily need to be atomic as it can be protected
by the rtnl_lock.

Like I said the logic is a little rough as I was tired. 

I am heading off to New Orleans for Linux Plumbers Conference in a
couple hours so I don't expect I will be particularly responsive
again until next week.

If you could test this patch perhaps refine it I think we are almost at
a final point of fixing this.

Just to be clear my reason for prefering this approach is that because
it adds no extra wait points (we already wait for the rtnl_lock), the
logic is unconditional and explicit and not hidden in the loopback
device's reference count.  Which should allow anyone reading the code
to discover and understand this guarantee.  Although a big fat comment
in default_device_exit_batch that we are guaranteeing we don't allow
the network namespace to exit while there are still network devices in
it (or something to that effect) is probably appropriate.

Eric

> Francesco
>
>>
>>  net/core/dev.c |   12 ++++++++++++
>>  1 files changed, 12 insertions(+), 0 deletions(-)
>>
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 5d702fe..c25e6f3 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -5002,10 +5002,13 @@ static int dev_new_index(struct net *net)
>>
>>  /* Delayed registration/unregisteration */
>>  static LIST_HEAD(net_todo_list);
>> +static atomic_t netdev_unregistering = ATOMIC_INIT(0);
>> +static DECLARE_WAIT_QUEUE_HEAD(netdev_unregistering_wait);
>>
>>  static void net_set_todo(struct net_device *dev)
>>  {
>>         list_add_tail(&dev->todo_list, &net_todo_list);
>> +       atomic_inc(&netdev_unregistering);
>>  }
>>
>>  static void rollback_registered_many(struct list_head *head)
>> @@ -5673,6 +5676,9 @@ void netdev_run_todo(void)
>>                 if (dev->destructor)
>>                         dev->destructor(dev);
>>
>> +               if (atomic_dec_and_test(&netdev_unregistering))
>> +                       wake_up(&netdev_unregistering_wait);
>> +
>>                 /* Free network device */
>>                 kobject_put(&dev->dev.kobj);
>>         }
>> @@ -6369,7 +6375,13 @@ static void __net_exit default_device_exit_batch(struct list_head *net_list)
>>         struct net *net;
>>         LIST_HEAD(dev_kill_list);
>>
>> +retry:
>> +       wait_event(netdev_unregistering_wait, (atomic_read(&netdev_unregistering) == 0));
>>         rtnl_lock();
>> +       if (atomic_read(&netdev_unregistering) != 0) {
>> +               __rtnl_unlock();
>> +               goto retry;
>> +       }
>>         list_for_each_entry(net, net_list, exit_list) {
>>                 for_each_netdev_reverse(net, dev) {
>>                         if (dev->rtnl_link_ops)
>> --
>> 1.7.5.4
>>

^ permalink raw reply

* Re: [PATCH  v3 0/6] ipv6: Do route updating for redirect in ndisc layer
From: Daniel Borkmann @ 2013-09-17  9:00 UTC (permalink / raw)
  To: Duan Jiong; +Cc: Duan Jiong, davem, netdev, hannes, linux-sctp@vger.kernel.org
In-Reply-To: <523710CC.1090404@gmail.com>

On 09/16/2013 04:08 PM, Duan Jiong wrote:
> 于 2013/9/16 20:22, Daniel Borkmann 写道:
>> On 09/16/2013 01:47 PM, Duan Jiong wrote:
>>> From: Duan Jiong <duanj.fnst@cn.fujitsu.com>
>>>
>>> the ip6_redirect() could be replaced with
>>> ip6_redirect_no_header(), we could always use ip6_redirect()
>>> for route updating in ndisc layer and use the data of the
>>> redirected header option just for finding the socket to be
>>> notified and then notify user in protocols' err_handler.
>> If I get this right, it seems to me that this patchset actually consists of two
>> different kind of changes:
>>
>> 1) Not notifying user space on ICMP redirects (net material)
>> 2) Simplify code for updating route in ndisc layer instead of error handlers (net-next)
>>
>> Also, you do the *actual* change in the very last patch, which means that from
>> patch 1 to 5 we're in an inconsistent and buggy state unless we also apply patch
>> number 6. It should actually be the other way around, that you first do the actual
>> change and then migrate users (also commit messages are quite terse).
> 
> I make the patch set on net tree, not on net-next. Maybe those
> things should be done in two patch sets.
> 
>> Moreover, just looking at the SCTP part (sctp_err_lookup() function) ...
>>
>> /* RFC 4960, Appendix C. ICMP Handling
>>   *
>>   * ICMP6) An implementation MUST validate that the Verification Tag
>>   * contained in the ICMP message matches the Verification Tag of
>>   * the peer.  If the Verification Tag is not 0 and does NOT
>>   * match, discard the ICMP message.  If it is 0 and the ICMP
>>   * message contains enough bytes to verify that the chunk type is
>>   * an INIT chunk and that the Initiate Tag matches the tag of the
>>   * peer, continue with ICMP7.  If the ICMP message is too short
>>   * or the chunk type or the Initiate Tag does not match, silently
>>   * discard the packet.
>>   */
>>   
>> ... it seems to me that we would simply ignore such RFC requirements with
>> your patch for the sctp_v6_err() part.
>>
>> Care to elaborate? ;-)
> 
> Sorry, i didn't notice that.
> 
> According to the RFC requirements, it suggests that we can't update
> route for redirect in ndisc layer before calling into sctp_err_lookup(),
> because we must verify the ICMP Message. Now maybe we update route for
> redirect in ndisc layer is wrong.

Looking further into RFC4960 [1] ... it says:

Appendix C.  ICMP Handling

   Whenever an ICMP message is received by an SCTP endpoint, the
   following procedures MUST be followed to ensure proper utilization of
   the information being provided by layer 3.

   ICMP1) An implementation MAY ignore all ICMPv4 messages where the
          type field is not set to "Destination Unreachable".

   ICMP2) An implementation MAY ignore all ICMPv6 messages where the
          type field is not "Destination Unreachable", "Parameter
          Problem", or "Packet Too Big".
   ...

So this basically means that only packets in step 2 are interesting for us here,
we *may* ignore the rest. The verification comes at step 6, so we may have
ignored the packet first. ;-) Therefore, I think your proposal should not be an
issue for SCTP.

I'd however prefer that your patch handles this similarly as in sctp_v4_err(),
so that in the default switch-case, we just go to out_unlock.

In any case, your commit message would need to be more elaborate and reflect
"why" it is okay/safe to change that in SCTP or other cases, preferably with
reference to the RFC. Otherwise, looking at the Git history in future will just be
confusing as it won't clarify why it was reasonable doing so this way.

  [1] http://tools.ietf.org/html/rfc4960#appendix-C

^ permalink raw reply

* [PATCH] USBNET: fix handling padding packet
From: Ming Lei @ 2013-09-17  9:10 UTC (permalink / raw)
  To: David S. Miller, Greg Kroah-Hartman
  Cc: Oliver Neukum, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-usb-u79uwXL29TY76Z2rM5mHXA, Ming Lei, Oliver Neukum

Commit 638c5115a7949(USBNET: support DMA SG) introduces DMA SG
if the usb host controller is capable of building packet from
discontinuous buffers, but missed handling padding packet when
building DMA SG.

This patch attachs the pre-allocated padding packet at the
end of the sg list, so padding packet can be sent to device
if drivers require that.

Reported-by: David Laight <David.Laight-JxhZ9S5GRejQT0dZR+AlfA@public.gmane.org>
Cc: Oliver Neukum <oliver-GvhC2dPhHPQdnm+yROfE0A@public.gmane.org>
Signed-off-by: Ming Lei <ming.lei-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
---
 drivers/net/usb/usbnet.c   |   27 +++++++++++++++++++++------
 include/linux/usb/usbnet.h |    1 +
 2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index 7b331e6..4a9bed3 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -1241,7 +1241,9 @@ static int build_dma_sg(const struct sk_buff *skb, struct urb *urb)
 	if (num_sgs == 1)
 		return 0;
 
-	urb->sg = kmalloc(num_sgs * sizeof(struct scatterlist), GFP_ATOMIC);
+	/* reserve one for zero packet */
+	urb->sg = kmalloc((num_sgs + 1) * sizeof(struct scatterlist),
+			  GFP_ATOMIC);
 	if (!urb->sg)
 		return -ENOMEM;
 
@@ -1305,7 +1307,7 @@ netdev_tx_t usbnet_start_xmit (struct sk_buff *skb,
 		if (build_dma_sg(skb, urb) < 0)
 			goto drop;
 	}
-	entry->length = length = urb->transfer_buffer_length;
+	length = urb->transfer_buffer_length;
 
 	/* don't assume the hardware handles USB_ZERO_PACKET
 	 * NOTE:  strictly conforming cdc-ether devices should expect
@@ -1317,15 +1319,18 @@ netdev_tx_t usbnet_start_xmit (struct sk_buff *skb,
 	if (length % dev->maxpacket == 0) {
 		if (!(info->flags & FLAG_SEND_ZLP)) {
 			if (!(info->flags & FLAG_MULTI_PACKET)) {
-				urb->transfer_buffer_length++;
-				if (skb_tailroom(skb)) {
+				length++;
+				if (skb_tailroom(skb) && !dev->can_dma_sg) {
 					skb->data[skb->len] = 0;
 					__skb_put(skb, 1);
-				}
+				} else if (dev->can_dma_sg)
+					sg_set_buf(&urb->sg[urb->num_sgs++],
+							dev->padding_pkt, 1);
 			}
 		} else
 			urb->transfer_flags |= URB_ZERO_PACKET;
 	}
+	entry->length = urb->transfer_buffer_length = length;
 
 	spin_lock_irqsave(&dev->txq.lock, flags);
 	retval = usb_autopm_get_interface_async(dev->intf);
@@ -1509,6 +1514,7 @@ void usbnet_disconnect (struct usb_interface *intf)
 
 	usb_kill_urb(dev->interrupt);
 	usb_free_urb(dev->interrupt);
+	kfree(dev->padding_pkt);
 
 	free_netdev(net);
 }
@@ -1679,9 +1685,16 @@ usbnet_probe (struct usb_interface *udev, const struct usb_device_id *prod)
 	/* initialize max rx_qlen and tx_qlen */
 	usbnet_update_max_qlen(dev);
 
+	if (dev->can_dma_sg && !(info->flags & FLAG_SEND_ZLP) &&
+		!(info->flags & FLAG_MULTI_PACKET)) {
+		dev->padding_pkt = kzalloc(1, GFP_KERNEL);
+		if (!dev->padding_pkt)
+			goto out4;
+	}
+
 	status = register_netdev (net);
 	if (status)
-		goto out4;
+		goto out5;
 	netif_info(dev, probe, dev->net,
 		   "register '%s' at usb-%s-%s, %s, %pM\n",
 		   udev->dev.driver->name,
@@ -1699,6 +1712,8 @@ usbnet_probe (struct usb_interface *udev, const struct usb_device_id *prod)
 
 	return 0;
 
+out5:
+	kfree(dev->padding_pkt);
 out4:
 	usb_free_urb(dev->interrupt);
 out3:
diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h
index 9cb2fe8..e303eef 100644
--- a/include/linux/usb/usbnet.h
+++ b/include/linux/usb/usbnet.h
@@ -42,6 +42,7 @@ struct usbnet {
 	struct usb_host_endpoint *status;
 	unsigned		maxpacket;
 	struct timer_list	delay;
+	const char		*padding_pkt;
 
 	/* protocol/interface state */
 	struct net_device	*net;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH] vhost/scsi: use vmalloc for order-10 allocation
From: Asias He @ 2013-09-17  8:55 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, netdev, linux-kernel, virtualization, Dan Aloni
In-Reply-To: <1379401998-5131-1-git-send-email-mst@redhat.com>

On Tue, Sep 17, 2013 at 10:21:07AM +0300, Michael S. Tsirkin wrote:
> As vhost scsi device struct is large, if the device is
> created on a busy system, kzalloc() might fail, so this patch does a
> fallback to vzalloc().
> 
> As vmalloc() adds overhead on data-path, add __GFP_REPEAT
> to kzalloc() flags to do this fallback only when really needed.
> 
> Reported-by: Dan Aloni <alonid@stratoscale.com>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

Reviewed-by: Asias He <asias@redhat.com>

> ---
> 
> I put this on my vhost fixes branch, intend to merge for 3.12.
> Dan, could you please confirm this works for you?
> 
>  drivers/vhost/scsi.c | 41 +++++++++++++++++++++++++++--------------
>  1 file changed, 27 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index 4b79a1f..2c30bb0 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -1373,21 +1373,30 @@ static int vhost_scsi_set_features(struct vhost_scsi *vs, u64 features)
>  	return 0;
>  }
>  
> +static void vhost_scsi_free(struct vhost_scsi *vs)
> +{
> +        if (is_vmalloc_addr(vs))
> +                vfree(vs);
> +        else
> +                kfree(vs);
> +}
> +
>  static int vhost_scsi_open(struct inode *inode, struct file *f)
>  {
>  	struct vhost_scsi *vs;
>  	struct vhost_virtqueue **vqs;
> -	int r, i;
> +	int r = -ENOMEM, i;
>  
> -	vs = kzalloc(sizeof(*vs), GFP_KERNEL);
> -	if (!vs)
> -		return -ENOMEM;
> +        vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
> +	if (!vs) {
> +		vs = vzalloc(sizeof(*vs));
> +		if (!vs)
> +			goto err_vs;
> +	}
>  
>  	vqs = kmalloc(VHOST_SCSI_MAX_VQ * sizeof(*vqs), GFP_KERNEL);
> -	if (!vqs) {
> -		kfree(vs);
> -		return -ENOMEM;
> -	}
> +	if (!vqs)
> +		goto err_vqs;
>  
>  	vhost_work_init(&vs->vs_completion_work, vhost_scsi_complete_cmd_work);
>  	vhost_work_init(&vs->vs_event_work, tcm_vhost_evt_work);
> @@ -1407,14 +1416,18 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
>  
>  	tcm_vhost_init_inflight(vs, NULL);
>  
> -	if (r < 0) {
> -		kfree(vqs);
> -		kfree(vs);
> -		return r;
> -	}
> +	if (r < 0)
> +		goto err_init;
>  
>  	f->private_data = vs;
>  	return 0;
> +
> +err_init:
> +	kfree(vqs);
> +err_vqs:
> +	vhost_scsi_free(vs);
> +err_vs:
> +	return r;
>  }
>  
>  static int vhost_scsi_release(struct inode *inode, struct file *f)
> @@ -1431,7 +1444,7 @@ static int vhost_scsi_release(struct inode *inode, struct file *f)
>  	/* Jobs can re-queue themselves in evt kick handler. Do extra flush. */
>  	vhost_scsi_flush(vs);
>  	kfree(vs->dev.vqs);
> -	kfree(vs);
> +	vhost_scsi_free(vs);
>  	return 0;
>  }
>  
> -- 
> MST

-- 
Asias

^ permalink raw reply

* RE: [PATCH 8/8 V2] rtlwifi: Remove variable 'noise' from rtl_stats struct
From: David Laight @ 2013-09-17  8:48 UTC (permalink / raw)
  To: Larry Finger, linville; +Cc: linux-wireless, netdev
In-Reply-To: <1379357722-17687-9-git-send-email-Larry.Finger@lwfinger.net>

> diff --git a/drivers/net/wireless/rtlwifi/wifi.h b/drivers/net/wireless/rtlwifi/wifi.h
> index cc03e7c..284ee8d 100644
> --- a/drivers/net/wireless/rtlwifi/wifi.h
> +++ b/drivers/net/wireless/rtlwifi/wifi.h
> @@ -1535,7 +1535,6 @@ struct rtl_stats {
>  	u32 mac_time[2];
>  	s8 rssi;
>  	u8 signal;
> -	u8 noise;
>  	u8 rate;		/* hw desc rate */
>  	u8 received_channel;
>  	u8 control;
> --

Is 'struct rtl_stats' exposed to userspace?
(or even to loadable drivers)
If so you probably ought to replace 'noise' with an explicit
pad in order to avoid changing the offsets of the other fields.

	David

^ permalink raw reply

* RE: [RFC PATCH v2 net-next 0/2] BPF and OVS extensions
From: David Laight @ 2013-09-17  8:40 UTC (permalink / raw)
  To: Alexei Starovoitov, David S. Miller, netdev, Eric Dumazet,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Daniel Borkmann, Paul E. McKenney, Xi Wang, David Howells,
	Cong Wang, Jesse Gross, Pravin B Shelar, Ben Pfaff, Thomas Graf,
	dev
In-Reply-To: <1379386119-4157-1-git-send-email-ast@plumgrid.com>

> Patch 1/2: generic BPF extension
> Original A and X 32-bit BPF registers are replaced with ten 64-bit registers.
> bpf opcode encoding kept the same. load/store were generalized to access stack,
> bpf_tables and bpf_context.
> BPF program interfaces to outside world via tables that it can read and write,
> and via bpf_context which is in/out blob of data.
> Other kernel components can provide callbacks to tailor BPF to specific needs.

As has been recently pointed out on some of the NetBSD lists
one of the points about BPF is that the filters are deterministic
and easily proven to both terminate and have no unwanted side effects.

The functionality you are proposing breaks both of these assumptions.

It is understood that parsing IPv6 headers is currently impossible
(needs loops of some form) but a specific helper instruction is
probably a better solution.

What you seem to be proposing is a completely different beast,
so you should call it something else - it isn't BPF.

Whether the Linux kernel needs a (another?) byte code interpreter
is a separate issue.

	David

^ permalink raw reply

* [PATCH net-next] gen_estimator: change the lock order for better perfermance
From: Hong Zhiguo @ 2013-09-17  8:38 UTC (permalink / raw)
  To: netdev; +Cc: davem, stephen, Hong Zhiguo

From: Hong Zhiguo <zhiguohong@tencent.com>

e->stats_lock is usually taken by fast path to update stats.
In the old order, fast path should wait for write_lock(&est_lock).
Even though it's only one line inside the write_lock(&est_lock),
but if there's interrupt or page fault, a lot of spin on
e->stats_lock will be wasted in fast path.

Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com>
---
 net/core/gen_estimator.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
index 6b5b6e7..bad7f5f 100644
--- a/net/core/gen_estimator.c
+++ b/net/core/gen_estimator.c
@@ -120,11 +120,11 @@ static void est_timer(unsigned long arg)
 		u32 npackets;
 		u32 rate;
 
-		spin_lock(e->stats_lock);
 		read_lock(&est_lock);
 		if (e->bstats == NULL)
 			goto skip;
 
+		spin_lock(e->stats_lock);
 		nbytes = e->bstats->bytes;
 		npackets = e->bstats->packets;
 		brate = (nbytes - e->last_bytes)<<(7 - idx);
@@ -136,9 +136,9 @@ static void est_timer(unsigned long arg)
 		e->last_packets = npackets;
 		e->avpps += (rate >> e->ewma_log) - (e->avpps >> e->ewma_log);
 		e->rate_est->pps = (e->avpps+0x1FF)>>10;
+		spin_unlock(e->stats_lock);
 skip:
 		read_unlock(&est_lock);
-		spin_unlock(e->stats_lock);
 	}
 
 	if (!list_empty(&elist[idx].list))
-- 
1.8.1.2

^ permalink raw reply related

* Re: [PATCH net 0/4] bridge: Fix problems around the PVID
From: Toshiaki Makita @ 2013-09-17  8:12 UTC (permalink / raw)
  To: vyasevic
  Cc: Toshiaki Makita, David Miller, netdev, Fernando Luis Vazquez Cao,
	Patrick McHardy
In-Reply-To: <523744A6.5090001@redhat.com>

On Mon, 2013-09-16 at 13:49 -0400, Vlad Yasevich wrote:
> On 09/13/2013 08:06 AM, Toshiaki Makita wrote:
> > On Thu, 2013-09-12 at 16:00 -0400, David Miller wrote:
> >> From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> >> Date: Tue, 10 Sep 2013 19:27:54 +0900
> >>
> >>> There seem to be some undesirable behaviors related with PVID.
> >>> 1. It has no effect assigning PVID to a port. PVID cannot be applied
> >>> to any frame regardless of whether we set it or not.
> >>> 2. FDB entries learned via frames applied PVID are registered with
> >>> VID 0 rather than VID value of PVID.
> >>> 3. We can set 0 or 4095 as a PVID that are not allowed in IEEE 802.1Q.
> >>> This leads interoperational problems such as sending frames with VID
> >>> 4095, which is not allowed in IEEE 802.1Q, and treating frames with VID
> >>> 0 as they belong to VLAN 0, which is expected to be handled as they have
> >>> no VID according to IEEE 802.1Q.
> >>>
> >>> Note: 2nd and 3rd problems are potential and not exposed unless 1st problem
> >>> is fixed, because we cannot activate PVID due to it.
> >>
> >> Please work out the issues in patch #2 with Vlad and resubmit this
> >> series.
> >>
> >> Thank you.
> >
> > I'm hovering between whether we should fix the issue by changing vlan 0
> > interface behavior in 8021q module or enabling a bridge port to sending
> > priority-tagged frames, or another better way.
> >
> > If you could comment it, I'd appreciate it :)
> >
> >
> > BTW, I think what is discussed in patch #2 is another problem about
> > handling priority-tags, and it exists without this patch set applied.
> > It looks like that we should prepare another patch set than this to fix
> > that problem.
> >
> > Should I include patches that fix the priority-tags problem in this
> > patch set and resubmit them all together?
> >
> 
> I am thinking that we might need to do it in bridge and it looks like
> the simplest way to do it is to have default priority regeneration table
> (table 6-5 from 802.1Q doc).
> 
> That way I think we would conform to the spec.
> 
> -vlad

Unfortunately I don't think the default priority regeneration table
resolves the problem because IEEE 802.1Q says that a VLAN-aware bridge
can transmit untagged or VLAN-tagged frames only (the end of section 7.5
and 8.1.7).

No mechanism to send priority-tagged frames is found as far as I can see
the standard. I think the regenerated priority is used for outgoing PCP
field only if egress policy is not untagged (i.e. transmitting as
VLAN-tagged), and unused if untagged (Section 6.9.2 3rd/4th Paragraph).

If we want to transmit priority-tagged frames from a bridge port, I
think we need to implement a new (optional) feature that is above the
standard, as I stated previously.

How do you feel about adding a per-port policy that enables a bridge to
send priority-tagged frames instead of untagged frames when egress
policy for the port is untagged?
With this change, we can transmit frames for a given vlan as either all
untagged, all priority-tagged or all VLAN-tagged.

Thanks,

Toshiaki Makita

> 
> >
> > Thanks,
> >
> > Toshiaki Makita
> >
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe netdev" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >

^ permalink raw reply

* [PATCH net-next] bridge: change the order of actions in addif/delif
From: Hong Zhiguo @ 2013-09-17  7:44 UTC (permalink / raw)
  To: netdev; +Cc: stephen, davem, eric.dumazet, vyasevic, Hong Zhiguo

From: Hong Zhiguo <zhiguohong@tencent.com>

netdev_rx_handler_register turns on the bridge traffic. It should
be the last action of br_add_if, after the installation of fdb and
stp staff. _Maybe_ traffic coming in before preparation of fdb and
stp is taken care of, but we could make it easier.

Vise versa, netdev_rx_handler_unregister actually turns off bridge
traffic from the dev. It should be the first action of br_del_if,
before fdb and stp staff is uninstalled.

Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com>
---
 net/bridge/br_if.c | 20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index c41d5fb..6544154 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -133,6 +133,8 @@ static void del_nbp(struct net_bridge_port *p)
 
 	sysfs_remove_link(br->ifobj, p->dev->name);
 
+	netdev_rx_handler_unregister(dev);
+
 	dev_set_promiscuity(dev, -1);
 
 	spin_lock_bh(&br->lock);
@@ -148,8 +150,6 @@ static void del_nbp(struct net_bridge_port *p)
 
 	dev->priv_flags &= ~IFF_BRIDGE_PORT;
 
-	netdev_rx_handler_unregister(dev);
-
 	netdev_upper_dev_unlink(dev, br->dev);
 
 	br_multicast_del_port(p);
@@ -336,8 +336,8 @@ int br_add_if(struct net_bridge *br, struct net_device *dev)
 	if (dev->netdev_ops->ndo_start_xmit == br_dev_xmit)
 		return -ELOOP;
 
-	/* Device is already being bridged */
-	if (br_port_exists(dev))
+	/* Device is already being bridged or registered with other handler */
+	if (br_port_exists(dev) || dev->rx_handler)
 		return -EBUSY;
 
 	/* No bridging devices that dislike that (e.g. wireless) */
@@ -371,10 +371,6 @@ int br_add_if(struct net_bridge *br, struct net_device *dev)
 	if (err)
 		goto err4;
 
-	err = netdev_rx_handler_register(dev, br_handle_frame, p);
-	if (err)
-		goto err5;
-
 	dev->priv_flags |= IFF_BRIDGE_PORT;
 
 	dev_disable_lro(dev);
@@ -394,8 +390,6 @@ int br_add_if(struct net_bridge *br, struct net_device *dev)
 		br_stp_enable_port(p);
 	spin_unlock_bh(&br->lock);
 
-	br_ifinfo_notify(RTM_NEWLINK, p);
-
 	if (changed_addr)
 		call_netdevice_notifiers(NETDEV_CHANGEADDR, br->dev);
 
@@ -404,12 +398,14 @@ int br_add_if(struct net_bridge *br, struct net_device *dev)
 	if (br_fdb_insert(br, p, dev->dev_addr, 0))
 		netdev_err(dev, "failed insert local address bridge forwarding table\n");
 
+	/* rx_handler is already tested, no fail here */
+	netdev_rx_handler_register(dev, br_handle_frame, p);
+
+	br_ifinfo_notify(RTM_NEWLINK, p);
 	kobject_uevent(&p->kobj, KOBJ_ADD);
 
 	return 0;
 
-err5:
-	netdev_upper_dev_unlink(dev, br->dev);
 err4:
 	br_netpoll_disable(p);
 err3:
-- 
1.8.1.2

^ permalink raw reply related

* Re: [PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was called
From: Fan Du @ 2013-09-17  7:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Miller, herbert, steffen.klassert, dborkman, linux-kernel,
	netdev, John Stultz
In-Reply-To: <alpine.DEB.2.02.1309161059220.4089@ionos.tec.linutronix.de>



On 2013年09月16日 17:01, Thomas Gleixner wrote:
> On Mon, 16 Sep 2013, Fan Du wrote:
>> On 2013年09月13日 22:32, Thomas Gleixner wrote:
>>> On Fri, 13 Sep 2013, Fan Du wrote:
>>>> (2) What I have been bugging you around here for this long time is really
>>>> the
>>>> second
>>>>       problem, I'm sorry I didn't make it clearly to you and others, which
>>>> is
>>>> below:
>>>>
>>>>       Why using wall clock time to calculate soft/hard IPsec events when
>>>> xfrm_state timer
>>>>       out happens in its timeout handler? Because even if xfrm_state using
>>>> CLOCK_BOOTTIME,
>>>>       system wall clock time changing will surely disturb soft/hard IPsec
>>>> events, which
>>>>       you raised your concern about in (*a*).
>>>
>>> No CLOCK_BOOTTIME is not affected by wall clock time changes. It's
>>> basically CLOCK_MONOTONIC, it just keeps counting the suspend time as
>>> well. So without a suspend/resume cycle CLOCK_MONOTONIC and
>>> CLOCK_BOOTTIME are the same. After a suspend/resume cycle
>>> CLOCK_BOOTTIME will be N seconds ahead of CLOCK_MONOTONIC, where N is
>>> the number of seconds spent in suspend.
>>
>> Sorry for the ambiguity. Yes, CLOCK_BOOTTIME is monotonic *and* counting
>> suspend/resume time as well as not affected by wall clock time change.
>>
>> Using CLOCK_BOOTTIME guarantees b- will timeout in a right manner, i.e., not
>> affected by wall clock time change, as well as keep the timer valid when
>> suspend/resume.
>>
>> A classic way of using timer is:
>>   a- Arm a timer with specified interval
>>   b- Wait for the timer to timeout
>>   c- After the timeout, notify the event to other place in the timer handler.
>>
>> IPsec key lifetime timer does its this way:
>>   a- Arm a timer with specified interval
>>   b- Wait for the timer to timeout
>>   c- After timeout, in the timer handler, using wall clock time to calculate
>>     which kind of event to notify user(soft or hard for both key use time,
>>     and key created time). So the problem arises at this stage if wall clock
>>     time changed.
>
> And why on earth must it use wall clock time for selecting the event
> type?

/*shivering... sorry to bother you again.*/

Yep, it's a bit twisted. This parts of codes come a long way from v2.5.67
Beyond this fact, I barely know its original design goal by doing so.
The first version of this patch is to remove the wall clock time things
by myself, it's a pity that the feedback is not very welcome at the end.
So later on adding notifier at clock_was_set is suggested by David.

> Thanks,
>
> 	tglx

-- 
浮沉随浪只记今朝笑

--fan

^ permalink raw reply

* Re: [PATCH v3 net-next 07/27] net: add for_each iterators through neighbour lower link's private
From: Veaceslav Falico @ 2013-09-17  7:36 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: netdev, jiri, David S. Miller, Eric Dumazet, Alexander Duyck
In-Reply-To: <1379381203.23881.22.camel@deadeye.wl.decadent.org.uk>

On Tue, Sep 17, 2013 at 02:26:43AM +0100, Ben Hutchings wrote:
>On Tue, 2013-09-17 at 02:46 +0200, Veaceslav Falico wrote:
>[...]
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -4537,6 +4537,72 @@ struct net_device *netdev_all_upper_get_next_dev_rcu(struct net_device *dev,
>>  }
>>  EXPORT_SYMBOL(netdev_all_upper_get_next_dev_rcu);
>>
>> +/* netdev_lower_get_next_private - Get the next ->private from the
>> + *				   lower neighbour list
>[...]
>
>This is not correct kernel-doc syntax.  You must begin the comment like
>this:
>
>/**
> * function_name - summary on one physical line, no wrapping allowed

I've thought that netdev specifically requires that type of comments*. But
I don't have any strong opinion on that, so if needed - can change easily
in the next version.

Thanks a lot!

*Documentation/networking/netdev-FAQ.txt

Q: Someone said that the comment style and coding convention is different
    for the networking content.  Is this true?

A: Yes, in a largely trivial way.  Instead of this:

         /*
          * foobar blah blah blah
          * another line of text
          */

    it is requested that you make it look like this:

         /* foobar blah blah blah
          * another line of text
          */

>
>Ben.
>
>-- 
>Ben Hutchings, Staff Engineer, Solarflare
>Not speaking for my employer; that's the marketing department's job.
>They asked us to note that Solarflare product names are trademarked.
>

^ permalink raw reply

* Re: [PATCH v3 net-next 02/27] net: add RCU variant to search for netdev_adjacent link
From: Veaceslav Falico @ 2013-09-17  7:34 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: netdev, jiri, David S. Miller, Eric Dumazet, Alexander Duyck,
	Cong Wang
In-Reply-To: <1379380998.23881.20.camel@deadeye.wl.decadent.org.uk>

On Tue, Sep 17, 2013 at 02:23:18AM +0100, Ben Hutchings wrote:
>On Tue, 2013-09-17 at 02:46 +0200, Veaceslav Falico wrote:
>> Currently we have only the RTNL flavour, however we can traverse it while
>> holding only RCU, so add the RCU search. Add only one function that will be
>> used further, other functions can be added easily afterwards, if anyone
>> would need them.
>[...]
>> +static struct netdev_adjacent *__netdev_find_adj_rcu(struct net_device *dev,
>> +						     struct net_device *adj_dev,
>> +						     bool upper, bool neighbour)
>> +{
>> +	struct netdev_adjacent *adj;
>> +	struct list_head *adj_list;
>> +
>> +	if (neighbour)
>> +		adj_list = upper ? &dev->adj_list.upper :
>> +				   &dev->adj_list.lower;
>> +	else
>> +		adj_list = upper ? &dev->all_adj_list.upper :
>> +				   &dev->all_adj_list.lower;
>
>I think it would be clearer to make adj_list a parameter, rather than
>taking the dev and a pair of booleans.

Hm, true, indeed, will try to change in the next version.

Thank you!

>
>Ben.
>
>> +	list_for_each_entry_rcu(adj, adj_list, list) {
>> +		if (adj->dev == adj_dev)
>> +			return adj;
>> +	}
>> +	return NULL;
>> +}
>> +
>> +static struct netdev_adjacent *__netdev_lower_find_rcu(struct net_device *dev,
>> +							struct net_device *ldev)
>> +{
>> +	return __netdev_find_adj_rcu(dev, ldev, false, true);
>> +}
>> +
>>  static struct netdev_adjacent *__netdev_find_adj(struct net_device *dev,
>>  						 struct net_device *adj_dev,
>>  						 bool upper, bool neighbour)
>
>-- 
>Ben Hutchings, Staff Engineer, Solarflare
>Not speaking for my employer; that's the marketing department's job.
>They asked us to note that Solarflare product names are trademarked.
>

^ permalink raw reply

* Re: [PATCH net-next v4] ptp: add the PTP_SYS_OFFSET ioctl to the testptp program
From: Dong Zhu @ 2013-09-17  7:32 UTC (permalink / raw)
  To: Richard Cochran; +Cc: Rob Landley, netdev, linux-doc, linux-kernel
In-Reply-To: <20130915084811.GA10163@netboy>

Hi Richard,

I developed a new patch and added a method to estimate the time offset
between system and phc clock.

Could you help reviewing it again ? If it do make sense I hope it could
be accepted.

Thanks !

>From e524e3b68f3df3cd91acd814490d092bad05386b Mon Sep 17 00:00:00 2001

This patch add a method into testptp.c to measure the time offset
between phc and system clock through the ioctl PTP_SYS_OFFSET.

Signed-off-by: Dong Zhu <bluezhudong@gmail.com>
---
 Documentation/ptp/testptp.c | 65 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 63 insertions(+), 2 deletions(-)

diff --git a/Documentation/ptp/testptp.c b/Documentation/ptp/testptp.c
index f59ded0..a74d0a8 100644
--- a/Documentation/ptp/testptp.c
+++ b/Documentation/ptp/testptp.c
@@ -100,6 +100,11 @@ static long ppb_to_scaled_ppm(int ppb)
 	return (long) (ppb * 65.536);
 }
 
+static int64_t pctns(struct ptp_clock_time *t)
+{
+	return t->sec * 1000000000LL + t->nsec;
+}
+
 static void usage(char *progname)
 {
 	fprintf(stderr,
@@ -112,6 +117,8 @@ static void usage(char *progname)
 		" -f val     adjust the ptp clock frequency by 'val' ppb\n"
 		" -g         get the ptp clock time\n"
 		" -h         prints this message\n"
+		" -k val     measure the time offset between system and phc clock\n"
+		"            for 'val' times (Maximum 25)\n"
 		" -p val     enable output with a period of 'val' nanoseconds\n"
 		" -P val     enable or disable (val=1|0) the system clock PPS\n"
 		" -s         set the ptp clock time from the system time\n"
@@ -133,8 +140,12 @@ int main(int argc, char *argv[])
 	struct itimerspec timeout;
 	struct sigevent sigevent;
 
+	struct ptp_clock_time *pct;
+	struct ptp_sys_offset *sysoff;
+
+
 	char *progname;
-	int c, cnt, fd;
+	int i, c, cnt, fd;
 
 	char *device = DEVICE;
 	clockid_t clkid;
@@ -144,14 +155,19 @@ int main(int argc, char *argv[])
 	int extts = 0;
 	int gettime = 0;
 	int oneshot = 0;
+	int pct_offset = 0;
+	int n_samples = 0;
 	int periodic = 0;
 	int perout = -1;
 	int pps = -1;
 	int settime = 0;
 
+	int64_t t1, t2, tp;
+	int64_t interval, offset;
+
 	progname = strrchr(argv[0], '/');
 	progname = progname ? 1+progname : argv[0];
-	while (EOF != (c = getopt(argc, argv, "a:A:cd:e:f:ghp:P:sSt:v"))) {
+	while (EOF != (c = getopt(argc, argv, "a:A:cd:e:f:ghk:p:P:sSt:v"))) {
 		switch (c) {
 		case 'a':
 			oneshot = atoi(optarg);
@@ -174,6 +190,10 @@ int main(int argc, char *argv[])
 		case 'g':
 			gettime = 1;
 			break;
+		case 'k':
+			pct_offset = 1;
+			n_samples = atoi(optarg);
+			break;
 		case 'p':
 			perout = atoi(optarg);
 			break;
@@ -376,6 +396,47 @@ int main(int argc, char *argv[])
 		}
 	}
 
+	if (pct_offset) {
+		if (n_samples <= 0 || n_samples > 25) {
+			puts("n_samples should be between 1 and 25");
+			usage(progname);
+			return -1;
+		}
+
+		sysoff = calloc(1, sizeof(*sysoff));
+		if (!sysoff) {
+			perror("calloc");
+			return -1;
+		}
+		sysoff->n_samples = n_samples;
+
+		if (ioctl(fd, PTP_SYS_OFFSET, sysoff))
+			perror("PTP_SYS_OFFSET");
+		else
+			puts("system and phc clock time offset request okay");
+
+		pct = &sysoff->ts[0];
+		for (i = 0; i < sysoff->n_samples; i++) {
+			t1 = pctns(pct+2*i);
+			tp = pctns(pct+2*i+1);
+			t2 = pctns(pct+2*i+2);
+			interval = t2 - t1;
+			offset = (t2 + t1) / 2 - tp;
+
+			printf("system time: %ld.%ld\n",
+				(pct+2*i)->sec, (pct+2*i)->nsec);
+			printf("phc    time: %ld.%ld\n",
+				(pct+2*i+1)->sec, (pct+2*i+1)->nsec);
+			printf("system time: %ld.%ld\n",
+				(pct+2*i+2)->sec, (pct+2*i+2)->nsec);
+			printf("system/phc clock time offset is %ld ns\n"
+				"system     clock time delay  is %ld ns\n",
+				offset, interval);
+		}
+
+		free(sysoff);
+	}
+
 	close(fd);
 	return 0;
 }
-- 
1.7.11.7


-- 
Best Regards,
Dong Zhu

^ permalink raw reply related

* [PATCH] vhost/scsi: use vmalloc for order-10 allocation
From: Michael S. Tsirkin @ 2013-09-17  7:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, netdev, virtualization, Dan Aloni

As vhost scsi device struct is large, if the device is
created on a busy system, kzalloc() might fail, so this patch does a
fallback to vzalloc().

As vmalloc() adds overhead on data-path, add __GFP_REPEAT
to kzalloc() flags to do this fallback only when really needed.

Reported-by: Dan Aloni <alonid@stratoscale.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---

I put this on my vhost fixes branch, intend to merge for 3.12.
Dan, could you please confirm this works for you?

 drivers/vhost/scsi.c | 41 +++++++++++++++++++++++++++--------------
 1 file changed, 27 insertions(+), 14 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 4b79a1f..2c30bb0 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1373,21 +1373,30 @@ static int vhost_scsi_set_features(struct vhost_scsi *vs, u64 features)
 	return 0;
 }
 
+static void vhost_scsi_free(struct vhost_scsi *vs)
+{
+        if (is_vmalloc_addr(vs))
+                vfree(vs);
+        else
+                kfree(vs);
+}
+
 static int vhost_scsi_open(struct inode *inode, struct file *f)
 {
 	struct vhost_scsi *vs;
 	struct vhost_virtqueue **vqs;
-	int r, i;
+	int r = -ENOMEM, i;
 
-	vs = kzalloc(sizeof(*vs), GFP_KERNEL);
-	if (!vs)
-		return -ENOMEM;
+        vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+	if (!vs) {
+		vs = vzalloc(sizeof(*vs));
+		if (!vs)
+			goto err_vs;
+	}
 
 	vqs = kmalloc(VHOST_SCSI_MAX_VQ * sizeof(*vqs), GFP_KERNEL);
-	if (!vqs) {
-		kfree(vs);
-		return -ENOMEM;
-	}
+	if (!vqs)
+		goto err_vqs;
 
 	vhost_work_init(&vs->vs_completion_work, vhost_scsi_complete_cmd_work);
 	vhost_work_init(&vs->vs_event_work, tcm_vhost_evt_work);
@@ -1407,14 +1416,18 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 
 	tcm_vhost_init_inflight(vs, NULL);
 
-	if (r < 0) {
-		kfree(vqs);
-		kfree(vs);
-		return r;
-	}
+	if (r < 0)
+		goto err_init;
 
 	f->private_data = vs;
 	return 0;
+
+err_init:
+	kfree(vqs);
+err_vqs:
+	vhost_scsi_free(vs);
+err_vs:
+	return r;
 }
 
 static int vhost_scsi_release(struct inode *inode, struct file *f)
@@ -1431,7 +1444,7 @@ static int vhost_scsi_release(struct inode *inode, struct file *f)
 	/* Jobs can re-queue themselves in evt kick handler. Do extra flush. */
 	vhost_scsi_flush(vs);
 	kfree(vs->dev.vqs);
-	kfree(vs);
+	vhost_scsi_free(vs);
 	return 0;
 }
 
-- 
MST

^ permalink raw reply related

* [PATCHv2 net] xfrm: Guard IPsec anti replay window against replay bitmap
From: Fan Du @ 2013-09-17  7:14 UTC (permalink / raw)
  To: steffen.klassert; +Cc: davem, netdev

For legacy IPsec anti replay mechanism:

bitmap in struct xfrm_replay_state could only provide a 32 bits
window size limit in current design, thus user level parameter
sadb_sa_replay should honor this limit, otherwise misleading
outputs("replay=244") by setkey -D will be:

192.168.25.2 192.168.22.2
	esp mode=transport spi=147561170(0x08cb9ad2) reqid=0(0x00000000)
	E: aes-cbc  9a8d7468 7655cf0b 719d27be b0ddaac2
	A: hmac-sha1  2d2115c2 ebf7c126 1c54f186 3b139b58 264a7331
	seq=0x00000000 replay=244 flags=0x00000000 state=mature
	created: Sep 17 14:00:00 2013	current: Sep 17 14:00:22 2013
	diff: 22(s)	hard: 30(s)	soft: 26(s)
	last: Sep 17 14:00:00 2013	hard: 0(s)	soft: 0(s)
	current: 1408(bytes)	hard: 0(bytes)	soft: 0(bytes)
	allocated: 22	hard: 0	soft: 0
	sadb_seq=1 pid=4854 refcnt=0
192.168.22.2 192.168.25.2
	esp mode=transport spi=255302123(0x0f3799eb) reqid=0(0x00000000)
	E: aes-cbc  6485d990 f61a6bd5 e5660252 608ad282
	A: hmac-sha1  0cca811a eb4fa893 c47ae56c 98f6e413 87379a88
	seq=0x00000000 replay=244 flags=0x00000000 state=mature
	created: Sep 17 14:00:00 2013	current: Sep 17 14:00:22 2013
	diff: 22(s)	hard: 30(s)	soft: 26(s)
	last: Sep 17 14:00:00 2013	hard: 0(s)	soft: 0(s)
	current: 1408(bytes)	hard: 0(bytes)	soft: 0(bytes)
	allocated: 22	hard: 0	soft: 0
	sadb_seq=0 pid=4854 refcnt=0

And also, optimizing xfrm_replay_check window checking by setting the
desirable x->props.replay_window with only doing the comparison once
for all when xfrm_state is first born.

Signed-off-by: Fan Du <fan.du@windriver.com>
---
v2:
   Update state replay_window created through netlink advised by Steffen.

---
 net/key/af_key.c       |    3 ++-
 net/xfrm/xfrm_replay.c |    3 +--
 net/xfrm/xfrm_user.c   |    3 ++-
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/key/af_key.c b/net/key/af_key.c
index 9d58537..911ef03 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -1098,7 +1098,8 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net,
 
 	x->id.proto = proto;
 	x->id.spi = sa->sadb_sa_spi;
-	x->props.replay_window = sa->sadb_sa_replay;
+	x->props.replay_window = min_t(unsigned int, sa->sadb_sa_replay,
+					(sizeof(x->replay.bitmap) * 8));
 	if (sa->sadb_sa_flags & SADB_SAFLAGS_NOECN)
 		x->props.flags |= XFRM_STATE_NOECN;
 	if (sa->sadb_sa_flags & SADB_SAFLAGS_DECAP_DSCP)
diff --git a/net/xfrm/xfrm_replay.c b/net/xfrm/xfrm_replay.c
index 8dafe6d3..eeca388 100644
--- a/net/xfrm/xfrm_replay.c
+++ b/net/xfrm/xfrm_replay.c
@@ -129,8 +129,7 @@ static int xfrm_replay_check(struct xfrm_state *x,
 		return 0;
 
 	diff = x->replay.seq - seq;
-	if (diff >= min_t(unsigned int, x->props.replay_window,
-			  sizeof(x->replay.bitmap) * 8)) {
+	if (diff >= x->props.replay_window) {
 		x->stats.replay_window++;
 		goto err;
 	}
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 3f565e4..56d3e45 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -446,7 +446,8 @@ static void copy_from_user_state(struct xfrm_state *x, struct xfrm_usersa_info *
 	memcpy(&x->sel, &p->sel, sizeof(x->sel));
 	memcpy(&x->lft, &p->lft, sizeof(x->lft));
 	x->props.mode = p->mode;
-	x->props.replay_window = p->replay_window;
+	x->props.replay_window = min_t(unsigned int, p->replay_window,
+					sizeof(x->replay.bitmap) * 8);
 	x->props.reqid = p->reqid;
 	x->props.family = p->family;
 	memcpy(&x->props.saddr, &p->saddr, sizeof(x->props.saddr));
-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH net] xfrm: Guard IPsec anti replay window against replay bitmap
From: Fan Du @ 2013-09-17  7:12 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: davem, netdev
In-Reply-To: <20130917065647.GO7660@secunet.com>



On 2013年09月17日 14:56, Steffen Klassert wrote:
> On Tue, Sep 17, 2013 at 02:26:05PM +0800, Fan Du wrote:
>>
>> diff --git a/net/key/af_key.c b/net/key/af_key.c
>> index 9d58537..911ef03 100644
>> --- a/net/key/af_key.c
>> +++ b/net/key/af_key.c
>> @@ -1098,7 +1098,8 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net,
>>
>>   	x->id.proto = proto;
>>   	x->id.spi = sa->sadb_sa_spi;
>> -	x->props.replay_window = sa->sadb_sa_replay;
>> +	x->props.replay_window = min_t(unsigned int, sa->sadb_sa_replay,
>> +					(sizeof(x->replay.bitmap) * 8));
>>   	if (sa->sadb_sa_flags&  SADB_SAFLAGS_NOECN)
>>   		x->props.flags |= XFRM_STATE_NOECN;
>>   	if (sa->sadb_sa_flags&  SADB_SAFLAGS_DECAP_DSCP)
>> diff --git a/net/xfrm/xfrm_replay.c b/net/xfrm/xfrm_replay.c
>> index 8dafe6d3..eeca388 100644
>> --- a/net/xfrm/xfrm_replay.c
>> +++ b/net/xfrm/xfrm_replay.c
>> @@ -129,8 +129,7 @@ static int xfrm_replay_check(struct xfrm_state *x,
>>   		return 0;
>>
>>   	diff = x->replay.seq - seq;
>> -	if (diff>= min_t(unsigned int, x->props.replay_window,
>> -			  sizeof(x->replay.bitmap) * 8)) {
>> +	if (diff>= x->props.replay_window) {
>
> So x->props.replay_window will be valid if the state was added with the
> pfkey interface, but what if the netlink interface was used? You should
> also update the netlink part to always hold a valid replay window.
>

Smell positively, v2 in seconds。。。

Thanks, Steffen.


-- 
浮沉随浪只记今朝笑

--fan

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox