Netdev List

Netdev List
 help / color / mirror / Atom feed

* [patch net-next V3 0/4] net: allow to change carrier from userspace
From: Jiri Pirko @ 2012-12-19  9:35 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, bhutchings, mirqus, shemminger, greearb, fbl,
	john.r.fastabend

This is basically a V3 of a repost of my previous patchset:
"[patch net-next-2.6 0/2] net: allow to change carrier via sysfs" from Aug 30

The way net-sysfs stores values changed and this patchset reflects it.
Also, I exposed carrier via rtnetlink iface.

So far, only dummy driver uses carrier change ndo. In very near future
team driver will use that as well.

V2->V3:
 - updated ndo_change_carrier comment by Dan Williams

V1->v2:
 - added bigger comment to ndo and also note to operstate.txt documentation
   stating the clear purpose of this iface

Jiri Pirko (4):
  net: add change_carrier netdev op
  net: allow to change carrier via sysfs
  rtnl: expose carrier value with possibility to set it
  dummy: implement carrier change

 Documentation/networking/operstates.txt |  4 ++++
 drivers/net/dummy.c                     | 10 ++++++++++
 include/linux/netdevice.h               | 12 ++++++++++++
 include/uapi/linux/if_link.h            |  1 +
 net/core/dev.c                          | 19 +++++++++++++++++++
 net/core/net-sysfs.c                    | 15 ++++++++++++++-
 net/core/rtnetlink.c                    | 10 ++++++++++
 7 files changed, 70 insertions(+), 1 deletion(-)

-- 
1.8.0

^ permalink raw reply

* [patch net-next 1/4] net: add change_carrier netdev op
From: Jiri Pirko @ 2012-12-19  9:35 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, bhutchings, mirqus, shemminger, greearb, fbl,
	john.r.fastabend
In-Reply-To: <1355909756-2389-1-git-send-email-jiri@resnulli.us>

This allows a driver to register change_carrier callback which will be
called whenever user will like to change carrier state. This is useful
for devices like dummy, gre, team and so on.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 include/linux/netdevice.h | 12 ++++++++++++
 net/core/dev.c            | 19 +++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 02e0f6b..22ca4a8 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -891,6 +891,14 @@ struct netdev_fcoe_hbainfo {
  * int (*ndo_bridge_setlink)(struct net_device *dev, struct nlmsghdr *nlh)
  * int (*ndo_bridge_getlink)(struct sk_buff *skb, u32 pid, u32 seq,
  *			     struct net_device *dev)
+ *
+ * int (*ndo_change_carrier)(struct net_device *dev, bool new_carrier);
+ *	Called to change device carrier. Soft-devices (like dummy, team, etc)
+ *	which do not represent real hardware may define this to allow their
+ *	userspace components to manage their virtual carrier state. Devices
+ *	that determine carrier state from physical hardware properties (eg
+ *	network cables) or protocol-dependent mechanisms (eg
+ *	USB_CDC_NOTIFY_NETWORK_CONNECTION) should NOT implement this function.
  */
 struct net_device_ops {
 	int			(*ndo_init)(struct net_device *dev);
@@ -1008,6 +1016,8 @@ struct net_device_ops {
 	int			(*ndo_bridge_getlink)(struct sk_buff *skb,
 						      u32 pid, u32 seq,
 						      struct net_device *dev);
+	int			(*ndo_change_carrier)(struct net_device *dev,
+						      bool new_carrier);
 };
 
 /*
@@ -2194,6 +2204,8 @@ extern int		dev_set_mtu(struct net_device *, int);
 extern void		dev_set_group(struct net_device *, int);
 extern int		dev_set_mac_address(struct net_device *,
 					    struct sockaddr *);
+extern int		dev_change_carrier(struct net_device *,
+					   bool new_carrier);
 extern int		dev_hard_start_xmit(struct sk_buff *skb,
 					    struct net_device *dev,
 					    struct netdev_queue *txq);
diff --git a/net/core/dev.c b/net/core/dev.c
index d0cbc93..268a714 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5027,6 +5027,25 @@ int dev_set_mac_address(struct net_device *dev, struct sockaddr *sa)
 }
 EXPORT_SYMBOL(dev_set_mac_address);
 
+/**
+ *	dev_change_carrier - Change device carrier
+ *	@dev: device
+ *	@new_carries: new value
+ *
+ *	Change device carrier
+ */
+int dev_change_carrier(struct net_device *dev, bool new_carrier)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	if (!ops->ndo_change_carrier)
+		return -EOPNOTSUPP;
+	if (!netif_device_present(dev))
+		return -ENODEV;
+	return ops->ndo_change_carrier(dev, new_carrier);
+}
+EXPORT_SYMBOL(dev_change_carrier);
+
 /*
  *	Perform the SIOCxIFxxx calls, inside rcu_read_lock()
  */
-- 
1.8.0

^ permalink raw reply related

* [patch net-next 2/4] net: allow to change carrier via sysfs
From: Jiri Pirko @ 2012-12-19  9:35 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, bhutchings, mirqus, shemminger, greearb, fbl,
	john.r.fastabend
In-Reply-To: <1355909756-2389-1-git-send-email-jiri@resnulli.us>

Make carrier writable

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 net/core/net-sysfs.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 334efd5..7eda40a 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -126,6 +126,19 @@ static ssize_t show_broadcast(struct device *dev,
 	return -EINVAL;
 }
 
+static int change_carrier(struct net_device *net, unsigned long new_carrier)
+{
+	if (!netif_running(net))
+		return -EINVAL;
+	return dev_change_carrier(net, (bool) new_carrier);
+}
+
+static ssize_t store_carrier(struct device *dev, struct device_attribute *attr,
+			 const char *buf, size_t len)
+{
+	return netdev_store(dev, attr, buf, len, change_carrier);
+}
+
 static ssize_t show_carrier(struct device *dev,
 			    struct device_attribute *attr, char *buf)
 {
@@ -331,7 +344,7 @@ static struct device_attribute net_class_attributes[] = {
 	__ATTR(link_mode, S_IRUGO, show_link_mode, NULL),
 	__ATTR(address, S_IRUGO, show_address, NULL),
 	__ATTR(broadcast, S_IRUGO, show_broadcast, NULL),
-	__ATTR(carrier, S_IRUGO, show_carrier, NULL),
+	__ATTR(carrier, S_IRUGO | S_IWUSR, show_carrier, store_carrier),
 	__ATTR(speed, S_IRUGO, show_speed, NULL),
 	__ATTR(duplex, S_IRUGO, show_duplex, NULL),
 	__ATTR(dormant, S_IRUGO, show_dormant, NULL),
-- 
1.8.0

^ permalink raw reply related

* [patch net-next 3/4] rtnl: expose carrier value with possibility to set it
From: Jiri Pirko @ 2012-12-19  9:35 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, bhutchings, mirqus, shemminger, greearb, fbl,
	john.r.fastabend
In-Reply-To: <1355909756-2389-1-git-send-email-jiri@resnulli.us>

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 Documentation/networking/operstates.txt |  4 ++++
 include/uapi/linux/if_link.h            |  1 +
 net/core/rtnetlink.c                    | 10 ++++++++++
 3 files changed, 15 insertions(+)

diff --git a/Documentation/networking/operstates.txt b/Documentation/networking/operstates.txt
index 1a77a3c..9769457 100644
--- a/Documentation/networking/operstates.txt
+++ b/Documentation/networking/operstates.txt
@@ -88,6 +88,10 @@ set this flag. On netif_carrier_off(), the scheduler stops sending
 packets. The name 'carrier' and the inversion are historical, think of
 it as lower layer.
 
+Note that for certain kind of soft-devices, which are not managing any
+real hardware, there is possible to set this bit from userpsace.
+One should use TVL IFLA_CARRIER to do so.
+
 netif_carrier_ok() can be used to query that bit.
 
 __LINK_STATE_DORMANT, maps to IFF_DORMANT:
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 60f3b6b..c4edfe1 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -142,6 +142,7 @@ enum {
 #define IFLA_PROMISCUITY IFLA_PROMISCUITY
 	IFLA_NUM_TX_QUEUES,
 	IFLA_NUM_RX_QUEUES,
+	IFLA_CARRIER,
 	__IFLA_MAX
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 1868625..2ef7a56 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -780,6 +780,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(4) /* IFLA_MTU */
 	       + nla_total_size(4) /* IFLA_LINK */
 	       + nla_total_size(4) /* IFLA_MASTER */
+	       + nla_total_size(1) /* IFLA_CARRIER */
 	       + nla_total_size(4) /* IFLA_PROMISCUITY */
 	       + nla_total_size(4) /* IFLA_NUM_TX_QUEUES */
 	       + nla_total_size(4) /* IFLA_NUM_RX_QUEUES */
@@ -909,6 +910,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	     nla_put_u32(skb, IFLA_LINK, dev->iflink)) ||
 	    (dev->master &&
 	     nla_put_u32(skb, IFLA_MASTER, dev->master->ifindex)) ||
+	    nla_put_u8(skb, IFLA_CARRIER, netif_carrier_ok(dev)) ||
 	    (dev->qdisc &&
 	     nla_put_string(skb, IFLA_QDISC, dev->qdisc->ops->id)) ||
 	    (dev->ifalias &&
@@ -1108,6 +1110,7 @@ const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_MTU]		= { .type = NLA_U32 },
 	[IFLA_LINK]		= { .type = NLA_U32 },
 	[IFLA_MASTER]		= { .type = NLA_U32 },
+	[IFLA_CARRIER]		= { .type = NLA_U8 },
 	[IFLA_TXQLEN]		= { .type = NLA_U32 },
 	[IFLA_WEIGHT]		= { .type = NLA_U32 },
 	[IFLA_OPERSTATE]	= { .type = NLA_U8 },
@@ -1438,6 +1441,13 @@ static int do_setlink(struct net_device *dev, struct ifinfomsg *ifm,
 		modified = 1;
 	}
 
+	if (tb[IFLA_CARRIER]) {
+		err = dev_change_carrier(dev, nla_get_u8(tb[IFLA_CARRIER]));
+		if (err)
+			goto errout;
+		modified = 1;
+	}
+
 	if (tb[IFLA_TXQLEN])
 		dev->tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]);
 
-- 
1.8.0

^ permalink raw reply related

* [patch net-next 4/4] dummy: implement carrier change
From: Jiri Pirko @ 2012-12-19  9:35 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, bhutchings, mirqus, shemminger, greearb, fbl,
	john.r.fastabend
In-Reply-To: <1355909756-2389-1-git-send-email-jiri@resnulli.us>

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 drivers/net/dummy.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c
index c260af5..42aa54a 100644
--- a/drivers/net/dummy.c
+++ b/drivers/net/dummy.c
@@ -100,6 +100,15 @@ static void dummy_dev_uninit(struct net_device *dev)
 	free_percpu(dev->dstats);
 }
 
+static int dummy_change_carrier(struct net_device *dev, bool new_carrier)
+{
+	if (new_carrier)
+		netif_carrier_on(dev);
+	else
+		netif_carrier_off(dev);
+	return 0;
+}
+
 static const struct net_device_ops dummy_netdev_ops = {
 	.ndo_init		= dummy_dev_init,
 	.ndo_uninit		= dummy_dev_uninit,
@@ -108,6 +117,7 @@ static const struct net_device_ops dummy_netdev_ops = {
 	.ndo_set_rx_mode	= set_multicast_list,
 	.ndo_set_mac_address	= eth_mac_addr,
 	.ndo_get_stats64	= dummy_get_stats64,
+	.ndo_change_carrier	= dummy_change_carrier,
 };
 
 static void dummy_setup(struct net_device *dev)
-- 
1.8.0

^ permalink raw reply related

* Re: [RFC PATCH net-next 0/5] Ease netns management for userland
From: Nicolas Dichtel @ 2012-12-19  9:47 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev, davem, aatteka
In-Reply-To: <87zk1g8tnq.fsf@xmission.com>

Le 14/12/2012 17:50, Eric W. Biederman a écrit :
> Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:
>
>> Le 13/12/2012 20:08, Eric W. Biederman a écrit :
>
>>> No.  The difficulty monitoring which network namespaces are being used
>>> is an unintended side effect.
>> Why is netlink a bad idea? Having a way to know all existing netns is a start
>> point to monitor netns, isn't it?
>
> In the same way that having a neighbour table that contains all existing
> ip address to mac addresses mappings is a starting point to monitor all
> existing hosts.
>
> All does not scale.
>
> All removes a lot of perfectly valid use cases like checkpoint-restart,
> and nesting containers.
>
> All as different from what is already implemented requires implementing
> yet another namespace to put the names of all into it.  We have enough
> namespaces now thank you very much.
>
> An unfiltered global list is about as interesting to use as putting
> all files in /.  Sure you know which directory you put your file in but
> which file is it?
>
> What has already been implemented should be roughly as good for
> monitoring as what is available with lsof.
>
> And of course there is the fact that a global list of anything that is
> the same from every perspective violates the principle of relativity,
> and is in contradiction with the phsical reality in which we exist.
>
> So there is no way that having a global all inclusive list of network
> namespaces makes the least lick of sense and I really don't want to
> think about it.

Thank you for your explanations and your patience, this is very useful.


Nicolas

^ permalink raw reply

* RE: TCP delayed ACK heuristic
From: David Laight @ 2012-12-19  9:52 UTC (permalink / raw)
  To: Rick Jones
  Cc: Cong Wang, netdev, Ben Greear, David Miller, Eric Dumazet,
	Stephen Hemminger, Thomas Graf
In-Reply-To: <50D0ADD4.7030903@hp.com>

> > I've seen problems when the sending side is doing (I think)
> > 'slow start' with Nagle disabled.
> > The sender would only send 4 segments before waiting for an
> > ACK - even when it had more than a full sized segment waiting.
> > Sender was Linux 2.6.something (probably low 20s).
> > I changed the application flow to send data in the reverse
> > direction to avoid the problem.
> > That was on a ~0 delay local connection - which means that
> > there is almost never outstanding data, and the 'slow start'
> > happened almost all the time.
> > Nagle is completely the wrong algorithm for the data flow.
> 
> If Nagle was already disabled, why the last sentence?  And from your
> description, even if Nagle were enabled, I would think that it was
> remote ACK+cwnd behaviour getting in your way, not Nagle, given that
> Nagle is to be decided on a user-send by user-send basis and release
> queued data (to the mercies of other heuristics) when it gets to be an
> MSS-worth.

With Nagle enabled the first segment is sent, the following ones
get buffered until full segments can be sent. Although (probably)
only 4 segments will be sent (1 small and 3 full) the 3rd of these
does generate an ack.

> ... but it sounds like you have an actual
> application looking to do that??

We are relaying data packets received over multiple SS7 signalling
links (64k hdlc) over a TCP connection. The connection will be local,
in some cases the host ethernet MAC, switch, and target cpu are all
on the same PCI(e) card (MII crossover links).
While a delay of a millisecond or two wouldn't matter (1ms is 8 byte
times) the Nagle delay is far too long - and since the data isn't
command/response the Nagle would delay happen a lot.

One of the conformance tests managed to make the system 'busy'.
Since all it does is make one 64k channel busy it shouldn't have
been able to generate a backlog of receive data - but it managed to
get over 100 data packets unacked (app level ack).

> Allowing a byte-limit-cwnd's worth of single-byte-payload TCP segments
> could easily be seen as being rather anti-social :)

If the actual RTT is almost zero (as in our case) and the network
really shouldn't be dropping packets the it doesn't matter.

I suspect that if the tx rate is faster than the RTT then the
'slow start' turns off and you can get a lot of small segments
in flight. But when the RTT is zero 'slow start' almost always
applies and you only send 4.

> And forcing/maintaining the original segment boundaries in
> retransmissions for small packets isn't such a hot idea either.

True, not splitting them might be useful, but to need to avoid
merges.

	David

^ permalink raw reply

* Re: [PATCH v2] netlink: align attributes on 64-bits
From: Nicolas Dichtel @ 2012-12-19 11:22 UTC (permalink / raw)
  To: Thomas Graf; +Cc: bhutchings, netdev, davem, David.Laight
In-Reply-To: <20121218125735.GG27746@casper.infradead.org>

Le 18/12/2012 13:57, Thomas Graf a écrit :
> On 12/17/12 at 05:49pm, Nicolas Dichtel wrote:
>> @@ -492,6 +492,15 @@ static inline struct nlmsghdr *nlmsg_put_answer(struct sk_buff *skb,
>>    */
>>   static inline struct sk_buff *nlmsg_new(size_t payload, gfp_t flags)
>>   {
>> +	/* Because attributes may be aligned on 64-bits boundary with fake
>> +	 * attribute (type 0, size 4 (attributes are 32-bits align by default)),
>> +	 * an exact payload size cannot be calculated. Hence, we need to reserve
>> +	 * more space for these attributes.
>> +	 * 128 is arbitrary: it allows to align up to 32 attributes.
>> +	 */
>> +	if (payload < NLMSG_DEFAULT_SIZE)
>> +		payload = min(payload + 128, (size_t)NLMSG_DEFAULT_SIZE);
>
> This is doomed to fail eventually. A netlink message may carry
> hundreds of attributes eventually. See my suggestion below.
>
>> diff --git a/lib/nlattr.c b/lib/nlattr.c
>> index 18eca78..7440a80 100644
>> --- a/lib/nlattr.c
>> +++ b/lib/nlattr.c
>> @@ -450,9 +450,18 @@ EXPORT_SYMBOL(__nla_put_nohdr);
>>    */
>>   int nla_put(struct sk_buff *skb, int attrtype, int attrlen, const void *data)
>>   {
>> -	if (unlikely(skb_tailroom(skb) < nla_total_size(attrlen)))
>> +	int align = IS_ALIGNED((unsigned long)skb_tail_pointer(skb), 8) ? 0 : 4;
>
> This does not look right. In order for the attribute data to be
> aligned properly you would need to check skb_tail_pointer(skb) +
> NLA_HDRLEN for proper alignment or you end up aligning the
> attribute header.
>
> How about we change nla_total_size() to return the size with
> needed padding taken into account. That should fix the message
> size caluclation problem and we only need to reserve room for
> the initial padding to align the very first attribute.
>
> Below is an untested patch that does this. What do you think?
>
> diff --git a/include/net/netlink.h b/include/net/netlink.h
> index 9690b0f..7ce8e76 100644
> --- a/include/net/netlink.h
> +++ b/include/net/netlink.h
> @@ -114,7 +114,6 @@
>    * Attribute Length Calculations:
>    *   nla_attr_size(payload)		length of attribute w/o padding
>    *   nla_total_size(payload)		length of attribute w/ padding
> - *   nla_padlen(payload)		length of padding
>    *
>    * Attribute Payload Access:
>    *   nla_data(nla)			head of attribute payload
> @@ -492,6 +491,14 @@ static inline struct nlmsghdr *nlmsg_put_answer(struct sk_buff *skb,
>    */
>   static inline struct sk_buff *nlmsg_new(size_t payload, gfp_t flags)
>   {
> +	/* If an exact size if specified, reserve some additional space to
> +	 * align the first attribute, all subsequent attributes should have
> +	 * padding accounted for.
> +	 */
> +	if (payload != NLMSG_DEFAULT_SIZE)
> +		payload = min_t(size_t, payload + NLA_ATTR_ALIGN,
> +				NLMSG_DEFAULT_SIZE);
> +
>   	return alloc_skb(nlmsg_total_size(payload), flags);
>   }
>
> @@ -653,16 +660,12 @@ static inline int nla_attr_size(int payload)
>    */
>   static inline int nla_total_size(int payload)
>   {
> -	return NLA_ALIGN(nla_attr_size(payload));
> -}
> +	size_t len = NLA_ALIGN(nla_attr_size(payload));
>
> -/**
> - * nla_padlen - length of padding at the tail of attribute
> - * @payload: length of payload
> - */
> -static inline int nla_padlen(int payload)
> -{
> -	return nla_total_size(payload) - nla_attr_size(payload);
> +	if (!IS_ALIGNED(len, NLA_ATTR_ALIGN))
> +		len = ALIGN(len + NLA_HDRLEN, NLA_ATTR_ALIGN);
> +
> +	return len;
>   }
>
>   /**
> diff --git a/include/uapi/linux/netlink.h b/include/uapi/linux/netlink.h
> index 78d5b8a..1856729 100644
> --- a/include/uapi/linux/netlink.h
> +++ b/include/uapi/linux/netlink.h
> @@ -149,5 +149,10 @@ struct nlattr {
>   #define NLA_ALIGN(len)		(((len) + NLA_ALIGNTO - 1) & ~(NLA_ALIGNTO - 1))
>   #define NLA_HDRLEN		((int) NLA_ALIGN(sizeof(struct nlattr)))
>
> +/* Padding attribute type, added automatically to align attributes,
> + * must be ignored by readers. */
> +#define NLA_PADDING		0
> +#define NLA_ATTR_ALIGN		8
> +
>
>   #endif /* _UAPI__LINUX_NETLINK_H */
> diff --git a/lib/nlattr.c b/lib/nlattr.c
> index 18eca78..b09473c 100644
> --- a/lib/nlattr.c
> +++ b/lib/nlattr.c
> @@ -322,18 +322,36 @@ int nla_strcmp(const struct nlattr *nla, const char *str)
>    * Adds a netlink attribute header to a socket buffer and reserves
>    * room for the payload but does not copy it.
>    *
> + * May add a padding attribute of type NLA_PADDING before the
> + * real attribute to ensure proper alignment.
> + *
>    * The caller is responsible to ensure that the skb provides enough
>    * tailroom for the attribute header and payload.
>    */
>   struct nlattr *__nla_reserve(struct sk_buff *skb, int attrtype, int attrlen)
>   {
>   	struct nlattr *nla;
> +	size_t offset;
> +
> +	offset = (size_t) skb_tail_pointer(skb);
> +	if (!IS_ALIGNED(offset + NLA_HDRLEN, NLA_ATTR_ALIGN)) {
> +		struct nlattr *pad;
> +		size_t padlen;
> +
> +		padlen = nla_total_size(offset) - offset - NLA_HDRLEN;
Here padlen will return 4, which is wrong: padlen + NLA_HDRLEN = 8, alignment is 
the same than before. Here is a proposal fix:

diff --git a/lib/nlattr.c b/lib/nlattr.c
index e4f0329..1556313 100644
--- a/lib/nlattr.c
+++ b/lib/nlattr.c
@@ -338,7 +338,10 @@ struct nlattr *__nla_reserve(struct sk_buff *skb, int 
attrtype, int attrlen)
  		struct nlattr *pad;
  		size_t padlen;

-		padlen = nla_total_size(offset) - offset -  NLA_HDRLEN;
+		/* We need to remove NLA_HDRLEN two times: one time for the
+		 * attribute hdr and one time for the pad attribute hdr.
+		 */
+		padlen = nla_total_size(offset) - offset -  2 * NLA_HDRLEN;
  		pad = (struct nlattr *) skb_put(skb, nla_attr_size(padlen));
  		pad->nla_type = 0;
  		pad->nla_len = nla_attr_size(padlen);

With this patch, it seems goods. attribute are always aligned on 8 bytes. Also
I did not notice any problem with size calculation (I try some ip link, ip xfrm, 
ip [m]route).

Do you want to make more tests? Or will your repost the full patch?
I can do it if you don't have time.

^ permalink raw reply related

* [PATCH 2/2] drivers/net: Use of_match_ptr() macro in smsc911x.c
From: Sachin Kamat @ 2012-12-19 11:17 UTC (permalink / raw)
  To: netdev; +Cc: steve.glendinning, davem, sachin.kamat, patches, nico
In-Reply-To: <1355915830-29481-1-git-send-email-sachin.kamat@linaro.org>

Add CONFIG_OF guard and use of_match_ptr macro.

Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
---
Compile tested on linux-next.
---
 drivers/net/ethernet/smsc/smsc911x.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/smsc/smsc911x.c b/drivers/net/ethernet/smsc/smsc911x.c
index 4616bf2..e112877 100644
--- a/drivers/net/ethernet/smsc/smsc911x.c
+++ b/drivers/net/ethernet/smsc/smsc911x.c
@@ -2575,11 +2575,13 @@ static const struct dev_pm_ops smsc911x_pm_ops = {
 #define SMSC911X_PM_OPS NULL
 #endif
 
+#ifdef CONFIG_OF
 static const struct of_device_id smsc911x_dt_ids[] = {
 	{ .compatible = "smsc,lan9115", },
 	{ /* sentinel */ }
 };
 MODULE_DEVICE_TABLE(of, smsc911x_dt_ids);
+#endif
 
 static struct platform_driver smsc911x_driver = {
 	.probe = smsc911x_drv_probe,
@@ -2588,7 +2590,7 @@ static struct platform_driver smsc911x_driver = {
 		.name	= SMSC_CHIPNAME,
 		.owner	= THIS_MODULE,
 		.pm	= SMSC911X_PM_OPS,
-		.of_match_table = smsc911x_dt_ids,
+		.of_match_table = of_match_ptr(smsc911x_dt_ids),
 	},
 };
 
-- 
1.7.4.1

^ permalink raw reply related

* Re: [PATCH] xen/netfront: improve truesize tracking
From: Sander Eikelenboom @ 2012-12-19 11:34 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Eric Dumazet, netdev@vger.kernel.org, Konrad Rzeszutek Wilk,
	annie li, xen-devel@lists.xensource.com
In-Reply-To: <1355844398.14620.254.camel@zakaz.uk.xensource.com>


Tuesday, December 18, 2012, 4:26:38 PM, you wrote:

> On Tue, 2012-12-18 at 15:12 +0000, Eric Dumazet wrote:
>> On Tue, 2012-12-18 at 13:51 +0000, Ian Campbell wrote:
>> > Using RX_COPY_THRESHOLD is incorrect if the SKB is actually smaller
>> > than that. We have already accounted for this in
>> > NETFRONT_SKB_CB(skb)->pull_to so use that instead.
>> > 
>> > Fixes WARN_ON from skb_try_coalesce.
>> > 
>> > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
>> > Cc: Sander Eikelenboom <linux@eikelenboom.it>
>> > Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> > Cc: annie li <annie.li@oracle.com>
>> > Cc: xen-devel@lists.xensource.com
>> > Cc: netdev@vger.kernel.org
>> > Cc: stable@kernel.org # 3.7.x only
>> > ---
>> >  drivers/net/xen-netfront.c |   15 +++++----------
>> >  1 files changed, 5 insertions(+), 10 deletions(-)
>> > 
>> > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
>> > index caa0110..b06ef81 100644
>> > --- a/drivers/net/xen-netfront.c
>> > +++ b/drivers/net/xen-netfront.c
>> > @@ -971,17 +971,12 @@ err:
>> >              * overheads. Here, we add the size of the data pulled
>> >              * in xennet_fill_frags().
>> >              *
>> > -            * We also adjust for any unused space in the main
>> > -            * data area by subtracting (RX_COPY_THRESHOLD -
>> > -            * len). This is especially important with drivers
>> > -            * which split incoming packets into header and data,
>> > -            * using only 66 bytes of the main data area (see the
>> > -            * e1000 driver for example.)  On such systems,
>> > -            * without this last adjustement, our achievable
>> > -            * receive throughout using the standard receive
>> > -            * buffer size was cut by 25%(!!!).
>> > +            * We also adjust for the __pskb_pull_tail done in
>> > +            * handle_incoming_queue which pulls data from the
>> > +            * frags into the head area, which is already
>> > +            * accounted in RX_COPY_THRESHOLD.
>> >              */
>> > -           skb->truesize += skb->data_len - RX_COPY_THRESHOLD;
>> > +           skb->truesize += skb->data_len - NETFRONT_SKB_CB(skb)->pull_to;
>> >             skb->len += skb->data_len;
>> >  
>> >             if (rx->flags & XEN_NETRXF_csum_blank)
>> 
>> 
>> But skb truesize is not what you think.

> Indeed, it seems I was completely backwards about what it means!

>> You must account the exact memory used by this skb, not only the used
>> part of it.
>> 
>> At the very minimum, it should be
>> 
>> skb->truesize += skb->data_len;
>> 
>> But it really should be the allocated size of the fragment.
>> 
>> If its a page, then its a page, even if you use one single byte in it.

> So actually we want += PAGE_SIZE * skb_shinfo(skb)->nr_frags ?

> Sander, can you try that change?

Hi Ian,

It ran overnight and i haven't seen the warn_once trigger.
(but i also didn't with the previous patch)

--
Sander

> Ian.

^ permalink raw reply

* Re: RFC  [PATCH] iproute2:  temporary solution to fix xt breakage
From: Jamal Hadi Salim @ 2012-12-19 11:36 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Hasan Chowdhury, Jan Engelhardt, Yury Stankevich,
	netdev@vger.kernel.org, pablo, netfilter-devel
In-Reply-To: <20121217081203.3dc324c8@nehalam.linuxnetplumber.net>

On 12-12-17 11:12 AM, Stephen Hemminger wrote:

>
> Maybe xtables should have stable API/ABI and use shim routines there?

Thats the general direction being taken now with this last changes...

cheers,
jamal



^ permalink raw reply

* Re: tc ipt action
From: Jamal Hadi Salim @ 2012-12-19 11:43 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Pablo Neira Ayuso, Yury Stankevich, shemonc,
	netdev@vger.kernel.org, netfilter-devel
In-Reply-To: <alpine.LNX.2.01.1212181426590.24597@nerf07.vanv.qr>

On 12-12-18 08:58 AM, Jan Engelhardt wrote:
>

> Chains can store multiple targets, so no loss.

Nice.

> 1. table
>
> First, I think some targets need to relax their restrictions, such as
> with xt_DSCP.

Saw your other patch to get rid of mangle hardcoding.

> Then, only a handful of extensions remain: CT, <all NATs>,
> TPROXY and REJECT. Would anyone want to call these from act_ipt?
> I doubt it. :)
>

Tempted to say tproxy.

> 2. hooks
>
> Extensions with hook limit: <NAT>, TPROXY, REJECT, CLASSIFY.
> Again, I don't quite see the value of attempting to NAT from act_ipt.
> CLASSIFY {c|sh?}ould be relaxed, unless I am missing something.
>


I could live with that. It would be an improvement over whats there 
today. I would prefer however for this to be an improvement over
act_xt.c i posted as opposed to have even more interfaces for xt.
We've suffered enough already ;-> i.e add your patches on top.

cheers,
jamal


^ permalink raw reply

* Re: RFC [PATCH] iproute2: temporary solution to fix xt breakage
From: Jamal Hadi Salim @ 2012-12-19 11:44 UTC (permalink / raw)
  To: Hasan Chowdhury
  Cc: Stephen Hemminger, Jan Engelhardt, Yury Stankevich,
	netdev@vger.kernel.org, pablo, netfilter-devel
In-Reply-To: <CAASe=fR6Hm2dxp=1wDchtrzqnaH6qacHpg2wrsqLfmGpPbQ9Fg@mail.gmail.com>

On 12-12-18 09:45 AM, Hasan Chowdhury wrote:
> Hi Jamal,
>
> Thanks for all the help and the information. I will keep tune myself so
> when the proper path from kernel side will show up I will integrate it
> into my system to test it.
>

Yikes. I guess i never posted that? Will do it shortly.

cheers,
jamal


^ permalink raw reply

* Re: [GIT PULL net-next 04/17] ndisc: Introduce ndisc_fill_redirect_hdr_option().
From: Bjørn Mork @ 2012-12-19 11:47 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki; +Cc: davem, netdev
In-Reply-To: <50D04B4B.7060002@linux-ipv6.org>

YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> writes:

> Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
> ---
>  net/ipv6/ndisc.c |   21 +++++++++++++++------
>  1 file changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
> index a181113..0a4f3a9 100644
> --- a/net/ipv6/ndisc.c
> +++ b/net/ipv6/ndisc.c
> @@ -1332,6 +1332,19 @@ static void ndisc_redirect_rcv(struct sk_buff *skb)
>  	icmpv6_notify(skb, NDISC_REDIRECT, 0, 0);
>  }
>  
> +static u8 *ndisc_fill_redirect_hdr_option(u8 *opt, struct sk_buff *orig_skb,
> +					  int rd_len)
> +{
> +	memset(opt, 0, 8);
> +	*(opt++) = ND_OPT_REDIRECT_HDR;
> +	*(opt++) = (rd_len >> 3);
> +	opt += 6;
> +
> +	memcpy(opt, ipv6_hdr(orig_skb), rd_len - 8);
> +
> +	return opt;
> +}
> +

I realize that it doesn't currently matter, but the above modification
of "opt" looks like a bug-waiting-to-happen to me.

>  void ndisc_send_redirect(struct sk_buff *skb, const struct in6_addr *target)
>  {
>  	struct net_device *dev = skb->dev;
> @@ -1461,12 +1474,8 @@ void ndisc_send_redirect(struct sk_buff *skb, const struct in6_addr *target)
>  	 *	build redirect option and copy skb over to the new packet.
>  	 */
>  
> -	memset(opt, 0, 8);
> -	*(opt++) = ND_OPT_REDIRECT_HDR;
> -	*(opt++) = (rd_len >> 3);
> -	opt += 6;
> -
> -	memcpy(opt, ipv6_hdr(skb), rd_len - 8);
> +	if (rd_len)
> +		opt = ndisc_fill_redirect_hdr_option(opt, skb, rd_len);


I understand that opt isn't currently used after this, but if it ever is
then it is going to come as big a surprise that this implies opt += 8;

This was previously quite clear when the code was inline, but it becomes
problematic when it is factored out.


Bjørn

^ permalink raw reply

* [PATCH 1/2] drivers/net: Use of_match_ptr() macro in smc91x.c
From: Sachin Kamat @ 2012-12-19 11:17 UTC (permalink / raw)
  To: netdev; +Cc: steve.glendinning, davem, sachin.kamat, patches, nico

This eliminates having an #ifdef returning NULL for the case
when OF is disabled.

Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
---
Compile tested on linux-next.
---
 drivers/net/ethernet/smsc/smc91x.c |    4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/smsc/smc91x.c b/drivers/net/ethernet/smsc/smc91x.c
index 022b45b..a670d23 100644
--- a/drivers/net/ethernet/smsc/smc91x.c
+++ b/drivers/net/ethernet/smsc/smc91x.c
@@ -2386,8 +2386,6 @@ static const struct of_device_id smc91x_match[] = {
 	{},
 };
 MODULE_DEVICE_TABLE(of, smc91x_match);
-#else
-#define smc91x_match NULL
 #endif
 
 static struct dev_pm_ops smc_drv_pm_ops = {
@@ -2402,7 +2400,7 @@ static struct platform_driver smc_driver = {
 		.name	= CARDNAME,
 		.owner	= THIS_MODULE,
 		.pm	= &smc_drv_pm_ops,
-		.of_match_table = smc91x_match,
+		.of_match_table = of_match_ptr(smc91x_match),
 	},
 };
 
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH] pkt_sched: act_xt support new Xtables interface
From: Jamal Hadi Salim @ 2012-12-19 11:56 UTC (permalink / raw)
  To: Hasan Chowdhury
  Cc: Stephen Hemminger, Jan Engelhardt, Yury Stankevich,
	netdev@vger.kernel.org, pablo, netfilter-devel
In-Reply-To: <50D1A8A7.1090002@mojatatu.com>

[-- Attachment #1: Type: text/plain, Size: 485 bytes --]


To be applied pending more testing.

Attached. Sorry, I thought I had sent this out over the weekend.
I have done basic testing with a single mark and sending pings to
update stats which can then displayed for the mark.

Hasan/Yury, if you test this please use the latest iproute2 with only 
the first patch I posted (originally from Hasan). Hasan please use that
patch not your version - if theres anything wrong we can find out sooner
before the patch becomes final.

cheers,
jamal

[-- Attachment #2: xt-p1 --]
[-- Type: text/plain, Size: 10173 bytes --]

commit 82330cc874429c63bd0e476e413a79ebab3da350
Author: Jamal Hadi Salim <hadi@mojatatu.com>
Date:   Wed Dec 19 06:23:28 2012 -0500

    Fix iptables/xtables ABI changes. We will eventually replace
    act_ipt with act_xt since only very few targets still support the
    old xtables interface
    
    Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>

diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 235e01a..1693973 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -578,12 +578,25 @@ config NET_ACT_MIRRED
 config NET_ACT_IPT
         tristate "IPtables targets"
         depends on NET_CLS_ACT && NETFILTER && IP_NF_IPTABLES
+	select NET_ACT_XT
         ---help---
 	  Say Y here to be able to invoke iptables targets after successful
-	  classification.
+	  classification. Better yet choose NET_ACT_XT since this version
+	  will eventually be obsoleted.
 
 	  To compile this code as a module, choose M here: the
 	  module will be called act_ipt.
+config NET_ACT_XT
+        tristate "New IPtables targets"
+        depends on NET_CLS_ACT && NETFILTER && IP_NF_IPTABLES
+        ---help---
+	  Say Y here to be able to invoke iptables targets after successful
+	  classification using the new xtables mechanism. This mechanism
+	  will eventually replace NET_ACT_IPT
+
+	  To compile this code as a module, choose M here: the
+	  module will be called act_xt.
+
 
 config NET_ACT_NAT
         tristate "Stateless NAT"
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 978cbf0..10a1136 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_NET_ACT_POLICE)	+= act_police.o
 obj-$(CONFIG_NET_ACT_GACT)	+= act_gact.o
 obj-$(CONFIG_NET_ACT_MIRRED)	+= act_mirred.o
 obj-$(CONFIG_NET_ACT_IPT)	+= act_ipt.o
+obj-$(CONFIG_NET_ACT_XT)	+= act_xt.o
 obj-$(CONFIG_NET_ACT_NAT)	+= act_nat.o
 obj-$(CONFIG_NET_ACT_PEDIT)	+= act_pedit.o
 obj-$(CONFIG_NET_ACT_SIMP)	+= act_simple.o
diff --git a/net/sched/act_xt.c b/net/sched/act_xt.c
new file mode 100644
index 0000000..589cfe6
--- /dev/null
+++ b/net/sched/act_xt.c
@@ -0,0 +1,324 @@
+/*
+ * net/sched/act_xt.c     iptables target interface
+ *
+ *TODO: Add other tables. For now we only support the ipv4 table targets
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Copyright:	Jamal Hadi Salim (2002-12)
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <linux/rtnetlink.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <linux/tc_act/tc_ipt.h>
+#include <net/tc_act/tc_ipt.h>
+
+#include <linux/netfilter_ipv4/ip_tables.h>
+
+#define IPT_TAB_MASK     15
+static struct tcf_common *tcf_ipt_ht[IPT_TAB_MASK + 1];
+static u32 ipt_idx_gen;
+static DEFINE_RWLOCK(ipt_lock);
+
+static struct tcf_hashinfo ipt_hash_info = {
+	.htab = tcf_ipt_ht,
+	.hmask = IPT_TAB_MASK,
+	.lock = &ipt_lock,
+};
+
+static int ipt_init_target(struct xt_entry_target *t, char *table,
+			   unsigned int hook)
+{
+	struct xt_tgchk_param par;
+	struct xt_target *target;
+	int ret = 0;
+
+	target = xt_request_find_target(AF_INET, t->u.user.name,
+					t->u.user.revision);
+	if (IS_ERR(target))
+		return PTR_ERR(target);
+
+	t->u.kernel.target = target;
+	par.table = table;
+	par.entryinfo = NULL;
+	par.target = target;
+	par.targinfo = t->data;
+	par.hook_mask = hook;
+	par.family = NFPROTO_IPV4;
+
+	ret = xt_check_target(&par, t->u.target_size - sizeof(*t), 0, false);
+	if (ret < 0) {
+		module_put(t->u.kernel.target->me);
+		return ret;
+	}
+	return 0;
+}
+
+static void ipt_destroy_target(struct xt_entry_target *t)
+{
+	struct xt_tgdtor_param par = {
+		.target = t->u.kernel.target,
+		.targinfo = t->data,
+	};
+	if (par.target->destroy != NULL)
+		par.target->destroy(&par);
+	module_put(par.target->me);
+}
+
+static int tcf_ipt_release(struct tcf_ipt *ipt, int bind)
+{
+	int ret = 0;
+	if (ipt) {
+		if (bind)
+			ipt->tcf_bindcnt--;
+		ipt->tcf_refcnt--;
+		if (ipt->tcf_bindcnt <= 0 && ipt->tcf_refcnt <= 0) {
+			ipt_destroy_target(ipt->tcfi_t);
+			kfree(ipt->tcfi_tname);
+			kfree(ipt->tcfi_t);
+			tcf_hash_destroy(&ipt->common, &ipt_hash_info);
+			ret = ACT_P_DELETED;
+		}
+	}
+	return ret;
+}
+
+static const struct nla_policy ipt_policy[TCA_IPT_MAX + 1] = {
+	[TCA_IPT_TABLE] = {.type = NLA_STRING,.len = IFNAMSIZ},
+	[TCA_IPT_HOOK] = {.type = NLA_U32},
+	[TCA_IPT_INDEX] = {.type = NLA_U32},
+	[TCA_IPT_TARG] = {.len = sizeof(struct xt_entry_target)},
+};
+
+static int tcf_ipt_init(struct nlattr *nla, struct nlattr *est,
+			struct tc_action *a, int ovr, int bind)
+{
+	struct nlattr *tb[TCA_IPT_MAX + 1];
+	struct tcf_ipt *ipt;
+	struct tcf_common *pc;
+	struct xt_entry_target *td, *t;
+	char *tname;
+	int ret = 0, err;
+	u32 hook = 0;
+	u32 index = 0;
+
+	if (nla == NULL)
+		return -EINVAL;
+
+	err = nla_parse_nested(tb, TCA_IPT_MAX, nla, ipt_policy);
+	if (err < 0)
+		return err;
+
+	if (tb[TCA_IPT_HOOK] == NULL)
+		return -EINVAL;
+	if (tb[TCA_IPT_TARG] == NULL)
+		return -EINVAL;
+
+	td = (struct xt_entry_target *)nla_data(tb[TCA_IPT_TARG]);
+	if (nla_len(tb[TCA_IPT_TARG]) < td->u.target_size)
+		return -EINVAL;
+
+	if (tb[TCA_IPT_INDEX] != NULL)
+		index = nla_get_u32(tb[TCA_IPT_INDEX]);
+
+	pc = tcf_hash_check(index, a, bind, &ipt_hash_info);
+	if (!pc) {
+		pc = tcf_hash_create(index, est, a, sizeof(*ipt), bind,
+				     &ipt_idx_gen, &ipt_hash_info);
+		if (IS_ERR(pc))
+			return PTR_ERR(pc);
+		ret = ACT_P_CREATED;
+	} else {
+		if (!ovr) {
+			tcf_ipt_release(to_ipt(pc), bind);
+			return -EEXIST;
+		}
+	}
+	ipt = to_ipt(pc);
+
+	hook = nla_get_u32(tb[TCA_IPT_HOOK]);
+
+	err = -ENOMEM;
+	tname = kmalloc(IFNAMSIZ, GFP_KERNEL);
+	if (unlikely(!tname))
+		goto err1;
+	if (tb[TCA_IPT_TABLE] == NULL ||
+	    nla_strlcpy(tname, tb[TCA_IPT_TABLE], IFNAMSIZ) >= IFNAMSIZ)
+		strcpy(tname, "mangle");
+
+	t = kmemdup(td, td->u.target_size, GFP_KERNEL);
+	if (unlikely(!t))
+		goto err2;
+
+	err = ipt_init_target(t, tname, hook);
+	if (err < 0)
+		goto err3;
+
+	spin_lock_bh(&ipt->tcf_lock);
+	if (ret != ACT_P_CREATED) {
+		ipt_destroy_target(ipt->tcfi_t);
+		kfree(ipt->tcfi_tname);
+		kfree(ipt->tcfi_t);
+	}
+	ipt->tcfi_tname = tname;
+	ipt->tcfi_t = t;
+	ipt->tcfi_hook = hook;
+	spin_unlock_bh(&ipt->tcf_lock);
+	if (ret == ACT_P_CREATED)
+		tcf_hash_insert(pc, &ipt_hash_info);
+	return ret;
+
+err3:
+	kfree(t);
+err2:
+	kfree(tname);
+err1:
+	if (ret == ACT_P_CREATED) {
+		if (est)
+			gen_kill_estimator(&pc->tcfc_bstats,
+					   &pc->tcfc_rate_est);
+		kfree_rcu(pc, tcfc_rcu);
+	}
+	return err;
+}
+
+static int tcf_ipt_cleanup(struct tc_action *a, int bind)
+{
+	struct tcf_ipt *ipt = a->priv;
+	return tcf_ipt_release(ipt, bind);
+}
+
+static int tcf_ipt(struct sk_buff *skb, const struct tc_action *a,
+		   struct tcf_result *res)
+{
+	int ret = 0, result = 0;
+	struct tcf_ipt *ipt = a->priv;
+	struct xt_action_param par;
+
+	if (skb_cloned(skb)) {
+		if (pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
+			return TC_ACT_UNSPEC;
+	}
+
+	spin_lock(&ipt->tcf_lock);
+
+	ipt->tcf_tm.lastuse = jiffies;
+	bstats_update(&ipt->tcf_bstats, skb);
+
+	/* yes, we have to worry about both in and out dev
+	 * worry later - danger - this API seems to have changed
+	 * from earlier kernels
+	 */
+	par.in = skb->dev;
+	par.out = NULL;
+	par.hooknum = ipt->tcfi_hook;
+	par.target = ipt->tcfi_t->u.kernel.target;
+	par.targinfo = ipt->tcfi_t->data;
+	ret = par.target->target(skb, &par);
+
+	switch (ret) {
+	case NF_ACCEPT:
+		result = TC_ACT_OK;
+		break;
+	case NF_DROP:
+		result = TC_ACT_SHOT;
+		ipt->tcf_qstats.drops++;
+		break;
+	case XT_CONTINUE:
+		result = TC_ACT_PIPE;
+		break;
+	default:
+		net_notice_ratelimited
+		    ("tc filter: Bogus netfilter code %d assume ACCEPT\n", ret);
+		result = TC_POLICE_OK;
+		break;
+	}
+	spin_unlock(&ipt->tcf_lock);
+	return result;
+
+}
+
+static int tcf_ipt_dump(struct sk_buff *skb, struct tc_action *a, int bind,
+			int ref)
+{
+	unsigned char *b = skb_tail_pointer(skb);
+	struct tcf_ipt *ipt = a->priv;
+	struct xt_entry_target *t;
+	struct tcf_t tm;
+	struct tc_cnt c;
+
+	/* for simple targets kernel size == user size
+	 * user name = target name
+	 * for foolproof you need to not assume this
+	 */
+
+	t = kmemdup(ipt->tcfi_t, ipt->tcfi_t->u.user.target_size, GFP_ATOMIC);
+	if (unlikely(!t))
+		goto nla_put_failure;
+
+	c.bindcnt = ipt->tcf_bindcnt - bind;
+	c.refcnt = ipt->tcf_refcnt - ref;
+	strcpy(t->u.user.name, ipt->tcfi_t->u.kernel.target->name);
+
+	if (nla_put(skb, TCA_IPT_TARG, ipt->tcfi_t->u.user.target_size, t) ||
+	    nla_put_u32(skb, TCA_IPT_INDEX, ipt->tcf_index) ||
+	    nla_put_u32(skb, TCA_IPT_HOOK, ipt->tcfi_hook) ||
+	    nla_put(skb, TCA_IPT_CNT, sizeof(struct tc_cnt), &c) ||
+	    nla_put_string(skb, TCA_IPT_TABLE, ipt->tcfi_tname))
+		goto nla_put_failure;
+	tm.install = jiffies_to_clock_t(jiffies - ipt->tcf_tm.install);
+	tm.lastuse = jiffies_to_clock_t(jiffies - ipt->tcf_tm.lastuse);
+	tm.expires = jiffies_to_clock_t(ipt->tcf_tm.expires);
+	if (nla_put(skb, TCA_IPT_TM, sizeof(tm), &tm))
+		goto nla_put_failure;
+	kfree(t);
+	return skb->len;
+
+nla_put_failure:
+	nlmsg_trim(skb, b);
+	kfree(t);
+	return -1;
+}
+
+static struct tc_action_ops act_ipt_ops = {
+	.kind = "xt",
+	.hinfo = &ipt_hash_info,
+	.type = TCA_ACT_IPT,
+	.capab = TCA_CAP_NONE,
+	.owner = THIS_MODULE,
+	.act = tcf_ipt,
+	.dump = tcf_ipt_dump,
+	.cleanup = tcf_ipt_cleanup,
+	.lookup = tcf_hash_search,
+	.init = tcf_ipt_init,
+	.walk = tcf_generic_walker
+};
+
+MODULE_AUTHOR("Jamal Hadi Salim(2002-12)");
+MODULE_DESCRIPTION("New Iptables target actions");
+MODULE_LICENSE("GPL");
+
+static int __init ipt_init_module(void)
+{
+	return tcf_register_action(&act_ipt_ops);
+}
+
+static void __exit ipt_cleanup_module(void)
+{
+	tcf_unregister_action(&act_ipt_ops);
+}
+
+module_init(ipt_init_module);
+module_exit(ipt_cleanup_module);

^ permalink raw reply related

* PMTU discovery is broken on kernel 3.7.1 for UDP sockets
From: Yurij M. Plotnikov @ 2012-12-19 13:10 UTC (permalink / raw)
  To: netdev, Ben Hutchings, Alexandra N. Kossovsky

On kernel 3.7.1 I get strange behaviour of IP_MTU_DISCOVER socket 
option. The behaviour in case of IP_PMTUDISC_DO and IP_PMTUDISC_WANT 
values of IP_MTU_DISCOVER socket option on SOCK_DGRAM socket are the 
same and packet is always sent with "Don't Fragment" bit in case of 
IP_PMTUDISC_WANT. Also, the value of IP_MTU socket option is not updated.

This can be reproduced with 3 hosts configuration. Let it be the hosts: 
host_A, host_B and host _C. host_A via interface eth1 connected with 
host_B via intefaces eth1. Let host_C via interface eth1 connected with 
host_B via interface eth2. Also Lets address 10.0.1.1/24 is assigned to 
eth1 on host_A; 10.0.1.2/24 is assigned to eth1 on host_B; 10.0.2.1/24 
is assigned to eth2 on host_B; 10.0.2.2/24 is assigned to eth1 on 
host_C. Also there are two routes: "10.0.2.2 via 10.0.1.2 dev eth1" on 
host_A and "10.0.1.1 via 10.0.2.1 dev eth1" on host_C. Also forwarding 
is on on host_B. So we have the following picture:

host_A-eth1(10.0.1.1)<-->(10.0.1.2)eth1-host_B-eth2(10.0.2.1)<-->(10.0.2.2)eth1-host_C

MTU is equal to 1500 on all involved interfaces. Then we make the 
followign steps:

on host_A:
1. socket(SOCK_DGRAM) -> 6
2. bind(6, 10.0.1.1:25630) -> 0
on host_C:
3. socket(SOCK_DGRAM) -> 5
4. bind(5, 10.0.2.2:25631) -> 0
on host_A:
5. connect(6, 10.0.2.2:25631) -> 0
on host_C:
6. connect(5, 10.0.1.1:25630) -> 0
on host_A
7. getsockopt(6,IP_MTU) -> 0 // Returns that MTU is 1500
8. getsockopt(6,IP_MTU_DISCOVER) -> 0 // Returns that default value is 
IP_PMTUDISC_WANT

On eth2 on host_B and on eth1 on host_C change MTU from 1500 to 750. 
Wait for a while.

9. send(6, lenght=1400) -> 1400 // the packet is sent with "Don't 
Fragment" bit, tcpdump on eth1 on host_B shows it
10. sleep(5);
11. send(6, length=1400) -> -1 with EMSGSIZE
12. sleep(5);
13. getsockopt(6,IP_MTU) -> 0 // Returns that MTU is 1500 once again. So 
value is not updated.
14. send(6, lenght=1400) -> 1400 // the packet one again is sent with 
"Don't Fragment" bit, tcpdump on eth1 on host_B shows it

So "Don't Fragment" bit is always set for the packets in case when value 
of IP_MTU_DISCOVER is IP_PMTUDISC_WANT. If at step 8 we change 
IP_MTU_DISCOVER value from IP_PMTUDISC_WANT to IP_PMTUDISC_DO we have 
the same picture. The value of IP_MTU socket options is still 1500 at 
step 13 in this case.

^ permalink raw reply

* Re: PMTU discovery is broken on kernel 3.7.1 for UDP sockets
From: Ben Hutchings @ 2012-12-19 13:35 UTC (permalink / raw)
  To: Yurij M. Plotnikov; +Cc: netdev, Alexandra N. Kossovsky
In-Reply-To: <50D1BCC0.2000208@oktetlabs.ru>

On Wed, 2012-12-19 at 17:10 +0400, Yurij M. Plotnikov wrote:
> On kernel 3.7.1 I get strange behaviour of IP_MTU_DISCOVER socket 
> option. The behaviour in case of IP_PMTUDISC_DO and IP_PMTUDISC_WANT 
> values of IP_MTU_DISCOVER socket option on SOCK_DGRAM socket are the 
> same and packet is always sent with "Don't Fragment" bit in case of 
> IP_PMTUDISC_WANT. Also, the value of IP_MTU socket option is not updated.

You could try reverting:

commit ee9a8f7ab2edf801b8b514c310455c94acc232f6
Author: Steffen Klassert <steffen.klassert@secunet.com>
Date:   Mon Oct 8 00:56:54 2012 +0000

    ipv4: Don't report stale pmtu values to userspace
    
    We report cached pmtu values even if they are already expired.
    Change this to not report these values after they are expired
    and fix a race in the expire time calculation, as suggested by
    Eric Dumazet.

Still, PMTU information is not supposed to expire for 10 minutes...

[...]
> On eth2 on host_B and on eth1 on host_C change MTU from 1500 to 750. 
> Wait for a while.
> 
> 9. send(6, lenght=1400) -> 1400 // the packet is sent with "Don't 
> Fragment" bit, tcpdump on eth1 on host_B shows it
> 10. sleep(5);
> 11. send(6, length=1400) -> -1 with EMSGSIZE
> 12. sleep(5);
> 13. getsockopt(6,IP_MTU) -> 0 // Returns that MTU is 1500 once again. So 
> value is not updated.
[...]

What if you read this option immediately before the sleep(5)?

Ben.

^ permalink raw reply

* Re: [PATCH V2 00/12] Add basic VLAN support to bridges
From: Vlad Yasevich @ 2012-12-19 14:13 UTC (permalink / raw)
  To: Shmulik Ladkani; +Cc: netdev, shemminger, davem, or.gerlitz, jhs, mst
In-Reply-To: <20121219101006.7086faef@pixies.home.jungo.com>

On 12/19/2012 03:10 AM, Shmulik Ladkani wrote:
> Thanks Vlad,
>
> On Tue, 18 Dec 2012 14:00:51 -0500 Vlad Yasevich <vyasevic@redhat.com> wrote:
>> A single vlan may also be designated as untagged.  Any untagged traffic
>> recieved by the port will be assigned to this vlan.
>
> Why the "untagged vlan" is per-bridge global?
> Usually, 802.1q switches define the PVID (port's VID) which controls
> the value of VID, in case ingress frame is either untagged or
> priority-tagged (per port configuration).
> This gives greater flexibility.

It's not.  There is a per port untagged pointer where you can designate 
which VLAN is untagged/native on a port.  The bride interface itself
can also function as a port, so it gets its own untagged pointer so
it can behave similar to port.

>
>> Any traffic exiting
>> the port with a VID matching the untagged vlan will exit untagged (the
>> bridge will strip the vlan header).  This is similar to "Native Vlan" support
>> available in most switches.
>
> 802.1q switches usually allow conifguring per-vlan, per-port
> tagged/untagged egress policy: each vid has its port membership map and
> an accompanying port egress-policy map.
> This gives great flexibility defining all sorts of configurations.

Right, and that's what's provided here.
  * Each VLAN has port membership map (net_bridge_vlan.portgroup).
  * Each port has a list of vlans configured as well 
(net_port_vlan.vlan_list).
  * Each port also has a single vlan that can be untagged 
(net_bridge_port.untagged).
  * The bridge also has a single untagged vlan (net_bridge.untagged)

The limitation (in switches as well) is that only a single VLAN
may be untagged on any 1 port.  If you have more then 1, you don't know
which VLAN the untagged traffic belongs to.

>
> Personally, I'd prefer a fully flexible vlan bridge allowing all sorts
> of configurations (as available in 802.1q switches).
>
> What's the reason limiting such configurations?

So, what do you see that's missing?

-vlad

>
> Regards,
> Shmulik
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* [PATCH net] net: qmi_wwan: add ZTE MF880
From: Bjørn Mork @ 2012-12-19 14:15 UTC (permalink / raw)
  To: netdev; +Cc: linux-usb, Bjørn Mork

The driver description files gives these names to the vendor specific
functions on this modem:

 diag: VID_19D2&PID_0284&MI_00
 nmea: VID_19D2&PID_0284&MI_01
 at:   VID_19D2&PID_0284&MI_02
 mdm:  VID_19D2&PID_0284&MI_03
 net:  VID_19D2&PID_0284&MI_04

Signed-off-by: Bjørn Mork <bjorn@mork.no>
---
 drivers/net/usb/qmi_wwan.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 9b950f5..91d7cb9 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -433,6 +433,7 @@ static const struct usb_device_id products[] = {
 	{QMI_FIXED_INTF(0x19d2, 0x0199, 1)},	/* ZTE MF820S */
 	{QMI_FIXED_INTF(0x19d2, 0x0200, 1)},
 	{QMI_FIXED_INTF(0x19d2, 0x0257, 3)},	/* ZTE MF821 */
+	{QMI_FIXED_INTF(0x19d2, 0x0284, 4)},	/* ZTE MF880 */
 	{QMI_FIXED_INTF(0x19d2, 0x0326, 4)},	/* ZTE MF821D */
 	{QMI_FIXED_INTF(0x19d2, 0x1008, 4)},	/* ZTE (Vodafone) K3570-Z */
 	{QMI_FIXED_INTF(0x19d2, 0x1010, 4)},	/* ZTE (Vodafone) K3571-Z */
-- 
1.7.10.4

^ permalink raw reply related

* Re: PMTU discovery is broken on kernel 3.7.1 for UDP sockets
From: Yurij M. Plotnikov @ 2012-12-19 14:27 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev, Alexandra N. Kossovsky
In-Reply-To: <1355924119.2676.6.camel@bwh-desktop.uk.solarflarecom.com>

On 12/19/12 17:35, Ben Hutchings wrote:
> On Wed, 2012-12-19 at 17:10 +0400, Yurij M. Plotnikov wrote:
>    
>> On kernel 3.7.1 I get strange behaviour of IP_MTU_DISCOVER socket
>> option. The behaviour in case of IP_PMTUDISC_DO and IP_PMTUDISC_WANT
>> values of IP_MTU_DISCOVER socket option on SOCK_DGRAM socket are the
>> same and packet is always sent with "Don't Fragment" bit in case of
>> IP_PMTUDISC_WANT. Also, the value of IP_MTU socket option is not updated.
>>      
> You could try reverting:
>
> commit ee9a8f7ab2edf801b8b514c310455c94acc232f6
> Author: Steffen Klassert<steffen.klassert@secunet.com>
> Date:   Mon Oct 8 00:56:54 2012 +0000
>
>      ipv4: Don't report stale pmtu values to userspace
>
>      We report cached pmtu values even if they are already expired.
>      Change this to not report these values after they are expired
>      and fix a race in the expire time calculation, as suggested by
>      Eric Dumazet.
>
> Still, PMTU information is not supposed to expire for 10 minutes...
>
>    
With reverted commit there is no such problem on 3.7.1: IP_MTU is 
updated and DF is set only for the first packet in case of 
IP_PMTUDISC_WANT.
> [...]
>    
>> On eth2 on host_B and on eth1 on host_C change MTU from 1500 to 750.
>> Wait for a while.
>>
>> 9. send(6, lenght=1400) ->  1400 // the packet is sent with "Don't
>> Fragment" bit, tcpdump on eth1 on host_B shows it
>> 10. sleep(5);
>> 11. send(6, length=1400) ->  -1 with EMSGSIZE
>> 12. sleep(5);
>> 13. getsockopt(6,IP_MTU) ->  0 // Returns that MTU is 1500 once again. So
>> value is not updated.
>>      
> [...]
>
> What if you read this option immediately before the sleep(5)?
>    
It still returns that MTU is 1500.

Yurij.

^ permalink raw reply

* Re: [PATCH V2 09/12] bridge: Add the ability to configure untagged vlans
From: Vlad Yasevich @ 2012-12-19 14:50 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, shemminger, davem, or.gerlitz, jhs
In-Reply-To: <20121218231049.GD1135@redhat.com>

On 12/18/2012 06:10 PM, Michael S. Tsirkin wrote:
> On Tue, Dec 18, 2012 at 06:03:25PM -0500, Vlad Yasevich wrote:
>> On 12/18/2012 06:01 PM, Michael S. Tsirkin wrote:
>>> On Tue, Dec 18, 2012 at 02:01:00PM -0500, Vlad Yasevich wrote:
>>>> A user may designate a certain vlan as untagged.  This means that
>>>> any ingress frame is assigned to this vlan and any forwarding decisions
>>>> are made with this vlan in mind.  On egress, any frames tagged/labeled
>>>> with untagged vlan have the vlan tag removed and are send as regular
>>>> ethernet frames.
>>>>
>>>> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
>>>> ---
>>>>   include/uapi/linux/if_bridge.h |    3 +
>>>>   net/bridge/br_if.c             |  146 +++++++++++++++++++++++++++++++++++++---
>>>>   net/bridge/br_netlink.c        |    6 +-
>>>>   net/bridge/br_private.h        |    2 +
>>>>   4 files changed, 144 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h
>>>> index d0b4f5c..988d858 100644
>>>> --- a/include/uapi/linux/if_bridge.h
>>>> +++ b/include/uapi/linux/if_bridge.h
>>>> @@ -127,6 +127,9 @@ enum {
>>>>   	BR_VLAN_DEL,
>>>>   };
>>>>
>>>> +#define BRIDGE_VLAN_INFO_MASTER		1
>>>> +#define BRIDGE_VLAN_INFO_UNTAGGED	2
>>>> +
>>>>   struct bridge_vlan_info {
>>>>   	u16 op_code;
>>>>   	u16 flags;
>>>> diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
>>>> index 57bbb35..14563fb 100644
>>>> --- a/net/bridge/br_if.c
>>>> +++ b/net/bridge/br_if.c
>>>> @@ -108,6 +108,34 @@ static void br_vlan_put(struct net_bridge_vlan *vlan)
>>>>   		br_vlan_destroy(vlan);
>>>>   }
>>>>
>>>> +/* Must be protected by RTNL */
>>>> +static void br_vlan_add_untagged(struct net_bridge *br,
>>>> +				 struct net_bridge_vlan *vlan)
>>>> +{
>>>> +	ASSERT_RTNL();
>>>> +	if (br->untagged == vlan)
>>>> +		return;
>>>> +	else if (br->untagged) {
>>>> +		/* Untagged vlan is already set on the master,
>>>> +		 * so drop the ref since we'll be replacing it.
>>>> +		 */
>>>> +		br_vlan_put(br->untagged);
>>>> +	}
>>>> +	br_vlan_hold(vlan);
>>>> +	rcu_assign_pointer(br->untagged, vlan);
>>>
>>> Is there a reason for rcu here but not else where? If all users are under
>>> rtnl you can just assign in a simple way.
>>> If not then rcu_dereference_protected would be more appropriate.
>>
>> Everywhere that the pointer changes rcu_assign_pointer is used.
>>
>> Now, if we hold an RTNL, we can technically read the pointer with
>> rcu since it's guaranteed not to change since it only changes under
>> RTNL.
>> I'll check that this is consistent.
>
> Check what rcu_dereference_protected does. It's really just
> an explicit way to say "this is accessed without rcu because I have
> this lock".

Looks like the helper rtnl_dereference() already does what I need.  I'll 
use that.

Thanks
-vlad

>
>> If I access the pointer without rtnl, it's always inside rcu
>> critical section and with rcu_dereference().
>>
>> I thought those were the basic rules of rcu.  Did that change?
>>
>> -vlad
>
>
>
>>>
>>>> +}
>>>> +
>>>> +/* Must be protected by RTNL */
>>>> +static void br_vlan_del_untagged(struct net_bridge *br,
>>>> +				 struct net_bridge_vlan *vlan)
>>>> +{
>>>> +	ASSERT_RTNL();
>>>> +	if (br->untagged == vlan) {
>>>> +		br_vlan_put(vlan);
>>>> +		rcu_assign_pointer(br->untagged, NULL);
>>>> +	}
>>>> +}
>>>> +
>>>>   struct net_bridge_vlan *br_vlan_find(struct net_bridge *br, u16 vid)
>>>>   {
>>>>   	struct net_bridge_vlan *vlan;
>>>> @@ -132,7 +160,7 @@ struct net_bridge_vlan *br_vlan_add(struct net_bridge *br, u16 vid,
>>>>
>>>>   	vlan = br_vlan_find(br, vid);
>>>>   	if (vlan)
>>>> -		return vlan;
>>>> +		goto untagged;
>>>>
>>>>   	vlan = kzalloc(sizeof(struct net_bridge_vlan), GFP_KERNEL);
>>>>   	if (!vlan)
>>>> @@ -141,7 +169,7 @@ struct net_bridge_vlan *br_vlan_add(struct net_bridge *br, u16 vid,
>>>>   	vlan->vid = vid;
>>>>   	atomic_set(&vlan->refcnt, 1);
>>>>
>>>> -	if (flags & BRIDGE_FLAGS_SELF) {
>>>> +	if (flags & BRIDGE_VLAN_INFO_MASTER) {
>>>>   		/* Set bit 0 that is associated with the bridge master
>>>>   		 * device.  Port numbers start with 1.
>>>>   		 */
>>>> @@ -149,15 +177,24 @@ struct net_bridge_vlan *br_vlan_add(struct net_bridge *br, u16 vid,
>>>>   	}
>>>>
>>>>   	hlist_add_head_rcu(&vlan->hlist, &br->vlan_hlist[br_vlan_hash(vid)]);
>>>> +
>>>> +untagged:
>>>> +	if (flags & BRIDGE_VLAN_INFO_UNTAGGED)
>>>> +		br_vlan_add_untagged(br, vlan);
>>>> +
>>>>   	return vlan;
>>>>   }
>>>>
>>>>   /* Must be protected by RTNL */
>>>> -static void br_vlan_del(struct net_bridge_vlan *vlan, u16 flags)
>>>> +static void br_vlan_del(struct net_bridge *br, struct net_bridge_vlan *vlan,
>>>> +			u16 flags)
>>>>   {
>>>>   	ASSERT_RTNL();
>>>>
>>>> -	if (flags & BRIDGE_FLAGS_SELF) {
>>>> +	if (flags & BRIDGE_VLAN_INFO_UNTAGGED)
>>>> +		br_vlan_del_untagged(br, vlan);
>>>> +
>>>> +	if (flags & BRIDGE_VLAN_INFO_MASTER) {
>>>>   		/* Clear bit 0 that is associated with the bridge master
>>>>   		 * device.
>>>>   		 */
>>>> @@ -172,6 +209,14 @@ static void br_vlan_del(struct net_bridge_vlan *vlan, u16 flags)
>>>>
>>>>   	vlan->vid = BR_INVALID_VID;
>>>>
>>>> +	/* If, for whatever reason, bridge still has a ref on this vlan
>>>> +	 * through the @untagged pointer, drop that ref and clear untagged.
>>>> +	 */
>>>> +	if (br->untagged == vlan) {
>>>> +		br_vlan_put(vlan);
>>>> +		rcu_assign_pointer(br->untagged, NULL);
>>>> +	}
>>>> +
>>>>   	/* Drop the self-ref to trigger descrution. */
>>>>   	br_vlan_put(vlan);
>>>>   }
>>>> @@ -187,7 +232,7 @@ int br_vlan_delete(struct net_bridge *br, u16 vid, u16 flags)
>>>>   	if (!vlan)
>>>>   		return -ENOENT;
>>>>
>>>> -	br_vlan_del(vlan, flags);
>>>> +	br_vlan_del(br, vlan, flags);
>>>>   	return 0;
>>>>   }
>>>>
>>>> @@ -204,7 +249,9 @@ static void br_vlan_flush(struct net_bridge *br)
>>>>   	for (i = 0; i < BR_VID_HASH_SIZE; i++) {
>>>>   		hlist_for_each_entry_safe(vlan, node, tmp,
>>>>   					  &br->vlan_hlist[i], hlist) {
>>>> -			br_vlan_del(vlan, BRIDGE_FLAGS_SELF);
>>>> +			br_vlan_del(br, vlan,
>>>> +				    (BRIDGE_VLAN_INFO_MASTER |
>>>> +				     BRIDGE_VLAN_INFO_UNTAGGED));
>>>>   		}
>>>>   	}
>>>>   }
>>>> @@ -224,10 +271,70 @@ struct net_port_vlan *nbp_vlan_find(const struct net_bridge_port *p, u16 vid)
>>>>   	return NULL;
>>>>   }
>>>>
>>>> +static int nbp_vlan_add_untagged(struct net_bridge_port *p,
>>>> +			  struct net_bridge_vlan *vlan,
>>>> +			  u16 flags)
>>>> +{
>>>> +	struct net_device *dev = p->dev;
>>>> +
>>>> +	if (p->untagged) {
>>>> +		/* Port already has untagged vlan set.  Drop the ref
>>>> +		 * to the old one since we'll be replace it.
>>>> +		 */
>>>> +		br_vlan_put(p->untagged);
>>>> +	} else {
>>>> +		int err;
>>>> +
>>>> +		/* Add vid 0 to filter if filter is available. */
>>>> +		if ((dev->features & NETIF_F_HW_VLAN_FILTER) &&
>>>> +		    dev->netdev_ops->ndo_vlan_rx_add_vid &&
>>>> +		    dev->netdev_ops->ndo_vlan_rx_kill_vid) {
>>>> +			err = dev->netdev_ops->ndo_vlan_rx_add_vid(dev, 0);
>>>> +			if (err)
>>>> +				return err;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	/* This VLAN is handled as untagged/native. Save an
>>>> +	 * additional ref.
>>>> +	 */
>>>> +	br_vlan_hold(vlan);
>>>> +	rcu_assign_pointer(p->untagged, vlan);
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static void nbp_vlan_delete_untagged(struct net_bridge_port *p,
>>>> +				     struct net_bridge_vlan *vlan)
>>>> +{
>>>> +	if (p->untagged != vlan)
>>>> +		return;
>>>> +
>>>> +	/* Remove VLAN from the device filter if it is supported. */
>>>> +	if ((p->dev->features & NETIF_F_HW_VLAN_FILTER) &&
>>>> +	    p->dev->netdev_ops->ndo_vlan_rx_kill_vid) {
>>>> +		int err;
>>>> +
>>>> +		err = p->dev->netdev_ops->ndo_vlan_rx_kill_vid(p->dev, 0);
>>>> +		if (err) {
>>>> +			pr_warn("failed to kill vid %d for device %s\n",
>>>> +				vlan->vid, p->dev->name);
>>>> +		}
>>>> +	}
>>>> +
>>>> +	/* If this VLAN is currently functioning as untagged, clear it.
>>>> +	 * It's safe to drop the refcount, since the vlan is still held
>>>> +	 * by the port.
>>>> +	 */
>>>> +	br_vlan_put(vlan);
>>>> +	rcu_assign_pointer(p->untagged, NULL);
>>>> +
>>>> +}
>>>> +
>>>>   /* Must be protected by RTNL */
>>>>   int nbp_vlan_add(struct net_bridge_port *p, u16 vid, u16 flags)
>>>>   {
>>>> -	struct net_port_vlan *pve;
>>>> +	struct net_port_vlan *pve = NULL;
>>>>   	struct net_bridge_vlan *vlan;
>>>>   	struct net_device *dev = p->dev;
>>>>   	int err;
>>>> @@ -275,11 +382,21 @@ int nbp_vlan_add(struct net_bridge_port *p, u16 vid, u16 flags)
>>>>   	set_bit(p->port_no, vlan->port_bitmap);
>>>>
>>>>   	list_add_tail_rcu(&pve->list, &p->vlan_list);
>>>> +
>>>> +	if (flags & BRIDGE_VLAN_INFO_UNTAGGED) {
>>>> +		err = nbp_vlan_add_untagged(p, vlan, flags);
>>>> +		if (err)
>>>> +			goto del_vlan;
>>>> +	}
>>>> +
>>>>   	return 0;
>>>>
>>>>   clean_up:
>>>>   	kfree(pve);
>>>> -	br_vlan_del(vlan, flags);
>>>> +	br_vlan_del(p->br, vlan, flags);
>>>> +	return err;
>>>> +del_vlan:
>>>> +	nbp_vlan_delete(p, vid, flags);
>>>>   	return err;
>>>>   }
>>>>
>>>> @@ -296,6 +413,9 @@ int nbp_vlan_delete(struct net_bridge_port *p, u16 vid, u16 flags)
>>>>   	if (!pve)
>>>>   		return -ENOENT;
>>>>
>>>> +	if (flags & BRIDGE_VLAN_INFO_UNTAGGED)
>>>> +		nbp_vlan_delete_untagged(p, pve->vlan);
>>>> +
>>>>   	/* Remove VLAN from the device filter if it is supported. */
>>>>   	if ((dev->features & NETIF_F_HW_VLAN_FILTER) &&
>>>>   	    dev->netdev_ops->ndo_vlan_rx_kill_vid) {
>>>> @@ -306,6 +426,7 @@ int nbp_vlan_delete(struct net_bridge_port *p, u16 vid, u16 flags)
>>>>   			pr_warn("failed to kill vid %d for device %s\n",
>>>>   				vid, dev->name);
>>>>   	}
>>>> +
>>>>   	pve->vid = BR_INVALID_VID;
>>>>
>>>>   	vlan = pve->vlan;
>>>> @@ -316,7 +437,7 @@ int nbp_vlan_delete(struct net_bridge_port *p, u16 vid, u16 flags)
>>>>   	list_del_rcu(&pve->list);
>>>>   	kfree_rcu(pve, rcu);
>>>>
>>>> -	br_vlan_del(vlan, flags);
>>>> +	br_vlan_del(p->br, vlan, flags);
>>>>
>>>>   	return 0;
>>>>   }
>>>> @@ -328,8 +449,11 @@ static void nbp_vlan_flush(struct net_bridge_port *p)
>>>>
>>>>   	ASSERT_RTNL();
>>>>
>>>> -	list_for_each_entry_safe(pve, tmp, &p->vlan_list, list)
>>>> -		nbp_vlan_delete(p, pve->vid, BRIDGE_FLAGS_SELF);
>>>> +	list_for_each_entry_safe(pve, tmp, &p->vlan_list, list)  {
>>>> +		nbp_vlan_delete(p, pve->vid,
>>>> +				(BRIDGE_VLAN_INFO_MASTER |
>>>> +				 BRIDGE_VLAN_INFO_UNTAGGED));
>>>> +	}
>>>>   }
>>>>
>>>>   static void release_nbp(struct kobject *kobj)
>>>> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
>>>> index 9cf2879..1b302ce 100644
>>>> --- a/net/bridge/br_netlink.c
>>>> +++ b/net/bridge/br_netlink.c
>>>> @@ -199,7 +199,8 @@ static int br_afspec(struct net_bridge *br, struct net_bridge_port *p,
>>>>   			if (p)
>>>>   				err = nbp_vlan_add(p, vinfo->vid, vinfo->flags);
>>>>   			else {
>>>> -				u16 flags = vinfo->flags | BRIDGE_FLAGS_SELF;
>>>> +				u16 flags = vinfo->flags |
>>>> +					    BRIDGE_VLAN_INFO_MASTER;
>>>>   				if (!br_vlan_add(br, vinfo->vid, flags))
>>>>   					err = -ENOMEM;
>>>>   			}
>>>> @@ -210,7 +211,8 @@ static int br_afspec(struct net_bridge *br, struct net_bridge_port *p,
>>>>   				err = nbp_vlan_delete(p, vinfo->vid,
>>>>   						      vinfo->flags);
>>>>   			else {
>>>> -				u16 flags = vinfo->flags | BRIDGE_FLAGS_SELF;
>>>> +				u16 flags = vinfo->flags |
>>>> +					    BRIDGE_VLAN_INFO_MASTER;
>>>>   				err = br_vlan_delete(br, vinfo->vid, flags);
>>>>   			}
>>>>   			break;
>>>> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
>>>> index cc75212..9328463 100644
>>>> --- a/net/bridge/br_private.h
>>>> +++ b/net/bridge/br_private.h
>>>> @@ -179,6 +179,7 @@ struct net_bridge_port
>>>>   	struct netpoll			*np;
>>>>   #endif
>>>>   	struct list_head		vlan_list;
>>>> +	struct net_bridge_vlan __rcu	*untagged;
>>>>   };
>>>>
>>>>   #define br_port_exists(dev) (dev->priv_flags & IFF_BRIDGE_PORT)
>>>> @@ -298,6 +299,7 @@ struct net_bridge
>>>>   	struct timer_list		gc_timer;
>>>>   	struct kobject			*ifobj;
>>>>   	struct hlist_head		vlan_hlist[BR_VID_HASH_SIZE];
>>>> +	struct net_bridge_vlan __rcu	*untagged;
>>>>   };
>>>>
>>>>   struct br_input_skb_cb {
>>>> --
>>>> 1.7.7.6

^ permalink raw reply

* Re: [PATCH] pkt_sched: act_xt support new Xtables interface
From: Jan Engelhardt @ 2012-12-19 15:52 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Hasan Chowdhury, Stephen Hemminger, Yury Stankevich,
	netdev@vger.kernel.org, pablo, netfilter-devel
In-Reply-To: <50D1AB7E.5060000@mojatatu.com>


On Wednesday 2012-12-19 12:56, Jamal Hadi Salim wrote:
>
> To be applied pending more testing.
>
> Attached. Sorry, I thought I had sent this out over the weekend.
> I have done basic testing with a single mark and sending pings to
> update stats which can then displayed for the mark.
>
> diffstat xt-p1
> Kconfig  |   15 ++
> Makefile |    1 
> act_xt.c |  324 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 339 insertions(+), 1 deletion(-)

Humm... that's a huge patch for what seems to be equal to act_ipt.c
Let's do a cross-diff:

--- act_ipt.c	2012-10-25 19:49:25.372191795 +0200
+++ act_xt.c	2012-12-19 16:48:22.052419730 +0100
@@ -2 +2 @@
- * net/sched/ipt.c     iptables target interface
+ * net/sched/act_xt.c     iptables target interface
@@ -11 +11 @@
- * Copyright:	Jamal Hadi Salim (2002-4)
+ * Copyright:	Jamal Hadi Salim (2002-12)
@@ -30 +29,0 @@
-
@@ -42 +41,2 @@ static struct tcf_hashinfo ipt_hash_info
-static int ipt_init_target(struct xt_entry_target *t, char *table, unsigned int hook)
+static int ipt_init_target(struct xt_entry_target *t, char *table,
+			   unsigned int hook)
@@ -243,2 +243,2 @@ static int tcf_ipt(struct sk_buff *skb,
-		net_notice_ratelimited("tc filter: Bogus netfilter code %d assume ACCEPT\n",
-				       ret);
+		net_notice_ratelimited
+		    ("tc filter: Bogus netfilter code %d assume ACCEPT\n", ret);
@@ -253 +253,2 @@ static int tcf_ipt(struct sk_buff *skb,
-static int tcf_ipt_dump(struct sk_buff *skb, struct tc_action *a, int bind, int ref)
+static int tcf_ipt_dump(struct sk_buff *skb, struct tc_action *a, int bind,
+			int ref)
@@ -295 +296 @@ static struct tc_action_ops act_ipt_ops
-	.kind		=	"ipt",
+	.kind = "xt",
@@ -308,2 +309,2 @@ static struct tc_action_ops act_ipt_ops
-MODULE_AUTHOR("Jamal Hadi Salim(2002-4)");
-MODULE_DESCRIPTION("Iptables target actions");
+MODULE_AUTHOR("Jamal Hadi Salim(2002-12)");
+MODULE_DESCRIPTION("New Iptables target actions");


Is that [the set of hunks] all? Then I would instead suggest
something like:


diff --git a/net/sched/act_ipt.c b/net/sched/act_ipt.c
index 58fb3c7..f92a007 100644
--- a/net/sched/act_ipt.c
+++ b/net/sched/act_ipt.c
@@ -305,18 +305,43 @@ static struct tc_action_ops act_ipt_ops = {
 	.walk		=	tcf_generic_walker
 };
 
+static struct tc_action_ops act_xt_ops = {
+	.kind		=	"xt",
+	.hinfo		=	&ipt_hash_info,
+	.type		=	TCA_ACT_IPT,
+	.capab		=	TCA_CAP_NONE,
+	.owner		=	THIS_MODULE,
+	.act		=	tcf_ipt,
+	.dump		=	tcf_ipt_dump,
+	.cleanup	=	tcf_ipt_cleanup,
+	.lookup		=	tcf_hash_search,
+	.init		=	tcf_ipt_init,
+	.walk		=	tcf_generic_walker
+};
+
 MODULE_AUTHOR("Jamal Hadi Salim(2002-4)");
 MODULE_DESCRIPTION("Iptables target actions");
 MODULE_LICENSE("GPL");
+MODULE_ALIAS("act_xt");
 
 static int __init ipt_init_module(void)
 {
-	return tcf_register_action(&act_ipt_ops);
+	int ret;
+	ret = tcf_register_action(&act_ipt_ops);
+	if (ret < 0)
+		return ret;
+	ret = tcf_register_action(&xt_ipt_ops);
+	if (ret < 0) {
+		tcf_unregister_action(&act_ipt_ops);
+		return ret;
+	}
+	return 0;
 }
 
 static void __exit ipt_cleanup_module(void)
 {
 	tcf_unregister_action(&act_ipt_ops);
+	tcf_unregister_action(&act_xt_ops);
 }
 
 module_init(ipt_init_module);

^ permalink raw reply related

* Re: [PATCH 4/4] FEC: Add time stamping code and a PTP hardware clock
From: Ben Hutchings @ 2012-12-19 15:53 UTC (permalink / raw)
  To: Frank Li, Sascha Hauer
  Cc: Richard Cochran, Shawn Guo, Frank Li, lznua, linux-arm-kernel,
	netdev, davem
In-Reply-To: <20121218070420.GA2946@netboy.at.omicron.at>

On Tue, Dec 18, 2012 at 08:04:20AM +0100, Richard Cochran wrote:
> On Mon, Dec 17, 2012 at 09:02:32PM +0100, Sascha Hauer wrote:
> > This leaves an option in the tree which can be used to break FEC on
> > i.MX3/5.
> > 
> > 	depends on !SOC_IMX31 && !SOC_IMX35 && !SOC_IMX5
> > 
> > might be an option, but given that this patch seems to have bypassed any
> > review I feel more like reverting it.
> 
> Instead of reverting, I suggest finding a solution (Frank) to let the
> code work when it can work and to prevent it when it cannot. This
> could be kconfig, DT, or run time probing of silicon revisions, but I
> don't have access to this hardware, and so I can't really say how to
> fix it.
[...]

Please implement run-time probing.  A different configuration for
each SoC is just not sustainable for distributions.

Ben.

-- 
Ben Hutchings
We get into the habit of living before acquiring the habit of thinking.
                                                              - Albert Camus

^ permalink raw reply

* Re: [RFC PATCH] fix IP_ECN_set_ce
From: Eric Dumazet @ 2012-12-19 16:14 UTC (permalink / raw)
  To: roy.qing.li; +Cc: netdev
In-Reply-To: <1355898095-7444-1-git-send-email-roy.qing.li@gmail.com>

On Wed, 2012-12-19 at 14:21 +0800, roy.qing.li@gmail.com wrote:
> From: Li RongQing <roy.qing.li@gmail.com>
> 
> 1. ECN uses the two least significant (right-most) bits of the DiffServ
> field in the IPv4, so it should be in iph->tos, not in (iph->tos+1)
> 
> 2. When setting CE, we should check if ECN Capable Transport supports,
> both 10 and 01 mean ECN Capable Transport, so only check 10 is not enough
>     00: Non ECN-Capable Transport — Non-ECT
>     10: ECN Capable Transport — ECT(0)
>     01: ECN Capable Transport — ECT(1)
>     11: Congestion Encountered — CE
> 
> 3. Remove the misunderstand comment
> 
> 4. fix the checksum computation
> 
> Signed-off-by: Li RongQing <roy.qing.li@gmail.com>

This is total crap.

Its perfectly clear to me and compiler generates fast code.

If you don't understand this code, please don't touch it.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox