Netdev List
 help / color / mirror / Atom feed
* Re: [RFC net-next 0/3] devconf: New infrastructure for setting pre-load parameters for a module
From: Greg KH @ 2015-01-08 16:46 UTC (permalink / raw)
  To: Hadar Hen Zion
  Cc: David S. Miller, netdev, Amir Vadai, Yevgeny Petrilin, Or Gerlitz,
	shannon.nelson, Doug Ledford, greearb
In-Reply-To: <1420730220-20224-1-git-send-email-hadarh@mellanox.com>

On Thu, Jan 08, 2015 at 05:16:57PM +0200, Hadar Hen Zion wrote:
> Hi,
>     
> When configuring a device at an early boot stage, most kernel drivers
> use module parameters (the parameters' settings can be determined in
> modprobe.d config files).

Which is a bad idea, as you have learned :)

> These parameters are difficult to manage, and one of the reasons is that
> module parameters are set per driver and not per device (NICs using the
> same driver cannot be set with different configurations).
> Furthermore, using other existing configuration tools like ethtool,
> ifconfig, ip link commands or sysfs entries is not applicable, since
> they all rely on having a netdev already set up.
> 
> In the past, 'request_firmware' solution for configuration parameters
> was suggested by Shannon Nelson from Intel[1]. The idea was rejected by
> Greg KH, who claimed it was abusive of the request_firmware mechanism.
> Greg suggested using configfs for device configuration instead (as done
> by the USB gadget driver).
> 
> As a solution, we introduce a new kernel infrastructure using configfs
> to allow the configuration of the device. The goal is to set low-level
> device functionality that needs to be sorted out before a module is
> loaded.

Ick, really?  drivers should never need to be configured like this, if
so, then something is wrong, they should "just work" by default.  What
are you needing to "configure" that can't be determined by something
like a device tree?

greg k-h

^ permalink raw reply

* Re: [RFC net-next 0/3] devconf: New infrastructure for setting pre-load parameters for a module
From: Amir Vadai @ 2015-01-08 17:11 UTC (permalink / raw)
  To: Greg KH, Hadar Hen Zion
  Cc: David S. Miller, netdev@vger.kernel.org, Yevgeny Petrilin,
	Or Gerlitz, shannon.nelson@intel.com, Doug Ledford,
	greearb@candelatech.com
In-Reply-To: <20150108164618.GA11696@kroah.com>

On 1/8/2015 6:46 PM, Greg KH wrote:
> On Thu, Jan 08, 2015 at 05:16:57PM +0200, Hadar Hen Zion wrote:
>> Hi,
>>     
>> When configuring a device at an early boot stage, most kernel drivers
>> use module parameters (the parameters' settings can be determined in
>> modprobe.d config files).
> 
> Which is a bad idea, as you have learned :)
> 
>> These parameters are difficult to manage, and one of the reasons is that
>> module parameters are set per driver and not per device (NICs using the
>> same driver cannot be set with different configurations).
>> Furthermore, using other existing configuration tools like ethtool,
>> ifconfig, ip link commands or sysfs entries is not applicable, since
>> they all rely on having a netdev already set up.
>>
>> In the past, 'request_firmware' solution for configuration parameters
>> was suggested by Shannon Nelson from Intel[1]. The idea was rejected by
>> Greg KH, who claimed it was abusive of the request_firmware mechanism.
>> Greg suggested using configfs for device configuration instead (as done
>> by the USB gadget driver).
>>
>> As a solution, we introduce a new kernel infrastructure using configfs
>> to allow the configuration of the device. The goal is to set low-level
>> device functionality that needs to be sorted out before a module is
>> loaded.
> 
> Ick, really?  drivers should never need to be configured like this, if
> so, then something is wrong, they should "just work" by default.  What
> are you needing to "configure" that can't be determined by something
> like a device tree?
Ick indeed - but we can't find anything better.

For example, we have devices that can act as either netdev or as an
Infiniband device.
The driver consists of a core to handle the hardware, and higher layer
drivers - one for Ethernet and one for Infiniband.
Today the selection is done through a module parameter. according to it
the relevant higher level driver is loaded, and the device is
initialized. You don't want to have a default of netdev, and in every
installation that needs Infiniband, a netdev will be created, removed
and only then the Infiniband device will appear.
This is only one example to configuration that needs to be known before
the hardware is initialized, and be persistent across boots.

We can have a 2 stages loading. First load the core, wait for user
input, and only then configure the device and load the right upper layer
driver according to the user input (configfs/sysfs).
Since other vendors seems to need this capability too, we thought it
would be better to make it generic - and this is this what this RFC is
about.

Amir




> 
> greg k-h
> 

^ permalink raw reply

* Loan Offer at 3%
From: Cashland Financial Service @ 2015-01-08 16:57 UTC (permalink / raw)
  To: Recipients

Loan Offer at 3%, Feel Free to REPLY back to us for more info.

^ permalink raw reply

* Re: [PATCH net-next v2] tcp: avoid reducing cwnd when ACK+DSACK is received
From: Eric Dumazet @ 2015-01-08 17:17 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Sébastien Barré, David Miller, Netdev, Gregory Detal,
	Nandita Dukkipati, Yuchung Cheng
In-Reply-To: <1420734325.5947.61.camel@edumazet-glaptop2.roam.corp.google.com>

On Thu, 2015-01-08 at 08:25 -0800, Eric Dumazet wrote:
> On Thu, 2015-01-08 at 10:49 -0500, Neal Cardwell wrote:
> > On Thu, Jan 8, 2015 at 10:39 AM, Sébastien Barré
> > <sebastien.barre@uclouvain.be> wrote:
> > > What do you and Neal think ?
> > 
> > My preference is to have the whole expression detecting the case where
> > the receiver got the probe packet encoded in a single expression. I
> > don't have a strong feeling about whether it should be stored in a
> > bool (to reduce the size of the diff) or written directly into the if
> > () expression (to reduce the size of the code). I'll defer to Eric on
> > which he thinks is better. :-)
> 
> There is no shame using helpers with nice names to help understand this
> TCP stack. Even if the helper is used exactly once.
> 
> In this case, it seems we test 2 different conditions, so this could use
> 2 helpers with self describing names.
> 
> When I see : 
> 
> > +     if (((ack == tp->tlp_high_seq) &&
> > +          !(flag & (FLAG_SND_UNA_ADVANCED |
> > +                    FLAG_NOT_DUP | FLAG_DATA_SACKED))) ||
> > +         (!before(ack, tp->tlp_high_seq) && (flag & FLAG_DSACKING_ACK))) {
> >               tp->tlp_high_seq = 0;
> 
> My brain hurts.

BTW, tlp_high_seq is used no matter what (initial value is 0).

While the probability ACK seq being exactly 0 is small (1 out of 2^32),
probability that !before(ack, tp->tlp_high_seq) being a false positive
is very high. Initial sequence number are supposed to be random.

I wonder if setting tp->tlp_high_seq to 0 is the right thing to do.

^ permalink raw reply

* Re: [PATCH net-next v2] tcp: avoid reducing cwnd when ACK+DSACK is received
From: Eric Dumazet @ 2015-01-08 17:27 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Sébastien Barré, David Miller, Netdev, Gregory Detal,
	Nandita Dukkipati, Yuchung Cheng
In-Reply-To: <1420737469.5947.66.camel@edumazet-glaptop2.roam.corp.google.com>

On Thu, 2015-01-08 at 09:17 -0800, Eric Dumazet wrote:

> BTW, tlp_high_seq is used no matter what (initial value is 0).
> 
> While the probability ACK seq being exactly 0 is small (1 out of 2^32),
> probability that !before(ack, tp->tlp_high_seq) being a false positive
> is very high. Initial sequence number are supposed to be random.
> 

Oh well, calls to tcp_process_tlp_ack() are guarded by :

if (tp->tlp_high_seq)
    tcp_process_tlp_ack(sk, ack, flag);

^ permalink raw reply

* [PATCH net 1/1] tipc: fix bug in broadcast retransmit code
From: Jon Maloy @ 2015-01-08 17:27 UTC (permalink / raw)
  To: davem
  Cc: netdev, Paul Gortmaker, erik.hugne, ying.xue, maloy,
	tipc-discussion, Jon Maloy

In commit 58dc55f25631178ee74cd27185956a8f7dcb3e32 ("tipc: use generic
SKB list APIs to manage link transmission queue") we replace all list
traversal loops with the macros skb_queue_walk() or
skb_queue_walk_safe(). While the previous loops were based on the
assumption that the list was NULL-terminated, the standard macros
stop when the iterator reaches the list head, which is non-NULL.

In the function bclink_retransmit_pkt() this macro replacement has
lead to a bug. When we receive a BCAST STATE_MSG we unconditionally
call the function bclink_retransmit_pkt(), whether there really is
anything to retransmit or not, assuming that the sequence number
comparisons will lead to the correct behavior. However, if the
transmission queue is empty, or if there are no eligible buffers in
the transmission queue, we will by mistake pass the list head pointer
to the function tipc_link_retransmit(). Since the list head is not a
valid sk_buff, this leads to a crash.

In this commit we fix this by only calling tipc_link_retransmit()
if we actually found eligible buffers in the transmission queue.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
---
 net/tipc/bcast.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
index 96ceefe..a9e174f 100644
--- a/net/tipc/bcast.c
+++ b/net/tipc/bcast.c
@@ -220,10 +220,11 @@ static void bclink_retransmit_pkt(u32 after, u32 to)
 	struct sk_buff *skb;
 
 	skb_queue_walk(&bcl->outqueue, skb) {
-		if (more(buf_seqno(skb), after))
+		if (more(buf_seqno(skb), after)) {
+			tipc_link_retransmit(bcl, skb, mod(to - after));
 			break;
+		}
 	}
-	tipc_link_retransmit(bcl, skb, mod(to - after));
 }
 
 /**
-- 
1.9.1

^ permalink raw reply related

* Re: [net-next PATCH v1 02/11] net: flow_table: add flow, delete flow
From: Jiri Pirko @ 2015-01-08 17:39 UTC (permalink / raw)
  To: John Fastabend; +Cc: tgraf, sfeldma, jhs, simon.horman, netdev, davem, andy
In-Reply-To: <20141231194615.31070.18038.stgit@nitbit.x32>

Wed, Dec 31, 2014 at 08:46:16PM CET, john.fastabend@gmail.com wrote:
>Now that the device capabilities are exposed we can add support to
>add and delete flows from the tables.
>
>The two operations are
>
>table_set_flows :
>
>  The set flow operations is used to program a set of flows into a
>  hardware device table. The message is consumed via netlink encoded
>  message which is then decoded into a null terminated  array of
>  flow entry structures. A flow entry structure is defined as
>
>     struct net_flow_flow {
>			  int table_id;
>			  int uid;
>			  int priority;
>			  struct net_flow_field_ref *matches;
>			  struct net_flow_action *actions;
>     }
>
>  The table id is the _uid_ returned from 'get_tables' operatoins.
>  Matches is a set of match criteria for packets with a logical AND
>  operation done on the set so packets match the entire criteria.
>  Actions provide a set of actions to perform when the flow rule is
>  hit. Both matches and actions are null terminated arrays.
>
>  The flows are configured in hardware using an ndo op. We do not
>  provide a commit operation at the moment and expect hardware
>  commits the flows one at a time. Future work may require a commit
>  operation to tell the hardware we are done loading flow rules. On
>  some hardware this will help bulk updates.
>
>  Its possible for hardware to return an error from a flow set
>  operation. This can occur for many reasons both transient and
>  resource constraints. We have different error handling strategies
>  built in and listed here,
>
>    *_ERROR_ABORT      abort on first error with errmsg
>
>    *_ERROR_CONTINUE   continue programming flows no errmsg
>
>    *_ERROR_ABORT_LOG  abort on first error and return flow that
> 		       failed to user space in reply msg
>
>    *_ERROR_CONT_LOG   continue programming flows and return a list
>		       of flows that failed to user space in a reply
>		       msg.
>
>  notably missing is a rollback error strategy. I don't have a
>  use for this in software yet but the strategy can be added with
>  *_ERROR_ROLLBACK for example.
>
>table_del_flows
>
>  The delete flow operation uses the same structures and error
>  handling strategies as the table_set_flows operations. Although on
>  delete messges ommit the matches/actions arrays because they are
>  not needed to lookup the flow.
>
>Also thanks to Simon Horman for fixes and other help.
>
>Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>---
> include/linux/if_flow.h      |   21 ++
> include/linux/netdevice.h    |    8 +
> include/uapi/linux/if_flow.h |   49 ++++
> net/core/flow_table.c        |  501 ++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 579 insertions(+)
>
>diff --git a/include/linux/if_flow.h b/include/linux/if_flow.h
>index 1b6c1ea..20fa752 100644
>--- a/include/linux/if_flow.h
>+++ b/include/linux/if_flow.h
>@@ -90,4 +90,25 @@ struct net_flow_tbl_node {
> 	__u32 flags;
> 	struct net_flow_jump_table *jump;
> };
>+
>+/**
>+ * @struct net_flow_flow
>+ * @brief describes the match/action entry
>+ *
>+ * @uid unique identifier for flow
>+ * @priority priority to execute flow match/action in table
>+ * @match null terminated set of match uids match criteria
>+ * @actoin null terminated set of action uids to apply to match
>+ *
>+ * Flows must match all entries in match set.
>+ */
>+struct net_flow_flow {
>+	int table_id;
>+	int uid;
>+	int priority;
>+	struct net_flow_field_ref *matches;
>+	struct net_flow_action *actions;
>+};
>+
>+int net_flow_put_flow(struct sk_buff *skb, struct net_flow_flow *flow);
> #endif
>diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>index 3c3c856..be8d4e4 100644
>--- a/include/linux/netdevice.h
>+++ b/include/linux/netdevice.h
>@@ -1197,6 +1197,14 @@ struct net_device_ops {
> 	struct net_flow_header	**(*ndo_flow_get_headers)(struct net_device *dev);
> 	struct net_flow_hdr_node **(*ndo_flow_get_hdr_graph)(struct net_device *dev);
> 	struct net_flow_tbl_node **(*ndo_flow_get_tbl_graph)(struct net_device *dev);
>+	int		        (*ndo_flow_get_flows)(struct sk_buff *skb,
>+						      struct net_device *dev,
>+						      int table,
>+						      int min, int max);
>+	int		        (*ndo_flow_set_flows)(struct net_device *dev,
>+						      struct net_flow_flow *f);
>+	int		        (*ndo_flow_del_flows)(struct net_device *dev,
>+						      struct net_flow_flow *f);
> #endif
> };
> 
>diff --git a/include/uapi/linux/if_flow.h b/include/uapi/linux/if_flow.h
>index 2acdb38..125cdc6 100644
>--- a/include/uapi/linux/if_flow.h
>+++ b/include/uapi/linux/if_flow.h
>@@ -329,6 +329,48 @@ enum {
> #define NET_FLOW_TABLE_GRAPH_MAX (__NET_FLOW_TABLE_GRAPH_MAX - 1)
> 
> enum {
>+	NET_FLOW_NET_FLOW_UNSPEC,
>+	NET_FLOW_FLOW,
>+	__NET_FLOW_NET_FLOW_MAX,
>+};
>+#define NET_FLOW_NET_FLOW_MAX (__NET_FLOW_NET_FLOW_MAX - 1)
>+
>+enum {
>+	NET_FLOW_TABLE_FLOWS_UNSPEC,
>+	NET_FLOW_TABLE_FLOWS_TABLE,
>+	NET_FLOW_TABLE_FLOWS_MINPRIO,
>+	NET_FLOW_TABLE_FLOWS_MAXPRIO,
>+	NET_FLOW_TABLE_FLOWS_FLOWS,
>+	__NET_FLOW_TABLE_FLOWS_MAX,
>+};
>+#define NET_FLOW_TABLE_FLOWS_MAX (__NET_FLOW_TABLE_FLOWS_MAX - 1)
>+
>+enum {
>+	/* Abort with normal errmsg */
>+	NET_FLOW_FLOWS_ERROR_ABORT,
>+	/* Ignore errors and continue without logging */
>+	NET_FLOW_FLOWS_ERROR_CONTINUE,
>+	/* Abort and reply with invalid flow fields */
>+	NET_FLOW_FLOWS_ERROR_ABORT_LOG,
>+	/* Continue and reply with list of invalid flows */
>+	NET_FLOW_FLOWS_ERROR_CONT_LOG,
>+	__NET_FLOWS_FLOWS_ERROR_MAX,
>+};
>+#define NET_FLOWS_FLOWS_ERROR_MAX (__NET_FLOWS_FLOWS_ERROR_MAX - 1)
>+
>+enum {
>+	NET_FLOW_ATTR_UNSPEC,
>+	NET_FLOW_ATTR_ERROR,
>+	NET_FLOW_ATTR_TABLE,
>+	NET_FLOW_ATTR_UID,
>+	NET_FLOW_ATTR_PRIORITY,
>+	NET_FLOW_ATTR_MATCHES,
>+	NET_FLOW_ATTR_ACTIONS,
>+	__NET_FLOW_ATTR_MAX,
>+};
>+#define NET_FLOW_ATTR_MAX (__NET_FLOW_ATTR_MAX - 1)
>+
>+enum {
> 	NET_FLOW_IDENTIFIER_IFINDEX, /* net_device ifindex */
> };
> 
>@@ -343,6 +385,9 @@ enum {
> 	NET_FLOW_HEADER_GRAPH,
> 	NET_FLOW_TABLE_GRAPH,
> 
>+	NET_FLOW_FLOWS,
>+	NET_FLOW_FLOWS_ERROR,
>+
> 	__NET_FLOW_MAX,
> 	NET_FLOW_MAX = (__NET_FLOW_MAX - 1),
> };
>@@ -354,6 +399,10 @@ enum {
> 	NET_FLOW_TABLE_CMD_GET_HDR_GRAPH,
> 	NET_FLOW_TABLE_CMD_GET_TABLE_GRAPH,
> 
>+	NET_FLOW_TABLE_CMD_GET_FLOWS,
>+	NET_FLOW_TABLE_CMD_SET_FLOWS,
>+	NET_FLOW_TABLE_CMD_DEL_FLOWS,
>+
> 	__NET_FLOW_CMD_MAX,
> 	NET_FLOW_CMD_MAX = (__NET_FLOW_CMD_MAX - 1),
> };
>diff --git a/net/core/flow_table.c b/net/core/flow_table.c
>index ec3f06d..f4cf293 100644
>--- a/net/core/flow_table.c
>+++ b/net/core/flow_table.c
>@@ -774,6 +774,489 @@ static int net_flow_cmd_get_table_graph(struct sk_buff *skb,
> 	return genlmsg_reply(msg, info);
> }
> 
>+static struct sk_buff *net_flow_build_flows_msg(struct net_device *dev,
>+						u32 portid, int seq, u8 cmd,
>+						int min, int max, int table)
>+{
>+	struct genlmsghdr *hdr;
>+	struct nlattr *flows;
>+	struct sk_buff *skb;
>+	int err = -ENOBUFS;
>+
>+	skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
>+	if (!skb)
>+		return ERR_PTR(-ENOBUFS);
>+
>+	hdr = genlmsg_put(skb, portid, seq, &net_flow_nl_family, 0, cmd);
>+	if (!hdr)
>+		goto out;
>+
>+	if (nla_put_u32(skb,
>+			NET_FLOW_IDENTIFIER_TYPE,
>+			NET_FLOW_IDENTIFIER_IFINDEX) ||
>+	    nla_put_u32(skb, NET_FLOW_IDENTIFIER, dev->ifindex)) {
>+		err = -ENOBUFS;
>+		goto out;
>+	}
>+
>+	flows = nla_nest_start(skb, NET_FLOW_FLOWS);
>+	if (!flows) {
>+		err = -EMSGSIZE;
>+		goto out;
>+	}
>+
>+	err = dev->netdev_ops->ndo_flow_get_flows(skb, dev, table, min, max);
>+	if (err < 0)
>+		goto out_cancel;
>+
>+	nla_nest_end(skb, flows);
>+
>+	err = genlmsg_end(skb, hdr);
>+	if (err < 0)
>+		goto out;
>+
>+	return skb;
>+out_cancel:
>+	nla_nest_cancel(skb, flows);
>+out:
>+	nlmsg_free(skb);
>+	return ERR_PTR(err);
>+}
>+
>+static const
>+struct nla_policy net_flow_table_flows_policy[NET_FLOW_TABLE_FLOWS_MAX + 1] = {
>+	[NET_FLOW_TABLE_FLOWS_TABLE]   = { .type = NLA_U32,},
>+	[NET_FLOW_TABLE_FLOWS_MINPRIO] = { .type = NLA_U32,},
>+	[NET_FLOW_TABLE_FLOWS_MAXPRIO] = { .type = NLA_U32,},
>+	[NET_FLOW_TABLE_FLOWS_FLOWS]   = { .type = NLA_NESTED,},
>+};
>+
>+static int net_flow_table_cmd_get_flows(struct sk_buff *skb,
>+					struct genl_info *info)
>+{
>+	struct nlattr *tb[NET_FLOW_TABLE_FLOWS_MAX+1];
>+	int table, min = -1, max = -1;
>+	struct net_device *dev;
>+	struct sk_buff *msg;
>+	int err = -EINVAL;
>+
>+	dev = net_flow_get_dev(info);
>+	if (!dev)
>+		return -EINVAL;
>+
>+	if (!dev->netdev_ops->ndo_flow_get_flows) {
>+		dev_put(dev);
>+		return -EOPNOTSUPP;
>+	}
>+
>+	if (!info->attrs[NET_FLOW_IDENTIFIER_TYPE] ||
>+	    !info->attrs[NET_FLOW_IDENTIFIER] ||
>+	    !info->attrs[NET_FLOW_FLOWS])
>+		goto out;
>+
>+	err = nla_parse_nested(tb, NET_FLOW_TABLE_FLOWS_MAX,
>+			       info->attrs[NET_FLOW_FLOWS],
>+			       net_flow_table_flows_policy);
>+	if (err)
>+		goto out;
>+
>+	if (!tb[NET_FLOW_TABLE_FLOWS_TABLE])
>+		goto out;
>+
>+	table = nla_get_u32(tb[NET_FLOW_TABLE_FLOWS_TABLE]);
>+
>+	if (tb[NET_FLOW_TABLE_FLOWS_MINPRIO])
>+		min = nla_get_u32(tb[NET_FLOW_TABLE_FLOWS_MINPRIO]);
>+	if (tb[NET_FLOW_TABLE_FLOWS_MAXPRIO])
>+		max = nla_get_u32(tb[NET_FLOW_TABLE_FLOWS_MAXPRIO]);
>+
>+	msg = net_flow_build_flows_msg(dev,
>+				       info->snd_portid,
>+				       info->snd_seq,
>+				       NET_FLOW_TABLE_CMD_GET_FLOWS,
>+				       min, max, table);
>+	dev_put(dev);
>+
>+	if (IS_ERR(msg))
>+		return PTR_ERR(msg);
>+
>+	return genlmsg_reply(msg, info);
>+out:
>+	dev_put(dev);
>+	return err;
>+}
>+
>+static struct sk_buff *net_flow_start_errmsg(struct net_device *dev,
>+					     struct genlmsghdr **hdr,
>+					     u32 portid, int seq, u8 cmd)
>+{
>+	struct genlmsghdr *h;
>+	struct sk_buff *skb;
>+
>+	skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
>+	if (!skb)
>+		return ERR_PTR(-EMSGSIZE);
>+
>+	h = genlmsg_put(skb, portid, seq, &net_flow_nl_family, 0, cmd);
>+	if (!h)
>+		return ERR_PTR(-EMSGSIZE);
>+
>+	if (nla_put_u32(skb,
>+			NET_FLOW_IDENTIFIER_TYPE,
>+			NET_FLOW_IDENTIFIER_IFINDEX) ||
>+	    nla_put_u32(skb, NET_FLOW_IDENTIFIER, dev->ifindex))
>+		return ERR_PTR(-EMSGSIZE);
>+
>+	*hdr = h;
>+	return skb;
>+}
>+
>+static struct sk_buff *net_flow_end_flow_errmsg(struct sk_buff *skb,
>+						struct genlmsghdr *hdr)
>+{
>+	int err;
>+
>+	err = genlmsg_end(skb, hdr);
>+	if (err < 0) {
>+		nlmsg_free(skb);
>+		return ERR_PTR(err);
>+	}
>+
>+	return skb;
>+}
>+
>+static int net_flow_put_flow_action(struct sk_buff *skb,
>+				    struct net_flow_action *a)
>+{
>+	struct nlattr *action, *sigs;
>+	int i, err = 0;
>+
>+	action = nla_nest_start(skb, NET_FLOW_ACTION);
>+	if (!action)
>+		return -EMSGSIZE;
>+
>+	if (nla_put_u32(skb, NET_FLOW_ACTION_ATTR_UID, a->uid))
>+		return -EMSGSIZE;
>+
>+	if (!a->args)
>+		goto done;
>+
>+	for (i = 0; a->args[i].type; i++) {
>+		sigs = nla_nest_start(skb, NET_FLOW_ACTION_ATTR_SIGNATURE);
>+		if (!sigs) {
>+			nla_nest_cancel(skb, action);
>+			return -EMSGSIZE;
>+		}
>+
>+		err = net_flow_put_act_types(skb, a[i].args);
>+		if (err) {
>+			nla_nest_cancel(skb, action);
>+			nla_nest_cancel(skb, sigs);
>+			return err;
>+		}
>+		nla_nest_end(skb, sigs);
>+	}
>+
>+done:
>+	nla_nest_end(skb, action);
>+	return 0;
>+}
>+
>+int net_flow_put_flow(struct sk_buff *skb, struct net_flow_flow *flow)
>+{
>+	struct nlattr *flows, *matches;
>+	struct nlattr *actions = NULL; /* must be null to unwind */
>+	int err, j, i = 0;
>+
>+	flows = nla_nest_start(skb, NET_FLOW_FLOW);
>+	if (!flows)
>+		goto put_failure;
>+
>+	if (nla_put_u32(skb, NET_FLOW_ATTR_TABLE, flow->table_id) ||
>+	    nla_put_u32(skb, NET_FLOW_ATTR_UID, flow->uid) ||
>+	    nla_put_u32(skb, NET_FLOW_ATTR_PRIORITY, flow->priority))
>+		goto flows_put_failure;
>+
>+	if (flow->matches) {
>+		matches = nla_nest_start(skb, NET_FLOW_ATTR_MATCHES);
>+		if (!matches)
>+			goto flows_put_failure;
>+
>+		for (j = 0; flow->matches && flow->matches[j].header; j++) {
>+			struct net_flow_field_ref *f = &flow->matches[j];
>+
>+			if (!f->header)
>+				continue;
>+
>+			nla_put(skb, NET_FLOW_FIELD_REF, sizeof(*f), f);
>+		}
>+		nla_nest_end(skb, matches);
>+	}
>+
>+	if (flow->actions) {
>+		actions = nla_nest_start(skb, NET_FLOW_ATTR_ACTIONS);
>+		if (!actions)
>+			goto flows_put_failure;
>+
>+		for (i = 0; flow->actions && flow->actions[i].uid; i++) {
>+			err = net_flow_put_flow_action(skb, &flow->actions[i]);
>+			if (err) {
>+				nla_nest_cancel(skb, actions);
>+				goto flows_put_failure;
>+			}
>+		}
>+		nla_nest_end(skb, actions);
>+	}
>+
>+	nla_nest_end(skb, flows);
>+	return 0;
>+
>+flows_put_failure:
>+	nla_nest_cancel(skb, flows);
>+put_failure:
>+	return -EMSGSIZE;
>+}
>+EXPORT_SYMBOL(net_flow_put_flow);
>+
>+static int net_flow_get_field(struct net_flow_field_ref *field,
>+			      struct nlattr *nla)
>+{
>+	if (nla_type(nla) != NET_FLOW_FIELD_REF)
>+		return -EINVAL;
>+
>+	if (nla_len(nla) < sizeof(*field))
>+		return -EINVAL;
>+
>+	*field = *(struct net_flow_field_ref *)nla_data(nla);
>+	return 0;
>+}
>+
>+static int net_flow_get_action(struct net_flow_action *a, struct nlattr *attr)
>+{
>+	struct nlattr *act[NET_FLOW_ACTION_ATTR_MAX+1];
>+	struct nlattr *args;
>+	int rem;
>+	int err, count = 0;
>+
>+	if (nla_type(attr) != NET_FLOW_ACTION) {
>+		pr_warn("%s: expected NET_FLOW_ACTION\n", __func__);
>+		return 0;
>+	}
>+
>+	err = nla_parse_nested(act, NET_FLOW_ACTION_ATTR_MAX,
>+			       attr, net_flow_action_policy);
>+	if (err < 0)
>+		return err;
>+
>+	if (!act[NET_FLOW_ACTION_ATTR_UID] ||
>+	    !act[NET_FLOW_ACTION_ATTR_SIGNATURE])
>+		return -EINVAL;
>+
>+	a->uid = nla_get_u32(act[NET_FLOW_ACTION_ATTR_UID]);
>+
>+	nla_for_each_nested(args, act[NET_FLOW_ACTION_ATTR_SIGNATURE], rem)
>+		count++; /* unoptimized max possible */
>+
>+	a->args = kcalloc(count + 1,
>+			  sizeof(struct net_flow_action_arg),
>+			  GFP_KERNEL);
>+	count = 0;
>+
>+	nla_for_each_nested(args, act[NET_FLOW_ACTION_ATTR_SIGNATURE], rem) {
>+		if (nla_type(args) != NET_FLOW_ACTION_ARG)
>+			continue;
>+
>+		if (nla_len(args) < sizeof(struct net_flow_action_arg)) {
>+			kfree(a->args);
>+			return -EINVAL;
>+		}
>+
>+		a->args[count] = *(struct net_flow_action_arg *)nla_data(args);
>+	}
>+	return 0;
>+}
>+
>+static const
>+struct nla_policy net_flow_flow_policy[NET_FLOW_ATTR_MAX + 1] = {
>+	[NET_FLOW_ATTR_TABLE]		= { .type = NLA_U32 },
>+	[NET_FLOW_ATTR_UID]		= { .type = NLA_U32 },
>+	[NET_FLOW_ATTR_PRIORITY]	= { .type = NLA_U32 },
>+	[NET_FLOW_ATTR_MATCHES]		= { .type = NLA_NESTED },
>+	[NET_FLOW_ATTR_ACTIONS]		= { .type = NLA_NESTED },
>+};
>+
>+static int net_flow_get_flow(struct net_flow_flow *flow, struct nlattr *attr)
>+{
>+	struct nlattr *f[NET_FLOW_ATTR_MAX+1];
>+	struct nlattr *attr2;
>+	int rem, err;
>+	int count = 0;
>+
>+	err = nla_parse_nested(f, NET_FLOW_ATTR_MAX,
>+			       attr, net_flow_flow_policy);
>+	if (err < 0)
>+		return -EINVAL;
>+
>+	if (!f[NET_FLOW_ATTR_TABLE] || !f[NET_FLOW_ATTR_UID] ||
>+	    !f[NET_FLOW_ATTR_PRIORITY])
>+		return -EINVAL;
>+
>+	flow->table_id = nla_get_u32(f[NET_FLOW_ATTR_TABLE]);
>+	flow->uid = nla_get_u32(f[NET_FLOW_ATTR_UID]);
>+	flow->priority = nla_get_u32(f[NET_FLOW_ATTR_PRIORITY]);
>+
>+	flow->matches = NULL;
>+	flow->actions = NULL;
>+
>+	if (f[NET_FLOW_ATTR_MATCHES]) {
>+		nla_for_each_nested(attr2, f[NET_FLOW_ATTR_MATCHES], rem)
>+			count++;
>+
>+		/* Null terminated list of matches */
>+		flow->matches = kcalloc(count + 1,
>+					sizeof(struct net_flow_field_ref),
>+					GFP_KERNEL);
>+		if (!flow->matches)
>+			return -ENOMEM;
>+
>+		count = 0;
>+		nla_for_each_nested(attr2, f[NET_FLOW_ATTR_MATCHES], rem) {
>+			err = net_flow_get_field(&flow->matches[count], attr2);
>+			if (err) {
>+				kfree(flow->matches);
>+				return err;
>+			}
>+			count++;
>+		}
>+	}
>+
>+	if (f[NET_FLOW_ATTR_ACTIONS]) {
>+		count = 0;
>+		nla_for_each_nested(attr2, f[NET_FLOW_ATTR_ACTIONS], rem)
>+			count++;
>+
>+		/* Null terminated list of actions */
>+		flow->actions = kcalloc(count + 1,
>+					sizeof(struct net_flow_action),
>+					GFP_KERNEL);
>+		if (!flow->actions) {
>+			kfree(flow->matches);
>+			return -ENOMEM;
>+		}
>+
>+		count = 0;
>+		nla_for_each_nested(attr2, f[NET_FLOW_ATTR_ACTIONS], rem) {
>+			err = net_flow_get_action(&flow->actions[count], attr2);
>+			if (err) {
>+				kfree(flow->matches);
>+				kfree(flow->actions);
>+				return err;
>+			}
>+			count++;
>+		}
>+	}
>+
>+	return 0;
>+}
>+
>+static int net_flow_table_cmd_flows(struct sk_buff *recv_skb,
>+				    struct genl_info *info)
>+{
>+	int rem, err_handle = NET_FLOW_FLOWS_ERROR_ABORT;
>+	struct sk_buff *skb = NULL;
>+	struct net_flow_flow this;
>+	struct genlmsghdr *hdr;
>+	struct net_device *dev;
>+	struct nlattr *flow, *flows;
>+	int cmd = info->genlhdr->cmd;
>+	int err = -EOPNOTSUPP;

I don't like the inconsistency in var naming. Sometimes, "flow" is of type
struct nlattr, sometimes it is of type struct net_flow_flow
(net_flow_get_flow). It is slightly confusing.

>+
>+	dev = net_flow_get_dev(info);
>+	if (!dev)
>+		return -EINVAL;
>+
>+	if (!dev->netdev_ops->ndo_flow_set_flows ||
>+	    !dev->netdev_ops->ndo_flow_del_flows)
>+		goto out;
>+
>+	if (!info->attrs[NET_FLOW_IDENTIFIER_TYPE] ||
>+	    !info->attrs[NET_FLOW_IDENTIFIER] ||
>+	    !info->attrs[NET_FLOW_FLOWS]) {
>+		err = -EINVAL;
>+		goto out;
>+	}
>+
>+	if (info->attrs[NET_FLOW_FLOWS_ERROR])
>+		err_handle = nla_get_u32(info->attrs[NET_FLOW_FLOWS_ERROR]);
>+
>+	nla_for_each_nested(flow, info->attrs[NET_FLOW_FLOWS], rem) {
>+		if (nla_type(flow) != NET_FLOW_FLOW)
>+			continue;
>+
>+		err = net_flow_get_flow(&this, flow);
>+		if (err)
>+			goto out;
>+
>+		switch (cmd) {
>+		case NET_FLOW_TABLE_CMD_SET_FLOWS:
>+			err = dev->netdev_ops->ndo_flow_set_flows(dev, &this);
>+			break;
>+		case NET_FLOW_TABLE_CMD_DEL_FLOWS:
>+			err = dev->netdev_ops->ndo_flow_del_flows(dev, &this);
>+			break;
>+		default:
>+			err = -EOPNOTSUPP;
>+			break;
>+		}
>+
>+		if (err && err_handle != NET_FLOW_FLOWS_ERROR_CONTINUE) {
>+			if (!skb) {
>+				skb = net_flow_start_errmsg(dev, &hdr,
>+							    info->snd_portid,
>+							    info->snd_seq,
>+							    cmd);
>+				if (IS_ERR(skb)) {
>+					err = PTR_ERR(skb);
>+					goto out_plus_free;
>+				}
>+
>+				flows = nla_nest_start(skb, NET_FLOW_FLOWS);
>+				if (!flows) {
>+					err = -EMSGSIZE;
>+					goto out_plus_free;
>+				}
>+			}
>+
>+			net_flow_put_flow(skb, &this);
>+		}
>+
>+		/* Cleanup flow */
>+		kfree(this.matches);
>+		kfree(this.actions);
>+
>+		if (err && err_handle == NET_FLOW_FLOWS_ERROR_ABORT)
>+			goto out;
>+	}
>+
>+	dev_put(dev);
>+
>+	if (skb) {
>+		nla_nest_end(skb, flows);
>+		net_flow_end_flow_errmsg(skb, hdr);
>+		return genlmsg_reply(skb, info);
>+	}
>+	return 0;
>+
>+out_plus_free:
>+	kfree(this.matches);
>+	kfree(this.actions);

	Maybe this can be done by some "flow_free" helper...

>+out:
>+	if (skb)
>+		nlmsg_free(skb);
>+	dev_put(dev);
>+	return -EINVAL;
>+}
>+
> static const struct nla_policy net_flow_cmd_policy[NET_FLOW_MAX + 1] = {
> 	[NET_FLOW_IDENTIFIER_TYPE] = {.type = NLA_U32, },
> 	[NET_FLOW_IDENTIFIER]	   = {.type = NLA_U32, },
>@@ -815,6 +1298,24 @@ static const struct genl_ops net_flow_table_nl_ops[] = {
> 		.policy = net_flow_cmd_policy,
> 		.flags = GENL_ADMIN_PERM,
> 	},
>+	{
>+		.cmd = NET_FLOW_TABLE_CMD_GET_FLOWS,
>+		.doit = net_flow_table_cmd_get_flows,
>+		.policy = net_flow_cmd_policy,
>+		.flags = GENL_ADMIN_PERM,
>+	},
>+	{
>+		.cmd = NET_FLOW_TABLE_CMD_SET_FLOWS,
>+		.doit = net_flow_table_cmd_flows,
>+		.policy = net_flow_cmd_policy,
>+		.flags = GENL_ADMIN_PERM,
>+	},
>+	{
>+		.cmd = NET_FLOW_TABLE_CMD_DEL_FLOWS,
>+		.doit = net_flow_table_cmd_flows,
>+		.policy = net_flow_cmd_policy,
>+		.flags = GENL_ADMIN_PERM,
>+	},
> };
> 
> static int __init net_flow_nl_module_init(void)
>

^ permalink raw reply

* Re: [net-next PATCH v1 03/11] net: flow_table: add apply action argument to tables
From: Jiri Pirko @ 2015-01-08 17:41 UTC (permalink / raw)
  To: John Fastabend; +Cc: tgraf, sfeldma, jhs, simon.horman, netdev, davem, andy
In-Reply-To: <20141231194642.31070.14445.stgit@nitbit.x32>

Wed, Dec 31, 2014 at 08:46:44PM CET, john.fastabend@gmail.com wrote:
>Actions may not always be applied after exiting a table. For example
>some pipelines may accumulate actions and then apply them at the end
>of a pipeline.
>
>To model this we use a table type called APPLY. Tables who share an
>apply identifier have their actions applied in one step.

Why this is a separate patch? Perhaps this can be squashed to one of the
previous ones?

>
>Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>---
> include/linux/if_flow.h      |    1 +
> include/uapi/linux/if_flow.h |    1 +
> net/core/flow_table.c        |    1 +
> 3 files changed, 3 insertions(+)
>
>diff --git a/include/linux/if_flow.h b/include/linux/if_flow.h
>index 20fa752..a042a3d 100644
>--- a/include/linux/if_flow.h
>+++ b/include/linux/if_flow.h
>@@ -67,6 +67,7 @@ struct net_flow_table {
> 	char name[NET_FLOW_NAMSIZ];
> 	int uid;
> 	int source;
>+	int apply_action;
> 	int size;
> 	struct net_flow_field_ref *matches;
> 	int *actions;
>diff --git a/include/uapi/linux/if_flow.h b/include/uapi/linux/if_flow.h
>index 125cdc6..3c1a860 100644
>--- a/include/uapi/linux/if_flow.h
>+++ b/include/uapi/linux/if_flow.h
>@@ -265,6 +265,7 @@ enum {
> 	NET_FLOW_TABLE_ATTR_NAME,
> 	NET_FLOW_TABLE_ATTR_UID,
> 	NET_FLOW_TABLE_ATTR_SOURCE,
>+	NET_FLOW_TABLE_ATTR_APPLY,
> 	NET_FLOW_TABLE_ATTR_SIZE,
> 	NET_FLOW_TABLE_ATTR_MATCHES,
> 	NET_FLOW_TABLE_ATTR_ACTIONS,
>diff --git a/net/core/flow_table.c b/net/core/flow_table.c
>index f4cf293..97cdf92 100644
>--- a/net/core/flow_table.c
>+++ b/net/core/flow_table.c
>@@ -223,6 +223,7 @@ static int net_flow_put_table(struct net_device *dev,
> 	if (nla_put_string(skb, NET_FLOW_TABLE_ATTR_NAME, t->name) ||
> 	    nla_put_u32(skb, NET_FLOW_TABLE_ATTR_UID, t->uid) ||
> 	    nla_put_u32(skb, NET_FLOW_TABLE_ATTR_SOURCE, t->source) ||
>+	    nla_put_u32(skb, NET_FLOW_TABLE_ATTR_APPLY, t->apply_action) ||
> 	    nla_put_u32(skb, NET_FLOW_TABLE_ATTR_SIZE, t->size))
> 		return -EMSGSIZE;
> 
>

^ permalink raw reply

* [PATCH iproute2] ss: Filter inet dgram sockets with established state by default
From: Vadim Kochan @ 2015-01-08 17:32 UTC (permalink / raw)
  To: netdev; +Cc: Vadim Kochan

From: Vadim Kochan <vadim4j@gmail.com>

As inet dgram sockets (udp, raw) can call connect(...)  - they
might be set in ESTABLISHED state. So keep the original behaviour of
'ss' which filtered them by ESTABLISHED state by default. So:

    $ ss -u

    or

    $ ss -w

Will show only ESTABLISHED UDP sockets by default.

Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
---
 misc/ss.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 08d210a..015d829 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -170,11 +170,11 @@ static const struct filter default_dbs[MAX_DB] = {
 		.families = (1 << AF_INET) | (1 << AF_INET6),
 	},
 	[UDP_DB] = {
-		.states   = (1 << SS_CLOSE),
+		.states   = (1 << SS_ESTABLISHED),
 		.families = (1 << AF_INET) | (1 << AF_INET6),
 	},
 	[RAW_DB] = {
-		.states   = (1 << SS_CLOSE),
+		.states   = (1 << SS_ESTABLISHED),
 		.families = (1 << AF_INET) | (1 << AF_INET6),
 	},
 	[UNIX_DG_DB] = {
-- 
2.1.3

^ permalink raw reply related

* Re: [RFC net-next 0/3] devconf: New infrastructure for setting pre-load parameters for a module
From: Greg KH @ 2015-01-08 17:47 UTC (permalink / raw)
  To: Amir Vadai
  Cc: Hadar Hen Zion, David S. Miller, netdev@vger.kernel.org,
	Yevgeny Petrilin, Or Gerlitz, shannon.nelson@intel.com,
	Doug Ledford, greearb@candelatech.com
In-Reply-To: <54AEBA28.1030301@mellanox.com>

On Thu, Jan 08, 2015 at 07:11:04PM +0200, Amir Vadai wrote:
> On 1/8/2015 6:46 PM, Greg KH wrote:
> > On Thu, Jan 08, 2015 at 05:16:57PM +0200, Hadar Hen Zion wrote:
> >> Hi,
> >>     
> >> When configuring a device at an early boot stage, most kernel drivers
> >> use module parameters (the parameters' settings can be determined in
> >> modprobe.d config files).
> > 
> > Which is a bad idea, as you have learned :)
> > 
> >> These parameters are difficult to manage, and one of the reasons is that
> >> module parameters are set per driver and not per device (NICs using the
> >> same driver cannot be set with different configurations).
> >> Furthermore, using other existing configuration tools like ethtool,
> >> ifconfig, ip link commands or sysfs entries is not applicable, since
> >> they all rely on having a netdev already set up.
> >>
> >> In the past, 'request_firmware' solution for configuration parameters
> >> was suggested by Shannon Nelson from Intel[1]. The idea was rejected by
> >> Greg KH, who claimed it was abusive of the request_firmware mechanism.
> >> Greg suggested using configfs for device configuration instead (as done
> >> by the USB gadget driver).
> >>
> >> As a solution, we introduce a new kernel infrastructure using configfs
> >> to allow the configuration of the device. The goal is to set low-level
> >> device functionality that needs to be sorted out before a module is
> >> loaded.
> > 
> > Ick, really?  drivers should never need to be configured like this, if
> > so, then something is wrong, they should "just work" by default.  What
> > are you needing to "configure" that can't be determined by something
> > like a device tree?
> Ick indeed - but we can't find anything better.
> 
> For example, we have devices that can act as either netdev or as an
> Infiniband device.
> The driver consists of a core to handle the hardware, and higher layer
> drivers - one for Ethernet and one for Infiniband.
> Today the selection is done through a module parameter. according to it
> the relevant higher level driver is loaded, and the device is
> initialized. You don't want to have a default of netdev, and in every
> installation that needs Infiniband, a netdev will be created, removed
> and only then the Infiniband device will appear.

But that really isn't an issue, right?  Who cares if it happens that
way?

Or just don't do anything until that "higher level" driver is loaded,
either ib or network.

> This is only one example to configuration that needs to be known before
> the hardware is initialized, and be persistent across boots.
> 
> We can have a 2 stages loading. First load the core, wait for user
> input, and only then configure the device and load the right upper layer
> driver according to the user input (configfs/sysfs).

Why not just wait for an "upper layer" driver to be loaded?  It doesn't
make it dynamic, and forces you to have a module manually be loaded, but
would solve this problem.

Or just don't do anything until configfs is set up for your devices,
after the module is loaded.  This allows drivers to be built into the
kernel if people really want that type of configuration to work
properly.

> Since other vendors seems to need this capability too, we thought it
> would be better to make it generic - and this is this what this RFC is
> about.

What other vendors have such horribly designed hardware / kernel modules
that need this type of thing?  :)

thanks,

greg k-h

^ permalink raw reply

* Re: TCP connection issues against Amazon S3
From: Erik Grinaker @ 2015-01-08 17:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Yuchung Cheng, linux-kernel@vger.kernel.org, netdev
In-Reply-To: <1420666401.5947.35.camel@edumazet-glaptop2.roam.corp.google.com>

On 07 Jan 2015, at 21:33, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2015-01-07 at 20:37 +0000, Erik Grinaker wrote:
> I agree. I have contacted Amazon about this, but am not too hopeful
>> for a quick fix; they have been promising SACK-support on their
>> loadbalancers since 2006, for example.
>> 
>> That said, since this change breaks a service as popular as S3, it
>> might be worth reconsidering.
> 
> Which change are you talking about ? Have you done a bisection to
> clearly identify the patch exposing this sender bug ?

FWIW, I've done a bisection, and it’s triggered by this change:

https://github.com/torvalds/linux/commit/4e4f1fc226816905c937f9b29dabe351075dfe0f

> We are not going to stick TCP stack to 20th century and buggy peers or
> middleboxes, sorry.

That’s fair enough.

^ permalink raw reply

* [PATCH net-next] netfilter: nat_snmp_basic: Use generic dump facility
From: Joe Perches @ 2015-01-08 17:56 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netdev, LKML

Use the generic facility instead of a static hex_dump routine.
Delete the now unused static routine.

Now emits the hex dump at KERN_DEBUG instead of default.

Shrinks the kernel ever so trivially.

$ size net/ipv4/netfilter/nf_nat_snmp_basic.o*
   text	   data	    bss	    dec	    hex	filename
   5948	    352	      8	   6308	   18a4	net/ipv4/netfilter/nf_nat_snmp_basic.o.defconfig.new
   5985	    352	      8	   6345	   18c9	net/ipv4/netfilter/nf_nat_snmp_basic.o.defconfig.old

Signed-off-by: Joe Perches <joe@perches.com>
---
 net/ipv4/netfilter/nf_nat_snmp_basic.c | 14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/net/ipv4/netfilter/nf_nat_snmp_basic.c b/net/ipv4/netfilter/nf_nat_snmp_basic.c
index 7c67667..5c8d3b5 100644
--- a/net/ipv4/netfilter/nf_nat_snmp_basic.c
+++ b/net/ipv4/netfilter/nf_nat_snmp_basic.c
@@ -998,18 +998,6 @@ err_id_free:
  *
  *****************************************************************************/
 
-static void hex_dump(const unsigned char *buf, size_t len)
-{
-	size_t i;
-
-	for (i = 0; i < len; i++) {
-		if (i && !(i % 16))
-			printk("\n");
-		printk("%02x ", *(buf + i));
-	}
-	printk("\n");
-}
-
 /*
  * Parse and mangle SNMP message according to mapping.
  * (And this is the fucking 'basic' method).
@@ -1026,7 +1014,7 @@ static int snmp_parse_mangle(unsigned char *msg,
 	struct snmp_object *obj;
 
 	if (debug > 1)
-		hex_dump(msg, len);
+		print_hex_dump_bytes("", DUMP_PREFIX_NONE, msg, len);
 
 	asn1_open(&ctx, msg, len);
 

^ permalink raw reply related

* Re: [net-next PATCH v1 00/11] A flow API
From: Jiri Pirko @ 2015-01-08 18:03 UTC (permalink / raw)
  To: John Fastabend; +Cc: tgraf, sfeldma, jhs, simon.horman, netdev, davem, andy
In-Reply-To: <20141231194057.31070.5244.stgit@nitbit.x32>

Wed, Dec 31, 2014 at 08:45:19PM CET, john.fastabend@gmail.com wrote:
>So... I could continue to mull over this and tweak bits and pieces
>here and there but I decided its best to get a wider group of folks
>looking at it and hopefulyl with any luck using it so here it is.
>
>This set creates a new netlink family and set of messages to configure
>flow tables in hardware. I tried to make the commit messages
>reasonably verbose at least in the flow_table patches.
>
>What we get at the end of this series is a working API to get device
>capabilities and program flows using the rocker switch.
>
>I created a user space tool 'flow' that I use to configure and query
>the devices it is posted here,
>
>	https://github.com/jrfastab/iprotue2-flow-tool
>
>For now it is a stand-alone tool but once the kernel bits get sorted
>out (I'm guessing there will need to be a few versions of this series
>to get it right) I would like to port it into the iproute2 package.
>This way we can keep all of our tooling in one package see 'bridge'
>for example.
>
>As far as testing, I've tested various combinations of tables and
>rules on the rocker switch and it seems to work. I have not tested
>100% of the rocker code paths though. It would be great to get some
>sort of automated framework around the API to do this. I don't
>think should gate the inclusion of the API though.
>
>I could use some help reviewing,
>
>  (a) error paths and netlink validation code paths
>
>  (b) Break down of structures vs netlink attributes. I
>      am trying to balance flexibility given by having
>      netlinnk TLV attributes vs conciseness. So some
>      things are passed as structures.
>
>  (c) are there any devices that have pipelines that we
>      can't represent with this API? It would be good to
>      know about these so we can design it in probably
>      in a future series.
>
>For some examples and maybe a bit more illustrative description I
>posted a quickly typed up set of notes on github io pages. Here we
>can show the description along with images produced by the flow tool
>showing the pipeline. Once we settle a bit more on the API we should
>probably do a clean up of this and other threads happening and commit
>something to the Documentation directory.
>
> http://jrfastab.github.io/jekyll/update/2014/12/21/flow-api.html
>
>Finally I have more patches to add support for creating and destroying
>tables. This allows users to define the pipeline at runtime rather
>than statically as rocker does now. After this set gets some traction
>I'll look at pushing them in a next round. However it likely requires
>adding another "world" to rocker. Another piece that I want to add is
>a description of the actions and metadata. This way user space can
>"learn" what an action is and how metadata interacts with the system.
>This work is under development.
>
>Thanks! Any comments/feedback always welcome.
>
>And also thanks to everyone who helped with this flow API so far. All
>the folks at Dusseldorf LPC, OVS summit Santa Clara, P4 authors for
>some inspiration, the collection of IETF FoRCES documents I mulled
>over, Netfilter workshop where I started to realize fixing ethtool
>was most likely not going to work, etc.
>
>---
>
>John Fastabend (11):
>      net: flow_table: create interface for hw match/action tables
>      net: flow_table: add flow, delete flow
>      net: flow_table: add apply action argument to tables
>      rocker: add pipeline model for rocker switch
>      net: rocker: add set flow rules
>      net: rocker: add group_id slices and drop explicit goto
>      net: rocker: add multicast path to bridging
>      net: rocker: add get flow API operation
>      net: rocker: add cookie to group acls and use flow_id to set cookie
>      net: rocker: have flow api calls set cookie value
>      net: rocker: implement delete flow routine

Truly impressive work John (including the "flow" tool, documentation).
Hat's off.

Currently, all is very userspace oriented and I understand the reason.
I also understand why Jamal is a bit nervous from that fact. I am as well..
Correct me if I'm wrong but this amount of "direct hw access" is
unprecedented. There have been kernel here to cover the hw differencies,
I wonder if there is any way to continue in this direction with flows...

What I would love to see in this initial patchset is "the internal user".
For example tc. The tc code could query the capabilities and decide what
"flows" to put into hw tables.

Jiri

^ permalink raw reply

* Re: TCP connection issues against Amazon S3
From: Eric Dumazet @ 2015-01-08 18:15 UTC (permalink / raw)
  To: Erik Grinaker; +Cc: Yuchung Cheng, linux-kernel@vger.kernel.org, netdev
In-Reply-To: <DCC8B9AB-D1D7-4722-9B65-93AB5A559763@bengler.no>

On Thu, 2015-01-08 at 17:47 +0000, Erik Grinaker wrote:

> FWIW, I've done a bisection, and it’s triggered by this change:
> 
> https://github.com/torvalds/linux/commit/4e4f1fc226816905c937f9b29dabe351075dfe0f


This totally makes sense, thanks for doing the bisection !

> 
> > We are not going to stick TCP stack to 20th century and buggy peers or
> > middleboxes, sorry.
> 
> That’s fair enough.

Strange thing is that sender does not misbehave at the beginning when
receiver window is still small. Only after a while.

It would be nice to know more details about sender OS/version.

Thanks.

^ permalink raw reply

* Re: [patch iproute2 2/2] iplink: print out addrgenmode attribute
From: Stephen Hemminger @ 2015-01-08 18:19 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, thaller
In-Reply-To: <20150108070439.GB1860@nanopsycho.orion>

On Thu, 8 Jan 2015 08:04:39 +0100
Jiri Pirko <jiri@resnulli.us> wrote:

> Thu, Jan 08, 2015 at 12:10:36AM CET, stephen@networkplumber.org wrote:
> >On Tue,  6 Jan 2015 17:23:46 +0100
> >Jiri Pirko <jiri@resnulli.us> wrote:
> >
> >> addrgenmode is currently write only by ip. So display this information
> >> if provided by kernel as well.
> >> 
> >> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
> >
> >Patch does not apply to current iproute2 git
> 
> I made that against net-next branch. Which branch should I use for new
> features?
> 

net-next hasn't been rebased to master. plus unless change is specific to
kernel net-next it should be against master.

^ permalink raw reply

* Re: bridge-utils : bridge fdb replace undocumented
From: Stephen Hemminger @ 2015-01-08 18:22 UTC (permalink / raw)
  To: Mathieu Rohon; +Cc: netdev
In-Reply-To: <CAAfiWQqCv9=nFRNuxHPb+hRVbNywiEN_bQJHaxXc29hka0iUaw@mail.gmail.com>

On Thu, 8 Jan 2015 13:52:01 +0100
Mathieu Rohon <mathieu.rohon@gmail.com> wrote:

> Hi,
> 
> the command :
> #bridge fdb replace
> 
> can be useful to replace a learned Mac address by a static one. We'd
> like to use this command to solve an openstack bug :
> https://bugs.launchpad.net/neutron/+bug/1367999
> 
> However it is not documented in the manpage of bridge-utilis version 1.5.9.
> 
> Also, I don't know what is the minimum version to have this features.
> 
> regards,
> Mathieu

Bridge command is part of iproute2 not bridge-utils.

^ permalink raw reply

* Re: route/max_size sysctl in ipv4
From: Ani Sinha @ 2015-01-08 18:39 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev@vger.kernel.org
In-Reply-To: <1420695665.5947.44.camel@edumazet-glaptop2.roam.corp.google.com>

On Wed, Jan 7, 2015 at 9:41 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2015-01-07 at 19:40 -0800, Ani Sinha wrote:
>
>> true but at least those scripts which does only read the values will
>> continue to function in a child namespace. In our case, our script
>> only reads the max_size sysctl but since it operates from within a
>> child namespace, it doesn't find it.
>>
>> thoughts?
>
> Fix your script,

But the whole point of keeping this sysctl still around is not to
break the userland scripts! If we are trying to achieve that, I think
we should go one step further and make sure that the sysctl is still
available for reading even in child namespaces (as it used to be the
case).

^ permalink raw reply

* Re: TCP connection issues against Amazon S3
From: Rick Jones @ 2015-01-08 18:52 UTC (permalink / raw)
  To: Eric Dumazet, Erik Grinaker
  Cc: Yuchung Cheng, linux-kernel@vger.kernel.org, netdev
In-Reply-To: <1420740936.5947.70.camel@edumazet-glaptop2.roam.corp.google.com>


> Strange thing is that sender does not misbehave at the beginning when
> receiver window is still small. Only after a while.

Just guessing, but when the receiver window is small, the sender cannot 
get a large quantity of data out there at once, so any string of lost 
packets will tend to be smaller.  If the sender is relying on the RTO to 
trigger the retransmits, and is not resetting his RTO until the clean 
ACK of a segment sent after snd_nxt when the loss is detected, the 
smaller loss strings will not get to the rather large RTO values seen in 
the trace before curl gives-up.  It may be that the sender is indeed 
misbehaving at the beginning, just that it isn't noticeable?

Different but perhaps related observation/question - without timestamps 
(which we don't have in this case), isn't there a certain ambiguity 
about arriving out-of-order segments? One doesn't really know if they 
are out-of-order because the network is re-ordering, or because they are 
retransmissions of segments we've not yet seen at the receiver.

rick

^ permalink raw reply

* Re: route/max_size sysctl in ipv4
From: Eric Dumazet @ 2015-01-08 19:03 UTC (permalink / raw)
  To: Ani Sinha; +Cc: David Miller, netdev@vger.kernel.org
In-Reply-To: <CAOxq_8OPj4twFg2U93hFNSjeiZ5We5DgeCo-W5hSbXJvQmZfyQ@mail.gmail.com>

On Thu, 2015-01-08 at 10:39 -0800, Ani Sinha wrote:

> But the whole point of keeping this sysctl still around is not to
> break the userland scripts! If we are trying to achieve that, I think
> we should go one step further and make sure that the sysctl is still
> available for reading even in child namespaces (as it used to be the
> case).


network namespaces make some sysctl disappear.

That is how it is since day 0.
Some scripts might even use this knowledge.

It is too late to change something now. Nobody but you ever complained.

If you want to use network namespaces, you have to adapt your scripts.

Nobody claimed network namespaces were totally transparent.

^ permalink raw reply

* Re: [patch net-next] tc: add BPF based action
From: Alexei Starovoitov @ 2015-01-08 19:04 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Daniel Borkmann, Network Development, David S. Miller, jhs,
	Stephen Hemminger

On Wed, Jan 7, 2015 at 11:26 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>
>>On the other hand, I would understand if it's at some point in
>>time eBPF which would f.e. mangle the packet, but the API you
>>propose is clearly classic BPF. ;)
>
> Exactly. I would like to extend cls_bpf and act_bpf to handle eBPF right
> after. That is the point.

I say that connecting it with classic BPF is not a prerequisite
to use eBPF in there. Invocation place may be the same,
but the way to pass the program will be different.
For classic with just pass the whole program, whereas
for eBPF we'll be likely passing fd.
Theoretically we can pass eBPF as vanilla program
as well that doesn't have map access, but verifier check
will only be binary (accept or reject). Which is not user
friendly. Even rejection of classic BPF is hard to decipher.
Especially when only language for classic is assembler
and poor users have no easy way to know what was
wrong with the program. Therefore I like bpf syscall
as a main and only interface to load the programs
and pass prog_fd to places where they suppose to run.
Having syscall as center place to load programs
also helps with accounting, since root will be able
to do something like 'lsmod' to see all loaded programs.
Anyway, that's a conversion for later...

As Daniel pointed out I think some better articulation
on what classic bpf programs will do here is needed.
It seems they will work as pre-filter on an action?
Few examples would help to understand use cases...

^ permalink raw reply

* Re: [RFC net-next 0/3] devconf: New infrastructure for setting pre-load parameters for a module
From: Amir Vadai @ 2015-01-08 19:14 UTC (permalink / raw)
  To: Greg KH
  Cc: Amir Vadai, Hadar Hen Zion, David S. Miller,
	netdev@vger.kernel.org, Yevgeny Petrilin, Or Gerlitz,
	shannon.nelson@intel.com, Doug Ledford, greearb@candelatech.com
In-Reply-To: <20150108174707.GA13218@kroah.com>

On Thu, Jan 8, 2015 at 7:47 PM, Greg KH <gregkh@linuxfoundation.org> wrote:
> On Thu, Jan 08, 2015 at 07:11:04PM +0200, Amir Vadai wrote:
>> On 1/8/2015 6:46 PM, Greg KH wrote:
>> > On Thu, Jan 08, 2015 at 05:16:57PM +0200, Hadar Hen Zion wrote:
>> >> Hi,
>> >>
>> >> When configuring a device at an early boot stage, most kernel drivers
>> >> use module parameters (the parameters' settings can be determined in
>> >> modprobe.d config files).
>> >
>> > Which is a bad idea, as you have learned :)
>> >
>> >> These parameters are difficult to manage, and one of the reasons is that
>> >> module parameters are set per driver and not per device (NICs using the
>> >> same driver cannot be set with different configurations).
>> >> Furthermore, using other existing configuration tools like ethtool,
>> >> ifconfig, ip link commands or sysfs entries is not applicable, since
>> >> they all rely on having a netdev already set up.
>> >>
>> >> In the past, 'request_firmware' solution for configuration parameters
>> >> was suggested by Shannon Nelson from Intel[1]. The idea was rejected by
>> >> Greg KH, who claimed it was abusive of the request_firmware mechanism.
>> >> Greg suggested using configfs for device configuration instead (as done
>> >> by the USB gadget driver).
>> >>
>> >> As a solution, we introduce a new kernel infrastructure using configfs
>> >> to allow the configuration of the device. The goal is to set low-level
>> >> device functionality that needs to be sorted out before a module is
>> >> loaded.
>> >
>> > Ick, really?  drivers should never need to be configured like this, if
>> > so, then something is wrong, they should "just work" by default.  What
>> > are you needing to "configure" that can't be determined by something
>> > like a device tree?
>> Ick indeed - but we can't find anything better.
>>
>> For example, we have devices that can act as either netdev or as an
>> Infiniband device.
>> The driver consists of a core to handle the hardware, and higher layer
>> drivers - one for Ethernet and one for Infiniband.
>> Today the selection is done through a module parameter. according to it
>> the relevant higher level driver is loaded, and the device is
>> initialized. You don't want to have a default of netdev, and in every
>> installation that needs Infiniband, a netdev will be created, removed
>> and only then the Infiniband device will appear.
>
> But that really isn't an issue, right?  Who cares if it happens that
> way?
>
> Or just don't do anything until that "higher level" driver is loaded,
> either ib or network.
>
>> This is only one example to configuration that needs to be known before
>> the hardware is initialized, and be persistent across boots.
>>
>> We can have a 2 stages loading. First load the core, wait for user
>> input, and only then configure the device and load the right upper layer
>> driver according to the user input (configfs/sysfs).
>
> Why not just wait for an "upper layer" driver to be loaded?  It doesn't
> make it dynamic, and forces you to have a module manually be loaded, but
> would solve this problem.
>
> Or just don't do anything until configfs is set up for your devices,
> after the module is loaded.  This allows drivers to be built into the
> kernel if people really want that type of configuration to work
> properly.
Yes this is the idea in this RFC, and we're suggesting a mechanism to
make it general, so other vendors that would like to have a persistent
state for their devices could do it.

>
>> Since other vendors seems to need this capability too, we thought it
>> would be better to make it generic - and this is this what this RFC is
>> about.
>
> What other vendors have such horribly designed hardware / kernel modules
> that need this type of thing?  :)
Intel [1], Atheros [2]. I can dig in the mailing list for some more.

[1] - https://lkml.org/lkml/2013/1/10/606
[2] - http://lists.openwall.net/netdev/2014/06/16/53

I agree that whenever possible the hardware should have a good default
state. But sometimes the user has some useful information that could
be used during initialization and make driver loading process much
more efficient and clean.

Let me know if you prefer us to contribute a generic solution, or to
have a configfs with 2 stages loading for our driver only. I thought
that a generic solution is better, but would be happy to hear what
other vendors and experienced kernel developers have to say.

Thanks,
Amir

>
> thanks,
>
> greg k-h
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC net-next 0/3] devconf: New infrastructure for setting pre-load parameters for a module
From: Greg KH @ 2015-01-08 19:25 UTC (permalink / raw)
  To: Amir Vadai
  Cc: Hadar Hen Zion, David S. Miller, netdev@vger.kernel.org,
	Yevgeny Petrilin, Or Gerlitz, shannon.nelson@intel.com,
	Doug Ledford, greearb@candelatech.com
In-Reply-To: <CAPcc5PivsUKr9Q3VN1NFn3xJiB339EdPV=DEiLie8d9QVgOmOg@mail.gmail.com>

On Thu, Jan 08, 2015 at 09:14:32PM +0200, Amir Vadai wrote:
> On Thu, Jan 8, 2015 at 7:47 PM, Greg KH <gregkh@linuxfoundation.org> wrote:
> > On Thu, Jan 08, 2015 at 07:11:04PM +0200, Amir Vadai wrote:
> >> On 1/8/2015 6:46 PM, Greg KH wrote:
> >> > On Thu, Jan 08, 2015 at 05:16:57PM +0200, Hadar Hen Zion wrote:
> >> >> Hi,
> >> >>
> >> >> When configuring a device at an early boot stage, most kernel drivers
> >> >> use module parameters (the parameters' settings can be determined in
> >> >> modprobe.d config files).
> >> >
> >> > Which is a bad idea, as you have learned :)
> >> >
> >> >> These parameters are difficult to manage, and one of the reasons is that
> >> >> module parameters are set per driver and not per device (NICs using the
> >> >> same driver cannot be set with different configurations).
> >> >> Furthermore, using other existing configuration tools like ethtool,
> >> >> ifconfig, ip link commands or sysfs entries is not applicable, since
> >> >> they all rely on having a netdev already set up.
> >> >>
> >> >> In the past, 'request_firmware' solution for configuration parameters
> >> >> was suggested by Shannon Nelson from Intel[1]. The idea was rejected by
> >> >> Greg KH, who claimed it was abusive of the request_firmware mechanism.
> >> >> Greg suggested using configfs for device configuration instead (as done
> >> >> by the USB gadget driver).
> >> >>
> >> >> As a solution, we introduce a new kernel infrastructure using configfs
> >> >> to allow the configuration of the device. The goal is to set low-level
> >> >> device functionality that needs to be sorted out before a module is
> >> >> loaded.
> >> >
> >> > Ick, really?  drivers should never need to be configured like this, if
> >> > so, then something is wrong, they should "just work" by default.  What
> >> > are you needing to "configure" that can't be determined by something
> >> > like a device tree?
> >> Ick indeed - but we can't find anything better.
> >>
> >> For example, we have devices that can act as either netdev or as an
> >> Infiniband device.
> >> The driver consists of a core to handle the hardware, and higher layer
> >> drivers - one for Ethernet and one for Infiniband.
> >> Today the selection is done through a module parameter. according to it
> >> the relevant higher level driver is loaded, and the device is
> >> initialized. You don't want to have a default of netdev, and in every
> >> installation that needs Infiniband, a netdev will be created, removed
> >> and only then the Infiniband device will appear.
> >
> > But that really isn't an issue, right?  Who cares if it happens that
> > way?
> >
> > Or just don't do anything until that "higher level" driver is loaded,
> > either ib or network.
> >
> >> This is only one example to configuration that needs to be known before
> >> the hardware is initialized, and be persistent across boots.
> >>
> >> We can have a 2 stages loading. First load the core, wait for user
> >> input, and only then configure the device and load the right upper layer
> >> driver according to the user input (configfs/sysfs).
> >
> > Why not just wait for an "upper layer" driver to be loaded?  It doesn't
> > make it dynamic, and forces you to have a module manually be loaded, but
> > would solve this problem.
> >
> > Or just don't do anything until configfs is set up for your devices,
> > after the module is loaded.  This allows drivers to be built into the
> > kernel if people really want that type of configuration to work
> > properly.
> Yes this is the idea in this RFC, and we're suggesting a mechanism to
> make it general, so other vendors that would like to have a persistent
> state for their devices could do it.

I don't think _any_ devices should be doing this, so please don't make
it "generic".  Just use configfs in your module and document it
properly, like other modules that use configfs.

> >> Since other vendors seems to need this capability too, we thought it
> >> would be better to make it generic - and this is this what this RFC is
> >> about.
> >
> > What other vendors have such horribly designed hardware / kernel modules
> > that need this type of thing?  :)
> Intel [1], Atheros [2]. I can dig in the mailing list for some more.
> 
> [1] - https://lkml.org/lkml/2013/1/10/606
> [2] - http://lists.openwall.net/netdev/2014/06/16/53
> 
> I agree that whenever possible the hardware should have a good default
> state. But sometimes the user has some useful information that could
> be used during initialization and make driver loading process much
> more efficient and clean.

But you are talking about loading config info before the driver is
loaded, which goes backwards for how we automatically load modules when
the hardware is found.

> Let me know if you prefer us to contribute a generic solution, or to
> have a configfs with 2 stages loading for our driver only. I thought
> that a generic solution is better, but would be happy to hear what
> other vendors and experienced kernel developers have to say.

A generic solution would be nice, but so far I just see network specific
drivers that can be configured after the driver is loaded, and so should
use the network specific configuration tools like everyone else.

So please just have the module default to a network device and then
configure things from there, after you are loaded, if you need it and
want to do something "special".  Otherwise you are moving backwards for
how autoload drivers work and that is not good.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
From: Jesse Gross @ 2015-01-08 19:55 UTC (permalink / raw)
  To: Fan Du
  Cc: Du, Fan, Thomas Graf, davem@davemloft.net, Michael S. Tsirkin,
	Jason Wang, netdev@vger.kernel.org, fw@strlen.de,
	dev@openvswitch.org, pshelar@nicira.com
In-Reply-To: <54AE506A.5020207@gmail.com>

On Thu, Jan 8, 2015 at 1:39 AM, Fan Du <fengyuleidian0615@gmail.com> wrote:
> 于 2015年01月08日 04:52, Jesse Gross 写道:
>>>
>>> My understanding is:
>>> >controller sets the forwarding rules into kernel datapath, any flow not
>>> >matching
>>> >with the rules are threw to controller by upcall. Once the rule decision
>>> > is
>>> >made
>>> >by controller, then, this flow packet is pushed down to datapath to be
>>> >forwarded
>>> >again according to the new rule.
>>> >
>>> >So I'm not sure whether pushing the over-MTU-sized packet or pushing the
>>> >forged ICMP
>>> >without encapsulation to controller is required by current ovs
>>> >implementation. By doing
>>> >so, such over-MTU-sized packet is treated as a event for the controller
>>> > to
>>> >be take
>>> >care of.
>>
>> If flows are implementing routing (again, they are doing things like
>> decrementing the TTL) then it is necessary for them to also handle
>> this situation using some potentially new primitives (like a size
>> check). Otherwise you end up with issues like the ones that I
>> mentioned above like needing to forge addresses because you don't know
>> what the correct ones are.
>
>
> Thanks for explaining, Jesse!
>
> btw, I don't get it about "to forge addresses", building ICMP message
> with Guest packet doesn't require to forge address when not encapsulating
> ICMP message with outer headers.

Your patch has things like this (for the inner IP header):

+                               new_ip->saddr = orig_ip->daddr;
+                               new_ip->daddr = orig_ip->saddr;

These addresses are owned by the endpoints, not the host generating
generating the ICMP message, so I would consider that to be forging
addresses.

> If the flows aren't doing things to
>>
>> implement routing, then you really have a flat L2 network and you
>> shouldn't be doing this type of behavior at all as I described in the
>> original plan.
>
>
> For flows implementing routing scenario:
> First of all, over-MTU-sized packet could only be detected once the flow
> as been consulted(each port could implement a 'check' hook to do this),
> and just before send to the actual port.
>
> Then pushing the over-MTU-sized packet back to controller, it's the
> controller
> who will will decide whether to build ICMP message, or whatever routing
> behaviour
> it may take. And sent it back with the port information. This ICMP message
> will
> travel back to Guest.
>
> Why does the flow has to use primitive like a "check size"? "check size"
> will only take effect after do_output. I'm not very clear with this
> approach.

Checking the size obviously needs to be an action that would take
place before outputting in order for it to have any effect. Attaching
a check to a port does not fit in very well with the other primitives
of OVS, so I think an action is the obvious place to put it.

> And not all scenario involving flow with routing behaviour, just set up a
> vxlan tunnel, and attach KVM guest or Docker onto it for playing or
> developing.
> This wouldn't necessarily require user to set additional specific flows to
> make
> over-MTU-sized packet pass through the tunnel correctly. In such scenario, I
> think the original patch in this thread to fragment tunnel packet is still
> needed
> OR workout a generic component to build ICMP for all type tunnel in L2
> level.
> Both of those will act as a backup plan as there is no such specific flow as
> default.

In these cases, we should find a way to adjust the MTU, preferably
automatically using virtio.

^ permalink raw reply

* Re: route/max_size sysctl in ipv4
From: Ani Sinha @ 2015-01-08 20:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev@vger.kernel.org
In-Reply-To: <1420743808.5947.73.camel@edumazet-glaptop2.roam.corp.google.com>

On Thu, Jan 8, 2015 at 11:03 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

>
> It is too late to change something now. Nobody but you ever complained.

Well, that is a strange argument! Nobody complained for a long time
that live capturing with vlan tagged packets were broken in tcpdump
because how the kernel started putting the tags in the aux data. But
someone eventually one day did notice that the issue was brought up.

>
> If you want to use network namespaces, you have to adapt your scripts.
>
> Nobody claimed network namespaces were totally transparent.
>

I see. I am going back to an old thread here where Linus says that the
#1 rule is:

""We don't regress user space"

https://lkml.org/lkml/2013/7/16/565

Breaking scripts seems to me to fall into the category of regressing
userspace. Or may be we can treat these sysctls more softly since they
are not strictly speaking linux ABIs.

^ permalink raw reply

* Re: route/max_size sysctl in ipv4
From: Eric Dumazet @ 2015-01-08 20:18 UTC (permalink / raw)
  To: Ani Sinha; +Cc: David Miller, netdev@vger.kernel.org
In-Reply-To: <CAOxq_8M=Ros56U_VbUVumjS226jHs1DK0A-K9aRksTMu88H_sg@mail.gmail.com>

On Thu, 2015-01-08 at 12:13 -0800, Ani Sinha wrote:

> I see. I am going back to an old thread here where Linus says that the
> #1 rule is:
> 
> ""We don't regress user space"
> 
> https://lkml.org/lkml/2013/7/16/565

Thats totally different.

Look, I am going to stop the discussion here, because it seems you know
better than me.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox