Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next 4/7] ixgbe: protect vxlan_get_rx_port in ixgbe_service_task with rtnl_lock
From: Hannes Frederic Sowa @ 2016-04-18 19:19 UTC (permalink / raw)
  To: netdev
  Cc: jesse, Jeff Kirsher, Jesse Brandeburg, Shannon Nelson,
	Carolyn Wyborny, Don Skidmore, Bruce Allan, John Ronciak,
	Mitch Williams
In-Reply-To: <1461007188-1603-1-git-send-email-hannes@stressinduktion.org>

vxlan_get_rx_port requires rtnl_lock to be held.

Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Shannon Nelson <shannon.nelson@intel.com>
Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
Cc: Don Skidmore <donald.c.skidmore@intel.com>
Cc: Bruce Allan <bruce.w.allan@intel.com>
Cc: John Ronciak <john.ronciak@intel.com>
Cc: Mitch Williams <mitch.a.williams@intel.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 2976df77bf14f5..b2f2cf40f06a87 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7192,10 +7192,12 @@ static void ixgbe_service_task(struct work_struct *work)
 		return;
 	}
 #ifdef CONFIG_IXGBE_VXLAN
+	rtnl_lock();
 	if (adapter->flags2 & IXGBE_FLAG2_VXLAN_REREG_NEEDED) {
 		adapter->flags2 &= ~IXGBE_FLAG2_VXLAN_REREG_NEEDED;
 		vxlan_get_rx_port(adapter->netdev);
 	}
+	rtnl_unlock();
 #endif /* CONFIG_IXGBE_VXLAN */
 	ixgbe_reset_subtask(adapter);
 	ixgbe_phy_interrupt_subtask(adapter);
-- 
2.5.5

^ permalink raw reply related

* [PATCH net-next 5/7] qlcnic: protect qlicnic_attach_func with rtnl_lock
From: Hannes Frederic Sowa @ 2016-04-18 19:19 UTC (permalink / raw)
  To: netdev; +Cc: jesse, Dept-GELinuxNICDev
In-Reply-To: <1461007188-1603-1-git-send-email-hannes@stressinduktion.org>

qlcnic_attach_func requires rtnl_lock to be held.

Cc: Dept-GELinuxNICDev@qlogic.com
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
index 1205f6f9c94173..1c29105b6c364f 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
@@ -3952,8 +3952,14 @@ static pci_ers_result_t qlcnic_82xx_io_error_detected(struct pci_dev *pdev,
 
 static pci_ers_result_t qlcnic_82xx_io_slot_reset(struct pci_dev *pdev)
 {
-	return qlcnic_attach_func(pdev) ? PCI_ERS_RESULT_DISCONNECT :
-				PCI_ERS_RESULT_RECOVERED;
+	pci_ers_result_t res;
+
+	rtnl_lock();
+	res = qlcnic_attach_func(pdev) ? PCI_ERS_RESULT_DISCONNECT :
+					 PCI_ERS_RESULT_RECOVERED;
+	rtnl_unlock();
+
+	return res;
 }
 
 static void qlcnic_82xx_io_resume(struct pci_dev *pdev)
-- 
2.5.5

^ permalink raw reply related

* [PATCH net-next 6/7] vxlan: break dependency with netdev drivers
From: Hannes Frederic Sowa @ 2016-04-18 19:19 UTC (permalink / raw)
  To: netdev; +Cc: jesse
In-Reply-To: <1461007188-1603-1-git-send-email-hannes@stressinduktion.org>

Currently all drivers depend and autoload the vxlan module because how
vxlan_get_rx_port is linked into them. Remove this dependency:

By using a new event type in the netdevice notifier call chain we proxy
the request from the drivers to flush and resetup the vxlan ports not
directly via function call but by the already existing netdevice
notifier call chain.

I added a separate new event type, NETDEV_OFFLOAD_PUSH_VXLAN, to do so.
We don't need to save those ids, as the event type field is an unsigned
long and using specialized event types for this purpose seemed to be a
more elegant way. This also comes in beneficial if in future we want to
add offloading knobs for vxlan.

Cc: Jesse Gross <jesse@kernel.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 drivers/net/vxlan.c       | 14 +++++++++-----
 include/linux/netdevice.h |  1 +
 include/net/vxlan.h       |  6 ++----
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index c2e22c2532a1a8..6fb93b57a72489 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2527,7 +2527,7 @@ static struct device_type vxlan_type = {
  * supply the listening VXLAN udp ports. Callers are expected
  * to implement the ndo_add_vxlan_port.
  */
-void vxlan_get_rx_port(struct net_device *dev)
+static void vxlan_push_rx_ports(struct net_device *dev)
 {
 	struct vxlan_sock *vs;
 	struct net *net = dev_net(dev);
@@ -2536,6 +2536,9 @@ void vxlan_get_rx_port(struct net_device *dev)
 	__be16 port;
 	unsigned int i;
 
+	if (!dev->netdev_ops->ndo_add_vxlan_port)
+		return;
+
 	spin_lock(&vn->sock_lock);
 	for (i = 0; i < PORT_HASH_SIZE; ++i) {
 		hlist_for_each_entry_rcu(vs, &vn->sock_list[i], hlist) {
@@ -2547,7 +2550,6 @@ void vxlan_get_rx_port(struct net_device *dev)
 	}
 	spin_unlock(&vn->sock_lock);
 }
-EXPORT_SYMBOL_GPL(vxlan_get_rx_port);
 
 /* Initialize the device structure. */
 static void vxlan_setup(struct net_device *dev)
@@ -3283,20 +3285,22 @@ static void vxlan_handle_lowerdev_unregister(struct vxlan_net *vn,
 	unregister_netdevice_many(&list_kill);
 }
 
-static int vxlan_lowerdev_event(struct notifier_block *unused,
-				unsigned long event, void *ptr)
+static int vxlan_netdevice_event(struct notifier_block *unused,
+				 unsigned long event, void *ptr)
 {
 	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
 	struct vxlan_net *vn = net_generic(dev_net(dev), vxlan_net_id);
 
 	if (event == NETDEV_UNREGISTER)
 		vxlan_handle_lowerdev_unregister(vn, dev);
+	else if (event == NETDEV_OFFLOAD_PUSH_VXLAN)
+		vxlan_push_rx_ports(dev);
 
 	return NOTIFY_DONE;
 }
 
 static struct notifier_block vxlan_notifier_block __read_mostly = {
-	.notifier_call = vxlan_lowerdev_event,
+	.notifier_call = vxlan_netdevice_event,
 };
 
 static __net_init int vxlan_init_net(struct net *net)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a3bb534576a383..d4c8cd424f8dbc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2244,6 +2244,7 @@ struct netdev_lag_lower_state_info {
 #define NETDEV_BONDING_INFO	0x0019
 #define NETDEV_PRECHANGEUPPER	0x001A
 #define NETDEV_CHANGELOWERSTATE	0x001B
+#define NETDEV_OFFLOAD_PUSH_VXLAN	0x001C
 
 int register_netdevice_notifier(struct notifier_block *nb);
 int unregister_netdevice_notifier(struct notifier_block *nb);
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index d442eb3129cde4..673e9f9e6da768 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -390,13 +390,11 @@ static inline __be32 vxlan_compute_rco(unsigned int start, unsigned int offset)
 	return vni_field;
 }
 
-#if IS_ENABLED(CONFIG_VXLAN)
-void vxlan_get_rx_port(struct net_device *netdev);
-#else
 static inline void vxlan_get_rx_port(struct net_device *netdev)
 {
+	ASSERT_RTNL();
+	call_netdevice_notifiers(NETDEV_OFFLOAD_PUSH_VXLAN, netdev);
 }
-#endif
 
 static inline unsigned short vxlan_get_sk_family(struct vxlan_sock *vs)
 {
-- 
2.5.5

^ permalink raw reply related

* [PATCH net-next 7/7] geneve: break dependency with netdev drivers
From: Hannes Frederic Sowa @ 2016-04-18 19:19 UTC (permalink / raw)
  To: netdev; +Cc: jesse
In-Reply-To: <1461007188-1603-1-git-send-email-hannes@stressinduktion.org>

Equivalent to "vxlan: break dependency with netdev drivers", don't
autoload geneve module in case the driver is loaded. Instead make the
coupling weaker by using netdevice notifiers as proxy.

Cc: Jesse Gross <jesse@kernel.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 drivers/net/geneve.c      | 31 ++++++++++++++++++++++++++++---
 include/linux/netdevice.h |  1 +
 include/net/geneve.h      |  6 ++----
 3 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index efbc7ceedc3a13..59314cf47584e4 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -1172,7 +1172,7 @@ static struct device_type geneve_type = {
  * supply the listening GENEVE udp ports. Callers are expected
  * to implement the ndo_add_geneve_port.
  */
-void geneve_get_rx_port(struct net_device *dev)
+static void geneve_push_rx_ports(struct net_device *dev)
 {
 	struct net *net = dev_net(dev);
 	struct geneve_net *gn = net_generic(net, geneve_net_id);
@@ -1181,6 +1181,9 @@ void geneve_get_rx_port(struct net_device *dev)
 	struct sock *sk;
 	__be16 port;
 
+	if (!dev->netdev_ops->ndo_add_geneve_port)
+		return;
+
 	rcu_read_lock();
 	list_for_each_entry_rcu(gs, &gn->sock_list, list) {
 		sk = gs->sock->sk;
@@ -1190,7 +1193,6 @@ void geneve_get_rx_port(struct net_device *dev)
 	}
 	rcu_read_unlock();
 }
-EXPORT_SYMBOL_GPL(geneve_get_rx_port);
 
 /* Initialize the device structure. */
 static void geneve_setup(struct net_device *dev)
@@ -1538,6 +1540,21 @@ struct net_device *geneve_dev_create_fb(struct net *net, const char *name,
 }
 EXPORT_SYMBOL_GPL(geneve_dev_create_fb);
 
+static int geneve_netdevice_event(struct notifier_block *unused,
+				  unsigned long event, void *ptr)
+{
+	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+
+	if (event == NETDEV_OFFLOAD_PUSH_GENEVE)
+		geneve_push_rx_ports(dev);
+
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block geneve_notifier_block __read_mostly = {
+	.notifier_call = geneve_netdevice_event,
+};
+
 static __net_init int geneve_init_net(struct net *net)
 {
 	struct geneve_net *gn = net_generic(net, geneve_net_id);
@@ -1590,11 +1607,18 @@ static int __init geneve_init_module(void)
 	if (rc)
 		goto out1;
 
-	rc = rtnl_link_register(&geneve_link_ops);
+	rc = register_netdevice_notifier(&geneve_notifier_block);
 	if (rc)
 		goto out2;
 
+	rc = rtnl_link_register(&geneve_link_ops);
+	if (rc)
+		goto out3;
+
 	return 0;
+
+out3:
+	unregister_netdevice_notifier(&geneve_notifier_block);
 out2:
 	unregister_pernet_subsys(&geneve_net_ops);
 out1:
@@ -1605,6 +1629,7 @@ late_initcall(geneve_init_module);
 static void __exit geneve_cleanup_module(void)
 {
 	rtnl_link_unregister(&geneve_link_ops);
+	unregister_netdevice_notifier(&geneve_notifier_block);
 	unregister_pernet_subsys(&geneve_net_ops);
 }
 module_exit(geneve_cleanup_module);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d4c8cd424f8dbc..1f6d5db471a24f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2245,6 +2245,7 @@ struct netdev_lag_lower_state_info {
 #define NETDEV_PRECHANGEUPPER	0x001A
 #define NETDEV_CHANGELOWERSTATE	0x001B
 #define NETDEV_OFFLOAD_PUSH_VXLAN	0x001C
+#define NETDEV_OFFLOAD_PUSH_GENEVE	0x001D
 
 int register_netdevice_notifier(struct notifier_block *nb);
 int unregister_netdevice_notifier(struct notifier_block *nb);
diff --git a/include/net/geneve.h b/include/net/geneve.h
index e6c23dc765f7ec..cb544a530146c2 100644
--- a/include/net/geneve.h
+++ b/include/net/geneve.h
@@ -62,13 +62,11 @@ struct genevehdr {
 	struct geneve_opt options[];
 };
 
-#if IS_ENABLED(CONFIG_GENEVE)
-void geneve_get_rx_port(struct net_device *netdev);
-#else
 static inline void geneve_get_rx_port(struct net_device *netdev)
 {
+	ASSERT_RTNL();
+	call_netdevice_notifiers(NETDEV_OFFLOAD_PUSH_GENEVE, netdev);
 }
-#endif
 
 #ifdef CONFIG_INET
 struct net_device *geneve_dev_create_fb(struct net *net, const char *name,
-- 
2.5.5

^ permalink raw reply related

* RE: [Intel-wired-lan] [PATCH net-next V2 2/2] intel: ixgbevf: Support Windows hosts (Hyper-V)
From: KY Srinivasan @ 2016-04-18 19:42 UTC (permalink / raw)
  To: Joe Perches, Alexander Duyck
  Cc: olaf@aepfle.de, Netdev, Jason Wang, linux-kernel@vger.kernel.org,
	jackm@mellanox.com, yevgenyp@mellanox.com, John Ronciak,
	intel-wired-lan, eli@mellanox.com, Robo Bot,
	devel@linuxdriverproject.org, David Miller
In-Reply-To: <1460998819.1917.10.camel@perches.com>



> -----Original Message-----
> From: Joe Perches [mailto:joe@perches.com]
> Sent: Monday, April 18, 2016 10:00 AM
> To: KY Srinivasan <kys@microsoft.com>; Alexander Duyck
> <alexander.duyck@gmail.com>
> Cc: David Miller <davem@davemloft.net>; Netdev
> <netdev@vger.kernel.org>; linux-kernel@vger.kernel.org;
> devel@linuxdriverproject.org; olaf@aepfle.de; Robo Bot
> <apw@canonical.com>; Jason Wang <jasowang@redhat.com>;
> eli@mellanox.com; jackm@mellanox.com; yevgenyp@mellanox.com; John
> Ronciak <john.ronciak@intel.com>; intel-wired-lan <intel-wired-
> lan@lists.osuosl.org>
> Subject: Re: [Intel-wired-lan] [PATCH net-next V2 2/2] intel: ixgbevf: Support
> Windows hosts (Hyper-V)
> 
> On Mon, 2016-04-18 at 16:52 +0000, KY Srinivasan wrote:
> []
> > > > +bool ixgbevf_on_hyperv(struct ixgbe_hw *hw)
> > > > +{
> > > > +       if (hw->mbx.ops.check_for_msg == NULL)
> > > > +               return true;
> > > > +       else
> > > > +               return false;
> > > > +}
> 
> trivia:
> 
> bool func(...)
> {
> 	if (<bar>)
> 		return true;
> 	else
> 		return false;
> }
> 
> can generally be written as:
> 
> bool func(...)
> {
> 	return <bar>;
> }

Thanks Joe; will update.

K. Y

^ permalink raw reply

* Re: [PATCH net-next 0/8] allow bpf attach to tracepoints
From: Alexei Starovoitov @ 2016-04-18 19:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, David S . Miller, Ingo Molnar, Daniel Borkmann,
	Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik, Brendan Gregg,
	netdev, linux-kernel, kernel-team
In-Reply-To: <20160418121311.10c88768@gandalf.local.home>

On 4/18/16 9:13 AM, Steven Rostedt wrote:
> On Mon, 4 Apr 2016 21:52:46 -0700
> Alexei Starovoitov <ast@fb.com> wrote:
>
>> Hi Steven, Peter,
>>
>> last time we discussed bpf+tracepoints it was a year ago [1] and the reason
>> we didn't proceed with that approach was that bpf would make arguments
>> arg1, arg2 to trace_xx(arg1, arg2) call to be exposed to bpf program
>> and that was considered unnecessary extension of abi. Back then I wanted
>> to avoid the cost of buffer alloc and field assign part in all
>> of the tracepoints, but looks like when optimized the cost is acceptable.
>> So this new apporach doesn't expose any new abi to bpf program.
>> The program is looking at tracepoint fields after they were copied
>> by perf_trace_xx() and described in /sys/kernel/debug/tracing/events/xxx/format
>
> Does this mean that ftrace could use this ability as well? As all the
> current filtering of ftrace was done after it was copied to the buffer,
> and that was what you wanted to avoid.

yeah, it could be added to ftrace as well, but it won't be as effective
as perf_trace, since the cost of trace_event_buffer_reserve() in
trace_event_raw_event_() handler is significantly higher than 
perf_trace_buf_alloc() in perf_trace_().
Then from the program point of view it wouldn't care how that memory
is allocated, so the user tools will just use perf_trace_() style.
The only use case I see for bpf with ftrace's tracepoint handler
is to work as an actual filter, but we already have filters there...
so not clear to me of the actual value of adding bpf to ftrace's
tracepoint handler.
At the same time there is whole ftrace as function tracer. That is
very lucrative field for bpf to plug into ;)

As far as 2nd part of your question about copying. Yeah, it adds to
the cost, so kprobe sometimes is faster than perf_trace tracepoint
that is copying a lot of args which are not going to be used.

^ permalink raw reply

* Re: [PATCH] netfilter: ctnetlink: add more #ifdef around unused code
From: Arnd Bergmann @ 2016-04-18 20:04 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Patrick McHardy, Jozsef Kadlecsik, David S. Miller,
	Daniel Borkmann, Ken-ichirou MATSUZAWA, netfilter-devel, coreteam,
	netdev, linux-kernel
In-Reply-To: <20160418184336.GA16586@salvia>

On Monday 18 April 2016 20:43:36 Pablo Neira Ayuso wrote:
> On Mon, Apr 18, 2016 at 08:33:15PM +0200, Arnd Bergmann wrote:
> > On Monday 18 April 2016 20:16:59 Pablo Neira Ayuso wrote:
> > > On Sat, Apr 16, 2016 at 10:17:43PM +0200, Arnd Bergmann wrote:
> > > > A recent patch removed many 'inline' annotations for static
> > > > functions in this file, which has caused warnings for functions
> > > > that are not used in a given configuration, in particular when
> > > > CONFIG_NF_CONNTRACK_EVENTS is disabled:
> > > > 
> > > > nf_conntrack_netlink.c:572:15: 'ctnetlink_timestamp_size' defined but not used
> > > > nf_conntrack_netlink.c:546:15: 'ctnetlink_acct_size' defined but not used
> > > > nf_conntrack_netlink.c:339:12: 'ctnetlink_label_size' defined but not used
> > > 
> > > Arnd, thanks for the fix.
> > > 
> > > I'm planning to push this though:
> > > 
> > > http://patchwork.ozlabs.org/patch/610820/
> > > 
> > > This is restoring the inlines for the size calculation functions, but
> > > I think that's ok. They are rather small and they're called from the
> > > event notification path (ie. packet path), so the compiler just place
> > > them out of the way when not needed and we calm down the gcc warning.
> > 
> > Looks good. I'll put this in my randconfig builder to replace my own
> > patch and will let you know if you missed something.
> 
> Thanks, will wait for your ack then.

100 clean randconfig builds later, looks good to me.

Acked-by: Arnd Bergmann <arnd@arndb.de>

^ permalink raw reply

* [PATCH net-next] net: dsa: kill circular reference with slave priv
From: Vivien Didelot @ 2016-04-18 20:10 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, Vivien Didelot

The dsa_slave_priv structure does not need a pointer to its net_device.
Kill it.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
 net/dsa/dsa_priv.h | 5 -----
 net/dsa/slave.c    | 9 ++++-----
 2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 1d1a546..dfa3377 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -22,11 +22,6 @@ struct dsa_device_ops {
 };
 
 struct dsa_slave_priv {
-	/*
-	 * The linux network interface corresponding to this
-	 * switch port.
-	 */
-	struct net_device	*dev;
 	struct sk_buff *	(*xmit)(struct sk_buff *skb,
 					struct net_device *dev);
 
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 2dae0d0..3b6750f 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -673,10 +673,10 @@ static void dsa_slave_get_ethtool_stats(struct net_device *dev,
 	struct dsa_slave_priv *p = netdev_priv(dev);
 	struct dsa_switch *ds = p->parent;
 
-	data[0] = p->dev->stats.tx_packets;
-	data[1] = p->dev->stats.tx_bytes;
-	data[2] = p->dev->stats.rx_packets;
-	data[3] = p->dev->stats.rx_bytes;
+	data[0] = dev->stats.tx_packets;
+	data[1] = dev->stats.tx_bytes;
+	data[2] = dev->stats.rx_packets;
+	data[3] = dev->stats.rx_bytes;
 	if (ds->drv->get_ethtool_stats != NULL)
 		ds->drv->get_ethtool_stats(ds, p->port, data + 4);
 }
@@ -1063,7 +1063,6 @@ int dsa_slave_create(struct dsa_switch *ds, struct device *parent,
 	slave_dev->vlan_features = master->vlan_features;
 
 	p = netdev_priv(slave_dev);
-	p->dev = slave_dev;
 	p->parent = ds;
 	p->port = port;
 
-- 
2.8.0

^ permalink raw reply related

* Re: [PATCH net-next] macvlan: fix failure during registration
From: Francesco Ruggeri @ 2016-04-18 20:10 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev, David S. Miller, Mahesh Bandewar
In-Reply-To: <87ega2sqhf.fsf@x220.int.ebiederm.org>

On Mon, Apr 18, 2016 at 11:48 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> These interactions all seem a little bit funny.  At a quick skim it
> would make more sense to increment the port count in macvlan_init,
> and completely remove the need to mess with port counts anywhere except
> macvlan_init and macvlan_uninit.

Thanks Eric, let me try that.

>
> If for some reason that can't be done the code can easily look at
> dev->reg_state.  If dev->reg_state == NETREG_UNITIALIZED it should
> be exactly the same as your new flag being set.
>

I am not sure that in macvlan_uninit one can tell whether it is being
invoked in the context of a failed register_netdevice (if that is what
you meant).

In case of register_netdevice failing in
call_netdevice_notifiers(NETDEV_REGISTER) the call sequence is:

macvlan_common_newlink
  register_netdevice
    call_netdevice_notifiers(NETDEV_REGISTER, dev) (<== fail here)
      rollback_registered(dev);
        rollback_registered_many
          dev->reg_state = NETREG_UNREGISTERING
          dev->netdev_ops->ndo_uninit(dev)
      dev->reg_state = NETREG_UNREGISTERED;

Francesco

^ permalink raw reply

* Re: [PATCH] netfilter: ctnetlink: add more #ifdef around unused code
From: Pablo Neira Ayuso @ 2016-04-18 20:14 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Patrick McHardy, Jozsef Kadlecsik, David S. Miller,
	Daniel Borkmann, Ken-ichirou MATSUZAWA, netfilter-devel, coreteam,
	netdev, linux-kernel
In-Reply-To: <3838598.xdNY1BW5d9@wuerfel>

On Mon, Apr 18, 2016 at 10:04:43PM +0200, Arnd Bergmann wrote:
> On Monday 18 April 2016 20:43:36 Pablo Neira Ayuso wrote:
> > On Mon, Apr 18, 2016 at 08:33:15PM +0200, Arnd Bergmann wrote:
> > > On Monday 18 April 2016 20:16:59 Pablo Neira Ayuso wrote:
> > > > On Sat, Apr 16, 2016 at 10:17:43PM +0200, Arnd Bergmann wrote:
> > > > > A recent patch removed many 'inline' annotations for static
> > > > > functions in this file, which has caused warnings for functions
> > > > > that are not used in a given configuration, in particular when
> > > > > CONFIG_NF_CONNTRACK_EVENTS is disabled:
> > > > > 
> > > > > nf_conntrack_netlink.c:572:15: 'ctnetlink_timestamp_size' defined but not used
> > > > > nf_conntrack_netlink.c:546:15: 'ctnetlink_acct_size' defined but not used
> > > > > nf_conntrack_netlink.c:339:12: 'ctnetlink_label_size' defined but not used
> > > > 
> > > > Arnd, thanks for the fix.
> > > > 
> > > > I'm planning to push this though:
> > > > 
> > > > http://patchwork.ozlabs.org/patch/610820/
> > > > 
> > > > This is restoring the inlines for the size calculation functions, but
> > > > I think that's ok. They are rather small and they're called from the
> > > > event notification path (ie. packet path), so the compiler just place
> > > > them out of the way when not needed and we calm down the gcc warning.
> > > 
> > > Looks good. I'll put this in my randconfig builder to replace my own
> > > patch and will let you know if you missed something.
> > 
> > Thanks, will wait for your ack then.
> 
> 100 clean randconfig builds later, looks good to me.
> 
> Acked-by: Arnd Bergmann <arnd@arndb.de>

Thanks! I have applied this to nf-next.

^ permalink raw reply

* Greetings!!!
From: andreas123 @ 2016-04-18 20:16 UTC (permalink / raw)
  To: Mr. Eric

Hi, how are you? My name is J Eric Denials, External Financial Auditor at Lloyds Banking Group plc., London. It is a pleasure to contact you at this time through this medium. I have a cool and legitimate deal to do with you as you're a foreigner, it will be mutually beneficial to both. If you’re interested, urgently, get back to me for more explanation about the deal.
Regards
Eric

^ permalink raw reply

* Re: [PATCH net-next 2/8] perf, bpf: allow bpf programs attach to tracepoints
From: Steven Rostedt @ 2016-04-18 20:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peter Zijlstra, David S . Miller, Ingo Molnar, Daniel Borkmann,
	Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik, Brendan Gregg,
	netdev, linux-kernel, kernel-team
In-Reply-To: <1459831974-2891931-3-git-send-email-ast@fb.com>

On Mon, 4 Apr 2016 21:52:48 -0700
Alexei Starovoitov <ast@fb.com> wrote:

> introduce BPF_PROG_TYPE_TRACEPOINT program type and allow it to be
> attached to tracepoints.
> The tracepoint will copy the arguments in the per-cpu buffer and pass
> it to the bpf program as its first argument.
> The layout of the fields can be discovered by doing
> 'cat /sys/kernel/debug/tracing/events/sched/sched_switch/format'
> prior to the compilation of the program with exception that first 8 bytes
> are reserved and not accessible to the program. This area is used to store
> the pointer to 'struct pt_regs' which some of the bpf helpers will use:
> +---------+
> | 8 bytes | hidden 'struct pt_regs *' (inaccessible to bpf program)
> +---------+
> | N bytes | static tracepoint fields defined in tracepoint/format (bpf readonly)
> +---------+
> | dynamic | __dynamic_array bytes of tracepoint (inaccessible to bpf yet)
> +---------+
> 
> Not that all of the fields are already dumped to user space via perf ring buffer
> and some application access it directly without consulting tracepoint/format.
> Same rule applies here: static tracepoint fields should only be accessed
> in a format defined in tracepoint/format. The order of fields and
> field sizes are not an ABI.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/trace/perf.h            | 18 ++++++++++++++----
>  include/uapi/linux/bpf.h        |  1 +
>  kernel/events/core.c            | 13 +++++++++----
>  kernel/trace/trace_event_perf.c |  3 +++
>  4 files changed, 27 insertions(+), 8 deletions(-)
> 
> diff --git a/include/trace/perf.h b/include/trace/perf.h
> index 26486fcd74ce..55feb69c873f 100644
> --- a/include/trace/perf.h
> +++ b/include/trace/perf.h
> @@ -37,18 +37,19 @@ perf_trace_##call(void *__data, proto)					\
>  	struct trace_event_call *event_call = __data;			\
>  	struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
>  	struct trace_event_raw_##call *entry;				\
> +	struct bpf_prog *prog = event_call->prog;			\
>  	struct pt_regs *__regs;						\
>  	u64 __addr = 0, __count = 1;					\
>  	struct task_struct *__task = NULL;				\
>  	struct hlist_head *head;					\
>  	int __entry_size;						\
>  	int __data_size;						\
> -	int rctx;							\
> +	int rctx, event_type;						\
>  									\
>  	__data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
>  									\
>  	head = this_cpu_ptr(event_call->perf_events);			\
> -	if (__builtin_constant_p(!__task) && !__task &&			\
> +	if (!prog && __builtin_constant_p(!__task) && !__task &&	\
>  				hlist_empty(head))			\
>  		return;							\
>  									\
> @@ -56,8 +57,9 @@ perf_trace_##call(void *__data, proto)					\
>  			     sizeof(u64));				\
>  	__entry_size -= sizeof(u32);					\
>  									\
> -	entry = perf_trace_buf_prepare(__entry_size,			\
> -			event_call->event.type, &__regs, &rctx);	\
> +	event_type = prog ? TRACE_EVENT_TYPE_MAX : event_call->event.type; \

Can you move this into perf_trace_entry_prepare?

> +	entry = perf_trace_buf_prepare(__entry_size, event_type,	\
> +				       &__regs, &rctx);			\
>  	if (!entry)							\
>  		return;							\
>  									\
> @@ -67,6 +69,14 @@ perf_trace_##call(void *__data, proto)					\
>  									\
>  	{ assign; }							\
>  									\
> +	if (prog) {							\
> +		*(struct pt_regs **)entry = __regs;			\
> +		if (!trace_call_bpf(prog, entry) || hlist_empty(head)) { \
> +			perf_swevent_put_recursion_context(rctx);	\
> +			return;						\
> +		}							\
> +		memset(&entry->ent, 0, sizeof(entry->ent));		\
> +	}								\

And perhaps this into perf_trace_buf_submit()?

Tracepoints are a major cause of bloat, and the reasons for these
prepare and submit functions is to move code out of the macros. Every
tracepoint in the kernel (1000 and counting) will include this code.
I've already had complaints that each tracepoint can add up to 5k to
the core.

-- Steve


>  	perf_trace_buf_submit(entry, __entry_size, rctx, __addr,	\
>  		__count, __regs, head, __task);				\
>  }
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 23917bb47bf3..70eda5aeb304 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -92,6 +92,7 @@ enum bpf_prog_type {
>  	BPF_PROG_TYPE_KPROBE,
>  	BPF_PROG_TYPE_SCHED_CLS,
>  	BPF_PROG_TYPE_SCHED_ACT,
> +	BPF_PROG_TYPE_TRACEPOINT,
>  };
>  
>  #define BPF_PSEUDO_MAP_FD	1
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index de24fbce5277..58fc9a7d1562 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6725,12 +6725,13 @@ int perf_swevent_get_recursion_context(void)
>  }
>  EXPORT_SYMBOL_GPL(perf_swevent_get_recursion_context);
>  
> -inline void perf_swevent_put_recursion_context(int rctx)
> +void perf_swevent_put_recursion_context(int rctx)
>  {
>  	struct swevent_htable *swhash = this_cpu_ptr(&swevent_htable);
>  
>  	put_recursion_context(swhash->recursion, rctx);
>  }
> +EXPORT_SYMBOL_GPL(perf_swevent_put_recursion_context);
>  
>  void ___perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr)
>  {
> @@ -7104,6 +7105,7 @@ static void perf_event_free_filter(struct perf_event *event)
>  
>  static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
>  {
> +	bool is_kprobe, is_tracepoint;
>  	struct bpf_prog *prog;
>  
>  	if (event->attr.type != PERF_TYPE_TRACEPOINT)
> @@ -7112,15 +7114,18 @@ static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
>  	if (event->tp_event->prog)
>  		return -EEXIST;
>  
> -	if (!(event->tp_event->flags & TRACE_EVENT_FL_UKPROBE))
> -		/* bpf programs can only be attached to u/kprobes */
> +	is_kprobe = event->tp_event->flags & TRACE_EVENT_FL_UKPROBE;
> +	is_tracepoint = event->tp_event->flags & TRACE_EVENT_FL_TRACEPOINT;
> +	if (!is_kprobe && !is_tracepoint)
> +		/* bpf programs can only be attached to u/kprobe or tracepoint */
>  		return -EINVAL;
>  
>  	prog = bpf_prog_get(prog_fd);
>  	if (IS_ERR(prog))
>  		return PTR_ERR(prog);
>  
> -	if (prog->type != BPF_PROG_TYPE_KPROBE) {
> +	if ((is_kprobe && prog->type != BPF_PROG_TYPE_KPROBE) ||
> +	    (is_tracepoint && prog->type != BPF_PROG_TYPE_TRACEPOINT)) {
>  		/* valid fd, but invalid bpf program type */
>  		bpf_prog_put(prog);
>  		return -EINVAL;
> diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
> index 7a68afca8249..7ada829029d3 100644
> --- a/kernel/trace/trace_event_perf.c
> +++ b/kernel/trace/trace_event_perf.c
> @@ -284,6 +284,9 @@ void *perf_trace_buf_prepare(int size, unsigned short type,
>  		*regs = this_cpu_ptr(&__perf_regs[*rctxp]);
>  	raw_data = this_cpu_ptr(perf_trace_buf[*rctxp]);
>  
> +	if (type == TRACE_EVENT_TYPE_MAX)
> +		return raw_data;
> +
>  	/* zero the dead bytes from align to not leak stack to user */
>  	memset(&raw_data[size - sizeof(u64)], 0, sizeof(u64));
>  

^ permalink raw reply

* Re: [PATCH net-next 0/8] allow bpf attach to tracepoints
From: Steven Rostedt @ 2016-04-18 20:47 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peter Zijlstra, David S . Miller, Ingo Molnar, Daniel Borkmann,
	Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik, Brendan Gregg,
	netdev, linux-kernel, kernel-team
In-Reply-To: <57153ACF.9090105@fb.com>

On Mon, 18 Apr 2016 12:51:43 -0700
Alexei Starovoitov <ast@fb.com> wrote:

> yeah, it could be added to ftrace as well, but it won't be as effective
> as perf_trace, since the cost of trace_event_buffer_reserve() in
> trace_event_raw_event_() handler is significantly higher than 
> perf_trace_buf_alloc() in perf_trace_().

Right, because ftrace optimizes the case where all traces make it into
the ring buffer (zero copy). Of course, if a filter is a attached, it
would be trivial to add a case to copy first into a per cpu context
buffer, and only copy into the ring buffer if the filter succeeds. Hmm,
that may actually be something I want to do regardless with the current
filter system.

> Then from the program point of view it wouldn't care how that memory
> is allocated, so the user tools will just use perf_trace_() style.
> The only use case I see for bpf with ftrace's tracepoint handler
> is to work as an actual filter, but we already have filters there...

But wasn't it shown that eBPF could speed up the current filters? If we
hook into eBPF then we could replace what we have with a faster filter.

> so not clear to me of the actual value of adding bpf to ftrace's
> tracepoint handler.

Well, you wouldn't because you don't use ftrace ;-) But I'm sure others
might.

> At the same time there is whole ftrace as function tracer. That is
> very lucrative field for bpf to plug into ;)

Which could get this as a side-effect of this change.

-- Steve

^ permalink raw reply

* [PATCH net-next] tcp-tso: do not split TSO packets at retransmit time
From: Eric Dumazet @ 2016-04-18 20:56 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Neal Cardwell, Yuchung Cheng, Eric Dumazet, Eric Dumazet

Linux TCP stack painfully segments all TSO/GSO packets before retransmits.

This was fine back in the days when TSO/GSO were emerging, with their
bugs, but we believe the dark age is over.

Keeping big packets in write queues, but also in stack traversal
has a lot of benefits.
 - Less memory overhead, because write queues have less skbs
 - Less cpu overhead at ACK processing.
 - Better SACK processing, as lot of studies mentioned how
   awful linux was at this ;)
 - Less cpu overhead to send the rtx packets
   (IP stack traversal, netfilter traversal, qdisc, drivers...)
 - Better latencies in presence of losses.
 - Smaller spikes in fq like packet schedulers, as retransmits
   are not constrained by TCP Small Queues.

1 % packet losses are common today, and at 100Gbit speeds, this
translates to ~80,000 losses per second. If we are unlucky and
first MSS of a 45-MSS TSO is lost, we are cooking 44 MSS segments
at rtx instead of a single 44-MSS TSO packet.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/tcp.h     |  4 ++--
 net/ipv4/tcp_input.c  |  2 +-
 net/ipv4/tcp_output.c | 64 +++++++++++++++++++++++----------------------------
 net/ipv4/tcp_timer.c  |  4 ++--
 4 files changed, 34 insertions(+), 40 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index fd40f8c64d5f..0dc272dcd772 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -538,8 +538,8 @@ __u32 cookie_v6_init_sequence(const struct sk_buff *skb, __u16 *mss);
 void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
 			       int nonagle);
 bool tcp_may_send_now(struct sock *sk);
-int __tcp_retransmit_skb(struct sock *, struct sk_buff *);
-int tcp_retransmit_skb(struct sock *, struct sk_buff *);
+int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs);
+int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs);
 void tcp_retransmit_timer(struct sock *sk);
 void tcp_xmit_retransmit_queue(struct sock *);
 void tcp_simple_retransmit(struct sock *);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 90e0d9256b74..729e489b5608 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5543,7 +5543,7 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
 	if (data) { /* Retransmit unacked data in SYN */
 		tcp_for_write_queue_from(data, sk) {
 			if (data == tcp_send_head(sk) ||
-			    __tcp_retransmit_skb(sk, data))
+			    __tcp_retransmit_skb(sk, data, 1))
 				break;
 		}
 		tcp_rearm_rto(sk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6451b83d81e9..4876b256a70a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2266,7 +2266,7 @@ void tcp_send_loss_probe(struct sock *sk)
 	if (WARN_ON(!skb || !tcp_skb_pcount(skb)))
 		goto rearm_timer;
 
-	if (__tcp_retransmit_skb(sk, skb))
+	if (__tcp_retransmit_skb(sk, skb, 1))
 		goto rearm_timer;
 
 	/* Record snd_nxt for loss detection. */
@@ -2551,17 +2551,17 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
  * state updates are done by the caller.  Returns non-zero if an
  * error occurred which prevented the send.
  */
-int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
+int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
 {
-	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
 	unsigned int cur_mss;
-	int err;
+	int diff, len, err;
+
 
-	/* Inconslusive MTU probe */
-	if (icsk->icsk_mtup.probe_size) {
+	/* Inconclusive MTU probe */
+	if (icsk->icsk_mtup.probe_size)
 		icsk->icsk_mtup.probe_size = 0;
-	}
 
 	/* Do not sent more than we queued. 1/4 is reserved for possible
 	 * copying overhead: fragmentation, tunneling, mangling etc.
@@ -2594,30 +2594,27 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 	    TCP_SKB_CB(skb)->seq != tp->snd_una)
 		return -EAGAIN;
 
-	if (skb->len > cur_mss) {
-		if (tcp_fragment(sk, skb, cur_mss, cur_mss, GFP_ATOMIC))
+	len = cur_mss * segs;
+	if (skb->len > len) {
+		if (tcp_fragment(sk, skb, len, cur_mss, GFP_ATOMIC))
 			return -ENOMEM; /* We'll try again later. */
 	} else {
-		int oldpcount = tcp_skb_pcount(skb);
+		if (skb_unclone(skb, GFP_ATOMIC))
+			return -ENOMEM;
 
-		if (unlikely(oldpcount > 1)) {
-			if (skb_unclone(skb, GFP_ATOMIC))
-				return -ENOMEM;
-			tcp_init_tso_segs(skb, cur_mss);
-			tcp_adjust_pcount(sk, skb, oldpcount - tcp_skb_pcount(skb));
-		}
+		diff = tcp_skb_pcount(skb);
+		tcp_set_skb_tso_segs(skb, cur_mss);
+		diff -= tcp_skb_pcount(skb);
+		if (diff)
+			tcp_adjust_pcount(sk, skb, diff);
+		if (skb->len < cur_mss)
+			tcp_retrans_try_collapse(sk, skb, cur_mss);
 	}
 
 	/* RFC3168, section 6.1.1.1. ECN fallback */
 	if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN_ECN) == TCPHDR_SYN_ECN)
 		tcp_ecn_clear_syn(sk, skb);
 
-	tcp_retrans_try_collapse(sk, skb, cur_mss);
-
-	/* Make a copy, if the first transmission SKB clone we made
-	 * is still in somebody's hands, else make a clone.
-	 */
-
 	/* make sure skb->data is aligned on arches that require it
 	 * and check if ack-trimming & collapsing extended the headroom
 	 * beyond what csum_start can cover.
@@ -2633,20 +2630,22 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 	}
 
 	if (likely(!err)) {
+		segs = tcp_skb_pcount(skb);
+
 		TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
 		/* Update global TCP statistics. */
-		TCP_INC_STATS(sock_net(sk), TCP_MIB_RETRANSSEGS);
+		TCP_ADD_STATS(sock_net(sk), TCP_MIB_RETRANSSEGS, segs);
 		if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)
 			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
-		tp->total_retrans++;
+		tp->total_retrans += segs;
 	}
 	return err;
 }
 
-int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
+int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	int err = __tcp_retransmit_skb(sk, skb);
+	int err = __tcp_retransmit_skb(sk, skb, segs);
 
 	if (err == 0) {
 #if FASTRETRANS_DEBUG > 0
@@ -2737,6 +2736,7 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
 
 	tcp_for_write_queue_from(skb, sk) {
 		__u8 sacked = TCP_SKB_CB(skb)->sacked;
+		int segs;
 
 		if (skb == tcp_send_head(sk))
 			break;
@@ -2744,14 +2744,8 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
 		if (!hole)
 			tp->retransmit_skb_hint = skb;
 
-		/* Assume this retransmit will generate
-		 * only one packet for congestion window
-		 * calculation purposes.  This works because
-		 * tcp_retransmit_skb() will chop up the
-		 * packet to be MSS sized and all the
-		 * packet counting works out.
-		 */
-		if (tcp_packets_in_flight(tp) >= tp->snd_cwnd)
+		segs = tp->snd_cwnd - tcp_packets_in_flight(tp);
+		if (segs <= 0)
 			return;
 
 		if (fwd_rexmitting) {
@@ -2788,7 +2782,7 @@ begin_fwd:
 		if (sacked & (TCPCB_SACKED_ACKED|TCPCB_SACKED_RETRANS))
 			continue;
 
-		if (tcp_retransmit_skb(sk, skb))
+		if (tcp_retransmit_skb(sk, skb, segs))
 			return;
 
 		NET_INC_STATS_BH(sock_net(sk), mib_idx);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 49bc474f8e35..373b03e78aaa 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -404,7 +404,7 @@ void tcp_retransmit_timer(struct sock *sk)
 			goto out;
 		}
 		tcp_enter_loss(sk);
-		tcp_retransmit_skb(sk, tcp_write_queue_head(sk));
+		tcp_retransmit_skb(sk, tcp_write_queue_head(sk), 1);
 		__sk_dst_reset(sk);
 		goto out_reset_timer;
 	}
@@ -436,7 +436,7 @@ void tcp_retransmit_timer(struct sock *sk)
 
 	tcp_enter_loss(sk);
 
-	if (tcp_retransmit_skb(sk, tcp_write_queue_head(sk)) > 0) {
+	if (tcp_retransmit_skb(sk, tcp_write_queue_head(sk), 1) > 0) {
 		/* Retransmission failed because of local congestion,
 		 * do not backoff.
 		 */
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH net-next v5] rtnetlink: add new RTM_GETSTATS message to dump link stats
From: Roopa Prabhu @ 2016-04-18 21:10 UTC (permalink / raw)
  To: netdev; +Cc: jhs, davem, tgraf, nicolas.dichtel

From: Roopa Prabhu <roopa@cumulusnetworks.com>

This patch adds a new RTM_GETSTATS message to query link stats via netlink
from the kernel. RTM_NEWLINK also dumps stats today, but RTM_NEWLINK
returns a lot more than just stats and is expensive in some cases when
frequent polling for stats from userspace is a common operation.

RTM_GETSTATS is an attempt to provide a light weight netlink message
to explicity query only link stats from the kernel on an interface.
The idea is to also keep it extensible so that new kinds of stats can be
added to it in the future.

This patch adds the following attribute for NETDEV stats:
struct nla_policy ifla_stats_policy[IFLA_STATS_MAX + 1] = {
        [IFLA_STATS_LINK_64]  = { .len = sizeof(struct rtnl_link_stats64) },
};

Like any other rtnetlink message, RTM_GETSTATS can be used to get stats of
a single interface or all interfaces with NLM_F_DUMP.

Future possible new types of stat attributes:
link af stats:
    - IFLA_STATS_LINK_IPV6  (nested. for ipv6 stats)
    - IFLA_STATS_LINK_MPLS  (nested. for mpls/mdev stats)
extended stats:
    - IFLA_STATS_LINK_EXTENDED (nested. extended software netdev stats like bridge,
      vlan, vxlan etc)
    - IFLA_STATS_LINK_HW_EXTENDED (nested. extended hardware stats which are
      available via ethtool today)

This patch also declares a filter mask for all stat attributes.
User has to provide a mask of stats attributes to query. filter mask
can be specified in the new hdr 'struct if_stats_msg' for stats messages.
Other important field in the header is the ifindex.

This api can also include attributes for global stats (eg tcp) in the future.
When global stats are included in a stats msg, the ifindex in the header
must be zero. A single stats message cannot contain both global and
netdev specific stats. To easily distinguish them, netdev specific stat
attributes name are prefixed with IFLA_STATS_LINK_

Without any attributes in the filter_mask, no stats will be returned.

This patch has been tested with mofified iproute2 ifstat.

Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
RFC to v1 (apologies for the delay in sending this version out. busy days):
        - Addressed feedback from Dave
                - removed rtnl_link_stats
                - Added hdr struct if_stats_msg to carry ifindex and
                  filter mask
                - new macro IFLA_STATS_FILTER_BIT(ATTR) for filter mask
        - split the ipv6 patch into a separate patch, need some more eyes on it
        - prefix attributes with IFLA_STATS instead of IFLA_LINK_STATS for
          shorter attribute names

v2:
        - move IFLA_STATS_INET6 declaration to the inet6 patch
        - get rid of RTM_DELSTATS
        - mark ipv6 patch RFC. It can be used as an example for
          other AF stats like stats

v3:
        - add required padding to the if_stats_msg structure(suggested by jamal)
        - rename netdev stat attributes with IFLA_STATS_LINK prefix
          so that they are easily distinguishable with global
          stats in the future (after global stats discussion with thomas)
        - get rid of unnecessary copy when getting stats with dev_get_stats
          (suggested by dave)

v4:
        - dropped calcit and af stats from this patch. Will add it
          back when it becomes necessary and with the first af stats
          patch
        - add check for null filter in dump and return -EINVAL:
          this follows rtnl_fdb_dump in returning an error.
          But since netlink_dump does not propagate the error
          to the user, the user will not see an error and
          but will also not see any data. This is consistent with
          other kinds of dumps.

v5:
        - fix selinux nlmsgtab to account for new RTM_*STATS messages

 include/uapi/linux/if_link.h   |  23 +++++++
 include/uapi/linux/rtnetlink.h |   5 ++
 net/core/rtnetlink.c           | 150 +++++++++++++++++++++++++++++++++++++++++
 security/selinux/nlmsgtab.c    |   4 +-
 4 files changed, 181 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index bb3a90b..0762f35 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -781,4 +781,27 @@ enum {
 
 #define IFLA_HSR_MAX (__IFLA_HSR_MAX - 1)
 
+/* STATS section */
+
+struct if_stats_msg {
+	__u8  family;
+	__u8  pad1;
+	__u16 pad2;
+	__u32 ifindex;
+	__u32 filter_mask;
+};
+
+/* A stats attribute can be netdev specific or a global stat.
+ * For netdev stats, lets use the prefix IFLA_STATS_LINK_*
+ */
+enum {
+	IFLA_STATS_UNSPEC,
+	IFLA_STATS_LINK_64,
+	__IFLA_STATS_MAX,
+};
+
+#define IFLA_STATS_MAX (__IFLA_STATS_MAX - 1)
+
+#define IFLA_STATS_FILTER_BIT(ATTR)	(1 << (ATTR))
+
 #endif /* _UAPI_LINUX_IF_LINK_H */
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index ca764b5..cc885c4 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -139,6 +139,11 @@ enum {
 	RTM_GETNSID = 90,
 #define RTM_GETNSID RTM_GETNSID
 
+	RTM_NEWSTATS = 92,
+#define RTM_NEWSTATS RTM_NEWSTATS
+	RTM_GETSTATS = 94,
+#define RTM_GETSTATS RTM_GETSTATS
+
 	__RTM_MAX,
 #define RTM_MAX		(((__RTM_MAX + 3) & ~3) - 1)
 };
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a75f7e9..fe35102 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3451,6 +3451,153 @@ out:
 	return err;
 }
 
+static int rtnl_fill_statsinfo(struct sk_buff *skb, struct net_device *dev,
+			       int type, u32 pid, u32 seq, u32 change,
+			       unsigned int flags, unsigned int filter_mask)
+{
+	struct if_stats_msg *ifsm;
+	struct nlmsghdr *nlh;
+	struct nlattr *attr;
+
+	ASSERT_RTNL();
+
+	nlh = nlmsg_put(skb, pid, seq, type, sizeof(*ifsm), flags);
+	if (!nlh)
+		return -EMSGSIZE;
+
+	ifsm = nlmsg_data(nlh);
+	ifsm->ifindex = dev->ifindex;
+	ifsm->filter_mask = filter_mask;
+
+	if (filter_mask & IFLA_STATS_FILTER_BIT(IFLA_STATS_LINK_64)) {
+		struct rtnl_link_stats64 *sp;
+
+		attr = nla_reserve(skb, IFLA_STATS_LINK_64,
+				   sizeof(struct rtnl_link_stats64));
+		if (!attr)
+			goto nla_put_failure;
+
+		sp = nla_data(attr);
+		dev_get_stats(dev, sp);
+	}
+
+	nlmsg_end(skb, nlh);
+
+	return 0;
+
+nla_put_failure:
+	nlmsg_cancel(skb, nlh);
+
+	return -EMSGSIZE;
+}
+
+static const struct nla_policy ifla_stats_policy[IFLA_STATS_MAX + 1] = {
+	[IFLA_STATS_LINK_64]	= { .len = sizeof(struct rtnl_link_stats64) },
+};
+
+static size_t if_nlmsg_stats_size(const struct net_device *dev,
+				  u32 filter_mask)
+{
+	size_t size = 0;
+
+	if (filter_mask & IFLA_STATS_FILTER_BIT(IFLA_STATS_LINK_64))
+		size += nla_total_size(sizeof(struct rtnl_link_stats64));
+
+	return size;
+}
+
+static int rtnl_stats_get(struct sk_buff *skb, struct nlmsghdr *nlh)
+{
+	struct net *net = sock_net(skb->sk);
+	struct if_stats_msg *ifsm;
+	struct net_device *dev = NULL;
+	struct sk_buff *nskb;
+	u32 filter_mask;
+	int err;
+
+	ifsm = nlmsg_data(nlh);
+	if (ifsm->ifindex > 0)
+		dev = __dev_get_by_index(net, ifsm->ifindex);
+	else
+		return -EINVAL;
+
+	if (!dev)
+		return -ENODEV;
+
+	filter_mask = ifsm->filter_mask;
+	if (!filter_mask)
+		return -EINVAL;
+
+	nskb = nlmsg_new(if_nlmsg_stats_size(dev, filter_mask), GFP_KERNEL);
+	if (!nskb)
+		return -ENOBUFS;
+
+	err = rtnl_fill_statsinfo(nskb, dev, RTM_NEWSTATS,
+				  NETLINK_CB(skb).portid, nlh->nlmsg_seq, 0,
+				  0, filter_mask);
+	if (err < 0) {
+		/* -EMSGSIZE implies BUG in if_nlmsg_stats_size */
+		WARN_ON(err == -EMSGSIZE);
+		kfree_skb(nskb);
+	} else {
+		err = rtnl_unicast(nskb, net, NETLINK_CB(skb).portid);
+	}
+
+	return err;
+}
+
+static int rtnl_stats_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct net *net = sock_net(skb->sk);
+	struct if_stats_msg *ifsm;
+	int h, s_h;
+	int idx = 0, s_idx;
+	struct net_device *dev;
+	struct hlist_head *head;
+	unsigned int flags = NLM_F_MULTI;
+	u32 filter_mask = 0;
+	int err;
+
+	s_h = cb->args[0];
+	s_idx = cb->args[1];
+
+	cb->seq = net->dev_base_seq;
+
+	ifsm = nlmsg_data(cb->nlh);
+	filter_mask = ifsm->filter_mask;
+	if (!filter_mask)
+		return -EINVAL;
+
+	for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
+		idx = 0;
+		head = &net->dev_index_head[h];
+		hlist_for_each_entry(dev, head, index_hlist) {
+			if (idx < s_idx)
+				goto cont;
+			err = rtnl_fill_statsinfo(skb, dev, RTM_NEWSTATS,
+						  NETLINK_CB(cb->skb).portid,
+						  cb->nlh->nlmsg_seq, 0,
+						  flags, filter_mask);
+			/* If we ran out of room on the first message,
+			 * we're in trouble
+			 */
+			WARN_ON((err == -EMSGSIZE) && (skb->len == 0));
+
+			if (err < 0)
+				goto out;
+
+			nl_dump_check_consistent(cb, nlmsg_hdr(skb));
+cont:
+			idx++;
+		}
+	}
+out:
+	cb->args[1] = idx;
+	cb->args[0] = h;
+
+	return skb->len;
+}
+
 /* Process one rtnetlink message. */
 
 static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
@@ -3600,4 +3747,7 @@ void __init rtnetlink_init(void)
 	rtnl_register(PF_BRIDGE, RTM_GETLINK, NULL, rtnl_bridge_getlink, NULL);
 	rtnl_register(PF_BRIDGE, RTM_DELLINK, rtnl_bridge_dellink, NULL, NULL);
 	rtnl_register(PF_BRIDGE, RTM_SETLINK, rtnl_bridge_setlink, NULL, NULL);
+
+	rtnl_register(PF_UNSPEC, RTM_GETSTATS, rtnl_stats_get, rtnl_stats_dump,
+		      NULL);
 }
diff --git a/security/selinux/nlmsgtab.c b/security/selinux/nlmsgtab.c
index 8495b93..1714633 100644
--- a/security/selinux/nlmsgtab.c
+++ b/security/selinux/nlmsgtab.c
@@ -76,6 +76,8 @@ static struct nlmsg_perm nlmsg_route_perms[] =
 	{ RTM_NEWNSID,		NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
 	{ RTM_DELNSID,		NETLINK_ROUTE_SOCKET__NLMSG_READ  },
 	{ RTM_GETNSID,		NETLINK_ROUTE_SOCKET__NLMSG_READ  },
+	{ RTM_NEWSTATS,		NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
+	{ RTM_GETSTATS,		NETLINK_ROUTE_SOCKET__NLMSG_READ  },
 };
 
 static struct nlmsg_perm nlmsg_tcpdiag_perms[] =
@@ -155,7 +157,7 @@ int selinux_nlmsg_lookup(u16 sclass, u16 nlmsg_type, u32 *perm)
 	switch (sclass) {
 	case SECCLASS_NETLINK_ROUTE_SOCKET:
 		/* RTM_MAX always point to RTM_SETxxxx, ie RTM_NEWxxx + 3 */
-		BUILD_BUG_ON(RTM_MAX != (RTM_NEWNSID + 3));
+		BUILD_BUG_ON(RTM_MAX != (RTM_NEWSTATS + 3));
 		err = nlmsg_perm(nlmsg_type, perm, nlmsg_route_perms,
 				 sizeof(nlmsg_route_perms));
 		break;
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH net-next v4] rtnetlink: add new RTM_GETSTATS message to dump link stats
From: roopa @ 2016-04-18 21:11 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, jhs, tgraf
In-Reply-To: <20160418.122807.412860510045727785.davem@davemloft.net>

On 4/18/16, 9:28 AM, David Miller wrote:
> From: Roopa Prabhu <roopa@cumulusnetworks.com>
> Date: Sun, 17 Apr 2016 22:35:35 -0700
>
>> This patch adds a new RTM_GETSTATS message to query link stats via netlink
>> from the kernel. RTM_NEWLINK also dumps stats today, but RTM_NEWLINK
>> returns a lot more than just stats and is expensive in some cases when
>> frequent polling for stats from userspace is a common operation.
> The following build failure needs to be resolved:
>
> security/selinux/nlmsgtab.c: In function ‘selinux_nlmsg_lookup’:
> include/linux/compiler.h:506:38: error: call to ‘__compiletime_assert_158’ declared with attribute error: BUILD_BUG_ON failed: RTM_MAX != (RTM_NEWNSID + 3)
sorry, did not know about this check. sent v5 just now..

^ permalink raw reply

* Re: [PATCH net-next] tcp-tso: do not split TSO packets at retransmit time
From: Yuchung Cheng @ 2016-04-18 21:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S . Miller, netdev, Neal Cardwell, Eric Dumazet
In-Reply-To: <1461012972-15757-1-git-send-email-edumazet@google.com>

On Mon, Apr 18, 2016 at 1:56 PM, Eric Dumazet <edumazet@google.com> wrote:
> Linux TCP stack painfully segments all TSO/GSO packets before retransmits.
>
> This was fine back in the days when TSO/GSO were emerging, with their
> bugs, but we believe the dark age is over.
>
> Keeping big packets in write queues, but also in stack traversal
> has a lot of benefits.
>  - Less memory overhead, because write queues have less skbs
>  - Less cpu overhead at ACK processing.
>  - Better SACK processing, as lot of studies mentioned how
>    awful linux was at this ;)
>  - Less cpu overhead to send the rtx packets
>    (IP stack traversal, netfilter traversal, qdisc, drivers...)
>  - Better latencies in presence of losses.
>  - Smaller spikes in fq like packet schedulers, as retransmits
>    are not constrained by TCP Small Queues.
>
> 1 % packet losses are common today, and at 100Gbit speeds, this
> translates to ~80,000 losses per second. If we are unlucky and
> first MSS of a 45-MSS TSO is lost, we are cooking 44 MSS segments
> at rtx instead of a single 44-MSS TSO packet.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
> ---
>  include/net/tcp.h     |  4 ++--
>  net/ipv4/tcp_input.c  |  2 +-
>  net/ipv4/tcp_output.c | 64 +++++++++++++++++++++++----------------------------
>  net/ipv4/tcp_timer.c  |  4 ++--
>  4 files changed, 34 insertions(+), 40 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index fd40f8c64d5f..0dc272dcd772 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -538,8 +538,8 @@ __u32 cookie_v6_init_sequence(const struct sk_buff *skb, __u16 *mss);
>  void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
>                                int nonagle);
>  bool tcp_may_send_now(struct sock *sk);
> -int __tcp_retransmit_skb(struct sock *, struct sk_buff *);
> -int tcp_retransmit_skb(struct sock *, struct sk_buff *);
> +int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs);
> +int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs);
>  void tcp_retransmit_timer(struct sock *sk);
>  void tcp_xmit_retransmit_queue(struct sock *);
>  void tcp_simple_retransmit(struct sock *);
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 90e0d9256b74..729e489b5608 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5543,7 +5543,7 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
>         if (data) { /* Retransmit unacked data in SYN */
>                 tcp_for_write_queue_from(data, sk) {
>                         if (data == tcp_send_head(sk) ||
> -                           __tcp_retransmit_skb(sk, data))
> +                           __tcp_retransmit_skb(sk, data, 1))
>                                 break;
>                 }
>                 tcp_rearm_rto(sk);
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 6451b83d81e9..4876b256a70a 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2266,7 +2266,7 @@ void tcp_send_loss_probe(struct sock *sk)
>         if (WARN_ON(!skb || !tcp_skb_pcount(skb)))
>                 goto rearm_timer;
>
> -       if (__tcp_retransmit_skb(sk, skb))
> +       if (__tcp_retransmit_skb(sk, skb, 1))
>                 goto rearm_timer;
>
>         /* Record snd_nxt for loss detection. */
> @@ -2551,17 +2551,17 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
>   * state updates are done by the caller.  Returns non-zero if an
>   * error occurred which prevented the send.
>   */
> -int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
> +int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
>  {
> -       struct tcp_sock *tp = tcp_sk(sk);
>         struct inet_connection_sock *icsk = inet_csk(sk);
> +       struct tcp_sock *tp = tcp_sk(sk);
>         unsigned int cur_mss;
> -       int err;
> +       int diff, len, err;
> +
>
> -       /* Inconslusive MTU probe */
> -       if (icsk->icsk_mtup.probe_size) {
> +       /* Inconclusive MTU probe */
> +       if (icsk->icsk_mtup.probe_size)
>                 icsk->icsk_mtup.probe_size = 0;
> -       }
>
>         /* Do not sent more than we queued. 1/4 is reserved for possible
>          * copying overhead: fragmentation, tunneling, mangling etc.
> @@ -2594,30 +2594,27 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
>             TCP_SKB_CB(skb)->seq != tp->snd_una)
>                 return -EAGAIN;
>
> -       if (skb->len > cur_mss) {
> -               if (tcp_fragment(sk, skb, cur_mss, cur_mss, GFP_ATOMIC))
> +       len = cur_mss * segs;
> +       if (skb->len > len) {
> +               if (tcp_fragment(sk, skb, len, cur_mss, GFP_ATOMIC))
>                         return -ENOMEM; /* We'll try again later. */
>         } else {
> -               int oldpcount = tcp_skb_pcount(skb);
> +               if (skb_unclone(skb, GFP_ATOMIC))
> +                       return -ENOMEM;
>
> -               if (unlikely(oldpcount > 1)) {
> -                       if (skb_unclone(skb, GFP_ATOMIC))
> -                               return -ENOMEM;
> -                       tcp_init_tso_segs(skb, cur_mss);
> -                       tcp_adjust_pcount(sk, skb, oldpcount - tcp_skb_pcount(skb));
> -               }
> +               diff = tcp_skb_pcount(skb);
> +               tcp_set_skb_tso_segs(skb, cur_mss);
> +               diff -= tcp_skb_pcount(skb);
> +               if (diff)
> +                       tcp_adjust_pcount(sk, skb, diff);
> +               if (skb->len < cur_mss)
> +                       tcp_retrans_try_collapse(sk, skb, cur_mss);
>         }
>
>         /* RFC3168, section 6.1.1.1. ECN fallback */
>         if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN_ECN) == TCPHDR_SYN_ECN)
>                 tcp_ecn_clear_syn(sk, skb);
>
> -       tcp_retrans_try_collapse(sk, skb, cur_mss);
> -
> -       /* Make a copy, if the first transmission SKB clone we made
> -        * is still in somebody's hands, else make a clone.
> -        */
> -
>         /* make sure skb->data is aligned on arches that require it
>          * and check if ack-trimming & collapsing extended the headroom
>          * beyond what csum_start can cover.
> @@ -2633,20 +2630,22 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
>         }
>
>         if (likely(!err)) {
> +               segs = tcp_skb_pcount(skb);
> +
>                 TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
>                 /* Update global TCP statistics. */
> -               TCP_INC_STATS(sock_net(sk), TCP_MIB_RETRANSSEGS);
> +               TCP_ADD_STATS(sock_net(sk), TCP_MIB_RETRANSSEGS, segs);
>                 if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)
>                         NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
> -               tp->total_retrans++;
> +               tp->total_retrans += segs;
>         }
>         return err;
>  }
>
> -int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
> +int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
>  {
>         struct tcp_sock *tp = tcp_sk(sk);
> -       int err = __tcp_retransmit_skb(sk, skb);
> +       int err = __tcp_retransmit_skb(sk, skb, segs);
>
>         if (err == 0) {
>  #if FASTRETRANS_DEBUG > 0
> @@ -2737,6 +2736,7 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
>
>         tcp_for_write_queue_from(skb, sk) {
>                 __u8 sacked = TCP_SKB_CB(skb)->sacked;
> +               int segs;
>
>                 if (skb == tcp_send_head(sk))
>                         break;
> @@ -2744,14 +2744,8 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
>                 if (!hole)
>                         tp->retransmit_skb_hint = skb;
>
> -               /* Assume this retransmit will generate
> -                * only one packet for congestion window
> -                * calculation purposes.  This works because
> -                * tcp_retransmit_skb() will chop up the
> -                * packet to be MSS sized and all the
> -                * packet counting works out.
> -                */
> -               if (tcp_packets_in_flight(tp) >= tp->snd_cwnd)
> +               segs = tp->snd_cwnd - tcp_packets_in_flight(tp);
> +               if (segs <= 0)
>                         return;
>
>                 if (fwd_rexmitting) {
> @@ -2788,7 +2782,7 @@ begin_fwd:
>                 if (sacked & (TCPCB_SACKED_ACKED|TCPCB_SACKED_RETRANS))
>                         continue;
>
> -               if (tcp_retransmit_skb(sk, skb))
> +               if (tcp_retransmit_skb(sk, skb, segs))
>                         return;
>
>                 NET_INC_STATS_BH(sock_net(sk), mib_idx);
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 49bc474f8e35..373b03e78aaa 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -404,7 +404,7 @@ void tcp_retransmit_timer(struct sock *sk)
>                         goto out;
>                 }
>                 tcp_enter_loss(sk);
> -               tcp_retransmit_skb(sk, tcp_write_queue_head(sk));
> +               tcp_retransmit_skb(sk, tcp_write_queue_head(sk), 1);
>                 __sk_dst_reset(sk);
>                 goto out_reset_timer;
>         }
> @@ -436,7 +436,7 @@ void tcp_retransmit_timer(struct sock *sk)
>
>         tcp_enter_loss(sk);
>
> -       if (tcp_retransmit_skb(sk, tcp_write_queue_head(sk)) > 0) {
> +       if (tcp_retransmit_skb(sk, tcp_write_queue_head(sk), 1) > 0) {
>                 /* Retransmission failed because of local congestion,
>                  * do not backoff.
>                  */
> --
> 2.8.0.rc3.226.g39d4020
>

^ permalink raw reply

* Re: [PATCH net-next 0/8] allow bpf attach to tracepoints
From: Alexei Starovoitov @ 2016-04-18 21:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, David S . Miller, Ingo Molnar, Daniel Borkmann,
	Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik, Brendan Gregg,
	netdev, linux-kernel, kernel-team
In-Reply-To: <20160418164713.337cd66f@gandalf.local.home>

On 4/18/16 1:47 PM, Steven Rostedt wrote:
> On Mon, 18 Apr 2016 12:51:43 -0700
> Alexei Starovoitov <ast@fb.com> wrote:
>
>
>> yeah, it could be added to ftrace as well, but it won't be as effective
>> as perf_trace, since the cost of trace_event_buffer_reserve() in
>> trace_event_raw_event_() handler is significantly higher than
>> perf_trace_buf_alloc() in perf_trace_().
>
> Right, because ftrace optimizes the case where all traces make it into
> the ring buffer (zero copy). Of course, if a filter is a attached, it
> would be trivial to add a case to copy first into a per cpu context
> buffer, and only copy into the ring buffer if the filter succeeds. Hmm,
> that may actually be something I want to do regardless with the current
> filter system.
>
>> Then from the program point of view it wouldn't care how that memory
>> is allocated, so the user tools will just use perf_trace_() style.
>> The only use case I see for bpf with ftrace's tracepoint handler
>> is to work as an actual filter, but we already have filters there...
>
> But wasn't it shown that eBPF could speed up the current filters? If we
> hook into eBPF then we could replace what we have with a faster filter.

yes. Long ago I had a patch to accelerate filter_match_preds(),
but then abandoned it, since, as you said, ftrace is primarily targeting
streaming these events to user space and faster filtering probably
won't be much noticeable to the end user.
So far all the tools around bpf have been capitalizing on
the ability to aggregate data in the kernel.

>> At the same time there is whole ftrace as function tracer. That is
>> very lucrative field for bpf to plug into ;)
>
> Which could get this as a side-effect of this change.

not really as side-effect. That would be the new thing to design.
I only have few rough ideas on how to approach that.

^ permalink raw reply

* Re: [PATCH net-next v5] rtnetlink: add new RTM_GETSTATS message to dump link stats
From: Eric Dumazet @ 2016-04-18 21:35 UTC (permalink / raw)
  To: Roopa Prabhu; +Cc: netdev, jhs, davem, tgraf, nicolas.dichtel
In-Reply-To: <1461013819-23223-1-git-send-email-roopa@cumulusnetworks.com>

On Mon, 2016-04-18 at 14:10 -0700, Roopa Prabhu wrote:

> +	if (filter_mask & IFLA_STATS_FILTER_BIT(IFLA_STATS_LINK_64)) {
> +		struct rtnl_link_stats64 *sp;
> +
> +		attr = nla_reserve(skb, IFLA_STATS_LINK_64,
> +				   sizeof(struct rtnl_link_stats64));
> +		if (!attr)
> +			goto nla_put_failure;
> +
> +		sp = nla_data(attr);

Are you sure we have a guarantee that sp is aligned to u64 fields ?

x86 does not care, but some arches would have a potential misalign
access here.


> +		dev_get_stats(dev, sp);
> +	}

^ permalink raw reply

* Re: [PATCH net-next 2/8] perf, bpf: allow bpf programs attach to tracepoints
From: Alexei Starovoitov @ 2016-04-18 21:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, David S . Miller, Ingo Molnar, Daniel Borkmann,
	Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik, Brendan Gregg,
	netdev, linux-kernel, kernel-team
In-Reply-To: <20160418162905.220df2f4@gandalf.local.home>

On 4/18/16 1:29 PM, Steven Rostedt wrote:
> On Mon, 4 Apr 2016 21:52:48 -0700
> Alexei Starovoitov <ast@fb.com> wrote:
>
>> introduce BPF_PROG_TYPE_TRACEPOINT program type and allow it to be
>> attached to tracepoints.
>> The tracepoint will copy the arguments in the per-cpu buffer and pass
>> it to the bpf program as its first argument.
>> The layout of the fields can be discovered by doing
>> 'cat /sys/kernel/debug/tracing/events/sched/sched_switch/format'
>> prior to the compilation of the program with exception that first 8 bytes
>> are reserved and not accessible to the program. This area is used to store
>> the pointer to 'struct pt_regs' which some of the bpf helpers will use:
>> +---------+
>> | 8 bytes | hidden 'struct pt_regs *' (inaccessible to bpf program)
>> +---------+
>> | N bytes | static tracepoint fields defined in tracepoint/format (bpf readonly)
>> +---------+
>> | dynamic | __dynamic_array bytes of tracepoint (inaccessible to bpf yet)
>> +---------+
>>
>> Not that all of the fields are already dumped to user space via perf ring buffer
>> and some application access it directly without consulting tracepoint/format.
>> Same rule applies here: static tracepoint fields should only be accessed
>> in a format defined in tracepoint/format. The order of fields and
>> field sizes are not an ABI.
>>
>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>> ---
...
>> -	entry = perf_trace_buf_prepare(__entry_size,			\
>> -			event_call->event.type, &__regs, &rctx);	\
>> +	event_type = prog ? TRACE_EVENT_TYPE_MAX : event_call->event.type; \
>
> Can you move this into perf_trace_entry_prepare?

that's the old version.
The last one are commits 1e1dcd93b46 and 98b5c2c65c295 in net-next.

>> +	if (prog) {							\
>> +		*(struct pt_regs **)entry = __regs;			\
>> +		if (!trace_call_bpf(prog, entry) || hlist_empty(head)) { \
>> +			perf_swevent_put_recursion_context(rctx);	\
>> +			return;						\
>> +		}							\
>> +		memset(&entry->ent, 0, sizeof(entry->ent));		\
>> +	}								\
>
> And perhaps this into perf_trace_buf_submit()?
>
> Tracepoints are a major cause of bloat, and the reasons for these
> prepare and submit functions is to move code out of the macros. Every
> tracepoint in the kernel (1000 and counting) will include this code.
> I've already had complaints that each tracepoint can add up to 5k to
> the core.

I was worried about this too, but single 'if' and two calls
(as in commit 98b5c2c65c295) is a better way, since it's faster, cleaner
and doesn't need to refactor the whole perf_trace_buf_submit() to pass
extra event_call argument to it.
perf_trace_buf_submit() is already ugly with 8 arguments!
Passing more args or creating a struct to pass args only going to
hurt performance without much reduction in .text size.
tinyfication folks will disable tracepoints anyway.
Note that the most common case is bpf returning 0 and not even
calling perf_trace_buf_submit() which is already slow due
to so many args passed on stack.
This stuff is called million times a second, so every instruction
counts.

^ permalink raw reply

* Re: [PATCH net-next] macvlan: fix failure during registration
From: Eric W. Biederman @ 2016-04-18 21:33 UTC (permalink / raw)
  To: Francesco Ruggeri; +Cc: netdev, David S. Miller, Mahesh Bandewar
In-Reply-To: <CA+HUmGhUqXn35x6Nnhcrk+WH8PuiaxwsrZNy-NsJ94wybCukGA@mail.gmail.com>

Francesco Ruggeri <fruggeri@arista.com> writes:

> On Mon, Apr 18, 2016 at 11:48 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> These interactions all seem a little bit funny.  At a quick skim it
>> would make more sense to increment the port count in macvlan_init,
>> and completely remove the need to mess with port counts anywhere except
>> macvlan_init and macvlan_uninit.
>
> Thanks Eric, let me try that.
>
>>
>> If for some reason that can't be done the code can easily look at
>> dev->reg_state.  If dev->reg_state == NETREG_UNITIALIZED it should
>> be exactly the same as your new flag being set.
>>
>
> I am not sure that in macvlan_uninit one can tell whether it is being
> invoked in the context of a failed register_netdevice (if that is what
> you meant).
>
> In case of register_netdevice failing in
> call_netdevice_notifiers(NETDEV_REGISTER) the call sequence is:
>
> macvlan_common_newlink
>   register_netdevice
>     call_netdevice_notifiers(NETDEV_REGISTER, dev) (<== fail here)
>       rollback_registered(dev);
>         rollback_registered_many
>           dev->reg_state = NETREG_UNREGISTERING
>           dev->netdev_ops->ndo_uninit(dev)
>       dev->reg_state = NETREG_UNREGISTERED;
>

The code I have looks a little different.  But that is a good point.

But please see if you can get macvlan_init to do the necessary work.
That should simplify everything, and make clever games unnecessary.

Eric

^ permalink raw reply

* Re: [PATCH] macsec: fix crypto Kconfig dependency
From: Stephen Rothwell @ 2016-04-18 21:44 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, arnd, sd, hannes, johannes, netdev, linux-kernel
In-Reply-To: <20160418.124316.1862068166928198574.davem@davemloft.net>

Hi Dave,

On Mon, 18 Apr 2016 12:43:16 -0400 (EDT) David Miller <davem@davemloft.net> wrote:
>
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Mon, 18 Apr 2016 18:43:36 +0800
> 
> > Right, the problem is that nothing within crypto ever selects
> > CRYPTO since it's also used as a way of hiding the crypto menu
> > options.  
> 
> As far as I understand it, this won't help.  Because selects do not
> trigger other selects and dependencies.

My understanding is that selects trigger other selects, but do not
honour dependencies.  So, in general, you should not select a symbol
that has dependencies (unless you also have the same dependencies or
select those dependencies).

-- 
Cheers,
Stephen Rothwell

^ permalink raw reply

* Re: [PATCH nf] netfilter: ipv6: Orphan skbs in nf_ct_frag6_gather()
From: Joe Stringer @ 2016-04-18 21:49 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Florian Westphal, David Laight, netfilter-devel@vger.kernel.org,
	diproiettod@vmware.com, netdev@vger.kernel.org
In-Reply-To: <20160418183546.GA4289@salvia>

On 18 April 2016 at 11:35, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Thu, Apr 14, 2016 at 05:35:39PM -0700, Joe Stringer wrote:
>> On 14 April 2016 at 03:35, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>> > On Thu, Apr 14, 2016 at 10:40:15AM +0200, Florian Westphal wrote:
>> >> David Laight <David.Laight@ACULAB.COM> wrote:
>> >> > From: Joe Stringer
>> >> > > Sent: 13 April 2016 19:10
>> >> > > This is the IPv6 equivalent of commit 8282f27449bf ("inet: frag: Always
>> >> > > orphan skbs inside ip_defrag()").
>> >> > >
>> >> > > Prior to commit 029f7f3b8701 ("netfilter: ipv6: nf_defrag: avoid/free
>> >> > > clone operations"), ipv6 fragments sent to nf_ct_frag6_gather() would be
>> >> > > cloned (implicitly orphaning) prior to queueing for reassembly. As such,
>> >> > > when the IPv6 message is eventually reassembled, the skb->sk for all
>> >> > > fragments would be NULL. After that commit was introduced, rather than
>> >> > > cloning, the original skbs were queued directly without orphaning. The
>> >> > > end result is that all frags except for the first and last may have a
>> >> > > socket attached.
>> >> >
>> >> > I'd have thought that the queued fragments would still want to be
>> >> > resource-counted against the socket (I think that is what skb->sk is for).
>> >>
>> >> No, ipv4/ipv6 reasm has its own accouting.
>> >>
>> >> > Although I can't imagine why IPv6 reassembly is happening on skb
>> >> > associated with a socket.
>> >>
>> >> Right, thats a much more interesting question -- both ipv4 and
>> >> ipv6 orphan skbs before NF_HOOK prerouting trip.
>> >>
>> >> (That being said, I don't mind the patch, I'm just be curious how this
>> >>  can happen).
>> >
>> > If this change is specific to get this working in ovs and its
>> > conntrack support, then I don't think this belong to core
>> > infrastructure. This should be fixed in ovs instead.
>>
>> I admit I've only been able to reproduce it with OVS. My main reason
>> for proposing the fix this way was just because this is what the IPv4
>> code does, so I figured IPv6 should be consistent with that.
>
> You mean that this is what you did in 029f7f3b8701 to fix this, right?
>
> But we shouldn't add code to the core that is OVS specific for no
> reason. We don't need this orphan from ipv4 and ipv6 as Florian
> indicated.

That makes complete sense to me. I was wondering whether the original
IPv4 fix was the correct one from the nf core perspective - but it
seems like perhaps it is, given that ip_defrag() has a lot more users
from a lot more different paths which are likely relying on the skb to
be orphaned. In comparison, in the IPv6 path, nf_ct_frag6_gather() is
only called from OVS and the netfilter hooks; the hooks already do the
orphaning, so it's more consistent for OVS to do it as well.

> Is there any chance you can fix this from OVS and its conntrack glue
> code? Thanks.

Sure, I'll resend the patch making the fix in OVS.

^ permalink raw reply

* [PATCHv2 net] openvswitch: Orphan skbs before IPv6 defrag
From: Joe Stringer @ 2016-04-18 21:51 UTC (permalink / raw)
  To: netdev; +Cc: Joe Stringer, fw, netfilter-devel, diproiettod, pablo

This is the IPv6 counterpart to commit 8282f27449bf ("inet: frag: Always
orphan skbs inside ip_defrag()").

Prior to commit 029f7f3b8701 ("netfilter: ipv6: nf_defrag: avoid/free
clone operations"), ipv6 fragments sent to nf_ct_frag6_gather() would be
cloned (implicitly orphaning) prior to queueing for reassembly. As such,
when the IPv6 message is eventually reassembled, the skb->sk for all
fragments would be NULL. After that commit was introduced, rather than
cloning, the original skbs were queued directly without orphaning. The
end result is that all frags except for the first and last may have a
socket attached.

This commit explicitly orphans such skbs during nf_ct_frag6_gather() to
prevent BUG_ON(skb->sk) during a later call to ip6_fragment().

kernel BUG at net/ipv6/ip6_output.c:631!
[...]
Call Trace:
 <IRQ>
 [<ffffffff810be8f7>] ? __lock_acquire+0x927/0x20a0
 [<ffffffffa042c7c0>] ? do_output.isra.28+0x1b0/0x1b0 [openvswitch]
 [<ffffffff810bb8a2>] ? __lock_is_held+0x52/0x70
 [<ffffffffa042c587>] ovs_fragment+0x1f7/0x280 [openvswitch]
 [<ffffffff810bdab5>] ? mark_held_locks+0x75/0xa0
 [<ffffffff817be416>] ? _raw_spin_unlock_irqrestore+0x36/0x50
 [<ffffffff81697ea0>] ? dst_discard_out+0x20/0x20
 [<ffffffff81697e80>] ? dst_ifdown+0x80/0x80
 [<ffffffffa042c703>] do_output.isra.28+0xf3/0x1b0 [openvswitch]
 [<ffffffffa042d279>] do_execute_actions+0x709/0x12c0 [openvswitch]
 [<ffffffffa04340a4>] ? ovs_flow_stats_update+0x74/0x1e0 [openvswitch]
 [<ffffffffa04340d1>] ? ovs_flow_stats_update+0xa1/0x1e0 [openvswitch]
 [<ffffffff817be387>] ? _raw_spin_unlock+0x27/0x40
 [<ffffffffa042de75>] ovs_execute_actions+0x45/0x120 [openvswitch]
 [<ffffffffa0432d65>] ovs_dp_process_packet+0x85/0x150 [openvswitch]
 [<ffffffff817be387>] ? _raw_spin_unlock+0x27/0x40
 [<ffffffffa042def4>] ovs_execute_actions+0xc4/0x120 [openvswitch]
 [<ffffffffa0432d65>] ovs_dp_process_packet+0x85/0x150 [openvswitch]
 [<ffffffffa04337f2>] ? key_extract+0x442/0xc10 [openvswitch]
 [<ffffffffa043b26d>] ovs_vport_receive+0x5d/0xb0 [openvswitch]
 [<ffffffff810be8f7>] ? __lock_acquire+0x927/0x20a0
 [<ffffffff810be8f7>] ? __lock_acquire+0x927/0x20a0
 [<ffffffff810be8f7>] ? __lock_acquire+0x927/0x20a0
 [<ffffffff817be416>] ? _raw_spin_unlock_irqrestore+0x36/0x50
 [<ffffffffa043c11d>] internal_dev_xmit+0x6d/0x150 [openvswitch]
 [<ffffffffa043c0b5>] ? internal_dev_xmit+0x5/0x150 [openvswitch]
 [<ffffffff8168fb5f>] dev_hard_start_xmit+0x2df/0x660
 [<ffffffff8168f5ea>] ? validate_xmit_skb.isra.105.part.106+0x1a/0x2b0
 [<ffffffff81690925>] __dev_queue_xmit+0x8f5/0x950
 [<ffffffff81690080>] ? __dev_queue_xmit+0x50/0x950
 [<ffffffff810bdab5>] ? mark_held_locks+0x75/0xa0
 [<ffffffff81690990>] dev_queue_xmit+0x10/0x20
 [<ffffffff8169a418>] neigh_resolve_output+0x178/0x220
 [<ffffffff81752759>] ? ip6_finish_output2+0x219/0x7b0
 [<ffffffff81752759>] ip6_finish_output2+0x219/0x7b0
 [<ffffffff817525a5>] ? ip6_finish_output2+0x65/0x7b0
 [<ffffffff816cde2b>] ? ip_idents_reserve+0x6b/0x80
 [<ffffffff8175488f>] ? ip6_fragment+0x93f/0xc50
 [<ffffffff81754af1>] ip6_fragment+0xba1/0xc50
 [<ffffffff81752540>] ? ip6_flush_pending_frames+0x40/0x40
 [<ffffffff81754c6b>] ip6_finish_output+0xcb/0x1d0
 [<ffffffff81754dcf>] ip6_output+0x5f/0x1a0
 [<ffffffff81754ba0>] ? ip6_fragment+0xc50/0xc50
 [<ffffffff81797fbd>] ip6_local_out+0x3d/0x80
 [<ffffffff817554df>] ip6_send_skb+0x2f/0xc0
 [<ffffffff817555bd>] ip6_push_pending_frames+0x4d/0x50
 [<ffffffff817796cc>] icmpv6_push_pending_frames+0xac/0xe0
 [<ffffffff8177a4be>] icmpv6_echo_reply+0x42e/0x500
 [<ffffffff8177acbf>] icmpv6_rcv+0x4cf/0x580
 [<ffffffff81755ac7>] ip6_input_finish+0x1a7/0x690
 [<ffffffff81755925>] ? ip6_input_finish+0x5/0x690
 [<ffffffff817567a0>] ip6_input+0x30/0xa0
 [<ffffffff81755920>] ? ip6_rcv_finish+0x1a0/0x1a0
 [<ffffffff817557ce>] ip6_rcv_finish+0x4e/0x1a0
 [<ffffffff8175640f>] ipv6_rcv+0x45f/0x7c0
 [<ffffffff81755fe6>] ? ipv6_rcv+0x36/0x7c0
 [<ffffffff81755780>] ? ip6_make_skb+0x1c0/0x1c0
 [<ffffffff8168b649>] __netif_receive_skb_core+0x229/0xb80
 [<ffffffff810bdab5>] ? mark_held_locks+0x75/0xa0
 [<ffffffff8168c07f>] ? process_backlog+0x6f/0x230
 [<ffffffff8168bfb6>] __netif_receive_skb+0x16/0x70
 [<ffffffff8168c088>] process_backlog+0x78/0x230
 [<ffffffff8168c0ed>] ? process_backlog+0xdd/0x230
 [<ffffffff8168db43>] net_rx_action+0x203/0x480
 [<ffffffff810bdab5>] ? mark_held_locks+0x75/0xa0
 [<ffffffff817c156e>] __do_softirq+0xde/0x49f
 [<ffffffff81752768>] ? ip6_finish_output2+0x228/0x7b0
 [<ffffffff817c070c>] do_softirq_own_stack+0x1c/0x30
 <EOI>
 [<ffffffff8106f88b>] do_softirq.part.18+0x3b/0x40
 [<ffffffff8106f946>] __local_bh_enable_ip+0xb6/0xc0
 [<ffffffff81752791>] ip6_finish_output2+0x251/0x7b0
 [<ffffffff81754af1>] ? ip6_fragment+0xba1/0xc50
 [<ffffffff816cde2b>] ? ip_idents_reserve+0x6b/0x80
 [<ffffffff8175488f>] ? ip6_fragment+0x93f/0xc50
 [<ffffffff81754af1>] ip6_fragment+0xba1/0xc50
 [<ffffffff81752540>] ? ip6_flush_pending_frames+0x40/0x40
 [<ffffffff81754c6b>] ip6_finish_output+0xcb/0x1d0
 [<ffffffff81754dcf>] ip6_output+0x5f/0x1a0
 [<ffffffff81754ba0>] ? ip6_fragment+0xc50/0xc50
 [<ffffffff81797fbd>] ip6_local_out+0x3d/0x80
 [<ffffffff817554df>] ip6_send_skb+0x2f/0xc0
 [<ffffffff817555bd>] ip6_push_pending_frames+0x4d/0x50
 [<ffffffff81778558>] rawv6_sendmsg+0xa28/0xe30
 [<ffffffff81719097>] ? inet_sendmsg+0xc7/0x1d0
 [<ffffffff817190d6>] inet_sendmsg+0x106/0x1d0
 [<ffffffff81718fd5>] ? inet_sendmsg+0x5/0x1d0
 [<ffffffff8166d078>] sock_sendmsg+0x38/0x50
 [<ffffffff8166d4d6>] SYSC_sendto+0xf6/0x170
 [<ffffffff8100201b>] ? trace_hardirqs_on_thunk+0x1b/0x1d
 [<ffffffff8166e38e>] SyS_sendto+0xe/0x10
 [<ffffffff817bebe5>] entry_SYSCALL_64_fastpath+0x18/0xa8
Code: 06 48 83 3f 00 75 26 48 8b 87 d8 00 00 00 2b 87 d0 00 00 00 48 39 d0 72 14 8b 87 e4 00 00 00 83 f8 01 75 09 48 83 7f 18 00 74 9a <0f> 0b 41 8b 86 cc 00 00 00 49 8#
RIP  [<ffffffff8175468a>] ip6_fragment+0x73a/0xc50
 RSP <ffff880072803120>

Fixes: 029f7f3b8701 ("netfilter: ipv6: nf_defrag: avoid/free clone
operations")
Reported-by: Daniele Di Proietto <diproiettod@vmware.com>
Signed-off-by: Joe Stringer <joe@ovn.org>
---
 net/openvswitch/conntrack.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 1b9d286756be..b5fea1101faa 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -367,6 +367,7 @@ static int handle_fragments(struct net *net, struct sw_flow_key *key,
 	} else if (key->eth.type == htons(ETH_P_IPV6)) {
 		enum ip6_defrag_users user = IP6_DEFRAG_CONNTRACK_IN + zone;
 
+		skb_orphan(skb);
 		memset(IP6CB(skb), 0, sizeof(struct inet6_skb_parm));
 		err = nf_ct_frag6_gather(net, skb, user);
 		if (err)
-- 
2.1.4

^ permalink raw reply related

* [PATCH] net: w5100: don't build spi driver without w5100
From: Arnd Bergmann @ 2016-04-18 21:58 UTC (permalink / raw)
  To: David S. Miller, Akinobu Mita, Paul Gortmaker, Arnd Bergmann
  Cc: netdev, linux-kernel

The w5100-spi driver front-end only makes sense when the w5100
core driver is enabled, not for a configuration that only has w5300:

drivers/net/built-in.o: In function `w5100_spi_remove':
drivers/net/ethernet/wiznet/w5100-spi.c:277: undefined reference to `w5100_remove'
drivers/net/built-in.o: In function `w5100_spi_probe':
drivers/net/ethernet/wiznet/w5100-spi.c:272: undefined reference to `w5100_probe'
drivers/net/built-in.o: In function `w5200_spi_init':
drivers/net/ethernet/wiznet/w5100-spi.c:125: undefined reference to `w5100_ops_priv'
drivers/net/built-in.o: In function `w5200_spi_readbulk':
drivers/net/ethernet/wiznet/w5100-spi.c:125: undefined reference to `w5100_ops_priv'
drivers/net/built-in.o: In function `w5200_spi_writebulk':
drivers/net/ethernet/wiznet/w5100-spi.c:125: undefined reference to `w5100_ops_priv'
drivers/net/built-in.o:(.data+0x3ed1c): undefined reference to `w5100_pm_ops'

This adds an appropriate Kconfig dependency.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 630cf09751fe ("net: w5100: support SPI interface mode")
---
 drivers/net/ethernet/wiznet/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/wiznet/Kconfig b/drivers/net/ethernet/wiznet/Kconfig
index 1f15376e9856..f3385a1999a2 100644
--- a/drivers/net/ethernet/wiznet/Kconfig
+++ b/drivers/net/ethernet/wiznet/Kconfig
@@ -71,7 +71,7 @@ endchoice
 
 config WIZNET_W5100_SPI
 	tristate "WIZnet W5100/W5200 Ethernet support for SPI mode"
-	depends on WIZNET_BUS_ANY
+	depends on WIZNET_BUS_ANY && WIZNET_W5100
 	depends on SPI
 	---help---
 	  In SPI mode host system accesses registers using SPI protocol
-- 
2.7.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox