Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] bpf: selftests: include <sys/resource.h> to fix build error
From: Daniel Borkmann @ 2016-11-28 12:22 UTC (permalink / raw)
  To: Colin King, Alexei Starovoitov, Shuah Khan, netdev
  Cc: linux-kernel, linux-kselftest
In-Reply-To: <20161128114541.10829-1-colin.king@canonical.com>

On 11/28/2016 12:45 PM, Colin King wrote:
> From: Colin Ian King <colin.king@canonical.com>
>
> Fix incomplete type build error on struct rlimit by including
> <sys/resource.h>, fixes:
>
> test_lru_map.c:552:9: error: variable ‘r’ has initializer
>   but incomplete type
>    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
>           ^
> test_lru_map.c:552:21: error: ‘RLIM_INFINITY’ undeclared
>   (first use in this function)
>    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
>
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Thanks for the patch, fixed here already:

https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=e00c7b216f34444252f3771f7d4ed48d4f032636

^ permalink raw reply

* Re: [PATCH net-next v3 2/4] dt-bindings: net: add EEE capability constants
From: Andreas Färber @ 2016-11-28 12:21 UTC (permalink / raw)
  To: Jerome Brunet, netdev, devicetree
  Cc: Florian Fainelli, Carlo Caione, Kevin Hilman, Giuseppe Cavallaro,
	Alexandre TORGUE, Martin Blumenstingl, Andre Roth, Andrew Lunn,
	Neil Armstrong, linux-amlogic, linux-arm-kernel, linux-kernel
In-Reply-To: <1480326409-25419-3-git-send-email-jbrunet@baylibre.com>

Am 28.11.2016 um 10:46 schrieb Jerome Brunet:
> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
> ---
>  include/dt-bindings/net/mdio.h | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
>  create mode 100644 include/dt-bindings/net/mdio.h

Tested-by: Andreas Färber <afaerber@suse.de>

Regards,
Andreas

-- 
SUSE Linux GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply

* Re: [PATCH net-next 0/5] net: add protocol level recvmmsg support
From: Jesper Dangaard Brouer @ 2016-11-28 12:21 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, David S. Miller, Eric Dumazet, Hannes Frederic Sowa,
	Sabrina Dubroca, brouer
In-Reply-To: <1480330358.6718.13.camel@redhat.com>

On Mon, 28 Nov 2016 11:52:38 +0100
Paolo Abeni <pabeni@redhat.com> wrote:

> Hi Jesper,
> 
> On Fri, 2016-11-25 at 18:37 +0100, Jesper Dangaard Brouer wrote:
> > > The measured performance delta is as follow:
> > > 
> > > 		before		after
> > > 		(Kpps)		(Kpps)
> > > 
> > > udp flood[1]	570		1800(+215%)
> > > max tput[2]	1850		3500(+89%)
> > > single queue[3]	1850		1630(-11%)
> > > 
> > > [1] line rate flood using multiple 64 bytes packets and multiple flows  
> > 
> > Is [1] sending multiple flow in the a single UDP-sink?  
> 
> Yes, in the test scenario [1] there are multiple UDP flows using 16
> different rx queues on the receiver host, and a single user space
> reader.
> 
> > > [2] like [1], but using the minimum number of flows to saturate the user space
> > >  sink, that is 1 flow for the old kernel and 3 for the patched one.
> > >  the tput increases since the contention on the rx lock is low.
> > > [3] like [1] but using a single flow with both old and new kernel. All the
> > >  packets land on the same rx queue and there is a single ksoftirqd instance
> > >  running  
> > 
> > It is important to know, if ksoftirqd and the UDP-sink runs on the same CPU?  
> 
> No pinning is enforced. The scheduler moves the user space process on a
> different cpu in respect to the ksoftriqd kernel thread.

This floating userspace process can cause a high variation between test
runs.  On my system, the performance drops to approx 600Kpps when
ksoftirqd and udp_sink share the same CPU.

Quick run with your patches applied:

Sender: pktgen with big packets
 ./pktgen_sample03_burst_single_flow.sh -i mlx5p2 -d 198.18.50.1 \
   -m 7c:fe:90:c7:b1:cf -t1 -b128 -s 1472

Forced CPU0 for both ksoftirq and udp_sink

# taskset -c 0 ./udp_sink --count $((10**7)) --port 9 --repeat 1
                                ns      pps             cycles 
recvMmsg/32  	run: 0 10000000	1667.93	599547.16	6685
recvmsg   	run: 0 10000000	1810.70	552273.39	7257
read      	run: 0 10000000	1634.72	611723.95	6552
recvfrom  	run: 0 10000000	1585.06	630891.39	6353

 
> > > The regression in the single queue scenario is actually due to the improved
> > > performance of the recvmmsg() syscall: the user space process is now
> > > significantly faster than the ksoftirqd process so that the latter needs often
> > > to wake up the user space process.  
> > 
> > When measuring these things, make sure that we/you measure both the packets
> > actually received in the userspace UDP-sink, and also measure packets
> > RX processed by ksoftirq (and I often also look at what HW got delivered).
> > Some times, when userspace is too slow, the kernel can/will drop packets.
> > 
> > It is actually quite easily verified with cmdline:
> > 
> >  nstat > /dev/null && sleep 1  && nstat
> > 
> > For HW measurements I use the tool ethtool_stats.pl:
> >  https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl  
> 
> We collected the UDP stats for all the three scenarios; we have lot of
> drop in test[1] and little, by design, in test[2]. In test [3], with the
> patched kernel, the drops are 0: ksoftirqd is way slower than the user
> space sink. 
> 
> > > Since ksoftirqd is the bottle-neck is such scenario, overall this causes a
> > > tput reduction. In a real use case, where the udp sink is performing some
> > > actual processing of the received data, such regression is unlikely to really
> > > have an effect.  
> > 
> > My experience is that the performance of RX UDP is affected by:
> >  * if socket is connected or not (yes, RX side also)
> >  * state of /proc/sys/net/ipv4/ip_early_demux
> > 
> > You don't need to run with all the combinations, but it would be nice
> > if you specify what config your have based your measurements on (and
> > keep them stable in your runs).
> > 
> > I've actually implemented the "--connect" option to my udp_sink
> > program[1] today, but I've not pushed it yet, if you are interested.  
> 
> The reported numbers are all gathered with unconnected sockets and early
> demux enabled.
> 
> We also used connected socket for test[3], with relative little
> difference (the tput increased for both unpatched and patched kernel, 
> and the difference was roughly the same).

When I use connected sockets (RX side) and ip_early_demux enabled, I do
see a performance boost for recvmmsg.  With these patches applied,
forced ksoftirqd on CPU0 and udp_sink on CPU2, pktgen single flow
sending size 1472 bytes.

$ sysctl net/ipv4/ip_early_demux
net.ipv4.ip_early_demux = 1

$ grep -H . /proc/sys/net/core/{r,w}mem_max
/proc/sys/net/core/rmem_max:1048576
/proc/sys/net/core/wmem_max:1048576

# taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --repeat 1
#                               ns      pps             cycles
recvMmsg/32  	run: 0 10000000	462.51	2162095.23	1853
recvmsg   	run: 0 10000000	536.47	1864041.75	2150
read      	run: 0 10000000	492.01	2032460.71	1972
recvfrom  	run: 0 10000000	553.94	1805262.84	2220

# taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --repeat 1 --connect
#                               ns      pps             cycles
recvMmsg/32  	run: 0 10000000	405.15	2468225.03	1623
recvmsg   	run: 0 10000000	548.23	1824049.58	2197
read      	run: 0 10000000	489.76	2041825.27	1962
recvfrom  	run: 0 10000000	466.18	2145091.77	1868

My theory is that by enabling connect'ed RX socket, the ksoftirqd gets
faster (no fib_lookup) and is no-longer a bottleneck.  This is
confirmed by the nstat output below.

Below: unconnected
 $ nstat > /dev/null && sleep 1  && nstat
 #kernel
 IpInReceives                    2143944            0.0
 IpInDelivers                    2143945            0.0
 UdpInDatagrams                  2143944            0.0
 IpExtInOctets                   3125889306         0.0
 IpExtInNoECTPkts                2143956            0.0

Below: connected
 $ nstat > /dev/null && sleep 1  && nstat
 #kernel
 IpInReceives                    2925155            0.0
 IpInDelivers                    2925156            0.0
 UdpInDatagrams                  2440925            0.0
 UdpInErrors                     484230             0.0
 UdpRcvbufErrors                 484230             0.0
 IpExtInOctets                   4264896402         0.0
 IpExtInNoECTPkts                2925170            0.0

This is a 50Gbit/s link, and IpInReceives correspondent to approx 35Gbit/s.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net-next v3 1/4] net: phy: add an option to disable EEE advertisement
From: Andreas Färber @ 2016-11-28 12:20 UTC (permalink / raw)
  To: Jerome Brunet, netdev
  Cc: devicetree, Florian Fainelli, Alexandre TORGUE, Andrew Lunn,
	Martin Blumenstingl, Kevin Hilman, Neil Armstrong, linux-kernel,
	Andre Roth, linux-amlogic, Carlo Caione, Giuseppe Cavallaro,
	linux-arm-kernel
In-Reply-To: <1480326409-25419-2-git-send-email-jbrunet@baylibre.com>

Am 28.11.2016 um 10:46 schrieb Jerome Brunet:
> This patch adds an option to disable EEE advertisement in the generic PHY
> by providing a mask of prohibited modes corresponding to the value found in
> the MDIO_AN_EEE_ADV register.
> 
> On some platforms, PHY Low power idle seems to be causing issues, even
> breaking the link some cases. The patch provides a convenient way for these
> platforms to disable EEE advertisement and work around the issue.
> 
> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
> ---
>  drivers/net/phy/phy.c        |  3 ++
>  drivers/net/phy/phy_device.c | 80 +++++++++++++++++++++++++++++++++++++++-----
>  include/linux/phy.h          |  3 ++
>  3 files changed, 77 insertions(+), 9 deletions(-)

Tested-by: Andreas Färber <afaerber@suse.de>

With this and the corresponding .dts change, Odroid-C2 survived a
317-package zypper update for me.

Thanks,
Andreas

-- 
SUSE Linux GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply

* [PATCH] stmmac: reduce code duplication getting basic descriptors
From: Pavel Machek @ 2016-11-28 12:17 UTC (permalink / raw)
  To: David Miller, Andrew Morton, alexandre.torgue
  Cc: peppe.cavallaro, netdev, linux-kernel
In-Reply-To: <20161124.110416.198867271899443489.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 3052 bytes --]


Remove code duplication getting basic descriptors.
    
Signed-off-by: Pavel Machek <pavel@denx.de>

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index f7133d0..ed20668 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -929,6 +929,17 @@ static int stmmac_set_bfsize(int mtu, int bufsize)
 	return ret;
 }
 
+static inline struct dma_desc *stmmac_tx_desc(struct stmmac_priv *priv, int i)
+{
+	struct dma_desc *p;
+	if (priv->extend_desc)
+		p = &((priv->dma_etx + i)->basic);
+	else
+		p = priv->dma_tx + i;
+	return p;
+}
+
+
 /**
  * stmmac_clear_descriptors - clear descriptors
  * @priv: driver private structure
@@ -1078,11 +1089,7 @@ static int init_dma_desc_rings(struct net_device *dev, gfp_t flags)
 
 	/* TX INITIALIZATION */
 	for (i = 0; i < DMA_TX_SIZE; i++) {
-		struct dma_desc *p;
-		if (priv->extend_desc)
-			p = &((priv->dma_etx + i)->basic);
-		else
-			p = priv->dma_tx + i;
+		struct dma_desc *p = stmmac_tx_desc(priv, i);
 
 		if (priv->synopsys_id >= DWMAC_CORE_4_00) {
 			p->des0 = 0;
@@ -1129,12 +1136,7 @@ static void dma_free_tx_skbufs(struct stmmac_priv *priv)
 	int i;
 
 	for (i = 0; i < DMA_TX_SIZE; i++) {
-		struct dma_desc *p;
-
-		if (priv->extend_desc)
-			p = &((priv->dma_etx + i)->basic);
-		else
-			p = priv->dma_tx + i;
+		struct dma_desc *p = stmmac_tx_desc(priv, i);
 
 		if (priv->tx_skbuff_dma[i].buf) {
 			if (priv->tx_skbuff_dma[i].map_as_page)
@@ -1314,14 +1316,9 @@ static void __stmmac_tx_clean(struct stmmac_priv *priv)
 
 	while (entry != priv->cur_tx) {
 		struct sk_buff *skb = priv->tx_skbuff[entry];
-		struct dma_desc *p;
+		struct dma_desc *p = stmmac_tx_desc(priv, entry);
 		int status;
 
-		if (priv->extend_desc)
-			p = (struct dma_desc *)(priv->dma_etx + entry);
-		else
-			p = priv->dma_tx + entry;
-
 		status = priv->hw->desc->tx_status(&priv->dev->stats,
 						      &priv->xstats, p,
 						      priv->ioaddr);
@@ -2227,11 +2224,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	csum_insertion = (skb->ip_summed == CHECKSUM_PARTIAL);
 
-	if (likely(priv->extend_desc))
-		desc = (struct dma_desc *)(priv->dma_etx + entry);
-	else
-		desc = priv->dma_tx + entry;
-
+	desc = stmmac_tx_desc(priv, entry);
 	first = desc;
 
 	priv->tx_skbuff[first_entry] = skb;
@@ -2254,10 +2247,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 
 		entry = STMMAC_GET_ENTRY(entry, DMA_TX_SIZE);
 
-		if (likely(priv->extend_desc))
-			desc = (struct dma_desc *)(priv->dma_etx + entry);
-		else
-			desc = priv->dma_tx + entry;
+		desc = stmmac_tx_desc(priv, entry);
 
 		des = skb_frag_dma_map(priv->device, frag, 0, len,
 				       DMA_TO_DEVICE);

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply related

* Re: stmmac ethernet in kernel 4.9-rc6: coalescing related pauses.
From: Pavel Machek @ 2016-11-28 12:13 UTC (permalink / raw)
  To: David Miller, alexandre.torgue; +Cc: peppe.cavallaro, netdev, linux-kernel
In-Reply-To: <20161124.110416.198867271899443489.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 1893 bytes --]

Hi!

> >> drivers/net/ethernet/stmicro/stmmac/common.h
> >> #define STMMAC_COAL_TX_TIMER    40000
> >> #define STMMAC_MAX_COAL_TX_TICK 100000
> >> #define STMMAC_TX_MAX_FRAMES    256
> >> 
> >> If I lower the parameters, delays are gone, but I get netdev watchdog
> >> backtrace followed by broken driver.
> >> 
> >> Any ideas what is going on there?
> > 
> > 4.9-rc6 still has the delays. With the
> > 
> > #define STMMAC_COAL_TX_TIMER 1000
> > #define STMMAC_TX_MAX_FRAMES 2
> > 
> > settings, delays go away, and driver still works. (It fails fairly
> > fast in 4.4). Good news. But the question still is: what is going on
> > there?
> 
> 256 packets looks way too large for being a trigger for aborting the
> TX coalescing timer.
> 
> Looking more deeply into this, the driver is using non-highres timers
> to implement the TX coalescing.  This simply cannot work.
> 
> 1 HZ, which is the lowest granularity of non-highres timers in the
> kernel, is variable as well as already too large of a delay for
> effective TX coalescing.
> 
> I seriously think that the TX coalescing support should be ripped out
> or disabled entirely until it is implemented properly in this
> driver.

Hmm. I still don't understand how the coalescing is supposed to do any
good in this driver.

As long as transmits happen, it seems driver is not recycling used DMA
descriptors. When the transmits stops, it waits for 40msec, then
recycles them. We'll run out of DMA descriptors in 5msec at 100mbit
speeds.

Giuseppe Cavallaro, Alexandre Torgue: could you comment here? Can you
apply the cleanup patches? The driver is marked as supported, but I
see no reactions on email, and bugzilla is down :-(.

Thanks,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* [PATCH] stmmac: fix comments, make debug output consistent
From: Pavel Machek @ 2016-11-28 11:55 UTC (permalink / raw)
  To: David Miller, alexandre.torgue
  Cc: peppe.cavallaro, netdev, linux-kernel, Andrew Morton
In-Reply-To: <20161124.110416.198867271899443489.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 3421 bytes --]


Fix comments, add some new, and make debugfs output consistent.
    
Signed-off-by: Pavel Machek <pavel@denx.de>

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
index a61de04..6074d97 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -44,6 +44,7 @@
 #define	DWMAC_CORE_4_00	0x40
 #define STMMAC_CHAN0	0	/* Always supported and default for all chips */
 
+/* These need to be power of two, and >= 4 */
 #define DMA_TX_SIZE 512
 #define DMA_RX_SIZE 512
 #define STMMAC_GET_ENTRY(x, size)	((x + 1) & (size - 1))
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 9dfdbe0..f7133d0 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -108,8 +108,8 @@ module_param(eee_timer, int, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(eee_timer, "LPI tx expiration time in msec");
 #define STMMAC_LPI_T(x) (jiffies + msecs_to_jiffies(x))
 
-/* By default the driver will use the ring mode to manage tx and rx descriptors
- * but passing this value so user can force to use the chain instead of the ring
+/* By default the driver will use the ring mode to manage tx and rx descriptors,
+ * but allow user to force to use the chain instead of the ring
  */
 static unsigned int chain_mode;
 module_param(chain_mode, int, S_IRUGO);
@@ -2966,6 +2966,8 @@ static int stmmac_sysfs_ring_open(struct inode *inode, struct file *file)
 	return single_open(file, stmmac_sysfs_ring_read, inode->i_private);
 }
 
+/* Debugfs files, should appear in /sys/kernel/debug/stmmaceth/eth0 */
+
 static const struct file_operations stmmac_rings_status_fops = {
 	.owner = THIS_MODULE,
 	.open = stmmac_sysfs_ring_open,
@@ -2988,11 +2990,11 @@ static int stmmac_sysfs_dma_cap_read(struct seq_file *seq, void *v)
 	seq_printf(seq, "\tDMA HW features\n");
 	seq_printf(seq, "==============================\n");
 
-	seq_printf(seq, "\t10/100 Mbps %s\n",
+	seq_printf(seq, "\t10/100 Mbps: %s\n",
 		   (priv->dma_cap.mbps_10_100) ? "Y" : "N");
-	seq_printf(seq, "\t1000 Mbps %s\n",
+	seq_printf(seq, "\t1000 Mbps: %s\n",
 		   (priv->dma_cap.mbps_1000) ? "Y" : "N");
-	seq_printf(seq, "\tHalf duple %s\n",
+	seq_printf(seq, "\tHalf duplex: %s\n",
 		   (priv->dma_cap.half_duplex) ? "Y" : "N");
 	seq_printf(seq, "\tHash Filter: %s\n",
 		   (priv->dma_cap.hash_filter) ? "Y" : "N");
@@ -3010,9 +3012,9 @@ static int stmmac_sysfs_dma_cap_read(struct seq_file *seq, void *v)
 		   (priv->dma_cap.rmon) ? "Y" : "N");
 	seq_printf(seq, "\tIEEE 1588-2002 Time Stamp: %s\n",
 		   (priv->dma_cap.time_stamp) ? "Y" : "N");
-	seq_printf(seq, "\tIEEE 1588-2008 Advanced Time Stamp:%s\n",
+	seq_printf(seq, "\tIEEE 1588-2008 Advanced Time Stamp: %s\n",
 		   (priv->dma_cap.atime_stamp) ? "Y" : "N");
-	seq_printf(seq, "\t802.3az - Energy-Efficient Ethernet (EEE) %s\n",
+	seq_printf(seq, "\t802.3az - Energy-Efficient Ethernet (EEE): %s\n",
 		   (priv->dma_cap.eee) ? "Y" : "N");
 	seq_printf(seq, "\tAV features: %s\n", (priv->dma_cap.av) ? "Y" : "N");
 	seq_printf(seq, "\tChecksum Offload in TX: %s\n",

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply related

* Re: [PATCH] stmmac ethernet: remove cut & paste code
From: Pavel Machek @ 2016-11-28 11:50 UTC (permalink / raw)
  To: Joe Perches
  Cc: peppe.cavallaro, netdev, kernel list, ezequiel, sonic.zhang,
	fabrice.gasnier
In-Reply-To: <1480026433.19726.17.camel@perches.com>

[-- Attachment #1: Type: text/plain, Size: 1737 bytes --]

On Thu 2016-11-24 14:27:13, Joe Perches wrote:
> On Thu, 2016-11-24 at 22:44 +0100, Pavel Machek wrote:
> > On Thu 2016-11-24 12:05:25, Joe Perches wrote:
> > > On Thu, 2016-11-24 at 12:05 +0100, Pavel Machek wrote:
> > > > Remove duplicate code from _tx routines.
> > > 
> > > trivia:
> > > 
> > > > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > > 
> > > []
> > > > @@ -1960,6 +1960,38 @@ static void stmmac_tso_allocator(struct stmmac_priv *priv, unsigned int des,
> > > >  	}
> > > >  }
> > > >  
> > > > +static void stmmac_xmit_common(struct sk_buff *skb, struct net_device *dev, int nfrags, struct dma_desc *desc)
> > > > +{
> > > > +	struct stmmac_priv *priv = netdev_priv(dev);
> > > > +
> > > > +	if (unlikely(stmmac_tx_avail(priv) <= (MAX_SKB_FRAGS + 1))) {
> > > > +		if (netif_msg_hw(priv))
> > > > +			pr_debug("%s: stop transmitted packets\n", __func__);
> > > 
> > > 		netif_dbg(priv, hw, dev, "%s: stop transmitted packets\n",
> > > 			  __func__);
> > 
> > Not now. Modifying the code while de-duplicating would be bad idea.
> 
> Too many people think overly granular patches are the
> best and only way to make changes.

> Deduplication and consolidation can happen simultaneously.

Can, but should not at this point. Please take a look at the driver in
question before commenting on trivial printk style.

Feel free to do your favourite cleanup on whole tree, or per-driver
basis. Doing it on per-message basis would be wrong thing to do.

Thanks,
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* [PATCH net-next] bpf: selftests: include <sys/resource.h> to fix build error
From: Colin King @ 2016-11-28 11:45 UTC (permalink / raw)
  To: Alexei Starovoitov, Shuah Khan, netdev; +Cc: linux-kernel, linux-kselftest

From: Colin Ian King <colin.king@canonical.com>

Fix incomplete type build error on struct rlimit by including
<sys/resource.h>, fixes:

test_lru_map.c:552:9: error: variable ‘r’ has initializer
 but incomplete type
  struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
         ^
test_lru_map.c:552:21: error: ‘RLIM_INFINITY’ undeclared
 (first use in this function)
  struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 tools/testing/selftests/bpf/test_lru_map.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/bpf/test_lru_map.c b/tools/testing/selftests/bpf/test_lru_map.c
index 627757e..fde54a2 100644
--- a/tools/testing/selftests/bpf/test_lru_map.c
+++ b/tools/testing/selftests/bpf/test_lru_map.c
@@ -13,6 +13,7 @@
 #include <assert.h>
 #include <sched.h>
 #include <sys/wait.h>
+#include <sys/resource.h>
 #include <stdlib.h>
 #include <time.h>
 #include "bpf_sys.h"
-- 
2.10.2

^ permalink raw reply related

* Re: [PATCH] vxlan: fix a potential issue when create a new vxlan fdb entry.
From: Jiri Benc @ 2016-11-28 11:37 UTC (permalink / raw)
  To: Haishuang Yan
  Cc: David S. Miller, Hannes Frederic Sowa, Pravin B Shelar, netdev,
	linux-kernel
In-Reply-To: <1480316543-23298-1-git-send-email-yanhaishuang@cmss.chinamobile.com>

On Mon, 28 Nov 2016 15:02:23 +0800, Haishuang Yan wrote:
> vxlan_fdb_append may return error, so add the proper check,
> otherwise it will cause memory leak.
> 
> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
> ---
>  drivers/net/vxlan.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> index 21e92be..3b7b237 100644
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -611,6 +611,7 @@ static int vxlan_fdb_create(struct vxlan_dev *vxlan,
>  	struct vxlan_rdst *rd = NULL;
>  	struct vxlan_fdb *f;
>  	int notify = 0;
> +	int rc = 0;

The initialization to 0 should not be needed. Looks good otherwise.

Thanks,

 Jiri

^ permalink raw reply

* Re: SNAT --random & fully is not actually random for ips
From: Denys Fedoryshchenko @ 2016-11-28 11:35 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Linux Kernel Network Developers, netfilter-devel
In-Reply-To: <20161128112955.GA1691@salvia>

On 2016-11-28 13:29, Pablo Neira Ayuso wrote:
> On Mon, Nov 28, 2016 at 01:12:07PM +0200, Denys Fedoryshchenko wrote:
>> On 2016-11-28 13:06, Pablo Neira Ayuso wrote:
>> >Why does your patch reverts NF_NAT_RANGE_PROTO_RANDOM_FULLY?
>> 
>> Ops, sorry i just did mistake with files, actually it is in reverse ( 
>> did
>> this patch, and it worked properly with it, with random source ip).
> 
> Oh, I see 8)
> 
>> --- nf_nat_core.c	2016-11-21 09:11:59.000000000 +0000
>> +++ nf_nat_core.c.new	2016-11-28 09:55:54.000000000 +0000
>> @@ -282,9 +282,13 @@
>>  	 * client coming from the same IP (some Internet Banking sites
>>  	 * like this), even across reboots.
>>  	 */
>> -	j = jhash2((u32 *)&tuple->src.u3, sizeof(tuple->src.u3) / 
>> sizeof(u32),
>> +	if (range->flags & NF_NAT_RANGE_PROTO_RANDOM_FULLY) {
>> +	    j = prandom_u32();
>> +	} else {
>> +	    j = jhash2((u32 *)&tuple->src.u3, sizeof(tuple->src.u3) / 
>> sizeof(u32),
>>  		   range->flags & NF_NAT_RANGE_PERSISTENT ?
>>  			0 : (__force u32)tuple->dst.u3.all[max] ^ zone->id);
>> +	}
>> 
>>  	full_range = false;
>>  	for (i = 0; i <= max; i++) {
>> 
>> This is current situation, RANDOM_FULLY actually does prandom_u32 for 
>> source
>> port only, but not for IP.
>> IP kept as persistent and kind of predictable, because hash function 
>> based
>> on source ip.
>> 
>> Sure i did tried to specify any combination of flags, but looking to
>> "find_best_ips_proto" function, it wont have any effect.
> 
> IIRC the original intention on random-fully was to cover only ports.
> Did you interpret from git history otherwise? Otherwise, safe
> procedure is to add a new flag.
No, seems i didnt read man page well, sorry.
I will check it, maybe will try to add new option and submit a patch, 
still studying impact on "balancing" with this change, seems it works 
great.
But not really sure such thing needed for someone else, actually some 
might have privacy concerns as well, and can use such option for 
privacy.

^ permalink raw reply

* Re: [PATCH net-next 1/2] sfc: separate out SFC4000 ("Falcon") support into new sfc-falcon driver
From: Edward Cree @ 2016-11-28 11:29 UTC (permalink / raw)
  To: kbuild test robot; +Cc: kbuild-all, linux-net-drivers, davem, bkenward, netdev
In-Reply-To: <201611260855.aKMo9XWI%fengguang.wu@intel.com>

On 26/11/16 00:58, kbuild test robot wrote:
> Hi Edward,
>
> [auto build test ERROR on net-next/master]
>
> url:    https://github.com/0day-ci/linux/commits/Edward-Cree/sfc-split-out-Falcon-driver/20161126-033439
> config: i386-randconfig-h1-11260702 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386 
>
> Note: the linux-review/Edward-Cree/sfc-split-out-Falcon-driver/20161126-033439 HEAD 738d215da7cf33cb4f2916dfba4fdb1558829e5a builds fine.
>       It only hurts bisectibility.
>
> All error/warnings (new ones prefixed by >>):
>
>    drivers/net/ethernet/sfc/falcon/built-in.o: In function `tenxpress_set_id_led':
>>> (.text+0x28db4): multiple definition of `tenxpress_set_id_led'
>    drivers/net/ethernet/sfc/built-in.o:(.text+0x3bac7): first defined here
Right, I have to do at least some of the rip-stuff-out in the first patch, else
building both drivers built-in breaks.  (I only tested building them as modules,
that was silly of me.)
Wondering if the right thing to do is just squash both patches together: it
could hardly make the series _less_ reviewable, and it might help git recognise
more of the copies as file renames.
Opinions?
-Ed

^ permalink raw reply

* Re: SNAT --random & fully is not actually random for ips
From: Pablo Neira Ayuso @ 2016-11-28 11:29 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: Linux Kernel Network Developers, netfilter-devel
In-Reply-To: <fb3151804cc3b2c57e69d1059b3417ee@nuclearcat.com>

On Mon, Nov 28, 2016 at 01:12:07PM +0200, Denys Fedoryshchenko wrote:
> On 2016-11-28 13:06, Pablo Neira Ayuso wrote:
> >Why does your patch reverts NF_NAT_RANGE_PROTO_RANDOM_FULLY?
>
> Ops, sorry i just did mistake with files, actually it is in reverse ( did
> this patch, and it worked properly with it, with random source ip).

Oh, I see 8)

> --- nf_nat_core.c	2016-11-21 09:11:59.000000000 +0000
> +++ nf_nat_core.c.new	2016-11-28 09:55:54.000000000 +0000
> @@ -282,9 +282,13 @@
>  	 * client coming from the same IP (some Internet Banking sites
>  	 * like this), even across reboots.
>  	 */
> -	j = jhash2((u32 *)&tuple->src.u3, sizeof(tuple->src.u3) / sizeof(u32),
> +	if (range->flags & NF_NAT_RANGE_PROTO_RANDOM_FULLY) {
> +	    j = prandom_u32();
> +	} else {
> +	    j = jhash2((u32 *)&tuple->src.u3, sizeof(tuple->src.u3) / sizeof(u32),
>  		   range->flags & NF_NAT_RANGE_PERSISTENT ?
>  			0 : (__force u32)tuple->dst.u3.all[max] ^ zone->id);
> +	}
> 
>  	full_range = false;
>  	for (i = 0; i <= max; i++) {
> 
> This is current situation, RANDOM_FULLY actually does prandom_u32 for source
> port only, but not for IP.
> IP kept as persistent and kind of predictable, because hash function based
> on source ip.
> 
> Sure i did tried to specify any combination of flags, but looking to
> "find_best_ips_proto" function, it wont have any effect.

IIRC the original intention on random-fully was to cover only ports.
Did you interpret from git history otherwise? Otherwise, safe
procedure is to add a new flag.

^ permalink raw reply

* Re: [PATCH 1/1] NET: usb: cdc_ncm: adding MBIM RESET_FUNCTION request and modifying ncm bind common code
From: Daniele Palmas @ 2016-11-28 11:23 UTC (permalink / raw)
  To: Bjørn Mork; +Cc: Oliver Neukum, linux-usb, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <87fume7xg5.fsf-3F4PFWf5pNjpjLOzFPqGjWGXanvQGlWp@public.gmane.org>

2016-11-26 22:17 GMT+01:00 Bjørn Mork <bjorn-yOkvZcmFvRU@public.gmane.org>:
> Bjørn Mork <bjorn-yOkvZcmFvRU@public.gmane.org> writes:
>
>> Finally, I found my modems (or at least a number of them) again today.
>> But I'm sorry to say, that the troublesome Huawei E3372h-153 is still
>> giving us a hard time.  It does not work with your patch. The symptom is
>> the same as earlier:  The modem returns MBIM frames with 32bit headers.
>>
>> So for now, I have to NAK this patch.
>>
>> I am sure we can find a good solution that makes all of these modems
>> work, but I cannot support a patch that breaks previously working
>> configurations. Sorry.  I'll do a few experiments and see if there is a
>> simple fix for this.  Otherwise we'll probably have to do the quirk
>> game.
>
>
> This is a proof-of-concept only, but it appears to be working.  Please
> test with your device(s) too.  It's still mostly your code, as you can
> see.

Sorry, this does not work :-(

The problem is always in the altsetting toggle: if I comment that
part, your patch is working fine.

I'm now wondering if the safest thing is to add a very simple quirk in
cdc_mbim that makes this toggle not to be applied with the buggy
modems and then unconditionally add the RESET_FUNCTION request in
cdc_mbim_bind after cdc_ncm_bind_common has been executed.

Daniele

>
> If this turns out to work, then I believe we should refactor
> cdc_ncm_init() and cdc_ncm_bind_common() to make the whole
> initialisation sequence a bit cleaner.  And maybe also include
> cdc_mbim_bind().  Ideally, the MBIM specific RESET should happen there
> instead of "polluting" the NCM driver with MBIM specific code.
>
> But anyway:  The sequence that seems to work for both the  E3372h-153
> and the EM7455 is
>
>  USB_CDC_GET_NTB_PARAMETERS
>  USB_CDC_RESET_FUNCTION
>  usb_set_interface(dev->udev, 'data interface no', 0);
>  remaining parts of cdc_ncm_init(), excluding USB_CDC_GET_NTB_PARAMETERS
>  usb_set_interface(dev->udev, 'data interface no', 'data alt setting');
>
> without any additional delay between the two usb_set_interface() calls.
> So the major difference from your patch is that I moved the two control
> requests out of cdc_ncm_init() to allow running them _before_ setting
> the data interface to altsetting 0.
>
> But maybe I was just lucky.  This was barely proof tested.  Needs a lot
> more testing and cleanups as suggested.  I'd appreciate it if you
> continued that, as I don't really have any time for it...
>
> FWIW, I also ran a quick test with a D-Link DWM-156A7 (Mediatek MBIM
> firmware) and a Huawei E367 (Qualcomm device with early Huawei MBIM
> firmware, distinctly different from the E3372h-153 and most other
> MBIM devices I've seen)
>
>
>
> Bjørn
>
> ---
>  drivers/net/usb/cdc_ncm.c    | 48 ++++++++++++++++++++++++++++----------------
>  include/uapi/linux/usb/cdc.h |  1 +
>  2 files changed, 32 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/net/usb/cdc_ncm.c b/drivers/net/usb/cdc_ncm.c
> index 877c9516e781..be019cbf1719 100644
> --- a/drivers/net/usb/cdc_ncm.c
> +++ b/drivers/net/usb/cdc_ncm.c
> @@ -488,16 +488,6 @@ static int cdc_ncm_init(struct usbnet *dev)
>         u8 iface_no = ctx->control->cur_altsetting->desc.bInterfaceNumber;
>         int err;
>
> -       err = usbnet_read_cmd(dev, USB_CDC_GET_NTB_PARAMETERS,
> -                             USB_TYPE_CLASS | USB_DIR_IN
> -                             |USB_RECIP_INTERFACE,
> -                             0, iface_no, &ctx->ncm_parm,
> -                             sizeof(ctx->ncm_parm));
> -       if (err < 0) {
> -               dev_err(&dev->intf->dev, "failed GET_NTB_PARAMETERS\n");
> -               return err; /* GET_NTB_PARAMETERS is required */
> -       }
> -
>         /* set CRC Mode */
>         if (cdc_ncm_flags(dev) & USB_CDC_NCM_NCAP_CRC_MODE) {
>                 dev_dbg(&dev->intf->dev, "Setting CRC mode off\n");
> @@ -837,12 +827,43 @@ int cdc_ncm_bind_common(struct usbnet *dev, struct usb_interface *intf, u8 data_
>                 }
>         }
>
> +       iface_no = ctx->control->cur_altsetting->desc.bInterfaceNumber;
> +       temp = usbnet_read_cmd(dev, USB_CDC_GET_NTB_PARAMETERS,
> +                              USB_TYPE_CLASS | USB_DIR_IN
> +                              | USB_RECIP_INTERFACE,
> +                              0, iface_no, &ctx->ncm_parm,
> +                              sizeof(ctx->ncm_parm));
> +       if (temp < 0) {
> +               dev_err(&dev->intf->dev, "failed GET_NTB_PARAMETERS\n");
> +               goto error; /* GET_NTB_PARAMETERS is required */
> +       }
> +
> +       /* Some modems (e.g. Telit LE922A6) need to reset the MBIM function
> +        * or they will fail to work properly.
> +        * For details on RESET_FUNCTION request see document
> +        * "USB Communication Class Subclass Specification for MBIM"
> +        * RESET_FUNCTION should be harmless for all the other MBIM modems
> +        */
> +       if (cdc_ncm_comm_intf_is_mbim(ctx->control->cur_altsetting)) {
> +               temp = usbnet_write_cmd(dev, USB_CDC_RESET_FUNCTION,
> +                                       USB_TYPE_CLASS | USB_DIR_OUT
> +                                       | USB_RECIP_INTERFACE,
> +                                       0, iface_no, NULL, 0);
> +               if (temp < 0)
> +                       dev_err(&dev->intf->dev, "failed RESET_FUNCTION\n");
> +       }
> +
>         iface_no = ctx->data->cur_altsetting->desc.bInterfaceNumber;
>
>         /* Reset data interface. Some devices will not reset properly
>          * unless they are configured first.  Toggle the altsetting to
>          * force a reset
> +        * This is applied only to ncm devices, since it has been verified
> +        * to cause issues with some MBIM modems (e.g. Telit LE922A6).
> +        * MBIM devices reset is achieved using MBIM request RESET_FUNCTION
> +        * in cdc_ncm_init
>          */
> +
>         usb_set_interface(dev->udev, iface_no, data_altsetting);
>         temp = usb_set_interface(dev->udev, iface_no, 0);
>         if (temp) {
> @@ -854,13 +875,6 @@ int cdc_ncm_bind_common(struct usbnet *dev, struct usb_interface *intf, u8 data_
>         if (cdc_ncm_init(dev))
>                 goto error2;
>
> -       /* Some firmwares need a pause here or they will silently fail
> -        * to set up the interface properly.  This value was decided
> -        * empirically on a Sierra Wireless MC7455 running 02.08.02.00
> -        * firmware.
> -        */
> -       usleep_range(10000, 20000);
> -
>         /* configure data interface */
>         temp = usb_set_interface(dev->udev, iface_no, data_altsetting);
>         if (temp) {
> diff --git a/include/uapi/linux/usb/cdc.h b/include/uapi/linux/usb/cdc.h
> index e2bc417b243b..30258fb229e6 100644
> --- a/include/uapi/linux/usb/cdc.h
> +++ b/include/uapi/linux/usb/cdc.h
> @@ -231,6 +231,7 @@ struct usb_cdc_mbim_extended_desc {
>
>  #define USB_CDC_SEND_ENCAPSULATED_COMMAND      0x00
>  #define USB_CDC_GET_ENCAPSULATED_RESPONSE      0x01
> +#define USB_CDC_RESET_FUNCTION                 0x05
>  #define USB_CDC_REQ_SET_LINE_CODING            0x20
>  #define USB_CDC_REQ_GET_LINE_CODING            0x21
>  #define USB_CDC_REQ_SET_CONTROL_LINE_STATE     0x22
> --
> 2.10.2
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Crash due to mutex genl_lock called from RCU context
From: Herbert Xu @ 2016-11-28 11:22 UTC (permalink / raw)
  To: Cong Wang
  Cc: Eric Dumazet, Subash Abhinov Kasiviswanathan, Thomas Graf,
	Linux Kernel Network Developers
In-Reply-To: <CAM_iQpXB_LRqE83cWwU0ZDckLWPrMcZg-yD4+NKy+rBVz3YAyw@mail.gmail.com>

On Sun, Nov 27, 2016 at 10:53:21PM -0800, Cong Wang wrote:
>
> I just took a deeper look, some user calls rhashtable_destroy() in ->done(),
> so even removing that genl lock is not enough, perhaps we should just
> move it to a work struct like what Daniel does for the tcf_proto, but that is
> ugly... I don't know if RCU provides any API to execute the callback in process
> context.

I looked into doing it without a worker struct, but basically it
means that we'd have to add more crap to the common code path for
what is essentially a very rare case.

So I think we should go with the worker struct because all we lose
is a bit of memory.

---8<---
netlink: Call cb->done from a worker thread

The cb->done interface expects to be called in process context.
This was broken by the netlink RCU conversion.  This patch fixes
it by adding a worker struct to make the cb->done call where
necessary.

Fixes: 21e4902aea80 ("netlink: Lockless lookup with RCU grace...")
Reported-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 62bea45..602e5eb 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -322,14 +322,11 @@ static void netlink_skb_set_owner_r(struct sk_buff *skb, struct sock *sk)
 	sk_mem_charge(sk, skb->truesize);
 }
 
-static void netlink_sock_destruct(struct sock *sk)
+static void __netlink_sock_destruct(struct sock *sk)
 {
 	struct netlink_sock *nlk = nlk_sk(sk);
 
 	if (nlk->cb_running) {
-		if (nlk->cb.done)
-			nlk->cb.done(&nlk->cb);
-
 		module_put(nlk->cb.module);
 		kfree_skb(nlk->cb.skb);
 	}
@@ -346,6 +343,28 @@ static void netlink_sock_destruct(struct sock *sk)
 	WARN_ON(nlk_sk(sk)->groups);
 }
 
+static void netlink_sock_destruct_work(struct work_struct *work)
+{
+	struct netlink_sock *nlk = container_of(work, struct netlink_sock,
+						work);
+
+	nlk->cb.done(&nlk->cb);
+	__netlink_sock_destruct(&nlk->sk);
+}
+
+static void netlink_sock_destruct(struct sock *sk)
+{
+	struct netlink_sock *nlk = nlk_sk(sk);
+
+	if (nlk->cb_running && nlk->cb.done) {
+		INIT_WORK(&nlk->work, netlink_sock_destruct_work);
+		schedule_work(&nlk->work);
+		return;
+	}
+
+	__netlink_sock_destruct(sk);
+}
+
 /* This lock without WQ_FLAG_EXCLUSIVE is good on UP and it is _very_ bad on
  * SMP. Look, when several writers sleep and reader wakes them up, all but one
  * immediately hit write lock and grab all the cpus. Exclusive sleep solves
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index 3cfd6cc..4fdb383 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -3,6 +3,7 @@
 
 #include <linux/rhashtable.h>
 #include <linux/atomic.h>
+#include <linux/workqueue.h>
 #include <net/sock.h>
 
 #define NLGRPSZ(x)	(ALIGN(x, sizeof(unsigned long) * 8) / 8)
@@ -33,6 +34,7 @@ struct netlink_sock {
 
 	struct rhash_head	node;
 	struct rcu_head		rcu;
+	struct work_struct	work;
 };
 
 static inline struct netlink_sock *nlk_sk(struct sock *sk)

-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* Re: SNAT --random & fully is not actually random for ips
From: Denys Fedoryshchenko @ 2016-11-28 11:12 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Linux Kernel Network Developers, netfilter-devel
In-Reply-To: <20161128110651.GA1024@salvia>

On 2016-11-28 13:06, Pablo Neira Ayuso wrote:
> On Mon, Nov 28, 2016 at 12:45:59PM +0200, Denys Fedoryshchenko wrote:
>> Hello,
>> 
>> I noticed that if i specify -j SNAT with options --random 
>> --random-fully
>> still it keeps persistence for source IP.
> 
> So you specify both?
> 
>> Actually truly random src ip required in some scenarios like links 
>> balanced
>> by IPs, but seems since 2012 at least it is not possible.
>> 
>> But actually if i do something like:
>> --- nf_nat_core.c.new	2016-11-28 09:55:54.000000000 +0000
>> +++ nf_nat_core.c	2016-11-21 09:11:59.000000000 +0000
>> @@ -282,13 +282,9 @@
>>  	 * client coming from the same IP (some Internet Banking sites
>>  	 * like this), even across reboots.
>>  	 */
>> -	if (range->flags & NF_NAT_RANGE_PROTO_RANDOM_FULLY) {
>> -	    j = prandom_u32();
>> -	} else {
>> -	    j = jhash2((u32 *)&tuple->src.u3, sizeof(tuple->src.u3) / 
>> sizeof(u32),
>> +	j = jhash2((u32 *)&tuple->src.u3, sizeof(tuple->src.u3) / 
>> sizeof(u32),
>>  		   range->flags & NF_NAT_RANGE_PERSISTENT ?
>>  			0 : (__force u32)tuple->dst.u3.all[max] ^ zone->id);
>> -	}
>> 
>>  	full_range = false;
>>  	for (i = 0; i <= max; i++) {
>> 
>> It works as intended. But i guess to not break compatibility it is 
>> better
>> should be introduced as new option?
>> Or maybe there is no really need for such option?
> 
> Why does your patch reverts NF_NAT_RANGE_PROTO_RANDOM_FULLY?
Ops, sorry i just did mistake with files, actually it is in reverse ( 
did this patch, and it worked properly with it, with random source ip).
--- nf_nat_core.c	2016-11-21 09:11:59.000000000 +0000
+++ nf_nat_core.c.new	2016-11-28 09:55:54.000000000 +0000
@@ -282,9 +282,13 @@
  	 * client coming from the same IP (some Internet Banking sites
  	 * like this), even across reboots.
  	 */
-	j = jhash2((u32 *)&tuple->src.u3, sizeof(tuple->src.u3) / sizeof(u32),
+	if (range->flags & NF_NAT_RANGE_PROTO_RANDOM_FULLY) {
+	    j = prandom_u32();
+	} else {
+	    j = jhash2((u32 *)&tuple->src.u3, sizeof(tuple->src.u3) / 
sizeof(u32),
  		   range->flags & NF_NAT_RANGE_PERSISTENT ?
  			0 : (__force u32)tuple->dst.u3.all[max] ^ zone->id);
+	}

  	full_range = false;
  	for (i = 0; i <= max; i++) {

This is current situation, RANDOM_FULLY actually does prandom_u32 for 
source port only, but not for IP.
IP kept as persistent and kind of predictable, because hash function 
based on source ip.

Sure i did tried to specify any combination of flags, but looking to 
"find_best_ips_proto" function, it wont have any effect.

^ permalink raw reply

* SNAT --random & fully is not actually random for ips
From: Denys Fedoryshchenko @ 2016-11-28 10:45 UTC (permalink / raw)
  To: Linux Kernel Network Developers, Pablo Neira Ayuso

Hello,

I noticed that if i specify -j SNAT with options --random --random-fully 
still it keeps persistence for source IP.
Actually truly random src ip required in some scenarios like links 
balanced by IPs, but seems since 2012 at least it is not possible.

But actually if i do something like:
--- nf_nat_core.c.new	2016-11-28 09:55:54.000000000 +0000
+++ nf_nat_core.c	2016-11-21 09:11:59.000000000 +0000
@@ -282,13 +282,9 @@
  	 * client coming from the same IP (some Internet Banking sites
  	 * like this), even across reboots.
  	 */
-	if (range->flags & NF_NAT_RANGE_PROTO_RANDOM_FULLY) {
-	    j = prandom_u32();
-	} else {
-	    j = jhash2((u32 *)&tuple->src.u3, sizeof(tuple->src.u3) / 
sizeof(u32),
+	j = jhash2((u32 *)&tuple->src.u3, sizeof(tuple->src.u3) / sizeof(u32),
  		   range->flags & NF_NAT_RANGE_PERSISTENT ?
  			0 : (__force u32)tuple->dst.u3.all[max] ^ zone->id);
-	}

  	full_range = false;
  	for (i = 0; i <= max; i++) {

It works as intended. But i guess to not break compatibility it is 
better should be introduced as new option?
Or maybe there is no really need for such option?

^ permalink raw reply

* Re: SNAT --random & fully is not actually random for ips
From: Pablo Neira Ayuso @ 2016-11-28 11:06 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: Linux Kernel Network Developers, netfilter-devel
In-Reply-To: <97a6a1c557f0f1e6d55d8d09b326f8b1@nuclearcat.com>

On Mon, Nov 28, 2016 at 12:45:59PM +0200, Denys Fedoryshchenko wrote:
> Hello,
> 
> I noticed that if i specify -j SNAT with options --random --random-fully
> still it keeps persistence for source IP.

So you specify both?

> Actually truly random src ip required in some scenarios like links balanced
> by IPs, but seems since 2012 at least it is not possible.
>
> But actually if i do something like:
> --- nf_nat_core.c.new	2016-11-28 09:55:54.000000000 +0000
> +++ nf_nat_core.c	2016-11-21 09:11:59.000000000 +0000
> @@ -282,13 +282,9 @@
>  	 * client coming from the same IP (some Internet Banking sites
>  	 * like this), even across reboots.
>  	 */
> -	if (range->flags & NF_NAT_RANGE_PROTO_RANDOM_FULLY) {
> -	    j = prandom_u32();
> -	} else {
> -	    j = jhash2((u32 *)&tuple->src.u3, sizeof(tuple->src.u3) / sizeof(u32),
> +	j = jhash2((u32 *)&tuple->src.u3, sizeof(tuple->src.u3) / sizeof(u32),
>  		   range->flags & NF_NAT_RANGE_PERSISTENT ?
>  			0 : (__force u32)tuple->dst.u3.all[max] ^ zone->id);
> -	}
> 
>  	full_range = false;
>  	for (i = 0; i <= max; i++) {
> 
> It works as intended. But i guess to not break compatibility it is better
> should be introduced as new option?
> Or maybe there is no really need for such option?

Why does your patch reverts NF_NAT_RANGE_PROTO_RANDOM_FULLY?

^ permalink raw reply

* [PATCH] bpf: cgroup: fix documentation of __cgroup_bpf_update()
From: Daniel Mack @ 2016-11-28 11:04 UTC (permalink / raw)
  To: ast; +Cc: davem, netdev, roszenrami, cgroups, Daniel Mack

There's a 'not' missing in one paragraph. Add it.

Signed-off-by: Daniel Mack <daniel@zonque.org>
Reported-by: Rami Rosen <roszenrami@gmail.com>
Fixes: 3007098494be ("cgroup: add support for eBPF programs")
---
 kernel/bpf/cgroup.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index a0ab43f..b708e6e 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -70,9 +70,9 @@ void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent)
  * releases the one that is currently attached, if any. @prog is then made
  * the effective program of type @type in that cgroup.
  *
- * If @prog is %NULL, the currently attached program of type @type is released,
- * and the effective program of the parent cgroup (if any) is inherited to
- * @cgrp.
+ * If @prog is not %NULL, the currently attached program of type @type is
+ * released, and the effective program of the parent cgroup (if any) is
+ * inherited to @cgrp.
  *
  * Then, the descendants of @cgrp are walked and the effective program for
  * each of them is set to the effective program of @cgrp unless the
-- 
2.7.4

^ permalink raw reply related

* [PATCH net V2] net/sched: pedit: make sure that offset is valid
From: Amir Vadai @ 2016-11-28 10:56 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Cong Wang, Jamal Hadi Salim, Or Gerlitz, Hadar Har-Zion,
	Jiri Pirko, Amir Vadai

Add a validation function to make sure offset is valid:
1. Not below skb head (could happen when offset is negative).
2. Validate both 'offset' and 'at'.

Signed-off-by: Amir Vadai <amir@vadai.me>
---
Hi Dave,

Please pull to -stable branches.

Changes from V0:
- Add a validation to the 'at' value (this is used as an offset too)
- Instead of validating the output of skb_header_pointer(), make sure that the
	offset is good before calling it.

Thanks,
Amir
 net/sched/act_pedit.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/net/sched/act_pedit.c b/net/sched/act_pedit.c
index b54d56d4959b..cf9b2fe8eac6 100644
--- a/net/sched/act_pedit.c
+++ b/net/sched/act_pedit.c
@@ -108,6 +108,17 @@ static void tcf_pedit_cleanup(struct tc_action *a, int bind)
 	kfree(keys);
 }
 
+static bool offset_valid(struct sk_buff *skb, int offset)
+{
+	if (offset > 0 && offset > skb->len)
+		return false;
+
+	if  (offset < 0 && -offset > skb_headroom(skb))
+		return false;
+
+	return true;
+}
+
 static int tcf_pedit(struct sk_buff *skb, const struct tc_action *a,
 		     struct tcf_result *res)
 {
@@ -134,6 +145,11 @@ static int tcf_pedit(struct sk_buff *skb, const struct tc_action *a,
 			if (tkey->offmask) {
 				char *d, _d;
 
+				if (!offset_valid(skb, off + tkey->at)) {
+					pr_info("tc filter pedit 'at' offset %d out of bounds\n",
+						off + tkey->at);
+					goto bad;
+				}
 				d = skb_header_pointer(skb, off + tkey->at, 1,
 						       &_d);
 				if (!d)
@@ -146,10 +162,10 @@ static int tcf_pedit(struct sk_buff *skb, const struct tc_action *a,
 					" offset must be on 32 bit boundaries\n");
 				goto bad;
 			}
-			if (offset > 0 && offset > skb->len) {
-				pr_info("tc filter pedit"
-					" offset %d can't exceed pkt length %d\n",
-				       offset, skb->len);
+
+			if (!offset_valid(skb, off + offset)) {
+				pr_info("tc filter pedit offset %d out of bounds\n",
+					offset);
 				goto bad;
 			}
 
-- 
2.10.2

^ permalink raw reply related

* Re: ip manpage comments
From: Phil Sutter @ 2016-11-28 10:53 UTC (permalink / raw)
  To: Jon LaBadie; +Cc: netdev
In-Reply-To: <20161127024932.GA9954@cyber.jgcomp.com>

Hi,

On Sat, Nov 26, 2016 at 09:49:32PM -0500, Jon LaBadie wrote:
> Though not new to *nix, I am new to using the ip(8) command.
> Thus some of my historical assumptions about ip may be wrong.
> 
> It seems that an inclusive manpage for the ip command was
> broken up into a shorter ip(8) manpage and 15 or more
> ip-<subcommand>(8) manpages.  I'm basing this assumption
> on long, inclusive manpages on https://linux.die.net/man/8/ip
> and CentOS 6 while CentOS 7 and Fedora 24 each have the
> sub-divided style.
> 
> I won't debate the wisdom of this subdivision, only comment
> on how it was done.
> 
> The ip(8) manpage make no mention of additional subordinate
> documents.  The listing of the additional documents in the
> See Also section is insufficient.  This section is typically
> used to mention related commands and other sources of reference
> materials such as info docs, wikis, blogs, or mailing lists.

In what aspect do you think it's insufficient? Only because you don't
think it's the appropriate place to list them or are there any other
reasons? Where do you think would be a better place for them?

Looking at the description of 'SEE ALSO' section in man-pages.7, it is
expected to find "A comma-separated list of related man pages" there.

> When one does investigate one of the subordinate manpages,
> they do not state that they document subcommands of the
> ip command.  In fact, on the ip-address(8) manpage it says
> 
>   The `ip address command' ...   (quotes added)
> 
> My first thought was "typo", this is the manpage about the
> "ip-address" command.  Of course there is no ip-address command.
> But "ip address" is not a command either, it is the "ip" command
> with an argument.

Well, yes and no. While it is true that the command itself on most
systems is 'ip' and 'address' is just the first parameter, by the
internal design 'ip' is merely a dispatcher to the actual subcommand
given. An interesting fact is that main() in ip.c checks argv[0] and in
case it's longer than two characters assumes the remaining part is the
subcommand (i.e. first parameter). So by symlinking 'ip' to 'ipaddress'
I can use the latter as an alternative to calling 'ip address'. You
might argue that the man page then should be named ipaddress.8, but
there is a reason for it not to be like this: Given that most
distributions don't make use of the symlink possibility, the dashed name
allows for opening ip-address.8 using 'man ip address' which is in my
opinion far more intuitive than having to use 'man ipaddress'. (Of
course nothing prevents one from having an ipaddress.8 redirecting to
ip-address.8.)

> There are several commands that have broken their manpage into
> several manpages.  Two which come to mind are zsh(1) and perl(1).
> The authors of those pages clearly state on the primary manpage
> that this is an overview page and give clear pointers to the
> additional manpages as well as additional documentation.  I would
> recommend reorganizing the ip(8) manpage in a similar fashion.

Well, go ahead if you like. Patches are appreciated, and if you put me
in Cc, I'll promise to help by reviewing them.

> Thank you for consideration of my opinion and for the development
> of an awesome tool.

Please keep in mind that you are talking to a community here, not tech
support of a product you bought. :)

Cheers, Phil

^ permalink raw reply

* Re: [PATCH net-next 0/5] net: add protocol level recvmmsg support
From: Paolo Abeni @ 2016-11-28 10:52 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, David S. Miller, Eric Dumazet, Hannes Frederic Sowa,
	Sabrina Dubroca
In-Reply-To: <20161125183711.675fa4a7@redhat.com>

Hi Jesper,

On Fri, 2016-11-25 at 18:37 +0100, Jesper Dangaard Brouer wrote:
> > The measured performance delta is as follow:
> > 
> > 		before		after
> > 		(Kpps)		(Kpps)
> > 
> > udp flood[1]	570		1800(+215%)
> > max tput[2]	1850		3500(+89%)
> > single queue[3]	1850		1630(-11%)
> > 
> > [1] line rate flood using multiple 64 bytes packets and multiple flows
> 
> Is [1] sending multiple flow in the a single UDP-sink?

Yes, in the test scenario [1] there are multiple UDP flows using 16
different rx queues on the receiver host, and a single user space
reader.

> > [2] like [1], but using the minimum number of flows to saturate the user space
> >  sink, that is 1 flow for the old kernel and 3 for the patched one.
> >  the tput increases since the contention on the rx lock is low.
> > [3] like [1] but using a single flow with both old and new kernel. All the
> >  packets land on the same rx queue and there is a single ksoftirqd instance
> >  running
> 
> It is important to know, if ksoftirqd and the UDP-sink runs on the same CPU?

No pinning is enforced. The scheduler moves the user space process on a
different cpu in respect to the ksoftriqd kernel thread.

> > The regression in the single queue scenario is actually due to the improved
> > performance of the recvmmsg() syscall: the user space process is now
> > significantly faster than the ksoftirqd process so that the latter needs often
> > to wake up the user space process.
> 
> When measuring these things, make sure that we/you measure both the packets
> actually received in the userspace UDP-sink, and also measure packets
> RX processed by ksoftirq (and I often also look at what HW got delivered).
> Some times, when userspace is too slow, the kernel can/will drop packets.
> 
> It is actually quite easily verified with cmdline:
> 
>  nstat > /dev/null && sleep 1  && nstat
> 
> For HW measurements I use the tool ethtool_stats.pl:
>  https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl

We collected the UDP stats for all the three scenarios; we have lot of
drop in test[1] and little, by design, in test[2]. In test [3], with the
patched kernel, the drops are 0: ksoftirqd is way slower than the user
space sink. 

> > Since ksoftirqd is the bottle-neck is such scenario, overall this causes a
> > tput reduction. In a real use case, where the udp sink is performing some
> > actual processing of the received data, such regression is unlikely to really
> > have an effect.
> 
> My experience is that the performance of RX UDP is affected by:
>  * if socket is connected or not (yes, RX side also)
>  * state of /proc/sys/net/ipv4/ip_early_demux
> 
> You don't need to run with all the combinations, but it would be nice
> if you specify what config your have based your measurements on (and
> keep them stable in your runs).
> 
> I've actually implemented the "--connect" option to my udp_sink
> program[1] today, but I've not pushed it yet, if you are interested.

The reported numbers are all gathered with unconnected sockets and early
demux enabled.

We also used connected socket for test[3], with relative little
difference (the tput increased for both unpatched and patched kernel, 
and the difference was roughly the same).

Paolo

^ permalink raw reply

* Re: [PATCH net] net, sched: respect rcu grace period on cls destruction
From: Daniel Borkmann @ 2016-11-28 10:51 UTC (permalink / raw)
  To: paulmck
  Cc: Cong Wang, David Miller, John Fastabend, Roi Dayan, ast,
	Hannes Frederic Sowa, Jiri Pirko, Linux Kernel Network Developers
In-Reply-To: <20161128104736.GX31360@linux.vnet.ibm.com>

On 11/28/2016 11:47 AM, Paul E. McKenney wrote:
> On Mon, Nov 28, 2016 at 10:09:16AM +0100, Daniel Borkmann wrote:
>> On 11/28/2016 07:57 AM, Cong Wang wrote:
[...]
>>> The ugly part is the work struct, I am not an RCU expert so don't know if we
>>> have any API to execute an RCU callback in process context. Paul?
>>
>> Same way we do this in BPF with prog destruction, by the way. I'm not aware
>> of any callback API for RCU that lets you do this, but maybe Paul knows.
>
> RCU callbacks are always executed in softirq context, so yes, you do need
> to use something like a work struct.  (Or a wakeup to a kthread or
> whatever.)

Ok, thanks for the confirmation!

Daniel

^ permalink raw reply

* Re: [PATCH net] net, sched: respect rcu grace period on cls destruction
From: Paul E. McKenney @ 2016-11-28 10:47 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Cong Wang, David Miller, John Fastabend, Roi Dayan, ast,
	Hannes Frederic Sowa, Jiri Pirko, Linux Kernel Network Developers
In-Reply-To: <583BF43C.4020103@iogearbox.net>

On Mon, Nov 28, 2016 at 10:09:16AM +0100, Daniel Borkmann wrote:
> On 11/28/2016 07:57 AM, Cong Wang wrote:
> >On Sat, Nov 26, 2016 at 4:18 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>Roi reported a crash in flower where tp->root was NULL in ->classify()
> >>callbacks. Reason is that in ->destroy() tp->root is set to NULL via
> >>RCU_INIT_POINTER(). It's problematic for some of the classifiers, because
> >>this doesn't respect RCU grace period for them, and as a result, still
> >>outstanding readers from tc_classify() will try to blindly dereference
> >>a NULL tp->root.
> >>
> >>The tp->root object is strictly private to the classifier implementation
> >>and holds internal data the core such as tc_ctl_tfilter() doesn't know
> >>about. Within some classifiers, such as cls_bpf, cls_basic, etc, tp->root
> >>is only checked for NULL in ->get() callback, but nowhere else. This is
> >>misleading and seemed to be copied from old classifier code that was not
> >>cleaned up properly. For example, d3fa76ee6b4a ("[NET_SCHED]: cls_basic:
> >>fix NULL pointer dereference") moved tp->root initialization into ->init()
> >>routine, where before it was part of ->change(), so ->get() had to deal
> >>with tp->root being NULL back then, so that was indeed a valid case, after
> >>d3fa76ee6b4a, not really anymore. We used to set tp->root to NULL long
> >>ago in ->destroy(), see 47a1a1d4be29 ("pkt_sched: remove unnecessary xchg()
> >>in packet classifiers"); but the NULLifying was reintroduced with the
> >>RCUification, but it's not correct for every classifier implementation.
> >>
> >>In the cases that are fixed here with one exception of cls_cgroup, tp->root
> >>object is allocated and initialized inside ->init() callback, which is always
> >>performed at a point in time after we allocate a new tp, which means tp and
> >>thus tp->root was not globally visible in the tp chain yet (see tc_ctl_tfilter()).
> >>Also, on destruction tp->root is strictly kfree_rcu()'ed in ->destroy()
> >>handler, same for the tp which is kfree_rcu()'ed right when we return
> >>from ->destroy() in tcf_destroy(). This means, the head object's lifetime
> >>for such classifiers is always tied to the tp lifetime. The RCU callback
> >>invocation for the two kfree_rcu() could be out of order, but that's fine
> >>since both are independent.
> >>
> >>Dropping the RCU_INIT_POINTER(tp->root, NULL) for these classifiers here
> >>means that 1) we don't need a useless NULL check in fast-path and, 2) that
> >>outstanding readers of that tp in tc_classify() can still execute under
> >>respect with RCU grace period as it is actually expected.
> >>
> >>Things that haven't been touched here: cls_fw and cls_route. They each
> >>handle tp->root being NULL in ->classify() path for historic reasons, so
> >>their ->destroy() implementation can stay as is. If someone actually
> >>cares, they could get cleaned up at some point to avoid the test in fast
> >>path. cls_u32 doesn't set tp->root to NULL. For cls_rsvp, I just added a
> >>!head should anyone actually be using/testing it, so it at least aligns with
> >>cls_fw and cls_route. For cls_flower we additionally need to defer rhashtable
> >>destruction (to a sleepable context) after RCU grace period as concurrent
> >>readers might still access it. (Note that in this case we need to hold module
> >>reference to keep work callback address intact, since we only wait on module
> >>unload for all call_rcu()s to finish.)
> >>
> >>This fixes one race to bring RCU grace period guarantees back. Next step
> >>as worked on by Cong however is to fix 1e052be69d04 ("net_sched: destroy
> >>proto tp when all filters are gone") to get the order of unlinking the tp
> >>in tc_ctl_tfilter() for the RTM_DELTFILTER case right by moving
> >>RCU_INIT_POINTER() before tcf_destroy() and let the notification for
> >>removal be done through the prior ->delete() callback. Both are independant
> >>issues. Once we have that right, we can then clean tp->root up for a number
> >>of classifiers by not making them RCU pointers, which requires a new callback
> >>(->uninit) that is triggered from tp's RCU callback, where we just kfree()
> >>tp->root from there.
> >
> >Looks good to my eyes,
> >
> >Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
> 
> Thanks for the review (also to John)!
> 
> >The ugly part is the work struct, I am not an RCU expert so don't know if we
> >have any API to execute an RCU callback in process context. Paul?
> 
> Same way we do this in BPF with prog destruction, by the way. I'm not aware
> of any callback API for RCU that lets you do this, but maybe Paul knows.

RCU callbacks are always executed in softirq context, so yes, you do need
to use something like a work struct.  (Or a wakeup to a kthread or
whatever.)

							Thanx, Paul

^ permalink raw reply

* RE: PAYMENT
From: oficina @ 2016-11-28  7:23 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 295 bytes --]

Dear sir
Greetings,

Please sorry for our late response due to holiday
.
Please kindly note that customer has transfered all due payments to your 
USD bank account today
Kindly see Bank TT copy in attachment below and check with your bankers.

Best Regards ,

Weng wei

Assistant Account Manager

[-- Attachment #2: pdf-icon-copy-min_png.zip --]
[-- Type: application/zip, Size: 1202914 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox