Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] bpf: split eBPF out of NET
From: David Miller @ 2014-10-27 23:10 UTC (permalink / raw)
  To: ast
  Cc: geert, josh, mingo, rostedt, hannes, edumazet, dborkman, netdev,
	linux-kernel
In-Reply-To: <1414114868-28228-1-git-send-email-ast@plumgrid.com>

From: Alexei Starovoitov <ast@plumgrid.com>
Date: Thu, 23 Oct 2014 18:41:08 -0700

> introduce two configs:
> - hidden CONFIG_BPF to select eBPF interpreter that classic socket filters
>   depend on
> - visible CONFIG_BPF_SYSCALL (default off) that tracing and sockets can use
> 
> that solves several problems:
> - tracing and others that wish to use eBPF don't need to depend on NET.
>   They can use BPF_SYSCALL to allow loading from userspace or select BPF
>   to use it directly from kernel in NET-less configs.
> - in 3.18 programs cannot be attached to events yet, so don't force it on
> - when the rest of eBPF infra is there in 3.19+, it's still useful to
>   switch it off to minimize kernel size
> 
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> ---
> 
> bloat-o-meter on x64 shows:
> add/remove: 0/60 grow/shrink: 0/2 up/down: 0/-15601 (-15601)
> 
> tested with many different config combinations. Hopefully didn't miss anything.

Applied with two changes:

1) boolean --> bool
2) Moved bloat-o-meter and testing information into commit message.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next v3 0/5] cleanup on resource check
From: David Miller @ 2014-10-27 23:16 UTC (permalink / raw)
  To: varkabhadram; +Cc: netdev, sergei.shtylyov, varkab
In-Reply-To: <1414116730-4590-1-git-send-email-varkab@cdac.in>

From: Varka Bhadram <varkabhadram@gmail.com>
Date: Fri, 24 Oct 2014 07:42:05 +0530

> This series removes the duplication of sanity check for
> platform_get_resource() return resource. It will be checked 
> with devm_ioremap_resource()
> 
> changes since v2:
> 	- Merge #1 and #2 patches into single patch
> 	- remove the comment
> 
> changes since v1:
> 	- remove NULL dereference on resource_size()

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next 2/2] udp: Reset flow table for flows over unconnected sockets
From: Eric Dumazet @ 2014-10-27 23:19 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, Linux Netdev List
In-Reply-To: <CA+mtBx_V3WT1bbXY9F731GNdDdb3+ebHwj9hRyVEFynAPYhSXg@mail.gmail.com>

On Mon, 2014-10-27 at 12:36 -0700, Tom Herbert wrote:

> Please try this patch and provide real data to support your points.
> 

Yep. This is not good, I confirm my fear.

Google servers are shifting to serve both TCP & UDP traffic (QUIC
protocol), with an increasing UDP load.

Millions of packets per second per host, from millions of different
sources...

And your patch voids the RFS table, adds another cache miss in fast path
for UDP rx path which is already too expensive.

> If a TCP connection is hot it will continually refresh the table for
> that connection, if connection becomes idle it only takes one received
> packet to restore the CPU. The only time there could be a persistent
> problem is if collision rate is high (which probably means table is
> too small).

RFS already has a low hit/miss rate, this patch does not help neither
UDP or TCP.

Ideally, RFS should be enabled on a protocol base, not an agnostic u32
flow hash.

Whatever strategy you implement, as long as different protocols share a
common hash table, it wont be perfect for mixed workloads.

Fundamental problem is that when an UDP packet comes, its not possible
to know if its a 'flow' or 'not', unless we perform an expensive lookup,
and then RPS/RFS cost becomes prohibitive.

While for TCP, the current RFS cache miss is good enough, because about
all packets are for connected flows. We eventually have bad steering for
<not yet established> flows where the stack performs poorly anyway.

^ permalink raw reply

* [PATCH v2] net: ethernet: realtek: atp: checkpatch errors and warnings corrected
From: Roberto Medina @ 2014-10-27 23:51 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, Roberto Medina

From: Roberto Medina <robertoxmed@gmail.com>

Several warnings and errors of coding style rules corrected.
Compile tested.

Signed-off-by: Roberto Medina <robertoxmed@gmail.com>

---
 drivers/net/ethernet/realtek/atp.h | 246 +++++++++++++++++++------------------
 1 file changed, 127 insertions(+), 119 deletions(-)

diff --git a/drivers/net/ethernet/realtek/atp.h b/drivers/net/ethernet/realtek/atp.h
index 040b137..32497f0 100644
--- a/drivers/net/ethernet/realtek/atp.h
+++ b/drivers/net/ethernet/realtek/atp.h
@@ -6,10 +6,10 @@
 
 /* The header prepended to received packets. */
 struct rx_header {
-    ushort pad;			/* Pad. */
-    ushort rx_count;
-    ushort rx_status;		/* Unknown bit assignments :-<.  */
-    ushort cur_addr;		/* Apparently the current buffer address(?) */
+	ushort pad;		/* Pad. */
+	ushort rx_count;
+	ushort rx_status;	/* Unknown bit assignments :-<.  */
+	ushort cur_addr;	/* Apparently the current buffer address(?) */
 };
 
 #define PAR_DATA	0
@@ -29,22 +29,25 @@ struct rx_header {
 #define RdAddr	0xC0
 #define HNib	0x10
 
-enum page0_regs
-{
-    /* The first six registers hold the ethernet physical station address. */
-    PAR0 = 0, PAR1 = 1, PAR2 = 2, PAR3 = 3, PAR4 = 4, PAR5 = 5,
-    TxCNT0 = 6, TxCNT1 = 7,		/* The transmit byte count. */
-    TxSTAT = 8, RxSTAT = 9,		/* Tx and Rx status. */
-    ISR = 10, IMR = 11,			/* Interrupt status and mask. */
-    CMR1 = 12,				/* Command register 1. */
-    CMR2 = 13,				/* Command register 2. */
-    MODSEL = 14,			/* Mode select register. */
-    MAR = 14,				/* Memory address register (?). */
-    CMR2_h = 0x1d, };
-
-enum eepage_regs
-{ PROM_CMD = 6, PROM_DATA = 7 };	/* Note that PROM_CMD is in the "high" bits. */
+enum page0_regs {
+	/* The first six registers hold
+	 * the ethernet physical station address.
+	 */
+	PAR0 = 0, PAR1 = 1, PAR2 = 2, PAR3 = 3, PAR4 = 4, PAR5 = 5,
+	TxCNT0 = 6, TxCNT1 = 7,		/* The transmit byte count. */
+	TxSTAT = 8, RxSTAT = 9,		/* Tx and Rx status. */
+	ISR = 10, IMR = 11,		/* Interrupt status and mask. */
+	CMR1 = 12,			/* Command register 1. */
+	CMR2 = 13,			/* Command register 2. */
+	MODSEL = 14,		/* Mode select register. */
+	MAR = 14,			/* Memory address register (?). */
+	CMR2_h = 0x1d,
+};
 
+enum eepage_regs {
+	PROM_CMD = 6,
+	PROM_DATA = 7	/* Note that PROM_CMD is in the "high" bits. */
+};
 
 #define ISR_TxOK	0x01
 #define ISR_RxOK	0x04
@@ -72,141 +75,146 @@ enum eepage_regs
 #define CMR2h_Normal	2	/* Accept physical and broadcast address. */
 #define CMR2h_PROMISC	3	/* Promiscuous mode. */
 
-/* An inline function used below: it differs from inb() by explicitly return an unsigned
-   char, saving a truncation. */
+/* An inline function used below: it differs from inb() by explicitly
+ * return an unsigned char, saving a truncation.
+ */
 static inline unsigned char inbyte(unsigned short port)
 {
-    unsigned char _v;
-    __asm__ __volatile__ ("inb %w1,%b0" :"=a" (_v):"d" (port));
-    return _v;
+	unsigned char _v;
+
+	__asm__ __volatile__ ("inb %w1,%b0" : "=a" (_v) : "d" (port));
+	return _v;
 }
 
 /* Read register OFFSET.
-   This command should always be terminated with read_end(). */
+ * This command should always be terminated with read_end().
+ */
 static inline unsigned char read_nibble(short port, unsigned char offset)
 {
-    unsigned char retval;
-    outb(EOC+offset, port + PAR_DATA);
-    outb(RdAddr+offset, port + PAR_DATA);
-    inbyte(port + PAR_STATUS);		/* Settling time delay */
-    retval = inbyte(port + PAR_STATUS);
-    outb(EOC+offset, port + PAR_DATA);
-
-    return retval;
+	unsigned char retval;
+
+	outb(EOC+offset, port + PAR_DATA);
+	outb(RdAddr+offset, port + PAR_DATA);
+	inbyte(port + PAR_STATUS);	/* Settling time delay */
+	retval = inbyte(port + PAR_STATUS);
+	outb(EOC+offset, port + PAR_DATA);
+
+	return retval;
 }
 
 /* Functions for bulk data read.  The interrupt line is always disabled. */
 /* Get a byte using read mode 0, reading data from the control lines. */
 static inline unsigned char read_byte_mode0(short ioaddr)
 {
-    unsigned char low_nib;
-
-    outb(Ctrl_LNibRead, ioaddr + PAR_CONTROL);
-    inbyte(ioaddr + PAR_STATUS);
-    low_nib = (inbyte(ioaddr + PAR_STATUS) >> 3) & 0x0f;
-    outb(Ctrl_HNibRead, ioaddr + PAR_CONTROL);
-    inbyte(ioaddr + PAR_STATUS);	/* Settling time delay -- needed!  */
-    inbyte(ioaddr + PAR_STATUS);	/* Settling time delay -- needed!  */
-    return low_nib | ((inbyte(ioaddr + PAR_STATUS) << 1) & 0xf0);
+	unsigned char low_nib;
+
+	outb(Ctrl_LNibRead, ioaddr + PAR_CONTROL);
+	inbyte(ioaddr + PAR_STATUS);
+	low_nib = (inbyte(ioaddr + PAR_STATUS) >> 3) & 0x0f;
+	outb(Ctrl_HNibRead, ioaddr + PAR_CONTROL);
+	inbyte(ioaddr + PAR_STATUS);	/* Settling time delay -- needed!  */
+	inbyte(ioaddr + PAR_STATUS);	/* Settling time delay -- needed!  */
+	return low_nib | ((inbyte(ioaddr + PAR_STATUS) << 1) & 0xf0);
 }
 
 /* The same as read_byte_mode0(), but does multiple inb()s for stability. */
 static inline unsigned char read_byte_mode2(short ioaddr)
 {
-    unsigned char low_nib;
-
-    outb(Ctrl_LNibRead, ioaddr + PAR_CONTROL);
-    inbyte(ioaddr + PAR_STATUS);
-    low_nib = (inbyte(ioaddr + PAR_STATUS) >> 3) & 0x0f;
-    outb(Ctrl_HNibRead, ioaddr + PAR_CONTROL);
-    inbyte(ioaddr + PAR_STATUS);	/* Settling time delay -- needed!  */
-    return low_nib | ((inbyte(ioaddr + PAR_STATUS) << 1) & 0xf0);
+	unsigned char low_nib;
+
+	outb(Ctrl_LNibRead, ioaddr + PAR_CONTROL);
+	inbyte(ioaddr + PAR_STATUS);
+	low_nib = (inbyte(ioaddr + PAR_STATUS) >> 3) & 0x0f;
+	outb(Ctrl_HNibRead, ioaddr + PAR_CONTROL);
+	inbyte(ioaddr + PAR_STATUS);	/* Settling time delay -- needed!  */
+	return low_nib | ((inbyte(ioaddr + PAR_STATUS) << 1) & 0xf0);
 }
 
 /* Read a byte through the data register. */
 static inline unsigned char read_byte_mode4(short ioaddr)
 {
-    unsigned char low_nib;
+	unsigned char low_nib;
 
-    outb(RdAddr | MAR, ioaddr + PAR_DATA);
-    low_nib = (inbyte(ioaddr + PAR_STATUS) >> 3) & 0x0f;
-    outb(RdAddr | HNib | MAR, ioaddr + PAR_DATA);
-    return low_nib | ((inbyte(ioaddr + PAR_STATUS) << 1) & 0xf0);
+	outb(RdAddr | MAR, ioaddr + PAR_DATA);
+	low_nib = (inbyte(ioaddr + PAR_STATUS) >> 3) & 0x0f;
+	outb(RdAddr | HNib | MAR, ioaddr + PAR_DATA);
+	return low_nib | ((inbyte(ioaddr + PAR_STATUS) << 1) & 0xf0);
 }
 
 /* Read a byte through the data register, double reading to allow settling. */
 static inline unsigned char read_byte_mode6(short ioaddr)
 {
-    unsigned char low_nib;
-
-    outb(RdAddr | MAR, ioaddr + PAR_DATA);
-    inbyte(ioaddr + PAR_STATUS);
-    low_nib = (inbyte(ioaddr + PAR_STATUS) >> 3) & 0x0f;
-    outb(RdAddr | HNib | MAR, ioaddr + PAR_DATA);
-    inbyte(ioaddr + PAR_STATUS);
-    return low_nib | ((inbyte(ioaddr + PAR_STATUS) << 1) & 0xf0);
+	unsigned char low_nib;
+
+	outb(RdAddr | MAR, ioaddr + PAR_DATA);
+	inbyte(ioaddr + PAR_STATUS);
+	low_nib = (inbyte(ioaddr + PAR_STATUS) >> 3) & 0x0f;
+	outb(RdAddr | HNib | MAR, ioaddr + PAR_DATA);
+	inbyte(ioaddr + PAR_STATUS);
+	return low_nib | ((inbyte(ioaddr + PAR_STATUS) << 1) & 0xf0);
 }
 
 static inline void
 write_reg(short port, unsigned char reg, unsigned char value)
 {
-    unsigned char outval;
-    outb(EOC | reg, port + PAR_DATA);
-    outval = WrAddr | reg;
-    outb(outval, port + PAR_DATA);
-    outb(outval, port + PAR_DATA);	/* Double write for PS/2. */
-
-    outval &= 0xf0;
-    outval |= value;
-    outb(outval, port + PAR_DATA);
-    outval &= 0x1f;
-    outb(outval, port + PAR_DATA);
-    outb(outval, port + PAR_DATA);
-
-    outb(EOC | outval, port + PAR_DATA);
+	unsigned char outval;
+
+	outb(EOC | reg, port + PAR_DATA);
+	outval = WrAddr | reg;
+	outb(outval, port + PAR_DATA);
+	outb(outval, port + PAR_DATA);	/* Double write for PS/2. */
+
+	outval &= 0xf0;
+	outval |= value;
+	outb(outval, port + PAR_DATA);
+	outval &= 0x1f;
+	outb(outval, port + PAR_DATA);
+	outb(outval, port + PAR_DATA);
+
+	outb(EOC | outval, port + PAR_DATA);
 }
 
 static inline void
 write_reg_high(short port, unsigned char reg, unsigned char value)
 {
-    unsigned char outval = EOC | HNib | reg;
+	unsigned char outval = EOC | HNib | reg;
 
-    outb(outval, port + PAR_DATA);
-    outval &= WrAddr | HNib | 0x0f;
-    outb(outval, port + PAR_DATA);
-    outb(outval, port + PAR_DATA);	/* Double write for PS/2. */
+	outb(outval, port + PAR_DATA);
+	outval &= WrAddr | HNib | 0x0f;
+	outb(outval, port + PAR_DATA);
+	outb(outval, port + PAR_DATA);	/* Double write for PS/2. */
 
-    outval = WrAddr | HNib | value;
-    outb(outval, port + PAR_DATA);
-    outval &= HNib | 0x0f;		/* HNib | value */
-    outb(outval, port + PAR_DATA);
-    outb(outval, port + PAR_DATA);
+	outval = WrAddr | HNib | value;
+	outb(outval, port + PAR_DATA);
+	outval &= HNib | 0x0f;		/* HNib | value */
+	outb(outval, port + PAR_DATA);
+	outb(outval, port + PAR_DATA);
 
-    outb(EOC | HNib | outval, port + PAR_DATA);
+	outb(EOC | HNib | outval, port + PAR_DATA);
 }
 
 /* Write a byte out using nibble mode.  The low nibble is written first. */
 static inline void
 write_reg_byte(short port, unsigned char reg, unsigned char value)
 {
-    unsigned char outval;
-    outb(EOC | reg, port + PAR_DATA); 	/* Reset the address register. */
-    outval = WrAddr | reg;
-    outb(outval, port + PAR_DATA);
-    outb(outval, port + PAR_DATA);	/* Double write for PS/2. */
-
-    outb((outval & 0xf0) | (value & 0x0f), port + PAR_DATA);
-    outb(value & 0x0f, port + PAR_DATA);
-    value >>= 4;
-    outb(value, port + PAR_DATA);
-    outb(0x10 | value, port + PAR_DATA);
-    outb(0x10 | value, port + PAR_DATA);
-
-    outb(EOC  | value, port + PAR_DATA); 	/* Reset the address register. */
+	unsigned char outval;
+
+	outb(EOC | reg, port + PAR_DATA); /* Reset the address register. */
+	outval = WrAddr | reg;
+	outb(outval, port + PAR_DATA);
+	outb(outval, port + PAR_DATA);	/* Double write for PS/2. */
+
+	outb((outval & 0xf0) | (value & 0x0f), port + PAR_DATA);
+	outb(value & 0x0f, port + PAR_DATA);
+	value >>= 4;
+	outb(value, port + PAR_DATA);
+	outb(0x10 | value, port + PAR_DATA);
+	outb(0x10 | value, port + PAR_DATA);
+
+	outb(EOC  | value, port + PAR_DATA); /* Reset the address register. */
 }
 
-/*
- * Bulk data writes to the packet buffer.  The interrupt line remains enabled.
+/* Bulk data writes to the packet buffer.  The interrupt line remains enabled.
  * The first, faster method uses only the dataport (data modes 0, 2 & 4).
  * The second (backup) method uses data and control regs (modes 1, 3 & 5).
  * It should only be needed when there is skew between the individual data
@@ -214,28 +222,28 @@ write_reg_byte(short port, unsigned char reg, unsigned char value)
  */
 static inline void write_byte_mode0(short ioaddr, unsigned char value)
 {
-    outb(value & 0x0f, ioaddr + PAR_DATA);
-    outb((value>>4) | 0x10, ioaddr + PAR_DATA);
+	outb(value & 0x0f, ioaddr + PAR_DATA);
+	outb((value>>4) | 0x10, ioaddr + PAR_DATA);
 }
 
 static inline void write_byte_mode1(short ioaddr, unsigned char value)
 {
-    outb(value & 0x0f, ioaddr + PAR_DATA);
-    outb(Ctrl_IRQEN | Ctrl_LNibWrite, ioaddr + PAR_CONTROL);
-    outb((value>>4) | 0x10, ioaddr + PAR_DATA);
-    outb(Ctrl_IRQEN | Ctrl_HNibWrite, ioaddr + PAR_CONTROL);
+	outb(value & 0x0f, ioaddr + PAR_DATA);
+	outb(Ctrl_IRQEN | Ctrl_LNibWrite, ioaddr + PAR_CONTROL);
+	outb((value>>4) | 0x10, ioaddr + PAR_DATA);
+	outb(Ctrl_IRQEN | Ctrl_HNibWrite, ioaddr + PAR_CONTROL);
 }
 
 /* Write 16bit VALUE to the packet buffer: the same as above just doubled. */
 static inline void write_word_mode0(short ioaddr, unsigned short value)
 {
-    outb(value & 0x0f, ioaddr + PAR_DATA);
-    value >>= 4;
-    outb((value & 0x0f) | 0x10, ioaddr + PAR_DATA);
-    value >>= 4;
-    outb(value & 0x0f, ioaddr + PAR_DATA);
-    value >>= 4;
-    outb((value & 0x0f) | 0x10, ioaddr + PAR_DATA);
+	outb(value & 0x0f, ioaddr + PAR_DATA);
+	value >>= 4;
+	outb((value & 0x0f) | 0x10, ioaddr + PAR_DATA);
+	value >>= 4;
+	outb(value & 0x0f, ioaddr + PAR_DATA);
+	value >>= 4;
+	outb((value & 0x0f) | 0x10, ioaddr + PAR_DATA);
 }
 
 /*  EEPROM_Ctrl bits. */
@@ -248,10 +256,10 @@ static inline void write_word_mode0(short ioaddr, unsigned short value)
 
 /* Delay between EEPROM clock transitions. */
 #define eeprom_delay(ticks) \
-do { int _i = 40; while (--_i > 0) { __SLOW_DOWN_IO; }} while (0)
+do { int _i = 40; while (--_i > 0) { __SLOW_DOWN_IO; } } while (0)
 
 /* The EEPROM commands include the alway-set leading bit. */
 #define EE_WRITE_CMD(offset)	(((5 << 6) + (offset)) << 17)
-#define EE_READ(offset) 	(((6 << 6) + (offset)) << 17)
+#define EE_READ(offset)		(((6 << 6) + (offset)) << 17)
 #define EE_ERASE(offset)	(((7 << 6) + (offset)) << 17)
 #define EE_CMD_SIZE	27	/* The command+address+data size. */
-- 
2.1.2

^ permalink raw reply related

* Re: [Bug 86851] New: Reproducible panic on heavy UDP traffic
From: Eric Dumazet @ 2014-10-28  0:16 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Patrick McLean, Florian Westphal, Stephen Hemminger, netdev
In-Reply-To: <544ECFFA.8080402@redhat.com>

On Tue, 2014-10-28 at 00:06 +0100, Nikolay Aleksandrov wrote:

> Great! Thanks for testing.
> As I said earlier we have a valid case that can hit the WARN_ON in
> inet_evict_frag().
> Anyhow, Eric would you mind posting the patch officially ?
> If you'd like me to remove the WARN_ON() in a separate one just let me
> know, otherwise feel free to remove it in the fix for the race.

Please Nikolay take ownership of this patch, I am busy on other stuff at
the moment, thanks !

^ permalink raw reply

* Re: [PATCH net] bpf: split eBPF out of NET
From: Alexei Starovoitov @ 2014-10-28  0:18 UTC (permalink / raw)
  To: David Miller
  Cc: Geert Uytterhoeven, Josh Triplett, Ingo Molnar, Steven Rostedt,
	Hannes Frederic Sowa, Eric Dumazet, Daniel Borkmann,
	Network Development, LKML
In-Reply-To: <20141027.191043.246099210901442100.davem@davemloft.net>

On Mon, Oct 27, 2014 at 4:10 PM, David Miller <davem@davemloft.net> wrote:
> From: Alexei Starovoitov <ast@plumgrid.com>
> Date: Thu, 23 Oct 2014 18:41:08 -0700
>
>> introduce two configs:
>> - hidden CONFIG_BPF to select eBPF interpreter that classic socket filters
>>   depend on
>> - visible CONFIG_BPF_SYSCALL (default off) that tracing and sockets can use
>>
>> that solves several problems:
>> - tracing and others that wish to use eBPF don't need to depend on NET.
>>   They can use BPF_SYSCALL to allow loading from userspace or select BPF
>>   to use it directly from kernel in NET-less configs.
>> - in 3.18 programs cannot be attached to events yet, so don't force it on
>> - when the rest of eBPF infra is there in 3.19+, it's still useful to
>>   switch it off to minimize kernel size
>>
>> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>> ---
>>
>> bloat-o-meter on x64 shows:
>> add/remove: 0/60 grow/shrink: 0/2 up/down: 0/-15601 (-15601)
>>
>> tested with many different config combinations. Hopefully didn't miss anything.
>
> Applied with two changes:
>
> 1) boolean --> bool
> 2) Moved bloat-o-meter and testing information into commit message.
>
> Thanks.

Thank you for taking care of it!

^ permalink raw reply

* Re: [PATCH] ovs: Turn vports with dependencies into separate modules
From: Pravin Shelar @ 2014-10-28  0:27 UTC (permalink / raw)
  To: Thomas Graf; +Cc: dev@openvswitch.org, netdev
In-Reply-To: <20141027214722.GA2783@casper.infradead.org>

On Mon, Oct 27, 2014 at 2:47 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 10/27/14 at 10:14am, Pravin Shelar wrote:
>> On Fri, Oct 24, 2014 at 2:57 PM, Thomas Graf <tgraf@suug.ch> wrote:
>> > I was refering to how many other kernel APIs have been designed, a
>> > registration API allowing a vport to be implemented exclusively in the
>> > scope of a single file tends to be cleaner than having to touch multiple
>> > files and maintaining an init list.
>> >
>> This has never been issue in openvswitch. Plus we do not need loadable
>> vport module to fix this issue.
>>
>> > It also allows for OVS to be built into vmlinuz while vports can
>> > remain as modules even if vxlan itself is built as a module.
>> >
>>
>> What is problem with current OVS built into kernel?
>
> What I mean specifically is the following dependency logic which will
> no longer be required:
>
> depends on NET_IPGRE_DEMUX && !(OPENVSWITCH=y && NET_IPGRE_DEMUX=m)
>
> The patch also brings additional flexibility to users of
> distributions. Distros typically ship something like an allmodconfig
> so a user can either run openvswitch.ko with all encaps compiled in
> or not run openvswitch.ko. With vports as module, a user can blacklist
> a certain encap type.
>
> Another advantage is obviously that users can run additional vport
> types on top of their distribution kernels.
>
> Is there anything specific that you are concerned with in regard
> to this proposed change?

OVS vport code is not alot and making it plugable module does not save
much space. Even with this patch user can not load any vport type
since we still need to define the type in kernel interface and add the
support in userspace netdev layer. Therefore this patch adds
complexity without much gain.

^ permalink raw reply

* Re: [PATCH RFC 1/4] virtio_net: pass vi around
From: Rusty Russell @ 2014-10-28  0:27 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel; +Cc: netdev, virtualization
In-Reply-To: <1414099656-28090-1-git-send-email-mst@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> writes:
> Too many places poke at [rs]q->vq->vdev->priv just to get
> the the vi structure.  Let's just pass the pointer around: seems
> cleaner, and might even be faster.

Agreed, it's neater.

Acked-by: Rusty Russell <rusty@rustcorp.com.au>

Thanks,
Rusty.

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/net/virtio_net.c | 36 +++++++++++++++++++-----------------
>  1 file changed, 19 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 57cbc7d..36f3dfc 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -241,11 +241,11 @@ static unsigned long mergeable_buf_to_ctx(void *buf, unsigned int truesize)
>  }
>  
>  /* Called from bottom half context */
> -static struct sk_buff *page_to_skb(struct receive_queue *rq,
> +static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> +				   struct receive_queue *rq,
>  				   struct page *page, unsigned int offset,
>  				   unsigned int len, unsigned int truesize)
>  {
> -	struct virtnet_info *vi = rq->vq->vdev->priv;
>  	struct sk_buff *skb;
>  	struct skb_vnet_hdr *hdr;
>  	unsigned int copy, hdr_len, hdr_padded_len;
> @@ -328,12 +328,13 @@ static struct sk_buff *receive_small(void *buf, unsigned int len)
>  }
>  
>  static struct sk_buff *receive_big(struct net_device *dev,
> +				   struct virtnet_info *vi,
>  				   struct receive_queue *rq,
>  				   void *buf,
>  				   unsigned int len)
>  {
>  	struct page *page = buf;
> -	struct sk_buff *skb = page_to_skb(rq, page, 0, len, PAGE_SIZE);
> +	struct sk_buff *skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
>  
>  	if (unlikely(!skb))
>  		goto err;
> @@ -359,7 +360,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  	int offset = buf - page_address(page);
>  	unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
>  
> -	struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize);
> +	struct sk_buff *head_skb = page_to_skb(vi, rq, page, offset, len,
> +					       truesize);
>  	struct sk_buff *curr_skb = head_skb;
>  
>  	if (unlikely(!curr_skb))
> @@ -433,9 +435,9 @@ err_buf:
>  	return NULL;
>  }
>  
> -static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
> +static void receive_buf(struct virtnet_info *vi, struct receive_queue *rq,
> +			void *buf, unsigned int len)
>  {
> -	struct virtnet_info *vi = rq->vq->vdev->priv;
>  	struct net_device *dev = vi->dev;
>  	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>  	struct sk_buff *skb;
> @@ -459,9 +461,9 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
>  	if (vi->mergeable_rx_bufs)
>  		skb = receive_mergeable(dev, vi, rq, (unsigned long)buf, len);
>  	else if (vi->big_packets)
> -		skb = receive_big(dev, rq, buf, len);
> +		skb = receive_big(dev, vi, rq, buf, len);
>  	else
> -		skb = receive_small(buf, len);
> +		skb = receive_small(vi, buf, len);
>  
>  	if (unlikely(!skb))
>  		return;
> @@ -530,9 +532,9 @@ frame_err:
>  	dev_kfree_skb(skb);
>  }
>  
> -static int add_recvbuf_small(struct receive_queue *rq, gfp_t gfp)
> +static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq,
> +			     gfp_t gfp)
>  {
> -	struct virtnet_info *vi = rq->vq->vdev->priv;
>  	struct sk_buff *skb;
>  	struct skb_vnet_hdr *hdr;
>  	int err;
> @@ -655,9 +657,9 @@ static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
>   * before we're receiving packets, or from refill_work which is
>   * careful to disable receiving (using napi_disable).
>   */
> -static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
> +static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
> +			  gfp_t gfp)
>  {
> -	struct virtnet_info *vi = rq->vq->vdev->priv;
>  	int err;
>  	bool oom;
>  
> @@ -668,7 +670,7 @@ static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
>  		else if (vi->big_packets)
>  			err = add_recvbuf_big(rq, gfp);
>  		else
> -			err = add_recvbuf_small(rq, gfp);
> +			err = add_recvbuf_small(vi, rq, gfp);
>  
>  		oom = err == -ENOMEM;
>  		if (err)
> @@ -717,7 +719,7 @@ static void refill_work(struct work_struct *work)
>  		struct receive_queue *rq = &vi->rq[i];
>  
>  		napi_disable(&rq->napi);
> -		still_empty = !try_fill_recv(rq, GFP_KERNEL);
> +		still_empty = !try_fill_recv(vi, rq, GFP_KERNEL);
>  		virtnet_napi_enable(rq);
>  
>  		/* In theory, this can happen: if we don't get any buffers in
> @@ -736,12 +738,12 @@ static int virtnet_receive(struct receive_queue *rq, int budget)
>  
>  	while (received < budget &&
>  	       (buf = virtqueue_get_buf(rq->vq, &len)) != NULL) {
> -		receive_buf(rq, buf, len);
> +		receive_buf(vi, rq, buf, len);
>  		received++;
>  	}
>  
>  	if (rq->vq->num_free > virtqueue_get_vring_size(rq->vq) / 2) {
> -		if (!try_fill_recv(rq, GFP_ATOMIC))
> +		if (!try_fill_recv(vi, rq, GFP_ATOMIC))
>  			schedule_delayed_work(&vi->refill, 0);
>  	}
>  
> @@ -817,7 +819,7 @@ static int virtnet_open(struct net_device *dev)
>  	for (i = 0; i < vi->max_queue_pairs; i++) {
>  		if (i < vi->curr_queue_pairs)
>  			/* Make sure we have some buffers: if oom use wq. */
> -			if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
> +			if (!try_fill_recv(vi, &vi->rq[i], GFP_KERNEL))
>  				schedule_delayed_work(&vi->refill, 0);
>  		virtnet_napi_enable(&vi->rq[i]);
>  	}
> -- 
> MST

^ permalink raw reply

* Re: [PATCH RFC 2/4] virtio_net: get rid of virtio_net_hdr/skb_vnet_hdr
From: Rusty Russell @ 2014-10-28  0:27 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel; +Cc: netdev, virtualization
In-Reply-To: <1414099656-28090-2-git-send-email-mst@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> writes:
> virtio 1.0 doesn't use virtio_net_hdr anymore, and in fact, it's not
> really useful since virtio_net_hdr_mrg_rxbuf includes that as the first
> field anyway.
>
> Let's drop it, precalculate header len and store within vi instead.
>
> This way we can also remove struct skb_vnet_hdr.

Yes, this is definitely a win.

Acked-by: Rusty Russell <rusty@rustcorp.com.au>

Thanks,
Rusty.

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/net/virtio_net.c | 88 ++++++++++++++++++++++--------------------------
>  1 file changed, 40 insertions(+), 48 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 36f3dfc..a795a23 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -123,6 +123,9 @@ struct virtnet_info {
>  	/* Host can handle any s/g split between our header and packet data */
>  	bool any_header_sg;
>  
> +	/* Packet virtio header size */
> +	u8 hdr_len;
> +
>  	/* Active statistics */
>  	struct virtnet_stats __percpu *stats;
>  
> @@ -139,21 +142,14 @@ struct virtnet_info {
>  	struct notifier_block nb;
>  };
>  
> -struct skb_vnet_hdr {
> -	union {
> -		struct virtio_net_hdr hdr;
> -		struct virtio_net_hdr_mrg_rxbuf mhdr;
> -	};
> -};
> -
>  struct padded_vnet_hdr {
> -	struct virtio_net_hdr hdr;
> +	struct virtio_net_hdr_mrg_rxbuf hdr;
>  	/*
> -	 * virtio_net_hdr should be in a separated sg buffer because of a
> -	 * QEMU bug, and data sg buffer shares same page with this header sg.
> -	 * This padding makes next sg 16 byte aligned after virtio_net_hdr.
> +	 * hdr is in a separate sg buffer, and data sg buffer shares same page
> +	 * with this header sg. This padding makes next sg 16 byte aligned
> +	 * after the header.
>  	 */
> -	char padding[6];
> +	char padding[4];
>  };
>  
>  /* Converting between virtqueue no. and kernel tx/rx queue no.
> @@ -179,9 +175,9 @@ static int rxq2vq(int rxq)
>  	return rxq * 2;
>  }
>  
> -static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
> +static inline struct virtio_net_hdr_mrg_rxbuf *skb_vnet_hdr(struct sk_buff *skb)
>  {
> -	return (struct skb_vnet_hdr *)skb->cb;
> +	return (struct virtio_net_hdr_mrg_rxbuf *)skb->cb;
>  }
>  
>  /*
> @@ -247,7 +243,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>  				   unsigned int len, unsigned int truesize)
>  {
>  	struct sk_buff *skb;
> -	struct skb_vnet_hdr *hdr;
> +	struct virtio_net_hdr_mrg_rxbuf *hdr;
>  	unsigned int copy, hdr_len, hdr_padded_len;
>  	char *p;
>  
> @@ -260,13 +256,11 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>  
>  	hdr = skb_vnet_hdr(skb);
>  
> -	if (vi->mergeable_rx_bufs) {
> -		hdr_len = sizeof hdr->mhdr;
> -		hdr_padded_len = sizeof hdr->mhdr;
> -	} else {
> -		hdr_len = sizeof hdr->hdr;
> +	hdr_len = vi->hdr_len;
> +	if (vi->mergeable_rx_bufs)
> +		hdr_padded_len = sizeof *hdr;
> +	else
>  		hdr_padded_len = sizeof(struct padded_vnet_hdr);
> -	}
>  
>  	memcpy(hdr, p, hdr_len);
>  
> @@ -317,11 +311,11 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>  	return skb;
>  }
>  
> -static struct sk_buff *receive_small(void *buf, unsigned int len)
> +static struct sk_buff *receive_small(struct virtnet_info *vi, void *buf, unsigned int len)
>  {
>  	struct sk_buff * skb = buf;
>  
> -	len -= sizeof(struct virtio_net_hdr);
> +	len -= vi->hdr_len;
>  	skb_trim(skb, len);
>  
>  	return skb;
> @@ -354,8 +348,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  					 unsigned int len)
>  {
>  	void *buf = mergeable_ctx_to_buf_address(ctx);
> -	struct skb_vnet_hdr *hdr = buf;
> -	u16 num_buf = virtio16_to_cpu(rq->vq->vdev, hdr->mhdr.num_buffers);
> +	struct virtio_net_hdr_mrg_rxbuf *hdr = buf;
> +	u16 num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers);
>  	struct page *page = virt_to_head_page(buf);
>  	int offset = buf - page_address(page);
>  	unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
> @@ -373,8 +367,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  		if (unlikely(!ctx)) {
>  			pr_debug("%s: rx error: %d buffers out of %d missing\n",
>  				 dev->name, num_buf,
> -				 virtio16_to_cpu(rq->vq->vdev,
> -						 hdr->mhdr.num_buffers));
> +				 virtio16_to_cpu(vi->vdev,
> +						 hdr->num_buffers));
>  			dev->stats.rx_length_errors++;
>  			goto err_buf;
>  		}
> @@ -441,7 +435,7 @@ static void receive_buf(struct virtnet_info *vi, struct receive_queue *rq,
>  	struct net_device *dev = vi->dev;
>  	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>  	struct sk_buff *skb;
> -	struct skb_vnet_hdr *hdr;
> +	struct virtio_net_hdr_mrg_rxbuf *hdr;
>  
>  	if (unlikely(len < sizeof(struct virtio_net_hdr) + ETH_HLEN)) {
>  		pr_debug("%s: short packet %i\n", dev->name, len);
> @@ -536,7 +530,7 @@ static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq,
>  			     gfp_t gfp)
>  {
>  	struct sk_buff *skb;
> -	struct skb_vnet_hdr *hdr;
> +	struct virtio_net_hdr_mrg_rxbuf *hdr;
>  	int err;
>  
>  	skb = __netdev_alloc_skb_ip_align(vi->dev, GOOD_PACKET_LEN, gfp);
> @@ -547,7 +541,7 @@ static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq,
>  
>  	hdr = skb_vnet_hdr(skb);
>  	sg_init_table(rq->sg, MAX_SKB_FRAGS + 2);
> -	sg_set_buf(rq->sg, &hdr->hdr, sizeof hdr->hdr);
> +	sg_set_buf(rq->sg, hdr, vi->hdr_len);
>  	skb_to_sgvec(skb, rq->sg + 1, 0, skb->len);
>  
>  	err = virtqueue_add_inbuf(rq->vq, rq->sg, 2, skb, gfp);
> @@ -557,7 +551,8 @@ static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq,
>  	return err;
>  }
>  
> -static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
> +static int add_recvbuf_big(struct virtnet_info *vi, struct receive_queue *rq,
> +			   gfp_t gfp)
>  {
>  	struct page *first, *list = NULL;
>  	char *p;
> @@ -588,8 +583,8 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
>  	p = page_address(first);
>  
>  	/* rq->sg[0], rq->sg[1] share the same page */
> -	/* a separated rq->sg[0] for virtio_net_hdr only due to QEMU bug */
> -	sg_set_buf(&rq->sg[0], p, sizeof(struct virtio_net_hdr));
> +	/* a separated rq->sg[0] for header - required in case !any_header_sg */
> +	sg_set_buf(&rq->sg[0], p, vi->hdr_len);
>  
>  	/* rq->sg[1] for data packet, from offset */
>  	offset = sizeof(struct padded_vnet_hdr);
> @@ -668,7 +663,7 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
>  		if (vi->mergeable_rx_bufs)
>  			err = add_recvbuf_mergeable(rq, gfp);
>  		else if (vi->big_packets)
> -			err = add_recvbuf_big(rq, gfp);
> +			err = add_recvbuf_big(vi, rq, gfp);
>  		else
>  			err = add_recvbuf_small(vi, rq, gfp);
>  
> @@ -848,18 +843,14 @@ static void free_old_xmit_skbs(struct send_queue *sq)
>  
>  static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
>  {
> -	struct skb_vnet_hdr *hdr;
> +	struct virtio_net_hdr_mrg_rxbuf *hdr;
>  	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
>  	struct virtnet_info *vi = sq->vq->vdev->priv;
>  	unsigned num_sg;
> -	unsigned hdr_len;
> +	unsigned hdr_len = vi->hdr_len;
>  	bool can_push;
>  
>  	pr_debug("%s: xmit %p %pM\n", vi->dev->name, skb, dest);
> -	if (vi->mergeable_rx_bufs)
> -		hdr_len = sizeof hdr->mhdr;
> -	else
> -		hdr_len = sizeof hdr->hdr;
>  
>  	can_push = vi->any_header_sg &&
>  		!((unsigned long)skb->data & (__alignof__(*hdr) - 1)) &&
> @@ -867,7 +858,7 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
>  	/* Even if we can, don't push here yet as this would skew
>  	 * csum_start offset below. */
>  	if (can_push)
> -		hdr = (struct skb_vnet_hdr *)(skb->data - hdr_len);
> +		hdr = (struct virtio_net_hdr_mrg_rxbuf *)(skb->data - hdr_len);
>  	else
>  		hdr = skb_vnet_hdr(skb);
>  
> @@ -902,7 +893,7 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
>  	}
>  
>  	if (vi->mergeable_rx_bufs)
> -		hdr->mhdr.num_buffers = 0;
> +		hdr->num_buffers = 0;
>  
>  	sg_init_table(sq->sg, MAX_SKB_FRAGS + 2);
>  	if (can_push) {
> @@ -1773,18 +1764,19 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
>  		vi->mergeable_rx_bufs = true;
>  
> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
> +		vi->hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +	else
> +		vi->hdr_len = sizeof(struct virtio_net_hdr);
> +
>  	if (virtio_has_feature(vdev, VIRTIO_F_ANY_LAYOUT))
>  		vi->any_header_sg = true;
>  
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>  		vi->has_cvq = true;
>  
> -	if (vi->any_header_sg) {
> -		if (vi->mergeable_rx_bufs)
> -			dev->needed_headroom = sizeof(struct virtio_net_hdr_mrg_rxbuf);
> -		else
> -			dev->needed_headroom = sizeof(struct virtio_net_hdr);
> -	}
> +	if (vi->any_header_sg)
> +		dev->needed_headroom = vi->hdr_len;
>  
>  	/* Use single tx/rx queue pair as default */
>  	vi->curr_queue_pairs = 1;
> -- 
> MST

^ permalink raw reply

* Re: [PATCH RFC 4/4] virtio_net: bigger header when VERSION_1 is set
From: Rusty Russell @ 2014-10-28  0:28 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel; +Cc: netdev, virtualization
In-Reply-To: <1414099656-28090-4-git-send-email-mst@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> writes:
> With VERSION_1 virtio_net uses same header size
> whether mergeable buffers are enabled or not.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

These two are great too, thanks:

Acked-by: Rusty Russell <rusty@rustcorp.com.au>

Cheers,
Rusty.

> ---
>  drivers/net/virtio_net.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 9c6d50f..a2fe340 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1764,7 +1764,8 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
>  		vi->mergeable_rx_bufs = true;
>  
> -	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF) ||
> +	    virtio_has_feature(vdev, VIRTIO_F_VERSION_1))
>  		vi->hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
>  	else
>  		vi->hdr_len = sizeof(struct virtio_net_hdr);
> -- 
> MST

^ permalink raw reply

* [GIT PULL nf-next] IPVS Updates for v3.19
From: Simon Horman @ 2014-10-28  0:59 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Simon Horman

Hi Pablo,

please consider these IPVS updates for v3.19.

The single patch in this series fixes some minor fallout from adding
support IPv6 real servers in IPv4 virtual-services and vice versa.

It should not have any run-time affect other than perhaps saving a few cycles.


The following changes since commit 61ed53deb1c6a4386d8710dbbfcee8779c381931:

  Merge tag 'ntb-3.18' of git://github.com/jonmason/ntb (2014-10-19 12:58:22 -0700)

are available in the git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next.git tags/ipvs-for-v3.19

for you to fetch changes up to d7701089118d23bfed03bad0a6b5cc5115990c9e:

  ipvs: remove unnecessary assignment in __ip_vs_get_out_rt (2014-10-28 09:50:06 +0900)

----------------------------------------------------------------
Alex Gartrell (1):
      ipvs: remove unnecessary assignment in __ip_vs_get_out_rt

 net/netfilter/ipvs/ip_vs_xmit.c | 1 -
 1 file changed, 1 deletion(-)

^ permalink raw reply

* [PATCH nf-next] ipvs: remove unnecessary assignment in __ip_vs_get_out_rt
From: Simon Horman @ 2014-10-28  0:59 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Alex Gartrell, Simon Horman
In-Reply-To: <1414457960-20864-1-git-send-email-horms@verge.net.au>

From: Alex Gartrell <agartrell@fb.com>

It is a precondition of the function that daddr be equal to dest->addr.ip
if dest is non-NULL, so this additional assignment is just confusing for
stupid engineers like me.

Signed-off-by: Alex Gartrell <agartrell@fb.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 net/netfilter/ipvs/ip_vs_xmit.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 91f17c1..5efa597 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -293,7 +293,6 @@ __ip_vs_get_out_rt(int skb_af, struct sk_buff *skb, struct ip_vs_dest *dest,
 				  &dest->addr.ip, &dest_dst->dst_saddr.ip,
 				  atomic_read(&rt->dst.__refcnt));
 		}
-		daddr = dest->addr.ip;
 		if (ret_saddr)
 			*ret_saddr = dest_dst->dst_saddr.ip;
 	} else {
-- 
2.1.1


^ permalink raw reply related

* [PATCH nf] ipvs: Avoid null-pointer deref in debug code
From: Simon Horman @ 2014-10-28  1:05 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Alex Gartrell, Simon Horman
In-Reply-To: <1414458334-22479-1-git-send-email-horms@verge.net.au>

From: Alex Gartrell <agartrell@fb.com>

Use daddr instead of reaching into dest.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Gartrell <agartrell@fb.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 net/netfilter/ipvs/ip_vs_xmit.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 91f17c1..437a366 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -316,7 +316,7 @@ __ip_vs_get_out_rt(int skb_af, struct sk_buff *skb, struct ip_vs_dest *dest,
 	if (unlikely(crosses_local_route_boundary(skb_af, skb, rt_mode,
 						  local))) {
 		IP_VS_DBG_RL("We are crossing local and non-local addresses"
-			     " daddr=%pI4\n", &dest->addr.ip);
+			     " daddr=%pI4\n", &daddr);
 		goto err_put;
 	}
 
@@ -458,7 +458,7 @@ __ip_vs_get_out_rt_v6(int skb_af, struct sk_buff *skb, struct ip_vs_dest *dest,
 	if (unlikely(crosses_local_route_boundary(skb_af, skb, rt_mode,
 						  local))) {
 		IP_VS_DBG_RL("We are crossing local and non-local addresses"
-			     " daddr=%pI6\n", &dest->addr.in6);
+			     " daddr=%pI6\n", daddr);
 		goto err_put;
 	}
 
-- 
2.1.1


^ permalink raw reply related

* [GIT PULL nf] IPVS Fixes for v3.18
From: Simon Horman @ 2014-10-28  1:05 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Simon Horman

Hi Pablo,

please consider this fix for v3.18.

It fixes a null-pointer dereference that may occur when logging
errors.

This problem was introduced by 4a4739d56b0 ("ipvs: Pull out
crosses_local_route_boundary logic") in v3.17-rc5. As such I would
also like it considered for 3.17-stable.


The following changes since commit 7965ee93719921ea5978f331da653dfa2d7b99f5:

  netfilter: nft_compat: fix wrong target lookup in nft_target_select_ops() (2014-10-27 22:17:46 +0100)

are available in the git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs.git tags/ipvs-fixes-for-v3.18

for you to fetch changes up to 3d53666b40007b55204ee8890618da79a20c9940:

  ipvs: Avoid null-pointer deref in debug code (2014-10-28 09:48:31 +0900)

----------------------------------------------------------------
Alex Gartrell (1):
      ipvs: Avoid null-pointer deref in debug code

 net/netfilter/ipvs/ip_vs_xmit.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

^ permalink raw reply

* Re: [PATCH net-next 2/2] udp: Reset flow table for flows over unconnected sockets
From: Tom Herbert @ 2014-10-28  1:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Linux Netdev List
In-Reply-To: <1414451970.2922.27.camel@edumazet-glaptop2.roam.corp.google.com>

On Mon, Oct 27, 2014 at 4:19 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2014-10-27 at 12:36 -0700, Tom Herbert wrote:
>
>> Please try this patch and provide real data to support your points.
>>
>
> Yep. This is not good, I confirm my fear.
>
> Google servers are shifting to serve both TCP & UDP traffic (QUIC
> protocol), with an increasing UDP load.
>
> Millions of packets per second per host, from millions of different
> sources...
>
This indicates nothing about the merits of this patch. Nevertheless,
in order to avoid further rat-holing and since this patch does change
a long standing behavior I'll will respin to make it enabled only by
sysctl.

Tom

> And your patch voids the RFS table, adds another cache miss in fast path
> for UDP rx path which is already too expensive.
>
>
>> If a TCP connection is hot it will continually refresh the table for
>> that connection, if connection becomes idle it only takes one received
>> packet to restore the CPU. The only time there could be a persistent
>> problem is if collision rate is high (which probably means table is
>> too small).
>
>
> RFS already has a low hit/miss rate, this patch does not help neither
> UDP or TCP.
>
> Ideally, RFS should be enabled on a protocol base, not an agnostic u32
> flow hash.
>
> Whatever strategy you implement, as long as different protocols share a
> common hash table, it wont be perfect for mixed workloads.
>
> Fundamental problem is that when an UDP packet comes, its not possible
> to know if its a 'flow' or 'not', unless we perform an expensive lookup,
> and then RPS/RFS cost becomes prohibitive.
>
> While for TCP, the current RFS cache miss is good enough, because about
> all packets are for connected flows. We eventually have bad steering for
> <not yet established> flows where the stack performs poorly anyway.
>
>
>

^ permalink raw reply

* Re: [PATCH] bridge: Add support for IEEE 802.11 Proxy ARP
From: Stephen Hemminger @ 2014-10-28  1:20 UTC (permalink / raw)
  To: Kyeyoon Park; +Cc: davem, jouni, netdev
In-Reply-To: <1414100957-8288-1-git-send-email-kyeyoonp@qca.qualcomm.com>

On Thu, 23 Oct 2014 14:49:17 -0700
Kyeyoon Park <kyeyoonp@qca.qualcomm.com> wrote:

> From: Kyeyoon Park <kyeyoonp@codeaurora.org>
> 
> This feature is defined in IEEE Std 802.11-2012, 10.23.13. It allows
> the AP devices to keep track of the hardware-address-to-IP-address
> mapping of the mobile devices within the WLAN network.
> 
> The AP will learn this mapping via observing DHCP, ARP, and NS/NA
> frames. When a request for such information is made (i.e. ARP request,
> Neighbor Solicitation), the AP will respond on behalf of the
> associated mobile device. In the process of doing so, the AP will drop
> the multicast request frame that was intended to go out to the wireless
> medium.
> 
> It was recommended at the LKS workshop to do this implementation in
> the bridge layer. vxlan.c is already doing something very similar.
> The DHCP snooping code will be added to the userspace application
> (hostapd) per the recommendation.
> 
> This RFC commit is only for IPv4. A similar approach in the bridge
> layer will be taken for IPv6 as well.
> 
> Signed-off-by: Kyeyoon Park <kyeyoonp@codeaurora.org>

Looks good. Maybe at some point VXLAN and bridge should share
more code or at least the same options.

I a little worried that this could be DoS'd.

^ permalink raw reply

* [PATCH] mac80211_hwsim: release driver when ieee80211_register_hw fails
From: Junjie Mao @ 2014-10-28  1:31 UTC (permalink / raw)
  To: Martin Pitt
  Cc: Junjie Mao, Fengguang Wu, linux-wireless, netdev, linux-kernel

The driver is not released when ieee80211_register_hw fails in
mac80211_hwsim_create_radio, leading to the access to the unregistered (and
possibly freed) device in platform_driver_unregister:

[    0.447547] mac80211_hwsim: ieee80211_register_hw failed (-2)
[    0.448292] ------------[ cut here ]------------
[    0.448854] WARNING: CPU: 0 PID: 1 at ../include/linux/kref.h:47 kobject_get+0x33/0x50()
[    0.449839] CPU: 0 PID: 1 Comm: swapper Not tainted 3.17.0-00001-gdd46990-dirty #2
[    0.450813] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[    0.451512]  00000000 00000000 78025e38 7967c6c6 78025e68 7905e09b 7988b480 00000000
[    0.452579]  00000001 79887d62 0000002f 79170bb3 79170bb3 78397008 79ac9d74 00000001
[    0.453614]  78025e78 7905e15d 00000009 00000000 78025e84 79170bb3 78397000 78025e8c
[    0.454632] Call Trace:
[    0.454921]  [<7967c6c6>] dump_stack+0x16/0x18
[    0.455453]  [<7905e09b>] warn_slowpath_common+0x6b/0x90
[    0.456067]  [<79170bb3>] ? kobject_get+0x33/0x50
[    0.456612]  [<79170bb3>] ? kobject_get+0x33/0x50
[    0.457155]  [<7905e15d>] warn_slowpath_null+0x1d/0x20
[    0.457748]  [<79170bb3>] kobject_get+0x33/0x50
[    0.458274]  [<7925824f>] get_device+0xf/0x20
[    0.458779]  [<7925b5cd>] driver_detach+0x3d/0xa0
[    0.459331]  [<7925a3ff>] bus_remove_driver+0x8f/0xb0
[    0.459927]  [<7925bf80>] ? class_unregister+0x40/0x80
[    0.460660]  [<7925bad7>] driver_unregister+0x47/0x50
[    0.461248]  [<7925c033>] ? class_destroy+0x13/0x20
[    0.461824]  [<7925d07b>] platform_driver_unregister+0xb/0x10
[    0.462507]  [<79b51ba0>] init_mac80211_hwsim+0x3e8/0x3f9
[    0.463161]  [<79b30c58>] do_one_initcall+0x106/0x1a9
[    0.463758]  [<79b517b8>] ? if_spi_init_module+0xac/0xac
[    0.464393]  [<79b517b8>] ? if_spi_init_module+0xac/0xac
[    0.465001]  [<79071935>] ? parse_args+0x2f5/0x480
[    0.465569]  [<7906b41e>] ? __usermodehelper_set_disable_depth+0x3e/0x50
[    0.466345]  [<79b30dd9>] kernel_init_freeable+0xde/0x17d
[    0.466972]  [<79b304d6>] ? do_early_param+0x7a/0x7a
[    0.467546]  [<79677b1b>] kernel_init+0xb/0xe0
[    0.468072]  [<79075f42>] ? schedule_tail+0x12/0x40
[    0.468658]  [<79686580>] ret_from_kernel_thread+0x20/0x30
[    0.469303]  [<79677b10>] ? rest_init+0xc0/0xc0
[    0.469829] ---[ end trace ad8ac403ff8aef5c ]---
[    0.470509] ------------[ cut here ]------------
[    0.471047] WARNING: CPU: 0 PID: 1 at ../kernel/locking/lockdep.c:3161 __lock_acquire.isra.22+0x7aa/0xb00()
[    0.472163] DEBUG_LOCKS_WARN_ON(id >= MAX_LOCKDEP_KEYS)
[    0.472774] CPU: 0 PID: 1 Comm: swapper Tainted: G        W      3.17.0-00001-gdd46990-dirty #2
[    0.473815] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[    0.474492]  78025de0 78025de0 78025da0 7967c6c6 78025dd0 7905e09b 79888931 78025dfc
[    0.475515]  00000001 79888a93 00000c59 7907f33a 7907f33a 78028000 fffe9d09 00000000
[    0.476519]  78025de8 7905e10e 00000009 78025de0 79888931 78025dfc 78025e24 7907f33a
[    0.477523] Call Trace:
[    0.477821]  [<7967c6c6>] dump_stack+0x16/0x18
[    0.478352]  [<7905e09b>] warn_slowpath_common+0x6b/0x90
[    0.478976]  [<7907f33a>] ? __lock_acquire.isra.22+0x7aa/0xb00
[    0.479658]  [<7907f33a>] ? __lock_acquire.isra.22+0x7aa/0xb00
[    0.480417]  [<7905e10e>] warn_slowpath_fmt+0x2e/0x30
[    0.480479]  [<7907f33a>] __lock_acquire.isra.22+0x7aa/0xb00
[    0.480479]  [<79078aa5>] ? sched_clock_cpu+0xb5/0xf0
[    0.480479]  [<7907fd06>] lock_acquire+0x56/0x70
[    0.480479]  [<7925b5e8>] ? driver_detach+0x58/0xa0
[    0.480479]  [<79682d11>] mutex_lock_nested+0x61/0x2a0
[    0.480479]  [<7925b5e8>] ? driver_detach+0x58/0xa0
[    0.480479]  [<7925b5e8>] ? driver_detach+0x58/0xa0
[    0.480479]  [<7925b5e8>] driver_detach+0x58/0xa0
[    0.480479]  [<7925a3ff>] bus_remove_driver+0x8f/0xb0
[    0.480479]  [<7925bf80>] ? class_unregister+0x40/0x80
[    0.480479]  [<7925bad7>] driver_unregister+0x47/0x50
[    0.480479]  [<7925c033>] ? class_destroy+0x13/0x20
[    0.480479]  [<7925d07b>] platform_driver_unregister+0xb/0x10
[    0.480479]  [<79b51ba0>] init_mac80211_hwsim+0x3e8/0x3f9
[    0.480479]  [<79b30c58>] do_one_initcall+0x106/0x1a9
[    0.480479]  [<79b517b8>] ? if_spi_init_module+0xac/0xac
[    0.480479]  [<79b517b8>] ? if_spi_init_module+0xac/0xac
[    0.480479]  [<79071935>] ? parse_args+0x2f5/0x480
[    0.480479]  [<7906b41e>] ? __usermodehelper_set_disable_depth+0x3e/0x50
[    0.480479]  [<79b30dd9>] kernel_init_freeable+0xde/0x17d
[    0.480479]  [<79b304d6>] ? do_early_param+0x7a/0x7a
[    0.480479]  [<79677b1b>] kernel_init+0xb/0xe0
[    0.480479]  [<79075f42>] ? schedule_tail+0x12/0x40
[    0.480479]  [<79686580>] ret_from_kernel_thread+0x20/0x30
[    0.480479]  [<79677b10>] ? rest_init+0xc0/0xc0
[    0.480479] ---[ end trace ad8ac403ff8aef5d ]---
[    0.495478] BUG: unable to handle kernel paging request at 00200200
[    0.496257] IP: [<79682de5>] mutex_lock_nested+0x135/0x2a0
[    0.496923] *pde = 00000000
[    0.497290] Oops: 0002 [#1]
[    0.497653] CPU: 0 PID: 1 Comm: swapper Tainted: G        W      3.17.0-00001-gdd46990-dirty #2
[    0.498659] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[    0.499321] task: 78028000 ti: 78024000 task.ti: 78024000
[    0.499955] EIP: 0060:[<79682de5>] EFLAGS: 00010097 CPU: 0
[    0.500620] EIP is at mutex_lock_nested+0x135/0x2a0
[    0.501145] EAX: 00200200 EBX: 78397434 ECX: 78397460 EDX: 78025e70
[    0.501816] ESI: 00000246 EDI: 78028000 EBP: 78025e8c ESP: 78025e54
[    0.502497]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[    0.503076] CR0: 8005003b CR2: 00200200 CR3: 01b9d000 CR4: 00000690
[    0.503773] Stack:
[    0.503998]  00000000 00000001 00000000 7925b5e8 78397460 7925b5e8 78397474 78397460
[    0.504944]  00200200 11111111 78025e70 78397000 79ac9d74 00000001 78025ea0 7925b5e8
[    0.505451]  79ac9d74 fffffffe 00000001 78025ebc 7925a3ff 7a251398 78025ec8 7925bf80
[    0.505451] Call Trace:
[    0.505451]  [<7925b5e8>] ? driver_detach+0x58/0xa0
[    0.505451]  [<7925b5e8>] ? driver_detach+0x58/0xa0
[    0.505451]  [<7925b5e8>] driver_detach+0x58/0xa0
[    0.505451]  [<7925a3ff>] bus_remove_driver+0x8f/0xb0
[    0.505451]  [<7925bf80>] ? class_unregister+0x40/0x80
[    0.505451]  [<7925bad7>] driver_unregister+0x47/0x50
[    0.505451]  [<7925c033>] ? class_destroy+0x13/0x20
[    0.505451]  [<7925d07b>] platform_driver_unregister+0xb/0x10
[    0.505451]  [<79b51ba0>] init_mac80211_hwsim+0x3e8/0x3f9
[    0.505451]  [<79b30c58>] do_one_initcall+0x106/0x1a9
[    0.505451]  [<79b517b8>] ? if_spi_init_module+0xac/0xac
[    0.505451]  [<79b517b8>] ? if_spi_init_module+0xac/0xac
[    0.505451]  [<79071935>] ? parse_args+0x2f5/0x480
[    0.505451]  [<7906b41e>] ? __usermodehelper_set_disable_depth+0x3e/0x50
[    0.505451]  [<79b30dd9>] kernel_init_freeable+0xde/0x17d
[    0.505451]  [<79b304d6>] ? do_early_param+0x7a/0x7a
[    0.505451]  [<79677b1b>] kernel_init+0xb/0xe0
[    0.505451]  [<79075f42>] ? schedule_tail+0x12/0x40
[    0.505451]  [<79686580>] ret_from_kernel_thread+0x20/0x30
[    0.505451]  [<79677b10>] ? rest_init+0xc0/0xc0
[    0.505451] Code: 89 d8 e8 cf 9b 9f ff 8b 4f 04 8d 55 e4 89 d8 e8 72 9d 9f ff 8d 43 2c 89 c1 89 45 d8 8b 43 30 8d 55 e4 89 53 30 89 4d e4 89 45 e8 <89> 10 8b 55 dc 8b 45 e0 89 7d ec e8 db af 9f ff eb 11 90 31 c0
[    0.505451] EIP: [<79682de5>] mutex_lock_nested+0x135/0x2a0 SS:ESP 0068:78025e54
[    0.505451] CR2: 0000000000200200
[    0.505451] ---[ end trace ad8ac403ff8aef5e ]---
[    0.505451] Kernel panic - not syncing: Fatal exception

Fixes: 9ea927748ced ("mac80211_hwsim: Register and bind to driver")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Junjie Mao <eternal.n08@gmail.com>
---
 drivers/net/wireless/mac80211_hwsim.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/mac80211_hwsim.c b/drivers/net/wireless/mac80211_hwsim.c
index babbdc1ce741..c9ad4cf1adfb 100644
--- a/drivers/net/wireless/mac80211_hwsim.c
+++ b/drivers/net/wireless/mac80211_hwsim.c
@@ -1987,7 +1987,7 @@ static int mac80211_hwsim_create_radio(int channels, const char *reg_alpha2,
 	if (err != 0) {
 		printk(KERN_DEBUG "mac80211_hwsim: device_bind_driver failed (%d)\n",
 		       err);
-		goto failed_hw;
+		goto failed_bind;
 	}

 	skb_queue_head_init(&data->pending);
@@ -2183,6 +2183,8 @@ static int mac80211_hwsim_create_radio(int channels, const char *reg_alpha2,
 	return idx;

 failed_hw:
+	device_release_driver(data->dev);
+failed_bind:
 	device_unregister(data->dev);
 failed_drvdata:
 	ieee80211_free_hw(hw);
--
1.9.3

^ permalink raw reply related

* Re: irq disable in __netdev_alloc_frag() ?
From: Christoph Lameter @ 2014-10-28  2:30 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, Alexander Duyck, Alexei Starovoitov, Eric Dumazet,
	Network Development
In-Reply-To: <20141027213523.799da09c@redhat.com>

On Mon, 27 Oct 2014, Jesper Dangaard Brouer wrote:

> > Same could be done with some kmem_cache_alloc() : SLAB uses hard irq
> > masking while some caches are never used from hard irq context.
>
> Sounds interesting.

SLUB does not disable interrupts in the fast paths.

^ permalink raw reply

* Re: irq disable in __netdev_alloc_frag() ?
From: Eric Dumazet @ 2014-10-28  2:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jesper Dangaard Brouer, Eric Dumazet, Alexander Duyck,
	Alexei Starovoitov, Network Development
In-Reply-To: <alpine.DEB.2.11.1410272129530.21936@gentwo.org>

On Mon, Oct 27, 2014 at 7:30 PM, Christoph Lameter <cl@linux.com> wrote:
> On Mon, 27 Oct 2014, Jesper Dangaard Brouer wrote:
>
>> > Same could be done with some kmem_cache_alloc() : SLAB uses hard irq
>> > masking while some caches are never used from hard irq context.
>>
>> Sounds interesting.
>
> SLUB does not disable interrupts in the fast paths.
>

Unfortunately, SLUB is more expensive than SLAB for many networking workloads.

The cost of disabling interrupts is pure noise compared to cache line misses.

SLUB has poor behavior compared to SLAB with alien caches,
even with the side effect that 'struct page' is 64 bytes aligned
instead of being 56 bytes with SLAB

Note that I am not doing SLUB/SLAB tests every day, so it might be
better nowadays.

^ permalink raw reply

* [PATCH net-next] tcp: allow for bigger reordering level
From: Eric Dumazet @ 2014-10-28  4:45 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Yaogong Wang

From: Eric Dumazet <edumazet@google.com>

While testing upcoming Yaogong patch (converting out of order queue
into an RB tree), I hit the max reordering level of linux TCP stack.

Reordering level was limited to 127 for no good reason, and some
network setups [1] can easily reach this limit and get limited
throughput.

Allow a new max limit of 300, and add a sysctl to allow admins to even
allow bigger (or lower) values if needed.

[1] Aggregation of links, per packet load balancing, fabrics not doing
 deep packet inspections, alternative TCP congestion modules...

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yaogong Wang <wygivan@google.com>
---
 Documentation/networking/bonding.txt   |    7 ++-----
 Documentation/networking/ip-sysctl.txt |   10 +++++++++-
 include/linux/tcp.h                    |    4 ++--
 include/net/tcp.h                      |    4 +---
 net/ipv4/sysctl_net_ipv4.c             |    7 +++++++
 net/ipv4/tcp_input.c                   |    3 ++-
 6 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index eeb5b2e97bedac5ce910a06cc03bf42035d544d4..7ddd70df4d9aaa76b5806bf4a74fd1583ba7e198 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -2230,11 +2230,8 @@ balance-rr: This mode is the only mode that will permit a single
 
 	It is possible to adjust TCP/IP's congestion limits by
 	altering the net.ipv4.tcp_reordering sysctl parameter.  The
-	usual default value is 3, and the maximum useful value is 127.
-	For a four interface balance-rr bond, expect that a single
-	TCP/IP stream will utilize no more than approximately 2.3
-	interface's worth of throughput, even after adjusting
-	tcp_reordering.
+	usual default value is 3. But keep in mind TCP stack is able
+	to automatically increase this when it detects reorders.
 
 	Note that the fraction of packets that will be delivered out of
 	order is highly variable, and is unlikely to be zero.  The level
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 0307e2875f2159cb669b741f9d6a949618c3a055..9028b879a97baebc29832c42694896361ecfba03 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -376,9 +376,17 @@ tcp_orphan_retries - INTEGER
 	may consume significant resources. Cf. tcp_max_orphans.
 
 tcp_reordering - INTEGER
-	Maximal reordering of packets in a TCP stream.
+	Initial reordering level of packets in a TCP stream.
+	TCP stack can then dynamically adjust flow reordering level
+	between this initial value and tcp_max_reordering
 	Default: 3
 
+tcp_max_reordering - INTEGER
+	Maximal reordering level of packets in a TCP stream.
+	300 is a fairly conservative value, but you might increase it
+	if paths are using per packet load balancing (like bonding rr mode)
+	Default: 300
+
 tcp_retrans_collapse - BOOLEAN
 	Bug-to-bug compatibility with some broken printers.
 	On retransmit try to send bigger packets to work around bugs in
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index c2dee7deefa8cb32af530d20e5aa32a61b10ce68..f566b8567892ef0bb213de0540b37cfc6ac03ca0 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -204,10 +204,10 @@ struct tcp_sock {
 
 	u16	urg_data;	/* Saved octet of OOB data and control flags */
 	u8	ecn_flags;	/* ECN status bits.			*/
-	u8	reordering;	/* Packet reordering metric.		*/
+	u8	keepalive_probes; /* num of allowed keep alive probes	*/
+	u32	reordering;	/* Packet reordering metric.		*/
 	u32	snd_up;		/* Urgent pointer		*/
 
-	u8	keepalive_probes; /* num of allowed keep alive probes	*/
 /*
  *      Options received (usually on last packet, some only on SYN packets).
  */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index c73fc145ee4533c3f65adf5370e9c0348dfb4395..3a35b1500359446d98ee9f1cd0b55d34ac66d477 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -70,9 +70,6 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 /* After receiving this amount of duplicate ACKs fast retransmit starts. */
 #define TCP_FASTRETRANS_THRESH 3
 
-/* Maximal reordering. */
-#define TCP_MAX_REORDERING	127
-
 /* Maximal number of ACKs sent quickly to accelerate slow-start. */
 #define TCP_MAX_QUICKACKS	16U
 
@@ -252,6 +249,7 @@ extern int sysctl_tcp_abort_on_overflow;
 extern int sysctl_tcp_max_orphans;
 extern int sysctl_tcp_fack;
 extern int sysctl_tcp_reordering;
+extern int sysctl_tcp_max_reordering;
 extern int sysctl_tcp_dsack;
 extern long sysctl_tcp_mem[3];
 extern int sysctl_tcp_wmem[3];
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index b3c53c8b331efc3d5cf6437fd3ec7634a154263c..e0ee384a448fb0e6eb5b957d98dbcb272ea97edb 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -496,6 +496,13 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec
 	},
 	{
+		.procname	= "tcp_max_reordering",
+		.data		= &sysctl_tcp_max_reordering,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+	{
 		.procname	= "tcp_dsack",
 		.data		= &sysctl_tcp_dsack,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a12b455928e52211efdc6b471ef54de6218f5df0..9a18cdd633f37e6a805f0f096edece0b0852bc20 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -81,6 +81,7 @@ int sysctl_tcp_window_scaling __read_mostly = 1;
 int sysctl_tcp_sack __read_mostly = 1;
 int sysctl_tcp_fack __read_mostly = 1;
 int sysctl_tcp_reordering __read_mostly = TCP_FASTRETRANS_THRESH;
+int sysctl_tcp_max_reordering __read_mostly = 300;
 EXPORT_SYMBOL(sysctl_tcp_reordering);
 int sysctl_tcp_dsack __read_mostly = 1;
 int sysctl_tcp_app_win __read_mostly = 31;
@@ -833,7 +834,7 @@ static void tcp_update_reordering(struct sock *sk, const int metric,
 	if (metric > tp->reordering) {
 		int mib_idx;
 
-		tp->reordering = min(TCP_MAX_REORDERING, metric);
+		tp->reordering = min(sysctl_tcp_max_reordering, metric);
 
 		/* This exciting event is worth to be remembered. 8) */
 		if (ts)

^ permalink raw reply related

* Re: [PATCH] ovs: Turn vports with dependencies into separate modules
From: David Miller @ 2014-10-28  4:48 UTC (permalink / raw)
  To: pshelar; +Cc: tgraf, dev, netdev
In-Reply-To: <CALnjE+p5b5EzLkY6_6J7jvwyf9rUdu-JGbdf5r3bDuhgndoeeg@mail.gmail.com>

From: Pravin Shelar <pshelar@nicira.com>
Date: Mon, 27 Oct 2014 17:27:11 -0700

> On Mon, Oct 27, 2014 at 2:47 PM, Thomas Graf <tgraf@suug.ch> wrote:
>> The patch also brings additional flexibility to users of
>> distributions. Distros typically ship something like an allmodconfig
>> so a user can either run openvswitch.ko with all encaps compiled in
>> or not run openvswitch.ko. With vports as module, a user can blacklist
>> a certain encap type.
>>
>> Another advantage is obviously that users can run additional vport
>> types on top of their distribution kernels.
>>
>> Is there anything specific that you are concerned with in regard
>> to this proposed change?
> 
> OVS vport code is not alot and making it plugable module does not save
> much space.

People don't blacklist modules to "save space".

^ permalink raw reply

* Re: [PATCH net-next 2/2] udp: Reset flow table for flows over unconnected sockets
From: David Miller @ 2014-10-28  4:51 UTC (permalink / raw)
  To: therbert; +Cc: eric.dumazet, netdev
In-Reply-To: <CA+mtBx_eQKOkM-0PEXG2WEMosXDtqHgwT3j7NnQpP62KdZeJKQ@mail.gmail.com>

From: Tom Herbert <therbert@google.com>
Date: Mon, 27 Oct 2014 18:09:25 -0700

> This indicates nothing about the merits of this patch. Nevertheless,
> in order to avoid further rat-holing and since this patch does change
> a long standing behavior I'll will respin to make it enabled only by
> sysctl.

Kind of disappointed on my end that you haven't addressed Eric's
main point, which is that:

1) A hash table shared between protocols will perform poorly for
   mixed workloads which are becomming increasingly common.

2) UDP is fundamentally different from TCP in that the issue of
   'flow' vs. 'non-flow' packets

I personally do not see you avoiding this conversation by simply
hiding the new behavior behind a sysctl, I still want you to address
it before I apply anything.

^ permalink raw reply

* Re: irq disable in __netdev_alloc_frag() ?
From: David Miller @ 2014-10-28  4:56 UTC (permalink / raw)
  To: edumazet; +Cc: cl, brouer, eric.dumazet, alexander.duyck, ast, netdev
In-Reply-To: <CANn89i+U0=YrwoUSASejsS37EiXO7dKR25Vx04at3PqGA1EpHA@mail.gmail.com>

From: Eric Dumazet <edumazet@google.com>
Date: Mon, 27 Oct 2014 19:46:20 -0700

> Unfortunately, SLUB is more expensive than SLAB for many networking
> workloads.
> 
> The cost of disabling interrupts is pure noise compared to cache
> line misses.
> 
> SLUB has poor behavior compared to SLAB with alien caches, even with
> the side effect that 'struct page' is 64 bytes aligned instead of
> being 56 bytes with SLAB

And SLAB completely shits itself when lots of memory gets cached up on
a foreign node.

This discussion has happened many times, SLAB may be faster when things
work out nicely, but it acts poorly wrt. keeping foreign memory from
being cached too aggressively.

And there is a cost for that, which is that foreign memory has to be
properly balanced back to it's home node.

^ permalink raw reply

* [PATCH net 0/1] cnic: Update the rcu_access_pointer() usages
From: Nilesh Javali @ 2014-10-28  5:18 UTC (permalink / raw)
  To: davem
  Cc: netdev, Dept-GELinuxNICDev, sudarsana.kalluru, vikas.chaudhary,
	giridhar.malavali, tej.parkash

This patch updates the rcu_access_pointer usages:

Tej Parkash (1):
      cnic: Update the rcu_access_pointer() usages

 drivers/net/ethernet/broadcom/cnic.c |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)

Please apply this patch to net.

Thanks,
Nilesh

^ permalink raw reply

* [PATCH net 1/1] cnic: Update the rcu_access_pointer() usages
From: Nilesh Javali @ 2014-10-28  5:18 UTC (permalink / raw)
  To: davem
  Cc: netdev, Dept-GELinuxNICDev, sudarsana.kalluru, vikas.chaudhary,
	giridhar.malavali, tej.parkash
In-Reply-To: <1414473495-24790-1-git-send-email-nilesh.javali@qlogic.com>

From: Tej Parkash <tej.parkash@qlogic.com>

1. Remove the rcu_read_lock/unlock around rcu_access_pointer
2. Replace the rcu_dereference with rcu_access_pointer

Signed-off-by: Tej Parkash <tej.parkash@qlogic.com>
---
 drivers/net/ethernet/broadcom/cnic.c |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/cnic.c b/drivers/net/ethernet/broadcom/cnic.c
index 23f23c9..f05fab6 100644
--- a/drivers/net/ethernet/broadcom/cnic.c
+++ b/drivers/net/ethernet/broadcom/cnic.c
@@ -382,10 +382,8 @@ static int cnic_iscsi_nl_msg_recv(struct cnic_dev *dev, u32 msg_type,
 		if (l5_cid >= MAX_CM_SK_TBL_SZ)
 			break;
 
-		rcu_read_lock();
 		if (!rcu_access_pointer(cp->ulp_ops[CNIC_ULP_L4])) {
 			rc = -ENODEV;
-			rcu_read_unlock();
 			break;
 		}
 		csk = &cp->csk_tbl[l5_cid];
@@ -414,7 +412,6 @@ static int cnic_iscsi_nl_msg_recv(struct cnic_dev *dev, u32 msg_type,
 			}
 		}
 		csk_put(csk);
-		rcu_read_unlock();
 		rc = 0;
 	}
 	}
@@ -615,7 +612,7 @@ static int cnic_unregister_device(struct cnic_dev *dev, int ulp_type)
 		cnic_send_nlmsg(cp, ISCSI_KEVENT_IF_DOWN, NULL);
 
 	mutex_lock(&cnic_lock);
-	if (rcu_dereference(cp->ulp_ops[ulp_type])) {
+	if (rcu_access_pointer(cp->ulp_ops[ulp_type])) {
 		RCU_INIT_POINTER(cp->ulp_ops[ulp_type], NULL);
 		cnic_put(dev);
 	} else {
-- 
1.5.6

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox