* Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
From: Eric Dumazet @ 2014-02-08 13:53 UTC (permalink / raw)
To: Thomas Glanzmann
Cc: John Ogness, Eric Dumazet, David S. Miller, Nicholas A. Bellinger,
target-devel, Linux Network Development, LKML
In-Reply-To: <20140208133744.GA20512@glanzmann.de>
On Sat, 2014-02-08 at 14:37 +0100, Thomas Glanzmann wrote:
> Hello Eric,
>
> It fixes my case but if you look at the round trip time it is not even
> close what it used to be. So while this fixes my problem I'm still for
> disabling it by default.
>
> https://thomas.glanzmann.de/tmp/tcp_auto_corking_on_patched.pcap.bz2
> https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-14:36:25.png
Very nice.
Now we have to check your NIC and how TX completion is performed.
What is your NIC model and driver ?
^ permalink raw reply
* Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
From: Thomas Glanzmann @ 2014-02-08 13:58 UTC (permalink / raw)
To: Eric Dumazet
Cc: John Ogness, Eric Dumazet, David S. Miller, Nicholas A. Bellinger,
target-devel, Linux Network Development, LKML
In-Reply-To: <1391867614.10160.89.camel@edumazet-glaptop2.roam.corp.google.com>
Hello Eric,
> What is your NIC model and driver?
I have four Intel Corporation I350 Gigabit Network Connection (rev 01).
(node-62) [~/work/linux-2.6] lspci -v | pbot
http://pbot.rmdir.de/rgu6yHMBDVQpflMmbcJACg
(node-62) [~/work/linux-2.6] ip a s | pbot
http://pbot.rmdir.de/xJjRT8u-ekC6mrWgl09ZtQ
(node-62) [~/work/linux-2.6] dmesg | pbot
http://pbot.rmdir.de/MigrSPtxGmp0fI1CRgXsHw
I do 802.3ad link aggregation layer 2 hash with two network cards to one
switch.
I'm running:
Linux node-62 3.14.0-rc1+ #23 SMP Sat Feb 8 14:27:47 CET 2014 x86_64 GNU/Linux
Driver: igb
If you need remote access to the machine let me know.
Cheers,
Thomas
^ permalink raw reply
* Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
From: Eric Dumazet @ 2014-02-08 14:09 UTC (permalink / raw)
To: Thomas Glanzmann
Cc: John Ogness, Eric Dumazet, David S. Miller, Nicholas A. Bellinger,
target-devel, Linux Network Development, LKML
In-Reply-To: <20140208133744.GA20512@glanzmann.de>
On Sat, 2014-02-08 at 14:37 +0100, Thomas Glanzmann wrote:
>
> It fixes my case but if you look at the round trip time it is not even
> close what it used to be. So while this fixes my problem I'm still for
> disabling it by default.
>
> https://thomas.glanzmann.de/tmp/tcp_auto_corking_on_patched.pcap.bz2
This pcap was taken on which host ?
10.101.99.5 or 10.101.0.13 ?
^ permalink raw reply
* Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
From: Thomas Glanzmann @ 2014-02-08 14:12 UTC (permalink / raw)
To: Eric Dumazet
Cc: John Ogness, Eric Dumazet, David S. Miller, Nicholas A. Bellinger,
target-devel, Linux Network Development, LKML
In-Reply-To: <1391868564.10160.91.camel@edumazet-glaptop2.roam.corp.google.com>
Hello Eric,
[RESEND: dropped CC accidently]
> 10.101.99.5 or 10.101.0.13?
10.101.99.5 (iSCSI Target)
tcpdump -i bond0.101 -s 0 -w /tmp/tcp_auto_corking_on_patched.pcap host esx-03.v101.campusvl.de
Cheers,
Thomas
^ permalink raw reply
* Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
From: Eric Dumazet @ 2014-02-08 14:13 UTC (permalink / raw)
To: Thomas Glanzmann
Cc: John Ogness, Eric Dumazet, David S. Miller, Nicholas A. Bellinger,
target-devel, Linux Network Development, LKML
In-Reply-To: <1391867404.10160.88.camel@edumazet-glaptop2.roam.corp.google.com>
On Sat, 2014-02-08 at 05:50 -0800, Eric Dumazet wrote:
> On Sat, 2014-02-08 at 05:33 -0800, Eric Dumazet wrote:
> > On Sat, 2014-02-08 at 05:14 -0800, Eric Dumazet wrote:
> > > Here is the combined patch, could you test it ?
> >
> > Also make sure you have commit a181ceb501b31b4bf8812a5c84c716cc31d82c2d
> > ("tcp: autocork should not hold first packet in write queue")
> > in your tree.
> >
> >
>
> BTW this problem demonstrates there is room for improvement in iCSCI,
> using MSG_MORE to avoid sending two small segments in separate frames.
>
> [1] 00:32:35.726568 IP 10.101.99.5.3260 > 10.101.0.13.27778: Flags [P.], seq 145:193, ack 144, win 235, options [nop,nop,TS val 4294960733 ecr 385385], length 48
> [2] 00:32:35.838074 IP 10.101.0.13.27778 > 10.101.99.5.3260: Flags [.], ack 193, win 514, options [nop,nop,TS val 385396 ecr 4294960733], length 0
> [3] 00:32:35.838099 IP 10.101.99.5.3260 > 10.101.0.13.27778: Flags [P.], seq 193:705, ack 144, win 235, options [nop,nop,TS val 4294960761 ecr 385396], length 512
>
> [1] & [3] could be coalesced, and [2] would be avoided.
>
With the fix, new pcap is more explicit about this suboptimal behavior :
05:34:16.280900 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack 54353, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0
05:34:16.280949 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq 5328:5376, ack 54353, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48
05:34:16.280982 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54353:54401, ack 5376, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48
05:34:16.281000 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54401:54913, ack 5376, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512
05:34:16.281107 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack 54913, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0
05:34:16.281157 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq 5376:5424, ack 54913, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48
05:34:16.281190 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54913:54961, ack 5424, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48
05:34:16.281208 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54961:55473, ack 5424, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512
05:34:16.281337 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack 55473, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0
05:34:16.281390 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq 5424:5472, ack 55473, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48
05:34:16.281423 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 55473:55521, ack 5472, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48
05:34:16.281440 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 55521:56033, ack 5472, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512
^ permalink raw reply
* Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
From: Thomas Glanzmann @ 2014-02-08 14:19 UTC (permalink / raw)
To: Eric Dumazet
Cc: John Ogness, Eric Dumazet, David S. Miller, Nicholas A. Bellinger,
target-devel, Linux Network Development, LKML
In-Reply-To: <1391868816.10160.93.camel@edumazet-glaptop2.roam.corp.google.com>
Hello Eric,
> > BTW this problem demonstrates there is room for improvement in iCSCI,
> > using MSG_MORE to avoid sending two small segments in separate frames.
> With the fix, new pcap is more explicit about this suboptimal behavior :
> 05:34:16.280900 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack 54353, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0
> 05:34:16.280949 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq 5328:5376, ack 54353, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48
> 05:34:16.280982 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54353:54401, ack 5376, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48
> 05:34:16.281000 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54401:54913, ack 5376, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512
> 05:34:16.281107 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack 54913, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0
> 05:34:16.281157 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq 5376:5424, ack 54913, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48
> 05:34:16.281190 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54913:54961, ack 5424, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48
> 05:34:16.281208 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54961:55473, ack 5424, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512
> 05:34:16.281337 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack 55473, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0
> 05:34:16.281390 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq 5424:5472, ack 55473, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48
> 05:34:16.281423 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 55473:55521, ack 5472, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48
> 05:34:16.281440 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 55521:56033, ack 5472, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512
I get the idea. However I'm a little bit confused, when I do a 'git grep
MSG_MORE' I don't see much references in the Linux kernel who use it at
all. So do you have an example for me where this flags needs to be
applied?
Cheers,
Thomas
^ permalink raw reply
* Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
From: Eric Dumazet @ 2014-02-08 14:30 UTC (permalink / raw)
To: Thomas Glanzmann
Cc: John Ogness, Eric Dumazet, David S. Miller, Nicholas A. Bellinger,
target-devel, Linux Network Development, LKML
In-Reply-To: <20140208141905.GG20512@glanzmann.de>
On Sat, 2014-02-08 at 15:19 +0100, Thomas Glanzmann wrote:
> Hello Eric,
> I get the idea. However I'm a little bit confused, when I do a 'git grep
> MSG_MORE' I don't see much references in the Linux kernel who use it at
> all. So do you have an example for me where this flags needs to be
> applied?
Idea would be to set this flag when calling sendmsg() of the 48 bytes of
the header, and not set it on the sendmsg() of the 512 bytes of the
payload.
iscsi_sw_tcp_xmit_segment() already adds MSG_MORE, but
it would be nice to add a new _initial_ flags parameter to
iscsi_sw_tcp_xmit_segment()
^ permalink raw reply
* Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
From: Thomas Glanzmann @ 2014-02-08 15:00 UTC (permalink / raw)
To: Eric Dumazet
Cc: John Ogness, Eric Dumazet, David S. Miller, Nicholas A. Bellinger,
target-devel, Linux Network Development, LKML
In-Reply-To: <1391869805.10160.97.camel@edumazet-glaptop2.roam.corp.google.com>
Hello Eric,
> Idea would be to set this flag when calling sendmsg() of the 48 bytes
> of the header, and not set it on the sendmsg() of the 512 bytes of the
> payload.
I see.
> iscsi_sw_tcp_xmit_segment() already adds MSG_MORE, but
> it would be nice to add a new _initial_ flags parameter to
> iscsi_sw_tcp_xmit_segment()
This is for the iscsi initiator implementation. I'm interested in iSCSI
target code, but I already found it and experiemented a little bit, but
I need to dig deeper if I want to prepare a patch.
Cheers,
Thomas
^ permalink raw reply
* Re: [PATCH] tcp: disable auto corking by default
From: Eric Dumazet @ 2014-02-08 15:04 UTC (permalink / raw)
To: Thomas Glanzmann
Cc: Eric Dumazet, David S. Miller, Nicholas A. Bellinger,
target-devel, Linux Network Development, LKML
In-Reply-To: <20140208091944.GB16336@glanzmann.de>
On Sat, 2014-02-08 at 10:19 +0100, Thomas Glanzmann wrote:
> When using auto corking with iSCSI the round trip time at least increases by
> factor 25 probably more. Other protocols are very likely also effected.
>
> Signed-off-by: Thomas Glanzmann <thomas@glanzmann.de>
> ---
> net/ipv4/tcp.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
I think there is no hurry.
We should let auto corking on during 3.14 development cycle so that we
can fix the bugs, and thing of some optimizations.
auto cork gives a strong incentive to applications to use
TCP_CORK/MSG_MORE to avoid overhead of sending multiple small segments.
In the normal case, the extra delay is something like 10 us, so if an
application is really hit by this delay, its a strong sign it could be
improved, especially if auto corking is off.
Lets wait the end of 3.14 dev cycle before considering this patch.
Don't shoot the messenger :)
Thanks !
^ permalink raw reply
* Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
From: Eric Dumazet @ 2014-02-08 15:06 UTC (permalink / raw)
To: Thomas Glanzmann
Cc: John Ogness, Eric Dumazet, David S. Miller, Nicholas A. Bellinger,
target-devel, Linux Network Development, LKML
In-Reply-To: <20140208150001.GI20512@glanzmann.de>
On Sat, 2014-02-08 at 16:00 +0100, Thomas Glanzmann wrote:
> Hello Eric,
>
> > Idea would be to set this flag when calling sendmsg() of the 48 bytes
> > of the header, and not set it on the sendmsg() of the 512 bytes of the
> > payload.
>
> I see.
>
> > iscsi_sw_tcp_xmit_segment() already adds MSG_MORE, but
> > it would be nice to add a new _initial_ flags parameter to
> > iscsi_sw_tcp_xmit_segment()
>
> This is for the iscsi initiator implementation. I'm interested in iSCSI
> target code, but I already found it and experiemented a little bit, but
> I need to dig deeper if I want to prepare a patch.
Fantastic !
Let me know if you want some help.
Note : We did some patches in the MSG_MORE logic for sendpage(), but in
your case I do not think its related
(git grep -n MSG_SENDPAGE_NOTLAST ) if you are curious
^ permalink raw reply
* Re: IPv6 FIB related crash with MACVLANs in 3.9.11+ kernel.
From: Ben Greear @ 2014-02-08 16:43 UTC (permalink / raw)
To: netdev
In-Reply-To: <52F012FF.9030105@candelatech.com>
On 02/03/2014 02:06 PM, Ben Greear wrote:
> On 02/03/2014 02:03 PM, Hannes Frederic Sowa wrote:
>> Hi Ben,
>>
>> On Mon, Feb 03, 2014 at 12:37:52PM -0800, Ben Greear wrote:
>>> The kernel has some additional patches, but not much to IPv6.
>>>
>>> The bug is that when we have lots of mac-vlans on some ixgbe ports
>>> (500 per interface in this case), and boot up the system with the ports unplugged,
>>> we get this crash almost every time. Boot-up is going to do normal bootup
>>> stuff plus create and configure the 1000 mac-vlans, dump their routing
>>> tables, etc.
>>>
>>> We are using one routing table per network device, and some
>>> ip rules.
>>>
>>> If we plug in the ixgbe ports, we do not ever see a crash.
>>>
>>> We have not yet tried reproducing it on other drivers, but I suspect
>>> the issue is not related to ixgbe.
>>>
>>> Any ideas on this one?
>>
>> Could you bring the machine to a panic again with enabling RT6_DEBUG at the
>> top of ip6_fib.c and send a dump of the trace?
>
> Yes, but it will be a bit until we can create a duplicate machine.
> We ended up delivering the machine with a note to make sure the
> interfaces were plugged in (we found the bug hours before shipping
> the system, of course).
According to my system test guy, it took a lot longer to reproduce
the problem with the debug enabled kernel, but I do not see any extra
debug messages on the serial console logging or in /var/log/messages
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply
* Re: [PATCH] tcp: disable auto corking by default
From: Thomas Glanzmann @ 2014-02-08 16:55 UTC (permalink / raw)
To: Eric Dumazet
Cc: Eric Dumazet, David S. Miller, Nicholas A. Bellinger,
target-devel, Linux Network Development, LKML
In-Reply-To: <1391871850.10160.103.camel@edumazet-glaptop2.roam.corp.google.com>
Hello Eric,
> > Disable auto corking by default
> We should let auto corking on during 3.14 development cycle so that we
> can fix the bugs, and thing of some optimizations.
I agree that leaving it enabled helps to find bugs, however I'm not
happy with the round trip time degradation.
> auto cork gives a strong incentive to applications to use
> TCP_CORK/MSG_MORE to avoid overhead of sending multiple small
> segments.
I agree. But if it breaks the application many people won't be happy,
for example I spend already 5 hours to track it down.
> In the normal case, the extra delay is something like 10 us, so if an
> application is really hit by this delay, its a strong sign it could be
> improved, especially if auto corking is off.
Yes, but 230 micro seconds for others. :-(
> Lets wait the end of 3.14 dev cycle before considering this patch.
I agree.
Btw. I mixed up the pcaps for autocork on and off, so I moved the files
that they know show what they should show.
Cheers,
Thomas
^ permalink raw reply
* Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
From: Thomas Glanzmann @ 2014-02-08 16:57 UTC (permalink / raw)
To: Eric Dumazet
Cc: John Ogness, Eric Dumazet, David S. Miller, Nicholas A. Bellinger,
target-devel, Linux Network Development, LKML
In-Reply-To: <1391871986.10160.105.camel@edumazet-glaptop2.roam.corp.google.com>
Hello Eric,
> Note : We did some patches in the MSG_MORE logic for sendpage(), but
> in your case I do not think its related
> (git grep -n MSG_SENDPAGE_NOTLAST ) if you are curious
thank you for the pointer. The iSCSI target code actually uses sendpage
whenever it can.
Cheers,
Thomas
^ permalink raw reply
* Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
From: Eric Dumazet @ 2014-02-08 17:08 UTC (permalink / raw)
To: Thomas Glanzmann
Cc: John Ogness, Eric Dumazet, David S. Miller, Nicholas A. Bellinger,
target-devel, Linux Network Development, LKML
In-Reply-To: <20140208165732.GB22359@glanzmann.de>
On Sat, 2014-02-08 at 17:57 +0100, Thomas Glanzmann wrote:
> Hello Eric,
>
> > Note : We did some patches in the MSG_MORE logic for sendpage(), but
> > in your case I do not think its related
> > (git grep -n MSG_SENDPAGE_NOTLAST ) if you are curious
>
> thank you for the pointer. The iSCSI target code actually uses sendpage
> whenever it can.
Yep, but the problem (at least on your pcap), is about sending the 48
bytes headers in TCP segment of its own, then the 512 byte payload in a
separate segment.
I suspect the sendpage() is only used for the payload. No need for
MSG_MORE here.
The MSG_MORE would need to be set on the first part (48 bytes header),
so that TCP stack will defer the push of the segment at the time the 512
bytes payload is added.
^ permalink raw reply
* [PATCH] 3c59x: Remove unused pointer in vortex_eisa_cleanup()
From: Christian Engelmayer @ 2014-02-08 17:11 UTC (permalink / raw)
To: Steffen Klassert; +Cc: netdev
[-- Attachment #1: Type: text/plain, Size: 919 bytes --]
Remove unused network device private data pointer 'vp' in function
vortex_eisa_cleanup(). Detected by Coverity: CID 139826.
Signed-off-by: Christian Engelmayer <cengelma@gmx.at>
---
drivers/net/ethernet/3com/3c59x.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/drivers/net/ethernet/3com/3c59x.c b/drivers/net/ethernet/3com/3c59x.c
index 0f4241c..238ccea 100644
--- a/drivers/net/ethernet/3com/3c59x.c
+++ b/drivers/net/ethernet/3com/3c59x.c
@@ -3294,7 +3294,6 @@ static int __init vortex_init(void)
static void __exit vortex_eisa_cleanup(void)
{
- struct vortex_private *vp;
void __iomem *ioaddr;
#ifdef CONFIG_EISA
@@ -3303,7 +3302,6 @@ static void __exit vortex_eisa_cleanup(void)
#endif
if (compaq_net_device) {
- vp = netdev_priv(compaq_net_device);
ioaddr = ioport_map(compaq_net_device->base_addr,
VORTEX_TOTAL_SIZE);
--
1.8.3.2
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply related
* Re: [PATCH] tcp: disable auto corking by default
From: Eric Dumazet @ 2014-02-08 17:12 UTC (permalink / raw)
To: Thomas Glanzmann
Cc: Eric Dumazet, David S. Miller, Nicholas A. Bellinger,
target-devel, Linux Network Development, LKML
In-Reply-To: <20140208165539.GA22359@glanzmann.de>
On Sat, 2014-02-08 at 17:55 +0100, Thomas Glanzmann wrote:
> Hello Eric,
>
> > > Disable auto corking by default
>
> > We should let auto corking on during 3.14 development cycle so that we
> > can fix the bugs, and thing of some optimizations.
>
> I agree that leaving it enabled helps to find bugs, however I'm not
> happy with the round trip time degradation.
>
> > auto cork gives a strong incentive to applications to use
> > TCP_CORK/MSG_MORE to avoid overhead of sending multiple small
> > segments.
>
> I agree. But if it breaks the application many people won't be happy,
> for example I spend already 5 hours to track it down.
Sure, but if we put this flag to zero, nobody will ever use it and find
any bug.
Thanks for running latest git tree and be part of linux improvement.
If we can add the MSG_MORE at the right place, your workload might gain
~20% exec time, and maybe 30% better efficiency, since you'll divide by
2 the total number of network segments.
Just to be clear : No stable kernel has yet any issue, right ?
^ permalink raw reply
* Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
From: Thomas Glanzmann @ 2014-02-08 17:15 UTC (permalink / raw)
To: Eric Dumazet
Cc: John Ogness, Eric Dumazet, David S. Miller, Nicholas A. Bellinger,
target-devel, Linux Network Development, LKML
In-Reply-To: <1391879318.10160.108.camel@edumazet-glaptop2.roam.corp.google.com>
Hello Eric,
> Yep, but the problem (at least on your pcap), is about sending the 48
> bytes headers in TCP segment of its own, then the 512 byte payload in
> a separate segment.
I agree.
> I suspect the sendpage() is only used for the payload. No need for
> MSG_MORE here.
I see.
> The MSG_MORE would need to be set on the first part (48 bytes header),
> so that TCP stack will defer the push of the segment at the time the 512
> bytes payload is added.
The iSCSI target uses one function to send all outbound data. So in
order to do it right every function that is sending data in multiple
chunks need to mark it correctly. Of course someone could also do some
wild guessing and saying that everything that is below 512 Bytes gets
pushed out. I wonder what Nab has to say about this?
Cheers,
Thomas
^ permalink raw reply
* Re: [PATCH] tcp: disable auto corking by default
From: Thomas Glanzmann @ 2014-02-08 17:20 UTC (permalink / raw)
To: Eric Dumazet
Cc: Eric Dumazet, David S. Miller, Nicholas A. Bellinger,
target-devel, Linux Network Development, LKML
In-Reply-To: <1391879558.10160.112.camel@edumazet-glaptop2.roam.corp.google.com>
Hello Eric,
> Sure, but if we put this flag to zero, nobody will ever use it and
> find any bug.
I agree.
> If we can add the MSG_MORE at the right place, your workload might gain
> ~20% exec time, and maybe 30% better efficiency, since you'll divide by
> 2 the total number of network segments.
That is in fact promising.
> Just to be clear: No stable kernel has yet any issue, right?
Not with TCP CORK as it was recently introduced in the development
branch but it will become stable at one point.
Cheers,
Thomas
^ permalink raw reply
* Re: IPv6 FIB related crash with MACVLANs in 3.9.11+ kernel.
From: Hannes Frederic Sowa @ 2014-02-08 17:23 UTC (permalink / raw)
To: Ben Greear; +Cc: netdev
In-Reply-To: <52F65EB4.1050306@candelatech.com>
On Sat, Feb 08, 2014 at 08:43:32AM -0800, Ben Greear wrote:
> On 02/03/2014 02:06 PM, Ben Greear wrote:
> > On 02/03/2014 02:03 PM, Hannes Frederic Sowa wrote:
> >> Hi Ben,
> >>
> >> On Mon, Feb 03, 2014 at 12:37:52PM -0800, Ben Greear wrote:
> >>> The kernel has some additional patches, but not much to IPv6.
> >>>
> >>> The bug is that when we have lots of mac-vlans on some ixgbe ports
> >>> (500 per interface in this case), and boot up the system with the ports unplugged,
> >>> we get this crash almost every time. Boot-up is going to do normal bootup
> >>> stuff plus create and configure the 1000 mac-vlans, dump their routing
> >>> tables, etc.
> >>>
> >>> We are using one routing table per network device, and some
> >>> ip rules.
> >>>
> >>> If we plug in the ixgbe ports, we do not ever see a crash.
> >>>
> >>> We have not yet tried reproducing it on other drivers, but I suspect
> >>> the issue is not related to ixgbe.
> >>>
> >>> Any ideas on this one?
> >>
> >> Could you bring the machine to a panic again with enabling RT6_DEBUG at the
> >> top of ip6_fib.c and send a dump of the trace?
> >
> > Yes, but it will be a bit until we can create a duplicate machine.
> > We ended up delivering the machine with a note to make sure the
> > interfaces were plugged in (we found the bug hours before shipping
> > the system, of course).
>
> According to my system test guy, it took a lot longer to reproduce
> the problem with the debug enabled kernel, but I do not see any extra
> debug messages on the serial console logging or in /var/log/messages
Sounds like a race, then, like I thought.
I forgot, those are pr_debugs, I usually enable them with
$ echo file net/ipv6/ip6_fib.c +p > /sys/kernel/debug/dynamic_debug/control
RT6_TRACE is pretty noisy so you should see output immediatley if you do ipv6
traffic. Other way is to specify dyndbg="file net/ipv6/ip6_fib.c +p" on the
kernel command line.
Try before doing to play with that until you can confirm the output showes up
on the console.
Thanks again,
Hannes
^ permalink raw reply
* [PATCH] bridge: Unbreak netconsole
From: Bart Van Assche @ 2014-02-08 18:41 UTC (permalink / raw)
To: David S. Miller
Cc: Stephen Hemminger, Jiri Pirko, Neil Horman,
netdev@vger.kernel.org
Sending netconsole messages over a bridge network interface doesn't
work anymore since kernel v3.12. Bisecting this led to the patch
"bridge: cleanup netpoll code". Hence revert that patch (commit
93d8bf9fb8f39d6d3e461db60f883d9f81006159).
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org> # 3.12
Reference: https://bugzilla.kernel.org/show_bug.cgi?id=70071
---
net/bridge/br_device.c | 12 ++++++------
net/bridge/br_if.c | 3 +--
net/bridge/br_private.h | 10 ++++++++++
3 files changed, 17 insertions(+), 8 deletions(-)
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index e4401a5..ab69594 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -252,22 +252,22 @@ fail:
int br_netpoll_enable(struct net_bridge_port *p, gfp_t gfp)
{
struct netpoll *np;
- int err;
-
- if (!p->br->dev->npinfo)
- return 0;
+ int err = 0;
np = kzalloc(sizeof(*p->np), gfp);
+ err = -ENOMEM;
if (!np)
- return -ENOMEM;
+ goto out;
err = __netpoll_setup(np, p->dev, gfp);
if (err) {
kfree(np);
- return err;
+ goto out;
}
p->np = np;
+
+out:
return err;
}
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index cffe1d6..639231a 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -366,8 +366,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev)
if (err)
goto err2;
- err = br_netpoll_enable(p, GFP_KERNEL);
- if (err)
+ if (br_netpoll_info(br) && ((err = br_netpoll_enable(p, GFP_KERNEL))))
goto err3;
err = netdev_master_upper_dev_link(dev, br->dev);
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index fcd1233..52d63bf 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -339,6 +339,11 @@ void br_dev_setup(struct net_device *dev);
void br_dev_delete(struct net_device *dev, struct list_head *list);
netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev);
#ifdef CONFIG_NET_POLL_CONTROLLER
+static inline struct netpoll_info *br_netpoll_info(struct net_bridge *br)
+{
+ return br->dev->npinfo;
+}
+
static inline void br_netpoll_send_skb(const struct net_bridge_port *p,
struct sk_buff *skb)
{
@@ -351,6 +356,11 @@ static inline void br_netpoll_send_skb(const struct net_bridge_port *p,
int br_netpoll_enable(struct net_bridge_port *p, gfp_t gfp);
void br_netpoll_disable(struct net_bridge_port *p);
#else
+static inline struct netpoll_info *br_netpoll_info(struct net_bridge *br)
+{
+ return NULL;
+}
+
static inline void br_netpoll_send_skb(const struct net_bridge_port *p,
struct sk_buff *skb)
{
--
1.8.4.5
^ permalink raw reply related
* [PATCH/RESEND] 3c59x: Remove unused pointer in vortex_eisa_cleanup()
From: Christian Engelmayer @ 2014-02-08 19:10 UTC (permalink / raw)
To: steffen; +Cc: netdev
In-Reply-To: <20140208181117.6776b154@spike>
[-- Attachment #1: Type: text/plain, Size: 1104 bytes --]
Remove unused network device private data pointer 'vp' in function
vortex_eisa_cleanup(). Detected by Coverity: CID 139826.
Signed-off-by: Christian Engelmayer <cengelma@gmx.at>
---
Resend using address steffen@klassert.de as retrieved from the mail delivery
fail notification by tu-chemnitz.de. Information in MAINTAINERS seems to be
outdated on that point.
---
drivers/net/ethernet/3com/3c59x.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/drivers/net/ethernet/3com/3c59x.c b/drivers/net/ethernet/3com/3c59x.c
index 0f4241c..238ccea 100644
--- a/drivers/net/ethernet/3com/3c59x.c
+++ b/drivers/net/ethernet/3com/3c59x.c
@@ -3294,7 +3294,6 @@ static int __init vortex_init(void)
static void __exit vortex_eisa_cleanup(void)
{
- struct vortex_private *vp;
void __iomem *ioaddr;
#ifdef CONFIG_EISA
@@ -3303,7 +3302,6 @@ static void __exit vortex_eisa_cleanup(void)
#endif
if (compaq_net_device) {
- vp = netdev_priv(compaq_net_device);
ioaddr = ioport_map(compaq_net_device->base_addr,
VORTEX_TOTAL_SIZE);
--
1.8.3.2
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply related
* [PATCH 0/2] Attention by Linus Torvalds needed to export symbol he wrote
From: Richard Yao @ 2014-02-08 19:11 UTC (permalink / raw)
To: Linus Torvalds
Cc: Eric Van Hensbergen, Ron Minnich, Latchesar Ionkov,
David S. Miller, V9FS Develooper Mailing List,
Linux Netdev Mailing List, Linux Kernel Mailing List,
Aneesh Kumar K.V, Will Deacon, Christopher Covington,
Brian Behlendorf, Matthew Thode
Dear Linus,
Loading kernel modules off 9p-virtio in a Linux guest causes VM termination
because of a page fault in unmapped memory, so I wrote a patch to fix it. Dave
Miller initially accepted it, but then rejected it because it calls an
unexported symbol from a kernel module, which breaks the build when
CONFIG_NET_9P_VIRTIO=m is set in the kernel config:
https://groups.google.com/forum/#!topic/linux.kernel/eRR7AyLE29Y
>From what I can tell, I need the original author of a symbol to sign-off on any
patch exporting it. git blame says that the original author is you, so I am
sending this pull request to you for approval.
Richard Yao (2):
mm/vmalloc: export is_vmalloc_or_module_addr
9p/trans_virtio.c: Fix broken zero-copy on vmalloc() buffers
mm/vmalloc.c | 1 +
net/9p/trans_virtio.c | 5 ++++-
2 files changed, 5 insertions(+), 1 deletion(-)
--
1.8.3.2
^ permalink raw reply
* [PATCH 0/2] Attention by Linus Torvalds needed to export symbol he wrote
From: Richard Yao @ 2014-02-08 19:12 UTC (permalink / raw)
To: Linus Torvalds
Cc: Eric Van Hensbergen, Ron Minnich, Latchesar Ionkov,
David S. Miller, V9FS Develooper Mailing List,
Linux Netdev Mailing List, Linux Kernel Mailing List,
Aneesh Kumar K.V, Will Deacon, Christopher Covington,
Matthew Thode
Dear Linus,
Loading kernel modules off 9p-virtio in a Linux guest causes VM termination
because of a page fault in unmapped memory, so I wrote a patch to fix it. Dave
Miller initially accepted it, but then rejected it because it calls an
unexported symbol from a kernel module, which breaks the build when
CONFIG_NET_9P_VIRTIO=m is set in the kernel config:
https://groups.google.com/forum/#!topic/linux.kernel/eRR7AyLE29Y
>From what I can tell, I need the original author of a symbol to sign-off on any
patch exporting it. git blame says that the original author is you, so I am
sending this pull request to you for approval.
Richard Yao (2):
mm/vmalloc: export is_vmalloc_or_module_addr
9p/trans_virtio.c: Fix broken zero-copy on vmalloc() buffers
mm/vmalloc.c | 1 +
net/9p/trans_virtio.c | 5 ++++-
2 files changed, 5 insertions(+), 1 deletion(-)
--
1.8.3.2
^ permalink raw reply
* [PATCH 1/2] mm/vmalloc: export is_vmalloc_or_module_addr
From: Richard Yao @ 2014-02-08 19:12 UTC (permalink / raw)
To: Linus Torvalds
Cc: Eric Van Hensbergen, Ron Minnich, Latchesar Ionkov,
David S. Miller, V9FS Develooper Mailing List,
Linux Netdev Mailing List, Linux Kernel Mailing List,
Aneesh Kumar K.V, Will Deacon, Christopher Covington,
Matthew Thode
In-Reply-To: <1391886730-19667-1-git-send-email-ryao@gentoo.org>
9p-virtio needs is_vmalloc_or_module_addr exported before a patch can be
merged to prevent the virtio zero-copy routines from triggering a
hypervisor page fault when loading kernel modules:
https://groups.google.com/forum/#!topic/linux.kernel/eRR7AyLE29Y
Without this export, the kernel build breaks with that patch applied and
CONFIG_NET_9P_VIRTIO=m. With this export in place, all is well.
Signed-off-by: Richard Yao <ryao@gentoo.org>
---
mm/vmalloc.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 0fdf968..8a2e54f 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -218,6 +218,7 @@ int is_vmalloc_or_module_addr(const void *x)
#endif
return is_vmalloc_addr(x);
}
+EXPORT_SYMBOL(is_vmalloc_or_module_addr);
/*
* Walk a vmap address to the struct page it maps.
--
1.8.3.2
^ permalink raw reply related
* [PATCH 2/2] 9p/trans_virtio.c: Fix broken zero-copy on vmalloc() buffers
From: Richard Yao @ 2014-02-08 19:12 UTC (permalink / raw)
To: Linus Torvalds
Cc: Eric Van Hensbergen, Ron Minnich, Latchesar Ionkov,
David S. Miller, V9FS Develooper Mailing List,
Linux Netdev Mailing List, Linux Kernel Mailing List,
Aneesh Kumar K.V, Will Deacon, Christopher Covington,
Matthew Thode
In-Reply-To: <1391886730-19667-1-git-send-email-ryao@gentoo.org>
The 9p-virtio transport does zero copy on things larger than 1024 bytes
in size. It accomplishes this by returning the physical addresses of
pages to the virtio-pci device. At present, the translation is usually a
bit shift.
However, that approach produces an invalid page address when we
read/write to vmalloc buffers, such as those used for Linux kernle
modules. This causes QEMU to die printing:
qemu-system-x86_64: virtio: trying to map MMIO memory
This patch enables 9p-virtio to correctly handle this case. This not
only enables us to load Linux kernel modules off virtfs, but also
enables ZFS file-based vdevs on virtfs to be used without killing QEMU.
Also, special thanks to both Avi Kivity and Alexander Graf for their
interpretation of QEMU backtraces. Without their guidence, tracking down
this bug would have taken much longer.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Acked-by: Alexander Graf <agraf@suse.de>
Reviewed-by: Will Deacon <will.deacon@arm.com>
---
net/9p/trans_virtio.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index cd1e1ed..b2009bc 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -340,7 +340,10 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
int count = nr_pages;
while (nr_pages) {
s = rest_of_page(data);
- pages[index++] = kmap_to_page(data);
+ if (is_vmalloc_or_module_addr(data))
+ pages[index++] = vmalloc_to_page(data);
+ else
+ pages[index++] = kmap_to_page(data);
data += s;
nr_pages--;
}
--
1.8.3.2
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox