Netdev List
 help / color / mirror / Atom feed
* [PATCH] Fix broken zero-copy on vmalloc() buffers (4th and hopefully final submission)
From: Richard Yao @ 2014-02-09  0:32 UTC (permalink / raw)
  To: David S. Miller
  Cc: Eric Van Hensbergen, Ron Minnich, Latchesar Ionkov,
	V9FS Develooper Mailing List, Linux Netdev Mailing List,
	Linux Kernel Mailing List, Aneesh Kumar K.V, Will Deacon,
	Christopher Covington, Matthew Thode

This patch has been submitted for a few times.

The first time was my first time doing any sort of Linux patch
submission. At the time, I was unaware of ./scripts/get_maintainer.pl
and sent the patch to only a subset of the correct people. Consequently,
it was not submitted properly for acceptance by the subsystem maintainer.

The second time was a week ago. I had taken advice from Greg Koah-Hartman to
use ./scripts/get_maintainer.pl to determine the correct recipients. It was
initially accepted by the subsystem maintainer and then rejected. This patch
uses is_vmalloc_or_module_addr(), which is not exported for use in kernel
modules. Using it causes a build failure when CONFIG_NET_9P_VIRTIO=m is set in
.config.

The third time was earlier today, when I sent it straight to Linus Torvalds
because merging it required exporting is_vmalloc_or_module_addr(), which he
wrote. A brief correspondence with Linus revealed that my earlier belief that
it would be better to use is_vmalloc_or_module_addr() instead of
is_vmalloc_addr() was incorrect.

I expect this submission to be the last. I have changed the patch to use
is_vmalloc_addr() as Linus Torvalds suggested. This resolves the build
regression the problem David S. Miller found when CONFIG_NET_9P_VIRTIO=m was
set and should resolve all criticism.

Richard Yao (1):
  9p/trans_virtio.c: Fix broken zero-copy on vmalloc() buffers

 net/9p/trans_virtio.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

-- 
1.8.3.2

^ permalink raw reply

* [PATCH] 9p/trans_virtio.c: Fix broken zero-copy on vmalloc() buffers
From: Richard Yao @ 2014-02-09  0:32 UTC (permalink / raw)
  To: David S. Miller
  Cc: Eric Van Hensbergen, Ron Minnich, Latchesar Ionkov,
	V9FS Develooper Mailing List, Linux Netdev Mailing List,
	Linux Kernel Mailing List, Aneesh Kumar K.V, Will Deacon,
	Christopher Covington, Matthew Thode
In-Reply-To: <1391905921-28378-1-git-send-email-ryao@gentoo.org>

The 9p-virtio transport does zero copy on things larger than 1024 bytes
in size. It accomplishes this by returning the physical addresses of
pages to the virtio-pci device. At present, the translation is usually a
bit shift.

That approach produces an invalid page address when we read/write to
vmalloc buffers, such as those used for Linux kernel modules. Any
attempt to load a Linux kernel module from 9p-virtio produces the
following stack.

[<ffffffff814878ce>] p9_virtio_zc_request+0x45e/0x510
[<ffffffff814814ed>] p9_client_zc_rpc.constprop.16+0xfd/0x4f0
[<ffffffff814839dd>] p9_client_read+0x15d/0x240
[<ffffffff811c8440>] v9fs_fid_readn+0x50/0xa0
[<ffffffff811c84a0>] v9fs_file_readn+0x10/0x20
[<ffffffff811c84e7>] v9fs_file_read+0x37/0x70
[<ffffffff8114e3fb>] vfs_read+0x9b/0x160
[<ffffffff81153571>] kernel_read+0x41/0x60
[<ffffffff810c83ab>] copy_module_from_fd.isra.34+0xfb/0x180

Subsequently, QEMU will die printing:

qemu-system-x86_64: virtio: trying to map MMIO memory

This patch enables 9p-virtio to correctly handle this case. This not
only enables us to load Linux kernel modules off virtfs, but also
enables ZFS file-based vdevs on virtfs to be used without killing QEMU.

Special thanks to both Avi Kivity and Alexander Graf for their
interpretation of QEMU backtraces. Without their guidence, tracking down
this bug would have taken much longer. Also, special thanks to Linus
Torvalds for his insightful explanation of why this should use
is_vmalloc_addr() instead of is_vmalloc_or_module_addr():

https://lkml.org/lkml/2014/2/8/272

Signed-off-by: Richard Yao <ryao@gentoo.org>
---
 net/9p/trans_virtio.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index cd1e1ed..ac2666c 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -340,7 +340,10 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
 		int count = nr_pages;
 		while (nr_pages) {
 			s = rest_of_page(data);
-			pages[index++] = kmap_to_page(data);
+			if (is_vmalloc_addr(data))
+				pages[index++] = vmalloc_to_page(data);
+			else
+				pages[index++] = kmap_to_page(data);
 			data += s;
 			nr_pages--;
 		}
-- 
1.8.3.2

^ permalink raw reply related

* Re: [PATCH] bridge: Unbreak netconsole
From: Stephen Hemminger @ 2014-02-09  1:05 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: David S. Miller, Jiri Pirko, Neil Horman, netdev@vger.kernel.org
In-Reply-To: <52F67A4B.9000504@acm.org>

On Sat, 08 Feb 2014 19:41:15 +0100
Bart Van Assche <bvanassche@acm.org> wrote:

> Sending netconsole messages over a bridge network interface doesn't
> work anymore since kernel v3.12. Bisecting this led to the patch
> "bridge: cleanup netpoll code". Hence revert that patch (commit
> 93d8bf9fb8f39d6d3e461db60f883d9f81006159).
> 
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> Cc: Stephen Hemminger <stephen@networkplumber.org>
> Cc: Jiri Pirko <jiri@resnulli.us>
> Cc: Neil Horman <nhorman@tuxdriver.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: <stable@vger.kernel.org> # 3.12
> Reference: https://bugzilla.kernel.org/show_bug.cgi?id=70071

Cong already send an alternative fix for this.

^ permalink raw reply

* Re: Question about skb_segment()
From: Herbert Xu @ 2014-02-09  2:42 UTC (permalink / raw)
  To: Arnaud Ebalard; +Cc: David Miller, Eric Dumazet, Willy Tarreau, netdev
In-Reply-To: <87fvo71a0m.fsf@natisbad.org>

On Wed, Jan 29, 2014 at 09:12:57AM +0100, Arnaud Ebalard wrote:
> Hi Herbert,
> 
> I wonder if you could share some knowledge on the behaviour of
> skb_segment() as it is implemented in 3.13.0: when passed a GSO
> packet to be segmented, can the skb result have skb->next == NULL?
> 
> One would expect the number of segments of the result to usually match
> tcp_skb_pcount() of passed packet and hence having skb->next != NULL: 
> AFAICT, this what usually happens when skb_segment() is called in
> tcp_gso_segment() but I noticed some cases where the number of segments
> (length of chained sk_buff) is lower than the expected value and also
> have two backtraces resulting from a skb delivered with a NULL skb->next. 

Hi Arnaud:

This is definitely not expected.  You said that the crash is not
easily reproducible.  Are you able to easily reproduce the problem
where the number of segments produced is less than what is expected?

Can you print out the size of the segments produced in that case?

Thanks!
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH] bridge: Unbreak netconsole
From: Cong Wang @ 2014-02-09  2:44 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Bart Van Assche, David S. Miller, Jiri Pirko, Neil Horman,
	netdev@vger.kernel.org
In-Reply-To: <20140208180511.533e1d98@samsung-9>

On Sat, Feb 8, 2014 at 5:05 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Sat, 08 Feb 2014 19:41:15 +0100
> Bart Van Assche <bvanassche@acm.org> wrote:
>
>> Sending netconsole messages over a bridge network interface doesn't
>> work anymore since kernel v3.12. Bisecting this led to the patch
>> "bridge: cleanup netpoll code". Hence revert that patch (commit
>> 93d8bf9fb8f39d6d3e461db60f883d9f81006159).
>>
>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>> Cc: Stephen Hemminger <stephen@networkplumber.org>
>> Cc: Jiri Pirko <jiri@resnulli.us>
>> Cc: Neil Horman <nhorman@tuxdriver.com>
>> Cc: David S. Miller <davem@davemloft.net>
>> Cc: <stable@vger.kernel.org> # 3.12
>> Reference: https://bugzilla.kernel.org/show_bug.cgi?id=70071
>
> Cong already send an alternative fix for this.

And it is already in -net:

commit dbe173079ab58a444e12dbebe96f5aec1e0bed1a
Author: Cong Wang <cwang@twopensource.com>
Date:   Thu Feb 6 15:00:52 2014 -0800

    bridge: fix netconsole setup over bridge

^ permalink raw reply

* Re: [PATCH 2/2] net: ip, ipv6: handle gso skbs in forwarding path
From: Herbert Xu @ 2014-02-09  2:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Florian Westphal, netdev
In-Reply-To: <1390926883.28432.12.camel@edumazet-glaptop2.roam.corp.google.com>

On Tue, Jan 28, 2014 at 08:34:43AM -0800, Eric Dumazet wrote:
> On Tue, 2014-01-28 at 09:57 +0100, Florian Westphal wrote:
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > > +	do {
> > > > +		struct sk_buff *nskb = segs->next;
> > > > +		int err;
> > > > +
> > > > +		segs->next = NULL;
> > > > +		err = dst_output(segs);
> > > > +
> > > > +		if (err && ret == 0)
> > > > +			ret = err;
> > > > +		segs = nskb;
> > > > +	} while (segs);
> > > > +
> > > > +	return ret;
> > > > +}
> > > > +
> > > 
> > > Its still unclear if this is the best strategy.
> > > 
> > > TCP stream not using DF flag are very unlikely to care if we adjust
> > > their MTU (lowering gso_size) at this point ?
> > 
> > Thanks for this suggestion.  It would indeed be nice to avoid sw
> > segmentation.  I tried:
> > 
> > static void ip_gso_adjust_seglen(struct sk_buff *skb)
> > {
> >         unsigned int mtu;
> > 
> >         if (!skb_is_gso(skb))
> >                 return;
> > 
> >         mtu = ip_dst_mtu_maybe_forward(skb_dst(skb), true);
> >         skb_shinfo(skb)->gso_size = mtu - sizeof(struct iphdr);
> > }
> > 
> > But this yields
> > 
> > [   28.644776] kernel BUG at net/net/core/skbuff.c:2984!
> 
> Yep, lets CC Herbert Xu, as he 'owns' skb_segment()

IMHO we should just stop merging ~DF packets altogether, at least
for TCP.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* [PATCH v2] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
From: Matija Glavinic Pecotic @ 2014-02-09  7:15 UTC (permalink / raw)
  To: linux-sctp@vger.kernel.org; +Cc: netdev@vger.kernel.org, Alexander Sverdlin

Implementation of (a)rwnd calculation might lead to severe performance issues
and associations completely stalling. These problems are described and solution
is proposed which improves lksctp's robustness in congestion state.

This v2 of the patch addresses comments from Vlad. Name of the function is set
to be more descriptive, and two parts of code are changed, in one removing the
superfluous call to sctp_assoc_rwnd_update since call would not result in update
of rwnd, and the other being reordering of the code in a way that call to
sctp_assoc_rwnd_update updates rwnd.

1) Sudden drop of a_rwnd and incomplete window recovery afterwards

Data accounted in sctp_assoc_rwnd_decrease takes only payload size (sctp data),
but size of sk_buff, which is blamed against receiver buffer, is not accounted
in rwnd. Theoretically, this should not be the problem as actual size of buffer
is double the amount requested on the socket (SO_RECVBUF). Problem here is
that this will have bad scaling for data which is less then sizeof sk_buff.
E.g. in 4G (LTE) networks, link interfacing radio side will have a large portion
of traffic of this size (less then 100B).

An example of sudden drop and incomplete window recovery is given below. Node B
exhibits problematic behavior. Node A initiates association and B is configured
to advertise rwnd of 10000. A sends messages of size 43B (size of typical sctp
message in 4G (LTE) network). On B data is left in buffer by not reading socket
in userspace.

Lets examine when we will hit pressure state and declare rwnd to be 0 for
scenario with above stated parameters (rwnd == 10000, chunk size == 43, each
chunk is sent in separate sctp packet)

Logic is implemented in sctp_assoc_rwnd_decrease:

socket_buffer (see below) is maximum size which can be held in socket buffer
(sk_rcvbuf). current_alloced is amount of data currently allocated (rx_count)

A simple expression is given for which it will be examined after how many
packets for above stated parameters we enter pressure state:

We start by condition which has to be met in order to enter pressure state:

	socket_buffer < currently_alloced;

currently_alloced is represented as size of sctp packets received so far and not
yet delivered to userspace. x is the number of chunks/packets (since there is no
bundling, and each chunk is delivered in separate packet, we can observe each
chunk also as sctp packet, and what is important here, having its own sk_buff):

	socket_buffer < x*each_sctp_packet;

each_sctp_packet is sctp chunk size + sizeof(struct sk_buff). socket_buffer is
twice the amount of initially requested size of socket buffer, which is in case
of sctp, twice the a_rwnd requested:

	2*rwnd < x*(payload+sizeof(struc sk_buff));

sizeof(struct sk_buff) is 190 (3.13.0-rc4+). Above is stated that rwnd is 10000
and each payload size is 43

	20000 < x(43+190);

	x > 20000/233;

	x ~> 84;

After ~84 messages, pressure state is entered and 0 rwnd is advertised while 
received 84*43B ~= 3612B sctp data. This is why external observer notices sudden
drop from 6474 to 0, as it will be now shown in example:

IP A.34340 > B.12345: sctp (1) [INIT] [init tag: 1875509148] [rwnd: 81920] [OS: 10] [MIS: 65535] [init TSN: 1096057017]
IP B.12345 > A.34340: sctp (1) [INIT ACK] [init tag: 3198966556] [rwnd: 10000] [OS: 10] [MIS: 10] [init TSN: 902132839]
IP A.34340 > B.12345: sctp (1) [COOKIE ECHO]
IP B.12345 > A.34340: sctp (1) [COOKIE ACK]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057017] [SID: 0] [SSEQ 0] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057017] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057018] [SID: 0] [SSEQ 1] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057018] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057019] [SID: 0] [SSEQ 2] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057019] [a_rwnd 9914] [#gap acks 0] [#dup tsns 0]
<...>
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057098] [SID: 0] [SSEQ 81] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057098] [a_rwnd 6517] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057099] [SID: 0] [SSEQ 82] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057099] [a_rwnd 6474] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057100] [SID: 0] [SSEQ 83] [PPID 0x18]

--> Sudden drop

IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

At this point, rwnd_press stores current rwnd value so it can be later restored
in sctp_assoc_rwnd_increase. This however doesn't happen as condition to start
slowly increasing rwnd until rwnd_press is returned to rwnd is never met. This
condition is not met since rwnd, after it hit 0, must first reach rwnd_press by
adding amount which is read from userspace. Let us observe values in above
example. Initial a_rwnd is 10000, pressure was hit when rwnd was ~6500 and the
amount of actual sctp data currently waiting to be delivered to userspace
is ~3500. When userspace starts to read, sctp_assoc_rwnd_increase will be blamed
only for sctp data, which is ~3500. Condition is never met, and when userspace
reads all data, rwnd stays on 3569.

IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 1505] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 3010] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057101] [SID: 0] [SSEQ 84] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057101] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]

--> At this point userspace read everything, rwnd recovered only to 3569

IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057102] [SID: 0] [SSEQ 85] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057102] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]

Reproduction is straight forward, it is enough for sender to send packets of
size less then sizeof(struct sk_buff) and receiver keeping them in its buffers.

2) Minute size window for associations sharing the same socket buffer

In case multiple associations share the same socket, and same socket buffer
(sctp.rcvbuf_policy == 0), different scenarios exist in which congestion on one
of the associations can permanently drop rwnd of other association(s).

Situation will be typically observed as one association suddenly having rwnd
dropped to size of last packet received and never recovering beyond that point.
Different scenarios will lead to it, but all have in common that one of the
associations (let it be association from 1)) nearly depleted socket buffer, and
the other association blames socket buffer just for the amount enough to start
the pressure. This association will enter pressure state, set rwnd_press and 
announce 0 rwnd.
When data is read by userspace, similar situation as in 1) will occur, rwnd will
increase just for the size read by userspace but rwnd_press will be high enough
so that association doesn't have enough credit to reach rwnd_press and restore
to previous state. This case is special case of 1), being worse as there is, in
the worst case, only one packet in buffer for which size rwnd will be increased.
Consequence is association which has very low maximum rwnd ('minute size', in
our case down to 43B - size of packet which caused pressure) and as such
unusable.

Scenario happened in the field and labs frequently after congestion state (link
breaks, different probabilities of packet drop, packet reordering) and with 
scenario 1) preceding. Here is given a deterministic scenario for reproduction:

>From node A establish two associations on the same socket, with rcvbuf_policy
being set to share one common buffer (sctp.rcvbuf_policy == 0). On association 1
repeat scenario from 1), that is, bring it down to 0 and restore up. Observe
scenario 1). Use small payload size (here we use 43). Once rwnd is 'recovered',
bring it down close to 0, as in just one more packet would close it. This has as
a consequence that association number 2 is able to receive (at least) one more
packet which will bring it in pressure state. E.g. if association 2 had rwnd of
10000, packet received was 43, and we enter at this point into pressure,
rwnd_press will have 9957. Once payload is delivered to userspace, rwnd will
increase for 43, but conditions to restore rwnd to original state, just as in
1), will never be satisfied.

--> Association 1, between A.y and B.12345

IP A.55915 > B.12345: sctp (1) [INIT] [init tag: 836880897] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 4032536569]
IP B.12345 > A.55915: sctp (1) [INIT ACK] [init tag: 2873310749] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3799315613]
IP A.55915 > B.12345: sctp (1) [COOKIE ECHO]
IP B.12345 > A.55915: sctp (1) [COOKIE ACK]

--> Association 2, between A.z and B.12346

IP A.55915 > B.12346: sctp (1) [INIT] [init tag: 534798321] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 2099285173]
IP B.12346 > A.55915: sctp (1) [INIT ACK] [init tag: 516668823] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3676403240]
IP A.55915 > B.12346: sctp (1) [COOKIE ECHO]
IP B.12346 > A.55915: sctp (1) [COOKIE ACK]

--> Deplete socket buffer by sending messages of size 43B over association 1

IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315613] [SID: 0] [SSEQ 0] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315613] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]

<...>

IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315696] [a_rwnd 6388] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315697] [SID: 0] [SSEQ 84] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315697] [a_rwnd 6345] [#gap acks 0] [#dup tsns 0]

--> Sudden drop on 1
 
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315698] [SID: 0] [SSEQ 85] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315698] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--> Here userspace read, rwnd 'recovered' to 3698, now deplete again using
    association 1 so there is place in buffer for only one more packet

IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315799] [SID: 0] [SSEQ 186] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315799] [a_rwnd 86] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315800] [SID: 0] [SSEQ 187] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]

--> Socket buffer is almost depleted, but there is space for one more packet,
    send them over association 2, size 43B

IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403240] [SID: 0] [SSEQ 0] [PPID 0x18]
IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403240] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--> Immediate drop

IP A.60995 > B.12346: sctp (1) [SACK] [cum ack 387491510] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--> Read everything from the socket, both association recover up to maximum rwnd
    they are capable of reaching, note that association 1 recovered up to 3698,
    and association 2 recovered only to 43

IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 1548] [#gap acks 0] [#dup tsns 0]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 3053] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315801] [SID: 0] [SSEQ 188] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315801] [a_rwnd 3698] [#gap acks 0] [#dup tsns 0]
IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403241] [SID: 0] [SSEQ 1] [PPID 0x18]
IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403241] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]

A careful reader might wonder why it is necessary to reproduce 1) prior
reproduction of 2). It is simply easier to observe when to send packet over
association 2 which will push association into the pressure state.

Proposed solution:

Both problems share the same root cause, and that is improper scaling of socket
buffer with rwnd. Solution in which sizeof(sk_buff) is taken into concern while
calculating rwnd is not possible due to fact that there is no linear
relationship between amount of data blamed in increase/decrease with IP packet
in which payload arrived. Even in case such solution would be followed,
complexity of the code would increase. Due to nature of current rwnd handling,
slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure state is
entered is rationale, but it gives false representation to the sender of current
buffer space. Furthermore, it implements additional congestion control mechanism
which is defined on implementation, and not on standard basis.

Proposed solution simplifies whole algorithm having on mind definition from rfc:

o  Receiver Window (rwnd): This gives the sender an indication of the space
   available in the receiver's inbound buffer.

Core of the proposed solution is given with these lines:

sctp_assoc_rwnd_update:
	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
	else
		asoc->rwnd = 0;

We advertise to sender (half of) actual space we have. Half is in the braces
depending whether you would like to observe size of socket buffer as SO_RECVBUF
or twice the amount, i.e. size is the one visible from userspace, that is,
from kernelspace.
In this way sender is given with good approximation of our buffer space,
regardless of the buffer policy - we always advertise what we have. Proposed
solution fixes described problems and removes necessity for rwnd restoration
algorithm. Finally, as proposed solution is simplification, some lines of code,
along with some bytes in struct sctp_association are saved.

Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com>

--- net-next.orig/net/sctp/associola.c
+++ net-next/net/sctp/associola.c
@@ -1367,44 +1367,35 @@ static inline bool sctp_peer_needs_updat
 	return false;
 }
 
-/* Increase asoc's rwnd by len and send any window update SACK if needed. */
-void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
+/* Update asoc's rwnd for the approximated state in the buffer,
+ * and check whether SACK needs to be sent.
+ */
+void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool update_peer)
 {
+	int rx_count;
 	struct sctp_chunk *sack;
 	struct timer_list *timer;
 
-	if (asoc->rwnd_over) {
-		if (asoc->rwnd_over >= len) {
-			asoc->rwnd_over -= len;
-		} else {
-			asoc->rwnd += (len - asoc->rwnd_over);
-			asoc->rwnd_over = 0;
-		}
-	} else {
-		asoc->rwnd += len;
-	}
+	if (asoc->ep->rcvbuf_policy)
+		rx_count = atomic_read(&asoc->rmem_alloc);
+	else
+		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
 
-	/* If we had window pressure, start recovering it
-	 * once our rwnd had reached the accumulated pressure
-	 * threshold.  The idea is to recover slowly, but up
-	 * to the initial advertised window.
-	 */
-	if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
-		int change = min(asoc->pathmtu, asoc->rwnd_press);
-		asoc->rwnd += change;
-		asoc->rwnd_press -= change;
-	}
+	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
+		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
+	else
+		asoc->rwnd = 0;
 
-	pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
-		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
-		 asoc->a_rwnd);
+	pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
+		 __func__, asoc, asoc->rwnd, rx_count,
+		 asoc->base.sk->sk_rcvbuf);
 
 	/* Send a window update SACK if the rwnd has increased by at least the
 	 * minimum of the association's PMTU and half of the receive buffer.
 	 * The algorithm used is similar to the one described in
 	 * Section 4.2.3.3 of RFC 1122.
 	 */
-	if (sctp_peer_needs_update(asoc)) {
+	if (update_peer && sctp_peer_needs_update(asoc)) {
 		asoc->a_rwnd = asoc->rwnd;
 
 		pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
@@ -1426,45 +1417,6 @@ void sctp_assoc_rwnd_increase(struct sct
 	}
 }
 
-/* Decrease asoc's rwnd by len. */
-void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
-{
-	int rx_count;
-	int over = 0;
-
-	if (unlikely(!asoc->rwnd || asoc->rwnd_over))
-		pr_debug("%s: association:%p has asoc->rwnd:%u, "
-			 "asoc->rwnd_over:%u!\n", __func__, asoc,
-			 asoc->rwnd, asoc->rwnd_over);
-
-	if (asoc->ep->rcvbuf_policy)
-		rx_count = atomic_read(&asoc->rmem_alloc);
-	else
-		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
-
-	/* If we've reached or overflowed our receive buffer, announce
-	 * a 0 rwnd if rwnd would still be positive.  Store the
-	 * the potential pressure overflow so that the window can be restored
-	 * back to original value.
-	 */
-	if (rx_count >= asoc->base.sk->sk_rcvbuf)
-		over = 1;
-
-	if (asoc->rwnd >= len) {
-		asoc->rwnd -= len;
-		if (over) {
-			asoc->rwnd_press += asoc->rwnd;
-			asoc->rwnd = 0;
-		}
-	} else {
-		asoc->rwnd_over = len - asoc->rwnd;
-		asoc->rwnd = 0;
-	}
-
-	pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
-		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
-		 asoc->rwnd_press);
-}
 
 /* Build the bind address list for the association based on info from the
  * local endpoint and the remote peer.
--- net-next.orig/include/net/sctp/structs.h
+++ net-next/include/net/sctp/structs.h
@@ -1653,17 +1653,6 @@ struct sctp_association {
 	/* This is the last advertised value of rwnd over a SACK chunk. */
 	__u32 a_rwnd;
 
-	/* Number of bytes by which the rwnd has slopped.  The rwnd is allowed
-	 * to slop over a maximum of the association's frag_point.
-	 */
-	__u32 rwnd_over;
-
-	/* Keeps treack of rwnd pressure.  This happens when we have
-	 * a window, but not recevie buffer (i.e small packets).  This one
-	 * is releases slowly (1 PMTU at a time ).
-	 */
-	__u32 rwnd_press;
-
 	/* This is the sndbuf size in use for the association.
 	 * This corresponds to the sndbuf size for the association,
 	 * as specified in the sk->sndbuf.
@@ -1892,8 +1881,7 @@ void sctp_assoc_update(struct sctp_assoc
 __u32 sctp_association_get_next_tsn(struct sctp_association *);
 
 void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
-void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
-void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
+void sctp_assoc_rwnd_update(struct sctp_association *, bool);
 void sctp_assoc_set_primary(struct sctp_association *,
 			    struct sctp_transport *);
 void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
--- net-next.orig/net/sctp/sm_statefuns.c
+++ net-next/net/sctp/sm_statefuns.c
@@ -6176,7 +6176,7 @@ static int sctp_eat_data(const struct sc
 	 * PMTU.  In cases, such as loopback, this might be a rather
 	 * large spill over.
 	 */
-	if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
+	if ((!chunk->data_accepted) && (!asoc->rwnd ||
 	    (datalen > asoc->rwnd + asoc->frag_point))) {
 
 		/* If this is the next TSN, consider reneging to make
--- net-next.orig/net/sctp/socket.c
+++ net-next/net/sctp/socket.c
@@ -2092,12 +2092,6 @@ static int sctp_recvmsg(struct kiocb *io
 		sctp_skb_pull(skb, copied);
 		skb_queue_head(&sk->sk_receive_queue, skb);
 
-		/* When only partial message is copied to the user, increase
-		 * rwnd by that amount. If all the data in the skb is read,
-		 * rwnd is updated when the event is freed.
-		 */
-		if (!sctp_ulpevent_is_notification(event))
-			sctp_assoc_rwnd_increase(event->asoc, copied);
 		goto out;
 	} else if ((event->msg_flags & MSG_NOTIFICATION) ||
 		   (event->msg_flags & MSG_EOR))
--- net-next.orig/net/sctp/ulpevent.c
+++ net-next/net/sctp/ulpevent.c
@@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(s
 	skb = sctp_event2skb(event);
 	/* Set the owner and charge rwnd for bytes received.  */
 	sctp_ulpevent_set_owner(event, asoc);
-	sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
+	sctp_assoc_rwnd_update(asoc, false);
 
 	if (!skb->data_len)
 		return;
@@ -1035,8 +1035,9 @@ static void sctp_ulpevent_release_data(s
 	}
 
 done:
-	sctp_assoc_rwnd_increase(event->asoc, len);
-	sctp_ulpevent_release_owner(event);
+	atomic_sub(event->rmem_len, &event->asoc->rmem_alloc);
+	sctp_assoc_rwnd_update(event->asoc, true);
+	sctp_association_put(event->asoc);
 }
 
 static void sctp_ulpevent_release_frag_data(struct sctp_ulpevent *event)

^ permalink raw reply

* RFC: Set MSG_MORE in iscsit_fe_sendpage_sg to avoid sending multiple TCP packets instead of one
From: Thomas Glanzmann @ 2014-02-09  7:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: John Ogness, Eric Dumazet, David S. Miller, Nicholas A. Bellinger,
	target-devel, Linux Network Development, LKML
In-Reply-To: <1391886759.10160.114.camel@edumazet-glaptop2.roam.corp.google.com>

Hello Eric,
I took the liberty to test and make your patch compile by adding the
following changeset:

--- a/drivers/target/iscsi/iscsi_target_parameters.c
+++ b/drivers/target/iscsi/iscsi_target_parameters.c
@@ -79,7 +79,7 @@ int iscsi_login_tx_data(
         */
        conn->if_marker += length;

-       tx_sent = tx_data(conn, &iov[0], iov_cnt, length);
+       tx_sent = tx_data(conn, &iov[0], iov_cnt, length, 0);
        if (tx_sent != length) {
                pr_err("tx_data returned %d, expecting %d.\n",
                                tx_sent, length);

Hello Nab,
please consider this for upstream. If I should clean it up in two patches:

        - One patch to change the interface iscsit_do_tx_data and tx_data

        - Another patch who uses the newly introduced interface

Let me know. This is how I would have normally done it. I set Eric as author
but singed it off and tested it. My testbed was: Using ESXi 5.5.0 GA to create
a 500 GB VMFS filesystem via iSCSI.

Cheers,
        Thomas

^ permalink raw reply

* [PATCH] This extends tx_data and and iscsit_do_tx_data with the additional parameter flags and avoids sending multiple TCP packets in iscsit_fe_sendpage_sg
From: Eric Dumazet @ 2014-02-09  7:42 UTC (permalink / raw)
  To: Eric Dumazet, John Ogness, Eric Dumazet, David S. Miller,
	Nicholas A. Bellinger, target-devel, Linux Network Development,
	LKML, thomas
In-Reply-To: <20140209074027.GA8105@glanzmann.de>

The new infrastructure is used in iscsit_fe_sendpage_sg to avoid sending three
TCP packets instead of one by settings the MSG_MORE when calling kernel_sendmsg
via the wrapper functions tx_data and iscsit_do_tx_data. This reduces the TCP
overhead by sending the same data in less TCP packets and minimized the TCP RTP
when TCP auto corking is enabled. When creating a 500 GB VMFS filesystem the
filesystem is created in 3 seconds instead of 4 seconds.

Signed-off-by: Thomas Glanzmann <thomas@glanzmann.de>
X-tested-by: Thomas Glanzmann <thomas@glanzmann.de>
---
 drivers/target/iscsi/iscsi_target_parameters.c |    2 +-
 drivers/target/iscsi/iscsi_target_util.c       |   23 ++++++++++++-----------
 drivers/target/iscsi/iscsi_target_util.h       |    2 +-
 3 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/drivers/target/iscsi/iscsi_target_parameters.c b/drivers/target/iscsi/iscsi_target_parameters.c
index 4d2e23f..b802392 100644
--- a/drivers/target/iscsi/iscsi_target_parameters.c
+++ b/drivers/target/iscsi/iscsi_target_parameters.c
@@ -79,7 +79,7 @@ int iscsi_login_tx_data(
 	 */
 	conn->if_marker += length;
 
-	tx_sent = tx_data(conn, &iov[0], iov_cnt, length);
+	tx_sent = tx_data(conn, &iov[0], iov_cnt, length, 0);
 	if (tx_sent != length) {
 		pr_err("tx_data returned %d, expecting %d.\n",
 				tx_sent, length);
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index e655b04..887fabb 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -1168,7 +1168,7 @@ send_data:
 		iov_count = cmd->iov_misc_count;
 	}
 
-	tx_sent = tx_data(conn, &iov[0], iov_count, tx_size);
+	tx_sent = tx_data(conn, &iov[0], iov_count, tx_size, 0);
 	if (tx_size != tx_sent) {
 		if (tx_sent == -EAGAIN) {
 			pr_err("tx_data() returned -EAGAIN\n");
@@ -1199,7 +1199,8 @@ send_hdr:
 	iov.iov_base = cmd->pdu;
 	iov.iov_len = tx_hdr_size;
 
-	tx_sent = tx_data(conn, &iov, 1, tx_hdr_size);
+	data_len = cmd->tx_size - tx_hdr_size - cmd->padding;
+	tx_sent = tx_data(conn, &iov, 1, tx_hdr_size, data_len ? MSG_MORE : 0);
 	if (tx_hdr_size != tx_sent) {
 		if (tx_sent == -EAGAIN) {
 			pr_err("tx_data() returned -EAGAIN\n");
@@ -1208,7 +1209,6 @@ send_hdr:
 		return -1;
 	}
 
-	data_len = cmd->tx_size - tx_hdr_size - cmd->padding;
 	/*
 	 * Set iov_off used by padding and data digest tx_data() calls below
 	 * in order to determine proper offset into cmd->iov_data[]
@@ -1252,7 +1252,8 @@ send_padding:
 	if (cmd->padding) {
 		struct kvec *iov_p = &cmd->iov_data[iov_off++];
 
-		tx_sent = tx_data(conn, iov_p, 1, cmd->padding);
+		tx_sent = tx_data(conn, iov_p, 1, cmd->padding,
+				  conn->conn_ops->DataDigest ? MSG_MORE : 0);
 		if (cmd->padding != tx_sent) {
 			if (tx_sent == -EAGAIN) {
 				pr_err("tx_data() returned -EAGAIN\n");
@@ -1266,7 +1267,7 @@ send_datacrc:
 	if (conn->conn_ops->DataDigest) {
 		struct kvec *iov_d = &cmd->iov_data[iov_off];
 
-		tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN);
+		tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN, 0);
 		if (ISCSI_CRC_LEN != tx_sent) {
 			if (tx_sent == -EAGAIN) {
 				pr_err("tx_data() returned -EAGAIN\n");
@@ -1352,11 +1353,12 @@ static int iscsit_do_rx_data(
 
 static int iscsit_do_tx_data(
 	struct iscsi_conn *conn,
-	struct iscsi_data_count *count)
+	struct iscsi_data_count *count,
+	int flags)
 {
 	int data = count->data_length, total_tx = 0, tx_loop = 0, iov_len;
 	struct kvec *iov_p;
-	struct msghdr msg;
+        struct msghdr msg = { .msg_flags = flags };
 
 	if (!conn || !conn->sock || !conn->conn_ops)
 		return -1;
@@ -1366,8 +1368,6 @@ static int iscsit_do_tx_data(
 		return -1;
 	}
 
-	memset(&msg, 0, sizeof(struct msghdr));
-
 	iov_p = count->iov;
 	iov_len = count->iov_count;
 
@@ -1411,7 +1411,8 @@ int tx_data(
 	struct iscsi_conn *conn,
 	struct kvec *iov,
 	int iov_count,
-	int data)
+	int data,
+	int flags)
 {
 	struct iscsi_data_count c;
 
@@ -1424,7 +1425,7 @@ int tx_data(
 	c.data_length = data;
 	c.type = ISCSI_TX_DATA;
 
-	return iscsit_do_tx_data(conn, &c);
+	return iscsit_do_tx_data(conn, &c, flags);
 }
 
 void iscsit_collect_login_stats(
diff --git a/drivers/target/iscsi/iscsi_target_util.h b/drivers/target/iscsi/iscsi_target_util.h
index 561a424..1b6b336 100644
--- a/drivers/target/iscsi/iscsi_target_util.h
+++ b/drivers/target/iscsi/iscsi_target_util.h
@@ -54,7 +54,7 @@ extern int iscsit_print_dev_to_proc(char *, char **, off_t, int);
 extern int iscsit_print_sessions_to_proc(char *, char **, off_t, int);
 extern int iscsit_print_tpg_to_proc(char *, char **, off_t, int);
 extern int rx_data(struct iscsi_conn *, struct kvec *, int, int);
-extern int tx_data(struct iscsi_conn *, struct kvec *, int, int);
+extern int tx_data(struct iscsi_conn *, struct kvec *, int, int, int);
 extern void iscsit_collect_login_stats(struct iscsi_conn *, u8, u8);
 extern struct iscsi_tiqn *iscsit_snmp_get_tiqn(struct iscsi_conn *);
 
-- 
1.7.10.4

^ permalink raw reply related

* Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
From: Thomas Glanzmann @ 2014-02-09  7:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: John Ogness, Eric Dumazet, David S. Miller, Nicholas A. Bellinger,
	target-devel, Linux Network Development, LKML
In-Reply-To: <1391904928.10160.118.camel@edumazet-glaptop2.roam.corp.google.com>

Hello Eric,

> Yes, this is much better : 2 frames per request/response, instead of 4.

perfect. I send out the page to the iscsi target list in your name since
you did the work and I added me as signed off I hope that is how it is
handled or should I have added my name to the from line and mentioned in
the description of the patch that you did the heavy lifting?

Cheers,
        Thomas

^ permalink raw reply

* Re: ax88179 regression
From: renevant @ 2014-02-09  7:51 UTC (permalink / raw)
  To: renevant; +Cc: netdev
In-Reply-To: <1519315.vRJdGOYKiP@athas>

Absolute last post from me. 

I migrated to a Intel Sandy Bridge based system to test. Definitely something 
going on with the AMD based system, could be power supply/motherboard bios 
related.

On the Intel system the ax88179 still locked up one time.  I rebuilt with for-
usb-linus tree merged and I reverted:

2a100047481a5c7430c72883b586eb6f2df34812 

xhci: remove conversion from generic to pci device in xhci_mem.c

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2a100047481a5c7430c72883b586eb6f2df34812


I can't say for sure if this commit is a problem but this kernel build has 
been up now for several hours without an issue and i've put it through paces 
that usually lock up the nic.


Regards,

Will Trives


On Saturday 08 February 2014 14:56:17 renevant@internode.on.net wrote:
> Everything is still working at commit
> 63a67a72d63dd077c2313cf19eb29d8e4bfa6963
> 
> At this point i'm beginning to think all the issues i've been having are
> motherboard bios and compiler flag issues.
> 
> Regards,
> 
> Will Trives
> 
> On Saturday 08 February 2014 12:56:10 renevant@internode.on.net wrote:
> > Hello,
> > 
> > I have finally nailed down my other issues and i'm at a point where I can
> > bisect from a point that is 100% stable and working.
> > 
> > 
> > I am currently running a kernel checked out at commit
> > d194c031994d3fc1038fa09e9e92d9be24a21921
> > 
> > A point in 3.12rc4
> > 
> > At this point the ax88179 works without issue even with scatter gather
> > turned on. So somewhere from this point something goes wrong and there is
> > some condition that exists that can lock up the nic.
> > 
> > 
> > I will keep bisecting at report my findings.
> > 
> > 
> > Regards,
> > 
> > Will Trives

^ permalink raw reply

* Re: [PATCH] bridge: Unbreak netconsole
From: Bart Van Assche @ 2014-02-09  8:46 UTC (permalink / raw)
  To: Cong Wang, Stephen Hemminger
  Cc: David S. Miller, Jiri Pirko, Neil Horman, netdev@vger.kernel.org
In-Reply-To: <CAHA+R7OavspZhWH=MEE2dwnj+iyekTttRyQiYHbVNgB_52t7WQ@mail.gmail.com>

On 02/09/14 03:44, Cong Wang wrote:
> On Sat, Feb 8, 2014 at 5:05 PM, Stephen Hemminger
> <stephen@networkplumber.org> wrote:
>> On Sat, 08 Feb 2014 19:41:15 +0100
>> Bart Van Assche <bvanassche@acm.org> wrote:
>>
>>> Sending netconsole messages over a bridge network interface doesn't
>>> work anymore since kernel v3.12. Bisecting this led to the patch
>>> "bridge: cleanup netpoll code". Hence revert that patch (commit
>>> 93d8bf9fb8f39d6d3e461db60f883d9f81006159).
>>>
>>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>>> Cc: Stephen Hemminger <stephen@networkplumber.org>
>>> Cc: Jiri Pirko <jiri@resnulli.us>
>>> Cc: Neil Horman <nhorman@tuxdriver.com>
>>> Cc: David S. Miller <davem@davemloft.net>
>>> Cc: <stable@vger.kernel.org> # 3.12
>>> Reference: https://bugzilla.kernel.org/show_bug.cgi?id=70071
>>
>> Cong already send an alternative fix for this.
> 
> And it is already in -net:
> 
> commit dbe173079ab58a444e12dbebe96f5aec1e0bed1a
> Author: Cong Wang <cwang@twopensource.com>
> Date:   Thu Feb 6 15:00:52 2014 -0800
> 
>     bridge: fix netconsole setup over bridge

Thanks Cong and Stephen for working on a fix, and sorry that I had not
noticed that a fix was already available before I sent the patch at the
start of this e-mail thread.

Bart.

^ permalink raw reply

* 6lowpan: lockless tx queue of routing netlink device
From: Alexander Aring @ 2014-02-09 10:20 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-zigbee-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Hi,

I got some locking issues with CONFIG_PROVE_LOCKING enabled and need help.

Full output:

=============================================
[ INFO: possible recursive locking detected ]
3.13.0-08605-g8f2b630-dirty #105 Not tainted
---------------------------------------------
agetty/841 is trying to acquire lock:
 (_xmit_IEEE802154#2){+.-...}, at: [<c0356b39>] sch_direct_xmit+0x34/0x122

but task is already holding lock:
 (_xmit_IEEE802154#2){+.-...}, at: [<c0346926>] __dev_queue_xmit+0x26e/0x329

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(_xmit_IEEE802154#2);
  lock(_xmit_IEEE802154#2);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

6 locks held by agetty/841:
 #0:  (((&idev->mc_ifc_timer))){+.-...}, at: [<c012b6f2>] call_timer_fn+0x0/0xb3
 #1:  (rcu_read_lock){.+.+..}, at: [<c03b4335>] rcu_read_lock+0x0/0x21
 #2:  (rcu_read_lock_bh){.+....}, at: [<c039d39d>] rcu_lock_acquire+0x0/0x1c
 #3:  (rcu_read_lock_bh){.+....}, at: [<c03426e5>] rcu_lock_acquire+0x0/0x1c
 #4:  (_xmit_IEEE802154#2){+.-...}, at: [<c0346926>] __dev_queue_xmit+0x26e/0x329
 #5:  (rcu_read_lock_bh){.+....}, at: [<c03426e5>] rcu_lock_acquire+0x0/0x1c



The solution was for me to change
dev->type = ARPHRD_IEEE802154
to
dev->type = ARPHRD_6LOWPAN
of the rtnl device. What we really shall do in the near future. (I have
a patch for this).


Another solution was to add:
dev->features           |= NETIF_F_LLTX;
in setup callback of rtnl device.


This enables a lockless tx queue.
I am not sure if we can do a lockless tx queue here and the comment of
NETIF_F_LLTX says it's deprecated "/* do not use LLTX in new drivers */".
Exists there some alternative for this?



So a little bit more information about the current architecture which is
a little bit complex for tx. Maybe then it's more clear how to fix this
issue correctly.

To setup a lowpan interface you need to run:
"ip link add link $WPAN_INTERFACE name $LOWPAN_INTERFACE type lowpan"

This setups a lowpan interface which "sitting" on top of the
$WPAN_INTERFACE.

The lowpan rtnl implementation saves pointers from both interfaces we
name it:

"real_dev" <-- WPAN_INTERFACE
"dev" <-- LOWPAN_INTERFACE


If we get some "usually" ipv6 packets from LOWPAN_INTERFACE which calls
header_create function, then we doing some manipulating of sk_buff there.

After this we calling dev_hard_header with the callback of
WPAN_INTERFACE to generate the mac header.

Then we are in the xmit callback of LOWPAN_INTERFACE and doing a
skb->dev pointer change from LOWPAN_INTERFACE to the WPAN_INTERFACE and
calling dev_queue_xmit to send it via the WPAN_INTERFACE.
The skb->dev switch is necessary because we call then the xmit callback
of the WPAN_INTERFACE, the LOWPAN_INTERFACE is more a "virtual" interface.

I think that's the problem. We have two dev_queue_xmit calls first from 
LOWPAN_INTERFACE then the WPAN_INTERFACE, so we locking something twice.


That's very much complicated and I think we doing some hacked stuff
there but currently it works so (except the locking issue). :-)

- Alex

------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk

^ permalink raw reply

* Re: [PATCH] This extends tx_data and and iscsit_do_tx_data with the additional parameter flags and avoids sending multiple TCP packets in iscsit_fe_sendpage_sg
From: Eric Dumazet @ 2014-02-09 12:30 UTC (permalink / raw)
  To: John Ogness
  Cc: Eric Dumazet, David S. Miller, Nicholas A. Bellinger,
	target-devel, Linux Network Development, LKML, thomas
In-Reply-To: <20140209074227.GA8219@glanzmann.de>

On Sun, 2014-02-09 at 08:42 +0100, Eric Dumazet wrote:
> The new infrastructure is used in iscsit_fe_sendpage_sg to avoid sending three
> TCP packets instead of one by settings the MSG_MORE when calling kernel_sendmsg
> via the wrapper functions tx_data and iscsit_do_tx_data. This reduces the TCP
> overhead by sending the same data in less TCP packets and minimized the TCP RTP
> when TCP auto corking is enabled. When creating a 500 GB VMFS filesystem the
> filesystem is created in 3 seconds instead of 4 seconds.
> 
> Signed-off-by: Thomas Glanzmann <thomas@glanzmann.de>
> X-tested-by: Thomas Glanzmann <thomas@glanzmann.de>
> ---

Hmm, thanks but this is not how to do this.

When you submit a patch written by someone else, you should :

1) Use your own identity as the sender, not impersonate me.
( thats standard convention )

2) Put following line as first line of the mail
( Documentation/SubmittingPatches lines ~565)

From: Eric Dumazet <edumazet@google.com>

Then I'll add my :
Signed-off-by: Eric Dumazet <edumazet@google.com>

Anyway, patch is not yet complete : We also want to set
MSG_MORE/MSG_SENDPAGE_NOTLAST for all pages but last one in a sg list.


This will fix suboptimal traffic :

13:32:04.976923 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [.], seq 289953:292849, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2896
13:32:04.976936 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [.], seq 292849:295745, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2896
13:32:04.976944 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 295745:298193, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2448
13:32:04.976952 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [.], seq 298193:301089, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2896
13:32:04.976960 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [.], seq 301089:303985, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2896
13:32:04.976998 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 303985:306385, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2400

Please try following updated patch, thanks !

Once tested, we'll submit it formally.

 drivers/target/iscsi/iscsi_target_parameters.c |    2 
 drivers/target/iscsi/iscsi_target_util.c       |   38 +++++++++------
 drivers/target/iscsi/iscsi_target_util.h       |    2 
 3 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/drivers/target/iscsi/iscsi_target_parameters.c b/drivers/target/iscsi/iscsi_target_parameters.c
index 4d2e23fc76fd..b80239250a1c 100644
--- a/drivers/target/iscsi/iscsi_target_parameters.c
+++ b/drivers/target/iscsi/iscsi_target_parameters.c
@@ -79,7 +79,7 @@ int iscsi_login_tx_data(
 	 */
 	conn->if_marker += length;
 
-	tx_sent = tx_data(conn, &iov[0], iov_cnt, length);
+	tx_sent = tx_data(conn, &iov[0], iov_cnt, length, 0);
 	if (tx_sent != length) {
 		pr_err("tx_data returned %d, expecting %d.\n",
 				tx_sent, length);
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 0819e688a398..3c529f7c61ce 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -1165,7 +1165,7 @@ send_data:
 		iov_count = cmd->iov_misc_count;
 	}
 
-	tx_sent = tx_data(conn, &iov[0], iov_count, tx_size);
+	tx_sent = tx_data(conn, &iov[0], iov_count, tx_size, 0);
 	if (tx_size != tx_sent) {
 		if (tx_sent == -EAGAIN) {
 			pr_err("tx_data() returned -EAGAIN\n");
@@ -1196,7 +1196,8 @@ send_hdr:
 	iov.iov_base = cmd->pdu;
 	iov.iov_len = tx_hdr_size;
 
-	tx_sent = tx_data(conn, &iov, 1, tx_hdr_size);
+	tx_sent = tx_data(conn, &iov, 1, tx_hdr_size,
+			  cmd->tx_size != tx_hdr_size ? MSG_MORE : 0);
 	if (tx_hdr_size != tx_sent) {
 		if (tx_sent == -EAGAIN) {
 			pr_err("tx_data() returned -EAGAIN\n");
@@ -1225,18 +1226,24 @@ send_hdr:
 	while (data_len) {
 		u32 space = (sg->length - offset);
 		u32 sub_len = min_t(u32, data_len, space);
+		int flags = 0;
+	
+		if ((data_len != sub_len) || cmd->padding ||
+		    conn->conn_ops->DataDigest)
+			flags = MSG_SENDPAGE_NOTLAST | MSG_MORE;
+
 send_pg:
 		tx_sent = conn->sock->ops->sendpage(conn->sock,
-					sg_page(sg), sg->offset + offset, sub_len, 0);
+						    sg_page(sg),
+						    sg->offset + offset,
+						    sub_len, flags);
 		if (tx_sent != sub_len) {
 			if (tx_sent == -EAGAIN) {
-				pr_err("tcp_sendpage() returned"
-						" -EAGAIN\n");
+				pr_err("tcp_sendpage() returned -EAGAIN\n");
 				goto send_pg;
 			}
 
-			pr_err("tcp_sendpage() failure: %d\n",
-					tx_sent);
+			pr_err("tcp_sendpage() failure: %d\n", tx_sent);
 			return -1;
 		}
 
@@ -1249,7 +1256,8 @@ send_padding:
 	if (cmd->padding) {
 		struct kvec *iov_p = &cmd->iov_data[iov_off++];
 
-		tx_sent = tx_data(conn, iov_p, 1, cmd->padding);
+		tx_sent = tx_data(conn, iov_p, 1, cmd->padding,
+				  conn->conn_ops->DataDigest ? MSG_MORE : 0);
 		if (cmd->padding != tx_sent) {
 			if (tx_sent == -EAGAIN) {
 				pr_err("tx_data() returned -EAGAIN\n");
@@ -1263,7 +1271,7 @@ send_datacrc:
 	if (conn->conn_ops->DataDigest) {
 		struct kvec *iov_d = &cmd->iov_data[iov_off];
 
-		tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN);
+		tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN, 0);
 		if (ISCSI_CRC_LEN != tx_sent) {
 			if (tx_sent == -EAGAIN) {
 				pr_err("tx_data() returned -EAGAIN\n");
@@ -1349,11 +1357,12 @@ static int iscsit_do_rx_data(
 
 static int iscsit_do_tx_data(
 	struct iscsi_conn *conn,
-	struct iscsi_data_count *count)
+	struct iscsi_data_count *count,
+	int flags)
 {
 	int data = count->data_length, total_tx = 0, tx_loop = 0, iov_len;
 	struct kvec *iov_p;
-	struct msghdr msg;
+	struct msghdr msg = { .msg_flags = flags };
 
 	if (!conn || !conn->sock || !conn->conn_ops)
 		return -1;
@@ -1363,8 +1372,6 @@ static int iscsit_do_tx_data(
 		return -1;
 	}
 
-	memset(&msg, 0, sizeof(struct msghdr));
-
 	iov_p = count->iov;
 	iov_len = count->iov_count;
 
@@ -1408,7 +1415,8 @@ int tx_data(
 	struct iscsi_conn *conn,
 	struct kvec *iov,
 	int iov_count,
-	int data)
+	int data,
+	int flags)
 {
 	struct iscsi_data_count c;
 
@@ -1421,7 +1429,7 @@ int tx_data(
 	c.data_length = data;
 	c.type = ISCSI_TX_DATA;
 
-	return iscsit_do_tx_data(conn, &c);
+	return iscsit_do_tx_data(conn, &c, flags);
 }
 
 void iscsit_collect_login_stats(
diff --git a/drivers/target/iscsi/iscsi_target_util.h b/drivers/target/iscsi/iscsi_target_util.h
index e4fc34a02f57..1b4f06801adc 100644
--- a/drivers/target/iscsi/iscsi_target_util.h
+++ b/drivers/target/iscsi/iscsi_target_util.h
@@ -54,7 +54,7 @@ extern int iscsit_print_dev_to_proc(char *, char **, off_t, int);
 extern int iscsit_print_sessions_to_proc(char *, char **, off_t, int);
 extern int iscsit_print_tpg_to_proc(char *, char **, off_t, int);
 extern int rx_data(struct iscsi_conn *, struct kvec *, int, int);
-extern int tx_data(struct iscsi_conn *, struct kvec *, int, int);
+extern int tx_data(struct iscsi_conn *, struct kvec *, int, int, int);
 extern void iscsit_collect_login_stats(struct iscsi_conn *, u8, u8);
 extern struct iscsi_tiqn *iscsit_snmp_get_tiqn(struct iscsi_conn *);
 

^ permalink raw reply related

* Re: 6lowpan: lockless tx queue of routing netlink device
From: Eric Dumazet @ 2014-02-09 12:41 UTC (permalink / raw)
  To: Alexander Aring; +Cc: netdev, linux-zigbee-devel
In-Reply-To: <20140209102047.GA14770@omega>

On Sun, 2014-02-09 at 11:20 +0100, Alexander Aring wrote:
> Hi,
> 
> I got some locking issues with CONFIG_PROVE_LOCKING enabled and need help.
> 
> Full output:
> 
> =============================================
> [ INFO: possible recursive locking detected ]
> 3.13.0-08605-g8f2b630-dirty #105 Not tainted
> ---------------------------------------------
> agetty/841 is trying to acquire lock:
>  (_xmit_IEEE802154#2){+.-...}, at: [<c0356b39>] sch_direct_xmit+0x34/0x122
> 
> but task is already holding lock:
>  (_xmit_IEEE802154#2){+.-...}, at: [<c0346926>] __dev_queue_xmit+0x26e/0x329
> 
> other info that might help us debug this:
>  Possible unsafe locking scenario:
> 
>        CPU0
>        ----
>   lock(_xmit_IEEE802154#2);
>   lock(_xmit_IEEE802154#2);
> 
>  *** DEADLOCK ***
> 
>  May be due to missing lock nesting notation
> 
> 6 locks held by agetty/841:
>  #0:  (((&idev->mc_ifc_timer))){+.-...}, at: [<c012b6f2>] call_timer_fn+0x0/0xb3
>  #1:  (rcu_read_lock){.+.+..}, at: [<c03b4335>] rcu_read_lock+0x0/0x21
>  #2:  (rcu_read_lock_bh){.+....}, at: [<c039d39d>] rcu_lock_acquire+0x0/0x1c
>  #3:  (rcu_read_lock_bh){.+....}, at: [<c03426e5>] rcu_lock_acquire+0x0/0x1c
>  #4:  (_xmit_IEEE802154#2){+.-...}, at: [<c0346926>] __dev_queue_xmit+0x26e/0x329
>  #5:  (rcu_read_lock_bh){.+....}, at: [<c03426e5>] rcu_lock_acquire+0x0/0x1c
> 
> 
> 
> The solution was for me to change
> dev->type = ARPHRD_IEEE802154
> to
> dev->type = ARPHRD_6LOWPAN
> of the rtnl device. What we really shall do in the near future. (I have
> a patch for this).
> 
> 
> Another solution was to add:
> dev->features           |= NETIF_F_LLTX;
> in setup callback of rtnl device.
> 
> 
> This enables a lockless tx queue.
> I am not sure if we can do a lockless tx queue here and the comment of
> NETIF_F_LLTX says it's deprecated "/* do not use LLTX in new drivers */".
> Exists there some alternative for this?
> 
> 
> 
> So a little bit more information about the current architecture which is
> a little bit complex for tx. Maybe then it's more clear how to fix this
> issue correctly.
> 
> To setup a lowpan interface you need to run:
> "ip link add link $WPAN_INTERFACE name $LOWPAN_INTERFACE type lowpan"
> 
> This setups a lowpan interface which "sitting" on top of the
> $WPAN_INTERFACE.
> 
> The lowpan rtnl implementation saves pointers from both interfaces we
> name it:
> 
> "real_dev" <-- WPAN_INTERFACE
> "dev" <-- LOWPAN_INTERFACE
> 
> 
> If we get some "usually" ipv6 packets from LOWPAN_INTERFACE which calls
> header_create function, then we doing some manipulating of sk_buff there.
> 
> After this we calling dev_hard_header with the callback of
> WPAN_INTERFACE to generate the mac header.
> 
> Then we are in the xmit callback of LOWPAN_INTERFACE and doing a
> skb->dev pointer change from LOWPAN_INTERFACE to the WPAN_INTERFACE and
> calling dev_queue_xmit to send it via the WPAN_INTERFACE.
> The skb->dev switch is necessary because we call then the xmit callback
> of the WPAN_INTERFACE, the LOWPAN_INTERFACE is more a "virtual" interface.
> 
> I think that's the problem. We have two dev_queue_xmit calls first from 
> LOWPAN_INTERFACE then the WPAN_INTERFACE, so we locking something twice.
> 
> 
> That's very much complicated and I think we doing some hacked stuff
> there but currently it works so (except the locking issue). :-)

Please try the following fix, thanks for this report !

diff --git a/net/ieee802154/6lowpan.c b/net/ieee802154/6lowpan.c
index 48b25c0af4d0..069af33013c4 100644
--- a/net/ieee802154/6lowpan.c
+++ b/net/ieee802154/6lowpan.c
@@ -533,7 +533,27 @@ static struct header_ops lowpan_header_ops = {
 	.create	= lowpan_header_create,
 };
 
+static struct lock_class_key lowpan_tx_busylock;
+static struct lock_class_key lowpan_netdev_xmit_lock_key;
+
+static void lowpan_set_lockdep_class_one(struct net_device *dev,
+					 struct netdev_queue *txq,
+					 void *_unused)
+{
+	lockdep_set_class(&txq->_xmit_lock,
+			  &lowpan_netdev_xmit_lock_key);
+}
+
+
+static int lowpan_dev_init(struct net_device *dev)
+{
+	netdev_for_each_tx_queue(dev, lowpan_set_lockdep_class_one, NULL);
+	dev->qdisc_tx_busylock = &lowpan_tx_busylock;
+	return 0;
+}
+
 static const struct net_device_ops lowpan_netdev_ops = {
+	.ndo_init		= lowpan_dev_init,
 	.ndo_start_xmit		= lowpan_xmit,
 	.ndo_set_mac_address	= lowpan_set_address,
 };

^ permalink raw reply related

* Re: [PATCH] bnx2x: Update to FW 7.8.19
From: Ben Hutchings @ 2014-02-09 13:20 UTC (permalink / raw)
  To: Yuval Mintz; +Cc: dwmw2, netdev, Dmitry Kravkov, Ariel Elior
In-Reply-To: <1390921061-22019-1-git-send-email-yuvalmin@broadcom.com>

[-- Attachment #1: Type: text/plain, Size: 914 bytes --]

On Tue, 2014-01-28 at 16:57 +0200, Yuval Mintz wrote:
> This new firmware fixes several bugs:
>  1. HW attention appears and traffic stops when iSCSI firmware tries to
>     retransmit iSCSI login command when the iSCSI login is carrying data
>     not aligned to 4-bytes.
>  2. FCoE traffic fails to run when running in switch-independent multi-function
>     mode and there's more than one interface supporting FCoE on a given port.
>  3. While two ports are running FCoE with at least one of them has a function
>     number (>1) on the same engine in a 4-port device a zeroed CQE is given,
>     causing FCoE traffic to stop.
> 
> Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
> Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
> Signed-off-by: Ariel Elior <ariele@broadcom.com>
[...]

Applied.

Ben.

-- 
Ben Hutchings
If at first you don't succeed, you're doing about average.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply

* Re: [PATCH] myri10ge: update firmware to 1.4.57
From: Ben Hutchings @ 2014-02-09 13:25 UTC (permalink / raw)
  To: Hyong-Youb Kim; +Cc: David Woodhouse, netdev, Hyong-Youb Kim
In-Reply-To: <1391254899-61702-1-git-send-email-hyong-youb.kim@myricom.com>

[-- Attachment #1: Type: text/plain, Size: 1121 bytes --]

On Sat, 2014-02-01 at 03:41 -0800, Hyong-Youb Kim wrote:
> From: Hyong-Youb Kim <hykim@myri.com>
> 
> Update firmware images for Myri10GE NICs.
> 
> Signed-off-by: Hyong-Youb Kim <hykim@myri.com>
> ---
>  WHENCE                        |    6 +++++-
>  myri10ge_eth_big_z8e.dat      |  Bin 0 -> 378832 bytes
>  myri10ge_eth_z8e.dat          |  Bin 367464 -> 378736 bytes
>  myri10ge_ethp_big_z8e.dat     |  Bin 0 -> 389144 bytes
>  myri10ge_ethp_z8e.dat         |  Bin 377784 -> 389056 bytes
>  myri10ge_rss_eth_big_z8e.dat  |  Bin 0 -> 536192 bytes
>  myri10ge_rss_eth_z8e.dat      |  Bin 553192 -> 536176 bytes
>  myri10ge_rss_ethp_big_z8e.dat |  Bin 0 -> 545936 bytes
>  myri10ge_rss_ethp_z8e.dat     |  Bin 563592 -> 545920 bytes
>  9 files changed, 5 insertions(+), 1 deletions(-)
>  create mode 100644 myri10ge_eth_big_z8e.dat
>  create mode 100644 myri10ge_ethp_big_z8e.dat
>  create mode 100644 myri10ge_rss_eth_big_z8e.dat
>  create mode 100644 myri10ge_rss_ethp_big_z8e.dat
[...]

Applied, thanks.

Ben.

-- 
Ben Hutchings
If at first you don't succeed, you're doing about average.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply

* Re: [PATCH] rtlwifi: Add firmware for rtl8192ee
From: Ben Hutchings @ 2014-02-09 13:26 UTC (permalink / raw)
  To: Larry Finger; +Cc: dwmw2, linux-wireless, netdev
In-Reply-To: <1390944643-7388-1-git-send-email-Larry.Finger@lwfinger.net>

[-- Attachment #1: Type: text/plain, Size: 710 bytes --]

On Tue, 2014-01-28 at 15:30 -0600, Larry Finger wrote:
> This firmware is for a driver for the Realtek RTL8192EE PCI WIFI device.
> It came from a tarball from Realtek named
> rtl_92ce_92se_92de_8723ae_88ee_8723be_92ee_linux_mac80211_0017.1224.2013.tar.gz
> and is allowed to be distributed under the LICENCE.rtlwifi_firmware.txt license.
> 
> Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
> ---
>  WHENCE                  |   9 +++++++++
>  rtlwifi/rtl8192eefw.bin | Bin 0 -> 32754 bytes
>  2 files changed, 9 insertions(+)
>  create mode 100644 rtlwifi/rtl8192eefw.bin
[...]

Applied, thanks.

Ben.

-- 
Ben Hutchings
If at first you don't succeed, you're doing about average.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply

* [PATCH v2 01/13] net: Mark function as static in 9p/client.c
From: Rashika Kheria @ 2014-02-09 14:27 UTC (permalink / raw)
  To: linux-kernel, Eric Van Hensbergen, Ron Minnich, Latchesar Ionkov,
	David S. Miller, v9fs-developer, netdev, josh
In-Reply-To: <6f029c895035908595957fb16ab445c82793c77d.1391888654.git.rashika.kheria@gmail.com>

Mark function as static in net/9p/client.c because it is not used
outside this file.

This eliminates the following warning in net/9p/client.c:
net/9p/client.c:207:18: warning: no previous prototype for ‘p9_fcall_alloc’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 net/9p/client.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/9p/client.c b/net/9p/client.c
index a5e4d2d..9186550 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -204,7 +204,7 @@ free_and_return:
 	return ret;
 }
 
-struct p9_fcall *p9_fcall_alloc(int alloc_msize)
+static struct p9_fcall *p9_fcall_alloc(int alloc_msize)
 {
 	struct p9_fcall *fc;
 	fc = kmalloc(sizeof(struct p9_fcall) + alloc_msize, GFP_NOFS);
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH v2 02/13] net: Include appropriate header file in caif/caif_dev.c
From: Rashika Kheria @ 2014-02-09 14:29 UTC (permalink / raw)
  To: linux-kernel, Dmitry Tarnyagin, David S. Miller, netdev, josh
In-Reply-To: <6f029c895035908595957fb16ab445c82793c77d.1391955924.git.rashika.kheria@gmail.com>

Include appropriate header file net/caif/caif_dev.h in caif/caif_dev.c
because it has prototype declarations of function defined in
caif/caif_dev.c.

This eliminates the following file in caif/caif_dev.c:
net/caif/caif_dev.c:303:6: warning: no previous prototype for ‘caif_enroll_dev’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 net/caif/caif_dev.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/net/caif/caif_dev.c b/net/caif/caif_dev.c
index 4dca159..edbca46 100644
--- a/net/caif/caif_dev.c
+++ b/net/caif/caif_dev.c
@@ -22,6 +22,7 @@
 #include <net/pkt_sched.h>
 #include <net/caif/caif_device.h>
 #include <net/caif/caif_layer.h>
+#include <net/caif/caif_dev.h>
 #include <net/caif/cfpkt.h>
 #include <net/caif/cfcnfg.h>
 #include <net/caif/cfserl.h>
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH v2 03/13] net: Include appropriate header file in caif/cfsrvl.c
From: Rashika Kheria @ 2014-02-09 14:32 UTC (permalink / raw)
  To: linux-kernel, Dmitry Tarnyagin, David S. Miller, netdev, josh
In-Reply-To: <6f029c895035908595957fb16ab445c82793c77d.1391955924.git.rashika.kheria@gmail.com>

Include appropriate header file net/caif/caif_dev.h in caif/cfsrvl.c
because it has prototype declaration of functions defined in
caif/cfsrvl.c.

This eliminates the following warning in caif/cfsrvl.c:
net/caif/cfsrvl.c:198:6: warning: no previous prototype for ‘caif_free_client’ [-Wmissing-prototypes]
net/caif/cfsrvl.c:208:6: warning: no previous prototype for ‘caif_client_register_refcnt’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 net/caif/cfsrvl.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/net/caif/cfsrvl.c b/net/caif/cfsrvl.c
index 353f793..a6e1154 100644
--- a/net/caif/cfsrvl.c
+++ b/net/caif/cfsrvl.c
@@ -15,6 +15,7 @@
 #include <net/caif/caif_layer.h>
 #include <net/caif/cfsrvl.h>
 #include <net/caif/cfpkt.h>
+#include <net/caif/caif_dev.h>
 
 #define SRVL_CTRL_PKT_SIZE 1
 #define SRVL_FLOW_OFF 0x81
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH v2 04/13] net: Mark functions as static in core/dev.c
From: Rashika Kheria @ 2014-02-09 14:56 UTC (permalink / raw)
  To: linux-kernel, josh, David S. Miller, Eric Dumazet,
	Veaceslav Falico, Nicolas Dichtel, Jiri Pirko, Pravin B Shelar,
	Cong Wang, netdev
In-Reply-To: <6f029c895035908595957fb16ab445c82793c77d.1391955924.git.rashika.kheria@gmail.com>

Mark functions as static in core/dev.c because they are not used outside
this file.

This eliminates the following warning in core/dev.c:
net/core/dev.c:2806:5: warning: no previous prototype for ‘__dev_queue_xmit’ [-Wmissing-prototypes]
net/core/dev.c:4640:5: warning: no previous prototype for ‘netdev_adjacent_sysfs_add’ [-Wmissing-prototypes]
net/core/dev.c:4650:6: warning: no previous prototype for ‘netdev_adjacent_sysfs_del’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Veaceslav Falico <vfalico@redhat.com>
---
 net/core/dev.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 3721db7..4ad1b78 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2803,7 +2803,7 @@ EXPORT_SYMBOL(dev_loopback_xmit);
  *      the BH enable code must have IRQs enabled so that it will not deadlock.
  *          --BLG
  */
-int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
+static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
 {
 	struct net_device *dev = skb->dev;
 	struct netdev_queue *txq;
@@ -4637,7 +4637,7 @@ struct net_device *netdev_master_upper_dev_get_rcu(struct net_device *dev)
 }
 EXPORT_SYMBOL(netdev_master_upper_dev_get_rcu);
 
-int netdev_adjacent_sysfs_add(struct net_device *dev,
+static int netdev_adjacent_sysfs_add(struct net_device *dev,
 			      struct net_device *adj_dev,
 			      struct list_head *dev_list)
 {
@@ -4647,7 +4647,7 @@ int netdev_adjacent_sysfs_add(struct net_device *dev,
 	return sysfs_create_link(&(dev->dev.kobj), &(adj_dev->dev.kobj),
 				 linkname);
 }
-void netdev_adjacent_sysfs_del(struct net_device *dev,
+static void netdev_adjacent_sysfs_del(struct net_device *dev,
 			       char *name,
 			       struct list_head *dev_list)
 {
-- 
1.7.9.5

^ permalink raw reply related

* Re: Poor network performance x86_64.. also with 3.13
From: Daniel Exner @ 2014-02-09 15:05 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-kernel, netdev
In-Reply-To: <20140120223752.GA3057@pd.tnic>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Hi all,

Am Mon, 20 Jan 2014 23:37:52 +0100
schrieb Borislav Petkov <bp@alien8.de>:

> On Mon, Jan 20, 2014 at 11:27:25PM +0100, Daniel Exner wrote:
> > I just did the same procedure with Kernel Version 3.13: same poor
> > rates.
> > 
> > I think I will try to see of 3.12.6 was still ok and bisect from
> > there.
> 
> Or try something more coarse-grained like 3.11 first, then 3.12 and
> then the -rcs in between.
> 

I must apologize for suspecting the kernel for my problems. After some
bisect attempts I finaly noticed the following:

> cat /etc/sysctl.d/net.conf
> net.ipv4.tcp_window_scaling = 1
> net.core.rmem_max = 16777216
> net.ipv4.tcp_rmem = 4096 87380 16777
> net.ipv4.tcp_wmem = 4096       1638

After removing those values I finally had sane iperf values.
No idea how those got there, perhaps they made sense when I first setup
the box, which is some years ago..

Anyway, thanks all for your help :)

Greetings
Daniel Exner
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)

iQIcBAEBCgAGBQJS95knAAoJEPI6v6bI/QkfZYYP/37WBvygR7gLKqFTYfQA2ALE
n6cOLrogoJT8Cf3q1fLqKiSzPToxSuPBTTmQtaNnhxLKTCPFHxLYPbTdtlXGTPB1
vFVJmXg8WAM/kQD/IoHrMsZsHfRWZE+RtQrUfeQ4Ava6abmniBufVe7ERMuF6ddW
02F5COtw74LJuSbxS70Cn3reog/ExoIYOYKQn6+FpoKTME4WnZtA8DJxo1r077RL
mNqo3D4OMrYdPhxyRjLygtCnmXuX/yynV2czBnFkME4f1B4P/1hIzqYCxa2dBQIM
Pfr+b/TtyVZA3DsE1d22f/+34EFWE/06EM5l8KwImmRHGA9Ffx77jKX4sAxN0Hhg
9ZJleeddk4NahXur5WNAV4lrkiLUgGauC0k721KwBFecFy2gYK/OUIyOm/oA3IPT
WAEeGT4nCfCa1vYfoZVBn5UWOZo1eLm5qh6dmGb9FHukmwWTEplRRvYDPyJNfmRg
0j5mvn7ymFIbnmkVSnBFdfJH0I6XhhiHQ9H3cb9It9OLH5eEK1x4AW1okkAwrquQ
oNYkpq54aJS/3oDokyWN/Gkvkmmk+4Q6tpxQge0AQPhrNeft5X7b8VhffstWhTSF
kO1ULQ+zOtRUHF6T5523qVcS3pzFfqQKPYPhhQspGvuPJEr0M94i2JS016Z84Cz6
krmaHvSO/MKFkm7w+x5d
=90En
-----END PGP SIGNATURE-----

^ permalink raw reply

* RFC: bridge get fdb by bridge device
From: Jamal Hadi Salim @ 2014-02-09 15:06 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: vyasevic, Stephen Hemminger, Scott Feldman, John Fastabend
In-Reply-To: <52F3E357.4040006@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 600 bytes --]


This patch allows something equivalent to
"brctl showmacs <bridge device>" with iproute2
syntax "bridge link br <bridge device>"
Filtering by bridge is done in the kernel.
The current setup doesnt scale when you have many bridges each
with large fdbs (preliminary fix with the kernel patch).

iproute2 allows filtering by bridge port, example:
"bridge link br br1234 dev port1234"
but the filtering is done in user space.
In a future patch i would like to do the port filtering
in the kernel. As well, adding a MAC filter in the kernel
makes sense.

Kernel patch is against net-next.

cheers,
jamal

[-- Attachment #2: bridge-fdb-filter1 --]
[-- Type: text/plain, Size: 2009 bytes --]

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 393b1bc..507ea4e 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2423,26 +2423,50 @@ static int rtnl_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	int idx = 0;
 	struct net *net = sock_net(skb->sk);
+	const struct net_device_ops *ops;
 	struct net_device *dev;
+	struct ndmsg *ndm;
 
-	rcu_read_lock();
-	for_each_netdev_rcu(net, dev) {
-		if (dev->priv_flags & IFF_BRIDGE_PORT) {
-			struct net_device *br_dev;
-			const struct net_device_ops *ops;
-
-			br_dev = netdev_master_upper_dev_get(dev);
-			ops = br_dev->netdev_ops;
-			if (ops->ndo_fdb_dump)
-				idx = ops->ndo_fdb_dump(skb, cb, dev, idx);
+	ndm = nlmsg_data(cb->nlh);
+	if (ndm->ndm_ifindex) {
+		dev = __dev_get_by_index(net, ndm->ndm_ifindex);
+		if (dev == NULL) {
+			pr_info("PF_BRIDGE: RTM_GETNEIGH with unknown ifindex\n");
+			return -ENODEV;
+		}
+	
+		if (!(dev->priv_flags & IFF_EBRIDGE)) {
+			pr_info("PF_BRIDGE: RTM_GETNEIGH %s not a bridge device\n",
+				dev->name);
+			return -EINVAL;
 		}
+		ops = dev->netdev_ops;
+		if (ops->ndo_fdb_dump) {
+			idx = ops->ndo_fdb_dump(skb, cb, dev, idx);
+		} else {
+			pr_info("PF_BRIDGE: RTM_GETNEIGH %s no dumper\n",
+				dev->name);
+			return -EINVAL;
+		}
+	} else {
+		rcu_read_lock();
+		for_each_netdev_rcu(net, dev) {
+			if (dev->priv_flags & IFF_BRIDGE_PORT) {
+				struct net_device *br_dev;
+				br_dev = netdev_master_upper_dev_get(dev);
+				ops = br_dev->netdev_ops;
+				if (ops->ndo_fdb_dump)
+					idx = ops->ndo_fdb_dump(skb, cb, dev, idx);
+			}
 
-		if (dev->netdev_ops->ndo_fdb_dump)
-			idx = dev->netdev_ops->ndo_fdb_dump(skb, cb, dev, idx);
-		else
-			idx = ndo_dflt_fdb_dump(skb, cb, dev, idx);
+			if (dev->netdev_ops->ndo_fdb_dump)
+				idx = dev->netdev_ops->ndo_fdb_dump(skb, cb, dev,
+								    idx);
+			else
+				idx = ndo_dflt_fdb_dump(skb, cb, dev, idx);
+		}
+		rcu_read_unlock();
 	}
-	rcu_read_unlock();
 
 	cb->args[0] = idx;
 	return skb->len;

[-- Attachment #3: iprt-fdb-brfilter1 --]
[-- Type: text/plain, Size: 1896 bytes --]

diff --git a/bridge/fdb.c b/bridge/fdb.c
index e2e53f1..f3073d6 100644
--- a/bridge/fdb.c
+++ b/bridge/fdb.c
@@ -33,7 +33,7 @@ static void usage(void)
 	fprintf(stderr, "Usage: bridge fdb { add | append | del | replace } ADDR dev DEV {self|master} [ temp ]\n"
 		        "              [router] [ dst IPADDR] [ vlan VID ]\n"
 		        "              [ port PORT] [ vni VNI ] [via DEV]\n");
-	fprintf(stderr, "       bridge fdb {show} [ dev DEV ]\n");
+	fprintf(stderr, "       bridge fdb {show} [ br BRDEV ] [ dev DEV ]\n");
 	exit(-1);
 }
 
@@ -152,18 +152,35 @@ int print_fdb(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 
 static int fdb_show(int argc, char **argv)
 {
+	struct ndmsg ndm = { };
 	char *filter_dev = NULL;
+	char *br = NULL;
+
+	ndm.ndm_family = PF_BRIDGE;
+	ndm.ndm_state = NUD_NOARP;
 
 	while (argc > 0) {
-		if (strcmp(*argv, "dev") == 0) {
+		if ((strcmp(*argv, "port") == 0) || strcmp(*argv, "dev") == 0) {
 			NEXT_ARG();
-			if (filter_dev)
-				duparg("dev", *argv);
 			filter_dev = *argv;
+		} else if (strcmp(*argv, "br") == 0) {
+			NEXT_ARG();
+			br = *argv;
+		} else {
+			if (matches(*argv, "help") == 0)
+				usage();
 		}
 		argc--; argv++;
 	}
 
+	if (br) {
+		ndm.ndm_ifindex = ll_name_to_index(br);
+		if (ndm.ndm_ifindex == 0) {
+			fprintf(stderr, "Cannot find bridge device \"%s\"\n", br);
+			return -1;
+		}
+	}
+
 	if (filter_dev) {
 		filter_index = if_nametoindex(filter_dev);
 		if (filter_index == 0) {
@@ -171,13 +188,15 @@ static int fdb_show(int argc, char **argv)
 				filter_dev);
 			return -1;
 		}
+
 	}
 
-	if (rtnl_wilddump_request(&rth, PF_BRIDGE, RTM_GETNEIGH) < 0) {
+	if (rtnl_dump_request(&rth, RTM_GETNEIGH, &ndm, sizeof(struct ndmsg)) < 0) {
 		perror("Cannot send dump request");
 		exit(1);
 	}
 
+
 	if (rtnl_dump_filter(&rth, print_fdb, stdout) < 0) {
 		fprintf(stderr, "Dump terminated\n");
 		exit(1);

^ permalink raw reply related

* Re: [PATCH] This extends tx_data and and iscsit_do_tx_data with the additional parameter flags and avoids sending multiple TCP packets in iscsit_fe_sendpage_sg
From: Thomas Glanzmann @ 2014-02-09 15:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: John Ogness, Eric Dumazet, David S. Miller, Nicholas A. Bellinger,
	target-devel, Linux Network Development, LKML
In-Reply-To: <1391949039.10160.129.camel@edumazet-glaptop2.roam.corp.google.com>

Hello Eric,

> 1) Use your own identity as the sender, not impersonate me.
> ( thats standard convention )

sorry about that, will not happen ever again.

> 2) Put following line as first line of the mail
> ( Documentation/SubmittingPatches lines ~565)

> From: Eric Dumazet <edumazet@google.com>

> Then I'll add my :
> Signed-off-by: Eric Dumazet <edumazet@google.com>

I see. Thank you for the awareness training. I read SubmittingPatches
completly.

> Anyway, patch is not yet complete : We also want to set
> MSG_MORE/MSG_SENDPAGE_NOTLAST for all pages but last one in a sg list.

I see.

> This will fix suboptimal traffic :

> 13:32:04.976923 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [.], seq 289953:292849, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2896
> 13:32:04.976936 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [.], seq 292849:295745, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2896
> 13:32:04.976944 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 295745:298193, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2448
> 13:32:04.976952 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [.], seq 298193:301089, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2896
> 13:32:04.976960 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [.], seq 301089:303985, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2896
> 13:32:04.976998 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 303985:306385, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2400

What is suboptimal about the traffic, could they all go in one packet?
Since my MTU is 1500 I assume that the network card will split this then
in MTU sized packets, is that correct? Should I repeat the test with MTU
9000 as well?

> Please try following updated patch, thanks!

This time it took 2 seconds instead of 4 seconds (3.12) to create the
filesystem. Find pcap here:

https://thomas.glanzmann.de/tmp/tcp_auto_corking_on_patched_tcp_more_notlast.pcap.bz2

> Once tested, we'll submit it formally.

Let me know if you want to submit or I should. If I should do it I would
split it up in two patches, one for the interface change and one for the
packet submission logic. Btw. your last patches did not apply for me
because I cut & pasted them from e-mail instead of saving it in an
editor this one. So your patch was fine but they way I tried to apply it
was flawed.

Cheers,
        Thomas

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox