Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v3 0/8] arm64: rockchip: Initial GeekBox enablement
From: Giuseppe CAVALLARO @ 2016-03-31  7:53 UTC (permalink / raw)
  To: Dinh Nguyen
  Cc: devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Heiko Stübner, netdev-u79uwXL29TY76Z2rM5mHXA,
	Gabriel Fernandez, LKML, Frank Schäfer,
	open list:ARM/Rockchip SoC..., LAKML, Fabrice GASNIER,
	Andreas Färber, Tomeu Vizoso, Alexandre TORGUE
In-Reply-To: <CADhT+wdXXp322vmgFmWTUiiRZTsCWeJSSdU=BMEEbSyb_bnrUw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 3/30/2016 6:44 PM, Dinh Nguyen wrote:
> On Tue, Mar 15, 2016 at 7:36 AM, Giuseppe CAVALLARO
> <peppe.cavallaro-qxv4g6HH51o@public.gmane.org> wrote:
>> Hello Tomeu
>>
>> On 3/15/2016 8:23 AM, Tomeu Vizoso wrote:
>>>
>>> Thanks.
>>>
>>> Btw, I have rebased on top of 4.5 this morning and I have noticed that
>>> 88f8b1bb41c6 ("stmmac: Fix 'eth0: No PHY found' regression") got in
>>> there, so I guess we have now a bunch of boards with broken network on
>>> that release:(
>>
>>
>>
>> This is the status on my side: I am testing on an HW that has the
>> Enhanced descriptors and all works fine.
>>
>> On this HW, if I force the driver to use the normal descriptor
>> layout, I meet problems but using both net.git and net-next.
>> So I suspect I cannot ply with this HW forcing the normal descriptors.
>> But! That is helping me to check if, on net-next, the stmmac is
>> actually  programming fine the normal desc case.
>> I have just found another fix so I kindly ask you to apply the temp
>> patch  attached and let me know.
>> In details, I have noticed that the OWN bit was not set in the right
>> TDES0.
>>
>> I also ask you to give me a log of the kernel where the stmmac was
>> running fine. I would like to see which configuration it is selected
>> at runtime by the driver on your box.
>>  From your previous logs (where the stmmac failed), it seems that
>> the  problem is on normal desc but, to be honest, this is the first
>> case I see a 3.50a with HW capability register and w/o Enhanced
>> descriptors.
>>
>
> Are you still working on a fix for:
>
> [    1.196110] libphy: PHY stmmac-0:ffffffff not found
> [    1.200972] eth0: Could not attach to PHY
> [    1.204991] stmmac_open: Cannot attach to PHY (error: -19)
>
> I see the error still there as of linux-next 20160330.

this could be because the fixes have been not applied on net-next
I will check and resend all asap

peppe

>
> Dinh
>

^ permalink raw reply

* Re: am335x: no multicast reception over VLAN
From: Yegor Yefremov @ 2016-03-31  7:52 UTC (permalink / raw)
  To: Mugunthan V N
  Cc: Peter Korsgaard, Grygorii Strashko, netdev,
	linux-omap@vger.kernel.org, drivshin, ml
In-Reply-To: <56FCC59B.70209@ti.com>

[-- Attachment #1: Type: text/plain, Size: 6225 bytes --]

On Thu, Mar 31, 2016 at 8:37 AM, Mugunthan V N <mugunthanvnm@ti.com> wrote:
> On Thursday 31 March 2016 01:17 AM, Peter Korsgaard wrote:
>>>>>>> "Mugunthan" == Mugunthan V N <mugunthanvnm@ti.com> writes:
>>
>> Hi,
>>
>>  > You had received these packets as tcpdump will enable promiscuous mode
>>  > so that you receive all the packets from the wire.
>>
>> FYI, you can use the -p option to tcpdump to not put the interface into
>> promiscuous mode.
>>
>
> Thanks for the information Peter Korsgaard.
>
> Yegor, can you provide tcpdump using -p as well in Grygorii commands.

Before VLAN configuration:

# switch-config -d
cpsw hw version 1.12 (0)
0   : type: vlan , vid = 1, untag_force = 0x3, reg_mcast = 0x3,
unreg_mcast = 0x0, member_list = 0x3
1   : type: mcast, vid = 1, addr = ff:ff:ff:ff:ff:ff, mcast_state = f,
no super, port_mask = 0x3
2   : type: ucast, vid = 1, addr = 74:6a:8f:00:16:12, ucast_type =
persistant, port_num = 0x0, Secure
3   : type: vlan , vid = 0, untag_force = 0x7, reg_mcast = 0x0,
unreg_mcast = 0x0, member_list = 0x7
4   : type: mcast, vid = 1, addr = 01:00:5e:00:00:01, mcast_state = f,
no super, port_mask = 0x3
5   : type: vlan , vid = 2, untag_force = 0x5, reg_mcast = 0x5,
unreg_mcast = 0x0, member_list = 0x5
6   : type: mcast, vid = 2, addr = ff:ff:ff:ff:ff:ff, mcast_state = f,
no super, port_mask = 0x5
7   : type: ucast, vid = 2, addr = 74:6a:8f:00:16:13, ucast_type =
persistant, port_num = 0x0, Secure
8   : type: mcast, vid = 2, addr = 01:00:5e:00:00:01, mcast_state = f,
no super, port_mask = 0x5

After VLAN configuration:

# switch-config -d
cpsw hw version 1.12 (0)
0   : type: vlan , vid = 1, untag_force = 0x3, reg_mcast = 0x3,
unreg_mcast = 0x0, member_list = 0x3
1   : type: mcast, vid = 1, addr = ff:ff:ff:ff:ff:ff, mcast_state = f,
no super, port_mask = 0x3
2   : type: ucast, vid = 1, addr = 74:6a:8f:00:16:12, ucast_type =
persistant, port_num = 0x0, Secure
3   : type: vlan , vid = 0, untag_force = 0x7, reg_mcast = 0x0,
unreg_mcast = 0x0, member_list = 0x7
4   : type: mcast, vid = 1, addr = 01:00:5e:00:00:01, mcast_state = f,
no super, port_mask = 0x3
5   : type: vlan , vid = 2, untag_force = 0x5, reg_mcast = 0x5,
unreg_mcast = 0x0, member_list = 0x5
6   : type: mcast, vid = 2, addr = ff:ff:ff:ff:ff:ff, mcast_state = f,
no super, port_mask = 0x5
7   : type: ucast, vid = 2, addr = 74:6a:8f:00:16:13, ucast_type =
persistant, port_num = 0x0, Secure
8   : type: mcast, vid = 2, addr = 01:00:5e:00:00:01, mcast_state = f,
no super, port_mask = 0x5
9   : type: vlan , vid = 100, untag_force = 0x0, reg_mcast = 0x5,
unreg_mcast = 0x0, member_list = 0x5
10  : type: ucast, vid = 100, addr = 74:6a:8f:00:16:13, ucast_type =
persistant, port_num = 0x0
11  : type: mcast, vid = 100, addr = ff:ff:ff:ff:ff:ff, mcast_state =
f, no super, port_mask = 0x5
12  : type: mcast, vid = 2, addr = 01:80:c2:00:00:21, mcast_state = f,
no super, port_mask = 0x5

During mulitcast receive:

# switch-config -d
cpsw hw version 1.12 (0)
0   : type: vlan , vid = 1, untag_force = 0x3, reg_mcast = 0x3,
unreg_mcast = 0x0, member_list = 0x3
1   : type: mcast, vid = 1, addr = ff:ff:ff:ff:ff:ff, mcast_state = f,
no super, port_mask = 0x3
2   : type: ucast, vid = 1, addr = 74:6a:8f:00:16:12, ucast_type =
persistant, port_num = 0x0, Secure
3   : type: vlan , vid = 0, untag_force = 0x7, reg_mcast = 0x0,
unreg_mcast = 0x0, member_list = 0x7
4   : type: mcast, vid = 1, addr = 01:00:5e:00:00:01, mcast_state = f,
no super, port_mask = 0x3
5   : type: vlan , vid = 2, untag_force = 0x5, reg_mcast = 0x5,
unreg_mcast = 0x0, member_list = 0x5
6   : type: mcast, vid = 2, addr = ff:ff:ff:ff:ff:ff, mcast_state = f,
no super, port_mask = 0x5
7   : type: ucast, vid = 2, addr = 74:6a:8f:00:16:13, ucast_type =
persistant, port_num = 0x0, Secure
8   : type: mcast, vid = 2, addr = 01:00:5e:00:00:01, mcast_state = f,
no super, port_mask = 0x5
9   : type: vlan , vid = 100, untag_force = 0x0, reg_mcast = 0x5,
unreg_mcast = 0x0, member_list = 0x5
10  : type: ucast, vid = 100, addr = 74:6a:8f:00:16:13, ucast_type =
persistant, port_num = 0x0
11  : type: mcast, vid = 100, addr = ff:ff:ff:ff:ff:ff, mcast_state =
f, no super, port_mask = 0x5
12  : type: mcast, vid = 2, addr = 01:80:c2:00:00:21, mcast_state = f,
no super, port_mask = 0x5
13  : type: mcast, vid = 2, addr = 01:00:5e:03:1d:47, mcast_state = f,
no super, port_mask = 0x5
14  : type: ucast, vid = 100, addr = 66:22:04:bc:90:26, ucast_type =
untouched , port_num = 0x2


After multicast receive use case:

# switch-config -d
cpsw hw version 1.12 (0)
0   : type: vlan , vid = 1, untag_force = 0x3, reg_mcast = 0x3,
unreg_mcast = 0x0, member_list = 0x3
1   : type: mcast, vid = 1, addr = ff:ff:ff:ff:ff:ff, mcast_state = f,
no super, port_mask = 0x3
2   : type: ucast, vid = 1, addr = 74:6a:8f:00:16:12, ucast_type =
persistant, port_num = 0x0, Secure
3   : type: vlan , vid = 0, untag_force = 0x7, reg_mcast = 0x0,
unreg_mcast = 0x0, member_list = 0x7
4   : type: mcast, vid = 1, addr = 01:00:5e:00:00:01, mcast_state = f,
no super, port_mask = 0x3
5   : type: vlan , vid = 2, untag_force = 0x5, reg_mcast = 0x5,
unreg_mcast = 0x0, member_list = 0x5
6   : type: mcast, vid = 2, addr = ff:ff:ff:ff:ff:ff, mcast_state = f,
no super, port_mask = 0x5
7   : type: ucast, vid = 2, addr = 74:6a:8f:00:16:13, ucast_type =
persistant, port_num = 0x0, Secure
8   : type: mcast, vid = 2, addr = 01:00:5e:00:00:01, mcast_state = f,
no super, port_mask = 0x5
9   : type: vlan , vid = 100, untag_force = 0x0, reg_mcast = 0x5,
unreg_mcast = 0x0, member_list = 0x5
10  : type: ucast, vid = 100, addr = 74:6a:8f:00:16:13, ucast_type =
persistant, port_num = 0x0
11  : type: mcast, vid = 100, addr = ff:ff:ff:ff:ff:ff, mcast_state =
f, no super, port_mask = 0x5
12  : type: mcast, vid = 2, addr = 01:80:c2:00:00:21, mcast_state = f,
no super, port_mask = 0x5
15  : type: ucast, vid = 100, addr = 66:22:04:bc:90:26, ucast_type =
untouched , port_num = 0x2

Both tcpdumps with -p option showed no packets. If I execute ping, I
can see related ICMP packets. addr = 66:22:04:bc:90:26 is PandaBoards
MAC.

Btw I've attached my test scripts (mcastr.py - multicast receiver and
mcastt.py - multicast transmitter). Could you reproduce my setup?

Thanks.

Yegor

[-- Attachment #2: mcastr.py --]
[-- Type: text/plain, Size: 821 bytes --]

import socket
import struct
import sys

multicast_group = '224.3.29.71'
server_address = ('', 10000)

# Create the socket
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)

# Bind to the server address
sock.bind(server_address)

# Tell the operating system to add the socket to the multicast group
# on all interfaces.
group = socket.inet_aton(multicast_group)
mreq = struct.pack('4sL', group, socket.INADDR_ANY)
sock.setsockopt(socket.IPPROTO_IP, socket.IP_ADD_MEMBERSHIP, mreq)

# Receive/respond loop
while True:
    print >>sys.stderr, '\nwaiting to receive message'
    data, address = sock.recvfrom(1024)
    
    print >>sys.stderr, 'received %s bytes from %s' % (len(data), address)
    print >>sys.stderr, data

    print >>sys.stderr, 'sending acknowledgement to', address
    sock.sendto('ack', address)

[-- Attachment #3: mcastt.py --]
[-- Type: text/plain, Size: 1075 bytes --]

import socket
import struct
import sys

message = 'very important data'
multicast_group = ('224.3.29.71', 10000)

# Create the datagram socket
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)

# Set a timeout so the socket does not block indefinitely when trying
# to receive data.
sock.settimeout(0.2)

# Set the time-to-live for messages to 1 so they do not go past the
# local network segment.
ttl = struct.pack('b', 1)
sock.setsockopt(socket.IPPROTO_IP, socket.IP_MULTICAST_TTL, ttl)

try:

    # Send data to the multicast group
    print >>sys.stderr, 'sending "%s"' % message
    sent = sock.sendto(message, multicast_group)

    # Look for responses from all recipients
    while True:
        print >>sys.stderr, 'waiting to receive'
        try:
            data, server = sock.recvfrom(16)
        except socket.timeout:
            print >>sys.stderr, 'timed out, no more responses'
            break
        else:
            print >>sys.stderr, 'received "%s" from %s' % (data, server)

finally:
    print >>sys.stderr, 'closing socket'
    sock.close()

^ permalink raw reply

* Re: Question on rhashtable in worst-case scenario.
From: Herbert Xu @ 2016-03-31  7:50 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Ben Greear, David Miller, linux-kernel, linux-wireless, netdev,
	tgraf
In-Reply-To: <1459410405.4576.8.camel@sipsolutions.net>

On Thu, Mar 31, 2016 at 09:46:45AM +0200, Johannes Berg wrote:
>
> In this case, I think perhaps you can just patch your local system with
> the many interfaces connecting to the same AP to add the parameter
> Herbert suggested (.insecure_elasticity = true in sta_rht_params). This
> is, after all, very much a case that "normal" operation doesn't even
> get close to.

I think you should just turn it on everywhere for mac80211.  Chain
length checks simply don't make sense when you allow duplicate
keys in the hash table.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: Question on rhashtable in worst-case scenario.
From: Johannes Berg @ 2016-03-31  7:46 UTC (permalink / raw)
  To: Ben Greear, David Miller
  Cc: linux-kernel, herbert, linux-wireless, netdev, tgraf
In-Reply-To: <56FC0445.6010200@candelatech.com>

On Wed, 2016-03-30 at 09:52 -0700, Ben Greear wrote:

> If someone can fix rhashtable, then great.
> I read some earlier comments [1] back when someone else reported
> similar problems, and the comments seemed to indicate that rhashtable
> was broken in this manner on purpose to protect against hashing
> attacks.
> 
> If you are baking in this type of policy to what should be a basic
> data-type, then it is not useful for how it is being used in
> the mac80211 stack.
> 
> [1]  http://lkml.iu.edu/hypermail/linux/kernel/1512.2/01681.html
> 

That's not really saying it's purposely broken, that's more saying that
Herbert didn't see a point in fixing a case that has awful behaviour
already.

However, I'm confused now - we can much more easily live with
*insertion* failures, as the linked email indicates, than *deletion*
failures, which I think you had indicated originally. Are you really
seeing *deletion* failures?

If there are in fact *deletion* failures then I think we really need to
address those in rhashtable, no matter the worst-case behaviour of the
hashing or keys, since we should be able to delete entries in order to
get back to something reasonable. Looking at the code though, I don't
actually see that happening.

If you're seeing only *insertion* failures, which you indicated in the
root of this thread, then I think for the general case in mac80211 we
can live with that - we use a seeded jhash for the hash function, and
since in the general case we cannot accept entries with identical MAC
addresses to start with, it shouldn't be possible to run into this
problem under normal use cases.

In this case, I think perhaps you can just patch your local system with
the many interfaces connecting to the same AP to add the parameter
Herbert suggested (.insecure_elasticity = true in sta_rht_params). This
is, after all, very much a case that "normal" operation doesn't even
get close to.

johannes

^ permalink raw reply

* Re: [PATCH] stmmac: Fix phy without MDIO subnode
From: Giuseppe CAVALLARO @ 2016-03-31  7:40 UTC (permalink / raw)
  To: Robert Gadsdon, John Keeping; +Cc: Gabriel Fernandez, netdev, linux-kernel
In-Reply-To: <56FC1972.9050704@gmail.com>


Hello Robert

On 3/30/2016 8:22 PM, Robert Gadsdon wrote:
> I have applied this to my Rock2 - Kernel 4.6-rc1 - and eth0 is present,
> now, but no network traffic gets through:

there are some patches not yet applied that fix known
problems

pls take a look at :

"stmmac: MDIO fixes"

and

[PATCH (net-next.git)] STMMAC: fix TX Normal descriptor

peppe

>
> To rock2:
> $ ping rgrock2
> PING rgrock2 (192.168.0.xx) 56(84) bytes of data.
> ^C
> --- rgrock2 ping statistics ---
> 9 packets transmitted, 0 received, 100% packet loss, time 7999ms
>
>>From rock2:
> # ping 192.168.0.x
> PING 192.168.0.x (192.168.0.2) 56(84) bytes of data.
>>From 192.168.0.xx icmp_seq=1 Destination Host Unreachable
>>From 192.168.0.xx icmp_seq=2 Destination Host Unreachable
>>From 192.168.0.xx icmp_seq=3 Destination Host Unreachable
>>From 192.168.0.xx icmp_seq=4 Destination Host Unreachable
>>From 192.168.0.xx icmp_seq=5 Destination Host Unreachable
>>From 192.168.0.xx icmp_seq=6 Destination Host Unreachable
>>From 192.168.0.xx icmp_seq=7 Destination Host Unreachable
>>From 192.168.0.xx icmp_seq=8 Destination Host Unreachable
>
> --- 192.168.0.x ping statistics ---
> 9 packets transmitted, 0 received, +8 errors, 100% packet loss, time 8001ms
>
> Everything worked OK with kernel 4.5-rc7, and with 4.5 Final with the
> -rc7 versions of stmmac_platform.c and stmmac_mdio.c substituted..
>
> Robert Gadsdon. March 30, 2016
>

^ permalink raw reply

* Re: [PATCH 0/5] wireless: ti: Convert specialized logging macros to kernel style
From: Kalle Valo @ 2016-03-31  7:39 UTC (permalink / raw)
  To: Joe Perches; +Cc: linux-kernel, linux-wireless, netdev
In-Reply-To: <1459373791.25110.132.camel@perches.com>

Joe Perches <joe@perches.com> writes:

> On Wed, 2016-03-30 at 14:51 +0300, Kalle Valo wrote:
>> Joe Perches <joe@perches.com> writes:
>> > Using the normal kernel logging mechanisms makes this code
>> > a bit more like other wireless drivers.
>> Personally I don't see the point but I don't have any strong opinions. A
>> bigger problem is that TI drivers are not really in active development
>> and that's I'm not thrilled to take big patches like this for dormant
>> drivers.
>
> Not very dormant.
>
> 35 patches in the last year, most of them adding functionality.

Oh, I didn't realise it had that many patches. But the driver is
orphaned and doesn't have a maintainer so could I then have an ack from
one of the active contributors that this ok?

-- 
Kalle Valo

^ permalink raw reply

* Re: [PATCH] net: mvneta: remove useless RX descriptor prefetch
From: Jisheng Zhang @ 2016-03-31  7:15 UTC (permalink / raw)
  To: thomas.petazzoni, gregory.clement, davem
  Cc: netdev, linux-kernel, linux-arm-kernel
In-Reply-To: <20160331144537.20f04a8c@xhacker>

Hi all,

On Thu, 31 Mar 2016 14:45:37 +0800 Jisheng Zhang wrote:

> Hi,
> 
> + linux arm kernel
> 
> On Thu, 31 Mar 2016 14:36:30 +0800 Jisheng Zhang wrote:
> 
> > The rx descriptors are allocated using dma_alloc_coherent, so prefetch
> > doesn't really happen at all.  
> 
> This is for RFC, I'm sorry to send it without changing its title -- s/PATCH/RFC.
> 
> I'm not sure whether there's any benefit to prefetch on space allocated from
> dma_alloc_coherent.

After more consideration, I think my patch is wrong.

As for coherent platforms, the space allocated from dma_alloc_coherent is
cacheable, so prefetch would definitely benefit us.

As for noncoherent platforms, the space allocated from dma_alloc_coherent is
uncacheable, but prefetch on arm/arm64 is implemented via. pld/prfm, the op
would be nop if target address is uncacheable.

So let's drop this patch.

Thanks,
Jisheng


> 
> Thanks
> 
> > 
> > Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
> > ---
> >  drivers/net/ethernet/marvell/mvneta.c | 1 -
> >  1 file changed, 1 deletion(-)
> > 
> > diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> > index 5880871..6c09a27 100644
> > --- a/drivers/net/ethernet/marvell/mvneta.c
> > +++ b/drivers/net/ethernet/marvell/mvneta.c
> > @@ -757,7 +757,6 @@ mvneta_rxq_next_desc_get(struct mvneta_rx_queue *rxq)
> >  	int rx_desc = rxq->next_desc_to_proc;
> >  
> >  	rxq->next_desc_to_proc = MVNETA_QUEUE_NEXT_DESC(rxq, rx_desc);
> > -	prefetch(rxq->descs + rxq->next_desc_to_proc);
> >  	return rxq->descs + rx_desc;
> >  }
> >    
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply

* Reboot Failed was occurred after v4.4-rc1 during IPv6 Ready Logo Conformance Test
From: Yuki Machida @ 2016-03-31  7:09 UTC (permalink / raw)
  To: netdev

Hi all,

Reboot Failed was occurred at Linux Kernel after v4.4-rc1 during IPv6 Ready Logo Conformance Test.
Not Fix a bug in v4.5-rc7 yet.
I will conform that it fix a bug in v4.6-rc1.

Currently, it is under investigation.

IPv6 Ready Logo
https://www.ipv6ready.org/
TAHI Project
http://www.tahi.org/

I ran the IPv6 Ready Logo Core Conformance Test on Intel D510MO (Atom D510).
It is using userland build with yocto project.

Test Environment
Test Specification          : 4.0.6
Tool Version                : REL_3_3_2
Test Program Version        : V6LC_5_0_0
Target Device               : Intel D510MO (Atom D510)

I conform that random testcases are failed.
(e.g. No.5, No.130, No.131, No.134, No.167 and No.168)

Regards,
Yuki Machida

^ permalink raw reply

* Re: [PATCH] net: mvneta: remove useless RX descriptor prefetch
From: Jisheng Zhang @ 2016-03-31  6:45 UTC (permalink / raw)
  To: thomas.petazzoni, gregory.clement, davem
  Cc: netdev, linux-kernel, linux-arm-kernel
In-Reply-To: <1459406190-2334-1-git-send-email-jszhang@marvell.com>

Hi,

+ linux arm kernel

On Thu, 31 Mar 2016 14:36:30 +0800 Jisheng Zhang wrote:

> The rx descriptors are allocated using dma_alloc_coherent, so prefetch
> doesn't really happen at all.

This is for RFC, I'm sorry to send it without changing its title -- s/PATCH/RFC.

I'm not sure whether there's any benefit to prefetch on space allocated from
dma_alloc_coherent.

Thanks

> 
> Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> index 5880871..6c09a27 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -757,7 +757,6 @@ mvneta_rxq_next_desc_get(struct mvneta_rx_queue *rxq)
>  	int rx_desc = rxq->next_desc_to_proc;
>  
>  	rxq->next_desc_to_proc = MVNETA_QUEUE_NEXT_DESC(rxq, rx_desc);
> -	prefetch(rxq->descs + rxq->next_desc_to_proc);
>  	return rxq->descs + rx_desc;
>  }
>  

^ permalink raw reply

* Re: [PATCH] net: mvneta: explicitly disable BM on 64bit platform
From: Marcin Wojtas @ 2016-03-31  6:49 UTC (permalink / raw)
  To: Jisheng Zhang
  Cc: Gregory CLEMENT, David S. Miller, Thomas Petazzoni, netdev,
	linux-kernel, linux-arm-kernel@lists.infradead.org
In-Reply-To: <20160331135322.523a7ceb@xhacker>

Hi Jisheng,

2016-03-31 7:53 GMT+02:00 Jisheng Zhang <jszhang@marvell.com>:
> Hi Gregory,
>
> On Wed, 30 Mar 2016 17:11:41 +0200 Gregory CLEMENT wrote:
>
>> Hi Jisheng,
>>
>>  On mer., mars 30 2016, Jisheng Zhang <jszhang@marvell.com> wrote:
>>
>> > The mvneta BM can't work on 64bit platform, as the BM hardware expects
>> > buf virtual address to be placed in the first four bytes of mapped
>> > buffer, but obviously the virtual address on 64bit platform can't be
>> > stored in 4 bytes. So we have to explicitly disable BM on 64bit
>> > platform.
>>
>> Actually mvneta is used on Armada 3700 which is a 64bits platform.
>> Is it true that the driver needs some change to use BM in 64 bits, but
>> we don't have to disable it.
>>
>> Here is the 64 bits part of the patch we have currently on the hardware
>> prototype. We have more things which are really related to the way the
>> mvneta is connected to the Armada 3700 SoC. This code was not ready for
>
> Thanks for the sharing.
>
> I think we could commit easy parts firstly, for example: the cacheline size
> hardcoding, either piece of your diff or my version:
>
> http://lists.infradead.org/pipermail/linux-arm-kernel/2016-March/418513.html

Since the commit:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/arch/arm64/include/asm/cache.h?id=97303480753e48fb313dc0e15daaf11b0451cdb8
detached L1_CACHE_BYTES from real cache size, I suggest, the macro should be:
#define MVNETA_CPU_D_CACHE_LINE_SIZE     cache_line_size()

Regarding check after dma_alloc_coherent, I agree it's not necessary.

>
>> mainline but I prefer share it now instead of having the HWBM blindly
>
> I have looked through the diff, it is for the driver itself on 64bit platforms,
> and it doesn't touch BM. The BM itself need to be disabled for 64bit, I'm not
> sure the BM could work on 64bit even with your diff. Per my understanding, the BM
> can't work on 64 bit, let's have a look at some piece of the mvneta_bm_construct()
>
> *(u32 *)buf = (u32)buf;

Indeed this particular part is different and unclear, I tried
different options - with no success. I'm checking with design team
now. Anyway, I managed to enable operation for HWBM on A3700 with one
work-around in mvneta_hwbm_rx():
data = phys_to_virt(rx_desc->buf_phys_addr);

Of course mvneta_bm, due to some silicone differences needed also a rework.

Actually I'd wait with updating 64-bit parts of mvneta, until real
support for such machine's controller is introduced. Basing on my
experience with enabling neta on A3700, it turns out to be more
changes.

Best regards,
Marcin

^ permalink raw reply

* [PATCH] net: mvneta: remove useless RX descriptor prefetch
From: Jisheng Zhang @ 2016-03-31  6:36 UTC (permalink / raw)
  To: thomas.petazzoni, gregory.clement, davem
  Cc: netdev, linux-kernel, Jisheng Zhang

The rx descriptors are allocated using dma_alloc_coherent, so prefetch
doesn't really happen at all.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 5880871..6c09a27 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -757,7 +757,6 @@ mvneta_rxq_next_desc_get(struct mvneta_rx_queue *rxq)
 	int rx_desc = rxq->next_desc_to_proc;
 
 	rxq->next_desc_to_proc = MVNETA_QUEUE_NEXT_DESC(rxq, rx_desc);
-	prefetch(rxq->descs + rxq->next_desc_to_proc);
 	return rxq->descs + rx_desc;
 }
 
-- 
2.8.0.rc3

^ permalink raw reply related

* Re: am335x: no multicast reception over VLAN
From: Mugunthan V N @ 2016-03-31  6:37 UTC (permalink / raw)
  To: Peter Korsgaard
  Cc: Yegor Yefremov, Grygorii Strashko, netdev,
	linux-omap@vger.kernel.org, drivshin, ml
In-Reply-To: <878u0z9297.fsf@dell.be.48ers.dk>

On Thursday 31 March 2016 01:17 AM, Peter Korsgaard wrote:
>>>>>> "Mugunthan" == Mugunthan V N <mugunthanvnm@ti.com> writes:
> 
> Hi,
> 
>  > You had received these packets as tcpdump will enable promiscuous mode
>  > so that you receive all the packets from the wire.
> 
> FYI, you can use the -p option to tcpdump to not put the interface into
> promiscuous mode.
> 

Thanks for the information Peter Korsgaard.

Yegor, can you provide tcpdump using -p as well in Grygorii commands.

Regards
Mugunthan V N

^ permalink raw reply

* Re: [PATCH] net: mvneta: explicitly disable BM on 64bit platform
From: Jisheng Zhang @ 2016-03-31  5:53 UTC (permalink / raw)
  To: Gregory CLEMENT, davem, mw, thomas.petazzoni
  Cc: netdev, linux-kernel, linux-arm-kernel
In-Reply-To: <8737r8uhjm.fsf@free-electrons.com>

Hi Gregory,

On Wed, 30 Mar 2016 17:11:41 +0200 Gregory CLEMENT wrote:

> Hi Jisheng,
>  
>  On mer., mars 30 2016, Jisheng Zhang <jszhang@marvell.com> wrote:
> 
> > The mvneta BM can't work on 64bit platform, as the BM hardware expects
> > buf virtual address to be placed in the first four bytes of mapped
> > buffer, but obviously the virtual address on 64bit platform can't be
> > stored in 4 bytes. So we have to explicitly disable BM on 64bit
> > platform.  
> 
> Actually mvneta is used on Armada 3700 which is a 64bits platform.
> Is it true that the driver needs some change to use BM in 64 bits, but
> we don't have to disable it.
> 
> Here is the 64 bits part of the patch we have currently on the hardware
> prototype. We have more things which are really related to the way the
> mvneta is connected to the Armada 3700 SoC. This code was not ready for

Thanks for the sharing.

I think we could commit easy parts firstly, for example: the cacheline size
hardcoding, either piece of your diff or my version:

http://lists.infradead.org/pipermail/linux-arm-kernel/2016-March/418513.html

> mainline but I prefer share it now instead of having the HWBM blindly

I have looked through the diff, it is for the driver itself on 64bit platforms,
and it doesn't touch BM. The BM itself need to be disabled for 64bit, I'm not
sure the BM could work on 64bit even with your diff. Per my understanding, the BM
can't work on 64 bit, let's have a look at some piece of the mvneta_bm_construct()

*(u32 *)buf = (u32)buf;

Am I misunderstanding?

Thanks,
Jisheng

> disable for 64 bits platform:
> 
> --- a/drivers/net/ethernet/marvell/Kconfig
> +++ b/drivers/net/ethernet/marvell/Kconfig
> @@ -55,7 +55,7 @@ config MVNETA_BM_ENABLE
>  
>  config MVNETA
>  	tristate "Marvell Armada 370/38x/XP network interface support"
> -	depends on PLAT_ORION
> +	depends on ARCH_MVEBU || COMPILE_TEST
>  	select MVMDIO
>  	select FIXED_PHY
>  	---help---
> diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> index 577f7ca7deba..6929ad112b64 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -260,7 +260,7 @@
>  
>  #define MVNETA_VLAN_TAG_LEN             4
>  
> -#define MVNETA_CPU_D_CACHE_LINE_SIZE    32
> +#define MVNETA_CPU_D_CACHE_LINE_SIZE    cache_line_size()
>  #define MVNETA_TX_CSUM_DEF_SIZE		1600
>  #define MVNETA_TX_CSUM_MAX_SIZE		9800
>  #define MVNETA_ACC_MODE_EXT1		1
> @@ -297,6 +297,12 @@
>  /* descriptor aligned size */
>  #define MVNETA_DESC_ALIGNED_SIZE	32
>  
> +/* Number of bytes to be taken into account by HW when putting incoming data
> + * to the buffers. It is needed in case NET_SKB_PAD exceeds maximum packet
> + * offset supported in MVNETA_RXQ_CONFIG_REG(q) registers.
> + */
> +#define MVNETA_RX_PKT_OFFSET_CORRECTION		64
> +
>  #define MVNETA_RX_PKT_SIZE(mtu) \
>  	ALIGN((mtu) + MVNETA_MH_SIZE + MVNETA_VLAN_TAG_LEN + \
>  	      ETH_HLEN + ETH_FCS_LEN,			     \
> @@ -417,6 +423,10 @@ struct mvneta_port {
>  	u64 ethtool_stats[ARRAY_SIZE(mvneta_statistics)];
>  
>  	u32 indir[MVNETA_RSS_LU_TABLE_SIZE];
> +#ifdef CONFIG_64BIT
> +	u64 data_high;
> +#endif
> +	u16 rx_offset_correction;
>  };
>  
>  /* The mvneta_tx_desc and mvneta_rx_desc structures describe the
> @@ -961,7 +971,9 @@ static int mvneta_bm_port_init(struct platform_device *pdev,
>  			       struct mvneta_port *pp)
>  {
>  	struct device_node *dn = pdev->dev.of_node;
> -	u32 long_pool_id, short_pool_id, wsize;
> +	u32 long_pool_id, short_pool_id;
> +#ifndef CONFIG_64BIT
> +	u32 wsize;
>  	u8 target, attr;
>  	int err;
>  
> @@ -985,7 +997,7 @@ static int mvneta_bm_port_init(struct platform_device *pdev,
>  		netdev_info(pp->dev, "missing long pool id\n");
>  		return -EINVAL;
>  	}
> -
> +#endif
>  	/* Create port's long pool depending on mtu */
>  	pp->pool_long = mvneta_bm_pool_use(pp->bm_priv, long_pool_id,
>  					   MVNETA_BM_LONG, pp->id,
> @@ -1790,6 +1802,10 @@ static int mvneta_rx_refill(struct mvneta_port *pp,
>  	if (!data)
>  		return -ENOMEM;
>  
> +#ifdef CONFIG_64BIT
> +	if (unlikely(pp->data_high != ((u64)data & 0xffffffff00000000)))
> +		return -ENOMEM;
> +#endif
>  	phys_addr = dma_map_single(pp->dev->dev.parent, data,
>  				   MVNETA_RX_BUF_SIZE(pp->pkt_size),
>  				   DMA_FROM_DEVICE);
> @@ -1798,7 +1814,8 @@ static int mvneta_rx_refill(struct mvneta_port *pp,
>  		return -ENOMEM;
>  	}
>  
> -	mvneta_rx_desc_fill(rx_desc, phys_addr, (u32)data);
> +	phys_addr += pp->rx_offset_correction;
> +	mvneta_rx_desc_fill(rx_desc, phys_addr, (uintptr_t)data);
>  	return 0;
>  }
>  
> @@ -1860,8 +1877,16 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
>  
>  	for (i = 0; i < rxq->size; i++) {
>  		struct mvneta_rx_desc *rx_desc = rxq->descs + i;
> -		void *data = (void *)rx_desc->buf_cookie;
> -
> +		void *data = (u8 *)(uintptr_t)rx_desc->buf_cookie;
> +#ifdef CONFIG_64BIT
> +		/* In Neta HW only 32 bits data is supported, so in
> +		 * order to obtain whole 64 bits address from RX
> +		 * descriptor, we store the upper 32 bits when
> +		 * allocating buffer, and put it back when using
> +		 * buffer cookie for accessing packet in memory.
> +		 */
> +		data = (u8 *)(pp->data_high | (u64)data);
> +#endif
>  		dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr,
>  				 MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE);
>  		mvneta_frag_free(pp->frag_size, data);
> @@ -1898,7 +1923,17 @@ static int mvneta_rx_swbm(struct mvneta_port *pp, int rx_todo,
>  		rx_done++;
>  		rx_status = rx_desc->status;
>  		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
> +#ifdef CONFIG_64BIT
> +		/* In Neta HW only 32 bits data is supported, so in
> +		 * order to obtain whole 64 bits address from RX
> +		 * descriptor, we store the upper 32 bits when
> +		 * allocating buffer, and put it back when using
> +		 * buffer cookie for accessing packet in memory.
> +		 */
> +		data = (u8 *)(pp->data_high | (u64)rx_desc->buf_cookie);
> +#else
>  		data = (unsigned char *)rx_desc->buf_cookie;
> +#endif
>  		phys_addr = rx_desc->buf_phys_addr;
>  
>  		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
> @@ -2019,7 +2054,17 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
>  		rx_done++;
>  		rx_status = rx_desc->status;
>  		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
> -		data = (unsigned char *)rx_desc->buf_cookie;
> +#ifdef CONFIG_64BIT
> +		/* In Neta HW only 32 bits data is supported, so in
> +		 * order to obtain whole 64 bits address from RX
> +		 * descriptor, we store the upper 32 bits when
> +		 * allocating buffer, and put it back when using
> +		 * buffer cookie for accessing packet in memory.
> +		 */
> +		data = (u8 *)(pp->data_high | (u64)rx_desc->buf_cookie);
> +#else
> +		data = (u8 *)rx_desc->buf_cookie;
> +#endif
>  		phys_addr = rx_desc->buf_phys_addr;
>  		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
>  		bm_pool = &pp->bm_priv->bm_pools[pool_id];
> @@ -2774,7 +2819,7 @@ static int mvneta_rxq_init(struct mvneta_port *pp,
>  	mvreg_write(pp, MVNETA_RXQ_SIZE_REG(rxq->id), rxq->size);
>  
>  	/* Set Offset */
> -	mvneta_rxq_offset_set(pp, rxq, NET_SKB_PAD);
> +	mvneta_rxq_offset_set(pp, rxq, NET_SKB_PAD - pp->rx_offset_correction);
>  
>  	/* Set coalescing pkts and time */
>  	mvneta_rx_pkts_coal_set(pp, rxq, rxq->pkts_coal);
> @@ -2935,6 +2980,22 @@ static void mvneta_cleanup_rxqs(struct mvneta_port *pp)
>  static int mvneta_setup_rxqs(struct mvneta_port *pp)
>  {
>  	int queue;
> +#ifdef CONFIG_64BIT
> +	void *data_tmp;
> +
> +	/* In Neta HW only 32 bits data is supported, so in order to
> +	 * obtain whole 64 bits address from RX descriptor, we store
> +	 * the upper 32 bits when allocating buffer, and put it back
> +	 * when using buffer cookie for accessing packet in memory.
> +	 * Frags should be allocated from single 'memory' region,
> +	 * hence common upper address half should be sufficient.
> +	 */
> +	data_tmp = mvneta_frag_alloc(pp->frag_size);
> +	if (data_tmp) {
> +		pp->data_high = (u64)data_tmp & 0xffffffff00000000;
> +		mvneta_frag_free(pp->frag_size, data_tmp);
> +	}
> +#endif
>  
>  	for (queue = 0; queue < rxq_number; queue++) {
>  		int err = mvneta_rxq_init(pp, &pp->rxqs[queue]);
> @@ -3672,25 +3733,24 @@ static void mvneta_ethtool_update_stats(struct mvneta_port *pp)
>  	const struct mvneta_statistic *s;
>  	void __iomem *base = pp->base;
>  	u32 high, low, val;
> -	u64 val64;
>  	int i;
>  
>  	for (i = 0, s = mvneta_statistics;
>  	     s < mvneta_statistics + ARRAY_SIZE(mvneta_statistics);
>  	     s++, i++) {
> +		val = 0;
>  		switch (s->type) {
>  		case T_REG_32:
>  			val = readl_relaxed(base + s->offset);
> -			pp->ethtool_stats[i] += val;
>  			break;
>  		case T_REG_64:
>  			/* Docs say to read low 32-bit then high */
>  			low = readl_relaxed(base + s->offset);
>  			high = readl_relaxed(base + s->offset + 4);
> -			val64 = (u64)high << 32 | low;
> -			pp->ethtool_stats[i] += val64;
> +			val = (u64)high << 32 | low;
>  			break;
>  		}
> +		pp->ethtool_stats[i] += val;
>  	}
>  }
>  
> @@ -4034,6 +4094,13 @@ static int mvneta_probe(struct platform_device *pdev)
>  
>  	pp->rxq_def = rxq_def;
>  
> +	/* Set RX packet offset correction for platforms, whose
> +	 * NET_SKB_PAD, exceeds 64B. It should be 64B for 64-bit
> +	 * platforms and 0B for 32-bit ones.
> +	 */
> +	pp->rx_offset_correction =
> +		max(0, NET_SKB_PAD - MVNETA_RX_PKT_OFFSET_CORRECTION);
> +
>  	pp->indir[0] = rxq_def;
>  
>  	pp->clk = devm_clk_get(&pdev->dev, "core");
> --
> 
> >
> > Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
> > ---
> >  drivers/net/ethernet/marvell/Kconfig | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/ethernet/marvell/Kconfig b/drivers/net/ethernet/marvell/Kconfig
> > index b5c6d42..53d6572 100644
> > --- a/drivers/net/ethernet/marvell/Kconfig
> > +++ b/drivers/net/ethernet/marvell/Kconfig
> > @@ -42,7 +42,7 @@ config MVMDIO
> >  
> >  config MVNETA_BM_ENABLE
> >  	tristate "Marvell Armada 38x/XP network interface BM support"
> > -	depends on MVNETA
> > +	depends on MVNETA && !64BIT
> >  	---help---
> >  	  This driver supports auxiliary block of the network
> >  	  interface units in the Marvell ARMADA XP and ARMADA 38x SoC
> > -- 
> > 2.8.0.rc3
> >  
> 

^ permalink raw reply

* [PATCH net-next 5/6] net: export napi_by_id()
From: Jason Wang @ 2016-03-31  5:50 UTC (permalink / raw)
  To: davem, mst, netdev, linux-kernel; +Cc: Jason Wang
In-Reply-To: <1459403439-6011-1-git-send-email-jasowang@redhat.com>

This patch exports napi_by_id() which will be used by vhost_net socket
busy polling.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 include/net/busy_poll.h | 1 +
 net/core/dev.c          | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index e765e23..dc9c76d 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -74,6 +74,7 @@ static inline bool busy_loop_timeout(unsigned long end_time)
 
 bool sk_busy_loop(struct sock *sk, int nonblock);
 int sk_busy_loop_once(struct sock *sk, struct napi_struct *napi);
+struct napi_struct *napi_by_id(unsigned int napi_id);
 
 /* used in the NIC receive handler to mark the skb */
 static inline void skb_mark_napi_id(struct sk_buff *skb,
diff --git a/net/core/dev.c b/net/core/dev.c
index a2f0c46..b98d210 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4888,7 +4888,7 @@ void napi_complete_done(struct napi_struct *n, int work_done)
 EXPORT_SYMBOL(napi_complete_done);
 
 /* must be called under rcu_read_lock(), as we dont take a reference */
-static struct napi_struct *napi_by_id(unsigned int napi_id)
+struct napi_struct *napi_by_id(unsigned int napi_id)
 {
 	unsigned int hash = napi_id % HASH_SIZE(napi_hash);
 	struct napi_struct *napi;
@@ -4899,6 +4899,7 @@ static struct napi_struct *napi_by_id(unsigned int napi_id)
 
 	return NULL;
 }
+EXPORT_SYMBOL(napi_by_id);
 
 #if defined(CONFIG_NET_RX_BUSY_POLL)
 #define BUSY_POLL_BUDGET 8
-- 
2.5.0

^ permalink raw reply related

* [PATCH net-next 6/6] vhost_net: net device rx busy polling support
From: Jason Wang @ 2016-03-31  5:50 UTC (permalink / raw)
  To: davem, mst, netdev, linux-kernel; +Cc: Jason Wang
In-Reply-To: <1459403439-6011-1-git-send-email-jasowang@redhat.com>

This patch let vhost_net try rx busy polling of underlying net device
when busy polling is enabled. Test shows some improvement on TCP_RR:

smp=1 queue=1
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
    1/     1/   +4%/   +3%/   +3%/   +3%/  +22%
    1/    50/   +2%/   +2%/   +2%/   +2%/    0%
    1/   100/   +1%/    0%/   +1%/   +1%/   -1%
    1/   200/   +2%/   +1%/   +2%/   +2%/    0%
   64/     1/   +1%/   +3%/   +1%/   +1%/   +1%
   64/    50/    0%/    0%/    0%/    0%/   -1%
   64/   100/   +1%/    0%/   +1%/   +1%/    0%
   64/   200/    0%/    0%/   +2%/   +2%/    0%
  256/     1/   +2%/   +2%/   +2%/   +2%/   +2%
  256/    50/   +3%/   +3%/   +3%/   +3%/    0%
  256/   100/   +1%/   +1%/   +2%/   +2%/    0%
  256/   200/    0%/    0%/   +1%/   +1%/   +1%
 1024/     1/   +2%/   +2%/   +2%/   +2%/   +2%
 1024/    50/   -1%/   -1%/   -1%/   -1%/   -2%
 1024/   100/   +1%/   +1%/    0%/    0%/   -1%
 1024/   200/   +2%/   +1%/   +2%/   +2%/    0%

smp=8 queue=1
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
    1/     1/   +1%/   -5%/   +1%/   +1%/    0%
    1/    50/   +1%/    0%/   +1%/   +1%/   -1%
    1/   100/   -1%/   -1%/   -2%/   -2%/   -4%
    1/   200/    0%/    0%/    0%/    0%/   -1%
   64/     1/   -2%/  -10%/   -2%/   -2%/   -2%
   64/    50/   -1%/   -1%/   -1%/   -1%/   -2%
   64/   100/   -1%/    0%/    0%/    0%/   -1%
   64/   200/   -1%/   -1%/    0%/    0%/    0%
  256/     1/   +7%/  +25%/   +7%/   +7%/   +7%
  256/    50/   +2%/   +2%/   +2%/   +2%/   -1%
  256/   100/   -1%/   -1%/   -1%/   -1%/   -3%
  256/   200/   +1%/    0%/    0%/    0%/    0%
 1024/     1/   +5%/  +15%/   +5%/   +5%/   +4%
 1024/    50/    0%/    0%/   -1%/   -1%/   -1%
 1024/   100/   -1%/   -1%/   -1%/   -1%/   -2%
 1024/   200/   -1%/    0%/   -1%/   -1%/   -1%

smp=8 queue=8
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
    1/     1/   +5%/   +2%/   +5%/   +5%/    0%
    1/    50/   +2%/   +2%/   +3%/   +3%/  -20%
    1/   100/   +5%/   +5%/   +5%/   +5%/  -13%
    1/   200/   +8%/   +8%/   +6%/   +6%/  -12%
   64/     1/    0%/   +4%/    0%/    0%/  +18%
   64/    50/   +6%/   +5%/   +5%/   +5%/   -7%
   64/   100/   +4%/   +4%/   +5%/   +5%/  -12%
   64/   200/   +5%/   +5%/   +5%/   +5%/  -12%
  256/     1/    0%/   -3%/    0%/    0%/   +1%
  256/    50/   +3%/   +3%/   +3%/   +3%/   -2%
  256/   100/   +6%/   +5%/   +5%/   +5%/  -11%
  256/   200/   +4%/   +4%/   +4%/   +4%/  -13%
 1024/     1/    0%/   -3%/    0%/    0%/   -6%
 1024/    50/   +1%/   +1%/   +1%/   +1%/  -10%
 1024/   100/   +4%/   +4%/   +5%/   +5%/  -11%
 1024/   200/   +4%/   +5%/   +4%/   +4%/  -12%

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f744eeb..7350f6c 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -27,6 +27,7 @@
 #include <linux/if_vlan.h>
 
 #include <net/sock.h>
+#include <net/busy_poll.h>
 
 #include "vhost.h"
 
@@ -307,15 +308,24 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 				    unsigned int *out_num, unsigned int *in_num)
 {
 	unsigned long uninitialized_var(endtime);
+	struct socket *sock = vq->private_data;
+	struct sock *sk = sock->sk;
+	struct napi_struct *napi;
 	int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
 				    out_num, in_num, NULL, NULL);
 
 	if (r == vq->num && vq->busyloop_timeout) {
 		preempt_disable();
+		rcu_read_lock();
+		napi = napi_by_id(sk->sk_napi_id);
 		endtime = busy_clock() + vq->busyloop_timeout;
 		while (vhost_can_busy_poll(vq->dev, endtime) &&
-		       vhost_vq_avail_empty(vq->dev, vq))
+		       vhost_vq_avail_empty(vq->dev, vq)) {
+			if (napi)
+				sk_busy_loop_once(sk, napi);
 			cpu_relax_lowlatency();
+		}
+		rcu_read_unlock();
 		preempt_enable();
 		r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
 					out_num, in_num, NULL, NULL);
@@ -476,6 +486,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
 	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
 	struct vhost_virtqueue *vq = &nvq->vq;
 	unsigned long uninitialized_var(endtime);
+	struct napi_struct *napi;
 	int len = peek_head_len(sk);
 
 	if (!len && vq->busyloop_timeout) {
@@ -484,13 +495,20 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
 		vhost_disable_notify(&net->dev, vq);
 
 		preempt_disable();
+		rcu_read_lock();
+
+		napi = napi_by_id(sk->sk_napi_id);
 		endtime = busy_clock() + vq->busyloop_timeout;
 
 		while (vhost_can_busy_poll(&net->dev, endtime) &&
 		       skb_queue_empty(&sk->sk_receive_queue) &&
-		       vhost_vq_avail_empty(&net->dev, vq))
+		       vhost_vq_avail_empty(&net->dev, vq)) {
+			if (napi)
+				sk_busy_loop_once(sk, napi);
 			cpu_relax_lowlatency();
+		}
 
+		rcu_read_unlock();
 		preempt_enable();
 
 		if (vhost_enable_notify(&net->dev, vq))
-- 
2.5.0

^ permalink raw reply related

* [PATCH net-next 4/6] net: core: factor out core busy polling logic to sk_busy_loop_once()
From: Jason Wang @ 2016-03-31  5:50 UTC (permalink / raw)
  To: davem, mst, netdev, linux-kernel; +Cc: Jason Wang
In-Reply-To: <1459403439-6011-1-git-send-email-jasowang@redhat.com>

This patch factors out core logic of busy polling to
sk_busy_loop_once() in order to be reused by other modules.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 include/net/busy_poll.h |  7 ++++++
 net/core/dev.c          | 59 ++++++++++++++++++++++++++++---------------------
 2 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 2fbeb13..e765e23 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -73,6 +73,7 @@ static inline bool busy_loop_timeout(unsigned long end_time)
 }
 
 bool sk_busy_loop(struct sock *sk, int nonblock);
+int sk_busy_loop_once(struct sock *sk, struct napi_struct *napi);
 
 /* used in the NIC receive handler to mark the skb */
 static inline void skb_mark_napi_id(struct sk_buff *skb,
@@ -117,6 +118,12 @@ static inline bool busy_loop_timeout(unsigned long end_time)
 	return true;
 }
 
+static inline int sk_busy_loop_once(struct napi_struct *napi,
+				    int (*busy_poll)(struct napi_struct *dev))
+{
+	return 0;
+}
+
 static inline bool sk_busy_loop(struct sock *sk, int nonblock)
 {
 	return false;
diff --git a/net/core/dev.c b/net/core/dev.c
index b9bcbe7..a2f0c46 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4902,10 +4902,42 @@ static struct napi_struct *napi_by_id(unsigned int napi_id)
 
 #if defined(CONFIG_NET_RX_BUSY_POLL)
 #define BUSY_POLL_BUDGET 8
+int sk_busy_loop_once(struct sock *sk, struct napi_struct *napi)
+{
+	int (*busy_poll)(struct napi_struct *dev);
+	int rc = 0;
+
+	/* Note: ndo_busy_poll method is optional in linux-4.5 */
+	busy_poll = napi->dev->netdev_ops->ndo_busy_poll;
+
+	local_bh_disable();
+	if (busy_poll) {
+		rc = busy_poll(napi);
+	} else if (napi_schedule_prep(napi)) {
+		void *have = netpoll_poll_lock(napi);
+
+		if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
+			rc = napi->poll(napi, BUSY_POLL_BUDGET);
+			trace_napi_poll(napi);
+			if (rc == BUSY_POLL_BUDGET) {
+				napi_complete_done(napi, rc);
+				napi_schedule(napi);
+			}
+		}
+		netpoll_poll_unlock(have);
+	}
+	if (rc > 0)
+		NET_ADD_STATS_BH(sock_net(sk),
+				 LINUX_MIB_BUSYPOLLRXPACKETS, rc);
+	local_bh_enable();
+
+	return rc;
+}
+EXPORT_SYMBOL(sk_busy_loop_once);
+
 bool sk_busy_loop(struct sock *sk, int nonblock)
 {
 	unsigned long end_time = !nonblock ? sk_busy_loop_end_time(sk) : 0;
-	int (*busy_poll)(struct napi_struct *dev);
 	struct napi_struct *napi;
 	int rc = false;
 
@@ -4915,31 +4947,8 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
 	if (!napi)
 		goto out;
 
-	/* Note: ndo_busy_poll method is optional in linux-4.5 */
-	busy_poll = napi->dev->netdev_ops->ndo_busy_poll;
-
 	do {
-		rc = 0;
-		local_bh_disable();
-		if (busy_poll) {
-			rc = busy_poll(napi);
-		} else if (napi_schedule_prep(napi)) {
-			void *have = netpoll_poll_lock(napi);
-
-			if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
-				rc = napi->poll(napi, BUSY_POLL_BUDGET);
-				trace_napi_poll(napi);
-				if (rc == BUSY_POLL_BUDGET) {
-					napi_complete_done(napi, rc);
-					napi_schedule(napi);
-				}
-			}
-			netpoll_poll_unlock(have);
-		}
-		if (rc > 0)
-			NET_ADD_STATS_BH(sock_net(sk),
-					 LINUX_MIB_BUSYPOLLRXPACKETS, rc);
-		local_bh_enable();
+		rc = sk_busy_loop_once(sk, napi);
 
 		if (rc == LL_FLUSH_FAILED)
 			break; /* permanent failure */
-- 
2.5.0

^ permalink raw reply related

* [PATCH net-next 3/6] macvtap: socket rx busy polling support
From: Jason Wang @ 2016-03-31  5:50 UTC (permalink / raw)
  To: davem, mst, netdev, linux-kernel; +Cc: Jason Wang
In-Reply-To: <1459403439-6011-1-git-send-email-jasowang@redhat.com>

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/macvtap.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 95394ed..1891aff 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -20,6 +20,7 @@
 #include <net/net_namespace.h>
 #include <net/rtnetlink.h>
 #include <net/sock.h>
+#include <net/busy_poll.h>
 #include <linux/virtio_net.h>
 
 /*
@@ -369,6 +370,7 @@ static rx_handler_result_t macvtap_handle_frame(struct sk_buff **pskb)
 			goto drop;
 
 		if (!segs) {
+			sk_mark_napi_id(&q->sk, skb);
 			skb_queue_tail(&q->sk.sk_receive_queue, skb);
 			goto wake_up;
 		}
@@ -378,6 +380,7 @@ static rx_handler_result_t macvtap_handle_frame(struct sk_buff **pskb)
 			struct sk_buff *nskb = segs->next;
 
 			segs->next = NULL;
+			sk_mark_napi_id(&q->sk, segs);
 			skb_queue_tail(&q->sk.sk_receive_queue, segs);
 			segs = nskb;
 		}
@@ -391,6 +394,7 @@ static rx_handler_result_t macvtap_handle_frame(struct sk_buff **pskb)
 		    !(features & NETIF_F_CSUM_MASK) &&
 		    skb_checksum_help(skb))
 			goto drop;
+		sk_mark_napi_id(&q->sk, skb);
 		skb_queue_tail(&q->sk.sk_receive_queue, skb);
 	}
 
-- 
2.5.0

^ permalink raw reply related

* [PATCH net-next 2/6] tuntap: socket rx busy polling support
From: Jason Wang @ 2016-03-31  5:50 UTC (permalink / raw)
  To: davem, mst, netdev, linux-kernel; +Cc: Jason Wang
In-Reply-To: <1459403439-6011-1-git-send-email-jasowang@redhat.com>

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index afdf950..950faf5 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -69,6 +69,7 @@
 #include <net/netns/generic.h>
 #include <net/rtnetlink.h>
 #include <net/sock.h>
+#include <net/busy_poll.h>
 #include <linux/seq_file.h>
 #include <linux/uio.h>
 
@@ -871,6 +872,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	nf_reset(skb);
 
+	sk_mark_napi_id(tfile->socket.sk, skb);
 	/* Enqueue packet */
 	skb_queue_tail(&tfile->socket.sk->sk_receive_queue, skb);
 
@@ -878,7 +880,6 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	if (tfile->flags & TUN_FASYNC)
 		kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
 	tfile->socket.sk->sk_data_ready(tfile->socket.sk);
-
 	rcu_read_unlock();
 	return NETDEV_TX_OK;
 
-- 
2.5.0

^ permalink raw reply related

* [PATCH net-next 1/6] net: skbuff: don't use union for napi_id and sender_cpu
From: Jason Wang @ 2016-03-31  5:50 UTC (permalink / raw)
  To: davem, mst, netdev, linux-kernel; +Cc: Jason Wang
In-Reply-To: <1459403439-6011-1-git-send-email-jasowang@redhat.com>

We use a union for napi_id and send_cpu, this is ok for most of the
cases except when we want to support busy polling for tun which needs
napi_id to be stored and passed to socket during tun_net_xmit(). In
this case, napi_id was overridden with sender_cpu before tun_net_xmit()
was called if XPS was enabled. Fixing by not using union for napi_id
and sender_cpu.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 include/linux/skbuff.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 15d0df9..8aee891 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -743,11 +743,11 @@ struct sk_buff {
 	__u32			hash;
 	__be16			vlan_proto;
 	__u16			vlan_tci;
-#if defined(CONFIG_NET_RX_BUSY_POLL) || defined(CONFIG_XPS)
-	union {
-		unsigned int	napi_id;
-		unsigned int	sender_cpu;
-	};
+#if defined(CONFIG_NET_RX_BUSY_POLL)
+	unsigned int		napi_id;
+#endif
+#if defined(CONFIG_XPS)
+	unsigned int		sender_cpu;
 #endif
 	union {
 #ifdef CONFIG_NETWORK_SECMARK
-- 
2.5.0

^ permalink raw reply related

* [PATCH net-next 0/6] net device rx busy polling support in vhost_net
From: Jason Wang @ 2016-03-31  5:50 UTC (permalink / raw)
  To: davem, mst, netdev, linux-kernel; +Cc: Jason Wang

Hi all:

This series try to add net device rx busy polling support in
vhost_net. This is done through:

- adding socket rx busy polling support for tun/macvtap by marking
  napi_id.
- vhost_net will try to find the net device through napi_id and do
  busy polling if possible.

TCP_RR tests on two mlx4s show some improvements:

smp=1 queue=1
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
    1/     1/   +4%/   +3%/   +3%/   +3%/  +22%
    1/    50/   +2%/   +2%/   +2%/   +2%/    0%
    1/   100/   +1%/    0%/   +1%/   +1%/   -1%
    1/   200/   +2%/   +1%/   +2%/   +2%/    0%
   64/     1/   +1%/   +3%/   +1%/   +1%/   +1%
   64/    50/    0%/    0%/    0%/    0%/   -1%
   64/   100/   +1%/    0%/   +1%/   +1%/    0%
   64/   200/    0%/    0%/   +2%/   +2%/    0%
  256/     1/   +2%/   +2%/   +2%/   +2%/   +2%
  256/    50/   +3%/   +3%/   +3%/   +3%/    0%
  256/   100/   +1%/   +1%/   +2%/   +2%/    0%
  256/   200/    0%/    0%/   +1%/   +1%/   +1%
 1024/     1/   +2%/   +2%/   +2%/   +2%/   +2%
 1024/    50/   -1%/   -1%/   -1%/   -1%/   -2%
 1024/   100/   +1%/   +1%/    0%/    0%/   -1%
 1024/   200/   +2%/   +1%/   +2%/   +2%/    0%

smp=8 queue=1
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
    1/     1/   +1%/   -5%/   +1%/   +1%/    0%
    1/    50/   +1%/    0%/   +1%/   +1%/   -1%
    1/   100/   -1%/   -1%/   -2%/   -2%/   -4%
    1/   200/    0%/    0%/    0%/    0%/   -1%
   64/     1/   -2%/  -10%/   -2%/   -2%/   -2%
   64/    50/   -1%/   -1%/   -1%/   -1%/   -2%
   64/   100/   -1%/    0%/    0%/    0%/   -1%
   64/   200/   -1%/   -1%/    0%/    0%/    0%
  256/     1/   +7%/  +25%/   +7%/   +7%/   +7%
  256/    50/   +2%/   +2%/   +2%/   +2%/   -1%
  256/   100/   -1%/   -1%/   -1%/   -1%/   -3%
  256/   200/   +1%/    0%/    0%/    0%/    0%
 1024/     1/   +5%/  +15%/   +5%/   +5%/   +4%
 1024/    50/    0%/    0%/   -1%/   -1%/   -1%
 1024/   100/   -1%/   -1%/   -1%/   -1%/   -2%
 1024/   200/   -1%/    0%/   -1%/   -1%/   -1%

smp=8 queue=8
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
    1/     1/   +5%/   +2%/   +5%/   +5%/    0%
    1/    50/   +2%/   +2%/   +3%/   +3%/  -20%
    1/   100/   +5%/   +5%/   +5%/   +5%/  -13%
    1/   200/   +8%/   +8%/   +6%/   +6%/  -12%
   64/     1/    0%/   +4%/    0%/    0%/  +18%
   64/    50/   +6%/   +5%/   +5%/   +5%/   -7%
   64/   100/   +4%/   +4%/   +5%/   +5%/  -12%
   64/   200/   +5%/   +5%/   +5%/   +5%/  -12%
  256/     1/    0%/   -3%/    0%/    0%/   +1%
  256/    50/   +3%/   +3%/   +3%/   +3%/   -2%
  256/   100/   +6%/   +5%/   +5%/   +5%/  -11%
  256/   200/   +4%/   +4%/   +4%/   +4%/  -13%
 1024/     1/    0%/   -3%/    0%/    0%/   -6%
 1024/    50/   +1%/   +1%/   +1%/   +1%/  -10%
 1024/   100/   +4%/   +4%/   +5%/   +5%/  -11%
 1024/   200/   +4%/   +5%/   +4%/   +4%/  -12%

Thanks

Jason Wang (6):
  net: skbuff: don't use union for napi_id and sender_cpu
  tuntap: socket rx busy polling support
  macvtap: socket rx busy polling support
  net: core: factor out core busy polling logic to sk_busy_loop_once()
  net: export napi_by_id()
  vhost_net: net device rx busy polling support

 drivers/net/macvtap.c   |  4 ++++
 drivers/net/tun.c       |  3 ++-
 drivers/vhost/net.c     | 22 ++++++++++++++++--
 include/linux/skbuff.h  | 10 ++++----
 include/net/busy_poll.h |  8 +++++++
 net/core/dev.c          | 62 ++++++++++++++++++++++++++++---------------------
 6 files changed, 75 insertions(+), 34 deletions(-)

-- 
2.5.0

^ permalink raw reply

* Re: [PATCH net] tun, bpf: fix suspicious RCU usage in tun_{attach,detach}_filter
From: Alexei Starovoitov @ 2016-03-31  5:43 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: Daniel Borkmann, davem, sasha.levin, jslaby, eric.dumazet, mst,
	netdev
In-Reply-To: <20160331052232.GB21665@unicorn.suse.cz>

On Thu, Mar 31, 2016 at 07:22:32AM +0200, Michal Kubecek wrote:
> On Wed, Mar 30, 2016 at 10:08:10PM -0700, Alexei Starovoitov wrote:
> > On Thu, Mar 31, 2016 at 07:01:15AM +0200, Michal Kubecek wrote:
> > > On Wed, Mar 30, 2016 at 06:18:42PM -0700, Alexei Starovoitov wrote:
> > > > 
> > > > kinda heavy patch to shut up lockdep.
> > > > Can we do
> > > > old_fp = rcu_dereference_protected(sk->sk_filter,
> > > >                                 sock_owned_by_user(sk) || lockdep_rtnl_is_held());
> > > > and it always be correct?
> > > > I think right now tun is the only such user, but if it's correct
> > > > for tun, it's correct for future users too. If not correct then
> > > > not correct for tun either.
> > > > Or I'm missing something?
> > > 
> > > Already discussed here:
> > > 
> > >   http://thread.gmane.org/gmane.linux.kernel/2158069/focus=405853
> > 
> > I saw that. My point above was challenging 'less accurate' part.
> > 
> Daniel's point was that lockdep_rtnl_is_held() does not mean "we hold
> RTNL" but "someone holds RTNL" so that some other task holding RTNL at
> the moment could make the check happy even when called by someone
> supposed to own the socket.

Of course... and that is the case for all rtnl_dereference() calls...
yet we're not paranoid about it.

^ permalink raw reply

* Re: [PATCH net] tun, bpf: fix suspicious RCU usage in tun_{attach,detach}_filter
From: Michal Kubecek @ 2016-03-31  5:22 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, davem, sasha.levin, jslaby, eric.dumazet, mst,
	netdev
In-Reply-To: <20160331050808.GA56499@ast-mbp.thefacebook.com>

On Wed, Mar 30, 2016 at 10:08:10PM -0700, Alexei Starovoitov wrote:
> On Thu, Mar 31, 2016 at 07:01:15AM +0200, Michal Kubecek wrote:
> > On Wed, Mar 30, 2016 at 06:18:42PM -0700, Alexei Starovoitov wrote:
> > > 
> > > kinda heavy patch to shut up lockdep.
> > > Can we do
> > > old_fp = rcu_dereference_protected(sk->sk_filter,
> > >                                 sock_owned_by_user(sk) || lockdep_rtnl_is_held());
> > > and it always be correct?
> > > I think right now tun is the only such user, but if it's correct
> > > for tun, it's correct for future users too. If not correct then
> > > not correct for tun either.
> > > Or I'm missing something?
> > 
> > Already discussed here:
> > 
> >   http://thread.gmane.org/gmane.linux.kernel/2158069/focus=405853
> 
> I saw that. My point above was challenging 'less accurate' part.
> 
Daniel's point was that lockdep_rtnl_is_held() does not mean "we hold
RTNL" but "someone holds RTNL" so that some other task holding RTNL at
the moment could make the check happy even when called by someone
supposed to own the socket.

So I guess the key question is if avoiding this type of false negative
is important enough to justify the extra complexity (taking into account
this race would have to happen every time the check is performed to
really hide a locking issue).

                                                          Michal Kubecek

^ permalink raw reply

* Re: [PATCH net] tun, bpf: fix suspicious RCU usage in tun_{attach,detach}_filter
From: Alexei Starovoitov @ 2016-03-31  5:08 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: Daniel Borkmann, davem, sasha.levin, jslaby, eric.dumazet, mst,
	netdev
In-Reply-To: <20160331050115.GA21665@unicorn.suse.cz>

On Thu, Mar 31, 2016 at 07:01:15AM +0200, Michal Kubecek wrote:
> On Wed, Mar 30, 2016 at 06:18:42PM -0700, Alexei Starovoitov wrote:
> > On Thu, Mar 31, 2016 at 02:13:18AM +0200, Daniel Borkmann wrote:
> > > Sasha Levin reported a suspicious rcu_dereference_protected() warning
> > > found while fuzzing with trinity that is similar to this one:
> > > 
> > >   [   52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage!
> > >   [   52.765688] other info that might help us debug this:
> > >   [   52.765695] rcu_scheduler_active = 1, debug_locks = 1
> > >   [   52.765701] 1 lock held by a.out/1525:
> > >   [   52.765704]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff816a64b7>] rtnl_lock+0x17/0x20
> > >   [   52.765721] stack backtrace:
> > >   [   52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264
> > >   [...]
> > >   [   52.765768] Call Trace:
> > >   [   52.765775]  [<ffffffff813e488d>] dump_stack+0x85/0xc8
> > >   [   52.765784]  [<ffffffff810f2fa5>] lockdep_rcu_suspicious+0xd5/0x110
> > >   [   52.765792]  [<ffffffff816afdc2>] sk_detach_filter+0x82/0x90
> > >   [   52.765801]  [<ffffffffa0883425>] tun_detach_filter+0x35/0x90 [tun]
> > >   [   52.765810]  [<ffffffffa0884ed4>] __tun_chr_ioctl+0x354/0x1130 [tun]
> > >   [   52.765818]  [<ffffffff8136fed0>] ? selinux_file_ioctl+0x130/0x210
> > >   [   52.765827]  [<ffffffffa0885ce3>] tun_chr_ioctl+0x13/0x20 [tun]
> > >   [   52.765834]  [<ffffffff81260ea6>] do_vfs_ioctl+0x96/0x690
> > >   [   52.765843]  [<ffffffff81364af3>] ? security_file_ioctl+0x43/0x60
> > >   [   52.765850]  [<ffffffff81261519>] SyS_ioctl+0x79/0x90
> > >   [   52.765858]  [<ffffffff81003ba2>] do_syscall_64+0x62/0x140
> > >   [   52.765866]  [<ffffffff817d563f>] entry_SYSCALL64_slow_path+0x25/0x25
> > > 
> > > Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled
> > > from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH,
> > > DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices.
> > > 
> > > Since the fix in f91ff5b9ff52 ("net: sk_{detach|attach}_filter() rcu
> > > fixes") sk_attach_filter()/sk_detach_filter() now dereferences the
> > > filter with rcu_dereference_protected(), checking whether socket lock
> > > is held in control path.
> > > 
> > > Since its introduction in 994051625981 ("tun: socket filter support"),
> > > tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the
> > > sock_owned_by_user(sk) doesn't apply in this specific case and therefore
> > > triggers the false positive.
> > > 
> > > Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair
> > > that is used by tap filters and pass in lockdep_rtnl_is_held() for the
> > > rcu_dereference_protected() checks instead.
> > > 
> > > Reported-by: Sasha Levin <sasha.levin@oracle.com>
> > > Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> > > ---
> > >  drivers/net/tun.c      |  8 +++++---
> > >  include/linux/filter.h |  4 ++++
> > >  net/core/filter.c      | 33 +++++++++++++++++++++------------
> > >  3 files changed, 30 insertions(+), 15 deletions(-)
> > 
> > kinda heavy patch to shut up lockdep.
> > Can we do
> > old_fp = rcu_dereference_protected(sk->sk_filter,
> >                                 sock_owned_by_user(sk) || lockdep_rtnl_is_held());
> > and it always be correct?
> > I think right now tun is the only such user, but if it's correct for tun,
> > it's correct for future users too. If not correct then not correct for tun either.
> > Or I'm missing something?
> 
> Already discussed here:
> 
>   http://thread.gmane.org/gmane.linux.kernel/2158069/focus=405853

I saw that. My point above was challenging 'less accurate' part.

^ permalink raw reply

* Re: [PATCH net] tun, bpf: fix suspicious RCU usage in tun_{attach,detach}_filter
From: Michal Kubecek @ 2016-03-31  5:01 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, davem, sasha.levin, jslaby, eric.dumazet, mst,
	netdev
In-Reply-To: <20160331011840.GA50846@ast-mbp.thefacebook.com>

On Wed, Mar 30, 2016 at 06:18:42PM -0700, Alexei Starovoitov wrote:
> On Thu, Mar 31, 2016 at 02:13:18AM +0200, Daniel Borkmann wrote:
> > Sasha Levin reported a suspicious rcu_dereference_protected() warning
> > found while fuzzing with trinity that is similar to this one:
> > 
> >   [   52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage!
> >   [   52.765688] other info that might help us debug this:
> >   [   52.765695] rcu_scheduler_active = 1, debug_locks = 1
> >   [   52.765701] 1 lock held by a.out/1525:
> >   [   52.765704]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff816a64b7>] rtnl_lock+0x17/0x20
> >   [   52.765721] stack backtrace:
> >   [   52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264
> >   [...]
> >   [   52.765768] Call Trace:
> >   [   52.765775]  [<ffffffff813e488d>] dump_stack+0x85/0xc8
> >   [   52.765784]  [<ffffffff810f2fa5>] lockdep_rcu_suspicious+0xd5/0x110
> >   [   52.765792]  [<ffffffff816afdc2>] sk_detach_filter+0x82/0x90
> >   [   52.765801]  [<ffffffffa0883425>] tun_detach_filter+0x35/0x90 [tun]
> >   [   52.765810]  [<ffffffffa0884ed4>] __tun_chr_ioctl+0x354/0x1130 [tun]
> >   [   52.765818]  [<ffffffff8136fed0>] ? selinux_file_ioctl+0x130/0x210
> >   [   52.765827]  [<ffffffffa0885ce3>] tun_chr_ioctl+0x13/0x20 [tun]
> >   [   52.765834]  [<ffffffff81260ea6>] do_vfs_ioctl+0x96/0x690
> >   [   52.765843]  [<ffffffff81364af3>] ? security_file_ioctl+0x43/0x60
> >   [   52.765850]  [<ffffffff81261519>] SyS_ioctl+0x79/0x90
> >   [   52.765858]  [<ffffffff81003ba2>] do_syscall_64+0x62/0x140
> >   [   52.765866]  [<ffffffff817d563f>] entry_SYSCALL64_slow_path+0x25/0x25
> > 
> > Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled
> > from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH,
> > DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices.
> > 
> > Since the fix in f91ff5b9ff52 ("net: sk_{detach|attach}_filter() rcu
> > fixes") sk_attach_filter()/sk_detach_filter() now dereferences the
> > filter with rcu_dereference_protected(), checking whether socket lock
> > is held in control path.
> > 
> > Since its introduction in 994051625981 ("tun: socket filter support"),
> > tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the
> > sock_owned_by_user(sk) doesn't apply in this specific case and therefore
> > triggers the false positive.
> > 
> > Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair
> > that is used by tap filters and pass in lockdep_rtnl_is_held() for the
> > rcu_dereference_protected() checks instead.
> > 
> > Reported-by: Sasha Levin <sasha.levin@oracle.com>
> > Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> > ---
> >  drivers/net/tun.c      |  8 +++++---
> >  include/linux/filter.h |  4 ++++
> >  net/core/filter.c      | 33 +++++++++++++++++++++------------
> >  3 files changed, 30 insertions(+), 15 deletions(-)
> 
> kinda heavy patch to shut up lockdep.
> Can we do
> old_fp = rcu_dereference_protected(sk->sk_filter,
>                                 sock_owned_by_user(sk) || lockdep_rtnl_is_held());
> and it always be correct?
> I think right now tun is the only such user, but if it's correct for tun,
> it's correct for future users too. If not correct then not correct for tun either.
> Or I'm missing something?

Already discussed here:

  http://thread.gmane.org/gmane.linux.kernel/2158069/focus=405853

Michal Kubecek

^ permalink raw reply

* Re: [PATCH net-next 0/8] add TX timestamping via cmsg
From: Willem de Bruijn @ 2016-03-31  3:50 UTC (permalink / raw)
  To: Martin KaFai Lau, Soheil Hassas Yeganeh; +Cc: Network Development
In-Reply-To: <CADAM+B1osNNwVpg1R_kAFW8m75mr15_usDa=gmqpBA+xwyA_fg@mail.gmail.com>

>> Nice patches!

This does not yet solve the append issue that your MSG_EOR patch
addresses, of course.

The straightforward jump to new_segment that I proposed as
simplification is more properly a "start-of-record" than
"end-of-record" signal. It is probably preferable to indeed be able to
pass EOR as signal that the last skb must not be appended to in
subsequent calls.

I think that the record bounds issue is best solved independently from
the interface for intermittent timestamps because (a) changing the tcp
bytestream packetization for timestamping introduces subtle
differences between tracked and untracked data that are not always
acceptable and (b) EOR can also be useful outside timestamps. A
zerocopy sendmsg patchset that I sent for RFC last year encountered a
similar requirement, to give one example: each skb with user data must
point to a completion notification structure (ubuf_info), and can only
point to one at a time. Appends that cause a conflict in skb->uarg
pointers had to be blocked, at the cost of possibly different
packetization compared to regular sends.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox