Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH bpf-next 0/4] AF_XDP follow-up patches, cosmetics
From: Daniel Borkmann @ 2018-05-18 14:10 UTC (permalink / raw)
  To: Björn Töpel, magnus.karlsson, magnus.karlsson, ast,
	netdev
  Cc: Björn Töpel
In-Reply-To: <20180518120024.8588-1-bjorn.topel@gmail.com>

On 05/18/2018 02:00 PM, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> This series contain "cosmetics only" follow-up patches for AF_XDP.
> 
> Thanks to Daniel for suggesting them!
> 
> Björn Töpel (4):
>   xsk: clean up SPDX headers
>   xsk: remove newline at end of file
>   xsk: fixed some cases of unnecessary parentheses
>   xsk: proper '=' alignment

Applied to bpf-next, thanks Björn!

^ permalink raw reply

* Re: [PATCH net] tuntap: raise EPOLLOUT on device up
From: Jason Wang @ 2018-05-18 14:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, linux-kernel, Hannes Frederic Sowa, Eric Dumazet
In-Reply-To: <20180518170155-mutt-send-email-mst@kernel.org>



On 2018年05月18日 22:06, Michael S. Tsirkin wrote:
> On Fri, May 18, 2018 at 10:00:31PM +0800, Jason Wang wrote:
>>
>> On 2018年05月18日 21:26, Jason Wang wrote:
>>>
>>> On 2018年05月18日 21:13, Michael S. Tsirkin wrote:
>>>> On Fri, May 18, 2018 at 09:00:43PM +0800, Jason Wang wrote:
>>>>> We return -EIO on device down but can not raise EPOLLOUT after it was
>>>>> up. This may confuse user like vhost which expects tuntap to raise
>>>>> EPOLLOUT to re-enable its TX routine after tuntap is down. This could
>>>>> be easily reproduced by transmitting packets from VM while down and up
>>>>> the tap device. Fixing this by set SOCKWQ_ASYNC_NOSPACE on -EIO.
>>>>>
>>>>> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
>>>>> Cc: Eric Dumazet <edumazet@google.com>
>>>>> Fixes: 1bd4978a88ac2 ("tun: honor IFF_UP in tun_get_user()")
>>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>>>> ---
>>>>>    drivers/net/tun.c | 4 +++-
>>>>>    1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>> index d45ac37..1b29761 100644
>>>>> --- a/drivers/net/tun.c
>>>>> +++ b/drivers/net/tun.c
>>>>> @@ -1734,8 +1734,10 @@ static ssize_t tun_get_user(struct
>>>>> tun_struct *tun, struct tun_file *tfile,
>>>>>        int skb_xdp = 1;
>>>>>        bool frags = tun_napi_frags_enabled(tun);
>>>>>    -    if (!(tun->dev->flags & IFF_UP))
>>>>> +    if (!(tun->dev->flags & IFF_UP)) {
>>>> Isn't this racy?  What if flag is cleared at this point?
>>> I think you mean "set at this point"? Then yes, so we probably need to
>>> set the bit during tun_net_close().
>>>
>>> Thanks
>> Looks no need, vhost will poll socket after it see EIO. So we are ok here?
>>
>> Thanks
> In fact I don't even understand why does this help any longer.
>

We disable tx polling and only enable it on demand for a better rx 
performance. You may want to have a look at :

commit feb8892cb441c742d4220cf7ced001e7fa070731
Author: Jason Wang <jasowang@redhat.com>
Date:   Mon Nov 13 11:45:34 2017 +0800

     vhost_net: conditionally enable tx polling

Thanks

^ permalink raw reply

* Re: [PATCH v3 net-next 00/12] net: stmmac: Clean-up and tune-up
From: Corentin Labbe @ 2018-05-18 14:12 UTC (permalink / raw)
  To: Jose Abreu
  Cc: netdev, David S. Miller, Joao Pinto, Vitor Soares,
	Giuseppe Cavallaro, Alexandre Torgue
In-Reply-To: <cover.1526651009.git.joabreu@synopsys.com>

On Fri, May 18, 2018 at 02:55:57PM +0100, Jose Abreu wrote:
> This targets to uniformize the handling of the different GMAC versions in
> stmmac_main.c file and also tune-up the HW.
> 
> Currently there are some if/else conditions in the main source file which
> calls different callbacks depending on the ID of GMAC.
> 
> With the introducion of a generic HW interface handling which automatically
> selects the GMAC callbacks to be used, it is now unpleasant to see if
> conditions in the main code because this should be completely agnostic of the
> GMAC version.
> 
> This series removes most of these conditions. There are some if conditions
> that remain untouched but the callbacks handling are now uniformized.
> 
> Tested in GMAC5, hope I didn't break any previous versions.
> 
> Please check [1] for performance analisys of patches 3-12.
> 
> ---
> David,
> 
> This will probably generate a merge conflict with [2] (which was not merged
> yet). I'm waiting for Corentin input and then, if this series is merged
> before, I will rebase [2]. Or the other way around if you prefer :D
> 
> Thanks
> ---
> 
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Joao Pinto <jpinto@synopsys.com>
> Cc: Vitor Soares <soares@synopsys.com>
> Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
> Cc: Alexandre Torgue <alexandre.torgue@st.com>
> 
> [1] https://marc.info/?l=linux-netdev&m=152656352607905&w=2
> [2] https://patchwork.ozlabs.org/patch/915286/
> 
> Jose Abreu (12):
>   net: stmmac: Enable OSP for GMAC4
>   net: stmmac: Do not keep rearming the coalesce timer in stmmac_xmit
>   net: stmmac: Let descriptor code set skbuff address
>   net: stmmac: Let descriptor code clear the descriptor
>   net: stmmac: Uniformize the use of dma_{rx/tx}_mode callbacks
>   net: stmmac: Remove uneeded checks for GMAC version
>   net: stmmac: Move PTP and MMC base address calculation to hwif.c
>   net: stmmac: Uniformize the use of dma_init_* callbacks
>   net: stmmac: Remove uneeded check for GMAC version in stmmac_xmit
>   net: stmmac: Uniformize set_rx_owner()
>   net: stmmac: Let descriptor code get skbuff address
>   net: stmmac: Remove if condition by taking advantage of hwif return
>     code
> 
>  drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c  |   82 +++++---
>  .../net/ethernet/stmicro/stmmac/dwmac1000_dma.c    |   92 ++++++----
>  drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c |   35 +++--
>  drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c |   34 +++-
>  drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c   |    7 +-
>  drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.h   |    1 -
>  drivers/net/ethernet/stmicro/stmmac/enh_desc.c     |   20 ++-
>  drivers/net/ethernet/stmicro/stmmac/hwif.c         |   34 ++++
>  drivers/net/ethernet/stmicro/stmmac/hwif.h         |   27 ++-
>  drivers/net/ethernet/stmicro/stmmac/norm_desc.c    |   20 ++-
>  drivers/net/ethernet/stmicro/stmmac/stmmac.h       |    1 +
>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  |  198 +++++++-------------
>  12 files changed, 323 insertions(+), 228 deletions(-)
> 
> 

Hello

You didnt have put in CC linux-kernel@vger.kernel.org as required by get_maintener.pl letting more people to see this series.
Since this series touch dwmac-sun8i.c you should have also added Chen-Yu Tsai/Maxime Ripard (as also asked by get_maintainer).

Regards

^ permalink raw reply

* Re: [PATCH net] net: sched: red: avoid hashing NULL child
From: Jiri Kosina @ 2018-05-18 14:22 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, Jamal Hadi Salim, Cong Wang, Jiri Pirko, David S. Miller,
	Hangbin Liu
In-Reply-To: <286744ab66eaa07d61d5d8ce8a07cb0b1615d35c.1526647813.git.pabeni@redhat.com>

On Fri, 18 May 2018, Paolo Abeni wrote:

> Hangbin reported an Oops triggered by the syzkaller qdisc rules:
> 
>  kasan: GPF could be caused by NULL-ptr deref or user memory access
>  general protection fault: 0000 [#1] SMP KASAN PTI
>  Modules linked in: sch_red
>  CPU: 0 PID: 28699 Comm: syz-executor5 Not tainted 4.17.0-rc4.kcov #1
>  Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
>  RIP: 0010:qdisc_hash_add+0x26/0xa0
>  RSP: 0018:ffff8800589cf470 EFLAGS: 00010203
>  RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff824ad971
>  RDX: 0000000000000007 RSI: ffffc9000ce9f000 RDI: 000000000000003c
>  RBP: 0000000000000001 R08: ffffed000b139ea2 R09: ffff8800589cf4f0
>  R10: ffff8800589cf50f R11: ffffed000b139ea2 R12: ffff880054019fc0
>  R13: ffff880054019fb4 R14: ffff88005c0af600 R15: ffff880054019fb0
>  FS:  00007fa6edcb1700(0000) GS:ffff88005ce00000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000020000740 CR3: 000000000fc16000 CR4: 00000000000006f0
>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>  Call Trace:
>   red_change+0x2d2/0xed0 [sch_red]
>   qdisc_create+0x57e/0xef0
>   tc_modify_qdisc+0x47f/0x14e0
>   rtnetlink_rcv_msg+0x6a8/0x920
>   netlink_rcv_skb+0x2a2/0x3c0
>   netlink_unicast+0x511/0x740
>   netlink_sendmsg+0x825/0xc30
>   sock_sendmsg+0xc5/0x100
>   ___sys_sendmsg+0x778/0x8e0
>   __sys_sendmsg+0xf5/0x1b0
>   do_syscall_64+0xbd/0x3b0
>   entry_SYSCALL_64_after_hwframe+0x44/0xa9
>  RIP: 0033:0x450869
>  RSP: 002b:00007fa6edcb0c48 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
>  RAX: ffffffffffffffda RBX: 00007fa6edcb16b4 RCX: 0000000000450869
>  RDX: 0000000000000000 RSI: 00000000200000c0 RDI: 0000000000000013
>  RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
>  R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
>  R13: 0000000000008778 R14: 0000000000702838 R15: 00007fa6edcb1700
>  Code: e9 0b fe ff ff 0f 1f 44 00 00 55 53 48 89 fb 89 f5 e8 3f 07 f3 fe 48 8d 7b 3c 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 51
>  RIP: qdisc_hash_add+0x26/0xa0 RSP: ffff8800589cf470
> 
> When a red qdisc is updated with a 0 limit, the child qdisc is left
> unmodified, no additional scheduler is created in red_change(),
> the 'child' local variable is rightfully NULL and must not add it
> to the hash table.
> 
> This change addresses the above issue moving qdisc_hash_add() right
> after the child qdisc creation. It additionally removes unneeded checks
> for noop_qdisc.
> 
> Reported-by: Hangbin Liu <liuhangbin@gmail.com>
> Fixes: 49b499718fa1 ("net: sched: make default fifo qdiscs appear in the dump")

Acked-by: Jiri Kosina <jkosina@suse.cz>

Thanks for fixing this Paolo.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply

* Re: [PATCH v3 net-next 00/12] net: stmmac: Clean-up and tune-up
From: Jose Abreu @ 2018-05-18 14:23 UTC (permalink / raw)
  To: Corentin Labbe, Jose Abreu
  Cc: netdev, David S. Miller, Joao Pinto, Vitor Soares,
	Giuseppe Cavallaro, Alexandre Torgue, Maxime Ripard, Chen-Yu Tsai
In-Reply-To: <20180518141201.GA11284@Red>

Hi Corentin,

On 18-05-2018 15:12, Corentin Labbe wrote:
> On Fri, May 18, 2018 at 02:55:57PM +0100, Jose Abreu wrote:
>> This targets to uniformize the handling of the different GMAC versions in
>> stmmac_main.c file and also tune-up the HW.
>>
>> Currently there are some if/else conditions in the main source file which
>> calls different callbacks depending on the ID of GMAC.
>>
>> With the introducion of a generic HW interface handling which automatically
>> selects the GMAC callbacks to be used, it is now unpleasant to see if
>> conditions in the main code because this should be completely agnostic of the
>> GMAC version.
>>
>> This series removes most of these conditions. There are some if conditions
>> that remain untouched but the callbacks handling are now uniformized.
>>
>> Tested in GMAC5, hope I didn't break any previous versions.
>>
>> Please check [1] for performance analisys of patches 3-12.
>>
>> ---
>> David,
>>
>> This will probably generate a merge conflict with [2] (which was not merged
>> yet). I'm waiting for Corentin input and then, if this series is merged
>> before, I will rebase [2]. Or the other way around if you prefer :D
>>
>> Thanks
>> ---
>>
>> Cc: David S. Miller <davem@davemloft.net>
>> Cc: Joao Pinto <jpinto@synopsys.com>
>> Cc: Vitor Soares <soares@synopsys.com>
>> Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
>> Cc: Alexandre Torgue <alexandre.torgue@st.com>
>>
>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dnetdev-26m-3D152656352607905-26w-3D2&d=DwIBAg&c=DPL6_X_6JkXFx7AXWqB0tg&r=WHDsc6kcWAl4i96Vm5hJ_19IJiuxx_p_Rzo2g-uHDKw&m=a7bgrSQpisaMSa5fT-je94smZ_TM7QTxNFKqkvI5Nns&s=Tr23Xj_UCR_PaJp8AYiy18hfhbILnsaCsKDT5_4m2z4&e=
>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.ozlabs.org_patch_915286_&d=DwIBAg&c=DPL6_X_6JkXFx7AXWqB0tg&r=WHDsc6kcWAl4i96Vm5hJ_19IJiuxx_p_Rzo2g-uHDKw&m=a7bgrSQpisaMSa5fT-je94smZ_TM7QTxNFKqkvI5Nns&s=Q0SV-ZR35zIJWjiaLNXqlOWchppQ2CsO-Fh-BFCjCB8&e=
>>
>> Jose Abreu (12):
>>   net: stmmac: Enable OSP for GMAC4
>>   net: stmmac: Do not keep rearming the coalesce timer in stmmac_xmit
>>   net: stmmac: Let descriptor code set skbuff address
>>   net: stmmac: Let descriptor code clear the descriptor
>>   net: stmmac: Uniformize the use of dma_{rx/tx}_mode callbacks
>>   net: stmmac: Remove uneeded checks for GMAC version
>>   net: stmmac: Move PTP and MMC base address calculation to hwif.c
>>   net: stmmac: Uniformize the use of dma_init_* callbacks
>>   net: stmmac: Remove uneeded check for GMAC version in stmmac_xmit
>>   net: stmmac: Uniformize set_rx_owner()
>>   net: stmmac: Let descriptor code get skbuff address
>>   net: stmmac: Remove if condition by taking advantage of hwif return
>>     code
>>
>>  drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c  |   82 +++++---
>>  .../net/ethernet/stmicro/stmmac/dwmac1000_dma.c    |   92 ++++++----
>>  drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c |   35 +++--
>>  drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c |   34 +++-
>>  drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c   |    7 +-
>>  drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.h   |    1 -
>>  drivers/net/ethernet/stmicro/stmmac/enh_desc.c     |   20 ++-
>>  drivers/net/ethernet/stmicro/stmmac/hwif.c         |   34 ++++
>>  drivers/net/ethernet/stmicro/stmmac/hwif.h         |   27 ++-
>>  drivers/net/ethernet/stmicro/stmmac/norm_desc.c    |   20 ++-
>>  drivers/net/ethernet/stmicro/stmmac/stmmac.h       |    1 +
>>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  |  198 +++++++-------------
>>  12 files changed, 323 insertions(+), 228 deletions(-)
>>
>>
> Hello
>
> You didnt have put in CC linux-kernel@vger.kernel.org as required by get_maintener.pl letting more people to see this series.
> Since this series touch dwmac-sun8i.c you should have also added Chen-Yu Tsai/Maxime Ripard (as also asked by get_maintainer).

Usually I just cc according to MAINTAINERS file but thanks for
noticing. Added in cc now.

Thanks and Best Regards,
Jose Miguel Abreu

>
> Regards

^ permalink raw reply

* Re: [PATCH net-next] net: stmmac: Populate missing callbacks in HWIF initialization
From: Corentin Labbe @ 2018-05-18 14:25 UTC (permalink / raw)
  To: Jose Abreu
  Cc: netdev, David S. Miller, Joao Pinto, Giuseppe Cavallaro,
	Alexandre Torgue
In-Reply-To: <a08345ae8576d5ebddcd31ee73beea18484a671d.1526550924.git.joabreu@synopsys.com>

On Thu, May 17, 2018 at 10:57:28AM +0100, Jose Abreu wrote:
> Some HW specific setusp, like sun8i, do not populate all the necessary
> callbacks, which is what HWIF helpers were expecting.
> 
> Fix this by always trying to get the generic helpers and populate them
> if they were not previously populated by HW specific setup.
> 
> Signed-off-by: Jose Abreu <joabreu@synopsys.com>
> Fixes: 5f0456b43140 ("net: stmmac: Implement logic to automatically
> select HW Interface")
> Reported-by: Corentin Labbe <clabbe.montjoie@gmail.com>
> Cc: Corentin Labbe <clabbe.montjoie@gmail.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Joao Pinto <jpinto@synopsys.com>
> Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
> Cc: Alexandre Torgue <alexandre.torgue@st.com>
> ---
> Hi Corentin,
> 
> Please check if this patch makes sun8i work again.
> 
> Thanks and Best Regards,
> Jose Miguel Abreu
> ---

Hello

Tested-by: Corentin Labbe <clabbe.montjoie@gmail.com>

Thanks for the quick fix.

Note that this patch conflict with your next v3 serie

Regards

^ permalink raw reply

* Re: [PATCH v3 net-next 00/12] net: stmmac: Clean-up and tune-up
From: Corentin Labbe @ 2018-05-18 14:28 UTC (permalink / raw)
  To: Jose Abreu
  Cc: netdev, David S. Miller, Joao Pinto, Vitor Soares,
	Giuseppe Cavallaro, Alexandre Torgue, Maxime Ripard, Chen-Yu Tsai
In-Reply-To: <a42375f2-6521-f959-75ea-2a5be1680be9@synopsys.com>

On Fri, May 18, 2018 at 03:23:44PM +0100, Jose Abreu wrote:
> Hi Corentin,
> 
> On 18-05-2018 15:12, Corentin Labbe wrote:
> > On Fri, May 18, 2018 at 02:55:57PM +0100, Jose Abreu wrote:
> >> This targets to uniformize the handling of the different GMAC versions in
> >> stmmac_main.c file and also tune-up the HW.
> >>
> >> Currently there are some if/else conditions in the main source file which
> >> calls different callbacks depending on the ID of GMAC.
> >>
> >> With the introducion of a generic HW interface handling which automatically
> >> selects the GMAC callbacks to be used, it is now unpleasant to see if
> >> conditions in the main code because this should be completely agnostic of the
> >> GMAC version.
> >>
> >> This series removes most of these conditions. There are some if conditions
> >> that remain untouched but the callbacks handling are now uniformized.
> >>
> >> Tested in GMAC5, hope I didn't break any previous versions.
> >>
> >> Please check [1] for performance analisys of patches 3-12.
> >>
> >> ---
> >> David,
> >>
> >> This will probably generate a merge conflict with [2] (which was not merged
> >> yet). I'm waiting for Corentin input and then, if this series is merged
> >> before, I will rebase [2]. Or the other way around if you prefer :D
> >>
> >> Thanks
> >> ---
> >>
> >> Cc: David S. Miller <davem@davemloft.net>
> >> Cc: Joao Pinto <jpinto@synopsys.com>
> >> Cc: Vitor Soares <soares@synopsys.com>
> >> Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
> >> Cc: Alexandre Torgue <alexandre.torgue@st.com>
> >>
> >> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dnetdev-26m-3D152656352607905-26w-3D2&d=DwIBAg&c=DPL6_X_6JkXFx7AXWqB0tg&r=WHDsc6kcWAl4i96Vm5hJ_19IJiuxx_p_Rzo2g-uHDKw&m=a7bgrSQpisaMSa5fT-je94smZ_TM7QTxNFKqkvI5Nns&s=Tr23Xj_UCR_PaJp8AYiy18hfhbILnsaCsKDT5_4m2z4&e=
> >> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.ozlabs.org_patch_915286_&d=DwIBAg&c=DPL6_X_6JkXFx7AXWqB0tg&r=WHDsc6kcWAl4i96Vm5hJ_19IJiuxx_p_Rzo2g-uHDKw&m=a7bgrSQpisaMSa5fT-je94smZ_TM7QTxNFKqkvI5Nns&s=Q0SV-ZR35zIJWjiaLNXqlOWchppQ2CsO-Fh-BFCjCB8&e=
> >>
> >> Jose Abreu (12):
> >>   net: stmmac: Enable OSP for GMAC4
> >>   net: stmmac: Do not keep rearming the coalesce timer in stmmac_xmit
> >>   net: stmmac: Let descriptor code set skbuff address
> >>   net: stmmac: Let descriptor code clear the descriptor
> >>   net: stmmac: Uniformize the use of dma_{rx/tx}_mode callbacks
> >>   net: stmmac: Remove uneeded checks for GMAC version
> >>   net: stmmac: Move PTP and MMC base address calculation to hwif.c
> >>   net: stmmac: Uniformize the use of dma_init_* callbacks
> >>   net: stmmac: Remove uneeded check for GMAC version in stmmac_xmit
> >>   net: stmmac: Uniformize set_rx_owner()
> >>   net: stmmac: Let descriptor code get skbuff address
> >>   net: stmmac: Remove if condition by taking advantage of hwif return
> >>     code
> >>
> >>  drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c  |   82 +++++---
> >>  .../net/ethernet/stmicro/stmmac/dwmac1000_dma.c    |   92 ++++++----
> >>  drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c |   35 +++--
> >>  drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c |   34 +++-
> >>  drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c   |    7 +-
> >>  drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.h   |    1 -
> >>  drivers/net/ethernet/stmicro/stmmac/enh_desc.c     |   20 ++-
> >>  drivers/net/ethernet/stmicro/stmmac/hwif.c         |   34 ++++
> >>  drivers/net/ethernet/stmicro/stmmac/hwif.h         |   27 ++-
> >>  drivers/net/ethernet/stmicro/stmmac/norm_desc.c    |   20 ++-
> >>  drivers/net/ethernet/stmicro/stmmac/stmmac.h       |    1 +
> >>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  |  198 +++++++-------------
> >>  12 files changed, 323 insertions(+), 228 deletions(-)
> >>
> >>
> > Hello
> >
> > You didnt have put in CC linux-kernel@vger.kernel.org as required by get_maintener.pl letting more people to see this series.
> > Since this series touch dwmac-sun8i.c you should have also added Chen-Yu Tsai/Maxime Ripard (as also asked by get_maintainer).
> 
> Usually I just cc according to MAINTAINERS file but thanks for
> noticing. Added in cc now.
> 

./scripts/get_maintainer.pl does this for you (and it use MAINTAINERS).
You have to use it at least, since it handle regex that could be easily unseen like for dwmac-sun8i.

^ permalink raw reply

* Re: [RFC v4 3/5] virtio_ring: add packed ring support
From: Tiwei Bie @ 2018-05-18 14:33 UTC (permalink / raw)
  To: Jason Wang; +Cc: mst, virtualization, linux-kernel, netdev, wexu, jfreimann
In-Reply-To: <e247b46a-183f-3959-859b-aebb0bf81c07@redhat.com>

On Fri, May 18, 2018 at 09:17:05PM +0800, Jason Wang wrote:
> On 2018年05月18日 19:29, Tiwei Bie wrote:
> > On Thu, May 17, 2018 at 08:01:52PM +0800, Jason Wang wrote:
> > > On 2018年05月16日 22:33, Tiwei Bie wrote:
> > > > On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:
> > > > > On 2018年05月16日 21:45, Tiwei Bie wrote:
> > > > > > On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
> > > > > > > On 2018年05月16日 20:39, Tiwei Bie wrote:
> > > > > > > > On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
> > > > > > > > > On 2018年05月16日 16:37, Tiwei Bie wrote:
> > > > [...]
> > > > > > > > > > +static void detach_buf_packed(struct vring_virtqueue *vq, unsigned int head,
> > > > > > > > > > +			      unsigned int id, void **ctx)
> > > > > > > > > > +{
> > > > > > > > > > +	struct vring_packed_desc *desc;
> > > > > > > > > > +	unsigned int i, j;
> > > > > > > > > > +
> > > > > > > > > > +	/* Clear data ptr. */
> > > > > > > > > > +	vq->desc_state[id].data = NULL;
> > > > > > > > > > +
> > > > > > > > > > +	i = head;
> > > > > > > > > > +
> > > > > > > > > > +	for (j = 0; j < vq->desc_state[id].num; j++) {
> > > > > > > > > > +		desc = &vq->vring_packed.desc[i];
> > > > > > > > > > +		vring_unmap_one_packed(vq, desc);
> > > > > > > > > As mentioned in previous discussion, this probably won't work for the case
> > > > > > > > > of out of order completion since it depends on the information in the
> > > > > > > > > descriptor ring. We probably need to extend ctx to record such information.
> > > > > > > > Above code doesn't depend on the information in the descriptor
> > > > > > > > ring. The vq->desc_state[] is the extended ctx.
> > > > > > > > 
> > > > > > > > Best regards,
> > > > > > > > Tiwei Bie
> > > > > > > Yes, but desc is a pointer to descriptor ring I think so
> > > > > > > vring_unmap_one_packed() still depends on the content of descriptor ring?
> > > > > > > 
> > > > > > I got your point now. I think it makes sense to reserve
> > > > > > the bits of the addr field. Driver shouldn't try to get
> > > > > > addrs from the descriptors when cleanup the descriptors
> > > > > > no matter whether we support out-of-order or not.
> > > > > Maybe I was wrong, but I remember spec mentioned something like this.
> > > > You're right. Spec mentioned this. I was just repeating
> > > > the spec to emphasize that it does make sense. :)
> > > > 
> > > > > > But combining it with the out-of-order support, it will
> > > > > > mean that the driver still needs to maintain a desc/ctx
> > > > > > list that is very similar to the desc ring in the split
> > > > > > ring. I'm not quite sure whether it's something we want.
> > > > > > If it is true, I'll do it. So do you think we also want
> > > > > > to maintain such a desc/ctx list for packed ring?
> > > > > To make it work for OOO backends I think we need something like this
> > > > > (hardware NIC drivers are usually have something like this).
> > > > Which hardware NIC drivers have this?
> > > It's quite common I think, e.g driver track e.g dma addr and page frag
> > > somewhere. e.g the ring->rx_info in mlx4 driver.
> > It seems that I had a misunderstanding on your
> > previous comments. I know it's quite common for
> > drivers to track e.g. DMA addrs somewhere (and
> > I think one reason behind this is that they want
> > to reuse the bits of addr field).
> 
> Yes, we may want this for virtio-net as well in the future.
> 
> >   But tracking
> > addrs somewhere doesn't means supporting OOO.
> > I thought you were saying it's quite common for
> > hardware NIC drivers to support OOO (i.e. NICs
> > will return the descriptors OOO):
> > 
> > I'm not familiar with mlx4, maybe I'm wrong.
> > I just had a quick glance. And I found below
> > comments in mlx4_en_process_rx_cq():
> > 
> > ```
> > /* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
> >   * descriptor offset can be deduced from the CQE index instead of
> >   * reading 'cqe->index' */
> > index = cq->mcq.cons_index & ring->size_mask;
> > cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
> > ```
> > 
> > It seems that although they have a completion
> > queue, they are still using the ring in order.
> 
> I guess so (at least from the above bits). Git grep -i "out of order" in
> drivers/net gives some hints. Looks like there're few deivces do this.
> 
> > I guess maybe storage device may want OOO.
> 
> Right, some iSCSI did.
> 
> But tracking them elsewhere is not only for OOO.
> 
> Spec said:
> 
> for element address
> 
> "
> In a used descriptor, Element Address is unused.
> "
> 
> for Next flag:
> 
> "
> For example, if descriptors are used in the same order in which they are
> made available, this will result in
> the used descriptor overwriting the first available descriptor in the list,
> the used descriptor for the next list
> overwriting the first available descriptor in the next list, etc.
> "
> 
> for in order completion:
> 
> "
> This will result in the used descriptor overwriting the first available
> descriptor in the batch, the used descriptor
> for the next batch overwriting the first available descriptor in the next
> batch, etc.
> "
> 
> So:
> 
> - It's an alignment to the spec
> - device may (or should) overwrite the descriptor make also make address
> field useless.

You didn't get my point...

I agreed driver should track the DMA addrs or some
other necessary things from the very beginning. And
I also repeated the spec to emphasize that it does
make sense. And I'd like to do that.

What I was saying is that, to support OOO, we may
need to manage these context (which saves DMA addrs
etc) via a list which is similar to the desc list
maintained via `next` in split ring instead of an
array whose elements always can be indexed directly.

The desc ring in split ring is an array, but its
free entries are managed as list via next. I was
just wondering, do we want to manage such a list
because of OOO. It's just a very simple question
that I want to hear your opinion... (It doesn't
means anything, e.g. It doesn't mean I don't want
to support OOO. It's just a simple question...)

Best regards,
Tiwei Bie

> 
> Thanks
> 
> > 
> > Best regards,
> > Tiwei Bie
> > 
> > > Thanks
> > > 
> > > > > Not for the patch, but it looks like having a OUT_OF_ORDER feature bit is
> > > > > much more simpler to be started with.
> > > > +1
> > > > 
> > > > Best regards,
> > > > Tiwei Bie
> 

^ permalink raw reply

* Re: [PATCH] hippi: fix spelling mistake: "Framming" -> "Framing"
From: Jes Sorensen @ 2018-05-18 14:33 UTC (permalink / raw)
  To: Colin King, David S . Miller, netdev; +Cc: kernel-janitors, linux-kernel
In-Reply-To: <20180518100922.28790-1-colin.king@canonical.com>

On 05/18/2018 06:09 AM, Colin King wrote:
> From: Colin Ian King <colin.king@canonical.com>
> 
> Trivial fix to spelling mistake in printk message text
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> ---
>  drivers/net/hippi/rrunner.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/hippi/rrunner.c b/drivers/net/hippi/rrunner.c
> index 1ab97d99b9ba..f41116488079 100644
> --- a/drivers/net/hippi/rrunner.c
> +++ b/drivers/net/hippi/rrunner.c
> @@ -867,7 +867,7 @@ static u32 rr_handle_event(struct net_device *dev, u32 prodidx, u32 eidx)
>  			       dev->name);
>  			goto drop;
>  		case E_FRM_ERR:
> -			printk(KERN_WARNING "%s: Framming Error\n",
> +			printk(KERN_WARNING "%s: Framing Error\n",
>  			       dev->name);
>  			goto drop;
>  		case E_FLG_SYN_ERR:
> 

Fine with me! I do wonder if it's time to retire this driver and the
HIPPI code. I haven't had access to hardware for over a decade so I have
no idea if it even works anymore.

Jes

^ permalink raw reply

* Re: [PATCH 1/2] bpf: sockmap, double free in __sock_map_ctx_update_elem()
From: Dan Carpenter @ 2018-05-18 14:39 UTC (permalink / raw)
  To: Daniel Borkmann, Gustavo A. R. Silva
  Cc: Alexei Starovoitov, John Fastabend, netdev, kernel-janitors
In-Reply-To: <e9dc40fb-ffa6-58af-d9df-eaab2d0a9259@iogearbox.net>

On Fri, May 18, 2018 at 10:27:18AM +0200, Daniel Borkmann wrote:
> 
> Thanks for the two fixes, appreciate it! There were two similar ones that
> fix the same issues which were already applied yesterday to bpf-next:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=0e4364560361d57e8cd873a8990327f3471d7d8a
> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=a78622932c27e8ec33e5ba180f3d2e87fb806b28

Hey Gustavo,

We're sort of duplicating each other's work.  Could you CC
kernel-janitors@vger.kernel.org for static checker fixes so that I can
see what you're working on?

We'll probably still send the occasional duplicate which is fine...

regards,
dan carpenter

^ permalink raw reply

* Re: [PATCH net] tuntap: raise EPOLLOUT on device up
From: Michael S. Tsirkin @ 2018-05-18 14:46 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, Hannes Frederic Sowa, Eric Dumazet
In-Reply-To: <ac404f5e-22ea-00e7-8415-63116ce4881e@redhat.com>

On Fri, May 18, 2018 at 10:11:54PM +0800, Jason Wang wrote:
> 
> 
> On 2018年05月18日 22:06, Michael S. Tsirkin wrote:
> > On Fri, May 18, 2018 at 10:00:31PM +0800, Jason Wang wrote:
> > > 
> > > On 2018年05月18日 21:26, Jason Wang wrote:
> > > > 
> > > > On 2018年05月18日 21:13, Michael S. Tsirkin wrote:
> > > > > On Fri, May 18, 2018 at 09:00:43PM +0800, Jason Wang wrote:
> > > > > > We return -EIO on device down but can not raise EPOLLOUT after it was
> > > > > > up. This may confuse user like vhost which expects tuntap to raise
> > > > > > EPOLLOUT to re-enable its TX routine after tuntap is down. This could
> > > > > > be easily reproduced by transmitting packets from VM while down and up
> > > > > > the tap device. Fixing this by set SOCKWQ_ASYNC_NOSPACE on -EIO.
> > > > > > 
> > > > > > Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > > > > > Cc: Eric Dumazet <edumazet@google.com>
> > > > > > Fixes: 1bd4978a88ac2 ("tun: honor IFF_UP in tun_get_user()")
> > > > > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > > > > ---
> > > > > >    drivers/net/tun.c | 4 +++-
> > > > > >    1 file changed, 3 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > > > > > index d45ac37..1b29761 100644
> > > > > > --- a/drivers/net/tun.c
> > > > > > +++ b/drivers/net/tun.c
> > > > > > @@ -1734,8 +1734,10 @@ static ssize_t tun_get_user(struct
> > > > > > tun_struct *tun, struct tun_file *tfile,
> > > > > >        int skb_xdp = 1;
> > > > > >        bool frags = tun_napi_frags_enabled(tun);
> > > > > >    -    if (!(tun->dev->flags & IFF_UP))
> > > > > > +    if (!(tun->dev->flags & IFF_UP)) {
> > > > > Isn't this racy?  What if flag is cleared at this point?
> > > > I think you mean "set at this point"? Then yes, so we probably need to
> > > > set the bit during tun_net_close().
> > > > 
> > > > Thanks
> > > Looks no need, vhost will poll socket after it see EIO. So we are ok here?
> > > 
> > > Thanks
> > In fact I don't even understand why does this help any longer.
> > 
> 
> We disable tx polling and only enable it on demand for a better rx
> performance. You may want to have a look at :
> 
> commit feb8892cb441c742d4220cf7ced001e7fa070731
> Author: Jason Wang <jasowang@redhat.com>
> Date:   Mon Nov 13 11:45:34 2017 +0800
> 
>     vhost_net: conditionally enable tx polling
> 
> Thanks


Question is, what looks at SOCKWQ_ASYNC_NOSPACE.
I think it's tested when packet is transmitted,
but there is no guarantee here any packet will
ever be transmitted.

-- 
MST

^ permalink raw reply

* Re: [PATCH v3 net-next 00/12] net: stmmac: Clean-up and tune-up
From: David Miller @ 2018-05-18 15:09 UTC (permalink / raw)
  To: Jose.Abreu
  Cc: netdev, Joao.Pinto, Vitor.Soares, peppe.cavallaro,
	alexandre.torgue
In-Reply-To: <cover.1526651009.git.joabreu@synopsys.com>

From: Jose Abreu <Jose.Abreu@synopsys.com>
Date: Fri, 18 May 2018 14:55:57 +0100

> This targets to uniformize the handling of the different GMAC versions in
> stmmac_main.c file and also tune-up the HW.
> 
> Currently there are some if/else conditions in the main source file which
> calls different callbacks depending on the ID of GMAC.
> 
> With the introducion of a generic HW interface handling which automatically
> selects the GMAC callbacks to be used, it is now unpleasant to see if
> conditions in the main code because this should be completely agnostic of the
> GMAC version.
> 
> This series removes most of these conditions. There are some if conditions
> that remain untouched but the callbacks handling are now uniformized.
> 
> Tested in GMAC5, hope I didn't break any previous versions.
> 
> Please check [1] for performance analisys of patches 3-12.

This looks a lot better, series applied, thanks.


> This will probably generate a merge conflict with [2] (which was not merged
> yet). I'm waiting for Corentin input and then, if this series is merged
> before, I will rebase [2]. Or the other way around if you prefer :D

Since I just merged this, please rebase 2.

Thank you.

^ permalink raw reply

* Re: VLAN performance issues with Open vSwitch when 8021q not loaded
From: Eric Dumazet @ 2018-05-18 15:12 UTC (permalink / raw)
  To: Mikhail Sennikovsky, netdev
In-Reply-To: <CALHVEJaaY=DeavKjBkcuDT19EpLhmCyJunP1V7hyNJsCkqvrdw@mail.gmail.com>



On 05/18/2018 06:24 AM, Mikhail Sennikovsky wrote:
> Hi all,
> 
> We were recently experiencing network performance issues with VLAN
> networking setup with Open vSwitch, for the ingress traffic coming to
> VLAN trunk port of the ovs.
> As we discovered, the issue was caused by gro not working for it,
> which in turn was because the gro receive callbacks for 802.1Q payload
> type are defined in the 8021q module (see
> https://elixir.bootlin.com/linux/v4.16.9/source/net/8021q/vlan.c#L762
> ), which was not loaded.
> This resulted in a significant bandwidth performance drop, having
> ~3Gbps instead of the expected ~7Gbps for a simple iperf3 test in our
> case.
> 
> The obvious work-around would be to load the 8021q module, which
> indeed makes bandwidth performance back to the expected numbers.
> This seems like a hidden and not obvious magic however.
> 
> So I'm questioning, whether it makes sense to have the gro receive
> callbacks for 802.1Q and 802.1ad moved to some common place, that
> would be used/enabled by both 8021q and openvswitch modules.


Sure, GRO should do whatever it can.

TCP IPv6 traffic can be aggregated by GRO even if IPv6 module is not loaded.

net/ipv6/Makefile :

ipv6-offload := ip6_offload.o tcpv6_offload.o exthdrs_offload.o
...
obj-$(CONFIG_INET) += output_core.o protocol.o $(ipv6-offload)

^ permalink raw reply

* Re: [PATCH bpf v2 1/6] bpf: support 64-bit offsets for bpf function calls
From: Daniel Borkmann @ 2018-05-18 15:15 UTC (permalink / raw)
  To: Sandipan Das, ast; +Cc: netdev, linuxppc-dev, naveen.n.rao, mpe, jakub.kicinski
In-Reply-To: <20180518125039.6500-2-sandipan@linux.vnet.ibm.com>

On 05/18/2018 02:50 PM, Sandipan Das wrote:
> The imm field of a bpf instruction is a signed 32-bit integer.
> For JIT bpf-to-bpf function calls, it stores the offset of the
> start address of the callee's JITed image from __bpf_call_base.
> 
> For some architectures, such as powerpc64, this offset may be
> as large as 64 bits and cannot be accomodated in the imm field
> without truncation.
> 
> We resolve this by:
> 
> [1] Additionally using the auxillary data of each function to
>     keep a list of start addresses of the JITed images for all
>     functions determined by the verifier.
> 
> [2] Retaining the subprog id inside the off field of the call
>     instructions and using it to index into the list mentioned
>     above and lookup the callee's address.
> 
> To make sure that the existing JIT compilers continue to work
> without requiring changes, we keep the imm field as it is.
> 
> Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
> ---
>  kernel/bpf/verifier.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index a9e4b1372da6..6c56cce9c4e3 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -5383,11 +5383,24 @@ static int jit_subprogs(struct bpf_verifier_env *env)
>  			    insn->src_reg != BPF_PSEUDO_CALL)
>  				continue;
>  			subprog = insn->off;
> -			insn->off = 0;
>  			insn->imm = (u64 (*)(u64, u64, u64, u64, u64))
>  				func[subprog]->bpf_func -
>  				__bpf_call_base;
>  		}
> +
> +		/* we use the aux data to keep a list of the start addresses
> +		 * of the JITed images for each function in the program
> +		 *
> +		 * for some architectures, such as powerpc64, the imm field
> +		 * might not be large enough to hold the offset of the start
> +		 * address of the callee's JITed image from __bpf_call_base
> +		 *
> +		 * in such cases, we can lookup the start address of a callee
> +		 * by using its subprog id, available from the off field of
> +		 * the call instruction, as an index for this list
> +		 */
> +		func[i]->aux->func = func;
> +		func[i]->aux->func_cnt = env->subprog_cnt + 1;

The target tree you have here is infact bpf, since in bpf-next there was a
cleanup where the + 1 is removed. Just for the record that we need to keep
this in mind for bpf into bpf-next merge since this would otherwise subtly
break.

>  	}
>  	for (i = 0; i < env->subprog_cnt; i++) {
>  		old_bpf_func = func[i]->bpf_func;
> 

^ permalink raw reply

* Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support
From: Björn Töpel @ 2018-05-18 15:18 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Alexei Starovoitov, Karlsson, Magnus,
	Duyck, Alexander H, Alexander Duyck, John Fastabend,
	Jesper Dangaard Brouer, Willem de Bruijn, Michael S. Tsirkin,
	Netdev, Björn Töpel, michael.lundkvist,
	Brandeburg, Jesse, Singhai, Anjali, Zhang, Qi Z
In-Reply-To: <3cc841af-d950-c961-7865-a5694eb5c618@iogearbox.net>

2018-05-18 15:43 GMT+02:00 Daniel Borkmann <daniel@iogearbox.net>:
> On 05/18/2018 05:38 AM, Alexei Starovoitov wrote:
>> On 5/16/18 11:46 PM, Björn Töpel wrote:
>>> 2018-05-04 1:38 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>:
>>>> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
>>>>> On 05/02/2018 01:01 PM, Björn Töpel wrote:
>>>>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>>>>
>>>>>> This patch set introduces a new address family called AF_XDP that is
>>>>>> optimized for high performance packet processing and, in upcoming
>>>>>> patch sets, zero-copy semantics. In this patch set, we have removed
>>>>>> all zero-copy related code in order to make it smaller, simpler and
>>>>>> hopefully more review friendly. This patch set only supports copy-mode
>>>>>> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>>>>>> for RX using the XDP_DRV path. Zero-copy support requires XDP and
>>>>>> driver changes that Jesper Dangaard Brouer is working on. Some of his
>>>>>> work has already been accepted. We will publish our zero-copy support
>>>>>> for RX and TX on top of his patch sets at a later point in time.
>>>>>
>>>>> +1, would be great to see it land this cycle. Saw few minor nits here
>>>>> and there but nothing to hold it up, for the series:
>>>>>
>>>>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>>>>>
>>>>> Thanks everyone!
>>>>
>>>> Great stuff!
>>>>
>>>> Applied to bpf-next, with one condition.
>>>> Upcoming zero-copy patches for both RX and TX need to be posted
>>>> and reviewed within this release window.
>>>> If netdev community as a whole won't be able to agree on the zero-copy
>>>> bits we'd need to revert this feature before the next merge window.
>>>>
>>>> Few other minor nits:
>>>> patch 3:
>>>> +struct xdp_ring {
>>>> +       __u32 producer __attribute__((aligned(64)));
>>>> +       __u32 consumer __attribute__((aligned(64)));
>>>> +};
>>>> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi headers.
>>>
>>> Hmm, I need some guidance on what a sane uapi variant would be. We
>>> can't have the uapi depend on the kernel build. ARM64, e.g., can have
>>> both 64B and 128B according to the specs. Contemporary IA processors
>>> have 64B.
>>>
>>> The simplest, and maybe most future-proof, would be 128B aligned for
>>> all. Another is having 128B for ARM and 64B for all IA. A third option
>>> is having a hand-shaking API (I think virtio has that) for determine
>>> the cache line size, but I'd rather not go down that route.
>>>
>>> Thoughts/ideas on how a uapi ____cacheline_aligned_in_smp version
>>> would look like?
>>
>> I suspect i40e+arm combination wasn't tested anyway.
>> The api may have endianness issues too on something like sparc.
>> I think the way to be backwards compatible in this area
>> is to make the api usable on x86 only by adding
>> to include/uapi/linux/if_xdp.h
>> #if defined(__x86_64__)
>> #define AF_XDP_CACHE_BYTES 64
>> #else
>> #error "AF_XDP support is not yet available for this architecture"
>> #endif
>> and doing:
>>     __u32 producer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
>>     __u32 consumer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
>>
>> And progressively add to this for arm64 and few other archs.
>> Eventually removing #error and adding some generic define
>> that's good enough for long tail of architectures that
>> we really cannot test.
>
> Been looking into this yesterday as well a bit, and it's a bit of a mess what
> uapi headers do on this regard (though there are just a handful of such headers).
> Some of the kernel uapi headers hard-code generally 64 bytes regardless of the
> underlying arch. In general, the kernel does expose it to user space via sysfs
> (coherency_line_size). Here's what perf does to retrieve it:
>
> #ifdef _SC_LEVEL1_DCACHE_LINESIZE
> #define cache_line_size(cacheline_sizep) *cacheline_sizep = sysconf(_SC_LEVEL1_DCACHE_LINESIZE)
> #else
> static void cache_line_size(int *cacheline_sizep)
> {
>         if (sysfs__read_int("devices/system/cpu/cpu0/cache/index0/coherency_line_size", cacheline_sizep))
>                 pr_debug("cannot determine cache line size");
> }
> #endif
>
> The sysconf() implementation for _SC_LEVEL1_DCACHE_LINESIZE seems also only
> available for x86, arm64, s390 and ppc on a cursory glance in the glibc code.
> In the x86 case it retrieves the info from cpuid insn. In order to generically
> use it in combination with the header you'd have some probe which would then
> set this as a define before including the header.
>

But as a uapi we cannot depend on the L1 cache line size for what's
currently running on the system, right? So, either a "one cache line
size for all flavors of an arch", e.g. for ARMv8 that would be 128,
even though there can be 64 flavors out there.

Another way would be to remove the ring structure completely, and
leave that to the user-space application to figure out. So there's a
runtime interface (getsockopt) to probe the offsets the head and tail
pointer is after/before the mmap call, and only expose the descriptor
format explicitly in if_xdp.h. Don't know if that is too unorthodox or
not...

> Then projects like urcu, they do ...
>
> #define ____cacheline_internodealigned_in_smp \
>         __attribute__((__aligned__(CAA_CACHE_LINE_SIZE)))
>
> ... and then hard code CAA_CACHE_LINE_SIZE for x86 (== 128), s390 (== 128),
> ppc (== 256) and sparc64 (== 256) with a generic fallback to 64.
>
> Hmm, perhaps a combination of the two would make sense where in case of known
> cacheline size it can still be used and we only have the fallback in such way.
> Like:
>
> #ifndef XDP_CACHE_BYTES
> # if defined(__x86_64__)
> #  define XDP_CACHE_BYTES       64
> # else
> #  error "Please define XDP_CACHE_BYTES for this architecture!"
> # endif
> #endif
>
> Too bad there's no asm uapi header at least for the archs where it's fixed
> anyway such that not every project out there has to redefine all of it from
> scratch and we could just include it (and the generic-asm one would throw
> a compile error if it's not externally defined or such).
>

I started out with adding a cache.h to the arch/XXX/include/uapi, but
then realized that this didn't really work. :-)

> Cheers,
> Daniel

^ permalink raw reply

* [PATCH] powerpc/32: Implement csum_ipv6_magic in assembly
From: Christophe Leroy @ 2018-05-18 15:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linux-kernel, linuxppc-dev, netdev

The generic csum_ipv6_magic() generates a pretty bad result

00000000 <csum_ipv6_magic>:
   0:	81 23 00 00 	lwz     r9,0(r3)
   4:	81 03 00 04 	lwz     r8,4(r3)
   8:	7c e7 4a 14 	add     r7,r7,r9
   c:	7d 29 38 10 	subfc   r9,r9,r7
  10:	7d 4a 51 10 	subfe   r10,r10,r10
  14:	7d 27 42 14 	add     r9,r7,r8
  18:	7d 2a 48 50 	subf    r9,r10,r9
  1c:	80 e3 00 08 	lwz     r7,8(r3)
  20:	7d 08 48 10 	subfc   r8,r8,r9
  24:	7d 4a 51 10 	subfe   r10,r10,r10
  28:	7d 29 3a 14 	add     r9,r9,r7
  2c:	81 03 00 0c 	lwz     r8,12(r3)
  30:	7d 2a 48 50 	subf    r9,r10,r9
  34:	7c e7 48 10 	subfc   r7,r7,r9
  38:	7d 4a 51 10 	subfe   r10,r10,r10
  3c:	7d 29 42 14 	add     r9,r9,r8
  40:	7d 2a 48 50 	subf    r9,r10,r9
  44:	80 e4 00 00 	lwz     r7,0(r4)
  48:	7d 08 48 10 	subfc   r8,r8,r9
  4c:	7d 4a 51 10 	subfe   r10,r10,r10
  50:	7d 29 3a 14 	add     r9,r9,r7
  54:	7d 2a 48 50 	subf    r9,r10,r9
  58:	81 04 00 04 	lwz     r8,4(r4)
  5c:	7c e7 48 10 	subfc   r7,r7,r9
  60:	7d 4a 51 10 	subfe   r10,r10,r10
  64:	7d 29 42 14 	add     r9,r9,r8
  68:	7d 2a 48 50 	subf    r9,r10,r9
  6c:	80 e4 00 08 	lwz     r7,8(r4)
  70:	7d 08 48 10 	subfc   r8,r8,r9
  74:	7d 4a 51 10 	subfe   r10,r10,r10
  78:	7d 29 3a 14 	add     r9,r9,r7
  7c:	7d 2a 48 50 	subf    r9,r10,r9
  80:	81 04 00 0c 	lwz     r8,12(r4)
  84:	7c e7 48 10 	subfc   r7,r7,r9
  88:	7d 4a 51 10 	subfe   r10,r10,r10
  8c:	7d 29 42 14 	add     r9,r9,r8
  90:	7d 2a 48 50 	subf    r9,r10,r9
  94:	7d 08 48 10 	subfc   r8,r8,r9
  98:	7d 4a 51 10 	subfe   r10,r10,r10
  9c:	7d 29 2a 14 	add     r9,r9,r5
  a0:	7d 2a 48 50 	subf    r9,r10,r9
  a4:	7c a5 48 10 	subfc   r5,r5,r9
  a8:	7c 63 19 10 	subfe   r3,r3,r3
  ac:	7d 29 32 14 	add     r9,r9,r6
  b0:	7d 23 48 50 	subf    r9,r3,r9
  b4:	7c c6 48 10 	subfc   r6,r6,r9
  b8:	7c 63 19 10 	subfe   r3,r3,r3
  bc:	7c 63 48 50 	subf    r3,r3,r9
  c0:	54 6a 80 3e 	rotlwi  r10,r3,16
  c4:	7c 63 52 14 	add     r3,r3,r10
  c8:	7c 63 18 f8 	not     r3,r3
  cc:	54 63 84 3e 	rlwinm  r3,r3,16,16,31
  d0:	4e 80 00 20 	blr

This patch implements it in assembly for PPC32

Link: https://github.com/linuxppc/linux/issues/9
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 arch/powerpc/include/asm/checksum.h |  8 ++++++++
 arch/powerpc/lib/checksum_32.S      | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/arch/powerpc/include/asm/checksum.h b/arch/powerpc/include/asm/checksum.h
index 54065caa40b3..c41c280c252f 100644
--- a/arch/powerpc/include/asm/checksum.h
+++ b/arch/powerpc/include/asm/checksum.h
@@ -13,6 +13,7 @@
 #include <asm-generic/checksum.h>
 #else
 #include <linux/bitops.h>
+#include <linux/in6.h>
 /*
  * Computes the checksum of a memory block at src, length len,
  * and adds in "sum" (32-bit), while copying the block to dst.
@@ -211,6 +212,13 @@ static inline __sum16 ip_compute_csum(const void *buff, int len)
 	return csum_fold(csum_partial(buff, len, 0));
 }
 
+#ifdef CONFIG_PPC32
+#define _HAVE_ARCH_IPV6_CSUM
+__sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+			const struct in6_addr *daddr,
+			__u32 len, __u8 proto, __wsum sum);
+#endif
+
 #endif
 #endif /* __KERNEL__ */
 #endif
diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
index 9a671c774b22..f9c047145696 100644
--- a/arch/powerpc/lib/checksum_32.S
+++ b/arch/powerpc/lib/checksum_32.S
@@ -293,3 +293,36 @@ dst_error:
 	EX_TABLE(51b, dst_error);
 
 EXPORT_SYMBOL(csum_partial_copy_generic)
+
+/*
+ * static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+ *				      const struct in6_addr *daddr,
+ *				      __u32 len, __u8 proto, __wsum sum)
+ */
+
+_GLOBAL(csum_ipv6_magic)
+	lwz	r8, 0(r3)
+	lwz	r9, 4(r3)
+	lwz	r10, 8(r3)
+	lwz	r11, 12(r3)
+	addc	r0, r5, r6
+	adde	r0, r0, r7
+	adde	r0, r0, r8
+	adde	r0, r0, r9
+	adde	r0, r0, r10
+	adde	r0, r0, r11
+	lwz	r8, 0(r4)
+	lwz	r9, 4(r4)
+	lwz	r10, 8(r4)
+	lwz	r11, 12(r4)
+	adde	r0, r0, r8
+	adde	r0, r0, r9
+	adde	r0, r0, r10
+	adde	r0, r0, r11
+	addze	r0
+	rotlwi	r3, r0, 16
+	add	r3, r0, r3
+	not	r3, r3
+	rlwinm	r3, r3, 16, 16, 31
+	blr
+EXPORT_SYMBOL(csum_ipv6_magic)
-- 
2.13.3

^ permalink raw reply related

* Re: [Cake] [PATCH net-next v12 3/7] sch_cake: Add optional ACK filter
From: Eric Dumazet @ 2018-05-18 15:20 UTC (permalink / raw)
  To: Kevin Darbyshire-Bryant, Cong Wang
  Cc: Toke Høiland-Jørgensen, Cake List,
	Linux Kernel Network Developers, Eric Dumazet
In-Reply-To: <05E1D675-B73B-4409-8991-EB89D5538EAB@darbyshire-bryant.me.uk>

On 05/18/2018 04:18 AM, Kevin Darbyshire-Bryant wrote:
> 

> Speaking as a user of cake’s ack filtering, although it may be an odd place, it is incredibly useful in my linux based home router middle box that usefully extracts extra usable bandwidth from my asymmetric link.  And whilst ack compression/reduction/filtering call it what you will, will come to the linux TCP stack, as yet other OS stacks are less enlightened and benefit from the router’s tweaking/meddling/interference.
> 
> 

Kevin, there is no doubt the feature might be useful.

But it has to be properly implemented, or we risk to block future TCP improvements.

An ACK filter MUST parse the TCP options.

1) If some options can not be understood, ACK MUST not be dropped.

2) SACK options MUST be carefully analyzed

3) Timestamps also need some care.

Too many middle-boxes are breaking TCP, we do not want to carry another TCP blackhole in upstream linux,
that would be a shame.

conntrack had bugs that seriously impacted TCP Fastopen deployment for example.

^ permalink raw reply

* Re: [RFC PATCH ghak32 V2 01/13] audit: add container id
From: Richard Guy Briggs @ 2018-05-18 15:21 UTC (permalink / raw)
  To: Steve Grubb
  Cc: cgroups, containers, linux-api, Linux-Audit Mailing List,
	linux-fsdevel, LKML, netdev, ebiederm, luto, jlayton, carlos,
	dhowells, viro, simo, eparis, serge
In-Reply-To: <20180518095636.56ff322d@ivy-bridge>

On 2018-05-18 09:56, Steve Grubb wrote:
> On Thu, 17 May 2018 17:56:00 -0400
> Richard Guy Briggs <rgb@redhat.com> wrote:
> 
> > > During syscall events, the path info is returned in a a record
> > > simply called AUDIT_PATH, cwd info is returned in AUDIT_CWD. So,
> > > rather than calling the record that gets attached to everything
> > > AUDIT_CONTAINER_INFO, how about simply AUDIT_CONTAINER.  
> > 
> > Considering the container initiation record is different than the
> > record to document the container involved in an otherwise normal
> > syscall, we need two names.  I don't have a strong opinion what they
> > are.
> > 
> > I'd prefer AUDIT_CONTAIN and AUDIT_CONTAINER_INFO so that the two
> > are different enough to be visually distinct while leaving
> > AUDIT_CONTAINERID for the field type in patch 4 ("audit: add
> > containerid filtering")

(Sorry, I had intended AUDIT_CONTAINER for the first in that paragraph
above.)

> How about AUDIT_CONTAINER for the auxiliary record? The one that starts
> the container, I don't have a strong opinion on. Could be
> AUDIT_CONTAINER_INIT, AUDIT_CONTAINER_START, AUDIT_CONTAINERID,
> AUDIT_CONTAINER_ID, or something else. The API call that sets the ID
> for filtering could be AUDIT_CID or AUDIT_CONTID if that helps decide
> what the initial event might be. Normally, it should match the field
> being filtered.

Ok, I had shortened the record field name to "contid=" to be unique
enough while not using too much netlink bandwidth.  I could have used
"cid=" but that could be unobvious or ambiguous.  I didn't want to use
the full "containerid=" due to that.  I suppose I could change the
field name macro to AUDIT_CONTID.

For the one that starts the container, I'd prefer to leave the name a
bit more general than "_INIT", "_START", so maybe I'll swap them around
and use AUDIT_CONTAINER_INFO for the startup record, and use
AUDIT_CONTAINER for the syscall auxiliary record.

Does that work?

> -Steve

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

^ permalink raw reply

* Re: [PATCH] hippi: fix spelling mistake: "Framming" -> "Framing"
From: David Miller @ 2018-05-18 15:26 UTC (permalink / raw)
  To: jes.sorensen; +Cc: colin.king, netdev, kernel-janitors, linux-kernel
In-Reply-To: <bdb779e6-1705-3607-7ad8-3b57dfcba31a@gmail.com>

From: Jes Sorensen <jes.sorensen@gmail.com>
Date: Fri, 18 May 2018 10:33:50 -0400

> I do wonder if it's time to retire this driver and the HIPPI code. I
> haven't had access to hardware for over a decade so I have no idea
> if it even works anymore.

That's a good question.

If you really think nobody is using this stuff, yes we can remove it.

Then if someone brings it to our attention that there are users, we
can very easily bring it back.

But as the HIPPI maintiner it is really up to you.

^ permalink raw reply

* Re: [PATCH bpf v2 2/6] bpf: powerpc64: add JIT support for multi-function programs
From: Daniel Borkmann @ 2018-05-18 15:30 UTC (permalink / raw)
  To: Sandipan Das, ast; +Cc: netdev, linuxppc-dev, naveen.n.rao, mpe, jakub.kicinski
In-Reply-To: <20180518125039.6500-3-sandipan@linux.vnet.ibm.com>

On 05/18/2018 02:50 PM, Sandipan Das wrote:
> This adds support for bpf-to-bpf function calls in the powerpc64
> JIT compiler. The JIT compiler converts the bpf call instructions
> to native branch instructions. After a round of the usual passes,
> the start addresses of the JITed images for the callee functions
> are known. Finally, to fixup the branch target addresses, we need
> to perform an extra pass.
> 
> Because of the address range in which JITed images are allocated
> on powerpc64, the offsets of the start addresses of these images
> from __bpf_call_base are as large as 64 bits. So, for a function
> call, we cannot use the imm field of the instruction to determine
> the callee's address. Instead, we use the alternative method of
> getting it from the list of function addresses in the auxillary
> data of the caller by using the off field as an index.
> 
> Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
> ---
>  arch/powerpc/net/bpf_jit_comp64.c | 79 ++++++++++++++++++++++++++++++++++-----
>  1 file changed, 69 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c
> index 1bdb1aff0619..25939892d8f7 100644
> --- a/arch/powerpc/net/bpf_jit_comp64.c
> +++ b/arch/powerpc/net/bpf_jit_comp64.c
> @@ -256,7 +256,7 @@ static void bpf_jit_emit_tail_call(u32 *image, struct codegen_context *ctx, u32
>  /* Assemble the body code between the prologue & epilogue */
>  static int bpf_jit_build_body(struct bpf_prog *fp, u32 *image,
>  			      struct codegen_context *ctx,
> -			      u32 *addrs)
> +			      u32 *addrs, bool extra_pass)
>  {
>  	const struct bpf_insn *insn = fp->insnsi;
>  	int flen = fp->len;
> @@ -712,11 +712,23 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 *image,
>  			break;
>  
>  		/*
> -		 * Call kernel helper
> +		 * Call kernel helper or bpf function
>  		 */
>  		case BPF_JMP | BPF_CALL:
>  			ctx->seen |= SEEN_FUNC;
> -			func = (u8 *) __bpf_call_base + imm;
> +
> +			/* bpf function call */
> +			if (insn[i].src_reg == BPF_PSEUDO_CALL && extra_pass)

Perhaps it might make sense here for !extra_pass to set func to some dummy
address as otherwise the 'kernel helper call' branch used for this is a bit
misleading in that sense. The PPC_LI64() used in bpf_jit_emit_func_call()
optimizes the immediate addr, I presume the JIT can handle situations where
in the final extra_pass the image needs to grow/shrink again (due to different
final address for the call)?

> +				if (fp->aux->func && off < fp->aux->func_cnt)
> +					/* use the subprog id from the off
> +					 * field to lookup the callee address
> +					 */
> +					func = (u8 *) fp->aux->func[off]->bpf_func;
> +				else
> +					return -EINVAL;
> +			/* kernel helper call */
> +			else
> +				func = (u8 *) __bpf_call_base + imm;
>  
>  			bpf_jit_emit_func_call(image, ctx, (u64)func);
>  
> @@ -864,6 +876,14 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 *image,
>  	return 0;
>  }
>  
> +struct powerpc64_jit_data {
> +	struct bpf_binary_header *header;
> +	u32 *addrs;
> +	u8 *image;
> +	u32 proglen;
> +	struct codegen_context ctx;
> +};
> +
>  struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
>  {
>  	u32 proglen;
> @@ -871,6 +891,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
>  	u8 *image = NULL;
>  	u32 *code_base;
>  	u32 *addrs;
> +	struct powerpc64_jit_data *jit_data;
>  	struct codegen_context cgctx;
>  	int pass;
>  	int flen;
> @@ -878,6 +899,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
>  	struct bpf_prog *org_fp = fp;
>  	struct bpf_prog *tmp_fp;
>  	bool bpf_blinded = false;
> +	bool extra_pass = false;
>  
>  	if (!fp->jit_requested)
>  		return org_fp;
> @@ -891,7 +913,28 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
>  		fp = tmp_fp;
>  	}
>  
> +	jit_data = fp->aux->jit_data;
> +	if (!jit_data) {
> +		jit_data = kzalloc(sizeof(*jit_data), GFP_KERNEL);
> +		if (!jit_data) {
> +			fp = org_fp;
> +			goto out;
> +		}
> +		fp->aux->jit_data = jit_data;
> +	}
> +
>  	flen = fp->len;
> +	addrs = jit_data->addrs;
> +	if (addrs) {
> +		cgctx = jit_data->ctx;
> +		image = jit_data->image;
> +		bpf_hdr = jit_data->header;
> +		proglen = jit_data->proglen;
> +		alloclen = proglen + FUNCTION_DESCR_SIZE;
> +		extra_pass = true;
> +		goto skip_init_ctx;
> +	}
> +
>  	addrs = kzalloc((flen+1) * sizeof(*addrs), GFP_KERNEL);
>  	if (addrs == NULL) {
>  		fp = org_fp;

In this case of !addrs, we leak the just allocated jit_data here!

> @@ -904,10 +947,10 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
>  	cgctx.stack_size = round_up(fp->aux->stack_depth, 16);
>  
>  	/* Scouting faux-generate pass 0 */
> -	if (bpf_jit_build_body(fp, 0, &cgctx, addrs)) {
> +	if (bpf_jit_build_body(fp, 0, &cgctx, addrs, false)) {
>  		/* We hit something illegal or unsupported. */
>  		fp = org_fp;
> -		goto out;
> +		goto out_addrs;
>  	}
>  
>  	/*
> @@ -925,9 +968,10 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
>  			bpf_jit_fill_ill_insns);
>  	if (!bpf_hdr) {
>  		fp = org_fp;
> -		goto out;
> +		goto out_addrs;
>  	}
>  
> +skip_init_ctx:
>  	code_base = (u32 *)(image + FUNCTION_DESCR_SIZE);
>  
>  	/* Code generation passes 1-2 */
> @@ -935,7 +979,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
>  		/* Now build the prologue, body code & epilogue for real. */
>  		cgctx.idx = 0;
>  		bpf_jit_build_prologue(code_base, &cgctx);
> -		bpf_jit_build_body(fp, code_base, &cgctx, addrs);
> +		bpf_jit_build_body(fp, code_base, &cgctx, addrs, extra_pass);
>  		bpf_jit_build_epilogue(code_base, &cgctx);
>  
>  		if (bpf_jit_enable > 1)
> @@ -956,15 +1000,30 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
>  	((u64 *)image)[1] = local_paca->kernel_toc;
>  #endif
>  
> +	bpf_flush_icache(bpf_hdr, (u8 *)bpf_hdr + (bpf_hdr->pages * PAGE_SIZE));
> +
> +	if (!fp->is_func || extra_pass) {
> +		bpf_jit_binary_lock_ro(bpf_hdr);

powerpc doesn't implement set_memory_ro(). Generally this is not a problem since
set_memory_ro() defaults to 'return 0' in this case, but since the bpf_jit_free()
destructor is overridden here, there's no bpf_jit_binary_unlock_ro() and in case
powerpc would get set_memory_*() support one day this will then crash in random
places once the mem gets back to the allocator, thus hard to debug. Two options:
either you remove the bpf_jit_free() override or you remove the bpf_jit_binary_lock_ro().

> +	} else {
> +		jit_data->addrs = addrs;
> +		jit_data->ctx = cgctx;
> +		jit_data->proglen = proglen;
> +		jit_data->image = image;
> +		jit_data->header = bpf_hdr;
> +	}
> +
>  	fp->bpf_func = (void *)image;
>  	fp->jited = 1;
>  	fp->jited_len = alloclen;
>  
> -	bpf_flush_icache(bpf_hdr, (u8 *)bpf_hdr + (bpf_hdr->pages * PAGE_SIZE));
> +	if (!fp->is_func || extra_pass) {
> +out_addrs:
> +		kfree(addrs);
> +		kfree(jit_data);
> +		fp->aux->jit_data = NULL;
> +	}
>  
>  out:
> -	kfree(addrs);
> -
>  	if (bpf_blinded)
>  		bpf_jit_prog_release_other(fp, fp == org_fp ? tmp_fp : org_fp);
>  
> 

^ permalink raw reply

* Re: WARNING in ip_recv_error
From: Eric Dumazet @ 2018-05-18 15:30 UTC (permalink / raw)
  To: DaeRyong Jeong, davem, kuznet, yoshfuji
  Cc: netdev, linux-kernel, byoungyoung, kt0755, bammanag,
	Willem de Bruijn
In-Reply-To: <20180518120826.GA19515@dragonet.kaist.ac.kr>



On 05/18/2018 05:08 AM, DaeRyong Jeong wrote:
> We report the crash: WARNING in ip_recv_error
> (I resend the email since I mistakenly missed the subject in my previous
> email. I'm sorry.)
> 
> 
> This crash has been found in v4.17-rc1 using RaceFuzzer (a modified
> version of Syzkaller), which we describe more at the end of this
> report. Our analysis shows that the race occurs when invoking two
> syscalls concurrently, setsockopt$inet6_IPV6_ADDRFORM and recvmsg.
> 
> 
> Diagnosis:
> We think the concurrent execution of do_ipv6_setsockopt() with optname
> IPV6_ADDRFORM and inet_recvmsg() causes the crash. do_ipv6_setsockopt()
> can update sk->prot to &udp_prot and sk->sk_family to PF_INET. But
> inet_recvmsg() can execute sk->sk_prot->recvmsg() right after that
> sk->prot is updated and sk->sk_family is not updated by
> do_ipv6_setsockopt(). This will lead WARN_ON in ip_recv_error().
> 
> 
> Thread interleaving:
> CPU0 (do_ipv6_setsockopt)			CPU1 (inet_recvmsg)
> =====						=====
> struct proto *prot = &udp_prot;
> ...
> sk->sk_prot = prot;
> sk->sk_socket->ops = &inet_dgram_ops;
> 						err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
> 									flags & ~MSG_DONTWAIT, &addr_len);
> 
> 						(in udp_recvmsg)
> 						if (flags & MSG_ERRQUEUE)
> 							return ip_recv_error(sk, msg, len, addr_len);
> 	
> 						(in ip_recv_error)
> 						WARN_ON_ONCE(sk->sk_family == AF_INET6);
> sk->sk_family = PF_INET;
> 
> 
> Call Sequence:
> CPU0
> =====
> udpv6_setsockopt
> 	ipv6_setsockopt
> 		do_ipv6_setsockopt
> 
> CPU1
> =====
> sock_recvmsg
> 	sock_recvmsg_nosec
> 		inet_recvmsg
> 			udp_recvmsg
> 
> 
> ==================================================================
> WARNING: CPU: 1 PID: 32600 at /home/daeryong/workspace/new-race-fuzzer/kernels_repo/kernel_v4.17-rc1/net/ipv4/ip_sockglue.c:508 ip_recv_error+0x6f2/0x720 net/ipv4/ip_sockglue.c:508
> Kernel panic - not syncing: panic_on_warn set ...
> 
> CPU: 1 PID: 32600 Comm: syz-executor0 Not tainted 4.17.0-rc1 #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x166/0x21c lib/dump_stack.c:113
>  panic+0x1a0/0x3a7 kernel/panic.c:184
>  __warn+0x191/0x1a0 kernel/panic.c:536
>  report_bug+0x132/0x1b0 lib/bug.c:186
>  fixup_bug.part.11+0x28/0x50 arch/x86/kernel/traps.c:178
>  fixup_bug arch/x86/kernel/traps.c:247 [inline]
>  do_error_trap+0x28b/0x2d0 arch/x86/kernel/traps.c:296
>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
>  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
> RIP: 0010:ip_recv_error+0x6f2/0x720 net/ipv4/ip_sockglue.c:508
> RSP: 0018:ffff8801dadff630 EFLAGS: 00010212
> RAX: 0000000000040000 RBX: 0000000000002002 RCX: ffffffff8327de12
> RDX: 000000000000008a RSI: ffffc90001a0c000 RDI: ffff8801be615010
> RBP: ffff8801dadff720 R08: 0000000000002002 R09: ffff8801dadff918
> R10: ffff8801dadff738 R11: ffff8801dadffaff R12: ffff8801be615000
> R13: ffff8801dadffd50 R14: 1ffff1003b5bfece R15: ffff8801dadffb90
>  udp_recvmsg+0x834/0xa10 net/ipv4/udp.c:1571
>  inet_recvmsg+0x121/0x420 net/ipv4/af_inet.c:830
>  sock_recvmsg_nosec net/socket.c:802 [inline]
>  sock_recvmsg+0x7f/0xa0 net/socket.c:809
>  ___sys_recvmsg+0x1f0/0x430 net/socket.c:2279
>  __sys_recvmsg+0xfc/0x1c0 net/socket.c:2328
>  __do_sys_recvmsg net/socket.c:2338 [inline]
>  __se_sys_recvmsg net/socket.c:2335 [inline]
>  __x64_sys_recvmsg+0x48/0x50 net/socket.c:2335
>  do_syscall_64+0x15f/0x4a0 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x4563f9
> RSP: 002b:00007f24f6927b28 EFLAGS: 00000246 ORIG_RAX: 000000000000002f
> RAX: ffffffffffffffda RBX: 000000000072bfa0 RCX: 00000000004563f9
> RDX: 0000000000002002 RSI: 0000000020000240 RDI: 0000000000000016
> RBP: 00000000000004e4 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 00007f24f69286d4
> R13: 00000000ffffffff R14: 00000000006fc600 R15: 0000000000000000
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
> ==================================================================
> 
> 
> = About RaceFuzzer
> 
> RaceFuzzer is a customized version of Syzkaller, specifically tailored
> to find race condition bugs in the Linux kernel. While we leverage
> many different technique, the notable feature of RaceFuzzer is in
> leveraging a custom hypervisor (QEMU/KVM) to interleave the
> scheduling. In particular, we modified the hypervisor to intentionally
> stall a per-core execution, which is similar to supporting per-core
> breakpoint functionality. This allows RaceFuzzer to force the kernel
> to deterministically trigger racy condition (which may rarely happen
> in practice due to randomness in scheduling).
> 
> RaceFuzzer's C repro always pinpoints two racy syscalls. Since C
> repro's scheduling synchronization should be performed at the user
> space, its reproducibility is limited (reproduction may take from 1
> second to 10 minutes (or even more), depending on a bug). This is
> because, while RaceFuzzer precisely interleaves the scheduling at the
> kernel's instruction level when finding this bug, C repro cannot fully
> utilize such a feature. Please disregard all code related to
> "should_hypercall" in the C repro, as this is only for our debugging
> purposes using our own hypervisor.
> 


We probably need to revert Willem patch (7ce875e5ecb8562fd44040f69bda96c999e38bbc)

^ permalink raw reply

* Re: [PATCH net-next ] net: mscc: Add SPDX identifier
From: David Miller @ 2018-05-18 15:30 UTC (permalink / raw)
  To: alexandre.belloni
  Cc: Allan.Nielsen, razvan.stefanescu, po.liu, thomas.petazzoni,
	andrew, f.fainelli, netdev, linux-kernel
In-Reply-To: <20180517192305.17517-1-alexandre.belloni@bootlin.com>

From: Alexandre Belloni <alexandre.belloni@bootlin.com>
Date: Thu, 17 May 2018 21:23:05 +0200

> ocelot_qsys.h is missing the SPDX identfier, fix that.
> 
> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>

Applied, thank you.

^ permalink raw reply

* Re: [RFC PATCH ghak32 V2 01/13] audit: add container id
From: Steve Grubb @ 2018-05-18 15:38 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: cgroups, containers, linux-api, Linux-Audit Mailing List,
	linux-fsdevel, LKML, netdev, ebiederm, luto, jlayton, carlos,
	dhowells, viro, simo, eparis, serge
In-Reply-To: <20180518152106.do5b3mu6e6eyvo7q@madcap2.tricolour.ca>

On Fri, 18 May 2018 11:21:06 -0400
Richard Guy Briggs <rgb@redhat.com> wrote:

> On 2018-05-18 09:56, Steve Grubb wrote:
> > On Thu, 17 May 2018 17:56:00 -0400
> > Richard Guy Briggs <rgb@redhat.com> wrote:
> >   
> > > > During syscall events, the path info is returned in a a record
> > > > simply called AUDIT_PATH, cwd info is returned in AUDIT_CWD. So,
> > > > rather than calling the record that gets attached to everything
> > > > AUDIT_CONTAINER_INFO, how about simply AUDIT_CONTAINER.    
> > > 
> > > Considering the container initiation record is different than the
> > > record to document the container involved in an otherwise normal
> > > syscall, we need two names.  I don't have a strong opinion what
> > > they are.
> > > 
> > > I'd prefer AUDIT_CONTAIN and AUDIT_CONTAINER_INFO so that the two
> > > are different enough to be visually distinct while leaving
> > > AUDIT_CONTAINERID for the field type in patch 4 ("audit: add
> > > containerid filtering")  
> 
> (Sorry, I had intended AUDIT_CONTAINER for the first in that paragraph
> above.)
> 
> > How about AUDIT_CONTAINER for the auxiliary record? The one that
> > starts the container, I don't have a strong opinion on. Could be
> > AUDIT_CONTAINER_INIT, AUDIT_CONTAINER_START, AUDIT_CONTAINERID,
> > AUDIT_CONTAINER_ID, or something else. The API call that sets the ID
> > for filtering could be AUDIT_CID or AUDIT_CONTID if that helps
> > decide what the initial event might be. Normally, it should match
> > the field being filtered.  
> 
> Ok, I had shortened the record field name to "contid=" to be unique
> enough while not using too much netlink bandwidth.  I could have used
> "cid=" but that could be unobvious or ambiguous.  I didn't want to use
> the full "containerid=" due to that.  I suppose I could change the
> field name macro to AUDIT_CONTID.
> 
> For the one that starts the container, I'd prefer to leave the name a
> bit more general than "_INIT", "_START", so maybe I'll swap them
> around and use AUDIT_CONTAINER_INFO for the startup record, and use
> AUDIT_CONTAINER for the syscall auxiliary record.
> 
> Does that work?

I'll go along with that. Thanks. But making that swap frees up
AUDIT_CONTAINER_ID which could be the first event. But
AUDIT_CONTAINER_INFO is also fine with me.

Best Regards,
-Steve

^ permalink raw reply

* Re: [PATCH v3 net-next 0/6] tcp: implement SACK compression
From: David Miller @ 2018-05-18 15:42 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, toke, ncardwell, ycheng, soheil, eric.dumazet
In-Reply-To: <20180517214729.186094-1-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Thu, 17 May 2018 14:47:23 -0700

> When TCP receives an out-of-order packet, it immediately sends
> a SACK packet, generating network load but also forcing the
> receiver to send 1-MSS pathological packets, increasing its
> RTX queue length/depth, and thus processing time.
> 
> Wifi networks suffer from this aggressive behavior, but generally
> speaking, all these SACK packets add fuel to the fire when networks
> are under congestion.
> 
> This patch series adds SACK compression, but the infrastructure
> could be leveraged to also compress ACK in the future.
> 
> v2: Addressed Neal feedback.
>     Added two sysctls to allow fine tuning, or even disabling the feature.
> 
> v3: take rtt = min(srtt, rcv_rtt) as Yuchung suggested, because rcv_rtt
>     can be over estimated for RPC (or sender limited)

This looks great, series applied, thanks Eric.

So now we handle locally terminated traffic well, however we still need
something that handles traffic flowing through the system.  And for that
reason the CAKE ACK filter still has merit.

I completely agree with you that it has to be implemented properly,
parse all TCP options and the SACK blocks, and pass the ACKs through
when there are options it doesn't understand or the SACK/timestampe/etc.
update contains new information that must be passed along and not dropper.

Thanks again.

^ permalink raw reply

* Re: [PATCH bpf v2 3/6] bpf: get kernel symbol addresses via syscall
From: Daniel Borkmann @ 2018-05-18 15:43 UTC (permalink / raw)
  To: Sandipan Das, ast; +Cc: netdev, linuxppc-dev, naveen.n.rao, mpe, jakub.kicinski
In-Reply-To: <20180518125039.6500-4-sandipan@linux.vnet.ibm.com>

On 05/18/2018 02:50 PM, Sandipan Das wrote:
> This adds new two new fields to struct bpf_prog_info. For
> multi-function programs, these fields can be used to pass
> a list of kernel symbol addresses for all functions in a
> given program and to userspace using the bpf system call
> with the BPF_OBJ_GET_INFO_BY_FD command.
> 
> When bpf_jit_kallsyms is enabled, we can get the address
> of the corresponding kernel symbol for a callee function
> and resolve the symbol's name. The address is determined
> by adding the value of the call instruction's imm field
> to __bpf_call_base. This offset gets assigned to the imm
> field by the verifier.
> 
> For some architectures, such as powerpc64, the imm field
> is not large enough to hold this offset.
> 
> We resolve this by:
> 
> [1] Assigning the subprog id to the imm field of a call
>     instruction in the verifier instead of the offset of
>     the callee's symbol's address from __bpf_call_base.
> 
> [2] Determining the address of a callee's corresponding
>     symbol by using the imm field as an index for the
>     list of kernel symbol addresses now available from
>     the program info.
> 
> Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
> ---
>  include/uapi/linux/bpf.h |  2 ++
>  kernel/bpf/syscall.c     | 20 ++++++++++++++++++++
>  kernel/bpf/verifier.c    |  7 +------
>  3 files changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index d94d333a8225..040c9cac7303 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2188,6 +2188,8 @@ struct bpf_prog_info {
>  	__u32 xlated_prog_len;
>  	__aligned_u64 jited_prog_insns;
>  	__aligned_u64 xlated_prog_insns;
> +	__aligned_u64 jited_ksyms;
> +	__u32 nr_jited_ksyms;
>  	__u64 load_time;	/* ns since boottime */
>  	__u32 created_by_uid;
>  	__u32 nr_map_ids;
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index bfcde949c7f8..54a72fafe57c 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1933,6 +1933,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
>  	if (!capable(CAP_SYS_ADMIN)) {
>  		info.jited_prog_len = 0;
>  		info.xlated_prog_len = 0;
> +		info.nr_jited_ksyms = 0;
>  		goto done;
>  	}
>  
> @@ -1981,6 +1982,25 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
>  		}
>  	}
>  
> +	ulen = info.nr_jited_ksyms;
> +	info.nr_jited_ksyms = prog->aux->func_cnt;
> +	if (info.nr_jited_ksyms && ulen) {

Since this exposes addresses (though masked one which is correct), this
definitely needs to be guarded with bpf_dump_raw_ok() like we do in other
places here (see JIT dump for example).

> +		u64 __user *user_jited_ksyms = u64_to_user_ptr(info.jited_ksyms);
> +		ulong ksym_addr;
> +		u32 i;
> +
> +		/* copy the address of the kernel symbol corresponding to
> +		 * each function
> +		 */
> +		ulen = min_t(u32, info.nr_jited_ksyms, ulen);
> +		for (i = 0; i < ulen; i++) {
> +			ksym_addr = (ulong) prog->aux->func[i]->bpf_func;
> +			ksym_addr &= PAGE_MASK;
> +			if (put_user((u64) ksym_addr, &user_jited_ksyms[i]))
> +				return -EFAULT;
> +		}
> +	}
> +
>  done:
>  	if (copy_to_user(uinfo, &info, info_len) ||
>  	    put_user(info_len, &uattr->info.info_len))
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 6c56cce9c4e3..e826c396aba2 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -5426,17 +5426,12 @@ static int jit_subprogs(struct bpf_verifier_env *env)
>  	 * later look the same as if they were interpreted only.
>  	 */
>  	for (i = 0, insn = prog->insnsi; i < prog->len; i++, insn++) {
> -		unsigned long addr;
> -
>  		if (insn->code != (BPF_JMP | BPF_CALL) ||
>  		    insn->src_reg != BPF_PSEUDO_CALL)
>  			continue;
>  		insn->off = env->insn_aux_data[i].call_imm;
>  		subprog = find_subprog(env, i + insn->off + 1);
> -		addr  = (unsigned long)func[subprog]->bpf_func;

Hmm, in current bpf tree this says 'subprog + 1' here, so this is not
rebased against bpf tree but bpf-next (unlike what subject says)?

https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/kernel/bpf/verifier.c#n5351

> -		addr &= PAGE_MASK;
> -		insn->imm = (u64 (*)(u64, u64, u64, u64, u64))
> -			    addr - __bpf_call_base;
> +		insn->imm = subprog;
>  	}
>  
>  	prog->jited = 1;
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox