Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] [v4] net: phy: phy drivers should not set SUPPORTED_[Asym_]Pause
From: David Miller @ 2016-12-11  4:31 UTC (permalink / raw)
  To: timur; +Cc: f.fainelli, netdev, jon.mason, nks.gnu
In-Reply-To: <1481138451-28144-1-git-send-email-timur@codeaurora.org>

From: Timur Tabi <timur@codeaurora.org>
Date: Wed,  7 Dec 2016 13:20:51 -0600

> Instead of having individual PHY drivers set the SUPPORTED_Pause and
> SUPPORTED_Asym_Pause flags, phylib itself should set those flags,
> unless there is a hardware erratum or other special case.  During
> autonegotiation, the PHYs will determine whether to enable pause
> frame support.
> 
> Pause frames are a feature that is supported by the MAC.  It is the MAC
> that generates the frames and that processes them.  The PHY can only be
> configured to allow them to pass through.
> 
> This commit also effectively reverts the recently applied c7a61319
> ("net: phy: dp83848: Support ethernet pause frames").
> 
> So the new process is:
> 
> 1) Unless the PHY driver overrides it, phylib sets the SUPPORTED_Pause
> and SUPPORTED_AsymPause bits in phydev->supported.  This indicates that
> the PHY supports pause frames.
> 
> 2) The MAC driver checks phydev->supported before it calls phy_start().
> If (SUPPORTED_Pause | SUPPORTED_AsymPause) is set, then the MAC driver
> sets those bits in phydev->advertising, if it wants to enable pause
> frame support.
> 
> 3) When the link state changes, the MAC driver checks phydev->pause and
> phydev->asym_pause,  If the bits are set, then it enables the corresponding
> features in the MAC.  The algorithm is:
> 
> 	if (phydev->pause)
> 		The MAC should be programmed to receive and honor
>                 pause frames it receives, i.e. enable receive flow control.
> 
> 	if (phydev->pause != phydev->asym_pause)
> 		The MAC should be programmed to transmit pause
> 		frames when needed, i.e. enable transmit flow control.
> 
> Signed-off-by: Timur Tabi <timur@codeaurora.org>

Applied.

^ permalink raw reply

* Re: [PATCH v2] net: ethernet: ti: netcp: add support of cpts
From: David Miller @ 2016-12-11  4:32 UTC (permalink / raw)
  To: grygorii.strashko
  Cc: netdev, richardcochran, m-karicheri2, w-kwok2, nsekhar,
	mugunthanvnm, linux-kernel, linux-omap
In-Reply-To: <20161208222156.23625-1-grygorii.strashko@ti.com>

From: Grygorii Strashko <grygorii.strashko@ti.com>
Date: Thu, 8 Dec 2016 16:21:56 -0600

> From: WingMan Kwok <w-kwok2@ti.com>
> 
> This patch adds support of the cpts device found in the
> gbe and 10gbe ethernet switches on the keystone 2 SoCs
> (66AK2E/L/Hx, 66AK2Gx).
> 
> Cc: Richard Cochran <richardcochran@gmail.com>
> Signed-off-by: WingMan Kwok <w-kwok2@ti.com>
> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
> ---
> changes in v2:
>  - dropped bindings changes
> 
> link on v1:
>  https://lkml.org/lkml/2016/11/28/781

Applied.

^ permalink raw reply

* Re: [PATCH] i40e: don't truncate match_method assignment
From: David Miller @ 2016-12-11  4:32 UTC (permalink / raw)
  To: jacob.e.keller
  Cc: intel-wired-lan, jeffrey.t.kirsher, netdev, sfr, bimmy.pujari
In-Reply-To: <20161209213921.26451-1-jacob.e.keller@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>
Date: Fri,  9 Dec 2016 13:39:21 -0800

> The .match_method field is a u8, so we shouldn't be casting to a u16,
> and because it is only one byte, we do not need to byte swap anything.
> Just assign the value directly. This avoids issues on Big Endian
> architectures which would have byte swapped and then incorrectly
> truncated the value.
> 
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>

Applied.

^ permalink raw reply

* Re: [Patch net] e1000: use disable_hardirq() for e1000_netpoll()
From: David Miller @ 2016-12-11  4:32 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, sd, davej, peterz, jeffrey.t.kirsher
In-Reply-To: <1481408562-6529-1-git-send-email-xiyou.wangcong@gmail.com>

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Sat, 10 Dec 2016 14:22:42 -0800

> In commit 02cea3958664 ("genirq: Provide disable_hardirq()")
> Peter introduced disable_hardirq() for netpoll, but it is forgotten
> to use it for e1000.
> 
> This patch changes disable_irq() to disable_hardirq() for e1000.
> 
> Reported-by: Dave Jones <davej@codemonkey.org.uk>
> Suggested-by: Sabrina Dubroca <sd@queasysnail.net>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Applied.

^ permalink raw reply

* Re: Misalignment, MIPS, and ip_hdr(skb)->version
From: Greg KH @ 2016-12-11  7:15 UTC (permalink / raw)
  To: Dan Lüdtke
  Cc: Daniel Kahn Gillmor, linux-mips, Netdev, LKML,
	Hannes Frederic Sowa, WireGuard mailing list
In-Reply-To: <CE942916-BF45-44CC-A5F5-3838CF9C93BC@danrl.com>

On Sat, Dec 10, 2016 at 11:18:14PM +0100, Dan Lüdtke wrote:
> 
> > On 8 Dec 2016, at 05:34, Daniel Kahn Gillmor <dkg@fifthhorseman.net> wrote:
> > 
> > On Wed 2016-12-07 19:30:34 -0500, Hannes Frederic Sowa wrote:
> >> Your custom protocol should be designed in a way you get an aligned ip
> >> header. Most protocols of the IETF follow this mantra and it is always
> >> possible to e.g. pad options so you end up on aligned boundaries for the
> >> next header.
> > 
> > fwiw, i'm not convinced that "most protocols of the IETF follow this
> > mantra".  we've had multiple discussions in different protocol groups
> > about shaving or bloating by a few bytes here or there in different
> > protocols, and i don't think anyone has brought up memory alignment as
> > an argument in any of the discussions i've followed.
> > 
> 
> If the trade-off is between 1 padding byte and 2 byte alignment versus
> 3 padding bytes and 4 byte alignment I would definitely opt for 3
> padding bytes. I know how that waste feels like to a protocol
> designer, but I think it is worth it. Maybe the padding/reserved will
> be useful some day for an additional feature.

Note, if you do do this (hint, I think it is a good idea), require that
these reserved/pad fields always set to 0 for now, so that no one puts
garbage in them and then if you later want to use them, it will be a
mess.

thanks,

greg k-h

^ permalink raw reply

* Re: Misalignment, MIPS, and ip_hdr(skb)->version
From: Willy Tarreau @ 2016-12-11  8:07 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: linux-mips, Netdev, LKML, David Miller, WireGuard mailing list
In-Reply-To: <CAHmME9qGoPGEFyqe0jBaZD5R51wHTEgAYb9edj+nu9nNPWSYCA@mail.gmail.com>

Hi Jason,

On Thu, Dec 08, 2016 at 11:20:04PM +0100, Jason A. Donenfeld wrote:
> Hi David,
> 
> On Thu, Dec 8, 2016 at 1:37 AM, David Miller <davem@davemloft.net> wrote:
> > You really have to land the IP header on a proper 4 byte boundary.
> >
> > I would suggest pushing 3 dummy garbage bytes of padding at the front
> > or the end of your header.
> 
> Are you sure 3 bytes to get 4 byte alignment is really the best?

It's always the best. However there's another option which should be
considered : maybe it's difficult but not impossible to move some bits
from the current protocol to remove one byte. That's not always easy,
and sometimes you cannot do it just for one bit. However after you run
through this exercise, if you notice there's really no way to shave
this extra byte, you'll realize there's no room left for future
extensions and you'll more easily accept to add 3 empty bytes for
this, typically protocol version, tags, qos or flagss that you'll be
happy to rely on for future versions of your protocol.

Also while it can feel like you're making your protocol less efficient,
keep in mind that these 3 bytes only matter for small packets, and
Ethernet already pads small frames to 64 bytes, so in practice any
IP packet smaller than 46 bytes will not bring any extra savings.

Best regards,
Willy

^ permalink raw reply

* Re: net: deadlock on genl_mutex
From: Dmitry Vyukov @ 2016-12-11  9:40 UTC (permalink / raw)
  To: Cong Wang
  Cc: syzkaller, Eric Dumazet, David Miller, Matti Vaittinen,
	Tycho Andersen, Florian Westphal, stephen hemminger, Tom Herbert,
	netdev, LKML, Richard Guy Briggs, netdev-owner
In-Reply-To: <CAM_iQpW2x5xeBpxnZx+k+vnKkjxWPycCBpiKZE41Kcp5gBsN4w@mail.gmail.com>

On Fri, Dec 9, 2016 at 6:08 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Thu, Dec 8, 2016 at 4:32 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> On Thu, Dec 8, 2016 at 9:16 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
>>> Chain exists of:
>>>  Possible unsafe locking scenario:
>>>
>>>        CPU0                    CPU1
>>>        ----                    ----
>>>   lock(genl_mutex);
>>>                                lock(nlk->cb_mutex);
>>>                                lock(genl_mutex);
>>>   lock(rtnl_mutex);
>>>
>>>  *** DEADLOCK ***
>>
>> This one looks legitimate, because nlk->cb_mutex could be rtnl_mutex.
>> Let me think about it.
>
> Never mind. Actually both reports in this thread are legitimate.
>
> I know what happened now, the lock chain is so long, 4 locks are involved
> to form a chain!!!
>
> Let me think about how to break the chain.



Seems to be a related one, now on nfnl_lock :



[ INFO: possible circular locking dependency detected ]
4.9.0-rc8+ #82 Not tainted
-------------------------------------------------------
syz-executor3/10151 is trying to acquire lock:
 (&table[i].mutex){+.+.+.}, at: [<ffffffff86c96f1d>]
nfnl_lock+0x2d/0x30 net/netfilter/nfnetlink.c:61
but task is already holding lock:
 (rtnl_mutex){+.+.+.}, at: [<ffffffff86b0cf0c>] rtnl_lock+0x1c/0x20
net/core/rtnetlink.c:70
which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

       [  231.942041] [<     inline     >] validate_chain
kernel/locking/lockdep.c:2265
       [  231.942041] [<ffffffff81569576>]
__lock_acquire+0x2156/0x3380 kernel/locking/lockdep.c:3338
floppy0: disk absent or changed during operation
floppy0: disk absent or changed during operation
       [  231.950342] [<ffffffff8156b672>] lock_acquire+0x2a2/0x790
kernel/locking/lockdep.c:3749
       [  231.950342] [<     inline     >] __mutex_lock_common
kernel/locking/mutex.c:521
       [  231.950342] [<ffffffff8815c2bf>]
mutex_lock_nested+0x23f/0xf20 kernel/locking/mutex.c:621
       [  231.950342] [<ffffffff86b0cf0c>] rtnl_lock+0x1c/0x20
net/core/rtnetlink.c:70
       [  231.950342] [<ffffffff87b234e9>]
nl80211_pre_doit+0x309/0x5b0 net/wireless/nl80211.c:11750
       [  231.950342] [<ffffffff86c883b0>]
genl_family_rcv_msg+0x780/0x1070 net/netlink/genetlink.c:631
       [  231.950342] [<ffffffff86c88e50>] genl_rcv_msg+0x1b0/0x260
net/netlink/genetlink.c:660
       [  231.950342] [<ffffffff86c86a2c>] netlink_rcv_skb+0x2bc/0x3a0
net/netlink/af_netlink.c:2298
       [  231.950342] [<ffffffff86c87c1d>] genl_rcv+0x2d/0x40
net/netlink/genetlink.c:671
       [  231.950342] [<     inline     >] netlink_unicast_kernel
net/netlink/af_netlink.c:1231
       [  231.950342] [<ffffffff86c8524a>] netlink_unicast+0x51a/0x740
net/netlink/af_netlink.c:1257
       [  231.950342] [<ffffffff86c85f14>] netlink_sendmsg+0xaa4/0xe50
net/netlink/af_netlink.c:1803
       [  231.950342] [<     inline     >] sock_sendmsg_nosec net/socket.c:621
       [  231.950342] [<ffffffff86a3c86f>] sock_sendmsg+0xcf/0x110
net/socket.c:631
       [  231.950342] [<ffffffff86a3cbdb>] sock_write_iter+0x32b/0x620
net/socket.c:829
       [  231.950342] [<     inline     >] new_sync_write fs/read_write.c:499
       [  231.950342] [<ffffffff81a7021e>] __vfs_write+0x4fe/0x830
fs/read_write.c:512
       [  231.950342] [<ffffffff81a71cc5>] vfs_write+0x175/0x4e0
fs/read_write.c:560
       [  231.950342] [<     inline     >] SYSC_write fs/read_write.c:607
       [  231.950342] [<ffffffff81a76150>] SyS_write+0x100/0x240
fs/read_write.c:599
       [  231.950342] [<ffffffff8816c685>] entry_SYSCALL_64_fastpath+0x23/0xc6

       [  231.950342] [<     inline     >] validate_chain
kernel/locking/lockdep.c:2265
       [  231.950342] [<ffffffff81569576>]
__lock_acquire+0x2156/0x3380 kernel/locking/lockdep.c:3338
       [  231.950342] [<ffffffff8156b672>] lock_acquire+0x2a2/0x790
kernel/locking/lockdep.c:3749
       [  231.950342] [<     inline     >] __mutex_lock_common
kernel/locking/mutex.c:521
       [  231.950342] [<ffffffff8815c2bf>]
mutex_lock_nested+0x23f/0xf20 kernel/locking/mutex.c:621
       [  231.950342] [<     inline     >] genl_lock net/netlink/genetlink.c:31
       [  231.950342] [<ffffffff86c87306>] genl_lock_dumpit+0x46/0xa0
net/netlink/genetlink.c:518
       [  231.950342] [<ffffffff86c79a8c>] netlink_dump+0x57c/0xd70
net/netlink/af_netlink.c:2127
       [  231.950342] [<ffffffff86c7e24a>]
__netlink_dump_start+0x4ea/0x760 net/netlink/af_netlink.c:2217
       [  231.950342] [<ffffffff86c889f9>]
genl_family_rcv_msg+0xdc9/0x1070 net/netlink/genetlink.c:586
       [  231.950342] [<ffffffff86c88e50>] genl_rcv_msg+0x1b0/0x260
net/netlink/genetlink.c:660
       [  231.950342] [<ffffffff86c86a2c>] netlink_rcv_skb+0x2bc/0x3a0
net/netlink/af_netlink.c:2298
       [  231.950342] [<ffffffff86c87c1d>] genl_rcv+0x2d/0x40
net/netlink/genetlink.c:671
       [  231.950342] [<     inline     >] netlink_unicast_kernel
net/netlink/af_netlink.c:1231
       [  231.950342] [<ffffffff86c8524a>] netlink_unicast+0x51a/0x740
net/netlink/af_netlink.c:1257
       [  231.950342] [<ffffffff86c85f14>] netlink_sendmsg+0xaa4/0xe50
net/netlink/af_netlink.c:1803
       [  231.950342] [<     inline     >] sock_sendmsg_nosec net/socket.c:621
       [  231.950342] [<ffffffff86a3c86f>] sock_sendmsg+0xcf/0x110
net/socket.c:631
       [  231.950342] [<ffffffff86a3cbdb>] sock_write_iter+0x32b/0x620
net/socket.c:829
       [  231.950342] [<ffffffff81a6fa13>]
do_iter_readv_writev+0x363/0x670 fs/read_write.c:695
       [  231.950342] [<ffffffff81a72461>] do_readv_writev+0x431/0x9b0
fs/read_write.c:872
       [  231.950342] [<ffffffff81a72f9c>] vfs_writev+0x8c/0xc0
fs/read_write.c:911
       [  231.950342] [<ffffffff81a730e5>] do_writev+0x115/0x2d0
fs/read_write.c:944
       [  231.950342] [<     inline     >] SYSC_writev fs/read_write.c:1017
       [  231.950342] [<ffffffff81a7689c>] SyS_writev+0x2c/0x40
fs/read_write.c:1014
       [  231.950342] [<ffffffff8816c685>] entry_SYSCALL_64_fastpath+0x23/0xc6

       [  231.950342] [<     inline     >] validate_chain
kernel/locking/lockdep.c:2265
       [  231.950342] [<ffffffff81569576>]
__lock_acquire+0x2156/0x3380 kernel/locking/lockdep.c:3338
       [  231.950342] [<ffffffff8156b672>] lock_acquire+0x2a2/0x790
kernel/locking/lockdep.c:3749
       [  231.950342] [<     inline     >] __mutex_lock_common
kernel/locking/mutex.c:521
       [  231.950342] [<ffffffff8815c2bf>]
mutex_lock_nested+0x23f/0xf20 kernel/locking/mutex.c:621
       [  231.950342] [<ffffffff86c7de59>]
__netlink_dump_start+0xf9/0x760 net/netlink/af_netlink.c:2187
       [  231.950342] [<     inline     >] netlink_dump_start
include/linux/netlink.h:165
       [  231.950342] [<ffffffff86d9d964>] ip_set_dump+0x204/0x2b0
net/netfilter/ipset/ip_set_core.c:1447
       [  231.950342] [<ffffffff86c9981e>]
nfnetlink_rcv_msg+0x9be/0xd60 net/netfilter/nfnetlink.c:212
       [  231.950342] [<ffffffff86c86a2c>] netlink_rcv_skb+0x2bc/0x3a0
net/netlink/af_netlink.c:2298
       [  231.950342] [<ffffffff86c98251>] nfnetlink_rcv+0x7e1/0x10d0
net/netfilter/nfnetlink.c:474
       [  231.950342] [<     inline     >] netlink_unicast_kernel
net/netlink/af_netlink.c:1231
       [  231.950342] [<ffffffff86c8524a>] netlink_unicast+0x51a/0x740
net/netlink/af_netlink.c:1257
       [  231.950342] [<ffffffff86c85f14>] netlink_sendmsg+0xaa4/0xe50
net/netlink/af_netlink.c:1803
       [  231.950342] [<     inline     >] sock_sendmsg_nosec net/socket.c:621
       [  231.950342] [<ffffffff86a3c86f>] sock_sendmsg+0xcf/0x110
net/socket.c:631
       [  231.950342] [<ffffffff86a3cbdb>] sock_write_iter+0x32b/0x620
net/socket.c:829
       [  231.950342] [<     inline     >] new_sync_write fs/read_write.c:499
       [  231.950342] [<ffffffff81a7021e>] __vfs_write+0x4fe/0x830
fs/read_write.c:512
       [  231.950342] [<ffffffff81a71cc5>] vfs_write+0x175/0x4e0
fs/read_write.c:560
       [  231.950342] [<     inline     >] SYSC_write fs/read_write.c:607
       [  231.950342] [<ffffffff81a76150>] SyS_write+0x100/0x240
fs/read_write.c:599
       [  231.950342] [<ffffffff8816c685>] entry_SYSCALL_64_fastpath+0x23/0xc6

       [  231.950342] [<     inline     >] check_prev_add
kernel/locking/lockdep.c:1828
       [  231.950342] [<ffffffff8156309b>]
check_prevs_add+0xaab/0x1c20 kernel/locking/lockdep.c:1938
       [  231.950342] [<     inline     >] validate_chain
kernel/locking/lockdep.c:2265
       [  231.950342] [<ffffffff81569576>]
__lock_acquire+0x2156/0x3380 kernel/locking/lockdep.c:3338
       [  231.950342] [<ffffffff8156b672>] lock_acquire+0x2a2/0x790
kernel/locking/lockdep.c:3749
       [  231.950342] [<     inline     >] __mutex_lock_common
kernel/locking/mutex.c:521
       [  231.950342] [<ffffffff8815c2bf>]
mutex_lock_nested+0x23f/0xf20 kernel/locking/mutex.c:621
       [  231.950342] [<ffffffff86c96f1d>] nfnl_lock+0x2d/0x30
net/netfilter/nfnetlink.c:61
       [  231.950342] [<ffffffff86d42c91>]
nf_tables_netdev_event+0x1f1/0x720
net/netfilter/nf_tables_netdev.c:122
       [  231.950342] [<ffffffff8149095a>]
notifier_call_chain+0x14a/0x2f0 kernel/notifier.c:93
       [  231.950342] [<     inline     >] __raw_notifier_call_chain
kernel/notifier.c:394
       [  231.950342] [<ffffffff81490b82>]
raw_notifier_call_chain+0x32/0x40 kernel/notifier.c:401
       [  231.950342] [<ffffffff86aab1d6>]
call_netdevice_notifiers_info+0x56/0x90 net/core/dev.c:1645
       [  231.950342] [<     inline     >] call_netdevice_notifiers
net/core/dev.c:1661
       [  231.950342] [<ffffffff86abf06d>]
rollback_registered_many+0x73d/0xba0 net/core/dev.c:6759
       [  231.950342] [<ffffffff86abf57e>]
rollback_registered+0xae/0x100 net/core/dev.c:6800
       [  231.950342] [<ffffffff86abf656>]
unregister_netdevice_queue+0x86/0x140 net/core/dev.c:7787
       [  231.950342] [<     inline     >] unregister_netdevice
include/linux/netdevice.h:2455
       [  231.950342] [<ffffffff848d9296>] __tun_detach+0xc66/0xea0
drivers/net/tun.c:567
       [  231.950342] [<     inline     >] tun_detach drivers/net/tun.c:578
       [  231.950342] [<ffffffff848d9519>] tun_chr_close+0x49/0x60
drivers/net/tun.c:2350
       [  231.950342] [<ffffffff81a77fee>] __fput+0x34e/0x910
fs/file_table.c:208
       [  231.950342] [<ffffffff81a7863a>] ____fput+0x1a/0x20
fs/file_table.c:244
       [  231.950342] [<ffffffff81483c20>] task_work_run+0x1a0/0x280
kernel/task_work.c:116
       [  231.950342] [<     inline     >] exit_task_work
include/linux/task_work.h:21
       [  231.950342] [<ffffffff814129e2>] do_exit+0x1842/0x2650
kernel/exit.c:828
       [  231.950342] [<ffffffff814139ae>] do_group_exit+0x14e/0x420
kernel/exit.c:932
       [  231.950342] [<ffffffff81442b43>] get_signal+0x663/0x1880
kernel/signal.c:2307
       [  231.950342] [<ffffffff81239b45>] do_signal+0xc5/0x2190
arch/x86/kernel/signal.c:807
       [  231.950342] [<ffffffff8100666a>]
exit_to_usermode_loop+0x1ea/0x2d0 arch/x86/entry/common.c:156
       [  231.950342] [<     inline     >] prepare_exit_to_usermode
arch/x86/entry/common.c:190
       [  231.950342] [<ffffffff81009693>]
syscall_return_slowpath+0x4d3/0x570 arch/x86/entry/common.c:259
       [  231.950342] [<ffffffff8816c726>] entry_SYSCALL_64_fastpath+0xc4/0xc6

other info that might help us debug this:

Chain exists of:
 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(rtnl_mutex);
                               lock(genl_mutex);
                               lock(rtnl_mutex);
  lock(&table[i].mutex);

 *** DEADLOCK ***

1 lock held by syz-executor3/10151:
 #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff86b0cf0c>]
rtnl_lock+0x1c/0x20 net/core/rtnetlink.c:70

stack backtrace:
CPU: 2 PID: 10151 Comm: syz-executor3 Not tainted 4.9.0-rc8+ #82
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 ffff8800311057f8 ffffffff8348fc59 ffffffff00000002 1ffff10006220a92
 ffffed0006220a8a 0000000041b58ab3 ffffffff8957cf18 ffffffff8348f96b
 ffffffff894eb258 ffffffff81564970 ffffffff8b565c30 ffffffff8b8e5020
Call Trace:
 [<     inline     >] __dump_stack lib/dump_stack.c:15
 [<ffffffff8348fc59>] dump_stack+0x2ee/0x3f5 lib/dump_stack.c:51
 [<ffffffff81560cb0>] print_circular_bug+0x310/0x3c0
kernel/locking/lockdep.c:1202
 [<     inline     >] check_prev_add kernel/locking/lockdep.c:1828
 [<ffffffff8156309b>] check_prevs_add+0xaab/0x1c20 kernel/locking/lockdep.c:1938
 [<     inline     >] validate_chain kernel/locking/lockdep.c:2265
 [<ffffffff81569576>] __lock_acquire+0x2156/0x3380 kernel/locking/lockdep.c:3338
 [<ffffffff8156b672>] lock_acquire+0x2a2/0x790 kernel/locking/lockdep.c:3749
 [<     inline     >] __mutex_lock_common kernel/locking/mutex.c:521
 [<ffffffff8815c2bf>] mutex_lock_nested+0x23f/0xf20 kernel/locking/mutex.c:621
 [<ffffffff86c96f1d>] nfnl_lock+0x2d/0x30 net/netfilter/nfnetlink.c:61
 [<ffffffff86d42c91>] nf_tables_netdev_event+0x1f1/0x720
net/netfilter/nf_tables_netdev.c:122
 [<ffffffff8149095a>] notifier_call_chain+0x14a/0x2f0 kernel/notifier.c:93
 [<     inline     >] __raw_notifier_call_chain kernel/notifier.c:394
 [<ffffffff81490b82>] raw_notifier_call_chain+0x32/0x40 kernel/notifier.c:401
 [<ffffffff86aab1d6>] call_netdevice_notifiers_info+0x56/0x90
net/core/dev.c:1645
 [<     inline     >] call_netdevice_notifiers net/core/dev.c:1661
 [<ffffffff86abf06d>] rollback_registered_many+0x73d/0xba0 net/core/dev.c:6759
 [<ffffffff86abf57e>] rollback_registered+0xae/0x100 net/core/dev.c:6800
 [<ffffffff86abf656>] unregister_netdevice_queue+0x86/0x140 net/core/dev.c:7787
 [<     inline     >] unregister_netdevice include/linux/netdevice.h:2455
 [<ffffffff848d9296>] __tun_detach+0xc66/0xea0 drivers/net/tun.c:567
 [<     inline     >] tun_detach drivers/net/tun.c:578
 [<ffffffff848d9519>] tun_chr_close+0x49/0x60 drivers/net/tun.c:2350
 [<ffffffff81a77fee>] __fput+0x34e/0x910 fs/file_table.c:208
 [<ffffffff81a7863a>] ____fput+0x1a/0x20 fs/file_table.c:244
 [<ffffffff81483c20>] task_work_run+0x1a0/0x280 kernel/task_work.c:116
 [<     inline     >] exit_task_work include/linux/task_work.h:21
 [<ffffffff814129e2>] do_exit+0x1842/0x2650 kernel/exit.c:828
 [<ffffffff814139ae>] do_group_exit+0x14e/0x420 kernel/exit.c:932
 [<ffffffff81442b43>] get_signal+0x663/0x1880 kernel/signal.c:2307
 [<ffffffff81239b45>] do_signal+0xc5/0x2190 arch/x86/kernel/signal.c:807
 [<ffffffff8100666a>] exit_to_usermode_loop+0x1ea/0x2d0
arch/x86/entry/common.c:156
 [<     inline     >] prepare_exit_to_usermode arch/x86/entry/common.c:190
 [<ffffffff81009693>] syscall_return_slowpath+0x4d3/0x570
arch/x86/entry/common.c:259
 [<ffffffff8816c726>] entry_SYSCALL_64_fastpath+0xc4/0xc6

^ permalink raw reply

* Re: [PATCH net-next] netfilter: nft_counter: rework atomic dump and reset
From: Pablo Neira Ayuso @ 2016-12-11 10:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netfilter-devel, arnd, netdev
In-Reply-To: <1481384408.4930.210.camel@edumazet-glaptop3.roam.corp.google.com>

On Sat, Dec 10, 2016 at 07:40:08AM -0800, Eric Dumazet wrote:
> On Sat, 2016-12-10 at 15:25 +0100, Pablo Neira Ayuso wrote:
> > On Sat, Dec 10, 2016 at 03:16:55PM +0100, Pablo Neira Ayuso wrote:
> =
> >  
> > -       nft_counter_fetch(priv, &total, reset);
> > +       nft_counter_fetch(priv, &total);
> > +       if (reset)
> > +               nft_counter_reset(priv, &total);
> >  
> >         if (nla_put_be64(skb, NFTA_COUNTER_BYTES,
> > cpu_to_be64(total.bytes),
> >                          NFTA_COUNTER_PAD) ||
> 
> Night be nitpicking, but you might reset the stats only if the
> nla_put_be64() succeeded.

Right, I've fixed this in the v2.

> But regardless of this detail, patch looks good and is very close to the
> one I cooked and was about to send this morning.

Thanks a lot!

^ permalink raw reply

* [PATCH net-next] netfilter: nft_counter: rework atomic dump and reset
From: Pablo Neira Ayuso @ 2016-12-11 10:43 UTC (permalink / raw)
  To: netfilter-devel; +Cc: eric.dumazet, netdev, davem, arnd

Dump and reset doesn't work unless cmpxchg64() is used both from packet
and control plane paths. This approach is going to be slow though.
Instead, use a percpu seqcount to fetch counters consistently, then
subtract bytes and packets in case a reset was requested.

The cpu that running over the reset code is guaranteed to own this stats
exclusively, we have to turn counters into signed 64bit though so stats
update on reset don't get wrong on underflow.

This patch is based on original sketch from Eric Dumazet.

Fixes: 43da04a593d8 ("netfilter: nf_tables: atomic dump and reset for stateful objects")
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
v2: adjust stats on reset on the current cpu, turn 64bit counters into signed.

@David: Please, take this into net-next to help speed up thing, thanks!

 net/netfilter/nft_counter.c | 127 +++++++++++++++++++-------------------------
 1 file changed, 55 insertions(+), 72 deletions(-)

diff --git a/net/netfilter/nft_counter.c b/net/netfilter/nft_counter.c
index f6a02c5071c2..7f8422213341 100644
--- a/net/netfilter/nft_counter.c
+++ b/net/netfilter/nft_counter.c
@@ -18,31 +18,33 @@
 #include <net/netfilter/nf_tables.h>
 
 struct nft_counter {
-	u64		bytes;
-	u64		packets;
-};
-
-struct nft_counter_percpu {
-	struct nft_counter	counter;
-	struct u64_stats_sync	syncp;
+	s64		bytes;
+	s64		packets;
 };
 
 struct nft_counter_percpu_priv {
-	struct nft_counter_percpu __percpu *counter;
+	struct nft_counter __percpu *counter;
 };
 
+static DEFINE_PER_CPU(seqcount_t, nft_counter_seq);
+
 static inline void nft_counter_do_eval(struct nft_counter_percpu_priv *priv,
 				       struct nft_regs *regs,
 				       const struct nft_pktinfo *pkt)
 {
-	struct nft_counter_percpu *this_cpu;
+	struct nft_counter *this_cpu;
+	seqcount_t *myseq;
 
 	local_bh_disable();
 	this_cpu = this_cpu_ptr(priv->counter);
-	u64_stats_update_begin(&this_cpu->syncp);
-	this_cpu->counter.bytes += pkt->skb->len;
-	this_cpu->counter.packets++;
-	u64_stats_update_end(&this_cpu->syncp);
+	myseq = this_cpu_ptr(&nft_counter_seq);
+
+	write_seqcount_begin(myseq);
+
+	this_cpu->bytes += pkt->skb->len;
+	this_cpu->packets++;
+
+	write_seqcount_end(myseq);
 	local_bh_enable();
 }
 
@@ -58,21 +60,21 @@ static inline void nft_counter_obj_eval(struct nft_object *obj,
 static int nft_counter_do_init(const struct nlattr * const tb[],
 			       struct nft_counter_percpu_priv *priv)
 {
-	struct nft_counter_percpu __percpu *cpu_stats;
-	struct nft_counter_percpu *this_cpu;
+	struct nft_counter __percpu *cpu_stats;
+	struct nft_counter *this_cpu;
 
-	cpu_stats = netdev_alloc_pcpu_stats(struct nft_counter_percpu);
+	cpu_stats = alloc_percpu(struct nft_counter);
 	if (cpu_stats == NULL)
 		return -ENOMEM;
 
 	preempt_disable();
 	this_cpu = this_cpu_ptr(cpu_stats);
 	if (tb[NFTA_COUNTER_PACKETS]) {
-	        this_cpu->counter.packets =
+	        this_cpu->packets =
 			be64_to_cpu(nla_get_be64(tb[NFTA_COUNTER_PACKETS]));
 	}
 	if (tb[NFTA_COUNTER_BYTES]) {
-		this_cpu->counter.bytes =
+		this_cpu->bytes =
 			be64_to_cpu(nla_get_be64(tb[NFTA_COUNTER_BYTES]));
 	}
 	preempt_enable();
@@ -100,80 +102,59 @@ static void nft_counter_obj_destroy(struct nft_object *obj)
 	nft_counter_do_destroy(priv);
 }
 
-static void nft_counter_fetch(struct nft_counter_percpu __percpu *counter,
+static void nft_counter_reset(struct nft_counter_percpu_priv __percpu *priv,
 			      struct nft_counter *total)
 {
-	struct nft_counter_percpu *cpu_stats;
-	u64 bytes, packets;
-	unsigned int seq;
-	int cpu;
+	struct nft_counter *this_cpu;
 
-	memset(total, 0, sizeof(*total));
-	for_each_possible_cpu(cpu) {
-		cpu_stats = per_cpu_ptr(counter, cpu);
-		do {
-			seq	= u64_stats_fetch_begin_irq(&cpu_stats->syncp);
-			bytes	= cpu_stats->counter.bytes;
-			packets	= cpu_stats->counter.packets;
-		} while (u64_stats_fetch_retry_irq(&cpu_stats->syncp, seq));
-
-		total->packets += packets;
-		total->bytes += bytes;
-	}
-}
-
-static u64 __nft_counter_reset(u64 *counter)
-{
-	u64 ret, old;
-
-	do {
-		old = *counter;
-		ret = cmpxchg64(counter, old, 0);
-	} while (ret != old);
-
-	return ret;
+	local_bh_disable();
+	this_cpu = this_cpu_ptr(priv->counter);
+	this_cpu->packets -= total->packets;
+	this_cpu->bytes -= total->bytes;
+	local_bh_enable();
 }
 
-static void nft_counter_reset(struct nft_counter_percpu __percpu *counter,
+static void nft_counter_fetch(struct nft_counter_percpu_priv *priv,
 			      struct nft_counter *total)
 {
-	struct nft_counter_percpu *cpu_stats;
+	struct nft_counter *this_cpu;
+	const seqcount_t *myseq;
 	u64 bytes, packets;
 	unsigned int seq;
 	int cpu;
 
 	memset(total, 0, sizeof(*total));
 	for_each_possible_cpu(cpu) {
-		bytes = packets = 0;
-
-		cpu_stats = per_cpu_ptr(counter, cpu);
+		myseq = per_cpu_ptr(&nft_counter_seq, cpu);
+		this_cpu = per_cpu_ptr(priv->counter, cpu);
 		do {
-			seq	= u64_stats_fetch_begin_irq(&cpu_stats->syncp);
-			packets	+= __nft_counter_reset(&cpu_stats->counter.packets);
-			bytes	+= __nft_counter_reset(&cpu_stats->counter.bytes);
-		} while (u64_stats_fetch_retry_irq(&cpu_stats->syncp, seq));
+			seq	= read_seqcount_begin(myseq);
+			bytes	= this_cpu->bytes;
+			packets	= this_cpu->packets;
+		} while (read_seqcount_retry(myseq, seq));
 
-		total->packets += packets;
-		total->bytes += bytes;
+		total->bytes	+= bytes;
+		total->packets	+= packets;
 	}
 }
 
 static int nft_counter_do_dump(struct sk_buff *skb,
-			       const struct nft_counter_percpu_priv *priv,
+			       struct nft_counter_percpu_priv *priv,
 			       bool reset)
 {
 	struct nft_counter total;
 
-	if (reset)
-		nft_counter_reset(priv->counter, &total);
-	else
-		nft_counter_fetch(priv->counter, &total);
+	nft_counter_fetch(priv, &total);
 
 	if (nla_put_be64(skb, NFTA_COUNTER_BYTES, cpu_to_be64(total.bytes),
 			 NFTA_COUNTER_PAD) ||
 	    nla_put_be64(skb, NFTA_COUNTER_PACKETS, cpu_to_be64(total.packets),
 			 NFTA_COUNTER_PAD))
 		goto nla_put_failure;
+
+	if (reset)
+		nft_counter_reset(priv, &total);
+
 	return 0;
 
 nla_put_failure:
@@ -216,7 +197,7 @@ static void nft_counter_eval(const struct nft_expr *expr,
 
 static int nft_counter_dump(struct sk_buff *skb, const struct nft_expr *expr)
 {
-	const struct nft_counter_percpu_priv *priv = nft_expr_priv(expr);
+	struct nft_counter_percpu_priv *priv = nft_expr_priv(expr);
 
 	return nft_counter_do_dump(skb, priv, false);
 }
@@ -242,21 +223,20 @@ static int nft_counter_clone(struct nft_expr *dst, const struct nft_expr *src)
 {
 	struct nft_counter_percpu_priv *priv = nft_expr_priv(src);
 	struct nft_counter_percpu_priv *priv_clone = nft_expr_priv(dst);
-	struct nft_counter_percpu __percpu *cpu_stats;
-	struct nft_counter_percpu *this_cpu;
+	struct nft_counter __percpu *cpu_stats;
+	struct nft_counter *this_cpu;
 	struct nft_counter total;
 
-	nft_counter_fetch(priv->counter, &total);
+	nft_counter_fetch(priv, &total);
 
-	cpu_stats = __netdev_alloc_pcpu_stats(struct nft_counter_percpu,
-					      GFP_ATOMIC);
+	cpu_stats = alloc_percpu_gfp(struct nft_counter, GFP_ATOMIC);
 	if (cpu_stats == NULL)
 		return -ENOMEM;
 
 	preempt_disable();
 	this_cpu = this_cpu_ptr(cpu_stats);
-	this_cpu->counter.packets = total.packets;
-	this_cpu->counter.bytes = total.bytes;
+	this_cpu->packets = total.packets;
+	this_cpu->bytes = total.bytes;
 	preempt_enable();
 
 	priv_clone->counter = cpu_stats;
@@ -285,7 +265,10 @@ static struct nft_expr_type nft_counter_type __read_mostly = {
 
 static int __init nft_counter_module_init(void)
 {
-	int err;
+	int cpu, err;
+
+	for_each_possible_cpu(cpu)
+		seqcount_init(per_cpu_ptr(&nft_counter_seq, cpu));
 
 	err = nft_register_obj(&nft_counter_obj);
 	if (err < 0)
-- 
2.1.4

^ permalink raw reply related

* Re: Misalignment, MIPS, and ip_hdr(skb)->version
From: Måns Rullgård @ 2016-12-11 10:47 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Jason A. Donenfeld, linux-mips, Netdev, LKML, David Miller,
	WireGuard mailing list
In-Reply-To: <20161211080718.GA7253@1wt.eu>

Willy Tarreau <w@1wt.eu> writes:

> Hi Jason,
>
> On Thu, Dec 08, 2016 at 11:20:04PM +0100, Jason A. Donenfeld wrote:
>> Hi David,
>> 
>> On Thu, Dec 8, 2016 at 1:37 AM, David Miller <davem@davemloft.net> wrote:
>> > You really have to land the IP header on a proper 4 byte boundary.
>> >
>> > I would suggest pushing 3 dummy garbage bytes of padding at the front
>> > or the end of your header.
>> 
>> Are you sure 3 bytes to get 4 byte alignment is really the best?
>
> It's always the best. However there's another option which should be
> considered : maybe it's difficult but not impossible to move some bits
> from the current protocol to remove one byte. That's not always easy,
> and sometimes you cannot do it just for one bit. However after you run
> through this exercise, if you notice there's really no way to shave
> this extra byte, you'll realize there's no room left for future
> extensions and you'll more easily accept to add 3 empty bytes for
> this, typically protocol version, tags, qos or flagss that you'll be
> happy to rely on for future versions of your protocol.

Always include some way of extending the protocol in the future.  A
single bit is often enough.  Require a value of zero initially, then if
you ever want to change anything, setting it to one can indicate
whatever you want, including a complete redesign of the header.
Alternatively, a one-bit field can indicate the presence of an extended
header yet to be defined.  Then old software can still make sense of the
basic header.

-- 
Måns Rullgård

^ permalink raw reply

* Re: [PATCH net-next 0/2] Add ethtool set regs support
From: Saeed Mahameed @ 2016-12-11 11:56 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Saeed Mahameed, David S. Miller, Linux Netdev List,
	John W . Linville
In-Reply-To: <20161206144509.5365a623@xeon-e3>

On Wed, Dec 7, 2016 at 12:45 AM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Wed,  7 Dec 2016 00:33:08 +0200
> Saeed Mahameed <saeedm@mellanox.com> wrote:
>
>> This simple ethool change will give HW vendors the flexibility to set
>> pure HW configurations (not directly related to netdev resources states
>> and rings), without the need of vendor proprietary tools and hacks.
>
>
> The danger is you need to restrict the kernel to only allow setting
> safe registers (and this is HW dependent).  There are cases like secure
> boot where it is expected that even root is not allowed to modify
> all memory.
>
> Also supporting closed format of device registers is not in the interest
> of promoting open source.
>

This is not totally close format, it is only a generic API to support
setting some of the HW registers.
the format is open in
1. http://www.mellanox.com/related-docs/user_manuals/Ethernet_Adapters_Programming_Manual.pdf
2. http://lxr.free-electrons.com/source/include/linux/mlx5/mlx5_ifc.h

We just want to let the user to configure HW dependent registers
without the need to add new defines/callback each time to the kernel
ABI/API.

> I am not saying I fundamentally disagree with supporting this, but it
> is a bigger step than you make it out to be.

I have to disagree, it is not as big as it seems, there are already at
least two places in ethtool API that do the same,

http://lxr.free-electrons.com/source/include/linux/ethtool.h
192  * @set_eeprom: Write data to the device EEPROM.
193  *      Should validate the magic field.  Don't need to check len for zero
194  *      or wraparound.  Update len to the amount written.  Returns an error
195  *      or zero.
http://lxr.free-electrons.com/source/net/core/ethtool.c#L1582

234  * @flash_device: Write a firmware image to device's flash memory.
235  *      Returns a negative error code or zero.

Both accept a binary array from user and forward to HW/FW.
Simply we want to to the same for HW registers, you already have a way
to get them (even parse them in user ethtool), but you can't set them.

^ permalink raw reply

* Re: [PATCH net-next 0/2] Add ethtool set regs support
From: Saeed Mahameed @ 2016-12-11 12:18 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Saeed Mahameed, David S. Miller, Linux Netdev List,
	John W . Linville
In-Reply-To: <20161207024143.GA655@lunn.ch>

On Wed, Dec 7, 2016 at 4:41 AM, Andrew Lunn <andrew@lunn.ch> wrote:
> On Wed, Dec 07, 2016 at 12:33:08AM +0200, Saeed Mahameed wrote:
>> Hi Dave,
>>
>> This series adds the support for setting device registers from user
>> space ethtool.
>
> Is this not the start of allowing binary only drivers in user space?
>

It is not, we want to do same as set_eeprom already do,
Just set some HW registers, for analysis/debug/tweak/configure HW
dependent register for the NIC netdev sake.

> Do we want this?
>

Apparently yes, Our performance team will be happy to have such
feature upstream, so they can just jump in and start their performance
analysis/debugging
with ethtool directly without the need of extra proprietary FW tools
(which sometimes demand reboot and driver restart for each
configuration).

This is one usage and i believe other NIC vendors will be happy to have this.

>
>> mlx5 driver have registers allowed access list and will check the user
>> Request validity before forwarding it to HW registers. Mlx5 will allow only mlx5 specific
>> configurations to be set (e.g. Device Diag Counters for HW performance debugging and analysis)
>> which has no standard API to access it.
>
> Would it not be better to define an flexible API to do this? We have
> lots of HW performance counters for CPUs. Why is it not possible to do
> this for a network device?
>

What is flexible in this context ? Register Types/layout and purposes
differs from vendor to vendor and changes cross HW updates and new HW
generations.
We simply don't want to inform/update the stack and ethtool API about
each register layout, this is just not scalable, and those registers
are not precisely meant for performance/debug counters, sometimes they
have some other purposes. For example configuring an internal HW unit
to work in different ways, so we can have some taste of  "on the fly"
control for testing purposes.

where can i find some details about "HW performance counters for CPUs" ?
In light of my response, if you have any suggestion for more flexible
(strongly typed) API, I will be glad to hear about it.

Thanks for your response.
Saeed.

^ permalink raw reply

* Re: [PATCH net-next 0/2] Add ethtool set regs support
From: Saeed Mahameed @ 2016-12-11 12:41 UTC (permalink / raw)
  To: David Miller; +Cc: andrew, Saeed Mahameed, Linux Netdev List, John W. Linville
In-Reply-To: <20161206.215732.1535653926791462641.davem@davemloft.net>

On Wed, Dec 7, 2016 at 4:57 AM, David Miller <davem@davemloft.net> wrote:
> From: Andrew Lunn <andrew@lunn.ch>
> Date: Wed, 7 Dec 2016 03:41:43 +0100
>
>> On Wed, Dec 07, 2016 at 12:33:08AM +0200, Saeed Mahameed wrote:
>>> Hi Dave,
>>>
>>> This series adds the support for setting device registers from user
>>> space ethtool.
>>
>> Is this not the start of allowing binary only drivers in user space?
>>
>> Do we want this?
>
> I don't think we do.
>

Dave, i think we do have a case here, can you please check my response
to Stephen and Andrew,
I mention some examples and I think I made the purpose of this patches
more clear.

>>
>>> mlx5 driver have registers allowed access list and will check the user
>>> Request validity before forwarding it to HW registers. Mlx5 will allow only mlx5 specific
>>> configurations to be set (e.g. Device Diag Counters for HW performance debugging and analysis)
>>> which has no standard API to access it.
>>
>> Would it not be better to define an flexible API to do this? We have
>> lots of HW performance counters for CPUs. Why is it not possible to do
>> this for a network device?
>
> So if this isn't for raw PIO register access, then we should define
> an appropriate interface for it.
>

It is not, I hope i made it more clear in my responses to Stephen and Andrew.

>
> The ethtool regs stuff is untyped, and arbitrary.
>
> Please create something properly structured, and typed, which would
> allow accessing the information you want the user to be able to
> access.
>
> That way the kernel can tell what the user is reading or writing,
> and thus properly control access.

Thanks for your input,

As i explained to Stephen and Andrew, Those registers differs from
vendor to vendor and cross HW updates. Making those registers
strongly typed is wrong in my opinion, it is not scalable and as
flexible as you think it would be.  There are many HW vendors and each
has lots of registers.

Having a set_eeprom like API will do the job. you can think about this
as ethtool_flash_device kinda tool but more lightweight.

We simply would like to configure/tweak/monitor/debug HW internal
units activities for NIC netdev sake, like the usage we currently
have/suggest,
to have flexibility to enable/disable/query some HW internal
performance registers (should be off on production), both set and
query
will use ethtool_set/get_regs, obviously.

We also took into account future usages for other registers/formats
and to have flexibility across different NICs.

I will be glad if you can give this some thought and provide me with
more concrete direction/suggestion on how to make this less untyped.

The only thing i can think of right now is providing a list of write
allowed registers (address + type + layout length) to ethtool, but i
am not sure how
to handle the "type" information in the kernel, it will create more
restrictions than needed, and what types of register will we need to
add support for at first ?

^ permalink raw reply

* Re: Misalignment, MIPS, and ip_hdr(skb)->version
From: Jason A. Donenfeld @ 2016-12-11 14:50 UTC (permalink / raw)
  To: linux-mips, Netdev, LKML
  Cc: Dan Lüdtke, Willy Tarreau, Måns Rullgård,
	Hannes Frederic Sowa, WireGuard mailing list, Greg KH,
	Felix Fietkau, Jiri Benc, David Miller
In-Reply-To: <20161211071501.GA32621@kroah.com>

Hey guys,

Thanks for the extremely detailed answers. The main take-away from
this is that passing unaligned packets to the networking stack kills
kittens. So now it's a question of mitigation. I have three options:

1. Copy the plaintext to three bytes before the start of the cipher
text, overwriting parts of the header that aren't actually required.
Pros: no changes required, MTU stays small.
Cons: scatterwalk's fast paths aren't hit, which means two page table
mappings are taken instead of one. I have no idea if this actually
matters or will slow down anything relavent.

2. Add 3 bytes to the plaintext header, set to zero, marked for future use.
Pros: satisfies IETF mantras and makes unaligned in-place decryption
straightforward.
Cons: lowers MTU, additional unauthenticated cleartext bits in the
header are of limited utility in protocol.

3. Add 3 bytes of padding, set to zero, to the encrypted section just
before the IP header, marked for future use.
Pros: satisfies IETF mantras, can use those extra bits in the future
for interesting protocol extensions for authenticated peers.
Cons: lowers MTU, marginally more difficult to implement but still
probably just one or two lines of code.

Of these, I'm leaning toward (3).

Anyway, thanks a lot for the input. "Doing nothing" is no longer under
serious consideration, thanks to your messages.

Thanks,
Jason

^ permalink raw reply

* Re: [PATCH net-next] netfilter: nft_counter: rework atomic dump and reset
From: David Miller @ 2016-12-11 15:02 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, eric.dumazet, netdev, arnd
In-Reply-To: <1481453040-2675-1-git-send-email-pablo@netfilter.org>

From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Sun, 11 Dec 2016 11:43:59 +0100

> Dump and reset doesn't work unless cmpxchg64() is used both from packet
> and control plane paths. This approach is going to be slow though.
> Instead, use a percpu seqcount to fetch counters consistently, then
> subtract bytes and packets in case a reset was requested.
> 
> The cpu that running over the reset code is guaranteed to own this stats
> exclusively, we have to turn counters into signed 64bit though so stats
> update on reset don't get wrong on underflow.
> 
> This patch is based on original sketch from Eric Dumazet.
> 
> Fixes: 43da04a593d8 ("netfilter: nf_tables: atomic dump and reset for stateful objects")
> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> ---
> v2: adjust stats on reset on the current cpu, turn 64bit counters into signed.
> 
> @David: Please, take this into net-next to help speed up thing, thanks!

Applied.

^ permalink raw reply

* Re: [PATCH net-next] netfilter: nft_counter: rework atomic dump and reset
From: Arnd Bergmann @ 2016-12-11 15:13 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, eric.dumazet, netdev, davem
In-Reply-To: <1481453040-2675-1-git-send-email-pablo@netfilter.org>

On Sunday, December 11, 2016 11:43:59 AM CET Pablo Neira Ayuso wrote:
> Dump and reset doesn't work unless cmpxchg64() is used both from packet
> and control plane paths. This approach is going to be slow though.
> Instead, use a percpu seqcount to fetch counters consistently, then
> subtract bytes and packets in case a reset was requested.
> 
> The cpu that running over the reset code is guaranteed to own this stats
> exclusively, we have to turn counters into signed 64bit though so stats
> update on reset don't get wrong on underflow.
> 
> This patch is based on original sketch from Eric Dumazet.
> 
> Fixes: 43da04a593d8 ("netfilter: nf_tables: atomic dump and reset for stateful objects")
> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> 

Looks great, I had a similar idea that I was going to suggest the
other day, but yours is better anyway.

	Arnd

^ permalink raw reply

* Re: [PATCH net-next 0/2] Add ethtool set regs support
From: Andrew Lunn @ 2016-12-11 15:22 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Saeed Mahameed, David S. Miller, Linux Netdev List,
	John W . Linville
In-Reply-To: <CALzJLG8nGU-29VoKDtdaoFuG7zfTvTTu14T91+PMxcJH-GG2bw@mail.gmail.com>

On Sun, Dec 11, 2016 at 02:18:00PM +0200, Saeed Mahameed wrote:
> On Wed, Dec 7, 2016 at 4:41 AM, Andrew Lunn <andrew@lunn.ch> wrote:
> > On Wed, Dec 07, 2016 at 12:33:08AM +0200, Saeed Mahameed wrote:
> >> Hi Dave,
> >>
> >> This series adds the support for setting device registers from user
> >> space ethtool.
> >
> > Is this not the start of allowing binary only drivers in user space?
> >
> 
> It is not, we want to do same as set_eeprom already do,
> Just set some HW registers, for analysis/debug/tweak/configure HW
> dependent register for the NIC netdev sake.

Mellanox has a good reputation of open drivers. However, this API
sounds like it would be the first step towards user space
drivers. This is an API which can peek and poke registers, so it
probably could be mis-used to put part of a driver in user
space. Hence we should avoid this sort of API to start with.

> where can i find some details about "HW performance counters for CPUs" ?

https://perf.wiki.kernel.org/index.php/Tutorial

	Andrew

^ permalink raw reply

* Re: Misalignment, MIPS, and ip_hdr(skb)->version
From: Andrew Lunn @ 2016-12-11 15:30 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: linux-mips, Netdev, LKML, Dan Lüdtke, Willy Tarreau,
	Måns Rullgård, Hannes Frederic Sowa,
	WireGuard mailing list, Greg KH, Felix Fietkau, Jiri Benc,
	David Miller
In-Reply-To: <CAHmME9q5ifwwishXjXYE3J=sVeR4jYY9fLUgs_FHCP594EZr6g@mail.gmail.com>

> 3. Add 3 bytes of padding, set to zero, to the encrypted section just
> before the IP header, marked for future use.
> Pros: satisfies IETF mantras, can use those extra bits in the future
> for interesting protocol extensions for authenticated peers.
> Cons: lowers MTU, marginally more difficult to implement but still
> probably just one or two lines of code.

I'm not a crypto expert, but does this not give you a helping hand in
breaking the crypto? You know the plain text value of these bytes, and
where they are in the encrypted text.

      Andrew

^ permalink raw reply

* [PATCH 1/1] Signed-off-by: Jan Wang <bbw725@163.com>
From: Jan Wang @ 2016-12-11 15:31 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Jan Wang

[net]: Missing init queue tail

Accept queue tail doesn't initialize NULL.
Though looks like no harm, unify it with reqsk_queue_remove(
it will initialize tail to NULL when empty).

Signed-off-by: Jan Wang <bbw725@163.com>
---
 net/core/request_sock.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/request_sock.c b/net/core/request_sock.c
index 5d26056..1392bab 100644
--- a/net/core/request_sock.c
+++ b/net/core/request_sock.c
@@ -47,6 +47,7 @@ void reqsk_queue_alloc(struct request_sock_queue *queue)
 	queue->fastopenq.qlen = 0;
 
 	queue->rskq_accept_head = NULL;
+	queue->rskq_accept_tail = NULL;
 }
 
 /*
-- 
2.1.4

^ permalink raw reply related

* Re: Misalignment, MIPS, and ip_hdr(skb)->version
From: Jason A. Donenfeld @ 2016-12-11 15:37 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: linux-mips, Netdev, LKML, Dan Lüdtke, Willy Tarreau,
	Måns Rullgård, Hannes Frederic Sowa,
	WireGuard mailing list, Greg KH, Felix Fietkau, Jiri Benc,
	David Miller
In-Reply-To: <20161211153027.GD29761@lunn.ch>

On Sun, Dec 11, 2016 at 4:30 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> I'm not a crypto expert, but does this not give you a helping hand in
> breaking the crypto? You know the plain text value of these bytes, and
> where they are in the encrypted text.

You also know with some probability that there's going to be an IP
header and a TCP header, each with predictable fields. Maybe you're
reasonably certain there's an HTTP header in there too. Gasp! But fear
not...

Symmetric ciphers are generally not considered secure if they fall to
what's called a "known plaintext attack". Fortunately, modern ciphers
like AES and ChaCha20 and most others that you're aware of are
generally believed to be secure against KPA.

^ permalink raw reply

* Re: Misalignment, MIPS, and ip_hdr(skb)->version
From: Willy Tarreau @ 2016-12-11 16:44 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: linux-mips, Måns Rullgård, LKML, Jiri Benc,
	Hannes Frederic Sowa, Netdev, David Miller,
	WireGuard mailing list, Felix Fietkau
In-Reply-To: <CAHmME9q5ifwwishXjXYE3J=sVeR4jYY9fLUgs_FHCP594EZr6g@mail.gmail.com>

On Sun, Dec 11, 2016 at 03:50:31PM +0100, Jason A. Donenfeld wrote:
> 3. Add 3 bytes of padding, set to zero, to the encrypted section just
> before the IP header, marked for future use.
> Pros: satisfies IETF mantras, can use those extra bits in the future
> for interesting protocol extensions for authenticated peers.
> Cons: lowers MTU, marginally more difficult to implement but still
> probably just one or two lines of code.
> 
> Of these, I'm leaning toward (3).

Or 4) add one byte to the cleartext header for future use (mostly flags
maybe) and 2 bytes of padding to the encrypted header. This way you get
the following benefits :
  1) your encrypted text is at least 16-bit aligned, maybe it matters
     in your checksum computations on during decryption
  2) your MTU remains even, this is better for both ends
  3) you're free to add some bits either to the encrypted or the clear
     parts.

Just a suggestion :-)

Willy

^ permalink raw reply

* net/3com/3c515: Fix timer handling, prevent leaks and crashes
From: Thomas Gleixner @ 2016-12-11 17:31 UTC (permalink / raw)
  To: LKML; +Cc: netdev, Matthew Whitehead, Andy Lutomirski, David Miller

The timer handling in this driver is broken in several ways:

- corkscrew_open() initializes and arms a timer before requesting the
  device interrupt. If the request fails the timer stays armed.

  A second call to corkscrew_open will unconditionally reinitialize the
  quued timer and arm it again. Also a immediate device removal will leave
  the timer queued because close() is not called (open() failed) and
  therefore nothing issues del_timer().

  The reinitialization corrupts the link chain in the timer wheel hash
  bucket and causes a NULL pointer dereference when the timer wheel tries
  to operate on that hash bucket. Immediate device removal lets the link
  chain poke into freed and possibly reused memory.

  Solution: Arm the timer after the successful irq request.

- corkscrew_close() uses del_timer()

  On close the timer is disarmed with del_timer() which lets the following
  code race against a concurrent timer expiry function.

  Solution: Use del_timer_sync() instead

- corkscrew_close() calls del_timer() unconditionally

  del_timer() is invoked even if the timer was never initialized. This
  works by chance because the struct containing the timer is zeroed at
  allocation time.

  Solution: Move the setup of the timer into corkscrew_setup().

Reported-by: Matthew Whitehead <tedheadster@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: netdev@vger.kernel.org
---
 drivers/net/ethernet/3com/3c515.c |   15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

--- a/drivers/net/ethernet/3com/3c515.c
+++ b/drivers/net/ethernet/3com/3c515.c
@@ -628,6 +628,8 @@ static int corkscrew_setup(struct net_de

 	spin_lock_init(&vp->lock);

+	setup_timer(&vp->timer, corkscrew_timer, (unsigned long) dev);
+
 	/* Read the station address from the EEPROM. */
 	EL3WINDOW(0);
 	for (i = 0; i < 0x18; i++) {
@@ -708,6 +710,7 @@ static int corkscrew_open(struct net_dev
 {
 	int ioaddr = dev->base_addr;
 	struct corkscrew_private *vp = netdev_priv(dev);
+	bool armtimer = false;
 	__u32 config;
 	int i;

@@ -732,12 +735,7 @@ static int corkscrew_open(struct net_dev
 		if (corkscrew_debug > 1)
 			pr_debug("%s: Initial media type %s.\n",
 			       dev->name, media_tbl[dev->if_port].name);
-
-		init_timer(&vp->timer);
-		vp->timer.expires = jiffies + media_tbl[dev->if_port].wait;
-		vp->timer.data = (unsigned long) dev;
-		vp->timer.function = corkscrew_timer;	/* timer handler */
-		add_timer(&vp->timer);
+		armtimer = true;
 	} else
 		dev->if_port = vp->default_media;

@@ -777,6 +775,9 @@ static int corkscrew_open(struct net_dev
 		return -EAGAIN;
 	}

+	if (armtimer)
+		mod_timer(&vp->timer, jiffies + media_tbl[dev->if_port].wait);
+
 	if (corkscrew_debug > 1) {
 		EL3WINDOW(4);
 		pr_debug("%s: corkscrew_open() irq %d media status %4.4x.\n",
@@ -1427,7 +1428,7 @@ static int corkscrew_close(struct net_de
 			dev->name, rx_nocopy, rx_copy, queued_packet);
 	}

-	del_timer(&vp->timer);
+	del_timer_sync(&vp->timer);

 	/* Turn off statistics ASAP.  We update lp->stats below. */
 	outw(StatsDisable, ioaddr + EL3_CMD);

^ permalink raw reply

* Re: [PATCH] stmmac: disable tx coalescing
From: Pavel Machek @ 2016-12-11 19:07 UTC (permalink / raw)
  To: David Miller
  Cc: peppe.cavallaro, netdev, linux-kernel, alexandre.torgue,
	LinoSanfilippo
In-Reply-To: <20161205122711.GA30774@amd>

[-- Attachment #1: Type: text/plain, Size: 2325 bytes --]

Hi!

David, ping? Can I get you to apply this one?

As you noticed, tx coalescing is completely broken in that driver, and
not easy to repair. This is simplest way to disable it. It can still
be re-enabled from userspace, so code can be fixed in future.

Best regards,
								Pavel

> 
> Tx coalescing in stmmac is broken in more than one way, so disable it
> for now.
>     
> First, low-res timers have resolution down to one per second. It is
> not acceptable to delay transmits for 40msec, and certainly not
> acceptable to delay them for 1000msec.
>     
> Second, the logic is wrong:
>     
>     if (likely(priv->tx_coal_frames > priv->tx_count_frames))
>      	mod_timer(&priv->txtimer,
>      	          STMMAC_COAL_TIMER(priv->tx_coal_timer));
>     ...
>     
> doing tx_clean() after set number of packets, or set time after the
> first packet is transmitted would make sense. But that's not what the
> code does. As long as packets are being transmitted, you move the
> timer into the future.. so that finally you run out of the place, then
> wait for timer (!) and only then you do the cleaning.
> 
> Third, tx_cleanup is not safe to call from interrupt (tx_lock is not
> irqsave), but that's exactly what coalescing code does.
> 
> Signed-off-by: Pavel Machek <pavel@denx.de>
> 
> ---
> 
> This is stable candidate, afaict. It is broken in 4.4, too.
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
> index 3ced2e1..32ce148 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/common.h
> +++ b/drivers/net/ethernet/stmicro/stmmac/common.h
> @@ -244,11 +244,11 @@ struct stmmac_extra_stats {
>  /* Max/Min RI Watchdog Timer count value */
>  #define MAX_DMA_RIWT		0xff
>  #define MIN_DMA_RIWT		0x20
> -/* Tx coalesce parameters */
> +/* Tx coalesce parameters. Set STMMAC_TX_FRAMES to 0 to disable coalescing. */
>  #define STMMAC_COAL_TX_TIMER	40000
>  #define STMMAC_MAX_COAL_TX_TICK	100000
>  #define STMMAC_TX_MAX_FRAMES	256
> -#define STMMAC_TX_FRAMES	64
> +#define STMMAC_TX_FRAMES	0
>  
>  /* Rx IPC status */
>  enum rx_frame_status {
> 



-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* Re: [PATCH] stmmac: disable tx coalescing
From: David Miller @ 2016-12-11 19:31 UTC (permalink / raw)
  To: pavel
  Cc: peppe.cavallaro, netdev, linux-kernel, alexandre.torgue,
	LinoSanfilippo
In-Reply-To: <20161211190750.GA14287@amd>

From: Pavel Machek <pavel@ucw.cz>
Date: Sun, 11 Dec 2016 20:07:50 +0100

> David, ping? Can I get you to apply this one?
> 
> As you noticed, tx coalescing is completely broken in that driver, and
> not easy to repair. This is simplest way to disable it. It can still
> be re-enabled from userspace, so code can be fixed in future.

Sorry I'm not applying this, I want you and the driver maintainer
to reach a real solution.

^ permalink raw reply

* Re: [PATCH] [v4] net: phy: phy drivers should not set SUPPORTED_[Asym_]Pause
From: Florian Fainelli @ 2016-12-11 19:45 UTC (permalink / raw)
  To: Timur Tabi, David Miller, netdev, jon.mason, nks.gnu
In-Reply-To: <1481138451-28144-1-git-send-email-timur@codeaurora.org>

Le 12/07/16 à 11:20, Timur Tabi a écrit :
> Instead of having individual PHY drivers set the SUPPORTED_Pause and
> SUPPORTED_Asym_Pause flags, phylib itself should set those flags,
> unless there is a hardware erratum or other special case.  During
> autonegotiation, the PHYs will determine whether to enable pause
> frame support.
> 
> Pause frames are a feature that is supported by the MAC.  It is the MAC
> that generates the frames and that processes them.  The PHY can only be
> configured to allow them to pass through.
> 
> This commit also effectively reverts the recently applied c7a61319
> ("net: phy: dp83848: Support ethernet pause frames").
> 
> So the new process is:
> 
> 1) Unless the PHY driver overrides it, phylib sets the SUPPORTED_Pause
> and SUPPORTED_AsymPause bits in phydev->supported.  This indicates that
> the PHY supports pause frames.
> 
> 2) The MAC driver checks phydev->supported before it calls phy_start().
> If (SUPPORTED_Pause | SUPPORTED_AsymPause) is set, then the MAC driver
> sets those bits in phydev->advertising, if it wants to enable pause
> frame support.
> 
> 3) When the link state changes, the MAC driver checks phydev->pause and
> phydev->asym_pause,  If the bits are set, then it enables the corresponding
> features in the MAC.  The algorithm is:
> 
> 	if (phydev->pause)
> 		The MAC should be programmed to receive and honor
>                 pause frames it receives, i.e. enable receive flow control.
> 
> 	if (phydev->pause != phydev->asym_pause)
> 		The MAC should be programmed to transmit pause
> 		frames when needed, i.e. enable transmit flow control.
> 
> Signed-off-by: Timur Tabi <timur@codeaurora.org>

This is already applied but:

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

Thanks!
-- 
Florian

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox