Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 3/3] net: phy: realtek: add support for the 2.5Gbps PHY in RTL8125
From: Andrew Lunn @ 2019-08-08 20:20 UTC (permalink / raw)
  To: Heiner Kallweit; +Cc: Florian Fainelli, David Miller, netdev@vger.kernel.org
In-Reply-To: <f34d1117-510f-861f-59f0-51e0e87ead1e@gmail.com>

> I have a contact in Realtek who provided the information about
> the vendor-specific registers used in the patch. I also asked for
> a method to auto-detect 2.5Gbps support but have no feedback so far.
> What may contribute to the problem is that also the integrated 1Gbps
> PHY's (all with the same PHY ID) differ significantly from each other,
> depending on the network chip version.

Hi Heiner

Some of the PHYs embedded in Marvell switches have an OUI, but no
product ID. We work around this brokenness by trapping the reads to
the ID registers in the MDIO bus controller driver and inserting the
switch product ID. The Marvell PHY driver then recognises these IDs
and does the right thing.

Maybe you can do something similar here?

      Andrew

^ permalink raw reply

* Re: [PATCH net-next v1 1/8] netfilter: inlined four headers files into another one.
From: Jeremy Sowden @ 2019-08-08 20:10 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Netfilter Devel, Net Dev, Masahiro Yamada, Jozsef Kadlecsik
In-Reply-To: <20190808112355.w3ax3twuf6b7pwc7@salvia>

[-- Attachment #1: Type: text/plain, Size: 768 bytes --]

On 2019-08-08, at 13:23:55 +0200, Pablo Neira Ayuso wrote:
> On Wed, Aug 07, 2019 at 03:16:58PM +0100, Jeremy Sowden wrote:
> [...]
> > +/* Called from uadd only, protected by the set spinlock.
> > + * The kadt functions don't use the comment extensions in any way.
> > + */
> > +static inline void
> > +ip_set_init_comment(struct ip_set *set, struct ip_set_comment *comment,
> > +		    const struct ip_set_ext *ext)
>
> Not related to this patch, but I think the number of inline functions
> could be reduced a bit by exporting symbols? Specifically for
> functions that are called from the netlink control plane, ie. _uadd()
> functions. I think forcing the compiler to inline this is not useful.
> This could be done in a follow up patchset.

I'll take a look.

J.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 1/3] net: phy: prepare phylib to deal with PHY's extending Clause 22
From: Heiner Kallweit @ 2019-08-08 20:09 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Florian Fainelli, David Miller, netdev@vger.kernel.org
In-Reply-To: <20190808193243.GK27917@lunn.ch>

On 08.08.2019 21:32, Andrew Lunn wrote:
> On Thu, Aug 08, 2019 at 09:03:42PM +0200, Heiner Kallweit wrote:
>> The integrated PHY in 2.5Gbps chip RTL8125 is the first (known to me)
>> PHY that uses standard Clause 22 for all modes up to 1Gbps and adds
>> 2.5Gbps control using vendor-specific registers. To use phylib for
>> the standard part little extensions are needed:
>> - Move most of genphy_config_aneg to a new function
>>   __genphy_config_aneg that takes a parameter whether restarting
>>   auto-negotiation is needed (depending on whether content of
>>   vendor-specific advertisement register changed).
>> - Don't clear phydev->lp_advertising in genphy_read_status so that
>>   we can set non-C22 mode flags before.
>>
>> Basically both changes mimic the behavior of the equivalent Clause 45
>> functions.
>>
>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
>> ---
>>  drivers/net/phy/phy_device.c | 35 +++++++++++++++--------------------
>>  include/linux/phy.h          |  8 +++++++-
>>  2 files changed, 22 insertions(+), 21 deletions(-)
>>
>> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
>> index 7ddd91df9..bd7e7db8c 100644
>> --- a/drivers/net/phy/phy_device.c
>> +++ b/drivers/net/phy/phy_device.c
>> @@ -1571,11 +1571,9 @@ static int genphy_config_advert(struct phy_device *phydev)
>>  	/* Only allow advertising what this PHY supports */
>>  	linkmode_and(phydev->advertising, phydev->advertising,
>>  		     phydev->supported);
>> -	if (!ethtool_convert_link_mode_to_legacy_u32(&advertise,
>> -						     phydev->advertising))
>> -		phydev_warn(phydev, "PHY advertising (%*pb) more modes than genphy supports, some modes not advertised.\n",
>> -			    __ETHTOOL_LINK_MODE_MASK_NBITS,
>> -			    phydev->advertising);
>> +
>> +	ethtool_convert_link_mode_to_legacy_u32(&advertise,
>> +						phydev->advertising);
> 
> Hi Heiner
> 
> linkmode_adv_to_mii_adv_t() would remove the need to use
> ethtool_convert_link_mode_to_legacy_u32(), and this warning would also
> go away. 
> 
Thanks for the hint! I'll check and submit a v2.

>    Andrew
> 
Heiner


^ permalink raw reply

* Re: [PATCH net-next] r8169: make use of xmit_more
From: Heiner Kallweit @ 2019-08-08 20:08 UTC (permalink / raw)
  To: Holger Hoffstätte, Realtek linux nic maintainers,
	David Miller
  Cc: netdev@vger.kernel.org, Sander Eikelenboom, Eric Dumazet
In-Reply-To: <868a1f4c-5fba-c64b-ea31-30a3770e6a2f@applied-asynchrony.com>

On 08.08.2019 21:52, Holger Hoffstätte wrote:
> On 8/8/19 8:17 PM, Heiner Kallweit wrote:
>> On 08.08.2019 17:53, Holger Hoffstätte wrote:
>>> On 8/8/19 4:37 PM, Holger Hoffstätte wrote:
>>>>
>>>> Hello Heiner -
>>>>
>>>> On 7/28/19 11:25 AM, Heiner Kallweit wrote:
>>>>> There was a previous attempt to use xmit_more, but the change had to be
>>>>> reverted because under load sometimes a transmit timeout occurred [0].
>>>>> Maybe this was caused by a missing memory barrier, the new attempt
>>>>> keeps the memory barrier before the call to netif_stop_queue like it
>>>>> is used by the driver as of today. The new attempt also changes the
>>>>> order of some calls as suggested by Eric.
>>>>>
>>>>> [0] https://lkml.org/lkml/2019/2/10/39
>>>>>
>>>>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
>>>>
>>>> I decided to take one for the team and merged this into my 5.2.x tree (just
>>>> fixing up the path) and it has been working fine for the last 2 weeks in two
>>>> machines..until today, when for the first time in forever some random NFS traffic
>>>> made this old friend come out from under the couch:
>>>>
>>>> [Aug 8 14:13] ------------[ cut here ]------------
>>>> [  +0.000006] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
>>>> [  +0.000021] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:442 dev_watchdog+0x21f/0x230
>>>> [  +0.000001] Modules linked in: lz4 lz4_compress lz4_decompress nfsd auth_rpcgss oid_registry lockd grace sunrpc sch_fq_codel btrfs xor zstd_compress raid6_pq zstd_decompress bfq jitterentropy_rng nct6775 hwmon_vid coretemp hwmon x86_pkg_temp_thermal aesni_intel aes_x86_64 i915 glue_helper crypto_simd cryptd i2c_i801 intel_gtt i2c_algo_bit iosf_mbi drm_kms_helper syscopyarea usbhid sysfillrect r8169 sysimgblt fb_sys_fops realtek drm libphy drm_panel_orientation_quirks i2c_core video backlight mq_deadline
>>>> [  +0.000026] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.2.7 #1
>>>> [  +0.000001] Hardware name: System manufacturer System Product Name/P8Z68-V LX, BIOS 4105 07/01/2013
>>>> [  +0.000004] RIP: 0010:dev_watchdog+0x21f/0x230
>>>> [  +0.000002] Code: 3b 00 75 ea eb ad 4c 89 ef c6 05 1c 45 bd 00 01 e8 66 35 fc ff 44 89 e1 4c 89 ee 48 c7 c7 e8 5e fc 81 48 89 c2 e8 90 df 92 ff <0f> 0b eb 8e 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 66 66 66 66 90
>>>> [  +0.000002] RSP: 0018:ffffc90000118e68 EFLAGS: 00010286
>>>> [  +0.000002] RAX: 0000000000000000 RBX: ffff8887f7837600 RCX: 0000000000000303
>>>> [  +0.000001] RDX: 0000000000000001 RSI: 0000000000000092 RDI: ffffffff827a488c
>>>> [  +0.000001] RBP: ffff8887f9fbc440 R08: 0000000000000303 R09: 0000000000000003
>>>> [  +0.000001] R10: 000000000001004c R11: 0000000000000001 R12: 0000000000000000
>>>> [  +0.000009] R13: ffff8887f9fbc000 R14: ffffffff8173aa20 R15: dead000000000200
>>>> [  +0.000001] FS:  0000000000000000(0000) GS:ffff8887ff580000(0000) knlGS:0000000000000000
>>>> [  +0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [  +0.000001] CR2: 00007f8d1c04d000 CR3: 0000000002209001 CR4: 00000000000606e0
>>>> [  +0.000000] Call Trace:
>>>> [  +0.000002]  <IRQ>
>>>> [  +0.000005]  call_timer_fn+0x2b/0x120
>>>> [  +0.000002]  expire_timers+0xa4/0x100
>>>> [  +0.000001]  run_timer_softirq+0x8c/0x170
>>>> [  +0.000002]  ? __hrtimer_run_queues+0x13a/0x290
>>>> [  +0.000003]  ? sched_clock_cpu+0xe/0x130
>>>> [  +0.000003]  __do_softirq+0xeb/0x2de
>>>> [  +0.000003]  irq_exit+0x9d/0xe0
>>>> [  +0.000002]  smp_apic_timer_interrupt+0x60/0x110
>>>> [  +0.000003]  apic_timer_interrupt+0xf/0x20
>>>> [  +0.000001]  </IRQ>
>>>> [  +0.000003] RIP: 0010:cpuidle_enter_state+0xad/0x930
>>>> [  +0.000001] Code: c5 66 66 66 66 90 31 ff e8 90 99 9e ff 80 7c 24 0b 00 74 12 9c 58 f6 c4 02 0f 85 39 08 00 00 31 ff e8 e7 26 a2 ff fb 45 85 e4 <0f> 88 34 02 00 00 49 63 cc 4c 2b 2c 24 48 8d 04 49 48 c1 e0 05 8b
>>>> [  +0.000000] RSP: 0018:ffffc9000008be50 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
>>>> [  +0.000001] RAX: ffff8887ff5a9180 RBX: ffffffff822b6c40 RCX: 000000000000001f
>>>> [  +0.000001] RDX: 0000000000000000 RSI: 0000000033087154 RDI: 0000000000000000
>>>> [  +0.000001] RBP: ffff8887ff5b1310 R08: 000030d021fae397 R09: ffff8887ff59c8c0
>>>> [  +0.000000] R10: ffff8887ff59c8c0 R11: 0000000000000006 R12: 0000000000000004
>>>> [  +0.000001] R13: 000030d021fae397 R14: 0000000000000004 R15: ffff8887fc281600
>>>> [  +0.000001]  cpuidle_enter+0x29/0x40
>>>> [  +0.000002]  do_idle+0x1e5/0x280
>>>> [  +0.000001]  cpu_startup_entry+0x19/0x20
>>>> [  +0.000002]  start_secondary+0x186/0x1c0
>>>> [  +0.000001]  secondary_startup_64+0xa4/0xb0
>>>> [  +0.000001] ---[ end trace 99493c768580f4fd ]---
>>>>
>>>> The device is:
>>>>
>>>> Aug  7 23:19:09 tux kernel: libphy: r8169: probed
>>>> Aug  7 23:19:09 tux kernel: r8169 0000:04:00.0 eth0: RTL8168evl/8111evl, c8:60:00:68:33:cc, XID 2c9, IRQ 36
>>>> Aug  7 23:19:09 tux kernel: r8169 0000:04:00.0 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
>>>> Aug  7 23:19:12 tux kernel: RTL8211E Gigabit Ethernet r8169-400:00: attached PHY driver [RTL8211E Gigabit Ethernet] (mii_bus:phy_addr=r8169-400:00, irq=IGNORE)
>>>> Aug  7 23:19:13 tux kernel: r8169 0000:04:00.0 eth0: No native access to PCI extended config space, falling back to CSI
>>>>
>>>> and using fq_codel, of course.
>>>>
>>>> This cpuidle hiccup used to be completely gone without xmit_more and this was
>>>> the first (and so far only) time since merging it (regardless of load).
>>>> Also, while I'm using BMQ as CPU scheduler, that hasn't made a difference for
>>>> this particular problem in the past (with MuQSS/PDS) either; way back when I had
>>>> Eric's previous attempt(s) it also hiccupped with CFS.
>>>>
>>>> Revert or wait for more reports when -next is merged in 5.4?
>>>
>>> Another question/data point: I've had the whole basket of offloads activated:
>>>
>>>    ethtool --offload eth0 rx on tx on gro on gso on sg on tso on
>>>
>>> and this caused zero problems without the xmit_more patch. However I just saw
>>> that net-next has a patch where TSO is disabled due to a known HW defect in
>>> RTL8168evl, which is of course what I have. Could this be the reason for the
>>> stall/hiccup when xmit_more has its fingers in the pie? I kind of know what
>>> xmit_more does, just not how it could interact with a possibly broken TSO that
>>> nevertheless seems to work fine otherwise..
>>>
>>
>> I was about to ask exactly that, whether you have TSO enabled. I don't know what
>> can trigger the HW issue, it was just confirmed by Realtek that this chip version
>> has a problem with TSO. So the logical conclusion is: test w/o TSO, ideally the
>> linux-next version.
> 
> So disabling TSO alone didn't work - it leads to reduced throughout (~70 MB/s in iperf).
> Instead I decided to backport 93681cd7d94f ("r8169: enable HW csum and TSO"), which
> wasn't easy due to cleanups/renamings of dependencies, but I managed to backport
> it and .. got the same problem of reduced throughout. wat?!
> 
> After lots of trial & error I started disabling all offloads and finally found
> that sg (Scatter-Gather) enabled alone - without TSO - will lead to the throughput
> drop. So the culprit seems 93681cd7d94f, which disabled TSO on my NIC, but left
> sg on by default. This weas repeatable - switch on sg, throughput drop; turn it
> off - smooth sailing, now with reduced buffers.
> 
> I modified the relevant bits to disable tso & sg like this:
> 
>     /* RTL8168e-vl has a HW issue with TSO */
>     if (tp->mac_version == RTL_GIGA_MAC_VER_34) {
> +        dev->vlan_features &= ~(NETIF_F_ALL_TSO|NETIF_F_SG);
> +        dev->hw_features &= ~(NETIF_F_ALL_TSO|NETIF_F_SG);
> +        dev->features &= ~(NETIF_F_ALL_TSO|NETIF_F_SG);
>     }
> 
> This seems to work since it restores performance without sg/tso by default
> and without any additional offloads, yet with xmit_more in the mix.
> We'll see whether that is stable over the next few days, but I strongly
> suspect it will be good and that the hiccups were due to xmit_more/TSO
> interaction.
> 
Thanks a lot for the analysis and testing. Then I'll submit the disabling
of SG on RTL8168evl (on your behalf), independent of whether it fixes
the timeout issue.

> thanks,
> Holger
> 
Heiner


^ permalink raw reply

* Re: BUG: soft lockup in tcp_delack_timer
From: Thomas Gleixner @ 2019-08-08 20:07 UTC (permalink / raw)
  To: syzbot
  Cc: Frederic Weisbecker, LKML, Ingo Molnar, syzkaller-bugs,
	Eric Dumazet, netdev
In-Reply-To: <00000000000094d2b5058f9df271@google.com>

On Thu, 8 Aug 2019, syzbot wrote:

Cc+ Eric, net-dev

> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:    0d8b3265 Add linux-next specific files for 20190729
> git tree:       linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=1101fdc8600000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=ae96f3b8a7e885f7
> dashboard link: https://syzkaller.appspot.com/bug?extid=2d55fb97f42947bbcddd
> compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> 
> Unfortunately, I don't have any reproducer for this crash yet.
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+2d55fb97f42947bbcddd@syzkaller.appspotmail.com
> 
> net_ratelimit: 2 callbacks suppressed
> TCP: request_sock_TCPv6: Possible SYN flooding on port 20002. Sending cookies.
> Check SNMP counters.
> watchdog: BUG: soft lockup - CPU#0 stuck for 122s! [swapper/0:0]
> Modules linked in:
> irq event stamp: 92022
> hardirqs last  enabled at (92021): [<ffffffff81660331>]
> tick_nohz_idle_exit+0x181/0x2e0 kernel/time/tick-sched.c:1180
> hardirqs last disabled at (92022): [<ffffffff873d5d7d>]
> __schedule+0x1dd/0x15b0 kernel/sched/core.c:3862
> softirqs last  enabled at (90810): [<ffffffff876006cd>]
> __do_softirq+0x6cd/0x98c kernel/softirq.c:319
> softirqs last disabled at (90703): [<ffffffff8144fc1b>] invoke_softirq
> kernel/softirq.c:373 [inline]
> softirqs last disabled at (90703): [<ffffffff8144fc1b>] irq_exit+0x19b/0x1e0
> kernel/softirq.c:413
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-rc2-next-20190729 #54
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
> 01/01/2011
> RIP: 0010:cpu_relax arch/x86/include/asm/processor.h:656 [inline]
> RIP: 0010:virt_spin_lock arch/x86/include/asm/qspinlock.h:84 [inline]
> RIP: 0010:native_queued_spin_lock_slowpath+0x132/0x9f0
> kernel/locking/qspinlock.c:325
> Code: 00 00 00 48 8b 45 d0 65 48 33 04 25 28 00 00 00 0f 85 37 07 00 00 48 81
> c4 98 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 f3 90 <e9> 73 ff ff ff 8b 45
> 98 4c 8d 65 d8 3d 00 01 00 00 0f 84 e5 00 00
> RSP: 0018:ffff8880ae809b48 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
> RAX: 0000000000000000 RBX: ffff8880621cd088 RCX: ffffffff8158f117
> RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8880621cd088
> RBP: ffff8880ae809c08 R08: 1ffff1100c439a11 R09: ffffed100c439a12
> R10: ffffed100c439a11 R11: ffff8880621cd08b R12: 0000000000000001
> R13: 0000000000000003 R14: ffffed100c439a11 R15: 0000000000000001
> FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000001541e88 CR3: 0000000068089000 CR4: 00000000001406f0
> Call Trace:
> <IRQ>
> pv_queued_spin_lock_slowpath arch/x86/include/asm/paravirt.h:642 [inline]
> queued_spin_lock_slowpath arch/x86/include/asm/qspinlock.h:50 [inline]
> queued_spin_lock include/asm-generic/qspinlock.h:81 [inline]
> do_raw_spin_lock+0x20e/0x2e0 kernel/locking/spinlock_debug.c:113
> __raw_spin_lock include/linux/spinlock_api_smp.h:143 [inline]
> _raw_spin_lock+0x37/0x40 kernel/locking/spinlock.c:151
> spin_lock include/linux/spinlock.h:338 [inline]
> tcp_delack_timer+0x2b/0x2a0 net/ipv4/tcp_timer.c:318
> call_timer_fn+0x1ac/0x780 kernel/time/timer.c:1322
> expire_timers kernel/time/timer.c:1366 [inline]
> __run_timers kernel/time/timer.c:1685 [inline]
> __run_timers kernel/time/timer.c:1653 [inline]
> run_timer_softirq+0x697/0x17a0 kernel/time/timer.c:1698
> __do_softirq+0x262/0x98c kernel/softirq.c:292
> invoke_softirq kernel/softirq.c:373 [inline]
> irq_exit+0x19b/0x1e0 kernel/softirq.c:413
> exiting_irq arch/x86/include/asm/apic.h:536 [inline]
> smp_apic_timer_interrupt+0x1a3/0x610 arch/x86/kernel/apic/apic.c:1095
> apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:828
> </IRQ>
> RIP: 0010:native_safe_halt+0xe/0x10 arch/x86/include/asm/irqflags.h:61
> Code: c8 75 6e fa eb 8a 90 90 90 90 90 90 e9 07 00 00 00 0f 00 2d c4 b2 49 00
> f4 c3 66 90 e9 07 00 00 00 0f 00 2d b4 b2 49 00 fb f4 <c3> 90 55 48 89 e5 41
> 57 41 56 41 55 41 54 53 e8 8e 56 21 fa e8 29
> RSP: 0018:ffffffff88c07ce8 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
> RAX: 1ffffffff11a5e87 RBX: ffffffff88c7a1c0 RCX: 1ffffffff134bca6
> RDX: dffffc0000000000 RSI: ffffffff81779dee RDI: ffffffff873e794c
> RBP: ffffffff88c07d18 R08: ffffffff88c7a1c0 R09: fffffbfff118f439
> R10: fffffbfff118f438 R11: ffffffff88c7a1c7 R12: dffffc0000000000
> R13: ffffffff89a5b340 R14: 0000000000000000 R15: 0000000000000000
> arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:571
> default_idle_call+0x84/0xb0 kernel/sched/idle.c:94
> cpuidle_idle_call kernel/sched/idle.c:154 [inline]
> do_idle+0x413/0x760 kernel/sched/idle.c:263
> cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:354
> rest_init+0x245/0x37b init/main.c:451
> arch_call_rest_init+0xe/0x1b
> start_kernel+0x912/0x951 init/main.c:785
> x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:472
> x86_64_start_kernel+0x77/0x7b arch/x86/kernel/head64.c:453
> secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:241
> Sending NMI from CPU 0 to CPUs 1:
> INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 1.339
> msecs
> NMI backtrace for cpu 1
> CPU: 1 PID: 7447 Comm: syz-executor.5 Not tainted 5.3.0-rc2-next-20190729 #54
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
> 01/01/2011
> RIP: 0010:cpu_relax arch/x86/include/asm/processor.h:656 [inline]
> RIP: 0010:virt_spin_lock arch/x86/include/asm/qspinlock.h:84 [inline]
> RIP: 0010:native_queued_spin_lock_slowpath+0x132/0x9f0
> kernel/locking/qspinlock.c:325
> Code: 00 00 00 48 8b 45 d0 65 48 33 04 25 28 00 00 00 0f 85 37 07 00 00 48 81
> c4 98 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 f3 90 <e9> 73 ff ff ff 8b 45
> 98 4c 8d 65 d8 3d 00 01 00 00 0f 84 e5 00 00
> RSP: 0018:ffff8880ae909210 EFLAGS: 00000202
> RAX: 0000000000000000 RBX: ffff8880621cd088 RCX: ffffffff8158f117
> RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8880621cd088
> RBP: ffff8880ae9092d0 R08: 1ffff1100c439a11 R09: ffffed100c439a12
> R10: ffffed100c439a11 R11: ffff8880621cd08b R12: 0000000000000001
> R13: 0000000000000003 R14: ffffed100c439a11 R15: 0000000000000001
> FS:  0000555557246940(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000001b32b29000 CR3: 00000000a4e3a000 CR4: 00000000001406e0
> Call Trace:
> <IRQ>
> pv_queued_spin_lock_slowpath arch/x86/include/asm/paravirt.h:642 [inline]
> queued_spin_lock_slowpath arch/x86/include/asm/qspinlock.h:50 [inline]
> queued_spin_lock include/asm-generic/qspinlock.h:81 [inline]
> do_raw_spin_lock+0x20e/0x2e0 kernel/locking/spinlock_debug.c:113
> __raw_spin_lock_bh include/linux/spinlock_api_smp.h:136 [inline]
> _raw_spin_lock_bh+0x3b/0x50 kernel/locking/spinlock.c:175
> spin_lock_bh include/linux/spinlock.h:343 [inline]
> release_sock+0x20/0x1c0 net/core/sock.c:2932
> wait_on_pending_writer+0x20f/0x420 net/tls/tls_main.c:91
> tls_sk_proto_cleanup+0x2c5/0x3e0 net/tls/tls_main.c:295
> tls_sk_proto_unhash+0x90/0x3f0 net/tls/tls_main.c:330
> tcp_set_state+0x5b9/0x7d0 net/ipv4/tcp.c:2235
> tcp_done+0xe2/0x320 net/ipv4/tcp.c:3824
> tcp_reset+0x132/0x500 net/ipv4/tcp_input.c:4080
> tcp_validate_incoming+0xa2d/0x1660 net/ipv4/tcp_input.c:5440
> tcp_rcv_established+0x6b5/0x1e70 net/ipv4/tcp_input.c:5648
> tcp_v6_do_rcv+0x41e/0x12c0 net/ipv6/tcp_ipv6.c:1356
> tcp_v6_rcv+0x31f1/0x3500 net/ipv6/tcp_ipv6.c:1588
> ip6_protocol_deliver_rcu+0x2fe/0x1660 net/ipv6/ip6_input.c:397
> ip6_input_finish+0x84/0x170 net/ipv6/ip6_input.c:438
> NF_HOOK include/linux/netfilter.h:305 [inline]
> NF_HOOK include/linux/netfilter.h:299 [inline]
> ip6_input+0xe4/0x3f0 net/ipv6/ip6_input.c:447
> dst_input include/net/dst.h:442 [inline]
> ip6_rcv_finish+0x1de/0x2f0 net/ipv6/ip6_input.c:76
> NF_HOOK include/linux/netfilter.h:305 [inline]
> NF_HOOK include/linux/netfilter.h:299 [inline]
> ipv6_rcv+0x10e/0x420 net/ipv6/ip6_input.c:272
> __netif_receive_skb_one_core+0x113/0x1a0 net/core/dev.c:4999
> __netif_receive_skb+0x2c/0x1d0 net/core/dev.c:5113
> process_backlog+0x206/0x750 net/core/dev.c:5924
> napi_poll net/core/dev.c:6347 [inline]
> net_rx_action+0x508/0x10c0 net/core/dev.c:6413
> __do_softirq+0x262/0x98c kernel/softirq.c:292
> do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1080
> </IRQ>
> do_softirq.part.0+0x11a/0x170 kernel/softirq.c:337
> do_softirq kernel/softirq.c:329 [inline]
> __local_bh_enable_ip+0x211/0x270 kernel/softirq.c:189
> local_bh_enable include/linux/bottom_half.h:32 [inline]
> inet_csk_listen_stop+0x1e0/0x850 net/ipv4/inet_connection_sock.c:993
> tcp_close+0xd5b/0x10e0 net/ipv4/tcp.c:2338
> inet_release+0xed/0x200 net/ipv4/af_inet.c:427
> inet6_release+0x53/0x80 net/ipv6/af_inet6.c:470
> __sock_release+0xce/0x280 net/socket.c:590
> sock_close+0x1e/0x30 net/socket.c:1268
> __fput+0x2ff/0x890 fs/file_table.c:280
> ____fput+0x16/0x20 fs/file_table.c:313
> task_work_run+0x145/0x1c0 kernel/task_work.c:113
> tracehook_notify_resume include/linux/tracehook.h:188 [inline]
> exit_to_usermode_loop+0x316/0x380 arch/x86/entry/common.c:163
> prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
> syscall_return_slowpath arch/x86/entry/common.c:274 [inline]
> do_syscall_64+0x65f/0x760 arch/x86/entry/common.c:300
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x413511
> Code: 75 14 b8 03 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 04 1b 00 00 c3 48 83
> ec 08 e8 0a fc ff ff 48 89 04 24 b8 03 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2
> e8 53 fc ff ff 48 89 d0 48 83 c4 08 48 3d 01
> RSP: 002b:00007ffebfc402f0 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
> RAX: 0000000000000000 RBX: 0000000000000005 RCX: 0000000000413511
> RDX: 0000000000000000 RSI: 000000000000183f RDI: 0000000000000004
> RBP: 0000000000000001 R08: 00000000c44ab83f R09: 00000000c44ab843
> R10: 00007ffebfc403d0 R11: 0000000000000293 R12: 000000000075c9a0
> R13: 000000000075c9a0 R14: 0000000000760750 R15: ffffffffffffffff
> 
> 
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
> 
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply

* Re: [PATCH bpf-next v5 0/6] xdp: Add devmap_hash map type
From: Jesper Dangaard Brouer @ 2019-08-08 20:05 UTC (permalink / raw)
  To: Y Song
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, Alexei Starovoitov, Network Development,
	David Miller, Jakub Kicinski, Björn Töpel,
	Yonghong Song, brouer
In-Reply-To: <CAH3MdRWk_bZVpBUZ8=xsMNw2hUwnQ3Yv-otu9M+7f1Cwr-t1UA@mail.gmail.com>

On Thu, 8 Aug 2019 12:57:05 -0700
Y Song <ys114321@gmail.com> wrote:

> On Thu, Aug 8, 2019 at 12:43 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> >  
> > > On Fri, Jul 26, 2019 at 9:06 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:  
> > >>
> > >> This series adds a new map type, devmap_hash, that works like the existing
> > >> devmap type, but using a hash-based indexing scheme. This is useful for the use
> > >> case where a devmap is indexed by ifindex (for instance for use with the routing
> > >> table lookup helper). For this use case, the regular devmap needs to be sized
> > >> after the maximum ifindex number, not the number of devices in it. A hash-based
> > >> indexing scheme makes it possible to size the map after the number of devices it
> > >> should contain instead.
> > >>
> > >> This was previously part of my patch series that also turned the regular
> > >> bpf_redirect() helper into a map-based one; for this series I just pulled out
> > >> the patches that introduced the new map type.
> > >>
> > >> Changelog:
> > >>
> > >> v5:
> > >>
> > >> - Dynamically set the number of hash buckets by rounding up max_entries to the
> > >>   nearest power of two (mirroring the regular hashmap), as suggested by Jesper.  
> > >
> > > fyi I'm waiting for Jesper to review this new version.  
> >
> > Ping Jesper? :)  
> 
> Toke, the patch set has been merged to net-next.
> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=d3406913561c322323ec2898cc58f55e79786be7
> 

Yes, and I did review this... :-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net] net: phy: rtl8211f: do a double read to get real time link status
From: Heiner Kallweit @ 2019-08-08 20:01 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Yonglong Liu, davem, netdev, linux-kernel, linuxarm, salil.mehta,
	yisen.zhuang, shiju.jose
In-Reply-To: <20190808194049.GM27917@lunn.ch>

On 08.08.2019 21:40, Andrew Lunn wrote:
>> @@ -568,6 +568,11 @@ int phy_start_aneg(struct phy_device *phydev)
>>  	if (err < 0)
>>  		goto out_unlock;
>>  
>> +	/* The PHY may not yet have cleared aneg-completed and link-up bit
>> +	 * w/o this delay when the following read is done.
>> +	 */
>> +	usleep_range(1000, 2000);
>> +
> 
> Hi Heiner
> 
> Does 802.3 C22 say anything about this?
> 
C22 says:
"The Auto-Negotiation process shall be restarted by setting bit 0.9 to a logic one. This bit is self-
clearing, and a PHY shall return a value of one in bit 0.9 until the Auto-Negotiation process has been
initiated."

Maybe we should read bit 0.9 in genphy_update_link() after having read BMSR and report
aneg-complete and link-up as false (no matter of their current value) if 0.9 is set.

> If this PHY is broken with respect to the standard, i would prefer the
> workaround is in the PHY specific driver code, not generic core code.
> 
Based on the C22 statement above the PHY may not be broken and the typical time between
two MDIO accesses is sufficient for the PHY to clear the bits. I think of MDIO bus access
functions in network chips that have a 10us-20us delay after each MDIO access.
On HNS3 this may not be the case.

> 	   Andrew
> 
Heiner

^ permalink raw reply

* Re: [PATCH bpf-next v5 0/6] xdp: Add devmap_hash map type
From: Y Song @ 2019-08-08 19:57 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Daniel Borkmann, Alexei Starovoitov,
	Network Development, David Miller, Jesper Dangaard Brouer,
	Jakub Kicinski, Björn Töpel, Yonghong Song
In-Reply-To: <87k1bnsbds.fsf@toke.dk>

On Thu, Aug 8, 2019 at 12:43 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > On Fri, Jul 26, 2019 at 9:06 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> This series adds a new map type, devmap_hash, that works like the existing
> >> devmap type, but using a hash-based indexing scheme. This is useful for the use
> >> case where a devmap is indexed by ifindex (for instance for use with the routing
> >> table lookup helper). For this use case, the regular devmap needs to be sized
> >> after the maximum ifindex number, not the number of devices in it. A hash-based
> >> indexing scheme makes it possible to size the map after the number of devices it
> >> should contain instead.
> >>
> >> This was previously part of my patch series that also turned the regular
> >> bpf_redirect() helper into a map-based one; for this series I just pulled out
> >> the patches that introduced the new map type.
> >>
> >> Changelog:
> >>
> >> v5:
> >>
> >> - Dynamically set the number of hash buckets by rounding up max_entries to the
> >>   nearest power of two (mirroring the regular hashmap), as suggested by Jesper.
> >
> > fyi I'm waiting for Jesper to review this new version.
>
> Ping Jesper? :)

Toke, the patch set has been merged to net-next.
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=d3406913561c322323ec2898cc58f55e79786be7

>
> -Toke

^ permalink raw reply

* Re: [v3,2/4] tools: bpftool: add net detach command to detach XDP on interface
From: Y Song @ 2019-08-08 19:52 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Daniel T. Lee, Daniel Borkmann, Alexei Starovoitov, netdev,
	Jakub Kicinski
In-Reply-To: <20190807223807.00002740@gmail.com>

On Wed, Aug 7, 2019 at 1:40 PM Maciej Fijalkowski
<maciejromanfijalkowski@gmail.com> wrote:
>
> On Wed, 7 Aug 2019 13:12:17 -0700
> Y Song <ys114321@gmail.com> wrote:
>
> > On Wed, Aug 7, 2019 at 11:30 AM Maciej Fijalkowski
> > <maciejromanfijalkowski@gmail.com> wrote:
> > >
> > > On Wed, 7 Aug 2019 10:02:04 -0700
> > > Y Song <ys114321@gmail.com> wrote:
> > >
> > > > On Tue, Aug 6, 2019 at 7:25 PM Daniel T. Lee <danieltimlee@gmail.com> wrote:
> > > > >
> > > > > By this commit, using `bpftool net detach`, the attached XDP prog can
> > > > > be detached. Detaching the BPF prog will be done through libbpf
> > > > > 'bpf_set_link_xdp_fd' with the progfd set to -1.
> > > > >
> > > > > Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
> > > > > ---
> > > > >  tools/bpf/bpftool/net.c | 42 ++++++++++++++++++++++++++++++++++++++++-
> > > > >  1 file changed, 41 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c
> > > > > index c05a3fac5cac..7be96acb08e0 100644
> > > > > --- a/tools/bpf/bpftool/net.c
> > > > > +++ b/tools/bpf/bpftool/net.c
> > > > > @@ -343,6 +343,43 @@ static int do_attach(int argc, char **argv)
> > > > >         return 0;
> > > > >  }
> > > > >
> > > > > +static int do_detach(int argc, char **argv)
> > > > > +{
> > > > > +       enum net_attach_type attach_type;
> > > > > +       int progfd, ifindex, err = 0;
> > > > > +
> > > > > +       /* parse detach args */
> > > > > +       if (!REQ_ARGS(3))
> > > > > +               return -EINVAL;
> > > > > +
> > > > > +       attach_type = parse_attach_type(*argv);
> > > > > +       if (attach_type == max_net_attach_type) {
> > > > > +               p_err("invalid net attach/detach type");
> > > > > +               return -EINVAL;
> > > > > +       }
> > > > > +
> > > > > +       NEXT_ARG();
> > > > > +       ifindex = net_parse_dev(&argc, &argv);
> > > > > +       if (ifindex < 1)
> > > > > +               return -EINVAL;
> > > > > +
> > > > > +       /* detach xdp prog */
> > > > > +       progfd = -1;
> > > > > +       if (is_prefix("xdp", attach_type_strings[attach_type]))
> > > > > +               err = do_attach_detach_xdp(progfd, attach_type, ifindex, NULL);
> > > >
> > > > I found an issue here. This is probably related to do_attach_detach_xdp.
> > > >
> > > > -bash-4.4$ sudo ./bpftool net attach x pinned /sys/fs/bpf/xdp_example dev v1
> > > > -bash-4.4$ sudo ./bpftool net
> > > > xdp:
> > > > v1(4) driver id 1172
> > > >
> > > > tc:
> > > > eth0(2) clsact/ingress fbflow_icmp id 29 act []
> > > > eth0(2) clsact/egress cls_fg_dscp_section id 27 act []
> > > > eth0(2) clsact/egress fbflow_egress id 28
> > > > eth0(2) clsact/egress fbflow_sslwall_egress id 35
> > > >
> > > > flow_dissector:
> > > >
> > > > -bash-4.4$ sudo ./bpftool net detach x dev v2
> > >
> > > Shouldn't this be v1 as dev?
> >
> > I am testing a scenario where with wrong devname
> > we did not return an error.
>
> Ah ok. In this scenario if driver has a native xdp support we would be invoking
> its ndo_bpf even if there's no prog currently attached and it wouldn't return
> error value.
>
> Looking at dev_xdp_uninstall, setting driver's prog to NULL is being done only
> when prog is attached. Maybe we should consider querying the driver in
> dev_change_xdp_fd regardless of passed fd value? E.g. don't query only when
> prog >= 0.
>
> I don't recall whether this was brought up previously.

Thanks for explanation. I think return an error is better in
such error cases. Otherwise, people mistakenly write wrong
device name and they may think xdp is detached and it is
actually not.

But this probably should not prevent
this patch as it is more like a kernel issue.

>
> CCing Jakub so we have one thread.
>
> Maciej
>
> >
> > Yes, if dev "v1", it works as expected.
> >
> > >
> > > > -bash-4.4$ sudo ./bpftool net
> > > > xdp:
> > > > v1(4) driver id 1172
> > > >
> > > > tc:
> > > > eth0(2) clsact/ingress fbflow_icmp id 29 act []
> > > > eth0(2) clsact/egress cls_fg_dscp_section id 27 act []
> > > > eth0(2) clsact/egress fbflow_egress id 28
> > > > eth0(2) clsact/egress fbflow_sslwall_egress id 35
> > > >
> > > > flow_dissector:
> > > >
> > > > -bash-4.4$
> > > >
> > > > Basically detaching may fail due to wrong dev name or wrong type, etc.
> > > > But the tool did not return an error. Is this expected?
> > > > This may be related to this funciton "bpf_set_link_xdp_fd()".
> > > > So this patch itself should be okay.
> > > >
> > > > > +
> > > > > +       if (err < 0) {
> > > > > +               p_err("interface %s detach failed",
> > > > > +                     attach_type_strings[attach_type]);
> > > > > +               return err;
> > > > > +       }
> > > > > +
> > > > > +       if (json_output)
> > > > > +               jsonw_null(json_wtr);
> > > > > +
> > > > > +       return 0;
> > > > > +}
> > > > > +
> > > > >  static int do_show(int argc, char **argv)
> > > > >  {
> > > > >         struct bpf_attach_info attach_info = {};
> > > > > @@ -419,6 +456,7 @@ static int do_help(int argc, char **argv)
> > > > >         fprintf(stderr,
> > > > >                 "Usage: %s %s { show | list } [dev <devname>]\n"
> > > > >                 "       %s %s attach ATTACH_TYPE PROG dev <devname> [ overwrite ]\n"
> > > > > +               "       %s %s detach ATTACH_TYPE dev <devname>\n"
> > > > >                 "       %s %s help\n"
> > > > >                 "\n"
> > > > >                 "       " HELP_SPEC_PROGRAM "\n"
> > > > > @@ -429,7 +467,8 @@ static int do_help(int argc, char **argv)
> > > > >                 "      to dump program attachments. For program types\n"
> > > > >                 "      sk_{filter,skb,msg,reuseport} and lwt/seg6, please\n"
> > > > >                 "      consult iproute2.\n",
> > > > > -               bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2]);
> > > > > +               bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
> > > > > +               bin_name, argv[-2]);
> > > > >
> > > > >         return 0;
> > > > >  }
> > > > > @@ -438,6 +477,7 @@ static const struct cmd cmds[] = {
> > > > >         { "show",       do_show },
> > > > >         { "list",       do_show },
> > > > >         { "attach",     do_attach },
> > > > > +       { "detach",     do_detach },
> > > > >         { "help",       do_help },
> > > > >         { 0 }
> > > > >  };
> > > > > --
> > > > > 2.20.1
> > > > >
> > >
>

^ permalink raw reply

* Re: [PATCH] pcan_usb_fd: zero out the common command buffer
From: Marc Kleine-Budde @ 2019-08-08 19:52 UTC (permalink / raw)
  To: David Miller, oneukum; +Cc: netdev, wg, linux-can
In-Reply-To: <20190808.110330.1163430543960132818.davem@davemloft.net>


[-- Attachment #1.1: Type: text/plain, Size: 863 bytes --]

On 8/8/19 8:03 PM, David Miller wrote:
> From: Oliver Neukum <oneukum@suse.com>
> Date: Thu,  8 Aug 2019 11:28:25 +0200
> 
>> Lest we leak kernel memory to a device we better zero out buffers.
>>
>> Reported-by: syzbot+513e4d0985298538bf9b@syzkaller.appspotmail.com
>> Signed-off-by: Oliver Neukum <oneukum@suse.com>
> 
> Please CC: the CAN subsystem maintainers, as this is clearly listed in the
> MAINTAINERS file.

The issue is already fixed and already mainline (stable@v.k.o is on Cc):

30a8beeb3042 can: peak_usb: pcan_usb_fd: Fix info-leaks to USB devices

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH net-next] r8169: make use of xmit_more
From: Holger Hoffstätte @ 2019-08-08 19:52 UTC (permalink / raw)
  To: Heiner Kallweit, Realtek linux nic maintainers, David Miller
  Cc: netdev@vger.kernel.org, Sander Eikelenboom, Eric Dumazet
In-Reply-To: <a19bb3de-a866-d342-7cca-020fef219d03@gmail.com>

On 8/8/19 8:17 PM, Heiner Kallweit wrote:
> On 08.08.2019 17:53, Holger Hoffstätte wrote:
>> On 8/8/19 4:37 PM, Holger Hoffstätte wrote:
>>>
>>> Hello Heiner -
>>>
>>> On 7/28/19 11:25 AM, Heiner Kallweit wrote:
>>>> There was a previous attempt to use xmit_more, but the change had to be
>>>> reverted because under load sometimes a transmit timeout occurred [0].
>>>> Maybe this was caused by a missing memory barrier, the new attempt
>>>> keeps the memory barrier before the call to netif_stop_queue like it
>>>> is used by the driver as of today. The new attempt also changes the
>>>> order of some calls as suggested by Eric.
>>>>
>>>> [0] https://lkml.org/lkml/2019/2/10/39
>>>>
>>>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
>>>
>>> I decided to take one for the team and merged this into my 5.2.x tree (just
>>> fixing up the path) and it has been working fine for the last 2 weeks in two
>>> machines..until today, when for the first time in forever some random NFS traffic
>>> made this old friend come out from under the couch:
>>>
>>> [Aug 8 14:13] ------------[ cut here ]------------
>>> [  +0.000006] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
>>> [  +0.000021] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:442 dev_watchdog+0x21f/0x230
>>> [  +0.000001] Modules linked in: lz4 lz4_compress lz4_decompress nfsd auth_rpcgss oid_registry lockd grace sunrpc sch_fq_codel btrfs xor zstd_compress raid6_pq zstd_decompress bfq jitterentropy_rng nct6775 hwmon_vid coretemp hwmon x86_pkg_temp_thermal aesni_intel aes_x86_64 i915 glue_helper crypto_simd cryptd i2c_i801 intel_gtt i2c_algo_bit iosf_mbi drm_kms_helper syscopyarea usbhid sysfillrect r8169 sysimgblt fb_sys_fops realtek drm libphy drm_panel_orientation_quirks i2c_core video backlight mq_deadline
>>> [  +0.000026] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.2.7 #1
>>> [  +0.000001] Hardware name: System manufacturer System Product Name/P8Z68-V LX, BIOS 4105 07/01/2013
>>> [  +0.000004] RIP: 0010:dev_watchdog+0x21f/0x230
>>> [  +0.000002] Code: 3b 00 75 ea eb ad 4c 89 ef c6 05 1c 45 bd 00 01 e8 66 35 fc ff 44 89 e1 4c 89 ee 48 c7 c7 e8 5e fc 81 48 89 c2 e8 90 df 92 ff <0f> 0b eb 8e 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 66 66 66 66 90
>>> [  +0.000002] RSP: 0018:ffffc90000118e68 EFLAGS: 00010286
>>> [  +0.000002] RAX: 0000000000000000 RBX: ffff8887f7837600 RCX: 0000000000000303
>>> [  +0.000001] RDX: 0000000000000001 RSI: 0000000000000092 RDI: ffffffff827a488c
>>> [  +0.000001] RBP: ffff8887f9fbc440 R08: 0000000000000303 R09: 0000000000000003
>>> [  +0.000001] R10: 000000000001004c R11: 0000000000000001 R12: 0000000000000000
>>> [  +0.000009] R13: ffff8887f9fbc000 R14: ffffffff8173aa20 R15: dead000000000200
>>> [  +0.000001] FS:  0000000000000000(0000) GS:ffff8887ff580000(0000) knlGS:0000000000000000
>>> [  +0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  +0.000001] CR2: 00007f8d1c04d000 CR3: 0000000002209001 CR4: 00000000000606e0
>>> [  +0.000000] Call Trace:
>>> [  +0.000002]  <IRQ>
>>> [  +0.000005]  call_timer_fn+0x2b/0x120
>>> [  +0.000002]  expire_timers+0xa4/0x100
>>> [  +0.000001]  run_timer_softirq+0x8c/0x170
>>> [  +0.000002]  ? __hrtimer_run_queues+0x13a/0x290
>>> [  +0.000003]  ? sched_clock_cpu+0xe/0x130
>>> [  +0.000003]  __do_softirq+0xeb/0x2de
>>> [  +0.000003]  irq_exit+0x9d/0xe0
>>> [  +0.000002]  smp_apic_timer_interrupt+0x60/0x110
>>> [  +0.000003]  apic_timer_interrupt+0xf/0x20
>>> [  +0.000001]  </IRQ>
>>> [  +0.000003] RIP: 0010:cpuidle_enter_state+0xad/0x930
>>> [  +0.000001] Code: c5 66 66 66 66 90 31 ff e8 90 99 9e ff 80 7c 24 0b 00 74 12 9c 58 f6 c4 02 0f 85 39 08 00 00 31 ff e8 e7 26 a2 ff fb 45 85 e4 <0f> 88 34 02 00 00 49 63 cc 4c 2b 2c 24 48 8d 04 49 48 c1 e0 05 8b
>>> [  +0.000000] RSP: 0018:ffffc9000008be50 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
>>> [  +0.000001] RAX: ffff8887ff5a9180 RBX: ffffffff822b6c40 RCX: 000000000000001f
>>> [  +0.000001] RDX: 0000000000000000 RSI: 0000000033087154 RDI: 0000000000000000
>>> [  +0.000001] RBP: ffff8887ff5b1310 R08: 000030d021fae397 R09: ffff8887ff59c8c0
>>> [  +0.000000] R10: ffff8887ff59c8c0 R11: 0000000000000006 R12: 0000000000000004
>>> [  +0.000001] R13: 000030d021fae397 R14: 0000000000000004 R15: ffff8887fc281600
>>> [  +0.000001]  cpuidle_enter+0x29/0x40
>>> [  +0.000002]  do_idle+0x1e5/0x280
>>> [  +0.000001]  cpu_startup_entry+0x19/0x20
>>> [  +0.000002]  start_secondary+0x186/0x1c0
>>> [  +0.000001]  secondary_startup_64+0xa4/0xb0
>>> [  +0.000001] ---[ end trace 99493c768580f4fd ]---
>>>
>>> The device is:
>>>
>>> Aug  7 23:19:09 tux kernel: libphy: r8169: probed
>>> Aug  7 23:19:09 tux kernel: r8169 0000:04:00.0 eth0: RTL8168evl/8111evl, c8:60:00:68:33:cc, XID 2c9, IRQ 36
>>> Aug  7 23:19:09 tux kernel: r8169 0000:04:00.0 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
>>> Aug  7 23:19:12 tux kernel: RTL8211E Gigabit Ethernet r8169-400:00: attached PHY driver [RTL8211E Gigabit Ethernet] (mii_bus:phy_addr=r8169-400:00, irq=IGNORE)
>>> Aug  7 23:19:13 tux kernel: r8169 0000:04:00.0 eth0: No native access to PCI extended config space, falling back to CSI
>>>
>>> and using fq_codel, of course.
>>>
>>> This cpuidle hiccup used to be completely gone without xmit_more and this was
>>> the first (and so far only) time since merging it (regardless of load).
>>> Also, while I'm using BMQ as CPU scheduler, that hasn't made a difference for
>>> this particular problem in the past (with MuQSS/PDS) either; way back when I had
>>> Eric's previous attempt(s) it also hiccupped with CFS.
>>>
>>> Revert or wait for more reports when -next is merged in 5.4?
>>
>> Another question/data point: I've had the whole basket of offloads activated:
>>
>>    ethtool --offload eth0 rx on tx on gro on gso on sg on tso on
>>
>> and this caused zero problems without the xmit_more patch. However I just saw
>> that net-next has a patch where TSO is disabled due to a known HW defect in
>> RTL8168evl, which is of course what I have. Could this be the reason for the
>> stall/hiccup when xmit_more has its fingers in the pie? I kind of know what
>> xmit_more does, just not how it could interact with a possibly broken TSO that
>> nevertheless seems to work fine otherwise..
>>
> 
> I was about to ask exactly that, whether you have TSO enabled. I don't know what
> can trigger the HW issue, it was just confirmed by Realtek that this chip version
> has a problem with TSO. So the logical conclusion is: test w/o TSO, ideally the
> linux-next version.

So disabling TSO alone didn't work - it leads to reduced throughout (~70 MB/s in iperf).
Instead I decided to backport 93681cd7d94f ("r8169: enable HW csum and TSO"), which
wasn't easy due to cleanups/renamings of dependencies, but I managed to backport
it and .. got the same problem of reduced throughout. wat?!

After lots of trial & error I started disabling all offloads and finally found
that sg (Scatter-Gather) enabled alone - without TSO - will lead to the throughput
drop. So the culprit seems 93681cd7d94f, which disabled TSO on my NIC, but left
sg on by default. This weas repeatable - switch on sg, throughput drop; turn it
off - smooth sailing, now with reduced buffers.

I modified the relevant bits to disable tso & sg like this:

	/* RTL8168e-vl has a HW issue with TSO */
	if (tp->mac_version == RTL_GIGA_MAC_VER_34) {
+		dev->vlan_features &= ~(NETIF_F_ALL_TSO|NETIF_F_SG);
+		dev->hw_features &= ~(NETIF_F_ALL_TSO|NETIF_F_SG);
+		dev->features &= ~(NETIF_F_ALL_TSO|NETIF_F_SG);
	}

This seems to work since it restores performance without sg/tso by default
and without any additional offloads, yet with xmit_more in the mix.
We'll see whether that is stable over the next few days, but I strongly
suspect it will be good and that the hiccups were due to xmit_more/TSO
interaction.

thanks,
Holger

^ permalink raw reply

* Re: [PATCH net-next 3/3] net: phy: realtek: add support for the 2.5Gbps PHY in RTL8125
From: Heiner Kallweit @ 2019-08-08 19:46 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Florian Fainelli, David Miller, netdev@vger.kernel.org
In-Reply-To: <20190808193743.GL27917@lunn.ch>

On 08.08.2019 21:37, Andrew Lunn wrote:
> On Thu, Aug 08, 2019 at 09:05:54PM +0200, Heiner Kallweit wrote:
>> This adds support for the integrated 2.5Gbps PHY in Realtek RTL8125.
>> Advertisement of 2.5Gbps mode is done via a vendor-specific register.
>> Same applies to reading NBase-T link partner advertisement.
>> Unfortunately this 2.5Gbps PHY shares the PHY ID with the integrated
>> 1Gbps PHY's in other Realtek network chips and so far no method is
>> known to differentiate them.
> 
> That is not nice.
> 
Indeed.

> Do you have any contacts in Realtek who can provide us with
> information? Maybe there is another undocumented vendor specific
> register?
> 
I have a contact in Realtek who provided the information about
the vendor-specific registers used in the patch. I also asked for
a method to auto-detect 2.5Gbps support but have no feedback so far.
What may contribute to the problem is that also the integrated 1Gbps
PHY's (all with the same PHY ID) differ significantly from each other,
depending on the network chip version.

> 	Andrew
> 
Heiner

^ permalink raw reply

* Re: [bpf-next v3 PATCH 2/3] samples/bpf: make xdp_fwd more practically usable via devmap lookup
From: Y Song @ 2019-08-08 19:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, Daniel Borkmann, Alexei Starovoitov, a.s.protopopov,
	David Ahern, Toke Høiland-Jørgensen
In-Reply-To: <156528106270.22124.2563148023961869582.stgit@firesoul>

On Thu, Aug 8, 2019 at 9:17 AM Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>
> This address the TODO in samples/bpf/xdp_fwd_kern.c, which points out
> that the chosen egress index should be checked for existence in the
> devmap. This can now be done via taking advantage of Toke's work in
> commit 0cdbb4b09a06 ("devmap: Allow map lookups from eBPF").
>
> This change makes xdp_fwd more practically usable, as this allows for
> a mixed environment, where IP-forwarding fallback to network stack, if
> the egress device isn't configured to use XDP.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Reviewed-by: David Ahern <dsahern@gmail.com>

Acked-by: Yonghong Song <yhs@fb.com>

> ---
>  samples/bpf/xdp_fwd_kern.c |   17 +++++++++++------
>  samples/bpf/xdp_fwd_user.c |   33 ++++++++++++++++++++++-----------
>  2 files changed, 33 insertions(+), 17 deletions(-)
>
> diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
> index e6ffc4ea06f4..a43d6953c054 100644
> --- a/samples/bpf/xdp_fwd_kern.c
> +++ b/samples/bpf/xdp_fwd_kern.c
> @@ -104,13 +104,18 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
>
>         rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
>
> -       /* verify egress index has xdp support
> -        * TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with
> -        *       cannot pass map_type 14 into func bpf_map_lookup_elem#1:
> -        * NOTE: without verification that egress index supports XDP
> -        *       forwarding packets are dropped.
> -        */
>         if (rc == 0) {
> +               /* Verify egress index has been configured as TX-port.
> +                * (Note: User can still have inserted an egress ifindex that
> +                * doesn't support XDP xmit, which will result in packet drops).
> +                *
> +                * Note: lookup in devmap supported since 0cdbb4b09a0.
> +                * If not supported will fail with:
> +                *  cannot pass map_type 14 into func bpf_map_lookup_elem#1:
> +                */
> +               if (!bpf_map_lookup_elem(&xdp_tx_ports, &fib_params.ifindex))
> +                       return XDP_PASS;
> +
>                 if (h_proto == htons(ETH_P_IP))
>                         ip_decrease_ttl(iph);
>                 else if (h_proto == htons(ETH_P_IPV6))
> diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c
> index ba012d9f93dd..97ff1dad7669 100644
> --- a/samples/bpf/xdp_fwd_user.c
> +++ b/samples/bpf/xdp_fwd_user.c
> @@ -27,14 +27,20 @@
>  #include "libbpf.h"
>  #include <bpf/bpf.h>
>
> -
> -static int do_attach(int idx, int fd, const char *name)
> +static int do_attach(int idx, int prog_fd, int map_fd, const char *name)
>  {
>         int err;
>
> -       err = bpf_set_link_xdp_fd(idx, fd, 0);
> -       if (err < 0)
> +       err = bpf_set_link_xdp_fd(idx, prog_fd, 0);
> +       if (err < 0) {
>                 printf("ERROR: failed to attach program to %s\n", name);
> +               return err;
> +       }
> +
> +       /* Adding ifindex as a possible egress TX port */
> +       err = bpf_map_update_elem(map_fd, &idx, &idx, 0);
> +       if (err)
> +               printf("ERROR: failed using device %s as TX-port\n", name);
>
>         return err;
>  }
> @@ -47,6 +53,9 @@ static int do_detach(int idx, const char *name)
>         if (err < 0)
>                 printf("ERROR: failed to detach program from %s\n", name);
>
> +       /* TODO: Remember to cleanup map, when adding use of shared map
> +        *  bpf_map_delete_elem((map_fd, &idx);
> +        */
>         return err;
>  }
>
> @@ -67,10 +76,10 @@ int main(int argc, char **argv)
>         };
>         const char *prog_name = "xdp_fwd";
>         struct bpf_program *prog;
> +       int prog_fd, map_fd = -1;
>         char filename[PATH_MAX];
>         struct bpf_object *obj;
>         int opt, i, idx, err;
> -       int prog_fd, map_fd;
>         int attach = 1;
>         int ret = 0;
>
> @@ -103,8 +112,14 @@ int main(int argc, char **argv)
>                         return 1;
>                 }
>
> -               if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
> +               err = bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd);
> +               if (err) {
> +                       printf("Does kernel support devmap lookup?\n");
> +                       /* If not, the error message will be:
> +                        *  "cannot pass map_type 14 into func bpf_map_lookup_elem#1"
> +                        */
>                         return 1;
> +               }
>
>                 prog = bpf_object__find_program_by_title(obj, prog_name);
>                 prog_fd = bpf_program__fd(prog);
> @@ -119,10 +134,6 @@ int main(int argc, char **argv)
>                         return 1;
>                 }
>         }
> -       if (attach) {
> -               for (i = 1; i < 64; ++i)
> -                       bpf_map_update_elem(map_fd, &i, &i, 0);
> -       }
>
>         for (i = optind; i < argc; ++i) {
>                 idx = if_nametoindex(argv[i]);
> @@ -138,7 +149,7 @@ int main(int argc, char **argv)
>                         if (err)
>                                 ret = err;
>                 } else {
> -                       err = do_attach(idx, prog_fd, argv[i]);
> +                       err = do_attach(idx, prog_fd, map_fd, argv[i]);
>                         if (err)
>                                 ret = err;
>                 }
>

^ permalink raw reply

* Re: [PATCH v5 bpf-next] BPF: helpers: New helper to obtain namespace data from current task
From: Y Song @ 2019-08-08 19:44 UTC (permalink / raw)
  To: Carlos Antonio Neira Bustos
  Cc: Yonghong Song, netdev@vger.kernel.org, ebiederm@xmission.com,
	brouer@redhat.com, quentin.monnet@netronome.com
In-Reply-To: <20190808174848.poybtaagg5ctle7t@dev00>

On Thu, Aug 8, 2019 at 10:52 AM Carlos Antonio Neira Bustos
<cneirabustos@gmail.com> wrote:
>
> Yonghong,
>
> I have modified the patch following your feedback.
> Let me know if I'm missing something.

Yes, I have some other requests about formating.
https://lore.kernel.org/netdev/20190808174848.poybtaagg5ctle7t@dev00/T/#t
Could you address it as well?

>
> Bests
>
> From 70f8d5584700c9cfc82c006901d8ee9595c53f15 Mon Sep 17 00:00:00 2001
> From: Carlos <cneirabustos@gmail.com>
> Date: Wed, 7 Aug 2019 20:04:30 -0400
> Subject: [PATCH] [PATCH v6 bpf-next] BPF: New helper to obtain namespace data
>  from current task
>
> This helper obtains the active namespace from current and returns pid, tgid,
> device and namespace id as seen from that namespace, allowing to instrument
> a process inside a container.
> Device is read from /proc/self/ns/pid, as in the future it's possible that
> different pid_ns files may belong to different devices, according
> to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
> conference.
> Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
> scripts but this helper returns the pid as seen by the root namespace which is
> fine when a bcc script is not executed inside a container.
> When the process of interest is inside a container, pid filtering will not work
> if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
> returning the pid as it's seen by the current namespace where the script is
> executing.
>
> This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
> used to do pid filtering even inside a container.
>
> For example a bcc script using bpf_get_current_pid_tgid() (tools/funccount.py):
>
>         u32 pid = bpf_get_current_pid_tgid() >> 32;
>         if (pid != <pid_arg_passed_in>)
>                 return 0;
> Could be modified to use bpf_get_current_pidns_info() as follows:
>
>         struct bpf_pidns pidns;
>         bpf_get_current_pidns_info(&pidns, sizeof(struct bpf_pidns));
>         u32 pid = pidns.tgid;
>         u32 nsid = pidns.nsid;
>         if ((pid != <pid_arg_passed_in>) && (nsid != <nsid_arg_passed_in>))
>                 return 0;
>
> To find out the name PID namespace id of a process, you could use this command:
>
> $ ps -h -o pidns -p <pid_of_interest>
>
> Or this other command:
>
> $ ls -Li /proc/<pid_of_interest>/ns/pid
>
> Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
> ---
>  fs/internal.h                                      |   2 -
>  fs/namei.c                                         |   1 -
>  include/linux/bpf.h                                |   1 +
>  include/linux/namei.h                              |   4 +
>  include/uapi/linux/bpf.h                           |  27 +++-
>  kernel/bpf/core.c                                  |   1 +
>  kernel/bpf/helpers.c                               |  64 ++++++++++
>  kernel/trace/bpf_trace.c                           |   2 +
>  samples/bpf/Makefile                               |   3 +
>  samples/bpf/trace_ns_info_user.c                   |  35 ++++++
>  samples/bpf/trace_ns_info_user_kern.c              |  44 +++++++
>  tools/include/uapi/linux/bpf.h                     |  27 +++-
>  tools/testing/selftests/bpf/Makefile               |   2 +-
>  tools/testing/selftests/bpf/bpf_helpers.h          |   3 +
>  .../testing/selftests/bpf/progs/test_pidns_kern.c  |  51 ++++++++
>  tools/testing/selftests/bpf/test_pidns.c           | 138 +++++++++++++++++++++
>  16 files changed, 399 insertions(+), 6 deletions(-)
>  create mode 100644 samples/bpf/trace_ns_info_user.c
>  create mode 100644 samples/bpf/trace_ns_info_user_kern.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_pidns_kern.c
>  create mode 100644 tools/testing/selftests/bpf/test_pidns.c
>
> diff --git a/fs/internal.h b/fs/internal.h
> index 315fcd8d237c..6647e15dd419 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -59,8 +59,6 @@ extern int finish_clean_context(struct fs_context *fc);
>  /*
>   * namei.c
>   */
> -extern int filename_lookup(int dfd, struct filename *name, unsigned flags,
> -                          struct path *path, struct path *root);
>  extern int user_path_mountpoint_at(int, const char __user *, unsigned int, struct path *);
>  extern int vfs_path_lookup(struct dentry *, struct vfsmount *,
>                            const char *, unsigned int, struct path *);
> diff --git a/fs/namei.c b/fs/namei.c
> index 209c51a5226c..a89fc72a4a10 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -19,7 +19,6 @@
>  #include <linux/export.h>
>  #include <linux/kernel.h>
>  #include <linux/slab.h>
> -#include <linux/fs.h>
>  #include <linux/namei.h>
>  #include <linux/pagemap.h>
>  #include <linux/fsnotify.h>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index f9a506147c8a..e4adf5e05afd 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1050,6 +1050,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
>  extern const struct bpf_func_proto bpf_strtol_proto;
>  extern const struct bpf_func_proto bpf_strtoul_proto;
>  extern const struct bpf_func_proto bpf_tcp_sock_proto;
> +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
>
>  /* Shared helpers among cBPF and eBPF. */
>  void bpf_user_rnd_init_once(void);
> diff --git a/include/linux/namei.h b/include/linux/namei.h
> index 9138b4471dbf..b45c8b6f7cb4 100644
> --- a/include/linux/namei.h
> +++ b/include/linux/namei.h
> @@ -6,6 +6,7 @@
>  #include <linux/path.h>
>  #include <linux/fcntl.h>
>  #include <linux/errno.h>
> +#include <linux/fs.h>
>
>  enum { MAX_NESTED_LINKS = 8 };
>
> @@ -97,6 +98,9 @@ extern void unlock_rename(struct dentry *, struct dentry *);
>
>  extern void nd_jump_link(struct path *path);
>
> +extern int filename_lookup(int dfd, struct filename *name, unsigned flags,
> +                          struct path *path, struct path *root);
> +
>  static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
>  {
>         ((char *) name)[min(len, maxlen)] = '\0';
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 4393bd4b2419..b0d4869fb860 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2741,6 +2741,24 @@ union bpf_attr {
>   *             **-EOPNOTSUPP** kernel configuration does not enable SYN cookies
>   *
>   *             **-EPROTONOSUPPORT** IP packet version is not 4 or 6
> + *
> + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
> + *     Description
> + *             Copies into *pidns* pid, namespace id and tgid as seen by the
> + *             current namespace and also device from /proc/self/ns/pid.
> + *             *size_of_pidns* must be the size of *pidns*
> + *
> + *             This helper is used when pid filtering is needed inside a
> + *             container as bpf_get_current_tgid() helper returns always the
> + *             pid id as seen by the root namespace.
> + *     Return
> + *             0 on success
> + *
> + *             **-EINVAL** if *size_of_pidns* is not valid or unable to get ns, pid
> + *             or tgid of the current task.
> + *
> + *             **-ENOMEM**  if allocation fails.
> + *
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -2853,7 +2871,8 @@ union bpf_attr {
>         FN(sk_storage_get),             \
>         FN(sk_storage_delete),          \
>         FN(send_signal),                \
> -       FN(tcp_gen_syncookie),
> +       FN(tcp_gen_syncookie),          \
> +       FN(get_current_pidns_info),
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -3604,4 +3623,10 @@ struct bpf_sockopt {
>         __s32   retval;
>  };
>
> +struct bpf_pidns_info {
> +       __u32 dev;
> +       __u32 nsid;
> +       __u32 tgid;
> +       __u32 pid;
> +};
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 8191a7db2777..3159f2a0188c 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -2038,6 +2038,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
>  const struct bpf_func_proto bpf_get_current_comm_proto __weak;
>  const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
>  const struct bpf_func_proto bpf_get_local_storage_proto __weak;
> +const struct bpf_func_proto bpf_get_current_pidns_info __weak;
>
>  const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
>  {
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 5e28718928ca..41fbf1f28a48 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -11,6 +11,12 @@
>  #include <linux/uidgid.h>
>  #include <linux/filter.h>
>  #include <linux/ctype.h>
> +#include <linux/pid_namespace.h>
> +#include <linux/major.h>
> +#include <linux/stat.h>
> +#include <linux/namei.h>
> +#include <linux/version.h>
> +
>
>  #include "../../lib/kstrtox.h"
>
> @@ -312,6 +318,64 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>         preempt_enable();
>  }
>
> +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
> +        size)
> +{
> +       const char *pidns_path = "/proc/self/ns/pid";
> +       struct pid_namespace *pidns = NULL;
> +       struct filename *tmp = NULL;
> +       struct inode *inode;
> +       struct path kp;
> +       pid_t tgid = 0;
> +       pid_t pid = 0;
> +       int ret;
> +       int len;
> +
> +       if (unlikely(size != sizeof(struct bpf_pidns_info)))
> +               return -EINVAL;
> +       pidns = task_active_pid_ns(current);
> +       if (unlikely(!pidns))
> +               goto clear;
> +       pidns_info->nsid =  pidns->ns.inum;
> +       pid = task_pid_nr_ns(current, pidns);
> +       if (unlikely(!pid))
> +               goto clear;
> +       tgid = task_tgid_nr_ns(current, pidns);
> +       if (unlikely(!tgid))
> +               goto clear;
> +       pidns_info->tgid = (u32) tgid;
> +       pidns_info->pid = (u32) pid;
> +       tmp = kmem_cache_alloc(names_cachep, GFP_ATOMIC);
> +       if (unlikely(!tmp)) {
> +               memset((void *)pidns_info, 0, (size_t) size);
> +               return -ENOMEM;
> +       }
> +       len = strlen(pidns_path) + 1;
> +       memcpy((char *)tmp->name, pidns_path, len);
> +       tmp->uptr = NULL;
> +       tmp->aname = NULL;
> +       tmp->refcnt = 1;
> +       ret = filename_lookup(AT_FDCWD, tmp, 0, &kp, NULL);
> +       if (ret) {
> +               memset((void *)pidns_info, 0, (size_t) size);
> +               return ret;
> +       }
> +       inode = d_backing_inode(kp.dentry);
> +       pidns_info->dev = inode->i_sb->s_dev;
> +       return 0;
> +clear:
> +       memset((void *)pidns_info, 0, (size_t) size);
> +       return -EINVAL;
> +}
> +
> +const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
> +       .func           = bpf_get_current_pidns_info,
> +       .gpl_only       = false,
> +       .ret_type       = RET_INTEGER,
> +       .arg1_type      = ARG_PTR_TO_UNINIT_MEM,
> +       .arg2_type      = ARG_CONST_SIZE,
> +};
> +
>  #ifdef CONFIG_CGROUPS
>  BPF_CALL_0(bpf_get_current_cgroup_id)
>  {
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index ca1255d14576..5e1dc22765a5 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -709,6 +709,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  #endif
>         case BPF_FUNC_send_signal:
>                 return &bpf_send_signal_proto;
> +       case BPF_FUNC_get_current_pidns_info:
> +               return &bpf_get_current_pidns_info_proto;
>         default:
>                 return NULL;
>         }
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index 1d9be26b4edd..238453ff27d2 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -53,6 +53,7 @@ hostprogs-y += task_fd_query
>  hostprogs-y += xdp_sample_pkts
>  hostprogs-y += ibumad
>  hostprogs-y += hbm
> +hostprogs-y += trace_ns_info
>
>  # Libbpf dependencies
>  LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
> @@ -109,6 +110,7 @@ task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
>  xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
>  ibumad-objs := bpf_load.o ibumad_user.o $(TRACE_HELPERS)
>  hbm-objs := bpf_load.o hbm.o $(CGROUP_HELPERS)
> +trace_ns_info-objs := bpf_load.o trace_ns_info_user.o
>
>  # Tell kbuild to always build the programs
>  always := $(hostprogs-y)
> @@ -170,6 +172,7 @@ always += xdp_sample_pkts_kern.o
>  always += ibumad_kern.o
>  always += hbm_out_kern.o
>  always += hbm_edt_kern.o
> +always += trace_ns_info_user_kern.o
>
>  KBUILD_HOSTCFLAGS += -I$(objtree)/usr/include
>  KBUILD_HOSTCFLAGS += -I$(srctree)/tools/lib/bpf/
> diff --git a/samples/bpf/trace_ns_info_user.c b/samples/bpf/trace_ns_info_user.c
> new file mode 100644
> index 000000000000..e06d08db6f30
> --- /dev/null
> +++ b/samples/bpf/trace_ns_info_user.c
> @@ -0,0 +1,35 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +
> +#include <stdio.h>
> +#include <linux/bpf.h>
> +#include <unistd.h>
> +#include "bpf/libbpf.h"
> +#include "bpf_load.h"
> +
> +/* This code was taken verbatim from tracex1_user.c, it's used
> + * to exercize bpf_get_current_pidns_info() helper call.
> + */
> +int main(int ac, char **argv)
> +{
> +       FILE *f;
> +       char filename[256];
> +
> +       snprintf(filename, sizeof(filename), "%s_user_kern.o", argv[0]);
> +       printf("loading %s\n", filename);
> +
> +       if (load_bpf_file(filename)) {
> +               printf("%s", bpf_log_buf);
> +               return 1;
> +       }
> +
> +       f = popen("taskset 1 ping  localhost", "r");
> +       (void) f;
> +       read_trace_pipe();
> +       return 0;
> +}
> diff --git a/samples/bpf/trace_ns_info_user_kern.c b/samples/bpf/trace_ns_info_user_kern.c
> new file mode 100644
> index 000000000000..96675e02b707
> --- /dev/null
> +++ b/samples/bpf/trace_ns_info_user_kern.c
> @@ -0,0 +1,44 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +#include <linux/skbuff.h>
> +#include <linux/netdevice.h>
> +#include <linux/version.h>
> +#include <uapi/linux/bpf.h>
> +#include "bpf_helpers.h"
> +
> +typedef __u64 u64;
> +typedef __u32 u32;
> +
> +
> +/* kprobe is NOT a stable ABI
> + * kernel functions can be removed, renamed or completely change semantics.
> + * Number of arguments and their positions can change, etc.
> + * In such case this bpf+kprobe example will no longer be meaningful
> + */
> +
> +/* This will call bpf_get_current_pidns_info() to display pid and ns values
> + * as seen by the current namespace, on the far left you will see the pid as
> + * seen as by the root namespace.
> + */
> +
> +SEC("kprobe/__netif_receive_skb_core")
> +int bpf_prog1(struct pt_regs *ctx)
> +{
> +       char fmt[] = "nsid:%u, dev: %u,  pid:%u\n";
> +       struct bpf_pidns_info nsinfo;
> +       int ok = 0;
> +
> +       ok = bpf_get_current_pidns_info(&nsinfo, sizeof(nsinfo));
> +       if (ok == 0)
> +               bpf_trace_printk(fmt, sizeof(fmt), (u32)nsinfo.nsid,
> +                                (u32) nsinfo.dev, (u32)nsinfo.pid);
> +
> +       return 0;
> +}
> +char _license[] SEC("license") = "GPL";
> +u32 _version SEC("version") = LINUX_VERSION_CODE;
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 4393bd4b2419..b0d4869fb860 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -2741,6 +2741,24 @@ union bpf_attr {
>   *             **-EOPNOTSUPP** kernel configuration does not enable SYN cookies
>   *
>   *             **-EPROTONOSUPPORT** IP packet version is not 4 or 6
> + *
> + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
> + *     Description
> + *             Copies into *pidns* pid, namespace id and tgid as seen by the
> + *             current namespace and also device from /proc/self/ns/pid.
> + *             *size_of_pidns* must be the size of *pidns*
> + *
> + *             This helper is used when pid filtering is needed inside a
> + *             container as bpf_get_current_tgid() helper returns always the
> + *             pid id as seen by the root namespace.
> + *     Return
> + *             0 on success
> + *
> + *             **-EINVAL** if *size_of_pidns* is not valid or unable to get ns, pid
> + *             or tgid of the current task.
> + *
> + *             **-ENOMEM**  if allocation fails.
> + *
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -2853,7 +2871,8 @@ union bpf_attr {
>         FN(sk_storage_get),             \
>         FN(sk_storage_delete),          \
>         FN(send_signal),                \
> -       FN(tcp_gen_syncookie),
> +       FN(tcp_gen_syncookie),          \
> +       FN(get_current_pidns_info),
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -3604,4 +3623,10 @@ struct bpf_sockopt {
>         __s32   retval;
>  };
>
> +struct bpf_pidns_info {
> +       __u32 dev;
> +       __u32 nsid;
> +       __u32 tgid;
> +       __u32 pid;
> +};
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 3bd0f4a0336a..1f97b571b581 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -29,7 +29,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
>         test_cgroup_storage test_select_reuseport test_section_names \
>         test_netcnt test_tcpnotify_user test_sock_fields test_sysctl test_hashmap \
>         test_btf_dump test_cgroup_attach xdping test_sockopt test_sockopt_sk \
> -       test_sockopt_multi test_tcp_rtt
> +       test_sockopt_multi test_tcp_rtt test_pidns
>
>  BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
>  TEST_GEN_FILES = $(BPF_OBJ_FILES)
> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
> index 120aa86c58d3..c96795a9d983 100644
> --- a/tools/testing/selftests/bpf/bpf_helpers.h
> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> @@ -231,6 +231,9 @@ static int (*bpf_send_signal)(unsigned sig) = (void *)BPF_FUNC_send_signal;
>  static long long (*bpf_tcp_gen_syncookie)(struct bpf_sock *sk, void *ip,
>                                           int ip_len, void *tcp, int tcp_len) =
>         (void *) BPF_FUNC_tcp_gen_syncookie;
> +static int (*bpf_get_current_pidns_info)(struct bpf_pidns_info *buf,
> +                                        unsigned int buf_size) =
> +       (void *) BPF_FUNC_get_current_pidns_info;
>
>  /* llvm builtin functions that eBPF C program may use to
>   * emit BPF_LD_ABS and BPF_LD_IND instructions
> diff --git a/tools/testing/selftests/bpf/progs/test_pidns_kern.c b/tools/testing/selftests/bpf/progs/test_pidns_kern.c
> new file mode 100644
> index 000000000000..e1d2facfa762
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_pidns_kern.c
> @@ -0,0 +1,51 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +
> +#include <linux/bpf.h>
> +#include <errno.h>
> +#include "bpf_helpers.h"
> +
> +struct bpf_map_def SEC("maps") nsidmap = {
> +       .type = BPF_MAP_TYPE_ARRAY,
> +       .key_size = sizeof(__u32),
> +       .value_size = sizeof(__u32),
> +       .max_entries = 1,
> +};
> +
> +struct bpf_map_def SEC("maps") pidmap = {
> +       .type = BPF_MAP_TYPE_ARRAY,
> +       .key_size = sizeof(__u32),
> +       .value_size = sizeof(__u32),
> +       .max_entries = 1,
> +};
> +
> +SEC("tracepoint/syscalls/sys_enter_nanosleep")
> +int trace(void *ctx)
> +{
> +       struct bpf_pidns_info nsinfo;
> +       __u32 key = 0, *expected_pid, *val;
> +       char fmt[] = "ERROR nspid:%d\n";
> +
> +       if (bpf_get_current_pidns_info(&nsinfo, sizeof(nsinfo)))
> +               return -EINVAL;
> +
> +       expected_pid = bpf_map_lookup_elem(&pidmap, &key);
> +
> +
> +       if (!expected_pid || *expected_pid != nsinfo.pid)
> +               return 0;
> +
> +       val = bpf_map_lookup_elem(&nsidmap, &key);
> +       if (val)
> +               *val = nsinfo.nsid;
> +
> +       return 0;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> +__u32 _version SEC("version") = 1;
> diff --git a/tools/testing/selftests/bpf/test_pidns.c b/tools/testing/selftests/bpf/test_pidns.c
> new file mode 100644
> index 000000000000..a7254055f294
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/test_pidns.c
> @@ -0,0 +1,138 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <syscall.h>
> +#include <unistd.h>
> +#include <linux/perf_event.h>
> +#include <sys/ioctl.h>
> +#include <sys/time.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +
> +#include <linux/bpf.h>
> +#include <bpf/bpf.h>
> +#include <bpf/libbpf.h>
> +
> +#include "cgroup_helpers.h"
> +#include "bpf_rlimit.h"
> +
> +#define CHECK(condition, tag, format...) ({            \
> +       int __ret = !!(condition);                      \
> +       if (__ret) {                                    \
> +               printf("%s:FAIL:%s ", __func__, tag);   \
> +               printf(format);                         \
> +       } else {                                        \
> +               printf("%s:PASS:%s\n", __func__, tag);  \
> +       }                                               \
> +       __ret;                                          \
> +})
> +
> +static int bpf_find_map(const char *test, struct bpf_object *obj,
> +                       const char *name)
> +{
> +       struct bpf_map *map;
> +
> +       map = bpf_object__find_map_by_name(obj, name);
> +       if (!map)
> +               return -1;
> +       return bpf_map__fd(map);
> +}
> +
> +
> +int main(int argc, char **argv)
> +{
> +       const char *probe_name = "syscalls/sys_enter_nanosleep";
> +       const char *file = "test_pidns_kern.o";
> +       int err, bytes, efd, prog_fd, pmu_fd;
> +       int pidmap_fd, nsidmap_fd;
> +       struct perf_event_attr attr = {};
> +       struct bpf_object *obj;
> +       __u32 knsid = 0;
> +       __u32 key = 0, pid;
> +       int exit_code = 1;
> +       struct stat st;
> +       char buf[256];
> +
> +       err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, &obj, &prog_fd);
> +       if (CHECK(err, "bpf_prog_load", "err %d errno %d\n", err, errno))
> +               goto cleanup_cgroup_env;
> +
> +       nsidmap_fd = bpf_find_map(__func__, obj, "nsidmap");
> +       if (CHECK(nsidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
> +                 nsidmap_fd, errno))
> +               goto close_prog;
> +
> +       pidmap_fd = bpf_find_map(__func__, obj, "pidmap");
> +       if (CHECK(pidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
> +                 pidmap_fd, errno))
> +               goto close_prog;
> +
> +       pid = getpid();
> +       bpf_map_update_elem(pidmap_fd, &key, &pid, 0);
> +
> +       snprintf(buf, sizeof(buf),
> +                "/sys/kernel/debug/tracing/events/%s/id", probe_name);
> +       efd = open(buf, O_RDONLY, 0);
> +       if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))
> +               goto close_prog;
> +       bytes = read(efd, buf, sizeof(buf));
> +       close(efd);
> +       if (CHECK(bytes <= 0 || bytes >= sizeof(buf), "read",
> +                 "bytes %d errno %d\n", bytes, errno))
> +               goto close_prog;
> +
> +       attr.config = strtol(buf, NULL, 0);
> +       attr.type = PERF_TYPE_TRACEPOINT;
> +       attr.sample_type = PERF_SAMPLE_RAW;
> +       attr.sample_period = 1;
> +       attr.wakeup_events = 1;
> +
> +       pmu_fd = syscall(__NR_perf_event_open, &attr, getpid(), -1, -1, 0);
> +       if (CHECK(pmu_fd < 0, "perf_event_open", "err %d errno %d\n", pmu_fd,
> +                 errno))
> +               goto close_prog;
> +
> +       err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
> +       if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n", err,
> +                 errno))
> +               goto close_pmu;
> +
> +       err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
> +       if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n", err,
> +                 errno))
> +               goto close_pmu;
> +
> +       /* trigger some syscalls */
> +       sleep(1);
> +
> +       err = bpf_map_lookup_elem(nsidmap_fd, &key, &knsid);
> +       if (CHECK(err, "bpf_map_lookup_elem", "err %d errno %d\n", err, errno))
> +               goto close_pmu;
> +
> +       if (stat("/proc/self/ns/pid", &st))
> +               goto close_pmu;
> +
> +       if (CHECK(knsid != (__u32) st.st_ino, "compare_namespace_id",
> +                 "kern knsid %u user unsid %u\n", knsid, (__u32) st.st_ino))
> +               goto close_pmu;
> +
> +       exit_code = 0;
> +       printf("%s:PASS\n", argv[0]);
> +
> +close_pmu:
> +       close(pmu_fd);
> +close_prog:
> +       bpf_object__close(obj);
> +cleanup_cgroup_env:
> +       return exit_code;
> +}
> --
> 2.11.0
>
>
>
>
>
>
> On Thu, Aug 08, 2019 at 05:09:51AM +0000, Yonghong Song wrote:
> >
> >
> > On 8/7/19 6:22 PM, Carlos Antonio Neira Bustos wrote:
> > > The code has been modified to avoid syscalls that could sleep.
> > > Please let me know if any other modification is needed.
> > >
> > >  From be0384c0fa209a78c1567936e8db4e35b9a7c0f8 Mon Sep 17 00:00:00 2001
> > > From: Carlos <cneirabustos@gmail.com>
> > > Date: Wed, 7 Aug 2019 20:04:30 -0400
> > > Subject: [PATCH] [PATCH v5 bpf-next] BPF: New helper to obtain namespace data
> > >   from current task
> > >
> > > This helper obtains the active namespace from current and returns pid, tgid,
> > > device and namespace id as seen from that namespace, allowing to instrument
> > > a process inside a container.
> > > Device is read from /proc/self/ns/pid, as in the future it's possible that
> > > different pid_ns files may belong to different devices, according
> > > to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
> > > conference.
> > > Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
> > > scripts but this helper returns the pid as seen by the root namespace which is
> > > fine when a bcc script is not executed inside a container.
> > > When the process of interest is inside a container, pid filtering will not work
> > > if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
> > > returning the pid as it's seen by the current namespace where the script is
> > > executing.
> > >
> > > This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
> > > used to do pid filtering even inside a container.
> > >
> > > For example a bcc script using bpf_get_current_pid_tgid() (tools/funccount.py):
> > >
> > >          u32 pid = bpf_get_current_pid_tgid() >> 32;
> > >          if (pid != <pid_arg_passed_in>)
> > >                  return 0;
> > > Could be modified to use bpf_get_current_pidns_info() as follows:
> > >
> > >          struct bpf_pidns pidns;
> > >          bpf_get_current_pidns_info(&pidns, sizeof(struct bpf_pidns));
> > >          u32 pid = pidns.tgid;
> > >          u32 nsid = pidns.nsid;
> > >          if ((pid != <pid_arg_passed_in>) && (nsid != <nsid_arg_passed_in>))
> > >                  return 0;
> > >
> > > To find out the name PID namespace id of a process, you could use this command:
> > >
> > > $ ps -h -o pidns -p <pid_of_interest>
> > >
> > > Or this other command:
> > >
> > > $ ls -Li /proc/<pid_of_interest>/ns/pid
> > >
> > > Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
> > > ---
> > >   fs/namei.c                                         |   2 +-
> > >   include/linux/bpf.h                                |   1 +
> > >   include/linux/namei.h                              |   4 +
> > >   include/uapi/linux/bpf.h                           |  29 ++++-
> > >   kernel/bpf/core.c                                  |   1 +
> > >   kernel/bpf/helpers.c                               |  78 ++++++++++++
> > >   kernel/trace/bpf_trace.c                           |   2 +
> > >   samples/bpf/Makefile                               |   3 +
> > >   samples/bpf/trace_ns_info_user.c                   |  35 ++++++
> > >   samples/bpf/trace_ns_info_user_kern.c              |  44 +++++++
> > >   tools/include/uapi/linux/bpf.h                     |  29 ++++-
> > >   tools/testing/selftests/bpf/Makefile               |   2 +-
> > >   tools/testing/selftests/bpf/bpf_helpers.h          |   3 +
> > >   .../testing/selftests/bpf/progs/test_pidns_kern.c  |  51 ++++++++
> > >   tools/testing/selftests/bpf/test_pidns.c           | 138 +++++++++++++++++++++
> > >   15 files changed, 418 insertions(+), 4 deletions(-)
> > >   create mode 100644 samples/bpf/trace_ns_info_user.c
> > >   create mode 100644 samples/bpf/trace_ns_info_user_kern.c
> > >   create mode 100644 tools/testing/selftests/bpf/progs/test_pidns_kern.c
> > >   create mode 100644 tools/testing/selftests/bpf/test_pidns.c
> > >
> > > diff --git a/fs/namei.c b/fs/namei.c
> > > index 209c51a5226c..d1eca36972d2 100644
> > > --- a/fs/namei.c
> > > +++ b/fs/namei.c
> > > @@ -19,7 +19,6 @@
> > >   #include <linux/export.h>
> > >   #include <linux/kernel.h>
> > >   #include <linux/slab.h>
> > > -#include <linux/fs.h>
> > >   #include <linux/namei.h>
> > >   #include <linux/pagemap.h>
> > >   #include <linux/fsnotify.h>
> > > @@ -2355,6 +2354,7 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
> > >     putname(name);
> > >     return retval;
> > >   }
> > > +EXPORT_SYMBOL(filename_lookup);
> >
> > No need to export symbols. bpf uses it and bpf is in the core, not in
> > modules.
> >
> > >
> > >   /* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
> > >   static int path_parentat(struct nameidata *nd, unsigned flags,
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index f9a506147c8a..e4adf5e05afd 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -1050,6 +1050,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
> > >   extern const struct bpf_func_proto bpf_strtol_proto;
> > >   extern const struct bpf_func_proto bpf_strtoul_proto;
> > >   extern const struct bpf_func_proto bpf_tcp_sock_proto;
> > > +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
> > >
> > >   /* Shared helpers among cBPF and eBPF. */
> > >   void bpf_user_rnd_init_once(void);
> > > diff --git a/include/linux/namei.h b/include/linux/namei.h
> > > index 9138b4471dbf..2c24e8c71d46 100644
> > > --- a/include/linux/namei.h
> > > +++ b/include/linux/namei.h
> > > @@ -6,6 +6,7 @@
> > >   #include <linux/path.h>
> > >   #include <linux/fcntl.h>
> > >   #include <linux/errno.h>
> > > +#include <linux/fs.h>
> > >
> > >   enum { MAX_NESTED_LINKS = 8 };
> > >
> > > @@ -97,6 +98,9 @@ extern void unlock_rename(struct dentry *, struct dentry *);
> > >
> > >   extern void nd_jump_link(struct path *path);
> > >
> > > +extern int filename_lookup(int dfd, struct filename *name, unsigned int flags,
> > > +               struct path *path, struct path *root);
> >
> > The previous definition in fs/internal.h should be removed.
> >
> > > +
> > >   static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
> > >   {
> > >     ((char *) name)[min(len, maxlen)] = '\0';
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 4393bd4b2419..6f601f7106e2 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -2741,6 +2741,26 @@ union bpf_attr {
> > >    *                **-EOPNOTSUPP** kernel configuration does not enable SYN cookies
> > >    *
> > >    *                **-EPROTONOSUPPORT** IP packet version is not 4 or 6
> > > + *
> > > + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
> > > + * Description
> > > + *         Copies into *pidns* pid, namespace id and tgid as seen by the
> > > + *         current namespace and also device from /proc/self/ns/pid.
> > > + *         *size_of_pidns* must be the size of *pidns*
> > > + *
> > > + *         This helper is used when pid filtering is needed inside a
> > > + *         container as bpf_get_current_tgid() helper returns always the
> > > + *         pid id as seen by the root namespace.
> > > + * Return
> > > + *         0 on success
> > > + *
> > > + *         **-EINVAL**  if unable to get ns, pid or tgid of current task.
> > > + *         Or if size_of_pidns is not valid.
> >
> > Maybe reword by following the code sequence.
> >     if *size_of_pidns* is not valid or unable to get ns, pid or tgid of
> >     the current task.
> >
> > > + *
> > > + *         **-ENOMEM**  if allocation fails.
> >
> > Maybe some other error codes in filename_lookup() function?
> >
> > > + *
> > > + *         If unable to get the inode from /proc/self/ns/pid an error code
> > > + *         will be returned.
> >
> > You do not need this. The description of error code cases should cover this.
> >
> > >    */
> > >   #define __BPF_FUNC_MAPPER(FN)             \
> > >     FN(unspec),                     \
> > > @@ -2853,7 +2873,8 @@ union bpf_attr {
> > >     FN(sk_storage_get),             \
> > >     FN(sk_storage_delete),          \
> > >     FN(send_signal),                \
> > > -   FN(tcp_gen_syncookie),
> > > +   FN(tcp_gen_syncookie),          \
> > > +   FN(get_current_pidns_info),
> > >
> > >   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > >    * function eBPF program intends to call
> > > @@ -3604,4 +3625,10 @@ struct bpf_sockopt {
> > >     __s32   retval;
> > >   };
> > >
> > > +struct bpf_pidns_info {
> > > +   __u32 dev;
> > > +   __u32 nsid;
> > > +   __u32 tgid;
> > > +   __u32 pid;
> > > +};
> > >   #endif /* _UAPI__LINUX_BPF_H__ */
> > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > > index 8191a7db2777..3159f2a0188c 100644
> > > --- a/kernel/bpf/core.c
> > > +++ b/kernel/bpf/core.c
> > > @@ -2038,6 +2038,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
> > >   const struct bpf_func_proto bpf_get_current_comm_proto __weak;
> > >   const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
> > >   const struct bpf_func_proto bpf_get_local_storage_proto __weak;
> > > +const struct bpf_func_proto bpf_get_current_pidns_info __weak;
> > >
> > >   const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
> > >   {
> > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > > index 5e28718928ca..571f24077db2 100644
> > > --- a/kernel/bpf/helpers.c
> > > +++ b/kernel/bpf/helpers.c
> > > @@ -11,6 +11,12 @@
> > >   #include <linux/uidgid.h>
> > >   #include <linux/filter.h>
> > >   #include <linux/ctype.h>
> > > +#include <linux/pid_namespace.h>
> > > +#include <linux/major.h>
> > > +#include <linux/stat.h>
> > > +#include <linux/namei.h>
> > > +#include <linux/version.h>
> > > +
> > >
> > >   #include "../../lib/kstrtox.h"
> > >
> > > @@ -312,6 +318,78 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
> > >     preempt_enable();
> > >   }
> > >
> > > +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
> > > +    size)
> > > +{
> > > +   const char *name = "/proc/self/ns/pid";
> >
> > maybe rename this variable to pidns_path?
> >
> > > +   struct pid_namespace *pidns = NULL;
> > > +   struct filename *tmp = NULL;
> >
> > Maybe rename this variable to name?
> >
> > > +   int len = strlen(name) + 1;
> >
> > We can delay this assignment later until it is needed.
> >
> > > +   struct inode *inode;
> > > +   struct path kp;
> > > +   pid_t tgid = 0;
> > > +   pid_t pid = 0;
> > > +   int ret;
> > > +
> > > +   if (unlikely(size != sizeof(struct bpf_pidns_info)))
> > > +           return -EINVAL;
> > > +
> > > +   pidns = task_active_pid_ns(current);
> > > +
> >
> > we can save an empty line here.
> >
> > > +   if (unlikely(!pidns))
> > > +           goto clear;
> > > +
> > > +   pidns_info->nsid =  pidns->ns.inum;
> > > +   pid = task_pid_nr_ns(current, pidns);
> > > +
> >
> > We can save an empty line here.
> >
> > > +   if (unlikely(!pid))
> > > +           goto clear;
> > > +
> > > +   tgid = task_tgid_nr_ns(current, pidns);
> > > +
> > ditto. save an empty line.
> > > +   if (unlikely(!tgid))
> > > +           goto clear;
> > > +
> > > +   pidns_info->tgid = (u32) tgid;
> > > +   pidns_info->pid = (u32) pid;
> > > +
> > > +   tmp = kmem_cache_alloc(names_cachep, GFP_ATOMIC);
> > > +   if (unlikely(!tmp)) {
> > > +           memset((void *)pidns_info, 0, (size_t) size);
> > > +           return -ENOMEM;
> > > +   }
> > > +
> > > +   memcpy((char *)tmp->name, name, len);
> > > +   tmp->uptr = NULL;
> > > +   tmp->aname = NULL;
> > > +   tmp->refcnt = 1;
> > > +
> > ditto. save an empty line.
> > > +   ret = filename_lookup(AT_FDCWD, tmp, 0, &kp, NULL);
> > > +
> > ditto. save an empty line.
> > > +   if (ret) {
> > > +           memset((void *)pidns_info, 0, (size_t) size);
> > > +           return ret;
> > > +   }
> > > +
> > > +   inode = d_backing_inode(kp.dentry);
> > > +   pidns_info->dev = inode->i_sb->s_dev;
> > > +
> > > +   return 0;
> > > +
> > > +clear:
> > > +   memset((void *)pidns_info, 0, (size_t) size);
> > > +
> > save an empty line.
> > > +   return -EINVAL;
> > > +}
> > > +
> > > +const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
> > > +   .func   = bpf_get_current_pidns_info,
> > make the "= " aligned with others?
> > > +   .gpl_only       = false,
> > > +   .ret_type       = RET_INTEGER,
> > > +   .arg1_type      = ARG_PTR_TO_UNINIT_MEM,
> > > +   .arg2_type      = ARG_CONST_SIZE,
> > > +};
> > > +
> > >   #ifdef CONFIG_CGROUPS
> > >   BPF_CALL_0(bpf_get_current_cgroup_id)
> > >   {
> > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > > index ca1255d14576..5e1dc22765a5 100644
> > > --- a/kernel/trace/bpf_trace.c
> > > +++ b/kernel/trace/bpf_trace.c
> > > @@ -709,6 +709,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > >   #endif
> > >     case BPF_FUNC_send_signal:
> > >             return &bpf_send_signal_proto;
> > > +   case BPF_FUNC_get_current_pidns_info:
> > > +           return &bpf_get_current_pidns_info_proto;
> > >     default:
> > >             return NULL;
> > >     }
> > > diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> > > index 1d9be26b4edd..238453ff27d2 100644
> > > --- a/samples/bpf/Makefile
> > > +++ b/samples/bpf/Makefile
> > > @@ -53,6 +53,7 @@ hostprogs-y += task_fd_query
> > >   hostprogs-y += xdp_sample_pkts
> > >   hostprogs-y += ibumad
> > >   hostprogs-y += hbm
> > > +hostprogs-y += trace_ns_info
> > [...]

^ permalink raw reply

* Re: [PATCH net] net: phy: rtl8211f: do a double read to get real time link status
From: Andrew Lunn @ 2019-08-08 19:40 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: Yonglong Liu, davem, netdev, linux-kernel, linuxarm, salil.mehta,
	yisen.zhuang, shiju.jose
In-Reply-To: <b1140603-f05b-2373-445f-c1d7a43ff012@gmail.com>

> @@ -568,6 +568,11 @@ int phy_start_aneg(struct phy_device *phydev)
>  	if (err < 0)
>  		goto out_unlock;
>  
> +	/* The PHY may not yet have cleared aneg-completed and link-up bit
> +	 * w/o this delay when the following read is done.
> +	 */
> +	usleep_range(1000, 2000);
> +

Hi Heiner

Does 802.3 C22 say anything about this?

If this PHY is broken with respect to the standard, i would prefer the
workaround is in the PHY specific driver code, not generic core code.

	   Andrew

^ permalink raw reply

* Re: [PATCH v2 02/15] net: phy: adin: hook genphy_read_abilities() to get_features
From: Heiner Kallweit @ 2019-08-08 19:32 UTC (permalink / raw)
  To: Andrew Lunn, Alexandru Ardelean
  Cc: netdev, devicetree, linux-kernel, davem, robh+dt, mark.rutland,
	f.fainelli
In-Reply-To: <20190808152403.GB27917@lunn.ch>

On 08.08.2019 17:24, Andrew Lunn wrote:
> On Thu, Aug 08, 2019 at 03:30:13PM +0300, Alexandru Ardelean wrote:
>> The ADIN PHYs can operate with Clause 45, however they are not typical for
>> how phylib considers Clause 45 PHYs.
>>
>> If the `features` field & the `get_features` hook are unspecified, and the
>> device wants to operate via Clause 45, it would also try to read features
>> via the `genphy_c45_pma_read_abilities()`, which will try to read PMA regs
>> that are unsupported.
>>
>> Hooking the `genphy_read_abilities()` function to the `get_features` hook
>> will ensure that this does not happen and the PHY features are read
>> correctly regardless of Clause 22 or Clause 45 operation.
> 
> I think we need to stop and think about a PHY which supports both C22
> and C45.
> 
> How does bus enumeration work? Is it discovered twice?  I've always
> considered phydev->is_c45 means everything is c45, not that some
> registers can be accessed via c45. But the driver is mixing c22 and
> c45. Does the driver actually require c45? Are some features which are
> only accessibly via C45? What does C45 actually bring us for this
> device?
> 
genphy_c45_pma_read_abilities() is only called if phydev->is_c45 is set.
And this flag means that the PHY complies with Clause 45 incl. all the
standard devices like PMA. In the case here only some vendor-specific
registers can be accessed via Clause 45 and therefore is_c45 shouldn't
bet set. As a consequence this patch isn't needed.

>      Andrew
> 
Heiner


^ permalink raw reply

* Re: [PATCH v2 13/15] net: phy: adin: configure downshift on config_init
From: Heiner Kallweit @ 2019-08-08 19:38 UTC (permalink / raw)
  To: Alexandru Ardelean, netdev, devicetree, linux-kernel
  Cc: davem, robh+dt, mark.rutland, f.fainelli, andrew
In-Reply-To: <20190808123026.17382-14-alexandru.ardelean@analog.com>

On 08.08.2019 14:30, Alexandru Ardelean wrote:
> Down-speed auto-negotiation may not always be enabled, in which case the
> PHY won't down-shift to 100 or 10 during auto-negotiation.
> 
> This change enables downshift and configures the number of retries to
> default 8 (maximum supported value).
> 
> The change has been adapted from the Marvell PHY driver.
> 
Instead of a fixed downshift setting (like in the Marvell driver) you
may consider to implement the ethtool phy-tunable ETHTOOL_PHY_DOWNSHIFT.
See the Aquantia PHY driver for an example.
Then the user can configure whether he wants downshift and if yes after
how many retries.

> Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
> ---
>  drivers/net/phy/adin.c | 39 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
[...]

Heiner

^ permalink raw reply

* Re: [PATCH net-next 3/3] net: phy: realtek: add support for the 2.5Gbps PHY in RTL8125
From: Andrew Lunn @ 2019-08-08 19:37 UTC (permalink / raw)
  To: Heiner Kallweit; +Cc: Florian Fainelli, David Miller, netdev@vger.kernel.org
In-Reply-To: <64769c3d-42b6-8eb8-26e4-722869408986@gmail.com>

On Thu, Aug 08, 2019 at 09:05:54PM +0200, Heiner Kallweit wrote:
> This adds support for the integrated 2.5Gbps PHY in Realtek RTL8125.
> Advertisement of 2.5Gbps mode is done via a vendor-specific register.
> Same applies to reading NBase-T link partner advertisement.
> Unfortunately this 2.5Gbps PHY shares the PHY ID with the integrated
> 1Gbps PHY's in other Realtek network chips and so far no method is
> known to differentiate them.

That is not nice.

Do you have any contacts in Realtek who can provide us with
information? Maybe there is another undocumented vendor specific
register?

	Andrew

^ permalink raw reply

* Re: [PATCH net-next 1/3] net: phy: prepare phylib to deal with PHY's extending Clause 22
From: Andrew Lunn @ 2019-08-08 19:32 UTC (permalink / raw)
  To: Heiner Kallweit; +Cc: Florian Fainelli, David Miller, netdev@vger.kernel.org
In-Reply-To: <214bedc0-4ae0-b15f-e03c-173f17527417@gmail.com>

On Thu, Aug 08, 2019 at 09:03:42PM +0200, Heiner Kallweit wrote:
> The integrated PHY in 2.5Gbps chip RTL8125 is the first (known to me)
> PHY that uses standard Clause 22 for all modes up to 1Gbps and adds
> 2.5Gbps control using vendor-specific registers. To use phylib for
> the standard part little extensions are needed:
> - Move most of genphy_config_aneg to a new function
>   __genphy_config_aneg that takes a parameter whether restarting
>   auto-negotiation is needed (depending on whether content of
>   vendor-specific advertisement register changed).
> - Don't clear phydev->lp_advertising in genphy_read_status so that
>   we can set non-C22 mode flags before.
> 
> Basically both changes mimic the behavior of the equivalent Clause 45
> functions.
> 
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
> ---
>  drivers/net/phy/phy_device.c | 35 +++++++++++++++--------------------
>  include/linux/phy.h          |  8 +++++++-
>  2 files changed, 22 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index 7ddd91df9..bd7e7db8c 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -1571,11 +1571,9 @@ static int genphy_config_advert(struct phy_device *phydev)
>  	/* Only allow advertising what this PHY supports */
>  	linkmode_and(phydev->advertising, phydev->advertising,
>  		     phydev->supported);
> -	if (!ethtool_convert_link_mode_to_legacy_u32(&advertise,
> -						     phydev->advertising))
> -		phydev_warn(phydev, "PHY advertising (%*pb) more modes than genphy supports, some modes not advertised.\n",
> -			    __ETHTOOL_LINK_MODE_MASK_NBITS,
> -			    phydev->advertising);
> +
> +	ethtool_convert_link_mode_to_legacy_u32(&advertise,
> +						phydev->advertising);

Hi Heiner

linkmode_adv_to_mii_adv_t() would remove the need to use
ethtool_convert_link_mode_to_legacy_u32(), and this warning would also
go away. 

   Andrew

^ permalink raw reply

* Re: [PATCH bpf-next v5 0/6] xdp: Add devmap_hash map type
From: Toke Høiland-Jørgensen @ 2019-08-08 19:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Alexei Starovoitov, Network Development,
	David Miller, Jesper Dangaard Brouer, Jakub Kicinski,
	Björn Töpel, Yonghong Song
In-Reply-To: <CAADnVQJpYeQ68V5BE2r3BhbraBh7G8dSd8zknFUJxtW4GwNkuA@mail.gmail.com>

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Fri, Jul 26, 2019 at 9:06 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> This series adds a new map type, devmap_hash, that works like the existing
>> devmap type, but using a hash-based indexing scheme. This is useful for the use
>> case where a devmap is indexed by ifindex (for instance for use with the routing
>> table lookup helper). For this use case, the regular devmap needs to be sized
>> after the maximum ifindex number, not the number of devices in it. A hash-based
>> indexing scheme makes it possible to size the map after the number of devices it
>> should contain instead.
>>
>> This was previously part of my patch series that also turned the regular
>> bpf_redirect() helper into a map-based one; for this series I just pulled out
>> the patches that introduced the new map type.
>>
>> Changelog:
>>
>> v5:
>>
>> - Dynamically set the number of hash buckets by rounding up max_entries to the
>>   nearest power of two (mirroring the regular hashmap), as suggested by Jesper.
>
> fyi I'm waiting for Jesper to review this new version.

Ping Jesper? :)

-Toke

^ permalink raw reply

* Re: [v3,3/4] tools: bpftool: add bash-completion for net attach/detach
From: Daniel T. Lee @ 2019-08-08 19:28 UTC (permalink / raw)
  To: Quentin Monnet; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <5cd88036-8682-8d26-b795-caf96e1e883f@netronome.com>

On Fri, Aug 9, 2019 at 1:48 AM Quentin Monnet
<quentin.monnet@netronome.com> wrote:
>
> 2019-08-07 11:25 UTC+0900 ~ Daniel T. Lee <danieltimlee@gmail.com>
> > This commit adds bash-completion for new "net attach/detach"
> > subcommand for attaching XDP program on interface.
> >
> > Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
> > ---
> >  tools/bpf/bpftool/bash-completion/bpftool | 64 +++++++++++++++++++----
> >  1 file changed, 55 insertions(+), 9 deletions(-)
> >
> > diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool
> > index c8f42e1fcbc9..1d81cb09d478 100644
> > --- a/tools/bpf/bpftool/bash-completion/bpftool
> > +++ b/tools/bpf/bpftool/bash-completion/bpftool
> > @@ -201,6 +201,10 @@ _bpftool()
> >              _bpftool_get_prog_tags
> >              return 0
> >              ;;
> > +        dev)
> > +            _sysfs_get_netdevs
> > +            return 0
> > +            ;;
>
> Makes sense to have this for "dev", thanks! But it seems you missed one
> place where it was used, for "bpftool feature probe" (We have "[[ $prev
> == "dev" ]] && _sysfs_get_netdevs && return 0"). Could you also remove
> that one please?
>
> Other than this looks good, thanks:
>
> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>

My bad. Thanks for letting me know.
I'll update it with the next version of patch.

Thank you for your review.
I really appreciate it.

^ permalink raw reply

* Re: [PATCH net] net: phy: rtl8211f: do a double read to get real time link status
From: Heiner Kallweit @ 2019-08-08 19:26 UTC (permalink / raw)
  To: Yonglong Liu, davem, andrew
  Cc: netdev, linux-kernel, linuxarm, salil.mehta, yisen.zhuang,
	shiju.jose
In-Reply-To: <c15f820b-cc80-9a93-4c48-1b60bc14f73a@huawei.com>

On 08.08.2019 08:21, Yonglong Liu wrote:
> 
> 
> On 2019/8/8 14:11, Heiner Kallweit wrote:
>> On 08.08.2019 03:15, Yonglong Liu wrote:
>>>
>>>
>>> On 2019/8/8 0:47, Heiner Kallweit wrote:
>>>> On 07.08.2019 15:16, Yonglong Liu wrote:
>>>>> [   27.232781] hns3 0000:bd:00.3 eth7: net open
>>>>> [   27.237303] 8021q: adding VLAN 0 to HW filter on device eth7
>>>>> [   27.242972] IPv6: ADDRCONF(NETDEV_CHANGE): eth7: link becomes ready
>>>>> [   27.244449] hns3 0000:bd:00.3: invalid speed (-1)
>>>>> [   27.253904] hns3 0000:bd:00.3 eth7: failed to adjust link.
>>>>> [   27.259379] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change UP -> RUNNING
>>>>> [   27.924903] hns3 0000:bd:00.3 eth7: link up
>>>>> [   28.280479] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change RUNNING -> NOLINK
>>>>> [   29.208452] hns3 0000:bd:00.3 eth7: link down
>>>>> [   32.376745] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change NOLINK -> RUNNING
>>>>> [   33.208448] hns3 0000:bd:00.3 eth7: link up
>>>>> [   35.253821] hns3 0000:bd:00.3 eth7: net stop
>>>>> [   35.258270] hns3 0000:bd:00.3 eth7: link down
>>>>>
>>>>> When using rtl8211f in polling mode, may get a invalid speed,
>>>>> because of reading a fake link up and autoneg complete status
>>>>> immediately after starting autoneg:
>>>>>
>>>>>         ifconfig-1176  [007] ....    27.232763: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x00 val:0x1040
>>>>>   kworker/u257:1-670   [015] ....    27.232805: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x04 val:0x01e1
>>>>>   kworker/u257:1-670   [015] ....    27.232815: mdio_access: mii-0000:bd:00.3 write phy:0x07 reg:0x04 val:0x05e1
>>>>>   kworker/u257:1-670   [015] ....    27.232869: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x79ad
>>>>>   kworker/u257:1-670   [015] ....    27.232904: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x09 val:0x0200
>>>>>   kworker/u257:1-670   [015] ....    27.232940: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x00 val:0x1040
>>>>>   kworker/u257:1-670   [015] ....    27.232949: mdio_access: mii-0000:bd:00.3 write phy:0x07 reg:0x00 val:0x1240
>>>>>   kworker/u257:1-670   [015] ....    27.233003: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x79ad
>>>>>   kworker/u257:1-670   [015] ....    27.233039: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x0a val:0x3002
>>>>>   kworker/u257:1-670   [015] ....    27.233074: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x09 val:0x0200
>>>>>   kworker/u257:1-670   [015] ....    27.233110: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x05 val:0x0000
>>>>>   kworker/u257:1-670   [000] ....    28.280475: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x7989
>>>>>   kworker/u257:1-670   [000] ....    29.304471: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x7989
>>>>>
>>>>> According to the datasheet of rtl8211f, to get the real time
>>>>> link status, need to read MII_BMSR twice.
>>>>>
>>>>> This patch add a read_status hook for rtl8211f, and do a fake
>>>>> phy_read before genphy_read_status(), so that can get real link
>>>>> status in genphy_read_status().
>>>>>
>>>>> Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
>>>>> ---
>>>>>  drivers/net/phy/realtek.c | 13 +++++++++++++
>>>>>  1 file changed, 13 insertions(+)
>>>>>
>>>> Is this an accidental resubmit? Because we discussed this in
>>>> https://marc.info/?t=156413509900003&r=1&w=2 and a fix has
>>>> been applied already.
>>>>
>>>> Heiner
>>>>
>>>> .
>>>>
>>>
>>> In https://marc.info/?t=156413509900003&r=1&w=2 , the invalid speed
>>> recurrence rate is almost 100%, and I had test the solution about
>>> 5 times and it works. But yesterday it happen again suddenly, and than
>>> I fount that the recurrence rate reduce to 10%. This time we get 0x79ad
>>> after autoneg started which is not 0x798d from last discussion.
>>>
>>>
>>>
>> OK, I'll have a look.
>> However the approach is wrong. The double read is related to the latching
>> of link-down events. This is done by all PHY's and not specific to RT8211F.
>> Also it's not related to the problem. I assume any sufficient delay would
>> do instead of the read.
>>
>> .
>>
> 
> So you will send a new patch to fix this problem? I am waiting for it,
> and can do a full test this time.
> 
> 
Can you try the following? This delay should give thy PHY enough time
to clear both bits before the following read is done.

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index ef7aa738e..32f327a44 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -568,6 +568,11 @@ int phy_start_aneg(struct phy_device *phydev)
 	if (err < 0)
 		goto out_unlock;
 
+	/* The PHY may not yet have cleared aneg-completed and link-up bit
+	 * w/o this delay when the following read is done.
+	 */
+	usleep_range(1000, 2000);
+
 	if (phy_is_started(phydev))
 		err = phy_check_link_status(phydev);
 out_unlock:
-- 
2.22.0



^ permalink raw reply related

* Re: [v3,1/4] tools: bpftool: add net attach command to attach XDP on interface
From: Daniel T. Lee @ 2019-08-08 19:22 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <20190808104948.7e72dea0@cakuba.netronome.com>

On Fri, Aug 9, 2019 at 2:50 AM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> On Thu, 8 Aug 2019 07:15:22 +0900, Daniel T. Lee wrote:
> > > > +             return -EINVAL;
> > > > +     }
> > > > +
> > > > +     NEXT_ARG();
> > >
> > > nit: the new line should be before NEXT_ARG(), IOV NEXT_ARG() belongs
> > > to the code which consumed the argument
> > >
> >
> > I'm not sure I'm following.
> > Are you saying that, at here the newline shouldn't be necessary?
>
> I mean this is better:
>
>         if (!is_prefix(*argv, "bla-bla"))
>                 return -EINVAL;
>         NEXT_ARG();
>
>         if (!is_prefix(*argv, "bla-bla"))
>                 return -EINVAL;
>         NEXT_ARG();
>
> Than this:
>
>         if (!is_prefix(*argv, "bla-bla"))
>                 return -EINVAL;
>
>         NEXT_ARG();
>         if (!is_prefix(*argv, "bla-bla"))
>                 return -EINVAL;
>
>         NEXT_ARG();
>
> Because the NEXT_ARG() "belongs" to the code that "consumed" the option.
>
> So instead of this:
>
>      attach_type = parse_attach_type(*argv);
>      if (attach_type == max_net_attach_type) {
>              p_err("invalid net attach/detach type");
>              return -EINVAL;
>      }
>
>      NEXT_ARG();
>      progfd = prog_parse_fd(&argc, &argv);
>      if (progfd < 0)
>              return -EINVAL;
>
> This seems more logical to me:
>
>      attach_type = parse_attach_type(*argv);
>      if (attach_type == max_net_attach_type) {
>              p_err("invalid net attach/detach type");
>              return -EINVAL;
>      }
>      NEXT_ARG();
>
>      progfd = prog_parse_fd(&argc, &argv);
>      if (progfd < 0)
>              return -EINVAL;

Oh. I see.
I'll update NEXT_ARG line stick to the code which "consumes" the option.

Thanks for the review! :)

^ permalink raw reply

* Re: [PATCH net v3] net/tls: prevent skb_orphan() from leaking TLS plain text with offload
From: Boris Pismenny @ 2019-08-08 19:18 UTC (permalink / raw)
  To: Jakub Kicinski, davem@davemloft.net
  Cc: netdev@vger.kernel.org, davejwatson@fb.com, Aviad Yehezkel,
	john.fastabend@gmail.com, daniel@iogearbox.net,
	willemb@google.com, edumazet@google.com,
	alexei.starovoitov@gmail.com, oss-drivers@netronome.com
In-Reply-To: <20190808000359.20785-1-jakub.kicinski@netronome.com>



On 08/08/2019 3:03, Jakub Kicinski wrote:
> sk_validate_xmit_skb() and drivers depend on the sk member of
> struct sk_buff to identify segments requiring encryption.
> Any operation which removes or does not preserve the original TLS
> socket such as skb_orphan() or skb_clone() will cause clear text
> leaks.
> 
> Make the TCP socket underlying an offloaded TLS connection
> mark all skbs as decrypted, if TLS TX is in offload mode.
> Then in sk_validate_xmit_skb() catch skbs which have no socket
> (or a socket with no validation) and decrypted flag set.
> 
> Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
> sk->sk_validate_xmit_skb are slightly interchangeable right now,
> they all imply TLS offload. The new checks are guarded by
> CONFIG_TLS_DEVICE because that's the option guarding the
> sk_buff->decrypted member.
> 
> Second, smaller issue with orphaning is that it breaks
> the guarantee that packets will be delivered to device
> queues in-order. All TLS offload drivers depend on that
> scheduling property. This means skb_orphan_partial()'s
> trick of preserving partial socket references will cause
> issues in the drivers. We need a full orphan, and as a
> result netem delay/throttling will cause all TLS offload
> skbs to be dropped.
> 
> Reusing the sk_buff->decrypted flag also protects from
> leaking clear text when incoming, decrypted skb is redirected
> (e.g. by TC).
> 
> See commit 0608c69c9a80 ("bpf: sk_msg, sock{map|hash} redirect
> through ULP") for justification why the internal flag is safe.
> The only location which could leak the flag in is tcp_bpf_sendmsg(),
> which is taken care of by clearing the previously unused bit.
> 
> v2:
>  - remove superfluous decrypted mark copy (Willem);
>  - remove the stale doc entry (Boris);
>  - rely entirely on EOR marking to prevent coalescing (Boris);
>  - use an internal sendpages flag instead of marking the socket
>    (Boris).
> v3 (Willem):
>  - reorganize the can_skb_orphan_partial() condition;
>  - fix the flag leak-in through tcp_bpf_sendmsg.
> 
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

LGTM

Reviewed-by: Boris Pismenny <borisp@mellanox.com>

^ permalink raw reply

* Re: [bpf-next v3 PATCH 3/3] samples/bpf: xdp_fwd explain bpf_fib_lookup return codes
From: Toke Høiland-Jørgensen @ 2019-08-08 19:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov
  Cc: a.s.protopopov, dsahern, ys114321, Jesper Dangaard Brouer
In-Reply-To: <156528106777.22124.12162740342925045912.stgit@firesoul>

Jesper Dangaard Brouer <brouer@redhat.com> writes:

> Make it clear that this XDP program depend on the network
> stack to do the ARP resolution.  This is connected with the
> BPF_FIB_LKUP_RET_NO_NEIGH return code from bpf_fib_lookup().
>
> Another common mistake (seen via XDP-tutorial) is that users
> don't realize that sysctl net.ipv{4,6}.conf.all.forwarding
> setting is honored by bpf_fib_lookup.
>
> Reported-by: Anton Protopopov <a.s.protopopov@gmail.com>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Reviewed-by: David Ahern <dsahern@gmail.com>
> Acked-by: Yonghong Song <yhs@fb.com>

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox